public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed
* [PATCH v1 1/3] x86: Fix misordered logic for setting `rep_movsb_stop_threshold`
@ 2022-06-15  0:25 Noah Goldstein
  2022-06-15  0:25 ` [PATCH v1 2/3] x86: Cleanup bounds checking in large memcpy case Noah Goldstein
                   ` (2 more replies)
  0 siblings, 3 replies; 29+ messages in thread
From: Noah Goldstein @ 2022-06-15  0:25 UTC (permalink / raw)
  To: libc-alpha

Move the setting of `rep_movsb_stop_threshold` to after the tunables
have been collected so that the `rep_movsb_stop_threshold` (which
is used to redirect control flow to the non_temporal case) will
use any user value for `non_temporal_threshold` (set using
glibc.cpu.x86_non_temporal_threshold)
---
 sysdeps/x86/dl-cacheinfo.h | 24 ++++++++++++------------
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h
index f64a2fb0ba..cc3b840f9c 100644
--- a/sysdeps/x86/dl-cacheinfo.h
+++ b/sysdeps/x86/dl-cacheinfo.h
@@ -898,18 +898,6 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
   if (CPU_FEATURE_USABLE_P (cpu_features, FSRM))
     rep_movsb_threshold = 2112;
 
-  unsigned long int rep_movsb_stop_threshold;
-  /* ERMS feature is implemented from AMD Zen3 architecture and it is
-     performing poorly for data above L2 cache size. Henceforth, adding
-     an upper bound threshold parameter to limit the usage of Enhanced
-     REP MOVSB operations and setting its value to L2 cache size.  */
-  if (cpu_features->basic.kind == arch_kind_amd)
-    rep_movsb_stop_threshold = core;
-  /* Setting the upper bound of ERMS to the computed value of
-     non-temporal threshold for architectures other than AMD.  */
-  else
-    rep_movsb_stop_threshold = non_temporal_threshold;
-
   /* The default threshold to use Enhanced REP STOSB.  */
   unsigned long int rep_stosb_threshold = 2048;
 
@@ -951,6 +939,18 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
 			   SIZE_MAX);
 #endif
 
+  unsigned long int rep_movsb_stop_threshold;
+  /* ERMS feature is implemented from AMD Zen3 architecture and it is
+     performing poorly for data above L2 cache size. Henceforth, adding
+     an upper bound threshold parameter to limit the usage of Enhanced
+     REP MOVSB operations and setting its value to L2 cache size.  */
+  if (cpu_features->basic.kind == arch_kind_amd)
+    rep_movsb_stop_threshold = core;
+  /* Setting the upper bound of ERMS to the computed value of
+     non-temporal threshold for architectures other than AMD.  */
+  else
+    rep_movsb_stop_threshold = non_temporal_threshold;
+
   cpu_features->data_cache_size = data;
   cpu_features->shared_cache_size = shared;
   cpu_features->non_temporal_threshold = non_temporal_threshold;
-- 
2.34.1


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v1 2/3] x86: Cleanup bounds checking in large memcpy case
  2022-06-15  0:25 [PATCH v1 1/3] x86: Fix misordered logic for setting `rep_movsb_stop_threshold` Noah Goldstein
@ 2022-06-15  0:25 ` Noah Goldstein
  2022-06-15  1:07   ` H.J. Lu
                     ` (3 more replies)
  2022-06-15  0:25 ` [PATCH v1 3/3] x86: Add sse42 implementation to strcmp's ifunc Noah Goldstein
  2022-06-15  1:02 ` [PATCH v1 1/3] x86: Fix misordered logic for setting `rep_movsb_stop_threshold` H.J. Lu
  2 siblings, 4 replies; 29+ messages in thread
From: Noah Goldstein @ 2022-06-15  0:25 UTC (permalink / raw)
  To: libc-alpha

1. Fix incorrect lower-bound threshold in L(large_memcpy_2x).
   Previously was using `__x86_rep_movsb_threshold` and should
   have been using `__x86_shared_non_temporal_threshold`.

2. Avoid reloading __x86_shared_non_temporal_threshold before
   the L(large_memcpy_4x) bounds check.

3. Document the second bounds check for L(large_memcpy_4x)
   more clearly.
---
 manual/tunables.texi                          |  2 +-
 sysdeps/x86/dl-cacheinfo.h                    |  8 +++--
 .../multiarch/memmove-vec-unaligned-erms.S    | 29 ++++++++++++++-----
 3 files changed, 28 insertions(+), 11 deletions(-)

diff --git a/manual/tunables.texi b/manual/tunables.texi
index 1482412078..49daf3eb4a 100644
--- a/manual/tunables.texi
+++ b/manual/tunables.texi
@@ -47,7 +47,7 @@ glibc.malloc.mxfast: 0x0 (min: 0x0, max: 0xffffffffffffffff)
 glibc.elision.skip_lock_busy: 3 (min: -2147483648, max: 2147483647)
 glibc.malloc.top_pad: 0x0 (min: 0x0, max: 0xffffffffffffffff)
 glibc.cpu.x86_rep_stosb_threshold: 0x800 (min: 0x1, max: 0xffffffffffffffff)
-glibc.cpu.x86_non_temporal_threshold: 0xc0000 (min: 0x0, max: 0xffffffffffffffff)
+glibc.cpu.x86_non_temporal_threshold: 0xc0000 (min: 0x0, max: 0x0fffffffffffffff)
 glibc.cpu.x86_shstk:
 glibc.cpu.hwcap_mask: 0x6 (min: 0x0, max: 0xffffffffffffffff)
 glibc.malloc.mmap_max: 0 (min: -2147483648, max: 2147483647)
diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h
index cc3b840f9c..a66152d9cc 100644
--- a/sysdeps/x86/dl-cacheinfo.h
+++ b/sysdeps/x86/dl-cacheinfo.h
@@ -915,9 +915,13 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
     shared = tunable_size;
 
   tunable_size = TUNABLE_GET (x86_non_temporal_threshold, long int, NULL);
-  /* NB: Ignore the default value 0.  */
-  if (tunable_size != 0)
+  /* NB: Ignore the default value 0.  Saturate very large values at
+     LONG_MAX >> 4.  */
+  if (tunable_size != 0 && tunable_size <= (LONG_MAX >> 3))
     non_temporal_threshold = tunable_size;
+  /* Saturate huge arguments.  */
+  else if (tunable_size != 0)
+    non_temporal_threshold = LONG_MAX >> 3;
 
   tunable_size = TUNABLE_GET (x86_rep_movsb_threshold, long int, NULL);
   if (tunable_size > minimum_rep_movsb_threshold)
diff --git a/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
index af51177d5d..d1518b8bab 100644
--- a/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
+++ b/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
@@ -118,7 +118,13 @@
 # define LARGE_LOAD_SIZE (VEC_SIZE * 4)
 #endif
 
-/* Amount to shift rdx by to compare for memcpy_large_4x.  */
+/* Amount to shift __x86_shared_non_temporal_threshold by for
+   bound for memcpy_large_4x. This is essentially use to to
+   indicate that the copy is far beyond the scope of L3
+   (assuming no user config x86_non_temporal_threshold) and to
+   use a more aggressively unrolled loop.  NB: before
+   increasing the value also update initialization of
+   x86_non_temporal_threshold.  */
 #ifndef LOG_4X_MEMCPY_THRESH
 # define LOG_4X_MEMCPY_THRESH 4
 #endif
@@ -724,9 +730,14 @@ L(skip_short_movsb_check):
 	.p2align 4,, 10
 #if (defined USE_MULTIARCH || VEC_SIZE == 16) && IS_IN (libc)
 L(large_memcpy_2x_check):
-	cmp	__x86_rep_movsb_threshold(%rip), %RDX_LP
-	jb	L(more_8x_vec_check)
+	/* Entry from L(large_memcpy_2x) has a redundant load of
+	   __x86_shared_non_temporal_threshold(%rip). L(large_memcpy_2x)
+	   is only use for the non-erms memmove which is generally less
+	   common.  */
 L(large_memcpy_2x):
+	mov	__x86_shared_non_temporal_threshold(%rip), %R11_LP
+	cmp	%R11_LP, %RDX_LP
+	jb	L(more_8x_vec_check)
 	/* To reach this point it is impossible for dst > src and
 	   overlap. Remaining to check is src > dst and overlap. rcx
 	   already contains dst - src. Negate rcx to get src - dst. If
@@ -774,18 +785,21 @@ L(large_memcpy_2x):
 	/* ecx contains -(dst - src). not ecx will return dst - src - 1
 	   which works for testing aliasing.  */
 	notl	%ecx
+	movq	%rdx, %r10
 	testl	$(PAGE_SIZE - VEC_SIZE * 8), %ecx
 	jz	L(large_memcpy_4x)
 
-	movq	%rdx, %r10
-	shrq	$LOG_4X_MEMCPY_THRESH, %r10
-	cmp	__x86_shared_non_temporal_threshold(%rip), %r10
+	/* r11 has __x86_shared_non_temporal_threshold.  Shift it left
+	   by LOG_4X_MEMCPY_THRESH to get L(large_memcpy_4x) threshold.
+	 */
+	shlq	$LOG_4X_MEMCPY_THRESH, %r11
+	cmp	%r11, %rdx
 	jae	L(large_memcpy_4x)
 
 	/* edx will store remainder size for copying tail.  */
 	andl	$(PAGE_SIZE * 2 - 1), %edx
 	/* r10 stores outer loop counter.  */
-	shrq	$((LOG_PAGE_SIZE + 1) - LOG_4X_MEMCPY_THRESH), %r10
+	shrq	$(LOG_PAGE_SIZE + 1), %r10
 	/* Copy 4x VEC at a time from 2 pages.  */
 	.p2align 4
 L(loop_large_memcpy_2x_outer):
@@ -850,7 +864,6 @@ L(large_memcpy_2x_end):
 
 	.p2align 4
 L(large_memcpy_4x):
-	movq	%rdx, %r10
 	/* edx will store remainder size for copying tail.  */
 	andl	$(PAGE_SIZE * 4 - 1), %edx
 	/* r10 stores outer loop counter.  */
-- 
2.34.1


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v1 3/3] x86: Add sse42 implementation to strcmp's ifunc
  2022-06-15  0:25 [PATCH v1 1/3] x86: Fix misordered logic for setting `rep_movsb_stop_threshold` Noah Goldstein
  2022-06-15  0:25 ` [PATCH v1 2/3] x86: Cleanup bounds checking in large memcpy case Noah Goldstein
@ 2022-06-15  0:25 ` Noah Goldstein
  2022-06-15  1:08   ` H.J. Lu
  2022-06-15  1:02 ` [PATCH v1 1/3] x86: Fix misordered logic for setting `rep_movsb_stop_threshold` H.J. Lu
  2 siblings, 1 reply; 29+ messages in thread
From: Noah Goldstein @ 2022-06-15  0:25 UTC (permalink / raw)
  To: libc-alpha

This has been missing since the the ifuncs where added.

The performance of SSE4.2 is preferable to to SSE2.

Measured on Tigerlake with N = 20 runs.
Geometric Mean of all benchmarks SSE4.2 / SSE2: 0.906
---
 sysdeps/x86_64/multiarch/strcmp.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/sysdeps/x86_64/multiarch/strcmp.c b/sysdeps/x86_64/multiarch/strcmp.c
index a248c2a6e6..9c1677724c 100644
--- a/sysdeps/x86_64/multiarch/strcmp.c
+++ b/sysdeps/x86_64/multiarch/strcmp.c
@@ -28,6 +28,7 @@
 
 extern __typeof (REDIRECT_NAME) OPTIMIZE (sse2) attribute_hidden;
 extern __typeof (REDIRECT_NAME) OPTIMIZE (sse2_unaligned) attribute_hidden;
+extern __typeof (REDIRECT_NAME) OPTIMIZE (sse42) attribute_hidden;
 extern __typeof (REDIRECT_NAME) OPTIMIZE (avx2) attribute_hidden;
 extern __typeof (REDIRECT_NAME) OPTIMIZE (avx2_rtm) attribute_hidden;
 extern __typeof (REDIRECT_NAME) OPTIMIZE (evex) attribute_hidden;
@@ -52,6 +53,10 @@ IFUNC_SELECTOR (void)
 	return OPTIMIZE (avx2);
     }
 
+  if (CPU_FEATURE_USABLE_P (cpu_features, SSE4_2)
+      && !CPU_FEATURES_ARCH_P (cpu_features, Slow_SSE4_2))
+    return OPTIMIZE (sse42);
+
   if (CPU_FEATURES_ARCH_P (cpu_features, Fast_Unaligned_Load))
     return OPTIMIZE (sse2_unaligned);
 
-- 
2.34.1


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v1 1/3] x86: Fix misordered logic for setting `rep_movsb_stop_threshold`
  2022-06-15  0:25 [PATCH v1 1/3] x86: Fix misordered logic for setting `rep_movsb_stop_threshold` Noah Goldstein
  2022-06-15  0:25 ` [PATCH v1 2/3] x86: Cleanup bounds checking in large memcpy case Noah Goldstein
  2022-06-15  0:25 ` [PATCH v1 3/3] x86: Add sse42 implementation to strcmp's ifunc Noah Goldstein
@ 2022-06-15  1:02 ` H.J. Lu
  2022-07-14  2:53   ` Sunil Pandey
  2 siblings, 1 reply; 29+ messages in thread
From: H.J. Lu @ 2022-06-15  1:02 UTC (permalink / raw)
  To: Noah Goldstein; +Cc: GNU C Library, Carlos O'Donell

On Tue, Jun 14, 2022 at 5:25 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
> Move the setting of `rep_movsb_stop_threshold` to after the tunables
> have been collected so that the `rep_movsb_stop_threshold` (which
> is used to redirect control flow to the non_temporal case) will
> use any user value for `non_temporal_threshold` (set using
> glibc.cpu.x86_non_temporal_threshold)
> ---
>  sysdeps/x86/dl-cacheinfo.h | 24 ++++++++++++------------
>  1 file changed, 12 insertions(+), 12 deletions(-)
>
> diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h
> index f64a2fb0ba..cc3b840f9c 100644
> --- a/sysdeps/x86/dl-cacheinfo.h
> +++ b/sysdeps/x86/dl-cacheinfo.h
> @@ -898,18 +898,6 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
>    if (CPU_FEATURE_USABLE_P (cpu_features, FSRM))
>      rep_movsb_threshold = 2112;
>
> -  unsigned long int rep_movsb_stop_threshold;
> -  /* ERMS feature is implemented from AMD Zen3 architecture and it is
> -     performing poorly for data above L2 cache size. Henceforth, adding
> -     an upper bound threshold parameter to limit the usage of Enhanced
> -     REP MOVSB operations and setting its value to L2 cache size.  */
> -  if (cpu_features->basic.kind == arch_kind_amd)
> -    rep_movsb_stop_threshold = core;
> -  /* Setting the upper bound of ERMS to the computed value of
> -     non-temporal threshold for architectures other than AMD.  */
> -  else
> -    rep_movsb_stop_threshold = non_temporal_threshold;
> -
>    /* The default threshold to use Enhanced REP STOSB.  */
>    unsigned long int rep_stosb_threshold = 2048;
>
> @@ -951,6 +939,18 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
>                            SIZE_MAX);
>  #endif
>
> +  unsigned long int rep_movsb_stop_threshold;
> +  /* ERMS feature is implemented from AMD Zen3 architecture and it is
> +     performing poorly for data above L2 cache size. Henceforth, adding
> +     an upper bound threshold parameter to limit the usage of Enhanced
> +     REP MOVSB operations and setting its value to L2 cache size.  */
> +  if (cpu_features->basic.kind == arch_kind_amd)
> +    rep_movsb_stop_threshold = core;
> +  /* Setting the upper bound of ERMS to the computed value of
> +     non-temporal threshold for architectures other than AMD.  */
> +  else
> +    rep_movsb_stop_threshold = non_temporal_threshold;
> +
>    cpu_features->data_cache_size = data;
>    cpu_features->shared_cache_size = shared;
>    cpu_features->non_temporal_threshold = non_temporal_threshold;
> --
> 2.34.1
>

LGTM.

Thanks.

-- 
H.J.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v1 2/3] x86: Cleanup bounds checking in large memcpy case
  2022-06-15  0:25 ` [PATCH v1 2/3] x86: Cleanup bounds checking in large memcpy case Noah Goldstein
@ 2022-06-15  1:07   ` H.J. Lu
  2022-06-15  3:57     ` Noah Goldstein
  2022-06-15  3:57   ` [PATCH v2] " Noah Goldstein
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 29+ messages in thread
From: H.J. Lu @ 2022-06-15  1:07 UTC (permalink / raw)
  To: Noah Goldstein; +Cc: GNU C Library, Carlos O'Donell

On Tue, Jun 14, 2022 at 5:25 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
> 1. Fix incorrect lower-bound threshold in L(large_memcpy_2x).
>    Previously was using `__x86_rep_movsb_threshold` and should
>    have been using `__x86_shared_non_temporal_threshold`.
>
> 2. Avoid reloading __x86_shared_non_temporal_threshold before
>    the L(large_memcpy_4x) bounds check.
>
> 3. Document the second bounds check for L(large_memcpy_4x)
>    more clearly.
> ---
>  manual/tunables.texi                          |  2 +-
>  sysdeps/x86/dl-cacheinfo.h                    |  8 +++--
>  .../multiarch/memmove-vec-unaligned-erms.S    | 29 ++++++++++++++-----
>  3 files changed, 28 insertions(+), 11 deletions(-)
>
> diff --git a/manual/tunables.texi b/manual/tunables.texi
> index 1482412078..49daf3eb4a 100644
> --- a/manual/tunables.texi
> +++ b/manual/tunables.texi
> @@ -47,7 +47,7 @@ glibc.malloc.mxfast: 0x0 (min: 0x0, max: 0xffffffffffffffff)
>  glibc.elision.skip_lock_busy: 3 (min: -2147483648, max: 2147483647)
>  glibc.malloc.top_pad: 0x0 (min: 0x0, max: 0xffffffffffffffff)
>  glibc.cpu.x86_rep_stosb_threshold: 0x800 (min: 0x1, max: 0xffffffffffffffff)
> -glibc.cpu.x86_non_temporal_threshold: 0xc0000 (min: 0x0, max: 0xffffffffffffffff)
> +glibc.cpu.x86_non_temporal_threshold: 0xc0000 (min: 0x0, max: 0x0fffffffffffffff)
>  glibc.cpu.x86_shstk:
>  glibc.cpu.hwcap_mask: 0x6 (min: 0x0, max: 0xffffffffffffffff)
>  glibc.malloc.mmap_max: 0 (min: -2147483648, max: 2147483647)
> diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h
> index cc3b840f9c..a66152d9cc 100644
> --- a/sysdeps/x86/dl-cacheinfo.h
> +++ b/sysdeps/x86/dl-cacheinfo.h
> @@ -915,9 +915,13 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
>      shared = tunable_size;
>
>    tunable_size = TUNABLE_GET (x86_non_temporal_threshold, long int, NULL);
> -  /* NB: Ignore the default value 0.  */
> -  if (tunable_size != 0)
> +  /* NB: Ignore the default value 0.  Saturate very large values at
> +     LONG_MAX >> 4.  */
> +  if (tunable_size != 0 && tunable_size <= (LONG_MAX >> 3))
>      non_temporal_threshold = tunable_size;
> +  /* Saturate huge arguments.  */
> +  else if (tunable_size != 0)
> +    non_temporal_threshold = LONG_MAX >> 3;
>
>    tunable_size = TUNABLE_GET (x86_rep_movsb_threshold, long int, NULL);
>    if (tunable_size > minimum_rep_movsb_threshold)

Please update

 TUNABLE_SET_WITH_BOUNDS (x86_non_temporal_threshold, non_temporal_threshold,
                           0, SIZE_MAX);

instead.

> diff --git a/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
> index af51177d5d..d1518b8bab 100644
> --- a/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
> +++ b/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
> @@ -118,7 +118,13 @@
>  # define LARGE_LOAD_SIZE (VEC_SIZE * 4)
>  #endif
>
> -/* Amount to shift rdx by to compare for memcpy_large_4x.  */
> +/* Amount to shift __x86_shared_non_temporal_threshold by for
> +   bound for memcpy_large_4x. This is essentially use to to
> +   indicate that the copy is far beyond the scope of L3
> +   (assuming no user config x86_non_temporal_threshold) and to
> +   use a more aggressively unrolled loop.  NB: before
> +   increasing the value also update initialization of
> +   x86_non_temporal_threshold.  */
>  #ifndef LOG_4X_MEMCPY_THRESH
>  # define LOG_4X_MEMCPY_THRESH 4
>  #endif
> @@ -724,9 +730,14 @@ L(skip_short_movsb_check):
>         .p2align 4,, 10
>  #if (defined USE_MULTIARCH || VEC_SIZE == 16) && IS_IN (libc)
>  L(large_memcpy_2x_check):
> -       cmp     __x86_rep_movsb_threshold(%rip), %RDX_LP
> -       jb      L(more_8x_vec_check)
> +       /* Entry from L(large_memcpy_2x) has a redundant load of
> +          __x86_shared_non_temporal_threshold(%rip). L(large_memcpy_2x)
> +          is only use for the non-erms memmove which is generally less
> +          common.  */
>  L(large_memcpy_2x):
> +       mov     __x86_shared_non_temporal_threshold(%rip), %R11_LP
> +       cmp     %R11_LP, %RDX_LP
> +       jb      L(more_8x_vec_check)
>         /* To reach this point it is impossible for dst > src and
>            overlap. Remaining to check is src > dst and overlap. rcx
>            already contains dst - src. Negate rcx to get src - dst. If
> @@ -774,18 +785,21 @@ L(large_memcpy_2x):
>         /* ecx contains -(dst - src). not ecx will return dst - src - 1
>            which works for testing aliasing.  */
>         notl    %ecx
> +       movq    %rdx, %r10
>         testl   $(PAGE_SIZE - VEC_SIZE * 8), %ecx
>         jz      L(large_memcpy_4x)
>
> -       movq    %rdx, %r10
> -       shrq    $LOG_4X_MEMCPY_THRESH, %r10
> -       cmp     __x86_shared_non_temporal_threshold(%rip), %r10
> +       /* r11 has __x86_shared_non_temporal_threshold.  Shift it left
> +          by LOG_4X_MEMCPY_THRESH to get L(large_memcpy_4x) threshold.
> +        */
> +       shlq    $LOG_4X_MEMCPY_THRESH, %r11
> +       cmp     %r11, %rdx
>         jae     L(large_memcpy_4x)
>
>         /* edx will store remainder size for copying tail.  */
>         andl    $(PAGE_SIZE * 2 - 1), %edx
>         /* r10 stores outer loop counter.  */
> -       shrq    $((LOG_PAGE_SIZE + 1) - LOG_4X_MEMCPY_THRESH), %r10
> +       shrq    $(LOG_PAGE_SIZE + 1), %r10
>         /* Copy 4x VEC at a time from 2 pages.  */
>         .p2align 4
>  L(loop_large_memcpy_2x_outer):
> @@ -850,7 +864,6 @@ L(large_memcpy_2x_end):
>
>         .p2align 4
>  L(large_memcpy_4x):
> -       movq    %rdx, %r10
>         /* edx will store remainder size for copying tail.  */
>         andl    $(PAGE_SIZE * 4 - 1), %edx
>         /* r10 stores outer loop counter.  */
> --
> 2.34.1
>


-- 
H.J.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v1 3/3] x86: Add sse42 implementation to strcmp's ifunc
  2022-06-15  0:25 ` [PATCH v1 3/3] x86: Add sse42 implementation to strcmp's ifunc Noah Goldstein
@ 2022-06-15  1:08   ` H.J. Lu
  2022-07-14  2:54     ` Sunil Pandey
  0 siblings, 1 reply; 29+ messages in thread
From: H.J. Lu @ 2022-06-15  1:08 UTC (permalink / raw)
  To: Noah Goldstein; +Cc: GNU C Library, Carlos O'Donell

On Tue, Jun 14, 2022 at 5:25 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
> This has been missing since the the ifuncs where added.
>
> The performance of SSE4.2 is preferable to to SSE2.
>
> Measured on Tigerlake with N = 20 runs.
> Geometric Mean of all benchmarks SSE4.2 / SSE2: 0.906
> ---
>  sysdeps/x86_64/multiarch/strcmp.c | 5 +++++
>  1 file changed, 5 insertions(+)
>
> diff --git a/sysdeps/x86_64/multiarch/strcmp.c b/sysdeps/x86_64/multiarch/strcmp.c
> index a248c2a6e6..9c1677724c 100644
> --- a/sysdeps/x86_64/multiarch/strcmp.c
> +++ b/sysdeps/x86_64/multiarch/strcmp.c
> @@ -28,6 +28,7 @@
>
>  extern __typeof (REDIRECT_NAME) OPTIMIZE (sse2) attribute_hidden;
>  extern __typeof (REDIRECT_NAME) OPTIMIZE (sse2_unaligned) attribute_hidden;
> +extern __typeof (REDIRECT_NAME) OPTIMIZE (sse42) attribute_hidden;
>  extern __typeof (REDIRECT_NAME) OPTIMIZE (avx2) attribute_hidden;
>  extern __typeof (REDIRECT_NAME) OPTIMIZE (avx2_rtm) attribute_hidden;
>  extern __typeof (REDIRECT_NAME) OPTIMIZE (evex) attribute_hidden;
> @@ -52,6 +53,10 @@ IFUNC_SELECTOR (void)
>         return OPTIMIZE (avx2);
>      }
>
> +  if (CPU_FEATURE_USABLE_P (cpu_features, SSE4_2)
> +      && !CPU_FEATURES_ARCH_P (cpu_features, Slow_SSE4_2))
> +    return OPTIMIZE (sse42);
> +
>    if (CPU_FEATURES_ARCH_P (cpu_features, Fast_Unaligned_Load))
>      return OPTIMIZE (sse2_unaligned);
>
> --
> 2.34.1
>

LGTM.

Thanks.

-- 
H.J.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v2] x86: Cleanup bounds checking in large memcpy case
  2022-06-15  0:25 ` [PATCH v1 2/3] x86: Cleanup bounds checking in large memcpy case Noah Goldstein
  2022-06-15  1:07   ` H.J. Lu
@ 2022-06-15  3:57   ` Noah Goldstein
  2022-06-15 14:52     ` H.J. Lu
  2022-06-15 15:12   ` [PATCH v3] " Noah Goldstein
  2022-06-15 17:41   ` [PATCH v4 1/2] " Noah Goldstein
  3 siblings, 1 reply; 29+ messages in thread
From: Noah Goldstein @ 2022-06-15  3:57 UTC (permalink / raw)
  To: libc-alpha

1. Fix incorrect lower-bound threshold in L(large_memcpy_2x).
   Previously was using `__x86_rep_movsb_threshold` and should
   have been using `__x86_shared_non_temporal_threshold`.

2. Avoid reloading __x86_shared_non_temporal_threshold before
   the L(large_memcpy_4x) bounds check.

3. Document the second bounds check for L(large_memcpy_4x)
   more clearly.
---
 manual/tunables.texi                          |  2 +-
 sysdeps/x86/dl-cacheinfo.h                    |  2 +-
 .../multiarch/memmove-vec-unaligned-erms.S    | 29 ++++++++++++++-----
 3 files changed, 23 insertions(+), 10 deletions(-)

diff --git a/manual/tunables.texi b/manual/tunables.texi
index 1482412078..49daf3eb4a 100644
--- a/manual/tunables.texi
+++ b/manual/tunables.texi
@@ -47,7 +47,7 @@ glibc.malloc.mxfast: 0x0 (min: 0x0, max: 0xffffffffffffffff)
 glibc.elision.skip_lock_busy: 3 (min: -2147483648, max: 2147483647)
 glibc.malloc.top_pad: 0x0 (min: 0x0, max: 0xffffffffffffffff)
 glibc.cpu.x86_rep_stosb_threshold: 0x800 (min: 0x1, max: 0xffffffffffffffff)
-glibc.cpu.x86_non_temporal_threshold: 0xc0000 (min: 0x0, max: 0xffffffffffffffff)
+glibc.cpu.x86_non_temporal_threshold: 0xc0000 (min: 0x0, max: 0x0fffffffffffffff)
 glibc.cpu.x86_shstk:
 glibc.cpu.hwcap_mask: 0x6 (min: 0x0, max: 0xffffffffffffffff)
 glibc.malloc.mmap_max: 0 (min: -2147483648, max: 2147483647)
diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h
index cc3b840f9c..858ff8a135 100644
--- a/sysdeps/x86/dl-cacheinfo.h
+++ b/sysdeps/x86/dl-cacheinfo.h
@@ -932,7 +932,7 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
   TUNABLE_SET_WITH_BOUNDS (x86_data_cache_size, data, 0, SIZE_MAX);
   TUNABLE_SET_WITH_BOUNDS (x86_shared_cache_size, shared, 0, SIZE_MAX);
   TUNABLE_SET_WITH_BOUNDS (x86_non_temporal_threshold, non_temporal_threshold,
-			   0, SIZE_MAX);
+			   0, SIZE_MAX >> 4);
   TUNABLE_SET_WITH_BOUNDS (x86_rep_movsb_threshold, rep_movsb_threshold,
 			   minimum_rep_movsb_threshold, SIZE_MAX);
   TUNABLE_SET_WITH_BOUNDS (x86_rep_stosb_threshold, rep_stosb_threshold, 1,
diff --git a/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
index af51177d5d..d1518b8bab 100644
--- a/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
+++ b/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
@@ -118,7 +118,13 @@
 # define LARGE_LOAD_SIZE (VEC_SIZE * 4)
 #endif
 
-/* Amount to shift rdx by to compare for memcpy_large_4x.  */
+/* Amount to shift __x86_shared_non_temporal_threshold by for
+   bound for memcpy_large_4x. This is essentially use to to
+   indicate that the copy is far beyond the scope of L3
+   (assuming no user config x86_non_temporal_threshold) and to
+   use a more aggressively unrolled loop.  NB: before
+   increasing the value also update initialization of
+   x86_non_temporal_threshold.  */
 #ifndef LOG_4X_MEMCPY_THRESH
 # define LOG_4X_MEMCPY_THRESH 4
 #endif
@@ -724,9 +730,14 @@ L(skip_short_movsb_check):
 	.p2align 4,, 10
 #if (defined USE_MULTIARCH || VEC_SIZE == 16) && IS_IN (libc)
 L(large_memcpy_2x_check):
-	cmp	__x86_rep_movsb_threshold(%rip), %RDX_LP
-	jb	L(more_8x_vec_check)
+	/* Entry from L(large_memcpy_2x) has a redundant load of
+	   __x86_shared_non_temporal_threshold(%rip). L(large_memcpy_2x)
+	   is only use for the non-erms memmove which is generally less
+	   common.  */
 L(large_memcpy_2x):
+	mov	__x86_shared_non_temporal_threshold(%rip), %R11_LP
+	cmp	%R11_LP, %RDX_LP
+	jb	L(more_8x_vec_check)
 	/* To reach this point it is impossible for dst > src and
 	   overlap. Remaining to check is src > dst and overlap. rcx
 	   already contains dst - src. Negate rcx to get src - dst. If
@@ -774,18 +785,21 @@ L(large_memcpy_2x):
 	/* ecx contains -(dst - src). not ecx will return dst - src - 1
 	   which works for testing aliasing.  */
 	notl	%ecx
+	movq	%rdx, %r10
 	testl	$(PAGE_SIZE - VEC_SIZE * 8), %ecx
 	jz	L(large_memcpy_4x)
 
-	movq	%rdx, %r10
-	shrq	$LOG_4X_MEMCPY_THRESH, %r10
-	cmp	__x86_shared_non_temporal_threshold(%rip), %r10
+	/* r11 has __x86_shared_non_temporal_threshold.  Shift it left
+	   by LOG_4X_MEMCPY_THRESH to get L(large_memcpy_4x) threshold.
+	 */
+	shlq	$LOG_4X_MEMCPY_THRESH, %r11
+	cmp	%r11, %rdx
 	jae	L(large_memcpy_4x)
 
 	/* edx will store remainder size for copying tail.  */
 	andl	$(PAGE_SIZE * 2 - 1), %edx
 	/* r10 stores outer loop counter.  */
-	shrq	$((LOG_PAGE_SIZE + 1) - LOG_4X_MEMCPY_THRESH), %r10
+	shrq	$(LOG_PAGE_SIZE + 1), %r10
 	/* Copy 4x VEC at a time from 2 pages.  */
 	.p2align 4
 L(loop_large_memcpy_2x_outer):
@@ -850,7 +864,6 @@ L(large_memcpy_2x_end):
 
 	.p2align 4
 L(large_memcpy_4x):
-	movq	%rdx, %r10
 	/* edx will store remainder size for copying tail.  */
 	andl	$(PAGE_SIZE * 4 - 1), %edx
 	/* r10 stores outer loop counter.  */
-- 
2.34.1


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v1 2/3] x86: Cleanup bounds checking in large memcpy case
  2022-06-15  1:07   ` H.J. Lu
@ 2022-06-15  3:57     ` Noah Goldstein
  0 siblings, 0 replies; 29+ messages in thread
From: Noah Goldstein @ 2022-06-15  3:57 UTC (permalink / raw)
  To: H.J. Lu; +Cc: GNU C Library, Carlos O'Donell

On Tue, Jun 14, 2022 at 6:08 PM H.J. Lu <hjl.tools@gmail.com> wrote:
>
> On Tue, Jun 14, 2022 at 5:25 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> >
> > 1. Fix incorrect lower-bound threshold in L(large_memcpy_2x).
> >    Previously was using `__x86_rep_movsb_threshold` and should
> >    have been using `__x86_shared_non_temporal_threshold`.
> >
> > 2. Avoid reloading __x86_shared_non_temporal_threshold before
> >    the L(large_memcpy_4x) bounds check.
> >
> > 3. Document the second bounds check for L(large_memcpy_4x)
> >    more clearly.
> > ---
> >  manual/tunables.texi                          |  2 +-
> >  sysdeps/x86/dl-cacheinfo.h                    |  8 +++--
> >  .../multiarch/memmove-vec-unaligned-erms.S    | 29 ++++++++++++++-----
> >  3 files changed, 28 insertions(+), 11 deletions(-)
> >
> > diff --git a/manual/tunables.texi b/manual/tunables.texi
> > index 1482412078..49daf3eb4a 100644
> > --- a/manual/tunables.texi
> > +++ b/manual/tunables.texi
> > @@ -47,7 +47,7 @@ glibc.malloc.mxfast: 0x0 (min: 0x0, max: 0xffffffffffffffff)
> >  glibc.elision.skip_lock_busy: 3 (min: -2147483648, max: 2147483647)
> >  glibc.malloc.top_pad: 0x0 (min: 0x0, max: 0xffffffffffffffff)
> >  glibc.cpu.x86_rep_stosb_threshold: 0x800 (min: 0x1, max: 0xffffffffffffffff)
> > -glibc.cpu.x86_non_temporal_threshold: 0xc0000 (min: 0x0, max: 0xffffffffffffffff)
> > +glibc.cpu.x86_non_temporal_threshold: 0xc0000 (min: 0x0, max: 0x0fffffffffffffff)
> >  glibc.cpu.x86_shstk:
> >  glibc.cpu.hwcap_mask: 0x6 (min: 0x0, max: 0xffffffffffffffff)
> >  glibc.malloc.mmap_max: 0 (min: -2147483648, max: 2147483647)
> > diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h
> > index cc3b840f9c..a66152d9cc 100644
> > --- a/sysdeps/x86/dl-cacheinfo.h
> > +++ b/sysdeps/x86/dl-cacheinfo.h
> > @@ -915,9 +915,13 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
> >      shared = tunable_size;
> >
> >    tunable_size = TUNABLE_GET (x86_non_temporal_threshold, long int, NULL);
> > -  /* NB: Ignore the default value 0.  */
> > -  if (tunable_size != 0)
> > +  /* NB: Ignore the default value 0.  Saturate very large values at
> > +     LONG_MAX >> 4.  */
> > +  if (tunable_size != 0 && tunable_size <= (LONG_MAX >> 3))
> >      non_temporal_threshold = tunable_size;
> > +  /* Saturate huge arguments.  */
> > +  else if (tunable_size != 0)
> > +    non_temporal_threshold = LONG_MAX >> 3;
> >
> >    tunable_size = TUNABLE_GET (x86_rep_movsb_threshold, long int, NULL);
> >    if (tunable_size > minimum_rep_movsb_threshold)
>
> Please update
>
>  TUNABLE_SET_WITH_BOUNDS (x86_non_temporal_threshold, non_temporal_threshold,
>                            0, SIZE_MAX);
>
> instead.

Fixed in V2.
>
> > diff --git a/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
> > index af51177d5d..d1518b8bab 100644
> > --- a/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
> > +++ b/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
> > @@ -118,7 +118,13 @@
> >  # define LARGE_LOAD_SIZE (VEC_SIZE * 4)
> >  #endif
> >
> > -/* Amount to shift rdx by to compare for memcpy_large_4x.  */
> > +/* Amount to shift __x86_shared_non_temporal_threshold by for
> > +   bound for memcpy_large_4x. This is essentially use to to
> > +   indicate that the copy is far beyond the scope of L3
> > +   (assuming no user config x86_non_temporal_threshold) and to
> > +   use a more aggressively unrolled loop.  NB: before
> > +   increasing the value also update initialization of
> > +   x86_non_temporal_threshold.  */
> >  #ifndef LOG_4X_MEMCPY_THRESH
> >  # define LOG_4X_MEMCPY_THRESH 4
> >  #endif
> > @@ -724,9 +730,14 @@ L(skip_short_movsb_check):
> >         .p2align 4,, 10
> >  #if (defined USE_MULTIARCH || VEC_SIZE == 16) && IS_IN (libc)
> >  L(large_memcpy_2x_check):
> > -       cmp     __x86_rep_movsb_threshold(%rip), %RDX_LP
> > -       jb      L(more_8x_vec_check)
> > +       /* Entry from L(large_memcpy_2x) has a redundant load of
> > +          __x86_shared_non_temporal_threshold(%rip). L(large_memcpy_2x)
> > +          is only use for the non-erms memmove which is generally less
> > +          common.  */
> >  L(large_memcpy_2x):
> > +       mov     __x86_shared_non_temporal_threshold(%rip), %R11_LP
> > +       cmp     %R11_LP, %RDX_LP
> > +       jb      L(more_8x_vec_check)
> >         /* To reach this point it is impossible for dst > src and
> >            overlap. Remaining to check is src > dst and overlap. rcx
> >            already contains dst - src. Negate rcx to get src - dst. If
> > @@ -774,18 +785,21 @@ L(large_memcpy_2x):
> >         /* ecx contains -(dst - src). not ecx will return dst - src - 1
> >            which works for testing aliasing.  */
> >         notl    %ecx
> > +       movq    %rdx, %r10
> >         testl   $(PAGE_SIZE - VEC_SIZE * 8), %ecx
> >         jz      L(large_memcpy_4x)
> >
> > -       movq    %rdx, %r10
> > -       shrq    $LOG_4X_MEMCPY_THRESH, %r10
> > -       cmp     __x86_shared_non_temporal_threshold(%rip), %r10
> > +       /* r11 has __x86_shared_non_temporal_threshold.  Shift it left
> > +          by LOG_4X_MEMCPY_THRESH to get L(large_memcpy_4x) threshold.
> > +        */
> > +       shlq    $LOG_4X_MEMCPY_THRESH, %r11
> > +       cmp     %r11, %rdx
> >         jae     L(large_memcpy_4x)
> >
> >         /* edx will store remainder size for copying tail.  */
> >         andl    $(PAGE_SIZE * 2 - 1), %edx
> >         /* r10 stores outer loop counter.  */
> > -       shrq    $((LOG_PAGE_SIZE + 1) - LOG_4X_MEMCPY_THRESH), %r10
> > +       shrq    $(LOG_PAGE_SIZE + 1), %r10
> >         /* Copy 4x VEC at a time from 2 pages.  */
> >         .p2align 4
> >  L(loop_large_memcpy_2x_outer):
> > @@ -850,7 +864,6 @@ L(large_memcpy_2x_end):
> >
> >         .p2align 4
> >  L(large_memcpy_4x):
> > -       movq    %rdx, %r10
> >         /* edx will store remainder size for copying tail.  */
> >         andl    $(PAGE_SIZE * 4 - 1), %edx
> >         /* r10 stores outer loop counter.  */
> > --
> > 2.34.1
> >
>
>
> --
> H.J.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2] x86: Cleanup bounds checking in large memcpy case
  2022-06-15  3:57   ` [PATCH v2] " Noah Goldstein
@ 2022-06-15 14:52     ` H.J. Lu
  2022-06-15 15:13       ` Noah Goldstein
  0 siblings, 1 reply; 29+ messages in thread
From: H.J. Lu @ 2022-06-15 14:52 UTC (permalink / raw)
  To: Noah Goldstein; +Cc: GNU C Library, Carlos O'Donell

On Tue, Jun 14, 2022 at 8:57 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
> 1. Fix incorrect lower-bound threshold in L(large_memcpy_2x).
>    Previously was using `__x86_rep_movsb_threshold` and should
>    have been using `__x86_shared_non_temporal_threshold`.
>
> 2. Avoid reloading __x86_shared_non_temporal_threshold before
>    the L(large_memcpy_4x) bounds check.
>
> 3. Document the second bounds check for L(large_memcpy_4x)
>    more clearly.
> ---
>  manual/tunables.texi                          |  2 +-
>  sysdeps/x86/dl-cacheinfo.h                    |  2 +-
>  .../multiarch/memmove-vec-unaligned-erms.S    | 29 ++++++++++++++-----
>  3 files changed, 23 insertions(+), 10 deletions(-)
>
> diff --git a/manual/tunables.texi b/manual/tunables.texi
> index 1482412078..49daf3eb4a 100644
> --- a/manual/tunables.texi
> +++ b/manual/tunables.texi
> @@ -47,7 +47,7 @@ glibc.malloc.mxfast: 0x0 (min: 0x0, max: 0xffffffffffffffff)
>  glibc.elision.skip_lock_busy: 3 (min: -2147483648, max: 2147483647)
>  glibc.malloc.top_pad: 0x0 (min: 0x0, max: 0xffffffffffffffff)
>  glibc.cpu.x86_rep_stosb_threshold: 0x800 (min: 0x1, max: 0xffffffffffffffff)
> -glibc.cpu.x86_non_temporal_threshold: 0xc0000 (min: 0x0, max: 0xffffffffffffffff)
> +glibc.cpu.x86_non_temporal_threshold: 0xc0000 (min: 0x0, max: 0x0fffffffffffffff)
>  glibc.cpu.x86_shstk:
>  glibc.cpu.hwcap_mask: 0x6 (min: 0x0, max: 0xffffffffffffffff)
>  glibc.malloc.mmap_max: 0 (min: -2147483648, max: 2147483647)
> diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h
> index cc3b840f9c..858ff8a135 100644
> --- a/sysdeps/x86/dl-cacheinfo.h
> +++ b/sysdeps/x86/dl-cacheinfo.h
> @@ -932,7 +932,7 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
>    TUNABLE_SET_WITH_BOUNDS (x86_data_cache_size, data, 0, SIZE_MAX);
>    TUNABLE_SET_WITH_BOUNDS (x86_shared_cache_size, shared, 0, SIZE_MAX);
>    TUNABLE_SET_WITH_BOUNDS (x86_non_temporal_threshold, non_temporal_threshold,
> -                          0, SIZE_MAX);
> +                          0, SIZE_MAX >> 4);

Please add a comment to describe where >> 4 comes from.

>    TUNABLE_SET_WITH_BOUNDS (x86_rep_movsb_threshold, rep_movsb_threshold,
>                            minimum_rep_movsb_threshold, SIZE_MAX);
>    TUNABLE_SET_WITH_BOUNDS (x86_rep_stosb_threshold, rep_stosb_threshold, 1,
> diff --git a/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
> index af51177d5d..d1518b8bab 100644
> --- a/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
> +++ b/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
> @@ -118,7 +118,13 @@
>  # define LARGE_LOAD_SIZE (VEC_SIZE * 4)
>  #endif
>
> -/* Amount to shift rdx by to compare for memcpy_large_4x.  */
> +/* Amount to shift __x86_shared_non_temporal_threshold by for
> +   bound for memcpy_large_4x. This is essentially use to to
> +   indicate that the copy is far beyond the scope of L3
> +   (assuming no user config x86_non_temporal_threshold) and to
> +   use a more aggressively unrolled loop.  NB: before
> +   increasing the value also update initialization of
> +   x86_non_temporal_threshold.  */
>  #ifndef LOG_4X_MEMCPY_THRESH
>  # define LOG_4X_MEMCPY_THRESH 4
>  #endif
> @@ -724,9 +730,14 @@ L(skip_short_movsb_check):
>         .p2align 4,, 10
>  #if (defined USE_MULTIARCH || VEC_SIZE == 16) && IS_IN (libc)
>  L(large_memcpy_2x_check):
> -       cmp     __x86_rep_movsb_threshold(%rip), %RDX_LP
> -       jb      L(more_8x_vec_check)
> +       /* Entry from L(large_memcpy_2x) has a redundant load of
> +          __x86_shared_non_temporal_threshold(%rip). L(large_memcpy_2x)
> +          is only use for the non-erms memmove which is generally less
> +          common.  */
>  L(large_memcpy_2x):
> +       mov     __x86_shared_non_temporal_threshold(%rip), %R11_LP
> +       cmp     %R11_LP, %RDX_LP
> +       jb      L(more_8x_vec_check)
>         /* To reach this point it is impossible for dst > src and
>            overlap. Remaining to check is src > dst and overlap. rcx
>            already contains dst - src. Negate rcx to get src - dst. If
> @@ -774,18 +785,21 @@ L(large_memcpy_2x):
>         /* ecx contains -(dst - src). not ecx will return dst - src - 1
>            which works for testing aliasing.  */
>         notl    %ecx
> +       movq    %rdx, %r10
>         testl   $(PAGE_SIZE - VEC_SIZE * 8), %ecx
>         jz      L(large_memcpy_4x)
>
> -       movq    %rdx, %r10
> -       shrq    $LOG_4X_MEMCPY_THRESH, %r10
> -       cmp     __x86_shared_non_temporal_threshold(%rip), %r10
> +       /* r11 has __x86_shared_non_temporal_threshold.  Shift it left
> +          by LOG_4X_MEMCPY_THRESH to get L(large_memcpy_4x) threshold.
> +        */
> +       shlq    $LOG_4X_MEMCPY_THRESH, %r11
> +       cmp     %r11, %rdx
>         jae     L(large_memcpy_4x)
>
>         /* edx will store remainder size for copying tail.  */
>         andl    $(PAGE_SIZE * 2 - 1), %edx
>         /* r10 stores outer loop counter.  */
> -       shrq    $((LOG_PAGE_SIZE + 1) - LOG_4X_MEMCPY_THRESH), %r10
> +       shrq    $(LOG_PAGE_SIZE + 1), %r10
>         /* Copy 4x VEC at a time from 2 pages.  */
>         .p2align 4
>  L(loop_large_memcpy_2x_outer):
> @@ -850,7 +864,6 @@ L(large_memcpy_2x_end):
>
>         .p2align 4
>  L(large_memcpy_4x):
> -       movq    %rdx, %r10
>         /* edx will store remainder size for copying tail.  */
>         andl    $(PAGE_SIZE * 4 - 1), %edx
>         /* r10 stores outer loop counter.  */
> --
> 2.34.1
>


-- 
H.J.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v3] x86: Cleanup bounds checking in large memcpy case
  2022-06-15  0:25 ` [PATCH v1 2/3] x86: Cleanup bounds checking in large memcpy case Noah Goldstein
  2022-06-15  1:07   ` H.J. Lu
  2022-06-15  3:57   ` [PATCH v2] " Noah Goldstein
@ 2022-06-15 15:12   ` Noah Goldstein
  2022-06-15 16:48     ` H.J. Lu
  2022-06-15 17:41   ` [PATCH v4 1/2] " Noah Goldstein
  3 siblings, 1 reply; 29+ messages in thread
From: Noah Goldstein @ 2022-06-15 15:12 UTC (permalink / raw)
  To: libc-alpha

1. Fix incorrect lower-bound threshold in L(large_memcpy_2x).
   Previously was using `__x86_rep_movsb_threshold` and should
   have been using `__x86_shared_non_temporal_threshold`.

2. Avoid reloading __x86_shared_non_temporal_threshold before
   the L(large_memcpy_4x) bounds check.

3. Document the second bounds check for L(large_memcpy_4x)
   more clearly.
---
 manual/tunables.texi                          |  2 +-
 sysdeps/x86/dl-cacheinfo.h                    |  6 +++-
 .../multiarch/memmove-vec-unaligned-erms.S    | 29 ++++++++++++++-----
 3 files changed, 27 insertions(+), 10 deletions(-)

diff --git a/manual/tunables.texi b/manual/tunables.texi
index 1482412078..49daf3eb4a 100644
--- a/manual/tunables.texi
+++ b/manual/tunables.texi
@@ -47,7 +47,7 @@ glibc.malloc.mxfast: 0x0 (min: 0x0, max: 0xffffffffffffffff)
 glibc.elision.skip_lock_busy: 3 (min: -2147483648, max: 2147483647)
 glibc.malloc.top_pad: 0x0 (min: 0x0, max: 0xffffffffffffffff)
 glibc.cpu.x86_rep_stosb_threshold: 0x800 (min: 0x1, max: 0xffffffffffffffff)
-glibc.cpu.x86_non_temporal_threshold: 0xc0000 (min: 0x0, max: 0xffffffffffffffff)
+glibc.cpu.x86_non_temporal_threshold: 0xc0000 (min: 0x0, max: 0x0fffffffffffffff)
 glibc.cpu.x86_shstk:
 glibc.cpu.hwcap_mask: 0x6 (min: 0x0, max: 0xffffffffffffffff)
 glibc.malloc.mmap_max: 0 (min: -2147483648, max: 2147483647)
diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h
index cc3b840f9c..f94ff2df43 100644
--- a/sysdeps/x86/dl-cacheinfo.h
+++ b/sysdeps/x86/dl-cacheinfo.h
@@ -931,8 +931,12 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
 
   TUNABLE_SET_WITH_BOUNDS (x86_data_cache_size, data, 0, SIZE_MAX);
   TUNABLE_SET_WITH_BOUNDS (x86_shared_cache_size, shared, 0, SIZE_MAX);
+  /* SIZE_MAX >> 4 because memmove-vec-unaligned-erms right-shifts the value of
+     'x86_non_temporal_threshold' by `LOG_4X_MEMCPY_THRESH` (4) and it is best
+     if that operation cannot overflow.  Not the '>> 4' also reflect the bound
+     in the manual.  */
   TUNABLE_SET_WITH_BOUNDS (x86_non_temporal_threshold, non_temporal_threshold,
-			   0, SIZE_MAX);
+			   0, SIZE_MAX >> 4);
   TUNABLE_SET_WITH_BOUNDS (x86_rep_movsb_threshold, rep_movsb_threshold,
 			   minimum_rep_movsb_threshold, SIZE_MAX);
   TUNABLE_SET_WITH_BOUNDS (x86_rep_stosb_threshold, rep_stosb_threshold, 1,
diff --git a/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
index af51177d5d..d1518b8bab 100644
--- a/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
+++ b/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
@@ -118,7 +118,13 @@
 # define LARGE_LOAD_SIZE (VEC_SIZE * 4)
 #endif
 
-/* Amount to shift rdx by to compare for memcpy_large_4x.  */
+/* Amount to shift __x86_shared_non_temporal_threshold by for
+   bound for memcpy_large_4x. This is essentially use to to
+   indicate that the copy is far beyond the scope of L3
+   (assuming no user config x86_non_temporal_threshold) and to
+   use a more aggressively unrolled loop.  NB: before
+   increasing the value also update initialization of
+   x86_non_temporal_threshold.  */
 #ifndef LOG_4X_MEMCPY_THRESH
 # define LOG_4X_MEMCPY_THRESH 4
 #endif
@@ -724,9 +730,14 @@ L(skip_short_movsb_check):
 	.p2align 4,, 10
 #if (defined USE_MULTIARCH || VEC_SIZE == 16) && IS_IN (libc)
 L(large_memcpy_2x_check):
-	cmp	__x86_rep_movsb_threshold(%rip), %RDX_LP
-	jb	L(more_8x_vec_check)
+	/* Entry from L(large_memcpy_2x) has a redundant load of
+	   __x86_shared_non_temporal_threshold(%rip). L(large_memcpy_2x)
+	   is only use for the non-erms memmove which is generally less
+	   common.  */
 L(large_memcpy_2x):
+	mov	__x86_shared_non_temporal_threshold(%rip), %R11_LP
+	cmp	%R11_LP, %RDX_LP
+	jb	L(more_8x_vec_check)
 	/* To reach this point it is impossible for dst > src and
 	   overlap. Remaining to check is src > dst and overlap. rcx
 	   already contains dst - src. Negate rcx to get src - dst. If
@@ -774,18 +785,21 @@ L(large_memcpy_2x):
 	/* ecx contains -(dst - src). not ecx will return dst - src - 1
 	   which works for testing aliasing.  */
 	notl	%ecx
+	movq	%rdx, %r10
 	testl	$(PAGE_SIZE - VEC_SIZE * 8), %ecx
 	jz	L(large_memcpy_4x)
 
-	movq	%rdx, %r10
-	shrq	$LOG_4X_MEMCPY_THRESH, %r10
-	cmp	__x86_shared_non_temporal_threshold(%rip), %r10
+	/* r11 has __x86_shared_non_temporal_threshold.  Shift it left
+	   by LOG_4X_MEMCPY_THRESH to get L(large_memcpy_4x) threshold.
+	 */
+	shlq	$LOG_4X_MEMCPY_THRESH, %r11
+	cmp	%r11, %rdx
 	jae	L(large_memcpy_4x)
 
 	/* edx will store remainder size for copying tail.  */
 	andl	$(PAGE_SIZE * 2 - 1), %edx
 	/* r10 stores outer loop counter.  */
-	shrq	$((LOG_PAGE_SIZE + 1) - LOG_4X_MEMCPY_THRESH), %r10
+	shrq	$(LOG_PAGE_SIZE + 1), %r10
 	/* Copy 4x VEC at a time from 2 pages.  */
 	.p2align 4
 L(loop_large_memcpy_2x_outer):
@@ -850,7 +864,6 @@ L(large_memcpy_2x_end):
 
 	.p2align 4
 L(large_memcpy_4x):
-	movq	%rdx, %r10
 	/* edx will store remainder size for copying tail.  */
 	andl	$(PAGE_SIZE * 4 - 1), %edx
 	/* r10 stores outer loop counter.  */
-- 
2.34.1


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2] x86: Cleanup bounds checking in large memcpy case
  2022-06-15 14:52     ` H.J. Lu
@ 2022-06-15 15:13       ` Noah Goldstein
  0 siblings, 0 replies; 29+ messages in thread
From: Noah Goldstein @ 2022-06-15 15:13 UTC (permalink / raw)
  To: H.J. Lu; +Cc: GNU C Library, Carlos O'Donell

On Wed, Jun 15, 2022 at 7:52 AM H.J. Lu <hjl.tools@gmail.com> wrote:
>
> On Tue, Jun 14, 2022 at 8:57 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> >
> > 1. Fix incorrect lower-bound threshold in L(large_memcpy_2x).
> >    Previously was using `__x86_rep_movsb_threshold` and should
> >    have been using `__x86_shared_non_temporal_threshold`.
> >
> > 2. Avoid reloading __x86_shared_non_temporal_threshold before
> >    the L(large_memcpy_4x) bounds check.
> >
> > 3. Document the second bounds check for L(large_memcpy_4x)
> >    more clearly.
> > ---
> >  manual/tunables.texi                          |  2 +-
> >  sysdeps/x86/dl-cacheinfo.h                    |  2 +-
> >  .../multiarch/memmove-vec-unaligned-erms.S    | 29 ++++++++++++++-----
> >  3 files changed, 23 insertions(+), 10 deletions(-)
> >
> > diff --git a/manual/tunables.texi b/manual/tunables.texi
> > index 1482412078..49daf3eb4a 100644
> > --- a/manual/tunables.texi
> > +++ b/manual/tunables.texi
> > @@ -47,7 +47,7 @@ glibc.malloc.mxfast: 0x0 (min: 0x0, max: 0xffffffffffffffff)
> >  glibc.elision.skip_lock_busy: 3 (min: -2147483648, max: 2147483647)
> >  glibc.malloc.top_pad: 0x0 (min: 0x0, max: 0xffffffffffffffff)
> >  glibc.cpu.x86_rep_stosb_threshold: 0x800 (min: 0x1, max: 0xffffffffffffffff)
> > -glibc.cpu.x86_non_temporal_threshold: 0xc0000 (min: 0x0, max: 0xffffffffffffffff)
> > +glibc.cpu.x86_non_temporal_threshold: 0xc0000 (min: 0x0, max: 0x0fffffffffffffff)
> >  glibc.cpu.x86_shstk:
> >  glibc.cpu.hwcap_mask: 0x6 (min: 0x0, max: 0xffffffffffffffff)
> >  glibc.malloc.mmap_max: 0 (min: -2147483648, max: 2147483647)
> > diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h
> > index cc3b840f9c..858ff8a135 100644
> > --- a/sysdeps/x86/dl-cacheinfo.h
> > +++ b/sysdeps/x86/dl-cacheinfo.h
> > @@ -932,7 +932,7 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
> >    TUNABLE_SET_WITH_BOUNDS (x86_data_cache_size, data, 0, SIZE_MAX);
> >    TUNABLE_SET_WITH_BOUNDS (x86_shared_cache_size, shared, 0, SIZE_MAX);
> >    TUNABLE_SET_WITH_BOUNDS (x86_non_temporal_threshold, non_temporal_threshold,
> > -                          0, SIZE_MAX);
> > +                          0, SIZE_MAX >> 4);
>
> Please add a comment to describe where >> 4 comes from.
Fixed in V3.
>
> >    TUNABLE_SET_WITH_BOUNDS (x86_rep_movsb_threshold, rep_movsb_threshold,
> >                            minimum_rep_movsb_threshold, SIZE_MAX);
> >    TUNABLE_SET_WITH_BOUNDS (x86_rep_stosb_threshold, rep_stosb_threshold, 1,
> > diff --git a/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
> > index af51177d5d..d1518b8bab 100644
> > --- a/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
> > +++ b/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
> > @@ -118,7 +118,13 @@
> >  # define LARGE_LOAD_SIZE (VEC_SIZE * 4)
> >  #endif
> >
> > -/* Amount to shift rdx by to compare for memcpy_large_4x.  */
> > +/* Amount to shift __x86_shared_non_temporal_threshold by for
> > +   bound for memcpy_large_4x. This is essentially use to to
> > +   indicate that the copy is far beyond the scope of L3
> > +   (assuming no user config x86_non_temporal_threshold) and to
> > +   use a more aggressively unrolled loop.  NB: before
> > +   increasing the value also update initialization of
> > +   x86_non_temporal_threshold.  */
> >  #ifndef LOG_4X_MEMCPY_THRESH
> >  # define LOG_4X_MEMCPY_THRESH 4
> >  #endif
> > @@ -724,9 +730,14 @@ L(skip_short_movsb_check):
> >         .p2align 4,, 10
> >  #if (defined USE_MULTIARCH || VEC_SIZE == 16) && IS_IN (libc)
> >  L(large_memcpy_2x_check):
> > -       cmp     __x86_rep_movsb_threshold(%rip), %RDX_LP
> > -       jb      L(more_8x_vec_check)
> > +       /* Entry from L(large_memcpy_2x) has a redundant load of
> > +          __x86_shared_non_temporal_threshold(%rip). L(large_memcpy_2x)
> > +          is only use for the non-erms memmove which is generally less
> > +          common.  */
> >  L(large_memcpy_2x):
> > +       mov     __x86_shared_non_temporal_threshold(%rip), %R11_LP
> > +       cmp     %R11_LP, %RDX_LP
> > +       jb      L(more_8x_vec_check)
> >         /* To reach this point it is impossible for dst > src and
> >            overlap. Remaining to check is src > dst and overlap. rcx
> >            already contains dst - src. Negate rcx to get src - dst. If
> > @@ -774,18 +785,21 @@ L(large_memcpy_2x):
> >         /* ecx contains -(dst - src). not ecx will return dst - src - 1
> >            which works for testing aliasing.  */
> >         notl    %ecx
> > +       movq    %rdx, %r10
> >         testl   $(PAGE_SIZE - VEC_SIZE * 8), %ecx
> >         jz      L(large_memcpy_4x)
> >
> > -       movq    %rdx, %r10
> > -       shrq    $LOG_4X_MEMCPY_THRESH, %r10
> > -       cmp     __x86_shared_non_temporal_threshold(%rip), %r10
> > +       /* r11 has __x86_shared_non_temporal_threshold.  Shift it left
> > +          by LOG_4X_MEMCPY_THRESH to get L(large_memcpy_4x) threshold.
> > +        */
> > +       shlq    $LOG_4X_MEMCPY_THRESH, %r11
> > +       cmp     %r11, %rdx
> >         jae     L(large_memcpy_4x)
> >
> >         /* edx will store remainder size for copying tail.  */
> >         andl    $(PAGE_SIZE * 2 - 1), %edx
> >         /* r10 stores outer loop counter.  */
> > -       shrq    $((LOG_PAGE_SIZE + 1) - LOG_4X_MEMCPY_THRESH), %r10
> > +       shrq    $(LOG_PAGE_SIZE + 1), %r10
> >         /* Copy 4x VEC at a time from 2 pages.  */
> >         .p2align 4
> >  L(loop_large_memcpy_2x_outer):
> > @@ -850,7 +864,6 @@ L(large_memcpy_2x_end):
> >
> >         .p2align 4
> >  L(large_memcpy_4x):
> > -       movq    %rdx, %r10
> >         /* edx will store remainder size for copying tail.  */
> >         andl    $(PAGE_SIZE * 4 - 1), %edx
> >         /* r10 stores outer loop counter.  */
> > --
> > 2.34.1
> >
>
>
> --
> H.J.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3] x86: Cleanup bounds checking in large memcpy case
  2022-06-15 15:12   ` [PATCH v3] " Noah Goldstein
@ 2022-06-15 16:48     ` H.J. Lu
  2022-06-15 17:44       ` Noah Goldstein
  0 siblings, 1 reply; 29+ messages in thread
From: H.J. Lu @ 2022-06-15 16:48 UTC (permalink / raw)
  To: Noah Goldstein; +Cc: GNU C Library, Carlos O'Donell

On Wed, Jun 15, 2022 at 8:12 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
> 1. Fix incorrect lower-bound threshold in L(large_memcpy_2x).
>    Previously was using `__x86_rep_movsb_threshold` and should
>    have been using `__x86_shared_non_temporal_threshold`.
>
> 2. Avoid reloading __x86_shared_non_temporal_threshold before
>    the L(large_memcpy_4x) bounds check.
>
> 3. Document the second bounds check for L(large_memcpy_4x)
>    more clearly.
> ---
>  manual/tunables.texi                          |  2 +-
>  sysdeps/x86/dl-cacheinfo.h                    |  6 +++-
>  .../multiarch/memmove-vec-unaligned-erms.S    | 29 ++++++++++++++-----
>  3 files changed, 27 insertions(+), 10 deletions(-)
>
> diff --git a/manual/tunables.texi b/manual/tunables.texi
> index 1482412078..49daf3eb4a 100644
> --- a/manual/tunables.texi
> +++ b/manual/tunables.texi
> @@ -47,7 +47,7 @@ glibc.malloc.mxfast: 0x0 (min: 0x0, max: 0xffffffffffffffff)
>  glibc.elision.skip_lock_busy: 3 (min: -2147483648, max: 2147483647)
>  glibc.malloc.top_pad: 0x0 (min: 0x0, max: 0xffffffffffffffff)
>  glibc.cpu.x86_rep_stosb_threshold: 0x800 (min: 0x1, max: 0xffffffffffffffff)
> -glibc.cpu.x86_non_temporal_threshold: 0xc0000 (min: 0x0, max: 0xffffffffffffffff)
> +glibc.cpu.x86_non_temporal_threshold: 0xc0000 (min: 0x0, max: 0x0fffffffffffffff)
>  glibc.cpu.x86_shstk:
>  glibc.cpu.hwcap_mask: 0x6 (min: 0x0, max: 0xffffffffffffffff)
>  glibc.malloc.mmap_max: 0 (min: -2147483648, max: 2147483647)
> diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h
> index cc3b840f9c..f94ff2df43 100644
> --- a/sysdeps/x86/dl-cacheinfo.h
> +++ b/sysdeps/x86/dl-cacheinfo.h
> @@ -931,8 +931,12 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
>
>    TUNABLE_SET_WITH_BOUNDS (x86_data_cache_size, data, 0, SIZE_MAX);
>    TUNABLE_SET_WITH_BOUNDS (x86_shared_cache_size, shared, 0, SIZE_MAX);
> +  /* SIZE_MAX >> 4 because memmove-vec-unaligned-erms right-shifts the value of
> +     'x86_non_temporal_threshold' by `LOG_4X_MEMCPY_THRESH` (4) and it is best
> +     if that operation cannot overflow.  Not the '>> 4' also reflect the bound
> +     in the manual.  */
>    TUNABLE_SET_WITH_BOUNDS (x86_non_temporal_threshold, non_temporal_threshold,
> -                          0, SIZE_MAX);
> +                          0, SIZE_MAX >> 4);
>    TUNABLE_SET_WITH_BOUNDS (x86_rep_movsb_threshold, rep_movsb_threshold,
>                            minimum_rep_movsb_threshold, SIZE_MAX);
>    TUNABLE_SET_WITH_BOUNDS (x86_rep_stosb_threshold, rep_stosb_threshold, 1,

To help backport, please break this patch into 2 patches and
make the memmove-vec-unaligned-erms.S change a separate
one.

Thanks.

> diff --git a/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
> index af51177d5d..d1518b8bab 100644
> --- a/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
> +++ b/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
> @@ -118,7 +118,13 @@
>  # define LARGE_LOAD_SIZE (VEC_SIZE * 4)
>  #endif
>
> -/* Amount to shift rdx by to compare for memcpy_large_4x.  */
> +/* Amount to shift __x86_shared_non_temporal_threshold by for
> +   bound for memcpy_large_4x. This is essentially use to to
> +   indicate that the copy is far beyond the scope of L3
> +   (assuming no user config x86_non_temporal_threshold) and to
> +   use a more aggressively unrolled loop.  NB: before
> +   increasing the value also update initialization of
> +   x86_non_temporal_threshold.  */
>  #ifndef LOG_4X_MEMCPY_THRESH
>  # define LOG_4X_MEMCPY_THRESH 4
>  #endif
> @@ -724,9 +730,14 @@ L(skip_short_movsb_check):
>         .p2align 4,, 10
>  #if (defined USE_MULTIARCH || VEC_SIZE == 16) && IS_IN (libc)
>  L(large_memcpy_2x_check):
> -       cmp     __x86_rep_movsb_threshold(%rip), %RDX_LP
> -       jb      L(more_8x_vec_check)
> +       /* Entry from L(large_memcpy_2x) has a redundant load of
> +          __x86_shared_non_temporal_threshold(%rip). L(large_memcpy_2x)
> +          is only use for the non-erms memmove which is generally less
> +          common.  */
>  L(large_memcpy_2x):
> +       mov     __x86_shared_non_temporal_threshold(%rip), %R11_LP
> +       cmp     %R11_LP, %RDX_LP
> +       jb      L(more_8x_vec_check)
>         /* To reach this point it is impossible for dst > src and
>            overlap. Remaining to check is src > dst and overlap. rcx
>            already contains dst - src. Negate rcx to get src - dst. If
> @@ -774,18 +785,21 @@ L(large_memcpy_2x):
>         /* ecx contains -(dst - src). not ecx will return dst - src - 1
>            which works for testing aliasing.  */
>         notl    %ecx
> +       movq    %rdx, %r10
>         testl   $(PAGE_SIZE - VEC_SIZE * 8), %ecx
>         jz      L(large_memcpy_4x)
>
> -       movq    %rdx, %r10
> -       shrq    $LOG_4X_MEMCPY_THRESH, %r10
> -       cmp     __x86_shared_non_temporal_threshold(%rip), %r10
> +       /* r11 has __x86_shared_non_temporal_threshold.  Shift it left
> +          by LOG_4X_MEMCPY_THRESH to get L(large_memcpy_4x) threshold.
> +        */
> +       shlq    $LOG_4X_MEMCPY_THRESH, %r11
> +       cmp     %r11, %rdx
>         jae     L(large_memcpy_4x)
>
>         /* edx will store remainder size for copying tail.  */
>         andl    $(PAGE_SIZE * 2 - 1), %edx
>         /* r10 stores outer loop counter.  */
> -       shrq    $((LOG_PAGE_SIZE + 1) - LOG_4X_MEMCPY_THRESH), %r10
> +       shrq    $(LOG_PAGE_SIZE + 1), %r10
>         /* Copy 4x VEC at a time from 2 pages.  */
>         .p2align 4
>  L(loop_large_memcpy_2x_outer):
> @@ -850,7 +864,6 @@ L(large_memcpy_2x_end):
>
>         .p2align 4
>  L(large_memcpy_4x):
> -       movq    %rdx, %r10
>         /* edx will store remainder size for copying tail.  */
>         andl    $(PAGE_SIZE * 4 - 1), %edx
>         /* r10 stores outer loop counter.  */
> --
> 2.34.1
>


-- 
H.J.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v4 1/2] x86: Cleanup bounds checking in large memcpy case
  2022-06-15  0:25 ` [PATCH v1 2/3] x86: Cleanup bounds checking in large memcpy case Noah Goldstein
                     ` (2 preceding siblings ...)
  2022-06-15 15:12   ` [PATCH v3] " Noah Goldstein
@ 2022-06-15 17:41   ` Noah Goldstein
  2022-06-15 17:41     ` [PATCH v4 2/2] x86: Add bounds `x86_non_temporal_threshold` Noah Goldstein
  2022-06-15 18:22     ` [PATCH v4 1/2] x86: Cleanup bounds checking in large memcpy case H.J. Lu
  3 siblings, 2 replies; 29+ messages in thread
From: Noah Goldstein @ 2022-06-15 17:41 UTC (permalink / raw)
  To: libc-alpha

1. Fix incorrect lower-bound threshold in L(large_memcpy_2x).
   Previously was using `__x86_rep_movsb_threshold` and should
   have been using `__x86_shared_non_temporal_threshold`.

2. Avoid reloading __x86_shared_non_temporal_threshold before
   the L(large_memcpy_4x) bounds check.

3. Document the second bounds check for L(large_memcpy_4x)
   more clearly.
---
 .../multiarch/memmove-vec-unaligned-erms.S    | 29 ++++++++++++++-----
 1 file changed, 21 insertions(+), 8 deletions(-)

diff --git a/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
index af51177d5d..d1518b8bab 100644
--- a/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
+++ b/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
@@ -118,7 +118,13 @@
 # define LARGE_LOAD_SIZE (VEC_SIZE * 4)
 #endif
 
-/* Amount to shift rdx by to compare for memcpy_large_4x.  */
+/* Amount to shift __x86_shared_non_temporal_threshold by for
+   bound for memcpy_large_4x. This is essentially use to to
+   indicate that the copy is far beyond the scope of L3
+   (assuming no user config x86_non_temporal_threshold) and to
+   use a more aggressively unrolled loop.  NB: before
+   increasing the value also update initialization of
+   x86_non_temporal_threshold.  */
 #ifndef LOG_4X_MEMCPY_THRESH
 # define LOG_4X_MEMCPY_THRESH 4
 #endif
@@ -724,9 +730,14 @@ L(skip_short_movsb_check):
 	.p2align 4,, 10
 #if (defined USE_MULTIARCH || VEC_SIZE == 16) && IS_IN (libc)
 L(large_memcpy_2x_check):
-	cmp	__x86_rep_movsb_threshold(%rip), %RDX_LP
-	jb	L(more_8x_vec_check)
+	/* Entry from L(large_memcpy_2x) has a redundant load of
+	   __x86_shared_non_temporal_threshold(%rip). L(large_memcpy_2x)
+	   is only use for the non-erms memmove which is generally less
+	   common.  */
 L(large_memcpy_2x):
+	mov	__x86_shared_non_temporal_threshold(%rip), %R11_LP
+	cmp	%R11_LP, %RDX_LP
+	jb	L(more_8x_vec_check)
 	/* To reach this point it is impossible for dst > src and
 	   overlap. Remaining to check is src > dst and overlap. rcx
 	   already contains dst - src. Negate rcx to get src - dst. If
@@ -774,18 +785,21 @@ L(large_memcpy_2x):
 	/* ecx contains -(dst - src). not ecx will return dst - src - 1
 	   which works for testing aliasing.  */
 	notl	%ecx
+	movq	%rdx, %r10
 	testl	$(PAGE_SIZE - VEC_SIZE * 8), %ecx
 	jz	L(large_memcpy_4x)
 
-	movq	%rdx, %r10
-	shrq	$LOG_4X_MEMCPY_THRESH, %r10
-	cmp	__x86_shared_non_temporal_threshold(%rip), %r10
+	/* r11 has __x86_shared_non_temporal_threshold.  Shift it left
+	   by LOG_4X_MEMCPY_THRESH to get L(large_memcpy_4x) threshold.
+	 */
+	shlq	$LOG_4X_MEMCPY_THRESH, %r11
+	cmp	%r11, %rdx
 	jae	L(large_memcpy_4x)
 
 	/* edx will store remainder size for copying tail.  */
 	andl	$(PAGE_SIZE * 2 - 1), %edx
 	/* r10 stores outer loop counter.  */
-	shrq	$((LOG_PAGE_SIZE + 1) - LOG_4X_MEMCPY_THRESH), %r10
+	shrq	$(LOG_PAGE_SIZE + 1), %r10
 	/* Copy 4x VEC at a time from 2 pages.  */
 	.p2align 4
 L(loop_large_memcpy_2x_outer):
@@ -850,7 +864,6 @@ L(large_memcpy_2x_end):
 
 	.p2align 4
 L(large_memcpy_4x):
-	movq	%rdx, %r10
 	/* edx will store remainder size for copying tail.  */
 	andl	$(PAGE_SIZE * 4 - 1), %edx
 	/* r10 stores outer loop counter.  */
-- 
2.34.1


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v4 2/2] x86: Add bounds `x86_non_temporal_threshold`
  2022-06-15 17:41   ` [PATCH v4 1/2] " Noah Goldstein
@ 2022-06-15 17:41     ` Noah Goldstein
  2022-06-15 18:22       ` H.J. Lu
                         ` (3 more replies)
  2022-06-15 18:22     ` [PATCH v4 1/2] x86: Cleanup bounds checking in large memcpy case H.J. Lu
  1 sibling, 4 replies; 29+ messages in thread
From: Noah Goldstein @ 2022-06-15 17:41 UTC (permalink / raw)
  To: libc-alpha

The lower-bound (131072) and upper-bound (SIZE_MAX / 16) are assumed
by memmove-vec-unaligned-erms.

The lower-bound is needed because memmove-vec-unaligned-erms unrolls
the loop aggressively in the L(large_memset_4x) case.

The upper-bound is needed because memmove-vec-unaligned-erms
right-shifts the value of `x86_non_temporal_threshold` by
LOG_4X_MEMCPY_THRESH (4) which without a bound may overflow.

The lack of lower-bound can be a correctness issue. The lack of
upper-bound cannot.
---
 manual/tunables.texi       | 2 +-
 sysdeps/x86/dl-cacheinfo.h | 6 +++++-
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/manual/tunables.texi b/manual/tunables.texi
index 1482412078..49daf3eb4a 100644
--- a/manual/tunables.texi
+++ b/manual/tunables.texi
@@ -47,7 +47,7 @@ glibc.malloc.mxfast: 0x0 (min: 0x0, max: 0xffffffffffffffff)
 glibc.elision.skip_lock_busy: 3 (min: -2147483648, max: 2147483647)
 glibc.malloc.top_pad: 0x0 (min: 0x0, max: 0xffffffffffffffff)
 glibc.cpu.x86_rep_stosb_threshold: 0x800 (min: 0x1, max: 0xffffffffffffffff)
-glibc.cpu.x86_non_temporal_threshold: 0xc0000 (min: 0x0, max: 0xffffffffffffffff)
+glibc.cpu.x86_non_temporal_threshold: 0xc0000 (min: 0x0, max: 0x0fffffffffffffff)
 glibc.cpu.x86_shstk:
 glibc.cpu.hwcap_mask: 0x6 (min: 0x0, max: 0xffffffffffffffff)
 glibc.malloc.mmap_max: 0 (min: -2147483648, max: 2147483647)
diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h
index cc3b840f9c..f94ff2df43 100644
--- a/sysdeps/x86/dl-cacheinfo.h
+++ b/sysdeps/x86/dl-cacheinfo.h
@@ -931,8 +931,12 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
 
   TUNABLE_SET_WITH_BOUNDS (x86_data_cache_size, data, 0, SIZE_MAX);
   TUNABLE_SET_WITH_BOUNDS (x86_shared_cache_size, shared, 0, SIZE_MAX);
+  /* SIZE_MAX >> 4 because memmove-vec-unaligned-erms right-shifts the value of
+     'x86_non_temporal_threshold' by `LOG_4X_MEMCPY_THRESH` (4) and it is best
+     if that operation cannot overflow.  Not the '>> 4' also reflect the bound
+     in the manual.  */
   TUNABLE_SET_WITH_BOUNDS (x86_non_temporal_threshold, non_temporal_threshold,
-			   0, SIZE_MAX);
+			   0, SIZE_MAX >> 4);
   TUNABLE_SET_WITH_BOUNDS (x86_rep_movsb_threshold, rep_movsb_threshold,
 			   minimum_rep_movsb_threshold, SIZE_MAX);
   TUNABLE_SET_WITH_BOUNDS (x86_rep_stosb_threshold, rep_stosb_threshold, 1,
-- 
2.34.1


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3] x86: Cleanup bounds checking in large memcpy case
  2022-06-15 16:48     ` H.J. Lu
@ 2022-06-15 17:44       ` Noah Goldstein
  0 siblings, 0 replies; 29+ messages in thread
From: Noah Goldstein @ 2022-06-15 17:44 UTC (permalink / raw)
  To: H.J. Lu; +Cc: GNU C Library, Carlos O'Donell

On Wed, Jun 15, 2022 at 9:49 AM H.J. Lu <hjl.tools@gmail.com> wrote:
>
> On Wed, Jun 15, 2022 at 8:12 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> >
> > 1. Fix incorrect lower-bound threshold in L(large_memcpy_2x).
> >    Previously was using `__x86_rep_movsb_threshold` and should
> >    have been using `__x86_shared_non_temporal_threshold`.
> >
> > 2. Avoid reloading __x86_shared_non_temporal_threshold before
> >    the L(large_memcpy_4x) bounds check.
> >
> > 3. Document the second bounds check for L(large_memcpy_4x)
> >    more clearly.
> > ---
> >  manual/tunables.texi                          |  2 +-
> >  sysdeps/x86/dl-cacheinfo.h                    |  6 +++-
> >  .../multiarch/memmove-vec-unaligned-erms.S    | 29 ++++++++++++++-----
> >  3 files changed, 27 insertions(+), 10 deletions(-)
> >
> > diff --git a/manual/tunables.texi b/manual/tunables.texi
> > index 1482412078..49daf3eb4a 100644
> > --- a/manual/tunables.texi
> > +++ b/manual/tunables.texi
> > @@ -47,7 +47,7 @@ glibc.malloc.mxfast: 0x0 (min: 0x0, max: 0xffffffffffffffff)
> >  glibc.elision.skip_lock_busy: 3 (min: -2147483648, max: 2147483647)
> >  glibc.malloc.top_pad: 0x0 (min: 0x0, max: 0xffffffffffffffff)
> >  glibc.cpu.x86_rep_stosb_threshold: 0x800 (min: 0x1, max: 0xffffffffffffffff)
> > -glibc.cpu.x86_non_temporal_threshold: 0xc0000 (min: 0x0, max: 0xffffffffffffffff)
> > +glibc.cpu.x86_non_temporal_threshold: 0xc0000 (min: 0x0, max: 0x0fffffffffffffff)
> >  glibc.cpu.x86_shstk:
> >  glibc.cpu.hwcap_mask: 0x6 (min: 0x0, max: 0xffffffffffffffff)
> >  glibc.malloc.mmap_max: 0 (min: -2147483648, max: 2147483647)
> > diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h
> > index cc3b840f9c..f94ff2df43 100644
> > --- a/sysdeps/x86/dl-cacheinfo.h
> > +++ b/sysdeps/x86/dl-cacheinfo.h
> > @@ -931,8 +931,12 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
> >
> >    TUNABLE_SET_WITH_BOUNDS (x86_data_cache_size, data, 0, SIZE_MAX);
> >    TUNABLE_SET_WITH_BOUNDS (x86_shared_cache_size, shared, 0, SIZE_MAX);
> > +  /* SIZE_MAX >> 4 because memmove-vec-unaligned-erms right-shifts the value of
> > +     'x86_non_temporal_threshold' by `LOG_4X_MEMCPY_THRESH` (4) and it is best
> > +     if that operation cannot overflow.  Not the '>> 4' also reflect the bound
> > +     in the manual.  */
> >    TUNABLE_SET_WITH_BOUNDS (x86_non_temporal_threshold, non_temporal_threshold,
> > -                          0, SIZE_MAX);
> > +                          0, SIZE_MAX >> 4);
> >    TUNABLE_SET_WITH_BOUNDS (x86_rep_movsb_threshold, rep_movsb_threshold,
> >                            minimum_rep_movsb_threshold, SIZE_MAX);
> >    TUNABLE_SET_WITH_BOUNDS (x86_rep_stosb_threshold, rep_stosb_threshold, 1,
>
> To help backport, please break this patch into 2 patches and
> make the memmove-vec-unaligned-erms.S change a separate
> one.

Done in V4.

Note there has been a lower bound missing since 2.34 that might also
need to be backported.

Added it in the second patch. I can split that one too (since upper
bound is not a correctness
issue) if it does in fact need to be backported.
>
> Thanks.
>
> > diff --git a/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
> > index af51177d5d..d1518b8bab 100644
> > --- a/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
> > +++ b/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
> > @@ -118,7 +118,13 @@
> >  # define LARGE_LOAD_SIZE (VEC_SIZE * 4)
> >  #endif
> >
> > -/* Amount to shift rdx by to compare for memcpy_large_4x.  */
> > +/* Amount to shift __x86_shared_non_temporal_threshold by for
> > +   bound for memcpy_large_4x. This is essentially use to to
> > +   indicate that the copy is far beyond the scope of L3
> > +   (assuming no user config x86_non_temporal_threshold) and to
> > +   use a more aggressively unrolled loop.  NB: before
> > +   increasing the value also update initialization of
> > +   x86_non_temporal_threshold.  */
> >  #ifndef LOG_4X_MEMCPY_THRESH
> >  # define LOG_4X_MEMCPY_THRESH 4
> >  #endif
> > @@ -724,9 +730,14 @@ L(skip_short_movsb_check):
> >         .p2align 4,, 10
> >  #if (defined USE_MULTIARCH || VEC_SIZE == 16) && IS_IN (libc)
> >  L(large_memcpy_2x_check):
> > -       cmp     __x86_rep_movsb_threshold(%rip), %RDX_LP
> > -       jb      L(more_8x_vec_check)
> > +       /* Entry from L(large_memcpy_2x) has a redundant load of
> > +          __x86_shared_non_temporal_threshold(%rip). L(large_memcpy_2x)
> > +          is only use for the non-erms memmove which is generally less
> > +          common.  */
> >  L(large_memcpy_2x):
> > +       mov     __x86_shared_non_temporal_threshold(%rip), %R11_LP
> > +       cmp     %R11_LP, %RDX_LP
> > +       jb      L(more_8x_vec_check)
> >         /* To reach this point it is impossible for dst > src and
> >            overlap. Remaining to check is src > dst and overlap. rcx
> >            already contains dst - src. Negate rcx to get src - dst. If
> > @@ -774,18 +785,21 @@ L(large_memcpy_2x):
> >         /* ecx contains -(dst - src). not ecx will return dst - src - 1
> >            which works for testing aliasing.  */
> >         notl    %ecx
> > +       movq    %rdx, %r10
> >         testl   $(PAGE_SIZE - VEC_SIZE * 8), %ecx
> >         jz      L(large_memcpy_4x)
> >
> > -       movq    %rdx, %r10
> > -       shrq    $LOG_4X_MEMCPY_THRESH, %r10
> > -       cmp     __x86_shared_non_temporal_threshold(%rip), %r10
> > +       /* r11 has __x86_shared_non_temporal_threshold.  Shift it left
> > +          by LOG_4X_MEMCPY_THRESH to get L(large_memcpy_4x) threshold.
> > +        */
> > +       shlq    $LOG_4X_MEMCPY_THRESH, %r11
> > +       cmp     %r11, %rdx
> >         jae     L(large_memcpy_4x)
> >
> >         /* edx will store remainder size for copying tail.  */
> >         andl    $(PAGE_SIZE * 2 - 1), %edx
> >         /* r10 stores outer loop counter.  */
> > -       shrq    $((LOG_PAGE_SIZE + 1) - LOG_4X_MEMCPY_THRESH), %r10
> > +       shrq    $(LOG_PAGE_SIZE + 1), %r10
> >         /* Copy 4x VEC at a time from 2 pages.  */
> >         .p2align 4
> >  L(loop_large_memcpy_2x_outer):
> > @@ -850,7 +864,6 @@ L(large_memcpy_2x_end):
> >
> >         .p2align 4
> >  L(large_memcpy_4x):
> > -       movq    %rdx, %r10
> >         /* edx will store remainder size for copying tail.  */
> >         andl    $(PAGE_SIZE * 4 - 1), %edx
> >         /* r10 stores outer loop counter.  */
> > --
> > 2.34.1
> >
>
>
> --
> H.J.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 2/2] x86: Add bounds `x86_non_temporal_threshold`
  2022-06-15 17:41     ` [PATCH v4 2/2] x86: Add bounds `x86_non_temporal_threshold` Noah Goldstein
@ 2022-06-15 18:22       ` H.J. Lu
  2022-06-15 18:33         ` Noah Goldstein
  2022-06-15 18:32       ` [PATCH v5 " Noah Goldstein
                         ` (2 subsequent siblings)
  3 siblings, 1 reply; 29+ messages in thread
From: H.J. Lu @ 2022-06-15 18:22 UTC (permalink / raw)
  To: Noah Goldstein; +Cc: GNU C Library, Carlos O'Donell

On Wed, Jun 15, 2022 at 10:41 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
> The lower-bound (131072) and upper-bound (SIZE_MAX / 16) are assumed
> by memmove-vec-unaligned-erms.
>
> The lower-bound is needed because memmove-vec-unaligned-erms unrolls
> the loop aggressively in the L(large_memset_4x) case.
>
> The upper-bound is needed because memmove-vec-unaligned-erms
> right-shifts the value of `x86_non_temporal_threshold` by
> LOG_4X_MEMCPY_THRESH (4) which without a bound may overflow.
>
> The lack of lower-bound can be a correctness issue. The lack of
> upper-bound cannot.
> ---
>  manual/tunables.texi       | 2 +-
>  sysdeps/x86/dl-cacheinfo.h | 6 +++++-
>  2 files changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/manual/tunables.texi b/manual/tunables.texi
> index 1482412078..49daf3eb4a 100644
> --- a/manual/tunables.texi
> +++ b/manual/tunables.texi
> @@ -47,7 +47,7 @@ glibc.malloc.mxfast: 0x0 (min: 0x0, max: 0xffffffffffffffff)
>  glibc.elision.skip_lock_busy: 3 (min: -2147483648, max: 2147483647)
>  glibc.malloc.top_pad: 0x0 (min: 0x0, max: 0xffffffffffffffff)
>  glibc.cpu.x86_rep_stosb_threshold: 0x800 (min: 0x1, max: 0xffffffffffffffff)
> -glibc.cpu.x86_non_temporal_threshold: 0xc0000 (min: 0x0, max: 0xffffffffffffffff)
> +glibc.cpu.x86_non_temporal_threshold: 0xc0000 (min: 0x0, max: 0x0fffffffffffffff)
>  glibc.cpu.x86_shstk:
>  glibc.cpu.hwcap_mask: 0x6 (min: 0x0, max: 0xffffffffffffffff)
>  glibc.malloc.mmap_max: 0 (min: -2147483648, max: 2147483647)
> diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h
> index cc3b840f9c..f94ff2df43 100644
> --- a/sysdeps/x86/dl-cacheinfo.h
> +++ b/sysdeps/x86/dl-cacheinfo.h
> @@ -931,8 +931,12 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
>
>    TUNABLE_SET_WITH_BOUNDS (x86_data_cache_size, data, 0, SIZE_MAX);
>    TUNABLE_SET_WITH_BOUNDS (x86_shared_cache_size, shared, 0, SIZE_MAX);
> +  /* SIZE_MAX >> 4 because memmove-vec-unaligned-erms right-shifts the value of
> +     'x86_non_temporal_threshold' by `LOG_4X_MEMCPY_THRESH` (4) and it is best
> +     if that operation cannot overflow.  Not the '>> 4' also reflect the bound
> +     in the manual.  */
>    TUNABLE_SET_WITH_BOUNDS (x86_non_temporal_threshold, non_temporal_threshold,
> -                          0, SIZE_MAX);
> +                          0, SIZE_MAX >> 4);

You didn't change the lower bound.

>    TUNABLE_SET_WITH_BOUNDS (x86_rep_movsb_threshold, rep_movsb_threshold,
>                            minimum_rep_movsb_threshold, SIZE_MAX);
>    TUNABLE_SET_WITH_BOUNDS (x86_rep_stosb_threshold, rep_stosb_threshold, 1,
> --
> 2.34.1
>


-- 
H.J.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 1/2] x86: Cleanup bounds checking in large memcpy case
  2022-06-15 17:41   ` [PATCH v4 1/2] " Noah Goldstein
  2022-06-15 17:41     ` [PATCH v4 2/2] x86: Add bounds `x86_non_temporal_threshold` Noah Goldstein
@ 2022-06-15 18:22     ` H.J. Lu
  2022-07-14  2:57       ` Sunil Pandey
  1 sibling, 1 reply; 29+ messages in thread
From: H.J. Lu @ 2022-06-15 18:22 UTC (permalink / raw)
  To: Noah Goldstein; +Cc: GNU C Library, Carlos O'Donell

On Wed, Jun 15, 2022 at 10:41 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
> 1. Fix incorrect lower-bound threshold in L(large_memcpy_2x).
>    Previously was using `__x86_rep_movsb_threshold` and should
>    have been using `__x86_shared_non_temporal_threshold`.
>
> 2. Avoid reloading __x86_shared_non_temporal_threshold before
>    the L(large_memcpy_4x) bounds check.
>
> 3. Document the second bounds check for L(large_memcpy_4x)
>    more clearly.
> ---
>  .../multiarch/memmove-vec-unaligned-erms.S    | 29 ++++++++++++++-----
>  1 file changed, 21 insertions(+), 8 deletions(-)
>
> diff --git a/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
> index af51177d5d..d1518b8bab 100644
> --- a/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
> +++ b/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
> @@ -118,7 +118,13 @@
>  # define LARGE_LOAD_SIZE (VEC_SIZE * 4)
>  #endif
>
> -/* Amount to shift rdx by to compare for memcpy_large_4x.  */
> +/* Amount to shift __x86_shared_non_temporal_threshold by for
> +   bound for memcpy_large_4x. This is essentially use to to
> +   indicate that the copy is far beyond the scope of L3
> +   (assuming no user config x86_non_temporal_threshold) and to
> +   use a more aggressively unrolled loop.  NB: before
> +   increasing the value also update initialization of
> +   x86_non_temporal_threshold.  */
>  #ifndef LOG_4X_MEMCPY_THRESH
>  # define LOG_4X_MEMCPY_THRESH 4
>  #endif
> @@ -724,9 +730,14 @@ L(skip_short_movsb_check):
>         .p2align 4,, 10
>  #if (defined USE_MULTIARCH || VEC_SIZE == 16) && IS_IN (libc)
>  L(large_memcpy_2x_check):
> -       cmp     __x86_rep_movsb_threshold(%rip), %RDX_LP
> -       jb      L(more_8x_vec_check)
> +       /* Entry from L(large_memcpy_2x) has a redundant load of
> +          __x86_shared_non_temporal_threshold(%rip). L(large_memcpy_2x)
> +          is only use for the non-erms memmove which is generally less
> +          common.  */
>  L(large_memcpy_2x):
> +       mov     __x86_shared_non_temporal_threshold(%rip), %R11_LP
> +       cmp     %R11_LP, %RDX_LP
> +       jb      L(more_8x_vec_check)
>         /* To reach this point it is impossible for dst > src and
>            overlap. Remaining to check is src > dst and overlap. rcx
>            already contains dst - src. Negate rcx to get src - dst. If
> @@ -774,18 +785,21 @@ L(large_memcpy_2x):
>         /* ecx contains -(dst - src). not ecx will return dst - src - 1
>            which works for testing aliasing.  */
>         notl    %ecx
> +       movq    %rdx, %r10
>         testl   $(PAGE_SIZE - VEC_SIZE * 8), %ecx
>         jz      L(large_memcpy_4x)
>
> -       movq    %rdx, %r10
> -       shrq    $LOG_4X_MEMCPY_THRESH, %r10
> -       cmp     __x86_shared_non_temporal_threshold(%rip), %r10
> +       /* r11 has __x86_shared_non_temporal_threshold.  Shift it left
> +          by LOG_4X_MEMCPY_THRESH to get L(large_memcpy_4x) threshold.
> +        */
> +       shlq    $LOG_4X_MEMCPY_THRESH, %r11
> +       cmp     %r11, %rdx
>         jae     L(large_memcpy_4x)
>
>         /* edx will store remainder size for copying tail.  */
>         andl    $(PAGE_SIZE * 2 - 1), %edx
>         /* r10 stores outer loop counter.  */
> -       shrq    $((LOG_PAGE_SIZE + 1) - LOG_4X_MEMCPY_THRESH), %r10
> +       shrq    $(LOG_PAGE_SIZE + 1), %r10
>         /* Copy 4x VEC at a time from 2 pages.  */
>         .p2align 4
>  L(loop_large_memcpy_2x_outer):
> @@ -850,7 +864,6 @@ L(large_memcpy_2x_end):
>
>         .p2align 4
>  L(large_memcpy_4x):
> -       movq    %rdx, %r10
>         /* edx will store remainder size for copying tail.  */
>         andl    $(PAGE_SIZE * 4 - 1), %edx
>         /* r10 stores outer loop counter.  */
> --
> 2.34.1
>

LGTM.

Thanks.

-- 
H.J.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v5 2/2] x86: Add bounds `x86_non_temporal_threshold`
  2022-06-15 17:41     ` [PATCH v4 2/2] x86: Add bounds `x86_non_temporal_threshold` Noah Goldstein
  2022-06-15 18:22       ` H.J. Lu
@ 2022-06-15 18:32       ` Noah Goldstein
  2022-06-15 18:43         ` H.J. Lu
  2022-06-15 19:52       ` [PATCH v6 2/3] " Noah Goldstein
  2022-06-15 20:34       ` [PATCH v7 " Noah Goldstein
  3 siblings, 1 reply; 29+ messages in thread
From: Noah Goldstein @ 2022-06-15 18:32 UTC (permalink / raw)
  To: libc-alpha

The lower-bound (131072) and upper-bound (SIZE_MAX / 16) are assumed
by memmove-vec-unaligned-erms.

The lower-bound is needed because memmove-vec-unaligned-erms unrolls
the loop aggressively in the L(large_memset_4x) case.

The upper-bound is needed because memmove-vec-unaligned-erms
right-shifts the value of `x86_non_temporal_threshold` by
LOG_4X_MEMCPY_THRESH (4) which without a bound may overflow.

The lack of lower-bound can be a correctness issue. The lack of
upper-bound cannot.
---
 manual/tunables.texi       | 2 +-
 sysdeps/x86/dl-cacheinfo.h | 7 ++++++-
 2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/manual/tunables.texi b/manual/tunables.texi
index 1482412078..a420ed6206 100644
--- a/manual/tunables.texi
+++ b/manual/tunables.texi
@@ -47,7 +47,7 @@ glibc.malloc.mxfast: 0x0 (min: 0x0, max: 0xffffffffffffffff)
 glibc.elision.skip_lock_busy: 3 (min: -2147483648, max: 2147483647)
 glibc.malloc.top_pad: 0x0 (min: 0x0, max: 0xffffffffffffffff)
 glibc.cpu.x86_rep_stosb_threshold: 0x800 (min: 0x1, max: 0xffffffffffffffff)
-glibc.cpu.x86_non_temporal_threshold: 0xc0000 (min: 0x0, max: 0xffffffffffffffff)
+glibc.cpu.x86_non_temporal_threshold: 0xc0000 (min: 0x20000, max: 0x0fffffffffffffff)
 glibc.cpu.x86_shstk:
 glibc.cpu.hwcap_mask: 0x6 (min: 0x0, max: 0xffffffffffffffff)
 glibc.malloc.mmap_max: 0 (min: -2147483648, max: 2147483647)
diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h
index cc3b840f9c..b4ff385ae1 100644
--- a/sysdeps/x86/dl-cacheinfo.h
+++ b/sysdeps/x86/dl-cacheinfo.h
@@ -931,8 +931,13 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
 
   TUNABLE_SET_WITH_BOUNDS (x86_data_cache_size, data, 0, SIZE_MAX);
   TUNABLE_SET_WITH_BOUNDS (x86_shared_cache_size, shared, 0, SIZE_MAX);
+  /* SIZE_MAX >> 4 because memmove-vec-unaligned-erms right-shifts the value of
+     'x86_non_temporal_threshold' by `LOG_4X_MEMCPY_THRESH` (4) and it is best
+     if that operation cannot overflow. 0x20000 (131072) because the
+     L(large_memset_4x) case aggressively unrolls the loop.  Both values are
+     reflected in the manual.  */
   TUNABLE_SET_WITH_BOUNDS (x86_non_temporal_threshold, non_temporal_threshold,
-			   0, SIZE_MAX);
+			   0x20000, SIZE_MAX >> 4);
   TUNABLE_SET_WITH_BOUNDS (x86_rep_movsb_threshold, rep_movsb_threshold,
 			   minimum_rep_movsb_threshold, SIZE_MAX);
   TUNABLE_SET_WITH_BOUNDS (x86_rep_stosb_threshold, rep_stosb_threshold, 1,
-- 
2.34.1


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 2/2] x86: Add bounds `x86_non_temporal_threshold`
  2022-06-15 18:22       ` H.J. Lu
@ 2022-06-15 18:33         ` Noah Goldstein
  0 siblings, 0 replies; 29+ messages in thread
From: Noah Goldstein @ 2022-06-15 18:33 UTC (permalink / raw)
  To: H.J. Lu; +Cc: GNU C Library, Carlos O'Donell

On Wed, Jun 15, 2022 at 11:22 AM H.J. Lu <hjl.tools@gmail.com> wrote:
>
> On Wed, Jun 15, 2022 at 10:41 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> >
> > The lower-bound (131072) and upper-bound (SIZE_MAX / 16) are assumed
> > by memmove-vec-unaligned-erms.
> >
> > The lower-bound is needed because memmove-vec-unaligned-erms unrolls
> > the loop aggressively in the L(large_memset_4x) case.
> >
> > The upper-bound is needed because memmove-vec-unaligned-erms
> > right-shifts the value of `x86_non_temporal_threshold` by
> > LOG_4X_MEMCPY_THRESH (4) which without a bound may overflow.
> >
> > The lack of lower-bound can be a correctness issue. The lack of
> > upper-bound cannot.
> > ---
> >  manual/tunables.texi       | 2 +-
> >  sysdeps/x86/dl-cacheinfo.h | 6 +++++-
> >  2 files changed, 6 insertions(+), 2 deletions(-)
> >
> > diff --git a/manual/tunables.texi b/manual/tunables.texi
> > index 1482412078..49daf3eb4a 100644
> > --- a/manual/tunables.texi
> > +++ b/manual/tunables.texi
> > @@ -47,7 +47,7 @@ glibc.malloc.mxfast: 0x0 (min: 0x0, max: 0xffffffffffffffff)
> >  glibc.elision.skip_lock_busy: 3 (min: -2147483648, max: 2147483647)
> >  glibc.malloc.top_pad: 0x0 (min: 0x0, max: 0xffffffffffffffff)
> >  glibc.cpu.x86_rep_stosb_threshold: 0x800 (min: 0x1, max: 0xffffffffffffffff)
> > -glibc.cpu.x86_non_temporal_threshold: 0xc0000 (min: 0x0, max: 0xffffffffffffffff)
> > +glibc.cpu.x86_non_temporal_threshold: 0xc0000 (min: 0x0, max: 0x0fffffffffffffff)
> >  glibc.cpu.x86_shstk:
> >  glibc.cpu.hwcap_mask: 0x6 (min: 0x0, max: 0xffffffffffffffff)
> >  glibc.malloc.mmap_max: 0 (min: -2147483648, max: 2147483647)
> > diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h
> > index cc3b840f9c..f94ff2df43 100644
> > --- a/sysdeps/x86/dl-cacheinfo.h
> > +++ b/sysdeps/x86/dl-cacheinfo.h
> > @@ -931,8 +931,12 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
> >
> >    TUNABLE_SET_WITH_BOUNDS (x86_data_cache_size, data, 0, SIZE_MAX);
> >    TUNABLE_SET_WITH_BOUNDS (x86_shared_cache_size, shared, 0, SIZE_MAX);
> > +  /* SIZE_MAX >> 4 because memmove-vec-unaligned-erms right-shifts the value of
> > +     'x86_non_temporal_threshold' by `LOG_4X_MEMCPY_THRESH` (4) and it is best
> > +     if that operation cannot overflow.  Not the '>> 4' also reflect the bound
> > +     in the manual.  */
> >    TUNABLE_SET_WITH_BOUNDS (x86_non_temporal_threshold, non_temporal_threshold,
> > -                          0, SIZE_MAX);
> > +                          0, SIZE_MAX >> 4);
>
> You didn't change the lower bound.

Fixed in V5.
>
> >    TUNABLE_SET_WITH_BOUNDS (x86_rep_movsb_threshold, rep_movsb_threshold,
> >                            minimum_rep_movsb_threshold, SIZE_MAX);
> >    TUNABLE_SET_WITH_BOUNDS (x86_rep_stosb_threshold, rep_stosb_threshold, 1,
> > --
> > 2.34.1
> >
>
>
> --
> H.J.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v5 2/2] x86: Add bounds `x86_non_temporal_threshold`
  2022-06-15 18:32       ` [PATCH v5 " Noah Goldstein
@ 2022-06-15 18:43         ` H.J. Lu
  0 siblings, 0 replies; 29+ messages in thread
From: H.J. Lu @ 2022-06-15 18:43 UTC (permalink / raw)
  To: Noah Goldstein; +Cc: GNU C Library, Carlos O'Donell

On Wed, Jun 15, 2022 at 11:32 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
> The lower-bound (131072) and upper-bound (SIZE_MAX / 16) are assumed
> by memmove-vec-unaligned-erms.
>
> The lower-bound is needed because memmove-vec-unaligned-erms unrolls
> the loop aggressively in the L(large_memset_4x) case.
>
> The upper-bound is needed because memmove-vec-unaligned-erms
> right-shifts the value of `x86_non_temporal_threshold` by
> LOG_4X_MEMCPY_THRESH (4) which without a bound may overflow.
>
> The lack of lower-bound can be a correctness issue. The lack of
> upper-bound cannot.
> ---
>  manual/tunables.texi       | 2 +-
>  sysdeps/x86/dl-cacheinfo.h | 7 ++++++-
>  2 files changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/manual/tunables.texi b/manual/tunables.texi
> index 1482412078..a420ed6206 100644
> --- a/manual/tunables.texi
> +++ b/manual/tunables.texi
> @@ -47,7 +47,7 @@ glibc.malloc.mxfast: 0x0 (min: 0x0, max: 0xffffffffffffffff)
>  glibc.elision.skip_lock_busy: 3 (min: -2147483648, max: 2147483647)
>  glibc.malloc.top_pad: 0x0 (min: 0x0, max: 0xffffffffffffffff)
>  glibc.cpu.x86_rep_stosb_threshold: 0x800 (min: 0x1, max: 0xffffffffffffffff)
> -glibc.cpu.x86_non_temporal_threshold: 0xc0000 (min: 0x0, max: 0xffffffffffffffff)
> +glibc.cpu.x86_non_temporal_threshold: 0xc0000 (min: 0x20000, max: 0x0fffffffffffffff)
>  glibc.cpu.x86_shstk:
>  glibc.cpu.hwcap_mask: 0x6 (min: 0x0, max: 0xffffffffffffffff)
>  glibc.malloc.mmap_max: 0 (min: -2147483648, max: 2147483647)
> diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h
> index cc3b840f9c..b4ff385ae1 100644
> --- a/sysdeps/x86/dl-cacheinfo.h
> +++ b/sysdeps/x86/dl-cacheinfo.h
> @@ -931,8 +931,13 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
>
>    TUNABLE_SET_WITH_BOUNDS (x86_data_cache_size, data, 0, SIZE_MAX);
>    TUNABLE_SET_WITH_BOUNDS (x86_shared_cache_size, shared, 0, SIZE_MAX);
> +  /* SIZE_MAX >> 4 because memmove-vec-unaligned-erms right-shifts the value of
> +     'x86_non_temporal_threshold' by `LOG_4X_MEMCPY_THRESH` (4) and it is best
> +     if that operation cannot overflow. 0x20000 (131072) because the
> +     L(large_memset_4x) case aggressively unrolls the loop.  Both values are

How is 0x20000 computed?  Shouldn't it depend on vector size?

> +     reflected in the manual.  */
>    TUNABLE_SET_WITH_BOUNDS (x86_non_temporal_threshold, non_temporal_threshold,
> -                          0, SIZE_MAX);
> +                          0x20000, SIZE_MAX >> 4);
>    TUNABLE_SET_WITH_BOUNDS (x86_rep_movsb_threshold, rep_movsb_threshold,
>                            minimum_rep_movsb_threshold, SIZE_MAX);
>    TUNABLE_SET_WITH_BOUNDS (x86_rep_stosb_threshold, rep_stosb_threshold, 1,
> --
> 2.34.1
>


-- 
H.J.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v6 2/3] x86: Add bounds `x86_non_temporal_threshold`
  2022-06-15 17:41     ` [PATCH v4 2/2] x86: Add bounds `x86_non_temporal_threshold` Noah Goldstein
  2022-06-15 18:22       ` H.J. Lu
  2022-06-15 18:32       ` [PATCH v5 " Noah Goldstein
@ 2022-06-15 19:52       ` Noah Goldstein
  2022-06-15 20:27         ` H.J. Lu
  2022-06-15 20:34       ` [PATCH v7 " Noah Goldstein
  3 siblings, 1 reply; 29+ messages in thread
From: Noah Goldstein @ 2022-06-15 19:52 UTC (permalink / raw)
  To: libc-alpha

The lower-bound (16448) and upper-bound (SIZE_MAX / 16) are assumed
by memmove-vec-unaligned-erms.

The lower-bound is needed because memmove-vec-unaligned-erms unrolls
the loop aggressively in the L(large_memset_4x) case.

The upper-bound is needed because memmove-vec-unaligned-erms
right-shifts the value of `x86_non_temporal_threshold` by
LOG_4X_MEMCPY_THRESH (4) which without a bound may overflow.

The lack of lower-bound can be a correctness issue. The lack of
upper-bound cannot.
---
 manual/tunables.texi       | 2 +-
 sysdeps/x86/dl-cacheinfo.h | 8 +++++++-
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/manual/tunables.texi b/manual/tunables.texi
index 1482412078..2c076019ae 100644
--- a/manual/tunables.texi
+++ b/manual/tunables.texi
@@ -47,7 +47,7 @@ glibc.malloc.mxfast: 0x0 (min: 0x0, max: 0xffffffffffffffff)
 glibc.elision.skip_lock_busy: 3 (min: -2147483648, max: 2147483647)
 glibc.malloc.top_pad: 0x0 (min: 0x0, max: 0xffffffffffffffff)
 glibc.cpu.x86_rep_stosb_threshold: 0x800 (min: 0x1, max: 0xffffffffffffffff)
-glibc.cpu.x86_non_temporal_threshold: 0xc0000 (min: 0x0, max: 0xffffffffffffffff)
+glibc.cpu.x86_non_temporal_threshold: 0xc0000 (min: 0x4040, max: 0x0fffffffffffffff)
 glibc.cpu.x86_shstk:
 glibc.cpu.hwcap_mask: 0x6 (min: 0x0, max: 0xffffffffffffffff)
 glibc.malloc.mmap_max: 0 (min: -2147483648, max: 2147483647)
diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h
index cc3b840f9c..c493956259 100644
--- a/sysdeps/x86/dl-cacheinfo.h
+++ b/sysdeps/x86/dl-cacheinfo.h
@@ -931,8 +931,14 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
 
   TUNABLE_SET_WITH_BOUNDS (x86_data_cache_size, data, 0, SIZE_MAX);
   TUNABLE_SET_WITH_BOUNDS (x86_shared_cache_size, shared, 0, SIZE_MAX);
+  /* SIZE_MAX >> 4 because memmove-vec-unaligned-erms right-shifts the value of
+     'x86_non_temporal_threshold' by `LOG_4X_MEMCPY_THRESH` (4) and it is best
+     if that operation cannot overflow. Minimum of 0x4040 (16448) because the
+     L(large_memset_4x) loops need 64-byte to cache align and enough space for
+     at least 1 iteration of 4x PAGE_SIZE unrolled loop.  Both values are
+     reflected in the manual.  */
   TUNABLE_SET_WITH_BOUNDS (x86_non_temporal_threshold, non_temporal_threshold,
-			   0, SIZE_MAX);
+			   0x20000, SIZE_MAX >> 4);
   TUNABLE_SET_WITH_BOUNDS (x86_rep_movsb_threshold, rep_movsb_threshold,
 			   minimum_rep_movsb_threshold, SIZE_MAX);
   TUNABLE_SET_WITH_BOUNDS (x86_rep_stosb_threshold, rep_stosb_threshold, 1,
-- 
2.34.1


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v6 2/3] x86: Add bounds `x86_non_temporal_threshold`
  2022-06-15 19:52       ` [PATCH v6 2/3] " Noah Goldstein
@ 2022-06-15 20:27         ` H.J. Lu
  2022-06-15 20:35           ` Noah Goldstein
  0 siblings, 1 reply; 29+ messages in thread
From: H.J. Lu @ 2022-06-15 20:27 UTC (permalink / raw)
  To: Noah Goldstein; +Cc: GNU C Library, Carlos O'Donell

On Wed, Jun 15, 2022 at 12:52 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
> The lower-bound (16448) and upper-bound (SIZE_MAX / 16) are assumed
> by memmove-vec-unaligned-erms.
>
> The lower-bound is needed because memmove-vec-unaligned-erms unrolls
> the loop aggressively in the L(large_memset_4x) case.
>
> The upper-bound is needed because memmove-vec-unaligned-erms
> right-shifts the value of `x86_non_temporal_threshold` by
> LOG_4X_MEMCPY_THRESH (4) which without a bound may overflow.
>
> The lack of lower-bound can be a correctness issue. The lack of
> upper-bound cannot.
> ---
>  manual/tunables.texi       | 2 +-
>  sysdeps/x86/dl-cacheinfo.h | 8 +++++++-
>  2 files changed, 8 insertions(+), 2 deletions(-)
>
> diff --git a/manual/tunables.texi b/manual/tunables.texi
> index 1482412078..2c076019ae 100644
> --- a/manual/tunables.texi
> +++ b/manual/tunables.texi
> @@ -47,7 +47,7 @@ glibc.malloc.mxfast: 0x0 (min: 0x0, max: 0xffffffffffffffff)
>  glibc.elision.skip_lock_busy: 3 (min: -2147483648, max: 2147483647)
>  glibc.malloc.top_pad: 0x0 (min: 0x0, max: 0xffffffffffffffff)
>  glibc.cpu.x86_rep_stosb_threshold: 0x800 (min: 0x1, max: 0xffffffffffffffff)
> -glibc.cpu.x86_non_temporal_threshold: 0xc0000 (min: 0x0, max: 0xffffffffffffffff)
> +glibc.cpu.x86_non_temporal_threshold: 0xc0000 (min: 0x4040, max: 0x0fffffffffffffff)
>  glibc.cpu.x86_shstk:
>  glibc.cpu.hwcap_mask: 0x6 (min: 0x0, max: 0xffffffffffffffff)
>  glibc.malloc.mmap_max: 0 (min: -2147483648, max: 2147483647)
> diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h
> index cc3b840f9c..c493956259 100644
> --- a/sysdeps/x86/dl-cacheinfo.h
> +++ b/sysdeps/x86/dl-cacheinfo.h
> @@ -931,8 +931,14 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
>
>    TUNABLE_SET_WITH_BOUNDS (x86_data_cache_size, data, 0, SIZE_MAX);
>    TUNABLE_SET_WITH_BOUNDS (x86_shared_cache_size, shared, 0, SIZE_MAX);
> +  /* SIZE_MAX >> 4 because memmove-vec-unaligned-erms right-shifts the value of
> +     'x86_non_temporal_threshold' by `LOG_4X_MEMCPY_THRESH` (4) and it is best
> +     if that operation cannot overflow. Minimum of 0x4040 (16448) because the
> +     L(large_memset_4x) loops need 64-byte to cache align and enough space for
> +     at least 1 iteration of 4x PAGE_SIZE unrolled loop.  Both values are
> +     reflected in the manual.  */
>    TUNABLE_SET_WITH_BOUNDS (x86_non_temporal_threshold, non_temporal_threshold,
> -                          0, SIZE_MAX);
> +                          0x20000, SIZE_MAX >> 4);

The lower bound should be 0x4040.

>    TUNABLE_SET_WITH_BOUNDS (x86_rep_movsb_threshold, rep_movsb_threshold,
>                            minimum_rep_movsb_threshold, SIZE_MAX);
>    TUNABLE_SET_WITH_BOUNDS (x86_rep_stosb_threshold, rep_stosb_threshold, 1,
> --
> 2.34.1
>


-- 
H.J.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v7 2/3] x86: Add bounds `x86_non_temporal_threshold`
  2022-06-15 17:41     ` [PATCH v4 2/2] x86: Add bounds `x86_non_temporal_threshold` Noah Goldstein
                         ` (2 preceding siblings ...)
  2022-06-15 19:52       ` [PATCH v6 2/3] " Noah Goldstein
@ 2022-06-15 20:34       ` Noah Goldstein
  2022-06-15 20:48         ` H.J. Lu
  3 siblings, 1 reply; 29+ messages in thread
From: Noah Goldstein @ 2022-06-15 20:34 UTC (permalink / raw)
  To: libc-alpha

The lower-bound (16448) and upper-bound (SIZE_MAX / 16) are assumed
by memmove-vec-unaligned-erms.

The lower-bound is needed because memmove-vec-unaligned-erms unrolls
the loop aggressively in the L(large_memset_4x) case.

The upper-bound is needed because memmove-vec-unaligned-erms
right-shifts the value of `x86_non_temporal_threshold` by
LOG_4X_MEMCPY_THRESH (4) which without a bound may overflow.

The lack of lower-bound can be a correctness issue. The lack of
upper-bound cannot.
---
 manual/tunables.texi       | 2 +-
 sysdeps/x86/dl-cacheinfo.h | 8 +++++++-
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/manual/tunables.texi b/manual/tunables.texi
index 1482412078..2c076019ae 100644
--- a/manual/tunables.texi
+++ b/manual/tunables.texi
@@ -47,7 +47,7 @@ glibc.malloc.mxfast: 0x0 (min: 0x0, max: 0xffffffffffffffff)
 glibc.elision.skip_lock_busy: 3 (min: -2147483648, max: 2147483647)
 glibc.malloc.top_pad: 0x0 (min: 0x0, max: 0xffffffffffffffff)
 glibc.cpu.x86_rep_stosb_threshold: 0x800 (min: 0x1, max: 0xffffffffffffffff)
-glibc.cpu.x86_non_temporal_threshold: 0xc0000 (min: 0x0, max: 0xffffffffffffffff)
+glibc.cpu.x86_non_temporal_threshold: 0xc0000 (min: 0x4040, max: 0x0fffffffffffffff)
 glibc.cpu.x86_shstk:
 glibc.cpu.hwcap_mask: 0x6 (min: 0x0, max: 0xffffffffffffffff)
 glibc.malloc.mmap_max: 0 (min: -2147483648, max: 2147483647)
diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h
index cc3b840f9c..e9f3382108 100644
--- a/sysdeps/x86/dl-cacheinfo.h
+++ b/sysdeps/x86/dl-cacheinfo.h
@@ -931,8 +931,14 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
 
   TUNABLE_SET_WITH_BOUNDS (x86_data_cache_size, data, 0, SIZE_MAX);
   TUNABLE_SET_WITH_BOUNDS (x86_shared_cache_size, shared, 0, SIZE_MAX);
+  /* SIZE_MAX >> 4 because memmove-vec-unaligned-erms right-shifts the value of
+     'x86_non_temporal_threshold' by `LOG_4X_MEMCPY_THRESH` (4) and it is best
+     if that operation cannot overflow. Minimum of 0x4040 (16448) because the
+     L(large_memset_4x) loops need 64-byte to cache align and enough space for
+     at least 1 iteration of 4x PAGE_SIZE unrolled loop.  Both values are
+     reflected in the manual.  */
   TUNABLE_SET_WITH_BOUNDS (x86_non_temporal_threshold, non_temporal_threshold,
-			   0, SIZE_MAX);
+			   0x4040, SIZE_MAX >> 4);
   TUNABLE_SET_WITH_BOUNDS (x86_rep_movsb_threshold, rep_movsb_threshold,
 			   minimum_rep_movsb_threshold, SIZE_MAX);
   TUNABLE_SET_WITH_BOUNDS (x86_rep_stosb_threshold, rep_stosb_threshold, 1,
-- 
2.34.1


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v6 2/3] x86: Add bounds `x86_non_temporal_threshold`
  2022-06-15 20:27         ` H.J. Lu
@ 2022-06-15 20:35           ` Noah Goldstein
  0 siblings, 0 replies; 29+ messages in thread
From: Noah Goldstein @ 2022-06-15 20:35 UTC (permalink / raw)
  To: H.J. Lu; +Cc: GNU C Library, Carlos O'Donell

On Wed, Jun 15, 2022 at 1:27 PM H.J. Lu <hjl.tools@gmail.com> wrote:
>
> On Wed, Jun 15, 2022 at 12:52 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> >
> > The lower-bound (16448) and upper-bound (SIZE_MAX / 16) are assumed
> > by memmove-vec-unaligned-erms.
> >
> > The lower-bound is needed because memmove-vec-unaligned-erms unrolls
> > the loop aggressively in the L(large_memset_4x) case.
> >
> > The upper-bound is needed because memmove-vec-unaligned-erms
> > right-shifts the value of `x86_non_temporal_threshold` by
> > LOG_4X_MEMCPY_THRESH (4) which without a bound may overflow.
> >
> > The lack of lower-bound can be a correctness issue. The lack of
> > upper-bound cannot.
> > ---
> >  manual/tunables.texi       | 2 +-
> >  sysdeps/x86/dl-cacheinfo.h | 8 +++++++-
> >  2 files changed, 8 insertions(+), 2 deletions(-)
> >
> > diff --git a/manual/tunables.texi b/manual/tunables.texi
> > index 1482412078..2c076019ae 100644
> > --- a/manual/tunables.texi
> > +++ b/manual/tunables.texi
> > @@ -47,7 +47,7 @@ glibc.malloc.mxfast: 0x0 (min: 0x0, max: 0xffffffffffffffff)
> >  glibc.elision.skip_lock_busy: 3 (min: -2147483648, max: 2147483647)
> >  glibc.malloc.top_pad: 0x0 (min: 0x0, max: 0xffffffffffffffff)
> >  glibc.cpu.x86_rep_stosb_threshold: 0x800 (min: 0x1, max: 0xffffffffffffffff)
> > -glibc.cpu.x86_non_temporal_threshold: 0xc0000 (min: 0x0, max: 0xffffffffffffffff)
> > +glibc.cpu.x86_non_temporal_threshold: 0xc0000 (min: 0x4040, max: 0x0fffffffffffffff)
> >  glibc.cpu.x86_shstk:
> >  glibc.cpu.hwcap_mask: 0x6 (min: 0x0, max: 0xffffffffffffffff)
> >  glibc.malloc.mmap_max: 0 (min: -2147483648, max: 2147483647)
> > diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h
> > index cc3b840f9c..c493956259 100644
> > --- a/sysdeps/x86/dl-cacheinfo.h
> > +++ b/sysdeps/x86/dl-cacheinfo.h
> > @@ -931,8 +931,14 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
> >
> >    TUNABLE_SET_WITH_BOUNDS (x86_data_cache_size, data, 0, SIZE_MAX);
> >    TUNABLE_SET_WITH_BOUNDS (x86_shared_cache_size, shared, 0, SIZE_MAX);
> > +  /* SIZE_MAX >> 4 because memmove-vec-unaligned-erms right-shifts the value of
> > +     'x86_non_temporal_threshold' by `LOG_4X_MEMCPY_THRESH` (4) and it is best
> > +     if that operation cannot overflow. Minimum of 0x4040 (16448) because the
> > +     L(large_memset_4x) loops need 64-byte to cache align and enough space for
> > +     at least 1 iteration of 4x PAGE_SIZE unrolled loop.  Both values are
> > +     reflected in the manual.  */
> >    TUNABLE_SET_WITH_BOUNDS (x86_non_temporal_threshold, non_temporal_threshold,
> > -                          0, SIZE_MAX);
> > +                          0x20000, SIZE_MAX >> 4);
>
> The lower bound should be 0x4040.

Done in V7.
>
> >    TUNABLE_SET_WITH_BOUNDS (x86_rep_movsb_threshold, rep_movsb_threshold,
> >                            minimum_rep_movsb_threshold, SIZE_MAX);
> >    TUNABLE_SET_WITH_BOUNDS (x86_rep_stosb_threshold, rep_stosb_threshold, 1,
> > --
> > 2.34.1
> >
>
>
> --
> H.J.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v7 2/3] x86: Add bounds `x86_non_temporal_threshold`
  2022-06-15 20:34       ` [PATCH v7 " Noah Goldstein
@ 2022-06-15 20:48         ` H.J. Lu
  2022-07-14  2:55           ` Sunil Pandey
  0 siblings, 1 reply; 29+ messages in thread
From: H.J. Lu @ 2022-06-15 20:48 UTC (permalink / raw)
  To: Noah Goldstein; +Cc: GNU C Library, Carlos O'Donell

On Wed, Jun 15, 2022 at 1:34 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
> The lower-bound (16448) and upper-bound (SIZE_MAX / 16) are assumed
> by memmove-vec-unaligned-erms.
>
> The lower-bound is needed because memmove-vec-unaligned-erms unrolls
> the loop aggressively in the L(large_memset_4x) case.
>
> The upper-bound is needed because memmove-vec-unaligned-erms
> right-shifts the value of `x86_non_temporal_threshold` by
> LOG_4X_MEMCPY_THRESH (4) which without a bound may overflow.
>
> The lack of lower-bound can be a correctness issue. The lack of
> upper-bound cannot.
> ---
>  manual/tunables.texi       | 2 +-
>  sysdeps/x86/dl-cacheinfo.h | 8 +++++++-
>  2 files changed, 8 insertions(+), 2 deletions(-)
>
> diff --git a/manual/tunables.texi b/manual/tunables.texi
> index 1482412078..2c076019ae 100644
> --- a/manual/tunables.texi
> +++ b/manual/tunables.texi
> @@ -47,7 +47,7 @@ glibc.malloc.mxfast: 0x0 (min: 0x0, max: 0xffffffffffffffff)
>  glibc.elision.skip_lock_busy: 3 (min: -2147483648, max: 2147483647)
>  glibc.malloc.top_pad: 0x0 (min: 0x0, max: 0xffffffffffffffff)
>  glibc.cpu.x86_rep_stosb_threshold: 0x800 (min: 0x1, max: 0xffffffffffffffff)
> -glibc.cpu.x86_non_temporal_threshold: 0xc0000 (min: 0x0, max: 0xffffffffffffffff)
> +glibc.cpu.x86_non_temporal_threshold: 0xc0000 (min: 0x4040, max: 0x0fffffffffffffff)
>  glibc.cpu.x86_shstk:
>  glibc.cpu.hwcap_mask: 0x6 (min: 0x0, max: 0xffffffffffffffff)
>  glibc.malloc.mmap_max: 0 (min: -2147483648, max: 2147483647)
> diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h
> index cc3b840f9c..e9f3382108 100644
> --- a/sysdeps/x86/dl-cacheinfo.h
> +++ b/sysdeps/x86/dl-cacheinfo.h
> @@ -931,8 +931,14 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
>
>    TUNABLE_SET_WITH_BOUNDS (x86_data_cache_size, data, 0, SIZE_MAX);
>    TUNABLE_SET_WITH_BOUNDS (x86_shared_cache_size, shared, 0, SIZE_MAX);
> +  /* SIZE_MAX >> 4 because memmove-vec-unaligned-erms right-shifts the value of
> +     'x86_non_temporal_threshold' by `LOG_4X_MEMCPY_THRESH` (4) and it is best
> +     if that operation cannot overflow. Minimum of 0x4040 (16448) because the
> +     L(large_memset_4x) loops need 64-byte to cache align and enough space for
> +     at least 1 iteration of 4x PAGE_SIZE unrolled loop.  Both values are
> +     reflected in the manual.  */
>    TUNABLE_SET_WITH_BOUNDS (x86_non_temporal_threshold, non_temporal_threshold,
> -                          0, SIZE_MAX);
> +                          0x4040, SIZE_MAX >> 4);
>    TUNABLE_SET_WITH_BOUNDS (x86_rep_movsb_threshold, rep_movsb_threshold,
>                            minimum_rep_movsb_threshold, SIZE_MAX);
>    TUNABLE_SET_WITH_BOUNDS (x86_rep_stosb_threshold, rep_stosb_threshold, 1,
> --
> 2.34.1
>

LGTM.

Thanks.

-- 
H.J.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v1 1/3] x86: Fix misordered logic for setting `rep_movsb_stop_threshold`
  2022-06-15  1:02 ` [PATCH v1 1/3] x86: Fix misordered logic for setting `rep_movsb_stop_threshold` H.J. Lu
@ 2022-07-14  2:53   ` Sunil Pandey
  0 siblings, 0 replies; 29+ messages in thread
From: Sunil Pandey @ 2022-07-14  2:53 UTC (permalink / raw)
  To: H.J. Lu; +Cc: Noah Goldstein, GNU C Library

On Tue, Jun 14, 2022 at 6:03 PM H.J. Lu via Libc-alpha
<libc-alpha@sourceware.org> wrote:
>
> On Tue, Jun 14, 2022 at 5:25 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> >
> > Move the setting of `rep_movsb_stop_threshold` to after the tunables
> > have been collected so that the `rep_movsb_stop_threshold` (which
> > is used to redirect control flow to the non_temporal case) will
> > use any user value for `non_temporal_threshold` (set using
> > glibc.cpu.x86_non_temporal_threshold)
> > ---
> >  sysdeps/x86/dl-cacheinfo.h | 24 ++++++++++++------------
> >  1 file changed, 12 insertions(+), 12 deletions(-)
> >
> > diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h
> > index f64a2fb0ba..cc3b840f9c 100644
> > --- a/sysdeps/x86/dl-cacheinfo.h
> > +++ b/sysdeps/x86/dl-cacheinfo.h
> > @@ -898,18 +898,6 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
> >    if (CPU_FEATURE_USABLE_P (cpu_features, FSRM))
> >      rep_movsb_threshold = 2112;
> >
> > -  unsigned long int rep_movsb_stop_threshold;
> > -  /* ERMS feature is implemented from AMD Zen3 architecture and it is
> > -     performing poorly for data above L2 cache size. Henceforth, adding
> > -     an upper bound threshold parameter to limit the usage of Enhanced
> > -     REP MOVSB operations and setting its value to L2 cache size.  */
> > -  if (cpu_features->basic.kind == arch_kind_amd)
> > -    rep_movsb_stop_threshold = core;
> > -  /* Setting the upper bound of ERMS to the computed value of
> > -     non-temporal threshold for architectures other than AMD.  */
> > -  else
> > -    rep_movsb_stop_threshold = non_temporal_threshold;
> > -
> >    /* The default threshold to use Enhanced REP STOSB.  */
> >    unsigned long int rep_stosb_threshold = 2048;
> >
> > @@ -951,6 +939,18 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
> >                            SIZE_MAX);
> >  #endif
> >
> > +  unsigned long int rep_movsb_stop_threshold;
> > +  /* ERMS feature is implemented from AMD Zen3 architecture and it is
> > +     performing poorly for data above L2 cache size. Henceforth, adding
> > +     an upper bound threshold parameter to limit the usage of Enhanced
> > +     REP MOVSB operations and setting its value to L2 cache size.  */
> > +  if (cpu_features->basic.kind == arch_kind_amd)
> > +    rep_movsb_stop_threshold = core;
> > +  /* Setting the upper bound of ERMS to the computed value of
> > +     non-temporal threshold for architectures other than AMD.  */
> > +  else
> > +    rep_movsb_stop_threshold = non_temporal_threshold;
> > +
> >    cpu_features->data_cache_size = data;
> >    cpu_features->shared_cache_size = shared;
> >    cpu_features->non_temporal_threshold = non_temporal_threshold;
> > --
> > 2.34.1
> >
>
> LGTM.
>
> Thanks.
>
> --
> H.J.

I would like to backport this patch to release branches.
Any comments or objections?

--Sunil

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v1 3/3] x86: Add sse42 implementation to strcmp's ifunc
  2022-06-15  1:08   ` H.J. Lu
@ 2022-07-14  2:54     ` Sunil Pandey
  0 siblings, 0 replies; 29+ messages in thread
From: Sunil Pandey @ 2022-07-14  2:54 UTC (permalink / raw)
  To: H.J. Lu; +Cc: Noah Goldstein, GNU C Library

On Tue, Jun 14, 2022 at 6:09 PM H.J. Lu via Libc-alpha
<libc-alpha@sourceware.org> wrote:
>
> On Tue, Jun 14, 2022 at 5:25 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> >
> > This has been missing since the the ifuncs where added.
> >
> > The performance of SSE4.2 is preferable to to SSE2.
> >
> > Measured on Tigerlake with N = 20 runs.
> > Geometric Mean of all benchmarks SSE4.2 / SSE2: 0.906
> > ---
> >  sysdeps/x86_64/multiarch/strcmp.c | 5 +++++
> >  1 file changed, 5 insertions(+)
> >
> > diff --git a/sysdeps/x86_64/multiarch/strcmp.c b/sysdeps/x86_64/multiarch/strcmp.c
> > index a248c2a6e6..9c1677724c 100644
> > --- a/sysdeps/x86_64/multiarch/strcmp.c
> > +++ b/sysdeps/x86_64/multiarch/strcmp.c
> > @@ -28,6 +28,7 @@
> >
> >  extern __typeof (REDIRECT_NAME) OPTIMIZE (sse2) attribute_hidden;
> >  extern __typeof (REDIRECT_NAME) OPTIMIZE (sse2_unaligned) attribute_hidden;
> > +extern __typeof (REDIRECT_NAME) OPTIMIZE (sse42) attribute_hidden;
> >  extern __typeof (REDIRECT_NAME) OPTIMIZE (avx2) attribute_hidden;
> >  extern __typeof (REDIRECT_NAME) OPTIMIZE (avx2_rtm) attribute_hidden;
> >  extern __typeof (REDIRECT_NAME) OPTIMIZE (evex) attribute_hidden;
> > @@ -52,6 +53,10 @@ IFUNC_SELECTOR (void)
> >         return OPTIMIZE (avx2);
> >      }
> >
> > +  if (CPU_FEATURE_USABLE_P (cpu_features, SSE4_2)
> > +      && !CPU_FEATURES_ARCH_P (cpu_features, Slow_SSE4_2))
> > +    return OPTIMIZE (sse42);
> > +
> >    if (CPU_FEATURES_ARCH_P (cpu_features, Fast_Unaligned_Load))
> >      return OPTIMIZE (sse2_unaligned);
> >
> > --
> > 2.34.1
> >
>
> LGTM.
>
> Thanks.
>
> --
> H.J.

I would like to backport this patch to release branches.
Any comments or objections?

--Sunil

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v7 2/3] x86: Add bounds `x86_non_temporal_threshold`
  2022-06-15 20:48         ` H.J. Lu
@ 2022-07-14  2:55           ` Sunil Pandey
  0 siblings, 0 replies; 29+ messages in thread
From: Sunil Pandey @ 2022-07-14  2:55 UTC (permalink / raw)
  To: H.J. Lu; +Cc: Noah Goldstein, GNU C Library

On Wed, Jun 15, 2022 at 1:49 PM H.J. Lu via Libc-alpha
<libc-alpha@sourceware.org> wrote:
>
> On Wed, Jun 15, 2022 at 1:34 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> >
> > The lower-bound (16448) and upper-bound (SIZE_MAX / 16) are assumed
> > by memmove-vec-unaligned-erms.
> >
> > The lower-bound is needed because memmove-vec-unaligned-erms unrolls
> > the loop aggressively in the L(large_memset_4x) case.
> >
> > The upper-bound is needed because memmove-vec-unaligned-erms
> > right-shifts the value of `x86_non_temporal_threshold` by
> > LOG_4X_MEMCPY_THRESH (4) which without a bound may overflow.
> >
> > The lack of lower-bound can be a correctness issue. The lack of
> > upper-bound cannot.
> > ---
> >  manual/tunables.texi       | 2 +-
> >  sysdeps/x86/dl-cacheinfo.h | 8 +++++++-
> >  2 files changed, 8 insertions(+), 2 deletions(-)
> >
> > diff --git a/manual/tunables.texi b/manual/tunables.texi
> > index 1482412078..2c076019ae 100644
> > --- a/manual/tunables.texi
> > +++ b/manual/tunables.texi
> > @@ -47,7 +47,7 @@ glibc.malloc.mxfast: 0x0 (min: 0x0, max: 0xffffffffffffffff)
> >  glibc.elision.skip_lock_busy: 3 (min: -2147483648, max: 2147483647)
> >  glibc.malloc.top_pad: 0x0 (min: 0x0, max: 0xffffffffffffffff)
> >  glibc.cpu.x86_rep_stosb_threshold: 0x800 (min: 0x1, max: 0xffffffffffffffff)
> > -glibc.cpu.x86_non_temporal_threshold: 0xc0000 (min: 0x0, max: 0xffffffffffffffff)
> > +glibc.cpu.x86_non_temporal_threshold: 0xc0000 (min: 0x4040, max: 0x0fffffffffffffff)
> >  glibc.cpu.x86_shstk:
> >  glibc.cpu.hwcap_mask: 0x6 (min: 0x0, max: 0xffffffffffffffff)
> >  glibc.malloc.mmap_max: 0 (min: -2147483648, max: 2147483647)
> > diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h
> > index cc3b840f9c..e9f3382108 100644
> > --- a/sysdeps/x86/dl-cacheinfo.h
> > +++ b/sysdeps/x86/dl-cacheinfo.h
> > @@ -931,8 +931,14 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
> >
> >    TUNABLE_SET_WITH_BOUNDS (x86_data_cache_size, data, 0, SIZE_MAX);
> >    TUNABLE_SET_WITH_BOUNDS (x86_shared_cache_size, shared, 0, SIZE_MAX);
> > +  /* SIZE_MAX >> 4 because memmove-vec-unaligned-erms right-shifts the value of
> > +     'x86_non_temporal_threshold' by `LOG_4X_MEMCPY_THRESH` (4) and it is best
> > +     if that operation cannot overflow. Minimum of 0x4040 (16448) because the
> > +     L(large_memset_4x) loops need 64-byte to cache align and enough space for
> > +     at least 1 iteration of 4x PAGE_SIZE unrolled loop.  Both values are
> > +     reflected in the manual.  */
> >    TUNABLE_SET_WITH_BOUNDS (x86_non_temporal_threshold, non_temporal_threshold,
> > -                          0, SIZE_MAX);
> > +                          0x4040, SIZE_MAX >> 4);
> >    TUNABLE_SET_WITH_BOUNDS (x86_rep_movsb_threshold, rep_movsb_threshold,
> >                            minimum_rep_movsb_threshold, SIZE_MAX);
> >    TUNABLE_SET_WITH_BOUNDS (x86_rep_stosb_threshold, rep_stosb_threshold, 1,
> > --
> > 2.34.1
> >
>
> LGTM.
>
> Thanks.
>
> --
> H.J.

I would like to backport this patch to release branches.
Any comments or objections?

--Sunil

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 1/2] x86: Cleanup bounds checking in large memcpy case
  2022-06-15 18:22     ` [PATCH v4 1/2] x86: Cleanup bounds checking in large memcpy case H.J. Lu
@ 2022-07-14  2:57       ` Sunil Pandey
  0 siblings, 0 replies; 29+ messages in thread
From: Sunil Pandey @ 2022-07-14  2:57 UTC (permalink / raw)
  To: H.J. Lu; +Cc: Noah Goldstein, GNU C Library

On Wed, Jun 15, 2022 at 11:24 AM H.J. Lu via Libc-alpha
<libc-alpha@sourceware.org> wrote:
>
> On Wed, Jun 15, 2022 at 10:41 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> >
> > 1. Fix incorrect lower-bound threshold in L(large_memcpy_2x).
> >    Previously was using `__x86_rep_movsb_threshold` and should
> >    have been using `__x86_shared_non_temporal_threshold`.
> >
> > 2. Avoid reloading __x86_shared_non_temporal_threshold before
> >    the L(large_memcpy_4x) bounds check.
> >
> > 3. Document the second bounds check for L(large_memcpy_4x)
> >    more clearly.
> > ---
> >  .../multiarch/memmove-vec-unaligned-erms.S    | 29 ++++++++++++++-----
> >  1 file changed, 21 insertions(+), 8 deletions(-)
> >
> > diff --git a/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
> > index af51177d5d..d1518b8bab 100644
> > --- a/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
> > +++ b/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
> > @@ -118,7 +118,13 @@
> >  # define LARGE_LOAD_SIZE (VEC_SIZE * 4)
> >  #endif
> >
> > -/* Amount to shift rdx by to compare for memcpy_large_4x.  */
> > +/* Amount to shift __x86_shared_non_temporal_threshold by for
> > +   bound for memcpy_large_4x. This is essentially use to to
> > +   indicate that the copy is far beyond the scope of L3
> > +   (assuming no user config x86_non_temporal_threshold) and to
> > +   use a more aggressively unrolled loop.  NB: before
> > +   increasing the value also update initialization of
> > +   x86_non_temporal_threshold.  */
> >  #ifndef LOG_4X_MEMCPY_THRESH
> >  # define LOG_4X_MEMCPY_THRESH 4
> >  #endif
> > @@ -724,9 +730,14 @@ L(skip_short_movsb_check):
> >         .p2align 4,, 10
> >  #if (defined USE_MULTIARCH || VEC_SIZE == 16) && IS_IN (libc)
> >  L(large_memcpy_2x_check):
> > -       cmp     __x86_rep_movsb_threshold(%rip), %RDX_LP
> > -       jb      L(more_8x_vec_check)
> > +       /* Entry from L(large_memcpy_2x) has a redundant load of
> > +          __x86_shared_non_temporal_threshold(%rip). L(large_memcpy_2x)
> > +          is only use for the non-erms memmove which is generally less
> > +          common.  */
> >  L(large_memcpy_2x):
> > +       mov     __x86_shared_non_temporal_threshold(%rip), %R11_LP
> > +       cmp     %R11_LP, %RDX_LP
> > +       jb      L(more_8x_vec_check)
> >         /* To reach this point it is impossible for dst > src and
> >            overlap. Remaining to check is src > dst and overlap. rcx
> >            already contains dst - src. Negate rcx to get src - dst. If
> > @@ -774,18 +785,21 @@ L(large_memcpy_2x):
> >         /* ecx contains -(dst - src). not ecx will return dst - src - 1
> >            which works for testing aliasing.  */
> >         notl    %ecx
> > +       movq    %rdx, %r10
> >         testl   $(PAGE_SIZE - VEC_SIZE * 8), %ecx
> >         jz      L(large_memcpy_4x)
> >
> > -       movq    %rdx, %r10
> > -       shrq    $LOG_4X_MEMCPY_THRESH, %r10
> > -       cmp     __x86_shared_non_temporal_threshold(%rip), %r10
> > +       /* r11 has __x86_shared_non_temporal_threshold.  Shift it left
> > +          by LOG_4X_MEMCPY_THRESH to get L(large_memcpy_4x) threshold.
> > +        */
> > +       shlq    $LOG_4X_MEMCPY_THRESH, %r11
> > +       cmp     %r11, %rdx
> >         jae     L(large_memcpy_4x)
> >
> >         /* edx will store remainder size for copying tail.  */
> >         andl    $(PAGE_SIZE * 2 - 1), %edx
> >         /* r10 stores outer loop counter.  */
> > -       shrq    $((LOG_PAGE_SIZE + 1) - LOG_4X_MEMCPY_THRESH), %r10
> > +       shrq    $(LOG_PAGE_SIZE + 1), %r10
> >         /* Copy 4x VEC at a time from 2 pages.  */
> >         .p2align 4
> >  L(loop_large_memcpy_2x_outer):
> > @@ -850,7 +864,6 @@ L(large_memcpy_2x_end):
> >
> >         .p2align 4
> >  L(large_memcpy_4x):
> > -       movq    %rdx, %r10
> >         /* edx will store remainder size for copying tail.  */
> >         andl    $(PAGE_SIZE * 4 - 1), %edx
> >         /* r10 stores outer loop counter.  */
> > --
> > 2.34.1
> >
>
> LGTM.
>
> Thanks.
>
> --
> H.J.

I would like to backport this patch to release branches.
Any comments or objections?

--Sunil

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2022-07-14  2:57 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-15  0:25 [PATCH v1 1/3] x86: Fix misordered logic for setting `rep_movsb_stop_threshold` Noah Goldstein
2022-06-15  0:25 ` [PATCH v1 2/3] x86: Cleanup bounds checking in large memcpy case Noah Goldstein
2022-06-15  1:07   ` H.J. Lu
2022-06-15  3:57     ` Noah Goldstein
2022-06-15  3:57   ` [PATCH v2] " Noah Goldstein
2022-06-15 14:52     ` H.J. Lu
2022-06-15 15:13       ` Noah Goldstein
2022-06-15 15:12   ` [PATCH v3] " Noah Goldstein
2022-06-15 16:48     ` H.J. Lu
2022-06-15 17:44       ` Noah Goldstein
2022-06-15 17:41   ` [PATCH v4 1/2] " Noah Goldstein
2022-06-15 17:41     ` [PATCH v4 2/2] x86: Add bounds `x86_non_temporal_threshold` Noah Goldstein
2022-06-15 18:22       ` H.J. Lu
2022-06-15 18:33         ` Noah Goldstein
2022-06-15 18:32       ` [PATCH v5 " Noah Goldstein
2022-06-15 18:43         ` H.J. Lu
2022-06-15 19:52       ` [PATCH v6 2/3] " Noah Goldstein
2022-06-15 20:27         ` H.J. Lu
2022-06-15 20:35           ` Noah Goldstein
2022-06-15 20:34       ` [PATCH v7 " Noah Goldstein
2022-06-15 20:48         ` H.J. Lu
2022-07-14  2:55           ` Sunil Pandey
2022-06-15 18:22     ` [PATCH v4 1/2] x86: Cleanup bounds checking in large memcpy case H.J. Lu
2022-07-14  2:57       ` Sunil Pandey
2022-06-15  0:25 ` [PATCH v1 3/3] x86: Add sse42 implementation to strcmp's ifunc Noah Goldstein
2022-06-15  1:08   ` H.J. Lu
2022-07-14  2:54     ` Sunil Pandey
2022-06-15  1:02 ` [PATCH v1 1/3] x86: Fix misordered logic for setting `rep_movsb_stop_threshold` H.J. Lu
2022-07-14  2:53   ` Sunil Pandey

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).