* [PATCH v1 1/3] x86: Fix misordered logic for setting `rep_movsb_stop_threshold`
@ 2022-06-15 0:25 Noah Goldstein
2022-06-15 0:25 ` [PATCH v1 2/3] x86: Cleanup bounds checking in large memcpy case Noah Goldstein
` (2 more replies)
0 siblings, 3 replies; 29+ messages in thread
From: Noah Goldstein @ 2022-06-15 0:25 UTC (permalink / raw)
To: libc-alpha
Move the setting of `rep_movsb_stop_threshold` to after the tunables
have been collected so that the `rep_movsb_stop_threshold` (which
is used to redirect control flow to the non_temporal case) will
use any user value for `non_temporal_threshold` (set using
glibc.cpu.x86_non_temporal_threshold)
---
sysdeps/x86/dl-cacheinfo.h | 24 ++++++++++++------------
1 file changed, 12 insertions(+), 12 deletions(-)
diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h
index f64a2fb0ba..cc3b840f9c 100644
--- a/sysdeps/x86/dl-cacheinfo.h
+++ b/sysdeps/x86/dl-cacheinfo.h
@@ -898,18 +898,6 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
if (CPU_FEATURE_USABLE_P (cpu_features, FSRM))
rep_movsb_threshold = 2112;
- unsigned long int rep_movsb_stop_threshold;
- /* ERMS feature is implemented from AMD Zen3 architecture and it is
- performing poorly for data above L2 cache size. Henceforth, adding
- an upper bound threshold parameter to limit the usage of Enhanced
- REP MOVSB operations and setting its value to L2 cache size. */
- if (cpu_features->basic.kind == arch_kind_amd)
- rep_movsb_stop_threshold = core;
- /* Setting the upper bound of ERMS to the computed value of
- non-temporal threshold for architectures other than AMD. */
- else
- rep_movsb_stop_threshold = non_temporal_threshold;
-
/* The default threshold to use Enhanced REP STOSB. */
unsigned long int rep_stosb_threshold = 2048;
@@ -951,6 +939,18 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
SIZE_MAX);
#endif
+ unsigned long int rep_movsb_stop_threshold;
+ /* ERMS feature is implemented from AMD Zen3 architecture and it is
+ performing poorly for data above L2 cache size. Henceforth, adding
+ an upper bound threshold parameter to limit the usage of Enhanced
+ REP MOVSB operations and setting its value to L2 cache size. */
+ if (cpu_features->basic.kind == arch_kind_amd)
+ rep_movsb_stop_threshold = core;
+ /* Setting the upper bound of ERMS to the computed value of
+ non-temporal threshold for architectures other than AMD. */
+ else
+ rep_movsb_stop_threshold = non_temporal_threshold;
+
cpu_features->data_cache_size = data;
cpu_features->shared_cache_size = shared;
cpu_features->non_temporal_threshold = non_temporal_threshold;
--
2.34.1
^ permalink raw reply [flat|nested] 29+ messages in thread
* [PATCH v1 2/3] x86: Cleanup bounds checking in large memcpy case
2022-06-15 0:25 [PATCH v1 1/3] x86: Fix misordered logic for setting `rep_movsb_stop_threshold` Noah Goldstein
@ 2022-06-15 0:25 ` Noah Goldstein
2022-06-15 1:07 ` H.J. Lu
` (3 more replies)
2022-06-15 0:25 ` [PATCH v1 3/3] x86: Add sse42 implementation to strcmp's ifunc Noah Goldstein
2022-06-15 1:02 ` [PATCH v1 1/3] x86: Fix misordered logic for setting `rep_movsb_stop_threshold` H.J. Lu
2 siblings, 4 replies; 29+ messages in thread
From: Noah Goldstein @ 2022-06-15 0:25 UTC (permalink / raw)
To: libc-alpha
1. Fix incorrect lower-bound threshold in L(large_memcpy_2x).
Previously was using `__x86_rep_movsb_threshold` and should
have been using `__x86_shared_non_temporal_threshold`.
2. Avoid reloading __x86_shared_non_temporal_threshold before
the L(large_memcpy_4x) bounds check.
3. Document the second bounds check for L(large_memcpy_4x)
more clearly.
---
manual/tunables.texi | 2 +-
sysdeps/x86/dl-cacheinfo.h | 8 +++--
.../multiarch/memmove-vec-unaligned-erms.S | 29 ++++++++++++++-----
3 files changed, 28 insertions(+), 11 deletions(-)
diff --git a/manual/tunables.texi b/manual/tunables.texi
index 1482412078..49daf3eb4a 100644
--- a/manual/tunables.texi
+++ b/manual/tunables.texi
@@ -47,7 +47,7 @@ glibc.malloc.mxfast: 0x0 (min: 0x0, max: 0xffffffffffffffff)
glibc.elision.skip_lock_busy: 3 (min: -2147483648, max: 2147483647)
glibc.malloc.top_pad: 0x0 (min: 0x0, max: 0xffffffffffffffff)
glibc.cpu.x86_rep_stosb_threshold: 0x800 (min: 0x1, max: 0xffffffffffffffff)
-glibc.cpu.x86_non_temporal_threshold: 0xc0000 (min: 0x0, max: 0xffffffffffffffff)
+glibc.cpu.x86_non_temporal_threshold: 0xc0000 (min: 0x0, max: 0x0fffffffffffffff)
glibc.cpu.x86_shstk:
glibc.cpu.hwcap_mask: 0x6 (min: 0x0, max: 0xffffffffffffffff)
glibc.malloc.mmap_max: 0 (min: -2147483648, max: 2147483647)
diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h
index cc3b840f9c..a66152d9cc 100644
--- a/sysdeps/x86/dl-cacheinfo.h
+++ b/sysdeps/x86/dl-cacheinfo.h
@@ -915,9 +915,13 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
shared = tunable_size;
tunable_size = TUNABLE_GET (x86_non_temporal_threshold, long int, NULL);
- /* NB: Ignore the default value 0. */
- if (tunable_size != 0)
+ /* NB: Ignore the default value 0. Saturate very large values at
+ LONG_MAX >> 4. */
+ if (tunable_size != 0 && tunable_size <= (LONG_MAX >> 3))
non_temporal_threshold = tunable_size;
+ /* Saturate huge arguments. */
+ else if (tunable_size != 0)
+ non_temporal_threshold = LONG_MAX >> 3;
tunable_size = TUNABLE_GET (x86_rep_movsb_threshold, long int, NULL);
if (tunable_size > minimum_rep_movsb_threshold)
diff --git a/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
index af51177d5d..d1518b8bab 100644
--- a/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
+++ b/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
@@ -118,7 +118,13 @@
# define LARGE_LOAD_SIZE (VEC_SIZE * 4)
#endif
-/* Amount to shift rdx by to compare for memcpy_large_4x. */
+/* Amount to shift __x86_shared_non_temporal_threshold by for
+ bound for memcpy_large_4x. This is essentially use to to
+ indicate that the copy is far beyond the scope of L3
+ (assuming no user config x86_non_temporal_threshold) and to
+ use a more aggressively unrolled loop. NB: before
+ increasing the value also update initialization of
+ x86_non_temporal_threshold. */
#ifndef LOG_4X_MEMCPY_THRESH
# define LOG_4X_MEMCPY_THRESH 4
#endif
@@ -724,9 +730,14 @@ L(skip_short_movsb_check):
.p2align 4,, 10
#if (defined USE_MULTIARCH || VEC_SIZE == 16) && IS_IN (libc)
L(large_memcpy_2x_check):
- cmp __x86_rep_movsb_threshold(%rip), %RDX_LP
- jb L(more_8x_vec_check)
+ /* Entry from L(large_memcpy_2x) has a redundant load of
+ __x86_shared_non_temporal_threshold(%rip). L(large_memcpy_2x)
+ is only use for the non-erms memmove which is generally less
+ common. */
L(large_memcpy_2x):
+ mov __x86_shared_non_temporal_threshold(%rip), %R11_LP
+ cmp %R11_LP, %RDX_LP
+ jb L(more_8x_vec_check)
/* To reach this point it is impossible for dst > src and
overlap. Remaining to check is src > dst and overlap. rcx
already contains dst - src. Negate rcx to get src - dst. If
@@ -774,18 +785,21 @@ L(large_memcpy_2x):
/* ecx contains -(dst - src). not ecx will return dst - src - 1
which works for testing aliasing. */
notl %ecx
+ movq %rdx, %r10
testl $(PAGE_SIZE - VEC_SIZE * 8), %ecx
jz L(large_memcpy_4x)
- movq %rdx, %r10
- shrq $LOG_4X_MEMCPY_THRESH, %r10
- cmp __x86_shared_non_temporal_threshold(%rip), %r10
+ /* r11 has __x86_shared_non_temporal_threshold. Shift it left
+ by LOG_4X_MEMCPY_THRESH to get L(large_memcpy_4x) threshold.
+ */
+ shlq $LOG_4X_MEMCPY_THRESH, %r11
+ cmp %r11, %rdx
jae L(large_memcpy_4x)
/* edx will store remainder size for copying tail. */
andl $(PAGE_SIZE * 2 - 1), %edx
/* r10 stores outer loop counter. */
- shrq $((LOG_PAGE_SIZE + 1) - LOG_4X_MEMCPY_THRESH), %r10
+ shrq $(LOG_PAGE_SIZE + 1), %r10
/* Copy 4x VEC at a time from 2 pages. */
.p2align 4
L(loop_large_memcpy_2x_outer):
@@ -850,7 +864,6 @@ L(large_memcpy_2x_end):
.p2align 4
L(large_memcpy_4x):
- movq %rdx, %r10
/* edx will store remainder size for copying tail. */
andl $(PAGE_SIZE * 4 - 1), %edx
/* r10 stores outer loop counter. */
--
2.34.1
^ permalink raw reply [flat|nested] 29+ messages in thread
* [PATCH v1 3/3] x86: Add sse42 implementation to strcmp's ifunc
2022-06-15 0:25 [PATCH v1 1/3] x86: Fix misordered logic for setting `rep_movsb_stop_threshold` Noah Goldstein
2022-06-15 0:25 ` [PATCH v1 2/3] x86: Cleanup bounds checking in large memcpy case Noah Goldstein
@ 2022-06-15 0:25 ` Noah Goldstein
2022-06-15 1:08 ` H.J. Lu
2022-06-15 1:02 ` [PATCH v1 1/3] x86: Fix misordered logic for setting `rep_movsb_stop_threshold` H.J. Lu
2 siblings, 1 reply; 29+ messages in thread
From: Noah Goldstein @ 2022-06-15 0:25 UTC (permalink / raw)
To: libc-alpha
This has been missing since the the ifuncs where added.
The performance of SSE4.2 is preferable to to SSE2.
Measured on Tigerlake with N = 20 runs.
Geometric Mean of all benchmarks SSE4.2 / SSE2: 0.906
---
sysdeps/x86_64/multiarch/strcmp.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/sysdeps/x86_64/multiarch/strcmp.c b/sysdeps/x86_64/multiarch/strcmp.c
index a248c2a6e6..9c1677724c 100644
--- a/sysdeps/x86_64/multiarch/strcmp.c
+++ b/sysdeps/x86_64/multiarch/strcmp.c
@@ -28,6 +28,7 @@
extern __typeof (REDIRECT_NAME) OPTIMIZE (sse2) attribute_hidden;
extern __typeof (REDIRECT_NAME) OPTIMIZE (sse2_unaligned) attribute_hidden;
+extern __typeof (REDIRECT_NAME) OPTIMIZE (sse42) attribute_hidden;
extern __typeof (REDIRECT_NAME) OPTIMIZE (avx2) attribute_hidden;
extern __typeof (REDIRECT_NAME) OPTIMIZE (avx2_rtm) attribute_hidden;
extern __typeof (REDIRECT_NAME) OPTIMIZE (evex) attribute_hidden;
@@ -52,6 +53,10 @@ IFUNC_SELECTOR (void)
return OPTIMIZE (avx2);
}
+ if (CPU_FEATURE_USABLE_P (cpu_features, SSE4_2)
+ && !CPU_FEATURES_ARCH_P (cpu_features, Slow_SSE4_2))
+ return OPTIMIZE (sse42);
+
if (CPU_FEATURES_ARCH_P (cpu_features, Fast_Unaligned_Load))
return OPTIMIZE (sse2_unaligned);
--
2.34.1
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v1 1/3] x86: Fix misordered logic for setting `rep_movsb_stop_threshold`
2022-06-15 0:25 [PATCH v1 1/3] x86: Fix misordered logic for setting `rep_movsb_stop_threshold` Noah Goldstein
2022-06-15 0:25 ` [PATCH v1 2/3] x86: Cleanup bounds checking in large memcpy case Noah Goldstein
2022-06-15 0:25 ` [PATCH v1 3/3] x86: Add sse42 implementation to strcmp's ifunc Noah Goldstein
@ 2022-06-15 1:02 ` H.J. Lu
2022-07-14 2:53 ` Sunil Pandey
2 siblings, 1 reply; 29+ messages in thread
From: H.J. Lu @ 2022-06-15 1:02 UTC (permalink / raw)
To: Noah Goldstein; +Cc: GNU C Library, Carlos O'Donell
On Tue, Jun 14, 2022 at 5:25 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
> Move the setting of `rep_movsb_stop_threshold` to after the tunables
> have been collected so that the `rep_movsb_stop_threshold` (which
> is used to redirect control flow to the non_temporal case) will
> use any user value for `non_temporal_threshold` (set using
> glibc.cpu.x86_non_temporal_threshold)
> ---
> sysdeps/x86/dl-cacheinfo.h | 24 ++++++++++++------------
> 1 file changed, 12 insertions(+), 12 deletions(-)
>
> diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h
> index f64a2fb0ba..cc3b840f9c 100644
> --- a/sysdeps/x86/dl-cacheinfo.h
> +++ b/sysdeps/x86/dl-cacheinfo.h
> @@ -898,18 +898,6 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
> if (CPU_FEATURE_USABLE_P (cpu_features, FSRM))
> rep_movsb_threshold = 2112;
>
> - unsigned long int rep_movsb_stop_threshold;
> - /* ERMS feature is implemented from AMD Zen3 architecture and it is
> - performing poorly for data above L2 cache size. Henceforth, adding
> - an upper bound threshold parameter to limit the usage of Enhanced
> - REP MOVSB operations and setting its value to L2 cache size. */
> - if (cpu_features->basic.kind == arch_kind_amd)
> - rep_movsb_stop_threshold = core;
> - /* Setting the upper bound of ERMS to the computed value of
> - non-temporal threshold for architectures other than AMD. */
> - else
> - rep_movsb_stop_threshold = non_temporal_threshold;
> -
> /* The default threshold to use Enhanced REP STOSB. */
> unsigned long int rep_stosb_threshold = 2048;
>
> @@ -951,6 +939,18 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
> SIZE_MAX);
> #endif
>
> + unsigned long int rep_movsb_stop_threshold;
> + /* ERMS feature is implemented from AMD Zen3 architecture and it is
> + performing poorly for data above L2 cache size. Henceforth, adding
> + an upper bound threshold parameter to limit the usage of Enhanced
> + REP MOVSB operations and setting its value to L2 cache size. */
> + if (cpu_features->basic.kind == arch_kind_amd)
> + rep_movsb_stop_threshold = core;
> + /* Setting the upper bound of ERMS to the computed value of
> + non-temporal threshold for architectures other than AMD. */
> + else
> + rep_movsb_stop_threshold = non_temporal_threshold;
> +
> cpu_features->data_cache_size = data;
> cpu_features->shared_cache_size = shared;
> cpu_features->non_temporal_threshold = non_temporal_threshold;
> --
> 2.34.1
>
LGTM.
Thanks.
--
H.J.
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v1 2/3] x86: Cleanup bounds checking in large memcpy case
2022-06-15 0:25 ` [PATCH v1 2/3] x86: Cleanup bounds checking in large memcpy case Noah Goldstein
@ 2022-06-15 1:07 ` H.J. Lu
2022-06-15 3:57 ` Noah Goldstein
2022-06-15 3:57 ` [PATCH v2] " Noah Goldstein
` (2 subsequent siblings)
3 siblings, 1 reply; 29+ messages in thread
From: H.J. Lu @ 2022-06-15 1:07 UTC (permalink / raw)
To: Noah Goldstein; +Cc: GNU C Library, Carlos O'Donell
On Tue, Jun 14, 2022 at 5:25 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
> 1. Fix incorrect lower-bound threshold in L(large_memcpy_2x).
> Previously was using `__x86_rep_movsb_threshold` and should
> have been using `__x86_shared_non_temporal_threshold`.
>
> 2. Avoid reloading __x86_shared_non_temporal_threshold before
> the L(large_memcpy_4x) bounds check.
>
> 3. Document the second bounds check for L(large_memcpy_4x)
> more clearly.
> ---
> manual/tunables.texi | 2 +-
> sysdeps/x86/dl-cacheinfo.h | 8 +++--
> .../multiarch/memmove-vec-unaligned-erms.S | 29 ++++++++++++++-----
> 3 files changed, 28 insertions(+), 11 deletions(-)
>
> diff --git a/manual/tunables.texi b/manual/tunables.texi
> index 1482412078..49daf3eb4a 100644
> --- a/manual/tunables.texi
> +++ b/manual/tunables.texi
> @@ -47,7 +47,7 @@ glibc.malloc.mxfast: 0x0 (min: 0x0, max: 0xffffffffffffffff)
> glibc.elision.skip_lock_busy: 3 (min: -2147483648, max: 2147483647)
> glibc.malloc.top_pad: 0x0 (min: 0x0, max: 0xffffffffffffffff)
> glibc.cpu.x86_rep_stosb_threshold: 0x800 (min: 0x1, max: 0xffffffffffffffff)
> -glibc.cpu.x86_non_temporal_threshold: 0xc0000 (min: 0x0, max: 0xffffffffffffffff)
> +glibc.cpu.x86_non_temporal_threshold: 0xc0000 (min: 0x0, max: 0x0fffffffffffffff)
> glibc.cpu.x86_shstk:
> glibc.cpu.hwcap_mask: 0x6 (min: 0x0, max: 0xffffffffffffffff)
> glibc.malloc.mmap_max: 0 (min: -2147483648, max: 2147483647)
> diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h
> index cc3b840f9c..a66152d9cc 100644
> --- a/sysdeps/x86/dl-cacheinfo.h
> +++ b/sysdeps/x86/dl-cacheinfo.h
> @@ -915,9 +915,13 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
> shared = tunable_size;
>
> tunable_size = TUNABLE_GET (x86_non_temporal_threshold, long int, NULL);
> - /* NB: Ignore the default value 0. */
> - if (tunable_size != 0)
> + /* NB: Ignore the default value 0. Saturate very large values at
> + LONG_MAX >> 4. */
> + if (tunable_size != 0 && tunable_size <= (LONG_MAX >> 3))
> non_temporal_threshold = tunable_size;
> + /* Saturate huge arguments. */
> + else if (tunable_size != 0)
> + non_temporal_threshold = LONG_MAX >> 3;
>
> tunable_size = TUNABLE_GET (x86_rep_movsb_threshold, long int, NULL);
> if (tunable_size > minimum_rep_movsb_threshold)
Please update
TUNABLE_SET_WITH_BOUNDS (x86_non_temporal_threshold, non_temporal_threshold,
0, SIZE_MAX);
instead.
> diff --git a/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
> index af51177d5d..d1518b8bab 100644
> --- a/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
> +++ b/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
> @@ -118,7 +118,13 @@
> # define LARGE_LOAD_SIZE (VEC_SIZE * 4)
> #endif
>
> -/* Amount to shift rdx by to compare for memcpy_large_4x. */
> +/* Amount to shift __x86_shared_non_temporal_threshold by for
> + bound for memcpy_large_4x. This is essentially use to to
> + indicate that the copy is far beyond the scope of L3
> + (assuming no user config x86_non_temporal_threshold) and to
> + use a more aggressively unrolled loop. NB: before
> + increasing the value also update initialization of
> + x86_non_temporal_threshold. */
> #ifndef LOG_4X_MEMCPY_THRESH
> # define LOG_4X_MEMCPY_THRESH 4
> #endif
> @@ -724,9 +730,14 @@ L(skip_short_movsb_check):
> .p2align 4,, 10
> #if (defined USE_MULTIARCH || VEC_SIZE == 16) && IS_IN (libc)
> L(large_memcpy_2x_check):
> - cmp __x86_rep_movsb_threshold(%rip), %RDX_LP
> - jb L(more_8x_vec_check)
> + /* Entry from L(large_memcpy_2x) has a redundant load of
> + __x86_shared_non_temporal_threshold(%rip). L(large_memcpy_2x)
> + is only use for the non-erms memmove which is generally less
> + common. */
> L(large_memcpy_2x):
> + mov __x86_shared_non_temporal_threshold(%rip), %R11_LP
> + cmp %R11_LP, %RDX_LP
> + jb L(more_8x_vec_check)
> /* To reach this point it is impossible for dst > src and
> overlap. Remaining to check is src > dst and overlap. rcx
> already contains dst - src. Negate rcx to get src - dst. If
> @@ -774,18 +785,21 @@ L(large_memcpy_2x):
> /* ecx contains -(dst - src). not ecx will return dst - src - 1
> which works for testing aliasing. */
> notl %ecx
> + movq %rdx, %r10
> testl $(PAGE_SIZE - VEC_SIZE * 8), %ecx
> jz L(large_memcpy_4x)
>
> - movq %rdx, %r10
> - shrq $LOG_4X_MEMCPY_THRESH, %r10
> - cmp __x86_shared_non_temporal_threshold(%rip), %r10
> + /* r11 has __x86_shared_non_temporal_threshold. Shift it left
> + by LOG_4X_MEMCPY_THRESH to get L(large_memcpy_4x) threshold.
> + */
> + shlq $LOG_4X_MEMCPY_THRESH, %r11
> + cmp %r11, %rdx
> jae L(large_memcpy_4x)
>
> /* edx will store remainder size for copying tail. */
> andl $(PAGE_SIZE * 2 - 1), %edx
> /* r10 stores outer loop counter. */
> - shrq $((LOG_PAGE_SIZE + 1) - LOG_4X_MEMCPY_THRESH), %r10
> + shrq $(LOG_PAGE_SIZE + 1), %r10
> /* Copy 4x VEC at a time from 2 pages. */
> .p2align 4
> L(loop_large_memcpy_2x_outer):
> @@ -850,7 +864,6 @@ L(large_memcpy_2x_end):
>
> .p2align 4
> L(large_memcpy_4x):
> - movq %rdx, %r10
> /* edx will store remainder size for copying tail. */
> andl $(PAGE_SIZE * 4 - 1), %edx
> /* r10 stores outer loop counter. */
> --
> 2.34.1
>
--
H.J.
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v1 3/3] x86: Add sse42 implementation to strcmp's ifunc
2022-06-15 0:25 ` [PATCH v1 3/3] x86: Add sse42 implementation to strcmp's ifunc Noah Goldstein
@ 2022-06-15 1:08 ` H.J. Lu
2022-07-14 2:54 ` Sunil Pandey
0 siblings, 1 reply; 29+ messages in thread
From: H.J. Lu @ 2022-06-15 1:08 UTC (permalink / raw)
To: Noah Goldstein; +Cc: GNU C Library, Carlos O'Donell
On Tue, Jun 14, 2022 at 5:25 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
> This has been missing since the the ifuncs where added.
>
> The performance of SSE4.2 is preferable to to SSE2.
>
> Measured on Tigerlake with N = 20 runs.
> Geometric Mean of all benchmarks SSE4.2 / SSE2: 0.906
> ---
> sysdeps/x86_64/multiarch/strcmp.c | 5 +++++
> 1 file changed, 5 insertions(+)
>
> diff --git a/sysdeps/x86_64/multiarch/strcmp.c b/sysdeps/x86_64/multiarch/strcmp.c
> index a248c2a6e6..9c1677724c 100644
> --- a/sysdeps/x86_64/multiarch/strcmp.c
> +++ b/sysdeps/x86_64/multiarch/strcmp.c
> @@ -28,6 +28,7 @@
>
> extern __typeof (REDIRECT_NAME) OPTIMIZE (sse2) attribute_hidden;
> extern __typeof (REDIRECT_NAME) OPTIMIZE (sse2_unaligned) attribute_hidden;
> +extern __typeof (REDIRECT_NAME) OPTIMIZE (sse42) attribute_hidden;
> extern __typeof (REDIRECT_NAME) OPTIMIZE (avx2) attribute_hidden;
> extern __typeof (REDIRECT_NAME) OPTIMIZE (avx2_rtm) attribute_hidden;
> extern __typeof (REDIRECT_NAME) OPTIMIZE (evex) attribute_hidden;
> @@ -52,6 +53,10 @@ IFUNC_SELECTOR (void)
> return OPTIMIZE (avx2);
> }
>
> + if (CPU_FEATURE_USABLE_P (cpu_features, SSE4_2)
> + && !CPU_FEATURES_ARCH_P (cpu_features, Slow_SSE4_2))
> + return OPTIMIZE (sse42);
> +
> if (CPU_FEATURES_ARCH_P (cpu_features, Fast_Unaligned_Load))
> return OPTIMIZE (sse2_unaligned);
>
> --
> 2.34.1
>
LGTM.
Thanks.
--
H.J.
^ permalink raw reply [flat|nested] 29+ messages in thread
* [PATCH v2] x86: Cleanup bounds checking in large memcpy case
2022-06-15 0:25 ` [PATCH v1 2/3] x86: Cleanup bounds checking in large memcpy case Noah Goldstein
2022-06-15 1:07 ` H.J. Lu
@ 2022-06-15 3:57 ` Noah Goldstein
2022-06-15 14:52 ` H.J. Lu
2022-06-15 15:12 ` [PATCH v3] " Noah Goldstein
2022-06-15 17:41 ` [PATCH v4 1/2] " Noah Goldstein
3 siblings, 1 reply; 29+ messages in thread
From: Noah Goldstein @ 2022-06-15 3:57 UTC (permalink / raw)
To: libc-alpha
1. Fix incorrect lower-bound threshold in L(large_memcpy_2x).
Previously was using `__x86_rep_movsb_threshold` and should
have been using `__x86_shared_non_temporal_threshold`.
2. Avoid reloading __x86_shared_non_temporal_threshold before
the L(large_memcpy_4x) bounds check.
3. Document the second bounds check for L(large_memcpy_4x)
more clearly.
---
manual/tunables.texi | 2 +-
sysdeps/x86/dl-cacheinfo.h | 2 +-
.../multiarch/memmove-vec-unaligned-erms.S | 29 ++++++++++++++-----
3 files changed, 23 insertions(+), 10 deletions(-)
diff --git a/manual/tunables.texi b/manual/tunables.texi
index 1482412078..49daf3eb4a 100644
--- a/manual/tunables.texi
+++ b/manual/tunables.texi
@@ -47,7 +47,7 @@ glibc.malloc.mxfast: 0x0 (min: 0x0, max: 0xffffffffffffffff)
glibc.elision.skip_lock_busy: 3 (min: -2147483648, max: 2147483647)
glibc.malloc.top_pad: 0x0 (min: 0x0, max: 0xffffffffffffffff)
glibc.cpu.x86_rep_stosb_threshold: 0x800 (min: 0x1, max: 0xffffffffffffffff)
-glibc.cpu.x86_non_temporal_threshold: 0xc0000 (min: 0x0, max: 0xffffffffffffffff)
+glibc.cpu.x86_non_temporal_threshold: 0xc0000 (min: 0x0, max: 0x0fffffffffffffff)
glibc.cpu.x86_shstk:
glibc.cpu.hwcap_mask: 0x6 (min: 0x0, max: 0xffffffffffffffff)
glibc.malloc.mmap_max: 0 (min: -2147483648, max: 2147483647)
diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h
index cc3b840f9c..858ff8a135 100644
--- a/sysdeps/x86/dl-cacheinfo.h
+++ b/sysdeps/x86/dl-cacheinfo.h
@@ -932,7 +932,7 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
TUNABLE_SET_WITH_BOUNDS (x86_data_cache_size, data, 0, SIZE_MAX);
TUNABLE_SET_WITH_BOUNDS (x86_shared_cache_size, shared, 0, SIZE_MAX);
TUNABLE_SET_WITH_BOUNDS (x86_non_temporal_threshold, non_temporal_threshold,
- 0, SIZE_MAX);
+ 0, SIZE_MAX >> 4);
TUNABLE_SET_WITH_BOUNDS (x86_rep_movsb_threshold, rep_movsb_threshold,
minimum_rep_movsb_threshold, SIZE_MAX);
TUNABLE_SET_WITH_BOUNDS (x86_rep_stosb_threshold, rep_stosb_threshold, 1,
diff --git a/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
index af51177d5d..d1518b8bab 100644
--- a/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
+++ b/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
@@ -118,7 +118,13 @@
# define LARGE_LOAD_SIZE (VEC_SIZE * 4)
#endif
-/* Amount to shift rdx by to compare for memcpy_large_4x. */
+/* Amount to shift __x86_shared_non_temporal_threshold by for
+ bound for memcpy_large_4x. This is essentially use to to
+ indicate that the copy is far beyond the scope of L3
+ (assuming no user config x86_non_temporal_threshold) and to
+ use a more aggressively unrolled loop. NB: before
+ increasing the value also update initialization of
+ x86_non_temporal_threshold. */
#ifndef LOG_4X_MEMCPY_THRESH
# define LOG_4X_MEMCPY_THRESH 4
#endif
@@ -724,9 +730,14 @@ L(skip_short_movsb_check):
.p2align 4,, 10
#if (defined USE_MULTIARCH || VEC_SIZE == 16) && IS_IN (libc)
L(large_memcpy_2x_check):
- cmp __x86_rep_movsb_threshold(%rip), %RDX_LP
- jb L(more_8x_vec_check)
+ /* Entry from L(large_memcpy_2x) has a redundant load of
+ __x86_shared_non_temporal_threshold(%rip). L(large_memcpy_2x)
+ is only use for the non-erms memmove which is generally less
+ common. */
L(large_memcpy_2x):
+ mov __x86_shared_non_temporal_threshold(%rip), %R11_LP
+ cmp %R11_LP, %RDX_LP
+ jb L(more_8x_vec_check)
/* To reach this point it is impossible for dst > src and
overlap. Remaining to check is src > dst and overlap. rcx
already contains dst - src. Negate rcx to get src - dst. If
@@ -774,18 +785,21 @@ L(large_memcpy_2x):
/* ecx contains -(dst - src). not ecx will return dst - src - 1
which works for testing aliasing. */
notl %ecx
+ movq %rdx, %r10
testl $(PAGE_SIZE - VEC_SIZE * 8), %ecx
jz L(large_memcpy_4x)
- movq %rdx, %r10
- shrq $LOG_4X_MEMCPY_THRESH, %r10
- cmp __x86_shared_non_temporal_threshold(%rip), %r10
+ /* r11 has __x86_shared_non_temporal_threshold. Shift it left
+ by LOG_4X_MEMCPY_THRESH to get L(large_memcpy_4x) threshold.
+ */
+ shlq $LOG_4X_MEMCPY_THRESH, %r11
+ cmp %r11, %rdx
jae L(large_memcpy_4x)
/* edx will store remainder size for copying tail. */
andl $(PAGE_SIZE * 2 - 1), %edx
/* r10 stores outer loop counter. */
- shrq $((LOG_PAGE_SIZE + 1) - LOG_4X_MEMCPY_THRESH), %r10
+ shrq $(LOG_PAGE_SIZE + 1), %r10
/* Copy 4x VEC at a time from 2 pages. */
.p2align 4
L(loop_large_memcpy_2x_outer):
@@ -850,7 +864,6 @@ L(large_memcpy_2x_end):
.p2align 4
L(large_memcpy_4x):
- movq %rdx, %r10
/* edx will store remainder size for copying tail. */
andl $(PAGE_SIZE * 4 - 1), %edx
/* r10 stores outer loop counter. */
--
2.34.1
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v1 2/3] x86: Cleanup bounds checking in large memcpy case
2022-06-15 1:07 ` H.J. Lu
@ 2022-06-15 3:57 ` Noah Goldstein
0 siblings, 0 replies; 29+ messages in thread
From: Noah Goldstein @ 2022-06-15 3:57 UTC (permalink / raw)
To: H.J. Lu; +Cc: GNU C Library, Carlos O'Donell
On Tue, Jun 14, 2022 at 6:08 PM H.J. Lu <hjl.tools@gmail.com> wrote:
>
> On Tue, Jun 14, 2022 at 5:25 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> >
> > 1. Fix incorrect lower-bound threshold in L(large_memcpy_2x).
> > Previously was using `__x86_rep_movsb_threshold` and should
> > have been using `__x86_shared_non_temporal_threshold`.
> >
> > 2. Avoid reloading __x86_shared_non_temporal_threshold before
> > the L(large_memcpy_4x) bounds check.
> >
> > 3. Document the second bounds check for L(large_memcpy_4x)
> > more clearly.
> > ---
> > manual/tunables.texi | 2 +-
> > sysdeps/x86/dl-cacheinfo.h | 8 +++--
> > .../multiarch/memmove-vec-unaligned-erms.S | 29 ++++++++++++++-----
> > 3 files changed, 28 insertions(+), 11 deletions(-)
> >
> > diff --git a/manual/tunables.texi b/manual/tunables.texi
> > index 1482412078..49daf3eb4a 100644
> > --- a/manual/tunables.texi
> > +++ b/manual/tunables.texi
> > @@ -47,7 +47,7 @@ glibc.malloc.mxfast: 0x0 (min: 0x0, max: 0xffffffffffffffff)
> > glibc.elision.skip_lock_busy: 3 (min: -2147483648, max: 2147483647)
> > glibc.malloc.top_pad: 0x0 (min: 0x0, max: 0xffffffffffffffff)
> > glibc.cpu.x86_rep_stosb_threshold: 0x800 (min: 0x1, max: 0xffffffffffffffff)
> > -glibc.cpu.x86_non_temporal_threshold: 0xc0000 (min: 0x0, max: 0xffffffffffffffff)
> > +glibc.cpu.x86_non_temporal_threshold: 0xc0000 (min: 0x0, max: 0x0fffffffffffffff)
> > glibc.cpu.x86_shstk:
> > glibc.cpu.hwcap_mask: 0x6 (min: 0x0, max: 0xffffffffffffffff)
> > glibc.malloc.mmap_max: 0 (min: -2147483648, max: 2147483647)
> > diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h
> > index cc3b840f9c..a66152d9cc 100644
> > --- a/sysdeps/x86/dl-cacheinfo.h
> > +++ b/sysdeps/x86/dl-cacheinfo.h
> > @@ -915,9 +915,13 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
> > shared = tunable_size;
> >
> > tunable_size = TUNABLE_GET (x86_non_temporal_threshold, long int, NULL);
> > - /* NB: Ignore the default value 0. */
> > - if (tunable_size != 0)
> > + /* NB: Ignore the default value 0. Saturate very large values at
> > + LONG_MAX >> 4. */
> > + if (tunable_size != 0 && tunable_size <= (LONG_MAX >> 3))
> > non_temporal_threshold = tunable_size;
> > + /* Saturate huge arguments. */
> > + else if (tunable_size != 0)
> > + non_temporal_threshold = LONG_MAX >> 3;
> >
> > tunable_size = TUNABLE_GET (x86_rep_movsb_threshold, long int, NULL);
> > if (tunable_size > minimum_rep_movsb_threshold)
>
> Please update
>
> TUNABLE_SET_WITH_BOUNDS (x86_non_temporal_threshold, non_temporal_threshold,
> 0, SIZE_MAX);
>
> instead.
Fixed in V2.
>
> > diff --git a/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
> > index af51177d5d..d1518b8bab 100644
> > --- a/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
> > +++ b/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
> > @@ -118,7 +118,13 @@
> > # define LARGE_LOAD_SIZE (VEC_SIZE * 4)
> > #endif
> >
> > -/* Amount to shift rdx by to compare for memcpy_large_4x. */
> > +/* Amount to shift __x86_shared_non_temporal_threshold by for
> > + bound for memcpy_large_4x. This is essentially use to to
> > + indicate that the copy is far beyond the scope of L3
> > + (assuming no user config x86_non_temporal_threshold) and to
> > + use a more aggressively unrolled loop. NB: before
> > + increasing the value also update initialization of
> > + x86_non_temporal_threshold. */
> > #ifndef LOG_4X_MEMCPY_THRESH
> > # define LOG_4X_MEMCPY_THRESH 4
> > #endif
> > @@ -724,9 +730,14 @@ L(skip_short_movsb_check):
> > .p2align 4,, 10
> > #if (defined USE_MULTIARCH || VEC_SIZE == 16) && IS_IN (libc)
> > L(large_memcpy_2x_check):
> > - cmp __x86_rep_movsb_threshold(%rip), %RDX_LP
> > - jb L(more_8x_vec_check)
> > + /* Entry from L(large_memcpy_2x) has a redundant load of
> > + __x86_shared_non_temporal_threshold(%rip). L(large_memcpy_2x)
> > + is only use for the non-erms memmove which is generally less
> > + common. */
> > L(large_memcpy_2x):
> > + mov __x86_shared_non_temporal_threshold(%rip), %R11_LP
> > + cmp %R11_LP, %RDX_LP
> > + jb L(more_8x_vec_check)
> > /* To reach this point it is impossible for dst > src and
> > overlap. Remaining to check is src > dst and overlap. rcx
> > already contains dst - src. Negate rcx to get src - dst. If
> > @@ -774,18 +785,21 @@ L(large_memcpy_2x):
> > /* ecx contains -(dst - src). not ecx will return dst - src - 1
> > which works for testing aliasing. */
> > notl %ecx
> > + movq %rdx, %r10
> > testl $(PAGE_SIZE - VEC_SIZE * 8), %ecx
> > jz L(large_memcpy_4x)
> >
> > - movq %rdx, %r10
> > - shrq $LOG_4X_MEMCPY_THRESH, %r10
> > - cmp __x86_shared_non_temporal_threshold(%rip), %r10
> > + /* r11 has __x86_shared_non_temporal_threshold. Shift it left
> > + by LOG_4X_MEMCPY_THRESH to get L(large_memcpy_4x) threshold.
> > + */
> > + shlq $LOG_4X_MEMCPY_THRESH, %r11
> > + cmp %r11, %rdx
> > jae L(large_memcpy_4x)
> >
> > /* edx will store remainder size for copying tail. */
> > andl $(PAGE_SIZE * 2 - 1), %edx
> > /* r10 stores outer loop counter. */
> > - shrq $((LOG_PAGE_SIZE + 1) - LOG_4X_MEMCPY_THRESH), %r10
> > + shrq $(LOG_PAGE_SIZE + 1), %r10
> > /* Copy 4x VEC at a time from 2 pages. */
> > .p2align 4
> > L(loop_large_memcpy_2x_outer):
> > @@ -850,7 +864,6 @@ L(large_memcpy_2x_end):
> >
> > .p2align 4
> > L(large_memcpy_4x):
> > - movq %rdx, %r10
> > /* edx will store remainder size for copying tail. */
> > andl $(PAGE_SIZE * 4 - 1), %edx
> > /* r10 stores outer loop counter. */
> > --
> > 2.34.1
> >
>
>
> --
> H.J.
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v2] x86: Cleanup bounds checking in large memcpy case
2022-06-15 3:57 ` [PATCH v2] " Noah Goldstein
@ 2022-06-15 14:52 ` H.J. Lu
2022-06-15 15:13 ` Noah Goldstein
0 siblings, 1 reply; 29+ messages in thread
From: H.J. Lu @ 2022-06-15 14:52 UTC (permalink / raw)
To: Noah Goldstein; +Cc: GNU C Library, Carlos O'Donell
On Tue, Jun 14, 2022 at 8:57 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
> 1. Fix incorrect lower-bound threshold in L(large_memcpy_2x).
> Previously was using `__x86_rep_movsb_threshold` and should
> have been using `__x86_shared_non_temporal_threshold`.
>
> 2. Avoid reloading __x86_shared_non_temporal_threshold before
> the L(large_memcpy_4x) bounds check.
>
> 3. Document the second bounds check for L(large_memcpy_4x)
> more clearly.
> ---
> manual/tunables.texi | 2 +-
> sysdeps/x86/dl-cacheinfo.h | 2 +-
> .../multiarch/memmove-vec-unaligned-erms.S | 29 ++++++++++++++-----
> 3 files changed, 23 insertions(+), 10 deletions(-)
>
> diff --git a/manual/tunables.texi b/manual/tunables.texi
> index 1482412078..49daf3eb4a 100644
> --- a/manual/tunables.texi
> +++ b/manual/tunables.texi
> @@ -47,7 +47,7 @@ glibc.malloc.mxfast: 0x0 (min: 0x0, max: 0xffffffffffffffff)
> glibc.elision.skip_lock_busy: 3 (min: -2147483648, max: 2147483647)
> glibc.malloc.top_pad: 0x0 (min: 0x0, max: 0xffffffffffffffff)
> glibc.cpu.x86_rep_stosb_threshold: 0x800 (min: 0x1, max: 0xffffffffffffffff)
> -glibc.cpu.x86_non_temporal_threshold: 0xc0000 (min: 0x0, max: 0xffffffffffffffff)
> +glibc.cpu.x86_non_temporal_threshold: 0xc0000 (min: 0x0, max: 0x0fffffffffffffff)
> glibc.cpu.x86_shstk:
> glibc.cpu.hwcap_mask: 0x6 (min: 0x0, max: 0xffffffffffffffff)
> glibc.malloc.mmap_max: 0 (min: -2147483648, max: 2147483647)
> diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h
> index cc3b840f9c..858ff8a135 100644
> --- a/sysdeps/x86/dl-cacheinfo.h
> +++ b/sysdeps/x86/dl-cacheinfo.h
> @@ -932,7 +932,7 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
> TUNABLE_SET_WITH_BOUNDS (x86_data_cache_size, data, 0, SIZE_MAX);
> TUNABLE_SET_WITH_BOUNDS (x86_shared_cache_size, shared, 0, SIZE_MAX);
> TUNABLE_SET_WITH_BOUNDS (x86_non_temporal_threshold, non_temporal_threshold,
> - 0, SIZE_MAX);
> + 0, SIZE_MAX >> 4);
Please add a comment to describe where >> 4 comes from.
> TUNABLE_SET_WITH_BOUNDS (x86_rep_movsb_threshold, rep_movsb_threshold,
> minimum_rep_movsb_threshold, SIZE_MAX);
> TUNABLE_SET_WITH_BOUNDS (x86_rep_stosb_threshold, rep_stosb_threshold, 1,
> diff --git a/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
> index af51177d5d..d1518b8bab 100644
> --- a/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
> +++ b/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
> @@ -118,7 +118,13 @@
> # define LARGE_LOAD_SIZE (VEC_SIZE * 4)
> #endif
>
> -/* Amount to shift rdx by to compare for memcpy_large_4x. */
> +/* Amount to shift __x86_shared_non_temporal_threshold by for
> + bound for memcpy_large_4x. This is essentially use to to
> + indicate that the copy is far beyond the scope of L3
> + (assuming no user config x86_non_temporal_threshold) and to
> + use a more aggressively unrolled loop. NB: before
> + increasing the value also update initialization of
> + x86_non_temporal_threshold. */
> #ifndef LOG_4X_MEMCPY_THRESH
> # define LOG_4X_MEMCPY_THRESH 4
> #endif
> @@ -724,9 +730,14 @@ L(skip_short_movsb_check):
> .p2align 4,, 10
> #if (defined USE_MULTIARCH || VEC_SIZE == 16) && IS_IN (libc)
> L(large_memcpy_2x_check):
> - cmp __x86_rep_movsb_threshold(%rip), %RDX_LP
> - jb L(more_8x_vec_check)
> + /* Entry from L(large_memcpy_2x) has a redundant load of
> + __x86_shared_non_temporal_threshold(%rip). L(large_memcpy_2x)
> + is only use for the non-erms memmove which is generally less
> + common. */
> L(large_memcpy_2x):
> + mov __x86_shared_non_temporal_threshold(%rip), %R11_LP
> + cmp %R11_LP, %RDX_LP
> + jb L(more_8x_vec_check)
> /* To reach this point it is impossible for dst > src and
> overlap. Remaining to check is src > dst and overlap. rcx
> already contains dst - src. Negate rcx to get src - dst. If
> @@ -774,18 +785,21 @@ L(large_memcpy_2x):
> /* ecx contains -(dst - src). not ecx will return dst - src - 1
> which works for testing aliasing. */
> notl %ecx
> + movq %rdx, %r10
> testl $(PAGE_SIZE - VEC_SIZE * 8), %ecx
> jz L(large_memcpy_4x)
>
> - movq %rdx, %r10
> - shrq $LOG_4X_MEMCPY_THRESH, %r10
> - cmp __x86_shared_non_temporal_threshold(%rip), %r10
> + /* r11 has __x86_shared_non_temporal_threshold. Shift it left
> + by LOG_4X_MEMCPY_THRESH to get L(large_memcpy_4x) threshold.
> + */
> + shlq $LOG_4X_MEMCPY_THRESH, %r11
> + cmp %r11, %rdx
> jae L(large_memcpy_4x)
>
> /* edx will store remainder size for copying tail. */
> andl $(PAGE_SIZE * 2 - 1), %edx
> /* r10 stores outer loop counter. */
> - shrq $((LOG_PAGE_SIZE + 1) - LOG_4X_MEMCPY_THRESH), %r10
> + shrq $(LOG_PAGE_SIZE + 1), %r10
> /* Copy 4x VEC at a time from 2 pages. */
> .p2align 4
> L(loop_large_memcpy_2x_outer):
> @@ -850,7 +864,6 @@ L(large_memcpy_2x_end):
>
> .p2align 4
> L(large_memcpy_4x):
> - movq %rdx, %r10
> /* edx will store remainder size for copying tail. */
> andl $(PAGE_SIZE * 4 - 1), %edx
> /* r10 stores outer loop counter. */
> --
> 2.34.1
>
--
H.J.
^ permalink raw reply [flat|nested] 29+ messages in thread
* [PATCH v3] x86: Cleanup bounds checking in large memcpy case
2022-06-15 0:25 ` [PATCH v1 2/3] x86: Cleanup bounds checking in large memcpy case Noah Goldstein
2022-06-15 1:07 ` H.J. Lu
2022-06-15 3:57 ` [PATCH v2] " Noah Goldstein
@ 2022-06-15 15:12 ` Noah Goldstein
2022-06-15 16:48 ` H.J. Lu
2022-06-15 17:41 ` [PATCH v4 1/2] " Noah Goldstein
3 siblings, 1 reply; 29+ messages in thread
From: Noah Goldstein @ 2022-06-15 15:12 UTC (permalink / raw)
To: libc-alpha
1. Fix incorrect lower-bound threshold in L(large_memcpy_2x).
Previously was using `__x86_rep_movsb_threshold` and should
have been using `__x86_shared_non_temporal_threshold`.
2. Avoid reloading __x86_shared_non_temporal_threshold before
the L(large_memcpy_4x) bounds check.
3. Document the second bounds check for L(large_memcpy_4x)
more clearly.
---
manual/tunables.texi | 2 +-
sysdeps/x86/dl-cacheinfo.h | 6 +++-
.../multiarch/memmove-vec-unaligned-erms.S | 29 ++++++++++++++-----
3 files changed, 27 insertions(+), 10 deletions(-)
diff --git a/manual/tunables.texi b/manual/tunables.texi
index 1482412078..49daf3eb4a 100644
--- a/manual/tunables.texi
+++ b/manual/tunables.texi
@@ -47,7 +47,7 @@ glibc.malloc.mxfast: 0x0 (min: 0x0, max: 0xffffffffffffffff)
glibc.elision.skip_lock_busy: 3 (min: -2147483648, max: 2147483647)
glibc.malloc.top_pad: 0x0 (min: 0x0, max: 0xffffffffffffffff)
glibc.cpu.x86_rep_stosb_threshold: 0x800 (min: 0x1, max: 0xffffffffffffffff)
-glibc.cpu.x86_non_temporal_threshold: 0xc0000 (min: 0x0, max: 0xffffffffffffffff)
+glibc.cpu.x86_non_temporal_threshold: 0xc0000 (min: 0x0, max: 0x0fffffffffffffff)
glibc.cpu.x86_shstk:
glibc.cpu.hwcap_mask: 0x6 (min: 0x0, max: 0xffffffffffffffff)
glibc.malloc.mmap_max: 0 (min: -2147483648, max: 2147483647)
diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h
index cc3b840f9c..f94ff2df43 100644
--- a/sysdeps/x86/dl-cacheinfo.h
+++ b/sysdeps/x86/dl-cacheinfo.h
@@ -931,8 +931,12 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
TUNABLE_SET_WITH_BOUNDS (x86_data_cache_size, data, 0, SIZE_MAX);
TUNABLE_SET_WITH_BOUNDS (x86_shared_cache_size, shared, 0, SIZE_MAX);
+ /* SIZE_MAX >> 4 because memmove-vec-unaligned-erms right-shifts the value of
+ 'x86_non_temporal_threshold' by `LOG_4X_MEMCPY_THRESH` (4) and it is best
+ if that operation cannot overflow. Not the '>> 4' also reflect the bound
+ in the manual. */
TUNABLE_SET_WITH_BOUNDS (x86_non_temporal_threshold, non_temporal_threshold,
- 0, SIZE_MAX);
+ 0, SIZE_MAX >> 4);
TUNABLE_SET_WITH_BOUNDS (x86_rep_movsb_threshold, rep_movsb_threshold,
minimum_rep_movsb_threshold, SIZE_MAX);
TUNABLE_SET_WITH_BOUNDS (x86_rep_stosb_threshold, rep_stosb_threshold, 1,
diff --git a/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
index af51177d5d..d1518b8bab 100644
--- a/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
+++ b/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
@@ -118,7 +118,13 @@
# define LARGE_LOAD_SIZE (VEC_SIZE * 4)
#endif
-/* Amount to shift rdx by to compare for memcpy_large_4x. */
+/* Amount to shift __x86_shared_non_temporal_threshold by for
+ bound for memcpy_large_4x. This is essentially use to to
+ indicate that the copy is far beyond the scope of L3
+ (assuming no user config x86_non_temporal_threshold) and to
+ use a more aggressively unrolled loop. NB: before
+ increasing the value also update initialization of
+ x86_non_temporal_threshold. */
#ifndef LOG_4X_MEMCPY_THRESH
# define LOG_4X_MEMCPY_THRESH 4
#endif
@@ -724,9 +730,14 @@ L(skip_short_movsb_check):
.p2align 4,, 10
#if (defined USE_MULTIARCH || VEC_SIZE == 16) && IS_IN (libc)
L(large_memcpy_2x_check):
- cmp __x86_rep_movsb_threshold(%rip), %RDX_LP
- jb L(more_8x_vec_check)
+ /* Entry from L(large_memcpy_2x) has a redundant load of
+ __x86_shared_non_temporal_threshold(%rip). L(large_memcpy_2x)
+ is only use for the non-erms memmove which is generally less
+ common. */
L(large_memcpy_2x):
+ mov __x86_shared_non_temporal_threshold(%rip), %R11_LP
+ cmp %R11_LP, %RDX_LP
+ jb L(more_8x_vec_check)
/* To reach this point it is impossible for dst > src and
overlap. Remaining to check is src > dst and overlap. rcx
already contains dst - src. Negate rcx to get src - dst. If
@@ -774,18 +785,21 @@ L(large_memcpy_2x):
/* ecx contains -(dst - src). not ecx will return dst - src - 1
which works for testing aliasing. */
notl %ecx
+ movq %rdx, %r10
testl $(PAGE_SIZE - VEC_SIZE * 8), %ecx
jz L(large_memcpy_4x)
- movq %rdx, %r10
- shrq $LOG_4X_MEMCPY_THRESH, %r10
- cmp __x86_shared_non_temporal_threshold(%rip), %r10
+ /* r11 has __x86_shared_non_temporal_threshold. Shift it left
+ by LOG_4X_MEMCPY_THRESH to get L(large_memcpy_4x) threshold.
+ */
+ shlq $LOG_4X_MEMCPY_THRESH, %r11
+ cmp %r11, %rdx
jae L(large_memcpy_4x)
/* edx will store remainder size for copying tail. */
andl $(PAGE_SIZE * 2 - 1), %edx
/* r10 stores outer loop counter. */
- shrq $((LOG_PAGE_SIZE + 1) - LOG_4X_MEMCPY_THRESH), %r10
+ shrq $(LOG_PAGE_SIZE + 1), %r10
/* Copy 4x VEC at a time from 2 pages. */
.p2align 4
L(loop_large_memcpy_2x_outer):
@@ -850,7 +864,6 @@ L(large_memcpy_2x_end):
.p2align 4
L(large_memcpy_4x):
- movq %rdx, %r10
/* edx will store remainder size for copying tail. */
andl $(PAGE_SIZE * 4 - 1), %edx
/* r10 stores outer loop counter. */
--
2.34.1
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v2] x86: Cleanup bounds checking in large memcpy case
2022-06-15 14:52 ` H.J. Lu
@ 2022-06-15 15:13 ` Noah Goldstein
0 siblings, 0 replies; 29+ messages in thread
From: Noah Goldstein @ 2022-06-15 15:13 UTC (permalink / raw)
To: H.J. Lu; +Cc: GNU C Library, Carlos O'Donell
On Wed, Jun 15, 2022 at 7:52 AM H.J. Lu <hjl.tools@gmail.com> wrote:
>
> On Tue, Jun 14, 2022 at 8:57 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> >
> > 1. Fix incorrect lower-bound threshold in L(large_memcpy_2x).
> > Previously was using `__x86_rep_movsb_threshold` and should
> > have been using `__x86_shared_non_temporal_threshold`.
> >
> > 2. Avoid reloading __x86_shared_non_temporal_threshold before
> > the L(large_memcpy_4x) bounds check.
> >
> > 3. Document the second bounds check for L(large_memcpy_4x)
> > more clearly.
> > ---
> > manual/tunables.texi | 2 +-
> > sysdeps/x86/dl-cacheinfo.h | 2 +-
> > .../multiarch/memmove-vec-unaligned-erms.S | 29 ++++++++++++++-----
> > 3 files changed, 23 insertions(+), 10 deletions(-)
> >
> > diff --git a/manual/tunables.texi b/manual/tunables.texi
> > index 1482412078..49daf3eb4a 100644
> > --- a/manual/tunables.texi
> > +++ b/manual/tunables.texi
> > @@ -47,7 +47,7 @@ glibc.malloc.mxfast: 0x0 (min: 0x0, max: 0xffffffffffffffff)
> > glibc.elision.skip_lock_busy: 3 (min: -2147483648, max: 2147483647)
> > glibc.malloc.top_pad: 0x0 (min: 0x0, max: 0xffffffffffffffff)
> > glibc.cpu.x86_rep_stosb_threshold: 0x800 (min: 0x1, max: 0xffffffffffffffff)
> > -glibc.cpu.x86_non_temporal_threshold: 0xc0000 (min: 0x0, max: 0xffffffffffffffff)
> > +glibc.cpu.x86_non_temporal_threshold: 0xc0000 (min: 0x0, max: 0x0fffffffffffffff)
> > glibc.cpu.x86_shstk:
> > glibc.cpu.hwcap_mask: 0x6 (min: 0x0, max: 0xffffffffffffffff)
> > glibc.malloc.mmap_max: 0 (min: -2147483648, max: 2147483647)
> > diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h
> > index cc3b840f9c..858ff8a135 100644
> > --- a/sysdeps/x86/dl-cacheinfo.h
> > +++ b/sysdeps/x86/dl-cacheinfo.h
> > @@ -932,7 +932,7 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
> > TUNABLE_SET_WITH_BOUNDS (x86_data_cache_size, data, 0, SIZE_MAX);
> > TUNABLE_SET_WITH_BOUNDS (x86_shared_cache_size, shared, 0, SIZE_MAX);
> > TUNABLE_SET_WITH_BOUNDS (x86_non_temporal_threshold, non_temporal_threshold,
> > - 0, SIZE_MAX);
> > + 0, SIZE_MAX >> 4);
>
> Please add a comment to describe where >> 4 comes from.
Fixed in V3.
>
> > TUNABLE_SET_WITH_BOUNDS (x86_rep_movsb_threshold, rep_movsb_threshold,
> > minimum_rep_movsb_threshold, SIZE_MAX);
> > TUNABLE_SET_WITH_BOUNDS (x86_rep_stosb_threshold, rep_stosb_threshold, 1,
> > diff --git a/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
> > index af51177d5d..d1518b8bab 100644
> > --- a/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
> > +++ b/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
> > @@ -118,7 +118,13 @@
> > # define LARGE_LOAD_SIZE (VEC_SIZE * 4)
> > #endif
> >
> > -/* Amount to shift rdx by to compare for memcpy_large_4x. */
> > +/* Amount to shift __x86_shared_non_temporal_threshold by for
> > + bound for memcpy_large_4x. This is essentially use to to
> > + indicate that the copy is far beyond the scope of L3
> > + (assuming no user config x86_non_temporal_threshold) and to
> > + use a more aggressively unrolled loop. NB: before
> > + increasing the value also update initialization of
> > + x86_non_temporal_threshold. */
> > #ifndef LOG_4X_MEMCPY_THRESH
> > # define LOG_4X_MEMCPY_THRESH 4
> > #endif
> > @@ -724,9 +730,14 @@ L(skip_short_movsb_check):
> > .p2align 4,, 10
> > #if (defined USE_MULTIARCH || VEC_SIZE == 16) && IS_IN (libc)
> > L(large_memcpy_2x_check):
> > - cmp __x86_rep_movsb_threshold(%rip), %RDX_LP
> > - jb L(more_8x_vec_check)
> > + /* Entry from L(large_memcpy_2x) has a redundant load of
> > + __x86_shared_non_temporal_threshold(%rip). L(large_memcpy_2x)
> > + is only use for the non-erms memmove which is generally less
> > + common. */
> > L(large_memcpy_2x):
> > + mov __x86_shared_non_temporal_threshold(%rip), %R11_LP
> > + cmp %R11_LP, %RDX_LP
> > + jb L(more_8x_vec_check)
> > /* To reach this point it is impossible for dst > src and
> > overlap. Remaining to check is src > dst and overlap. rcx
> > already contains dst - src. Negate rcx to get src - dst. If
> > @@ -774,18 +785,21 @@ L(large_memcpy_2x):
> > /* ecx contains -(dst - src). not ecx will return dst - src - 1
> > which works for testing aliasing. */
> > notl %ecx
> > + movq %rdx, %r10
> > testl $(PAGE_SIZE - VEC_SIZE * 8), %ecx
> > jz L(large_memcpy_4x)
> >
> > - movq %rdx, %r10
> > - shrq $LOG_4X_MEMCPY_THRESH, %r10
> > - cmp __x86_shared_non_temporal_threshold(%rip), %r10
> > + /* r11 has __x86_shared_non_temporal_threshold. Shift it left
> > + by LOG_4X_MEMCPY_THRESH to get L(large_memcpy_4x) threshold.
> > + */
> > + shlq $LOG_4X_MEMCPY_THRESH, %r11
> > + cmp %r11, %rdx
> > jae L(large_memcpy_4x)
> >
> > /* edx will store remainder size for copying tail. */
> > andl $(PAGE_SIZE * 2 - 1), %edx
> > /* r10 stores outer loop counter. */
> > - shrq $((LOG_PAGE_SIZE + 1) - LOG_4X_MEMCPY_THRESH), %r10
> > + shrq $(LOG_PAGE_SIZE + 1), %r10
> > /* Copy 4x VEC at a time from 2 pages. */
> > .p2align 4
> > L(loop_large_memcpy_2x_outer):
> > @@ -850,7 +864,6 @@ L(large_memcpy_2x_end):
> >
> > .p2align 4
> > L(large_memcpy_4x):
> > - movq %rdx, %r10
> > /* edx will store remainder size for copying tail. */
> > andl $(PAGE_SIZE * 4 - 1), %edx
> > /* r10 stores outer loop counter. */
> > --
> > 2.34.1
> >
>
>
> --
> H.J.
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v3] x86: Cleanup bounds checking in large memcpy case
2022-06-15 15:12 ` [PATCH v3] " Noah Goldstein
@ 2022-06-15 16:48 ` H.J. Lu
2022-06-15 17:44 ` Noah Goldstein
0 siblings, 1 reply; 29+ messages in thread
From: H.J. Lu @ 2022-06-15 16:48 UTC (permalink / raw)
To: Noah Goldstein; +Cc: GNU C Library, Carlos O'Donell
On Wed, Jun 15, 2022 at 8:12 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
> 1. Fix incorrect lower-bound threshold in L(large_memcpy_2x).
> Previously was using `__x86_rep_movsb_threshold` and should
> have been using `__x86_shared_non_temporal_threshold`.
>
> 2. Avoid reloading __x86_shared_non_temporal_threshold before
> the L(large_memcpy_4x) bounds check.
>
> 3. Document the second bounds check for L(large_memcpy_4x)
> more clearly.
> ---
> manual/tunables.texi | 2 +-
> sysdeps/x86/dl-cacheinfo.h | 6 +++-
> .../multiarch/memmove-vec-unaligned-erms.S | 29 ++++++++++++++-----
> 3 files changed, 27 insertions(+), 10 deletions(-)
>
> diff --git a/manual/tunables.texi b/manual/tunables.texi
> index 1482412078..49daf3eb4a 100644
> --- a/manual/tunables.texi
> +++ b/manual/tunables.texi
> @@ -47,7 +47,7 @@ glibc.malloc.mxfast: 0x0 (min: 0x0, max: 0xffffffffffffffff)
> glibc.elision.skip_lock_busy: 3 (min: -2147483648, max: 2147483647)
> glibc.malloc.top_pad: 0x0 (min: 0x0, max: 0xffffffffffffffff)
> glibc.cpu.x86_rep_stosb_threshold: 0x800 (min: 0x1, max: 0xffffffffffffffff)
> -glibc.cpu.x86_non_temporal_threshold: 0xc0000 (min: 0x0, max: 0xffffffffffffffff)
> +glibc.cpu.x86_non_temporal_threshold: 0xc0000 (min: 0x0, max: 0x0fffffffffffffff)
> glibc.cpu.x86_shstk:
> glibc.cpu.hwcap_mask: 0x6 (min: 0x0, max: 0xffffffffffffffff)
> glibc.malloc.mmap_max: 0 (min: -2147483648, max: 2147483647)
> diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h
> index cc3b840f9c..f94ff2df43 100644
> --- a/sysdeps/x86/dl-cacheinfo.h
> +++ b/sysdeps/x86/dl-cacheinfo.h
> @@ -931,8 +931,12 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
>
> TUNABLE_SET_WITH_BOUNDS (x86_data_cache_size, data, 0, SIZE_MAX);
> TUNABLE_SET_WITH_BOUNDS (x86_shared_cache_size, shared, 0, SIZE_MAX);
> + /* SIZE_MAX >> 4 because memmove-vec-unaligned-erms right-shifts the value of
> + 'x86_non_temporal_threshold' by `LOG_4X_MEMCPY_THRESH` (4) and it is best
> + if that operation cannot overflow. Not the '>> 4' also reflect the bound
> + in the manual. */
> TUNABLE_SET_WITH_BOUNDS (x86_non_temporal_threshold, non_temporal_threshold,
> - 0, SIZE_MAX);
> + 0, SIZE_MAX >> 4);
> TUNABLE_SET_WITH_BOUNDS (x86_rep_movsb_threshold, rep_movsb_threshold,
> minimum_rep_movsb_threshold, SIZE_MAX);
> TUNABLE_SET_WITH_BOUNDS (x86_rep_stosb_threshold, rep_stosb_threshold, 1,
To help backport, please break this patch into 2 patches and
make the memmove-vec-unaligned-erms.S change a separate
one.
Thanks.
> diff --git a/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
> index af51177d5d..d1518b8bab 100644
> --- a/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
> +++ b/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
> @@ -118,7 +118,13 @@
> # define LARGE_LOAD_SIZE (VEC_SIZE * 4)
> #endif
>
> -/* Amount to shift rdx by to compare for memcpy_large_4x. */
> +/* Amount to shift __x86_shared_non_temporal_threshold by for
> + bound for memcpy_large_4x. This is essentially use to to
> + indicate that the copy is far beyond the scope of L3
> + (assuming no user config x86_non_temporal_threshold) and to
> + use a more aggressively unrolled loop. NB: before
> + increasing the value also update initialization of
> + x86_non_temporal_threshold. */
> #ifndef LOG_4X_MEMCPY_THRESH
> # define LOG_4X_MEMCPY_THRESH 4
> #endif
> @@ -724,9 +730,14 @@ L(skip_short_movsb_check):
> .p2align 4,, 10
> #if (defined USE_MULTIARCH || VEC_SIZE == 16) && IS_IN (libc)
> L(large_memcpy_2x_check):
> - cmp __x86_rep_movsb_threshold(%rip), %RDX_LP
> - jb L(more_8x_vec_check)
> + /* Entry from L(large_memcpy_2x) has a redundant load of
> + __x86_shared_non_temporal_threshold(%rip). L(large_memcpy_2x)
> + is only use for the non-erms memmove which is generally less
> + common. */
> L(large_memcpy_2x):
> + mov __x86_shared_non_temporal_threshold(%rip), %R11_LP
> + cmp %R11_LP, %RDX_LP
> + jb L(more_8x_vec_check)
> /* To reach this point it is impossible for dst > src and
> overlap. Remaining to check is src > dst and overlap. rcx
> already contains dst - src. Negate rcx to get src - dst. If
> @@ -774,18 +785,21 @@ L(large_memcpy_2x):
> /* ecx contains -(dst - src). not ecx will return dst - src - 1
> which works for testing aliasing. */
> notl %ecx
> + movq %rdx, %r10
> testl $(PAGE_SIZE - VEC_SIZE * 8), %ecx
> jz L(large_memcpy_4x)
>
> - movq %rdx, %r10
> - shrq $LOG_4X_MEMCPY_THRESH, %r10
> - cmp __x86_shared_non_temporal_threshold(%rip), %r10
> + /* r11 has __x86_shared_non_temporal_threshold. Shift it left
> + by LOG_4X_MEMCPY_THRESH to get L(large_memcpy_4x) threshold.
> + */
> + shlq $LOG_4X_MEMCPY_THRESH, %r11
> + cmp %r11, %rdx
> jae L(large_memcpy_4x)
>
> /* edx will store remainder size for copying tail. */
> andl $(PAGE_SIZE * 2 - 1), %edx
> /* r10 stores outer loop counter. */
> - shrq $((LOG_PAGE_SIZE + 1) - LOG_4X_MEMCPY_THRESH), %r10
> + shrq $(LOG_PAGE_SIZE + 1), %r10
> /* Copy 4x VEC at a time from 2 pages. */
> .p2align 4
> L(loop_large_memcpy_2x_outer):
> @@ -850,7 +864,6 @@ L(large_memcpy_2x_end):
>
> .p2align 4
> L(large_memcpy_4x):
> - movq %rdx, %r10
> /* edx will store remainder size for copying tail. */
> andl $(PAGE_SIZE * 4 - 1), %edx
> /* r10 stores outer loop counter. */
> --
> 2.34.1
>
--
H.J.
^ permalink raw reply [flat|nested] 29+ messages in thread
* [PATCH v4 1/2] x86: Cleanup bounds checking in large memcpy case
2022-06-15 0:25 ` [PATCH v1 2/3] x86: Cleanup bounds checking in large memcpy case Noah Goldstein
` (2 preceding siblings ...)
2022-06-15 15:12 ` [PATCH v3] " Noah Goldstein
@ 2022-06-15 17:41 ` Noah Goldstein
2022-06-15 17:41 ` [PATCH v4 2/2] x86: Add bounds `x86_non_temporal_threshold` Noah Goldstein
2022-06-15 18:22 ` [PATCH v4 1/2] x86: Cleanup bounds checking in large memcpy case H.J. Lu
3 siblings, 2 replies; 29+ messages in thread
From: Noah Goldstein @ 2022-06-15 17:41 UTC (permalink / raw)
To: libc-alpha
1. Fix incorrect lower-bound threshold in L(large_memcpy_2x).
Previously was using `__x86_rep_movsb_threshold` and should
have been using `__x86_shared_non_temporal_threshold`.
2. Avoid reloading __x86_shared_non_temporal_threshold before
the L(large_memcpy_4x) bounds check.
3. Document the second bounds check for L(large_memcpy_4x)
more clearly.
---
.../multiarch/memmove-vec-unaligned-erms.S | 29 ++++++++++++++-----
1 file changed, 21 insertions(+), 8 deletions(-)
diff --git a/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
index af51177d5d..d1518b8bab 100644
--- a/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
+++ b/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
@@ -118,7 +118,13 @@
# define LARGE_LOAD_SIZE (VEC_SIZE * 4)
#endif
-/* Amount to shift rdx by to compare for memcpy_large_4x. */
+/* Amount to shift __x86_shared_non_temporal_threshold by for
+ bound for memcpy_large_4x. This is essentially use to to
+ indicate that the copy is far beyond the scope of L3
+ (assuming no user config x86_non_temporal_threshold) and to
+ use a more aggressively unrolled loop. NB: before
+ increasing the value also update initialization of
+ x86_non_temporal_threshold. */
#ifndef LOG_4X_MEMCPY_THRESH
# define LOG_4X_MEMCPY_THRESH 4
#endif
@@ -724,9 +730,14 @@ L(skip_short_movsb_check):
.p2align 4,, 10
#if (defined USE_MULTIARCH || VEC_SIZE == 16) && IS_IN (libc)
L(large_memcpy_2x_check):
- cmp __x86_rep_movsb_threshold(%rip), %RDX_LP
- jb L(more_8x_vec_check)
+ /* Entry from L(large_memcpy_2x) has a redundant load of
+ __x86_shared_non_temporal_threshold(%rip). L(large_memcpy_2x)
+ is only use for the non-erms memmove which is generally less
+ common. */
L(large_memcpy_2x):
+ mov __x86_shared_non_temporal_threshold(%rip), %R11_LP
+ cmp %R11_LP, %RDX_LP
+ jb L(more_8x_vec_check)
/* To reach this point it is impossible for dst > src and
overlap. Remaining to check is src > dst and overlap. rcx
already contains dst - src. Negate rcx to get src - dst. If
@@ -774,18 +785,21 @@ L(large_memcpy_2x):
/* ecx contains -(dst - src). not ecx will return dst - src - 1
which works for testing aliasing. */
notl %ecx
+ movq %rdx, %r10
testl $(PAGE_SIZE - VEC_SIZE * 8), %ecx
jz L(large_memcpy_4x)
- movq %rdx, %r10
- shrq $LOG_4X_MEMCPY_THRESH, %r10
- cmp __x86_shared_non_temporal_threshold(%rip), %r10
+ /* r11 has __x86_shared_non_temporal_threshold. Shift it left
+ by LOG_4X_MEMCPY_THRESH to get L(large_memcpy_4x) threshold.
+ */
+ shlq $LOG_4X_MEMCPY_THRESH, %r11
+ cmp %r11, %rdx
jae L(large_memcpy_4x)
/* edx will store remainder size for copying tail. */
andl $(PAGE_SIZE * 2 - 1), %edx
/* r10 stores outer loop counter. */
- shrq $((LOG_PAGE_SIZE + 1) - LOG_4X_MEMCPY_THRESH), %r10
+ shrq $(LOG_PAGE_SIZE + 1), %r10
/* Copy 4x VEC at a time from 2 pages. */
.p2align 4
L(loop_large_memcpy_2x_outer):
@@ -850,7 +864,6 @@ L(large_memcpy_2x_end):
.p2align 4
L(large_memcpy_4x):
- movq %rdx, %r10
/* edx will store remainder size for copying tail. */
andl $(PAGE_SIZE * 4 - 1), %edx
/* r10 stores outer loop counter. */
--
2.34.1
^ permalink raw reply [flat|nested] 29+ messages in thread
* [PATCH v4 2/2] x86: Add bounds `x86_non_temporal_threshold`
2022-06-15 17:41 ` [PATCH v4 1/2] " Noah Goldstein
@ 2022-06-15 17:41 ` Noah Goldstein
2022-06-15 18:22 ` H.J. Lu
` (3 more replies)
2022-06-15 18:22 ` [PATCH v4 1/2] x86: Cleanup bounds checking in large memcpy case H.J. Lu
1 sibling, 4 replies; 29+ messages in thread
From: Noah Goldstein @ 2022-06-15 17:41 UTC (permalink / raw)
To: libc-alpha
The lower-bound (131072) and upper-bound (SIZE_MAX / 16) are assumed
by memmove-vec-unaligned-erms.
The lower-bound is needed because memmove-vec-unaligned-erms unrolls
the loop aggressively in the L(large_memset_4x) case.
The upper-bound is needed because memmove-vec-unaligned-erms
right-shifts the value of `x86_non_temporal_threshold` by
LOG_4X_MEMCPY_THRESH (4) which without a bound may overflow.
The lack of lower-bound can be a correctness issue. The lack of
upper-bound cannot.
---
manual/tunables.texi | 2 +-
sysdeps/x86/dl-cacheinfo.h | 6 +++++-
2 files changed, 6 insertions(+), 2 deletions(-)
diff --git a/manual/tunables.texi b/manual/tunables.texi
index 1482412078..49daf3eb4a 100644
--- a/manual/tunables.texi
+++ b/manual/tunables.texi
@@ -47,7 +47,7 @@ glibc.malloc.mxfast: 0x0 (min: 0x0, max: 0xffffffffffffffff)
glibc.elision.skip_lock_busy: 3 (min: -2147483648, max: 2147483647)
glibc.malloc.top_pad: 0x0 (min: 0x0, max: 0xffffffffffffffff)
glibc.cpu.x86_rep_stosb_threshold: 0x800 (min: 0x1, max: 0xffffffffffffffff)
-glibc.cpu.x86_non_temporal_threshold: 0xc0000 (min: 0x0, max: 0xffffffffffffffff)
+glibc.cpu.x86_non_temporal_threshold: 0xc0000 (min: 0x0, max: 0x0fffffffffffffff)
glibc.cpu.x86_shstk:
glibc.cpu.hwcap_mask: 0x6 (min: 0x0, max: 0xffffffffffffffff)
glibc.malloc.mmap_max: 0 (min: -2147483648, max: 2147483647)
diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h
index cc3b840f9c..f94ff2df43 100644
--- a/sysdeps/x86/dl-cacheinfo.h
+++ b/sysdeps/x86/dl-cacheinfo.h
@@ -931,8 +931,12 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
TUNABLE_SET_WITH_BOUNDS (x86_data_cache_size, data, 0, SIZE_MAX);
TUNABLE_SET_WITH_BOUNDS (x86_shared_cache_size, shared, 0, SIZE_MAX);
+ /* SIZE_MAX >> 4 because memmove-vec-unaligned-erms right-shifts the value of
+ 'x86_non_temporal_threshold' by `LOG_4X_MEMCPY_THRESH` (4) and it is best
+ if that operation cannot overflow. Not the '>> 4' also reflect the bound
+ in the manual. */
TUNABLE_SET_WITH_BOUNDS (x86_non_temporal_threshold, non_temporal_threshold,
- 0, SIZE_MAX);
+ 0, SIZE_MAX >> 4);
TUNABLE_SET_WITH_BOUNDS (x86_rep_movsb_threshold, rep_movsb_threshold,
minimum_rep_movsb_threshold, SIZE_MAX);
TUNABLE_SET_WITH_BOUNDS (x86_rep_stosb_threshold, rep_stosb_threshold, 1,
--
2.34.1
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v3] x86: Cleanup bounds checking in large memcpy case
2022-06-15 16:48 ` H.J. Lu
@ 2022-06-15 17:44 ` Noah Goldstein
0 siblings, 0 replies; 29+ messages in thread
From: Noah Goldstein @ 2022-06-15 17:44 UTC (permalink / raw)
To: H.J. Lu; +Cc: GNU C Library, Carlos O'Donell
On Wed, Jun 15, 2022 at 9:49 AM H.J. Lu <hjl.tools@gmail.com> wrote:
>
> On Wed, Jun 15, 2022 at 8:12 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> >
> > 1. Fix incorrect lower-bound threshold in L(large_memcpy_2x).
> > Previously was using `__x86_rep_movsb_threshold` and should
> > have been using `__x86_shared_non_temporal_threshold`.
> >
> > 2. Avoid reloading __x86_shared_non_temporal_threshold before
> > the L(large_memcpy_4x) bounds check.
> >
> > 3. Document the second bounds check for L(large_memcpy_4x)
> > more clearly.
> > ---
> > manual/tunables.texi | 2 +-
> > sysdeps/x86/dl-cacheinfo.h | 6 +++-
> > .../multiarch/memmove-vec-unaligned-erms.S | 29 ++++++++++++++-----
> > 3 files changed, 27 insertions(+), 10 deletions(-)
> >
> > diff --git a/manual/tunables.texi b/manual/tunables.texi
> > index 1482412078..49daf3eb4a 100644
> > --- a/manual/tunables.texi
> > +++ b/manual/tunables.texi
> > @@ -47,7 +47,7 @@ glibc.malloc.mxfast: 0x0 (min: 0x0, max: 0xffffffffffffffff)
> > glibc.elision.skip_lock_busy: 3 (min: -2147483648, max: 2147483647)
> > glibc.malloc.top_pad: 0x0 (min: 0x0, max: 0xffffffffffffffff)
> > glibc.cpu.x86_rep_stosb_threshold: 0x800 (min: 0x1, max: 0xffffffffffffffff)
> > -glibc.cpu.x86_non_temporal_threshold: 0xc0000 (min: 0x0, max: 0xffffffffffffffff)
> > +glibc.cpu.x86_non_temporal_threshold: 0xc0000 (min: 0x0, max: 0x0fffffffffffffff)
> > glibc.cpu.x86_shstk:
> > glibc.cpu.hwcap_mask: 0x6 (min: 0x0, max: 0xffffffffffffffff)
> > glibc.malloc.mmap_max: 0 (min: -2147483648, max: 2147483647)
> > diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h
> > index cc3b840f9c..f94ff2df43 100644
> > --- a/sysdeps/x86/dl-cacheinfo.h
> > +++ b/sysdeps/x86/dl-cacheinfo.h
> > @@ -931,8 +931,12 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
> >
> > TUNABLE_SET_WITH_BOUNDS (x86_data_cache_size, data, 0, SIZE_MAX);
> > TUNABLE_SET_WITH_BOUNDS (x86_shared_cache_size, shared, 0, SIZE_MAX);
> > + /* SIZE_MAX >> 4 because memmove-vec-unaligned-erms right-shifts the value of
> > + 'x86_non_temporal_threshold' by `LOG_4X_MEMCPY_THRESH` (4) and it is best
> > + if that operation cannot overflow. Not the '>> 4' also reflect the bound
> > + in the manual. */
> > TUNABLE_SET_WITH_BOUNDS (x86_non_temporal_threshold, non_temporal_threshold,
> > - 0, SIZE_MAX);
> > + 0, SIZE_MAX >> 4);
> > TUNABLE_SET_WITH_BOUNDS (x86_rep_movsb_threshold, rep_movsb_threshold,
> > minimum_rep_movsb_threshold, SIZE_MAX);
> > TUNABLE_SET_WITH_BOUNDS (x86_rep_stosb_threshold, rep_stosb_threshold, 1,
>
> To help backport, please break this patch into 2 patches and
> make the memmove-vec-unaligned-erms.S change a separate
> one.
Done in V4.
Note there has been a lower bound missing since 2.34 that might also
need to be backported.
Added it in the second patch. I can split that one too (since upper
bound is not a correctness
issue) if it does in fact need to be backported.
>
> Thanks.
>
> > diff --git a/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
> > index af51177d5d..d1518b8bab 100644
> > --- a/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
> > +++ b/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
> > @@ -118,7 +118,13 @@
> > # define LARGE_LOAD_SIZE (VEC_SIZE * 4)
> > #endif
> >
> > -/* Amount to shift rdx by to compare for memcpy_large_4x. */
> > +/* Amount to shift __x86_shared_non_temporal_threshold by for
> > + bound for memcpy_large_4x. This is essentially use to to
> > + indicate that the copy is far beyond the scope of L3
> > + (assuming no user config x86_non_temporal_threshold) and to
> > + use a more aggressively unrolled loop. NB: before
> > + increasing the value also update initialization of
> > + x86_non_temporal_threshold. */
> > #ifndef LOG_4X_MEMCPY_THRESH
> > # define LOG_4X_MEMCPY_THRESH 4
> > #endif
> > @@ -724,9 +730,14 @@ L(skip_short_movsb_check):
> > .p2align 4,, 10
> > #if (defined USE_MULTIARCH || VEC_SIZE == 16) && IS_IN (libc)
> > L(large_memcpy_2x_check):
> > - cmp __x86_rep_movsb_threshold(%rip), %RDX_LP
> > - jb L(more_8x_vec_check)
> > + /* Entry from L(large_memcpy_2x) has a redundant load of
> > + __x86_shared_non_temporal_threshold(%rip). L(large_memcpy_2x)
> > + is only use for the non-erms memmove which is generally less
> > + common. */
> > L(large_memcpy_2x):
> > + mov __x86_shared_non_temporal_threshold(%rip), %R11_LP
> > + cmp %R11_LP, %RDX_LP
> > + jb L(more_8x_vec_check)
> > /* To reach this point it is impossible for dst > src and
> > overlap. Remaining to check is src > dst and overlap. rcx
> > already contains dst - src. Negate rcx to get src - dst. If
> > @@ -774,18 +785,21 @@ L(large_memcpy_2x):
> > /* ecx contains -(dst - src). not ecx will return dst - src - 1
> > which works for testing aliasing. */
> > notl %ecx
> > + movq %rdx, %r10
> > testl $(PAGE_SIZE - VEC_SIZE * 8), %ecx
> > jz L(large_memcpy_4x)
> >
> > - movq %rdx, %r10
> > - shrq $LOG_4X_MEMCPY_THRESH, %r10
> > - cmp __x86_shared_non_temporal_threshold(%rip), %r10
> > + /* r11 has __x86_shared_non_temporal_threshold. Shift it left
> > + by LOG_4X_MEMCPY_THRESH to get L(large_memcpy_4x) threshold.
> > + */
> > + shlq $LOG_4X_MEMCPY_THRESH, %r11
> > + cmp %r11, %rdx
> > jae L(large_memcpy_4x)
> >
> > /* edx will store remainder size for copying tail. */
> > andl $(PAGE_SIZE * 2 - 1), %edx
> > /* r10 stores outer loop counter. */
> > - shrq $((LOG_PAGE_SIZE + 1) - LOG_4X_MEMCPY_THRESH), %r10
> > + shrq $(LOG_PAGE_SIZE + 1), %r10
> > /* Copy 4x VEC at a time from 2 pages. */
> > .p2align 4
> > L(loop_large_memcpy_2x_outer):
> > @@ -850,7 +864,6 @@ L(large_memcpy_2x_end):
> >
> > .p2align 4
> > L(large_memcpy_4x):
> > - movq %rdx, %r10
> > /* edx will store remainder size for copying tail. */
> > andl $(PAGE_SIZE * 4 - 1), %edx
> > /* r10 stores outer loop counter. */
> > --
> > 2.34.1
> >
>
>
> --
> H.J.
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v4 2/2] x86: Add bounds `x86_non_temporal_threshold`
2022-06-15 17:41 ` [PATCH v4 2/2] x86: Add bounds `x86_non_temporal_threshold` Noah Goldstein
@ 2022-06-15 18:22 ` H.J. Lu
2022-06-15 18:33 ` Noah Goldstein
2022-06-15 18:32 ` [PATCH v5 " Noah Goldstein
` (2 subsequent siblings)
3 siblings, 1 reply; 29+ messages in thread
From: H.J. Lu @ 2022-06-15 18:22 UTC (permalink / raw)
To: Noah Goldstein; +Cc: GNU C Library, Carlos O'Donell
On Wed, Jun 15, 2022 at 10:41 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
> The lower-bound (131072) and upper-bound (SIZE_MAX / 16) are assumed
> by memmove-vec-unaligned-erms.
>
> The lower-bound is needed because memmove-vec-unaligned-erms unrolls
> the loop aggressively in the L(large_memset_4x) case.
>
> The upper-bound is needed because memmove-vec-unaligned-erms
> right-shifts the value of `x86_non_temporal_threshold` by
> LOG_4X_MEMCPY_THRESH (4) which without a bound may overflow.
>
> The lack of lower-bound can be a correctness issue. The lack of
> upper-bound cannot.
> ---
> manual/tunables.texi | 2 +-
> sysdeps/x86/dl-cacheinfo.h | 6 +++++-
> 2 files changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/manual/tunables.texi b/manual/tunables.texi
> index 1482412078..49daf3eb4a 100644
> --- a/manual/tunables.texi
> +++ b/manual/tunables.texi
> @@ -47,7 +47,7 @@ glibc.malloc.mxfast: 0x0 (min: 0x0, max: 0xffffffffffffffff)
> glibc.elision.skip_lock_busy: 3 (min: -2147483648, max: 2147483647)
> glibc.malloc.top_pad: 0x0 (min: 0x0, max: 0xffffffffffffffff)
> glibc.cpu.x86_rep_stosb_threshold: 0x800 (min: 0x1, max: 0xffffffffffffffff)
> -glibc.cpu.x86_non_temporal_threshold: 0xc0000 (min: 0x0, max: 0xffffffffffffffff)
> +glibc.cpu.x86_non_temporal_threshold: 0xc0000 (min: 0x0, max: 0x0fffffffffffffff)
> glibc.cpu.x86_shstk:
> glibc.cpu.hwcap_mask: 0x6 (min: 0x0, max: 0xffffffffffffffff)
> glibc.malloc.mmap_max: 0 (min: -2147483648, max: 2147483647)
> diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h
> index cc3b840f9c..f94ff2df43 100644
> --- a/sysdeps/x86/dl-cacheinfo.h
> +++ b/sysdeps/x86/dl-cacheinfo.h
> @@ -931,8 +931,12 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
>
> TUNABLE_SET_WITH_BOUNDS (x86_data_cache_size, data, 0, SIZE_MAX);
> TUNABLE_SET_WITH_BOUNDS (x86_shared_cache_size, shared, 0, SIZE_MAX);
> + /* SIZE_MAX >> 4 because memmove-vec-unaligned-erms right-shifts the value of
> + 'x86_non_temporal_threshold' by `LOG_4X_MEMCPY_THRESH` (4) and it is best
> + if that operation cannot overflow. Not the '>> 4' also reflect the bound
> + in the manual. */
> TUNABLE_SET_WITH_BOUNDS (x86_non_temporal_threshold, non_temporal_threshold,
> - 0, SIZE_MAX);
> + 0, SIZE_MAX >> 4);
You didn't change the lower bound.
> TUNABLE_SET_WITH_BOUNDS (x86_rep_movsb_threshold, rep_movsb_threshold,
> minimum_rep_movsb_threshold, SIZE_MAX);
> TUNABLE_SET_WITH_BOUNDS (x86_rep_stosb_threshold, rep_stosb_threshold, 1,
> --
> 2.34.1
>
--
H.J.
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v4 1/2] x86: Cleanup bounds checking in large memcpy case
2022-06-15 17:41 ` [PATCH v4 1/2] " Noah Goldstein
2022-06-15 17:41 ` [PATCH v4 2/2] x86: Add bounds `x86_non_temporal_threshold` Noah Goldstein
@ 2022-06-15 18:22 ` H.J. Lu
2022-07-14 2:57 ` Sunil Pandey
1 sibling, 1 reply; 29+ messages in thread
From: H.J. Lu @ 2022-06-15 18:22 UTC (permalink / raw)
To: Noah Goldstein; +Cc: GNU C Library, Carlos O'Donell
On Wed, Jun 15, 2022 at 10:41 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
> 1. Fix incorrect lower-bound threshold in L(large_memcpy_2x).
> Previously was using `__x86_rep_movsb_threshold` and should
> have been using `__x86_shared_non_temporal_threshold`.
>
> 2. Avoid reloading __x86_shared_non_temporal_threshold before
> the L(large_memcpy_4x) bounds check.
>
> 3. Document the second bounds check for L(large_memcpy_4x)
> more clearly.
> ---
> .../multiarch/memmove-vec-unaligned-erms.S | 29 ++++++++++++++-----
> 1 file changed, 21 insertions(+), 8 deletions(-)
>
> diff --git a/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
> index af51177d5d..d1518b8bab 100644
> --- a/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
> +++ b/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
> @@ -118,7 +118,13 @@
> # define LARGE_LOAD_SIZE (VEC_SIZE * 4)
> #endif
>
> -/* Amount to shift rdx by to compare for memcpy_large_4x. */
> +/* Amount to shift __x86_shared_non_temporal_threshold by for
> + bound for memcpy_large_4x. This is essentially use to to
> + indicate that the copy is far beyond the scope of L3
> + (assuming no user config x86_non_temporal_threshold) and to
> + use a more aggressively unrolled loop. NB: before
> + increasing the value also update initialization of
> + x86_non_temporal_threshold. */
> #ifndef LOG_4X_MEMCPY_THRESH
> # define LOG_4X_MEMCPY_THRESH 4
> #endif
> @@ -724,9 +730,14 @@ L(skip_short_movsb_check):
> .p2align 4,, 10
> #if (defined USE_MULTIARCH || VEC_SIZE == 16) && IS_IN (libc)
> L(large_memcpy_2x_check):
> - cmp __x86_rep_movsb_threshold(%rip), %RDX_LP
> - jb L(more_8x_vec_check)
> + /* Entry from L(large_memcpy_2x) has a redundant load of
> + __x86_shared_non_temporal_threshold(%rip). L(large_memcpy_2x)
> + is only use for the non-erms memmove which is generally less
> + common. */
> L(large_memcpy_2x):
> + mov __x86_shared_non_temporal_threshold(%rip), %R11_LP
> + cmp %R11_LP, %RDX_LP
> + jb L(more_8x_vec_check)
> /* To reach this point it is impossible for dst > src and
> overlap. Remaining to check is src > dst and overlap. rcx
> already contains dst - src. Negate rcx to get src - dst. If
> @@ -774,18 +785,21 @@ L(large_memcpy_2x):
> /* ecx contains -(dst - src). not ecx will return dst - src - 1
> which works for testing aliasing. */
> notl %ecx
> + movq %rdx, %r10
> testl $(PAGE_SIZE - VEC_SIZE * 8), %ecx
> jz L(large_memcpy_4x)
>
> - movq %rdx, %r10
> - shrq $LOG_4X_MEMCPY_THRESH, %r10
> - cmp __x86_shared_non_temporal_threshold(%rip), %r10
> + /* r11 has __x86_shared_non_temporal_threshold. Shift it left
> + by LOG_4X_MEMCPY_THRESH to get L(large_memcpy_4x) threshold.
> + */
> + shlq $LOG_4X_MEMCPY_THRESH, %r11
> + cmp %r11, %rdx
> jae L(large_memcpy_4x)
>
> /* edx will store remainder size for copying tail. */
> andl $(PAGE_SIZE * 2 - 1), %edx
> /* r10 stores outer loop counter. */
> - shrq $((LOG_PAGE_SIZE + 1) - LOG_4X_MEMCPY_THRESH), %r10
> + shrq $(LOG_PAGE_SIZE + 1), %r10
> /* Copy 4x VEC at a time from 2 pages. */
> .p2align 4
> L(loop_large_memcpy_2x_outer):
> @@ -850,7 +864,6 @@ L(large_memcpy_2x_end):
>
> .p2align 4
> L(large_memcpy_4x):
> - movq %rdx, %r10
> /* edx will store remainder size for copying tail. */
> andl $(PAGE_SIZE * 4 - 1), %edx
> /* r10 stores outer loop counter. */
> --
> 2.34.1
>
LGTM.
Thanks.
--
H.J.
^ permalink raw reply [flat|nested] 29+ messages in thread
* [PATCH v5 2/2] x86: Add bounds `x86_non_temporal_threshold`
2022-06-15 17:41 ` [PATCH v4 2/2] x86: Add bounds `x86_non_temporal_threshold` Noah Goldstein
2022-06-15 18:22 ` H.J. Lu
@ 2022-06-15 18:32 ` Noah Goldstein
2022-06-15 18:43 ` H.J. Lu
2022-06-15 19:52 ` [PATCH v6 2/3] " Noah Goldstein
2022-06-15 20:34 ` [PATCH v7 " Noah Goldstein
3 siblings, 1 reply; 29+ messages in thread
From: Noah Goldstein @ 2022-06-15 18:32 UTC (permalink / raw)
To: libc-alpha
The lower-bound (131072) and upper-bound (SIZE_MAX / 16) are assumed
by memmove-vec-unaligned-erms.
The lower-bound is needed because memmove-vec-unaligned-erms unrolls
the loop aggressively in the L(large_memset_4x) case.
The upper-bound is needed because memmove-vec-unaligned-erms
right-shifts the value of `x86_non_temporal_threshold` by
LOG_4X_MEMCPY_THRESH (4) which without a bound may overflow.
The lack of lower-bound can be a correctness issue. The lack of
upper-bound cannot.
---
manual/tunables.texi | 2 +-
sysdeps/x86/dl-cacheinfo.h | 7 ++++++-
2 files changed, 7 insertions(+), 2 deletions(-)
diff --git a/manual/tunables.texi b/manual/tunables.texi
index 1482412078..a420ed6206 100644
--- a/manual/tunables.texi
+++ b/manual/tunables.texi
@@ -47,7 +47,7 @@ glibc.malloc.mxfast: 0x0 (min: 0x0, max: 0xffffffffffffffff)
glibc.elision.skip_lock_busy: 3 (min: -2147483648, max: 2147483647)
glibc.malloc.top_pad: 0x0 (min: 0x0, max: 0xffffffffffffffff)
glibc.cpu.x86_rep_stosb_threshold: 0x800 (min: 0x1, max: 0xffffffffffffffff)
-glibc.cpu.x86_non_temporal_threshold: 0xc0000 (min: 0x0, max: 0xffffffffffffffff)
+glibc.cpu.x86_non_temporal_threshold: 0xc0000 (min: 0x20000, max: 0x0fffffffffffffff)
glibc.cpu.x86_shstk:
glibc.cpu.hwcap_mask: 0x6 (min: 0x0, max: 0xffffffffffffffff)
glibc.malloc.mmap_max: 0 (min: -2147483648, max: 2147483647)
diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h
index cc3b840f9c..b4ff385ae1 100644
--- a/sysdeps/x86/dl-cacheinfo.h
+++ b/sysdeps/x86/dl-cacheinfo.h
@@ -931,8 +931,13 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
TUNABLE_SET_WITH_BOUNDS (x86_data_cache_size, data, 0, SIZE_MAX);
TUNABLE_SET_WITH_BOUNDS (x86_shared_cache_size, shared, 0, SIZE_MAX);
+ /* SIZE_MAX >> 4 because memmove-vec-unaligned-erms right-shifts the value of
+ 'x86_non_temporal_threshold' by `LOG_4X_MEMCPY_THRESH` (4) and it is best
+ if that operation cannot overflow. 0x20000 (131072) because the
+ L(large_memset_4x) case aggressively unrolls the loop. Both values are
+ reflected in the manual. */
TUNABLE_SET_WITH_BOUNDS (x86_non_temporal_threshold, non_temporal_threshold,
- 0, SIZE_MAX);
+ 0x20000, SIZE_MAX >> 4);
TUNABLE_SET_WITH_BOUNDS (x86_rep_movsb_threshold, rep_movsb_threshold,
minimum_rep_movsb_threshold, SIZE_MAX);
TUNABLE_SET_WITH_BOUNDS (x86_rep_stosb_threshold, rep_stosb_threshold, 1,
--
2.34.1
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v4 2/2] x86: Add bounds `x86_non_temporal_threshold`
2022-06-15 18:22 ` H.J. Lu
@ 2022-06-15 18:33 ` Noah Goldstein
0 siblings, 0 replies; 29+ messages in thread
From: Noah Goldstein @ 2022-06-15 18:33 UTC (permalink / raw)
To: H.J. Lu; +Cc: GNU C Library, Carlos O'Donell
On Wed, Jun 15, 2022 at 11:22 AM H.J. Lu <hjl.tools@gmail.com> wrote:
>
> On Wed, Jun 15, 2022 at 10:41 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> >
> > The lower-bound (131072) and upper-bound (SIZE_MAX / 16) are assumed
> > by memmove-vec-unaligned-erms.
> >
> > The lower-bound is needed because memmove-vec-unaligned-erms unrolls
> > the loop aggressively in the L(large_memset_4x) case.
> >
> > The upper-bound is needed because memmove-vec-unaligned-erms
> > right-shifts the value of `x86_non_temporal_threshold` by
> > LOG_4X_MEMCPY_THRESH (4) which without a bound may overflow.
> >
> > The lack of lower-bound can be a correctness issue. The lack of
> > upper-bound cannot.
> > ---
> > manual/tunables.texi | 2 +-
> > sysdeps/x86/dl-cacheinfo.h | 6 +++++-
> > 2 files changed, 6 insertions(+), 2 deletions(-)
> >
> > diff --git a/manual/tunables.texi b/manual/tunables.texi
> > index 1482412078..49daf3eb4a 100644
> > --- a/manual/tunables.texi
> > +++ b/manual/tunables.texi
> > @@ -47,7 +47,7 @@ glibc.malloc.mxfast: 0x0 (min: 0x0, max: 0xffffffffffffffff)
> > glibc.elision.skip_lock_busy: 3 (min: -2147483648, max: 2147483647)
> > glibc.malloc.top_pad: 0x0 (min: 0x0, max: 0xffffffffffffffff)
> > glibc.cpu.x86_rep_stosb_threshold: 0x800 (min: 0x1, max: 0xffffffffffffffff)
> > -glibc.cpu.x86_non_temporal_threshold: 0xc0000 (min: 0x0, max: 0xffffffffffffffff)
> > +glibc.cpu.x86_non_temporal_threshold: 0xc0000 (min: 0x0, max: 0x0fffffffffffffff)
> > glibc.cpu.x86_shstk:
> > glibc.cpu.hwcap_mask: 0x6 (min: 0x0, max: 0xffffffffffffffff)
> > glibc.malloc.mmap_max: 0 (min: -2147483648, max: 2147483647)
> > diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h
> > index cc3b840f9c..f94ff2df43 100644
> > --- a/sysdeps/x86/dl-cacheinfo.h
> > +++ b/sysdeps/x86/dl-cacheinfo.h
> > @@ -931,8 +931,12 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
> >
> > TUNABLE_SET_WITH_BOUNDS (x86_data_cache_size, data, 0, SIZE_MAX);
> > TUNABLE_SET_WITH_BOUNDS (x86_shared_cache_size, shared, 0, SIZE_MAX);
> > + /* SIZE_MAX >> 4 because memmove-vec-unaligned-erms right-shifts the value of
> > + 'x86_non_temporal_threshold' by `LOG_4X_MEMCPY_THRESH` (4) and it is best
> > + if that operation cannot overflow. Not the '>> 4' also reflect the bound
> > + in the manual. */
> > TUNABLE_SET_WITH_BOUNDS (x86_non_temporal_threshold, non_temporal_threshold,
> > - 0, SIZE_MAX);
> > + 0, SIZE_MAX >> 4);
>
> You didn't change the lower bound.
Fixed in V5.
>
> > TUNABLE_SET_WITH_BOUNDS (x86_rep_movsb_threshold, rep_movsb_threshold,
> > minimum_rep_movsb_threshold, SIZE_MAX);
> > TUNABLE_SET_WITH_BOUNDS (x86_rep_stosb_threshold, rep_stosb_threshold, 1,
> > --
> > 2.34.1
> >
>
>
> --
> H.J.
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v5 2/2] x86: Add bounds `x86_non_temporal_threshold`
2022-06-15 18:32 ` [PATCH v5 " Noah Goldstein
@ 2022-06-15 18:43 ` H.J. Lu
0 siblings, 0 replies; 29+ messages in thread
From: H.J. Lu @ 2022-06-15 18:43 UTC (permalink / raw)
To: Noah Goldstein; +Cc: GNU C Library, Carlos O'Donell
On Wed, Jun 15, 2022 at 11:32 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
> The lower-bound (131072) and upper-bound (SIZE_MAX / 16) are assumed
> by memmove-vec-unaligned-erms.
>
> The lower-bound is needed because memmove-vec-unaligned-erms unrolls
> the loop aggressively in the L(large_memset_4x) case.
>
> The upper-bound is needed because memmove-vec-unaligned-erms
> right-shifts the value of `x86_non_temporal_threshold` by
> LOG_4X_MEMCPY_THRESH (4) which without a bound may overflow.
>
> The lack of lower-bound can be a correctness issue. The lack of
> upper-bound cannot.
> ---
> manual/tunables.texi | 2 +-
> sysdeps/x86/dl-cacheinfo.h | 7 ++++++-
> 2 files changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/manual/tunables.texi b/manual/tunables.texi
> index 1482412078..a420ed6206 100644
> --- a/manual/tunables.texi
> +++ b/manual/tunables.texi
> @@ -47,7 +47,7 @@ glibc.malloc.mxfast: 0x0 (min: 0x0, max: 0xffffffffffffffff)
> glibc.elision.skip_lock_busy: 3 (min: -2147483648, max: 2147483647)
> glibc.malloc.top_pad: 0x0 (min: 0x0, max: 0xffffffffffffffff)
> glibc.cpu.x86_rep_stosb_threshold: 0x800 (min: 0x1, max: 0xffffffffffffffff)
> -glibc.cpu.x86_non_temporal_threshold: 0xc0000 (min: 0x0, max: 0xffffffffffffffff)
> +glibc.cpu.x86_non_temporal_threshold: 0xc0000 (min: 0x20000, max: 0x0fffffffffffffff)
> glibc.cpu.x86_shstk:
> glibc.cpu.hwcap_mask: 0x6 (min: 0x0, max: 0xffffffffffffffff)
> glibc.malloc.mmap_max: 0 (min: -2147483648, max: 2147483647)
> diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h
> index cc3b840f9c..b4ff385ae1 100644
> --- a/sysdeps/x86/dl-cacheinfo.h
> +++ b/sysdeps/x86/dl-cacheinfo.h
> @@ -931,8 +931,13 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
>
> TUNABLE_SET_WITH_BOUNDS (x86_data_cache_size, data, 0, SIZE_MAX);
> TUNABLE_SET_WITH_BOUNDS (x86_shared_cache_size, shared, 0, SIZE_MAX);
> + /* SIZE_MAX >> 4 because memmove-vec-unaligned-erms right-shifts the value of
> + 'x86_non_temporal_threshold' by `LOG_4X_MEMCPY_THRESH` (4) and it is best
> + if that operation cannot overflow. 0x20000 (131072) because the
> + L(large_memset_4x) case aggressively unrolls the loop. Both values are
How is 0x20000 computed? Shouldn't it depend on vector size?
> + reflected in the manual. */
> TUNABLE_SET_WITH_BOUNDS (x86_non_temporal_threshold, non_temporal_threshold,
> - 0, SIZE_MAX);
> + 0x20000, SIZE_MAX >> 4);
> TUNABLE_SET_WITH_BOUNDS (x86_rep_movsb_threshold, rep_movsb_threshold,
> minimum_rep_movsb_threshold, SIZE_MAX);
> TUNABLE_SET_WITH_BOUNDS (x86_rep_stosb_threshold, rep_stosb_threshold, 1,
> --
> 2.34.1
>
--
H.J.
^ permalink raw reply [flat|nested] 29+ messages in thread
* [PATCH v6 2/3] x86: Add bounds `x86_non_temporal_threshold`
2022-06-15 17:41 ` [PATCH v4 2/2] x86: Add bounds `x86_non_temporal_threshold` Noah Goldstein
2022-06-15 18:22 ` H.J. Lu
2022-06-15 18:32 ` [PATCH v5 " Noah Goldstein
@ 2022-06-15 19:52 ` Noah Goldstein
2022-06-15 20:27 ` H.J. Lu
2022-06-15 20:34 ` [PATCH v7 " Noah Goldstein
3 siblings, 1 reply; 29+ messages in thread
From: Noah Goldstein @ 2022-06-15 19:52 UTC (permalink / raw)
To: libc-alpha
The lower-bound (16448) and upper-bound (SIZE_MAX / 16) are assumed
by memmove-vec-unaligned-erms.
The lower-bound is needed because memmove-vec-unaligned-erms unrolls
the loop aggressively in the L(large_memset_4x) case.
The upper-bound is needed because memmove-vec-unaligned-erms
right-shifts the value of `x86_non_temporal_threshold` by
LOG_4X_MEMCPY_THRESH (4) which without a bound may overflow.
The lack of lower-bound can be a correctness issue. The lack of
upper-bound cannot.
---
manual/tunables.texi | 2 +-
sysdeps/x86/dl-cacheinfo.h | 8 +++++++-
2 files changed, 8 insertions(+), 2 deletions(-)
diff --git a/manual/tunables.texi b/manual/tunables.texi
index 1482412078..2c076019ae 100644
--- a/manual/tunables.texi
+++ b/manual/tunables.texi
@@ -47,7 +47,7 @@ glibc.malloc.mxfast: 0x0 (min: 0x0, max: 0xffffffffffffffff)
glibc.elision.skip_lock_busy: 3 (min: -2147483648, max: 2147483647)
glibc.malloc.top_pad: 0x0 (min: 0x0, max: 0xffffffffffffffff)
glibc.cpu.x86_rep_stosb_threshold: 0x800 (min: 0x1, max: 0xffffffffffffffff)
-glibc.cpu.x86_non_temporal_threshold: 0xc0000 (min: 0x0, max: 0xffffffffffffffff)
+glibc.cpu.x86_non_temporal_threshold: 0xc0000 (min: 0x4040, max: 0x0fffffffffffffff)
glibc.cpu.x86_shstk:
glibc.cpu.hwcap_mask: 0x6 (min: 0x0, max: 0xffffffffffffffff)
glibc.malloc.mmap_max: 0 (min: -2147483648, max: 2147483647)
diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h
index cc3b840f9c..c493956259 100644
--- a/sysdeps/x86/dl-cacheinfo.h
+++ b/sysdeps/x86/dl-cacheinfo.h
@@ -931,8 +931,14 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
TUNABLE_SET_WITH_BOUNDS (x86_data_cache_size, data, 0, SIZE_MAX);
TUNABLE_SET_WITH_BOUNDS (x86_shared_cache_size, shared, 0, SIZE_MAX);
+ /* SIZE_MAX >> 4 because memmove-vec-unaligned-erms right-shifts the value of
+ 'x86_non_temporal_threshold' by `LOG_4X_MEMCPY_THRESH` (4) and it is best
+ if that operation cannot overflow. Minimum of 0x4040 (16448) because the
+ L(large_memset_4x) loops need 64-byte to cache align and enough space for
+ at least 1 iteration of 4x PAGE_SIZE unrolled loop. Both values are
+ reflected in the manual. */
TUNABLE_SET_WITH_BOUNDS (x86_non_temporal_threshold, non_temporal_threshold,
- 0, SIZE_MAX);
+ 0x20000, SIZE_MAX >> 4);
TUNABLE_SET_WITH_BOUNDS (x86_rep_movsb_threshold, rep_movsb_threshold,
minimum_rep_movsb_threshold, SIZE_MAX);
TUNABLE_SET_WITH_BOUNDS (x86_rep_stosb_threshold, rep_stosb_threshold, 1,
--
2.34.1
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v6 2/3] x86: Add bounds `x86_non_temporal_threshold`
2022-06-15 19:52 ` [PATCH v6 2/3] " Noah Goldstein
@ 2022-06-15 20:27 ` H.J. Lu
2022-06-15 20:35 ` Noah Goldstein
0 siblings, 1 reply; 29+ messages in thread
From: H.J. Lu @ 2022-06-15 20:27 UTC (permalink / raw)
To: Noah Goldstein; +Cc: GNU C Library, Carlos O'Donell
On Wed, Jun 15, 2022 at 12:52 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
> The lower-bound (16448) and upper-bound (SIZE_MAX / 16) are assumed
> by memmove-vec-unaligned-erms.
>
> The lower-bound is needed because memmove-vec-unaligned-erms unrolls
> the loop aggressively in the L(large_memset_4x) case.
>
> The upper-bound is needed because memmove-vec-unaligned-erms
> right-shifts the value of `x86_non_temporal_threshold` by
> LOG_4X_MEMCPY_THRESH (4) which without a bound may overflow.
>
> The lack of lower-bound can be a correctness issue. The lack of
> upper-bound cannot.
> ---
> manual/tunables.texi | 2 +-
> sysdeps/x86/dl-cacheinfo.h | 8 +++++++-
> 2 files changed, 8 insertions(+), 2 deletions(-)
>
> diff --git a/manual/tunables.texi b/manual/tunables.texi
> index 1482412078..2c076019ae 100644
> --- a/manual/tunables.texi
> +++ b/manual/tunables.texi
> @@ -47,7 +47,7 @@ glibc.malloc.mxfast: 0x0 (min: 0x0, max: 0xffffffffffffffff)
> glibc.elision.skip_lock_busy: 3 (min: -2147483648, max: 2147483647)
> glibc.malloc.top_pad: 0x0 (min: 0x0, max: 0xffffffffffffffff)
> glibc.cpu.x86_rep_stosb_threshold: 0x800 (min: 0x1, max: 0xffffffffffffffff)
> -glibc.cpu.x86_non_temporal_threshold: 0xc0000 (min: 0x0, max: 0xffffffffffffffff)
> +glibc.cpu.x86_non_temporal_threshold: 0xc0000 (min: 0x4040, max: 0x0fffffffffffffff)
> glibc.cpu.x86_shstk:
> glibc.cpu.hwcap_mask: 0x6 (min: 0x0, max: 0xffffffffffffffff)
> glibc.malloc.mmap_max: 0 (min: -2147483648, max: 2147483647)
> diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h
> index cc3b840f9c..c493956259 100644
> --- a/sysdeps/x86/dl-cacheinfo.h
> +++ b/sysdeps/x86/dl-cacheinfo.h
> @@ -931,8 +931,14 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
>
> TUNABLE_SET_WITH_BOUNDS (x86_data_cache_size, data, 0, SIZE_MAX);
> TUNABLE_SET_WITH_BOUNDS (x86_shared_cache_size, shared, 0, SIZE_MAX);
> + /* SIZE_MAX >> 4 because memmove-vec-unaligned-erms right-shifts the value of
> + 'x86_non_temporal_threshold' by `LOG_4X_MEMCPY_THRESH` (4) and it is best
> + if that operation cannot overflow. Minimum of 0x4040 (16448) because the
> + L(large_memset_4x) loops need 64-byte to cache align and enough space for
> + at least 1 iteration of 4x PAGE_SIZE unrolled loop. Both values are
> + reflected in the manual. */
> TUNABLE_SET_WITH_BOUNDS (x86_non_temporal_threshold, non_temporal_threshold,
> - 0, SIZE_MAX);
> + 0x20000, SIZE_MAX >> 4);
The lower bound should be 0x4040.
> TUNABLE_SET_WITH_BOUNDS (x86_rep_movsb_threshold, rep_movsb_threshold,
> minimum_rep_movsb_threshold, SIZE_MAX);
> TUNABLE_SET_WITH_BOUNDS (x86_rep_stosb_threshold, rep_stosb_threshold, 1,
> --
> 2.34.1
>
--
H.J.
^ permalink raw reply [flat|nested] 29+ messages in thread
* [PATCH v7 2/3] x86: Add bounds `x86_non_temporal_threshold`
2022-06-15 17:41 ` [PATCH v4 2/2] x86: Add bounds `x86_non_temporal_threshold` Noah Goldstein
` (2 preceding siblings ...)
2022-06-15 19:52 ` [PATCH v6 2/3] " Noah Goldstein
@ 2022-06-15 20:34 ` Noah Goldstein
2022-06-15 20:48 ` H.J. Lu
3 siblings, 1 reply; 29+ messages in thread
From: Noah Goldstein @ 2022-06-15 20:34 UTC (permalink / raw)
To: libc-alpha
The lower-bound (16448) and upper-bound (SIZE_MAX / 16) are assumed
by memmove-vec-unaligned-erms.
The lower-bound is needed because memmove-vec-unaligned-erms unrolls
the loop aggressively in the L(large_memset_4x) case.
The upper-bound is needed because memmove-vec-unaligned-erms
right-shifts the value of `x86_non_temporal_threshold` by
LOG_4X_MEMCPY_THRESH (4) which without a bound may overflow.
The lack of lower-bound can be a correctness issue. The lack of
upper-bound cannot.
---
manual/tunables.texi | 2 +-
sysdeps/x86/dl-cacheinfo.h | 8 +++++++-
2 files changed, 8 insertions(+), 2 deletions(-)
diff --git a/manual/tunables.texi b/manual/tunables.texi
index 1482412078..2c076019ae 100644
--- a/manual/tunables.texi
+++ b/manual/tunables.texi
@@ -47,7 +47,7 @@ glibc.malloc.mxfast: 0x0 (min: 0x0, max: 0xffffffffffffffff)
glibc.elision.skip_lock_busy: 3 (min: -2147483648, max: 2147483647)
glibc.malloc.top_pad: 0x0 (min: 0x0, max: 0xffffffffffffffff)
glibc.cpu.x86_rep_stosb_threshold: 0x800 (min: 0x1, max: 0xffffffffffffffff)
-glibc.cpu.x86_non_temporal_threshold: 0xc0000 (min: 0x0, max: 0xffffffffffffffff)
+glibc.cpu.x86_non_temporal_threshold: 0xc0000 (min: 0x4040, max: 0x0fffffffffffffff)
glibc.cpu.x86_shstk:
glibc.cpu.hwcap_mask: 0x6 (min: 0x0, max: 0xffffffffffffffff)
glibc.malloc.mmap_max: 0 (min: -2147483648, max: 2147483647)
diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h
index cc3b840f9c..e9f3382108 100644
--- a/sysdeps/x86/dl-cacheinfo.h
+++ b/sysdeps/x86/dl-cacheinfo.h
@@ -931,8 +931,14 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
TUNABLE_SET_WITH_BOUNDS (x86_data_cache_size, data, 0, SIZE_MAX);
TUNABLE_SET_WITH_BOUNDS (x86_shared_cache_size, shared, 0, SIZE_MAX);
+ /* SIZE_MAX >> 4 because memmove-vec-unaligned-erms right-shifts the value of
+ 'x86_non_temporal_threshold' by `LOG_4X_MEMCPY_THRESH` (4) and it is best
+ if that operation cannot overflow. Minimum of 0x4040 (16448) because the
+ L(large_memset_4x) loops need 64-byte to cache align and enough space for
+ at least 1 iteration of 4x PAGE_SIZE unrolled loop. Both values are
+ reflected in the manual. */
TUNABLE_SET_WITH_BOUNDS (x86_non_temporal_threshold, non_temporal_threshold,
- 0, SIZE_MAX);
+ 0x4040, SIZE_MAX >> 4);
TUNABLE_SET_WITH_BOUNDS (x86_rep_movsb_threshold, rep_movsb_threshold,
minimum_rep_movsb_threshold, SIZE_MAX);
TUNABLE_SET_WITH_BOUNDS (x86_rep_stosb_threshold, rep_stosb_threshold, 1,
--
2.34.1
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v6 2/3] x86: Add bounds `x86_non_temporal_threshold`
2022-06-15 20:27 ` H.J. Lu
@ 2022-06-15 20:35 ` Noah Goldstein
0 siblings, 0 replies; 29+ messages in thread
From: Noah Goldstein @ 2022-06-15 20:35 UTC (permalink / raw)
To: H.J. Lu; +Cc: GNU C Library, Carlos O'Donell
On Wed, Jun 15, 2022 at 1:27 PM H.J. Lu <hjl.tools@gmail.com> wrote:
>
> On Wed, Jun 15, 2022 at 12:52 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> >
> > The lower-bound (16448) and upper-bound (SIZE_MAX / 16) are assumed
> > by memmove-vec-unaligned-erms.
> >
> > The lower-bound is needed because memmove-vec-unaligned-erms unrolls
> > the loop aggressively in the L(large_memset_4x) case.
> >
> > The upper-bound is needed because memmove-vec-unaligned-erms
> > right-shifts the value of `x86_non_temporal_threshold` by
> > LOG_4X_MEMCPY_THRESH (4) which without a bound may overflow.
> >
> > The lack of lower-bound can be a correctness issue. The lack of
> > upper-bound cannot.
> > ---
> > manual/tunables.texi | 2 +-
> > sysdeps/x86/dl-cacheinfo.h | 8 +++++++-
> > 2 files changed, 8 insertions(+), 2 deletions(-)
> >
> > diff --git a/manual/tunables.texi b/manual/tunables.texi
> > index 1482412078..2c076019ae 100644
> > --- a/manual/tunables.texi
> > +++ b/manual/tunables.texi
> > @@ -47,7 +47,7 @@ glibc.malloc.mxfast: 0x0 (min: 0x0, max: 0xffffffffffffffff)
> > glibc.elision.skip_lock_busy: 3 (min: -2147483648, max: 2147483647)
> > glibc.malloc.top_pad: 0x0 (min: 0x0, max: 0xffffffffffffffff)
> > glibc.cpu.x86_rep_stosb_threshold: 0x800 (min: 0x1, max: 0xffffffffffffffff)
> > -glibc.cpu.x86_non_temporal_threshold: 0xc0000 (min: 0x0, max: 0xffffffffffffffff)
> > +glibc.cpu.x86_non_temporal_threshold: 0xc0000 (min: 0x4040, max: 0x0fffffffffffffff)
> > glibc.cpu.x86_shstk:
> > glibc.cpu.hwcap_mask: 0x6 (min: 0x0, max: 0xffffffffffffffff)
> > glibc.malloc.mmap_max: 0 (min: -2147483648, max: 2147483647)
> > diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h
> > index cc3b840f9c..c493956259 100644
> > --- a/sysdeps/x86/dl-cacheinfo.h
> > +++ b/sysdeps/x86/dl-cacheinfo.h
> > @@ -931,8 +931,14 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
> >
> > TUNABLE_SET_WITH_BOUNDS (x86_data_cache_size, data, 0, SIZE_MAX);
> > TUNABLE_SET_WITH_BOUNDS (x86_shared_cache_size, shared, 0, SIZE_MAX);
> > + /* SIZE_MAX >> 4 because memmove-vec-unaligned-erms right-shifts the value of
> > + 'x86_non_temporal_threshold' by `LOG_4X_MEMCPY_THRESH` (4) and it is best
> > + if that operation cannot overflow. Minimum of 0x4040 (16448) because the
> > + L(large_memset_4x) loops need 64-byte to cache align and enough space for
> > + at least 1 iteration of 4x PAGE_SIZE unrolled loop. Both values are
> > + reflected in the manual. */
> > TUNABLE_SET_WITH_BOUNDS (x86_non_temporal_threshold, non_temporal_threshold,
> > - 0, SIZE_MAX);
> > + 0x20000, SIZE_MAX >> 4);
>
> The lower bound should be 0x4040.
Done in V7.
>
> > TUNABLE_SET_WITH_BOUNDS (x86_rep_movsb_threshold, rep_movsb_threshold,
> > minimum_rep_movsb_threshold, SIZE_MAX);
> > TUNABLE_SET_WITH_BOUNDS (x86_rep_stosb_threshold, rep_stosb_threshold, 1,
> > --
> > 2.34.1
> >
>
>
> --
> H.J.
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v7 2/3] x86: Add bounds `x86_non_temporal_threshold`
2022-06-15 20:34 ` [PATCH v7 " Noah Goldstein
@ 2022-06-15 20:48 ` H.J. Lu
2022-07-14 2:55 ` Sunil Pandey
0 siblings, 1 reply; 29+ messages in thread
From: H.J. Lu @ 2022-06-15 20:48 UTC (permalink / raw)
To: Noah Goldstein; +Cc: GNU C Library, Carlos O'Donell
On Wed, Jun 15, 2022 at 1:34 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
> The lower-bound (16448) and upper-bound (SIZE_MAX / 16) are assumed
> by memmove-vec-unaligned-erms.
>
> The lower-bound is needed because memmove-vec-unaligned-erms unrolls
> the loop aggressively in the L(large_memset_4x) case.
>
> The upper-bound is needed because memmove-vec-unaligned-erms
> right-shifts the value of `x86_non_temporal_threshold` by
> LOG_4X_MEMCPY_THRESH (4) which without a bound may overflow.
>
> The lack of lower-bound can be a correctness issue. The lack of
> upper-bound cannot.
> ---
> manual/tunables.texi | 2 +-
> sysdeps/x86/dl-cacheinfo.h | 8 +++++++-
> 2 files changed, 8 insertions(+), 2 deletions(-)
>
> diff --git a/manual/tunables.texi b/manual/tunables.texi
> index 1482412078..2c076019ae 100644
> --- a/manual/tunables.texi
> +++ b/manual/tunables.texi
> @@ -47,7 +47,7 @@ glibc.malloc.mxfast: 0x0 (min: 0x0, max: 0xffffffffffffffff)
> glibc.elision.skip_lock_busy: 3 (min: -2147483648, max: 2147483647)
> glibc.malloc.top_pad: 0x0 (min: 0x0, max: 0xffffffffffffffff)
> glibc.cpu.x86_rep_stosb_threshold: 0x800 (min: 0x1, max: 0xffffffffffffffff)
> -glibc.cpu.x86_non_temporal_threshold: 0xc0000 (min: 0x0, max: 0xffffffffffffffff)
> +glibc.cpu.x86_non_temporal_threshold: 0xc0000 (min: 0x4040, max: 0x0fffffffffffffff)
> glibc.cpu.x86_shstk:
> glibc.cpu.hwcap_mask: 0x6 (min: 0x0, max: 0xffffffffffffffff)
> glibc.malloc.mmap_max: 0 (min: -2147483648, max: 2147483647)
> diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h
> index cc3b840f9c..e9f3382108 100644
> --- a/sysdeps/x86/dl-cacheinfo.h
> +++ b/sysdeps/x86/dl-cacheinfo.h
> @@ -931,8 +931,14 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
>
> TUNABLE_SET_WITH_BOUNDS (x86_data_cache_size, data, 0, SIZE_MAX);
> TUNABLE_SET_WITH_BOUNDS (x86_shared_cache_size, shared, 0, SIZE_MAX);
> + /* SIZE_MAX >> 4 because memmove-vec-unaligned-erms right-shifts the value of
> + 'x86_non_temporal_threshold' by `LOG_4X_MEMCPY_THRESH` (4) and it is best
> + if that operation cannot overflow. Minimum of 0x4040 (16448) because the
> + L(large_memset_4x) loops need 64-byte to cache align and enough space for
> + at least 1 iteration of 4x PAGE_SIZE unrolled loop. Both values are
> + reflected in the manual. */
> TUNABLE_SET_WITH_BOUNDS (x86_non_temporal_threshold, non_temporal_threshold,
> - 0, SIZE_MAX);
> + 0x4040, SIZE_MAX >> 4);
> TUNABLE_SET_WITH_BOUNDS (x86_rep_movsb_threshold, rep_movsb_threshold,
> minimum_rep_movsb_threshold, SIZE_MAX);
> TUNABLE_SET_WITH_BOUNDS (x86_rep_stosb_threshold, rep_stosb_threshold, 1,
> --
> 2.34.1
>
LGTM.
Thanks.
--
H.J.
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v1 1/3] x86: Fix misordered logic for setting `rep_movsb_stop_threshold`
2022-06-15 1:02 ` [PATCH v1 1/3] x86: Fix misordered logic for setting `rep_movsb_stop_threshold` H.J. Lu
@ 2022-07-14 2:53 ` Sunil Pandey
0 siblings, 0 replies; 29+ messages in thread
From: Sunil Pandey @ 2022-07-14 2:53 UTC (permalink / raw)
To: H.J. Lu; +Cc: Noah Goldstein, GNU C Library
On Tue, Jun 14, 2022 at 6:03 PM H.J. Lu via Libc-alpha
<libc-alpha@sourceware.org> wrote:
>
> On Tue, Jun 14, 2022 at 5:25 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> >
> > Move the setting of `rep_movsb_stop_threshold` to after the tunables
> > have been collected so that the `rep_movsb_stop_threshold` (which
> > is used to redirect control flow to the non_temporal case) will
> > use any user value for `non_temporal_threshold` (set using
> > glibc.cpu.x86_non_temporal_threshold)
> > ---
> > sysdeps/x86/dl-cacheinfo.h | 24 ++++++++++++------------
> > 1 file changed, 12 insertions(+), 12 deletions(-)
> >
> > diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h
> > index f64a2fb0ba..cc3b840f9c 100644
> > --- a/sysdeps/x86/dl-cacheinfo.h
> > +++ b/sysdeps/x86/dl-cacheinfo.h
> > @@ -898,18 +898,6 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
> > if (CPU_FEATURE_USABLE_P (cpu_features, FSRM))
> > rep_movsb_threshold = 2112;
> >
> > - unsigned long int rep_movsb_stop_threshold;
> > - /* ERMS feature is implemented from AMD Zen3 architecture and it is
> > - performing poorly for data above L2 cache size. Henceforth, adding
> > - an upper bound threshold parameter to limit the usage of Enhanced
> > - REP MOVSB operations and setting its value to L2 cache size. */
> > - if (cpu_features->basic.kind == arch_kind_amd)
> > - rep_movsb_stop_threshold = core;
> > - /* Setting the upper bound of ERMS to the computed value of
> > - non-temporal threshold for architectures other than AMD. */
> > - else
> > - rep_movsb_stop_threshold = non_temporal_threshold;
> > -
> > /* The default threshold to use Enhanced REP STOSB. */
> > unsigned long int rep_stosb_threshold = 2048;
> >
> > @@ -951,6 +939,18 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
> > SIZE_MAX);
> > #endif
> >
> > + unsigned long int rep_movsb_stop_threshold;
> > + /* ERMS feature is implemented from AMD Zen3 architecture and it is
> > + performing poorly for data above L2 cache size. Henceforth, adding
> > + an upper bound threshold parameter to limit the usage of Enhanced
> > + REP MOVSB operations and setting its value to L2 cache size. */
> > + if (cpu_features->basic.kind == arch_kind_amd)
> > + rep_movsb_stop_threshold = core;
> > + /* Setting the upper bound of ERMS to the computed value of
> > + non-temporal threshold for architectures other than AMD. */
> > + else
> > + rep_movsb_stop_threshold = non_temporal_threshold;
> > +
> > cpu_features->data_cache_size = data;
> > cpu_features->shared_cache_size = shared;
> > cpu_features->non_temporal_threshold = non_temporal_threshold;
> > --
> > 2.34.1
> >
>
> LGTM.
>
> Thanks.
>
> --
> H.J.
I would like to backport this patch to release branches.
Any comments or objections?
--Sunil
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v1 3/3] x86: Add sse42 implementation to strcmp's ifunc
2022-06-15 1:08 ` H.J. Lu
@ 2022-07-14 2:54 ` Sunil Pandey
0 siblings, 0 replies; 29+ messages in thread
From: Sunil Pandey @ 2022-07-14 2:54 UTC (permalink / raw)
To: H.J. Lu; +Cc: Noah Goldstein, GNU C Library
On Tue, Jun 14, 2022 at 6:09 PM H.J. Lu via Libc-alpha
<libc-alpha@sourceware.org> wrote:
>
> On Tue, Jun 14, 2022 at 5:25 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> >
> > This has been missing since the the ifuncs where added.
> >
> > The performance of SSE4.2 is preferable to to SSE2.
> >
> > Measured on Tigerlake with N = 20 runs.
> > Geometric Mean of all benchmarks SSE4.2 / SSE2: 0.906
> > ---
> > sysdeps/x86_64/multiarch/strcmp.c | 5 +++++
> > 1 file changed, 5 insertions(+)
> >
> > diff --git a/sysdeps/x86_64/multiarch/strcmp.c b/sysdeps/x86_64/multiarch/strcmp.c
> > index a248c2a6e6..9c1677724c 100644
> > --- a/sysdeps/x86_64/multiarch/strcmp.c
> > +++ b/sysdeps/x86_64/multiarch/strcmp.c
> > @@ -28,6 +28,7 @@
> >
> > extern __typeof (REDIRECT_NAME) OPTIMIZE (sse2) attribute_hidden;
> > extern __typeof (REDIRECT_NAME) OPTIMIZE (sse2_unaligned) attribute_hidden;
> > +extern __typeof (REDIRECT_NAME) OPTIMIZE (sse42) attribute_hidden;
> > extern __typeof (REDIRECT_NAME) OPTIMIZE (avx2) attribute_hidden;
> > extern __typeof (REDIRECT_NAME) OPTIMIZE (avx2_rtm) attribute_hidden;
> > extern __typeof (REDIRECT_NAME) OPTIMIZE (evex) attribute_hidden;
> > @@ -52,6 +53,10 @@ IFUNC_SELECTOR (void)
> > return OPTIMIZE (avx2);
> > }
> >
> > + if (CPU_FEATURE_USABLE_P (cpu_features, SSE4_2)
> > + && !CPU_FEATURES_ARCH_P (cpu_features, Slow_SSE4_2))
> > + return OPTIMIZE (sse42);
> > +
> > if (CPU_FEATURES_ARCH_P (cpu_features, Fast_Unaligned_Load))
> > return OPTIMIZE (sse2_unaligned);
> >
> > --
> > 2.34.1
> >
>
> LGTM.
>
> Thanks.
>
> --
> H.J.
I would like to backport this patch to release branches.
Any comments or objections?
--Sunil
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v7 2/3] x86: Add bounds `x86_non_temporal_threshold`
2022-06-15 20:48 ` H.J. Lu
@ 2022-07-14 2:55 ` Sunil Pandey
0 siblings, 0 replies; 29+ messages in thread
From: Sunil Pandey @ 2022-07-14 2:55 UTC (permalink / raw)
To: H.J. Lu; +Cc: Noah Goldstein, GNU C Library
On Wed, Jun 15, 2022 at 1:49 PM H.J. Lu via Libc-alpha
<libc-alpha@sourceware.org> wrote:
>
> On Wed, Jun 15, 2022 at 1:34 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> >
> > The lower-bound (16448) and upper-bound (SIZE_MAX / 16) are assumed
> > by memmove-vec-unaligned-erms.
> >
> > The lower-bound is needed because memmove-vec-unaligned-erms unrolls
> > the loop aggressively in the L(large_memset_4x) case.
> >
> > The upper-bound is needed because memmove-vec-unaligned-erms
> > right-shifts the value of `x86_non_temporal_threshold` by
> > LOG_4X_MEMCPY_THRESH (4) which without a bound may overflow.
> >
> > The lack of lower-bound can be a correctness issue. The lack of
> > upper-bound cannot.
> > ---
> > manual/tunables.texi | 2 +-
> > sysdeps/x86/dl-cacheinfo.h | 8 +++++++-
> > 2 files changed, 8 insertions(+), 2 deletions(-)
> >
> > diff --git a/manual/tunables.texi b/manual/tunables.texi
> > index 1482412078..2c076019ae 100644
> > --- a/manual/tunables.texi
> > +++ b/manual/tunables.texi
> > @@ -47,7 +47,7 @@ glibc.malloc.mxfast: 0x0 (min: 0x0, max: 0xffffffffffffffff)
> > glibc.elision.skip_lock_busy: 3 (min: -2147483648, max: 2147483647)
> > glibc.malloc.top_pad: 0x0 (min: 0x0, max: 0xffffffffffffffff)
> > glibc.cpu.x86_rep_stosb_threshold: 0x800 (min: 0x1, max: 0xffffffffffffffff)
> > -glibc.cpu.x86_non_temporal_threshold: 0xc0000 (min: 0x0, max: 0xffffffffffffffff)
> > +glibc.cpu.x86_non_temporal_threshold: 0xc0000 (min: 0x4040, max: 0x0fffffffffffffff)
> > glibc.cpu.x86_shstk:
> > glibc.cpu.hwcap_mask: 0x6 (min: 0x0, max: 0xffffffffffffffff)
> > glibc.malloc.mmap_max: 0 (min: -2147483648, max: 2147483647)
> > diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h
> > index cc3b840f9c..e9f3382108 100644
> > --- a/sysdeps/x86/dl-cacheinfo.h
> > +++ b/sysdeps/x86/dl-cacheinfo.h
> > @@ -931,8 +931,14 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
> >
> > TUNABLE_SET_WITH_BOUNDS (x86_data_cache_size, data, 0, SIZE_MAX);
> > TUNABLE_SET_WITH_BOUNDS (x86_shared_cache_size, shared, 0, SIZE_MAX);
> > + /* SIZE_MAX >> 4 because memmove-vec-unaligned-erms right-shifts the value of
> > + 'x86_non_temporal_threshold' by `LOG_4X_MEMCPY_THRESH` (4) and it is best
> > + if that operation cannot overflow. Minimum of 0x4040 (16448) because the
> > + L(large_memset_4x) loops need 64-byte to cache align and enough space for
> > + at least 1 iteration of 4x PAGE_SIZE unrolled loop. Both values are
> > + reflected in the manual. */
> > TUNABLE_SET_WITH_BOUNDS (x86_non_temporal_threshold, non_temporal_threshold,
> > - 0, SIZE_MAX);
> > + 0x4040, SIZE_MAX >> 4);
> > TUNABLE_SET_WITH_BOUNDS (x86_rep_movsb_threshold, rep_movsb_threshold,
> > minimum_rep_movsb_threshold, SIZE_MAX);
> > TUNABLE_SET_WITH_BOUNDS (x86_rep_stosb_threshold, rep_stosb_threshold, 1,
> > --
> > 2.34.1
> >
>
> LGTM.
>
> Thanks.
>
> --
> H.J.
I would like to backport this patch to release branches.
Any comments or objections?
--Sunil
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v4 1/2] x86: Cleanup bounds checking in large memcpy case
2022-06-15 18:22 ` [PATCH v4 1/2] x86: Cleanup bounds checking in large memcpy case H.J. Lu
@ 2022-07-14 2:57 ` Sunil Pandey
0 siblings, 0 replies; 29+ messages in thread
From: Sunil Pandey @ 2022-07-14 2:57 UTC (permalink / raw)
To: H.J. Lu; +Cc: Noah Goldstein, GNU C Library
On Wed, Jun 15, 2022 at 11:24 AM H.J. Lu via Libc-alpha
<libc-alpha@sourceware.org> wrote:
>
> On Wed, Jun 15, 2022 at 10:41 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> >
> > 1. Fix incorrect lower-bound threshold in L(large_memcpy_2x).
> > Previously was using `__x86_rep_movsb_threshold` and should
> > have been using `__x86_shared_non_temporal_threshold`.
> >
> > 2. Avoid reloading __x86_shared_non_temporal_threshold before
> > the L(large_memcpy_4x) bounds check.
> >
> > 3. Document the second bounds check for L(large_memcpy_4x)
> > more clearly.
> > ---
> > .../multiarch/memmove-vec-unaligned-erms.S | 29 ++++++++++++++-----
> > 1 file changed, 21 insertions(+), 8 deletions(-)
> >
> > diff --git a/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
> > index af51177d5d..d1518b8bab 100644
> > --- a/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
> > +++ b/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
> > @@ -118,7 +118,13 @@
> > # define LARGE_LOAD_SIZE (VEC_SIZE * 4)
> > #endif
> >
> > -/* Amount to shift rdx by to compare for memcpy_large_4x. */
> > +/* Amount to shift __x86_shared_non_temporal_threshold by for
> > + bound for memcpy_large_4x. This is essentially use to to
> > + indicate that the copy is far beyond the scope of L3
> > + (assuming no user config x86_non_temporal_threshold) and to
> > + use a more aggressively unrolled loop. NB: before
> > + increasing the value also update initialization of
> > + x86_non_temporal_threshold. */
> > #ifndef LOG_4X_MEMCPY_THRESH
> > # define LOG_4X_MEMCPY_THRESH 4
> > #endif
> > @@ -724,9 +730,14 @@ L(skip_short_movsb_check):
> > .p2align 4,, 10
> > #if (defined USE_MULTIARCH || VEC_SIZE == 16) && IS_IN (libc)
> > L(large_memcpy_2x_check):
> > - cmp __x86_rep_movsb_threshold(%rip), %RDX_LP
> > - jb L(more_8x_vec_check)
> > + /* Entry from L(large_memcpy_2x) has a redundant load of
> > + __x86_shared_non_temporal_threshold(%rip). L(large_memcpy_2x)
> > + is only use for the non-erms memmove which is generally less
> > + common. */
> > L(large_memcpy_2x):
> > + mov __x86_shared_non_temporal_threshold(%rip), %R11_LP
> > + cmp %R11_LP, %RDX_LP
> > + jb L(more_8x_vec_check)
> > /* To reach this point it is impossible for dst > src and
> > overlap. Remaining to check is src > dst and overlap. rcx
> > already contains dst - src. Negate rcx to get src - dst. If
> > @@ -774,18 +785,21 @@ L(large_memcpy_2x):
> > /* ecx contains -(dst - src). not ecx will return dst - src - 1
> > which works for testing aliasing. */
> > notl %ecx
> > + movq %rdx, %r10
> > testl $(PAGE_SIZE - VEC_SIZE * 8), %ecx
> > jz L(large_memcpy_4x)
> >
> > - movq %rdx, %r10
> > - shrq $LOG_4X_MEMCPY_THRESH, %r10
> > - cmp __x86_shared_non_temporal_threshold(%rip), %r10
> > + /* r11 has __x86_shared_non_temporal_threshold. Shift it left
> > + by LOG_4X_MEMCPY_THRESH to get L(large_memcpy_4x) threshold.
> > + */
> > + shlq $LOG_4X_MEMCPY_THRESH, %r11
> > + cmp %r11, %rdx
> > jae L(large_memcpy_4x)
> >
> > /* edx will store remainder size for copying tail. */
> > andl $(PAGE_SIZE * 2 - 1), %edx
> > /* r10 stores outer loop counter. */
> > - shrq $((LOG_PAGE_SIZE + 1) - LOG_4X_MEMCPY_THRESH), %r10
> > + shrq $(LOG_PAGE_SIZE + 1), %r10
> > /* Copy 4x VEC at a time from 2 pages. */
> > .p2align 4
> > L(loop_large_memcpy_2x_outer):
> > @@ -850,7 +864,6 @@ L(large_memcpy_2x_end):
> >
> > .p2align 4
> > L(large_memcpy_4x):
> > - movq %rdx, %r10
> > /* edx will store remainder size for copying tail. */
> > andl $(PAGE_SIZE * 4 - 1), %edx
> > /* r10 stores outer loop counter. */
> > --
> > 2.34.1
> >
>
> LGTM.
>
> Thanks.
>
> --
> H.J.
I would like to backport this patch to release branches.
Any comments or objections?
--Sunil
^ permalink raw reply [flat|nested] 29+ messages in thread
end of thread, other threads:[~2022-07-14 2:57 UTC | newest]
Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-15 0:25 [PATCH v1 1/3] x86: Fix misordered logic for setting `rep_movsb_stop_threshold` Noah Goldstein
2022-06-15 0:25 ` [PATCH v1 2/3] x86: Cleanup bounds checking in large memcpy case Noah Goldstein
2022-06-15 1:07 ` H.J. Lu
2022-06-15 3:57 ` Noah Goldstein
2022-06-15 3:57 ` [PATCH v2] " Noah Goldstein
2022-06-15 14:52 ` H.J. Lu
2022-06-15 15:13 ` Noah Goldstein
2022-06-15 15:12 ` [PATCH v3] " Noah Goldstein
2022-06-15 16:48 ` H.J. Lu
2022-06-15 17:44 ` Noah Goldstein
2022-06-15 17:41 ` [PATCH v4 1/2] " Noah Goldstein
2022-06-15 17:41 ` [PATCH v4 2/2] x86: Add bounds `x86_non_temporal_threshold` Noah Goldstein
2022-06-15 18:22 ` H.J. Lu
2022-06-15 18:33 ` Noah Goldstein
2022-06-15 18:32 ` [PATCH v5 " Noah Goldstein
2022-06-15 18:43 ` H.J. Lu
2022-06-15 19:52 ` [PATCH v6 2/3] " Noah Goldstein
2022-06-15 20:27 ` H.J. Lu
2022-06-15 20:35 ` Noah Goldstein
2022-06-15 20:34 ` [PATCH v7 " Noah Goldstein
2022-06-15 20:48 ` H.J. Lu
2022-07-14 2:55 ` Sunil Pandey
2022-06-15 18:22 ` [PATCH v4 1/2] x86: Cleanup bounds checking in large memcpy case H.J. Lu
2022-07-14 2:57 ` Sunil Pandey
2022-06-15 0:25 ` [PATCH v1 3/3] x86: Add sse42 implementation to strcmp's ifunc Noah Goldstein
2022-06-15 1:08 ` H.J. Lu
2022-07-14 2:54 ` Sunil Pandey
2022-06-15 1:02 ` [PATCH v1 1/3] x86: Fix misordered logic for setting `rep_movsb_stop_threshold` H.J. Lu
2022-07-14 2:53 ` Sunil Pandey
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).