* [PATCH v1] x86: Improve large memset perf with non-temporal stores [RHEL-29312]
@ 2024-05-19 0:43 Noah Goldstein
2024-05-20 17:14 ` H.J. Lu
` (3 more replies)
0 siblings, 4 replies; 21+ messages in thread
From: Noah Goldstein @ 2024-05-19 0:43 UTC (permalink / raw)
To: libc-alpha; +Cc: goldstein.w.n, hjl.tools
Previously we use `rep stosb` for all medium/large memsets. This is
notably worse than non-temporal stores for large (above a
few MBs) memsets.
See:
https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4P2uMjjQGLbLcU/edit?usp=sharing
For data using different stategies for large memset on ICX and SKX.
Using non-temporal stores can be up to 3x faster on ICX and 2x faster
on SKX. Historically, these numbers would not have been so good
because of the zero-over-zero writeback optimization that `rep stosb`
is able to do. But, the zero-over-zero writeback optimization has been
removed as a potential side-channel attack, so there is no longer any
good reason to only rely on `rep stosb` for large memsets. On the flip
size, non-temporal writes can avoid data in their RFO requests saving
memory bandwidth.
All of the other changes to the file are to re-organize the
code-blocks to maintain "good" alignment given the new code added in
the `L(stosb_local)` case.
The results from running the GLIBC memset benchmarks on TGL-client for
N=20 runs:
Geometric Mean across the suite New / Old EXEX256: 0.979
Geometric Mean across the suite New / Old EXEX512: 0.979
Geometric Mean across the suite New / Old AVX2 : 0.986
Geometric Mean across the suite New / Old SSE2 : 0.979
Most of the cases are essentially unchanged, this is mostly to show
that adding the non-temporal case didn't add any regressions to the
other cases.
The results on the memset-large benchmark suite on TGL-client for N=20
runs:
Geometric Mean across the suite New / Old EXEX256: 0.926
Geometric Mean across the suite New / Old EXEX512: 0.925
Geometric Mean across the suite New / Old AVX2 : 0.928
Geometric Mean across the suite New / Old SSE2 : 0.924
So roughly a 7.5% speedup. This is lower than what we see on servers
(likely because clients typically have faster single-core bandwidth so
saving bandwidth on RFOs is less impactful), but still advantageous.
Full test-suite passes on x86_64 w/ and w/o multiarch.
---
.../multiarch/memset-vec-unaligned-erms.S | 149 +++++++++++-------
1 file changed, 91 insertions(+), 58 deletions(-)
diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
index 97839a2248..637caadb40 100644
--- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
+++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
@@ -21,10 +21,13 @@
2. If size is less than VEC, use integer register stores.
3. If size is from VEC_SIZE to 2 * VEC_SIZE, use 2 VEC stores.
4. If size is from 2 * VEC_SIZE to 4 * VEC_SIZE, use 4 VEC stores.
- 5. On machines ERMS feature, if size is greater or equal than
- __x86_rep_stosb_threshold then REP STOSB will be used.
- 6. If size is more to 4 * VEC_SIZE, align to 4 * VEC_SIZE with
- 4 VEC stores and store 4 * VEC at a time until done. */
+ 5. If size is more to 4 * VEC_SIZE, align to 1 * VEC_SIZE with
+ 4 VEC stores and store 4 * VEC at a time until done.
+ 6. On machines ERMS feature, if size is range
+ [__x86_rep_stosb_threshold, __x86_shared_non_temporal_threshold)
+ then REP STOSB will be used.
+ 7. If size >= __x86_shared_non_temporal_threshold, use a
+ non-temporal stores. */
#include <sysdep.h>
@@ -147,6 +150,41 @@ L(entry_from_wmemset):
VMOVU %VMM(0), -VEC_SIZE(%rdi,%rdx)
VMOVU %VMM(0), (%rdi)
VZEROUPPER_RETURN
+
+ /* If have AVX512 mask instructions put L(less_vec) close to
+ entry as it doesn't take much space and is likely a hot target. */
+#ifdef USE_LESS_VEC_MASK_STORE
+ /* Align to ensure the L(less_vec) logic all fits in 1x cache lines. */
+ .p2align 6,, 47
+ .p2align 4
+L(less_vec):
+L(less_vec_from_wmemset):
+ /* Less than 1 VEC. */
+# if VEC_SIZE != 16 && VEC_SIZE != 32 && VEC_SIZE != 64
+# error Unsupported VEC_SIZE!
+# endif
+ /* Clear high bits from edi. Only keeping bits relevant to page
+ cross check. Note that we are using rax which is set in
+ MEMSET_VDUP_TO_VEC0_AND_SET_RETURN as ptr from here on out. */
+ andl $(PAGE_SIZE - 1), %edi
+ /* Check if VEC_SIZE store cross page. Mask stores suffer
+ serious performance degradation when it has to fault suppress. */
+ cmpl $(PAGE_SIZE - VEC_SIZE), %edi
+ /* This is generally considered a cold target. */
+ ja L(cross_page)
+# if VEC_SIZE > 32
+ movq $-1, %rcx
+ bzhiq %rdx, %rcx, %rcx
+ kmovq %rcx, %k1
+# else
+ movl $-1, %ecx
+ bzhil %edx, %ecx, %ecx
+ kmovd %ecx, %k1
+# endif
+ vmovdqu8 %VMM(0), (%rax){%k1}
+ VZEROUPPER_RETURN
+#endif
+
#if defined USE_MULTIARCH && IS_IN (libc)
END (MEMSET_SYMBOL (__memset, unaligned))
@@ -185,54 +223,6 @@ L(last_2x_vec):
#endif
VZEROUPPER_RETURN
- /* If have AVX512 mask instructions put L(less_vec) close to
- entry as it doesn't take much space and is likely a hot target.
- */
-#ifdef USE_LESS_VEC_MASK_STORE
- .p2align 4,, 10
-L(less_vec):
-L(less_vec_from_wmemset):
- /* Less than 1 VEC. */
-# if VEC_SIZE != 16 && VEC_SIZE != 32 && VEC_SIZE != 64
-# error Unsupported VEC_SIZE!
-# endif
- /* Clear high bits from edi. Only keeping bits relevant to page
- cross check. Note that we are using rax which is set in
- MEMSET_VDUP_TO_VEC0_AND_SET_RETURN as ptr from here on out. */
- andl $(PAGE_SIZE - 1), %edi
- /* Check if VEC_SIZE store cross page. Mask stores suffer
- serious performance degradation when it has to fault suppress.
- */
- cmpl $(PAGE_SIZE - VEC_SIZE), %edi
- /* This is generally considered a cold target. */
- ja L(cross_page)
-# if VEC_SIZE > 32
- movq $-1, %rcx
- bzhiq %rdx, %rcx, %rcx
- kmovq %rcx, %k1
-# else
- movl $-1, %ecx
- bzhil %edx, %ecx, %ecx
- kmovd %ecx, %k1
-# endif
- vmovdqu8 %VMM(0), (%rax){%k1}
- VZEROUPPER_RETURN
-
-# if defined USE_MULTIARCH && IS_IN (libc)
- /* Include L(stosb_local) here if including L(less_vec) between
- L(stosb_more_2x_vec) and ENTRY. This is to cache align the
- L(stosb_more_2x_vec) target. */
- .p2align 4,, 10
-L(stosb_local):
- movzbl %sil, %eax
- mov %RDX_LP, %RCX_LP
- mov %RDI_LP, %RDX_LP
- rep stosb
- mov %RDX_LP, %RAX_LP
- VZEROUPPER_RETURN
-# endif
-#endif
-
#if defined USE_MULTIARCH && IS_IN (libc)
.p2align 4
L(stosb_more_2x_vec):
@@ -318,21 +308,33 @@ L(return_vzeroupper):
ret
#endif
- .p2align 4,, 10
-#ifndef USE_LESS_VEC_MASK_STORE
-# if defined USE_MULTIARCH && IS_IN (libc)
+#ifdef USE_WITH_AVX2
+ .p2align 4
+#else
+ .p2align 4,, 4
+#endif
+
+#if defined USE_MULTIARCH && IS_IN (libc)
/* If no USE_LESS_VEC_MASK put L(stosb_local) here. Will be in
range for 2-byte jump encoding. */
L(stosb_local):
+ cmp __x86_shared_non_temporal_threshold(%rip), %RDX_LP
+ jae L(nt_memset)
movzbl %sil, %eax
mov %RDX_LP, %RCX_LP
mov %RDI_LP, %RDX_LP
rep stosb
+# if (defined USE_WITH_SSE2) || (defined USE_WITH_AVX512)
+ /* Use xchg to save 1-byte (this helps align targets below). */
+ xchg %RDX_LP, %RAX_LP
+# else
mov %RDX_LP, %RAX_LP
- VZEROUPPER_RETURN
# endif
+ VZEROUPPER_RETURN
+#endif
+#ifndef USE_LESS_VEC_MASK_STORE
/* Define L(less_vec) only if not otherwise defined. */
- .p2align 4
+ .p2align 4,, 12
L(less_vec):
/* Broadcast esi to partial register (i.e VEC_SIZE == 32 broadcast to
xmm). This is only does anything for AVX2. */
@@ -423,4 +425,35 @@ L(between_2_3):
movb %SET_REG8, -1(%LESS_VEC_REG, %rdx)
#endif
ret
-END (MEMSET_SYMBOL (__memset, unaligned_erms))
+
+#if defined USE_MULTIARCH && IS_IN (libc)
+# ifdef USE_WITH_AVX512
+ /* Force align so the loop doesn't cross a cache-line. */
+ .p2align 4
+# endif
+ .p2align 4,, 7
+ /* Memset using non-temporal stores. */
+L(nt_memset):
+ VMOVU %VMM(0), (VEC_SIZE * 0)(%rdi)
+ leaq (VEC_SIZE * -4)(%rdi, %rdx), %rdx
+ /* Align DST. */
+ orq $(VEC_SIZE * 1 - 1), %rdi
+ incq %rdi
+ .p2align 4,, 7
+L(nt_loop):
+ VMOVNT %VMM(0), (VEC_SIZE * 0)(%rdi)
+ VMOVNT %VMM(0), (VEC_SIZE * 1)(%rdi)
+ VMOVNT %VMM(0), (VEC_SIZE * 2)(%rdi)
+ VMOVNT %VMM(0), (VEC_SIZE * 3)(%rdi)
+ subq $(VEC_SIZE * -4), %rdi
+ cmpq %rdx, %rdi
+ jb L(nt_loop)
+ sfence
+ VMOVU %VMM(0), (VEC_SIZE * 0)(%rdx)
+ VMOVU %VMM(0), (VEC_SIZE * 1)(%rdx)
+ VMOVU %VMM(0), (VEC_SIZE * 2)(%rdx)
+ VMOVU %VMM(0), (VEC_SIZE * 3)(%rdx)
+ VZEROUPPER_RETURN
+#endif
+
+END(MEMSET_SYMBOL(__memset, unaligned_erms))
--
2.34.1
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v1] x86: Improve large memset perf with non-temporal stores [RHEL-29312]
2024-05-19 0:43 [PATCH v1] x86: Improve large memset perf with non-temporal stores [RHEL-29312] Noah Goldstein
@ 2024-05-20 17:14 ` H.J. Lu
2024-05-20 19:16 ` Adhemerval Zanella Netto
` (2 subsequent siblings)
3 siblings, 0 replies; 21+ messages in thread
From: H.J. Lu @ 2024-05-20 17:14 UTC (permalink / raw)
To: Noah Goldstein; +Cc: libc-alpha
On Sat, May 18, 2024 at 5:44 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
> Previously we use `rep stosb` for all medium/large memsets. This is
> notably worse than non-temporal stores for large (above a
> few MBs) memsets.
> See:
> https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4P2uMjjQGLbLcU/edit?usp=sharing
> For data using different stategies for large memset on ICX and SKX.
>
> Using non-temporal stores can be up to 3x faster on ICX and 2x faster
> on SKX. Historically, these numbers would not have been so good
> because of the zero-over-zero writeback optimization that `rep stosb`
> is able to do. But, the zero-over-zero writeback optimization has been
> removed as a potential side-channel attack, so there is no longer any
> good reason to only rely on `rep stosb` for large memsets. On the flip
> size, non-temporal writes can avoid data in their RFO requests saving
> memory bandwidth.
>
> All of the other changes to the file are to re-organize the
> code-blocks to maintain "good" alignment given the new code added in
> the `L(stosb_local)` case.
>
> The results from running the GLIBC memset benchmarks on TGL-client for
> N=20 runs:
>
> Geometric Mean across the suite New / Old EXEX256: 0.979
> Geometric Mean across the suite New / Old EXEX512: 0.979
> Geometric Mean across the suite New / Old AVX2 : 0.986
> Geometric Mean across the suite New / Old SSE2 : 0.979
>
> Most of the cases are essentially unchanged, this is mostly to show
> that adding the non-temporal case didn't add any regressions to the
> other cases.
>
> The results on the memset-large benchmark suite on TGL-client for N=20
> runs:
>
> Geometric Mean across the suite New / Old EXEX256: 0.926
> Geometric Mean across the suite New / Old EXEX512: 0.925
> Geometric Mean across the suite New / Old AVX2 : 0.928
> Geometric Mean across the suite New / Old SSE2 : 0.924
>
> So roughly a 7.5% speedup. This is lower than what we see on servers
> (likely because clients typically have faster single-core bandwidth so
> saving bandwidth on RFOs is less impactful), but still advantageous.
>
> Full test-suite passes on x86_64 w/ and w/o multiarch.
> ---
> .../multiarch/memset-vec-unaligned-erms.S | 149 +++++++++++-------
> 1 file changed, 91 insertions(+), 58 deletions(-)
>
> diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> index 97839a2248..637caadb40 100644
> --- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> +++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> @@ -21,10 +21,13 @@
> 2. If size is less than VEC, use integer register stores.
> 3. If size is from VEC_SIZE to 2 * VEC_SIZE, use 2 VEC stores.
> 4. If size is from 2 * VEC_SIZE to 4 * VEC_SIZE, use 4 VEC stores.
> - 5. On machines ERMS feature, if size is greater or equal than
> - __x86_rep_stosb_threshold then REP STOSB will be used.
> - 6. If size is more to 4 * VEC_SIZE, align to 4 * VEC_SIZE with
> - 4 VEC stores and store 4 * VEC at a time until done. */
> + 5. If size is more to 4 * VEC_SIZE, align to 1 * VEC_SIZE with
> + 4 VEC stores and store 4 * VEC at a time until done.
> + 6. On machines ERMS feature, if size is range
> + [__x86_rep_stosb_threshold, __x86_shared_non_temporal_threshold)
> + then REP STOSB will be used.
> + 7. If size >= __x86_shared_non_temporal_threshold, use a
> + non-temporal stores. */
>
> #include <sysdep.h>
>
> @@ -147,6 +150,41 @@ L(entry_from_wmemset):
> VMOVU %VMM(0), -VEC_SIZE(%rdi,%rdx)
> VMOVU %VMM(0), (%rdi)
> VZEROUPPER_RETURN
> +
> + /* If have AVX512 mask instructions put L(less_vec) close to
> + entry as it doesn't take much space and is likely a hot target. */
> +#ifdef USE_LESS_VEC_MASK_STORE
> + /* Align to ensure the L(less_vec) logic all fits in 1x cache lines. */
> + .p2align 6,, 47
> + .p2align 4
> +L(less_vec):
> +L(less_vec_from_wmemset):
> + /* Less than 1 VEC. */
> +# if VEC_SIZE != 16 && VEC_SIZE != 32 && VEC_SIZE != 64
> +# error Unsupported VEC_SIZE!
> +# endif
> + /* Clear high bits from edi. Only keeping bits relevant to page
> + cross check. Note that we are using rax which is set in
> + MEMSET_VDUP_TO_VEC0_AND_SET_RETURN as ptr from here on out. */
> + andl $(PAGE_SIZE - 1), %edi
> + /* Check if VEC_SIZE store cross page. Mask stores suffer
> + serious performance degradation when it has to fault suppress. */
> + cmpl $(PAGE_SIZE - VEC_SIZE), %edi
> + /* This is generally considered a cold target. */
> + ja L(cross_page)
> +# if VEC_SIZE > 32
> + movq $-1, %rcx
> + bzhiq %rdx, %rcx, %rcx
> + kmovq %rcx, %k1
> +# else
> + movl $-1, %ecx
> + bzhil %edx, %ecx, %ecx
> + kmovd %ecx, %k1
> +# endif
> + vmovdqu8 %VMM(0), (%rax){%k1}
> + VZEROUPPER_RETURN
> +#endif
> +
> #if defined USE_MULTIARCH && IS_IN (libc)
> END (MEMSET_SYMBOL (__memset, unaligned))
>
> @@ -185,54 +223,6 @@ L(last_2x_vec):
> #endif
> VZEROUPPER_RETURN
>
> - /* If have AVX512 mask instructions put L(less_vec) close to
> - entry as it doesn't take much space and is likely a hot target.
> - */
> -#ifdef USE_LESS_VEC_MASK_STORE
> - .p2align 4,, 10
> -L(less_vec):
> -L(less_vec_from_wmemset):
> - /* Less than 1 VEC. */
> -# if VEC_SIZE != 16 && VEC_SIZE != 32 && VEC_SIZE != 64
> -# error Unsupported VEC_SIZE!
> -# endif
> - /* Clear high bits from edi. Only keeping bits relevant to page
> - cross check. Note that we are using rax which is set in
> - MEMSET_VDUP_TO_VEC0_AND_SET_RETURN as ptr from here on out. */
> - andl $(PAGE_SIZE - 1), %edi
> - /* Check if VEC_SIZE store cross page. Mask stores suffer
> - serious performance degradation when it has to fault suppress.
> - */
> - cmpl $(PAGE_SIZE - VEC_SIZE), %edi
> - /* This is generally considered a cold target. */
> - ja L(cross_page)
> -# if VEC_SIZE > 32
> - movq $-1, %rcx
> - bzhiq %rdx, %rcx, %rcx
> - kmovq %rcx, %k1
> -# else
> - movl $-1, %ecx
> - bzhil %edx, %ecx, %ecx
> - kmovd %ecx, %k1
> -# endif
> - vmovdqu8 %VMM(0), (%rax){%k1}
> - VZEROUPPER_RETURN
> -
> -# if defined USE_MULTIARCH && IS_IN (libc)
> - /* Include L(stosb_local) here if including L(less_vec) between
> - L(stosb_more_2x_vec) and ENTRY. This is to cache align the
> - L(stosb_more_2x_vec) target. */
> - .p2align 4,, 10
> -L(stosb_local):
> - movzbl %sil, %eax
> - mov %RDX_LP, %RCX_LP
> - mov %RDI_LP, %RDX_LP
> - rep stosb
> - mov %RDX_LP, %RAX_LP
> - VZEROUPPER_RETURN
> -# endif
> -#endif
> -
> #if defined USE_MULTIARCH && IS_IN (libc)
> .p2align 4
> L(stosb_more_2x_vec):
> @@ -318,21 +308,33 @@ L(return_vzeroupper):
> ret
> #endif
>
> - .p2align 4,, 10
> -#ifndef USE_LESS_VEC_MASK_STORE
> -# if defined USE_MULTIARCH && IS_IN (libc)
> +#ifdef USE_WITH_AVX2
> + .p2align 4
> +#else
> + .p2align 4,, 4
> +#endif
> +
> +#if defined USE_MULTIARCH && IS_IN (libc)
> /* If no USE_LESS_VEC_MASK put L(stosb_local) here. Will be in
> range for 2-byte jump encoding. */
> L(stosb_local):
> + cmp __x86_shared_non_temporal_threshold(%rip), %RDX_LP
> + jae L(nt_memset)
> movzbl %sil, %eax
> mov %RDX_LP, %RCX_LP
> mov %RDI_LP, %RDX_LP
> rep stosb
> +# if (defined USE_WITH_SSE2) || (defined USE_WITH_AVX512)
> + /* Use xchg to save 1-byte (this helps align targets below). */
> + xchg %RDX_LP, %RAX_LP
> +# else
> mov %RDX_LP, %RAX_LP
> - VZEROUPPER_RETURN
> # endif
> + VZEROUPPER_RETURN
> +#endif
> +#ifndef USE_LESS_VEC_MASK_STORE
> /* Define L(less_vec) only if not otherwise defined. */
> - .p2align 4
> + .p2align 4,, 12
> L(less_vec):
> /* Broadcast esi to partial register (i.e VEC_SIZE == 32 broadcast to
> xmm). This is only does anything for AVX2. */
> @@ -423,4 +425,35 @@ L(between_2_3):
> movb %SET_REG8, -1(%LESS_VEC_REG, %rdx)
> #endif
> ret
> -END (MEMSET_SYMBOL (__memset, unaligned_erms))
> +
> +#if defined USE_MULTIARCH && IS_IN (libc)
> +# ifdef USE_WITH_AVX512
> + /* Force align so the loop doesn't cross a cache-line. */
> + .p2align 4
> +# endif
> + .p2align 4,, 7
> + /* Memset using non-temporal stores. */
> +L(nt_memset):
> + VMOVU %VMM(0), (VEC_SIZE * 0)(%rdi)
> + leaq (VEC_SIZE * -4)(%rdi, %rdx), %rdx
> + /* Align DST. */
> + orq $(VEC_SIZE * 1 - 1), %rdi
> + incq %rdi
> + .p2align 4,, 7
> +L(nt_loop):
> + VMOVNT %VMM(0), (VEC_SIZE * 0)(%rdi)
> + VMOVNT %VMM(0), (VEC_SIZE * 1)(%rdi)
> + VMOVNT %VMM(0), (VEC_SIZE * 2)(%rdi)
> + VMOVNT %VMM(0), (VEC_SIZE * 3)(%rdi)
> + subq $(VEC_SIZE * -4), %rdi
> + cmpq %rdx, %rdi
> + jb L(nt_loop)
> + sfence
> + VMOVU %VMM(0), (VEC_SIZE * 0)(%rdx)
> + VMOVU %VMM(0), (VEC_SIZE * 1)(%rdx)
> + VMOVU %VMM(0), (VEC_SIZE * 2)(%rdx)
> + VMOVU %VMM(0), (VEC_SIZE * 3)(%rdx)
> + VZEROUPPER_RETURN
> +#endif
> +
> +END(MEMSET_SYMBOL(__memset, unaligned_erms))
> --
> 2.34.1
>
LGTM.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
Thanks.
--
H.J.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v1] x86: Improve large memset perf with non-temporal stores [RHEL-29312]
2024-05-19 0:43 [PATCH v1] x86: Improve large memset perf with non-temporal stores [RHEL-29312] Noah Goldstein
2024-05-20 17:14 ` H.J. Lu
@ 2024-05-20 19:16 ` Adhemerval Zanella Netto
2024-05-20 22:47 ` Noah Goldstein
2024-05-20 19:37 ` Zack Weinberg
2024-05-24 17:38 ` [PATCH v2 1/2] " Noah Goldstein
3 siblings, 1 reply; 21+ messages in thread
From: Adhemerval Zanella Netto @ 2024-05-20 19:16 UTC (permalink / raw)
To: Noah Goldstein, libc-alpha; +Cc: hjl.tools
On 18/05/24 21:43, Noah Goldstein wrote:
> Previously we use `rep stosb` for all medium/large memsets. This is
> notably worse than non-temporal stores for large (above a
> few MBs) memsets.
> See:
> https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4P2uMjjQGLbLcU/edit?usp=sharing
> For data using different stategies for large memset on ICX and SKX.
Btw this datasheet is not accessible without extra steps.
>
> Using non-temporal stores can be up to 3x faster on ICX and 2x faster
> on SKX. Historically, these numbers would not have been so good
> because of the zero-over-zero writeback optimization that `rep stosb`
> is able to do. But, the zero-over-zero writeback optimization has been
> removed as a potential side-channel attack, so there is no longer any
> good reason to only rely on `rep stosb` for large memsets. On the flip
> size, non-temporal writes can avoid data in their RFO requests saving
> memory bandwidth.
Any chance on how this play in newer AMD chips? I am trying to avoid
another regression like BZ#30994 and BZ#30995 (this one I would like
to fix, however I don't have access to a Zen4 machine to check for
results).
>
> All of the other changes to the file are to re-organize the
> code-blocks to maintain "good" alignment given the new code added in
> the `L(stosb_local)` case.
>
> The results from running the GLIBC memset benchmarks on TGL-client for
> N=20 runs:
>
> Geometric Mean across the suite New / Old EXEX256: 0.979
> Geometric Mean across the suite New / Old EXEX512: 0.979
> Geometric Mean across the suite New / Old AVX2 : 0.986
> Geometric Mean across the suite New / Old SSE2 : 0.979
>
> Most of the cases are essentially unchanged, this is mostly to show
> that adding the non-temporal case didn't add any regressions to the
> other cases.
>
> The results on the memset-large benchmark suite on TGL-client for N=20
> runs:
>
> Geometric Mean across the suite New / Old EXEX256: 0.926
> Geometric Mean across the suite New / Old EXEX512: 0.925
> Geometric Mean across the suite New / Old AVX2 : 0.928
> Geometric Mean across the suite New / Old SSE2 : 0.924
>
> So roughly a 7.5% speedup. This is lower than what we see on servers
> (likely because clients typically have faster single-core bandwidth so
> saving bandwidth on RFOs is less impactful), but still advantageous.
>
> Full test-suite passes on x86_64 w/ and w/o multiarch.
> ---
> .../multiarch/memset-vec-unaligned-erms.S | 149 +++++++++++-------
> 1 file changed, 91 insertions(+), 58 deletions(-)
>
> diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> index 97839a2248..637caadb40 100644
> --- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> +++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> @@ -21,10 +21,13 @@
> 2. If size is less than VEC, use integer register stores.
> 3. If size is from VEC_SIZE to 2 * VEC_SIZE, use 2 VEC stores.
> 4. If size is from 2 * VEC_SIZE to 4 * VEC_SIZE, use 4 VEC stores.
> - 5. On machines ERMS feature, if size is greater or equal than
> - __x86_rep_stosb_threshold then REP STOSB will be used.
> - 6. If size is more to 4 * VEC_SIZE, align to 4 * VEC_SIZE with
> - 4 VEC stores and store 4 * VEC at a time until done. */
> + 5. If size is more to 4 * VEC_SIZE, align to 1 * VEC_SIZE with
> + 4 VEC stores and store 4 * VEC at a time until done.
> + 6. On machines ERMS feature, if size is range
> + [__x86_rep_stosb_threshold, __x86_shared_non_temporal_threshold)
> + then REP STOSB will be used.
> + 7. If size >= __x86_shared_non_temporal_threshold, use a
> + non-temporal stores. */
>
> #include <sysdep.h>
>
> @@ -147,6 +150,41 @@ L(entry_from_wmemset):
> VMOVU %VMM(0), -VEC_SIZE(%rdi,%rdx)
> VMOVU %VMM(0), (%rdi)
> VZEROUPPER_RETURN
> +
> + /* If have AVX512 mask instructions put L(less_vec) close to
> + entry as it doesn't take much space and is likely a hot target. */
> +#ifdef USE_LESS_VEC_MASK_STORE
> + /* Align to ensure the L(less_vec) logic all fits in 1x cache lines. */
> + .p2align 6,, 47
> + .p2align 4
> +L(less_vec):
> +L(less_vec_from_wmemset):
> + /* Less than 1 VEC. */
> +# if VEC_SIZE != 16 && VEC_SIZE != 32 && VEC_SIZE != 64
> +# error Unsupported VEC_SIZE!
> +# endif
> + /* Clear high bits from edi. Only keeping bits relevant to page
> + cross check. Note that we are using rax which is set in
> + MEMSET_VDUP_TO_VEC0_AND_SET_RETURN as ptr from here on out. */
> + andl $(PAGE_SIZE - 1), %edi
> + /* Check if VEC_SIZE store cross page. Mask stores suffer
> + serious performance degradation when it has to fault suppress. */
> + cmpl $(PAGE_SIZE - VEC_SIZE), %edi
> + /* This is generally considered a cold target. */
> + ja L(cross_page)
> +# if VEC_SIZE > 32
> + movq $-1, %rcx
> + bzhiq %rdx, %rcx, %rcx
> + kmovq %rcx, %k1
> +# else
> + movl $-1, %ecx
> + bzhil %edx, %ecx, %ecx
> + kmovd %ecx, %k1
> +# endif
> + vmovdqu8 %VMM(0), (%rax){%k1}
> + VZEROUPPER_RETURN
> +#endif
> +
> #if defined USE_MULTIARCH && IS_IN (libc)
> END (MEMSET_SYMBOL (__memset, unaligned))
>
> @@ -185,54 +223,6 @@ L(last_2x_vec):
> #endif
> VZEROUPPER_RETURN
>
> - /* If have AVX512 mask instructions put L(less_vec) close to
> - entry as it doesn't take much space and is likely a hot target.
> - */
> -#ifdef USE_LESS_VEC_MASK_STORE
> - .p2align 4,, 10
> -L(less_vec):
> -L(less_vec_from_wmemset):
> - /* Less than 1 VEC. */
> -# if VEC_SIZE != 16 && VEC_SIZE != 32 && VEC_SIZE != 64
> -# error Unsupported VEC_SIZE!
> -# endif
> - /* Clear high bits from edi. Only keeping bits relevant to page
> - cross check. Note that we are using rax which is set in
> - MEMSET_VDUP_TO_VEC0_AND_SET_RETURN as ptr from here on out. */
> - andl $(PAGE_SIZE - 1), %edi
> - /* Check if VEC_SIZE store cross page. Mask stores suffer
> - serious performance degradation when it has to fault suppress.
> - */
> - cmpl $(PAGE_SIZE - VEC_SIZE), %edi
> - /* This is generally considered a cold target. */
> - ja L(cross_page)
> -# if VEC_SIZE > 32
> - movq $-1, %rcx
> - bzhiq %rdx, %rcx, %rcx
> - kmovq %rcx, %k1
> -# else
> - movl $-1, %ecx
> - bzhil %edx, %ecx, %ecx
> - kmovd %ecx, %k1
> -# endif
> - vmovdqu8 %VMM(0), (%rax){%k1}
> - VZEROUPPER_RETURN
> -
> -# if defined USE_MULTIARCH && IS_IN (libc)
> - /* Include L(stosb_local) here if including L(less_vec) between
> - L(stosb_more_2x_vec) and ENTRY. This is to cache align the
> - L(stosb_more_2x_vec) target. */
> - .p2align 4,, 10
> -L(stosb_local):
> - movzbl %sil, %eax
> - mov %RDX_LP, %RCX_LP
> - mov %RDI_LP, %RDX_LP
> - rep stosb
> - mov %RDX_LP, %RAX_LP
> - VZEROUPPER_RETURN
> -# endif
> -#endif
> -
> #if defined USE_MULTIARCH && IS_IN (libc)
> .p2align 4
> L(stosb_more_2x_vec):
> @@ -318,21 +308,33 @@ L(return_vzeroupper):
> ret
> #endif
>
> - .p2align 4,, 10
> -#ifndef USE_LESS_VEC_MASK_STORE
> -# if defined USE_MULTIARCH && IS_IN (libc)
> +#ifdef USE_WITH_AVX2
> + .p2align 4
> +#else
> + .p2align 4,, 4
> +#endif
> +
> +#if defined USE_MULTIARCH && IS_IN (libc)
> /* If no USE_LESS_VEC_MASK put L(stosb_local) here. Will be in
> range for 2-byte jump encoding. */
> L(stosb_local):
> + cmp __x86_shared_non_temporal_threshold(%rip), %RDX_LP
> + jae L(nt_memset)
> movzbl %sil, %eax
> mov %RDX_LP, %RCX_LP
> mov %RDI_LP, %RDX_LP
> rep stosb
> +# if (defined USE_WITH_SSE2) || (defined USE_WITH_AVX512)
> + /* Use xchg to save 1-byte (this helps align targets below). */
> + xchg %RDX_LP, %RAX_LP
> +# else
> mov %RDX_LP, %RAX_LP
> - VZEROUPPER_RETURN
> # endif
> + VZEROUPPER_RETURN
> +#endif
> +#ifndef USE_LESS_VEC_MASK_STORE
> /* Define L(less_vec) only if not otherwise defined. */
> - .p2align 4
> + .p2align 4,, 12
> L(less_vec):
> /* Broadcast esi to partial register (i.e VEC_SIZE == 32 broadcast to
> xmm). This is only does anything for AVX2. */
> @@ -423,4 +425,35 @@ L(between_2_3):
> movb %SET_REG8, -1(%LESS_VEC_REG, %rdx)
> #endif
> ret
> -END (MEMSET_SYMBOL (__memset, unaligned_erms))
> +
> +#if defined USE_MULTIARCH && IS_IN (libc)
> +# ifdef USE_WITH_AVX512
> + /* Force align so the loop doesn't cross a cache-line. */
> + .p2align 4
> +# endif
> + .p2align 4,, 7
> + /* Memset using non-temporal stores. */
> +L(nt_memset):
> + VMOVU %VMM(0), (VEC_SIZE * 0)(%rdi)
> + leaq (VEC_SIZE * -4)(%rdi, %rdx), %rdx
> + /* Align DST. */
> + orq $(VEC_SIZE * 1 - 1), %rdi
> + incq %rdi
> + .p2align 4,, 7
> +L(nt_loop):
> + VMOVNT %VMM(0), (VEC_SIZE * 0)(%rdi)
> + VMOVNT %VMM(0), (VEC_SIZE * 1)(%rdi)
> + VMOVNT %VMM(0), (VEC_SIZE * 2)(%rdi)
> + VMOVNT %VMM(0), (VEC_SIZE * 3)(%rdi)
> + subq $(VEC_SIZE * -4), %rdi
> + cmpq %rdx, %rdi
> + jb L(nt_loop)
> + sfence
> + VMOVU %VMM(0), (VEC_SIZE * 0)(%rdx)
> + VMOVU %VMM(0), (VEC_SIZE * 1)(%rdx)
> + VMOVU %VMM(0), (VEC_SIZE * 2)(%rdx)
> + VMOVU %VMM(0), (VEC_SIZE * 3)(%rdx)
> + VZEROUPPER_RETURN
> +#endif
> +
> +END(MEMSET_SYMBOL(__memset, unaligned_erms))
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v1] x86: Improve large memset perf with non-temporal stores [RHEL-29312]
2024-05-19 0:43 [PATCH v1] x86: Improve large memset perf with non-temporal stores [RHEL-29312] Noah Goldstein
2024-05-20 17:14 ` H.J. Lu
2024-05-20 19:16 ` Adhemerval Zanella Netto
@ 2024-05-20 19:37 ` Zack Weinberg
2024-05-20 22:48 ` Noah Goldstein
2024-05-24 17:38 ` [PATCH v2 1/2] " Noah Goldstein
3 siblings, 1 reply; 21+ messages in thread
From: Zack Weinberg @ 2024-05-20 19:37 UTC (permalink / raw)
To: Noah Goldstein, GNU libc development; +Cc: H . J . Lu
On Sun, May 19, 2024, at 12:43 AM, Noah Goldstein wrote:
> Previously we use `rep stosb` for all medium/large memsets. This is
> notably worse than non-temporal stores for large (above a few MBs)
> memsets. See:
> https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4P2uMjjQGLbLcU/edit?usp=sharing
> For data using different stategies for large memset on ICX and SKX.
Did your tests read the region that was memset afterward, or did they
just do the memset? It *probably* doesn't matter at "above a few MBs"
but I'm worried that this new implementation might look good on
microbenchmarks but make real programs *slower* because it is no longer
priming the cache for the memset region.
zw
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v1] x86: Improve large memset perf with non-temporal stores [RHEL-29312]
2024-05-20 19:16 ` Adhemerval Zanella Netto
@ 2024-05-20 22:47 ` Noah Goldstein
2024-05-20 22:50 ` Noah Goldstein
0 siblings, 1 reply; 21+ messages in thread
From: Noah Goldstein @ 2024-05-20 22:47 UTC (permalink / raw)
To: Adhemerval Zanella Netto; +Cc: libc-alpha, hjl.tools
On Mon, May 20, 2024 at 2:16 PM Adhemerval Zanella Netto
<adhemerval.zanella@linaro.org> wrote:
>
>
>
> On 18/05/24 21:43, Noah Goldstein wrote:
> > Previously we use `rep stosb` for all medium/large memsets. This is
> > notably worse than non-temporal stores for large (above a
> > few MBs) memsets.
> > See:
> > https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4P2uMjjQGLbLcU/edit?usp=sharing
> > For data using different stategies for large memset on ICX and SKX.
>
> Btw this datasheet is not accessible without extra steps.
>
Bh sorry, put the benchmark data in a repo. See data here:
https://github.com/goldsteinn/memset-benchmarks/tree/master/results
The ICX.html and SKX.html have the data from the above link.
> >
> > Using non-temporal stores can be up to 3x faster on ICX and 2x faster
> > on SKX. Historically, these numbers would not have been so good
> > because of the zero-over-zero writeback optimization that `rep stosb`
> > is able to do. But, the zero-over-zero writeback optimization has been
> > removed as a potential side-channel attack, so there is no longer any
> > good reason to only rely on `rep stosb` for large memsets. On the flip
> > size, non-temporal writes can avoid data in their RFO requests saving
> > memory bandwidth.
>
> Any chance on how this play in newer AMD chips? I am trying to avoid
> another regression like BZ#30994 and BZ#30995 (this one I would like
> to fix, however I don't have access to a Zen4 machine to check for
> results).
I didn't and don't have access to any of the newer AMD chips.
The benchmark code here: https://github.com/goldsteinn/memset-benchmarks/
has a README w/ steps if anyone wants to test it.
>
> >
> > All of the other changes to the file are to re-organize the
> > code-blocks to maintain "good" alignment given the new code added in
> > the `L(stosb_local)` case.
> >
> > The results from running the GLIBC memset benchmarks on TGL-client for
> > N=20 runs:
> >
> > Geometric Mean across the suite New / Old EXEX256: 0.979
> > Geometric Mean across the suite New / Old EXEX512: 0.979
> > Geometric Mean across the suite New / Old AVX2 : 0.986
> > Geometric Mean across the suite New / Old SSE2 : 0.979
> >
> > Most of the cases are essentially unchanged, this is mostly to show
> > that adding the non-temporal case didn't add any regressions to the
> > other cases.
> >
> > The results on the memset-large benchmark suite on TGL-client for N=20
> > runs:
> >
> > Geometric Mean across the suite New / Old EXEX256: 0.926
> > Geometric Mean across the suite New / Old EXEX512: 0.925
> > Geometric Mean across the suite New / Old AVX2 : 0.928
> > Geometric Mean across the suite New / Old SSE2 : 0.924
> >
> > So roughly a 7.5% speedup. This is lower than what we see on servers
> > (likely because clients typically have faster single-core bandwidth so
> > saving bandwidth on RFOs is less impactful), but still advantageous.
> >
> > Full test-suite passes on x86_64 w/ and w/o multiarch.
> > ---
> > .../multiarch/memset-vec-unaligned-erms.S | 149 +++++++++++-------
> > 1 file changed, 91 insertions(+), 58 deletions(-)
> >
> > diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > index 97839a2248..637caadb40 100644
> > --- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > +++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > @@ -21,10 +21,13 @@
> > 2. If size is less than VEC, use integer register stores.
> > 3. If size is from VEC_SIZE to 2 * VEC_SIZE, use 2 VEC stores.
> > 4. If size is from 2 * VEC_SIZE to 4 * VEC_SIZE, use 4 VEC stores.
> > - 5. On machines ERMS feature, if size is greater or equal than
> > - __x86_rep_stosb_threshold then REP STOSB will be used.
> > - 6. If size is more to 4 * VEC_SIZE, align to 4 * VEC_SIZE with
> > - 4 VEC stores and store 4 * VEC at a time until done. */
> > + 5. If size is more to 4 * VEC_SIZE, align to 1 * VEC_SIZE with
> > + 4 VEC stores and store 4 * VEC at a time until done.
> > + 6. On machines ERMS feature, if size is range
> > + [__x86_rep_stosb_threshold, __x86_shared_non_temporal_threshold)
> > + then REP STOSB will be used.
> > + 7. If size >= __x86_shared_non_temporal_threshold, use a
> > + non-temporal stores. */
> >
> > #include <sysdep.h>
> >
> > @@ -147,6 +150,41 @@ L(entry_from_wmemset):
> > VMOVU %VMM(0), -VEC_SIZE(%rdi,%rdx)
> > VMOVU %VMM(0), (%rdi)
> > VZEROUPPER_RETURN
> > +
> > + /* If have AVX512 mask instructions put L(less_vec) close to
> > + entry as it doesn't take much space and is likely a hot target. */
> > +#ifdef USE_LESS_VEC_MASK_STORE
> > + /* Align to ensure the L(less_vec) logic all fits in 1x cache lines. */
> > + .p2align 6,, 47
> > + .p2align 4
> > +L(less_vec):
> > +L(less_vec_from_wmemset):
> > + /* Less than 1 VEC. */
> > +# if VEC_SIZE != 16 && VEC_SIZE != 32 && VEC_SIZE != 64
> > +# error Unsupported VEC_SIZE!
> > +# endif
> > + /* Clear high bits from edi. Only keeping bits relevant to page
> > + cross check. Note that we are using rax which is set in
> > + MEMSET_VDUP_TO_VEC0_AND_SET_RETURN as ptr from here on out. */
> > + andl $(PAGE_SIZE - 1), %edi
> > + /* Check if VEC_SIZE store cross page. Mask stores suffer
> > + serious performance degradation when it has to fault suppress. */
> > + cmpl $(PAGE_SIZE - VEC_SIZE), %edi
> > + /* This is generally considered a cold target. */
> > + ja L(cross_page)
> > +# if VEC_SIZE > 32
> > + movq $-1, %rcx
> > + bzhiq %rdx, %rcx, %rcx
> > + kmovq %rcx, %k1
> > +# else
> > + movl $-1, %ecx
> > + bzhil %edx, %ecx, %ecx
> > + kmovd %ecx, %k1
> > +# endif
> > + vmovdqu8 %VMM(0), (%rax){%k1}
> > + VZEROUPPER_RETURN
> > +#endif
> > +
> > #if defined USE_MULTIARCH && IS_IN (libc)
> > END (MEMSET_SYMBOL (__memset, unaligned))
> >
> > @@ -185,54 +223,6 @@ L(last_2x_vec):
> > #endif
> > VZEROUPPER_RETURN
> >
> > - /* If have AVX512 mask instructions put L(less_vec) close to
> > - entry as it doesn't take much space and is likely a hot target.
> > - */
> > -#ifdef USE_LESS_VEC_MASK_STORE
> > - .p2align 4,, 10
> > -L(less_vec):
> > -L(less_vec_from_wmemset):
> > - /* Less than 1 VEC. */
> > -# if VEC_SIZE != 16 && VEC_SIZE != 32 && VEC_SIZE != 64
> > -# error Unsupported VEC_SIZE!
> > -# endif
> > - /* Clear high bits from edi. Only keeping bits relevant to page
> > - cross check. Note that we are using rax which is set in
> > - MEMSET_VDUP_TO_VEC0_AND_SET_RETURN as ptr from here on out. */
> > - andl $(PAGE_SIZE - 1), %edi
> > - /* Check if VEC_SIZE store cross page. Mask stores suffer
> > - serious performance degradation when it has to fault suppress.
> > - */
> > - cmpl $(PAGE_SIZE - VEC_SIZE), %edi
> > - /* This is generally considered a cold target. */
> > - ja L(cross_page)
> > -# if VEC_SIZE > 32
> > - movq $-1, %rcx
> > - bzhiq %rdx, %rcx, %rcx
> > - kmovq %rcx, %k1
> > -# else
> > - movl $-1, %ecx
> > - bzhil %edx, %ecx, %ecx
> > - kmovd %ecx, %k1
> > -# endif
> > - vmovdqu8 %VMM(0), (%rax){%k1}
> > - VZEROUPPER_RETURN
> > -
> > -# if defined USE_MULTIARCH && IS_IN (libc)
> > - /* Include L(stosb_local) here if including L(less_vec) between
> > - L(stosb_more_2x_vec) and ENTRY. This is to cache align the
> > - L(stosb_more_2x_vec) target. */
> > - .p2align 4,, 10
> > -L(stosb_local):
> > - movzbl %sil, %eax
> > - mov %RDX_LP, %RCX_LP
> > - mov %RDI_LP, %RDX_LP
> > - rep stosb
> > - mov %RDX_LP, %RAX_LP
> > - VZEROUPPER_RETURN
> > -# endif
> > -#endif
> > -
> > #if defined USE_MULTIARCH && IS_IN (libc)
> > .p2align 4
> > L(stosb_more_2x_vec):
> > @@ -318,21 +308,33 @@ L(return_vzeroupper):
> > ret
> > #endif
> >
> > - .p2align 4,, 10
> > -#ifndef USE_LESS_VEC_MASK_STORE
> > -# if defined USE_MULTIARCH && IS_IN (libc)
> > +#ifdef USE_WITH_AVX2
> > + .p2align 4
> > +#else
> > + .p2align 4,, 4
> > +#endif
> > +
> > +#if defined USE_MULTIARCH && IS_IN (libc)
> > /* If no USE_LESS_VEC_MASK put L(stosb_local) here. Will be in
> > range for 2-byte jump encoding. */
> > L(stosb_local):
> > + cmp __x86_shared_non_temporal_threshold(%rip), %RDX_LP
> > + jae L(nt_memset)
> > movzbl %sil, %eax
> > mov %RDX_LP, %RCX_LP
> > mov %RDI_LP, %RDX_LP
> > rep stosb
> > +# if (defined USE_WITH_SSE2) || (defined USE_WITH_AVX512)
> > + /* Use xchg to save 1-byte (this helps align targets below). */
> > + xchg %RDX_LP, %RAX_LP
> > +# else
> > mov %RDX_LP, %RAX_LP
> > - VZEROUPPER_RETURN
> > # endif
> > + VZEROUPPER_RETURN
> > +#endif
> > +#ifndef USE_LESS_VEC_MASK_STORE
> > /* Define L(less_vec) only if not otherwise defined. */
> > - .p2align 4
> > + .p2align 4,, 12
> > L(less_vec):
> > /* Broadcast esi to partial register (i.e VEC_SIZE == 32 broadcast to
> > xmm). This is only does anything for AVX2. */
> > @@ -423,4 +425,35 @@ L(between_2_3):
> > movb %SET_REG8, -1(%LESS_VEC_REG, %rdx)
> > #endif
> > ret
> > -END (MEMSET_SYMBOL (__memset, unaligned_erms))
> > +
> > +#if defined USE_MULTIARCH && IS_IN (libc)
> > +# ifdef USE_WITH_AVX512
> > + /* Force align so the loop doesn't cross a cache-line. */
> > + .p2align 4
> > +# endif
> > + .p2align 4,, 7
> > + /* Memset using non-temporal stores. */
> > +L(nt_memset):
> > + VMOVU %VMM(0), (VEC_SIZE * 0)(%rdi)
> > + leaq (VEC_SIZE * -4)(%rdi, %rdx), %rdx
> > + /* Align DST. */
> > + orq $(VEC_SIZE * 1 - 1), %rdi
> > + incq %rdi
> > + .p2align 4,, 7
> > +L(nt_loop):
> > + VMOVNT %VMM(0), (VEC_SIZE * 0)(%rdi)
> > + VMOVNT %VMM(0), (VEC_SIZE * 1)(%rdi)
> > + VMOVNT %VMM(0), (VEC_SIZE * 2)(%rdi)
> > + VMOVNT %VMM(0), (VEC_SIZE * 3)(%rdi)
> > + subq $(VEC_SIZE * -4), %rdi
> > + cmpq %rdx, %rdi
> > + jb L(nt_loop)
> > + sfence
> > + VMOVU %VMM(0), (VEC_SIZE * 0)(%rdx)
> > + VMOVU %VMM(0), (VEC_SIZE * 1)(%rdx)
> > + VMOVU %VMM(0), (VEC_SIZE * 2)(%rdx)
> > + VMOVU %VMM(0), (VEC_SIZE * 3)(%rdx)
> > + VZEROUPPER_RETURN
> > +#endif
> > +
> > +END(MEMSET_SYMBOL(__memset, unaligned_erms))
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v1] x86: Improve large memset perf with non-temporal stores [RHEL-29312]
2024-05-20 19:37 ` Zack Weinberg
@ 2024-05-20 22:48 ` Noah Goldstein
0 siblings, 0 replies; 21+ messages in thread
From: Noah Goldstein @ 2024-05-20 22:48 UTC (permalink / raw)
To: Zack Weinberg; +Cc: GNU libc development, H . J . Lu
On Mon, May 20, 2024 at 2:37 PM Zack Weinberg <zack@owlfolio.org> wrote:
>
> On Sun, May 19, 2024, at 12:43 AM, Noah Goldstein wrote:
> > Previously we use `rep stosb` for all medium/large memsets. This is
> > notably worse than non-temporal stores for large (above a few MBs)
> > memsets. See:
> > https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4P2uMjjQGLbLcU/edit?usp=sharing
> > For data using different stategies for large memset on ICX and SKX.
>
> Did your tests read the region that was memset afterward, or did they
> just do the memset? It *probably* doesn't matter at "above a few MBs"
> but I'm worried that this new implementation might look good on
> microbenchmarks but make real programs *slower* because it is no longer
> priming the cache for the memset region.
Yes, See all results tagged with `reused=1`. Results are still good.
Data available at:
https://github.com/goldsteinn/memset-benchmarks/blob/master/results/SKX.html
and/or
https://github.com/goldsteinn/memset-benchmarks/blob/master/results/ICX.html
>
> zw
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v1] x86: Improve large memset perf with non-temporal stores [RHEL-29312]
2024-05-20 22:47 ` Noah Goldstein
@ 2024-05-20 22:50 ` Noah Goldstein
2024-05-23 16:46 ` Noah Goldstein
0 siblings, 1 reply; 21+ messages in thread
From: Noah Goldstein @ 2024-05-20 22:50 UTC (permalink / raw)
To: Adhemerval Zanella Netto; +Cc: libc-alpha, hjl.tools
On Mon, May 20, 2024 at 5:47 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
> On Mon, May 20, 2024 at 2:16 PM Adhemerval Zanella Netto
> <adhemerval.zanella@linaro.org> wrote:
> >
> >
> >
> > On 18/05/24 21:43, Noah Goldstein wrote:
> > > Previously we use `rep stosb` for all medium/large memsets. This is
> > > notably worse than non-temporal stores for large (above a
> > > few MBs) memsets.
> > > See:
> > > https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4P2uMjjQGLbLcU/edit?usp=sharing
> > > For data using different stategies for large memset on ICX and SKX.
> >
> > Btw this datasheet is not accessible without extra steps.
> >
>
> Bh sorry, put the benchmark data in a repo. See data here:
> https://github.com/goldsteinn/memset-benchmarks/tree/master/results
>
> The ICX.html and SKX.html have the data from the above link.
> > >
> > > Using non-temporal stores can be up to 3x faster on ICX and 2x faster
> > > on SKX. Historically, these numbers would not have been so good
> > > because of the zero-over-zero writeback optimization that `rep stosb`
> > > is able to do. But, the zero-over-zero writeback optimization has been
> > > removed as a potential side-channel attack, so there is no longer any
> > > good reason to only rely on `rep stosb` for large memsets. On the flip
> > > size, non-temporal writes can avoid data in their RFO requests saving
> > > memory bandwidth.
> >
> > Any chance on how this play in newer AMD chips? I am trying to avoid
> > another regression like BZ#30994 and BZ#30995 (this one I would like
> > to fix, however I don't have access to a Zen4 machine to check for
> > results).
>
> I didn't and don't have access to any of the newer AMD chips.
>
> The benchmark code here: https://github.com/goldsteinn/memset-benchmarks/
> has a README w/ steps if anyone wants to test it.
What we could do is use a seperate tunable for memset NT threshold and just
make it SIZE_MAX for AMD.
> >
> > >
> > > All of the other changes to the file are to re-organize the
> > > code-blocks to maintain "good" alignment given the new code added in
> > > the `L(stosb_local)` case.
> > >
> > > The results from running the GLIBC memset benchmarks on TGL-client for
> > > N=20 runs:
> > >
> > > Geometric Mean across the suite New / Old EXEX256: 0.979
> > > Geometric Mean across the suite New / Old EXEX512: 0.979
> > > Geometric Mean across the suite New / Old AVX2 : 0.986
> > > Geometric Mean across the suite New / Old SSE2 : 0.979
> > >
> > > Most of the cases are essentially unchanged, this is mostly to show
> > > that adding the non-temporal case didn't add any regressions to the
> > > other cases.
> > >
> > > The results on the memset-large benchmark suite on TGL-client for N=20
> > > runs:
> > >
> > > Geometric Mean across the suite New / Old EXEX256: 0.926
> > > Geometric Mean across the suite New / Old EXEX512: 0.925
> > > Geometric Mean across the suite New / Old AVX2 : 0.928
> > > Geometric Mean across the suite New / Old SSE2 : 0.924
> > >
> > > So roughly a 7.5% speedup. This is lower than what we see on servers
> > > (likely because clients typically have faster single-core bandwidth so
> > > saving bandwidth on RFOs is less impactful), but still advantageous.
> > >
> > > Full test-suite passes on x86_64 w/ and w/o multiarch.
> > > ---
> > > .../multiarch/memset-vec-unaligned-erms.S | 149 +++++++++++-------
> > > 1 file changed, 91 insertions(+), 58 deletions(-)
> > >
> > > diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > > index 97839a2248..637caadb40 100644
> > > --- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > > +++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > > @@ -21,10 +21,13 @@
> > > 2. If size is less than VEC, use integer register stores.
> > > 3. If size is from VEC_SIZE to 2 * VEC_SIZE, use 2 VEC stores.
> > > 4. If size is from 2 * VEC_SIZE to 4 * VEC_SIZE, use 4 VEC stores.
> > > - 5. On machines ERMS feature, if size is greater or equal than
> > > - __x86_rep_stosb_threshold then REP STOSB will be used.
> > > - 6. If size is more to 4 * VEC_SIZE, align to 4 * VEC_SIZE with
> > > - 4 VEC stores and store 4 * VEC at a time until done. */
> > > + 5. If size is more to 4 * VEC_SIZE, align to 1 * VEC_SIZE with
> > > + 4 VEC stores and store 4 * VEC at a time until done.
> > > + 6. On machines ERMS feature, if size is range
> > > + [__x86_rep_stosb_threshold, __x86_shared_non_temporal_threshold)
> > > + then REP STOSB will be used.
> > > + 7. If size >= __x86_shared_non_temporal_threshold, use a
> > > + non-temporal stores. */
> > >
> > > #include <sysdep.h>
> > >
> > > @@ -147,6 +150,41 @@ L(entry_from_wmemset):
> > > VMOVU %VMM(0), -VEC_SIZE(%rdi,%rdx)
> > > VMOVU %VMM(0), (%rdi)
> > > VZEROUPPER_RETURN
> > > +
> > > + /* If have AVX512 mask instructions put L(less_vec) close to
> > > + entry as it doesn't take much space and is likely a hot target. */
> > > +#ifdef USE_LESS_VEC_MASK_STORE
> > > + /* Align to ensure the L(less_vec) logic all fits in 1x cache lines. */
> > > + .p2align 6,, 47
> > > + .p2align 4
> > > +L(less_vec):
> > > +L(less_vec_from_wmemset):
> > > + /* Less than 1 VEC. */
> > > +# if VEC_SIZE != 16 && VEC_SIZE != 32 && VEC_SIZE != 64
> > > +# error Unsupported VEC_SIZE!
> > > +# endif
> > > + /* Clear high bits from edi. Only keeping bits relevant to page
> > > + cross check. Note that we are using rax which is set in
> > > + MEMSET_VDUP_TO_VEC0_AND_SET_RETURN as ptr from here on out. */
> > > + andl $(PAGE_SIZE - 1), %edi
> > > + /* Check if VEC_SIZE store cross page. Mask stores suffer
> > > + serious performance degradation when it has to fault suppress. */
> > > + cmpl $(PAGE_SIZE - VEC_SIZE), %edi
> > > + /* This is generally considered a cold target. */
> > > + ja L(cross_page)
> > > +# if VEC_SIZE > 32
> > > + movq $-1, %rcx
> > > + bzhiq %rdx, %rcx, %rcx
> > > + kmovq %rcx, %k1
> > > +# else
> > > + movl $-1, %ecx
> > > + bzhil %edx, %ecx, %ecx
> > > + kmovd %ecx, %k1
> > > +# endif
> > > + vmovdqu8 %VMM(0), (%rax){%k1}
> > > + VZEROUPPER_RETURN
> > > +#endif
> > > +
> > > #if defined USE_MULTIARCH && IS_IN (libc)
> > > END (MEMSET_SYMBOL (__memset, unaligned))
> > >
> > > @@ -185,54 +223,6 @@ L(last_2x_vec):
> > > #endif
> > > VZEROUPPER_RETURN
> > >
> > > - /* If have AVX512 mask instructions put L(less_vec) close to
> > > - entry as it doesn't take much space and is likely a hot target.
> > > - */
> > > -#ifdef USE_LESS_VEC_MASK_STORE
> > > - .p2align 4,, 10
> > > -L(less_vec):
> > > -L(less_vec_from_wmemset):
> > > - /* Less than 1 VEC. */
> > > -# if VEC_SIZE != 16 && VEC_SIZE != 32 && VEC_SIZE != 64
> > > -# error Unsupported VEC_SIZE!
> > > -# endif
> > > - /* Clear high bits from edi. Only keeping bits relevant to page
> > > - cross check. Note that we are using rax which is set in
> > > - MEMSET_VDUP_TO_VEC0_AND_SET_RETURN as ptr from here on out. */
> > > - andl $(PAGE_SIZE - 1), %edi
> > > - /* Check if VEC_SIZE store cross page. Mask stores suffer
> > > - serious performance degradation when it has to fault suppress.
> > > - */
> > > - cmpl $(PAGE_SIZE - VEC_SIZE), %edi
> > > - /* This is generally considered a cold target. */
> > > - ja L(cross_page)
> > > -# if VEC_SIZE > 32
> > > - movq $-1, %rcx
> > > - bzhiq %rdx, %rcx, %rcx
> > > - kmovq %rcx, %k1
> > > -# else
> > > - movl $-1, %ecx
> > > - bzhil %edx, %ecx, %ecx
> > > - kmovd %ecx, %k1
> > > -# endif
> > > - vmovdqu8 %VMM(0), (%rax){%k1}
> > > - VZEROUPPER_RETURN
> > > -
> > > -# if defined USE_MULTIARCH && IS_IN (libc)
> > > - /* Include L(stosb_local) here if including L(less_vec) between
> > > - L(stosb_more_2x_vec) and ENTRY. This is to cache align the
> > > - L(stosb_more_2x_vec) target. */
> > > - .p2align 4,, 10
> > > -L(stosb_local):
> > > - movzbl %sil, %eax
> > > - mov %RDX_LP, %RCX_LP
> > > - mov %RDI_LP, %RDX_LP
> > > - rep stosb
> > > - mov %RDX_LP, %RAX_LP
> > > - VZEROUPPER_RETURN
> > > -# endif
> > > -#endif
> > > -
> > > #if defined USE_MULTIARCH && IS_IN (libc)
> > > .p2align 4
> > > L(stosb_more_2x_vec):
> > > @@ -318,21 +308,33 @@ L(return_vzeroupper):
> > > ret
> > > #endif
> > >
> > > - .p2align 4,, 10
> > > -#ifndef USE_LESS_VEC_MASK_STORE
> > > -# if defined USE_MULTIARCH && IS_IN (libc)
> > > +#ifdef USE_WITH_AVX2
> > > + .p2align 4
> > > +#else
> > > + .p2align 4,, 4
> > > +#endif
> > > +
> > > +#if defined USE_MULTIARCH && IS_IN (libc)
> > > /* If no USE_LESS_VEC_MASK put L(stosb_local) here. Will be in
> > > range for 2-byte jump encoding. */
> > > L(stosb_local):
> > > + cmp __x86_shared_non_temporal_threshold(%rip), %RDX_LP
> > > + jae L(nt_memset)
> > > movzbl %sil, %eax
> > > mov %RDX_LP, %RCX_LP
> > > mov %RDI_LP, %RDX_LP
> > > rep stosb
> > > +# if (defined USE_WITH_SSE2) || (defined USE_WITH_AVX512)
> > > + /* Use xchg to save 1-byte (this helps align targets below). */
> > > + xchg %RDX_LP, %RAX_LP
> > > +# else
> > > mov %RDX_LP, %RAX_LP
> > > - VZEROUPPER_RETURN
> > > # endif
> > > + VZEROUPPER_RETURN
> > > +#endif
> > > +#ifndef USE_LESS_VEC_MASK_STORE
> > > /* Define L(less_vec) only if not otherwise defined. */
> > > - .p2align 4
> > > + .p2align 4,, 12
> > > L(less_vec):
> > > /* Broadcast esi to partial register (i.e VEC_SIZE == 32 broadcast to
> > > xmm). This is only does anything for AVX2. */
> > > @@ -423,4 +425,35 @@ L(between_2_3):
> > > movb %SET_REG8, -1(%LESS_VEC_REG, %rdx)
> > > #endif
> > > ret
> > > -END (MEMSET_SYMBOL (__memset, unaligned_erms))
> > > +
> > > +#if defined USE_MULTIARCH && IS_IN (libc)
> > > +# ifdef USE_WITH_AVX512
> > > + /* Force align so the loop doesn't cross a cache-line. */
> > > + .p2align 4
> > > +# endif
> > > + .p2align 4,, 7
> > > + /* Memset using non-temporal stores. */
> > > +L(nt_memset):
> > > + VMOVU %VMM(0), (VEC_SIZE * 0)(%rdi)
> > > + leaq (VEC_SIZE * -4)(%rdi, %rdx), %rdx
> > > + /* Align DST. */
> > > + orq $(VEC_SIZE * 1 - 1), %rdi
> > > + incq %rdi
> > > + .p2align 4,, 7
> > > +L(nt_loop):
> > > + VMOVNT %VMM(0), (VEC_SIZE * 0)(%rdi)
> > > + VMOVNT %VMM(0), (VEC_SIZE * 1)(%rdi)
> > > + VMOVNT %VMM(0), (VEC_SIZE * 2)(%rdi)
> > > + VMOVNT %VMM(0), (VEC_SIZE * 3)(%rdi)
> > > + subq $(VEC_SIZE * -4), %rdi
> > > + cmpq %rdx, %rdi
> > > + jb L(nt_loop)
> > > + sfence
> > > + VMOVU %VMM(0), (VEC_SIZE * 0)(%rdx)
> > > + VMOVU %VMM(0), (VEC_SIZE * 1)(%rdx)
> > > + VMOVU %VMM(0), (VEC_SIZE * 2)(%rdx)
> > > + VMOVU %VMM(0), (VEC_SIZE * 3)(%rdx)
> > > + VZEROUPPER_RETURN
> > > +#endif
> > > +
> > > +END(MEMSET_SYMBOL(__memset, unaligned_erms))
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v1] x86: Improve large memset perf with non-temporal stores [RHEL-29312]
2024-05-20 22:50 ` Noah Goldstein
@ 2024-05-23 16:46 ` Noah Goldstein
2024-05-23 16:49 ` H.J. Lu
0 siblings, 1 reply; 21+ messages in thread
From: Noah Goldstein @ 2024-05-23 16:46 UTC (permalink / raw)
To: Adhemerval Zanella Netto; +Cc: libc-alpha, hjl.tools
On Mon, May 20, 2024 at 5:50 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
> On Mon, May 20, 2024 at 5:47 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> >
> > On Mon, May 20, 2024 at 2:16 PM Adhemerval Zanella Netto
> > <adhemerval.zanella@linaro.org> wrote:
> > >
> > >
> > >
> > > On 18/05/24 21:43, Noah Goldstein wrote:
> > > > Previously we use `rep stosb` for all medium/large memsets. This is
> > > > notably worse than non-temporal stores for large (above a
> > > > few MBs) memsets.
> > > > See:
> > > > https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4P2uMjjQGLbLcU/edit?usp=sharing
> > > > For data using different stategies for large memset on ICX and SKX.
> > >
> > > Btw this datasheet is not accessible without extra steps.
> > >
> >
> > Bh sorry, put the benchmark data in a repo. See data here:
> > https://github.com/goldsteinn/memset-benchmarks/tree/master/results
> >
> > The ICX.html and SKX.html have the data from the above link.
> > > >
> > > > Using non-temporal stores can be up to 3x faster on ICX and 2x faster
> > > > on SKX. Historically, these numbers would not have been so good
> > > > because of the zero-over-zero writeback optimization that `rep stosb`
> > > > is able to do. But, the zero-over-zero writeback optimization has been
> > > > removed as a potential side-channel attack, so there is no longer any
> > > > good reason to only rely on `rep stosb` for large memsets. On the flip
> > > > size, non-temporal writes can avoid data in their RFO requests saving
> > > > memory bandwidth.
> > >
> > > Any chance on how this play in newer AMD chips? I am trying to avoid
> > > another regression like BZ#30994 and BZ#30995 (this one I would like
> > > to fix, however I don't have access to a Zen4 machine to check for
> > > results).
> >
> > I didn't and don't have access to any of the newer AMD chips.
> >
> > The benchmark code here: https://github.com/goldsteinn/memset-benchmarks/
> > has a README w/ steps if anyone wants to test it.
>
> What we could do is use a seperate tunable for memset NT threshold and just
> make it SIZE_MAX for AMD.
Adhemerval, this patch okay to get in like that, or do you want to keep in Intel
specific?
> > >
> > > >
> > > > All of the other changes to the file are to re-organize the
> > > > code-blocks to maintain "good" alignment given the new code added in
> > > > the `L(stosb_local)` case.
> > > >
> > > > The results from running the GLIBC memset benchmarks on TGL-client for
> > > > N=20 runs:
> > > >
> > > > Geometric Mean across the suite New / Old EXEX256: 0.979
> > > > Geometric Mean across the suite New / Old EXEX512: 0.979
> > > > Geometric Mean across the suite New / Old AVX2 : 0.986
> > > > Geometric Mean across the suite New / Old SSE2 : 0.979
> > > >
> > > > Most of the cases are essentially unchanged, this is mostly to show
> > > > that adding the non-temporal case didn't add any regressions to the
> > > > other cases.
> > > >
> > > > The results on the memset-large benchmark suite on TGL-client for N=20
> > > > runs:
> > > >
> > > > Geometric Mean across the suite New / Old EXEX256: 0.926
> > > > Geometric Mean across the suite New / Old EXEX512: 0.925
> > > > Geometric Mean across the suite New / Old AVX2 : 0.928
> > > > Geometric Mean across the suite New / Old SSE2 : 0.924
> > > >
> > > > So roughly a 7.5% speedup. This is lower than what we see on servers
> > > > (likely because clients typically have faster single-core bandwidth so
> > > > saving bandwidth on RFOs is less impactful), but still advantageous.
> > > >
> > > > Full test-suite passes on x86_64 w/ and w/o multiarch.
> > > > ---
> > > > .../multiarch/memset-vec-unaligned-erms.S | 149 +++++++++++-------
> > > > 1 file changed, 91 insertions(+), 58 deletions(-)
> > > >
> > > > diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > > > index 97839a2248..637caadb40 100644
> > > > --- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > > > +++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > > > @@ -21,10 +21,13 @@
> > > > 2. If size is less than VEC, use integer register stores.
> > > > 3. If size is from VEC_SIZE to 2 * VEC_SIZE, use 2 VEC stores.
> > > > 4. If size is from 2 * VEC_SIZE to 4 * VEC_SIZE, use 4 VEC stores.
> > > > - 5. On machines ERMS feature, if size is greater or equal than
> > > > - __x86_rep_stosb_threshold then REP STOSB will be used.
> > > > - 6. If size is more to 4 * VEC_SIZE, align to 4 * VEC_SIZE with
> > > > - 4 VEC stores and store 4 * VEC at a time until done. */
> > > > + 5. If size is more to 4 * VEC_SIZE, align to 1 * VEC_SIZE with
> > > > + 4 VEC stores and store 4 * VEC at a time until done.
> > > > + 6. On machines ERMS feature, if size is range
> > > > + [__x86_rep_stosb_threshold, __x86_shared_non_temporal_threshold)
> > > > + then REP STOSB will be used.
> > > > + 7. If size >= __x86_shared_non_temporal_threshold, use a
> > > > + non-temporal stores. */
> > > >
> > > > #include <sysdep.h>
> > > >
> > > > @@ -147,6 +150,41 @@ L(entry_from_wmemset):
> > > > VMOVU %VMM(0), -VEC_SIZE(%rdi,%rdx)
> > > > VMOVU %VMM(0), (%rdi)
> > > > VZEROUPPER_RETURN
> > > > +
> > > > + /* If have AVX512 mask instructions put L(less_vec) close to
> > > > + entry as it doesn't take much space and is likely a hot target. */
> > > > +#ifdef USE_LESS_VEC_MASK_STORE
> > > > + /* Align to ensure the L(less_vec) logic all fits in 1x cache lines. */
> > > > + .p2align 6,, 47
> > > > + .p2align 4
> > > > +L(less_vec):
> > > > +L(less_vec_from_wmemset):
> > > > + /* Less than 1 VEC. */
> > > > +# if VEC_SIZE != 16 && VEC_SIZE != 32 && VEC_SIZE != 64
> > > > +# error Unsupported VEC_SIZE!
> > > > +# endif
> > > > + /* Clear high bits from edi. Only keeping bits relevant to page
> > > > + cross check. Note that we are using rax which is set in
> > > > + MEMSET_VDUP_TO_VEC0_AND_SET_RETURN as ptr from here on out. */
> > > > + andl $(PAGE_SIZE - 1), %edi
> > > > + /* Check if VEC_SIZE store cross page. Mask stores suffer
> > > > + serious performance degradation when it has to fault suppress. */
> > > > + cmpl $(PAGE_SIZE - VEC_SIZE), %edi
> > > > + /* This is generally considered a cold target. */
> > > > + ja L(cross_page)
> > > > +# if VEC_SIZE > 32
> > > > + movq $-1, %rcx
> > > > + bzhiq %rdx, %rcx, %rcx
> > > > + kmovq %rcx, %k1
> > > > +# else
> > > > + movl $-1, %ecx
> > > > + bzhil %edx, %ecx, %ecx
> > > > + kmovd %ecx, %k1
> > > > +# endif
> > > > + vmovdqu8 %VMM(0), (%rax){%k1}
> > > > + VZEROUPPER_RETURN
> > > > +#endif
> > > > +
> > > > #if defined USE_MULTIARCH && IS_IN (libc)
> > > > END (MEMSET_SYMBOL (__memset, unaligned))
> > > >
> > > > @@ -185,54 +223,6 @@ L(last_2x_vec):
> > > > #endif
> > > > VZEROUPPER_RETURN
> > > >
> > > > - /* If have AVX512 mask instructions put L(less_vec) close to
> > > > - entry as it doesn't take much space and is likely a hot target.
> > > > - */
> > > > -#ifdef USE_LESS_VEC_MASK_STORE
> > > > - .p2align 4,, 10
> > > > -L(less_vec):
> > > > -L(less_vec_from_wmemset):
> > > > - /* Less than 1 VEC. */
> > > > -# if VEC_SIZE != 16 && VEC_SIZE != 32 && VEC_SIZE != 64
> > > > -# error Unsupported VEC_SIZE!
> > > > -# endif
> > > > - /* Clear high bits from edi. Only keeping bits relevant to page
> > > > - cross check. Note that we are using rax which is set in
> > > > - MEMSET_VDUP_TO_VEC0_AND_SET_RETURN as ptr from here on out. */
> > > > - andl $(PAGE_SIZE - 1), %edi
> > > > - /* Check if VEC_SIZE store cross page. Mask stores suffer
> > > > - serious performance degradation when it has to fault suppress.
> > > > - */
> > > > - cmpl $(PAGE_SIZE - VEC_SIZE), %edi
> > > > - /* This is generally considered a cold target. */
> > > > - ja L(cross_page)
> > > > -# if VEC_SIZE > 32
> > > > - movq $-1, %rcx
> > > > - bzhiq %rdx, %rcx, %rcx
> > > > - kmovq %rcx, %k1
> > > > -# else
> > > > - movl $-1, %ecx
> > > > - bzhil %edx, %ecx, %ecx
> > > > - kmovd %ecx, %k1
> > > > -# endif
> > > > - vmovdqu8 %VMM(0), (%rax){%k1}
> > > > - VZEROUPPER_RETURN
> > > > -
> > > > -# if defined USE_MULTIARCH && IS_IN (libc)
> > > > - /* Include L(stosb_local) here if including L(less_vec) between
> > > > - L(stosb_more_2x_vec) and ENTRY. This is to cache align the
> > > > - L(stosb_more_2x_vec) target. */
> > > > - .p2align 4,, 10
> > > > -L(stosb_local):
> > > > - movzbl %sil, %eax
> > > > - mov %RDX_LP, %RCX_LP
> > > > - mov %RDI_LP, %RDX_LP
> > > > - rep stosb
> > > > - mov %RDX_LP, %RAX_LP
> > > > - VZEROUPPER_RETURN
> > > > -# endif
> > > > -#endif
> > > > -
> > > > #if defined USE_MULTIARCH && IS_IN (libc)
> > > > .p2align 4
> > > > L(stosb_more_2x_vec):
> > > > @@ -318,21 +308,33 @@ L(return_vzeroupper):
> > > > ret
> > > > #endif
> > > >
> > > > - .p2align 4,, 10
> > > > -#ifndef USE_LESS_VEC_MASK_STORE
> > > > -# if defined USE_MULTIARCH && IS_IN (libc)
> > > > +#ifdef USE_WITH_AVX2
> > > > + .p2align 4
> > > > +#else
> > > > + .p2align 4,, 4
> > > > +#endif
> > > > +
> > > > +#if defined USE_MULTIARCH && IS_IN (libc)
> > > > /* If no USE_LESS_VEC_MASK put L(stosb_local) here. Will be in
> > > > range for 2-byte jump encoding. */
> > > > L(stosb_local):
> > > > + cmp __x86_shared_non_temporal_threshold(%rip), %RDX_LP
> > > > + jae L(nt_memset)
> > > > movzbl %sil, %eax
> > > > mov %RDX_LP, %RCX_LP
> > > > mov %RDI_LP, %RDX_LP
> > > > rep stosb
> > > > +# if (defined USE_WITH_SSE2) || (defined USE_WITH_AVX512)
> > > > + /* Use xchg to save 1-byte (this helps align targets below). */
> > > > + xchg %RDX_LP, %RAX_LP
> > > > +# else
> > > > mov %RDX_LP, %RAX_LP
> > > > - VZEROUPPER_RETURN
> > > > # endif
> > > > + VZEROUPPER_RETURN
> > > > +#endif
> > > > +#ifndef USE_LESS_VEC_MASK_STORE
> > > > /* Define L(less_vec) only if not otherwise defined. */
> > > > - .p2align 4
> > > > + .p2align 4,, 12
> > > > L(less_vec):
> > > > /* Broadcast esi to partial register (i.e VEC_SIZE == 32 broadcast to
> > > > xmm). This is only does anything for AVX2. */
> > > > @@ -423,4 +425,35 @@ L(between_2_3):
> > > > movb %SET_REG8, -1(%LESS_VEC_REG, %rdx)
> > > > #endif
> > > > ret
> > > > -END (MEMSET_SYMBOL (__memset, unaligned_erms))
> > > > +
> > > > +#if defined USE_MULTIARCH && IS_IN (libc)
> > > > +# ifdef USE_WITH_AVX512
> > > > + /* Force align so the loop doesn't cross a cache-line. */
> > > > + .p2align 4
> > > > +# endif
> > > > + .p2align 4,, 7
> > > > + /* Memset using non-temporal stores. */
> > > > +L(nt_memset):
> > > > + VMOVU %VMM(0), (VEC_SIZE * 0)(%rdi)
> > > > + leaq (VEC_SIZE * -4)(%rdi, %rdx), %rdx
> > > > + /* Align DST. */
> > > > + orq $(VEC_SIZE * 1 - 1), %rdi
> > > > + incq %rdi
> > > > + .p2align 4,, 7
> > > > +L(nt_loop):
> > > > + VMOVNT %VMM(0), (VEC_SIZE * 0)(%rdi)
> > > > + VMOVNT %VMM(0), (VEC_SIZE * 1)(%rdi)
> > > > + VMOVNT %VMM(0), (VEC_SIZE * 2)(%rdi)
> > > > + VMOVNT %VMM(0), (VEC_SIZE * 3)(%rdi)
> > > > + subq $(VEC_SIZE * -4), %rdi
> > > > + cmpq %rdx, %rdi
> > > > + jb L(nt_loop)
> > > > + sfence
> > > > + VMOVU %VMM(0), (VEC_SIZE * 0)(%rdx)
> > > > + VMOVU %VMM(0), (VEC_SIZE * 1)(%rdx)
> > > > + VMOVU %VMM(0), (VEC_SIZE * 2)(%rdx)
> > > > + VMOVU %VMM(0), (VEC_SIZE * 3)(%rdx)
> > > > + VZEROUPPER_RETURN
> > > > +#endif
> > > > +
> > > > +END(MEMSET_SYMBOL(__memset, unaligned_erms))
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v1] x86: Improve large memset perf with non-temporal stores [RHEL-29312]
2024-05-23 16:46 ` Noah Goldstein
@ 2024-05-23 16:49 ` H.J. Lu
2024-05-23 17:18 ` Adhemerval Zanella Netto
0 siblings, 1 reply; 21+ messages in thread
From: H.J. Lu @ 2024-05-23 16:49 UTC (permalink / raw)
To: Noah Goldstein; +Cc: Adhemerval Zanella Netto, libc-alpha
On Thu, May 23, 2024 at 9:46 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
> On Mon, May 20, 2024 at 5:50 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> >
> > On Mon, May 20, 2024 at 5:47 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> > >
> > > On Mon, May 20, 2024 at 2:16 PM Adhemerval Zanella Netto
> > > <adhemerval.zanella@linaro.org> wrote:
> > > >
> > > >
> > > >
> > > > On 18/05/24 21:43, Noah Goldstein wrote:
> > > > > Previously we use `rep stosb` for all medium/large memsets. This is
> > > > > notably worse than non-temporal stores for large (above a
> > > > > few MBs) memsets.
> > > > > See:
> > > > > https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4P2uMjjQGLbLcU/edit?usp=sharing
> > > > > For data using different stategies for large memset on ICX and SKX.
> > > >
> > > > Btw this datasheet is not accessible without extra steps.
> > > >
> > >
> > > Bh sorry, put the benchmark data in a repo. See data here:
> > > https://github.com/goldsteinn/memset-benchmarks/tree/master/results
> > >
> > > The ICX.html and SKX.html have the data from the above link.
> > > > >
> > > > > Using non-temporal stores can be up to 3x faster on ICX and 2x faster
> > > > > on SKX. Historically, these numbers would not have been so good
> > > > > because of the zero-over-zero writeback optimization that `rep stosb`
> > > > > is able to do. But, the zero-over-zero writeback optimization has been
> > > > > removed as a potential side-channel attack, so there is no longer any
> > > > > good reason to only rely on `rep stosb` for large memsets. On the flip
> > > > > size, non-temporal writes can avoid data in their RFO requests saving
> > > > > memory bandwidth.
> > > >
> > > > Any chance on how this play in newer AMD chips? I am trying to avoid
> > > > another regression like BZ#30994 and BZ#30995 (this one I would like
> > > > to fix, however I don't have access to a Zen4 machine to check for
> > > > results).
> > >
> > > I didn't and don't have access to any of the newer AMD chips.
> > >
> > > The benchmark code here: https://github.com/goldsteinn/memset-benchmarks/
> > > has a README w/ steps if anyone wants to test it.
> >
> > What we could do is use a seperate tunable for memset NT threshold and just
> > make it SIZE_MAX for AMD.
>
> Adhemerval, this patch okay to get in like that, or do you want to keep in Intel
> specific?
Please make it Intel specific.
Thanks.
> > > >
> > > > >
> > > > > All of the other changes to the file are to re-organize the
> > > > > code-blocks to maintain "good" alignment given the new code added in
> > > > > the `L(stosb_local)` case.
> > > > >
> > > > > The results from running the GLIBC memset benchmarks on TGL-client for
> > > > > N=20 runs:
> > > > >
> > > > > Geometric Mean across the suite New / Old EXEX256: 0.979
> > > > > Geometric Mean across the suite New / Old EXEX512: 0.979
> > > > > Geometric Mean across the suite New / Old AVX2 : 0.986
> > > > > Geometric Mean across the suite New / Old SSE2 : 0.979
> > > > >
> > > > > Most of the cases are essentially unchanged, this is mostly to show
> > > > > that adding the non-temporal case didn't add any regressions to the
> > > > > other cases.
> > > > >
> > > > > The results on the memset-large benchmark suite on TGL-client for N=20
> > > > > runs:
> > > > >
> > > > > Geometric Mean across the suite New / Old EXEX256: 0.926
> > > > > Geometric Mean across the suite New / Old EXEX512: 0.925
> > > > > Geometric Mean across the suite New / Old AVX2 : 0.928
> > > > > Geometric Mean across the suite New / Old SSE2 : 0.924
> > > > >
> > > > > So roughly a 7.5% speedup. This is lower than what we see on servers
> > > > > (likely because clients typically have faster single-core bandwidth so
> > > > > saving bandwidth on RFOs is less impactful), but still advantageous.
> > > > >
> > > > > Full test-suite passes on x86_64 w/ and w/o multiarch.
> > > > > ---
> > > > > .../multiarch/memset-vec-unaligned-erms.S | 149 +++++++++++-------
> > > > > 1 file changed, 91 insertions(+), 58 deletions(-)
> > > > >
> > > > > diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > > > > index 97839a2248..637caadb40 100644
> > > > > --- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > > > > +++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > > > > @@ -21,10 +21,13 @@
> > > > > 2. If size is less than VEC, use integer register stores.
> > > > > 3. If size is from VEC_SIZE to 2 * VEC_SIZE, use 2 VEC stores.
> > > > > 4. If size is from 2 * VEC_SIZE to 4 * VEC_SIZE, use 4 VEC stores.
> > > > > - 5. On machines ERMS feature, if size is greater or equal than
> > > > > - __x86_rep_stosb_threshold then REP STOSB will be used.
> > > > > - 6. If size is more to 4 * VEC_SIZE, align to 4 * VEC_SIZE with
> > > > > - 4 VEC stores and store 4 * VEC at a time until done. */
> > > > > + 5. If size is more to 4 * VEC_SIZE, align to 1 * VEC_SIZE with
> > > > > + 4 VEC stores and store 4 * VEC at a time until done.
> > > > > + 6. On machines ERMS feature, if size is range
> > > > > + [__x86_rep_stosb_threshold, __x86_shared_non_temporal_threshold)
> > > > > + then REP STOSB will be used.
> > > > > + 7. If size >= __x86_shared_non_temporal_threshold, use a
> > > > > + non-temporal stores. */
> > > > >
> > > > > #include <sysdep.h>
> > > > >
> > > > > @@ -147,6 +150,41 @@ L(entry_from_wmemset):
> > > > > VMOVU %VMM(0), -VEC_SIZE(%rdi,%rdx)
> > > > > VMOVU %VMM(0), (%rdi)
> > > > > VZEROUPPER_RETURN
> > > > > +
> > > > > + /* If have AVX512 mask instructions put L(less_vec) close to
> > > > > + entry as it doesn't take much space and is likely a hot target. */
> > > > > +#ifdef USE_LESS_VEC_MASK_STORE
> > > > > + /* Align to ensure the L(less_vec) logic all fits in 1x cache lines. */
> > > > > + .p2align 6,, 47
> > > > > + .p2align 4
> > > > > +L(less_vec):
> > > > > +L(less_vec_from_wmemset):
> > > > > + /* Less than 1 VEC. */
> > > > > +# if VEC_SIZE != 16 && VEC_SIZE != 32 && VEC_SIZE != 64
> > > > > +# error Unsupported VEC_SIZE!
> > > > > +# endif
> > > > > + /* Clear high bits from edi. Only keeping bits relevant to page
> > > > > + cross check. Note that we are using rax which is set in
> > > > > + MEMSET_VDUP_TO_VEC0_AND_SET_RETURN as ptr from here on out. */
> > > > > + andl $(PAGE_SIZE - 1), %edi
> > > > > + /* Check if VEC_SIZE store cross page. Mask stores suffer
> > > > > + serious performance degradation when it has to fault suppress. */
> > > > > + cmpl $(PAGE_SIZE - VEC_SIZE), %edi
> > > > > + /* This is generally considered a cold target. */
> > > > > + ja L(cross_page)
> > > > > +# if VEC_SIZE > 32
> > > > > + movq $-1, %rcx
> > > > > + bzhiq %rdx, %rcx, %rcx
> > > > > + kmovq %rcx, %k1
> > > > > +# else
> > > > > + movl $-1, %ecx
> > > > > + bzhil %edx, %ecx, %ecx
> > > > > + kmovd %ecx, %k1
> > > > > +# endif
> > > > > + vmovdqu8 %VMM(0), (%rax){%k1}
> > > > > + VZEROUPPER_RETURN
> > > > > +#endif
> > > > > +
> > > > > #if defined USE_MULTIARCH && IS_IN (libc)
> > > > > END (MEMSET_SYMBOL (__memset, unaligned))
> > > > >
> > > > > @@ -185,54 +223,6 @@ L(last_2x_vec):
> > > > > #endif
> > > > > VZEROUPPER_RETURN
> > > > >
> > > > > - /* If have AVX512 mask instructions put L(less_vec) close to
> > > > > - entry as it doesn't take much space and is likely a hot target.
> > > > > - */
> > > > > -#ifdef USE_LESS_VEC_MASK_STORE
> > > > > - .p2align 4,, 10
> > > > > -L(less_vec):
> > > > > -L(less_vec_from_wmemset):
> > > > > - /* Less than 1 VEC. */
> > > > > -# if VEC_SIZE != 16 && VEC_SIZE != 32 && VEC_SIZE != 64
> > > > > -# error Unsupported VEC_SIZE!
> > > > > -# endif
> > > > > - /* Clear high bits from edi. Only keeping bits relevant to page
> > > > > - cross check. Note that we are using rax which is set in
> > > > > - MEMSET_VDUP_TO_VEC0_AND_SET_RETURN as ptr from here on out. */
> > > > > - andl $(PAGE_SIZE - 1), %edi
> > > > > - /* Check if VEC_SIZE store cross page. Mask stores suffer
> > > > > - serious performance degradation when it has to fault suppress.
> > > > > - */
> > > > > - cmpl $(PAGE_SIZE - VEC_SIZE), %edi
> > > > > - /* This is generally considered a cold target. */
> > > > > - ja L(cross_page)
> > > > > -# if VEC_SIZE > 32
> > > > > - movq $-1, %rcx
> > > > > - bzhiq %rdx, %rcx, %rcx
> > > > > - kmovq %rcx, %k1
> > > > > -# else
> > > > > - movl $-1, %ecx
> > > > > - bzhil %edx, %ecx, %ecx
> > > > > - kmovd %ecx, %k1
> > > > > -# endif
> > > > > - vmovdqu8 %VMM(0), (%rax){%k1}
> > > > > - VZEROUPPER_RETURN
> > > > > -
> > > > > -# if defined USE_MULTIARCH && IS_IN (libc)
> > > > > - /* Include L(stosb_local) here if including L(less_vec) between
> > > > > - L(stosb_more_2x_vec) and ENTRY. This is to cache align the
> > > > > - L(stosb_more_2x_vec) target. */
> > > > > - .p2align 4,, 10
> > > > > -L(stosb_local):
> > > > > - movzbl %sil, %eax
> > > > > - mov %RDX_LP, %RCX_LP
> > > > > - mov %RDI_LP, %RDX_LP
> > > > > - rep stosb
> > > > > - mov %RDX_LP, %RAX_LP
> > > > > - VZEROUPPER_RETURN
> > > > > -# endif
> > > > > -#endif
> > > > > -
> > > > > #if defined USE_MULTIARCH && IS_IN (libc)
> > > > > .p2align 4
> > > > > L(stosb_more_2x_vec):
> > > > > @@ -318,21 +308,33 @@ L(return_vzeroupper):
> > > > > ret
> > > > > #endif
> > > > >
> > > > > - .p2align 4,, 10
> > > > > -#ifndef USE_LESS_VEC_MASK_STORE
> > > > > -# if defined USE_MULTIARCH && IS_IN (libc)
> > > > > +#ifdef USE_WITH_AVX2
> > > > > + .p2align 4
> > > > > +#else
> > > > > + .p2align 4,, 4
> > > > > +#endif
> > > > > +
> > > > > +#if defined USE_MULTIARCH && IS_IN (libc)
> > > > > /* If no USE_LESS_VEC_MASK put L(stosb_local) here. Will be in
> > > > > range for 2-byte jump encoding. */
> > > > > L(stosb_local):
> > > > > + cmp __x86_shared_non_temporal_threshold(%rip), %RDX_LP
> > > > > + jae L(nt_memset)
> > > > > movzbl %sil, %eax
> > > > > mov %RDX_LP, %RCX_LP
> > > > > mov %RDI_LP, %RDX_LP
> > > > > rep stosb
> > > > > +# if (defined USE_WITH_SSE2) || (defined USE_WITH_AVX512)
> > > > > + /* Use xchg to save 1-byte (this helps align targets below). */
> > > > > + xchg %RDX_LP, %RAX_LP
> > > > > +# else
> > > > > mov %RDX_LP, %RAX_LP
> > > > > - VZEROUPPER_RETURN
> > > > > # endif
> > > > > + VZEROUPPER_RETURN
> > > > > +#endif
> > > > > +#ifndef USE_LESS_VEC_MASK_STORE
> > > > > /* Define L(less_vec) only if not otherwise defined. */
> > > > > - .p2align 4
> > > > > + .p2align 4,, 12
> > > > > L(less_vec):
> > > > > /* Broadcast esi to partial register (i.e VEC_SIZE == 32 broadcast to
> > > > > xmm). This is only does anything for AVX2. */
> > > > > @@ -423,4 +425,35 @@ L(between_2_3):
> > > > > movb %SET_REG8, -1(%LESS_VEC_REG, %rdx)
> > > > > #endif
> > > > > ret
> > > > > -END (MEMSET_SYMBOL (__memset, unaligned_erms))
> > > > > +
> > > > > +#if defined USE_MULTIARCH && IS_IN (libc)
> > > > > +# ifdef USE_WITH_AVX512
> > > > > + /* Force align so the loop doesn't cross a cache-line. */
> > > > > + .p2align 4
> > > > > +# endif
> > > > > + .p2align 4,, 7
> > > > > + /* Memset using non-temporal stores. */
> > > > > +L(nt_memset):
> > > > > + VMOVU %VMM(0), (VEC_SIZE * 0)(%rdi)
> > > > > + leaq (VEC_SIZE * -4)(%rdi, %rdx), %rdx
> > > > > + /* Align DST. */
> > > > > + orq $(VEC_SIZE * 1 - 1), %rdi
> > > > > + incq %rdi
> > > > > + .p2align 4,, 7
> > > > > +L(nt_loop):
> > > > > + VMOVNT %VMM(0), (VEC_SIZE * 0)(%rdi)
> > > > > + VMOVNT %VMM(0), (VEC_SIZE * 1)(%rdi)
> > > > > + VMOVNT %VMM(0), (VEC_SIZE * 2)(%rdi)
> > > > > + VMOVNT %VMM(0), (VEC_SIZE * 3)(%rdi)
> > > > > + subq $(VEC_SIZE * -4), %rdi
> > > > > + cmpq %rdx, %rdi
> > > > > + jb L(nt_loop)
> > > > > + sfence
> > > > > + VMOVU %VMM(0), (VEC_SIZE * 0)(%rdx)
> > > > > + VMOVU %VMM(0), (VEC_SIZE * 1)(%rdx)
> > > > > + VMOVU %VMM(0), (VEC_SIZE * 2)(%rdx)
> > > > > + VMOVU %VMM(0), (VEC_SIZE * 3)(%rdx)
> > > > > + VZEROUPPER_RETURN
> > > > > +#endif
> > > > > +
> > > > > +END(MEMSET_SYMBOL(__memset, unaligned_erms))
--
H.J.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v1] x86: Improve large memset perf with non-temporal stores [RHEL-29312]
2024-05-23 16:49 ` H.J. Lu
@ 2024-05-23 17:18 ` Adhemerval Zanella Netto
2024-05-24 17:39 ` Noah Goldstein
2024-05-30 20:02 ` Noah Goldstein
0 siblings, 2 replies; 21+ messages in thread
From: Adhemerval Zanella Netto @ 2024-05-23 17:18 UTC (permalink / raw)
To: H.J. Lu, Noah Goldstein; +Cc: libc-alpha
On 23/05/24 13:49, H.J. Lu wrote:
> On Thu, May 23, 2024 at 9:46 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>>
>> On Mon, May 20, 2024 at 5:50 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>>>
>>> On Mon, May 20, 2024 at 5:47 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>>>>
>>>> On Mon, May 20, 2024 at 2:16 PM Adhemerval Zanella Netto
>>>> <adhemerval.zanella@linaro.org> wrote:
>>>>>
>>>>>
>>>>>
>>>>> On 18/05/24 21:43, Noah Goldstein wrote:
>>>>>> Previously we use `rep stosb` for all medium/large memsets. This is
>>>>>> notably worse than non-temporal stores for large (above a
>>>>>> few MBs) memsets.
>>>>>> See:
>>>>>> https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4P2uMjjQGLbLcU/edit?usp=sharing
>>>>>> For data using different stategies for large memset on ICX and SKX.
>>>>>
>>>>> Btw this datasheet is not accessible without extra steps.
>>>>>
>>>>
>>>> Bh sorry, put the benchmark data in a repo. See data here:
>>>> https://github.com/goldsteinn/memset-benchmarks/tree/master/results
>>>>
>>>> The ICX.html and SKX.html have the data from the above link.
>>>>>>
>>>>>> Using non-temporal stores can be up to 3x faster on ICX and 2x faster
>>>>>> on SKX. Historically, these numbers would not have been so good
>>>>>> because of the zero-over-zero writeback optimization that `rep stosb`
>>>>>> is able to do. But, the zero-over-zero writeback optimization has been
>>>>>> removed as a potential side-channel attack, so there is no longer any
>>>>>> good reason to only rely on `rep stosb` for large memsets. On the flip
>>>>>> size, non-temporal writes can avoid data in their RFO requests saving
>>>>>> memory bandwidth.
>>>>>
>>>>> Any chance on how this play in newer AMD chips? I am trying to avoid
>>>>> another regression like BZ#30994 and BZ#30995 (this one I would like
>>>>> to fix, however I don't have access to a Zen4 machine to check for
>>>>> results).
>>>>
>>>> I didn't and don't have access to any of the newer AMD chips.
>>>>
>>>> The benchmark code here: https://github.com/goldsteinn/memset-benchmarks/
>>>> has a README w/ steps if anyone wants to test it.
>>>
>>> What we could do is use a seperate tunable for memset NT threshold and just
>>> make it SIZE_MAX for AMD.
>>
>> Adhemerval, this patch okay to get in like that, or do you want to keep in Intel
>> specific?
>
> Please make it Intel specific.
>
> Thanks.
Yes, please make it Intel specific for now. I will try to take a look at least
with a Zen3 core that I have access
^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH v2 1/2] x86: Improve large memset perf with non-temporal stores [RHEL-29312]
2024-05-19 0:43 [PATCH v1] x86: Improve large memset perf with non-temporal stores [RHEL-29312] Noah Goldstein
` (2 preceding siblings ...)
2024-05-20 19:37 ` Zack Weinberg
@ 2024-05-24 17:38 ` Noah Goldstein
2024-05-24 17:38 ` [PATCH v2 2/2] x86: Add seperate non-temporal tunable for memset Noah Goldstein
2024-05-29 22:52 ` [PATCH v2 1/2] x86: Improve large memset perf with non-temporal stores [RHEL-29312] H.J. Lu
3 siblings, 2 replies; 21+ messages in thread
From: Noah Goldstein @ 2024-05-24 17:38 UTC (permalink / raw)
To: libc-alpha; +Cc: goldstein.w.n, hjl.tools, carlos
Previously we use `rep stosb` for all medium/large memsets. This is
notably worse than non-temporal stores for large (above a
few MBs) memsets.
See:
https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4P2uMjjQGLbLcU/edit?usp=sharing
For data using different stategies for large memset on ICX and SKX.
Using non-temporal stores can be up to 3x faster on ICX and 2x faster
on SKX. Historically, these numbers would not have been so good
because of the zero-over-zero writeback optimization that `rep stosb`
is able to do. But, the zero-over-zero writeback optimization has been
removed as a potential side-channel attack, so there is no longer any
good reason to only rely on `rep stosb` for large memsets. On the flip
size, non-temporal writes can avoid data in their RFO requests saving
memory bandwidth.
All of the other changes to the file are to re-organize the
code-blocks to maintain "good" alignment given the new code added in
the `L(stosb_local)` case.
The results from running the GLIBC memset benchmarks on TGL-client for
N=20 runs:
Geometric Mean across the suite New / Old EXEX256: 0.979
Geometric Mean across the suite New / Old EXEX512: 0.979
Geometric Mean across the suite New / Old AVX2 : 0.986
Geometric Mean across the suite New / Old SSE2 : 0.979
Most of the cases are essentially unchanged, this is mostly to show
that adding the non-temporal case didn't add any regressions to the
other cases.
The results on the memset-large benchmark suite on TGL-client for N=20
runs:
Geometric Mean across the suite New / Old EXEX256: 0.926
Geometric Mean across the suite New / Old EXEX512: 0.925
Geometric Mean across the suite New / Old AVX2 : 0.928
Geometric Mean across the suite New / Old SSE2 : 0.924
So roughly a 7.5% speedup. This is lower than what we see on servers
(likely because clients typically have faster single-core bandwidth so
saving bandwidth on RFOs is less impactful), but still advantageous.
Full test-suite passes on x86_64 w/ and w/o multiarch.
---
.../multiarch/memset-vec-unaligned-erms.S | 149 +++++++++++-------
1 file changed, 91 insertions(+), 58 deletions(-)
diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
index 97839a2248..637caadb40 100644
--- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
+++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
@@ -21,10 +21,13 @@
2. If size is less than VEC, use integer register stores.
3. If size is from VEC_SIZE to 2 * VEC_SIZE, use 2 VEC stores.
4. If size is from 2 * VEC_SIZE to 4 * VEC_SIZE, use 4 VEC stores.
- 5. On machines ERMS feature, if size is greater or equal than
- __x86_rep_stosb_threshold then REP STOSB will be used.
- 6. If size is more to 4 * VEC_SIZE, align to 4 * VEC_SIZE with
- 4 VEC stores and store 4 * VEC at a time until done. */
+ 5. If size is more to 4 * VEC_SIZE, align to 1 * VEC_SIZE with
+ 4 VEC stores and store 4 * VEC at a time until done.
+ 6. On machines ERMS feature, if size is range
+ [__x86_rep_stosb_threshold, __x86_shared_non_temporal_threshold)
+ then REP STOSB will be used.
+ 7. If size >= __x86_shared_non_temporal_threshold, use a
+ non-temporal stores. */
#include <sysdep.h>
@@ -147,6 +150,41 @@ L(entry_from_wmemset):
VMOVU %VMM(0), -VEC_SIZE(%rdi,%rdx)
VMOVU %VMM(0), (%rdi)
VZEROUPPER_RETURN
+
+ /* If have AVX512 mask instructions put L(less_vec) close to
+ entry as it doesn't take much space and is likely a hot target. */
+#ifdef USE_LESS_VEC_MASK_STORE
+ /* Align to ensure the L(less_vec) logic all fits in 1x cache lines. */
+ .p2align 6,, 47
+ .p2align 4
+L(less_vec):
+L(less_vec_from_wmemset):
+ /* Less than 1 VEC. */
+# if VEC_SIZE != 16 && VEC_SIZE != 32 && VEC_SIZE != 64
+# error Unsupported VEC_SIZE!
+# endif
+ /* Clear high bits from edi. Only keeping bits relevant to page
+ cross check. Note that we are using rax which is set in
+ MEMSET_VDUP_TO_VEC0_AND_SET_RETURN as ptr from here on out. */
+ andl $(PAGE_SIZE - 1), %edi
+ /* Check if VEC_SIZE store cross page. Mask stores suffer
+ serious performance degradation when it has to fault suppress. */
+ cmpl $(PAGE_SIZE - VEC_SIZE), %edi
+ /* This is generally considered a cold target. */
+ ja L(cross_page)
+# if VEC_SIZE > 32
+ movq $-1, %rcx
+ bzhiq %rdx, %rcx, %rcx
+ kmovq %rcx, %k1
+# else
+ movl $-1, %ecx
+ bzhil %edx, %ecx, %ecx
+ kmovd %ecx, %k1
+# endif
+ vmovdqu8 %VMM(0), (%rax){%k1}
+ VZEROUPPER_RETURN
+#endif
+
#if defined USE_MULTIARCH && IS_IN (libc)
END (MEMSET_SYMBOL (__memset, unaligned))
@@ -185,54 +223,6 @@ L(last_2x_vec):
#endif
VZEROUPPER_RETURN
- /* If have AVX512 mask instructions put L(less_vec) close to
- entry as it doesn't take much space and is likely a hot target.
- */
-#ifdef USE_LESS_VEC_MASK_STORE
- .p2align 4,, 10
-L(less_vec):
-L(less_vec_from_wmemset):
- /* Less than 1 VEC. */
-# if VEC_SIZE != 16 && VEC_SIZE != 32 && VEC_SIZE != 64
-# error Unsupported VEC_SIZE!
-# endif
- /* Clear high bits from edi. Only keeping bits relevant to page
- cross check. Note that we are using rax which is set in
- MEMSET_VDUP_TO_VEC0_AND_SET_RETURN as ptr from here on out. */
- andl $(PAGE_SIZE - 1), %edi
- /* Check if VEC_SIZE store cross page. Mask stores suffer
- serious performance degradation when it has to fault suppress.
- */
- cmpl $(PAGE_SIZE - VEC_SIZE), %edi
- /* This is generally considered a cold target. */
- ja L(cross_page)
-# if VEC_SIZE > 32
- movq $-1, %rcx
- bzhiq %rdx, %rcx, %rcx
- kmovq %rcx, %k1
-# else
- movl $-1, %ecx
- bzhil %edx, %ecx, %ecx
- kmovd %ecx, %k1
-# endif
- vmovdqu8 %VMM(0), (%rax){%k1}
- VZEROUPPER_RETURN
-
-# if defined USE_MULTIARCH && IS_IN (libc)
- /* Include L(stosb_local) here if including L(less_vec) between
- L(stosb_more_2x_vec) and ENTRY. This is to cache align the
- L(stosb_more_2x_vec) target. */
- .p2align 4,, 10
-L(stosb_local):
- movzbl %sil, %eax
- mov %RDX_LP, %RCX_LP
- mov %RDI_LP, %RDX_LP
- rep stosb
- mov %RDX_LP, %RAX_LP
- VZEROUPPER_RETURN
-# endif
-#endif
-
#if defined USE_MULTIARCH && IS_IN (libc)
.p2align 4
L(stosb_more_2x_vec):
@@ -318,21 +308,33 @@ L(return_vzeroupper):
ret
#endif
- .p2align 4,, 10
-#ifndef USE_LESS_VEC_MASK_STORE
-# if defined USE_MULTIARCH && IS_IN (libc)
+#ifdef USE_WITH_AVX2
+ .p2align 4
+#else
+ .p2align 4,, 4
+#endif
+
+#if defined USE_MULTIARCH && IS_IN (libc)
/* If no USE_LESS_VEC_MASK put L(stosb_local) here. Will be in
range for 2-byte jump encoding. */
L(stosb_local):
+ cmp __x86_shared_non_temporal_threshold(%rip), %RDX_LP
+ jae L(nt_memset)
movzbl %sil, %eax
mov %RDX_LP, %RCX_LP
mov %RDI_LP, %RDX_LP
rep stosb
+# if (defined USE_WITH_SSE2) || (defined USE_WITH_AVX512)
+ /* Use xchg to save 1-byte (this helps align targets below). */
+ xchg %RDX_LP, %RAX_LP
+# else
mov %RDX_LP, %RAX_LP
- VZEROUPPER_RETURN
# endif
+ VZEROUPPER_RETURN
+#endif
+#ifndef USE_LESS_VEC_MASK_STORE
/* Define L(less_vec) only if not otherwise defined. */
- .p2align 4
+ .p2align 4,, 12
L(less_vec):
/* Broadcast esi to partial register (i.e VEC_SIZE == 32 broadcast to
xmm). This is only does anything for AVX2. */
@@ -423,4 +425,35 @@ L(between_2_3):
movb %SET_REG8, -1(%LESS_VEC_REG, %rdx)
#endif
ret
-END (MEMSET_SYMBOL (__memset, unaligned_erms))
+
+#if defined USE_MULTIARCH && IS_IN (libc)
+# ifdef USE_WITH_AVX512
+ /* Force align so the loop doesn't cross a cache-line. */
+ .p2align 4
+# endif
+ .p2align 4,, 7
+ /* Memset using non-temporal stores. */
+L(nt_memset):
+ VMOVU %VMM(0), (VEC_SIZE * 0)(%rdi)
+ leaq (VEC_SIZE * -4)(%rdi, %rdx), %rdx
+ /* Align DST. */
+ orq $(VEC_SIZE * 1 - 1), %rdi
+ incq %rdi
+ .p2align 4,, 7
+L(nt_loop):
+ VMOVNT %VMM(0), (VEC_SIZE * 0)(%rdi)
+ VMOVNT %VMM(0), (VEC_SIZE * 1)(%rdi)
+ VMOVNT %VMM(0), (VEC_SIZE * 2)(%rdi)
+ VMOVNT %VMM(0), (VEC_SIZE * 3)(%rdi)
+ subq $(VEC_SIZE * -4), %rdi
+ cmpq %rdx, %rdi
+ jb L(nt_loop)
+ sfence
+ VMOVU %VMM(0), (VEC_SIZE * 0)(%rdx)
+ VMOVU %VMM(0), (VEC_SIZE * 1)(%rdx)
+ VMOVU %VMM(0), (VEC_SIZE * 2)(%rdx)
+ VMOVU %VMM(0), (VEC_SIZE * 3)(%rdx)
+ VZEROUPPER_RETURN
+#endif
+
+END(MEMSET_SYMBOL(__memset, unaligned_erms))
--
2.34.1
^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH v2 2/2] x86: Add seperate non-temporal tunable for memset
2024-05-24 17:38 ` [PATCH v2 1/2] " Noah Goldstein
@ 2024-05-24 17:38 ` Noah Goldstein
2024-05-29 16:19 ` Noah Goldstein
2024-05-29 22:53 ` H.J. Lu
2024-05-29 22:52 ` [PATCH v2 1/2] x86: Improve large memset perf with non-temporal stores [RHEL-29312] H.J. Lu
1 sibling, 2 replies; 21+ messages in thread
From: Noah Goldstein @ 2024-05-24 17:38 UTC (permalink / raw)
To: libc-alpha; +Cc: goldstein.w.n, hjl.tools, carlos
The tuning for non-temporal stores for memset vs memcpy is not always
the same. This includes both the exact value and whether non-temporal
stores are profitable at all for a given arch.
This patch add `x86_memset_non_temporal_threshold`. Currently we
disable non-temporal stores for non Intel vendors as the only
benchmarks showing its benefit have been on Intel hardware.
---
manual/tunables.texi | 16 +++++++++++++++-
sysdeps/x86/cacheinfo.h | 8 +++++++-
sysdeps/x86/dl-cacheinfo.h | 16 ++++++++++++++++
sysdeps/x86/dl-diagnostics-cpu.c | 2 ++
sysdeps/x86/dl-tunables.list | 3 +++
sysdeps/x86/include/cpu-features.h | 4 +++-
.../x86_64/multiarch/memset-vec-unaligned-erms.S | 6 +++---
7 files changed, 49 insertions(+), 6 deletions(-)
diff --git a/manual/tunables.texi b/manual/tunables.texi
index baaf751721..8dd02d8149 100644
--- a/manual/tunables.texi
+++ b/manual/tunables.texi
@@ -52,6 +52,7 @@ glibc.elision.skip_lock_busy: 3 (min: 0, max: 2147483647)
glibc.malloc.top_pad: 0x20000 (min: 0x0, max: 0xffffffffffffffff)
glibc.cpu.x86_rep_stosb_threshold: 0x800 (min: 0x1, max: 0xffffffffffffffff)
glibc.cpu.x86_non_temporal_threshold: 0xc0000 (min: 0x4040, max: 0xfffffffffffffff)
+glibc.cpu.x86_memset_non_temporal_threshold: 0xc0000 (min: 0x4040, max: 0xfffffffffffffff)
glibc.cpu.x86_shstk:
glibc.pthread.stack_cache_size: 0x2800000 (min: 0x0, max: 0xffffffffffffffff)
glibc.cpu.hwcap_mask: 0x6 (min: 0x0, max: 0xffffffffffffffff)
@@ -495,7 +496,8 @@ thread stack originally backup by Huge Pages to default pages.
@cindex shared_cache_size tunables
@cindex tunables, shared_cache_size
@cindex non_temporal_threshold tunables
-@cindex tunables, non_temporal_threshold
+@cindex memset_non_temporal_threshold tunables
+@cindex tunables, non_temporal_threshold, memset_non_temporal_threshold
@deftp {Tunable namespace} glibc.cpu
Behavior of @theglibc{} can be tuned to assume specific hardware capabilities
@@ -574,6 +576,18 @@ like memmove and memcpy.
This tunable is specific to i386 and x86-64.
@end deftp
+@deftp Tunable glibc.cpu.x86_memset_non_temporal_threshold
+The @code{glibc.cpu.x86_memset_non_temporal_threshold} tunable allows
+the user to set threshold in bytes for non temporal store in
+memset. Non temporal stores give a hint to the hardware to move data
+directly to memory without displacing other data from the cache. This
+tunable is used by some platforms to determine when to use non
+temporal stores memset.
+
+This tunable is specific to i386 and x86-64.
+@end deftp
+
+
@deftp Tunable glibc.cpu.x86_rep_movsb_threshold
The @code{glibc.cpu.x86_rep_movsb_threshold} tunable allows the user to
set threshold in bytes to start using "rep movsb". The value must be
diff --git a/sysdeps/x86/cacheinfo.h b/sysdeps/x86/cacheinfo.h
index ab73556772..83491607c7 100644
--- a/sysdeps/x86/cacheinfo.h
+++ b/sysdeps/x86/cacheinfo.h
@@ -35,9 +35,12 @@ long int __x86_data_cache_size attribute_hidden = 32 * 1024;
long int __x86_shared_cache_size_half attribute_hidden = 1024 * 1024 / 2;
long int __x86_shared_cache_size attribute_hidden = 1024 * 1024;
-/* Threshold to use non temporal store. */
+/* Threshold to use non temporal store in memmove. */
long int __x86_shared_non_temporal_threshold attribute_hidden;
+/* Threshold to use non temporal store in memset. */
+long int __x86_memset_non_temporal_threshold attribute_hidden;
+
/* Threshold to use Enhanced REP MOVSB. */
long int __x86_rep_movsb_threshold attribute_hidden = 2048;
@@ -77,6 +80,9 @@ init_cacheinfo (void)
__x86_shared_non_temporal_threshold
= cpu_features->non_temporal_threshold;
+ __x86_memset_non_temporal_threshold
+ = cpu_features->memset_non_temporal_threshold;
+
__x86_rep_movsb_threshold = cpu_features->rep_movsb_threshold;
__x86_rep_stosb_threshold = cpu_features->rep_stosb_threshold;
__x86_rep_movsb_stop_threshold = cpu_features->rep_movsb_stop_threshold;
diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h
index 5a98f70364..d375a7cba6 100644
--- a/sysdeps/x86/dl-cacheinfo.h
+++ b/sysdeps/x86/dl-cacheinfo.h
@@ -986,6 +986,13 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
if (CPU_FEATURE_USABLE_P (cpu_features, FSRM))
rep_movsb_threshold = 2112;
+ /* Non-temporal stores in memset have only been tested on Intel hardware.
+ Until we benchmark data on other x86 processor, disable non-temporal
+ stores in memset. */
+ unsigned long int memset_non_temporal_threshold = SIZE_MAX;
+ if (cpu_features->basic.kind == arch_kind_intel)
+ memset_non_temporal_threshold = non_temporal_threshold;
+
/* For AMD CPUs that support ERMS (Zen3+), REP MOVSB is in a lot of
cases slower than the vectorized path (and for some alignments,
it is really slow, check BZ #30994). */
@@ -1012,6 +1019,11 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
&& tunable_size <= maximum_non_temporal_threshold)
non_temporal_threshold = tunable_size;
+ tunable_size = TUNABLE_GET (x86_memset_non_temporal_threshold, long int, NULL);
+ if (tunable_size > minimum_non_temporal_threshold
+ && tunable_size <= maximum_non_temporal_threshold)
+ memset_non_temporal_threshold = tunable_size;
+
tunable_size = TUNABLE_GET (x86_rep_movsb_threshold, long int, NULL);
if (tunable_size > minimum_rep_movsb_threshold)
rep_movsb_threshold = tunable_size;
@@ -1032,6 +1044,9 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
TUNABLE_SET_WITH_BOUNDS (x86_non_temporal_threshold, non_temporal_threshold,
minimum_non_temporal_threshold,
maximum_non_temporal_threshold);
+ TUNABLE_SET_WITH_BOUNDS (
+ x86_memset_non_temporal_threshold, memset_non_temporal_threshold,
+ minimum_non_temporal_threshold, maximum_non_temporal_threshold);
TUNABLE_SET_WITH_BOUNDS (x86_rep_movsb_threshold, rep_movsb_threshold,
minimum_rep_movsb_threshold, SIZE_MAX);
TUNABLE_SET_WITH_BOUNDS (x86_rep_stosb_threshold, rep_stosb_threshold, 1,
@@ -1045,6 +1060,7 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
cpu_features->data_cache_size = data;
cpu_features->shared_cache_size = shared;
cpu_features->non_temporal_threshold = non_temporal_threshold;
+ cpu_features->memset_non_temporal_threshold = memset_non_temporal_threshold;
cpu_features->rep_movsb_threshold = rep_movsb_threshold;
cpu_features->rep_stosb_threshold = rep_stosb_threshold;
cpu_features->rep_movsb_stop_threshold = rep_movsb_stop_threshold;
diff --git a/sysdeps/x86/dl-diagnostics-cpu.c b/sysdeps/x86/dl-diagnostics-cpu.c
index ceafde9481..49eeb5f70a 100644
--- a/sysdeps/x86/dl-diagnostics-cpu.c
+++ b/sysdeps/x86/dl-diagnostics-cpu.c
@@ -94,6 +94,8 @@ _dl_diagnostics_cpu (void)
cpu_features->shared_cache_size);
print_cpu_features_value ("non_temporal_threshold",
cpu_features->non_temporal_threshold);
+ print_cpu_features_value ("memset_non_temporal_threshold",
+ cpu_features->memset_non_temporal_threshold);
print_cpu_features_value ("rep_movsb_threshold",
cpu_features->rep_movsb_threshold);
print_cpu_features_value ("rep_movsb_stop_threshold",
diff --git a/sysdeps/x86/dl-tunables.list b/sysdeps/x86/dl-tunables.list
index 7d82da0dec..a0a1299592 100644
--- a/sysdeps/x86/dl-tunables.list
+++ b/sysdeps/x86/dl-tunables.list
@@ -30,6 +30,9 @@ glibc {
x86_non_temporal_threshold {
type: SIZE_T
}
+ x86_memset_non_temporal_threshold {
+ type: SIZE_T
+ }
x86_rep_movsb_threshold {
type: SIZE_T
# Since there is overhead to set up REP MOVSB operation, REP
diff --git a/sysdeps/x86/include/cpu-features.h b/sysdeps/x86/include/cpu-features.h
index cd7bd27cf3..aaae44f0e1 100644
--- a/sysdeps/x86/include/cpu-features.h
+++ b/sysdeps/x86/include/cpu-features.h
@@ -944,8 +944,10 @@ struct cpu_features
/* Shared cache size for use in memory and string routines, typically
L2 or L3 size. */
unsigned long int shared_cache_size;
- /* Threshold to use non temporal store. */
+ /* Threshold to use non temporal store in memmove. */
unsigned long int non_temporal_threshold;
+ /* Threshold to use non temporal store in memset. */
+ unsigned long int memset_non_temporal_threshold;
/* Threshold to use "rep movsb". */
unsigned long int rep_movsb_threshold;
/* Threshold to stop using "rep movsb". */
diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
index 637caadb40..88bf08e4f4 100644
--- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
+++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
@@ -24,9 +24,9 @@
5. If size is more to 4 * VEC_SIZE, align to 1 * VEC_SIZE with
4 VEC stores and store 4 * VEC at a time until done.
6. On machines ERMS feature, if size is range
- [__x86_rep_stosb_threshold, __x86_shared_non_temporal_threshold)
+ [__x86_rep_stosb_threshold, __x86_memset_non_temporal_threshold)
then REP STOSB will be used.
- 7. If size >= __x86_shared_non_temporal_threshold, use a
+ 7. If size >= __x86_memset_non_temporal_threshold, use a
non-temporal stores. */
#include <sysdep.h>
@@ -318,7 +318,7 @@ L(return_vzeroupper):
/* If no USE_LESS_VEC_MASK put L(stosb_local) here. Will be in
range for 2-byte jump encoding. */
L(stosb_local):
- cmp __x86_shared_non_temporal_threshold(%rip), %RDX_LP
+ cmp __x86_memset_non_temporal_threshold(%rip), %RDX_LP
jae L(nt_memset)
movzbl %sil, %eax
mov %RDX_LP, %RCX_LP
--
2.34.1
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v1] x86: Improve large memset perf with non-temporal stores [RHEL-29312]
2024-05-23 17:18 ` Adhemerval Zanella Netto
@ 2024-05-24 17:39 ` Noah Goldstein
2024-05-30 20:02 ` Noah Goldstein
1 sibling, 0 replies; 21+ messages in thread
From: Noah Goldstein @ 2024-05-24 17:39 UTC (permalink / raw)
To: Adhemerval Zanella Netto; +Cc: H.J. Lu, libc-alpha
On Thu, May 23, 2024 at 12:18 PM Adhemerval Zanella Netto
<adhemerval.zanella@linaro.org> wrote:
>
>
>
> On 23/05/24 13:49, H.J. Lu wrote:
> > On Thu, May 23, 2024 at 9:46 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> >>
> >> On Mon, May 20, 2024 at 5:50 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> >>>
> >>> On Mon, May 20, 2024 at 5:47 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> >>>>
> >>>> On Mon, May 20, 2024 at 2:16 PM Adhemerval Zanella Netto
> >>>> <adhemerval.zanella@linaro.org> wrote:
> >>>>>
> >>>>>
> >>>>>
> >>>>> On 18/05/24 21:43, Noah Goldstein wrote:
> >>>>>> Previously we use `rep stosb` for all medium/large memsets. This is
> >>>>>> notably worse than non-temporal stores for large (above a
> >>>>>> few MBs) memsets.
> >>>>>> See:
> >>>>>> https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4P2uMjjQGLbLcU/edit?usp=sharing
> >>>>>> For data using different stategies for large memset on ICX and SKX.
> >>>>>
> >>>>> Btw this datasheet is not accessible without extra steps.
> >>>>>
> >>>>
> >>>> Bh sorry, put the benchmark data in a repo. See data here:
> >>>> https://github.com/goldsteinn/memset-benchmarks/tree/master/results
> >>>>
> >>>> The ICX.html and SKX.html have the data from the above link.
> >>>>>>
> >>>>>> Using non-temporal stores can be up to 3x faster on ICX and 2x faster
> >>>>>> on SKX. Historically, these numbers would not have been so good
> >>>>>> because of the zero-over-zero writeback optimization that `rep stosb`
> >>>>>> is able to do. But, the zero-over-zero writeback optimization has been
> >>>>>> removed as a potential side-channel attack, so there is no longer any
> >>>>>> good reason to only rely on `rep stosb` for large memsets. On the flip
> >>>>>> size, non-temporal writes can avoid data in their RFO requests saving
> >>>>>> memory bandwidth.
> >>>>>
> >>>>> Any chance on how this play in newer AMD chips? I am trying to avoid
> >>>>> another regression like BZ#30994 and BZ#30995 (this one I would like
> >>>>> to fix, however I don't have access to a Zen4 machine to check for
> >>>>> results).
> >>>>
> >>>> I didn't and don't have access to any of the newer AMD chips.
> >>>>
> >>>> The benchmark code here: https://github.com/goldsteinn/memset-benchmarks/
> >>>> has a README w/ steps if anyone wants to test it.
> >>>
> >>> What we could do is use a seperate tunable for memset NT threshold and just
> >>> make it SIZE_MAX for AMD.
> >>
> >> Adhemerval, this patch okay to get in like that, or do you want to keep in Intel
> >> specific?
> >
> > Please make it Intel specific.
> >
> > Thanks.
>
> Yes, please make it Intel specific for now. I will try to take a look at least
> with a Zen3 core that I have access
Done, split into two patches, the tunable one is the second in the series.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v2 2/2] x86: Add seperate non-temporal tunable for memset
2024-05-24 17:38 ` [PATCH v2 2/2] x86: Add seperate non-temporal tunable for memset Noah Goldstein
@ 2024-05-29 16:19 ` Noah Goldstein
2024-05-29 22:53 ` H.J. Lu
1 sibling, 0 replies; 21+ messages in thread
From: Noah Goldstein @ 2024-05-29 16:19 UTC (permalink / raw)
To: libc-alpha; +Cc: hjl.tools, carlos
On Fri, May 24, 2024 at 12:38 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
> The tuning for non-temporal stores for memset vs memcpy is not always
> the same. This includes both the exact value and whether non-temporal
> stores are profitable at all for a given arch.
>
> This patch add `x86_memset_non_temporal_threshold`. Currently we
> disable non-temporal stores for non Intel vendors as the only
> benchmarks showing its benefit have been on Intel hardware.
> ---
> manual/tunables.texi | 16 +++++++++++++++-
> sysdeps/x86/cacheinfo.h | 8 +++++++-
> sysdeps/x86/dl-cacheinfo.h | 16 ++++++++++++++++
> sysdeps/x86/dl-diagnostics-cpu.c | 2 ++
> sysdeps/x86/dl-tunables.list | 3 +++
> sysdeps/x86/include/cpu-features.h | 4 +++-
> .../x86_64/multiarch/memset-vec-unaligned-erms.S | 6 +++---
> 7 files changed, 49 insertions(+), 6 deletions(-)
>
> diff --git a/manual/tunables.texi b/manual/tunables.texi
> index baaf751721..8dd02d8149 100644
> --- a/manual/tunables.texi
> +++ b/manual/tunables.texi
> @@ -52,6 +52,7 @@ glibc.elision.skip_lock_busy: 3 (min: 0, max: 2147483647)
> glibc.malloc.top_pad: 0x20000 (min: 0x0, max: 0xffffffffffffffff)
> glibc.cpu.x86_rep_stosb_threshold: 0x800 (min: 0x1, max: 0xffffffffffffffff)
> glibc.cpu.x86_non_temporal_threshold: 0xc0000 (min: 0x4040, max: 0xfffffffffffffff)
> +glibc.cpu.x86_memset_non_temporal_threshold: 0xc0000 (min: 0x4040, max: 0xfffffffffffffff)
> glibc.cpu.x86_shstk:
> glibc.pthread.stack_cache_size: 0x2800000 (min: 0x0, max: 0xffffffffffffffff)
> glibc.cpu.hwcap_mask: 0x6 (min: 0x0, max: 0xffffffffffffffff)
> @@ -495,7 +496,8 @@ thread stack originally backup by Huge Pages to default pages.
> @cindex shared_cache_size tunables
> @cindex tunables, shared_cache_size
> @cindex non_temporal_threshold tunables
> -@cindex tunables, non_temporal_threshold
> +@cindex memset_non_temporal_threshold tunables
> +@cindex tunables, non_temporal_threshold, memset_non_temporal_threshold
>
> @deftp {Tunable namespace} glibc.cpu
> Behavior of @theglibc{} can be tuned to assume specific hardware capabilities
> @@ -574,6 +576,18 @@ like memmove and memcpy.
> This tunable is specific to i386 and x86-64.
> @end deftp
>
> +@deftp Tunable glibc.cpu.x86_memset_non_temporal_threshold
> +The @code{glibc.cpu.x86_memset_non_temporal_threshold} tunable allows
> +the user to set threshold in bytes for non temporal store in
> +memset. Non temporal stores give a hint to the hardware to move data
> +directly to memory without displacing other data from the cache. This
> +tunable is used by some platforms to determine when to use non
> +temporal stores memset.
> +
> +This tunable is specific to i386 and x86-64.
> +@end deftp
> +
> +
> @deftp Tunable glibc.cpu.x86_rep_movsb_threshold
> The @code{glibc.cpu.x86_rep_movsb_threshold} tunable allows the user to
> set threshold in bytes to start using "rep movsb". The value must be
> diff --git a/sysdeps/x86/cacheinfo.h b/sysdeps/x86/cacheinfo.h
> index ab73556772..83491607c7 100644
> --- a/sysdeps/x86/cacheinfo.h
> +++ b/sysdeps/x86/cacheinfo.h
> @@ -35,9 +35,12 @@ long int __x86_data_cache_size attribute_hidden = 32 * 1024;
> long int __x86_shared_cache_size_half attribute_hidden = 1024 * 1024 / 2;
> long int __x86_shared_cache_size attribute_hidden = 1024 * 1024;
>
> -/* Threshold to use non temporal store. */
> +/* Threshold to use non temporal store in memmove. */
> long int __x86_shared_non_temporal_threshold attribute_hidden;
>
> +/* Threshold to use non temporal store in memset. */
> +long int __x86_memset_non_temporal_threshold attribute_hidden;
> +
> /* Threshold to use Enhanced REP MOVSB. */
> long int __x86_rep_movsb_threshold attribute_hidden = 2048;
>
> @@ -77,6 +80,9 @@ init_cacheinfo (void)
> __x86_shared_non_temporal_threshold
> = cpu_features->non_temporal_threshold;
>
> + __x86_memset_non_temporal_threshold
> + = cpu_features->memset_non_temporal_threshold;
> +
> __x86_rep_movsb_threshold = cpu_features->rep_movsb_threshold;
> __x86_rep_stosb_threshold = cpu_features->rep_stosb_threshold;
> __x86_rep_movsb_stop_threshold = cpu_features->rep_movsb_stop_threshold;
> diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h
> index 5a98f70364..d375a7cba6 100644
> --- a/sysdeps/x86/dl-cacheinfo.h
> +++ b/sysdeps/x86/dl-cacheinfo.h
> @@ -986,6 +986,13 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
> if (CPU_FEATURE_USABLE_P (cpu_features, FSRM))
> rep_movsb_threshold = 2112;
>
> + /* Non-temporal stores in memset have only been tested on Intel hardware.
> + Until we benchmark data on other x86 processor, disable non-temporal
> + stores in memset. */
> + unsigned long int memset_non_temporal_threshold = SIZE_MAX;
> + if (cpu_features->basic.kind == arch_kind_intel)
> + memset_non_temporal_threshold = non_temporal_threshold;
> +
> /* For AMD CPUs that support ERMS (Zen3+), REP MOVSB is in a lot of
> cases slower than the vectorized path (and for some alignments,
> it is really slow, check BZ #30994). */
> @@ -1012,6 +1019,11 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
> && tunable_size <= maximum_non_temporal_threshold)
> non_temporal_threshold = tunable_size;
>
> + tunable_size = TUNABLE_GET (x86_memset_non_temporal_threshold, long int, NULL);
> + if (tunable_size > minimum_non_temporal_threshold
> + && tunable_size <= maximum_non_temporal_threshold)
> + memset_non_temporal_threshold = tunable_size;
> +
> tunable_size = TUNABLE_GET (x86_rep_movsb_threshold, long int, NULL);
> if (tunable_size > minimum_rep_movsb_threshold)
> rep_movsb_threshold = tunable_size;
> @@ -1032,6 +1044,9 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
> TUNABLE_SET_WITH_BOUNDS (x86_non_temporal_threshold, non_temporal_threshold,
> minimum_non_temporal_threshold,
> maximum_non_temporal_threshold);
> + TUNABLE_SET_WITH_BOUNDS (
> + x86_memset_non_temporal_threshold, memset_non_temporal_threshold,
> + minimum_non_temporal_threshold, maximum_non_temporal_threshold);
> TUNABLE_SET_WITH_BOUNDS (x86_rep_movsb_threshold, rep_movsb_threshold,
> minimum_rep_movsb_threshold, SIZE_MAX);
> TUNABLE_SET_WITH_BOUNDS (x86_rep_stosb_threshold, rep_stosb_threshold, 1,
> @@ -1045,6 +1060,7 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
> cpu_features->data_cache_size = data;
> cpu_features->shared_cache_size = shared;
> cpu_features->non_temporal_threshold = non_temporal_threshold;
> + cpu_features->memset_non_temporal_threshold = memset_non_temporal_threshold;
> cpu_features->rep_movsb_threshold = rep_movsb_threshold;
> cpu_features->rep_stosb_threshold = rep_stosb_threshold;
> cpu_features->rep_movsb_stop_threshold = rep_movsb_stop_threshold;
> diff --git a/sysdeps/x86/dl-diagnostics-cpu.c b/sysdeps/x86/dl-diagnostics-cpu.c
> index ceafde9481..49eeb5f70a 100644
> --- a/sysdeps/x86/dl-diagnostics-cpu.c
> +++ b/sysdeps/x86/dl-diagnostics-cpu.c
> @@ -94,6 +94,8 @@ _dl_diagnostics_cpu (void)
> cpu_features->shared_cache_size);
> print_cpu_features_value ("non_temporal_threshold",
> cpu_features->non_temporal_threshold);
> + print_cpu_features_value ("memset_non_temporal_threshold",
> + cpu_features->memset_non_temporal_threshold);
> print_cpu_features_value ("rep_movsb_threshold",
> cpu_features->rep_movsb_threshold);
> print_cpu_features_value ("rep_movsb_stop_threshold",
> diff --git a/sysdeps/x86/dl-tunables.list b/sysdeps/x86/dl-tunables.list
> index 7d82da0dec..a0a1299592 100644
> --- a/sysdeps/x86/dl-tunables.list
> +++ b/sysdeps/x86/dl-tunables.list
> @@ -30,6 +30,9 @@ glibc {
> x86_non_temporal_threshold {
> type: SIZE_T
> }
> + x86_memset_non_temporal_threshold {
> + type: SIZE_T
> + }
> x86_rep_movsb_threshold {
> type: SIZE_T
> # Since there is overhead to set up REP MOVSB operation, REP
> diff --git a/sysdeps/x86/include/cpu-features.h b/sysdeps/x86/include/cpu-features.h
> index cd7bd27cf3..aaae44f0e1 100644
> --- a/sysdeps/x86/include/cpu-features.h
> +++ b/sysdeps/x86/include/cpu-features.h
> @@ -944,8 +944,10 @@ struct cpu_features
> /* Shared cache size for use in memory and string routines, typically
> L2 or L3 size. */
> unsigned long int shared_cache_size;
> - /* Threshold to use non temporal store. */
> + /* Threshold to use non temporal store in memmove. */
> unsigned long int non_temporal_threshold;
> + /* Threshold to use non temporal store in memset. */
> + unsigned long int memset_non_temporal_threshold;
> /* Threshold to use "rep movsb". */
> unsigned long int rep_movsb_threshold;
> /* Threshold to stop using "rep movsb". */
> diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> index 637caadb40..88bf08e4f4 100644
> --- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> +++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> @@ -24,9 +24,9 @@
> 5. If size is more to 4 * VEC_SIZE, align to 1 * VEC_SIZE with
> 4 VEC stores and store 4 * VEC at a time until done.
> 6. On machines ERMS feature, if size is range
> - [__x86_rep_stosb_threshold, __x86_shared_non_temporal_threshold)
> + [__x86_rep_stosb_threshold, __x86_memset_non_temporal_threshold)
> then REP STOSB will be used.
> - 7. If size >= __x86_shared_non_temporal_threshold, use a
> + 7. If size >= __x86_memset_non_temporal_threshold, use a
> non-temporal stores. */
>
> #include <sysdep.h>
> @@ -318,7 +318,7 @@ L(return_vzeroupper):
> /* If no USE_LESS_VEC_MASK put L(stosb_local) here. Will be in
> range for 2-byte jump encoding. */
> L(stosb_local):
> - cmp __x86_shared_non_temporal_threshold(%rip), %RDX_LP
> + cmp __x86_memset_non_temporal_threshold(%rip), %RDX_LP
> jae L(nt_memset)
> movzbl %sil, %eax
> mov %RDX_LP, %RCX_LP
> --
> 2.34.1
>
ping
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v2 1/2] x86: Improve large memset perf with non-temporal stores [RHEL-29312]
2024-05-24 17:38 ` [PATCH v2 1/2] " Noah Goldstein
2024-05-24 17:38 ` [PATCH v2 2/2] x86: Add seperate non-temporal tunable for memset Noah Goldstein
@ 2024-05-29 22:52 ` H.J. Lu
1 sibling, 0 replies; 21+ messages in thread
From: H.J. Lu @ 2024-05-29 22:52 UTC (permalink / raw)
To: Noah Goldstein; +Cc: GNU C Library, Carlos O'Donell
On Fri, May 24, 2024 at 10:38 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
> Previously we use `rep stosb` for all medium/large memsets. This is
> notably worse than non-temporal stores for large (above a
> few MBs) memsets.
> See:
> https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4P2uMjjQGLbLcU/edit?usp=sharing
> For data using different stategies for large memset on ICX and SKX.
>
> Using non-temporal stores can be up to 3x faster on ICX and 2x faster
> on SKX. Historically, these numbers would not have been so good
> because of the zero-over-zero writeback optimization that `rep stosb`
> is able to do. But, the zero-over-zero writeback optimization has been
> removed as a potential side-channel attack, so there is no longer any
> good reason to only rely on `rep stosb` for large memsets. On the flip
> size, non-temporal writes can avoid data in their RFO requests saving
> memory bandwidth.
>
> All of the other changes to the file are to re-organize the
> code-blocks to maintain "good" alignment given the new code added in
> the `L(stosb_local)` case.
>
> The results from running the GLIBC memset benchmarks on TGL-client for
> N=20 runs:
>
> Geometric Mean across the suite New / Old EXEX256: 0.979
> Geometric Mean across the suite New / Old EXEX512: 0.979
> Geometric Mean across the suite New / Old AVX2 : 0.986
> Geometric Mean across the suite New / Old SSE2 : 0.979
>
> Most of the cases are essentially unchanged, this is mostly to show
> that adding the non-temporal case didn't add any regressions to the
> other cases.
>
> The results on the memset-large benchmark suite on TGL-client for N=20
> runs:
>
> Geometric Mean across the suite New / Old EXEX256: 0.926
> Geometric Mean across the suite New / Old EXEX512: 0.925
> Geometric Mean across the suite New / Old AVX2 : 0.928
> Geometric Mean across the suite New / Old SSE2 : 0.924
>
> So roughly a 7.5% speedup. This is lower than what we see on servers
> (likely because clients typically have faster single-core bandwidth so
> saving bandwidth on RFOs is less impactful), but still advantageous.
>
> Full test-suite passes on x86_64 w/ and w/o multiarch.
> ---
> .../multiarch/memset-vec-unaligned-erms.S | 149 +++++++++++-------
> 1 file changed, 91 insertions(+), 58 deletions(-)
>
> diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> index 97839a2248..637caadb40 100644
> --- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> +++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> @@ -21,10 +21,13 @@
> 2. If size is less than VEC, use integer register stores.
> 3. If size is from VEC_SIZE to 2 * VEC_SIZE, use 2 VEC stores.
> 4. If size is from 2 * VEC_SIZE to 4 * VEC_SIZE, use 4 VEC stores.
> - 5. On machines ERMS feature, if size is greater or equal than
> - __x86_rep_stosb_threshold then REP STOSB will be used.
> - 6. If size is more to 4 * VEC_SIZE, align to 4 * VEC_SIZE with
> - 4 VEC stores and store 4 * VEC at a time until done. */
> + 5. If size is more to 4 * VEC_SIZE, align to 1 * VEC_SIZE with
> + 4 VEC stores and store 4 * VEC at a time until done.
> + 6. On machines ERMS feature, if size is range
> + [__x86_rep_stosb_threshold, __x86_shared_non_temporal_threshold)
> + then REP STOSB will be used.
> + 7. If size >= __x86_shared_non_temporal_threshold, use a
> + non-temporal stores. */
>
> #include <sysdep.h>
>
> @@ -147,6 +150,41 @@ L(entry_from_wmemset):
> VMOVU %VMM(0), -VEC_SIZE(%rdi,%rdx)
> VMOVU %VMM(0), (%rdi)
> VZEROUPPER_RETURN
> +
> + /* If have AVX512 mask instructions put L(less_vec) close to
> + entry as it doesn't take much space and is likely a hot target. */
> +#ifdef USE_LESS_VEC_MASK_STORE
> + /* Align to ensure the L(less_vec) logic all fits in 1x cache lines. */
> + .p2align 6,, 47
> + .p2align 4
> +L(less_vec):
> +L(less_vec_from_wmemset):
> + /* Less than 1 VEC. */
> +# if VEC_SIZE != 16 && VEC_SIZE != 32 && VEC_SIZE != 64
> +# error Unsupported VEC_SIZE!
> +# endif
> + /* Clear high bits from edi. Only keeping bits relevant to page
> + cross check. Note that we are using rax which is set in
> + MEMSET_VDUP_TO_VEC0_AND_SET_RETURN as ptr from here on out. */
> + andl $(PAGE_SIZE - 1), %edi
> + /* Check if VEC_SIZE store cross page. Mask stores suffer
> + serious performance degradation when it has to fault suppress. */
> + cmpl $(PAGE_SIZE - VEC_SIZE), %edi
> + /* This is generally considered a cold target. */
> + ja L(cross_page)
> +# if VEC_SIZE > 32
> + movq $-1, %rcx
> + bzhiq %rdx, %rcx, %rcx
> + kmovq %rcx, %k1
> +# else
> + movl $-1, %ecx
> + bzhil %edx, %ecx, %ecx
> + kmovd %ecx, %k1
> +# endif
> + vmovdqu8 %VMM(0), (%rax){%k1}
> + VZEROUPPER_RETURN
> +#endif
> +
> #if defined USE_MULTIARCH && IS_IN (libc)
> END (MEMSET_SYMBOL (__memset, unaligned))
>
> @@ -185,54 +223,6 @@ L(last_2x_vec):
> #endif
> VZEROUPPER_RETURN
>
> - /* If have AVX512 mask instructions put L(less_vec) close to
> - entry as it doesn't take much space and is likely a hot target.
> - */
> -#ifdef USE_LESS_VEC_MASK_STORE
> - .p2align 4,, 10
> -L(less_vec):
> -L(less_vec_from_wmemset):
> - /* Less than 1 VEC. */
> -# if VEC_SIZE != 16 && VEC_SIZE != 32 && VEC_SIZE != 64
> -# error Unsupported VEC_SIZE!
> -# endif
> - /* Clear high bits from edi. Only keeping bits relevant to page
> - cross check. Note that we are using rax which is set in
> - MEMSET_VDUP_TO_VEC0_AND_SET_RETURN as ptr from here on out. */
> - andl $(PAGE_SIZE - 1), %edi
> - /* Check if VEC_SIZE store cross page. Mask stores suffer
> - serious performance degradation when it has to fault suppress.
> - */
> - cmpl $(PAGE_SIZE - VEC_SIZE), %edi
> - /* This is generally considered a cold target. */
> - ja L(cross_page)
> -# if VEC_SIZE > 32
> - movq $-1, %rcx
> - bzhiq %rdx, %rcx, %rcx
> - kmovq %rcx, %k1
> -# else
> - movl $-1, %ecx
> - bzhil %edx, %ecx, %ecx
> - kmovd %ecx, %k1
> -# endif
> - vmovdqu8 %VMM(0), (%rax){%k1}
> - VZEROUPPER_RETURN
> -
> -# if defined USE_MULTIARCH && IS_IN (libc)
> - /* Include L(stosb_local) here if including L(less_vec) between
> - L(stosb_more_2x_vec) and ENTRY. This is to cache align the
> - L(stosb_more_2x_vec) target. */
> - .p2align 4,, 10
> -L(stosb_local):
> - movzbl %sil, %eax
> - mov %RDX_LP, %RCX_LP
> - mov %RDI_LP, %RDX_LP
> - rep stosb
> - mov %RDX_LP, %RAX_LP
> - VZEROUPPER_RETURN
> -# endif
> -#endif
> -
> #if defined USE_MULTIARCH && IS_IN (libc)
> .p2align 4
> L(stosb_more_2x_vec):
> @@ -318,21 +308,33 @@ L(return_vzeroupper):
> ret
> #endif
>
> - .p2align 4,, 10
> -#ifndef USE_LESS_VEC_MASK_STORE
> -# if defined USE_MULTIARCH && IS_IN (libc)
> +#ifdef USE_WITH_AVX2
> + .p2align 4
> +#else
> + .p2align 4,, 4
> +#endif
> +
> +#if defined USE_MULTIARCH && IS_IN (libc)
> /* If no USE_LESS_VEC_MASK put L(stosb_local) here. Will be in
> range for 2-byte jump encoding. */
> L(stosb_local):
> + cmp __x86_shared_non_temporal_threshold(%rip), %RDX_LP
> + jae L(nt_memset)
> movzbl %sil, %eax
> mov %RDX_LP, %RCX_LP
> mov %RDI_LP, %RDX_LP
> rep stosb
> +# if (defined USE_WITH_SSE2) || (defined USE_WITH_AVX512)
> + /* Use xchg to save 1-byte (this helps align targets below). */
> + xchg %RDX_LP, %RAX_LP
> +# else
> mov %RDX_LP, %RAX_LP
> - VZEROUPPER_RETURN
> # endif
> + VZEROUPPER_RETURN
> +#endif
> +#ifndef USE_LESS_VEC_MASK_STORE
> /* Define L(less_vec) only if not otherwise defined. */
> - .p2align 4
> + .p2align 4,, 12
> L(less_vec):
> /* Broadcast esi to partial register (i.e VEC_SIZE == 32 broadcast to
> xmm). This is only does anything for AVX2. */
> @@ -423,4 +425,35 @@ L(between_2_3):
> movb %SET_REG8, -1(%LESS_VEC_REG, %rdx)
> #endif
> ret
> -END (MEMSET_SYMBOL (__memset, unaligned_erms))
> +
> +#if defined USE_MULTIARCH && IS_IN (libc)
> +# ifdef USE_WITH_AVX512
> + /* Force align so the loop doesn't cross a cache-line. */
> + .p2align 4
> +# endif
> + .p2align 4,, 7
> + /* Memset using non-temporal stores. */
> +L(nt_memset):
> + VMOVU %VMM(0), (VEC_SIZE * 0)(%rdi)
> + leaq (VEC_SIZE * -4)(%rdi, %rdx), %rdx
> + /* Align DST. */
> + orq $(VEC_SIZE * 1 - 1), %rdi
> + incq %rdi
> + .p2align 4,, 7
> +L(nt_loop):
> + VMOVNT %VMM(0), (VEC_SIZE * 0)(%rdi)
> + VMOVNT %VMM(0), (VEC_SIZE * 1)(%rdi)
> + VMOVNT %VMM(0), (VEC_SIZE * 2)(%rdi)
> + VMOVNT %VMM(0), (VEC_SIZE * 3)(%rdi)
> + subq $(VEC_SIZE * -4), %rdi
> + cmpq %rdx, %rdi
> + jb L(nt_loop)
> + sfence
> + VMOVU %VMM(0), (VEC_SIZE * 0)(%rdx)
> + VMOVU %VMM(0), (VEC_SIZE * 1)(%rdx)
> + VMOVU %VMM(0), (VEC_SIZE * 2)(%rdx)
> + VMOVU %VMM(0), (VEC_SIZE * 3)(%rdx)
> + VZEROUPPER_RETURN
> +#endif
> +
> +END(MEMSET_SYMBOL(__memset, unaligned_erms))
> --
> 2.34.1
>
LGTM.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
Thanks.
--
H.J.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v2 2/2] x86: Add seperate non-temporal tunable for memset
2024-05-24 17:38 ` [PATCH v2 2/2] x86: Add seperate non-temporal tunable for memset Noah Goldstein
2024-05-29 16:19 ` Noah Goldstein
@ 2024-05-29 22:53 ` H.J. Lu
2024-06-14 10:40 ` Borislav Petkov
1 sibling, 1 reply; 21+ messages in thread
From: H.J. Lu @ 2024-05-29 22:53 UTC (permalink / raw)
To: Noah Goldstein; +Cc: libc-alpha, carlos
On Fri, May 24, 2024 at 10:39 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
> The tuning for non-temporal stores for memset vs memcpy is not always
> the same. This includes both the exact value and whether non-temporal
> stores are profitable at all for a given arch.
>
> This patch add `x86_memset_non_temporal_threshold`. Currently we
> disable non-temporal stores for non Intel vendors as the only
> benchmarks showing its benefit have been on Intel hardware.
> ---
> manual/tunables.texi | 16 +++++++++++++++-
> sysdeps/x86/cacheinfo.h | 8 +++++++-
> sysdeps/x86/dl-cacheinfo.h | 16 ++++++++++++++++
> sysdeps/x86/dl-diagnostics-cpu.c | 2 ++
> sysdeps/x86/dl-tunables.list | 3 +++
> sysdeps/x86/include/cpu-features.h | 4 +++-
> .../x86_64/multiarch/memset-vec-unaligned-erms.S | 6 +++---
> 7 files changed, 49 insertions(+), 6 deletions(-)
>
> diff --git a/manual/tunables.texi b/manual/tunables.texi
> index baaf751721..8dd02d8149 100644
> --- a/manual/tunables.texi
> +++ b/manual/tunables.texi
> @@ -52,6 +52,7 @@ glibc.elision.skip_lock_busy: 3 (min: 0, max: 2147483647)
> glibc.malloc.top_pad: 0x20000 (min: 0x0, max: 0xffffffffffffffff)
> glibc.cpu.x86_rep_stosb_threshold: 0x800 (min: 0x1, max: 0xffffffffffffffff)
> glibc.cpu.x86_non_temporal_threshold: 0xc0000 (min: 0x4040, max: 0xfffffffffffffff)
> +glibc.cpu.x86_memset_non_temporal_threshold: 0xc0000 (min: 0x4040, max: 0xfffffffffffffff)
> glibc.cpu.x86_shstk:
> glibc.pthread.stack_cache_size: 0x2800000 (min: 0x0, max: 0xffffffffffffffff)
> glibc.cpu.hwcap_mask: 0x6 (min: 0x0, max: 0xffffffffffffffff)
> @@ -495,7 +496,8 @@ thread stack originally backup by Huge Pages to default pages.
> @cindex shared_cache_size tunables
> @cindex tunables, shared_cache_size
> @cindex non_temporal_threshold tunables
> -@cindex tunables, non_temporal_threshold
> +@cindex memset_non_temporal_threshold tunables
> +@cindex tunables, non_temporal_threshold, memset_non_temporal_threshold
>
> @deftp {Tunable namespace} glibc.cpu
> Behavior of @theglibc{} can be tuned to assume specific hardware capabilities
> @@ -574,6 +576,18 @@ like memmove and memcpy.
> This tunable is specific to i386 and x86-64.
> @end deftp
>
> +@deftp Tunable glibc.cpu.x86_memset_non_temporal_threshold
> +The @code{glibc.cpu.x86_memset_non_temporal_threshold} tunable allows
> +the user to set threshold in bytes for non temporal store in
> +memset. Non temporal stores give a hint to the hardware to move data
> +directly to memory without displacing other data from the cache. This
> +tunable is used by some platforms to determine when to use non
> +temporal stores memset.
> +
> +This tunable is specific to i386 and x86-64.
> +@end deftp
> +
> +
> @deftp Tunable glibc.cpu.x86_rep_movsb_threshold
> The @code{glibc.cpu.x86_rep_movsb_threshold} tunable allows the user to
> set threshold in bytes to start using "rep movsb". The value must be
> diff --git a/sysdeps/x86/cacheinfo.h b/sysdeps/x86/cacheinfo.h
> index ab73556772..83491607c7 100644
> --- a/sysdeps/x86/cacheinfo.h
> +++ b/sysdeps/x86/cacheinfo.h
> @@ -35,9 +35,12 @@ long int __x86_data_cache_size attribute_hidden = 32 * 1024;
> long int __x86_shared_cache_size_half attribute_hidden = 1024 * 1024 / 2;
> long int __x86_shared_cache_size attribute_hidden = 1024 * 1024;
>
> -/* Threshold to use non temporal store. */
> +/* Threshold to use non temporal store in memmove. */
> long int __x86_shared_non_temporal_threshold attribute_hidden;
>
> +/* Threshold to use non temporal store in memset. */
> +long int __x86_memset_non_temporal_threshold attribute_hidden;
> +
> /* Threshold to use Enhanced REP MOVSB. */
> long int __x86_rep_movsb_threshold attribute_hidden = 2048;
>
> @@ -77,6 +80,9 @@ init_cacheinfo (void)
> __x86_shared_non_temporal_threshold
> = cpu_features->non_temporal_threshold;
>
> + __x86_memset_non_temporal_threshold
> + = cpu_features->memset_non_temporal_threshold;
> +
> __x86_rep_movsb_threshold = cpu_features->rep_movsb_threshold;
> __x86_rep_stosb_threshold = cpu_features->rep_stosb_threshold;
> __x86_rep_movsb_stop_threshold = cpu_features->rep_movsb_stop_threshold;
> diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h
> index 5a98f70364..d375a7cba6 100644
> --- a/sysdeps/x86/dl-cacheinfo.h
> +++ b/sysdeps/x86/dl-cacheinfo.h
> @@ -986,6 +986,13 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
> if (CPU_FEATURE_USABLE_P (cpu_features, FSRM))
> rep_movsb_threshold = 2112;
>
> + /* Non-temporal stores in memset have only been tested on Intel hardware.
> + Until we benchmark data on other x86 processor, disable non-temporal
> + stores in memset. */
> + unsigned long int memset_non_temporal_threshold = SIZE_MAX;
> + if (cpu_features->basic.kind == arch_kind_intel)
> + memset_non_temporal_threshold = non_temporal_threshold;
> +
> /* For AMD CPUs that support ERMS (Zen3+), REP MOVSB is in a lot of
> cases slower than the vectorized path (and for some alignments,
> it is really slow, check BZ #30994). */
> @@ -1012,6 +1019,11 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
> && tunable_size <= maximum_non_temporal_threshold)
> non_temporal_threshold = tunable_size;
>
> + tunable_size = TUNABLE_GET (x86_memset_non_temporal_threshold, long int, NULL);
> + if (tunable_size > minimum_non_temporal_threshold
> + && tunable_size <= maximum_non_temporal_threshold)
> + memset_non_temporal_threshold = tunable_size;
> +
> tunable_size = TUNABLE_GET (x86_rep_movsb_threshold, long int, NULL);
> if (tunable_size > minimum_rep_movsb_threshold)
> rep_movsb_threshold = tunable_size;
> @@ -1032,6 +1044,9 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
> TUNABLE_SET_WITH_BOUNDS (x86_non_temporal_threshold, non_temporal_threshold,
> minimum_non_temporal_threshold,
> maximum_non_temporal_threshold);
> + TUNABLE_SET_WITH_BOUNDS (
> + x86_memset_non_temporal_threshold, memset_non_temporal_threshold,
> + minimum_non_temporal_threshold, maximum_non_temporal_threshold);
> TUNABLE_SET_WITH_BOUNDS (x86_rep_movsb_threshold, rep_movsb_threshold,
> minimum_rep_movsb_threshold, SIZE_MAX);
> TUNABLE_SET_WITH_BOUNDS (x86_rep_stosb_threshold, rep_stosb_threshold, 1,
> @@ -1045,6 +1060,7 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
> cpu_features->data_cache_size = data;
> cpu_features->shared_cache_size = shared;
> cpu_features->non_temporal_threshold = non_temporal_threshold;
> + cpu_features->memset_non_temporal_threshold = memset_non_temporal_threshold;
> cpu_features->rep_movsb_threshold = rep_movsb_threshold;
> cpu_features->rep_stosb_threshold = rep_stosb_threshold;
> cpu_features->rep_movsb_stop_threshold = rep_movsb_stop_threshold;
> diff --git a/sysdeps/x86/dl-diagnostics-cpu.c b/sysdeps/x86/dl-diagnostics-cpu.c
> index ceafde9481..49eeb5f70a 100644
> --- a/sysdeps/x86/dl-diagnostics-cpu.c
> +++ b/sysdeps/x86/dl-diagnostics-cpu.c
> @@ -94,6 +94,8 @@ _dl_diagnostics_cpu (void)
> cpu_features->shared_cache_size);
> print_cpu_features_value ("non_temporal_threshold",
> cpu_features->non_temporal_threshold);
> + print_cpu_features_value ("memset_non_temporal_threshold",
> + cpu_features->memset_non_temporal_threshold);
> print_cpu_features_value ("rep_movsb_threshold",
> cpu_features->rep_movsb_threshold);
> print_cpu_features_value ("rep_movsb_stop_threshold",
> diff --git a/sysdeps/x86/dl-tunables.list b/sysdeps/x86/dl-tunables.list
> index 7d82da0dec..a0a1299592 100644
> --- a/sysdeps/x86/dl-tunables.list
> +++ b/sysdeps/x86/dl-tunables.list
> @@ -30,6 +30,9 @@ glibc {
> x86_non_temporal_threshold {
> type: SIZE_T
> }
> + x86_memset_non_temporal_threshold {
> + type: SIZE_T
> + }
> x86_rep_movsb_threshold {
> type: SIZE_T
> # Since there is overhead to set up REP MOVSB operation, REP
> diff --git a/sysdeps/x86/include/cpu-features.h b/sysdeps/x86/include/cpu-features.h
> index cd7bd27cf3..aaae44f0e1 100644
> --- a/sysdeps/x86/include/cpu-features.h
> +++ b/sysdeps/x86/include/cpu-features.h
> @@ -944,8 +944,10 @@ struct cpu_features
> /* Shared cache size for use in memory and string routines, typically
> L2 or L3 size. */
> unsigned long int shared_cache_size;
> - /* Threshold to use non temporal store. */
> + /* Threshold to use non temporal store in memmove. */
> unsigned long int non_temporal_threshold;
> + /* Threshold to use non temporal store in memset. */
> + unsigned long int memset_non_temporal_threshold;
> /* Threshold to use "rep movsb". */
> unsigned long int rep_movsb_threshold;
> /* Threshold to stop using "rep movsb". */
> diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> index 637caadb40..88bf08e4f4 100644
> --- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> +++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> @@ -24,9 +24,9 @@
> 5. If size is more to 4 * VEC_SIZE, align to 1 * VEC_SIZE with
> 4 VEC stores and store 4 * VEC at a time until done.
> 6. On machines ERMS feature, if size is range
> - [__x86_rep_stosb_threshold, __x86_shared_non_temporal_threshold)
> + [__x86_rep_stosb_threshold, __x86_memset_non_temporal_threshold)
> then REP STOSB will be used.
> - 7. If size >= __x86_shared_non_temporal_threshold, use a
> + 7. If size >= __x86_memset_non_temporal_threshold, use a
> non-temporal stores. */
>
> #include <sysdep.h>
> @@ -318,7 +318,7 @@ L(return_vzeroupper):
> /* If no USE_LESS_VEC_MASK put L(stosb_local) here. Will be in
> range for 2-byte jump encoding. */
> L(stosb_local):
> - cmp __x86_shared_non_temporal_threshold(%rip), %RDX_LP
> + cmp __x86_memset_non_temporal_threshold(%rip), %RDX_LP
> jae L(nt_memset)
> movzbl %sil, %eax
> mov %RDX_LP, %RCX_LP
> --
> 2.34.1
>
LGTM.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
Thanks.
--
H.J.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v1] x86: Improve large memset perf with non-temporal stores [RHEL-29312]
2024-05-23 17:18 ` Adhemerval Zanella Netto
2024-05-24 17:39 ` Noah Goldstein
@ 2024-05-30 20:02 ` Noah Goldstein
1 sibling, 0 replies; 21+ messages in thread
From: Noah Goldstein @ 2024-05-30 20:02 UTC (permalink / raw)
To: GNU C Library; +Cc: H.J. Lu, Adhemerval Zanella, Joe Damato
On Thu, May 23, 2024 at 12:18 PM Adhemerval Zanella Netto
<adhemerval.zanella@linaro.org> wrote:
>
>
>
> On 23/05/24 13:49, H.J. Lu wrote:
> > On Thu, May 23, 2024 at 9:46 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> >>
> >> On Mon, May 20, 2024 at 5:50 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> >>>
> >>> On Mon, May 20, 2024 at 5:47 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> >>>>
> >>>> On Mon, May 20, 2024 at 2:16 PM Adhemerval Zanella Netto
> >>>> <adhemerval.zanella@linaro.org> wrote:
> >>>>>
> >>>>>
> >>>>>
> >>>>> On 18/05/24 21:43, Noah Goldstein wrote:
> >>>>>> Previously we use `rep stosb` for all medium/large memsets. This is
> >>>>>> notably worse than non-temporal stores for large (above a
> >>>>>> few MBs) memsets.
> >>>>>> See:
> >>>>>> https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4P2uMjjQGLbLcU/edit?usp=sharing
> >>>>>> For data using different stategies for large memset on ICX and SKX.
> >>>>>
> >>>>> Btw this datasheet is not accessible without extra steps.
> >>>>>
> >>>>
> >>>> Bh sorry, put the benchmark data in a repo. See data here:
> >>>> https://github.com/goldsteinn/memset-benchmarks/tree/master/results
> >>>>
> >>>> The ICX.html and SKX.html have the data from the above link.
> >>>>>>
> >>>>>> Using non-temporal stores can be up to 3x faster on ICX and 2x faster
> >>>>>> on SKX. Historically, these numbers would not have been so good
> >>>>>> because of the zero-over-zero writeback optimization that `rep stosb`
> >>>>>> is able to do. But, the zero-over-zero writeback optimization has been
> >>>>>> removed as a potential side-channel attack, so there is no longer any
> >>>>>> good reason to only rely on `rep stosb` for large memsets. On the flip
> >>>>>> size, non-temporal writes can avoid data in their RFO requests saving
> >>>>>> memory bandwidth.
> >>>>>
> >>>>> Any chance on how this play in newer AMD chips? I am trying to avoid
> >>>>> another regression like BZ#30994 and BZ#30995 (this one I would like
> >>>>> to fix, however I don't have access to a Zen4 machine to check for
> >>>>> results).
> >>>>
> >>>> I didn't and don't have access to any of the newer AMD chips.
> >>>>
> >>>> The benchmark code here: https://github.com/goldsteinn/memset-benchmarks/
> >>>> has a README w/ steps if anyone wants to test it.
> >>>
> >>> What we could do is use a seperate tunable for memset NT threshold and just
> >>> make it SIZE_MAX for AMD.
> >>
> >> Adhemerval, this patch okay to get in like that, or do you want to keep in Intel
> >> specific?
> >
> > Please make it Intel specific.
> >
> > Thanks.
>
> Yes, please make it Intel specific for now. I will try to take a look at least
> with a Zen3 core that I have access
Adding Joe into the thread. He has access to some AMD machines
and will run the benchmarks for us.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v2 2/2] x86: Add seperate non-temporal tunable for memset
2024-05-29 22:53 ` H.J. Lu
@ 2024-06-14 10:40 ` Borislav Petkov
2024-06-14 16:39 ` Noah Goldstein
0 siblings, 1 reply; 21+ messages in thread
From: Borislav Petkov @ 2024-06-14 10:40 UTC (permalink / raw)
To: H.J. Lu, Noah Goldstein; +Cc: libc-alpha
Hi,
I'm not subscribed to the glibc list - pls CC me directly on replies.
On Wed, May 29, 2024 at 03:53:20PM -0700, H.J. Lu wrote:
> On Fri, May 24, 2024 at 10:39?AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> >
> > The tuning for non-temporal stores for memset vs memcpy is not always
> > the same. This includes both the exact value and whether non-temporal
> > stores are profitable at all for a given arch.
> >
> > This patch add `x86_memset_non_temporal_threshold`. Currently we
> > disable non-temporal stores for non Intel vendors as the only
> > benchmarks showing its benefit have been on Intel hardware.
> > ---
> > manual/tunables.texi | 16 +++++++++++++++-
> > sysdeps/x86/cacheinfo.h | 8 +++++++-
> > sysdeps/x86/dl-cacheinfo.h | 16 ++++++++++++++++
> > sysdeps/x86/dl-diagnostics-cpu.c | 2 ++
> > sysdeps/x86/dl-tunables.list | 3 +++
> > sysdeps/x86/include/cpu-features.h | 4 +++-
> > .../x86_64/multiarch/memset-vec-unaligned-erms.S | 6 +++---
> > 7 files changed, 49 insertions(+), 6 deletions(-)
...
> > + /* Non-temporal stores in memset have only been tested on Intel hardware.
> > + Until we benchmark data on other x86 processor, disable non-temporal
> > + stores in memset. */
Well, something's fishy here:
$ ./elf/ld.so --list-tunables | grep threshold
glibc.malloc.mmap_threshold: 0x0 (min: 0x0, max: 0xffffffffffffffff)
glibc.cpu.x86_rep_movsb_threshold: 0x600000 (min: 0x100, max: 0xffffffffffffffff)
glibc.cpu.x86_non_temporal_threshold: 0x600000 (min: 0x4040, max: 0xfffffffffffffff)
glibc.malloc.trim_threshold: 0x0 (min: 0x0, max: 0xffffffffffffffff)
glibc.cpu.x86_rep_stosb_threshold: 0xffffffffffffffff (min: 0x1, max: 0xffffffffffffffff)
glibc.cpu.x86_memset_non_temporal_threshold: 0x0 (min: 0x0, max: 0xffffffffffffffff)
^^^^^^^^^
on glibc-2.39.9000-300-g54c1efdac55b from git.
That's on a AMD Zen1 so I'd expect that memset NT threshold to be
0xffffffffffffffff by default...
Thx.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v2 2/2] x86: Add seperate non-temporal tunable for memset
2024-06-14 10:40 ` Borislav Petkov
@ 2024-06-14 16:39 ` Noah Goldstein
2024-06-14 18:01 ` Borislav Petkov
0 siblings, 1 reply; 21+ messages in thread
From: Noah Goldstein @ 2024-06-14 16:39 UTC (permalink / raw)
To: Borislav Petkov; +Cc: H.J. Lu, libc-alpha
On Fri, Jun 14, 2024 at 5:41 AM Borislav Petkov <bp@alien8.de> wrote:
>
> Hi,
>
> I'm not subscribed to the glibc list - pls CC me directly on replies.
>
> On Wed, May 29, 2024 at 03:53:20PM -0700, H.J. Lu wrote:
> > On Fri, May 24, 2024 at 10:39?AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> > >
> > > The tuning for non-temporal stores for memset vs memcpy is not always
> > > the same. This includes both the exact value and whether non-temporal
> > > stores are profitable at all for a given arch.
> > >
> > > This patch add `x86_memset_non_temporal_threshold`. Currently we
> > > disable non-temporal stores for non Intel vendors as the only
> > > benchmarks showing its benefit have been on Intel hardware.
> > > ---
> > > manual/tunables.texi | 16 +++++++++++++++-
> > > sysdeps/x86/cacheinfo.h | 8 +++++++-
> > > sysdeps/x86/dl-cacheinfo.h | 16 ++++++++++++++++
> > > sysdeps/x86/dl-diagnostics-cpu.c | 2 ++
> > > sysdeps/x86/dl-tunables.list | 3 +++
> > > sysdeps/x86/include/cpu-features.h | 4 +++-
> > > .../x86_64/multiarch/memset-vec-unaligned-erms.S | 6 +++---
> > > 7 files changed, 49 insertions(+), 6 deletions(-)
>
> ...
>
> > > + /* Non-temporal stores in memset have only been tested on Intel hardware.
> > > + Until we benchmark data on other x86 processor, disable non-temporal
> > > + stores in memset. */
>
> Well, something's fishy here:
>
> $ ./elf/ld.so --list-tunables | grep threshold
> glibc.malloc.mmap_threshold: 0x0 (min: 0x0, max: 0xffffffffffffffff)
> glibc.cpu.x86_rep_movsb_threshold: 0x600000 (min: 0x100, max: 0xffffffffffffffff)
> glibc.cpu.x86_non_temporal_threshold: 0x600000 (min: 0x4040, max: 0xfffffffffffffff)
> glibc.malloc.trim_threshold: 0x0 (min: 0x0, max: 0xffffffffffffffff)
> glibc.cpu.x86_rep_stosb_threshold: 0xffffffffffffffff (min: 0x1, max: 0xffffffffffffffff)
> glibc.cpu.x86_memset_non_temporal_threshold: 0x0 (min: 0x0, max: 0xffffffffffffffff)
> ^^^^^^^^^
>
> on glibc-2.39.9000-300-g54c1efdac55b from git.
>
> That's on a AMD Zen1 so I'd expect that memset NT threshold to be
> 0xffffffffffffffff by default...
>
> Thx.
>
Thanks for bringing this up, looking into it.
> --
> Regards/Gruss,
> Boris.
>
> https://people.kernel.org/tglx/notes-about-netiquette
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v2 2/2] x86: Add seperate non-temporal tunable for memset
2024-06-14 16:39 ` Noah Goldstein
@ 2024-06-14 18:01 ` Borislav Petkov
2024-06-14 18:02 ` Noah Goldstein
0 siblings, 1 reply; 21+ messages in thread
From: Borislav Petkov @ 2024-06-14 18:01 UTC (permalink / raw)
To: Noah Goldstein; +Cc: H.J. Lu, libc-alpha, Michael Matz
On Fri, Jun 14, 2024 at 11:39:07AM -0500, Noah Goldstein wrote:
> On Fri, Jun 14, 2024 at 5:41 AM Borislav Petkov <bp@alien8.de> wrote:
> >
> > Hi,
> >
> > I'm not subscribed to the glibc list - pls CC me directly on replies.
> >
> > On Wed, May 29, 2024 at 03:53:20PM -0700, H.J. Lu wrote:
> > > On Fri, May 24, 2024 at 10:39?AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> > > >
> > > > The tuning for non-temporal stores for memset vs memcpy is not always
> > > > the same. This includes both the exact value and whether non-temporal
> > > > stores are profitable at all for a given arch.
> > > >
> > > > This patch add `x86_memset_non_temporal_threshold`. Currently we
> > > > disable non-temporal stores for non Intel vendors as the only
> > > > benchmarks showing its benefit have been on Intel hardware.
> > > > ---
> > > > manual/tunables.texi | 16 +++++++++++++++-
> > > > sysdeps/x86/cacheinfo.h | 8 +++++++-
> > > > sysdeps/x86/dl-cacheinfo.h | 16 ++++++++++++++++
> > > > sysdeps/x86/dl-diagnostics-cpu.c | 2 ++
> > > > sysdeps/x86/dl-tunables.list | 3 +++
> > > > sysdeps/x86/include/cpu-features.h | 4 +++-
> > > > .../x86_64/multiarch/memset-vec-unaligned-erms.S | 6 +++---
> > > > 7 files changed, 49 insertions(+), 6 deletions(-)
> >
> > ...
> >
> > > > + /* Non-temporal stores in memset have only been tested on Intel hardware.
> > > > + Until we benchmark data on other x86 processor, disable non-temporal
> > > > + stores in memset. */
> >
> > Well, something's fishy here:
> >
> > $ ./elf/ld.so --list-tunables | grep threshold
> > glibc.malloc.mmap_threshold: 0x0 (min: 0x0, max: 0xffffffffffffffff)
> > glibc.cpu.x86_rep_movsb_threshold: 0x600000 (min: 0x100, max: 0xffffffffffffffff)
> > glibc.cpu.x86_non_temporal_threshold: 0x600000 (min: 0x4040, max: 0xfffffffffffffff)
> > glibc.malloc.trim_threshold: 0x0 (min: 0x0, max: 0xffffffffffffffff)
> > glibc.cpu.x86_rep_stosb_threshold: 0xffffffffffffffff (min: 0x1, max: 0xffffffffffffffff)
> > glibc.cpu.x86_memset_non_temporal_threshold: 0x0 (min: 0x0, max: 0xffffffffffffffff)
> > ^^^^^^^^^
> >
> > on glibc-2.39.9000-300-g54c1efdac55b from git.
> >
> > That's on a AMD Zen1 so I'd expect that memset NT threshold to be
> > 0xffffffffffffffff by default...
> >
> > Thx.
> >
>
> Thanks for bringing this up, looking into it.
Thx, so Michael did debug it yesterday to the ranges mismatching:
diff --git a/elf/dl-tunables.c b/elf/dl-tunables.c
index 147cc4cf23f5..ecf3c1d3736e 100644
--- a/elf/dl-tunables.c
+++ b/elf/dl-tunables.c
@@ -110,8 +110,11 @@ do_tunable_update_val (tunable_t *cur, const tunable_val_t *valp,
/* Bail out if the bounds are not valid. */
if (tunable_val_lt (val, min, unsigned_cmp)
- || tunable_val_lt (max, val, unsigned_cmp))
+ || tunable_val_lt (max, val, unsigned_cmp)) {
+ _dl_printf("bail out due to: 0x%lx, min: 0x%lx, max: 0x%lx\n",
+ val, min, max);
return;
+ }
cur->val.numval = val;
cur->type.min = min;
$ ./elf/ld.so --list-tunables | grep -E "(threshold|bail)"
dl_init_cacheinfo: memset_non_temporal_threshold: 0xffffffffffffffff
dl_init_cacheinfo: memset_non_temporal_threshold, tunable_size: 0xffffffffffffffff
bail out due to: 0xffffffffffffffff, min: 0x4040, max: 0xfffffffffffffff
^^^^^^^
dl_init_cacheinfo: memset_non_temporal_threshold, tunable set: 0xffffffffffffffff, min: 0x4040, max: 0xfffffffffffffff
glibc.cpu.x86_memset_non_temporal_threshold: 0x0 (min: 0x0, max: 0xffffffffffffffff)
but you guys probably should do the right fix here.
Thx.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v2 2/2] x86: Add seperate non-temporal tunable for memset
2024-06-14 18:01 ` Borislav Petkov
@ 2024-06-14 18:02 ` Noah Goldstein
0 siblings, 0 replies; 21+ messages in thread
From: Noah Goldstein @ 2024-06-14 18:02 UTC (permalink / raw)
To: Borislav Petkov; +Cc: H.J. Lu, libc-alpha, Michael Matz
On Fri, Jun 14, 2024 at 1:01 PM Borislav Petkov <bp@alien8.de> wrote:
>
> On Fri, Jun 14, 2024 at 11:39:07AM -0500, Noah Goldstein wrote:
> > On Fri, Jun 14, 2024 at 5:41 AM Borislav Petkov <bp@alien8.de> wrote:
> > >
> > > Hi,
> > >
> > > I'm not subscribed to the glibc list - pls CC me directly on replies.
> > >
> > > On Wed, May 29, 2024 at 03:53:20PM -0700, H.J. Lu wrote:
> > > > On Fri, May 24, 2024 at 10:39?AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> > > > >
> > > > > The tuning for non-temporal stores for memset vs memcpy is not always
> > > > > the same. This includes both the exact value and whether non-temporal
> > > > > stores are profitable at all for a given arch.
> > > > >
> > > > > This patch add `x86_memset_non_temporal_threshold`. Currently we
> > > > > disable non-temporal stores for non Intel vendors as the only
> > > > > benchmarks showing its benefit have been on Intel hardware.
> > > > > ---
> > > > > manual/tunables.texi | 16 +++++++++++++++-
> > > > > sysdeps/x86/cacheinfo.h | 8 +++++++-
> > > > > sysdeps/x86/dl-cacheinfo.h | 16 ++++++++++++++++
> > > > > sysdeps/x86/dl-diagnostics-cpu.c | 2 ++
> > > > > sysdeps/x86/dl-tunables.list | 3 +++
> > > > > sysdeps/x86/include/cpu-features.h | 4 +++-
> > > > > .../x86_64/multiarch/memset-vec-unaligned-erms.S | 6 +++---
> > > > > 7 files changed, 49 insertions(+), 6 deletions(-)
> > >
> > > ...
> > >
> > > > > + /* Non-temporal stores in memset have only been tested on Intel hardware.
> > > > > + Until we benchmark data on other x86 processor, disable non-temporal
> > > > > + stores in memset. */
> > >
> > > Well, something's fishy here:
> > >
> > > $ ./elf/ld.so --list-tunables | grep threshold
> > > glibc.malloc.mmap_threshold: 0x0 (min: 0x0, max: 0xffffffffffffffff)
> > > glibc.cpu.x86_rep_movsb_threshold: 0x600000 (min: 0x100, max: 0xffffffffffffffff)
> > > glibc.cpu.x86_non_temporal_threshold: 0x600000 (min: 0x4040, max: 0xfffffffffffffff)
> > > glibc.malloc.trim_threshold: 0x0 (min: 0x0, max: 0xffffffffffffffff)
> > > glibc.cpu.x86_rep_stosb_threshold: 0xffffffffffffffff (min: 0x1, max: 0xffffffffffffffff)
> > > glibc.cpu.x86_memset_non_temporal_threshold: 0x0 (min: 0x0, max: 0xffffffffffffffff)
> > > ^^^^^^^^^
> > >
> > > on glibc-2.39.9000-300-g54c1efdac55b from git.
> > >
> > > That's on a AMD Zen1 so I'd expect that memset NT threshold to be
> > > 0xffffffffffffffff by default...
> > >
> > > Thx.
> > >
> >
> > Thanks for bringing this up, looking into it.
>
> Thx, so Michael did debug it yesterday to the ranges mismatching:
>
> diff --git a/elf/dl-tunables.c b/elf/dl-tunables.c
> index 147cc4cf23f5..ecf3c1d3736e 100644
> --- a/elf/dl-tunables.c
> +++ b/elf/dl-tunables.c
> @@ -110,8 +110,11 @@ do_tunable_update_val (tunable_t *cur, const tunable_val_t *valp,
>
> /* Bail out if the bounds are not valid. */
> if (tunable_val_lt (val, min, unsigned_cmp)
> - || tunable_val_lt (max, val, unsigned_cmp))
> + || tunable_val_lt (max, val, unsigned_cmp)) {
> + _dl_printf("bail out due to: 0x%lx, min: 0x%lx, max: 0x%lx\n",
> + val, min, max);
> return;
> + }
>
> cur->val.numval = val;
> cur->type.min = min;
>
> $ ./elf/ld.so --list-tunables | grep -E "(threshold|bail)"
> dl_init_cacheinfo: memset_non_temporal_threshold: 0xffffffffffffffff
> dl_init_cacheinfo: memset_non_temporal_threshold, tunable_size: 0xffffffffffffffff
> bail out due to: 0xffffffffffffffff, min: 0x4040, max: 0xfffffffffffffff
> ^^^^^^^
>
> dl_init_cacheinfo: memset_non_temporal_threshold, tunable set: 0xffffffffffffffff, min: 0x4040, max: 0xfffffffffffffff
> glibc.cpu.x86_memset_non_temporal_threshold: 0x0 (min: 0x0, max: 0xffffffffffffffff)
>
> but you guys probably should do the right fix here.
Just posted the fix, you should be CCd on it.
>
> Thx.
>
> --
> Regards/Gruss,
> Boris.
>
> https://people.kernel.org/tglx/notes-about-netiquette
^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2024-06-14 18:02 UTC | newest]
Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-05-19 0:43 [PATCH v1] x86: Improve large memset perf with non-temporal stores [RHEL-29312] Noah Goldstein
2024-05-20 17:14 ` H.J. Lu
2024-05-20 19:16 ` Adhemerval Zanella Netto
2024-05-20 22:47 ` Noah Goldstein
2024-05-20 22:50 ` Noah Goldstein
2024-05-23 16:46 ` Noah Goldstein
2024-05-23 16:49 ` H.J. Lu
2024-05-23 17:18 ` Adhemerval Zanella Netto
2024-05-24 17:39 ` Noah Goldstein
2024-05-30 20:02 ` Noah Goldstein
2024-05-20 19:37 ` Zack Weinberg
2024-05-20 22:48 ` Noah Goldstein
2024-05-24 17:38 ` [PATCH v2 1/2] " Noah Goldstein
2024-05-24 17:38 ` [PATCH v2 2/2] x86: Add seperate non-temporal tunable for memset Noah Goldstein
2024-05-29 16:19 ` Noah Goldstein
2024-05-29 22:53 ` H.J. Lu
2024-06-14 10:40 ` Borislav Petkov
2024-06-14 16:39 ` Noah Goldstein
2024-06-14 18:01 ` Borislav Petkov
2024-06-14 18:02 ` Noah Goldstein
2024-05-29 22:52 ` [PATCH v2 1/2] x86: Improve large memset perf with non-temporal stores [RHEL-29312] H.J. Lu
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).