[PATCH v1] x86: Improve large memset perf with non-temporal stores [RHEL-29312]

public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed

* [PATCH v1] x86: Improve large memset perf with non-temporal stores [RHEL-29312]
@ 2024-05-19  0:43 Noah Goldstein
  2024-05-20 17:14 ` H.J. Lu
                   ` (3 more replies)
  0 siblings, 4 replies; 21+ messages in thread
From: Noah Goldstein @ 2024-05-19  0:43 UTC (permalink / raw)
  To: libc-alpha; +Cc: goldstein.w.n, hjl.tools

Previously we use `rep stosb` for all medium/large memsets. This is
notably worse than non-temporal stores for large (above a
few MBs) memsets.
See:
https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4P2uMjjQGLbLcU/edit?usp=sharing
For data using different stategies for large memset on ICX and SKX.

Using non-temporal stores can be up to 3x faster on ICX and 2x faster
on SKX. Historically, these numbers would not have been so good
because of the zero-over-zero writeback optimization that `rep stosb`
is able to do. But, the zero-over-zero writeback optimization has been
removed as a potential side-channel attack, so there is no longer any
good reason to only rely on `rep stosb` for large memsets. On the flip
size, non-temporal writes can avoid data in their RFO requests saving
memory bandwidth.

All of the other changes to the file are to re-organize the
code-blocks to maintain "good" alignment given the new code added in
the `L(stosb_local)` case.

The results from running the GLIBC memset benchmarks on TGL-client for
N=20 runs:

Geometric Mean across the suite New / Old EXEX256: 0.979
Geometric Mean across the suite New / Old EXEX512: 0.979
Geometric Mean across the suite New / Old AVX2   : 0.986
Geometric Mean across the suite New / Old SSE2   : 0.979

Most of the cases are essentially unchanged, this is mostly to show
that adding the non-temporal case didn't add any regressions to the
other cases.

The results on the memset-large benchmark suite on TGL-client for N=20
runs:

Geometric Mean across the suite New / Old EXEX256: 0.926
Geometric Mean across the suite New / Old EXEX512: 0.925
Geometric Mean across the suite New / Old AVX2   : 0.928
Geometric Mean across the suite New / Old SSE2   : 0.924

So roughly a 7.5% speedup. This is lower than what we see on servers
(likely because clients typically have faster single-core bandwidth so
saving bandwidth on RFOs is less impactful), but still advantageous.

Full test-suite passes on x86_64 w/ and w/o multiarch.
---
 .../multiarch/memset-vec-unaligned-erms.S     | 149 +++++++++++-------
 1 file changed, 91 insertions(+), 58 deletions(-)

diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
index 97839a2248..637caadb40 100644
--- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
+++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
@@ -21,10 +21,13 @@
    2. If size is less than VEC, use integer register stores.
    3. If size is from VEC_SIZE to 2 * VEC_SIZE, use 2 VEC stores.
    4. If size is from 2 * VEC_SIZE to 4 * VEC_SIZE, use 4 VEC stores.
-   5. On machines ERMS feature, if size is greater or equal than
-      __x86_rep_stosb_threshold then REP STOSB will be used.
-   6. If size is more to 4 * VEC_SIZE, align to 4 * VEC_SIZE with
-      4 VEC stores and store 4 * VEC at a time until done.  */
+   5. If size is more to 4 * VEC_SIZE, align to 1 * VEC_SIZE with
+      4 VEC stores and store 4 * VEC at a time until done.
+   6. On machines ERMS feature, if size is range
+	  [__x86_rep_stosb_threshold, __x86_shared_non_temporal_threshold)
+	  then REP STOSB will be used.
+   7. If size >= __x86_shared_non_temporal_threshold, use a
+	  non-temporal stores.  */
 
 #include <sysdep.h>
 
@@ -147,6 +150,41 @@ L(entry_from_wmemset):
 	VMOVU	%VMM(0), -VEC_SIZE(%rdi,%rdx)
 	VMOVU	%VMM(0), (%rdi)
 	VZEROUPPER_RETURN
+
+	/* If have AVX512 mask instructions put L(less_vec) close to
+	   entry as it doesn't take much space and is likely a hot target.  */
+#ifdef USE_LESS_VEC_MASK_STORE
+    /* Align to ensure the L(less_vec) logic all fits in 1x cache lines.  */
+	.p2align 6,, 47
+	.p2align 4
+L(less_vec):
+L(less_vec_from_wmemset):
+	/* Less than 1 VEC.  */
+# if VEC_SIZE != 16 && VEC_SIZE != 32 && VEC_SIZE != 64
+#  error Unsupported VEC_SIZE!
+# endif
+	/* Clear high bits from edi. Only keeping bits relevant to page
+	   cross check. Note that we are using rax which is set in
+	   MEMSET_VDUP_TO_VEC0_AND_SET_RETURN as ptr from here on out.  */
+	andl	$(PAGE_SIZE - 1), %edi
+	/* Check if VEC_SIZE store cross page. Mask stores suffer
+	   serious performance degradation when it has to fault suppress.  */
+	cmpl	$(PAGE_SIZE - VEC_SIZE), %edi
+	/* This is generally considered a cold target.  */
+	ja	L(cross_page)
+# if VEC_SIZE > 32
+	movq	$-1, %rcx
+	bzhiq	%rdx, %rcx, %rcx
+	kmovq	%rcx, %k1
+# else
+	movl	$-1, %ecx
+	bzhil	%edx, %ecx, %ecx
+	kmovd	%ecx, %k1
+# endif
+	vmovdqu8 %VMM(0), (%rax){%k1}
+	VZEROUPPER_RETURN
+#endif
+
 #if defined USE_MULTIARCH && IS_IN (libc)
 END (MEMSET_SYMBOL (__memset, unaligned))
 
@@ -185,54 +223,6 @@ L(last_2x_vec):
 #endif
 	VZEROUPPER_RETURN
 
-	/* If have AVX512 mask instructions put L(less_vec) close to
-	   entry as it doesn't take much space and is likely a hot target.
-	 */
-#ifdef USE_LESS_VEC_MASK_STORE
-	.p2align 4,, 10
-L(less_vec):
-L(less_vec_from_wmemset):
-	/* Less than 1 VEC.  */
-# if VEC_SIZE != 16 && VEC_SIZE != 32 && VEC_SIZE != 64
-#  error Unsupported VEC_SIZE!
-# endif
-	/* Clear high bits from edi. Only keeping bits relevant to page
-	   cross check. Note that we are using rax which is set in
-	   MEMSET_VDUP_TO_VEC0_AND_SET_RETURN as ptr from here on out.  */
-	andl	$(PAGE_SIZE - 1), %edi
-	/* Check if VEC_SIZE store cross page. Mask stores suffer
-	   serious performance degradation when it has to fault suppress.
-	 */
-	cmpl	$(PAGE_SIZE - VEC_SIZE), %edi
-	/* This is generally considered a cold target.  */
-	ja	L(cross_page)
-# if VEC_SIZE > 32
-	movq	$-1, %rcx
-	bzhiq	%rdx, %rcx, %rcx
-	kmovq	%rcx, %k1
-# else
-	movl	$-1, %ecx
-	bzhil	%edx, %ecx, %ecx
-	kmovd	%ecx, %k1
-# endif
-	vmovdqu8 %VMM(0), (%rax){%k1}
-	VZEROUPPER_RETURN
-
-# if defined USE_MULTIARCH && IS_IN (libc)
-	/* Include L(stosb_local) here if including L(less_vec) between
-	   L(stosb_more_2x_vec) and ENTRY. This is to cache align the
-	   L(stosb_more_2x_vec) target.  */
-	.p2align 4,, 10
-L(stosb_local):
-	movzbl	%sil, %eax
-	mov	%RDX_LP, %RCX_LP
-	mov	%RDI_LP, %RDX_LP
-	rep	stosb
-	mov	%RDX_LP, %RAX_LP
-	VZEROUPPER_RETURN
-# endif
-#endif
-
 #if defined USE_MULTIARCH && IS_IN (libc)
 	.p2align 4
 L(stosb_more_2x_vec):
@@ -318,21 +308,33 @@ L(return_vzeroupper):
 	ret
 #endif
 
-	.p2align 4,, 10
-#ifndef USE_LESS_VEC_MASK_STORE
-# if defined USE_MULTIARCH && IS_IN (libc)
+#ifdef USE_WITH_AVX2
+	.p2align 4
+#else
+	.p2align 4,, 4
+#endif
+
+#if defined USE_MULTIARCH && IS_IN (libc)
 	/* If no USE_LESS_VEC_MASK put L(stosb_local) here. Will be in
 	   range for 2-byte jump encoding.  */
 L(stosb_local):
+	cmp	__x86_shared_non_temporal_threshold(%rip), %RDX_LP
+	jae	L(nt_memset)
 	movzbl	%sil, %eax
 	mov	%RDX_LP, %RCX_LP
 	mov	%RDI_LP, %RDX_LP
 	rep	stosb
+# if (defined USE_WITH_SSE2) || (defined USE_WITH_AVX512)
+	/* Use xchg to save 1-byte (this helps align targets below).  */
+	xchg	%RDX_LP, %RAX_LP
+# else
 	mov	%RDX_LP, %RAX_LP
-	VZEROUPPER_RETURN
 # endif
+	VZEROUPPER_RETURN
+#endif
+#ifndef USE_LESS_VEC_MASK_STORE
 	/* Define L(less_vec) only if not otherwise defined.  */
-	.p2align 4
+	.p2align 4,, 12
 L(less_vec):
 	/* Broadcast esi to partial register (i.e VEC_SIZE == 32 broadcast to
 	   xmm). This is only does anything for AVX2.  */
@@ -423,4 +425,35 @@ L(between_2_3):
 	movb	%SET_REG8, -1(%LESS_VEC_REG, %rdx)
 #endif
 	ret
-END (MEMSET_SYMBOL (__memset, unaligned_erms))
+
+#if defined USE_MULTIARCH && IS_IN (libc)
+# ifdef USE_WITH_AVX512
+	/* Force align so the loop doesn't cross a cache-line.  */
+	.p2align 4
+# endif
+	.p2align 4,, 7
+    /* Memset using non-temporal stores.  */
+L(nt_memset):
+	VMOVU	%VMM(0), (VEC_SIZE * 0)(%rdi)
+	leaq	(VEC_SIZE * -4)(%rdi, %rdx), %rdx
+    /* Align DST.  */
+	orq	$(VEC_SIZE * 1 - 1), %rdi
+	incq	%rdi
+	.p2align 4,, 7
+L(nt_loop):
+	VMOVNT	%VMM(0), (VEC_SIZE * 0)(%rdi)
+	VMOVNT	%VMM(0), (VEC_SIZE * 1)(%rdi)
+	VMOVNT	%VMM(0), (VEC_SIZE * 2)(%rdi)
+	VMOVNT	%VMM(0), (VEC_SIZE * 3)(%rdi)
+	subq	$(VEC_SIZE * -4), %rdi
+	cmpq	%rdx, %rdi
+	jb	L(nt_loop)
+	sfence
+	VMOVU	%VMM(0), (VEC_SIZE * 0)(%rdx)
+	VMOVU	%VMM(0), (VEC_SIZE * 1)(%rdx)
+	VMOVU	%VMM(0), (VEC_SIZE * 2)(%rdx)
+	VMOVU	%VMM(0), (VEC_SIZE * 3)(%rdx)
+	VZEROUPPER_RETURN
+#endif
+
+END(MEMSET_SYMBOL(__memset, unaligned_erms))
-- 
2.34.1


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v1] x86: Improve large memset perf with non-temporal stores [RHEL-29312]
  2024-05-19  0:43 [PATCH v1] x86: Improve large memset perf with non-temporal stores [RHEL-29312] Noah Goldstein
@ 2024-05-20 17:14 ` H.J. Lu
  2024-05-20 19:16 ` Adhemerval Zanella Netto
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 21+ messages in thread
From: H.J. Lu @ 2024-05-20 17:14 UTC (permalink / raw)
  To: Noah Goldstein; +Cc: libc-alpha

On Sat, May 18, 2024 at 5:44 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
> Previously we use `rep stosb` for all medium/large memsets. This is
> notably worse than non-temporal stores for large (above a
> few MBs) memsets.
> See:
> https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4P2uMjjQGLbLcU/edit?usp=sharing
> For data using different stategies for large memset on ICX and SKX.
>
> Using non-temporal stores can be up to 3x faster on ICX and 2x faster
> on SKX. Historically, these numbers would not have been so good
> because of the zero-over-zero writeback optimization that `rep stosb`
> is able to do. But, the zero-over-zero writeback optimization has been
> removed as a potential side-channel attack, so there is no longer any
> good reason to only rely on `rep stosb` for large memsets. On the flip
> size, non-temporal writes can avoid data in their RFO requests saving
> memory bandwidth.
>
> All of the other changes to the file are to re-organize the
> code-blocks to maintain "good" alignment given the new code added in
> the `L(stosb_local)` case.
>
> The results from running the GLIBC memset benchmarks on TGL-client for
> N=20 runs:
>
> Geometric Mean across the suite New / Old EXEX256: 0.979
> Geometric Mean across the suite New / Old EXEX512: 0.979
> Geometric Mean across the suite New / Old AVX2   : 0.986
> Geometric Mean across the suite New / Old SSE2   : 0.979
>
> Most of the cases are essentially unchanged, this is mostly to show
> that adding the non-temporal case didn't add any regressions to the
> other cases.
>
> The results on the memset-large benchmark suite on TGL-client for N=20
> runs:
>
> Geometric Mean across the suite New / Old EXEX256: 0.926
> Geometric Mean across the suite New / Old EXEX512: 0.925
> Geometric Mean across the suite New / Old AVX2   : 0.928
> Geometric Mean across the suite New / Old SSE2   : 0.924
>
> So roughly a 7.5% speedup. This is lower than what we see on servers
> (likely because clients typically have faster single-core bandwidth so
> saving bandwidth on RFOs is less impactful), but still advantageous.
>
> Full test-suite passes on x86_64 w/ and w/o multiarch.
> ---
>  .../multiarch/memset-vec-unaligned-erms.S     | 149 +++++++++++-------
>  1 file changed, 91 insertions(+), 58 deletions(-)
>
> diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> index 97839a2248..637caadb40 100644
> --- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> +++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> @@ -21,10 +21,13 @@
>     2. If size is less than VEC, use integer register stores.
>     3. If size is from VEC_SIZE to 2 * VEC_SIZE, use 2 VEC stores.
>     4. If size is from 2 * VEC_SIZE to 4 * VEC_SIZE, use 4 VEC stores.
> -   5. On machines ERMS feature, if size is greater or equal than
> -      __x86_rep_stosb_threshold then REP STOSB will be used.
> -   6. If size is more to 4 * VEC_SIZE, align to 4 * VEC_SIZE with
> -      4 VEC stores and store 4 * VEC at a time until done.  */
> +   5. If size is more to 4 * VEC_SIZE, align to 1 * VEC_SIZE with
> +      4 VEC stores and store 4 * VEC at a time until done.
> +   6. On machines ERMS feature, if size is range
> +         [__x86_rep_stosb_threshold, __x86_shared_non_temporal_threshold)
> +         then REP STOSB will be used.
> +   7. If size >= __x86_shared_non_temporal_threshold, use a
> +         non-temporal stores.  */
>
>  #include <sysdep.h>
>
> @@ -147,6 +150,41 @@ L(entry_from_wmemset):
>         VMOVU   %VMM(0), -VEC_SIZE(%rdi,%rdx)
>         VMOVU   %VMM(0), (%rdi)
>         VZEROUPPER_RETURN
> +
> +       /* If have AVX512 mask instructions put L(less_vec) close to
> +          entry as it doesn't take much space and is likely a hot target.  */
> +#ifdef USE_LESS_VEC_MASK_STORE
> +    /* Align to ensure the L(less_vec) logic all fits in 1x cache lines.  */
> +       .p2align 6,, 47
> +       .p2align 4
> +L(less_vec):
> +L(less_vec_from_wmemset):
> +       /* Less than 1 VEC.  */
> +# if VEC_SIZE != 16 && VEC_SIZE != 32 && VEC_SIZE != 64
> +#  error Unsupported VEC_SIZE!
> +# endif
> +       /* Clear high bits from edi. Only keeping bits relevant to page
> +          cross check. Note that we are using rax which is set in
> +          MEMSET_VDUP_TO_VEC0_AND_SET_RETURN as ptr from here on out.  */
> +       andl    $(PAGE_SIZE - 1), %edi
> +       /* Check if VEC_SIZE store cross page. Mask stores suffer
> +          serious performance degradation when it has to fault suppress.  */
> +       cmpl    $(PAGE_SIZE - VEC_SIZE), %edi
> +       /* This is generally considered a cold target.  */
> +       ja      L(cross_page)
> +# if VEC_SIZE > 32
> +       movq    $-1, %rcx
> +       bzhiq   %rdx, %rcx, %rcx
> +       kmovq   %rcx, %k1
> +# else
> +       movl    $-1, %ecx
> +       bzhil   %edx, %ecx, %ecx
> +       kmovd   %ecx, %k1
> +# endif
> +       vmovdqu8 %VMM(0), (%rax){%k1}
> +       VZEROUPPER_RETURN
> +#endif
> +
>  #if defined USE_MULTIARCH && IS_IN (libc)
>  END (MEMSET_SYMBOL (__memset, unaligned))
>
> @@ -185,54 +223,6 @@ L(last_2x_vec):
>  #endif
>         VZEROUPPER_RETURN
>
> -       /* If have AVX512 mask instructions put L(less_vec) close to
> -          entry as it doesn't take much space and is likely a hot target.
> -        */
> -#ifdef USE_LESS_VEC_MASK_STORE
> -       .p2align 4,, 10
> -L(less_vec):
> -L(less_vec_from_wmemset):
> -       /* Less than 1 VEC.  */
> -# if VEC_SIZE != 16 && VEC_SIZE != 32 && VEC_SIZE != 64
> -#  error Unsupported VEC_SIZE!
> -# endif
> -       /* Clear high bits from edi. Only keeping bits relevant to page
> -          cross check. Note that we are using rax which is set in
> -          MEMSET_VDUP_TO_VEC0_AND_SET_RETURN as ptr from here on out.  */
> -       andl    $(PAGE_SIZE - 1), %edi
> -       /* Check if VEC_SIZE store cross page. Mask stores suffer
> -          serious performance degradation when it has to fault suppress.
> -        */
> -       cmpl    $(PAGE_SIZE - VEC_SIZE), %edi
> -       /* This is generally considered a cold target.  */
> -       ja      L(cross_page)
> -# if VEC_SIZE > 32
> -       movq    $-1, %rcx
> -       bzhiq   %rdx, %rcx, %rcx
> -       kmovq   %rcx, %k1
> -# else
> -       movl    $-1, %ecx
> -       bzhil   %edx, %ecx, %ecx
> -       kmovd   %ecx, %k1
> -# endif
> -       vmovdqu8 %VMM(0), (%rax){%k1}
> -       VZEROUPPER_RETURN
> -
> -# if defined USE_MULTIARCH && IS_IN (libc)
> -       /* Include L(stosb_local) here if including L(less_vec) between
> -          L(stosb_more_2x_vec) and ENTRY. This is to cache align the
> -          L(stosb_more_2x_vec) target.  */
> -       .p2align 4,, 10
> -L(stosb_local):
> -       movzbl  %sil, %eax
> -       mov     %RDX_LP, %RCX_LP
> -       mov     %RDI_LP, %RDX_LP
> -       rep     stosb
> -       mov     %RDX_LP, %RAX_LP
> -       VZEROUPPER_RETURN
> -# endif
> -#endif
> -
>  #if defined USE_MULTIARCH && IS_IN (libc)
>         .p2align 4
>  L(stosb_more_2x_vec):
> @@ -318,21 +308,33 @@ L(return_vzeroupper):
>         ret
>  #endif
>
> -       .p2align 4,, 10
> -#ifndef USE_LESS_VEC_MASK_STORE
> -# if defined USE_MULTIARCH && IS_IN (libc)
> +#ifdef USE_WITH_AVX2
> +       .p2align 4
> +#else
> +       .p2align 4,, 4
> +#endif
> +
> +#if defined USE_MULTIARCH && IS_IN (libc)
>         /* If no USE_LESS_VEC_MASK put L(stosb_local) here. Will be in
>            range for 2-byte jump encoding.  */
>  L(stosb_local):
> +       cmp     __x86_shared_non_temporal_threshold(%rip), %RDX_LP
> +       jae     L(nt_memset)
>         movzbl  %sil, %eax
>         mov     %RDX_LP, %RCX_LP
>         mov     %RDI_LP, %RDX_LP
>         rep     stosb
> +# if (defined USE_WITH_SSE2) || (defined USE_WITH_AVX512)
> +       /* Use xchg to save 1-byte (this helps align targets below).  */
> +       xchg    %RDX_LP, %RAX_LP
> +# else
>         mov     %RDX_LP, %RAX_LP
> -       VZEROUPPER_RETURN
>  # endif
> +       VZEROUPPER_RETURN
> +#endif
> +#ifndef USE_LESS_VEC_MASK_STORE
>         /* Define L(less_vec) only if not otherwise defined.  */
> -       .p2align 4
> +       .p2align 4,, 12
>  L(less_vec):
>         /* Broadcast esi to partial register (i.e VEC_SIZE == 32 broadcast to
>            xmm). This is only does anything for AVX2.  */
> @@ -423,4 +425,35 @@ L(between_2_3):
>         movb    %SET_REG8, -1(%LESS_VEC_REG, %rdx)
>  #endif
>         ret
> -END (MEMSET_SYMBOL (__memset, unaligned_erms))
> +
> +#if defined USE_MULTIARCH && IS_IN (libc)
> +# ifdef USE_WITH_AVX512
> +       /* Force align so the loop doesn't cross a cache-line.  */
> +       .p2align 4
> +# endif
> +       .p2align 4,, 7
> +    /* Memset using non-temporal stores.  */
> +L(nt_memset):
> +       VMOVU   %VMM(0), (VEC_SIZE * 0)(%rdi)
> +       leaq    (VEC_SIZE * -4)(%rdi, %rdx), %rdx
> +    /* Align DST.  */
> +       orq     $(VEC_SIZE * 1 - 1), %rdi
> +       incq    %rdi
> +       .p2align 4,, 7
> +L(nt_loop):
> +       VMOVNT  %VMM(0), (VEC_SIZE * 0)(%rdi)
> +       VMOVNT  %VMM(0), (VEC_SIZE * 1)(%rdi)
> +       VMOVNT  %VMM(0), (VEC_SIZE * 2)(%rdi)
> +       VMOVNT  %VMM(0), (VEC_SIZE * 3)(%rdi)
> +       subq    $(VEC_SIZE * -4), %rdi
> +       cmpq    %rdx, %rdi
> +       jb      L(nt_loop)
> +       sfence
> +       VMOVU   %VMM(0), (VEC_SIZE * 0)(%rdx)
> +       VMOVU   %VMM(0), (VEC_SIZE * 1)(%rdx)
> +       VMOVU   %VMM(0), (VEC_SIZE * 2)(%rdx)
> +       VMOVU   %VMM(0), (VEC_SIZE * 3)(%rdx)
> +       VZEROUPPER_RETURN
> +#endif
> +
> +END(MEMSET_SYMBOL(__memset, unaligned_erms))
> --
> 2.34.1
>

LGTM.

Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

Thanks.

-- 
H.J.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v1] x86: Improve large memset perf with non-temporal stores [RHEL-29312]
  2024-05-19  0:43 [PATCH v1] x86: Improve large memset perf with non-temporal stores [RHEL-29312] Noah Goldstein
  2024-05-20 17:14 ` H.J. Lu
@ 2024-05-20 19:16 ` Adhemerval Zanella Netto
  2024-05-20 22:47   ` Noah Goldstein
  2024-05-20 19:37 ` Zack Weinberg
  2024-05-24 17:38 ` [PATCH v2 1/2] " Noah Goldstein
  3 siblings, 1 reply; 21+ messages in thread
From: Adhemerval Zanella Netto @ 2024-05-20 19:16 UTC (permalink / raw)
  To: Noah Goldstein, libc-alpha; +Cc: hjl.tools



On 18/05/24 21:43, Noah Goldstein wrote:
> Previously we use `rep stosb` for all medium/large memsets. This is
> notably worse than non-temporal stores for large (above a
> few MBs) memsets.
> See:
> https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4P2uMjjQGLbLcU/edit?usp=sharing
> For data using different stategies for large memset on ICX and SKX.

Btw this datasheet is not accessible without extra steps.

> 
> Using non-temporal stores can be up to 3x faster on ICX and 2x faster
> on SKX. Historically, these numbers would not have been so good
> because of the zero-over-zero writeback optimization that `rep stosb`
> is able to do. But, the zero-over-zero writeback optimization has been
> removed as a potential side-channel attack, so there is no longer any
> good reason to only rely on `rep stosb` for large memsets. On the flip
> size, non-temporal writes can avoid data in their RFO requests saving
> memory bandwidth.

Any chance on how this play in newer AMD chips? I am trying to avoid
another regression like BZ#30994 and BZ#30995 (this one I would like
to fix, however I don't have access to a Zen4 machine to check for
results).

> 
> All of the other changes to the file are to re-organize the
> code-blocks to maintain "good" alignment given the new code added in
> the `L(stosb_local)` case.
> 
> The results from running the GLIBC memset benchmarks on TGL-client for
> N=20 runs:
> 
> Geometric Mean across the suite New / Old EXEX256: 0.979
> Geometric Mean across the suite New / Old EXEX512: 0.979
> Geometric Mean across the suite New / Old AVX2   : 0.986
> Geometric Mean across the suite New / Old SSE2   : 0.979
> 
> Most of the cases are essentially unchanged, this is mostly to show
> that adding the non-temporal case didn't add any regressions to the
> other cases.
> 
> The results on the memset-large benchmark suite on TGL-client for N=20
> runs:
> 
> Geometric Mean across the suite New / Old EXEX256: 0.926
> Geometric Mean across the suite New / Old EXEX512: 0.925
> Geometric Mean across the suite New / Old AVX2   : 0.928
> Geometric Mean across the suite New / Old SSE2   : 0.924
> 
> So roughly a 7.5% speedup. This is lower than what we see on servers
> (likely because clients typically have faster single-core bandwidth so
> saving bandwidth on RFOs is less impactful), but still advantageous.
> 
> Full test-suite passes on x86_64 w/ and w/o multiarch.
> ---
>  .../multiarch/memset-vec-unaligned-erms.S     | 149 +++++++++++-------
>  1 file changed, 91 insertions(+), 58 deletions(-)
> 
> diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> index 97839a2248..637caadb40 100644
> --- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> +++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> @@ -21,10 +21,13 @@
>     2. If size is less than VEC, use integer register stores.
>     3. If size is from VEC_SIZE to 2 * VEC_SIZE, use 2 VEC stores.
>     4. If size is from 2 * VEC_SIZE to 4 * VEC_SIZE, use 4 VEC stores.
> -   5. On machines ERMS feature, if size is greater or equal than
> -      __x86_rep_stosb_threshold then REP STOSB will be used.
> -   6. If size is more to 4 * VEC_SIZE, align to 4 * VEC_SIZE with
> -      4 VEC stores and store 4 * VEC at a time until done.  */
> +   5. If size is more to 4 * VEC_SIZE, align to 1 * VEC_SIZE with
> +      4 VEC stores and store 4 * VEC at a time until done.
> +   6. On machines ERMS feature, if size is range
> +	  [__x86_rep_stosb_threshold, __x86_shared_non_temporal_threshold)
> +	  then REP STOSB will be used.
> +   7. If size >= __x86_shared_non_temporal_threshold, use a
> +	  non-temporal stores.  */
>  
>  #include <sysdep.h>
>  
> @@ -147,6 +150,41 @@ L(entry_from_wmemset):
>  	VMOVU	%VMM(0), -VEC_SIZE(%rdi,%rdx)
>  	VMOVU	%VMM(0), (%rdi)
>  	VZEROUPPER_RETURN
> +
> +	/* If have AVX512 mask instructions put L(less_vec) close to
> +	   entry as it doesn't take much space and is likely a hot target.  */
> +#ifdef USE_LESS_VEC_MASK_STORE
> +    /* Align to ensure the L(less_vec) logic all fits in 1x cache lines.  */
> +	.p2align 6,, 47
> +	.p2align 4
> +L(less_vec):
> +L(less_vec_from_wmemset):
> +	/* Less than 1 VEC.  */
> +# if VEC_SIZE != 16 && VEC_SIZE != 32 && VEC_SIZE != 64
> +#  error Unsupported VEC_SIZE!
> +# endif
> +	/* Clear high bits from edi. Only keeping bits relevant to page
> +	   cross check. Note that we are using rax which is set in
> +	   MEMSET_VDUP_TO_VEC0_AND_SET_RETURN as ptr from here on out.  */
> +	andl	$(PAGE_SIZE - 1), %edi
> +	/* Check if VEC_SIZE store cross page. Mask stores suffer
> +	   serious performance degradation when it has to fault suppress.  */
> +	cmpl	$(PAGE_SIZE - VEC_SIZE), %edi
> +	/* This is generally considered a cold target.  */
> +	ja	L(cross_page)
> +# if VEC_SIZE > 32
> +	movq	$-1, %rcx
> +	bzhiq	%rdx, %rcx, %rcx
> +	kmovq	%rcx, %k1
> +# else
> +	movl	$-1, %ecx
> +	bzhil	%edx, %ecx, %ecx
> +	kmovd	%ecx, %k1
> +# endif
> +	vmovdqu8 %VMM(0), (%rax){%k1}
> +	VZEROUPPER_RETURN
> +#endif
> +
>  #if defined USE_MULTIARCH && IS_IN (libc)
>  END (MEMSET_SYMBOL (__memset, unaligned))
>  
> @@ -185,54 +223,6 @@ L(last_2x_vec):
>  #endif
>  	VZEROUPPER_RETURN
>  
> -	/* If have AVX512 mask instructions put L(less_vec) close to
> -	   entry as it doesn't take much space and is likely a hot target.
> -	 */
> -#ifdef USE_LESS_VEC_MASK_STORE
> -	.p2align 4,, 10
> -L(less_vec):
> -L(less_vec_from_wmemset):
> -	/* Less than 1 VEC.  */
> -# if VEC_SIZE != 16 && VEC_SIZE != 32 && VEC_SIZE != 64
> -#  error Unsupported VEC_SIZE!
> -# endif
> -	/* Clear high bits from edi. Only keeping bits relevant to page
> -	   cross check. Note that we are using rax which is set in
> -	   MEMSET_VDUP_TO_VEC0_AND_SET_RETURN as ptr from here on out.  */
> -	andl	$(PAGE_SIZE - 1), %edi
> -	/* Check if VEC_SIZE store cross page. Mask stores suffer
> -	   serious performance degradation when it has to fault suppress.
> -	 */
> -	cmpl	$(PAGE_SIZE - VEC_SIZE), %edi
> -	/* This is generally considered a cold target.  */
> -	ja	L(cross_page)
> -# if VEC_SIZE > 32
> -	movq	$-1, %rcx
> -	bzhiq	%rdx, %rcx, %rcx
> -	kmovq	%rcx, %k1
> -# else
> -	movl	$-1, %ecx
> -	bzhil	%edx, %ecx, %ecx
> -	kmovd	%ecx, %k1
> -# endif
> -	vmovdqu8 %VMM(0), (%rax){%k1}
> -	VZEROUPPER_RETURN
> -
> -# if defined USE_MULTIARCH && IS_IN (libc)
> -	/* Include L(stosb_local) here if including L(less_vec) between
> -	   L(stosb_more_2x_vec) and ENTRY. This is to cache align the
> -	   L(stosb_more_2x_vec) target.  */
> -	.p2align 4,, 10
> -L(stosb_local):
> -	movzbl	%sil, %eax
> -	mov	%RDX_LP, %RCX_LP
> -	mov	%RDI_LP, %RDX_LP
> -	rep	stosb
> -	mov	%RDX_LP, %RAX_LP
> -	VZEROUPPER_RETURN
> -# endif
> -#endif
> -
>  #if defined USE_MULTIARCH && IS_IN (libc)
>  	.p2align 4
>  L(stosb_more_2x_vec):
> @@ -318,21 +308,33 @@ L(return_vzeroupper):
>  	ret
>  #endif
>  
> -	.p2align 4,, 10
> -#ifndef USE_LESS_VEC_MASK_STORE
> -# if defined USE_MULTIARCH && IS_IN (libc)
> +#ifdef USE_WITH_AVX2
> +	.p2align 4
> +#else
> +	.p2align 4,, 4
> +#endif
> +
> +#if defined USE_MULTIARCH && IS_IN (libc)
>  	/* If no USE_LESS_VEC_MASK put L(stosb_local) here. Will be in
>  	   range for 2-byte jump encoding.  */
>  L(stosb_local):
> +	cmp	__x86_shared_non_temporal_threshold(%rip), %RDX_LP
> +	jae	L(nt_memset)
>  	movzbl	%sil, %eax
>  	mov	%RDX_LP, %RCX_LP
>  	mov	%RDI_LP, %RDX_LP
>  	rep	stosb
> +# if (defined USE_WITH_SSE2) || (defined USE_WITH_AVX512)
> +	/* Use xchg to save 1-byte (this helps align targets below).  */
> +	xchg	%RDX_LP, %RAX_LP
> +# else
>  	mov	%RDX_LP, %RAX_LP
> -	VZEROUPPER_RETURN
>  # endif
> +	VZEROUPPER_RETURN
> +#endif
> +#ifndef USE_LESS_VEC_MASK_STORE
>  	/* Define L(less_vec) only if not otherwise defined.  */
> -	.p2align 4
> +	.p2align 4,, 12
>  L(less_vec):
>  	/* Broadcast esi to partial register (i.e VEC_SIZE == 32 broadcast to
>  	   xmm). This is only does anything for AVX2.  */
> @@ -423,4 +425,35 @@ L(between_2_3):
>  	movb	%SET_REG8, -1(%LESS_VEC_REG, %rdx)
>  #endif
>  	ret
> -END (MEMSET_SYMBOL (__memset, unaligned_erms))
> +
> +#if defined USE_MULTIARCH && IS_IN (libc)
> +# ifdef USE_WITH_AVX512
> +	/* Force align so the loop doesn't cross a cache-line.  */
> +	.p2align 4
> +# endif
> +	.p2align 4,, 7
> +    /* Memset using non-temporal stores.  */
> +L(nt_memset):
> +	VMOVU	%VMM(0), (VEC_SIZE * 0)(%rdi)
> +	leaq	(VEC_SIZE * -4)(%rdi, %rdx), %rdx
> +    /* Align DST.  */
> +	orq	$(VEC_SIZE * 1 - 1), %rdi
> +	incq	%rdi
> +	.p2align 4,, 7
> +L(nt_loop):
> +	VMOVNT	%VMM(0), (VEC_SIZE * 0)(%rdi)
> +	VMOVNT	%VMM(0), (VEC_SIZE * 1)(%rdi)
> +	VMOVNT	%VMM(0), (VEC_SIZE * 2)(%rdi)
> +	VMOVNT	%VMM(0), (VEC_SIZE * 3)(%rdi)
> +	subq	$(VEC_SIZE * -4), %rdi
> +	cmpq	%rdx, %rdi
> +	jb	L(nt_loop)
> +	sfence
> +	VMOVU	%VMM(0), (VEC_SIZE * 0)(%rdx)
> +	VMOVU	%VMM(0), (VEC_SIZE * 1)(%rdx)
> +	VMOVU	%VMM(0), (VEC_SIZE * 2)(%rdx)
> +	VMOVU	%VMM(0), (VEC_SIZE * 3)(%rdx)
> +	VZEROUPPER_RETURN
> +#endif
> +
> +END(MEMSET_SYMBOL(__memset, unaligned_erms))

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v1] x86: Improve large memset perf with non-temporal stores [RHEL-29312]
  2024-05-19  0:43 [PATCH v1] x86: Improve large memset perf with non-temporal stores [RHEL-29312] Noah Goldstein
  2024-05-20 17:14 ` H.J. Lu
  2024-05-20 19:16 ` Adhemerval Zanella Netto
@ 2024-05-20 19:37 ` Zack Weinberg
  2024-05-20 22:48   ` Noah Goldstein
  2024-05-24 17:38 ` [PATCH v2 1/2] " Noah Goldstein
  3 siblings, 1 reply; 21+ messages in thread
From: Zack Weinberg @ 2024-05-20 19:37 UTC (permalink / raw)
  To: Noah Goldstein, GNU libc development; +Cc: H . J . Lu

On Sun, May 19, 2024, at 12:43 AM, Noah Goldstein wrote:
> Previously we use `rep stosb` for all medium/large memsets. This is
> notably worse than non-temporal stores for large (above a few MBs)
> memsets. See:
> https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4P2uMjjQGLbLcU/edit?usp=sharing
> For data using different stategies for large memset on ICX and SKX.

Did your tests read the region that was memset afterward, or did they
just do the memset?  It *probably* doesn't matter at "above a few MBs"
but I'm worried that this new implementation might look good on
microbenchmarks but make real programs *slower* because it is no longer
priming the cache for the memset region.

zw

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v1] x86: Improve large memset perf with non-temporal stores [RHEL-29312]
  2024-05-20 19:16 ` Adhemerval Zanella Netto
@ 2024-05-20 22:47   ` Noah Goldstein
  2024-05-20 22:50     ` Noah Goldstein
  0 siblings, 1 reply; 21+ messages in thread
From: Noah Goldstein @ 2024-05-20 22:47 UTC (permalink / raw)
  To: Adhemerval Zanella Netto; +Cc: libc-alpha, hjl.tools

On Mon, May 20, 2024 at 2:16 PM Adhemerval Zanella Netto
<adhemerval.zanella@linaro.org> wrote:
>
>
>
> On 18/05/24 21:43, Noah Goldstein wrote:
> > Previously we use `rep stosb` for all medium/large memsets. This is
> > notably worse than non-temporal stores for large (above a
> > few MBs) memsets.
> > See:
> > https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4P2uMjjQGLbLcU/edit?usp=sharing
> > For data using different stategies for large memset on ICX and SKX.
>
> Btw this datasheet is not accessible without extra steps.
>

Bh sorry, put the benchmark data in a repo. See data here:
https://github.com/goldsteinn/memset-benchmarks/tree/master/results

The ICX.html and SKX.html have the data from the above link.
> >
> > Using non-temporal stores can be up to 3x faster on ICX and 2x faster
> > on SKX. Historically, these numbers would not have been so good
> > because of the zero-over-zero writeback optimization that `rep stosb`
> > is able to do. But, the zero-over-zero writeback optimization has been
> > removed as a potential side-channel attack, so there is no longer any
> > good reason to only rely on `rep stosb` for large memsets. On the flip
> > size, non-temporal writes can avoid data in their RFO requests saving
> > memory bandwidth.
>
> Any chance on how this play in newer AMD chips? I am trying to avoid
> another regression like BZ#30994 and BZ#30995 (this one I would like
> to fix, however I don't have access to a Zen4 machine to check for
> results).

I didn't and don't have access to any of the newer AMD chips.

The benchmark code here: https://github.com/goldsteinn/memset-benchmarks/
has a README w/ steps if anyone wants to test it.
>
> >
> > All of the other changes to the file are to re-organize the
> > code-blocks to maintain "good" alignment given the new code added in
> > the `L(stosb_local)` case.
> >
> > The results from running the GLIBC memset benchmarks on TGL-client for
> > N=20 runs:
> >
> > Geometric Mean across the suite New / Old EXEX256: 0.979
> > Geometric Mean across the suite New / Old EXEX512: 0.979
> > Geometric Mean across the suite New / Old AVX2   : 0.986
> > Geometric Mean across the suite New / Old SSE2   : 0.979
> >
> > Most of the cases are essentially unchanged, this is mostly to show
> > that adding the non-temporal case didn't add any regressions to the
> > other cases.
> >
> > The results on the memset-large benchmark suite on TGL-client for N=20
> > runs:
> >
> > Geometric Mean across the suite New / Old EXEX256: 0.926
> > Geometric Mean across the suite New / Old EXEX512: 0.925
> > Geometric Mean across the suite New / Old AVX2   : 0.928
> > Geometric Mean across the suite New / Old SSE2   : 0.924
> >
> > So roughly a 7.5% speedup. This is lower than what we see on servers
> > (likely because clients typically have faster single-core bandwidth so
> > saving bandwidth on RFOs is less impactful), but still advantageous.
> >
> > Full test-suite passes on x86_64 w/ and w/o multiarch.
> > ---
> >  .../multiarch/memset-vec-unaligned-erms.S     | 149 +++++++++++-------
> >  1 file changed, 91 insertions(+), 58 deletions(-)
> >
> > diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > index 97839a2248..637caadb40 100644
> > --- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > +++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > @@ -21,10 +21,13 @@
> >     2. If size is less than VEC, use integer register stores.
> >     3. If size is from VEC_SIZE to 2 * VEC_SIZE, use 2 VEC stores.
> >     4. If size is from 2 * VEC_SIZE to 4 * VEC_SIZE, use 4 VEC stores.
> > -   5. On machines ERMS feature, if size is greater or equal than
> > -      __x86_rep_stosb_threshold then REP STOSB will be used.
> > -   6. If size is more to 4 * VEC_SIZE, align to 4 * VEC_SIZE with
> > -      4 VEC stores and store 4 * VEC at a time until done.  */
> > +   5. If size is more to 4 * VEC_SIZE, align to 1 * VEC_SIZE with
> > +      4 VEC stores and store 4 * VEC at a time until done.
> > +   6. On machines ERMS feature, if size is range
> > +       [__x86_rep_stosb_threshold, __x86_shared_non_temporal_threshold)
> > +       then REP STOSB will be used.
> > +   7. If size >= __x86_shared_non_temporal_threshold, use a
> > +       non-temporal stores.  */
> >
> >  #include <sysdep.h>
> >
> > @@ -147,6 +150,41 @@ L(entry_from_wmemset):
> >       VMOVU   %VMM(0), -VEC_SIZE(%rdi,%rdx)
> >       VMOVU   %VMM(0), (%rdi)
> >       VZEROUPPER_RETURN
> > +
> > +     /* If have AVX512 mask instructions put L(less_vec) close to
> > +        entry as it doesn't take much space and is likely a hot target.  */
> > +#ifdef USE_LESS_VEC_MASK_STORE
> > +    /* Align to ensure the L(less_vec) logic all fits in 1x cache lines.  */
> > +     .p2align 6,, 47
> > +     .p2align 4
> > +L(less_vec):
> > +L(less_vec_from_wmemset):
> > +     /* Less than 1 VEC.  */
> > +# if VEC_SIZE != 16 && VEC_SIZE != 32 && VEC_SIZE != 64
> > +#  error Unsupported VEC_SIZE!
> > +# endif
> > +     /* Clear high bits from edi. Only keeping bits relevant to page
> > +        cross check. Note that we are using rax which is set in
> > +        MEMSET_VDUP_TO_VEC0_AND_SET_RETURN as ptr from here on out.  */
> > +     andl    $(PAGE_SIZE - 1), %edi
> > +     /* Check if VEC_SIZE store cross page. Mask stores suffer
> > +        serious performance degradation when it has to fault suppress.  */
> > +     cmpl    $(PAGE_SIZE - VEC_SIZE), %edi
> > +     /* This is generally considered a cold target.  */
> > +     ja      L(cross_page)
> > +# if VEC_SIZE > 32
> > +     movq    $-1, %rcx
> > +     bzhiq   %rdx, %rcx, %rcx
> > +     kmovq   %rcx, %k1
> > +# else
> > +     movl    $-1, %ecx
> > +     bzhil   %edx, %ecx, %ecx
> > +     kmovd   %ecx, %k1
> > +# endif
> > +     vmovdqu8 %VMM(0), (%rax){%k1}
> > +     VZEROUPPER_RETURN
> > +#endif
> > +
> >  #if defined USE_MULTIARCH && IS_IN (libc)
> >  END (MEMSET_SYMBOL (__memset, unaligned))
> >
> > @@ -185,54 +223,6 @@ L(last_2x_vec):
> >  #endif
> >       VZEROUPPER_RETURN
> >
> > -     /* If have AVX512 mask instructions put L(less_vec) close to
> > -        entry as it doesn't take much space and is likely a hot target.
> > -      */
> > -#ifdef USE_LESS_VEC_MASK_STORE
> > -     .p2align 4,, 10
> > -L(less_vec):
> > -L(less_vec_from_wmemset):
> > -     /* Less than 1 VEC.  */
> > -# if VEC_SIZE != 16 && VEC_SIZE != 32 && VEC_SIZE != 64
> > -#  error Unsupported VEC_SIZE!
> > -# endif
> > -     /* Clear high bits from edi. Only keeping bits relevant to page
> > -        cross check. Note that we are using rax which is set in
> > -        MEMSET_VDUP_TO_VEC0_AND_SET_RETURN as ptr from here on out.  */
> > -     andl    $(PAGE_SIZE - 1), %edi
> > -     /* Check if VEC_SIZE store cross page. Mask stores suffer
> > -        serious performance degradation when it has to fault suppress.
> > -      */
> > -     cmpl    $(PAGE_SIZE - VEC_SIZE), %edi
> > -     /* This is generally considered a cold target.  */
> > -     ja      L(cross_page)
> > -# if VEC_SIZE > 32
> > -     movq    $-1, %rcx
> > -     bzhiq   %rdx, %rcx, %rcx
> > -     kmovq   %rcx, %k1
> > -# else
> > -     movl    $-1, %ecx
> > -     bzhil   %edx, %ecx, %ecx
> > -     kmovd   %ecx, %k1
> > -# endif
> > -     vmovdqu8 %VMM(0), (%rax){%k1}
> > -     VZEROUPPER_RETURN
> > -
> > -# if defined USE_MULTIARCH && IS_IN (libc)
> > -     /* Include L(stosb_local) here if including L(less_vec) between
> > -        L(stosb_more_2x_vec) and ENTRY. This is to cache align the
> > -        L(stosb_more_2x_vec) target.  */
> > -     .p2align 4,, 10
> > -L(stosb_local):
> > -     movzbl  %sil, %eax
> > -     mov     %RDX_LP, %RCX_LP
> > -     mov     %RDI_LP, %RDX_LP
> > -     rep     stosb
> > -     mov     %RDX_LP, %RAX_LP
> > -     VZEROUPPER_RETURN
> > -# endif
> > -#endif
> > -
> >  #if defined USE_MULTIARCH && IS_IN (libc)
> >       .p2align 4
> >  L(stosb_more_2x_vec):
> > @@ -318,21 +308,33 @@ L(return_vzeroupper):
> >       ret
> >  #endif
> >
> > -     .p2align 4,, 10
> > -#ifndef USE_LESS_VEC_MASK_STORE
> > -# if defined USE_MULTIARCH && IS_IN (libc)
> > +#ifdef USE_WITH_AVX2
> > +     .p2align 4
> > +#else
> > +     .p2align 4,, 4
> > +#endif
> > +
> > +#if defined USE_MULTIARCH && IS_IN (libc)
> >       /* If no USE_LESS_VEC_MASK put L(stosb_local) here. Will be in
> >          range for 2-byte jump encoding.  */
> >  L(stosb_local):
> > +     cmp     __x86_shared_non_temporal_threshold(%rip), %RDX_LP
> > +     jae     L(nt_memset)
> >       movzbl  %sil, %eax
> >       mov     %RDX_LP, %RCX_LP
> >       mov     %RDI_LP, %RDX_LP
> >       rep     stosb
> > +# if (defined USE_WITH_SSE2) || (defined USE_WITH_AVX512)
> > +     /* Use xchg to save 1-byte (this helps align targets below).  */
> > +     xchg    %RDX_LP, %RAX_LP
> > +# else
> >       mov     %RDX_LP, %RAX_LP
> > -     VZEROUPPER_RETURN
> >  # endif
> > +     VZEROUPPER_RETURN
> > +#endif
> > +#ifndef USE_LESS_VEC_MASK_STORE
> >       /* Define L(less_vec) only if not otherwise defined.  */
> > -     .p2align 4
> > +     .p2align 4,, 12
> >  L(less_vec):
> >       /* Broadcast esi to partial register (i.e VEC_SIZE == 32 broadcast to
> >          xmm). This is only does anything for AVX2.  */
> > @@ -423,4 +425,35 @@ L(between_2_3):
> >       movb    %SET_REG8, -1(%LESS_VEC_REG, %rdx)
> >  #endif
> >       ret
> > -END (MEMSET_SYMBOL (__memset, unaligned_erms))
> > +
> > +#if defined USE_MULTIARCH && IS_IN (libc)
> > +# ifdef USE_WITH_AVX512
> > +     /* Force align so the loop doesn't cross a cache-line.  */
> > +     .p2align 4
> > +# endif
> > +     .p2align 4,, 7
> > +    /* Memset using non-temporal stores.  */
> > +L(nt_memset):
> > +     VMOVU   %VMM(0), (VEC_SIZE * 0)(%rdi)
> > +     leaq    (VEC_SIZE * -4)(%rdi, %rdx), %rdx
> > +    /* Align DST.  */
> > +     orq     $(VEC_SIZE * 1 - 1), %rdi
> > +     incq    %rdi
> > +     .p2align 4,, 7
> > +L(nt_loop):
> > +     VMOVNT  %VMM(0), (VEC_SIZE * 0)(%rdi)
> > +     VMOVNT  %VMM(0), (VEC_SIZE * 1)(%rdi)
> > +     VMOVNT  %VMM(0), (VEC_SIZE * 2)(%rdi)
> > +     VMOVNT  %VMM(0), (VEC_SIZE * 3)(%rdi)
> > +     subq    $(VEC_SIZE * -4), %rdi
> > +     cmpq    %rdx, %rdi
> > +     jb      L(nt_loop)
> > +     sfence
> > +     VMOVU   %VMM(0), (VEC_SIZE * 0)(%rdx)
> > +     VMOVU   %VMM(0), (VEC_SIZE * 1)(%rdx)
> > +     VMOVU   %VMM(0), (VEC_SIZE * 2)(%rdx)
> > +     VMOVU   %VMM(0), (VEC_SIZE * 3)(%rdx)
> > +     VZEROUPPER_RETURN
> > +#endif
> > +
> > +END(MEMSET_SYMBOL(__memset, unaligned_erms))

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v1] x86: Improve large memset perf with non-temporal stores [RHEL-29312]
  2024-05-20 19:37 ` Zack Weinberg
@ 2024-05-20 22:48   ` Noah Goldstein
  0 siblings, 0 replies; 21+ messages in thread
From: Noah Goldstein @ 2024-05-20 22:48 UTC (permalink / raw)
  To: Zack Weinberg; +Cc: GNU libc development, H . J . Lu

On Mon, May 20, 2024 at 2:37 PM Zack Weinberg <zack@owlfolio.org> wrote:
>
> On Sun, May 19, 2024, at 12:43 AM, Noah Goldstein wrote:
> > Previously we use `rep stosb` for all medium/large memsets. This is
> > notably worse than non-temporal stores for large (above a few MBs)
> > memsets. See:
> > https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4P2uMjjQGLbLcU/edit?usp=sharing
> > For data using different stategies for large memset on ICX and SKX.
>
> Did your tests read the region that was memset afterward, or did they
> just do the memset?  It *probably* doesn't matter at "above a few MBs"
> but I'm worried that this new implementation might look good on
> microbenchmarks but make real programs *slower* because it is no longer
> priming the cache for the memset region.

Yes, See all results tagged with `reused=1`. Results are still good.
Data available at:
https://github.com/goldsteinn/memset-benchmarks/blob/master/results/SKX.html
and/or
https://github.com/goldsteinn/memset-benchmarks/blob/master/results/ICX.html
>
> zw

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v1] x86: Improve large memset perf with non-temporal stores [RHEL-29312]
  2024-05-20 22:47   ` Noah Goldstein
@ 2024-05-20 22:50     ` Noah Goldstein
  2024-05-23 16:46       ` Noah Goldstein
  0 siblings, 1 reply; 21+ messages in thread
From: Noah Goldstein @ 2024-05-20 22:50 UTC (permalink / raw)
  To: Adhemerval Zanella Netto; +Cc: libc-alpha, hjl.tools

On Mon, May 20, 2024 at 5:47 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
> On Mon, May 20, 2024 at 2:16 PM Adhemerval Zanella Netto
> <adhemerval.zanella@linaro.org> wrote:
> >
> >
> >
> > On 18/05/24 21:43, Noah Goldstein wrote:
> > > Previously we use `rep stosb` for all medium/large memsets. This is
> > > notably worse than non-temporal stores for large (above a
> > > few MBs) memsets.
> > > See:
> > > https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4P2uMjjQGLbLcU/edit?usp=sharing
> > > For data using different stategies for large memset on ICX and SKX.
> >
> > Btw this datasheet is not accessible without extra steps.
> >
>
> Bh sorry, put the benchmark data in a repo. See data here:
> https://github.com/goldsteinn/memset-benchmarks/tree/master/results
>
> The ICX.html and SKX.html have the data from the above link.
> > >
> > > Using non-temporal stores can be up to 3x faster on ICX and 2x faster
> > > on SKX. Historically, these numbers would not have been so good
> > > because of the zero-over-zero writeback optimization that `rep stosb`
> > > is able to do. But, the zero-over-zero writeback optimization has been
> > > removed as a potential side-channel attack, so there is no longer any
> > > good reason to only rely on `rep stosb` for large memsets. On the flip
> > > size, non-temporal writes can avoid data in their RFO requests saving
> > > memory bandwidth.
> >
> > Any chance on how this play in newer AMD chips? I am trying to avoid
> > another regression like BZ#30994 and BZ#30995 (this one I would like
> > to fix, however I don't have access to a Zen4 machine to check for
> > results).
>
> I didn't and don't have access to any of the newer AMD chips.
>
> The benchmark code here: https://github.com/goldsteinn/memset-benchmarks/
> has a README w/ steps if anyone wants to test it.

What we could do is use a seperate tunable for memset NT threshold and just
make it SIZE_MAX for AMD.
> >
> > >
> > > All of the other changes to the file are to re-organize the
> > > code-blocks to maintain "good" alignment given the new code added in
> > > the `L(stosb_local)` case.
> > >
> > > The results from running the GLIBC memset benchmarks on TGL-client for
> > > N=20 runs:
> > >
> > > Geometric Mean across the suite New / Old EXEX256: 0.979
> > > Geometric Mean across the suite New / Old EXEX512: 0.979
> > > Geometric Mean across the suite New / Old AVX2   : 0.986
> > > Geometric Mean across the suite New / Old SSE2   : 0.979
> > >
> > > Most of the cases are essentially unchanged, this is mostly to show
> > > that adding the non-temporal case didn't add any regressions to the
> > > other cases.
> > >
> > > The results on the memset-large benchmark suite on TGL-client for N=20
> > > runs:
> > >
> > > Geometric Mean across the suite New / Old EXEX256: 0.926
> > > Geometric Mean across the suite New / Old EXEX512: 0.925
> > > Geometric Mean across the suite New / Old AVX2   : 0.928
> > > Geometric Mean across the suite New / Old SSE2   : 0.924
> > >
> > > So roughly a 7.5% speedup. This is lower than what we see on servers
> > > (likely because clients typically have faster single-core bandwidth so
> > > saving bandwidth on RFOs is less impactful), but still advantageous.
> > >
> > > Full test-suite passes on x86_64 w/ and w/o multiarch.
> > > ---
> > >  .../multiarch/memset-vec-unaligned-erms.S     | 149 +++++++++++-------
> > >  1 file changed, 91 insertions(+), 58 deletions(-)
> > >
> > > diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > > index 97839a2248..637caadb40 100644
> > > --- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > > +++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > > @@ -21,10 +21,13 @@
> > >     2. If size is less than VEC, use integer register stores.
> > >     3. If size is from VEC_SIZE to 2 * VEC_SIZE, use 2 VEC stores.
> > >     4. If size is from 2 * VEC_SIZE to 4 * VEC_SIZE, use 4 VEC stores.
> > > -   5. On machines ERMS feature, if size is greater or equal than
> > > -      __x86_rep_stosb_threshold then REP STOSB will be used.
> > > -   6. If size is more to 4 * VEC_SIZE, align to 4 * VEC_SIZE with
> > > -      4 VEC stores and store 4 * VEC at a time until done.  */
> > > +   5. If size is more to 4 * VEC_SIZE, align to 1 * VEC_SIZE with
> > > +      4 VEC stores and store 4 * VEC at a time until done.
> > > +   6. On machines ERMS feature, if size is range
> > > +       [__x86_rep_stosb_threshold, __x86_shared_non_temporal_threshold)
> > > +       then REP STOSB will be used.
> > > +   7. If size >= __x86_shared_non_temporal_threshold, use a
> > > +       non-temporal stores.  */
> > >
> > >  #include <sysdep.h>
> > >
> > > @@ -147,6 +150,41 @@ L(entry_from_wmemset):
> > >       VMOVU   %VMM(0), -VEC_SIZE(%rdi,%rdx)
> > >       VMOVU   %VMM(0), (%rdi)
> > >       VZEROUPPER_RETURN
> > > +
> > > +     /* If have AVX512 mask instructions put L(less_vec) close to
> > > +        entry as it doesn't take much space and is likely a hot target.  */
> > > +#ifdef USE_LESS_VEC_MASK_STORE
> > > +    /* Align to ensure the L(less_vec) logic all fits in 1x cache lines.  */
> > > +     .p2align 6,, 47
> > > +     .p2align 4
> > > +L(less_vec):
> > > +L(less_vec_from_wmemset):
> > > +     /* Less than 1 VEC.  */
> > > +# if VEC_SIZE != 16 && VEC_SIZE != 32 && VEC_SIZE != 64
> > > +#  error Unsupported VEC_SIZE!
> > > +# endif
> > > +     /* Clear high bits from edi. Only keeping bits relevant to page
> > > +        cross check. Note that we are using rax which is set in
> > > +        MEMSET_VDUP_TO_VEC0_AND_SET_RETURN as ptr from here on out.  */
> > > +     andl    $(PAGE_SIZE - 1), %edi
> > > +     /* Check if VEC_SIZE store cross page. Mask stores suffer
> > > +        serious performance degradation when it has to fault suppress.  */
> > > +     cmpl    $(PAGE_SIZE - VEC_SIZE), %edi
> > > +     /* This is generally considered a cold target.  */
> > > +     ja      L(cross_page)
> > > +# if VEC_SIZE > 32
> > > +     movq    $-1, %rcx
> > > +     bzhiq   %rdx, %rcx, %rcx
> > > +     kmovq   %rcx, %k1
> > > +# else
> > > +     movl    $-1, %ecx
> > > +     bzhil   %edx, %ecx, %ecx
> > > +     kmovd   %ecx, %k1
> > > +# endif
> > > +     vmovdqu8 %VMM(0), (%rax){%k1}
> > > +     VZEROUPPER_RETURN
> > > +#endif
> > > +
> > >  #if defined USE_MULTIARCH && IS_IN (libc)
> > >  END (MEMSET_SYMBOL (__memset, unaligned))
> > >
> > > @@ -185,54 +223,6 @@ L(last_2x_vec):
> > >  #endif
> > >       VZEROUPPER_RETURN
> > >
> > > -     /* If have AVX512 mask instructions put L(less_vec) close to
> > > -        entry as it doesn't take much space and is likely a hot target.
> > > -      */
> > > -#ifdef USE_LESS_VEC_MASK_STORE
> > > -     .p2align 4,, 10
> > > -L(less_vec):
> > > -L(less_vec_from_wmemset):
> > > -     /* Less than 1 VEC.  */
> > > -# if VEC_SIZE != 16 && VEC_SIZE != 32 && VEC_SIZE != 64
> > > -#  error Unsupported VEC_SIZE!
> > > -# endif
> > > -     /* Clear high bits from edi. Only keeping bits relevant to page
> > > -        cross check. Note that we are using rax which is set in
> > > -        MEMSET_VDUP_TO_VEC0_AND_SET_RETURN as ptr from here on out.  */
> > > -     andl    $(PAGE_SIZE - 1), %edi
> > > -     /* Check if VEC_SIZE store cross page. Mask stores suffer
> > > -        serious performance degradation when it has to fault suppress.
> > > -      */
> > > -     cmpl    $(PAGE_SIZE - VEC_SIZE), %edi
> > > -     /* This is generally considered a cold target.  */
> > > -     ja      L(cross_page)
> > > -# if VEC_SIZE > 32
> > > -     movq    $-1, %rcx
> > > -     bzhiq   %rdx, %rcx, %rcx
> > > -     kmovq   %rcx, %k1
> > > -# else
> > > -     movl    $-1, %ecx
> > > -     bzhil   %edx, %ecx, %ecx
> > > -     kmovd   %ecx, %k1
> > > -# endif
> > > -     vmovdqu8 %VMM(0), (%rax){%k1}
> > > -     VZEROUPPER_RETURN
> > > -
> > > -# if defined USE_MULTIARCH && IS_IN (libc)
> > > -     /* Include L(stosb_local) here if including L(less_vec) between
> > > -        L(stosb_more_2x_vec) and ENTRY. This is to cache align the
> > > -        L(stosb_more_2x_vec) target.  */
> > > -     .p2align 4,, 10
> > > -L(stosb_local):
> > > -     movzbl  %sil, %eax
> > > -     mov     %RDX_LP, %RCX_LP
> > > -     mov     %RDI_LP, %RDX_LP
> > > -     rep     stosb
> > > -     mov     %RDX_LP, %RAX_LP
> > > -     VZEROUPPER_RETURN
> > > -# endif
> > > -#endif
> > > -
> > >  #if defined USE_MULTIARCH && IS_IN (libc)
> > >       .p2align 4
> > >  L(stosb_more_2x_vec):
> > > @@ -318,21 +308,33 @@ L(return_vzeroupper):
> > >       ret
> > >  #endif
> > >
> > > -     .p2align 4,, 10
> > > -#ifndef USE_LESS_VEC_MASK_STORE
> > > -# if defined USE_MULTIARCH && IS_IN (libc)
> > > +#ifdef USE_WITH_AVX2
> > > +     .p2align 4
> > > +#else
> > > +     .p2align 4,, 4
> > > +#endif
> > > +
> > > +#if defined USE_MULTIARCH && IS_IN (libc)
> > >       /* If no USE_LESS_VEC_MASK put L(stosb_local) here. Will be in
> > >          range for 2-byte jump encoding.  */
> > >  L(stosb_local):
> > > +     cmp     __x86_shared_non_temporal_threshold(%rip), %RDX_LP
> > > +     jae     L(nt_memset)
> > >       movzbl  %sil, %eax
> > >       mov     %RDX_LP, %RCX_LP
> > >       mov     %RDI_LP, %RDX_LP
> > >       rep     stosb
> > > +# if (defined USE_WITH_SSE2) || (defined USE_WITH_AVX512)
> > > +     /* Use xchg to save 1-byte (this helps align targets below).  */
> > > +     xchg    %RDX_LP, %RAX_LP
> > > +# else
> > >       mov     %RDX_LP, %RAX_LP
> > > -     VZEROUPPER_RETURN
> > >  # endif
> > > +     VZEROUPPER_RETURN
> > > +#endif
> > > +#ifndef USE_LESS_VEC_MASK_STORE
> > >       /* Define L(less_vec) only if not otherwise defined.  */
> > > -     .p2align 4
> > > +     .p2align 4,, 12
> > >  L(less_vec):
> > >       /* Broadcast esi to partial register (i.e VEC_SIZE == 32 broadcast to
> > >          xmm). This is only does anything for AVX2.  */
> > > @@ -423,4 +425,35 @@ L(between_2_3):
> > >       movb    %SET_REG8, -1(%LESS_VEC_REG, %rdx)
> > >  #endif
> > >       ret
> > > -END (MEMSET_SYMBOL (__memset, unaligned_erms))
> > > +
> > > +#if defined USE_MULTIARCH && IS_IN (libc)
> > > +# ifdef USE_WITH_AVX512
> > > +     /* Force align so the loop doesn't cross a cache-line.  */
> > > +     .p2align 4
> > > +# endif
> > > +     .p2align 4,, 7
> > > +    /* Memset using non-temporal stores.  */
> > > +L(nt_memset):
> > > +     VMOVU   %VMM(0), (VEC_SIZE * 0)(%rdi)
> > > +     leaq    (VEC_SIZE * -4)(%rdi, %rdx), %rdx
> > > +    /* Align DST.  */
> > > +     orq     $(VEC_SIZE * 1 - 1), %rdi
> > > +     incq    %rdi
> > > +     .p2align 4,, 7
> > > +L(nt_loop):
> > > +     VMOVNT  %VMM(0), (VEC_SIZE * 0)(%rdi)
> > > +     VMOVNT  %VMM(0), (VEC_SIZE * 1)(%rdi)
> > > +     VMOVNT  %VMM(0), (VEC_SIZE * 2)(%rdi)
> > > +     VMOVNT  %VMM(0), (VEC_SIZE * 3)(%rdi)
> > > +     subq    $(VEC_SIZE * -4), %rdi
> > > +     cmpq    %rdx, %rdi
> > > +     jb      L(nt_loop)
> > > +     sfence
> > > +     VMOVU   %VMM(0), (VEC_SIZE * 0)(%rdx)
> > > +     VMOVU   %VMM(0), (VEC_SIZE * 1)(%rdx)
> > > +     VMOVU   %VMM(0), (VEC_SIZE * 2)(%rdx)
> > > +     VMOVU   %VMM(0), (VEC_SIZE * 3)(%rdx)
> > > +     VZEROUPPER_RETURN
> > > +#endif
> > > +
> > > +END(MEMSET_SYMBOL(__memset, unaligned_erms))

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v1] x86: Improve large memset perf with non-temporal stores [RHEL-29312]
  2024-05-20 22:50     ` Noah Goldstein
@ 2024-05-23 16:46       ` Noah Goldstein
  2024-05-23 16:49         ` H.J. Lu
  0 siblings, 1 reply; 21+ messages in thread
From: Noah Goldstein @ 2024-05-23 16:46 UTC (permalink / raw)
  To: Adhemerval Zanella Netto; +Cc: libc-alpha, hjl.tools

On Mon, May 20, 2024 at 5:50 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
> On Mon, May 20, 2024 at 5:47 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> >
> > On Mon, May 20, 2024 at 2:16 PM Adhemerval Zanella Netto
> > <adhemerval.zanella@linaro.org> wrote:
> > >
> > >
> > >
> > > On 18/05/24 21:43, Noah Goldstein wrote:
> > > > Previously we use `rep stosb` for all medium/large memsets. This is
> > > > notably worse than non-temporal stores for large (above a
> > > > few MBs) memsets.
> > > > See:
> > > > https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4P2uMjjQGLbLcU/edit?usp=sharing
> > > > For data using different stategies for large memset on ICX and SKX.
> > >
> > > Btw this datasheet is not accessible without extra steps.
> > >
> >
> > Bh sorry, put the benchmark data in a repo. See data here:
> > https://github.com/goldsteinn/memset-benchmarks/tree/master/results
> >
> > The ICX.html and SKX.html have the data from the above link.
> > > >
> > > > Using non-temporal stores can be up to 3x faster on ICX and 2x faster
> > > > on SKX. Historically, these numbers would not have been so good
> > > > because of the zero-over-zero writeback optimization that `rep stosb`
> > > > is able to do. But, the zero-over-zero writeback optimization has been
> > > > removed as a potential side-channel attack, so there is no longer any
> > > > good reason to only rely on `rep stosb` for large memsets. On the flip
> > > > size, non-temporal writes can avoid data in their RFO requests saving
> > > > memory bandwidth.
> > >
> > > Any chance on how this play in newer AMD chips? I am trying to avoid
> > > another regression like BZ#30994 and BZ#30995 (this one I would like
> > > to fix, however I don't have access to a Zen4 machine to check for
> > > results).
> >
> > I didn't and don't have access to any of the newer AMD chips.
> >
> > The benchmark code here: https://github.com/goldsteinn/memset-benchmarks/
> > has a README w/ steps if anyone wants to test it.
>
> What we could do is use a seperate tunable for memset NT threshold and just
> make it SIZE_MAX for AMD.

Adhemerval, this patch okay to get in like that, or do you want to keep in Intel
specific?
> > >
> > > >
> > > > All of the other changes to the file are to re-organize the
> > > > code-blocks to maintain "good" alignment given the new code added in
> > > > the `L(stosb_local)` case.
> > > >
> > > > The results from running the GLIBC memset benchmarks on TGL-client for
> > > > N=20 runs:
> > > >
> > > > Geometric Mean across the suite New / Old EXEX256: 0.979
> > > > Geometric Mean across the suite New / Old EXEX512: 0.979
> > > > Geometric Mean across the suite New / Old AVX2   : 0.986
> > > > Geometric Mean across the suite New / Old SSE2   : 0.979
> > > >
> > > > Most of the cases are essentially unchanged, this is mostly to show
> > > > that adding the non-temporal case didn't add any regressions to the
> > > > other cases.
> > > >
> > > > The results on the memset-large benchmark suite on TGL-client for N=20
> > > > runs:
> > > >
> > > > Geometric Mean across the suite New / Old EXEX256: 0.926
> > > > Geometric Mean across the suite New / Old EXEX512: 0.925
> > > > Geometric Mean across the suite New / Old AVX2   : 0.928
> > > > Geometric Mean across the suite New / Old SSE2   : 0.924
> > > >
> > > > So roughly a 7.5% speedup. This is lower than what we see on servers
> > > > (likely because clients typically have faster single-core bandwidth so
> > > > saving bandwidth on RFOs is less impactful), but still advantageous.
> > > >
> > > > Full test-suite passes on x86_64 w/ and w/o multiarch.
> > > > ---
> > > >  .../multiarch/memset-vec-unaligned-erms.S     | 149 +++++++++++-------
> > > >  1 file changed, 91 insertions(+), 58 deletions(-)
> > > >
> > > > diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > > > index 97839a2248..637caadb40 100644
> > > > --- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > > > +++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > > > @@ -21,10 +21,13 @@
> > > >     2. If size is less than VEC, use integer register stores.
> > > >     3. If size is from VEC_SIZE to 2 * VEC_SIZE, use 2 VEC stores.
> > > >     4. If size is from 2 * VEC_SIZE to 4 * VEC_SIZE, use 4 VEC stores.
> > > > -   5. On machines ERMS feature, if size is greater or equal than
> > > > -      __x86_rep_stosb_threshold then REP STOSB will be used.
> > > > -   6. If size is more to 4 * VEC_SIZE, align to 4 * VEC_SIZE with
> > > > -      4 VEC stores and store 4 * VEC at a time until done.  */
> > > > +   5. If size is more to 4 * VEC_SIZE, align to 1 * VEC_SIZE with
> > > > +      4 VEC stores and store 4 * VEC at a time until done.
> > > > +   6. On machines ERMS feature, if size is range
> > > > +       [__x86_rep_stosb_threshold, __x86_shared_non_temporal_threshold)
> > > > +       then REP STOSB will be used.
> > > > +   7. If size >= __x86_shared_non_temporal_threshold, use a
> > > > +       non-temporal stores.  */
> > > >
> > > >  #include <sysdep.h>
> > > >
> > > > @@ -147,6 +150,41 @@ L(entry_from_wmemset):
> > > >       VMOVU   %VMM(0), -VEC_SIZE(%rdi,%rdx)
> > > >       VMOVU   %VMM(0), (%rdi)
> > > >       VZEROUPPER_RETURN
> > > > +
> > > > +     /* If have AVX512 mask instructions put L(less_vec) close to
> > > > +        entry as it doesn't take much space and is likely a hot target.  */
> > > > +#ifdef USE_LESS_VEC_MASK_STORE
> > > > +    /* Align to ensure the L(less_vec) logic all fits in 1x cache lines.  */
> > > > +     .p2align 6,, 47
> > > > +     .p2align 4
> > > > +L(less_vec):
> > > > +L(less_vec_from_wmemset):
> > > > +     /* Less than 1 VEC.  */
> > > > +# if VEC_SIZE != 16 && VEC_SIZE != 32 && VEC_SIZE != 64
> > > > +#  error Unsupported VEC_SIZE!
> > > > +# endif
> > > > +     /* Clear high bits from edi. Only keeping bits relevant to page
> > > > +        cross check. Note that we are using rax which is set in
> > > > +        MEMSET_VDUP_TO_VEC0_AND_SET_RETURN as ptr from here on out.  */
> > > > +     andl    $(PAGE_SIZE - 1), %edi
> > > > +     /* Check if VEC_SIZE store cross page. Mask stores suffer
> > > > +        serious performance degradation when it has to fault suppress.  */
> > > > +     cmpl    $(PAGE_SIZE - VEC_SIZE), %edi
> > > > +     /* This is generally considered a cold target.  */
> > > > +     ja      L(cross_page)
> > > > +# if VEC_SIZE > 32
> > > > +     movq    $-1, %rcx
> > > > +     bzhiq   %rdx, %rcx, %rcx
> > > > +     kmovq   %rcx, %k1
> > > > +# else
> > > > +     movl    $-1, %ecx
> > > > +     bzhil   %edx, %ecx, %ecx
> > > > +     kmovd   %ecx, %k1
> > > > +# endif
> > > > +     vmovdqu8 %VMM(0), (%rax){%k1}
> > > > +     VZEROUPPER_RETURN
> > > > +#endif
> > > > +
> > > >  #if defined USE_MULTIARCH && IS_IN (libc)
> > > >  END (MEMSET_SYMBOL (__memset, unaligned))
> > > >
> > > > @@ -185,54 +223,6 @@ L(last_2x_vec):
> > > >  #endif
> > > >       VZEROUPPER_RETURN
> > > >
> > > > -     /* If have AVX512 mask instructions put L(less_vec) close to
> > > > -        entry as it doesn't take much space and is likely a hot target.
> > > > -      */
> > > > -#ifdef USE_LESS_VEC_MASK_STORE
> > > > -     .p2align 4,, 10
> > > > -L(less_vec):
> > > > -L(less_vec_from_wmemset):
> > > > -     /* Less than 1 VEC.  */
> > > > -# if VEC_SIZE != 16 && VEC_SIZE != 32 && VEC_SIZE != 64
> > > > -#  error Unsupported VEC_SIZE!
> > > > -# endif
> > > > -     /* Clear high bits from edi. Only keeping bits relevant to page
> > > > -        cross check. Note that we are using rax which is set in
> > > > -        MEMSET_VDUP_TO_VEC0_AND_SET_RETURN as ptr from here on out.  */
> > > > -     andl    $(PAGE_SIZE - 1), %edi
> > > > -     /* Check if VEC_SIZE store cross page. Mask stores suffer
> > > > -        serious performance degradation when it has to fault suppress.
> > > > -      */
> > > > -     cmpl    $(PAGE_SIZE - VEC_SIZE), %edi
> > > > -     /* This is generally considered a cold target.  */
> > > > -     ja      L(cross_page)
> > > > -# if VEC_SIZE > 32
> > > > -     movq    $-1, %rcx
> > > > -     bzhiq   %rdx, %rcx, %rcx
> > > > -     kmovq   %rcx, %k1
> > > > -# else
> > > > -     movl    $-1, %ecx
> > > > -     bzhil   %edx, %ecx, %ecx
> > > > -     kmovd   %ecx, %k1
> > > > -# endif
> > > > -     vmovdqu8 %VMM(0), (%rax){%k1}
> > > > -     VZEROUPPER_RETURN
> > > > -
> > > > -# if defined USE_MULTIARCH && IS_IN (libc)
> > > > -     /* Include L(stosb_local) here if including L(less_vec) between
> > > > -        L(stosb_more_2x_vec) and ENTRY. This is to cache align the
> > > > -        L(stosb_more_2x_vec) target.  */
> > > > -     .p2align 4,, 10
> > > > -L(stosb_local):
> > > > -     movzbl  %sil, %eax
> > > > -     mov     %RDX_LP, %RCX_LP
> > > > -     mov     %RDI_LP, %RDX_LP
> > > > -     rep     stosb
> > > > -     mov     %RDX_LP, %RAX_LP
> > > > -     VZEROUPPER_RETURN
> > > > -# endif
> > > > -#endif
> > > > -
> > > >  #if defined USE_MULTIARCH && IS_IN (libc)
> > > >       .p2align 4
> > > >  L(stosb_more_2x_vec):
> > > > @@ -318,21 +308,33 @@ L(return_vzeroupper):
> > > >       ret
> > > >  #endif
> > > >
> > > > -     .p2align 4,, 10
> > > > -#ifndef USE_LESS_VEC_MASK_STORE
> > > > -# if defined USE_MULTIARCH && IS_IN (libc)
> > > > +#ifdef USE_WITH_AVX2
> > > > +     .p2align 4
> > > > +#else
> > > > +     .p2align 4,, 4
> > > > +#endif
> > > > +
> > > > +#if defined USE_MULTIARCH && IS_IN (libc)
> > > >       /* If no USE_LESS_VEC_MASK put L(stosb_local) here. Will be in
> > > >          range for 2-byte jump encoding.  */
> > > >  L(stosb_local):
> > > > +     cmp     __x86_shared_non_temporal_threshold(%rip), %RDX_LP
> > > > +     jae     L(nt_memset)
> > > >       movzbl  %sil, %eax
> > > >       mov     %RDX_LP, %RCX_LP
> > > >       mov     %RDI_LP, %RDX_LP
> > > >       rep     stosb
> > > > +# if (defined USE_WITH_SSE2) || (defined USE_WITH_AVX512)
> > > > +     /* Use xchg to save 1-byte (this helps align targets below).  */
> > > > +     xchg    %RDX_LP, %RAX_LP
> > > > +# else
> > > >       mov     %RDX_LP, %RAX_LP
> > > > -     VZEROUPPER_RETURN
> > > >  # endif
> > > > +     VZEROUPPER_RETURN
> > > > +#endif
> > > > +#ifndef USE_LESS_VEC_MASK_STORE
> > > >       /* Define L(less_vec) only if not otherwise defined.  */
> > > > -     .p2align 4
> > > > +     .p2align 4,, 12
> > > >  L(less_vec):
> > > >       /* Broadcast esi to partial register (i.e VEC_SIZE == 32 broadcast to
> > > >          xmm). This is only does anything for AVX2.  */
> > > > @@ -423,4 +425,35 @@ L(between_2_3):
> > > >       movb    %SET_REG8, -1(%LESS_VEC_REG, %rdx)
> > > >  #endif
> > > >       ret
> > > > -END (MEMSET_SYMBOL (__memset, unaligned_erms))
> > > > +
> > > > +#if defined USE_MULTIARCH && IS_IN (libc)
> > > > +# ifdef USE_WITH_AVX512
> > > > +     /* Force align so the loop doesn't cross a cache-line.  */
> > > > +     .p2align 4
> > > > +# endif
> > > > +     .p2align 4,, 7
> > > > +    /* Memset using non-temporal stores.  */
> > > > +L(nt_memset):
> > > > +     VMOVU   %VMM(0), (VEC_SIZE * 0)(%rdi)
> > > > +     leaq    (VEC_SIZE * -4)(%rdi, %rdx), %rdx
> > > > +    /* Align DST.  */
> > > > +     orq     $(VEC_SIZE * 1 - 1), %rdi
> > > > +     incq    %rdi
> > > > +     .p2align 4,, 7
> > > > +L(nt_loop):
> > > > +     VMOVNT  %VMM(0), (VEC_SIZE * 0)(%rdi)
> > > > +     VMOVNT  %VMM(0), (VEC_SIZE * 1)(%rdi)
> > > > +     VMOVNT  %VMM(0), (VEC_SIZE * 2)(%rdi)
> > > > +     VMOVNT  %VMM(0), (VEC_SIZE * 3)(%rdi)
> > > > +     subq    $(VEC_SIZE * -4), %rdi
> > > > +     cmpq    %rdx, %rdi
> > > > +     jb      L(nt_loop)
> > > > +     sfence
> > > > +     VMOVU   %VMM(0), (VEC_SIZE * 0)(%rdx)
> > > > +     VMOVU   %VMM(0), (VEC_SIZE * 1)(%rdx)
> > > > +     VMOVU   %VMM(0), (VEC_SIZE * 2)(%rdx)
> > > > +     VMOVU   %VMM(0), (VEC_SIZE * 3)(%rdx)
> > > > +     VZEROUPPER_RETURN
> > > > +#endif
> > > > +
> > > > +END(MEMSET_SYMBOL(__memset, unaligned_erms))

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v1] x86: Improve large memset perf with non-temporal stores [RHEL-29312]
  2024-05-23 16:46       ` Noah Goldstein
@ 2024-05-23 16:49         ` H.J. Lu
  2024-05-23 17:18           ` Adhemerval Zanella Netto
  0 siblings, 1 reply; 21+ messages in thread
From: H.J. Lu @ 2024-05-23 16:49 UTC (permalink / raw)
  To: Noah Goldstein; +Cc: Adhemerval Zanella Netto, libc-alpha

On Thu, May 23, 2024 at 9:46 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
> On Mon, May 20, 2024 at 5:50 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> >
> > On Mon, May 20, 2024 at 5:47 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> > >
> > > On Mon, May 20, 2024 at 2:16 PM Adhemerval Zanella Netto
> > > <adhemerval.zanella@linaro.org> wrote:
> > > >
> > > >
> > > >
> > > > On 18/05/24 21:43, Noah Goldstein wrote:
> > > > > Previously we use `rep stosb` for all medium/large memsets. This is
> > > > > notably worse than non-temporal stores for large (above a
> > > > > few MBs) memsets.
> > > > > See:
> > > > > https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4P2uMjjQGLbLcU/edit?usp=sharing
> > > > > For data using different stategies for large memset on ICX and SKX.
> > > >
> > > > Btw this datasheet is not accessible without extra steps.
> > > >
> > >
> > > Bh sorry, put the benchmark data in a repo. See data here:
> > > https://github.com/goldsteinn/memset-benchmarks/tree/master/results
> > >
> > > The ICX.html and SKX.html have the data from the above link.
> > > > >
> > > > > Using non-temporal stores can be up to 3x faster on ICX and 2x faster
> > > > > on SKX. Historically, these numbers would not have been so good
> > > > > because of the zero-over-zero writeback optimization that `rep stosb`
> > > > > is able to do. But, the zero-over-zero writeback optimization has been
> > > > > removed as a potential side-channel attack, so there is no longer any
> > > > > good reason to only rely on `rep stosb` for large memsets. On the flip
> > > > > size, non-temporal writes can avoid data in their RFO requests saving
> > > > > memory bandwidth.
> > > >
> > > > Any chance on how this play in newer AMD chips? I am trying to avoid
> > > > another regression like BZ#30994 and BZ#30995 (this one I would like
> > > > to fix, however I don't have access to a Zen4 machine to check for
> > > > results).
> > >
> > > I didn't and don't have access to any of the newer AMD chips.
> > >
> > > The benchmark code here: https://github.com/goldsteinn/memset-benchmarks/
> > > has a README w/ steps if anyone wants to test it.
> >
> > What we could do is use a seperate tunable for memset NT threshold and just
> > make it SIZE_MAX for AMD.
>
> Adhemerval, this patch okay to get in like that, or do you want to keep in Intel
> specific?

Please make it Intel specific.

Thanks.

> > > >
> > > > >
> > > > > All of the other changes to the file are to re-organize the
> > > > > code-blocks to maintain "good" alignment given the new code added in
> > > > > the `L(stosb_local)` case.
> > > > >
> > > > > The results from running the GLIBC memset benchmarks on TGL-client for
> > > > > N=20 runs:
> > > > >
> > > > > Geometric Mean across the suite New / Old EXEX256: 0.979
> > > > > Geometric Mean across the suite New / Old EXEX512: 0.979
> > > > > Geometric Mean across the suite New / Old AVX2   : 0.986
> > > > > Geometric Mean across the suite New / Old SSE2   : 0.979
> > > > >
> > > > > Most of the cases are essentially unchanged, this is mostly to show
> > > > > that adding the non-temporal case didn't add any regressions to the
> > > > > other cases.
> > > > >
> > > > > The results on the memset-large benchmark suite on TGL-client for N=20
> > > > > runs:
> > > > >
> > > > > Geometric Mean across the suite New / Old EXEX256: 0.926
> > > > > Geometric Mean across the suite New / Old EXEX512: 0.925
> > > > > Geometric Mean across the suite New / Old AVX2   : 0.928
> > > > > Geometric Mean across the suite New / Old SSE2   : 0.924
> > > > >
> > > > > So roughly a 7.5% speedup. This is lower than what we see on servers
> > > > > (likely because clients typically have faster single-core bandwidth so
> > > > > saving bandwidth on RFOs is less impactful), but still advantageous.
> > > > >
> > > > > Full test-suite passes on x86_64 w/ and w/o multiarch.
> > > > > ---
> > > > >  .../multiarch/memset-vec-unaligned-erms.S     | 149 +++++++++++-------
> > > > >  1 file changed, 91 insertions(+), 58 deletions(-)
> > > > >
> > > > > diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > > > > index 97839a2248..637caadb40 100644
> > > > > --- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > > > > +++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> > > > > @@ -21,10 +21,13 @@
> > > > >     2. If size is less than VEC, use integer register stores.
> > > > >     3. If size is from VEC_SIZE to 2 * VEC_SIZE, use 2 VEC stores.
> > > > >     4. If size is from 2 * VEC_SIZE to 4 * VEC_SIZE, use 4 VEC stores.
> > > > > -   5. On machines ERMS feature, if size is greater or equal than
> > > > > -      __x86_rep_stosb_threshold then REP STOSB will be used.
> > > > > -   6. If size is more to 4 * VEC_SIZE, align to 4 * VEC_SIZE with
> > > > > -      4 VEC stores and store 4 * VEC at a time until done.  */
> > > > > +   5. If size is more to 4 * VEC_SIZE, align to 1 * VEC_SIZE with
> > > > > +      4 VEC stores and store 4 * VEC at a time until done.
> > > > > +   6. On machines ERMS feature, if size is range
> > > > > +       [__x86_rep_stosb_threshold, __x86_shared_non_temporal_threshold)
> > > > > +       then REP STOSB will be used.
> > > > > +   7. If size >= __x86_shared_non_temporal_threshold, use a
> > > > > +       non-temporal stores.  */
> > > > >
> > > > >  #include <sysdep.h>
> > > > >
> > > > > @@ -147,6 +150,41 @@ L(entry_from_wmemset):
> > > > >       VMOVU   %VMM(0), -VEC_SIZE(%rdi,%rdx)
> > > > >       VMOVU   %VMM(0), (%rdi)
> > > > >       VZEROUPPER_RETURN
> > > > > +
> > > > > +     /* If have AVX512 mask instructions put L(less_vec) close to
> > > > > +        entry as it doesn't take much space and is likely a hot target.  */
> > > > > +#ifdef USE_LESS_VEC_MASK_STORE
> > > > > +    /* Align to ensure the L(less_vec) logic all fits in 1x cache lines.  */
> > > > > +     .p2align 6,, 47
> > > > > +     .p2align 4
> > > > > +L(less_vec):
> > > > > +L(less_vec_from_wmemset):
> > > > > +     /* Less than 1 VEC.  */
> > > > > +# if VEC_SIZE != 16 && VEC_SIZE != 32 && VEC_SIZE != 64
> > > > > +#  error Unsupported VEC_SIZE!
> > > > > +# endif
> > > > > +     /* Clear high bits from edi. Only keeping bits relevant to page
> > > > > +        cross check. Note that we are using rax which is set in
> > > > > +        MEMSET_VDUP_TO_VEC0_AND_SET_RETURN as ptr from here on out.  */
> > > > > +     andl    $(PAGE_SIZE - 1), %edi
> > > > > +     /* Check if VEC_SIZE store cross page. Mask stores suffer
> > > > > +        serious performance degradation when it has to fault suppress.  */
> > > > > +     cmpl    $(PAGE_SIZE - VEC_SIZE), %edi
> > > > > +     /* This is generally considered a cold target.  */
> > > > > +     ja      L(cross_page)
> > > > > +# if VEC_SIZE > 32
> > > > > +     movq    $-1, %rcx
> > > > > +     bzhiq   %rdx, %rcx, %rcx
> > > > > +     kmovq   %rcx, %k1
> > > > > +# else
> > > > > +     movl    $-1, %ecx
> > > > > +     bzhil   %edx, %ecx, %ecx
> > > > > +     kmovd   %ecx, %k1
> > > > > +# endif
> > > > > +     vmovdqu8 %VMM(0), (%rax){%k1}
> > > > > +     VZEROUPPER_RETURN
> > > > > +#endif
> > > > > +
> > > > >  #if defined USE_MULTIARCH && IS_IN (libc)
> > > > >  END (MEMSET_SYMBOL (__memset, unaligned))
> > > > >
> > > > > @@ -185,54 +223,6 @@ L(last_2x_vec):
> > > > >  #endif
> > > > >       VZEROUPPER_RETURN
> > > > >
> > > > > -     /* If have AVX512 mask instructions put L(less_vec) close to
> > > > > -        entry as it doesn't take much space and is likely a hot target.
> > > > > -      */
> > > > > -#ifdef USE_LESS_VEC_MASK_STORE
> > > > > -     .p2align 4,, 10
> > > > > -L(less_vec):
> > > > > -L(less_vec_from_wmemset):
> > > > > -     /* Less than 1 VEC.  */
> > > > > -# if VEC_SIZE != 16 && VEC_SIZE != 32 && VEC_SIZE != 64
> > > > > -#  error Unsupported VEC_SIZE!
> > > > > -# endif
> > > > > -     /* Clear high bits from edi. Only keeping bits relevant to page
> > > > > -        cross check. Note that we are using rax which is set in
> > > > > -        MEMSET_VDUP_TO_VEC0_AND_SET_RETURN as ptr from here on out.  */
> > > > > -     andl    $(PAGE_SIZE - 1), %edi
> > > > > -     /* Check if VEC_SIZE store cross page. Mask stores suffer
> > > > > -        serious performance degradation when it has to fault suppress.
> > > > > -      */
> > > > > -     cmpl    $(PAGE_SIZE - VEC_SIZE), %edi
> > > > > -     /* This is generally considered a cold target.  */
> > > > > -     ja      L(cross_page)
> > > > > -# if VEC_SIZE > 32
> > > > > -     movq    $-1, %rcx
> > > > > -     bzhiq   %rdx, %rcx, %rcx
> > > > > -     kmovq   %rcx, %k1
> > > > > -# else
> > > > > -     movl    $-1, %ecx
> > > > > -     bzhil   %edx, %ecx, %ecx
> > > > > -     kmovd   %ecx, %k1
> > > > > -# endif
> > > > > -     vmovdqu8 %VMM(0), (%rax){%k1}
> > > > > -     VZEROUPPER_RETURN
> > > > > -
> > > > > -# if defined USE_MULTIARCH && IS_IN (libc)
> > > > > -     /* Include L(stosb_local) here if including L(less_vec) between
> > > > > -        L(stosb_more_2x_vec) and ENTRY. This is to cache align the
> > > > > -        L(stosb_more_2x_vec) target.  */
> > > > > -     .p2align 4,, 10
> > > > > -L(stosb_local):
> > > > > -     movzbl  %sil, %eax
> > > > > -     mov     %RDX_LP, %RCX_LP
> > > > > -     mov     %RDI_LP, %RDX_LP
> > > > > -     rep     stosb
> > > > > -     mov     %RDX_LP, %RAX_LP
> > > > > -     VZEROUPPER_RETURN
> > > > > -# endif
> > > > > -#endif
> > > > > -
> > > > >  #if defined USE_MULTIARCH && IS_IN (libc)
> > > > >       .p2align 4
> > > > >  L(stosb_more_2x_vec):
> > > > > @@ -318,21 +308,33 @@ L(return_vzeroupper):
> > > > >       ret
> > > > >  #endif
> > > > >
> > > > > -     .p2align 4,, 10
> > > > > -#ifndef USE_LESS_VEC_MASK_STORE
> > > > > -# if defined USE_MULTIARCH && IS_IN (libc)
> > > > > +#ifdef USE_WITH_AVX2
> > > > > +     .p2align 4
> > > > > +#else
> > > > > +     .p2align 4,, 4
> > > > > +#endif
> > > > > +
> > > > > +#if defined USE_MULTIARCH && IS_IN (libc)
> > > > >       /* If no USE_LESS_VEC_MASK put L(stosb_local) here. Will be in
> > > > >          range for 2-byte jump encoding.  */
> > > > >  L(stosb_local):
> > > > > +     cmp     __x86_shared_non_temporal_threshold(%rip), %RDX_LP
> > > > > +     jae     L(nt_memset)
> > > > >       movzbl  %sil, %eax
> > > > >       mov     %RDX_LP, %RCX_LP
> > > > >       mov     %RDI_LP, %RDX_LP
> > > > >       rep     stosb
> > > > > +# if (defined USE_WITH_SSE2) || (defined USE_WITH_AVX512)
> > > > > +     /* Use xchg to save 1-byte (this helps align targets below).  */
> > > > > +     xchg    %RDX_LP, %RAX_LP
> > > > > +# else
> > > > >       mov     %RDX_LP, %RAX_LP
> > > > > -     VZEROUPPER_RETURN
> > > > >  # endif
> > > > > +     VZEROUPPER_RETURN
> > > > > +#endif
> > > > > +#ifndef USE_LESS_VEC_MASK_STORE
> > > > >       /* Define L(less_vec) only if not otherwise defined.  */
> > > > > -     .p2align 4
> > > > > +     .p2align 4,, 12
> > > > >  L(less_vec):
> > > > >       /* Broadcast esi to partial register (i.e VEC_SIZE == 32 broadcast to
> > > > >          xmm). This is only does anything for AVX2.  */
> > > > > @@ -423,4 +425,35 @@ L(between_2_3):
> > > > >       movb    %SET_REG8, -1(%LESS_VEC_REG, %rdx)
> > > > >  #endif
> > > > >       ret
> > > > > -END (MEMSET_SYMBOL (__memset, unaligned_erms))
> > > > > +
> > > > > +#if defined USE_MULTIARCH && IS_IN (libc)
> > > > > +# ifdef USE_WITH_AVX512
> > > > > +     /* Force align so the loop doesn't cross a cache-line.  */
> > > > > +     .p2align 4
> > > > > +# endif
> > > > > +     .p2align 4,, 7
> > > > > +    /* Memset using non-temporal stores.  */
> > > > > +L(nt_memset):
> > > > > +     VMOVU   %VMM(0), (VEC_SIZE * 0)(%rdi)
> > > > > +     leaq    (VEC_SIZE * -4)(%rdi, %rdx), %rdx
> > > > > +    /* Align DST.  */
> > > > > +     orq     $(VEC_SIZE * 1 - 1), %rdi
> > > > > +     incq    %rdi
> > > > > +     .p2align 4,, 7
> > > > > +L(nt_loop):
> > > > > +     VMOVNT  %VMM(0), (VEC_SIZE * 0)(%rdi)
> > > > > +     VMOVNT  %VMM(0), (VEC_SIZE * 1)(%rdi)
> > > > > +     VMOVNT  %VMM(0), (VEC_SIZE * 2)(%rdi)
> > > > > +     VMOVNT  %VMM(0), (VEC_SIZE * 3)(%rdi)
> > > > > +     subq    $(VEC_SIZE * -4), %rdi
> > > > > +     cmpq    %rdx, %rdi
> > > > > +     jb      L(nt_loop)
> > > > > +     sfence
> > > > > +     VMOVU   %VMM(0), (VEC_SIZE * 0)(%rdx)
> > > > > +     VMOVU   %VMM(0), (VEC_SIZE * 1)(%rdx)
> > > > > +     VMOVU   %VMM(0), (VEC_SIZE * 2)(%rdx)
> > > > > +     VMOVU   %VMM(0), (VEC_SIZE * 3)(%rdx)
> > > > > +     VZEROUPPER_RETURN
> > > > > +#endif
> > > > > +
> > > > > +END(MEMSET_SYMBOL(__memset, unaligned_erms))



-- 
H.J.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v1] x86: Improve large memset perf with non-temporal stores [RHEL-29312]
  2024-05-23 16:49         ` H.J. Lu
@ 2024-05-23 17:18           ` Adhemerval Zanella Netto
  2024-05-24 17:39             ` Noah Goldstein
  2024-05-30 20:02             ` Noah Goldstein
  0 siblings, 2 replies; 21+ messages in thread
From: Adhemerval Zanella Netto @ 2024-05-23 17:18 UTC (permalink / raw)
  To: H.J. Lu, Noah Goldstein; +Cc: libc-alpha



On 23/05/24 13:49, H.J. Lu wrote:
> On Thu, May 23, 2024 at 9:46 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>>
>> On Mon, May 20, 2024 at 5:50 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>>>
>>> On Mon, May 20, 2024 at 5:47 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>>>>
>>>> On Mon, May 20, 2024 at 2:16 PM Adhemerval Zanella Netto
>>>> <adhemerval.zanella@linaro.org> wrote:
>>>>>
>>>>>
>>>>>
>>>>> On 18/05/24 21:43, Noah Goldstein wrote:
>>>>>> Previously we use `rep stosb` for all medium/large memsets. This is
>>>>>> notably worse than non-temporal stores for large (above a
>>>>>> few MBs) memsets.
>>>>>> See:
>>>>>> https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4P2uMjjQGLbLcU/edit?usp=sharing
>>>>>> For data using different stategies for large memset on ICX and SKX.
>>>>>
>>>>> Btw this datasheet is not accessible without extra steps.
>>>>>
>>>>
>>>> Bh sorry, put the benchmark data in a repo. See data here:
>>>> https://github.com/goldsteinn/memset-benchmarks/tree/master/results
>>>>
>>>> The ICX.html and SKX.html have the data from the above link.
>>>>>>
>>>>>> Using non-temporal stores can be up to 3x faster on ICX and 2x faster
>>>>>> on SKX. Historically, these numbers would not have been so good
>>>>>> because of the zero-over-zero writeback optimization that `rep stosb`
>>>>>> is able to do. But, the zero-over-zero writeback optimization has been
>>>>>> removed as a potential side-channel attack, so there is no longer any
>>>>>> good reason to only rely on `rep stosb` for large memsets. On the flip
>>>>>> size, non-temporal writes can avoid data in their RFO requests saving
>>>>>> memory bandwidth.
>>>>>
>>>>> Any chance on how this play in newer AMD chips? I am trying to avoid
>>>>> another regression like BZ#30994 and BZ#30995 (this one I would like
>>>>> to fix, however I don't have access to a Zen4 machine to check for
>>>>> results).
>>>>
>>>> I didn't and don't have access to any of the newer AMD chips.
>>>>
>>>> The benchmark code here: https://github.com/goldsteinn/memset-benchmarks/
>>>> has a README w/ steps if anyone wants to test it.
>>>
>>> What we could do is use a seperate tunable for memset NT threshold and just
>>> make it SIZE_MAX for AMD.
>>
>> Adhemerval, this patch okay to get in like that, or do you want to keep in Intel
>> specific?
> 
> Please make it Intel specific.
> 
> Thanks.

Yes, please make it Intel specific for now. I will try to take a look at least
with a Zen3 core that I have access

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v2 1/2] x86: Improve large memset perf with non-temporal stores [RHEL-29312]
  2024-05-19  0:43 [PATCH v1] x86: Improve large memset perf with non-temporal stores [RHEL-29312] Noah Goldstein
                   ` (2 preceding siblings ...)
  2024-05-20 19:37 ` Zack Weinberg
@ 2024-05-24 17:38 ` Noah Goldstein
  2024-05-24 17:38   ` [PATCH v2 2/2] x86: Add seperate non-temporal tunable for memset Noah Goldstein
  2024-05-29 22:52   ` [PATCH v2 1/2] x86: Improve large memset perf with non-temporal stores [RHEL-29312] H.J. Lu
  3 siblings, 2 replies; 21+ messages in thread
From: Noah Goldstein @ 2024-05-24 17:38 UTC (permalink / raw)
  To: libc-alpha; +Cc: goldstein.w.n, hjl.tools, carlos

Previously we use `rep stosb` for all medium/large memsets. This is
notably worse than non-temporal stores for large (above a
few MBs) memsets.
See:
https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4P2uMjjQGLbLcU/edit?usp=sharing
For data using different stategies for large memset on ICX and SKX.

Using non-temporal stores can be up to 3x faster on ICX and 2x faster
on SKX. Historically, these numbers would not have been so good
because of the zero-over-zero writeback optimization that `rep stosb`
is able to do. But, the zero-over-zero writeback optimization has been
removed as a potential side-channel attack, so there is no longer any
good reason to only rely on `rep stosb` for large memsets. On the flip
size, non-temporal writes can avoid data in their RFO requests saving
memory bandwidth.

All of the other changes to the file are to re-organize the
code-blocks to maintain "good" alignment given the new code added in
the `L(stosb_local)` case.

The results from running the GLIBC memset benchmarks on TGL-client for
N=20 runs:

Geometric Mean across the suite New / Old EXEX256: 0.979
Geometric Mean across the suite New / Old EXEX512: 0.979
Geometric Mean across the suite New / Old AVX2   : 0.986
Geometric Mean across the suite New / Old SSE2   : 0.979

Most of the cases are essentially unchanged, this is mostly to show
that adding the non-temporal case didn't add any regressions to the
other cases.

The results on the memset-large benchmark suite on TGL-client for N=20
runs:

Geometric Mean across the suite New / Old EXEX256: 0.926
Geometric Mean across the suite New / Old EXEX512: 0.925
Geometric Mean across the suite New / Old AVX2   : 0.928
Geometric Mean across the suite New / Old SSE2   : 0.924

So roughly a 7.5% speedup. This is lower than what we see on servers
(likely because clients typically have faster single-core bandwidth so
saving bandwidth on RFOs is less impactful), but still advantageous.

Full test-suite passes on x86_64 w/ and w/o multiarch.
---
 .../multiarch/memset-vec-unaligned-erms.S     | 149 +++++++++++-------
 1 file changed, 91 insertions(+), 58 deletions(-)

diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
index 97839a2248..637caadb40 100644
--- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
+++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
@@ -21,10 +21,13 @@
    2. If size is less than VEC, use integer register stores.
    3. If size is from VEC_SIZE to 2 * VEC_SIZE, use 2 VEC stores.
    4. If size is from 2 * VEC_SIZE to 4 * VEC_SIZE, use 4 VEC stores.
-   5. On machines ERMS feature, if size is greater or equal than
-      __x86_rep_stosb_threshold then REP STOSB will be used.
-   6. If size is more to 4 * VEC_SIZE, align to 4 * VEC_SIZE with
-      4 VEC stores and store 4 * VEC at a time until done.  */
+   5. If size is more to 4 * VEC_SIZE, align to 1 * VEC_SIZE with
+      4 VEC stores and store 4 * VEC at a time until done.
+   6. On machines ERMS feature, if size is range
+	  [__x86_rep_stosb_threshold, __x86_shared_non_temporal_threshold)
+	  then REP STOSB will be used.
+   7. If size >= __x86_shared_non_temporal_threshold, use a
+	  non-temporal stores.  */
 
 #include <sysdep.h>
 
@@ -147,6 +150,41 @@ L(entry_from_wmemset):
 	VMOVU	%VMM(0), -VEC_SIZE(%rdi,%rdx)
 	VMOVU	%VMM(0), (%rdi)
 	VZEROUPPER_RETURN
+
+	/* If have AVX512 mask instructions put L(less_vec) close to
+	   entry as it doesn't take much space and is likely a hot target.  */
+#ifdef USE_LESS_VEC_MASK_STORE
+    /* Align to ensure the L(less_vec) logic all fits in 1x cache lines.  */
+	.p2align 6,, 47
+	.p2align 4
+L(less_vec):
+L(less_vec_from_wmemset):
+	/* Less than 1 VEC.  */
+# if VEC_SIZE != 16 && VEC_SIZE != 32 && VEC_SIZE != 64
+#  error Unsupported VEC_SIZE!
+# endif
+	/* Clear high bits from edi. Only keeping bits relevant to page
+	   cross check. Note that we are using rax which is set in
+	   MEMSET_VDUP_TO_VEC0_AND_SET_RETURN as ptr from here on out.  */
+	andl	$(PAGE_SIZE - 1), %edi
+	/* Check if VEC_SIZE store cross page. Mask stores suffer
+	   serious performance degradation when it has to fault suppress.  */
+	cmpl	$(PAGE_SIZE - VEC_SIZE), %edi
+	/* This is generally considered a cold target.  */
+	ja	L(cross_page)
+# if VEC_SIZE > 32
+	movq	$-1, %rcx
+	bzhiq	%rdx, %rcx, %rcx
+	kmovq	%rcx, %k1
+# else
+	movl	$-1, %ecx
+	bzhil	%edx, %ecx, %ecx
+	kmovd	%ecx, %k1
+# endif
+	vmovdqu8 %VMM(0), (%rax){%k1}
+	VZEROUPPER_RETURN
+#endif
+
 #if defined USE_MULTIARCH && IS_IN (libc)
 END (MEMSET_SYMBOL (__memset, unaligned))
 
@@ -185,54 +223,6 @@ L(last_2x_vec):
 #endif
 	VZEROUPPER_RETURN
 
-	/* If have AVX512 mask instructions put L(less_vec) close to
-	   entry as it doesn't take much space and is likely a hot target.
-	 */
-#ifdef USE_LESS_VEC_MASK_STORE
-	.p2align 4,, 10
-L(less_vec):
-L(less_vec_from_wmemset):
-	/* Less than 1 VEC.  */
-# if VEC_SIZE != 16 && VEC_SIZE != 32 && VEC_SIZE != 64
-#  error Unsupported VEC_SIZE!
-# endif
-	/* Clear high bits from edi. Only keeping bits relevant to page
-	   cross check. Note that we are using rax which is set in
-	   MEMSET_VDUP_TO_VEC0_AND_SET_RETURN as ptr from here on out.  */
-	andl	$(PAGE_SIZE - 1), %edi
-	/* Check if VEC_SIZE store cross page. Mask stores suffer
-	   serious performance degradation when it has to fault suppress.
-	 */
-	cmpl	$(PAGE_SIZE - VEC_SIZE), %edi
-	/* This is generally considered a cold target.  */
-	ja	L(cross_page)
-# if VEC_SIZE > 32
-	movq	$-1, %rcx
-	bzhiq	%rdx, %rcx, %rcx
-	kmovq	%rcx, %k1
-# else
-	movl	$-1, %ecx
-	bzhil	%edx, %ecx, %ecx
-	kmovd	%ecx, %k1
-# endif
-	vmovdqu8 %VMM(0), (%rax){%k1}
-	VZEROUPPER_RETURN
-
-# if defined USE_MULTIARCH && IS_IN (libc)
-	/* Include L(stosb_local) here if including L(less_vec) between
-	   L(stosb_more_2x_vec) and ENTRY. This is to cache align the
-	   L(stosb_more_2x_vec) target.  */
-	.p2align 4,, 10
-L(stosb_local):
-	movzbl	%sil, %eax
-	mov	%RDX_LP, %RCX_LP
-	mov	%RDI_LP, %RDX_LP
-	rep	stosb
-	mov	%RDX_LP, %RAX_LP
-	VZEROUPPER_RETURN
-# endif
-#endif
-
 #if defined USE_MULTIARCH && IS_IN (libc)
 	.p2align 4
 L(stosb_more_2x_vec):
@@ -318,21 +308,33 @@ L(return_vzeroupper):
 	ret
 #endif
 
-	.p2align 4,, 10
-#ifndef USE_LESS_VEC_MASK_STORE
-# if defined USE_MULTIARCH && IS_IN (libc)
+#ifdef USE_WITH_AVX2
+	.p2align 4
+#else
+	.p2align 4,, 4
+#endif
+
+#if defined USE_MULTIARCH && IS_IN (libc)
 	/* If no USE_LESS_VEC_MASK put L(stosb_local) here. Will be in
 	   range for 2-byte jump encoding.  */
 L(stosb_local):
+	cmp	__x86_shared_non_temporal_threshold(%rip), %RDX_LP
+	jae	L(nt_memset)
 	movzbl	%sil, %eax
 	mov	%RDX_LP, %RCX_LP
 	mov	%RDI_LP, %RDX_LP
 	rep	stosb
+# if (defined USE_WITH_SSE2) || (defined USE_WITH_AVX512)
+	/* Use xchg to save 1-byte (this helps align targets below).  */
+	xchg	%RDX_LP, %RAX_LP
+# else
 	mov	%RDX_LP, %RAX_LP
-	VZEROUPPER_RETURN
 # endif
+	VZEROUPPER_RETURN
+#endif
+#ifndef USE_LESS_VEC_MASK_STORE
 	/* Define L(less_vec) only if not otherwise defined.  */
-	.p2align 4
+	.p2align 4,, 12
 L(less_vec):
 	/* Broadcast esi to partial register (i.e VEC_SIZE == 32 broadcast to
 	   xmm). This is only does anything for AVX2.  */
@@ -423,4 +425,35 @@ L(between_2_3):
 	movb	%SET_REG8, -1(%LESS_VEC_REG, %rdx)
 #endif
 	ret
-END (MEMSET_SYMBOL (__memset, unaligned_erms))
+
+#if defined USE_MULTIARCH && IS_IN (libc)
+# ifdef USE_WITH_AVX512
+	/* Force align so the loop doesn't cross a cache-line.  */
+	.p2align 4
+# endif
+	.p2align 4,, 7
+    /* Memset using non-temporal stores.  */
+L(nt_memset):
+	VMOVU	%VMM(0), (VEC_SIZE * 0)(%rdi)
+	leaq	(VEC_SIZE * -4)(%rdi, %rdx), %rdx
+    /* Align DST.  */
+	orq	$(VEC_SIZE * 1 - 1), %rdi
+	incq	%rdi
+	.p2align 4,, 7
+L(nt_loop):
+	VMOVNT	%VMM(0), (VEC_SIZE * 0)(%rdi)
+	VMOVNT	%VMM(0), (VEC_SIZE * 1)(%rdi)
+	VMOVNT	%VMM(0), (VEC_SIZE * 2)(%rdi)
+	VMOVNT	%VMM(0), (VEC_SIZE * 3)(%rdi)
+	subq	$(VEC_SIZE * -4), %rdi
+	cmpq	%rdx, %rdi
+	jb	L(nt_loop)
+	sfence
+	VMOVU	%VMM(0), (VEC_SIZE * 0)(%rdx)
+	VMOVU	%VMM(0), (VEC_SIZE * 1)(%rdx)
+	VMOVU	%VMM(0), (VEC_SIZE * 2)(%rdx)
+	VMOVU	%VMM(0), (VEC_SIZE * 3)(%rdx)
+	VZEROUPPER_RETURN
+#endif
+
+END(MEMSET_SYMBOL(__memset, unaligned_erms))
-- 
2.34.1


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v2 2/2] x86: Add seperate non-temporal tunable for memset
  2024-05-24 17:38 ` [PATCH v2 1/2] " Noah Goldstein
@ 2024-05-24 17:38   ` Noah Goldstein
  2024-05-29 16:19     ` Noah Goldstein
  2024-05-29 22:53     ` H.J. Lu
  2024-05-29 22:52   ` [PATCH v2 1/2] x86: Improve large memset perf with non-temporal stores [RHEL-29312] H.J. Lu
  1 sibling, 2 replies; 21+ messages in thread
From: Noah Goldstein @ 2024-05-24 17:38 UTC (permalink / raw)
  To: libc-alpha; +Cc: goldstein.w.n, hjl.tools, carlos

The tuning for non-temporal stores for memset vs memcpy is not always
the same. This includes both the exact value and whether non-temporal
stores are profitable at all for a given arch.

This patch add `x86_memset_non_temporal_threshold`. Currently we
disable non-temporal stores for non Intel vendors as the only
benchmarks showing its benefit have been on Intel hardware.
---
 manual/tunables.texi                             | 16 +++++++++++++++-
 sysdeps/x86/cacheinfo.h                          |  8 +++++++-
 sysdeps/x86/dl-cacheinfo.h                       | 16 ++++++++++++++++
 sysdeps/x86/dl-diagnostics-cpu.c                 |  2 ++
 sysdeps/x86/dl-tunables.list                     |  3 +++
 sysdeps/x86/include/cpu-features.h               |  4 +++-
 .../x86_64/multiarch/memset-vec-unaligned-erms.S |  6 +++---
 7 files changed, 49 insertions(+), 6 deletions(-)

diff --git a/manual/tunables.texi b/manual/tunables.texi
index baaf751721..8dd02d8149 100644
--- a/manual/tunables.texi
+++ b/manual/tunables.texi
@@ -52,6 +52,7 @@ glibc.elision.skip_lock_busy: 3 (min: 0, max: 2147483647)
 glibc.malloc.top_pad: 0x20000 (min: 0x0, max: 0xffffffffffffffff)
 glibc.cpu.x86_rep_stosb_threshold: 0x800 (min: 0x1, max: 0xffffffffffffffff)
 glibc.cpu.x86_non_temporal_threshold: 0xc0000 (min: 0x4040, max: 0xfffffffffffffff)
+glibc.cpu.x86_memset_non_temporal_threshold: 0xc0000 (min: 0x4040, max: 0xfffffffffffffff)
 glibc.cpu.x86_shstk:
 glibc.pthread.stack_cache_size: 0x2800000 (min: 0x0, max: 0xffffffffffffffff)
 glibc.cpu.hwcap_mask: 0x6 (min: 0x0, max: 0xffffffffffffffff)
@@ -495,7 +496,8 @@ thread stack originally backup by Huge Pages to default pages.
 @cindex shared_cache_size tunables
 @cindex tunables, shared_cache_size
 @cindex non_temporal_threshold tunables
-@cindex tunables, non_temporal_threshold
+@cindex memset_non_temporal_threshold tunables
+@cindex tunables, non_temporal_threshold, memset_non_temporal_threshold
 
 @deftp {Tunable namespace} glibc.cpu
 Behavior of @theglibc{} can be tuned to assume specific hardware capabilities
@@ -574,6 +576,18 @@ like memmove and memcpy.
 This tunable is specific to i386 and x86-64.
 @end deftp
 
+@deftp Tunable glibc.cpu.x86_memset_non_temporal_threshold
+The @code{glibc.cpu.x86_memset_non_temporal_threshold} tunable allows
+the user to set threshold in bytes for non temporal store in
+memset. Non temporal stores give a hint to the hardware to move data
+directly to memory without displacing other data from the cache. This
+tunable is used by some platforms to determine when to use non
+temporal stores memset.
+
+This tunable is specific to i386 and x86-64.
+@end deftp
+
+
 @deftp Tunable glibc.cpu.x86_rep_movsb_threshold
 The @code{glibc.cpu.x86_rep_movsb_threshold} tunable allows the user to
 set threshold in bytes to start using "rep movsb".  The value must be
diff --git a/sysdeps/x86/cacheinfo.h b/sysdeps/x86/cacheinfo.h
index ab73556772..83491607c7 100644
--- a/sysdeps/x86/cacheinfo.h
+++ b/sysdeps/x86/cacheinfo.h
@@ -35,9 +35,12 @@ long int __x86_data_cache_size attribute_hidden = 32 * 1024;
 long int __x86_shared_cache_size_half attribute_hidden = 1024 * 1024 / 2;
 long int __x86_shared_cache_size attribute_hidden = 1024 * 1024;
 
-/* Threshold to use non temporal store.  */
+/* Threshold to use non temporal store in memmove.  */
 long int __x86_shared_non_temporal_threshold attribute_hidden;
 
+/* Threshold to use non temporal store in memset.  */
+long int __x86_memset_non_temporal_threshold attribute_hidden;
+
 /* Threshold to use Enhanced REP MOVSB.  */
 long int __x86_rep_movsb_threshold attribute_hidden = 2048;
 
@@ -77,6 +80,9 @@ init_cacheinfo (void)
   __x86_shared_non_temporal_threshold
     = cpu_features->non_temporal_threshold;
 
+  __x86_memset_non_temporal_threshold
+      = cpu_features->memset_non_temporal_threshold;
+
   __x86_rep_movsb_threshold = cpu_features->rep_movsb_threshold;
   __x86_rep_stosb_threshold = cpu_features->rep_stosb_threshold;
   __x86_rep_movsb_stop_threshold =  cpu_features->rep_movsb_stop_threshold;
diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h
index 5a98f70364..d375a7cba6 100644
--- a/sysdeps/x86/dl-cacheinfo.h
+++ b/sysdeps/x86/dl-cacheinfo.h
@@ -986,6 +986,13 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
   if (CPU_FEATURE_USABLE_P (cpu_features, FSRM))
     rep_movsb_threshold = 2112;
 
+  /* Non-temporal stores in memset have only been tested on Intel hardware.
+     Until we benchmark data on other x86 processor, disable non-temporal
+     stores in memset. */
+  unsigned long int memset_non_temporal_threshold = SIZE_MAX;
+  if (cpu_features->basic.kind == arch_kind_intel)
+      memset_non_temporal_threshold = non_temporal_threshold;
+
    /* For AMD CPUs that support ERMS (Zen3+), REP MOVSB is in a lot of
       cases slower than the vectorized path (and for some alignments,
       it is really slow, check BZ #30994).  */
@@ -1012,6 +1019,11 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
       && tunable_size <= maximum_non_temporal_threshold)
     non_temporal_threshold = tunable_size;
 
+  tunable_size = TUNABLE_GET (x86_memset_non_temporal_threshold, long int, NULL);
+  if (tunable_size > minimum_non_temporal_threshold
+      && tunable_size <= maximum_non_temporal_threshold)
+    memset_non_temporal_threshold = tunable_size;
+
   tunable_size = TUNABLE_GET (x86_rep_movsb_threshold, long int, NULL);
   if (tunable_size > minimum_rep_movsb_threshold)
     rep_movsb_threshold = tunable_size;
@@ -1032,6 +1044,9 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
   TUNABLE_SET_WITH_BOUNDS (x86_non_temporal_threshold, non_temporal_threshold,
 			   minimum_non_temporal_threshold,
 			   maximum_non_temporal_threshold);
+  TUNABLE_SET_WITH_BOUNDS (
+      x86_memset_non_temporal_threshold, memset_non_temporal_threshold,
+      minimum_non_temporal_threshold, maximum_non_temporal_threshold);
   TUNABLE_SET_WITH_BOUNDS (x86_rep_movsb_threshold, rep_movsb_threshold,
 			   minimum_rep_movsb_threshold, SIZE_MAX);
   TUNABLE_SET_WITH_BOUNDS (x86_rep_stosb_threshold, rep_stosb_threshold, 1,
@@ -1045,6 +1060,7 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
   cpu_features->data_cache_size = data;
   cpu_features->shared_cache_size = shared;
   cpu_features->non_temporal_threshold = non_temporal_threshold;
+  cpu_features->memset_non_temporal_threshold = memset_non_temporal_threshold;
   cpu_features->rep_movsb_threshold = rep_movsb_threshold;
   cpu_features->rep_stosb_threshold = rep_stosb_threshold;
   cpu_features->rep_movsb_stop_threshold = rep_movsb_stop_threshold;
diff --git a/sysdeps/x86/dl-diagnostics-cpu.c b/sysdeps/x86/dl-diagnostics-cpu.c
index ceafde9481..49eeb5f70a 100644
--- a/sysdeps/x86/dl-diagnostics-cpu.c
+++ b/sysdeps/x86/dl-diagnostics-cpu.c
@@ -94,6 +94,8 @@ _dl_diagnostics_cpu (void)
                             cpu_features->shared_cache_size);
   print_cpu_features_value ("non_temporal_threshold",
                             cpu_features->non_temporal_threshold);
+  print_cpu_features_value ("memset_non_temporal_threshold",
+                            cpu_features->memset_non_temporal_threshold);
   print_cpu_features_value ("rep_movsb_threshold",
                             cpu_features->rep_movsb_threshold);
   print_cpu_features_value ("rep_movsb_stop_threshold",
diff --git a/sysdeps/x86/dl-tunables.list b/sysdeps/x86/dl-tunables.list
index 7d82da0dec..a0a1299592 100644
--- a/sysdeps/x86/dl-tunables.list
+++ b/sysdeps/x86/dl-tunables.list
@@ -30,6 +30,9 @@ glibc {
     x86_non_temporal_threshold {
       type: SIZE_T
     }
+    x86_memset_non_temporal_threshold {
+      type: SIZE_T
+    }
     x86_rep_movsb_threshold {
       type: SIZE_T
       # Since there is overhead to set up REP MOVSB operation, REP
diff --git a/sysdeps/x86/include/cpu-features.h b/sysdeps/x86/include/cpu-features.h
index cd7bd27cf3..aaae44f0e1 100644
--- a/sysdeps/x86/include/cpu-features.h
+++ b/sysdeps/x86/include/cpu-features.h
@@ -944,8 +944,10 @@ struct cpu_features
   /* Shared cache size for use in memory and string routines, typically
      L2 or L3 size.  */
   unsigned long int shared_cache_size;
-  /* Threshold to use non temporal store.  */
+  /* Threshold to use non temporal store in memmove.  */
   unsigned long int non_temporal_threshold;
+  /* Threshold to use non temporal store in memset.  */
+  unsigned long int memset_non_temporal_threshold;
   /* Threshold to use "rep movsb".  */
   unsigned long int rep_movsb_threshold;
   /* Threshold to stop using "rep movsb".  */
diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
index 637caadb40..88bf08e4f4 100644
--- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
+++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
@@ -24,9 +24,9 @@
    5. If size is more to 4 * VEC_SIZE, align to 1 * VEC_SIZE with
       4 VEC stores and store 4 * VEC at a time until done.
    6. On machines ERMS feature, if size is range
-	  [__x86_rep_stosb_threshold, __x86_shared_non_temporal_threshold)
+	  [__x86_rep_stosb_threshold, __x86_memset_non_temporal_threshold)
 	  then REP STOSB will be used.
-   7. If size >= __x86_shared_non_temporal_threshold, use a
+   7. If size >= __x86_memset_non_temporal_threshold, use a
 	  non-temporal stores.  */
 
 #include <sysdep.h>
@@ -318,7 +318,7 @@ L(return_vzeroupper):
 	/* If no USE_LESS_VEC_MASK put L(stosb_local) here. Will be in
 	   range for 2-byte jump encoding.  */
 L(stosb_local):
-	cmp	__x86_shared_non_temporal_threshold(%rip), %RDX_LP
+	cmp	__x86_memset_non_temporal_threshold(%rip), %RDX_LP
 	jae	L(nt_memset)
 	movzbl	%sil, %eax
 	mov	%RDX_LP, %RCX_LP
-- 
2.34.1


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v1] x86: Improve large memset perf with non-temporal stores [RHEL-29312]
  2024-05-23 17:18           ` Adhemerval Zanella Netto
@ 2024-05-24 17:39             ` Noah Goldstein
  2024-05-30 20:02             ` Noah Goldstein
  1 sibling, 0 replies; 21+ messages in thread
From: Noah Goldstein @ 2024-05-24 17:39 UTC (permalink / raw)
  To: Adhemerval Zanella Netto; +Cc: H.J. Lu, libc-alpha

On Thu, May 23, 2024 at 12:18 PM Adhemerval Zanella Netto
<adhemerval.zanella@linaro.org> wrote:
>
>
>
> On 23/05/24 13:49, H.J. Lu wrote:
> > On Thu, May 23, 2024 at 9:46 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> >>
> >> On Mon, May 20, 2024 at 5:50 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> >>>
> >>> On Mon, May 20, 2024 at 5:47 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> >>>>
> >>>> On Mon, May 20, 2024 at 2:16 PM Adhemerval Zanella Netto
> >>>> <adhemerval.zanella@linaro.org> wrote:
> >>>>>
> >>>>>
> >>>>>
> >>>>> On 18/05/24 21:43, Noah Goldstein wrote:
> >>>>>> Previously we use `rep stosb` for all medium/large memsets. This is
> >>>>>> notably worse than non-temporal stores for large (above a
> >>>>>> few MBs) memsets.
> >>>>>> See:
> >>>>>> https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4P2uMjjQGLbLcU/edit?usp=sharing
> >>>>>> For data using different stategies for large memset on ICX and SKX.
> >>>>>
> >>>>> Btw this datasheet is not accessible without extra steps.
> >>>>>
> >>>>
> >>>> Bh sorry, put the benchmark data in a repo. See data here:
> >>>> https://github.com/goldsteinn/memset-benchmarks/tree/master/results
> >>>>
> >>>> The ICX.html and SKX.html have the data from the above link.
> >>>>>>
> >>>>>> Using non-temporal stores can be up to 3x faster on ICX and 2x faster
> >>>>>> on SKX. Historically, these numbers would not have been so good
> >>>>>> because of the zero-over-zero writeback optimization that `rep stosb`
> >>>>>> is able to do. But, the zero-over-zero writeback optimization has been
> >>>>>> removed as a potential side-channel attack, so there is no longer any
> >>>>>> good reason to only rely on `rep stosb` for large memsets. On the flip
> >>>>>> size, non-temporal writes can avoid data in their RFO requests saving
> >>>>>> memory bandwidth.
> >>>>>
> >>>>> Any chance on how this play in newer AMD chips? I am trying to avoid
> >>>>> another regression like BZ#30994 and BZ#30995 (this one I would like
> >>>>> to fix, however I don't have access to a Zen4 machine to check for
> >>>>> results).
> >>>>
> >>>> I didn't and don't have access to any of the newer AMD chips.
> >>>>
> >>>> The benchmark code here: https://github.com/goldsteinn/memset-benchmarks/
> >>>> has a README w/ steps if anyone wants to test it.
> >>>
> >>> What we could do is use a seperate tunable for memset NT threshold and just
> >>> make it SIZE_MAX for AMD.
> >>
> >> Adhemerval, this patch okay to get in like that, or do you want to keep in Intel
> >> specific?
> >
> > Please make it Intel specific.
> >
> > Thanks.
>
> Yes, please make it Intel specific for now. I will try to take a look at least
> with a Zen3 core that I have access

Done, split into two patches, the tunable one is the second in the series.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 2/2] x86: Add seperate non-temporal tunable for memset
  2024-05-24 17:38   ` [PATCH v2 2/2] x86: Add seperate non-temporal tunable for memset Noah Goldstein
@ 2024-05-29 16:19     ` Noah Goldstein
  2024-05-29 22:53     ` H.J. Lu
  1 sibling, 0 replies; 21+ messages in thread
From: Noah Goldstein @ 2024-05-29 16:19 UTC (permalink / raw)
  To: libc-alpha; +Cc: hjl.tools, carlos

On Fri, May 24, 2024 at 12:38 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
> The tuning for non-temporal stores for memset vs memcpy is not always
> the same. This includes both the exact value and whether non-temporal
> stores are profitable at all for a given arch.
>
> This patch add `x86_memset_non_temporal_threshold`. Currently we
> disable non-temporal stores for non Intel vendors as the only
> benchmarks showing its benefit have been on Intel hardware.
> ---
>  manual/tunables.texi                             | 16 +++++++++++++++-
>  sysdeps/x86/cacheinfo.h                          |  8 +++++++-
>  sysdeps/x86/dl-cacheinfo.h                       | 16 ++++++++++++++++
>  sysdeps/x86/dl-diagnostics-cpu.c                 |  2 ++
>  sysdeps/x86/dl-tunables.list                     |  3 +++
>  sysdeps/x86/include/cpu-features.h               |  4 +++-
>  .../x86_64/multiarch/memset-vec-unaligned-erms.S |  6 +++---
>  7 files changed, 49 insertions(+), 6 deletions(-)
>
> diff --git a/manual/tunables.texi b/manual/tunables.texi
> index baaf751721..8dd02d8149 100644
> --- a/manual/tunables.texi
> +++ b/manual/tunables.texi
> @@ -52,6 +52,7 @@ glibc.elision.skip_lock_busy: 3 (min: 0, max: 2147483647)
>  glibc.malloc.top_pad: 0x20000 (min: 0x0, max: 0xffffffffffffffff)
>  glibc.cpu.x86_rep_stosb_threshold: 0x800 (min: 0x1, max: 0xffffffffffffffff)
>  glibc.cpu.x86_non_temporal_threshold: 0xc0000 (min: 0x4040, max: 0xfffffffffffffff)
> +glibc.cpu.x86_memset_non_temporal_threshold: 0xc0000 (min: 0x4040, max: 0xfffffffffffffff)
>  glibc.cpu.x86_shstk:
>  glibc.pthread.stack_cache_size: 0x2800000 (min: 0x0, max: 0xffffffffffffffff)
>  glibc.cpu.hwcap_mask: 0x6 (min: 0x0, max: 0xffffffffffffffff)
> @@ -495,7 +496,8 @@ thread stack originally backup by Huge Pages to default pages.
>  @cindex shared_cache_size tunables
>  @cindex tunables, shared_cache_size
>  @cindex non_temporal_threshold tunables
> -@cindex tunables, non_temporal_threshold
> +@cindex memset_non_temporal_threshold tunables
> +@cindex tunables, non_temporal_threshold, memset_non_temporal_threshold
>
>  @deftp {Tunable namespace} glibc.cpu
>  Behavior of @theglibc{} can be tuned to assume specific hardware capabilities
> @@ -574,6 +576,18 @@ like memmove and memcpy.
>  This tunable is specific to i386 and x86-64.
>  @end deftp
>
> +@deftp Tunable glibc.cpu.x86_memset_non_temporal_threshold
> +The @code{glibc.cpu.x86_memset_non_temporal_threshold} tunable allows
> +the user to set threshold in bytes for non temporal store in
> +memset. Non temporal stores give a hint to the hardware to move data
> +directly to memory without displacing other data from the cache. This
> +tunable is used by some platforms to determine when to use non
> +temporal stores memset.
> +
> +This tunable is specific to i386 and x86-64.
> +@end deftp
> +
> +
>  @deftp Tunable glibc.cpu.x86_rep_movsb_threshold
>  The @code{glibc.cpu.x86_rep_movsb_threshold} tunable allows the user to
>  set threshold in bytes to start using "rep movsb".  The value must be
> diff --git a/sysdeps/x86/cacheinfo.h b/sysdeps/x86/cacheinfo.h
> index ab73556772..83491607c7 100644
> --- a/sysdeps/x86/cacheinfo.h
> +++ b/sysdeps/x86/cacheinfo.h
> @@ -35,9 +35,12 @@ long int __x86_data_cache_size attribute_hidden = 32 * 1024;
>  long int __x86_shared_cache_size_half attribute_hidden = 1024 * 1024 / 2;
>  long int __x86_shared_cache_size attribute_hidden = 1024 * 1024;
>
> -/* Threshold to use non temporal store.  */
> +/* Threshold to use non temporal store in memmove.  */
>  long int __x86_shared_non_temporal_threshold attribute_hidden;
>
> +/* Threshold to use non temporal store in memset.  */
> +long int __x86_memset_non_temporal_threshold attribute_hidden;
> +
>  /* Threshold to use Enhanced REP MOVSB.  */
>  long int __x86_rep_movsb_threshold attribute_hidden = 2048;
>
> @@ -77,6 +80,9 @@ init_cacheinfo (void)
>    __x86_shared_non_temporal_threshold
>      = cpu_features->non_temporal_threshold;
>
> +  __x86_memset_non_temporal_threshold
> +      = cpu_features->memset_non_temporal_threshold;
> +
>    __x86_rep_movsb_threshold = cpu_features->rep_movsb_threshold;
>    __x86_rep_stosb_threshold = cpu_features->rep_stosb_threshold;
>    __x86_rep_movsb_stop_threshold =  cpu_features->rep_movsb_stop_threshold;
> diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h
> index 5a98f70364..d375a7cba6 100644
> --- a/sysdeps/x86/dl-cacheinfo.h
> +++ b/sysdeps/x86/dl-cacheinfo.h
> @@ -986,6 +986,13 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
>    if (CPU_FEATURE_USABLE_P (cpu_features, FSRM))
>      rep_movsb_threshold = 2112;
>
> +  /* Non-temporal stores in memset have only been tested on Intel hardware.
> +     Until we benchmark data on other x86 processor, disable non-temporal
> +     stores in memset. */
> +  unsigned long int memset_non_temporal_threshold = SIZE_MAX;
> +  if (cpu_features->basic.kind == arch_kind_intel)
> +      memset_non_temporal_threshold = non_temporal_threshold;
> +
>     /* For AMD CPUs that support ERMS (Zen3+), REP MOVSB is in a lot of
>        cases slower than the vectorized path (and for some alignments,
>        it is really slow, check BZ #30994).  */
> @@ -1012,6 +1019,11 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
>        && tunable_size <= maximum_non_temporal_threshold)
>      non_temporal_threshold = tunable_size;
>
> +  tunable_size = TUNABLE_GET (x86_memset_non_temporal_threshold, long int, NULL);
> +  if (tunable_size > minimum_non_temporal_threshold
> +      && tunable_size <= maximum_non_temporal_threshold)
> +    memset_non_temporal_threshold = tunable_size;
> +
>    tunable_size = TUNABLE_GET (x86_rep_movsb_threshold, long int, NULL);
>    if (tunable_size > minimum_rep_movsb_threshold)
>      rep_movsb_threshold = tunable_size;
> @@ -1032,6 +1044,9 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
>    TUNABLE_SET_WITH_BOUNDS (x86_non_temporal_threshold, non_temporal_threshold,
>                            minimum_non_temporal_threshold,
>                            maximum_non_temporal_threshold);
> +  TUNABLE_SET_WITH_BOUNDS (
> +      x86_memset_non_temporal_threshold, memset_non_temporal_threshold,
> +      minimum_non_temporal_threshold, maximum_non_temporal_threshold);
>    TUNABLE_SET_WITH_BOUNDS (x86_rep_movsb_threshold, rep_movsb_threshold,
>                            minimum_rep_movsb_threshold, SIZE_MAX);
>    TUNABLE_SET_WITH_BOUNDS (x86_rep_stosb_threshold, rep_stosb_threshold, 1,
> @@ -1045,6 +1060,7 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
>    cpu_features->data_cache_size = data;
>    cpu_features->shared_cache_size = shared;
>    cpu_features->non_temporal_threshold = non_temporal_threshold;
> +  cpu_features->memset_non_temporal_threshold = memset_non_temporal_threshold;
>    cpu_features->rep_movsb_threshold = rep_movsb_threshold;
>    cpu_features->rep_stosb_threshold = rep_stosb_threshold;
>    cpu_features->rep_movsb_stop_threshold = rep_movsb_stop_threshold;
> diff --git a/sysdeps/x86/dl-diagnostics-cpu.c b/sysdeps/x86/dl-diagnostics-cpu.c
> index ceafde9481..49eeb5f70a 100644
> --- a/sysdeps/x86/dl-diagnostics-cpu.c
> +++ b/sysdeps/x86/dl-diagnostics-cpu.c
> @@ -94,6 +94,8 @@ _dl_diagnostics_cpu (void)
>                              cpu_features->shared_cache_size);
>    print_cpu_features_value ("non_temporal_threshold",
>                              cpu_features->non_temporal_threshold);
> +  print_cpu_features_value ("memset_non_temporal_threshold",
> +                            cpu_features->memset_non_temporal_threshold);
>    print_cpu_features_value ("rep_movsb_threshold",
>                              cpu_features->rep_movsb_threshold);
>    print_cpu_features_value ("rep_movsb_stop_threshold",
> diff --git a/sysdeps/x86/dl-tunables.list b/sysdeps/x86/dl-tunables.list
> index 7d82da0dec..a0a1299592 100644
> --- a/sysdeps/x86/dl-tunables.list
> +++ b/sysdeps/x86/dl-tunables.list
> @@ -30,6 +30,9 @@ glibc {
>      x86_non_temporal_threshold {
>        type: SIZE_T
>      }
> +    x86_memset_non_temporal_threshold {
> +      type: SIZE_T
> +    }
>      x86_rep_movsb_threshold {
>        type: SIZE_T
>        # Since there is overhead to set up REP MOVSB operation, REP
> diff --git a/sysdeps/x86/include/cpu-features.h b/sysdeps/x86/include/cpu-features.h
> index cd7bd27cf3..aaae44f0e1 100644
> --- a/sysdeps/x86/include/cpu-features.h
> +++ b/sysdeps/x86/include/cpu-features.h
> @@ -944,8 +944,10 @@ struct cpu_features
>    /* Shared cache size for use in memory and string routines, typically
>       L2 or L3 size.  */
>    unsigned long int shared_cache_size;
> -  /* Threshold to use non temporal store.  */
> +  /* Threshold to use non temporal store in memmove.  */
>    unsigned long int non_temporal_threshold;
> +  /* Threshold to use non temporal store in memset.  */
> +  unsigned long int memset_non_temporal_threshold;
>    /* Threshold to use "rep movsb".  */
>    unsigned long int rep_movsb_threshold;
>    /* Threshold to stop using "rep movsb".  */
> diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> index 637caadb40..88bf08e4f4 100644
> --- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> +++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> @@ -24,9 +24,9 @@
>     5. If size is more to 4 * VEC_SIZE, align to 1 * VEC_SIZE with
>        4 VEC stores and store 4 * VEC at a time until done.
>     6. On machines ERMS feature, if size is range
> -         [__x86_rep_stosb_threshold, __x86_shared_non_temporal_threshold)
> +         [__x86_rep_stosb_threshold, __x86_memset_non_temporal_threshold)
>           then REP STOSB will be used.
> -   7. If size >= __x86_shared_non_temporal_threshold, use a
> +   7. If size >= __x86_memset_non_temporal_threshold, use a
>           non-temporal stores.  */
>
>  #include <sysdep.h>
> @@ -318,7 +318,7 @@ L(return_vzeroupper):
>         /* If no USE_LESS_VEC_MASK put L(stosb_local) here. Will be in
>            range for 2-byte jump encoding.  */
>  L(stosb_local):
> -       cmp     __x86_shared_non_temporal_threshold(%rip), %RDX_LP
> +       cmp     __x86_memset_non_temporal_threshold(%rip), %RDX_LP
>         jae     L(nt_memset)
>         movzbl  %sil, %eax
>         mov     %RDX_LP, %RCX_LP
> --
> 2.34.1
>
ping

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 1/2] x86: Improve large memset perf with non-temporal stores [RHEL-29312]
  2024-05-24 17:38 ` [PATCH v2 1/2] " Noah Goldstein
  2024-05-24 17:38   ` [PATCH v2 2/2] x86: Add seperate non-temporal tunable for memset Noah Goldstein
@ 2024-05-29 22:52   ` H.J. Lu
  1 sibling, 0 replies; 21+ messages in thread
From: H.J. Lu @ 2024-05-29 22:52 UTC (permalink / raw)
  To: Noah Goldstein; +Cc: GNU C Library, Carlos O'Donell

On Fri, May 24, 2024 at 10:38 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
> Previously we use `rep stosb` for all medium/large memsets. This is
> notably worse than non-temporal stores for large (above a
> few MBs) memsets.
> See:
> https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4P2uMjjQGLbLcU/edit?usp=sharing
> For data using different stategies for large memset on ICX and SKX.
>
> Using non-temporal stores can be up to 3x faster on ICX and 2x faster
> on SKX. Historically, these numbers would not have been so good
> because of the zero-over-zero writeback optimization that `rep stosb`
> is able to do. But, the zero-over-zero writeback optimization has been
> removed as a potential side-channel attack, so there is no longer any
> good reason to only rely on `rep stosb` for large memsets. On the flip
> size, non-temporal writes can avoid data in their RFO requests saving
> memory bandwidth.
>
> All of the other changes to the file are to re-organize the
> code-blocks to maintain "good" alignment given the new code added in
> the `L(stosb_local)` case.
>
> The results from running the GLIBC memset benchmarks on TGL-client for
> N=20 runs:
>
> Geometric Mean across the suite New / Old EXEX256: 0.979
> Geometric Mean across the suite New / Old EXEX512: 0.979
> Geometric Mean across the suite New / Old AVX2   : 0.986
> Geometric Mean across the suite New / Old SSE2   : 0.979
>
> Most of the cases are essentially unchanged, this is mostly to show
> that adding the non-temporal case didn't add any regressions to the
> other cases.
>
> The results on the memset-large benchmark suite on TGL-client for N=20
> runs:
>
> Geometric Mean across the suite New / Old EXEX256: 0.926
> Geometric Mean across the suite New / Old EXEX512: 0.925
> Geometric Mean across the suite New / Old AVX2   : 0.928
> Geometric Mean across the suite New / Old SSE2   : 0.924
>
> So roughly a 7.5% speedup. This is lower than what we see on servers
> (likely because clients typically have faster single-core bandwidth so
> saving bandwidth on RFOs is less impactful), but still advantageous.
>
> Full test-suite passes on x86_64 w/ and w/o multiarch.
> ---
>  .../multiarch/memset-vec-unaligned-erms.S     | 149 +++++++++++-------
>  1 file changed, 91 insertions(+), 58 deletions(-)
>
> diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> index 97839a2248..637caadb40 100644
> --- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> +++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> @@ -21,10 +21,13 @@
>     2. If size is less than VEC, use integer register stores.
>     3. If size is from VEC_SIZE to 2 * VEC_SIZE, use 2 VEC stores.
>     4. If size is from 2 * VEC_SIZE to 4 * VEC_SIZE, use 4 VEC stores.
> -   5. On machines ERMS feature, if size is greater or equal than
> -      __x86_rep_stosb_threshold then REP STOSB will be used.
> -   6. If size is more to 4 * VEC_SIZE, align to 4 * VEC_SIZE with
> -      4 VEC stores and store 4 * VEC at a time until done.  */
> +   5. If size is more to 4 * VEC_SIZE, align to 1 * VEC_SIZE with
> +      4 VEC stores and store 4 * VEC at a time until done.
> +   6. On machines ERMS feature, if size is range
> +         [__x86_rep_stosb_threshold, __x86_shared_non_temporal_threshold)
> +         then REP STOSB will be used.
> +   7. If size >= __x86_shared_non_temporal_threshold, use a
> +         non-temporal stores.  */
>
>  #include <sysdep.h>
>
> @@ -147,6 +150,41 @@ L(entry_from_wmemset):
>         VMOVU   %VMM(0), -VEC_SIZE(%rdi,%rdx)
>         VMOVU   %VMM(0), (%rdi)
>         VZEROUPPER_RETURN
> +
> +       /* If have AVX512 mask instructions put L(less_vec) close to
> +          entry as it doesn't take much space and is likely a hot target.  */
> +#ifdef USE_LESS_VEC_MASK_STORE
> +    /* Align to ensure the L(less_vec) logic all fits in 1x cache lines.  */
> +       .p2align 6,, 47
> +       .p2align 4
> +L(less_vec):
> +L(less_vec_from_wmemset):
> +       /* Less than 1 VEC.  */
> +# if VEC_SIZE != 16 && VEC_SIZE != 32 && VEC_SIZE != 64
> +#  error Unsupported VEC_SIZE!
> +# endif
> +       /* Clear high bits from edi. Only keeping bits relevant to page
> +          cross check. Note that we are using rax which is set in
> +          MEMSET_VDUP_TO_VEC0_AND_SET_RETURN as ptr from here on out.  */
> +       andl    $(PAGE_SIZE - 1), %edi
> +       /* Check if VEC_SIZE store cross page. Mask stores suffer
> +          serious performance degradation when it has to fault suppress.  */
> +       cmpl    $(PAGE_SIZE - VEC_SIZE), %edi
> +       /* This is generally considered a cold target.  */
> +       ja      L(cross_page)
> +# if VEC_SIZE > 32
> +       movq    $-1, %rcx
> +       bzhiq   %rdx, %rcx, %rcx
> +       kmovq   %rcx, %k1
> +# else
> +       movl    $-1, %ecx
> +       bzhil   %edx, %ecx, %ecx
> +       kmovd   %ecx, %k1
> +# endif
> +       vmovdqu8 %VMM(0), (%rax){%k1}
> +       VZEROUPPER_RETURN
> +#endif
> +
>  #if defined USE_MULTIARCH && IS_IN (libc)
>  END (MEMSET_SYMBOL (__memset, unaligned))
>
> @@ -185,54 +223,6 @@ L(last_2x_vec):
>  #endif
>         VZEROUPPER_RETURN
>
> -       /* If have AVX512 mask instructions put L(less_vec) close to
> -          entry as it doesn't take much space and is likely a hot target.
> -        */
> -#ifdef USE_LESS_VEC_MASK_STORE
> -       .p2align 4,, 10
> -L(less_vec):
> -L(less_vec_from_wmemset):
> -       /* Less than 1 VEC.  */
> -# if VEC_SIZE != 16 && VEC_SIZE != 32 && VEC_SIZE != 64
> -#  error Unsupported VEC_SIZE!
> -# endif
> -       /* Clear high bits from edi. Only keeping bits relevant to page
> -          cross check. Note that we are using rax which is set in
> -          MEMSET_VDUP_TO_VEC0_AND_SET_RETURN as ptr from here on out.  */
> -       andl    $(PAGE_SIZE - 1), %edi
> -       /* Check if VEC_SIZE store cross page. Mask stores suffer
> -          serious performance degradation when it has to fault suppress.
> -        */
> -       cmpl    $(PAGE_SIZE - VEC_SIZE), %edi
> -       /* This is generally considered a cold target.  */
> -       ja      L(cross_page)
> -# if VEC_SIZE > 32
> -       movq    $-1, %rcx
> -       bzhiq   %rdx, %rcx, %rcx
> -       kmovq   %rcx, %k1
> -# else
> -       movl    $-1, %ecx
> -       bzhil   %edx, %ecx, %ecx
> -       kmovd   %ecx, %k1
> -# endif
> -       vmovdqu8 %VMM(0), (%rax){%k1}
> -       VZEROUPPER_RETURN
> -
> -# if defined USE_MULTIARCH && IS_IN (libc)
> -       /* Include L(stosb_local) here if including L(less_vec) between
> -          L(stosb_more_2x_vec) and ENTRY. This is to cache align the
> -          L(stosb_more_2x_vec) target.  */
> -       .p2align 4,, 10
> -L(stosb_local):
> -       movzbl  %sil, %eax
> -       mov     %RDX_LP, %RCX_LP
> -       mov     %RDI_LP, %RDX_LP
> -       rep     stosb
> -       mov     %RDX_LP, %RAX_LP
> -       VZEROUPPER_RETURN
> -# endif
> -#endif
> -
>  #if defined USE_MULTIARCH && IS_IN (libc)
>         .p2align 4
>  L(stosb_more_2x_vec):
> @@ -318,21 +308,33 @@ L(return_vzeroupper):
>         ret
>  #endif
>
> -       .p2align 4,, 10
> -#ifndef USE_LESS_VEC_MASK_STORE
> -# if defined USE_MULTIARCH && IS_IN (libc)
> +#ifdef USE_WITH_AVX2
> +       .p2align 4
> +#else
> +       .p2align 4,, 4
> +#endif
> +
> +#if defined USE_MULTIARCH && IS_IN (libc)
>         /* If no USE_LESS_VEC_MASK put L(stosb_local) here. Will be in
>            range for 2-byte jump encoding.  */
>  L(stosb_local):
> +       cmp     __x86_shared_non_temporal_threshold(%rip), %RDX_LP
> +       jae     L(nt_memset)
>         movzbl  %sil, %eax
>         mov     %RDX_LP, %RCX_LP
>         mov     %RDI_LP, %RDX_LP
>         rep     stosb
> +# if (defined USE_WITH_SSE2) || (defined USE_WITH_AVX512)
> +       /* Use xchg to save 1-byte (this helps align targets below).  */
> +       xchg    %RDX_LP, %RAX_LP
> +# else
>         mov     %RDX_LP, %RAX_LP
> -       VZEROUPPER_RETURN
>  # endif
> +       VZEROUPPER_RETURN
> +#endif
> +#ifndef USE_LESS_VEC_MASK_STORE
>         /* Define L(less_vec) only if not otherwise defined.  */
> -       .p2align 4
> +       .p2align 4,, 12
>  L(less_vec):
>         /* Broadcast esi to partial register (i.e VEC_SIZE == 32 broadcast to
>            xmm). This is only does anything for AVX2.  */
> @@ -423,4 +425,35 @@ L(between_2_3):
>         movb    %SET_REG8, -1(%LESS_VEC_REG, %rdx)
>  #endif
>         ret
> -END (MEMSET_SYMBOL (__memset, unaligned_erms))
> +
> +#if defined USE_MULTIARCH && IS_IN (libc)
> +# ifdef USE_WITH_AVX512
> +       /* Force align so the loop doesn't cross a cache-line.  */
> +       .p2align 4
> +# endif
> +       .p2align 4,, 7
> +    /* Memset using non-temporal stores.  */
> +L(nt_memset):
> +       VMOVU   %VMM(0), (VEC_SIZE * 0)(%rdi)
> +       leaq    (VEC_SIZE * -4)(%rdi, %rdx), %rdx
> +    /* Align DST.  */
> +       orq     $(VEC_SIZE * 1 - 1), %rdi
> +       incq    %rdi
> +       .p2align 4,, 7
> +L(nt_loop):
> +       VMOVNT  %VMM(0), (VEC_SIZE * 0)(%rdi)
> +       VMOVNT  %VMM(0), (VEC_SIZE * 1)(%rdi)
> +       VMOVNT  %VMM(0), (VEC_SIZE * 2)(%rdi)
> +       VMOVNT  %VMM(0), (VEC_SIZE * 3)(%rdi)
> +       subq    $(VEC_SIZE * -4), %rdi
> +       cmpq    %rdx, %rdi
> +       jb      L(nt_loop)
> +       sfence
> +       VMOVU   %VMM(0), (VEC_SIZE * 0)(%rdx)
> +       VMOVU   %VMM(0), (VEC_SIZE * 1)(%rdx)
> +       VMOVU   %VMM(0), (VEC_SIZE * 2)(%rdx)
> +       VMOVU   %VMM(0), (VEC_SIZE * 3)(%rdx)
> +       VZEROUPPER_RETURN
> +#endif
> +
> +END(MEMSET_SYMBOL(__memset, unaligned_erms))
> --
> 2.34.1
>

LGTM.

Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

Thanks.

-- 
H.J.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 2/2] x86: Add seperate non-temporal tunable for memset
  2024-05-24 17:38   ` [PATCH v2 2/2] x86: Add seperate non-temporal tunable for memset Noah Goldstein
  2024-05-29 16:19     ` Noah Goldstein
@ 2024-05-29 22:53     ` H.J. Lu
  2024-06-14 10:40       ` Borislav Petkov
  1 sibling, 1 reply; 21+ messages in thread
From: H.J. Lu @ 2024-05-29 22:53 UTC (permalink / raw)
  To: Noah Goldstein; +Cc: libc-alpha, carlos

On Fri, May 24, 2024 at 10:39 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
> The tuning for non-temporal stores for memset vs memcpy is not always
> the same. This includes both the exact value and whether non-temporal
> stores are profitable at all for a given arch.
>
> This patch add `x86_memset_non_temporal_threshold`. Currently we
> disable non-temporal stores for non Intel vendors as the only
> benchmarks showing its benefit have been on Intel hardware.
> ---
>  manual/tunables.texi                             | 16 +++++++++++++++-
>  sysdeps/x86/cacheinfo.h                          |  8 +++++++-
>  sysdeps/x86/dl-cacheinfo.h                       | 16 ++++++++++++++++
>  sysdeps/x86/dl-diagnostics-cpu.c                 |  2 ++
>  sysdeps/x86/dl-tunables.list                     |  3 +++
>  sysdeps/x86/include/cpu-features.h               |  4 +++-
>  .../x86_64/multiarch/memset-vec-unaligned-erms.S |  6 +++---
>  7 files changed, 49 insertions(+), 6 deletions(-)
>
> diff --git a/manual/tunables.texi b/manual/tunables.texi
> index baaf751721..8dd02d8149 100644
> --- a/manual/tunables.texi
> +++ b/manual/tunables.texi
> @@ -52,6 +52,7 @@ glibc.elision.skip_lock_busy: 3 (min: 0, max: 2147483647)
>  glibc.malloc.top_pad: 0x20000 (min: 0x0, max: 0xffffffffffffffff)
>  glibc.cpu.x86_rep_stosb_threshold: 0x800 (min: 0x1, max: 0xffffffffffffffff)
>  glibc.cpu.x86_non_temporal_threshold: 0xc0000 (min: 0x4040, max: 0xfffffffffffffff)
> +glibc.cpu.x86_memset_non_temporal_threshold: 0xc0000 (min: 0x4040, max: 0xfffffffffffffff)
>  glibc.cpu.x86_shstk:
>  glibc.pthread.stack_cache_size: 0x2800000 (min: 0x0, max: 0xffffffffffffffff)
>  glibc.cpu.hwcap_mask: 0x6 (min: 0x0, max: 0xffffffffffffffff)
> @@ -495,7 +496,8 @@ thread stack originally backup by Huge Pages to default pages.
>  @cindex shared_cache_size tunables
>  @cindex tunables, shared_cache_size
>  @cindex non_temporal_threshold tunables
> -@cindex tunables, non_temporal_threshold
> +@cindex memset_non_temporal_threshold tunables
> +@cindex tunables, non_temporal_threshold, memset_non_temporal_threshold
>
>  @deftp {Tunable namespace} glibc.cpu
>  Behavior of @theglibc{} can be tuned to assume specific hardware capabilities
> @@ -574,6 +576,18 @@ like memmove and memcpy.
>  This tunable is specific to i386 and x86-64.
>  @end deftp
>
> +@deftp Tunable glibc.cpu.x86_memset_non_temporal_threshold
> +The @code{glibc.cpu.x86_memset_non_temporal_threshold} tunable allows
> +the user to set threshold in bytes for non temporal store in
> +memset. Non temporal stores give a hint to the hardware to move data
> +directly to memory without displacing other data from the cache. This
> +tunable is used by some platforms to determine when to use non
> +temporal stores memset.
> +
> +This tunable is specific to i386 and x86-64.
> +@end deftp
> +
> +
>  @deftp Tunable glibc.cpu.x86_rep_movsb_threshold
>  The @code{glibc.cpu.x86_rep_movsb_threshold} tunable allows the user to
>  set threshold in bytes to start using "rep movsb".  The value must be
> diff --git a/sysdeps/x86/cacheinfo.h b/sysdeps/x86/cacheinfo.h
> index ab73556772..83491607c7 100644
> --- a/sysdeps/x86/cacheinfo.h
> +++ b/sysdeps/x86/cacheinfo.h
> @@ -35,9 +35,12 @@ long int __x86_data_cache_size attribute_hidden = 32 * 1024;
>  long int __x86_shared_cache_size_half attribute_hidden = 1024 * 1024 / 2;
>  long int __x86_shared_cache_size attribute_hidden = 1024 * 1024;
>
> -/* Threshold to use non temporal store.  */
> +/* Threshold to use non temporal store in memmove.  */
>  long int __x86_shared_non_temporal_threshold attribute_hidden;
>
> +/* Threshold to use non temporal store in memset.  */
> +long int __x86_memset_non_temporal_threshold attribute_hidden;
> +
>  /* Threshold to use Enhanced REP MOVSB.  */
>  long int __x86_rep_movsb_threshold attribute_hidden = 2048;
>
> @@ -77,6 +80,9 @@ init_cacheinfo (void)
>    __x86_shared_non_temporal_threshold
>      = cpu_features->non_temporal_threshold;
>
> +  __x86_memset_non_temporal_threshold
> +      = cpu_features->memset_non_temporal_threshold;
> +
>    __x86_rep_movsb_threshold = cpu_features->rep_movsb_threshold;
>    __x86_rep_stosb_threshold = cpu_features->rep_stosb_threshold;
>    __x86_rep_movsb_stop_threshold =  cpu_features->rep_movsb_stop_threshold;
> diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h
> index 5a98f70364..d375a7cba6 100644
> --- a/sysdeps/x86/dl-cacheinfo.h
> +++ b/sysdeps/x86/dl-cacheinfo.h
> @@ -986,6 +986,13 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
>    if (CPU_FEATURE_USABLE_P (cpu_features, FSRM))
>      rep_movsb_threshold = 2112;
>
> +  /* Non-temporal stores in memset have only been tested on Intel hardware.
> +     Until we benchmark data on other x86 processor, disable non-temporal
> +     stores in memset. */
> +  unsigned long int memset_non_temporal_threshold = SIZE_MAX;
> +  if (cpu_features->basic.kind == arch_kind_intel)
> +      memset_non_temporal_threshold = non_temporal_threshold;
> +
>     /* For AMD CPUs that support ERMS (Zen3+), REP MOVSB is in a lot of
>        cases slower than the vectorized path (and for some alignments,
>        it is really slow, check BZ #30994).  */
> @@ -1012,6 +1019,11 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
>        && tunable_size <= maximum_non_temporal_threshold)
>      non_temporal_threshold = tunable_size;
>
> +  tunable_size = TUNABLE_GET (x86_memset_non_temporal_threshold, long int, NULL);
> +  if (tunable_size > minimum_non_temporal_threshold
> +      && tunable_size <= maximum_non_temporal_threshold)
> +    memset_non_temporal_threshold = tunable_size;
> +
>    tunable_size = TUNABLE_GET (x86_rep_movsb_threshold, long int, NULL);
>    if (tunable_size > minimum_rep_movsb_threshold)
>      rep_movsb_threshold = tunable_size;
> @@ -1032,6 +1044,9 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
>    TUNABLE_SET_WITH_BOUNDS (x86_non_temporal_threshold, non_temporal_threshold,
>                            minimum_non_temporal_threshold,
>                            maximum_non_temporal_threshold);
> +  TUNABLE_SET_WITH_BOUNDS (
> +      x86_memset_non_temporal_threshold, memset_non_temporal_threshold,
> +      minimum_non_temporal_threshold, maximum_non_temporal_threshold);
>    TUNABLE_SET_WITH_BOUNDS (x86_rep_movsb_threshold, rep_movsb_threshold,
>                            minimum_rep_movsb_threshold, SIZE_MAX);
>    TUNABLE_SET_WITH_BOUNDS (x86_rep_stosb_threshold, rep_stosb_threshold, 1,
> @@ -1045,6 +1060,7 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
>    cpu_features->data_cache_size = data;
>    cpu_features->shared_cache_size = shared;
>    cpu_features->non_temporal_threshold = non_temporal_threshold;
> +  cpu_features->memset_non_temporal_threshold = memset_non_temporal_threshold;
>    cpu_features->rep_movsb_threshold = rep_movsb_threshold;
>    cpu_features->rep_stosb_threshold = rep_stosb_threshold;
>    cpu_features->rep_movsb_stop_threshold = rep_movsb_stop_threshold;
> diff --git a/sysdeps/x86/dl-diagnostics-cpu.c b/sysdeps/x86/dl-diagnostics-cpu.c
> index ceafde9481..49eeb5f70a 100644
> --- a/sysdeps/x86/dl-diagnostics-cpu.c
> +++ b/sysdeps/x86/dl-diagnostics-cpu.c
> @@ -94,6 +94,8 @@ _dl_diagnostics_cpu (void)
>                              cpu_features->shared_cache_size);
>    print_cpu_features_value ("non_temporal_threshold",
>                              cpu_features->non_temporal_threshold);
> +  print_cpu_features_value ("memset_non_temporal_threshold",
> +                            cpu_features->memset_non_temporal_threshold);
>    print_cpu_features_value ("rep_movsb_threshold",
>                              cpu_features->rep_movsb_threshold);
>    print_cpu_features_value ("rep_movsb_stop_threshold",
> diff --git a/sysdeps/x86/dl-tunables.list b/sysdeps/x86/dl-tunables.list
> index 7d82da0dec..a0a1299592 100644
> --- a/sysdeps/x86/dl-tunables.list
> +++ b/sysdeps/x86/dl-tunables.list
> @@ -30,6 +30,9 @@ glibc {
>      x86_non_temporal_threshold {
>        type: SIZE_T
>      }
> +    x86_memset_non_temporal_threshold {
> +      type: SIZE_T
> +    }
>      x86_rep_movsb_threshold {
>        type: SIZE_T
>        # Since there is overhead to set up REP MOVSB operation, REP
> diff --git a/sysdeps/x86/include/cpu-features.h b/sysdeps/x86/include/cpu-features.h
> index cd7bd27cf3..aaae44f0e1 100644
> --- a/sysdeps/x86/include/cpu-features.h
> +++ b/sysdeps/x86/include/cpu-features.h
> @@ -944,8 +944,10 @@ struct cpu_features
>    /* Shared cache size for use in memory and string routines, typically
>       L2 or L3 size.  */
>    unsigned long int shared_cache_size;
> -  /* Threshold to use non temporal store.  */
> +  /* Threshold to use non temporal store in memmove.  */
>    unsigned long int non_temporal_threshold;
> +  /* Threshold to use non temporal store in memset.  */
> +  unsigned long int memset_non_temporal_threshold;
>    /* Threshold to use "rep movsb".  */
>    unsigned long int rep_movsb_threshold;
>    /* Threshold to stop using "rep movsb".  */
> diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> index 637caadb40..88bf08e4f4 100644
> --- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> +++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
> @@ -24,9 +24,9 @@
>     5. If size is more to 4 * VEC_SIZE, align to 1 * VEC_SIZE with
>        4 VEC stores and store 4 * VEC at a time until done.
>     6. On machines ERMS feature, if size is range
> -         [__x86_rep_stosb_threshold, __x86_shared_non_temporal_threshold)
> +         [__x86_rep_stosb_threshold, __x86_memset_non_temporal_threshold)
>           then REP STOSB will be used.
> -   7. If size >= __x86_shared_non_temporal_threshold, use a
> +   7. If size >= __x86_memset_non_temporal_threshold, use a
>           non-temporal stores.  */
>
>  #include <sysdep.h>
> @@ -318,7 +318,7 @@ L(return_vzeroupper):
>         /* If no USE_LESS_VEC_MASK put L(stosb_local) here. Will be in
>            range for 2-byte jump encoding.  */
>  L(stosb_local):
> -       cmp     __x86_shared_non_temporal_threshold(%rip), %RDX_LP
> +       cmp     __x86_memset_non_temporal_threshold(%rip), %RDX_LP
>         jae     L(nt_memset)
>         movzbl  %sil, %eax
>         mov     %RDX_LP, %RCX_LP
> --
> 2.34.1
>

LGTM.

Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

Thanks.

-- 
H.J.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v1] x86: Improve large memset perf with non-temporal stores [RHEL-29312]
  2024-05-23 17:18           ` Adhemerval Zanella Netto
  2024-05-24 17:39             ` Noah Goldstein
@ 2024-05-30 20:02             ` Noah Goldstein
  1 sibling, 0 replies; 21+ messages in thread
From: Noah Goldstein @ 2024-05-30 20:02 UTC (permalink / raw)
  To: GNU C Library; +Cc: H.J. Lu, Adhemerval Zanella, Joe Damato

On Thu, May 23, 2024 at 12:18 PM Adhemerval Zanella Netto
<adhemerval.zanella@linaro.org> wrote:
>
>
>
> On 23/05/24 13:49, H.J. Lu wrote:
> > On Thu, May 23, 2024 at 9:46 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> >>
> >> On Mon, May 20, 2024 at 5:50 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> >>>
> >>> On Mon, May 20, 2024 at 5:47 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> >>>>
> >>>> On Mon, May 20, 2024 at 2:16 PM Adhemerval Zanella Netto
> >>>> <adhemerval.zanella@linaro.org> wrote:
> >>>>>
> >>>>>
> >>>>>
> >>>>> On 18/05/24 21:43, Noah Goldstein wrote:
> >>>>>> Previously we use `rep stosb` for all medium/large memsets. This is
> >>>>>> notably worse than non-temporal stores for large (above a
> >>>>>> few MBs) memsets.
> >>>>>> See:
> >>>>>> https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4P2uMjjQGLbLcU/edit?usp=sharing
> >>>>>> For data using different stategies for large memset on ICX and SKX.
> >>>>>
> >>>>> Btw this datasheet is not accessible without extra steps.
> >>>>>
> >>>>
> >>>> Bh sorry, put the benchmark data in a repo. See data here:
> >>>> https://github.com/goldsteinn/memset-benchmarks/tree/master/results
> >>>>
> >>>> The ICX.html and SKX.html have the data from the above link.
> >>>>>>
> >>>>>> Using non-temporal stores can be up to 3x faster on ICX and 2x faster
> >>>>>> on SKX. Historically, these numbers would not have been so good
> >>>>>> because of the zero-over-zero writeback optimization that `rep stosb`
> >>>>>> is able to do. But, the zero-over-zero writeback optimization has been
> >>>>>> removed as a potential side-channel attack, so there is no longer any
> >>>>>> good reason to only rely on `rep stosb` for large memsets. On the flip
> >>>>>> size, non-temporal writes can avoid data in their RFO requests saving
> >>>>>> memory bandwidth.
> >>>>>
> >>>>> Any chance on how this play in newer AMD chips? I am trying to avoid
> >>>>> another regression like BZ#30994 and BZ#30995 (this one I would like
> >>>>> to fix, however I don't have access to a Zen4 machine to check for
> >>>>> results).
> >>>>
> >>>> I didn't and don't have access to any of the newer AMD chips.
> >>>>
> >>>> The benchmark code here: https://github.com/goldsteinn/memset-benchmarks/
> >>>> has a README w/ steps if anyone wants to test it.
> >>>
> >>> What we could do is use a seperate tunable for memset NT threshold and just
> >>> make it SIZE_MAX for AMD.
> >>
> >> Adhemerval, this patch okay to get in like that, or do you want to keep in Intel
> >> specific?
> >
> > Please make it Intel specific.
> >
> > Thanks.
>
> Yes, please make it Intel specific for now. I will try to take a look at least
> with a Zen3 core that I have access

Adding Joe into the thread. He has access to some AMD machines
and will run the benchmarks for us.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 2/2] x86: Add seperate non-temporal tunable for memset
  2024-05-29 22:53     ` H.J. Lu
@ 2024-06-14 10:40       ` Borislav Petkov
  2024-06-14 16:39         ` Noah Goldstein
  0 siblings, 1 reply; 21+ messages in thread
From: Borislav Petkov @ 2024-06-14 10:40 UTC (permalink / raw)
  To: H.J. Lu, Noah Goldstein; +Cc: libc-alpha

Hi,

I'm not subscribed to the glibc list - pls CC me directly on replies.

On Wed, May 29, 2024 at 03:53:20PM -0700, H.J. Lu wrote:
> On Fri, May 24, 2024 at 10:39?AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> >
> > The tuning for non-temporal stores for memset vs memcpy is not always
> > the same. This includes both the exact value and whether non-temporal
> > stores are profitable at all for a given arch.
> >
> > This patch add `x86_memset_non_temporal_threshold`. Currently we
> > disable non-temporal stores for non Intel vendors as the only
> > benchmarks showing its benefit have been on Intel hardware.
> > ---
> >  manual/tunables.texi                             | 16 +++++++++++++++-
> >  sysdeps/x86/cacheinfo.h                          |  8 +++++++-
> >  sysdeps/x86/dl-cacheinfo.h                       | 16 ++++++++++++++++
> >  sysdeps/x86/dl-diagnostics-cpu.c                 |  2 ++
> >  sysdeps/x86/dl-tunables.list                     |  3 +++
> >  sysdeps/x86/include/cpu-features.h               |  4 +++-
> >  .../x86_64/multiarch/memset-vec-unaligned-erms.S |  6 +++---
> >  7 files changed, 49 insertions(+), 6 deletions(-)

...

> > +  /* Non-temporal stores in memset have only been tested on Intel hardware.
> > +     Until we benchmark data on other x86 processor, disable non-temporal
> > +     stores in memset. */

Well, something's fishy here:

$ ./elf/ld.so --list-tunables | grep threshold
glibc.malloc.mmap_threshold: 0x0 (min: 0x0, max: 0xffffffffffffffff)
glibc.cpu.x86_rep_movsb_threshold: 0x600000 (min: 0x100, max: 0xffffffffffffffff)
glibc.cpu.x86_non_temporal_threshold: 0x600000 (min: 0x4040, max: 0xfffffffffffffff)
glibc.malloc.trim_threshold: 0x0 (min: 0x0, max: 0xffffffffffffffff)
glibc.cpu.x86_rep_stosb_threshold: 0xffffffffffffffff (min: 0x1, max: 0xffffffffffffffff)
glibc.cpu.x86_memset_non_temporal_threshold: 0x0 (min: 0x0, max: 0xffffffffffffffff)
					    ^^^^^^^^^

on glibc-2.39.9000-300-g54c1efdac55b from git.

That's on a AMD Zen1 so I'd expect that memset NT threshold to be
0xffffffffffffffff by default...

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 2/2] x86: Add seperate non-temporal tunable for memset
  2024-06-14 10:40       ` Borislav Petkov
@ 2024-06-14 16:39         ` Noah Goldstein
  2024-06-14 18:01           ` Borislav Petkov
  0 siblings, 1 reply; 21+ messages in thread
From: Noah Goldstein @ 2024-06-14 16:39 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: H.J. Lu, libc-alpha

On Fri, Jun 14, 2024 at 5:41 AM Borislav Petkov <bp@alien8.de> wrote:
>
> Hi,
>
> I'm not subscribed to the glibc list - pls CC me directly on replies.
>
> On Wed, May 29, 2024 at 03:53:20PM -0700, H.J. Lu wrote:
> > On Fri, May 24, 2024 at 10:39?AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> > >
> > > The tuning for non-temporal stores for memset vs memcpy is not always
> > > the same. This includes both the exact value and whether non-temporal
> > > stores are profitable at all for a given arch.
> > >
> > > This patch add `x86_memset_non_temporal_threshold`. Currently we
> > > disable non-temporal stores for non Intel vendors as the only
> > > benchmarks showing its benefit have been on Intel hardware.
> > > ---
> > >  manual/tunables.texi                             | 16 +++++++++++++++-
> > >  sysdeps/x86/cacheinfo.h                          |  8 +++++++-
> > >  sysdeps/x86/dl-cacheinfo.h                       | 16 ++++++++++++++++
> > >  sysdeps/x86/dl-diagnostics-cpu.c                 |  2 ++
> > >  sysdeps/x86/dl-tunables.list                     |  3 +++
> > >  sysdeps/x86/include/cpu-features.h               |  4 +++-
> > >  .../x86_64/multiarch/memset-vec-unaligned-erms.S |  6 +++---
> > >  7 files changed, 49 insertions(+), 6 deletions(-)
>
> ...
>
> > > +  /* Non-temporal stores in memset have only been tested on Intel hardware.
> > > +     Until we benchmark data on other x86 processor, disable non-temporal
> > > +     stores in memset. */
>
> Well, something's fishy here:
>
> $ ./elf/ld.so --list-tunables | grep threshold
> glibc.malloc.mmap_threshold: 0x0 (min: 0x0, max: 0xffffffffffffffff)
> glibc.cpu.x86_rep_movsb_threshold: 0x600000 (min: 0x100, max: 0xffffffffffffffff)
> glibc.cpu.x86_non_temporal_threshold: 0x600000 (min: 0x4040, max: 0xfffffffffffffff)
> glibc.malloc.trim_threshold: 0x0 (min: 0x0, max: 0xffffffffffffffff)
> glibc.cpu.x86_rep_stosb_threshold: 0xffffffffffffffff (min: 0x1, max: 0xffffffffffffffff)
> glibc.cpu.x86_memset_non_temporal_threshold: 0x0 (min: 0x0, max: 0xffffffffffffffff)
>                                             ^^^^^^^^^
>
> on glibc-2.39.9000-300-g54c1efdac55b from git.
>
> That's on a AMD Zen1 so I'd expect that memset NT threshold to be
> 0xffffffffffffffff by default...
>
> Thx.
>

Thanks for bringing this up, looking into it.
> --
> Regards/Gruss,
>     Boris.
>
> https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 2/2] x86: Add seperate non-temporal tunable for memset
  2024-06-14 16:39         ` Noah Goldstein
@ 2024-06-14 18:01           ` Borislav Petkov
  2024-06-14 18:02             ` Noah Goldstein
  0 siblings, 1 reply; 21+ messages in thread
From: Borislav Petkov @ 2024-06-14 18:01 UTC (permalink / raw)
  To: Noah Goldstein; +Cc: H.J. Lu, libc-alpha, Michael Matz

On Fri, Jun 14, 2024 at 11:39:07AM -0500, Noah Goldstein wrote:
> On Fri, Jun 14, 2024 at 5:41 AM Borislav Petkov <bp@alien8.de> wrote:
> >
> > Hi,
> >
> > I'm not subscribed to the glibc list - pls CC me directly on replies.
> >
> > On Wed, May 29, 2024 at 03:53:20PM -0700, H.J. Lu wrote:
> > > On Fri, May 24, 2024 at 10:39?AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> > > >
> > > > The tuning for non-temporal stores for memset vs memcpy is not always
> > > > the same. This includes both the exact value and whether non-temporal
> > > > stores are profitable at all for a given arch.
> > > >
> > > > This patch add `x86_memset_non_temporal_threshold`. Currently we
> > > > disable non-temporal stores for non Intel vendors as the only
> > > > benchmarks showing its benefit have been on Intel hardware.
> > > > ---
> > > >  manual/tunables.texi                             | 16 +++++++++++++++-
> > > >  sysdeps/x86/cacheinfo.h                          |  8 +++++++-
> > > >  sysdeps/x86/dl-cacheinfo.h                       | 16 ++++++++++++++++
> > > >  sysdeps/x86/dl-diagnostics-cpu.c                 |  2 ++
> > > >  sysdeps/x86/dl-tunables.list                     |  3 +++
> > > >  sysdeps/x86/include/cpu-features.h               |  4 +++-
> > > >  .../x86_64/multiarch/memset-vec-unaligned-erms.S |  6 +++---
> > > >  7 files changed, 49 insertions(+), 6 deletions(-)
> >
> > ...
> >
> > > > +  /* Non-temporal stores in memset have only been tested on Intel hardware.
> > > > +     Until we benchmark data on other x86 processor, disable non-temporal
> > > > +     stores in memset. */
> >
> > Well, something's fishy here:
> >
> > $ ./elf/ld.so --list-tunables | grep threshold
> > glibc.malloc.mmap_threshold: 0x0 (min: 0x0, max: 0xffffffffffffffff)
> > glibc.cpu.x86_rep_movsb_threshold: 0x600000 (min: 0x100, max: 0xffffffffffffffff)
> > glibc.cpu.x86_non_temporal_threshold: 0x600000 (min: 0x4040, max: 0xfffffffffffffff)
> > glibc.malloc.trim_threshold: 0x0 (min: 0x0, max: 0xffffffffffffffff)
> > glibc.cpu.x86_rep_stosb_threshold: 0xffffffffffffffff (min: 0x1, max: 0xffffffffffffffff)
> > glibc.cpu.x86_memset_non_temporal_threshold: 0x0 (min: 0x0, max: 0xffffffffffffffff)
> >                                             ^^^^^^^^^
> >
> > on glibc-2.39.9000-300-g54c1efdac55b from git.
> >
> > That's on a AMD Zen1 so I'd expect that memset NT threshold to be
> > 0xffffffffffffffff by default...
> >
> > Thx.
> >
> 
> Thanks for bringing this up, looking into it.

Thx, so Michael did debug it yesterday to the ranges mismatching:

diff --git a/elf/dl-tunables.c b/elf/dl-tunables.c
index 147cc4cf23f5..ecf3c1d3736e 100644
--- a/elf/dl-tunables.c
+++ b/elf/dl-tunables.c
@@ -110,8 +110,11 @@ do_tunable_update_val (tunable_t *cur, const tunable_val_t *valp,
 
   /* Bail out if the bounds are not valid.  */
   if (tunable_val_lt (val, min, unsigned_cmp)
-      || tunable_val_lt (max, val, unsigned_cmp))
+      || tunable_val_lt (max, val, unsigned_cmp)) {
+         _dl_printf("bail out due to: 0x%lx, min: 0x%lx, max: 0x%lx\n",
+                    val, min, max);
     return;
+  }
 
   cur->val.numval = val;
   cur->type.min = min;

$ ./elf/ld.so --list-tunables | grep -E "(threshold|bail)"
dl_init_cacheinfo: memset_non_temporal_threshold: 0xffffffffffffffff
dl_init_cacheinfo: memset_non_temporal_threshold, tunable_size: 0xffffffffffffffff
bail out due to: 0xffffffffffffffff, min: 0x4040, max: 0xfffffffffffffff
^^^^^^^

dl_init_cacheinfo: memset_non_temporal_threshold, tunable set: 0xffffffffffffffff, min: 0x4040, max: 0xfffffffffffffff
glibc.cpu.x86_memset_non_temporal_threshold: 0x0 (min: 0x0, max: 0xffffffffffffffff)

but you guys probably should do the right fix here.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 2/2] x86: Add seperate non-temporal tunable for memset
  2024-06-14 18:01           ` Borislav Petkov
@ 2024-06-14 18:02             ` Noah Goldstein
  0 siblings, 0 replies; 21+ messages in thread
From: Noah Goldstein @ 2024-06-14 18:02 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: H.J. Lu, libc-alpha, Michael Matz

On Fri, Jun 14, 2024 at 1:01 PM Borislav Petkov <bp@alien8.de> wrote:
>
> On Fri, Jun 14, 2024 at 11:39:07AM -0500, Noah Goldstein wrote:
> > On Fri, Jun 14, 2024 at 5:41 AM Borislav Petkov <bp@alien8.de> wrote:
> > >
> > > Hi,
> > >
> > > I'm not subscribed to the glibc list - pls CC me directly on replies.
> > >
> > > On Wed, May 29, 2024 at 03:53:20PM -0700, H.J. Lu wrote:
> > > > On Fri, May 24, 2024 at 10:39?AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> > > > >
> > > > > The tuning for non-temporal stores for memset vs memcpy is not always
> > > > > the same. This includes both the exact value and whether non-temporal
> > > > > stores are profitable at all for a given arch.
> > > > >
> > > > > This patch add `x86_memset_non_temporal_threshold`. Currently we
> > > > > disable non-temporal stores for non Intel vendors as the only
> > > > > benchmarks showing its benefit have been on Intel hardware.
> > > > > ---
> > > > >  manual/tunables.texi                             | 16 +++++++++++++++-
> > > > >  sysdeps/x86/cacheinfo.h                          |  8 +++++++-
> > > > >  sysdeps/x86/dl-cacheinfo.h                       | 16 ++++++++++++++++
> > > > >  sysdeps/x86/dl-diagnostics-cpu.c                 |  2 ++
> > > > >  sysdeps/x86/dl-tunables.list                     |  3 +++
> > > > >  sysdeps/x86/include/cpu-features.h               |  4 +++-
> > > > >  .../x86_64/multiarch/memset-vec-unaligned-erms.S |  6 +++---
> > > > >  7 files changed, 49 insertions(+), 6 deletions(-)
> > >
> > > ...
> > >
> > > > > +  /* Non-temporal stores in memset have only been tested on Intel hardware.
> > > > > +     Until we benchmark data on other x86 processor, disable non-temporal
> > > > > +     stores in memset. */
> > >
> > > Well, something's fishy here:
> > >
> > > $ ./elf/ld.so --list-tunables | grep threshold
> > > glibc.malloc.mmap_threshold: 0x0 (min: 0x0, max: 0xffffffffffffffff)
> > > glibc.cpu.x86_rep_movsb_threshold: 0x600000 (min: 0x100, max: 0xffffffffffffffff)
> > > glibc.cpu.x86_non_temporal_threshold: 0x600000 (min: 0x4040, max: 0xfffffffffffffff)
> > > glibc.malloc.trim_threshold: 0x0 (min: 0x0, max: 0xffffffffffffffff)
> > > glibc.cpu.x86_rep_stosb_threshold: 0xffffffffffffffff (min: 0x1, max: 0xffffffffffffffff)
> > > glibc.cpu.x86_memset_non_temporal_threshold: 0x0 (min: 0x0, max: 0xffffffffffffffff)
> > >                                             ^^^^^^^^^
> > >
> > > on glibc-2.39.9000-300-g54c1efdac55b from git.
> > >
> > > That's on a AMD Zen1 so I'd expect that memset NT threshold to be
> > > 0xffffffffffffffff by default...
> > >
> > > Thx.
> > >
> >
> > Thanks for bringing this up, looking into it.
>
> Thx, so Michael did debug it yesterday to the ranges mismatching:
>
> diff --git a/elf/dl-tunables.c b/elf/dl-tunables.c
> index 147cc4cf23f5..ecf3c1d3736e 100644
> --- a/elf/dl-tunables.c
> +++ b/elf/dl-tunables.c
> @@ -110,8 +110,11 @@ do_tunable_update_val (tunable_t *cur, const tunable_val_t *valp,
>
>    /* Bail out if the bounds are not valid.  */
>    if (tunable_val_lt (val, min, unsigned_cmp)
> -      || tunable_val_lt (max, val, unsigned_cmp))
> +      || tunable_val_lt (max, val, unsigned_cmp)) {
> +         _dl_printf("bail out due to: 0x%lx, min: 0x%lx, max: 0x%lx\n",
> +                    val, min, max);
>      return;
> +  }
>
>    cur->val.numval = val;
>    cur->type.min = min;
>
> $ ./elf/ld.so --list-tunables | grep -E "(threshold|bail)"
> dl_init_cacheinfo: memset_non_temporal_threshold: 0xffffffffffffffff
> dl_init_cacheinfo: memset_non_temporal_threshold, tunable_size: 0xffffffffffffffff
> bail out due to: 0xffffffffffffffff, min: 0x4040, max: 0xfffffffffffffff
> ^^^^^^^
>
> dl_init_cacheinfo: memset_non_temporal_threshold, tunable set: 0xffffffffffffffff, min: 0x4040, max: 0xfffffffffffffff
> glibc.cpu.x86_memset_non_temporal_threshold: 0x0 (min: 0x0, max: 0xffffffffffffffff)
>
> but you guys probably should do the right fix here.

Just posted the fix, you should be CCd on it.
>
> Thx.
>
> --
> Regards/Gruss,
>     Boris.
>
> https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2024-06-14 18:02 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-05-19  0:43 [PATCH v1] x86: Improve large memset perf with non-temporal stores [RHEL-29312] Noah Goldstein
2024-05-20 17:14 ` H.J. Lu
2024-05-20 19:16 ` Adhemerval Zanella Netto
2024-05-20 22:47   ` Noah Goldstein
2024-05-20 22:50     ` Noah Goldstein
2024-05-23 16:46       ` Noah Goldstein
2024-05-23 16:49         ` H.J. Lu
2024-05-23 17:18           ` Adhemerval Zanella Netto
2024-05-24 17:39             ` Noah Goldstein
2024-05-30 20:02             ` Noah Goldstein
2024-05-20 19:37 ` Zack Weinberg
2024-05-20 22:48   ` Noah Goldstein
2024-05-24 17:38 ` [PATCH v2 1/2] " Noah Goldstein
2024-05-24 17:38   ` [PATCH v2 2/2] x86: Add seperate non-temporal tunable for memset Noah Goldstein
2024-05-29 16:19     ` Noah Goldstein
2024-05-29 22:53     ` H.J. Lu
2024-06-14 10:40       ` Borislav Petkov
2024-06-14 16:39         ` Noah Goldstein
2024-06-14 18:01           ` Borislav Petkov
2024-06-14 18:02             ` Noah Goldstein
2024-05-29 22:52   ` [PATCH v2 1/2] x86: Improve large memset perf with non-temporal stores [RHEL-29312] H.J. Lu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).