public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed
* [PATCH v1] x86: Improve memset-vec-unaligned-erms.S
@ 2021-05-20 18:44 Noah Goldstein
  2021-05-20 18:47 ` Noah Goldstein
  2021-05-20 19:39 ` H.J. Lu
  0 siblings, 2 replies; 14+ messages in thread
From: Noah Goldstein @ 2021-05-20 18:44 UTC (permalink / raw)
  To: libc-alpha

No bug. This commit makes a few small improvements to
memset-vec-unaligned-erms.S. The changes are 1) only aligning to 64
instead of 128. Either alignment will perform equally well in a loop
and 128 just increases the odds of having to do an extra iteration
which can be significant overhead for small values. 2) Align some
targets and the loop. 3) Remove an ALU from the alignment process. 4)
Reorder the last 4x VEC so that they are stored after the loop. 5)
Move the condition for leq 8x VEC to before the alignment
process. test-memset and test-wmemset are both passing.

Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
---
Tests where run on the following CPUs:

Skylake: https://ark.intel.com/content/www/us/en/ark/products/149091/intel-core-i7-8565u-processor-8m-cache-up-to-4-60-ghz.html

Icelake: https://ark.intel.com/content/www/us/en/ark/products/196597/intel-core-i7-1065g7-processor-8m-cache-up-to-3-90-ghz.html

Tigerlake: https://ark.intel.com/content/www/us/en/ark/products/208921/intel-core-i7-1165g7-processor-12m-cache-up-to-4-70-ghz-with-ipu.html

All times are the geometric mean of N=50. The unit of time is
seconds.

"Cur" refers to the current implementation
"New" refers to this patches implementation

Performance data attached in memset-data.pdf
    
Some notes on the numbers:

I only included numbers for the proper VEC_SIZE for the corresponding
cpu.

skl -> avx2
icl -> evex
tgl -> evex            
    
The changes only affect sizes > 2 * VEC_SIZE. The performance
differences in the size <= 2 * VEC_SIZE come from changes in alignment
after linking (i.e ENTRY aligns to 16, but performance can be affected
by alignment % 64 or alignment % 4096) and generally affects
throughput only, not latency (i.e with an lfence to the benchmark loop
the deviations go away). Generally I think they can be ignored (both
positive and negative affects).

The interesting part of the data is in the medium size range [128,
1024] where the new implementation has a reasonable speedup. This is
especially pronounced when the more conservative alignment saves a
full loop iteration. The only significant exception is
skylake-avx2-erms case for size = 416, alignment = 416 where the
current implementation is meaningfully faster. I am unsure of the root
cause for this. The skylake-avx2 case only performs a bit worse in
this case which makes me think part of it is code alignment related,
though comparative to the speedup in other size/alignment
configurations it is still a trough.  Despite this, I still think the
numbers are overall an improvement.

As well due to aligning the loop (and possibly slightly more efficient
DSB behavior with the replacement of addq 4 * VEC_SIZE in the loop
with subq -4 * VEC_SIZE) in the non-erms cases there is often a slight
improvement to the main loop for large sizes.

 .../multiarch/memset-vec-unaligned-erms.S     | 50 +++++++++++--------
 1 file changed, 28 insertions(+), 22 deletions(-)

diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
index 08cfa49bd1..ff196844a0 100644
--- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
+++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
@@ -173,17 +173,22 @@ ENTRY (MEMSET_SYMBOL (__memset, unaligned_erms))
 	VMOVU	%VEC(0), (%rdi)
 	VZEROUPPER_RETURN
 
+	.p2align 4
 L(stosb_more_2x_vec):
 	cmp	__x86_rep_stosb_threshold(%rip), %RDX_LP
 	ja	L(stosb)
+#else
+	.p2align 4
 #endif
 L(more_2x_vec):
-	cmpq  $(VEC_SIZE * 4), %rdx
-	ja	L(loop_start)
+	/* Stores to first 2x VEC before cmp as any path forward will
+	   require it.  */
 	VMOVU	%VEC(0), (%rdi)
 	VMOVU	%VEC(0), VEC_SIZE(%rdi)
-	VMOVU	%VEC(0), -VEC_SIZE(%rdi,%rdx)
+	cmpq	$(VEC_SIZE * 4), %rdx
+	ja	L(loop_start)
 	VMOVU	%VEC(0), -(VEC_SIZE * 2)(%rdi,%rdx)
+	VMOVU	%VEC(0), -VEC_SIZE(%rdi,%rdx)
 L(return):
 #if VEC_SIZE > 16
 	ZERO_UPPER_VEC_REGISTERS_RETURN
@@ -192,28 +197,29 @@ L(return):
 #endif
 
 L(loop_start):
-	leaq	(VEC_SIZE * 4)(%rdi), %rcx
-	VMOVU	%VEC(0), (%rdi)
-	andq	$-(VEC_SIZE * 4), %rcx
-	VMOVU	%VEC(0), -VEC_SIZE(%rdi,%rdx)
-	VMOVU	%VEC(0), VEC_SIZE(%rdi)
-	VMOVU	%VEC(0), -(VEC_SIZE * 2)(%rdi,%rdx)
 	VMOVU	%VEC(0), (VEC_SIZE * 2)(%rdi)
-	VMOVU	%VEC(0), -(VEC_SIZE * 3)(%rdi,%rdx)
 	VMOVU	%VEC(0), (VEC_SIZE * 3)(%rdi)
-	VMOVU	%VEC(0), -(VEC_SIZE * 4)(%rdi,%rdx)
-	addq	%rdi, %rdx
-	andq	$-(VEC_SIZE * 4), %rdx
-	cmpq	%rdx, %rcx
-	je	L(return)
+	cmpq	$(VEC_SIZE * 8), %rdx
+	jbe	L(loop_end)
+	andq	$-(VEC_SIZE * 2), %rdi
+	subq	$-(VEC_SIZE * 4), %rdi
+	leaq	-(VEC_SIZE * 4)(%rax, %rdx), %rcx
+	.p2align 4
 L(loop):
-	VMOVA	%VEC(0), (%rcx)
-	VMOVA	%VEC(0), VEC_SIZE(%rcx)
-	VMOVA	%VEC(0), (VEC_SIZE * 2)(%rcx)
-	VMOVA	%VEC(0), (VEC_SIZE * 3)(%rcx)
-	addq	$(VEC_SIZE * 4), %rcx
-	cmpq	%rcx, %rdx
-	jne	L(loop)
+	VMOVA	%VEC(0), (%rdi)
+	VMOVA	%VEC(0), VEC_SIZE(%rdi)
+	VMOVA	%VEC(0), (VEC_SIZE * 2)(%rdi)
+	VMOVA	%VEC(0), (VEC_SIZE * 3)(%rdi)
+	subq	$-(VEC_SIZE * 4), %rdi
+	cmpq	%rcx, %rdi
+	jb	L(loop)
+L(loop_end):
+	/* NB: rax is set as ptr in MEMSET_VDUP_TO_VEC0_AND_SET_RETURN.
+	       rdx as length is also unchanged.  */
+	VMOVU	%VEC(0), -(VEC_SIZE * 4)(%rax, %rdx)
+	VMOVU	%VEC(0), -(VEC_SIZE * 3)(%rax, %rdx)
+	VMOVU	%VEC(0), -(VEC_SIZE * 2)(%rax, %rdx)
+	VMOVU	%VEC(0), -VEC_SIZE(%rax, %rdx)
 	VZEROUPPER_SHORT_RETURN
 
 	.p2align 4
-- 
2.25.1


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2022-04-28  0:06 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-20 18:44 [PATCH v1] x86: Improve memset-vec-unaligned-erms.S Noah Goldstein
2021-05-20 18:47 ` Noah Goldstein
2021-05-20 19:39 ` H.J. Lu
2021-05-20 20:03   ` Noah Goldstein
2021-05-20 20:44     ` H.J. Lu
2021-06-07  2:35       ` Noah Goldstein
2021-06-07  2:47         ` H.J. Lu
2021-06-07  3:05           ` Noah Goldstein
2021-06-07  3:20             ` H.J. Lu
2021-06-07  3:38               ` Noah Goldstein
2021-06-07  4:33                 ` H.J. Lu
2021-06-07  7:12                   ` Noah Goldstein
2021-06-07 12:54                     ` H.J. Lu
2022-04-28  0:06                       ` Sunil Pandey

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).