[PATCH 0/7] [BZ #19776] Improve x86-64 memcpy-sse2-unaligned.S

public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed

* [PATCH 0/7] [BZ #19776] Improve x86-64 memcpy-sse2-unaligned.S
@ 2016-03-07 17:36 H.J. Lu
  2016-03-07 17:37 ` [PATCH 2/7] Don't use RAX as scratch register H.J. Lu
                   ` (6 more replies)
  0 siblings, 7 replies; 8+ messages in thread
From: H.J. Lu @ 2016-03-07 17:36 UTC (permalink / raw)
  To: libc-alpha; +Cc: Ondrej Bilka

This set of patches improves x86-64 memcpy-sse2-unaligned.S by

1. Removing dead code.
2. Setting RAX to the return value at entrance.
3. Removing unnecessary code.
4. Adding entry points for __mempcpy_chk_sse2_unaligned,
__mempcpy_sse2_unaligned and __memcpy_chk_sse2_unaligned.
5. Enabling __mempcpy_chk_sse2_unaligned, __mempcpy_sse2_unaligned and
__memcpy_chk_sse2_unaligned.

bench-mempcpy shows

Ivy Bridge:

                              simple_mempcpy __mempcpy_avx_unaligned __mempcpy_ssse3_back __mempcpy_ssse3 __mempcpy_sse2_unaligned __mempcpy_sse2
Length  432, alignment 27/ 0:	1628.16	98.3906	73.5625	94.7344	67.1719	139.531
Length  432, alignment  0/27:	1627.84	148.891	80.625	98.8281	104.766	142.625
Length  432, alignment 27/27:	1631.03	90.5469	70.0312	69.5938	76.5469	123.969
Length  448, alignment  0/ 0:	1685.95	79.1875	65.1719	72.9062	70.1406	116.688
Length  448, alignment 28/ 0:	1685.84	89.4531	73.0156	99.5938	86.3594	138.203
Length  448, alignment  0/28:	1684.52	148.016	82.2812	94.8438	103.344	147.578
Length  448, alignment 28/28:	1684.42	86.4688	65.4062	70.0469	72.4688	123.422
Length  464, alignment  0/ 0:	1740.77	70.1406	66.2812	69.2656	71.25	118.234
Length  464, alignment 29/ 0:	1742.31	100.141	75.875	98.9219	83.375	145.594
Length  464, alignment  0/29:	1742.31	148.016	80.2969	107.766	102.031	154.531
Length  464, alignment 29/29:	1740.98	91.5469	64.8438	72.4531	71.4688	127.062
Length  480, alignment  0/ 0:	1967.2	76.875	66.625	71.1406	71.25	123.641
Length  480, alignment 30/ 0:	1799.02	94.3125	72.9062	103.797	80.2969	144.484
Length  480, alignment  0/30:	1797.47	148.453	82.6094	102.906	102.25	158.062
Length  480, alignment 30/30:	1799.02	90.8906	68.0469	69.5938	71.3594	124.844
Length  496, alignment  0/ 0:	1853.83	71.25	68.3906	71.9219	69.1406	123.422
Length  496, alignment 31/ 0:	1855.38	94.8438	74.3438	104.672	73.2344	148.125
Length  496, alignment  0/31:	1853.59	148.906	80.2969	109.297	114.703	163.016
Length  496, alignment 31/31:	1855.27	93.0781	71.4688	72.3438	84.2656	127.953
Length 4096, alignment  0/ 0:	14559.7	509.891	506.469	474.156	508.344	591.062

Nehalem:

                             simple_mempcpy __mempcpy_ssse3_back __mempcpy_ssse3 __mempcpy_sse2_unaligned __mempcpy_sse2

Length  432, alignment 27/ 0:	113.25	50.9531	64.0312	39.1406	77.6719
Length  432, alignment  0/27:	130.688	45.7969	63.9844	89.1562	133.078
Length  432, alignment 27/27:	118.266	34.4531	36	40.9688	70.7812
Length  448, alignment  0/ 0:	98.2969	34.3594	37.5	39.5156	56.2969
Length  448, alignment 28/ 0:	115.641	51.7969	64.6406	44.6719	77.2031
Length  448, alignment  0/28:	143.297	46.7812	64.9688	88.1719	137.25
Length  448, alignment 28/28:	118.453	34.4531	36.7969	40.0312	70.3125
Length  464, alignment  0/ 0:	101.156	36.0938	37.125	39.4688	63.6562
Length  464, alignment 29/ 0:	118.594	52.6875	69.1406	43.6875	79.9219
Length  464, alignment  0/29:	133.922	46.4062	71.0156	88.5	142.922
Length  464, alignment 29/29:	126.047	36.1406	39.375	39.3281	71.3438
Length  480, alignment  0/ 0:	104.203	36.1406	38.2969	39.2812	59.5312
Length  480, alignment 30/ 0:	120.375	53.2969	69.8438	47.25	80.5312
Length  480, alignment  0/30:	150	47.0625	69.9844	87.4219	148.125
Length  480, alignment 30/30:	126.375	37.9219	37.6875	39.2812	70.8281
Length  496, alignment  0/ 0:	107.016	37.5	39.0938	39.5156	67.2656
Length  496, alignment 31/ 0:	119.719	169.078	71.4375	45.6562	79.4531
Length  496, alignment  0/31:	139.641	47.25	71.2969	101.953	155.062
Length  496, alignment 31/31:	123.844	39.8438	40.6406	45.75	70.5469
Length 4096, alignment  0/ 0:	749.203	245.859	249.609	253.172	292.078

*** BLURB HERE ***

H.J. Lu (7):
  Remove dead code from memcpy-sse2-unaligned.S
  Don't use RAX as scratch register
  Remove L(overlapping) from memcpy-sse2-unaligned.S
  Add entry points for __mempcpy_sse2_unaligned and _chk functions
  Enable __mempcpy_sse2_unaligned
  Enable __mempcpy_chk_sse2_unaligned
  Enable __memcpy_chk_sse2_unaligned

 sysdeps/x86_64/multiarch/ifunc-impl-list.c       |   6 ++
 sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S | 125 ++++++++---------------
 sysdeps/x86_64/multiarch/memcpy_chk.S            |  23 +++--
 sysdeps/x86_64/multiarch/mempcpy.S               |  19 ++--
 sysdeps/x86_64/multiarch/mempcpy_chk.S           |  19 ++--
 5 files changed, 85 insertions(+), 107 deletions(-)

-- 
2.5.0

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 4/7] Add entry points for __mempcpy_sse2_unaligned and _chk functions
  2016-03-07 17:36 [PATCH 0/7] [BZ #19776] Improve x86-64 memcpy-sse2-unaligned.S H.J. Lu
  2016-03-07 17:37 ` [PATCH 2/7] Don't use RAX as scratch register H.J. Lu
  2016-03-07 17:37 ` [PATCH 3/7] Remove L(overlapping) from memcpy-sse2-unaligned.S H.J. Lu
@ 2016-03-07 17:37 ` H.J. Lu
  2016-03-07 17:37 ` [PATCH 1/7] Remove dead code from memcpy-sse2-unaligned.S H.J. Lu
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: H.J. Lu @ 2016-03-07 17:37 UTC (permalink / raw)
  To: libc-alpha; +Cc: Ondrej Bilka

Add entry points for __mempcpy_chk_sse2_unaligned,
__mempcpy_sse2_unaligned and __memcpy_chk_sse2_unaligned.

	[BZ #19776]
	* sysdeps/x86_64/multiarch/ifunc-impl-list.c
	(__libc_ifunc_impl_list): Test __memcpy_chk_sse2_unaligned,
	__mempcpy_chk_sse2_unaligned and __mempcpy_sse2_unaligned.
	* sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S
	(__mempcpy_chk_sse2_unaligned): New.
	(__mempcpy_sse2_unaligned): Likewise.
	(__memcpy_chk_sse2_unaligned): Likewise.
	(L(start): New label.
---
 sysdeps/x86_64/multiarch/ifunc-impl-list.c       |  6 ++++++
 sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S | 20 ++++++++++++++++++++
 2 files changed, 26 insertions(+)

diff --git a/sysdeps/x86_64/multiarch/ifunc-impl-list.c b/sysdeps/x86_64/multiarch/ifunc-impl-list.c
index 188b6d3..47ca468 100644
--- a/sysdeps/x86_64/multiarch/ifunc-impl-list.c
+++ b/sysdeps/x86_64/multiarch/ifunc-impl-list.c
@@ -278,6 +278,8 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
 			      HAS_CPU_FEATURE (SSSE3),
 			      __memcpy_chk_ssse3)
 	      IFUNC_IMPL_ADD (array, i, __memcpy_chk, 1,
+			      __memcpy_chk_sse2_unaligned)
+	      IFUNC_IMPL_ADD (array, i, __memcpy_chk, 1,
 			      __memcpy_chk_sse2))
 
   /* Support sysdeps/x86_64/multiarch/memcpy.S.  */
@@ -314,6 +316,8 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
 			      HAS_CPU_FEATURE (SSSE3),
 			      __mempcpy_chk_ssse3)
 	      IFUNC_IMPL_ADD (array, i, __mempcpy_chk, 1,
+			      __mempcpy_chk_sse2_unaligned)
+	      IFUNC_IMPL_ADD (array, i, __mempcpy_chk, 1,
 			      __mempcpy_chk_sse2))
 
   /* Support sysdeps/x86_64/multiarch/mempcpy.S.  */
@@ -330,6 +334,8 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
 			      __mempcpy_ssse3_back)
 	      IFUNC_IMPL_ADD (array, i, mempcpy, HAS_CPU_FEATURE (SSSE3),
 			      __mempcpy_ssse3)
+	      IFUNC_IMPL_ADD (array, i, mempcpy, 1,
+			      __mempcpy_sse2_unaligned)
 	      IFUNC_IMPL_ADD (array, i, mempcpy, 1, __mempcpy_sse2))
 
   /* Support sysdeps/x86_64/multiarch/strncmp.S.  */
diff --git a/sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S b/sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S
index 335a498..947c50f 100644
--- a/sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S
+++ b/sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S
@@ -22,9 +22,29 @@
 
 #include "asm-syntax.h"
 
+# ifdef SHARED
+ENTRY (__mempcpy_chk_sse2_unaligned)
+	cmpq	%rdx, %rcx
+	jb	HIDDEN_JUMPTARGET (__chk_fail)
+END (__mempcpy_chk_sse2_unaligned)
+# endif
+
+ENTRY (__mempcpy_sse2_unaligned)
+	mov	%rdi, %rax
+	add	%rdx, %rax
+	jmp	L(start)
+END (__mempcpy_sse2_unaligned)
+
+# ifdef SHARED
+ENTRY (__memcpy_chk_sse2_unaligned)
+	cmpq	%rdx, %rcx
+	jb	HIDDEN_JUMPTARGET (__chk_fail)
+END (__memcpy_chk_sse2_unaligned)
+# endif
 
 ENTRY(__memcpy_sse2_unaligned)
 	movq	%rdi, %rax
+L(start):
 	testq	%rdx, %rdx
 	je	L(return)
 	cmpq	$16, %rdx
-- 
2.5.0

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 1/7] Remove dead code from memcpy-sse2-unaligned.S
  2016-03-07 17:36 [PATCH 0/7] [BZ #19776] Improve x86-64 memcpy-sse2-unaligned.S H.J. Lu
                   ` (2 preceding siblings ...)
  2016-03-07 17:37 ` [PATCH 4/7] Add entry points for __mempcpy_sse2_unaligned and _chk functions H.J. Lu
@ 2016-03-07 17:37 ` H.J. Lu
  2016-03-07 17:37 ` [PATCH 7/7] Enable __memcpy_chk_sse2_unaligned H.J. Lu
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: H.J. Lu @ 2016-03-07 17:37 UTC (permalink / raw)
  To: libc-alpha; +Cc: Ondrej Bilka

There are

ENTRY(__memcpy_sse2_unaligned)
   movq  %rsi, %rax
   leaq  (%rdx,%rdx), %rcx
   subq  %rdi, %rax
   subq  %rdx, %rax
   cmpq  %rcx, %rax
   jb L(overlapping)

When branch is taken,

   cmpq  %rsi, %rdi
   jae   .L3

will never be taken.  We can remove the dead code.

	[BZ #19776]
	* sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S (.L3) Removed.
---
 sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S | 11 -----------
 1 file changed, 11 deletions(-)

diff --git a/sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S b/sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S
index c450983..7207753 100644
--- a/sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S
+++ b/sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S
@@ -90,8 +90,6 @@ L(loop):
 	jne	L(loop)
 	jmp	L(return)
 L(overlapping):
-	cmpq	%rsi, %rdi
-	jae	.L3
 	testq	%rdx, %rdx
 	.p2align 4,,5
 	je	L(return)
@@ -146,15 +144,6 @@ L(less_16):
 	movzwl	-2(%rsi,%rdx), %eax
 	movw	%ax, -2(%rdi,%rdx)
 	jmp	L(return)
-.L3:
-	leaq	-1(%rdx), %rax
-	.p2align 4,,10
-	.p2align 4
-.L11:
-	movzbl	(%rsi,%rax), %edx
-	movb	%dl, (%rdi,%rax)
-	subq	$1, %rax
-	jmp	.L11
 L(between_9_16):
 	movq	(%rsi), %rax
 	movq	%rax, (%rdi)
-- 
2.5.0

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 2/7] Don't use RAX as scratch register
  2016-03-07 17:36 [PATCH 0/7] [BZ #19776] Improve x86-64 memcpy-sse2-unaligned.S H.J. Lu
@ 2016-03-07 17:37 ` H.J. Lu
  2016-03-07 17:37 ` [PATCH 3/7] Remove L(overlapping) from memcpy-sse2-unaligned.S H.J. Lu
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: H.J. Lu @ 2016-03-07 17:37 UTC (permalink / raw)
  To: libc-alpha; +Cc: Ondrej Bilka

To prepare sharing code with mempcpy, don't use RAX as scratch register
so that RAX can be set to the return value at entrance.

	[BZ #19776]
	* sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Don't use
	RAX as scratch register.
---
 sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S | 77 ++++++++++++------------
 1 file changed, 37 insertions(+), 40 deletions(-)

diff --git a/sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S b/sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S
index 7207753..19d8aa6 100644
--- a/sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S
+++ b/sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S
@@ -24,11 +24,12 @@
 
 
 ENTRY(__memcpy_sse2_unaligned)
-	movq	%rsi, %rax
+	movq	%rdi, %rax
+	movq	%rsi, %r11
 	leaq	(%rdx,%rdx), %rcx
-	subq	%rdi, %rax
-	subq	%rdx, %rax
-	cmpq	%rcx, %rax
+	subq	%rdi, %r11
+	subq	%rdx, %r11
+	cmpq	%rcx, %r11
 	jb	L(overlapping)
 	cmpq	$16, %rdx
 	jbe	L(less_16)
@@ -39,7 +40,6 @@ ENTRY(__memcpy_sse2_unaligned)
 	movdqu	%xmm8, -16(%rdi,%rdx)
 	ja	.L31
 L(return):
-	movq	%rdi, %rax
 	ret
 	.p2align 4,,10
 	.p2align 4
@@ -64,16 +64,16 @@ L(return):
 	addq	%rdi, %rdx
 	andq	$-64, %rdx
 	andq	$-64, %rcx
-	movq	%rcx, %rax
-	subq	%rdi, %rax
-	addq	%rax, %rsi
+	movq	%rcx, %r11
+	subq	%rdi, %r11
+	addq	%r11, %rsi
 	cmpq	%rdx, %rcx
 	je	L(return)
 	movq	%rsi, %r10
 	subq	%rcx, %r10
 	leaq	16(%r10), %r9
 	leaq	32(%r10), %r8
-	leaq	48(%r10), %rax
+	leaq	48(%r10), %r11
 	.p2align 4,,10
 	.p2align 4
 L(loop):
@@ -83,12 +83,12 @@ L(loop):
 	movdqa	%xmm8, 16(%rcx)
 	movdqu	(%rcx,%r8), %xmm8
 	movdqa	%xmm8, 32(%rcx)
-	movdqu	(%rcx,%rax), %xmm8
+	movdqu	(%rcx,%r11), %xmm8
 	movdqa	%xmm8, 48(%rcx)
 	addq	$64, %rcx
 	cmpq	%rcx, %rdx
 	jne	L(loop)
-	jmp	L(return)
+	ret
 L(overlapping):
 	testq	%rdx, %rdx
 	.p2align 4,,5
@@ -97,8 +97,8 @@ L(overlapping):
 	leaq	16(%rsi), %rcx
 	leaq	16(%rdi), %r8
 	shrq	$4, %r9
-	movq	%r9, %rax
-	salq	$4, %rax
+	movq	%r9, %r11
+	salq	$4, %r11
 	cmpq	%rcx, %rdi
 	setae	%cl
 	cmpq	%r8, %rsi
@@ -107,9 +107,9 @@ L(overlapping):
 	cmpq	$15, %rdx
 	seta	%r8b
 	testb	%r8b, %cl
-	je	.L16
-	testq	%rax, %rax
-	je	.L16
+	je	.L21
+	testq	%r11, %r11
+	je	.L21
 	xorl	%ecx, %ecx
 	xorl	%r8d, %r8d
 .L7:
@@ -119,15 +119,15 @@ L(overlapping):
 	addq	$16, %rcx
 	cmpq	%r8, %r9
 	ja	.L7
-	cmpq	%rax, %rdx
+	cmpq	%r11, %rdx
 	je	L(return)
 .L21:
-	movzbl	(%rsi,%rax), %ecx
-	movb	%cl, (%rdi,%rax)
-	addq	$1, %rax
-	cmpq	%rax, %rdx
+	movzbl	(%rsi,%r11), %ecx
+	movb	%cl, (%rdi,%r11)
+	addq	$1, %r11
+	cmpq	%r11, %rdx
 	ja	.L21
-	jmp	L(return)
+	ret
 L(less_16):
 	testb	$24, %dl
 	jne	L(between_9_16)
@@ -137,28 +137,25 @@ L(less_16):
 	testq	%rdx, %rdx
 	.p2align 4,,2
 	je	L(return)
-	movzbl	(%rsi), %eax
+	movzbl	(%rsi), %ecx
 	testb	$2, %dl
-	movb	%al, (%rdi)
+	movb	%cl, (%rdi)
 	je	L(return)
-	movzwl	-2(%rsi,%rdx), %eax
-	movw	%ax, -2(%rdi,%rdx)
-	jmp	L(return)
+	movzwl	-2(%rsi,%rdx), %ecx
+	movw	%cx, -2(%rdi,%rdx)
+	ret
 L(between_9_16):
-	movq	(%rsi), %rax
-	movq	%rax, (%rdi)
-	movq	-8(%rsi,%rdx), %rax
-	movq	%rax, -8(%rdi,%rdx)
-	jmp	L(return)
-.L16:
-	xorl	%eax, %eax
-	jmp	.L21
+	movq	(%rsi), %rcx
+	movq	%rcx, (%rdi)
+	movq	-8(%rsi,%rdx), %rcx
+	movq	%rcx, -8(%rdi,%rdx)
+	ret
 L(between_5_8):
-	movl	(%rsi), %eax
-	movl	%eax, (%rdi)
-	movl	-4(%rsi,%rdx), %eax
-	movl	%eax, -4(%rdi,%rdx)
-	jmp	L(return)
+	movl	(%rsi), %ecx
+	movl	%ecx, (%rdi)
+	movl	-4(%rsi,%rdx), %ecx
+	movl	%ecx, -4(%rdi,%rdx)
+	ret
 END(__memcpy_sse2_unaligned)
 
 #endif
-- 
2.5.0

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 6/7] Enable __mempcpy_chk_sse2_unaligned
  2016-03-07 17:36 [PATCH 0/7] [BZ #19776] Improve x86-64 memcpy-sse2-unaligned.S H.J. Lu
                   ` (5 preceding siblings ...)
  2016-03-07 17:37 ` [PATCH 5/7] Enable __mempcpy_sse2_unaligned H.J. Lu
@ 2016-03-07 17:37 ` H.J. Lu
  6 siblings, 0 replies; 8+ messages in thread
From: H.J. Lu @ 2016-03-07 17:37 UTC (permalink / raw)
  To: libc-alpha; +Cc: Ondrej Bilka

Check Fast_Unaligned_Load for __mempcpy_chk_sse2_unaligned. The new
selection order is:

1. __mempcpy_chk_avx_unaligned if AVX_Fast_Unaligned_Load bit is set.
2. __mempcpy_chk_sse2_unaligned if Fast_Unaligned_Load bit is set.
3. __mempcpy_chk_sse2 if SSSE3 isn't available.
4. __mempcpy_chk_ssse3_back if Fast_Copy_Backward bit it set.
5. __mempcpy_chk_ssse3

	[BZ #19776]
	* sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Check
	Fast_Unaligned_Load to enable __mempcpy_chk_sse2_unaligned.
---
 sysdeps/x86_64/multiarch/mempcpy_chk.S | 19 +++++++++++--------
 1 file changed, 11 insertions(+), 8 deletions(-)

diff --git a/sysdeps/x86_64/multiarch/mempcpy_chk.S b/sysdeps/x86_64/multiarch/mempcpy_chk.S
index 6e8a89d..bec37bc 100644
--- a/sysdeps/x86_64/multiarch/mempcpy_chk.S
+++ b/sysdeps/x86_64/multiarch/mempcpy_chk.S
@@ -35,19 +35,22 @@ ENTRY(__mempcpy_chk)
 	jz	1f
 	HAS_ARCH_FEATURE (Prefer_No_VZEROUPPER)
 	jz	1f
-	leaq    __mempcpy_chk_avx512_no_vzeroupper(%rip), %rax
+	lea	__mempcpy_chk_avx512_no_vzeroupper(%rip), %RAX_LP
 	ret
 #endif
-1:	leaq	__mempcpy_chk_sse2(%rip), %rax
+1:	lea	__mempcpy_chk_avx_unaligned(%rip), %RAX_LP
+	HAS_ARCH_FEATURE (AVX_Fast_Unaligned_Load)
+	jnz	2f
+	lea	__mempcpy_chk_sse2_unaligned(%rip), %RAX_LP
+	HAS_ARCH_FEATURE (Fast_Unaligned_Load)
+	jnz	2f
+	lea	__mempcpy_chk_sse2(%rip), %RAX_LP
 	HAS_CPU_FEATURE (SSSE3)
 	jz	2f
-	leaq	__mempcpy_chk_ssse3(%rip), %rax
+	lea	__mempcpy_chk_ssse3_back(%rip), %RAX_LP
 	HAS_ARCH_FEATURE (Fast_Copy_Backward)
-	jz	2f
-	leaq	__mempcpy_chk_ssse3_back(%rip), %rax
-	HAS_ARCH_FEATURE (AVX_Fast_Unaligned_Load)
-	jz	2f
-	leaq	__mempcpy_chk_avx_unaligned(%rip), %rax
+	jnz	2f
+	lea	__mempcpy_chk_ssse3(%rip), %RAX_LP
 2:	ret
 END(__mempcpy_chk)
 # else
-- 
2.5.0

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 7/7] Enable __memcpy_chk_sse2_unaligned
  2016-03-07 17:36 [PATCH 0/7] [BZ #19776] Improve x86-64 memcpy-sse2-unaligned.S H.J. Lu
                   ` (3 preceding siblings ...)
  2016-03-07 17:37 ` [PATCH 1/7] Remove dead code from memcpy-sse2-unaligned.S H.J. Lu
@ 2016-03-07 17:37 ` H.J. Lu
  2016-03-07 17:37 ` [PATCH 5/7] Enable __mempcpy_sse2_unaligned H.J. Lu
  2016-03-07 17:37 ` [PATCH 6/7] Enable __mempcpy_chk_sse2_unaligned H.J. Lu
  6 siblings, 0 replies; 8+ messages in thread
From: H.J. Lu @ 2016-03-07 17:37 UTC (permalink / raw)
  To: libc-alpha; +Cc: Ondrej Bilka

Check Fast_Unaligned_Load for __memcpy_chk_sse2_unaligned. The new
selection order is:

1. __memcpy_chk_avx_unaligned if AVX_Fast_Unaligned_Load bit is set.
2. __memcpy_chk_sse2_unaligned if Fast_Unaligned_Load bit is set.
3. __memcpy_chk_sse2 if SSSE3 isn't available.
4. __memcpy_chk_ssse3_back if Fast_Copy_Backward bit it set.
5. __memcpy_chk_ssse3

	[BZ #19776]
	* sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Check
	Fast_Unaligned_Load to enable __mempcpy_chk_sse2_unaligned.
---
 sysdeps/x86_64/multiarch/memcpy_chk.S | 23 +++++++++++++----------
 1 file changed, 13 insertions(+), 10 deletions(-)

diff --git a/sysdeps/x86_64/multiarch/memcpy_chk.S b/sysdeps/x86_64/multiarch/memcpy_chk.S
index 648217e..c009211 100644
--- a/sysdeps/x86_64/multiarch/memcpy_chk.S
+++ b/sysdeps/x86_64/multiarch/memcpy_chk.S
@@ -32,22 +32,25 @@ ENTRY(__memcpy_chk)
 	LOAD_RTLD_GLOBAL_RO_RDX
 #ifdef HAVE_AVX512_ASM_SUPPORT
 	HAS_ARCH_FEATURE (AVX512F_Usable)
-	jz      1f
+	jz	1f
 	HAS_ARCH_FEATURE (Prefer_No_VZEROUPPER)
-	jz      1f
-	leaq    __memcpy_chk_avx512_no_vzeroupper(%rip), %rax
+	jz	1f
+	lea	__memcpy_chk_avx512_no_vzeroupper(%rip), %RAX_LP
 	ret
 #endif
-1:	leaq	__memcpy_chk_sse2(%rip), %rax
+1:	lea	__memcpy_chk_avx_unaligned(%rip), %RAX_LP
+	HAS_ARCH_FEATURE (AVX_Fast_Unaligned_Load)
+	jnz	2f
+	lea	__memcpy_chk_sse2_unaligned(%rip), %RAX_LP
+	HAS_ARCH_FEATURE (Fast_Unaligned_Load)
+	jnz	2f
+	lea	__memcpy_chk_sse2(%rip), %RAX_LP
 	HAS_CPU_FEATURE (SSSE3)
 	jz	2f
-	leaq	__memcpy_chk_ssse3(%rip), %rax
+	lea	__memcpy_chk_ssse3_back(%rip), %RAX_LP
 	HAS_ARCH_FEATURE (Fast_Copy_Backward)
-	jz	2f
-	leaq	__memcpy_chk_ssse3_back(%rip), %rax
-	HAS_ARCH_FEATURE (AVX_Fast_Unaligned_Load)
-	jz  2f
-	leaq    __memcpy_chk_avx_unaligned(%rip), %rax
+	jnz	2f
+	lea	__memcpy_chk_ssse3(%rip), %RAX_LP
 2:	ret
 END(__memcpy_chk)
 # else
-- 
2.5.0

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 5/7] Enable __mempcpy_sse2_unaligned
  2016-03-07 17:36 [PATCH 0/7] [BZ #19776] Improve x86-64 memcpy-sse2-unaligned.S H.J. Lu
                   ` (4 preceding siblings ...)
  2016-03-07 17:37 ` [PATCH 7/7] Enable __memcpy_chk_sse2_unaligned H.J. Lu
@ 2016-03-07 17:37 ` H.J. Lu
  2016-03-07 17:37 ` [PATCH 6/7] Enable __mempcpy_chk_sse2_unaligned H.J. Lu
  6 siblings, 0 replies; 8+ messages in thread
From: H.J. Lu @ 2016-03-07 17:37 UTC (permalink / raw)
  To: libc-alpha; +Cc: Ondrej Bilka

Check Fast_Unaligned_Load for __mempcpy_sse2_unaligned.  The new
selection order is:

1. __mempcpy_avx_unaligned if AVX_Fast_Unaligned_Load bit is set.
2. __mempcpy_sse2_unaligned if Fast_Unaligned_Load bit is set.
3. __mempcpy_sse2 if SSSE3 isn't available.
4. __mempcpy_ssse3_back if Fast_Copy_Backward bit it set.
5. __mempcpy_ssse3

	[BZ #19776]
	* sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Check
	Fast_Unaligned_Load to enable __mempcpy_sse2_unaligned.
---
 sysdeps/x86_64/multiarch/mempcpy.S | 19 +++++++++++--------
 1 file changed, 11 insertions(+), 8 deletions(-)

diff --git a/sysdeps/x86_64/multiarch/mempcpy.S b/sysdeps/x86_64/multiarch/mempcpy.S
index ed78623..1314d76 100644
--- a/sysdeps/x86_64/multiarch/mempcpy.S
+++ b/sysdeps/x86_64/multiarch/mempcpy.S
@@ -33,19 +33,22 @@ ENTRY(__mempcpy)
 	jz	1f
 	HAS_ARCH_FEATURE (Prefer_No_VZEROUPPER)
 	jz	1f
-	leaq    __mempcpy_avx512_no_vzeroupper(%rip), %rax
+	lea	__mempcpy_avx512_no_vzeroupper(%rip), %RAX_LP
 	ret
 #endif
-1:	leaq	__mempcpy_sse2(%rip), %rax
+1:	lea	__mempcpy_avx_unaligned(%rip), %RAX_LP
+	HAS_ARCH_FEATURE (AVX_Fast_Unaligned_Load)
+	jnz	2f
+	lea	__mempcpy_sse2_unaligned(%rip), %RAX_LP
+	HAS_ARCH_FEATURE (Fast_Unaligned_Load)
+	jnz	2f
+	lea	__mempcpy_sse2(%rip), %RAX_LP
 	HAS_CPU_FEATURE (SSSE3)
 	jz	2f
-	leaq	__mempcpy_ssse3(%rip), %rax
+	lea	__mempcpy_ssse3_back(%rip), %RAX_LP
 	HAS_ARCH_FEATURE (Fast_Copy_Backward)
-	jz	2f
-	leaq	__mempcpy_ssse3_back(%rip), %rax
-	HAS_ARCH_FEATURE (AVX_Fast_Unaligned_Load)
-	jz	2f
-	leaq	__mempcpy_avx_unaligned(%rip), %rax
+	jnz	2f
+	lea	__mempcpy_ssse3(%rip), %RAX_LP
 2:	ret
 END(__mempcpy)
 
-- 
2.5.0

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 3/7] Remove L(overlapping) from memcpy-sse2-unaligned.S
  2016-03-07 17:36 [PATCH 0/7] [BZ #19776] Improve x86-64 memcpy-sse2-unaligned.S H.J. Lu
  2016-03-07 17:37 ` [PATCH 2/7] Don't use RAX as scratch register H.J. Lu
@ 2016-03-07 17:37 ` H.J. Lu
  2016-03-07 17:37 ` [PATCH 4/7] Add entry points for __mempcpy_sse2_unaligned and _chk functions H.J. Lu
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: H.J. Lu @ 2016-03-07 17:37 UTC (permalink / raw)
  To: libc-alpha; +Cc: Ondrej Bilka

Since memcpy doesn't need to check overlapping source and destination,
we can remove L(overlapping).

	[BZ #19776]
	* sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S
	(L(overlapping)): Removed.
---
 sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S | 47 +-----------------------
 1 file changed, 2 insertions(+), 45 deletions(-)

diff --git a/sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S b/sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S
index 19d8aa6..335a498 100644
--- a/sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S
+++ b/sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S
@@ -25,12 +25,8 @@
 
 ENTRY(__memcpy_sse2_unaligned)
 	movq	%rdi, %rax
-	movq	%rsi, %r11
-	leaq	(%rdx,%rdx), %rcx
-	subq	%rdi, %r11
-	subq	%rdx, %r11
-	cmpq	%rcx, %r11
-	jb	L(overlapping)
+	testq	%rdx, %rdx
+	je	L(return)
 	cmpq	$16, %rdx
 	jbe	L(less_16)
 	movdqu	(%rsi), %xmm8
@@ -89,45 +85,6 @@ L(loop):
 	cmpq	%rcx, %rdx
 	jne	L(loop)
 	ret
-L(overlapping):
-	testq	%rdx, %rdx
-	.p2align 4,,5
-	je	L(return)
-	movq	%rdx, %r9
-	leaq	16(%rsi), %rcx
-	leaq	16(%rdi), %r8
-	shrq	$4, %r9
-	movq	%r9, %r11
-	salq	$4, %r11
-	cmpq	%rcx, %rdi
-	setae	%cl
-	cmpq	%r8, %rsi
-	setae	%r8b
-	orl	%r8d, %ecx
-	cmpq	$15, %rdx
-	seta	%r8b
-	testb	%r8b, %cl
-	je	.L21
-	testq	%r11, %r11
-	je	.L21
-	xorl	%ecx, %ecx
-	xorl	%r8d, %r8d
-.L7:
-	movdqu	(%rsi,%rcx), %xmm8
-	addq	$1, %r8
-	movdqu	%xmm8, (%rdi,%rcx)
-	addq	$16, %rcx
-	cmpq	%r8, %r9
-	ja	.L7
-	cmpq	%r11, %rdx
-	je	L(return)
-.L21:
-	movzbl	(%rsi,%r11), %ecx
-	movb	%cl, (%rdi,%r11)
-	addq	$1, %r11
-	cmpq	%r11, %rdx
-	ja	.L21
-	ret
 L(less_16):
 	testb	$24, %dl
 	jne	L(between_9_16)
-- 
2.5.0

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2016-03-07 17:37 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-03-07 17:36 [PATCH 0/7] [BZ #19776] Improve x86-64 memcpy-sse2-unaligned.S H.J. Lu
2016-03-07 17:37 ` [PATCH 2/7] Don't use RAX as scratch register H.J. Lu
2016-03-07 17:37 ` [PATCH 3/7] Remove L(overlapping) from memcpy-sse2-unaligned.S H.J. Lu
2016-03-07 17:37 ` [PATCH 4/7] Add entry points for __mempcpy_sse2_unaligned and _chk functions H.J. Lu
2016-03-07 17:37 ` [PATCH 1/7] Remove dead code from memcpy-sse2-unaligned.S H.J. Lu
2016-03-07 17:37 ` [PATCH 7/7] Enable __memcpy_chk_sse2_unaligned H.J. Lu
2016-03-07 17:37 ` [PATCH 5/7] Enable __mempcpy_sse2_unaligned H.J. Lu
2016-03-07 17:37 ` [PATCH 6/7] Enable __mempcpy_chk_sse2_unaligned H.J. Lu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).