* [PATCH 0/7] [BZ #19776] Improve x86-64 memcpy-sse2-unaligned.S
@ 2016-03-07 17:36 H.J. Lu
2016-03-07 17:37 ` [PATCH 2/7] Don't use RAX as scratch register H.J. Lu
` (6 more replies)
0 siblings, 7 replies; 8+ messages in thread
From: H.J. Lu @ 2016-03-07 17:36 UTC (permalink / raw)
To: libc-alpha; +Cc: Ondrej Bilka
This set of patches improves x86-64 memcpy-sse2-unaligned.S by
1. Removing dead code.
2. Setting RAX to the return value at entrance.
3. Removing unnecessary code.
4. Adding entry points for __mempcpy_chk_sse2_unaligned,
__mempcpy_sse2_unaligned and __memcpy_chk_sse2_unaligned.
5. Enabling __mempcpy_chk_sse2_unaligned, __mempcpy_sse2_unaligned and
__memcpy_chk_sse2_unaligned.
bench-mempcpy shows
Ivy Bridge:
simple_mempcpy __mempcpy_avx_unaligned __mempcpy_ssse3_back __mempcpy_ssse3 __mempcpy_sse2_unaligned __mempcpy_sse2
Length 432, alignment 27/ 0: 1628.16 98.3906 73.5625 94.7344 67.1719 139.531
Length 432, alignment 0/27: 1627.84 148.891 80.625 98.8281 104.766 142.625
Length 432, alignment 27/27: 1631.03 90.5469 70.0312 69.5938 76.5469 123.969
Length 448, alignment 0/ 0: 1685.95 79.1875 65.1719 72.9062 70.1406 116.688
Length 448, alignment 28/ 0: 1685.84 89.4531 73.0156 99.5938 86.3594 138.203
Length 448, alignment 0/28: 1684.52 148.016 82.2812 94.8438 103.344 147.578
Length 448, alignment 28/28: 1684.42 86.4688 65.4062 70.0469 72.4688 123.422
Length 464, alignment 0/ 0: 1740.77 70.1406 66.2812 69.2656 71.25 118.234
Length 464, alignment 29/ 0: 1742.31 100.141 75.875 98.9219 83.375 145.594
Length 464, alignment 0/29: 1742.31 148.016 80.2969 107.766 102.031 154.531
Length 464, alignment 29/29: 1740.98 91.5469 64.8438 72.4531 71.4688 127.062
Length 480, alignment 0/ 0: 1967.2 76.875 66.625 71.1406 71.25 123.641
Length 480, alignment 30/ 0: 1799.02 94.3125 72.9062 103.797 80.2969 144.484
Length 480, alignment 0/30: 1797.47 148.453 82.6094 102.906 102.25 158.062
Length 480, alignment 30/30: 1799.02 90.8906 68.0469 69.5938 71.3594 124.844
Length 496, alignment 0/ 0: 1853.83 71.25 68.3906 71.9219 69.1406 123.422
Length 496, alignment 31/ 0: 1855.38 94.8438 74.3438 104.672 73.2344 148.125
Length 496, alignment 0/31: 1853.59 148.906 80.2969 109.297 114.703 163.016
Length 496, alignment 31/31: 1855.27 93.0781 71.4688 72.3438 84.2656 127.953
Length 4096, alignment 0/ 0: 14559.7 509.891 506.469 474.156 508.344 591.062
Nehalem:
simple_mempcpy __mempcpy_ssse3_back __mempcpy_ssse3 __mempcpy_sse2_unaligned __mempcpy_sse2
Length 432, alignment 27/ 0: 113.25 50.9531 64.0312 39.1406 77.6719
Length 432, alignment 0/27: 130.688 45.7969 63.9844 89.1562 133.078
Length 432, alignment 27/27: 118.266 34.4531 36 40.9688 70.7812
Length 448, alignment 0/ 0: 98.2969 34.3594 37.5 39.5156 56.2969
Length 448, alignment 28/ 0: 115.641 51.7969 64.6406 44.6719 77.2031
Length 448, alignment 0/28: 143.297 46.7812 64.9688 88.1719 137.25
Length 448, alignment 28/28: 118.453 34.4531 36.7969 40.0312 70.3125
Length 464, alignment 0/ 0: 101.156 36.0938 37.125 39.4688 63.6562
Length 464, alignment 29/ 0: 118.594 52.6875 69.1406 43.6875 79.9219
Length 464, alignment 0/29: 133.922 46.4062 71.0156 88.5 142.922
Length 464, alignment 29/29: 126.047 36.1406 39.375 39.3281 71.3438
Length 480, alignment 0/ 0: 104.203 36.1406 38.2969 39.2812 59.5312
Length 480, alignment 30/ 0: 120.375 53.2969 69.8438 47.25 80.5312
Length 480, alignment 0/30: 150 47.0625 69.9844 87.4219 148.125
Length 480, alignment 30/30: 126.375 37.9219 37.6875 39.2812 70.8281
Length 496, alignment 0/ 0: 107.016 37.5 39.0938 39.5156 67.2656
Length 496, alignment 31/ 0: 119.719 169.078 71.4375 45.6562 79.4531
Length 496, alignment 0/31: 139.641 47.25 71.2969 101.953 155.062
Length 496, alignment 31/31: 123.844 39.8438 40.6406 45.75 70.5469
Length 4096, alignment 0/ 0: 749.203 245.859 249.609 253.172 292.078
*** BLURB HERE ***
H.J. Lu (7):
Remove dead code from memcpy-sse2-unaligned.S
Don't use RAX as scratch register
Remove L(overlapping) from memcpy-sse2-unaligned.S
Add entry points for __mempcpy_sse2_unaligned and _chk functions
Enable __mempcpy_sse2_unaligned
Enable __mempcpy_chk_sse2_unaligned
Enable __memcpy_chk_sse2_unaligned
sysdeps/x86_64/multiarch/ifunc-impl-list.c | 6 ++
sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S | 125 ++++++++---------------
sysdeps/x86_64/multiarch/memcpy_chk.S | 23 +++--
sysdeps/x86_64/multiarch/mempcpy.S | 19 ++--
sysdeps/x86_64/multiarch/mempcpy_chk.S | 19 ++--
5 files changed, 85 insertions(+), 107 deletions(-)
--
2.5.0
^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH 4/7] Add entry points for __mempcpy_sse2_unaligned and _chk functions
2016-03-07 17:36 [PATCH 0/7] [BZ #19776] Improve x86-64 memcpy-sse2-unaligned.S H.J. Lu
` (2 preceding siblings ...)
2016-03-07 17:37 ` [PATCH 3/7] Remove L(overlapping) " H.J. Lu
@ 2016-03-07 17:37 ` H.J. Lu
2016-03-07 17:37 ` [PATCH 7/7] Enable __memcpy_chk_sse2_unaligned H.J. Lu
` (2 subsequent siblings)
6 siblings, 0 replies; 8+ messages in thread
From: H.J. Lu @ 2016-03-07 17:37 UTC (permalink / raw)
To: libc-alpha; +Cc: Ondrej Bilka
Add entry points for __mempcpy_chk_sse2_unaligned,
__mempcpy_sse2_unaligned and __memcpy_chk_sse2_unaligned.
[BZ #19776]
* sysdeps/x86_64/multiarch/ifunc-impl-list.c
(__libc_ifunc_impl_list): Test __memcpy_chk_sse2_unaligned,
__mempcpy_chk_sse2_unaligned and __mempcpy_sse2_unaligned.
* sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S
(__mempcpy_chk_sse2_unaligned): New.
(__mempcpy_sse2_unaligned): Likewise.
(__memcpy_chk_sse2_unaligned): Likewise.
(L(start): New label.
---
sysdeps/x86_64/multiarch/ifunc-impl-list.c | 6 ++++++
sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S | 20 ++++++++++++++++++++
2 files changed, 26 insertions(+)
diff --git a/sysdeps/x86_64/multiarch/ifunc-impl-list.c b/sysdeps/x86_64/multiarch/ifunc-impl-list.c
index 188b6d3..47ca468 100644
--- a/sysdeps/x86_64/multiarch/ifunc-impl-list.c
+++ b/sysdeps/x86_64/multiarch/ifunc-impl-list.c
@@ -278,6 +278,8 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
HAS_CPU_FEATURE (SSSE3),
__memcpy_chk_ssse3)
IFUNC_IMPL_ADD (array, i, __memcpy_chk, 1,
+ __memcpy_chk_sse2_unaligned)
+ IFUNC_IMPL_ADD (array, i, __memcpy_chk, 1,
__memcpy_chk_sse2))
/* Support sysdeps/x86_64/multiarch/memcpy.S. */
@@ -314,6 +316,8 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
HAS_CPU_FEATURE (SSSE3),
__mempcpy_chk_ssse3)
IFUNC_IMPL_ADD (array, i, __mempcpy_chk, 1,
+ __mempcpy_chk_sse2_unaligned)
+ IFUNC_IMPL_ADD (array, i, __mempcpy_chk, 1,
__mempcpy_chk_sse2))
/* Support sysdeps/x86_64/multiarch/mempcpy.S. */
@@ -330,6 +334,8 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
__mempcpy_ssse3_back)
IFUNC_IMPL_ADD (array, i, mempcpy, HAS_CPU_FEATURE (SSSE3),
__mempcpy_ssse3)
+ IFUNC_IMPL_ADD (array, i, mempcpy, 1,
+ __mempcpy_sse2_unaligned)
IFUNC_IMPL_ADD (array, i, mempcpy, 1, __mempcpy_sse2))
/* Support sysdeps/x86_64/multiarch/strncmp.S. */
diff --git a/sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S b/sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S
index 335a498..947c50f 100644
--- a/sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S
+++ b/sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S
@@ -22,9 +22,29 @@
#include "asm-syntax.h"
+# ifdef SHARED
+ENTRY (__mempcpy_chk_sse2_unaligned)
+ cmpq %rdx, %rcx
+ jb HIDDEN_JUMPTARGET (__chk_fail)
+END (__mempcpy_chk_sse2_unaligned)
+# endif
+
+ENTRY (__mempcpy_sse2_unaligned)
+ mov %rdi, %rax
+ add %rdx, %rax
+ jmp L(start)
+END (__mempcpy_sse2_unaligned)
+
+# ifdef SHARED
+ENTRY (__memcpy_chk_sse2_unaligned)
+ cmpq %rdx, %rcx
+ jb HIDDEN_JUMPTARGET (__chk_fail)
+END (__memcpy_chk_sse2_unaligned)
+# endif
ENTRY(__memcpy_sse2_unaligned)
movq %rdi, %rax
+L(start):
testq %rdx, %rdx
je L(return)
cmpq $16, %rdx
--
2.5.0
^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH 1/7] Remove dead code from memcpy-sse2-unaligned.S
2016-03-07 17:36 [PATCH 0/7] [BZ #19776] Improve x86-64 memcpy-sse2-unaligned.S H.J. Lu
2016-03-07 17:37 ` [PATCH 2/7] Don't use RAX as scratch register H.J. Lu
@ 2016-03-07 17:37 ` H.J. Lu
2016-03-07 17:37 ` [PATCH 3/7] Remove L(overlapping) " H.J. Lu
` (4 subsequent siblings)
6 siblings, 0 replies; 8+ messages in thread
From: H.J. Lu @ 2016-03-07 17:37 UTC (permalink / raw)
To: libc-alpha; +Cc: Ondrej Bilka
There are
ENTRY(__memcpy_sse2_unaligned)
movq %rsi, %rax
leaq (%rdx,%rdx), %rcx
subq %rdi, %rax
subq %rdx, %rax
cmpq %rcx, %rax
jb L(overlapping)
When branch is taken,
cmpq %rsi, %rdi
jae .L3
will never be taken. We can remove the dead code.
[BZ #19776]
* sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S (.L3) Removed.
---
sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S | 11 -----------
1 file changed, 11 deletions(-)
diff --git a/sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S b/sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S
index c450983..7207753 100644
--- a/sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S
+++ b/sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S
@@ -90,8 +90,6 @@ L(loop):
jne L(loop)
jmp L(return)
L(overlapping):
- cmpq %rsi, %rdi
- jae .L3
testq %rdx, %rdx
.p2align 4,,5
je L(return)
@@ -146,15 +144,6 @@ L(less_16):
movzwl -2(%rsi,%rdx), %eax
movw %ax, -2(%rdi,%rdx)
jmp L(return)
-.L3:
- leaq -1(%rdx), %rax
- .p2align 4,,10
- .p2align 4
-.L11:
- movzbl (%rsi,%rax), %edx
- movb %dl, (%rdi,%rax)
- subq $1, %rax
- jmp .L11
L(between_9_16):
movq (%rsi), %rax
movq %rax, (%rdi)
--
2.5.0
^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH 2/7] Don't use RAX as scratch register
2016-03-07 17:36 [PATCH 0/7] [BZ #19776] Improve x86-64 memcpy-sse2-unaligned.S H.J. Lu
@ 2016-03-07 17:37 ` H.J. Lu
2016-03-07 17:37 ` [PATCH 1/7] Remove dead code from memcpy-sse2-unaligned.S H.J. Lu
` (5 subsequent siblings)
6 siblings, 0 replies; 8+ messages in thread
From: H.J. Lu @ 2016-03-07 17:37 UTC (permalink / raw)
To: libc-alpha; +Cc: Ondrej Bilka
To prepare sharing code with mempcpy, don't use RAX as scratch register
so that RAX can be set to the return value at entrance.
[BZ #19776]
* sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: Don't use
RAX as scratch register.
---
sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S | 77 ++++++++++++------------
1 file changed, 37 insertions(+), 40 deletions(-)
diff --git a/sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S b/sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S
index 7207753..19d8aa6 100644
--- a/sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S
+++ b/sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S
@@ -24,11 +24,12 @@
ENTRY(__memcpy_sse2_unaligned)
- movq %rsi, %rax
+ movq %rdi, %rax
+ movq %rsi, %r11
leaq (%rdx,%rdx), %rcx
- subq %rdi, %rax
- subq %rdx, %rax
- cmpq %rcx, %rax
+ subq %rdi, %r11
+ subq %rdx, %r11
+ cmpq %rcx, %r11
jb L(overlapping)
cmpq $16, %rdx
jbe L(less_16)
@@ -39,7 +40,6 @@ ENTRY(__memcpy_sse2_unaligned)
movdqu %xmm8, -16(%rdi,%rdx)
ja .L31
L(return):
- movq %rdi, %rax
ret
.p2align 4,,10
.p2align 4
@@ -64,16 +64,16 @@ L(return):
addq %rdi, %rdx
andq $-64, %rdx
andq $-64, %rcx
- movq %rcx, %rax
- subq %rdi, %rax
- addq %rax, %rsi
+ movq %rcx, %r11
+ subq %rdi, %r11
+ addq %r11, %rsi
cmpq %rdx, %rcx
je L(return)
movq %rsi, %r10
subq %rcx, %r10
leaq 16(%r10), %r9
leaq 32(%r10), %r8
- leaq 48(%r10), %rax
+ leaq 48(%r10), %r11
.p2align 4,,10
.p2align 4
L(loop):
@@ -83,12 +83,12 @@ L(loop):
movdqa %xmm8, 16(%rcx)
movdqu (%rcx,%r8), %xmm8
movdqa %xmm8, 32(%rcx)
- movdqu (%rcx,%rax), %xmm8
+ movdqu (%rcx,%r11), %xmm8
movdqa %xmm8, 48(%rcx)
addq $64, %rcx
cmpq %rcx, %rdx
jne L(loop)
- jmp L(return)
+ ret
L(overlapping):
testq %rdx, %rdx
.p2align 4,,5
@@ -97,8 +97,8 @@ L(overlapping):
leaq 16(%rsi), %rcx
leaq 16(%rdi), %r8
shrq $4, %r9
- movq %r9, %rax
- salq $4, %rax
+ movq %r9, %r11
+ salq $4, %r11
cmpq %rcx, %rdi
setae %cl
cmpq %r8, %rsi
@@ -107,9 +107,9 @@ L(overlapping):
cmpq $15, %rdx
seta %r8b
testb %r8b, %cl
- je .L16
- testq %rax, %rax
- je .L16
+ je .L21
+ testq %r11, %r11
+ je .L21
xorl %ecx, %ecx
xorl %r8d, %r8d
.L7:
@@ -119,15 +119,15 @@ L(overlapping):
addq $16, %rcx
cmpq %r8, %r9
ja .L7
- cmpq %rax, %rdx
+ cmpq %r11, %rdx
je L(return)
.L21:
- movzbl (%rsi,%rax), %ecx
- movb %cl, (%rdi,%rax)
- addq $1, %rax
- cmpq %rax, %rdx
+ movzbl (%rsi,%r11), %ecx
+ movb %cl, (%rdi,%r11)
+ addq $1, %r11
+ cmpq %r11, %rdx
ja .L21
- jmp L(return)
+ ret
L(less_16):
testb $24, %dl
jne L(between_9_16)
@@ -137,28 +137,25 @@ L(less_16):
testq %rdx, %rdx
.p2align 4,,2
je L(return)
- movzbl (%rsi), %eax
+ movzbl (%rsi), %ecx
testb $2, %dl
- movb %al, (%rdi)
+ movb %cl, (%rdi)
je L(return)
- movzwl -2(%rsi,%rdx), %eax
- movw %ax, -2(%rdi,%rdx)
- jmp L(return)
+ movzwl -2(%rsi,%rdx), %ecx
+ movw %cx, -2(%rdi,%rdx)
+ ret
L(between_9_16):
- movq (%rsi), %rax
- movq %rax, (%rdi)
- movq -8(%rsi,%rdx), %rax
- movq %rax, -8(%rdi,%rdx)
- jmp L(return)
-.L16:
- xorl %eax, %eax
- jmp .L21
+ movq (%rsi), %rcx
+ movq %rcx, (%rdi)
+ movq -8(%rsi,%rdx), %rcx
+ movq %rcx, -8(%rdi,%rdx)
+ ret
L(between_5_8):
- movl (%rsi), %eax
- movl %eax, (%rdi)
- movl -4(%rsi,%rdx), %eax
- movl %eax, -4(%rdi,%rdx)
- jmp L(return)
+ movl (%rsi), %ecx
+ movl %ecx, (%rdi)
+ movl -4(%rsi,%rdx), %ecx
+ movl %ecx, -4(%rdi,%rdx)
+ ret
END(__memcpy_sse2_unaligned)
#endif
--
2.5.0
^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH 6/7] Enable __mempcpy_chk_sse2_unaligned
2016-03-07 17:36 [PATCH 0/7] [BZ #19776] Improve x86-64 memcpy-sse2-unaligned.S H.J. Lu
` (5 preceding siblings ...)
2016-03-07 17:37 ` [PATCH 5/7] Enable __mempcpy_sse2_unaligned H.J. Lu
@ 2016-03-07 17:37 ` H.J. Lu
6 siblings, 0 replies; 8+ messages in thread
From: H.J. Lu @ 2016-03-07 17:37 UTC (permalink / raw)
To: libc-alpha; +Cc: Ondrej Bilka
Check Fast_Unaligned_Load for __mempcpy_chk_sse2_unaligned. The new
selection order is:
1. __mempcpy_chk_avx_unaligned if AVX_Fast_Unaligned_Load bit is set.
2. __mempcpy_chk_sse2_unaligned if Fast_Unaligned_Load bit is set.
3. __mempcpy_chk_sse2 if SSSE3 isn't available.
4. __mempcpy_chk_ssse3_back if Fast_Copy_Backward bit it set.
5. __mempcpy_chk_ssse3
[BZ #19776]
* sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Check
Fast_Unaligned_Load to enable __mempcpy_chk_sse2_unaligned.
---
sysdeps/x86_64/multiarch/mempcpy_chk.S | 19 +++++++++++--------
1 file changed, 11 insertions(+), 8 deletions(-)
diff --git a/sysdeps/x86_64/multiarch/mempcpy_chk.S b/sysdeps/x86_64/multiarch/mempcpy_chk.S
index 6e8a89d..bec37bc 100644
--- a/sysdeps/x86_64/multiarch/mempcpy_chk.S
+++ b/sysdeps/x86_64/multiarch/mempcpy_chk.S
@@ -35,19 +35,22 @@ ENTRY(__mempcpy_chk)
jz 1f
HAS_ARCH_FEATURE (Prefer_No_VZEROUPPER)
jz 1f
- leaq __mempcpy_chk_avx512_no_vzeroupper(%rip), %rax
+ lea __mempcpy_chk_avx512_no_vzeroupper(%rip), %RAX_LP
ret
#endif
-1: leaq __mempcpy_chk_sse2(%rip), %rax
+1: lea __mempcpy_chk_avx_unaligned(%rip), %RAX_LP
+ HAS_ARCH_FEATURE (AVX_Fast_Unaligned_Load)
+ jnz 2f
+ lea __mempcpy_chk_sse2_unaligned(%rip), %RAX_LP
+ HAS_ARCH_FEATURE (Fast_Unaligned_Load)
+ jnz 2f
+ lea __mempcpy_chk_sse2(%rip), %RAX_LP
HAS_CPU_FEATURE (SSSE3)
jz 2f
- leaq __mempcpy_chk_ssse3(%rip), %rax
+ lea __mempcpy_chk_ssse3_back(%rip), %RAX_LP
HAS_ARCH_FEATURE (Fast_Copy_Backward)
- jz 2f
- leaq __mempcpy_chk_ssse3_back(%rip), %rax
- HAS_ARCH_FEATURE (AVX_Fast_Unaligned_Load)
- jz 2f
- leaq __mempcpy_chk_avx_unaligned(%rip), %rax
+ jnz 2f
+ lea __mempcpy_chk_ssse3(%rip), %RAX_LP
2: ret
END(__mempcpy_chk)
# else
--
2.5.0
^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH 7/7] Enable __memcpy_chk_sse2_unaligned
2016-03-07 17:36 [PATCH 0/7] [BZ #19776] Improve x86-64 memcpy-sse2-unaligned.S H.J. Lu
` (3 preceding siblings ...)
2016-03-07 17:37 ` [PATCH 4/7] Add entry points for __mempcpy_sse2_unaligned and _chk functions H.J. Lu
@ 2016-03-07 17:37 ` H.J. Lu
2016-03-07 17:37 ` [PATCH 5/7] Enable __mempcpy_sse2_unaligned H.J. Lu
2016-03-07 17:37 ` [PATCH 6/7] Enable __mempcpy_chk_sse2_unaligned H.J. Lu
6 siblings, 0 replies; 8+ messages in thread
From: H.J. Lu @ 2016-03-07 17:37 UTC (permalink / raw)
To: libc-alpha; +Cc: Ondrej Bilka
Check Fast_Unaligned_Load for __memcpy_chk_sse2_unaligned. The new
selection order is:
1. __memcpy_chk_avx_unaligned if AVX_Fast_Unaligned_Load bit is set.
2. __memcpy_chk_sse2_unaligned if Fast_Unaligned_Load bit is set.
3. __memcpy_chk_sse2 if SSSE3 isn't available.
4. __memcpy_chk_ssse3_back if Fast_Copy_Backward bit it set.
5. __memcpy_chk_ssse3
[BZ #19776]
* sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Check
Fast_Unaligned_Load to enable __mempcpy_chk_sse2_unaligned.
---
sysdeps/x86_64/multiarch/memcpy_chk.S | 23 +++++++++++++----------
1 file changed, 13 insertions(+), 10 deletions(-)
diff --git a/sysdeps/x86_64/multiarch/memcpy_chk.S b/sysdeps/x86_64/multiarch/memcpy_chk.S
index 648217e..c009211 100644
--- a/sysdeps/x86_64/multiarch/memcpy_chk.S
+++ b/sysdeps/x86_64/multiarch/memcpy_chk.S
@@ -32,22 +32,25 @@ ENTRY(__memcpy_chk)
LOAD_RTLD_GLOBAL_RO_RDX
#ifdef HAVE_AVX512_ASM_SUPPORT
HAS_ARCH_FEATURE (AVX512F_Usable)
- jz 1f
+ jz 1f
HAS_ARCH_FEATURE (Prefer_No_VZEROUPPER)
- jz 1f
- leaq __memcpy_chk_avx512_no_vzeroupper(%rip), %rax
+ jz 1f
+ lea __memcpy_chk_avx512_no_vzeroupper(%rip), %RAX_LP
ret
#endif
-1: leaq __memcpy_chk_sse2(%rip), %rax
+1: lea __memcpy_chk_avx_unaligned(%rip), %RAX_LP
+ HAS_ARCH_FEATURE (AVX_Fast_Unaligned_Load)
+ jnz 2f
+ lea __memcpy_chk_sse2_unaligned(%rip), %RAX_LP
+ HAS_ARCH_FEATURE (Fast_Unaligned_Load)
+ jnz 2f
+ lea __memcpy_chk_sse2(%rip), %RAX_LP
HAS_CPU_FEATURE (SSSE3)
jz 2f
- leaq __memcpy_chk_ssse3(%rip), %rax
+ lea __memcpy_chk_ssse3_back(%rip), %RAX_LP
HAS_ARCH_FEATURE (Fast_Copy_Backward)
- jz 2f
- leaq __memcpy_chk_ssse3_back(%rip), %rax
- HAS_ARCH_FEATURE (AVX_Fast_Unaligned_Load)
- jz 2f
- leaq __memcpy_chk_avx_unaligned(%rip), %rax
+ jnz 2f
+ lea __memcpy_chk_ssse3(%rip), %RAX_LP
2: ret
END(__memcpy_chk)
# else
--
2.5.0
^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH 5/7] Enable __mempcpy_sse2_unaligned
2016-03-07 17:36 [PATCH 0/7] [BZ #19776] Improve x86-64 memcpy-sse2-unaligned.S H.J. Lu
` (4 preceding siblings ...)
2016-03-07 17:37 ` [PATCH 7/7] Enable __memcpy_chk_sse2_unaligned H.J. Lu
@ 2016-03-07 17:37 ` H.J. Lu
2016-03-07 17:37 ` [PATCH 6/7] Enable __mempcpy_chk_sse2_unaligned H.J. Lu
6 siblings, 0 replies; 8+ messages in thread
From: H.J. Lu @ 2016-03-07 17:37 UTC (permalink / raw)
To: libc-alpha; +Cc: Ondrej Bilka
Check Fast_Unaligned_Load for __mempcpy_sse2_unaligned. The new
selection order is:
1. __mempcpy_avx_unaligned if AVX_Fast_Unaligned_Load bit is set.
2. __mempcpy_sse2_unaligned if Fast_Unaligned_Load bit is set.
3. __mempcpy_sse2 if SSSE3 isn't available.
4. __mempcpy_ssse3_back if Fast_Copy_Backward bit it set.
5. __mempcpy_ssse3
[BZ #19776]
* sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Check
Fast_Unaligned_Load to enable __mempcpy_sse2_unaligned.
---
sysdeps/x86_64/multiarch/mempcpy.S | 19 +++++++++++--------
1 file changed, 11 insertions(+), 8 deletions(-)
diff --git a/sysdeps/x86_64/multiarch/mempcpy.S b/sysdeps/x86_64/multiarch/mempcpy.S
index ed78623..1314d76 100644
--- a/sysdeps/x86_64/multiarch/mempcpy.S
+++ b/sysdeps/x86_64/multiarch/mempcpy.S
@@ -33,19 +33,22 @@ ENTRY(__mempcpy)
jz 1f
HAS_ARCH_FEATURE (Prefer_No_VZEROUPPER)
jz 1f
- leaq __mempcpy_avx512_no_vzeroupper(%rip), %rax
+ lea __mempcpy_avx512_no_vzeroupper(%rip), %RAX_LP
ret
#endif
-1: leaq __mempcpy_sse2(%rip), %rax
+1: lea __mempcpy_avx_unaligned(%rip), %RAX_LP
+ HAS_ARCH_FEATURE (AVX_Fast_Unaligned_Load)
+ jnz 2f
+ lea __mempcpy_sse2_unaligned(%rip), %RAX_LP
+ HAS_ARCH_FEATURE (Fast_Unaligned_Load)
+ jnz 2f
+ lea __mempcpy_sse2(%rip), %RAX_LP
HAS_CPU_FEATURE (SSSE3)
jz 2f
- leaq __mempcpy_ssse3(%rip), %rax
+ lea __mempcpy_ssse3_back(%rip), %RAX_LP
HAS_ARCH_FEATURE (Fast_Copy_Backward)
- jz 2f
- leaq __mempcpy_ssse3_back(%rip), %rax
- HAS_ARCH_FEATURE (AVX_Fast_Unaligned_Load)
- jz 2f
- leaq __mempcpy_avx_unaligned(%rip), %rax
+ jnz 2f
+ lea __mempcpy_ssse3(%rip), %RAX_LP
2: ret
END(__mempcpy)
--
2.5.0
^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH 3/7] Remove L(overlapping) from memcpy-sse2-unaligned.S
2016-03-07 17:36 [PATCH 0/7] [BZ #19776] Improve x86-64 memcpy-sse2-unaligned.S H.J. Lu
2016-03-07 17:37 ` [PATCH 2/7] Don't use RAX as scratch register H.J. Lu
2016-03-07 17:37 ` [PATCH 1/7] Remove dead code from memcpy-sse2-unaligned.S H.J. Lu
@ 2016-03-07 17:37 ` H.J. Lu
2016-03-07 17:37 ` [PATCH 4/7] Add entry points for __mempcpy_sse2_unaligned and _chk functions H.J. Lu
` (3 subsequent siblings)
6 siblings, 0 replies; 8+ messages in thread
From: H.J. Lu @ 2016-03-07 17:37 UTC (permalink / raw)
To: libc-alpha; +Cc: Ondrej Bilka
Since memcpy doesn't need to check overlapping source and destination,
we can remove L(overlapping).
[BZ #19776]
* sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S
(L(overlapping)): Removed.
---
sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S | 47 +-----------------------
1 file changed, 2 insertions(+), 45 deletions(-)
diff --git a/sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S b/sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S
index 19d8aa6..335a498 100644
--- a/sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S
+++ b/sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S
@@ -25,12 +25,8 @@
ENTRY(__memcpy_sse2_unaligned)
movq %rdi, %rax
- movq %rsi, %r11
- leaq (%rdx,%rdx), %rcx
- subq %rdi, %r11
- subq %rdx, %r11
- cmpq %rcx, %r11
- jb L(overlapping)
+ testq %rdx, %rdx
+ je L(return)
cmpq $16, %rdx
jbe L(less_16)
movdqu (%rsi), %xmm8
@@ -89,45 +85,6 @@ L(loop):
cmpq %rcx, %rdx
jne L(loop)
ret
-L(overlapping):
- testq %rdx, %rdx
- .p2align 4,,5
- je L(return)
- movq %rdx, %r9
- leaq 16(%rsi), %rcx
- leaq 16(%rdi), %r8
- shrq $4, %r9
- movq %r9, %r11
- salq $4, %r11
- cmpq %rcx, %rdi
- setae %cl
- cmpq %r8, %rsi
- setae %r8b
- orl %r8d, %ecx
- cmpq $15, %rdx
- seta %r8b
- testb %r8b, %cl
- je .L21
- testq %r11, %r11
- je .L21
- xorl %ecx, %ecx
- xorl %r8d, %r8d
-.L7:
- movdqu (%rsi,%rcx), %xmm8
- addq $1, %r8
- movdqu %xmm8, (%rdi,%rcx)
- addq $16, %rcx
- cmpq %r8, %r9
- ja .L7
- cmpq %r11, %rdx
- je L(return)
-.L21:
- movzbl (%rsi,%r11), %ecx
- movb %cl, (%rdi,%r11)
- addq $1, %r11
- cmpq %r11, %rdx
- ja .L21
- ret
L(less_16):
testb $24, %dl
jne L(between_9_16)
--
2.5.0
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2016-03-07 17:37 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-03-07 17:36 [PATCH 0/7] [BZ #19776] Improve x86-64 memcpy-sse2-unaligned.S H.J. Lu
2016-03-07 17:37 ` [PATCH 2/7] Don't use RAX as scratch register H.J. Lu
2016-03-07 17:37 ` [PATCH 1/7] Remove dead code from memcpy-sse2-unaligned.S H.J. Lu
2016-03-07 17:37 ` [PATCH 3/7] Remove L(overlapping) " H.J. Lu
2016-03-07 17:37 ` [PATCH 4/7] Add entry points for __mempcpy_sse2_unaligned and _chk functions H.J. Lu
2016-03-07 17:37 ` [PATCH 7/7] Enable __memcpy_chk_sse2_unaligned H.J. Lu
2016-03-07 17:37 ` [PATCH 5/7] Enable __mempcpy_sse2_unaligned H.J. Lu
2016-03-07 17:37 ` [PATCH 6/7] Enable __mempcpy_chk_sse2_unaligned H.J. Lu
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).