* [PATCH] powerpc: Optimized POWER8 strlen
@ 2016-03-28 19:44 Carlos Eduardo Seo
2016-04-04 15:12 ` Carlos Eduardo Seo
2016-04-07 17:32 ` Tulio Magno Quites Machado Filho
0 siblings, 2 replies; 3+ messages in thread
From: Carlos Eduardo Seo @ 2016-03-28 19:44 UTC (permalink / raw)
To: GLIBC; +Cc: Tulio Magno Quites Machado Filho, Steve Munroe
[-- Attachment #1: Type: text/plain, Size: 359 bytes --]
Vectorized implementation of strlen for POWER8. This adds significant
improvement for long strings (~3x). There will be a trade-off around the
64-byte length due to the alignment checks required to jump into the
vectorized loop.
Benchmark results are attached.
--
Carlos Eduardo Seo
Software Engineer - Linux on Power Toolchain
cseo@linux.vnet.ibm.com
[-- Attachment #2: 0001-powerpc-Optimization-for-strlen-for-POWER8.patch --]
[-- Type: text/plain, Size: 14346 bytes --]
From 9a288b41a374f5ae555409b584ec00776b5e6771 Mon Sep 17 00:00:00 2001
From: Carlos Eduardo Seo <cseo@linux.vnet.ibm.com>
Date: Wed, 11 Nov 2015 17:31:28 -0200
Subject: [PATCH] powerpc: Optimization for strlen for POWER8.
This implementation takes advantage of vectorization to improve performance of
the loop over the current strlen implementation for POWER7.
2016-03-28 Carlos Eduardo Seo <cseo@linux.vnet.ibm.com>
* sysdeps/powerpc/powerpc64/multiarch/Makefile: Added __strlen_power8.
* sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c: Added
__strlen_power8 entry.
* sysdeps/powerpc/powerpc64/multiarch/strlen-power8.S: New file.
Implementation for POWER8.
* sysdeps/powerpc/powerpc64/multiarch/strlen.c: Added IFUNC selector
for __strlen_power8.
* sysdeps/powerpc/powerpc64/power8/strlen.S: New file.
Implementation for POWER8.
---
sysdeps/powerpc/powerpc64/multiarch/Makefile | 2 +-
.../powerpc/powerpc64/multiarch/ifunc-impl-list.c | 2 +
.../powerpc/powerpc64/multiarch/strlen-power8.S | 39 +++
sysdeps/powerpc/powerpc64/multiarch/strlen.c | 9 +-
sysdeps/powerpc/powerpc64/power8/strlen.S | 297 +++++++++++++++++++++
5 files changed, 345 insertions(+), 4 deletions(-)
create mode 100644 sysdeps/powerpc/powerpc64/multiarch/strlen-power8.S
create mode 100644 sysdeps/powerpc/powerpc64/power8/strlen.S
diff --git a/sysdeps/powerpc/powerpc64/multiarch/Makefile b/sysdeps/powerpc/powerpc64/multiarch/Makefile
index 3b0e3a0..f160120 100644
--- a/sysdeps/powerpc/powerpc64/multiarch/Makefile
+++ b/sysdeps/powerpc/powerpc64/multiarch/Makefile
@@ -19,7 +19,7 @@ sysdep_routines += memcpy-power7 memcpy-a2 memcpy-power6 memcpy-cell \
strcmp-power8 strcmp-power7 strcmp-ppc64 \
strcat-power8 strcat-power7 strcat-ppc64 \
memmove-power7 memmove-ppc64 wordcopy-ppc64 bcopy-ppc64 \
- strncpy-power8 strstr-power7 strstr-ppc64
+ strncpy-power8 strstr-power7 strstr-ppc64 strlen-power8
CFLAGS-strncase-power7.c += -mcpu=power7 -funroll-loops
CFLAGS-strncase_l-power7.c += -mcpu=power7 -funroll-loops
diff --git a/sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c b/sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c
index 11a8215..f1d44c7 100644
--- a/sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c
+++ b/sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c
@@ -101,6 +101,8 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
/* Support sysdeps/powerpc/powerpc64/multiarch/strlen.c. */
IFUNC_IMPL (i, name, strlen,
+ IFUNC_IMPL_ADD (array, i, strlen, hwcap2 & PPC_FEATURE2_ARCH_2_07,
+ __strlen_power8)
IFUNC_IMPL_ADD (array, i, strlen, hwcap & PPC_FEATURE_HAS_VSX,
__strlen_power7)
IFUNC_IMPL_ADD (array, i, strlen, 1,
diff --git a/sysdeps/powerpc/powerpc64/multiarch/strlen-power8.S b/sysdeps/powerpc/powerpc64/multiarch/strlen-power8.S
new file mode 100644
index 0000000..686dc3d
--- /dev/null
+++ b/sysdeps/powerpc/powerpc64/multiarch/strlen-power8.S
@@ -0,0 +1,39 @@
+/* Optimized strlen implementation for POWER8.
+ Copyright (C) 2016 Free Software Foundation, Inc.
+ This file is part of the GNU C Library.
+
+ The GNU C Library is free software; you can redistribute it and/or
+ modify it under the terms of the GNU Lesser General Public
+ License as published by the Free Software Foundation; either
+ version 2.1 of the License, or (at your option) any later version.
+
+ The GNU C Library is distributed in the hope that it will be useful,
+ but WITHOUT ANY WARRANTY; without even the implied warranty of
+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ Lesser General Public License for more details.
+
+ You should have received a copy of the GNU Lesser General Public
+ License along with the GNU C Library; if not, see
+ <http://www.gnu.org/licenses/>. */
+
+#include <sysdep.h>
+
+#undef EALIGN
+#define EALIGN(name, alignt, words) \
+ .section ".text"; \
+ ENTRY_2(__strlen_power8) \
+ .align ALIGNARG(alignt); \
+ EALIGN_W_##words; \
+ BODY_LABEL(__strlen_power8): \
+ cfi_startproc; \
+ LOCALENTRY(__strlen_power8)
+#undef END
+#define END(name) \
+ cfi_endproc; \
+ TRACEBACK(__strlen_power8) \
+ END_2(__strlen_power8)
+
+#undef libc_hidden_builtin_def
+#define libc_hidden_builtin_def(name)
+
+#include <sysdeps/powerpc/powerpc64/power8/strlen.S>
diff --git a/sysdeps/powerpc/powerpc64/multiarch/strlen.c b/sysdeps/powerpc/powerpc64/multiarch/strlen.c
index 94501fd..609a87e 100644
--- a/sysdeps/powerpc/powerpc64/multiarch/strlen.c
+++ b/sysdeps/powerpc/powerpc64/multiarch/strlen.c
@@ -29,11 +29,14 @@ extern __typeof (__redirect_strlen) __libc_strlen;
extern __typeof (__redirect_strlen) __strlen_ppc attribute_hidden;
extern __typeof (__redirect_strlen) __strlen_power7 attribute_hidden;
+extern __typeof (__redirect_strlen) __strlen_power8 attribute_hidden;
libc_ifunc (__libc_strlen,
- (hwcap & PPC_FEATURE_HAS_VSX)
- ? __strlen_power7
- : __strlen_ppc);
+ (hwcap2 & PPC_FEATURE2_ARCH_2_07)
+ ? __strlen_power8 :
+ (hwcap & PPC_FEATURE_HAS_VSX)
+ ? __strlen_power7
+ : __strlen_ppc);
#undef strlen
strong_alias (__libc_strlen, strlen)
diff --git a/sysdeps/powerpc/powerpc64/power8/strlen.S b/sysdeps/powerpc/powerpc64/power8/strlen.S
new file mode 100644
index 0000000..0142747
--- /dev/null
+++ b/sysdeps/powerpc/powerpc64/power8/strlen.S
@@ -0,0 +1,297 @@
+/* Optimized strlen implementation for PowerPC64/POWER8 using a vectorized
+ loop.
+ Copyright (C) 2016 Free Software Foundation, Inc.
+ This file is part of the GNU C Library.
+
+ The GNU C Library is free software; you can redistribute it and/or
+ modify it under the terms of the GNU Lesser General Public
+ License as published by the Free Software Foundation; either
+ version 2.1 of the License, or (at your option) any later version.
+
+ The GNU C Library is distributed in the hope that it will be useful,
+ but WITHOUT ANY WARRANTY; without even the implied warranty of
+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ Lesser General Public License for more details.
+
+ You should have received a copy of the GNU Lesser General Public
+ License along with the GNU C Library; if not, see
+ <http://www.gnu.org/licenses/>. */
+
+#include <sysdep.h>
+
+/* TODO: change these to the actual instructions when the minimum required
+ binutils allows it. */
+#define MFVRD(r,v) .long (0x7c000067 | ((v)<<(32-11)) | ((r)<<(32-16)))
+#define VBPERMQ(t,a,b) .long (0x1000054c \
+ | ((t)<<(32-11)) \
+ | ((a)<<(32-16)) \
+ | ((b)<<(32-21)) )
+
+/* int [r3] strlen (char *s [r3]) */
+
+/* TODO: change this to .machine power8 when the minimum required binutils
+ allows it. */
+ .machine power7
+EALIGN (strlen, 4, 0)
+ CALL_MCOUNT 1
+ dcbt 0,r3
+ clrrdi r4,r3,3 /* Align the address to doubleword boundary. */
+ rlwinm r6,r3,3,26,28 /* Calculate padding. */
+ li r0,0 /* Doubleword with null chars to use
+ with cmpb. */
+ li r5,-1 /* MASK = 0xffffffffffffffff. */
+ ld r12,0(r4) /* Load doubleword from memory. */
+#ifdef __LITTLE_ENDIAN__
+ sld r5,r5,r6
+#else
+ srd r5,r5,r6 /* MASK = MASK >> padding. */
+#endif
+ orc r9,r12,r5 /* Mask bits that are not part of the string. */
+ cmpb r10,r9,r0 /* Check for null bytes in DWORD1. */
+ cmpdi cr7,r10,0 /* If r10 == 0, no null's have been found. */
+ bne cr7,L(done)
+
+ /* For shorter strings (< 64 bytes), we will not use vector registers,
+ as the overhead isn't worth it. So, let's use GPRs instead. This
+ will be done the same way as we do in the POWER7 implementation.
+ Let's see if we are aligned to a quadword boundary. If so, we can
+ jump to the first (non-vectorized) loop. Otherwise, we have to
+ handle the next DWORD first. */
+ mtcrf 0x01,r4
+ mr r9,r4
+ addi r9,r9,8
+ bt 28,L(align64)
+
+ /* Handle the next 8 bytes so we are aligned to a quadword
+ boundary. */
+ ldu r5,8(r4)
+ cmpb r10,r5,r0
+ cmpdi cr7,r10,0
+ addi r9,r9,8
+ bne cr7,L(done)
+
+L(align64):
+ /* Proceed to the old (POWER7) implementation, checking two doublewords
+ per iteraction. For the first 56 bytes, we will just check for null
+ characters. After that, we will also check if we are 64-byte aligned
+ so we can jump to the vectorized implementation. We will unroll
+ these loops to avoid excessive branching. */
+ ld r6,8(r4)
+ ldu r5,16(r4)
+ cmpb r10,r6,r0
+ cmpb r11,r5,r0
+ or r5,r10,r11
+ cmpdi cr7,r5,0
+ addi r9,r9,16
+ bne cr7,L(dword_zero)
+
+ ld r6,8(r4)
+ ldu r5,16(r4)
+ cmpb r10,r6,r0
+ cmpb r11,r5,r0
+ or r5,r10,r11
+ cmpdi cr7,r5,0
+ addi r9,r9,16
+ bne cr7,L(dword_zero)
+
+ ld r6,8(r4)
+ ldu r5,16(r4)
+ cmpb r10,r6,r0
+ cmpb r11,r5,r0
+ or r5,r10,r11
+ cmpdi cr7,r5,0
+ addi r9,r9,16
+ bne cr7,L(dword_zero)
+
+ /* Are we 64-byte aligned? If so, jump to the vectorized loop.
+ Note: aligning to 64-byte will necessarily slow down performance for
+ strings around 64 bytes in length due to the extra comparisons
+ required to check alignment for the vectorized loop. This is a
+ necessary tradeoff we are willing to take in order to speed up the
+ calculation for larger strings. */
+ andi. r10,r9,63
+ beq cr0,L(preloop)
+ ld r6,8(r4)
+ ldu r5,16(r4)
+ cmpb r10,r6,r0
+ cmpb r11,r5,r0
+ or r5,r10,r11
+ cmpdi cr7,r5,0
+ addi r9,r9,16
+ bne cr7,L(dword_zero)
+
+ andi. r10,r9,63
+ beq cr0,L(preloop)
+ ld r6,8(r4)
+ ldu r5,16(r4)
+ cmpb r10,r6,r0
+ cmpb r11,r5,r0
+ or r5,r10,r11
+ cmpdi cr7,r5,0
+ addi r9,r9,16
+ bne cr7,L(dword_zero)
+
+ andi. r10,r9,63
+ beq cr0,L(preloop)
+ ld r6,8(r4)
+ ldu r5,16(r4)
+ cmpb r10,r6,r0
+ cmpb r11,r5,r0
+ or r5,r10,r11
+ cmpdi cr7,r5,0
+ addi r9,r9,16
+ bne cr7,L(dword_zero)
+
+ andi. r10,r9,63
+ beq cr0,L(preloop)
+ ld r6,8(r4)
+ ldu r5,16(r4)
+ cmpb r10,r6,r0
+ cmpb r11,r5,r0
+ or r5,r10,r11
+ cmpdi cr7,r5,0
+ addi r9,r9,16
+
+ /* At this point, we are necessarily 64-byte aligned. If no zeroes were
+ found, jump to the vectorized loop. */
+ beq cr7,L(preloop)
+
+L(dword_zero):
+ /* OK, one (or both) of the doublewords contains a null byte. Check
+ the first doubleword and decrement the address in case the first
+ doubleword really contains a null byte. */
+
+ cmpdi cr6,r10,0
+ addi r4,r4,-8
+ bne cr6,L(done)
+
+ /* The null byte must be in the second doubleword. Adjust the address
+ again and move the result of cmpb to r10 so we can calculate the
+ length. */
+
+ mr r10,r11
+ addi r4,r4,8
+
+ /* If the null byte was found in the non-vectorized code, compute the
+ final length. r10 has the output of the cmpb instruction, that is,
+ it contains 0xff in the same position as the null byte in the
+ original doubleword from the string. Use that to calculate the
+ length. */
+L(done):
+#ifdef __LITTLE_ENDIAN__
+ addi r9, r10,-1 /* Form a mask from trailing zeros. */
+ andc r9, r9,r10
+ popcntd r0, r9 /* Count the bits in the mask. */
+#else
+ cntlzd r0,r10 /* Count leading zeros before the match. */
+#endif
+ subf r5,r3,r4
+ srdi r0,r0,3 /* Convert leading/trailing zeros to bytes. */
+ add r3,r5,r0 /* Compute final length. */
+ blr
+
+ /* Vectorized implementation starts here. */
+ .p2align 4
+L(preloop):
+ /* Set up for the loop. */
+ mr r4,r9
+ li r7, 16 /* Load required offsets. */
+ li r8, 32
+ li r9, 48
+ li r12, 8
+ vxor v0,v0,v0 /* VR with null chars to use with
+ vcmpequb. */
+
+ /* Main loop to look for the end of the string. We will read in
+ 64-byte chunks. Align it to 32 bytes and unroll it 3 times to
+ leverage the icache performance. */
+ .p2align 5
+L(loop):
+ lvx v1,r4,r0 /* Load 4 quadwords. */
+ lvx v2,r4,r7
+ lvx v3,r4,r8
+ lvx v4,r4,r9
+ vminub v5,v1,v2 /* Compare and merge into one VR for speed. */
+ vminub v6,v3,v4
+ vminub v7,v5,v6
+ vcmpequb. v7,v7,v0 /* Check for NULLs. */
+ addi r4,r4,64 /* Adjust address for the next iteration. */
+ bne cr6,L(vmx_zero)
+
+ lvx v1,r4,r0 /* Load 4 quadwords. */
+ lvx v2,r4,r7
+ lvx v3,r4,r8
+ lvx v4,r4,r9
+ vminub v5,v1,v2 /* Compare and merge into one VR for speed. */
+ vminub v6,v3,v4
+ vminub v7,v5,v6
+ vcmpequb. v7,v7,v0 /* Check for NULLs. */
+ addi r4,r4,64 /* Adjust address for the next iteration. */
+ bne cr6,L(vmx_zero)
+
+ lvx v1,r4,r0 /* Load 4 quadwords. */
+ lvx v2,r4,r7
+ lvx v3,r4,r8
+ lvx v4,r4,r9
+ vminub v5,v1,v2 /* Compare and merge into one VR for speed. */
+ vminub v6,v3,v4
+ vminub v7,v5,v6
+ vcmpequb. v7,v7,v0 /* Check for NULLs. */
+ addi r4,r4,64 /* Adjust address for the next iteration. */
+ beq cr6,L(loop)
+
+L(vmx_zero):
+ /* OK, we found a null byte. Let's look for it in the current 64-byte
+ block and mark it in its corresponding VR. */
+ vcmpequb v1,v1,v0
+ vcmpequb v2,v2,v0
+ vcmpequb v3,v3,v0
+ vcmpequb v4,v4,v0
+
+ /* We will now 'compress' the result into a single doubleword, so it
+ can be moved to a GPR for the final calculation. First, we
+ generate an appropriate mask for vbpermq, so we can permute bits into
+ the first halfword. */
+ vspltisb v10,3
+ lvsl v11,r0,r0
+ vslb v10,v11,v10
+
+ /* Permute the first bit of each byte into bits 48-63. */
+ VBPERMQ(v1,v1,v10)
+ VBPERMQ(v2,v2,v10)
+ VBPERMQ(v3,v3,v10)
+ VBPERMQ(v4,v4,v10)
+
+ /* Shift each component into its correct position for merging. */
+#ifdef __LITTLE_ENDIAN__
+ vsldoi v2,v2,v2,2
+ vsldoi v3,v3,v3,4
+ vsldoi v4,v4,v4,6
+#else
+ vsldoi v1,v1,v1,6
+ vsldoi v2,v2,v2,4
+ vsldoi v3,v3,v3,2
+#endif
+
+ /* Merge the results and move to a GPR. */
+ vor v1,v2,v1
+ vor v2,v3,v4
+ vor v4,v1,v2
+ MFVRD(r10,v4)
+
+ /* Adjust address to the begninning of the current 64-byte block. */
+ addi r4,r4,-64
+
+#ifdef __LITTLE_ENDIAN__
+ addi r9, r10,-1 /* Form a mask from trailing zeros. */
+ andc r9, r9,r10
+ popcntd r0, r9 /* Count the bits in the mask. */
+#else
+ cntlzd r0,r10 /* Count leading zeros before the match. */
+#endif
+ subf r5,r3,r4
+ add r3,r5,r0 /* Compute final length. */
+ blr
+
+END (strlen)
+libc_hidden_builtin_def (strlen)
--
2.6.4 (Apple Git-63)
[-- Attachment #3: bench-strlen.txt --]
[-- Type: text/plain, Size: 3861 bytes --]
simple_STRLEN builtin_strlen __strlen_power8 __strlen_power7 __strlen_ppc
Length 1, alignment 1: 2.70312 3.125 2.20312 2.25 2.32812
Length 1, alignment 0: 2.54688 3.17188 2.07812 2.15625 2.28125
Length 2, alignment 2: 2.57812 3.21875 2.07812 2.28125 2.28125
Length 2, alignment 0: 2.45312 3.20312 2.25 2.35938 2.28125
Length 3, alignment 3: 2.07812 3.14062 2.21875 2.23438 2.45312
Length 3, alignment 0: 1.98438 3.21875 2.10938 2.17188 2.28125
Length 4, alignment 4: 2.5625 3.20312 2.04688 2.34375 2.42188
Length 4, alignment 0: 2.39062 3.1875 1.90625 2.26562 2.32812
Length 5, alignment 5: 5.54688 3.34375 1.98438 2.32812 2.57812
Length 5, alignment 0: 5.4375 3.10938 2.35938 2.35938 2.34375
Length 6, alignment 6: 3.28125 3.14062 2.29688 2.15625 2.4375
Length 6, alignment 0: 2.98438 3.20312 2.35938 2.4375 2.26562
Length 7, alignment 7: 3.59375 3.17188 2.28125 2.29688 2.46875
Length 7, alignment 0: 3.375 3.14062 2.01562 2.25 2.28125
Length 4, alignment 0: 2.42188 3.09375 2.03125 2.21875 2.25
Length 4, alignment 7: 2.28125 3.21875 1.9375 2.45312 2.4375
Length 4, alignment 2: 2.17188 3.10938 2.26562 2.20312 2.21875
Length 2, alignment 2: 2.40625 3.07812 2.35938 2.23438 2.25
Length 8, alignment 0: 3.875 3.15625 2.125 2.35938 2.40625
Length 8, alignment 7: 3.625 3.1875 2.125 2.10938 2.375
Length 8, alignment 3: 3.67188 3.125 2.04688 2.29688 2.32812
Length 5, alignment 3: 5.45312 3.15625 2.3125 2.375 2.29688
Length 16, alignment 0: 6.78125 3.82812 2.32812 2.42188 2.78125
Length 16, alignment 7: 6.5625 3.84375 2.39062 2.32812 2.70312
Length 16, alignment 4: 6.54688 3.90625 2.375 2.20312 2.73438
Length 10, alignment 4: 4.6875 3.3125 1.95312 2.4375 2.32812
Length 32, alignment 0: 15.3281 4 2.57812 3.1875 3.34375
Length 32, alignment 7: 15.125 3.75 2.60938 3.0625 3.21875
Length 32, alignment 5: 15.1562 3.89062 2.60938 2.89062 3.20312
Length 21, alignment 5: 11.2031 3.53125 2.40625 2.59375 2.85938
Length 64, alignment 0: 26.8281 5.76562 4.625 3.53125 4.26562
Length 64, alignment 7: 26.8906 5.42188 4.59375 3.5 4.25
Length 64, alignment 6: 27.6406 5.51562 4.5 3.32812 4.07812
Length 42, alignment 6: 18.9375 4.4375 3.1875 3.15625 3.875
Length 128, alignment 0: 49.625 6.20312 4.96875 5.01562 6.5625
Length 128, alignment 7: 49.4531 5.875 4.85938 4.84375 6.21875
Length 128, alignment 7: 49.625 5.75 4.9375 4.8125 6.34375
Length 85, alignment 7: 34.3281 5.59375 4.57812 3.92188 4.75
Length 256, alignment 0: 95.0156 7.29688 6.14062 8.5625 12.6562
Length 256, alignment 7: 95.0156 6.78125 5.82812 7.84375 12.4531
Length 256, alignment 8: 95.1094 6.90625 5.64062 8 12.8281
Length 170, alignment 8: 64.6406 5.89062 4.625 6.21875 10.0938
Length 512, alignment 0: 186 9.14062 8.15625 18.0469 20.4219
Length 512, alignment 7: 185.812 8.53125 7.875 18.0625 20.4844
Length 512, alignment 9: 186.078 8.78125 7.92188 17.9375 20.5938
Length 341, alignment 9: 125.328 7.21875 6.10938 13.2031 15.0469
Length 1024, alignment 0: 367.984 11.7969 10.9844 30.1406 35.7344
Length 1024, alignment 7: 367.797 11.5625 10.5469 30.2344 35.5938
Length 1024, alignment 10: 367.906 11.5781 10.375 30.0156 35.6406
Length 682, alignment 10: 246.438 9.4375 7.85938 22.25 25.8125
Length 2048, alignment 0: 731.922 22.9531 21.4375 54.3281 66.0625
Length 2048, alignment 7: 731.953 22.7969 21.4219 54.1406 66.0781
Length 2048, alignment 11: 731.922 22.75 21.125 53.5781 66.1562
Length 1365, alignment 11: 489.266 18.2344 16.5312 38.2656 46.2969
Length 4096, alignment 0: 1459.77 36.9375 35.5 101.938 126.438
Length 4096, alignment 7: 1459.62 36.8125 35.4219 101.25 126.781
Length 4096, alignment 12: 1459.8 36.7188 34.4688 101.594 126.5
Length 2730, alignment 12: 974.219 27.1094 25.375 69.8594 86.5469
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [PATCH] powerpc: Optimized POWER8 strlen
2016-03-28 19:44 [PATCH] powerpc: Optimized POWER8 strlen Carlos Eduardo Seo
@ 2016-04-04 15:12 ` Carlos Eduardo Seo
2016-04-07 17:32 ` Tulio Magno Quites Machado Filho
1 sibling, 0 replies; 3+ messages in thread
From: Carlos Eduardo Seo @ 2016-04-04 15:12 UTC (permalink / raw)
To: libc-alpha
Ping?
On 3/28/16 4:44 PM, Carlos Eduardo Seo wrote:
>
> Vectorized implementation of strlen for POWER8. This adds significant
> improvement for long strings (~3x). There will be a trade-off around the
> 64-byte length due to the alignment checks required to jump into the
> vectorized loop.
>
> Benchmark results are attached.
>
--
Carlos Eduardo Seo
Software Engineer - Linux on Power Toolchain
cseo@linux.vnet.ibm.com
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [PATCH] powerpc: Optimized POWER8 strlen
2016-03-28 19:44 [PATCH] powerpc: Optimized POWER8 strlen Carlos Eduardo Seo
2016-04-04 15:12 ` Carlos Eduardo Seo
@ 2016-04-07 17:32 ` Tulio Magno Quites Machado Filho
1 sibling, 0 replies; 3+ messages in thread
From: Tulio Magno Quites Machado Filho @ 2016-04-07 17:32 UTC (permalink / raw)
To: Carlos Eduardo Seo, GLIBC; +Cc: Steve Munroe
Carlos Eduardo Seo <cseo@linux.vnet.ibm.com> writes:
> This implementation takes advantage of vectorization to improve performance of
> the loop over the current strlen implementation for POWER7.
>
> 2016-03-28 Carlos Eduardo Seo <cseo@linux.vnet.ibm.com>
>
> * sysdeps/powerpc/powerpc64/multiarch/Makefile: Added __strlen_power8.
You have to mention the variable you changed:
* sysdeps/powerpc/powerpc64/multiarch/Makefile (sysdep_routines): ...
LGTM with that change.
--
Tulio Magno
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2016-04-07 17:32 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-03-28 19:44 [PATCH] powerpc: Optimized POWER8 strlen Carlos Eduardo Seo
2016-04-04 15:12 ` Carlos Eduardo Seo
2016-04-07 17:32 ` Tulio Magno Quites Machado Filho
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).