[PATCH] powerpc: Optimized POWER8 strlen

public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed

* [PATCH] powerpc: Optimized POWER8 strlen
@ 2016-03-28 19:44 Carlos Eduardo Seo
  2016-04-04 15:12 ` Carlos Eduardo Seo
  2016-04-07 17:32 ` Tulio Magno Quites Machado Filho
  0 siblings, 2 replies; 3+ messages in thread
From: Carlos Eduardo Seo @ 2016-03-28 19:44 UTC (permalink / raw)
  To: GLIBC; +Cc: Tulio Magno Quites Machado Filho, Steve Munroe

[-- Attachment #1: Type: text/plain, Size: 359 bytes --]


Vectorized implementation of strlen for POWER8. This adds significant 
improvement for long strings (~3x). There will be a trade-off around the 
64-byte length due to the alignment checks required to jump into the 
vectorized loop.

Benchmark results are attached.

-- 
Carlos Eduardo Seo
Software Engineer - Linux on Power Toolchain
cseo@linux.vnet.ibm.com

[-- Attachment #2: 0001-powerpc-Optimization-for-strlen-for-POWER8.patch --]
[-- Type: text/plain, Size: 14346 bytes --]

From 9a288b41a374f5ae555409b584ec00776b5e6771 Mon Sep 17 00:00:00 2001
From: Carlos Eduardo Seo <cseo@linux.vnet.ibm.com>
Date: Wed, 11 Nov 2015 17:31:28 -0200
Subject: [PATCH] powerpc: Optimization for strlen for POWER8.

This implementation takes advantage of vectorization to improve performance of
the loop over the current strlen implementation for POWER7.

2016-03-28  Carlos Eduardo Seo  <cseo@linux.vnet.ibm.com>

	* sysdeps/powerpc/powerpc64/multiarch/Makefile: Added __strlen_power8.
	* sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c: Added
	__strlen_power8 entry.
	* sysdeps/powerpc/powerpc64/multiarch/strlen-power8.S: New file.
	Implementation for POWER8.
	* sysdeps/powerpc/powerpc64/multiarch/strlen.c: Added IFUNC selector
	for __strlen_power8.
	* sysdeps/powerpc/powerpc64/power8/strlen.S: New file.
	Implementation for POWER8.
---
 sysdeps/powerpc/powerpc64/multiarch/Makefile       |   2 +-
 .../powerpc/powerpc64/multiarch/ifunc-impl-list.c  |   2 +
 .../powerpc/powerpc64/multiarch/strlen-power8.S    |  39 +++
 sysdeps/powerpc/powerpc64/multiarch/strlen.c       |   9 +-
 sysdeps/powerpc/powerpc64/power8/strlen.S          | 297 +++++++++++++++++++++
 5 files changed, 345 insertions(+), 4 deletions(-)
 create mode 100644 sysdeps/powerpc/powerpc64/multiarch/strlen-power8.S
 create mode 100644 sysdeps/powerpc/powerpc64/power8/strlen.S

diff --git a/sysdeps/powerpc/powerpc64/multiarch/Makefile b/sysdeps/powerpc/powerpc64/multiarch/Makefile
index 3b0e3a0..f160120 100644
--- a/sysdeps/powerpc/powerpc64/multiarch/Makefile
+++ b/sysdeps/powerpc/powerpc64/multiarch/Makefile
@@ -19,7 +19,7 @@ sysdep_routines += memcpy-power7 memcpy-a2 memcpy-power6 memcpy-cell \
 		   strcmp-power8 strcmp-power7 strcmp-ppc64 \
 		   strcat-power8 strcat-power7 strcat-ppc64 \
 		   memmove-power7 memmove-ppc64 wordcopy-ppc64 bcopy-ppc64 \
-		   strncpy-power8 strstr-power7 strstr-ppc64
+		   strncpy-power8 strstr-power7 strstr-ppc64 strlen-power8
 
 CFLAGS-strncase-power7.c += -mcpu=power7 -funroll-loops
 CFLAGS-strncase_l-power7.c += -mcpu=power7 -funroll-loops
diff --git a/sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c b/sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c
index 11a8215..f1d44c7 100644
--- a/sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c
+++ b/sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c
@@ -101,6 +101,8 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
 
   /* Support sysdeps/powerpc/powerpc64/multiarch/strlen.c.  */
   IFUNC_IMPL (i, name, strlen,
+	      IFUNC_IMPL_ADD (array, i, strlen, hwcap2 & PPC_FEATURE2_ARCH_2_07,
+			      __strlen_power8)
 	      IFUNC_IMPL_ADD (array, i, strlen, hwcap & PPC_FEATURE_HAS_VSX,
 			      __strlen_power7)
 	      IFUNC_IMPL_ADD (array, i, strlen, 1,
diff --git a/sysdeps/powerpc/powerpc64/multiarch/strlen-power8.S b/sysdeps/powerpc/powerpc64/multiarch/strlen-power8.S
new file mode 100644
index 0000000..686dc3d
--- /dev/null
+++ b/sysdeps/powerpc/powerpc64/multiarch/strlen-power8.S
@@ -0,0 +1,39 @@
+/* Optimized strlen implementation for POWER8.
+   Copyright (C) 2016 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#include <sysdep.h>
+
+#undef EALIGN
+#define EALIGN(name, alignt, words)				\
+  .section ".text";						\
+  ENTRY_2(__strlen_power8)					\
+  .align ALIGNARG(alignt);					\
+  EALIGN_W_##words;						\
+  BODY_LABEL(__strlen_power8):					\
+  cfi_startproc;						\
+  LOCALENTRY(__strlen_power8)
+#undef END
+#define END(name)						\
+  cfi_endproc;							\
+  TRACEBACK(__strlen_power8)					\
+  END_2(__strlen_power8)
+
+#undef libc_hidden_builtin_def
+#define libc_hidden_builtin_def(name)
+
+#include <sysdeps/powerpc/powerpc64/power8/strlen.S>
diff --git a/sysdeps/powerpc/powerpc64/multiarch/strlen.c b/sysdeps/powerpc/powerpc64/multiarch/strlen.c
index 94501fd..609a87e 100644
--- a/sysdeps/powerpc/powerpc64/multiarch/strlen.c
+++ b/sysdeps/powerpc/powerpc64/multiarch/strlen.c
@@ -29,11 +29,14 @@ extern __typeof (__redirect_strlen) __libc_strlen;
 
 extern __typeof (__redirect_strlen) __strlen_ppc attribute_hidden;
 extern __typeof (__redirect_strlen) __strlen_power7 attribute_hidden;
+extern __typeof (__redirect_strlen) __strlen_power8 attribute_hidden;
 
 libc_ifunc (__libc_strlen,
-            (hwcap & PPC_FEATURE_HAS_VSX)
-            ? __strlen_power7
-            : __strlen_ppc);
+	    (hwcap2 & PPC_FEATURE2_ARCH_2_07)
+	    ? __strlen_power8 :
+	      (hwcap & PPC_FEATURE_HAS_VSX)
+	      ? __strlen_power7
+	      : __strlen_ppc);
 
 #undef strlen
 strong_alias (__libc_strlen, strlen)
diff --git a/sysdeps/powerpc/powerpc64/power8/strlen.S b/sysdeps/powerpc/powerpc64/power8/strlen.S
new file mode 100644
index 0000000..0142747
--- /dev/null
+++ b/sysdeps/powerpc/powerpc64/power8/strlen.S
@@ -0,0 +1,297 @@
+/* Optimized strlen implementation for PowerPC64/POWER8 using a vectorized
+   loop.
+   Copyright (C) 2016 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#include <sysdep.h>
+
+/* TODO: change these to the actual instructions when the minimum required
+   binutils allows it.  */
+#define MFVRD(r,v)	.long (0x7c000067 | ((v)<<(32-11)) | ((r)<<(32-16)))
+#define VBPERMQ(t,a,b)	.long (0x1000054c \
+			       | ((t)<<(32-11))	\
+			       | ((a)<<(32-16))	\
+			       | ((b)<<(32-21)) )
+
+/* int [r3] strlen (char *s [r3])  */
+
+/* TODO: change this to .machine power8 when the minimum required binutils
+   allows it.  */
+	.machine  power7
+EALIGN (strlen, 4, 0)
+	CALL_MCOUNT 1
+	dcbt	0,r3
+	clrrdi	r4,r3,3	      /* Align the address to doubleword boundary.  */
+	rlwinm	r6,r3,3,26,28 /* Calculate padding.  */
+	li	r0,0	      /* Doubleword with null chars to use
+				 with cmpb.  */
+	li	r5,-1	      /* MASK = 0xffffffffffffffff.  */
+	ld	r12,0(r4)     /* Load doubleword from memory.  */
+#ifdef __LITTLE_ENDIAN__
+	sld	r5,r5,r6
+#else
+	srd	r5,r5,r6      /* MASK = MASK >> padding.  */
+#endif
+	orc	r9,r12,r5     /* Mask bits that are not part of the string.  */
+	cmpb	r10,r9,r0     /* Check for null bytes in DWORD1.  */
+	cmpdi	cr7,r10,0     /* If r10 == 0, no null's have been found.  */
+	bne	cr7,L(done)
+
+	/* For shorter strings (< 64 bytes), we will not use vector registers,
+	   as the overhead isn't worth it.  So, let's use GPRs instead.  This
+	   will be done the same way as we do in the POWER7 implementation.
+	   Let's see if we are aligned to a quadword boundary.  If so, we can
+	   jump to the first (non-vectorized) loop.  Otherwise, we have to
+	   handle the next DWORD first.  */
+	mtcrf	0x01,r4
+	mr	r9,r4
+	addi	r9,r9,8
+	bt	28,L(align64)
+
+	/* Handle the next 8 bytes so we are aligned to a quadword
+	   boundary.  */
+	ldu	r5,8(r4)
+	cmpb	r10,r5,r0
+	cmpdi	cr7,r10,0
+	addi	r9,r9,8
+	bne	cr7,L(done)
+
+L(align64):
+	/* Proceed to the old (POWER7) implementation, checking two doublewords
+	   per iteraction.  For the first 56 bytes, we will just check for null
+	   characters.  After that, we will also check if we are 64-byte aligned
+	   so we can jump to the vectorized implementation.  We will unroll
+	   these loops to avoid excessive branching.  */
+	ld	r6,8(r4)
+	ldu	r5,16(r4)
+	cmpb	r10,r6,r0
+	cmpb	r11,r5,r0
+	or	r5,r10,r11
+	cmpdi	cr7,r5,0
+	addi	r9,r9,16
+	bne	cr7,L(dword_zero)
+
+	ld	r6,8(r4)
+	ldu	r5,16(r4)
+	cmpb	r10,r6,r0
+	cmpb	r11,r5,r0
+	or	r5,r10,r11
+	cmpdi	cr7,r5,0
+	addi	r9,r9,16
+	bne	cr7,L(dword_zero)
+
+	ld	r6,8(r4)
+	ldu	r5,16(r4)
+	cmpb	r10,r6,r0
+	cmpb	r11,r5,r0
+	or	r5,r10,r11
+	cmpdi	cr7,r5,0
+	addi	r9,r9,16
+	bne	cr7,L(dword_zero)
+
+	/* Are we 64-byte aligned? If so, jump to the vectorized loop.
+	   Note: aligning to 64-byte will necessarily slow down performance for
+	   strings around 64 bytes in length due to the extra comparisons
+	   required to check alignment for the vectorized loop.  This is a
+	   necessary tradeoff we are willing to take in order to speed up the
+	   calculation for larger strings.  */
+	andi.	r10,r9,63
+	beq	cr0,L(preloop)
+	ld	r6,8(r4)
+	ldu	r5,16(r4)
+	cmpb	r10,r6,r0
+	cmpb	r11,r5,r0
+	or	r5,r10,r11
+	cmpdi	cr7,r5,0
+	addi	r9,r9,16
+	bne	cr7,L(dword_zero)
+
+	andi.	r10,r9,63
+	beq	cr0,L(preloop)
+	ld	r6,8(r4)
+	ldu	r5,16(r4)
+	cmpb	r10,r6,r0
+	cmpb	r11,r5,r0
+	or	r5,r10,r11
+	cmpdi	cr7,r5,0
+	addi	r9,r9,16
+	bne	cr7,L(dword_zero)
+
+	andi.	r10,r9,63
+	beq	cr0,L(preloop)
+	ld	r6,8(r4)
+	ldu	r5,16(r4)
+	cmpb	r10,r6,r0
+	cmpb	r11,r5,r0
+	or	r5,r10,r11
+	cmpdi	cr7,r5,0
+	addi	r9,r9,16
+	bne	cr7,L(dword_zero)
+
+	andi.	r10,r9,63
+	beq	cr0,L(preloop)
+	ld	r6,8(r4)
+	ldu	r5,16(r4)
+	cmpb	r10,r6,r0
+	cmpb	r11,r5,r0
+	or	r5,r10,r11
+	cmpdi	cr7,r5,0
+	addi	r9,r9,16
+
+	/* At this point, we are necessarily 64-byte aligned.  If no zeroes were
+	   found, jump to the vectorized loop.  */
+	beq	cr7,L(preloop)
+
+L(dword_zero):
+	/* OK, one (or both) of the doublewords contains a null byte.  Check
+	   the first doubleword and decrement the address in case the first
+	   doubleword really contains a null byte.  */
+
+	cmpdi	cr6,r10,0
+	addi	r4,r4,-8
+	bne	cr6,L(done)
+
+	/* The null byte must be in the second doubleword.  Adjust the address
+	   again and move the result of cmpb to r10 so we can calculate the
+	   length.  */
+
+	mr	r10,r11
+	addi	r4,r4,8
+
+	/* If the null byte was found in the non-vectorized code, compute the
+	   final length.  r10 has the output of the cmpb instruction, that is,
+	   it contains 0xff in the same position as the null byte in the
+	   original doubleword from the string.  Use that to calculate the
+	   length.  */
+L(done):
+#ifdef __LITTLE_ENDIAN__
+	addi	r9, r10,-1    /* Form a mask from trailing zeros.  */
+	andc	r9, r9,r10
+	popcntd	r0, r9	      /* Count the bits in the mask.  */
+#else
+	cntlzd	r0,r10	      /* Count leading zeros before the match.  */
+#endif
+	subf	r5,r3,r4
+	srdi	r0,r0,3	      /* Convert leading/trailing zeros to bytes.  */
+	add	r3,r5,r0      /* Compute final length.  */
+	blr
+
+	/* Vectorized implementation starts here.  */
+	.p2align  4
+L(preloop):
+	/* Set up for the loop.  */
+	mr	r4,r9
+	li	r7, 16	      /* Load required offsets.  */
+	li	r8, 32
+	li	r9, 48
+	li	r12, 8
+	vxor	v0,v0,v0      /* VR with null chars to use with
+				 vcmpequb.  */
+
+	/* Main loop to look for the end of the string.  We will read in
+	   64-byte chunks.  Align it to 32 bytes and unroll it 3 times to
+	   leverage the icache performance.  */
+	.p2align  5
+L(loop):
+	lvx	  v1,r4,r0  /* Load 4 quadwords.  */
+	lvx	  v2,r4,r7
+	lvx	  v3,r4,r8
+	lvx	  v4,r4,r9
+	vminub	  v5,v1,v2  /* Compare and merge into one VR for speed.  */
+	vminub	  v6,v3,v4
+	vminub	  v7,v5,v6
+	vcmpequb. v7,v7,v0  /* Check for NULLs.  */
+	addi	  r4,r4,64  /* Adjust address for the next iteration.  */
+	bne	  cr6,L(vmx_zero)
+
+	lvx	  v1,r4,r0  /* Load 4 quadwords.  */
+	lvx	  v2,r4,r7
+	lvx	  v3,r4,r8
+	lvx	  v4,r4,r9
+	vminub	  v5,v1,v2  /* Compare and merge into one VR for speed.  */
+	vminub	  v6,v3,v4
+	vminub	  v7,v5,v6
+	vcmpequb. v7,v7,v0  /* Check for NULLs.  */
+	addi	  r4,r4,64  /* Adjust address for the next iteration.  */
+	bne	  cr6,L(vmx_zero)
+
+	lvx	  v1,r4,r0  /* Load 4 quadwords.  */
+	lvx	  v2,r4,r7
+	lvx	  v3,r4,r8
+	lvx	  v4,r4,r9
+	vminub	  v5,v1,v2  /* Compare and merge into one VR for speed.  */
+	vminub	  v6,v3,v4
+	vminub	  v7,v5,v6
+	vcmpequb. v7,v7,v0  /* Check for NULLs.  */
+	addi	  r4,r4,64  /* Adjust address for the next iteration.  */
+	beq	  cr6,L(loop)
+
+L(vmx_zero):
+	/* OK, we found a null byte.  Let's look for it in the current 64-byte
+	   block and mark it in its corresponding VR.  */
+	vcmpequb  v1,v1,v0
+	vcmpequb  v2,v2,v0
+	vcmpequb  v3,v3,v0
+	vcmpequb  v4,v4,v0
+
+	/* We will now 'compress' the result into a single doubleword, so it
+	   can be moved to a GPR for the final calculation.  First, we
+	   generate an appropriate mask for vbpermq, so we can permute bits into
+	   the first halfword.  */
+	vspltisb  v10,3
+	lvsl	  v11,r0,r0
+	vslb	  v10,v11,v10
+
+	/* Permute the first bit of each byte into bits 48-63.  */
+	VBPERMQ(v1,v1,v10)
+	VBPERMQ(v2,v2,v10)
+	VBPERMQ(v3,v3,v10)
+	VBPERMQ(v4,v4,v10)
+
+	/* Shift each component into its correct position for merging.  */
+#ifdef __LITTLE_ENDIAN__
+	vsldoi  v2,v2,v2,2
+	vsldoi  v3,v3,v3,4
+	vsldoi  v4,v4,v4,6
+#else
+	vsldoi	v1,v1,v1,6
+	vsldoi	v2,v2,v2,4
+	vsldoi	v3,v3,v3,2
+#endif
+
+	/* Merge the results and move to a GPR.  */
+	vor	v1,v2,v1
+	vor	v2,v3,v4
+	vor	v4,v1,v2
+	MFVRD(r10,v4)
+
+	 /* Adjust address to the begninning of the current 64-byte block.  */
+	addi	r4,r4,-64
+
+#ifdef __LITTLE_ENDIAN__
+	addi	r9, r10,-1    /* Form a mask from trailing zeros.  */
+	andc	r9, r9,r10
+	popcntd	r0, r9	      /* Count the bits in the mask.  */
+#else
+	cntlzd	r0,r10	      /* Count leading zeros before the match.  */
+#endif
+	subf	r5,r3,r4
+	add	r3,r5,r0      /* Compute final length.  */
+	blr
+
+END (strlen)
+libc_hidden_builtin_def (strlen)
-- 
2.6.4 (Apple Git-63)


[-- Attachment #3: bench-strlen.txt --]
[-- Type: text/plain, Size: 3861 bytes --]

                    	simple_STRLEN	builtin_strlen	__strlen_power8	__strlen_power7	__strlen_ppc
Length    1, alignment  1:	2.70312	3.125	2.20312	2.25	2.32812
Length    1, alignment  0:	2.54688	3.17188	2.07812	2.15625	2.28125
Length    2, alignment  2:	2.57812	3.21875	2.07812	2.28125	2.28125
Length    2, alignment  0:	2.45312	3.20312	2.25	2.35938	2.28125
Length    3, alignment  3:	2.07812	3.14062	2.21875	2.23438	2.45312
Length    3, alignment  0:	1.98438	3.21875	2.10938	2.17188	2.28125
Length    4, alignment  4:	2.5625	3.20312	2.04688	2.34375	2.42188
Length    4, alignment  0:	2.39062	3.1875	1.90625	2.26562	2.32812
Length    5, alignment  5:	5.54688	3.34375	1.98438	2.32812	2.57812
Length    5, alignment  0:	5.4375	3.10938	2.35938	2.35938	2.34375
Length    6, alignment  6:	3.28125	3.14062	2.29688	2.15625	2.4375
Length    6, alignment  0:	2.98438	3.20312	2.35938	2.4375	2.26562
Length    7, alignment  7:	3.59375	3.17188	2.28125	2.29688	2.46875
Length    7, alignment  0:	3.375	3.14062	2.01562	2.25	2.28125
Length    4, alignment  0:	2.42188	3.09375	2.03125	2.21875	2.25
Length    4, alignment  7:	2.28125	3.21875	1.9375	2.45312	2.4375
Length    4, alignment  2:	2.17188	3.10938	2.26562	2.20312	2.21875
Length    2, alignment  2:	2.40625	3.07812	2.35938	2.23438	2.25
Length    8, alignment  0:	3.875	3.15625	2.125	2.35938	2.40625
Length    8, alignment  7:	3.625	3.1875	2.125	2.10938	2.375
Length    8, alignment  3:	3.67188	3.125	2.04688	2.29688	2.32812
Length    5, alignment  3:	5.45312	3.15625	2.3125	2.375	2.29688
Length   16, alignment  0:	6.78125	3.82812	2.32812	2.42188	2.78125
Length   16, alignment  7:	6.5625	3.84375	2.39062	2.32812	2.70312
Length   16, alignment  4:	6.54688	3.90625	2.375	2.20312	2.73438
Length   10, alignment  4:	4.6875	3.3125	1.95312	2.4375	2.32812
Length   32, alignment  0:	15.3281	4	2.57812	3.1875	3.34375
Length   32, alignment  7:	15.125	3.75	2.60938	3.0625	3.21875
Length   32, alignment  5:	15.1562	3.89062	2.60938	2.89062	3.20312
Length   21, alignment  5:	11.2031	3.53125	2.40625	2.59375	2.85938
Length   64, alignment  0:	26.8281	5.76562	4.625	3.53125	4.26562
Length   64, alignment  7:	26.8906	5.42188	4.59375	3.5	4.25
Length   64, alignment  6:	27.6406	5.51562	4.5	3.32812	4.07812
Length   42, alignment  6:	18.9375	4.4375	3.1875	3.15625	3.875
Length  128, alignment  0:	49.625	6.20312	4.96875	5.01562	6.5625
Length  128, alignment  7:	49.4531	5.875	4.85938	4.84375	6.21875
Length  128, alignment  7:	49.625	5.75	4.9375	4.8125	6.34375
Length   85, alignment  7:	34.3281	5.59375	4.57812	3.92188	4.75
Length  256, alignment  0:	95.0156	7.29688	6.14062	8.5625	12.6562
Length  256, alignment  7:	95.0156	6.78125	5.82812	7.84375	12.4531
Length  256, alignment  8:	95.1094	6.90625	5.64062	8	12.8281
Length  170, alignment  8:	64.6406	5.89062	4.625	6.21875	10.0938
Length  512, alignment  0:	186	9.14062	8.15625	18.0469	20.4219
Length  512, alignment  7:	185.812	8.53125	7.875	18.0625	20.4844
Length  512, alignment  9:	186.078	8.78125	7.92188	17.9375	20.5938
Length  341, alignment  9:	125.328	7.21875	6.10938	13.2031	15.0469
Length 1024, alignment  0:	367.984	11.7969	10.9844	30.1406	35.7344
Length 1024, alignment  7:	367.797	11.5625	10.5469	30.2344	35.5938
Length 1024, alignment 10:	367.906	11.5781	10.375	30.0156	35.6406
Length  682, alignment 10:	246.438	9.4375	7.85938	22.25	25.8125
Length 2048, alignment  0:	731.922	22.9531	21.4375	54.3281	66.0625
Length 2048, alignment  7:	731.953	22.7969	21.4219	54.1406	66.0781
Length 2048, alignment 11:	731.922	22.75	21.125	53.5781	66.1562
Length 1365, alignment 11:	489.266	18.2344	16.5312	38.2656	46.2969
Length 4096, alignment  0:	1459.77	36.9375	35.5	101.938	126.438
Length 4096, alignment  7:	1459.62	36.8125	35.4219	101.25	126.781
Length 4096, alignment 12:	1459.8	36.7188	34.4688	101.594	126.5
Length 2730, alignment 12:	974.219	27.1094	25.375	69.8594	86.5469

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH] powerpc: Optimized POWER8 strlen
  2016-03-28 19:44 [PATCH] powerpc: Optimized POWER8 strlen Carlos Eduardo Seo
@ 2016-04-04 15:12 ` Carlos Eduardo Seo
  2016-04-07 17:32 ` Tulio Magno Quites Machado Filho
  1 sibling, 0 replies; 3+ messages in thread
From: Carlos Eduardo Seo @ 2016-04-04 15:12 UTC (permalink / raw)
  To: libc-alpha

Ping?

On 3/28/16 4:44 PM, Carlos Eduardo Seo wrote:
>
> Vectorized implementation of strlen for POWER8. This adds significant
> improvement for long strings (~3x). There will be a trade-off around the
> 64-byte length due to the alignment checks required to jump into the
> vectorized loop.
>
> Benchmark results are attached.
>

-- 
Carlos Eduardo Seo
Software Engineer - Linux on Power Toolchain
cseo@linux.vnet.ibm.com

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH] powerpc: Optimized POWER8 strlen
  2016-03-28 19:44 [PATCH] powerpc: Optimized POWER8 strlen Carlos Eduardo Seo
  2016-04-04 15:12 ` Carlos Eduardo Seo
@ 2016-04-07 17:32 ` Tulio Magno Quites Machado Filho
  1 sibling, 0 replies; 3+ messages in thread
From: Tulio Magno Quites Machado Filho @ 2016-04-07 17:32 UTC (permalink / raw)
  To: Carlos Eduardo Seo, GLIBC; +Cc: Steve Munroe

Carlos Eduardo Seo <cseo@linux.vnet.ibm.com> writes:

> This implementation takes advantage of vectorization to improve performance of
> the loop over the current strlen implementation for POWER7.
>
> 2016-03-28  Carlos Eduardo Seo  <cseo@linux.vnet.ibm.com>
>
> 	* sysdeps/powerpc/powerpc64/multiarch/Makefile: Added __strlen_power8.

You have to mention the variable you changed:

	* sysdeps/powerpc/powerpc64/multiarch/Makefile (sysdep_routines): ...

LGTM with that change.

-- 
Tulio Magno

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2016-04-07 17:32 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-03-28 19:44 [PATCH] powerpc: Optimized POWER8 strlen Carlos Eduardo Seo
2016-04-04 15:12 ` Carlos Eduardo Seo
2016-04-07 17:32 ` Tulio Magno Quites Machado Filho

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).