* [PATCH] powerpc64le: add optimized strlen for P9 @ 2020-05-21 19:10 Paul E. Murphy 2020-05-21 20:41 ` Lucas A. M. Magalhaes 2020-05-27 16:45 ` Paul A. Clarke 0 siblings, 2 replies; 7+ messages in thread From: Paul E. Murphy @ 2020-05-21 19:10 UTC (permalink / raw) To: libc-alpha, anton This is a followup to rawmemchr/strlen from Anton. I missed his original strlen patch, and likewise I wasn't happy with the 3-4% performance drop for larger strings which occurs around 2.5kB as the P8 vector loop is a bit faster. As noted, this is up to 50% faster for small strings, and about 1% faster for larger strings (I hazard to guess this some uarch difference between lxv and lvx). I guess this is a semi-V2 of the patch. Likewise, I need to double check binutils 2.26 supports the P9 insn used here. ---8<--- This started as a trivial change to Anton's rawmemchr. I got carried away. This is a hybrid between P8's asympotically faster 64B checks with extremely efficient small string checks e.g <64B (and sometimes a little bit more depending on alignment). The second trick is to align to 64B by running a 48B checking loop 16B at a time until we naturally align to 64B (i.e checking 48/96/144 bytes/iteration based on the alignment after the first 5 comparisons). This allieviates the need to check page boundaries. Finally, explicly use the P7 strlen with the runtime loader when building P9. We need to be cautious about vector/vsx extensions here on P9 only builds. --- .../powerpc/powerpc64/le/power9/rtld-strlen.S | 1 + sysdeps/powerpc/powerpc64/le/power9/strlen.S | 215 ++++++++++++++++++ sysdeps/powerpc/powerpc64/multiarch/Makefile | 2 +- .../powerpc64/multiarch/ifunc-impl-list.c | 4 + .../powerpc64/multiarch/strlen-power9.S | 2 + sysdeps/powerpc/powerpc64/multiarch/strlen.c | 5 + 6 files changed, 228 insertions(+), 1 deletion(-) create mode 100644 sysdeps/powerpc/powerpc64/le/power9/rtld-strlen.S create mode 100644 sysdeps/powerpc/powerpc64/le/power9/strlen.S create mode 100644 sysdeps/powerpc/powerpc64/multiarch/strlen-power9.S diff --git a/sysdeps/powerpc/powerpc64/le/power9/rtld-strlen.S b/sysdeps/powerpc/powerpc64/le/power9/rtld-strlen.S new file mode 100644 index 0000000000..e9d83323ac --- /dev/null +++ b/sysdeps/powerpc/powerpc64/le/power9/rtld-strlen.S @@ -0,0 +1 @@ +#include <sysdeps/powerpc/powerpc64/power7/strlen.S> diff --git a/sysdeps/powerpc/powerpc64/le/power9/strlen.S b/sysdeps/powerpc/powerpc64/le/power9/strlen.S new file mode 100644 index 0000000000..084d6e31a8 --- /dev/null +++ b/sysdeps/powerpc/powerpc64/le/power9/strlen.S @@ -0,0 +1,215 @@ + +/* Optimized rawmemchr implementation for PowerPC64/POWER9. + Copyright (C) 2020 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + <https://www.gnu.org/licenses/>. */ + +#include <sysdep.h> + +#ifndef STRLEN +# define STRLEN __strlen +# define DEFINE_STRLEN_HIDDEN_DEF 1 +#endif + +/* Implements the function + + int [r3] strlen (void *s [r3]) + + The implementation can load bytes past a matching byte, but only + up to the next 16B or 64B boundary, so it never crosses a page. */ + +.machine power9 +ENTRY_TOCLESS (STRLEN, 4) + CALL_MCOUNT 2 + + mr r4, r3 + vspltisb v18, 0 + vspltisb v19, -1 + + neg r5,r3 + rldicl r9,r5,0,60 /* How many bytes to get source 16B aligned? */ + + + /* Align data and fill bytes not loaded with non matching char */ + lvx v0,0,r4 + lvsr v1,0,r4 + vperm v0,v19,v0,v1 + + vcmpequb. v6,v0,v18 /* 0xff if byte matches, 0x00 otherwise */ + beq cr6,L(aligned) + + vctzlsbb r3,v6 + blr + + /* Test 64B 16B at a time. The vector loop is costly for small strings. */ +L(aligned): + add r4,r4,r9 + + rldicl. r5, r4, 60, 62 /* Determine how many 48B loops we should run */ + + lxv v0+32,0(r4) + vcmpequb. v6,v0,v18 /* 0xff if byte matches, 0x00 otherwise */ + bne cr6,L(tail1) + + lxv v0+32,16(r4) + vcmpequb. v6,v0,v18 /* 0xff if byte matches, 0x00 otherwise */ + bne cr6,L(tail2) + + lxv v0+32,32(r4) + vcmpequb. v6,v0,v18 /* 0xff if byte matches, 0x00 otherwise */ + bne cr6,L(tail3) + + lxv v0+32,48(r4) + vcmpequb. v6,v0,v18 /* 0xff if byte matches, 0x00 otherwise */ + bne cr6,L(tail4) + addi r4, r4, 64 + + /* prep for weird constant generation of reduction */ + li r0, 0 + + /* Skip the alignment if not needed */ + beq L(loop_64b) + mtctr r5 + + /* Test 48B per iteration until 64B aligned */ + .p2align 5 +L(loop): + lxv v0+32,0(r4) + vcmpequb. v6,v0,v18 /* 0xff if byte matches, 0x00 otherwise */ + bne cr6,L(tail1) + + lxv v0+32,16(r4) + vcmpequb. v6,v0,v18 /* 0xff if byte matches, 0x00 otherwise */ + bne cr6,L(tail2) + + lxv v0+32,32(r4) + vcmpequb. v6,v0,v18 /* 0xff if byte matches, 0x00 otherwise */ + bne cr6,L(tail3) + + addi r4,r4,48 + bdnz L(loop) + + .p2align 5 +L(loop_64b): + lxv v1+32, 0(r4) /* Load 4 quadwords. */ + lxv v2+32, 16(r4) + lxv v3+32, 32(r4) + lxv v4+32, 48(r4) + vminub v5,v1,v2 /* Compare and merge into one VR for speed. */ + vminub v6,v3,v4 + vminub v7,v5,v6 + vcmpequb. v7,v7,v18 /* Check for NULLs. */ + addi r4,r4,64 /* Adjust address for the next iteration. */ + bne cr6,L(vmx_zero) + + lxv v1+32, 0(r4) /* Load 4 quadwords. */ + lxv v2+32, 16(r4) + lxv v3+32, 32(r4) + lxv v4+32, 48(r4) + vminub v5,v1,v2 /* Compare and merge into one VR for speed. */ + vminub v6,v3,v4 + vminub v7,v5,v6 + vcmpequb. v7,v7,v18 /* Check for NULLs. */ + addi r4,r4,64 /* Adjust address for the next iteration. */ + bne cr6,L(vmx_zero) + + lxv v1+32, 0(r4) /* Load 4 quadwords. */ + lxv v2+32, 16(r4) + lxv v3+32, 32(r4) + lxv v4+32, 48(r4) + vminub v5,v1,v2 /* Compare and merge into one VR for speed. */ + vminub v6,v3,v4 + vminub v7,v5,v6 + vcmpequb. v7,v7,v18 /* Check for NULLs. */ + addi r4,r4,64 /* Adjust address for the next iteration. */ + beq cr6,L(loop_64b) + +L(vmx_zero): + /* OK, we found a null byte. Let's look for it in the current 64-byte + block and mark it in its corresponding VR. */ + vcmpequb v1,v1,v18 + vcmpequb v2,v2,v18 + vcmpequb v3,v3,v18 + vcmpequb v4,v4,v18 + + /* We will now 'compress' the result into a single doubleword, so it + can be moved to a GPR for the final calculation. First, we + generate an appropriate mask for vbpermq, so we can permute bits into + the first halfword. */ + vspltisb v10,3 + lvsl v11,r0,r0 + vslb v10,v11,v10 + + /* Permute the first bit of each byte into bits 48-63. */ + vbpermq v1,v1,v10 + vbpermq v2,v2,v10 + vbpermq v3,v3,v10 + vbpermq v4,v4,v10 + + /* Shift each component into its correct position for merging. */ + vsldoi v2,v2,v2,2 + vsldoi v3,v3,v3,4 + vsldoi v4,v4,v4,6 + + /* Merge the results and move to a GPR. */ + vor v1,v2,v1 + vor v2,v3,v4 + vor v4,v1,v2 + mfvrd r10,v4 + + /* Adjust address to the begninning of the current 64-byte block. */ + addi r4,r4,-64 + + addi r9, r10,-1 /* Form a mask from trailing zeros. */ + andc r9, r9,r10 + popcntd r0, r9 /* Count the bits in the mask. */ + subf r5,r3,r4 + add r3,r5,r0 /* Compute final length. */ + blr + +L(tail1): + vctzlsbb r0,v6 + add r4,r4,r0 + subf r3,r3,r4 + blr + +L(tail2): + vctzlsbb r0,v6 + add r4,r4,r0 + addi r4,r4,16 + subf r3,r3,r4 + blr + +L(tail3): + vctzlsbb r0,v6 + add r4,r4,r0 + addi r4,r4,32 + subf r3,r3,r4 + blr + +L(tail4): + vctzlsbb r0,v6 + add r4,r4,r0 + addi r4,r4,48 + subf r3,r3,r4 + blr + +END (STRLEN) + +#ifdef DEFINE_STRLEN_HIDDEN_DEF +weak_alias (__strlen, strlen) +libc_hidden_builtin_def (strlen) +#endif diff --git a/sysdeps/powerpc/powerpc64/multiarch/Makefile b/sysdeps/powerpc/powerpc64/multiarch/Makefile index fc2268f6b5..19acb6c64a 100644 --- a/sysdeps/powerpc/powerpc64/multiarch/Makefile +++ b/sysdeps/powerpc/powerpc64/multiarch/Makefile @@ -33,7 +33,7 @@ sysdep_routines += memcpy-power8-cached memcpy-power7 memcpy-a2 memcpy-power6 \ ifneq (,$(filter %le,$(config-machine))) sysdep_routines += strcmp-power9 strncmp-power9 strcpy-power9 stpcpy-power9 \ - rawmemchr-power9 + rawmemchr-power9 strlen-power9 endif CFLAGS-strncase-power7.c += -mcpu=power7 -funroll-loops CFLAGS-strncase_l-power7.c += -mcpu=power7 -funroll-loops diff --git a/sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c b/sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c index 59a227ee22..ea10b00417 100644 --- a/sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c +++ b/sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c @@ -111,6 +111,10 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array, /* Support sysdeps/powerpc/powerpc64/multiarch/strlen.c. */ IFUNC_IMPL (i, name, strlen, +#ifdef __LITTLE_ENDIAN__ + IFUNC_IMPL_ADD (array, i, strcpy, hwcap2 & PPC_FEATURE2_ARCH_3_00, + __strlen_power9) +#endif IFUNC_IMPL_ADD (array, i, strlen, hwcap2 & PPC_FEATURE2_ARCH_2_07, __strlen_power8) IFUNC_IMPL_ADD (array, i, strlen, hwcap & PPC_FEATURE_HAS_VSX, diff --git a/sysdeps/powerpc/powerpc64/multiarch/strlen-power9.S b/sysdeps/powerpc/powerpc64/multiarch/strlen-power9.S new file mode 100644 index 0000000000..68c8d54b5f --- /dev/null +++ b/sysdeps/powerpc/powerpc64/multiarch/strlen-power9.S @@ -0,0 +1,2 @@ +#define STRLEN __strlen_power9 +#include <sysdeps/powerpc/powerpc64/le/power9/strlen.S> diff --git a/sysdeps/powerpc/powerpc64/multiarch/strlen.c b/sysdeps/powerpc/powerpc64/multiarch/strlen.c index e587554221..cd9dc78a7c 100644 --- a/sysdeps/powerpc/powerpc64/multiarch/strlen.c +++ b/sysdeps/powerpc/powerpc64/multiarch/strlen.c @@ -30,8 +30,13 @@ extern __typeof (__redirect_strlen) __libc_strlen; extern __typeof (__redirect_strlen) __strlen_ppc attribute_hidden; extern __typeof (__redirect_strlen) __strlen_power7 attribute_hidden; extern __typeof (__redirect_strlen) __strlen_power8 attribute_hidden; +extern __typeof (__redirect_strlen) __strlen_power9 attribute_hidden; libc_ifunc (__libc_strlen, +# ifdef __LITTLE_ENDIAN__ + (hwcap2 & PPC_FEATURE2_ARCH_3_00) + ? __strlen_power9 : +# endif (hwcap2 & PPC_FEATURE2_ARCH_2_07) ? __strlen_power8 : (hwcap & PPC_FEATURE_HAS_VSX) -- 2.26.2 ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] powerpc64le: add optimized strlen for P9 2020-05-21 19:10 [PATCH] powerpc64le: add optimized strlen for P9 Paul E. Murphy @ 2020-05-21 20:41 ` Lucas A. M. Magalhaes 2020-05-27 16:45 ` Paul A. Clarke 1 sibling, 0 replies; 7+ messages in thread From: Lucas A. M. Magalhaes @ 2020-05-21 20:41 UTC (permalink / raw) To: Paul E. Murphy, anton, libc-alpha Quoting Paul E. Murphy via Libc-alpha (2020-05-21 16:10:48) > This is a followup to rawmemchr/strlen from Anton. I missed > his original strlen patch, and likewise I wasn't happy with > the 3-4% performance drop for larger strings which occurs > around 2.5kB as the P8 vector loop is a bit faster. As noted, > this is up to 50% faster for small strings, and about 1% faster > for larger strings (I hazard to guess this some uarch difference > between lxv and lvx). > > I guess this is a semi-V2 of the patch. Likewise, I need to > double check binutils 2.26 supports the P9 insn used here. > > ---8<--- > > This started as a trivial change to Anton's rawmemchr. I got > carried away. This is a hybrid between P8's asympotically > faster 64B checks with extremely efficient small string checks > e.g <64B (and sometimes a little bit more depending on alignment). > > The second trick is to align to 64B by running a 48B checking loop > 16B at a time until we naturally align to 64B (i.e checking 48/96/144 > bytes/iteration based on the alignment after the first 5 comparisons). > This allieviates the need to check page boundaries. > > Finally, explicly use the P7 strlen with the runtime loader when building > P9. We need to be cautious about vector/vsx extensions here on P9 only > builds. > --- > .../powerpc/powerpc64/le/power9/rtld-strlen.S | 1 + > sysdeps/powerpc/powerpc64/le/power9/strlen.S | 215 ++++++++++++++++++ > sysdeps/powerpc/powerpc64/multiarch/Makefile | 2 +- > .../powerpc64/multiarch/ifunc-impl-list.c | 4 + > .../powerpc64/multiarch/strlen-power9.S | 2 + > sysdeps/powerpc/powerpc64/multiarch/strlen.c | 5 + > 6 files changed, 228 insertions(+), 1 deletion(-) > create mode 100644 sysdeps/powerpc/powerpc64/le/power9/rtld-strlen.S > create mode 100644 sysdeps/powerpc/powerpc64/le/power9/strlen.S > create mode 100644 sysdeps/powerpc/powerpc64/multiarch/strlen-power9.S > > diff --git a/sysdeps/powerpc/powerpc64/le/power9/rtld-strlen.S b/sysdeps/powerpc/powerpc64/le/power9/rtld-strlen.S > new file mode 100644 > index 0000000000..e9d83323ac > --- /dev/null > +++ b/sysdeps/powerpc/powerpc64/le/power9/rtld-strlen.S > @@ -0,0 +1 @@ > +#include <sysdeps/powerpc/powerpc64/power7/strlen.S> > diff --git a/sysdeps/powerpc/powerpc64/le/power9/strlen.S b/sysdeps/powerpc/powerpc64/le/power9/strlen.S > new file mode 100644 > index 0000000000..084d6e31a8 > --- /dev/null > +++ b/sysdeps/powerpc/powerpc64/le/power9/strlen.S > @@ -0,0 +1,215 @@ > + > +/* Optimized rawmemchr implementation for PowerPC64/POWER9. s/rawmemchr/strlen Still trying to understand the rest of the patch though. =) --- Lucas A. M. Magalhães ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] powerpc64le: add optimized strlen for P9 2020-05-21 19:10 [PATCH] powerpc64le: add optimized strlen for P9 Paul E. Murphy 2020-05-21 20:41 ` Lucas A. M. Magalhaes @ 2020-05-27 16:45 ` Paul A. Clarke 2020-05-29 16:26 ` Paul E Murphy 1 sibling, 1 reply; 7+ messages in thread From: Paul A. Clarke @ 2020-05-27 16:45 UTC (permalink / raw) To: Paul E. Murphy; +Cc: libc-alpha, anton On Thu, May 21, 2020 at 02:10:48PM -0500, Paul E. Murphy via Libc-alpha wrote: > This is a followup to rawmemchr/strlen from Anton. I missed > his original strlen patch, and likewise I wasn't happy with > the 3-4% performance drop for larger strings which occurs > around 2.5kB as the P8 vector loop is a bit faster. As noted, > this is up to 50% faster for small strings, and about 1% faster > for larger strings (I hazard to guess this some uarch difference > between lxv and lvx). > > I guess this is a semi-V2 of the patch. Likewise, I need to > double check binutils 2.26 supports the P9 insn used here. > > ---8<--- > > This started as a trivial change to Anton's rawmemchr. I got > carried away. This is a hybrid between P8's asympotically > faster 64B checks with extremely efficient small string checks > e.g <64B (and sometimes a little bit more depending on alignment). > > The second trick is to align to 64B by running a 48B checking loop > 16B at a time until we naturally align to 64B (i.e checking 48/96/144 > bytes/iteration based on the alignment after the first 5 comparisons). > This allieviates the need to check page boundaries. > > Finally, explicly use the P7 strlen with the runtime loader when building > P9. We need to be cautious about vector/vsx extensions here on P9 only > builds. > --- > .../powerpc/powerpc64/le/power9/rtld-strlen.S | 1 + > sysdeps/powerpc/powerpc64/le/power9/strlen.S | 215 ++++++++++++++++++ > sysdeps/powerpc/powerpc64/multiarch/Makefile | 2 +- > .../powerpc64/multiarch/ifunc-impl-list.c | 4 + > .../powerpc64/multiarch/strlen-power9.S | 2 + > sysdeps/powerpc/powerpc64/multiarch/strlen.c | 5 + > 6 files changed, 228 insertions(+), 1 deletion(-) > create mode 100644 sysdeps/powerpc/powerpc64/le/power9/rtld-strlen.S > create mode 100644 sysdeps/powerpc/powerpc64/le/power9/strlen.S > create mode 100644 sysdeps/powerpc/powerpc64/multiarch/strlen-power9.S > > diff --git a/sysdeps/powerpc/powerpc64/le/power9/rtld-strlen.S b/sysdeps/powerpc/powerpc64/le/power9/rtld-strlen.S > new file mode 100644 > index 0000000000..e9d83323ac > --- /dev/null > +++ b/sysdeps/powerpc/powerpc64/le/power9/rtld-strlen.S > @@ -0,0 +1 @@ > +#include <sysdeps/powerpc/powerpc64/power7/strlen.S> > diff --git a/sysdeps/powerpc/powerpc64/le/power9/strlen.S b/sysdeps/powerpc/powerpc64/le/power9/strlen.S > new file mode 100644 > index 0000000000..084d6e31a8 > --- /dev/null > +++ b/sysdeps/powerpc/powerpc64/le/power9/strlen.S > @@ -0,0 +1,215 @@ > + > +/* Optimized rawmemchr implementation for PowerPC64/POWER9. > + Copyright (C) 2020 Free Software Foundation, Inc. > + This file is part of the GNU C Library. > + > + The GNU C Library is free software; you can redistribute it and/or > + modify it under the terms of the GNU Lesser General Public > + License as published by the Free Software Foundation; either > + version 2.1 of the License, or (at your option) any later version. > + > + The GNU C Library is distributed in the hope that it will be useful, > + but WITHOUT ANY WARRANTY; without even the implied warranty of > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > + Lesser General Public License for more details. > + > + You should have received a copy of the GNU Lesser General Public > + License along with the GNU C Library; if not, see > + <https://www.gnu.org/licenses/>. */ > + > +#include <sysdep.h> > + > +#ifndef STRLEN > +# define STRLEN __strlen > +# define DEFINE_STRLEN_HIDDEN_DEF 1 > +#endif > + > +/* Implements the function > + > + int [r3] strlen (void *s [r3]) const void *s? > + > + The implementation can load bytes past a matching byte, but only > + up to the next 16B or 64B boundary, so it never crosses a page. */ > + > +.machine power9 > +ENTRY_TOCLESS (STRLEN, 4) > + CALL_MCOUNT 2 > + > + mr r4, r3 This can be moved later, and folded into the "add" below. In my experiments, it helped performance for tiny strings. extra space after comma. > + vspltisb v18, 0 > + vspltisb v19, -1 extra spaces after commas. > + > + neg r5,r3 > + rldicl r9,r5,0,60 /* How many bytes to get source 16B aligned? */ > + > + > + /* Align data and fill bytes not loaded with non matching char */ Missing '.' after 'char', but I suggest some different comments (subjective)... Consider: /* Load cache line containing beginning of string. */ > + lvx v0,0,r4 Consider: /* Create permute vector to shift into alignment. */ > + lvsr v1,0,r4 To move the "mr" above later, both of the above instructions would thus need to use "r3" instead of "r4". Consider: /* Shift into alignment, filling with 0xff. */ > + vperm v0,v19,v0,v1 > + > + vcmpequb. v6,v0,v18 /* 0xff if byte matches, 0x00 otherwise */ need a '.' after 'otherwise'. > + beq cr6,L(aligned) > + Consider: /* String ends within first cache line. Compute length. */ > + vctzlsbb r3,v6 > + blr > + > + /* Test 64B 16B at a time. The vector loop is costly for small strings. */ Consider: /* Test 64B 16B at a time. Postpone the vector loop ("loop", below), which is costly for small strings. */ > +L(aligned): > + add r4,r4,r9 And this can change to "add r4,r3,r9". > + > + rldicl. r5, r4, 60, 62 /* Determine how many 48B loops we should run */ /* Determine how many 48B loops we should run in "loop" below. 48B loops perform better than simpler 16B loops. */ extra spaces after commas Should this calculation be moved down, just before its use at "beq", or does it schedule better if left here? Since the result is not used until after the next 14 instructions, strings of these lengths are penalized. > + > + lxv v0+32,0(r4) Is the "+32" needed to accommodate a binutils that doesn't support VSX registers? > + vcmpequb. v6,v0,v18 /* 0xff if byte matches, 0x00 otherwise */ need a '.' after 'otherwise'. > + bne cr6,L(tail1) > + > + lxv v0+32,16(r4) > + vcmpequb. v6,v0,v18 /* 0xff if byte matches, 0x00 otherwise */ need a '.' after 'otherwise'. > + bne cr6,L(tail2) > + > + lxv v0+32,32(r4) > + vcmpequb. v6,v0,v18 /* 0xff if byte matches, 0x00 otherwise */ need a '.' after 'otherwise'. > + bne cr6,L(tail3) > + > + lxv v0+32,48(r4) > + vcmpequb. v6,v0,v18 /* 0xff if byte matches, 0x00 otherwise */ need a '.' after 'otherwise'. > + bne cr6,L(tail4) > + addi r4, r4, 64 extra spaces after commas. > + > + /* prep for weird constant generation of reduction */ Need leading capitalization ("Prep..."). But, maybe a better comment instead... /* Load a dummy aligned address (0) so that 'lvsl' produces a shift vector * of 0..15. */ > + li r0, 0 Extra space after ',' Would it be bad to move this just a little closer to where it is used much later? > + > + /* Skip the alignment if not needed */ need a '.' after 'needed'. (Above "rldicl." could be moved as late as here.) > + beq L(loop_64b) > + mtctr r5 > + > + /* Test 48B per iteration until 64B aligned */ need a '.' after 'aligned'. > + .p2align 5 > +L(loop): > + lxv v0+32,0(r4) > + vcmpequb. v6,v0,v18 /* 0xff if byte matches, 0x00 otherwise */ need a '.' after 'otherwise'. > + bne cr6,L(tail1) > + > + lxv v0+32,16(r4) > + vcmpequb. v6,v0,v18 /* 0xff if byte matches, 0x00 otherwise */ here, too. > + bne cr6,L(tail2) > + > + lxv v0+32,32(r4) > + vcmpequb. v6,v0,v18 /* 0xff if byte matches, 0x00 otherwise */ and here. > + bne cr6,L(tail3) > + > + addi r4,r4,48 > + bdnz L(loop) > + > + .p2align 5 > +L(loop_64b): > + lxv v1+32, 0(r4) /* Load 4 quadwords. */ > + lxv v2+32, 16(r4) > + lxv v3+32, 32(r4) > + lxv v4+32, 48(r4) > + vminub v5,v1,v2 /* Compare and merge into one VR for speed. */ > + vminub v6,v3,v4 > + vminub v7,v5,v6 > + vcmpequb. v7,v7,v18 /* Check for NULLs. */ > + addi r4,r4,64 /* Adjust address for the next iteration. */ > + bne cr6,L(vmx_zero) > + > + lxv v1+32, 0(r4) /* Load 4 quadwords. */ > + lxv v2+32, 16(r4) > + lxv v3+32, 32(r4) > + lxv v4+32, 48(r4) > + vminub v5,v1,v2 /* Compare and merge into one VR for speed. */ > + vminub v6,v3,v4 > + vminub v7,v5,v6 > + vcmpequb. v7,v7,v18 /* Check for NULLs. */ > + addi r4,r4,64 /* Adjust address for the next iteration. */ > + bne cr6,L(vmx_zero) > + > + lxv v1+32, 0(r4) /* Load 4 quadwords. */ > + lxv v2+32, 16(r4) > + lxv v3+32, 32(r4) > + lxv v4+32, 48(r4) > + vminub v5,v1,v2 /* Compare and merge into one VR for speed. */ > + vminub v6,v3,v4 > + vminub v7,v5,v6 > + vcmpequb. v7,v7,v18 /* Check for NULLs. */ > + addi r4,r4,64 /* Adjust address for the next iteration. */ > + beq cr6,L(loop_64b) Curious how much this loop unrolling helps, since it adds a fair bit of redundant code? > + > +L(vmx_zero): > + /* OK, we found a null byte. Let's look for it in the current 64-byte > + block and mark it in its corresponding VR. */ > + vcmpequb v1,v1,v18 > + vcmpequb v2,v2,v18 > + vcmpequb v3,v3,v18 > + vcmpequb v4,v4,v18 > + > + /* We will now 'compress' the result into a single doubleword, so it > + can be moved to a GPR for the final calculation. First, we > + generate an appropriate mask for vbpermq, so we can permute bits into > + the first halfword. */ I'm wondering (without having verified) if you can do something here akin to what's done in the "tail" sections below, using "vctzlsbb". > + vspltisb v10,3 > + lvsl v11,r0,r0 Second field should probably be "0" instead of "r0" ("v11,0,r0"). > + vslb v10,v11,v10 > + > + /* Permute the first bit of each byte into bits 48-63. */ > + vbpermq v1,v1,v10 > + vbpermq v2,v2,v10 > + vbpermq v3,v3,v10 > + vbpermq v4,v4,v10 > + > + /* Shift each component into its correct position for merging. */ > + vsldoi v2,v2,v2,2 > + vsldoi v3,v3,v3,4 > + vsldoi v4,v4,v4,6 > + > + /* Merge the results and move to a GPR. */ > + vor v1,v2,v1 > + vor v2,v3,v4 > + vor v4,v1,v2 > + mfvrd r10,v4 > + > + /* Adjust address to the begninning of the current 64-byte block. */ > + addi r4,r4,-64 > + > + addi r9, r10,-1 /* Form a mask from trailing zeros. */ > + andc r9, r9,r10 > + popcntd r0, r9 /* Count the bits in the mask. */ extra spaces after the first comma in the above 3 lines. > + subf r5,r3,r4 > + add r3,r5,r0 /* Compute final length. */ > + blr > + > +L(tail1): > + vctzlsbb r0,v6 > + add r4,r4,r0 > + subf r3,r3,r4 > + blr > + > +L(tail2): > + vctzlsbb r0,v6 > + add r4,r4,r0 > + addi r4,r4,16 > + subf r3,r3,r4 > + blr > + > +L(tail3): > + vctzlsbb r0,v6 > + add r4,r4,r0 > + addi r4,r4,32 > + subf r3,r3,r4 > + blr > + > +L(tail4): > + vctzlsbb r0,v6 > + add r4,r4,r0 > + addi r4,r4,48 > + subf r3,r3,r4 > + blr > + > +END (STRLEN) (snip) PC ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] powerpc64le: add optimized strlen for P9 2020-05-27 16:45 ` Paul A. Clarke @ 2020-05-29 16:26 ` Paul E Murphy 2020-06-03 20:44 ` Paul A. Clarke 0 siblings, 1 reply; 7+ messages in thread From: Paul E Murphy @ 2020-05-29 16:26 UTC (permalink / raw) To: Paul A. Clarke, Paul E. Murphy; +Cc: libc-alpha, anton [-- Attachment #1: Type: text/plain, Size: 2734 bytes --] V3 is attached with changes to formatting and a couple of simplifications as noted below. On 5/27/20 11:45 AM, Paul A. Clarke wrote: > On Thu, May 21, 2020 at 02:10:48PM -0500, Paul E. Murphy via Libc-alpha wrote: >> +/* Implements the function >> + >> + int [r3] strlen (void *s [r3]) > > const void *s? Fixed, alongside folding away the mr r3,r4. Likewise, the basic GNU formatting requests, and removed some of the more redundant ones. Thank you for the suggested changes. >> + .p2align 5 >> +L(loop_64b): >> + lxv v1+32, 0(r4) /* Load 4 quadwords. */ >> + lxv v2+32, 16(r4) >> + lxv v3+32, 32(r4) >> + lxv v4+32, 48(r4) >> + vminub v5,v1,v2 /* Compare and merge into one VR for speed. */ >> + vminub v6,v3,v4 >> + vminub v7,v5,v6 >> + vcmpequb. v7,v7,v18 /* Check for NULLs. */ >> + addi r4,r4,64 /* Adjust address for the next iteration. */ >> + bne cr6,L(vmx_zero) >> + >> + lxv v1+32, 0(r4) /* Load 4 quadwords. */ >> + lxv v2+32, 16(r4) >> + lxv v3+32, 32(r4) >> + lxv v4+32, 48(r4) >> + vminub v5,v1,v2 /* Compare and merge into one VR for speed. */ >> + vminub v6,v3,v4 >> + vminub v7,v5,v6 >> + vcmpequb. v7,v7,v18 /* Check for NULLs. */ >> + addi r4,r4,64 /* Adjust address for the next iteration. */ >> + bne cr6,L(vmx_zero) >> + >> + lxv v1+32, 0(r4) /* Load 4 quadwords. */ >> + lxv v2+32, 16(r4) >> + lxv v3+32, 32(r4) >> + lxv v4+32, 48(r4) >> + vminub v5,v1,v2 /* Compare and merge into one VR for speed. */ >> + vminub v6,v3,v4 >> + vminub v7,v5,v6 >> + vcmpequb. v7,v7,v18 /* Check for NULLs. */ >> + addi r4,r4,64 /* Adjust address for the next iteration. */ >> + beq cr6,L(loop_64b) > > Curious how much this loop unrolling helps, since it adds a fair bit of > redundant code? It does seem to help a little bit, though maybe just an artifact of the benchsuite. > >> + >> +L(vmx_zero): >> + /* OK, we found a null byte. Let's look for it in the current 64-byte >> + block and mark it in its corresponding VR. */ >> + vcmpequb v1,v1,v18 >> + vcmpequb v2,v2,v18 >> + vcmpequb v3,v3,v18 >> + vcmpequb v4,v4,v18 >> + >> + /* We will now 'compress' the result into a single doubleword, so it >> + can be moved to a GPR for the final calculation. First, we >> + generate an appropriate mask for vbpermq, so we can permute bits into >> + the first halfword. */ > > I'm wondering (without having verified) if you can do something here akin to > what's done in the "tail" sections below, using "vctzlsbb". It does not help when the content spans more than 1 VR. I don't think there is much to improve for a 64b mask reduction. Though, we can save a couple cycles below using cnttzd (new in ISA 3.0). [-- Attachment #2: 0001-powerpc64le-add-optimized-strlen-for-P9.patch --] [-- Type: text/x-patch, Size: 10144 bytes --] From 86decdb4a1bea39cc34bb3320fc9e3ea934042f5 Mon Sep 17 00:00:00 2001 From: "Paul E. Murphy" <murphyp@linux.vnet.ibm.com> Date: Mon, 18 May 2020 11:16:06 -0500 Subject: [PATCH] powerpc64le: add optimized strlen for P9 This started as a trivial change to Anton's rawmemchr. I got carried away. This is a hybrid between P8's asympotically faster 64B checks with extremely efficient small string checks e.g <64B (and sometimes a little bit more depending on alignment). The second trick is to align to 64B by running a 48B checking loop 16B at a time until we naturally align to 64B (i.e checking 48/96/144 bytes/iteration based on the alignment after the first 5 comparisons). This allieviates the need to check page boundaries. Finally, explicly use the P7 strlen with the runtime loader when building P9. We need to be cautious about vector/vsx extensions here on P9 only builds. --- .../powerpc/powerpc64/le/power9/rtld-strlen.S | 1 + sysdeps/powerpc/powerpc64/le/power9/strlen.S | 213 ++++++++++++++++++ sysdeps/powerpc/powerpc64/multiarch/Makefile | 2 +- .../powerpc64/multiarch/ifunc-impl-list.c | 4 + .../powerpc64/multiarch/strlen-power9.S | 2 + sysdeps/powerpc/powerpc64/multiarch/strlen.c | 5 + 6 files changed, 226 insertions(+), 1 deletion(-) create mode 100644 sysdeps/powerpc/powerpc64/le/power9/rtld-strlen.S create mode 100644 sysdeps/powerpc/powerpc64/le/power9/strlen.S create mode 100644 sysdeps/powerpc/powerpc64/multiarch/strlen-power9.S diff --git a/sysdeps/powerpc/powerpc64/le/power9/rtld-strlen.S b/sysdeps/powerpc/powerpc64/le/power9/rtld-strlen.S new file mode 100644 index 0000000000..e9d83323ac --- /dev/null +++ b/sysdeps/powerpc/powerpc64/le/power9/rtld-strlen.S @@ -0,0 +1 @@ +#include <sysdeps/powerpc/powerpc64/power7/strlen.S> diff --git a/sysdeps/powerpc/powerpc64/le/power9/strlen.S b/sysdeps/powerpc/powerpc64/le/power9/strlen.S new file mode 100644 index 0000000000..0b358ff128 --- /dev/null +++ b/sysdeps/powerpc/powerpc64/le/power9/strlen.S @@ -0,0 +1,213 @@ +/* Optimized strlen implementation for PowerPC64/POWER9. + Copyright (C) 2020 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + <https://www.gnu.org/licenses/>. */ + +#include <sysdep.h> + +#ifndef STRLEN +# define STRLEN __strlen +# define DEFINE_STRLEN_HIDDEN_DEF 1 +#endif + +/* Implements the function + + int [r3] strlen (const void *s [r3]) + + The implementation can load bytes past a matching byte, but only + up to the next 64B boundary, so it never crosses a page. */ + +.machine power9 +ENTRY_TOCLESS (STRLEN, 4) + CALL_MCOUNT 2 + + vspltisb v18,0 + vspltisb v19,-1 + + neg r5,r3 + rldicl r9,r5,0,60 /* How many bytes to get source 16B aligned? */ + + + /* Align data and fill bytes not loaded with non matching char. */ + lvx v0,0,r3 + lvsr v1,0,r3 + vperm v0,v19,v0,v1 + + vcmpequb. v6,v0,v18 + beq cr6,L(aligned) + + vctzlsbb r3,v6 + blr + + /* Test 64B 16B at a time. The 64B vector loop is optimized for + longer strings. Likewise, we check a multiple of 64B to avoid + breaking the alignment calculation below. */ +L(aligned): + add r4,r3,r9 + rldicl. r5,r4,60,62 /* Determine the number of 48B loops needed for + alignment to 64B. And test for zero. */ + + lxv v0+32,0(r4) + vcmpequb. v6,v0,v18 + bne cr6,L(tail1) + + lxv v0+32,16(r4) + vcmpequb. v6,v0,v18 + bne cr6,L(tail2) + + lxv v0+32,32(r4) + vcmpequb. v6,v0,v18 + bne cr6,L(tail3) + + lxv v0+32,48(r4) + vcmpequb. v6,v0,v18 + bne cr6,L(tail4) + addi r4,r4,64 + + /* Prep for weird constant generation of reduction. */ + li r0,0 + + /* Skip the alignment if already 64B aligned. */ + beq L(loop_64b) + mtctr r5 + + /* Test 48B per iteration until 64B aligned. */ + .p2align 5 +L(loop): + lxv v0+32,0(r4) + vcmpequb. v6,v0,v18 + bne cr6,L(tail1) + + lxv v0+32,16(r4) + vcmpequb. v6,v0,v18 + bne cr6,L(tail2) + + lxv v0+32,32(r4) + vcmpequb. v6,v0,v18 + bne cr6,L(tail3) + + addi r4,r4,48 + bdnz L(loop) + + .p2align 5 +L(loop_64b): + lxv v1+32,0(r4) /* Load 4 quadwords. */ + lxv v2+32,16(r4) + lxv v3+32,32(r4) + lxv v4+32,48(r4) + vminub v5,v1,v2 /* Compare and merge into one VR for speed. */ + vminub v6,v3,v4 + vminub v7,v5,v6 + vcmpequb. v7,v7,v18 /* Check for NULLs. */ + addi r4,r4,64 /* Adjust address for the next iteration. */ + bne cr6,L(vmx_zero) + + lxv v1+32,0(r4) /* Load 4 quadwords. */ + lxv v2+32,16(r4) + lxv v3+32,32(r4) + lxv v4+32,48(r4) + vminub v5,v1,v2 /* Compare and merge into one VR for speed. */ + vminub v6,v3,v4 + vminub v7,v5,v6 + vcmpequb. v7,v7,v18 /* Check for NULLs. */ + addi r4,r4,64 /* Adjust address for the next iteration. */ + bne cr6,L(vmx_zero) + + lxv v1+32,0(r4) /* Load 4 quadwords. */ + lxv v2+32,16(r4) + lxv v3+32,32(r4) + lxv v4+32,48(r4) + vminub v5,v1,v2 /* Compare and merge into one VR for speed. */ + vminub v6,v3,v4 + vminub v7,v5,v6 + vcmpequb. v7,v7,v18 /* Check for NULLs. */ + addi r4,r4,64 /* Adjust address for the next iteration. */ + beq cr6,L(loop_64b) + +L(vmx_zero): + /* OK, we found a null byte. Let's look for it in the current 64-byte + block and mark it in its corresponding VR. */ + vcmpequb v1,v1,v18 + vcmpequb v2,v2,v18 + vcmpequb v3,v3,v18 + vcmpequb v4,v4,v18 + + /* We will now 'compress' the result into a single doubleword, so it + can be moved to a GPR for the final calculation. First, we + generate an appropriate mask for vbpermq, so we can permute bits into + the first halfword. */ + vspltisb v10,3 + lvsl v11,0,r0 + vslb v10,v11,v10 + + /* Permute the first bit of each byte into bits 48-63. */ + vbpermq v1,v1,v10 + vbpermq v2,v2,v10 + vbpermq v3,v3,v10 + vbpermq v4,v4,v10 + + /* Shift each component into its correct position for merging. */ + vsldoi v2,v2,v2,2 + vsldoi v3,v3,v3,4 + vsldoi v4,v4,v4,6 + + /* Merge the results and move to a GPR. */ + vor v1,v2,v1 + vor v2,v3,v4 + vor v4,v1,v2 + mfvrd r10,v4 + + /* Adjust address to the begninning of the current 64-byte block. */ + addi r4,r4,-64 + + cnttzd r0,r10 /* Count trailing zeros before the match. */ + subf r5,r3,r4 + add r3,r5,r0 /* Compute final length. */ + blr + +L(tail1): + vctzlsbb r0,v6 + add r4,r4,r0 + subf r3,r3,r4 + blr + +L(tail2): + vctzlsbb r0,v6 + add r4,r4,r0 + addi r4,r4,16 + subf r3,r3,r4 + blr + +L(tail3): + vctzlsbb r0,v6 + add r4,r4,r0 + addi r4,r4,32 + subf r3,r3,r4 + blr + +L(tail4): + vctzlsbb r0,v6 + add r4,r4,r0 + addi r4,r4,48 + subf r3,r3,r4 + blr + +END (STRLEN) + +#ifdef DEFINE_STRLEN_HIDDEN_DEF +weak_alias (__strlen, strlen) +libc_hidden_builtin_def (strlen) +#endif diff --git a/sysdeps/powerpc/powerpc64/multiarch/Makefile b/sysdeps/powerpc/powerpc64/multiarch/Makefile index fc2268f6b5..19acb6c64a 100644 --- a/sysdeps/powerpc/powerpc64/multiarch/Makefile +++ b/sysdeps/powerpc/powerpc64/multiarch/Makefile @@ -33,7 +33,7 @@ sysdep_routines += memcpy-power8-cached memcpy-power7 memcpy-a2 memcpy-power6 \ ifneq (,$(filter %le,$(config-machine))) sysdep_routines += strcmp-power9 strncmp-power9 strcpy-power9 stpcpy-power9 \ - rawmemchr-power9 + rawmemchr-power9 strlen-power9 endif CFLAGS-strncase-power7.c += -mcpu=power7 -funroll-loops CFLAGS-strncase_l-power7.c += -mcpu=power7 -funroll-loops diff --git a/sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c b/sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c index 59a227ee22..ea10b00417 100644 --- a/sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c +++ b/sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c @@ -111,6 +111,10 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array, /* Support sysdeps/powerpc/powerpc64/multiarch/strlen.c. */ IFUNC_IMPL (i, name, strlen, +#ifdef __LITTLE_ENDIAN__ + IFUNC_IMPL_ADD (array, i, strcpy, hwcap2 & PPC_FEATURE2_ARCH_3_00, + __strlen_power9) +#endif IFUNC_IMPL_ADD (array, i, strlen, hwcap2 & PPC_FEATURE2_ARCH_2_07, __strlen_power8) IFUNC_IMPL_ADD (array, i, strlen, hwcap & PPC_FEATURE_HAS_VSX, diff --git a/sysdeps/powerpc/powerpc64/multiarch/strlen-power9.S b/sysdeps/powerpc/powerpc64/multiarch/strlen-power9.S new file mode 100644 index 0000000000..68c8d54b5f --- /dev/null +++ b/sysdeps/powerpc/powerpc64/multiarch/strlen-power9.S @@ -0,0 +1,2 @@ +#define STRLEN __strlen_power9 +#include <sysdeps/powerpc/powerpc64/le/power9/strlen.S> diff --git a/sysdeps/powerpc/powerpc64/multiarch/strlen.c b/sysdeps/powerpc/powerpc64/multiarch/strlen.c index e587554221..cd9dc78a7c 100644 --- a/sysdeps/powerpc/powerpc64/multiarch/strlen.c +++ b/sysdeps/powerpc/powerpc64/multiarch/strlen.c @@ -30,8 +30,13 @@ extern __typeof (__redirect_strlen) __libc_strlen; extern __typeof (__redirect_strlen) __strlen_ppc attribute_hidden; extern __typeof (__redirect_strlen) __strlen_power7 attribute_hidden; extern __typeof (__redirect_strlen) __strlen_power8 attribute_hidden; +extern __typeof (__redirect_strlen) __strlen_power9 attribute_hidden; libc_ifunc (__libc_strlen, +# ifdef __LITTLE_ENDIAN__ + (hwcap2 & PPC_FEATURE2_ARCH_3_00) + ? __strlen_power9 : +# endif (hwcap2 & PPC_FEATURE2_ARCH_2_07) ? __strlen_power8 : (hwcap & PPC_FEATURE_HAS_VSX) -- 2.26.2 ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] powerpc64le: add optimized strlen for P9 2020-05-29 16:26 ` Paul E Murphy @ 2020-06-03 20:44 ` Paul A. Clarke 2020-06-04 13:55 ` Paul E Murphy 0 siblings, 1 reply; 7+ messages in thread From: Paul A. Clarke @ 2020-06-03 20:44 UTC (permalink / raw) To: Paul E Murphy; +Cc: Paul E. Murphy, libc-alpha, anton On Fri, May 29, 2020 at 11:26:14AM -0500, Paul E Murphy wrote: > > V3 is attached with changes to formatting and a couple of > simplifications as noted below. [snip] This version LGTM with a few nits below (and you were going to check the binutils support for the POWER9 instruction). > From 86decdb4a1bea39cc34bb3320fc9e3ea934042f5 Mon Sep 17 00:00:00 2001 > From: "Paul E. Murphy" <murphyp@linux.vnet.ibm.com> > Date: Mon, 18 May 2020 11:16:06 -0500 > Subject: [PATCH] powerpc64le: add optimized strlen for P9 > > This started as a trivial change to Anton's rawmemchr. I got > carried away. This is a hybrid between P8's asympotically > faster 64B checks with extremely efficient small string checks > e.g <64B (and sometimes a little bit more depending on alignment). > > The second trick is to align to 64B by running a 48B checking loop > 16B at a time until we naturally align to 64B (i.e checking 48/96/144 > bytes/iteration based on the alignment after the first 5 comparisons). > This allieviates the need to check page boundaries. > > Finally, explicly use the P7 strlen with the runtime loader when building > P9. We need to be cautious about vector/vsx extensions here on P9 only > builds. > --- > .../powerpc/powerpc64/le/power9/rtld-strlen.S | 1 + > sysdeps/powerpc/powerpc64/le/power9/strlen.S | 213 ++++++++++++++++++ > sysdeps/powerpc/powerpc64/multiarch/Makefile | 2 +- > .../powerpc64/multiarch/ifunc-impl-list.c | 4 + > .../powerpc64/multiarch/strlen-power9.S | 2 + > sysdeps/powerpc/powerpc64/multiarch/strlen.c | 5 + > 6 files changed, 226 insertions(+), 1 deletion(-) > create mode 100644 sysdeps/powerpc/powerpc64/le/power9/rtld-strlen.S > create mode 100644 sysdeps/powerpc/powerpc64/le/power9/strlen.S > create mode 100644 sysdeps/powerpc/powerpc64/multiarch/strlen-power9.S > > diff --git a/sysdeps/powerpc/powerpc64/le/power9/rtld-strlen.S b/sysdeps/powerpc/powerpc64/le/power9/rtld-strlen.S > new file mode 100644 > index 0000000000..e9d83323ac > --- /dev/null > +++ b/sysdeps/powerpc/powerpc64/le/power9/rtld-strlen.S > @@ -0,0 +1 @@ > +#include <sysdeps/powerpc/powerpc64/power7/strlen.S> > diff --git a/sysdeps/powerpc/powerpc64/le/power9/strlen.S b/sysdeps/powerpc/powerpc64/le/power9/strlen.S > new file mode 100644 > index 0000000000..0b358ff128 > --- /dev/null > +++ b/sysdeps/powerpc/powerpc64/le/power9/strlen.S > @@ -0,0 +1,213 @@ > +/* Optimized strlen implementation for PowerPC64/POWER9. > + Copyright (C) 2020 Free Software Foundation, Inc. > + This file is part of the GNU C Library. > + > + The GNU C Library is free software; you can redistribute it and/or > + modify it under the terms of the GNU Lesser General Public > + License as published by the Free Software Foundation; either > + version 2.1 of the License, or (at your option) any later version. > + > + The GNU C Library is distributed in the hope that it will be useful, > + but WITHOUT ANY WARRANTY; without even the implied warranty of > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > + Lesser General Public License for more details. > + > + You should have received a copy of the GNU Lesser General Public > + License along with the GNU C Library; if not, see > + <https://www.gnu.org/licenses/>. */ > + > +#include <sysdep.h> > + > +#ifndef STRLEN > +# define STRLEN __strlen > +# define DEFINE_STRLEN_HIDDEN_DEF 1 > +#endif > + > +/* Implements the function > + > + int [r3] strlen (const void *s [r3]) > + > + The implementation can load bytes past a matching byte, but only > + up to the next 64B boundary, so it never crosses a page. */ > + > +.machine power9 > +ENTRY_TOCLESS (STRLEN, 4) > + CALL_MCOUNT 2 > + > + vspltisb v18,0 > + vspltisb v19,-1 > + > + neg r5,r3 > + rldicl r9,r5,0,60 /* How many bytes to get source 16B aligned? */ > + > + Extra blank line here. (Sorry, didn't see this the first time.) > + /* Align data and fill bytes not loaded with non matching char. */ > + lvx v0,0,r3 > + lvsr v1,0,r3 > + vperm v0,v19,v0,v1 > + > + vcmpequb. v6,v0,v18 > + beq cr6,L(aligned) > + Consider for before the next two instructions: /* String ends within first cache line. Compute and return length. */ > + vctzlsbb r3,v6 > + blr > + > + /* Test 64B 16B at a time. The 64B vector loop is optimized for > + longer strings. Likewise, we check a multiple of 64B to avoid > + breaking the alignment calculation below. */ > +L(aligned): > + add r4,r3,r9 > + rldicl. r5,r4,60,62 /* Determine the number of 48B loops needed for > + alignment to 64B. And test for zero. */ Would it be bad to move the "rldicl." down... > + > + lxv v0+32,0(r4) > + vcmpequb. v6,v0,v18 > + bne cr6,L(tail1) > + > + lxv v0+32,16(r4) > + vcmpequb. v6,v0,v18 > + bne cr6,L(tail2) > + > + lxv v0+32,32(r4) > + vcmpequb. v6,v0,v18 > + bne cr6,L(tail3) > + > + lxv v0+32,48(r4) > + vcmpequb. v6,v0,v18 > + bne cr6,L(tail4) ...to here, to avoid needlessly penalizing the cases above? > + addi r4,r4,64 > + > + /* Prep for weird constant generation of reduction. */ > + li r0,0 Still need a better comment here. Consider: /* Load a dummy aligned address (0) so that 'lvsl' produces a shift vector of 0..15. */ And this "li" instruction can be moved WAY down... > + > + /* Skip the alignment if already 64B aligned. */ > + beq L(loop_64b) > + mtctr r5 > + > + /* Test 48B per iteration until 64B aligned. */ > + .p2align 5 > +L(loop): > + lxv v0+32,0(r4) > + vcmpequb. v6,v0,v18 > + bne cr6,L(tail1) > + > + lxv v0+32,16(r4) > + vcmpequb. v6,v0,v18 > + bne cr6,L(tail2) > + > + lxv v0+32,32(r4) > + vcmpequb. v6,v0,v18 > + bne cr6,L(tail3) > + > + addi r4,r4,48 > + bdnz L(loop) > + > + .p2align 5 > +L(loop_64b): > + lxv v1+32,0(r4) /* Load 4 quadwords. */ > + lxv v2+32,16(r4) > + lxv v3+32,32(r4) > + lxv v4+32,48(r4) > + vminub v5,v1,v2 /* Compare and merge into one VR for speed. */ > + vminub v6,v3,v4 > + vminub v7,v5,v6 > + vcmpequb. v7,v7,v18 /* Check for NULLs. */ > + addi r4,r4,64 /* Adjust address for the next iteration. */ > + bne cr6,L(vmx_zero) > + > + lxv v1+32,0(r4) /* Load 4 quadwords. */ > + lxv v2+32,16(r4) > + lxv v3+32,32(r4) > + lxv v4+32,48(r4) > + vminub v5,v1,v2 /* Compare and merge into one VR for speed. */ > + vminub v6,v3,v4 > + vminub v7,v5,v6 > + vcmpequb. v7,v7,v18 /* Check for NULLs. */ > + addi r4,r4,64 /* Adjust address for the next iteration. */ > + bne cr6,L(vmx_zero) > + > + lxv v1+32,0(r4) /* Load 4 quadwords. */ > + lxv v2+32,16(r4) > + lxv v3+32,32(r4) > + lxv v4+32,48(r4) > + vminub v5,v1,v2 /* Compare and merge into one VR for speed. */ > + vminub v6,v3,v4 > + vminub v7,v5,v6 > + vcmpequb. v7,v7,v18 /* Check for NULLs. */ > + addi r4,r4,64 /* Adjust address for the next iteration. */ > + beq cr6,L(loop_64b) > + > +L(vmx_zero): ...to here, perhaps, to avoid penalizing shorter strings. (And be closer to its use.) > + /* OK, we found a null byte. Let's look for it in the current 64-byte > + block and mark it in its corresponding VR. */ > + vcmpequb v1,v1,v18 > + vcmpequb v2,v2,v18 > + vcmpequb v3,v3,v18 > + vcmpequb v4,v4,v18 > + > + /* We will now 'compress' the result into a single doubleword, so it > + can be moved to a GPR for the final calculation. First, we > + generate an appropriate mask for vbpermq, so we can permute bits into > + the first halfword. */ > + vspltisb v10,3 > + lvsl v11,0,r0 > + vslb v10,v11,v10 > + > + /* Permute the first bit of each byte into bits 48-63. */ > + vbpermq v1,v1,v10 > + vbpermq v2,v2,v10 > + vbpermq v3,v3,v10 > + vbpermq v4,v4,v10 > + > + /* Shift each component into its correct position for merging. */ > + vsldoi v2,v2,v2,2 > + vsldoi v3,v3,v3,4 > + vsldoi v4,v4,v4,6 > + > + /* Merge the results and move to a GPR. */ > + vor v1,v2,v1 > + vor v2,v3,v4 > + vor v4,v1,v2 > + mfvrd r10,v4 > + > + /* Adjust address to the begninning of the current 64-byte block. */ > + addi r4,r4,-64 > + > + cnttzd r0,r10 /* Count trailing zeros before the match. */ > + subf r5,r3,r4 > + add r3,r5,r0 /* Compute final length. */ > + blr > + > +L(tail1): > + vctzlsbb r0,v6 > + add r4,r4,r0 > + subf r3,r3,r4 > + blr > + > +L(tail2): > + vctzlsbb r0,v6 > + add r4,r4,r0 > + addi r4,r4,16 > + subf r3,r3,r4 > + blr > + > +L(tail3): > + vctzlsbb r0,v6 > + add r4,r4,r0 > + addi r4,r4,32 > + subf r3,r3,r4 > + blr > + > +L(tail4): > + vctzlsbb r0,v6 > + add r4,r4,r0 > + addi r4,r4,48 > + subf r3,r3,r4 > + blr > + > +END (STRLEN) > + > +#ifdef DEFINE_STRLEN_HIDDEN_DEF > +weak_alias (__strlen, strlen) > +libc_hidden_builtin_def (strlen) > +#endif [snip] PC ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] powerpc64le: add optimized strlen for P9 2020-06-03 20:44 ` Paul A. Clarke @ 2020-06-04 13:55 ` Paul E Murphy 2020-06-05 20:39 ` Paul E Murphy 0 siblings, 1 reply; 7+ messages in thread From: Paul E Murphy @ 2020-06-04 13:55 UTC (permalink / raw) To: Paul A. Clarke; +Cc: Paul E. Murphy, libc-alpha, anton On 6/3/20 3:44 PM, Paul A. Clarke wrote: >> + vctzlsbb r3,v6 >> + blr >> + >> + /* Test 64B 16B at a time. The 64B vector loop is optimized for >> + longer strings. Likewise, we check a multiple of 64B to avoid >> + breaking the alignment calculation below. */ >> +L(aligned): >> + add r4,r3,r9 >> + rldicl. r5,r4,60,62 /* Determine the number of 48B loops needed for >> + alignment to 64B. And test for zero. */ > > Would it be bad to move the "rldicl." down... > >> + >> + lxv v0+32,0(r4) >> + vcmpequb. v6,v0,v18 >> + bne cr6,L(tail1) >> + >> + lxv v0+32,16(r4) >> + vcmpequb. v6,v0,v18 >> + bne cr6,L(tail2) >> + >> + lxv v0+32,32(r4) >> + vcmpequb. v6,v0,v18 >> + bne cr6,L(tail3) >> + >> + lxv v0+32,48(r4) >> + vcmpequb. v6,v0,v18 >> + bne cr6,L(tail4) > > ...to here, to avoid needlessly penalizing the cases above? My thinking is the integer op is basically free while we're busy executing vector ops (similarly for li). Likewise, this may put enough space between the conditional branch below to ease prediction. I don't claim to have run this against a cycle accurate simulator, so it may not really matter. Similarly, binutils 2.26 should support all P9 instructions here and in Anton's other patches. I did check before pushing the rawmemchr/stpcpy. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] powerpc64le: add optimized strlen for P9 2020-06-04 13:55 ` Paul E Murphy @ 2020-06-05 20:39 ` Paul E Murphy 0 siblings, 0 replies; 7+ messages in thread From: Paul E Murphy @ 2020-06-05 20:39 UTC (permalink / raw) To: Paul A. Clarke; +Cc: Paul E. Murphy, libc-alpha, anton On 6/4/20 8:55 AM, Paul E Murphy wrote: > > > On 6/3/20 3:44 PM, Paul A. Clarke wrote: >>> + vctzlsbb r3,v6 >>> + blr >>> + >>> + /* Test 64B 16B at a time. The 64B vector loop is optimized for >>> + longer strings. Likewise, we check a multiple of 64B to avoid >>> + breaking the alignment calculation below. */ >>> +L(aligned): >>> + add r4,r3,r9 >>> + rldicl. r5,r4,60,62 /* Determine the number of 48B loops >>> needed for >>> + alignment to 64B. And test for >>> zero. */ >> >> Would it be bad to move the "rldicl." down... >> >>> + >>> + lxv v0+32,0(r4) >>> + vcmpequb. v6,v0,v18 >>> + bne cr6,L(tail1) >>> + >>> + lxv v0+32,16(r4) >>> + vcmpequb. v6,v0,v18 >>> + bne cr6,L(tail2) >>> + >>> + lxv v0+32,32(r4) >>> + vcmpequb. v6,v0,v18 >>> + bne cr6,L(tail3) >>> + >>> + lxv v0+32,48(r4) >>> + vcmpequb. v6,v0,v18 >>> + bne cr6,L(tail4) >> >> ...to here, to avoid needlessly penalizing the cases above? > > My thinking is the integer op is basically free while we're busy > executing vector ops (similarly for li). Likewise, this may put enough > space between the conditional branch below to ease prediction. I don't > claim to have run this against a cycle accurate simulator, so it may not > really matter. > > Similarly, binutils 2.26 should support all P9 instructions here and in > Anton's other patches. I did check before pushing the rawmemchr/stpcpy. And pushed with extra space removed and rewritten comment for li r0,0. Thank you for the feedback Paul. ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2020-06-05 20:39 UTC | newest] Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-05-21 19:10 [PATCH] powerpc64le: add optimized strlen for P9 Paul E. Murphy 2020-05-21 20:41 ` Lucas A. M. Magalhaes 2020-05-27 16:45 ` Paul A. Clarke 2020-05-29 16:26 ` Paul E Murphy 2020-06-03 20:44 ` Paul A. Clarke 2020-06-04 13:55 ` Paul E Murphy 2020-06-05 20:39 ` Paul E Murphy
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).