* Re: [PATCH v1 03/23] x86: Code cleanup in strchr-avx2 and comment justifying branch
[not found] ` <CAMe9rOqRGcLn3tvQSANaSydOM8RRQ2cY0PxBOHDu=iK88j=XUg@mail.gmail.com>
@ 2022-05-12 19:31 ` Sunil Pandey
0 siblings, 0 replies; 10+ messages in thread
From: Sunil Pandey @ 2022-05-12 19:31 UTC (permalink / raw)
To: H.J. Lu, Libc-stable Mailing List; +Cc: Noah Goldstein, GNU C Library
On Thu, Mar 24, 2022 at 12:37 PM H.J. Lu via Libc-alpha
<libc-alpha@sourceware.org> wrote:
>
> On Thu, Mar 24, 2022 at 12:20 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> >
> > On Thu, Mar 24, 2022 at 1:53 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> > >
> > > On Wed, Mar 23, 2022 at 2:58 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> > > >
> > > > Small code cleanup for size: -53 bytes.
> > > >
> > > > Add comment justifying using a branch to do NULL/non-null return.
> > >
> > >
> > > Do you have followup patches to improve its performance? We are
> > > backporting all x86-64 improvements to Intel release branches:
> > >
> > > https://gitlab.com/x86-glibc/glibc/-/wikis/home
> > >
> > > Patches without performance improvements are undesirable.
> >
> > No further changes planned at the moment, code size saves
> > seem worth it for master though. Also in favor of adding the comment
> > as I think its non-intuitive.
> >
>
> LGTM.
>
> Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
>
> Thanks.
>
> --
> H.J.
I would like to backport this patch to release branches.
Any comments or objections?
--Sunil
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v1 04/23] x86: Code cleanup in strchr-evex and comment justifying branch
[not found] ` <CAMe9rOraZjeAXy8GgdNqUb94y+0TUwbjWKJU7RixESgRYw1o7A@mail.gmail.com>
@ 2022-05-12 19:32 ` Sunil Pandey
0 siblings, 0 replies; 10+ messages in thread
From: Sunil Pandey @ 2022-05-12 19:32 UTC (permalink / raw)
To: H.J. Lu, Libc-stable Mailing List; +Cc: Noah Goldstein, GNU C Library
On Thu, Mar 24, 2022 at 11:55 AM H.J. Lu via Libc-alpha
<libc-alpha@sourceware.org> wrote:
>
> On Wed, Mar 23, 2022 at 2:58 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> >
> > Small code cleanup for size: -81 bytes.
> >
> > Add comment justifying using a branch to do NULL/non-null return.
> >
> > All string/memory tests pass and no regressions in benchtests.
> >
> > geometric_mean(N=20) of all benchmarks New / Original: .985
> > ---
> > Geomtric Mean N=20 runs; All functions page aligned
> > length, alignment, pos, rand, seek_char/branch, max_char/perc-zero, New Time / Old Time
> > 2048, 0, 32, 0, 23, 127, 0.878
> > 2048, 1, 32, 0, 23, 127, 0.88
> > 2048, 0, 64, 0, 23, 127, 0.997
> > 2048, 2, 64, 0, 23, 127, 1.001
> > 2048, 0, 128, 0, 23, 127, 0.973
> > 2048, 3, 128, 0, 23, 127, 0.971
> > 2048, 0, 256, 0, 23, 127, 0.976
> > 2048, 4, 256, 0, 23, 127, 0.973
> > 2048, 0, 512, 0, 23, 127, 1.001
> > 2048, 5, 512, 0, 23, 127, 1.004
> > 2048, 0, 1024, 0, 23, 127, 1.005
> > 2048, 6, 1024, 0, 23, 127, 1.007
> > 2048, 0, 2048, 0, 23, 127, 1.035
> > 2048, 7, 2048, 0, 23, 127, 1.03
> > 4096, 0, 32, 0, 23, 127, 0.889
> > 4096, 1, 32, 0, 23, 127, 0.891
> > 4096, 0, 64, 0, 23, 127, 1.012
> > 4096, 2, 64, 0, 23, 127, 1.017
> > 4096, 0, 128, 0, 23, 127, 0.975
> > 4096, 3, 128, 0, 23, 127, 0.974
> > 4096, 0, 256, 0, 23, 127, 0.974
> > 4096, 4, 256, 0, 23, 127, 0.972
> > 4096, 0, 512, 0, 23, 127, 1.002
> > 4096, 5, 512, 0, 23, 127, 1.016
> > 4096, 0, 1024, 0, 23, 127, 1.009
> > 4096, 6, 1024, 0, 23, 127, 1.008
> > 4096, 0, 2048, 0, 23, 127, 1.003
> > 4096, 7, 2048, 0, 23, 127, 1.004
> > 256, 1, 64, 0, 23, 127, 0.993
> > 256, 2, 64, 0, 23, 127, 0.999
> > 256, 3, 64, 0, 23, 127, 0.992
> > 256, 4, 64, 0, 23, 127, 0.99
> > 256, 5, 64, 0, 23, 127, 0.99
> > 256, 6, 64, 0, 23, 127, 0.994
> > 256, 7, 64, 0, 23, 127, 0.991
> > 512, 0, 256, 0, 23, 127, 0.971
> > 512, 16, 256, 0, 23, 127, 0.971
> > 512, 32, 256, 0, 23, 127, 1.005
> > 512, 48, 256, 0, 23, 127, 0.998
> > 512, 64, 256, 0, 23, 127, 1.001
> > 512, 80, 256, 0, 23, 127, 1.002
> > 512, 96, 256, 0, 23, 127, 1.005
> > 512, 112, 256, 0, 23, 127, 1.012
> > 1, 0, 0, 0, 23, 127, 1.024
> > 2, 0, 1, 0, 23, 127, 0.991
> > 3, 0, 2, 0, 23, 127, 0.997
> > 4, 0, 3, 0, 23, 127, 0.984
> > 5, 0, 4, 0, 23, 127, 0.993
> > 6, 0, 5, 0, 23, 127, 0.985
> > 7, 0, 6, 0, 23, 127, 0.979
> > 8, 0, 7, 0, 23, 127, 0.975
> > 9, 0, 8, 0, 23, 127, 0.965
> > 10, 0, 9, 0, 23, 127, 0.957
> > 11, 0, 10, 0, 23, 127, 0.979
> > 12, 0, 11, 0, 23, 127, 0.987
> > 13, 0, 12, 0, 23, 127, 1.023
> > 14, 0, 13, 0, 23, 127, 0.997
> > 15, 0, 14, 0, 23, 127, 0.983
> > 16, 0, 15, 0, 23, 127, 0.987
> > 17, 0, 16, 0, 23, 127, 0.993
> > 18, 0, 17, 0, 23, 127, 0.985
> > 19, 0, 18, 0, 23, 127, 0.999
> > 20, 0, 19, 0, 23, 127, 0.998
> > 21, 0, 20, 0, 23, 127, 0.983
> > 22, 0, 21, 0, 23, 127, 0.983
> > 23, 0, 22, 0, 23, 127, 1.002
> > 24, 0, 23, 0, 23, 127, 1.0
> > 25, 0, 24, 0, 23, 127, 1.002
> > 26, 0, 25, 0, 23, 127, 0.984
> > 27, 0, 26, 0, 23, 127, 0.994
> > 28, 0, 27, 0, 23, 127, 0.995
> > 29, 0, 28, 0, 23, 127, 1.017
> > 30, 0, 29, 0, 23, 127, 1.009
> > 31, 0, 30, 0, 23, 127, 1.001
> > 32, 0, 31, 0, 23, 127, 1.021
> > 2048, 0, 32, 0, 0, 127, 0.899
> > 2048, 1, 32, 0, 0, 127, 0.93
> > 2048, 0, 64, 0, 0, 127, 1.009
> > 2048, 2, 64, 0, 0, 127, 1.023
> > 2048, 0, 128, 0, 0, 127, 0.973
> > 2048, 3, 128, 0, 0, 127, 0.975
> > 2048, 0, 256, 0, 0, 127, 0.974
> > 2048, 4, 256, 0, 0, 127, 0.97
> > 2048, 0, 512, 0, 0, 127, 0.999
> > 2048, 5, 512, 0, 0, 127, 1.004
> > 2048, 0, 1024, 0, 0, 127, 1.008
> > 2048, 6, 1024, 0, 0, 127, 1.008
> > 2048, 0, 2048, 0, 0, 127, 0.996
> > 2048, 7, 2048, 0, 0, 127, 1.002
> > 4096, 0, 32, 0, 0, 127, 0.872
> > 4096, 1, 32, 0, 0, 127, 0.881
> > 4096, 0, 64, 0, 0, 127, 1.006
> > 4096, 2, 64, 0, 0, 127, 1.005
> > 4096, 0, 128, 0, 0, 127, 0.973
> > 4096, 3, 128, 0, 0, 127, 0.974
> > 4096, 0, 256, 0, 0, 127, 0.969
> > 4096, 4, 256, 0, 0, 127, 0.971
> > 4096, 0, 512, 0, 0, 127, 1.0
> > 4096, 5, 512, 0, 0, 127, 1.005
> > 4096, 0, 1024, 0, 0, 127, 1.007
> > 4096, 6, 1024, 0, 0, 127, 1.009
> > 4096, 0, 2048, 0, 0, 127, 1.005
> > 4096, 7, 2048, 0, 0, 127, 1.007
> > 256, 1, 64, 0, 0, 127, 0.994
> > 256, 2, 64, 0, 0, 127, 1.008
> > 256, 3, 64, 0, 0, 127, 1.019
> > 256, 4, 64, 0, 0, 127, 0.991
> > 256, 5, 64, 0, 0, 127, 0.992
> > 256, 6, 64, 0, 0, 127, 0.991
> > 256, 7, 64, 0, 0, 127, 0.988
> > 512, 0, 256, 0, 0, 127, 0.971
> > 512, 16, 256, 0, 0, 127, 0.967
> > 512, 32, 256, 0, 0, 127, 1.005
> > 512, 48, 256, 0, 0, 127, 1.001
> > 512, 64, 256, 0, 0, 127, 1.009
> > 512, 80, 256, 0, 0, 127, 1.008
> > 512, 96, 256, 0, 0, 127, 1.009
> > 512, 112, 256, 0, 0, 127, 1.016
> > 1, 0, 0, 0, 0, 127, 1.038
> > 2, 0, 1, 0, 0, 127, 1.009
> > 3, 0, 2, 0, 0, 127, 0.992
> > 4, 0, 3, 0, 0, 127, 1.004
> > 5, 0, 4, 0, 0, 127, 0.966
> > 6, 0, 5, 0, 0, 127, 0.968
> > 7, 0, 6, 0, 0, 127, 1.004
> > 8, 0, 7, 0, 0, 127, 0.99
> > 9, 0, 8, 0, 0, 127, 0.958
> > 10, 0, 9, 0, 0, 127, 0.96
> > 11, 0, 10, 0, 0, 127, 0.948
> > 12, 0, 11, 0, 0, 127, 0.984
> > 13, 0, 12, 0, 0, 127, 0.967
> > 14, 0, 13, 0, 0, 127, 0.993
> > 15, 0, 14, 0, 0, 127, 0.991
> > 16, 0, 15, 0, 0, 127, 1.0
> > 17, 0, 16, 0, 0, 127, 0.982
> > 18, 0, 17, 0, 0, 127, 0.977
> > 19, 0, 18, 0, 0, 127, 0.987
> > 20, 0, 19, 0, 0, 127, 0.978
> > 21, 0, 20, 0, 0, 127, 1.0
> > 22, 0, 21, 0, 0, 127, 0.99
> > 23, 0, 22, 0, 0, 127, 0.988
> > 24, 0, 23, 0, 0, 127, 0.997
> > 25, 0, 24, 0, 0, 127, 1.003
> > 26, 0, 25, 0, 0, 127, 1.004
> > 27, 0, 26, 0, 0, 127, 0.982
> > 28, 0, 27, 0, 0, 127, 0.972
> > 29, 0, 28, 0, 0, 127, 0.978
> > 30, 0, 29, 0, 0, 127, 0.992
> > 31, 0, 30, 0, 0, 127, 0.986
> > 32, 0, 31, 0, 0, 127, 1.0
> >
> > 16, 0, 15, 1, 1, 0, 0.997
> > 16, 0, 15, 1, 0, 0, 1.001
> > 16, 0, 15, 1, 1, 0.1, 0.984
> > 16, 0, 15, 1, 0, 0.1, 0.999
> > 16, 0, 15, 1, 1, 0.25, 0.929
> > 16, 0, 15, 1, 0, 0.25, 1.001
> > 16, 0, 15, 1, 1, 0.33, 0.892
> > 16, 0, 15, 1, 0, 0.33, 0.996
> > 16, 0, 15, 1, 1, 0.5, 0.897
> > 16, 0, 15, 1, 0, 0.5, 1.009
> > 16, 0, 15, 1, 1, 0.66, 0.882
> > 16, 0, 15, 1, 0, 0.66, 0.967
> > 16, 0, 15, 1, 1, 0.75, 0.919
> > 16, 0, 15, 1, 0, 0.75, 1.027
> > 16, 0, 15, 1, 1, 0.9, 0.949
> > 16, 0, 15, 1, 0, 0.9, 1.021
> > 16, 0, 15, 1, 1, 1, 0.998
> > 16, 0, 15, 1, 0, 1, 0.999
> >
> > sysdeps/x86_64/multiarch/strchr-evex.S | 146 ++++++++++++++-----------
> > 1 file changed, 80 insertions(+), 66 deletions(-)
> >
> > diff --git a/sysdeps/x86_64/multiarch/strchr-evex.S b/sysdeps/x86_64/multiarch/strchr-evex.S
> > index f62cd9d144..ec739fb8f9 100644
> > --- a/sysdeps/x86_64/multiarch/strchr-evex.S
> > +++ b/sysdeps/x86_64/multiarch/strchr-evex.S
> > @@ -30,6 +30,7 @@
> > # ifdef USE_AS_WCSCHR
> > # define VPBROADCAST vpbroadcastd
> > # define VPCMP vpcmpd
> > +# define VPTESTN vptestnmd
> > # define VPMINU vpminud
> > # define CHAR_REG esi
> > # define SHIFT_REG ecx
> > @@ -37,6 +38,7 @@
> > # else
> > # define VPBROADCAST vpbroadcastb
> > # define VPCMP vpcmpb
> > +# define VPTESTN vptestnmb
> > # define VPMINU vpminub
> > # define CHAR_REG sil
> > # define SHIFT_REG edx
> > @@ -61,13 +63,11 @@
> > # define CHAR_PER_VEC (VEC_SIZE / CHAR_SIZE)
> >
> > .section .text.evex,"ax",@progbits
> > -ENTRY (STRCHR)
> > +ENTRY_P2ALIGN (STRCHR, 5)
> > /* Broadcast CHAR to YMM0. */
> > VPBROADCAST %esi, %YMM0
> > movl %edi, %eax
> > andl $(PAGE_SIZE - 1), %eax
> > - vpxorq %XMMZERO, %XMMZERO, %XMMZERO
> > -
> > /* Check if we cross page boundary with one vector load.
> > Otherwise it is safe to use an unaligned load. */
> > cmpl $(PAGE_SIZE - VEC_SIZE), %eax
> > @@ -81,49 +81,35 @@ ENTRY (STRCHR)
> > vpxorq %YMM1, %YMM0, %YMM2
> > VPMINU %YMM2, %YMM1, %YMM2
> > /* Each bit in K0 represents a CHAR or a null byte in YMM1. */
> > - VPCMP $0, %YMMZERO, %YMM2, %k0
> > + VPTESTN %YMM2, %YMM2, %k0
> > kmovd %k0, %eax
> > testl %eax, %eax
> > jz L(aligned_more)
> > tzcntl %eax, %eax
> > +# ifndef USE_AS_STRCHRNUL
> > + /* Found CHAR or the null byte. */
> > + cmp (%rdi, %rax, CHAR_SIZE), %CHAR_REG
> > + /* NB: Use a branch instead of cmovcc here. The expectation is
> > + that with strchr the user will branch based on input being
> > + null. Since this branch will be 100% predictive of the user
> > + branch a branch miss here should save what otherwise would
> > + be branch miss in the user code. Otherwise using a branch 1)
> > + saves code size and 2) is faster in highly predictable
> > + environments. */
> > + jne L(zero)
> > +# endif
> > # ifdef USE_AS_WCSCHR
> > /* NB: Multiply wchar_t count by 4 to get the number of bytes.
> > */
> > leaq (%rdi, %rax, CHAR_SIZE), %rax
> > # else
> > addq %rdi, %rax
> > -# endif
> > -# ifndef USE_AS_STRCHRNUL
> > - /* Found CHAR or the null byte. */
> > - cmp (%rax), %CHAR_REG
> > - jne L(zero)
> > # endif
> > ret
> >
> > - /* .p2align 5 helps keep performance more consistent if ENTRY()
> > - alignment % 32 was either 16 or 0. As well this makes the
> > - alignment % 32 of the loop_4x_vec fixed which makes tuning it
> > - easier. */
> > - .p2align 5
> > -L(first_vec_x3):
> > - tzcntl %eax, %eax
> > -# ifndef USE_AS_STRCHRNUL
> > - /* Found CHAR or the null byte. */
> > - cmp (VEC_SIZE * 3)(%rdi, %rax, CHAR_SIZE), %CHAR_REG
> > - jne L(zero)
> > -# endif
> > - /* NB: Multiply sizeof char type (1 or 4) to get the number of
> > - bytes. */
> > - leaq (VEC_SIZE * 3)(%rdi, %rax, CHAR_SIZE), %rax
> > - ret
> >
> > -# ifndef USE_AS_STRCHRNUL
> > -L(zero):
> > - xorl %eax, %eax
> > - ret
> > -# endif
> >
> > - .p2align 4
> > + .p2align 4,, 10
> > L(first_vec_x4):
> > # ifndef USE_AS_STRCHRNUL
> > /* Check to see if first match was CHAR (k0) or null (k1). */
> > @@ -144,9 +130,18 @@ L(first_vec_x4):
> > leaq (VEC_SIZE * 4)(%rdi, %rax, CHAR_SIZE), %rax
> > ret
> >
> > +# ifndef USE_AS_STRCHRNUL
> > +L(zero):
> > + xorl %eax, %eax
> > + ret
> > +# endif
> > +
> > +
> > .p2align 4
> > L(first_vec_x1):
> > - tzcntl %eax, %eax
> > + /* Use bsf here to save 1-byte keeping keeping the block in 1x
> > + fetch block. eax guranteed non-zero. */
> > + bsfl %eax, %eax
> > # ifndef USE_AS_STRCHRNUL
> > /* Found CHAR or the null byte. */
> > cmp (VEC_SIZE)(%rdi, %rax, CHAR_SIZE), %CHAR_REG
> > @@ -158,7 +153,7 @@ L(first_vec_x1):
> > leaq (VEC_SIZE)(%rdi, %rax, CHAR_SIZE), %rax
> > ret
> >
> > - .p2align 4
> > + .p2align 4,, 10
> > L(first_vec_x2):
> > # ifndef USE_AS_STRCHRNUL
> > /* Check to see if first match was CHAR (k0) or null (k1). */
> > @@ -179,6 +174,21 @@ L(first_vec_x2):
> > leaq (VEC_SIZE * 2)(%rdi, %rax, CHAR_SIZE), %rax
> > ret
> >
> > + .p2align 4,, 10
> > +L(first_vec_x3):
> > + /* Use bsf here to save 1-byte keeping keeping the block in 1x
> > + fetch block. eax guranteed non-zero. */
> > + bsfl %eax, %eax
> > +# ifndef USE_AS_STRCHRNUL
> > + /* Found CHAR or the null byte. */
> > + cmp (VEC_SIZE * 3)(%rdi, %rax, CHAR_SIZE), %CHAR_REG
> > + jne L(zero)
> > +# endif
> > + /* NB: Multiply sizeof char type (1 or 4) to get the number of
> > + bytes. */
> > + leaq (VEC_SIZE * 3)(%rdi, %rax, CHAR_SIZE), %rax
> > + ret
> > +
> > .p2align 4
> > L(aligned_more):
> > /* Align data to VEC_SIZE. */
> > @@ -195,7 +205,7 @@ L(cross_page_continue):
> > vpxorq %YMM1, %YMM0, %YMM2
> > VPMINU %YMM2, %YMM1, %YMM2
> > /* Each bit in K0 represents a CHAR or a null byte in YMM1. */
> > - VPCMP $0, %YMMZERO, %YMM2, %k0
> > + VPTESTN %YMM2, %YMM2, %k0
> > kmovd %k0, %eax
> > testl %eax, %eax
> > jnz L(first_vec_x1)
> > @@ -206,7 +216,7 @@ L(cross_page_continue):
> > /* Each bit in K0 represents a CHAR in YMM1. */
> > VPCMP $0, %YMM1, %YMM0, %k0
> > /* Each bit in K1 represents a CHAR in YMM1. */
> > - VPCMP $0, %YMM1, %YMMZERO, %k1
> > + VPTESTN %YMM1, %YMM1, %k1
> > kortestd %k0, %k1
> > jnz L(first_vec_x2)
> >
> > @@ -215,7 +225,7 @@ L(cross_page_continue):
> > vpxorq %YMM1, %YMM0, %YMM2
> > VPMINU %YMM2, %YMM1, %YMM2
> > /* Each bit in K0 represents a CHAR or a null byte in YMM1. */
> > - VPCMP $0, %YMMZERO, %YMM2, %k0
> > + VPTESTN %YMM2, %YMM2, %k0
> > kmovd %k0, %eax
> > testl %eax, %eax
> > jnz L(first_vec_x3)
> > @@ -224,7 +234,7 @@ L(cross_page_continue):
> > /* Each bit in K0 represents a CHAR in YMM1. */
> > VPCMP $0, %YMM1, %YMM0, %k0
> > /* Each bit in K1 represents a CHAR in YMM1. */
> > - VPCMP $0, %YMM1, %YMMZERO, %k1
> > + VPTESTN %YMM1, %YMM1, %k1
> > kortestd %k0, %k1
> > jnz L(first_vec_x4)
> >
> > @@ -265,33 +275,33 @@ L(loop_4x_vec):
> > VPMINU %YMM3, %YMM4, %YMM4
> > VPMINU %YMM2, %YMM4, %YMM4{%k4}{z}
> >
> > - VPCMP $0, %YMMZERO, %YMM4, %k1
> > + VPTESTN %YMM4, %YMM4, %k1
> > kmovd %k1, %ecx
> > subq $-(VEC_SIZE * 4), %rdi
> > testl %ecx, %ecx
> > jz L(loop_4x_vec)
> >
> > - VPCMP $0, %YMMZERO, %YMM1, %k0
> > + VPTESTN %YMM1, %YMM1, %k0
> > kmovd %k0, %eax
> > testl %eax, %eax
> > jnz L(last_vec_x1)
> >
> > - VPCMP $0, %YMMZERO, %YMM2, %k0
> > + VPTESTN %YMM2, %YMM2, %k0
> > kmovd %k0, %eax
> > testl %eax, %eax
> > jnz L(last_vec_x2)
> >
> > - VPCMP $0, %YMMZERO, %YMM3, %k0
> > + VPTESTN %YMM3, %YMM3, %k0
> > kmovd %k0, %eax
> > /* Combine YMM3 matches (eax) with YMM4 matches (ecx). */
> > # ifdef USE_AS_WCSCHR
> > sall $8, %ecx
> > orl %ecx, %eax
> > - tzcntl %eax, %eax
> > + bsfl %eax, %eax
> > # else
> > salq $32, %rcx
> > orq %rcx, %rax
> > - tzcntq %rax, %rax
> > + bsfq %rax, %rax
> > # endif
> > # ifndef USE_AS_STRCHRNUL
> > /* Check if match was CHAR or null. */
> > @@ -303,28 +313,28 @@ L(loop_4x_vec):
> > leaq (VEC_SIZE * 2)(%rdi, %rax, CHAR_SIZE), %rax
> > ret
> >
> > -# ifndef USE_AS_STRCHRNUL
> > -L(zero_end):
> > - xorl %eax, %eax
> > - ret
> > + .p2align 4,, 8
> > +L(last_vec_x1):
> > + bsfl %eax, %eax
> > +# ifdef USE_AS_WCSCHR
> > + /* NB: Multiply wchar_t count by 4 to get the number of bytes.
> > + */
> > + leaq (%rdi, %rax, CHAR_SIZE), %rax
> > +# else
> > + addq %rdi, %rax
> > # endif
> >
> > - .p2align 4
> > -L(last_vec_x1):
> > - tzcntl %eax, %eax
> > # ifndef USE_AS_STRCHRNUL
> > /* Check if match was null. */
> > - cmp (%rdi, %rax, CHAR_SIZE), %CHAR_REG
> > + cmp (%rax), %CHAR_REG
> > jne L(zero_end)
> > # endif
> > - /* NB: Multiply sizeof char type (1 or 4) to get the number of
> > - bytes. */
> > - leaq (%rdi, %rax, CHAR_SIZE), %rax
> > +
> > ret
> >
> > - .p2align 4
> > + .p2align 4,, 8
> > L(last_vec_x2):
> > - tzcntl %eax, %eax
> > + bsfl %eax, %eax
> > # ifndef USE_AS_STRCHRNUL
> > /* Check if match was null. */
> > cmp (VEC_SIZE)(%rdi, %rax, CHAR_SIZE), %CHAR_REG
> > @@ -336,7 +346,7 @@ L(last_vec_x2):
> > ret
> >
> > /* Cold case for crossing page with first load. */
> > - .p2align 4
> > + .p2align 4,, 8
> > L(cross_page_boundary):
> > movq %rdi, %rdx
> > /* Align rdi. */
> > @@ -346,9 +356,9 @@ L(cross_page_boundary):
> > vpxorq %YMM1, %YMM0, %YMM2
> > VPMINU %YMM2, %YMM1, %YMM2
> > /* Each bit in K0 represents a CHAR or a null byte in YMM1. */
> > - VPCMP $0, %YMMZERO, %YMM2, %k0
> > + VPTESTN %YMM2, %YMM2, %k0
> > kmovd %k0, %eax
> > - /* Remove the leading bits. */
> > + /* Remove the leading bits. */
> > # ifdef USE_AS_WCSCHR
> > movl %edx, %SHIFT_REG
> > /* NB: Divide shift count by 4 since each bit in K1 represent 4
> > @@ -360,20 +370,24 @@ L(cross_page_boundary):
> > /* If eax is zero continue. */
> > testl %eax, %eax
> > jz L(cross_page_continue)
> > - tzcntl %eax, %eax
> > -# ifndef USE_AS_STRCHRNUL
> > - /* Check to see if match was CHAR or null. */
> > - cmp (%rdx, %rax, CHAR_SIZE), %CHAR_REG
> > - jne L(zero_end)
> > -# endif
> > + bsfl %eax, %eax
> > +
> > # ifdef USE_AS_WCSCHR
> > /* NB: Multiply wchar_t count by 4 to get the number of
> > bytes. */
> > leaq (%rdx, %rax, CHAR_SIZE), %rax
> > # else
> > addq %rdx, %rax
> > +# endif
> > +# ifndef USE_AS_STRCHRNUL
> > + /* Check to see if match was CHAR or null. */
> > + cmp (%rax), %CHAR_REG
> > + je L(cross_page_ret)
> > +L(zero_end):
> > + xorl %eax, %eax
> > +L(cross_page_ret):
> > # endif
> > ret
> >
> > END (STRCHR)
> > -# endif
> > +#endif
> > --
> > 2.25.1
> >
>
> LGTM.
>
> Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
>
> Thanks.
>
> --
> H.J.
I would like to backport this patch to release branches.
Any comments or objections?
--Sunil
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v1 07/23] x86: Optimize strcspn and strpbrk in strcspn-c.c
[not found] ` <CAMe9rOpSiZaO+mkq8OqwTHS__JgUD4LQQShMpjrgyGdZSPwUsA@mail.gmail.com>
@ 2022-05-12 19:34 ` Sunil Pandey
0 siblings, 0 replies; 10+ messages in thread
From: Sunil Pandey @ 2022-05-12 19:34 UTC (permalink / raw)
To: H.J. Lu, Libc-stable Mailing List; +Cc: Noah Goldstein, GNU C Library
On Thu, Mar 24, 2022 at 11:57 AM H.J. Lu via Libc-alpha
<libc-alpha@sourceware.org> wrote:
>
> On Wed, Mar 23, 2022 at 2:59 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> >
> > Use _mm_cmpeq_epi8 and _mm_movemask_epi8 to get strlen instead of
> > _mm_cmpistri. Also change offset to unsigned to avoid unnecessary
> > sign extensions.
> >
> > geometric_mean(N=20) of all benchmarks that dont fallback on
> > sse2/strlen; New / Original: .928
> >
> > All string/memory tests pass.
> > ---
> > Geomtric Mean N=20 runs; All functions page aligned
> > len, align1, align2, pos, New Time / Old Time
> > 0, 0, 0, 512, 1.207
> > 1, 0, 0, 512, 1.039
> > 1, 1, 0, 512, 0.997
> > 1, 0, 1, 512, 0.981
> > 1, 1, 1, 512, 0.977
> > 2, 0, 0, 512, 1.02
> > 2, 2, 0, 512, 0.979
> > 2, 0, 2, 512, 0.902
> > 2, 2, 2, 512, 0.958
> > 3, 0, 0, 512, 0.978
> > 3, 3, 0, 512, 0.988
> > 3, 0, 3, 512, 0.979
> > 3, 3, 3, 512, 0.955
> > 4, 0, 0, 512, 0.969
> > 4, 4, 0, 512, 0.991
> > 4, 0, 4, 512, 0.94
> > 4, 4, 4, 512, 0.958
> > 5, 0, 0, 512, 0.963
> > 5, 5, 0, 512, 1.004
> > 5, 0, 5, 512, 0.948
> > 5, 5, 5, 512, 0.971
> > 6, 0, 0, 512, 0.933
> > 6, 6, 0, 512, 1.007
> > 6, 0, 6, 512, 0.921
> > 6, 6, 6, 512, 0.969
> > 7, 0, 0, 512, 0.928
> > 7, 7, 0, 512, 0.976
> > 7, 0, 7, 512, 0.932
> > 7, 7, 7, 512, 0.995
> > 8, 0, 0, 512, 0.931
> > 8, 0, 8, 512, 0.766
> > 9, 0, 0, 512, 0.965
> > 9, 1, 0, 512, 0.999
> > 9, 0, 9, 512, 0.765
> > 9, 1, 9, 512, 0.97
> > 10, 0, 0, 512, 0.976
> > 10, 2, 0, 512, 0.991
> > 10, 0, 10, 512, 0.768
> > 10, 2, 10, 512, 0.926
> > 11, 0, 0, 512, 0.958
> > 11, 3, 0, 512, 1.006
> > 11, 0, 11, 512, 0.768
> > 11, 3, 11, 512, 0.908
> > 12, 0, 0, 512, 0.945
> > 12, 4, 0, 512, 0.896
> > 12, 0, 12, 512, 0.764
> > 12, 4, 12, 512, 0.785
> > 13, 0, 0, 512, 0.957
> > 13, 5, 0, 512, 1.019
> > 13, 0, 13, 512, 0.76
> > 13, 5, 13, 512, 0.785
> > 14, 0, 0, 512, 0.918
> > 14, 6, 0, 512, 1.004
> > 14, 0, 14, 512, 0.78
> > 14, 6, 14, 512, 0.711
> > 15, 0, 0, 512, 0.855
> > 15, 7, 0, 512, 0.985
> > 15, 0, 15, 512, 0.779
> > 15, 7, 15, 512, 0.772
> > 16, 0, 0, 512, 0.987
> > 16, 0, 16, 512, 0.99
> > 17, 0, 0, 512, 0.996
> > 17, 1, 0, 512, 0.979
> > 17, 0, 17, 512, 1.001
> > 17, 1, 17, 512, 1.03
> > 18, 0, 0, 512, 0.976
> > 18, 2, 0, 512, 0.989
> > 18, 0, 18, 512, 0.976
> > 18, 2, 18, 512, 0.992
> > 19, 0, 0, 512, 0.991
> > 19, 3, 0, 512, 0.988
> > 19, 0, 19, 512, 1.009
> > 19, 3, 19, 512, 1.018
> > 20, 0, 0, 512, 0.999
> > 20, 4, 0, 512, 1.005
> > 20, 0, 20, 512, 0.993
> > 20, 4, 20, 512, 0.983
> > 21, 0, 0, 512, 0.982
> > 21, 5, 0, 512, 0.988
> > 21, 0, 21, 512, 0.978
> > 21, 5, 21, 512, 0.984
> > 22, 0, 0, 512, 0.988
> > 22, 6, 0, 512, 0.979
> > 22, 0, 22, 512, 0.984
> > 22, 6, 22, 512, 0.983
> > 23, 0, 0, 512, 0.996
> > 23, 7, 0, 512, 0.998
> > 23, 0, 23, 512, 0.979
> > 23, 7, 23, 512, 0.987
> > 24, 0, 0, 512, 0.99
> > 24, 0, 24, 512, 0.979
> > 25, 0, 0, 512, 0.985
> > 25, 1, 0, 512, 0.988
> > 25, 0, 25, 512, 0.99
> > 25, 1, 25, 512, 0.986
> > 26, 0, 0, 512, 1.005
> > 26, 2, 0, 512, 0.995
> > 26, 0, 26, 512, 0.992
> > 26, 2, 26, 512, 0.983
> > 27, 0, 0, 512, 0.986
> > 27, 3, 0, 512, 0.978
> > 27, 0, 27, 512, 0.986
> > 27, 3, 27, 512, 0.973
> > 28, 0, 0, 512, 0.995
> > 28, 4, 0, 512, 0.993
> > 28, 0, 28, 512, 0.983
> > 28, 4, 28, 512, 1.005
> > 29, 0, 0, 512, 0.983
> > 29, 5, 0, 512, 0.982
> > 29, 0, 29, 512, 0.984
> > 29, 5, 29, 512, 1.005
> > 30, 0, 0, 512, 0.978
> > 30, 6, 0, 512, 0.985
> > 30, 0, 30, 512, 0.994
> > 30, 6, 30, 512, 0.993
> > 31, 0, 0, 512, 0.984
> > 31, 7, 0, 512, 0.983
> > 31, 0, 31, 512, 1.0
> > 31, 7, 31, 512, 1.031
> > 4, 0, 0, 32, 0.916
> > 4, 1, 0, 32, 0.952
> > 4, 0, 1, 32, 0.927
> > 4, 1, 1, 32, 0.969
> > 4, 0, 0, 64, 0.961
> > 4, 2, 0, 64, 0.955
> > 4, 0, 2, 64, 0.975
> > 4, 2, 2, 64, 0.972
> > 4, 0, 0, 128, 0.971
> > 4, 3, 0, 128, 0.982
> > 4, 0, 3, 128, 0.945
> > 4, 3, 3, 128, 0.971
> > 4, 0, 0, 256, 1.004
> > 4, 4, 0, 256, 0.966
> > 4, 0, 4, 256, 0.961
> > 4, 4, 4, 256, 0.971
> > 4, 5, 0, 512, 0.929
> > 4, 0, 5, 512, 0.969
> > 4, 5, 5, 512, 0.985
> > 4, 0, 0, 1024, 1.003
> > 4, 6, 0, 1024, 1.009
> > 4, 0, 6, 1024, 1.005
> > 4, 6, 6, 1024, 0.999
> > 4, 0, 0, 2048, 0.917
> > 4, 7, 0, 2048, 1.015
> > 4, 0, 7, 2048, 1.011
> > 4, 7, 7, 2048, 0.907
> > 10, 1, 0, 64, 0.964
> > 10, 1, 1, 64, 0.966
> > 10, 2, 0, 64, 0.953
> > 10, 2, 2, 64, 0.972
> > 10, 3, 0, 64, 0.962
> > 10, 3, 3, 64, 0.969
> > 10, 4, 0, 64, 0.957
> > 10, 4, 4, 64, 0.969
> > 10, 5, 0, 64, 0.961
> > 10, 5, 5, 64, 0.965
> > 10, 6, 0, 64, 0.949
> > 10, 6, 6, 64, 0.9
> > 10, 7, 0, 64, 0.957
> > 10, 7, 7, 64, 0.897
> > 6, 0, 0, 0, 0.991
> > 6, 0, 0, 1, 1.011
> > 6, 0, 1, 1, 0.939
> > 6, 0, 0, 2, 1.016
> > 6, 0, 2, 2, 0.94
> > 6, 0, 0, 3, 1.019
> > 6, 0, 3, 3, 0.941
> > 6, 0, 0, 4, 1.056
> > 6, 0, 4, 4, 0.884
> > 6, 0, 0, 5, 0.977
> > 6, 0, 5, 5, 0.934
> > 6, 0, 0, 6, 0.954
> > 6, 0, 6, 6, 0.93
> > 6, 0, 0, 7, 0.963
> > 6, 0, 7, 7, 0.916
> > 6, 0, 0, 8, 0.963
> > 6, 0, 8, 8, 0.945
> > 6, 0, 0, 9, 1.028
> > 6, 0, 9, 9, 0.942
> > 6, 0, 0, 10, 0.955
> > 6, 0, 10, 10, 0.831
> > 6, 0, 0, 11, 0.948
> > 6, 0, 11, 11, 0.82
> > 6, 0, 0, 12, 1.033
> > 6, 0, 12, 12, 0.873
> > 6, 0, 0, 13, 0.983
> > 6, 0, 13, 13, 0.852
> > 6, 0, 0, 14, 0.984
> > 6, 0, 14, 14, 0.853
> > 6, 0, 0, 15, 0.984
> > 6, 0, 15, 15, 0.882
> > 6, 0, 0, 16, 0.971
> > 6, 0, 16, 16, 0.958
> > 6, 0, 0, 17, 0.938
> > 6, 0, 17, 17, 0.947
> > 6, 0, 0, 18, 0.96
> > 6, 0, 18, 18, 0.938
> > 6, 0, 0, 19, 0.903
> > 6, 0, 19, 19, 0.943
> > 6, 0, 0, 20, 0.947
> > 6, 0, 20, 20, 0.951
> > 6, 0, 0, 21, 0.948
> > 6, 0, 21, 21, 0.96
> > 6, 0, 0, 22, 0.926
> > 6, 0, 22, 22, 0.951
> > 6, 0, 0, 23, 0.923
> > 6, 0, 23, 23, 0.959
> > 6, 0, 0, 24, 0.918
> > 6, 0, 24, 24, 0.952
> > 6, 0, 0, 25, 0.97
> > 6, 0, 25, 25, 0.952
> > 6, 0, 0, 26, 0.871
> > 6, 0, 26, 26, 0.869
> > 6, 0, 0, 27, 0.935
> > 6, 0, 27, 27, 0.836
> > 6, 0, 0, 28, 0.936
> > 6, 0, 28, 28, 0.857
> > 6, 0, 0, 29, 0.876
> > 6, 0, 29, 29, 0.859
> > 6, 0, 0, 30, 0.934
> > 6, 0, 30, 30, 0.857
> > 6, 0, 0, 31, 0.962
> > 6, 0, 31, 31, 0.86
> > 6, 0, 0, 32, 0.912
> > 6, 0, 32, 32, 0.94
> > 6, 0, 0, 33, 0.903
> > 6, 0, 33, 33, 0.968
> > 6, 0, 0, 34, 0.913
> > 6, 0, 34, 34, 0.896
> > 6, 0, 0, 35, 0.904
> > 6, 0, 35, 35, 0.913
> > 6, 0, 0, 36, 0.905
> > 6, 0, 36, 36, 0.907
> > 6, 0, 0, 37, 0.899
> > 6, 0, 37, 37, 0.9
> > 6, 0, 0, 38, 0.912
> > 6, 0, 38, 38, 0.919
> > 6, 0, 0, 39, 0.925
> > 6, 0, 39, 39, 0.927
> > 6, 0, 0, 40, 0.923
> > 6, 0, 40, 40, 0.972
> > 6, 0, 0, 41, 0.92
> > 6, 0, 41, 41, 0.966
> > 6, 0, 0, 42, 0.915
> > 6, 0, 42, 42, 0.834
> > 6, 0, 0, 43, 0.92
> > 6, 0, 43, 43, 0.856
> > 6, 0, 0, 44, 0.908
> > 6, 0, 44, 44, 0.858
> > 6, 0, 0, 45, 0.932
> > 6, 0, 45, 45, 0.847
> > 6, 0, 0, 46, 0.927
> > 6, 0, 46, 46, 0.859
> > 6, 0, 0, 47, 0.902
> > 6, 0, 47, 47, 0.855
> > 6, 0, 0, 48, 0.949
> > 6, 0, 48, 48, 0.934
> > 6, 0, 0, 49, 0.907
> > 6, 0, 49, 49, 0.943
> > 6, 0, 0, 50, 0.934
> > 6, 0, 50, 50, 0.943
> > 6, 0, 0, 51, 0.933
> > 6, 0, 51, 51, 0.939
> > 6, 0, 0, 52, 0.944
> > 6, 0, 52, 52, 0.944
> > 6, 0, 0, 53, 0.939
> > 6, 0, 53, 53, 0.938
> > 6, 0, 0, 54, 0.9
> > 6, 0, 54, 54, 0.923
> > 6, 0, 0, 55, 0.9
> > 6, 0, 55, 55, 0.927
> > 6, 0, 0, 56, 0.9
> > 6, 0, 56, 56, 0.917
> > 6, 0, 0, 57, 0.9
> > 6, 0, 57, 57, 0.916
> > 6, 0, 0, 58, 0.914
> > 6, 0, 58, 58, 0.784
> > 6, 0, 0, 59, 0.863
> > 6, 0, 59, 59, 0.846
> > 6, 0, 0, 60, 0.88
> > 6, 0, 60, 60, 0.827
> > 6, 0, 0, 61, 0.896
> > 6, 0, 61, 61, 0.847
> > 6, 0, 0, 62, 0.894
> > 6, 0, 62, 62, 0.865
> > 6, 0, 0, 63, 0.934
> > 6, 0, 63, 63, 0.866
> >
> > sysdeps/x86_64/multiarch/strcspn-c.c | 83 +++++++++++++---------------
> > 1 file changed, 37 insertions(+), 46 deletions(-)
> >
> > diff --git a/sysdeps/x86_64/multiarch/strcspn-c.c b/sysdeps/x86_64/multiarch/strcspn-c.c
> > index 013aebf797..c312fab8b1 100644
> > --- a/sysdeps/x86_64/multiarch/strcspn-c.c
> > +++ b/sysdeps/x86_64/multiarch/strcspn-c.c
> > @@ -84,83 +84,74 @@ STRCSPN_SSE42 (const char *s, const char *a)
> > RETURN (NULL, strlen (s));
> >
> > const char *aligned;
> > - __m128i mask;
> > - int offset = (int) ((size_t) a & 15);
> > + __m128i mask, maskz, zero;
> > + unsigned int maskz_bits;
> > + unsigned int offset = (unsigned int) ((size_t) a & 15);
> > + zero = _mm_set1_epi8 (0);
> > if (offset != 0)
> > {
> > /* Load masks. */
> > aligned = (const char *) ((size_t) a & -16L);
> > __m128i mask0 = _mm_load_si128 ((__m128i *) aligned);
> > -
> > - mask = __m128i_shift_right (mask0, offset);
> > + maskz = _mm_cmpeq_epi8 (mask0, zero);
> >
> > /* Find where the NULL terminator is. */
> > - int length = _mm_cmpistri (mask, mask, 0x3a);
> > - if (length == 16 - offset)
> > - {
> > - /* There is no NULL terminator. */
> > - __m128i mask1 = _mm_load_si128 ((__m128i *) (aligned + 16));
> > - int index = _mm_cmpistri (mask1, mask1, 0x3a);
> > - length += index;
> > -
> > - /* Don't use SSE4.2 if the length of A > 16. */
> > - if (length > 16)
> > - return STRCSPN_SSE2 (s, a);
> > -
> > - if (index != 0)
> > - {
> > - /* Combine mask0 and mask1. We could play games with
> > - palignr, but frankly this data should be in L1 now
> > - so do the merge via an unaligned load. */
> > - mask = _mm_loadu_si128 ((__m128i *) a);
> > - }
> > - }
> > + maskz_bits = _mm_movemask_epi8 (maskz) >> offset;
> > + if (maskz_bits != 0)
> > + {
> > + mask = __m128i_shift_right (mask0, offset);
> > + offset = (unsigned int) ((size_t) s & 15);
> > + if (offset)
> > + goto start_unaligned;
> > +
> > + aligned = s;
> > + goto start_loop;
> > + }
> > }
> > - else
> > - {
> > - /* A is aligned. */
> > - mask = _mm_load_si128 ((__m128i *) a);
> >
> > - /* Find where the NULL terminator is. */
> > - int length = _mm_cmpistri (mask, mask, 0x3a);
> > - if (length == 16)
> > - {
> > - /* There is no NULL terminator. Don't use SSE4.2 if the length
> > - of A > 16. */
> > - if (a[16] != 0)
> > - return STRCSPN_SSE2 (s, a);
> > - }
> > + /* A is aligned. */
> > + mask = _mm_loadu_si128 ((__m128i *) a);
> > + /* Find where the NULL terminator is. */
> > + maskz = _mm_cmpeq_epi8 (mask, zero);
> > + maskz_bits = _mm_movemask_epi8 (maskz);
> > + if (maskz_bits == 0)
> > + {
> > + /* There is no NULL terminator. Don't use SSE4.2 if the length
> > + of A > 16. */
> > + if (a[16] != 0)
> > + return STRCSPN_SSE2 (s, a);
> > }
> >
> > - offset = (int) ((size_t) s & 15);
> > + aligned = s;
> > + offset = (unsigned int) ((size_t) s & 15);
> > if (offset != 0)
> > {
> > + start_unaligned:
> > /* Check partial string. */
> > aligned = (const char *) ((size_t) s & -16L);
> > __m128i value = _mm_load_si128 ((__m128i *) aligned);
> >
> > value = __m128i_shift_right (value, offset);
> >
> > - int length = _mm_cmpistri (mask, value, 0x2);
> > + unsigned int length = _mm_cmpistri (mask, value, 0x2);
> > /* No need to check ZFlag since ZFlag is always 1. */
> > - int cflag = _mm_cmpistrc (mask, value, 0x2);
> > + unsigned int cflag = _mm_cmpistrc (mask, value, 0x2);
> > if (cflag)
> > RETURN ((char *) (s + length), length);
> > /* Find where the NULL terminator is. */
> > - int index = _mm_cmpistri (value, value, 0x3a);
> > + unsigned int index = _mm_cmpistri (value, value, 0x3a);
> > if (index < 16 - offset)
> > RETURN (NULL, index);
> > aligned += 16;
> > }
> > - else
> > - aligned = s;
> >
> > +start_loop:
> > while (1)
> > {
> > __m128i value = _mm_load_si128 ((__m128i *) aligned);
> > - int index = _mm_cmpistri (mask, value, 0x2);
> > - int cflag = _mm_cmpistrc (mask, value, 0x2);
> > - int zflag = _mm_cmpistrz (mask, value, 0x2);
> > + unsigned int index = _mm_cmpistri (mask, value, 0x2);
> > + unsigned int cflag = _mm_cmpistrc (mask, value, 0x2);
> > + unsigned int zflag = _mm_cmpistrz (mask, value, 0x2);
> > if (cflag)
> > RETURN ((char *) (aligned + index), (size_t) (aligned + index - s));
> > if (zflag)
> > --
> > 2.25.1
> >
>
> LGTM.
>
> Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
>
> Thanks.
>
> --
> H.J.
I would like to backport this patch to release branches.
Any comments or objections?
--Sunil
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v1 08/23] x86: Optimize strspn in strspn-c.c
[not found] ` <CAMe9rOo7oks15kPUaZqd=Z1J1Xe==FJoECTNU5mBca9WTHgf1w@mail.gmail.com>
@ 2022-05-12 19:39 ` Sunil Pandey
0 siblings, 0 replies; 10+ messages in thread
From: Sunil Pandey @ 2022-05-12 19:39 UTC (permalink / raw)
To: H.J. Lu, Libc-stable Mailing List; +Cc: Noah Goldstein, GNU C Library
On Thu, Mar 24, 2022 at 11:58 AM H.J. Lu via Libc-alpha
<libc-alpha@sourceware.org> wrote:
>
> On Wed, Mar 23, 2022 at 2:59 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> >
> > Use _mm_cmpeq_epi8 and _mm_movemask_epi8 to get strlen instead of
> > _mm_cmpistri. Also change offset to unsigned to avoid unnecessary
> > sign extensions.
> >
> > geometric_mean(N=20) of all benchmarks that dont fallback on
> > sse2; New / Original: .901
> >
> > All string/memory tests pass.
> > ---
> > Geomtric Mean N=20 runs; All functions page aligned
> > len, align1, align2, pos, New Time / Old Time
> > 1, 0, 0, 512, 0.768
> > 1, 1, 0, 512, 0.666
> > 1, 0, 1, 512, 1.193
> > 1, 1, 1, 512, 0.872
> > 2, 0, 0, 512, 0.698
> > 2, 2, 0, 512, 0.687
> > 2, 0, 2, 512, 1.393
> > 2, 2, 2, 512, 0.944
> > 3, 0, 0, 512, 0.691
> > 3, 3, 0, 512, 0.676
> > 3, 0, 3, 512, 1.388
> > 3, 3, 3, 512, 0.948
> > 4, 0, 0, 512, 0.74
> > 4, 4, 0, 512, 0.678
> > 4, 0, 4, 512, 1.421
> > 4, 4, 4, 512, 0.943
> > 5, 0, 0, 512, 0.691
> > 5, 5, 0, 512, 0.675
> > 5, 0, 5, 512, 1.348
> > 5, 5, 5, 512, 0.952
> > 6, 0, 0, 512, 0.685
> > 6, 6, 0, 512, 0.67
> > 6, 0, 6, 512, 1.333
> > 6, 6, 6, 512, 0.95
> > 7, 0, 0, 512, 0.688
> > 7, 7, 0, 512, 0.675
> > 7, 0, 7, 512, 1.344
> > 7, 7, 7, 512, 0.919
> > 8, 0, 0, 512, 0.716
> > 8, 0, 8, 512, 0.935
> > 9, 0, 0, 512, 0.716
> > 9, 1, 0, 512, 0.712
> > 9, 0, 9, 512, 0.956
> > 9, 1, 9, 512, 0.992
> > 10, 0, 0, 512, 0.699
> > 10, 2, 0, 512, 0.68
> > 10, 0, 10, 512, 0.952
> > 10, 2, 10, 512, 0.932
> > 11, 0, 0, 512, 0.705
> > 11, 3, 0, 512, 0.685
> > 11, 0, 11, 512, 0.956
> > 11, 3, 11, 512, 0.927
> > 12, 0, 0, 512, 0.695
> > 12, 4, 0, 512, 0.675
> > 12, 0, 12, 512, 0.948
> > 12, 4, 12, 512, 0.928
> > 13, 0, 0, 512, 0.7
> > 13, 5, 0, 512, 0.678
> > 13, 0, 13, 512, 0.944
> > 13, 5, 13, 512, 0.931
> > 14, 0, 0, 512, 0.703
> > 14, 6, 0, 512, 0.678
> > 14, 0, 14, 512, 0.949
> > 14, 6, 14, 512, 0.93
> > 15, 0, 0, 512, 0.694
> > 15, 7, 0, 512, 0.678
> > 15, 0, 15, 512, 0.953
> > 15, 7, 15, 512, 0.924
> > 16, 0, 0, 512, 1.021
> > 16, 0, 16, 512, 1.067
> > 17, 0, 0, 512, 0.991
> > 17, 1, 0, 512, 0.984
> > 17, 0, 17, 512, 0.979
> > 17, 1, 17, 512, 0.993
> > 18, 0, 0, 512, 0.992
> > 18, 2, 0, 512, 1.008
> > 18, 0, 18, 512, 1.016
> > 18, 2, 18, 512, 0.993
> > 19, 0, 0, 512, 0.984
> > 19, 3, 0, 512, 0.985
> > 19, 0, 19, 512, 1.007
> > 19, 3, 19, 512, 1.006
> > 20, 0, 0, 512, 0.969
> > 20, 4, 0, 512, 0.968
> > 20, 0, 20, 512, 0.975
> > 20, 4, 20, 512, 0.975
> > 21, 0, 0, 512, 0.992
> > 21, 5, 0, 512, 0.992
> > 21, 0, 21, 512, 0.98
> > 21, 5, 21, 512, 0.97
> > 22, 0, 0, 512, 0.989
> > 22, 6, 0, 512, 0.987
> > 22, 0, 22, 512, 0.99
> > 22, 6, 22, 512, 0.985
> > 23, 0, 0, 512, 0.989
> > 23, 7, 0, 512, 0.98
> > 23, 0, 23, 512, 1.0
> > 23, 7, 23, 512, 0.993
> > 24, 0, 0, 512, 0.99
> > 24, 0, 24, 512, 0.998
> > 25, 0, 0, 512, 1.01
> > 25, 1, 0, 512, 1.0
> > 25, 0, 25, 512, 0.97
> > 25, 1, 25, 512, 0.967
> > 26, 0, 0, 512, 1.009
> > 26, 2, 0, 512, 0.986
> > 26, 0, 26, 512, 0.997
> > 26, 2, 26, 512, 0.993
> > 27, 0, 0, 512, 0.984
> > 27, 3, 0, 512, 0.997
> > 27, 0, 27, 512, 0.989
> > 27, 3, 27, 512, 0.976
> > 28, 0, 0, 512, 0.991
> > 28, 4, 0, 512, 1.003
> > 28, 0, 28, 512, 0.986
> > 28, 4, 28, 512, 0.989
> > 29, 0, 0, 512, 0.986
> > 29, 5, 0, 512, 0.985
> > 29, 0, 29, 512, 0.984
> > 29, 5, 29, 512, 0.977
> > 30, 0, 0, 512, 0.991
> > 30, 6, 0, 512, 0.987
> > 30, 0, 30, 512, 0.979
> > 30, 6, 30, 512, 0.974
> > 31, 0, 0, 512, 0.995
> > 31, 7, 0, 512, 0.995
> > 31, 0, 31, 512, 0.994
> > 31, 7, 31, 512, 0.984
> > 4, 0, 0, 32, 0.861
> > 4, 1, 0, 32, 0.864
> > 4, 0, 1, 32, 0.962
> > 4, 1, 1, 32, 0.967
> > 4, 0, 0, 64, 0.884
> > 4, 2, 0, 64, 0.818
> > 4, 0, 2, 64, 0.889
> > 4, 2, 2, 64, 0.918
> > 4, 0, 0, 128, 0.942
> > 4, 3, 0, 128, 0.884
> > 4, 0, 3, 128, 0.931
> > 4, 3, 3, 128, 0.883
> > 4, 0, 0, 256, 0.964
> > 4, 4, 0, 256, 0.922
> > 4, 0, 4, 256, 0.956
> > 4, 4, 4, 256, 0.93
> > 4, 5, 0, 512, 0.833
> > 4, 0, 5, 512, 1.027
> > 4, 5, 5, 512, 0.929
> > 4, 0, 0, 1024, 0.998
> > 4, 6, 0, 1024, 0.986
> > 4, 0, 6, 1024, 0.984
> > 4, 6, 6, 1024, 0.977
> > 4, 0, 0, 2048, 0.991
> > 4, 7, 0, 2048, 0.987
> > 4, 0, 7, 2048, 0.996
> > 4, 7, 7, 2048, 0.98
> > 10, 1, 0, 64, 0.826
> > 10, 1, 1, 64, 0.907
> > 10, 2, 0, 64, 0.829
> > 10, 2, 2, 64, 0.91
> > 10, 3, 0, 64, 0.83
> > 10, 3, 3, 64, 0.915
> > 10, 4, 0, 64, 0.83
> > 10, 4, 4, 64, 0.911
> > 10, 5, 0, 64, 0.828
> > 10, 5, 5, 64, 0.905
> > 10, 6, 0, 64, 0.828
> > 10, 6, 6, 64, 0.812
> > 10, 7, 0, 64, 0.83
> > 10, 7, 7, 64, 0.819
> > 6, 0, 0, 0, 1.261
> > 6, 0, 0, 1, 1.252
> > 6, 0, 1, 1, 0.845
> > 6, 0, 0, 2, 1.27
> > 6, 0, 2, 2, 0.85
> > 6, 0, 0, 3, 1.269
> > 6, 0, 3, 3, 0.845
> > 6, 0, 0, 4, 1.287
> > 6, 0, 4, 4, 0.852
> > 6, 0, 0, 5, 1.278
> > 6, 0, 5, 5, 0.851
> > 6, 0, 0, 6, 1.269
> > 6, 0, 6, 6, 0.841
> > 6, 0, 0, 7, 1.268
> > 6, 0, 7, 7, 0.851
> > 6, 0, 0, 8, 1.291
> > 6, 0, 8, 8, 0.837
> > 6, 0, 0, 9, 1.283
> > 6, 0, 9, 9, 0.831
> > 6, 0, 0, 10, 1.252
> > 6, 0, 10, 10, 0.997
> > 6, 0, 0, 11, 1.295
> > 6, 0, 11, 11, 1.046
> > 6, 0, 0, 12, 1.296
> > 6, 0, 12, 12, 1.038
> > 6, 0, 0, 13, 1.287
> > 6, 0, 13, 13, 1.082
> > 6, 0, 0, 14, 1.284
> > 6, 0, 14, 14, 1.001
> > 6, 0, 0, 15, 1.286
> > 6, 0, 15, 15, 1.002
> > 6, 0, 0, 16, 0.894
> > 6, 0, 16, 16, 0.874
> > 6, 0, 0, 17, 0.892
> > 6, 0, 17, 17, 0.974
> > 6, 0, 0, 18, 0.907
> > 6, 0, 18, 18, 0.993
> > 6, 0, 0, 19, 0.909
> > 6, 0, 19, 19, 0.99
> > 6, 0, 0, 20, 0.894
> > 6, 0, 20, 20, 0.978
> > 6, 0, 0, 21, 0.89
> > 6, 0, 21, 21, 0.958
> > 6, 0, 0, 22, 0.893
> > 6, 0, 22, 22, 0.99
> > 6, 0, 0, 23, 0.899
> > 6, 0, 23, 23, 0.986
> > 6, 0, 0, 24, 0.893
> > 6, 0, 24, 24, 0.989
> > 6, 0, 0, 25, 0.889
> > 6, 0, 25, 25, 0.982
> > 6, 0, 0, 26, 0.889
> > 6, 0, 26, 26, 0.852
> > 6, 0, 0, 27, 0.89
> > 6, 0, 27, 27, 0.832
> > 6, 0, 0, 28, 0.89
> > 6, 0, 28, 28, 0.831
> > 6, 0, 0, 29, 0.89
> > 6, 0, 29, 29, 0.838
> > 6, 0, 0, 30, 0.907
> > 6, 0, 30, 30, 0.833
> > 6, 0, 0, 31, 0.888
> > 6, 0, 31, 31, 0.837
> > 6, 0, 0, 32, 0.853
> > 6, 0, 32, 32, 0.828
> > 6, 0, 0, 33, 0.857
> > 6, 0, 33, 33, 0.947
> > 6, 0, 0, 34, 0.847
> > 6, 0, 34, 34, 0.954
> > 6, 0, 0, 35, 0.841
> > 6, 0, 35, 35, 0.94
> > 6, 0, 0, 36, 0.854
> > 6, 0, 36, 36, 0.958
> > 6, 0, 0, 37, 0.856
> > 6, 0, 37, 37, 0.957
> > 6, 0, 0, 38, 0.839
> > 6, 0, 38, 38, 0.962
> > 6, 0, 0, 39, 0.866
> > 6, 0, 39, 39, 0.945
> > 6, 0, 0, 40, 0.845
> > 6, 0, 40, 40, 0.961
> > 6, 0, 0, 41, 0.858
> > 6, 0, 41, 41, 0.961
> > 6, 0, 0, 42, 0.862
> > 6, 0, 42, 42, 0.825
> > 6, 0, 0, 43, 0.864
> > 6, 0, 43, 43, 0.82
> > 6, 0, 0, 44, 0.843
> > 6, 0, 44, 44, 0.81
> > 6, 0, 0, 45, 0.859
> > 6, 0, 45, 45, 0.816
> > 6, 0, 0, 46, 0.866
> > 6, 0, 46, 46, 0.81
> > 6, 0, 0, 47, 0.858
> > 6, 0, 47, 47, 0.807
> > 6, 0, 0, 48, 0.87
> > 6, 0, 48, 48, 0.87
> > 6, 0, 0, 49, 0.871
> > 6, 0, 49, 49, 0.874
> > 6, 0, 0, 50, 0.87
> > 6, 0, 50, 50, 0.881
> > 6, 0, 0, 51, 0.868
> > 6, 0, 51, 51, 0.875
> > 6, 0, 0, 52, 0.873
> > 6, 0, 52, 52, 0.871
> > 6, 0, 0, 53, 0.866
> > 6, 0, 53, 53, 0.882
> > 6, 0, 0, 54, 0.863
> > 6, 0, 54, 54, 0.876
> > 6, 0, 0, 55, 0.851
> > 6, 0, 55, 55, 0.871
> > 6, 0, 0, 56, 0.867
> > 6, 0, 56, 56, 0.888
> > 6, 0, 0, 57, 0.862
> > 6, 0, 57, 57, 0.899
> > 6, 0, 0, 58, 0.873
> > 6, 0, 58, 58, 0.798
> > 6, 0, 0, 59, 0.881
> > 6, 0, 59, 59, 0.785
> > 6, 0, 0, 60, 0.867
> > 6, 0, 60, 60, 0.797
> > 6, 0, 0, 61, 0.872
> > 6, 0, 61, 61, 0.791
> > 6, 0, 0, 62, 0.859
> > 6, 0, 62, 62, 0.79
> > 6, 0, 0, 63, 0.87
> > 6, 0, 63, 63, 0.796
> >
> > sysdeps/x86_64/multiarch/strspn-c.c | 86 +++++++++++++----------------
> > 1 file changed, 39 insertions(+), 47 deletions(-)
> >
> > diff --git a/sysdeps/x86_64/multiarch/strspn-c.c b/sysdeps/x86_64/multiarch/strspn-c.c
> > index 8fb3aba64d..6124033ceb 100644
> > --- a/sysdeps/x86_64/multiarch/strspn-c.c
> > +++ b/sysdeps/x86_64/multiarch/strspn-c.c
> > @@ -62,81 +62,73 @@ __strspn_sse42 (const char *s, const char *a)
> > return 0;
> >
> > const char *aligned;
> > - __m128i mask;
> > - int offset = (int) ((size_t) a & 15);
> > + __m128i mask, maskz, zero;
> > + unsigned int maskz_bits;
> > + unsigned int offset = (int) ((size_t) a & 15);
> > + zero = _mm_set1_epi8 (0);
> > if (offset != 0)
> > {
> > /* Load masks. */
> > aligned = (const char *) ((size_t) a & -16L);
> > __m128i mask0 = _mm_load_si128 ((__m128i *) aligned);
> > -
> > - mask = __m128i_shift_right (mask0, offset);
> > + maskz = _mm_cmpeq_epi8 (mask0, zero);
> >
> > /* Find where the NULL terminator is. */
> > - int length = _mm_cmpistri (mask, mask, 0x3a);
> > - if (length == 16 - offset)
> > - {
> > - /* There is no NULL terminator. */
> > - __m128i mask1 = _mm_load_si128 ((__m128i *) (aligned + 16));
> > - int index = _mm_cmpistri (mask1, mask1, 0x3a);
> > - length += index;
> > -
> > - /* Don't use SSE4.2 if the length of A > 16. */
> > - if (length > 16)
> > - return __strspn_sse2 (s, a);
> > -
> > - if (index != 0)
> > - {
> > - /* Combine mask0 and mask1. We could play games with
> > - palignr, but frankly this data should be in L1 now
> > - so do the merge via an unaligned load. */
> > - mask = _mm_loadu_si128 ((__m128i *) a);
> > - }
> > - }
> > + maskz_bits = _mm_movemask_epi8 (maskz) >> offset;
> > + if (maskz_bits != 0)
> > + {
> > + mask = __m128i_shift_right (mask0, offset);
> > + offset = (unsigned int) ((size_t) s & 15);
> > + if (offset)
> > + goto start_unaligned;
> > +
> > + aligned = s;
> > + goto start_loop;
> > + }
> > }
> > - else
> > - {
> > - /* A is aligned. */
> > - mask = _mm_load_si128 ((__m128i *) a);
> >
> > - /* Find where the NULL terminator is. */
> > - int length = _mm_cmpistri (mask, mask, 0x3a);
> > - if (length == 16)
> > - {
> > - /* There is no NULL terminator. Don't use SSE4.2 if the length
> > - of A > 16. */
> > - if (a[16] != 0)
> > - return __strspn_sse2 (s, a);
> > - }
> > + /* A is aligned. */
> > + mask = _mm_loadu_si128 ((__m128i *) a);
> > +
> > + /* Find where the NULL terminator is. */
> > + maskz = _mm_cmpeq_epi8 (mask, zero);
> > + maskz_bits = _mm_movemask_epi8 (maskz);
> > + if (maskz_bits == 0)
> > + {
> > + /* There is no NULL terminator. Don't use SSE4.2 if the length
> > + of A > 16. */
> > + if (a[16] != 0)
> > + return __strspn_sse2 (s, a);
> > }
> > + aligned = s;
> > + offset = (unsigned int) ((size_t) s & 15);
> >
> > - offset = (int) ((size_t) s & 15);
> > if (offset != 0)
> > {
> > + start_unaligned:
> > /* Check partial string. */
> > aligned = (const char *) ((size_t) s & -16L);
> > __m128i value = _mm_load_si128 ((__m128i *) aligned);
> > + __m128i adj_value = __m128i_shift_right (value, offset);
> >
> > - value = __m128i_shift_right (value, offset);
> > -
> > - int length = _mm_cmpistri (mask, value, 0x12);
> > + unsigned int length = _mm_cmpistri (mask, adj_value, 0x12);
> > /* No need to check CFlag since it is always 1. */
> > if (length < 16 - offset)
> > return length;
> > /* Find where the NULL terminator is. */
> > - int index = _mm_cmpistri (value, value, 0x3a);
> > - if (index < 16 - offset)
> > + maskz = _mm_cmpeq_epi8 (value, zero);
> > + maskz_bits = _mm_movemask_epi8 (maskz) >> offset;
> > + if (maskz_bits != 0)
> > return length;
> > aligned += 16;
> > }
> > - else
> > - aligned = s;
> >
> > +start_loop:
> > while (1)
> > {
> > __m128i value = _mm_load_si128 ((__m128i *) aligned);
> > - int index = _mm_cmpistri (mask, value, 0x12);
> > - int cflag = _mm_cmpistrc (mask, value, 0x12);
> > + unsigned int index = _mm_cmpistri (mask, value, 0x12);
> > + unsigned int cflag = _mm_cmpistrc (mask, value, 0x12);
> > if (cflag)
> > return (size_t) (aligned + index - s);
> > aligned += 16;
> > --
> > 2.25.1
> >
>
> LGTM.
>
> Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
>
> Thanks.
>
> --
> H.J.
I would like to backport this patch to release branches.
Any comments or objections?
--Sunil
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v1 09/23] x86: Remove strcspn-sse2.S and use the generic implementation
[not found] ` <CAMe9rOrQ_zOdL-n-iiYpzLf+RxD_DBR51yEnGpRKB0zj4m31SQ@mail.gmail.com>
@ 2022-05-12 19:40 ` Sunil Pandey
0 siblings, 0 replies; 10+ messages in thread
From: Sunil Pandey @ 2022-05-12 19:40 UTC (permalink / raw)
To: H.J. Lu, Libc-stable Mailing List; +Cc: Noah Goldstein, GNU C Library
On Thu, Mar 24, 2022 at 11:59 AM H.J. Lu via Libc-alpha
<libc-alpha@sourceware.org> wrote:
>
> On Wed, Mar 23, 2022 at 3:00 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> >
> > The generic implementation is faster.
> >
> > geometric_mean(N=20) of all benchmarks New / Original: .678
> >
> > All string/memory tests pass.
> > ---
> > Geomtric Mean N=20 runs; All functions page aligned
> > len, align1, align2, pos, New Time / Old Time
> > 0, 0, 0, 512, 0.054
> > 1, 0, 0, 512, 0.055
> > 1, 1, 0, 512, 0.051
> > 1, 0, 1, 512, 0.054
> > 1, 1, 1, 512, 0.054
> > 2, 0, 0, 512, 0.861
> > 2, 2, 0, 512, 0.861
> > 2, 0, 2, 512, 0.861
> > 2, 2, 2, 512, 0.864
> > 3, 0, 0, 512, 0.854
> > 3, 3, 0, 512, 0.848
> > 3, 0, 3, 512, 0.845
> > 3, 3, 3, 512, 0.85
> > 4, 0, 0, 512, 0.851
> > 4, 4, 0, 512, 0.85
> > 4, 0, 4, 512, 0.852
> > 4, 4, 4, 512, 0.849
> > 5, 0, 0, 512, 0.938
> > 5, 5, 0, 512, 0.94
> > 5, 0, 5, 512, 0.864
> > 5, 5, 5, 512, 0.86
> > 6, 0, 0, 512, 0.858
> > 6, 6, 0, 512, 0.869
> > 6, 0, 6, 512, 0.847
> > 6, 6, 6, 512, 0.868
> > 7, 0, 0, 512, 0.867
> > 7, 7, 0, 512, 0.861
> > 7, 0, 7, 512, 0.864
> > 7, 7, 7, 512, 0.863
> > 8, 0, 0, 512, 0.884
> > 8, 0, 8, 512, 0.884
> > 9, 0, 0, 512, 0.886
> > 9, 1, 0, 512, 0.894
> > 9, 0, 9, 512, 0.889
> > 9, 1, 9, 512, 0.886
> > 10, 0, 0, 512, 0.859
> > 10, 2, 0, 512, 0.859
> > 10, 0, 10, 512, 0.862
> > 10, 2, 10, 512, 0.861
> > 11, 0, 0, 512, 0.846
> > 11, 3, 0, 512, 0.865
> > 11, 0, 11, 512, 0.859
> > 11, 3, 11, 512, 0.862
> > 12, 0, 0, 512, 0.858
> > 12, 4, 0, 512, 0.857
> > 12, 0, 12, 512, 0.964
> > 12, 4, 12, 512, 0.876
> > 13, 0, 0, 512, 0.827
> > 13, 5, 0, 512, 0.805
> > 13, 0, 13, 512, 0.821
> > 13, 5, 13, 512, 0.825
> > 14, 0, 0, 512, 0.786
> > 14, 6, 0, 512, 0.786
> > 14, 0, 14, 512, 0.803
> > 14, 6, 14, 512, 0.783
> > 15, 0, 0, 512, 0.778
> > 15, 7, 0, 512, 0.792
> > 15, 0, 15, 512, 0.796
> > 15, 7, 15, 512, 0.799
> > 16, 0, 0, 512, 0.803
> > 16, 0, 16, 512, 0.815
> > 17, 0, 0, 512, 0.812
> > 17, 1, 0, 512, 0.826
> > 17, 0, 17, 512, 0.803
> > 17, 1, 17, 512, 0.856
> > 18, 0, 0, 512, 0.801
> > 18, 2, 0, 512, 0.886
> > 18, 0, 18, 512, 0.805
> > 18, 2, 18, 512, 0.807
> > 19, 0, 0, 512, 0.814
> > 19, 3, 0, 512, 0.804
> > 19, 0, 19, 512, 0.813
> > 19, 3, 19, 512, 0.814
> > 20, 0, 0, 512, 0.885
> > 20, 4, 0, 512, 0.799
> > 20, 0, 20, 512, 0.826
> > 20, 4, 20, 512, 0.808
> > 21, 0, 0, 512, 0.816
> > 21, 5, 0, 512, 0.824
> > 21, 0, 21, 512, 0.819
> > 21, 5, 21, 512, 0.826
> > 22, 0, 0, 512, 0.814
> > 22, 6, 0, 512, 0.824
> > 22, 0, 22, 512, 0.81
> > 22, 6, 22, 512, 0.806
> > 23, 0, 0, 512, 0.825
> > 23, 7, 0, 512, 0.829
> > 23, 0, 23, 512, 0.809
> > 23, 7, 23, 512, 0.823
> > 24, 0, 0, 512, 0.829
> > 24, 0, 24, 512, 0.823
> > 25, 0, 0, 512, 0.864
> > 25, 1, 0, 512, 0.895
> > 25, 0, 25, 512, 0.88
> > 25, 1, 25, 512, 0.848
> > 26, 0, 0, 512, 0.903
> > 26, 2, 0, 512, 0.888
> > 26, 0, 26, 512, 0.894
> > 26, 2, 26, 512, 0.89
> > 27, 0, 0, 512, 0.914
> > 27, 3, 0, 512, 0.917
> > 27, 0, 27, 512, 0.902
> > 27, 3, 27, 512, 0.887
> > 28, 0, 0, 512, 0.887
> > 28, 4, 0, 512, 0.877
> > 28, 0, 28, 512, 0.893
> > 28, 4, 28, 512, 0.866
> > 29, 0, 0, 512, 0.885
> > 29, 5, 0, 512, 0.907
> > 29, 0, 29, 512, 0.894
> > 29, 5, 29, 512, 0.906
> > 30, 0, 0, 512, 0.88
> > 30, 6, 0, 512, 0.898
> > 30, 0, 30, 512, 0.9
> > 30, 6, 30, 512, 0.895
> > 31, 0, 0, 512, 0.893
> > 31, 7, 0, 512, 0.874
> > 31, 0, 31, 512, 0.894
> > 31, 7, 31, 512, 0.899
> > 4, 0, 0, 32, 0.618
> > 4, 1, 0, 32, 0.627
> > 4, 0, 1, 32, 0.625
> > 4, 1, 1, 32, 0.613
> > 4, 0, 0, 64, 0.913
> > 4, 2, 0, 64, 0.801
> > 4, 0, 2, 64, 0.759
> > 4, 2, 2, 64, 0.761
> > 4, 0, 0, 128, 0.822
> > 4, 3, 0, 128, 0.863
> > 4, 0, 3, 128, 0.867
> > 4, 3, 3, 128, 0.917
> > 4, 0, 0, 256, 0.816
> > 4, 4, 0, 256, 0.812
> > 4, 0, 4, 256, 0.803
> > 4, 4, 4, 256, 0.811
> > 4, 5, 0, 512, 0.848
> > 4, 0, 5, 512, 0.843
> > 4, 5, 5, 512, 0.857
> > 4, 0, 0, 1024, 0.886
> > 4, 6, 0, 1024, 0.887
> > 4, 0, 6, 1024, 0.881
> > 4, 6, 6, 1024, 0.873
> > 4, 0, 0, 2048, 0.892
> > 4, 7, 0, 2048, 0.894
> > 4, 0, 7, 2048, 0.89
> > 4, 7, 7, 2048, 0.874
> > 10, 1, 0, 64, 0.946
> > 10, 1, 1, 64, 0.81
> > 10, 2, 0, 64, 0.804
> > 10, 2, 2, 64, 0.82
> > 10, 3, 0, 64, 0.772
> > 10, 3, 3, 64, 0.772
> > 10, 4, 0, 64, 0.748
> > 10, 4, 4, 64, 0.751
> > 10, 5, 0, 64, 0.76
> > 10, 5, 5, 64, 0.76
> > 10, 6, 0, 64, 0.726
> > 10, 6, 6, 64, 0.718
> > 10, 7, 0, 64, 0.724
> > 10, 7, 7, 64, 0.72
> > 6, 0, 0, 0, 0.415
> > 6, 0, 0, 1, 0.423
> > 6, 0, 1, 1, 0.412
> > 6, 0, 0, 2, 0.433
> > 6, 0, 2, 2, 0.434
> > 6, 0, 0, 3, 0.427
> > 6, 0, 3, 3, 0.428
> > 6, 0, 0, 4, 0.465
> > 6, 0, 4, 4, 0.466
> > 6, 0, 0, 5, 0.463
> > 6, 0, 5, 5, 0.468
> > 6, 0, 0, 6, 0.435
> > 6, 0, 6, 6, 0.444
> > 6, 0, 0, 7, 0.41
> > 6, 0, 7, 7, 0.42
> > 6, 0, 0, 8, 0.474
> > 6, 0, 8, 8, 0.501
> > 6, 0, 0, 9, 0.471
> > 6, 0, 9, 9, 0.489
> > 6, 0, 0, 10, 0.462
> > 6, 0, 10, 10, 0.46
> > 6, 0, 0, 11, 0.459
> > 6, 0, 11, 11, 0.458
> > 6, 0, 0, 12, 0.516
> > 6, 0, 12, 12, 0.51
> > 6, 0, 0, 13, 0.494
> > 6, 0, 13, 13, 0.524
> > 6, 0, 0, 14, 0.486
> > 6, 0, 14, 14, 0.5
> > 6, 0, 0, 15, 0.48
> > 6, 0, 15, 15, 0.501
> > 6, 0, 0, 16, 0.54
> > 6, 0, 16, 16, 0.538
> > 6, 0, 0, 17, 0.503
> > 6, 0, 17, 17, 0.541
> > 6, 0, 0, 18, 0.537
> > 6, 0, 18, 18, 0.549
> > 6, 0, 0, 19, 0.527
> > 6, 0, 19, 19, 0.537
> > 6, 0, 0, 20, 0.539
> > 6, 0, 20, 20, 0.554
> > 6, 0, 0, 21, 0.558
> > 6, 0, 21, 21, 0.541
> > 6, 0, 0, 22, 0.546
> > 6, 0, 22, 22, 0.561
> > 6, 0, 0, 23, 0.54
> > 6, 0, 23, 23, 0.536
> > 6, 0, 0, 24, 0.565
> > 6, 0, 24, 24, 0.584
> > 6, 0, 0, 25, 0.563
> > 6, 0, 25, 25, 0.58
> > 6, 0, 0, 26, 0.555
> > 6, 0, 26, 26, 0.584
> > 6, 0, 0, 27, 0.569
> > 6, 0, 27, 27, 0.587
> > 6, 0, 0, 28, 0.612
> > 6, 0, 28, 28, 0.623
> > 6, 0, 0, 29, 0.604
> > 6, 0, 29, 29, 0.621
> > 6, 0, 0, 30, 0.59
> > 6, 0, 30, 30, 0.609
> > 6, 0, 0, 31, 0.577
> > 6, 0, 31, 31, 0.588
> > 6, 0, 0, 32, 0.621
> > 6, 0, 32, 32, 0.608
> > 6, 0, 0, 33, 0.601
> > 6, 0, 33, 33, 0.623
> > 6, 0, 0, 34, 0.614
> > 6, 0, 34, 34, 0.615
> > 6, 0, 0, 35, 0.598
> > 6, 0, 35, 35, 0.608
> > 6, 0, 0, 36, 0.626
> > 6, 0, 36, 36, 0.634
> > 6, 0, 0, 37, 0.62
> > 6, 0, 37, 37, 0.634
> > 6, 0, 0, 38, 0.612
> > 6, 0, 38, 38, 0.637
> > 6, 0, 0, 39, 0.627
> > 6, 0, 39, 39, 0.612
> > 6, 0, 0, 40, 0.661
> > 6, 0, 40, 40, 0.674
> > 6, 0, 0, 41, 0.633
> > 6, 0, 41, 41, 0.643
> > 6, 0, 0, 42, 0.634
> > 6, 0, 42, 42, 0.636
> > 6, 0, 0, 43, 0.619
> > 6, 0, 43, 43, 0.625
> > 6, 0, 0, 44, 0.654
> > 6, 0, 44, 44, 0.654
> > 6, 0, 0, 45, 0.647
> > 6, 0, 45, 45, 0.649
> > 6, 0, 0, 46, 0.651
> > 6, 0, 46, 46, 0.651
> > 6, 0, 0, 47, 0.646
> > 6, 0, 47, 47, 0.648
> > 6, 0, 0, 48, 0.662
> > 6, 0, 48, 48, 0.664
> > 6, 0, 0, 49, 0.68
> > 6, 0, 49, 49, 0.667
> > 6, 0, 0, 50, 0.654
> > 6, 0, 50, 50, 0.659
> > 6, 0, 0, 51, 0.638
> > 6, 0, 51, 51, 0.639
> > 6, 0, 0, 52, 0.665
> > 6, 0, 52, 52, 0.669
> > 6, 0, 0, 53, 0.658
> > 6, 0, 53, 53, 0.656
> > 6, 0, 0, 54, 0.669
> > 6, 0, 54, 54, 0.67
> > 6, 0, 0, 55, 0.668
> > 6, 0, 55, 55, 0.664
> > 6, 0, 0, 56, 0.701
> > 6, 0, 56, 56, 0.695
> > 6, 0, 0, 57, 0.687
> > 6, 0, 57, 57, 0.696
> > 6, 0, 0, 58, 0.693
> > 6, 0, 58, 58, 0.704
> > 6, 0, 0, 59, 0.695
> > 6, 0, 59, 59, 0.708
> > 6, 0, 0, 60, 0.708
> > 6, 0, 60, 60, 0.728
> > 6, 0, 0, 61, 0.708
> > 6, 0, 61, 61, 0.71
> > 6, 0, 0, 62, 0.715
> > 6, 0, 62, 62, 0.705
> > 6, 0, 0, 63, 0.677
> > 6, 0, 63, 63, 0.702
> >
> > .../{strcspn-sse2.S => strcspn-sse2.c} | 8 +-
> > sysdeps/x86_64/strcspn.S | 119 ------------------
> > 2 files changed, 4 insertions(+), 123 deletions(-)
> > rename sysdeps/x86_64/multiarch/{strcspn-sse2.S => strcspn-sse2.c} (85%)
> > delete mode 100644 sysdeps/x86_64/strcspn.S
> >
> > diff --git a/sysdeps/x86_64/multiarch/strcspn-sse2.S b/sysdeps/x86_64/multiarch/strcspn-sse2.c
> > similarity index 85%
> > rename from sysdeps/x86_64/multiarch/strcspn-sse2.S
> > rename to sysdeps/x86_64/multiarch/strcspn-sse2.c
> > index f97e856e1f..3a04bb39fc 100644
> > --- a/sysdeps/x86_64/multiarch/strcspn-sse2.S
> > +++ b/sysdeps/x86_64/multiarch/strcspn-sse2.c
> > @@ -1,4 +1,4 @@
> > -/* strcspn optimized with SSE2.
> > +/* strcspn.
> > Copyright (C) 2017-2022 Free Software Foundation, Inc.
> > This file is part of the GNU C Library.
> >
> > @@ -19,10 +19,10 @@
> > #if IS_IN (libc)
> >
> > # include <sysdep.h>
> > -# define strcspn __strcspn_sse2
> > +# define STRCSPN __strcspn_sse2
> >
> > # undef libc_hidden_builtin_def
> > -# define libc_hidden_builtin_def(strcspn)
> > +# define libc_hidden_builtin_def(STRCSPN)
> > #endif
> >
> > -#include <sysdeps/x86_64/strcspn.S>
> > +#include <string/strcspn.c>
> > diff --git a/sysdeps/x86_64/strcspn.S b/sysdeps/x86_64/strcspn.S
> > deleted file mode 100644
> > index f3cd86c606..0000000000
> > --- a/sysdeps/x86_64/strcspn.S
> > +++ /dev/null
> > @@ -1,119 +0,0 @@
> > -/* strcspn (str, ss) -- Return the length of the initial segment of STR
> > - which contains no characters from SS.
> > - For AMD x86-64.
> > - Copyright (C) 1994-2022 Free Software Foundation, Inc.
> > - This file is part of the GNU C Library.
> > -
> > - The GNU C Library is free software; you can redistribute it and/or
> > - modify it under the terms of the GNU Lesser General Public
> > - License as published by the Free Software Foundation; either
> > - version 2.1 of the License, or (at your option) any later version.
> > -
> > - The GNU C Library is distributed in the hope that it will be useful,
> > - but WITHOUT ANY WARRANTY; without even the implied warranty of
> > - MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
> > - Lesser General Public License for more details.
> > -
> > - You should have received a copy of the GNU Lesser General Public
> > - License along with the GNU C Library; if not, see
> > - <https://www.gnu.org/licenses/>. */
> > -
> > -#include <sysdep.h>
> > -#include "asm-syntax.h"
> > -
> > - .text
> > -ENTRY (strcspn)
> > -
> > - movq %rdi, %rdx /* Save SRC. */
> > -
> > - /* First we create a table with flags for all possible characters.
> > - For the ASCII (7bit/8bit) or ISO-8859-X character sets which are
> > - supported by the C string functions we have 256 characters.
> > - Before inserting marks for the stop characters we clear the whole
> > - table. */
> > - movq %rdi, %r8 /* Save value. */
> > - subq $256, %rsp /* Make space for 256 bytes. */
> > - cfi_adjust_cfa_offset(256)
> > - movl $32, %ecx /* 32*8 bytes = 256 bytes. */
> > - movq %rsp, %rdi
> > - xorl %eax, %eax /* We store 0s. */
> > - cld
> > - rep
> > - stosq
> > -
> > - movq %rsi, %rax /* Setup skipset. */
> > -
> > -/* For understanding the following code remember that %rcx == 0 now.
> > - Although all the following instruction only modify %cl we always
> > - have a correct zero-extended 64-bit value in %rcx. */
> > -
> > - .p2align 4
> > -L(2): movb (%rax), %cl /* get byte from skipset */
> > - testb %cl, %cl /* is NUL char? */
> > - jz L(1) /* yes => start compare loop */
> > - movb %cl, (%rsp,%rcx) /* set corresponding byte in skipset table */
> > -
> > - movb 1(%rax), %cl /* get byte from skipset */
> > - testb $0xff, %cl /* is NUL char? */
> > - jz L(1) /* yes => start compare loop */
> > - movb %cl, (%rsp,%rcx) /* set corresponding byte in skipset table */
> > -
> > - movb 2(%rax), %cl /* get byte from skipset */
> > - testb $0xff, %cl /* is NUL char? */
> > - jz L(1) /* yes => start compare loop */
> > - movb %cl, (%rsp,%rcx) /* set corresponding byte in skipset table */
> > -
> > - movb 3(%rax), %cl /* get byte from skipset */
> > - addq $4, %rax /* increment skipset pointer */
> > - movb %cl, (%rsp,%rcx) /* set corresponding byte in skipset table */
> > - testb $0xff, %cl /* is NUL char? */
> > - jnz L(2) /* no => process next dword from skipset */
> > -
> > -L(1): leaq -4(%rdx), %rax /* prepare loop */
> > -
> > - /* We use a neat trick for the following loop. Normally we would
> > - have to test for two termination conditions
> > - 1. a character in the skipset was found
> > - and
> > - 2. the end of the string was found
> > - But as a sign that the character is in the skipset we store its
> > - value in the table. But the value of NUL is NUL so the loop
> > - terminates for NUL in every case. */
> > -
> > - .p2align 4
> > -L(3): addq $4, %rax /* adjust pointer for full loop round */
> > -
> > - movb (%rax), %cl /* get byte from string */
> > - cmpb %cl, (%rsp,%rcx) /* is it contained in skipset? */
> > - je L(4) /* yes => return */
> > -
> > - movb 1(%rax), %cl /* get byte from string */
> > - cmpb %cl, (%rsp,%rcx) /* is it contained in skipset? */
> > - je L(5) /* yes => return */
> > -
> > - movb 2(%rax), %cl /* get byte from string */
> > - cmpb %cl, (%rsp,%rcx) /* is it contained in skipset? */
> > - jz L(6) /* yes => return */
> > -
> > - movb 3(%rax), %cl /* get byte from string */
> > - cmpb %cl, (%rsp,%rcx) /* is it contained in skipset? */
> > - jne L(3) /* no => start loop again */
> > -
> > - incq %rax /* adjust pointer */
> > -L(6): incq %rax
> > -L(5): incq %rax
> > -
> > -L(4): addq $256, %rsp /* remove skipset */
> > - cfi_adjust_cfa_offset(-256)
> > -#ifdef USE_AS_STRPBRK
> > - xorl %edx,%edx
> > - orb %cl, %cl /* was last character NUL? */
> > - cmovzq %rdx, %rax /* Yes: return NULL */
> > -#else
> > - subq %rdx, %rax /* we have to return the number of valid
> > - characters, so compute distance to first
> > - non-valid character */
> > -#endif
> > - ret
> > -END (strcspn)
> > -libc_hidden_builtin_def (strcspn)
> > --
> > 2.25.1
> >
>
> LGTM.
>
> Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
>
> Thanks.
>
> --
> H.J.
I would like to backport this patch to release branches.
Any comments or objections?
--Sunil
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v1 10/23] x86: Remove strpbrk-sse2.S and use the generic implementation
[not found] ` <CAMe9rOqn8rZNfisVTmSKP9iWH1N26D--dncq1=MMgo-Hh-oR_Q@mail.gmail.com>
@ 2022-05-12 19:41 ` Sunil Pandey
0 siblings, 0 replies; 10+ messages in thread
From: Sunil Pandey @ 2022-05-12 19:41 UTC (permalink / raw)
To: H.J. Lu, Libc-stable Mailing List; +Cc: Noah Goldstein, GNU C Library
On Thu, Mar 24, 2022 at 12:00 PM H.J. Lu via Libc-alpha
<libc-alpha@sourceware.org> wrote:
>
> On Wed, Mar 23, 2022 at 3:00 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> >
> > The generic implementation is faster (see strcspn commit).
> >
> > All string/memory tests pass.
> > ---
> > .../x86_64/multiarch/{strpbrk-sse2.S => strpbrk-sse2.c} | 9 ++++-----
> > sysdeps/x86_64/strpbrk.S | 3 ---
> > 2 files changed, 4 insertions(+), 8 deletions(-)
> > rename sysdeps/x86_64/multiarch/{strpbrk-sse2.S => strpbrk-sse2.c} (84%)
> > delete mode 100644 sysdeps/x86_64/strpbrk.S
> >
> > diff --git a/sysdeps/x86_64/multiarch/strpbrk-sse2.S b/sysdeps/x86_64/multiarch/strpbrk-sse2.c
> > similarity index 84%
> > rename from sysdeps/x86_64/multiarch/strpbrk-sse2.S
> > rename to sysdeps/x86_64/multiarch/strpbrk-sse2.c
> > index d537b6c27b..d03214c4fb 100644
> > --- a/sysdeps/x86_64/multiarch/strpbrk-sse2.S
> > +++ b/sysdeps/x86_64/multiarch/strpbrk-sse2.c
> > @@ -1,4 +1,4 @@
> > -/* strpbrk optimized with SSE2.
> > +/* strpbrk.
> > Copyright (C) 2017-2022 Free Software Foundation, Inc.
> > This file is part of the GNU C Library.
> >
> > @@ -19,11 +19,10 @@
> > #if IS_IN (libc)
> >
> > # include <sysdep.h>
> > -# define strcspn __strpbrk_sse2
> > +# define STRPBRK __strpbrk_sse2
> >
> > # undef libc_hidden_builtin_def
> > -# define libc_hidden_builtin_def(strpbrk)
> > +# define libc_hidden_builtin_def(STRPBRK)
> > #endif
> >
> > -#define USE_AS_STRPBRK
> > -#include <sysdeps/x86_64/strcspn.S>
> > +#include <string/strpbrk.c>
> > diff --git a/sysdeps/x86_64/strpbrk.S b/sysdeps/x86_64/strpbrk.S
> > deleted file mode 100644
> > index 21888a5b92..0000000000
> > --- a/sysdeps/x86_64/strpbrk.S
> > +++ /dev/null
> > @@ -1,3 +0,0 @@
> > -#define strcspn strpbrk
> > -#define USE_AS_STRPBRK
> > -#include <sysdeps/x86_64/strcspn.S>
> > --
> > 2.25.1
> >
>
> LGTM.
>
> Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
>
> Thanks.
>
> --
> H.J.
I would like to backport this patch to release branches.
Any comments or objections?
--Sunil
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v1 11/23] x86: Remove strspn-sse2.S and use the generic implementation
[not found] ` <CAMe9rOp89e_1T9+i0W3=R3XR8DHp_Ua72x+poB6HQvE1q6b0MQ@mail.gmail.com>
@ 2022-05-12 19:42 ` Sunil Pandey
0 siblings, 0 replies; 10+ messages in thread
From: Sunil Pandey @ 2022-05-12 19:42 UTC (permalink / raw)
To: H.J. Lu, Libc-stable Mailing List; +Cc: Noah Goldstein, GNU C Library
On Thu, Mar 24, 2022 at 12:00 PM H.J. Lu via Libc-alpha
<libc-alpha@sourceware.org> wrote:
>
> On Wed, Mar 23, 2022 at 3:01 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> >
> > The generic implementation is faster.
> >
> > geometric_mean(N=20) of all benchmarks New / Original: .710
> >
> > All string/memory tests pass.
> > ---
> > Geomtric Mean N=20 runs; All functions page aligned
> > len, align1, align2, pos, New Time / Old Time
> > 1, 0, 0, 512, 0.824
> > 1, 1, 0, 512, 1.018
> > 1, 0, 1, 512, 0.986
> > 1, 1, 1, 512, 1.092
> > 2, 0, 0, 512, 0.86
> > 2, 2, 0, 512, 0.868
> > 2, 0, 2, 512, 0.858
> > 2, 2, 2, 512, 0.857
> > 3, 0, 0, 512, 0.836
> > 3, 3, 0, 512, 0.849
> > 3, 0, 3, 512, 0.84
> > 3, 3, 3, 512, 0.85
> > 4, 0, 0, 512, 0.843
> > 4, 4, 0, 512, 0.837
> > 4, 0, 4, 512, 0.835
> > 4, 4, 4, 512, 0.846
> > 5, 0, 0, 512, 0.852
> > 5, 5, 0, 512, 0.848
> > 5, 0, 5, 512, 0.85
> > 5, 5, 5, 512, 0.85
> > 6, 0, 0, 512, 0.853
> > 6, 6, 0, 512, 0.855
> > 6, 0, 6, 512, 0.853
> > 6, 6, 6, 512, 0.853
> > 7, 0, 0, 512, 0.857
> > 7, 7, 0, 512, 0.861
> > 7, 0, 7, 512, 0.94
> > 7, 7, 7, 512, 0.856
> > 8, 0, 0, 512, 0.927
> > 8, 0, 8, 512, 0.965
> > 9, 0, 0, 512, 0.967
> > 9, 1, 0, 512, 0.976
> > 9, 0, 9, 512, 0.887
> > 9, 1, 9, 512, 0.881
> > 10, 0, 0, 512, 0.853
> > 10, 2, 0, 512, 0.846
> > 10, 0, 10, 512, 0.855
> > 10, 2, 10, 512, 0.849
> > 11, 0, 0, 512, 0.854
> > 11, 3, 0, 512, 0.855
> > 11, 0, 11, 512, 0.85
> > 11, 3, 11, 512, 0.854
> > 12, 0, 0, 512, 0.864
> > 12, 4, 0, 512, 0.864
> > 12, 0, 12, 512, 0.867
> > 12, 4, 12, 512, 0.87
> > 13, 0, 0, 512, 0.853
> > 13, 5, 0, 512, 0.841
> > 13, 0, 13, 512, 0.837
> > 13, 5, 13, 512, 0.85
> > 14, 0, 0, 512, 0.838
> > 14, 6, 0, 512, 0.842
> > 14, 0, 14, 512, 0.818
> > 14, 6, 14, 512, 0.845
> > 15, 0, 0, 512, 0.799
> > 15, 7, 0, 512, 0.847
> > 15, 0, 15, 512, 0.787
> > 15, 7, 15, 512, 0.84
> > 16, 0, 0, 512, 0.824
> > 16, 0, 16, 512, 0.827
> > 17, 0, 0, 512, 0.817
> > 17, 1, 0, 512, 0.823
> > 17, 0, 17, 512, 0.82
> > 17, 1, 17, 512, 0.814
> > 18, 0, 0, 512, 0.81
> > 18, 2, 0, 512, 0.833
> > 18, 0, 18, 512, 0.811
> > 18, 2, 18, 512, 0.842
> > 19, 0, 0, 512, 0.823
> > 19, 3, 0, 512, 0.818
> > 19, 0, 19, 512, 0.821
> > 19, 3, 19, 512, 0.824
> > 20, 0, 0, 512, 0.814
> > 20, 4, 0, 512, 0.818
> > 20, 0, 20, 512, 0.806
> > 20, 4, 20, 512, 0.802
> > 21, 0, 0, 512, 0.835
> > 21, 5, 0, 512, 0.839
> > 21, 0, 21, 512, 0.842
> > 21, 5, 21, 512, 0.82
> > 22, 0, 0, 512, 0.824
> > 22, 6, 0, 512, 0.831
> > 22, 0, 22, 512, 0.819
> > 22, 6, 22, 512, 0.824
> > 23, 0, 0, 512, 0.816
> > 23, 7, 0, 512, 0.856
> > 23, 0, 23, 512, 0.808
> > 23, 7, 23, 512, 0.848
> > 24, 0, 0, 512, 0.88
> > 24, 0, 24, 512, 0.846
> > 25, 0, 0, 512, 0.929
> > 25, 1, 0, 512, 0.917
> > 25, 0, 25, 512, 0.884
> > 25, 1, 25, 512, 0.859
> > 26, 0, 0, 512, 0.919
> > 26, 2, 0, 512, 0.867
> > 26, 0, 26, 512, 0.914
> > 26, 2, 26, 512, 0.845
> > 27, 0, 0, 512, 0.919
> > 27, 3, 0, 512, 0.864
> > 27, 0, 27, 512, 0.917
> > 27, 3, 27, 512, 0.847
> > 28, 0, 0, 512, 0.905
> > 28, 4, 0, 512, 0.896
> > 28, 0, 28, 512, 0.898
> > 28, 4, 28, 512, 0.871
> > 29, 0, 0, 512, 0.911
> > 29, 5, 0, 512, 0.91
> > 29, 0, 29, 512, 0.905
> > 29, 5, 29, 512, 0.884
> > 30, 0, 0, 512, 0.907
> > 30, 6, 0, 512, 0.802
> > 30, 0, 30, 512, 0.906
> > 30, 6, 30, 512, 0.818
> > 31, 0, 0, 512, 0.907
> > 31, 7, 0, 512, 0.821
> > 31, 0, 31, 512, 0.89
> > 31, 7, 31, 512, 0.787
> > 4, 0, 0, 32, 0.623
> > 4, 1, 0, 32, 0.606
> > 4, 0, 1, 32, 0.6
> > 4, 1, 1, 32, 0.603
> > 4, 0, 0, 64, 0.731
> > 4, 2, 0, 64, 0.733
> > 4, 0, 2, 64, 0.734
> > 4, 2, 2, 64, 0.755
> > 4, 0, 0, 128, 0.822
> > 4, 3, 0, 128, 0.873
> > 4, 0, 3, 128, 0.89
> > 4, 3, 3, 128, 0.907
> > 4, 0, 0, 256, 0.827
> > 4, 4, 0, 256, 0.811
> > 4, 0, 4, 256, 0.794
> > 4, 4, 4, 256, 0.814
> > 4, 5, 0, 512, 0.841
> > 4, 0, 5, 512, 0.831
> > 4, 5, 5, 512, 0.845
> > 4, 0, 0, 1024, 0.861
> > 4, 6, 0, 1024, 0.857
> > 4, 0, 6, 1024, 0.9
> > 4, 6, 6, 1024, 0.861
> > 4, 0, 0, 2048, 0.879
> > 4, 7, 0, 2048, 0.875
> > 4, 0, 7, 2048, 0.883
> > 4, 7, 7, 2048, 0.88
> > 10, 1, 0, 64, 0.747
> > 10, 1, 1, 64, 0.743
> > 10, 2, 0, 64, 0.732
> > 10, 2, 2, 64, 0.729
> > 10, 3, 0, 64, 0.747
> > 10, 3, 3, 64, 0.733
> > 10, 4, 0, 64, 0.74
> > 10, 4, 4, 64, 0.751
> > 10, 5, 0, 64, 0.735
> > 10, 5, 5, 64, 0.746
> > 10, 6, 0, 64, 0.735
> > 10, 6, 6, 64, 0.733
> > 10, 7, 0, 64, 0.734
> > 10, 7, 7, 64, 0.74
> > 6, 0, 0, 0, 0.377
> > 6, 0, 0, 1, 0.369
> > 6, 0, 1, 1, 0.383
> > 6, 0, 0, 2, 0.391
> > 6, 0, 2, 2, 0.394
> > 6, 0, 0, 3, 0.416
> > 6, 0, 3, 3, 0.411
> > 6, 0, 0, 4, 0.475
> > 6, 0, 4, 4, 0.483
> > 6, 0, 0, 5, 0.473
> > 6, 0, 5, 5, 0.476
> > 6, 0, 0, 6, 0.459
> > 6, 0, 6, 6, 0.445
> > 6, 0, 0, 7, 0.433
> > 6, 0, 7, 7, 0.432
> > 6, 0, 0, 8, 0.492
> > 6, 0, 8, 8, 0.494
> > 6, 0, 0, 9, 0.476
> > 6, 0, 9, 9, 0.483
> > 6, 0, 0, 10, 0.46
> > 6, 0, 10, 10, 0.476
> > 6, 0, 0, 11, 0.463
> > 6, 0, 11, 11, 0.463
> > 6, 0, 0, 12, 0.511
> > 6, 0, 12, 12, 0.515
> > 6, 0, 0, 13, 0.506
> > 6, 0, 13, 13, 0.536
> > 6, 0, 0, 14, 0.496
> > 6, 0, 14, 14, 0.484
> > 6, 0, 0, 15, 0.473
> > 6, 0, 15, 15, 0.475
> > 6, 0, 0, 16, 0.534
> > 6, 0, 16, 16, 0.534
> > 6, 0, 0, 17, 0.525
> > 6, 0, 17, 17, 0.523
> > 6, 0, 0, 18, 0.522
> > 6, 0, 18, 18, 0.524
> > 6, 0, 0, 19, 0.512
> > 6, 0, 19, 19, 0.514
> > 6, 0, 0, 20, 0.535
> > 6, 0, 20, 20, 0.54
> > 6, 0, 0, 21, 0.543
> > 6, 0, 21, 21, 0.536
> > 6, 0, 0, 22, 0.542
> > 6, 0, 22, 22, 0.542
> > 6, 0, 0, 23, 0.529
> > 6, 0, 23, 23, 0.53
> > 6, 0, 0, 24, 0.596
> > 6, 0, 24, 24, 0.589
> > 6, 0, 0, 25, 0.583
> > 6, 0, 25, 25, 0.58
> > 6, 0, 0, 26, 0.574
> > 6, 0, 26, 26, 0.58
> > 6, 0, 0, 27, 0.575
> > 6, 0, 27, 27, 0.558
> > 6, 0, 0, 28, 0.606
> > 6, 0, 28, 28, 0.606
> > 6, 0, 0, 29, 0.589
> > 6, 0, 29, 29, 0.595
> > 6, 0, 0, 30, 0.592
> > 6, 0, 30, 30, 0.585
> > 6, 0, 0, 31, 0.585
> > 6, 0, 31, 31, 0.579
> > 6, 0, 0, 32, 0.625
> > 6, 0, 32, 32, 0.615
> > 6, 0, 0, 33, 0.615
> > 6, 0, 33, 33, 0.61
> > 6, 0, 0, 34, 0.604
> > 6, 0, 34, 34, 0.6
> > 6, 0, 0, 35, 0.602
> > 6, 0, 35, 35, 0.608
> > 6, 0, 0, 36, 0.644
> > 6, 0, 36, 36, 0.644
> > 6, 0, 0, 37, 0.658
> > 6, 0, 37, 37, 0.651
> > 6, 0, 0, 38, 0.644
> > 6, 0, 38, 38, 0.649
> > 6, 0, 0, 39, 0.626
> > 6, 0, 39, 39, 0.632
> > 6, 0, 0, 40, 0.662
> > 6, 0, 40, 40, 0.661
> > 6, 0, 0, 41, 0.656
> > 6, 0, 41, 41, 0.655
> > 6, 0, 0, 42, 0.643
> > 6, 0, 42, 42, 0.637
> > 6, 0, 0, 43, 0.622
> > 6, 0, 43, 43, 0.628
> > 6, 0, 0, 44, 0.673
> > 6, 0, 44, 44, 0.687
> > 6, 0, 0, 45, 0.661
> > 6, 0, 45, 45, 0.659
> > 6, 0, 0, 46, 0.657
> > 6, 0, 46, 46, 0.653
> > 6, 0, 0, 47, 0.658
> > 6, 0, 47, 47, 0.65
> > 6, 0, 0, 48, 0.678
> > 6, 0, 48, 48, 0.683
> > 6, 0, 0, 49, 0.676
> > 6, 0, 49, 49, 0.661
> > 6, 0, 0, 50, 0.672
> > 6, 0, 50, 50, 0.662
> > 6, 0, 0, 51, 0.656
> > 6, 0, 51, 51, 0.659
> > 6, 0, 0, 52, 0.682
> > 6, 0, 52, 52, 0.686
> > 6, 0, 0, 53, 0.67
> > 6, 0, 53, 53, 0.674
> > 6, 0, 0, 54, 0.663
> > 6, 0, 54, 54, 0.675
> > 6, 0, 0, 55, 0.662
> > 6, 0, 55, 55, 0.665
> > 6, 0, 0, 56, 0.681
> > 6, 0, 56, 56, 0.697
> > 6, 0, 0, 57, 0.686
> > 6, 0, 57, 57, 0.687
> > 6, 0, 0, 58, 0.701
> > 6, 0, 58, 58, 0.693
> > 6, 0, 0, 59, 0.709
> > 6, 0, 59, 59, 0.698
> > 6, 0, 0, 60, 0.708
> > 6, 0, 60, 60, 0.708
> > 6, 0, 0, 61, 0.709
> > 6, 0, 61, 61, 0.716
> > 6, 0, 0, 62, 0.709
> > 6, 0, 62, 62, 0.707
> > 6, 0, 0, 63, 0.703
> > 6, 0, 63, 63, 0.716
> >
> > .../{strspn-sse2.S => strspn-sse2.c} | 8 +-
> > sysdeps/x86_64/strspn.S | 112 ------------------
> > 2 files changed, 4 insertions(+), 116 deletions(-)
> > rename sysdeps/x86_64/multiarch/{strspn-sse2.S => strspn-sse2.c} (86%)
> > delete mode 100644 sysdeps/x86_64/strspn.S
> >
> > diff --git a/sysdeps/x86_64/multiarch/strspn-sse2.S b/sysdeps/x86_64/multiarch/strspn-sse2.c
> > similarity index 86%
> > rename from sysdeps/x86_64/multiarch/strspn-sse2.S
> > rename to sysdeps/x86_64/multiarch/strspn-sse2.c
> > index e0a095f25a..61cc6cb0a5 100644
> > --- a/sysdeps/x86_64/multiarch/strspn-sse2.S
> > +++ b/sysdeps/x86_64/multiarch/strspn-sse2.c
> > @@ -1,4 +1,4 @@
> > -/* strspn optimized with SSE2.
> > +/* strspn.
> > Copyright (C) 2017-2022 Free Software Foundation, Inc.
> > This file is part of the GNU C Library.
> >
> > @@ -19,10 +19,10 @@
> > #if IS_IN (libc)
> >
> > # include <sysdep.h>
> > -# define strspn __strspn_sse2
> > +# define STRSPN __strspn_sse2
> >
> > # undef libc_hidden_builtin_def
> > -# define libc_hidden_builtin_def(strspn)
> > +# define libc_hidden_builtin_def(STRSPN)
> > #endif
> >
> > -#include <sysdeps/x86_64/strspn.S>
> > +#include <string/strspn.c>
> > diff --git a/sysdeps/x86_64/strspn.S b/sysdeps/x86_64/strspn.S
> > deleted file mode 100644
> > index 61b76ee0a1..0000000000
> > --- a/sysdeps/x86_64/strspn.S
> > +++ /dev/null
> > @@ -1,112 +0,0 @@
> > -/* strspn (str, ss) -- Return the length of the initial segment of STR
> > - which contains only characters from SS.
> > - For AMD x86-64.
> > - Copyright (C) 1994-2022 Free Software Foundation, Inc.
> > - This file is part of the GNU C Library.
> > -
> > - The GNU C Library is free software; you can redistribute it and/or
> > - modify it under the terms of the GNU Lesser General Public
> > - License as published by the Free Software Foundation; either
> > - version 2.1 of the License, or (at your option) any later version.
> > -
> > - The GNU C Library is distributed in the hope that it will be useful,
> > - but WITHOUT ANY WARRANTY; without even the implied warranty of
> > - MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
> > - Lesser General Public License for more details.
> > -
> > - You should have received a copy of the GNU Lesser General Public
> > - License along with the GNU C Library; if not, see
> > - <https://www.gnu.org/licenses/>. */
> > -
> > -#include <sysdep.h>
> > -
> > - .text
> > -ENTRY (strspn)
> > -
> > - movq %rdi, %rdx /* Save SRC. */
> > -
> > - /* First we create a table with flags for all possible characters.
> > - For the ASCII (7bit/8bit) or ISO-8859-X character sets which are
> > - supported by the C string functions we have 256 characters.
> > - Before inserting marks for the stop characters we clear the whole
> > - table. */
> > - movq %rdi, %r8 /* Save value. */
> > - subq $256, %rsp /* Make space for 256 bytes. */
> > - cfi_adjust_cfa_offset(256)
> > - movl $32, %ecx /* 32*8 bytes = 256 bytes. */
> > - movq %rsp, %rdi
> > - xorl %eax, %eax /* We store 0s. */
> > - cld
> > - rep
> > - stosq
> > -
> > - movq %rsi, %rax /* Setup stopset. */
> > -
> > -/* For understanding the following code remember that %rcx == 0 now.
> > - Although all the following instruction only modify %cl we always
> > - have a correct zero-extended 64-bit value in %rcx. */
> > -
> > - .p2align 4
> > -L(2): movb (%rax), %cl /* get byte from stopset */
> > - testb %cl, %cl /* is NUL char? */
> > - jz L(1) /* yes => start compare loop */
> > - movb %cl, (%rsp,%rcx) /* set corresponding byte in stopset table */
> > -
> > - movb 1(%rax), %cl /* get byte from stopset */
> > - testb $0xff, %cl /* is NUL char? */
> > - jz L(1) /* yes => start compare loop */
> > - movb %cl, (%rsp,%rcx) /* set corresponding byte in stopset table */
> > -
> > - movb 2(%rax), %cl /* get byte from stopset */
> > - testb $0xff, %cl /* is NUL char? */
> > - jz L(1) /* yes => start compare loop */
> > - movb %cl, (%rsp,%rcx) /* set corresponding byte in stopset table */
> > -
> > - movb 3(%rax), %cl /* get byte from stopset */
> > - addq $4, %rax /* increment stopset pointer */
> > - movb %cl, (%rsp,%rcx) /* set corresponding byte in stopset table */
> > - testb $0xff, %cl /* is NUL char? */
> > - jnz L(2) /* no => process next dword from stopset */
> > -
> > -L(1): leaq -4(%rdx), %rax /* prepare loop */
> > -
> > - /* We use a neat trick for the following loop. Normally we would
> > - have to test for two termination conditions
> > - 1. a character in the stopset was found
> > - and
> > - 2. the end of the string was found
> > - But as a sign that the character is in the stopset we store its
> > - value in the table. But the value of NUL is NUL so the loop
> > - terminates for NUL in every case. */
> > -
> > - .p2align 4
> > -L(3): addq $4, %rax /* adjust pointer for full loop round */
> > -
> > - movb (%rax), %cl /* get byte from string */
> > - testb %cl, (%rsp,%rcx) /* is it contained in skipset? */
> > - jz L(4) /* no => return */
> > -
> > - movb 1(%rax), %cl /* get byte from string */
> > - testb %cl, (%rsp,%rcx) /* is it contained in skipset? */
> > - jz L(5) /* no => return */
> > -
> > - movb 2(%rax), %cl /* get byte from string */
> > - testb %cl, (%rsp,%rcx) /* is it contained in skipset? */
> > - jz L(6) /* no => return */
> > -
> > - movb 3(%rax), %cl /* get byte from string */
> > - testb %cl, (%rsp,%rcx) /* is it contained in skipset? */
> > - jnz L(3) /* yes => start loop again */
> > -
> > - incq %rax /* adjust pointer */
> > -L(6): incq %rax
> > -L(5): incq %rax
> > -
> > -L(4): addq $256, %rsp /* remove stopset */
> > - cfi_adjust_cfa_offset(-256)
> > - subq %rdx, %rax /* we have to return the number of valid
> > - characters, so compute distance to first
> > - non-valid character */
> > - ret
> > -END (strspn)
> > -libc_hidden_builtin_def (strspn)
> > --
> > 2.25.1
> >
>
> LGTM.
>
> Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
>
> Thanks.
>
> --
> H.J.
I would like to backport this patch to release branches.
Any comments or objections?
--Sunil
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v1 17/23] x86: Optimize str{n}casecmp TOLOWER logic in strcmp.S
[not found] ` <CAMe9rOqkZtA9gE87TiqkHg+_rTZY4dqXO74_LykBwvihNO0YJA@mail.gmail.com>
@ 2022-05-12 19:44 ` Sunil Pandey
0 siblings, 0 replies; 10+ messages in thread
From: Sunil Pandey @ 2022-05-12 19:44 UTC (permalink / raw)
To: H.J. Lu, Libc-stable Mailing List; +Cc: Noah Goldstein, GNU C Library
On Thu, Mar 24, 2022 at 12:05 PM H.J. Lu via Libc-alpha
<libc-alpha@sourceware.org> wrote:
>
> On Wed, Mar 23, 2022 at 3:01 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> >
> > Slightly faster method of doing TOLOWER that saves an
> > instruction.
> >
> > Also replace the hard coded 5-byte no with .p2align 4. On builds with
> > CET enabled this misaligned entry to strcasecmp.
> >
> > geometric_mean(N=40) of all benchmarks New / Original: .894
> >
> > All string/memory tests pass.
> > ---
> > Geomtric Mean N=40 runs; All functions page aligned
> > length, align1, align2, max_char, New Time / Old Time
> > 1, 1, 1, 127, 0.903
> > 2, 2, 2, 127, 0.905
> > 3, 3, 3, 127, 0.877
> > 4, 4, 4, 127, 0.888
> > 5, 5, 5, 127, 0.901
> > 6, 6, 6, 127, 0.954
> > 7, 7, 7, 127, 0.932
> > 8, 0, 0, 127, 0.918
> > 9, 1, 1, 127, 0.914
> > 10, 2, 2, 127, 0.877
> > 11, 3, 3, 127, 0.909
> > 12, 4, 4, 127, 0.876
> > 13, 5, 5, 127, 0.886
> > 14, 6, 6, 127, 0.914
> > 15, 7, 7, 127, 0.939
> > 4, 0, 0, 127, 0.963
> > 4, 0, 0, 254, 0.943
> > 8, 0, 0, 254, 0.927
> > 16, 0, 0, 127, 0.876
> > 16, 0, 0, 254, 0.865
> > 32, 0, 0, 127, 0.865
> > 32, 0, 0, 254, 0.862
> > 64, 0, 0, 127, 0.863
> > 64, 0, 0, 254, 0.896
> > 128, 0, 0, 127, 0.885
> > 128, 0, 0, 254, 0.882
> > 256, 0, 0, 127, 0.87
> > 256, 0, 0, 254, 0.869
> > 512, 0, 0, 127, 0.832
> > 512, 0, 0, 254, 0.848
> > 1024, 0, 0, 127, 0.835
> > 1024, 0, 0, 254, 0.843
> > 16, 1, 2, 127, 0.914
> > 16, 2, 1, 254, 0.949
> > 32, 2, 4, 127, 0.955
> > 32, 4, 2, 254, 1.004
> > 64, 3, 6, 127, 0.844
> > 64, 6, 3, 254, 0.905
> > 128, 4, 0, 127, 0.889
> > 128, 0, 4, 254, 0.845
> > 256, 5, 2, 127, 0.929
> > 256, 2, 5, 254, 0.907
> > 512, 6, 4, 127, 0.837
> > 512, 4, 6, 254, 0.862
> > 1024, 7, 6, 127, 0.895
> > 1024, 6, 7, 254, 0.89
> >
> > sysdeps/x86_64/strcmp.S | 64 +++++++++++++++++++----------------------
> > 1 file changed, 29 insertions(+), 35 deletions(-)
> >
> > diff --git a/sysdeps/x86_64/strcmp.S b/sysdeps/x86_64/strcmp.S
> > index e2ab59c555..99d8b36f1d 100644
> > --- a/sysdeps/x86_64/strcmp.S
> > +++ b/sysdeps/x86_64/strcmp.S
> > @@ -75,9 +75,8 @@ ENTRY2 (__strcasecmp)
> > movq __libc_tsd_LOCALE@gottpoff(%rip),%rax
> > mov %fs:(%rax),%RDX_LP
> >
> > - // XXX 5 byte should be before the function
> > - /* 5-byte NOP. */
> > - .byte 0x0f,0x1f,0x44,0x00,0x00
> > + /* Either 1 or 5 bytes (dependeing if CET is enabled). */
> > + .p2align 4
> > END2 (__strcasecmp)
> > # ifndef NO_NOLOCALE_ALIAS
> > weak_alias (__strcasecmp, strcasecmp)
> > @@ -94,9 +93,8 @@ ENTRY2 (__strncasecmp)
> > movq __libc_tsd_LOCALE@gottpoff(%rip),%rax
> > mov %fs:(%rax),%RCX_LP
> >
> > - // XXX 5 byte should be before the function
> > - /* 5-byte NOP. */
> > - .byte 0x0f,0x1f,0x44,0x00,0x00
> > + /* Either 1 or 5 bytes (dependeing if CET is enabled). */
> > + .p2align 4
> > END2 (__strncasecmp)
> > # ifndef NO_NOLOCALE_ALIAS
> > weak_alias (__strncasecmp, strncasecmp)
> > @@ -146,22 +144,22 @@ ENTRY (STRCMP)
> > #if defined USE_AS_STRCASECMP_L || defined USE_AS_STRNCASECMP_L
> > .section .rodata.cst16,"aM",@progbits,16
> > .align 16
> > -.Lbelowupper:
> > - .quad 0x4040404040404040
> > - .quad 0x4040404040404040
> > -.Ltopupper:
> > - .quad 0x5b5b5b5b5b5b5b5b
> > - .quad 0x5b5b5b5b5b5b5b5b
> > -.Ltouppermask:
> > +.Llcase_min:
> > + .quad 0x3f3f3f3f3f3f3f3f
> > + .quad 0x3f3f3f3f3f3f3f3f
> > +.Llcase_max:
> > + .quad 0x9999999999999999
> > + .quad 0x9999999999999999
> > +.Lcase_add:
> > .quad 0x2020202020202020
> > .quad 0x2020202020202020
> > .previous
> > - movdqa .Lbelowupper(%rip), %xmm5
> > -# define UCLOW_reg %xmm5
> > - movdqa .Ltopupper(%rip), %xmm6
> > -# define UCHIGH_reg %xmm6
> > - movdqa .Ltouppermask(%rip), %xmm7
> > -# define LCQWORD_reg %xmm7
> > + movdqa .Llcase_min(%rip), %xmm5
> > +# define LCASE_MIN_reg %xmm5
> > + movdqa .Llcase_max(%rip), %xmm6
> > +# define LCASE_MAX_reg %xmm6
> > + movdqa .Lcase_add(%rip), %xmm7
> > +# define CASE_ADD_reg %xmm7
> > #endif
> > cmp $0x30, %ecx
> > ja LABEL(crosscache) /* rsi: 16-byte load will cross cache line */
> > @@ -172,22 +170,18 @@ ENTRY (STRCMP)
> > movhpd 8(%rdi), %xmm1
> > movhpd 8(%rsi), %xmm2
> > #if defined USE_AS_STRCASECMP_L || defined USE_AS_STRNCASECMP_L
> > -# define TOLOWER(reg1, reg2) \
> > - movdqa reg1, %xmm8; \
> > - movdqa UCHIGH_reg, %xmm9; \
> > - movdqa reg2, %xmm10; \
> > - movdqa UCHIGH_reg, %xmm11; \
> > - pcmpgtb UCLOW_reg, %xmm8; \
> > - pcmpgtb reg1, %xmm9; \
> > - pcmpgtb UCLOW_reg, %xmm10; \
> > - pcmpgtb reg2, %xmm11; \
> > - pand %xmm9, %xmm8; \
> > - pand %xmm11, %xmm10; \
> > - pand LCQWORD_reg, %xmm8; \
> > - pand LCQWORD_reg, %xmm10; \
> > - por %xmm8, reg1; \
> > - por %xmm10, reg2
> > - TOLOWER (%xmm1, %xmm2)
> > +# define TOLOWER(reg1, reg2) \
> > + movdqa LCASE_MIN_reg, %xmm8; \
> > + movdqa LCASE_MIN_reg, %xmm9; \
> > + paddb reg1, %xmm8; \
> > + paddb reg2, %xmm9; \
> > + pcmpgtb LCASE_MAX_reg, %xmm8; \
> > + pcmpgtb LCASE_MAX_reg, %xmm9; \
> > + pandn CASE_ADD_reg, %xmm8; \
> > + pandn CASE_ADD_reg, %xmm9; \
> > + paddb %xmm8, reg1; \
> > + paddb %xmm9, reg2
> > + TOLOWER (%xmm1, %xmm2)
> > #else
> > # define TOLOWER(reg1, reg2)
> > #endif
> > --
> > 2.25.1
> >
>
> LGTM.
>
> Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
>
> Thanks.
>
> --
> H.J.
I would like to backport this patch to release branches.
Any comments or objections?
--Sunil
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v1 18/23] x86: Optimize str{n}casecmp TOLOWER logic in strcmp-sse42.S
[not found] ` <CAMe9rOo-MzhNRiuyFhHpHKanbu50_OPr_Gaof9Yt16tJRwjYFA@mail.gmail.com>
@ 2022-05-12 19:45 ` Sunil Pandey
0 siblings, 0 replies; 10+ messages in thread
From: Sunil Pandey @ 2022-05-12 19:45 UTC (permalink / raw)
To: H.J. Lu, Libc-stable Mailing List; +Cc: Noah Goldstein, GNU C Library
On Thu, Mar 24, 2022 at 12:05 PM H.J. Lu via Libc-alpha
<libc-alpha@sourceware.org> wrote:
>
> On Wed, Mar 23, 2022 at 3:02 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> >
> > Slightly faster method of doing TOLOWER that saves an
> > instruction.
> >
> > Also replace the hard coded 5-byte no with .p2align 4. On builds with
> > CET enabled this misaligned entry to strcasecmp.
> >
> > geometric_mean(N=40) of all benchmarks New / Original: .920
> >
> > All string/memory tests pass.
> > ---
> > Geomtric Mean N=40 runs; All functions page aligned
> > length, align1, align2, max_char, New Time / Old Time
> > 1, 1, 1, 127, 0.914
> > 2, 2, 2, 127, 0.952
> > 3, 3, 3, 127, 0.924
> > 4, 4, 4, 127, 0.995
> > 5, 5, 5, 127, 0.985
> > 6, 6, 6, 127, 1.017
> > 7, 7, 7, 127, 1.031
> > 8, 0, 0, 127, 0.967
> > 9, 1, 1, 127, 0.969
> > 10, 2, 2, 127, 0.951
> > 11, 3, 3, 127, 0.938
> > 12, 4, 4, 127, 0.937
> > 13, 5, 5, 127, 0.967
> > 14, 6, 6, 127, 0.941
> > 15, 7, 7, 127, 0.951
> > 4, 0, 0, 127, 0.959
> > 4, 0, 0, 254, 0.98
> > 8, 0, 0, 254, 0.959
> > 16, 0, 0, 127, 0.895
> > 16, 0, 0, 254, 0.901
> > 32, 0, 0, 127, 0.85
> > 32, 0, 0, 254, 0.851
> > 64, 0, 0, 127, 0.897
> > 64, 0, 0, 254, 0.895
> > 128, 0, 0, 127, 0.944
> > 128, 0, 0, 254, 0.935
> > 256, 0, 0, 127, 0.922
> > 256, 0, 0, 254, 0.913
> > 512, 0, 0, 127, 0.921
> > 512, 0, 0, 254, 0.914
> > 1024, 0, 0, 127, 0.845
> > 1024, 0, 0, 254, 0.84
> > 16, 1, 2, 127, 0.923
> > 16, 2, 1, 254, 0.955
> > 32, 2, 4, 127, 0.979
> > 32, 4, 2, 254, 0.957
> > 64, 3, 6, 127, 0.866
> > 64, 6, 3, 254, 0.849
> > 128, 4, 0, 127, 0.882
> > 128, 0, 4, 254, 0.876
> > 256, 5, 2, 127, 0.877
> > 256, 2, 5, 254, 0.882
> > 512, 6, 4, 127, 0.822
> > 512, 4, 6, 254, 0.862
> > 1024, 7, 6, 127, 0.903
> > 1024, 6, 7, 254, 0.908
> >
> > sysdeps/x86_64/multiarch/strcmp-sse42.S | 83 +++++++++++--------------
> > 1 file changed, 35 insertions(+), 48 deletions(-)
> >
> > diff --git a/sysdeps/x86_64/multiarch/strcmp-sse42.S b/sysdeps/x86_64/multiarch/strcmp-sse42.S
> > index 580feb90e9..7805ae9d41 100644
> > --- a/sysdeps/x86_64/multiarch/strcmp-sse42.S
> > +++ b/sysdeps/x86_64/multiarch/strcmp-sse42.S
> > @@ -88,9 +88,8 @@ ENTRY (GLABEL(__strcasecmp))
> > movq __libc_tsd_LOCALE@gottpoff(%rip),%rax
> > mov %fs:(%rax),%RDX_LP
> >
> > - // XXX 5 byte should be before the function
> > - /* 5-byte NOP. */
> > - .byte 0x0f,0x1f,0x44,0x00,0x00
> > + /* Either 1 or 5 bytes (dependeing if CET is enabled). */
> > + .p2align 4
> > END (GLABEL(__strcasecmp))
> > /* FALLTHROUGH to strcasecmp_l. */
> > #endif
> > @@ -99,9 +98,8 @@ ENTRY (GLABEL(__strncasecmp))
> > movq __libc_tsd_LOCALE@gottpoff(%rip),%rax
> > mov %fs:(%rax),%RCX_LP
> >
> > - // XXX 5 byte should be before the function
> > - /* 5-byte NOP. */
> > - .byte 0x0f,0x1f,0x44,0x00,0x00
> > + /* Either 1 or 5 bytes (dependeing if CET is enabled). */
> > + .p2align 4
> > END (GLABEL(__strncasecmp))
> > /* FALLTHROUGH to strncasecmp_l. */
> > #endif
> > @@ -169,27 +167,22 @@ STRCMP_SSE42:
> > #if defined USE_AS_STRCASECMP_L || defined USE_AS_STRNCASECMP_L
> > .section .rodata.cst16,"aM",@progbits,16
> > .align 16
> > -LABEL(belowupper):
> > - .quad 0x4040404040404040
> > - .quad 0x4040404040404040
> > -LABEL(topupper):
> > -# ifdef USE_AVX
> > - .quad 0x5a5a5a5a5a5a5a5a
> > - .quad 0x5a5a5a5a5a5a5a5a
> > -# else
> > - .quad 0x5b5b5b5b5b5b5b5b
> > - .quad 0x5b5b5b5b5b5b5b5b
> > -# endif
> > -LABEL(touppermask):
> > +LABEL(lcase_min):
> > + .quad 0x3f3f3f3f3f3f3f3f
> > + .quad 0x3f3f3f3f3f3f3f3f
> > +LABEL(lcase_max):
> > + .quad 0x9999999999999999
> > + .quad 0x9999999999999999
> > +LABEL(case_add):
> > .quad 0x2020202020202020
> > .quad 0x2020202020202020
> > .previous
> > - movdqa LABEL(belowupper)(%rip), %xmm4
> > -# define UCLOW_reg %xmm4
> > - movdqa LABEL(topupper)(%rip), %xmm5
> > -# define UCHIGH_reg %xmm5
> > - movdqa LABEL(touppermask)(%rip), %xmm6
> > -# define LCQWORD_reg %xmm6
> > + movdqa LABEL(lcase_min)(%rip), %xmm4
> > +# define LCASE_MIN_reg %xmm4
> > + movdqa LABEL(lcase_max)(%rip), %xmm5
> > +# define LCASE_MAX_reg %xmm5
> > + movdqa LABEL(case_add)(%rip), %xmm6
> > +# define CASE_ADD_reg %xmm6
> > #endif
> > cmp $0x30, %ecx
> > ja LABEL(crosscache)/* rsi: 16-byte load will cross cache line */
> > @@ -200,32 +193,26 @@ LABEL(touppermask):
> > #if defined USE_AS_STRCASECMP_L || defined USE_AS_STRNCASECMP_L
> > # ifdef USE_AVX
> > # define TOLOWER(reg1, reg2) \
> > - vpcmpgtb UCLOW_reg, reg1, %xmm7; \
> > - vpcmpgtb UCHIGH_reg, reg1, %xmm8; \
> > - vpcmpgtb UCLOW_reg, reg2, %xmm9; \
> > - vpcmpgtb UCHIGH_reg, reg2, %xmm10; \
> > - vpandn %xmm7, %xmm8, %xmm8; \
> > - vpandn %xmm9, %xmm10, %xmm10; \
> > - vpand LCQWORD_reg, %xmm8, %xmm8; \
> > - vpand LCQWORD_reg, %xmm10, %xmm10; \
> > - vpor reg1, %xmm8, reg1; \
> > - vpor reg2, %xmm10, reg2
> > + vpaddb LCASE_MIN_reg, reg1, %xmm7; \
> > + vpaddb LCASE_MIN_reg, reg2, %xmm8; \
> > + vpcmpgtb LCASE_MAX_reg, %xmm7, %xmm7; \
> > + vpcmpgtb LCASE_MAX_reg, %xmm8, %xmm8; \
> > + vpandn CASE_ADD_reg, %xmm7, %xmm7; \
> > + vpandn CASE_ADD_reg, %xmm8, %xmm8; \
> > + vpaddb %xmm7, reg1, reg1; \
> > + vpaddb %xmm8, reg2, reg2
> > # else
> > # define TOLOWER(reg1, reg2) \
> > - movdqa reg1, %xmm7; \
> > - movdqa UCHIGH_reg, %xmm8; \
> > - movdqa reg2, %xmm9; \
> > - movdqa UCHIGH_reg, %xmm10; \
> > - pcmpgtb UCLOW_reg, %xmm7; \
> > - pcmpgtb reg1, %xmm8; \
> > - pcmpgtb UCLOW_reg, %xmm9; \
> > - pcmpgtb reg2, %xmm10; \
> > - pand %xmm8, %xmm7; \
> > - pand %xmm10, %xmm9; \
> > - pand LCQWORD_reg, %xmm7; \
> > - pand LCQWORD_reg, %xmm9; \
> > - por %xmm7, reg1; \
> > - por %xmm9, reg2
> > + movdqa LCASE_MIN_reg, %xmm7; \
> > + movdqa LCASE_MIN_reg, %xmm8; \
> > + paddb reg1, %xmm7; \
> > + paddb reg2, %xmm8; \
> > + pcmpgtb LCASE_MAX_reg, %xmm7; \
> > + pcmpgtb LCASE_MAX_reg, %xmm8; \
> > + pandn CASE_ADD_reg, %xmm7; \
> > + pandn CASE_ADD_reg, %xmm8; \
> > + paddb %xmm7, reg1; \
> > + paddb %xmm8, reg2
> > # endif
> > TOLOWER (%xmm1, %xmm2)
> > #else
> > --
> > 2.25.1
> >
>
> LGTM.
>
> Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
>
> Thanks.
>
> --
> H.J.
I would like to backport this patch to release branches.
Any comments or objections?
--Sunil
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v1 23/23] x86: Remove AVX str{n}casecmp
[not found] ` <CAMe9rOpzEL=V1OmUFJuScNetUc3mgMqYeqcqiD9aK+tBTN_sxQ@mail.gmail.com>
@ 2022-05-12 19:54 ` Sunil Pandey
0 siblings, 0 replies; 10+ messages in thread
From: Sunil Pandey @ 2022-05-12 19:54 UTC (permalink / raw)
To: H.J. Lu, Libc-stable Mailing List; +Cc: Noah Goldstein, GNU C Library
On Thu, Mar 24, 2022 at 12:09 PM H.J. Lu via Libc-alpha
<libc-alpha@sourceware.org> wrote:
>
> On Wed, Mar 23, 2022 at 3:03 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> >
> > The rational is:
> >
> > 1. SSE42 has nearly identical logic so any benefit is minimal (3.4%
> > regression on Tigerlake using SSE42 versus AVX across the
> > benchtest suite).
> > 2. AVX2 version covers the majority of targets that previously
> > prefered it.
> > 3. The targets where AVX would still be best (SnB and IVB) are
> > becoming outdated.
> >
> > All in all the saving the code size is worth it.
> >
> > All string/memory tests pass.
> > ---
> > Geomtric Mean N=40 runs; All functions page aligned
> > length, align1, align2, max_char, AVX Time / SSE42 Time
> > 1, 1, 1, 127, 0.928
> > 2, 2, 2, 127, 0.934
> > 3, 3, 3, 127, 0.975
> > 4, 4, 4, 127, 0.96
> > 5, 5, 5, 127, 0.935
> > 6, 6, 6, 127, 0.929
> > 7, 7, 7, 127, 0.959
> > 8, 0, 0, 127, 0.955
> > 9, 1, 1, 127, 0.944
> > 10, 2, 2, 127, 0.975
> > 11, 3, 3, 127, 0.935
> > 12, 4, 4, 127, 0.931
> > 13, 5, 5, 127, 0.926
> > 14, 6, 6, 127, 0.901
> > 15, 7, 7, 127, 0.951
> > 4, 0, 0, 127, 0.958
> > 4, 0, 0, 254, 0.956
> > 8, 0, 0, 254, 0.977
> > 16, 0, 0, 127, 0.955
> > 16, 0, 0, 254, 0.953
> > 32, 0, 0, 127, 0.943
> > 32, 0, 0, 254, 0.941
> > 64, 0, 0, 127, 0.941
> > 64, 0, 0, 254, 0.955
> > 128, 0, 0, 127, 0.972
> > 128, 0, 0, 254, 0.975
> > 256, 0, 0, 127, 0.996
> > 256, 0, 0, 254, 0.993
> > 512, 0, 0, 127, 0.992
> > 512, 0, 0, 254, 0.986
> > 1024, 0, 0, 127, 0.994
> > 1024, 0, 0, 254, 0.993
> > 16, 1, 2, 127, 0.933
> > 16, 2, 1, 254, 0.953
> > 32, 2, 4, 127, 0.927
> > 32, 4, 2, 254, 0.986
> > 64, 3, 6, 127, 0.991
> > 64, 6, 3, 254, 1.014
> > 128, 4, 0, 127, 1.001
> > 128, 0, 4, 254, 0.991
> > 256, 5, 2, 127, 1.011
> > 256, 2, 5, 254, 1.013
> > 512, 6, 4, 127, 1.056
> > 512, 4, 6, 254, 0.916
> > 1024, 7, 6, 127, 1.059
> > 1024, 6, 7, 254, 1.043
> >
> > sysdeps/x86_64/multiarch/Makefile | 2 -
> > sysdeps/x86_64/multiarch/ifunc-impl-list.c | 12 -
> > sysdeps/x86_64/multiarch/ifunc-strcasecmp.h | 4 -
> > sysdeps/x86_64/multiarch/strcasecmp_l-avx.S | 22 --
> > sysdeps/x86_64/multiarch/strcmp-sse42.S | 240 +++++++++-----------
> > sysdeps/x86_64/multiarch/strncase_l-avx.S | 22 --
> > 6 files changed, 105 insertions(+), 197 deletions(-)
> > delete mode 100644 sysdeps/x86_64/multiarch/strcasecmp_l-avx.S
> > delete mode 100644 sysdeps/x86_64/multiarch/strncase_l-avx.S
> >
> > diff --git a/sysdeps/x86_64/multiarch/Makefile b/sysdeps/x86_64/multiarch/Makefile
> > index 35d80dc2ff..6507d1b7fa 100644
> > --- a/sysdeps/x86_64/multiarch/Makefile
> > +++ b/sysdeps/x86_64/multiarch/Makefile
> > @@ -54,7 +54,6 @@ sysdep_routines += \
> > stpncpy-evex \
> > stpncpy-sse2-unaligned \
> > stpncpy-ssse3 \
> > - strcasecmp_l-avx \
> > strcasecmp_l-avx2 \
> > strcasecmp_l-avx2-rtm \
> > strcasecmp_l-evex \
> > @@ -95,7 +94,6 @@ sysdep_routines += \
> > strlen-avx2-rtm \
> > strlen-evex \
> > strlen-sse2 \
> > - strncase_l-avx \
> > strncase_l-avx2 \
> > strncase_l-avx2-rtm \
> > strncase_l-evex \
> > diff --git a/sysdeps/x86_64/multiarch/ifunc-impl-list.c b/sysdeps/x86_64/multiarch/ifunc-impl-list.c
> > index f1a4d3dac2..40cc6cc49e 100644
> > --- a/sysdeps/x86_64/multiarch/ifunc-impl-list.c
> > +++ b/sysdeps/x86_64/multiarch/ifunc-impl-list.c
> > @@ -447,9 +447,6 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
> > (CPU_FEATURE_USABLE (AVX2)
> > && CPU_FEATURE_USABLE (RTM)),
> > __strcasecmp_avx2_rtm)
> > - IFUNC_IMPL_ADD (array, i, strcasecmp,
> > - CPU_FEATURE_USABLE (AVX),
> > - __strcasecmp_avx)
> > IFUNC_IMPL_ADD (array, i, strcasecmp,
> > CPU_FEATURE_USABLE (SSE4_2),
> > __strcasecmp_sse42)
> > @@ -471,9 +468,6 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
> > (CPU_FEATURE_USABLE (AVX2)
> > && CPU_FEATURE_USABLE (RTM)),
> > __strcasecmp_l_avx2_rtm)
> > - IFUNC_IMPL_ADD (array, i, strcasecmp_l,
> > - CPU_FEATURE_USABLE (AVX),
> > - __strcasecmp_l_avx)
> > IFUNC_IMPL_ADD (array, i, strcasecmp_l,
> > CPU_FEATURE_USABLE (SSE4_2),
> > __strcasecmp_l_sse42)
> > @@ -609,9 +603,6 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
> > (CPU_FEATURE_USABLE (AVX2)
> > && CPU_FEATURE_USABLE (RTM)),
> > __strncasecmp_avx2_rtm)
> > - IFUNC_IMPL_ADD (array, i, strncasecmp,
> > - CPU_FEATURE_USABLE (AVX),
> > - __strncasecmp_avx)
> > IFUNC_IMPL_ADD (array, i, strncasecmp,
> > CPU_FEATURE_USABLE (SSE4_2),
> > __strncasecmp_sse42)
> > @@ -634,9 +625,6 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
> > (CPU_FEATURE_USABLE (AVX2)
> > && CPU_FEATURE_USABLE (RTM)),
> > __strncasecmp_l_avx2_rtm)
> > - IFUNC_IMPL_ADD (array, i, strncasecmp_l,
> > - CPU_FEATURE_USABLE (AVX),
> > - __strncasecmp_l_avx)
> > IFUNC_IMPL_ADD (array, i, strncasecmp_l,
> > CPU_FEATURE_USABLE (SSE4_2),
> > __strncasecmp_l_sse42)
> > diff --git a/sysdeps/x86_64/multiarch/ifunc-strcasecmp.h b/sysdeps/x86_64/multiarch/ifunc-strcasecmp.h
> > index bf0d146e7f..766539c241 100644
> > --- a/sysdeps/x86_64/multiarch/ifunc-strcasecmp.h
> > +++ b/sysdeps/x86_64/multiarch/ifunc-strcasecmp.h
> > @@ -22,7 +22,6 @@
> > extern __typeof (REDIRECT_NAME) OPTIMIZE (sse2) attribute_hidden;
> > extern __typeof (REDIRECT_NAME) OPTIMIZE (ssse3) attribute_hidden;
> > extern __typeof (REDIRECT_NAME) OPTIMIZE (sse42) attribute_hidden;
> > -extern __typeof (REDIRECT_NAME) OPTIMIZE (avx) attribute_hidden;
> > extern __typeof (REDIRECT_NAME) OPTIMIZE (avx2) attribute_hidden;
> > extern __typeof (REDIRECT_NAME) OPTIMIZE (avx2_rtm) attribute_hidden;
> > extern __typeof (REDIRECT_NAME) OPTIMIZE (evex) attribute_hidden;
> > @@ -46,9 +45,6 @@ IFUNC_SELECTOR (void)
> > return OPTIMIZE (avx2);
> > }
> >
> > - if (CPU_FEATURE_USABLE_P (cpu_features, AVX))
> > - return OPTIMIZE (avx);
> > -
> > if (CPU_FEATURE_USABLE_P (cpu_features, SSE4_2)
> > && !CPU_FEATURES_ARCH_P (cpu_features, Slow_SSE4_2))
> > return OPTIMIZE (sse42);
> > diff --git a/sysdeps/x86_64/multiarch/strcasecmp_l-avx.S b/sysdeps/x86_64/multiarch/strcasecmp_l-avx.S
> > deleted file mode 100644
> > index 7ec7c21b5a..0000000000
> > --- a/sysdeps/x86_64/multiarch/strcasecmp_l-avx.S
> > +++ /dev/null
> > @@ -1,22 +0,0 @@
> > -/* strcasecmp_l optimized with AVX.
> > - Copyright (C) 2017-2022 Free Software Foundation, Inc.
> > - This file is part of the GNU C Library.
> > -
> > - The GNU C Library is free software; you can redistribute it and/or
> > - modify it under the terms of the GNU Lesser General Public
> > - License as published by the Free Software Foundation; either
> > - version 2.1 of the License, or (at your option) any later version.
> > -
> > - The GNU C Library is distributed in the hope that it will be useful,
> > - but WITHOUT ANY WARRANTY; without even the implied warranty of
> > - MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
> > - Lesser General Public License for more details.
> > -
> > - You should have received a copy of the GNU Lesser General Public
> > - License along with the GNU C Library; if not, see
> > - <https://www.gnu.org/licenses/>. */
> > -
> > -#define STRCMP_SSE42 __strcasecmp_l_avx
> > -#define USE_AVX 1
> > -#define USE_AS_STRCASECMP_L
> > -#include "strcmp-sse42.S"
> > diff --git a/sysdeps/x86_64/multiarch/strcmp-sse42.S b/sysdeps/x86_64/multiarch/strcmp-sse42.S
> > index 7805ae9d41..a9178ad25c 100644
> > --- a/sysdeps/x86_64/multiarch/strcmp-sse42.S
> > +++ b/sysdeps/x86_64/multiarch/strcmp-sse42.S
> > @@ -41,13 +41,8 @@
> > # define UPDATE_STRNCMP_COUNTER
> > #endif
> >
> > -#ifdef USE_AVX
> > -# define SECTION avx
> > -# define GLABEL(l) l##_avx
> > -#else
> > -# define SECTION sse4.2
> > -# define GLABEL(l) l##_sse42
> > -#endif
> > +#define SECTION sse4.2
> > +#define GLABEL(l) l##_sse42
> >
> > #define LABEL(l) .L##l
> >
> > @@ -105,21 +100,7 @@ END (GLABEL(__strncasecmp))
> > #endif
> >
> >
> > -#ifdef USE_AVX
> > -# define movdqa vmovdqa
> > -# define movdqu vmovdqu
> > -# define pmovmskb vpmovmskb
> > -# define pcmpistri vpcmpistri
> > -# define psubb vpsubb
> > -# define pcmpeqb vpcmpeqb
> > -# define psrldq vpsrldq
> > -# define pslldq vpslldq
> > -# define palignr vpalignr
> > -# define pxor vpxor
> > -# define D(arg) arg, arg
> > -#else
> > -# define D(arg) arg
> > -#endif
> > +#define arg arg
> >
> > STRCMP_SSE42:
> > cfi_startproc
> > @@ -191,18 +172,7 @@ LABEL(case_add):
> > movdqu (%rdi), %xmm1
> > movdqu (%rsi), %xmm2
> > #if defined USE_AS_STRCASECMP_L || defined USE_AS_STRNCASECMP_L
> > -# ifdef USE_AVX
> > -# define TOLOWER(reg1, reg2) \
> > - vpaddb LCASE_MIN_reg, reg1, %xmm7; \
> > - vpaddb LCASE_MIN_reg, reg2, %xmm8; \
> > - vpcmpgtb LCASE_MAX_reg, %xmm7, %xmm7; \
> > - vpcmpgtb LCASE_MAX_reg, %xmm8, %xmm8; \
> > - vpandn CASE_ADD_reg, %xmm7, %xmm7; \
> > - vpandn CASE_ADD_reg, %xmm8, %xmm8; \
> > - vpaddb %xmm7, reg1, reg1; \
> > - vpaddb %xmm8, reg2, reg2
> > -# else
> > -# define TOLOWER(reg1, reg2) \
> > +# define TOLOWER(reg1, reg2) \
> > movdqa LCASE_MIN_reg, %xmm7; \
> > movdqa LCASE_MIN_reg, %xmm8; \
> > paddb reg1, %xmm7; \
> > @@ -213,15 +183,15 @@ LABEL(case_add):
> > pandn CASE_ADD_reg, %xmm8; \
> > paddb %xmm7, reg1; \
> > paddb %xmm8, reg2
> > -# endif
> > +
> > TOLOWER (%xmm1, %xmm2)
> > #else
> > # define TOLOWER(reg1, reg2)
> > #endif
> > - pxor %xmm0, D(%xmm0) /* clear %xmm0 for null char checks */
> > - pcmpeqb %xmm1, D(%xmm0) /* Any null chars? */
> > - pcmpeqb %xmm2, D(%xmm1) /* compare first 16 bytes for equality */
> > - psubb %xmm0, D(%xmm1) /* packed sub of comparison results*/
> > + pxor %xmm0, %xmm0 /* clear %xmm0 for null char checks */
> > + pcmpeqb %xmm1, %xmm0 /* Any null chars? */
> > + pcmpeqb %xmm2, %xmm1 /* compare first 16 bytes for equality */
> > + psubb %xmm0, %xmm1 /* packed sub of comparison results*/
> > pmovmskb %xmm1, %edx
> > sub $0xffff, %edx /* if first 16 bytes are same, edx == 0xffff */
> > jnz LABEL(less16bytes)/* If not, find different value or null char */
> > @@ -245,7 +215,7 @@ LABEL(crosscache):
> > xor %r8d, %r8d
> > and $0xf, %ecx /* offset of rsi */
> > and $0xf, %eax /* offset of rdi */
> > - pxor %xmm0, D(%xmm0) /* clear %xmm0 for null char check */
> > + pxor %xmm0, %xmm0 /* clear %xmm0 for null char check */
> > cmp %eax, %ecx
> > je LABEL(ashr_0) /* rsi and rdi relative offset same */
> > ja LABEL(bigger)
> > @@ -259,7 +229,7 @@ LABEL(bigger):
> > sub %rcx, %r9
> > lea LABEL(unaligned_table)(%rip), %r10
> > movslq (%r10, %r9,4), %r9
> > - pcmpeqb %xmm1, D(%xmm0) /* Any null chars? */
> > + pcmpeqb %xmm1, %xmm0 /* Any null chars? */
> > lea (%r10, %r9), %r10
> > _CET_NOTRACK jmp *%r10 /* jump to corresponding case */
> >
> > @@ -272,15 +242,15 @@ LABEL(bigger):
> > LABEL(ashr_0):
> >
> > movdqa (%rsi), %xmm1
> > - pcmpeqb %xmm1, D(%xmm0) /* Any null chars? */
> > + pcmpeqb %xmm1, %xmm0 /* Any null chars? */
> > #if !defined USE_AS_STRCASECMP_L && !defined USE_AS_STRNCASECMP_L
> > - pcmpeqb (%rdi), D(%xmm1) /* compare 16 bytes for equality */
> > + pcmpeqb (%rdi), %xmm1 /* compare 16 bytes for equality */
> > #else
> > movdqa (%rdi), %xmm2
> > TOLOWER (%xmm1, %xmm2)
> > - pcmpeqb %xmm2, D(%xmm1) /* compare 16 bytes for equality */
> > + pcmpeqb %xmm2, %xmm1 /* compare 16 bytes for equality */
> > #endif
> > - psubb %xmm0, D(%xmm1) /* packed sub of comparison results*/
> > + psubb %xmm0, %xmm1 /* packed sub of comparison results*/
> > pmovmskb %xmm1, %r9d
> > shr %cl, %edx /* adjust 0xffff for offset */
> > shr %cl, %r9d /* adjust for 16-byte offset */
> > @@ -360,10 +330,10 @@ LABEL(ashr_0_exit_use):
> > */
> > .p2align 4
> > LABEL(ashr_1):
> > - pslldq $15, D(%xmm2) /* shift first string to align with second */
> > + pslldq $15, %xmm2 /* shift first string to align with second */
> > TOLOWER (%xmm1, %xmm2)
> > - pcmpeqb %xmm1, D(%xmm2) /* compare 16 bytes for equality */
> > - psubb %xmm0, D(%xmm2) /* packed sub of comparison results*/
> > + pcmpeqb %xmm1, %xmm2 /* compare 16 bytes for equality */
> > + psubb %xmm0, %xmm2 /* packed sub of comparison results*/
> > pmovmskb %xmm2, %r9d
> > shr %cl, %edx /* adjust 0xffff for offset */
> > shr %cl, %r9d /* adjust for 16-byte offset */
> > @@ -391,7 +361,7 @@ LABEL(loop_ashr_1_use):
> >
> > LABEL(nibble_ashr_1_restart_use):
> > movdqa (%rdi, %rdx), %xmm0
> > - palignr $1, -16(%rdi, %rdx), D(%xmm0)
> > + palignr $1, -16(%rdi, %rdx), %xmm0
> > #if !defined USE_AS_STRCASECMP_L && !defined USE_AS_STRNCASECMP_L
> > pcmpistri $0x1a,(%rsi,%rdx), %xmm0
> > #else
> > @@ -410,7 +380,7 @@ LABEL(nibble_ashr_1_restart_use):
> > jg LABEL(nibble_ashr_1_use)
> >
> > movdqa (%rdi, %rdx), %xmm0
> > - palignr $1, -16(%rdi, %rdx), D(%xmm0)
> > + palignr $1, -16(%rdi, %rdx), %xmm0
> > #if !defined USE_AS_STRCASECMP_L && !defined USE_AS_STRNCASECMP_L
> > pcmpistri $0x1a,(%rsi,%rdx), %xmm0
> > #else
> > @@ -430,7 +400,7 @@ LABEL(nibble_ashr_1_restart_use):
> > LABEL(nibble_ashr_1_use):
> > sub $0x1000, %r10
> > movdqa -16(%rdi, %rdx), %xmm0
> > - psrldq $1, D(%xmm0)
> > + psrldq $1, %xmm0
> > pcmpistri $0x3a,%xmm0, %xmm0
> > #if defined USE_AS_STRNCMP || defined USE_AS_STRNCASECMP_L
> > cmp %r11, %rcx
> > @@ -448,10 +418,10 @@ LABEL(nibble_ashr_1_use):
> > */
> > .p2align 4
> > LABEL(ashr_2):
> > - pslldq $14, D(%xmm2)
> > + pslldq $14, %xmm2
> > TOLOWER (%xmm1, %xmm2)
> > - pcmpeqb %xmm1, D(%xmm2)
> > - psubb %xmm0, D(%xmm2)
> > + pcmpeqb %xmm1, %xmm2
> > + psubb %xmm0, %xmm2
> > pmovmskb %xmm2, %r9d
> > shr %cl, %edx
> > shr %cl, %r9d
> > @@ -479,7 +449,7 @@ LABEL(loop_ashr_2_use):
> >
> > LABEL(nibble_ashr_2_restart_use):
> > movdqa (%rdi, %rdx), %xmm0
> > - palignr $2, -16(%rdi, %rdx), D(%xmm0)
> > + palignr $2, -16(%rdi, %rdx), %xmm0
> > #if !defined USE_AS_STRCASECMP_L && !defined USE_AS_STRNCASECMP_L
> > pcmpistri $0x1a,(%rsi,%rdx), %xmm0
> > #else
> > @@ -498,7 +468,7 @@ LABEL(nibble_ashr_2_restart_use):
> > jg LABEL(nibble_ashr_2_use)
> >
> > movdqa (%rdi, %rdx), %xmm0
> > - palignr $2, -16(%rdi, %rdx), D(%xmm0)
> > + palignr $2, -16(%rdi, %rdx), %xmm0
> > #if !defined USE_AS_STRCASECMP_L && !defined USE_AS_STRNCASECMP_L
> > pcmpistri $0x1a,(%rsi,%rdx), %xmm0
> > #else
> > @@ -518,7 +488,7 @@ LABEL(nibble_ashr_2_restart_use):
> > LABEL(nibble_ashr_2_use):
> > sub $0x1000, %r10
> > movdqa -16(%rdi, %rdx), %xmm0
> > - psrldq $2, D(%xmm0)
> > + psrldq $2, %xmm0
> > pcmpistri $0x3a,%xmm0, %xmm0
> > #if defined USE_AS_STRNCMP || defined USE_AS_STRNCASECMP_L
> > cmp %r11, %rcx
> > @@ -536,10 +506,10 @@ LABEL(nibble_ashr_2_use):
> > */
> > .p2align 4
> > LABEL(ashr_3):
> > - pslldq $13, D(%xmm2)
> > + pslldq $13, %xmm2
> > TOLOWER (%xmm1, %xmm2)
> > - pcmpeqb %xmm1, D(%xmm2)
> > - psubb %xmm0, D(%xmm2)
> > + pcmpeqb %xmm1, %xmm2
> > + psubb %xmm0, %xmm2
> > pmovmskb %xmm2, %r9d
> > shr %cl, %edx
> > shr %cl, %r9d
> > @@ -567,7 +537,7 @@ LABEL(loop_ashr_3_use):
> >
> > LABEL(nibble_ashr_3_restart_use):
> > movdqa (%rdi, %rdx), %xmm0
> > - palignr $3, -16(%rdi, %rdx), D(%xmm0)
> > + palignr $3, -16(%rdi, %rdx), %xmm0
> > #if !defined USE_AS_STRCASECMP_L && !defined USE_AS_STRNCASECMP_L
> > pcmpistri $0x1a,(%rsi,%rdx), %xmm0
> > #else
> > @@ -586,7 +556,7 @@ LABEL(nibble_ashr_3_restart_use):
> > jg LABEL(nibble_ashr_3_use)
> >
> > movdqa (%rdi, %rdx), %xmm0
> > - palignr $3, -16(%rdi, %rdx), D(%xmm0)
> > + palignr $3, -16(%rdi, %rdx), %xmm0
> > #if !defined USE_AS_STRCASECMP_L && !defined USE_AS_STRNCASECMP_L
> > pcmpistri $0x1a,(%rsi,%rdx), %xmm0
> > #else
> > @@ -606,7 +576,7 @@ LABEL(nibble_ashr_3_restart_use):
> > LABEL(nibble_ashr_3_use):
> > sub $0x1000, %r10
> > movdqa -16(%rdi, %rdx), %xmm0
> > - psrldq $3, D(%xmm0)
> > + psrldq $3, %xmm0
> > pcmpistri $0x3a,%xmm0, %xmm0
> > #if defined USE_AS_STRNCMP || defined USE_AS_STRNCASECMP_L
> > cmp %r11, %rcx
> > @@ -624,10 +594,10 @@ LABEL(nibble_ashr_3_use):
> > */
> > .p2align 4
> > LABEL(ashr_4):
> > - pslldq $12, D(%xmm2)
> > + pslldq $12, %xmm2
> > TOLOWER (%xmm1, %xmm2)
> > - pcmpeqb %xmm1, D(%xmm2)
> > - psubb %xmm0, D(%xmm2)
> > + pcmpeqb %xmm1, %xmm2
> > + psubb %xmm0, %xmm2
> > pmovmskb %xmm2, %r9d
> > shr %cl, %edx
> > shr %cl, %r9d
> > @@ -656,7 +626,7 @@ LABEL(loop_ashr_4_use):
> >
> > LABEL(nibble_ashr_4_restart_use):
> > movdqa (%rdi, %rdx), %xmm0
> > - palignr $4, -16(%rdi, %rdx), D(%xmm0)
> > + palignr $4, -16(%rdi, %rdx), %xmm0
> > #if !defined USE_AS_STRCASECMP_L && !defined USE_AS_STRNCASECMP_L
> > pcmpistri $0x1a,(%rsi,%rdx), %xmm0
> > #else
> > @@ -675,7 +645,7 @@ LABEL(nibble_ashr_4_restart_use):
> > jg LABEL(nibble_ashr_4_use)
> >
> > movdqa (%rdi, %rdx), %xmm0
> > - palignr $4, -16(%rdi, %rdx), D(%xmm0)
> > + palignr $4, -16(%rdi, %rdx), %xmm0
> > #if !defined USE_AS_STRCASECMP_L && !defined USE_AS_STRNCASECMP_L
> > pcmpistri $0x1a,(%rsi,%rdx), %xmm0
> > #else
> > @@ -695,7 +665,7 @@ LABEL(nibble_ashr_4_restart_use):
> > LABEL(nibble_ashr_4_use):
> > sub $0x1000, %r10
> > movdqa -16(%rdi, %rdx), %xmm0
> > - psrldq $4, D(%xmm0)
> > + psrldq $4, %xmm0
> > pcmpistri $0x3a,%xmm0, %xmm0
> > #if defined USE_AS_STRNCMP || defined USE_AS_STRNCASECMP_L
> > cmp %r11, %rcx
> > @@ -713,10 +683,10 @@ LABEL(nibble_ashr_4_use):
> > */
> > .p2align 4
> > LABEL(ashr_5):
> > - pslldq $11, D(%xmm2)
> > + pslldq $11, %xmm2
> > TOLOWER (%xmm1, %xmm2)
> > - pcmpeqb %xmm1, D(%xmm2)
> > - psubb %xmm0, D(%xmm2)
> > + pcmpeqb %xmm1, %xmm2
> > + psubb %xmm0, %xmm2
> > pmovmskb %xmm2, %r9d
> > shr %cl, %edx
> > shr %cl, %r9d
> > @@ -745,7 +715,7 @@ LABEL(loop_ashr_5_use):
> >
> > LABEL(nibble_ashr_5_restart_use):
> > movdqa (%rdi, %rdx), %xmm0
> > - palignr $5, -16(%rdi, %rdx), D(%xmm0)
> > + palignr $5, -16(%rdi, %rdx), %xmm0
> > #if !defined USE_AS_STRCASECMP_L && !defined USE_AS_STRNCASECMP_L
> > pcmpistri $0x1a,(%rsi,%rdx), %xmm0
> > #else
> > @@ -765,7 +735,7 @@ LABEL(nibble_ashr_5_restart_use):
> >
> > movdqa (%rdi, %rdx), %xmm0
> >
> > - palignr $5, -16(%rdi, %rdx), D(%xmm0)
> > + palignr $5, -16(%rdi, %rdx), %xmm0
> > #if !defined USE_AS_STRCASECMP_L && !defined USE_AS_STRNCASECMP_L
> > pcmpistri $0x1a,(%rsi,%rdx), %xmm0
> > #else
> > @@ -785,7 +755,7 @@ LABEL(nibble_ashr_5_restart_use):
> > LABEL(nibble_ashr_5_use):
> > sub $0x1000, %r10
> > movdqa -16(%rdi, %rdx), %xmm0
> > - psrldq $5, D(%xmm0)
> > + psrldq $5, %xmm0
> > pcmpistri $0x3a,%xmm0, %xmm0
> > #if defined USE_AS_STRNCMP || defined USE_AS_STRNCASECMP_L
> > cmp %r11, %rcx
> > @@ -803,10 +773,10 @@ LABEL(nibble_ashr_5_use):
> > */
> > .p2align 4
> > LABEL(ashr_6):
> > - pslldq $10, D(%xmm2)
> > + pslldq $10, %xmm2
> > TOLOWER (%xmm1, %xmm2)
> > - pcmpeqb %xmm1, D(%xmm2)
> > - psubb %xmm0, D(%xmm2)
> > + pcmpeqb %xmm1, %xmm2
> > + psubb %xmm0, %xmm2
> > pmovmskb %xmm2, %r9d
> > shr %cl, %edx
> > shr %cl, %r9d
> > @@ -835,7 +805,7 @@ LABEL(loop_ashr_6_use):
> >
> > LABEL(nibble_ashr_6_restart_use):
> > movdqa (%rdi, %rdx), %xmm0
> > - palignr $6, -16(%rdi, %rdx), D(%xmm0)
> > + palignr $6, -16(%rdi, %rdx), %xmm0
> > #if !defined USE_AS_STRCASECMP_L && !defined USE_AS_STRNCASECMP_L
> > pcmpistri $0x1a,(%rsi,%rdx), %xmm0
> > #else
> > @@ -854,7 +824,7 @@ LABEL(nibble_ashr_6_restart_use):
> > jg LABEL(nibble_ashr_6_use)
> >
> > movdqa (%rdi, %rdx), %xmm0
> > - palignr $6, -16(%rdi, %rdx), D(%xmm0)
> > + palignr $6, -16(%rdi, %rdx), %xmm0
> > #if !defined USE_AS_STRCASECMP_L && !defined USE_AS_STRNCASECMP_L
> > pcmpistri $0x1a,(%rsi,%rdx), %xmm0
> > #else
> > @@ -874,7 +844,7 @@ LABEL(nibble_ashr_6_restart_use):
> > LABEL(nibble_ashr_6_use):
> > sub $0x1000, %r10
> > movdqa -16(%rdi, %rdx), %xmm0
> > - psrldq $6, D(%xmm0)
> > + psrldq $6, %xmm0
> > pcmpistri $0x3a,%xmm0, %xmm0
> > #if defined USE_AS_STRNCMP || defined USE_AS_STRNCASECMP_L
> > cmp %r11, %rcx
> > @@ -892,10 +862,10 @@ LABEL(nibble_ashr_6_use):
> > */
> > .p2align 4
> > LABEL(ashr_7):
> > - pslldq $9, D(%xmm2)
> > + pslldq $9, %xmm2
> > TOLOWER (%xmm1, %xmm2)
> > - pcmpeqb %xmm1, D(%xmm2)
> > - psubb %xmm0, D(%xmm2)
> > + pcmpeqb %xmm1, %xmm2
> > + psubb %xmm0, %xmm2
> > pmovmskb %xmm2, %r9d
> > shr %cl, %edx
> > shr %cl, %r9d
> > @@ -924,7 +894,7 @@ LABEL(loop_ashr_7_use):
> >
> > LABEL(nibble_ashr_7_restart_use):
> > movdqa (%rdi, %rdx), %xmm0
> > - palignr $7, -16(%rdi, %rdx), D(%xmm0)
> > + palignr $7, -16(%rdi, %rdx), %xmm0
> > #if !defined USE_AS_STRCASECMP_L && !defined USE_AS_STRNCASECMP_L
> > pcmpistri $0x1a,(%rsi,%rdx), %xmm0
> > #else
> > @@ -943,7 +913,7 @@ LABEL(nibble_ashr_7_restart_use):
> > jg LABEL(nibble_ashr_7_use)
> >
> > movdqa (%rdi, %rdx), %xmm0
> > - palignr $7, -16(%rdi, %rdx), D(%xmm0)
> > + palignr $7, -16(%rdi, %rdx), %xmm0
> > #if !defined USE_AS_STRCASECMP_L && !defined USE_AS_STRNCASECMP_L
> > pcmpistri $0x1a,(%rsi,%rdx), %xmm0
> > #else
> > @@ -963,7 +933,7 @@ LABEL(nibble_ashr_7_restart_use):
> > LABEL(nibble_ashr_7_use):
> > sub $0x1000, %r10
> > movdqa -16(%rdi, %rdx), %xmm0
> > - psrldq $7, D(%xmm0)
> > + psrldq $7, %xmm0
> > pcmpistri $0x3a,%xmm0, %xmm0
> > #if defined USE_AS_STRNCMP || defined USE_AS_STRNCASECMP_L
> > cmp %r11, %rcx
> > @@ -981,10 +951,10 @@ LABEL(nibble_ashr_7_use):
> > */
> > .p2align 4
> > LABEL(ashr_8):
> > - pslldq $8, D(%xmm2)
> > + pslldq $8, %xmm2
> > TOLOWER (%xmm1, %xmm2)
> > - pcmpeqb %xmm1, D(%xmm2)
> > - psubb %xmm0, D(%xmm2)
> > + pcmpeqb %xmm1, %xmm2
> > + psubb %xmm0, %xmm2
> > pmovmskb %xmm2, %r9d
> > shr %cl, %edx
> > shr %cl, %r9d
> > @@ -1013,7 +983,7 @@ LABEL(loop_ashr_8_use):
> >
> > LABEL(nibble_ashr_8_restart_use):
> > movdqa (%rdi, %rdx), %xmm0
> > - palignr $8, -16(%rdi, %rdx), D(%xmm0)
> > + palignr $8, -16(%rdi, %rdx), %xmm0
> > #if !defined USE_AS_STRCASECMP_L && !defined USE_AS_STRNCASECMP_L
> > pcmpistri $0x1a, (%rsi,%rdx), %xmm0
> > #else
> > @@ -1032,7 +1002,7 @@ LABEL(nibble_ashr_8_restart_use):
> > jg LABEL(nibble_ashr_8_use)
> >
> > movdqa (%rdi, %rdx), %xmm0
> > - palignr $8, -16(%rdi, %rdx), D(%xmm0)
> > + palignr $8, -16(%rdi, %rdx), %xmm0
> > #if !defined USE_AS_STRCASECMP_L && !defined USE_AS_STRNCASECMP_L
> > pcmpistri $0x1a, (%rsi,%rdx), %xmm0
> > #else
> > @@ -1052,7 +1022,7 @@ LABEL(nibble_ashr_8_restart_use):
> > LABEL(nibble_ashr_8_use):
> > sub $0x1000, %r10
> > movdqa -16(%rdi, %rdx), %xmm0
> > - psrldq $8, D(%xmm0)
> > + psrldq $8, %xmm0
> > pcmpistri $0x3a,%xmm0, %xmm0
> > #if defined USE_AS_STRNCMP || defined USE_AS_STRNCASECMP_L
> > cmp %r11, %rcx
> > @@ -1070,10 +1040,10 @@ LABEL(nibble_ashr_8_use):
> > */
> > .p2align 4
> > LABEL(ashr_9):
> > - pslldq $7, D(%xmm2)
> > + pslldq $7, %xmm2
> > TOLOWER (%xmm1, %xmm2)
> > - pcmpeqb %xmm1, D(%xmm2)
> > - psubb %xmm0, D(%xmm2)
> > + pcmpeqb %xmm1, %xmm2
> > + psubb %xmm0, %xmm2
> > pmovmskb %xmm2, %r9d
> > shr %cl, %edx
> > shr %cl, %r9d
> > @@ -1103,7 +1073,7 @@ LABEL(loop_ashr_9_use):
> > LABEL(nibble_ashr_9_restart_use):
> > movdqa (%rdi, %rdx), %xmm0
> >
> > - palignr $9, -16(%rdi, %rdx), D(%xmm0)
> > + palignr $9, -16(%rdi, %rdx), %xmm0
> > #if !defined USE_AS_STRCASECMP_L && !defined USE_AS_STRNCASECMP_L
> > pcmpistri $0x1a, (%rsi,%rdx), %xmm0
> > #else
> > @@ -1122,7 +1092,7 @@ LABEL(nibble_ashr_9_restart_use):
> > jg LABEL(nibble_ashr_9_use)
> >
> > movdqa (%rdi, %rdx), %xmm0
> > - palignr $9, -16(%rdi, %rdx), D(%xmm0)
> > + palignr $9, -16(%rdi, %rdx), %xmm0
> > #if !defined USE_AS_STRCASECMP_L && !defined USE_AS_STRNCASECMP_L
> > pcmpistri $0x1a, (%rsi,%rdx), %xmm0
> > #else
> > @@ -1142,7 +1112,7 @@ LABEL(nibble_ashr_9_restart_use):
> > LABEL(nibble_ashr_9_use):
> > sub $0x1000, %r10
> > movdqa -16(%rdi, %rdx), %xmm0
> > - psrldq $9, D(%xmm0)
> > + psrldq $9, %xmm0
> > pcmpistri $0x3a,%xmm0, %xmm0
> > #if defined USE_AS_STRNCMP || defined USE_AS_STRNCASECMP_L
> > cmp %r11, %rcx
> > @@ -1160,10 +1130,10 @@ LABEL(nibble_ashr_9_use):
> > */
> > .p2align 4
> > LABEL(ashr_10):
> > - pslldq $6, D(%xmm2)
> > + pslldq $6, %xmm2
> > TOLOWER (%xmm1, %xmm2)
> > - pcmpeqb %xmm1, D(%xmm2)
> > - psubb %xmm0, D(%xmm2)
> > + pcmpeqb %xmm1, %xmm2
> > + psubb %xmm0, %xmm2
> > pmovmskb %xmm2, %r9d
> > shr %cl, %edx
> > shr %cl, %r9d
> > @@ -1192,7 +1162,7 @@ LABEL(loop_ashr_10_use):
> >
> > LABEL(nibble_ashr_10_restart_use):
> > movdqa (%rdi, %rdx), %xmm0
> > - palignr $10, -16(%rdi, %rdx), D(%xmm0)
> > + palignr $10, -16(%rdi, %rdx), %xmm0
> > #if !defined USE_AS_STRCASECMP_L && !defined USE_AS_STRNCASECMP_L
> > pcmpistri $0x1a, (%rsi,%rdx), %xmm0
> > #else
> > @@ -1211,7 +1181,7 @@ LABEL(nibble_ashr_10_restart_use):
> > jg LABEL(nibble_ashr_10_use)
> >
> > movdqa (%rdi, %rdx), %xmm0
> > - palignr $10, -16(%rdi, %rdx), D(%xmm0)
> > + palignr $10, -16(%rdi, %rdx), %xmm0
> > #if !defined USE_AS_STRCASECMP_L && !defined USE_AS_STRNCASECMP_L
> > pcmpistri $0x1a, (%rsi,%rdx), %xmm0
> > #else
> > @@ -1231,7 +1201,7 @@ LABEL(nibble_ashr_10_restart_use):
> > LABEL(nibble_ashr_10_use):
> > sub $0x1000, %r10
> > movdqa -16(%rdi, %rdx), %xmm0
> > - psrldq $10, D(%xmm0)
> > + psrldq $10, %xmm0
> > pcmpistri $0x3a,%xmm0, %xmm0
> > #if defined USE_AS_STRNCMP || defined USE_AS_STRNCASECMP_L
> > cmp %r11, %rcx
> > @@ -1249,10 +1219,10 @@ LABEL(nibble_ashr_10_use):
> > */
> > .p2align 4
> > LABEL(ashr_11):
> > - pslldq $5, D(%xmm2)
> > + pslldq $5, %xmm2
> > TOLOWER (%xmm1, %xmm2)
> > - pcmpeqb %xmm1, D(%xmm2)
> > - psubb %xmm0, D(%xmm2)
> > + pcmpeqb %xmm1, %xmm2
> > + psubb %xmm0, %xmm2
> > pmovmskb %xmm2, %r9d
> > shr %cl, %edx
> > shr %cl, %r9d
> > @@ -1281,7 +1251,7 @@ LABEL(loop_ashr_11_use):
> >
> > LABEL(nibble_ashr_11_restart_use):
> > movdqa (%rdi, %rdx), %xmm0
> > - palignr $11, -16(%rdi, %rdx), D(%xmm0)
> > + palignr $11, -16(%rdi, %rdx), %xmm0
> > #if !defined USE_AS_STRCASECMP_L && !defined USE_AS_STRNCASECMP_L
> > pcmpistri $0x1a, (%rsi,%rdx), %xmm0
> > #else
> > @@ -1300,7 +1270,7 @@ LABEL(nibble_ashr_11_restart_use):
> > jg LABEL(nibble_ashr_11_use)
> >
> > movdqa (%rdi, %rdx), %xmm0
> > - palignr $11, -16(%rdi, %rdx), D(%xmm0)
> > + palignr $11, -16(%rdi, %rdx), %xmm0
> > #if !defined USE_AS_STRCASECMP_L && !defined USE_AS_STRNCASECMP_L
> > pcmpistri $0x1a, (%rsi,%rdx), %xmm0
> > #else
> > @@ -1320,7 +1290,7 @@ LABEL(nibble_ashr_11_restart_use):
> > LABEL(nibble_ashr_11_use):
> > sub $0x1000, %r10
> > movdqa -16(%rdi, %rdx), %xmm0
> > - psrldq $11, D(%xmm0)
> > + psrldq $11, %xmm0
> > pcmpistri $0x3a,%xmm0, %xmm0
> > #if defined USE_AS_STRNCMP || defined USE_AS_STRNCASECMP_L
> > cmp %r11, %rcx
> > @@ -1338,10 +1308,10 @@ LABEL(nibble_ashr_11_use):
> > */
> > .p2align 4
> > LABEL(ashr_12):
> > - pslldq $4, D(%xmm2)
> > + pslldq $4, %xmm2
> > TOLOWER (%xmm1, %xmm2)
> > - pcmpeqb %xmm1, D(%xmm2)
> > - psubb %xmm0, D(%xmm2)
> > + pcmpeqb %xmm1, %xmm2
> > + psubb %xmm0, %xmm2
> > pmovmskb %xmm2, %r9d
> > shr %cl, %edx
> > shr %cl, %r9d
> > @@ -1370,7 +1340,7 @@ LABEL(loop_ashr_12_use):
> >
> > LABEL(nibble_ashr_12_restart_use):
> > movdqa (%rdi, %rdx), %xmm0
> > - palignr $12, -16(%rdi, %rdx), D(%xmm0)
> > + palignr $12, -16(%rdi, %rdx), %xmm0
> > #if !defined USE_AS_STRCASECMP_L && !defined USE_AS_STRNCASECMP_L
> > pcmpistri $0x1a, (%rsi,%rdx), %xmm0
> > #else
> > @@ -1389,7 +1359,7 @@ LABEL(nibble_ashr_12_restart_use):
> > jg LABEL(nibble_ashr_12_use)
> >
> > movdqa (%rdi, %rdx), %xmm0
> > - palignr $12, -16(%rdi, %rdx), D(%xmm0)
> > + palignr $12, -16(%rdi, %rdx), %xmm0
> > #if !defined USE_AS_STRCASECMP_L && !defined USE_AS_STRNCASECMP_L
> > pcmpistri $0x1a, (%rsi,%rdx), %xmm0
> > #else
> > @@ -1409,7 +1379,7 @@ LABEL(nibble_ashr_12_restart_use):
> > LABEL(nibble_ashr_12_use):
> > sub $0x1000, %r10
> > movdqa -16(%rdi, %rdx), %xmm0
> > - psrldq $12, D(%xmm0)
> > + psrldq $12, %xmm0
> > pcmpistri $0x3a,%xmm0, %xmm0
> > #if defined USE_AS_STRNCMP || defined USE_AS_STRNCASECMP_L
> > cmp %r11, %rcx
> > @@ -1427,10 +1397,10 @@ LABEL(nibble_ashr_12_use):
> > */
> > .p2align 4
> > LABEL(ashr_13):
> > - pslldq $3, D(%xmm2)
> > + pslldq $3, %xmm2
> > TOLOWER (%xmm1, %xmm2)
> > - pcmpeqb %xmm1, D(%xmm2)
> > - psubb %xmm0, D(%xmm2)
> > + pcmpeqb %xmm1, %xmm2
> > + psubb %xmm0, %xmm2
> > pmovmskb %xmm2, %r9d
> > shr %cl, %edx
> > shr %cl, %r9d
> > @@ -1460,7 +1430,7 @@ LABEL(loop_ashr_13_use):
> >
> > LABEL(nibble_ashr_13_restart_use):
> > movdqa (%rdi, %rdx), %xmm0
> > - palignr $13, -16(%rdi, %rdx), D(%xmm0)
> > + palignr $13, -16(%rdi, %rdx), %xmm0
> > #if !defined USE_AS_STRCASECMP_L && !defined USE_AS_STRNCASECMP_L
> > pcmpistri $0x1a, (%rsi,%rdx), %xmm0
> > #else
> > @@ -1479,7 +1449,7 @@ LABEL(nibble_ashr_13_restart_use):
> > jg LABEL(nibble_ashr_13_use)
> >
> > movdqa (%rdi, %rdx), %xmm0
> > - palignr $13, -16(%rdi, %rdx), D(%xmm0)
> > + palignr $13, -16(%rdi, %rdx), %xmm0
> > #if !defined USE_AS_STRCASECMP_L && !defined USE_AS_STRNCASECMP_L
> > pcmpistri $0x1a, (%rsi,%rdx), %xmm0
> > #else
> > @@ -1499,7 +1469,7 @@ LABEL(nibble_ashr_13_restart_use):
> > LABEL(nibble_ashr_13_use):
> > sub $0x1000, %r10
> > movdqa -16(%rdi, %rdx), %xmm0
> > - psrldq $13, D(%xmm0)
> > + psrldq $13, %xmm0
> > pcmpistri $0x3a,%xmm0, %xmm0
> > #if defined USE_AS_STRNCMP || defined USE_AS_STRNCASECMP_L
> > cmp %r11, %rcx
> > @@ -1517,10 +1487,10 @@ LABEL(nibble_ashr_13_use):
> > */
> > .p2align 4
> > LABEL(ashr_14):
> > - pslldq $2, D(%xmm2)
> > + pslldq $2, %xmm2
> > TOLOWER (%xmm1, %xmm2)
> > - pcmpeqb %xmm1, D(%xmm2)
> > - psubb %xmm0, D(%xmm2)
> > + pcmpeqb %xmm1, %xmm2
> > + psubb %xmm0, %xmm2
> > pmovmskb %xmm2, %r9d
> > shr %cl, %edx
> > shr %cl, %r9d
> > @@ -1550,7 +1520,7 @@ LABEL(loop_ashr_14_use):
> >
> > LABEL(nibble_ashr_14_restart_use):
> > movdqa (%rdi, %rdx), %xmm0
> > - palignr $14, -16(%rdi, %rdx), D(%xmm0)
> > + palignr $14, -16(%rdi, %rdx), %xmm0
> > #if !defined USE_AS_STRCASECMP_L && !defined USE_AS_STRNCASECMP_L
> > pcmpistri $0x1a, (%rsi,%rdx), %xmm0
> > #else
> > @@ -1569,7 +1539,7 @@ LABEL(nibble_ashr_14_restart_use):
> > jg LABEL(nibble_ashr_14_use)
> >
> > movdqa (%rdi, %rdx), %xmm0
> > - palignr $14, -16(%rdi, %rdx), D(%xmm0)
> > + palignr $14, -16(%rdi, %rdx), %xmm0
> > #if !defined USE_AS_STRCASECMP_L && !defined USE_AS_STRNCASECMP_L
> > pcmpistri $0x1a, (%rsi,%rdx), %xmm0
> > #else
> > @@ -1589,7 +1559,7 @@ LABEL(nibble_ashr_14_restart_use):
> > LABEL(nibble_ashr_14_use):
> > sub $0x1000, %r10
> > movdqa -16(%rdi, %rdx), %xmm0
> > - psrldq $14, D(%xmm0)
> > + psrldq $14, %xmm0
> > pcmpistri $0x3a,%xmm0, %xmm0
> > #if defined USE_AS_STRNCMP || defined USE_AS_STRNCASECMP_L
> > cmp %r11, %rcx
> > @@ -1607,10 +1577,10 @@ LABEL(nibble_ashr_14_use):
> > */
> > .p2align 4
> > LABEL(ashr_15):
> > - pslldq $1, D(%xmm2)
> > + pslldq $1, %xmm2
> > TOLOWER (%xmm1, %xmm2)
> > - pcmpeqb %xmm1, D(%xmm2)
> > - psubb %xmm0, D(%xmm2)
> > + pcmpeqb %xmm1, %xmm2
> > + psubb %xmm0, %xmm2
> > pmovmskb %xmm2, %r9d
> > shr %cl, %edx
> > shr %cl, %r9d
> > @@ -1642,7 +1612,7 @@ LABEL(loop_ashr_15_use):
> >
> > LABEL(nibble_ashr_15_restart_use):
> > movdqa (%rdi, %rdx), %xmm0
> > - palignr $15, -16(%rdi, %rdx), D(%xmm0)
> > + palignr $15, -16(%rdi, %rdx), %xmm0
> > #if !defined USE_AS_STRCASECMP_L && !defined USE_AS_STRNCASECMP_L
> > pcmpistri $0x1a, (%rsi,%rdx), %xmm0
> > #else
> > @@ -1661,7 +1631,7 @@ LABEL(nibble_ashr_15_restart_use):
> > jg LABEL(nibble_ashr_15_use)
> >
> > movdqa (%rdi, %rdx), %xmm0
> > - palignr $15, -16(%rdi, %rdx), D(%xmm0)
> > + palignr $15, -16(%rdi, %rdx), %xmm0
> > #if !defined USE_AS_STRCASECMP_L && !defined USE_AS_STRNCASECMP_L
> > pcmpistri $0x1a, (%rsi,%rdx), %xmm0
> > #else
> > @@ -1681,7 +1651,7 @@ LABEL(nibble_ashr_15_restart_use):
> > LABEL(nibble_ashr_15_use):
> > sub $0x1000, %r10
> > movdqa -16(%rdi, %rdx), %xmm0
> > - psrldq $15, D(%xmm0)
> > + psrldq $15, %xmm0
> > pcmpistri $0x3a,%xmm0, %xmm0
> > #if defined USE_AS_STRNCMP || defined USE_AS_STRNCASECMP_L
> > cmp %r11, %rcx
> > diff --git a/sysdeps/x86_64/multiarch/strncase_l-avx.S b/sysdeps/x86_64/multiarch/strncase_l-avx.S
> > deleted file mode 100644
> > index b51b86d223..0000000000
> > --- a/sysdeps/x86_64/multiarch/strncase_l-avx.S
> > +++ /dev/null
> > @@ -1,22 +0,0 @@
> > -/* strncasecmp_l optimized with AVX.
> > - Copyright (C) 2017-2022 Free Software Foundation, Inc.
> > - This file is part of the GNU C Library.
> > -
> > - The GNU C Library is free software; you can redistribute it and/or
> > - modify it under the terms of the GNU Lesser General Public
> > - License as published by the Free Software Foundation; either
> > - version 2.1 of the License, or (at your option) any later version.
> > -
> > - The GNU C Library is distributed in the hope that it will be useful,
> > - but WITHOUT ANY WARRANTY; without even the implied warranty of
> > - MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
> > - Lesser General Public License for more details.
> > -
> > - You should have received a copy of the GNU Lesser General Public
> > - License along with the GNU C Library; if not, see
> > - <https://www.gnu.org/licenses/>. */
> > -
> > -#define STRCMP_SSE42 __strncasecmp_l_avx
> > -#define USE_AVX 1
> > -#define USE_AS_STRNCASECMP_L
> > -#include "strcmp-sse42.S"
> > --
> > 2.25.1
> >
>
> LGTM.
>
> Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
>
> Thanks.
>
> --
> H.J.
I would like to backport this patch to release branches.
Any comments or objections?
--Sunil
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2022-05-12 19:54 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
[not found] <20220323215734.3927131-1-goldstein.w.n@gmail.com>
[not found] ` <20220323215734.3927131-3-goldstein.w.n@gmail.com>
[not found] ` <CAMe9rOqQHH-20_czF-vtb_L_6MRBer=H9g3XpNBQLzcoSLZj+A@mail.gmail.com>
[not found] ` <CAFUsyfKfR3haCneczj0=ji+u3X_RsMNCXuOadytBrcaxgoEVTg@mail.gmail.com>
[not found] ` <CAMe9rOqRGcLn3tvQSANaSydOM8RRQ2cY0PxBOHDu=iK88j=XUg@mail.gmail.com>
2022-05-12 19:31 ` [PATCH v1 03/23] x86: Code cleanup in strchr-avx2 and comment justifying branch Sunil Pandey
[not found] ` <20220323215734.3927131-4-goldstein.w.n@gmail.com>
[not found] ` <CAMe9rOraZjeAXy8GgdNqUb94y+0TUwbjWKJU7RixESgRYw1o7A@mail.gmail.com>
2022-05-12 19:32 ` [PATCH v1 04/23] x86: Code cleanup in strchr-evex " Sunil Pandey
[not found] ` <20220323215734.3927131-7-goldstein.w.n@gmail.com>
[not found] ` <CAMe9rOpSiZaO+mkq8OqwTHS__JgUD4LQQShMpjrgyGdZSPwUsA@mail.gmail.com>
2022-05-12 19:34 ` [PATCH v1 07/23] x86: Optimize strcspn and strpbrk in strcspn-c.c Sunil Pandey
[not found] ` <20220323215734.3927131-8-goldstein.w.n@gmail.com>
[not found] ` <CAMe9rOo7oks15kPUaZqd=Z1J1Xe==FJoECTNU5mBca9WTHgf1w@mail.gmail.com>
2022-05-12 19:39 ` [PATCH v1 08/23] x86: Optimize strspn in strspn-c.c Sunil Pandey
[not found] ` <20220323215734.3927131-9-goldstein.w.n@gmail.com>
[not found] ` <CAMe9rOrQ_zOdL-n-iiYpzLf+RxD_DBR51yEnGpRKB0zj4m31SQ@mail.gmail.com>
2022-05-12 19:40 ` [PATCH v1 09/23] x86: Remove strcspn-sse2.S and use the generic implementation Sunil Pandey
[not found] ` <20220323215734.3927131-10-goldstein.w.n@gmail.com>
[not found] ` <CAMe9rOqn8rZNfisVTmSKP9iWH1N26D--dncq1=MMgo-Hh-oR_Q@mail.gmail.com>
2022-05-12 19:41 ` [PATCH v1 10/23] x86: Remove strpbrk-sse2.S " Sunil Pandey
[not found] ` <20220323215734.3927131-11-goldstein.w.n@gmail.com>
[not found] ` <CAMe9rOp89e_1T9+i0W3=R3XR8DHp_Ua72x+poB6HQvE1q6b0MQ@mail.gmail.com>
2022-05-12 19:42 ` [PATCH v1 11/23] x86: Remove strspn-sse2.S " Sunil Pandey
[not found] ` <20220323215734.3927131-17-goldstein.w.n@gmail.com>
[not found] ` <CAMe9rOqkZtA9gE87TiqkHg+_rTZY4dqXO74_LykBwvihNO0YJA@mail.gmail.com>
2022-05-12 19:44 ` [PATCH v1 17/23] x86: Optimize str{n}casecmp TOLOWER logic in strcmp.S Sunil Pandey
[not found] ` <20220323215734.3927131-18-goldstein.w.n@gmail.com>
[not found] ` <CAMe9rOo-MzhNRiuyFhHpHKanbu50_OPr_Gaof9Yt16tJRwjYFA@mail.gmail.com>
2022-05-12 19:45 ` [PATCH v1 18/23] x86: Optimize str{n}casecmp TOLOWER logic in strcmp-sse42.S Sunil Pandey
[not found] ` <20220323215734.3927131-23-goldstein.w.n@gmail.com>
[not found] ` <CAMe9rOpzEL=V1OmUFJuScNetUc3mgMqYeqcqiD9aK+tBTN_sxQ@mail.gmail.com>
2022-05-12 19:54 ` [PATCH v1 23/23] x86: Remove AVX str{n}casecmp Sunil Pandey
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).