public inbox for libc-stable@sourceware.org
 help / color / mirror / Atom feed
* Re: [PATCH v1 03/23] x86: Code cleanup in strchr-avx2 and comment justifying branch
       [not found]       ` <CAMe9rOqRGcLn3tvQSANaSydOM8RRQ2cY0PxBOHDu=iK88j=XUg@mail.gmail.com>
@ 2022-05-12 19:31         ` Sunil Pandey
  0 siblings, 0 replies; 10+ messages in thread
From: Sunil Pandey @ 2022-05-12 19:31 UTC (permalink / raw)
  To: H.J. Lu, Libc-stable Mailing List; +Cc: Noah Goldstein, GNU C Library

On Thu, Mar 24, 2022 at 12:37 PM H.J. Lu via Libc-alpha
<libc-alpha@sourceware.org> wrote:
>
> On Thu, Mar 24, 2022 at 12:20 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> >
> > On Thu, Mar 24, 2022 at 1:53 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> > >
> > > On Wed, Mar 23, 2022 at 2:58 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> > > >
> > > > Small code cleanup for size: -53 bytes.
> > > >
> > > > Add comment justifying using a branch to do NULL/non-null return.
> > >
> > >
> > > Do you have followup patches to improve its performance?  We are
> > > backporting all x86-64 improvements to Intel release branches:
> > >
> > > https://gitlab.com/x86-glibc/glibc/-/wikis/home
> > >
> > > Patches without performance improvements are undesirable.
> >
> > No further changes planned at the moment, code size saves
> > seem worth it for master though. Also in favor of adding the comment
> > as I think its non-intuitive.
> >
>
> LGTM.
>
> Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
>
> Thanks.
>
> --
> H.J.

I would like to backport this patch to release branches.
Any comments or objections?

--Sunil

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v1 04/23] x86: Code cleanup in strchr-evex and comment justifying branch
       [not found]   ` <CAMe9rOraZjeAXy8GgdNqUb94y+0TUwbjWKJU7RixESgRYw1o7A@mail.gmail.com>
@ 2022-05-12 19:32     ` Sunil Pandey
  0 siblings, 0 replies; 10+ messages in thread
From: Sunil Pandey @ 2022-05-12 19:32 UTC (permalink / raw)
  To: H.J. Lu, Libc-stable Mailing List; +Cc: Noah Goldstein, GNU C Library

On Thu, Mar 24, 2022 at 11:55 AM H.J. Lu via Libc-alpha
<libc-alpha@sourceware.org> wrote:
>
> On Wed, Mar 23, 2022 at 2:58 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> >
> > Small code cleanup for size: -81 bytes.
> >
> > Add comment justifying using a branch to do NULL/non-null return.
> >
> > All string/memory tests pass and no regressions in benchtests.
> >
> > geometric_mean(N=20) of all benchmarks New / Original: .985
> > ---
> > Geomtric Mean N=20 runs; All functions page aligned
> > length, alignment,  pos, rand, seek_char/branch, max_char/perc-zero, New Time / Old Time
> >   2048,         0,   32,    0,               23,                127,               0.878
> >   2048,         1,   32,    0,               23,                127,                0.88
> >   2048,         0,   64,    0,               23,                127,               0.997
> >   2048,         2,   64,    0,               23,                127,               1.001
> >   2048,         0,  128,    0,               23,                127,               0.973
> >   2048,         3,  128,    0,               23,                127,               0.971
> >   2048,         0,  256,    0,               23,                127,               0.976
> >   2048,         4,  256,    0,               23,                127,               0.973
> >   2048,         0,  512,    0,               23,                127,               1.001
> >   2048,         5,  512,    0,               23,                127,               1.004
> >   2048,         0, 1024,    0,               23,                127,               1.005
> >   2048,         6, 1024,    0,               23,                127,               1.007
> >   2048,         0, 2048,    0,               23,                127,               1.035
> >   2048,         7, 2048,    0,               23,                127,                1.03
> >   4096,         0,   32,    0,               23,                127,               0.889
> >   4096,         1,   32,    0,               23,                127,               0.891
> >   4096,         0,   64,    0,               23,                127,               1.012
> >   4096,         2,   64,    0,               23,                127,               1.017
> >   4096,         0,  128,    0,               23,                127,               0.975
> >   4096,         3,  128,    0,               23,                127,               0.974
> >   4096,         0,  256,    0,               23,                127,               0.974
> >   4096,         4,  256,    0,               23,                127,               0.972
> >   4096,         0,  512,    0,               23,                127,               1.002
> >   4096,         5,  512,    0,               23,                127,               1.016
> >   4096,         0, 1024,    0,               23,                127,               1.009
> >   4096,         6, 1024,    0,               23,                127,               1.008
> >   4096,         0, 2048,    0,               23,                127,               1.003
> >   4096,         7, 2048,    0,               23,                127,               1.004
> >    256,         1,   64,    0,               23,                127,               0.993
> >    256,         2,   64,    0,               23,                127,               0.999
> >    256,         3,   64,    0,               23,                127,               0.992
> >    256,         4,   64,    0,               23,                127,                0.99
> >    256,         5,   64,    0,               23,                127,                0.99
> >    256,         6,   64,    0,               23,                127,               0.994
> >    256,         7,   64,    0,               23,                127,               0.991
> >    512,         0,  256,    0,               23,                127,               0.971
> >    512,        16,  256,    0,               23,                127,               0.971
> >    512,        32,  256,    0,               23,                127,               1.005
> >    512,        48,  256,    0,               23,                127,               0.998
> >    512,        64,  256,    0,               23,                127,               1.001
> >    512,        80,  256,    0,               23,                127,               1.002
> >    512,        96,  256,    0,               23,                127,               1.005
> >    512,       112,  256,    0,               23,                127,               1.012
> >      1,         0,    0,    0,               23,                127,               1.024
> >      2,         0,    1,    0,               23,                127,               0.991
> >      3,         0,    2,    0,               23,                127,               0.997
> >      4,         0,    3,    0,               23,                127,               0.984
> >      5,         0,    4,    0,               23,                127,               0.993
> >      6,         0,    5,    0,               23,                127,               0.985
> >      7,         0,    6,    0,               23,                127,               0.979
> >      8,         0,    7,    0,               23,                127,               0.975
> >      9,         0,    8,    0,               23,                127,               0.965
> >     10,         0,    9,    0,               23,                127,               0.957
> >     11,         0,   10,    0,               23,                127,               0.979
> >     12,         0,   11,    0,               23,                127,               0.987
> >     13,         0,   12,    0,               23,                127,               1.023
> >     14,         0,   13,    0,               23,                127,               0.997
> >     15,         0,   14,    0,               23,                127,               0.983
> >     16,         0,   15,    0,               23,                127,               0.987
> >     17,         0,   16,    0,               23,                127,               0.993
> >     18,         0,   17,    0,               23,                127,               0.985
> >     19,         0,   18,    0,               23,                127,               0.999
> >     20,         0,   19,    0,               23,                127,               0.998
> >     21,         0,   20,    0,               23,                127,               0.983
> >     22,         0,   21,    0,               23,                127,               0.983
> >     23,         0,   22,    0,               23,                127,               1.002
> >     24,         0,   23,    0,               23,                127,                 1.0
> >     25,         0,   24,    0,               23,                127,               1.002
> >     26,         0,   25,    0,               23,                127,               0.984
> >     27,         0,   26,    0,               23,                127,               0.994
> >     28,         0,   27,    0,               23,                127,               0.995
> >     29,         0,   28,    0,               23,                127,               1.017
> >     30,         0,   29,    0,               23,                127,               1.009
> >     31,         0,   30,    0,               23,                127,               1.001
> >     32,         0,   31,    0,               23,                127,               1.021
> >   2048,         0,   32,    0,                0,                127,               0.899
> >   2048,         1,   32,    0,                0,                127,                0.93
> >   2048,         0,   64,    0,                0,                127,               1.009
> >   2048,         2,   64,    0,                0,                127,               1.023
> >   2048,         0,  128,    0,                0,                127,               0.973
> >   2048,         3,  128,    0,                0,                127,               0.975
> >   2048,         0,  256,    0,                0,                127,               0.974
> >   2048,         4,  256,    0,                0,                127,                0.97
> >   2048,         0,  512,    0,                0,                127,               0.999
> >   2048,         5,  512,    0,                0,                127,               1.004
> >   2048,         0, 1024,    0,                0,                127,               1.008
> >   2048,         6, 1024,    0,                0,                127,               1.008
> >   2048,         0, 2048,    0,                0,                127,               0.996
> >   2048,         7, 2048,    0,                0,                127,               1.002
> >   4096,         0,   32,    0,                0,                127,               0.872
> >   4096,         1,   32,    0,                0,                127,               0.881
> >   4096,         0,   64,    0,                0,                127,               1.006
> >   4096,         2,   64,    0,                0,                127,               1.005
> >   4096,         0,  128,    0,                0,                127,               0.973
> >   4096,         3,  128,    0,                0,                127,               0.974
> >   4096,         0,  256,    0,                0,                127,               0.969
> >   4096,         4,  256,    0,                0,                127,               0.971
> >   4096,         0,  512,    0,                0,                127,                 1.0
> >   4096,         5,  512,    0,                0,                127,               1.005
> >   4096,         0, 1024,    0,                0,                127,               1.007
> >   4096,         6, 1024,    0,                0,                127,               1.009
> >   4096,         0, 2048,    0,                0,                127,               1.005
> >   4096,         7, 2048,    0,                0,                127,               1.007
> >    256,         1,   64,    0,                0,                127,               0.994
> >    256,         2,   64,    0,                0,                127,               1.008
> >    256,         3,   64,    0,                0,                127,               1.019
> >    256,         4,   64,    0,                0,                127,               0.991
> >    256,         5,   64,    0,                0,                127,               0.992
> >    256,         6,   64,    0,                0,                127,               0.991
> >    256,         7,   64,    0,                0,                127,               0.988
> >    512,         0,  256,    0,                0,                127,               0.971
> >    512,        16,  256,    0,                0,                127,               0.967
> >    512,        32,  256,    0,                0,                127,               1.005
> >    512,        48,  256,    0,                0,                127,               1.001
> >    512,        64,  256,    0,                0,                127,               1.009
> >    512,        80,  256,    0,                0,                127,               1.008
> >    512,        96,  256,    0,                0,                127,               1.009
> >    512,       112,  256,    0,                0,                127,               1.016
> >      1,         0,    0,    0,                0,                127,               1.038
> >      2,         0,    1,    0,                0,                127,               1.009
> >      3,         0,    2,    0,                0,                127,               0.992
> >      4,         0,    3,    0,                0,                127,               1.004
> >      5,         0,    4,    0,                0,                127,               0.966
> >      6,         0,    5,    0,                0,                127,               0.968
> >      7,         0,    6,    0,                0,                127,               1.004
> >      8,         0,    7,    0,                0,                127,                0.99
> >      9,         0,    8,    0,                0,                127,               0.958
> >     10,         0,    9,    0,                0,                127,                0.96
> >     11,         0,   10,    0,                0,                127,               0.948
> >     12,         0,   11,    0,                0,                127,               0.984
> >     13,         0,   12,    0,                0,                127,               0.967
> >     14,         0,   13,    0,                0,                127,               0.993
> >     15,         0,   14,    0,                0,                127,               0.991
> >     16,         0,   15,    0,                0,                127,                 1.0
> >     17,         0,   16,    0,                0,                127,               0.982
> >     18,         0,   17,    0,                0,                127,               0.977
> >     19,         0,   18,    0,                0,                127,               0.987
> >     20,         0,   19,    0,                0,                127,               0.978
> >     21,         0,   20,    0,                0,                127,                 1.0
> >     22,         0,   21,    0,                0,                127,                0.99
> >     23,         0,   22,    0,                0,                127,               0.988
> >     24,         0,   23,    0,                0,                127,               0.997
> >     25,         0,   24,    0,                0,                127,               1.003
> >     26,         0,   25,    0,                0,                127,               1.004
> >     27,         0,   26,    0,                0,                127,               0.982
> >     28,         0,   27,    0,                0,                127,               0.972
> >     29,         0,   28,    0,                0,                127,               0.978
> >     30,         0,   29,    0,                0,                127,               0.992
> >     31,         0,   30,    0,                0,                127,               0.986
> >     32,         0,   31,    0,                0,                127,                 1.0
> >
> >     16,         0,   15,    1,                1,                  0,               0.997
> >     16,         0,   15,    1,                0,                  0,               1.001
> >     16,         0,   15,    1,                1,                0.1,               0.984
> >     16,         0,   15,    1,                0,                0.1,               0.999
> >     16,         0,   15,    1,                1,               0.25,               0.929
> >     16,         0,   15,    1,                0,               0.25,               1.001
> >     16,         0,   15,    1,                1,               0.33,               0.892
> >     16,         0,   15,    1,                0,               0.33,               0.996
> >     16,         0,   15,    1,                1,                0.5,               0.897
> >     16,         0,   15,    1,                0,                0.5,               1.009
> >     16,         0,   15,    1,                1,               0.66,               0.882
> >     16,         0,   15,    1,                0,               0.66,               0.967
> >     16,         0,   15,    1,                1,               0.75,               0.919
> >     16,         0,   15,    1,                0,               0.75,               1.027
> >     16,         0,   15,    1,                1,                0.9,               0.949
> >     16,         0,   15,    1,                0,                0.9,               1.021
> >     16,         0,   15,    1,                1,                  1,               0.998
> >     16,         0,   15,    1,                0,                  1,               0.999
> >
> >  sysdeps/x86_64/multiarch/strchr-evex.S | 146 ++++++++++++++-----------
> >  1 file changed, 80 insertions(+), 66 deletions(-)
> >
> > diff --git a/sysdeps/x86_64/multiarch/strchr-evex.S b/sysdeps/x86_64/multiarch/strchr-evex.S
> > index f62cd9d144..ec739fb8f9 100644
> > --- a/sysdeps/x86_64/multiarch/strchr-evex.S
> > +++ b/sysdeps/x86_64/multiarch/strchr-evex.S
> > @@ -30,6 +30,7 @@
> >  # ifdef USE_AS_WCSCHR
> >  #  define VPBROADCAST  vpbroadcastd
> >  #  define VPCMP                vpcmpd
> > +#  define VPTESTN      vptestnmd
> >  #  define VPMINU       vpminud
> >  #  define CHAR_REG     esi
> >  #  define SHIFT_REG    ecx
> > @@ -37,6 +38,7 @@
> >  # else
> >  #  define VPBROADCAST  vpbroadcastb
> >  #  define VPCMP                vpcmpb
> > +#  define VPTESTN      vptestnmb
> >  #  define VPMINU       vpminub
> >  #  define CHAR_REG     sil
> >  #  define SHIFT_REG    edx
> > @@ -61,13 +63,11 @@
> >  # define CHAR_PER_VEC (VEC_SIZE / CHAR_SIZE)
> >
> >         .section .text.evex,"ax",@progbits
> > -ENTRY (STRCHR)
> > +ENTRY_P2ALIGN (STRCHR, 5)
> >         /* Broadcast CHAR to YMM0.      */
> >         VPBROADCAST     %esi, %YMM0
> >         movl    %edi, %eax
> >         andl    $(PAGE_SIZE - 1), %eax
> > -       vpxorq  %XMMZERO, %XMMZERO, %XMMZERO
> > -
> >         /* Check if we cross page boundary with one vector load.
> >            Otherwise it is safe to use an unaligned load.  */
> >         cmpl    $(PAGE_SIZE - VEC_SIZE), %eax
> > @@ -81,49 +81,35 @@ ENTRY (STRCHR)
> >         vpxorq  %YMM1, %YMM0, %YMM2
> >         VPMINU  %YMM2, %YMM1, %YMM2
> >         /* Each bit in K0 represents a CHAR or a null byte in YMM1.  */
> > -       VPCMP   $0, %YMMZERO, %YMM2, %k0
> > +       VPTESTN %YMM2, %YMM2, %k0
> >         kmovd   %k0, %eax
> >         testl   %eax, %eax
> >         jz      L(aligned_more)
> >         tzcntl  %eax, %eax
> > +# ifndef USE_AS_STRCHRNUL
> > +       /* Found CHAR or the null byte.  */
> > +       cmp     (%rdi, %rax, CHAR_SIZE), %CHAR_REG
> > +       /* NB: Use a branch instead of cmovcc here. The expectation is
> > +          that with strchr the user will branch based on input being
> > +          null. Since this branch will be 100% predictive of the user
> > +          branch a branch miss here should save what otherwise would
> > +          be branch miss in the user code. Otherwise using a branch 1)
> > +          saves code size and 2) is faster in highly predictable
> > +          environments.  */
> > +       jne     L(zero)
> > +# endif
> >  # ifdef USE_AS_WCSCHR
> >         /* NB: Multiply wchar_t count by 4 to get the number of bytes.
> >          */
> >         leaq    (%rdi, %rax, CHAR_SIZE), %rax
> >  # else
> >         addq    %rdi, %rax
> > -# endif
> > -# ifndef USE_AS_STRCHRNUL
> > -       /* Found CHAR or the null byte.  */
> > -       cmp     (%rax), %CHAR_REG
> > -       jne     L(zero)
> >  # endif
> >         ret
> >
> > -       /* .p2align 5 helps keep performance more consistent if ENTRY()
> > -          alignment % 32 was either 16 or 0. As well this makes the
> > -          alignment % 32 of the loop_4x_vec fixed which makes tuning it
> > -          easier.  */
> > -       .p2align 5
> > -L(first_vec_x3):
> > -       tzcntl  %eax, %eax
> > -# ifndef USE_AS_STRCHRNUL
> > -       /* Found CHAR or the null byte.  */
> > -       cmp     (VEC_SIZE * 3)(%rdi, %rax, CHAR_SIZE), %CHAR_REG
> > -       jne     L(zero)
> > -# endif
> > -       /* NB: Multiply sizeof char type (1 or 4) to get the number of
> > -          bytes.  */
> > -       leaq    (VEC_SIZE * 3)(%rdi, %rax, CHAR_SIZE), %rax
> > -       ret
> >
> > -# ifndef USE_AS_STRCHRNUL
> > -L(zero):
> > -       xorl    %eax, %eax
> > -       ret
> > -# endif
> >
> > -       .p2align 4
> > +       .p2align 4,, 10
> >  L(first_vec_x4):
> >  # ifndef USE_AS_STRCHRNUL
> >         /* Check to see if first match was CHAR (k0) or null (k1).  */
> > @@ -144,9 +130,18 @@ L(first_vec_x4):
> >         leaq    (VEC_SIZE * 4)(%rdi, %rax, CHAR_SIZE), %rax
> >         ret
> >
> > +# ifndef USE_AS_STRCHRNUL
> > +L(zero):
> > +       xorl    %eax, %eax
> > +       ret
> > +# endif
> > +
> > +
> >         .p2align 4
> >  L(first_vec_x1):
> > -       tzcntl  %eax, %eax
> > +       /* Use bsf here to save 1-byte keeping keeping the block in 1x
> > +          fetch block. eax guranteed non-zero.  */
> > +       bsfl    %eax, %eax
> >  # ifndef USE_AS_STRCHRNUL
> >         /* Found CHAR or the null byte.  */
> >         cmp     (VEC_SIZE)(%rdi, %rax, CHAR_SIZE), %CHAR_REG
> > @@ -158,7 +153,7 @@ L(first_vec_x1):
> >         leaq    (VEC_SIZE)(%rdi, %rax, CHAR_SIZE), %rax
> >         ret
> >
> > -       .p2align 4
> > +       .p2align 4,, 10
> >  L(first_vec_x2):
> >  # ifndef USE_AS_STRCHRNUL
> >         /* Check to see if first match was CHAR (k0) or null (k1).  */
> > @@ -179,6 +174,21 @@ L(first_vec_x2):
> >         leaq    (VEC_SIZE * 2)(%rdi, %rax, CHAR_SIZE), %rax
> >         ret
> >
> > +       .p2align 4,, 10
> > +L(first_vec_x3):
> > +       /* Use bsf here to save 1-byte keeping keeping the block in 1x
> > +          fetch block. eax guranteed non-zero.  */
> > +       bsfl    %eax, %eax
> > +# ifndef USE_AS_STRCHRNUL
> > +       /* Found CHAR or the null byte.  */
> > +       cmp     (VEC_SIZE * 3)(%rdi, %rax, CHAR_SIZE), %CHAR_REG
> > +       jne     L(zero)
> > +# endif
> > +       /* NB: Multiply sizeof char type (1 or 4) to get the number of
> > +          bytes.  */
> > +       leaq    (VEC_SIZE * 3)(%rdi, %rax, CHAR_SIZE), %rax
> > +       ret
> > +
> >         .p2align 4
> >  L(aligned_more):
> >         /* Align data to VEC_SIZE.  */
> > @@ -195,7 +205,7 @@ L(cross_page_continue):
> >         vpxorq  %YMM1, %YMM0, %YMM2
> >         VPMINU  %YMM2, %YMM1, %YMM2
> >         /* Each bit in K0 represents a CHAR or a null byte in YMM1.  */
> > -       VPCMP   $0, %YMMZERO, %YMM2, %k0
> > +       VPTESTN %YMM2, %YMM2, %k0
> >         kmovd   %k0, %eax
> >         testl   %eax, %eax
> >         jnz     L(first_vec_x1)
> > @@ -206,7 +216,7 @@ L(cross_page_continue):
> >         /* Each bit in K0 represents a CHAR in YMM1.  */
> >         VPCMP   $0, %YMM1, %YMM0, %k0
> >         /* Each bit in K1 represents a CHAR in YMM1.  */
> > -       VPCMP   $0, %YMM1, %YMMZERO, %k1
> > +       VPTESTN %YMM1, %YMM1, %k1
> >         kortestd        %k0, %k1
> >         jnz     L(first_vec_x2)
> >
> > @@ -215,7 +225,7 @@ L(cross_page_continue):
> >         vpxorq  %YMM1, %YMM0, %YMM2
> >         VPMINU  %YMM2, %YMM1, %YMM2
> >         /* Each bit in K0 represents a CHAR or a null byte in YMM1.  */
> > -       VPCMP   $0, %YMMZERO, %YMM2, %k0
> > +       VPTESTN %YMM2, %YMM2, %k0
> >         kmovd   %k0, %eax
> >         testl   %eax, %eax
> >         jnz     L(first_vec_x3)
> > @@ -224,7 +234,7 @@ L(cross_page_continue):
> >         /* Each bit in K0 represents a CHAR in YMM1.  */
> >         VPCMP   $0, %YMM1, %YMM0, %k0
> >         /* Each bit in K1 represents a CHAR in YMM1.  */
> > -       VPCMP   $0, %YMM1, %YMMZERO, %k1
> > +       VPTESTN %YMM1, %YMM1, %k1
> >         kortestd        %k0, %k1
> >         jnz     L(first_vec_x4)
> >
> > @@ -265,33 +275,33 @@ L(loop_4x_vec):
> >         VPMINU  %YMM3, %YMM4, %YMM4
> >         VPMINU  %YMM2, %YMM4, %YMM4{%k4}{z}
> >
> > -       VPCMP   $0, %YMMZERO, %YMM4, %k1
> > +       VPTESTN %YMM4, %YMM4, %k1
> >         kmovd   %k1, %ecx
> >         subq    $-(VEC_SIZE * 4), %rdi
> >         testl   %ecx, %ecx
> >         jz      L(loop_4x_vec)
> >
> > -       VPCMP   $0, %YMMZERO, %YMM1, %k0
> > +       VPTESTN %YMM1, %YMM1, %k0
> >         kmovd   %k0, %eax
> >         testl   %eax, %eax
> >         jnz     L(last_vec_x1)
> >
> > -       VPCMP   $0, %YMMZERO, %YMM2, %k0
> > +       VPTESTN %YMM2, %YMM2, %k0
> >         kmovd   %k0, %eax
> >         testl   %eax, %eax
> >         jnz     L(last_vec_x2)
> >
> > -       VPCMP   $0, %YMMZERO, %YMM3, %k0
> > +       VPTESTN %YMM3, %YMM3, %k0
> >         kmovd   %k0, %eax
> >         /* Combine YMM3 matches (eax) with YMM4 matches (ecx).  */
> >  # ifdef USE_AS_WCSCHR
> >         sall    $8, %ecx
> >         orl     %ecx, %eax
> > -       tzcntl  %eax, %eax
> > +       bsfl    %eax, %eax
> >  # else
> >         salq    $32, %rcx
> >         orq     %rcx, %rax
> > -       tzcntq  %rax, %rax
> > +       bsfq    %rax, %rax
> >  # endif
> >  # ifndef USE_AS_STRCHRNUL
> >         /* Check if match was CHAR or null.  */
> > @@ -303,28 +313,28 @@ L(loop_4x_vec):
> >         leaq    (VEC_SIZE * 2)(%rdi, %rax, CHAR_SIZE), %rax
> >         ret
> >
> > -# ifndef USE_AS_STRCHRNUL
> > -L(zero_end):
> > -       xorl    %eax, %eax
> > -       ret
> > +       .p2align 4,, 8
> > +L(last_vec_x1):
> > +       bsfl    %eax, %eax
> > +# ifdef USE_AS_WCSCHR
> > +       /* NB: Multiply wchar_t count by 4 to get the number of bytes.
> > +          */
> > +       leaq    (%rdi, %rax, CHAR_SIZE), %rax
> > +# else
> > +       addq    %rdi, %rax
> >  # endif
> >
> > -       .p2align 4
> > -L(last_vec_x1):
> > -       tzcntl  %eax, %eax
> >  # ifndef USE_AS_STRCHRNUL
> >         /* Check if match was null.  */
> > -       cmp     (%rdi, %rax, CHAR_SIZE), %CHAR_REG
> > +       cmp     (%rax), %CHAR_REG
> >         jne     L(zero_end)
> >  # endif
> > -       /* NB: Multiply sizeof char type (1 or 4) to get the number of
> > -          bytes.  */
> > -       leaq    (%rdi, %rax, CHAR_SIZE), %rax
> > +
> >         ret
> >
> > -       .p2align 4
> > +       .p2align 4,, 8
> >  L(last_vec_x2):
> > -       tzcntl  %eax, %eax
> > +       bsfl    %eax, %eax
> >  # ifndef USE_AS_STRCHRNUL
> >         /* Check if match was null.  */
> >         cmp     (VEC_SIZE)(%rdi, %rax, CHAR_SIZE), %CHAR_REG
> > @@ -336,7 +346,7 @@ L(last_vec_x2):
> >         ret
> >
> >         /* Cold case for crossing page with first load.  */
> > -       .p2align 4
> > +       .p2align 4,, 8
> >  L(cross_page_boundary):
> >         movq    %rdi, %rdx
> >         /* Align rdi.  */
> > @@ -346,9 +356,9 @@ L(cross_page_boundary):
> >         vpxorq  %YMM1, %YMM0, %YMM2
> >         VPMINU  %YMM2, %YMM1, %YMM2
> >         /* Each bit in K0 represents a CHAR or a null byte in YMM1.  */
> > -       VPCMP   $0, %YMMZERO, %YMM2, %k0
> > +       VPTESTN %YMM2, %YMM2, %k0
> >         kmovd   %k0, %eax
> > -       /* Remove the leading bits.      */
> > +       /* Remove the leading bits.  */
> >  # ifdef USE_AS_WCSCHR
> >         movl    %edx, %SHIFT_REG
> >         /* NB: Divide shift count by 4 since each bit in K1 represent 4
> > @@ -360,20 +370,24 @@ L(cross_page_boundary):
> >         /* If eax is zero continue.  */
> >         testl   %eax, %eax
> >         jz      L(cross_page_continue)
> > -       tzcntl  %eax, %eax
> > -# ifndef USE_AS_STRCHRNUL
> > -       /* Check to see if match was CHAR or null.  */
> > -       cmp     (%rdx, %rax, CHAR_SIZE), %CHAR_REG
> > -       jne     L(zero_end)
> > -# endif
> > +       bsfl    %eax, %eax
> > +
> >  # ifdef USE_AS_WCSCHR
> >         /* NB: Multiply wchar_t count by 4 to get the number of
> >            bytes.  */
> >         leaq    (%rdx, %rax, CHAR_SIZE), %rax
> >  # else
> >         addq    %rdx, %rax
> > +# endif
> > +# ifndef USE_AS_STRCHRNUL
> > +       /* Check to see if match was CHAR or null.  */
> > +       cmp     (%rax), %CHAR_REG
> > +       je      L(cross_page_ret)
> > +L(zero_end):
> > +       xorl    %eax, %eax
> > +L(cross_page_ret):
> >  # endif
> >         ret
> >
> >  END (STRCHR)
> > -# endif
> > +#endif
> > --
> > 2.25.1
> >
>
> LGTM.
>
> Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
>
> Thanks.
>
> --
> H.J.

I would like to backport this patch to release branches.
Any comments or objections?

--Sunil

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v1 07/23] x86: Optimize strcspn and strpbrk in strcspn-c.c
       [not found]   ` <CAMe9rOpSiZaO+mkq8OqwTHS__JgUD4LQQShMpjrgyGdZSPwUsA@mail.gmail.com>
@ 2022-05-12 19:34     ` Sunil Pandey
  0 siblings, 0 replies; 10+ messages in thread
From: Sunil Pandey @ 2022-05-12 19:34 UTC (permalink / raw)
  To: H.J. Lu, Libc-stable Mailing List; +Cc: Noah Goldstein, GNU C Library

On Thu, Mar 24, 2022 at 11:57 AM H.J. Lu via Libc-alpha
<libc-alpha@sourceware.org> wrote:
>
> On Wed, Mar 23, 2022 at 2:59 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> >
> > Use _mm_cmpeq_epi8 and _mm_movemask_epi8 to get strlen instead of
> > _mm_cmpistri. Also change offset to unsigned to avoid unnecessary
> > sign extensions.
> >
> > geometric_mean(N=20) of all benchmarks that dont fallback on
> > sse2/strlen; New / Original: .928
> >
> > All string/memory tests pass.
> > ---
> > Geomtric Mean N=20 runs; All functions page aligned
> > len, align1, align2,  pos, New Time / Old Time
> >   0,      0,      0,  512,               1.207
> >   1,      0,      0,  512,               1.039
> >   1,      1,      0,  512,               0.997
> >   1,      0,      1,  512,               0.981
> >   1,      1,      1,  512,               0.977
> >   2,      0,      0,  512,                1.02
> >   2,      2,      0,  512,               0.979
> >   2,      0,      2,  512,               0.902
> >   2,      2,      2,  512,               0.958
> >   3,      0,      0,  512,               0.978
> >   3,      3,      0,  512,               0.988
> >   3,      0,      3,  512,               0.979
> >   3,      3,      3,  512,               0.955
> >   4,      0,      0,  512,               0.969
> >   4,      4,      0,  512,               0.991
> >   4,      0,      4,  512,                0.94
> >   4,      4,      4,  512,               0.958
> >   5,      0,      0,  512,               0.963
> >   5,      5,      0,  512,               1.004
> >   5,      0,      5,  512,               0.948
> >   5,      5,      5,  512,               0.971
> >   6,      0,      0,  512,               0.933
> >   6,      6,      0,  512,               1.007
> >   6,      0,      6,  512,               0.921
> >   6,      6,      6,  512,               0.969
> >   7,      0,      0,  512,               0.928
> >   7,      7,      0,  512,               0.976
> >   7,      0,      7,  512,               0.932
> >   7,      7,      7,  512,               0.995
> >   8,      0,      0,  512,               0.931
> >   8,      0,      8,  512,               0.766
> >   9,      0,      0,  512,               0.965
> >   9,      1,      0,  512,               0.999
> >   9,      0,      9,  512,               0.765
> >   9,      1,      9,  512,                0.97
> >  10,      0,      0,  512,               0.976
> >  10,      2,      0,  512,               0.991
> >  10,      0,     10,  512,               0.768
> >  10,      2,     10,  512,               0.926
> >  11,      0,      0,  512,               0.958
> >  11,      3,      0,  512,               1.006
> >  11,      0,     11,  512,               0.768
> >  11,      3,     11,  512,               0.908
> >  12,      0,      0,  512,               0.945
> >  12,      4,      0,  512,               0.896
> >  12,      0,     12,  512,               0.764
> >  12,      4,     12,  512,               0.785
> >  13,      0,      0,  512,               0.957
> >  13,      5,      0,  512,               1.019
> >  13,      0,     13,  512,                0.76
> >  13,      5,     13,  512,               0.785
> >  14,      0,      0,  512,               0.918
> >  14,      6,      0,  512,               1.004
> >  14,      0,     14,  512,                0.78
> >  14,      6,     14,  512,               0.711
> >  15,      0,      0,  512,               0.855
> >  15,      7,      0,  512,               0.985
> >  15,      0,     15,  512,               0.779
> >  15,      7,     15,  512,               0.772
> >  16,      0,      0,  512,               0.987
> >  16,      0,     16,  512,                0.99
> >  17,      0,      0,  512,               0.996
> >  17,      1,      0,  512,               0.979
> >  17,      0,     17,  512,               1.001
> >  17,      1,     17,  512,                1.03
> >  18,      0,      0,  512,               0.976
> >  18,      2,      0,  512,               0.989
> >  18,      0,     18,  512,               0.976
> >  18,      2,     18,  512,               0.992
> >  19,      0,      0,  512,               0.991
> >  19,      3,      0,  512,               0.988
> >  19,      0,     19,  512,               1.009
> >  19,      3,     19,  512,               1.018
> >  20,      0,      0,  512,               0.999
> >  20,      4,      0,  512,               1.005
> >  20,      0,     20,  512,               0.993
> >  20,      4,     20,  512,               0.983
> >  21,      0,      0,  512,               0.982
> >  21,      5,      0,  512,               0.988
> >  21,      0,     21,  512,               0.978
> >  21,      5,     21,  512,               0.984
> >  22,      0,      0,  512,               0.988
> >  22,      6,      0,  512,               0.979
> >  22,      0,     22,  512,               0.984
> >  22,      6,     22,  512,               0.983
> >  23,      0,      0,  512,               0.996
> >  23,      7,      0,  512,               0.998
> >  23,      0,     23,  512,               0.979
> >  23,      7,     23,  512,               0.987
> >  24,      0,      0,  512,                0.99
> >  24,      0,     24,  512,               0.979
> >  25,      0,      0,  512,               0.985
> >  25,      1,      0,  512,               0.988
> >  25,      0,     25,  512,                0.99
> >  25,      1,     25,  512,               0.986
> >  26,      0,      0,  512,               1.005
> >  26,      2,      0,  512,               0.995
> >  26,      0,     26,  512,               0.992
> >  26,      2,     26,  512,               0.983
> >  27,      0,      0,  512,               0.986
> >  27,      3,      0,  512,               0.978
> >  27,      0,     27,  512,               0.986
> >  27,      3,     27,  512,               0.973
> >  28,      0,      0,  512,               0.995
> >  28,      4,      0,  512,               0.993
> >  28,      0,     28,  512,               0.983
> >  28,      4,     28,  512,               1.005
> >  29,      0,      0,  512,               0.983
> >  29,      5,      0,  512,               0.982
> >  29,      0,     29,  512,               0.984
> >  29,      5,     29,  512,               1.005
> >  30,      0,      0,  512,               0.978
> >  30,      6,      0,  512,               0.985
> >  30,      0,     30,  512,               0.994
> >  30,      6,     30,  512,               0.993
> >  31,      0,      0,  512,               0.984
> >  31,      7,      0,  512,               0.983
> >  31,      0,     31,  512,                 1.0
> >  31,      7,     31,  512,               1.031
> >   4,      0,      0,   32,               0.916
> >   4,      1,      0,   32,               0.952
> >   4,      0,      1,   32,               0.927
> >   4,      1,      1,   32,               0.969
> >   4,      0,      0,   64,               0.961
> >   4,      2,      0,   64,               0.955
> >   4,      0,      2,   64,               0.975
> >   4,      2,      2,   64,               0.972
> >   4,      0,      0,  128,               0.971
> >   4,      3,      0,  128,               0.982
> >   4,      0,      3,  128,               0.945
> >   4,      3,      3,  128,               0.971
> >   4,      0,      0,  256,               1.004
> >   4,      4,      0,  256,               0.966
> >   4,      0,      4,  256,               0.961
> >   4,      4,      4,  256,               0.971
> >   4,      5,      0,  512,               0.929
> >   4,      0,      5,  512,               0.969
> >   4,      5,      5,  512,               0.985
> >   4,      0,      0, 1024,               1.003
> >   4,      6,      0, 1024,               1.009
> >   4,      0,      6, 1024,               1.005
> >   4,      6,      6, 1024,               0.999
> >   4,      0,      0, 2048,               0.917
> >   4,      7,      0, 2048,               1.015
> >   4,      0,      7, 2048,               1.011
> >   4,      7,      7, 2048,               0.907
> >  10,      1,      0,   64,               0.964
> >  10,      1,      1,   64,               0.966
> >  10,      2,      0,   64,               0.953
> >  10,      2,      2,   64,               0.972
> >  10,      3,      0,   64,               0.962
> >  10,      3,      3,   64,               0.969
> >  10,      4,      0,   64,               0.957
> >  10,      4,      4,   64,               0.969
> >  10,      5,      0,   64,               0.961
> >  10,      5,      5,   64,               0.965
> >  10,      6,      0,   64,               0.949
> >  10,      6,      6,   64,                 0.9
> >  10,      7,      0,   64,               0.957
> >  10,      7,      7,   64,               0.897
> >   6,      0,      0,    0,               0.991
> >   6,      0,      0,    1,               1.011
> >   6,      0,      1,    1,               0.939
> >   6,      0,      0,    2,               1.016
> >   6,      0,      2,    2,                0.94
> >   6,      0,      0,    3,               1.019
> >   6,      0,      3,    3,               0.941
> >   6,      0,      0,    4,               1.056
> >   6,      0,      4,    4,               0.884
> >   6,      0,      0,    5,               0.977
> >   6,      0,      5,    5,               0.934
> >   6,      0,      0,    6,               0.954
> >   6,      0,      6,    6,                0.93
> >   6,      0,      0,    7,               0.963
> >   6,      0,      7,    7,               0.916
> >   6,      0,      0,    8,               0.963
> >   6,      0,      8,    8,               0.945
> >   6,      0,      0,    9,               1.028
> >   6,      0,      9,    9,               0.942
> >   6,      0,      0,   10,               0.955
> >   6,      0,     10,   10,               0.831
> >   6,      0,      0,   11,               0.948
> >   6,      0,     11,   11,                0.82
> >   6,      0,      0,   12,               1.033
> >   6,      0,     12,   12,               0.873
> >   6,      0,      0,   13,               0.983
> >   6,      0,     13,   13,               0.852
> >   6,      0,      0,   14,               0.984
> >   6,      0,     14,   14,               0.853
> >   6,      0,      0,   15,               0.984
> >   6,      0,     15,   15,               0.882
> >   6,      0,      0,   16,               0.971
> >   6,      0,     16,   16,               0.958
> >   6,      0,      0,   17,               0.938
> >   6,      0,     17,   17,               0.947
> >   6,      0,      0,   18,                0.96
> >   6,      0,     18,   18,               0.938
> >   6,      0,      0,   19,               0.903
> >   6,      0,     19,   19,               0.943
> >   6,      0,      0,   20,               0.947
> >   6,      0,     20,   20,               0.951
> >   6,      0,      0,   21,               0.948
> >   6,      0,     21,   21,                0.96
> >   6,      0,      0,   22,               0.926
> >   6,      0,     22,   22,               0.951
> >   6,      0,      0,   23,               0.923
> >   6,      0,     23,   23,               0.959
> >   6,      0,      0,   24,               0.918
> >   6,      0,     24,   24,               0.952
> >   6,      0,      0,   25,                0.97
> >   6,      0,     25,   25,               0.952
> >   6,      0,      0,   26,               0.871
> >   6,      0,     26,   26,               0.869
> >   6,      0,      0,   27,               0.935
> >   6,      0,     27,   27,               0.836
> >   6,      0,      0,   28,               0.936
> >   6,      0,     28,   28,               0.857
> >   6,      0,      0,   29,               0.876
> >   6,      0,     29,   29,               0.859
> >   6,      0,      0,   30,               0.934
> >   6,      0,     30,   30,               0.857
> >   6,      0,      0,   31,               0.962
> >   6,      0,     31,   31,                0.86
> >   6,      0,      0,   32,               0.912
> >   6,      0,     32,   32,                0.94
> >   6,      0,      0,   33,               0.903
> >   6,      0,     33,   33,               0.968
> >   6,      0,      0,   34,               0.913
> >   6,      0,     34,   34,               0.896
> >   6,      0,      0,   35,               0.904
> >   6,      0,     35,   35,               0.913
> >   6,      0,      0,   36,               0.905
> >   6,      0,     36,   36,               0.907
> >   6,      0,      0,   37,               0.899
> >   6,      0,     37,   37,                 0.9
> >   6,      0,      0,   38,               0.912
> >   6,      0,     38,   38,               0.919
> >   6,      0,      0,   39,               0.925
> >   6,      0,     39,   39,               0.927
> >   6,      0,      0,   40,               0.923
> >   6,      0,     40,   40,               0.972
> >   6,      0,      0,   41,                0.92
> >   6,      0,     41,   41,               0.966
> >   6,      0,      0,   42,               0.915
> >   6,      0,     42,   42,               0.834
> >   6,      0,      0,   43,                0.92
> >   6,      0,     43,   43,               0.856
> >   6,      0,      0,   44,               0.908
> >   6,      0,     44,   44,               0.858
> >   6,      0,      0,   45,               0.932
> >   6,      0,     45,   45,               0.847
> >   6,      0,      0,   46,               0.927
> >   6,      0,     46,   46,               0.859
> >   6,      0,      0,   47,               0.902
> >   6,      0,     47,   47,               0.855
> >   6,      0,      0,   48,               0.949
> >   6,      0,     48,   48,               0.934
> >   6,      0,      0,   49,               0.907
> >   6,      0,     49,   49,               0.943
> >   6,      0,      0,   50,               0.934
> >   6,      0,     50,   50,               0.943
> >   6,      0,      0,   51,               0.933
> >   6,      0,     51,   51,               0.939
> >   6,      0,      0,   52,               0.944
> >   6,      0,     52,   52,               0.944
> >   6,      0,      0,   53,               0.939
> >   6,      0,     53,   53,               0.938
> >   6,      0,      0,   54,                 0.9
> >   6,      0,     54,   54,               0.923
> >   6,      0,      0,   55,                 0.9
> >   6,      0,     55,   55,               0.927
> >   6,      0,      0,   56,                 0.9
> >   6,      0,     56,   56,               0.917
> >   6,      0,      0,   57,                 0.9
> >   6,      0,     57,   57,               0.916
> >   6,      0,      0,   58,               0.914
> >   6,      0,     58,   58,               0.784
> >   6,      0,      0,   59,               0.863
> >   6,      0,     59,   59,               0.846
> >   6,      0,      0,   60,                0.88
> >   6,      0,     60,   60,               0.827
> >   6,      0,      0,   61,               0.896
> >   6,      0,     61,   61,               0.847
> >   6,      0,      0,   62,               0.894
> >   6,      0,     62,   62,               0.865
> >   6,      0,      0,   63,               0.934
> >   6,      0,     63,   63,               0.866
> >
> >  sysdeps/x86_64/multiarch/strcspn-c.c | 83 +++++++++++++---------------
> >  1 file changed, 37 insertions(+), 46 deletions(-)
> >
> > diff --git a/sysdeps/x86_64/multiarch/strcspn-c.c b/sysdeps/x86_64/multiarch/strcspn-c.c
> > index 013aebf797..c312fab8b1 100644
> > --- a/sysdeps/x86_64/multiarch/strcspn-c.c
> > +++ b/sysdeps/x86_64/multiarch/strcspn-c.c
> > @@ -84,83 +84,74 @@ STRCSPN_SSE42 (const char *s, const char *a)
> >      RETURN (NULL, strlen (s));
> >
> >    const char *aligned;
> > -  __m128i mask;
> > -  int offset = (int) ((size_t) a & 15);
> > +  __m128i mask, maskz, zero;
> > +  unsigned int maskz_bits;
> > +  unsigned int offset = (unsigned int) ((size_t) a & 15);
> > +  zero = _mm_set1_epi8 (0);
> >    if (offset != 0)
> >      {
> >        /* Load masks.  */
> >        aligned = (const char *) ((size_t) a & -16L);
> >        __m128i mask0 = _mm_load_si128 ((__m128i *) aligned);
> > -
> > -      mask = __m128i_shift_right (mask0, offset);
> > +      maskz = _mm_cmpeq_epi8 (mask0, zero);
> >
> >        /* Find where the NULL terminator is.  */
> > -      int length = _mm_cmpistri (mask, mask, 0x3a);
> > -      if (length == 16 - offset)
> > -       {
> > -         /* There is no NULL terminator.  */
> > -         __m128i mask1 = _mm_load_si128 ((__m128i *) (aligned + 16));
> > -         int index = _mm_cmpistri (mask1, mask1, 0x3a);
> > -         length += index;
> > -
> > -         /* Don't use SSE4.2 if the length of A > 16.  */
> > -         if (length > 16)
> > -           return STRCSPN_SSE2 (s, a);
> > -
> > -         if (index != 0)
> > -           {
> > -             /* Combine mask0 and mask1.  We could play games with
> > -                palignr, but frankly this data should be in L1 now
> > -                so do the merge via an unaligned load.  */
> > -             mask = _mm_loadu_si128 ((__m128i *) a);
> > -           }
> > -       }
> > +      maskz_bits = _mm_movemask_epi8 (maskz) >> offset;
> > +      if (maskz_bits != 0)
> > +        {
> > +          mask = __m128i_shift_right (mask0, offset);
> > +          offset = (unsigned int) ((size_t) s & 15);
> > +          if (offset)
> > +            goto start_unaligned;
> > +
> > +          aligned = s;
> > +          goto start_loop;
> > +        }
> >      }
> > -  else
> > -    {
> > -      /* A is aligned.  */
> > -      mask = _mm_load_si128 ((__m128i *) a);
> >
> > -      /* Find where the NULL terminator is.  */
> > -      int length = _mm_cmpistri (mask, mask, 0x3a);
> > -      if (length == 16)
> > -       {
> > -         /* There is no NULL terminator.  Don't use SSE4.2 if the length
> > -            of A > 16.  */
> > -         if (a[16] != 0)
> > -           return STRCSPN_SSE2 (s, a);
> > -       }
> > +  /* A is aligned.  */
> > +  mask = _mm_loadu_si128 ((__m128i *) a);
> > +  /* Find where the NULL terminator is.  */
> > +  maskz = _mm_cmpeq_epi8 (mask, zero);
> > +  maskz_bits = _mm_movemask_epi8 (maskz);
> > +  if (maskz_bits == 0)
> > +    {
> > +      /* There is no NULL terminator.  Don't use SSE4.2 if the length
> > +         of A > 16.  */
> > +      if (a[16] != 0)
> > +        return STRCSPN_SSE2 (s, a);
> >      }
> >
> > -  offset = (int) ((size_t) s & 15);
> > +  aligned = s;
> > +  offset = (unsigned int) ((size_t) s & 15);
> >    if (offset != 0)
> >      {
> > +    start_unaligned:
> >        /* Check partial string.  */
> >        aligned = (const char *) ((size_t) s & -16L);
> >        __m128i value = _mm_load_si128 ((__m128i *) aligned);
> >
> >        value = __m128i_shift_right (value, offset);
> >
> > -      int length = _mm_cmpistri (mask, value, 0x2);
> > +      unsigned int length = _mm_cmpistri (mask, value, 0x2);
> >        /* No need to check ZFlag since ZFlag is always 1.  */
> > -      int cflag = _mm_cmpistrc (mask, value, 0x2);
> > +      unsigned int cflag = _mm_cmpistrc (mask, value, 0x2);
> >        if (cflag)
> >         RETURN ((char *) (s + length), length);
> >        /* Find where the NULL terminator is.  */
> > -      int index = _mm_cmpistri (value, value, 0x3a);
> > +      unsigned int index = _mm_cmpistri (value, value, 0x3a);
> >        if (index < 16 - offset)
> >         RETURN (NULL, index);
> >        aligned += 16;
> >      }
> > -  else
> > -    aligned = s;
> >
> > +start_loop:
> >    while (1)
> >      {
> >        __m128i value = _mm_load_si128 ((__m128i *) aligned);
> > -      int index = _mm_cmpistri (mask, value, 0x2);
> > -      int cflag = _mm_cmpistrc (mask, value, 0x2);
> > -      int zflag = _mm_cmpistrz (mask, value, 0x2);
> > +      unsigned int index = _mm_cmpistri (mask, value, 0x2);
> > +      unsigned int cflag = _mm_cmpistrc (mask, value, 0x2);
> > +      unsigned int zflag = _mm_cmpistrz (mask, value, 0x2);
> >        if (cflag)
> >         RETURN ((char *) (aligned + index), (size_t) (aligned + index - s));
> >        if (zflag)
> > --
> > 2.25.1
> >
>
> LGTM.
>
> Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
>
> Thanks.
>
> --
> H.J.

I would like to backport this patch to release branches.
Any comments or objections?

--Sunil

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v1 08/23] x86: Optimize strspn in strspn-c.c
       [not found]   ` <CAMe9rOo7oks15kPUaZqd=Z1J1Xe==FJoECTNU5mBca9WTHgf1w@mail.gmail.com>
@ 2022-05-12 19:39     ` Sunil Pandey
  0 siblings, 0 replies; 10+ messages in thread
From: Sunil Pandey @ 2022-05-12 19:39 UTC (permalink / raw)
  To: H.J. Lu, Libc-stable Mailing List; +Cc: Noah Goldstein, GNU C Library

On Thu, Mar 24, 2022 at 11:58 AM H.J. Lu via Libc-alpha
<libc-alpha@sourceware.org> wrote:
>
> On Wed, Mar 23, 2022 at 2:59 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> >
> > Use _mm_cmpeq_epi8 and _mm_movemask_epi8 to get strlen instead of
> > _mm_cmpistri. Also change offset to unsigned to avoid unnecessary
> > sign extensions.
> >
> > geometric_mean(N=20) of all benchmarks that dont fallback on
> > sse2; New / Original: .901
> >
> > All string/memory tests pass.
> > ---
> > Geomtric Mean N=20 runs; All functions page aligned
> > len, align1, align2,  pos, New Time / Old Time
> >   1,      0,      0,  512,               0.768
> >   1,      1,      0,  512,               0.666
> >   1,      0,      1,  512,               1.193
> >   1,      1,      1,  512,               0.872
> >   2,      0,      0,  512,               0.698
> >   2,      2,      0,  512,               0.687
> >   2,      0,      2,  512,               1.393
> >   2,      2,      2,  512,               0.944
> >   3,      0,      0,  512,               0.691
> >   3,      3,      0,  512,               0.676
> >   3,      0,      3,  512,               1.388
> >   3,      3,      3,  512,               0.948
> >   4,      0,      0,  512,                0.74
> >   4,      4,      0,  512,               0.678
> >   4,      0,      4,  512,               1.421
> >   4,      4,      4,  512,               0.943
> >   5,      0,      0,  512,               0.691
> >   5,      5,      0,  512,               0.675
> >   5,      0,      5,  512,               1.348
> >   5,      5,      5,  512,               0.952
> >   6,      0,      0,  512,               0.685
> >   6,      6,      0,  512,                0.67
> >   6,      0,      6,  512,               1.333
> >   6,      6,      6,  512,                0.95
> >   7,      0,      0,  512,               0.688
> >   7,      7,      0,  512,               0.675
> >   7,      0,      7,  512,               1.344
> >   7,      7,      7,  512,               0.919
> >   8,      0,      0,  512,               0.716
> >   8,      0,      8,  512,               0.935
> >   9,      0,      0,  512,               0.716
> >   9,      1,      0,  512,               0.712
> >   9,      0,      9,  512,               0.956
> >   9,      1,      9,  512,               0.992
> >  10,      0,      0,  512,               0.699
> >  10,      2,      0,  512,                0.68
> >  10,      0,     10,  512,               0.952
> >  10,      2,     10,  512,               0.932
> >  11,      0,      0,  512,               0.705
> >  11,      3,      0,  512,               0.685
> >  11,      0,     11,  512,               0.956
> >  11,      3,     11,  512,               0.927
> >  12,      0,      0,  512,               0.695
> >  12,      4,      0,  512,               0.675
> >  12,      0,     12,  512,               0.948
> >  12,      4,     12,  512,               0.928
> >  13,      0,      0,  512,                 0.7
> >  13,      5,      0,  512,               0.678
> >  13,      0,     13,  512,               0.944
> >  13,      5,     13,  512,               0.931
> >  14,      0,      0,  512,               0.703
> >  14,      6,      0,  512,               0.678
> >  14,      0,     14,  512,               0.949
> >  14,      6,     14,  512,                0.93
> >  15,      0,      0,  512,               0.694
> >  15,      7,      0,  512,               0.678
> >  15,      0,     15,  512,               0.953
> >  15,      7,     15,  512,               0.924
> >  16,      0,      0,  512,               1.021
> >  16,      0,     16,  512,               1.067
> >  17,      0,      0,  512,               0.991
> >  17,      1,      0,  512,               0.984
> >  17,      0,     17,  512,               0.979
> >  17,      1,     17,  512,               0.993
> >  18,      0,      0,  512,               0.992
> >  18,      2,      0,  512,               1.008
> >  18,      0,     18,  512,               1.016
> >  18,      2,     18,  512,               0.993
> >  19,      0,      0,  512,               0.984
> >  19,      3,      0,  512,               0.985
> >  19,      0,     19,  512,               1.007
> >  19,      3,     19,  512,               1.006
> >  20,      0,      0,  512,               0.969
> >  20,      4,      0,  512,               0.968
> >  20,      0,     20,  512,               0.975
> >  20,      4,     20,  512,               0.975
> >  21,      0,      0,  512,               0.992
> >  21,      5,      0,  512,               0.992
> >  21,      0,     21,  512,                0.98
> >  21,      5,     21,  512,                0.97
> >  22,      0,      0,  512,               0.989
> >  22,      6,      0,  512,               0.987
> >  22,      0,     22,  512,                0.99
> >  22,      6,     22,  512,               0.985
> >  23,      0,      0,  512,               0.989
> >  23,      7,      0,  512,                0.98
> >  23,      0,     23,  512,                 1.0
> >  23,      7,     23,  512,               0.993
> >  24,      0,      0,  512,                0.99
> >  24,      0,     24,  512,               0.998
> >  25,      0,      0,  512,                1.01
> >  25,      1,      0,  512,                 1.0
> >  25,      0,     25,  512,                0.97
> >  25,      1,     25,  512,               0.967
> >  26,      0,      0,  512,               1.009
> >  26,      2,      0,  512,               0.986
> >  26,      0,     26,  512,               0.997
> >  26,      2,     26,  512,               0.993
> >  27,      0,      0,  512,               0.984
> >  27,      3,      0,  512,               0.997
> >  27,      0,     27,  512,               0.989
> >  27,      3,     27,  512,               0.976
> >  28,      0,      0,  512,               0.991
> >  28,      4,      0,  512,               1.003
> >  28,      0,     28,  512,               0.986
> >  28,      4,     28,  512,               0.989
> >  29,      0,      0,  512,               0.986
> >  29,      5,      0,  512,               0.985
> >  29,      0,     29,  512,               0.984
> >  29,      5,     29,  512,               0.977
> >  30,      0,      0,  512,               0.991
> >  30,      6,      0,  512,               0.987
> >  30,      0,     30,  512,               0.979
> >  30,      6,     30,  512,               0.974
> >  31,      0,      0,  512,               0.995
> >  31,      7,      0,  512,               0.995
> >  31,      0,     31,  512,               0.994
> >  31,      7,     31,  512,               0.984
> >   4,      0,      0,   32,               0.861
> >   4,      1,      0,   32,               0.864
> >   4,      0,      1,   32,               0.962
> >   4,      1,      1,   32,               0.967
> >   4,      0,      0,   64,               0.884
> >   4,      2,      0,   64,               0.818
> >   4,      0,      2,   64,               0.889
> >   4,      2,      2,   64,               0.918
> >   4,      0,      0,  128,               0.942
> >   4,      3,      0,  128,               0.884
> >   4,      0,      3,  128,               0.931
> >   4,      3,      3,  128,               0.883
> >   4,      0,      0,  256,               0.964
> >   4,      4,      0,  256,               0.922
> >   4,      0,      4,  256,               0.956
> >   4,      4,      4,  256,                0.93
> >   4,      5,      0,  512,               0.833
> >   4,      0,      5,  512,               1.027
> >   4,      5,      5,  512,               0.929
> >   4,      0,      0, 1024,               0.998
> >   4,      6,      0, 1024,               0.986
> >   4,      0,      6, 1024,               0.984
> >   4,      6,      6, 1024,               0.977
> >   4,      0,      0, 2048,               0.991
> >   4,      7,      0, 2048,               0.987
> >   4,      0,      7, 2048,               0.996
> >   4,      7,      7, 2048,                0.98
> >  10,      1,      0,   64,               0.826
> >  10,      1,      1,   64,               0.907
> >  10,      2,      0,   64,               0.829
> >  10,      2,      2,   64,                0.91
> >  10,      3,      0,   64,                0.83
> >  10,      3,      3,   64,               0.915
> >  10,      4,      0,   64,                0.83
> >  10,      4,      4,   64,               0.911
> >  10,      5,      0,   64,               0.828
> >  10,      5,      5,   64,               0.905
> >  10,      6,      0,   64,               0.828
> >  10,      6,      6,   64,               0.812
> >  10,      7,      0,   64,                0.83
> >  10,      7,      7,   64,               0.819
> >   6,      0,      0,    0,               1.261
> >   6,      0,      0,    1,               1.252
> >   6,      0,      1,    1,               0.845
> >   6,      0,      0,    2,                1.27
> >   6,      0,      2,    2,                0.85
> >   6,      0,      0,    3,               1.269
> >   6,      0,      3,    3,               0.845
> >   6,      0,      0,    4,               1.287
> >   6,      0,      4,    4,               0.852
> >   6,      0,      0,    5,               1.278
> >   6,      0,      5,    5,               0.851
> >   6,      0,      0,    6,               1.269
> >   6,      0,      6,    6,               0.841
> >   6,      0,      0,    7,               1.268
> >   6,      0,      7,    7,               0.851
> >   6,      0,      0,    8,               1.291
> >   6,      0,      8,    8,               0.837
> >   6,      0,      0,    9,               1.283
> >   6,      0,      9,    9,               0.831
> >   6,      0,      0,   10,               1.252
> >   6,      0,     10,   10,               0.997
> >   6,      0,      0,   11,               1.295
> >   6,      0,     11,   11,               1.046
> >   6,      0,      0,   12,               1.296
> >   6,      0,     12,   12,               1.038
> >   6,      0,      0,   13,               1.287
> >   6,      0,     13,   13,               1.082
> >   6,      0,      0,   14,               1.284
> >   6,      0,     14,   14,               1.001
> >   6,      0,      0,   15,               1.286
> >   6,      0,     15,   15,               1.002
> >   6,      0,      0,   16,               0.894
> >   6,      0,     16,   16,               0.874
> >   6,      0,      0,   17,               0.892
> >   6,      0,     17,   17,               0.974
> >   6,      0,      0,   18,               0.907
> >   6,      0,     18,   18,               0.993
> >   6,      0,      0,   19,               0.909
> >   6,      0,     19,   19,                0.99
> >   6,      0,      0,   20,               0.894
> >   6,      0,     20,   20,               0.978
> >   6,      0,      0,   21,                0.89
> >   6,      0,     21,   21,               0.958
> >   6,      0,      0,   22,               0.893
> >   6,      0,     22,   22,                0.99
> >   6,      0,      0,   23,               0.899
> >   6,      0,     23,   23,               0.986
> >   6,      0,      0,   24,               0.893
> >   6,      0,     24,   24,               0.989
> >   6,      0,      0,   25,               0.889
> >   6,      0,     25,   25,               0.982
> >   6,      0,      0,   26,               0.889
> >   6,      0,     26,   26,               0.852
> >   6,      0,      0,   27,                0.89
> >   6,      0,     27,   27,               0.832
> >   6,      0,      0,   28,                0.89
> >   6,      0,     28,   28,               0.831
> >   6,      0,      0,   29,                0.89
> >   6,      0,     29,   29,               0.838
> >   6,      0,      0,   30,               0.907
> >   6,      0,     30,   30,               0.833
> >   6,      0,      0,   31,               0.888
> >   6,      0,     31,   31,               0.837
> >   6,      0,      0,   32,               0.853
> >   6,      0,     32,   32,               0.828
> >   6,      0,      0,   33,               0.857
> >   6,      0,     33,   33,               0.947
> >   6,      0,      0,   34,               0.847
> >   6,      0,     34,   34,               0.954
> >   6,      0,      0,   35,               0.841
> >   6,      0,     35,   35,                0.94
> >   6,      0,      0,   36,               0.854
> >   6,      0,     36,   36,               0.958
> >   6,      0,      0,   37,               0.856
> >   6,      0,     37,   37,               0.957
> >   6,      0,      0,   38,               0.839
> >   6,      0,     38,   38,               0.962
> >   6,      0,      0,   39,               0.866
> >   6,      0,     39,   39,               0.945
> >   6,      0,      0,   40,               0.845
> >   6,      0,     40,   40,               0.961
> >   6,      0,      0,   41,               0.858
> >   6,      0,     41,   41,               0.961
> >   6,      0,      0,   42,               0.862
> >   6,      0,     42,   42,               0.825
> >   6,      0,      0,   43,               0.864
> >   6,      0,     43,   43,                0.82
> >   6,      0,      0,   44,               0.843
> >   6,      0,     44,   44,                0.81
> >   6,      0,      0,   45,               0.859
> >   6,      0,     45,   45,               0.816
> >   6,      0,      0,   46,               0.866
> >   6,      0,     46,   46,                0.81
> >   6,      0,      0,   47,               0.858
> >   6,      0,     47,   47,               0.807
> >   6,      0,      0,   48,                0.87
> >   6,      0,     48,   48,                0.87
> >   6,      0,      0,   49,               0.871
> >   6,      0,     49,   49,               0.874
> >   6,      0,      0,   50,                0.87
> >   6,      0,     50,   50,               0.881
> >   6,      0,      0,   51,               0.868
> >   6,      0,     51,   51,               0.875
> >   6,      0,      0,   52,               0.873
> >   6,      0,     52,   52,               0.871
> >   6,      0,      0,   53,               0.866
> >   6,      0,     53,   53,               0.882
> >   6,      0,      0,   54,               0.863
> >   6,      0,     54,   54,               0.876
> >   6,      0,      0,   55,               0.851
> >   6,      0,     55,   55,               0.871
> >   6,      0,      0,   56,               0.867
> >   6,      0,     56,   56,               0.888
> >   6,      0,      0,   57,               0.862
> >   6,      0,     57,   57,               0.899
> >   6,      0,      0,   58,               0.873
> >   6,      0,     58,   58,               0.798
> >   6,      0,      0,   59,               0.881
> >   6,      0,     59,   59,               0.785
> >   6,      0,      0,   60,               0.867
> >   6,      0,     60,   60,               0.797
> >   6,      0,      0,   61,               0.872
> >   6,      0,     61,   61,               0.791
> >   6,      0,      0,   62,               0.859
> >   6,      0,     62,   62,                0.79
> >   6,      0,      0,   63,                0.87
> >   6,      0,     63,   63,               0.796
> >
> >  sysdeps/x86_64/multiarch/strspn-c.c | 86 +++++++++++++----------------
> >  1 file changed, 39 insertions(+), 47 deletions(-)
> >
> > diff --git a/sysdeps/x86_64/multiarch/strspn-c.c b/sysdeps/x86_64/multiarch/strspn-c.c
> > index 8fb3aba64d..6124033ceb 100644
> > --- a/sysdeps/x86_64/multiarch/strspn-c.c
> > +++ b/sysdeps/x86_64/multiarch/strspn-c.c
> > @@ -62,81 +62,73 @@ __strspn_sse42 (const char *s, const char *a)
> >      return 0;
> >
> >    const char *aligned;
> > -  __m128i mask;
> > -  int offset = (int) ((size_t) a & 15);
> > +  __m128i mask, maskz, zero;
> > +  unsigned int maskz_bits;
> > +  unsigned int offset = (int) ((size_t) a & 15);
> > +  zero = _mm_set1_epi8 (0);
> >    if (offset != 0)
> >      {
> >        /* Load masks.  */
> >        aligned = (const char *) ((size_t) a & -16L);
> >        __m128i mask0 = _mm_load_si128 ((__m128i *) aligned);
> > -
> > -      mask = __m128i_shift_right (mask0, offset);
> > +      maskz = _mm_cmpeq_epi8 (mask0, zero);
> >
> >        /* Find where the NULL terminator is.  */
> > -      int length = _mm_cmpistri (mask, mask, 0x3a);
> > -      if (length == 16 - offset)
> > -       {
> > -         /* There is no NULL terminator.  */
> > -         __m128i mask1 = _mm_load_si128 ((__m128i *) (aligned + 16));
> > -         int index = _mm_cmpistri (mask1, mask1, 0x3a);
> > -         length += index;
> > -
> > -         /* Don't use SSE4.2 if the length of A > 16.  */
> > -         if (length > 16)
> > -           return __strspn_sse2 (s, a);
> > -
> > -         if (index != 0)
> > -           {
> > -             /* Combine mask0 and mask1.  We could play games with
> > -                palignr, but frankly this data should be in L1 now
> > -                so do the merge via an unaligned load.  */
> > -             mask = _mm_loadu_si128 ((__m128i *) a);
> > -           }
> > -       }
> > +      maskz_bits = _mm_movemask_epi8 (maskz) >> offset;
> > +      if (maskz_bits != 0)
> > +        {
> > +          mask = __m128i_shift_right (mask0, offset);
> > +          offset = (unsigned int) ((size_t) s & 15);
> > +          if (offset)
> > +            goto start_unaligned;
> > +
> > +          aligned = s;
> > +          goto start_loop;
> > +        }
> >      }
> > -  else
> > -    {
> > -      /* A is aligned.  */
> > -      mask = _mm_load_si128 ((__m128i *) a);
> >
> > -      /* Find where the NULL terminator is.  */
> > -      int length = _mm_cmpistri (mask, mask, 0x3a);
> > -      if (length == 16)
> > -       {
> > -         /* There is no NULL terminator.  Don't use SSE4.2 if the length
> > -            of A > 16.  */
> > -         if (a[16] != 0)
> > -           return __strspn_sse2 (s, a);
> > -       }
> > +  /* A is aligned.  */
> > +  mask = _mm_loadu_si128 ((__m128i *) a);
> > +
> > +  /* Find where the NULL terminator is.  */
> > +  maskz = _mm_cmpeq_epi8 (mask, zero);
> > +  maskz_bits = _mm_movemask_epi8 (maskz);
> > +  if (maskz_bits == 0)
> > +    {
> > +      /* There is no NULL terminator.  Don't use SSE4.2 if the length
> > +         of A > 16.  */
> > +      if (a[16] != 0)
> > +        return __strspn_sse2 (s, a);
> >      }
> > +  aligned = s;
> > +  offset = (unsigned int) ((size_t) s & 15);
> >
> > -  offset = (int) ((size_t) s & 15);
> >    if (offset != 0)
> >      {
> > +    start_unaligned:
> >        /* Check partial string.  */
> >        aligned = (const char *) ((size_t) s & -16L);
> >        __m128i value = _mm_load_si128 ((__m128i *) aligned);
> > +      __m128i adj_value = __m128i_shift_right (value, offset);
> >
> > -      value = __m128i_shift_right (value, offset);
> > -
> > -      int length = _mm_cmpistri (mask, value, 0x12);
> > +      unsigned int length = _mm_cmpistri (mask, adj_value, 0x12);
> >        /* No need to check CFlag since it is always 1.  */
> >        if (length < 16 - offset)
> >         return length;
> >        /* Find where the NULL terminator is.  */
> > -      int index = _mm_cmpistri (value, value, 0x3a);
> > -      if (index < 16 - offset)
> > +      maskz = _mm_cmpeq_epi8 (value, zero);
> > +      maskz_bits = _mm_movemask_epi8 (maskz) >> offset;
> > +      if (maskz_bits != 0)
> >         return length;
> >        aligned += 16;
> >      }
> > -  else
> > -    aligned = s;
> >
> > +start_loop:
> >    while (1)
> >      {
> >        __m128i value = _mm_load_si128 ((__m128i *) aligned);
> > -      int index = _mm_cmpistri (mask, value, 0x12);
> > -      int cflag = _mm_cmpistrc (mask, value, 0x12);
> > +      unsigned int index = _mm_cmpistri (mask, value, 0x12);
> > +      unsigned int cflag = _mm_cmpistrc (mask, value, 0x12);
> >        if (cflag)
> >         return (size_t) (aligned + index - s);
> >        aligned += 16;
> > --
> > 2.25.1
> >
>
> LGTM.
>
> Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
>
> Thanks.
>
> --
> H.J.

I would like to backport this patch to release branches.
Any comments or objections?

--Sunil

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v1 09/23] x86: Remove strcspn-sse2.S and use the generic implementation
       [not found]   ` <CAMe9rOrQ_zOdL-n-iiYpzLf+RxD_DBR51yEnGpRKB0zj4m31SQ@mail.gmail.com>
@ 2022-05-12 19:40     ` Sunil Pandey
  0 siblings, 0 replies; 10+ messages in thread
From: Sunil Pandey @ 2022-05-12 19:40 UTC (permalink / raw)
  To: H.J. Lu, Libc-stable Mailing List; +Cc: Noah Goldstein, GNU C Library

On Thu, Mar 24, 2022 at 11:59 AM H.J. Lu via Libc-alpha
<libc-alpha@sourceware.org> wrote:
>
> On Wed, Mar 23, 2022 at 3:00 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> >
> > The generic implementation is faster.
> >
> > geometric_mean(N=20) of all benchmarks New / Original: .678
> >
> > All string/memory tests pass.
> > ---
> > Geomtric Mean N=20 runs; All functions page aligned
> > len, align1, align2,  pos, New Time / Old Time
> >   0,      0,      0,  512,               0.054
> >   1,      0,      0,  512,               0.055
> >   1,      1,      0,  512,               0.051
> >   1,      0,      1,  512,               0.054
> >   1,      1,      1,  512,               0.054
> >   2,      0,      0,  512,               0.861
> >   2,      2,      0,  512,               0.861
> >   2,      0,      2,  512,               0.861
> >   2,      2,      2,  512,               0.864
> >   3,      0,      0,  512,               0.854
> >   3,      3,      0,  512,               0.848
> >   3,      0,      3,  512,               0.845
> >   3,      3,      3,  512,                0.85
> >   4,      0,      0,  512,               0.851
> >   4,      4,      0,  512,                0.85
> >   4,      0,      4,  512,               0.852
> >   4,      4,      4,  512,               0.849
> >   5,      0,      0,  512,               0.938
> >   5,      5,      0,  512,                0.94
> >   5,      0,      5,  512,               0.864
> >   5,      5,      5,  512,                0.86
> >   6,      0,      0,  512,               0.858
> >   6,      6,      0,  512,               0.869
> >   6,      0,      6,  512,               0.847
> >   6,      6,      6,  512,               0.868
> >   7,      0,      0,  512,               0.867
> >   7,      7,      0,  512,               0.861
> >   7,      0,      7,  512,               0.864
> >   7,      7,      7,  512,               0.863
> >   8,      0,      0,  512,               0.884
> >   8,      0,      8,  512,               0.884
> >   9,      0,      0,  512,               0.886
> >   9,      1,      0,  512,               0.894
> >   9,      0,      9,  512,               0.889
> >   9,      1,      9,  512,               0.886
> >  10,      0,      0,  512,               0.859
> >  10,      2,      0,  512,               0.859
> >  10,      0,     10,  512,               0.862
> >  10,      2,     10,  512,               0.861
> >  11,      0,      0,  512,               0.846
> >  11,      3,      0,  512,               0.865
> >  11,      0,     11,  512,               0.859
> >  11,      3,     11,  512,               0.862
> >  12,      0,      0,  512,               0.858
> >  12,      4,      0,  512,               0.857
> >  12,      0,     12,  512,               0.964
> >  12,      4,     12,  512,               0.876
> >  13,      0,      0,  512,               0.827
> >  13,      5,      0,  512,               0.805
> >  13,      0,     13,  512,               0.821
> >  13,      5,     13,  512,               0.825
> >  14,      0,      0,  512,               0.786
> >  14,      6,      0,  512,               0.786
> >  14,      0,     14,  512,               0.803
> >  14,      6,     14,  512,               0.783
> >  15,      0,      0,  512,               0.778
> >  15,      7,      0,  512,               0.792
> >  15,      0,     15,  512,               0.796
> >  15,      7,     15,  512,               0.799
> >  16,      0,      0,  512,               0.803
> >  16,      0,     16,  512,               0.815
> >  17,      0,      0,  512,               0.812
> >  17,      1,      0,  512,               0.826
> >  17,      0,     17,  512,               0.803
> >  17,      1,     17,  512,               0.856
> >  18,      0,      0,  512,               0.801
> >  18,      2,      0,  512,               0.886
> >  18,      0,     18,  512,               0.805
> >  18,      2,     18,  512,               0.807
> >  19,      0,      0,  512,               0.814
> >  19,      3,      0,  512,               0.804
> >  19,      0,     19,  512,               0.813
> >  19,      3,     19,  512,               0.814
> >  20,      0,      0,  512,               0.885
> >  20,      4,      0,  512,               0.799
> >  20,      0,     20,  512,               0.826
> >  20,      4,     20,  512,               0.808
> >  21,      0,      0,  512,               0.816
> >  21,      5,      0,  512,               0.824
> >  21,      0,     21,  512,               0.819
> >  21,      5,     21,  512,               0.826
> >  22,      0,      0,  512,               0.814
> >  22,      6,      0,  512,               0.824
> >  22,      0,     22,  512,                0.81
> >  22,      6,     22,  512,               0.806
> >  23,      0,      0,  512,               0.825
> >  23,      7,      0,  512,               0.829
> >  23,      0,     23,  512,               0.809
> >  23,      7,     23,  512,               0.823
> >  24,      0,      0,  512,               0.829
> >  24,      0,     24,  512,               0.823
> >  25,      0,      0,  512,               0.864
> >  25,      1,      0,  512,               0.895
> >  25,      0,     25,  512,                0.88
> >  25,      1,     25,  512,               0.848
> >  26,      0,      0,  512,               0.903
> >  26,      2,      0,  512,               0.888
> >  26,      0,     26,  512,               0.894
> >  26,      2,     26,  512,                0.89
> >  27,      0,      0,  512,               0.914
> >  27,      3,      0,  512,               0.917
> >  27,      0,     27,  512,               0.902
> >  27,      3,     27,  512,               0.887
> >  28,      0,      0,  512,               0.887
> >  28,      4,      0,  512,               0.877
> >  28,      0,     28,  512,               0.893
> >  28,      4,     28,  512,               0.866
> >  29,      0,      0,  512,               0.885
> >  29,      5,      0,  512,               0.907
> >  29,      0,     29,  512,               0.894
> >  29,      5,     29,  512,               0.906
> >  30,      0,      0,  512,                0.88
> >  30,      6,      0,  512,               0.898
> >  30,      0,     30,  512,                 0.9
> >  30,      6,     30,  512,               0.895
> >  31,      0,      0,  512,               0.893
> >  31,      7,      0,  512,               0.874
> >  31,      0,     31,  512,               0.894
> >  31,      7,     31,  512,               0.899
> >   4,      0,      0,   32,               0.618
> >   4,      1,      0,   32,               0.627
> >   4,      0,      1,   32,               0.625
> >   4,      1,      1,   32,               0.613
> >   4,      0,      0,   64,               0.913
> >   4,      2,      0,   64,               0.801
> >   4,      0,      2,   64,               0.759
> >   4,      2,      2,   64,               0.761
> >   4,      0,      0,  128,               0.822
> >   4,      3,      0,  128,               0.863
> >   4,      0,      3,  128,               0.867
> >   4,      3,      3,  128,               0.917
> >   4,      0,      0,  256,               0.816
> >   4,      4,      0,  256,               0.812
> >   4,      0,      4,  256,               0.803
> >   4,      4,      4,  256,               0.811
> >   4,      5,      0,  512,               0.848
> >   4,      0,      5,  512,               0.843
> >   4,      5,      5,  512,               0.857
> >   4,      0,      0, 1024,               0.886
> >   4,      6,      0, 1024,               0.887
> >   4,      0,      6, 1024,               0.881
> >   4,      6,      6, 1024,               0.873
> >   4,      0,      0, 2048,               0.892
> >   4,      7,      0, 2048,               0.894
> >   4,      0,      7, 2048,                0.89
> >   4,      7,      7, 2048,               0.874
> >  10,      1,      0,   64,               0.946
> >  10,      1,      1,   64,                0.81
> >  10,      2,      0,   64,               0.804
> >  10,      2,      2,   64,                0.82
> >  10,      3,      0,   64,               0.772
> >  10,      3,      3,   64,               0.772
> >  10,      4,      0,   64,               0.748
> >  10,      4,      4,   64,               0.751
> >  10,      5,      0,   64,                0.76
> >  10,      5,      5,   64,                0.76
> >  10,      6,      0,   64,               0.726
> >  10,      6,      6,   64,               0.718
> >  10,      7,      0,   64,               0.724
> >  10,      7,      7,   64,                0.72
> >   6,      0,      0,    0,               0.415
> >   6,      0,      0,    1,               0.423
> >   6,      0,      1,    1,               0.412
> >   6,      0,      0,    2,               0.433
> >   6,      0,      2,    2,               0.434
> >   6,      0,      0,    3,               0.427
> >   6,      0,      3,    3,               0.428
> >   6,      0,      0,    4,               0.465
> >   6,      0,      4,    4,               0.466
> >   6,      0,      0,    5,               0.463
> >   6,      0,      5,    5,               0.468
> >   6,      0,      0,    6,               0.435
> >   6,      0,      6,    6,               0.444
> >   6,      0,      0,    7,                0.41
> >   6,      0,      7,    7,                0.42
> >   6,      0,      0,    8,               0.474
> >   6,      0,      8,    8,               0.501
> >   6,      0,      0,    9,               0.471
> >   6,      0,      9,    9,               0.489
> >   6,      0,      0,   10,               0.462
> >   6,      0,     10,   10,                0.46
> >   6,      0,      0,   11,               0.459
> >   6,      0,     11,   11,               0.458
> >   6,      0,      0,   12,               0.516
> >   6,      0,     12,   12,                0.51
> >   6,      0,      0,   13,               0.494
> >   6,      0,     13,   13,               0.524
> >   6,      0,      0,   14,               0.486
> >   6,      0,     14,   14,                 0.5
> >   6,      0,      0,   15,                0.48
> >   6,      0,     15,   15,               0.501
> >   6,      0,      0,   16,                0.54
> >   6,      0,     16,   16,               0.538
> >   6,      0,      0,   17,               0.503
> >   6,      0,     17,   17,               0.541
> >   6,      0,      0,   18,               0.537
> >   6,      0,     18,   18,               0.549
> >   6,      0,      0,   19,               0.527
> >   6,      0,     19,   19,               0.537
> >   6,      0,      0,   20,               0.539
> >   6,      0,     20,   20,               0.554
> >   6,      0,      0,   21,               0.558
> >   6,      0,     21,   21,               0.541
> >   6,      0,      0,   22,               0.546
> >   6,      0,     22,   22,               0.561
> >   6,      0,      0,   23,                0.54
> >   6,      0,     23,   23,               0.536
> >   6,      0,      0,   24,               0.565
> >   6,      0,     24,   24,               0.584
> >   6,      0,      0,   25,               0.563
> >   6,      0,     25,   25,                0.58
> >   6,      0,      0,   26,               0.555
> >   6,      0,     26,   26,               0.584
> >   6,      0,      0,   27,               0.569
> >   6,      0,     27,   27,               0.587
> >   6,      0,      0,   28,               0.612
> >   6,      0,     28,   28,               0.623
> >   6,      0,      0,   29,               0.604
> >   6,      0,     29,   29,               0.621
> >   6,      0,      0,   30,                0.59
> >   6,      0,     30,   30,               0.609
> >   6,      0,      0,   31,               0.577
> >   6,      0,     31,   31,               0.588
> >   6,      0,      0,   32,               0.621
> >   6,      0,     32,   32,               0.608
> >   6,      0,      0,   33,               0.601
> >   6,      0,     33,   33,               0.623
> >   6,      0,      0,   34,               0.614
> >   6,      0,     34,   34,               0.615
> >   6,      0,      0,   35,               0.598
> >   6,      0,     35,   35,               0.608
> >   6,      0,      0,   36,               0.626
> >   6,      0,     36,   36,               0.634
> >   6,      0,      0,   37,                0.62
> >   6,      0,     37,   37,               0.634
> >   6,      0,      0,   38,               0.612
> >   6,      0,     38,   38,               0.637
> >   6,      0,      0,   39,               0.627
> >   6,      0,     39,   39,               0.612
> >   6,      0,      0,   40,               0.661
> >   6,      0,     40,   40,               0.674
> >   6,      0,      0,   41,               0.633
> >   6,      0,     41,   41,               0.643
> >   6,      0,      0,   42,               0.634
> >   6,      0,     42,   42,               0.636
> >   6,      0,      0,   43,               0.619
> >   6,      0,     43,   43,               0.625
> >   6,      0,      0,   44,               0.654
> >   6,      0,     44,   44,               0.654
> >   6,      0,      0,   45,               0.647
> >   6,      0,     45,   45,               0.649
> >   6,      0,      0,   46,               0.651
> >   6,      0,     46,   46,               0.651
> >   6,      0,      0,   47,               0.646
> >   6,      0,     47,   47,               0.648
> >   6,      0,      0,   48,               0.662
> >   6,      0,     48,   48,               0.664
> >   6,      0,      0,   49,                0.68
> >   6,      0,     49,   49,               0.667
> >   6,      0,      0,   50,               0.654
> >   6,      0,     50,   50,               0.659
> >   6,      0,      0,   51,               0.638
> >   6,      0,     51,   51,               0.639
> >   6,      0,      0,   52,               0.665
> >   6,      0,     52,   52,               0.669
> >   6,      0,      0,   53,               0.658
> >   6,      0,     53,   53,               0.656
> >   6,      0,      0,   54,               0.669
> >   6,      0,     54,   54,                0.67
> >   6,      0,      0,   55,               0.668
> >   6,      0,     55,   55,               0.664
> >   6,      0,      0,   56,               0.701
> >   6,      0,     56,   56,               0.695
> >   6,      0,      0,   57,               0.687
> >   6,      0,     57,   57,               0.696
> >   6,      0,      0,   58,               0.693
> >   6,      0,     58,   58,               0.704
> >   6,      0,      0,   59,               0.695
> >   6,      0,     59,   59,               0.708
> >   6,      0,      0,   60,               0.708
> >   6,      0,     60,   60,               0.728
> >   6,      0,      0,   61,               0.708
> >   6,      0,     61,   61,                0.71
> >   6,      0,      0,   62,               0.715
> >   6,      0,     62,   62,               0.705
> >   6,      0,      0,   63,               0.677
> >   6,      0,     63,   63,               0.702
> >
> >  .../{strcspn-sse2.S => strcspn-sse2.c}        |   8 +-
> >  sysdeps/x86_64/strcspn.S                      | 119 ------------------
> >  2 files changed, 4 insertions(+), 123 deletions(-)
> >  rename sysdeps/x86_64/multiarch/{strcspn-sse2.S => strcspn-sse2.c} (85%)
> >  delete mode 100644 sysdeps/x86_64/strcspn.S
> >
> > diff --git a/sysdeps/x86_64/multiarch/strcspn-sse2.S b/sysdeps/x86_64/multiarch/strcspn-sse2.c
> > similarity index 85%
> > rename from sysdeps/x86_64/multiarch/strcspn-sse2.S
> > rename to sysdeps/x86_64/multiarch/strcspn-sse2.c
> > index f97e856e1f..3a04bb39fc 100644
> > --- a/sysdeps/x86_64/multiarch/strcspn-sse2.S
> > +++ b/sysdeps/x86_64/multiarch/strcspn-sse2.c
> > @@ -1,4 +1,4 @@
> > -/* strcspn optimized with SSE2.
> > +/* strcspn.
> >     Copyright (C) 2017-2022 Free Software Foundation, Inc.
> >     This file is part of the GNU C Library.
> >
> > @@ -19,10 +19,10 @@
> >  #if IS_IN (libc)
> >
> >  # include <sysdep.h>
> > -# define strcspn __strcspn_sse2
> > +# define STRCSPN __strcspn_sse2
> >
> >  # undef libc_hidden_builtin_def
> > -# define libc_hidden_builtin_def(strcspn)
> > +# define libc_hidden_builtin_def(STRCSPN)
> >  #endif
> >
> > -#include <sysdeps/x86_64/strcspn.S>
> > +#include <string/strcspn.c>
> > diff --git a/sysdeps/x86_64/strcspn.S b/sysdeps/x86_64/strcspn.S
> > deleted file mode 100644
> > index f3cd86c606..0000000000
> > --- a/sysdeps/x86_64/strcspn.S
> > +++ /dev/null
> > @@ -1,119 +0,0 @@
> > -/* strcspn (str, ss) -- Return the length of the initial segment of STR
> > -                       which contains no characters from SS.
> > -   For AMD x86-64.
> > -   Copyright (C) 1994-2022 Free Software Foundation, Inc.
> > -   This file is part of the GNU C Library.
> > -
> > -   The GNU C Library is free software; you can redistribute it and/or
> > -   modify it under the terms of the GNU Lesser General Public
> > -   License as published by the Free Software Foundation; either
> > -   version 2.1 of the License, or (at your option) any later version.
> > -
> > -   The GNU C Library is distributed in the hope that it will be useful,
> > -   but WITHOUT ANY WARRANTY; without even the implied warranty of
> > -   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> > -   Lesser General Public License for more details.
> > -
> > -   You should have received a copy of the GNU Lesser General Public
> > -   License along with the GNU C Library; if not, see
> > -   <https://www.gnu.org/licenses/>.  */
> > -
> > -#include <sysdep.h>
> > -#include "asm-syntax.h"
> > -
> > -       .text
> > -ENTRY (strcspn)
> > -
> > -       movq %rdi, %rdx         /* Save SRC.  */
> > -
> > -       /* First we create a table with flags for all possible characters.
> > -          For the ASCII (7bit/8bit) or ISO-8859-X character sets which are
> > -          supported by the C string functions we have 256 characters.
> > -          Before inserting marks for the stop characters we clear the whole
> > -          table.  */
> > -       movq %rdi, %r8                  /* Save value.  */
> > -       subq $256, %rsp                 /* Make space for 256 bytes.  */
> > -       cfi_adjust_cfa_offset(256)
> > -       movl $32,  %ecx                 /* 32*8 bytes = 256 bytes.  */
> > -       movq %rsp, %rdi
> > -       xorl %eax, %eax                 /* We store 0s.  */
> > -       cld
> > -       rep
> > -       stosq
> > -
> > -       movq %rsi, %rax                 /* Setup skipset.  */
> > -
> > -/* For understanding the following code remember that %rcx == 0 now.
> > -   Although all the following instruction only modify %cl we always
> > -   have a correct zero-extended 64-bit value in %rcx.  */
> > -
> > -       .p2align 4
> > -L(2):  movb (%rax), %cl        /* get byte from skipset */
> > -       testb %cl, %cl          /* is NUL char? */
> > -       jz L(1)                 /* yes => start compare loop */
> > -       movb %cl, (%rsp,%rcx)   /* set corresponding byte in skipset table */
> > -
> > -       movb 1(%rax), %cl       /* get byte from skipset */
> > -       testb $0xff, %cl        /* is NUL char? */
> > -       jz L(1)                 /* yes => start compare loop */
> > -       movb %cl, (%rsp,%rcx)   /* set corresponding byte in skipset table */
> > -
> > -       movb 2(%rax), %cl       /* get byte from skipset */
> > -       testb $0xff, %cl        /* is NUL char? */
> > -       jz L(1)                 /* yes => start compare loop */
> > -       movb %cl, (%rsp,%rcx)   /* set corresponding byte in skipset table */
> > -
> > -       movb 3(%rax), %cl       /* get byte from skipset */
> > -       addq $4, %rax           /* increment skipset pointer */
> > -       movb %cl, (%rsp,%rcx)   /* set corresponding byte in skipset table */
> > -       testb $0xff, %cl        /* is NUL char? */
> > -       jnz L(2)                /* no => process next dword from skipset */
> > -
> > -L(1):  leaq -4(%rdx), %rax     /* prepare loop */
> > -
> > -       /* We use a neat trick for the following loop.  Normally we would
> > -          have to test for two termination conditions
> > -          1. a character in the skipset was found
> > -          and
> > -          2. the end of the string was found
> > -          But as a sign that the character is in the skipset we store its
> > -          value in the table.  But the value of NUL is NUL so the loop
> > -          terminates for NUL in every case.  */
> > -
> > -       .p2align 4
> > -L(3):  addq $4, %rax           /* adjust pointer for full loop round */
> > -
> > -       movb (%rax), %cl        /* get byte from string */
> > -       cmpb %cl, (%rsp,%rcx)   /* is it contained in skipset? */
> > -       je L(4)                 /* yes => return */
> > -
> > -       movb 1(%rax), %cl       /* get byte from string */
> > -       cmpb %cl, (%rsp,%rcx)   /* is it contained in skipset? */
> > -       je L(5)                 /* yes => return */
> > -
> > -       movb 2(%rax), %cl       /* get byte from string */
> > -       cmpb %cl, (%rsp,%rcx)   /* is it contained in skipset? */
> > -       jz L(6)                 /* yes => return */
> > -
> > -       movb 3(%rax), %cl       /* get byte from string */
> > -       cmpb %cl, (%rsp,%rcx)   /* is it contained in skipset? */
> > -       jne L(3)                /* no => start loop again */
> > -
> > -       incq %rax               /* adjust pointer */
> > -L(6):  incq %rax
> > -L(5):  incq %rax
> > -
> > -L(4):  addq $256, %rsp         /* remove skipset */
> > -       cfi_adjust_cfa_offset(-256)
> > -#ifdef USE_AS_STRPBRK
> > -       xorl %edx,%edx
> > -       orb %cl, %cl            /* was last character NUL? */
> > -       cmovzq %rdx, %rax       /* Yes: return NULL */
> > -#else
> > -       subq %rdx, %rax         /* we have to return the number of valid
> > -                                  characters, so compute distance to first
> > -                                  non-valid character */
> > -#endif
> > -       ret
> > -END (strcspn)
> > -libc_hidden_builtin_def (strcspn)
> > --
> > 2.25.1
> >
>
> LGTM.
>
> Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
>
> Thanks.
>
> --
> H.J.

I would like to backport this patch to release branches.
Any comments or objections?

--Sunil

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v1 10/23] x86: Remove strpbrk-sse2.S and use the generic implementation
       [not found]   ` <CAMe9rOqn8rZNfisVTmSKP9iWH1N26D--dncq1=MMgo-Hh-oR_Q@mail.gmail.com>
@ 2022-05-12 19:41     ` Sunil Pandey
  0 siblings, 0 replies; 10+ messages in thread
From: Sunil Pandey @ 2022-05-12 19:41 UTC (permalink / raw)
  To: H.J. Lu, Libc-stable Mailing List; +Cc: Noah Goldstein, GNU C Library

On Thu, Mar 24, 2022 at 12:00 PM H.J. Lu via Libc-alpha
<libc-alpha@sourceware.org> wrote:
>
> On Wed, Mar 23, 2022 at 3:00 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> >
> > The generic implementation is faster (see strcspn commit).
> >
> > All string/memory tests pass.
> > ---
> >  .../x86_64/multiarch/{strpbrk-sse2.S => strpbrk-sse2.c}  | 9 ++++-----
> >  sysdeps/x86_64/strpbrk.S                                 | 3 ---
> >  2 files changed, 4 insertions(+), 8 deletions(-)
> >  rename sysdeps/x86_64/multiarch/{strpbrk-sse2.S => strpbrk-sse2.c} (84%)
> >  delete mode 100644 sysdeps/x86_64/strpbrk.S
> >
> > diff --git a/sysdeps/x86_64/multiarch/strpbrk-sse2.S b/sysdeps/x86_64/multiarch/strpbrk-sse2.c
> > similarity index 84%
> > rename from sysdeps/x86_64/multiarch/strpbrk-sse2.S
> > rename to sysdeps/x86_64/multiarch/strpbrk-sse2.c
> > index d537b6c27b..d03214c4fb 100644
> > --- a/sysdeps/x86_64/multiarch/strpbrk-sse2.S
> > +++ b/sysdeps/x86_64/multiarch/strpbrk-sse2.c
> > @@ -1,4 +1,4 @@
> > -/* strpbrk optimized with SSE2.
> > +/* strpbrk.
> >     Copyright (C) 2017-2022 Free Software Foundation, Inc.
> >     This file is part of the GNU C Library.
> >
> > @@ -19,11 +19,10 @@
> >  #if IS_IN (libc)
> >
> >  # include <sysdep.h>
> > -# define strcspn __strpbrk_sse2
> > +# define STRPBRK __strpbrk_sse2
> >
> >  # undef libc_hidden_builtin_def
> > -# define libc_hidden_builtin_def(strpbrk)
> > +# define libc_hidden_builtin_def(STRPBRK)
> >  #endif
> >
> > -#define USE_AS_STRPBRK
> > -#include <sysdeps/x86_64/strcspn.S>
> > +#include <string/strpbrk.c>
> > diff --git a/sysdeps/x86_64/strpbrk.S b/sysdeps/x86_64/strpbrk.S
> > deleted file mode 100644
> > index 21888a5b92..0000000000
> > --- a/sysdeps/x86_64/strpbrk.S
> > +++ /dev/null
> > @@ -1,3 +0,0 @@
> > -#define strcspn strpbrk
> > -#define USE_AS_STRPBRK
> > -#include <sysdeps/x86_64/strcspn.S>
> > --
> > 2.25.1
> >
>
> LGTM.
>
> Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
>
> Thanks.
>
> --
> H.J.

I would like to backport this patch to release branches.
Any comments or objections?

--Sunil

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v1 11/23] x86: Remove strspn-sse2.S and use the generic implementation
       [not found]   ` <CAMe9rOp89e_1T9+i0W3=R3XR8DHp_Ua72x+poB6HQvE1q6b0MQ@mail.gmail.com>
@ 2022-05-12 19:42     ` Sunil Pandey
  0 siblings, 0 replies; 10+ messages in thread
From: Sunil Pandey @ 2022-05-12 19:42 UTC (permalink / raw)
  To: H.J. Lu, Libc-stable Mailing List; +Cc: Noah Goldstein, GNU C Library

On Thu, Mar 24, 2022 at 12:00 PM H.J. Lu via Libc-alpha
<libc-alpha@sourceware.org> wrote:
>
> On Wed, Mar 23, 2022 at 3:01 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> >
> > The generic implementation is faster.
> >
> > geometric_mean(N=20) of all benchmarks New / Original: .710
> >
> > All string/memory tests pass.
> > ---
> > Geomtric Mean N=20 runs; All functions page aligned
> > len, align1, align2,  pos, New Time / Old Time
> >   1,      0,      0,  512,               0.824
> >   1,      1,      0,  512,               1.018
> >   1,      0,      1,  512,               0.986
> >   1,      1,      1,  512,               1.092
> >   2,      0,      0,  512,                0.86
> >   2,      2,      0,  512,               0.868
> >   2,      0,      2,  512,               0.858
> >   2,      2,      2,  512,               0.857
> >   3,      0,      0,  512,               0.836
> >   3,      3,      0,  512,               0.849
> >   3,      0,      3,  512,                0.84
> >   3,      3,      3,  512,                0.85
> >   4,      0,      0,  512,               0.843
> >   4,      4,      0,  512,               0.837
> >   4,      0,      4,  512,               0.835
> >   4,      4,      4,  512,               0.846
> >   5,      0,      0,  512,               0.852
> >   5,      5,      0,  512,               0.848
> >   5,      0,      5,  512,                0.85
> >   5,      5,      5,  512,                0.85
> >   6,      0,      0,  512,               0.853
> >   6,      6,      0,  512,               0.855
> >   6,      0,      6,  512,               0.853
> >   6,      6,      6,  512,               0.853
> >   7,      0,      0,  512,               0.857
> >   7,      7,      0,  512,               0.861
> >   7,      0,      7,  512,                0.94
> >   7,      7,      7,  512,               0.856
> >   8,      0,      0,  512,               0.927
> >   8,      0,      8,  512,               0.965
> >   9,      0,      0,  512,               0.967
> >   9,      1,      0,  512,               0.976
> >   9,      0,      9,  512,               0.887
> >   9,      1,      9,  512,               0.881
> >  10,      0,      0,  512,               0.853
> >  10,      2,      0,  512,               0.846
> >  10,      0,     10,  512,               0.855
> >  10,      2,     10,  512,               0.849
> >  11,      0,      0,  512,               0.854
> >  11,      3,      0,  512,               0.855
> >  11,      0,     11,  512,                0.85
> >  11,      3,     11,  512,               0.854
> >  12,      0,      0,  512,               0.864
> >  12,      4,      0,  512,               0.864
> >  12,      0,     12,  512,               0.867
> >  12,      4,     12,  512,                0.87
> >  13,      0,      0,  512,               0.853
> >  13,      5,      0,  512,               0.841
> >  13,      0,     13,  512,               0.837
> >  13,      5,     13,  512,                0.85
> >  14,      0,      0,  512,               0.838
> >  14,      6,      0,  512,               0.842
> >  14,      0,     14,  512,               0.818
> >  14,      6,     14,  512,               0.845
> >  15,      0,      0,  512,               0.799
> >  15,      7,      0,  512,               0.847
> >  15,      0,     15,  512,               0.787
> >  15,      7,     15,  512,                0.84
> >  16,      0,      0,  512,               0.824
> >  16,      0,     16,  512,               0.827
> >  17,      0,      0,  512,               0.817
> >  17,      1,      0,  512,               0.823
> >  17,      0,     17,  512,                0.82
> >  17,      1,     17,  512,               0.814
> >  18,      0,      0,  512,                0.81
> >  18,      2,      0,  512,               0.833
> >  18,      0,     18,  512,               0.811
> >  18,      2,     18,  512,               0.842
> >  19,      0,      0,  512,               0.823
> >  19,      3,      0,  512,               0.818
> >  19,      0,     19,  512,               0.821
> >  19,      3,     19,  512,               0.824
> >  20,      0,      0,  512,               0.814
> >  20,      4,      0,  512,               0.818
> >  20,      0,     20,  512,               0.806
> >  20,      4,     20,  512,               0.802
> >  21,      0,      0,  512,               0.835
> >  21,      5,      0,  512,               0.839
> >  21,      0,     21,  512,               0.842
> >  21,      5,     21,  512,                0.82
> >  22,      0,      0,  512,               0.824
> >  22,      6,      0,  512,               0.831
> >  22,      0,     22,  512,               0.819
> >  22,      6,     22,  512,               0.824
> >  23,      0,      0,  512,               0.816
> >  23,      7,      0,  512,               0.856
> >  23,      0,     23,  512,               0.808
> >  23,      7,     23,  512,               0.848
> >  24,      0,      0,  512,                0.88
> >  24,      0,     24,  512,               0.846
> >  25,      0,      0,  512,               0.929
> >  25,      1,      0,  512,               0.917
> >  25,      0,     25,  512,               0.884
> >  25,      1,     25,  512,               0.859
> >  26,      0,      0,  512,               0.919
> >  26,      2,      0,  512,               0.867
> >  26,      0,     26,  512,               0.914
> >  26,      2,     26,  512,               0.845
> >  27,      0,      0,  512,               0.919
> >  27,      3,      0,  512,               0.864
> >  27,      0,     27,  512,               0.917
> >  27,      3,     27,  512,               0.847
> >  28,      0,      0,  512,               0.905
> >  28,      4,      0,  512,               0.896
> >  28,      0,     28,  512,               0.898
> >  28,      4,     28,  512,               0.871
> >  29,      0,      0,  512,               0.911
> >  29,      5,      0,  512,                0.91
> >  29,      0,     29,  512,               0.905
> >  29,      5,     29,  512,               0.884
> >  30,      0,      0,  512,               0.907
> >  30,      6,      0,  512,               0.802
> >  30,      0,     30,  512,               0.906
> >  30,      6,     30,  512,               0.818
> >  31,      0,      0,  512,               0.907
> >  31,      7,      0,  512,               0.821
> >  31,      0,     31,  512,                0.89
> >  31,      7,     31,  512,               0.787
> >   4,      0,      0,   32,               0.623
> >   4,      1,      0,   32,               0.606
> >   4,      0,      1,   32,                 0.6
> >   4,      1,      1,   32,               0.603
> >   4,      0,      0,   64,               0.731
> >   4,      2,      0,   64,               0.733
> >   4,      0,      2,   64,               0.734
> >   4,      2,      2,   64,               0.755
> >   4,      0,      0,  128,               0.822
> >   4,      3,      0,  128,               0.873
> >   4,      0,      3,  128,                0.89
> >   4,      3,      3,  128,               0.907
> >   4,      0,      0,  256,               0.827
> >   4,      4,      0,  256,               0.811
> >   4,      0,      4,  256,               0.794
> >   4,      4,      4,  256,               0.814
> >   4,      5,      0,  512,               0.841
> >   4,      0,      5,  512,               0.831
> >   4,      5,      5,  512,               0.845
> >   4,      0,      0, 1024,               0.861
> >   4,      6,      0, 1024,               0.857
> >   4,      0,      6, 1024,                 0.9
> >   4,      6,      6, 1024,               0.861
> >   4,      0,      0, 2048,               0.879
> >   4,      7,      0, 2048,               0.875
> >   4,      0,      7, 2048,               0.883
> >   4,      7,      7, 2048,                0.88
> >  10,      1,      0,   64,               0.747
> >  10,      1,      1,   64,               0.743
> >  10,      2,      0,   64,               0.732
> >  10,      2,      2,   64,               0.729
> >  10,      3,      0,   64,               0.747
> >  10,      3,      3,   64,               0.733
> >  10,      4,      0,   64,                0.74
> >  10,      4,      4,   64,               0.751
> >  10,      5,      0,   64,               0.735
> >  10,      5,      5,   64,               0.746
> >  10,      6,      0,   64,               0.735
> >  10,      6,      6,   64,               0.733
> >  10,      7,      0,   64,               0.734
> >  10,      7,      7,   64,                0.74
> >   6,      0,      0,    0,               0.377
> >   6,      0,      0,    1,               0.369
> >   6,      0,      1,    1,               0.383
> >   6,      0,      0,    2,               0.391
> >   6,      0,      2,    2,               0.394
> >   6,      0,      0,    3,               0.416
> >   6,      0,      3,    3,               0.411
> >   6,      0,      0,    4,               0.475
> >   6,      0,      4,    4,               0.483
> >   6,      0,      0,    5,               0.473
> >   6,      0,      5,    5,               0.476
> >   6,      0,      0,    6,               0.459
> >   6,      0,      6,    6,               0.445
> >   6,      0,      0,    7,               0.433
> >   6,      0,      7,    7,               0.432
> >   6,      0,      0,    8,               0.492
> >   6,      0,      8,    8,               0.494
> >   6,      0,      0,    9,               0.476
> >   6,      0,      9,    9,               0.483
> >   6,      0,      0,   10,                0.46
> >   6,      0,     10,   10,               0.476
> >   6,      0,      0,   11,               0.463
> >   6,      0,     11,   11,               0.463
> >   6,      0,      0,   12,               0.511
> >   6,      0,     12,   12,               0.515
> >   6,      0,      0,   13,               0.506
> >   6,      0,     13,   13,               0.536
> >   6,      0,      0,   14,               0.496
> >   6,      0,     14,   14,               0.484
> >   6,      0,      0,   15,               0.473
> >   6,      0,     15,   15,               0.475
> >   6,      0,      0,   16,               0.534
> >   6,      0,     16,   16,               0.534
> >   6,      0,      0,   17,               0.525
> >   6,      0,     17,   17,               0.523
> >   6,      0,      0,   18,               0.522
> >   6,      0,     18,   18,               0.524
> >   6,      0,      0,   19,               0.512
> >   6,      0,     19,   19,               0.514
> >   6,      0,      0,   20,               0.535
> >   6,      0,     20,   20,                0.54
> >   6,      0,      0,   21,               0.543
> >   6,      0,     21,   21,               0.536
> >   6,      0,      0,   22,               0.542
> >   6,      0,     22,   22,               0.542
> >   6,      0,      0,   23,               0.529
> >   6,      0,     23,   23,                0.53
> >   6,      0,      0,   24,               0.596
> >   6,      0,     24,   24,               0.589
> >   6,      0,      0,   25,               0.583
> >   6,      0,     25,   25,                0.58
> >   6,      0,      0,   26,               0.574
> >   6,      0,     26,   26,                0.58
> >   6,      0,      0,   27,               0.575
> >   6,      0,     27,   27,               0.558
> >   6,      0,      0,   28,               0.606
> >   6,      0,     28,   28,               0.606
> >   6,      0,      0,   29,               0.589
> >   6,      0,     29,   29,               0.595
> >   6,      0,      0,   30,               0.592
> >   6,      0,     30,   30,               0.585
> >   6,      0,      0,   31,               0.585
> >   6,      0,     31,   31,               0.579
> >   6,      0,      0,   32,               0.625
> >   6,      0,     32,   32,               0.615
> >   6,      0,      0,   33,               0.615
> >   6,      0,     33,   33,                0.61
> >   6,      0,      0,   34,               0.604
> >   6,      0,     34,   34,                 0.6
> >   6,      0,      0,   35,               0.602
> >   6,      0,     35,   35,               0.608
> >   6,      0,      0,   36,               0.644
> >   6,      0,     36,   36,               0.644
> >   6,      0,      0,   37,               0.658
> >   6,      0,     37,   37,               0.651
> >   6,      0,      0,   38,               0.644
> >   6,      0,     38,   38,               0.649
> >   6,      0,      0,   39,               0.626
> >   6,      0,     39,   39,               0.632
> >   6,      0,      0,   40,               0.662
> >   6,      0,     40,   40,               0.661
> >   6,      0,      0,   41,               0.656
> >   6,      0,     41,   41,               0.655
> >   6,      0,      0,   42,               0.643
> >   6,      0,     42,   42,               0.637
> >   6,      0,      0,   43,               0.622
> >   6,      0,     43,   43,               0.628
> >   6,      0,      0,   44,               0.673
> >   6,      0,     44,   44,               0.687
> >   6,      0,      0,   45,               0.661
> >   6,      0,     45,   45,               0.659
> >   6,      0,      0,   46,               0.657
> >   6,      0,     46,   46,               0.653
> >   6,      0,      0,   47,               0.658
> >   6,      0,     47,   47,                0.65
> >   6,      0,      0,   48,               0.678
> >   6,      0,     48,   48,               0.683
> >   6,      0,      0,   49,               0.676
> >   6,      0,     49,   49,               0.661
> >   6,      0,      0,   50,               0.672
> >   6,      0,     50,   50,               0.662
> >   6,      0,      0,   51,               0.656
> >   6,      0,     51,   51,               0.659
> >   6,      0,      0,   52,               0.682
> >   6,      0,     52,   52,               0.686
> >   6,      0,      0,   53,                0.67
> >   6,      0,     53,   53,               0.674
> >   6,      0,      0,   54,               0.663
> >   6,      0,     54,   54,               0.675
> >   6,      0,      0,   55,               0.662
> >   6,      0,     55,   55,               0.665
> >   6,      0,      0,   56,               0.681
> >   6,      0,     56,   56,               0.697
> >   6,      0,      0,   57,               0.686
> >   6,      0,     57,   57,               0.687
> >   6,      0,      0,   58,               0.701
> >   6,      0,     58,   58,               0.693
> >   6,      0,      0,   59,               0.709
> >   6,      0,     59,   59,               0.698
> >   6,      0,      0,   60,               0.708
> >   6,      0,     60,   60,               0.708
> >   6,      0,      0,   61,               0.709
> >   6,      0,     61,   61,               0.716
> >   6,      0,      0,   62,               0.709
> >   6,      0,     62,   62,               0.707
> >   6,      0,      0,   63,               0.703
> >   6,      0,     63,   63,               0.716
> >
> >  .../{strspn-sse2.S => strspn-sse2.c}          |   8 +-
> >  sysdeps/x86_64/strspn.S                       | 112 ------------------
> >  2 files changed, 4 insertions(+), 116 deletions(-)
> >  rename sysdeps/x86_64/multiarch/{strspn-sse2.S => strspn-sse2.c} (86%)
> >  delete mode 100644 sysdeps/x86_64/strspn.S
> >
> > diff --git a/sysdeps/x86_64/multiarch/strspn-sse2.S b/sysdeps/x86_64/multiarch/strspn-sse2.c
> > similarity index 86%
> > rename from sysdeps/x86_64/multiarch/strspn-sse2.S
> > rename to sysdeps/x86_64/multiarch/strspn-sse2.c
> > index e0a095f25a..61cc6cb0a5 100644
> > --- a/sysdeps/x86_64/multiarch/strspn-sse2.S
> > +++ b/sysdeps/x86_64/multiarch/strspn-sse2.c
> > @@ -1,4 +1,4 @@
> > -/* strspn optimized with SSE2.
> > +/* strspn.
> >     Copyright (C) 2017-2022 Free Software Foundation, Inc.
> >     This file is part of the GNU C Library.
> >
> > @@ -19,10 +19,10 @@
> >  #if IS_IN (libc)
> >
> >  # include <sysdep.h>
> > -# define strspn __strspn_sse2
> > +# define STRSPN __strspn_sse2
> >
> >  # undef libc_hidden_builtin_def
> > -# define libc_hidden_builtin_def(strspn)
> > +# define libc_hidden_builtin_def(STRSPN)
> >  #endif
> >
> > -#include <sysdeps/x86_64/strspn.S>
> > +#include <string/strspn.c>
> > diff --git a/sysdeps/x86_64/strspn.S b/sysdeps/x86_64/strspn.S
> > deleted file mode 100644
> > index 61b76ee0a1..0000000000
> > --- a/sysdeps/x86_64/strspn.S
> > +++ /dev/null
> > @@ -1,112 +0,0 @@
> > -/* strspn (str, ss) -- Return the length of the initial segment of STR
> > -                       which contains only characters from SS.
> > -   For AMD x86-64.
> > -   Copyright (C) 1994-2022 Free Software Foundation, Inc.
> > -   This file is part of the GNU C Library.
> > -
> > -   The GNU C Library is free software; you can redistribute it and/or
> > -   modify it under the terms of the GNU Lesser General Public
> > -   License as published by the Free Software Foundation; either
> > -   version 2.1 of the License, or (at your option) any later version.
> > -
> > -   The GNU C Library is distributed in the hope that it will be useful,
> > -   but WITHOUT ANY WARRANTY; without even the implied warranty of
> > -   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> > -   Lesser General Public License for more details.
> > -
> > -   You should have received a copy of the GNU Lesser General Public
> > -   License along with the GNU C Library; if not, see
> > -   <https://www.gnu.org/licenses/>.  */
> > -
> > -#include <sysdep.h>
> > -
> > -       .text
> > -ENTRY (strspn)
> > -
> > -       movq %rdi, %rdx         /* Save SRC.  */
> > -
> > -       /* First we create a table with flags for all possible characters.
> > -          For the ASCII (7bit/8bit) or ISO-8859-X character sets which are
> > -          supported by the C string functions we have 256 characters.
> > -          Before inserting marks for the stop characters we clear the whole
> > -          table.  */
> > -       movq %rdi, %r8                  /* Save value.  */
> > -       subq $256, %rsp                 /* Make space for 256 bytes.  */
> > -       cfi_adjust_cfa_offset(256)
> > -       movl $32,  %ecx                 /* 32*8 bytes = 256 bytes.  */
> > -       movq %rsp, %rdi
> > -       xorl %eax, %eax                 /* We store 0s.  */
> > -       cld
> > -       rep
> > -       stosq
> > -
> > -       movq %rsi, %rax                 /* Setup stopset.  */
> > -
> > -/* For understanding the following code remember that %rcx == 0 now.
> > -   Although all the following instruction only modify %cl we always
> > -   have a correct zero-extended 64-bit value in %rcx.  */
> > -
> > -       .p2align 4
> > -L(2):  movb (%rax), %cl        /* get byte from stopset */
> > -       testb %cl, %cl          /* is NUL char? */
> > -       jz L(1)                 /* yes => start compare loop */
> > -       movb %cl, (%rsp,%rcx)   /* set corresponding byte in stopset table */
> > -
> > -       movb 1(%rax), %cl       /* get byte from stopset */
> > -       testb $0xff, %cl        /* is NUL char? */
> > -       jz L(1)                 /* yes => start compare loop */
> > -       movb %cl, (%rsp,%rcx)   /* set corresponding byte in stopset table */
> > -
> > -       movb 2(%rax), %cl       /* get byte from stopset */
> > -       testb $0xff, %cl        /* is NUL char? */
> > -       jz L(1)                 /* yes => start compare loop */
> > -       movb %cl, (%rsp,%rcx)   /* set corresponding byte in stopset table */
> > -
> > -       movb 3(%rax), %cl       /* get byte from stopset */
> > -       addq $4, %rax           /* increment stopset pointer */
> > -       movb %cl, (%rsp,%rcx)   /* set corresponding byte in stopset table */
> > -       testb $0xff, %cl        /* is NUL char? */
> > -       jnz L(2)                /* no => process next dword from stopset */
> > -
> > -L(1):  leaq -4(%rdx), %rax     /* prepare loop */
> > -
> > -       /* We use a neat trick for the following loop.  Normally we would
> > -          have to test for two termination conditions
> > -          1. a character in the stopset was found
> > -          and
> > -          2. the end of the string was found
> > -          But as a sign that the character is in the stopset we store its
> > -          value in the table.  But the value of NUL is NUL so the loop
> > -          terminates for NUL in every case.  */
> > -
> > -       .p2align 4
> > -L(3):  addq $4, %rax           /* adjust pointer for full loop round */
> > -
> > -       movb (%rax), %cl        /* get byte from string */
> > -       testb %cl, (%rsp,%rcx)  /* is it contained in skipset? */
> > -       jz L(4)                 /* no => return */
> > -
> > -       movb 1(%rax), %cl       /* get byte from string */
> > -       testb %cl, (%rsp,%rcx)  /* is it contained in skipset? */
> > -       jz L(5)                 /* no => return */
> > -
> > -       movb 2(%rax), %cl       /* get byte from string */
> > -       testb %cl, (%rsp,%rcx)  /* is it contained in skipset? */
> > -       jz L(6)                 /* no => return */
> > -
> > -       movb 3(%rax), %cl       /* get byte from string */
> > -       testb %cl, (%rsp,%rcx)  /* is it contained in skipset? */
> > -       jnz L(3)                /* yes => start loop again */
> > -
> > -       incq %rax               /* adjust pointer */
> > -L(6):  incq %rax
> > -L(5):  incq %rax
> > -
> > -L(4):  addq $256, %rsp         /* remove stopset */
> > -       cfi_adjust_cfa_offset(-256)
> > -       subq %rdx, %rax         /* we have to return the number of valid
> > -                                  characters, so compute distance to first
> > -                                  non-valid character */
> > -       ret
> > -END (strspn)
> > -libc_hidden_builtin_def (strspn)
> > --
> > 2.25.1
> >
>
> LGTM.
>
> Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
>
> Thanks.
>
> --
> H.J.

I would like to backport this patch to release branches.
Any comments or objections?

--Sunil

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v1 17/23] x86: Optimize str{n}casecmp TOLOWER logic in strcmp.S
       [not found]   ` <CAMe9rOqkZtA9gE87TiqkHg+_rTZY4dqXO74_LykBwvihNO0YJA@mail.gmail.com>
@ 2022-05-12 19:44     ` Sunil Pandey
  0 siblings, 0 replies; 10+ messages in thread
From: Sunil Pandey @ 2022-05-12 19:44 UTC (permalink / raw)
  To: H.J. Lu, Libc-stable Mailing List; +Cc: Noah Goldstein, GNU C Library

On Thu, Mar 24, 2022 at 12:05 PM H.J. Lu via Libc-alpha
<libc-alpha@sourceware.org> wrote:
>
> On Wed, Mar 23, 2022 at 3:01 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> >
> > Slightly faster method of doing TOLOWER that saves an
> > instruction.
> >
> > Also replace the hard coded 5-byte no with .p2align 4. On builds with
> > CET enabled this misaligned entry to strcasecmp.
> >
> > geometric_mean(N=40) of all benchmarks New / Original: .894
> >
> > All string/memory tests pass.
> > ---
> > Geomtric Mean N=40 runs; All functions page aligned
> > length, align1, align2, max_char, New Time / Old Time
> >      1,      1,      1,      127,               0.903
> >      2,      2,      2,      127,               0.905
> >      3,      3,      3,      127,               0.877
> >      4,      4,      4,      127,               0.888
> >      5,      5,      5,      127,               0.901
> >      6,      6,      6,      127,               0.954
> >      7,      7,      7,      127,               0.932
> >      8,      0,      0,      127,               0.918
> >      9,      1,      1,      127,               0.914
> >     10,      2,      2,      127,               0.877
> >     11,      3,      3,      127,               0.909
> >     12,      4,      4,      127,               0.876
> >     13,      5,      5,      127,               0.886
> >     14,      6,      6,      127,               0.914
> >     15,      7,      7,      127,               0.939
> >      4,      0,      0,      127,               0.963
> >      4,      0,      0,      254,               0.943
> >      8,      0,      0,      254,               0.927
> >     16,      0,      0,      127,               0.876
> >     16,      0,      0,      254,               0.865
> >     32,      0,      0,      127,               0.865
> >     32,      0,      0,      254,               0.862
> >     64,      0,      0,      127,               0.863
> >     64,      0,      0,      254,               0.896
> >    128,      0,      0,      127,               0.885
> >    128,      0,      0,      254,               0.882
> >    256,      0,      0,      127,                0.87
> >    256,      0,      0,      254,               0.869
> >    512,      0,      0,      127,               0.832
> >    512,      0,      0,      254,               0.848
> >   1024,      0,      0,      127,               0.835
> >   1024,      0,      0,      254,               0.843
> >     16,      1,      2,      127,               0.914
> >     16,      2,      1,      254,               0.949
> >     32,      2,      4,      127,               0.955
> >     32,      4,      2,      254,               1.004
> >     64,      3,      6,      127,               0.844
> >     64,      6,      3,      254,               0.905
> >    128,      4,      0,      127,               0.889
> >    128,      0,      4,      254,               0.845
> >    256,      5,      2,      127,               0.929
> >    256,      2,      5,      254,               0.907
> >    512,      6,      4,      127,               0.837
> >    512,      4,      6,      254,               0.862
> >   1024,      7,      6,      127,               0.895
> >   1024,      6,      7,      254,                0.89
> >
> >  sysdeps/x86_64/strcmp.S | 64 +++++++++++++++++++----------------------
> >  1 file changed, 29 insertions(+), 35 deletions(-)
> >
> > diff --git a/sysdeps/x86_64/strcmp.S b/sysdeps/x86_64/strcmp.S
> > index e2ab59c555..99d8b36f1d 100644
> > --- a/sysdeps/x86_64/strcmp.S
> > +++ b/sysdeps/x86_64/strcmp.S
> > @@ -75,9 +75,8 @@ ENTRY2 (__strcasecmp)
> >         movq    __libc_tsd_LOCALE@gottpoff(%rip),%rax
> >         mov     %fs:(%rax),%RDX_LP
> >
> > -       // XXX 5 byte should be before the function
> > -       /* 5-byte NOP.  */
> > -       .byte   0x0f,0x1f,0x44,0x00,0x00
> > +       /* Either 1 or 5 bytes (dependeing if CET is enabled).  */
> > +       .p2align 4
> >  END2 (__strcasecmp)
> >  # ifndef NO_NOLOCALE_ALIAS
> >  weak_alias (__strcasecmp, strcasecmp)
> > @@ -94,9 +93,8 @@ ENTRY2 (__strncasecmp)
> >         movq    __libc_tsd_LOCALE@gottpoff(%rip),%rax
> >         mov     %fs:(%rax),%RCX_LP
> >
> > -       // XXX 5 byte should be before the function
> > -       /* 5-byte NOP.  */
> > -       .byte   0x0f,0x1f,0x44,0x00,0x00
> > +       /* Either 1 or 5 bytes (dependeing if CET is enabled).  */
> > +       .p2align 4
> >  END2 (__strncasecmp)
> >  # ifndef NO_NOLOCALE_ALIAS
> >  weak_alias (__strncasecmp, strncasecmp)
> > @@ -146,22 +144,22 @@ ENTRY (STRCMP)
> >  #if defined USE_AS_STRCASECMP_L || defined USE_AS_STRNCASECMP_L
> >         .section .rodata.cst16,"aM",@progbits,16
> >         .align 16
> > -.Lbelowupper:
> > -       .quad   0x4040404040404040
> > -       .quad   0x4040404040404040
> > -.Ltopupper:
> > -       .quad   0x5b5b5b5b5b5b5b5b
> > -       .quad   0x5b5b5b5b5b5b5b5b
> > -.Ltouppermask:
> > +.Llcase_min:
> > +       .quad   0x3f3f3f3f3f3f3f3f
> > +       .quad   0x3f3f3f3f3f3f3f3f
> > +.Llcase_max:
> > +       .quad   0x9999999999999999
> > +       .quad   0x9999999999999999
> > +.Lcase_add:
> >         .quad   0x2020202020202020
> >         .quad   0x2020202020202020
> >         .previous
> > -       movdqa  .Lbelowupper(%rip), %xmm5
> > -# define UCLOW_reg %xmm5
> > -       movdqa  .Ltopupper(%rip), %xmm6
> > -# define UCHIGH_reg %xmm6
> > -       movdqa  .Ltouppermask(%rip), %xmm7
> > -# define LCQWORD_reg %xmm7
> > +       movdqa  .Llcase_min(%rip), %xmm5
> > +# define LCASE_MIN_reg %xmm5
> > +       movdqa  .Llcase_max(%rip), %xmm6
> > +# define LCASE_MAX_reg %xmm6
> > +       movdqa  .Lcase_add(%rip), %xmm7
> > +# define CASE_ADD_reg %xmm7
> >  #endif
> >         cmp     $0x30, %ecx
> >         ja      LABEL(crosscache)       /* rsi: 16-byte load will cross cache line */
> > @@ -172,22 +170,18 @@ ENTRY (STRCMP)
> >         movhpd  8(%rdi), %xmm1
> >         movhpd  8(%rsi), %xmm2
> >  #if defined USE_AS_STRCASECMP_L || defined USE_AS_STRNCASECMP_L
> > -# define TOLOWER(reg1, reg2) \
> > -       movdqa  reg1, %xmm8;                                    \
> > -       movdqa  UCHIGH_reg, %xmm9;                              \
> > -       movdqa  reg2, %xmm10;                                   \
> > -       movdqa  UCHIGH_reg, %xmm11;                             \
> > -       pcmpgtb UCLOW_reg, %xmm8;                               \
> > -       pcmpgtb reg1, %xmm9;                                    \
> > -       pcmpgtb UCLOW_reg, %xmm10;                              \
> > -       pcmpgtb reg2, %xmm11;                                   \
> > -       pand    %xmm9, %xmm8;                                   \
> > -       pand    %xmm11, %xmm10;                                 \
> > -       pand    LCQWORD_reg, %xmm8;                             \
> > -       pand    LCQWORD_reg, %xmm10;                            \
> > -       por     %xmm8, reg1;                                    \
> > -       por     %xmm10, reg2
> > -       TOLOWER (%xmm1, %xmm2)
> > +#  define TOLOWER(reg1, reg2) \
> > +       movdqa  LCASE_MIN_reg, %xmm8;                                   \
> > +       movdqa  LCASE_MIN_reg, %xmm9;                                   \
> > +       paddb   reg1, %xmm8;                                    \
> > +       paddb   reg2, %xmm9;                                    \
> > +       pcmpgtb LCASE_MAX_reg, %xmm8;                           \
> > +       pcmpgtb LCASE_MAX_reg, %xmm9;                           \
> > +       pandn   CASE_ADD_reg, %xmm8;                                    \
> > +       pandn   CASE_ADD_reg, %xmm9;                                    \
> > +       paddb   %xmm8, reg1;                                    \
> > +       paddb   %xmm9, reg2
> > +       TOLOWER (%xmm1, %xmm2)
> >  #else
> >  # define TOLOWER(reg1, reg2)
> >  #endif
> > --
> > 2.25.1
> >
>
> LGTM.
>
> Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
>
> Thanks.
>
> --
> H.J.

I would like to backport this patch to release branches.
Any comments or objections?

--Sunil

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v1 18/23] x86: Optimize str{n}casecmp TOLOWER logic in strcmp-sse42.S
       [not found]   ` <CAMe9rOo-MzhNRiuyFhHpHKanbu50_OPr_Gaof9Yt16tJRwjYFA@mail.gmail.com>
@ 2022-05-12 19:45     ` Sunil Pandey
  0 siblings, 0 replies; 10+ messages in thread
From: Sunil Pandey @ 2022-05-12 19:45 UTC (permalink / raw)
  To: H.J. Lu, Libc-stable Mailing List; +Cc: Noah Goldstein, GNU C Library

On Thu, Mar 24, 2022 at 12:05 PM H.J. Lu via Libc-alpha
<libc-alpha@sourceware.org> wrote:
>
> On Wed, Mar 23, 2022 at 3:02 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> >
> > Slightly faster method of doing TOLOWER that saves an
> > instruction.
> >
> > Also replace the hard coded 5-byte no with .p2align 4. On builds with
> > CET enabled this misaligned entry to strcasecmp.
> >
> > geometric_mean(N=40) of all benchmarks New / Original: .920
> >
> > All string/memory tests pass.
> > ---
> > Geomtric Mean N=40 runs; All functions page aligned
> > length, align1, align2, max_char, New Time / Old Time
> >      1,      1,      1,      127,               0.914
> >      2,      2,      2,      127,               0.952
> >      3,      3,      3,      127,               0.924
> >      4,      4,      4,      127,               0.995
> >      5,      5,      5,      127,               0.985
> >      6,      6,      6,      127,               1.017
> >      7,      7,      7,      127,               1.031
> >      8,      0,      0,      127,               0.967
> >      9,      1,      1,      127,               0.969
> >     10,      2,      2,      127,               0.951
> >     11,      3,      3,      127,               0.938
> >     12,      4,      4,      127,               0.937
> >     13,      5,      5,      127,               0.967
> >     14,      6,      6,      127,               0.941
> >     15,      7,      7,      127,               0.951
> >      4,      0,      0,      127,               0.959
> >      4,      0,      0,      254,                0.98
> >      8,      0,      0,      254,               0.959
> >     16,      0,      0,      127,               0.895
> >     16,      0,      0,      254,               0.901
> >     32,      0,      0,      127,                0.85
> >     32,      0,      0,      254,               0.851
> >     64,      0,      0,      127,               0.897
> >     64,      0,      0,      254,               0.895
> >    128,      0,      0,      127,               0.944
> >    128,      0,      0,      254,               0.935
> >    256,      0,      0,      127,               0.922
> >    256,      0,      0,      254,               0.913
> >    512,      0,      0,      127,               0.921
> >    512,      0,      0,      254,               0.914
> >   1024,      0,      0,      127,               0.845
> >   1024,      0,      0,      254,                0.84
> >     16,      1,      2,      127,               0.923
> >     16,      2,      1,      254,               0.955
> >     32,      2,      4,      127,               0.979
> >     32,      4,      2,      254,               0.957
> >     64,      3,      6,      127,               0.866
> >     64,      6,      3,      254,               0.849
> >    128,      4,      0,      127,               0.882
> >    128,      0,      4,      254,               0.876
> >    256,      5,      2,      127,               0.877
> >    256,      2,      5,      254,               0.882
> >    512,      6,      4,      127,               0.822
> >    512,      4,      6,      254,               0.862
> >   1024,      7,      6,      127,               0.903
> >   1024,      6,      7,      254,               0.908
> >
> >  sysdeps/x86_64/multiarch/strcmp-sse42.S | 83 +++++++++++--------------
> >  1 file changed, 35 insertions(+), 48 deletions(-)
> >
> > diff --git a/sysdeps/x86_64/multiarch/strcmp-sse42.S b/sysdeps/x86_64/multiarch/strcmp-sse42.S
> > index 580feb90e9..7805ae9d41 100644
> > --- a/sysdeps/x86_64/multiarch/strcmp-sse42.S
> > +++ b/sysdeps/x86_64/multiarch/strcmp-sse42.S
> > @@ -88,9 +88,8 @@ ENTRY (GLABEL(__strcasecmp))
> >         movq    __libc_tsd_LOCALE@gottpoff(%rip),%rax
> >         mov     %fs:(%rax),%RDX_LP
> >
> > -       // XXX 5 byte should be before the function
> > -       /* 5-byte NOP.  */
> > -       .byte   0x0f,0x1f,0x44,0x00,0x00
> > +       /* Either 1 or 5 bytes (dependeing if CET is enabled).  */
> > +       .p2align 4
> >  END (GLABEL(__strcasecmp))
> >         /* FALLTHROUGH to strcasecmp_l.  */
> >  #endif
> > @@ -99,9 +98,8 @@ ENTRY (GLABEL(__strncasecmp))
> >         movq    __libc_tsd_LOCALE@gottpoff(%rip),%rax
> >         mov     %fs:(%rax),%RCX_LP
> >
> > -       // XXX 5 byte should be before the function
> > -       /* 5-byte NOP.  */
> > -       .byte   0x0f,0x1f,0x44,0x00,0x00
> > +       /* Either 1 or 5 bytes (dependeing if CET is enabled).  */
> > +       .p2align 4
> >  END (GLABEL(__strncasecmp))
> >         /* FALLTHROUGH to strncasecmp_l.  */
> >  #endif
> > @@ -169,27 +167,22 @@ STRCMP_SSE42:
> >  #if defined USE_AS_STRCASECMP_L || defined USE_AS_STRNCASECMP_L
> >         .section .rodata.cst16,"aM",@progbits,16
> >         .align 16
> > -LABEL(belowupper):
> > -       .quad   0x4040404040404040
> > -       .quad   0x4040404040404040
> > -LABEL(topupper):
> > -# ifdef USE_AVX
> > -       .quad   0x5a5a5a5a5a5a5a5a
> > -       .quad   0x5a5a5a5a5a5a5a5a
> > -# else
> > -       .quad   0x5b5b5b5b5b5b5b5b
> > -       .quad   0x5b5b5b5b5b5b5b5b
> > -# endif
> > -LABEL(touppermask):
> > +LABEL(lcase_min):
> > +       .quad   0x3f3f3f3f3f3f3f3f
> > +       .quad   0x3f3f3f3f3f3f3f3f
> > +LABEL(lcase_max):
> > +       .quad   0x9999999999999999
> > +       .quad   0x9999999999999999
> > +LABEL(case_add):
> >         .quad   0x2020202020202020
> >         .quad   0x2020202020202020
> >         .previous
> > -       movdqa  LABEL(belowupper)(%rip), %xmm4
> > -# define UCLOW_reg %xmm4
> > -       movdqa  LABEL(topupper)(%rip), %xmm5
> > -# define UCHIGH_reg %xmm5
> > -       movdqa  LABEL(touppermask)(%rip), %xmm6
> > -# define LCQWORD_reg %xmm6
> > +       movdqa  LABEL(lcase_min)(%rip), %xmm4
> > +# define LCASE_MIN_reg %xmm4
> > +       movdqa  LABEL(lcase_max)(%rip), %xmm5
> > +# define LCASE_MAX_reg %xmm5
> > +       movdqa  LABEL(case_add)(%rip), %xmm6
> > +# define CASE_ADD_reg %xmm6
> >  #endif
> >         cmp     $0x30, %ecx
> >         ja      LABEL(crosscache)/* rsi: 16-byte load will cross cache line */
> > @@ -200,32 +193,26 @@ LABEL(touppermask):
> >  #if defined USE_AS_STRCASECMP_L || defined USE_AS_STRNCASECMP_L
> >  # ifdef USE_AVX
> >  #  define TOLOWER(reg1, reg2) \
> > -       vpcmpgtb UCLOW_reg, reg1, %xmm7;                        \
> > -       vpcmpgtb UCHIGH_reg, reg1, %xmm8;                       \
> > -       vpcmpgtb UCLOW_reg, reg2, %xmm9;                        \
> > -       vpcmpgtb UCHIGH_reg, reg2, %xmm10;                      \
> > -       vpandn  %xmm7, %xmm8, %xmm8;                                    \
> > -       vpandn  %xmm9, %xmm10, %xmm10;                                  \
> > -       vpand   LCQWORD_reg, %xmm8, %xmm8;                              \
> > -       vpand   LCQWORD_reg, %xmm10, %xmm10;                            \
> > -       vpor    reg1, %xmm8, reg1;                                      \
> > -       vpor    reg2, %xmm10, reg2
> > +       vpaddb  LCASE_MIN_reg, reg1, %xmm7;                                     \
> > +       vpaddb  LCASE_MIN_reg, reg2, %xmm8;                                     \
> > +       vpcmpgtb LCASE_MAX_reg, %xmm7, %xmm7;                                   \
> > +       vpcmpgtb LCASE_MAX_reg, %xmm8, %xmm8;                                   \
> > +       vpandn  CASE_ADD_reg, %xmm7, %xmm7;                                     \
> > +       vpandn  CASE_ADD_reg, %xmm8, %xmm8;                                     \
> > +       vpaddb  %xmm7, reg1, reg1;                                      \
> > +       vpaddb  %xmm8, reg2, reg2
> >  # else
> >  #  define TOLOWER(reg1, reg2) \
> > -       movdqa  reg1, %xmm7;                                    \
> > -       movdqa  UCHIGH_reg, %xmm8;                              \
> > -       movdqa  reg2, %xmm9;                                    \
> > -       movdqa  UCHIGH_reg, %xmm10;                             \
> > -       pcmpgtb UCLOW_reg, %xmm7;                               \
> > -       pcmpgtb reg1, %xmm8;                                    \
> > -       pcmpgtb UCLOW_reg, %xmm9;                               \
> > -       pcmpgtb reg2, %xmm10;                                   \
> > -       pand    %xmm8, %xmm7;                                   \
> > -       pand    %xmm10, %xmm9;                                  \
> > -       pand    LCQWORD_reg, %xmm7;                             \
> > -       pand    LCQWORD_reg, %xmm9;                             \
> > -       por     %xmm7, reg1;                                    \
> > -       por     %xmm9, reg2
> > +       movdqa  LCASE_MIN_reg, %xmm7;                                   \
> > +       movdqa  LCASE_MIN_reg, %xmm8;                                   \
> > +       paddb   reg1, %xmm7;                                    \
> > +       paddb   reg2, %xmm8;                                    \
> > +       pcmpgtb LCASE_MAX_reg, %xmm7;                           \
> > +       pcmpgtb LCASE_MAX_reg, %xmm8;                           \
> > +       pandn   CASE_ADD_reg, %xmm7;                                    \
> > +       pandn   CASE_ADD_reg, %xmm8;                                    \
> > +       paddb   %xmm7, reg1;                                    \
> > +       paddb   %xmm8, reg2
> >  # endif
> >         TOLOWER (%xmm1, %xmm2)
> >  #else
> > --
> > 2.25.1
> >
>
> LGTM.
>
> Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
>
> Thanks.
>
> --
> H.J.

I would like to backport this patch to release branches.
Any comments or objections?

--Sunil

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v1 23/23] x86: Remove AVX str{n}casecmp
       [not found]   ` <CAMe9rOpzEL=V1OmUFJuScNetUc3mgMqYeqcqiD9aK+tBTN_sxQ@mail.gmail.com>
@ 2022-05-12 19:54     ` Sunil Pandey
  0 siblings, 0 replies; 10+ messages in thread
From: Sunil Pandey @ 2022-05-12 19:54 UTC (permalink / raw)
  To: H.J. Lu, Libc-stable Mailing List; +Cc: Noah Goldstein, GNU C Library

On Thu, Mar 24, 2022 at 12:09 PM H.J. Lu via Libc-alpha
<libc-alpha@sourceware.org> wrote:
>
> On Wed, Mar 23, 2022 at 3:03 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> >
> > The rational is:
> >
> > 1. SSE42 has nearly identical logic so any benefit is minimal (3.4%
> >    regression on Tigerlake using SSE42 versus AVX across the
> >    benchtest suite).
> > 2. AVX2 version covers the majority of targets that previously
> >    prefered it.
> > 3. The targets where AVX would still be best (SnB and IVB) are
> >    becoming outdated.
> >
> > All in all the saving the code size is worth it.
> >
> > All string/memory tests pass.
> > ---
> > Geomtric Mean N=40 runs; All functions page aligned
> > length, align1, align2, max_char, AVX Time / SSE42 Time
> >      1,      1,      1,      127,                 0.928
> >      2,      2,      2,      127,                 0.934
> >      3,      3,      3,      127,                 0.975
> >      4,      4,      4,      127,                  0.96
> >      5,      5,      5,      127,                 0.935
> >      6,      6,      6,      127,                 0.929
> >      7,      7,      7,      127,                 0.959
> >      8,      0,      0,      127,                 0.955
> >      9,      1,      1,      127,                 0.944
> >     10,      2,      2,      127,                 0.975
> >     11,      3,      3,      127,                 0.935
> >     12,      4,      4,      127,                 0.931
> >     13,      5,      5,      127,                 0.926
> >     14,      6,      6,      127,                 0.901
> >     15,      7,      7,      127,                 0.951
> >      4,      0,      0,      127,                 0.958
> >      4,      0,      0,      254,                 0.956
> >      8,      0,      0,      254,                 0.977
> >     16,      0,      0,      127,                 0.955
> >     16,      0,      0,      254,                 0.953
> >     32,      0,      0,      127,                 0.943
> >     32,      0,      0,      254,                 0.941
> >     64,      0,      0,      127,                 0.941
> >     64,      0,      0,      254,                 0.955
> >    128,      0,      0,      127,                 0.972
> >    128,      0,      0,      254,                 0.975
> >    256,      0,      0,      127,                 0.996
> >    256,      0,      0,      254,                 0.993
> >    512,      0,      0,      127,                 0.992
> >    512,      0,      0,      254,                 0.986
> >   1024,      0,      0,      127,                 0.994
> >   1024,      0,      0,      254,                 0.993
> >     16,      1,      2,      127,                 0.933
> >     16,      2,      1,      254,                 0.953
> >     32,      2,      4,      127,                 0.927
> >     32,      4,      2,      254,                 0.986
> >     64,      3,      6,      127,                 0.991
> >     64,      6,      3,      254,                 1.014
> >    128,      4,      0,      127,                 1.001
> >    128,      0,      4,      254,                 0.991
> >    256,      5,      2,      127,                 1.011
> >    256,      2,      5,      254,                 1.013
> >    512,      6,      4,      127,                 1.056
> >    512,      4,      6,      254,                 0.916
> >   1024,      7,      6,      127,                 1.059
> >   1024,      6,      7,      254,                 1.043
> >
> >  sysdeps/x86_64/multiarch/Makefile           |   2 -
> >  sysdeps/x86_64/multiarch/ifunc-impl-list.c  |  12 -
> >  sysdeps/x86_64/multiarch/ifunc-strcasecmp.h |   4 -
> >  sysdeps/x86_64/multiarch/strcasecmp_l-avx.S |  22 --
> >  sysdeps/x86_64/multiarch/strcmp-sse42.S     | 240 +++++++++-----------
> >  sysdeps/x86_64/multiarch/strncase_l-avx.S   |  22 --
> >  6 files changed, 105 insertions(+), 197 deletions(-)
> >  delete mode 100644 sysdeps/x86_64/multiarch/strcasecmp_l-avx.S
> >  delete mode 100644 sysdeps/x86_64/multiarch/strncase_l-avx.S
> >
> > diff --git a/sysdeps/x86_64/multiarch/Makefile b/sysdeps/x86_64/multiarch/Makefile
> > index 35d80dc2ff..6507d1b7fa 100644
> > --- a/sysdeps/x86_64/multiarch/Makefile
> > +++ b/sysdeps/x86_64/multiarch/Makefile
> > @@ -54,7 +54,6 @@ sysdep_routines += \
> >    stpncpy-evex \
> >    stpncpy-sse2-unaligned \
> >    stpncpy-ssse3 \
> > -  strcasecmp_l-avx \
> >    strcasecmp_l-avx2 \
> >    strcasecmp_l-avx2-rtm \
> >    strcasecmp_l-evex \
> > @@ -95,7 +94,6 @@ sysdep_routines += \
> >    strlen-avx2-rtm \
> >    strlen-evex \
> >    strlen-sse2 \
> > -  strncase_l-avx \
> >    strncase_l-avx2 \
> >    strncase_l-avx2-rtm \
> >    strncase_l-evex \
> > diff --git a/sysdeps/x86_64/multiarch/ifunc-impl-list.c b/sysdeps/x86_64/multiarch/ifunc-impl-list.c
> > index f1a4d3dac2..40cc6cc49e 100644
> > --- a/sysdeps/x86_64/multiarch/ifunc-impl-list.c
> > +++ b/sysdeps/x86_64/multiarch/ifunc-impl-list.c
> > @@ -447,9 +447,6 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
> >                               (CPU_FEATURE_USABLE (AVX2)
> >                                && CPU_FEATURE_USABLE (RTM)),
> >                               __strcasecmp_avx2_rtm)
> > -             IFUNC_IMPL_ADD (array, i, strcasecmp,
> > -                             CPU_FEATURE_USABLE (AVX),
> > -                             __strcasecmp_avx)
> >               IFUNC_IMPL_ADD (array, i, strcasecmp,
> >                               CPU_FEATURE_USABLE (SSE4_2),
> >                               __strcasecmp_sse42)
> > @@ -471,9 +468,6 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
> >                               (CPU_FEATURE_USABLE (AVX2)
> >                                && CPU_FEATURE_USABLE (RTM)),
> >                               __strcasecmp_l_avx2_rtm)
> > -             IFUNC_IMPL_ADD (array, i, strcasecmp_l,
> > -                             CPU_FEATURE_USABLE (AVX),
> > -                             __strcasecmp_l_avx)
> >               IFUNC_IMPL_ADD (array, i, strcasecmp_l,
> >                               CPU_FEATURE_USABLE (SSE4_2),
> >                               __strcasecmp_l_sse42)
> > @@ -609,9 +603,6 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
> >                               (CPU_FEATURE_USABLE (AVX2)
> >                                && CPU_FEATURE_USABLE (RTM)),
> >                               __strncasecmp_avx2_rtm)
> > -             IFUNC_IMPL_ADD (array, i, strncasecmp,
> > -                             CPU_FEATURE_USABLE (AVX),
> > -                             __strncasecmp_avx)
> >               IFUNC_IMPL_ADD (array, i, strncasecmp,
> >                               CPU_FEATURE_USABLE (SSE4_2),
> >                               __strncasecmp_sse42)
> > @@ -634,9 +625,6 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
> >                               (CPU_FEATURE_USABLE (AVX2)
> >                                && CPU_FEATURE_USABLE (RTM)),
> >                               __strncasecmp_l_avx2_rtm)
> > -             IFUNC_IMPL_ADD (array, i, strncasecmp_l,
> > -                             CPU_FEATURE_USABLE (AVX),
> > -                             __strncasecmp_l_avx)
> >               IFUNC_IMPL_ADD (array, i, strncasecmp_l,
> >                               CPU_FEATURE_USABLE (SSE4_2),
> >                               __strncasecmp_l_sse42)
> > diff --git a/sysdeps/x86_64/multiarch/ifunc-strcasecmp.h b/sysdeps/x86_64/multiarch/ifunc-strcasecmp.h
> > index bf0d146e7f..766539c241 100644
> > --- a/sysdeps/x86_64/multiarch/ifunc-strcasecmp.h
> > +++ b/sysdeps/x86_64/multiarch/ifunc-strcasecmp.h
> > @@ -22,7 +22,6 @@
> >  extern __typeof (REDIRECT_NAME) OPTIMIZE (sse2) attribute_hidden;
> >  extern __typeof (REDIRECT_NAME) OPTIMIZE (ssse3) attribute_hidden;
> >  extern __typeof (REDIRECT_NAME) OPTIMIZE (sse42) attribute_hidden;
> > -extern __typeof (REDIRECT_NAME) OPTIMIZE (avx) attribute_hidden;
> >  extern __typeof (REDIRECT_NAME) OPTIMIZE (avx2) attribute_hidden;
> >  extern __typeof (REDIRECT_NAME) OPTIMIZE (avx2_rtm) attribute_hidden;
> >  extern __typeof (REDIRECT_NAME) OPTIMIZE (evex) attribute_hidden;
> > @@ -46,9 +45,6 @@ IFUNC_SELECTOR (void)
> >          return OPTIMIZE (avx2);
> >      }
> >
> > -  if (CPU_FEATURE_USABLE_P (cpu_features, AVX))
> > -    return OPTIMIZE (avx);
> > -
> >    if (CPU_FEATURE_USABLE_P (cpu_features, SSE4_2)
> >        && !CPU_FEATURES_ARCH_P (cpu_features, Slow_SSE4_2))
> >      return OPTIMIZE (sse42);
> > diff --git a/sysdeps/x86_64/multiarch/strcasecmp_l-avx.S b/sysdeps/x86_64/multiarch/strcasecmp_l-avx.S
> > deleted file mode 100644
> > index 7ec7c21b5a..0000000000
> > --- a/sysdeps/x86_64/multiarch/strcasecmp_l-avx.S
> > +++ /dev/null
> > @@ -1,22 +0,0 @@
> > -/* strcasecmp_l optimized with AVX.
> > -   Copyright (C) 2017-2022 Free Software Foundation, Inc.
> > -   This file is part of the GNU C Library.
> > -
> > -   The GNU C Library is free software; you can redistribute it and/or
> > -   modify it under the terms of the GNU Lesser General Public
> > -   License as published by the Free Software Foundation; either
> > -   version 2.1 of the License, or (at your option) any later version.
> > -
> > -   The GNU C Library is distributed in the hope that it will be useful,
> > -   but WITHOUT ANY WARRANTY; without even the implied warranty of
> > -   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> > -   Lesser General Public License for more details.
> > -
> > -   You should have received a copy of the GNU Lesser General Public
> > -   License along with the GNU C Library; if not, see
> > -   <https://www.gnu.org/licenses/>.  */
> > -
> > -#define STRCMP_SSE42 __strcasecmp_l_avx
> > -#define USE_AVX 1
> > -#define USE_AS_STRCASECMP_L
> > -#include "strcmp-sse42.S"
> > diff --git a/sysdeps/x86_64/multiarch/strcmp-sse42.S b/sysdeps/x86_64/multiarch/strcmp-sse42.S
> > index 7805ae9d41..a9178ad25c 100644
> > --- a/sysdeps/x86_64/multiarch/strcmp-sse42.S
> > +++ b/sysdeps/x86_64/multiarch/strcmp-sse42.S
> > @@ -41,13 +41,8 @@
> >  # define UPDATE_STRNCMP_COUNTER
> >  #endif
> >
> > -#ifdef USE_AVX
> > -# define SECTION       avx
> > -# define GLABEL(l)     l##_avx
> > -#else
> > -# define SECTION       sse4.2
> > -# define GLABEL(l)     l##_sse42
> > -#endif
> > +#define SECTION        sse4.2
> > +#define GLABEL(l)      l##_sse42
> >
> >  #define LABEL(l)       .L##l
> >
> > @@ -105,21 +100,7 @@ END (GLABEL(__strncasecmp))
> >  #endif
> >
> >
> > -#ifdef USE_AVX
> > -# define movdqa vmovdqa
> > -# define movdqu vmovdqu
> > -# define pmovmskb vpmovmskb
> > -# define pcmpistri vpcmpistri
> > -# define psubb vpsubb
> > -# define pcmpeqb vpcmpeqb
> > -# define psrldq vpsrldq
> > -# define pslldq vpslldq
> > -# define palignr vpalignr
> > -# define pxor vpxor
> > -# define D(arg) arg, arg
> > -#else
> > -# define D(arg) arg
> > -#endif
> > +#define arg arg
> >
> >  STRCMP_SSE42:
> >         cfi_startproc
> > @@ -191,18 +172,7 @@ LABEL(case_add):
> >         movdqu  (%rdi), %xmm1
> >         movdqu  (%rsi), %xmm2
> >  #if defined USE_AS_STRCASECMP_L || defined USE_AS_STRNCASECMP_L
> > -# ifdef USE_AVX
> > -#  define TOLOWER(reg1, reg2) \
> > -       vpaddb  LCASE_MIN_reg, reg1, %xmm7;                                     \
> > -       vpaddb  LCASE_MIN_reg, reg2, %xmm8;                                     \
> > -       vpcmpgtb LCASE_MAX_reg, %xmm7, %xmm7;                                   \
> > -       vpcmpgtb LCASE_MAX_reg, %xmm8, %xmm8;                                   \
> > -       vpandn  CASE_ADD_reg, %xmm7, %xmm7;                                     \
> > -       vpandn  CASE_ADD_reg, %xmm8, %xmm8;                                     \
> > -       vpaddb  %xmm7, reg1, reg1;                                      \
> > -       vpaddb  %xmm8, reg2, reg2
> > -# else
> > -#  define TOLOWER(reg1, reg2) \
> > +# define TOLOWER(reg1, reg2) \
> >         movdqa  LCASE_MIN_reg, %xmm7;                                   \
> >         movdqa  LCASE_MIN_reg, %xmm8;                                   \
> >         paddb   reg1, %xmm7;                                    \
> > @@ -213,15 +183,15 @@ LABEL(case_add):
> >         pandn   CASE_ADD_reg, %xmm8;                                    \
> >         paddb   %xmm7, reg1;                                    \
> >         paddb   %xmm8, reg2
> > -# endif
> > +
> >         TOLOWER (%xmm1, %xmm2)
> >  #else
> >  # define TOLOWER(reg1, reg2)
> >  #endif
> > -       pxor    %xmm0, D(%xmm0)         /* clear %xmm0 for null char checks */
> > -       pcmpeqb %xmm1, D(%xmm0)         /* Any null chars? */
> > -       pcmpeqb %xmm2, D(%xmm1)         /* compare first 16 bytes for equality */
> > -       psubb   %xmm0, D(%xmm1)         /* packed sub of comparison results*/
> > +       pxor    %xmm0, %xmm0            /* clear %xmm0 for null char checks */
> > +       pcmpeqb %xmm1, %xmm0            /* Any null chars? */
> > +       pcmpeqb %xmm2, %xmm1            /* compare first 16 bytes for equality */
> > +       psubb   %xmm0, %xmm1            /* packed sub of comparison results*/
> >         pmovmskb %xmm1, %edx
> >         sub     $0xffff, %edx           /* if first 16 bytes are same, edx == 0xffff */
> >         jnz     LABEL(less16bytes)/* If not, find different value or null char */
> > @@ -245,7 +215,7 @@ LABEL(crosscache):
> >         xor     %r8d, %r8d
> >         and     $0xf, %ecx              /* offset of rsi */
> >         and     $0xf, %eax              /* offset of rdi */
> > -       pxor    %xmm0, D(%xmm0)         /* clear %xmm0 for null char check */
> > +       pxor    %xmm0, %xmm0            /* clear %xmm0 for null char check */
> >         cmp     %eax, %ecx
> >         je      LABEL(ashr_0)           /* rsi and rdi relative offset same */
> >         ja      LABEL(bigger)
> > @@ -259,7 +229,7 @@ LABEL(bigger):
> >         sub     %rcx, %r9
> >         lea     LABEL(unaligned_table)(%rip), %r10
> >         movslq  (%r10, %r9,4), %r9
> > -       pcmpeqb %xmm1, D(%xmm0)         /* Any null chars? */
> > +       pcmpeqb %xmm1, %xmm0            /* Any null chars? */
> >         lea     (%r10, %r9), %r10
> >         _CET_NOTRACK jmp *%r10          /* jump to corresponding case */
> >
> > @@ -272,15 +242,15 @@ LABEL(bigger):
> >  LABEL(ashr_0):
> >
> >         movdqa  (%rsi), %xmm1
> > -       pcmpeqb %xmm1, D(%xmm0)         /* Any null chars? */
> > +       pcmpeqb %xmm1, %xmm0            /* Any null chars? */
> >  #if !defined USE_AS_STRCASECMP_L && !defined USE_AS_STRNCASECMP_L
> > -       pcmpeqb (%rdi), D(%xmm1)        /* compare 16 bytes for equality */
> > +       pcmpeqb (%rdi), %xmm1           /* compare 16 bytes for equality */
> >  #else
> >         movdqa  (%rdi), %xmm2
> >         TOLOWER (%xmm1, %xmm2)
> > -       pcmpeqb %xmm2, D(%xmm1)         /* compare 16 bytes for equality */
> > +       pcmpeqb %xmm2, %xmm1            /* compare 16 bytes for equality */
> >  #endif
> > -       psubb   %xmm0, D(%xmm1)         /* packed sub of comparison results*/
> > +       psubb   %xmm0, %xmm1            /* packed sub of comparison results*/
> >         pmovmskb %xmm1, %r9d
> >         shr     %cl, %edx               /* adjust 0xffff for offset */
> >         shr     %cl, %r9d               /* adjust for 16-byte offset */
> > @@ -360,10 +330,10 @@ LABEL(ashr_0_exit_use):
> >   */
> >         .p2align 4
> >  LABEL(ashr_1):
> > -       pslldq  $15, D(%xmm2)           /* shift first string to align with second */
> > +       pslldq  $15, %xmm2              /* shift first string to align with second */
> >         TOLOWER (%xmm1, %xmm2)
> > -       pcmpeqb %xmm1, D(%xmm2)         /* compare 16 bytes for equality */
> > -       psubb   %xmm0, D(%xmm2)         /* packed sub of comparison results*/
> > +       pcmpeqb %xmm1, %xmm2            /* compare 16 bytes for equality */
> > +       psubb   %xmm0, %xmm2            /* packed sub of comparison results*/
> >         pmovmskb %xmm2, %r9d
> >         shr     %cl, %edx               /* adjust 0xffff for offset */
> >         shr     %cl, %r9d               /* adjust for 16-byte offset */
> > @@ -391,7 +361,7 @@ LABEL(loop_ashr_1_use):
> >
> >  LABEL(nibble_ashr_1_restart_use):
> >         movdqa  (%rdi, %rdx), %xmm0
> > -       palignr $1, -16(%rdi, %rdx), D(%xmm0)
> > +       palignr $1, -16(%rdi, %rdx), %xmm0
> >  #if !defined USE_AS_STRCASECMP_L && !defined USE_AS_STRNCASECMP_L
> >         pcmpistri       $0x1a,(%rsi,%rdx), %xmm0
> >  #else
> > @@ -410,7 +380,7 @@ LABEL(nibble_ashr_1_restart_use):
> >         jg      LABEL(nibble_ashr_1_use)
> >
> >         movdqa  (%rdi, %rdx), %xmm0
> > -       palignr $1, -16(%rdi, %rdx), D(%xmm0)
> > +       palignr $1, -16(%rdi, %rdx), %xmm0
> >  #if !defined USE_AS_STRCASECMP_L && !defined USE_AS_STRNCASECMP_L
> >         pcmpistri       $0x1a,(%rsi,%rdx), %xmm0
> >  #else
> > @@ -430,7 +400,7 @@ LABEL(nibble_ashr_1_restart_use):
> >  LABEL(nibble_ashr_1_use):
> >         sub     $0x1000, %r10
> >         movdqa  -16(%rdi, %rdx), %xmm0
> > -       psrldq  $1, D(%xmm0)
> > +       psrldq  $1, %xmm0
> >         pcmpistri      $0x3a,%xmm0, %xmm0
> >  #if defined USE_AS_STRNCMP || defined USE_AS_STRNCASECMP_L
> >         cmp     %r11, %rcx
> > @@ -448,10 +418,10 @@ LABEL(nibble_ashr_1_use):
> >   */
> >         .p2align 4
> >  LABEL(ashr_2):
> > -       pslldq  $14, D(%xmm2)
> > +       pslldq  $14, %xmm2
> >         TOLOWER (%xmm1, %xmm2)
> > -       pcmpeqb %xmm1, D(%xmm2)
> > -       psubb   %xmm0, D(%xmm2)
> > +       pcmpeqb %xmm1, %xmm2
> > +       psubb   %xmm0, %xmm2
> >         pmovmskb %xmm2, %r9d
> >         shr     %cl, %edx
> >         shr     %cl, %r9d
> > @@ -479,7 +449,7 @@ LABEL(loop_ashr_2_use):
> >
> >  LABEL(nibble_ashr_2_restart_use):
> >         movdqa  (%rdi, %rdx), %xmm0
> > -       palignr $2, -16(%rdi, %rdx), D(%xmm0)
> > +       palignr $2, -16(%rdi, %rdx), %xmm0
> >  #if !defined USE_AS_STRCASECMP_L && !defined USE_AS_STRNCASECMP_L
> >         pcmpistri       $0x1a,(%rsi,%rdx), %xmm0
> >  #else
> > @@ -498,7 +468,7 @@ LABEL(nibble_ashr_2_restart_use):
> >         jg      LABEL(nibble_ashr_2_use)
> >
> >         movdqa  (%rdi, %rdx), %xmm0
> > -       palignr $2, -16(%rdi, %rdx), D(%xmm0)
> > +       palignr $2, -16(%rdi, %rdx), %xmm0
> >  #if !defined USE_AS_STRCASECMP_L && !defined USE_AS_STRNCASECMP_L
> >         pcmpistri       $0x1a,(%rsi,%rdx), %xmm0
> >  #else
> > @@ -518,7 +488,7 @@ LABEL(nibble_ashr_2_restart_use):
> >  LABEL(nibble_ashr_2_use):
> >         sub     $0x1000, %r10
> >         movdqa  -16(%rdi, %rdx), %xmm0
> > -       psrldq  $2, D(%xmm0)
> > +       psrldq  $2, %xmm0
> >         pcmpistri      $0x3a,%xmm0, %xmm0
> >  #if defined USE_AS_STRNCMP || defined USE_AS_STRNCASECMP_L
> >         cmp     %r11, %rcx
> > @@ -536,10 +506,10 @@ LABEL(nibble_ashr_2_use):
> >   */
> >         .p2align 4
> >  LABEL(ashr_3):
> > -       pslldq  $13, D(%xmm2)
> > +       pslldq  $13, %xmm2
> >         TOLOWER (%xmm1, %xmm2)
> > -       pcmpeqb %xmm1, D(%xmm2)
> > -       psubb   %xmm0, D(%xmm2)
> > +       pcmpeqb %xmm1, %xmm2
> > +       psubb   %xmm0, %xmm2
> >         pmovmskb %xmm2, %r9d
> >         shr     %cl, %edx
> >         shr     %cl, %r9d
> > @@ -567,7 +537,7 @@ LABEL(loop_ashr_3_use):
> >
> >  LABEL(nibble_ashr_3_restart_use):
> >         movdqa  (%rdi, %rdx), %xmm0
> > -       palignr $3, -16(%rdi, %rdx), D(%xmm0)
> > +       palignr $3, -16(%rdi, %rdx), %xmm0
> >  #if !defined USE_AS_STRCASECMP_L && !defined USE_AS_STRNCASECMP_L
> >         pcmpistri       $0x1a,(%rsi,%rdx), %xmm0
> >  #else
> > @@ -586,7 +556,7 @@ LABEL(nibble_ashr_3_restart_use):
> >         jg      LABEL(nibble_ashr_3_use)
> >
> >         movdqa  (%rdi, %rdx), %xmm0
> > -       palignr $3, -16(%rdi, %rdx), D(%xmm0)
> > +       palignr $3, -16(%rdi, %rdx), %xmm0
> >  #if !defined USE_AS_STRCASECMP_L && !defined USE_AS_STRNCASECMP_L
> >         pcmpistri       $0x1a,(%rsi,%rdx), %xmm0
> >  #else
> > @@ -606,7 +576,7 @@ LABEL(nibble_ashr_3_restart_use):
> >  LABEL(nibble_ashr_3_use):
> >         sub     $0x1000, %r10
> >         movdqa  -16(%rdi, %rdx), %xmm0
> > -       psrldq  $3, D(%xmm0)
> > +       psrldq  $3, %xmm0
> >         pcmpistri      $0x3a,%xmm0, %xmm0
> >  #if defined USE_AS_STRNCMP || defined USE_AS_STRNCASECMP_L
> >         cmp     %r11, %rcx
> > @@ -624,10 +594,10 @@ LABEL(nibble_ashr_3_use):
> >   */
> >         .p2align 4
> >  LABEL(ashr_4):
> > -       pslldq  $12, D(%xmm2)
> > +       pslldq  $12, %xmm2
> >         TOLOWER (%xmm1, %xmm2)
> > -       pcmpeqb %xmm1, D(%xmm2)
> > -       psubb   %xmm0, D(%xmm2)
> > +       pcmpeqb %xmm1, %xmm2
> > +       psubb   %xmm0, %xmm2
> >         pmovmskb %xmm2, %r9d
> >         shr     %cl, %edx
> >         shr     %cl, %r9d
> > @@ -656,7 +626,7 @@ LABEL(loop_ashr_4_use):
> >
> >  LABEL(nibble_ashr_4_restart_use):
> >         movdqa  (%rdi, %rdx), %xmm0
> > -       palignr $4, -16(%rdi, %rdx), D(%xmm0)
> > +       palignr $4, -16(%rdi, %rdx), %xmm0
> >  #if !defined USE_AS_STRCASECMP_L && !defined USE_AS_STRNCASECMP_L
> >         pcmpistri       $0x1a,(%rsi,%rdx), %xmm0
> >  #else
> > @@ -675,7 +645,7 @@ LABEL(nibble_ashr_4_restart_use):
> >         jg      LABEL(nibble_ashr_4_use)
> >
> >         movdqa  (%rdi, %rdx), %xmm0
> > -       palignr $4, -16(%rdi, %rdx), D(%xmm0)
> > +       palignr $4, -16(%rdi, %rdx), %xmm0
> >  #if !defined USE_AS_STRCASECMP_L && !defined USE_AS_STRNCASECMP_L
> >         pcmpistri       $0x1a,(%rsi,%rdx), %xmm0
> >  #else
> > @@ -695,7 +665,7 @@ LABEL(nibble_ashr_4_restart_use):
> >  LABEL(nibble_ashr_4_use):
> >         sub     $0x1000, %r10
> >         movdqa  -16(%rdi, %rdx), %xmm0
> > -       psrldq  $4, D(%xmm0)
> > +       psrldq  $4, %xmm0
> >         pcmpistri      $0x3a,%xmm0, %xmm0
> >  #if defined USE_AS_STRNCMP || defined USE_AS_STRNCASECMP_L
> >         cmp     %r11, %rcx
> > @@ -713,10 +683,10 @@ LABEL(nibble_ashr_4_use):
> >   */
> >         .p2align 4
> >  LABEL(ashr_5):
> > -       pslldq  $11, D(%xmm2)
> > +       pslldq  $11, %xmm2
> >         TOLOWER (%xmm1, %xmm2)
> > -       pcmpeqb %xmm1, D(%xmm2)
> > -       psubb   %xmm0, D(%xmm2)
> > +       pcmpeqb %xmm1, %xmm2
> > +       psubb   %xmm0, %xmm2
> >         pmovmskb %xmm2, %r9d
> >         shr     %cl, %edx
> >         shr     %cl, %r9d
> > @@ -745,7 +715,7 @@ LABEL(loop_ashr_5_use):
> >
> >  LABEL(nibble_ashr_5_restart_use):
> >         movdqa  (%rdi, %rdx), %xmm0
> > -       palignr $5, -16(%rdi, %rdx), D(%xmm0)
> > +       palignr $5, -16(%rdi, %rdx), %xmm0
> >  #if !defined USE_AS_STRCASECMP_L && !defined USE_AS_STRNCASECMP_L
> >         pcmpistri       $0x1a,(%rsi,%rdx), %xmm0
> >  #else
> > @@ -765,7 +735,7 @@ LABEL(nibble_ashr_5_restart_use):
> >
> >         movdqa  (%rdi, %rdx), %xmm0
> >
> > -       palignr $5, -16(%rdi, %rdx), D(%xmm0)
> > +       palignr $5, -16(%rdi, %rdx), %xmm0
> >  #if !defined USE_AS_STRCASECMP_L && !defined USE_AS_STRNCASECMP_L
> >         pcmpistri       $0x1a,(%rsi,%rdx), %xmm0
> >  #else
> > @@ -785,7 +755,7 @@ LABEL(nibble_ashr_5_restart_use):
> >  LABEL(nibble_ashr_5_use):
> >         sub     $0x1000, %r10
> >         movdqa  -16(%rdi, %rdx), %xmm0
> > -       psrldq  $5, D(%xmm0)
> > +       psrldq  $5, %xmm0
> >         pcmpistri      $0x3a,%xmm0, %xmm0
> >  #if defined USE_AS_STRNCMP || defined USE_AS_STRNCASECMP_L
> >         cmp     %r11, %rcx
> > @@ -803,10 +773,10 @@ LABEL(nibble_ashr_5_use):
> >   */
> >         .p2align 4
> >  LABEL(ashr_6):
> > -       pslldq  $10, D(%xmm2)
> > +       pslldq  $10, %xmm2
> >         TOLOWER (%xmm1, %xmm2)
> > -       pcmpeqb %xmm1, D(%xmm2)
> > -       psubb   %xmm0, D(%xmm2)
> > +       pcmpeqb %xmm1, %xmm2
> > +       psubb   %xmm0, %xmm2
> >         pmovmskb %xmm2, %r9d
> >         shr     %cl, %edx
> >         shr     %cl, %r9d
> > @@ -835,7 +805,7 @@ LABEL(loop_ashr_6_use):
> >
> >  LABEL(nibble_ashr_6_restart_use):
> >         movdqa  (%rdi, %rdx), %xmm0
> > -       palignr $6, -16(%rdi, %rdx), D(%xmm0)
> > +       palignr $6, -16(%rdi, %rdx), %xmm0
> >  #if !defined USE_AS_STRCASECMP_L && !defined USE_AS_STRNCASECMP_L
> >         pcmpistri $0x1a,(%rsi,%rdx), %xmm0
> >  #else
> > @@ -854,7 +824,7 @@ LABEL(nibble_ashr_6_restart_use):
> >         jg      LABEL(nibble_ashr_6_use)
> >
> >         movdqa  (%rdi, %rdx), %xmm0
> > -       palignr $6, -16(%rdi, %rdx), D(%xmm0)
> > +       palignr $6, -16(%rdi, %rdx), %xmm0
> >  #if !defined USE_AS_STRCASECMP_L && !defined USE_AS_STRNCASECMP_L
> >         pcmpistri $0x1a,(%rsi,%rdx), %xmm0
> >  #else
> > @@ -874,7 +844,7 @@ LABEL(nibble_ashr_6_restart_use):
> >  LABEL(nibble_ashr_6_use):
> >         sub     $0x1000, %r10
> >         movdqa  -16(%rdi, %rdx), %xmm0
> > -       psrldq  $6, D(%xmm0)
> > +       psrldq  $6, %xmm0
> >         pcmpistri      $0x3a,%xmm0, %xmm0
> >  #if defined USE_AS_STRNCMP || defined USE_AS_STRNCASECMP_L
> >         cmp     %r11, %rcx
> > @@ -892,10 +862,10 @@ LABEL(nibble_ashr_6_use):
> >   */
> >         .p2align 4
> >  LABEL(ashr_7):
> > -       pslldq  $9, D(%xmm2)
> > +       pslldq  $9, %xmm2
> >         TOLOWER (%xmm1, %xmm2)
> > -       pcmpeqb %xmm1, D(%xmm2)
> > -       psubb   %xmm0, D(%xmm2)
> > +       pcmpeqb %xmm1, %xmm2
> > +       psubb   %xmm0, %xmm2
> >         pmovmskb %xmm2, %r9d
> >         shr     %cl, %edx
> >         shr     %cl, %r9d
> > @@ -924,7 +894,7 @@ LABEL(loop_ashr_7_use):
> >
> >  LABEL(nibble_ashr_7_restart_use):
> >         movdqa  (%rdi, %rdx), %xmm0
> > -       palignr $7, -16(%rdi, %rdx), D(%xmm0)
> > +       palignr $7, -16(%rdi, %rdx), %xmm0
> >  #if !defined USE_AS_STRCASECMP_L && !defined USE_AS_STRNCASECMP_L
> >         pcmpistri       $0x1a,(%rsi,%rdx), %xmm0
> >  #else
> > @@ -943,7 +913,7 @@ LABEL(nibble_ashr_7_restart_use):
> >         jg      LABEL(nibble_ashr_7_use)
> >
> >         movdqa  (%rdi, %rdx), %xmm0
> > -       palignr $7, -16(%rdi, %rdx), D(%xmm0)
> > +       palignr $7, -16(%rdi, %rdx), %xmm0
> >  #if !defined USE_AS_STRCASECMP_L && !defined USE_AS_STRNCASECMP_L
> >         pcmpistri       $0x1a,(%rsi,%rdx), %xmm0
> >  #else
> > @@ -963,7 +933,7 @@ LABEL(nibble_ashr_7_restart_use):
> >  LABEL(nibble_ashr_7_use):
> >         sub     $0x1000, %r10
> >         movdqa  -16(%rdi, %rdx), %xmm0
> > -       psrldq  $7, D(%xmm0)
> > +       psrldq  $7, %xmm0
> >         pcmpistri      $0x3a,%xmm0, %xmm0
> >  #if defined USE_AS_STRNCMP || defined USE_AS_STRNCASECMP_L
> >         cmp     %r11, %rcx
> > @@ -981,10 +951,10 @@ LABEL(nibble_ashr_7_use):
> >   */
> >         .p2align 4
> >  LABEL(ashr_8):
> > -       pslldq  $8, D(%xmm2)
> > +       pslldq  $8, %xmm2
> >         TOLOWER (%xmm1, %xmm2)
> > -       pcmpeqb %xmm1, D(%xmm2)
> > -       psubb   %xmm0, D(%xmm2)
> > +       pcmpeqb %xmm1, %xmm2
> > +       psubb   %xmm0, %xmm2
> >         pmovmskb %xmm2, %r9d
> >         shr     %cl, %edx
> >         shr     %cl, %r9d
> > @@ -1013,7 +983,7 @@ LABEL(loop_ashr_8_use):
> >
> >  LABEL(nibble_ashr_8_restart_use):
> >         movdqa  (%rdi, %rdx), %xmm0
> > -       palignr $8, -16(%rdi, %rdx), D(%xmm0)
> > +       palignr $8, -16(%rdi, %rdx), %xmm0
> >  #if !defined USE_AS_STRCASECMP_L && !defined USE_AS_STRNCASECMP_L
> >         pcmpistri $0x1a, (%rsi,%rdx), %xmm0
> >  #else
> > @@ -1032,7 +1002,7 @@ LABEL(nibble_ashr_8_restart_use):
> >         jg      LABEL(nibble_ashr_8_use)
> >
> >         movdqa  (%rdi, %rdx), %xmm0
> > -       palignr $8, -16(%rdi, %rdx), D(%xmm0)
> > +       palignr $8, -16(%rdi, %rdx), %xmm0
> >  #if !defined USE_AS_STRCASECMP_L && !defined USE_AS_STRNCASECMP_L
> >         pcmpistri $0x1a, (%rsi,%rdx), %xmm0
> >  #else
> > @@ -1052,7 +1022,7 @@ LABEL(nibble_ashr_8_restart_use):
> >  LABEL(nibble_ashr_8_use):
> >         sub     $0x1000, %r10
> >         movdqa  -16(%rdi, %rdx), %xmm0
> > -       psrldq  $8, D(%xmm0)
> > +       psrldq  $8, %xmm0
> >         pcmpistri      $0x3a,%xmm0, %xmm0
> >  #if defined USE_AS_STRNCMP || defined USE_AS_STRNCASECMP_L
> >         cmp     %r11, %rcx
> > @@ -1070,10 +1040,10 @@ LABEL(nibble_ashr_8_use):
> >   */
> >         .p2align 4
> >  LABEL(ashr_9):
> > -       pslldq  $7, D(%xmm2)
> > +       pslldq  $7, %xmm2
> >         TOLOWER (%xmm1, %xmm2)
> > -       pcmpeqb %xmm1, D(%xmm2)
> > -       psubb   %xmm0, D(%xmm2)
> > +       pcmpeqb %xmm1, %xmm2
> > +       psubb   %xmm0, %xmm2
> >         pmovmskb %xmm2, %r9d
> >         shr     %cl, %edx
> >         shr     %cl, %r9d
> > @@ -1103,7 +1073,7 @@ LABEL(loop_ashr_9_use):
> >  LABEL(nibble_ashr_9_restart_use):
> >         movdqa  (%rdi, %rdx), %xmm0
> >
> > -       palignr $9, -16(%rdi, %rdx), D(%xmm0)
> > +       palignr $9, -16(%rdi, %rdx), %xmm0
> >  #if !defined USE_AS_STRCASECMP_L && !defined USE_AS_STRNCASECMP_L
> >         pcmpistri $0x1a, (%rsi,%rdx), %xmm0
> >  #else
> > @@ -1122,7 +1092,7 @@ LABEL(nibble_ashr_9_restart_use):
> >         jg      LABEL(nibble_ashr_9_use)
> >
> >         movdqa  (%rdi, %rdx), %xmm0
> > -       palignr $9, -16(%rdi, %rdx), D(%xmm0)
> > +       palignr $9, -16(%rdi, %rdx), %xmm0
> >  #if !defined USE_AS_STRCASECMP_L && !defined USE_AS_STRNCASECMP_L
> >         pcmpistri $0x1a, (%rsi,%rdx), %xmm0
> >  #else
> > @@ -1142,7 +1112,7 @@ LABEL(nibble_ashr_9_restart_use):
> >  LABEL(nibble_ashr_9_use):
> >         sub     $0x1000, %r10
> >         movdqa  -16(%rdi, %rdx), %xmm0
> > -       psrldq  $9, D(%xmm0)
> > +       psrldq  $9, %xmm0
> >         pcmpistri      $0x3a,%xmm0, %xmm0
> >  #if defined USE_AS_STRNCMP || defined USE_AS_STRNCASECMP_L
> >         cmp     %r11, %rcx
> > @@ -1160,10 +1130,10 @@ LABEL(nibble_ashr_9_use):
> >   */
> >         .p2align 4
> >  LABEL(ashr_10):
> > -       pslldq  $6, D(%xmm2)
> > +       pslldq  $6, %xmm2
> >         TOLOWER (%xmm1, %xmm2)
> > -       pcmpeqb %xmm1, D(%xmm2)
> > -       psubb   %xmm0, D(%xmm2)
> > +       pcmpeqb %xmm1, %xmm2
> > +       psubb   %xmm0, %xmm2
> >         pmovmskb %xmm2, %r9d
> >         shr     %cl, %edx
> >         shr     %cl, %r9d
> > @@ -1192,7 +1162,7 @@ LABEL(loop_ashr_10_use):
> >
> >  LABEL(nibble_ashr_10_restart_use):
> >         movdqa  (%rdi, %rdx), %xmm0
> > -       palignr $10, -16(%rdi, %rdx), D(%xmm0)
> > +       palignr $10, -16(%rdi, %rdx), %xmm0
> >  #if !defined USE_AS_STRCASECMP_L && !defined USE_AS_STRNCASECMP_L
> >         pcmpistri $0x1a, (%rsi,%rdx), %xmm0
> >  #else
> > @@ -1211,7 +1181,7 @@ LABEL(nibble_ashr_10_restart_use):
> >         jg      LABEL(nibble_ashr_10_use)
> >
> >         movdqa  (%rdi, %rdx), %xmm0
> > -       palignr $10, -16(%rdi, %rdx), D(%xmm0)
> > +       palignr $10, -16(%rdi, %rdx), %xmm0
> >  #if !defined USE_AS_STRCASECMP_L && !defined USE_AS_STRNCASECMP_L
> >         pcmpistri $0x1a, (%rsi,%rdx), %xmm0
> >  #else
> > @@ -1231,7 +1201,7 @@ LABEL(nibble_ashr_10_restart_use):
> >  LABEL(nibble_ashr_10_use):
> >         sub     $0x1000, %r10
> >         movdqa  -16(%rdi, %rdx), %xmm0
> > -       psrldq  $10, D(%xmm0)
> > +       psrldq  $10, %xmm0
> >         pcmpistri      $0x3a,%xmm0, %xmm0
> >  #if defined USE_AS_STRNCMP || defined USE_AS_STRNCASECMP_L
> >         cmp     %r11, %rcx
> > @@ -1249,10 +1219,10 @@ LABEL(nibble_ashr_10_use):
> >   */
> >         .p2align 4
> >  LABEL(ashr_11):
> > -       pslldq  $5, D(%xmm2)
> > +       pslldq  $5, %xmm2
> >         TOLOWER (%xmm1, %xmm2)
> > -       pcmpeqb %xmm1, D(%xmm2)
> > -       psubb   %xmm0, D(%xmm2)
> > +       pcmpeqb %xmm1, %xmm2
> > +       psubb   %xmm0, %xmm2
> >         pmovmskb %xmm2, %r9d
> >         shr     %cl, %edx
> >         shr     %cl, %r9d
> > @@ -1281,7 +1251,7 @@ LABEL(loop_ashr_11_use):
> >
> >  LABEL(nibble_ashr_11_restart_use):
> >         movdqa  (%rdi, %rdx), %xmm0
> > -       palignr $11, -16(%rdi, %rdx), D(%xmm0)
> > +       palignr $11, -16(%rdi, %rdx), %xmm0
> >  #if !defined USE_AS_STRCASECMP_L && !defined USE_AS_STRNCASECMP_L
> >         pcmpistri $0x1a, (%rsi,%rdx), %xmm0
> >  #else
> > @@ -1300,7 +1270,7 @@ LABEL(nibble_ashr_11_restart_use):
> >         jg      LABEL(nibble_ashr_11_use)
> >
> >         movdqa  (%rdi, %rdx), %xmm0
> > -       palignr $11, -16(%rdi, %rdx), D(%xmm0)
> > +       palignr $11, -16(%rdi, %rdx), %xmm0
> >  #if !defined USE_AS_STRCASECMP_L && !defined USE_AS_STRNCASECMP_L
> >         pcmpistri $0x1a, (%rsi,%rdx), %xmm0
> >  #else
> > @@ -1320,7 +1290,7 @@ LABEL(nibble_ashr_11_restart_use):
> >  LABEL(nibble_ashr_11_use):
> >         sub     $0x1000, %r10
> >         movdqa  -16(%rdi, %rdx), %xmm0
> > -       psrldq  $11, D(%xmm0)
> > +       psrldq  $11, %xmm0
> >         pcmpistri      $0x3a,%xmm0, %xmm0
> >  #if defined USE_AS_STRNCMP || defined USE_AS_STRNCASECMP_L
> >         cmp     %r11, %rcx
> > @@ -1338,10 +1308,10 @@ LABEL(nibble_ashr_11_use):
> >   */
> >         .p2align 4
> >  LABEL(ashr_12):
> > -       pslldq  $4, D(%xmm2)
> > +       pslldq  $4, %xmm2
> >         TOLOWER (%xmm1, %xmm2)
> > -       pcmpeqb %xmm1, D(%xmm2)
> > -       psubb   %xmm0, D(%xmm2)
> > +       pcmpeqb %xmm1, %xmm2
> > +       psubb   %xmm0, %xmm2
> >         pmovmskb %xmm2, %r9d
> >         shr     %cl, %edx
> >         shr     %cl, %r9d
> > @@ -1370,7 +1340,7 @@ LABEL(loop_ashr_12_use):
> >
> >  LABEL(nibble_ashr_12_restart_use):
> >         movdqa  (%rdi, %rdx), %xmm0
> > -       palignr $12, -16(%rdi, %rdx), D(%xmm0)
> > +       palignr $12, -16(%rdi, %rdx), %xmm0
> >  #if !defined USE_AS_STRCASECMP_L && !defined USE_AS_STRNCASECMP_L
> >         pcmpistri $0x1a, (%rsi,%rdx), %xmm0
> >  #else
> > @@ -1389,7 +1359,7 @@ LABEL(nibble_ashr_12_restart_use):
> >         jg      LABEL(nibble_ashr_12_use)
> >
> >         movdqa  (%rdi, %rdx), %xmm0
> > -       palignr $12, -16(%rdi, %rdx), D(%xmm0)
> > +       palignr $12, -16(%rdi, %rdx), %xmm0
> >  #if !defined USE_AS_STRCASECMP_L && !defined USE_AS_STRNCASECMP_L
> >         pcmpistri $0x1a, (%rsi,%rdx), %xmm0
> >  #else
> > @@ -1409,7 +1379,7 @@ LABEL(nibble_ashr_12_restart_use):
> >  LABEL(nibble_ashr_12_use):
> >         sub     $0x1000, %r10
> >         movdqa  -16(%rdi, %rdx), %xmm0
> > -       psrldq  $12, D(%xmm0)
> > +       psrldq  $12, %xmm0
> >         pcmpistri      $0x3a,%xmm0, %xmm0
> >  #if defined USE_AS_STRNCMP || defined USE_AS_STRNCASECMP_L
> >         cmp     %r11, %rcx
> > @@ -1427,10 +1397,10 @@ LABEL(nibble_ashr_12_use):
> >   */
> >         .p2align 4
> >  LABEL(ashr_13):
> > -       pslldq  $3, D(%xmm2)
> > +       pslldq  $3, %xmm2
> >         TOLOWER (%xmm1, %xmm2)
> > -       pcmpeqb %xmm1, D(%xmm2)
> > -       psubb   %xmm0, D(%xmm2)
> > +       pcmpeqb %xmm1, %xmm2
> > +       psubb   %xmm0, %xmm2
> >         pmovmskb %xmm2, %r9d
> >         shr     %cl, %edx
> >         shr     %cl, %r9d
> > @@ -1460,7 +1430,7 @@ LABEL(loop_ashr_13_use):
> >
> >  LABEL(nibble_ashr_13_restart_use):
> >         movdqa  (%rdi, %rdx), %xmm0
> > -       palignr $13, -16(%rdi, %rdx), D(%xmm0)
> > +       palignr $13, -16(%rdi, %rdx), %xmm0
> >  #if !defined USE_AS_STRCASECMP_L && !defined USE_AS_STRNCASECMP_L
> >         pcmpistri $0x1a, (%rsi,%rdx), %xmm0
> >  #else
> > @@ -1479,7 +1449,7 @@ LABEL(nibble_ashr_13_restart_use):
> >         jg      LABEL(nibble_ashr_13_use)
> >
> >         movdqa  (%rdi, %rdx), %xmm0
> > -       palignr $13, -16(%rdi, %rdx), D(%xmm0)
> > +       palignr $13, -16(%rdi, %rdx), %xmm0
> >  #if !defined USE_AS_STRCASECMP_L && !defined USE_AS_STRNCASECMP_L
> >         pcmpistri $0x1a, (%rsi,%rdx), %xmm0
> >  #else
> > @@ -1499,7 +1469,7 @@ LABEL(nibble_ashr_13_restart_use):
> >  LABEL(nibble_ashr_13_use):
> >         sub     $0x1000, %r10
> >         movdqa  -16(%rdi, %rdx), %xmm0
> > -       psrldq  $13, D(%xmm0)
> > +       psrldq  $13, %xmm0
> >         pcmpistri      $0x3a,%xmm0, %xmm0
> >  #if defined USE_AS_STRNCMP || defined USE_AS_STRNCASECMP_L
> >         cmp     %r11, %rcx
> > @@ -1517,10 +1487,10 @@ LABEL(nibble_ashr_13_use):
> >   */
> >         .p2align 4
> >  LABEL(ashr_14):
> > -       pslldq  $2, D(%xmm2)
> > +       pslldq  $2, %xmm2
> >         TOLOWER (%xmm1, %xmm2)
> > -       pcmpeqb %xmm1, D(%xmm2)
> > -       psubb   %xmm0, D(%xmm2)
> > +       pcmpeqb %xmm1, %xmm2
> > +       psubb   %xmm0, %xmm2
> >         pmovmskb %xmm2, %r9d
> >         shr     %cl, %edx
> >         shr     %cl, %r9d
> > @@ -1550,7 +1520,7 @@ LABEL(loop_ashr_14_use):
> >
> >  LABEL(nibble_ashr_14_restart_use):
> >         movdqa  (%rdi, %rdx), %xmm0
> > -       palignr $14, -16(%rdi, %rdx), D(%xmm0)
> > +       palignr $14, -16(%rdi, %rdx), %xmm0
> >  #if !defined USE_AS_STRCASECMP_L && !defined USE_AS_STRNCASECMP_L
> >         pcmpistri $0x1a, (%rsi,%rdx), %xmm0
> >  #else
> > @@ -1569,7 +1539,7 @@ LABEL(nibble_ashr_14_restart_use):
> >         jg      LABEL(nibble_ashr_14_use)
> >
> >         movdqa  (%rdi, %rdx), %xmm0
> > -       palignr $14, -16(%rdi, %rdx), D(%xmm0)
> > +       palignr $14, -16(%rdi, %rdx), %xmm0
> >  #if !defined USE_AS_STRCASECMP_L && !defined USE_AS_STRNCASECMP_L
> >         pcmpistri $0x1a, (%rsi,%rdx), %xmm0
> >  #else
> > @@ -1589,7 +1559,7 @@ LABEL(nibble_ashr_14_restart_use):
> >  LABEL(nibble_ashr_14_use):
> >         sub     $0x1000, %r10
> >         movdqa  -16(%rdi, %rdx), %xmm0
> > -       psrldq  $14, D(%xmm0)
> > +       psrldq  $14, %xmm0
> >         pcmpistri      $0x3a,%xmm0, %xmm0
> >  #if defined USE_AS_STRNCMP || defined USE_AS_STRNCASECMP_L
> >         cmp     %r11, %rcx
> > @@ -1607,10 +1577,10 @@ LABEL(nibble_ashr_14_use):
> >   */
> >         .p2align 4
> >  LABEL(ashr_15):
> > -       pslldq  $1, D(%xmm2)
> > +       pslldq  $1, %xmm2
> >         TOLOWER (%xmm1, %xmm2)
> > -       pcmpeqb %xmm1, D(%xmm2)
> > -       psubb   %xmm0, D(%xmm2)
> > +       pcmpeqb %xmm1, %xmm2
> > +       psubb   %xmm0, %xmm2
> >         pmovmskb %xmm2, %r9d
> >         shr     %cl, %edx
> >         shr     %cl, %r9d
> > @@ -1642,7 +1612,7 @@ LABEL(loop_ashr_15_use):
> >
> >  LABEL(nibble_ashr_15_restart_use):
> >         movdqa  (%rdi, %rdx), %xmm0
> > -       palignr $15, -16(%rdi, %rdx), D(%xmm0)
> > +       palignr $15, -16(%rdi, %rdx), %xmm0
> >  #if !defined USE_AS_STRCASECMP_L && !defined USE_AS_STRNCASECMP_L
> >         pcmpistri $0x1a, (%rsi,%rdx), %xmm0
> >  #else
> > @@ -1661,7 +1631,7 @@ LABEL(nibble_ashr_15_restart_use):
> >         jg      LABEL(nibble_ashr_15_use)
> >
> >         movdqa  (%rdi, %rdx), %xmm0
> > -       palignr $15, -16(%rdi, %rdx), D(%xmm0)
> > +       palignr $15, -16(%rdi, %rdx), %xmm0
> >  #if !defined USE_AS_STRCASECMP_L && !defined USE_AS_STRNCASECMP_L
> >         pcmpistri $0x1a, (%rsi,%rdx), %xmm0
> >  #else
> > @@ -1681,7 +1651,7 @@ LABEL(nibble_ashr_15_restart_use):
> >  LABEL(nibble_ashr_15_use):
> >         sub     $0x1000, %r10
> >         movdqa  -16(%rdi, %rdx), %xmm0
> > -       psrldq  $15, D(%xmm0)
> > +       psrldq  $15, %xmm0
> >         pcmpistri      $0x3a,%xmm0, %xmm0
> >  #if defined USE_AS_STRNCMP || defined USE_AS_STRNCASECMP_L
> >         cmp     %r11, %rcx
> > diff --git a/sysdeps/x86_64/multiarch/strncase_l-avx.S b/sysdeps/x86_64/multiarch/strncase_l-avx.S
> > deleted file mode 100644
> > index b51b86d223..0000000000
> > --- a/sysdeps/x86_64/multiarch/strncase_l-avx.S
> > +++ /dev/null
> > @@ -1,22 +0,0 @@
> > -/* strncasecmp_l optimized with AVX.
> > -   Copyright (C) 2017-2022 Free Software Foundation, Inc.
> > -   This file is part of the GNU C Library.
> > -
> > -   The GNU C Library is free software; you can redistribute it and/or
> > -   modify it under the terms of the GNU Lesser General Public
> > -   License as published by the Free Software Foundation; either
> > -   version 2.1 of the License, or (at your option) any later version.
> > -
> > -   The GNU C Library is distributed in the hope that it will be useful,
> > -   but WITHOUT ANY WARRANTY; without even the implied warranty of
> > -   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> > -   Lesser General Public License for more details.
> > -
> > -   You should have received a copy of the GNU Lesser General Public
> > -   License along with the GNU C Library; if not, see
> > -   <https://www.gnu.org/licenses/>.  */
> > -
> > -#define STRCMP_SSE42 __strncasecmp_l_avx
> > -#define USE_AVX 1
> > -#define USE_AS_STRNCASECMP_L
> > -#include "strcmp-sse42.S"
> > --
> > 2.25.1
> >
>
> LGTM.
>
> Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
>
> Thanks.
>
> --
> H.J.

I would like to backport this patch to release branches.
Any comments or objections?

--Sunil

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2022-05-12 19:54 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20220323215734.3927131-1-goldstein.w.n@gmail.com>
     [not found] ` <20220323215734.3927131-3-goldstein.w.n@gmail.com>
     [not found]   ` <CAMe9rOqQHH-20_czF-vtb_L_6MRBer=H9g3XpNBQLzcoSLZj+A@mail.gmail.com>
     [not found]     ` <CAFUsyfKfR3haCneczj0=ji+u3X_RsMNCXuOadytBrcaxgoEVTg@mail.gmail.com>
     [not found]       ` <CAMe9rOqRGcLn3tvQSANaSydOM8RRQ2cY0PxBOHDu=iK88j=XUg@mail.gmail.com>
2022-05-12 19:31         ` [PATCH v1 03/23] x86: Code cleanup in strchr-avx2 and comment justifying branch Sunil Pandey
     [not found] ` <20220323215734.3927131-4-goldstein.w.n@gmail.com>
     [not found]   ` <CAMe9rOraZjeAXy8GgdNqUb94y+0TUwbjWKJU7RixESgRYw1o7A@mail.gmail.com>
2022-05-12 19:32     ` [PATCH v1 04/23] x86: Code cleanup in strchr-evex " Sunil Pandey
     [not found] ` <20220323215734.3927131-7-goldstein.w.n@gmail.com>
     [not found]   ` <CAMe9rOpSiZaO+mkq8OqwTHS__JgUD4LQQShMpjrgyGdZSPwUsA@mail.gmail.com>
2022-05-12 19:34     ` [PATCH v1 07/23] x86: Optimize strcspn and strpbrk in strcspn-c.c Sunil Pandey
     [not found] ` <20220323215734.3927131-8-goldstein.w.n@gmail.com>
     [not found]   ` <CAMe9rOo7oks15kPUaZqd=Z1J1Xe==FJoECTNU5mBca9WTHgf1w@mail.gmail.com>
2022-05-12 19:39     ` [PATCH v1 08/23] x86: Optimize strspn in strspn-c.c Sunil Pandey
     [not found] ` <20220323215734.3927131-9-goldstein.w.n@gmail.com>
     [not found]   ` <CAMe9rOrQ_zOdL-n-iiYpzLf+RxD_DBR51yEnGpRKB0zj4m31SQ@mail.gmail.com>
2022-05-12 19:40     ` [PATCH v1 09/23] x86: Remove strcspn-sse2.S and use the generic implementation Sunil Pandey
     [not found] ` <20220323215734.3927131-10-goldstein.w.n@gmail.com>
     [not found]   ` <CAMe9rOqn8rZNfisVTmSKP9iWH1N26D--dncq1=MMgo-Hh-oR_Q@mail.gmail.com>
2022-05-12 19:41     ` [PATCH v1 10/23] x86: Remove strpbrk-sse2.S " Sunil Pandey
     [not found] ` <20220323215734.3927131-11-goldstein.w.n@gmail.com>
     [not found]   ` <CAMe9rOp89e_1T9+i0W3=R3XR8DHp_Ua72x+poB6HQvE1q6b0MQ@mail.gmail.com>
2022-05-12 19:42     ` [PATCH v1 11/23] x86: Remove strspn-sse2.S " Sunil Pandey
     [not found] ` <20220323215734.3927131-17-goldstein.w.n@gmail.com>
     [not found]   ` <CAMe9rOqkZtA9gE87TiqkHg+_rTZY4dqXO74_LykBwvihNO0YJA@mail.gmail.com>
2022-05-12 19:44     ` [PATCH v1 17/23] x86: Optimize str{n}casecmp TOLOWER logic in strcmp.S Sunil Pandey
     [not found] ` <20220323215734.3927131-18-goldstein.w.n@gmail.com>
     [not found]   ` <CAMe9rOo-MzhNRiuyFhHpHKanbu50_OPr_Gaof9Yt16tJRwjYFA@mail.gmail.com>
2022-05-12 19:45     ` [PATCH v1 18/23] x86: Optimize str{n}casecmp TOLOWER logic in strcmp-sse42.S Sunil Pandey
     [not found] ` <20220323215734.3927131-23-goldstein.w.n@gmail.com>
     [not found]   ` <CAMe9rOpzEL=V1OmUFJuScNetUc3mgMqYeqcqiD9aK+tBTN_sxQ@mail.gmail.com>
2022-05-12 19:54     ` [PATCH v1 23/23] x86: Remove AVX str{n}casecmp Sunil Pandey

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).