* Re: [PATCH v1 03/23] x86: Code cleanup in strchr-avx2 and comment justifying branch [not found] ` <CAMe9rOqRGcLn3tvQSANaSydOM8RRQ2cY0PxBOHDu=iK88j=XUg@mail.gmail.com> @ 2022-05-12 19:31 ` Sunil Pandey 0 siblings, 0 replies; 10+ messages in thread From: Sunil Pandey @ 2022-05-12 19:31 UTC (permalink / raw) To: H.J. Lu, Libc-stable Mailing List; +Cc: Noah Goldstein, GNU C Library On Thu, Mar 24, 2022 at 12:37 PM H.J. Lu via Libc-alpha <libc-alpha@sourceware.org> wrote: > > On Thu, Mar 24, 2022 at 12:20 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > > > On Thu, Mar 24, 2022 at 1:53 PM H.J. Lu <hjl.tools@gmail.com> wrote: > > > > > > On Wed, Mar 23, 2022 at 2:58 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > > > > > > > Small code cleanup for size: -53 bytes. > > > > > > > > Add comment justifying using a branch to do NULL/non-null return. > > > > > > > > > Do you have followup patches to improve its performance? We are > > > backporting all x86-64 improvements to Intel release branches: > > > > > > https://gitlab.com/x86-glibc/glibc/-/wikis/home > > > > > > Patches without performance improvements are undesirable. > > > > No further changes planned at the moment, code size saves > > seem worth it for master though. Also in favor of adding the comment > > as I think its non-intuitive. > > > > LGTM. > > Reviewed-by: H.J. Lu <hjl.tools@gmail.com> > > Thanks. > > -- > H.J. I would like to backport this patch to release branches. Any comments or objections? --Sunil ^ permalink raw reply [flat|nested] 10+ messages in thread
[parent not found: <20220323215734.3927131-4-goldstein.w.n@gmail.com>]
[parent not found: <CAMe9rOraZjeAXy8GgdNqUb94y+0TUwbjWKJU7RixESgRYw1o7A@mail.gmail.com>]
* Re: [PATCH v1 04/23] x86: Code cleanup in strchr-evex and comment justifying branch [not found] ` <CAMe9rOraZjeAXy8GgdNqUb94y+0TUwbjWKJU7RixESgRYw1o7A@mail.gmail.com> @ 2022-05-12 19:32 ` Sunil Pandey 0 siblings, 0 replies; 10+ messages in thread From: Sunil Pandey @ 2022-05-12 19:32 UTC (permalink / raw) To: H.J. Lu, Libc-stable Mailing List; +Cc: Noah Goldstein, GNU C Library On Thu, Mar 24, 2022 at 11:55 AM H.J. Lu via Libc-alpha <libc-alpha@sourceware.org> wrote: > > On Wed, Mar 23, 2022 at 2:58 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > > > Small code cleanup for size: -81 bytes. > > > > Add comment justifying using a branch to do NULL/non-null return. > > > > All string/memory tests pass and no regressions in benchtests. > > > > geometric_mean(N=20) of all benchmarks New / Original: .985 > > --- > > Geomtric Mean N=20 runs; All functions page aligned > > length, alignment, pos, rand, seek_char/branch, max_char/perc-zero, New Time / Old Time > > 2048, 0, 32, 0, 23, 127, 0.878 > > 2048, 1, 32, 0, 23, 127, 0.88 > > 2048, 0, 64, 0, 23, 127, 0.997 > > 2048, 2, 64, 0, 23, 127, 1.001 > > 2048, 0, 128, 0, 23, 127, 0.973 > > 2048, 3, 128, 0, 23, 127, 0.971 > > 2048, 0, 256, 0, 23, 127, 0.976 > > 2048, 4, 256, 0, 23, 127, 0.973 > > 2048, 0, 512, 0, 23, 127, 1.001 > > 2048, 5, 512, 0, 23, 127, 1.004 > > 2048, 0, 1024, 0, 23, 127, 1.005 > > 2048, 6, 1024, 0, 23, 127, 1.007 > > 2048, 0, 2048, 0, 23, 127, 1.035 > > 2048, 7, 2048, 0, 23, 127, 1.03 > > 4096, 0, 32, 0, 23, 127, 0.889 > > 4096, 1, 32, 0, 23, 127, 0.891 > > 4096, 0, 64, 0, 23, 127, 1.012 > > 4096, 2, 64, 0, 23, 127, 1.017 > > 4096, 0, 128, 0, 23, 127, 0.975 > > 4096, 3, 128, 0, 23, 127, 0.974 > > 4096, 0, 256, 0, 23, 127, 0.974 > > 4096, 4, 256, 0, 23, 127, 0.972 > > 4096, 0, 512, 0, 23, 127, 1.002 > > 4096, 5, 512, 0, 23, 127, 1.016 > > 4096, 0, 1024, 0, 23, 127, 1.009 > > 4096, 6, 1024, 0, 23, 127, 1.008 > > 4096, 0, 2048, 0, 23, 127, 1.003 > > 4096, 7, 2048, 0, 23, 127, 1.004 > > 256, 1, 64, 0, 23, 127, 0.993 > > 256, 2, 64, 0, 23, 127, 0.999 > > 256, 3, 64, 0, 23, 127, 0.992 > > 256, 4, 64, 0, 23, 127, 0.99 > > 256, 5, 64, 0, 23, 127, 0.99 > > 256, 6, 64, 0, 23, 127, 0.994 > > 256, 7, 64, 0, 23, 127, 0.991 > > 512, 0, 256, 0, 23, 127, 0.971 > > 512, 16, 256, 0, 23, 127, 0.971 > > 512, 32, 256, 0, 23, 127, 1.005 > > 512, 48, 256, 0, 23, 127, 0.998 > > 512, 64, 256, 0, 23, 127, 1.001 > > 512, 80, 256, 0, 23, 127, 1.002 > > 512, 96, 256, 0, 23, 127, 1.005 > > 512, 112, 256, 0, 23, 127, 1.012 > > 1, 0, 0, 0, 23, 127, 1.024 > > 2, 0, 1, 0, 23, 127, 0.991 > > 3, 0, 2, 0, 23, 127, 0.997 > > 4, 0, 3, 0, 23, 127, 0.984 > > 5, 0, 4, 0, 23, 127, 0.993 > > 6, 0, 5, 0, 23, 127, 0.985 > > 7, 0, 6, 0, 23, 127, 0.979 > > 8, 0, 7, 0, 23, 127, 0.975 > > 9, 0, 8, 0, 23, 127, 0.965 > > 10, 0, 9, 0, 23, 127, 0.957 > > 11, 0, 10, 0, 23, 127, 0.979 > > 12, 0, 11, 0, 23, 127, 0.987 > > 13, 0, 12, 0, 23, 127, 1.023 > > 14, 0, 13, 0, 23, 127, 0.997 > > 15, 0, 14, 0, 23, 127, 0.983 > > 16, 0, 15, 0, 23, 127, 0.987 > > 17, 0, 16, 0, 23, 127, 0.993 > > 18, 0, 17, 0, 23, 127, 0.985 > > 19, 0, 18, 0, 23, 127, 0.999 > > 20, 0, 19, 0, 23, 127, 0.998 > > 21, 0, 20, 0, 23, 127, 0.983 > > 22, 0, 21, 0, 23, 127, 0.983 > > 23, 0, 22, 0, 23, 127, 1.002 > > 24, 0, 23, 0, 23, 127, 1.0 > > 25, 0, 24, 0, 23, 127, 1.002 > > 26, 0, 25, 0, 23, 127, 0.984 > > 27, 0, 26, 0, 23, 127, 0.994 > > 28, 0, 27, 0, 23, 127, 0.995 > > 29, 0, 28, 0, 23, 127, 1.017 > > 30, 0, 29, 0, 23, 127, 1.009 > > 31, 0, 30, 0, 23, 127, 1.001 > > 32, 0, 31, 0, 23, 127, 1.021 > > 2048, 0, 32, 0, 0, 127, 0.899 > > 2048, 1, 32, 0, 0, 127, 0.93 > > 2048, 0, 64, 0, 0, 127, 1.009 > > 2048, 2, 64, 0, 0, 127, 1.023 > > 2048, 0, 128, 0, 0, 127, 0.973 > > 2048, 3, 128, 0, 0, 127, 0.975 > > 2048, 0, 256, 0, 0, 127, 0.974 > > 2048, 4, 256, 0, 0, 127, 0.97 > > 2048, 0, 512, 0, 0, 127, 0.999 > > 2048, 5, 512, 0, 0, 127, 1.004 > > 2048, 0, 1024, 0, 0, 127, 1.008 > > 2048, 6, 1024, 0, 0, 127, 1.008 > > 2048, 0, 2048, 0, 0, 127, 0.996 > > 2048, 7, 2048, 0, 0, 127, 1.002 > > 4096, 0, 32, 0, 0, 127, 0.872 > > 4096, 1, 32, 0, 0, 127, 0.881 > > 4096, 0, 64, 0, 0, 127, 1.006 > > 4096, 2, 64, 0, 0, 127, 1.005 > > 4096, 0, 128, 0, 0, 127, 0.973 > > 4096, 3, 128, 0, 0, 127, 0.974 > > 4096, 0, 256, 0, 0, 127, 0.969 > > 4096, 4, 256, 0, 0, 127, 0.971 > > 4096, 0, 512, 0, 0, 127, 1.0 > > 4096, 5, 512, 0, 0, 127, 1.005 > > 4096, 0, 1024, 0, 0, 127, 1.007 > > 4096, 6, 1024, 0, 0, 127, 1.009 > > 4096, 0, 2048, 0, 0, 127, 1.005 > > 4096, 7, 2048, 0, 0, 127, 1.007 > > 256, 1, 64, 0, 0, 127, 0.994 > > 256, 2, 64, 0, 0, 127, 1.008 > > 256, 3, 64, 0, 0, 127, 1.019 > > 256, 4, 64, 0, 0, 127, 0.991 > > 256, 5, 64, 0, 0, 127, 0.992 > > 256, 6, 64, 0, 0, 127, 0.991 > > 256, 7, 64, 0, 0, 127, 0.988 > > 512, 0, 256, 0, 0, 127, 0.971 > > 512, 16, 256, 0, 0, 127, 0.967 > > 512, 32, 256, 0, 0, 127, 1.005 > > 512, 48, 256, 0, 0, 127, 1.001 > > 512, 64, 256, 0, 0, 127, 1.009 > > 512, 80, 256, 0, 0, 127, 1.008 > > 512, 96, 256, 0, 0, 127, 1.009 > > 512, 112, 256, 0, 0, 127, 1.016 > > 1, 0, 0, 0, 0, 127, 1.038 > > 2, 0, 1, 0, 0, 127, 1.009 > > 3, 0, 2, 0, 0, 127, 0.992 > > 4, 0, 3, 0, 0, 127, 1.004 > > 5, 0, 4, 0, 0, 127, 0.966 > > 6, 0, 5, 0, 0, 127, 0.968 > > 7, 0, 6, 0, 0, 127, 1.004 > > 8, 0, 7, 0, 0, 127, 0.99 > > 9, 0, 8, 0, 0, 127, 0.958 > > 10, 0, 9, 0, 0, 127, 0.96 > > 11, 0, 10, 0, 0, 127, 0.948 > > 12, 0, 11, 0, 0, 127, 0.984 > > 13, 0, 12, 0, 0, 127, 0.967 > > 14, 0, 13, 0, 0, 127, 0.993 > > 15, 0, 14, 0, 0, 127, 0.991 > > 16, 0, 15, 0, 0, 127, 1.0 > > 17, 0, 16, 0, 0, 127, 0.982 > > 18, 0, 17, 0, 0, 127, 0.977 > > 19, 0, 18, 0, 0, 127, 0.987 > > 20, 0, 19, 0, 0, 127, 0.978 > > 21, 0, 20, 0, 0, 127, 1.0 > > 22, 0, 21, 0, 0, 127, 0.99 > > 23, 0, 22, 0, 0, 127, 0.988 > > 24, 0, 23, 0, 0, 127, 0.997 > > 25, 0, 24, 0, 0, 127, 1.003 > > 26, 0, 25, 0, 0, 127, 1.004 > > 27, 0, 26, 0, 0, 127, 0.982 > > 28, 0, 27, 0, 0, 127, 0.972 > > 29, 0, 28, 0, 0, 127, 0.978 > > 30, 0, 29, 0, 0, 127, 0.992 > > 31, 0, 30, 0, 0, 127, 0.986 > > 32, 0, 31, 0, 0, 127, 1.0 > > > > 16, 0, 15, 1, 1, 0, 0.997 > > 16, 0, 15, 1, 0, 0, 1.001 > > 16, 0, 15, 1, 1, 0.1, 0.984 > > 16, 0, 15, 1, 0, 0.1, 0.999 > > 16, 0, 15, 1, 1, 0.25, 0.929 > > 16, 0, 15, 1, 0, 0.25, 1.001 > > 16, 0, 15, 1, 1, 0.33, 0.892 > > 16, 0, 15, 1, 0, 0.33, 0.996 > > 16, 0, 15, 1, 1, 0.5, 0.897 > > 16, 0, 15, 1, 0, 0.5, 1.009 > > 16, 0, 15, 1, 1, 0.66, 0.882 > > 16, 0, 15, 1, 0, 0.66, 0.967 > > 16, 0, 15, 1, 1, 0.75, 0.919 > > 16, 0, 15, 1, 0, 0.75, 1.027 > > 16, 0, 15, 1, 1, 0.9, 0.949 > > 16, 0, 15, 1, 0, 0.9, 1.021 > > 16, 0, 15, 1, 1, 1, 0.998 > > 16, 0, 15, 1, 0, 1, 0.999 > > > > sysdeps/x86_64/multiarch/strchr-evex.S | 146 ++++++++++++++----------- > > 1 file changed, 80 insertions(+), 66 deletions(-) > > > > diff --git a/sysdeps/x86_64/multiarch/strchr-evex.S b/sysdeps/x86_64/multiarch/strchr-evex.S > > index f62cd9d144..ec739fb8f9 100644 > > --- a/sysdeps/x86_64/multiarch/strchr-evex.S > > +++ b/sysdeps/x86_64/multiarch/strchr-evex.S > > @@ -30,6 +30,7 @@ > > # ifdef USE_AS_WCSCHR > > # define VPBROADCAST vpbroadcastd > > # define VPCMP vpcmpd > > +# define VPTESTN vptestnmd > > # define VPMINU vpminud > > # define CHAR_REG esi > > # define SHIFT_REG ecx > > @@ -37,6 +38,7 @@ > > # else > > # define VPBROADCAST vpbroadcastb > > # define VPCMP vpcmpb > > +# define VPTESTN vptestnmb > > # define VPMINU vpminub > > # define CHAR_REG sil > > # define SHIFT_REG edx > > @@ -61,13 +63,11 @@ > > # define CHAR_PER_VEC (VEC_SIZE / CHAR_SIZE) > > > > .section .text.evex,"ax",@progbits > > -ENTRY (STRCHR) > > +ENTRY_P2ALIGN (STRCHR, 5) > > /* Broadcast CHAR to YMM0. */ > > VPBROADCAST %esi, %YMM0 > > movl %edi, %eax > > andl $(PAGE_SIZE - 1), %eax > > - vpxorq %XMMZERO, %XMMZERO, %XMMZERO > > - > > /* Check if we cross page boundary with one vector load. > > Otherwise it is safe to use an unaligned load. */ > > cmpl $(PAGE_SIZE - VEC_SIZE), %eax > > @@ -81,49 +81,35 @@ ENTRY (STRCHR) > > vpxorq %YMM1, %YMM0, %YMM2 > > VPMINU %YMM2, %YMM1, %YMM2 > > /* Each bit in K0 represents a CHAR or a null byte in YMM1. */ > > - VPCMP $0, %YMMZERO, %YMM2, %k0 > > + VPTESTN %YMM2, %YMM2, %k0 > > kmovd %k0, %eax > > testl %eax, %eax > > jz L(aligned_more) > > tzcntl %eax, %eax > > +# ifndef USE_AS_STRCHRNUL > > + /* Found CHAR or the null byte. */ > > + cmp (%rdi, %rax, CHAR_SIZE), %CHAR_REG > > + /* NB: Use a branch instead of cmovcc here. The expectation is > > + that with strchr the user will branch based on input being > > + null. Since this branch will be 100% predictive of the user > > + branch a branch miss here should save what otherwise would > > + be branch miss in the user code. Otherwise using a branch 1) > > + saves code size and 2) is faster in highly predictable > > + environments. */ > > + jne L(zero) > > +# endif > > # ifdef USE_AS_WCSCHR > > /* NB: Multiply wchar_t count by 4 to get the number of bytes. > > */ > > leaq (%rdi, %rax, CHAR_SIZE), %rax > > # else > > addq %rdi, %rax > > -# endif > > -# ifndef USE_AS_STRCHRNUL > > - /* Found CHAR or the null byte. */ > > - cmp (%rax), %CHAR_REG > > - jne L(zero) > > # endif > > ret > > > > - /* .p2align 5 helps keep performance more consistent if ENTRY() > > - alignment % 32 was either 16 or 0. As well this makes the > > - alignment % 32 of the loop_4x_vec fixed which makes tuning it > > - easier. */ > > - .p2align 5 > > -L(first_vec_x3): > > - tzcntl %eax, %eax > > -# ifndef USE_AS_STRCHRNUL > > - /* Found CHAR or the null byte. */ > > - cmp (VEC_SIZE * 3)(%rdi, %rax, CHAR_SIZE), %CHAR_REG > > - jne L(zero) > > -# endif > > - /* NB: Multiply sizeof char type (1 or 4) to get the number of > > - bytes. */ > > - leaq (VEC_SIZE * 3)(%rdi, %rax, CHAR_SIZE), %rax > > - ret > > > > -# ifndef USE_AS_STRCHRNUL > > -L(zero): > > - xorl %eax, %eax > > - ret > > -# endif > > > > - .p2align 4 > > + .p2align 4,, 10 > > L(first_vec_x4): > > # ifndef USE_AS_STRCHRNUL > > /* Check to see if first match was CHAR (k0) or null (k1). */ > > @@ -144,9 +130,18 @@ L(first_vec_x4): > > leaq (VEC_SIZE * 4)(%rdi, %rax, CHAR_SIZE), %rax > > ret > > > > +# ifndef USE_AS_STRCHRNUL > > +L(zero): > > + xorl %eax, %eax > > + ret > > +# endif > > + > > + > > .p2align 4 > > L(first_vec_x1): > > - tzcntl %eax, %eax > > + /* Use bsf here to save 1-byte keeping keeping the block in 1x > > + fetch block. eax guranteed non-zero. */ > > + bsfl %eax, %eax > > # ifndef USE_AS_STRCHRNUL > > /* Found CHAR or the null byte. */ > > cmp (VEC_SIZE)(%rdi, %rax, CHAR_SIZE), %CHAR_REG > > @@ -158,7 +153,7 @@ L(first_vec_x1): > > leaq (VEC_SIZE)(%rdi, %rax, CHAR_SIZE), %rax > > ret > > > > - .p2align 4 > > + .p2align 4,, 10 > > L(first_vec_x2): > > # ifndef USE_AS_STRCHRNUL > > /* Check to see if first match was CHAR (k0) or null (k1). */ > > @@ -179,6 +174,21 @@ L(first_vec_x2): > > leaq (VEC_SIZE * 2)(%rdi, %rax, CHAR_SIZE), %rax > > ret > > > > + .p2align 4,, 10 > > +L(first_vec_x3): > > + /* Use bsf here to save 1-byte keeping keeping the block in 1x > > + fetch block. eax guranteed non-zero. */ > > + bsfl %eax, %eax > > +# ifndef USE_AS_STRCHRNUL > > + /* Found CHAR or the null byte. */ > > + cmp (VEC_SIZE * 3)(%rdi, %rax, CHAR_SIZE), %CHAR_REG > > + jne L(zero) > > +# endif > > + /* NB: Multiply sizeof char type (1 or 4) to get the number of > > + bytes. */ > > + leaq (VEC_SIZE * 3)(%rdi, %rax, CHAR_SIZE), %rax > > + ret > > + > > .p2align 4 > > L(aligned_more): > > /* Align data to VEC_SIZE. */ > > @@ -195,7 +205,7 @@ L(cross_page_continue): > > vpxorq %YMM1, %YMM0, %YMM2 > > VPMINU %YMM2, %YMM1, %YMM2 > > /* Each bit in K0 represents a CHAR or a null byte in YMM1. */ > > - VPCMP $0, %YMMZERO, %YMM2, %k0 > > + VPTESTN %YMM2, %YMM2, %k0 > > kmovd %k0, %eax > > testl %eax, %eax > > jnz L(first_vec_x1) > > @@ -206,7 +216,7 @@ L(cross_page_continue): > > /* Each bit in K0 represents a CHAR in YMM1. */ > > VPCMP $0, %YMM1, %YMM0, %k0 > > /* Each bit in K1 represents a CHAR in YMM1. */ > > - VPCMP $0, %YMM1, %YMMZERO, %k1 > > + VPTESTN %YMM1, %YMM1, %k1 > > kortestd %k0, %k1 > > jnz L(first_vec_x2) > > > > @@ -215,7 +225,7 @@ L(cross_page_continue): > > vpxorq %YMM1, %YMM0, %YMM2 > > VPMINU %YMM2, %YMM1, %YMM2 > > /* Each bit in K0 represents a CHAR or a null byte in YMM1. */ > > - VPCMP $0, %YMMZERO, %YMM2, %k0 > > + VPTESTN %YMM2, %YMM2, %k0 > > kmovd %k0, %eax > > testl %eax, %eax > > jnz L(first_vec_x3) > > @@ -224,7 +234,7 @@ L(cross_page_continue): > > /* Each bit in K0 represents a CHAR in YMM1. */ > > VPCMP $0, %YMM1, %YMM0, %k0 > > /* Each bit in K1 represents a CHAR in YMM1. */ > > - VPCMP $0, %YMM1, %YMMZERO, %k1 > > + VPTESTN %YMM1, %YMM1, %k1 > > kortestd %k0, %k1 > > jnz L(first_vec_x4) > > > > @@ -265,33 +275,33 @@ L(loop_4x_vec): > > VPMINU %YMM3, %YMM4, %YMM4 > > VPMINU %YMM2, %YMM4, %YMM4{%k4}{z} > > > > - VPCMP $0, %YMMZERO, %YMM4, %k1 > > + VPTESTN %YMM4, %YMM4, %k1 > > kmovd %k1, %ecx > > subq $-(VEC_SIZE * 4), %rdi > > testl %ecx, %ecx > > jz L(loop_4x_vec) > > > > - VPCMP $0, %YMMZERO, %YMM1, %k0 > > + VPTESTN %YMM1, %YMM1, %k0 > > kmovd %k0, %eax > > testl %eax, %eax > > jnz L(last_vec_x1) > > > > - VPCMP $0, %YMMZERO, %YMM2, %k0 > > + VPTESTN %YMM2, %YMM2, %k0 > > kmovd %k0, %eax > > testl %eax, %eax > > jnz L(last_vec_x2) > > > > - VPCMP $0, %YMMZERO, %YMM3, %k0 > > + VPTESTN %YMM3, %YMM3, %k0 > > kmovd %k0, %eax > > /* Combine YMM3 matches (eax) with YMM4 matches (ecx). */ > > # ifdef USE_AS_WCSCHR > > sall $8, %ecx > > orl %ecx, %eax > > - tzcntl %eax, %eax > > + bsfl %eax, %eax > > # else > > salq $32, %rcx > > orq %rcx, %rax > > - tzcntq %rax, %rax > > + bsfq %rax, %rax > > # endif > > # ifndef USE_AS_STRCHRNUL > > /* Check if match was CHAR or null. */ > > @@ -303,28 +313,28 @@ L(loop_4x_vec): > > leaq (VEC_SIZE * 2)(%rdi, %rax, CHAR_SIZE), %rax > > ret > > > > -# ifndef USE_AS_STRCHRNUL > > -L(zero_end): > > - xorl %eax, %eax > > - ret > > + .p2align 4,, 8 > > +L(last_vec_x1): > > + bsfl %eax, %eax > > +# ifdef USE_AS_WCSCHR > > + /* NB: Multiply wchar_t count by 4 to get the number of bytes. > > + */ > > + leaq (%rdi, %rax, CHAR_SIZE), %rax > > +# else > > + addq %rdi, %rax > > # endif > > > > - .p2align 4 > > -L(last_vec_x1): > > - tzcntl %eax, %eax > > # ifndef USE_AS_STRCHRNUL > > /* Check if match was null. */ > > - cmp (%rdi, %rax, CHAR_SIZE), %CHAR_REG > > + cmp (%rax), %CHAR_REG > > jne L(zero_end) > > # endif > > - /* NB: Multiply sizeof char type (1 or 4) to get the number of > > - bytes. */ > > - leaq (%rdi, %rax, CHAR_SIZE), %rax > > + > > ret > > > > - .p2align 4 > > + .p2align 4,, 8 > > L(last_vec_x2): > > - tzcntl %eax, %eax > > + bsfl %eax, %eax > > # ifndef USE_AS_STRCHRNUL > > /* Check if match was null. */ > > cmp (VEC_SIZE)(%rdi, %rax, CHAR_SIZE), %CHAR_REG > > @@ -336,7 +346,7 @@ L(last_vec_x2): > > ret > > > > /* Cold case for crossing page with first load. */ > > - .p2align 4 > > + .p2align 4,, 8 > > L(cross_page_boundary): > > movq %rdi, %rdx > > /* Align rdi. */ > > @@ -346,9 +356,9 @@ L(cross_page_boundary): > > vpxorq %YMM1, %YMM0, %YMM2 > > VPMINU %YMM2, %YMM1, %YMM2 > > /* Each bit in K0 represents a CHAR or a null byte in YMM1. */ > > - VPCMP $0, %YMMZERO, %YMM2, %k0 > > + VPTESTN %YMM2, %YMM2, %k0 > > kmovd %k0, %eax > > - /* Remove the leading bits. */ > > + /* Remove the leading bits. */ > > # ifdef USE_AS_WCSCHR > > movl %edx, %SHIFT_REG > > /* NB: Divide shift count by 4 since each bit in K1 represent 4 > > @@ -360,20 +370,24 @@ L(cross_page_boundary): > > /* If eax is zero continue. */ > > testl %eax, %eax > > jz L(cross_page_continue) > > - tzcntl %eax, %eax > > -# ifndef USE_AS_STRCHRNUL > > - /* Check to see if match was CHAR or null. */ > > - cmp (%rdx, %rax, CHAR_SIZE), %CHAR_REG > > - jne L(zero_end) > > -# endif > > + bsfl %eax, %eax > > + > > # ifdef USE_AS_WCSCHR > > /* NB: Multiply wchar_t count by 4 to get the number of > > bytes. */ > > leaq (%rdx, %rax, CHAR_SIZE), %rax > > # else > > addq %rdx, %rax > > +# endif > > +# ifndef USE_AS_STRCHRNUL > > + /* Check to see if match was CHAR or null. */ > > + cmp (%rax), %CHAR_REG > > + je L(cross_page_ret) > > +L(zero_end): > > + xorl %eax, %eax > > +L(cross_page_ret): > > # endif > > ret > > > > END (STRCHR) > > -# endif > > +#endif > > -- > > 2.25.1 > > > > LGTM. > > Reviewed-by: H.J. Lu <hjl.tools@gmail.com> > > Thanks. > > -- > H.J. I would like to backport this patch to release branches. Any comments or objections? --Sunil ^ permalink raw reply [flat|nested] 10+ messages in thread
[parent not found: <20220323215734.3927131-7-goldstein.w.n@gmail.com>]
[parent not found: <CAMe9rOpSiZaO+mkq8OqwTHS__JgUD4LQQShMpjrgyGdZSPwUsA@mail.gmail.com>]
* Re: [PATCH v1 07/23] x86: Optimize strcspn and strpbrk in strcspn-c.c [not found] ` <CAMe9rOpSiZaO+mkq8OqwTHS__JgUD4LQQShMpjrgyGdZSPwUsA@mail.gmail.com> @ 2022-05-12 19:34 ` Sunil Pandey 0 siblings, 0 replies; 10+ messages in thread From: Sunil Pandey @ 2022-05-12 19:34 UTC (permalink / raw) To: H.J. Lu, Libc-stable Mailing List; +Cc: Noah Goldstein, GNU C Library On Thu, Mar 24, 2022 at 11:57 AM H.J. Lu via Libc-alpha <libc-alpha@sourceware.org> wrote: > > On Wed, Mar 23, 2022 at 2:59 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > > > Use _mm_cmpeq_epi8 and _mm_movemask_epi8 to get strlen instead of > > _mm_cmpistri. Also change offset to unsigned to avoid unnecessary > > sign extensions. > > > > geometric_mean(N=20) of all benchmarks that dont fallback on > > sse2/strlen; New / Original: .928 > > > > All string/memory tests pass. > > --- > > Geomtric Mean N=20 runs; All functions page aligned > > len, align1, align2, pos, New Time / Old Time > > 0, 0, 0, 512, 1.207 > > 1, 0, 0, 512, 1.039 > > 1, 1, 0, 512, 0.997 > > 1, 0, 1, 512, 0.981 > > 1, 1, 1, 512, 0.977 > > 2, 0, 0, 512, 1.02 > > 2, 2, 0, 512, 0.979 > > 2, 0, 2, 512, 0.902 > > 2, 2, 2, 512, 0.958 > > 3, 0, 0, 512, 0.978 > > 3, 3, 0, 512, 0.988 > > 3, 0, 3, 512, 0.979 > > 3, 3, 3, 512, 0.955 > > 4, 0, 0, 512, 0.969 > > 4, 4, 0, 512, 0.991 > > 4, 0, 4, 512, 0.94 > > 4, 4, 4, 512, 0.958 > > 5, 0, 0, 512, 0.963 > > 5, 5, 0, 512, 1.004 > > 5, 0, 5, 512, 0.948 > > 5, 5, 5, 512, 0.971 > > 6, 0, 0, 512, 0.933 > > 6, 6, 0, 512, 1.007 > > 6, 0, 6, 512, 0.921 > > 6, 6, 6, 512, 0.969 > > 7, 0, 0, 512, 0.928 > > 7, 7, 0, 512, 0.976 > > 7, 0, 7, 512, 0.932 > > 7, 7, 7, 512, 0.995 > > 8, 0, 0, 512, 0.931 > > 8, 0, 8, 512, 0.766 > > 9, 0, 0, 512, 0.965 > > 9, 1, 0, 512, 0.999 > > 9, 0, 9, 512, 0.765 > > 9, 1, 9, 512, 0.97 > > 10, 0, 0, 512, 0.976 > > 10, 2, 0, 512, 0.991 > > 10, 0, 10, 512, 0.768 > > 10, 2, 10, 512, 0.926 > > 11, 0, 0, 512, 0.958 > > 11, 3, 0, 512, 1.006 > > 11, 0, 11, 512, 0.768 > > 11, 3, 11, 512, 0.908 > > 12, 0, 0, 512, 0.945 > > 12, 4, 0, 512, 0.896 > > 12, 0, 12, 512, 0.764 > > 12, 4, 12, 512, 0.785 > > 13, 0, 0, 512, 0.957 > > 13, 5, 0, 512, 1.019 > > 13, 0, 13, 512, 0.76 > > 13, 5, 13, 512, 0.785 > > 14, 0, 0, 512, 0.918 > > 14, 6, 0, 512, 1.004 > > 14, 0, 14, 512, 0.78 > > 14, 6, 14, 512, 0.711 > > 15, 0, 0, 512, 0.855 > > 15, 7, 0, 512, 0.985 > > 15, 0, 15, 512, 0.779 > > 15, 7, 15, 512, 0.772 > > 16, 0, 0, 512, 0.987 > > 16, 0, 16, 512, 0.99 > > 17, 0, 0, 512, 0.996 > > 17, 1, 0, 512, 0.979 > > 17, 0, 17, 512, 1.001 > > 17, 1, 17, 512, 1.03 > > 18, 0, 0, 512, 0.976 > > 18, 2, 0, 512, 0.989 > > 18, 0, 18, 512, 0.976 > > 18, 2, 18, 512, 0.992 > > 19, 0, 0, 512, 0.991 > > 19, 3, 0, 512, 0.988 > > 19, 0, 19, 512, 1.009 > > 19, 3, 19, 512, 1.018 > > 20, 0, 0, 512, 0.999 > > 20, 4, 0, 512, 1.005 > > 20, 0, 20, 512, 0.993 > > 20, 4, 20, 512, 0.983 > > 21, 0, 0, 512, 0.982 > > 21, 5, 0, 512, 0.988 > > 21, 0, 21, 512, 0.978 > > 21, 5, 21, 512, 0.984 > > 22, 0, 0, 512, 0.988 > > 22, 6, 0, 512, 0.979 > > 22, 0, 22, 512, 0.984 > > 22, 6, 22, 512, 0.983 > > 23, 0, 0, 512, 0.996 > > 23, 7, 0, 512, 0.998 > > 23, 0, 23, 512, 0.979 > > 23, 7, 23, 512, 0.987 > > 24, 0, 0, 512, 0.99 > > 24, 0, 24, 512, 0.979 > > 25, 0, 0, 512, 0.985 > > 25, 1, 0, 512, 0.988 > > 25, 0, 25, 512, 0.99 > > 25, 1, 25, 512, 0.986 > > 26, 0, 0, 512, 1.005 > > 26, 2, 0, 512, 0.995 > > 26, 0, 26, 512, 0.992 > > 26, 2, 26, 512, 0.983 > > 27, 0, 0, 512, 0.986 > > 27, 3, 0, 512, 0.978 > > 27, 0, 27, 512, 0.986 > > 27, 3, 27, 512, 0.973 > > 28, 0, 0, 512, 0.995 > > 28, 4, 0, 512, 0.993 > > 28, 0, 28, 512, 0.983 > > 28, 4, 28, 512, 1.005 > > 29, 0, 0, 512, 0.983 > > 29, 5, 0, 512, 0.982 > > 29, 0, 29, 512, 0.984 > > 29, 5, 29, 512, 1.005 > > 30, 0, 0, 512, 0.978 > > 30, 6, 0, 512, 0.985 > > 30, 0, 30, 512, 0.994 > > 30, 6, 30, 512, 0.993 > > 31, 0, 0, 512, 0.984 > > 31, 7, 0, 512, 0.983 > > 31, 0, 31, 512, 1.0 > > 31, 7, 31, 512, 1.031 > > 4, 0, 0, 32, 0.916 > > 4, 1, 0, 32, 0.952 > > 4, 0, 1, 32, 0.927 > > 4, 1, 1, 32, 0.969 > > 4, 0, 0, 64, 0.961 > > 4, 2, 0, 64, 0.955 > > 4, 0, 2, 64, 0.975 > > 4, 2, 2, 64, 0.972 > > 4, 0, 0, 128, 0.971 > > 4, 3, 0, 128, 0.982 > > 4, 0, 3, 128, 0.945 > > 4, 3, 3, 128, 0.971 > > 4, 0, 0, 256, 1.004 > > 4, 4, 0, 256, 0.966 > > 4, 0, 4, 256, 0.961 > > 4, 4, 4, 256, 0.971 > > 4, 5, 0, 512, 0.929 > > 4, 0, 5, 512, 0.969 > > 4, 5, 5, 512, 0.985 > > 4, 0, 0, 1024, 1.003 > > 4, 6, 0, 1024, 1.009 > > 4, 0, 6, 1024, 1.005 > > 4, 6, 6, 1024, 0.999 > > 4, 0, 0, 2048, 0.917 > > 4, 7, 0, 2048, 1.015 > > 4, 0, 7, 2048, 1.011 > > 4, 7, 7, 2048, 0.907 > > 10, 1, 0, 64, 0.964 > > 10, 1, 1, 64, 0.966 > > 10, 2, 0, 64, 0.953 > > 10, 2, 2, 64, 0.972 > > 10, 3, 0, 64, 0.962 > > 10, 3, 3, 64, 0.969 > > 10, 4, 0, 64, 0.957 > > 10, 4, 4, 64, 0.969 > > 10, 5, 0, 64, 0.961 > > 10, 5, 5, 64, 0.965 > > 10, 6, 0, 64, 0.949 > > 10, 6, 6, 64, 0.9 > > 10, 7, 0, 64, 0.957 > > 10, 7, 7, 64, 0.897 > > 6, 0, 0, 0, 0.991 > > 6, 0, 0, 1, 1.011 > > 6, 0, 1, 1, 0.939 > > 6, 0, 0, 2, 1.016 > > 6, 0, 2, 2, 0.94 > > 6, 0, 0, 3, 1.019 > > 6, 0, 3, 3, 0.941 > > 6, 0, 0, 4, 1.056 > > 6, 0, 4, 4, 0.884 > > 6, 0, 0, 5, 0.977 > > 6, 0, 5, 5, 0.934 > > 6, 0, 0, 6, 0.954 > > 6, 0, 6, 6, 0.93 > > 6, 0, 0, 7, 0.963 > > 6, 0, 7, 7, 0.916 > > 6, 0, 0, 8, 0.963 > > 6, 0, 8, 8, 0.945 > > 6, 0, 0, 9, 1.028 > > 6, 0, 9, 9, 0.942 > > 6, 0, 0, 10, 0.955 > > 6, 0, 10, 10, 0.831 > > 6, 0, 0, 11, 0.948 > > 6, 0, 11, 11, 0.82 > > 6, 0, 0, 12, 1.033 > > 6, 0, 12, 12, 0.873 > > 6, 0, 0, 13, 0.983 > > 6, 0, 13, 13, 0.852 > > 6, 0, 0, 14, 0.984 > > 6, 0, 14, 14, 0.853 > > 6, 0, 0, 15, 0.984 > > 6, 0, 15, 15, 0.882 > > 6, 0, 0, 16, 0.971 > > 6, 0, 16, 16, 0.958 > > 6, 0, 0, 17, 0.938 > > 6, 0, 17, 17, 0.947 > > 6, 0, 0, 18, 0.96 > > 6, 0, 18, 18, 0.938 > > 6, 0, 0, 19, 0.903 > > 6, 0, 19, 19, 0.943 > > 6, 0, 0, 20, 0.947 > > 6, 0, 20, 20, 0.951 > > 6, 0, 0, 21, 0.948 > > 6, 0, 21, 21, 0.96 > > 6, 0, 0, 22, 0.926 > > 6, 0, 22, 22, 0.951 > > 6, 0, 0, 23, 0.923 > > 6, 0, 23, 23, 0.959 > > 6, 0, 0, 24, 0.918 > > 6, 0, 24, 24, 0.952 > > 6, 0, 0, 25, 0.97 > > 6, 0, 25, 25, 0.952 > > 6, 0, 0, 26, 0.871 > > 6, 0, 26, 26, 0.869 > > 6, 0, 0, 27, 0.935 > > 6, 0, 27, 27, 0.836 > > 6, 0, 0, 28, 0.936 > > 6, 0, 28, 28, 0.857 > > 6, 0, 0, 29, 0.876 > > 6, 0, 29, 29, 0.859 > > 6, 0, 0, 30, 0.934 > > 6, 0, 30, 30, 0.857 > > 6, 0, 0, 31, 0.962 > > 6, 0, 31, 31, 0.86 > > 6, 0, 0, 32, 0.912 > > 6, 0, 32, 32, 0.94 > > 6, 0, 0, 33, 0.903 > > 6, 0, 33, 33, 0.968 > > 6, 0, 0, 34, 0.913 > > 6, 0, 34, 34, 0.896 > > 6, 0, 0, 35, 0.904 > > 6, 0, 35, 35, 0.913 > > 6, 0, 0, 36, 0.905 > > 6, 0, 36, 36, 0.907 > > 6, 0, 0, 37, 0.899 > > 6, 0, 37, 37, 0.9 > > 6, 0, 0, 38, 0.912 > > 6, 0, 38, 38, 0.919 > > 6, 0, 0, 39, 0.925 > > 6, 0, 39, 39, 0.927 > > 6, 0, 0, 40, 0.923 > > 6, 0, 40, 40, 0.972 > > 6, 0, 0, 41, 0.92 > > 6, 0, 41, 41, 0.966 > > 6, 0, 0, 42, 0.915 > > 6, 0, 42, 42, 0.834 > > 6, 0, 0, 43, 0.92 > > 6, 0, 43, 43, 0.856 > > 6, 0, 0, 44, 0.908 > > 6, 0, 44, 44, 0.858 > > 6, 0, 0, 45, 0.932 > > 6, 0, 45, 45, 0.847 > > 6, 0, 0, 46, 0.927 > > 6, 0, 46, 46, 0.859 > > 6, 0, 0, 47, 0.902 > > 6, 0, 47, 47, 0.855 > > 6, 0, 0, 48, 0.949 > > 6, 0, 48, 48, 0.934 > > 6, 0, 0, 49, 0.907 > > 6, 0, 49, 49, 0.943 > > 6, 0, 0, 50, 0.934 > > 6, 0, 50, 50, 0.943 > > 6, 0, 0, 51, 0.933 > > 6, 0, 51, 51, 0.939 > > 6, 0, 0, 52, 0.944 > > 6, 0, 52, 52, 0.944 > > 6, 0, 0, 53, 0.939 > > 6, 0, 53, 53, 0.938 > > 6, 0, 0, 54, 0.9 > > 6, 0, 54, 54, 0.923 > > 6, 0, 0, 55, 0.9 > > 6, 0, 55, 55, 0.927 > > 6, 0, 0, 56, 0.9 > > 6, 0, 56, 56, 0.917 > > 6, 0, 0, 57, 0.9 > > 6, 0, 57, 57, 0.916 > > 6, 0, 0, 58, 0.914 > > 6, 0, 58, 58, 0.784 > > 6, 0, 0, 59, 0.863 > > 6, 0, 59, 59, 0.846 > > 6, 0, 0, 60, 0.88 > > 6, 0, 60, 60, 0.827 > > 6, 0, 0, 61, 0.896 > > 6, 0, 61, 61, 0.847 > > 6, 0, 0, 62, 0.894 > > 6, 0, 62, 62, 0.865 > > 6, 0, 0, 63, 0.934 > > 6, 0, 63, 63, 0.866 > > > > sysdeps/x86_64/multiarch/strcspn-c.c | 83 +++++++++++++--------------- > > 1 file changed, 37 insertions(+), 46 deletions(-) > > > > diff --git a/sysdeps/x86_64/multiarch/strcspn-c.c b/sysdeps/x86_64/multiarch/strcspn-c.c > > index 013aebf797..c312fab8b1 100644 > > --- a/sysdeps/x86_64/multiarch/strcspn-c.c > > +++ b/sysdeps/x86_64/multiarch/strcspn-c.c > > @@ -84,83 +84,74 @@ STRCSPN_SSE42 (const char *s, const char *a) > > RETURN (NULL, strlen (s)); > > > > const char *aligned; > > - __m128i mask; > > - int offset = (int) ((size_t) a & 15); > > + __m128i mask, maskz, zero; > > + unsigned int maskz_bits; > > + unsigned int offset = (unsigned int) ((size_t) a & 15); > > + zero = _mm_set1_epi8 (0); > > if (offset != 0) > > { > > /* Load masks. */ > > aligned = (const char *) ((size_t) a & -16L); > > __m128i mask0 = _mm_load_si128 ((__m128i *) aligned); > > - > > - mask = __m128i_shift_right (mask0, offset); > > + maskz = _mm_cmpeq_epi8 (mask0, zero); > > > > /* Find where the NULL terminator is. */ > > - int length = _mm_cmpistri (mask, mask, 0x3a); > > - if (length == 16 - offset) > > - { > > - /* There is no NULL terminator. */ > > - __m128i mask1 = _mm_load_si128 ((__m128i *) (aligned + 16)); > > - int index = _mm_cmpistri (mask1, mask1, 0x3a); > > - length += index; > > - > > - /* Don't use SSE4.2 if the length of A > 16. */ > > - if (length > 16) > > - return STRCSPN_SSE2 (s, a); > > - > > - if (index != 0) > > - { > > - /* Combine mask0 and mask1. We could play games with > > - palignr, but frankly this data should be in L1 now > > - so do the merge via an unaligned load. */ > > - mask = _mm_loadu_si128 ((__m128i *) a); > > - } > > - } > > + maskz_bits = _mm_movemask_epi8 (maskz) >> offset; > > + if (maskz_bits != 0) > > + { > > + mask = __m128i_shift_right (mask0, offset); > > + offset = (unsigned int) ((size_t) s & 15); > > + if (offset) > > + goto start_unaligned; > > + > > + aligned = s; > > + goto start_loop; > > + } > > } > > - else > > - { > > - /* A is aligned. */ > > - mask = _mm_load_si128 ((__m128i *) a); > > > > - /* Find where the NULL terminator is. */ > > - int length = _mm_cmpistri (mask, mask, 0x3a); > > - if (length == 16) > > - { > > - /* There is no NULL terminator. Don't use SSE4.2 if the length > > - of A > 16. */ > > - if (a[16] != 0) > > - return STRCSPN_SSE2 (s, a); > > - } > > + /* A is aligned. */ > > + mask = _mm_loadu_si128 ((__m128i *) a); > > + /* Find where the NULL terminator is. */ > > + maskz = _mm_cmpeq_epi8 (mask, zero); > > + maskz_bits = _mm_movemask_epi8 (maskz); > > + if (maskz_bits == 0) > > + { > > + /* There is no NULL terminator. Don't use SSE4.2 if the length > > + of A > 16. */ > > + if (a[16] != 0) > > + return STRCSPN_SSE2 (s, a); > > } > > > > - offset = (int) ((size_t) s & 15); > > + aligned = s; > > + offset = (unsigned int) ((size_t) s & 15); > > if (offset != 0) > > { > > + start_unaligned: > > /* Check partial string. */ > > aligned = (const char *) ((size_t) s & -16L); > > __m128i value = _mm_load_si128 ((__m128i *) aligned); > > > > value = __m128i_shift_right (value, offset); > > > > - int length = _mm_cmpistri (mask, value, 0x2); > > + unsigned int length = _mm_cmpistri (mask, value, 0x2); > > /* No need to check ZFlag since ZFlag is always 1. */ > > - int cflag = _mm_cmpistrc (mask, value, 0x2); > > + unsigned int cflag = _mm_cmpistrc (mask, value, 0x2); > > if (cflag) > > RETURN ((char *) (s + length), length); > > /* Find where the NULL terminator is. */ > > - int index = _mm_cmpistri (value, value, 0x3a); > > + unsigned int index = _mm_cmpistri (value, value, 0x3a); > > if (index < 16 - offset) > > RETURN (NULL, index); > > aligned += 16; > > } > > - else > > - aligned = s; > > > > +start_loop: > > while (1) > > { > > __m128i value = _mm_load_si128 ((__m128i *) aligned); > > - int index = _mm_cmpistri (mask, value, 0x2); > > - int cflag = _mm_cmpistrc (mask, value, 0x2); > > - int zflag = _mm_cmpistrz (mask, value, 0x2); > > + unsigned int index = _mm_cmpistri (mask, value, 0x2); > > + unsigned int cflag = _mm_cmpistrc (mask, value, 0x2); > > + unsigned int zflag = _mm_cmpistrz (mask, value, 0x2); > > if (cflag) > > RETURN ((char *) (aligned + index), (size_t) (aligned + index - s)); > > if (zflag) > > -- > > 2.25.1 > > > > LGTM. > > Reviewed-by: H.J. Lu <hjl.tools@gmail.com> > > Thanks. > > -- > H.J. I would like to backport this patch to release branches. Any comments or objections? --Sunil ^ permalink raw reply [flat|nested] 10+ messages in thread
[parent not found: <20220323215734.3927131-8-goldstein.w.n@gmail.com>]
[parent not found: <CAMe9rOo7oks15kPUaZqd=Z1J1Xe==FJoECTNU5mBca9WTHgf1w@mail.gmail.com>]
* Re: [PATCH v1 08/23] x86: Optimize strspn in strspn-c.c [not found] ` <CAMe9rOo7oks15kPUaZqd=Z1J1Xe==FJoECTNU5mBca9WTHgf1w@mail.gmail.com> @ 2022-05-12 19:39 ` Sunil Pandey 0 siblings, 0 replies; 10+ messages in thread From: Sunil Pandey @ 2022-05-12 19:39 UTC (permalink / raw) To: H.J. Lu, Libc-stable Mailing List; +Cc: Noah Goldstein, GNU C Library On Thu, Mar 24, 2022 at 11:58 AM H.J. Lu via Libc-alpha <libc-alpha@sourceware.org> wrote: > > On Wed, Mar 23, 2022 at 2:59 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > > > Use _mm_cmpeq_epi8 and _mm_movemask_epi8 to get strlen instead of > > _mm_cmpistri. Also change offset to unsigned to avoid unnecessary > > sign extensions. > > > > geometric_mean(N=20) of all benchmarks that dont fallback on > > sse2; New / Original: .901 > > > > All string/memory tests pass. > > --- > > Geomtric Mean N=20 runs; All functions page aligned > > len, align1, align2, pos, New Time / Old Time > > 1, 0, 0, 512, 0.768 > > 1, 1, 0, 512, 0.666 > > 1, 0, 1, 512, 1.193 > > 1, 1, 1, 512, 0.872 > > 2, 0, 0, 512, 0.698 > > 2, 2, 0, 512, 0.687 > > 2, 0, 2, 512, 1.393 > > 2, 2, 2, 512, 0.944 > > 3, 0, 0, 512, 0.691 > > 3, 3, 0, 512, 0.676 > > 3, 0, 3, 512, 1.388 > > 3, 3, 3, 512, 0.948 > > 4, 0, 0, 512, 0.74 > > 4, 4, 0, 512, 0.678 > > 4, 0, 4, 512, 1.421 > > 4, 4, 4, 512, 0.943 > > 5, 0, 0, 512, 0.691 > > 5, 5, 0, 512, 0.675 > > 5, 0, 5, 512, 1.348 > > 5, 5, 5, 512, 0.952 > > 6, 0, 0, 512, 0.685 > > 6, 6, 0, 512, 0.67 > > 6, 0, 6, 512, 1.333 > > 6, 6, 6, 512, 0.95 > > 7, 0, 0, 512, 0.688 > > 7, 7, 0, 512, 0.675 > > 7, 0, 7, 512, 1.344 > > 7, 7, 7, 512, 0.919 > > 8, 0, 0, 512, 0.716 > > 8, 0, 8, 512, 0.935 > > 9, 0, 0, 512, 0.716 > > 9, 1, 0, 512, 0.712 > > 9, 0, 9, 512, 0.956 > > 9, 1, 9, 512, 0.992 > > 10, 0, 0, 512, 0.699 > > 10, 2, 0, 512, 0.68 > > 10, 0, 10, 512, 0.952 > > 10, 2, 10, 512, 0.932 > > 11, 0, 0, 512, 0.705 > > 11, 3, 0, 512, 0.685 > > 11, 0, 11, 512, 0.956 > > 11, 3, 11, 512, 0.927 > > 12, 0, 0, 512, 0.695 > > 12, 4, 0, 512, 0.675 > > 12, 0, 12, 512, 0.948 > > 12, 4, 12, 512, 0.928 > > 13, 0, 0, 512, 0.7 > > 13, 5, 0, 512, 0.678 > > 13, 0, 13, 512, 0.944 > > 13, 5, 13, 512, 0.931 > > 14, 0, 0, 512, 0.703 > > 14, 6, 0, 512, 0.678 > > 14, 0, 14, 512, 0.949 > > 14, 6, 14, 512, 0.93 > > 15, 0, 0, 512, 0.694 > > 15, 7, 0, 512, 0.678 > > 15, 0, 15, 512, 0.953 > > 15, 7, 15, 512, 0.924 > > 16, 0, 0, 512, 1.021 > > 16, 0, 16, 512, 1.067 > > 17, 0, 0, 512, 0.991 > > 17, 1, 0, 512, 0.984 > > 17, 0, 17, 512, 0.979 > > 17, 1, 17, 512, 0.993 > > 18, 0, 0, 512, 0.992 > > 18, 2, 0, 512, 1.008 > > 18, 0, 18, 512, 1.016 > > 18, 2, 18, 512, 0.993 > > 19, 0, 0, 512, 0.984 > > 19, 3, 0, 512, 0.985 > > 19, 0, 19, 512, 1.007 > > 19, 3, 19, 512, 1.006 > > 20, 0, 0, 512, 0.969 > > 20, 4, 0, 512, 0.968 > > 20, 0, 20, 512, 0.975 > > 20, 4, 20, 512, 0.975 > > 21, 0, 0, 512, 0.992 > > 21, 5, 0, 512, 0.992 > > 21, 0, 21, 512, 0.98 > > 21, 5, 21, 512, 0.97 > > 22, 0, 0, 512, 0.989 > > 22, 6, 0, 512, 0.987 > > 22, 0, 22, 512, 0.99 > > 22, 6, 22, 512, 0.985 > > 23, 0, 0, 512, 0.989 > > 23, 7, 0, 512, 0.98 > > 23, 0, 23, 512, 1.0 > > 23, 7, 23, 512, 0.993 > > 24, 0, 0, 512, 0.99 > > 24, 0, 24, 512, 0.998 > > 25, 0, 0, 512, 1.01 > > 25, 1, 0, 512, 1.0 > > 25, 0, 25, 512, 0.97 > > 25, 1, 25, 512, 0.967 > > 26, 0, 0, 512, 1.009 > > 26, 2, 0, 512, 0.986 > > 26, 0, 26, 512, 0.997 > > 26, 2, 26, 512, 0.993 > > 27, 0, 0, 512, 0.984 > > 27, 3, 0, 512, 0.997 > > 27, 0, 27, 512, 0.989 > > 27, 3, 27, 512, 0.976 > > 28, 0, 0, 512, 0.991 > > 28, 4, 0, 512, 1.003 > > 28, 0, 28, 512, 0.986 > > 28, 4, 28, 512, 0.989 > > 29, 0, 0, 512, 0.986 > > 29, 5, 0, 512, 0.985 > > 29, 0, 29, 512, 0.984 > > 29, 5, 29, 512, 0.977 > > 30, 0, 0, 512, 0.991 > > 30, 6, 0, 512, 0.987 > > 30, 0, 30, 512, 0.979 > > 30, 6, 30, 512, 0.974 > > 31, 0, 0, 512, 0.995 > > 31, 7, 0, 512, 0.995 > > 31, 0, 31, 512, 0.994 > > 31, 7, 31, 512, 0.984 > > 4, 0, 0, 32, 0.861 > > 4, 1, 0, 32, 0.864 > > 4, 0, 1, 32, 0.962 > > 4, 1, 1, 32, 0.967 > > 4, 0, 0, 64, 0.884 > > 4, 2, 0, 64, 0.818 > > 4, 0, 2, 64, 0.889 > > 4, 2, 2, 64, 0.918 > > 4, 0, 0, 128, 0.942 > > 4, 3, 0, 128, 0.884 > > 4, 0, 3, 128, 0.931 > > 4, 3, 3, 128, 0.883 > > 4, 0, 0, 256, 0.964 > > 4, 4, 0, 256, 0.922 > > 4, 0, 4, 256, 0.956 > > 4, 4, 4, 256, 0.93 > > 4, 5, 0, 512, 0.833 > > 4, 0, 5, 512, 1.027 > > 4, 5, 5, 512, 0.929 > > 4, 0, 0, 1024, 0.998 > > 4, 6, 0, 1024, 0.986 > > 4, 0, 6, 1024, 0.984 > > 4, 6, 6, 1024, 0.977 > > 4, 0, 0, 2048, 0.991 > > 4, 7, 0, 2048, 0.987 > > 4, 0, 7, 2048, 0.996 > > 4, 7, 7, 2048, 0.98 > > 10, 1, 0, 64, 0.826 > > 10, 1, 1, 64, 0.907 > > 10, 2, 0, 64, 0.829 > > 10, 2, 2, 64, 0.91 > > 10, 3, 0, 64, 0.83 > > 10, 3, 3, 64, 0.915 > > 10, 4, 0, 64, 0.83 > > 10, 4, 4, 64, 0.911 > > 10, 5, 0, 64, 0.828 > > 10, 5, 5, 64, 0.905 > > 10, 6, 0, 64, 0.828 > > 10, 6, 6, 64, 0.812 > > 10, 7, 0, 64, 0.83 > > 10, 7, 7, 64, 0.819 > > 6, 0, 0, 0, 1.261 > > 6, 0, 0, 1, 1.252 > > 6, 0, 1, 1, 0.845 > > 6, 0, 0, 2, 1.27 > > 6, 0, 2, 2, 0.85 > > 6, 0, 0, 3, 1.269 > > 6, 0, 3, 3, 0.845 > > 6, 0, 0, 4, 1.287 > > 6, 0, 4, 4, 0.852 > > 6, 0, 0, 5, 1.278 > > 6, 0, 5, 5, 0.851 > > 6, 0, 0, 6, 1.269 > > 6, 0, 6, 6, 0.841 > > 6, 0, 0, 7, 1.268 > > 6, 0, 7, 7, 0.851 > > 6, 0, 0, 8, 1.291 > > 6, 0, 8, 8, 0.837 > > 6, 0, 0, 9, 1.283 > > 6, 0, 9, 9, 0.831 > > 6, 0, 0, 10, 1.252 > > 6, 0, 10, 10, 0.997 > > 6, 0, 0, 11, 1.295 > > 6, 0, 11, 11, 1.046 > > 6, 0, 0, 12, 1.296 > > 6, 0, 12, 12, 1.038 > > 6, 0, 0, 13, 1.287 > > 6, 0, 13, 13, 1.082 > > 6, 0, 0, 14, 1.284 > > 6, 0, 14, 14, 1.001 > > 6, 0, 0, 15, 1.286 > > 6, 0, 15, 15, 1.002 > > 6, 0, 0, 16, 0.894 > > 6, 0, 16, 16, 0.874 > > 6, 0, 0, 17, 0.892 > > 6, 0, 17, 17, 0.974 > > 6, 0, 0, 18, 0.907 > > 6, 0, 18, 18, 0.993 > > 6, 0, 0, 19, 0.909 > > 6, 0, 19, 19, 0.99 > > 6, 0, 0, 20, 0.894 > > 6, 0, 20, 20, 0.978 > > 6, 0, 0, 21, 0.89 > > 6, 0, 21, 21, 0.958 > > 6, 0, 0, 22, 0.893 > > 6, 0, 22, 22, 0.99 > > 6, 0, 0, 23, 0.899 > > 6, 0, 23, 23, 0.986 > > 6, 0, 0, 24, 0.893 > > 6, 0, 24, 24, 0.989 > > 6, 0, 0, 25, 0.889 > > 6, 0, 25, 25, 0.982 > > 6, 0, 0, 26, 0.889 > > 6, 0, 26, 26, 0.852 > > 6, 0, 0, 27, 0.89 > > 6, 0, 27, 27, 0.832 > > 6, 0, 0, 28, 0.89 > > 6, 0, 28, 28, 0.831 > > 6, 0, 0, 29, 0.89 > > 6, 0, 29, 29, 0.838 > > 6, 0, 0, 30, 0.907 > > 6, 0, 30, 30, 0.833 > > 6, 0, 0, 31, 0.888 > > 6, 0, 31, 31, 0.837 > > 6, 0, 0, 32, 0.853 > > 6, 0, 32, 32, 0.828 > > 6, 0, 0, 33, 0.857 > > 6, 0, 33, 33, 0.947 > > 6, 0, 0, 34, 0.847 > > 6, 0, 34, 34, 0.954 > > 6, 0, 0, 35, 0.841 > > 6, 0, 35, 35, 0.94 > > 6, 0, 0, 36, 0.854 > > 6, 0, 36, 36, 0.958 > > 6, 0, 0, 37, 0.856 > > 6, 0, 37, 37, 0.957 > > 6, 0, 0, 38, 0.839 > > 6, 0, 38, 38, 0.962 > > 6, 0, 0, 39, 0.866 > > 6, 0, 39, 39, 0.945 > > 6, 0, 0, 40, 0.845 > > 6, 0, 40, 40, 0.961 > > 6, 0, 0, 41, 0.858 > > 6, 0, 41, 41, 0.961 > > 6, 0, 0, 42, 0.862 > > 6, 0, 42, 42, 0.825 > > 6, 0, 0, 43, 0.864 > > 6, 0, 43, 43, 0.82 > > 6, 0, 0, 44, 0.843 > > 6, 0, 44, 44, 0.81 > > 6, 0, 0, 45, 0.859 > > 6, 0, 45, 45, 0.816 > > 6, 0, 0, 46, 0.866 > > 6, 0, 46, 46, 0.81 > > 6, 0, 0, 47, 0.858 > > 6, 0, 47, 47, 0.807 > > 6, 0, 0, 48, 0.87 > > 6, 0, 48, 48, 0.87 > > 6, 0, 0, 49, 0.871 > > 6, 0, 49, 49, 0.874 > > 6, 0, 0, 50, 0.87 > > 6, 0, 50, 50, 0.881 > > 6, 0, 0, 51, 0.868 > > 6, 0, 51, 51, 0.875 > > 6, 0, 0, 52, 0.873 > > 6, 0, 52, 52, 0.871 > > 6, 0, 0, 53, 0.866 > > 6, 0, 53, 53, 0.882 > > 6, 0, 0, 54, 0.863 > > 6, 0, 54, 54, 0.876 > > 6, 0, 0, 55, 0.851 > > 6, 0, 55, 55, 0.871 > > 6, 0, 0, 56, 0.867 > > 6, 0, 56, 56, 0.888 > > 6, 0, 0, 57, 0.862 > > 6, 0, 57, 57, 0.899 > > 6, 0, 0, 58, 0.873 > > 6, 0, 58, 58, 0.798 > > 6, 0, 0, 59, 0.881 > > 6, 0, 59, 59, 0.785 > > 6, 0, 0, 60, 0.867 > > 6, 0, 60, 60, 0.797 > > 6, 0, 0, 61, 0.872 > > 6, 0, 61, 61, 0.791 > > 6, 0, 0, 62, 0.859 > > 6, 0, 62, 62, 0.79 > > 6, 0, 0, 63, 0.87 > > 6, 0, 63, 63, 0.796 > > > > sysdeps/x86_64/multiarch/strspn-c.c | 86 +++++++++++++---------------- > > 1 file changed, 39 insertions(+), 47 deletions(-) > > > > diff --git a/sysdeps/x86_64/multiarch/strspn-c.c b/sysdeps/x86_64/multiarch/strspn-c.c > > index 8fb3aba64d..6124033ceb 100644 > > --- a/sysdeps/x86_64/multiarch/strspn-c.c > > +++ b/sysdeps/x86_64/multiarch/strspn-c.c > > @@ -62,81 +62,73 @@ __strspn_sse42 (const char *s, const char *a) > > return 0; > > > > const char *aligned; > > - __m128i mask; > > - int offset = (int) ((size_t) a & 15); > > + __m128i mask, maskz, zero; > > + unsigned int maskz_bits; > > + unsigned int offset = (int) ((size_t) a & 15); > > + zero = _mm_set1_epi8 (0); > > if (offset != 0) > > { > > /* Load masks. */ > > aligned = (const char *) ((size_t) a & -16L); > > __m128i mask0 = _mm_load_si128 ((__m128i *) aligned); > > - > > - mask = __m128i_shift_right (mask0, offset); > > + maskz = _mm_cmpeq_epi8 (mask0, zero); > > > > /* Find where the NULL terminator is. */ > > - int length = _mm_cmpistri (mask, mask, 0x3a); > > - if (length == 16 - offset) > > - { > > - /* There is no NULL terminator. */ > > - __m128i mask1 = _mm_load_si128 ((__m128i *) (aligned + 16)); > > - int index = _mm_cmpistri (mask1, mask1, 0x3a); > > - length += index; > > - > > - /* Don't use SSE4.2 if the length of A > 16. */ > > - if (length > 16) > > - return __strspn_sse2 (s, a); > > - > > - if (index != 0) > > - { > > - /* Combine mask0 and mask1. We could play games with > > - palignr, but frankly this data should be in L1 now > > - so do the merge via an unaligned load. */ > > - mask = _mm_loadu_si128 ((__m128i *) a); > > - } > > - } > > + maskz_bits = _mm_movemask_epi8 (maskz) >> offset; > > + if (maskz_bits != 0) > > + { > > + mask = __m128i_shift_right (mask0, offset); > > + offset = (unsigned int) ((size_t) s & 15); > > + if (offset) > > + goto start_unaligned; > > + > > + aligned = s; > > + goto start_loop; > > + } > > } > > - else > > - { > > - /* A is aligned. */ > > - mask = _mm_load_si128 ((__m128i *) a); > > > > - /* Find where the NULL terminator is. */ > > - int length = _mm_cmpistri (mask, mask, 0x3a); > > - if (length == 16) > > - { > > - /* There is no NULL terminator. Don't use SSE4.2 if the length > > - of A > 16. */ > > - if (a[16] != 0) > > - return __strspn_sse2 (s, a); > > - } > > + /* A is aligned. */ > > + mask = _mm_loadu_si128 ((__m128i *) a); > > + > > + /* Find where the NULL terminator is. */ > > + maskz = _mm_cmpeq_epi8 (mask, zero); > > + maskz_bits = _mm_movemask_epi8 (maskz); > > + if (maskz_bits == 0) > > + { > > + /* There is no NULL terminator. Don't use SSE4.2 if the length > > + of A > 16. */ > > + if (a[16] != 0) > > + return __strspn_sse2 (s, a); > > } > > + aligned = s; > > + offset = (unsigned int) ((size_t) s & 15); > > > > - offset = (int) ((size_t) s & 15); > > if (offset != 0) > > { > > + start_unaligned: > > /* Check partial string. */ > > aligned = (const char *) ((size_t) s & -16L); > > __m128i value = _mm_load_si128 ((__m128i *) aligned); > > + __m128i adj_value = __m128i_shift_right (value, offset); > > > > - value = __m128i_shift_right (value, offset); > > - > > - int length = _mm_cmpistri (mask, value, 0x12); > > + unsigned int length = _mm_cmpistri (mask, adj_value, 0x12); > > /* No need to check CFlag since it is always 1. */ > > if (length < 16 - offset) > > return length; > > /* Find where the NULL terminator is. */ > > - int index = _mm_cmpistri (value, value, 0x3a); > > - if (index < 16 - offset) > > + maskz = _mm_cmpeq_epi8 (value, zero); > > + maskz_bits = _mm_movemask_epi8 (maskz) >> offset; > > + if (maskz_bits != 0) > > return length; > > aligned += 16; > > } > > - else > > - aligned = s; > > > > +start_loop: > > while (1) > > { > > __m128i value = _mm_load_si128 ((__m128i *) aligned); > > - int index = _mm_cmpistri (mask, value, 0x12); > > - int cflag = _mm_cmpistrc (mask, value, 0x12); > > + unsigned int index = _mm_cmpistri (mask, value, 0x12); > > + unsigned int cflag = _mm_cmpistrc (mask, value, 0x12); > > if (cflag) > > return (size_t) (aligned + index - s); > > aligned += 16; > > -- > > 2.25.1 > > > > LGTM. > > Reviewed-by: H.J. Lu <hjl.tools@gmail.com> > > Thanks. > > -- > H.J. I would like to backport this patch to release branches. Any comments or objections? --Sunil ^ permalink raw reply [flat|nested] 10+ messages in thread
[parent not found: <20220323215734.3927131-9-goldstein.w.n@gmail.com>]
[parent not found: <CAMe9rOrQ_zOdL-n-iiYpzLf+RxD_DBR51yEnGpRKB0zj4m31SQ@mail.gmail.com>]
* Re: [PATCH v1 09/23] x86: Remove strcspn-sse2.S and use the generic implementation [not found] ` <CAMe9rOrQ_zOdL-n-iiYpzLf+RxD_DBR51yEnGpRKB0zj4m31SQ@mail.gmail.com> @ 2022-05-12 19:40 ` Sunil Pandey 0 siblings, 0 replies; 10+ messages in thread From: Sunil Pandey @ 2022-05-12 19:40 UTC (permalink / raw) To: H.J. Lu, Libc-stable Mailing List; +Cc: Noah Goldstein, GNU C Library On Thu, Mar 24, 2022 at 11:59 AM H.J. Lu via Libc-alpha <libc-alpha@sourceware.org> wrote: > > On Wed, Mar 23, 2022 at 3:00 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > > > The generic implementation is faster. > > > > geometric_mean(N=20) of all benchmarks New / Original: .678 > > > > All string/memory tests pass. > > --- > > Geomtric Mean N=20 runs; All functions page aligned > > len, align1, align2, pos, New Time / Old Time > > 0, 0, 0, 512, 0.054 > > 1, 0, 0, 512, 0.055 > > 1, 1, 0, 512, 0.051 > > 1, 0, 1, 512, 0.054 > > 1, 1, 1, 512, 0.054 > > 2, 0, 0, 512, 0.861 > > 2, 2, 0, 512, 0.861 > > 2, 0, 2, 512, 0.861 > > 2, 2, 2, 512, 0.864 > > 3, 0, 0, 512, 0.854 > > 3, 3, 0, 512, 0.848 > > 3, 0, 3, 512, 0.845 > > 3, 3, 3, 512, 0.85 > > 4, 0, 0, 512, 0.851 > > 4, 4, 0, 512, 0.85 > > 4, 0, 4, 512, 0.852 > > 4, 4, 4, 512, 0.849 > > 5, 0, 0, 512, 0.938 > > 5, 5, 0, 512, 0.94 > > 5, 0, 5, 512, 0.864 > > 5, 5, 5, 512, 0.86 > > 6, 0, 0, 512, 0.858 > > 6, 6, 0, 512, 0.869 > > 6, 0, 6, 512, 0.847 > > 6, 6, 6, 512, 0.868 > > 7, 0, 0, 512, 0.867 > > 7, 7, 0, 512, 0.861 > > 7, 0, 7, 512, 0.864 > > 7, 7, 7, 512, 0.863 > > 8, 0, 0, 512, 0.884 > > 8, 0, 8, 512, 0.884 > > 9, 0, 0, 512, 0.886 > > 9, 1, 0, 512, 0.894 > > 9, 0, 9, 512, 0.889 > > 9, 1, 9, 512, 0.886 > > 10, 0, 0, 512, 0.859 > > 10, 2, 0, 512, 0.859 > > 10, 0, 10, 512, 0.862 > > 10, 2, 10, 512, 0.861 > > 11, 0, 0, 512, 0.846 > > 11, 3, 0, 512, 0.865 > > 11, 0, 11, 512, 0.859 > > 11, 3, 11, 512, 0.862 > > 12, 0, 0, 512, 0.858 > > 12, 4, 0, 512, 0.857 > > 12, 0, 12, 512, 0.964 > > 12, 4, 12, 512, 0.876 > > 13, 0, 0, 512, 0.827 > > 13, 5, 0, 512, 0.805 > > 13, 0, 13, 512, 0.821 > > 13, 5, 13, 512, 0.825 > > 14, 0, 0, 512, 0.786 > > 14, 6, 0, 512, 0.786 > > 14, 0, 14, 512, 0.803 > > 14, 6, 14, 512, 0.783 > > 15, 0, 0, 512, 0.778 > > 15, 7, 0, 512, 0.792 > > 15, 0, 15, 512, 0.796 > > 15, 7, 15, 512, 0.799 > > 16, 0, 0, 512, 0.803 > > 16, 0, 16, 512, 0.815 > > 17, 0, 0, 512, 0.812 > > 17, 1, 0, 512, 0.826 > > 17, 0, 17, 512, 0.803 > > 17, 1, 17, 512, 0.856 > > 18, 0, 0, 512, 0.801 > > 18, 2, 0, 512, 0.886 > > 18, 0, 18, 512, 0.805 > > 18, 2, 18, 512, 0.807 > > 19, 0, 0, 512, 0.814 > > 19, 3, 0, 512, 0.804 > > 19, 0, 19, 512, 0.813 > > 19, 3, 19, 512, 0.814 > > 20, 0, 0, 512, 0.885 > > 20, 4, 0, 512, 0.799 > > 20, 0, 20, 512, 0.826 > > 20, 4, 20, 512, 0.808 > > 21, 0, 0, 512, 0.816 > > 21, 5, 0, 512, 0.824 > > 21, 0, 21, 512, 0.819 > > 21, 5, 21, 512, 0.826 > > 22, 0, 0, 512, 0.814 > > 22, 6, 0, 512, 0.824 > > 22, 0, 22, 512, 0.81 > > 22, 6, 22, 512, 0.806 > > 23, 0, 0, 512, 0.825 > > 23, 7, 0, 512, 0.829 > > 23, 0, 23, 512, 0.809 > > 23, 7, 23, 512, 0.823 > > 24, 0, 0, 512, 0.829 > > 24, 0, 24, 512, 0.823 > > 25, 0, 0, 512, 0.864 > > 25, 1, 0, 512, 0.895 > > 25, 0, 25, 512, 0.88 > > 25, 1, 25, 512, 0.848 > > 26, 0, 0, 512, 0.903 > > 26, 2, 0, 512, 0.888 > > 26, 0, 26, 512, 0.894 > > 26, 2, 26, 512, 0.89 > > 27, 0, 0, 512, 0.914 > > 27, 3, 0, 512, 0.917 > > 27, 0, 27, 512, 0.902 > > 27, 3, 27, 512, 0.887 > > 28, 0, 0, 512, 0.887 > > 28, 4, 0, 512, 0.877 > > 28, 0, 28, 512, 0.893 > > 28, 4, 28, 512, 0.866 > > 29, 0, 0, 512, 0.885 > > 29, 5, 0, 512, 0.907 > > 29, 0, 29, 512, 0.894 > > 29, 5, 29, 512, 0.906 > > 30, 0, 0, 512, 0.88 > > 30, 6, 0, 512, 0.898 > > 30, 0, 30, 512, 0.9 > > 30, 6, 30, 512, 0.895 > > 31, 0, 0, 512, 0.893 > > 31, 7, 0, 512, 0.874 > > 31, 0, 31, 512, 0.894 > > 31, 7, 31, 512, 0.899 > > 4, 0, 0, 32, 0.618 > > 4, 1, 0, 32, 0.627 > > 4, 0, 1, 32, 0.625 > > 4, 1, 1, 32, 0.613 > > 4, 0, 0, 64, 0.913 > > 4, 2, 0, 64, 0.801 > > 4, 0, 2, 64, 0.759 > > 4, 2, 2, 64, 0.761 > > 4, 0, 0, 128, 0.822 > > 4, 3, 0, 128, 0.863 > > 4, 0, 3, 128, 0.867 > > 4, 3, 3, 128, 0.917 > > 4, 0, 0, 256, 0.816 > > 4, 4, 0, 256, 0.812 > > 4, 0, 4, 256, 0.803 > > 4, 4, 4, 256, 0.811 > > 4, 5, 0, 512, 0.848 > > 4, 0, 5, 512, 0.843 > > 4, 5, 5, 512, 0.857 > > 4, 0, 0, 1024, 0.886 > > 4, 6, 0, 1024, 0.887 > > 4, 0, 6, 1024, 0.881 > > 4, 6, 6, 1024, 0.873 > > 4, 0, 0, 2048, 0.892 > > 4, 7, 0, 2048, 0.894 > > 4, 0, 7, 2048, 0.89 > > 4, 7, 7, 2048, 0.874 > > 10, 1, 0, 64, 0.946 > > 10, 1, 1, 64, 0.81 > > 10, 2, 0, 64, 0.804 > > 10, 2, 2, 64, 0.82 > > 10, 3, 0, 64, 0.772 > > 10, 3, 3, 64, 0.772 > > 10, 4, 0, 64, 0.748 > > 10, 4, 4, 64, 0.751 > > 10, 5, 0, 64, 0.76 > > 10, 5, 5, 64, 0.76 > > 10, 6, 0, 64, 0.726 > > 10, 6, 6, 64, 0.718 > > 10, 7, 0, 64, 0.724 > > 10, 7, 7, 64, 0.72 > > 6, 0, 0, 0, 0.415 > > 6, 0, 0, 1, 0.423 > > 6, 0, 1, 1, 0.412 > > 6, 0, 0, 2, 0.433 > > 6, 0, 2, 2, 0.434 > > 6, 0, 0, 3, 0.427 > > 6, 0, 3, 3, 0.428 > > 6, 0, 0, 4, 0.465 > > 6, 0, 4, 4, 0.466 > > 6, 0, 0, 5, 0.463 > > 6, 0, 5, 5, 0.468 > > 6, 0, 0, 6, 0.435 > > 6, 0, 6, 6, 0.444 > > 6, 0, 0, 7, 0.41 > > 6, 0, 7, 7, 0.42 > > 6, 0, 0, 8, 0.474 > > 6, 0, 8, 8, 0.501 > > 6, 0, 0, 9, 0.471 > > 6, 0, 9, 9, 0.489 > > 6, 0, 0, 10, 0.462 > > 6, 0, 10, 10, 0.46 > > 6, 0, 0, 11, 0.459 > > 6, 0, 11, 11, 0.458 > > 6, 0, 0, 12, 0.516 > > 6, 0, 12, 12, 0.51 > > 6, 0, 0, 13, 0.494 > > 6, 0, 13, 13, 0.524 > > 6, 0, 0, 14, 0.486 > > 6, 0, 14, 14, 0.5 > > 6, 0, 0, 15, 0.48 > > 6, 0, 15, 15, 0.501 > > 6, 0, 0, 16, 0.54 > > 6, 0, 16, 16, 0.538 > > 6, 0, 0, 17, 0.503 > > 6, 0, 17, 17, 0.541 > > 6, 0, 0, 18, 0.537 > > 6, 0, 18, 18, 0.549 > > 6, 0, 0, 19, 0.527 > > 6, 0, 19, 19, 0.537 > > 6, 0, 0, 20, 0.539 > > 6, 0, 20, 20, 0.554 > > 6, 0, 0, 21, 0.558 > > 6, 0, 21, 21, 0.541 > > 6, 0, 0, 22, 0.546 > > 6, 0, 22, 22, 0.561 > > 6, 0, 0, 23, 0.54 > > 6, 0, 23, 23, 0.536 > > 6, 0, 0, 24, 0.565 > > 6, 0, 24, 24, 0.584 > > 6, 0, 0, 25, 0.563 > > 6, 0, 25, 25, 0.58 > > 6, 0, 0, 26, 0.555 > > 6, 0, 26, 26, 0.584 > > 6, 0, 0, 27, 0.569 > > 6, 0, 27, 27, 0.587 > > 6, 0, 0, 28, 0.612 > > 6, 0, 28, 28, 0.623 > > 6, 0, 0, 29, 0.604 > > 6, 0, 29, 29, 0.621 > > 6, 0, 0, 30, 0.59 > > 6, 0, 30, 30, 0.609 > > 6, 0, 0, 31, 0.577 > > 6, 0, 31, 31, 0.588 > > 6, 0, 0, 32, 0.621 > > 6, 0, 32, 32, 0.608 > > 6, 0, 0, 33, 0.601 > > 6, 0, 33, 33, 0.623 > > 6, 0, 0, 34, 0.614 > > 6, 0, 34, 34, 0.615 > > 6, 0, 0, 35, 0.598 > > 6, 0, 35, 35, 0.608 > > 6, 0, 0, 36, 0.626 > > 6, 0, 36, 36, 0.634 > > 6, 0, 0, 37, 0.62 > > 6, 0, 37, 37, 0.634 > > 6, 0, 0, 38, 0.612 > > 6, 0, 38, 38, 0.637 > > 6, 0, 0, 39, 0.627 > > 6, 0, 39, 39, 0.612 > > 6, 0, 0, 40, 0.661 > > 6, 0, 40, 40, 0.674 > > 6, 0, 0, 41, 0.633 > > 6, 0, 41, 41, 0.643 > > 6, 0, 0, 42, 0.634 > > 6, 0, 42, 42, 0.636 > > 6, 0, 0, 43, 0.619 > > 6, 0, 43, 43, 0.625 > > 6, 0, 0, 44, 0.654 > > 6, 0, 44, 44, 0.654 > > 6, 0, 0, 45, 0.647 > > 6, 0, 45, 45, 0.649 > > 6, 0, 0, 46, 0.651 > > 6, 0, 46, 46, 0.651 > > 6, 0, 0, 47, 0.646 > > 6, 0, 47, 47, 0.648 > > 6, 0, 0, 48, 0.662 > > 6, 0, 48, 48, 0.664 > > 6, 0, 0, 49, 0.68 > > 6, 0, 49, 49, 0.667 > > 6, 0, 0, 50, 0.654 > > 6, 0, 50, 50, 0.659 > > 6, 0, 0, 51, 0.638 > > 6, 0, 51, 51, 0.639 > > 6, 0, 0, 52, 0.665 > > 6, 0, 52, 52, 0.669 > > 6, 0, 0, 53, 0.658 > > 6, 0, 53, 53, 0.656 > > 6, 0, 0, 54, 0.669 > > 6, 0, 54, 54, 0.67 > > 6, 0, 0, 55, 0.668 > > 6, 0, 55, 55, 0.664 > > 6, 0, 0, 56, 0.701 > > 6, 0, 56, 56, 0.695 > > 6, 0, 0, 57, 0.687 > > 6, 0, 57, 57, 0.696 > > 6, 0, 0, 58, 0.693 > > 6, 0, 58, 58, 0.704 > > 6, 0, 0, 59, 0.695 > > 6, 0, 59, 59, 0.708 > > 6, 0, 0, 60, 0.708 > > 6, 0, 60, 60, 0.728 > > 6, 0, 0, 61, 0.708 > > 6, 0, 61, 61, 0.71 > > 6, 0, 0, 62, 0.715 > > 6, 0, 62, 62, 0.705 > > 6, 0, 0, 63, 0.677 > > 6, 0, 63, 63, 0.702 > > > > .../{strcspn-sse2.S => strcspn-sse2.c} | 8 +- > > sysdeps/x86_64/strcspn.S | 119 ------------------ > > 2 files changed, 4 insertions(+), 123 deletions(-) > > rename sysdeps/x86_64/multiarch/{strcspn-sse2.S => strcspn-sse2.c} (85%) > > delete mode 100644 sysdeps/x86_64/strcspn.S > > > > diff --git a/sysdeps/x86_64/multiarch/strcspn-sse2.S b/sysdeps/x86_64/multiarch/strcspn-sse2.c > > similarity index 85% > > rename from sysdeps/x86_64/multiarch/strcspn-sse2.S > > rename to sysdeps/x86_64/multiarch/strcspn-sse2.c > > index f97e856e1f..3a04bb39fc 100644 > > --- a/sysdeps/x86_64/multiarch/strcspn-sse2.S > > +++ b/sysdeps/x86_64/multiarch/strcspn-sse2.c > > @@ -1,4 +1,4 @@ > > -/* strcspn optimized with SSE2. > > +/* strcspn. > > Copyright (C) 2017-2022 Free Software Foundation, Inc. > > This file is part of the GNU C Library. > > > > @@ -19,10 +19,10 @@ > > #if IS_IN (libc) > > > > # include <sysdep.h> > > -# define strcspn __strcspn_sse2 > > +# define STRCSPN __strcspn_sse2 > > > > # undef libc_hidden_builtin_def > > -# define libc_hidden_builtin_def(strcspn) > > +# define libc_hidden_builtin_def(STRCSPN) > > #endif > > > > -#include <sysdeps/x86_64/strcspn.S> > > +#include <string/strcspn.c> > > diff --git a/sysdeps/x86_64/strcspn.S b/sysdeps/x86_64/strcspn.S > > deleted file mode 100644 > > index f3cd86c606..0000000000 > > --- a/sysdeps/x86_64/strcspn.S > > +++ /dev/null > > @@ -1,119 +0,0 @@ > > -/* strcspn (str, ss) -- Return the length of the initial segment of STR > > - which contains no characters from SS. > > - For AMD x86-64. > > - Copyright (C) 1994-2022 Free Software Foundation, Inc. > > - This file is part of the GNU C Library. > > - > > - The GNU C Library is free software; you can redistribute it and/or > > - modify it under the terms of the GNU Lesser General Public > > - License as published by the Free Software Foundation; either > > - version 2.1 of the License, or (at your option) any later version. > > - > > - The GNU C Library is distributed in the hope that it will be useful, > > - but WITHOUT ANY WARRANTY; without even the implied warranty of > > - MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > > - Lesser General Public License for more details. > > - > > - You should have received a copy of the GNU Lesser General Public > > - License along with the GNU C Library; if not, see > > - <https://www.gnu.org/licenses/>. */ > > - > > -#include <sysdep.h> > > -#include "asm-syntax.h" > > - > > - .text > > -ENTRY (strcspn) > > - > > - movq %rdi, %rdx /* Save SRC. */ > > - > > - /* First we create a table with flags for all possible characters. > > - For the ASCII (7bit/8bit) or ISO-8859-X character sets which are > > - supported by the C string functions we have 256 characters. > > - Before inserting marks for the stop characters we clear the whole > > - table. */ > > - movq %rdi, %r8 /* Save value. */ > > - subq $256, %rsp /* Make space for 256 bytes. */ > > - cfi_adjust_cfa_offset(256) > > - movl $32, %ecx /* 32*8 bytes = 256 bytes. */ > > - movq %rsp, %rdi > > - xorl %eax, %eax /* We store 0s. */ > > - cld > > - rep > > - stosq > > - > > - movq %rsi, %rax /* Setup skipset. */ > > - > > -/* For understanding the following code remember that %rcx == 0 now. > > - Although all the following instruction only modify %cl we always > > - have a correct zero-extended 64-bit value in %rcx. */ > > - > > - .p2align 4 > > -L(2): movb (%rax), %cl /* get byte from skipset */ > > - testb %cl, %cl /* is NUL char? */ > > - jz L(1) /* yes => start compare loop */ > > - movb %cl, (%rsp,%rcx) /* set corresponding byte in skipset table */ > > - > > - movb 1(%rax), %cl /* get byte from skipset */ > > - testb $0xff, %cl /* is NUL char? */ > > - jz L(1) /* yes => start compare loop */ > > - movb %cl, (%rsp,%rcx) /* set corresponding byte in skipset table */ > > - > > - movb 2(%rax), %cl /* get byte from skipset */ > > - testb $0xff, %cl /* is NUL char? */ > > - jz L(1) /* yes => start compare loop */ > > - movb %cl, (%rsp,%rcx) /* set corresponding byte in skipset table */ > > - > > - movb 3(%rax), %cl /* get byte from skipset */ > > - addq $4, %rax /* increment skipset pointer */ > > - movb %cl, (%rsp,%rcx) /* set corresponding byte in skipset table */ > > - testb $0xff, %cl /* is NUL char? */ > > - jnz L(2) /* no => process next dword from skipset */ > > - > > -L(1): leaq -4(%rdx), %rax /* prepare loop */ > > - > > - /* We use a neat trick for the following loop. Normally we would > > - have to test for two termination conditions > > - 1. a character in the skipset was found > > - and > > - 2. the end of the string was found > > - But as a sign that the character is in the skipset we store its > > - value in the table. But the value of NUL is NUL so the loop > > - terminates for NUL in every case. */ > > - > > - .p2align 4 > > -L(3): addq $4, %rax /* adjust pointer for full loop round */ > > - > > - movb (%rax), %cl /* get byte from string */ > > - cmpb %cl, (%rsp,%rcx) /* is it contained in skipset? */ > > - je L(4) /* yes => return */ > > - > > - movb 1(%rax), %cl /* get byte from string */ > > - cmpb %cl, (%rsp,%rcx) /* is it contained in skipset? */ > > - je L(5) /* yes => return */ > > - > > - movb 2(%rax), %cl /* get byte from string */ > > - cmpb %cl, (%rsp,%rcx) /* is it contained in skipset? */ > > - jz L(6) /* yes => return */ > > - > > - movb 3(%rax), %cl /* get byte from string */ > > - cmpb %cl, (%rsp,%rcx) /* is it contained in skipset? */ > > - jne L(3) /* no => start loop again */ > > - > > - incq %rax /* adjust pointer */ > > -L(6): incq %rax > > -L(5): incq %rax > > - > > -L(4): addq $256, %rsp /* remove skipset */ > > - cfi_adjust_cfa_offset(-256) > > -#ifdef USE_AS_STRPBRK > > - xorl %edx,%edx > > - orb %cl, %cl /* was last character NUL? */ > > - cmovzq %rdx, %rax /* Yes: return NULL */ > > -#else > > - subq %rdx, %rax /* we have to return the number of valid > > - characters, so compute distance to first > > - non-valid character */ > > -#endif > > - ret > > -END (strcspn) > > -libc_hidden_builtin_def (strcspn) > > -- > > 2.25.1 > > > > LGTM. > > Reviewed-by: H.J. Lu <hjl.tools@gmail.com> > > Thanks. > > -- > H.J. I would like to backport this patch to release branches. Any comments or objections? --Sunil ^ permalink raw reply [flat|nested] 10+ messages in thread
[parent not found: <20220323215734.3927131-10-goldstein.w.n@gmail.com>]
[parent not found: <CAMe9rOqn8rZNfisVTmSKP9iWH1N26D--dncq1=MMgo-Hh-oR_Q@mail.gmail.com>]
* Re: [PATCH v1 10/23] x86: Remove strpbrk-sse2.S and use the generic implementation [not found] ` <CAMe9rOqn8rZNfisVTmSKP9iWH1N26D--dncq1=MMgo-Hh-oR_Q@mail.gmail.com> @ 2022-05-12 19:41 ` Sunil Pandey 0 siblings, 0 replies; 10+ messages in thread From: Sunil Pandey @ 2022-05-12 19:41 UTC (permalink / raw) To: H.J. Lu, Libc-stable Mailing List; +Cc: Noah Goldstein, GNU C Library On Thu, Mar 24, 2022 at 12:00 PM H.J. Lu via Libc-alpha <libc-alpha@sourceware.org> wrote: > > On Wed, Mar 23, 2022 at 3:00 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > > > The generic implementation is faster (see strcspn commit). > > > > All string/memory tests pass. > > --- > > .../x86_64/multiarch/{strpbrk-sse2.S => strpbrk-sse2.c} | 9 ++++----- > > sysdeps/x86_64/strpbrk.S | 3 --- > > 2 files changed, 4 insertions(+), 8 deletions(-) > > rename sysdeps/x86_64/multiarch/{strpbrk-sse2.S => strpbrk-sse2.c} (84%) > > delete mode 100644 sysdeps/x86_64/strpbrk.S > > > > diff --git a/sysdeps/x86_64/multiarch/strpbrk-sse2.S b/sysdeps/x86_64/multiarch/strpbrk-sse2.c > > similarity index 84% > > rename from sysdeps/x86_64/multiarch/strpbrk-sse2.S > > rename to sysdeps/x86_64/multiarch/strpbrk-sse2.c > > index d537b6c27b..d03214c4fb 100644 > > --- a/sysdeps/x86_64/multiarch/strpbrk-sse2.S > > +++ b/sysdeps/x86_64/multiarch/strpbrk-sse2.c > > @@ -1,4 +1,4 @@ > > -/* strpbrk optimized with SSE2. > > +/* strpbrk. > > Copyright (C) 2017-2022 Free Software Foundation, Inc. > > This file is part of the GNU C Library. > > > > @@ -19,11 +19,10 @@ > > #if IS_IN (libc) > > > > # include <sysdep.h> > > -# define strcspn __strpbrk_sse2 > > +# define STRPBRK __strpbrk_sse2 > > > > # undef libc_hidden_builtin_def > > -# define libc_hidden_builtin_def(strpbrk) > > +# define libc_hidden_builtin_def(STRPBRK) > > #endif > > > > -#define USE_AS_STRPBRK > > -#include <sysdeps/x86_64/strcspn.S> > > +#include <string/strpbrk.c> > > diff --git a/sysdeps/x86_64/strpbrk.S b/sysdeps/x86_64/strpbrk.S > > deleted file mode 100644 > > index 21888a5b92..0000000000 > > --- a/sysdeps/x86_64/strpbrk.S > > +++ /dev/null > > @@ -1,3 +0,0 @@ > > -#define strcspn strpbrk > > -#define USE_AS_STRPBRK > > -#include <sysdeps/x86_64/strcspn.S> > > -- > > 2.25.1 > > > > LGTM. > > Reviewed-by: H.J. Lu <hjl.tools@gmail.com> > > Thanks. > > -- > H.J. I would like to backport this patch to release branches. Any comments or objections? --Sunil ^ permalink raw reply [flat|nested] 10+ messages in thread
[parent not found: <20220323215734.3927131-11-goldstein.w.n@gmail.com>]
[parent not found: <CAMe9rOp89e_1T9+i0W3=R3XR8DHp_Ua72x+poB6HQvE1q6b0MQ@mail.gmail.com>]
* Re: [PATCH v1 11/23] x86: Remove strspn-sse2.S and use the generic implementation [not found] ` <CAMe9rOp89e_1T9+i0W3=R3XR8DHp_Ua72x+poB6HQvE1q6b0MQ@mail.gmail.com> @ 2022-05-12 19:42 ` Sunil Pandey 0 siblings, 0 replies; 10+ messages in thread From: Sunil Pandey @ 2022-05-12 19:42 UTC (permalink / raw) To: H.J. Lu, Libc-stable Mailing List; +Cc: Noah Goldstein, GNU C Library On Thu, Mar 24, 2022 at 12:00 PM H.J. Lu via Libc-alpha <libc-alpha@sourceware.org> wrote: > > On Wed, Mar 23, 2022 at 3:01 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > > > The generic implementation is faster. > > > > geometric_mean(N=20) of all benchmarks New / Original: .710 > > > > All string/memory tests pass. > > --- > > Geomtric Mean N=20 runs; All functions page aligned > > len, align1, align2, pos, New Time / Old Time > > 1, 0, 0, 512, 0.824 > > 1, 1, 0, 512, 1.018 > > 1, 0, 1, 512, 0.986 > > 1, 1, 1, 512, 1.092 > > 2, 0, 0, 512, 0.86 > > 2, 2, 0, 512, 0.868 > > 2, 0, 2, 512, 0.858 > > 2, 2, 2, 512, 0.857 > > 3, 0, 0, 512, 0.836 > > 3, 3, 0, 512, 0.849 > > 3, 0, 3, 512, 0.84 > > 3, 3, 3, 512, 0.85 > > 4, 0, 0, 512, 0.843 > > 4, 4, 0, 512, 0.837 > > 4, 0, 4, 512, 0.835 > > 4, 4, 4, 512, 0.846 > > 5, 0, 0, 512, 0.852 > > 5, 5, 0, 512, 0.848 > > 5, 0, 5, 512, 0.85 > > 5, 5, 5, 512, 0.85 > > 6, 0, 0, 512, 0.853 > > 6, 6, 0, 512, 0.855 > > 6, 0, 6, 512, 0.853 > > 6, 6, 6, 512, 0.853 > > 7, 0, 0, 512, 0.857 > > 7, 7, 0, 512, 0.861 > > 7, 0, 7, 512, 0.94 > > 7, 7, 7, 512, 0.856 > > 8, 0, 0, 512, 0.927 > > 8, 0, 8, 512, 0.965 > > 9, 0, 0, 512, 0.967 > > 9, 1, 0, 512, 0.976 > > 9, 0, 9, 512, 0.887 > > 9, 1, 9, 512, 0.881 > > 10, 0, 0, 512, 0.853 > > 10, 2, 0, 512, 0.846 > > 10, 0, 10, 512, 0.855 > > 10, 2, 10, 512, 0.849 > > 11, 0, 0, 512, 0.854 > > 11, 3, 0, 512, 0.855 > > 11, 0, 11, 512, 0.85 > > 11, 3, 11, 512, 0.854 > > 12, 0, 0, 512, 0.864 > > 12, 4, 0, 512, 0.864 > > 12, 0, 12, 512, 0.867 > > 12, 4, 12, 512, 0.87 > > 13, 0, 0, 512, 0.853 > > 13, 5, 0, 512, 0.841 > > 13, 0, 13, 512, 0.837 > > 13, 5, 13, 512, 0.85 > > 14, 0, 0, 512, 0.838 > > 14, 6, 0, 512, 0.842 > > 14, 0, 14, 512, 0.818 > > 14, 6, 14, 512, 0.845 > > 15, 0, 0, 512, 0.799 > > 15, 7, 0, 512, 0.847 > > 15, 0, 15, 512, 0.787 > > 15, 7, 15, 512, 0.84 > > 16, 0, 0, 512, 0.824 > > 16, 0, 16, 512, 0.827 > > 17, 0, 0, 512, 0.817 > > 17, 1, 0, 512, 0.823 > > 17, 0, 17, 512, 0.82 > > 17, 1, 17, 512, 0.814 > > 18, 0, 0, 512, 0.81 > > 18, 2, 0, 512, 0.833 > > 18, 0, 18, 512, 0.811 > > 18, 2, 18, 512, 0.842 > > 19, 0, 0, 512, 0.823 > > 19, 3, 0, 512, 0.818 > > 19, 0, 19, 512, 0.821 > > 19, 3, 19, 512, 0.824 > > 20, 0, 0, 512, 0.814 > > 20, 4, 0, 512, 0.818 > > 20, 0, 20, 512, 0.806 > > 20, 4, 20, 512, 0.802 > > 21, 0, 0, 512, 0.835 > > 21, 5, 0, 512, 0.839 > > 21, 0, 21, 512, 0.842 > > 21, 5, 21, 512, 0.82 > > 22, 0, 0, 512, 0.824 > > 22, 6, 0, 512, 0.831 > > 22, 0, 22, 512, 0.819 > > 22, 6, 22, 512, 0.824 > > 23, 0, 0, 512, 0.816 > > 23, 7, 0, 512, 0.856 > > 23, 0, 23, 512, 0.808 > > 23, 7, 23, 512, 0.848 > > 24, 0, 0, 512, 0.88 > > 24, 0, 24, 512, 0.846 > > 25, 0, 0, 512, 0.929 > > 25, 1, 0, 512, 0.917 > > 25, 0, 25, 512, 0.884 > > 25, 1, 25, 512, 0.859 > > 26, 0, 0, 512, 0.919 > > 26, 2, 0, 512, 0.867 > > 26, 0, 26, 512, 0.914 > > 26, 2, 26, 512, 0.845 > > 27, 0, 0, 512, 0.919 > > 27, 3, 0, 512, 0.864 > > 27, 0, 27, 512, 0.917 > > 27, 3, 27, 512, 0.847 > > 28, 0, 0, 512, 0.905 > > 28, 4, 0, 512, 0.896 > > 28, 0, 28, 512, 0.898 > > 28, 4, 28, 512, 0.871 > > 29, 0, 0, 512, 0.911 > > 29, 5, 0, 512, 0.91 > > 29, 0, 29, 512, 0.905 > > 29, 5, 29, 512, 0.884 > > 30, 0, 0, 512, 0.907 > > 30, 6, 0, 512, 0.802 > > 30, 0, 30, 512, 0.906 > > 30, 6, 30, 512, 0.818 > > 31, 0, 0, 512, 0.907 > > 31, 7, 0, 512, 0.821 > > 31, 0, 31, 512, 0.89 > > 31, 7, 31, 512, 0.787 > > 4, 0, 0, 32, 0.623 > > 4, 1, 0, 32, 0.606 > > 4, 0, 1, 32, 0.6 > > 4, 1, 1, 32, 0.603 > > 4, 0, 0, 64, 0.731 > > 4, 2, 0, 64, 0.733 > > 4, 0, 2, 64, 0.734 > > 4, 2, 2, 64, 0.755 > > 4, 0, 0, 128, 0.822 > > 4, 3, 0, 128, 0.873 > > 4, 0, 3, 128, 0.89 > > 4, 3, 3, 128, 0.907 > > 4, 0, 0, 256, 0.827 > > 4, 4, 0, 256, 0.811 > > 4, 0, 4, 256, 0.794 > > 4, 4, 4, 256, 0.814 > > 4, 5, 0, 512, 0.841 > > 4, 0, 5, 512, 0.831 > > 4, 5, 5, 512, 0.845 > > 4, 0, 0, 1024, 0.861 > > 4, 6, 0, 1024, 0.857 > > 4, 0, 6, 1024, 0.9 > > 4, 6, 6, 1024, 0.861 > > 4, 0, 0, 2048, 0.879 > > 4, 7, 0, 2048, 0.875 > > 4, 0, 7, 2048, 0.883 > > 4, 7, 7, 2048, 0.88 > > 10, 1, 0, 64, 0.747 > > 10, 1, 1, 64, 0.743 > > 10, 2, 0, 64, 0.732 > > 10, 2, 2, 64, 0.729 > > 10, 3, 0, 64, 0.747 > > 10, 3, 3, 64, 0.733 > > 10, 4, 0, 64, 0.74 > > 10, 4, 4, 64, 0.751 > > 10, 5, 0, 64, 0.735 > > 10, 5, 5, 64, 0.746 > > 10, 6, 0, 64, 0.735 > > 10, 6, 6, 64, 0.733 > > 10, 7, 0, 64, 0.734 > > 10, 7, 7, 64, 0.74 > > 6, 0, 0, 0, 0.377 > > 6, 0, 0, 1, 0.369 > > 6, 0, 1, 1, 0.383 > > 6, 0, 0, 2, 0.391 > > 6, 0, 2, 2, 0.394 > > 6, 0, 0, 3, 0.416 > > 6, 0, 3, 3, 0.411 > > 6, 0, 0, 4, 0.475 > > 6, 0, 4, 4, 0.483 > > 6, 0, 0, 5, 0.473 > > 6, 0, 5, 5, 0.476 > > 6, 0, 0, 6, 0.459 > > 6, 0, 6, 6, 0.445 > > 6, 0, 0, 7, 0.433 > > 6, 0, 7, 7, 0.432 > > 6, 0, 0, 8, 0.492 > > 6, 0, 8, 8, 0.494 > > 6, 0, 0, 9, 0.476 > > 6, 0, 9, 9, 0.483 > > 6, 0, 0, 10, 0.46 > > 6, 0, 10, 10, 0.476 > > 6, 0, 0, 11, 0.463 > > 6, 0, 11, 11, 0.463 > > 6, 0, 0, 12, 0.511 > > 6, 0, 12, 12, 0.515 > > 6, 0, 0, 13, 0.506 > > 6, 0, 13, 13, 0.536 > > 6, 0, 0, 14, 0.496 > > 6, 0, 14, 14, 0.484 > > 6, 0, 0, 15, 0.473 > > 6, 0, 15, 15, 0.475 > > 6, 0, 0, 16, 0.534 > > 6, 0, 16, 16, 0.534 > > 6, 0, 0, 17, 0.525 > > 6, 0, 17, 17, 0.523 > > 6, 0, 0, 18, 0.522 > > 6, 0, 18, 18, 0.524 > > 6, 0, 0, 19, 0.512 > > 6, 0, 19, 19, 0.514 > > 6, 0, 0, 20, 0.535 > > 6, 0, 20, 20, 0.54 > > 6, 0, 0, 21, 0.543 > > 6, 0, 21, 21, 0.536 > > 6, 0, 0, 22, 0.542 > > 6, 0, 22, 22, 0.542 > > 6, 0, 0, 23, 0.529 > > 6, 0, 23, 23, 0.53 > > 6, 0, 0, 24, 0.596 > > 6, 0, 24, 24, 0.589 > > 6, 0, 0, 25, 0.583 > > 6, 0, 25, 25, 0.58 > > 6, 0, 0, 26, 0.574 > > 6, 0, 26, 26, 0.58 > > 6, 0, 0, 27, 0.575 > > 6, 0, 27, 27, 0.558 > > 6, 0, 0, 28, 0.606 > > 6, 0, 28, 28, 0.606 > > 6, 0, 0, 29, 0.589 > > 6, 0, 29, 29, 0.595 > > 6, 0, 0, 30, 0.592 > > 6, 0, 30, 30, 0.585 > > 6, 0, 0, 31, 0.585 > > 6, 0, 31, 31, 0.579 > > 6, 0, 0, 32, 0.625 > > 6, 0, 32, 32, 0.615 > > 6, 0, 0, 33, 0.615 > > 6, 0, 33, 33, 0.61 > > 6, 0, 0, 34, 0.604 > > 6, 0, 34, 34, 0.6 > > 6, 0, 0, 35, 0.602 > > 6, 0, 35, 35, 0.608 > > 6, 0, 0, 36, 0.644 > > 6, 0, 36, 36, 0.644 > > 6, 0, 0, 37, 0.658 > > 6, 0, 37, 37, 0.651 > > 6, 0, 0, 38, 0.644 > > 6, 0, 38, 38, 0.649 > > 6, 0, 0, 39, 0.626 > > 6, 0, 39, 39, 0.632 > > 6, 0, 0, 40, 0.662 > > 6, 0, 40, 40, 0.661 > > 6, 0, 0, 41, 0.656 > > 6, 0, 41, 41, 0.655 > > 6, 0, 0, 42, 0.643 > > 6, 0, 42, 42, 0.637 > > 6, 0, 0, 43, 0.622 > > 6, 0, 43, 43, 0.628 > > 6, 0, 0, 44, 0.673 > > 6, 0, 44, 44, 0.687 > > 6, 0, 0, 45, 0.661 > > 6, 0, 45, 45, 0.659 > > 6, 0, 0, 46, 0.657 > > 6, 0, 46, 46, 0.653 > > 6, 0, 0, 47, 0.658 > > 6, 0, 47, 47, 0.65 > > 6, 0, 0, 48, 0.678 > > 6, 0, 48, 48, 0.683 > > 6, 0, 0, 49, 0.676 > > 6, 0, 49, 49, 0.661 > > 6, 0, 0, 50, 0.672 > > 6, 0, 50, 50, 0.662 > > 6, 0, 0, 51, 0.656 > > 6, 0, 51, 51, 0.659 > > 6, 0, 0, 52, 0.682 > > 6, 0, 52, 52, 0.686 > > 6, 0, 0, 53, 0.67 > > 6, 0, 53, 53, 0.674 > > 6, 0, 0, 54, 0.663 > > 6, 0, 54, 54, 0.675 > > 6, 0, 0, 55, 0.662 > > 6, 0, 55, 55, 0.665 > > 6, 0, 0, 56, 0.681 > > 6, 0, 56, 56, 0.697 > > 6, 0, 0, 57, 0.686 > > 6, 0, 57, 57, 0.687 > > 6, 0, 0, 58, 0.701 > > 6, 0, 58, 58, 0.693 > > 6, 0, 0, 59, 0.709 > > 6, 0, 59, 59, 0.698 > > 6, 0, 0, 60, 0.708 > > 6, 0, 60, 60, 0.708 > > 6, 0, 0, 61, 0.709 > > 6, 0, 61, 61, 0.716 > > 6, 0, 0, 62, 0.709 > > 6, 0, 62, 62, 0.707 > > 6, 0, 0, 63, 0.703 > > 6, 0, 63, 63, 0.716 > > > > .../{strspn-sse2.S => strspn-sse2.c} | 8 +- > > sysdeps/x86_64/strspn.S | 112 ------------------ > > 2 files changed, 4 insertions(+), 116 deletions(-) > > rename sysdeps/x86_64/multiarch/{strspn-sse2.S => strspn-sse2.c} (86%) > > delete mode 100644 sysdeps/x86_64/strspn.S > > > > diff --git a/sysdeps/x86_64/multiarch/strspn-sse2.S b/sysdeps/x86_64/multiarch/strspn-sse2.c > > similarity index 86% > > rename from sysdeps/x86_64/multiarch/strspn-sse2.S > > rename to sysdeps/x86_64/multiarch/strspn-sse2.c > > index e0a095f25a..61cc6cb0a5 100644 > > --- a/sysdeps/x86_64/multiarch/strspn-sse2.S > > +++ b/sysdeps/x86_64/multiarch/strspn-sse2.c > > @@ -1,4 +1,4 @@ > > -/* strspn optimized with SSE2. > > +/* strspn. > > Copyright (C) 2017-2022 Free Software Foundation, Inc. > > This file is part of the GNU C Library. > > > > @@ -19,10 +19,10 @@ > > #if IS_IN (libc) > > > > # include <sysdep.h> > > -# define strspn __strspn_sse2 > > +# define STRSPN __strspn_sse2 > > > > # undef libc_hidden_builtin_def > > -# define libc_hidden_builtin_def(strspn) > > +# define libc_hidden_builtin_def(STRSPN) > > #endif > > > > -#include <sysdeps/x86_64/strspn.S> > > +#include <string/strspn.c> > > diff --git a/sysdeps/x86_64/strspn.S b/sysdeps/x86_64/strspn.S > > deleted file mode 100644 > > index 61b76ee0a1..0000000000 > > --- a/sysdeps/x86_64/strspn.S > > +++ /dev/null > > @@ -1,112 +0,0 @@ > > -/* strspn (str, ss) -- Return the length of the initial segment of STR > > - which contains only characters from SS. > > - For AMD x86-64. > > - Copyright (C) 1994-2022 Free Software Foundation, Inc. > > - This file is part of the GNU C Library. > > - > > - The GNU C Library is free software; you can redistribute it and/or > > - modify it under the terms of the GNU Lesser General Public > > - License as published by the Free Software Foundation; either > > - version 2.1 of the License, or (at your option) any later version. > > - > > - The GNU C Library is distributed in the hope that it will be useful, > > - but WITHOUT ANY WARRANTY; without even the implied warranty of > > - MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > > - Lesser General Public License for more details. > > - > > - You should have received a copy of the GNU Lesser General Public > > - License along with the GNU C Library; if not, see > > - <https://www.gnu.org/licenses/>. */ > > - > > -#include <sysdep.h> > > - > > - .text > > -ENTRY (strspn) > > - > > - movq %rdi, %rdx /* Save SRC. */ > > - > > - /* First we create a table with flags for all possible characters. > > - For the ASCII (7bit/8bit) or ISO-8859-X character sets which are > > - supported by the C string functions we have 256 characters. > > - Before inserting marks for the stop characters we clear the whole > > - table. */ > > - movq %rdi, %r8 /* Save value. */ > > - subq $256, %rsp /* Make space for 256 bytes. */ > > - cfi_adjust_cfa_offset(256) > > - movl $32, %ecx /* 32*8 bytes = 256 bytes. */ > > - movq %rsp, %rdi > > - xorl %eax, %eax /* We store 0s. */ > > - cld > > - rep > > - stosq > > - > > - movq %rsi, %rax /* Setup stopset. */ > > - > > -/* For understanding the following code remember that %rcx == 0 now. > > - Although all the following instruction only modify %cl we always > > - have a correct zero-extended 64-bit value in %rcx. */ > > - > > - .p2align 4 > > -L(2): movb (%rax), %cl /* get byte from stopset */ > > - testb %cl, %cl /* is NUL char? */ > > - jz L(1) /* yes => start compare loop */ > > - movb %cl, (%rsp,%rcx) /* set corresponding byte in stopset table */ > > - > > - movb 1(%rax), %cl /* get byte from stopset */ > > - testb $0xff, %cl /* is NUL char? */ > > - jz L(1) /* yes => start compare loop */ > > - movb %cl, (%rsp,%rcx) /* set corresponding byte in stopset table */ > > - > > - movb 2(%rax), %cl /* get byte from stopset */ > > - testb $0xff, %cl /* is NUL char? */ > > - jz L(1) /* yes => start compare loop */ > > - movb %cl, (%rsp,%rcx) /* set corresponding byte in stopset table */ > > - > > - movb 3(%rax), %cl /* get byte from stopset */ > > - addq $4, %rax /* increment stopset pointer */ > > - movb %cl, (%rsp,%rcx) /* set corresponding byte in stopset table */ > > - testb $0xff, %cl /* is NUL char? */ > > - jnz L(2) /* no => process next dword from stopset */ > > - > > -L(1): leaq -4(%rdx), %rax /* prepare loop */ > > - > > - /* We use a neat trick for the following loop. Normally we would > > - have to test for two termination conditions > > - 1. a character in the stopset was found > > - and > > - 2. the end of the string was found > > - But as a sign that the character is in the stopset we store its > > - value in the table. But the value of NUL is NUL so the loop > > - terminates for NUL in every case. */ > > - > > - .p2align 4 > > -L(3): addq $4, %rax /* adjust pointer for full loop round */ > > - > > - movb (%rax), %cl /* get byte from string */ > > - testb %cl, (%rsp,%rcx) /* is it contained in skipset? */ > > - jz L(4) /* no => return */ > > - > > - movb 1(%rax), %cl /* get byte from string */ > > - testb %cl, (%rsp,%rcx) /* is it contained in skipset? */ > > - jz L(5) /* no => return */ > > - > > - movb 2(%rax), %cl /* get byte from string */ > > - testb %cl, (%rsp,%rcx) /* is it contained in skipset? */ > > - jz L(6) /* no => return */ > > - > > - movb 3(%rax), %cl /* get byte from string */ > > - testb %cl, (%rsp,%rcx) /* is it contained in skipset? */ > > - jnz L(3) /* yes => start loop again */ > > - > > - incq %rax /* adjust pointer */ > > -L(6): incq %rax > > -L(5): incq %rax > > - > > -L(4): addq $256, %rsp /* remove stopset */ > > - cfi_adjust_cfa_offset(-256) > > - subq %rdx, %rax /* we have to return the number of valid > > - characters, so compute distance to first > > - non-valid character */ > > - ret > > -END (strspn) > > -libc_hidden_builtin_def (strspn) > > -- > > 2.25.1 > > > > LGTM. > > Reviewed-by: H.J. Lu <hjl.tools@gmail.com> > > Thanks. > > -- > H.J. I would like to backport this patch to release branches. Any comments or objections? --Sunil ^ permalink raw reply [flat|nested] 10+ messages in thread
[parent not found: <20220323215734.3927131-17-goldstein.w.n@gmail.com>]
[parent not found: <CAMe9rOqkZtA9gE87TiqkHg+_rTZY4dqXO74_LykBwvihNO0YJA@mail.gmail.com>]
* Re: [PATCH v1 17/23] x86: Optimize str{n}casecmp TOLOWER logic in strcmp.S [not found] ` <CAMe9rOqkZtA9gE87TiqkHg+_rTZY4dqXO74_LykBwvihNO0YJA@mail.gmail.com> @ 2022-05-12 19:44 ` Sunil Pandey 0 siblings, 0 replies; 10+ messages in thread From: Sunil Pandey @ 2022-05-12 19:44 UTC (permalink / raw) To: H.J. Lu, Libc-stable Mailing List; +Cc: Noah Goldstein, GNU C Library On Thu, Mar 24, 2022 at 12:05 PM H.J. Lu via Libc-alpha <libc-alpha@sourceware.org> wrote: > > On Wed, Mar 23, 2022 at 3:01 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > > > Slightly faster method of doing TOLOWER that saves an > > instruction. > > > > Also replace the hard coded 5-byte no with .p2align 4. On builds with > > CET enabled this misaligned entry to strcasecmp. > > > > geometric_mean(N=40) of all benchmarks New / Original: .894 > > > > All string/memory tests pass. > > --- > > Geomtric Mean N=40 runs; All functions page aligned > > length, align1, align2, max_char, New Time / Old Time > > 1, 1, 1, 127, 0.903 > > 2, 2, 2, 127, 0.905 > > 3, 3, 3, 127, 0.877 > > 4, 4, 4, 127, 0.888 > > 5, 5, 5, 127, 0.901 > > 6, 6, 6, 127, 0.954 > > 7, 7, 7, 127, 0.932 > > 8, 0, 0, 127, 0.918 > > 9, 1, 1, 127, 0.914 > > 10, 2, 2, 127, 0.877 > > 11, 3, 3, 127, 0.909 > > 12, 4, 4, 127, 0.876 > > 13, 5, 5, 127, 0.886 > > 14, 6, 6, 127, 0.914 > > 15, 7, 7, 127, 0.939 > > 4, 0, 0, 127, 0.963 > > 4, 0, 0, 254, 0.943 > > 8, 0, 0, 254, 0.927 > > 16, 0, 0, 127, 0.876 > > 16, 0, 0, 254, 0.865 > > 32, 0, 0, 127, 0.865 > > 32, 0, 0, 254, 0.862 > > 64, 0, 0, 127, 0.863 > > 64, 0, 0, 254, 0.896 > > 128, 0, 0, 127, 0.885 > > 128, 0, 0, 254, 0.882 > > 256, 0, 0, 127, 0.87 > > 256, 0, 0, 254, 0.869 > > 512, 0, 0, 127, 0.832 > > 512, 0, 0, 254, 0.848 > > 1024, 0, 0, 127, 0.835 > > 1024, 0, 0, 254, 0.843 > > 16, 1, 2, 127, 0.914 > > 16, 2, 1, 254, 0.949 > > 32, 2, 4, 127, 0.955 > > 32, 4, 2, 254, 1.004 > > 64, 3, 6, 127, 0.844 > > 64, 6, 3, 254, 0.905 > > 128, 4, 0, 127, 0.889 > > 128, 0, 4, 254, 0.845 > > 256, 5, 2, 127, 0.929 > > 256, 2, 5, 254, 0.907 > > 512, 6, 4, 127, 0.837 > > 512, 4, 6, 254, 0.862 > > 1024, 7, 6, 127, 0.895 > > 1024, 6, 7, 254, 0.89 > > > > sysdeps/x86_64/strcmp.S | 64 +++++++++++++++++++---------------------- > > 1 file changed, 29 insertions(+), 35 deletions(-) > > > > diff --git a/sysdeps/x86_64/strcmp.S b/sysdeps/x86_64/strcmp.S > > index e2ab59c555..99d8b36f1d 100644 > > --- a/sysdeps/x86_64/strcmp.S > > +++ b/sysdeps/x86_64/strcmp.S > > @@ -75,9 +75,8 @@ ENTRY2 (__strcasecmp) > > movq __libc_tsd_LOCALE@gottpoff(%rip),%rax > > mov %fs:(%rax),%RDX_LP > > > > - // XXX 5 byte should be before the function > > - /* 5-byte NOP. */ > > - .byte 0x0f,0x1f,0x44,0x00,0x00 > > + /* Either 1 or 5 bytes (dependeing if CET is enabled). */ > > + .p2align 4 > > END2 (__strcasecmp) > > # ifndef NO_NOLOCALE_ALIAS > > weak_alias (__strcasecmp, strcasecmp) > > @@ -94,9 +93,8 @@ ENTRY2 (__strncasecmp) > > movq __libc_tsd_LOCALE@gottpoff(%rip),%rax > > mov %fs:(%rax),%RCX_LP > > > > - // XXX 5 byte should be before the function > > - /* 5-byte NOP. */ > > - .byte 0x0f,0x1f,0x44,0x00,0x00 > > + /* Either 1 or 5 bytes (dependeing if CET is enabled). */ > > + .p2align 4 > > END2 (__strncasecmp) > > # ifndef NO_NOLOCALE_ALIAS > > weak_alias (__strncasecmp, strncasecmp) > > @@ -146,22 +144,22 @@ ENTRY (STRCMP) > > #if defined USE_AS_STRCASECMP_L || defined USE_AS_STRNCASECMP_L > > .section .rodata.cst16,"aM",@progbits,16 > > .align 16 > > -.Lbelowupper: > > - .quad 0x4040404040404040 > > - .quad 0x4040404040404040 > > -.Ltopupper: > > - .quad 0x5b5b5b5b5b5b5b5b > > - .quad 0x5b5b5b5b5b5b5b5b > > -.Ltouppermask: > > +.Llcase_min: > > + .quad 0x3f3f3f3f3f3f3f3f > > + .quad 0x3f3f3f3f3f3f3f3f > > +.Llcase_max: > > + .quad 0x9999999999999999 > > + .quad 0x9999999999999999 > > +.Lcase_add: > > .quad 0x2020202020202020 > > .quad 0x2020202020202020 > > .previous > > - movdqa .Lbelowupper(%rip), %xmm5 > > -# define UCLOW_reg %xmm5 > > - movdqa .Ltopupper(%rip), %xmm6 > > -# define UCHIGH_reg %xmm6 > > - movdqa .Ltouppermask(%rip), %xmm7 > > -# define LCQWORD_reg %xmm7 > > + movdqa .Llcase_min(%rip), %xmm5 > > +# define LCASE_MIN_reg %xmm5 > > + movdqa .Llcase_max(%rip), %xmm6 > > +# define LCASE_MAX_reg %xmm6 > > + movdqa .Lcase_add(%rip), %xmm7 > > +# define CASE_ADD_reg %xmm7 > > #endif > > cmp $0x30, %ecx > > ja LABEL(crosscache) /* rsi: 16-byte load will cross cache line */ > > @@ -172,22 +170,18 @@ ENTRY (STRCMP) > > movhpd 8(%rdi), %xmm1 > > movhpd 8(%rsi), %xmm2 > > #if defined USE_AS_STRCASECMP_L || defined USE_AS_STRNCASECMP_L > > -# define TOLOWER(reg1, reg2) \ > > - movdqa reg1, %xmm8; \ > > - movdqa UCHIGH_reg, %xmm9; \ > > - movdqa reg2, %xmm10; \ > > - movdqa UCHIGH_reg, %xmm11; \ > > - pcmpgtb UCLOW_reg, %xmm8; \ > > - pcmpgtb reg1, %xmm9; \ > > - pcmpgtb UCLOW_reg, %xmm10; \ > > - pcmpgtb reg2, %xmm11; \ > > - pand %xmm9, %xmm8; \ > > - pand %xmm11, %xmm10; \ > > - pand LCQWORD_reg, %xmm8; \ > > - pand LCQWORD_reg, %xmm10; \ > > - por %xmm8, reg1; \ > > - por %xmm10, reg2 > > - TOLOWER (%xmm1, %xmm2) > > +# define TOLOWER(reg1, reg2) \ > > + movdqa LCASE_MIN_reg, %xmm8; \ > > + movdqa LCASE_MIN_reg, %xmm9; \ > > + paddb reg1, %xmm8; \ > > + paddb reg2, %xmm9; \ > > + pcmpgtb LCASE_MAX_reg, %xmm8; \ > > + pcmpgtb LCASE_MAX_reg, %xmm9; \ > > + pandn CASE_ADD_reg, %xmm8; \ > > + pandn CASE_ADD_reg, %xmm9; \ > > + paddb %xmm8, reg1; \ > > + paddb %xmm9, reg2 > > + TOLOWER (%xmm1, %xmm2) > > #else > > # define TOLOWER(reg1, reg2) > > #endif > > -- > > 2.25.1 > > > > LGTM. > > Reviewed-by: H.J. Lu <hjl.tools@gmail.com> > > Thanks. > > -- > H.J. I would like to backport this patch to release branches. Any comments or objections? --Sunil ^ permalink raw reply [flat|nested] 10+ messages in thread
[parent not found: <20220323215734.3927131-18-goldstein.w.n@gmail.com>]
[parent not found: <CAMe9rOo-MzhNRiuyFhHpHKanbu50_OPr_Gaof9Yt16tJRwjYFA@mail.gmail.com>]
* Re: [PATCH v1 18/23] x86: Optimize str{n}casecmp TOLOWER logic in strcmp-sse42.S [not found] ` <CAMe9rOo-MzhNRiuyFhHpHKanbu50_OPr_Gaof9Yt16tJRwjYFA@mail.gmail.com> @ 2022-05-12 19:45 ` Sunil Pandey 0 siblings, 0 replies; 10+ messages in thread From: Sunil Pandey @ 2022-05-12 19:45 UTC (permalink / raw) To: H.J. Lu, Libc-stable Mailing List; +Cc: Noah Goldstein, GNU C Library On Thu, Mar 24, 2022 at 12:05 PM H.J. Lu via Libc-alpha <libc-alpha@sourceware.org> wrote: > > On Wed, Mar 23, 2022 at 3:02 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > > > Slightly faster method of doing TOLOWER that saves an > > instruction. > > > > Also replace the hard coded 5-byte no with .p2align 4. On builds with > > CET enabled this misaligned entry to strcasecmp. > > > > geometric_mean(N=40) of all benchmarks New / Original: .920 > > > > All string/memory tests pass. > > --- > > Geomtric Mean N=40 runs; All functions page aligned > > length, align1, align2, max_char, New Time / Old Time > > 1, 1, 1, 127, 0.914 > > 2, 2, 2, 127, 0.952 > > 3, 3, 3, 127, 0.924 > > 4, 4, 4, 127, 0.995 > > 5, 5, 5, 127, 0.985 > > 6, 6, 6, 127, 1.017 > > 7, 7, 7, 127, 1.031 > > 8, 0, 0, 127, 0.967 > > 9, 1, 1, 127, 0.969 > > 10, 2, 2, 127, 0.951 > > 11, 3, 3, 127, 0.938 > > 12, 4, 4, 127, 0.937 > > 13, 5, 5, 127, 0.967 > > 14, 6, 6, 127, 0.941 > > 15, 7, 7, 127, 0.951 > > 4, 0, 0, 127, 0.959 > > 4, 0, 0, 254, 0.98 > > 8, 0, 0, 254, 0.959 > > 16, 0, 0, 127, 0.895 > > 16, 0, 0, 254, 0.901 > > 32, 0, 0, 127, 0.85 > > 32, 0, 0, 254, 0.851 > > 64, 0, 0, 127, 0.897 > > 64, 0, 0, 254, 0.895 > > 128, 0, 0, 127, 0.944 > > 128, 0, 0, 254, 0.935 > > 256, 0, 0, 127, 0.922 > > 256, 0, 0, 254, 0.913 > > 512, 0, 0, 127, 0.921 > > 512, 0, 0, 254, 0.914 > > 1024, 0, 0, 127, 0.845 > > 1024, 0, 0, 254, 0.84 > > 16, 1, 2, 127, 0.923 > > 16, 2, 1, 254, 0.955 > > 32, 2, 4, 127, 0.979 > > 32, 4, 2, 254, 0.957 > > 64, 3, 6, 127, 0.866 > > 64, 6, 3, 254, 0.849 > > 128, 4, 0, 127, 0.882 > > 128, 0, 4, 254, 0.876 > > 256, 5, 2, 127, 0.877 > > 256, 2, 5, 254, 0.882 > > 512, 6, 4, 127, 0.822 > > 512, 4, 6, 254, 0.862 > > 1024, 7, 6, 127, 0.903 > > 1024, 6, 7, 254, 0.908 > > > > sysdeps/x86_64/multiarch/strcmp-sse42.S | 83 +++++++++++-------------- > > 1 file changed, 35 insertions(+), 48 deletions(-) > > > > diff --git a/sysdeps/x86_64/multiarch/strcmp-sse42.S b/sysdeps/x86_64/multiarch/strcmp-sse42.S > > index 580feb90e9..7805ae9d41 100644 > > --- a/sysdeps/x86_64/multiarch/strcmp-sse42.S > > +++ b/sysdeps/x86_64/multiarch/strcmp-sse42.S > > @@ -88,9 +88,8 @@ ENTRY (GLABEL(__strcasecmp)) > > movq __libc_tsd_LOCALE@gottpoff(%rip),%rax > > mov %fs:(%rax),%RDX_LP > > > > - // XXX 5 byte should be before the function > > - /* 5-byte NOP. */ > > - .byte 0x0f,0x1f,0x44,0x00,0x00 > > + /* Either 1 or 5 bytes (dependeing if CET is enabled). */ > > + .p2align 4 > > END (GLABEL(__strcasecmp)) > > /* FALLTHROUGH to strcasecmp_l. */ > > #endif > > @@ -99,9 +98,8 @@ ENTRY (GLABEL(__strncasecmp)) > > movq __libc_tsd_LOCALE@gottpoff(%rip),%rax > > mov %fs:(%rax),%RCX_LP > > > > - // XXX 5 byte should be before the function > > - /* 5-byte NOP. */ > > - .byte 0x0f,0x1f,0x44,0x00,0x00 > > + /* Either 1 or 5 bytes (dependeing if CET is enabled). */ > > + .p2align 4 > > END (GLABEL(__strncasecmp)) > > /* FALLTHROUGH to strncasecmp_l. */ > > #endif > > @@ -169,27 +167,22 @@ STRCMP_SSE42: > > #if defined USE_AS_STRCASECMP_L || defined USE_AS_STRNCASECMP_L > > .section .rodata.cst16,"aM",@progbits,16 > > .align 16 > > -LABEL(belowupper): > > - .quad 0x4040404040404040 > > - .quad 0x4040404040404040 > > -LABEL(topupper): > > -# ifdef USE_AVX > > - .quad 0x5a5a5a5a5a5a5a5a > > - .quad 0x5a5a5a5a5a5a5a5a > > -# else > > - .quad 0x5b5b5b5b5b5b5b5b > > - .quad 0x5b5b5b5b5b5b5b5b > > -# endif > > -LABEL(touppermask): > > +LABEL(lcase_min): > > + .quad 0x3f3f3f3f3f3f3f3f > > + .quad 0x3f3f3f3f3f3f3f3f > > +LABEL(lcase_max): > > + .quad 0x9999999999999999 > > + .quad 0x9999999999999999 > > +LABEL(case_add): > > .quad 0x2020202020202020 > > .quad 0x2020202020202020 > > .previous > > - movdqa LABEL(belowupper)(%rip), %xmm4 > > -# define UCLOW_reg %xmm4 > > - movdqa LABEL(topupper)(%rip), %xmm5 > > -# define UCHIGH_reg %xmm5 > > - movdqa LABEL(touppermask)(%rip), %xmm6 > > -# define LCQWORD_reg %xmm6 > > + movdqa LABEL(lcase_min)(%rip), %xmm4 > > +# define LCASE_MIN_reg %xmm4 > > + movdqa LABEL(lcase_max)(%rip), %xmm5 > > +# define LCASE_MAX_reg %xmm5 > > + movdqa LABEL(case_add)(%rip), %xmm6 > > +# define CASE_ADD_reg %xmm6 > > #endif > > cmp $0x30, %ecx > > ja LABEL(crosscache)/* rsi: 16-byte load will cross cache line */ > > @@ -200,32 +193,26 @@ LABEL(touppermask): > > #if defined USE_AS_STRCASECMP_L || defined USE_AS_STRNCASECMP_L > > # ifdef USE_AVX > > # define TOLOWER(reg1, reg2) \ > > - vpcmpgtb UCLOW_reg, reg1, %xmm7; \ > > - vpcmpgtb UCHIGH_reg, reg1, %xmm8; \ > > - vpcmpgtb UCLOW_reg, reg2, %xmm9; \ > > - vpcmpgtb UCHIGH_reg, reg2, %xmm10; \ > > - vpandn %xmm7, %xmm8, %xmm8; \ > > - vpandn %xmm9, %xmm10, %xmm10; \ > > - vpand LCQWORD_reg, %xmm8, %xmm8; \ > > - vpand LCQWORD_reg, %xmm10, %xmm10; \ > > - vpor reg1, %xmm8, reg1; \ > > - vpor reg2, %xmm10, reg2 > > + vpaddb LCASE_MIN_reg, reg1, %xmm7; \ > > + vpaddb LCASE_MIN_reg, reg2, %xmm8; \ > > + vpcmpgtb LCASE_MAX_reg, %xmm7, %xmm7; \ > > + vpcmpgtb LCASE_MAX_reg, %xmm8, %xmm8; \ > > + vpandn CASE_ADD_reg, %xmm7, %xmm7; \ > > + vpandn CASE_ADD_reg, %xmm8, %xmm8; \ > > + vpaddb %xmm7, reg1, reg1; \ > > + vpaddb %xmm8, reg2, reg2 > > # else > > # define TOLOWER(reg1, reg2) \ > > - movdqa reg1, %xmm7; \ > > - movdqa UCHIGH_reg, %xmm8; \ > > - movdqa reg2, %xmm9; \ > > - movdqa UCHIGH_reg, %xmm10; \ > > - pcmpgtb UCLOW_reg, %xmm7; \ > > - pcmpgtb reg1, %xmm8; \ > > - pcmpgtb UCLOW_reg, %xmm9; \ > > - pcmpgtb reg2, %xmm10; \ > > - pand %xmm8, %xmm7; \ > > - pand %xmm10, %xmm9; \ > > - pand LCQWORD_reg, %xmm7; \ > > - pand LCQWORD_reg, %xmm9; \ > > - por %xmm7, reg1; \ > > - por %xmm9, reg2 > > + movdqa LCASE_MIN_reg, %xmm7; \ > > + movdqa LCASE_MIN_reg, %xmm8; \ > > + paddb reg1, %xmm7; \ > > + paddb reg2, %xmm8; \ > > + pcmpgtb LCASE_MAX_reg, %xmm7; \ > > + pcmpgtb LCASE_MAX_reg, %xmm8; \ > > + pandn CASE_ADD_reg, %xmm7; \ > > + pandn CASE_ADD_reg, %xmm8; \ > > + paddb %xmm7, reg1; \ > > + paddb %xmm8, reg2 > > # endif > > TOLOWER (%xmm1, %xmm2) > > #else > > -- > > 2.25.1 > > > > LGTM. > > Reviewed-by: H.J. Lu <hjl.tools@gmail.com> > > Thanks. > > -- > H.J. I would like to backport this patch to release branches. Any comments or objections? --Sunil ^ permalink raw reply [flat|nested] 10+ messages in thread
[parent not found: <20220323215734.3927131-23-goldstein.w.n@gmail.com>]
[parent not found: <CAMe9rOpzEL=V1OmUFJuScNetUc3mgMqYeqcqiD9aK+tBTN_sxQ@mail.gmail.com>]
* Re: [PATCH v1 23/23] x86: Remove AVX str{n}casecmp [not found] ` <CAMe9rOpzEL=V1OmUFJuScNetUc3mgMqYeqcqiD9aK+tBTN_sxQ@mail.gmail.com> @ 2022-05-12 19:54 ` Sunil Pandey 0 siblings, 0 replies; 10+ messages in thread From: Sunil Pandey @ 2022-05-12 19:54 UTC (permalink / raw) To: H.J. Lu, Libc-stable Mailing List; +Cc: Noah Goldstein, GNU C Library On Thu, Mar 24, 2022 at 12:09 PM H.J. Lu via Libc-alpha <libc-alpha@sourceware.org> wrote: > > On Wed, Mar 23, 2022 at 3:03 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > > > The rational is: > > > > 1. SSE42 has nearly identical logic so any benefit is minimal (3.4% > > regression on Tigerlake using SSE42 versus AVX across the > > benchtest suite). > > 2. AVX2 version covers the majority of targets that previously > > prefered it. > > 3. The targets where AVX would still be best (SnB and IVB) are > > becoming outdated. > > > > All in all the saving the code size is worth it. > > > > All string/memory tests pass. > > --- > > Geomtric Mean N=40 runs; All functions page aligned > > length, align1, align2, max_char, AVX Time / SSE42 Time > > 1, 1, 1, 127, 0.928 > > 2, 2, 2, 127, 0.934 > > 3, 3, 3, 127, 0.975 > > 4, 4, 4, 127, 0.96 > > 5, 5, 5, 127, 0.935 > > 6, 6, 6, 127, 0.929 > > 7, 7, 7, 127, 0.959 > > 8, 0, 0, 127, 0.955 > > 9, 1, 1, 127, 0.944 > > 10, 2, 2, 127, 0.975 > > 11, 3, 3, 127, 0.935 > > 12, 4, 4, 127, 0.931 > > 13, 5, 5, 127, 0.926 > > 14, 6, 6, 127, 0.901 > > 15, 7, 7, 127, 0.951 > > 4, 0, 0, 127, 0.958 > > 4, 0, 0, 254, 0.956 > > 8, 0, 0, 254, 0.977 > > 16, 0, 0, 127, 0.955 > > 16, 0, 0, 254, 0.953 > > 32, 0, 0, 127, 0.943 > > 32, 0, 0, 254, 0.941 > > 64, 0, 0, 127, 0.941 > > 64, 0, 0, 254, 0.955 > > 128, 0, 0, 127, 0.972 > > 128, 0, 0, 254, 0.975 > > 256, 0, 0, 127, 0.996 > > 256, 0, 0, 254, 0.993 > > 512, 0, 0, 127, 0.992 > > 512, 0, 0, 254, 0.986 > > 1024, 0, 0, 127, 0.994 > > 1024, 0, 0, 254, 0.993 > > 16, 1, 2, 127, 0.933 > > 16, 2, 1, 254, 0.953 > > 32, 2, 4, 127, 0.927 > > 32, 4, 2, 254, 0.986 > > 64, 3, 6, 127, 0.991 > > 64, 6, 3, 254, 1.014 > > 128, 4, 0, 127, 1.001 > > 128, 0, 4, 254, 0.991 > > 256, 5, 2, 127, 1.011 > > 256, 2, 5, 254, 1.013 > > 512, 6, 4, 127, 1.056 > > 512, 4, 6, 254, 0.916 > > 1024, 7, 6, 127, 1.059 > > 1024, 6, 7, 254, 1.043 > > > > sysdeps/x86_64/multiarch/Makefile | 2 - > > sysdeps/x86_64/multiarch/ifunc-impl-list.c | 12 - > > sysdeps/x86_64/multiarch/ifunc-strcasecmp.h | 4 - > > sysdeps/x86_64/multiarch/strcasecmp_l-avx.S | 22 -- > > sysdeps/x86_64/multiarch/strcmp-sse42.S | 240 +++++++++----------- > > sysdeps/x86_64/multiarch/strncase_l-avx.S | 22 -- > > 6 files changed, 105 insertions(+), 197 deletions(-) > > delete mode 100644 sysdeps/x86_64/multiarch/strcasecmp_l-avx.S > > delete mode 100644 sysdeps/x86_64/multiarch/strncase_l-avx.S > > > > diff --git a/sysdeps/x86_64/multiarch/Makefile b/sysdeps/x86_64/multiarch/Makefile > > index 35d80dc2ff..6507d1b7fa 100644 > > --- a/sysdeps/x86_64/multiarch/Makefile > > +++ b/sysdeps/x86_64/multiarch/Makefile > > @@ -54,7 +54,6 @@ sysdep_routines += \ > > stpncpy-evex \ > > stpncpy-sse2-unaligned \ > > stpncpy-ssse3 \ > > - strcasecmp_l-avx \ > > strcasecmp_l-avx2 \ > > strcasecmp_l-avx2-rtm \ > > strcasecmp_l-evex \ > > @@ -95,7 +94,6 @@ sysdep_routines += \ > > strlen-avx2-rtm \ > > strlen-evex \ > > strlen-sse2 \ > > - strncase_l-avx \ > > strncase_l-avx2 \ > > strncase_l-avx2-rtm \ > > strncase_l-evex \ > > diff --git a/sysdeps/x86_64/multiarch/ifunc-impl-list.c b/sysdeps/x86_64/multiarch/ifunc-impl-list.c > > index f1a4d3dac2..40cc6cc49e 100644 > > --- a/sysdeps/x86_64/multiarch/ifunc-impl-list.c > > +++ b/sysdeps/x86_64/multiarch/ifunc-impl-list.c > > @@ -447,9 +447,6 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array, > > (CPU_FEATURE_USABLE (AVX2) > > && CPU_FEATURE_USABLE (RTM)), > > __strcasecmp_avx2_rtm) > > - IFUNC_IMPL_ADD (array, i, strcasecmp, > > - CPU_FEATURE_USABLE (AVX), > > - __strcasecmp_avx) > > IFUNC_IMPL_ADD (array, i, strcasecmp, > > CPU_FEATURE_USABLE (SSE4_2), > > __strcasecmp_sse42) > > @@ -471,9 +468,6 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array, > > (CPU_FEATURE_USABLE (AVX2) > > && CPU_FEATURE_USABLE (RTM)), > > __strcasecmp_l_avx2_rtm) > > - IFUNC_IMPL_ADD (array, i, strcasecmp_l, > > - CPU_FEATURE_USABLE (AVX), > > - __strcasecmp_l_avx) > > IFUNC_IMPL_ADD (array, i, strcasecmp_l, > > CPU_FEATURE_USABLE (SSE4_2), > > __strcasecmp_l_sse42) > > @@ -609,9 +603,6 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array, > > (CPU_FEATURE_USABLE (AVX2) > > && CPU_FEATURE_USABLE (RTM)), > > __strncasecmp_avx2_rtm) > > - IFUNC_IMPL_ADD (array, i, strncasecmp, > > - CPU_FEATURE_USABLE (AVX), > > - __strncasecmp_avx) > > IFUNC_IMPL_ADD (array, i, strncasecmp, > > CPU_FEATURE_USABLE (SSE4_2), > > __strncasecmp_sse42) > > @@ -634,9 +625,6 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array, > > (CPU_FEATURE_USABLE (AVX2) > > && CPU_FEATURE_USABLE (RTM)), > > __strncasecmp_l_avx2_rtm) > > - IFUNC_IMPL_ADD (array, i, strncasecmp_l, > > - CPU_FEATURE_USABLE (AVX), > > - __strncasecmp_l_avx) > > IFUNC_IMPL_ADD (array, i, strncasecmp_l, > > CPU_FEATURE_USABLE (SSE4_2), > > __strncasecmp_l_sse42) > > diff --git a/sysdeps/x86_64/multiarch/ifunc-strcasecmp.h b/sysdeps/x86_64/multiarch/ifunc-strcasecmp.h > > index bf0d146e7f..766539c241 100644 > > --- a/sysdeps/x86_64/multiarch/ifunc-strcasecmp.h > > +++ b/sysdeps/x86_64/multiarch/ifunc-strcasecmp.h > > @@ -22,7 +22,6 @@ > > extern __typeof (REDIRECT_NAME) OPTIMIZE (sse2) attribute_hidden; > > extern __typeof (REDIRECT_NAME) OPTIMIZE (ssse3) attribute_hidden; > > extern __typeof (REDIRECT_NAME) OPTIMIZE (sse42) attribute_hidden; > > -extern __typeof (REDIRECT_NAME) OPTIMIZE (avx) attribute_hidden; > > extern __typeof (REDIRECT_NAME) OPTIMIZE (avx2) attribute_hidden; > > extern __typeof (REDIRECT_NAME) OPTIMIZE (avx2_rtm) attribute_hidden; > > extern __typeof (REDIRECT_NAME) OPTIMIZE (evex) attribute_hidden; > > @@ -46,9 +45,6 @@ IFUNC_SELECTOR (void) > > return OPTIMIZE (avx2); > > } > > > > - if (CPU_FEATURE_USABLE_P (cpu_features, AVX)) > > - return OPTIMIZE (avx); > > - > > if (CPU_FEATURE_USABLE_P (cpu_features, SSE4_2) > > && !CPU_FEATURES_ARCH_P (cpu_features, Slow_SSE4_2)) > > return OPTIMIZE (sse42); > > diff --git a/sysdeps/x86_64/multiarch/strcasecmp_l-avx.S b/sysdeps/x86_64/multiarch/strcasecmp_l-avx.S > > deleted file mode 100644 > > index 7ec7c21b5a..0000000000 > > --- a/sysdeps/x86_64/multiarch/strcasecmp_l-avx.S > > +++ /dev/null > > @@ -1,22 +0,0 @@ > > -/* strcasecmp_l optimized with AVX. > > - Copyright (C) 2017-2022 Free Software Foundation, Inc. > > - This file is part of the GNU C Library. > > - > > - The GNU C Library is free software; you can redistribute it and/or > > - modify it under the terms of the GNU Lesser General Public > > - License as published by the Free Software Foundation; either > > - version 2.1 of the License, or (at your option) any later version. > > - > > - The GNU C Library is distributed in the hope that it will be useful, > > - but WITHOUT ANY WARRANTY; without even the implied warranty of > > - MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > > - Lesser General Public License for more details. > > - > > - You should have received a copy of the GNU Lesser General Public > > - License along with the GNU C Library; if not, see > > - <https://www.gnu.org/licenses/>. */ > > - > > -#define STRCMP_SSE42 __strcasecmp_l_avx > > -#define USE_AVX 1 > > -#define USE_AS_STRCASECMP_L > > -#include "strcmp-sse42.S" > > diff --git a/sysdeps/x86_64/multiarch/strcmp-sse42.S b/sysdeps/x86_64/multiarch/strcmp-sse42.S > > index 7805ae9d41..a9178ad25c 100644 > > --- a/sysdeps/x86_64/multiarch/strcmp-sse42.S > > +++ b/sysdeps/x86_64/multiarch/strcmp-sse42.S > > @@ -41,13 +41,8 @@ > > # define UPDATE_STRNCMP_COUNTER > > #endif > > > > -#ifdef USE_AVX > > -# define SECTION avx > > -# define GLABEL(l) l##_avx > > -#else > > -# define SECTION sse4.2 > > -# define GLABEL(l) l##_sse42 > > -#endif > > +#define SECTION sse4.2 > > +#define GLABEL(l) l##_sse42 > > > > #define LABEL(l) .L##l > > > > @@ -105,21 +100,7 @@ END (GLABEL(__strncasecmp)) > > #endif > > > > > > -#ifdef USE_AVX > > -# define movdqa vmovdqa > > -# define movdqu vmovdqu > > -# define pmovmskb vpmovmskb > > -# define pcmpistri vpcmpistri > > -# define psubb vpsubb > > -# define pcmpeqb vpcmpeqb > > -# define psrldq vpsrldq > > -# define pslldq vpslldq > > -# define palignr vpalignr > > -# define pxor vpxor > > -# define D(arg) arg, arg > > -#else > > -# define D(arg) arg > > -#endif > > +#define arg arg > > > > STRCMP_SSE42: > > cfi_startproc > > @@ -191,18 +172,7 @@ LABEL(case_add): > > movdqu (%rdi), %xmm1 > > movdqu (%rsi), %xmm2 > > #if defined USE_AS_STRCASECMP_L || defined USE_AS_STRNCASECMP_L > > -# ifdef USE_AVX > > -# define TOLOWER(reg1, reg2) \ > > - vpaddb LCASE_MIN_reg, reg1, %xmm7; \ > > - vpaddb LCASE_MIN_reg, reg2, %xmm8; \ > > - vpcmpgtb LCASE_MAX_reg, %xmm7, %xmm7; \ > > - vpcmpgtb LCASE_MAX_reg, %xmm8, %xmm8; \ > > - vpandn CASE_ADD_reg, %xmm7, %xmm7; \ > > - vpandn CASE_ADD_reg, %xmm8, %xmm8; \ > > - vpaddb %xmm7, reg1, reg1; \ > > - vpaddb %xmm8, reg2, reg2 > > -# else > > -# define TOLOWER(reg1, reg2) \ > > +# define TOLOWER(reg1, reg2) \ > > movdqa LCASE_MIN_reg, %xmm7; \ > > movdqa LCASE_MIN_reg, %xmm8; \ > > paddb reg1, %xmm7; \ > > @@ -213,15 +183,15 @@ LABEL(case_add): > > pandn CASE_ADD_reg, %xmm8; \ > > paddb %xmm7, reg1; \ > > paddb %xmm8, reg2 > > -# endif > > + > > TOLOWER (%xmm1, %xmm2) > > #else > > # define TOLOWER(reg1, reg2) > > #endif > > - pxor %xmm0, D(%xmm0) /* clear %xmm0 for null char checks */ > > - pcmpeqb %xmm1, D(%xmm0) /* Any null chars? */ > > - pcmpeqb %xmm2, D(%xmm1) /* compare first 16 bytes for equality */ > > - psubb %xmm0, D(%xmm1) /* packed sub of comparison results*/ > > + pxor %xmm0, %xmm0 /* clear %xmm0 for null char checks */ > > + pcmpeqb %xmm1, %xmm0 /* Any null chars? */ > > + pcmpeqb %xmm2, %xmm1 /* compare first 16 bytes for equality */ > > + psubb %xmm0, %xmm1 /* packed sub of comparison results*/ > > pmovmskb %xmm1, %edx > > sub $0xffff, %edx /* if first 16 bytes are same, edx == 0xffff */ > > jnz LABEL(less16bytes)/* If not, find different value or null char */ > > @@ -245,7 +215,7 @@ LABEL(crosscache): > > xor %r8d, %r8d > > and $0xf, %ecx /* offset of rsi */ > > and $0xf, %eax /* offset of rdi */ > > - pxor %xmm0, D(%xmm0) /* clear %xmm0 for null char check */ > > + pxor %xmm0, %xmm0 /* clear %xmm0 for null char check */ > > cmp %eax, %ecx > > je LABEL(ashr_0) /* rsi and rdi relative offset same */ > > ja LABEL(bigger) > > @@ -259,7 +229,7 @@ LABEL(bigger): > > sub %rcx, %r9 > > lea LABEL(unaligned_table)(%rip), %r10 > > movslq (%r10, %r9,4), %r9 > > - pcmpeqb %xmm1, D(%xmm0) /* Any null chars? */ > > + pcmpeqb %xmm1, %xmm0 /* Any null chars? */ > > lea (%r10, %r9), %r10 > > _CET_NOTRACK jmp *%r10 /* jump to corresponding case */ > > > > @@ -272,15 +242,15 @@ LABEL(bigger): > > LABEL(ashr_0): > > > > movdqa (%rsi), %xmm1 > > - pcmpeqb %xmm1, D(%xmm0) /* Any null chars? */ > > + pcmpeqb %xmm1, %xmm0 /* Any null chars? */ > > #if !defined USE_AS_STRCASECMP_L && !defined USE_AS_STRNCASECMP_L > > - pcmpeqb (%rdi), D(%xmm1) /* compare 16 bytes for equality */ > > + pcmpeqb (%rdi), %xmm1 /* compare 16 bytes for equality */ > > #else > > movdqa (%rdi), %xmm2 > > TOLOWER (%xmm1, %xmm2) > > - pcmpeqb %xmm2, D(%xmm1) /* compare 16 bytes for equality */ > > + pcmpeqb %xmm2, %xmm1 /* compare 16 bytes for equality */ > > #endif > > - psubb %xmm0, D(%xmm1) /* packed sub of comparison results*/ > > + psubb %xmm0, %xmm1 /* packed sub of comparison results*/ > > pmovmskb %xmm1, %r9d > > shr %cl, %edx /* adjust 0xffff for offset */ > > shr %cl, %r9d /* adjust for 16-byte offset */ > > @@ -360,10 +330,10 @@ LABEL(ashr_0_exit_use): > > */ > > .p2align 4 > > LABEL(ashr_1): > > - pslldq $15, D(%xmm2) /* shift first string to align with second */ > > + pslldq $15, %xmm2 /* shift first string to align with second */ > > TOLOWER (%xmm1, %xmm2) > > - pcmpeqb %xmm1, D(%xmm2) /* compare 16 bytes for equality */ > > - psubb %xmm0, D(%xmm2) /* packed sub of comparison results*/ > > + pcmpeqb %xmm1, %xmm2 /* compare 16 bytes for equality */ > > + psubb %xmm0, %xmm2 /* packed sub of comparison results*/ > > pmovmskb %xmm2, %r9d > > shr %cl, %edx /* adjust 0xffff for offset */ > > shr %cl, %r9d /* adjust for 16-byte offset */ > > @@ -391,7 +361,7 @@ LABEL(loop_ashr_1_use): > > > > LABEL(nibble_ashr_1_restart_use): > > movdqa (%rdi, %rdx), %xmm0 > > - palignr $1, -16(%rdi, %rdx), D(%xmm0) > > + palignr $1, -16(%rdi, %rdx), %xmm0 > > #if !defined USE_AS_STRCASECMP_L && !defined USE_AS_STRNCASECMP_L > > pcmpistri $0x1a,(%rsi,%rdx), %xmm0 > > #else > > @@ -410,7 +380,7 @@ LABEL(nibble_ashr_1_restart_use): > > jg LABEL(nibble_ashr_1_use) > > > > movdqa (%rdi, %rdx), %xmm0 > > - palignr $1, -16(%rdi, %rdx), D(%xmm0) > > + palignr $1, -16(%rdi, %rdx), %xmm0 > > #if !defined USE_AS_STRCASECMP_L && !defined USE_AS_STRNCASECMP_L > > pcmpistri $0x1a,(%rsi,%rdx), %xmm0 > > #else > > @@ -430,7 +400,7 @@ LABEL(nibble_ashr_1_restart_use): > > LABEL(nibble_ashr_1_use): > > sub $0x1000, %r10 > > movdqa -16(%rdi, %rdx), %xmm0 > > - psrldq $1, D(%xmm0) > > + psrldq $1, %xmm0 > > pcmpistri $0x3a,%xmm0, %xmm0 > > #if defined USE_AS_STRNCMP || defined USE_AS_STRNCASECMP_L > > cmp %r11, %rcx > > @@ -448,10 +418,10 @@ LABEL(nibble_ashr_1_use): > > */ > > .p2align 4 > > LABEL(ashr_2): > > - pslldq $14, D(%xmm2) > > + pslldq $14, %xmm2 > > TOLOWER (%xmm1, %xmm2) > > - pcmpeqb %xmm1, D(%xmm2) > > - psubb %xmm0, D(%xmm2) > > + pcmpeqb %xmm1, %xmm2 > > + psubb %xmm0, %xmm2 > > pmovmskb %xmm2, %r9d > > shr %cl, %edx > > shr %cl, %r9d > > @@ -479,7 +449,7 @@ LABEL(loop_ashr_2_use): > > > > LABEL(nibble_ashr_2_restart_use): > > movdqa (%rdi, %rdx), %xmm0 > > - palignr $2, -16(%rdi, %rdx), D(%xmm0) > > + palignr $2, -16(%rdi, %rdx), %xmm0 > > #if !defined USE_AS_STRCASECMP_L && !defined USE_AS_STRNCASECMP_L > > pcmpistri $0x1a,(%rsi,%rdx), %xmm0 > > #else > > @@ -498,7 +468,7 @@ LABEL(nibble_ashr_2_restart_use): > > jg LABEL(nibble_ashr_2_use) > > > > movdqa (%rdi, %rdx), %xmm0 > > - palignr $2, -16(%rdi, %rdx), D(%xmm0) > > + palignr $2, -16(%rdi, %rdx), %xmm0 > > #if !defined USE_AS_STRCASECMP_L && !defined USE_AS_STRNCASECMP_L > > pcmpistri $0x1a,(%rsi,%rdx), %xmm0 > > #else > > @@ -518,7 +488,7 @@ LABEL(nibble_ashr_2_restart_use): > > LABEL(nibble_ashr_2_use): > > sub $0x1000, %r10 > > movdqa -16(%rdi, %rdx), %xmm0 > > - psrldq $2, D(%xmm0) > > + psrldq $2, %xmm0 > > pcmpistri $0x3a,%xmm0, %xmm0 > > #if defined USE_AS_STRNCMP || defined USE_AS_STRNCASECMP_L > > cmp %r11, %rcx > > @@ -536,10 +506,10 @@ LABEL(nibble_ashr_2_use): > > */ > > .p2align 4 > > LABEL(ashr_3): > > - pslldq $13, D(%xmm2) > > + pslldq $13, %xmm2 > > TOLOWER (%xmm1, %xmm2) > > - pcmpeqb %xmm1, D(%xmm2) > > - psubb %xmm0, D(%xmm2) > > + pcmpeqb %xmm1, %xmm2 > > + psubb %xmm0, %xmm2 > > pmovmskb %xmm2, %r9d > > shr %cl, %edx > > shr %cl, %r9d > > @@ -567,7 +537,7 @@ LABEL(loop_ashr_3_use): > > > > LABEL(nibble_ashr_3_restart_use): > > movdqa (%rdi, %rdx), %xmm0 > > - palignr $3, -16(%rdi, %rdx), D(%xmm0) > > + palignr $3, -16(%rdi, %rdx), %xmm0 > > #if !defined USE_AS_STRCASECMP_L && !defined USE_AS_STRNCASECMP_L > > pcmpistri $0x1a,(%rsi,%rdx), %xmm0 > > #else > > @@ -586,7 +556,7 @@ LABEL(nibble_ashr_3_restart_use): > > jg LABEL(nibble_ashr_3_use) > > > > movdqa (%rdi, %rdx), %xmm0 > > - palignr $3, -16(%rdi, %rdx), D(%xmm0) > > + palignr $3, -16(%rdi, %rdx), %xmm0 > > #if !defined USE_AS_STRCASECMP_L && !defined USE_AS_STRNCASECMP_L > > pcmpistri $0x1a,(%rsi,%rdx), %xmm0 > > #else > > @@ -606,7 +576,7 @@ LABEL(nibble_ashr_3_restart_use): > > LABEL(nibble_ashr_3_use): > > sub $0x1000, %r10 > > movdqa -16(%rdi, %rdx), %xmm0 > > - psrldq $3, D(%xmm0) > > + psrldq $3, %xmm0 > > pcmpistri $0x3a,%xmm0, %xmm0 > > #if defined USE_AS_STRNCMP || defined USE_AS_STRNCASECMP_L > > cmp %r11, %rcx > > @@ -624,10 +594,10 @@ LABEL(nibble_ashr_3_use): > > */ > > .p2align 4 > > LABEL(ashr_4): > > - pslldq $12, D(%xmm2) > > + pslldq $12, %xmm2 > > TOLOWER (%xmm1, %xmm2) > > - pcmpeqb %xmm1, D(%xmm2) > > - psubb %xmm0, D(%xmm2) > > + pcmpeqb %xmm1, %xmm2 > > + psubb %xmm0, %xmm2 > > pmovmskb %xmm2, %r9d > > shr %cl, %edx > > shr %cl, %r9d > > @@ -656,7 +626,7 @@ LABEL(loop_ashr_4_use): > > > > LABEL(nibble_ashr_4_restart_use): > > movdqa (%rdi, %rdx), %xmm0 > > - palignr $4, -16(%rdi, %rdx), D(%xmm0) > > + palignr $4, -16(%rdi, %rdx), %xmm0 > > #if !defined USE_AS_STRCASECMP_L && !defined USE_AS_STRNCASECMP_L > > pcmpistri $0x1a,(%rsi,%rdx), %xmm0 > > #else > > @@ -675,7 +645,7 @@ LABEL(nibble_ashr_4_restart_use): > > jg LABEL(nibble_ashr_4_use) > > > > movdqa (%rdi, %rdx), %xmm0 > > - palignr $4, -16(%rdi, %rdx), D(%xmm0) > > + palignr $4, -16(%rdi, %rdx), %xmm0 > > #if !defined USE_AS_STRCASECMP_L && !defined USE_AS_STRNCASECMP_L > > pcmpistri $0x1a,(%rsi,%rdx), %xmm0 > > #else > > @@ -695,7 +665,7 @@ LABEL(nibble_ashr_4_restart_use): > > LABEL(nibble_ashr_4_use): > > sub $0x1000, %r10 > > movdqa -16(%rdi, %rdx), %xmm0 > > - psrldq $4, D(%xmm0) > > + psrldq $4, %xmm0 > > pcmpistri $0x3a,%xmm0, %xmm0 > > #if defined USE_AS_STRNCMP || defined USE_AS_STRNCASECMP_L > > cmp %r11, %rcx > > @@ -713,10 +683,10 @@ LABEL(nibble_ashr_4_use): > > */ > > .p2align 4 > > LABEL(ashr_5): > > - pslldq $11, D(%xmm2) > > + pslldq $11, %xmm2 > > TOLOWER (%xmm1, %xmm2) > > - pcmpeqb %xmm1, D(%xmm2) > > - psubb %xmm0, D(%xmm2) > > + pcmpeqb %xmm1, %xmm2 > > + psubb %xmm0, %xmm2 > > pmovmskb %xmm2, %r9d > > shr %cl, %edx > > shr %cl, %r9d > > @@ -745,7 +715,7 @@ LABEL(loop_ashr_5_use): > > > > LABEL(nibble_ashr_5_restart_use): > > movdqa (%rdi, %rdx), %xmm0 > > - palignr $5, -16(%rdi, %rdx), D(%xmm0) > > + palignr $5, -16(%rdi, %rdx), %xmm0 > > #if !defined USE_AS_STRCASECMP_L && !defined USE_AS_STRNCASECMP_L > > pcmpistri $0x1a,(%rsi,%rdx), %xmm0 > > #else > > @@ -765,7 +735,7 @@ LABEL(nibble_ashr_5_restart_use): > > > > movdqa (%rdi, %rdx), %xmm0 > > > > - palignr $5, -16(%rdi, %rdx), D(%xmm0) > > + palignr $5, -16(%rdi, %rdx), %xmm0 > > #if !defined USE_AS_STRCASECMP_L && !defined USE_AS_STRNCASECMP_L > > pcmpistri $0x1a,(%rsi,%rdx), %xmm0 > > #else > > @@ -785,7 +755,7 @@ LABEL(nibble_ashr_5_restart_use): > > LABEL(nibble_ashr_5_use): > > sub $0x1000, %r10 > > movdqa -16(%rdi, %rdx), %xmm0 > > - psrldq $5, D(%xmm0) > > + psrldq $5, %xmm0 > > pcmpistri $0x3a,%xmm0, %xmm0 > > #if defined USE_AS_STRNCMP || defined USE_AS_STRNCASECMP_L > > cmp %r11, %rcx > > @@ -803,10 +773,10 @@ LABEL(nibble_ashr_5_use): > > */ > > .p2align 4 > > LABEL(ashr_6): > > - pslldq $10, D(%xmm2) > > + pslldq $10, %xmm2 > > TOLOWER (%xmm1, %xmm2) > > - pcmpeqb %xmm1, D(%xmm2) > > - psubb %xmm0, D(%xmm2) > > + pcmpeqb %xmm1, %xmm2 > > + psubb %xmm0, %xmm2 > > pmovmskb %xmm2, %r9d > > shr %cl, %edx > > shr %cl, %r9d > > @@ -835,7 +805,7 @@ LABEL(loop_ashr_6_use): > > > > LABEL(nibble_ashr_6_restart_use): > > movdqa (%rdi, %rdx), %xmm0 > > - palignr $6, -16(%rdi, %rdx), D(%xmm0) > > + palignr $6, -16(%rdi, %rdx), %xmm0 > > #if !defined USE_AS_STRCASECMP_L && !defined USE_AS_STRNCASECMP_L > > pcmpistri $0x1a,(%rsi,%rdx), %xmm0 > > #else > > @@ -854,7 +824,7 @@ LABEL(nibble_ashr_6_restart_use): > > jg LABEL(nibble_ashr_6_use) > > > > movdqa (%rdi, %rdx), %xmm0 > > - palignr $6, -16(%rdi, %rdx), D(%xmm0) > > + palignr $6, -16(%rdi, %rdx), %xmm0 > > #if !defined USE_AS_STRCASECMP_L && !defined USE_AS_STRNCASECMP_L > > pcmpistri $0x1a,(%rsi,%rdx), %xmm0 > > #else > > @@ -874,7 +844,7 @@ LABEL(nibble_ashr_6_restart_use): > > LABEL(nibble_ashr_6_use): > > sub $0x1000, %r10 > > movdqa -16(%rdi, %rdx), %xmm0 > > - psrldq $6, D(%xmm0) > > + psrldq $6, %xmm0 > > pcmpistri $0x3a,%xmm0, %xmm0 > > #if defined USE_AS_STRNCMP || defined USE_AS_STRNCASECMP_L > > cmp %r11, %rcx > > @@ -892,10 +862,10 @@ LABEL(nibble_ashr_6_use): > > */ > > .p2align 4 > > LABEL(ashr_7): > > - pslldq $9, D(%xmm2) > > + pslldq $9, %xmm2 > > TOLOWER (%xmm1, %xmm2) > > - pcmpeqb %xmm1, D(%xmm2) > > - psubb %xmm0, D(%xmm2) > > + pcmpeqb %xmm1, %xmm2 > > + psubb %xmm0, %xmm2 > > pmovmskb %xmm2, %r9d > > shr %cl, %edx > > shr %cl, %r9d > > @@ -924,7 +894,7 @@ LABEL(loop_ashr_7_use): > > > > LABEL(nibble_ashr_7_restart_use): > > movdqa (%rdi, %rdx), %xmm0 > > - palignr $7, -16(%rdi, %rdx), D(%xmm0) > > + palignr $7, -16(%rdi, %rdx), %xmm0 > > #if !defined USE_AS_STRCASECMP_L && !defined USE_AS_STRNCASECMP_L > > pcmpistri $0x1a,(%rsi,%rdx), %xmm0 > > #else > > @@ -943,7 +913,7 @@ LABEL(nibble_ashr_7_restart_use): > > jg LABEL(nibble_ashr_7_use) > > > > movdqa (%rdi, %rdx), %xmm0 > > - palignr $7, -16(%rdi, %rdx), D(%xmm0) > > + palignr $7, -16(%rdi, %rdx), %xmm0 > > #if !defined USE_AS_STRCASECMP_L && !defined USE_AS_STRNCASECMP_L > > pcmpistri $0x1a,(%rsi,%rdx), %xmm0 > > #else > > @@ -963,7 +933,7 @@ LABEL(nibble_ashr_7_restart_use): > > LABEL(nibble_ashr_7_use): > > sub $0x1000, %r10 > > movdqa -16(%rdi, %rdx), %xmm0 > > - psrldq $7, D(%xmm0) > > + psrldq $7, %xmm0 > > pcmpistri $0x3a,%xmm0, %xmm0 > > #if defined USE_AS_STRNCMP || defined USE_AS_STRNCASECMP_L > > cmp %r11, %rcx > > @@ -981,10 +951,10 @@ LABEL(nibble_ashr_7_use): > > */ > > .p2align 4 > > LABEL(ashr_8): > > - pslldq $8, D(%xmm2) > > + pslldq $8, %xmm2 > > TOLOWER (%xmm1, %xmm2) > > - pcmpeqb %xmm1, D(%xmm2) > > - psubb %xmm0, D(%xmm2) > > + pcmpeqb %xmm1, %xmm2 > > + psubb %xmm0, %xmm2 > > pmovmskb %xmm2, %r9d > > shr %cl, %edx > > shr %cl, %r9d > > @@ -1013,7 +983,7 @@ LABEL(loop_ashr_8_use): > > > > LABEL(nibble_ashr_8_restart_use): > > movdqa (%rdi, %rdx), %xmm0 > > - palignr $8, -16(%rdi, %rdx), D(%xmm0) > > + palignr $8, -16(%rdi, %rdx), %xmm0 > > #if !defined USE_AS_STRCASECMP_L && !defined USE_AS_STRNCASECMP_L > > pcmpistri $0x1a, (%rsi,%rdx), %xmm0 > > #else > > @@ -1032,7 +1002,7 @@ LABEL(nibble_ashr_8_restart_use): > > jg LABEL(nibble_ashr_8_use) > > > > movdqa (%rdi, %rdx), %xmm0 > > - palignr $8, -16(%rdi, %rdx), D(%xmm0) > > + palignr $8, -16(%rdi, %rdx), %xmm0 > > #if !defined USE_AS_STRCASECMP_L && !defined USE_AS_STRNCASECMP_L > > pcmpistri $0x1a, (%rsi,%rdx), %xmm0 > > #else > > @@ -1052,7 +1022,7 @@ LABEL(nibble_ashr_8_restart_use): > > LABEL(nibble_ashr_8_use): > > sub $0x1000, %r10 > > movdqa -16(%rdi, %rdx), %xmm0 > > - psrldq $8, D(%xmm0) > > + psrldq $8, %xmm0 > > pcmpistri $0x3a,%xmm0, %xmm0 > > #if defined USE_AS_STRNCMP || defined USE_AS_STRNCASECMP_L > > cmp %r11, %rcx > > @@ -1070,10 +1040,10 @@ LABEL(nibble_ashr_8_use): > > */ > > .p2align 4 > > LABEL(ashr_9): > > - pslldq $7, D(%xmm2) > > + pslldq $7, %xmm2 > > TOLOWER (%xmm1, %xmm2) > > - pcmpeqb %xmm1, D(%xmm2) > > - psubb %xmm0, D(%xmm2) > > + pcmpeqb %xmm1, %xmm2 > > + psubb %xmm0, %xmm2 > > pmovmskb %xmm2, %r9d > > shr %cl, %edx > > shr %cl, %r9d > > @@ -1103,7 +1073,7 @@ LABEL(loop_ashr_9_use): > > LABEL(nibble_ashr_9_restart_use): > > movdqa (%rdi, %rdx), %xmm0 > > > > - palignr $9, -16(%rdi, %rdx), D(%xmm0) > > + palignr $9, -16(%rdi, %rdx), %xmm0 > > #if !defined USE_AS_STRCASECMP_L && !defined USE_AS_STRNCASECMP_L > > pcmpistri $0x1a, (%rsi,%rdx), %xmm0 > > #else > > @@ -1122,7 +1092,7 @@ LABEL(nibble_ashr_9_restart_use): > > jg LABEL(nibble_ashr_9_use) > > > > movdqa (%rdi, %rdx), %xmm0 > > - palignr $9, -16(%rdi, %rdx), D(%xmm0) > > + palignr $9, -16(%rdi, %rdx), %xmm0 > > #if !defined USE_AS_STRCASECMP_L && !defined USE_AS_STRNCASECMP_L > > pcmpistri $0x1a, (%rsi,%rdx), %xmm0 > > #else > > @@ -1142,7 +1112,7 @@ LABEL(nibble_ashr_9_restart_use): > > LABEL(nibble_ashr_9_use): > > sub $0x1000, %r10 > > movdqa -16(%rdi, %rdx), %xmm0 > > - psrldq $9, D(%xmm0) > > + psrldq $9, %xmm0 > > pcmpistri $0x3a,%xmm0, %xmm0 > > #if defined USE_AS_STRNCMP || defined USE_AS_STRNCASECMP_L > > cmp %r11, %rcx > > @@ -1160,10 +1130,10 @@ LABEL(nibble_ashr_9_use): > > */ > > .p2align 4 > > LABEL(ashr_10): > > - pslldq $6, D(%xmm2) > > + pslldq $6, %xmm2 > > TOLOWER (%xmm1, %xmm2) > > - pcmpeqb %xmm1, D(%xmm2) > > - psubb %xmm0, D(%xmm2) > > + pcmpeqb %xmm1, %xmm2 > > + psubb %xmm0, %xmm2 > > pmovmskb %xmm2, %r9d > > shr %cl, %edx > > shr %cl, %r9d > > @@ -1192,7 +1162,7 @@ LABEL(loop_ashr_10_use): > > > > LABEL(nibble_ashr_10_restart_use): > > movdqa (%rdi, %rdx), %xmm0 > > - palignr $10, -16(%rdi, %rdx), D(%xmm0) > > + palignr $10, -16(%rdi, %rdx), %xmm0 > > #if !defined USE_AS_STRCASECMP_L && !defined USE_AS_STRNCASECMP_L > > pcmpistri $0x1a, (%rsi,%rdx), %xmm0 > > #else > > @@ -1211,7 +1181,7 @@ LABEL(nibble_ashr_10_restart_use): > > jg LABEL(nibble_ashr_10_use) > > > > movdqa (%rdi, %rdx), %xmm0 > > - palignr $10, -16(%rdi, %rdx), D(%xmm0) > > + palignr $10, -16(%rdi, %rdx), %xmm0 > > #if !defined USE_AS_STRCASECMP_L && !defined USE_AS_STRNCASECMP_L > > pcmpistri $0x1a, (%rsi,%rdx), %xmm0 > > #else > > @@ -1231,7 +1201,7 @@ LABEL(nibble_ashr_10_restart_use): > > LABEL(nibble_ashr_10_use): > > sub $0x1000, %r10 > > movdqa -16(%rdi, %rdx), %xmm0 > > - psrldq $10, D(%xmm0) > > + psrldq $10, %xmm0 > > pcmpistri $0x3a,%xmm0, %xmm0 > > #if defined USE_AS_STRNCMP || defined USE_AS_STRNCASECMP_L > > cmp %r11, %rcx > > @@ -1249,10 +1219,10 @@ LABEL(nibble_ashr_10_use): > > */ > > .p2align 4 > > LABEL(ashr_11): > > - pslldq $5, D(%xmm2) > > + pslldq $5, %xmm2 > > TOLOWER (%xmm1, %xmm2) > > - pcmpeqb %xmm1, D(%xmm2) > > - psubb %xmm0, D(%xmm2) > > + pcmpeqb %xmm1, %xmm2 > > + psubb %xmm0, %xmm2 > > pmovmskb %xmm2, %r9d > > shr %cl, %edx > > shr %cl, %r9d > > @@ -1281,7 +1251,7 @@ LABEL(loop_ashr_11_use): > > > > LABEL(nibble_ashr_11_restart_use): > > movdqa (%rdi, %rdx), %xmm0 > > - palignr $11, -16(%rdi, %rdx), D(%xmm0) > > + palignr $11, -16(%rdi, %rdx), %xmm0 > > #if !defined USE_AS_STRCASECMP_L && !defined USE_AS_STRNCASECMP_L > > pcmpistri $0x1a, (%rsi,%rdx), %xmm0 > > #else > > @@ -1300,7 +1270,7 @@ LABEL(nibble_ashr_11_restart_use): > > jg LABEL(nibble_ashr_11_use) > > > > movdqa (%rdi, %rdx), %xmm0 > > - palignr $11, -16(%rdi, %rdx), D(%xmm0) > > + palignr $11, -16(%rdi, %rdx), %xmm0 > > #if !defined USE_AS_STRCASECMP_L && !defined USE_AS_STRNCASECMP_L > > pcmpistri $0x1a, (%rsi,%rdx), %xmm0 > > #else > > @@ -1320,7 +1290,7 @@ LABEL(nibble_ashr_11_restart_use): > > LABEL(nibble_ashr_11_use): > > sub $0x1000, %r10 > > movdqa -16(%rdi, %rdx), %xmm0 > > - psrldq $11, D(%xmm0) > > + psrldq $11, %xmm0 > > pcmpistri $0x3a,%xmm0, %xmm0 > > #if defined USE_AS_STRNCMP || defined USE_AS_STRNCASECMP_L > > cmp %r11, %rcx > > @@ -1338,10 +1308,10 @@ LABEL(nibble_ashr_11_use): > > */ > > .p2align 4 > > LABEL(ashr_12): > > - pslldq $4, D(%xmm2) > > + pslldq $4, %xmm2 > > TOLOWER (%xmm1, %xmm2) > > - pcmpeqb %xmm1, D(%xmm2) > > - psubb %xmm0, D(%xmm2) > > + pcmpeqb %xmm1, %xmm2 > > + psubb %xmm0, %xmm2 > > pmovmskb %xmm2, %r9d > > shr %cl, %edx > > shr %cl, %r9d > > @@ -1370,7 +1340,7 @@ LABEL(loop_ashr_12_use): > > > > LABEL(nibble_ashr_12_restart_use): > > movdqa (%rdi, %rdx), %xmm0 > > - palignr $12, -16(%rdi, %rdx), D(%xmm0) > > + palignr $12, -16(%rdi, %rdx), %xmm0 > > #if !defined USE_AS_STRCASECMP_L && !defined USE_AS_STRNCASECMP_L > > pcmpistri $0x1a, (%rsi,%rdx), %xmm0 > > #else > > @@ -1389,7 +1359,7 @@ LABEL(nibble_ashr_12_restart_use): > > jg LABEL(nibble_ashr_12_use) > > > > movdqa (%rdi, %rdx), %xmm0 > > - palignr $12, -16(%rdi, %rdx), D(%xmm0) > > + palignr $12, -16(%rdi, %rdx), %xmm0 > > #if !defined USE_AS_STRCASECMP_L && !defined USE_AS_STRNCASECMP_L > > pcmpistri $0x1a, (%rsi,%rdx), %xmm0 > > #else > > @@ -1409,7 +1379,7 @@ LABEL(nibble_ashr_12_restart_use): > > LABEL(nibble_ashr_12_use): > > sub $0x1000, %r10 > > movdqa -16(%rdi, %rdx), %xmm0 > > - psrldq $12, D(%xmm0) > > + psrldq $12, %xmm0 > > pcmpistri $0x3a,%xmm0, %xmm0 > > #if defined USE_AS_STRNCMP || defined USE_AS_STRNCASECMP_L > > cmp %r11, %rcx > > @@ -1427,10 +1397,10 @@ LABEL(nibble_ashr_12_use): > > */ > > .p2align 4 > > LABEL(ashr_13): > > - pslldq $3, D(%xmm2) > > + pslldq $3, %xmm2 > > TOLOWER (%xmm1, %xmm2) > > - pcmpeqb %xmm1, D(%xmm2) > > - psubb %xmm0, D(%xmm2) > > + pcmpeqb %xmm1, %xmm2 > > + psubb %xmm0, %xmm2 > > pmovmskb %xmm2, %r9d > > shr %cl, %edx > > shr %cl, %r9d > > @@ -1460,7 +1430,7 @@ LABEL(loop_ashr_13_use): > > > > LABEL(nibble_ashr_13_restart_use): > > movdqa (%rdi, %rdx), %xmm0 > > - palignr $13, -16(%rdi, %rdx), D(%xmm0) > > + palignr $13, -16(%rdi, %rdx), %xmm0 > > #if !defined USE_AS_STRCASECMP_L && !defined USE_AS_STRNCASECMP_L > > pcmpistri $0x1a, (%rsi,%rdx), %xmm0 > > #else > > @@ -1479,7 +1449,7 @@ LABEL(nibble_ashr_13_restart_use): > > jg LABEL(nibble_ashr_13_use) > > > > movdqa (%rdi, %rdx), %xmm0 > > - palignr $13, -16(%rdi, %rdx), D(%xmm0) > > + palignr $13, -16(%rdi, %rdx), %xmm0 > > #if !defined USE_AS_STRCASECMP_L && !defined USE_AS_STRNCASECMP_L > > pcmpistri $0x1a, (%rsi,%rdx), %xmm0 > > #else > > @@ -1499,7 +1469,7 @@ LABEL(nibble_ashr_13_restart_use): > > LABEL(nibble_ashr_13_use): > > sub $0x1000, %r10 > > movdqa -16(%rdi, %rdx), %xmm0 > > - psrldq $13, D(%xmm0) > > + psrldq $13, %xmm0 > > pcmpistri $0x3a,%xmm0, %xmm0 > > #if defined USE_AS_STRNCMP || defined USE_AS_STRNCASECMP_L > > cmp %r11, %rcx > > @@ -1517,10 +1487,10 @@ LABEL(nibble_ashr_13_use): > > */ > > .p2align 4 > > LABEL(ashr_14): > > - pslldq $2, D(%xmm2) > > + pslldq $2, %xmm2 > > TOLOWER (%xmm1, %xmm2) > > - pcmpeqb %xmm1, D(%xmm2) > > - psubb %xmm0, D(%xmm2) > > + pcmpeqb %xmm1, %xmm2 > > + psubb %xmm0, %xmm2 > > pmovmskb %xmm2, %r9d > > shr %cl, %edx > > shr %cl, %r9d > > @@ -1550,7 +1520,7 @@ LABEL(loop_ashr_14_use): > > > > LABEL(nibble_ashr_14_restart_use): > > movdqa (%rdi, %rdx), %xmm0 > > - palignr $14, -16(%rdi, %rdx), D(%xmm0) > > + palignr $14, -16(%rdi, %rdx), %xmm0 > > #if !defined USE_AS_STRCASECMP_L && !defined USE_AS_STRNCASECMP_L > > pcmpistri $0x1a, (%rsi,%rdx), %xmm0 > > #else > > @@ -1569,7 +1539,7 @@ LABEL(nibble_ashr_14_restart_use): > > jg LABEL(nibble_ashr_14_use) > > > > movdqa (%rdi, %rdx), %xmm0 > > - palignr $14, -16(%rdi, %rdx), D(%xmm0) > > + palignr $14, -16(%rdi, %rdx), %xmm0 > > #if !defined USE_AS_STRCASECMP_L && !defined USE_AS_STRNCASECMP_L > > pcmpistri $0x1a, (%rsi,%rdx), %xmm0 > > #else > > @@ -1589,7 +1559,7 @@ LABEL(nibble_ashr_14_restart_use): > > LABEL(nibble_ashr_14_use): > > sub $0x1000, %r10 > > movdqa -16(%rdi, %rdx), %xmm0 > > - psrldq $14, D(%xmm0) > > + psrldq $14, %xmm0 > > pcmpistri $0x3a,%xmm0, %xmm0 > > #if defined USE_AS_STRNCMP || defined USE_AS_STRNCASECMP_L > > cmp %r11, %rcx > > @@ -1607,10 +1577,10 @@ LABEL(nibble_ashr_14_use): > > */ > > .p2align 4 > > LABEL(ashr_15): > > - pslldq $1, D(%xmm2) > > + pslldq $1, %xmm2 > > TOLOWER (%xmm1, %xmm2) > > - pcmpeqb %xmm1, D(%xmm2) > > - psubb %xmm0, D(%xmm2) > > + pcmpeqb %xmm1, %xmm2 > > + psubb %xmm0, %xmm2 > > pmovmskb %xmm2, %r9d > > shr %cl, %edx > > shr %cl, %r9d > > @@ -1642,7 +1612,7 @@ LABEL(loop_ashr_15_use): > > > > LABEL(nibble_ashr_15_restart_use): > > movdqa (%rdi, %rdx), %xmm0 > > - palignr $15, -16(%rdi, %rdx), D(%xmm0) > > + palignr $15, -16(%rdi, %rdx), %xmm0 > > #if !defined USE_AS_STRCASECMP_L && !defined USE_AS_STRNCASECMP_L > > pcmpistri $0x1a, (%rsi,%rdx), %xmm0 > > #else > > @@ -1661,7 +1631,7 @@ LABEL(nibble_ashr_15_restart_use): > > jg LABEL(nibble_ashr_15_use) > > > > movdqa (%rdi, %rdx), %xmm0 > > - palignr $15, -16(%rdi, %rdx), D(%xmm0) > > + palignr $15, -16(%rdi, %rdx), %xmm0 > > #if !defined USE_AS_STRCASECMP_L && !defined USE_AS_STRNCASECMP_L > > pcmpistri $0x1a, (%rsi,%rdx), %xmm0 > > #else > > @@ -1681,7 +1651,7 @@ LABEL(nibble_ashr_15_restart_use): > > LABEL(nibble_ashr_15_use): > > sub $0x1000, %r10 > > movdqa -16(%rdi, %rdx), %xmm0 > > - psrldq $15, D(%xmm0) > > + psrldq $15, %xmm0 > > pcmpistri $0x3a,%xmm0, %xmm0 > > #if defined USE_AS_STRNCMP || defined USE_AS_STRNCASECMP_L > > cmp %r11, %rcx > > diff --git a/sysdeps/x86_64/multiarch/strncase_l-avx.S b/sysdeps/x86_64/multiarch/strncase_l-avx.S > > deleted file mode 100644 > > index b51b86d223..0000000000 > > --- a/sysdeps/x86_64/multiarch/strncase_l-avx.S > > +++ /dev/null > > @@ -1,22 +0,0 @@ > > -/* strncasecmp_l optimized with AVX. > > - Copyright (C) 2017-2022 Free Software Foundation, Inc. > > - This file is part of the GNU C Library. > > - > > - The GNU C Library is free software; you can redistribute it and/or > > - modify it under the terms of the GNU Lesser General Public > > - License as published by the Free Software Foundation; either > > - version 2.1 of the License, or (at your option) any later version. > > - > > - The GNU C Library is distributed in the hope that it will be useful, > > - but WITHOUT ANY WARRANTY; without even the implied warranty of > > - MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > > - Lesser General Public License for more details. > > - > > - You should have received a copy of the GNU Lesser General Public > > - License along with the GNU C Library; if not, see > > - <https://www.gnu.org/licenses/>. */ > > - > > -#define STRCMP_SSE42 __strncasecmp_l_avx > > -#define USE_AVX 1 > > -#define USE_AS_STRNCASECMP_L > > -#include "strcmp-sse42.S" > > -- > > 2.25.1 > > > > LGTM. > > Reviewed-by: H.J. Lu <hjl.tools@gmail.com> > > Thanks. > > -- > H.J. I would like to backport this patch to release branches. Any comments or objections? --Sunil ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2022-05-12 19:54 UTC | newest] Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- [not found] <20220323215734.3927131-1-goldstein.w.n@gmail.com> [not found] ` <20220323215734.3927131-3-goldstein.w.n@gmail.com> [not found] ` <CAMe9rOqQHH-20_czF-vtb_L_6MRBer=H9g3XpNBQLzcoSLZj+A@mail.gmail.com> [not found] ` <CAFUsyfKfR3haCneczj0=ji+u3X_RsMNCXuOadytBrcaxgoEVTg@mail.gmail.com> [not found] ` <CAMe9rOqRGcLn3tvQSANaSydOM8RRQ2cY0PxBOHDu=iK88j=XUg@mail.gmail.com> 2022-05-12 19:31 ` [PATCH v1 03/23] x86: Code cleanup in strchr-avx2 and comment justifying branch Sunil Pandey [not found] ` <20220323215734.3927131-4-goldstein.w.n@gmail.com> [not found] ` <CAMe9rOraZjeAXy8GgdNqUb94y+0TUwbjWKJU7RixESgRYw1o7A@mail.gmail.com> 2022-05-12 19:32 ` [PATCH v1 04/23] x86: Code cleanup in strchr-evex " Sunil Pandey [not found] ` <20220323215734.3927131-7-goldstein.w.n@gmail.com> [not found] ` <CAMe9rOpSiZaO+mkq8OqwTHS__JgUD4LQQShMpjrgyGdZSPwUsA@mail.gmail.com> 2022-05-12 19:34 ` [PATCH v1 07/23] x86: Optimize strcspn and strpbrk in strcspn-c.c Sunil Pandey [not found] ` <20220323215734.3927131-8-goldstein.w.n@gmail.com> [not found] ` <CAMe9rOo7oks15kPUaZqd=Z1J1Xe==FJoECTNU5mBca9WTHgf1w@mail.gmail.com> 2022-05-12 19:39 ` [PATCH v1 08/23] x86: Optimize strspn in strspn-c.c Sunil Pandey [not found] ` <20220323215734.3927131-9-goldstein.w.n@gmail.com> [not found] ` <CAMe9rOrQ_zOdL-n-iiYpzLf+RxD_DBR51yEnGpRKB0zj4m31SQ@mail.gmail.com> 2022-05-12 19:40 ` [PATCH v1 09/23] x86: Remove strcspn-sse2.S and use the generic implementation Sunil Pandey [not found] ` <20220323215734.3927131-10-goldstein.w.n@gmail.com> [not found] ` <CAMe9rOqn8rZNfisVTmSKP9iWH1N26D--dncq1=MMgo-Hh-oR_Q@mail.gmail.com> 2022-05-12 19:41 ` [PATCH v1 10/23] x86: Remove strpbrk-sse2.S " Sunil Pandey [not found] ` <20220323215734.3927131-11-goldstein.w.n@gmail.com> [not found] ` <CAMe9rOp89e_1T9+i0W3=R3XR8DHp_Ua72x+poB6HQvE1q6b0MQ@mail.gmail.com> 2022-05-12 19:42 ` [PATCH v1 11/23] x86: Remove strspn-sse2.S " Sunil Pandey [not found] ` <20220323215734.3927131-17-goldstein.w.n@gmail.com> [not found] ` <CAMe9rOqkZtA9gE87TiqkHg+_rTZY4dqXO74_LykBwvihNO0YJA@mail.gmail.com> 2022-05-12 19:44 ` [PATCH v1 17/23] x86: Optimize str{n}casecmp TOLOWER logic in strcmp.S Sunil Pandey [not found] ` <20220323215734.3927131-18-goldstein.w.n@gmail.com> [not found] ` <CAMe9rOo-MzhNRiuyFhHpHKanbu50_OPr_Gaof9Yt16tJRwjYFA@mail.gmail.com> 2022-05-12 19:45 ` [PATCH v1 18/23] x86: Optimize str{n}casecmp TOLOWER logic in strcmp-sse42.S Sunil Pandey [not found] ` <20220323215734.3927131-23-goldstein.w.n@gmail.com> [not found] ` <CAMe9rOpzEL=V1OmUFJuScNetUc3mgMqYeqcqiD9aK+tBTN_sxQ@mail.gmail.com> 2022-05-12 19:54 ` [PATCH v1 23/23] x86: Remove AVX str{n}casecmp Sunil Pandey
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).