From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ej1-x636.google.com (mail-ej1-x636.google.com [IPv6:2a00:1450:4864:20::636]) by sourceware.org (Postfix) with ESMTPS id 71D0D3858425 for ; Fri, 4 Nov 2022 20:21:35 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 71D0D3858425 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com Received: by mail-ej1-x636.google.com with SMTP id q9so16273357ejd.0 for ; Fri, 04 Nov 2022 13:21:35 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=4U8VSnb36zC+4dTghywlo3Jrx6yhqwBrskXF3DAmLLc=; b=QRLK/kliybt7GMIbd5/+7En6zF7fmqZi+MJWG3ahELMVbznz/LHEUpe8Nu4QyWgEse rqgyzuW4CgQjTbvr2X9VqQm5l8UZ0sQpqMSNgeo1aDEHFVyQjksWbfA2TipRQU+JqF1H gFUQmjed2UHJHv5gF9nNSFTU3/m7+llG0S+zgs7QJhg3k2mi2yMMLiM/zTwgyvVLeVHR C/s17yasOM5Ex3dL87njwQk4HDqJ4jU59J3An7W6VMgD+Ox0aTDeiipoY3FGHtEGbfSP 0elhpBYy3+Y7Dm3KgVcmPY5nGcpMPmKO6S4LMNCvScpTWGsjSaB1sJCJHmyJsIW+W4hV /MlQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=4U8VSnb36zC+4dTghywlo3Jrx6yhqwBrskXF3DAmLLc=; b=rGQo6oheXNZ/ALO1vtM2y3F0XroDleFPMQuB7aVZHvjKC9HLepzNs9+CujO4xyWfV0 Tvq+T123PUnmKc2BoUEKZCeNi0+JPTh7lOag23qUu3S9eISa+8hwPJPHTSawH64K/cPS oCRC3uaK+zaJ7YvSenaOPJyfaa1+KXdi+bFkD3eKe+QCpdI2+v17IzFkygkv4vcV3Z90 rZB0KymdmmLUrt81DlwwdvQCdigqiI3qu6dhZuRDhe9T5PWkr/I0Dv6FWvdhWG6HVvOq Okdo8zFWdbCAWdh9eY3gc8sGWqBqLJHhfyVw4FWjhhk49zp2+3wXYCEf6RoA+mY4HWch bnhQ== X-Gm-Message-State: ACrzQf2KUrhApt5zDihuDp3SXPBH8/qU6C+uIvFTcWlreCbLYGY/gTO6 osk/+wmP8YQeTFO4jm/vCOfur1NjGF4dVv5EvU4= X-Google-Smtp-Source: AMsMyM7xIUzv3mtZyM4HVP/QETTt4IdH93Wie92YffvhMakTW16y2zSdgd3KLWYxI0uJonQAa88ixj6TNw/kQbpY+G4= X-Received: by 2002:a17:906:eeca:b0:730:6880:c397 with SMTP id wu10-20020a170906eeca00b007306880c397mr35357040ejb.593.1667593293452; Fri, 04 Nov 2022 13:21:33 -0700 (PDT) MIME-Version: 1.0 References: <20221103085314.1069528-1-goldstein.w.n@gmail.com> <20221104082024.205104-1-goldstein.w.n@gmail.com> <20221104082024.205104-3-goldstein.w.n@gmail.com> In-Reply-To: From: Noah Goldstein Date: Fri, 4 Nov 2022 13:21:21 -0700 Message-ID: Subject: Re: [PATCH v2 3/4] x86: Optimize and shrink st{r|p}{n}{cat|cpy}-avx2 functions To: "H.J. Lu" Cc: libc-alpha@sourceware.org, carlos@systemhalted.org Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-9.4 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,GIT_PATCH_0,KAM_SHORT,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: On Fri, Nov 4, 2022 at 9:45 AM H.J. Lu wrote: > > On Fri, Nov 4, 2022 at 1:20 AM Noah Goldstein wrote: > > > > Optimizations are: > > 1. Use more overlapping stores to avoid branches. > > 2. Reduce how unrolled the aligning copies are (this is more of a > > code-size save, its a negative for some sizes in terms of > > perf). > > 3. For st{r|p}n{cat|cpy} re-order the branches to minimize the > > number that are taken. > > > > Performance Changes: > > > > Times are from N = 10 runs of the benchmark suite and are > > reported as geometric mean of all ratios of > > New Implementation / Old Implementation. > > > > strcat-avx2 -> 0.998 > > strcpy-avx2 -> 0.937 > > stpcpy-avx2 -> 0.971 > > > > strncpy-avx2 -> 0.793 > > stpncpy-avx2 -> 0.775 > > > > strncat-avx2 -> 0.962 > > > > Code Size Changes: > > function -> Bytes New / Bytes Old -> Ratio > > > > strcat-avx2 -> 685 / 1639 -> 0.418 > > strcpy-avx2 -> 560 / 903 -> 0.620 > > stpcpy-avx2 -> 592 / 939 -> 0.630 > > > > strncpy-avx2 -> 1176 / 2390 -> 0.492 > > stpncpy-avx2 -> 1268 / 2438 -> 0.520 > > > > strncat-avx2 -> 1042 / 2563 -> 0.407 > > > > Notes: > > 1. Because of the significant difference between the > > implementations they are split into three files. > > > > strcpy-evex.S -> strcpy, stpcpy, strcat > > strncpy-evex.S -> strncpy > > strncat-evex.S > strncat > > > > I couldn't find a way to merge them without making the > > ifdefs incredibly difficult to follow. > > > > 2. All implementations can be made evex512 by including > > "x86-evex512-vecs.h" at the top. > > These comments are wrong for AVX2 implementations. Sorry, fixed in V3. > > > Full check passes on x86-64 and build succeeds for all ISA levels w/ > > and w/o multiarch. > > > > Fix avx2 > > Strayed comments? > Yes. Fixed in V3. > > --- > > sysdeps/x86_64/multiarch/stpcpy-avx2-rtm.S | 6 +- > > sysdeps/x86_64/multiarch/stpncpy-avx2-rtm.S | 7 +- > > sysdeps/x86_64/multiarch/stpncpy-avx2.S | 5 +- > > sysdeps/x86_64/multiarch/strcat-avx2-rtm.S | 13 +- > > sysdeps/x86_64/multiarch/strcat-avx2.S | 268 +--- > > sysdeps/x86_64/multiarch/strcat-strlen-avx2.S | 76 + > > sysdeps/x86_64/multiarch/strcpy-avx2-rtm.S | 13 +- > > sysdeps/x86_64/multiarch/strcpy-avx2.S | 1236 +++++------------ > > sysdeps/x86_64/multiarch/strncat-avx2-rtm.S | 6 +- > > sysdeps/x86_64/multiarch/strncat-avx2.S | 424 +++++- > > sysdeps/x86_64/multiarch/strncpy-avx2-rtm.S | 6 +- > > sysdeps/x86_64/multiarch/strncpy-avx2.S | 740 +++++++++- > > sysdeps/x86_64/multiarch/x86-avx-vecs.h | 5 +- > > sysdeps/x86_64/multiarch/x86-avx2-rtm-vecs.h | 26 + > > sysdeps/x86_64/multiarch/x86-avx2-vecs.h | 27 + > > 15 files changed, 1624 insertions(+), 1234 deletions(-) > > create mode 100644 sysdeps/x86_64/multiarch/strcat-strlen-avx2.S > > create mode 100644 sysdeps/x86_64/multiarch/x86-avx2-rtm-vecs.h > > create mode 100644 sysdeps/x86_64/multiarch/x86-avx2-vecs.h > > > > diff --git a/sysdeps/x86_64/multiarch/stpcpy-avx2-rtm.S b/sysdeps/x86_64/multiarch/stpcpy-avx2-rtm.S > > index 2b9c07a59f..189a288053 100644 > > --- a/sysdeps/x86_64/multiarch/stpcpy-avx2-rtm.S > > +++ b/sysdeps/x86_64/multiarch/stpcpy-avx2-rtm.S > > @@ -1,3 +1,3 @@ > > -#define USE_AS_STPCPY > > -#define STRCPY __stpcpy_avx2_rtm > > -#include "strcpy-avx2-rtm.S" > > +#define STPCPY __stpcpy_avx2_rtm > > +#include "x86-avx2-rtm-vecs.h" > > +#include "stpcpy-avx2.S" > > diff --git a/sysdeps/x86_64/multiarch/stpncpy-avx2-rtm.S b/sysdeps/x86_64/multiarch/stpncpy-avx2-rtm.S > > index 60a2ccfe53..1b252985e7 100644 > > --- a/sysdeps/x86_64/multiarch/stpncpy-avx2-rtm.S > > +++ b/sysdeps/x86_64/multiarch/stpncpy-avx2-rtm.S > > @@ -1,4 +1,3 @@ > > -#define USE_AS_STPCPY > > -#define USE_AS_STRNCPY > > -#define STRCPY __stpncpy_avx2_rtm > > -#include "strcpy-avx2-rtm.S" > > +#define STPNCPY __stpncpy_avx2_rtm > > +#include "x86-avx2-rtm-vecs.h" > > +#include "stpncpy-avx2.S" > > diff --git a/sysdeps/x86_64/multiarch/stpncpy-avx2.S b/sysdeps/x86_64/multiarch/stpncpy-avx2.S > > index b2f8c19143..a46a8edbe2 100644 > > --- a/sysdeps/x86_64/multiarch/stpncpy-avx2.S > > +++ b/sysdeps/x86_64/multiarch/stpncpy-avx2.S > > @@ -3,6 +3,5 @@ > > #endif > > > > #define USE_AS_STPCPY > > -#define USE_AS_STRNCPY > > -#define STRCPY STPNCPY > > -#include "strcpy-avx2.S" > > +#define STRNCPY STPNCPY > > +#include "strncpy-avx2.S" > > diff --git a/sysdeps/x86_64/multiarch/strcat-avx2-rtm.S b/sysdeps/x86_64/multiarch/strcat-avx2-rtm.S > > index 637fb557c4..94d51d10bd 100644 > > --- a/sysdeps/x86_64/multiarch/strcat-avx2-rtm.S > > +++ b/sysdeps/x86_64/multiarch/strcat-avx2-rtm.S > > @@ -1,12 +1,3 @@ > > -#ifndef STRCAT > > -# define STRCAT __strcat_avx2_rtm > > -#endif > > - > > -#define ZERO_UPPER_VEC_REGISTERS_RETURN \ > > - ZERO_UPPER_VEC_REGISTERS_RETURN_XTEST > > - > > -#define VZEROUPPER_RETURN jmp L(return_vzeroupper) > > - > > -#define SECTION(p) p##.avx.rtm > > - > > +#define STRCAT __strcat_avx2_rtm > > +#include "x86-avx2-rtm-vecs.h" > > #include "strcat-avx2.S" > > diff --git a/sysdeps/x86_64/multiarch/strcat-avx2.S b/sysdeps/x86_64/multiarch/strcat-avx2.S > > index d9b7fb2a43..3f914fa342 100644 > > --- a/sysdeps/x86_64/multiarch/strcat-avx2.S > > +++ b/sysdeps/x86_64/multiarch/strcat-avx2.S > > @@ -16,266 +16,10 @@ > > License along with the GNU C Library; if not, see > > . */ > > > > -#include > > - > > -#if ISA_SHOULD_BUILD (3) > > - > > - > > -# include > > - > > -# ifndef STRCAT > > -# define STRCAT __strcat_avx2 > > -# endif > > - > > -# define USE_AS_STRCAT > > - > > -/* Number of bytes in a vector register */ > > -# define VEC_SIZE 32 > > - > > -# ifndef SECTION > > -# define SECTION(p) p##.avx > > -# endif > > - > > - .section SECTION(.text),"ax",@progbits > > -ENTRY (STRCAT) > > - mov %rdi, %r9 > > -# ifdef USE_AS_STRNCAT > > - mov %rdx, %r8 > > -# endif > > - > > - xor %eax, %eax > > - mov %edi, %ecx > > - and $((VEC_SIZE * 4) - 1), %ecx > > - vpxor %xmm6, %xmm6, %xmm6 > > - cmp $(VEC_SIZE * 3), %ecx > > - ja L(fourth_vector_boundary) > > - vpcmpeqb (%rdi), %ymm6, %ymm0 > > - vpmovmskb %ymm0, %edx > > - test %edx, %edx > > - jnz L(exit_null_on_first_vector) > > - mov %rdi, %rax > > - and $-VEC_SIZE, %rax > > - jmp L(align_vec_size_start) > > -L(fourth_vector_boundary): > > - mov %rdi, %rax > > - and $-VEC_SIZE, %rax > > - vpcmpeqb (%rax), %ymm6, %ymm0 > > - mov $-1, %r10d > > - sub %rax, %rcx > > - shl %cl, %r10d > > - vpmovmskb %ymm0, %edx > > - and %r10d, %edx > > - jnz L(exit) > > - > > -L(align_vec_size_start): > > - vpcmpeqb VEC_SIZE(%rax), %ymm6, %ymm0 > > - vpmovmskb %ymm0, %edx > > - test %edx, %edx > > - jnz L(exit_null_on_second_vector) > > - > > - vpcmpeqb (VEC_SIZE * 2)(%rax), %ymm6, %ymm1 > > - vpmovmskb %ymm1, %edx > > - test %edx, %edx > > - jnz L(exit_null_on_third_vector) > > - > > - vpcmpeqb (VEC_SIZE * 3)(%rax), %ymm6, %ymm2 > > - vpmovmskb %ymm2, %edx > > - test %edx, %edx > > - jnz L(exit_null_on_fourth_vector) > > - > > - vpcmpeqb (VEC_SIZE * 4)(%rax), %ymm6, %ymm3 > > - vpmovmskb %ymm3, %edx > > - test %edx, %edx > > - jnz L(exit_null_on_fifth_vector) > > - > > - vpcmpeqb (VEC_SIZE * 5)(%rax), %ymm6, %ymm0 > > - add $(VEC_SIZE * 4), %rax > > - vpmovmskb %ymm0, %edx > > - test %edx, %edx > > - jnz L(exit_null_on_second_vector) > > - > > - vpcmpeqb (VEC_SIZE * 2)(%rax), %ymm6, %ymm1 > > - vpmovmskb %ymm1, %edx > > - test %edx, %edx > > - jnz L(exit_null_on_third_vector) > > - > > - vpcmpeqb (VEC_SIZE * 3)(%rax), %ymm6, %ymm2 > > - vpmovmskb %ymm2, %edx > > - test %edx, %edx > > - jnz L(exit_null_on_fourth_vector) > > - > > - vpcmpeqb (VEC_SIZE * 4)(%rax), %ymm6, %ymm3 > > - vpmovmskb %ymm3, %edx > > - test %edx, %edx > > - jnz L(exit_null_on_fifth_vector) > > - > > - vpcmpeqb (VEC_SIZE * 5)(%rax), %ymm6, %ymm0 > > - add $(VEC_SIZE * 4), %rax > > - vpmovmskb %ymm0, %edx > > - test %edx, %edx > > - jnz L(exit_null_on_second_vector) > > - > > - vpcmpeqb (VEC_SIZE * 2)(%rax), %ymm6, %ymm1 > > - vpmovmskb %ymm1, %edx > > - test %edx, %edx > > - jnz L(exit_null_on_third_vector) > > - > > - vpcmpeqb (VEC_SIZE * 3)(%rax), %ymm6, %ymm2 > > - vpmovmskb %ymm2, %edx > > - test %edx, %edx > > - jnz L(exit_null_on_fourth_vector) > > - > > - vpcmpeqb (VEC_SIZE * 4)(%rax), %ymm6, %ymm3 > > - vpmovmskb %ymm3, %edx > > - test %edx, %edx > > - jnz L(exit_null_on_fifth_vector) > > - > > - vpcmpeqb (VEC_SIZE * 5)(%rax), %ymm6, %ymm0 > > - add $(VEC_SIZE * 4), %rax > > - vpmovmskb %ymm0, %edx > > - test %edx, %edx > > - jnz L(exit_null_on_second_vector) > > - > > - vpcmpeqb (VEC_SIZE * 2)(%rax), %ymm6, %ymm1 > > - vpmovmskb %ymm1, %edx > > - test %edx, %edx > > - jnz L(exit_null_on_third_vector) > > - > > - vpcmpeqb (VEC_SIZE * 3)(%rax), %ymm6, %ymm2 > > - vpmovmskb %ymm2, %edx > > - test %edx, %edx > > - jnz L(exit_null_on_fourth_vector) > > - > > - vpcmpeqb (VEC_SIZE * 4)(%rax), %ymm6, %ymm3 > > - vpmovmskb %ymm3, %edx > > - test %edx, %edx > > - jnz L(exit_null_on_fifth_vector) > > - > > - test $((VEC_SIZE * 4) - 1), %rax > > - jz L(align_four_vec_loop) > > - > > - vpcmpeqb (VEC_SIZE * 5)(%rax), %ymm6, %ymm0 > > - add $(VEC_SIZE * 5), %rax > > - vpmovmskb %ymm0, %edx > > - test %edx, %edx > > - jnz L(exit) > > - > > - test $((VEC_SIZE * 4) - 1), %rax > > - jz L(align_four_vec_loop) > > - > > - vpcmpeqb VEC_SIZE(%rax), %ymm6, %ymm1 > > - add $VEC_SIZE, %rax > > - vpmovmskb %ymm1, %edx > > - test %edx, %edx > > - jnz L(exit) > > - > > - test $((VEC_SIZE * 4) - 1), %rax > > - jz L(align_four_vec_loop) > > - > > - vpcmpeqb VEC_SIZE(%rax), %ymm6, %ymm2 > > - add $VEC_SIZE, %rax > > - vpmovmskb %ymm2, %edx > > - test %edx, %edx > > - jnz L(exit) > > - > > - test $((VEC_SIZE * 4) - 1), %rax > > - jz L(align_four_vec_loop) > > - > > - vpcmpeqb VEC_SIZE(%rax), %ymm6, %ymm3 > > - add $VEC_SIZE, %rax > > - vpmovmskb %ymm3, %edx > > - test %edx, %edx > > - jnz L(exit) > > - > > - add $VEC_SIZE, %rax > > - > > - .p2align 4 > > -L(align_four_vec_loop): > > - vmovaps (%rax), %ymm4 > > - vpminub VEC_SIZE(%rax), %ymm4, %ymm4 > > - vmovaps (VEC_SIZE * 2)(%rax), %ymm5 > > - vpminub (VEC_SIZE * 3)(%rax), %ymm5, %ymm5 > > - add $(VEC_SIZE * 4), %rax > > - vpminub %ymm4, %ymm5, %ymm5 > > - vpcmpeqb %ymm5, %ymm6, %ymm5 > > - vpmovmskb %ymm5, %edx > > - test %edx, %edx > > - jz L(align_four_vec_loop) > > - > > - vpcmpeqb -(VEC_SIZE * 4)(%rax), %ymm6, %ymm0 > > - sub $(VEC_SIZE * 5), %rax > > - vpmovmskb %ymm0, %edx > > - test %edx, %edx > > - jnz L(exit_null_on_second_vector) > > - > > - vpcmpeqb (VEC_SIZE * 2)(%rax), %ymm6, %ymm1 > > - vpmovmskb %ymm1, %edx > > - test %edx, %edx > > - jnz L(exit_null_on_third_vector) > > - > > - vpcmpeqb (VEC_SIZE * 3)(%rax), %ymm6, %ymm2 > > - vpmovmskb %ymm2, %edx > > - test %edx, %edx > > - jnz L(exit_null_on_fourth_vector) > > - > > - vpcmpeqb (VEC_SIZE * 4)(%rax), %ymm6, %ymm3 > > - vpmovmskb %ymm3, %edx > > - sub %rdi, %rax > > - bsf %rdx, %rdx > > - add %rdx, %rax > > - add $(VEC_SIZE * 4), %rax > > - jmp L(StartStrcpyPart) > > - > > - .p2align 4 > > -L(exit): > > - sub %rdi, %rax > > -L(exit_null_on_first_vector): > > - bsf %rdx, %rdx > > - add %rdx, %rax > > - jmp L(StartStrcpyPart) > > - > > - .p2align 4 > > -L(exit_null_on_second_vector): > > - sub %rdi, %rax > > - bsf %rdx, %rdx > > - add %rdx, %rax > > - add $VEC_SIZE, %rax > > - jmp L(StartStrcpyPart) > > - > > - .p2align 4 > > -L(exit_null_on_third_vector): > > - sub %rdi, %rax > > - bsf %rdx, %rdx > > - add %rdx, %rax > > - add $(VEC_SIZE * 2), %rax > > - jmp L(StartStrcpyPart) > > - > > - .p2align 4 > > -L(exit_null_on_fourth_vector): > > - sub %rdi, %rax > > - bsf %rdx, %rdx > > - add %rdx, %rax > > - add $(VEC_SIZE * 3), %rax > > - jmp L(StartStrcpyPart) > > - > > - .p2align 4 > > -L(exit_null_on_fifth_vector): > > - sub %rdi, %rax > > - bsf %rdx, %rdx > > - add %rdx, %rax > > - add $(VEC_SIZE * 4), %rax > > - > > - .p2align 4 > > -L(StartStrcpyPart): > > - lea (%r9, %rax), %rdi > > - mov %rsi, %rcx > > - mov %r9, %rax /* save result */ > > - > > -# ifdef USE_AS_STRNCAT > > - test %r8, %r8 > > - jz L(ExitZero) > > -# define USE_AS_STRNCPY > > -# endif > > - > > -# include "strcpy-avx2.S" > > +#ifndef STRCAT > > +# define STRCAT __strcat_avx2 > > #endif > > + > > +#define USE_AS_STRCAT > > +#define STRCPY STRCAT > > +#include "strcpy-avx2.S" > > diff --git a/sysdeps/x86_64/multiarch/strcat-strlen-avx2.S b/sysdeps/x86_64/multiarch/strcat-strlen-avx2.S > > new file mode 100644 > > index 0000000000..128a45b6ff > > --- /dev/null > > +++ b/sysdeps/x86_64/multiarch/strcat-strlen-avx2.S > > Missing copyright notice. Fixed in V3. > > > @@ -0,0 +1,76 @@ > > + /* Simple strlen implementation that ends at L(strcat_strlen_done). */ > > + movq %rdi, %r8 > > + andq $(VEC_SIZE * -1), %r8 > > + VPCMPEQ (%r8), %VZERO, %VMM(0) > > > > + vpmovmskb %VMM(0), %ecx > > + shrxl %edi, %ecx, %ecx > > + testl %ecx, %ecx > > + jnz L(bsf_and_done_v0) > > + > > + VPCMPEQ VEC_SIZE(%r8), %VZERO, %VMM(0) > > + vpmovmskb %VMM(0), %ecx > > + leaq (VEC_SIZE)(%r8), %rdi > > + testl %ecx, %ecx > > + jnz L(bsf_and_done_v0) > > + > > + VPCMPEQ (VEC_SIZE * 1)(%rdi), %VZERO, %VMM(0) > > + vpmovmskb %VMM(0), %ecx > > + testl %ecx, %ecx > > + jnz L(bsf_and_done_v1) > > + > > + VPCMPEQ (VEC_SIZE * 2)(%rdi), %VZERO, %VMM(0) > > + vpmovmskb %VMM(0), %ecx > > + testl %ecx, %ecx > > + jnz L(bsf_and_done_v2) > > + > > + VPCMPEQ (VEC_SIZE * 3)(%rdi), %VZERO, %VMM(0) > > + vpmovmskb %VMM(0), %ecx > > + testl %ecx, %ecx > > + jnz L(bsf_and_done_v3) > > + > > + orq $(VEC_SIZE * 4 - 1), %rdi > > + .p2align 4,, 8 > > +L(loop_2x_vec): > > + VMOVA (VEC_SIZE * 0 + 1)(%rdi), %VMM(0) > > + VPMIN (VEC_SIZE * 1 + 1)(%rdi), %VMM(0), %VMM(1) > > + VMOVA (VEC_SIZE * 2 + 1)(%rdi), %VMM(2) > > + VPMIN (VEC_SIZE * 3 + 1)(%rdi), %VMM(2), %VMM(3) > > + VPMIN %VMM(1), %VMM(3), %VMM(3) > > + VPCMPEQ %VMM(3), %VZERO, %VMM(3) > > + vpmovmskb %VMM(3), %r8d > > + subq $(VEC_SIZE * -4), %rdi > > + testl %r8d, %r8d > > + jz L(loop_2x_vec) > > + > > + addq $(VEC_SIZE * -4 + 1), %rdi > > + > > + VPCMPEQ %VMM(0), %VZERO, %VMM(0) > > + vpmovmskb %VMM(0), %ecx > > + testl %ecx, %ecx > > + jnz L(bsf_and_done_v0) > > + > > + VPCMPEQ %VMM(1), %VZERO, %VMM(1) > > + vpmovmskb %VMM(1), %ecx > > + testl %ecx, %ecx > > + jnz L(bsf_and_done_v1) > > + > > + VPCMPEQ %VMM(2), %VZERO, %VMM(2) > > + vpmovmskb %VMM(2), %ecx > > + testl %ecx, %ecx > > + jnz L(bsf_and_done_v2) > > + > > + movl %r8d, %ecx > > +L(bsf_and_done_v3): > > + addq $VEC_SIZE, %rdi > > +L(bsf_and_done_v2): > > + bsfl %ecx, %ecx > > + leaq (VEC_SIZE * 2)(%rdi, %rcx), %rdi > > + jmp L(strcat_strlen_done) > > + > > + .p2align 4,, 4 > > +L(bsf_and_done_v1): > > + addq $VEC_SIZE, %rdi > > +L(bsf_and_done_v0): > > + bsfl %ecx, %ecx > > + addq %rcx, %rdi > > +L(strcat_strlen_done): > > diff --git a/sysdeps/x86_64/multiarch/strcpy-avx2-rtm.S b/sysdeps/x86_64/multiarch/strcpy-avx2-rtm.S > > index c2c581ecf7..fe80ffd265 100644 > > --- a/sysdeps/x86_64/multiarch/strcpy-avx2-rtm.S > > +++ b/sysdeps/x86_64/multiarch/strcpy-avx2-rtm.S > > @@ -1,12 +1,3 @@ > > -#ifndef STRCPY > > -# define STRCPY __strcpy_avx2_rtm > > -#endif > > - > > -#define ZERO_UPPER_VEC_REGISTERS_RETURN \ > > - ZERO_UPPER_VEC_REGISTERS_RETURN_XTEST > > - > > -#define VZEROUPPER_RETURN jmp L(return_vzeroupper) > > - > > -#define SECTION(p) p##.avx.rtm > > - > > +#define STRCPY __strcpy_avx2_rtm > > +#include "x86-avx2-rtm-vecs.h" > > #include "strcpy-avx2.S" > > diff --git a/sysdeps/x86_64/multiarch/strcpy-avx2.S b/sysdeps/x86_64/multiarch/strcpy-avx2.S > > index c725834929..b87a1722d5 100644 > > --- a/sysdeps/x86_64/multiarch/strcpy-avx2.S > > +++ b/sysdeps/x86_64/multiarch/strcpy-avx2.S > > @@ -20,984 +20,378 @@ > > > > #if ISA_SHOULD_BUILD (3) > > > > +# include > > > > -# ifndef USE_AS_STRCAT > > -# include > > - > > -# ifndef STRCPY > > -# define STRCPY __strcpy_avx2 > > -# endif > > - > > -# endif > > - > > -/* Number of bytes in a vector register */ > > # ifndef VEC_SIZE > > -# define VEC_SIZE 32 > > -# endif > > - > > -# ifndef VZEROUPPER > > -# define VZEROUPPER vzeroupper > > -# endif > > - > > -# ifndef SECTION > > -# define SECTION(p) p##.avx > > -# endif > > - > > -/* zero register */ > > -#define xmmZ xmm0 > > -#define ymmZ ymm0 > > - > > -/* mask register */ > > -#define ymmM ymm1 > > - > > -# ifndef USE_AS_STRCAT > > - > > - .section SECTION(.text),"ax",@progbits > > -ENTRY (STRCPY) > > -# ifdef USE_AS_STRNCPY > > - mov %RDX_LP, %R8_LP > > - test %R8_LP, %R8_LP > > - jz L(ExitZero) > > -# endif > > - mov %rsi, %rcx > > -# ifndef USE_AS_STPCPY > > - mov %rdi, %rax /* save result */ > > -# endif > > - > > +# include "x86-avx2-vecs.h" > > # endif > > > > - vpxor %xmmZ, %xmmZ, %xmmZ > > - > > - and $((VEC_SIZE * 4) - 1), %ecx > > - cmp $(VEC_SIZE * 2), %ecx > > - jbe L(SourceStringAlignmentLessTwoVecSize) > > - > > - and $-VEC_SIZE, %rsi > > - and $(VEC_SIZE - 1), %ecx > > - > > - vpcmpeqb (%rsi), %ymmZ, %ymmM > > - vpmovmskb %ymmM, %edx > > - shr %cl, %rdx > > - > > -# ifdef USE_AS_STRNCPY > > -# if defined USE_AS_STPCPY || defined USE_AS_STRCAT > > - mov $VEC_SIZE, %r10 > > - sub %rcx, %r10 > > - cmp %r10, %r8 > > -# else > > - mov $(VEC_SIZE + 1), %r10 > > - sub %rcx, %r10 > > - cmp %r10, %r8 > > -# endif > > - jbe L(CopyVecSizeTailCase2OrCase3) > > +# ifndef STRCPY > > +# define STRCPY __strcpy_avx2 > > # endif > > - test %edx, %edx > > - jnz L(CopyVecSizeTail) > > > > - vpcmpeqb VEC_SIZE(%rsi), %ymmZ, %ymm2 > > - vpmovmskb %ymm2, %edx > > + /* Use movsb in page cross case to save code size. */ > > +# define USE_MOVSB_IN_PAGE_CROSS 1 > > > > -# ifdef USE_AS_STRNCPY > > - add $VEC_SIZE, %r10 > > - cmp %r10, %r8 > > - jbe L(CopyTwoVecSizeCase2OrCase3) > > -# endif > > - test %edx, %edx > > - jnz L(CopyTwoVecSize) > > - > > - vmovdqu (%rsi, %rcx), %ymm2 /* copy VEC_SIZE bytes */ > > - vmovdqu %ymm2, (%rdi) > > - > > -/* If source address alignment != destination address alignment */ > > - .p2align 4 > > -L(UnalignVecSizeBoth): > > - sub %rcx, %rdi > > -# ifdef USE_AS_STRNCPY > > - add %rcx, %r8 > > - sbb %rcx, %rcx > > - or %rcx, %r8 > > -# endif > > - mov $VEC_SIZE, %rcx > > - vmovdqa (%rsi, %rcx), %ymm2 > > - vmovdqu %ymm2, (%rdi, %rcx) > > - vmovdqa VEC_SIZE(%rsi, %rcx), %ymm2 > > - vpcmpeqb %ymm2, %ymmZ, %ymmM > > - vpmovmskb %ymmM, %edx > > - add $VEC_SIZE, %rcx > > -# ifdef USE_AS_STRNCPY > > - sub $(VEC_SIZE * 3), %r8 > > - jbe L(CopyVecSizeCase2OrCase3) > > -# endif > > - test %edx, %edx > > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > > - jnz L(CopyVecSizeUnalignedVec2) > > +# ifdef USE_AS_WCSCPY > > +# define VPCMPEQ vpcmpeqd > > +# define VPMIN vpminud > > +# define CHAR_SIZE 4 > > # else > > - jnz L(CopyVecSize) > > +# define VPCMPEQ vpcmpeqb > > +# define VPMIN vpminub > > +# define CHAR_SIZE 1 > > # endif > > > > - vmovdqu %ymm2, (%rdi, %rcx) > > - vmovdqa VEC_SIZE(%rsi, %rcx), %ymm3 > > - vpcmpeqb %ymm3, %ymmZ, %ymmM > > - vpmovmskb %ymmM, %edx > > - add $VEC_SIZE, %rcx > > -# ifdef USE_AS_STRNCPY > > - sub $VEC_SIZE, %r8 > > - jbe L(CopyVecSizeCase2OrCase3) > > -# endif > > - test %edx, %edx > > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > > - jnz L(CopyVecSizeUnalignedVec3) > > -# else > > - jnz L(CopyVecSize) > > -# endif > > +# define PAGE_SIZE 4096 > > > > - vmovdqu %ymm3, (%rdi, %rcx) > > - vmovdqa VEC_SIZE(%rsi, %rcx), %ymm4 > > - vpcmpeqb %ymm4, %ymmZ, %ymmM > > - vpmovmskb %ymmM, %edx > > - add $VEC_SIZE, %rcx > > -# ifdef USE_AS_STRNCPY > > - sub $VEC_SIZE, %r8 > > - jbe L(CopyVecSizeCase2OrCase3) > > -# endif > > - test %edx, %edx > > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > > - jnz L(CopyVecSizeUnalignedVec4) > > +# ifdef USE_AS_STPCPY > > +# define END_REG rax > > # else > > - jnz L(CopyVecSize) > > +# define END_REG rdi, %rdx > > # endif > > > > - vmovdqu %ymm4, (%rdi, %rcx) > > - vmovdqa VEC_SIZE(%rsi, %rcx), %ymm2 > > - vpcmpeqb %ymm2, %ymmZ, %ymmM > > - vpmovmskb %ymmM, %edx > > - add $VEC_SIZE, %rcx > > -# ifdef USE_AS_STRNCPY > > - sub $VEC_SIZE, %r8 > > - jbe L(CopyVecSizeCase2OrCase3) > > -# endif > > - test %edx, %edx > > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > > - jnz L(CopyVecSizeUnalignedVec2) > > +# ifdef USE_AS_STRCAT > > +# define PAGE_ALIGN_REG ecx > > # else > > - jnz L(CopyVecSize) > > +# define PAGE_ALIGN_REG eax > > # endif > > > > - vmovdqu %ymm2, (%rdi, %rcx) > > - vmovdqa VEC_SIZE(%rsi, %rcx), %ymm2 > > - vpcmpeqb %ymm2, %ymmZ, %ymmM > > - vpmovmskb %ymmM, %edx > > - add $VEC_SIZE, %rcx > > -# ifdef USE_AS_STRNCPY > > - sub $VEC_SIZE, %r8 > > - jbe L(CopyVecSizeCase2OrCase3) > > -# endif > > - test %edx, %edx > > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > > - jnz L(CopyVecSizeUnalignedVec2) > > -# else > > - jnz L(CopyVecSize) > > -# endif > > +# define VZERO VMM(7) > > +# define VZERO_128 VMM_128(7) > > > > - vmovdqa VEC_SIZE(%rsi, %rcx), %ymm3 > > - vmovdqu %ymm2, (%rdi, %rcx) > > - vpcmpeqb %ymm3, %ymmZ, %ymmM > > - vpmovmskb %ymmM, %edx > > - add $VEC_SIZE, %rcx > > -# ifdef USE_AS_STRNCPY > > - sub $VEC_SIZE, %r8 > > - jbe L(CopyVecSizeCase2OrCase3) > > -# endif > > - test %edx, %edx > > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > > - jnz L(CopyVecSizeUnalignedVec3) > > -# else > > - jnz L(CopyVecSize) > > -# endif > > + .section SECTION(.text), "ax", @progbits > > +ENTRY(STRCPY) > > + vpxor %VZERO_128, %VZERO_128, %VZERO_128 > > > > - vmovdqu %ymm3, (%rdi, %rcx) > > - mov %rsi, %rdx > > - lea VEC_SIZE(%rsi, %rcx), %rsi > > - and $-(VEC_SIZE * 4), %rsi > > - sub %rsi, %rdx > > - sub %rdx, %rdi > > -# ifdef USE_AS_STRNCPY > > - lea (VEC_SIZE * 8)(%r8, %rdx), %r8 > > -# endif > > -L(UnalignedFourVecSizeLoop): > > - vmovdqa (%rsi), %ymm4 > > - vmovdqa VEC_SIZE(%rsi), %ymm5 > > - vmovdqa (VEC_SIZE * 2)(%rsi), %ymm6 > > - vmovdqa (VEC_SIZE * 3)(%rsi), %ymm7 > > - vpminub %ymm5, %ymm4, %ymm2 > > - vpminub %ymm7, %ymm6, %ymm3 > > - vpminub %ymm2, %ymm3, %ymm3 > > - vpcmpeqb %ymmM, %ymm3, %ymm3 > > - vpmovmskb %ymm3, %edx > > -# ifdef USE_AS_STRNCPY > > - sub $(VEC_SIZE * 4), %r8 > > - jbe L(UnalignedLeaveCase2OrCase3) > > -# endif > > - test %edx, %edx > > - jnz L(UnalignedFourVecSizeLeave) > > - > > -L(UnalignedFourVecSizeLoop_start): > > - add $(VEC_SIZE * 4), %rdi > > - add $(VEC_SIZE * 4), %rsi > > - vmovdqu %ymm4, -(VEC_SIZE * 4)(%rdi) > > - vmovdqa (%rsi), %ymm4 > > - vmovdqu %ymm5, -(VEC_SIZE * 3)(%rdi) > > - vmovdqa VEC_SIZE(%rsi), %ymm5 > > - vpminub %ymm5, %ymm4, %ymm2 > > - vmovdqu %ymm6, -(VEC_SIZE * 2)(%rdi) > > - vmovdqa (VEC_SIZE * 2)(%rsi), %ymm6 > > - vmovdqu %ymm7, -VEC_SIZE(%rdi) > > - vmovdqa (VEC_SIZE * 3)(%rsi), %ymm7 > > - vpminub %ymm7, %ymm6, %ymm3 > > - vpminub %ymm2, %ymm3, %ymm3 > > - vpcmpeqb %ymmM, %ymm3, %ymm3 > > - vpmovmskb %ymm3, %edx > > -# ifdef USE_AS_STRNCPY > > - sub $(VEC_SIZE * 4), %r8 > > - jbe L(UnalignedLeaveCase2OrCase3) > > -# endif > > - test %edx, %edx > > - jz L(UnalignedFourVecSizeLoop_start) > > - > > -L(UnalignedFourVecSizeLeave): > > - vpcmpeqb %ymm4, %ymmZ, %ymmM > > - vpmovmskb %ymmM, %edx > > - test %edx, %edx > > - jnz L(CopyVecSizeUnaligned_0) > > - > > - vpcmpeqb %ymm5, %ymmZ, %ymmM > > - vpmovmskb %ymmM, %ecx > > - test %ecx, %ecx > > - jnz L(CopyVecSizeUnaligned_16) > > - > > - vpcmpeqb %ymm6, %ymmZ, %ymmM > > - vpmovmskb %ymmM, %edx > > - test %edx, %edx > > - jnz L(CopyVecSizeUnaligned_32) > > - > > - vpcmpeqb %ymm7, %ymmZ, %ymmM > > - vpmovmskb %ymmM, %ecx > > - bsf %ecx, %edx > > - vmovdqu %ymm4, (%rdi) > > - vmovdqu %ymm5, VEC_SIZE(%rdi) > > - vmovdqu %ymm6, (VEC_SIZE * 2)(%rdi) > > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > > -# ifdef USE_AS_STPCPY > > - lea (VEC_SIZE * 3)(%rdi, %rdx), %rax > > -# endif > > - vmovdqu %ymm7, (VEC_SIZE * 3)(%rdi) > > - add $(VEC_SIZE - 1), %r8 > > - sub %rdx, %r8 > > - lea ((VEC_SIZE * 3) + 1)(%rdi, %rdx), %rdi > > - jmp L(StrncpyFillTailWithZero) > > -# else > > - add $(VEC_SIZE * 3), %rsi > > - add $(VEC_SIZE * 3), %rdi > > - jmp L(CopyVecSizeExit) > > +# ifdef USE_AS_STRCAT > > + movq %rdi, %rax > > +# include "strcat-strlen-avx2.S" > > # endif > > > > -/* If source address alignment == destination address alignment */ > > - > > -L(SourceStringAlignmentLessTwoVecSize): > > - vmovdqu (%rsi), %ymm3 > > - vmovdqu VEC_SIZE(%rsi), %ymm2 > > - vpcmpeqb %ymm3, %ymmZ, %ymmM > > - vpmovmskb %ymmM, %edx > > - > > -# ifdef USE_AS_STRNCPY > > -# if defined USE_AS_STPCPY || defined USE_AS_STRCAT > > - cmp $VEC_SIZE, %r8 > > -# else > > - cmp $(VEC_SIZE + 1), %r8 > > -# endif > > - jbe L(CopyVecSizeTail1Case2OrCase3) > > + movl %esi, %PAGE_ALIGN_REG > > + andl $(PAGE_SIZE - 1), %PAGE_ALIGN_REG > > + cmpl $(PAGE_SIZE - VEC_SIZE), %PAGE_ALIGN_REG > > + ja L(page_cross) > > +L(page_cross_continue): > > +# if !defined USE_AS_STPCPY && !defined USE_AS_STRCAT > > + movq %rdi, %rax > > # endif > > - test %edx, %edx > > - jnz L(CopyVecSizeTail1) > > - > > - vmovdqu %ymm3, (%rdi) > > - vpcmpeqb %ymm2, %ymmZ, %ymmM > > - vpmovmskb %ymmM, %edx > > - > > -# ifdef USE_AS_STRNCPY > > -# if defined USE_AS_STPCPY || defined USE_AS_STRCAT > > - cmp $(VEC_SIZE * 2), %r8 > > -# else > > - cmp $((VEC_SIZE * 2) + 1), %r8 > > -# endif > > - jbe L(CopyTwoVecSize1Case2OrCase3) > > -# endif > > - test %edx, %edx > > - jnz L(CopyTwoVecSize1) > > - > > - and $-VEC_SIZE, %rsi > > - and $(VEC_SIZE - 1), %ecx > > - jmp L(UnalignVecSizeBoth) > > + VMOVU (%rsi), %VMM(0) > > + VPCMPEQ %VMM(0), %VZERO, %VMM(6) > > + vpmovmskb %VMM(6), %ecx > > > > -/*------End of main part with loops---------------------*/ > > + testl %ecx, %ecx > > + jz L(more_1x_vec) > > > > -/* Case1 */ > > + /* No longer need ymm registers so just vzeroupper so it doesn't > > + need to be duplicated at each return statement. */ > > + COND_VZEROUPPER > > > > -# if (!defined USE_AS_STRNCPY) || (defined USE_AS_STRCAT) > > - .p2align 4 > > -L(CopyVecSize): > > - add %rcx, %rdi > > -# endif > > -L(CopyVecSizeTail): > > - add %rcx, %rsi > > -L(CopyVecSizeTail1): > > - bsf %edx, %edx > > -L(CopyVecSizeExit): > > - cmp $32, %edx > > - jae L(Exit32_63) > > - cmp $16, %edx > > - jae L(Exit16_31) > > - cmp $8, %edx > > - jae L(Exit8_15) > > - cmp $4, %edx > > - jae L(Exit4_7) > > - cmp $3, %edx > > - je L(Exit3) > > - cmp $1, %edx > > - ja L(Exit2) > > - je L(Exit1) > > - movb $0, (%rdi) > > + xorl %edx, %edx > > + bsfl %ecx, %edx > > # ifdef USE_AS_STPCPY > > - lea (%rdi), %rax > > -# endif > > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > > - sub $1, %r8 > > - lea 1(%rdi), %rdi > > - jnz L(StrncpyFillTailWithZero) > > -# endif > > -L(return_vzeroupper): > > - ZERO_UPPER_VEC_REGISTERS_RETURN > > - > > - .p2align 4 > > -L(CopyTwoVecSize1): > > - add $VEC_SIZE, %rsi > > - add $VEC_SIZE, %rdi > > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > > - sub $VEC_SIZE, %r8 > > -# endif > > - jmp L(CopyVecSizeTail1) > > - > > - .p2align 4 > > -L(CopyTwoVecSize): > > - bsf %edx, %edx > > - add %rcx, %rsi > > - add $VEC_SIZE, %edx > > - sub %ecx, %edx > > - jmp L(CopyVecSizeExit) > > - > > - .p2align 4 > > -L(CopyVecSizeUnaligned_0): > > - bsf %edx, %edx > > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > > -# ifdef USE_AS_STPCPY > > - lea (%rdi, %rdx), %rax > > -# endif > > - vmovdqu %ymm4, (%rdi) > > - add $((VEC_SIZE * 4) - 1), %r8 > > - sub %rdx, %r8 > > - lea 1(%rdi, %rdx), %rdi > > - jmp L(StrncpyFillTailWithZero) > > -# else > > - jmp L(CopyVecSizeExit) > > -# endif > > - > > - .p2align 4 > > -L(CopyVecSizeUnaligned_16): > > - bsf %ecx, %edx > > - vmovdqu %ymm4, (%rdi) > > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > > -# ifdef USE_AS_STPCPY > > - lea VEC_SIZE(%rdi, %rdx), %rax > > -# endif > > - vmovdqu %ymm5, VEC_SIZE(%rdi) > > - add $((VEC_SIZE * 3) - 1), %r8 > > - sub %rdx, %r8 > > - lea (VEC_SIZE + 1)(%rdi, %rdx), %rdi > > - jmp L(StrncpyFillTailWithZero) > > + leaq (%rdi, %rdx), %rax > > +# endif > > + > > + /* Use mask bits in rcx to detect which copy we need. If the low > > + mask is zero then there must be a bit set in the upper half. > > + I.e if ecx != 0 and cx == 0, then match must be upper 16 > > + bits so we use L(copy_16_31). */ > > + testw %cx, %cx > > + jz L(copy_16_31) > > + > > + testb %cl, %cl > > + jz L(copy_8_15) > > +# ifdef USE_AS_WCSCPY > > + vmovd %xmm0, (%rdi) > > + movl $0, (%END_REG) > > + ret > > # else > > - add $VEC_SIZE, %rsi > > - add $VEC_SIZE, %rdi > > - jmp L(CopyVecSizeExit) > > -# endif > > - > > - .p2align 4 > > -L(CopyVecSizeUnaligned_32): > > - bsf %edx, %edx > > - vmovdqu %ymm4, (%rdi) > > - vmovdqu %ymm5, VEC_SIZE(%rdi) > > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > > -# ifdef USE_AS_STPCPY > > - lea (VEC_SIZE * 2)(%rdi, %rdx), %rax > > -# endif > > - vmovdqu %ymm6, (VEC_SIZE * 2)(%rdi) > > - add $((VEC_SIZE * 2) - 1), %r8 > > - sub %rdx, %r8 > > - lea ((VEC_SIZE * 2) + 1)(%rdi, %rdx), %rdi > > - jmp L(StrncpyFillTailWithZero) > > + testb $0x7, %cl > > + jz L(copy_4_7) > > + > > + testl %edx, %edx > > + jz L(set_null_term) > > + vmovd %xmm0, %ecx > > + movw %cx, (%rdi) > > + > > + .p2align 4,, 2 > > +L(set_null_term): > > + movb $0, (%END_REG) > > + ret > > + > > + .p2align 4,, 12 > > +L(copy_4_7): > > + movl -3(%rsi, %rdx), %ecx > > + vmovd %xmm0, (%rdi) > > + movl %ecx, -3(%END_REG) > > + ret > > +# endif > > + > > + .p2align 4,, 10 > > +L(copy_16_31): > > + VMOVU -(16 - CHAR_SIZE)(%rsi, %rdx), %xmm1 > > + VMOVU %xmm0, (%rdi) > > + VMOVU %xmm1, -(16 - CHAR_SIZE)(%END_REG) > > + ret > > + > > + .p2align 4,, 10 > > +L(copy_8_15): > > +# ifdef USE_AS_WCSCPY > > + movl -(8 - CHAR_SIZE)(%rsi, %rdx), %ecx > > # else > > - add $(VEC_SIZE * 2), %rsi > > - add $(VEC_SIZE * 2), %rdi > > - jmp L(CopyVecSizeExit) > > -# endif > > - > > -# ifdef USE_AS_STRNCPY > > -# ifndef USE_AS_STRCAT > > - .p2align 4 > > -L(CopyVecSizeUnalignedVec6): > > - vmovdqu %ymm6, (%rdi, %rcx) > > - jmp L(CopyVecSizeVecExit) > > - > > - .p2align 4 > > -L(CopyVecSizeUnalignedVec5): > > - vmovdqu %ymm5, (%rdi, %rcx) > > - jmp L(CopyVecSizeVecExit) > > - > > - .p2align 4 > > -L(CopyVecSizeUnalignedVec4): > > - vmovdqu %ymm4, (%rdi, %rcx) > > - jmp L(CopyVecSizeVecExit) > > - > > - .p2align 4 > > -L(CopyVecSizeUnalignedVec3): > > - vmovdqu %ymm3, (%rdi, %rcx) > > - jmp L(CopyVecSizeVecExit) > > -# endif > > - > > -/* Case2 */ > > - > > - .p2align 4 > > -L(CopyVecSizeCase2): > > - add $VEC_SIZE, %r8 > > - add %rcx, %rdi > > - add %rcx, %rsi > > - bsf %edx, %edx > > - cmp %r8d, %edx > > - jb L(CopyVecSizeExit) > > - jmp L(StrncpyExit) > > - > > - .p2align 4 > > -L(CopyTwoVecSizeCase2): > > - add %rcx, %rsi > > - bsf %edx, %edx > > - add $VEC_SIZE, %edx > > - sub %ecx, %edx > > - cmp %r8d, %edx > > - jb L(CopyVecSizeExit) > > - jmp L(StrncpyExit) > > - > > -L(CopyVecSizeTailCase2): > > - add %rcx, %rsi > > - bsf %edx, %edx > > - cmp %r8d, %edx > > - jb L(CopyVecSizeExit) > > - jmp L(StrncpyExit) > > - > > -L(CopyVecSizeTail1Case2): > > - bsf %edx, %edx > > - cmp %r8d, %edx > > - jb L(CopyVecSizeExit) > > - jmp L(StrncpyExit) > > - > > -/* Case2 or Case3, Case3 */ > > - > > - .p2align 4 > > -L(CopyVecSizeCase2OrCase3): > > - test %rdx, %rdx > > - jnz L(CopyVecSizeCase2) > > -L(CopyVecSizeCase3): > > - add $VEC_SIZE, %r8 > > - add %rcx, %rdi > > - add %rcx, %rsi > > - jmp L(StrncpyExit) > > - > > - .p2align 4 > > -L(CopyTwoVecSizeCase2OrCase3): > > - test %rdx, %rdx > > - jnz L(CopyTwoVecSizeCase2) > > - add %rcx, %rsi > > - jmp L(StrncpyExit) > > - > > - .p2align 4 > > -L(CopyVecSizeTailCase2OrCase3): > > - test %rdx, %rdx > > - jnz L(CopyVecSizeTailCase2) > > - add %rcx, %rsi > > - jmp L(StrncpyExit) > > - > > - .p2align 4 > > -L(CopyTwoVecSize1Case2OrCase3): > > - add $VEC_SIZE, %rdi > > - add $VEC_SIZE, %rsi > > - sub $VEC_SIZE, %r8 > > -L(CopyVecSizeTail1Case2OrCase3): > > - test %rdx, %rdx > > - jnz L(CopyVecSizeTail1Case2) > > - jmp L(StrncpyExit) > > -# endif > > - > > -/*------------End labels regarding with copying 1-VEC_SIZE bytes--and 1-(VEC_SIZE*2) bytes----*/ > > - > > - .p2align 4 > > -L(Exit1): > > - movzwl (%rsi), %edx > > - mov %dx, (%rdi) > > -# ifdef USE_AS_STPCPY > > - lea 1(%rdi), %rax > > -# endif > > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > > - sub $2, %r8 > > - lea 2(%rdi), %rdi > > - jnz L(StrncpyFillTailWithZero) > > -# endif > > - VZEROUPPER_RETURN > > - > > - .p2align 4 > > -L(Exit2): > > - movzwl (%rsi), %ecx > > - mov %cx, (%rdi) > > - movb $0, 2(%rdi) > > -# ifdef USE_AS_STPCPY > > - lea 2(%rdi), %rax > > -# endif > > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > > - sub $3, %r8 > > - lea 3(%rdi), %rdi > > - jnz L(StrncpyFillTailWithZero) > > -# endif > > - VZEROUPPER_RETURN > > - > > - .p2align 4 > > -L(Exit3): > > - mov (%rsi), %edx > > - mov %edx, (%rdi) > > + movq -(8 - CHAR_SIZE)(%rsi, %rdx), %rcx > > +# endif > > + vmovq %xmm0, (%rdi) > > + movq %rcx, -(8 - CHAR_SIZE)(%END_REG) > > + ret > > + > > + > > + .p2align 4,, 8 > > +L(more_1x_vec): > > +# if defined USE_AS_STPCPY || defined USE_AS_STRCAT > > + VMOVU %VMM(0), (%rdi) > > +# endif > > + subq %rsi, %rdi > > + orq $(VEC_SIZE - 1), %rsi > > + addq %rsi, %rdi > > + VMOVA 1(%rsi), %VMM(1) > > + > > + /* Try and order stores after as many loads as is reasonable to > > + avoid potential false dependencies. */ > > +# if !defined USE_AS_STPCPY && !defined USE_AS_STRCAT > > + VMOVU %VMM(0), (%rax) > > +# endif > > + VPCMPEQ %VMM(1), %VZERO, %VMM(6) > > + vpmovmskb %VMM(6), %ecx > > + testl %ecx, %ecx > > + jnz L(ret_vec_x1) > > + > > + VMOVA (VEC_SIZE + 1)(%rsi), %VMM(2) > > + VMOVU %VMM(1), 1(%rdi) > > + > > + VPCMPEQ %VMM(2), %VZERO, %VMM(6) > > + vpmovmskb %VMM(6), %ecx > > + testl %ecx, %ecx > > + jnz L(ret_vec_x2) > > + > > + VMOVA (VEC_SIZE * 2 + 1)(%rsi), %VMM(3) > > + VMOVU %VMM(2), (VEC_SIZE + 1)(%rdi) > > + > > + VPCMPEQ %VMM(3), %VZERO, %VMM(6) > > + vpmovmskb %VMM(6), %ecx > > + testl %ecx, %ecx > > + jnz L(ret_vec_x3) > > + > > + VMOVA (VEC_SIZE * 3 + 1)(%rsi), %VMM(4) > > + VMOVU %VMM(3), (VEC_SIZE * 2 + 1)(%rdi) > > + VPCMPEQ %VMM(4), %VZERO, %VMM(6) > > + vpmovmskb %VMM(6), %edx > > + testl %edx, %edx > > + jnz L(ret_vec_x4) > > + > > + VMOVU %VMM(4), (VEC_SIZE * 3 + 1)(%rdi) > > + > > + /* Subtract rsi from rdi before aligning. Adding back rsi will > > + get proper rdi (dst) for new src. */ > > + subq %rsi, %rdi > > + incq %rsi > > + orq $(VEC_SIZE * 4 - 1), %rsi > > + > > + /* Do first half of loop ahead of time so loop can just start by > > + storing. */ > > + VMOVA (VEC_SIZE * 0 + 1)(%rsi), %VMM(0) > > + VMOVA (VEC_SIZE * 1 + 1)(%rsi), %VMM(1) > > + VMOVA (VEC_SIZE * 2 + 1)(%rsi), %VMM(2) > > + VMOVA (VEC_SIZE * 3 + 1)(%rsi), %VMM(3) > > + > > + VPMIN %VMM(0), %VMM(1), %VMM(4) > > + VPMIN %VMM(2), %VMM(3), %VMM(6) > > + VPMIN %VMM(4), %VMM(6), %VMM(6) > > + VPCMPEQ %VMM(6), %VZERO, %VMM(6) > > + vpmovmskb %VMM(6), %edx > > + addq %rsi, %rdi > > + > > + testl %edx, %edx > > + jnz L(loop_4x_done) > > + > > + .p2align 4,, 11 > > +L(loop_4x_vec): > > + > > + VMOVU %VMM(0), (VEC_SIZE * 0 + 1)(%rdi) > > + VMOVU %VMM(1), (VEC_SIZE * 1 + 1)(%rdi) > > + subq $(VEC_SIZE * -4), %rsi > > + VMOVU %VMM(2), (VEC_SIZE * 2 + 1)(%rdi) > > + VMOVU %VMM(3), (VEC_SIZE * 3 + 1)(%rdi) > > + > > + > > + VMOVA (VEC_SIZE * 0 + 1)(%rsi), %VMM(0) > > + VMOVA (VEC_SIZE * 1 + 1)(%rsi), %VMM(1) > > + VMOVA (VEC_SIZE * 2 + 1)(%rsi), %VMM(2) > > + VMOVA (VEC_SIZE * 3 + 1)(%rsi), %VMM(3) > > + > > + VPMIN %VMM(0), %VMM(1), %VMM(4) > > + VPMIN %VMM(2), %VMM(3), %VMM(6) > > + VPMIN %VMM(4), %VMM(6), %VMM(6) > > + VPCMPEQ %VMM(6), %VZERO, %VMM(6) > > + > > + vpmovmskb %VMM(6), %edx > > + subq $(VEC_SIZE * -4), %rdi > > + testl %edx, %edx > > + jz L(loop_4x_vec) > > + > > +L(loop_4x_done): > > + VPCMPEQ %VMM(0), %VZERO, %VMM(6) > > + vpmovmskb %VMM(6), %ecx > > + testl %ecx, %ecx > > + jnz L(ret_vec_x1) > > + VMOVU %VMM(0), (VEC_SIZE * 0 + 1)(%rdi) > > + > > + VPCMPEQ %VMM(1), %VZERO, %VMM(6) > > + vpmovmskb %VMM(6), %ecx > > + testl %ecx, %ecx > > + jnz L(ret_vec_x2) > > + VMOVU %VMM(1), (VEC_SIZE * 1 + 1)(%rdi) > > + > > + VPCMPEQ %VMM(2), %VZERO, %VMM(6) > > + vpmovmskb %VMM(6), %ecx > > + testl %ecx, %ecx > > + jnz L(ret_vec_x3) > > + VMOVU %VMM(2), (VEC_SIZE * 2 + 1)(%rdi) > > +L(ret_vec_x4): > > + bsfl %edx, %edx > > + VMOVU ((VEC_SIZE * 3 + 1)-(VEC_SIZE - CHAR_SIZE))(%rsi, %rdx), %VMM(1) > > + VMOVU %VMM(1), ((VEC_SIZE * 3 + 1)-(VEC_SIZE - CHAR_SIZE))(%rdi, %rdx) > > # ifdef USE_AS_STPCPY > > - lea 3(%rdi), %rax > > -# endif > > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > > - sub $4, %r8 > > - lea 4(%rdi), %rdi > > - jnz L(StrncpyFillTailWithZero) > > + leaq (VEC_SIZE * 3 + 1)(%rdx, %rdi), %rax > > # endif > > +L(return_end): > > VZEROUPPER_RETURN > > > > - .p2align 4 > > -L(Exit4_7): > > - mov (%rsi), %ecx > > - mov %ecx, (%rdi) > > - mov -3(%rsi, %rdx), %ecx > > - mov %ecx, -3(%rdi, %rdx) > > + .p2align 4,, 8 > > +L(ret_vec_x1): > > + bsfl %ecx, %ecx > > + VMOVU (1 -(VEC_SIZE - CHAR_SIZE))(%rsi, %rcx), %VMM(1) > > + VMOVU %VMM(1), (1 -(VEC_SIZE - CHAR_SIZE))(%rdi, %rcx) > > # ifdef USE_AS_STPCPY > > - lea (%rdi, %rdx), %rax > > -# endif > > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > > - sub %rdx, %r8 > > - sub $1, %r8 > > - lea 1(%rdi, %rdx), %rdi > > - jnz L(StrncpyFillTailWithZero) > > + leaq 1(%rcx, %rdi), %rax > > # endif > > - VZEROUPPER_RETURN > > - > > - .p2align 4 > > -L(Exit8_15): > > - mov (%rsi), %rcx > > - mov -7(%rsi, %rdx), %r9 > > - mov %rcx, (%rdi) > > - mov %r9, -7(%rdi, %rdx) > > -# ifdef USE_AS_STPCPY > > - lea (%rdi, %rdx), %rax > > -# endif > > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > > - sub %rdx, %r8 > > - sub $1, %r8 > > - lea 1(%rdi, %rdx), %rdi > > - jnz L(StrncpyFillTailWithZero) > > -# endif > > - VZEROUPPER_RETURN > > +L(return_vzeroupper): > > + ZERO_UPPER_VEC_REGISTERS_RETURN > > > > - .p2align 4 > > -L(Exit16_31): > > - vmovdqu (%rsi), %xmm2 > > - vmovdqu -15(%rsi, %rdx), %xmm3 > > - vmovdqu %xmm2, (%rdi) > > - vmovdqu %xmm3, -15(%rdi, %rdx) > > + .p2align 4,, 8 > > +L(ret_vec_x2): > > + bsfl %ecx, %ecx > > + VMOVU ((VEC_SIZE + 1)-(VEC_SIZE - CHAR_SIZE))(%rsi, %rcx), %VMM(1) > > + VMOVU %VMM(1), ((VEC_SIZE + 1)-(VEC_SIZE - CHAR_SIZE))(%rdi, %rcx) > > # ifdef USE_AS_STPCPY > > - lea (%rdi, %rdx), %rax > > -# endif > > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > > - sub %rdx, %r8 > > - sub $1, %r8 > > - lea 1(%rdi, %rdx), %rdi > > - jnz L(StrncpyFillTailWithZero) > > + leaq (VEC_SIZE * 1 + 1)(%rcx, %rdi), %rax > > # endif > > VZEROUPPER_RETURN > > > > - .p2align 4 > > -L(Exit32_63): > > - vmovdqu (%rsi), %ymm2 > > - vmovdqu -31(%rsi, %rdx), %ymm3 > > - vmovdqu %ymm2, (%rdi) > > - vmovdqu %ymm3, -31(%rdi, %rdx) > > + .p2align 4,, 8 > > +L(ret_vec_x3): > > + bsfl %ecx, %ecx > > + VMOVU ((VEC_SIZE * 2 + 1)-(VEC_SIZE - CHAR_SIZE))(%rsi, %rcx), %VMM(1) > > + VMOVU %VMM(1), ((VEC_SIZE * 2 + 1)-(VEC_SIZE - CHAR_SIZE))(%rdi, %rcx) > > # ifdef USE_AS_STPCPY > > - lea (%rdi, %rdx), %rax > > -# endif > > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > > - sub %rdx, %r8 > > - sub $1, %r8 > > - lea 1(%rdi, %rdx), %rdi > > - jnz L(StrncpyFillTailWithZero) > > + leaq (VEC_SIZE * 2 + 1)(%rcx, %rdi), %rax > > # endif > > VZEROUPPER_RETURN > > > > -# ifdef USE_AS_STRNCPY > > > > - .p2align 4 > > -L(StrncpyExit1): > > - movzbl (%rsi), %edx > > - mov %dl, (%rdi) > > + .p2align 4,, 4 > > +L(page_cross): > > + movq %rsi, %rcx > > + andq $(VEC_SIZE * -1), %rcx > > + > > + VPCMPEQ (%rcx), %VZERO, %VMM(6) > > + vpmovmskb %VMM(6), %ecx > > + shrxl %esi, %ecx, %ecx > > +# if USE_MOVSB_IN_PAGE_CROSS > > + /* Optimizing more aggressively for space as this is very cold > > + code. This saves 2x cache lines. */ > > + > > + /* This adds once to the later result which will get correct > > + copy bounds. NB: this can never zero-out a non-zero RCX as > > + to be in the page cross case rsi cannot be aligned and we > > + already right-shift rcx by the misalignment. */ > > + shll $CHAR_SIZE, %ecx > > + jz L(page_cross_continue) > > + bsfl %ecx, %ecx > > +# if !defined USE_AS_STPCPY && !defined USE_AS_STRCAT > > + movq %rdi, %rax > > +# endif > > + rep movsb > > # ifdef USE_AS_STPCPY > > - lea 1(%rdi), %rax > > -# endif > > -# ifdef USE_AS_STRCAT > > - movb $0, 1(%rdi) > > + leaq -CHAR_SIZE(%rdi), %rax > > # endif > > - VZEROUPPER_RETURN > > > > - .p2align 4 > > -L(StrncpyExit2): > > - movzwl (%rsi), %edx > > - mov %dx, (%rdi) > > -# ifdef USE_AS_STPCPY > > - lea 2(%rdi), %rax > > -# endif > > -# ifdef USE_AS_STRCAT > > - movb $0, 2(%rdi) > > -# endif > > VZEROUPPER_RETURN > > > > - .p2align 4 > > -L(StrncpyExit3_4): > > - movzwl (%rsi), %ecx > > - movzwl -2(%rsi, %r8), %edx > > - mov %cx, (%rdi) > > - mov %dx, -2(%rdi, %r8) > > -# ifdef USE_AS_STPCPY > > - lea (%rdi, %r8), %rax > > -# endif > > -# ifdef USE_AS_STRCAT > > - movb $0, (%rdi, %r8) > > -# endif > > - VZEROUPPER_RETURN > > - > > - .p2align 4 > > -L(StrncpyExit5_8): > > - mov (%rsi), %ecx > > - mov -4(%rsi, %r8), %edx > > - mov %ecx, (%rdi) > > - mov %edx, -4(%rdi, %r8) > > -# ifdef USE_AS_STPCPY > > - lea (%rdi, %r8), %rax > > -# endif > > -# ifdef USE_AS_STRCAT > > - movb $0, (%rdi, %r8) > > -# endif > > - VZEROUPPER_RETURN > > - > > - .p2align 4 > > -L(StrncpyExit9_16): > > - mov (%rsi), %rcx > > - mov -8(%rsi, %r8), %rdx > > - mov %rcx, (%rdi) > > - mov %rdx, -8(%rdi, %r8) > > -# ifdef USE_AS_STPCPY > > - lea (%rdi, %r8), %rax > > -# endif > > -# ifdef USE_AS_STRCAT > > - movb $0, (%rdi, %r8) > > -# endif > > - VZEROUPPER_RETURN > > - > > - .p2align 4 > > -L(StrncpyExit17_32): > > - vmovdqu (%rsi), %xmm2 > > - vmovdqu -16(%rsi, %r8), %xmm3 > > - vmovdqu %xmm2, (%rdi) > > - vmovdqu %xmm3, -16(%rdi, %r8) > > -# ifdef USE_AS_STPCPY > > - lea (%rdi, %r8), %rax > > -# endif > > -# ifdef USE_AS_STRCAT > > - movb $0, (%rdi, %r8) > > -# endif > > - VZEROUPPER_RETURN > > - > > - .p2align 4 > > -L(StrncpyExit33_64): > > - /* 0/32, 31/16 */ > > - vmovdqu (%rsi), %ymm2 > > - vmovdqu -VEC_SIZE(%rsi, %r8), %ymm3 > > - vmovdqu %ymm2, (%rdi) > > - vmovdqu %ymm3, -VEC_SIZE(%rdi, %r8) > > -# ifdef USE_AS_STPCPY > > - lea (%rdi, %r8), %rax > > -# endif > > -# ifdef USE_AS_STRCAT > > - movb $0, (%rdi, %r8) > > -# endif > > - VZEROUPPER_RETURN > > - > > - .p2align 4 > > -L(StrncpyExit65): > > - /* 0/32, 32/32, 64/1 */ > > - vmovdqu (%rsi), %ymm2 > > - vmovdqu 32(%rsi), %ymm3 > > - mov 64(%rsi), %cl > > - vmovdqu %ymm2, (%rdi) > > - vmovdqu %ymm3, 32(%rdi) > > - mov %cl, 64(%rdi) > > -# ifdef USE_AS_STPCPY > > - lea 65(%rdi), %rax > > -# endif > > -# ifdef USE_AS_STRCAT > > - movb $0, 65(%rdi) > > -# endif > > - VZEROUPPER_RETURN > > +# else > > + testl %ecx, %ecx > > + jz L(page_cross_continue) > > > > + /* Traditional copy case, essentially same as used in non-page- > > + cross case but since we can't reuse VMM(0) we need twice as > > + many loads from rsi. */ > > # ifndef USE_AS_STRCAT > > - > > - .p2align 4 > > -L(Fill1): > > - mov %dl, (%rdi) > > - VZEROUPPER_RETURN > > - > > - .p2align 4 > > -L(Fill2): > > - mov %dx, (%rdi) > > - VZEROUPPER_RETURN > > - > > - .p2align 4 > > -L(Fill3_4): > > - mov %dx, (%rdi) > > - mov %dx, -2(%rdi, %r8) > > - VZEROUPPER_RETURN > > - > > - .p2align 4 > > -L(Fill5_8): > > - mov %edx, (%rdi) > > - mov %edx, -4(%rdi, %r8) > > - VZEROUPPER_RETURN > > - > > - .p2align 4 > > -L(Fill9_16): > > - mov %rdx, (%rdi) > > - mov %rdx, -8(%rdi, %r8) > > - VZEROUPPER_RETURN > > - > > - .p2align 4 > > -L(Fill17_32): > > - vmovdqu %xmmZ, (%rdi) > > - vmovdqu %xmmZ, -16(%rdi, %r8) > > - VZEROUPPER_RETURN > > - > > - .p2align 4 > > -L(CopyVecSizeUnalignedVec2): > > - vmovdqu %ymm2, (%rdi, %rcx) > > - > > - .p2align 4 > > -L(CopyVecSizeVecExit): > > - bsf %edx, %edx > > - add $(VEC_SIZE - 1), %r8 > > - add %rcx, %rdi > > -# ifdef USE_AS_STPCPY > > - lea (%rdi, %rdx), %rax > > -# endif > > - sub %rdx, %r8 > > - lea 1(%rdi, %rdx), %rdi > > - > > - .p2align 4 > > -L(StrncpyFillTailWithZero): > > - xor %edx, %edx > > - sub $VEC_SIZE, %r8 > > - jbe L(StrncpyFillExit) > > - > > - vmovdqu %ymmZ, (%rdi) > > - add $VEC_SIZE, %rdi > > - > > - mov %rdi, %rsi > > - and $(VEC_SIZE - 1), %esi > > - sub %rsi, %rdi > > - add %rsi, %r8 > > - sub $(VEC_SIZE * 4), %r8 > > - jb L(StrncpyFillLessFourVecSize) > > - > > -L(StrncpyFillLoopVmovdqa): > > - vmovdqa %ymmZ, (%rdi) > > - vmovdqa %ymmZ, VEC_SIZE(%rdi) > > - vmovdqa %ymmZ, (VEC_SIZE * 2)(%rdi) > > - vmovdqa %ymmZ, (VEC_SIZE * 3)(%rdi) > > - add $(VEC_SIZE * 4), %rdi > > - sub $(VEC_SIZE * 4), %r8 > > - jae L(StrncpyFillLoopVmovdqa) > > - > > -L(StrncpyFillLessFourVecSize): > > - add $(VEC_SIZE * 2), %r8 > > - jl L(StrncpyFillLessTwoVecSize) > > - vmovdqa %ymmZ, (%rdi) > > - vmovdqa %ymmZ, VEC_SIZE(%rdi) > > - add $(VEC_SIZE * 2), %rdi > > - sub $VEC_SIZE, %r8 > > - jl L(StrncpyFillExit) > > - vmovdqa %ymmZ, (%rdi) > > - add $VEC_SIZE, %rdi > > - jmp L(Fill) > > - > > - .p2align 4 > > -L(StrncpyFillLessTwoVecSize): > > - add $VEC_SIZE, %r8 > > - jl L(StrncpyFillExit) > > - vmovdqa %ymmZ, (%rdi) > > - add $VEC_SIZE, %rdi > > - jmp L(Fill) > > - > > - .p2align 4 > > -L(StrncpyFillExit): > > - add $VEC_SIZE, %r8 > > -L(Fill): > > - cmp $17, %r8d > > - jae L(Fill17_32) > > - cmp $9, %r8d > > - jae L(Fill9_16) > > - cmp $5, %r8d > > - jae L(Fill5_8) > > - cmp $3, %r8d > > - jae L(Fill3_4) > > - cmp $1, %r8d > > - ja L(Fill2) > > - je L(Fill1) > > - VZEROUPPER_RETURN > > - > > -/* end of ifndef USE_AS_STRCAT */ > > + xorl %edx, %edx > > # endif > > - > > - .p2align 4 > > -L(UnalignedLeaveCase2OrCase3): > > - test %rdx, %rdx > > - jnz L(UnalignedFourVecSizeLeaveCase2) > > -L(UnalignedFourVecSizeLeaveCase3): > > - lea (VEC_SIZE * 4)(%r8), %rcx > > - and $-VEC_SIZE, %rcx > > - add $(VEC_SIZE * 3), %r8 > > - jl L(CopyVecSizeCase3) > > - vmovdqu %ymm4, (%rdi) > > - sub $VEC_SIZE, %r8 > > - jb L(CopyVecSizeCase3) > > - vmovdqu %ymm5, VEC_SIZE(%rdi) > > - sub $VEC_SIZE, %r8 > > - jb L(CopyVecSizeCase3) > > - vmovdqu %ymm6, (VEC_SIZE * 2)(%rdi) > > - sub $VEC_SIZE, %r8 > > - jb L(CopyVecSizeCase3) > > - vmovdqu %ymm7, (VEC_SIZE * 3)(%rdi) > > + bsfl %ecx, %edx > > # ifdef USE_AS_STPCPY > > - lea (VEC_SIZE * 4)(%rdi), %rax > > -# endif > > -# ifdef USE_AS_STRCAT > > - movb $0, (VEC_SIZE * 4)(%rdi) > > + leaq (%rdi, %rdx), %rax > > +# elif !defined USE_AS_STRCAT > > + movq %rdi, %rax > > # endif > > - VZEROUPPER_RETURN > > > > - .p2align 4 > > -L(UnalignedFourVecSizeLeaveCase2): > > - xor %ecx, %ecx > > - vpcmpeqb %ymm4, %ymmZ, %ymmM > > - vpmovmskb %ymmM, %edx > > - add $(VEC_SIZE * 3), %r8 > > - jle L(CopyVecSizeCase2OrCase3) > > - test %edx, %edx > > -# ifndef USE_AS_STRCAT > > - jnz L(CopyVecSizeUnalignedVec4) > > -# else > > - jnz L(CopyVecSize) > > -# endif > > - vpcmpeqb %ymm5, %ymmZ, %ymmM > > - vpmovmskb %ymmM, %edx > > - vmovdqu %ymm4, (%rdi) > > - add $VEC_SIZE, %rcx > > - sub $VEC_SIZE, %r8 > > - jbe L(CopyVecSizeCase2OrCase3) > > - test %edx, %edx > > -# ifndef USE_AS_STRCAT > > - jnz L(CopyVecSizeUnalignedVec5) > > -# else > > - jnz L(CopyVecSize) > > -# endif > > + /* vzeroupper early to avoid duplicating at each return. */ > > + COND_VZEROUPPER > > > > - vpcmpeqb %ymm6, %ymmZ, %ymmM > > - vpmovmskb %ymmM, %edx > > - vmovdqu %ymm5, VEC_SIZE(%rdi) > > - add $VEC_SIZE, %rcx > > - sub $VEC_SIZE, %r8 > > - jbe L(CopyVecSizeCase2OrCase3) > > - test %edx, %edx > > -# ifndef USE_AS_STRCAT > > - jnz L(CopyVecSizeUnalignedVec6) > > -# else > > - jnz L(CopyVecSize) > > -# endif > > + testw %cx, %cx > > + jz L(page_cross_copy_16_31) > > > > - vpcmpeqb %ymm7, %ymmZ, %ymmM > > - vpmovmskb %ymmM, %edx > > - vmovdqu %ymm6, (VEC_SIZE * 2)(%rdi) > > - lea VEC_SIZE(%rdi, %rcx), %rdi > > - lea VEC_SIZE(%rsi, %rcx), %rsi > > - bsf %edx, %edx > > - cmp %r8d, %edx > > - jb L(CopyVecSizeExit) > > -L(StrncpyExit): > > - cmp $65, %r8d > > - je L(StrncpyExit65) > > - cmp $33, %r8d > > - jae L(StrncpyExit33_64) > > - cmp $17, %r8d > > - jae L(StrncpyExit17_32) > > - cmp $9, %r8d > > - jae L(StrncpyExit9_16) > > - cmp $5, %r8d > > - jae L(StrncpyExit5_8) > > - cmp $3, %r8d > > - jae L(StrncpyExit3_4) > > - cmp $1, %r8d > > - ja L(StrncpyExit2) > > - je L(StrncpyExit1) > > -# ifdef USE_AS_STPCPY > > - mov %rdi, %rax > > -# endif > > -# ifdef USE_AS_STRCAT > > - movb $0, (%rdi) > > -# endif > > - VZEROUPPER_RETURN > > - > > - .p2align 4 > > -L(ExitZero): > > -# ifndef USE_AS_STRCAT > > - mov %rdi, %rax > > -# endif > > - VZEROUPPER_RETURN > > + testb %cl, %cl > > + jz L(page_cross_copy_8_15) > > > > -# endif > > + testl $0x7, %cl > > + jz L(page_cross_copy_4_7) > > > > -# ifndef USE_AS_STRCAT > > -END (STRCPY) > > -# else > > -END (STRCAT) > > -# endif > > + testl %edx, %edx > > + jz L(page_cross_set_null_term) > > + movzwl (%rsi), %ecx > > + movw %cx, (%rdi) > > +L(page_cross_set_null_term): > > + movb $0, (%END_REG) > > + ret > > + > > + .p2align 4,, 4 > > +L(page_cross_copy_4_7): > > + movl (%rsi), %ecx > > + movl -3(%rsi, %rdx), %esi > > + movl %ecx, (%rdi) > > + movl %esi, -3(%END_REG) > > + ret > > + > > + .p2align 4,, 4 > > +L(page_cross_copy_8_15): > > + movq (%rsi), %rcx > > + movq -7(%rsi, %rdx), %rsi > > + movq %rcx, (%rdi) > > + movq %rsi, -7(%END_REG) > > + ret > > + > > + > > + .p2align 4,, 3 > > +L(page_cross_copy_16_31): > > + VMOVU (%rsi), %xmm0 > > + VMOVU -15(%rsi, %rdx), %xmm1 > > + VMOVU %xmm0, (%rdi) > > + VMOVU %xmm1, -15(%END_REG) > > + ret > > +# endif > > + > > +END(STRCPY) > > #endif > > diff --git a/sysdeps/x86_64/multiarch/strncat-avx2-rtm.S b/sysdeps/x86_64/multiarch/strncat-avx2-rtm.S > > index 0dcea18dbb..2bbdbb91ab 100644 > > --- a/sysdeps/x86_64/multiarch/strncat-avx2-rtm.S > > +++ b/sysdeps/x86_64/multiarch/strncat-avx2-rtm.S > > @@ -1,3 +1,3 @@ > > -#define USE_AS_STRNCAT > > -#define STRCAT __strncat_avx2_rtm > > -#include "strcat-avx2-rtm.S" > > +#define STRNCAT __strncat_avx2_rtm > > +#include "x86-avx2-rtm-vecs.h" > > +#include "strncat-avx2.S" > > diff --git a/sysdeps/x86_64/multiarch/strncat-avx2.S b/sysdeps/x86_64/multiarch/strncat-avx2.S > > index 52ecbca943..547cef9486 100644 > > --- a/sysdeps/x86_64/multiarch/strncat-avx2.S > > +++ b/sysdeps/x86_64/multiarch/strncat-avx2.S > > @@ -1,7 +1,419 @@ > > -#ifndef STRNCAT > > -# define STRNCAT __strncat_avx2 > > -#endif > > +/* strncat with AVX2 > > + Copyright (C) 2022 Free Software Foundation, Inc. > > + This file is part of the GNU C Library. > > + > > + The GNU C Library is free software; you can redistribute it and/or > > + modify it under the terms of the GNU Lesser General Public > > + License as published by the Free Software Foundation; either > > + version 2.1 of the License, or (at your option) any later version. > > + > > + The GNU C Library is distributed in the hope that it will be useful, > > + but WITHOUT ANY WARRANTY; without even the implied warranty of > > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > > + Lesser General Public License for more details. > > + > > + You should have received a copy of the GNU Lesser General Public > > + License along with the GNU C Library; if not, see > > + . */ > > + > > +#include > > + > > +#if ISA_SHOULD_BUILD (3) > > + > > +# include > > + > > +# ifndef VEC_SIZE > > +# include "x86-avx2-vecs.h" > > +# endif > > + > > +# ifndef STRNCAT > > +# define STRNCAT __strncat_avx2 > > +# endif > > + > > +# ifdef USE_AS_WCSCPY > > +# define movNULL movl > > +# define VPCMPEQ vpcmpeqd > > +# define VPMIN vpminud > > +# define CHAR_SIZE 4 > > +# else > > +# define movNULL movb > > +# define VPCMPEQ vpcmpeqb > > +# define VPMIN vpminub > > +# define CHAR_SIZE 1 > > +# endif > > + > > +# include "strncpy-or-cat-overflow-def.h" > > + > > +# define PAGE_SIZE 4096 > > + > > +# define VZERO VMM(7) > > +# define VZERO_128 VMM_128(7) > > + > > + .section SECTION(.text), "ax", @progbits > > +ENTRY(STRNCAT) > > + /* Filter zero length strings and very long strings. Zero > > + length strings just return, very long strings are handled by > > + using the non-length variant {wcs|str}cat. */ > > + movq %rdi, %rax > > +# ifdef USE_AS_WCSCPY > > + leaq -1(%rdx), %rcx > > + shr $56, %rcx > > + jnz L(zero_len) > > + salq $2, %rdx > > +# else > > + test %rdx, %rdx > > + jl L(zero_len) > > +# endif > > + vpxor %VZERO_128, %VZERO_128, %VZERO_128 > > + > > +# include "strcat-strlen-avx2.S" > > + > > + movl %esi, %ecx > > + andl $(PAGE_SIZE - 1), %ecx > > + cmpl $(PAGE_SIZE - VEC_SIZE), %ecx > > + ja L(page_cross) > > +L(page_cross_continue): > > + VMOVU (%rsi), %VMM(0) > > + VPCMPEQ %VMM(0), %VZERO, %VMM(6) > > + vpmovmskb %VMM(6), %ecx > > + > > + tzcnt %ecx, %r8d > > + cmpq %r8, %rdx > > + jbe L(less_1x_vec) > > + > > + testl %ecx, %ecx > > + jz L(more_1x_vec) > > + > > + /* Hoist this to save code size. */ > > + > > + movl %r8d, %edx > > + > > +L(less_1x_vec): > > + COND_VZEROUPPER > > + > > + cmpl $16, %edx > > + jae L(copy_16_31) > > + cmpl $8, %edx > > + jae L(copy_8_15) > > + > > + > > +# ifdef USE_AS_WCSCPY > > + vmovd %VMM_128(0), (%rdi) > > + movNULL $0, (%rdi, %rdx) > > + ret > > +# else > > + cmpl $4, %edx > > + jae L(copy_4_7) > > + > > + movzbl (%rsi), %ecx > > + cmpl $1, %edx > > + jbe L(set_null_term) > > + > > + /* NB: make this `vmovw` if support for AVX512-FP16 is added. > > + */ > > + movzwl 1(%rsi), %esi > > + movw %si, 1(%rdi) > > + > > + .p2align 4,, 1 > > +L(set_null_term): > > + movb %cl, (%rdi) > > + movNULL $0, (%rdi, %rdx) > > + ret > > + > > + .p2align 4,, 11 > > +L(copy_4_7): > > + movl -(4)(%rsi, %rdx), %ecx > > + vmovd %xmm0, (%rdi) > > + movl %ecx, -(4)(%rdi, %rdx) > > + movNULL $0, (%rdi, %rdx) > > + ret > > +# endif > > + > > + > > + .p2align 4,, 10 > > +L(copy_16_31): > > + VMOVU -(16)(%rsi, %rdx), %xmm1 > > + VMOVU %xmm0, (%rdi) > > + VMOVU %xmm1, -(16)(%rdi, %rdx) > > + movNULL $0, (%rdi, %rdx) > > + ret > > + > > + .p2align 4,, 10 > > +L(copy_8_15): > > + movq -(8)(%rsi, %rdx), %rcx > > + vmovq %xmm0, (%rdi) > > + movq %rcx, -(8)(%rdi, %rdx) > > + movNULL $0, (%rdi, %rdx) > > + ret > > + > > + .p2align 4,, 8 > > + .p2align 6,, 14 > > +L(more_1x_vec): > > + VMOVU %VMM(0), (%rdi) > > + > > + /* Align rsi (src) and just rdx/rdi (length/dst). */ > > + addq %rsi, %rdx > > + subq %rsi, %rdi > > + orq $(VEC_SIZE - 1), %rsi > > + incq %rsi > > + addq %rsi, %rdi > > +L(loop_last_4x_vec): > > + subq %rsi, %rdx > > + VMOVA 0(%rsi), %VMM(1) > > + VPCMPEQ %VMM(1), %VZERO, %VMM(6) > > + vpmovmskb %VMM(6), %ecx > > + cmpq $(VEC_SIZE * 2), %rdx > > + ja L(more_2x_vec) > > +L(last_2x_vec): > > + tzcnt %ecx, %ecx > > + cmpl %ecx, %edx > > + jbe L(ret_vec_x1_len) > > + > > + cmpl $VEC_SIZE, %ecx > > + jnz L(ret_vec_x1) > > + > > + VMOVA (VEC_SIZE * 1)(%rsi), %VMM(2) > > + VMOVU %VMM(1), (%rdi) > > + VPCMPEQ %VMM(2), %VZERO, %VMM(6) > > + vpmovmskb %VMM(6), %ecx > > + addl $-VEC_SIZE, %edx > > + bzhil %edx, %ecx, %r8d > > + jz L(ret_vec_x2_len) > > +L(ret_vec_x2): > > + bsfl %ecx, %edx > > +L(ret_vec_x2_len): > > + VMOVU (%rsi, %rdx), %VMM(0) > > + movNULL $0, (VEC_SIZE)(%rdi, %rdx) > > + VMOVU %VMM(0), (%rdi, %rdx) > > +L(return_vzeroupper): > > + ZERO_UPPER_VEC_REGISTERS_RETURN > > + > > + > > + .p2align 4,, 12 > > +L(ret_vec_x1_len): > > + movl %edx, %ecx > > +L(ret_vec_x1): > > + VMOVU -(VEC_SIZE)(%rsi, %rcx), %VMM(1) > > + movNULL $0, (%rdi, %rcx) > > + VMOVU %VMM(1), -VEC_SIZE(%rdi, %rcx) > > + VZEROUPPER_RETURN > > + > > + .p2align 4,, 8 > > +L(last_4x_vec): > > + subq $-(VEC_SIZE * 4), %rsi > > + VMOVA 0(%rsi), %VMM(1) > > + VPCMPEQ %VMM(1), %VZERO, %VMM(6) > > + vpmovmskb %VMM(6), %ecx > > + subq $-(VEC_SIZE * 4), %rdi > > + addl $-(VEC_SIZE * 4), %edx > > + cmpl $(VEC_SIZE * 2), %edx > > + jbe L(last_2x_vec) > > + .p2align 4,, 8 > > +L(more_2x_vec): > > + /* L(ret_vec_x1) expects ecx to have position of first match so > > + test with bsf. */ > > + bsfl %ecx, %ecx > > + jnz L(ret_vec_x1) > > + > > + VMOVA (VEC_SIZE * 1)(%rsi), %VMM(2) > > + VMOVU %VMM(1), (%rdi) > > + > > + VPCMPEQ %VMM(2), %VZERO, %VMM(6) > > + vpmovmskb %VMM(6), %ecx > > + testl %ecx, %ecx > > + jnz L(ret_vec_x2) > > + > > > > -#define USE_AS_STRNCAT > > -#define STRCAT STRNCAT > > -#include "strcat-avx2.S" > > + VMOVA (VEC_SIZE * 2)(%rsi), %VMM(3) > > + VMOVU %VMM(2), (VEC_SIZE * 1)(%rdi) > > + > > + VPCMPEQ %VMM(3), %VZERO, %VMM(6) > > + vpmovmskb %VMM(6), %ecx > > + > > + /* Check if length is greater than 4x VEC. */ > > + cmpq $(VEC_SIZE * 4), %rdx > > + ja L(more_4x_vec) > > + > > + addl $(VEC_SIZE * -2), %edx > > + > > + tzcnt %ecx, %ecx > > + cmpl %ecx, %edx > > + jbe L(ret_vec_x3_len) > > + > > + cmpl $VEC_SIZE, %ecx > > + jnz L(ret_vec_x3) > > + > > + VMOVA (VEC_SIZE * 3 + 0)(%rsi), %VMM(4) > > + VMOVU %VMM(3), (VEC_SIZE * 2 + 0)(%rdi) > > + VPCMPEQ %VMM(4), %VZERO, %VMM(6) > > + vpmovmskb %VMM(6), %ecx > > + addl $-VEC_SIZE, %edx > > + bzhil %edx, %ecx, %r8d > > + jz L(ret_vec_x4_len) > > +L(ret_vec_x4): > > + bsfl %ecx, %edx > > +L(ret_vec_x4_len): > > + VMOVU (VEC_SIZE * 2)(%rsi, %rdx), %VMM(0) > > + movNULL $0, (VEC_SIZE * 3)(%rdi, %rdx) > > + VMOVU %VMM(0), (VEC_SIZE * 2)(%rdi, %rdx) > > + VZEROUPPER_RETURN > > + > > + .p2align 4,, 4 > > +L(ret_vec_x3_len): > > + movl %edx, %ecx > > +L(ret_vec_x3): > > + VMOVU (VEC_SIZE)(%rsi, %rcx), %VMM(0) > > + movNULL $0, (VEC_SIZE * 2)(%rdi, %rcx) > > + VMOVU %VMM(0), (VEC_SIZE)(%rdi, %rcx) > > + VZEROUPPER_RETURN > > + > > + > > + .p2align 4,, 8 > > +L(more_4x_vec): > > + bsfl %ecx, %ecx > > + jnz L(ret_vec_x3) > > + > > + VMOVA (VEC_SIZE * 3)(%rsi), %VMM(4) > > + VMOVU %VMM(3), (VEC_SIZE * 2)(%rdi) > > + VPCMPEQ %VMM(4), %VZERO, %VMM(6) > > + vpmovmskb %VMM(6), %ecx > > + testl %ecx, %ecx > > + jnz L(ret_vec_x4) > > + > > + VMOVU %VMM(4), (VEC_SIZE * 3)(%rdi) > > + > > + > > + /* Recheck length before aligning. */ > > + cmpq $(VEC_SIZE * 8), %rdx > > + jbe L(last_4x_vec) > > + > > + /* Align rsi (src) and just rdx/rdi (length/dst). */ > > + addq %rsi, %rdx > > + subq %rsi, %rdi > > + subq $-(VEC_SIZE * 4), %rsi > > + andq $(VEC_SIZE * -4), %rsi > > + > > + /* Do first half of loop ahead of time so loop can just start by > > + storing. */ > > + VMOVA (VEC_SIZE * 0 + 0)(%rsi), %VMM(0) > > + VMOVA (VEC_SIZE * 1 + 0)(%rsi), %VMM(1) > > + VMOVA (VEC_SIZE * 2 + 0)(%rsi), %VMM(2) > > + VMOVA (VEC_SIZE * 3 + 0)(%rsi), %VMM(3) > > + > > + VPMIN %VMM(0), %VMM(1), %VMM(4) > > + VPMIN %VMM(2), %VMM(3), %VMM(6) > > + VPMIN %VMM(4), %VMM(6), %VMM(6) > > + VPCMPEQ %VMM(6), %VZERO, %VMM(6) > > + vpmovmskb %VMM(6), %r8d > > + addq %rsi, %rdi > > + testl %r8d, %r8d > > + jnz L(loop_4x_done) > > + > > + /* Use r9 for end of region before handling last 4x VEC > > + specially. */ > > + leaq -(VEC_SIZE * 4)(%rdx), %r9 > > + > > + .p2align 4,, 11 > > +L(loop_4x_vec): > > + > > + VMOVU %VMM(0), (VEC_SIZE * 0 + 0)(%rdi) > > + VMOVU %VMM(1), (VEC_SIZE * 1 + 0)(%rdi) > > + subq $(VEC_SIZE * -4), %rsi > > + VMOVU %VMM(2), (VEC_SIZE * 2 + 0)(%rdi) > > + VMOVU %VMM(3), (VEC_SIZE * 3 + 0)(%rdi) > > + > > + subq $(VEC_SIZE * -4), %rdi > > + cmpq %rsi, %r9 > > + jbe L(loop_last_4x_vec) > > + > > + VMOVA (VEC_SIZE * 0 + 0)(%rsi), %VMM(0) > > + VMOVA (VEC_SIZE * 1 + 0)(%rsi), %VMM(1) > > + VMOVA (VEC_SIZE * 2 + 0)(%rsi), %VMM(2) > > + VMOVA (VEC_SIZE * 3 + 0)(%rsi), %VMM(3) > > + > > + VPMIN %VMM(0), %VMM(1), %VMM(4) > > + VPMIN %VMM(2), %VMM(3), %VMM(6) > > + VPMIN %VMM(4), %VMM(6), %VMM(6) > > + VPCMPEQ %VMM(6), %VZERO, %VMM(6) > > + > > + vpmovmskb %VMM(6), %r8d > > + > > + testl %r8d, %r8d > > + jz L(loop_4x_vec) > > + > > +L(loop_4x_done): > > + VPCMPEQ %VMM(0), %VZERO, %VMM(6) > > + vpmovmskb %VMM(6), %ecx > > + /* L(ret_vec_x1) expects ecx to have position of first match so > > + test with bsf. */ > > + bsfl %ecx, %ecx > > + jnz L(ret_vec_x1) > > + VMOVU %VMM(0), (VEC_SIZE * 0 + 0)(%rdi) > > + > > + VPCMPEQ %VMM(1), %VZERO, %VMM(6) > > + vpmovmskb %VMM(6), %ecx > > + > > + testl %ecx, %ecx > > + jnz L(ret_vec_x2) > > + VMOVU %VMM(1), (VEC_SIZE * 1 + 0)(%rdi) > > + > > + VPCMPEQ %VMM(2), %VZERO, %VMM(6) > > + vpmovmskb %VMM(6), %ecx > > + bsfl %ecx, %ecx > > + jnz L(ret_vec_x3) > > + > > + VMOVU %VMM(2), (VEC_SIZE * 2 + 0)(%rdi) > > + bsfl %r8d, %r8d > > + VMOVU (VEC_SIZE * 2 + CHAR_SIZE)(%rsi, %r8), %VMM(1) > > + VMOVU %VMM(1), (VEC_SIZE * 2 + CHAR_SIZE)(%rdi, %r8) > > + VZEROUPPER_RETURN > > + > > + > > + > > + .p2align 4,, 4 > > +L(page_cross): > > + movq %rsi, %r8 > > + andq $(VEC_SIZE * -1), %r8 > > + > > + VPCMPEQ (%r8), %VZERO, %VMM(6) > > + > > + vpmovmskb %VMM(6), %ecx > > + shrxl %esi, %ecx, %ecx > > + > > + subl %esi, %r8d > > + andl $(VEC_SIZE - 1), %r8d > > + cmpq %r8, %rdx > > + jb L(page_cross_small) > > + > > + /* Optimizing more aggressively for space as this is very cold > > + code. This saves 2x cache lines. */ > > + > > + /* This adds once to the later result which will get correct > > + copy bounds. NB: this can never zero-out a non-zero RCX as > > + to be in the page cross case rsi cannot be aligned and we > > + already right-shift rcx by the misalignment. */ > > + shll $CHAR_SIZE, %ecx > > + jz L(page_cross_continue) > > + bsfl %ecx, %ecx > > + rep movsb > > + VZEROUPPER_RETURN > > + > > +L(page_cross_small): > > + tzcntl %ecx, %ecx > > + jz L(page_cross_setz) > > + cmpl %edx, %ecx > > + cmova %edx, %ecx > > + rep movsb > > +L(page_cross_setz): > > + movNULL $0, (%rdi) > > + VZEROUPPER_RETURN > > +L(zero_len): > > +# ifdef USE_AS_WCSCPY > > + test %rdx, %rdx > > +# endif > > + jnz OVERFLOW_STRCAT > > + ret > > + > > + > > +END(STRNCAT) > > +#endif > > diff --git a/sysdeps/x86_64/multiarch/strncpy-avx2-rtm.S b/sysdeps/x86_64/multiarch/strncpy-avx2-rtm.S > > index 79e7083299..b582a4a7a1 100644 > > --- a/sysdeps/x86_64/multiarch/strncpy-avx2-rtm.S > > +++ b/sysdeps/x86_64/multiarch/strncpy-avx2-rtm.S > > @@ -1,3 +1,3 @@ > > -#define USE_AS_STRNCPY > > -#define STRCPY __strncpy_avx2_rtm > > -#include "strcpy-avx2-rtm.S" > > +#define STRNCPY __strncpy_avx2_rtm > > +#include "x86-avx2-rtm-vecs.h" > > +#include "strncpy-avx2.S" > > diff --git a/sysdeps/x86_64/multiarch/strncpy-avx2.S b/sysdeps/x86_64/multiarch/strncpy-avx2.S > > index ce634e94fa..d1b25b7a42 100644 > > --- a/sysdeps/x86_64/multiarch/strncpy-avx2.S > > +++ b/sysdeps/x86_64/multiarch/strncpy-avx2.S > > @@ -1,7 +1,735 @@ > > -#ifndef STRNCPY > > -# define STRNCPY __strncpy_avx2 > > -#endif > > +/* strncpy with AVX2 > > + Copyright (C) 2022 Free Software Foundation, Inc. > > + This file is part of the GNU C Library. > > + > > + The GNU C Library is free software; you can redistribute it and/or > > + modify it under the terms of the GNU Lesser General Public > > + License as published by the Free Software Foundation; either > > + version 2.1 of the License, or (at your option) any later version. > > + > > + The GNU C Library is distributed in the hope that it will be useful, > > + but WITHOUT ANY WARRANTY; without even the implied warranty of > > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > > + Lesser General Public License for more details. > > + > > + You should have received a copy of the GNU Lesser General Public > > + License along with the GNU C Library; if not, see > > + . */ > > + > > +#include > > + > > +#if ISA_SHOULD_BUILD (3) > > + > > +# include > > + > > + > > +# ifndef VEC_SIZE > > +# include "x86-avx2-vecs.h" > > +# endif > > + > > +# ifndef STRNCPY > > +# define STRNCPY __strncpy_avx2 > > +# endif > > + > > + > > +# ifdef USE_AS_WCSCPY > > +# define VPCMPEQ vpcmpeqd > > +# define VPMIN vpminud > > +# define CHAR_SIZE 4 > > +# else > > +# define VPCMPEQ vpcmpeqb > > +# define VPMIN vpminub > > +# define CHAR_SIZE 1 > > +# endif > > + > > +# include "strncpy-or-cat-overflow-def.h" > > + > > +# define PAGE_SIZE 4096 > > + > > +# define VZERO VMM(7) > > +# define VZERO_128 VMM_128(7) > > + > > + > > + .section SECTION(.text), "ax", @progbits > > +ENTRY(STRNCPY) > > + /* Filter zero length strings and very long strings. Zero > > + length strings just return, very long strings are handled by > > + just running rep stos{b|l} to zero set (which will almost > > + certainly segfault), if that succeeds then just calling > > + OVERFLOW_STRCPY (strcpy, stpcpy, wcscpy, wcpcpy). */ > > +# ifdef USE_AS_WCSCPY > > + decq %rdx > > + movq %rdx, %rax > > + /* 56 is end of max supported address space. */ > > + shr $56, %rax > > + jnz L(zero_len) > > + salq $2, %rdx > > +# else > > + decq %rdx > > + /* `dec` can macrofuse with `jl`. If the flag needs to become > > + `jb` replace `dec` with `sub`. */ > > + jl L(zero_len) > > +# endif > > + > > + vpxor %VZERO_128, %VZERO_128, %VZERO_128 > > + movl %esi, %eax > > + andl $(PAGE_SIZE - 1), %eax > > + cmpl $(PAGE_SIZE - VEC_SIZE), %eax > > + ja L(page_cross) > > + > > +L(page_cross_continue): > > + VMOVU (%rsi), %VMM(0) > > + VPCMPEQ %VMM(0), %VZERO, %VMM(6) > > + vpmovmskb %VMM(6), %ecx > > + > > + /* If no STPCPY just save end ahead of time. */ > > +# ifndef USE_AS_STPCPY > > + movq %rdi, %rax > > +# elif defined USE_AS_WCSCPY > > + /* Clear dependency as nearly all return code for wcpncpy uses > > + `setc %al`. */ > > + xorl %eax, %eax > > +# endif > > + > > + cmpq $(VEC_SIZE - CHAR_SIZE), %rdx > > + /* `jb` because length rdx is now length - CHAR_SIZE. */ > > + jbe L(less_1x_vec) > > + > > + /* This may overset but thats fine because we still need to zero > > + fill. */ > > + VMOVU %VMM(0), (%rdi) > > + > > + testl %ecx, %ecx > > + jnz L(zfill) > > + > > + /* Align. */ > > + addq %rsi, %rdx > > + subq %rsi, %rdi > > + orq $(VEC_SIZE - 1), %rsi > > + incq %rsi > > +L(last_4x_vec): > > + addq %rsi, %rdi > > +L(loop_last_4x_vec): > > + subq %rsi, %rdx > > + > > + > > + VMOVA 0(%rsi), %VMM(1) > > + VPCMPEQ %VMM(1), %VZERO, %VMM(6) > > + vpmovmskb %VMM(6), %ecx > > + > > + cmpq $(VEC_SIZE * 2), %rdx > > + jae L(more_2x_vec) > > + > > + cmpl $(VEC_SIZE), %edx > > + jb L(ret_vec_x1_len) > > + > > + testl %ecx, %ecx > > + jnz L(ret_vec_x1) > > + > > + VPCMPEQ VEC_SIZE(%rsi), %VZERO, %VMM(6) > > + VMOVU %VMM(1), (%rdi) > > + vpmovmskb %VMM(6), %ecx > > + shlq $VEC_SIZE, %rcx > > +L(ret_vec_x1_len): > > + tzcntq %rcx, %rcx > > + cmpl %ecx, %edx > > + jbe L(ret_vec_x1_len_no_zfill) > > + /* Fall through (expectation) is copy len < buffer len. */ > > + VMOVU %VZERO, ((0)-(VEC_SIZE - CHAR_SIZE))(%rdi, %rdx) > > +L(ret_vec_x1_len_no_zfill_mov): > > + movl %ecx, %edx > > +# ifdef USE_AS_STPCPY > > + /* clear flags. */ > > + xorl %ecx, %ecx > > +# endif > > +L(ret_vec_x1_len_no_zfill): > > + VMOVU ((0)-(VEC_SIZE - CHAR_SIZE))(%rsi, %rdx), %VMM(1) > > + VMOVU %VMM(1), ((0)-(VEC_SIZE - CHAR_SIZE))(%rdi, %rdx) > > +# ifdef USE_AS_STPCPY > > +# ifdef USE_AS_WCSCPY > > + setc %al > > + addq %rdx, %rdi > > + leaq (%rdi, %rax, CHAR_SIZE), %rax > > +# else > > + movl %edx, %eax > > + adcq %rdi, %rax > > +# endif > > +# endif > > +L(return_vzeroupper): > > + ZERO_UPPER_VEC_REGISTERS_RETURN > > + > > + .p2align 4,, 6 > > +L(ret_vec_x1): > > + bsfl %ecx, %ecx > > + VMOVU %VZERO, ((0)-(VEC_SIZE - CHAR_SIZE))(%rdi, %rdx) > > + subl %ecx, %edx > > + /* Check if we need to reload/store. */ > > + cmpl $VEC_SIZE, %edx > > + jb L(ret_vec_x1_len_no_zfill_mov) > > + /* Otherwise safe to just store directly. */ > > + VMOVU %VMM(1), (%rdi) > > + VMOVU %VZERO, (%rdi, %rcx) > > +# ifdef USE_AS_STPCPY > > + leaq (%rdi, %rcx), %rax > > +# endif > > + VZEROUPPER_RETURN > > + > > + .p2align 4,, 12 > > +L(more_2x_vec): > > + VMOVU %VMM(1), (%rdi) > > + testl %ecx, %ecx > > + /* Must fill at least 2x VEC. */ > > + jnz L(zfill_vec1) > > + > > + VMOVA VEC_SIZE(%rsi), %VMM(2) > > + VMOVU %VMM(2), VEC_SIZE(%rdi) > > + VPCMPEQ %VMM(2), %VZERO, %VMM(6) > > + vpmovmskb %VMM(6), %ecx > > + testl %ecx, %ecx > > + /* Must fill at least 1x VEC. */ > > + jnz L(zfill_vec2) > > + > > + VMOVA (VEC_SIZE * 2)(%rsi), %VMM(3) > > + VPCMPEQ %VMM(3), %VZERO, %VMM(6) > > + vpmovmskb %VMM(6), %ecx > > + > > + /* Check if len is more 4x VEC. -CHAR_SIZE because rdx is len - > > + CHAR_SIZE. */ > > + cmpq $(VEC_SIZE * 4 - CHAR_SIZE), %rdx > > + ja L(more_4x_vec) > > + > > + subl $(VEC_SIZE * 3), %edx > > + jb L(ret_vec_x3_len) > > + > > + testl %ecx, %ecx > > + jnz L(ret_vec_x3) > > + > > + VPCMPEQ (VEC_SIZE * 3)(%rsi), %VZERO, %VMM(6) > > + VMOVU %VMM(3), (VEC_SIZE * 2)(%rdi) > > + vpmovmskb %VMM(6), %ecx > > + tzcntl %ecx, %ecx > > + cmpl %ecx, %edx > > + jbe L(ret_vec_x4_len_no_zfill) > > + /* Fall through (expectation) is copy len < buffer len. */ > > + VMOVU %VZERO, ((VEC_SIZE * 3)-(VEC_SIZE - CHAR_SIZE))(%rdi, %rdx) > > + movl %ecx, %edx > > +L(ret_vec_x4_len_no_zfill): > > + VMOVU ((VEC_SIZE * 3)-(VEC_SIZE - CHAR_SIZE))(%rsi, %rdx), %VMM(1) > > + VMOVU %VMM(1), ((VEC_SIZE * 3)-(VEC_SIZE - CHAR_SIZE))(%rdi, %rdx) > > +# ifdef USE_AS_STPCPY > > +# ifdef USE_AS_WCSCPY > > + setc %al > > + addq %rdx, %rdi > > + leaq (VEC_SIZE * 3)(%rdi, %rax, CHAR_SIZE), %rax > > +# else > > + leal (VEC_SIZE * 3 + 0)(%edx), %eax > > + adcq %rdi, %rax > > +# endif > > +# endif > > + VZEROUPPER_RETURN > > + > > + > > +L(ret_vec_x3_len): > > + addl $(VEC_SIZE * 1), %edx > > + tzcntl %ecx, %ecx > > + cmpl %ecx, %edx > > + jbe L(ret_vec_x3_len_no_zfill) > > + /* Fall through (expectation) is copy len < buffer len. */ > > + VMOVU %VZERO, ((VEC_SIZE * 2)-(VEC_SIZE - CHAR_SIZE))(%rdi, %rdx) > > +L(ret_vec_x3_len_no_zfill_mov): > > + movl %ecx, %edx > > +# ifdef USE_AS_STPCPY > > + /* clear flags. */ > > + xorl %ecx, %ecx > > +# endif > > + .p2align 4,, 4 > > +L(ret_vec_x3_len_no_zfill): > > + VMOVU ((VEC_SIZE * 2)-(VEC_SIZE - CHAR_SIZE))(%rsi, %rdx), %VMM(1) > > + VMOVU %VMM(1), ((VEC_SIZE * 2)-(VEC_SIZE - CHAR_SIZE))(%rdi, %rdx) > > +# ifdef USE_AS_STPCPY > > +# ifdef USE_AS_WCSCPY > > + setc %al > > + addq %rdx, %rdi > > + leaq (VEC_SIZE * 2)(%rdi, %rax, CHAR_SIZE), %rax > > +# else > > + leal (VEC_SIZE * 2 + 0)(%rdx), %eax > > + adcq %rdi, %rax > > +# endif > > +# endif > > + VZEROUPPER_RETURN > > + > > + > > + .p2align 4,, 8 > > +L(ret_vec_x3): > > + bsfl %ecx, %ecx > > + VMOVU %VZERO, (VEC_SIZE * 3 +(-(VEC_SIZE - CHAR_SIZE)))(%rdi, %rdx) > > + subl %ecx, %edx > > + jl L(ret_vec_x3_len_no_zfill_mov) > > + VMOVU %VMM(3), (VEC_SIZE * 2)(%rdi) > > + VMOVU %VZERO, (VEC_SIZE * 2)(%rdi, %rcx) > > +# ifdef USE_AS_STPCPY > > + leaq (VEC_SIZE * 2)(%rdi, %rcx), %rax > > +# endif > > + VZEROUPPER_RETURN > > + > > + .p2align 4,, 8 > > +L(more_4x_vec): > > + > > + VMOVU %VMM(3), (VEC_SIZE * 2)(%rdi) > > + testl %ecx, %ecx > > + jnz L(zfill_vec3) > > + > > + VMOVA (VEC_SIZE * 3)(%rsi), %VMM(4) > > + VMOVU %VMM(4), (VEC_SIZE * 3)(%rdi) > > + VPCMPEQ %VMM(4), %VZERO, %VMM(6) > > + vpmovmskb %VMM(6), %ecx > > + testl %ecx, %ecx > > + jnz L(zfill_vec4) > > + > > + movq %rdx, %rcx > > + addq %rsi, %rdx > > + subq %rsi, %rdi > > + subq $-(VEC_SIZE * 4), %rsi > > + /* Recheck length before aligning. */ > > + cmpq $(VEC_SIZE * 8 - CHAR_SIZE), %rcx > > + jbe L(last_4x_vec) > > + > > + andq $(VEC_SIZE * -4), %rsi > > + > > + /* Do first half of loop ahead of time so loop can just start by > > + storing. */ > > + VMOVA (VEC_SIZE * 0 + 0)(%rsi), %VMM(0) > > + VMOVA (VEC_SIZE * 1 + 0)(%rsi), %VMM(1) > > + VMOVA (VEC_SIZE * 2 + 0)(%rsi), %VMM(2) > > + VMOVA (VEC_SIZE * 3 + 0)(%rsi), %VMM(3) > > + > > + VPMIN %VMM(0), %VMM(1), %VMM(4) > > + VPMIN %VMM(2), %VMM(3), %VMM(6) > > + VPMIN %VMM(4), %VMM(6), %VMM(6) > > + VPCMPEQ %VMM(6), %VZERO, %VMM(6) > > + vpmovmskb %VMM(6), %r8d > > + addq %rsi, %rdi > > + testl %r8d, %r8d > > + jnz L(loop_4x_done) > > + > > + /* Use r9 as end register. */ > > + leaq -(VEC_SIZE * 4 - CHAR_SIZE)(%rdx), %r9 > > > > -#define USE_AS_STRNCPY > > -#define STRCPY STRNCPY > > -#include "strcpy-avx2.S" > > + .p2align 4,, 11 > > +L(loop_4x_vec): > > + > > + VMOVU %VMM(0), (VEC_SIZE * 0 + 0)(%rdi) > > + VMOVU %VMM(1), (VEC_SIZE * 1 + 0)(%rdi) > > + subq $(VEC_SIZE * -4), %rsi > > + VMOVU %VMM(2), (VEC_SIZE * 2 + 0)(%rdi) > > + VMOVU %VMM(3), (VEC_SIZE * 3 + 0)(%rdi) > > + > > + subq $(VEC_SIZE * -4), %rdi > > + cmpq %rsi, %r9 > > + jbe L(loop_last_4x_vec) > > + > > + VMOVA (VEC_SIZE * 0 + 0)(%rsi), %VMM(0) > > + VMOVA (VEC_SIZE * 1 + 0)(%rsi), %VMM(1) > > + VMOVA (VEC_SIZE * 2 + 0)(%rsi), %VMM(2) > > + VMOVA (VEC_SIZE * 3 + 0)(%rsi), %VMM(3) > > + > > + VPMIN %VMM(0), %VMM(1), %VMM(4) > > + VPMIN %VMM(2), %VMM(3), %VMM(6) > > + VPMIN %VMM(4), %VMM(6), %VMM(6) > > + VPCMPEQ %VMM(6), %VZERO, %VMM(6) > > + > > + vpmovmskb %VMM(6), %r8d > > + > > + testl %r8d, %r8d > > + jz L(loop_4x_vec) > > + > > +L(loop_4x_done): > > + subq %rsi, %rdx > > + VMOVU %VMM(0), (VEC_SIZE * 0 + 0)(%rdi) > > + VPCMPEQ %VMM(0), %VZERO, %VMM(6) > > + vpmovmskb %VMM(6), %ecx > > + testl %ecx, %ecx > > + jnz L(zfill_vec1) > > + > > + VMOVU %VMM(1), (VEC_SIZE * 1 + 0)(%rdi) > > + VPCMPEQ %VMM(1), %VZERO, %VMM(6) > > + vpmovmskb %VMM(6), %ecx > > + testl %ecx, %ecx > > + jnz L(zfill_vec2) > > + > > + VMOVU %VMM(2), (VEC_SIZE * 2 + 0)(%rdi) > > + VPCMPEQ %VMM(2), %VZERO, %VMM(6) > > + vpmovmskb %VMM(6), %ecx > > + testl %ecx, %ecx > > + jnz L(zfill_vec3) > > + > > + VMOVU %VMM(3), (VEC_SIZE * 3 + 0)(%rdi) > > + movl %r8d, %ecx > > + > > + // Zfill more.... > > + > > + .p2align 4,, 4 > > +L(zfill_vec4): > > + addq $(VEC_SIZE * 2), %rdi > > + subq $(VEC_SIZE * 2), %rdx > > +L(zfill_vec2): > > + shlq $VEC_SIZE, %rcx > > +L(zfill): > > + bsfq %rcx, %rcx > > + subq %rcx, %rdx > > + addq %rcx, %rdi > > +# ifdef USE_AS_STPCPY > > + movq %rdi, %rax > > +# endif > > +L(zfill_from_page_cross): > > + cmpq $VEC_SIZE, %rdx > > + jb L(zfill_less_vec_vzeroupper) > > + > > +L(zfill_more_1x_vec): > > + VMOVU %VZERO, CHAR_SIZE(%rdi) > > + VMOVU %VZERO, (CHAR_SIZE - VEC_SIZE)(%rdi, %rdx) > > + cmpq $(VEC_SIZE * 2), %rdx > > + jae L(zfill_more_2x_vec) > > +L(zfill_done0): > > + VZEROUPPER_RETURN > > + > > + .p2align 4,, 8 > > +L(zfill_vec3): > > + addq $(VEC_SIZE * 2), %rdi > > + subq $(VEC_SIZE * 2), %rdx > > + .p2align 4,, 2 > > +L(zfill_vec1): > > + bsfl %ecx, %ecx > > + addq %rcx, %rdi > > + subq %rcx, %rdx > > +# ifdef USE_AS_STPCPY > > + movq %rdi, %rax > > +# endif > > + /* zfill from vec1/vec3 must have to set at least 2x VECS. */ > > + > > + VMOVU %VZERO, CHAR_SIZE(%rdi) > > + VMOVU %VZERO, (CHAR_SIZE - VEC_SIZE)(%rdi, %rdx) > > + cmpq $(VEC_SIZE * 2), %rdx > > + jb L(zfill_done0) > > +L(zfill_more_2x_vec): > > + VMOVU %VZERO, (CHAR_SIZE - VEC_SIZE * 2)(%rdi, %rdx) > > + VMOVU %VZERO, (VEC_SIZE + CHAR_SIZE)(%rdi) > > + subq $(VEC_SIZE * 4 - CHAR_SIZE), %rdx > > + jbe L(zfill_done) > > + > > + addq %rdi, %rdx > > + VMOVU %VZERO, (VEC_SIZE * 2 + CHAR_SIZE)(%rdi) > > + VMOVU %VZERO, (VEC_SIZE * 3 + CHAR_SIZE)(%rdi) > > + > > + > > + VMOVU %VZERO, (VEC_SIZE * 0 + 0)(%rdx) > > + VMOVU %VZERO, (VEC_SIZE * 1 + 0)(%rdx) > > + > > + subq $-(VEC_SIZE * 4 + CHAR_SIZE), %rdi > > + cmpq %rdi, %rdx > > + jbe L(zfill_done) > > + > > + andq $-(VEC_SIZE), %rdi > > + .p2align 4,, 12 > > +L(zfill_loop_4x_vec): > > + VMOVA %VZERO, (VEC_SIZE * 0)(%rdi) > > + VMOVA %VZERO, (VEC_SIZE * 1)(%rdi) > > + VMOVA %VZERO, (VEC_SIZE * 2)(%rdi) > > + VMOVA %VZERO, (VEC_SIZE * 3)(%rdi) > > + subq $-(VEC_SIZE * 4), %rdi > > + cmpq %rdi, %rdx > > + ja L(zfill_loop_4x_vec) > > +L(zfill_done): > > + VZEROUPPER_RETURN > > + > > + > > + .p2align 4,, 8 > > +L(copy_1x): > > + VMOVU %VMM(0), (%rdi) > > + testl %ecx, %ecx > > + jz L(ret_32_32) > > +L(zfill_less_vec): > > + bsfl %ecx, %ecx > > +L(zfill_less_vec_no_bsf): > > + subq %rcx, %rdx > > + addq %rcx, %rdi > > +# ifdef USE_AS_STPCPY > > + movq %rdi, %rax > > +# endif > > +L(zfill_less_vec_vzeroupper): > > + COND_VZEROUPPER > > + /* We are taking advantage of the fact that to be here we must > > + be writing null-term as (%rdi, %rcx) we have a byte of lee- > > + way for overwriting. */ > > + cmpl $16, %edx > > + jb L(zfill_less_16) > > + VMOVU %VZERO_128, (%rdi) > > + VMOVU %VZERO_128, -(16 - CHAR_SIZE)(%rdi, %rdx) > > + ret > > +# ifdef USE_AS_STPCPY > > +L(ret_32_32): > > + leaq CHAR_SIZE(%rdi, %rdx), %rax > > + VZEROUPPER_RETURN > > +# endif > > + > > + .p2align 4,, 4 > > +L(copy_16_31): > > + /* Overfill to avoid branches. */ > > + vmovdqu -(16 - CHAR_SIZE)(%rsi, %rdx), %xmm1 > > + vmovdqu %xmm0, (%rdi) > > + vmovdqu %xmm1, -(16 - CHAR_SIZE)(%rdi, %rdx) > > + cmpl %ecx, %edx > > + ja L(zfill_less_vec_no_bsf) > > +# ifndef USE_AS_STPCPY > > +L(ret_32_32): > > +# else > > +# ifdef USE_AS_WCSCPY > > + setc %al > > + addq %rdx, %rdi > > + leaq (%rdi, %rax, CHAR_SIZE), %rax > > +# else > > + movl %edx, %eax > > + adcq %rdi, %rax > > +# endif > > +# endif > > + VZEROUPPER_RETURN > > + > > + .p2align 4,, 4 > > +L(copy_8_15): > > + /* Overfill to avoid branches. */ > > + movq -(8 - CHAR_SIZE)(%rsi, %rdx), %rsi > > + vmovq %xmm0, (%rdi) > > + movq %rsi, -(8 - CHAR_SIZE)(%rdi, %rdx) > > + cmpl %ecx, %edx > > + jbe L(ret_8_15) > > + subq %rcx, %rdx > > + addq %rcx, %rdi > > +# ifdef USE_AS_STPCPY > > + movq %rdi, %rax > > +# endif > > + .p2align 4,, 8 > > +L(zfill_less_16): > > + xorl %ecx, %ecx > > + cmpl $8, %edx > > + jb L(zfill_less_8) > > + movq %rcx, (%rdi) > > + movq %rcx, -(8 - CHAR_SIZE)(%rdi, %rdx) > > +# ifndef USE_AS_STPCPY > > +L(ret_8_15): > > +# endif > > + ret > > + > > + > > + .p2align 4,, 8 > > +L(less_1x_vec): > > + /* Reuse flag from `cmp $VEC_SIZE, %rdx`. The idea is many > > + buffer sizes are aligned conventially. */ > > + je L(copy_1x) > > + > > + tzcntl %ecx, %ecx > > + cmpl $16, %edx > > + jae L(copy_16_31) > > + > > + COND_VZEROUPPER > > + cmpl $8, %edx > > + jae L(copy_8_15) > > +# ifdef USE_AS_WCSCPY > > + testl %ecx, %ecx > > + jz L(zfill_less_8_set_ret) > > + > > + movl (%rsi, %rdx), %esi > > + vmovd %xmm0, (%rdi) > > + movl %esi, (%rdi, %rdx) > > + > > +# ifdef USE_AS_STPCPY > > + cmpl %ecx, %edx > > +L(ret_8_15): > > + setc %al > > + addq %rdx, %rdi > > + leaq (%rdi, %rax, CHAR_SIZE), %rax > > +# endif > > + ret > > +L(zfill_less_8_set_ret): > > + xorl %ecx, %ecx > > +# ifdef USE_AS_STPCPY > > + movq %rdi, %rax > > +# endif > > +L(zfill_less_8): > > + movl %ecx, (%rdi) > > + movl %ecx, (%rdi, %rdx) > > + ret > > + > > +# else > > + cmpl $3, %edx > > + jb L(copy_0_3) > > + /* Overfill to avoid branches. */ > > + movl -3(%rsi, %rdx), %esi > > + vmovd %xmm0, (%rdi) > > + movl %esi, -3(%rdi, %rdx) > > + cmpl %ecx, %edx > > + jbe L(ret_4_7) > > + subq %rcx, %rdx > > + addq %rcx, %rdi > > +# ifdef USE_AS_STPCPY > > + movq %rdi, %rax > > +# endif > > + xorl %ecx, %ecx > > + .p2align 4,, 8 > > +L(zfill_less_8): > > + cmpl $3, %edx > > + jb L(zfill_less_3) > > + movl %ecx, (%rdi) > > + movl %ecx, -3(%rdi, %rdx) > > +# ifdef USE_AS_STPCPY > > + ret > > +# endif > > + > > +L(ret_4_7): > > +# ifdef USE_AS_STPCPY > > +L(ret_8_15): > > + movl %edx, %eax > > + adcq %rdi, %rax > > +# endif > > + ret > > + > > + .p2align 4,, 4 > > +L(zfill_less_3): > > + testl %edx, %edx > > + jz L(zfill_1) > > + movw %cx, (%rdi) > > +L(zfill_1): > > + movb %cl, (%rdi, %rdx) > > + ret > > + > > + .p2align 4,, 8 > > +L(copy_0_3): > > + vmovd %xmm0, %r8d > > + testl %edx, %edx > > + jz L(copy_1) > > + movw %r8w, (%rdi) > > + cmpl %ecx, %edx > > + ja L(zfill_from_1) > > + movzbl (%rsi, %rdx), %r8d > > +# ifdef USE_AS_STPCPY > > + movl %edx, %eax > > + adcq %rdi, %rax > > + movb %r8b, (%rdi, %rdx) > > + ret > > +# endif > > + > > +L(copy_1): > > +# ifdef USE_AS_STPCPY > > + movl %edx, %eax > > + cmpl %ecx, %edx > > + adcq %rdi, %rax > > +# endif > > +# ifdef USE_AS_WCSCPY > > + vmovd %xmm0, (%rdi) > > +# else > > + movb %r8b, (%rdi, %rdx) > > +# endif > > + ret > > +# endif > > + > > + .p2align 4,, 2 > > +L(zero_len): > > + movq %rdi, %rax > > + ret > > +# ifndef USE_AS_WCSCPY > > + .p2align 4,, 8 > > +L(zfill_from_1): > > +# ifdef USE_AS_STPCPY > > + leaq (%rdi, %rcx), %rax > > +# endif > > + movw $0, -1(%rdi, %rdx) > > + ret > > +# endif > > + > > + .p2align 4,, 4 > > + .p2align 6,, 8 > > +L(page_cross): > > + movq %rsi, %rax > > + andq $(VEC_SIZE * -1), %rax > > + > > + VPCMPEQ (%rax), %VZERO, %VMM(6) > > + > > + vpmovmskb %VMM(6), %ecx > > + shrxl %esi, %ecx, %ecx > > + > > + subl %esi, %eax > > + andl $(VEC_SIZE - 1), %eax > > + cmpq %rax, %rdx > > + jb L(page_cross_small) > > + /* Optimizing more aggressively for space as this is very cold > > + code. This saves 2x cache lines. */ > > + > > + /* If rcx is non-zero then continue. */ > > + shl $CHAR_SIZE, %ecx > > + jz L(page_cross_continue) > > + bsf %ecx, %ecx > > + > > + subq %rcx, %rdx > > +# ifdef USE_AS_STPCPY > > + leaq -CHAR_SIZE(%rdi, %rcx), %rax > > +# else > > + movq %rdi, %rax > > +# endif > > + > > + rep movsb > > +# ifdef USE_AS_WCSCPY > > + movl $0, (%rdi) > > +# else > > + movb $0, (%rdi) > > +# endif > > + jmp L(zfill_from_page_cross) > > + > > +L(page_cross_small): > > + tzcntl %ecx, %ecx > > + xorl %eax, %eax > > + cmpl %ecx, %edx > > + jbe L(page_cross_copy_only) > > + > > + /* Do a zfill of the tail before copying. */ > > + movq %rdi, %r9 > > + movl %ecx, %r8d > > + > > + subl %ecx, %edx > > + leaq CHAR_SIZE(%rdi, %rcx), %rdi > > + movl %edx, %ecx > > + rep stosb > > + movq %r9, %rdi > > + movl %r8d, %edx > > +L(page_cross_copy_only): > > + leal CHAR_SIZE(%rdx), %ecx > > +# ifdef USE_AS_STPCPY > > +# ifdef USE_AS_WCSCPY > > + setc %al > > + addq %rdi, %rdx > > + leaq (%rdx, %rax, CHAR_SIZE), %rax > > +# else > > + movl %edx, %eax > > + adcq %rdi, %rax > > +# endif > > +# else > > + movq %rdi, %rax > > +# endif > > + rep movsb > > + ret > > + > > + > > +L(best_effort_strncpy): > > + movq %rdx, %rcx > > + xorl %eax, %eax > > + movq %rdi, %r8 > > + /* The length is >= 2^63. We very much so expect to segfault at > > + rep stos. If that doesn't happen then just strcpy to finish. > > + */ > > +# ifdef USE_AS_WCSCPY > > + rep stosl > > +# else > > + rep stosb > > +# endif > > + movq %r8, %rdi > > + jmp OVERFLOW_STRCPY > > +END(STRNCPY) > > +#endif > > diff --git a/sysdeps/x86_64/multiarch/x86-avx-vecs.h b/sysdeps/x86_64/multiarch/x86-avx-vecs.h > > index dca1089060..01bead1435 100644 > > --- a/sysdeps/x86_64/multiarch/x86-avx-vecs.h > > +++ b/sysdeps/x86_64/multiarch/x86-avx-vecs.h > > @@ -27,7 +27,10 @@ > > #define VEC_SIZE 32 > > #include "x86-vec-macros.h" > > > > -#define USE_WITH_AVX 1 > > +#ifndef USE_WITH_AVX2 > > +# define USE_WITH_AVX 1 > > Is this checked somewhere? > > > +#endif > > + > > #define SECTION(p) p##.avx > > > > /* 4-byte mov instructions with AVX2. */ > > diff --git a/sysdeps/x86_64/multiarch/x86-avx2-rtm-vecs.h b/sysdeps/x86_64/multiarch/x86-avx2-rtm-vecs.h > > new file mode 100644 > > index 0000000000..a5966701ec > > --- /dev/null > > +++ b/sysdeps/x86_64/multiarch/x86-avx2-rtm-vecs.h > > @@ -0,0 +1,26 @@ > > +/* Common config for AVX2-RTM VECs > > + All versions must be listed in ifunc-impl-list.c. > > + Copyright (C) 2022 Free Software Foundation, Inc. > > + This file is part of the GNU C Library. > > + > > + The GNU C Library is free software; you can redistribute it and/or > > + modify it under the terms of the GNU Lesser General Public > > + License as published by the Free Software Foundation; either > > + version 2.1 of the License, or (at your option) any later version. > > + > > + The GNU C Library is distributed in the hope that it will be useful, > > + but WITHOUT ANY WARRANTY; without even the implied warranty of > > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > > + Lesser General Public License for more details. > > + > > + You should have received a copy of the GNU Lesser General Public > > + License along with the GNU C Library; if not, see > > + . */ > > + > > +#ifndef _X86_AVX2_RTM_VECS_H > > +#define _X86_AVX2_RTM_VECS_H 1 > > + > > +#define USE_WITH_AVX2 1 > > +#include "x86-avx-rtm-vecs.h" > > + > > +#endif > > diff --git a/sysdeps/x86_64/multiarch/x86-avx2-vecs.h b/sysdeps/x86_64/multiarch/x86-avx2-vecs.h > > new file mode 100644 > > index 0000000000..16d7ae5147 > > --- /dev/null > > +++ b/sysdeps/x86_64/multiarch/x86-avx2-vecs.h > > @@ -0,0 +1,27 @@ > > +/* Common config for AVX2 VECs > > + All versions must be listed in ifunc-impl-list.c. > > + Copyright (C) 2022 Free Software Foundation, Inc. > > + This file is part of the GNU C Library. > > + > > + The GNU C Library is free software; you can redistribute it and/or > > + modify it under the terms of the GNU Lesser General Public > > + License as published by the Free Software Foundation; either > > + version 2.1 of the License, or (at your option) any later version. > > + > > + The GNU C Library is distributed in the hope that it will be useful, > > + but WITHOUT ANY WARRANTY; without even the implied warranty of > > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > > + Lesser General Public License for more details. > > + > > + You should have received a copy of the GNU Lesser General Public > > + License along with the GNU C Library; if not, see > > + . */ > > + > > +#ifndef _X86_AVX2_VECS_H > > +#define _X86_AVX2_VECS_H 1 > > + > > +#define USE_WITH_AVX2 1 > > Is this checked somewhere? > > > + > > +#include "x86-avx-vecs.h" > > + > > +#endif > > -- > > 2.34.1 > > > > > -- > H.J.