From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-oi1-x22a.google.com (mail-oi1-x22a.google.com [IPv6:2607:f8b0:4864:20::22a]) by sourceware.org (Postfix) with ESMTPS id 057C93858298 for ; Sun, 2 Jul 2023 17:03:21 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 057C93858298 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com Received: by mail-oi1-x22a.google.com with SMTP id 5614622812f47-3a3a8d21208so194862b6e.0 for ; Sun, 02 Jul 2023 10:03:20 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1688317400; x=1690909400; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=NIka3MpjHlhLeC5G+gOojMHzxLv/KY9GdpCXEC9ymvI=; b=VI/DGQSepKHOuVWsObmUEt/GimeOO2yIIvnXz+xQyFws8r9F+R+t+CFeF6BALNauO+ lq2/kCZtTw4cmSgv/XNkX1BEtACuUljjell6MMURqseQTsR5J5fDYvo+9zradWsu+4lN nSV9MheHt8ohhriDt0EGgms3ojTpHPFO8FYTHfIDTxTWAhDOB3otxc2Gn+/guH3aQ4q4 RXdj/nwdb1xMIVpZLkI4qbwblt3Z+0cNyhY575GLjw6kODIQzR69pAN92LFZ+nY/nSl/ TrmKMRYtMJ85LeQJqtdBHUyJhiiE2OfTv3jDgrxfJTsRfY1WMZjrNpPPWCHgxhAvHPup c7Cw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1688317400; x=1690909400; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=NIka3MpjHlhLeC5G+gOojMHzxLv/KY9GdpCXEC9ymvI=; b=IhV+usOvDnPX4VBX7X5uWLQhXehnAo9vMr9N0EEGmcUMyfaj6sJXpIuhSzJM8CTs0q mrTVQMKUtyFqUBcUOC53Q19kR1qp2PSVuro4bmq5OAbIuoHSyxpLs6IG76W2TLICPOQU +qkqXuf/p1ZIi8FAtuaYdtS6i3JGtViPMRSLSRpwr6lNEhPKc734YPJzDuGMjClVNIQs nyFi50JsHZaRlVX27qLn/jCR9sZ0RBCfNJFtH6gulig44uKUCCYGdLliTfM0Vk+Q0FT9 FhyEOLDz+ojJcxPMcySYE3tZ8n1gC+riNLqkDTQuKz0dqMFowbvu5+BwoDOkfYN4/Byg jbiQ== X-Gm-Message-State: AC+VfDxcU5mSaHQ8zb1EopM8FPzmowPsWu7q70geEDv9LynZSS2pIpDH /m0B4CVB6rHUWZBxIrgYF9f9CUF8E2hFhRMOfUZ26WVZ X-Google-Smtp-Source: ACHHUZ540oREUoW1JCdQIWVgaUuqIqxmdnUZTsCfgcbgPoFNqpz3jZtTVedbwiuxIpOcRPmOO8lV2e3tnMrecZPISCw= X-Received: by 2002:a05:6808:18c:b0:3a1:e85f:33c3 with SMTP id w12-20020a056808018c00b003a1e85f33c3mr8028362oic.50.1688317399918; Sun, 02 Jul 2023 10:03:19 -0700 (PDT) MIME-Version: 1.0 References: <20230630204812.2059831-1-skpgkp2@gmail.com> In-Reply-To: <20230630204812.2059831-1-skpgkp2@gmail.com> From: Noah Goldstein Date: Sun, 2 Jul 2023 12:03:07 -0500 Message-ID: Subject: Re: [PATCH] x86_64: Implement AVX2 version of strlcpy/wcslcpy function To: Sunil K Pandey Cc: libc-alpha@sourceware.org, hjl.tools@gmail.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-8.3 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,GIT_PATCH_0,KAM_SHORT,KAM_STOCKGEN,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: On Fri, Jun 30, 2023 at 3:48=E2=80=AFPM Sunil K Pandey via Libc-alpha wrote: > > This patch optimizes strlcpy/wsclcpy string functions for AVX2. > --- > sysdeps/x86_64/multiarch/Makefile | 4 + > sysdeps/x86_64/multiarch/ifunc-impl-list.c | 18 + > sysdeps/x86_64/multiarch/ifunc-strlcpy.h | 34 ++ > sysdeps/x86_64/multiarch/strlcpy-avx2.S | 446 +++++++++++++++++++++ > sysdeps/x86_64/multiarch/strlcpy-generic.c | 25 ++ > sysdeps/x86_64/multiarch/strlcpy.c | 36 ++ > sysdeps/x86_64/multiarch/wcslcpy-avx2.S | 4 + > sysdeps/x86_64/multiarch/wcslcpy-generic.c | 25 ++ > sysdeps/x86_64/multiarch/wcslcpy.c | 35 ++ > 9 files changed, 627 insertions(+) > create mode 100644 sysdeps/x86_64/multiarch/ifunc-strlcpy.h > create mode 100644 sysdeps/x86_64/multiarch/strlcpy-avx2.S > create mode 100644 sysdeps/x86_64/multiarch/strlcpy-generic.c > create mode 100644 sysdeps/x86_64/multiarch/strlcpy.c > create mode 100644 sysdeps/x86_64/multiarch/wcslcpy-avx2.S > create mode 100644 sysdeps/x86_64/multiarch/wcslcpy-generic.c > create mode 100644 sysdeps/x86_64/multiarch/wcslcpy.c > > diff --git a/sysdeps/x86_64/multiarch/Makefile b/sysdeps/x86_64/multiarch= /Makefile > index e1e894c963..7e3fc081df 100644 > --- a/sysdeps/x86_64/multiarch/Makefile > +++ b/sysdeps/x86_64/multiarch/Makefile > @@ -82,6 +82,8 @@ sysdep_routines +=3D \ > strcpy-sse2 \ > strcpy-sse2-unaligned \ > strcspn-sse4 \ > + strlcpy-avx2 \ > + strlcpy-generic \ > strlen-avx2 \ > strlen-avx2-rtm \ > strlen-evex \ > @@ -153,6 +155,8 @@ sysdep_routines +=3D \ > wcscpy-evex \ > wcscpy-generic \ > wcscpy-ssse3 \ > + wcslcpy-avx2 \ > + wcslcpy-generic \ > wcslen-avx2 \ > wcslen-avx2-rtm \ > wcslen-evex \ > diff --git a/sysdeps/x86_64/multiarch/ifunc-impl-list.c b/sysdeps/x86_64/= multiarch/ifunc-impl-list.c > index 5427ff1907..9928dee187 100644 > --- a/sysdeps/x86_64/multiarch/ifunc-impl-list.c > +++ b/sysdeps/x86_64/multiarch/ifunc-impl-list.c > @@ -751,6 +751,15 @@ __libc_ifunc_impl_list (const char *name, struct lib= c_ifunc_impl *array, > 1, > __strncat_sse2_unaligned)) > > + /* Support sysdeps/x86_64/multiarch/strlcpy.c. */ > + IFUNC_IMPL (i, name, strlcpy, > + X86_IFUNC_IMPL_ADD_V3 (array, i, strlcpy, > + CPU_FEATURE_USABLE (AVX2), > + __strlcpy_avx2) > + X86_IFUNC_IMPL_ADD_V1 (array, i, strlcpy, > + 1, > + __strlcpy_generic)) > + > /* Support sysdeps/x86_64/multiarch/strncpy.c. */ > IFUNC_IMPL (i, name, strncpy, > X86_IFUNC_IMPL_ADD_V4 (array, i, strncpy, > @@ -917,6 +926,15 @@ __libc_ifunc_impl_list (const char *name, struct lib= c_ifunc_impl *array, > 1, > __wcscpy_generic)) > > + /* Support sysdeps/x86_64/multiarch/wcslcpy.c. */ > + IFUNC_IMPL (i, name, wcslcpy, > + X86_IFUNC_IMPL_ADD_V3 (array, i, wcslcpy, > + CPU_FEATURE_USABLE (AVX2), > + __wcslcpy_avx2) > + X86_IFUNC_IMPL_ADD_V1 (array, i, wcslcpy, > + 1, > + __wcslcpy_generic)) > + > /* Support sysdeps/x86_64/multiarch/wcsncpy.c. */ > IFUNC_IMPL (i, name, wcsncpy, > X86_IFUNC_IMPL_ADD_V4 (array, i, wcsncpy, > diff --git a/sysdeps/x86_64/multiarch/ifunc-strlcpy.h b/sysdeps/x86_64/mu= ltiarch/ifunc-strlcpy.h > new file mode 100644 > index 0000000000..982a30d15b > --- /dev/null > +++ b/sysdeps/x86_64/multiarch/ifunc-strlcpy.h > @@ -0,0 +1,34 @@ > +/* Common definition for ifunc selections. > + All versions must be listed in ifunc-impl-list.c. > + Copyright (C) 2023 Free Software Foundation, Inc. > + This file is part of the GNU C Library. > + > + The GNU C Library is free software; you can redistribute it and/or > + modify it under the terms of the GNU Lesser General Public > + License as published by the Free Software Foundation; either > + version 2.1 of the License, or (at your option) any later version. > + > + The GNU C Library is distributed in the hope that it will be useful, > + but WITHOUT ANY WARRANTY; without even the implied warranty of > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > + Lesser General Public License for more details. > + > + You should have received a copy of the GNU Lesser General Public > + License along with the GNU C Library; if not, see > + . */ > + > +#include > + > +extern __typeof (REDIRECT_NAME) OPTIMIZE (avx2) attribute_hidden; > +extern __typeof (REDIRECT_NAME) OPTIMIZE (generic) attribute_hidden; > + > +static inline void * > +IFUNC_SELECTOR (void) > +{ > + const struct cpu_features *cpu_features =3D __get_cpu_features (); > + > + if (X86_ISA_CPU_FEATURE_USABLE_P (cpu_features, AVX2)) > + return OPTIMIZE (avx2); > + > + return OPTIMIZE (generic); > +} > diff --git a/sysdeps/x86_64/multiarch/strlcpy-avx2.S b/sysdeps/x86_64/mul= tiarch/strlcpy-avx2.S > new file mode 100644 > index 0000000000..cf54b1e990 > --- /dev/null > +++ b/sysdeps/x86_64/multiarch/strlcpy-avx2.S > @@ -0,0 +1,446 @@ > +/* Strlcpy/wcslcpy optimized with AVX2. > + Copyright (C) 2023 Free Software Foundation, Inc. > + This file is part of the GNU C Library. > + > + The GNU C Library is free software; you can redistribute it and/or > + modify it under the terms of the GNU Lesser General Public > + License as published by the Free Software Foundation; either > + version 2.1 of the License, or (at your option) any later version. > + > + The GNU C Library is distributed in the hope that it will be useful, > + but WITHOUT ANY WARRANTY; without even the implied warranty of > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > + Lesser General Public License for more details. > + > + You should have received a copy of the GNU Lesser General Public > + License along with the GNU C Library; if not, see > + . */ > + > +#include > + > +#if ISA_SHOULD_BUILD (3) > + > +# include > + > +# ifndef VEC_SIZE > +# include "x86-avx-vecs.h" > +# endif > + > +# ifndef STRLCPY > +# define STRLCPY __strlcpy_avx2 > +# endif > + > + > +# ifdef USE_AS_WCSLCPY > +# define CHAR_SIZE 4 > +# define MOVU movl > +# define VPCMPEQ vpcmpeqd > +# define VPMINU vpminud > +# else > +# define CHAR_SIZE 1 > +# define MOVU movb > +# define VPCMPEQ vpcmpeqb > +# define VPMINU vpminub > +# endif > + > +# define PMOVMSK vpmovmskb > +# define PAGE_SIZE 4096 > +# define VEC_SIZE 32 > +# define CHAR_PER_VEC (VEC_SIZE / CHAR_SIZE) > + > + .section SECTION(.text),"ax",@progbits > +/* Aligning entry point to 64 byte, provides better performance for > + one vector length string. */ > + > +ENTRY_P2ALIGN (STRLCPY, 6) > +# ifdef __ILP32__ > + /* Clear the upper 32 bits. */ > + movl %edx, %edx > +# endif > + > + /* Zero out vector register for end of string comparison. */ > + vpxor %VMM(0), %VMM(0), %VMM(0) > + /* Save source pointer for return calculation. */ > + mov %rsi, %r8 > + mov %esi, %eax > + sall $20, %eax > + cmpl $((PAGE_SIZE - (VEC_SIZE)) << 20), %eax > + ja L(page_cross) > + > +L(page_cross_continue): > + /* Load first vector. */ > + VMOVU (%rsi), %VMM(1) > + VPCMPEQ %VMM(0), %VMM(1), %VMM(2) > + PMOVMSK %VMM(2), %eax > + test %eax, %eax > + jnz L(ret_vec_x1) > + > + test %rdx, %rdx > + jz L(continue_second_vector) > + > + /* Check whether we can copy full vector. */ > + cmp $CHAR_PER_VEC, %rdx > + jbe L(page_cross_small_vec_copy) > + /* Copy first vector. */ > + VMOVU %VMM(1), (%rdi) > + sub $CHAR_PER_VEC, %rdx > + > +L(continue_second_vector): > + /* Align RSI pointer and adjust RDI based on offset. */ > + mov %rsi, %rax > + and $-VEC_SIZE, %rsi > + sub %rsi, %rax > + sub %rax, %rdi > + > + /* Check if string already copied N char, and RDX is 0. */ > + test %rdx, %rdx > + jz L(skip_copy_alignment_fix) > + > + /* Adjust RDX for copy alignment fix. */ > +# ifdef USE_AS_WCSLCPY > + shr $2, %rax > +# endif > + add %rax, %rdx > + > +L(skip_copy_alignment_fix): > + /* Load second vector. */ > + VMOVA (VEC_SIZE * 1)(%rsi), %VMM(1) > + VPCMPEQ %VMM(0), %VMM(1), %VMM(2) > + vptest %VMM(2), %VMM(2) > + jnz L(ret_vec_x2) > + > + /* Skip copy if RDX is 0. */ > + test %rdx, %rdx > + jz L(continue_third_vector) > + > + /* Jump below/equal(instead of below) used here, because last > + copy chracter must be NULL. */ > + cmp $CHAR_PER_VEC, %rdx > + jbe L(partial_copy_second_vector) > + > + sub $CHAR_PER_VEC, %rdx > + /* Copy second vector. */ > + VMOVU %VMM(1), (VEC_SIZE * 1)(%rdi) > + > +L(continue_third_vector): > + /* Load third vector. */ > + VMOVA (VEC_SIZE * 2)(%rsi), %VMM(1) > + VPCMPEQ %VMM(0), %VMM(1), %VMM(2) > + vptest %VMM(2), %VMM(2) > + jnz L(ret_vec_x3) > + > + /* Skip copy if RDX is 0. */ > + test %rdx, %rdx > + jz L(continue_fourth_vector) > + > + cmp $CHAR_PER_VEC, %rdx > + jbe L(partial_copy_third_vector) > + > + sub $CHAR_PER_VEC, %rdx > + /* Copy third vector. */ > + VMOVU %VMM(1), (VEC_SIZE * 2)(%rdi) > + > +L(continue_fourth_vector): > + /* Load fourth vector. */ > + VMOVA (VEC_SIZE * 3)(%rsi), %VMM(1) > + VPCMPEQ %VMM(0), %VMM(1), %VMM(2) > + vptest %VMM(2), %VMM(2) > + jnz L(ret_vec_x4) > + > + /* Skip copy if RDX is 0. */ > + test %rdx, %rdx > + jz L(loop_4x_align) > + > + cmp $CHAR_PER_VEC, %rdx > + jbe L(partial_copy_fourth_vector) > + > + sub $CHAR_PER_VEC, %rdx > + /* Copy fourth vector. */ > + VMOVU %VMM(1), (VEC_SIZE * 3)(%rdi) > + > + > +L(loop_4x_align): > + /* Jump to loop if RSI is already 4 vector align. */ > + test $(VEC_SIZE * 4 - 1), %esi > + jz L(loop_4x_read) > + > + mov %rsi, %rcx > + > + /* Align RSI to 4x vector. */ > + and $(VEC_SIZE * -4), %rsi > + sub %rsi, %rcx > + > + /* Adjust RDI for RSI alignment fix. */ > + sub %rcx, %rdi > + > + /* Jump to loop if RDX is 0. */ > + test %rdx, %rdx > + jz L(loop_4x_read) > + > +# ifdef USE_AS_WCSLCPY > + shr $2, %rcx > +# endif > + > + /* Adjust RDX for RSI alignment fix. */ > + add %rcx, %rdx > + jmp L(loop_4x_read) > + > + .p2align 4,,6 > +L(loop_4x_vec): > + /* Skip copy if RDX is 0. */ > + test %rdx, %rdx > + jz L(loop_partial_copy_return) > + cmp $(CHAR_PER_VEC * 4), %rdx > + jbe L(loop_partial_copy) > + VMOVU %VMM(1), (VEC_SIZE * 4)(%rdi) > + VMOVU %VMM(2), (VEC_SIZE * 5)(%rdi) > + VMOVU %VMM(3), (VEC_SIZE * 6)(%rdi) > + VMOVU %VMM(4), (VEC_SIZE * 7)(%rdi) > + sub $(CHAR_PER_VEC * 4), %rdx > + > +L(loop_partial_copy_return): > + sub $(VEC_SIZE * -4), %rsi > + sub $(VEC_SIZE * -4), %rdi > + > +L(loop_4x_read): > + VMOVA (VEC_SIZE * 4)(%rsi), %VMM(1) > + VMOVA (VEC_SIZE * 5)(%rsi), %VMM(2) > + VMOVA (VEC_SIZE * 6)(%rsi), %VMM(3) > + VMOVA (VEC_SIZE * 7)(%rsi), %VMM(4) > + VPMINU %VMM(1), %VMM(2), %VMM(5) > + VPMINU %VMM(3), %VMM(4), %VMM(6) > + VPMINU %VMM(5), %VMM(6), %VMM(7) > + VPCMPEQ %VMM(0), %VMM(7), %VMM(7) > + vptest %VMM(7), %VMM(7) > + > + jz L(loop_4x_vec) > + > + /* Check if string ends in first vector or second vector. */ > + lea (VEC_SIZE * 4)(%rsi), %rax > + sub %r8, %rax > +# ifdef USE_AS_WCSLCPY > + shr $2, %rax > +# endif > + xor %r10, %r10 > + VPCMPEQ %VMM(0), %VMM(5), %VMM(6) > + vptest %VMM(6), %VMM(6) > + jnz L(endloop) > + sub $(CHAR_PER_VEC * -2), %rax > + mov $(CHAR_PER_VEC * 2), %r10 > + VMOVA %VMM(3), %VMM(1) > + VMOVA %VMM(4), %VMM(2) > + > +L(endloop): > + VPCMPEQ %VMM(0), %VMM(1), %VMM(1) > + VPCMPEQ %VMM(0), %VMM(2), %VMM(2) > + PMOVMSK %VMM(1), %rcx > + PMOVMSK %VMM(2), %r9 > + shlq $32, %r9 > + orq %r9, %rcx > + bsf %rcx, %rcx > + /* Shift RCX by 2, VPMOVMSK has only byte version. */ > +# ifdef USE_AS_WCSLCPY > + shr $2, %rcx > +# endif > + /* At this point RAX has length to return. */ > + add %rcx, %rax > + test %rdx, %rdx > + jz L(ret) > + > + /* Add 1 to account for NULL character in RDX comparison. */ > + lea 1(%r10, %rcx), %rcx > + cmp %rdx, %rcx > + cmovb %rcx, %rdx > + > +L(loop_partial_copy): > + cmp $(CHAR_PER_VEC * 2), %rdx > + jbe L(loop_partial_first_half) > + /* Reload first 2 vector. */ > + VMOVA (VEC_SIZE * 4)(%rsi), %VMM(1) > + VMOVA (VEC_SIZE * 5)(%rsi), %VMM(2) > + VMOVU %VMM(1), (VEC_SIZE * 4)(%rdi) > + VMOVU %VMM(2), (VEC_SIZE * 5)(%rdi) > + > +L(loop_partial_first_half): > + /* Go back 2 vector from last and use overlapping copy. > + (VEC_SIZE * 4 - VEC_SIZE * 2)(%rsi, %rdx, CHAR_SIZE) > + (VEC_SIZE * 4 - VEC_SIZE * 1)(%rsi, %rdx, CHAR_SIZE) > + */ > + VMOVU (VEC_SIZE * 2)(%rsi, %rdx, CHAR_SIZE), %VMM(3) > + VMOVU (VEC_SIZE * 3)(%rsi, %rdx, CHAR_SIZE), %VMM(4) > + VMOVU %VMM(3), (VEC_SIZE * 2)(%rdi, %rdx, CHAR_SIZE) > + VMOVU %VMM(4), (VEC_SIZE * 3)(%rdi, %rdx, CHAR_SIZE) > + MOVU $0, (VEC_SIZE * 4 - CHAR_SIZE)(%rdi, %rdx, CHAR_SIZE) > + xor %rdx, %rdx > + vptest %VMM(7), %VMM(7) > + jz L(loop_partial_copy_return) > + ret > + > + .p2align 4 > +L(page_cross): > + mov %rsi, %rcx > + mov %rsi, %r11 > + and $-VEC_SIZE, %r11 > + and $(VEC_SIZE - 1), %rcx > + VMOVA (%r11), %VMM(1) > + VPCMPEQ %VMM(0), %VMM(1), %VMM(2) > + PMOVMSK %VMM(2), %eax > + shr %cl, %eax > + jz L(page_cross_continue) > + > +L(ret_vec_x1): > + bsf %eax, %eax > +# ifdef USE_AS_WCSLCPY > + shr $2, %eax > +# endif > + /* Increment by 1 to account for NULL char. */ > + lea 1(%eax), %ecx > + cmp %rdx, %rcx > + cmovb %rcx, %rdx > + test %rdx, %rdx > + jz L(ret) > + > +L(page_cross_small_vec_copy): > + cmp $(16 / CHAR_SIZE), %rdx > + jbe L(copy_8_byte_scalar) > + VMOVU (%rsi), %VMM_128(1) > + VMOVU -16(%rsi, %rdx, CHAR_SIZE), %VMM_128(3) > + VMOVU %VMM_128(1), (%rdi) > + VMOVU %VMM_128(3), -16(%rdi, %rdx, CHAR_SIZE) > + MOVU $0, -(CHAR_SIZE * 1)(%rdi, %rdx, CHAR_SIZE) > + xor %rdx, %rdx > + vptest %VMM(2), %VMM(2) > + jz L(continue_second_vector) > + ret > + > +L(copy_8_byte_scalar): > + cmp $(8 / CHAR_SIZE), %rdx > + jbe L(copy_4_byte_scalar) > + movq (%rsi), %r10 > + movq -8(%rsi, %rdx, CHAR_SIZE), %r11 > + movq %r10, (%rdi) > + movq %r11, -8(%rdi, %rdx, CHAR_SIZE) > + MOVU $0, -(CHAR_SIZE * 1)(%rdi, %rdx, CHAR_SIZE) > + xor %edx, %edx > + vptest %VMM(2), %VMM(2) > + jz L(continue_second_vector) > + ret > + > +L(copy_4_byte_scalar): > +# ifndef USE_AS_WCSLCPY > + cmp $4, %rdx > + jbe L(copy_2_byte_scalar) > +# endif > + movl (%rsi), %r10d > + movl -4(%rsi, %rdx, CHAR_SIZE), %r11d > + movl %r10d, (%rdi) > + movl %r11d, -4(%rdi, %rdx, CHAR_SIZE) > + MOVU $0, -(CHAR_SIZE * 1)(%rdi, %rdx, CHAR_SIZE) > + xor %edx, %edx > + vptest %VMM(2), %VMM(2) > + jz L(continue_second_vector) > + ret > + > +# ifndef USE_AS_WCSLCPY > +L(copy_2_byte_scalar): > + cmp $2, %rdx > + jbe L(copy_1_byte_scalar) > + movw (%rsi), %r10w > + movw -(CHAR_SIZE * 3)(%rsi, %rdx, CHAR_SIZE), %r11w > + movw %r10w, (%rdi) > + movw %r11w, -(CHAR_SIZE * 3)(%rdi, %rdx, CHAR_SIZE) > + MOVU $0, -(CHAR_SIZE * 1)(%rdi, %rdx, CHAR_SIZE) > + xor %edx, %edx > + vptest %VMM(2), %VMM(2) > + jz L(continue_second_vector) > + ret > + > +L(copy_1_byte_scalar): > + MOVU (%rsi), %r10b > + MOVU %r10b, (%rdi) > + MOVU $0, -(CHAR_SIZE * 1)(%rdi, %rdx, CHAR_SIZE) > + xor %edx, %edx > + vptest %VMM(2), %VMM(2) > + jz L(continue_second_vector) > + ret > +# endif > + > +L(ret_vec_x2): > + PMOVMSK %VMM(2), %rax > + bsf %rax, %rcx > + /* Calculate return value. */ > + lea VEC_SIZE(%rsi, %rcx), %rax > + sub %r8, %rax > +# ifdef USE_AS_WCSLCPY > + shr $2, %rax > + shr $2, %rcx > +# endif > + inc %rcx > + test %rdx, %rdx > + jz L(ret) > + cmp %rdx, %rcx > + cmovb %rcx, %rdx > + > +L(partial_copy_second_vector): > + VMOVU (%rsi, %rdx, CHAR_SIZE), %VMM(1) > + VMOVU %VMM(1), (%rdi, %rdx, CHAR_SIZE) > + MOVU $0, (VEC_SIZE - CHAR_SIZE * 1)(%rdi, %rdx, CHAR_SIZE) > + xor %edx, %edx > + vptest %VMM(2), %VMM(2) > + jz L(continue_third_vector) > + > +L(ret): > + ret > + > +L(ret_vec_x3): > + PMOVMSK %VMM(2), %rax > + bsf %rax, %rcx > + /* Calculate return value. */ > + lea (VEC_SIZE * 2)(%rsi, %rcx), %rax > + sub %r8, %rax > +# ifdef USE_AS_WCSLCPY > + shr $2, %rax > + shr $2, %rcx > +# endif > + inc %rcx > + test %rdx, %rdx > + jz L(ret) > + cmp %rdx, %rcx > + cmovb %rcx, %rdx > + > +L(partial_copy_third_vector): > + VMOVU (VEC_SIZE)(%rsi, %rdx, CHAR_SIZE), %VMM(1) > + VMOVU %VMM(1), (VEC_SIZE)(%rdi, %rdx, CHAR_SIZE) > + MOVU $0, ((VEC_SIZE * 2) - CHAR_SIZE * 1)(%rdi, %rdx, CHAR_SIZ= E) > + xor %edx, %edx > + vptest %VMM(2), %VMM(2) > + jz L(continue_fourth_vector) > + ret > + > +L(ret_vec_x4): > + PMOVMSK %VMM(2), %rax > + bsf %rax, %rcx > + /* Calculate return value. */ > + lea (VEC_SIZE * 3)(%rsi, %rcx), %rax > + sub %r8, %rax > +# ifdef USE_AS_WCSLCPY > + shr $2, %rax > + shr $2, %rcx > +# endif > + inc %rcx > + test %rdx, %rdx > + jz L(ret) > + cmp %rdx, %rcx > + cmovb %rcx, %rdx > + > +L(partial_copy_fourth_vector): > + VMOVU (VEC_SIZE * 2)(%rsi, %rdx, CHAR_SIZE), %VMM(1) > + VMOVU %VMM(1), (VEC_SIZE * 2)(%rdi, %rdx, CHAR_SIZE) > + MOVU $0, ((VEC_SIZE * 3) - CHAR_SIZE * 1)(%rdi, %rdx, CHAR_SIZ= E) > + xor %edx, %edx > + vptest %VMM(2), %VMM(2) > + jz L(continue_fourth_vector) > + ret > + > +END (STRLCPY) Is strlcpy/strlcat integratable with existing strncat impl? Had figured they would fit in the same file. > +#endif > diff --git a/sysdeps/x86_64/multiarch/strlcpy-generic.c b/sysdeps/x86_64/= multiarch/strlcpy-generic.c > new file mode 100644 > index 0000000000..eee3b7b086 > --- /dev/null > +++ b/sysdeps/x86_64/multiarch/strlcpy-generic.c > @@ -0,0 +1,25 @@ > +/* strlcpy generic. > + Copyright (C) 2023 Free Software Foundation, Inc. > + This file is part of the GNU C Library. > + > + The GNU C Library is free software; you can redistribute it and/or > + modify it under the terms of the GNU Lesser General Public > + License as published by the Free Software Foundation; either > + version 2.1 of the License, or (at your option) any later version. > + > + The GNU C Library is distributed in the hope that it will be useful, > + but WITHOUT ANY WARRANTY; without even the implied warranty of > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > + Lesser General Public License for more details. > + > + You should have received a copy of the GNU Lesser General Public > + License along with the GNU C Library; if not, see > + . */ > + > + > +#include > +#if ISA_SHOULD_BUILD (1) > +# define __strlcpy __strlcpy_generic > +# include > + > +#endif > diff --git a/sysdeps/x86_64/multiarch/strlcpy.c b/sysdeps/x86_64/multiarc= h/strlcpy.c > new file mode 100644 > index 0000000000..ded41fbcfb > --- /dev/null > +++ b/sysdeps/x86_64/multiarch/strlcpy.c > @@ -0,0 +1,36 @@ > +/* Multiple versions of strlcpy. > + All versions must be listed in ifunc-impl-list.c. > + Copyright (C) 2023 Free Software Foundation, Inc. > + This file is part of the GNU C Library. > + > + The GNU C Library is free software; you can redistribute it and/or > + modify it under the terms of the GNU Lesser General Public > + License as published by the Free Software Foundation; either > + version 2.1 of the License, or (at your option) any later version. > + > + The GNU C Library is distributed in the hope that it will be useful, > + but WITHOUT ANY WARRANTY; without even the implied warranty of > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > + Lesser General Public License for more details. > + > + You should have received a copy of the GNU Lesser General Public > + License along with the GNU C Library; if not, see > + . */ > + > +/* Define multiple versions only for the definition in libc. */ > +#if IS_IN (libc) > +# define __strlcpy __redirect_strlcpy > +# include > +# undef __strlcpy > + > +# define SYMBOL_NAME strlcpy > +# include "ifunc-strlcpy.h" > + > +libc_ifunc_redirected (__redirect_strlcpy, __strlcpy, IFUNC_SELECTOR ())= ; > +weak_alias (__strlcpy, strlcpy) > + > +# ifdef SHARED > +__hidden_ver1 (__strlcpy, __GI___strlcpy, __redirect_strlcpy) > + __attribute__ ((visibility ("hidden"))) __attribute_copy__ (strlcpy); > +# endif > +#endif > diff --git a/sysdeps/x86_64/multiarch/wcslcpy-avx2.S b/sysdeps/x86_64/mul= tiarch/wcslcpy-avx2.S > new file mode 100644 > index 0000000000..dafc20ded0 > --- /dev/null > +++ b/sysdeps/x86_64/multiarch/wcslcpy-avx2.S > @@ -0,0 +1,4 @@ > +#define STRLCPY __wcslcpy_avx2 > +#define USE_AS_WCSLCPY 1 > + > +#include "strlcpy-avx2.S" > diff --git a/sysdeps/x86_64/multiarch/wcslcpy-generic.c b/sysdeps/x86_64/= multiarch/wcslcpy-generic.c > new file mode 100644 > index 0000000000..ffd3c0e846 > --- /dev/null > +++ b/sysdeps/x86_64/multiarch/wcslcpy-generic.c > @@ -0,0 +1,25 @@ > +/* wcslcpy generic. > + Copyright (C) 2023 Free Software Foundation, Inc. > + This file is part of the GNU C Library. > + > + The GNU C Library is free software; you can redistribute it and/or > + modify it under the terms of the GNU Lesser General Public > + License as published by the Free Software Foundation; either > + version 2.1 of the License, or (at your option) any later version. > + > + The GNU C Library is distributed in the hope that it will be useful, > + but WITHOUT ANY WARRANTY; without even the implied warranty of > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > + Lesser General Public License for more details. > + > + You should have received a copy of the GNU Lesser General Public > + License along with the GNU C Library; if not, see > + . */ > + > + > +#include > +#if ISA_SHOULD_BUILD (1) > +# define __wcslcpy __wcslcpy_generic > +# include > + > +#endif > diff --git a/sysdeps/x86_64/multiarch/wcslcpy.c b/sysdeps/x86_64/multiarc= h/wcslcpy.c > new file mode 100644 > index 0000000000..371ef9626c > --- /dev/null > +++ b/sysdeps/x86_64/multiarch/wcslcpy.c > @@ -0,0 +1,35 @@ > +/* Multiple versions of wcslcpy. > + All versions must be listed in ifunc-impl-list.c. > + Copyright (C) 2023 Free Software Foundation, Inc. > + This file is part of the GNU C Library. > + > + The GNU C Library is free software; you can redistribute it and/or > + modify it under the terms of the GNU Lesser General Public > + License as published by the Free Software Foundation; either > + version 2.1 of the License, or (at your option) any later version. > + > + The GNU C Library is distributed in the hope that it will be useful, > + but WITHOUT ANY WARRANTY; without even the implied warranty of > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > + Lesser General Public License for more details. > + > + You should have received a copy of the GNU Lesser General Public > + License along with the GNU C Library; if not, see > + . */ > + > +/* Define multiple versions only for the definition in libc. */ > +#if IS_IN (libc) > +# define __wcslcpy __redirect_wcslcpy > +# include > +# undef __wcslcpy > + > +# define SYMBOL_NAME wcslcpy > +# include "ifunc-strlcpy.h" > + > +libc_ifunc_redirected (__redirect_wcslcpy, __wcslcpy, IFUNC_SELECTOR ())= ; > +weak_alias (__wcslcpy, wcslcpy) > +# ifdef SHARED > +__hidden_ver1 (__wcslcpy, __GI___wcslcpy, __redirect_wcslcpy) > + __attribute__((visibility ("hidden"))) __attribute_copy__ (wcslcpy); > +# endif > +#endif > -- > 2.38.1 >