From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wm1-x32d.google.com (mail-wm1-x32d.google.com [IPv6:2a00:1450:4864:20::32d]) by sourceware.org (Postfix) with ESMTPS id 0D1803858412 for ; Sun, 2 Jul 2023 18:38:30 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 0D1803858412 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com Received: by mail-wm1-x32d.google.com with SMTP id 5b1f17b1804b1-3fa93d61d48so44378415e9.0 for ; Sun, 02 Jul 2023 11:38:30 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1688323109; x=1690915109; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=L8aCdqbXzvo2lziXlek6vnGfKiiF/2A0datJTPUkHOE=; b=eVXRd/vmYNsIVUevNQVtPlP4pv72EL6HkX+2i2tJsouIxG2jsTx7H98mUTjJYuXt5A b6mmNS0bHqYyzUsj1TubzsyDgfMt/4W70iS7upinZalL4Iei63DqmtwQ5fZRgOZ/qEd1 V9k/xBPrQ7sTrdBRsYrDkosIjnGj4+dzjcBi7lp/CjmhI5lUMF2RQomwv8Wnp1tqLmcC 94Rt+H3tr908Vk7vrz9ZFisir1lDV2bv39CuAJidWERlibOVUSAEvWAI5u2iTsjpyEk8 ao1rZv6AsJD64EkgJVoxmzTPbDgTD40XhgOSEWiV1TWwg7YOgeTHWT9RktATBSaZQklX m2Dg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1688323109; x=1690915109; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=L8aCdqbXzvo2lziXlek6vnGfKiiF/2A0datJTPUkHOE=; b=adVL08gAK0znFukdONIjJGtDK7MLx6oUHUxEsm0v+If0kNB12rNrdKShD7bPMH0AMq xCdzAvz32FIQvWrDrqJBAya4v7GofNhXYBE7Qpyn9+KV/Sp/LA4RLE6A9i2rfcmqarB3 hmsTSszrOggFgz+76OWvFp2ADD4MmFv0uwv1m/fy44O2wNkUZIiU77uGb1bfZcXirn6c OzA1pbABaFxtAva1E2lSxWGH3X1Pqa8/VE0bkZd1uinfcaW1sgvvlG335ZvT9lYfeHHe M+IDTFqUjc55ZhJsqTeaAw4Fc8VMG0L4PgZvr9riiOx9YMPTD+F/xJI6HCuVAOaK1dll tTZA== X-Gm-Message-State: ABy/qLaHIfGfHwQqXrLy+27k3OcuoC7v7DIQT5XqWr8TdQS2ZSFNiWx7 wL0qGph+6rmExGleGtYO6GtwcrqwwrVDVt7UYls= X-Google-Smtp-Source: APBJJlGfon2MGiKpi/ou+JX9QasAJSddffLciRN4cuhV8g5lgMa6feq4D+fsteL4MAerU5VIRc9zmKGcVonlg9inlr8= X-Received: by 2002:a5d:534d:0:b0:314:1fdc:796d with SMTP id t13-20020a5d534d000000b003141fdc796dmr6261873wrv.70.1688323108352; Sun, 02 Jul 2023 11:38:28 -0700 (PDT) MIME-Version: 1.0 References: <20230630204812.2059831-1-skpgkp2@gmail.com> In-Reply-To: From: Sunil Pandey Date: Sun, 2 Jul 2023 11:37:51 -0700 Message-ID: Subject: Re: [PATCH] x86_64: Implement AVX2 version of strlcpy/wcslcpy function To: Noah Goldstein Cc: libc-alpha@sourceware.org, hjl.tools@gmail.com Content-Type: multipart/alternative; boundary="00000000000023b7e105ff855bbd" X-Spam-Status: No, score=-7.8 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_ENVFROM_END_DIGIT,FREEMAIL_FROM,GIT_PATCH_0,HK_RANDOM_ENVFROM,HK_RANDOM_FROM,HTML_MESSAGE,KAM_SHORT,KAM_STOCKGEN,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: --00000000000023b7e105ff855bbd Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Sun, Jul 2, 2023 at 10:03=E2=80=AFAM Noah Goldstein wrote: > On Fri, Jun 30, 2023 at 3:48=E2=80=AFPM Sunil K Pandey via Libc-alpha > wrote: > > > > This patch optimizes strlcpy/wsclcpy string functions for AVX2. > > --- > > sysdeps/x86_64/multiarch/Makefile | 4 + > > sysdeps/x86_64/multiarch/ifunc-impl-list.c | 18 + > > sysdeps/x86_64/multiarch/ifunc-strlcpy.h | 34 ++ > > sysdeps/x86_64/multiarch/strlcpy-avx2.S | 446 +++++++++++++++++++++ > > sysdeps/x86_64/multiarch/strlcpy-generic.c | 25 ++ > > sysdeps/x86_64/multiarch/strlcpy.c | 36 ++ > > sysdeps/x86_64/multiarch/wcslcpy-avx2.S | 4 + > > sysdeps/x86_64/multiarch/wcslcpy-generic.c | 25 ++ > > sysdeps/x86_64/multiarch/wcslcpy.c | 35 ++ > > 9 files changed, 627 insertions(+) > > create mode 100644 sysdeps/x86_64/multiarch/ifunc-strlcpy.h > > create mode 100644 sysdeps/x86_64/multiarch/strlcpy-avx2.S > > create mode 100644 sysdeps/x86_64/multiarch/strlcpy-generic.c > > create mode 100644 sysdeps/x86_64/multiarch/strlcpy.c > > create mode 100644 sysdeps/x86_64/multiarch/wcslcpy-avx2.S > > create mode 100644 sysdeps/x86_64/multiarch/wcslcpy-generic.c > > create mode 100644 sysdeps/x86_64/multiarch/wcslcpy.c > > > > diff --git a/sysdeps/x86_64/multiarch/Makefile > b/sysdeps/x86_64/multiarch/Makefile > > index e1e894c963..7e3fc081df 100644 > > --- a/sysdeps/x86_64/multiarch/Makefile > > +++ b/sysdeps/x86_64/multiarch/Makefile > > @@ -82,6 +82,8 @@ sysdep_routines +=3D \ > > strcpy-sse2 \ > > strcpy-sse2-unaligned \ > > strcspn-sse4 \ > > + strlcpy-avx2 \ > > + strlcpy-generic \ > > strlen-avx2 \ > > strlen-avx2-rtm \ > > strlen-evex \ > > @@ -153,6 +155,8 @@ sysdep_routines +=3D \ > > wcscpy-evex \ > > wcscpy-generic \ > > wcscpy-ssse3 \ > > + wcslcpy-avx2 \ > > + wcslcpy-generic \ > > wcslen-avx2 \ > > wcslen-avx2-rtm \ > > wcslen-evex \ > > diff --git a/sysdeps/x86_64/multiarch/ifunc-impl-list.c > b/sysdeps/x86_64/multiarch/ifunc-impl-list.c > > index 5427ff1907..9928dee187 100644 > > --- a/sysdeps/x86_64/multiarch/ifunc-impl-list.c > > +++ b/sysdeps/x86_64/multiarch/ifunc-impl-list.c > > @@ -751,6 +751,15 @@ __libc_ifunc_impl_list (const char *name, struct > libc_ifunc_impl *array, > > 1, > > __strncat_sse2_unaligned)) > > > > + /* Support sysdeps/x86_64/multiarch/strlcpy.c. */ > > + IFUNC_IMPL (i, name, strlcpy, > > + X86_IFUNC_IMPL_ADD_V3 (array, i, strlcpy, > > + CPU_FEATURE_USABLE (AVX2), > > + __strlcpy_avx2) > > + X86_IFUNC_IMPL_ADD_V1 (array, i, strlcpy, > > + 1, > > + __strlcpy_generic)) > > + > > /* Support sysdeps/x86_64/multiarch/strncpy.c. */ > > IFUNC_IMPL (i, name, strncpy, > > X86_IFUNC_IMPL_ADD_V4 (array, i, strncpy, > > @@ -917,6 +926,15 @@ __libc_ifunc_impl_list (const char *name, struct > libc_ifunc_impl *array, > > 1, > > __wcscpy_generic)) > > > > + /* Support sysdeps/x86_64/multiarch/wcslcpy.c. */ > > + IFUNC_IMPL (i, name, wcslcpy, > > + X86_IFUNC_IMPL_ADD_V3 (array, i, wcslcpy, > > + CPU_FEATURE_USABLE (AVX2), > > + __wcslcpy_avx2) > > + X86_IFUNC_IMPL_ADD_V1 (array, i, wcslcpy, > > + 1, > > + __wcslcpy_generic)) > > + > > /* Support sysdeps/x86_64/multiarch/wcsncpy.c. */ > > IFUNC_IMPL (i, name, wcsncpy, > > X86_IFUNC_IMPL_ADD_V4 (array, i, wcsncpy, > > diff --git a/sysdeps/x86_64/multiarch/ifunc-strlcpy.h > b/sysdeps/x86_64/multiarch/ifunc-strlcpy.h > > new file mode 100644 > > index 0000000000..982a30d15b > > --- /dev/null > > +++ b/sysdeps/x86_64/multiarch/ifunc-strlcpy.h > > @@ -0,0 +1,34 @@ > > +/* Common definition for ifunc selections. > > + All versions must be listed in ifunc-impl-list.c. > > + Copyright (C) 2023 Free Software Foundation, Inc. > > + This file is part of the GNU C Library. > > + > > + The GNU C Library is free software; you can redistribute it and/or > > + modify it under the terms of the GNU Lesser General Public > > + License as published by the Free Software Foundation; either > > + version 2.1 of the License, or (at your option) any later version. > > + > > + The GNU C Library is distributed in the hope that it will be useful, > > + but WITHOUT ANY WARRANTY; without even the implied warranty of > > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > > + Lesser General Public License for more details. > > + > > + You should have received a copy of the GNU Lesser General Public > > + License along with the GNU C Library; if not, see > > + . */ > > + > > +#include > > + > > +extern __typeof (REDIRECT_NAME) OPTIMIZE (avx2) attribute_hidden; > > +extern __typeof (REDIRECT_NAME) OPTIMIZE (generic) attribute_hidden; > > + > > +static inline void * > > +IFUNC_SELECTOR (void) > > +{ > > + const struct cpu_features *cpu_features =3D __get_cpu_features (); > > + > > + if (X86_ISA_CPU_FEATURE_USABLE_P (cpu_features, AVX2)) > > + return OPTIMIZE (avx2); > > + > > + return OPTIMIZE (generic); > > +} > > diff --git a/sysdeps/x86_64/multiarch/strlcpy-avx2.S > b/sysdeps/x86_64/multiarch/strlcpy-avx2.S > > new file mode 100644 > > index 0000000000..cf54b1e990 > > --- /dev/null > > +++ b/sysdeps/x86_64/multiarch/strlcpy-avx2.S > > @@ -0,0 +1,446 @@ > > +/* Strlcpy/wcslcpy optimized with AVX2. > > + Copyright (C) 2023 Free Software Foundation, Inc. > > + This file is part of the GNU C Library. > > + > > + The GNU C Library is free software; you can redistribute it and/or > > + modify it under the terms of the GNU Lesser General Public > > + License as published by the Free Software Foundation; either > > + version 2.1 of the License, or (at your option) any later version. > > + > > + The GNU C Library is distributed in the hope that it will be useful, > > + but WITHOUT ANY WARRANTY; without even the implied warranty of > > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > > + Lesser General Public License for more details. > > + > > + You should have received a copy of the GNU Lesser General Public > > + License along with the GNU C Library; if not, see > > + . */ > > + > > +#include > > + > > +#if ISA_SHOULD_BUILD (3) > > + > > +# include > > + > > +# ifndef VEC_SIZE > > +# include "x86-avx-vecs.h" > > +# endif > > + > > +# ifndef STRLCPY > > +# define STRLCPY __strlcpy_avx2 > > +# endif > > + > > + > > +# ifdef USE_AS_WCSLCPY > > +# define CHAR_SIZE 4 > > +# define MOVU movl > > +# define VPCMPEQ vpcmpeqd > > +# define VPMINU vpminud > > +# else > > +# define CHAR_SIZE 1 > > +# define MOVU movb > > +# define VPCMPEQ vpcmpeqb > > +# define VPMINU vpminub > > +# endif > > + > > +# define PMOVMSK vpmovmskb > > +# define PAGE_SIZE 4096 > > +# define VEC_SIZE 32 > > +# define CHAR_PER_VEC (VEC_SIZE / CHAR_SIZE) > > + > > + .section SECTION(.text),"ax",@progbits > > +/* Aligning entry point to 64 byte, provides better performance for > > + one vector length string. */ > > + > > +ENTRY_P2ALIGN (STRLCPY, 6) > > +# ifdef __ILP32__ > > + /* Clear the upper 32 bits. */ > > + movl %edx, %edx > > +# endif > > + > > + /* Zero out vector register for end of string comparison. */ > > + vpxor %VMM(0), %VMM(0), %VMM(0) > > + /* Save source pointer for return calculation. */ > > + mov %rsi, %r8 > > + mov %esi, %eax > > + sall $20, %eax > > + cmpl $((PAGE_SIZE - (VEC_SIZE)) << 20), %eax > > + ja L(page_cross) > > + > > +L(page_cross_continue): > > + /* Load first vector. */ > > + VMOVU (%rsi), %VMM(1) > > + VPCMPEQ %VMM(0), %VMM(1), %VMM(2) > > + PMOVMSK %VMM(2), %eax > > + test %eax, %eax > > + jnz L(ret_vec_x1) > > + > > + test %rdx, %rdx > > + jz L(continue_second_vector) > > + > > + /* Check whether we can copy full vector. */ > > + cmp $CHAR_PER_VEC, %rdx > > + jbe L(page_cross_small_vec_copy) > > + /* Copy first vector. */ > > + VMOVU %VMM(1), (%rdi) > > + sub $CHAR_PER_VEC, %rdx > > + > > +L(continue_second_vector): > > + /* Align RSI pointer and adjust RDI based on offset. */ > > + mov %rsi, %rax > > + and $-VEC_SIZE, %rsi > > + sub %rsi, %rax > > + sub %rax, %rdi > > + > > + /* Check if string already copied N char, and RDX is 0. */ > > + test %rdx, %rdx > > + jz L(skip_copy_alignment_fix) > > + > > + /* Adjust RDX for copy alignment fix. */ > > +# ifdef USE_AS_WCSLCPY > > + shr $2, %rax > > +# endif > > + add %rax, %rdx > > + > > +L(skip_copy_alignment_fix): > > + /* Load second vector. */ > > + VMOVA (VEC_SIZE * 1)(%rsi), %VMM(1) > > + VPCMPEQ %VMM(0), %VMM(1), %VMM(2) > > + vptest %VMM(2), %VMM(2) > > + jnz L(ret_vec_x2) > > + > > + /* Skip copy if RDX is 0. */ > > + test %rdx, %rdx > > + jz L(continue_third_vector) > > + > > + /* Jump below/equal(instead of below) used here, because last > > + copy chracter must be NULL. */ > > + cmp $CHAR_PER_VEC, %rdx > > + jbe L(partial_copy_second_vector) > > + > > + sub $CHAR_PER_VEC, %rdx > > + /* Copy second vector. */ > > + VMOVU %VMM(1), (VEC_SIZE * 1)(%rdi) > > + > > +L(continue_third_vector): > > + /* Load third vector. */ > > + VMOVA (VEC_SIZE * 2)(%rsi), %VMM(1) > > + VPCMPEQ %VMM(0), %VMM(1), %VMM(2) > > + vptest %VMM(2), %VMM(2) > > + jnz L(ret_vec_x3) > > + > > + /* Skip copy if RDX is 0. */ > > + test %rdx, %rdx > > + jz L(continue_fourth_vector) > > + > > + cmp $CHAR_PER_VEC, %rdx > > + jbe L(partial_copy_third_vector) > > + > > + sub $CHAR_PER_VEC, %rdx > > + /* Copy third vector. */ > > + VMOVU %VMM(1), (VEC_SIZE * 2)(%rdi) > > + > > +L(continue_fourth_vector): > > + /* Load fourth vector. */ > > + VMOVA (VEC_SIZE * 3)(%rsi), %VMM(1) > > + VPCMPEQ %VMM(0), %VMM(1), %VMM(2) > > + vptest %VMM(2), %VMM(2) > > + jnz L(ret_vec_x4) > > + > > + /* Skip copy if RDX is 0. */ > > + test %rdx, %rdx > > + jz L(loop_4x_align) > > + > > + cmp $CHAR_PER_VEC, %rdx > > + jbe L(partial_copy_fourth_vector) > > + > > + sub $CHAR_PER_VEC, %rdx > > + /* Copy fourth vector. */ > > + VMOVU %VMM(1), (VEC_SIZE * 3)(%rdi) > > + > > + > > +L(loop_4x_align): > > + /* Jump to loop if RSI is already 4 vector align. */ > > + test $(VEC_SIZE * 4 - 1), %esi > > + jz L(loop_4x_read) > > + > > + mov %rsi, %rcx > > + > > + /* Align RSI to 4x vector. */ > > + and $(VEC_SIZE * -4), %rsi > > + sub %rsi, %rcx > > + > > + /* Adjust RDI for RSI alignment fix. */ > > + sub %rcx, %rdi > > + > > + /* Jump to loop if RDX is 0. */ > > + test %rdx, %rdx > > + jz L(loop_4x_read) > > + > > +# ifdef USE_AS_WCSLCPY > > + shr $2, %rcx > > +# endif > > + > > + /* Adjust RDX for RSI alignment fix. */ > > + add %rcx, %rdx > > + jmp L(loop_4x_read) > > + > > + .p2align 4,,6 > > +L(loop_4x_vec): > > + /* Skip copy if RDX is 0. */ > > + test %rdx, %rdx > > + jz L(loop_partial_copy_return) > > + cmp $(CHAR_PER_VEC * 4), %rdx > > + jbe L(loop_partial_copy) > > + VMOVU %VMM(1), (VEC_SIZE * 4)(%rdi) > > + VMOVU %VMM(2), (VEC_SIZE * 5)(%rdi) > > + VMOVU %VMM(3), (VEC_SIZE * 6)(%rdi) > > + VMOVU %VMM(4), (VEC_SIZE * 7)(%rdi) > > + sub $(CHAR_PER_VEC * 4), %rdx > > + > > +L(loop_partial_copy_return): > > + sub $(VEC_SIZE * -4), %rsi > > + sub $(VEC_SIZE * -4), %rdi > > + > > +L(loop_4x_read): > > + VMOVA (VEC_SIZE * 4)(%rsi), %VMM(1) > > + VMOVA (VEC_SIZE * 5)(%rsi), %VMM(2) > > + VMOVA (VEC_SIZE * 6)(%rsi), %VMM(3) > > + VMOVA (VEC_SIZE * 7)(%rsi), %VMM(4) > > + VPMINU %VMM(1), %VMM(2), %VMM(5) > > + VPMINU %VMM(3), %VMM(4), %VMM(6) > > + VPMINU %VMM(5), %VMM(6), %VMM(7) > > + VPCMPEQ %VMM(0), %VMM(7), %VMM(7) > > + vptest %VMM(7), %VMM(7) > > + > > + jz L(loop_4x_vec) > > + > > + /* Check if string ends in first vector or second vector. */ > > + lea (VEC_SIZE * 4)(%rsi), %rax > > + sub %r8, %rax > > +# ifdef USE_AS_WCSLCPY > > + shr $2, %rax > > +# endif > > + xor %r10, %r10 > > + VPCMPEQ %VMM(0), %VMM(5), %VMM(6) > > + vptest %VMM(6), %VMM(6) > > + jnz L(endloop) > > + sub $(CHAR_PER_VEC * -2), %rax > > + mov $(CHAR_PER_VEC * 2), %r10 > > + VMOVA %VMM(3), %VMM(1) > > + VMOVA %VMM(4), %VMM(2) > > + > > +L(endloop): > > + VPCMPEQ %VMM(0), %VMM(1), %VMM(1) > > + VPCMPEQ %VMM(0), %VMM(2), %VMM(2) > > + PMOVMSK %VMM(1), %rcx > > + PMOVMSK %VMM(2), %r9 > > + shlq $32, %r9 > > + orq %r9, %rcx > > + bsf %rcx, %rcx > > + /* Shift RCX by 2, VPMOVMSK has only byte version. */ > > +# ifdef USE_AS_WCSLCPY > > + shr $2, %rcx > > +# endif > > + /* At this point RAX has length to return. */ > > + add %rcx, %rax > > + test %rdx, %rdx > > + jz L(ret) > > + > > + /* Add 1 to account for NULL character in RDX comparison. */ > > + lea 1(%r10, %rcx), %rcx > > + cmp %rdx, %rcx > > + cmovb %rcx, %rdx > > + > > +L(loop_partial_copy): > > + cmp $(CHAR_PER_VEC * 2), %rdx > > + jbe L(loop_partial_first_half) > > + /* Reload first 2 vector. */ > > + VMOVA (VEC_SIZE * 4)(%rsi), %VMM(1) > > + VMOVA (VEC_SIZE * 5)(%rsi), %VMM(2) > > + VMOVU %VMM(1), (VEC_SIZE * 4)(%rdi) > > + VMOVU %VMM(2), (VEC_SIZE * 5)(%rdi) > > + > > +L(loop_partial_first_half): > > + /* Go back 2 vector from last and use overlapping copy. > > + (VEC_SIZE * 4 - VEC_SIZE * 2)(%rsi, %rdx, CHAR_SIZE) > > + (VEC_SIZE * 4 - VEC_SIZE * 1)(%rsi, %rdx, CHAR_SIZE) > > + */ > > + VMOVU (VEC_SIZE * 2)(%rsi, %rdx, CHAR_SIZE), %VMM(3) > > + VMOVU (VEC_SIZE * 3)(%rsi, %rdx, CHAR_SIZE), %VMM(4) > > + VMOVU %VMM(3), (VEC_SIZE * 2)(%rdi, %rdx, CHAR_SIZE) > > + VMOVU %VMM(4), (VEC_SIZE * 3)(%rdi, %rdx, CHAR_SIZE) > > + MOVU $0, (VEC_SIZE * 4 - CHAR_SIZE)(%rdi, %rdx, CHAR_SIZE) > > + xor %rdx, %rdx > > + vptest %VMM(7), %VMM(7) > > + jz L(loop_partial_copy_return) > > + ret > > + > > + .p2align 4 > > +L(page_cross): > > + mov %rsi, %rcx > > + mov %rsi, %r11 > > + and $-VEC_SIZE, %r11 > > + and $(VEC_SIZE - 1), %rcx > > + VMOVA (%r11), %VMM(1) > > + VPCMPEQ %VMM(0), %VMM(1), %VMM(2) > > + PMOVMSK %VMM(2), %eax > > + shr %cl, %eax > > + jz L(page_cross_continue) > > + > > +L(ret_vec_x1): > > + bsf %eax, %eax > > +# ifdef USE_AS_WCSLCPY > > + shr $2, %eax > > +# endif > > + /* Increment by 1 to account for NULL char. */ > > + lea 1(%eax), %ecx > > + cmp %rdx, %rcx > > + cmovb %rcx, %rdx > > + test %rdx, %rdx > > + jz L(ret) > > + > > +L(page_cross_small_vec_copy): > > + cmp $(16 / CHAR_SIZE), %rdx > > + jbe L(copy_8_byte_scalar) > > + VMOVU (%rsi), %VMM_128(1) > > + VMOVU -16(%rsi, %rdx, CHAR_SIZE), %VMM_128(3) > > + VMOVU %VMM_128(1), (%rdi) > > + VMOVU %VMM_128(3), -16(%rdi, %rdx, CHAR_SIZE) > > + MOVU $0, -(CHAR_SIZE * 1)(%rdi, %rdx, CHAR_SIZE) > > + xor %rdx, %rdx > > + vptest %VMM(2), %VMM(2) > > + jz L(continue_second_vector) > > + ret > > + > > +L(copy_8_byte_scalar): > > + cmp $(8 / CHAR_SIZE), %rdx > > + jbe L(copy_4_byte_scalar) > > + movq (%rsi), %r10 > > + movq -8(%rsi, %rdx, CHAR_SIZE), %r11 > > + movq %r10, (%rdi) > > + movq %r11, -8(%rdi, %rdx, CHAR_SIZE) > > + MOVU $0, -(CHAR_SIZE * 1)(%rdi, %rdx, CHAR_SIZE) > > + xor %edx, %edx > > + vptest %VMM(2), %VMM(2) > > + jz L(continue_second_vector) > > + ret > > + > > +L(copy_4_byte_scalar): > > +# ifndef USE_AS_WCSLCPY > > + cmp $4, %rdx > > + jbe L(copy_2_byte_scalar) > > +# endif > > + movl (%rsi), %r10d > > + movl -4(%rsi, %rdx, CHAR_SIZE), %r11d > > + movl %r10d, (%rdi) > > + movl %r11d, -4(%rdi, %rdx, CHAR_SIZE) > > + MOVU $0, -(CHAR_SIZE * 1)(%rdi, %rdx, CHAR_SIZE) > > + xor %edx, %edx > > + vptest %VMM(2), %VMM(2) > > + jz L(continue_second_vector) > > + ret > > + > > +# ifndef USE_AS_WCSLCPY > > +L(copy_2_byte_scalar): > > + cmp $2, %rdx > > + jbe L(copy_1_byte_scalar) > > + movw (%rsi), %r10w > > + movw -(CHAR_SIZE * 3)(%rsi, %rdx, CHAR_SIZE), %r11w > > + movw %r10w, (%rdi) > > + movw %r11w, -(CHAR_SIZE * 3)(%rdi, %rdx, CHAR_SIZE) > > + MOVU $0, -(CHAR_SIZE * 1)(%rdi, %rdx, CHAR_SIZE) > > + xor %edx, %edx > > + vptest %VMM(2), %VMM(2) > > + jz L(continue_second_vector) > > + ret > > + > > +L(copy_1_byte_scalar): > > + MOVU (%rsi), %r10b > > + MOVU %r10b, (%rdi) > > + MOVU $0, -(CHAR_SIZE * 1)(%rdi, %rdx, CHAR_SIZE) > > + xor %edx, %edx > > + vptest %VMM(2), %VMM(2) > > + jz L(continue_second_vector) > > + ret > > +# endif > > + > > +L(ret_vec_x2): > > + PMOVMSK %VMM(2), %rax > > + bsf %rax, %rcx > > + /* Calculate return value. */ > > + lea VEC_SIZE(%rsi, %rcx), %rax > > + sub %r8, %rax > > +# ifdef USE_AS_WCSLCPY > > + shr $2, %rax > > + shr $2, %rcx > > +# endif > > + inc %rcx > > + test %rdx, %rdx > > + jz L(ret) > > + cmp %rdx, %rcx > > + cmovb %rcx, %rdx > > + > > +L(partial_copy_second_vector): > > + VMOVU (%rsi, %rdx, CHAR_SIZE), %VMM(1) > > + VMOVU %VMM(1), (%rdi, %rdx, CHAR_SIZE) > > + MOVU $0, (VEC_SIZE - CHAR_SIZE * 1)(%rdi, %rdx, CHAR_SIZE) > > + xor %edx, %edx > > + vptest %VMM(2), %VMM(2) > > + jz L(continue_third_vector) > > + > > +L(ret): > > + ret > > + > > +L(ret_vec_x3): > > + PMOVMSK %VMM(2), %rax > > + bsf %rax, %rcx > > + /* Calculate return value. */ > > + lea (VEC_SIZE * 2)(%rsi, %rcx), %rax > > + sub %r8, %rax > > +# ifdef USE_AS_WCSLCPY > > + shr $2, %rax > > + shr $2, %rcx > > +# endif > > + inc %rcx > > + test %rdx, %rdx > > + jz L(ret) > > + cmp %rdx, %rcx > > + cmovb %rcx, %rdx > > + > > +L(partial_copy_third_vector): > > + VMOVU (VEC_SIZE)(%rsi, %rdx, CHAR_SIZE), %VMM(1) > > + VMOVU %VMM(1), (VEC_SIZE)(%rdi, %rdx, CHAR_SIZE) > > + MOVU $0, ((VEC_SIZE * 2) - CHAR_SIZE * 1)(%rdi, %rdx, > CHAR_SIZE) > > + xor %edx, %edx > > + vptest %VMM(2), %VMM(2) > > + jz L(continue_fourth_vector) > > + ret > > + > > +L(ret_vec_x4): > > + PMOVMSK %VMM(2), %rax > > + bsf %rax, %rcx > > + /* Calculate return value. */ > > + lea (VEC_SIZE * 3)(%rsi, %rcx), %rax > > + sub %r8, %rax > > +# ifdef USE_AS_WCSLCPY > > + shr $2, %rax > > + shr $2, %rcx > > +# endif > > + inc %rcx > > + test %rdx, %rdx > > + jz L(ret) > > + cmp %rdx, %rcx > > + cmovb %rcx, %rdx > > + > > +L(partial_copy_fourth_vector): > > + VMOVU (VEC_SIZE * 2)(%rsi, %rdx, CHAR_SIZE), %VMM(1) > > + VMOVU %VMM(1), (VEC_SIZE * 2)(%rdi, %rdx, CHAR_SIZE) > > + MOVU $0, ((VEC_SIZE * 3) - CHAR_SIZE * 1)(%rdi, %rdx, > CHAR_SIZE) > > + xor %edx, %edx > > + vptest %VMM(2), %VMM(2) > > + jz L(continue_fourth_vector) > > + ret > > + > > +END (STRLCPY) > > Is strlcpy/strlcat integratable with existing strncat impl? Had > figured they would > fit in the same file. > Hi Noah, It may not be a good idea to put strlcpy/strlcat in the existing strncpy/strnat impl file, as strlcpy/strlcat functions are associated with GLIBC_2.38 ABI. --Sunil > > +#endif > > diff --git a/sysdeps/x86_64/multiarch/strlcpy-generic.c > b/sysdeps/x86_64/multiarch/strlcpy-generic.c > > new file mode 100644 > > index 0000000000..eee3b7b086 > > --- /dev/null > > +++ b/sysdeps/x86_64/multiarch/strlcpy-generic.c > > @@ -0,0 +1,25 @@ > > +/* strlcpy generic. > > + Copyright (C) 2023 Free Software Foundation, Inc. > > + This file is part of the GNU C Library. > > + > > + The GNU C Library is free software; you can redistribute it and/or > > + modify it under the terms of the GNU Lesser General Public > > + License as published by the Free Software Foundation; either > > + version 2.1 of the License, or (at your option) any later version. > > + > > + The GNU C Library is distributed in the hope that it will be useful, > > + but WITHOUT ANY WARRANTY; without even the implied warranty of > > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > > + Lesser General Public License for more details. > > + > > + You should have received a copy of the GNU Lesser General Public > > + License along with the GNU C Library; if not, see > > + . */ > > + > > + > > +#include > > +#if ISA_SHOULD_BUILD (1) > > +# define __strlcpy __strlcpy_generic > > +# include > > + > > +#endif > > diff --git a/sysdeps/x86_64/multiarch/strlcpy.c > b/sysdeps/x86_64/multiarch/strlcpy.c > > new file mode 100644 > > index 0000000000..ded41fbcfb > > --- /dev/null > > +++ b/sysdeps/x86_64/multiarch/strlcpy.c > > @@ -0,0 +1,36 @@ > > +/* Multiple versions of strlcpy. > > + All versions must be listed in ifunc-impl-list.c. > > + Copyright (C) 2023 Free Software Foundation, Inc. > > + This file is part of the GNU C Library. > > + > > + The GNU C Library is free software; you can redistribute it and/or > > + modify it under the terms of the GNU Lesser General Public > > + License as published by the Free Software Foundation; either > > + version 2.1 of the License, or (at your option) any later version. > > + > > + The GNU C Library is distributed in the hope that it will be useful, > > + but WITHOUT ANY WARRANTY; without even the implied warranty of > > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > > + Lesser General Public License for more details. > > + > > + You should have received a copy of the GNU Lesser General Public > > + License along with the GNU C Library; if not, see > > + . */ > > + > > +/* Define multiple versions only for the definition in libc. */ > > +#if IS_IN (libc) > > +# define __strlcpy __redirect_strlcpy > > +# include > > +# undef __strlcpy > > + > > +# define SYMBOL_NAME strlcpy > > +# include "ifunc-strlcpy.h" > > + > > +libc_ifunc_redirected (__redirect_strlcpy, __strlcpy, IFUNC_SELECTOR > ()); > > +weak_alias (__strlcpy, strlcpy) > > + > > +# ifdef SHARED > > +__hidden_ver1 (__strlcpy, __GI___strlcpy, __redirect_strlcpy) > > + __attribute__ ((visibility ("hidden"))) __attribute_copy__ (strlcpy); > > +# endif > > +#endif > > diff --git a/sysdeps/x86_64/multiarch/wcslcpy-avx2.S > b/sysdeps/x86_64/multiarch/wcslcpy-avx2.S > > new file mode 100644 > > index 0000000000..dafc20ded0 > > --- /dev/null > > +++ b/sysdeps/x86_64/multiarch/wcslcpy-avx2.S > > @@ -0,0 +1,4 @@ > > +#define STRLCPY __wcslcpy_avx2 > > +#define USE_AS_WCSLCPY 1 > > + > > +#include "strlcpy-avx2.S" > > diff --git a/sysdeps/x86_64/multiarch/wcslcpy-generic.c > b/sysdeps/x86_64/multiarch/wcslcpy-generic.c > > new file mode 100644 > > index 0000000000..ffd3c0e846 > > --- /dev/null > > +++ b/sysdeps/x86_64/multiarch/wcslcpy-generic.c > > @@ -0,0 +1,25 @@ > > +/* wcslcpy generic. > > + Copyright (C) 2023 Free Software Foundation, Inc. > > + This file is part of the GNU C Library. > > + > > + The GNU C Library is free software; you can redistribute it and/or > > + modify it under the terms of the GNU Lesser General Public > > + License as published by the Free Software Foundation; either > > + version 2.1 of the License, or (at your option) any later version. > > + > > + The GNU C Library is distributed in the hope that it will be useful, > > + but WITHOUT ANY WARRANTY; without even the implied warranty of > > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > > + Lesser General Public License for more details. > > + > > + You should have received a copy of the GNU Lesser General Public > > + License along with the GNU C Library; if not, see > > + . */ > > + > > + > > +#include > > +#if ISA_SHOULD_BUILD (1) > > +# define __wcslcpy __wcslcpy_generic > > +# include > > + > > +#endif > > diff --git a/sysdeps/x86_64/multiarch/wcslcpy.c > b/sysdeps/x86_64/multiarch/wcslcpy.c > > new file mode 100644 > > index 0000000000..371ef9626c > > --- /dev/null > > +++ b/sysdeps/x86_64/multiarch/wcslcpy.c > > @@ -0,0 +1,35 @@ > > +/* Multiple versions of wcslcpy. > > + All versions must be listed in ifunc-impl-list.c. > > + Copyright (C) 2023 Free Software Foundation, Inc. > > + This file is part of the GNU C Library. > > + > > + The GNU C Library is free software; you can redistribute it and/or > > + modify it under the terms of the GNU Lesser General Public > > + License as published by the Free Software Foundation; either > > + version 2.1 of the License, or (at your option) any later version. > > + > > + The GNU C Library is distributed in the hope that it will be useful, > > + but WITHOUT ANY WARRANTY; without even the implied warranty of > > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > > + Lesser General Public License for more details. > > + > > + You should have received a copy of the GNU Lesser General Public > > + License along with the GNU C Library; if not, see > > + . */ > > + > > +/* Define multiple versions only for the definition in libc. */ > > +#if IS_IN (libc) > > +# define __wcslcpy __redirect_wcslcpy > > +# include > > +# undef __wcslcpy > > + > > +# define SYMBOL_NAME wcslcpy > > +# include "ifunc-strlcpy.h" > > + > > +libc_ifunc_redirected (__redirect_wcslcpy, __wcslcpy, IFUNC_SELECTOR > ()); > > +weak_alias (__wcslcpy, wcslcpy) > > +# ifdef SHARED > > +__hidden_ver1 (__wcslcpy, __GI___wcslcpy, __redirect_wcslcpy) > > + __attribute__((visibility ("hidden"))) __attribute_copy__ (wcslcpy); > > +# endif > > +#endif > > -- > > 2.38.1 > > > --00000000000023b7e105ff855bbd--