From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-lj1-x235.google.com (mail-lj1-x235.google.com [IPv6:2a00:1450:4864:20::235]) by sourceware.org (Postfix) with ESMTPS id D33903858428 for ; Thu, 2 Feb 2023 10:02:48 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org D33903858428 Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=rivosinc.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=rivosinc.com Received: by mail-lj1-x235.google.com with SMTP id y19so1316905ljq.7 for ; Thu, 02 Feb 2023 02:02:48 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=rivosinc-com.20210112.gappssmtp.com; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=Ry8wfPCJzmxCXjq06kZh7+eGOLNiwBRxoDUwIbAv9aM=; b=tBEmt8cyOqpipYKOQNa0UZQH0zWYeg4oJwGgiFIA62FFm4piEdHEuJfVAmDQuRQwEs tWHgCddFiOYYk6oSL/0ZdAa2MopkiqFXgErzpaNYbr2M1oO3P8pR2LqEnK/dQbvhLYVk ZBAE9Fki/OWEcKjXSfH1CIRmhHJSZYJpl848YRLclaEg6KePxv04ovyhZY6ZZEHWvGto amc0IXvi03sHl+J+lLk/aLiF37SA+db4O1I8nsO370iJWVPlTBNdtqYfZeU5HUJ84dGl ox3YIsITIj2a/eW2BYbIoUlu5/yLCKZxKst3QsrkFMlsvD6WwOxGkVmebfy8fpjRBDTT L3tA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=Ry8wfPCJzmxCXjq06kZh7+eGOLNiwBRxoDUwIbAv9aM=; b=gCR9rEYI+HtK7vVZDOWKrSPMe7G92Vv6QDAruMjTHVdfMF4jnn9xZ6ieErGDLZ7sod IGKwgY8WzeCk9xB4+93pMYuJGD0eMkBVK4sKRmM/3cL0+ktULfOS8IqiefLfGgFdHDii qQvjnj+nEntl7Oa937swPFoLXt8nW3b5CIAcRQ+huSRIJGxoej6tiPVJr1RK/auEKsW/ F8e4p1cDpAwzxjyVUrlStXhlGMuUZfZR2pgJ1UOSv+etaq1poqZ+eeWOy3NOfFklicgj FdNVxjpX8RRfcSp44ISVnIhWLZVwrSiYSUauRPzM1iLgTFi+B//X6JGp5tGT+XgI5dUq ldDg== X-Gm-Message-State: AO0yUKU0lqh7ICBpTc2qyH8+wTJrO1qz8Bv8xrLg9xDfdWv2BuSJ+0fB Hu7dCWdNN5qcVKI6P+5Q0jCJheJFDlV7hKd7tdFVJQ85xG2DlUii X-Google-Smtp-Source: AK7set/B9vBs3XHd4AqdXyt9P3se3MuFgrWSJidJujLXLNT0Pv398cjD5sw6Zy6triTtfZbXh1BEcFVzoviyYfyv3L8= X-Received: by 2002:a2e:9d05:0:b0:283:9a0a:a0f7 with SMTP id t5-20020a2e9d05000000b002839a0aa0f7mr754100lji.81.1675332166714; Thu, 02 Feb 2023 02:02:46 -0800 (PST) MIME-Version: 1.0 References: <20230201095232.15942-1-slewis@rivosinc.com> <20230201095232.15942-2-slewis@rivosinc.com> <87479d1a-abf3-b564-8613-2a48d26527b5@linaro.org> In-Reply-To: <87479d1a-abf3-b564-8613-2a48d26527b5@linaro.org> From: Sergei Lewis Date: Thu, 2 Feb 2023 10:02:35 +0000 Message-ID: Subject: Re: [PATCH 2/2] riscv: vectorised mem* and str* functions To: Adhemerval Zanella Netto Cc: libc-alpha@sourceware.org Content-Type: multipart/alternative; boundary="000000000000ad982905f3b4aad3" X-Spam-Status: No, score=-9.4 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,GIT_PATCH_0,HTML_MESSAGE,KAM_SHORT,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: --000000000000ad982905f3b4aad3 Content-Type: text/plain; charset="UTF-8" Thank you very much for the detailed review! > > +#ifndef __riscv_strict_align > Would this be defined by compiler as predefine macro or is it just a debug > switch? If the later, I think it would be better to remove it. The intent is to make use of the gcc feature in flight here: https://gcc.gnu.org/pipermail/gcc-patches/2023-January/610115.html to detect the situation where the build environment has been configured to avoid unaligned access. > It is really worth to add a strrchr optimization? The generic implementation > already calls strchr (which should be optimized). The performance win is actually quite significant; consider searching for the first space in a piece of text compared to the last space - reusing strchr in a loop as the generic implementation does would cause branches every few bytes of input, and essentially destroy all benefits gained from vectorisation. Worse, vectorised functions typically have a few instructions of setup/overhead before the work begins and these would likewise be repeated every few bytes of input. > I wonder if we could adapt the generic implementation, so riscv only reimplements > the vectorized search instead of all the boilerplace to generate the table and > early tests. The issue is that the table looks different for different implementations, and possibly even for different cases in the same implementation; e.g. some of the existing implementations use a 256-byte table with one byte per character rather than a 256-bit bitfield as I do here (and going forward we would potentially want such a path for riscv as well and select between them based on the length of the character set - common use in parsing will tend to produce very small character sets, but if we get a large one or potentially always depending on architecture, using indexed loads/stores will become faster than the bitfield approach I use here). I am integrating all the other feedback, and will also work with your changes to the generic implementations - it looks like there is quite a bit of potential to reduce and simplify my changeset once yours goes in. On Wed, Feb 1, 2023 at 5:38 PM Adhemerval Zanella Netto < adhemerval.zanella@linaro.org> wrote: > > > On 01/02/23 06:52, Sergei Lewis wrote: > > Initial implementations of memchr, memcmp, memcpy, memmove, memset, > strchr, > > strcmp, strcpy, strlen, strncmp, strncpy, strnlen, strrchr, strspn > > targeting the riscv "V" extension, version 1.0 > > > > The vectorised implementations assume VLENB of at least 128 and at least > 32 > > registers (as mandated by the "V" extension spec). They also assume that > > VLENB is a power of two which is no larger than the page size, and (as > > vectorised code in glibc for other platforms does) that it is safe to > read > > past null terminators / buffer ends provided one does not cross a page > > boundary. > > > > Signed-off-by: Sergei Lewis > > Some comments that might be useful since I am working the generic > implementations > below. > > Also, I think it should be splitted with one implementation per patch, > unless the > implementation is tied together (as for strchr/strchrnul for instance). > Does > the vectorized routine only work for rv64? > > > --- > > sysdeps/riscv/rv64/rvv/Implies | 2 + > > sysdeps/riscv/rv64/rvv/memchr.S | 127 +++++++++++++++++++ > > sysdeps/riscv/rv64/rvv/memcmp.S | 93 ++++++++++++++ > > sysdeps/riscv/rv64/rvv/memcpy.S | 154 +++++++++++++++++++++++ > > sysdeps/riscv/rv64/rvv/memmove.c | 22 ++++ > > sysdeps/riscv/rv64/rvv/memset.S | 89 ++++++++++++++ > > sysdeps/riscv/rv64/rvv/strchr.S | 92 ++++++++++++++ > > sysdeps/riscv/rv64/rvv/strchrnul.c | 22 ++++ > > sysdeps/riscv/rv64/rvv/strcmp.S | 108 +++++++++++++++++ > > sysdeps/riscv/rv64/rvv/strcpy.S | 72 +++++++++++ > > sysdeps/riscv/rv64/rvv/strcspn.c | 22 ++++ > > sysdeps/riscv/rv64/rvv/strlen.S | 67 ++++++++++ > > sysdeps/riscv/rv64/rvv/strncmp.S | 104 ++++++++++++++++ > > sysdeps/riscv/rv64/rvv/strncpy.S | 96 +++++++++++++++ > > sysdeps/riscv/rv64/rvv/strnlen.S | 81 +++++++++++++ > > sysdeps/riscv/rv64/rvv/strrchr.S | 88 ++++++++++++++ > > sysdeps/riscv/rv64/rvv/strspn.S | 189 +++++++++++++++++++++++++++++ > > 17 files changed, 1428 insertions(+) > > create mode 100644 sysdeps/riscv/rv64/rvv/Implies > > create mode 100644 sysdeps/riscv/rv64/rvv/memchr.S > > create mode 100644 sysdeps/riscv/rv64/rvv/memcmp.S > > create mode 100644 sysdeps/riscv/rv64/rvv/memcpy.S > > create mode 100644 sysdeps/riscv/rv64/rvv/memmove.c > > create mode 100644 sysdeps/riscv/rv64/rvv/memset.S > > create mode 100644 sysdeps/riscv/rv64/rvv/strchr.S > > create mode 100644 sysdeps/riscv/rv64/rvv/strchrnul.c > > create mode 100644 sysdeps/riscv/rv64/rvv/strcmp.S > > create mode 100644 sysdeps/riscv/rv64/rvv/strcpy.S > > create mode 100644 sysdeps/riscv/rv64/rvv/strcspn.c > > create mode 100644 sysdeps/riscv/rv64/rvv/strlen.S > > create mode 100644 sysdeps/riscv/rv64/rvv/strncmp.S > > create mode 100644 sysdeps/riscv/rv64/rvv/strncpy.S > > create mode 100644 sysdeps/riscv/rv64/rvv/strnlen.S > > create mode 100644 sysdeps/riscv/rv64/rvv/strrchr.S > > create mode 100644 sysdeps/riscv/rv64/rvv/strspn.S > > > > diff --git a/sysdeps/riscv/rv64/rvv/Implies > b/sysdeps/riscv/rv64/rvv/Implies > > new file mode 100644 > > index 0000000000..b07b4cb906 > > --- /dev/null > > +++ b/sysdeps/riscv/rv64/rvv/Implies > > @@ -0,0 +1,2 @@ > > +riscv/rv64/rvd > > + > > diff --git a/sysdeps/riscv/rv64/rvv/memchr.S > b/sysdeps/riscv/rv64/rvv/memchr.S > > new file mode 100644 > > index 0000000000..a7e32b8f25 > > --- /dev/null > > +++ b/sysdeps/riscv/rv64/rvv/memchr.S > > @@ -0,0 +1,127 @@ > > + > > Spurious new line at the start. We also require a brief comment describing > the file contents for newer files. > > > +/* Copyright (C) 2012-2023 Free Software Foundation, Inc. > > Not sure 2012 range fits here. > > > + > > + This file is part of the GNU C Library. > > + > > + The GNU C Library is free software; you can redistribute it and/or > > + modify it under the terms of the GNU Lesser General Public > > + License as published by the Free Software Foundation; either > > + version 2.1 of the License, or (at your option) any later version. > > + > > + The GNU C Library is distributed in the hope that it will be useful, > > + but WITHOUT ANY WARRANTY; without even the implied warranty of > > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > > + Lesser General Public License for more details. > > + > > + You should have received a copy of the GNU Lesser General Public > > + License along with the GNU C Library. If not, see > > + . */ > > + > > + > > +#include > > + > > + > > +/* Optimised memchr for riscv with vector extension > > + * Assumptions: > > + * - cpu becomes bandwidth limited at or before > > + * 2 vector register sized read/write operations > > + * + 2 scalar operations > > + * + conditional branch > > + */ > > + > > +.globl memchr > > +.type memchr,@function > > + > > +.align 2 > > +memchr: > > We have the ENTRY macro for that. > > > + beqz a2, .Lnot_found > > Maybe use the L macro here for local labels; > > > + csrr t1, vlenb > > + bgeu a2, t1, .Lvector_path /* only use vector path if we're > scanning > > + at least vlenb bytes */ > > + > > +#ifndef __riscv_strict_align > > Would this be defined by compiler as predefine macro or is it just a debug > switch? If the later, I think it would be better to remove it. > > > + li a3, 8 > > + blt a2, a3, .Lbytewise > > + > > + li t1, 0x0101010101010101 > > + slli a4, t1, 7 /* a4 = 0x8080808080808080 */ > > + mul t2, a1, t1 /* entirety of t2 is now repeats of target > character; > > + assume mul is at worst no worse than > 3*(shift+OR), > > + otherwise do that instead */ > > + > > +/* > > + * strategy: > > + * t4 = ((*a0) ^ t2) > > + * - now t4 contains zero bytes if and only if next word of memory > > + * had target character at those positions > > + * > > + * t4 = ((t4-0x0101010101010101) & ~t4) & 0x8080808080808080 > > + * - all nonzero bytes of t4 become 0; zero bytes become 0x80 > > + * > > + * if t4 is nonzero, find the index of the byte within it, add to a0 > and return > > + * otherwise, loop > > + */ > > + > > +1: > > + ld t4, (a0) /* t4 = load next 8 bytes */ > > + xor t4, t4, t2 > > + sub t5, t4, t1 > > + not t4, t4 > > + and t4, t5, t4 > > + and t4, t4, a4 > > + bnez t4, .Lbytewise /* could use ctzw, mod+lookup or just binary > chop > > + to locate byte of interest in t4 but > profiling > > + shows these approaches are at best no > better */ > > + addi a2, a2, -8 > > + addi a0, a0, 8 > > + bgeu a2, a3, 1b > > + beqz a2, .Lnot_found > > +#endif // __riscv_strict_align > > + > > +/* too little data for a dword. mask calculation and branch mispredict > costs > > + make checking a word not worthwhile. degrade to bytewise search. */ > > + > > +.Lbytewise: > > + add t2, a0, a2 > > + > > +1: > > + lb t1, (a0) > > + beq t1, a1, .Lfound > > + addi a0, a0, 1 > > + blt a0, t2, 1b > > + > > +.Lnot_found: > > + mv a0, zero > > +.Lfound: > > + ret > > + > > +.Lvector_path: > > + vsetvli t2, a2, e8, m2, ta, ma > > + > > +1: > > + vle8.v v2, (a0) > > + vmseq.vx v0, v2, a1 > > + vfirst.m t3, v0 > > + bgez t3, .Lvec_found > > + add a0, a0, t2 > > + sub a2, a2, t2 > > + bge a2, t2, 1b > > + bnez a2, 2f > > + mv a0, zero > > + ret > > + > > +2: > > + vsetvli t2, a2, e8, m2, ta, ma > > + vle8.v v2, (a0) > > + vmseq.vx v0, v2, a1 > > + vfirst.m t3, v0 > > + bgez t3, .Lvec_found > > + mv a0, zero > > + ret > > + > > +.Lvec_found: > > + add a0, a0, t3 > > + ret > > + > > +.size memchr, .-memchr > > +libc_hidden_builtin_def (memchr) > > \ No newline at end of file > > Please add a newline. > > > diff --git a/sysdeps/riscv/rv64/rvv/strcpy.S > b/sysdeps/riscv/rv64/rvv/strcpy.S > > new file mode 100644 > > index 0000000000..b21909d66f > > --- /dev/null > > +++ b/sysdeps/riscv/rv64/rvv/strcpy.S > > @@ -0,0 +1,72 @@ > > + > > +/* Copyright (C) 2012-2023 Free Software Foundation, Inc. > > + > > + This file is part of the GNU C Library. > > + > > + The GNU C Library is free software; you can redistribute it and/or > > + modify it under the terms of the GNU Lesser General Public > > + License as published by the Free Software Foundation; either > > + version 2.1 of the License, or (at your option) any later version. > > + > > + The GNU C Library is distributed in the hope that it will be useful, > > + but WITHOUT ANY WARRANTY; without even the implied warranty of > > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > > + Lesser General Public License for more details. > > + > > + You should have received a copy of the GNU Lesser General Public > > + License along with the GNU C Library. If not, see > > + . */ > > + > > + > > You can add a optimize stpcpy and use to implement strcpy on top of that > (as my generic proposal does [1]). ARMv6 does something similar [2] > > [1] > https://patchwork.sourceware.org/project/glibc/patch/20230201170406.303978-12-adhemerval.zanella@linaro.org/ > [2] > https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/arm/armv6/strcpy.S;h=e9f63a56c1c605a21b05f7ac21412585b0705171;hb=HEAD > > > +#include > > + > > +.globl strcpy > > +.type strcpy,@function > > + > > +/* > > + * optimized strcpy for riscv with vector extension > > + * assumptions: > > + * - vlenb is a power of 2 > > + * - page size >= 2*vlenb > > + */ > > + > > +.align 2 > > +strcpy: > > + mv t0, a0 /* copy dest so we can > return it */ > > + > > + csrr t1, vlenb /* find vlenb*2 */ > > + add t1, t1, t1 > > + > > + addi t2, t1, -1 /* mask unaligned part of > ptr */ > > + and t2, a1, t2 > > + beqz t2, .Laligned > > + > > + sub t2, t1, t2 /* search enough to align > ptr */ > > + vsetvli t2, t2, e8, m2, tu, mu > > + vle8.v v2, (a1) > > + vmseq.vx v4, v2, zero > > + vmsif.m v0, v4 /* copy but not past null */ > > + vfirst.m t3, v4 > > + vse8.v v2, (t0), v0.t > > + bgez t3, .Ldone > > + add t0, t0, t2 > > + add a1, a1, t2 > > + > > +.Laligned: > > + vsetvli zero, t1, e8, m2, ta, mu /* now do 2*vlenb bytes per > pass */ > > + > > +1: > > + vle8.v v2, (a1) > > + add a1, a1, t1 > > + vmseq.vx v4, v2, zero > > + vmsif.m v0, v4 > > + vfirst.m t3, v4 > > + vse8.v v2, (t0), v0.t > > + add t0, t0, t1 > > + bltz t3, 1b > > + > > +.Ldone: > > + ret > > + > > +.size strcpy, .-strcpy > > +libc_hidden_builtin_def (strcpy) > > \ No newline at end of file > > diff --git a/sysdeps/riscv/rv64/rvv/strcspn.c > b/sysdeps/riscv/rv64/rvv/strcspn.c > > new file mode 100644 > > index 0000000000..f0595a72fb > > --- /dev/null > > +++ b/sysdeps/riscv/rv64/rvv/strcspn.c > > @@ -0,0 +1,22 @@ > > + > > +/* Copyright (C) 2012-2023 Free Software Foundation, Inc. > > + > > + This file is part of the GNU C Library. > > + > > + The GNU C Library is free software; you can redistribute it and/or > > + modify it under the terms of the GNU Lesser General Public > > + License as published by the Free Software Foundation; either > > + version 2.1 of the License, or (at your option) any later version. > > + > > + The GNU C Library is distributed in the hope that it will be useful, > > + but WITHOUT ANY WARRANTY; without even the implied warranty of > > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > > + Lesser General Public License for more details. > > + > > + You should have received a copy of the GNU Lesser General Public > > + License along with the GNU C Library. If not, see > > + . */ > > + > > + > > +/* strcspn is implemented in strspn.S > > + */ > > diff --git a/sysdeps/riscv/rv64/rvv/strlen.S > b/sysdeps/riscv/rv64/rvv/strlen.S > > new file mode 100644 > > index 0000000000..c77d500693 > > --- /dev/null > > +++ b/sysdeps/riscv/rv64/rvv/strlen.S > > @@ -0,0 +1,67 @@ > > + > > +/* Copyright (C) 2012-2023 Free Software Foundation, Inc. > > + > > + This file is part of the GNU C Library. > > + > > + The GNU C Library is free software; you can redistribute it and/or > > + modify it under the terms of the GNU Lesser General Public > > + License as published by the Free Software Foundation; either > > + version 2.1 of the License, or (at your option) any later version. > > + > > + The GNU C Library is distributed in the hope that it will be useful, > > + but WITHOUT ANY WARRANTY; without even the implied warranty of > > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > > + Lesser General Public License for more details. > > + > > + You should have received a copy of the GNU Lesser General Public > > + License along with the GNU C Library. If not, see > > + . */ > > + > > + > > +#include > > + > > +.globl strlen > > +.type strlen,@function > > + > > +/* > > + * optimized strlen for riscv with vector extension > > + * assumptions: > > + * - vlenb is a power of 2 > > + * - page size >= 2*vlenb > > + */ > > + > > +.align 2 > > +strlen: > > + mv t4, a0 /* copy of buffer start */ > > + csrr t1, vlenb /* find vlenb*2 */ > > + add t1, t1, t1 > > + addi t2, t1, -1 /* mask off unaligned part of > ptr */ > > + and t2, a0, t2 > > + beqz t2, .Laligned > > + > > + sub t2, t1, t2 /* search fwd to align ptr */ > > + vsetvli t2, t2, e8, m2, ta, ma > > + vle8.v v2, (a0) > > + vmseq.vx v0, v2, zero > > + vfirst.m t3, v0 > > + bgez t3, .Lfound > > + add a0, a0, t2 > > + > > +.Laligned: > > + vsetvli zero, t1, e8, m2, ta, ma /* search 2*vlenb bytes per > pass */ > > + add t4, t4, t1 > > + > > +1: > > + vle8.v v2, (a0) > > + add a0, a0, t1 > > + vmseq.vx v0, v2, zero > > + vfirst.m t3, v0 > > + bltz t3, 1b > > + > > +.Lfound: /* found the 0; subtract > */ > > + sub a0, a0, t4 /* buffer start from current > ptr */ > > + add a0, a0, t3 /* and add offset into > fetched */ > > + ret /* data to get length */ > > + > > +.size strlen, .-strlen > > +libc_hidden_builtin_def (strlen) > > \ No newline at end of file > > diff --git a/sysdeps/riscv/rv64/rvv/strncmp.S > b/sysdeps/riscv/rv64/rvv/strncmp.S > > new file mode 100644 > > index 0000000000..863e5cb525 > > --- /dev/null > > +++ b/sysdeps/riscv/rv64/rvv/strncmp.S > > @@ -0,0 +1,104 @@ > > + > > +/* Copyright (C) 2012-2023 Free Software Foundation, Inc. > > + > > + This file is part of the GNU C Library. > > + > > + The GNU C Library is free software; you can redistribute it and/or > > + modify it under the terms of the GNU Lesser General Public > > + License as published by the Free Software Foundation; either > > + version 2.1 of the License, or (at your option) any later version. > > + > > + The GNU C Library is distributed in the hope that it will be useful, > > + but WITHOUT ANY WARRANTY; without even the implied warranty of > > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > > + Lesser General Public License for more details. > > + > > + You should have received a copy of the GNU Lesser General Public > > + License along with the GNU C Library. If not, see > > + . */ > > + > > + > > +#include > > + > > +.globl strncmp > > +.type strncmp,@function > > + > > +.align 2 > > + > > +/* as strcmp, but with added checks on a2 (max count) > > + */ > > + > > +strncmp: > > + csrr t1, vlenb /* find vlenb*2 */ > > + add t1, t1, t1 > > + blt a2, t1, .Ltail /* degrade if max < vlenb*2 > */ > > + vsetvli zero, t1, e8, m2, ta, mu > > + vid.v v30 > > + addi t2, t1, -1 /* mask unaligned part of > ptr */ > > + and t6, a0, t2 /* unaligned part of lhs */ > > + and t5, a1, t2 /* unaligned part of rhs */ > > + sub t6, t1, t6 /* safe count to read from > lhs */ > > + sub t5, t1, t5 /* same, rhs */ > > + vmsltu.vx v28, v30, t6 /* mask for first part of > lhs */ > > + vmsltu.vx v26, v30, t5 /* mask for first part of > rhs */ > > + vmv.v.x v16, zero > > + vmv.v.x v18, zero > > + > > + > > +1: blt a2, t1, .Ltail > > + vmv.v.v v0, v28 /* lhs mask */ > > + vle8.v v2, (a0), v0.t /* masked load from lhs */ > > + vmseq.vx v16, v2, zero, v0.t /* check loaded bytes for > null */ > > + vmv.v.v v0, v26 /* rhs mask */ > > + vfirst.m t2, v16 /* get lhs check result */ > > + bgez t2, .Ltail /* can we safely check rest > */ > > + vle8.v v4, (a1), v0.t /* masked load from > rhs */ > > + vmseq.vx v18, v4, zero, v0.t /* check partial rhs > */ > > + vmnot.m v0, v28 /* mask for rest of lhs */ > > + vfirst.m t3, v18 /* get check result */ > > + bltz t3, 2f /* test it */ > > + bge t3, t6, .Ltail > > + > > + vmsleu.vx v0, v30, t3 /* select rest of string + > null */ > > + vmsne.vv v0, v2, v4, v0.t /* compare */ > > + vfirst.m t3, v0 > > + bgez t3, 3f > > + mv a0, zero > > + ret > > +3: add a0, a0, t3 > > + add a1, a1, t3 > > + lbu t0, (a0) > > + lbu t1, (a1) > > +.Ldiff: > > + sub a0, t0, t1 > > + ret > > + > > + /* ...no null terminator in first part of lhs or rhs */ > > +2: vle8.v v2, (a0), v0.t /* load rest of lhs */ > > + vmnot.m v0, v26 /* mask for rest of rhs */ > > + vle8.v v4, (a1), v0.t /* load rest of rhs */ > > + vmsne.vv v0, v2, v4 /* compare */ > > + add a0, a0, t1 /* advance ptrs */ > > + vfirst.m t3, v0 > > + add a1, a1, t1 > > + sub a2, a2, t1 > > + bltz t3, 1b > > + > > + sub t3, t3, t1 /* found a diff but we've already advanced a0 and > a1 */ > > + j 3b > > + > > +.Ltail: > > + beqz a2, 1f > > + addi a2, a2, -1 > > + lbu t0, (a0) > > + lbu t1, (a1) > > + bne t0, t1, .Ldiff > > + addi a0, a0, 1 > > + addi a1, a1, 1 > > + bnez t0, .Ltail > > +1: mv a0, zero > > + ret > > + > > + > > +.size strncmp, .-strncmp > > +libc_hidden_builtin_def (strncmp) > > \ No newline at end of file > > diff --git a/sysdeps/riscv/rv64/rvv/strncpy.S > b/sysdeps/riscv/rv64/rvv/strncpy.S > > new file mode 100644 > > index 0000000000..8b3a1e545c > > --- /dev/null > > +++ b/sysdeps/riscv/rv64/rvv/strncpy.S > > @@ -0,0 +1,96 @@ > > + > > +/* Copyright (C) 2012-2023 Free Software Foundation, Inc. > > + > > + This file is part of the GNU C Library. > > + > > + The GNU C Library is free software; you can redistribute it and/or > > + modify it under the terms of the GNU Lesser General Public > > + License as published by the Free Software Foundation; either > > + version 2.1 of the License, or (at your option) any later version. > > + > > + The GNU C Library is distributed in the hope that it will be useful, > > + but WITHOUT ANY WARRANTY; without even the implied warranty of > > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > > + Lesser General Public License for more details. > > + > > + You should have received a copy of the GNU Lesser General Public > > + License along with the GNU C Library. If not, see > > + . */ > > + > > + > > +#include > > + > > +.globl strncpy > > +.type strncpy,@function > > + > > +/* > > + * optimized strcpy for riscv with vector extension > > + * assumptions: > > + * - vlenb is a power of 2 > > + * - page size >= 2*vlenb > > + */ > > + > > +.align 2 > > +strncpy: > > + mv t0, a0 /* need to return dest so > copy */ > > + > > + csrr t1, vlenb /* find vlenb*2 */ > > + add t1, t1, t1 > > + > > + addi t2, t1, -1 /* mask off unaligned part of > ptr */ > > + and t2, a1, t2 > > + beqz t2, .Laligned > > + > > + sub t2, t1, t2 /* search to align the > pointer */ > > + vsetvli zero, t2, e8, m2, tu, mu > > + vle8.v v2, (a1) > > + vmseq.vx v4, v2, zero > > + vmsif.m v0, v4 /* copy to dest */ > > + vfirst.m t3, v4 > > + bgeu t2, a2, .Ldest_full > > + vse8.v v2, (t0), v0.t > > + bgez t3, .Lterminator_found > > + add t0, t0, t2 > > + add a1, a1, t2 > > + sub a2, a2, t2 > > + beqz a2, .Ldone > > + > > +.Laligned: > > + vsetvli zero, t1, e8, m2, ta, mu /* now do 2*vlenb bytes per > pass */ > > + > > +1: > > + vle8.v v2, (a1) > > + add a1, a1, t1 > > + vmseq.vx v4, v2, zero > > + vmsif.m v0, v4 > > + vfirst.m t3, v4 > > + bgeu t1, a2, .Ldest_full > > + vse8.v v2, (t0), v0.t > > + add t0, t0, t1 > > + sub a2, a2, t1 > > + bltz t3, 1b > > + sub t0, t0, t1 > > + > > +.Lterminator_found: > > + addi sp, sp, -16 > > + sd ra, 0(sp) > > + sd a0, 8(sp) > > + add a0, t0, t3 > > + mv a1, zero > > + sub a2, a2, t3 > > + jal ra, memset > > + ld ra, 0(sp) > > + ld a0, 8(sp) > > + addi sp, sp, 16 > > +.Ldone: > > + ret > > + > > +.Ldest_full: > > + vid.v v6 > > + vmsltu.vx v4, v6, a2 > > + vmand.mm v0, v0, v4 > > + vse8.v v2, (t0), v0.t > > + ret > > + > > +.size strncpy, .-strncpy > > +libc_hidden_builtin_def (strncpy) > > \ No newline at end of file > > diff --git a/sysdeps/riscv/rv64/rvv/strnlen.S > b/sysdeps/riscv/rv64/rvv/strnlen.S > > new file mode 100644 > > index 0000000000..6d7ee65c7a > > --- /dev/null > > +++ b/sysdeps/riscv/rv64/rvv/strnlen.S > > @@ -0,0 +1,81 @@ > > + > > +/* Copyright (C) 2012-2023 Free Software Foundation, Inc. > > + > > + This file is part of the GNU C Library. > > + > > + The GNU C Library is free software; you can redistribute it and/or > > + modify it under the terms of the GNU Lesser General Public > > + License as published by the Free Software Foundation; either > > + version 2.1 of the License, or (at your option) any later version. > > + > > + The GNU C Library is distributed in the hope that it will be useful, > > + but WITHOUT ANY WARRANTY; without even the implied warranty of > > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > > + Lesser General Public License for more details. > > + > > + You should have received a copy of the GNU Lesser General Public > > + License along with the GNU C Library. If not, see > > + . */ > > + > > + > > +#include > > + > > Maybe use a generic implementation that issues memchr (which should be > optimized > using vector instructions) [3] ? It would be a extra function call, but > it should really > help on both code size and icache pressure. > > [3] > https://patchwork.sourceware.org/project/glibc/patch/20230201170406.303978-6-adhemerval.zanella@linaro.org/ > > > +.globl __strnlen > > +.type __strnlen,@function > > + > > +/* vector optimized strnlen > > + * assume it's safe to read to the end of the page > > + * containing either a null terminator or the last byte of the count or > both, > > + * but not past it > > + * assume page size >= vlenb*2 > > + */ > > + > > +.align 2 > > +__strnlen: > > + mv t4, a0 /* stash a copy of start for later > */ > > + beqz a1, .LzeroCount > > + > > + csrr t1, vlenb /* find vlenb*2 */ > > + add t1, t1, t1 > > + addi t2, t1, -1 /* mask off unaligned part of ptr > */ > > + and t2, a1, a0 > > + beqz t2, .Laligned > > + > > + sub t2, t1, t2 /* search to align pointer to t1 */ > > + bgeu t2, a1, 2f /* check it's safe */ > > + mv t2, a1 /* it's not! look as far as > permitted */ > > +2: vsetvli t2, t2, e8, m2, ta, ma > > + vle8.v v2, (a0) > > + vmseq.vx v0, v2, zero > > + vfirst.m t3, v0 > > + bgez t3, .Lfound > > + add a0, a0, t2 > > + sub a1, a1, t2 > > + bltu a1, t1, .LreachedCount > > + > > +.Laligned: > > + vsetvli zero, t1, e8, m2, ta, ma /* do 2*vlenb bytes per > pass */ > > + > > +1: vle8.v v2, (a0) > > + sub a1, a1, t1 > > + vmseq.vx v0, v2, zero > > + vfirst.m t3, v0 > > + bgez t3, .Lfound > > + add a0, a0, t1 > > + bgeu a1, t1, 1b > > +.LreachedCount: > > + mv t2, a1 /* in case 0 < a1 < t1 */ > > + bnez a1, 2b /* if so, still t2 bytes to check, all safe */ > > +.LzeroCount: > > + sub a0, a0, t4 > > + ret > > + > > +.Lfound: /* found the 0; subtract buffer start from current > pointer */ > > + add a0, a0, t3 /* and add offset into fetched data */ > > + sub a0, a0, t4 > > + ret > > + > > +.size __strnlen, .-__strnlen > > +weak_alias (__strnlen, strnlen) > > +libc_hidden_builtin_def (__strnlen) > > +libc_hidden_builtin_def (strnlen) > > \ No newline at end of file > > diff --git a/sysdeps/riscv/rv64/rvv/strrchr.S > b/sysdeps/riscv/rv64/rvv/strrchr.S > > new file mode 100644 > > index 0000000000..4bef8a3b9c > > --- /dev/null > > +++ b/sysdeps/riscv/rv64/rvv/strrchr.S > > @@ -0,0 +1,88 @@ > > + > > +/* Copyright (C) 2012-2023 Free Software Foundation, Inc. > > + > > + This file is part of the GNU C Library. > > + > > + The GNU C Library is free software; you can redistribute it and/or > > + modify it under the terms of the GNU Lesser General Public > > + License as published by the Free Software Foundation; either > > + version 2.1 of the License, or (at your option) any later version. > > + > > + The GNU C Library is distributed in the hope that it will be useful, > > + but WITHOUT ANY WARRANTY; without even the implied warranty of > > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > > + Lesser General Public License for more details. > > + > > + You should have received a copy of the GNU Lesser General Public > > + License along with the GNU C Library. If not, see > > + . */ > > + > > + > > +#include > > It is really worth to add a strrchr optimization? The generic > implementation > already calls strchr (which should be optimized). > > > + > > +.globl strrchr > > +.type strrchr,@function > > + > > +/* > > + * optimized strrchr for riscv with vector extension > > + * assumptions: > > + * - vlenb is a power of 2 > > + * - page size >= 2*vlenb > > + */ > > + > > +.align 2 > > +strrchr: > > + mv t5, a0 /* stash buffer ptr somewhere safe */ > > + mv a0, zero /* result is nullptr unless we find better > below */ > > + > > + csrr t1, vlenb /* determine vlenb*2 */ > > + add t1, t1, t1 > > + addi t2, t1, -1 /* mask off unaligned part of > ptr */ > > + and t2, t5, t2 > > + beqz t2, .Laligned > > + > > + sub t2, t1, t2 /* search to align ptr to > 2*vlenb */ > > + vsetvli t2, t2, e8, m2, ta, mu > > + > > + vle8.v v2, (t5) /* load data into v2(,v3) */ > > + vmseq.vx v4, v2, zero /* check for null terminator */ > > + vfirst.m t4, v4 /* grab its position, if any */ > > + vmsbf.m v0, v4 /* select valid chars */ > > + vmseq.vx v0, v2, a1, v0.t /* search for candidate byte */ > > + vfirst.m t3, v0 /* grab its position, if any */ > > + bltz t3, 2f /* did we find a candidate? */ > > + > > +3: add a0, t3, t5 /* we did! grab the address */ > > + vmsof.m v1, v0 /* there might be more than > one */ > > + vmandn.mm v0, v0, v1 /* so clear the one we just > found */ > > + vfirst.m t3, v0 /* is there another? */ > > + bgez t3, 3b > > + > > +2: bgez t4, .Ldone /* did we see a null > terminator? */ > > + add t5, t5, t2 > > + > > +.Laligned: > > + vsetvli zero, t1, e8, m2, ta, mu /* now do 2*vlenb bytes per > pass */ > > + > > +1: vle8.v v2, (t5) > > + vmseq.vx v4, v2, zero > > + vfirst.m t4, v4 > > + vmsbf.m v0, v4 > > + vmseq.vx v0, v2, a1, v0.t > > + vfirst.m t3, v0 > > + bltz t3, 2f > > + > > +3: add a0, t3, t5 > > + vmsof.m v1, v0 > > + vmandn.mm v0, v0, v1 > > + vfirst.m t3, v0 > > + bgez t3, 3b > > + > > +2: add t5, t5, t1 > > + bltz t4, 1b > > + > > +.Ldone: > > + ret > > + > > +.size strrchr, .-strrchr > > +libc_hidden_builtin_def (strrchr) > > \ No newline at end of file > > diff --git a/sysdeps/riscv/rv64/rvv/strspn.S > b/sysdeps/riscv/rv64/rvv/strspn.S > > new file mode 100644 > > index 0000000000..2b9af5cc2d > > --- /dev/null > > +++ b/sysdeps/riscv/rv64/rvv/strspn.S > > @@ -0,0 +1,189 @@ > > + > > +/* Copyright (C) 2012-2023 Free Software Foundation, Inc. > > + > > + This file is part of the GNU C Library. > > + > > + The GNU C Library is free software; you can redistribute it and/or > > + modify it under the terms of the GNU Lesser General Public > > + License as published by the Free Software Foundation; either > > + version 2.1 of the License, or (at your option) any later version. > > + > > + The GNU C Library is distributed in the hope that it will be useful, > > + but WITHOUT ANY WARRANTY; without even the implied warranty of > > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > > + Lesser General Public License for more details. > > + > > + You should have received a copy of the GNU Lesser General Public > > + License along with the GNU C Library. If not, see > > + . */ > > + > > + > > +#include > > + > > +.globl strspn > > +.type strspn,@function > > + > > +.globl strcspn > > +.type strcspn,@function > > + > > +/* > > + * optimized strspn / strcspn for riscv with vector extension > > + * assumptions: > > + * - vlenb is a power of 2 > > + * - page size >= 32 > > + * strategy: > > + * - build a 256-bit table on the stack, where each elt is zero > > + * if encountering it should terminate computation and nonzero > otherwise > > + * - use vectorised lookups into this to check 2*vlen elts at a time; > > + * this code is identical for strspan and strcspan and can be shared > > + * > > + * note that while V mandates at least 128 bit wide regs, > > + * we are building a 256 bit lookup table > > + * therefore we use either LMUL=1 or 2 depending on what the target > supports > > + * therefore we only use even vector register numbers, > > + * so everything still works if we go with LMUL=2 > > + */ > > + > > I wonder if we could adapt the generic implementation, so riscv only > reimplements > the vectorized search instead of all the boilerplace to generate the table > and > early tests. > > > +# ----------------------------- > > + > > +.align 2 > > + > > +strspn: > > + lbu t0, 0(a1) > > + bnez t0, .Lbuild_table > > + mv a0, zero > > + ret > > + > > +.Lbuild_table: > > + mv a6, a0 /* store incoming a0 */ > > + li t1, 32 /* want to deal with 256 bits at a time, so 32 > bytes */ > > + > > + vsetvli zero, t1, e8, m1, tu, mu > > +#if __riscv_v_min_vlen < 256 > > + /* we want to build a 256-bit table, so use vlenb*2, > > + * m2 if regs are 128 bits wide or vlenb, m1 if >= 256 > > + * 'V' extension specifies a minimum vlen of 128 so this should > cover > > + * all cases; we can skip the check if we know vlen >= 256 at > compile time > > + */ > > + csrr t2, vlenb > > + bgeu t2, t1, 1f > > + vsetvli zero, t1, e8, m2, tu, mu > > +1: > > +#endif // __riscv_v_min_vlen > > + > > + /* read one char from the charset at a time and write the correct > bit > > + * in the lookup table; we could do SIMD iff we ever get an > extension > > + * that provides some way of scattering bytes into a reg group > > + */ > > + vmv.v.x v16, zero /* clear out table */ > > + vmv.v.x v8, zero /* clear out v8 */ > > + li t3, 1 > > + vmv.s.x v8, t3 /* v8 now all zeroes except bottom byte > */ > > + > > +1: vmv.v.x v2, zero /* clear out v2 */ > > + addi a1, a1, 1 /* advance charset ptr */ > > + srli t2, t0, 3 /* divide the byte we read earlier by 8 > */ > > + vslideup.vx v2, v8, t2 /* v2 now 1 in the correct byte 0 > elsewhere */ > > + vsll.vx v2, v2, t0 /* v2 now 1 in the correct bit, 0 > elsewhere */ > > + vor.vv v16, v16, v2 /* or it in */ > > + lbu t0, 0(a1) /* fetch next bute */ > > + bnez t0, 1b /* if it's null, go round again */ > > + > > +/* > > + * Table is now built in v16. > > + * Strategy: > > + * - fetch next t1 bytes from memory > > + * - vrgather on their values divided by 8 to get relevant bytes of > table > > + * - shift right to get the correct bit into bit 1 > > + * - and with 1, compare with expected terminator value, then check > mask > > + * to see if we've found a terminator > > + * > > + * Before we can begin, a0 needs to be t1-aligned, so that when we > fetch > > + * the next t1 bytes - any of which may be the null terminator - > > + * we do not cross a page boundary and read unmapped memory. Therefore > > + * we have one read of however many bytes are needed to align a0, > > + * before the main loop. > > + */ > > + > > +.Lscan_table: > > + vmv.v.x v8, t3 /* v8 now t1 bytes of 0x01 */ > > + > > + and t2, a0, t1 /* mask to align to t1 */ > > + beqz t2, 2f /* or skip if we're already aligned > */ > > + sub t2, t1, t2 /* t2 now bytes to read to align to > t1 */ > > + > > + vid.v v2 /* build mask instead of changing > vl */ > > + vmsltu.vx v0, v2, t2 /* so we don't need to track LMUL */ > > + > > + vle8.v v2, (a0), v0.t /* load next bytes from input */ > > + vsrl.vi v4, v2, 3 /* divide by 8 */ > > + vrgather.vv v6, v16, v4 /* corresponding bytes of bit table > */ > > + vsrl.vv v6, v6, v2 /* shift correct bits to lsb */ > > + vand.vv v6, v6, v8 /* and with 1 to complete the > lookups */ > > + vmseq.vx v4, v6, zero, v0.t /* check to see if any 0s are > present */ > > + vfirst.m t0, v4 /* index of the first 0, if any */ > > + bgez t0, .Lscan_end /* if we found one, stop */ > > + add a0, a0, t2 /* advance by number of bytes we > read */ > > + > > +2: add a6, a6, t1 /* we'll advance a0 before the exit > check */ > > +1: vle8.v v2, (a0) /* as above but unmasked so t1 elts per > pass */ > > + add a0, a0, t1 > > + > > + vsrl.vi v4, v2, 3 > > + vrgather.vv v6, v16, v4 > > + vsrl.vv v6, v6, v2 > > + vand.vv v6, v6, v8 > > + > > + vmseq.vx v4, v6, zero > > + vfirst.m t0, v4 > > + bltz t0, 1b > > + > > +.Lscan_end: > > + add a0, a0, t0 /* calculate offset to terminating byte > */ > > + sub a0, a0, a6 > > + ret > > +.size strspn, .-strspn > > + > > +/* strcspn > > + * > > + * table build exactly as for strspn, except: > > + * - the lookup table starts with all bits except bit 0 of byte 0 set > > + * - we clear the corresponding bit for each byte in the charset > > + * once table is built, we can reuse the scan code directly > > + */ > > + > > +strcspn: > > + lbu t0, 0(a1) > > + beqz t0, strlen /* no rejections -> prefix is whole string > */ > > + > > + mv a6, a0 > > + li t1, 32 > > + > > + vsetvli zero, t1, e8, m1, tu, mu > > +#if __riscv_v_min_vlen < 256 > > + csrr t2, vlenb > > + bgeu t2, t1, 1f > > + vsetvli zero, t1, e8, m2, tu, mu > > +1: > > +#endif // __riscv_v_min_vlen > > + > > + vmv.v.x v8, zero > > + li t3, 1 /* all bits clear except bit 0 of byte > 0 */ > > + vmv.s.x v8, t3 > > + vnot.v v16, v8 /* v16 is the inverse of that */ > > + li t4, -1 > > + > > +1: vmv.v.x v2, zero > > + addi a1, a1, 1 /* advance charset ptr */ > > + srli t2, t0, 3 /* select correct bit in v2 */ > > + vslideup.vx v2, v8, t2 > > + vsll.vx v2, v2, t0 > > + vnot.v v2, v2 /* invert */ > > + vand.vv v16, v16, v2 /* clear the relevant bit of table */ > > + lbu t0, 0(a1) > > + bnez t0, 1b > > + j .Lscan_table > > +.size strcspn, .-strcspn > > + > > +libc_hidden_builtin_def (strspn) > > +libc_hidden_builtin_def (strcspn) > > \ No newline at end of file > --000000000000ad982905f3b4aad3--