From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=5WhA=56=rivosinc.com=slewis@sourceware.org>
Received: from mail-lj1-x235.google.com (mail-lj1-x235.google.com [IPv6:2a00:1450:4864:20::235])
	by sourceware.org (Postfix) with ESMTPS id D33903858428
	for <libc-alpha@sourceware.org>; Thu,  2 Feb 2023 10:02:48 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org D33903858428
Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=rivosinc.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=rivosinc.com
Received: by mail-lj1-x235.google.com with SMTP id y19so1316905ljq.7
        for <libc-alpha@sourceware.org>; Thu, 02 Feb 2023 02:02:48 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=rivosinc-com.20210112.gappssmtp.com; s=20210112;
        h=cc:to:subject:message-id:date:from:in-reply-to:references
         :mime-version:from:to:cc:subject:date:message-id:reply-to;
        bh=Ry8wfPCJzmxCXjq06kZh7+eGOLNiwBRxoDUwIbAv9aM=;
        b=tBEmt8cyOqpipYKOQNa0UZQH0zWYeg4oJwGgiFIA62FFm4piEdHEuJfVAmDQuRQwEs
         tWHgCddFiOYYk6oSL/0ZdAa2MopkiqFXgErzpaNYbr2M1oO3P8pR2LqEnK/dQbvhLYVk
         ZBAE9Fki/OWEcKjXSfH1CIRmhHJSZYJpl848YRLclaEg6KePxv04ovyhZY6ZZEHWvGto
         amc0IXvi03sHl+J+lLk/aLiF37SA+db4O1I8nsO370iJWVPlTBNdtqYfZeU5HUJ84dGl
         ox3YIsITIj2a/eW2BYbIoUlu5/yLCKZxKst3QsrkFMlsvD6WwOxGkVmebfy8fpjRBDTT
         L3tA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=cc:to:subject:message-id:date:from:in-reply-to:references
         :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id
         :reply-to;
        bh=Ry8wfPCJzmxCXjq06kZh7+eGOLNiwBRxoDUwIbAv9aM=;
        b=gCR9rEYI+HtK7vVZDOWKrSPMe7G92Vv6QDAruMjTHVdfMF4jnn9xZ6ieErGDLZ7sod
         IGKwgY8WzeCk9xB4+93pMYuJGD0eMkBVK4sKRmM/3cL0+ktULfOS8IqiefLfGgFdHDii
         qQvjnj+nEntl7Oa937swPFoLXt8nW3b5CIAcRQ+huSRIJGxoej6tiPVJr1RK/auEKsW/
         F8e4p1cDpAwzxjyVUrlStXhlGMuUZfZR2pgJ1UOSv+etaq1poqZ+eeWOy3NOfFklicgj
         FdNVxjpX8RRfcSp44ISVnIhWLZVwrSiYSUauRPzM1iLgTFi+B//X6JGp5tGT+XgI5dUq
         ldDg==
X-Gm-Message-State: AO0yUKU0lqh7ICBpTc2qyH8+wTJrO1qz8Bv8xrLg9xDfdWv2BuSJ+0fB
	Hu7dCWdNN5qcVKI6P+5Q0jCJheJFDlV7hKd7tdFVJQ85xG2DlUii
X-Google-Smtp-Source: AK7set/B9vBs3XHd4AqdXyt9P3se3MuFgrWSJidJujLXLNT0Pv398cjD5sw6Zy6triTtfZbXh1BEcFVzoviyYfyv3L8=
X-Received: by 2002:a2e:9d05:0:b0:283:9a0a:a0f7 with SMTP id
 t5-20020a2e9d05000000b002839a0aa0f7mr754100lji.81.1675332166714; Thu, 02 Feb
 2023 02:02:46 -0800 (PST)
MIME-Version: 1.0
References: <20230201095232.15942-1-slewis@rivosinc.com> <20230201095232.15942-2-slewis@rivosinc.com>
 <87479d1a-abf3-b564-8613-2a48d26527b5@linaro.org>
In-Reply-To: <87479d1a-abf3-b564-8613-2a48d26527b5@linaro.org>
From: Sergei Lewis <slewis@rivosinc.com>
Date: Thu, 2 Feb 2023 10:02:35 +0000
Message-ID: <CAE2KcLqB56DQFpAeif0bbLNBUVRQRqLQwJByh44qvC94YWbiow@mail.gmail.com>
Subject: Re: [PATCH 2/2] riscv: vectorised mem* and str* functions
To: Adhemerval Zanella Netto <adhemerval.zanella@linaro.org>
Cc: libc-alpha@sourceware.org
Content-Type: multipart/alternative; boundary="000000000000ad982905f3b4aad3"
X-Spam-Status: No, score=-9.4 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,GIT_PATCH_0,HTML_MESSAGE,KAM_SHORT,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <libc-alpha.sourceware.org>

--000000000000ad982905f3b4aad3
Content-Type: text/plain; charset="UTF-8"

Thank you very much for the detailed review!

> > +#ifndef __riscv_strict_align
> Would this be defined by compiler as predefine macro or is it just a debug
> switch? If the later, I think it would be better to remove it.

The intent is to make use of the gcc feature in flight here:
https://gcc.gnu.org/pipermail/gcc-patches/2023-January/610115.html to
detect the situation where the build environment has been configured to
avoid unaligned access.


> It is really worth to add a strrchr optimization?  The generic
implementation
> already calls strchr (which should be optimized).

The performance win is actually quite significant; consider searching for
the first space in a piece of text compared to the last space - reusing
strchr in a loop as the generic implementation does would cause branches
every few bytes of input, and essentially destroy all benefits gained from
vectorisation. Worse, vectorised functions typically have a few
instructions of setup/overhead before the work begins and these would
likewise be repeated every few bytes of input.

> I wonder if we could adapt the generic implementation, so riscv only
reimplements
> the vectorized search instead of all the boilerplace to generate the
table and
> early tests.

The issue is that the table looks different for different implementations,
and possibly even for different cases in the same implementation; e.g. some
of the existing implementations use a 256-byte table with one byte per
character rather than a 256-bit bitfield as I do here (and going forward we
would potentially want such a path for riscv as well and select between
them based on the length of the character set - common use in parsing will
tend to produce very small character sets, but if we get a large one or
potentially always depending on architecture, using indexed loads/stores
will become faster than the bitfield approach I use here).

I am integrating all the other feedback, and will also work with your
changes to the generic implementations - it looks like there is quite a bit
of potential to reduce and simplify my changeset once yours goes in.

On Wed, Feb 1, 2023 at 5:38 PM Adhemerval Zanella Netto <
adhemerval.zanella@linaro.org> wrote:

>
>
> On 01/02/23 06:52, Sergei Lewis wrote:
> > Initial implementations of memchr, memcmp, memcpy, memmove, memset,
> strchr,
> > strcmp, strcpy, strlen, strncmp, strncpy, strnlen, strrchr, strspn
> > targeting the riscv "V" extension, version 1.0
> >
> > The vectorised implementations assume VLENB of at least 128 and at least
> 32
> > registers (as mandated by the "V" extension spec). They also assume that
> > VLENB is a power of two which is no larger than the page size, and (as
> > vectorised code in glibc for other platforms does) that it is safe to
> read
> > past null terminators / buffer ends provided one does not cross a page
> > boundary.
> >
> > Signed-off-by: Sergei Lewis <slewis@rivosinc.com>
>
> Some comments that might be useful since I am working the generic
> implementations
> below.
>
> Also, I think it should be splitted with one implementation per patch,
> unless the
> implementation is tied together (as for strchr/strchrnul for instance).
> Does
> the vectorized routine only work for rv64?
>
> > ---
> >  sysdeps/riscv/rv64/rvv/Implies     |   2 +
> >  sysdeps/riscv/rv64/rvv/memchr.S    | 127 +++++++++++++++++++
> >  sysdeps/riscv/rv64/rvv/memcmp.S    |  93 ++++++++++++++
> >  sysdeps/riscv/rv64/rvv/memcpy.S    | 154 +++++++++++++++++++++++
> >  sysdeps/riscv/rv64/rvv/memmove.c   |  22 ++++
> >  sysdeps/riscv/rv64/rvv/memset.S    |  89 ++++++++++++++
> >  sysdeps/riscv/rv64/rvv/strchr.S    |  92 ++++++++++++++
> >  sysdeps/riscv/rv64/rvv/strchrnul.c |  22 ++++
> >  sysdeps/riscv/rv64/rvv/strcmp.S    | 108 +++++++++++++++++
> >  sysdeps/riscv/rv64/rvv/strcpy.S    |  72 +++++++++++
> >  sysdeps/riscv/rv64/rvv/strcspn.c   |  22 ++++
> >  sysdeps/riscv/rv64/rvv/strlen.S    |  67 ++++++++++
> >  sysdeps/riscv/rv64/rvv/strncmp.S   | 104 ++++++++++++++++
> >  sysdeps/riscv/rv64/rvv/strncpy.S   |  96 +++++++++++++++
> >  sysdeps/riscv/rv64/rvv/strnlen.S   |  81 +++++++++++++
> >  sysdeps/riscv/rv64/rvv/strrchr.S   |  88 ++++++++++++++
> >  sysdeps/riscv/rv64/rvv/strspn.S    | 189 +++++++++++++++++++++++++++++
> >  17 files changed, 1428 insertions(+)
> >  create mode 100644 sysdeps/riscv/rv64/rvv/Implies
> >  create mode 100644 sysdeps/riscv/rv64/rvv/memchr.S
> >  create mode 100644 sysdeps/riscv/rv64/rvv/memcmp.S
> >  create mode 100644 sysdeps/riscv/rv64/rvv/memcpy.S
> >  create mode 100644 sysdeps/riscv/rv64/rvv/memmove.c
> >  create mode 100644 sysdeps/riscv/rv64/rvv/memset.S
> >  create mode 100644 sysdeps/riscv/rv64/rvv/strchr.S
> >  create mode 100644 sysdeps/riscv/rv64/rvv/strchrnul.c
> >  create mode 100644 sysdeps/riscv/rv64/rvv/strcmp.S
> >  create mode 100644 sysdeps/riscv/rv64/rvv/strcpy.S
> >  create mode 100644 sysdeps/riscv/rv64/rvv/strcspn.c
> >  create mode 100644 sysdeps/riscv/rv64/rvv/strlen.S
> >  create mode 100644 sysdeps/riscv/rv64/rvv/strncmp.S
> >  create mode 100644 sysdeps/riscv/rv64/rvv/strncpy.S
> >  create mode 100644 sysdeps/riscv/rv64/rvv/strnlen.S
> >  create mode 100644 sysdeps/riscv/rv64/rvv/strrchr.S
> >  create mode 100644 sysdeps/riscv/rv64/rvv/strspn.S
> >
> > diff --git a/sysdeps/riscv/rv64/rvv/Implies
> b/sysdeps/riscv/rv64/rvv/Implies
> > new file mode 100644
> > index 0000000000..b07b4cb906
> > --- /dev/null
> > +++ b/sysdeps/riscv/rv64/rvv/Implies
> > @@ -0,0 +1,2 @@
> > +riscv/rv64/rvd
> > +
> > diff --git a/sysdeps/riscv/rv64/rvv/memchr.S
> b/sysdeps/riscv/rv64/rvv/memchr.S
> > new file mode 100644
> > index 0000000000..a7e32b8f25
> > --- /dev/null
> > +++ b/sysdeps/riscv/rv64/rvv/memchr.S
> > @@ -0,0 +1,127 @@
> > +
>
> Spurious new line at the start.  We also require a brief comment describing
> the file contents for newer files.
>
> > +/* Copyright (C) 2012-2023 Free Software Foundation, Inc.
>
> Not sure 2012 range fits here.
>
> > +
> > +   This file is part of the GNU C Library.
> > +
> > +   The GNU C Library is free software; you can redistribute it and/or
> > +   modify it under the terms of the GNU Lesser General Public
> > +   License as published by the Free Software Foundation; either
> > +   version 2.1 of the License, or (at your option) any later version.
> > +
> > +   The GNU C Library is distributed in the hope that it will be useful,
> > +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> > +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> > +   Lesser General Public License for more details.
> > +
> > +   You should have received a copy of the GNU Lesser General Public
> > +   License along with the GNU C Library.  If not, see
> > +   <https://www.gnu.org/licenses/>.  */
> > +
> > +
> > +#include <sysdep.h>
> > +
> > +
> > +/* Optimised memchr for riscv with vector extension
> > + * Assumptions:
> > + *    - cpu becomes bandwidth limited at or before
> > + *            2 vector register sized read/write operations
> > + *          + 2 scalar operations
> > + *          + conditional branch
> > + */
> > +
> > +.globl  memchr
> > +.type   memchr,@function
> > +
> > +.align    2
> > +memchr:
>
> We have the ENTRY macro for that.
>
> > +    beqz a2, .Lnot_found
>
> Maybe use the L macro here for local labels;
>
> > +    csrr    t1, vlenb
> > +    bgeu    a2, t1, .Lvector_path   /* only use vector path if we're
> scanning
> > +                                       at least vlenb bytes */
> > +
> > +#ifndef __riscv_strict_align
>
> Would this be defined by compiler as predefine macro or is it just a debug
> switch? If the later, I think it would be better to remove it.
>
> > +    li a3, 8
> > +    blt a2, a3, .Lbytewise
> > +
> > +    li      t1, 0x0101010101010101
> > +    slli    a4, t1, 7     /* a4 = 0x8080808080808080 */
> > +    mul     t2, a1, t1    /* entirety of t2 is now repeats of target
> character;
> > +                             assume mul is at worst no worse than
> 3*(shift+OR),
> > +                             otherwise do that instead */
> > +
> > +/*
> > + * strategy:
> > + * t4 = ((*a0) ^ t2)
> > + *      - now t4 contains zero bytes if and only if next word of memory
> > + *        had target character at those positions
> > + *
> > + * t4 = ((t4-0x0101010101010101) & ~t4) & 0x8080808080808080
> > + *      - all nonzero bytes of t4 become 0; zero bytes become 0x80
> > + *
> > + * if t4 is nonzero, find the index of the byte within it, add to a0
> and return
> > + * otherwise, loop
> > + */
> > +
> > +1:
> > +    ld t4, (a0)          /* t4 = load next 8 bytes */
> > +    xor t4, t4, t2
> > +    sub t5, t4, t1
> > +    not t4, t4
> > +    and t4, t5, t4
> > +    and t4, t4, a4
> > +    bnez t4, .Lbytewise  /* could use ctzw, mod+lookup or just binary
> chop
> > +                            to locate byte of interest in t4 but
> profiling
> > +                            shows these approaches are at best no
> better */
> > +    addi a2, a2, -8
> > +    addi a0, a0, 8
> > +    bgeu a2, a3, 1b
> > +    beqz a2, .Lnot_found
> > +#endif // __riscv_strict_align
> > +
> > +/* too little data for a dword. mask calculation and branch mispredict
> costs
> > +   make checking a word not worthwhile. degrade to bytewise search. */
> > +
> > +.Lbytewise:
> > +    add t2, a0, a2
> > +
> > +1:
> > +    lb t1, (a0)
> > +    beq t1, a1, .Lfound
> > +    addi a0, a0, 1
> > +    blt a0, t2, 1b
> > +
> > +.Lnot_found:
> > +    mv a0, zero
> > +.Lfound:
> > +    ret
> > +
> > +.Lvector_path:
> > +    vsetvli t2, a2, e8, m2, ta, ma
> > +
> > +1:
> > +    vle8.v      v2, (a0)
> > +    vmseq.vx    v0, v2, a1
> > +    vfirst.m    t3, v0
> > +    bgez        t3, .Lvec_found
> > +    add         a0, a0, t2
> > +    sub         a2, a2, t2
> > +    bge         a2, t2, 1b
> > +    bnez        a2, 2f
> > +    mv          a0, zero
> > +    ret
> > +
> > +2:
> > +    vsetvli t2, a2, e8, m2, ta, ma
> > +    vle8.v      v2, (a0)
> > +    vmseq.vx    v0, v2, a1
> > +    vfirst.m    t3, v0
> > +    bgez        t3, .Lvec_found
> > +    mv          a0, zero
> > +    ret
> > +
> > +.Lvec_found:
> > +    add a0, a0, t3
> > +    ret
> > +
> > +.size   memchr, .-memchr
> > +libc_hidden_builtin_def (memchr)
> > \ No newline at end of file
>
> Please add a newline.
>
> > diff --git a/sysdeps/riscv/rv64/rvv/strcpy.S
> b/sysdeps/riscv/rv64/rvv/strcpy.S
> > new file mode 100644
> > index 0000000000..b21909d66f
> > --- /dev/null
> > +++ b/sysdeps/riscv/rv64/rvv/strcpy.S
> > @@ -0,0 +1,72 @@
> > +
> > +/* Copyright (C) 2012-2023 Free Software Foundation, Inc.
> > +
> > +   This file is part of the GNU C Library.
> > +
> > +   The GNU C Library is free software; you can redistribute it and/or
> > +   modify it under the terms of the GNU Lesser General Public
> > +   License as published by the Free Software Foundation; either
> > +   version 2.1 of the License, or (at your option) any later version.
> > +
> > +   The GNU C Library is distributed in the hope that it will be useful,
> > +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> > +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> > +   Lesser General Public License for more details.
> > +
> > +   You should have received a copy of the GNU Lesser General Public
> > +   License along with the GNU C Library.  If not, see
> > +   <https://www.gnu.org/licenses/>.  */
> > +
> > +
>
> You can add a optimize stpcpy and use to implement strcpy on top of that
> (as my generic proposal does [1]). ARMv6 does something similar [2]
>
> [1]
> https://patchwork.sourceware.org/project/glibc/patch/20230201170406.303978-12-adhemerval.zanella@linaro.org/
> [2]
> https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/arm/armv6/strcpy.S;h=e9f63a56c1c605a21b05f7ac21412585b0705171;hb=HEAD
>
> > +#include <sysdep.h>
> > +
> > +.globl  strcpy
> > +.type   strcpy,@function
> > +
> > +/*
> > + *  optimized strcpy for riscv with vector extension
> > + *  assumptions:
> > + *  - vlenb is a power of 2
> > + *  - page size >= 2*vlenb
> > + */
> > +
> > +.align    2
> > +strcpy:
> > +    mv          t0, a0                     /* copy dest so we can
> return it */
> > +
> > +    csrr        t1, vlenb                  /* find vlenb*2 */
> > +    add         t1, t1, t1
> > +
> > +    addi        t2, t1, -1                 /* mask unaligned part of
> ptr */
> > +    and         t2, a1, t2
> > +    beqz        t2, .Laligned
> > +
> > +    sub         t2, t1, t2                 /* search enough to align
> ptr */
> > +    vsetvli     t2, t2, e8, m2, tu, mu
> > +    vle8.v      v2, (a1)
> > +    vmseq.vx    v4, v2, zero
> > +    vmsif.m     v0, v4                     /* copy but not past null */
> > +    vfirst.m    t3, v4
> > +    vse8.v      v2, (t0), v0.t
> > +    bgez        t3, .Ldone
> > +    add         t0, t0, t2
> > +    add         a1, a1, t2
> > +
> > +.Laligned:
> > +    vsetvli     zero, t1, e8, m2, ta, mu   /* now do 2*vlenb bytes per
> pass */
> > +
> > +1:
> > +    vle8.v      v2, (a1)
> > +    add         a1, a1, t1
> > +    vmseq.vx    v4, v2, zero
> > +    vmsif.m     v0, v4
> > +    vfirst.m    t3, v4
> > +    vse8.v      v2, (t0), v0.t
> > +    add         t0, t0, t1
> > +    bltz        t3, 1b
> > +
> > +.Ldone:
> > +    ret
> > +
> > +.size   strcpy, .-strcpy
> > +libc_hidden_builtin_def (strcpy)
> > \ No newline at end of file
> > diff --git a/sysdeps/riscv/rv64/rvv/strcspn.c
> b/sysdeps/riscv/rv64/rvv/strcspn.c
> > new file mode 100644
> > index 0000000000..f0595a72fb
> > --- /dev/null
> > +++ b/sysdeps/riscv/rv64/rvv/strcspn.c
> > @@ -0,0 +1,22 @@
> > +
> > +/* Copyright (C) 2012-2023 Free Software Foundation, Inc.
> > +
> > +   This file is part of the GNU C Library.
> > +
> > +   The GNU C Library is free software; you can redistribute it and/or
> > +   modify it under the terms of the GNU Lesser General Public
> > +   License as published by the Free Software Foundation; either
> > +   version 2.1 of the License, or (at your option) any later version.
> > +
> > +   The GNU C Library is distributed in the hope that it will be useful,
> > +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> > +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> > +   Lesser General Public License for more details.
> > +
> > +   You should have received a copy of the GNU Lesser General Public
> > +   License along with the GNU C Library.  If not, see
> > +   <https://www.gnu.org/licenses/>.  */
> > +
> > +
> > +/* strcspn is implemented in strspn.S
> > + */
> > diff --git a/sysdeps/riscv/rv64/rvv/strlen.S
> b/sysdeps/riscv/rv64/rvv/strlen.S
> > new file mode 100644
> > index 0000000000..c77d500693
> > --- /dev/null
> > +++ b/sysdeps/riscv/rv64/rvv/strlen.S
> > @@ -0,0 +1,67 @@
> > +
> > +/* Copyright (C) 2012-2023 Free Software Foundation, Inc.
> > +
> > +   This file is part of the GNU C Library.
> > +
> > +   The GNU C Library is free software; you can redistribute it and/or
> > +   modify it under the terms of the GNU Lesser General Public
> > +   License as published by the Free Software Foundation; either
> > +   version 2.1 of the License, or (at your option) any later version.
> > +
> > +   The GNU C Library is distributed in the hope that it will be useful,
> > +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> > +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> > +   Lesser General Public License for more details.
> > +
> > +   You should have received a copy of the GNU Lesser General Public
> > +   License along with the GNU C Library.  If not, see
> > +   <https://www.gnu.org/licenses/>.  */
> > +
> > +
> > +#include <sysdep.h>
> > +
> > +.globl  strlen
> > +.type   strlen,@function
> > +
> > +/*
> > + *  optimized strlen for riscv with vector extension
> > + *  assumptions:
> > + *  - vlenb is a power of 2
> > + *  - page size >= 2*vlenb
> > + */
> > +
> > +.align    2
> > +strlen:
> > +    mv          t4, a0                    /* copy of buffer start */
> > +    csrr        t1, vlenb                 /* find vlenb*2 */
> > +    add         t1, t1, t1
> > +    addi        t2, t1, -1                /* mask off unaligned part of
> ptr */
> > +    and         t2, a0, t2
> > +    beqz        t2, .Laligned
> > +
> > +    sub         t2, t1, t2                /* search fwd to align ptr */
> > +    vsetvli     t2, t2, e8, m2, ta, ma
> > +    vle8.v      v2, (a0)
> > +    vmseq.vx    v0, v2, zero
> > +    vfirst.m    t3, v0
> > +    bgez        t3, .Lfound
> > +    add         a0, a0, t2
> > +
> > +.Laligned:
> > +    vsetvli     zero, t1, e8, m2, ta, ma  /* search 2*vlenb bytes per
> pass */
> > +    add         t4, t4, t1
> > +
> > +1:
> > +    vle8.v      v2, (a0)
> > +    add         a0, a0, t1
> > +    vmseq.vx    v0, v2, zero
> > +    vfirst.m    t3, v0
> > +    bltz        t3, 1b
> > +
> > +.Lfound:                                  /* found the 0; subtract
>     */
> > +    sub         a0, a0, t4                /* buffer start from current
> ptr  */
> > +    add         a0, a0, t3                /* and add offset into
> fetched    */
> > +    ret                                   /* data to get length */
> > +
> > +.size   strlen, .-strlen
> > +libc_hidden_builtin_def (strlen)
> > \ No newline at end of file
> > diff --git a/sysdeps/riscv/rv64/rvv/strncmp.S
> b/sysdeps/riscv/rv64/rvv/strncmp.S
> > new file mode 100644
> > index 0000000000..863e5cb525
> > --- /dev/null
> > +++ b/sysdeps/riscv/rv64/rvv/strncmp.S
> > @@ -0,0 +1,104 @@
> > +
> > +/* Copyright (C) 2012-2023 Free Software Foundation, Inc.
> > +
> > +   This file is part of the GNU C Library.
> > +
> > +   The GNU C Library is free software; you can redistribute it and/or
> > +   modify it under the terms of the GNU Lesser General Public
> > +   License as published by the Free Software Foundation; either
> > +   version 2.1 of the License, or (at your option) any later version.
> > +
> > +   The GNU C Library is distributed in the hope that it will be useful,
> > +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> > +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> > +   Lesser General Public License for more details.
> > +
> > +   You should have received a copy of the GNU Lesser General Public
> > +   License along with the GNU C Library.  If not, see
> > +   <https://www.gnu.org/licenses/>.  */
> > +
> > +
> > +#include <sysdep.h>
> > +
> > +.globl  strncmp
> > +.type   strncmp,@function
> > +
> > +.align    2
> > +
> > +/* as strcmp, but with added checks on a2 (max count)
> > + */
> > +
> > +strncmp:
> > +    csrr        t1, vlenb                   /* find vlenb*2 */
> > +    add         t1, t1, t1
> > +    blt         a2, t1, .Ltail              /* degrade if max < vlenb*2
> */
> > +    vsetvli     zero, t1, e8, m2, ta, mu
> > +    vid.v       v30
> > +    addi        t2, t1, -1                  /* mask unaligned part of
> ptr */
> > +    and         t6, a0, t2                  /* unaligned part of lhs */
> > +    and         t5, a1, t2                  /* unaligned part of rhs */
> > +    sub         t6, t1, t6                  /* safe count to read from
> lhs */
> > +    sub         t5, t1, t5                  /* same, rhs */
> > +    vmsltu.vx   v28, v30, t6                /* mask for first part of
> lhs */
> > +    vmsltu.vx   v26, v30, t5                /* mask for first part of
> rhs */
> > +    vmv.v.x     v16, zero
> > +    vmv.v.x     v18, zero
> > +
> > +
> > +1:  blt         a2, t1, .Ltail
> > +    vmv.v.v     v0, v28                     /* lhs mask */
> > +    vle8.v      v2, (a0), v0.t              /* masked load from lhs */
> > +    vmseq.vx    v16, v2, zero, v0.t         /* check loaded bytes for
> null */
> > +    vmv.v.v     v0, v26                     /*       rhs mask */
> > +    vfirst.m    t2, v16                     /* get lhs check result */
> > +    bgez        t2, .Ltail                  /* can we safely check rest
> */
> > +    vle8.v      v4, (a1), v0.t              /*       masked load from
> rhs */
> > +    vmseq.vx    v18, v4, zero, v0.t         /*       check partial rhs
> */
> > +    vmnot.m     v0, v28                     /* mask for rest of lhs */
> > +    vfirst.m    t3, v18                     /* get check result */
> > +    bltz        t3, 2f                      /* test it */
> > +    bge         t3, t6, .Ltail
> > +
> > +    vmsleu.vx   v0, v30, t3                 /* select rest of string +
> null */
> > +    vmsne.vv    v0, v2, v4, v0.t            /* compare */
> > +    vfirst.m    t3, v0
> > +    bgez        t3, 3f
> > +    mv          a0, zero
> > +    ret
> > +3:  add a0, a0, t3
> > +    add a1, a1, t3
> > +    lbu t0, (a0)
> > +    lbu t1, (a1)
> > +.Ldiff:
> > +    sub a0, t0, t1
> > +    ret
> > +
> > +    /* ...no null terminator in first part of lhs or rhs */
> > +2:  vle8.v      v2, (a0), v0.t              /* load rest of lhs */
> > +    vmnot.m     v0, v26                     /* mask for rest of rhs */
> > +    vle8.v      v4, (a1), v0.t              /* load rest of rhs */
> > +    vmsne.vv    v0, v2, v4                  /* compare */
> > +    add         a0, a0, t1                  /* advance ptrs */
> > +    vfirst.m    t3, v0
> > +    add         a1, a1, t1
> > +    sub         a2, a2, t1
> > +    bltz        t3, 1b
> > +
> > +    sub t3, t3, t1  /* found a diff but we've already advanced a0 and
> a1 */
> > +    j 3b
> > +
> > +.Ltail:
> > +    beqz a2, 1f
> > +    addi a2, a2, -1
> > +    lbu t0, (a0)
> > +    lbu t1, (a1)
> > +    bne t0, t1, .Ldiff
> > +    addi a0, a0, 1
> > +    addi a1, a1, 1
> > +    bnez t0, .Ltail
> > +1:  mv a0, zero
> > +    ret
> > +
> > +
> > +.size strncmp, .-strncmp
> > +libc_hidden_builtin_def (strncmp)
> > \ No newline at end of file
> > diff --git a/sysdeps/riscv/rv64/rvv/strncpy.S
> b/sysdeps/riscv/rv64/rvv/strncpy.S
> > new file mode 100644
> > index 0000000000..8b3a1e545c
> > --- /dev/null
> > +++ b/sysdeps/riscv/rv64/rvv/strncpy.S
> > @@ -0,0 +1,96 @@
> > +
> > +/* Copyright (C) 2012-2023 Free Software Foundation, Inc.
> > +
> > +   This file is part of the GNU C Library.
> > +
> > +   The GNU C Library is free software; you can redistribute it and/or
> > +   modify it under the terms of the GNU Lesser General Public
> > +   License as published by the Free Software Foundation; either
> > +   version 2.1 of the License, or (at your option) any later version.
> > +
> > +   The GNU C Library is distributed in the hope that it will be useful,
> > +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> > +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> > +   Lesser General Public License for more details.
> > +
> > +   You should have received a copy of the GNU Lesser General Public
> > +   License along with the GNU C Library.  If not, see
> > +   <https://www.gnu.org/licenses/>.  */
> > +
> > +
> > +#include <sysdep.h>
> > +
> > +.globl  strncpy
> > +.type   strncpy,@function
> > +
> > +/*
> > + *  optimized strcpy for riscv with vector extension
> > + *  assumptions:
> > + *  - vlenb is a power of 2
> > + *  - page size >= 2*vlenb
> > + */
> > +
> > +.align    2
> > +strncpy:
> > +    mv          t0, a0                    /* need to return dest so
> copy */
> > +
> > +    csrr        t1, vlenb                 /* find vlenb*2 */
> > +    add         t1, t1, t1
> > +
> > +    addi        t2, t1, -1                /* mask off unaligned part of
> ptr */
> > +    and         t2, a1, t2
> > +    beqz        t2, .Laligned
> > +
> > +    sub         t2, t1, t2                /* search to align the
> pointer */
> > +    vsetvli     zero, t2, e8, m2, tu, mu
> > +    vle8.v      v2, (a1)
> > +    vmseq.vx    v4, v2, zero
> > +    vmsif.m     v0, v4                    /* copy to dest */
> > +    vfirst.m    t3, v4
> > +    bgeu        t2, a2, .Ldest_full
> > +    vse8.v      v2, (t0), v0.t
> > +    bgez        t3, .Lterminator_found
> > +    add         t0, t0, t2
> > +    add         a1, a1, t2
> > +    sub         a2, a2, t2
> > +    beqz        a2, .Ldone
> > +
> > +.Laligned:
> > +    vsetvli     zero, t1, e8, m2, ta, mu /* now do 2*vlenb bytes per
> pass */
> > +
> > +1:
> > +    vle8.v      v2, (a1)
> > +    add         a1, a1, t1
> > +    vmseq.vx    v4, v2, zero
> > +    vmsif.m     v0, v4
> > +    vfirst.m    t3, v4
> > +    bgeu        t1, a2, .Ldest_full
> > +    vse8.v      v2, (t0), v0.t
> > +    add         t0, t0, t1
> > +    sub         a2, a2, t1
> > +    bltz        t3, 1b
> > +    sub         t0, t0, t1
> > +
> > +.Lterminator_found:
> > +    addi        sp, sp, -16
> > +    sd          ra, 0(sp)
> > +    sd          a0, 8(sp)
> > +    add         a0, t0, t3
> > +    mv          a1, zero
> > +    sub         a2, a2, t3
> > +    jal         ra, memset
> > +    ld          ra, 0(sp)
> > +    ld          a0, 8(sp)
> > +    addi        sp, sp, 16
> > +.Ldone:
> > +    ret
> > +
> > +.Ldest_full:
> > +    vid.v       v6
> > +    vmsltu.vx   v4, v6, a2
> > +    vmand.mm    v0, v0, v4
> > +    vse8.v      v2, (t0), v0.t
> > +    ret
> > +
> > +.size   strncpy, .-strncpy
> > +libc_hidden_builtin_def (strncpy)
> > \ No newline at end of file
> > diff --git a/sysdeps/riscv/rv64/rvv/strnlen.S
> b/sysdeps/riscv/rv64/rvv/strnlen.S
> > new file mode 100644
> > index 0000000000..6d7ee65c7a
> > --- /dev/null
> > +++ b/sysdeps/riscv/rv64/rvv/strnlen.S
> > @@ -0,0 +1,81 @@
> > +
> > +/* Copyright (C) 2012-2023 Free Software Foundation, Inc.
> > +
> > +   This file is part of the GNU C Library.
> > +
> > +   The GNU C Library is free software; you can redistribute it and/or
> > +   modify it under the terms of the GNU Lesser General Public
> > +   License as published by the Free Software Foundation; either
> > +   version 2.1 of the License, or (at your option) any later version.
> > +
> > +   The GNU C Library is distributed in the hope that it will be useful,
> > +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> > +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> > +   Lesser General Public License for more details.
> > +
> > +   You should have received a copy of the GNU Lesser General Public
> > +   License along with the GNU C Library.  If not, see
> > +   <https://www.gnu.org/licenses/>.  */
> > +
> > +
> > +#include <sysdep.h>
> > +
>
> Maybe use a generic implementation that issues memchr (which should be
> optimized
> using vector instructions) [3] ?  It would be a extra function call, but
> it should really
> help on both code size and icache pressure.
>
> [3]
> https://patchwork.sourceware.org/project/glibc/patch/20230201170406.303978-6-adhemerval.zanella@linaro.org/
>
> > +.globl  __strnlen
> > +.type   __strnlen,@function
> > +
> > +/* vector optimized strnlen
> > + * assume it's safe to read to the end of the page
> > + * containing either a null terminator or the last byte of the count or
> both,
> > + * but not past it
> > + * assume page size >= vlenb*2
> > + */
> > +
> > +.align    2
> > +__strnlen:
> > +    mv          t4, a0               /* stash a copy of start for later
> */
> > +    beqz        a1, .LzeroCount
> > +
> > +    csrr        t1, vlenb            /* find vlenb*2 */
> > +    add         t1, t1, t1
> > +    addi        t2, t1, -1           /* mask off unaligned part of ptr
> */
> > +    and         t2, a1, a0
> > +    beqz        t2, .Laligned
> > +
> > +    sub         t2, t1, t2           /* search to align pointer to t1 */
> > +    bgeu        t2, a1, 2f           /* check it's safe */
> > +    mv          t2, a1               /* it's not! look as far as
> permitted */
> > +2:  vsetvli     t2, t2, e8, m2, ta, ma
> > +    vle8.v      v2, (a0)
> > +    vmseq.vx    v0, v2, zero
> > +    vfirst.m    t3, v0
> > +    bgez        t3, .Lfound
> > +    add         a0, a0, t2
> > +    sub         a1, a1, t2
> > +    bltu        a1, t1, .LreachedCount
> > +
> > +.Laligned:
> > +    vsetvli     zero, t1, e8, m2, ta, ma    /* do 2*vlenb bytes per
> pass */
> > +
> > +1:  vle8.v      v2, (a0)
> > +    sub         a1, a1, t1
> > +    vmseq.vx    v0, v2, zero
> > +    vfirst.m    t3, v0
> > +    bgez        t3, .Lfound
> > +    add         a0, a0, t1
> > +    bgeu        a1, t1, 1b
> > +.LreachedCount:
> > +    mv          t2, a1    /* in case 0 < a1 < t1 */
> > +    bnez        a1, 2b    /* if so, still t2 bytes to check, all safe */
> > +.LzeroCount:
> > +    sub         a0, a0, t4
> > +    ret
> > +
> > +.Lfound:        /* found the 0; subtract buffer start from current
> pointer */
> > +    add         a0, a0, t3 /* and add offset into fetched data */
> > +    sub         a0, a0, t4
> > +    ret
> > +
> > +.size   __strnlen, .-__strnlen
> > +weak_alias (__strnlen, strnlen)
> > +libc_hidden_builtin_def (__strnlen)
> > +libc_hidden_builtin_def (strnlen)
> > \ No newline at end of file
> > diff --git a/sysdeps/riscv/rv64/rvv/strrchr.S
> b/sysdeps/riscv/rv64/rvv/strrchr.S
> > new file mode 100644
> > index 0000000000..4bef8a3b9c
> > --- /dev/null
> > +++ b/sysdeps/riscv/rv64/rvv/strrchr.S
> > @@ -0,0 +1,88 @@
> > +
> > +/* Copyright (C) 2012-2023 Free Software Foundation, Inc.
> > +
> > +   This file is part of the GNU C Library.
> > +
> > +   The GNU C Library is free software; you can redistribute it and/or
> > +   modify it under the terms of the GNU Lesser General Public
> > +   License as published by the Free Software Foundation; either
> > +   version 2.1 of the License, or (at your option) any later version.
> > +
> > +   The GNU C Library is distributed in the hope that it will be useful,
> > +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> > +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> > +   Lesser General Public License for more details.
> > +
> > +   You should have received a copy of the GNU Lesser General Public
> > +   License along with the GNU C Library.  If not, see
> > +   <https://www.gnu.org/licenses/>.  */
> > +
> > +
> > +#include <sysdep.h>
>
> It is really worth to add a strrchr optimization?  The generic
> implementation
> already calls strchr (which should be optimized).
>
> > +
> > +.globl  strrchr
> > +.type   strrchr,@function
> > +
> > +/*
> > + *  optimized strrchr for riscv with vector extension
> > + *  assumptions:
> > + *  - vlenb is a power of 2
> > + *  - page size >= 2*vlenb
> > + */
> > +
> > +.align    2
> > +strrchr:
> > +    mv          t5, a0    /* stash buffer ptr somewhere safe */
> > +    mv          a0, zero  /* result is nullptr unless we find better
> below */
> > +
> > +    csrr        t1, vlenb                /* determine vlenb*2 */
> > +    add         t1, t1, t1
> > +    addi        t2, t1, -1               /* mask off unaligned part of
> ptr */
> > +    and         t2, t5, t2
> > +    beqz        t2, .Laligned
> > +
> > +    sub         t2, t1, t2               /* search to align ptr to
> 2*vlenb */
> > +    vsetvli     t2, t2, e8, m2, ta, mu
> > +
> > +    vle8.v      v2, (t5)                 /* load data into v2(,v3) */
> > +    vmseq.vx    v4, v2, zero             /* check for null terminator */
> > +    vfirst.m    t4, v4                   /* grab its position, if any */
> > +    vmsbf.m     v0, v4                   /* select valid chars */
> > +    vmseq.vx    v0, v2, a1, v0.t         /* search for candidate byte */
> > +    vfirst.m    t3, v0                   /* grab its position, if any */
> > +    bltz        t3, 2f                   /* did we find a candidate? */
> > +
> > +3:  add         a0, t3, t5               /* we did! grab the address */
> > +    vmsof.m     v1, v0                   /* there might be more than
> one */
> > +    vmandn.mm   v0, v0, v1               /* so clear the one we just
> found */
> > +    vfirst.m    t3, v0                   /* is there another? */
> > +    bgez        t3, 3b
> > +
> > +2:  bgez        t4, .Ldone               /* did we see a null
> terminator? */
> > +    add         t5, t5, t2
> > +
> > +.Laligned:
> > +    vsetvli     zero, t1, e8, m2, ta, mu /* now do 2*vlenb bytes per
> pass */
> > +
> > +1:  vle8.v      v2, (t5)
> > +    vmseq.vx    v4, v2, zero
> > +    vfirst.m    t4, v4
> > +    vmsbf.m     v0, v4
> > +    vmseq.vx    v0, v2, a1, v0.t
> > +    vfirst.m    t3, v0
> > +    bltz        t3, 2f
> > +
> > +3:  add         a0, t3, t5
> > +    vmsof.m     v1, v0
> > +    vmandn.mm   v0, v0, v1
> > +    vfirst.m    t3, v0
> > +    bgez        t3, 3b
> > +
> > +2:  add         t5, t5, t1
> > +    bltz        t4, 1b
> > +
> > +.Ldone:
> > +    ret
> > +
> > +.size   strrchr, .-strrchr
> > +libc_hidden_builtin_def (strrchr)
> > \ No newline at end of file
> > diff --git a/sysdeps/riscv/rv64/rvv/strspn.S
> b/sysdeps/riscv/rv64/rvv/strspn.S
> > new file mode 100644
> > index 0000000000..2b9af5cc2d
> > --- /dev/null
> > +++ b/sysdeps/riscv/rv64/rvv/strspn.S
> > @@ -0,0 +1,189 @@
> > +
> > +/* Copyright (C) 2012-2023 Free Software Foundation, Inc.
> > +
> > +   This file is part of the GNU C Library.
> > +
> > +   The GNU C Library is free software; you can redistribute it and/or
> > +   modify it under the terms of the GNU Lesser General Public
> > +   License as published by the Free Software Foundation; either
> > +   version 2.1 of the License, or (at your option) any later version.
> > +
> > +   The GNU C Library is distributed in the hope that it will be useful,
> > +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> > +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> > +   Lesser General Public License for more details.
> > +
> > +   You should have received a copy of the GNU Lesser General Public
> > +   License along with the GNU C Library.  If not, see
> > +   <https://www.gnu.org/licenses/>.  */
> > +
> > +
> > +#include <sysdep.h>
> > +
> > +.globl  strspn
> > +.type   strspn,@function
> > +
> > +.globl  strcspn
> > +.type   strcspn,@function
> > +
> > +/*
> > + *  optimized strspn / strcspn for riscv with vector extension
> > + *  assumptions:
> > + *  - vlenb is a power of 2
> > + *  - page size >= 32
> > + *  strategy:
> > + *  - build a 256-bit table on the stack, where each elt is zero
> > + *    if encountering it should terminate computation and nonzero
> otherwise
> > + *  - use vectorised lookups into this to check 2*vlen elts at a time;
> > + *    this code is identical for strspan and strcspan and can be shared
> > + *
> > + *  note that while V mandates at least 128 bit wide regs,
> > + *  we are building a 256 bit lookup table
> > + *  therefore we use either LMUL=1 or 2 depending on what the target
> supports
> > + *  therefore we only use even vector register numbers,
> > + *  so everything still works if we go with LMUL=2
> > + */
> > +
>
> I wonder if we could adapt the generic implementation, so riscv only
> reimplements
> the vectorized search instead of all the boilerplace to generate the table
> and
> early tests.
>
> > +# -----------------------------
> > +
> > +.align    2
> > +
> > +strspn:
> > +    lbu         t0, 0(a1)
> > +    bnez        t0, .Lbuild_table
> > +    mv          a0, zero
> > +    ret
> > +
> > +.Lbuild_table:
> > +    mv          a6, a0 /* store incoming a0 */
> > +    li          t1, 32 /* want to deal with 256 bits at a time, so 32
> bytes */
> > +
> > +    vsetvli     zero, t1, e8, m1, tu, mu
> > +#if __riscv_v_min_vlen < 256
> > +    /* we want to build a 256-bit table, so use vlenb*2,
> > +     * m2 if regs are 128 bits wide or vlenb, m1 if >= 256
> > +     * 'V' extension specifies a minimum vlen of 128 so this should
> cover
> > +     * all cases; we can skip the check if we know vlen >= 256 at
> compile time
> > +     */
> > +    csrr        t2, vlenb
> > +    bgeu        t2, t1, 1f
> > +    vsetvli     zero, t1, e8, m2, tu, mu
> > +1:
> > +#endif // __riscv_v_min_vlen
> > +
> > +    /* read one char from the charset at a time and write the correct
> bit
> > +     * in the lookup table; we could do SIMD iff we ever get an
> extension
> > +     * that provides some way of scattering bytes into a reg group
> > +     */
> > +    vmv.v.x     v16, zero       /* clear out table */
> > +    vmv.v.x     v8, zero        /* clear out v8 */
> > +    li          t3, 1
> > +    vmv.s.x     v8, t3          /* v8 now all zeroes except bottom byte
> */
> > +
> > +1:  vmv.v.x     v2, zero        /* clear out v2 */
> > +    addi        a1, a1, 1       /* advance charset ptr */
> > +    srli        t2, t0, 3       /* divide the byte we read earlier by 8
> */
> > +    vslideup.vx v2, v8, t2      /* v2 now 1 in the correct byte 0
> elsewhere */
> > +    vsll.vx     v2, v2, t0      /* v2 now 1 in the correct bit, 0
> elsewhere */
> > +    vor.vv      v16, v16, v2    /* or it in */
> > +    lbu         t0, 0(a1)       /* fetch next bute */
> > +    bnez        t0, 1b          /* if it's null, go round again */
> > +
> > +/*
> > + *   Table is now built in v16.
> > + *   Strategy:
> > + *   - fetch next t1 bytes from memory
> > + *   - vrgather on their values divided by 8 to get relevant bytes of
> table
> > + *   - shift right to get the correct bit into bit 1
> > + *   - and with 1, compare with expected terminator value, then check
> mask
> > + *     to see if we've found a terminator
> > + *
> > + *   Before we can begin, a0 needs to be t1-aligned, so that when we
> fetch
> > + *   the next t1 bytes - any of which may be the null terminator -
> > + *   we do not cross a page boundary and read unmapped memory. Therefore
> > + *   we have one read of however many bytes are needed to align a0,
> > + *   before the main loop.
> > + */
> > +
> > +.Lscan_table:
> > +    vmv.v.x     v8, t3              /* v8 now t1 bytes of 0x01 */
> > +
> > +    and         t2, a0, t1          /* mask to align to t1 */
> > +    beqz        t2, 2f              /* or skip if we're already aligned
> */
> > +    sub         t2, t1, t2          /* t2 now bytes to read to align to
> t1 */
> > +
> > +    vid.v       v2                  /* build mask instead of changing
> vl */
> > +    vmsltu.vx   v0, v2, t2          /* so we don't need to track LMUL */
> > +
> > +    vle8.v      v2, (a0), v0.t      /* load next bytes from input */
> > +    vsrl.vi     v4, v2, 3           /* divide by 8 */
> > +    vrgather.vv v6, v16, v4         /* corresponding bytes of bit table
> */
> > +    vsrl.vv     v6, v6, v2          /* shift correct bits to lsb */
> > +    vand.vv     v6, v6, v8          /* and with 1 to complete the
> lookups */
> > +    vmseq.vx    v4, v6, zero, v0.t  /* check to see if any 0s are
> present */
> > +    vfirst.m    t0, v4              /* index of the first 0, if any */
> > +    bgez        t0, .Lscan_end      /* if we found one, stop */
> > +    add         a0, a0, t2          /* advance by number of bytes we
> read */
> > +
> > +2:  add         a6, a6, t1     /* we'll advance a0 before the exit
> check */
> > +1:  vle8.v      v2, (a0)       /* as above but unmasked so t1 elts per
> pass */
> > +    add         a0, a0, t1
> > +
> > +    vsrl.vi     v4, v2, 3
> > +    vrgather.vv v6, v16, v4
> > +    vsrl.vv     v6, v6, v2
> > +    vand.vv     v6, v6, v8
> > +
> > +    vmseq.vx    v4, v6, zero
> > +    vfirst.m    t0, v4
> > +    bltz        t0, 1b
> > +
> > +.Lscan_end:
> > +    add         a0, a0, t0     /* calculate offset to terminating byte
> */
> > +    sub         a0, a0, a6
> > +    ret
> > +.size   strspn, .-strspn
> > +
> > +/* strcspn
> > + *
> > + * table build exactly as for strspn, except:
> > + * - the lookup table starts with all bits except bit 0 of byte 0 set
> > + * - we clear the corresponding bit for each byte in the charset
> > + * once table is built, we can reuse the scan code directly
> > + */
> > +
> > +strcspn:
> > +    lbu         t0, 0(a1)
> > +    beqz        t0, strlen   /* no rejections -> prefix is whole string
> */
> > +
> > +    mv          a6, a0
> > +    li          t1, 32
> > +
> > +    vsetvli     zero, t1, e8, m1, tu, mu
> > +#if __riscv_v_min_vlen < 256
> > +    csrr        t2, vlenb
> > +    bgeu        t2, t1, 1f
> > +    vsetvli     zero, t1, e8, m2, tu, mu
> > +1:
> > +#endif // __riscv_v_min_vlen
> > +
> > +    vmv.v.x     v8, zero
> > +    li          t3, 1           /* all bits clear except bit 0 of byte
> 0 */
> > +    vmv.s.x     v8, t3
> > +    vnot.v      v16, v8         /* v16 is the inverse of that */
> > +    li          t4, -1
> > +
> > +1:  vmv.v.x     v2, zero
> > +    addi        a1, a1, 1       /* advance charset ptr */
> > +    srli        t2, t0, 3       /* select correct bit in v2 */
> > +    vslideup.vx v2, v8, t2
> > +    vsll.vx     v2, v2, t0
> > +    vnot.v      v2, v2          /* invert */
> > +    vand.vv     v16, v16, v2    /* clear the relevant bit of table */
> > +    lbu         t0, 0(a1)
> > +    bnez        t0, 1b
> > +    j           .Lscan_table
> > +.size   strcspn, .-strcspn
> > +
> > +libc_hidden_builtin_def (strspn)
> > +libc_hidden_builtin_def (strcspn)
> > \ No newline at end of file
>

--000000000000ad982905f3b4aad3--