From: Noah Goldstein <goldstein.w.n@gmail.com>
To: Adhemerval Zanella Netto <adhemerval.zanella@linaro.org>
Cc: Sergei Lewis <slewis@rivosinc.com>, libc-alpha@sourceware.org
Subject: Re: [PATCH 2/2] riscv: vectorised mem* and str* functions
Date: Wed, 1 Feb 2023 12:13:09 -0600 [thread overview]
Message-ID: <CAFUsyfJ6Do8B725QYZWPMBLLYqHfu+iMvsJQ_nWWVjoX3MvhCA@mail.gmail.com> (raw)
In-Reply-To: <87479d1a-abf3-b564-8613-2a48d26527b5@linaro.org>
On Wed, Feb 1, 2023 at 11:38 AM Adhemerval Zanella Netto via
Libc-alpha <libc-alpha@sourceware.org> wrote:
>
>
>
> On 01/02/23 06:52, Sergei Lewis wrote:
> > Initial implementations of memchr, memcmp, memcpy, memmove, memset, strchr,
> > strcmp, strcpy, strlen, strncmp, strncpy, strnlen, strrchr, strspn
> > targeting the riscv "V" extension, version 1.0
> >
> > The vectorised implementations assume VLENB of at least 128 and at least 32
> > registers (as mandated by the "V" extension spec). They also assume that
> > VLENB is a power of two which is no larger than the page size, and (as
> > vectorised code in glibc for other platforms does) that it is safe to read
> > past null terminators / buffer ends provided one does not cross a page
> > boundary.
> >
> > Signed-off-by: Sergei Lewis <slewis@rivosinc.com>
>
> Some comments that might be useful since I am working the generic implementations
> below.
>
> Also, I think it should be splitted with one implementation per patch, unless the
> implementation is tied together (as for strchr/strchrnul for instance). Does
> the vectorized routine only work for rv64?
>
> > ---
> > sysdeps/riscv/rv64/rvv/Implies | 2 +
> > sysdeps/riscv/rv64/rvv/memchr.S | 127 +++++++++++++++++++
> > sysdeps/riscv/rv64/rvv/memcmp.S | 93 ++++++++++++++
> > sysdeps/riscv/rv64/rvv/memcpy.S | 154 +++++++++++++++++++++++
> > sysdeps/riscv/rv64/rvv/memmove.c | 22 ++++
> > sysdeps/riscv/rv64/rvv/memset.S | 89 ++++++++++++++
> > sysdeps/riscv/rv64/rvv/strchr.S | 92 ++++++++++++++
> > sysdeps/riscv/rv64/rvv/strchrnul.c | 22 ++++
> > sysdeps/riscv/rv64/rvv/strcmp.S | 108 +++++++++++++++++
> > sysdeps/riscv/rv64/rvv/strcpy.S | 72 +++++++++++
> > sysdeps/riscv/rv64/rvv/strcspn.c | 22 ++++
> > sysdeps/riscv/rv64/rvv/strlen.S | 67 ++++++++++
> > sysdeps/riscv/rv64/rvv/strncmp.S | 104 ++++++++++++++++
> > sysdeps/riscv/rv64/rvv/strncpy.S | 96 +++++++++++++++
> > sysdeps/riscv/rv64/rvv/strnlen.S | 81 +++++++++++++
> > sysdeps/riscv/rv64/rvv/strrchr.S | 88 ++++++++++++++
> > sysdeps/riscv/rv64/rvv/strspn.S | 189 +++++++++++++++++++++++++++++
> > 17 files changed, 1428 insertions(+)
> > create mode 100644 sysdeps/riscv/rv64/rvv/Implies
> > create mode 100644 sysdeps/riscv/rv64/rvv/memchr.S
> > create mode 100644 sysdeps/riscv/rv64/rvv/memcmp.S
> > create mode 100644 sysdeps/riscv/rv64/rvv/memcpy.S
> > create mode 100644 sysdeps/riscv/rv64/rvv/memmove.c
> > create mode 100644 sysdeps/riscv/rv64/rvv/memset.S
> > create mode 100644 sysdeps/riscv/rv64/rvv/strchr.S
> > create mode 100644 sysdeps/riscv/rv64/rvv/strchrnul.c
> > create mode 100644 sysdeps/riscv/rv64/rvv/strcmp.S
> > create mode 100644 sysdeps/riscv/rv64/rvv/strcpy.S
> > create mode 100644 sysdeps/riscv/rv64/rvv/strcspn.c
> > create mode 100644 sysdeps/riscv/rv64/rvv/strlen.S
> > create mode 100644 sysdeps/riscv/rv64/rvv/strncmp.S
> > create mode 100644 sysdeps/riscv/rv64/rvv/strncpy.S
> > create mode 100644 sysdeps/riscv/rv64/rvv/strnlen.S
> > create mode 100644 sysdeps/riscv/rv64/rvv/strrchr.S
> > create mode 100644 sysdeps/riscv/rv64/rvv/strspn.S
> >
> > diff --git a/sysdeps/riscv/rv64/rvv/Implies b/sysdeps/riscv/rv64/rvv/Implies
> > new file mode 100644
> > index 0000000000..b07b4cb906
> > --- /dev/null
> > +++ b/sysdeps/riscv/rv64/rvv/Implies
> > @@ -0,0 +1,2 @@
> > +riscv/rv64/rvd
> > +
> > diff --git a/sysdeps/riscv/rv64/rvv/memchr.S b/sysdeps/riscv/rv64/rvv/memchr.S
> > new file mode 100644
> > index 0000000000..a7e32b8f25
> > --- /dev/null
> > +++ b/sysdeps/riscv/rv64/rvv/memchr.S
> > @@ -0,0 +1,127 @@
> > +
>
> Spurious new line at the start. We also require a brief comment describing
> the file contents for newer files.
>
> > +/* Copyright (C) 2012-2023 Free Software Foundation, Inc.
>
> Not sure 2012 range fits here.
>
> > +
> > + This file is part of the GNU C Library.
> > +
> > + The GNU C Library is free software; you can redistribute it and/or
> > + modify it under the terms of the GNU Lesser General Public
> > + License as published by the Free Software Foundation; either
> > + version 2.1 of the License, or (at your option) any later version.
> > +
> > + The GNU C Library is distributed in the hope that it will be useful,
> > + but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
> > + Lesser General Public License for more details.
> > +
> > + You should have received a copy of the GNU Lesser General Public
> > + License along with the GNU C Library. If not, see
> > + <https://www.gnu.org/licenses/>. */
> > +
> > +
> > +#include <sysdep.h>
> > +
> > +
> > +/* Optimised memchr for riscv with vector extension
> > + * Assumptions:
> > + * - cpu becomes bandwidth limited at or before
> > + * 2 vector register sized read/write operations
> > + * + 2 scalar operations
> > + * + conditional branch
> > + */
> > +
> > +.globl memchr
> > +.type memchr,@function
> > +
> > +.align 2
> > +memchr:
>
> We have the ENTRY macro for that.
>
> > + beqz a2, .Lnot_found
>
> Maybe use the L macro here for local labels;
>
> > + csrr t1, vlenb
> > + bgeu a2, t1, .Lvector_path /* only use vector path if we're scanning
> > + at least vlenb bytes */
> > +
> > +#ifndef __riscv_strict_align
>
> Would this be defined by compiler as predefine macro or is it just a debug
> switch? If the later, I think it would be better to remove it.
>
> > + li a3, 8
> > + blt a2, a3, .Lbytewise
> > +
> > + li t1, 0x0101010101010101
> > + slli a4, t1, 7 /* a4 = 0x8080808080808080 */
> > + mul t2, a1, t1 /* entirety of t2 is now repeats of target character;
> > + assume mul is at worst no worse than 3*(shift+OR),
> > + otherwise do that instead */
> > +
> > +/*
> > + * strategy:
> > + * t4 = ((*a0) ^ t2)
> > + * - now t4 contains zero bytes if and only if next word of memory
> > + * had target character at those positions
> > + *
> > + * t4 = ((t4-0x0101010101010101) & ~t4) & 0x8080808080808080
> > + * - all nonzero bytes of t4 become 0; zero bytes become 0x80
> > + *
> > + * if t4 is nonzero, find the index of the byte within it, add to a0 and return
> > + * otherwise, loop
> > + */
> > +
> > +1:
> > + ld t4, (a0) /* t4 = load next 8 bytes */
> > + xor t4, t4, t2
> > + sub t5, t4, t1
> > + not t4, t4
> > + and t4, t5, t4
> > + and t4, t4, a4
> > + bnez t4, .Lbytewise /* could use ctzw, mod+lookup or just binary chop
> > + to locate byte of interest in t4 but profiling
> > + shows these approaches are at best no better */
> > + addi a2, a2, -8
> > + addi a0, a0, 8
> > + bgeu a2, a3, 1b
> > + beqz a2, .Lnot_found
> > +#endif // __riscv_strict_align
> > +
> > +/* too little data for a dword. mask calculation and branch mispredict costs
> > + make checking a word not worthwhile. degrade to bytewise search. */
> > +
> > +.Lbytewise:
> > + add t2, a0, a2
> > +
> > +1:
> > + lb t1, (a0)
> > + beq t1, a1, .Lfound
> > + addi a0, a0, 1
> > + blt a0, t2, 1b
> > +
> > +.Lnot_found:
> > + mv a0, zero
> > +.Lfound:
> > + ret
> > +
> > +.Lvector_path:
> > + vsetvli t2, a2, e8, m2, ta, ma
> > +
> > +1:
> > + vle8.v v2, (a0)
> > + vmseq.vx v0, v2, a1
> > + vfirst.m t3, v0
> > + bgez t3, .Lvec_found
> > + add a0, a0, t2
> > + sub a2, a2, t2
> > + bge a2, t2, 1b
> > + bnez a2, 2f
> > + mv a0, zero
> > + ret
> > +
> > +2:
> > + vsetvli t2, a2, e8, m2, ta, ma
> > + vle8.v v2, (a0)
> > + vmseq.vx v0, v2, a1
> > + vfirst.m t3, v0
> > + bgez t3, .Lvec_found
> > + mv a0, zero
> > + ret
> > +
> > +.Lvec_found:
> > + add a0, a0, t3
> > + ret
> > +
> > +.size memchr, .-memchr
> > +libc_hidden_builtin_def (memchr)
> > \ No newline at end of file
>
> Please add a newline.
>
> > diff --git a/sysdeps/riscv/rv64/rvv/strcpy.S b/sysdeps/riscv/rv64/rvv/strcpy.S
> > new file mode 100644
> > index 0000000000..b21909d66f
> > --- /dev/null
> > +++ b/sysdeps/riscv/rv64/rvv/strcpy.S
> > @@ -0,0 +1,72 @@
> > +
> > +/* Copyright (C) 2012-2023 Free Software Foundation, Inc.
> > +
> > + This file is part of the GNU C Library.
> > +
> > + The GNU C Library is free software; you can redistribute it and/or
> > + modify it under the terms of the GNU Lesser General Public
> > + License as published by the Free Software Foundation; either
> > + version 2.1 of the License, or (at your option) any later version.
> > +
> > + The GNU C Library is distributed in the hope that it will be useful,
> > + but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
> > + Lesser General Public License for more details.
> > +
> > + You should have received a copy of the GNU Lesser General Public
> > + License along with the GNU C Library. If not, see
> > + <https://www.gnu.org/licenses/>. */
> > +
> > +
>
> You can add a optimize stpcpy and use to implement strcpy on top of that
> (as my generic proposal does [1]). ARMv6 does something similar [2]
>
> [1] https://patchwork.sourceware.org/project/glibc/patch/20230201170406.303978-12-adhemerval.zanella@linaro.org/
> [2] https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/arm/armv6/strcpy.S;h=e9f63a56c1c605a21b05f7ac21412585b0705171;hb=HEAD
>
> > +#include <sysdep.h>
> > +
> > +.globl strcpy
> > +.type strcpy,@function
> > +
> > +/*
> > + * optimized strcpy for riscv with vector extension
> > + * assumptions:
> > + * - vlenb is a power of 2
> > + * - page size >= 2*vlenb
> > + */
> > +
> > +.align 2
> > +strcpy:
> > + mv t0, a0 /* copy dest so we can return it */
> > +
> > + csrr t1, vlenb /* find vlenb*2 */
> > + add t1, t1, t1
> > +
> > + addi t2, t1, -1 /* mask unaligned part of ptr */
> > + and t2, a1, t2
> > + beqz t2, .Laligned
> > +
> > + sub t2, t1, t2 /* search enough to align ptr */
> > + vsetvli t2, t2, e8, m2, tu, mu
> > + vle8.v v2, (a1)
> > + vmseq.vx v4, v2, zero
> > + vmsif.m v0, v4 /* copy but not past null */
> > + vfirst.m t3, v4
> > + vse8.v v2, (t0), v0.t
> > + bgez t3, .Ldone
> > + add t0, t0, t2
> > + add a1, a1, t2
> > +
> > +.Laligned:
> > + vsetvli zero, t1, e8, m2, ta, mu /* now do 2*vlenb bytes per pass */
> > +
> > +1:
> > + vle8.v v2, (a1)
> > + add a1, a1, t1
> > + vmseq.vx v4, v2, zero
> > + vmsif.m v0, v4
> > + vfirst.m t3, v4
> > + vse8.v v2, (t0), v0.t
> > + add t0, t0, t1
> > + bltz t3, 1b
> > +
> > +.Ldone:
> > + ret
> > +
> > +.size strcpy, .-strcpy
> > +libc_hidden_builtin_def (strcpy)
> > \ No newline at end of file
> > diff --git a/sysdeps/riscv/rv64/rvv/strcspn.c b/sysdeps/riscv/rv64/rvv/strcspn.c
> > new file mode 100644
> > index 0000000000..f0595a72fb
> > --- /dev/null
> > +++ b/sysdeps/riscv/rv64/rvv/strcspn.c
> > @@ -0,0 +1,22 @@
> > +
> > +/* Copyright (C) 2012-2023 Free Software Foundation, Inc.
> > +
> > + This file is part of the GNU C Library.
> > +
> > + The GNU C Library is free software; you can redistribute it and/or
> > + modify it under the terms of the GNU Lesser General Public
> > + License as published by the Free Software Foundation; either
> > + version 2.1 of the License, or (at your option) any later version.
> > +
> > + The GNU C Library is distributed in the hope that it will be useful,
> > + but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
> > + Lesser General Public License for more details.
> > +
> > + You should have received a copy of the GNU Lesser General Public
> > + License along with the GNU C Library. If not, see
> > + <https://www.gnu.org/licenses/>. */
> > +
> > +
> > +/* strcspn is implemented in strspn.S
> > + */
> > diff --git a/sysdeps/riscv/rv64/rvv/strlen.S b/sysdeps/riscv/rv64/rvv/strlen.S
> > new file mode 100644
> > index 0000000000..c77d500693
> > --- /dev/null
> > +++ b/sysdeps/riscv/rv64/rvv/strlen.S
> > @@ -0,0 +1,67 @@
> > +
> > +/* Copyright (C) 2012-2023 Free Software Foundation, Inc.
> > +
> > + This file is part of the GNU C Library.
> > +
> > + The GNU C Library is free software; you can redistribute it and/or
> > + modify it under the terms of the GNU Lesser General Public
> > + License as published by the Free Software Foundation; either
> > + version 2.1 of the License, or (at your option) any later version.
> > +
> > + The GNU C Library is distributed in the hope that it will be useful,
> > + but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
> > + Lesser General Public License for more details.
> > +
> > + You should have received a copy of the GNU Lesser General Public
> > + License along with the GNU C Library. If not, see
> > + <https://www.gnu.org/licenses/>. */
> > +
> > +
> > +#include <sysdep.h>
> > +
> > +.globl strlen
> > +.type strlen,@function
> > +
> > +/*
> > + * optimized strlen for riscv with vector extension
> > + * assumptions:
> > + * - vlenb is a power of 2
> > + * - page size >= 2*vlenb
> > + */
> > +
> > +.align 2
> > +strlen:
> > + mv t4, a0 /* copy of buffer start */
> > + csrr t1, vlenb /* find vlenb*2 */
> > + add t1, t1, t1
> > + addi t2, t1, -1 /* mask off unaligned part of ptr */
> > + and t2, a0, t2
> > + beqz t2, .Laligned
> > +
> > + sub t2, t1, t2 /* search fwd to align ptr */
> > + vsetvli t2, t2, e8, m2, ta, ma
> > + vle8.v v2, (a0)
> > + vmseq.vx v0, v2, zero
> > + vfirst.m t3, v0
> > + bgez t3, .Lfound
> > + add a0, a0, t2
> > +
> > +.Laligned:
> > + vsetvli zero, t1, e8, m2, ta, ma /* search 2*vlenb bytes per pass */
> > + add t4, t4, t1
> > +
> > +1:
> > + vle8.v v2, (a0)
> > + add a0, a0, t1
> > + vmseq.vx v0, v2, zero
> > + vfirst.m t3, v0
> > + bltz t3, 1b
> > +
> > +.Lfound: /* found the 0; subtract */
> > + sub a0, a0, t4 /* buffer start from current ptr */
> > + add a0, a0, t3 /* and add offset into fetched */
> > + ret /* data to get length */
> > +
> > +.size strlen, .-strlen
> > +libc_hidden_builtin_def (strlen)
> > \ No newline at end of file
> > diff --git a/sysdeps/riscv/rv64/rvv/strncmp.S b/sysdeps/riscv/rv64/rvv/strncmp.S
> > new file mode 100644
> > index 0000000000..863e5cb525
> > --- /dev/null
> > +++ b/sysdeps/riscv/rv64/rvv/strncmp.S
> > @@ -0,0 +1,104 @@
> > +
> > +/* Copyright (C) 2012-2023 Free Software Foundation, Inc.
> > +
> > + This file is part of the GNU C Library.
> > +
> > + The GNU C Library is free software; you can redistribute it and/or
> > + modify it under the terms of the GNU Lesser General Public
> > + License as published by the Free Software Foundation; either
> > + version 2.1 of the License, or (at your option) any later version.
> > +
> > + The GNU C Library is distributed in the hope that it will be useful,
> > + but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
> > + Lesser General Public License for more details.
> > +
> > + You should have received a copy of the GNU Lesser General Public
> > + License along with the GNU C Library. If not, see
> > + <https://www.gnu.org/licenses/>. */
> > +
> > +
> > +#include <sysdep.h>
> > +
> > +.globl strncmp
> > +.type strncmp,@function
> > +
> > +.align 2
> > +
> > +/* as strcmp, but with added checks on a2 (max count)
> > + */
> > +
> > +strncmp:
> > + csrr t1, vlenb /* find vlenb*2 */
> > + add t1, t1, t1
> > + blt a2, t1, .Ltail /* degrade if max < vlenb*2 */
> > + vsetvli zero, t1, e8, m2, ta, mu
> > + vid.v v30
> > + addi t2, t1, -1 /* mask unaligned part of ptr */
> > + and t6, a0, t2 /* unaligned part of lhs */
> > + and t5, a1, t2 /* unaligned part of rhs */
> > + sub t6, t1, t6 /* safe count to read from lhs */
> > + sub t5, t1, t5 /* same, rhs */
> > + vmsltu.vx v28, v30, t6 /* mask for first part of lhs */
> > + vmsltu.vx v26, v30, t5 /* mask for first part of rhs */
> > + vmv.v.x v16, zero
> > + vmv.v.x v18, zero
> > +
> > +
> > +1: blt a2, t1, .Ltail
> > + vmv.v.v v0, v28 /* lhs mask */
> > + vle8.v v2, (a0), v0.t /* masked load from lhs */
> > + vmseq.vx v16, v2, zero, v0.t /* check loaded bytes for null */
> > + vmv.v.v v0, v26 /* rhs mask */
> > + vfirst.m t2, v16 /* get lhs check result */
> > + bgez t2, .Ltail /* can we safely check rest */
> > + vle8.v v4, (a1), v0.t /* masked load from rhs */
> > + vmseq.vx v18, v4, zero, v0.t /* check partial rhs */
> > + vmnot.m v0, v28 /* mask for rest of lhs */
> > + vfirst.m t3, v18 /* get check result */
> > + bltz t3, 2f /* test it */
> > + bge t3, t6, .Ltail
> > +
> > + vmsleu.vx v0, v30, t3 /* select rest of string + null */
> > + vmsne.vv v0, v2, v4, v0.t /* compare */
> > + vfirst.m t3, v0
> > + bgez t3, 3f
> > + mv a0, zero
> > + ret
> > +3: add a0, a0, t3
> > + add a1, a1, t3
> > + lbu t0, (a0)
> > + lbu t1, (a1)
> > +.Ldiff:
> > + sub a0, t0, t1
> > + ret
> > +
> > + /* ...no null terminator in first part of lhs or rhs */
> > +2: vle8.v v2, (a0), v0.t /* load rest of lhs */
> > + vmnot.m v0, v26 /* mask for rest of rhs */
> > + vle8.v v4, (a1), v0.t /* load rest of rhs */
> > + vmsne.vv v0, v2, v4 /* compare */
> > + add a0, a0, t1 /* advance ptrs */
> > + vfirst.m t3, v0
> > + add a1, a1, t1
> > + sub a2, a2, t1
> > + bltz t3, 1b
> > +
> > + sub t3, t3, t1 /* found a diff but we've already advanced a0 and a1 */
> > + j 3b
> > +
> > +.Ltail:
> > + beqz a2, 1f
> > + addi a2, a2, -1
> > + lbu t0, (a0)
> > + lbu t1, (a1)
> > + bne t0, t1, .Ldiff
> > + addi a0, a0, 1
> > + addi a1, a1, 1
> > + bnez t0, .Ltail
> > +1: mv a0, zero
> > + ret
> > +
> > +
> > +.size strncmp, .-strncmp
> > +libc_hidden_builtin_def (strncmp)
> > \ No newline at end of file
> > diff --git a/sysdeps/riscv/rv64/rvv/strncpy.S b/sysdeps/riscv/rv64/rvv/strncpy.S
> > new file mode 100644
> > index 0000000000..8b3a1e545c
> > --- /dev/null
> > +++ b/sysdeps/riscv/rv64/rvv/strncpy.S
> > @@ -0,0 +1,96 @@
> > +
> > +/* Copyright (C) 2012-2023 Free Software Foundation, Inc.
> > +
> > + This file is part of the GNU C Library.
> > +
> > + The GNU C Library is free software; you can redistribute it and/or
> > + modify it under the terms of the GNU Lesser General Public
> > + License as published by the Free Software Foundation; either
> > + version 2.1 of the License, or (at your option) any later version.
> > +
> > + The GNU C Library is distributed in the hope that it will be useful,
> > + but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
> > + Lesser General Public License for more details.
> > +
> > + You should have received a copy of the GNU Lesser General Public
> > + License along with the GNU C Library. If not, see
> > + <https://www.gnu.org/licenses/>. */
> > +
> > +
> > +#include <sysdep.h>
> > +
> > +.globl strncpy
> > +.type strncpy,@function
> > +
> > +/*
> > + * optimized strcpy for riscv with vector extension
> > + * assumptions:
> > + * - vlenb is a power of 2
> > + * - page size >= 2*vlenb
> > + */
> > +
> > +.align 2
> > +strncpy:
> > + mv t0, a0 /* need to return dest so copy */
> > +
> > + csrr t1, vlenb /* find vlenb*2 */
> > + add t1, t1, t1
> > +
> > + addi t2, t1, -1 /* mask off unaligned part of ptr */
> > + and t2, a1, t2
> > + beqz t2, .Laligned
> > +
> > + sub t2, t1, t2 /* search to align the pointer */
> > + vsetvli zero, t2, e8, m2, tu, mu
> > + vle8.v v2, (a1)
> > + vmseq.vx v4, v2, zero
> > + vmsif.m v0, v4 /* copy to dest */
> > + vfirst.m t3, v4
> > + bgeu t2, a2, .Ldest_full
> > + vse8.v v2, (t0), v0.t
> > + bgez t3, .Lterminator_found
> > + add t0, t0, t2
> > + add a1, a1, t2
> > + sub a2, a2, t2
> > + beqz a2, .Ldone
> > +
> > +.Laligned:
> > + vsetvli zero, t1, e8, m2, ta, mu /* now do 2*vlenb bytes per pass */
> > +
> > +1:
> > + vle8.v v2, (a1)
> > + add a1, a1, t1
> > + vmseq.vx v4, v2, zero
> > + vmsif.m v0, v4
> > + vfirst.m t3, v4
> > + bgeu t1, a2, .Ldest_full
> > + vse8.v v2, (t0), v0.t
> > + add t0, t0, t1
> > + sub a2, a2, t1
> > + bltz t3, 1b
> > + sub t0, t0, t1
> > +
> > +.Lterminator_found:
> > + addi sp, sp, -16
> > + sd ra, 0(sp)
> > + sd a0, 8(sp)
> > + add a0, t0, t3
> > + mv a1, zero
> > + sub a2, a2, t3
> > + jal ra, memset
> > + ld ra, 0(sp)
> > + ld a0, 8(sp)
> > + addi sp, sp, 16
> > +.Ldone:
> > + ret
> > +
> > +.Ldest_full:
> > + vid.v v6
> > + vmsltu.vx v4, v6, a2
> > + vmand.mm v0, v0, v4
> > + vse8.v v2, (t0), v0.t
> > + ret
> > +
> > +.size strncpy, .-strncpy
> > +libc_hidden_builtin_def (strncpy)
> > \ No newline at end of file
> > diff --git a/sysdeps/riscv/rv64/rvv/strnlen.S b/sysdeps/riscv/rv64/rvv/strnlen.S
> > new file mode 100644
> > index 0000000000..6d7ee65c7a
> > --- /dev/null
> > +++ b/sysdeps/riscv/rv64/rvv/strnlen.S
> > @@ -0,0 +1,81 @@
> > +
> > +/* Copyright (C) 2012-2023 Free Software Foundation, Inc.
> > +
> > + This file is part of the GNU C Library.
> > +
> > + The GNU C Library is free software; you can redistribute it and/or
> > + modify it under the terms of the GNU Lesser General Public
> > + License as published by the Free Software Foundation; either
> > + version 2.1 of the License, or (at your option) any later version.
> > +
> > + The GNU C Library is distributed in the hope that it will be useful,
> > + but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
> > + Lesser General Public License for more details.
> > +
> > + You should have received a copy of the GNU Lesser General Public
> > + License along with the GNU C Library. If not, see
> > + <https://www.gnu.org/licenses/>. */
> > +
> > +
> > +#include <sysdep.h>
> > +
>
> Maybe use a generic implementation that issues memchr (which should be optimized
> using vector instructions) [3] ? It would be a extra function call, but it should really
> help on both code size and icache pressure.
>
> [3] https://patchwork.sourceware.org/project/glibc/patch/20230201170406.303978-6-adhemerval.zanella@linaro.org/
>
> > +.globl __strnlen
> > +.type __strnlen,@function
> > +
> > +/* vector optimized strnlen
> > + * assume it's safe to read to the end of the page
> > + * containing either a null terminator or the last byte of the count or both,
> > + * but not past it
> > + * assume page size >= vlenb*2
> > + */
> > +
> > +.align 2
> > +__strnlen:
> > + mv t4, a0 /* stash a copy of start for later */
> > + beqz a1, .LzeroCount
> > +
> > + csrr t1, vlenb /* find vlenb*2 */
> > + add t1, t1, t1
> > + addi t2, t1, -1 /* mask off unaligned part of ptr */
> > + and t2, a1, a0
> > + beqz t2, .Laligned
> > +
> > + sub t2, t1, t2 /* search to align pointer to t1 */
> > + bgeu t2, a1, 2f /* check it's safe */
> > + mv t2, a1 /* it's not! look as far as permitted */
> > +2: vsetvli t2, t2, e8, m2, ta, ma
> > + vle8.v v2, (a0)
> > + vmseq.vx v0, v2, zero
> > + vfirst.m t3, v0
> > + bgez t3, .Lfound
> > + add a0, a0, t2
> > + sub a1, a1, t2
> > + bltu a1, t1, .LreachedCount
> > +
> > +.Laligned:
> > + vsetvli zero, t1, e8, m2, ta, ma /* do 2*vlenb bytes per pass */
> > +
> > +1: vle8.v v2, (a0)
> > + sub a1, a1, t1
> > + vmseq.vx v0, v2, zero
> > + vfirst.m t3, v0
> > + bgez t3, .Lfound
> > + add a0, a0, t1
> > + bgeu a1, t1, 1b
> > +.LreachedCount:
> > + mv t2, a1 /* in case 0 < a1 < t1 */
> > + bnez a1, 2b /* if so, still t2 bytes to check, all safe */
> > +.LzeroCount:
> > + sub a0, a0, t4
> > + ret
> > +
> > +.Lfound: /* found the 0; subtract buffer start from current pointer */
> > + add a0, a0, t3 /* and add offset into fetched data */
> > + sub a0, a0, t4
> > + ret
> > +
> > +.size __strnlen, .-__strnlen
> > +weak_alias (__strnlen, strnlen)
> > +libc_hidden_builtin_def (__strnlen)
> > +libc_hidden_builtin_def (strnlen)
> > \ No newline at end of file
> > diff --git a/sysdeps/riscv/rv64/rvv/strrchr.S b/sysdeps/riscv/rv64/rvv/strrchr.S
> > new file mode 100644
> > index 0000000000..4bef8a3b9c
> > --- /dev/null
> > +++ b/sysdeps/riscv/rv64/rvv/strrchr.S
> > @@ -0,0 +1,88 @@
> > +
> > +/* Copyright (C) 2012-2023 Free Software Foundation, Inc.
> > +
> > + This file is part of the GNU C Library.
> > +
> > + The GNU C Library is free software; you can redistribute it and/or
> > + modify it under the terms of the GNU Lesser General Public
> > + License as published by the Free Software Foundation; either
> > + version 2.1 of the License, or (at your option) any later version.
> > +
> > + The GNU C Library is distributed in the hope that it will be useful,
> > + but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
> > + Lesser General Public License for more details.
> > +
> > + You should have received a copy of the GNU Lesser General Public
> > + License along with the GNU C Library. If not, see
> > + <https://www.gnu.org/licenses/>. */
> > +
> > +
> > +#include <sysdep.h>
>
> It is really worth to add a strrchr optimization? The generic implementation
> already calls strchr (which should be optimized).
>
> > +
> > +.globl strrchr
> > +.type strrchr,@function
> > +
> > +/*
> > + * optimized strrchr for riscv with vector extension
> > + * assumptions:
> > + * - vlenb is a power of 2
> > + * - page size >= 2*vlenb
> > + */
> > +
> > +.align 2
> > +strrchr:
> > + mv t5, a0 /* stash buffer ptr somewhere safe */
> > + mv a0, zero /* result is nullptr unless we find better below */
> > +
> > + csrr t1, vlenb /* determine vlenb*2 */
> > + add t1, t1, t1
> > + addi t2, t1, -1 /* mask off unaligned part of ptr */
> > + and t2, t5, t2
> > + beqz t2, .Laligned
> > +
> > + sub t2, t1, t2 /* search to align ptr to 2*vlenb */
> > + vsetvli t2, t2, e8, m2, ta, mu
> > +
> > + vle8.v v2, (t5) /* load data into v2(,v3) */
> > + vmseq.vx v4, v2, zero /* check for null terminator */
> > + vfirst.m t4, v4 /* grab its position, if any */
> > + vmsbf.m v0, v4 /* select valid chars */
> > + vmseq.vx v0, v2, a1, v0.t /* search for candidate byte */
> > + vfirst.m t3, v0 /* grab its position, if any */
> > + bltz t3, 2f /* did we find a candidate? */
> > +
> > +3: add a0, t3, t5 /* we did! grab the address */
> > + vmsof.m v1, v0 /* there might be more than one */
> > + vmandn.mm v0, v0, v1 /* so clear the one we just found */
> > + vfirst.m t3, v0 /* is there another? */
> > + bgez t3, 3b
> > +
> > +2: bgez t4, .Ldone /* did we see a null terminator? */
> > + add t5, t5, t2
> > +
> > +.Laligned:
> > + vsetvli zero, t1, e8, m2, ta, mu /* now do 2*vlenb bytes per pass */
> > +
> > +1: vle8.v v2, (t5)
> > + vmseq.vx v4, v2, zero
> > + vfirst.m t4, v4
> > + vmsbf.m v0, v4
> > + vmseq.vx v0, v2, a1, v0.t
> > + vfirst.m t3, v0
> > + bltz t3, 2f
> > +
> > +3: add a0, t3, t5
> > + vmsof.m v1, v0
> > + vmandn.mm v0, v0, v1
> > + vfirst.m t3, v0
> > + bgez t3, 3b
> > +
> > +2: add t5, t5, t1
> > + bltz t4, 1b
> > +
> > +.Ldone:
> > + ret
> > +
> > +.size strrchr, .-strrchr
> > +libc_hidden_builtin_def (strrchr)
> > \ No newline at end of file
> > diff --git a/sysdeps/riscv/rv64/rvv/strspn.S b/sysdeps/riscv/rv64/rvv/strspn.S
> > new file mode 100644
> > index 0000000000..2b9af5cc2d
> > --- /dev/null
> > +++ b/sysdeps/riscv/rv64/rvv/strspn.S
> > @@ -0,0 +1,189 @@
> > +
> > +/* Copyright (C) 2012-2023 Free Software Foundation, Inc.
> > +
> > + This file is part of the GNU C Library.
> > +
> > + The GNU C Library is free software; you can redistribute it and/or
> > + modify it under the terms of the GNU Lesser General Public
> > + License as published by the Free Software Foundation; either
> > + version 2.1 of the License, or (at your option) any later version.
> > +
> > + The GNU C Library is distributed in the hope that it will be useful,
> > + but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
> > + Lesser General Public License for more details.
> > +
> > + You should have received a copy of the GNU Lesser General Public
> > + License along with the GNU C Library. If not, see
> > + <https://www.gnu.org/licenses/>. */
> > +
> > +
> > +#include <sysdep.h>
> > +
> > +.globl strspn
> > +.type strspn,@function
> > +
> > +.globl strcspn
> > +.type strcspn,@function
> > +
> > +/*
> > + * optimized strspn / strcspn for riscv with vector extension
> > + * assumptions:
> > + * - vlenb is a power of 2
> > + * - page size >= 32
> > + * strategy:
> > + * - build a 256-bit table on the stack, where each elt is zero
> > + * if encountering it should terminate computation and nonzero otherwise
> > + * - use vectorised lookups into this to check 2*vlen elts at a time;
> > + * this code is identical for strspan and strcspan and can be shared
> > + *
> > + * note that while V mandates at least 128 bit wide regs,
> > + * we are building a 256 bit lookup table
> > + * therefore we use either LMUL=1 or 2 depending on what the target supports
> > + * therefore we only use even vector register numbers,
> > + * so everything still works if we go with LMUL=2
> > + */
> > +
>
> I wonder if we could adapt the generic implementation, so riscv only reimplements
> the vectorized search instead of all the boilerplace to generate the table and
> early tests.
+1
>
> > +# -----------------------------
> > +
> > +.align 2
> > +
> > +strspn:
> > + lbu t0, 0(a1)
> > + bnez t0, .Lbuild_table
> > + mv a0, zero
> > + ret
> > +
> > +.Lbuild_table:
> > + mv a6, a0 /* store incoming a0 */
> > + li t1, 32 /* want to deal with 256 bits at a time, so 32 bytes */
> > +
> > + vsetvli zero, t1, e8, m1, tu, mu
> > +#if __riscv_v_min_vlen < 256
> > + /* we want to build a 256-bit table, so use vlenb*2,
> > + * m2 if regs are 128 bits wide or vlenb, m1 if >= 256
> > + * 'V' extension specifies a minimum vlen of 128 so this should cover
> > + * all cases; we can skip the check if we know vlen >= 256 at compile time
> > + */
> > + csrr t2, vlenb
> > + bgeu t2, t1, 1f
> > + vsetvli zero, t1, e8, m2, tu, mu
> > +1:
> > +#endif // __riscv_v_min_vlen
> > +
> > + /* read one char from the charset at a time and write the correct bit
> > + * in the lookup table; we could do SIMD iff we ever get an extension
> > + * that provides some way of scattering bytes into a reg group
> > + */
> > + vmv.v.x v16, zero /* clear out table */
> > + vmv.v.x v8, zero /* clear out v8 */
> > + li t3, 1
> > + vmv.s.x v8, t3 /* v8 now all zeroes except bottom byte */
> > +
> > +1: vmv.v.x v2, zero /* clear out v2 */
> > + addi a1, a1, 1 /* advance charset ptr */
> > + srli t2, t0, 3 /* divide the byte we read earlier by 8 */
> > + vslideup.vx v2, v8, t2 /* v2 now 1 in the correct byte 0 elsewhere */
> > + vsll.vx v2, v2, t0 /* v2 now 1 in the correct bit, 0 elsewhere */
> > + vor.vv v16, v16, v2 /* or it in */
> > + lbu t0, 0(a1) /* fetch next bute */
> > + bnez t0, 1b /* if it's null, go round again */
> > +
> > +/*
> > + * Table is now built in v16.
> > + * Strategy:
> > + * - fetch next t1 bytes from memory
> > + * - vrgather on their values divided by 8 to get relevant bytes of table
> > + * - shift right to get the correct bit into bit 1
> > + * - and with 1, compare with expected terminator value, then check mask
> > + * to see if we've found a terminator
> > + *
> > + * Before we can begin, a0 needs to be t1-aligned, so that when we fetch
> > + * the next t1 bytes - any of which may be the null terminator -
> > + * we do not cross a page boundary and read unmapped memory. Therefore
> > + * we have one read of however many bytes are needed to align a0,
> > + * before the main loop.
> > + */
> > +
> > +.Lscan_table:
> > + vmv.v.x v8, t3 /* v8 now t1 bytes of 0x01 */
> > +
> > + and t2, a0, t1 /* mask to align to t1 */
> > + beqz t2, 2f /* or skip if we're already aligned */
> > + sub t2, t1, t2 /* t2 now bytes to read to align to t1 */
> > +
> > + vid.v v2 /* build mask instead of changing vl */
> > + vmsltu.vx v0, v2, t2 /* so we don't need to track LMUL */
> > +
> > + vle8.v v2, (a0), v0.t /* load next bytes from input */
> > + vsrl.vi v4, v2, 3 /* divide by 8 */
> > + vrgather.vv v6, v16, v4 /* corresponding bytes of bit table */
> > + vsrl.vv v6, v6, v2 /* shift correct bits to lsb */
> > + vand.vv v6, v6, v8 /* and with 1 to complete the lookups */
> > + vmseq.vx v4, v6, zero, v0.t /* check to see if any 0s are present */
> > + vfirst.m t0, v4 /* index of the first 0, if any */
> > + bgez t0, .Lscan_end /* if we found one, stop */
> > + add a0, a0, t2 /* advance by number of bytes we read */
> > +
> > +2: add a6, a6, t1 /* we'll advance a0 before the exit check */
> > +1: vle8.v v2, (a0) /* as above but unmasked so t1 elts per pass */
> > + add a0, a0, t1
> > +
> > + vsrl.vi v4, v2, 3
> > + vrgather.vv v6, v16, v4
> > + vsrl.vv v6, v6, v2
> > + vand.vv v6, v6, v8
> > +
> > + vmseq.vx v4, v6, zero
> > + vfirst.m t0, v4
> > + bltz t0, 1b
> > +
> > +.Lscan_end:
> > + add a0, a0, t0 /* calculate offset to terminating byte */
> > + sub a0, a0, a6
> > + ret
> > +.size strspn, .-strspn
> > +
> > +/* strcspn
> > + *
> > + * table build exactly as for strspn, except:
> > + * - the lookup table starts with all bits except bit 0 of byte 0 set
> > + * - we clear the corresponding bit for each byte in the charset
> > + * once table is built, we can reuse the scan code directly
> > + */
> > +
> > +strcspn:
> > + lbu t0, 0(a1)
> > + beqz t0, strlen /* no rejections -> prefix is whole string */
> > +
> > + mv a6, a0
> > + li t1, 32
> > +
> > + vsetvli zero, t1, e8, m1, tu, mu
> > +#if __riscv_v_min_vlen < 256
> > + csrr t2, vlenb
> > + bgeu t2, t1, 1f
> > + vsetvli zero, t1, e8, m2, tu, mu
> > +1:
> > +#endif // __riscv_v_min_vlen
> > +
> > + vmv.v.x v8, zero
> > + li t3, 1 /* all bits clear except bit 0 of byte 0 */
> > + vmv.s.x v8, t3
> > + vnot.v v16, v8 /* v16 is the inverse of that */
> > + li t4, -1
> > +
> > +1: vmv.v.x v2, zero
> > + addi a1, a1, 1 /* advance charset ptr */
> > + srli t2, t0, 3 /* select correct bit in v2 */
> > + vslideup.vx v2, v8, t2
> > + vsll.vx v2, v2, t0
> > + vnot.v v2, v2 /* invert */
> > + vand.vv v16, v16, v2 /* clear the relevant bit of table */
> > + lbu t0, 0(a1)
> > + bnez t0, 1b
> > + j .Lscan_table
> > +.size strcspn, .-strcspn
> > +
> > +libc_hidden_builtin_def (strspn)
> > +libc_hidden_builtin_def (strcspn)
> > \ No newline at end of file
next prev parent reply other threads:[~2023-02-01 18:13 UTC|newest]
Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-02-01 9:52 [PATCH 1/2] riscv: sysdeps support for vectorised functions Sergei Lewis
2023-02-01 9:52 ` [PATCH 2/2] riscv: vectorised mem* and str* functions Sergei Lewis
2023-02-01 15:33 ` Jeff Law
2023-02-01 16:42 ` Florian Weimer
2023-02-01 17:07 ` Jeff Law
2023-02-02 9:34 ` Sergei Lewis
2023-02-06 12:49 ` Sergei Lewis
2023-02-01 17:17 ` Adhemerval Zanella Netto
2023-02-01 17:38 ` Adhemerval Zanella Netto
2023-02-01 18:13 ` Noah Goldstein [this message]
2023-02-02 10:02 ` Sergei Lewis
2023-02-02 14:26 ` Adhemerval Zanella Netto
2023-02-02 15:20 ` Sergei Lewis
2023-02-02 15:35 ` Sergei Lewis
2023-02-03 11:35 ` Adhemerval Zanella Netto
2023-02-03 14:04 ` Sergei Lewis
2023-02-01 18:11 ` Noah Goldstein
2023-02-01 18:13 ` Andrew Waterman
2023-02-01 19:03 ` Andrew Waterman
2023-02-03 0:13 ` Vineet Gupta
2023-02-03 0:51 ` Andrew Waterman
2023-05-03 2:11 ` Yun Hsiang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CAFUsyfJ6Do8B725QYZWPMBLLYqHfu+iMvsJQ_nWWVjoX3MvhCA@mail.gmail.com \
--to=goldstein.w.n@gmail.com \
--cc=adhemerval.zanella@linaro.org \
--cc=libc-alpha@sourceware.org \
--cc=slewis@rivosinc.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).