[RFC PATCH 00/19] riscv: ifunc support with optimized mem*/str*/cpu

public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed

* [RFC PATCH 00/19] riscv: ifunc support with optimized mem*/str*/cpu_relax routines
@ 2023-02-07  0:15 Christoph Muellner
  2023-02-07  0:16 ` [RFC PATCH 01/19] Inhibit early libcalls before ifunc support is ready Christoph Muellner
                   ` (20 more replies)
  0 siblings, 21 replies; 42+ messages in thread
From: Christoph Muellner @ 2023-02-07  0:15 UTC (permalink / raw)
  To: libc-alpha, Palmer Dabbelt, Darius Rad, Andrew Waterman,
	DJ Delorie, Vineet Gupta, Kito Cheng, Jeff Law, Philipp Tomsich,
	Heiko Stuebner
  Cc: Christoph Müllner

From: Christoph Müllner <christoph.muellner@vrull.eu>

This RFC series introduces ifunc support for RISC-V and adds
optimized routines of memset(), memcpy()/memmove(), strlen(),
strcmp(), strncmp(), and cpu_relax().

The ifunc mechanism desides based on the following hart features:
- Available extensions
- Cache block size
- Fast unaligned accesses

Since we don't have an interface to get this information from the
kernel (at the moment), this patch uses environment variables instead,
which is also why this patch should not be considered for upstream
inclusion and is explicitly tagged as RFC.

The environment variables are:
- RISCV_RT_MARCH (e.g. "rv64gc_zicboz")
- RISCV_RT_CBOZ_BLOCKSIZE (e.g. "64")
- RISCV_RT_CBOM_BLOCKSIZE (e.g. "64")
- RISCV_RT_FAST_UNALIGNED (e.g. "1")

The environment variables are looked up and parsed early during
startup, where other architectures query similar properties from
the kernel or the CPU.
The ifunc implementation can use test macros to select a matching
implementation (e.g. HAVE_RV(zbb) or HAVE_FAST_UNALIGNED()).

The following optimized routines exist:
- memset
- memcpy/memmove
- strlen
- strcmp
- strncmp
- cpu_relax

The following optimizations have been applied:
- excessive loop unrolling
- Zbb's orc.b instruction
- Zbb's ctz intruction
- Zicboz/Zic64b ability to clear a cache block in memory
- Fast unaligned accesses (but with keeping exception guarantees intact)
- Fast overlapping accesses

The patch was developed more than a year ago and was tested as part
of a vendor SDK since then. One of the areas where this patchset
was used is benchmarking (e.g. SPEC CPU2017).
The optimized string functions have been tested with the glibc tests
for that purpose.

The first patch of the series does not strictly belong to this series,
but was required to build and test SPEC CPU2017 benchmarks.

To build a cross-toolchain that includes these patches,
the riscv-gnu-toolchain or any other cross-toolchain
builder can be used.

Christoph Müllner (19):
  Inhibit early libcalls before ifunc support is ready
  riscv: LEAF: Use C_LABEL() to construct the asm name for a C symbol
  riscv: Add ENTRY_ALIGN() macro
  riscv: Add hart feature run-time detection framework
  riscv: Introduction of ISA extensions
  riscv: Adding ISA string parser for environment variables
  riscv: hart-features: Add fast_unaligned property
  riscv: Add (empty) ifunc framework
  riscv: Add ifunc support for memset
  riscv: Add accelerated memset routines for RV64
  riscv: Add ifunc support for memcpy/memmove
  riscv: Add accelerated memcpy/memmove routines for RV64
  riscv: Add ifunc support for strlen
  riscv: Add accelerated strlen routine
  riscv: Add ifunc support for strcmp
  riscv: Add accelerated strcmp routines
  riscv: Add ifunc support for strncmp
  riscv: Add an optimized strncmp routine
  riscv: Add __riscv_cpu_relax() to allow yielding in busy loops

 csu/libc-start.c                              |   1 +
 elf/dl-support.c                              |   1 +
 sysdeps/riscv/dl-machine.h                    |  13 +
 sysdeps/riscv/ldsodefs.h                      |   1 +
 sysdeps/riscv/multiarch/Makefile              |  24 +
 sysdeps/riscv/multiarch/cpu_relax.c           |  36 ++
 sysdeps/riscv/multiarch/cpu_relax_impl.S      |  40 ++
 sysdeps/riscv/multiarch/ifunc-impl-list.c     |  70 +++
 sysdeps/riscv/multiarch/init-arch.h           |  24 +
 sysdeps/riscv/multiarch/memcpy.c              |  49 ++
 sysdeps/riscv/multiarch/memcpy_generic.c      |  32 ++
 .../riscv/multiarch/memcpy_rv64_unaligned.S   | 475 ++++++++++++++++++
 sysdeps/riscv/multiarch/memmove.c             |  49 ++
 sysdeps/riscv/multiarch/memmove_generic.c     |  32 ++
 sysdeps/riscv/multiarch/memset.c              |  52 ++
 sysdeps/riscv/multiarch/memset_generic.c      |  32 ++
 .../riscv/multiarch/memset_rv64_unaligned.S   |  31 ++
 .../multiarch/memset_rv64_unaligned_cboz64.S  | 217 ++++++++
 sysdeps/riscv/multiarch/strcmp.c              |  47 ++
 sysdeps/riscv/multiarch/strcmp_generic.c      |  32 ++
 sysdeps/riscv/multiarch/strcmp_zbb.S          | 104 ++++
 .../riscv/multiarch/strcmp_zbb_unaligned.S    | 213 ++++++++
 sysdeps/riscv/multiarch/strlen.c              |  44 ++
 sysdeps/riscv/multiarch/strlen_generic.c      |  32 ++
 sysdeps/riscv/multiarch/strlen_zbb.S          | 105 ++++
 sysdeps/riscv/multiarch/strncmp.c             |  44 ++
 sysdeps/riscv/multiarch/strncmp_generic.c     |  32 ++
 sysdeps/riscv/multiarch/strncmp_zbb.S         | 119 +++++
 sysdeps/riscv/sys/asm.h                       |  14 +-
 .../unix/sysv/linux/riscv/atomic-machine.h    |   3 +
 sysdeps/unix/sysv/linux/riscv/dl-procinfo.c   |  62 +++
 sysdeps/unix/sysv/linux/riscv/dl-procinfo.h   |  46 ++
 sysdeps/unix/sysv/linux/riscv/hart-features.c | 356 +++++++++++++
 sysdeps/unix/sysv/linux/riscv/hart-features.h |  58 +++
 .../unix/sysv/linux/riscv/isa-extensions.def  |  72 +++
 sysdeps/unix/sysv/linux/riscv/libc-start.c    |  29 ++
 .../unix/sysv/linux/riscv/macro-for-each.h    |  24 +
 37 files changed, 2610 insertions(+), 5 deletions(-)
 create mode 100644 sysdeps/riscv/multiarch/Makefile
 create mode 100644 sysdeps/riscv/multiarch/cpu_relax.c
 create mode 100644 sysdeps/riscv/multiarch/cpu_relax_impl.S
 create mode 100644 sysdeps/riscv/multiarch/ifunc-impl-list.c
 create mode 100644 sysdeps/riscv/multiarch/init-arch.h
 create mode 100644 sysdeps/riscv/multiarch/memcpy.c
 create mode 100644 sysdeps/riscv/multiarch/memcpy_generic.c
 create mode 100644 sysdeps/riscv/multiarch/memcpy_rv64_unaligned.S
 create mode 100644 sysdeps/riscv/multiarch/memmove.c
 create mode 100644 sysdeps/riscv/multiarch/memmove_generic.c
 create mode 100644 sysdeps/riscv/multiarch/memset.c
 create mode 100644 sysdeps/riscv/multiarch/memset_generic.c
 create mode 100644 sysdeps/riscv/multiarch/memset_rv64_unaligned.S
 create mode 100644 sysdeps/riscv/multiarch/memset_rv64_unaligned_cboz64.S
 create mode 100644 sysdeps/riscv/multiarch/strcmp.c
 create mode 100644 sysdeps/riscv/multiarch/strcmp_generic.c
 create mode 100644 sysdeps/riscv/multiarch/strcmp_zbb.S
 create mode 100644 sysdeps/riscv/multiarch/strcmp_zbb_unaligned.S
 create mode 100644 sysdeps/riscv/multiarch/strlen.c
 create mode 100644 sysdeps/riscv/multiarch/strlen_generic.c
 create mode 100644 sysdeps/riscv/multiarch/strlen_zbb.S
 create mode 100644 sysdeps/riscv/multiarch/strncmp.c
 create mode 100644 sysdeps/riscv/multiarch/strncmp_generic.c
 create mode 100644 sysdeps/riscv/multiarch/strncmp_zbb.S
 create mode 100644 sysdeps/unix/sysv/linux/riscv/dl-procinfo.c
 create mode 100644 sysdeps/unix/sysv/linux/riscv/dl-procinfo.h
 create mode 100644 sysdeps/unix/sysv/linux/riscv/hart-features.c
 create mode 100644 sysdeps/unix/sysv/linux/riscv/hart-features.h
 create mode 100644 sysdeps/unix/sysv/linux/riscv/isa-extensions.def
 create mode 100644 sysdeps/unix/sysv/linux/riscv/libc-start.c
 create mode 100644 sysdeps/unix/sysv/linux/riscv/macro-for-each.h

-- 
2.39.1


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC PATCH 01/19] Inhibit early libcalls before ifunc support is ready
  2023-02-07  0:15 [RFC PATCH 00/19] riscv: ifunc support with optimized mem*/str*/cpu_relax routines Christoph Muellner
@ 2023-02-07  0:16 ` Christoph Muellner
  2023-02-07  0:16 ` [RFC PATCH 02/19] riscv: LEAF: Use C_LABEL() to construct the asm name for a C symbol Christoph Muellner
                   ` (19 subsequent siblings)
  20 siblings, 0 replies; 42+ messages in thread
From: Christoph Muellner @ 2023-02-07  0:16 UTC (permalink / raw)
  To: libc-alpha, Palmer Dabbelt, Darius Rad, Andrew Waterman,
	DJ Delorie, Vineet Gupta, Kito Cheng, Jeff Law, Philipp Tomsich,
	Heiko Stuebner
  Cc: Christoph Müllner

From: Christoph Müllner <christoph.muellner@vrull.eu>

One of the few tasks in __libc_start_main_impl, before
ifunc support is ready on many architectures is to process
the AUX vector. GCC is able to detect libcall routines in
this code, which will result in invocations of uninitialized
ifunc pointers.

Let's set the proper attributes to these early functions
to avoid avoid libcalls.

This was observed to be an issue (endless loop) in combination with:
- GCC upstream/master
- glibc upstream/master
- glibc built with -O3
- target arch RISC-V (RV64)
- experimental RISC-V ifunc support patches
Other combinations/architectures might be affected as well.

Signed-off-by: Christoph Müllner <christoph.muellner@vrull.eu>
---
 csu/libc-start.c | 1 +
 elf/dl-support.c | 1 +
 2 files changed, 2 insertions(+)

diff --git a/csu/libc-start.c b/csu/libc-start.c
index c3bb6d09bc..8566a54df5 100644
--- a/csu/libc-start.c
+++ b/csu/libc-start.c
@@ -231,6 +231,7 @@ STATIC int LIBC_START_MAIN (int (*main) (int, char **, char **
    locate constructors and destructors.  For statically linked
    executables, the relevant symbols are access directly.  */
 STATIC int
+inhibit_loop_to_libcall
 LIBC_START_MAIN (int (*main) (int, char **, char ** MAIN_AUXVEC_DECL),
 		 int argc, char **argv,
 #ifdef LIBC_START_MAIN_AUXVEC_ARG
diff --git a/elf/dl-support.c b/elf/dl-support.c
index 9714f75db0..b0e9e1636a 100644
--- a/elf/dl-support.c
+++ b/elf/dl-support.c
@@ -242,6 +242,7 @@ __rtld_lock_define_initialized_recursive (, _dl_load_tls_lock)
 int _dl_clktck;
 
 void
+inhibit_loop_to_libcall
 _dl_aux_init (ElfW(auxv_t) *av)
 {
 #ifdef NEED_DL_SYSINFO
-- 
2.39.1


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC PATCH 02/19] riscv: LEAF: Use C_LABEL() to construct the asm name for a C symbol
  2023-02-07  0:15 [RFC PATCH 00/19] riscv: ifunc support with optimized mem*/str*/cpu_relax routines Christoph Muellner
  2023-02-07  0:16 ` [RFC PATCH 01/19] Inhibit early libcalls before ifunc support is ready Christoph Muellner
@ 2023-02-07  0:16 ` Christoph Muellner
  2023-02-07  0:16 ` [RFC PATCH 03/19] riscv: Add ENTRY_ALIGN() macro Christoph Muellner
                   ` (18 subsequent siblings)
  20 siblings, 0 replies; 42+ messages in thread
From: Christoph Muellner @ 2023-02-07  0:16 UTC (permalink / raw)
  To: libc-alpha, Palmer Dabbelt, Darius Rad, Andrew Waterman,
	DJ Delorie, Vineet Gupta, Kito Cheng, Jeff Law, Philipp Tomsich,
	Heiko Stuebner
  Cc: Christoph Müllner

From: Christoph Müllner <christoph.muellner@vrull.eu>

It is common practice in glibc to use C_LABEL() to construct the asm
name for a C symbol. Let's do this for RISC-V as well, even if this
is essentially a non-functional change.

Signed-off-by: Christoph Müllner <christoph.muellner@vrull.eu>
---
 sysdeps/riscv/sys/asm.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/sysdeps/riscv/sys/asm.h b/sysdeps/riscv/sys/asm.h
index 5432f2d5d2..b782cfa2f2 100644
--- a/sysdeps/riscv/sys/asm.h
+++ b/sysdeps/riscv/sys/asm.h
@@ -51,7 +51,7 @@
 		.globl	symbol;			\
 		.align	2;			\
 		.type	symbol,@function;	\
-symbol:						\
+		C_LABEL(symbol)			\
 		cfi_startproc;
 
 /* Mark end of function.  */
-- 
2.39.1


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC PATCH 03/19] riscv: Add ENTRY_ALIGN() macro
  2023-02-07  0:15 [RFC PATCH 00/19] riscv: ifunc support with optimized mem*/str*/cpu_relax routines Christoph Muellner
  2023-02-07  0:16 ` [RFC PATCH 01/19] Inhibit early libcalls before ifunc support is ready Christoph Muellner
  2023-02-07  0:16 ` [RFC PATCH 02/19] riscv: LEAF: Use C_LABEL() to construct the asm name for a C symbol Christoph Muellner
@ 2023-02-07  0:16 ` Christoph Muellner
  2023-02-07  0:16 ` [RFC PATCH 04/19] riscv: Add hart feature run-time detection framework Christoph Muellner
                   ` (17 subsequent siblings)
  20 siblings, 0 replies; 42+ messages in thread
From: Christoph Muellner @ 2023-02-07  0:16 UTC (permalink / raw)
  To: libc-alpha, Palmer Dabbelt, Darius Rad, Andrew Waterman,
	DJ Delorie, Vineet Gupta, Kito Cheng, Jeff Law, Philipp Tomsich,
	Heiko Stuebner
  Cc: Christoph Müllner

From: Christoph Müllner <christoph.muellner@vrull.eu>

This patch adds an ENTRY_ALIGN() macro to generate
aligned function symbols in assembly files.
Since the LEAF() macro is a special-case of that,
we change LEAF() to be reflect this.

Signed-off-by: Christoph Müllner <christoph.muellner@vrull.eu>
---
 sysdeps/riscv/sys/asm.h | 14 +++++++++-----
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/sysdeps/riscv/sys/asm.h b/sysdeps/riscv/sys/asm.h
index b782cfa2f2..de6394b984 100644
--- a/sysdeps/riscv/sys/asm.h
+++ b/sysdeps/riscv/sys/asm.h
@@ -46,14 +46,18 @@
 # endif
 #endif
 
-/* Declare leaf routine.  */
-#define	LEAF(symbol)				\
-		.globl	symbol;			\
-		.align	2;			\
-		.type	symbol,@function;	\
+/* Define an entry point visible from C with custom p2-alignment.  */
+#define	ENTRY_ALIGN(symbol, align)		\
+		.globl symbol;			\
+		.p2align align;			\
+		.type symbol,@function;		\
 		C_LABEL(symbol)			\
 		cfi_startproc;
 
+/* Declare leaf routine.  */
+#define	LEAF(symbol)				\
+	ENTRY_ALIGN (symbol, 1)
+
 /* Mark end of function.  */
 #undef END
 #define END(function)				\
-- 
2.39.1


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC PATCH 04/19] riscv: Add hart feature run-time detection framework
  2023-02-07  0:15 [RFC PATCH 00/19] riscv: ifunc support with optimized mem*/str*/cpu_relax routines Christoph Muellner
                   ` (2 preceding siblings ...)
  2023-02-07  0:16 ` [RFC PATCH 03/19] riscv: Add ENTRY_ALIGN() macro Christoph Muellner
@ 2023-02-07  0:16 ` Christoph Muellner
  2023-02-07  0:16 ` [RFC PATCH 05/19] riscv: Introduction of ISA extensions Christoph Muellner
                   ` (16 subsequent siblings)
  20 siblings, 0 replies; 42+ messages in thread
From: Christoph Muellner @ 2023-02-07  0:16 UTC (permalink / raw)
  To: libc-alpha, Palmer Dabbelt, Darius Rad, Andrew Waterman,
	DJ Delorie, Vineet Gupta, Kito Cheng, Jeff Law, Philipp Tomsich,
	Heiko Stuebner
  Cc: Christoph Müllner

From: Christoph Müllner <christoph.muellner@vrull.eu>

This patch introduces a framework to detect and store hart features
(e.g. ISA extensions and their parameters) for RISC-V.

This patch does not introduce a concrete mechanism for run-time
detection, but implements everything so that such a mechanism
can be introduced.

Most of the changes in this patch are inspired by similar code
for other architectures, so nothing surprising should be hidden here.

Signed-off-by: Christoph Müllner <christoph.muellner@vrull.eu>
---
 sysdeps/riscv/dl-machine.h                    | 13 ++++
 sysdeps/riscv/ldsodefs.h                      |  1 +
 sysdeps/unix/sysv/linux/riscv/dl-procinfo.c   | 62 +++++++++++++++++++
 sysdeps/unix/sysv/linux/riscv/dl-procinfo.h   | 46 ++++++++++++++
 sysdeps/unix/sysv/linux/riscv/hart-features.c | 43 +++++++++++++
 sysdeps/unix/sysv/linux/riscv/hart-features.h | 26 ++++++++
 sysdeps/unix/sysv/linux/riscv/libc-start.c    | 29 +++++++++
 7 files changed, 220 insertions(+)
 create mode 100644 sysdeps/unix/sysv/linux/riscv/dl-procinfo.c
 create mode 100644 sysdeps/unix/sysv/linux/riscv/dl-procinfo.h
 create mode 100644 sysdeps/unix/sysv/linux/riscv/hart-features.c
 create mode 100644 sysdeps/unix/sysv/linux/riscv/hart-features.h
 create mode 100644 sysdeps/unix/sysv/linux/riscv/libc-start.c

diff --git a/sysdeps/riscv/dl-machine.h b/sysdeps/riscv/dl-machine.h
index c0c9bd93ad..43f4f96c0e 100644
--- a/sysdeps/riscv/dl-machine.h
+++ b/sysdeps/riscv/dl-machine.h
@@ -28,6 +28,7 @@
 #include <dl-irel.h>
 #include <dl-static-tls.h>
 #include <dl-machine-rel.h>
+#include <hart-features.c>
 
 #ifndef _RTLD_PROLOGUE
 # define _RTLD_PROLOGUE(entry)						\
@@ -148,6 +149,18 @@ elf_machine_fixup_plt (struct link_map *map, lookup_t t,
   return *reloc_addr = value;
 }
 
+#define DL_PLATFORM_INIT dl_platform_init ()
+
+static inline void __attribute__ ((unused))
+dl_platform_init (void)
+{
+#ifdef SHARED
+  /* init_hart_features has been called early from __libc_start_main in
+     static executable.  */
+  init_hart_features (&GLRO(dl_riscv_hart_features));
+#endif /* SHARED */
+}
+
 #endif /* !dl_machine_h */
 
 #ifdef RESOLVE_MAP
diff --git a/sysdeps/riscv/ldsodefs.h b/sysdeps/riscv/ldsodefs.h
index 90e95e60c5..4b184de255 100644
--- a/sysdeps/riscv/ldsodefs.h
+++ b/sysdeps/riscv/ldsodefs.h
@@ -20,6 +20,7 @@
 #define _RISCV_LDSODEFS_H 1
 
 #include <elf.h>
+#include <hart-features.h>
 
 struct La_riscv_regs;
 struct La_riscv_retval;
diff --git a/sysdeps/unix/sysv/linux/riscv/dl-procinfo.c b/sysdeps/unix/sysv/linux/riscv/dl-procinfo.c
new file mode 100644
index 0000000000..ce137d10c4
--- /dev/null
+++ b/sysdeps/unix/sysv/linux/riscv/dl-procinfo.c
@@ -0,0 +1,62 @@
+/* Data for RISC-V version of processor capability information.
+   Copyright (C) 2022 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library.  If not, see
+   <https://www.gnu.org/licenses/>.  */
+
+/* This information must be kept in sync with the _DL_PLATFORM_COUNT
+   definitions in procinfo.h.
+
+   If anything should be added here check whether the size of each string
+   is still ok with the given array size.
+
+   All the #ifdefs in the definitions are quite irritating but
+   necessary if we want to avoid duplicating the information.  There
+   are three different modes:
+
+   - PROCINFO_DECL is defined.  This means we are only interested in
+     declarations.
+
+   - PROCINFO_DECL is not defined:
+
+     + if SHARED is defined the file is included in an array
+       initializer.  The .element = { ... } syntax is needed.
+
+     + if SHARED is not defined a normal array initialization is
+       needed.
+  */
+
+#ifndef PROCINFO_CLASS
+# define PROCINFO_CLASS
+#endif
+
+#if !IS_IN (ldconfig)
+# if !defined PROCINFO_DECL && defined SHARED
+  ._dl_riscv_hart_features
+# else
+PROCINFO_CLASS struct hart_features _dl_riscv_hart_features
+# endif
+# ifndef PROCINFO_DECL
+= { }
+# endif
+# if !defined SHARED || defined PROCINFO_DECL
+;
+# else
+,
+# endif
+#endif
+
+#undef PROCINFO_DECL
+#undef PROCINFO_CLASS
diff --git a/sysdeps/unix/sysv/linux/riscv/dl-procinfo.h b/sysdeps/unix/sysv/linux/riscv/dl-procinfo.h
new file mode 100644
index 0000000000..27aaebe02d
--- /dev/null
+++ b/sysdeps/unix/sysv/linux/riscv/dl-procinfo.h
@@ -0,0 +1,46 @@
+/* RISC-V version of processor capability information handling macros.
+   Copyright (C) 2022 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library.  If not, see
+   <https://www.gnu.org/licenses/>.  */
+
+#ifndef _DL_PROCINFO_H
+#define _DL_PROCINFO_H	1
+
+#include <sys/auxv.h>
+#include <unistd.h>
+#include <ldsodefs.h>
+#include <sysdep.h>
+
+/* We cannot provide a general printing function.  */
+#define _dl_procinfo(word, val) -1
+
+/* There are no hardware capabilities defined.  */
+#define _dl_hwcap_string(idx) ""
+
+/* By default there is no important hardware capability.  */
+#define HWCAP_IMPORTANT (0)
+
+/* We don't have any hardware capabilities.  */
+#define _DL_HWCAP_COUNT	0
+
+#define _dl_string_hwcap(str) (-1)
+
+/* There're no platforms to filter out.  */
+#define _DL_HWCAP_PLATFORM 0
+
+#define _dl_string_platform(str) (-1)
+
+#endif /* dl-procinfo.h */
diff --git a/sysdeps/unix/sysv/linux/riscv/hart-features.c b/sysdeps/unix/sysv/linux/riscv/hart-features.c
new file mode 100644
index 0000000000..41111eff57
--- /dev/null
+++ b/sysdeps/unix/sysv/linux/riscv/hart-features.c
@@ -0,0 +1,43 @@
+/* Initialize hart feature data.  RISC-V version.
+   This file is part of the GNU C Library.
+   Copyright (C) 2022 Free Software Foundation, Inc.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <https://www.gnu.org/licenses/>.  */
+
+#include <hart-features.h>
+
+/* The code in this file is executed very early, so we cannot call
+   indirect functions because ifunc support is not initialized.
+   Therefore this file adds a few simple helper functions to avoid
+   dependencies to functions outside of this file.  */
+
+static inline void
+inhibit_loop_to_libcall
+simple_memset (void *s, int c, size_t n)
+{
+  char *p = (char*)s;
+  while (n != 0)
+    {
+      *p = c;
+      n--;
+    }
+}
+
+/* Discover hart features and store them.  */
+static inline void
+init_hart_features (struct hart_features *hart_features)
+{
+  simple_memset (hart_features, 0, sizeof (*hart_features));
+}
diff --git a/sysdeps/unix/sysv/linux/riscv/hart-features.h b/sysdeps/unix/sysv/linux/riscv/hart-features.h
new file mode 100644
index 0000000000..a417cbc326
--- /dev/null
+++ b/sysdeps/unix/sysv/linux/riscv/hart-features.h
@@ -0,0 +1,26 @@
+/* Initialize CPU feature data.  RISC-V version.
+   This file is part of the GNU C Library.
+   Copyright (C) 2022 Free Software Foundation, Inc.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <https://www.gnu.org/licenses/>.  */
+
+#ifndef _CPU_FEATURES_RISCV_H
+#define _CPU_FEATURES_RISCV_H
+
+struct hart_features
+{
+};
+
+#endif /* _CPU_FEATURES_RISCV_H  */
diff --git a/sysdeps/unix/sysv/linux/riscv/libc-start.c b/sysdeps/unix/sysv/linux/riscv/libc-start.c
new file mode 100644
index 0000000000..57c7c09223
--- /dev/null
+++ b/sysdeps/unix/sysv/linux/riscv/libc-start.c
@@ -0,0 +1,29 @@
+/* Override csu/libc-start.c on RISC-V
+   Copyright (C) 2022 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <https://www.gnu.org/licenses/>.  */
+
+#ifndef SHARED
+
+# include <ldsodefs.h>
+# include <hart-features.c>
+
+extern struct hart_features _dl_riscv_hart_features;
+
+# define ARCH_INIT_CPU_FEATURES() init_hart_features (&_dl_riscv_hart_features)
+
+#endif
+#include <csu/libc-start.c>
-- 
2.39.1


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC PATCH 05/19] riscv: Introduction of ISA extensions
  2023-02-07  0:15 [RFC PATCH 00/19] riscv: ifunc support with optimized mem*/str*/cpu_relax routines Christoph Muellner
                   ` (3 preceding siblings ...)
  2023-02-07  0:16 ` [RFC PATCH 04/19] riscv: Add hart feature run-time detection framework Christoph Muellner
@ 2023-02-07  0:16 ` Christoph Muellner
  2023-02-07  0:16 ` [RFC PATCH 06/19] riscv: Adding ISA string parser for environment variables Christoph Muellner
                   ` (15 subsequent siblings)
  20 siblings, 0 replies; 42+ messages in thread
From: Christoph Muellner @ 2023-02-07  0:16 UTC (permalink / raw)
  To: libc-alpha, Palmer Dabbelt, Darius Rad, Andrew Waterman,
	DJ Delorie, Vineet Gupta, Kito Cheng, Jeff Law, Philipp Tomsich,
	Heiko Stuebner
  Cc: Christoph Müllner

From: Christoph Müllner <christoph.muellner@vrull.eu>

The RISC-V ISA consists of a base ISA and a multitude of optional ISA
extensions. This patch introduces some of them, which are expected
to be relevant in the near future for ifunc-based optimizations in glibc:

* Base (i or e)
* M
* A
* F
* D
* C
* Zicsr
* Zifencei
* G
* Zihintpause
* zicbom
* zicbop
* zicboz
* zawrs
* zba
* zbb
* zbc
* zbs

Given the DSL-like definition it should be trivial to extend the list.

Signed-off-by: Christoph Müllner <christoph.muellner@vrull.eu>
---
 sysdeps/unix/sysv/linux/riscv/hart-features.h | 27 +++++++
 .../unix/sysv/linux/riscv/isa-extensions.def  | 72 +++++++++++++++++++
 2 files changed, 99 insertions(+)
 create mode 100644 sysdeps/unix/sysv/linux/riscv/isa-extensions.def

diff --git a/sysdeps/unix/sysv/linux/riscv/hart-features.h b/sysdeps/unix/sysv/linux/riscv/hart-features.h
index a417cbc326..dd94685676 100644
--- a/sysdeps/unix/sysv/linux/riscv/hart-features.h
+++ b/sysdeps/unix/sysv/linux/riscv/hart-features.h
@@ -19,8 +19,35 @@
 #ifndef _CPU_FEATURES_RISCV_H
 #define _CPU_FEATURES_RISCV_H
 
+#define IS_RV32() \
+	(GLRO (dl_riscv_hart_features).xlen == 32)
+
+#define IS_RV64() \
+	(GLRO (dl_riscv_hart_features).xlen == 64)
+
+#define HAVE_RV(E) \
+	(GLRO (dl_riscv_hart_features).have_ ## E == 1)
+
+#define HAVE_CBOM_BLOCKSIZE(n)	\
+	(GLRO (dl_riscv_hart_features).cbom_blocksize == n)
+
+#define HAVE_CBOZ_BLOCKSIZE(n)	\
+	(GLRO (dl_riscv_hart_features).cboz_blocksize == n)
+
 struct hart_features
 {
+  const char* rt_march;
+  unsigned xlen;
+#define ISA_EXT(e)			\
+  unsigned have_##e:1;
+#define ISA_EXT_GROUP(g, ...)		\
+  unsigned have_##g:1;
+#include "isa-extensions.def"
+
+  const char* rt_cbom_blocksize;
+  unsigned cbom_blocksize;
+  const char* rt_cboz_blocksize;
+  unsigned cboz_blocksize;
 };
 
 #endif /* _CPU_FEATURES_RISCV_H  */
diff --git a/sysdeps/unix/sysv/linux/riscv/isa-extensions.def b/sysdeps/unix/sysv/linux/riscv/isa-extensions.def
new file mode 100644
index 0000000000..eb05823998
--- /dev/null
+++ b/sysdeps/unix/sysv/linux/riscv/isa-extensions.def
@@ -0,0 +1,72 @@
+/* ISA extensions of RISC-V.
+   Copyright (C) 2022 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <https://www.gnu.org/licenses/>.  */
+
+/* Define RISC-V ISA extension.  */
+#ifndef ISA_EXT
+# define ISA_EXT(e)
+#endif
+
+/* Define RISC-V ISA extension group.  */
+#ifndef ISA_EXT_GROUP
+# define ISA_EXT_GROUP(...)
+#endif
+
+/*
+ * Here are the ordering rules of extension naming defined by RISC-V
+ * specification :
+ * 1. All extensions should be separated from other multi-letter extensions
+ *    by an underscore.
+ * 2. The first letter following the 'Z' conventionally indicates the most
+ *    closely related alphabetical extension category, IMAFDQLCBKJTPVH.
+ *    If multiple 'Z' extensions are named, they should be ordered first
+ *    by category, then alphabetically within a category.
+ * 3. Standard supervisor-level extensions (starts with 'S') should be
+ *    listed after standard unprivileged extensions.  If multiple
+ *    supervisor-level extensions are listed, they should be ordered
+ *    alphabetically.
+ * 4. Non-standard extensions (starts with 'X') must be listed after all
+ *    standard extensions. They must be separated from other multi-letter
+ *    extensions by an underscore.
+ */
+
+ISA_EXT (i)
+ISA_EXT (e)
+
+ISA_EXT (m)
+ISA_EXT (a)
+ISA_EXT (f)
+ISA_EXT (d)
+ISA_EXT (c)
+ISA_EXT (zicsr)
+ISA_EXT (zifencei)
+ISA_EXT_GROUP (g, i, m, a, f, d, zicsr, zifencei)
+
+ISA_EXT (zicbom)
+ISA_EXT (zicbop)
+ISA_EXT (zicboz)
+ISA_EXT (zihintpause)
+
+ISA_EXT (zawrs)
+
+ISA_EXT (zba)
+ISA_EXT (zbb)
+ISA_EXT (zbc)
+ISA_EXT (zbs)
+
+#undef ISA_EXT
+#undef ISA_EXT_GROUP
-- 
2.39.1


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC PATCH 06/19] riscv: Adding ISA string parser for environment variables
  2023-02-07  0:15 [RFC PATCH 00/19] riscv: ifunc support with optimized mem*/str*/cpu_relax routines Christoph Muellner
                   ` (4 preceding siblings ...)
  2023-02-07  0:16 ` [RFC PATCH 05/19] riscv: Introduction of ISA extensions Christoph Muellner
@ 2023-02-07  0:16 ` Christoph Muellner
  2023-02-07  6:20   ` David Abdurachmanov
  2023-02-07  0:16 ` [RFC PATCH 07/19] riscv: hart-features: Add fast_unaligned property Christoph Muellner
                   ` (14 subsequent siblings)
  20 siblings, 1 reply; 42+ messages in thread
From: Christoph Muellner @ 2023-02-07  0:16 UTC (permalink / raw)
  To: libc-alpha, Palmer Dabbelt, Darius Rad, Andrew Waterman,
	DJ Delorie, Vineet Gupta, Kito Cheng, Jeff Law, Philipp Tomsich,
	Heiko Stuebner
  Cc: Christoph Müllner

From: Christoph Müllner <christoph.muellner@vrull.eu>

RISC-V does not have a reliable mechanism to detect hart features
like supported ISA extensions or cache block sizes at run-time
as of now.

Not knowing the hart features limits optimization strategies of glibc
(e.g. ifunc support requires run-time hard feature knowledge).

To circumvent this limitation this patch introduces a mechanism to
get the hart features via environment variables:
* RISCV_RT_MARCH represents a lower-case ISA string (-march string)
  E.g. RISCV_RT_MARCH=rv64gc_zicboz
* RISCV_RT_CBOM_BLOCKSIZE represents the cbom instruction block size
  E.g. RISCV_RT_CBOZ_BLOCKSIZE=64
* RISCV_RT_CBOZ_BLOCKSIZE represents the cboz instruction block size

These environment variables are parsed during startup and the found
ISA extensions are stored a struct (hart_features) for evaluation
by dynamic dispatching code.

As the parser code is executed very early, we cannot call functions
that have direct or indirect (via getenv()) dependencies to strlen()
and strncmp(), as these functions cannot be called before the ifunc
support is initialized. Therefore, this patch contains its own helper
functions for strlen(), strncmp(), and getenv().

Signed-off-by: Christoph Müllner <christoph.muellner@vrull.eu>
---
 sysdeps/unix/sysv/linux/riscv/hart-features.c | 294 ++++++++++++++++++
 .../unix/sysv/linux/riscv/macro-for-each.h    |  24 ++
 2 files changed, 318 insertions(+)
 create mode 100644 sysdeps/unix/sysv/linux/riscv/macro-for-each.h

diff --git a/sysdeps/unix/sysv/linux/riscv/hart-features.c b/sysdeps/unix/sysv/linux/riscv/hart-features.c
index 41111eff57..6de41a26cc 100644
--- a/sysdeps/unix/sysv/linux/riscv/hart-features.c
+++ b/sysdeps/unix/sysv/linux/riscv/hart-features.c
@@ -17,12 +17,17 @@
    <https://www.gnu.org/licenses/>.  */
 
 #include <hart-features.h>
+#include <macro-for-each.h>
+#include <string_private.h>
 
 /* The code in this file is executed very early, so we cannot call
    indirect functions because ifunc support is not initialized.
    Therefore this file adds a few simple helper functions to avoid
    dependencies to functions outside of this file.  */
 
+#define xstr(s) str(s)
+#define str(s) #s
+
 static inline void
 inhibit_loop_to_libcall
 simple_memset (void *s, int c, size_t n)
@@ -35,9 +40,298 @@ simple_memset (void *s, int c, size_t n)
     }
 }
 
+static inline size_t
+inhibit_loop_to_libcall
+simple_strlen (const char *s)
+{
+  size_t n = 0;
+  char c = *s;
+  while (c != 0)
+    {
+      s++;
+      n++;
+      c = *s;
+    }
+  return n;
+}
+
+static inline int
+inhibit_loop_to_libcall
+simple_strncmp (const char *s1, const char *s2, size_t n)
+{
+  while (n != 0)
+    {
+      if (*s1 == 0 || *s1 != *s2)
+	return *((const unsigned char *)s1) - *((const unsigned char *)s2);
+      n--;
+      s1++;
+      s2++;
+    }
+  return 0;
+}
+
+extern char **__environ;
+static inline char*
+simple_getenv (const char *name)
+{
+  char **ep;
+  uint16_t name_start;
+
+  if (__environ == NULL || name[0] == 0 || name[1] == 0)
+    return NULL;
+
+  size_t len = simple_strlen (name);
+#if _STRING_ARCH_unaligned
+  name_start = *(const uint16_t *) name;
+#else
+  name_start = (((const unsigned char *) name)[0]
+		| (((const unsigned char *) name)[1] << 8));
+#endif
+  len -= 2;
+  name += 2;
+
+  for (ep = __environ; *ep != NULL; ++ep)
+    {
+#if _STRING_ARCH_unaligned
+      uint16_t ep_start = *(uint16_t *) *ep;
+#else
+      uint16_t ep_start = (((unsigned char *) *ep)[0]
+			   | (((unsigned char *) *ep)[1] << 8));
+#endif
+      if (name_start == ep_start && !simple_strncmp (*ep + 2, name, len)
+	  && (*ep)[len + 2] == '=')
+	return &(*ep)[len + 3];
+    }
+  return NULL;
+}
+
+/* Check if the given number is a power of 2.
+   Return true if so, or false otherwise.  */
+static inline int
+is_power_of_two (unsigned long v)
+{
+  return (v & (v - 1)) == 0;
+}
+
+/* Check if the given string str starts with
+   the prefix pre.  Return true if so, or false
+   otherwise.  */
+static inline int
+starts_with (const char *str, const char *pre)
+{
+  return simple_strncmp (pre, str, simple_strlen (pre)) == 0;
+}
+
+/* Lower all characters of a string up to the
+   first NUL-character in the string.  */
+static inline void
+strtolower (char *s)
+{
+  char c = *s;
+  while (c != '\0')
+    {
+      if (c >= 'A' && c <= 'Z')
+	*s = c + 'a' - 'A';
+      s++;
+      c = *s;
+    }
+}
+
+/* Count the number of detected extensions.  */
+static inline unsigned long
+count_extensions (struct hart_features *hart_features)
+{
+  unsigned long n = 0;
+#define ISA_EXT(e)							\
+  if (hart_features->have_##e == 1)					\
+    n++;
+#define ISA_EXT_GROUP(g, ...)						\
+  if (hart_features->have_##g == 1)					\
+    n++;
+#include "isa-extensions.def"
+  return n;
+}
+
+/* Check if the given charater is not '0'-'9'.  */
+static inline int
+notanumber (const char c)
+{
+  return (c < '0' || c > '9');
+}
+
+/* Parse RISCV_RT_MARCH and store found extensions.  */
+static inline void
+parse_rt_march (struct hart_features *hart_features)
+{
+  const char* s = simple_getenv ("RISCV_RT_MARCH");
+  if (s == NULL)
+    goto end;
+
+  hart_features->rt_march = s;
+
+  /* "RISC-V ISA strings begin with either RV32I, RV32E, RV64I, or RV128I
+      indicating the supported address space size in bits for the base
+      integer ISA."  */
+  if (starts_with (s, "rv32") && notanumber (*(s+4)))
+    {
+      hart_features->xlen = 32;
+      s += 4;
+    }
+  else if (starts_with (s, "rv64") && notanumber (*(s+4)))
+    {
+      hart_features->xlen = 64;
+      s += 4;
+    }
+  else if (starts_with (s, "rv128") && notanumber (*(s+5)))
+    {
+      hart_features->xlen = 128;
+      s += 5;
+    }
+  else
+    {
+      goto fail;
+    }
+
+  /* Parse the extensions.  */
+  const char *s_old = s;
+  while (*s != '\0')
+    {
+#define ISA_EXT(e)							\
+      else if (starts_with (s, xstr (e)))				\
+	{								\
+	  hart_features->have_##e = 1;					\
+	  s += simple_strlen (xstr (e));				\
+	}
+#define ISA_EXT_GROUP(g, ...)						\
+      ISA_EXT (g)
+      if (0);
+#include "isa-extensions.def"
+
+      /* Consume optional version information.  */
+      while (*s >= '0' && *s <= '9')
+	s++;
+      while (*s == 'p')
+	s++;
+      while (*s >= '0' && *s <= '9')
+	s++;
+
+      /* Consume optional '_'.  */
+      if (*s == '_')
+	s++;
+
+      /* If we got stuck, bail out.  */
+      if (s == s_old)
+	goto fail;
+    }
+
+  /* Propagate subsets (until we reach a fixpoint).  */
+  unsigned long n = count_extensions (hart_features);
+  while (1)
+    {
+      /* Forward-propagation.  E.g.:
+      if (hart_features->have_g == 1)
+	{
+	  hart_features->have_i = 1;
+	  ...
+	  hart_features->have_zifencei = 1;
+	}  */
+#define ISA_EXT_GROUP_HEAD(y)						\
+      if (hart_features->have_##y)					\
+	{
+#define ISA_EXT_GROUP_SUBSET(s)						\
+	  hart_features->have_##s = 1;
+#define ISA_EXT_GROUP_TAIL(z)						\
+	}
+#define ISA_EXT_GROUP(x, ...)						\
+	ISA_EXT_GROUP_HEAD (x)						\
+	FOR_EACH (ISA_EXT_GROUP_SUBSET, __VA_ARGS__)			\
+	ISA_EXT_GROUP_TAIL (x)
+#include "isa-extensions.def"
+#undef ISA_EXT_GROUP_HEAD
+#undef ISA_EXT_GROUP_SUBSET
+#undef ISA_EXT_GROUP_TAIL
+
+      /* Backward-propagation.  E.g.:
+      if (1
+	  && hart_features->have_i == 1
+	  ...
+	  && hart_features->have_zifencei == 1
+	  )
+	hart_features->have_g = 1;  */
+#define ISA_EXT_GROUP_HEAD(y)						\
+      if (1
+#define ISA_EXT_GROUP_SUBSET(s)						\
+	  && hart_features->have_##s == 1
+#define ISA_EXT_GROUP_TAIL(z)						\
+	  )								\
+	hart_features->have_##z = 1;
+#define ISA_EXT_GROUP(x, ...)						\
+	ISA_EXT_GROUP_HEAD (x)						\
+	FOR_EACH (ISA_EXT_GROUP_SUBSET, __VA_ARGS__)			\
+	ISA_EXT_GROUP_TAIL (x)
+#include "isa-extensions.def"
+#undef ISA_EXT_GROUP_HEAD
+#undef ISA_EXT_GROUP_SUBSET
+#undef ISA_EXT_GROUP_TAIL
+
+      unsigned long n2 = count_extensions (hart_features);
+      /* Stop if fix-point reached.  */
+      if (n == n2)
+	break;
+      n = n2;
+    }
+
+end:
+  return;
+
+fail:
+  hart_features->rt_march = NULL;
+}
+
+/* Parse RISCV_RT_CBOM_BLOCKSIZE and store value.  */
+static inline void
+parse_rt_cbom_blocksize (struct hart_features *hart_features)
+{
+  hart_features->rt_cbom_blocksize = NULL;
+  hart_features->cbom_blocksize = 0;
+
+  const char *s = simple_getenv ("RISCV_RT_CBOM_BLOCKSIZE");
+  if (s == NULL)
+    return;
+
+  uint64_t v = _dl_strtoul (s, NULL);
+  if (!is_power_of_two (v))
+    return;
+
+  hart_features->rt_cbom_blocksize = s;
+  hart_features->cbom_blocksize = v;
+}
+
+/* Parse RISCV_RT_CBOZ_BLOCKSIZE and store value.  */
+static inline void
+parse_rt_cboz_blocksize (struct hart_features *hart_features)
+{
+  hart_features->rt_cboz_blocksize = NULL;
+  hart_features->cboz_blocksize = 0;
+
+  const char *s = simple_getenv ("RISCV_RT_CBOZ_BLOCKSIZE");
+  if (s == NULL)
+    return;
+
+  uint64_t v = _dl_strtoul (s, NULL);
+  if (!is_power_of_two (v))
+    return;
+
+  hart_features->rt_cboz_blocksize = s;
+  hart_features->cboz_blocksize = v;
+}
+
 /* Discover hart features and store them.  */
 static inline void
 init_hart_features (struct hart_features *hart_features)
 {
   simple_memset (hart_features, 0, sizeof (*hart_features));
+  parse_rt_march (hart_features);
+  parse_rt_cbom_blocksize (hart_features);
+  parse_rt_cboz_blocksize (hart_features);
 }
diff --git a/sysdeps/unix/sysv/linux/riscv/macro-for-each.h b/sysdeps/unix/sysv/linux/riscv/macro-for-each.h
new file mode 100644
index 0000000000..524bef3c0a
--- /dev/null
+++ b/sysdeps/unix/sysv/linux/riscv/macro-for-each.h
@@ -0,0 +1,24 @@
+/* Recursive macros implementation by David Mazières
+   https://www.scs.stanford.edu/~dm/blog/va-opt.html  */
+
+#ifndef _MACRO_FOR_EACH_H
+#define _MACRO_FOR_EACH_H
+
+#define EXPAND1(...) __VA_ARGS__
+#define EXPAND2(...) EXPAND1 (EXPAND1 (EXPAND1 (EXPAND1 (__VA_ARGS__))))
+#define EXPAND3(...) EXPAND2 (EXPAND2 (EXPAND2 (EXPAND2 (__VA_ARGS__))))
+#define EXPAND4(...) EXPAND3 (EXPAND3 (EXPAND3 (EXPAND3 (__VA_ARGS__))))
+#define EXPAND(...)  EXPAND4 (EXPAND4 (EXPAND4 (EXPAND4 (__VA_ARGS__))))
+
+#define FOR_EACH(macro, ...)						\
+  __VA_OPT__ (EXPAND (FOR_EACH_HELPER (macro, __VA_ARGS__)))
+
+#define PARENS ()
+
+#define FOR_EACH_HELPER(macro, a1, ...)					\
+  macro (a1)								\
+  __VA_OPT__ (FOR_EACH_AGAIN PARENS (macro, __VA_ARGS__))
+
+#define FOR_EACH_AGAIN() FOR_EACH_HELPER
+
+#endif /* _MACRO_FOR_EACH_H  */
-- 
2.39.1


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC PATCH 07/19] riscv: hart-features: Add fast_unaligned property
  2023-02-07  0:15 [RFC PATCH 00/19] riscv: ifunc support with optimized mem*/str*/cpu_relax routines Christoph Muellner
                   ` (5 preceding siblings ...)
  2023-02-07  0:16 ` [RFC PATCH 06/19] riscv: Adding ISA string parser for environment variables Christoph Muellner
@ 2023-02-07  0:16 ` Christoph Muellner
  2023-02-07  0:16 ` [RFC PATCH 08/19] riscv: Add (empty) ifunc framework Christoph Muellner
                   ` (13 subsequent siblings)
  20 siblings, 0 replies; 42+ messages in thread
From: Christoph Muellner @ 2023-02-07  0:16 UTC (permalink / raw)
  To: libc-alpha, Palmer Dabbelt, Darius Rad, Andrew Waterman,
	DJ Delorie, Vineet Gupta, Kito Cheng, Jeff Law, Philipp Tomsich,
	Heiko Stuebner
  Cc: Christoph Müllner

From: Christoph Müllner <christoph.muellner@vrull.eu>

Having fast unaligned accesses opens the door for a performance
optimizations. Let's add this property to the hart-features
so that this property can be queried using the environment
variable RISCV_RT_FAST_UNALIGNED (e.g. by setting it to "1").

Signed-off-by: Christoph Müllner <christoph.muellner@vrull.eu>
---
 sysdeps/unix/sysv/linux/riscv/hart-features.c | 19 +++++++++++++++++++
 sysdeps/unix/sysv/linux/riscv/hart-features.h |  5 +++++
 2 files changed, 24 insertions(+)

diff --git a/sysdeps/unix/sysv/linux/riscv/hart-features.c b/sysdeps/unix/sysv/linux/riscv/hart-features.c
index 6de41a26cc..b3b7955534 100644
--- a/sysdeps/unix/sysv/linux/riscv/hart-features.c
+++ b/sysdeps/unix/sysv/linux/riscv/hart-features.c
@@ -326,6 +326,22 @@ parse_rt_cboz_blocksize (struct hart_features *hart_features)
   hart_features->cboz_blocksize = v;
 }
 
+/* Parse RISCV_RT_FAST_UNALIGNED and store value.  */
+static inline void
+parse_rt_fast_unaligned (struct hart_features *hart_features)
+{
+  hart_features->rt_fast_unaligned = NULL;
+  hart_features->fast_unaligned = 0;
+
+  const char *s = simple_getenv ("RISCV_RT_FAST_UNALIGNED");
+  if (s == NULL)
+    return;
+
+  uint64_t v = _dl_strtoul (s, NULL);
+  hart_features->rt_fast_unaligned = s;
+  hart_features->fast_unaligned = v;
+}
+
 /* Discover hart features and store them.  */
 static inline void
 init_hart_features (struct hart_features *hart_features)
@@ -334,4 +350,7 @@ init_hart_features (struct hart_features *hart_features)
   parse_rt_march (hart_features);
   parse_rt_cbom_blocksize (hart_features);
   parse_rt_cboz_blocksize (hart_features);
+
+  /* Parse tuning properties.  */
+  parse_rt_fast_unaligned (hart_features);
 }
diff --git a/sysdeps/unix/sysv/linux/riscv/hart-features.h b/sysdeps/unix/sysv/linux/riscv/hart-features.h
index dd94685676..b2cefd5748 100644
--- a/sysdeps/unix/sysv/linux/riscv/hart-features.h
+++ b/sysdeps/unix/sysv/linux/riscv/hart-features.h
@@ -34,6 +34,9 @@
 #define HAVE_CBOZ_BLOCKSIZE(n)	\
 	(GLRO (dl_riscv_hart_features).cboz_blocksize == n)
 
+#define HAVE_FAST_UNALIGNED() \
+	(GLRO (dl_riscv_hart_features).fast_unaligned != 0)
+
 struct hart_features
 {
   const char* rt_march;
@@ -48,6 +51,8 @@ struct hart_features
   unsigned cbom_blocksize;
   const char* rt_cboz_blocksize;
   unsigned cboz_blocksize;
+  const char* rt_fast_unaligned;
+  unsigned fast_unaligned;
 };
 
 #endif /* _CPU_FEATURES_RISCV_H  */
-- 
2.39.1


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC PATCH 08/19] riscv: Add (empty) ifunc framework
  2023-02-07  0:15 [RFC PATCH 00/19] riscv: ifunc support with optimized mem*/str*/cpu_relax routines Christoph Muellner
                   ` (6 preceding siblings ...)
  2023-02-07  0:16 ` [RFC PATCH 07/19] riscv: hart-features: Add fast_unaligned property Christoph Muellner
@ 2023-02-07  0:16 ` Christoph Muellner
  2023-02-07  0:16 ` [RFC PATCH 09/19] riscv: Add ifunc support for memset Christoph Muellner
                   ` (12 subsequent siblings)
  20 siblings, 0 replies; 42+ messages in thread
From: Christoph Muellner @ 2023-02-07  0:16 UTC (permalink / raw)
  To: libc-alpha, Palmer Dabbelt, Darius Rad, Andrew Waterman,
	DJ Delorie, Vineet Gupta, Kito Cheng, Jeff Law, Philipp Tomsich,
	Heiko Stuebner
  Cc: Christoph Müllner

From: Christoph Müllner <christoph.muellner@vrull.eu>

This patch adds the missing pieces to add ifunc implementations
of routines. No optimized code is added as part of this patch.

Signed-off-by: Christoph Müllner <christoph.muellner@vrull.eu>
---
 sysdeps/riscv/multiarch/Makefile          |  4 +++
 sysdeps/riscv/multiarch/ifunc-impl-list.c | 39 +++++++++++++++++++++++
 sysdeps/riscv/multiarch/init-arch.h       | 24 ++++++++++++++
 3 files changed, 67 insertions(+)
 create mode 100644 sysdeps/riscv/multiarch/Makefile
 create mode 100644 sysdeps/riscv/multiarch/ifunc-impl-list.c
 create mode 100644 sysdeps/riscv/multiarch/init-arch.h

diff --git a/sysdeps/riscv/multiarch/Makefile b/sysdeps/riscv/multiarch/Makefile
new file mode 100644
index 0000000000..68d3f5192f
--- /dev/null
+++ b/sysdeps/riscv/multiarch/Makefile
@@ -0,0 +1,4 @@
+ifeq ($(subdir),string)
+sysdep_routines += \
+
+endif
diff --git a/sysdeps/riscv/multiarch/ifunc-impl-list.c b/sysdeps/riscv/multiarch/ifunc-impl-list.c
new file mode 100644
index 0000000000..c0cdca45fd
--- /dev/null
+++ b/sysdeps/riscv/multiarch/ifunc-impl-list.c
@@ -0,0 +1,39 @@
+/* Enumerate available IFUNC implementations of a function.  RISC-V version.
+   Copyright (C) 2022 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <https://www.gnu.org/licenses/>.  */
+
+#include <assert.h>
+#include <string.h>
+#include <wchar.h>
+#include <ldsodefs.h>
+#include <ifunc-impl-list.h>
+#include <init-arch.h>
+#include <stdio.h>
+
+/* Maximum number of IFUNC implementations.  */
+#define MAX_IFUNC	7
+
+size_t
+__libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
+			size_t max)
+{
+  assert (max >= MAX_IFUNC);
+
+  size_t i = 0;
+
+  return i;
+}
diff --git a/sysdeps/riscv/multiarch/init-arch.h b/sysdeps/riscv/multiarch/init-arch.h
new file mode 100644
index 0000000000..c9afeec07b
--- /dev/null
+++ b/sysdeps/riscv/multiarch/init-arch.h
@@ -0,0 +1,24 @@
+/* Define INIT_ARCH for RISC-V.
+   This file is part of the GNU C Library.
+   Copyright (C) 2022 Free Software Foundation, Inc.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <https://www.gnu.org/licenses/>.  */
+
+#ifndef _INIT_ARCH_RISCV
+#define _INIT_ARCH_RISCV
+
+#define INIT_ARCH()
+
+#endif /* _INIT_ARCH_RISCV  */
-- 
2.39.1


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC PATCH 09/19] riscv: Add ifunc support for memset
  2023-02-07  0:15 [RFC PATCH 00/19] riscv: ifunc support with optimized mem*/str*/cpu_relax routines Christoph Muellner
                   ` (7 preceding siblings ...)
  2023-02-07  0:16 ` [RFC PATCH 08/19] riscv: Add (empty) ifunc framework Christoph Muellner
@ 2023-02-07  0:16 ` Christoph Muellner
  2023-02-07  0:16 ` [RFC PATCH 10/19] riscv: Add accelerated memset routines for RV64 Christoph Muellner
                   ` (11 subsequent siblings)
  20 siblings, 0 replies; 42+ messages in thread
From: Christoph Muellner @ 2023-02-07  0:16 UTC (permalink / raw)
  To: libc-alpha, Palmer Dabbelt, Darius Rad, Andrew Waterman,
	DJ Delorie, Vineet Gupta, Kito Cheng, Jeff Law, Philipp Tomsich,
	Heiko Stuebner
  Cc: Christoph Müllner

From: Christoph Müllner <christoph.muellner@vrull.eu>

This patch adds ifunc support for calls to memset to the RISC-V code.
No optimized code is added as part of this patch.

Signed-off-by: Christoph Müllner <christoph.muellner@vrull.eu>
---
 sysdeps/riscv/multiarch/Makefile          |  2 +-
 sysdeps/riscv/multiarch/ifunc-impl-list.c |  3 ++
 sysdeps/riscv/multiarch/memset.c          | 40 +++++++++++++++++++++++
 sysdeps/riscv/multiarch/memset_generic.c  | 32 ++++++++++++++++++
 4 files changed, 76 insertions(+), 1 deletion(-)
 create mode 100644 sysdeps/riscv/multiarch/memset.c
 create mode 100644 sysdeps/riscv/multiarch/memset_generic.c

diff --git a/sysdeps/riscv/multiarch/Makefile b/sysdeps/riscv/multiarch/Makefile
index 68d3f5192f..453f0f4e4c 100644
--- a/sysdeps/riscv/multiarch/Makefile
+++ b/sysdeps/riscv/multiarch/Makefile
@@ -1,4 +1,4 @@
 ifeq ($(subdir),string)
 sysdep_routines += \
-
+	memset_generic
 endif
diff --git a/sysdeps/riscv/multiarch/ifunc-impl-list.c b/sysdeps/riscv/multiarch/ifunc-impl-list.c
index c0cdca45fd..fd1752bc46 100644
--- a/sysdeps/riscv/multiarch/ifunc-impl-list.c
+++ b/sysdeps/riscv/multiarch/ifunc-impl-list.c
@@ -35,5 +35,8 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
 
   size_t i = 0;
 
+  IFUNC_IMPL (i, name, memset,
+	      IFUNC_IMPL_ADD (array, i, memset, 1, __memset_generic))
+
   return i;
 }
diff --git a/sysdeps/riscv/multiarch/memset.c b/sysdeps/riscv/multiarch/memset.c
new file mode 100644
index 0000000000..ae4289ab03
--- /dev/null
+++ b/sysdeps/riscv/multiarch/memset.c
@@ -0,0 +1,40 @@
+/* Multiple versions of memset. RISC-V version.
+   Copyright (C) 2022 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <https://www.gnu.org/licenses/>.  */
+
+/* Define multiple versions only for the definition in libc.  */
+
+#if IS_IN (libc)
+/* Redefine memset so that the compiler won't complain about the type
+   mismatch with the IFUNC selector in strong_alias, below.  */
+# undef memset
+# define memset __redirect_memset
+# include <string.h>
+# include <ldsodefs.h>
+# include <sys/auxv.h>
+# include <init-arch.h>
+
+extern __typeof (__redirect_memset) __libc_memset;
+extern __typeof (__redirect_memset) __memset_generic attribute_hidden;
+
+libc_ifunc (__libc_memset, __memset_generic);
+
+# undef memset
+strong_alias (__libc_memset, memset);
+#else
+# include <string/memset.c>
+#endif
diff --git a/sysdeps/riscv/multiarch/memset_generic.c b/sysdeps/riscv/multiarch/memset_generic.c
new file mode 100644
index 0000000000..37acb398d4
--- /dev/null
+++ b/sysdeps/riscv/multiarch/memset_generic.c
@@ -0,0 +1,32 @@
+/* Memset for RISC-V, default version for internal use.
+   Copyright (C) 2022 Free Software Foundation, Inc.
+
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library.  If not, see
+   <https://www.gnu.org/licenses/>.  */
+
+#include <string.h>
+
+#define MEMSET __memset_generic
+
+#ifdef SHARED
+# undef libc_hidden_builtin_def
+# define libc_hidden_builtin_def(name) \
+  __hidden_ver1(__memset_generic, __GI_memset, __memset_generic);
+#endif
+
+extern void *__memset_generic(void *s, int c, size_t n);
+
+#include <string/memset.c>
-- 
2.39.1


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC PATCH 10/19] riscv: Add accelerated memset routines for RV64
  2023-02-07  0:15 [RFC PATCH 00/19] riscv: ifunc support with optimized mem*/str*/cpu_relax routines Christoph Muellner
                   ` (8 preceding siblings ...)
  2023-02-07  0:16 ` [RFC PATCH 09/19] riscv: Add ifunc support for memset Christoph Muellner
@ 2023-02-07  0:16 ` Christoph Muellner
  2023-02-07  0:16 ` [RFC PATCH 11/19] riscv: Add ifunc support for memcpy/memmove Christoph Muellner
                   ` (10 subsequent siblings)
  20 siblings, 0 replies; 42+ messages in thread
From: Christoph Muellner @ 2023-02-07  0:16 UTC (permalink / raw)
  To: libc-alpha, Palmer Dabbelt, Darius Rad, Andrew Waterman,
	DJ Delorie, Vineet Gupta, Kito Cheng, Jeff Law, Philipp Tomsich,
	Heiko Stuebner
  Cc: Christoph Müllner

From: Christoph Müllner <christoph.muellner@vrull.eu>

The implementation of memset() can be accelerated by
loop unrolling, fast unaligned accesses and cbo.zero.
Let's provide an implementation that supports that,
with a cbo.zero being optional and only available for
a block size of 64 bytes.

Signed-off-by: Christoph Müllner <christoph.muellner@vrull.eu>
---
 sysdeps/riscv/multiarch/Makefile              |   4 +-
 sysdeps/riscv/multiarch/ifunc-impl-list.c     |   4 +
 sysdeps/riscv/multiarch/memset.c              |  12 +
 .../riscv/multiarch/memset_rv64_unaligned.S   |  31 +++
 .../multiarch/memset_rv64_unaligned_cboz64.S  | 217 ++++++++++++++++++
 5 files changed, 267 insertions(+), 1 deletion(-)
 create mode 100644 sysdeps/riscv/multiarch/memset_rv64_unaligned.S
 create mode 100644 sysdeps/riscv/multiarch/memset_rv64_unaligned_cboz64.S

diff --git a/sysdeps/riscv/multiarch/Makefile b/sysdeps/riscv/multiarch/Makefile
index 453f0f4e4c..6e8ebb42d8 100644
--- a/sysdeps/riscv/multiarch/Makefile
+++ b/sysdeps/riscv/multiarch/Makefile
@@ -1,4 +1,6 @@
 ifeq ($(subdir),string)
 sysdep_routines += \
-	memset_generic
+	memset_generic \
+	memset_rv64_unaligned \
+	memset_rv64_unaligned_cboz64
 endif
diff --git a/sysdeps/riscv/multiarch/ifunc-impl-list.c b/sysdeps/riscv/multiarch/ifunc-impl-list.c
index fd1752bc46..e878977b73 100644
--- a/sysdeps/riscv/multiarch/ifunc-impl-list.c
+++ b/sysdeps/riscv/multiarch/ifunc-impl-list.c
@@ -36,6 +36,10 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
   size_t i = 0;
 
   IFUNC_IMPL (i, name, memset,
+#if __riscv_xlen == 64
+	      IFUNC_IMPL_ADD (array, i, memset, 1, __memset_rv64_unaligned_cboz64)
+	      IFUNC_IMPL_ADD (array, i, memset, 1, __memset_rv64_unaligned)
+#endif
 	      IFUNC_IMPL_ADD (array, i, memset, 1, __memset_generic))
 
   return i;
diff --git a/sysdeps/riscv/multiarch/memset.c b/sysdeps/riscv/multiarch/memset.c
index ae4289ab03..7ba10dd3da 100644
--- a/sysdeps/riscv/multiarch/memset.c
+++ b/sysdeps/riscv/multiarch/memset.c
@@ -31,7 +31,19 @@
 extern __typeof (__redirect_memset) __libc_memset;
 extern __typeof (__redirect_memset) __memset_generic attribute_hidden;
 
+#if __riscv_xlen == 64
+extern __typeof (__redirect_memset) __memset_rv64_unaligned_cboz64 attribute_hidden;
+extern __typeof (__redirect_memset) __memset_rv64_unaligned attribute_hidden;
+
+libc_ifunc (__libc_memset,
+	    (IS_RV64() && HAVE_FAST_UNALIGNED() && HAVE_RV(zicboz) && HAVE_CBOZ_BLOCKSIZE(64)
+	    ? __memset_rv64_unaligned_cboz64
+	    : (IS_RV64() && HAVE_FAST_UNALIGNED()
+	      ? __memset_rv64_unaligned
+	      : __memset_generic)));
+#else
 libc_ifunc (__libc_memset, __memset_generic);
+#endif
 
 # undef memset
 strong_alias (__libc_memset, memset);
diff --git a/sysdeps/riscv/multiarch/memset_rv64_unaligned.S b/sysdeps/riscv/multiarch/memset_rv64_unaligned.S
new file mode 100644
index 0000000000..561e564b42
--- /dev/null
+++ b/sysdeps/riscv/multiarch/memset_rv64_unaligned.S
@@ -0,0 +1,31 @@
+/* Copyright (C) 2022 Free Software Foundation, Inc.
+
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library.  If not, see
+   <https://www.gnu.org/licenses/>.  */
+
+#include <sysdep.h>
+#include <sys/asm.h>
+
+#ifndef MEMSET
+# define MEMSET __memset_rv64_unaligned
+#endif
+
+#undef CBO_ZERO_THRESHOLD
+#define CBO_ZERO_THRESHOLD 0
+
+/* Assumptions: rv64i unaligned accesses.  */
+
+#include "./memset_rv64_unaligned_cboz64.S"
diff --git a/sysdeps/riscv/multiarch/memset_rv64_unaligned_cboz64.S b/sysdeps/riscv/multiarch/memset_rv64_unaligned_cboz64.S
new file mode 100644
index 0000000000..710bb41e44
--- /dev/null
+++ b/sysdeps/riscv/multiarch/memset_rv64_unaligned_cboz64.S
@@ -0,0 +1,217 @@
+/* Copyright (C) 2022 Free Software Foundation, Inc.
+
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library.  If not, see
+   <https://www.gnu.org/licenses/>.  */
+
+#if __riscv_xlen == 64
+
+#include <sysdep.h>
+#include <sys/asm.h>
+
+#define dstin	a0
+#define val	a1
+#define count	a2
+#define dst	a3
+#define dstend	a4
+#define tmp1	a5
+
+#ifndef MEMSET
+# define MEMSET __memset_rv64_unaligned_cboz64
+#endif
+
+/* cbo.zero can be used to improve the performance of memset-zero.
+ * However, the performance gain depends on the amount of data
+ * to be cleared. This threshold allows to set the minimum amount
+ * of bytes to enable the cbo.zero loop.
+ * To disable cbo.zero, set this threshold to 0. */
+#ifndef CBO_ZERO_THRESHOLD
+# define CBO_ZERO_THRESHOLD 128
+#endif
+
+/* Assumptions:
+ * rv64i_zicboz, 64 byte cbo.zero block size, unaligned accesses.  */
+
+ENTRY_ALIGN (MEMSET, 6)
+
+	/* Repeat the byte.  */
+	slli	tmp1, val, 8
+	or	val, tmp1, a1
+	slli	tmp1, val, 16
+	or	val, tmp1, a1
+	slli	tmp1, val, 32
+	or	val, tmp1, val
+
+	/* Calculate the end position.  */
+	add	dstend, dstin, count
+
+	/* Decide how to process.  */
+	li	tmp1, 96
+	bgtu	count, tmp1, L(set_long)
+	li	tmp1, 16
+	bgtu	count, tmp1, L(set_medium)
+
+	/* Set 0..16 bytes.  */
+	li	tmp1, 8
+	bltu	count, tmp1, 1f
+	/* Set 8..16 bytes.  */
+	sd	val, 0(dstin)
+	sd	val, -8(dstend)
+	ret
+
+	.p2align 3
+	/* Set 0..7 bytes.  */
+1:	li	tmp1, 4
+	bltu	count, tmp1, 2f
+	/* Set 4..7 bytes.  */
+	sw	val, 0(dstin)
+	sw	val, -4(dstend)
+	ret
+
+	/* Set 0..3 bytes.  */
+2:	beqz	count, 3f
+	sb	val, 0(dstin)
+	li	tmp1, 2
+	bltu	count, tmp1, 3f
+	sh	val, -2(dstend)
+3:	ret
+
+	.p2align 3
+	/* Set 17..96 bytes.  */
+L(set_medium):
+	sd	val, 0(dstin)
+	sd	val, 8(dstin)
+	li	tmp1, 64
+	bgtu	count, tmp1, L(set96)
+	sd	val, -16(dstend)
+	sd	val, -8(dstend)
+	li	tmp1, 32
+	bleu	count, tmp1, 1f
+	sd	val, 16(dstin)
+	sd	val, 24(dstin)
+	sd	val, -32(dstend)
+	sd	val, -24(dstend)
+1:	ret
+
+	.p2align 4
+	/* Set 65..96 bytes.  Write 64 bytes from the start and
+	   32 bytes from the end.  */
+L(set96):
+	sd	val, 16(dstin)
+	sd	val, 24(dstin)
+	sd	val, 32(dstin)
+	sd	val, 40(dstin)
+	sd	val, 48(dstin)
+	sd	val, 56(dstin)
+	sd	val, -32(dstend)
+	sd	val, -24(dstend)
+	sd	val, -16(dstend)
+	sd	val, -8(dstend)
+	ret
+
+	.p2align 4
+	/* Set 97+ bytes.  */
+L(set_long):
+	/* Store 16 bytes unaligned.  */
+	sd	val, 0(dstin)
+	sd	val, 8(dstin)
+
+#if CBO_ZERO_THRESHOLD
+	li	tmp1, CBO_ZERO_THRESHOLD
+	blt	count, tmp1, 1f
+	beqz	val, L(cbo_zero_64)
+1:
+#endif
+
+	/* Round down to the previous 16 byte boundary (keep offset of 16).  */
+	andi	dst, dstin, -16
+
+	/* Calculate loop termination position.  */
+	addi	tmp1, dstend, -(16+64)
+
+	/* Store 64 bytes in a loop.  */
+	.p2align 4
+1:	sd	val, 16(dst)
+	sd	val, 24(dst)
+	sd	val, 32(dst)
+	sd	val, 40(dst)
+	sd	val, 48(dst)
+	sd	val, 56(dst)
+	sd	val, 64(dst)
+	sd	val, 72(dst)
+	addi	dst, dst, 64
+	bltu	dst, tmp1, 1b
+
+	/* Calculate remainder (dst2 is 16 too less).  */
+	sub	count, dstend, dst
+
+	/* Check if we have more than 32 bytes to copy.  */
+	li	tmp1, (32+16)
+	ble	count, tmp1, 1f
+	sd	val, 16(dst)
+	sd	val, 24(dst)
+	sd	val, 32(dst)
+	sd	val, 40(dst)
+1:	sd	val, -32(dstend)
+	sd	val, -24(dstend)
+	sd	val, -16(dstend)
+	sd	val, -8(dstend)
+	ret
+
+#if CBO_ZERO_THRESHOLD
+	.option push
+	.option arch,+zicboz
+	.p2align 3
+L(cbo_zero_64):
+	/* Align dst (down).  */
+	sd	val, 16(dstin)
+	sd	val, 24(dstin)
+	sd	val, 32(dstin)
+	sd	val, 40(dstin)
+	sd	val, 48(dstin)
+	sd	val, 56(dstin)
+
+	/* Round up to the next 64 byte boundary.  */
+	andi	dst, dstin, -64
+	addi	dst, dst, 64
+
+	/* Calculate loop termination position.  */
+	addi	tmp1, dstend, -64
+
+	/* cbo.zero sets 64 bytes each time. */
+	.p2align 4
+1:	cbo.zero	(dst)
+	addi	dst, dst, 64
+	bltu	dst, tmp1, 1b
+
+	sub	count, dstend, dst
+	li	tmp1, 32
+	ble	count, tmp1, 1f
+	sd	val, 0(dst)
+	sd	val, 8(dst)
+	sd	val, 16(dst)
+	sd	val, 24(dst)
+1:	sd	val, -32(dstend)
+	sd	val, -24(dstend)
+	sd	val, -16(dstend)
+	sd	val, -8(dstend)
+	ret
+	.option pop
+#endif /* CBO_ZERO_THRESHOLD  */
+
+END (MEMSET)
+libc_hidden_builtin_def (MEMSET)
+
+#endif /* __riscv_xlen == 64  */
-- 
2.39.1


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC PATCH 11/19] riscv: Add ifunc support for memcpy/memmove
  2023-02-07  0:15 [RFC PATCH 00/19] riscv: ifunc support with optimized mem*/str*/cpu_relax routines Christoph Muellner
                   ` (9 preceding siblings ...)
  2023-02-07  0:16 ` [RFC PATCH 10/19] riscv: Add accelerated memset routines for RV64 Christoph Muellner
@ 2023-02-07  0:16 ` Christoph Muellner
  2023-02-07  0:16 ` [RFC PATCH 12/19] riscv: Add accelerated memcpy/memmove routines for RV64 Christoph Muellner
                   ` (9 subsequent siblings)
  20 siblings, 0 replies; 42+ messages in thread
From: Christoph Muellner @ 2023-02-07  0:16 UTC (permalink / raw)
  To: libc-alpha, Palmer Dabbelt, Darius Rad, Andrew Waterman,
	DJ Delorie, Vineet Gupta, Kito Cheng, Jeff Law, Philipp Tomsich,
	Heiko Stuebner
  Cc: Christoph Müllner

From: Christoph Müllner <christoph.muellner@vrull.eu>

This patch adds ifunc support for calls to memcpy() and memmove()
to the RISC-V code.
No optimized code is added as part of this patch.

Signed-off-by: Christoph Müllner <christoph.muellner@vrull.eu>
---
 sysdeps/riscv/multiarch/Makefile          |  2 ++
 sysdeps/riscv/multiarch/ifunc-impl-list.c |  6 ++++
 sysdeps/riscv/multiarch/memcpy.c          | 40 +++++++++++++++++++++++
 sysdeps/riscv/multiarch/memcpy_generic.c  | 32 ++++++++++++++++++
 sysdeps/riscv/multiarch/memmove.c         | 40 +++++++++++++++++++++++
 sysdeps/riscv/multiarch/memmove_generic.c | 32 ++++++++++++++++++
 6 files changed, 152 insertions(+)
 create mode 100644 sysdeps/riscv/multiarch/memcpy.c
 create mode 100644 sysdeps/riscv/multiarch/memcpy_generic.c
 create mode 100644 sysdeps/riscv/multiarch/memmove.c
 create mode 100644 sysdeps/riscv/multiarch/memmove_generic.c

diff --git a/sysdeps/riscv/multiarch/Makefile b/sysdeps/riscv/multiarch/Makefile
index 6e8ebb42d8..6bc20c4fe0 100644
--- a/sysdeps/riscv/multiarch/Makefile
+++ b/sysdeps/riscv/multiarch/Makefile
@@ -1,5 +1,7 @@
 ifeq ($(subdir),string)
 sysdep_routines += \
+	memcpy_generic \
+	memmove_generic \
 	memset_generic \
 	memset_rv64_unaligned \
 	memset_rv64_unaligned_cboz64
diff --git a/sysdeps/riscv/multiarch/ifunc-impl-list.c b/sysdeps/riscv/multiarch/ifunc-impl-list.c
index e878977b73..16e4d7137f 100644
--- a/sysdeps/riscv/multiarch/ifunc-impl-list.c
+++ b/sysdeps/riscv/multiarch/ifunc-impl-list.c
@@ -35,6 +35,12 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
 
   size_t i = 0;
 
+  IFUNC_IMPL (i, name, memcpy,
+	      IFUNC_IMPL_ADD (array, i, memcpy, 1, __memcpy_generic))
+
+  IFUNC_IMPL (i, name, memmove,
+	      IFUNC_IMPL_ADD (array, i, memmove, 1, __memmove_generic))
+
   IFUNC_IMPL (i, name, memset,
 #if __riscv_xlen == 64
 	      IFUNC_IMPL_ADD (array, i, memset, 1, __memset_rv64_unaligned_cboz64)
diff --git a/sysdeps/riscv/multiarch/memcpy.c b/sysdeps/riscv/multiarch/memcpy.c
new file mode 100644
index 0000000000..cc9185912a
--- /dev/null
+++ b/sysdeps/riscv/multiarch/memcpy.c
@@ -0,0 +1,40 @@
+/* Multiple versions of memcpy. RISC-V version.
+   Copyright (C) 2022 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <https://www.gnu.org/licenses/>.  */
+
+/* Define multiple versions only for the definition in libc.  */
+
+#if IS_IN (libc)
+/* Redefine memcpy so that the compiler won't complain about the type
+   mismatch with the IFUNC selector in strong_alias, below.  */
+# undef memcpy
+# define memcpy __redirect_memcpy
+# include <string.h>
+# include <ldsodefs.h>
+# include <sys/auxv.h>
+# include <init-arch.h>
+
+extern __typeof (__redirect_memcpy) __libc_memcpy;
+extern __typeof (__redirect_memcpy) __memcpy_generic attribute_hidden;
+
+libc_ifunc (__libc_memcpy, __memcpy_generic);
+
+# undef memcpy
+strong_alias (__libc_memcpy, memcpy);
+#else
+# include <string/memcpy.c>
+#endif
diff --git a/sysdeps/riscv/multiarch/memcpy_generic.c b/sysdeps/riscv/multiarch/memcpy_generic.c
new file mode 100644
index 0000000000..fb46fe7622
--- /dev/null
+++ b/sysdeps/riscv/multiarch/memcpy_generic.c
@@ -0,0 +1,32 @@
+/* Memcpy for RISC-V, default version for internal use.
+   Copyright (C) 2022 Free Software Foundation, Inc.
+
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library.  If not, see
+   <https://www.gnu.org/licenses/>.  */
+
+#include <string.h>
+
+#define MEMCPY __memcpy_generic
+
+#ifdef SHARED
+# undef libc_hidden_builtin_def
+# define libc_hidden_builtin_def(name) \
+  __hidden_ver1(__memcpy_generic, __GI_memcpy, __memcpy_generic);
+#endif
+
+extern void *__memcpy_generic(void *dest, const void *src, size_t n);
+
+#include <string/memcpy.c>
diff --git a/sysdeps/riscv/multiarch/memmove.c b/sysdeps/riscv/multiarch/memmove.c
new file mode 100644
index 0000000000..581a8327d6
--- /dev/null
+++ b/sysdeps/riscv/multiarch/memmove.c
@@ -0,0 +1,40 @@
+/* Multiple versions of memmove. RISC-V version.
+   Copyright (C) 2022 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <https://www.gnu.org/licenses/>.  */
+
+/* Define multiple versions only for the definition in libc.  */
+
+#if IS_IN (libc)
+/* Redefine memmove so that the compiler won't complain about the type
+   mismatch with the IFUNC selector in strong_alias, below.  */
+# undef memmove
+# define memmove __redirect_memmove
+# include <string.h>
+# include <ldsodefs.h>
+# include <sys/auxv.h>
+# include <init-arch.h>
+
+extern __typeof (__redirect_memmove) __libc_memmove;
+extern __typeof (__redirect_memmove) __memmove_generic attribute_hidden;
+
+libc_ifunc (__libc_memmove, __memmove_generic);
+
+# undef memmove
+strong_alias (__libc_memmove, memmove);
+#else
+# include <string/memmove.c>
+#endif
diff --git a/sysdeps/riscv/multiarch/memmove_generic.c b/sysdeps/riscv/multiarch/memmove_generic.c
new file mode 100644
index 0000000000..4a9e83c13c
--- /dev/null
+++ b/sysdeps/riscv/multiarch/memmove_generic.c
@@ -0,0 +1,32 @@
+/* Memmove for RISC-V, default version for internal use.
+   Copyright (C) 2022 Free Software Foundation, Inc.
+
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library.  If not, see
+   <https://www.gnu.org/licenses/>.  */
+
+#include <string.h>
+
+#define MEMMOVE __memmove_generic
+
+#ifdef SHARED
+# undef libc_hidden_builtin_def
+# define libc_hidden_builtin_def(name) \
+  __hidden_ver1(__memmove_generic, __GI_memmove, __memmove_generic);
+#endif
+
+extern void *__memmove_generic(void *dest, const void *src, size_t n);
+
+#include <string/memmove.c>
-- 
2.39.1


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC PATCH 12/19] riscv: Add accelerated memcpy/memmove routines for RV64
  2023-02-07  0:15 [RFC PATCH 00/19] riscv: ifunc support with optimized mem*/str*/cpu_relax routines Christoph Muellner
                   ` (10 preceding siblings ...)
  2023-02-07  0:16 ` [RFC PATCH 11/19] riscv: Add ifunc support for memcpy/memmove Christoph Muellner
@ 2023-02-07  0:16 ` Christoph Muellner
  2023-02-07  0:16 ` [RFC PATCH 13/19] riscv: Add ifunc support for strlen Christoph Muellner
                   ` (8 subsequent siblings)
  20 siblings, 0 replies; 42+ messages in thread
From: Christoph Muellner @ 2023-02-07  0:16 UTC (permalink / raw)
  To: libc-alpha, Palmer Dabbelt, Darius Rad, Andrew Waterman,
	DJ Delorie, Vineet Gupta, Kito Cheng, Jeff Law, Philipp Tomsich,
	Heiko Stuebner
  Cc: Christoph Müllner

From: Christoph Müllner <christoph.muellner@vrull.eu>

The implementation of memcpy()/memmove() can be accelerated by
loop unrolling and fast unaligned accesses.
Let's provide an implementation that is optimized accordingly.

Signed-off-by: Christoph Müllner <christoph.muellner@vrull.eu>
---
 sysdeps/riscv/multiarch/Makefile              |   2 +
 sysdeps/riscv/multiarch/ifunc-impl-list.c     |   6 +
 sysdeps/riscv/multiarch/memcpy.c              |   9 +
 .../riscv/multiarch/memcpy_rv64_unaligned.S   | 475 ++++++++++++++++++
 sysdeps/riscv/multiarch/memmove.c             |   9 +
 5 files changed, 501 insertions(+)
 create mode 100644 sysdeps/riscv/multiarch/memcpy_rv64_unaligned.S

diff --git a/sysdeps/riscv/multiarch/Makefile b/sysdeps/riscv/multiarch/Makefile
index 6bc20c4fe0..b08d7d1c8b 100644
--- a/sysdeps/riscv/multiarch/Makefile
+++ b/sysdeps/riscv/multiarch/Makefile
@@ -2,6 +2,8 @@ ifeq ($(subdir),string)
 sysdep_routines += \
 	memcpy_generic \
 	memmove_generic \
+	memcpy_rv64_unaligned \
+	\
 	memset_generic \
 	memset_rv64_unaligned \
 	memset_rv64_unaligned_cboz64
diff --git a/sysdeps/riscv/multiarch/ifunc-impl-list.c b/sysdeps/riscv/multiarch/ifunc-impl-list.c
index 16e4d7137f..84b3eb25a4 100644
--- a/sysdeps/riscv/multiarch/ifunc-impl-list.c
+++ b/sysdeps/riscv/multiarch/ifunc-impl-list.c
@@ -36,9 +36,15 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
   size_t i = 0;
 
   IFUNC_IMPL (i, name, memcpy,
+#if __riscv_xlen == 64
+	      IFUNC_IMPL_ADD (array, i, memcpy, 1, __memcpy_rv64_unaligned)
+#endif
 	      IFUNC_IMPL_ADD (array, i, memcpy, 1, __memcpy_generic))
 
   IFUNC_IMPL (i, name, memmove,
+#if __riscv_xlen == 64
+	      IFUNC_IMPL_ADD (array, i, memmove, 1, __memmove_rv64_unaligned)
+#endif
 	      IFUNC_IMPL_ADD (array, i, memmove, 1, __memmove_generic))
 
   IFUNC_IMPL (i, name, memset,
diff --git a/sysdeps/riscv/multiarch/memcpy.c b/sysdeps/riscv/multiarch/memcpy.c
index cc9185912a..68ac9bbe35 100644
--- a/sysdeps/riscv/multiarch/memcpy.c
+++ b/sysdeps/riscv/multiarch/memcpy.c
@@ -31,7 +31,16 @@
 extern __typeof (__redirect_memcpy) __libc_memcpy;
 extern __typeof (__redirect_memcpy) __memcpy_generic attribute_hidden;
 
+#if __riscv_xlen == 64
+extern __typeof (__redirect_memcpy) __memcpy_rv64_unaligned attribute_hidden;
+
+libc_ifunc (__libc_memcpy,
+	    (IS_RV64() && HAVE_FAST_UNALIGNED()
+	    ? __memcpy_rv64_unaligned
+	    : __memcpy_generic));
+#else
 libc_ifunc (__libc_memcpy, __memcpy_generic);
+#endif
 
 # undef memcpy
 strong_alias (__libc_memcpy, memcpy);
diff --git a/sysdeps/riscv/multiarch/memcpy_rv64_unaligned.S b/sysdeps/riscv/multiarch/memcpy_rv64_unaligned.S
new file mode 100644
index 0000000000..372cd0baea
--- /dev/null
+++ b/sysdeps/riscv/multiarch/memcpy_rv64_unaligned.S
@@ -0,0 +1,475 @@
+/* Copyright (C) 2022 Free Software Foundation, Inc.
+
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library.  If not, see
+   <https://www.gnu.org/licenses/>.  */
+
+#if __riscv_xlen == 64
+
+#include <sysdep.h>
+#include <sys/asm.h>
+
+#define dst	a0
+#define src	a1
+#define count	a2
+#define srcend	a3
+#define dstend	a4
+#define tmp1	a5
+#define dst2	t6
+
+#define A_l	a6
+#define A_h	a7
+#define B_l	t0
+#define B_h	t1
+#define C_l	t2
+#define C_h	t3
+#define D_l	t4
+#define D_h	t5
+#define E_l	tmp1
+#define E_h	count
+#define F_l	dst2
+#define F_h	srcend
+
+#ifndef MEMCPY
+# define MEMCPY __memcpy_rv64_unaligned
+#endif
+
+#ifndef MEMMOVE
+# define MEMMOVE __memmove_rv64_unaligned
+#endif
+
+#ifndef COPY97_128
+# define COPY97_128 1
+#endif
+
+/* Assumptions: rv64i, unaligned accesses.  */
+
+/* memcpy/memmove is implemented by unrolling copy loops.
+   We have two strategies:
+   1) copy from front/start to back/end ("forward")
+   2) copy from back/end to front/start ("backward")
+   In case of memcpy(), the strategy does not matter for correctness.
+   For memmove() and overlapping buffers we need to use the following strategy:
+     if dst < src && src-dst < count -> copy from front to back
+     if src < dst && dst-src < count -> copy from back to front  */
+
+ENTRY_ALIGN (MEMCPY, 6)
+	/* Calculate the end position.  */
+	add	srcend, src, count
+	add	dstend, dst, count
+
+	/* Decide how to process.  */
+	li	tmp1, 96
+	bgtu	count, tmp1, L(copy_long_forward)
+	li	tmp1, 32
+	bgtu	count, tmp1, L(copy33_96)
+	li	tmp1, 16
+	bleu	count, tmp1, L(copy0_16)
+
+	/* Copy 17-32 bytes.  */
+	ld	A_l, 0(src)
+	ld	A_h, 8(src)
+	ld	B_l, -16(srcend)
+	ld	B_h, -8(srcend)
+	sd	A_l, 0(dst)
+	sd	A_h, 8(dst)
+	sd	B_l, -16(dstend)
+	sd	B_h, -8(dstend)
+	ret
+
+L(copy0_16):
+	li	tmp1, 8
+	bleu	count, tmp1, L(copy0_8)
+	/* Copy 9-16 bytes.  */
+	ld	A_l, 0(src)
+	ld	A_h, -8(srcend)
+	sd	A_l, 0(dst)
+	sd	A_h, -8(dstend)
+	ret
+
+	.p2align 3
+L(copy0_8):
+	li	tmp1, 4
+	bleu	count, tmp1, L(copy0_4)
+	/* Copy 5-8 bytes.  */
+	lw	A_l, 0(src)
+	lw	B_l, -4(srcend)
+	sw	A_l, 0(dst)
+	sw	B_l, -4(dstend)
+	ret
+
+L(copy0_4):
+	li	tmp1, 2
+	bleu	count, tmp1, L(copy0_2)
+	/* Copy 3-4 bytes.  */
+	lh	A_l, 0(src)
+	lh	B_l, -2(srcend)
+	sh	A_l, 0(dst)
+	sh	B_l, -2(dstend)
+	ret
+
+L(copy0_2):
+	li	tmp1, 1
+	bleu	count, tmp1, L(copy0_1)
+	/* Copy 2 bytes. */
+	lh	A_l, 0(src)
+	sh	A_l, 0(dst)
+	ret
+
+L(copy0_1):
+	beqz	count, L(copy0)
+	/* Copy 1 byte.  */
+	lb	A_l, 0(src)
+	sb	A_l, 0(dst)
+L(copy0):
+	ret
+
+	.p2align 4
+L(copy33_96):
+	/* Copy 33-96 bytes.  */
+	ld	A_l, 0(src)
+	ld	A_h, 8(src)
+	ld	B_l, 16(src)
+	ld	B_h, 24(src)
+	ld	C_l, -32(srcend)
+	ld	C_h, -24(srcend)
+	ld	D_l, -16(srcend)
+	ld	D_h, -8(srcend)
+
+	li	tmp1, 64
+	bgtu	count, tmp1, L(copy65_96_preloaded)
+
+	sd	A_l, 0(dst)
+	sd	A_h, 8(dst)
+	sd	B_l, 16(dst)
+	sd	B_h, 24(dst)
+	sd	C_l, -32(dstend)
+	sd	C_h, -24(dstend)
+	sd	D_l, -16(dstend)
+	sd	D_h, -8(dstend)
+	ret
+
+	.p2align 4
+L(copy65_96_preloaded):
+	/* Copy 65-96 bytes with pre-loaded A, B, C and D.  */
+	ld	E_l, 32(src)
+	ld	E_h, 40(src)
+	ld	F_l, 48(src) /* dst2 will be overwritten.  */
+	ld	F_h, 56(src) /* srcend will be overwritten.  */
+
+	sd	A_l, 0(dst)
+	sd	A_h, 8(dst)
+	sd	B_l, 16(dst)
+	sd	B_h, 24(dst)
+	sd	E_l, 32(dst)
+	sd	E_h, 40(dst)
+	sd	F_l, 48(dst)
+	sd	F_h, 56(dst)
+	sd	C_l, -32(dstend)
+	sd	C_h, -24(dstend)
+	sd	D_l, -16(dstend)
+	sd	D_h, -8(dstend)
+	ret
+
+#ifdef COPY97_128
+	.p2align 4
+L(copy97_128_forward):
+	/* Copy 97-128 bytes from front to back.  */
+	ld	A_l, 0(src)
+	ld	A_h, 8(src)
+	ld	B_l, 16(src)
+	ld	B_h, 24(src)
+	ld	C_l, -16(srcend)
+	ld	C_h, -8(srcend)
+	ld	D_l, -32(srcend)
+	ld	D_h, -24(srcend)
+	ld	E_l, -48(srcend)
+	ld	E_h, -40(srcend)
+	ld	F_l, -64(srcend) /* dst2 will be overwritten.  */
+	ld	F_h, -56(srcend) /* srcend will be overwritten.  */
+
+	sd	A_l, 0(dst)
+	sd	A_h, 8(dst)
+	ld	A_l, 32(src)
+	ld	A_h, 40(src)
+	sd	B_l, 16(dst)
+	sd	B_h, 24(dst)
+	ld	B_l, 48(src)
+	ld	B_h, 56(src)
+
+	sd	C_l, -16(dstend)
+	sd	C_h, -8(dstend)
+	sd	D_l, -32(dstend)
+	sd	D_h, -24(dstend)
+	sd	E_l, -48(dstend)
+	sd	E_h, -40(dstend)
+	sd	F_l, -64(dstend)
+	sd	F_h, -56(dstend)
+
+	sd	A_l, 32(dst)
+	sd	A_h, 40(dst)
+	sd	B_l, 48(dst)
+	sd	B_h, 56(dst)
+	ret
+#endif
+
+	.p2align 4
+	/* Copy 97+ bytes from front to back.  */
+L(copy_long_forward):
+#ifdef COPY97_128
+	/* Avoid loop if possible.  */
+	li	tmp1, 128
+	ble	count, tmp1, L(copy97_128_forward)
+#endif
+
+	/* Copy 16 bytes and then align dst to 16-byte alignment.  */
+	ld	D_l, 0(src)
+	ld	D_h, 8(src)
+
+	/* Round down to the previous 16 byte boundary (keep offset of 16).  */
+	andi	tmp1, dst, 15
+	andi	dst2, dst, -16
+	sub	src, src, tmp1
+
+	ld	A_l, 16(src)
+	ld	A_h, 24(src)
+	sd	D_l, 0(dst)
+	sd	D_h, 8(dst)
+	ld	B_l, 32(src)
+	ld	B_h, 40(src)
+	ld	C_l, 48(src)
+	ld	C_h, 56(src)
+	ld	D_l, 64(src)
+	ld	D_h, 72(src)
+	addi	src, src, 64
+
+	/* Calculate loop termination position.  */
+	addi	tmp1, dstend, -(16+128)
+	bgeu	dst2, tmp1, L(copy64_from_end)
+
+	/* Store 64 bytes in a loop.  */
+	.p2align 4
+L(loop64_forward):
+	addi	src, src, 64
+	sd	A_l, 16(dst2)
+	sd	A_h, 24(dst2)
+	ld	A_l, -48(src)
+	ld	A_h, -40(src)
+	sd	B_l, 32(dst2)
+	sd	B_h, 40(dst2)
+	ld	B_l, -32(src)
+	ld	B_h, -24(src)
+	sd	C_l, 48(dst2)
+	sd	C_h, 56(dst2)
+	ld	C_l, -16(src)
+	ld	C_h, -8(src)
+	sd	D_l, 64(dst2)
+	sd	D_h, 72(dst2)
+	ld	D_l, 0(src)
+	ld	D_h, 8(src)
+	addi	dst2, dst2, 64
+	bltu	dst2, tmp1, L(loop64_forward)
+
+L(copy64_from_end):
+	ld	E_l, -64(srcend)
+	ld	E_h, -56(srcend)
+	sd	A_l, 16(dst2)
+	sd	A_h, 24(dst2)
+	ld	A_l, -48(srcend)
+	ld	A_h, -40(srcend)
+	sd	B_l, 32(dst2)
+	sd	B_h, 40(dst2)
+	ld	B_l, -32(srcend)
+	ld	B_h, -24(srcend)
+	sd	C_l, 48(dst2)
+	sd	C_h, 56(dst2)
+	ld	C_l, -16(srcend)
+	ld	C_h, -8(srcend)
+	sd	D_l, 64(dst2)
+	sd	D_h, 72(dst2)
+	sd	E_l, -64(dstend)
+	sd	E_h, -56(dstend)
+	sd	A_l, -48(dstend)
+	sd	A_h, -40(dstend)
+	sd	B_l, -32(dstend)
+	sd	B_h, -24(dstend)
+	sd	C_l, -16(dstend)
+	sd	C_h, -8(dstend)
+	ret
+
+END (MEMCPY)
+libc_hidden_builtin_def (MEMCPY)
+
+ENTRY_ALIGN (MEMMOVE, 6)
+	/* Calculate the end position.  */
+	add	srcend, src, count
+	add	dstend, dst, count
+
+	/* Decide how to process.  */
+	li	tmp1, 96
+	bgtu	count, tmp1, L(move_long)
+	li	tmp1, 32
+	bgtu	count, tmp1, L(copy33_96)
+	li	tmp1, 16
+	bleu	count, tmp1, L(copy0_16)
+
+	/* Copy 17-32 bytes.  */
+	ld	A_l, 0(src)
+	ld	A_h, 8(src)
+	ld	B_l, -16(srcend)
+	ld	B_h, -8(srcend)
+	sd	A_l, 0(dst)
+	sd	A_h, 8(dst)
+	sd	B_l, -16(dstend)
+	sd	B_h, -8(dstend)
+	ret
+
+#ifdef COPY97_128
+	.p2align 4
+L(copy97_128_backward):
+	/* Copy 97-128 bytes from back to front.  */
+	ld	A_l, -16(srcend)
+	ld	A_h, -8(srcend)
+	ld	B_l, -32(srcend)
+	ld	B_h, -24(srcend)
+	ld	C_l, -48(srcend)
+	ld	C_h, -40(srcend)
+	ld	D_l, -64(srcend)
+	ld	D_h, -56(srcend)
+	ld	E_l, -80(srcend)
+	ld	E_h, -72(srcend)
+	ld	F_l, -96(srcend) /* dst2 will be overwritten.  */
+	ld	F_h, -88(srcend) /* srcend will be overwritten.  */
+
+	sd	A_l, -16(dstend)
+	sd	A_h, -8(dstend)
+	ld	A_l, 16(src)
+	ld	A_h, 24(src)
+	sd	B_l, -32(dstend)
+	sd	B_h, -24(dstend)
+	ld	B_l, 0(src)
+	ld	B_h, 8(src)
+
+	sd	C_l, -48(dstend)
+	sd	C_h, -40(dstend)
+	sd	D_l, -64(dstend)
+	sd	D_h, -56(dstend)
+	sd	E_l, -80(dstend)
+	sd	E_h, -72(dstend)
+	sd	F_l, -96(dstend)
+	sd	F_h, -88(dstend)
+
+	sd	A_l, 16(dst)
+	sd	A_h, 24(dst)
+	sd	B_l, 0(dst)
+	sd	B_h, 8(dst)
+	ret
+#endif
+
+	.p2align 4
+	/* Copy 97+ bytes.  */
+L(move_long):
+	/* dst-src is positive if src < dst.
+	   In this case we must copy forward if dst-src >= count.
+	   If dst-src is negative, then we can interpret the difference
+	   as unsigned value to enforce dst-src >= count as well.  */
+	sub	tmp1, dst, src
+	beqz	tmp1, L(copy0)
+	bgeu	tmp1, count, L(copy_long_forward)
+
+#ifdef COPY97_128
+	/* Avoid loop if possible.  */
+	li	tmp1, 128
+	ble	count, tmp1, L(copy97_128_backward)
+#endif
+
+	/* Copy 16 bytes and then align dst to 16-byte alignment.  */
+	ld	D_l, -16(srcend)
+	ld	D_h, -8(srcend)
+
+	/* Round down to the previous 16 byte boundary (keep offset of 16).  */
+	andi	tmp1, dstend, 15
+	sub	srcend, srcend, tmp1
+
+	ld	A_l, -16(srcend)
+	ld	A_h, -8(srcend)
+	ld	B_l, -32(srcend)
+	ld	B_h, -24(srcend)
+	ld	C_l, -48(srcend)
+	ld	C_h, -40(srcend)
+	sd	D_l, -16(dstend)
+	sd	D_h, -8(dstend)
+	ld	D_l, -64(srcend)
+	ld	D_h, -56(srcend)
+	andi	dstend, dstend, -16
+
+	/* Calculate loop termination position.  */
+	addi	tmp1, dst, 128
+	bleu	dstend, tmp1, L(copy64_from_start)
+
+	/* Store 64 bytes in a loop.  */
+	.p2align 4
+L(loop64_backward):
+	addi	srcend, srcend, -64
+	sd	A_l, -16(dstend)
+	sd	A_h, -8(dstend)
+	ld	A_l, -16(srcend)
+	ld	A_h, -8(srcend)
+	sd	B_l, -32(dstend)
+	sd	B_h, -24(dstend)
+	ld	B_l, -32(srcend)
+	ld	B_h, -24(srcend)
+	sd	C_l, -48(dstend)
+	sd	C_h, -40(dstend)
+	ld	C_l, -48(srcend)
+	ld	C_h, -40(srcend)
+	sd	D_l, -64(dstend)
+	sd	D_h, -56(dstend)
+	ld	D_l, -64(srcend)
+	ld	D_h, -56(srcend)
+	addi	dstend, dstend, -64
+	bgtu	dstend, tmp1, L(loop64_backward)
+
+L(copy64_from_start):
+	ld	E_l, 48(src)
+	ld	E_h, 56(src)
+	sd	A_l, -16(dstend)
+	sd	A_h, -8(dstend)
+	ld	A_l, 32(src)
+	ld	A_h, 40(src)
+	sd	B_l, -32(dstend)
+	sd	B_h, -24(dstend)
+	ld	B_l, 16(src)
+	ld	B_h, 24(src)
+	sd	C_l, -48(dstend)
+	sd	C_h, -40(dstend)
+	ld	C_l, 0(src)
+	ld	C_h, 8(src)
+	sd	D_l, -64(dstend)
+	sd	D_h, -56(dstend)
+	sd	E_l, 48(dst)
+	sd	E_h, 56(dst)
+	sd	A_l, 32(dst)
+	sd	A_h, 40(dst)
+	sd	B_l, 16(dst)
+	sd	B_h, 24(dst)
+	sd	C_l, 0(dst)
+	sd	C_h, 8(dst)
+	ret
+
+END (MEMMOVE)
+libc_hidden_builtin_def (MEMMOVE)
+
+#endif /* __riscv_xlen == 64  */
diff --git a/sysdeps/riscv/multiarch/memmove.c b/sysdeps/riscv/multiarch/memmove.c
index 581a8327d6..b446a9e036 100644
--- a/sysdeps/riscv/multiarch/memmove.c
+++ b/sysdeps/riscv/multiarch/memmove.c
@@ -31,7 +31,16 @@
 extern __typeof (__redirect_memmove) __libc_memmove;
 extern __typeof (__redirect_memmove) __memmove_generic attribute_hidden;
 
+#if __riscv_xlen == 64
+extern __typeof (__redirect_memmove) __memmove_rv64_unaligned attribute_hidden;
+
+libc_ifunc (__libc_memmove,
+	    (IS_RV64() && HAVE_FAST_UNALIGNED()
+	    ? __memmove_rv64_unaligned
+	    : __memmove_generic));
+#else
 libc_ifunc (__libc_memmove, __memmove_generic);
+#endif
 
 # undef memmove
 strong_alias (__libc_memmove, memmove);
-- 
2.39.1


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC PATCH 13/19] riscv: Add ifunc support for strlen
  2023-02-07  0:15 [RFC PATCH 00/19] riscv: ifunc support with optimized mem*/str*/cpu_relax routines Christoph Muellner
                   ` (11 preceding siblings ...)
  2023-02-07  0:16 ` [RFC PATCH 12/19] riscv: Add accelerated memcpy/memmove routines for RV64 Christoph Muellner
@ 2023-02-07  0:16 ` Christoph Muellner
  2023-02-07  0:16 ` [RFC PATCH 14/19] riscv: Add accelerated strlen routine Christoph Muellner
                   ` (7 subsequent siblings)
  20 siblings, 0 replies; 42+ messages in thread
From: Christoph Muellner @ 2023-02-07  0:16 UTC (permalink / raw)
  To: libc-alpha, Palmer Dabbelt, Darius Rad, Andrew Waterman,
	DJ Delorie, Vineet Gupta, Kito Cheng, Jeff Law, Philipp Tomsich,
	Heiko Stuebner
  Cc: Christoph Müllner

From: Christoph Müllner <christoph.muellner@vrull.eu>

This patch adds ifunc support for calls to strlen to the RISC-V code.
No optimized code is added as part of this patch.

Signed-off-by: Christoph Müllner <christoph.muellner@vrull.eu>
---
 sysdeps/riscv/multiarch/Makefile          |  4 ++-
 sysdeps/riscv/multiarch/ifunc-impl-list.c |  4 +++
 sysdeps/riscv/multiarch/strlen.c          | 40 +++++++++++++++++++++++
 sysdeps/riscv/multiarch/strlen_generic.c  | 32 ++++++++++++++++++
 4 files changed, 79 insertions(+), 1 deletion(-)
 create mode 100644 sysdeps/riscv/multiarch/strlen.c
 create mode 100644 sysdeps/riscv/multiarch/strlen_generic.c

diff --git a/sysdeps/riscv/multiarch/Makefile b/sysdeps/riscv/multiarch/Makefile
index b08d7d1c8b..8e2b020233 100644
--- a/sysdeps/riscv/multiarch/Makefile
+++ b/sysdeps/riscv/multiarch/Makefile
@@ -6,5 +6,7 @@ sysdep_routines += \
 	\
 	memset_generic \
 	memset_rv64_unaligned \
-	memset_rv64_unaligned_cboz64
+	memset_rv64_unaligned_cboz64 \
+	\
+	strlen_generic
 endif
diff --git a/sysdeps/riscv/multiarch/ifunc-impl-list.c b/sysdeps/riscv/multiarch/ifunc-impl-list.c
index 84b3eb25a4..f848fc8401 100644
--- a/sysdeps/riscv/multiarch/ifunc-impl-list.c
+++ b/sysdeps/riscv/multiarch/ifunc-impl-list.c
@@ -54,5 +54,9 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
 #endif
 	      IFUNC_IMPL_ADD (array, i, memset, 1, __memset_generic))
 
+  IFUNC_IMPL (i, name, strlen,
+	      IFUNC_IMPL_ADD (array, i, strlen, 1, __strlen_generic))
+
+
   return i;
 }
diff --git a/sysdeps/riscv/multiarch/strlen.c b/sysdeps/riscv/multiarch/strlen.c
new file mode 100644
index 0000000000..85f7a91c9f
--- /dev/null
+++ b/sysdeps/riscv/multiarch/strlen.c
@@ -0,0 +1,40 @@
+/* Multiple versions of strlen. RISC-V version.
+   Copyright (C) 2022 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <https://www.gnu.org/licenses/>.  */
+
+/* Define multiple versions only for the definition in libc.  */
+
+#if IS_IN (libc)
+/* Redefine strlen so that the compiler won't complain about the type
+   mismatch with the IFUNC selector in strong_alias, below.  */
+# undef strlen
+# define strlen __redirect_strlen
+# include <string.h>
+# include <ldsodefs.h>
+# include <sys/auxv.h>
+# include <init-arch.h>
+
+extern __typeof (__redirect_strlen) __libc_strlen;
+extern __typeof (__redirect_strlen) __strlen_generic attribute_hidden;
+
+libc_ifunc (__libc_strlen, __strlen_generic);
+
+# undef strlen
+strong_alias (__libc_strlen, strlen);
+#else
+# include <string/strlen.c>
+#endif
diff --git a/sysdeps/riscv/multiarch/strlen_generic.c b/sysdeps/riscv/multiarch/strlen_generic.c
new file mode 100644
index 0000000000..10aa05e699
--- /dev/null
+++ b/sysdeps/riscv/multiarch/strlen_generic.c
@@ -0,0 +1,32 @@
+/* strlen for RISC-V, default version for internal use.
+   Copyright (C) 2022 Free Software Foundation, Inc.
+
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library.  If not, see
+   <https://www.gnu.org/licenses/>.  */
+
+#include <string.h>
+
+#define STRLEN __strlen_generic
+
+#ifdef SHARED
+# undef libc_hidden_builtin_def
+# define libc_hidden_builtin_def(name) \
+  __hidden_ver1(__strlen_generic, __GI_strlen, __strlen_generic);
+#endif
+
+extern size_t __strlen_generic(const char *str);
+
+#include <string/strlen.c>
-- 
2.39.1


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC PATCH 14/19] riscv: Add accelerated strlen routine
  2023-02-07  0:15 [RFC PATCH 00/19] riscv: ifunc support with optimized mem*/str*/cpu_relax routines Christoph Muellner
                   ` (12 preceding siblings ...)
  2023-02-07  0:16 ` [RFC PATCH 13/19] riscv: Add ifunc support for strlen Christoph Muellner
@ 2023-02-07  0:16 ` Christoph Muellner
  2023-02-07  0:16 ` [RFC PATCH 15/19] riscv: Add ifunc support for strcmp Christoph Muellner
                   ` (6 subsequent siblings)
  20 siblings, 0 replies; 42+ messages in thread
From: Christoph Muellner @ 2023-02-07  0:16 UTC (permalink / raw)
  To: libc-alpha, Palmer Dabbelt, Darius Rad, Andrew Waterman,
	DJ Delorie, Vineet Gupta, Kito Cheng, Jeff Law, Philipp Tomsich,
	Heiko Stuebner
  Cc: Christoph Müllner

From: Christoph Müllner <christoph.muellner@vrull.eu>

The implementation of strlen() can be accelerated using Zbb's orc.b
instruction. Let's add an implementation that provides that.
The implementation is part of the Bitmanip specification.

Signed-off-by: Christoph Müllner <christoph.muellner@vrull.eu>
---
 sysdeps/riscv/multiarch/Makefile          |   3 +-
 sysdeps/riscv/multiarch/ifunc-impl-list.c |   1 +
 sysdeps/riscv/multiarch/strlen.c          |   6 +-
 sysdeps/riscv/multiarch/strlen_zbb.S      | 105 ++++++++++++++++++++++
 4 files changed, 113 insertions(+), 2 deletions(-)
 create mode 100644 sysdeps/riscv/multiarch/strlen_zbb.S

diff --git a/sysdeps/riscv/multiarch/Makefile b/sysdeps/riscv/multiarch/Makefile
index 8e2b020233..b2247b7326 100644
--- a/sysdeps/riscv/multiarch/Makefile
+++ b/sysdeps/riscv/multiarch/Makefile
@@ -8,5 +8,6 @@ sysdep_routines += \
 	memset_rv64_unaligned \
 	memset_rv64_unaligned_cboz64 \
 	\
-	strlen_generic
+	strlen_generic \
+	strlen_zbb
 endif
diff --git a/sysdeps/riscv/multiarch/ifunc-impl-list.c b/sysdeps/riscv/multiarch/ifunc-impl-list.c
index f848fc8401..2b4d2e1c17 100644
--- a/sysdeps/riscv/multiarch/ifunc-impl-list.c
+++ b/sysdeps/riscv/multiarch/ifunc-impl-list.c
@@ -55,6 +55,7 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
 	      IFUNC_IMPL_ADD (array, i, memset, 1, __memset_generic))
 
   IFUNC_IMPL (i, name, strlen,
+	      IFUNC_IMPL_ADD (array, i, strlen, 1, __strlen_zbb)
 	      IFUNC_IMPL_ADD (array, i, strlen, 1, __strlen_generic))
 
 
diff --git a/sysdeps/riscv/multiarch/strlen.c b/sysdeps/riscv/multiarch/strlen.c
index 85f7a91c9f..8b2f4d94b2 100644
--- a/sysdeps/riscv/multiarch/strlen.c
+++ b/sysdeps/riscv/multiarch/strlen.c
@@ -30,8 +30,12 @@
 
 extern __typeof (__redirect_strlen) __libc_strlen;
 extern __typeof (__redirect_strlen) __strlen_generic attribute_hidden;
+extern __typeof (__redirect_strlen) __strlen_zbb attribute_hidden;
 
-libc_ifunc (__libc_strlen, __strlen_generic);
+libc_ifunc (__libc_strlen,
+	    HAVE_RV(zbb)
+	     ? __strlen_zbb
+	     : __strlen_generic);
 
 # undef strlen
 strong_alias (__libc_strlen, strlen);
diff --git a/sysdeps/riscv/multiarch/strlen_zbb.S b/sysdeps/riscv/multiarch/strlen_zbb.S
new file mode 100644
index 0000000000..a0ca599c8e
--- /dev/null
+++ b/sysdeps/riscv/multiarch/strlen_zbb.S
@@ -0,0 +1,105 @@
+/* Copyright (C) 2022 Free Software Foundation, Inc.
+
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library.  If not, see
+   <https://www.gnu.org/licenses/>.  */
+
+#include <sysdep.h>
+#include <sys/asm.h>
+
+/* Assumptions: rvi_zbb.  */
+/* Implementation from the Bitmanip specification.  */
+
+#define src		a0
+#define result		a0
+#define addr		a1
+#define data		a2
+#define offset		a3
+#define offset_bits	a3
+#define valid_bytes	a4
+#define m1		a4
+
+#if __riscv_xlen == 64
+# define REG_L	ld
+# define SZREG	8
+#else
+# define REG_L	lw
+# define SZREG	4
+#endif
+
+#define BITSPERBYTELOG 3
+
+#if __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__
+# define CZ	clz
+# define SHIFT	sll
+#else
+# define CZ	ctz
+# define SHIFT	srl
+#endif
+
+#ifndef STRLEN
+# define STRLEN __strlen_zbb
+#endif
+
+.option push
+.option arch,+zbb
+
+ENTRY_ALIGN (STRLEN, 6)
+	andi	offset, src, SZREG-1
+	andi	addr, src, -SZREG
+
+	li	valid_bytes, SZREG
+	sub	valid_bytes, valid_bytes, offset
+	slli	offset_bits, offset, BITSPERBYTELOG
+	REG_L	data, 0(addr)
+	/* Shift the partial/unaligned chunk we loaded to remove the bytes
+	 * from before the start of the string, adding NUL bytes at the end. */
+	SHIFT	data, data, offset_bits
+	orc.b	data, data
+	not	data, data
+	/* Non-NUL bytes in the string have been expanded to 0x00, while
+	 * NUL bytes have become 0xff. Search for the first set bit
+	 * (corresponding to a NUL byte in the original chunk). */
+	CZ	data, data
+	/* The first chunk is special: compare against the number of valid
+	 * bytes in this chunk. */
+	srli	result, data, 3
+	bgtu	valid_bytes, result, L(done)
+	addi	offset, addr, SZREG
+	li	m1, -1
+
+	/* Our critical loop is 4 instructions and processes data in 4 byte
+	 * or 8 byte chunks.  */
+	.p2align 2
+L(loop):
+	REG_L	data, SZREG(addr)
+	addi	addr, addr, SZREG
+	orc.b	data, data
+	beq	data, m1, L(loop)
+
+L(epilogue):
+	not	data, data
+	CZ	data, data
+	sub	offset, addr, offset
+	add	result, result, offset
+	srli	data, data, 3
+	add	result, result, data
+L(done):
+	ret
+
+.option pop
+
+END (STRLEN)
+libc_hidden_builtin_def (STRLEN)
-- 
2.39.1


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC PATCH 15/19] riscv: Add ifunc support for strcmp
  2023-02-07  0:15 [RFC PATCH 00/19] riscv: ifunc support with optimized mem*/str*/cpu_relax routines Christoph Muellner
                   ` (13 preceding siblings ...)
  2023-02-07  0:16 ` [RFC PATCH 14/19] riscv: Add accelerated strlen routine Christoph Muellner
@ 2023-02-07  0:16 ` Christoph Muellner
  2023-02-07  0:16 ` [RFC PATCH 16/19] riscv: Add accelerated strcmp routines Christoph Muellner
                   ` (5 subsequent siblings)
  20 siblings, 0 replies; 42+ messages in thread
From: Christoph Muellner @ 2023-02-07  0:16 UTC (permalink / raw)
  To: libc-alpha, Palmer Dabbelt, Darius Rad, Andrew Waterman,
	DJ Delorie, Vineet Gupta, Kito Cheng, Jeff Law, Philipp Tomsich,
	Heiko Stuebner
  Cc: Christoph Müllner

From: Christoph Müllner <christoph.muellner@vrull.eu>

This patch adds ifunc support for calls to strcmp to the RISC-V code.
No optimized code is added as part of this patch.

Signed-off-by: Christoph Müllner <christoph.muellner@vrull.eu>
---
 sysdeps/riscv/multiarch/Makefile          |  4 ++-
 sysdeps/riscv/multiarch/ifunc-impl-list.c |  2 ++
 sysdeps/riscv/multiarch/strcmp.c          | 40 +++++++++++++++++++++++
 sysdeps/riscv/multiarch/strcmp_generic.c  | 32 ++++++++++++++++++
 4 files changed, 77 insertions(+), 1 deletion(-)
 create mode 100644 sysdeps/riscv/multiarch/strcmp.c
 create mode 100644 sysdeps/riscv/multiarch/strcmp_generic.c

diff --git a/sysdeps/riscv/multiarch/Makefile b/sysdeps/riscv/multiarch/Makefile
index b2247b7326..3017bde75a 100644
--- a/sysdeps/riscv/multiarch/Makefile
+++ b/sysdeps/riscv/multiarch/Makefile
@@ -9,5 +9,7 @@ sysdep_routines += \
 	memset_rv64_unaligned_cboz64 \
 	\
 	strlen_generic \
-	strlen_zbb
+	strlen_zbb \
+	\
+	strcmp_generic
 endif
diff --git a/sysdeps/riscv/multiarch/ifunc-impl-list.c b/sysdeps/riscv/multiarch/ifunc-impl-list.c
index 2b4d2e1c17..64331a4c7f 100644
--- a/sysdeps/riscv/multiarch/ifunc-impl-list.c
+++ b/sysdeps/riscv/multiarch/ifunc-impl-list.c
@@ -58,6 +58,8 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
 	      IFUNC_IMPL_ADD (array, i, strlen, 1, __strlen_zbb)
 	      IFUNC_IMPL_ADD (array, i, strlen, 1, __strlen_generic))
 
+  IFUNC_IMPL (i, name, strcmp,
+	      IFUNC_IMPL_ADD (array, i, strcpy, 1, __strcmp_generic))
 
   return i;
 }
diff --git a/sysdeps/riscv/multiarch/strcmp.c b/sysdeps/riscv/multiarch/strcmp.c
new file mode 100644
index 0000000000..8c21a90afd
--- /dev/null
+++ b/sysdeps/riscv/multiarch/strcmp.c
@@ -0,0 +1,40 @@
+/* Multiple versions of strcmp. RISC-V version.
+   Copyright (C) 2022 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <https://www.gnu.org/licenses/>.  */
+
+/* Define multiple versions only for the definition in libc.  */
+
+#if IS_IN (libc)
+/* Redefine strcmp so that the compiler won't complain about the type
+   mismatch with the IFUNC selector in strong_alias, below.  */
+# undef strcmp
+# define strcmp __redirect_strcmp
+# include <string.h>
+# include <ldsodefs.h>
+# include <sys/auxv.h>
+# include <init-arch.h>
+
+extern __typeof (__redirect_strcmp) __libc_strcmp;
+extern __typeof (__redirect_strcmp) __strcmp_generic attribute_hidden;
+
+libc_ifunc (__libc_strcmp, __strcmp_generic);
+
+# undef strcmp
+strong_alias (__libc_strcmp, strcmp);
+#else
+# include <string/strcmp.c>
+#endif
diff --git a/sysdeps/riscv/multiarch/strcmp_generic.c b/sysdeps/riscv/multiarch/strcmp_generic.c
new file mode 100644
index 0000000000..d85cf3940f
--- /dev/null
+++ b/sysdeps/riscv/multiarch/strcmp_generic.c
@@ -0,0 +1,32 @@
+/* strcmp for RISC-V, default version for internal use.
+   Copyright (C) 2022 Free Software Foundation, Inc.
+
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library.  If not, see
+   <https://www.gnu.org/licenses/>.  */
+
+#include <string.h>
+
+#define STRCMP __strcmp_generic
+
+#ifdef SHARED
+# undef libc_hidden_builtin_def
+# define libc_hidden_builtin_def(name) \
+  __hidden_ver1(__strcmp_generic, __GI_strcmp, __strcmp_generic);
+#endif
+
+extern int __strcmp_generic(const char *s1, const char *s2);
+
+#include <string/strcmp.c>
-- 
2.39.1


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC PATCH 16/19] riscv: Add accelerated strcmp routines
  2023-02-07  0:15 [RFC PATCH 00/19] riscv: ifunc support with optimized mem*/str*/cpu_relax routines Christoph Muellner
                   ` (14 preceding siblings ...)
  2023-02-07  0:16 ` [RFC PATCH 15/19] riscv: Add ifunc support for strcmp Christoph Muellner
@ 2023-02-07  0:16 ` Christoph Muellner
  2023-02-07 11:57   ` Xi Ruoyao
  2023-02-07  0:16 ` [RFC PATCH 17/19] riscv: Add ifunc support for strncmp Christoph Muellner
                   ` (4 subsequent siblings)
  20 siblings, 1 reply; 42+ messages in thread
From: Christoph Muellner @ 2023-02-07  0:16 UTC (permalink / raw)
  To: libc-alpha, Palmer Dabbelt, Darius Rad, Andrew Waterman,
	DJ Delorie, Vineet Gupta, Kito Cheng, Jeff Law, Philipp Tomsich,
	Heiko Stuebner
  Cc: Christoph Müllner

From: Christoph Müllner <christoph.muellner@vrull.eu>

The implementation of strcmp() can be accelerated using Zbb's orc.b
instruction and fast unaligned accesses. Howver, strcmp can use
unaligned accesses only if such an address does not change the
exception behaviour (compared to a single-byte compare loop).
Let's add an implementation that does all that.
Additionally, let's add the strcmp implementation from the
Bitmanip specification, which does not do any unaligned accesses.

Signed-off-by: Christoph Müllner <christoph.muellner@vrull.eu>
---
 sysdeps/riscv/multiarch/Makefile              |   4 +-
 sysdeps/riscv/multiarch/ifunc-impl-list.c     |   4 +-
 sysdeps/riscv/multiarch/strcmp.c              |  11 +-
 sysdeps/riscv/multiarch/strcmp_zbb.S          | 104 +++++++++
 .../riscv/multiarch/strcmp_zbb_unaligned.S    | 213 ++++++++++++++++++
 5 files changed, 332 insertions(+), 4 deletions(-)
 create mode 100644 sysdeps/riscv/multiarch/strcmp_zbb.S
 create mode 100644 sysdeps/riscv/multiarch/strcmp_zbb_unaligned.S

diff --git a/sysdeps/riscv/multiarch/Makefile b/sysdeps/riscv/multiarch/Makefile
index 3017bde75a..73a62be85d 100644
--- a/sysdeps/riscv/multiarch/Makefile
+++ b/sysdeps/riscv/multiarch/Makefile
@@ -11,5 +11,7 @@ sysdep_routines += \
 	strlen_generic \
 	strlen_zbb \
 	\
-	strcmp_generic
+	strcmp_generic \
+	strcmp_zbb \
+	strcmp_zbb_unaligned
 endif
diff --git a/sysdeps/riscv/multiarch/ifunc-impl-list.c b/sysdeps/riscv/multiarch/ifunc-impl-list.c
index 64331a4c7f..d354aa1178 100644
--- a/sysdeps/riscv/multiarch/ifunc-impl-list.c
+++ b/sysdeps/riscv/multiarch/ifunc-impl-list.c
@@ -59,7 +59,9 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
 	      IFUNC_IMPL_ADD (array, i, strlen, 1, __strlen_generic))
 
   IFUNC_IMPL (i, name, strcmp,
-	      IFUNC_IMPL_ADD (array, i, strcpy, 1, __strcmp_generic))
+	      IFUNC_IMPL_ADD (array, i, strcmp, 1, __strcmp_zbb_unaligned)
+	      IFUNC_IMPL_ADD (array, i, strcmp, 1, __strcmp_zbb)
+	      IFUNC_IMPL_ADD (array, i, strcmp, 1, __strcmp_generic))
 
   return i;
 }
diff --git a/sysdeps/riscv/multiarch/strcmp.c b/sysdeps/riscv/multiarch/strcmp.c
index 8c21a90afd..d3f2fe19ae 100644
--- a/sysdeps/riscv/multiarch/strcmp.c
+++ b/sysdeps/riscv/multiarch/strcmp.c
@@ -30,8 +30,15 @@
 
 extern __typeof (__redirect_strcmp) __libc_strcmp;
 extern __typeof (__redirect_strcmp) __strcmp_generic attribute_hidden;
-
-libc_ifunc (__libc_strcmp, __strcmp_generic);
+extern __typeof (__redirect_strcmp) __strcmp_zbb attribute_hidden;
+extern __typeof (__redirect_strcmp) __strcmp_zbb_unaligned attribute_hidden;
+
+libc_ifunc (__libc_strcmp,
+	    HAVE_RV(zbb) && HAVE_FAST_UNALIGNED()
+	    ? __strcmp_zbb_unaligned
+	    : HAVE_RV(zbb)
+	      ? __strcmp_zbb
+	      : __strcmp_generic);
 
 # undef strcmp
 strong_alias (__libc_strcmp, strcmp);
diff --git a/sysdeps/riscv/multiarch/strcmp_zbb.S b/sysdeps/riscv/multiarch/strcmp_zbb.S
new file mode 100644
index 0000000000..1c265d6107
--- /dev/null
+++ b/sysdeps/riscv/multiarch/strcmp_zbb.S
@@ -0,0 +1,104 @@
+/* Copyright (C) 2022 Free Software Foundation, Inc.
+
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library.  If not, see
+   <https://www.gnu.org/licenses/>.  */
+
+#include <sysdep.h>
+#include <sys/asm.h>
+
+/* Assumptions: rvi_zbb.  */
+/* Implementation from the Bitmanip specification.  */
+
+#define src1		a0
+#define result		a0
+#define src2		a1
+#define data1		a2
+#define data2		a3
+#define align		a4
+#define data1_orcb	t0
+#define m1		t2
+
+#if __riscv_xlen == 64
+# define REG_L	ld
+# define SZREG	8
+#else
+# define REG_L	lw
+# define SZREG	4
+#endif
+
+#ifndef STRCMP
+# define STRCMP __strcmp_zbb
+#endif
+
+.option push
+.option arch,+zbb
+
+ENTRY_ALIGN (STRCMP, 6)
+	or	align, src1, src2
+	and	align, align, SZREG-1
+	bnez	align, L(simpleloop)
+	li	m1, -1
+
+	/* Main loop for aligned strings.  */
+	.p2align 2
+L(loop):
+	REG_L	data1, 0(src1)
+	REG_L	data2, 0(src2)
+	orc.b	data1_orcb, data1
+	bne	data1_orcb, m1, L(foundnull)
+	addi	src1, src1, SZREG
+	addi	src2, src2, SZREG
+	beq	data1, data2, L(loop)
+
+	/* Words don't match, and no null byte in the first word.
+	 * Get bytes in big-endian order and compare.  */
+#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
+	rev8	data1, data1
+	rev8	data2, data2
+#endif
+	/* Synthesize (data1 >= data2) ? 1 : -1 in a branchless sequence.  */
+	sltu	result, data1, data2
+	neg	result, result
+	ori	result, result, 1
+	ret
+
+L(foundnull):
+	/* Found a null byte.
+	 * If words don't match, fall back to simple loop.  */
+	bne	data1, data2, L(simpleloop)
+
+	/* Otherwise, strings are equal.  */
+	li	result, 0
+	ret
+
+	/* Simple loop for misaligned strings.  */
+	.p2align 3
+L(simpleloop):
+	lbu	data1, 0(src1)
+	lbu	data2, 0(src2)
+	addi	src1, src1, 1
+	addi	src2, src2, 1
+	bne	data1, data2, L(sub)
+	bnez	data1, L(simpleloop)
+
+L(sub):
+	sub	result, data1, data2
+	ret
+
+.option pop
+
+END (STRCMP)
+libc_hidden_builtin_def (STRCMP)
diff --git a/sysdeps/riscv/multiarch/strcmp_zbb_unaligned.S b/sysdeps/riscv/multiarch/strcmp_zbb_unaligned.S
new file mode 100644
index 0000000000..ec21982b65
--- /dev/null
+++ b/sysdeps/riscv/multiarch/strcmp_zbb_unaligned.S
@@ -0,0 +1,213 @@
+/* Copyright (C) 2022 Free Software Foundation, Inc.
+
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library.  If not, see
+   <https://www.gnu.org/licenses/>.  */
+
+#include <sysdep.h>
+#include <sys/asm.h>
+
+/* Assumptions: rvi_zbb with fast unaligned access.  */
+/* Implementation inspired by aarch64/strcmp.S.  */
+
+#define src1		a0
+#define result		a0
+#define src2		a1
+#define off		a3
+#define m1		a4
+#define align1		a5
+#define src3		a6
+#define tmp		a7
+
+#define data1		t0
+#define data2		t1
+#define b1		t0
+#define b2		t1
+#define data3		t2
+#define data1_orcb	t3
+#define data3_orcb	t4
+#define shift		t5
+
+#if __riscv_xlen == 64
+# define REG_L	ld
+# define SZREG	8
+# define PTRLOG	3
+#else
+# define REG_L	lw
+# define SZREG	4
+# define PTRLOG	2
+#endif
+
+#if __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__
+# error big endian is untested!
+# define CZ	ctz
+# define SHIFT	srl
+# define SHIFT2	sll
+#else
+# define CZ	ctz
+# define SHIFT	sll
+# define SHIFT2	srl
+#endif
+
+#ifndef STRCMP
+# define STRCMP __strcmp_zbb_unaligned
+#endif
+
+.option push
+.option arch,+zbb
+
+ENTRY_ALIGN (STRCMP, 6)
+	/* off...delta from src1 to src2.  */
+	sub	off, src2, src1
+	li	m1, -1
+	andi	tmp, off, SZREG-1
+	andi	align1, src1, SZREG-1
+	bnez	tmp, L(misaligned8)
+	bnez	align1, L(mutual_align)
+
+	.p2align 4
+L(loop_aligned):
+	REG_L	data1, 0(src1)
+	add	tmp, src1, off
+	addi	src1, src1, SZREG
+	REG_L	data2, 0(tmp)
+
+L(start_realigned):
+	orc.b	data1_orcb, data1
+	bne	data1_orcb, m1, L(end)
+	beq	data1, data2, L(loop_aligned)
+
+L(fast_end):
+	/* Words don't match, and no NUL byte in one word.
+	   Get bytes in big-endian order and compare as words.  */
+#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
+	rev8	data1, data1
+	rev8	data2, data2
+#endif
+	/* Synthesize (data1 >= data2) ? 1 : -1 in a branchless sequence.  */
+	sltu	result, data1, data2
+	neg	result, result
+	ori	result, result, 1
+	ret
+
+L(end_orc):
+	orc.b	data1_orcb, data1
+L(end):
+	/* Words don't match or NUL byte in at least one word.
+	   data1_orcb holds orc.b value of data1.  */
+	xor	tmp, data1, data2
+	orc.b	tmp, tmp
+
+	orn	tmp, tmp, data1_orcb
+	CZ	shift, tmp
+
+#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
+	rev8	data1, data1
+	rev8	data2, data2
+#endif
+	sll	data1, data1, shift
+	sll	data2, data2, shift
+	srl	b1, data1, SZREG*8-8
+	srl	b2, data2, SZREG*8-8
+
+L(end_singlebyte):
+	sub	result, b1, b2
+	ret
+
+	.p2align 4
+L(mutual_align):
+	/* Sources are mutually aligned, but are not currently at an
+	   alignment boundary.  Round down the addresses and then mask off
+	   the bytes that precede the start point.  */
+	andi	src1, src1, -SZREG
+	add	tmp, src1, off
+	REG_L	data1, 0(src1)
+	addi	src1, src1, SZREG
+	REG_L	data2, 0(tmp)
+	/* Get number of bits to mask.  */
+	sll	shift, src2, 3
+	/* Bits to mask are now 0, others are 1.  */
+	SHIFT	tmp, m1, shift
+	/* Or with inverted value -> masked bits become 1.  */
+	orn	data1, data1, tmp
+	orn	data2, data2, tmp
+	j	L(start_realigned)
+
+L(misaligned8):
+	/* Skip slow loop if SRC1 is aligned.  */
+	beqz	align1, L(src1_aligned)
+L(do_misaligned):
+	/* Align SRC1 to 8 bytes.  */
+	lbu	b1, 0(src1)
+	lbu	b2, 0(src2)
+	beqz	b1, L(end_singlebyte)
+	bne	b1, b2, L(end_singlebyte)
+	addi	src1, src1, 1
+	addi	src2, src2, 1
+	andi	align1, src1, SZREG-1
+	bnez	align1, L(do_misaligned)
+
+L(src1_aligned):
+	/* SRC1 is aligned. Align SRC2 down and check for NUL there.
+	 * If there is no NUL, we may read the next word from SRC2.
+	 * If there is a NUL, we must not read a complete word from SRC2
+	 * because we might cross a page boundary.  */
+	/* Get number of bits to mask (upper bits are ignored by shifts).  */
+	sll	shift, src2, 3
+	/* src3 := align_down (src2)  */
+	andi	src3, src2, -SZREG
+	REG_L   data3, 0(src3)
+	addi	src3, src3, SZREG
+
+	/* Bits to mask are now 0, others are 1.  */
+	SHIFT	tmp, m1, shift
+	/* Or with inverted value -> masked bits become 1.  */
+	orn	data3_orcb, data3, tmp
+	/* Check for NUL in next aligned word.  */
+	orc.b	data3_orcb, data3_orcb
+	bne	data3_orcb, m1, L(unaligned_nul)
+
+	.p2align 4
+L(loop_unaligned):
+	/* Read the (aligned) data1 and the unaligned data2.  */
+	REG_L	data1, 0(src1)
+	addi	src1, src1, SZREG
+	REG_L	data2, 0(src2)
+	addi	src2, src2, SZREG
+	orc.b	data1_orcb, data1
+	bne	data1_orcb, m1, L(end)
+	bne	data1, data2, L(end)
+
+	/* Read the next aligned-down word.  */
+	REG_L	data3, 0(src3)
+	addi	src3, src3, SZREG
+	orc.b	data3_orcb, data3
+	beq	data3_orcb, m1, L(loop_unaligned)
+
+L(unaligned_nul):
+	/* src1 points to unread word (only first bytes relevant).
+	 * data3 holds next aligned-down word with NUL.
+	 * Compare the first bytes of data1 with the last bytes of data3.  */
+	REG_L	data1, 0(src1)
+	/* Shift NUL bytes into data3 to become data2.  */
+	SHIFT2	data2, data3, shift
+	bne	data1, data2, L(end_orc)
+	li	result, 0
+	ret
+
+.option pop
+
+END (STRCMP)
+libc_hidden_builtin_def (STRCMP)
-- 
2.39.1


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC PATCH 17/19] riscv: Add ifunc support for strncmp
  2023-02-07  0:15 [RFC PATCH 00/19] riscv: ifunc support with optimized mem*/str*/cpu_relax routines Christoph Muellner
                   ` (15 preceding siblings ...)
  2023-02-07  0:16 ` [RFC PATCH 16/19] riscv: Add accelerated strcmp routines Christoph Muellner
@ 2023-02-07  0:16 ` Christoph Muellner
  2023-02-07  0:16 ` [RFC PATCH 18/19] riscv: Add an optimized strncmp routine Christoph Muellner
                   ` (3 subsequent siblings)
  20 siblings, 0 replies; 42+ messages in thread
From: Christoph Muellner @ 2023-02-07  0:16 UTC (permalink / raw)
  To: libc-alpha, Palmer Dabbelt, Darius Rad, Andrew Waterman,
	DJ Delorie, Vineet Gupta, Kito Cheng, Jeff Law, Philipp Tomsich,
	Heiko Stuebner
  Cc: Christoph Müllner

From: Christoph Müllner <christoph.muellner@vrull.eu>

This patch adds ifunc support for calls to strncmp to the RISC-V code.
No optimized code is added as part of this patch.

Signed-off-by: Christoph Müllner <christoph.muellner@vrull.eu>
---
 sysdeps/riscv/multiarch/Makefile          |  3 +-
 sysdeps/riscv/multiarch/ifunc-impl-list.c |  2 ++
 sysdeps/riscv/multiarch/strncmp.c         | 40 +++++++++++++++++++++++
 sysdeps/riscv/multiarch/strncmp_generic.c | 32 ++++++++++++++++++
 4 files changed, 76 insertions(+), 1 deletion(-)
 create mode 100644 sysdeps/riscv/multiarch/strncmp.c
 create mode 100644 sysdeps/riscv/multiarch/strncmp_generic.c

diff --git a/sysdeps/riscv/multiarch/Makefile b/sysdeps/riscv/multiarch/Makefile
index 73a62be85d..056ce2ffc0 100644
--- a/sysdeps/riscv/multiarch/Makefile
+++ b/sysdeps/riscv/multiarch/Makefile
@@ -13,5 +13,6 @@ sysdep_routines += \
 	\
 	strcmp_generic \
 	strcmp_zbb \
-	strcmp_zbb_unaligned
+	strcmp_zbb_unaligned \
+	strncmp_generic
 endif
diff --git a/sysdeps/riscv/multiarch/ifunc-impl-list.c b/sysdeps/riscv/multiarch/ifunc-impl-list.c
index d354aa1178..eb37ed6017 100644
--- a/sysdeps/riscv/multiarch/ifunc-impl-list.c
+++ b/sysdeps/riscv/multiarch/ifunc-impl-list.c
@@ -63,5 +63,7 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
 	      IFUNC_IMPL_ADD (array, i, strcmp, 1, __strcmp_zbb)
 	      IFUNC_IMPL_ADD (array, i, strcmp, 1, __strcmp_generic))
 
+  IFUNC_IMPL (i, name, strncmp,
+	      IFUNC_IMPL_ADD (array, i, strncmp, 1, __strncmp_generic))
   return i;
 }
diff --git a/sysdeps/riscv/multiarch/strncmp.c b/sysdeps/riscv/multiarch/strncmp.c
new file mode 100644
index 0000000000..970aeb8b85
--- /dev/null
+++ b/sysdeps/riscv/multiarch/strncmp.c
@@ -0,0 +1,40 @@
+/* Multiple versions of strncmp. RISC-V version.
+   Copyright (C) 2022 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <https://www.gnu.org/licenses/>.  */
+
+/* Define multiple versions only for the definition in libc.  */
+
+#if IS_IN (libc)
+/* Redefine strncmp so that the compiler won't complain about the type
+   mismatch with the IFUNC selector in strong_alias, below.  */
+# undef strncmp
+# define strncmp __redirect_strncmp
+# include <string.h>
+# include <ldsodefs.h>
+# include <sys/auxv.h>
+# include <init-arch.h>
+
+extern __typeof (__redirect_strncmp) __libc_strncmp;
+extern __typeof (__redirect_strncmp) __strncmp_generic attribute_hidden;
+
+libc_ifunc (__libc_strncmp, __strncmp_generic);
+
+# undef strncmp
+strong_alias (__libc_strncmp, strncmp);
+#else
+# include <string/strncmp.c>
+#endif
diff --git a/sysdeps/riscv/multiarch/strncmp_generic.c b/sysdeps/riscv/multiarch/strncmp_generic.c
new file mode 100644
index 0000000000..9d8cdf2f1a
--- /dev/null
+++ b/sysdeps/riscv/multiarch/strncmp_generic.c
@@ -0,0 +1,32 @@
+/* strncmp for RISC-V, default version for internal use.
+   Copyright (C) 2022 Free Software Foundation, Inc.
+
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library.  If not, see
+   <https://www.gnu.org/licenses/>.  */
+
+#include <string.h>
+
+#define STRNCMP __strncmp_generic
+
+#ifdef SHARED
+# undef libc_hidden_builtin_def
+# define libc_hidden_builtin_def(name) \
+  __hidden_ver1(__strncmp_generic, __GI_strncmp, __strncmp_generic);
+#endif
+
+extern int __strncmp_generic(const char *s1, const char *s2, size_t n);
+
+#include <string/strncmp.c>
-- 
2.39.1


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC PATCH 18/19] riscv: Add an optimized strncmp routine
  2023-02-07  0:15 [RFC PATCH 00/19] riscv: ifunc support with optimized mem*/str*/cpu_relax routines Christoph Muellner
                   ` (16 preceding siblings ...)
  2023-02-07  0:16 ` [RFC PATCH 17/19] riscv: Add ifunc support for strncmp Christoph Muellner
@ 2023-02-07  0:16 ` Christoph Muellner
  2023-02-07  1:19   ` Noah Goldstein
  2023-02-07  0:16 ` [RFC PATCH 19/19] riscv: Add __riscv_cpu_relax() to allow yielding in busy loops Christoph Muellner
                   ` (2 subsequent siblings)
  20 siblings, 1 reply; 42+ messages in thread
From: Christoph Muellner @ 2023-02-07  0:16 UTC (permalink / raw)
  To: libc-alpha, Palmer Dabbelt, Darius Rad, Andrew Waterman,
	DJ Delorie, Vineet Gupta, Kito Cheng, Jeff Law, Philipp Tomsich,
	Heiko Stuebner
  Cc: Christoph Müllner

From: Christoph Müllner <christoph.muellner@vrull.eu>

The implementation of strncmp() can be accelerated using Zbb's orc.b
instruction. Let's add an optimized implementation that makes use
of this instruction.

Signed-off-by: Christoph Müllner <christoph.muellner@vrull.eu>
---
 sysdeps/riscv/multiarch/Makefile          |   3 +-
 sysdeps/riscv/multiarch/ifunc-impl-list.c |   1 +
 sysdeps/riscv/multiarch/strncmp.c         |   6 +-
 sysdeps/riscv/multiarch/strncmp_zbb.S     | 119 ++++++++++++++++++++++
 4 files changed, 127 insertions(+), 2 deletions(-)
 create mode 100644 sysdeps/riscv/multiarch/strncmp_zbb.S

diff --git a/sysdeps/riscv/multiarch/Makefile b/sysdeps/riscv/multiarch/Makefile
index 056ce2ffc0..9f22e31b99 100644
--- a/sysdeps/riscv/multiarch/Makefile
+++ b/sysdeps/riscv/multiarch/Makefile
@@ -14,5 +14,6 @@ sysdep_routines += \
 	strcmp_generic \
 	strcmp_zbb \
 	strcmp_zbb_unaligned \
-	strncmp_generic
+	strncmp_generic \
+	strncmp_zbb
 endif
diff --git a/sysdeps/riscv/multiarch/ifunc-impl-list.c b/sysdeps/riscv/multiarch/ifunc-impl-list.c
index eb37ed6017..82fd34d010 100644
--- a/sysdeps/riscv/multiarch/ifunc-impl-list.c
+++ b/sysdeps/riscv/multiarch/ifunc-impl-list.c
@@ -64,6 +64,7 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
 	      IFUNC_IMPL_ADD (array, i, strcmp, 1, __strcmp_generic))
 
   IFUNC_IMPL (i, name, strncmp,
+	      IFUNC_IMPL_ADD (array, i, strncmp, 1, __strncmp_zbb)
 	      IFUNC_IMPL_ADD (array, i, strncmp, 1, __strncmp_generic))
   return i;
 }
diff --git a/sysdeps/riscv/multiarch/strncmp.c b/sysdeps/riscv/multiarch/strncmp.c
index 970aeb8b85..5b0fe08e98 100644
--- a/sysdeps/riscv/multiarch/strncmp.c
+++ b/sysdeps/riscv/multiarch/strncmp.c
@@ -30,8 +30,12 @@
 
 extern __typeof (__redirect_strncmp) __libc_strncmp;
 extern __typeof (__redirect_strncmp) __strncmp_generic attribute_hidden;
+extern __typeof (__redirect_strncmp) __strncmp_zbb attribute_hidden;
 
-libc_ifunc (__libc_strncmp, __strncmp_generic);
+libc_ifunc (__libc_strncmp,
+	    HAVE_RV(zbb)
+	    ? __strncmp_zbb
+	    : __strncmp_generic);
 
 # undef strncmp
 strong_alias (__libc_strncmp, strncmp);
diff --git a/sysdeps/riscv/multiarch/strncmp_zbb.S b/sysdeps/riscv/multiarch/strncmp_zbb.S
new file mode 100644
index 0000000000..29cff30def
--- /dev/null
+++ b/sysdeps/riscv/multiarch/strncmp_zbb.S
@@ -0,0 +1,119 @@
+/* Copyright (C) 2022 Free Software Foundation, Inc.
+
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library.  If not, see
+   <https://www.gnu.org/licenses/>.  */
+
+#include <sysdep.h>
+#include <sys/asm.h>
+
+/* Assumptions: rvi_zbb.  */
+
+#define src1		a0
+#define result		a0
+#define src2		a1
+#define len		a2
+#define data1		a2
+#define data2		a3
+#define align		a4
+#define data1_orcb	t0
+#define limit		t1
+#define fast_limit      t2
+#define m1		t3
+
+#if __riscv_xlen == 64
+# define REG_L	ld
+# define SZREG	8
+# define PTRLOG	3
+#else
+# define REG_L	lw
+# define SZREG	4
+# define PTRLOG	2
+#endif
+
+#ifndef STRNCMP
+# define STRNCMP __strncmp_zbb
+#endif
+
+.option push
+.option arch,+zbb
+
+ENTRY_ALIGN (STRNCMP, 6)
+	beqz	len, L(equal)
+	or	align, src1, src2
+	and	align, align, SZREG-1
+	add	limit, src1, len
+	bnez	align, L(simpleloop)
+	li	m1, -1
+
+	/* Adjust limit for fast-path.  */
+	andi fast_limit, limit, -SZREG
+
+	/* Main loop for aligned string.  */
+	.p2align 3
+L(loop):
+	bge	src1, fast_limit, L(simpleloop)
+	REG_L	data1, 0(src1)
+	REG_L	data2, 0(src2)
+	orc.b	data1_orcb, data1
+	bne	data1_orcb, m1, L(foundnull)
+	addi	src1, src1, SZREG
+	addi	src2, src2, SZREG
+	beq	data1, data2, L(loop)
+
+	/* Words don't match, and no null byte in the first
+	 * word. Get bytes in big-endian order and compare.  */
+#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
+	rev8	data1, data1
+	rev8	data2, data2
+#endif
+	/* Synthesize (data1 >= data2) ? 1 : -1 in a branchless sequence.  */
+	sltu	result, data1, data2
+	neg	result, result
+	ori	result, result, 1
+	ret
+
+L(foundnull):
+	/* Found a null byte.
+	 * If words don't match, fall back to simple loop.  */
+	bne	data1, data2, L(simpleloop)
+
+	/* Otherwise, strings are equal.  */
+	li	result, 0
+	ret
+
+	/* Simple loop for misaligned strings.  */
+	.p2align 3
+L(simpleloop):
+	bge	src1, limit, L(equal)
+	lbu	data1, 0(src1)
+	addi	src1, src1, 1
+	lbu	data2, 0(src2)
+	addi	src2, src2, 1
+	bne	data1, data2, L(sub)
+	bnez	data1, L(simpleloop)
+
+L(sub):
+	sub	result, data1, data2
+	ret
+
+L(equal):
+	li	result, 0
+	ret
+
+.option pop
+
+END (STRNCMP)
+libc_hidden_builtin_def (STRNCMP)
-- 
2.39.1


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC PATCH 19/19] riscv: Add __riscv_cpu_relax() to allow yielding in busy loops
  2023-02-07  0:15 [RFC PATCH 00/19] riscv: ifunc support with optimized mem*/str*/cpu_relax routines Christoph Muellner
                   ` (17 preceding siblings ...)
  2023-02-07  0:16 ` [RFC PATCH 18/19] riscv: Add an optimized strncmp routine Christoph Muellner
@ 2023-02-07  0:16 ` Christoph Muellner
  2023-02-07  0:23   ` Andrew Waterman
  2023-02-07  2:59 ` [RFC PATCH 00/19] riscv: ifunc support with optimized mem*/str*/cpu_relax routines Kito Cheng
  2023-02-07 16:40 ` Adhemerval Zanella Netto
  20 siblings, 1 reply; 42+ messages in thread
From: Christoph Muellner @ 2023-02-07  0:16 UTC (permalink / raw)
  To: libc-alpha, Palmer Dabbelt, Darius Rad, Andrew Waterman,
	DJ Delorie, Vineet Gupta, Kito Cheng, Jeff Law, Philipp Tomsich,
	Heiko Stuebner
  Cc: Christoph Müllner

From: Christoph Müllner <christoph.muellner@vrull.eu>

The spinning loop of PTHREAD_MUTEX_ADAPTIVE_NP provides the hook
atomic_spin_nop() that can be used by architectures.

On RISC-V we have two instructions that can be used here:
* WRS.STO from the Zawrs extension
* PAUSE from the Zihintpause extension

Let's use these instructions and prefer WRS.STO over PAUSE
(based on availability of the corresponding ISA extension
at runtime).

Signed-off-by: Christoph Müllner <christoph.muellner@vrull.eu>
---
 sysdeps/riscv/multiarch/Makefile              |  5 +++
 sysdeps/riscv/multiarch/cpu_relax.c           | 36 +++++++++++++++++
 sysdeps/riscv/multiarch/cpu_relax_impl.S      | 40 +++++++++++++++++++
 .../unix/sysv/linux/riscv/atomic-machine.h    |  3 ++
 4 files changed, 84 insertions(+)
 create mode 100644 sysdeps/riscv/multiarch/cpu_relax.c
 create mode 100644 sysdeps/riscv/multiarch/cpu_relax_impl.S

diff --git a/sysdeps/riscv/multiarch/Makefile b/sysdeps/riscv/multiarch/Makefile
index 9f22e31b99..b5b9fcf986 100644
--- a/sysdeps/riscv/multiarch/Makefile
+++ b/sysdeps/riscv/multiarch/Makefile
@@ -17,3 +17,8 @@ sysdep_routines += \
 	strncmp_generic \
 	strncmp_zbb
 endif
+
+# nscd uses atomic_spin_nop which in turn requires cpu_relax
+ifeq ($(subdir),nscd)
+routines += cpu_relax cpu_relax_impl
+endif
diff --git a/sysdeps/riscv/multiarch/cpu_relax.c b/sysdeps/riscv/multiarch/cpu_relax.c
new file mode 100644
index 0000000000..4e6825ca50
--- /dev/null
+++ b/sysdeps/riscv/multiarch/cpu_relax.c
@@ -0,0 +1,36 @@
+/* CPU strand yielding for busy loops.  RISC-V version.
+   Copyright (C) 2022 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <https://www.gnu.org/licenses/>.  */
+
+#include <ldsodefs.h>
+#include <init-arch.h>
+
+void __cpu_relax (void);
+extern void __cpu_relax_zawrs (void);
+extern void __cpu_relax_zihintpause (void);
+
+static void
+__cpu_relax_generic (void)
+{
+}
+
+libc_ifunc (__cpu_relax,
+	    HAVE_RV(zawrs)
+	    ? __cpu_relax_zawrs
+	    : HAVE_RV(zihintpause)
+	      ? __cpu_relax_zihintpause
+	      : __cpu_relax_generic);
diff --git a/sysdeps/riscv/multiarch/cpu_relax_impl.S b/sysdeps/riscv/multiarch/cpu_relax_impl.S
new file mode 100644
index 0000000000..5d349c351f
--- /dev/null
+++ b/sysdeps/riscv/multiarch/cpu_relax_impl.S
@@ -0,0 +1,40 @@
+/* Copyright (C) 2022 Free Software Foundation, Inc.
+
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library.  If not, see
+   <https://www.gnu.org/licenses/>.  */
+
+#include <sysdep.h>
+#include <sys/asm.h>
+
+.option push
+.option arch,+zawrs
+
+ENTRY_ALIGN (__cpu_relax_zawrs, 4)
+	wrs.sto
+	ret
+END (__cpu_relax_zawrs)
+
+.option pop
+
+.option push
+.option arch,+zihintpause
+
+ENTRY_ALIGN (__cpu_relax_zihintpause, 4)
+	pause
+	ret
+END (__cpu_relax_zihintpause)
+
+.option pop
diff --git a/sysdeps/unix/sysv/linux/riscv/atomic-machine.h b/sysdeps/unix/sysv/linux/riscv/atomic-machine.h
index dbf70d8d57..88aa58ef95 100644
--- a/sysdeps/unix/sysv/linux/riscv/atomic-machine.h
+++ b/sysdeps/unix/sysv/linux/riscv/atomic-machine.h
@@ -178,4 +178,7 @@
 # error "ISAs that do not subsume the A extension are not supported"
 #endif /* !__riscv_atomic */
 
+extern void __cpu_relax (void);
+#define atomic_spin_nop() __cpu_relax()
+
 #endif /* bits/atomic.h */
-- 
2.39.1


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 19/19] riscv: Add __riscv_cpu_relax() to allow yielding in busy loops
  2023-02-07  0:16 ` [RFC PATCH 19/19] riscv: Add __riscv_cpu_relax() to allow yielding in busy loops Christoph Muellner
@ 2023-02-07  0:23   ` Andrew Waterman
  2023-02-07  0:29     ` Christoph Müllner
  0 siblings, 1 reply; 42+ messages in thread
From: Andrew Waterman @ 2023-02-07  0:23 UTC (permalink / raw)
  To: Christoph Muellner
  Cc: libc-alpha, Palmer Dabbelt, Darius Rad, DJ Delorie, Vineet Gupta,
	Kito Cheng, Jeff Law, Philipp Tomsich, Heiko Stuebner

Note that all implementations must support `pause`, since it's a HINT
instruction encoded within a base-ISA instruction that has no
architecturally visible effect.  So it's not clear to me that there's
any virtue in distinguishing implementations that claim to support
Zihintpause from those that don't.


On Mon, Feb 6, 2023 at 4:17 PM Christoph Muellner
<christoph.muellner@vrull.eu> wrote:
>
> From: Christoph Müllner <christoph.muellner@vrull.eu>
>
> The spinning loop of PTHREAD_MUTEX_ADAPTIVE_NP provides the hook
> atomic_spin_nop() that can be used by architectures.
>
> On RISC-V we have two instructions that can be used here:
> * WRS.STO from the Zawrs extension
> * PAUSE from the Zihintpause extension
>
> Let's use these instructions and prefer WRS.STO over PAUSE
> (based on availability of the corresponding ISA extension
> at runtime).
>
> Signed-off-by: Christoph Müllner <christoph.muellner@vrull.eu>
> ---
>  sysdeps/riscv/multiarch/Makefile              |  5 +++
>  sysdeps/riscv/multiarch/cpu_relax.c           | 36 +++++++++++++++++
>  sysdeps/riscv/multiarch/cpu_relax_impl.S      | 40 +++++++++++++++++++
>  .../unix/sysv/linux/riscv/atomic-machine.h    |  3 ++
>  4 files changed, 84 insertions(+)
>  create mode 100644 sysdeps/riscv/multiarch/cpu_relax.c
>  create mode 100644 sysdeps/riscv/multiarch/cpu_relax_impl.S
>
> diff --git a/sysdeps/riscv/multiarch/Makefile b/sysdeps/riscv/multiarch/Makefile
> index 9f22e31b99..b5b9fcf986 100644
> --- a/sysdeps/riscv/multiarch/Makefile
> +++ b/sysdeps/riscv/multiarch/Makefile
> @@ -17,3 +17,8 @@ sysdep_routines += \
>         strncmp_generic \
>         strncmp_zbb
>  endif
> +
> +# nscd uses atomic_spin_nop which in turn requires cpu_relax
> +ifeq ($(subdir),nscd)
> +routines += cpu_relax cpu_relax_impl
> +endif
> diff --git a/sysdeps/riscv/multiarch/cpu_relax.c b/sysdeps/riscv/multiarch/cpu_relax.c
> new file mode 100644
> index 0000000000..4e6825ca50
> --- /dev/null
> +++ b/sysdeps/riscv/multiarch/cpu_relax.c
> @@ -0,0 +1,36 @@
> +/* CPU strand yielding for busy loops.  RISC-V version.
> +   Copyright (C) 2022 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <https://www.gnu.org/licenses/>.  */
> +
> +#include <ldsodefs.h>
> +#include <init-arch.h>
> +
> +void __cpu_relax (void);
> +extern void __cpu_relax_zawrs (void);
> +extern void __cpu_relax_zihintpause (void);
> +
> +static void
> +__cpu_relax_generic (void)
> +{
> +}
> +
> +libc_ifunc (__cpu_relax,
> +           HAVE_RV(zawrs)
> +           ? __cpu_relax_zawrs
> +           : HAVE_RV(zihintpause)
> +             ? __cpu_relax_zihintpause
> +             : __cpu_relax_generic);
> diff --git a/sysdeps/riscv/multiarch/cpu_relax_impl.S b/sysdeps/riscv/multiarch/cpu_relax_impl.S
> new file mode 100644
> index 0000000000..5d349c351f
> --- /dev/null
> +++ b/sysdeps/riscv/multiarch/cpu_relax_impl.S
> @@ -0,0 +1,40 @@
> +/* Copyright (C) 2022 Free Software Foundation, Inc.
> +
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library.  If not, see
> +   <https://www.gnu.org/licenses/>.  */
> +
> +#include <sysdep.h>
> +#include <sys/asm.h>
> +
> +.option push
> +.option arch,+zawrs
> +
> +ENTRY_ALIGN (__cpu_relax_zawrs, 4)
> +       wrs.sto
> +       ret
> +END (__cpu_relax_zawrs)
> +
> +.option pop
> +
> +.option push
> +.option arch,+zihintpause
> +
> +ENTRY_ALIGN (__cpu_relax_zihintpause, 4)
> +       pause
> +       ret
> +END (__cpu_relax_zihintpause)
> +
> +.option pop
> diff --git a/sysdeps/unix/sysv/linux/riscv/atomic-machine.h b/sysdeps/unix/sysv/linux/riscv/atomic-machine.h
> index dbf70d8d57..88aa58ef95 100644
> --- a/sysdeps/unix/sysv/linux/riscv/atomic-machine.h
> +++ b/sysdeps/unix/sysv/linux/riscv/atomic-machine.h
> @@ -178,4 +178,7 @@
>  # error "ISAs that do not subsume the A extension are not supported"
>  #endif /* !__riscv_atomic */
>
> +extern void __cpu_relax (void);
> +#define atomic_spin_nop() __cpu_relax()
> +
>  #endif /* bits/atomic.h */
> --
> 2.39.1
>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 19/19] riscv: Add __riscv_cpu_relax() to allow yielding in busy loops
  2023-02-07  0:23   ` Andrew Waterman
@ 2023-02-07  0:29     ` Christoph Müllner
  0 siblings, 0 replies; 42+ messages in thread
From: Christoph Müllner @ 2023-02-07  0:29 UTC (permalink / raw)
  To: Andrew Waterman
  Cc: libc-alpha, Palmer Dabbelt, Darius Rad, DJ Delorie, Vineet Gupta,
	Kito Cheng, Jeff Law, Philipp Tomsich, Heiko Stuebner

[-- Attachment #1: Type: text/plain, Size: 5921 bytes --]

On Tue, Feb 7, 2023 at 1:23 AM Andrew Waterman <andrew@sifive.com> wrote:

> Note that all implementations must support `pause`, since it's a HINT
> instruction encoded within a base-ISA instruction that has no
> architecturally visible effect.  So it's not clear to me that there's
> any virtue in distinguishing implementations that claim to support
> Zihintpause from those that don't.
>

Will be considered in a v2.
Thanks!


>
>
> On Mon, Feb 6, 2023 at 4:17 PM Christoph Muellner
> <christoph.muellner@vrull.eu> wrote:
> >
> > From: Christoph Müllner <christoph.muellner@vrull.eu>
> >
> > The spinning loop of PTHREAD_MUTEX_ADAPTIVE_NP provides the hook
> > atomic_spin_nop() that can be used by architectures.
> >
> > On RISC-V we have two instructions that can be used here:
> > * WRS.STO from the Zawrs extension
> > * PAUSE from the Zihintpause extension
> >
> > Let's use these instructions and prefer WRS.STO over PAUSE
> > (based on availability of the corresponding ISA extension
> > at runtime).
> >
> > Signed-off-by: Christoph Müllner <christoph.muellner@vrull.eu>
> > ---
> >  sysdeps/riscv/multiarch/Makefile              |  5 +++
> >  sysdeps/riscv/multiarch/cpu_relax.c           | 36 +++++++++++++++++
> >  sysdeps/riscv/multiarch/cpu_relax_impl.S      | 40 +++++++++++++++++++
> >  .../unix/sysv/linux/riscv/atomic-machine.h    |  3 ++
> >  4 files changed, 84 insertions(+)
> >  create mode 100644 sysdeps/riscv/multiarch/cpu_relax.c
> >  create mode 100644 sysdeps/riscv/multiarch/cpu_relax_impl.S
> >
> > diff --git a/sysdeps/riscv/multiarch/Makefile
> b/sysdeps/riscv/multiarch/Makefile
> > index 9f22e31b99..b5b9fcf986 100644
> > --- a/sysdeps/riscv/multiarch/Makefile
> > +++ b/sysdeps/riscv/multiarch/Makefile
> > @@ -17,3 +17,8 @@ sysdep_routines += \
> >         strncmp_generic \
> >         strncmp_zbb
> >  endif
> > +
> > +# nscd uses atomic_spin_nop which in turn requires cpu_relax
> > +ifeq ($(subdir),nscd)
> > +routines += cpu_relax cpu_relax_impl
> > +endif
> > diff --git a/sysdeps/riscv/multiarch/cpu_relax.c
> b/sysdeps/riscv/multiarch/cpu_relax.c
> > new file mode 100644
> > index 0000000000..4e6825ca50
> > --- /dev/null
> > +++ b/sysdeps/riscv/multiarch/cpu_relax.c
> > @@ -0,0 +1,36 @@
> > +/* CPU strand yielding for busy loops.  RISC-V version.
> > +   Copyright (C) 2022 Free Software Foundation, Inc.
> > +   This file is part of the GNU C Library.
> > +
> > +   The GNU C Library is free software; you can redistribute it and/or
> > +   modify it under the terms of the GNU Lesser General Public
> > +   License as published by the Free Software Foundation; either
> > +   version 2.1 of the License, or (at your option) any later version.
> > +
> > +   The GNU C Library is distributed in the hope that it will be useful,
> > +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> > +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> > +   Lesser General Public License for more details.
> > +
> > +   You should have received a copy of the GNU Lesser General Public
> > +   License along with the GNU C Library; if not, see
> > +   <https://www.gnu.org/licenses/>.  */
> > +
> > +#include <ldsodefs.h>
> > +#include <init-arch.h>
> > +
> > +void __cpu_relax (void);
> > +extern void __cpu_relax_zawrs (void);
> > +extern void __cpu_relax_zihintpause (void);
> > +
> > +static void
> > +__cpu_relax_generic (void)
> > +{
> > +}
> > +
> > +libc_ifunc (__cpu_relax,
> > +           HAVE_RV(zawrs)
> > +           ? __cpu_relax_zawrs
> > +           : HAVE_RV(zihintpause)
> > +             ? __cpu_relax_zihintpause
> > +             : __cpu_relax_generic);
> > diff --git a/sysdeps/riscv/multiarch/cpu_relax_impl.S
> b/sysdeps/riscv/multiarch/cpu_relax_impl.S
> > new file mode 100644
> > index 0000000000..5d349c351f
> > --- /dev/null
> > +++ b/sysdeps/riscv/multiarch/cpu_relax_impl.S
> > @@ -0,0 +1,40 @@
> > +/* Copyright (C) 2022 Free Software Foundation, Inc.
> > +
> > +   This file is part of the GNU C Library.
> > +
> > +   The GNU C Library is free software; you can redistribute it and/or
> > +   modify it under the terms of the GNU Lesser General Public
> > +   License as published by the Free Software Foundation; either
> > +   version 2.1 of the License, or (at your option) any later version.
> > +
> > +   The GNU C Library is distributed in the hope that it will be useful,
> > +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> > +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> > +   Lesser General Public License for more details.
> > +
> > +   You should have received a copy of the GNU Lesser General Public
> > +   License along with the GNU C Library.  If not, see
> > +   <https://www.gnu.org/licenses/>.  */
> > +
> > +#include <sysdep.h>
> > +#include <sys/asm.h>
> > +
> > +.option push
> > +.option arch,+zawrs
> > +
> > +ENTRY_ALIGN (__cpu_relax_zawrs, 4)
> > +       wrs.sto
> > +       ret
> > +END (__cpu_relax_zawrs)
> > +
> > +.option pop
> > +
> > +.option push
> > +.option arch,+zihintpause
> > +
> > +ENTRY_ALIGN (__cpu_relax_zihintpause, 4)
> > +       pause
> > +       ret
> > +END (__cpu_relax_zihintpause)
> > +
> > +.option pop
> > diff --git a/sysdeps/unix/sysv/linux/riscv/atomic-machine.h
> b/sysdeps/unix/sysv/linux/riscv/atomic-machine.h
> > index dbf70d8d57..88aa58ef95 100644
> > --- a/sysdeps/unix/sysv/linux/riscv/atomic-machine.h
> > +++ b/sysdeps/unix/sysv/linux/riscv/atomic-machine.h
> > @@ -178,4 +178,7 @@
> >  # error "ISAs that do not subsume the A extension are not supported"
> >  #endif /* !__riscv_atomic */
> >
> > +extern void __cpu_relax (void);
> > +#define atomic_spin_nop() __cpu_relax()
> > +
> >  #endif /* bits/atomic.h */
> > --
> > 2.39.1
> >
>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 18/19] riscv: Add an optimized strncmp routine
  2023-02-07  0:16 ` [RFC PATCH 18/19] riscv: Add an optimized strncmp routine Christoph Muellner
@ 2023-02-07  1:19   ` Noah Goldstein
  2023-02-08 15:13     ` Philipp Tomsich
  0 siblings, 1 reply; 42+ messages in thread
From: Noah Goldstein @ 2023-02-07  1:19 UTC (permalink / raw)
  To: Christoph Muellner
  Cc: libc-alpha, Palmer Dabbelt, Darius Rad, Andrew Waterman,
	DJ Delorie, Vineet Gupta, Kito Cheng, Jeff Law, Philipp Tomsich,
	Heiko Stuebner

On Mon, Feb 6, 2023 at 6:23 PM Christoph Muellner
<christoph.muellner@vrull.eu> wrote:
>
> From: Christoph Müllner <christoph.muellner@vrull.eu>
>
> The implementation of strncmp() can be accelerated using Zbb's orc.b
> instruction. Let's add an optimized implementation that makes use
> of this instruction.
>
> Signed-off-by: Christoph Müllner <christoph.muellner@vrull.eu>

Not necessary, but imo performance patches should have at least some reference
to the expected speedup versus the existing alternatives.
> ---
>  sysdeps/riscv/multiarch/Makefile          |   3 +-
>  sysdeps/riscv/multiarch/ifunc-impl-list.c |   1 +
>  sysdeps/riscv/multiarch/strncmp.c         |   6 +-
>  sysdeps/riscv/multiarch/strncmp_zbb.S     | 119 ++++++++++++++++++++++
>  4 files changed, 127 insertions(+), 2 deletions(-)
>  create mode 100644 sysdeps/riscv/multiarch/strncmp_zbb.S
>
> diff --git a/sysdeps/riscv/multiarch/Makefile b/sysdeps/riscv/multiarch/Makefile
> index 056ce2ffc0..9f22e31b99 100644
> --- a/sysdeps/riscv/multiarch/Makefile
> +++ b/sysdeps/riscv/multiarch/Makefile
> @@ -14,5 +14,6 @@ sysdep_routines += \
>         strcmp_generic \
>         strcmp_zbb \
>         strcmp_zbb_unaligned \
> -       strncmp_generic
> +       strncmp_generic \
> +       strncmp_zbb
>  endif
> diff --git a/sysdeps/riscv/multiarch/ifunc-impl-list.c b/sysdeps/riscv/multiarch/ifunc-impl-list.c
> index eb37ed6017..82fd34d010 100644
> --- a/sysdeps/riscv/multiarch/ifunc-impl-list.c
> +++ b/sysdeps/riscv/multiarch/ifunc-impl-list.c
> @@ -64,6 +64,7 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
>               IFUNC_IMPL_ADD (array, i, strcmp, 1, __strcmp_generic))
>
>    IFUNC_IMPL (i, name, strncmp,
> +             IFUNC_IMPL_ADD (array, i, strncmp, 1, __strncmp_zbb)
>               IFUNC_IMPL_ADD (array, i, strncmp, 1, __strncmp_generic))
>    return i;
>  }
> diff --git a/sysdeps/riscv/multiarch/strncmp.c b/sysdeps/riscv/multiarch/strncmp.c
> index 970aeb8b85..5b0fe08e98 100644
> --- a/sysdeps/riscv/multiarch/strncmp.c
> +++ b/sysdeps/riscv/multiarch/strncmp.c
> @@ -30,8 +30,12 @@
>
>  extern __typeof (__redirect_strncmp) __libc_strncmp;
>  extern __typeof (__redirect_strncmp) __strncmp_generic attribute_hidden;
> +extern __typeof (__redirect_strncmp) __strncmp_zbb attribute_hidden;
>
> -libc_ifunc (__libc_strncmp, __strncmp_generic);
> +libc_ifunc (__libc_strncmp,
> +           HAVE_RV(zbb)
> +           ? __strncmp_zbb
> +           : __strncmp_generic);
>
>  # undef strncmp
>  strong_alias (__libc_strncmp, strncmp);
> diff --git a/sysdeps/riscv/multiarch/strncmp_zbb.S b/sysdeps/riscv/multiarch/strncmp_zbb.S
> new file mode 100644
> index 0000000000..29cff30def
> --- /dev/null
> +++ b/sysdeps/riscv/multiarch/strncmp_zbb.S
> @@ -0,0 +1,119 @@
> +/* Copyright (C) 2022 Free Software Foundation, Inc.
> +
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library.  If not, see
> +   <https://www.gnu.org/licenses/>.  */
> +
> +#include <sysdep.h>
> +#include <sys/asm.h>
> +
> +/* Assumptions: rvi_zbb.  */
> +
> +#define src1           a0
> +#define result         a0
> +#define src2           a1
> +#define len            a2
> +#define data1          a2
> +#define data2          a3
> +#define align          a4
> +#define data1_orcb     t0
> +#define limit          t1
> +#define fast_limit      t2
> +#define m1             t3
> +
> +#if __riscv_xlen == 64
> +# define REG_L ld
> +# define SZREG 8
> +# define PTRLOG        3
> +#else
> +# define REG_L lw
> +# define SZREG 4
> +# define PTRLOG        2
> +#endif
> +
> +#ifndef STRNCMP
> +# define STRNCMP __strncmp_zbb
> +#endif
> +
> +.option push
> +.option arch,+zbb
> +
> +ENTRY_ALIGN (STRNCMP, 6)
> +       beqz    len, L(equal)
> +       or      align, src1, src2
> +       and     align, align, SZREG-1
> +       add     limit, src1, len
> +       bnez    align, L(simpleloop)
> +       li      m1, -1
> +
> +       /* Adjust limit for fast-path.  */
> +       andi fast_limit, limit, -SZREG
> +
> +       /* Main loop for aligned string.  */
> +       .p2align 3
> +L(loop):
> +       bge     src1, fast_limit, L(simpleloop)
> +       REG_L   data1, 0(src1)
> +       REG_L   data2, 0(src2)
> +       orc.b   data1_orcb, data1
> +       bne     data1_orcb, m1, L(foundnull)
> +       addi    src1, src1, SZREG
> +       addi    src2, src2, SZREG
> +       beq     data1, data2, L(loop)
> +
> +       /* Words don't match, and no null byte in the first
> +        * word. Get bytes in big-endian order and compare.  */
> +#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
> +       rev8    data1, data1
> +       rev8    data2, data2
> +#endif
> +       /* Synthesize (data1 >= data2) ? 1 : -1 in a branchless sequence.  */
> +       sltu    result, data1, data2
> +       neg     result, result
> +       ori     result, result, 1
> +       ret
> +
> +L(foundnull):
> +       /* Found a null byte.
> +        * If words don't match, fall back to simple loop.  */
> +       bne     data1, data2, L(simpleloop)
> +
> +       /* Otherwise, strings are equal.  */
> +       li      result, 0
> +       ret
> +
> +       /* Simple loop for misaligned strings.  */
> +       .p2align 3
> +L(simpleloop):
> +       bge     src1, limit, L(equal)
> +       lbu     data1, 0(src1)
> +       addi    src1, src1, 1
> +       lbu     data2, 0(src2)
> +       addi    src2, src2, 1
> +       bne     data1, data2, L(sub)
> +       bnez    data1, L(simpleloop)
> +
> +L(sub):
> +       sub     result, data1, data2
> +       ret
> +
> +L(equal):
> +       li      result, 0
> +       ret
> +
> +.option pop
> +
> +END (STRNCMP)
> +libc_hidden_builtin_def (STRNCMP)
> --
> 2.39.1
>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 00/19] riscv: ifunc support with optimized mem*/str*/cpu_relax routines
  2023-02-07  0:15 [RFC PATCH 00/19] riscv: ifunc support with optimized mem*/str*/cpu_relax routines Christoph Muellner
                   ` (18 preceding siblings ...)
  2023-02-07  0:16 ` [RFC PATCH 19/19] riscv: Add __riscv_cpu_relax() to allow yielding in busy loops Christoph Muellner
@ 2023-02-07  2:59 ` Kito Cheng
  2023-02-07 16:40 ` Adhemerval Zanella Netto
  20 siblings, 0 replies; 42+ messages in thread
From: Kito Cheng @ 2023-02-07  2:59 UTC (permalink / raw)
  To: Christoph Muellner
  Cc: libc-alpha, Palmer Dabbelt, Darius Rad, Andrew Waterman,
	DJ Delorie, Vineet Gupta, Jeff Law, Philipp Tomsich,
	Heiko Stuebner

nit: copyright year should be 2023

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 06/19] riscv: Adding ISA string parser for environment variables
  2023-02-07  0:16 ` [RFC PATCH 06/19] riscv: Adding ISA string parser for environment variables Christoph Muellner
@ 2023-02-07  6:20   ` David Abdurachmanov
  0 siblings, 0 replies; 42+ messages in thread
From: David Abdurachmanov @ 2023-02-07  6:20 UTC (permalink / raw)
  To: Christoph Muellner
  Cc: libc-alpha, Palmer Dabbelt, Darius Rad, Andrew Waterman,
	DJ Delorie, Vineet Gupta, Kito Cheng, Jeff Law, Philipp Tomsich,
	Heiko Stuebner

On Tue, Feb 7, 2023 at 1:20 AM Christoph Muellner
<christoph.muellner@vrull.eu> wrote:
>
> From: Christoph Müllner <christoph.muellner@vrull.eu>
>
> RISC-V does not have a reliable mechanism to detect hart features
> like supported ISA extensions or cache block sizes at run-time
> as of now.
>
> Not knowing the hart features limits optimization strategies of glibc
> (e.g. ifunc support requires run-time hard feature knowledge).
>
> To circumvent this limitation this patch introduces a mechanism to
> get the hart features via environment variables:
> * RISCV_RT_MARCH represents a lower-case ISA string (-march string)
>   E.g. RISCV_RT_MARCH=rv64gc_zicboz
> * RISCV_RT_CBOM_BLOCKSIZE represents the cbom instruction block size
>   E.g. RISCV_RT_CBOZ_BLOCKSIZE=64
> * RISCV_RT_CBOZ_BLOCKSIZE represents the cboz instruction block size

Hi,

Non-expert here, but have you considered defining RISCV glibc
tunables? It's designed for this purpose.

See: https://www.gnu.org/software/libc/manual/html_node/Tunables.html
Especially hardware capability tunables:
https://www.gnu.org/software/libc/manual/html_node/Hardware-Capability-Tunables.html

It would be nice to have tunables defined for RISC-V that could stay
with us for a long time.

Cheers,
david

> These environment variables are parsed during startup and the found
> ISA extensions are stored a struct (hart_features) for evaluation
> by dynamic dispatching code.
>
> As the parser code is executed very early, we cannot call functions
> that have direct or indirect (via getenv()) dependencies to strlen()
> and strncmp(), as these functions cannot be called before the ifunc
> support is initialized. Therefore, this patch contains its own helper
> functions for strlen(), strncmp(), and getenv().
>
> Signed-off-by: Christoph Müllner <christoph.muellner@vrull.eu>
> ---
>  sysdeps/unix/sysv/linux/riscv/hart-features.c | 294 ++++++++++++++++++
>  .../unix/sysv/linux/riscv/macro-for-each.h    |  24 ++
>  2 files changed, 318 insertions(+)
>  create mode 100644 sysdeps/unix/sysv/linux/riscv/macro-for-each.h
>
> diff --git a/sysdeps/unix/sysv/linux/riscv/hart-features.c b/sysdeps/unix/sysv/linux/riscv/hart-features.c
> index 41111eff57..6de41a26cc 100644
> --- a/sysdeps/unix/sysv/linux/riscv/hart-features.c
> +++ b/sysdeps/unix/sysv/linux/riscv/hart-features.c
> @@ -17,12 +17,17 @@
>     <https://www.gnu.org/licenses/>.  */
>
>  #include <hart-features.h>
> +#include <macro-for-each.h>
> +#include <string_private.h>
>
>  /* The code in this file is executed very early, so we cannot call
>     indirect functions because ifunc support is not initialized.
>     Therefore this file adds a few simple helper functions to avoid
>     dependencies to functions outside of this file.  */
>
> +#define xstr(s) str(s)
> +#define str(s) #s
> +
>  static inline void
>  inhibit_loop_to_libcall
>  simple_memset (void *s, int c, size_t n)
> @@ -35,9 +40,298 @@ simple_memset (void *s, int c, size_t n)
>      }
>  }
>
> +static inline size_t
> +inhibit_loop_to_libcall
> +simple_strlen (const char *s)
> +{
> +  size_t n = 0;
> +  char c = *s;
> +  while (c != 0)
> +    {
> +      s++;
> +      n++;
> +      c = *s;
> +    }
> +  return n;
> +}
> +
> +static inline int
> +inhibit_loop_to_libcall
> +simple_strncmp (const char *s1, const char *s2, size_t n)
> +{
> +  while (n != 0)
> +    {
> +      if (*s1 == 0 || *s1 != *s2)
> +       return *((const unsigned char *)s1) - *((const unsigned char *)s2);
> +      n--;
> +      s1++;
> +      s2++;
> +    }
> +  return 0;
> +}
> +
> +extern char **__environ;
> +static inline char*
> +simple_getenv (const char *name)
> +{
> +  char **ep;
> +  uint16_t name_start;
> +
> +  if (__environ == NULL || name[0] == 0 || name[1] == 0)
> +    return NULL;
> +
> +  size_t len = simple_strlen (name);
> +#if _STRING_ARCH_unaligned
> +  name_start = *(const uint16_t *) name;
> +#else
> +  name_start = (((const unsigned char *) name)[0]
> +               | (((const unsigned char *) name)[1] << 8));
> +#endif
> +  len -= 2;
> +  name += 2;
> +
> +  for (ep = __environ; *ep != NULL; ++ep)
> +    {
> +#if _STRING_ARCH_unaligned
> +      uint16_t ep_start = *(uint16_t *) *ep;
> +#else
> +      uint16_t ep_start = (((unsigned char *) *ep)[0]
> +                          | (((unsigned char *) *ep)[1] << 8));
> +#endif
> +      if (name_start == ep_start && !simple_strncmp (*ep + 2, name, len)
> +         && (*ep)[len + 2] == '=')
> +       return &(*ep)[len + 3];
> +    }
> +  return NULL;
> +}
> +
> +/* Check if the given number is a power of 2.
> +   Return true if so, or false otherwise.  */
> +static inline int
> +is_power_of_two (unsigned long v)
> +{
> +  return (v & (v - 1)) == 0;
> +}
> +
> +/* Check if the given string str starts with
> +   the prefix pre.  Return true if so, or false
> +   otherwise.  */
> +static inline int
> +starts_with (const char *str, const char *pre)
> +{
> +  return simple_strncmp (pre, str, simple_strlen (pre)) == 0;
> +}
> +
> +/* Lower all characters of a string up to the
> +   first NUL-character in the string.  */
> +static inline void
> +strtolower (char *s)
> +{
> +  char c = *s;
> +  while (c != '\0')
> +    {
> +      if (c >= 'A' && c <= 'Z')
> +       *s = c + 'a' - 'A';
> +      s++;
> +      c = *s;
> +    }
> +}
> +
> +/* Count the number of detected extensions.  */
> +static inline unsigned long
> +count_extensions (struct hart_features *hart_features)
> +{
> +  unsigned long n = 0;
> +#define ISA_EXT(e)                                                     \
> +  if (hart_features->have_##e == 1)                                    \
> +    n++;
> +#define ISA_EXT_GROUP(g, ...)                                          \
> +  if (hart_features->have_##g == 1)                                    \
> +    n++;
> +#include "isa-extensions.def"
> +  return n;
> +}
> +
> +/* Check if the given charater is not '0'-'9'.  */
> +static inline int
> +notanumber (const char c)
> +{
> +  return (c < '0' || c > '9');
> +}
> +
> +/* Parse RISCV_RT_MARCH and store found extensions.  */
> +static inline void
> +parse_rt_march (struct hart_features *hart_features)
> +{
> +  const char* s = simple_getenv ("RISCV_RT_MARCH");
> +  if (s == NULL)
> +    goto end;
> +
> +  hart_features->rt_march = s;
> +
> +  /* "RISC-V ISA strings begin with either RV32I, RV32E, RV64I, or RV128I
> +      indicating the supported address space size in bits for the base
> +      integer ISA."  */
> +  if (starts_with (s, "rv32") && notanumber (*(s+4)))
> +    {
> +      hart_features->xlen = 32;
> +      s += 4;
> +    }
> +  else if (starts_with (s, "rv64") && notanumber (*(s+4)))
> +    {
> +      hart_features->xlen = 64;
> +      s += 4;
> +    }
> +  else if (starts_with (s, "rv128") && notanumber (*(s+5)))
> +    {
> +      hart_features->xlen = 128;
> +      s += 5;
> +    }
> +  else
> +    {
> +      goto fail;
> +    }
> +
> +  /* Parse the extensions.  */
> +  const char *s_old = s;
> +  while (*s != '\0')
> +    {
> +#define ISA_EXT(e)                                                     \
> +      else if (starts_with (s, xstr (e)))                              \
> +       {                                                               \
> +         hart_features->have_##e = 1;                                  \
> +         s += simple_strlen (xstr (e));                                \
> +       }
> +#define ISA_EXT_GROUP(g, ...)                                          \
> +      ISA_EXT (g)
> +      if (0);
> +#include "isa-extensions.def"
> +
> +      /* Consume optional version information.  */
> +      while (*s >= '0' && *s <= '9')
> +       s++;
> +      while (*s == 'p')
> +       s++;
> +      while (*s >= '0' && *s <= '9')
> +       s++;
> +
> +      /* Consume optional '_'.  */
> +      if (*s == '_')
> +       s++;
> +
> +      /* If we got stuck, bail out.  */
> +      if (s == s_old)
> +       goto fail;
> +    }
> +
> +  /* Propagate subsets (until we reach a fixpoint).  */
> +  unsigned long n = count_extensions (hart_features);
> +  while (1)
> +    {
> +      /* Forward-propagation.  E.g.:
> +      if (hart_features->have_g == 1)
> +       {
> +         hart_features->have_i = 1;
> +         ...
> +         hart_features->have_zifencei = 1;
> +       }  */
> +#define ISA_EXT_GROUP_HEAD(y)                                          \
> +      if (hart_features->have_##y)                                     \
> +       {
> +#define ISA_EXT_GROUP_SUBSET(s)                                                \
> +         hart_features->have_##s = 1;
> +#define ISA_EXT_GROUP_TAIL(z)                                          \
> +       }
> +#define ISA_EXT_GROUP(x, ...)                                          \
> +       ISA_EXT_GROUP_HEAD (x)                                          \
> +       FOR_EACH (ISA_EXT_GROUP_SUBSET, __VA_ARGS__)                    \
> +       ISA_EXT_GROUP_TAIL (x)
> +#include "isa-extensions.def"
> +#undef ISA_EXT_GROUP_HEAD
> +#undef ISA_EXT_GROUP_SUBSET
> +#undef ISA_EXT_GROUP_TAIL
> +
> +      /* Backward-propagation.  E.g.:
> +      if (1
> +         && hart_features->have_i == 1
> +         ...
> +         && hart_features->have_zifencei == 1
> +         )
> +       hart_features->have_g = 1;  */
> +#define ISA_EXT_GROUP_HEAD(y)                                          \
> +      if (1
> +#define ISA_EXT_GROUP_SUBSET(s)                                                \
> +         && hart_features->have_##s == 1
> +#define ISA_EXT_GROUP_TAIL(z)                                          \
> +         )                                                             \
> +       hart_features->have_##z = 1;
> +#define ISA_EXT_GROUP(x, ...)                                          \
> +       ISA_EXT_GROUP_HEAD (x)                                          \
> +       FOR_EACH (ISA_EXT_GROUP_SUBSET, __VA_ARGS__)                    \
> +       ISA_EXT_GROUP_TAIL (x)
> +#include "isa-extensions.def"
> +#undef ISA_EXT_GROUP_HEAD
> +#undef ISA_EXT_GROUP_SUBSET
> +#undef ISA_EXT_GROUP_TAIL
> +
> +      unsigned long n2 = count_extensions (hart_features);
> +      /* Stop if fix-point reached.  */
> +      if (n == n2)
> +       break;
> +      n = n2;
> +    }
> +
> +end:
> +  return;
> +
> +fail:
> +  hart_features->rt_march = NULL;
> +}
> +
> +/* Parse RISCV_RT_CBOM_BLOCKSIZE and store value.  */
> +static inline void
> +parse_rt_cbom_blocksize (struct hart_features *hart_features)
> +{
> +  hart_features->rt_cbom_blocksize = NULL;
> +  hart_features->cbom_blocksize = 0;
> +
> +  const char *s = simple_getenv ("RISCV_RT_CBOM_BLOCKSIZE");
> +  if (s == NULL)
> +    return;
> +
> +  uint64_t v = _dl_strtoul (s, NULL);
> +  if (!is_power_of_two (v))
> +    return;
> +
> +  hart_features->rt_cbom_blocksize = s;
> +  hart_features->cbom_blocksize = v;
> +}
> +
> +/* Parse RISCV_RT_CBOZ_BLOCKSIZE and store value.  */
> +static inline void
> +parse_rt_cboz_blocksize (struct hart_features *hart_features)
> +{
> +  hart_features->rt_cboz_blocksize = NULL;
> +  hart_features->cboz_blocksize = 0;
> +
> +  const char *s = simple_getenv ("RISCV_RT_CBOZ_BLOCKSIZE");
> +  if (s == NULL)
> +    return;
> +
> +  uint64_t v = _dl_strtoul (s, NULL);
> +  if (!is_power_of_two (v))
> +    return;
> +
> +  hart_features->rt_cboz_blocksize = s;
> +  hart_features->cboz_blocksize = v;
> +}
> +
>  /* Discover hart features and store them.  */
>  static inline void
>  init_hart_features (struct hart_features *hart_features)
>  {
>    simple_memset (hart_features, 0, sizeof (*hart_features));
> +  parse_rt_march (hart_features);
> +  parse_rt_cbom_blocksize (hart_features);
> +  parse_rt_cboz_blocksize (hart_features);
>  }
> diff --git a/sysdeps/unix/sysv/linux/riscv/macro-for-each.h b/sysdeps/unix/sysv/linux/riscv/macro-for-each.h
> new file mode 100644
> index 0000000000..524bef3c0a
> --- /dev/null
> +++ b/sysdeps/unix/sysv/linux/riscv/macro-for-each.h
> @@ -0,0 +1,24 @@
> +/* Recursive macros implementation by David Mazières
> +   https://www.scs.stanford.edu/~dm/blog/va-opt.html  */
> +
> +#ifndef _MACRO_FOR_EACH_H
> +#define _MACRO_FOR_EACH_H
> +
> +#define EXPAND1(...) __VA_ARGS__
> +#define EXPAND2(...) EXPAND1 (EXPAND1 (EXPAND1 (EXPAND1 (__VA_ARGS__))))
> +#define EXPAND3(...) EXPAND2 (EXPAND2 (EXPAND2 (EXPAND2 (__VA_ARGS__))))
> +#define EXPAND4(...) EXPAND3 (EXPAND3 (EXPAND3 (EXPAND3 (__VA_ARGS__))))
> +#define EXPAND(...)  EXPAND4 (EXPAND4 (EXPAND4 (EXPAND4 (__VA_ARGS__))))
> +
> +#define FOR_EACH(macro, ...)                                           \
> +  __VA_OPT__ (EXPAND (FOR_EACH_HELPER (macro, __VA_ARGS__)))
> +
> +#define PARENS ()
> +
> +#define FOR_EACH_HELPER(macro, a1, ...)                                        \
> +  macro (a1)                                                           \
> +  __VA_OPT__ (FOR_EACH_AGAIN PARENS (macro, __VA_ARGS__))
> +
> +#define FOR_EACH_AGAIN() FOR_EACH_HELPER
> +
> +#endif /* _MACRO_FOR_EACH_H  */
> --
> 2.39.1
>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 16/19] riscv: Add accelerated strcmp routines
  2023-02-07  0:16 ` [RFC PATCH 16/19] riscv: Add accelerated strcmp routines Christoph Muellner
@ 2023-02-07 11:57   ` Xi Ruoyao
  2023-02-07 14:15     ` Christoph Müllner
  0 siblings, 1 reply; 42+ messages in thread
From: Xi Ruoyao @ 2023-02-07 11:57 UTC (permalink / raw)
  To: Christoph Muellner, libc-alpha, Palmer Dabbelt, Darius Rad,
	Andrew Waterman, DJ Delorie, Vineet Gupta, Kito Cheng, Jeff Law,
	Philipp Tomsich, Heiko Stuebner
  Cc: Adhemerval Zanella

Is it possible to make the optimized generic string routine [1] ifunc-
aware in some way?  Or can we "templatize" it so we can make vectorized
string routines more easily?

[1]: https://sourceware.org/pipermail/libc-alpha/2023-February/145211.html

On Tue, 2023-02-07 at 01:16 +0100, Christoph Muellner wrote:
> From: Christoph Müllner <christoph.muellner@vrull.eu>
> 
> The implementation of strcmp() can be accelerated using Zbb's orc.b
> instruction and fast unaligned accesses. Howver, strcmp can use
> unaligned accesses only if such an address does not change the
> exception behaviour (compared to a single-byte compare loop).
> Let's add an implementation that does all that.
> Additionally, let's add the strcmp implementation from the
> Bitmanip specification, which does not do any unaligned accesses.
> 
> Signed-off-by: Christoph Müllner <christoph.muellner@vrull.eu>
> ---
>  sysdeps/riscv/multiarch/Makefile              |   4 +-
>  sysdeps/riscv/multiarch/ifunc-impl-list.c     |   4 +-
>  sysdeps/riscv/multiarch/strcmp.c              |  11 +-
>  sysdeps/riscv/multiarch/strcmp_zbb.S          | 104 +++++++++
>  .../riscv/multiarch/strcmp_zbb_unaligned.S    | 213
> ++++++++++++++++++
>  5 files changed, 332 insertions(+), 4 deletions(-)
>  create mode 100644 sysdeps/riscv/multiarch/strcmp_zbb.S
>  create mode 100644 sysdeps/riscv/multiarch/strcmp_zbb_unaligned.S
> 
> diff --git a/sysdeps/riscv/multiarch/Makefile
> b/sysdeps/riscv/multiarch/Makefile
> index 3017bde75a..73a62be85d 100644
> --- a/sysdeps/riscv/multiarch/Makefile
> +++ b/sysdeps/riscv/multiarch/Makefile
> @@ -11,5 +11,7 @@ sysdep_routines += \
>         strlen_generic \
>         strlen_zbb \
>         \
> -       strcmp_generic
> +       strcmp_generic \
> +       strcmp_zbb \
> +       strcmp_zbb_unaligned
>  endif
> diff --git a/sysdeps/riscv/multiarch/ifunc-impl-list.c
> b/sysdeps/riscv/multiarch/ifunc-impl-list.c
> index 64331a4c7f..d354aa1178 100644
> --- a/sysdeps/riscv/multiarch/ifunc-impl-list.c
> +++ b/sysdeps/riscv/multiarch/ifunc-impl-list.c
> @@ -59,7 +59,9 @@ __libc_ifunc_impl_list (const char *name, struct
> libc_ifunc_impl *array,
>               IFUNC_IMPL_ADD (array, i, strlen, 1, __strlen_generic))
>  
>    IFUNC_IMPL (i, name, strcmp,
> -             IFUNC_IMPL_ADD (array, i, strcpy, 1, __strcmp_generic))
> +             IFUNC_IMPL_ADD (array, i, strcmp, 1,
> __strcmp_zbb_unaligned)
> +             IFUNC_IMPL_ADD (array, i, strcmp, 1, __strcmp_zbb)
> +             IFUNC_IMPL_ADD (array, i, strcmp, 1, __strcmp_generic))
>  
>    return i;
>  }
> diff --git a/sysdeps/riscv/multiarch/strcmp.c
> b/sysdeps/riscv/multiarch/strcmp.c
> index 8c21a90afd..d3f2fe19ae 100644
> --- a/sysdeps/riscv/multiarch/strcmp.c
> +++ b/sysdeps/riscv/multiarch/strcmp.c
> @@ -30,8 +30,15 @@
>  
>  extern __typeof (__redirect_strcmp) __libc_strcmp;
>  extern __typeof (__redirect_strcmp) __strcmp_generic
> attribute_hidden;
> -
> -libc_ifunc (__libc_strcmp, __strcmp_generic);
> +extern __typeof (__redirect_strcmp) __strcmp_zbb attribute_hidden;
> +extern __typeof (__redirect_strcmp) __strcmp_zbb_unaligned
> attribute_hidden;
> +
> +libc_ifunc (__libc_strcmp,
> +           HAVE_RV(zbb) && HAVE_FAST_UNALIGNED()
> +           ? __strcmp_zbb_unaligned
> +           : HAVE_RV(zbb)
> +             ? __strcmp_zbb
> +             : __strcmp_generic);
>  
>  # undef strcmp
>  strong_alias (__libc_strcmp, strcmp);
> diff --git a/sysdeps/riscv/multiarch/strcmp_zbb.S
> b/sysdeps/riscv/multiarch/strcmp_zbb.S
> new file mode 100644
> index 0000000000..1c265d6107
> --- /dev/null
> +++ b/sysdeps/riscv/multiarch/strcmp_zbb.S
> @@ -0,0 +1,104 @@
> +/* Copyright (C) 2022 Free Software Foundation, Inc.
> +
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be
> useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library.  If not, see
> +   <https://www.gnu.org/licenses/>.  */
> +
> +#include <sysdep.h>
> +#include <sys/asm.h>
> +
> +/* Assumptions: rvi_zbb.  */
> +/* Implementation from the Bitmanip specification.  */
> +
> +#define src1           a0
> +#define result         a0
> +#define src2           a1
> +#define data1          a2
> +#define data2          a3
> +#define align          a4
> +#define data1_orcb     t0
> +#define m1             t2
> +
> +#if __riscv_xlen == 64
> +# define REG_L ld
> +# define SZREG 8
> +#else
> +# define REG_L lw
> +# define SZREG 4
> +#endif
> +
> +#ifndef STRCMP
> +# define STRCMP __strcmp_zbb
> +#endif
> +
> +.option push
> +.option arch,+zbb
> +
> +ENTRY_ALIGN (STRCMP, 6)
> +       or      align, src1, src2
> +       and     align, align, SZREG-1
> +       bnez    align, L(simpleloop)
> +       li      m1, -1
> +
> +       /* Main loop for aligned strings.  */
> +       .p2align 2
> +L(loop):
> +       REG_L   data1, 0(src1)
> +       REG_L   data2, 0(src2)
> +       orc.b   data1_orcb, data1
> +       bne     data1_orcb, m1, L(foundnull)
> +       addi    src1, src1, SZREG
> +       addi    src2, src2, SZREG
> +       beq     data1, data2, L(loop)
> +
> +       /* Words don't match, and no null byte in the first word.
> +        * Get bytes in big-endian order and compare.  */
> +#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
> +       rev8    data1, data1
> +       rev8    data2, data2
> +#endif
> +       /* Synthesize (data1 >= data2) ? 1 : -1 in a branchless
> sequence.  */
> +       sltu    result, data1, data2
> +       neg     result, result
> +       ori     result, result, 1
> +       ret
> +
> +L(foundnull):
> +       /* Found a null byte.
> +        * If words don't match, fall back to simple loop.  */
> +       bne     data1, data2, L(simpleloop)
> +
> +       /* Otherwise, strings are equal.  */
> +       li      result, 0
> +       ret
> +
> +       /* Simple loop for misaligned strings.  */
> +       .p2align 3
> +L(simpleloop):
> +       lbu     data1, 0(src1)
> +       lbu     data2, 0(src2)
> +       addi    src1, src1, 1
> +       addi    src2, src2, 1
> +       bne     data1, data2, L(sub)
> +       bnez    data1, L(simpleloop)
> +
> +L(sub):
> +       sub     result, data1, data2
> +       ret
> +
> +.option pop
> +
> +END (STRCMP)
> +libc_hidden_builtin_def (STRCMP)
> diff --git a/sysdeps/riscv/multiarch/strcmp_zbb_unaligned.S
> b/sysdeps/riscv/multiarch/strcmp_zbb_unaligned.S
> new file mode 100644
> index 0000000000..ec21982b65
> --- /dev/null
> +++ b/sysdeps/riscv/multiarch/strcmp_zbb_unaligned.S
> @@ -0,0 +1,213 @@
> +/* Copyright (C) 2022 Free Software Foundation, Inc.
> +
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be
> useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library.  If not, see
> +   <https://www.gnu.org/licenses/>.  */
> +
> +#include <sysdep.h>
> +#include <sys/asm.h>
> +
> +/* Assumptions: rvi_zbb with fast unaligned access.  */
> +/* Implementation inspired by aarch64/strcmp.S.  */
> +
> +#define src1           a0
> +#define result         a0
> +#define src2           a1
> +#define off            a3
> +#define m1             a4
> +#define align1         a5
> +#define src3           a6
> +#define tmp            a7
> +
> +#define data1          t0
> +#define data2          t1
> +#define b1             t0
> +#define b2             t1
> +#define data3          t2
> +#define data1_orcb     t3
> +#define data3_orcb     t4
> +#define shift          t5
> +
> +#if __riscv_xlen == 64
> +# define REG_L ld
> +# define SZREG 8
> +# define PTRLOG        3
> +#else
> +# define REG_L lw
> +# define SZREG 4
> +# define PTRLOG        2
> +#endif
> +
> +#if __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__
> +# error big endian is untested!
> +# define CZ    ctz
> +# define SHIFT srl
> +# define SHIFT2        sll
> +#else
> +# define CZ    ctz
> +# define SHIFT sll
> +# define SHIFT2        srl
> +#endif
> +
> +#ifndef STRCMP
> +# define STRCMP __strcmp_zbb_unaligned
> +#endif
> +
> +.option push
> +.option arch,+zbb
> +
> +ENTRY_ALIGN (STRCMP, 6)
> +       /* off...delta from src1 to src2.  */
> +       sub     off, src2, src1
> +       li      m1, -1
> +       andi    tmp, off, SZREG-1
> +       andi    align1, src1, SZREG-1
> +       bnez    tmp, L(misaligned8)
> +       bnez    align1, L(mutual_align)
> +
> +       .p2align 4
> +L(loop_aligned):
> +       REG_L   data1, 0(src1)
> +       add     tmp, src1, off
> +       addi    src1, src1, SZREG
> +       REG_L   data2, 0(tmp)
> +
> +L(start_realigned):
> +       orc.b   data1_orcb, data1
> +       bne     data1_orcb, m1, L(end)
> +       beq     data1, data2, L(loop_aligned)
> +
> +L(fast_end):
> +       /* Words don't match, and no NUL byte in one word.
> +          Get bytes in big-endian order and compare as words.  */
> +#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
> +       rev8    data1, data1
> +       rev8    data2, data2
> +#endif
> +       /* Synthesize (data1 >= data2) ? 1 : -1 in a branchless
> sequence.  */
> +       sltu    result, data1, data2
> +       neg     result, result
> +       ori     result, result, 1
> +       ret
> +
> +L(end_orc):
> +       orc.b   data1_orcb, data1
> +L(end):
> +       /* Words don't match or NUL byte in at least one word.
> +          data1_orcb holds orc.b value of data1.  */
> +       xor     tmp, data1, data2
> +       orc.b   tmp, tmp
> +
> +       orn     tmp, tmp, data1_orcb
> +       CZ      shift, tmp
> +
> +#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
> +       rev8    data1, data1
> +       rev8    data2, data2
> +#endif
> +       sll     data1, data1, shift
> +       sll     data2, data2, shift
> +       srl     b1, data1, SZREG*8-8
> +       srl     b2, data2, SZREG*8-8
> +
> +L(end_singlebyte):
> +       sub     result, b1, b2
> +       ret
> +
> +       .p2align 4
> +L(mutual_align):
> +       /* Sources are mutually aligned, but are not currently at an
> +          alignment boundary.  Round down the addresses and then mask
> off
> +          the bytes that precede the start point.  */
> +       andi    src1, src1, -SZREG
> +       add     tmp, src1, off
> +       REG_L   data1, 0(src1)
> +       addi    src1, src1, SZREG
> +       REG_L   data2, 0(tmp)
> +       /* Get number of bits to mask.  */
> +       sll     shift, src2, 3
> +       /* Bits to mask are now 0, others are 1.  */
> +       SHIFT   tmp, m1, shift
> +       /* Or with inverted value -> masked bits become 1.  */
> +       orn     data1, data1, tmp
> +       orn     data2, data2, tmp
> +       j       L(start_realigned)
> +
> +L(misaligned8):
> +       /* Skip slow loop if SRC1 is aligned.  */
> +       beqz    align1, L(src1_aligned)
> +L(do_misaligned):
> +       /* Align SRC1 to 8 bytes.  */
> +       lbu     b1, 0(src1)
> +       lbu     b2, 0(src2)
> +       beqz    b1, L(end_singlebyte)
> +       bne     b1, b2, L(end_singlebyte)
> +       addi    src1, src1, 1
> +       addi    src2, src2, 1
> +       andi    align1, src1, SZREG-1
> +       bnez    align1, L(do_misaligned)
> +
> +L(src1_aligned):
> +       /* SRC1 is aligned. Align SRC2 down and check for NUL there.
> +        * If there is no NUL, we may read the next word from SRC2.
> +        * If there is a NUL, we must not read a complete word from
> SRC2
> +        * because we might cross a page boundary.  */
> +       /* Get number of bits to mask (upper bits are ignored by
> shifts).  */
> +       sll     shift, src2, 3
> +       /* src3 := align_down (src2)  */
> +       andi    src3, src2, -SZREG
> +       REG_L   data3, 0(src3)
> +       addi    src3, src3, SZREG
> +
> +       /* Bits to mask are now 0, others are 1.  */
> +       SHIFT   tmp, m1, shift
> +       /* Or with inverted value -> masked bits become 1.  */
> +       orn     data3_orcb, data3, tmp
> +       /* Check for NUL in next aligned word.  */
> +       orc.b   data3_orcb, data3_orcb
> +       bne     data3_orcb, m1, L(unaligned_nul)
> +
> +       .p2align 4
> +L(loop_unaligned):
> +       /* Read the (aligned) data1 and the unaligned data2.  */
> +       REG_L   data1, 0(src1)
> +       addi    src1, src1, SZREG
> +       REG_L   data2, 0(src2)
> +       addi    src2, src2, SZREG
> +       orc.b   data1_orcb, data1
> +       bne     data1_orcb, m1, L(end)
> +       bne     data1, data2, L(end)
> +
> +       /* Read the next aligned-down word.  */
> +       REG_L   data3, 0(src3)
> +       addi    src3, src3, SZREG
> +       orc.b   data3_orcb, data3
> +       beq     data3_orcb, m1, L(loop_unaligned)
> +
> +L(unaligned_nul):
> +       /* src1 points to unread word (only first bytes relevant).
> +        * data3 holds next aligned-down word with NUL.
> +        * Compare the first bytes of data1 with the last bytes of
> data3.  */
> +       REG_L   data1, 0(src1)
> +       /* Shift NUL bytes into data3 to become data2.  */
> +       SHIFT2  data2, data3, shift
> +       bne     data1, data2, L(end_orc)
> +       li      result, 0
> +       ret
> +
> +.option pop
> +
> +END (STRCMP)
> +libc_hidden_builtin_def (STRCMP)

-- 
Xi Ruoyao <xry111@xry111.site>
School of Aerospace Science and Technology, Xidian University

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 16/19] riscv: Add accelerated strcmp routines
  2023-02-07 11:57   ` Xi Ruoyao
@ 2023-02-07 14:15     ` Christoph Müllner
  2023-03-31  5:06       ` Jeff Law
  2023-03-31 14:32       ` Jeff Law
  0 siblings, 2 replies; 42+ messages in thread
From: Christoph Müllner @ 2023-02-07 14:15 UTC (permalink / raw)
  To: Xi Ruoyao
  Cc: libc-alpha, Palmer Dabbelt, Darius Rad, Andrew Waterman,
	DJ Delorie, Vineet Gupta, Kito Cheng, Jeff Law, Philipp Tomsich,
	Heiko Stuebner, Adhemerval Zanella

[-- Attachment #1: Type: text/plain, Size: 15816 bytes --]

On Tue, Feb 7, 2023 at 12:57 PM Xi Ruoyao <xry111@xry111.site> wrote:

> Is it possible to make the optimized generic string routine [1] ifunc-
> aware in some way?  Or can we "templatize" it so we can make vectorized
> string routines more easily?
>

You mean to compile the generic code multiple times with different sets
of enabled extensions? Yes, that should work (my patchset does the same
for memset). Let's wait for the referenced patchset to land (looks like this
will happen soon).

Thanks,
Christoph



>
> [1]: https://sourceware.org/pipermail/libc-alpha/2023-February/145211.html
>
> On Tue, 2023-02-07 at 01:16 +0100, Christoph Muellner wrote:
> > From: Christoph Müllner <christoph.muellner@vrull.eu>
> >
> > The implementation of strcmp() can be accelerated using Zbb's orc.b
> > instruction and fast unaligned accesses. Howver, strcmp can use
> > unaligned accesses only if such an address does not change the
> > exception behaviour (compared to a single-byte compare loop).
> > Let's add an implementation that does all that.
> > Additionally, let's add the strcmp implementation from the
> > Bitmanip specification, which does not do any unaligned accesses.
> >
> > Signed-off-by: Christoph Müllner <christoph.muellner@vrull.eu>
> > ---
> >  sysdeps/riscv/multiarch/Makefile              |   4 +-
> >  sysdeps/riscv/multiarch/ifunc-impl-list.c     |   4 +-
> >  sysdeps/riscv/multiarch/strcmp.c              |  11 +-
> >  sysdeps/riscv/multiarch/strcmp_zbb.S          | 104 +++++++++
> >  .../riscv/multiarch/strcmp_zbb_unaligned.S    | 213
> > ++++++++++++++++++
> >  5 files changed, 332 insertions(+), 4 deletions(-)
> >  create mode 100644 sysdeps/riscv/multiarch/strcmp_zbb.S
> >  create mode 100644 sysdeps/riscv/multiarch/strcmp_zbb_unaligned.S
> >
> > diff --git a/sysdeps/riscv/multiarch/Makefile
> > b/sysdeps/riscv/multiarch/Makefile
> > index 3017bde75a..73a62be85d 100644
> > --- a/sysdeps/riscv/multiarch/Makefile
> > +++ b/sysdeps/riscv/multiarch/Makefile
> > @@ -11,5 +11,7 @@ sysdep_routines += \
> >         strlen_generic \
> >         strlen_zbb \
> >         \
> > -       strcmp_generic
> > +       strcmp_generic \
> > +       strcmp_zbb \
> > +       strcmp_zbb_unaligned
> >  endif
> > diff --git a/sysdeps/riscv/multiarch/ifunc-impl-list.c
> > b/sysdeps/riscv/multiarch/ifunc-impl-list.c
> > index 64331a4c7f..d354aa1178 100644
> > --- a/sysdeps/riscv/multiarch/ifunc-impl-list.c
> > +++ b/sysdeps/riscv/multiarch/ifunc-impl-list.c
> > @@ -59,7 +59,9 @@ __libc_ifunc_impl_list (const char *name, struct
> > libc_ifunc_impl *array,
> >               IFUNC_IMPL_ADD (array, i, strlen, 1, __strlen_generic))
> >
> >    IFUNC_IMPL (i, name, strcmp,
> > -             IFUNC_IMPL_ADD (array, i, strcpy, 1, __strcmp_generic))
> > +             IFUNC_IMPL_ADD (array, i, strcmp, 1,
> > __strcmp_zbb_unaligned)
> > +             IFUNC_IMPL_ADD (array, i, strcmp, 1, __strcmp_zbb)
> > +             IFUNC_IMPL_ADD (array, i, strcmp, 1, __strcmp_generic))
> >
> >    return i;
> >  }
> > diff --git a/sysdeps/riscv/multiarch/strcmp.c
> > b/sysdeps/riscv/multiarch/strcmp.c
> > index 8c21a90afd..d3f2fe19ae 100644
> > --- a/sysdeps/riscv/multiarch/strcmp.c
> > +++ b/sysdeps/riscv/multiarch/strcmp.c
> > @@ -30,8 +30,15 @@
> >
> >  extern __typeof (__redirect_strcmp) __libc_strcmp;
> >  extern __typeof (__redirect_strcmp) __strcmp_generic
> > attribute_hidden;
> > -
> > -libc_ifunc (__libc_strcmp, __strcmp_generic);
> > +extern __typeof (__redirect_strcmp) __strcmp_zbb attribute_hidden;
> > +extern __typeof (__redirect_strcmp) __strcmp_zbb_unaligned
> > attribute_hidden;
> > +
> > +libc_ifunc (__libc_strcmp,
> > +           HAVE_RV(zbb) && HAVE_FAST_UNALIGNED()
> > +           ? __strcmp_zbb_unaligned
> > +           : HAVE_RV(zbb)
> > +             ? __strcmp_zbb
> > +             : __strcmp_generic);
> >
> >  # undef strcmp
> >  strong_alias (__libc_strcmp, strcmp);
> > diff --git a/sysdeps/riscv/multiarch/strcmp_zbb.S
> > b/sysdeps/riscv/multiarch/strcmp_zbb.S
> > new file mode 100644
> > index 0000000000..1c265d6107
> > --- /dev/null
> > +++ b/sysdeps/riscv/multiarch/strcmp_zbb.S
> > @@ -0,0 +1,104 @@
> > +/* Copyright (C) 2022 Free Software Foundation, Inc.
> > +
> > +   This file is part of the GNU C Library.
> > +
> > +   The GNU C Library is free software; you can redistribute it and/or
> > +   modify it under the terms of the GNU Lesser General Public
> > +   License as published by the Free Software Foundation; either
> > +   version 2.1 of the License, or (at your option) any later version.
> > +
> > +   The GNU C Library is distributed in the hope that it will be
> > useful,
> > +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> > +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> > +   Lesser General Public License for more details.
> > +
> > +   You should have received a copy of the GNU Lesser General Public
> > +   License along with the GNU C Library.  If not, see
> > +   <https://www.gnu.org/licenses/>.  */
> > +
> > +#include <sysdep.h>
> > +#include <sys/asm.h>
> > +
> > +/* Assumptions: rvi_zbb.  */
> > +/* Implementation from the Bitmanip specification.  */
> > +
> > +#define src1           a0
> > +#define result         a0
> > +#define src2           a1
> > +#define data1          a2
> > +#define data2          a3
> > +#define align          a4
> > +#define data1_orcb     t0
> > +#define m1             t2
> > +
> > +#if __riscv_xlen == 64
> > +# define REG_L ld
> > +# define SZREG 8
> > +#else
> > +# define REG_L lw
> > +# define SZREG 4
> > +#endif
> > +
> > +#ifndef STRCMP
> > +# define STRCMP __strcmp_zbb
> > +#endif
> > +
> > +.option push
> > +.option arch,+zbb
> > +
> > +ENTRY_ALIGN (STRCMP, 6)
> > +       or      align, src1, src2
> > +       and     align, align, SZREG-1
> > +       bnez    align, L(simpleloop)
> > +       li      m1, -1
> > +
> > +       /* Main loop for aligned strings.  */
> > +       .p2align 2
> > +L(loop):
> > +       REG_L   data1, 0(src1)
> > +       REG_L   data2, 0(src2)
> > +       orc.b   data1_orcb, data1
> > +       bne     data1_orcb, m1, L(foundnull)
> > +       addi    src1, src1, SZREG
> > +       addi    src2, src2, SZREG
> > +       beq     data1, data2, L(loop)
> > +
> > +       /* Words don't match, and no null byte in the first word.
> > +        * Get bytes in big-endian order and compare.  */
> > +#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
> > +       rev8    data1, data1
> > +       rev8    data2, data2
> > +#endif
> > +       /* Synthesize (data1 >= data2) ? 1 : -1 in a branchless
> > sequence.  */
> > +       sltu    result, data1, data2
> > +       neg     result, result
> > +       ori     result, result, 1
> > +       ret
> > +
> > +L(foundnull):
> > +       /* Found a null byte.
> > +        * If words don't match, fall back to simple loop.  */
> > +       bne     data1, data2, L(simpleloop)
> > +
> > +       /* Otherwise, strings are equal.  */
> > +       li      result, 0
> > +       ret
> > +
> > +       /* Simple loop for misaligned strings.  */
> > +       .p2align 3
> > +L(simpleloop):
> > +       lbu     data1, 0(src1)
> > +       lbu     data2, 0(src2)
> > +       addi    src1, src1, 1
> > +       addi    src2, src2, 1
> > +       bne     data1, data2, L(sub)
> > +       bnez    data1, L(simpleloop)
> > +
> > +L(sub):
> > +       sub     result, data1, data2
> > +       ret
> > +
> > +.option pop
> > +
> > +END (STRCMP)
> > +libc_hidden_builtin_def (STRCMP)
> > diff --git a/sysdeps/riscv/multiarch/strcmp_zbb_unaligned.S
> > b/sysdeps/riscv/multiarch/strcmp_zbb_unaligned.S
> > new file mode 100644
> > index 0000000000..ec21982b65
> > --- /dev/null
> > +++ b/sysdeps/riscv/multiarch/strcmp_zbb_unaligned.S
> > @@ -0,0 +1,213 @@
> > +/* Copyright (C) 2022 Free Software Foundation, Inc.
> > +
> > +   This file is part of the GNU C Library.
> > +
> > +   The GNU C Library is free software; you can redistribute it and/or
> > +   modify it under the terms of the GNU Lesser General Public
> > +   License as published by the Free Software Foundation; either
> > +   version 2.1 of the License, or (at your option) any later version.
> > +
> > +   The GNU C Library is distributed in the hope that it will be
> > useful,
> > +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> > +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> > +   Lesser General Public License for more details.
> > +
> > +   You should have received a copy of the GNU Lesser General Public
> > +   License along with the GNU C Library.  If not, see
> > +   <https://www.gnu.org/licenses/>.  */
> > +
> > +#include <sysdep.h>
> > +#include <sys/asm.h>
> > +
> > +/* Assumptions: rvi_zbb with fast unaligned access.  */
> > +/* Implementation inspired by aarch64/strcmp.S.  */
> > +
> > +#define src1           a0
> > +#define result         a0
> > +#define src2           a1
> > +#define off            a3
> > +#define m1             a4
> > +#define align1         a5
> > +#define src3           a6
> > +#define tmp            a7
> > +
> > +#define data1          t0
> > +#define data2          t1
> > +#define b1             t0
> > +#define b2             t1
> > +#define data3          t2
> > +#define data1_orcb     t3
> > +#define data3_orcb     t4
> > +#define shift          t5
> > +
> > +#if __riscv_xlen == 64
> > +# define REG_L ld
> > +# define SZREG 8
> > +# define PTRLOG        3
> > +#else
> > +# define REG_L lw
> > +# define SZREG 4
> > +# define PTRLOG        2
> > +#endif
> > +
> > +#if __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__
> > +# error big endian is untested!
> > +# define CZ    ctz
> > +# define SHIFT srl
> > +# define SHIFT2        sll
> > +#else
> > +# define CZ    ctz
> > +# define SHIFT sll
> > +# define SHIFT2        srl
> > +#endif
> > +
> > +#ifndef STRCMP
> > +# define STRCMP __strcmp_zbb_unaligned
> > +#endif
> > +
> > +.option push
> > +.option arch,+zbb
> > +
> > +ENTRY_ALIGN (STRCMP, 6)
> > +       /* off...delta from src1 to src2.  */
> > +       sub     off, src2, src1
> > +       li      m1, -1
> > +       andi    tmp, off, SZREG-1
> > +       andi    align1, src1, SZREG-1
> > +       bnez    tmp, L(misaligned8)
> > +       bnez    align1, L(mutual_align)
> > +
> > +       .p2align 4
> > +L(loop_aligned):
> > +       REG_L   data1, 0(src1)
> > +       add     tmp, src1, off
> > +       addi    src1, src1, SZREG
> > +       REG_L   data2, 0(tmp)
> > +
> > +L(start_realigned):
> > +       orc.b   data1_orcb, data1
> > +       bne     data1_orcb, m1, L(end)
> > +       beq     data1, data2, L(loop_aligned)
> > +
> > +L(fast_end):
> > +       /* Words don't match, and no NUL byte in one word.
> > +          Get bytes in big-endian order and compare as words.  */
> > +#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
> > +       rev8    data1, data1
> > +       rev8    data2, data2
> > +#endif
> > +       /* Synthesize (data1 >= data2) ? 1 : -1 in a branchless
> > sequence.  */
> > +       sltu    result, data1, data2
> > +       neg     result, result
> > +       ori     result, result, 1
> > +       ret
> > +
> > +L(end_orc):
> > +       orc.b   data1_orcb, data1
> > +L(end):
> > +       /* Words don't match or NUL byte in at least one word.
> > +          data1_orcb holds orc.b value of data1.  */
> > +       xor     tmp, data1, data2
> > +       orc.b   tmp, tmp
> > +
> > +       orn     tmp, tmp, data1_orcb
> > +       CZ      shift, tmp
> > +
> > +#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
> > +       rev8    data1, data1
> > +       rev8    data2, data2
> > +#endif
> > +       sll     data1, data1, shift
> > +       sll     data2, data2, shift
> > +       srl     b1, data1, SZREG*8-8
> > +       srl     b2, data2, SZREG*8-8
> > +
> > +L(end_singlebyte):
> > +       sub     result, b1, b2
> > +       ret
> > +
> > +       .p2align 4
> > +L(mutual_align):
> > +       /* Sources are mutually aligned, but are not currently at an
> > +          alignment boundary.  Round down the addresses and then mask
> > off
> > +          the bytes that precede the start point.  */
> > +       andi    src1, src1, -SZREG
> > +       add     tmp, src1, off
> > +       REG_L   data1, 0(src1)
> > +       addi    src1, src1, SZREG
> > +       REG_L   data2, 0(tmp)
> > +       /* Get number of bits to mask.  */
> > +       sll     shift, src2, 3
> > +       /* Bits to mask are now 0, others are 1.  */
> > +       SHIFT   tmp, m1, shift
> > +       /* Or with inverted value -> masked bits become 1.  */
> > +       orn     data1, data1, tmp
> > +       orn     data2, data2, tmp
> > +       j       L(start_realigned)
> > +
> > +L(misaligned8):
> > +       /* Skip slow loop if SRC1 is aligned.  */
> > +       beqz    align1, L(src1_aligned)
> > +L(do_misaligned):
> > +       /* Align SRC1 to 8 bytes.  */
> > +       lbu     b1, 0(src1)
> > +       lbu     b2, 0(src2)
> > +       beqz    b1, L(end_singlebyte)
> > +       bne     b1, b2, L(end_singlebyte)
> > +       addi    src1, src1, 1
> > +       addi    src2, src2, 1
> > +       andi    align1, src1, SZREG-1
> > +       bnez    align1, L(do_misaligned)
> > +
> > +L(src1_aligned):
> > +       /* SRC1 is aligned. Align SRC2 down and check for NUL there.
> > +        * If there is no NUL, we may read the next word from SRC2.
> > +        * If there is a NUL, we must not read a complete word from
> > SRC2
> > +        * because we might cross a page boundary.  */
> > +       /* Get number of bits to mask (upper bits are ignored by
> > shifts).  */
> > +       sll     shift, src2, 3
> > +       /* src3 := align_down (src2)  */
> > +       andi    src3, src2, -SZREG
> > +       REG_L   data3, 0(src3)
> > +       addi    src3, src3, SZREG
> > +
> > +       /* Bits to mask are now 0, others are 1.  */
> > +       SHIFT   tmp, m1, shift
> > +       /* Or with inverted value -> masked bits become 1.  */
> > +       orn     data3_orcb, data3, tmp
> > +       /* Check for NUL in next aligned word.  */
> > +       orc.b   data3_orcb, data3_orcb
> > +       bne     data3_orcb, m1, L(unaligned_nul)
> > +
> > +       .p2align 4
> > +L(loop_unaligned):
> > +       /* Read the (aligned) data1 and the unaligned data2.  */
> > +       REG_L   data1, 0(src1)
> > +       addi    src1, src1, SZREG
> > +       REG_L   data2, 0(src2)
> > +       addi    src2, src2, SZREG
> > +       orc.b   data1_orcb, data1
> > +       bne     data1_orcb, m1, L(end)
> > +       bne     data1, data2, L(end)
> > +
> > +       /* Read the next aligned-down word.  */
> > +       REG_L   data3, 0(src3)
> > +       addi    src3, src3, SZREG
> > +       orc.b   data3_orcb, data3
> > +       beq     data3_orcb, m1, L(loop_unaligned)
> > +
> > +L(unaligned_nul):
> > +       /* src1 points to unread word (only first bytes relevant).
> > +        * data3 holds next aligned-down word with NUL.
> > +        * Compare the first bytes of data1 with the last bytes of
> > data3.  */
> > +       REG_L   data1, 0(src1)
> > +       /* Shift NUL bytes into data3 to become data2.  */
> > +       SHIFT2  data2, data3, shift
> > +       bne     data1, data2, L(end_orc)
> > +       li      result, 0
> > +       ret
> > +
> > +.option pop
> > +
> > +END (STRCMP)
> > +libc_hidden_builtin_def (STRCMP)
>
> --
> Xi Ruoyao <xry111@xry111.site>
> School of Aerospace Science and Technology, Xidian University
>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 00/19] riscv: ifunc support with optimized mem*/str*/cpu_relax routines
  2023-02-07  0:15 [RFC PATCH 00/19] riscv: ifunc support with optimized mem*/str*/cpu_relax routines Christoph Muellner
                   ` (19 preceding siblings ...)
  2023-02-07  2:59 ` [RFC PATCH 00/19] riscv: ifunc support with optimized mem*/str*/cpu_relax routines Kito Cheng
@ 2023-02-07 16:40 ` Adhemerval Zanella Netto
  2023-02-07 17:16   ` DJ Delorie
  20 siblings, 1 reply; 42+ messages in thread
From: Adhemerval Zanella Netto @ 2023-02-07 16:40 UTC (permalink / raw)
  To: Christoph Muellner, libc-alpha, Palmer Dabbelt, Darius Rad,
	Andrew Waterman, DJ Delorie, Vineet Gupta, Kito Cheng, Jeff Law,
	Philipp Tomsich, Heiko Stuebner

On 06/02/23 21:15, Christoph Muellner wrote:
> From: Christoph Müllner <christoph.muellner@vrull.eu>
> 
> This RFC series introduces ifunc support for RISC-V and adds
> optimized routines of memset(), memcpy()/memmove(), strlen(),
> strcmp(), strncmp(), and cpu_relax().
> 
> The ifunc mechanism desides based on the following hart features:
> - Available extensions
> - Cache block size
> - Fast unaligned accesses
> 
> Since we don't have an interface to get this information from the
> kernel (at the moment), this patch uses environment variables instead,
> which is also why this patch should not be considered for upstream
> inclusion and is explicitly tagged as RFC.
> 
> The environment variables are:
> - RISCV_RT_MARCH (e.g. "rv64gc_zicboz")
> - RISCV_RT_CBOZ_BLOCKSIZE (e.g. "64")
> - RISCV_RT_CBOM_BLOCKSIZE (e.g. "64")
> - RISCV_RT_FAST_UNALIGNED (e.g. "1")
> 
> The environment variables are looked up and parsed early during
> startup, where other architectures query similar properties from
> the kernel or the CPU.
> The ifunc implementation can use test macros to select a matching
> implementation (e.g. HAVE_RV(zbb) or HAVE_FAST_UNALIGNED()).

So now we have 3 different proposal mechanism to provide implementation runtime 
selection on riscv:

  1. The sysdep mechanism to select optimized routines based on compiler/ABI
     done at build time.  It is the current mechanism and it is also used
     on rvv routines [1].

  2. A ifunc one using a new riscv syscall to query the kernel the required 
     information.

  3. Another ifunc one using riscv specific environment variable.

Although all of them are interchangeable in a sense they can be used independently,
RISCV is following MIPS on having uncountable minor ABI variants due this exactly
available permutations.  This incurs in extra maintanance, extra documentation, 
extra testing, etc.

So I would like you RISCV arch-maintainers to first figure out what scheme you
want focus on, instead of trying to push multiple fronts with different ad-hoc 
schemes.

The first scheme, which is the oldest one used by architectures like arm, powerpc,
mips, etc. is the sysdep where you select the variant at build time.  It has the
advantage of no need to extra runtime cost or probing, and a slight code size 
reduction.  However it ties the ABI used to build glibc, which means you need 
multiple libc build if you targeting different chips/ABIs.

I recall that Red Hat and SuSE used to provided specialized glibc build for POWER
machines to try leverage new chips optimization (libm showed some gain, specially
back when ISA 2.05 added rounding instruction, and isa 2.07 GRP to FP special
register).  But I also recall that it was deprecated over using ifunc to optimize
only the required functions that does show performance improvement, since each
glibc build variantion required all the steps to validation.

And that's why aarch64 and x86_64 initially followed the patch to avoid using 
sysdeps folder and have a minimum default implementation that works on the minimum 
support ISA and provide any optimized variant through iFUNC.  And that is what I 
suggest you to do for *rvv*.

You can follow x86_64/s390 and add an extra optimization to only build certain 
variant if the ISA is high enough (for instance, if you targeting rbb, use it as 
default).  It requires a *lot* of boilerplate code, as you can see for the
x86_64-vX code recently; but it should integrate better with current ldconfig
and newer RPM support (and I expect that other packages managers to follow as
well).

And it lead us on *how* to select the ABI variants. I am sorry, but I will *block 
the new environment variables* as is. You might rework it through glibc hardware 
tunables [2], nevertheless I *strong* suggest you to *first* figure out the kernel
interface first prior starting working on providing the optimized routine in glibc.
The glibc tunable might then work a way to tune/test/filter the already in place
mechanism, ideally it should not rely on user intervention as default.

It was not clear from the 'hardware probing user interface' thread [3] why current 
Linux auxv advertise mechanism are not suffice enough for this specific interface 
(maybe you want something more generic like a cpuid-like interface).  It works for
aarch64 and powerpc, so I am not sure why RISCV can't start using it.

[1] https://sourceware.org/pipermail/libc-alpha/2023-February/145102.html
[2] https://www.gnu.org/software/libc/manual/html_node/Hardware-Capability-Tunables.html
[3] https://yhbt.net/lore/all/20221013163551.6775-1-palmer@rivosinc.com/

> 
> The following optimized routines exist:
> - memset

It seems that main gain here is unaligned access, loop unrolling, and cache clear 
instruction.  Unfortuantely current implementation does not provide support for 
any of this, however I wonder if we could parametrize the generic implementation 
to allow at least some support for fast unaligned memory (we can factor cache clear 
as well).

I am working on refactor memcpy, memmove, memset, and memcmp to get rid of old
code and allow to work toward it.

> - memcpy/memmove

The generic implementation already does some loop unrolling, so I wonder if we can
improve the generic implementation by adding a swtich to assume unaligned access
(so there is no need to use the two load/merge strategy).

One advantage that is not easily reproducable on C is to branch to memcpy on memmove
if the copy shoud be fone fowards.  This is not easily done on generic implementation 
because we can't simply call memcpy in such case (since source and destiny can overlap
and it might call a memcpy routine that does not support it).

My approach on my generic refactor is just to remove the wordcopy and make memcpy
and memmove using the same strategy, but with different code.

> - strlen

The optimized routine seems quite similar to the generic one I installed recently [4],
which should use both cbz and orc.b with RISCV hooks [5]

[4] https://sourceware.org/git/?p=glibc.git;a=commit;h=350d8d13661a863e6b189f02d876fa265fe71302
[5] https://sourceware.org/git/?p=glibc.git;a=commit;h=25788431c0f5264c4830415de0cdd4d9926cbad9

> - strcmp
> - strncmp

The current generic implementations [6][7] now have a small advantage where unaligned
inputs are also improved by first aligning one input and operating with a double
load and merge comparision.

[6] https://sourceware.org/git/?p=glibc.git;a=commit;h=30cf54bf3072be942847400c1669bcd63aab039e
[7] https://sourceware.org/git/?p=glibc.git;a=commit;h=367c31b5d61164db97834917f5487094ebef2f58

> - cpu_relax
> 
> The following optimizations have been applied:
> - excessive loop unrolling
> - Zbb's orc.b instruction
> - Zbb's ctz intruction
> - Zicboz/Zic64b ability to clear a cache block in memory
> - Fast unaligned accesses (but with keeping exception guarantees intact)
> - Fast overlapping accesses
> 
> The patch was developed more than a year ago and was tested as part
> of a vendor SDK since then. One of the areas where this patchset
> was used is benchmarking (e.g. SPEC CPU2017).
> The optimized string functions have been tested with the glibc tests
> for that purpose.
> 
> The first patch of the series does not strictly belong to this series,
> but was required to build and test SPEC CPU2017 benchmarks.
> 
> To build a cross-toolchain that includes these patches,
> the riscv-gnu-toolchain or any other cross-toolchain
> builder can be used.
> 
> Christoph Müllner (19):
>   Inhibit early libcalls before ifunc support is ready
>   riscv: LEAF: Use C_LABEL() to construct the asm name for a C symbol
>   riscv: Add ENTRY_ALIGN() macro
>   riscv: Add hart feature run-time detection framework
>   riscv: Introduction of ISA extensions
>   riscv: Adding ISA string parser for environment variables
>   riscv: hart-features: Add fast_unaligned property
>   riscv: Add (empty) ifunc framework
>   riscv: Add ifunc support for memset
>   riscv: Add accelerated memset routines for RV64
>   riscv: Add ifunc support for memcpy/memmove
>   riscv: Add accelerated memcpy/memmove routines for RV64
>   riscv: Add ifunc support for strlen
>   riscv: Add accelerated strlen routine
>   riscv: Add ifunc support for strcmp
>   riscv: Add accelerated strcmp routines
>   riscv: Add ifunc support for strncmp
>   riscv: Add an optimized strncmp routine
>   riscv: Add __riscv_cpu_relax() to allow yielding in busy loops
> 
>  csu/libc-start.c                              |   1 +
>  elf/dl-support.c                              |   1 +
>  sysdeps/riscv/dl-machine.h                    |  13 +
>  sysdeps/riscv/ldsodefs.h                      |   1 +
>  sysdeps/riscv/multiarch/Makefile              |  24 +
>  sysdeps/riscv/multiarch/cpu_relax.c           |  36 ++
>  sysdeps/riscv/multiarch/cpu_relax_impl.S      |  40 ++
>  sysdeps/riscv/multiarch/ifunc-impl-list.c     |  70 +++
>  sysdeps/riscv/multiarch/init-arch.h           |  24 +
>  sysdeps/riscv/multiarch/memcpy.c              |  49 ++
>  sysdeps/riscv/multiarch/memcpy_generic.c      |  32 ++
>  .../riscv/multiarch/memcpy_rv64_unaligned.S   | 475 ++++++++++++++++++
>  sysdeps/riscv/multiarch/memmove.c             |  49 ++
>  sysdeps/riscv/multiarch/memmove_generic.c     |  32 ++
>  sysdeps/riscv/multiarch/memset.c              |  52 ++
>  sysdeps/riscv/multiarch/memset_generic.c      |  32 ++
>  .../riscv/multiarch/memset_rv64_unaligned.S   |  31 ++
>  .../multiarch/memset_rv64_unaligned_cboz64.S  | 217 ++++++++
>  sysdeps/riscv/multiarch/strcmp.c              |  47 ++
>  sysdeps/riscv/multiarch/strcmp_generic.c      |  32 ++
>  sysdeps/riscv/multiarch/strcmp_zbb.S          | 104 ++++
>  .../riscv/multiarch/strcmp_zbb_unaligned.S    | 213 ++++++++
>  sysdeps/riscv/multiarch/strlen.c              |  44 ++
>  sysdeps/riscv/multiarch/strlen_generic.c      |  32 ++
>  sysdeps/riscv/multiarch/strlen_zbb.S          | 105 ++++
>  sysdeps/riscv/multiarch/strncmp.c             |  44 ++
>  sysdeps/riscv/multiarch/strncmp_generic.c     |  32 ++
>  sysdeps/riscv/multiarch/strncmp_zbb.S         | 119 +++++
>  sysdeps/riscv/sys/asm.h                       |  14 +-
>  .../unix/sysv/linux/riscv/atomic-machine.h    |   3 +
>  sysdeps/unix/sysv/linux/riscv/dl-procinfo.c   |  62 +++
>  sysdeps/unix/sysv/linux/riscv/dl-procinfo.h   |  46 ++
>  sysdeps/unix/sysv/linux/riscv/hart-features.c | 356 +++++++++++++
>  sysdeps/unix/sysv/linux/riscv/hart-features.h |  58 +++
>  .../unix/sysv/linux/riscv/isa-extensions.def  |  72 +++
>  sysdeps/unix/sysv/linux/riscv/libc-start.c    |  29 ++
>  .../unix/sysv/linux/riscv/macro-for-each.h    |  24 +
>  37 files changed, 2610 insertions(+), 5 deletions(-)
>  create mode 100644 sysdeps/riscv/multiarch/Makefile
>  create mode 100644 sysdeps/riscv/multiarch/cpu_relax.c
>  create mode 100644 sysdeps/riscv/multiarch/cpu_relax_impl.S
>  create mode 100644 sysdeps/riscv/multiarch/ifunc-impl-list.c
>  create mode 100644 sysdeps/riscv/multiarch/init-arch.h
>  create mode 100644 sysdeps/riscv/multiarch/memcpy.c
>  create mode 100644 sysdeps/riscv/multiarch/memcpy_generic.c
>  create mode 100644 sysdeps/riscv/multiarch/memcpy_rv64_unaligned.S
>  create mode 100644 sysdeps/riscv/multiarch/memmove.c
>  create mode 100644 sysdeps/riscv/multiarch/memmove_generic.c
>  create mode 100644 sysdeps/riscv/multiarch/memset.c
>  create mode 100644 sysdeps/riscv/multiarch/memset_generic.c
>  create mode 100644 sysdeps/riscv/multiarch/memset_rv64_unaligned.S
>  create mode 100644 sysdeps/riscv/multiarch/memset_rv64_unaligned_cboz64.S
>  create mode 100644 sysdeps/riscv/multiarch/strcmp.c
>  create mode 100644 sysdeps/riscv/multiarch/strcmp_generic.c
>  create mode 100644 sysdeps/riscv/multiarch/strcmp_zbb.S
>  create mode 100644 sysdeps/riscv/multiarch/strcmp_zbb_unaligned.S
>  create mode 100644 sysdeps/riscv/multiarch/strlen.c
>  create mode 100644 sysdeps/riscv/multiarch/strlen_generic.c
>  create mode 100644 sysdeps/riscv/multiarch/strlen_zbb.S
>  create mode 100644 sysdeps/riscv/multiarch/strncmp.c
>  create mode 100644 sysdeps/riscv/multiarch/strncmp_generic.c
>  create mode 100644 sysdeps/riscv/multiarch/strncmp_zbb.S
>  create mode 100644 sysdeps/unix/sysv/linux/riscv/dl-procinfo.c
>  create mode 100644 sysdeps/unix/sysv/linux/riscv/dl-procinfo.h
>  create mode 100644 sysdeps/unix/sysv/linux/riscv/hart-features.c
>  create mode 100644 sysdeps/unix/sysv/linux/riscv/hart-features.h
>  create mode 100644 sysdeps/unix/sysv/linux/riscv/isa-extensions.def
>  create mode 100644 sysdeps/unix/sysv/linux/riscv/libc-start.c
>  create mode 100644 sysdeps/unix/sysv/linux/riscv/macro-for-each.h
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 00/19] riscv: ifunc support with optimized mem*/str*/cpu_relax routines
  2023-02-07 16:40 ` Adhemerval Zanella Netto
@ 2023-02-07 17:16   ` DJ Delorie
  2023-02-07 19:32     ` Philipp Tomsich
  0 siblings, 1 reply; 42+ messages in thread
From: DJ Delorie @ 2023-02-07 17:16 UTC (permalink / raw)
  To: Adhemerval Zanella Netto
  Cc: christoph.muellner, libc-alpha, palmer, darius, andrew, vineetg,
	kito.cheng, jeffreyalaw, philipp.tomsich, heiko.stuebner

Adhemerval Zanella Netto <adhemerval.zanella@linaro.org> writes:
> So now we have 3 different proposal mechanism to provide implementation runtime 
> selection on riscv:
>
>   1. The sysdep mechanism to select optimized routines based on compiler/ABI
>      done at build time.  It is the current mechanism and it is also used
>      on rvv routines [1].
>
>   2. A ifunc one using a new riscv syscall to query the kernel the required 
>      information.
>
>   3. Another ifunc one using riscv specific environment variable.

I'm also going to oppose #3 on principles.  We've been removing the use
of environment variables for tuning, in favor of tunables.

If we have a way to auto-detect the best implementation without relying
on the user, that's my preference.  Users are unreliable and require
documentation.  The compiler likely doesn't have access to the
hardware[*], so must rely on the user.  Thus, my preference is #2 - the
kernel has access to the hardware and its device tree, and can tell the
userspace what capabilities are available.

I would not be opposed to a tunable that overrides the autodetection; we
have something similar for x86.  But the default (and should be) is
"works basically correctly without user intervention".

[*] you can run gcc on the "right" hardware, but typically we
    build-once-run-everywhere.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 00/19] riscv: ifunc support with optimized mem*/str*/cpu_relax routines
  2023-02-07 17:16   ` DJ Delorie
@ 2023-02-07 19:32     ` Philipp Tomsich
  2023-02-07 21:14       ` DJ Delorie
  0 siblings, 1 reply; 42+ messages in thread
From: Philipp Tomsich @ 2023-02-07 19:32 UTC (permalink / raw)
  To: DJ Delorie
  Cc: Adhemerval Zanella Netto, christoph.muellner, libc-alpha, palmer,
	darius, andrew, vineetg, kito.cheng, jeffreyalaw, heiko.stuebner

On Tue, 7 Feb 2023 at 18:16, DJ Delorie <dj@redhat.com> wrote:
>
> Adhemerval Zanella Netto <adhemerval.zanella@linaro.org> writes:
> > So now we have 3 different proposal mechanism to provide implementation runtime
> > selection on riscv:
> >
> >   1. The sysdep mechanism to select optimized routines based on compiler/ABI
> >      done at build time.  It is the current mechanism and it is also used
> >      on rvv routines [1].
> >
> >   2. A ifunc one using a new riscv syscall to query the kernel the required
> >      information.
> >
> >   3. Another ifunc one using riscv specific environment variable.
>
> I'm also going to oppose #3 on principles.  We've been removing the use
> of environment variables for tuning, in favor of tunables.

You may have missed the essential part of the commit message:
> > Since we don't have an interface to get this information from the
> > kernel (at the moment), this patch uses environment variables instead,
> > which is also why this patch should not be considered for upstream
> > inclusion and is explicitly tagged as RFC.

So this patch has always been a stand-in until option #2 is ready.
I am strongly opinionated towards a mechanism that uses existing
mechanisms in the ELF auxiliary vector to pass information — and tries
to avoid the introduction of a new arch-specific syscall. if possible.

> If we have a way to auto-detect the best implementation without relying
> on the user, that's my preference.  Users are unreliable and require
> documentation.  The compiler likely doesn't have access to the
> hardware[*], so must rely on the user.  Thus, my preference is #2 - the
> kernel has access to the hardware and its device tree, and can tell the
> userspace what capabilities are available.
>
> I would not be opposed to a tunable that overrides the autodetection; we
> have something similar for x86.  But the default (and should be) is
> "works basically correctly without user intervention".
>
> [*] you can run gcc on the "right" hardware, but typically we
>     build-once-run-everywhere.
>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 00/19] riscv: ifunc support with optimized mem*/str*/cpu_relax routines
  2023-02-07 19:32     ` Philipp Tomsich
@ 2023-02-07 21:14       ` DJ Delorie
  2023-02-08 11:26         ` Christoph Müllner
  0 siblings, 1 reply; 42+ messages in thread
From: DJ Delorie @ 2023-02-07 21:14 UTC (permalink / raw)
  To: Philipp Tomsich
  Cc: adhemerval.zanella, christoph.muellner, libc-alpha, palmer,
	darius, andrew, vineetg, kito.cheng, jeffreyalaw, heiko.stuebner


Philipp Tomsich <philipp.tomsich@vrull.eu> writes:
> So this patch has always been a stand-in until option #2 is ready.
> I am strongly opinionated towards a mechanism that uses existing
> mechanisms in the ELF auxiliary vector to pass information — and tries
> to avoid the introduction of a new arch-specific syscall. if possible.

If the patch were converted to use tunables, it could be more than a
standin.  It's the environment variable itself I'm opposed to.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 00/19] riscv: ifunc support with optimized mem*/str*/cpu_relax routines
  2023-02-07 21:14       ` DJ Delorie
@ 2023-02-08 11:26         ` Christoph Müllner
  0 siblings, 0 replies; 42+ messages in thread
From: Christoph Müllner @ 2023-02-08 11:26 UTC (permalink / raw)
  To: DJ Delorie
  Cc: Philipp Tomsich, adhemerval.zanella, libc-alpha, palmer, darius,
	andrew, vineetg, kito.cheng, jeffreyalaw, heiko.stuebner

[-- Attachment #1: Type: text/plain, Size: 2172 bytes --]

On Tue, Feb 7, 2023 at 10:14 PM DJ Delorie <dj@redhat.com> wrote:

>
> Philipp Tomsich <philipp.tomsich@vrull.eu> writes:
> > So this patch has always been a stand-in until option #2 is ready.
> > I am strongly opinionated towards a mechanism that uses existing
> > mechanisms in the ELF auxiliary vector to pass information — and tries
> > to avoid the introduction of a new arch-specific syscall. if possible.
>
> If the patch were converted to use tunables, it could be more than a
> standin.  It's the environment variable itself I'm opposed to.
>

Thanks DJ and  Adhemerval for your valuable inputs!

As said in the cover letter, the environment variable approach was not
meant to be merged but represents a starting point for discussions.
It is not what we want but serves as a dirty placeholder, that allows
development of optimized routines.

The IFUNC support and the kernel-userspace API was discussed
multiple times in the past (e.g. on LPC2021 and LPC2022).
There are different opinions on the approaches so the whole process
is regularly getting stuck.

The topic, where most (if not all) in the RISC-V community agree on,
is that we don't want a compile-time-only approach.
Patches that rely on a compile-time-only approach are most likely
written such, because of the absence of ifunc support.

Meanwhile, we have heard from multiple vendors to work on their own
solutions downstream, which results in a duplication of work and not
necessarily in a common solution upstream.

This patchset was sent out to move the discussion from the idea level
to actual code, which can be reviewed, criticized, tested, improved, and
reused (get a common ground for RISC-V vendors).

Both of your comments show how we can move this patchset forward:
- eliminate the first patch
  See also: https://sourceware.org/bugzilla/show_bug.cgi?id=30095
- work on the kernel-userspace interface to query capabilities
- use tunables instead of env variable
- look how we can reuse more of the generic implementation
  to minimize ASM code with little or no benefits

I fully agree with all the mentioned points.

Thanks,
Christoph

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 18/19] riscv: Add an optimized strncmp routine
  2023-02-07  1:19   ` Noah Goldstein
@ 2023-02-08 15:13     ` Philipp Tomsich
  2023-02-08 17:55       ` Palmer Dabbelt
  2023-02-08 18:04       ` Noah Goldstein
  0 siblings, 2 replies; 42+ messages in thread
From: Philipp Tomsich @ 2023-02-08 15:13 UTC (permalink / raw)
  To: Noah Goldstein
  Cc: Christoph Muellner, libc-alpha, Palmer Dabbelt, Darius Rad,
	Andrew Waterman, DJ Delorie, Vineet Gupta, Kito Cheng, Jeff Law,
	Heiko Stuebner

On Tue, 7 Feb 2023 at 02:20, Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
> On Mon, Feb 6, 2023 at 6:23 PM Christoph Muellner
> <christoph.muellner@vrull.eu> wrote:
> >
> > From: Christoph Müllner <christoph.muellner@vrull.eu>
> >
> > The implementation of strncmp() can be accelerated using Zbb's orc.b
> > instruction. Let's add an optimized implementation that makes use
> > of this instruction.
> >
> > Signed-off-by: Christoph Müllner <christoph.muellner@vrull.eu>
>
> Not necessary, but imo performance patches should have at least some reference
> to the expected speedup versus the existing alternatives.

Given that this is effectively a SWAR-like optimization (orc.b allows
us to test 8 bytes in parallel for a NUL byte), we should be able to
show the benefit through a reduction in dynamic instructions.  Would
this be considered reasonable reference data?

> > ---
> >  sysdeps/riscv/multiarch/Makefile          |   3 +-
> >  sysdeps/riscv/multiarch/ifunc-impl-list.c |   1 +
> >  sysdeps/riscv/multiarch/strncmp.c         |   6 +-
> >  sysdeps/riscv/multiarch/strncmp_zbb.S     | 119 ++++++++++++++++++++++
> >  4 files changed, 127 insertions(+), 2 deletions(-)
> >  create mode 100644 sysdeps/riscv/multiarch/strncmp_zbb.S
> >
> > diff --git a/sysdeps/riscv/multiarch/Makefile b/sysdeps/riscv/multiarch/Makefile
> > index 056ce2ffc0..9f22e31b99 100644
> > --- a/sysdeps/riscv/multiarch/Makefile
> > +++ b/sysdeps/riscv/multiarch/Makefile
> > @@ -14,5 +14,6 @@ sysdep_routines += \
> >         strcmp_generic \
> >         strcmp_zbb \
> >         strcmp_zbb_unaligned \
> > -       strncmp_generic
> > +       strncmp_generic \
> > +       strncmp_zbb
> >  endif
> > diff --git a/sysdeps/riscv/multiarch/ifunc-impl-list.c b/sysdeps/riscv/multiarch/ifunc-impl-list.c
> > index eb37ed6017..82fd34d010 100644
> > --- a/sysdeps/riscv/multiarch/ifunc-impl-list.c
> > +++ b/sysdeps/riscv/multiarch/ifunc-impl-list.c
> > @@ -64,6 +64,7 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
> >               IFUNC_IMPL_ADD (array, i, strcmp, 1, __strcmp_generic))
> >
> >    IFUNC_IMPL (i, name, strncmp,
> > +             IFUNC_IMPL_ADD (array, i, strncmp, 1, __strncmp_zbb)
> >               IFUNC_IMPL_ADD (array, i, strncmp, 1, __strncmp_generic))
> >    return i;
> >  }
> > diff --git a/sysdeps/riscv/multiarch/strncmp.c b/sysdeps/riscv/multiarch/strncmp.c
> > index 970aeb8b85..5b0fe08e98 100644
> > --- a/sysdeps/riscv/multiarch/strncmp.c
> > +++ b/sysdeps/riscv/multiarch/strncmp.c
> > @@ -30,8 +30,12 @@
> >
> >  extern __typeof (__redirect_strncmp) __libc_strncmp;
> >  extern __typeof (__redirect_strncmp) __strncmp_generic attribute_hidden;
> > +extern __typeof (__redirect_strncmp) __strncmp_zbb attribute_hidden;
> >
> > -libc_ifunc (__libc_strncmp, __strncmp_generic);
> > +libc_ifunc (__libc_strncmp,
> > +           HAVE_RV(zbb)
> > +           ? __strncmp_zbb
> > +           : __strncmp_generic);
> >
> >  # undef strncmp
> >  strong_alias (__libc_strncmp, strncmp);
> > diff --git a/sysdeps/riscv/multiarch/strncmp_zbb.S b/sysdeps/riscv/multiarch/strncmp_zbb.S
> > new file mode 100644
> > index 0000000000..29cff30def
> > --- /dev/null
> > +++ b/sysdeps/riscv/multiarch/strncmp_zbb.S
> > @@ -0,0 +1,119 @@
> > +/* Copyright (C) 2022 Free Software Foundation, Inc.
> > +
> > +   This file is part of the GNU C Library.
> > +
> > +   The GNU C Library is free software; you can redistribute it and/or
> > +   modify it under the terms of the GNU Lesser General Public
> > +   License as published by the Free Software Foundation; either
> > +   version 2.1 of the License, or (at your option) any later version.
> > +
> > +   The GNU C Library is distributed in the hope that it will be useful,
> > +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> > +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> > +   Lesser General Public License for more details.
> > +
> > +   You should have received a copy of the GNU Lesser General Public
> > +   License along with the GNU C Library.  If not, see
> > +   <https://www.gnu.org/licenses/>.  */
> > +
> > +#include <sysdep.h>
> > +#include <sys/asm.h>
> > +
> > +/* Assumptions: rvi_zbb.  */
> > +
> > +#define src1           a0
> > +#define result         a0
> > +#define src2           a1
> > +#define len            a2
> > +#define data1          a2
> > +#define data2          a3
> > +#define align          a4
> > +#define data1_orcb     t0
> > +#define limit          t1
> > +#define fast_limit      t2
> > +#define m1             t3
> > +
> > +#if __riscv_xlen == 64
> > +# define REG_L ld
> > +# define SZREG 8
> > +# define PTRLOG        3
> > +#else
> > +# define REG_L lw
> > +# define SZREG 4
> > +# define PTRLOG        2
> > +#endif
> > +
> > +#ifndef STRNCMP
> > +# define STRNCMP __strncmp_zbb
> > +#endif
> > +
> > +.option push
> > +.option arch,+zbb
> > +
> > +ENTRY_ALIGN (STRNCMP, 6)
> > +       beqz    len, L(equal)
> > +       or      align, src1, src2
> > +       and     align, align, SZREG-1
> > +       add     limit, src1, len
> > +       bnez    align, L(simpleloop)
> > +       li      m1, -1
> > +
> > +       /* Adjust limit for fast-path.  */
> > +       andi fast_limit, limit, -SZREG
> > +
> > +       /* Main loop for aligned string.  */
> > +       .p2align 3
> > +L(loop):
> > +       bge     src1, fast_limit, L(simpleloop)
> > +       REG_L   data1, 0(src1)
> > +       REG_L   data2, 0(src2)
> > +       orc.b   data1_orcb, data1
> > +       bne     data1_orcb, m1, L(foundnull)
> > +       addi    src1, src1, SZREG
> > +       addi    src2, src2, SZREG
> > +       beq     data1, data2, L(loop)
> > +
> > +       /* Words don't match, and no null byte in the first
> > +        * word. Get bytes in big-endian order and compare.  */
> > +#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
> > +       rev8    data1, data1
> > +       rev8    data2, data2
> > +#endif
> > +       /* Synthesize (data1 >= data2) ? 1 : -1 in a branchless sequence.  */
> > +       sltu    result, data1, data2
> > +       neg     result, result
> > +       ori     result, result, 1
> > +       ret
> > +
> > +L(foundnull):
> > +       /* Found a null byte.
> > +        * If words don't match, fall back to simple loop.  */
> > +       bne     data1, data2, L(simpleloop)
> > +
> > +       /* Otherwise, strings are equal.  */
> > +       li      result, 0
> > +       ret
> > +
> > +       /* Simple loop for misaligned strings.  */
> > +       .p2align 3
> > +L(simpleloop):
> > +       bge     src1, limit, L(equal)
> > +       lbu     data1, 0(src1)
> > +       addi    src1, src1, 1
> > +       lbu     data2, 0(src2)
> > +       addi    src2, src2, 1
> > +       bne     data1, data2, L(sub)
> > +       bnez    data1, L(simpleloop)
> > +
> > +L(sub):
> > +       sub     result, data1, data2
> > +       ret
> > +
> > +L(equal):
> > +       li      result, 0
> > +       ret
> > +
> > +.option pop
> > +
> > +END (STRNCMP)
> > +libc_hidden_builtin_def (STRNCMP)
> > --
> > 2.39.1
> >

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 18/19] riscv: Add an optimized strncmp routine
  2023-02-08 15:13     ` Philipp Tomsich
@ 2023-02-08 17:55       ` Palmer Dabbelt
  2023-02-08 19:48         ` Adhemerval Zanella Netto
  2023-02-08 18:04       ` Noah Goldstein
  1 sibling, 1 reply; 42+ messages in thread
From: Palmer Dabbelt @ 2023-02-08 17:55 UTC (permalink / raw)
  To: philipp.tomsich
  Cc: goldstein.w.n, christoph.muellner, libc-alpha, Darius Rad,
	Andrew Waterman, DJ Delorie, Vineet Gupta, kito.cheng,
	jeffreyalaw, heiko.stuebner

On Wed, 08 Feb 2023 07:13:44 PST (-0800), philipp.tomsich@vrull.eu wrote:
> On Tue, 7 Feb 2023 at 02:20, Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>>
>> On Mon, Feb 6, 2023 at 6:23 PM Christoph Muellner
>> <christoph.muellner@vrull.eu> wrote:
>> >
>> > From: Christoph Müllner <christoph.muellner@vrull.eu>
>> >
>> > The implementation of strncmp() can be accelerated using Zbb's orc.b
>> > instruction. Let's add an optimized implementation that makes use
>> > of this instruction.
>> >
>> > Signed-off-by: Christoph Müllner <christoph.muellner@vrull.eu>
>>
>> Not necessary, but imo performance patches should have at least some reference
>> to the expected speedup versus the existing alternatives.
>
> Given that this is effectively a SWAR-like optimization (orc.b allows
> us to test 8 bytes in parallel for a NUL byte), we should be able to
> show the benefit through a reduction in dynamic instructions.  Would
> this be considered reasonable reference data?

Generally for performance improvements the only metrics that count come 
from real hardware.  Processor implementation is complex and it's not 
generally true that reducing dynamic instructions results in better 
performance (particularly when more complex flavors instructions replace 
simpler ones).

We've not been so good about this on the RISC-V side of things, though.  
I think that's largely because we didn't have all that much complexity 
around this, but there's a ton of stuff showing up right now.  The 
general theory has been that Zbb instructions will execute faster than 
their corresponding I sequences, but nobody has proved that.  I believe 
the new JH7110 has Zba and Zbb, so maybe the right answer there is to 
just benchmark things before merging them?  That way we can get back to 
doing things sanely before we go too far down the premature optimization 
rabbit hole.

FWIW: we had a pretty similar discussion in Linux land around these and 
nobody could get the JH7110 to boot, but given that we have ~6 months 
until glibc releases again hopefully that will be sorted out.  There's a 
bunch of ongoing work looking at the more core issues like probing, so 
maybe it's best to focus on getting that all sorted out first?  It's 
kind of awkward to have a bunch of routines posted in a whole new 
framework that's not sorting out all the probing dependencies.

>> > ---
>> >  sysdeps/riscv/multiarch/Makefile          |   3 +-
>> >  sysdeps/riscv/multiarch/ifunc-impl-list.c |   1 +
>> >  sysdeps/riscv/multiarch/strncmp.c         |   6 +-
>> >  sysdeps/riscv/multiarch/strncmp_zbb.S     | 119 ++++++++++++++++++++++
>> >  4 files changed, 127 insertions(+), 2 deletions(-)
>> >  create mode 100644 sysdeps/riscv/multiarch/strncmp_zbb.S
>> >
>> > diff --git a/sysdeps/riscv/multiarch/Makefile b/sysdeps/riscv/multiarch/Makefile
>> > index 056ce2ffc0..9f22e31b99 100644
>> > --- a/sysdeps/riscv/multiarch/Makefile
>> > +++ b/sysdeps/riscv/multiarch/Makefile
>> > @@ -14,5 +14,6 @@ sysdep_routines += \
>> >         strcmp_generic \
>> >         strcmp_zbb \
>> >         strcmp_zbb_unaligned \
>> > -       strncmp_generic
>> > +       strncmp_generic \
>> > +       strncmp_zbb
>> >  endif
>> > diff --git a/sysdeps/riscv/multiarch/ifunc-impl-list.c b/sysdeps/riscv/multiarch/ifunc-impl-list.c
>> > index eb37ed6017..82fd34d010 100644
>> > --- a/sysdeps/riscv/multiarch/ifunc-impl-list.c
>> > +++ b/sysdeps/riscv/multiarch/ifunc-impl-list.c
>> > @@ -64,6 +64,7 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
>> >               IFUNC_IMPL_ADD (array, i, strcmp, 1, __strcmp_generic))
>> >
>> >    IFUNC_IMPL (i, name, strncmp,
>> > +             IFUNC_IMPL_ADD (array, i, strncmp, 1, __strncmp_zbb)
>> >               IFUNC_IMPL_ADD (array, i, strncmp, 1, __strncmp_generic))
>> >    return i;
>> >  }
>> > diff --git a/sysdeps/riscv/multiarch/strncmp.c b/sysdeps/riscv/multiarch/strncmp.c
>> > index 970aeb8b85..5b0fe08e98 100644
>> > --- a/sysdeps/riscv/multiarch/strncmp.c
>> > +++ b/sysdeps/riscv/multiarch/strncmp.c
>> > @@ -30,8 +30,12 @@
>> >
>> >  extern __typeof (__redirect_strncmp) __libc_strncmp;
>> >  extern __typeof (__redirect_strncmp) __strncmp_generic attribute_hidden;
>> > +extern __typeof (__redirect_strncmp) __strncmp_zbb attribute_hidden;
>> >
>> > -libc_ifunc (__libc_strncmp, __strncmp_generic);
>> > +libc_ifunc (__libc_strncmp,
>> > +           HAVE_RV(zbb)
>> > +           ? __strncmp_zbb
>> > +           : __strncmp_generic);
>> >
>> >  # undef strncmp
>> >  strong_alias (__libc_strncmp, strncmp);
>> > diff --git a/sysdeps/riscv/multiarch/strncmp_zbb.S b/sysdeps/riscv/multiarch/strncmp_zbb.S
>> > new file mode 100644
>> > index 0000000000..29cff30def
>> > --- /dev/null
>> > +++ b/sysdeps/riscv/multiarch/strncmp_zbb.S
>> > @@ -0,0 +1,119 @@
>> > +/* Copyright (C) 2022 Free Software Foundation, Inc.
>> > +
>> > +   This file is part of the GNU C Library.
>> > +
>> > +   The GNU C Library is free software; you can redistribute it and/or
>> > +   modify it under the terms of the GNU Lesser General Public
>> > +   License as published by the Free Software Foundation; either
>> > +   version 2.1 of the License, or (at your option) any later version.
>> > +
>> > +   The GNU C Library is distributed in the hope that it will be useful,
>> > +   but WITHOUT ANY WARRANTY; without even the implied warranty of
>> > +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
>> > +   Lesser General Public License for more details.
>> > +
>> > +   You should have received a copy of the GNU Lesser General Public
>> > +   License along with the GNU C Library.  If not, see
>> > +   <https://www.gnu.org/licenses/>.  */
>> > +
>> > +#include <sysdep.h>
>> > +#include <sys/asm.h>
>> > +
>> > +/* Assumptions: rvi_zbb.  */
>> > +
>> > +#define src1           a0
>> > +#define result         a0
>> > +#define src2           a1
>> > +#define len            a2
>> > +#define data1          a2
>> > +#define data2          a3
>> > +#define align          a4
>> > +#define data1_orcb     t0
>> > +#define limit          t1
>> > +#define fast_limit      t2
>> > +#define m1             t3
>> > +
>> > +#if __riscv_xlen == 64
>> > +# define REG_L ld
>> > +# define SZREG 8
>> > +# define PTRLOG        3
>> > +#else
>> > +# define REG_L lw
>> > +# define SZREG 4
>> > +# define PTRLOG        2
>> > +#endif
>> > +
>> > +#ifndef STRNCMP
>> > +# define STRNCMP __strncmp_zbb
>> > +#endif
>> > +
>> > +.option push
>> > +.option arch,+zbb
>> > +
>> > +ENTRY_ALIGN (STRNCMP, 6)
>> > +       beqz    len, L(equal)
>> > +       or      align, src1, src2
>> > +       and     align, align, SZREG-1
>> > +       add     limit, src1, len
>> > +       bnez    align, L(simpleloop)
>> > +       li      m1, -1
>> > +
>> > +       /* Adjust limit for fast-path.  */
>> > +       andi fast_limit, limit, -SZREG
>> > +
>> > +       /* Main loop for aligned string.  */
>> > +       .p2align 3
>> > +L(loop):
>> > +       bge     src1, fast_limit, L(simpleloop)
>> > +       REG_L   data1, 0(src1)
>> > +       REG_L   data2, 0(src2)
>> > +       orc.b   data1_orcb, data1
>> > +       bne     data1_orcb, m1, L(foundnull)
>> > +       addi    src1, src1, SZREG
>> > +       addi    src2, src2, SZREG
>> > +       beq     data1, data2, L(loop)
>> > +
>> > +       /* Words don't match, and no null byte in the first
>> > +        * word. Get bytes in big-endian order and compare.  */
>> > +#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
>> > +       rev8    data1, data1
>> > +       rev8    data2, data2
>> > +#endif
>> > +       /* Synthesize (data1 >= data2) ? 1 : -1 in a branchless sequence.  */
>> > +       sltu    result, data1, data2
>> > +       neg     result, result
>> > +       ori     result, result, 1
>> > +       ret
>> > +
>> > +L(foundnull):
>> > +       /* Found a null byte.
>> > +        * If words don't match, fall back to simple loop.  */
>> > +       bne     data1, data2, L(simpleloop)
>> > +
>> > +       /* Otherwise, strings are equal.  */
>> > +       li      result, 0
>> > +       ret
>> > +
>> > +       /* Simple loop for misaligned strings.  */
>> > +       .p2align 3
>> > +L(simpleloop):
>> > +       bge     src1, limit, L(equal)
>> > +       lbu     data1, 0(src1)
>> > +       addi    src1, src1, 1
>> > +       lbu     data2, 0(src2)
>> > +       addi    src2, src2, 1
>> > +       bne     data1, data2, L(sub)
>> > +       bnez    data1, L(simpleloop)
>> > +
>> > +L(sub):
>> > +       sub     result, data1, data2
>> > +       ret
>> > +
>> > +L(equal):
>> > +       li      result, 0
>> > +       ret
>> > +
>> > +.option pop
>> > +
>> > +END (STRNCMP)
>> > +libc_hidden_builtin_def (STRNCMP)
>> > --
>> > 2.39.1
>> >

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 18/19] riscv: Add an optimized strncmp routine
  2023-02-08 15:13     ` Philipp Tomsich
  2023-02-08 17:55       ` Palmer Dabbelt
@ 2023-02-08 18:04       ` Noah Goldstein
  1 sibling, 0 replies; 42+ messages in thread
From: Noah Goldstein @ 2023-02-08 18:04 UTC (permalink / raw)
  To: Philipp Tomsich
  Cc: Christoph Muellner, libc-alpha, Palmer Dabbelt, Darius Rad,
	Andrew Waterman, DJ Delorie, Vineet Gupta, Kito Cheng, Jeff Law,
	Heiko Stuebner

On Wed, Feb 8, 2023 at 9:13 AM Philipp Tomsich <philipp.tomsich@vrull.eu> wrote:
>
> On Tue, 7 Feb 2023 at 02:20, Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> >
> > On Mon, Feb 6, 2023 at 6:23 PM Christoph Muellner
> > <christoph.muellner@vrull.eu> wrote:
> > >
> > > From: Christoph Müllner <christoph.muellner@vrull.eu>
> > >
> > > The implementation of strncmp() can be accelerated using Zbb's orc.b
> > > instruction. Let's add an optimized implementation that makes use
> > > of this instruction.
> > >
> > > Signed-off-by: Christoph Müllner <christoph.muellner@vrull.eu>
> >
> > Not necessary, but imo performance patches should have at least some reference
> > to the expected speedup versus the existing alternatives.
>
> Given that this is effectively a SWAR-like optimization (orc.b allows
> us to test 8 bytes in parallel for a NUL byte), we should be able to
> show the benefit through a reduction in dynamic instructions.  Would
> this be considered reasonable reference data?
>

GLIBC has a benchmark suite for all the string/memory functions so
would expect improvement in those results compared to the generic
implementations.


> > > ---
> > >  sysdeps/riscv/multiarch/Makefile          |   3 +-
> > >  sysdeps/riscv/multiarch/ifunc-impl-list.c |   1 +
> > >  sysdeps/riscv/multiarch/strncmp.c         |   6 +-
> > >  sysdeps/riscv/multiarch/strncmp_zbb.S     | 119 ++++++++++++++++++++++
> > >  4 files changed, 127 insertions(+), 2 deletions(-)
> > >  create mode 100644 sysdeps/riscv/multiarch/strncmp_zbb.S
> > >
> > > diff --git a/sysdeps/riscv/multiarch/Makefile b/sysdeps/riscv/multiarch/Makefile
> > > index 056ce2ffc0..9f22e31b99 100644
> > > --- a/sysdeps/riscv/multiarch/Makefile
> > > +++ b/sysdeps/riscv/multiarch/Makefile
> > > @@ -14,5 +14,6 @@ sysdep_routines += \
> > >         strcmp_generic \
> > >         strcmp_zbb \
> > >         strcmp_zbb_unaligned \
> > > -       strncmp_generic
> > > +       strncmp_generic \
> > > +       strncmp_zbb
> > >  endif
> > > diff --git a/sysdeps/riscv/multiarch/ifunc-impl-list.c b/sysdeps/riscv/multiarch/ifunc-impl-list.c
> > > index eb37ed6017..82fd34d010 100644
> > > --- a/sysdeps/riscv/multiarch/ifunc-impl-list.c
> > > +++ b/sysdeps/riscv/multiarch/ifunc-impl-list.c
> > > @@ -64,6 +64,7 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
> > >               IFUNC_IMPL_ADD (array, i, strcmp, 1, __strcmp_generic))
> > >
> > >    IFUNC_IMPL (i, name, strncmp,
> > > +             IFUNC_IMPL_ADD (array, i, strncmp, 1, __strncmp_zbb)
> > >               IFUNC_IMPL_ADD (array, i, strncmp, 1, __strncmp_generic))
> > >    return i;
> > >  }
> > > diff --git a/sysdeps/riscv/multiarch/strncmp.c b/sysdeps/riscv/multiarch/strncmp.c
> > > index 970aeb8b85..5b0fe08e98 100644
> > > --- a/sysdeps/riscv/multiarch/strncmp.c
> > > +++ b/sysdeps/riscv/multiarch/strncmp.c
> > > @@ -30,8 +30,12 @@
> > >
> > >  extern __typeof (__redirect_strncmp) __libc_strncmp;
> > >  extern __typeof (__redirect_strncmp) __strncmp_generic attribute_hidden;
> > > +extern __typeof (__redirect_strncmp) __strncmp_zbb attribute_hidden;
> > >
> > > -libc_ifunc (__libc_strncmp, __strncmp_generic);
> > > +libc_ifunc (__libc_strncmp,
> > > +           HAVE_RV(zbb)
> > > +           ? __strncmp_zbb
> > > +           : __strncmp_generic);
> > >
> > >  # undef strncmp
> > >  strong_alias (__libc_strncmp, strncmp);
> > > diff --git a/sysdeps/riscv/multiarch/strncmp_zbb.S b/sysdeps/riscv/multiarch/strncmp_zbb.S
> > > new file mode 100644
> > > index 0000000000..29cff30def
> > > --- /dev/null
> > > +++ b/sysdeps/riscv/multiarch/strncmp_zbb.S
> > > @@ -0,0 +1,119 @@
> > > +/* Copyright (C) 2022 Free Software Foundation, Inc.
> > > +
> > > +   This file is part of the GNU C Library.
> > > +
> > > +   The GNU C Library is free software; you can redistribute it and/or
> > > +   modify it under the terms of the GNU Lesser General Public
> > > +   License as published by the Free Software Foundation; either
> > > +   version 2.1 of the License, or (at your option) any later version.
> > > +
> > > +   The GNU C Library is distributed in the hope that it will be useful,
> > > +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> > > +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> > > +   Lesser General Public License for more details.
> > > +
> > > +   You should have received a copy of the GNU Lesser General Public
> > > +   License along with the GNU C Library.  If not, see
> > > +   <https://www.gnu.org/licenses/>.  */
> > > +
> > > +#include <sysdep.h>
> > > +#include <sys/asm.h>
> > > +
> > > +/* Assumptions: rvi_zbb.  */
> > > +
> > > +#define src1           a0
> > > +#define result         a0
> > > +#define src2           a1
> > > +#define len            a2
> > > +#define data1          a2
> > > +#define data2          a3
> > > +#define align          a4
> > > +#define data1_orcb     t0
> > > +#define limit          t1
> > > +#define fast_limit      t2
> > > +#define m1             t3
> > > +
> > > +#if __riscv_xlen == 64
> > > +# define REG_L ld
> > > +# define SZREG 8
> > > +# define PTRLOG        3
> > > +#else
> > > +# define REG_L lw
> > > +# define SZREG 4
> > > +# define PTRLOG        2
> > > +#endif
> > > +
> > > +#ifndef STRNCMP
> > > +# define STRNCMP __strncmp_zbb
> > > +#endif
> > > +
> > > +.option push
> > > +.option arch,+zbb
> > > +
> > > +ENTRY_ALIGN (STRNCMP, 6)
> > > +       beqz    len, L(equal)
> > > +       or      align, src1, src2
> > > +       and     align, align, SZREG-1
> > > +       add     limit, src1, len
> > > +       bnez    align, L(simpleloop)
> > > +       li      m1, -1
> > > +
> > > +       /* Adjust limit for fast-path.  */
> > > +       andi fast_limit, limit, -SZREG
> > > +
> > > +       /* Main loop for aligned string.  */
> > > +       .p2align 3
> > > +L(loop):
> > > +       bge     src1, fast_limit, L(simpleloop)
> > > +       REG_L   data1, 0(src1)
> > > +       REG_L   data2, 0(src2)
> > > +       orc.b   data1_orcb, data1
> > > +       bne     data1_orcb, m1, L(foundnull)
> > > +       addi    src1, src1, SZREG
> > > +       addi    src2, src2, SZREG
> > > +       beq     data1, data2, L(loop)
> > > +
> > > +       /* Words don't match, and no null byte in the first
> > > +        * word. Get bytes in big-endian order and compare.  */
> > > +#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
> > > +       rev8    data1, data1
> > > +       rev8    data2, data2
> > > +#endif
> > > +       /* Synthesize (data1 >= data2) ? 1 : -1 in a branchless sequence.  */
> > > +       sltu    result, data1, data2
> > > +       neg     result, result
> > > +       ori     result, result, 1
> > > +       ret
> > > +
> > > +L(foundnull):
> > > +       /* Found a null byte.
> > > +        * If words don't match, fall back to simple loop.  */
> > > +       bne     data1, data2, L(simpleloop)
> > > +
> > > +       /* Otherwise, strings are equal.  */
> > > +       li      result, 0
> > > +       ret
> > > +
> > > +       /* Simple loop for misaligned strings.  */
> > > +       .p2align 3
> > > +L(simpleloop):
> > > +       bge     src1, limit, L(equal)
> > > +       lbu     data1, 0(src1)
> > > +       addi    src1, src1, 1
> > > +       lbu     data2, 0(src2)
> > > +       addi    src2, src2, 1
> > > +       bne     data1, data2, L(sub)
> > > +       bnez    data1, L(simpleloop)
> > > +
> > > +L(sub):
> > > +       sub     result, data1, data2
> > > +       ret
> > > +
> > > +L(equal):
> > > +       li      result, 0
> > > +       ret
> > > +
> > > +.option pop
> > > +
> > > +END (STRNCMP)
> > > +libc_hidden_builtin_def (STRNCMP)
> > > --
> > > 2.39.1
> > >

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 18/19] riscv: Add an optimized strncmp routine
  2023-02-08 17:55       ` Palmer Dabbelt
@ 2023-02-08 19:48         ` Adhemerval Zanella Netto
  0 siblings, 0 replies; 42+ messages in thread
From: Adhemerval Zanella Netto @ 2023-02-08 19:48 UTC (permalink / raw)
  To: Palmer Dabbelt, philipp.tomsich
  Cc: goldstein.w.n, christoph.muellner, libc-alpha, Darius Rad,
	Andrew Waterman, DJ Delorie, Vineet Gupta, kito.cheng,
	jeffreyalaw, heiko.stuebner

On 08/02/23 14:55, Palmer Dabbelt wrote:
> On Wed, 08 Feb 2023 07:13:44 PST (-0800), philipp.tomsich@vrull.eu wrote:
>> On Tue, 7 Feb 2023 at 02:20, Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>>>
>>> On Mon, Feb 6, 2023 at 6:23 PM Christoph Muellner
>>> <christoph.muellner@vrull.eu> wrote:
>>> >
>>> > From: Christoph Müllner <christoph.muellner@vrull.eu>
>>> >
>>> > The implementation of strncmp() can be accelerated using Zbb's orc.b
>>> > instruction. Let's add an optimized implementation that makes use
>>> > of this instruction.
>>> >
>>> > Signed-off-by: Christoph Müllner <christoph.muellner@vrull.eu>
>>>
>>> Not necessary, but imo performance patches should have at least some reference
>>> to the expected speedup versus the existing alternatives.
>>
>> Given that this is effectively a SWAR-like optimization (orc.b allows
>> us to test 8 bytes in parallel for a NUL byte), we should be able to
>> show the benefit through a reduction in dynamic instructions.  Would
>> this be considered reasonable reference data?
> 
> Generally for performance improvements the only metrics that count come from real hardware.  Processor implementation is complex and it's not generally true that reducing dynamic instructions results in better performance (particularly when more complex flavors instructions replace simpler ones).
> 

I agree with Noah here that we need to have some baseline performance number, 
even tough we are comparing naive implementations (what glibc used to have for
implementations).

> We've not been so good about this on the RISC-V side of things, though.  I think that's largely because we didn't have all that much complexity around this, but there's a ton of stuff showing up right now.  The general theory has been that Zbb instructions will execute faster than their corresponding I sequences, but nobody has proved that.  I believe the new JH7110 has Zba and Zbb, so maybe the right answer there is to just benchmark things before merging them?  That way we can get back to doing things sanely before we go too far down the premature optimization rabbit hole.
> 
> FWIW: we had a pretty similar discussion in Linux land around these and nobody could get the JH7110 to boot, but given that we have ~6 months until glibc releases again hopefully that will be sorted out.  There's a bunch of ongoing work looking at the more core issues like probing, so maybe it's best to focus on getting that all sorted out first?  It's kind of awkward to have a bunch of routines posted in a whole new framework that's not sorting out all the probing dependencies.

Just a heads up that with latest generic string routines optimization, all
str* routines should now use new zbb extensions (if compiler is instructed
to do so). I think you might squeeze some cycles with hand crafted assembly
routine, but I would rather focus on trying to optimize code generation
instead.

The generic routines still assumes that hardware can't or is prohibitive 
expensive to issue unaligned memory access.  However, I think we move toward 
this direction to start adding unaligned variants when it makes sense.

Another usual tuning is loop unrolling, which depends on underlying hardware.
Unfortunately we need to explicit force gcc to unroll some loop construction
(for instance check sysdeps/powerpc/powerpc64/power4/Makefile), so this might
be another approach you might use to tune RISCV routines.

The memcpy, memmove, memset, memcmp are a slight different subject.  Although
current generic mem routines does use some explicit unrolling, it also does
not take in consideration unaligned access, vector instructions, or special 
instruction (such as cache clear one).  And these usually make a lot of
difference.

What I would expect it maybe we can use a similar strategy Google is doing
with llvm libc, which based its work on the automemcpy paper [1]. It means
that for unaligned, each architecture will reimplement the memory routine
block.  Although the project focus on static compiling, I think using hooks
over assembly routines might be a better approach (you might reuse code
blocks or try different strategies more easily).

[1] https://storage.googleapis.com/pub-tools-public-publication-data/pdf/4f7c3da72d557ed418828823a8e59942859d677f.pdf

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 16/19] riscv: Add accelerated strcmp routines
  2023-02-07 14:15     ` Christoph Müllner
@ 2023-03-31  5:06       ` Jeff Law
  2023-03-31 12:31         ` Adhemerval Zanella Netto
  2023-03-31 14:32       ` Jeff Law
  1 sibling, 1 reply; 42+ messages in thread
From: Jeff Law @ 2023-03-31  5:06 UTC (permalink / raw)
  To: Christoph Müllner, Xi Ruoyao
  Cc: libc-alpha, Palmer Dabbelt, Darius Rad, Andrew Waterman,
	DJ Delorie, Vineet Gupta, Kito Cheng, Philipp Tomsich,
	Heiko Stuebner, Adhemerval Zanella

On 2/7/23 07:15, Christoph Müllner wrote:

>      > diff --git a/sysdeps/riscv/multiarch/strcmp_zbb_unaligned.S
>      > b/sysdeps/riscv/multiarch/strcmp_zbb_unaligned.S
[ ... ]

>      > +
>      > +ENTRY_ALIGN (STRCMP, 6)
>      > +       /* off...delta from src1 to src2.  */
>      > +       sub     off, src2, src1
>      > +       li      m1, -1
>      > +       andi    tmp, off, SZREG-1
>      > +       andi    align1, src1, SZREG-1
>      > +       bnez    tmp, L(misaligned8)
>      > +       bnez    align1, L(mutual_align)
>      > +
>      > +       .p2align 4
>      > +L(loop_aligned):
>      > +       REG_L   data1, 0(src1)
>      > +       add     tmp, src1, off
>      > +       addi    src1, src1, SZREG
>      > +       REG_L   data2, 0(tmp)

So any thoughts on reducing the alignment?  Based on the data I've seen 
we very rarely ever take the branch to L(loop_aligned).    So aligning 
this particular label is of dubious value to begin with.  As it stands 
we have to emit 3 full sized nops to achieve the requested alignment and 
they can burn most of an issue cycle.

While it's highly dependent on pipeline state, there's a reasonable 
chance of shaving a cycle by reducing the alignment to p2align 3.

I haven't done much with analyzing the rest of the code as it just 
hasn't been hot in any of the cases I've looked at.

I'd be comfortable with this going in as-is or with the alignment 
adjustment.  Obviously wiring it up via ifunc is dependent upon settling 
the kernel->glibc interface.

Jeff

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 16/19] riscv: Add accelerated strcmp routines
  2023-03-31  5:06       ` Jeff Law
@ 2023-03-31 12:31         ` Adhemerval Zanella Netto
  2023-03-31 14:30           ` Jeff Law
  0 siblings, 1 reply; 42+ messages in thread
From: Adhemerval Zanella Netto @ 2023-03-31 12:31 UTC (permalink / raw)
  To: Jeff Law, Christoph Müllner, Xi Ruoyao
  Cc: libc-alpha, Palmer Dabbelt, Darius Rad, Andrew Waterman,
	DJ Delorie, Vineet Gupta, Kito Cheng, Philipp Tomsich,
	Heiko Stuebner



On 31/03/23 02:06, Jeff Law wrote:
> 
> 
> On 2/7/23 07:15, Christoph Müllner wrote:
> 
>>      > diff --git a/sysdeps/riscv/multiarch/strcmp_zbb_unaligned.S
>>      > b/sysdeps/riscv/multiarch/strcmp_zbb_unaligned.S
> [ ... ]
> 
> 
>>      > +
>>      > +ENTRY_ALIGN (STRCMP, 6)
>>      > +       /* off...delta from src1 to src2.  */
>>      > +       sub     off, src2, src1
>>      > +       li      m1, -1
>>      > +       andi    tmp, off, SZREG-1
>>      > +       andi    align1, src1, SZREG-1
>>      > +       bnez    tmp, L(misaligned8)
>>      > +       bnez    align1, L(mutual_align)
>>      > +
>>      > +       .p2align 4
>>      > +L(loop_aligned):
>>      > +       REG_L   data1, 0(src1)
>>      > +       add     tmp, src1, off
>>      > +       addi    src1, src1, SZREG
>>      > +       REG_L   data2, 0(tmp)
> 
> So any thoughts on reducing the alignment?  Based on the data I've seen we very rarely ever take the branch to L(loop_aligned).    So aligning this particular label is of dubious value to begin with.  As it stands we have to emit 3 full sized nops to achieve the requested alignment and they can burn most of an issue cycle.
> 
> While it's highly dependent on pipeline state, there's a reasonable chance of shaving a cycle by reducing the alignment to p2align 3.
> 
> I haven't done much with analyzing the rest of the code as it just hasn't been hot in any of the cases I've looked at.
> 
> I'd be comfortable with this going in as-is or with the alignment adjustment.  Obviously wiring it up via ifunc is dependent upon settling the kernel->glibc interface.
> 
> Jeff

Is this implementation really better than new generic one [1]? With a target
with zbb support, the generic word comparison should use orc.b instruction [2].
And the final comparison, once with the last word or the mismatch word is found,
should use clz/ctz instruction [3] (result also in branchless code, albeit
I have not check if better than the snippet this implementation uses).

The generic implementation also has the advantage of use word instruction
on unaligned case, where this implementation does a naive byte per byte
check.

So maybe a better option would to optimize further the generic implementation.
One option might be to parametrize the final_cmp so you can use the branchless
trick (if it indeed is better than generic code).  Another option that the 
generic implementation does not explore is manual loop unrolling, as done by 
multiple assembly implementations.

[1] https://sourceware.org/git/?p=glibc.git;a=blob;f=string/strcmp.c;h=11ec8bac816b630417ccbfeba70f9eab6ec37874;hb=HEAD
[2] https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/riscv/string-fza.h;h=4429653a001de09730cfa83325b29556a8afb5ed;hb=HEAD
[3] https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/generic/string-fzi.h;h=2deecefc236833abffbee886851d75e7ecf66755;hb=HEAD

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 16/19] riscv: Add accelerated strcmp routines
  2023-03-31 12:31         ` Adhemerval Zanella Netto
@ 2023-03-31 14:30           ` Jeff Law
  2023-03-31 14:48             ` Adhemerval Zanella Netto
  0 siblings, 1 reply; 42+ messages in thread
From: Jeff Law @ 2023-03-31 14:30 UTC (permalink / raw)
  To: Adhemerval Zanella Netto, Christoph Müllner, Xi Ruoyao
  Cc: libc-alpha, Palmer Dabbelt, Darius Rad, Andrew Waterman,
	DJ Delorie, Vineet Gupta, Kito Cheng, Philipp Tomsich,
	Heiko Stuebner



On 3/31/23 06:31, Adhemerval Zanella Netto wrote:

>> Jeff
> 
> Is this implementation really better than new generic one [1]? With a target
> with zbb support, the generic word comparison should use orc.b instruction [2].
> And the final comparison, once with the last word or the mismatch word is found,
> should use clz/ctz instruction [3] (result also in branchless code, albeit
> I have not check if better than the snippet this implementation uses).
I haven't done any comparisons against the updated generic bits.  I 
nearly suggested to Christoph to do that evaluation, but when I wandered 
around sysdeps I saw that we still had multiple custom strcmp 
implementations and set that suggestion aside.


> 
> The generic implementation also has the advantage of use word instruction
> on unaligned case, where this implementation does a naive byte per byte
> check.
Yea, but in my digging this just didn't happen terribly often.  I don't 
think there's a lot of value there.  Along the same lines, my 
investigation didn't show any significant value to realign cases and I 
nearly suggested dropping them to avoid the branch in the hot path, but 
I wasn't confident enough in the breadth of my investigations to push it.

> 
> So maybe a better option would to optimize further the generic implementation.
> One option might be to parametrize the final_cmp so you can use the branchless
> trick (if it indeed is better than generic code).  Another option that the
> generic implementation does not explore is manual loop unrolling, as done by
> multiple assembly implementations.
I could certainly support that.  I was on the fence about pushing to use 
the generic bits, a little nudge could easily push me to that side.

jeff


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 16/19] riscv: Add accelerated strcmp routines
  2023-02-07 14:15     ` Christoph Müllner
  2023-03-31  5:06       ` Jeff Law
@ 2023-03-31 14:32       ` Jeff Law
  1 sibling, 0 replies; 42+ messages in thread
From: Jeff Law @ 2023-03-31 14:32 UTC (permalink / raw)
  To: Christoph Müllner, Xi Ruoyao
  Cc: libc-alpha, Palmer Dabbelt, Darius Rad, Andrew Waterman,
	DJ Delorie, Vineet Gupta, Kito Cheng, Philipp Tomsich,
	Heiko Stuebner, Adhemerval Zanella



On 2/7/23 07:15, Christoph Müllner wrote:
> 
> On Tue, Feb 7, 2023 at 12:57 PM Xi Ruoyao <xry111@xry111.site> wrote:
> 
>     Is it possible to make the optimized generic string routine [1] ifunc-
>     aware in some way?  Or can we "templatize" it so we can make vectorized
>     string routines more easily?
> 
> 
> You mean to compile the generic code multiple times with different sets
> of enabled extensions? Yes, that should work (my patchset does the same
> for memset). Let's wait for the referenced patchset to land (looks like this
> will happen soon).
We would still need that kernel interface to be sorted out to use 
function multi-versioning.  Under the hood what GCC generates for FMV 
looks a lot like an ifunc resolver.

jeff

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 16/19] riscv: Add accelerated strcmp routines
  2023-03-31 14:30           ` Jeff Law
@ 2023-03-31 14:48             ` Adhemerval Zanella Netto
  2023-03-31 17:19               ` Palmer Dabbelt
  0 siblings, 1 reply; 42+ messages in thread
From: Adhemerval Zanella Netto @ 2023-03-31 14:48 UTC (permalink / raw)
  To: Jeff Law, Christoph Müllner, Xi Ruoyao
  Cc: libc-alpha, Palmer Dabbelt, Darius Rad, Andrew Waterman,
	DJ Delorie, Vineet Gupta, Kito Cheng, Philipp Tomsich,
	Heiko Stuebner



On 31/03/23 11:30, Jeff Law wrote:
> 
> 
> On 3/31/23 06:31, Adhemerval Zanella Netto wrote:
> 
>>> Jeff
>>
>> Is this implementation really better than new generic one [1]? With a target
>> with zbb support, the generic word comparison should use orc.b instruction [2].
>> And the final comparison, once with the last word or the mismatch word is found,
>> should use clz/ctz instruction [3] (result also in branchless code, albeit
>> I have not check if better than the snippet this implementation uses).
> I haven't done any comparisons against the updated generic bits.  I nearly suggested to Christoph to do that evaluation, but when I wandered around sysdeps I saw that we still had multiple custom strcmp implementations and set that suggestion aside.
> 
> 
>>
>> The generic implementation also has the advantage of use word instruction
>> on unaligned case, where this implementation does a naive byte per byte
>> check.
> Yea, but in my digging this just didn't happen terribly often.  I don't think there's a lot of value there.  Along the same lines, my investigation didn't show any significant value to realign cases and I nearly suggested dropping them to avoid the branch in the hot path, but I wasn't confident enough in the breadth of my investigations to push it.
> >>
>> So maybe a better option would to optimize further the generic implementation.
>> One option might be to parametrize the final_cmp so you can use the branchless
>> trick (if it indeed is better than generic code).  Another option that the
>> generic implementation does not explore is manual loop unrolling, as done by
>> multiple assembly implementations.
> I could certainly support that.  I was on the fence about pushing to use the generic bits, a little nudge could easily push me to that side.

The initial realign could be tuned, I added mostly because it simplifies both
aligned and unaligned case a lot.  But it should be doable to use a similar
strategy as strchr/strlen to mask off the bits based on the input alignment.

The unaligned case is just to avoid drastic performance different between
input alignment, it is cheap and in the end should just be additional code
size.

But the main gain of using the generic implementation is one less assembly
routine to maintain and tune; and by improving the generic implementation
we gain in ecosystem as whole.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC PATCH 16/19] riscv: Add accelerated strcmp routines
  2023-03-31 14:48             ` Adhemerval Zanella Netto
@ 2023-03-31 17:19               ` Palmer Dabbelt
  0 siblings, 0 replies; 42+ messages in thread
From: Palmer Dabbelt @ 2023-03-31 17:19 UTC (permalink / raw)
  To: adhemerval.zanella
  Cc: jeffreyalaw, christoph.muellner, xry111, libc-alpha, Darius Rad,
	Andrew Waterman, DJ Delorie, Vineet Gupta, kito.cheng,
	philipp.tomsich, heiko.stuebner

On Fri, 31 Mar 2023 07:48:43 PDT (-0700), adhemerval.zanella@linaro.org wrote:
>
>
> On 31/03/23 11:30, Jeff Law wrote:
>>
>>
>> On 3/31/23 06:31, Adhemerval Zanella Netto wrote:
>>
>>>> Jeff
>>>
>>> Is this implementation really better than new generic one [1]? With a target
>>> with zbb support, the generic word comparison should use orc.b instruction [2].
>>> And the final comparison, once with the last word or the mismatch word is found,
>>> should use clz/ctz instruction [3] (result also in branchless code, albeit
>>> I have not check if better than the snippet this implementation uses).
>> I haven't done any comparisons against the updated generic bits.  I nearly suggested to Christoph to do that evaluation, but when I wandered around sysdeps I saw that we still had multiple custom strcmp implementations and set that suggestion aside.
>>
>>
>>>
>>> The generic implementation also has the advantage of use word instruction
>>> on unaligned case, where this implementation does a naive byte per byte
>>> check.
>> Yea, but in my digging this just didn't happen terribly often.  I don't think there's a lot of value there.  Along the same lines, my investigation didn't show any significant value to realign cases and I nearly suggested dropping them to avoid the branch in the hot path, but I wasn't confident enough in the breadth of my investigations to push it.
>> >>
>>> So maybe a better option would to optimize further the generic implementation.
>>> One option might be to parametrize the final_cmp so you can use the branchless
>>> trick (if it indeed is better than generic code).  Another option that the
>>> generic implementation does not explore is manual loop unrolling, as done by
>>> multiple assembly implementations.
>> I could certainly support that.  I was on the fence about pushing to use the generic bits, a little nudge could easily push me to that side.
>
> The initial realign could be tuned, I added mostly because it simplifies both
> aligned and unaligned case a lot.  But it should be doable to use a similar
> strategy as strchr/strlen to mask off the bits based on the input alignment.
>
> The unaligned case is just to avoid drastic performance different between
> input alignment, it is cheap and in the end should just be additional code
> size.
>
> But the main gain of using the generic implementation is one less assembly
> routine to maintain and tune; and by improving the generic implementation
> we gain in ecosystem as whole.

I think we should use the generic stuff where we can, just to avoid 
extra maintiance issues.  I think we'll eventually end up with 
vendor-specific assembly routines, but IMO it's best to only merge those 
if there's a meaningful performance advantage and there's no way to 
replicate it without resorting to assembly.

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2023-03-31 17:19 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-02-07  0:15 [RFC PATCH 00/19] riscv: ifunc support with optimized mem*/str*/cpu_relax routines Christoph Muellner
2023-02-07  0:16 ` [RFC PATCH 01/19] Inhibit early libcalls before ifunc support is ready Christoph Muellner
2023-02-07  0:16 ` [RFC PATCH 02/19] riscv: LEAF: Use C_LABEL() to construct the asm name for a C symbol Christoph Muellner
2023-02-07  0:16 ` [RFC PATCH 03/19] riscv: Add ENTRY_ALIGN() macro Christoph Muellner
2023-02-07  0:16 ` [RFC PATCH 04/19] riscv: Add hart feature run-time detection framework Christoph Muellner
2023-02-07  0:16 ` [RFC PATCH 05/19] riscv: Introduction of ISA extensions Christoph Muellner
2023-02-07  0:16 ` [RFC PATCH 06/19] riscv: Adding ISA string parser for environment variables Christoph Muellner
2023-02-07  6:20   ` David Abdurachmanov
2023-02-07  0:16 ` [RFC PATCH 07/19] riscv: hart-features: Add fast_unaligned property Christoph Muellner
2023-02-07  0:16 ` [RFC PATCH 08/19] riscv: Add (empty) ifunc framework Christoph Muellner
2023-02-07  0:16 ` [RFC PATCH 09/19] riscv: Add ifunc support for memset Christoph Muellner
2023-02-07  0:16 ` [RFC PATCH 10/19] riscv: Add accelerated memset routines for RV64 Christoph Muellner
2023-02-07  0:16 ` [RFC PATCH 11/19] riscv: Add ifunc support for memcpy/memmove Christoph Muellner
2023-02-07  0:16 ` [RFC PATCH 12/19] riscv: Add accelerated memcpy/memmove routines for RV64 Christoph Muellner
2023-02-07  0:16 ` [RFC PATCH 13/19] riscv: Add ifunc support for strlen Christoph Muellner
2023-02-07  0:16 ` [RFC PATCH 14/19] riscv: Add accelerated strlen routine Christoph Muellner
2023-02-07  0:16 ` [RFC PATCH 15/19] riscv: Add ifunc support for strcmp Christoph Muellner
2023-02-07  0:16 ` [RFC PATCH 16/19] riscv: Add accelerated strcmp routines Christoph Muellner
2023-02-07 11:57   ` Xi Ruoyao
2023-02-07 14:15     ` Christoph Müllner
2023-03-31  5:06       ` Jeff Law
2023-03-31 12:31         ` Adhemerval Zanella Netto
2023-03-31 14:30           ` Jeff Law
2023-03-31 14:48             ` Adhemerval Zanella Netto
2023-03-31 17:19               ` Palmer Dabbelt
2023-03-31 14:32       ` Jeff Law
2023-02-07  0:16 ` [RFC PATCH 17/19] riscv: Add ifunc support for strncmp Christoph Muellner
2023-02-07  0:16 ` [RFC PATCH 18/19] riscv: Add an optimized strncmp routine Christoph Muellner
2023-02-07  1:19   ` Noah Goldstein
2023-02-08 15:13     ` Philipp Tomsich
2023-02-08 17:55       ` Palmer Dabbelt
2023-02-08 19:48         ` Adhemerval Zanella Netto
2023-02-08 18:04       ` Noah Goldstein
2023-02-07  0:16 ` [RFC PATCH 19/19] riscv: Add __riscv_cpu_relax() to allow yielding in busy loops Christoph Muellner
2023-02-07  0:23   ` Andrew Waterman
2023-02-07  0:29     ` Christoph Müllner
2023-02-07  2:59 ` [RFC PATCH 00/19] riscv: ifunc support with optimized mem*/str*/cpu_relax routines Kito Cheng
2023-02-07 16:40 ` Adhemerval Zanella Netto
2023-02-07 17:16   ` DJ Delorie
2023-02-07 19:32     ` Philipp Tomsich
2023-02-07 21:14       ` DJ Delorie
2023-02-08 11:26         ` Christoph Müllner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).