public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/3] RISC-V: Use WRS.STO for atomic_spin_nop
@ 2024-04-18  9:46 Christoph Müllner
  2024-04-18  9:46 ` [RFC PATCH 1/3] RISC-V: Sync hwprobe: Sync extension bits with Linux 6.8 Christoph Müllner
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Christoph Müllner @ 2024-04-18  9:46 UTC (permalink / raw)
  To: libc-alpha, Adhemerval Zanella, Palmer Dabbelt, Darius Rad,
	Andrew Waterman, Philipp Tomsich, Evan Green, Kito Cheng,
	Jeff Law, Vineet Gupta
  Cc: Christoph Müllner

This series utilizes the recently introduced hwprobe() and ifunc support
in RISC-V to implement atomic_spin_nop. The generic code uses the PAUSE
instruction (specified in Zihintpause as HINT instruction).
If hwprobe() report availability of Zawrs, then WRS.STO is used instead.

All specification are ratified.  However, The second patch is not ready
to land in glibc yet, because the RISCV_HWPROBE_EXT_ZAWRS macro is not
defined in upstream Linux yet (there is just a patch from Andrew Jones
on LKML).  Therefore, this patch is marked as RFC. See also:
  https://lore.kernel.org/all/20240315134009.580167-10-ajones@ventanamicro.com/

The first patch of this series imports all HWPROBE macros from Linux 6.8.
As this patch has not further dependencies, it could be merged any time.

This patch was tested with a simple test code that calls
pthread_spin_lock() twice (triggering the spinning).
This program was compiled for rv64gc and executed using QEMU.
A small modification in QEMU was used to report if WRS.STO
was executed (-cpu "rv64,zawrs=false" vs -cpu "rv64,zawrs=true").

Christoph Müllner (3):
  RISC-V: Sync hwprobe: Sync extension bits with Linux 6.8
  RISC-V: hwprobe: Add Zawrs test bit
  RISC-V: Implement CPU yielding for busy loops with Zihintpause/Zawrs

 sysdeps/riscv/multiarch/cpu-relax_generic.S   | 31 +++++++++++++++
 sysdeps/riscv/multiarch/cpu-relax_zawrs.S     | 28 +++++++++++++
 .../unix/sysv/linux/riscv/atomic-machine.h    |  3 ++
 .../unix/sysv/linux/riscv/multiarch/Makefile  |  8 ++++
 .../sysv/linux/riscv/multiarch/cpu-relax.c    | 39 +++++++++++++++++++
 .../linux/riscv/multiarch/ifunc-impl-list.c   | 32 +++++++++++++--
 sysdeps/unix/sysv/linux/riscv/sys/hwprobe.h   | 30 ++++++++++++++
 7 files changed, 167 insertions(+), 4 deletions(-)
 create mode 100644 sysdeps/riscv/multiarch/cpu-relax_generic.S
 create mode 100644 sysdeps/riscv/multiarch/cpu-relax_zawrs.S
 create mode 100644 sysdeps/unix/sysv/linux/riscv/multiarch/cpu-relax.c

-- 
2.44.0


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC PATCH 1/3] RISC-V: Sync hwprobe: Sync extension bits with Linux 6.8
  2024-04-18  9:46 [RFC PATCH 0/3] RISC-V: Use WRS.STO for atomic_spin_nop Christoph Müllner
@ 2024-04-18  9:46 ` Christoph Müllner
  2024-04-18  9:46 ` [RFC PATCH 2/3] RISC-V: hwprobe: Add Zawrs test bit Christoph Müllner
  2024-04-18  9:46 ` [RFC PATCH 3/3] RISC-V: Implement CPU yielding for busy loops with Zihintpause/Zawrs Christoph Müllner
  2 siblings, 0 replies; 11+ messages in thread
From: Christoph Müllner @ 2024-04-18  9:46 UTC (permalink / raw)
  To: libc-alpha, Adhemerval Zanella, Palmer Dabbelt, Darius Rad,
	Andrew Waterman, Philipp Tomsich, Evan Green, Kito Cheng,
	Jeff Law, Vineet Gupta
  Cc: Christoph Müllner

This patch imports additional extension bits for hwprobe
from Linux 6.8.  This patch does not change existing
behaviour as non of the newly defined bits are used
anywhere.

Signed-off-by: Christoph Müllner <christoph.muellner@vrull.eu>
---
 sysdeps/unix/sysv/linux/riscv/sys/hwprobe.h | 29 +++++++++++++++++++++
 1 file changed, 29 insertions(+)

diff --git a/sysdeps/unix/sysv/linux/riscv/sys/hwprobe.h b/sysdeps/unix/sysv/linux/riscv/sys/hwprobe.h
index 8ecb43bb69..4856189f3c 100644
--- a/sysdeps/unix/sysv/linux/riscv/sys/hwprobe.h
+++ b/sysdeps/unix/sysv/linux/riscv/sys/hwprobe.h
@@ -50,6 +50,35 @@ struct riscv_hwprobe {
 #define  RISCV_HWPROBE_EXT_ZBB (1 << 4)
 #define  RISCV_HWPROBE_EXT_ZBS (1 << 5)
 #define  RISCV_HWPROBE_EXT_ZICBOZ (1 << 6)
+#define  RISCV_HWPROBE_EXT_ZBC (1 << 7)
+#define  RISCV_HWPROBE_EXT_ZBKB (1 << 8)
+#define  RISCV_HWPROBE_EXT_ZBKC (1 << 9)
+#define  RISCV_HWPROBE_EXT_ZBKX (1 << 10)
+#define  RISCV_HWPROBE_EXT_ZKND (1 << 11)
+#define  RISCV_HWPROBE_EXT_ZKNE (1 << 12)
+#define  RISCV_HWPROBE_EXT_ZKNH (1 << 13)
+#define  RISCV_HWPROBE_EXT_ZKSED (1 << 14)
+#define  RISCV_HWPROBE_EXT_ZKSH (1 << 15)
+#define  RISCV_HWPROBE_EXT_ZKT (1 << 16)
+#define  RISCV_HWPROBE_EXT_ZVBB (1 << 17)
+#define  RISCV_HWPROBE_EXT_ZVBC (1 << 18)
+#define  RISCV_HWPROBE_EXT_ZVKB (1 << 19)
+#define  RISCV_HWPROBE_EXT_ZVKG (1 << 20)
+#define  RISCV_HWPROBE_EXT_ZVKNED (1 << 21)
+#define  RISCV_HWPROBE_EXT_ZVKNHA (1 << 22)
+#define  RISCV_HWPROBE_EXT_ZVKNHB (1 << 23)
+#define  RISCV_HWPROBE_EXT_ZVKSED (1 << 24)
+#define  RISCV_HWPROBE_EXT_ZVKSH (1 << 25)
+#define  RISCV_HWPROBE_EXT_ZVKT (1 << 26)
+#define  RISCV_HWPROBE_EXT_ZFH (1 << 27)
+#define  RISCV_HWPROBE_EXT_ZFHMIN (1 << 28)
+#define  RISCV_HWPROBE_EXT_ZIHINTNTL (1 << 29)
+#define  RISCV_HWPROBE_EXT_ZVFH (1 << 30)
+#define  RISCV_HWPROBE_EXT_ZVFHMIN (1 << 31)
+#define  RISCV_HWPROBE_EXT_ZFA (1ULL << 32)
+#define  RISCV_HWPROBE_EXT_ZTSO (1ULL << 33)
+#define  RISCV_HWPROBE_EXT_ZACAS (1ULL << 34)
+#define  RISCV_HWPROBE_EXT_ZICOND (1ULL << 35)
 #define RISCV_HWPROBE_KEY_CPUPERF_0 5
 #define  RISCV_HWPROBE_MISALIGNED_UNKNOWN (0 << 0)
 #define  RISCV_HWPROBE_MISALIGNED_EMULATED (1 << 0)
-- 
2.44.0


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC PATCH 2/3] RISC-V: hwprobe: Add Zawrs test bit
  2024-04-18  9:46 [RFC PATCH 0/3] RISC-V: Use WRS.STO for atomic_spin_nop Christoph Müllner
  2024-04-18  9:46 ` [RFC PATCH 1/3] RISC-V: Sync hwprobe: Sync extension bits with Linux 6.8 Christoph Müllner
@ 2024-04-18  9:46 ` Christoph Müllner
  2024-04-18  9:46 ` [RFC PATCH 3/3] RISC-V: Implement CPU yielding for busy loops with Zihintpause/Zawrs Christoph Müllner
  2 siblings, 0 replies; 11+ messages in thread
From: Christoph Müllner @ 2024-04-18  9:46 UTC (permalink / raw)
  To: libc-alpha, Adhemerval Zanella, Palmer Dabbelt, Darius Rad,
	Andrew Waterman, Philipp Tomsich, Evan Green, Kito Cheng,
	Jeff Law, Vineet Gupta
  Cc: Christoph Müllner

This patch adds a hwprobe test bit for Zawrs.
The bit position is not settled as the corresponding kernel
patch did not land upstream so far:
  https://lore.kernel.org/all/20240315134009.580167-10-ajones@ventanamicro.com/

Signed-off-by: Christoph Müllner <christoph.muellner@vrull.eu>
---
 sysdeps/unix/sysv/linux/riscv/sys/hwprobe.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/sysdeps/unix/sysv/linux/riscv/sys/hwprobe.h b/sysdeps/unix/sysv/linux/riscv/sys/hwprobe.h
index 4856189f3c..16526c3aec 100644
--- a/sysdeps/unix/sysv/linux/riscv/sys/hwprobe.h
+++ b/sysdeps/unix/sysv/linux/riscv/sys/hwprobe.h
@@ -79,6 +79,7 @@ struct riscv_hwprobe {
 #define  RISCV_HWPROBE_EXT_ZTSO (1ULL << 33)
 #define  RISCV_HWPROBE_EXT_ZACAS (1ULL << 34)
 #define  RISCV_HWPROBE_EXT_ZICOND (1ULL << 35)
+#define  RISCV_HWPROBE_EXT_ZAWRS (1ULL << 36)
 #define RISCV_HWPROBE_KEY_CPUPERF_0 5
 #define  RISCV_HWPROBE_MISALIGNED_UNKNOWN (0 << 0)
 #define  RISCV_HWPROBE_MISALIGNED_EMULATED (1 << 0)
-- 
2.44.0


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC PATCH 3/3] RISC-V: Implement CPU yielding for busy loops with Zihintpause/Zawrs
  2024-04-18  9:46 [RFC PATCH 0/3] RISC-V: Use WRS.STO for atomic_spin_nop Christoph Müllner
  2024-04-18  9:46 ` [RFC PATCH 1/3] RISC-V: Sync hwprobe: Sync extension bits with Linux 6.8 Christoph Müllner
  2024-04-18  9:46 ` [RFC PATCH 2/3] RISC-V: hwprobe: Add Zawrs test bit Christoph Müllner
@ 2024-04-18  9:46 ` Christoph Müllner
  2024-04-18 17:17   ` Palmer Dabbelt
  2 siblings, 1 reply; 11+ messages in thread
From: Christoph Müllner @ 2024-04-18  9:46 UTC (permalink / raw)
  To: libc-alpha, Adhemerval Zanella, Palmer Dabbelt, Darius Rad,
	Andrew Waterman, Philipp Tomsich, Evan Green, Kito Cheng,
	Jeff Law, Vineet Gupta
  Cc: Christoph Müllner

The macro atomic_spin_nop can be used to implement arch-specific
CPU yielding that is used in busy loops (e.g. in pthread_spin_lock).
This patch introduces an ifunc-based implementation for RISC-V,
that uses Zihintpause's PAUSE instruction for that matter (as PAUSE
is a HINT instruction there is not dependency to Zihintpause at
runtime).  Further, we test for Zawrs via hwprobe() and if found
we use WRS.STO instead of PAUSE.

Signed-off-by: Christoph Müllner <christoph.muellner@vrull.eu>
---
 sysdeps/riscv/multiarch/cpu-relax_generic.S   | 31 +++++++++++++++
 sysdeps/riscv/multiarch/cpu-relax_zawrs.S     | 28 +++++++++++++
 .../unix/sysv/linux/riscv/atomic-machine.h    |  3 ++
 .../unix/sysv/linux/riscv/multiarch/Makefile  |  8 ++++
 .../sysv/linux/riscv/multiarch/cpu-relax.c    | 39 +++++++++++++++++++
 .../linux/riscv/multiarch/ifunc-impl-list.c   | 32 +++++++++++++--
 6 files changed, 137 insertions(+), 4 deletions(-)
 create mode 100644 sysdeps/riscv/multiarch/cpu-relax_generic.S
 create mode 100644 sysdeps/riscv/multiarch/cpu-relax_zawrs.S
 create mode 100644 sysdeps/unix/sysv/linux/riscv/multiarch/cpu-relax.c

diff --git a/sysdeps/riscv/multiarch/cpu-relax_generic.S b/sysdeps/riscv/multiarch/cpu-relax_generic.S
new file mode 100644
index 0000000000..d3ccfdce84
--- /dev/null
+++ b/sysdeps/riscv/multiarch/cpu-relax_generic.S
@@ -0,0 +1,31 @@
+/* CPU strand yielding for busy loops.  RISC-V version.
+   Copyright (C) 2024 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library.  If not, see
+   <https://www.gnu.org/licenses/>.  */
+
+#include <sysdep.h>
+#include <sys/asm.h>
+
+.option push
+.option arch, +zihintpause
+ENTRY (__cpu_relax_generic)
+	/* While we can use the `pause` instruction without
+	   the need of Zihintpause (because it is a HINT instruction),
+	   we still have to enable Zihintpause for the assembler.  */
+	pause
+	ret
+END (__cpu_relax_generic)
+.option pop
diff --git a/sysdeps/riscv/multiarch/cpu-relax_zawrs.S b/sysdeps/riscv/multiarch/cpu-relax_zawrs.S
new file mode 100644
index 0000000000..6d27b354df
--- /dev/null
+++ b/sysdeps/riscv/multiarch/cpu-relax_zawrs.S
@@ -0,0 +1,28 @@
+/* CPU strand yielding for busy loops.  RISC-V version with Zawrs.
+   Copyright (C) 2024 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library.  If not, see
+   <https://www.gnu.org/licenses/>.  */
+
+#include <sysdep.h>
+#include <sys/asm.h>
+
+.option push
+.option arch, +zawrs
+ENTRY (__cpu_relax_zawrs)
+	wrs.sto
+	ret
+END (__cpu_relax_zawrs)
+.option pop
diff --git a/sysdeps/unix/sysv/linux/riscv/atomic-machine.h b/sysdeps/unix/sysv/linux/riscv/atomic-machine.h
index c1c9d949a0..02b9b7a421 100644
--- a/sysdeps/unix/sysv/linux/riscv/atomic-machine.h
+++ b/sysdeps/unix/sysv/linux/riscv/atomic-machine.h
@@ -178,4 +178,7 @@
 # error "ISAs that do not subsume the A extension are not supported"
 #endif /* !__riscv_atomic */
 
+extern void __cpu_relax (void);
+#define atomic_spin_nop() __cpu_relax()
+
 #endif /* bits/atomic.h */
diff --git a/sysdeps/unix/sysv/linux/riscv/multiarch/Makefile b/sysdeps/unix/sysv/linux/riscv/multiarch/Makefile
index fcef5659d4..0cdf37a38b 100644
--- a/sysdeps/unix/sysv/linux/riscv/multiarch/Makefile
+++ b/sysdeps/unix/sysv/linux/riscv/multiarch/Makefile
@@ -1,3 +1,11 @@
+# nscd uses atomic_spin_nop which in turn requires cpu_relax
+ifeq ($(subdir),nscd)
+sysdep_routines += \
+  cpu-relax \
+  cpu-relax_generic \
+  cpu-relax_zawrs
+endif
+
 ifeq ($(subdir),string)
 sysdep_routines += \
   memcpy \
diff --git a/sysdeps/unix/sysv/linux/riscv/multiarch/cpu-relax.c b/sysdeps/unix/sysv/linux/riscv/multiarch/cpu-relax.c
new file mode 100644
index 0000000000..5aeb120e21
--- /dev/null
+++ b/sysdeps/unix/sysv/linux/riscv/multiarch/cpu-relax.c
@@ -0,0 +1,39 @@
+/* Multiple versions of cpu-relax.
+   All versions must be listed in ifunc-impl-list.c.
+   Copyright (C) 2024 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <https://www.gnu.org/licenses/>.  */
+
+# include <ifunc-init.h>
+# include <riscv-ifunc.h>
+# include <sys/hwprobe.h>
+
+void __cpu_relax (void);
+extern __typeof (__cpu_relax) __cpu_relax_generic attribute_hidden;
+extern __typeof (__cpu_relax) __cpu_relax_zawrs attribute_hidden;
+
+static inline __typeof (__cpu_relax) *
+select_cpu_relax_ifunc (uint64_t dl_hwcap, __riscv_hwprobe_t hwprobe_func)
+{
+  unsigned long long int v;
+  if (__riscv_hwprobe_one (hwprobe_func, RISCV_HWPROBE_KEY_IMA_EXT_0, &v) == 0
+      && (v & RISCV_HWPROBE_EXT_ZAWRS))
+    return __cpu_relax_zawrs;
+
+  return __cpu_relax_generic;
+}
+
+riscv_libc_ifunc (__cpu_relax, select_cpu_relax_ifunc);
diff --git a/sysdeps/unix/sysv/linux/riscv/multiarch/ifunc-impl-list.c b/sysdeps/unix/sysv/linux/riscv/multiarch/ifunc-impl-list.c
index 9f806d7a9e..9c7a8c2e1f 100644
--- a/sysdeps/unix/sysv/linux/riscv/multiarch/ifunc-impl-list.c
+++ b/sysdeps/unix/sysv/linux/riscv/multiarch/ifunc-impl-list.c
@@ -20,24 +20,48 @@
 #include <string.h>
 #include <sys/hwprobe.h>
 
+#define ARRAY_SIZE(A) (sizeof (A) / sizeof ((A)[0]))
+
+void __cpu_relax (void);
+
 size_t
 __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
 			size_t max)
 {
   size_t i = max;
+  struct riscv_hwprobe pairs[] = {
+    { .key = RISCV_HWPROBE_KEY_IMA_EXT_0 },
+    { .key = RISCV_HWPROBE_KEY_CPUPERF_0 },
+  };
 
   bool fast_unaligned = false;
+  bool has_zawrs = false;
+
+  if (__riscv_hwprobe (pairs, ARRAY_SIZE (pairs), 0, NULL, 0) == 0)
+    {
+      struct riscv_hwprobe *pair;
 
-  struct riscv_hwprobe pair = { .key = RISCV_HWPROBE_KEY_CPUPERF_0 };
-  if (__riscv_hwprobe (&pair, 1, 0, NULL, 0) == 0
-      && (pair.value & RISCV_HWPROBE_MISALIGNED_MASK)
+      /* RISCV_HWPROBE_KEY_IMA_EXT_0  */
+      pair = &pairs[0];
+      if (pair->value & RISCV_HWPROBE_EXT_ZAWRS)
+	has_zawrs = true;
+
+      /* RISCV_HWPROBE_KEY_CPUPERF_0  */
+      pair = &pairs[1];
+      if ((pair->value & RISCV_HWPROBE_MISALIGNED_MASK)
           == RISCV_HWPROBE_MISALIGNED_FAST)
-    fast_unaligned = true;
+	fast_unaligned = true;
+    }
 
   IFUNC_IMPL (i, name, memcpy,
 	      IFUNC_IMPL_ADD (array, i, memcpy, fast_unaligned,
 			      __memcpy_noalignment)
 	      IFUNC_IMPL_ADD (array, i, memcpy, 1, __memcpy_generic))
 
+  IFUNC_IMPL (i, name, __cpu_relax,
+	      IFUNC_IMPL_ADD (array, i, __cpu_relax, has_zawrs,
+			      __cpu_relax_zawrs)
+	      IFUNC_IMPL_ADD (array, i, __cpu_relax, 1, __cpu_relax_generic))
+
   return 0;
 }
-- 
2.44.0


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 3/3] RISC-V: Implement CPU yielding for busy loops with Zihintpause/Zawrs
  2024-04-18  9:46 ` [RFC PATCH 3/3] RISC-V: Implement CPU yielding for busy loops with Zihintpause/Zawrs Christoph Müllner
@ 2024-04-18 17:17   ` Palmer Dabbelt
  2024-04-18 20:03     ` Vineet Gupta
  2024-04-18 20:19     ` Christoph Müllner
  0 siblings, 2 replies; 11+ messages in thread
From: Palmer Dabbelt @ 2024-04-18 17:17 UTC (permalink / raw)
  To: christoph.muellner
  Cc: libc-alpha, adhemerval.zanella, Darius Rad, Andrew Waterman,
	philipp.tomsich, Evan Green, kito.cheng, jeffreyalaw,
	Vineet Gupta, christoph.muellner

On Thu, 18 Apr 2024 02:46:35 PDT (-0700), christoph.muellner@vrull.eu wrote:
> The macro atomic_spin_nop can be used to implement arch-specific
> CPU yielding that is used in busy loops (e.g. in pthread_spin_lock).
> This patch introduces an ifunc-based implementation for RISC-V,
> that uses Zihintpause's PAUSE instruction for that matter (as PAUSE
> is a HINT instruction there is not dependency to Zihintpause at
> runtime).  Further, we test for Zawrs via hwprobe() and if found
> we use WRS.STO instead of PAUSE.
>
> Signed-off-by: Christoph Müllner <christoph.muellner@vrull.eu>
> ---
>  sysdeps/riscv/multiarch/cpu-relax_generic.S   | 31 +++++++++++++++
>  sysdeps/riscv/multiarch/cpu-relax_zawrs.S     | 28 +++++++++++++
>  .../unix/sysv/linux/riscv/atomic-machine.h    |  3 ++
>  .../unix/sysv/linux/riscv/multiarch/Makefile  |  8 ++++
>  .../sysv/linux/riscv/multiarch/cpu-relax.c    | 39 +++++++++++++++++++
>  .../linux/riscv/multiarch/ifunc-impl-list.c   | 32 +++++++++++++--
>  6 files changed, 137 insertions(+), 4 deletions(-)
>  create mode 100644 sysdeps/riscv/multiarch/cpu-relax_generic.S
>  create mode 100644 sysdeps/riscv/multiarch/cpu-relax_zawrs.S
>  create mode 100644 sysdeps/unix/sysv/linux/riscv/multiarch/cpu-relax.c
>
> diff --git a/sysdeps/riscv/multiarch/cpu-relax_generic.S b/sysdeps/riscv/multiarch/cpu-relax_generic.S
> new file mode 100644
> index 0000000000..d3ccfdce84
> --- /dev/null
> +++ b/sysdeps/riscv/multiarch/cpu-relax_generic.S
> @@ -0,0 +1,31 @@
> +/* CPU strand yielding for busy loops.  RISC-V version.
> +   Copyright (C) 2024 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library.  If not, see
> +   <https://www.gnu.org/licenses/>.  */
> +
> +#include <sysdep.h>
> +#include <sys/asm.h>
> +
> +.option push
> +.option arch, +zihintpause
> +ENTRY (__cpu_relax_generic)
> +	/* While we can use the `pause` instruction without
> +	   the need of Zihintpause (because it is a HINT instruction),
> +	   we still have to enable Zihintpause for the assembler.  */
> +	pause
> +	ret
> +END (__cpu_relax_generic)
> +.option pop
> diff --git a/sysdeps/riscv/multiarch/cpu-relax_zawrs.S b/sysdeps/riscv/multiarch/cpu-relax_zawrs.S
> new file mode 100644
> index 0000000000..6d27b354df
> --- /dev/null
> +++ b/sysdeps/riscv/multiarch/cpu-relax_zawrs.S
> @@ -0,0 +1,28 @@
> +/* CPU strand yielding for busy loops.  RISC-V version with Zawrs.
> +   Copyright (C) 2024 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library.  If not, see
> +   <https://www.gnu.org/licenses/>.  */
> +
> +#include <sysdep.h>
> +#include <sys/asm.h>
> +
> +.option push
> +.option arch, +zawrs
> +ENTRY (__cpu_relax_zawrs)
> +	wrs.sto

This has the same forward progress/eventual success violation as the 
code you sent for GCC and Linux does.  It doesn't really matter if the 
user of the reservation is in a builtin, an asm block, or a function.  
The compiler just doesn't know about those reservation rules and isn't 
going to generate code that follows them.

That said, as far as I can tell the same grey area in the WRS spec that 
I pointed out during one of the Linux reviews still exists.   So if 
there's some hardware that has a more generous set of LR->WRS rules than 
the ISA-required LR->SC rules that's fine, we just need to know what the 
hardware actually is so we can write down the rules and then stick to 
them -- it's just a performance thing for WRS, so having weaker rules 
seems entirely plausible.

I don't know of any hardware with WRS, but sorry if I missed one.  Do 
you have something publicly availiable that behaves this way?

> +	ret
> +END (__cpu_relax_zawrs)
> +.option pop
> diff --git a/sysdeps/unix/sysv/linux/riscv/atomic-machine.h b/sysdeps/unix/sysv/linux/riscv/atomic-machine.h
> index c1c9d949a0..02b9b7a421 100644
> --- a/sysdeps/unix/sysv/linux/riscv/atomic-machine.h
> +++ b/sysdeps/unix/sysv/linux/riscv/atomic-machine.h
> @@ -178,4 +178,7 @@
>  # error "ISAs that do not subsume the A extension are not supported"
>  #endif /* !__riscv_atomic */
>
> +extern void __cpu_relax (void);
> +#define atomic_spin_nop() __cpu_relax()
> +
>  #endif /* bits/atomic.h */
> diff --git a/sysdeps/unix/sysv/linux/riscv/multiarch/Makefile b/sysdeps/unix/sysv/linux/riscv/multiarch/Makefile
> index fcef5659d4..0cdf37a38b 100644
> --- a/sysdeps/unix/sysv/linux/riscv/multiarch/Makefile
> +++ b/sysdeps/unix/sysv/linux/riscv/multiarch/Makefile
> @@ -1,3 +1,11 @@
> +# nscd uses atomic_spin_nop which in turn requires cpu_relax
> +ifeq ($(subdir),nscd)
> +sysdep_routines += \
> +  cpu-relax \
> +  cpu-relax_generic \
> +  cpu-relax_zawrs
> +endif
> +
>  ifeq ($(subdir),string)
>  sysdep_routines += \
>    memcpy \
> diff --git a/sysdeps/unix/sysv/linux/riscv/multiarch/cpu-relax.c b/sysdeps/unix/sysv/linux/riscv/multiarch/cpu-relax.c
> new file mode 100644
> index 0000000000..5aeb120e21
> --- /dev/null
> +++ b/sysdeps/unix/sysv/linux/riscv/multiarch/cpu-relax.c
> @@ -0,0 +1,39 @@
> +/* Multiple versions of cpu-relax.
> +   All versions must be listed in ifunc-impl-list.c.
> +   Copyright (C) 2024 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <https://www.gnu.org/licenses/>.  */
> +
> +# include <ifunc-init.h>
> +# include <riscv-ifunc.h>
> +# include <sys/hwprobe.h>
> +
> +void __cpu_relax (void);
> +extern __typeof (__cpu_relax) __cpu_relax_generic attribute_hidden;
> +extern __typeof (__cpu_relax) __cpu_relax_zawrs attribute_hidden;
> +
> +static inline __typeof (__cpu_relax) *
> +select_cpu_relax_ifunc (uint64_t dl_hwcap, __riscv_hwprobe_t hwprobe_func)
> +{
> +  unsigned long long int v;
> +  if (__riscv_hwprobe_one (hwprobe_func, RISCV_HWPROBE_KEY_IMA_EXT_0, &v) == 0
> +      && (v & RISCV_HWPROBE_EXT_ZAWRS))
> +    return __cpu_relax_zawrs;
> +
> +  return __cpu_relax_generic;
> +}
> +
> +riscv_libc_ifunc (__cpu_relax, select_cpu_relax_ifunc);
> diff --git a/sysdeps/unix/sysv/linux/riscv/multiarch/ifunc-impl-list.c b/sysdeps/unix/sysv/linux/riscv/multiarch/ifunc-impl-list.c
> index 9f806d7a9e..9c7a8c2e1f 100644
> --- a/sysdeps/unix/sysv/linux/riscv/multiarch/ifunc-impl-list.c
> +++ b/sysdeps/unix/sysv/linux/riscv/multiarch/ifunc-impl-list.c
> @@ -20,24 +20,48 @@
>  #include <string.h>
>  #include <sys/hwprobe.h>
>
> +#define ARRAY_SIZE(A) (sizeof (A) / sizeof ((A)[0]))
> +
> +void __cpu_relax (void);
> +
>  size_t
>  __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
>  			size_t max)
>  {
>    size_t i = max;
> +  struct riscv_hwprobe pairs[] = {
> +    { .key = RISCV_HWPROBE_KEY_IMA_EXT_0 },
> +    { .key = RISCV_HWPROBE_KEY_CPUPERF_0 },
> +  };
>
>    bool fast_unaligned = false;
> +  bool has_zawrs = false;
> +
> +  if (__riscv_hwprobe (pairs, ARRAY_SIZE (pairs), 0, NULL, 0) == 0)
> +    {
> +      struct riscv_hwprobe *pair;
>
> -  struct riscv_hwprobe pair = { .key = RISCV_HWPROBE_KEY_CPUPERF_0 };
> -  if (__riscv_hwprobe (&pair, 1, 0, NULL, 0) == 0
> -      && (pair.value & RISCV_HWPROBE_MISALIGNED_MASK)
> +      /* RISCV_HWPROBE_KEY_IMA_EXT_0  */
> +      pair = &pairs[0];
> +      if (pair->value & RISCV_HWPROBE_EXT_ZAWRS)
> +	has_zawrs = true;
> +
> +      /* RISCV_HWPROBE_KEY_CPUPERF_0  */
> +      pair = &pairs[1];
> +      if ((pair->value & RISCV_HWPROBE_MISALIGNED_MASK)
>            == RISCV_HWPROBE_MISALIGNED_FAST)
> -    fast_unaligned = true;
> +	fast_unaligned = true;
> +    }
>
>    IFUNC_IMPL (i, name, memcpy,
>  	      IFUNC_IMPL_ADD (array, i, memcpy, fast_unaligned,
>  			      __memcpy_noalignment)
>  	      IFUNC_IMPL_ADD (array, i, memcpy, 1, __memcpy_generic))
>
> +  IFUNC_IMPL (i, name, __cpu_relax,
> +	      IFUNC_IMPL_ADD (array, i, __cpu_relax, has_zawrs,
> +			      __cpu_relax_zawrs)
> +	      IFUNC_IMPL_ADD (array, i, __cpu_relax, 1, __cpu_relax_generic))
> +
>    return 0;
>  }

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 3/3] RISC-V: Implement CPU yielding for busy loops with Zihintpause/Zawrs
  2024-04-18 17:17   ` Palmer Dabbelt
@ 2024-04-18 20:03     ` Vineet Gupta
  2024-04-18 20:25       ` Christoph Müllner
  2024-04-18 20:19     ` Christoph Müllner
  1 sibling, 1 reply; 11+ messages in thread
From: Vineet Gupta @ 2024-04-18 20:03 UTC (permalink / raw)
  To: Palmer Dabbelt, christoph.muellner
  Cc: libc-alpha, adhemerval.zanella, Darius Rad, Andrew Waterman,
	philipp.tomsich, Evan Green, kito.cheng, jeffreyalaw,
	gnu-toolchain

Hi Christoph,

My 2 cents.

On 4/18/24 10:17, Palmer Dabbelt wrote:
> On Thu, 18 Apr 2024 02:46:35 PDT (-0700), christoph.muellner@vrull.eu wrote:
>> The macro atomic_spin_nop can be used to implement arch-specific
>> CPU yielding that is used in busy loops (e.g. in pthread_spin_lock).
>> This patch introduces an ifunc-based implementation for RISC-V,
>> that uses Zihintpause's PAUSE instruction for that matter (as PAUSE
>> is a HINT instruction there is not dependency to Zihintpause at
>> runtime).  Further, we test for Zawrs via hwprobe() and if found
>> we use WRS.STO instead of PAUSE.

[snip]

>> +ENTRY (__cpu_relax_generic)
>> +	/* While we can use the `pause` instruction without
>> +	   the need of Zihintpause (because it is a HINT instruction),
>> +	   we still have to enable Zihintpause for the assembler.  */
>> +	pause
>> +	ret
>> +END (__cpu_relax_generic)

[snip]

>> +.option push
>> +.option arch, +zawrs
>> +ENTRY (__cpu_relax_zawrs)
>> +	wrs.sto
> This has the same forward progress/eventual success violation as the 
> code you sent for GCC and Linux does.  It doesn't really matter if the 
> user of the reservation is in a builtin, an asm block, or a function.  
> The compiler just doesn't know about those reservation rules and isn't 
> going to generate code that follows them.

So, this routine maps to atomic_spin_nop () called in generic code in a
bunch of places.

The easiest case is nptl/pthread_spin_lock.c which looks something like this

__pthread_spin_lock (pthread_spinlock_t *lock)
...
   do
    {
      atomic_spin_nop ();
      val = atomic_load_relaxed (lock);
    }
  while (val != 0);

This works fine for a PAUSE based implementation which doesn't need a
prior reservation, but WRS does and even if both the load and spin are
inlined, reservation happens after WRS which is wrong.

So I think before we can implement this optimization for riscv, we need
a generic glibc change to replace of atomic_spin_nop () with a variant
which also takes the lock/memory under consideration.
The fallback implementation of atomic_load_and_spin_if_cond() will be
what the above loop does. For arches that do implement the API, based on
ISA semantics, they can choose to ignore the mem arg completely (RISC-V
PAUSE based implementation) or implement a little loop which explicitly
takes reservation on the mem/lock and calls WRS.

Does that make sense ?

We just need to sort out the 2nd use of atomic_spin_nop in
nptl/pthread_mutex_lock.c where this conversion might not be straight fwd.
The recent back off back spin looping is getting in the way of obvious
solution, however for discussion sake if we ignore it, the code looks
like this:

PTHREAD_MUTEX_LOCK (pthread_mutex_t *mutex)
...
       int cnt=0, max_cnt = something;
       do
        {
            if (cnt ++>= max_cnt)
            {
                 LLL_MUTEX_LOCK (mutex);
                 break;
            }
            atomic_spin_nop ();
       }
       while (LLL_MUTEX_READ_LOCK (mutex) != 0 || LLL_MUTEX_TRYLOCK
(mutex) != 0);

And this seems like it can be converted to a
atomic_load_and_spin_if_cond(mutex, !=0 ) kind of API.

Let me know if you want me to spin this - pun intended ;-)

-Vineet


> That said, as far as I can tell the same grey area in the WRS spec that 
> I pointed out during one of the Linux reviews still exists.   So if 
> there's some hardware that has a more generous set of LR->WRS rules than 
> the ISA-required LR->SC rules that's fine, we just need to know what the 
> hardware actually is so we can write down the rules and then stick to 
> them -- it's just a performance thing for WRS, so having weaker rules 
> seems entirely plausible.
>
> I don't know of any hardware with WRS, but sorry if I missed one.  Do 
> you have something publicly availiable that behaves this way?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 3/3] RISC-V: Implement CPU yielding for busy loops with Zihintpause/Zawrs
  2024-04-18 17:17   ` Palmer Dabbelt
  2024-04-18 20:03     ` Vineet Gupta
@ 2024-04-18 20:19     ` Christoph Müllner
  2024-04-18 20:36       ` Vineet Gupta
  1 sibling, 1 reply; 11+ messages in thread
From: Christoph Müllner @ 2024-04-18 20:19 UTC (permalink / raw)
  To: Palmer Dabbelt
  Cc: libc-alpha, adhemerval.zanella, Darius Rad, Andrew Waterman,
	philipp.tomsich, Evan Green, kito.cheng, jeffreyalaw,
	Vineet Gupta

On Thu, Apr 18, 2024 at 7:17 PM Palmer Dabbelt <palmer@dabbelt.com> wrote:
>
> On Thu, 18 Apr 2024 02:46:35 PDT (-0700), christoph.muellner@vrull.eu wrote:
> > The macro atomic_spin_nop can be used to implement arch-specific
> > CPU yielding that is used in busy loops (e.g. in pthread_spin_lock).
> > This patch introduces an ifunc-based implementation for RISC-V,
> > that uses Zihintpause's PAUSE instruction for that matter (as PAUSE
> > is a HINT instruction there is not dependency to Zihintpause at
> > runtime).  Further, we test for Zawrs via hwprobe() and if found
> > we use WRS.STO instead of PAUSE.
> >
> > Signed-off-by: Christoph Müllner <christoph.muellner@vrull.eu>
> > ---
> >  sysdeps/riscv/multiarch/cpu-relax_generic.S   | 31 +++++++++++++++
> >  sysdeps/riscv/multiarch/cpu-relax_zawrs.S     | 28 +++++++++++++
> >  .../unix/sysv/linux/riscv/atomic-machine.h    |  3 ++
> >  .../unix/sysv/linux/riscv/multiarch/Makefile  |  8 ++++
> >  .../sysv/linux/riscv/multiarch/cpu-relax.c    | 39 +++++++++++++++++++
> >  .../linux/riscv/multiarch/ifunc-impl-list.c   | 32 +++++++++++++--
> >  6 files changed, 137 insertions(+), 4 deletions(-)
> >  create mode 100644 sysdeps/riscv/multiarch/cpu-relax_generic.S
> >  create mode 100644 sysdeps/riscv/multiarch/cpu-relax_zawrs.S
> >  create mode 100644 sysdeps/unix/sysv/linux/riscv/multiarch/cpu-relax.c
> >
> > diff --git a/sysdeps/riscv/multiarch/cpu-relax_generic.S b/sysdeps/riscv/multiarch/cpu-relax_generic.S
> > new file mode 100644
> > index 0000000000..d3ccfdce84
> > --- /dev/null
> > +++ b/sysdeps/riscv/multiarch/cpu-relax_generic.S
> > @@ -0,0 +1,31 @@
> > +/* CPU strand yielding for busy loops.  RISC-V version.
> > +   Copyright (C) 2024 Free Software Foundation, Inc.
> > +   This file is part of the GNU C Library.
> > +
> > +   The GNU C Library is free software; you can redistribute it and/or
> > +   modify it under the terms of the GNU Lesser General Public
> > +   License as published by the Free Software Foundation; either
> > +   version 2.1 of the License, or (at your option) any later version.
> > +
> > +   The GNU C Library is distributed in the hope that it will be useful,
> > +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> > +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> > +   Lesser General Public License for more details.
> > +
> > +   You should have received a copy of the GNU Lesser General Public
> > +   License along with the GNU C Library.  If not, see
> > +   <https://www.gnu.org/licenses/>.  */
> > +
> > +#include <sysdep.h>
> > +#include <sys/asm.h>
> > +
> > +.option push
> > +.option arch, +zihintpause
> > +ENTRY (__cpu_relax_generic)
> > +     /* While we can use the `pause` instruction without
> > +        the need of Zihintpause (because it is a HINT instruction),
> > +        we still have to enable Zihintpause for the assembler.  */
> > +     pause
> > +     ret
> > +END (__cpu_relax_generic)
> > +.option pop
> > diff --git a/sysdeps/riscv/multiarch/cpu-relax_zawrs.S b/sysdeps/riscv/multiarch/cpu-relax_zawrs.S
> > new file mode 100644
> > index 0000000000..6d27b354df
> > --- /dev/null
> > +++ b/sysdeps/riscv/multiarch/cpu-relax_zawrs.S
> > @@ -0,0 +1,28 @@
> > +/* CPU strand yielding for busy loops.  RISC-V version with Zawrs.
> > +   Copyright (C) 2024 Free Software Foundation, Inc.
> > +   This file is part of the GNU C Library.
> > +
> > +   The GNU C Library is free software; you can redistribute it and/or
> > +   modify it under the terms of the GNU Lesser General Public
> > +   License as published by the Free Software Foundation; either
> > +   version 2.1 of the License, or (at your option) any later version.
> > +
> > +   The GNU C Library is distributed in the hope that it will be useful,
> > +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> > +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> > +   Lesser General Public License for more details.
> > +
> > +   You should have received a copy of the GNU Lesser General Public
> > +   License along with the GNU C Library.  If not, see
> > +   <https://www.gnu.org/licenses/>.  */
> > +
> > +#include <sysdep.h>
> > +#include <sys/asm.h>
> > +
> > +.option push
> > +.option arch, +zawrs
> > +ENTRY (__cpu_relax_zawrs)
> > +     wrs.sto
>
> This has the same forward progress/eventual success violation as the
> code you sent for GCC and Linux does.  It doesn't really matter if the
> user of the reservation is in a builtin, an asm block, or a function.
> The compiler just doesn't know about those reservation rules and isn't
> going to generate code that follows them.

I see. The main issue is that we don't have a valid reservation when
calling WRS,
so the whole use of Zawrs instructions is pointless.
So the only way to move Zawrs forward would be to adjust the locking routines
(introducing new primitives that have to be implemented for all architectures).


>
> That said, as far as I can tell the same grey area in the WRS spec that
> I pointed out during one of the Linux reviews still exists.   So if
> there's some hardware that has a more generous set of LR->WRS rules than
> the ISA-required LR->SC rules that's fine, we just need to know what the
> hardware actually is so we can write down the rules and then stick to
> them -- it's just a performance thing for WRS, so having weaker rules
> seems entirely plausible.
>
> I don't know of any hardware with WRS, but sorry if I missed one.  Do
> you have something publicly availiable that behaves this way?
>
> > +     ret
> > +END (__cpu_relax_zawrs)
> > +.option pop
> > diff --git a/sysdeps/unix/sysv/linux/riscv/atomic-machine.h b/sysdeps/unix/sysv/linux/riscv/atomic-machine.h
> > index c1c9d949a0..02b9b7a421 100644
> > --- a/sysdeps/unix/sysv/linux/riscv/atomic-machine.h
> > +++ b/sysdeps/unix/sysv/linux/riscv/atomic-machine.h
> > @@ -178,4 +178,7 @@
> >  # error "ISAs that do not subsume the A extension are not supported"
> >  #endif /* !__riscv_atomic */
> >
> > +extern void __cpu_relax (void);
> > +#define atomic_spin_nop() __cpu_relax()
> > +
> >  #endif /* bits/atomic.h */
> > diff --git a/sysdeps/unix/sysv/linux/riscv/multiarch/Makefile b/sysdeps/unix/sysv/linux/riscv/multiarch/Makefile
> > index fcef5659d4..0cdf37a38b 100644
> > --- a/sysdeps/unix/sysv/linux/riscv/multiarch/Makefile
> > +++ b/sysdeps/unix/sysv/linux/riscv/multiarch/Makefile
> > @@ -1,3 +1,11 @@
> > +# nscd uses atomic_spin_nop which in turn requires cpu_relax
> > +ifeq ($(subdir),nscd)
> > +sysdep_routines += \
> > +  cpu-relax \
> > +  cpu-relax_generic \
> > +  cpu-relax_zawrs
> > +endif
> > +
> >  ifeq ($(subdir),string)
> >  sysdep_routines += \
> >    memcpy \
> > diff --git a/sysdeps/unix/sysv/linux/riscv/multiarch/cpu-relax.c b/sysdeps/unix/sysv/linux/riscv/multiarch/cpu-relax.c
> > new file mode 100644
> > index 0000000000..5aeb120e21
> > --- /dev/null
> > +++ b/sysdeps/unix/sysv/linux/riscv/multiarch/cpu-relax.c
> > @@ -0,0 +1,39 @@
> > +/* Multiple versions of cpu-relax.
> > +   All versions must be listed in ifunc-impl-list.c.
> > +   Copyright (C) 2024 Free Software Foundation, Inc.
> > +   This file is part of the GNU C Library.
> > +
> > +   The GNU C Library is free software; you can redistribute it and/or
> > +   modify it under the terms of the GNU Lesser General Public
> > +   License as published by the Free Software Foundation; either
> > +   version 2.1 of the License, or (at your option) any later version.
> > +
> > +   The GNU C Library is distributed in the hope that it will be useful,
> > +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> > +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> > +   Lesser General Public License for more details.
> > +
> > +   You should have received a copy of the GNU Lesser General Public
> > +   License along with the GNU C Library; if not, see
> > +   <https://www.gnu.org/licenses/>.  */
> > +
> > +# include <ifunc-init.h>
> > +# include <riscv-ifunc.h>
> > +# include <sys/hwprobe.h>
> > +
> > +void __cpu_relax (void);
> > +extern __typeof (__cpu_relax) __cpu_relax_generic attribute_hidden;
> > +extern __typeof (__cpu_relax) __cpu_relax_zawrs attribute_hidden;
> > +
> > +static inline __typeof (__cpu_relax) *
> > +select_cpu_relax_ifunc (uint64_t dl_hwcap, __riscv_hwprobe_t hwprobe_func)
> > +{
> > +  unsigned long long int v;
> > +  if (__riscv_hwprobe_one (hwprobe_func, RISCV_HWPROBE_KEY_IMA_EXT_0, &v) == 0
> > +      && (v & RISCV_HWPROBE_EXT_ZAWRS))
> > +    return __cpu_relax_zawrs;
> > +
> > +  return __cpu_relax_generic;
> > +}
> > +
> > +riscv_libc_ifunc (__cpu_relax, select_cpu_relax_ifunc);
> > diff --git a/sysdeps/unix/sysv/linux/riscv/multiarch/ifunc-impl-list.c b/sysdeps/unix/sysv/linux/riscv/multiarch/ifunc-impl-list.c
> > index 9f806d7a9e..9c7a8c2e1f 100644
> > --- a/sysdeps/unix/sysv/linux/riscv/multiarch/ifunc-impl-list.c
> > +++ b/sysdeps/unix/sysv/linux/riscv/multiarch/ifunc-impl-list.c
> > @@ -20,24 +20,48 @@
> >  #include <string.h>
> >  #include <sys/hwprobe.h>
> >
> > +#define ARRAY_SIZE(A) (sizeof (A) / sizeof ((A)[0]))
> > +
> > +void __cpu_relax (void);
> > +
> >  size_t
> >  __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
> >                       size_t max)
> >  {
> >    size_t i = max;
> > +  struct riscv_hwprobe pairs[] = {
> > +    { .key = RISCV_HWPROBE_KEY_IMA_EXT_0 },
> > +    { .key = RISCV_HWPROBE_KEY_CPUPERF_0 },
> > +  };
> >
> >    bool fast_unaligned = false;
> > +  bool has_zawrs = false;
> > +
> > +  if (__riscv_hwprobe (pairs, ARRAY_SIZE (pairs), 0, NULL, 0) == 0)
> > +    {
> > +      struct riscv_hwprobe *pair;
> >
> > -  struct riscv_hwprobe pair = { .key = RISCV_HWPROBE_KEY_CPUPERF_0 };
> > -  if (__riscv_hwprobe (&pair, 1, 0, NULL, 0) == 0
> > -      && (pair.value & RISCV_HWPROBE_MISALIGNED_MASK)
> > +      /* RISCV_HWPROBE_KEY_IMA_EXT_0  */
> > +      pair = &pairs[0];
> > +      if (pair->value & RISCV_HWPROBE_EXT_ZAWRS)
> > +     has_zawrs = true;
> > +
> > +      /* RISCV_HWPROBE_KEY_CPUPERF_0  */
> > +      pair = &pairs[1];
> > +      if ((pair->value & RISCV_HWPROBE_MISALIGNED_MASK)
> >            == RISCV_HWPROBE_MISALIGNED_FAST)
> > -    fast_unaligned = true;
> > +     fast_unaligned = true;
> > +    }
> >
> >    IFUNC_IMPL (i, name, memcpy,
> >             IFUNC_IMPL_ADD (array, i, memcpy, fast_unaligned,
> >                             __memcpy_noalignment)
> >             IFUNC_IMPL_ADD (array, i, memcpy, 1, __memcpy_generic))
> >
> > +  IFUNC_IMPL (i, name, __cpu_relax,
> > +           IFUNC_IMPL_ADD (array, i, __cpu_relax, has_zawrs,
> > +                           __cpu_relax_zawrs)
> > +           IFUNC_IMPL_ADD (array, i, __cpu_relax, 1, __cpu_relax_generic))
> > +
> >    return 0;
> >  }

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 3/3] RISC-V: Implement CPU yielding for busy loops with Zihintpause/Zawrs
  2024-04-18 20:03     ` Vineet Gupta
@ 2024-04-18 20:25       ` Christoph Müllner
  0 siblings, 0 replies; 11+ messages in thread
From: Christoph Müllner @ 2024-04-18 20:25 UTC (permalink / raw)
  To: Vineet Gupta
  Cc: Palmer Dabbelt, libc-alpha, adhemerval.zanella, Darius Rad,
	Andrew Waterman, philipp.tomsich, Evan Green, kito.cheng,
	jeffreyalaw, gnu-toolchain

On Thu, Apr 18, 2024 at 10:03 PM Vineet Gupta <vineetg@rivosinc.com> wrote:
>
> Hi Christoph,
>
> My 2 cents.
>
> On 4/18/24 10:17, Palmer Dabbelt wrote:
> > On Thu, 18 Apr 2024 02:46:35 PDT (-0700), christoph.muellner@vrull.eu wrote:
> >> The macro atomic_spin_nop can be used to implement arch-specific
> >> CPU yielding that is used in busy loops (e.g. in pthread_spin_lock).
> >> This patch introduces an ifunc-based implementation for RISC-V,
> >> that uses Zihintpause's PAUSE instruction for that matter (as PAUSE
> >> is a HINT instruction there is not dependency to Zihintpause at
> >> runtime).  Further, we test for Zawrs via hwprobe() and if found
> >> we use WRS.STO instead of PAUSE.
>
> [snip]
>
> >> +ENTRY (__cpu_relax_generic)
> >> +    /* While we can use the `pause` instruction without
> >> +       the need of Zihintpause (because it is a HINT instruction),
> >> +       we still have to enable Zihintpause for the assembler.  */
> >> +    pause
> >> +    ret
> >> +END (__cpu_relax_generic)
>
> [snip]
>
> >> +.option push
> >> +.option arch, +zawrs
> >> +ENTRY (__cpu_relax_zawrs)
> >> +    wrs.sto
> > This has the same forward progress/eventual success violation as the
> > code you sent for GCC and Linux does.  It doesn't really matter if the
> > user of the reservation is in a builtin, an asm block, or a function.
> > The compiler just doesn't know about those reservation rules and isn't
> > going to generate code that follows them.
>
> So, this routine maps to atomic_spin_nop () called in generic code in a
> bunch of places.
>
> The easiest case is nptl/pthread_spin_lock.c which looks something like this
>
> __pthread_spin_lock (pthread_spinlock_t *lock)
> ...
>    do
>     {
>       atomic_spin_nop ();
>       val = atomic_load_relaxed (lock);
>     }
>   while (val != 0);
>
> This works fine for a PAUSE based implementation which doesn't need a
> prior reservation, but WRS does and even if both the load and spin are
> inlined, reservation happens after WRS which is wrong.
>
> So I think before we can implement this optimization for riscv, we need
> a generic glibc change to replace of atomic_spin_nop () with a variant
> which also takes the lock/memory under consideration.
> The fallback implementation of atomic_load_and_spin_if_cond() will be
> what the above loop does. For arches that do implement the API, based on
> ISA semantics, they can choose to ignore the mem arg completely (RISC-V
> PAUSE based implementation) or implement a little loop which explicitly
> takes reservation on the mem/lock and calls WRS.
>
> Does that make sense ?

I did something similar in the Linux patch a while ago.
Here I wanted to avoid changes in arch-independent code.
But you are right, wrs.sto does not make sense without a
valid reservation.

>
> We just need to sort out the 2nd use of atomic_spin_nop in
> nptl/pthread_mutex_lock.c where this conversion might not be straight fwd.
> The recent back off back spin looping is getting in the way of obvious
> solution, however for discussion sake if we ignore it, the code looks
> like this:
>
> PTHREAD_MUTEX_LOCK (pthread_mutex_t *mutex)
> ...
>        int cnt=0, max_cnt = something;
>        do
>         {
>             if (cnt ++>= max_cnt)
>             {
>                  LLL_MUTEX_LOCK (mutex);
>                  break;
>             }
>             atomic_spin_nop ();
>        }
>        while (LLL_MUTEX_READ_LOCK (mutex) != 0 || LLL_MUTEX_TRYLOCK
> (mutex) != 0);
>
> And this seems like it can be converted to a
> atomic_load_and_spin_if_cond(mutex, !=0 ) kind of API.
>
> Let me know if you want me to spin this - pun intended ;-)
>
> -Vineet
>
>
> > That said, as far as I can tell the same grey area in the WRS spec that
> > I pointed out during one of the Linux reviews still exists.   So if
> > there's some hardware that has a more generous set of LR->WRS rules than
> > the ISA-required LR->SC rules that's fine, we just need to know what the
> > hardware actually is so we can write down the rules and then stick to
> > them -- it's just a performance thing for WRS, so having weaker rules
> > seems entirely plausible.
> >
> > I don't know of any hardware with WRS, but sorry if I missed one.  Do
> > you have something publicly availiable that behaves this way?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 3/3] RISC-V: Implement CPU yielding for busy loops with Zihintpause/Zawrs
  2024-04-18 20:19     ` Christoph Müllner
@ 2024-04-18 20:36       ` Vineet Gupta
  2024-04-18 21:10         ` Palmer Dabbelt
  0 siblings, 1 reply; 11+ messages in thread
From: Vineet Gupta @ 2024-04-18 20:36 UTC (permalink / raw)
  To: Christoph Müllner, Palmer Dabbelt
  Cc: libc-alpha, adhemerval.zanella, Darius Rad, Andrew Waterman,
	philipp.tomsich, Evan Green, kito.cheng, jeffreyalaw

On 4/18/24 13:19, Christoph Müllner wrote:
>> This has the same forward progress/eventual success violation as the
>> code you sent for GCC and Linux does.  It doesn't really matter if the
>> user of the reservation is in a builtin, an asm block, or a function.
>> The compiler just doesn't know about those reservation rules and isn't
>> going to generate code that follows them.
> I see. The main issue is that we don't have a valid reservation when
> calling WRS,
> so the whole use of Zawrs instructions is pointless.
> So the only way to move Zawrs forward would be to adjust the locking routines
> (introducing new primitives that have to be implemented for all architectures).

Not explicitly anyways - the generic fallback will take care of every
arch, except SPARC/x86 which implement atomic_spin_nop wth pause like
semantics, but even they don't need to change at all if we implement new
API atomic_load_and_spin_if_cond_whatever ()  in terms of existing
atomic_spin_nop ()

-Vineet

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 3/3] RISC-V: Implement CPU yielding for busy loops with Zihintpause/Zawrs
  2024-04-18 20:36       ` Vineet Gupta
@ 2024-04-18 21:10         ` Palmer Dabbelt
  2024-04-19 14:09           ` Andrew Jones
  0 siblings, 1 reply; 11+ messages in thread
From: Palmer Dabbelt @ 2024-04-18 21:10 UTC (permalink / raw)
  To: Vineet Gupta, ajones, Charlie Jenkins
  Cc: christoph.muellner, libc-alpha, adhemerval.zanella, Darius Rad,
	Andrew Waterman, philipp.tomsich, Evan Green, kito.cheng,
	jeffreyalaw

On Thu, 18 Apr 2024 13:36:32 PDT (-0700), Vineet Gupta wrote:
> On 4/18/24 13:19, Christoph Müllner wrote:
>>> This has the same forward progress/eventual success violation as the
>>> code you sent for GCC and Linux does.  It doesn't really matter if the
>>> user of the reservation is in a builtin, an asm block, or a function.
>>> The compiler just doesn't know about those reservation rules and isn't
>>> going to generate code that follows them.
>> I see. The main issue is that we don't have a valid reservation when
>> calling WRS,
>> so the whole use of Zawrs instructions is pointless.
>> So the only way to move Zawrs forward would be to adjust the locking routines
>> (introducing new primitives that have to be implemented for all architectures).
>
> Not explicitly anyways - the generic fallback will take care of every
> arch, except SPARC/x86 which implement atomic_spin_nop wth pause like
> semantics, but even they don't need to change at all if we implement new
> API atomic_load_and_spin_if_cond_whatever ()  in terms of existing
> atomic_spin_nop ()

Ya, sounds about right.

IIRC I just hooked some of the LLL macros when doing the POC/estimates, 
but it was at least a year ago so I forget exactly how it all fit 
together.  Whatever it is, they end up with basically the same 
load->cond->lr->beq->{wrs,sc} patterns as a bunch of the Linux routines 
should have (sort of a test-and-test-and-set type pattern, or how arm64 
does the load-and-cmpxchg routines).

Adding Charlie and Drew, as we were talking about the Linux side of 
things recently.

> -Vineet

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 3/3] RISC-V: Implement CPU yielding for busy loops with Zihintpause/Zawrs
  2024-04-18 21:10         ` Palmer Dabbelt
@ 2024-04-19 14:09           ` Andrew Jones
  0 siblings, 0 replies; 11+ messages in thread
From: Andrew Jones @ 2024-04-19 14:09 UTC (permalink / raw)
  To: Palmer Dabbelt
  Cc: Vineet Gupta, Charlie Jenkins, christoph.muellner, libc-alpha,
	adhemerval.zanella, Darius Rad, Andrew Waterman, philipp.tomsich,
	Evan Green, kito.cheng, jeffreyalaw

On Thu, Apr 18, 2024 at 02:10:42PM -0700, Palmer Dabbelt wrote:
> On Thu, 18 Apr 2024 13:36:32 PDT (-0700), Vineet Gupta wrote:
> > On 4/18/24 13:19, Christoph Müllner wrote:
> > > > This has the same forward progress/eventual success violation as the
> > > > code you sent for GCC and Linux does.  It doesn't really matter if the
> > > > user of the reservation is in a builtin, an asm block, or a function.
> > > > The compiler just doesn't know about those reservation rules and isn't
> > > > going to generate code that follows them.
> > > I see. The main issue is that we don't have a valid reservation when
> > > calling WRS,
> > > so the whole use of Zawrs instructions is pointless.
> > > So the only way to move Zawrs forward would be to adjust the locking routines
> > > (introducing new primitives that have to be implemented for all architectures).
> > 
> > Not explicitly anyways - the generic fallback will take care of every
> > arch, except SPARC/x86 which implement atomic_spin_nop wth pause like
> > semantics, but even they don't need to change at all if we implement new
> > API atomic_load_and_spin_if_cond_whatever ()  in terms of existing
> > atomic_spin_nop ()
> 
> Ya, sounds about right.
> 
> IIRC I just hooked some of the LLL macros when doing the POC/estimates, but
> it was at least a year ago so I forget exactly how it all fit together.
> Whatever it is, they end up with basically the same
> load->cond->lr->beq->{wrs,sc} patterns as a bunch of the Linux routines
> should have (sort of a test-and-test-and-set type pattern, or how arm64 does
> the load-and-cmpxchg routines).
> 
> Adding Charlie and Drew, as we were talking about the Linux side of things
> recently.

I've just now posted another version [1].

[1] https://lore.kernel.org/all/20240419135321.70781-8-ajones@ventanamicro.com/

Thanks,
drew

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2024-04-19 14:09 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-04-18  9:46 [RFC PATCH 0/3] RISC-V: Use WRS.STO for atomic_spin_nop Christoph Müllner
2024-04-18  9:46 ` [RFC PATCH 1/3] RISC-V: Sync hwprobe: Sync extension bits with Linux 6.8 Christoph Müllner
2024-04-18  9:46 ` [RFC PATCH 2/3] RISC-V: hwprobe: Add Zawrs test bit Christoph Müllner
2024-04-18  9:46 ` [RFC PATCH 3/3] RISC-V: Implement CPU yielding for busy loops with Zihintpause/Zawrs Christoph Müllner
2024-04-18 17:17   ` Palmer Dabbelt
2024-04-18 20:03     ` Vineet Gupta
2024-04-18 20:25       ` Christoph Müllner
2024-04-18 20:19     ` Christoph Müllner
2024-04-18 20:36       ` Vineet Gupta
2024-04-18 21:10         ` Palmer Dabbelt
2024-04-19 14:09           ` Andrew Jones

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).