public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [PATCH v3 0/3] Libatomic: Add LSE128 atomics support for AArch64
@ 2024-01-03  1:28 Victor Do Nascimento
  2024-01-03  1:28 ` [PATCH v3 1/3] libatomic: atomic_16.S: Improve ENTRY, END and ALIAS macro interface Victor Do Nascimento
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Victor Do Nascimento @ 2024-01-03  1:28 UTC (permalink / raw)
  To: gcc-patches
  Cc: kyrylo.tkachov, richard.sandiford, Richard.Earnshaw,
	Victor Do Nascimento

v3 updates:

   1. In the absence of the `HWCAP_LSE128' feature bit in the current
   Linux Kernel release, the feature check continues to rely on a user
   space-issued `mrs' instruction.  Since the the ABI for exporting
   the AArch64 CPU ID/feature registers to userspace relies on
   FEAT_IDST [1], we make the ID_AA64ISAR0_EL1-mediated feature check
   contingent on having the HWCAP_CPUID bit set, ensuring FEAT_IDST
   support, avoiding potential runtime errors.

   2. It is established that, given LSE2 is mandatory from Armv8.4
   onward, LSE128 as introduced from Armv9.4 necessarily implies LSE2,
   such that a separate check for LSE2 is not necessary.

   3. Given that the infrastructure for exposing `mrs' to userspace
   hooks into the exception handler, the feature-register read is
   relatively expensive and ought to be avoided where possible.
   Consequently, where we can ascertain whether prerequisites are met
   via HWCAPS, we query these as a way of returning early where we
   known unequivocally that a given feature cannot be implemented due
   to unmet dependencies.  Such checks are added to both `has_lse2'
   and `has_lse128'.

Regression-tested on aarch64-none-linux-gnu on Cortex-A72 and
LSE128-enabled Armv-A Base RevC AEM FVP.

[1] https://www.kernel.org/doc/html/v6.6/arch/arm64/cpu-feature-registers.html

---

Building upon Wilco Dijkstra's work on AArch64 128-bit atomics for
Libatomic, namely the patches from [1] and [2],  this patch series
extends the library's  capabilities to dynamically select and emit
Armv9.4-a LSE128 implementations of atomic operations via ifuncs at
run-time whenever architectural support is present.

Regression tested on aarch64-linux-gnu target with LSE128-support.

[1] https://gcc.gnu.org/pipermail/gcc-patches/2023-June/620529.html
[2] https://gcc.gnu.org/pipermail/gcc-patches/2023-August/626358.html

Victor Do Nascimento (3):
  libatomic: atomic_16.S: Improve ENTRY, END and ALIAS macro interface
  libatomic: Enable LSE128 128-bit atomics for armv9.4-a
  aarch64: Add explicit checks for implicit LSE/LSE2 requirements.

 libatomic/Makefile.am                        |   3 +
 libatomic/Makefile.in                        |   1 +
 libatomic/acinclude.m4                       |  19 ++
 libatomic/auto-config.h.in                   |   3 +
 libatomic/config/linux/aarch64/atomic_16.S   | 251 ++++++++++++++++---
 libatomic/config/linux/aarch64/host-config.h |  36 ++-
 libatomic/configure                          |  59 ++++-
 libatomic/configure.ac                       |   1 +
 8 files changed, 331 insertions(+), 42 deletions(-)

-- 
2.42.0


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v3 1/3] libatomic: atomic_16.S: Improve ENTRY, END and ALIAS macro interface
  2024-01-03  1:28 [PATCH v3 0/3] Libatomic: Add LSE128 atomics support for AArch64 Victor Do Nascimento
@ 2024-01-03  1:28 ` Victor Do Nascimento
  2024-01-05 11:10   ` Richard Sandiford
  2024-01-03  1:28 ` [PATCH v3 2/3] libatomic: Enable LSE128 128-bit atomics for armv9.4-a Victor Do Nascimento
  2024-01-03  1:28 ` [PATCH v3 3/3] aarch64: Add explicit checks for implicit LSE/LSE2 requirements Victor Do Nascimento
  2 siblings, 1 reply; 14+ messages in thread
From: Victor Do Nascimento @ 2024-01-03  1:28 UTC (permalink / raw)
  To: gcc-patches
  Cc: kyrylo.tkachov, richard.sandiford, Richard.Earnshaw,
	Victor Do Nascimento

The introduction of further architectural-feature dependent ifuncs
for AArch64 makes hard-coding ifunc `_i<n>' suffixes to functions
cumbersome to work with.  It is awkward to remember which ifunc maps
onto which arch feature and makes the code harder to maintain when new
ifuncs are added and their suffixes possibly altered.

This patch uses pre-processor `#define' statements to map each suffix to
a descriptive feature name macro, for example:

  #define LSE2 _i1

and reconstructs function names with the pre-processor's token
concatenation feature, such that for `MACRO(<name>_i<n>)', we would
now have `MACRO_FEAT(name, feature)' and in the macro definition body
we replace `name` with `name##feature`.

Consequently, for base functionality, where the ifunc suffix is
absent, the macro interface remains the same.  For example, the entry
and endpoints of `libat_store_16' remain defined by:

  - ENTRY (libat_store_16)
and
  - END (libat_store_16)

For the LSE2 implementation of the same 16-byte atomic store, we now
have:

  - ENTRY_FEAT (libat_store_16, LSE2)
and
  - END_FEAT (libat_store_16, LSE2)

For the alising of ifunc names, we define the following new
implementation of the ALIAS macro:

  - ALIAS (FN_BASE_NAME, FROM_SUFFIX, TO_SUFFIX)

Defining the base feature name macro to map `CORE' to the empty string,
mapping LSE2 to the base implementation, we'd alias the LSE2
`libat_exchange_16' to it base implementation with:

  - ALIAS (libat_exchange_16, LSE2, CORE)

libatomic/ChangeLog:
	* config/linux/aarch64/atomic_16.S (CORE): New macro.
	(LSE2): Likewise.
	(ENTRY_FEAT): Likewise.
	(END_FEAT): Likewise.
	(ENTRY_FEAT1): Likewise.
	(END_FEAT1): Likewise.
	(ALIAS): Modify macro to take in `arch' arguments.
---
 libatomic/config/linux/aarch64/atomic_16.S | 83 +++++++++++++---------
 1 file changed, 49 insertions(+), 34 deletions(-)

diff --git a/libatomic/config/linux/aarch64/atomic_16.S b/libatomic/config/linux/aarch64/atomic_16.S
index a099037179b..eb8e749b8a2 100644
--- a/libatomic/config/linux/aarch64/atomic_16.S
+++ b/libatomic/config/linux/aarch64/atomic_16.S
@@ -40,22 +40,38 @@
 
 	.arch	armv8-a+lse
 
-#define ENTRY(name)		\
-	.global name;		\
-	.hidden name;		\
-	.type name,%function;	\
-	.p2align 4;		\
-name:				\
-	.cfi_startproc;		\
+#define ENTRY(name) ENTRY_FEAT (name, CORE)
+
+#define ENTRY_FEAT(name, feat)		\
+	ENTRY_FEAT1(name, feat)
+
+#define ENTRY_FEAT1(name, feat)	\
+	.global name##feat;		\
+	.hidden name##feat;		\
+	.type name##feat,%function;	\
+	.p2align 4;			\
+name##feat:				\
+	.cfi_startproc;			\
 	hint	34	// bti c
 
-#define END(name)		\
-	.cfi_endproc;		\
-	.size name, .-name;
+#define END(name) END_FEAT (name, CORE)
 
-#define ALIAS(alias,name)	\
-	.global alias;		\
-	.set alias, name;
+#define END_FEAT(name, feat)			\
+	END_FEAT1(name, feat)
+
+#define END_FEAT1(name, feat)		\
+	.cfi_endproc;			\
+	.size name##feat, .-name##feat;
+
+#define ALIAS(alias, from, to)		\
+	ALIAS1(alias,from,to)
+
+#define ALIAS1(alias, from, to)		\
+	.global alias##from;		\
+	.set alias##from, alias##to;
+
+#define CORE
+#define LSE2	_i1
 
 #define res0 x0
 #define res1 x1
@@ -108,7 +124,7 @@ ENTRY (libat_load_16)
 END (libat_load_16)
 
 
-ENTRY (libat_load_16_i1)
+ENTRY_FEAT (libat_load_16, LSE2)
 	cbnz	w1, 1f
 
 	/* RELAXED.  */
@@ -128,7 +144,7 @@ ENTRY (libat_load_16_i1)
 	ldp	res0, res1, [x0]
 	dmb	ishld
 	ret
-END (libat_load_16_i1)
+END_FEAT (libat_load_16, LSE2)
 
 
 ENTRY (libat_store_16)
@@ -148,7 +164,7 @@ ENTRY (libat_store_16)
 END (libat_store_16)
 
 
-ENTRY (libat_store_16_i1)
+ENTRY_FEAT (libat_store_16, LSE2)
 	cbnz	w4, 1f
 
 	/* RELAXED.  */
@@ -160,7 +176,7 @@ ENTRY (libat_store_16_i1)
 	stlxp	w4, in0, in1, [x0]
 	cbnz	w4, 1b
 	ret
-END (libat_store_16_i1)
+END_FEAT (libat_store_16, LSE2)
 
 
 ENTRY (libat_exchange_16)
@@ -237,7 +253,7 @@ ENTRY (libat_compare_exchange_16)
 END (libat_compare_exchange_16)
 
 
-ENTRY (libat_compare_exchange_16_i1)
+ENTRY_FEAT (libat_compare_exchange_16, LSE2)
 	ldp	exp0, exp1, [x1]
 	mov	tmp0, exp0
 	mov	tmp1, exp1
@@ -270,7 +286,7 @@ ENTRY (libat_compare_exchange_16_i1)
 	/* ACQ_REL/SEQ_CST.  */
 4:	caspal	exp0, exp1, in0, in1, [x0]
 	b	0b
-END (libat_compare_exchange_16_i1)
+END_FEAT (libat_compare_exchange_16, LSE2)
 
 
 ENTRY (libat_fetch_add_16)
@@ -556,21 +572,20 @@ END (libat_test_and_set_16)
 
 /* Alias entry points which are the same in baseline and LSE2.  */
 
-ALIAS (libat_exchange_16_i1, libat_exchange_16)
-ALIAS (libat_fetch_add_16_i1, libat_fetch_add_16)
-ALIAS (libat_add_fetch_16_i1, libat_add_fetch_16)
-ALIAS (libat_fetch_sub_16_i1, libat_fetch_sub_16)
-ALIAS (libat_sub_fetch_16_i1, libat_sub_fetch_16)
-ALIAS (libat_fetch_or_16_i1, libat_fetch_or_16)
-ALIAS (libat_or_fetch_16_i1, libat_or_fetch_16)
-ALIAS (libat_fetch_and_16_i1, libat_fetch_and_16)
-ALIAS (libat_and_fetch_16_i1, libat_and_fetch_16)
-ALIAS (libat_fetch_xor_16_i1, libat_fetch_xor_16)
-ALIAS (libat_xor_fetch_16_i1, libat_xor_fetch_16)
-ALIAS (libat_fetch_nand_16_i1, libat_fetch_nand_16)
-ALIAS (libat_nand_fetch_16_i1, libat_nand_fetch_16)
-ALIAS (libat_test_and_set_16_i1, libat_test_and_set_16)
-
+ALIAS (libat_exchange_16, LSE2, CORE)
+ALIAS (libat_fetch_add_16, LSE2, CORE)
+ALIAS (libat_add_fetch_16, LSE2, CORE)
+ALIAS (libat_fetch_sub_16, LSE2, CORE)
+ALIAS (libat_sub_fetch_16, LSE2, CORE)
+ALIAS (libat_fetch_or_16, LSE2, CORE)
+ALIAS (libat_or_fetch_16, LSE2, CORE)
+ALIAS (libat_fetch_and_16, LSE2, CORE)
+ALIAS (libat_and_fetch_16, LSE2, CORE)
+ALIAS (libat_fetch_xor_16, LSE2, CORE)
+ALIAS (libat_xor_fetch_16, LSE2, CORE)
+ALIAS (libat_fetch_nand_16, LSE2, CORE)
+ALIAS (libat_nand_fetch_16, LSE2, CORE)
+ALIAS (libat_test_and_set_16, LSE2, CORE)
 
 /* GNU_PROPERTY_AARCH64_* macros from elf.h for use in asm code.  */
 #define FEATURE_1_AND 0xc0000000
-- 
2.42.0


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v3 2/3] libatomic: Enable LSE128 128-bit atomics for armv9.4-a
  2024-01-03  1:28 [PATCH v3 0/3] Libatomic: Add LSE128 atomics support for AArch64 Victor Do Nascimento
  2024-01-03  1:28 ` [PATCH v3 1/3] libatomic: atomic_16.S: Improve ENTRY, END and ALIAS macro interface Victor Do Nascimento
@ 2024-01-03  1:28 ` Victor Do Nascimento
  2024-01-05 11:47   ` Richard Sandiford
  2024-01-03  1:28 ` [PATCH v3 3/3] aarch64: Add explicit checks for implicit LSE/LSE2 requirements Victor Do Nascimento
  2 siblings, 1 reply; 14+ messages in thread
From: Victor Do Nascimento @ 2024-01-03  1:28 UTC (permalink / raw)
  To: gcc-patches
  Cc: kyrylo.tkachov, richard.sandiford, Richard.Earnshaw,
	Victor Do Nascimento

The armv9.4-a architectural revision adds three new atomic operations
associated with the LSE128 feature:

  * LDCLRP - Atomic AND NOT (bitclear) of a location with 128-bit
  value held in a pair of registers, with original data loaded into
  the same 2 registers.
  * LDSETP - Atomic OR (bitset) of a location with 128-bit value held
  in a pair of registers, with original data loaded into the same 2
  registers.
  * SWPP - Atomic swap of one 128-bit value with 128-bit value held
  in a pair of registers.

This patch adds the logic required to make use of these when the
architectural feature is present and a suitable assembler available.

In order to do this, the following changes are made:

  1. Add a configure-time check to check for LSE128 support in the
  assembler.
  2. Edit host-config.h so that when N == 16, nifunc = 2.
  3. Where available due to LSE128, implement the second ifunc, making
  use of the novel instructions.
  4. For atomic functions unable to make use of these new
  instructions, define a new alias which causes the _i1 function
  variant to point ahead to the corresponding _i2 implementation.

libatomic/ChangeLog:

	* Makefile.am (AM_CPPFLAGS): add conditional setting of
	-DHAVE_FEAT_LSE128.
	* acinclude.m4 (LIBAT_TEST_FEAT_LSE128): New.
	* config/linux/aarch64/atomic_16.S (LSE128): New macro
	definition.
	(libat_exchange_16): New LSE128 variant.
	(libat_fetch_or_16): Likewise.
	(libat_or_fetch_16): Likewise.
	(libat_fetch_and_16): Likewise.
	(libat_and_fetch_16): Likewise.
	* config/linux/aarch64/host-config.h (IFUNC_COND_2): New.
	(IFUNC_NCOND): Add operand size checking.
	(has_lse2): Renamed from `ifunc1`.
	(has_lse128): New.
	(HAS_LSE128): Likewise.
	* libatomic/configure.ac: Add call to LIBAT_TEST_FEAT_LSE128.
	* configure (ac_subst_vars): Regenerated via autoreconf.
	* libatomic/Makefile.in: Likewise.
	* libatomic/auto-config.h.in: Likewise.
---
 libatomic/Makefile.am                        |   3 +
 libatomic/Makefile.in                        |   1 +
 libatomic/acinclude.m4                       |  19 +++
 libatomic/auto-config.h.in                   |   3 +
 libatomic/config/linux/aarch64/atomic_16.S   | 170 ++++++++++++++++++-
 libatomic/config/linux/aarch64/host-config.h |  29 +++-
 libatomic/configure                          |  59 ++++++-
 libatomic/configure.ac                       |   1 +
 8 files changed, 276 insertions(+), 9 deletions(-)

diff --git a/libatomic/Makefile.am b/libatomic/Makefile.am
index c0b8dea5037..24e843db67d 100644
--- a/libatomic/Makefile.am
+++ b/libatomic/Makefile.am
@@ -130,6 +130,9 @@ libatomic_la_LIBADD = $(foreach s,$(SIZES),$(addsuffix _$(s)_.lo,$(SIZEOBJS)))
 ## On a target-specific basis, include alternates to be selected by IFUNC.
 if HAVE_IFUNC
 if ARCH_AARCH64_LINUX
+if ARCH_AARCH64_HAVE_LSE128
+AM_CPPFLAGS	     = -DHAVE_FEAT_LSE128
+endif
 IFUNC_OPTIONS	     = -march=armv8-a+lse
 libatomic_la_LIBADD += $(foreach s,$(SIZES),$(addsuffix _$(s)_1_.lo,$(SIZEOBJS)))
 libatomic_la_SOURCES += atomic_16.S
diff --git a/libatomic/Makefile.in b/libatomic/Makefile.in
index dc2330b91fd..cd48fa21334 100644
--- a/libatomic/Makefile.in
+++ b/libatomic/Makefile.in
@@ -452,6 +452,7 @@ M_SRC = $(firstword $(filter %/$(M_FILE), $(all_c_files)))
 libatomic_la_LIBADD = $(foreach s,$(SIZES),$(addsuffix \
 	_$(s)_.lo,$(SIZEOBJS))) $(am__append_1) $(am__append_3) \
 	$(am__append_4) $(am__append_5)
+@ARCH_AARCH64_HAVE_LSE128_TRUE@@ARCH_AARCH64_LINUX_TRUE@@HAVE_IFUNC_TRUE@AM_CPPFLAGS = -DHAVE_FEAT_LSE128
 @ARCH_AARCH64_LINUX_TRUE@@HAVE_IFUNC_TRUE@IFUNC_OPTIONS = -march=armv8-a+lse
 @ARCH_ARM_LINUX_TRUE@@HAVE_IFUNC_TRUE@IFUNC_OPTIONS = -march=armv7-a+fp -DHAVE_KERNEL64
 @ARCH_I386_TRUE@@HAVE_IFUNC_TRUE@IFUNC_OPTIONS = -march=i586
diff --git a/libatomic/acinclude.m4 b/libatomic/acinclude.m4
index f35ab5b60a5..4197db8f404 100644
--- a/libatomic/acinclude.m4
+++ b/libatomic/acinclude.m4
@@ -83,6 +83,25 @@ AC_DEFUN([LIBAT_TEST_ATOMIC_BUILTIN],[
   ])
 ])
 
+dnl
+dnl Test if the host assembler supports armv9.4-a LSE128 isns.
+dnl
+AC_DEFUN([LIBAT_TEST_FEAT_LSE128],[
+  AC_CACHE_CHECK([for armv9.4-a LSE128 insn support],
+    [libat_cv_have_feat_lse128],[
+    AC_LANG_CONFTEST([AC_LANG_PROGRAM([],[asm(".arch armv9-a+lse128")])])
+    if AC_TRY_EVAL(ac_link); then
+      eval libat_cv_have_feat_lse128=yes
+    else
+      eval libat_cv_have_feat_lse128=no
+    fi
+    rm -f conftest*
+  ])
+  LIBAT_DEFINE_YESNO([HAVE_FEAT_LSE128], [$libat_cv_have_feat_lse128],
+	[Have LSE128 support for 16 byte integers.])
+  AM_CONDITIONAL([ARCH_AARCH64_HAVE_LSE128], [test x$libat_cv_have_feat_lse128 = xyes])
+])
+
 dnl
 dnl Test if we have __atomic_load and __atomic_store for mode $1, size $2
 dnl
diff --git a/libatomic/auto-config.h.in b/libatomic/auto-config.h.in
index ab3424a759e..7c78933b07d 100644
--- a/libatomic/auto-config.h.in
+++ b/libatomic/auto-config.h.in
@@ -105,6 +105,9 @@
 /* Define to 1 if you have the <dlfcn.h> header file. */
 #undef HAVE_DLFCN_H
 
+/* Have LSE128 support for 16 byte integers. */
+#undef HAVE_FEAT_LSE128
+
 /* Define to 1 if you have the <fenv.h> header file. */
 #undef HAVE_FENV_H
 
diff --git a/libatomic/config/linux/aarch64/atomic_16.S b/libatomic/config/linux/aarch64/atomic_16.S
index eb8e749b8a2..a15135c34a4 100644
--- a/libatomic/config/linux/aarch64/atomic_16.S
+++ b/libatomic/config/linux/aarch64/atomic_16.S
@@ -35,10 +35,14 @@
    writes, this will be true when using atomics in actual code.
 
    The libat_<op>_16 entry points are ARMv8.0.
-   The libat_<op>_16_i1 entry points are used when LSE2 is available.  */
-
+   The libat_<op>_16_i1 entry points are used when LSE128 is available.
+   The libat_<op>_16_i2 entry points are used when LSE2 is available.  */
 
+#if HAVE_FEAT_LSE128
+	.arch	armv9-a+lse128
+#else
 	.arch	armv8-a+lse
+#endif
 
 #define ENTRY(name) ENTRY_FEAT (name, CORE)
 
@@ -71,7 +75,8 @@ name##feat:				\
 	.set alias##from, alias##to;
 
 #define CORE
-#define LSE2	_i1
+#define LSE128	_i1
+#define LSE2	_i2
 
 #define res0 x0
 #define res1 x1
@@ -206,6 +211,31 @@ ENTRY (libat_exchange_16)
 END (libat_exchange_16)
 
 
+#if HAVE_FEAT_LSE128
+ENTRY_FEAT (libat_exchange_16, LSE128)
+	mov	tmp0, x0
+	mov	res0, in0
+	mov	res1, in1
+	cbnz	w4, 1f
+
+	/* RELAXED.  */
+	swpp	res0, res1, [tmp0]
+	ret
+1:
+	cmp	w4, ACQUIRE
+	b.hi	2f
+
+	/* ACQUIRE/CONSUME.  */
+	swppa	res0, res1, [tmp0]
+	ret
+
+	/* RELEASE/ACQ_REL/SEQ_CST.  */
+2:	swppal	res0, res1, [tmp0]
+	ret
+END_FEAT (libat_exchange_16, LSE128)
+#endif
+
+
 ENTRY (libat_compare_exchange_16)
 	ldp	exp0, exp1, [x1]
 	cbz	w4, 3f
@@ -399,6 +429,31 @@ ENTRY (libat_fetch_or_16)
 END (libat_fetch_or_16)
 
 
+#if HAVE_FEAT_LSE128
+ENTRY_FEAT (libat_fetch_or_16, LSE128)
+	mov	tmp0, x0
+	mov	res0, in0
+	mov	res1, in1
+	cbnz	w4, 1f
+
+	/* RELAXED.  */
+	ldsetp	res0, res1, [tmp0]
+	ret
+1:
+	cmp	w4, ACQUIRE
+	b.hi	2f
+
+	/* ACQUIRE/CONSUME.  */
+	ldsetpa	res0, res1, [tmp0]
+	ret
+
+	/* RELEASE/ACQ_REL/SEQ_CST.  */
+2:	ldsetpal	res0, res1, [tmp0]
+	ret
+END_FEAT (libat_fetch_or_16, LSE128)
+#endif
+
+
 ENTRY (libat_or_fetch_16)
 	mov	x5, x0
 	cbnz	w4, 2f
@@ -421,6 +476,36 @@ ENTRY (libat_or_fetch_16)
 END (libat_or_fetch_16)
 
 
+#if HAVE_FEAT_LSE128
+ENTRY_FEAT (libat_or_fetch_16, LSE128)
+	cbnz	w4, 1f
+	mov	tmp0, in0
+	mov	tmp1, in1
+
+	/* RELAXED.  */
+	ldsetp	in0, in1, [x0]
+	orr	res0, in0, tmp0
+	orr	res1, in1, tmp1
+	ret
+1:
+	cmp	w4, ACQUIRE
+	b.hi	2f
+
+	/* ACQUIRE/CONSUME.  */
+	ldsetpa	in0, in1, [x0]
+	orr	res0, in0, tmp0
+	orr	res1, in1, tmp1
+	ret
+
+	/* RELEASE/ACQ_REL/SEQ_CST.  */
+2:	ldsetpal	in0, in1, [x0]
+	orr	res0, in0, tmp0
+	orr	res1, in1, tmp1
+	ret
+END_FEAT (libat_or_fetch_16, LSE128)
+#endif
+
+
 ENTRY (libat_fetch_and_16)
 	mov	x5, x0
 	cbnz	w4, 2f
@@ -443,6 +528,32 @@ ENTRY (libat_fetch_and_16)
 END (libat_fetch_and_16)
 
 
+#if HAVE_FEAT_LSE128
+ENTRY_FEAT (libat_fetch_and_16, LSE128)
+	mov	tmp0, x0
+	mvn	res0, in0
+	mvn	res1, in1
+	cbnz	w4, 1f
+
+	/* RELAXED.  */
+	ldclrp	res0, res1, [tmp0]
+	ret
+
+1:
+	cmp	w4, ACQUIRE
+	b.hi	2f
+
+	/* ACQUIRE/CONSUME.  */
+	ldclrpa res0, res1, [tmp0]
+	ret
+
+	/* RELEASE/ACQ_REL/SEQ_CST.  */
+2:	ldclrpal	res0, res1, [tmp0]
+	ret
+END_FEAT (libat_fetch_and_16, LSE128)
+#endif
+
+
 ENTRY (libat_and_fetch_16)
 	mov	x5, x0
 	cbnz	w4, 2f
@@ -465,6 +576,37 @@ ENTRY (libat_and_fetch_16)
 END (libat_and_fetch_16)
 
 
+#if HAVE_FEAT_LSE128
+ENTRY_FEAT (libat_and_fetch_16, LSE128)
+	mvn	tmp0, in0
+	mvn	tmp0, in1
+	cbnz	w4, 1f
+
+	/* RELAXED.  */
+	ldclrp	tmp0, tmp1, [x0]
+	and	res0, tmp0, in0
+	and	res1, tmp1, in1
+	ret
+
+1:
+	cmp	w4, ACQUIRE
+	b.hi	2f
+
+	/* ACQUIRE/CONSUME.  */
+	ldclrpa tmp0, tmp1, [x0]
+	and	res0, tmp0, in0
+	and	res1, tmp1, in1
+	ret
+
+	/* RELEASE/ACQ_REL/SEQ_CST.  */
+2:	ldclrpal	tmp0, tmp1, [x5]
+	and	res0, tmp0, in0
+	and	res1, tmp1, in1
+	ret
+END_FEAT (libat_and_fetch_16, LSE128)
+#endif
+
+
 ENTRY (libat_fetch_xor_16)
 	mov	x5, x0
 	cbnz	w4, 2f
@@ -570,6 +712,28 @@ ENTRY (libat_test_and_set_16)
 END (libat_test_and_set_16)
 
 
+/* Alias entry points which are the same in LSE2 and LSE128.  */
+
+#if !HAVE_FEAT_LSE128
+ALIAS (libat_exchange_16, LSE128, LSE2)
+ALIAS (libat_fetch_or_16, LSE128, LSE2)
+ALIAS (libat_fetch_and_16, LSE128, LSE2)
+ALIAS (libat_or_fetch_16, LSE128, LSE2)
+ALIAS (libat_and_fetch_16, LSE128, LSE2)
+#endif
+ALIAS (libat_load_16, LSE128, LSE2)
+ALIAS (libat_store_16, LSE128, LSE2)
+ALIAS (libat_compare_exchange_16, LSE128, LSE2)
+ALIAS (libat_fetch_add_16, LSE128, LSE2)
+ALIAS (libat_add_fetch_16, LSE128, LSE2)
+ALIAS (libat_fetch_sub_16, LSE128, LSE2)
+ALIAS (libat_sub_fetch_16, LSE128, LSE2)
+ALIAS (libat_fetch_xor_16, LSE128, LSE2)
+ALIAS (libat_xor_fetch_16, LSE128, LSE2)
+ALIAS (libat_fetch_nand_16, LSE128, LSE2)
+ALIAS (libat_nand_fetch_16, LSE128, LSE2)
+ALIAS (libat_test_and_set_16, LSE128, LSE2)
+
 /* Alias entry points which are the same in baseline and LSE2.  */
 
 ALIAS (libat_exchange_16, LSE2, CORE)
diff --git a/libatomic/config/linux/aarch64/host-config.h b/libatomic/config/linux/aarch64/host-config.h
index ac4d922ca5c..c5485d63855 100644
--- a/libatomic/config/linux/aarch64/host-config.h
+++ b/libatomic/config/linux/aarch64/host-config.h
@@ -26,14 +26,17 @@
 
 #ifdef HWCAP_USCAT
 # if N == 16
-#  define IFUNC_COND_1	ifunc1 (hwcap)
+#  define IFUNC_COND_1		(has_lse128 (hwcap))
+#  define IFUNC_COND_2		(has_lse2 (hwcap))
+#  define IFUNC_NCOND(N)	2
 # else
-#  define IFUNC_COND_1	(hwcap & HWCAP_ATOMICS)
+#  define IFUNC_COND_1		(hwcap & HWCAP_ATOMICS)
+#  define IFUNC_NCOND(N)	1
 # endif
 #else
 #  define IFUNC_COND_1	(false)
+#  define IFUNC_NCOND(N)	1
 #endif
-#define IFUNC_NCOND(N)	(1)
 
 #endif /* HAVE_IFUNC */
 
@@ -48,7 +51,7 @@
 #define MIDR_PARTNUM(midr)	(((midr) >> 4) & 0xfff)
 
 static inline bool
-ifunc1 (unsigned long hwcap)
+has_lse2 (unsigned long hwcap)
 {
   if (hwcap & HWCAP_USCAT)
     return true;
@@ -64,6 +67,24 @@ ifunc1 (unsigned long hwcap)
 
   return false;
 }
+
+/* LSE128 atomic support encoded in ID_AA64ISAR0_EL1.Atomic,
+   bits[23:20].  The expected value is 0b0011.  Check that.  */
+
+#define AT_FEAT_FIELD(isar0)	(((isar0) >> 20) & 15)
+
+static inline bool
+has_lse128 (unsigned long hwcap)
+{
+  if (!(hwcap & HWCAP_CPUID))
+    return false;
+  unsigned long isar0;
+  asm volatile ("mrs %0, ID_AA64ISAR0_EL1" : "=r" (isar0));
+  if (AT_FEAT_FIELD (isar0) >= 3)
+    return true;
+  return false;
+}
+
 #endif
 
 #include_next <host-config.h>
diff --git a/libatomic/configure b/libatomic/configure
index d579bab96f8..ee3bbb97d69 100755
--- a/libatomic/configure
+++ b/libatomic/configure
@@ -657,6 +657,8 @@ LIBAT_BUILD_VERSIONED_SHLIB_TRUE
 OPT_LDFLAGS
 SECTION_LDFLAGS
 SYSROOT_CFLAGS_FOR_TARGET
+ARCH_AARCH64_HAVE_LSE128_FALSE
+ARCH_AARCH64_HAVE_LSE128_TRUE
 enable_aarch64_lse
 libtool_VERSION
 ENABLE_DARWIN_AT_RPATH_FALSE
@@ -11456,7 +11458,7 @@ else
   lt_dlunknown=0; lt_dlno_uscore=1; lt_dlneed_uscore=2
   lt_status=$lt_dlunknown
   cat > conftest.$ac_ext <<_LT_EOF
-#line 11459 "configure"
+#line 11461 "configure"
 #include "confdefs.h"
 
 #if HAVE_DLFCN_H
@@ -11562,7 +11564,7 @@ else
   lt_dlunknown=0; lt_dlno_uscore=1; lt_dlneed_uscore=2
   lt_status=$lt_dlunknown
   cat > conftest.$ac_ext <<_LT_EOF
-#line 11565 "configure"
+#line 11567 "configure"
 #include "confdefs.h"
 
 #if HAVE_DLFCN_H
@@ -11926,6 +11928,55 @@ ac_link='$CC -o conftest$ac_exeext $CFLAGS $CPPFLAGS $LDFLAGS conftest.$ac_ext $
 ac_compiler_gnu=$ac_cv_c_compiler_gnu
 
 
+
+  { $as_echo "$as_me:${as_lineno-$LINENO}: checking for armv9.4-a LSE128 insn support" >&5
+$as_echo_n "checking for armv9.4-a LSE128 insn support... " >&6; }
+if ${libat_cv_have_feat_lse128+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+
+    cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+
+int
+main ()
+{
+asm(".arch armv9-a+lse128")
+  ;
+  return 0;
+}
+_ACEOF
+    if { { eval echo "\"\$as_me\":${as_lineno-$LINENO}: \"$ac_link\""; } >&5
+  (eval $ac_link) 2>&5
+  ac_status=$?
+  $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5
+  test $ac_status = 0; }; then
+      eval libat_cv_have_feat_lse128=yes
+    else
+      eval libat_cv_have_feat_lse128=no
+    fi
+    rm -f conftest*
+
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $libat_cv_have_feat_lse128" >&5
+$as_echo "$libat_cv_have_feat_lse128" >&6; }
+
+  yesno=`echo $libat_cv_have_feat_lse128 | tr 'yesno' '1  0 '`
+
+cat >>confdefs.h <<_ACEOF
+#define HAVE_FEAT_LSE128 $yesno
+_ACEOF
+
+
+   if test x$libat_cv_have_feat_lse128 = xyes; then
+  ARCH_AARCH64_HAVE_LSE128_TRUE=
+  ARCH_AARCH64_HAVE_LSE128_FALSE='#'
+else
+  ARCH_AARCH64_HAVE_LSE128_TRUE='#'
+  ARCH_AARCH64_HAVE_LSE128_FALSE=
+fi
+
+
     ;;
 esac
 
@@ -15989,6 +16040,10 @@ if test -z "${ENABLE_DARWIN_AT_RPATH_TRUE}" && test -z "${ENABLE_DARWIN_AT_RPATH
   as_fn_error $? "conditional \"ENABLE_DARWIN_AT_RPATH\" was never defined.
 Usually this means the macro was only invoked conditionally." "$LINENO" 5
 fi
+if test -z "${ARCH_AARCH64_HAVE_LSE128_TRUE}" && test -z "${ARCH_AARCH64_HAVE_LSE128_FALSE}"; then
+  as_fn_error $? "conditional \"ARCH_AARCH64_HAVE_LSE128\" was never defined.
+Usually this means the macro was only invoked conditionally." "$LINENO" 5
+fi
 
 if test -z "${LIBAT_BUILD_VERSIONED_SHLIB_TRUE}" && test -z "${LIBAT_BUILD_VERSIONED_SHLIB_FALSE}"; then
   as_fn_error $? "conditional \"LIBAT_BUILD_VERSIONED_SHLIB\" was never defined.
diff --git a/libatomic/configure.ac b/libatomic/configure.ac
index 5f2821ac3f4..b2fe68d7d0f 100644
--- a/libatomic/configure.ac
+++ b/libatomic/configure.ac
@@ -169,6 +169,7 @@ AC_MSG_RESULT([$target_thread_file])
 case "$target" in
  *aarch64*)
     ACX_PROG_CC_WARNING_OPTS([-march=armv8-a+lse],[enable_aarch64_lse])
+    LIBAT_TEST_FEAT_LSE128()
     ;;
 esac
 
-- 
2.42.0


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v3 3/3] aarch64: Add explicit checks for implicit LSE/LSE2 requirements.
  2024-01-03  1:28 [PATCH v3 0/3] Libatomic: Add LSE128 atomics support for AArch64 Victor Do Nascimento
  2024-01-03  1:28 ` [PATCH v3 1/3] libatomic: atomic_16.S: Improve ENTRY, END and ALIAS macro interface Victor Do Nascimento
  2024-01-03  1:28 ` [PATCH v3 2/3] libatomic: Enable LSE128 128-bit atomics for armv9.4-a Victor Do Nascimento
@ 2024-01-03  1:28 ` Victor Do Nascimento
  2024-01-05 11:58   ` Richard Sandiford
  2 siblings, 1 reply; 14+ messages in thread
From: Victor Do Nascimento @ 2024-01-03  1:28 UTC (permalink / raw)
  To: gcc-patches
  Cc: kyrylo.tkachov, richard.sandiford, Richard.Earnshaw,
	Victor Do Nascimento

At present, Evaluation of both `has_lse2(hwcap)' and
`has_lse128(hwcap)' may require issuing an `mrs' instruction to query
a system register.  This instruction, when issued from user-space
results in a trap by the kernel which then returns the value read in
by the system register.  Given the undesirable nature of the
computational expense associated with the context switch, it is
important to implement mechanisms to, wherever possible, forgo the
operation.

In light of this, given how other architectural requirements serving
as prerequisites have long been assigned HWCAP bits by the kernel, we
can inexpensively query for their availability before attempting to
read any system registers.  Where one of these early tests fail, we
can assert that the main feature of interest (be it LSE2 or LSE128)
cannot be present, allowing us to return from the function early and
skip the unnecessary expensive kernel-mediated access to system
registers.

libatomic/ChangeLog:

	* config/linux/aarch64/host-config.h (has_lse2): Add test for LSE.
	(has_lse128): Add test for LSE2.
---
 libatomic/config/linux/aarch64/host-config.h | 13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/libatomic/config/linux/aarch64/host-config.h b/libatomic/config/linux/aarch64/host-config.h
index c5485d63855..3be4db6e5f8 100644
--- a/libatomic/config/linux/aarch64/host-config.h
+++ b/libatomic/config/linux/aarch64/host-config.h
@@ -53,8 +53,13 @@
 static inline bool
 has_lse2 (unsigned long hwcap)
 {
+  /* Check for LSE2.  */
   if (hwcap & HWCAP_USCAT)
     return true;
+  /* No point checking further for atomic 128-bit load/store if LSE
+     prerequisite not met.  */
+  if (!(hwcap & HWCAP_ATOMICS))
+    return false;
   if (!(hwcap & HWCAP_CPUID))
     return false;
 
@@ -76,12 +81,14 @@ has_lse2 (unsigned long hwcap)
 static inline bool
 has_lse128 (unsigned long hwcap)
 {
-  if (!(hwcap & HWCAP_CPUID))
-    return false;
+  /* In the absence of HWCAP_CPUID, we are unable to check for LSE128, return.
+     If feature check available, check LSE2 prerequisite before proceeding.  */
+  if (!(hwcap & HWCAP_CPUID) || !(hwcap & HWCAP_USCAT))
+     return false;
   unsigned long isar0;
   asm volatile ("mrs %0, ID_AA64ISAR0_EL1" : "=r" (isar0));
   if (AT_FEAT_FIELD (isar0) >= 3)
-    return true;
+      return true;
   return false;
 }
 
-- 
2.42.0


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 1/3] libatomic: atomic_16.S: Improve ENTRY, END and ALIAS macro interface
  2024-01-03  1:28 ` [PATCH v3 1/3] libatomic: atomic_16.S: Improve ENTRY, END and ALIAS macro interface Victor Do Nascimento
@ 2024-01-05 11:10   ` Richard Sandiford
  2024-01-08 12:33     ` Victor Do Nascimento
  0 siblings, 1 reply; 14+ messages in thread
From: Richard Sandiford @ 2024-01-05 11:10 UTC (permalink / raw)
  To: Victor Do Nascimento; +Cc: gcc-patches, kyrylo.tkachov, Richard.Earnshaw

Victor Do Nascimento <victor.donascimento@arm.com> writes:
> The introduction of further architectural-feature dependent ifuncs
> for AArch64 makes hard-coding ifunc `_i<n>' suffixes to functions
> cumbersome to work with.  It is awkward to remember which ifunc maps
> onto which arch feature and makes the code harder to maintain when new
> ifuncs are added and their suffixes possibly altered.
>
> This patch uses pre-processor `#define' statements to map each suffix to
> a descriptive feature name macro, for example:
>
>   #define LSE2 _i1
>
> and reconstructs function names with the pre-processor's token
> concatenation feature, such that for `MACRO(<name>_i<n>)', we would
> now have `MACRO_FEAT(name, feature)' and in the macro definition body
> we replace `name` with `name##feature`.

FWIW, another way of doing this would be to have:

#define CORE(NAME) NAME
#define LSE2(NAME) NAME##_i1

and use feature(name) instead of name##feature.  This has the slight
advantage of not using ## on empty tokens, and the maybe slightly
better advantage of not needing the extra forwarding step in:

#define ENTRY_FEAT(name, feat)		\
	ENTRY_FEAT1(name, feat)

#define ENTRY_FEAT1(name, feat)	\

WDYT?

Richard

> Consequently, for base functionality, where the ifunc suffix is
> absent, the macro interface remains the same.  For example, the entry
> and endpoints of `libat_store_16' remain defined by:
>
>   - ENTRY (libat_store_16)
> and
>   - END (libat_store_16)
>
> For the LSE2 implementation of the same 16-byte atomic store, we now
> have:
>
>   - ENTRY_FEAT (libat_store_16, LSE2)
> and
>   - END_FEAT (libat_store_16, LSE2)
>
> For the alising of ifunc names, we define the following new
> implementation of the ALIAS macro:
>
>   - ALIAS (FN_BASE_NAME, FROM_SUFFIX, TO_SUFFIX)
>
> Defining the base feature name macro to map `CORE' to the empty string,
> mapping LSE2 to the base implementation, we'd alias the LSE2
> `libat_exchange_16' to it base implementation with:
>
>   - ALIAS (libat_exchange_16, LSE2, CORE)
>
> libatomic/ChangeLog:
> 	* config/linux/aarch64/atomic_16.S (CORE): New macro.
> 	(LSE2): Likewise.
> 	(ENTRY_FEAT): Likewise.
> 	(END_FEAT): Likewise.
> 	(ENTRY_FEAT1): Likewise.
> 	(END_FEAT1): Likewise.
> 	(ALIAS): Modify macro to take in `arch' arguments.
> ---
>  libatomic/config/linux/aarch64/atomic_16.S | 83 +++++++++++++---------
>  1 file changed, 49 insertions(+), 34 deletions(-)
>
> diff --git a/libatomic/config/linux/aarch64/atomic_16.S b/libatomic/config/linux/aarch64/atomic_16.S
> index a099037179b..eb8e749b8a2 100644
> --- a/libatomic/config/linux/aarch64/atomic_16.S
> +++ b/libatomic/config/linux/aarch64/atomic_16.S
> @@ -40,22 +40,38 @@
>  
>  	.arch	armv8-a+lse
>  
> -#define ENTRY(name)		\
> -	.global name;		\
> -	.hidden name;		\
> -	.type name,%function;	\
> -	.p2align 4;		\
> -name:				\
> -	.cfi_startproc;		\
> +#define ENTRY(name) ENTRY_FEAT (name, CORE)
> +
> +#define ENTRY_FEAT(name, feat)		\
> +	ENTRY_FEAT1(name, feat)
> +
> +#define ENTRY_FEAT1(name, feat)	\
> +	.global name##feat;		\
> +	.hidden name##feat;		\
> +	.type name##feat,%function;	\
> +	.p2align 4;			\
> +name##feat:				\
> +	.cfi_startproc;			\
>  	hint	34	// bti c
>  
> -#define END(name)		\
> -	.cfi_endproc;		\
> -	.size name, .-name;
> +#define END(name) END_FEAT (name, CORE)
>  
> -#define ALIAS(alias,name)	\
> -	.global alias;		\
> -	.set alias, name;
> +#define END_FEAT(name, feat)			\
> +	END_FEAT1(name, feat)
> +
> +#define END_FEAT1(name, feat)		\
> +	.cfi_endproc;			\
> +	.size name##feat, .-name##feat;
> +
> +#define ALIAS(alias, from, to)		\
> +	ALIAS1(alias,from,to)
> +
> +#define ALIAS1(alias, from, to)		\
> +	.global alias##from;		\
> +	.set alias##from, alias##to;
> +
> +#define CORE
> +#define LSE2	_i1
>  
>  #define res0 x0
>  #define res1 x1
> @@ -108,7 +124,7 @@ ENTRY (libat_load_16)
>  END (libat_load_16)
>  
>  
> -ENTRY (libat_load_16_i1)
> +ENTRY_FEAT (libat_load_16, LSE2)
>  	cbnz	w1, 1f
>  
>  	/* RELAXED.  */
> @@ -128,7 +144,7 @@ ENTRY (libat_load_16_i1)
>  	ldp	res0, res1, [x0]
>  	dmb	ishld
>  	ret
> -END (libat_load_16_i1)
> +END_FEAT (libat_load_16, LSE2)
>  
>  
>  ENTRY (libat_store_16)
> @@ -148,7 +164,7 @@ ENTRY (libat_store_16)
>  END (libat_store_16)
>  
>  
> -ENTRY (libat_store_16_i1)
> +ENTRY_FEAT (libat_store_16, LSE2)
>  	cbnz	w4, 1f
>  
>  	/* RELAXED.  */
> @@ -160,7 +176,7 @@ ENTRY (libat_store_16_i1)
>  	stlxp	w4, in0, in1, [x0]
>  	cbnz	w4, 1b
>  	ret
> -END (libat_store_16_i1)
> +END_FEAT (libat_store_16, LSE2)
>  
>  
>  ENTRY (libat_exchange_16)
> @@ -237,7 +253,7 @@ ENTRY (libat_compare_exchange_16)
>  END (libat_compare_exchange_16)
>  
>  
> -ENTRY (libat_compare_exchange_16_i1)
> +ENTRY_FEAT (libat_compare_exchange_16, LSE2)
>  	ldp	exp0, exp1, [x1]
>  	mov	tmp0, exp0
>  	mov	tmp1, exp1
> @@ -270,7 +286,7 @@ ENTRY (libat_compare_exchange_16_i1)
>  	/* ACQ_REL/SEQ_CST.  */
>  4:	caspal	exp0, exp1, in0, in1, [x0]
>  	b	0b
> -END (libat_compare_exchange_16_i1)
> +END_FEAT (libat_compare_exchange_16, LSE2)
>  
>  
>  ENTRY (libat_fetch_add_16)
> @@ -556,21 +572,20 @@ END (libat_test_and_set_16)
>  
>  /* Alias entry points which are the same in baseline and LSE2.  */
>  
> -ALIAS (libat_exchange_16_i1, libat_exchange_16)
> -ALIAS (libat_fetch_add_16_i1, libat_fetch_add_16)
> -ALIAS (libat_add_fetch_16_i1, libat_add_fetch_16)
> -ALIAS (libat_fetch_sub_16_i1, libat_fetch_sub_16)
> -ALIAS (libat_sub_fetch_16_i1, libat_sub_fetch_16)
> -ALIAS (libat_fetch_or_16_i1, libat_fetch_or_16)
> -ALIAS (libat_or_fetch_16_i1, libat_or_fetch_16)
> -ALIAS (libat_fetch_and_16_i1, libat_fetch_and_16)
> -ALIAS (libat_and_fetch_16_i1, libat_and_fetch_16)
> -ALIAS (libat_fetch_xor_16_i1, libat_fetch_xor_16)
> -ALIAS (libat_xor_fetch_16_i1, libat_xor_fetch_16)
> -ALIAS (libat_fetch_nand_16_i1, libat_fetch_nand_16)
> -ALIAS (libat_nand_fetch_16_i1, libat_nand_fetch_16)
> -ALIAS (libat_test_and_set_16_i1, libat_test_and_set_16)
> -
> +ALIAS (libat_exchange_16, LSE2, CORE)
> +ALIAS (libat_fetch_add_16, LSE2, CORE)
> +ALIAS (libat_add_fetch_16, LSE2, CORE)
> +ALIAS (libat_fetch_sub_16, LSE2, CORE)
> +ALIAS (libat_sub_fetch_16, LSE2, CORE)
> +ALIAS (libat_fetch_or_16, LSE2, CORE)
> +ALIAS (libat_or_fetch_16, LSE2, CORE)
> +ALIAS (libat_fetch_and_16, LSE2, CORE)
> +ALIAS (libat_and_fetch_16, LSE2, CORE)
> +ALIAS (libat_fetch_xor_16, LSE2, CORE)
> +ALIAS (libat_xor_fetch_16, LSE2, CORE)
> +ALIAS (libat_fetch_nand_16, LSE2, CORE)
> +ALIAS (libat_nand_fetch_16, LSE2, CORE)
> +ALIAS (libat_test_and_set_16, LSE2, CORE)
>  
>  /* GNU_PROPERTY_AARCH64_* macros from elf.h for use in asm code.  */
>  #define FEATURE_1_AND 0xc0000000

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 2/3] libatomic: Enable LSE128 128-bit atomics for armv9.4-a
  2024-01-03  1:28 ` [PATCH v3 2/3] libatomic: Enable LSE128 128-bit atomics for armv9.4-a Victor Do Nascimento
@ 2024-01-05 11:47   ` Richard Sandiford
  2024-01-08 10:17     ` Victor Do Nascimento
  0 siblings, 1 reply; 14+ messages in thread
From: Richard Sandiford @ 2024-01-05 11:47 UTC (permalink / raw)
  To: Victor Do Nascimento; +Cc: gcc-patches, kyrylo.tkachov, Richard.Earnshaw

Victor Do Nascimento <victor.donascimento@arm.com> writes:
> The armv9.4-a architectural revision adds three new atomic operations
> associated with the LSE128 feature:
>
>   * LDCLRP - Atomic AND NOT (bitclear) of a location with 128-bit
>   value held in a pair of registers, with original data loaded into
>   the same 2 registers.
>   * LDSETP - Atomic OR (bitset) of a location with 128-bit value held
>   in a pair of registers, with original data loaded into the same 2
>   registers.
>   * SWPP - Atomic swap of one 128-bit value with 128-bit value held
>   in a pair of registers.
>
> This patch adds the logic required to make use of these when the
> architectural feature is present and a suitable assembler available.
>
> In order to do this, the following changes are made:
>
>   1. Add a configure-time check to check for LSE128 support in the
>   assembler.
>   2. Edit host-config.h so that when N == 16, nifunc = 2.
>   3. Where available due to LSE128, implement the second ifunc, making
>   use of the novel instructions.
>   4. For atomic functions unable to make use of these new
>   instructions, define a new alias which causes the _i1 function
>   variant to point ahead to the corresponding _i2 implementation.
>
> libatomic/ChangeLog:
>
> 	* Makefile.am (AM_CPPFLAGS): add conditional setting of
> 	-DHAVE_FEAT_LSE128.
> 	* acinclude.m4 (LIBAT_TEST_FEAT_LSE128): New.
> 	* config/linux/aarch64/atomic_16.S (LSE128): New macro
> 	definition.
> 	(libat_exchange_16): New LSE128 variant.
> 	(libat_fetch_or_16): Likewise.
> 	(libat_or_fetch_16): Likewise.
> 	(libat_fetch_and_16): Likewise.
> 	(libat_and_fetch_16): Likewise.
> 	* config/linux/aarch64/host-config.h (IFUNC_COND_2): New.
> 	(IFUNC_NCOND): Add operand size checking.
> 	(has_lse2): Renamed from `ifunc1`.
> 	(has_lse128): New.
> 	(HAS_LSE128): Likewise.
> 	* libatomic/configure.ac: Add call to LIBAT_TEST_FEAT_LSE128.
> 	* configure (ac_subst_vars): Regenerated via autoreconf.
> 	* libatomic/Makefile.in: Likewise.
> 	* libatomic/auto-config.h.in: Likewise.
> ---
>  libatomic/Makefile.am                        |   3 +
>  libatomic/Makefile.in                        |   1 +
>  libatomic/acinclude.m4                       |  19 +++
>  libatomic/auto-config.h.in                   |   3 +
>  libatomic/config/linux/aarch64/atomic_16.S   | 170 ++++++++++++++++++-
>  libatomic/config/linux/aarch64/host-config.h |  29 +++-
>  libatomic/configure                          |  59 ++++++-
>  libatomic/configure.ac                       |   1 +
>  8 files changed, 276 insertions(+), 9 deletions(-)
>
> [...]
> diff --git a/libatomic/acinclude.m4 b/libatomic/acinclude.m4
> index f35ab5b60a5..4197db8f404 100644
> --- a/libatomic/acinclude.m4
> +++ b/libatomic/acinclude.m4
> @@ -83,6 +83,25 @@ AC_DEFUN([LIBAT_TEST_ATOMIC_BUILTIN],[
>    ])
>  ])
>  
> +dnl
> +dnl Test if the host assembler supports armv9.4-a LSE128 isns.
> +dnl
> +AC_DEFUN([LIBAT_TEST_FEAT_LSE128],[
> +  AC_CACHE_CHECK([for armv9.4-a LSE128 insn support],
> +    [libat_cv_have_feat_lse128],[
> +    AC_LANG_CONFTEST([AC_LANG_PROGRAM([],[asm(".arch armv9-a+lse128")])])
> +    if AC_TRY_EVAL(ac_link); then

ac_compile should be enough for this.  The link step isn't really
adding anything.

> +      eval libat_cv_have_feat_lse128=yes
> +    else
> +      eval libat_cv_have_feat_lse128=no
> +    fi
> +    rm -f conftest*
> +  ])
> +  LIBAT_DEFINE_YESNO([HAVE_FEAT_LSE128], [$libat_cv_have_feat_lse128],
> +	[Have LSE128 support for 16 byte integers.])
> +  AM_CONDITIONAL([ARCH_AARCH64_HAVE_LSE128], [test x$libat_cv_have_feat_lse128 = xyes])
> +])
> +
>  dnl
>  dnl Test if we have __atomic_load and __atomic_store for mode $1, size $2
>  dnl
> [...]
> @@ -206,6 +211,31 @@ ENTRY (libat_exchange_16)
>  END (libat_exchange_16)
>  
>  
> +#if HAVE_FEAT_LSE128
> +ENTRY_FEAT (libat_exchange_16, LSE128)
> +	mov	tmp0, x0
> +	mov	res0, in0
> +	mov	res1, in1
> +	cbnz	w4, 1f
> +
> +	/* RELAXED.  */
> +	swpp	res0, res1, [tmp0]
> +	ret
> +1:
> +	cmp	w4, ACQUIRE
> +	b.hi	2f
> +
> +	/* ACQUIRE/CONSUME.  */
> +	swppa	res0, res1, [tmp0]
> +	ret
> +
> +	/* RELEASE/ACQ_REL/SEQ_CST.  */
> +2:	swppal	res0, res1, [tmp0]
> +	ret
> +END_FEAT (libat_exchange_16, LSE128)
> +#endif

Is there no benefit to using SWPPL for RELEASE here?  Similarly for the
others.

Looks good otherwise.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 3/3] aarch64: Add explicit checks for implicit LSE/LSE2 requirements.
  2024-01-03  1:28 ` [PATCH v3 3/3] aarch64: Add explicit checks for implicit LSE/LSE2 requirements Victor Do Nascimento
@ 2024-01-05 11:58   ` Richard Sandiford
  2024-01-05 12:08     ` Richard Sandiford
  0 siblings, 1 reply; 14+ messages in thread
From: Richard Sandiford @ 2024-01-05 11:58 UTC (permalink / raw)
  To: Victor Do Nascimento; +Cc: gcc-patches, kyrylo.tkachov, Richard.Earnshaw

Victor Do Nascimento <victor.donascimento@arm.com> writes:
> At present, Evaluation of both `has_lse2(hwcap)' and
> `has_lse128(hwcap)' may require issuing an `mrs' instruction to query
> a system register.  This instruction, when issued from user-space
> results in a trap by the kernel which then returns the value read in
> by the system register.  Given the undesirable nature of the
> computational expense associated with the context switch, it is
> important to implement mechanisms to, wherever possible, forgo the
> operation.
>
> In light of this, given how other architectural requirements serving
> as prerequisites have long been assigned HWCAP bits by the kernel, we
> can inexpensively query for their availability before attempting to
> read any system registers.  Where one of these early tests fail, we
> can assert that the main feature of interest (be it LSE2 or LSE128)
> cannot be present, allowing us to return from the function early and
> skip the unnecessary expensive kernel-mediated access to system
> registers.
>
> libatomic/ChangeLog:
>
> 	* config/linux/aarch64/host-config.h (has_lse2): Add test for LSE.
> 	(has_lse128): Add test for LSE2.
> ---
>  libatomic/config/linux/aarch64/host-config.h | 13 ++++++++++---
>  1 file changed, 10 insertions(+), 3 deletions(-)
>
> diff --git a/libatomic/config/linux/aarch64/host-config.h b/libatomic/config/linux/aarch64/host-config.h
> index c5485d63855..3be4db6e5f8 100644
> --- a/libatomic/config/linux/aarch64/host-config.h
> +++ b/libatomic/config/linux/aarch64/host-config.h
> @@ -53,8 +53,13 @@
>  static inline bool
>  has_lse2 (unsigned long hwcap)
>  {
> +  /* Check for LSE2.  */
>    if (hwcap & HWCAP_USCAT)
>      return true;
> +  /* No point checking further for atomic 128-bit load/store if LSE
> +     prerequisite not met.  */
> +  if (!(hwcap & HWCAP_ATOMICS))
> +    return false;

This part is OK.

>    if (!(hwcap & HWCAP_CPUID))
>      return false;
>  
> @@ -76,12 +81,14 @@ has_lse2 (unsigned long hwcap)
>  static inline bool
>  has_lse128 (unsigned long hwcap)
>  {
> -  if (!(hwcap & HWCAP_CPUID))
> -    return false;
> +  /* In the absence of HWCAP_CPUID, we are unable to check for LSE128, return.
> +     If feature check available, check LSE2 prerequisite before proceeding.  */
> +  if (!(hwcap & HWCAP_CPUID) || !(hwcap & HWCAP_USCAT))
> +     return false;

The inconsistency feels wrong here.  If we're saying that HWCAP_USCAT
is now so old that we don't need to fall back on CPUID, then it feels
like we should have the courage of our convictions and do the same for
has_lse2.  If instead we still want to support libcs that predate
HWCAP_USCAT, we should do the same here too.

>    unsigned long isar0;
>    asm volatile ("mrs %0, ID_AA64ISAR0_EL1" : "=r" (isar0));
>    if (AT_FEAT_FIELD (isar0) >= 3)
> -    return true;
> +      return true;

The original formatting was correct.

Thanks,
Richard

>    return false;
>  }

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 3/3] aarch64: Add explicit checks for implicit LSE/LSE2 requirements.
  2024-01-05 11:58   ` Richard Sandiford
@ 2024-01-05 12:08     ` Richard Sandiford
  0 siblings, 0 replies; 14+ messages in thread
From: Richard Sandiford @ 2024-01-05 12:08 UTC (permalink / raw)
  To: Victor Do Nascimento; +Cc: gcc-patches, kyrylo.tkachov, Richard.Earnshaw

Richard Sandiford <richard.sandiford@arm.com> writes:
> Victor Do Nascimento <victor.donascimento@arm.com> writes:
>> At present, Evaluation of both `has_lse2(hwcap)' and
>> `has_lse128(hwcap)' may require issuing an `mrs' instruction to query
>> a system register.  This instruction, when issued from user-space
>> results in a trap by the kernel which then returns the value read in
>> by the system register.  Given the undesirable nature of the
>> computational expense associated with the context switch, it is
>> important to implement mechanisms to, wherever possible, forgo the
>> operation.
>>
>> In light of this, given how other architectural requirements serving
>> as prerequisites have long been assigned HWCAP bits by the kernel, we
>> can inexpensively query for their availability before attempting to
>> read any system registers.  Where one of these early tests fail, we
>> can assert that the main feature of interest (be it LSE2 or LSE128)
>> cannot be present, allowing us to return from the function early and
>> skip the unnecessary expensive kernel-mediated access to system
>> registers.
>>
>> libatomic/ChangeLog:
>>
>> 	* config/linux/aarch64/host-config.h (has_lse2): Add test for LSE.
>> 	(has_lse128): Add test for LSE2.
>> ---
>>  libatomic/config/linux/aarch64/host-config.h | 13 ++++++++++---
>>  1 file changed, 10 insertions(+), 3 deletions(-)
>>
>> diff --git a/libatomic/config/linux/aarch64/host-config.h b/libatomic/config/linux/aarch64/host-config.h
>> index c5485d63855..3be4db6e5f8 100644
>> --- a/libatomic/config/linux/aarch64/host-config.h
>> +++ b/libatomic/config/linux/aarch64/host-config.h
>> @@ -53,8 +53,13 @@
>>  static inline bool
>>  has_lse2 (unsigned long hwcap)
>>  {
>> +  /* Check for LSE2.  */
>>    if (hwcap & HWCAP_USCAT)
>>      return true;
>> +  /* No point checking further for atomic 128-bit load/store if LSE
>> +     prerequisite not met.  */
>> +  if (!(hwcap & HWCAP_ATOMICS))
>> +    return false;
>
> This part is OK.
>
>>    if (!(hwcap & HWCAP_CPUID))
>>      return false;
>>  
>> @@ -76,12 +81,14 @@ has_lse2 (unsigned long hwcap)
>>  static inline bool
>>  has_lse128 (unsigned long hwcap)
>>  {
>> -  if (!(hwcap & HWCAP_CPUID))
>> -    return false;
>> +  /* In the absence of HWCAP_CPUID, we are unable to check for LSE128, return.
>> +     If feature check available, check LSE2 prerequisite before proceeding.  */
>> +  if (!(hwcap & HWCAP_CPUID) || !(hwcap & HWCAP_USCAT))
>> +     return false;
>
> The inconsistency feels wrong here.  If we're saying that HWCAP_USCAT
> is now so old that we don't need to fall back on CPUID, then it feels
> like we should have the courage of our convictions and do the same for
> has_lse2.  If instead we still want to support libcs that predate
> HWCAP_USCAT, we should do the same here too.

Sorry, scratch that, I'd misread has_lse2.  The CPUID fallback there is
only for Neoverse N1, which we know doesn't support LSE128.

So the patch is OK with the formatting fixed: the returns should be
indented by their original amount.

Thanks,
Richard

>>    unsigned long isar0;
>>    asm volatile ("mrs %0, ID_AA64ISAR0_EL1" : "=r" (isar0));
>>    if (AT_FEAT_FIELD (isar0) >= 3)
>> -    return true;
>> +      return true;
>
> The original formatting was correct.
>
> Thanks,
> Richard
>
>>    return false;
>>  }

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 2/3] libatomic: Enable LSE128 128-bit atomics for armv9.4-a
  2024-01-05 11:47   ` Richard Sandiford
@ 2024-01-08 10:17     ` Victor Do Nascimento
  2024-01-08 13:06       ` Wilco Dijkstra
  0 siblings, 1 reply; 14+ messages in thread
From: Victor Do Nascimento @ 2024-01-08 10:17 UTC (permalink / raw)
  To: gcc-patches, kyrylo.tkachov, Richard.Earnshaw, richard.sandiford,
	wilco.dijkstra



On 1/5/24 11:47, Richard Sandiford wrote:
> Victor Do Nascimento <victor.donascimento@arm.com> writes:
>> The armv9.4-a architectural revision adds three new atomic operations
>> associated with the LSE128 feature:
>>
>>    * LDCLRP - Atomic AND NOT (bitclear) of a location with 128-bit
>>    value held in a pair of registers, with original data loaded into
>>    the same 2 registers.
>>    * LDSETP - Atomic OR (bitset) of a location with 128-bit value held
>>    in a pair of registers, with original data loaded into the same 2
>>    registers.
>>    * SWPP - Atomic swap of one 128-bit value with 128-bit value held
>>    in a pair of registers.
>>
>> This patch adds the logic required to make use of these when the
>> architectural feature is present and a suitable assembler available.
>>
>> In order to do this, the following changes are made:
>>
>>    1. Add a configure-time check to check for LSE128 support in the
>>    assembler.
>>    2. Edit host-config.h so that when N == 16, nifunc = 2.
>>    3. Where available due to LSE128, implement the second ifunc, making
>>    use of the novel instructions.
>>    4. For atomic functions unable to make use of these new
>>    instructions, define a new alias which causes the _i1 function
>>    variant to point ahead to the corresponding _i2 implementation.
>>
>> libatomic/ChangeLog:
>>
>> 	* Makefile.am (AM_CPPFLAGS): add conditional setting of
>> 	-DHAVE_FEAT_LSE128.
>> 	* acinclude.m4 (LIBAT_TEST_FEAT_LSE128): New.
>> 	* config/linux/aarch64/atomic_16.S (LSE128): New macro
>> 	definition.
>> 	(libat_exchange_16): New LSE128 variant.
>> 	(libat_fetch_or_16): Likewise.
>> 	(libat_or_fetch_16): Likewise.
>> 	(libat_fetch_and_16): Likewise.
>> 	(libat_and_fetch_16): Likewise.
>> 	* config/linux/aarch64/host-config.h (IFUNC_COND_2): New.
>> 	(IFUNC_NCOND): Add operand size checking.
>> 	(has_lse2): Renamed from `ifunc1`.
>> 	(has_lse128): New.
>> 	(HAS_LSE128): Likewise.
>> 	* libatomic/configure.ac: Add call to LIBAT_TEST_FEAT_LSE128.
>> 	* configure (ac_subst_vars): Regenerated via autoreconf.
>> 	* libatomic/Makefile.in: Likewise.
>> 	* libatomic/auto-config.h.in: Likewise.
>> ---
>>   libatomic/Makefile.am                        |   3 +
>>   libatomic/Makefile.in                        |   1 +
>>   libatomic/acinclude.m4                       |  19 +++
>>   libatomic/auto-config.h.in                   |   3 +
>>   libatomic/config/linux/aarch64/atomic_16.S   | 170 ++++++++++++++++++-
>>   libatomic/config/linux/aarch64/host-config.h |  29 +++-
>>   libatomic/configure                          |  59 ++++++-
>>   libatomic/configure.ac                       |   1 +
>>   8 files changed, 276 insertions(+), 9 deletions(-)
>>
>> [...]
>> diff --git a/libatomic/acinclude.m4 b/libatomic/acinclude.m4
>> index f35ab5b60a5..4197db8f404 100644
>> --- a/libatomic/acinclude.m4
>> +++ b/libatomic/acinclude.m4
>> @@ -83,6 +83,25 @@ AC_DEFUN([LIBAT_TEST_ATOMIC_BUILTIN],[
>>     ])
>>   ])
>>   
>> +dnl
>> +dnl Test if the host assembler supports armv9.4-a LSE128 isns.
>> +dnl
>> +AC_DEFUN([LIBAT_TEST_FEAT_LSE128],[
>> +  AC_CACHE_CHECK([for armv9.4-a LSE128 insn support],
>> +    [libat_cv_have_feat_lse128],[
>> +    AC_LANG_CONFTEST([AC_LANG_PROGRAM([],[asm(".arch armv9-a+lse128")])])
>> +    if AC_TRY_EVAL(ac_link); then
> 
> ac_compile should be enough for this.  The link step isn't really
> adding anything.
> 
>> +      eval libat_cv_have_feat_lse128=yes
>> +    else
>> +      eval libat_cv_have_feat_lse128=no
>> +    fi
>> +    rm -f conftest*
>> +  ])
>> +  LIBAT_DEFINE_YESNO([HAVE_FEAT_LSE128], [$libat_cv_have_feat_lse128],
>> +	[Have LSE128 support for 16 byte integers.])
>> +  AM_CONDITIONAL([ARCH_AARCH64_HAVE_LSE128], [test x$libat_cv_have_feat_lse128 = xyes])
>> +])
>> +
>>   dnl
>>   dnl Test if we have __atomic_load and __atomic_store for mode $1, size $2
>>   dnl
>> [...]
>> @@ -206,6 +211,31 @@ ENTRY (libat_exchange_16)
>>   END (libat_exchange_16)
>>   
>>   
>> +#if HAVE_FEAT_LSE128
>> +ENTRY_FEAT (libat_exchange_16, LSE128)
>> +	mov	tmp0, x0
>> +	mov	res0, in0
>> +	mov	res1, in1
>> +	cbnz	w4, 1f
>> +
>> +	/* RELAXED.  */
>> +	swpp	res0, res1, [tmp0]
>> +	ret
>> +1:
>> +	cmp	w4, ACQUIRE
>> +	b.hi	2f
>> +
>> +	/* ACQUIRE/CONSUME.  */
>> +	swppa	res0, res1, [tmp0]
>> +	ret
>> +
>> +	/* RELEASE/ACQ_REL/SEQ_CST.  */
>> +2:	swppal	res0, res1, [tmp0]
>> +	ret
>> +END_FEAT (libat_exchange_16, LSE128)
>> +#endif
> 
> Is there no benefit to using SWPPL for RELEASE here?  Similarly for the
> others.

We started off implementing all possible memory orderings available. 
Wilco saw value in merging less restricted orderings into more 
restricted ones - mainly to reduce codesize in less frequently used atomics.

This saw us combine RELEASE and ACQ_REL/SEQ_CST cases to make functions 
a little smaller.

> 
> Looks good otherwise.
> 
> Thanks,
> Richard

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 1/3] libatomic: atomic_16.S: Improve ENTRY, END and ALIAS macro interface
  2024-01-05 11:10   ` Richard Sandiford
@ 2024-01-08 12:33     ` Victor Do Nascimento
  2024-01-08 15:36       ` Richard Sandiford
  0 siblings, 1 reply; 14+ messages in thread
From: Victor Do Nascimento @ 2024-01-08 12:33 UTC (permalink / raw)
  To: gcc-patches, kyrylo.tkachov, Richard.Earnshaw, richard.sandiford



On 1/5/24 11:10, Richard Sandiford wrote:
> Victor Do Nascimento <victor.donascimento@arm.com> writes:
>> The introduction of further architectural-feature dependent ifuncs
>> for AArch64 makes hard-coding ifunc `_i<n>' suffixes to functions
>> cumbersome to work with.  It is awkward to remember which ifunc maps
>> onto which arch feature and makes the code harder to maintain when new
>> ifuncs are added and their suffixes possibly altered.
>>
>> This patch uses pre-processor `#define' statements to map each suffix to
>> a descriptive feature name macro, for example:
>>
>>    #define LSE2 _i1
>>
>> and reconstructs function names with the pre-processor's token
>> concatenation feature, such that for `MACRO(<name>_i<n>)', we would
>> now have `MACRO_FEAT(name, feature)' and in the macro definition body
>> we replace `name` with `name##feature`.
> 
> FWIW, another way of doing this would be to have:
> 
> #define CORE(NAME) NAME
> #define LSE2(NAME) NAME##_i1
> 
> and use feature(name) instead of name##feature.  This has the slight
> advantage of not using ## on empty tokens, and the maybe slightly
> better advantage of not needing the extra forwarding step in:
> 
> #define ENTRY_FEAT(name, feat)		\
> 	ENTRY_FEAT1(name, feat)
> 
> #define ENTRY_FEAT1(name, feat)	\
> 
> WDYT?
> 
> Richard
> 

While from a strictly stylistic point of view, I'm not so keen on the 
resulting interface and its 'function call within a function call' look, 
e.g.

   ENTRY (LSE2 (libat_compare_exchange_16))

and

   ALIAS (LSE128 (libat_compare_exchange_16), \
          LSE2 (libat_compare_exchange_16))

on the implementation-side of things, I like the benefits this brings 
about.  Namely allowing the use of the unaltered original 
implementations of the ENTRY, END and ALIAS macros with the 
aforementioned advantages of not having to use ## on empty tokens and 
abolishing the need for the extra forwarding step.

I'm happy enough to go with this approach.

Cheers

>> Consequently, for base functionality, where the ifunc suffix is
>> absent, the macro interface remains the same.  For example, the entry
>> and endpoints of `libat_store_16' remain defined by:
>>
>>    - ENTRY (libat_store_16)
>> and
>>    - END (libat_store_16)
>>
>> For the LSE2 implementation of the same 16-byte atomic store, we now
>> have:
>>
>>    - ENTRY_FEAT (libat_store_16, LSE2)
>> and
>>    - END_FEAT (libat_store_16, LSE2)
>>
>> For the alising of ifunc names, we define the following new
>> implementation of the ALIAS macro:
>>
>>    - ALIAS (FN_BASE_NAME, FROM_SUFFIX, TO_SUFFIX)
>>
>> Defining the base feature name macro to map `CORE' to the empty string,
>> mapping LSE2 to the base implementation, we'd alias the LSE2
>> `libat_exchange_16' to it base implementation with:
>>
>>    - ALIAS (libat_exchange_16, LSE2, CORE)
>>
>> libatomic/ChangeLog:
>> 	* config/linux/aarch64/atomic_16.S (CORE): New macro.
>> 	(LSE2): Likewise.
>> 	(ENTRY_FEAT): Likewise.
>> 	(END_FEAT): Likewise.
>> 	(ENTRY_FEAT1): Likewise.
>> 	(END_FEAT1): Likewise.
>> 	(ALIAS): Modify macro to take in `arch' arguments.
>> ---
>>   libatomic/config/linux/aarch64/atomic_16.S | 83 +++++++++++++---------
>>   1 file changed, 49 insertions(+), 34 deletions(-)
>>
>> diff --git a/libatomic/config/linux/aarch64/atomic_16.S b/libatomic/config/linux/aarch64/atomic_16.S
>> index a099037179b..eb8e749b8a2 100644
>> --- a/libatomic/config/linux/aarch64/atomic_16.S
>> +++ b/libatomic/config/linux/aarch64/atomic_16.S
>> @@ -40,22 +40,38 @@
>>   
>>   	.arch	armv8-a+lse
>>   
>> -#define ENTRY(name)		\
>> -	.global name;		\
>> -	.hidden name;		\
>> -	.type name,%function;	\
>> -	.p2align 4;		\
>> -name:				\
>> -	.cfi_startproc;		\
>> +#define ENTRY(name) ENTRY_FEAT (name, CORE)
>> +
>> +#define ENTRY_FEAT(name, feat)		\
>> +	ENTRY_FEAT1(name, feat)
>> +
>> +#define ENTRY_FEAT1(name, feat)	\
>> +	.global name##feat;		\
>> +	.hidden name##feat;		\
>> +	.type name##feat,%function;	\
>> +	.p2align 4;			\
>> +name##feat:				\
>> +	.cfi_startproc;			\
>>   	hint	34	// bti c
>>   
>> -#define END(name)		\
>> -	.cfi_endproc;		\
>> -	.size name, .-name;
>> +#define END(name) END_FEAT (name, CORE)
>>   
>> -#define ALIAS(alias,name)	\
>> -	.global alias;		\
>> -	.set alias, name;
>> +#define END_FEAT(name, feat)			\
>> +	END_FEAT1(name, feat)
>> +
>> +#define END_FEAT1(name, feat)		\
>> +	.cfi_endproc;			\
>> +	.size name##feat, .-name##feat;
>> +
>> +#define ALIAS(alias, from, to)		\
>> +	ALIAS1(alias,from,to)
>> +
>> +#define ALIAS1(alias, from, to)		\
>> +	.global alias##from;		\
>> +	.set alias##from, alias##to;
>> +
>> +#define CORE
>> +#define LSE2	_i1
>>   
>>   #define res0 x0
>>   #define res1 x1
>> @@ -108,7 +124,7 @@ ENTRY (libat_load_16)
>>   END (libat_load_16)
>>   
>>   
>> -ENTRY (libat_load_16_i1)
>> +ENTRY_FEAT (libat_load_16, LSE2)
>>   	cbnz	w1, 1f
>>   
>>   	/* RELAXED.  */
>> @@ -128,7 +144,7 @@ ENTRY (libat_load_16_i1)
>>   	ldp	res0, res1, [x0]
>>   	dmb	ishld
>>   	ret
>> -END (libat_load_16_i1)
>> +END_FEAT (libat_load_16, LSE2)
>>   
>>   
>>   ENTRY (libat_store_16)
>> @@ -148,7 +164,7 @@ ENTRY (libat_store_16)
>>   END (libat_store_16)
>>   
>>   
>> -ENTRY (libat_store_16_i1)
>> +ENTRY_FEAT (libat_store_16, LSE2)
>>   	cbnz	w4, 1f
>>   
>>   	/* RELAXED.  */
>> @@ -160,7 +176,7 @@ ENTRY (libat_store_16_i1)
>>   	stlxp	w4, in0, in1, [x0]
>>   	cbnz	w4, 1b
>>   	ret
>> -END (libat_store_16_i1)
>> +END_FEAT (libat_store_16, LSE2)
>>   
>>   
>>   ENTRY (libat_exchange_16)
>> @@ -237,7 +253,7 @@ ENTRY (libat_compare_exchange_16)
>>   END (libat_compare_exchange_16)
>>   
>>   
>> -ENTRY (libat_compare_exchange_16_i1)
>> +ENTRY_FEAT (libat_compare_exchange_16, LSE2)
>>   	ldp	exp0, exp1, [x1]
>>   	mov	tmp0, exp0
>>   	mov	tmp1, exp1
>> @@ -270,7 +286,7 @@ ENTRY (libat_compare_exchange_16_i1)
>>   	/* ACQ_REL/SEQ_CST.  */
>>   4:	caspal	exp0, exp1, in0, in1, [x0]
>>   	b	0b
>> -END (libat_compare_exchange_16_i1)
>> +END_FEAT (libat_compare_exchange_16, LSE2)
>>   
>>   
>>   ENTRY (libat_fetch_add_16)
>> @@ -556,21 +572,20 @@ END (libat_test_and_set_16)
>>   
>>   /* Alias entry points which are the same in baseline and LSE2.  */
>>   
>> -ALIAS (libat_exchange_16_i1, libat_exchange_16)
>> -ALIAS (libat_fetch_add_16_i1, libat_fetch_add_16)
>> -ALIAS (libat_add_fetch_16_i1, libat_add_fetch_16)
>> -ALIAS (libat_fetch_sub_16_i1, libat_fetch_sub_16)
>> -ALIAS (libat_sub_fetch_16_i1, libat_sub_fetch_16)
>> -ALIAS (libat_fetch_or_16_i1, libat_fetch_or_16)
>> -ALIAS (libat_or_fetch_16_i1, libat_or_fetch_16)
>> -ALIAS (libat_fetch_and_16_i1, libat_fetch_and_16)
>> -ALIAS (libat_and_fetch_16_i1, libat_and_fetch_16)
>> -ALIAS (libat_fetch_xor_16_i1, libat_fetch_xor_16)
>> -ALIAS (libat_xor_fetch_16_i1, libat_xor_fetch_16)
>> -ALIAS (libat_fetch_nand_16_i1, libat_fetch_nand_16)
>> -ALIAS (libat_nand_fetch_16_i1, libat_nand_fetch_16)
>> -ALIAS (libat_test_and_set_16_i1, libat_test_and_set_16)
>> -
>> +ALIAS (libat_exchange_16, LSE2, CORE)
>> +ALIAS (libat_fetch_add_16, LSE2, CORE)
>> +ALIAS (libat_add_fetch_16, LSE2, CORE)
>> +ALIAS (libat_fetch_sub_16, LSE2, CORE)
>> +ALIAS (libat_sub_fetch_16, LSE2, CORE)
>> +ALIAS (libat_fetch_or_16, LSE2, CORE)
>> +ALIAS (libat_or_fetch_16, LSE2, CORE)
>> +ALIAS (libat_fetch_and_16, LSE2, CORE)
>> +ALIAS (libat_and_fetch_16, LSE2, CORE)
>> +ALIAS (libat_fetch_xor_16, LSE2, CORE)
>> +ALIAS (libat_xor_fetch_16, LSE2, CORE)
>> +ALIAS (libat_fetch_nand_16, LSE2, CORE)
>> +ALIAS (libat_nand_fetch_16, LSE2, CORE)
>> +ALIAS (libat_test_and_set_16, LSE2, CORE)
>>   
>>   /* GNU_PROPERTY_AARCH64_* macros from elf.h for use in asm code.  */
>>   #define FEATURE_1_AND 0xc0000000

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 2/3] libatomic: Enable LSE128 128-bit atomics for armv9.4-a
  2024-01-08 10:17     ` Victor Do Nascimento
@ 2024-01-08 13:06       ` Wilco Dijkstra
  2024-01-08 14:39         ` Richard Sandiford
  0 siblings, 1 reply; 14+ messages in thread
From: Wilco Dijkstra @ 2024-01-08 13:06 UTC (permalink / raw)
  To: Victor Do Nascimento, gcc-patches, Kyrylo Tkachov,
	Richard Earnshaw, Richard Sandiford

Hi,

>> Is there no benefit to using SWPPL for RELEASE here?  Similarly for the
>> others.
>
> We started off implementing all possible memory orderings available. 
> Wilco saw value in merging less restricted orderings into more 
> restricted ones - mainly to reduce codesize in less frequently used atomics.
> 
> This saw us combine RELEASE and ACQ_REL/SEQ_CST cases to make functions 
> a little smaller.

Benchmarking showed that LSE and LSE2 RMW atomics have similar performance once
the atomic is acquire, release or both. Given there is already a significant overhead due
to the function call, PLT indirection and argument setup, it doesn't make sense to add
extra taken branches that may mispredict or cause extra fetch cycles...

The goal for next GCC is to inline these instructions directly to avoid these overheads.

Cheers,
Wilco

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 2/3] libatomic: Enable LSE128 128-bit atomics for armv9.4-a
  2024-01-08 13:06       ` Wilco Dijkstra
@ 2024-01-08 14:39         ` Richard Sandiford
  2024-01-08 16:45           ` Wilco Dijkstra
  0 siblings, 1 reply; 14+ messages in thread
From: Richard Sandiford @ 2024-01-08 14:39 UTC (permalink / raw)
  To: Wilco Dijkstra
  Cc: Victor Do Nascimento, gcc-patches, Kyrylo Tkachov, Richard Earnshaw

Wilco Dijkstra <Wilco.Dijkstra@arm.com> writes:
> Hi,
>
>>> Is there no benefit to using SWPPL for RELEASE here?  Similarly for the
>>> others.
>>
>> We started off implementing all possible memory orderings available.
>> Wilco saw value in merging less restricted orderings into more
>> restricted ones - mainly to reduce codesize in less frequently used atomics.
>>
>> This saw us combine RELEASE and ACQ_REL/SEQ_CST cases to make functions
>> a little smaller.
>
> Benchmarking showed that LSE and LSE2 RMW atomics have similar performance once
> the atomic is acquire, release or both. Given there is already a significant overhead due
> to the function call, PLT indirection and argument setup, it doesn't make sense to add
> extra taken branches that may mispredict or cause extra fetch cycles...

Thanks for the extra context, especially wrt the LSE/LSE2 benchmarking.
If there isn't any difference for acquire vs. the rest, is there a
justification we can use for keeping the acquire branch, rather than
using SWPAL for everything except relaxed?

If so, then Victor, could you include that in the explanation above and
add it as a source comment?  Although maybe tone down "doesn't make
sense to add" to something like "doesn't seem worth adding". :)

Richard

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 1/3] libatomic: atomic_16.S: Improve ENTRY, END and ALIAS macro interface
  2024-01-08 12:33     ` Victor Do Nascimento
@ 2024-01-08 15:36       ` Richard Sandiford
  0 siblings, 0 replies; 14+ messages in thread
From: Richard Sandiford @ 2024-01-08 15:36 UTC (permalink / raw)
  To: Victor Do Nascimento; +Cc: gcc-patches, kyrylo.tkachov, Richard.Earnshaw

Victor Do Nascimento <Victor.DoNascimento@arm.com> writes:
> On 1/5/24 11:10, Richard Sandiford wrote:
>> Victor Do Nascimento <victor.donascimento@arm.com> writes:
>>> The introduction of further architectural-feature dependent ifuncs
>>> for AArch64 makes hard-coding ifunc `_i<n>' suffixes to functions
>>> cumbersome to work with.  It is awkward to remember which ifunc maps
>>> onto which arch feature and makes the code harder to maintain when new
>>> ifuncs are added and their suffixes possibly altered.
>>>
>>> This patch uses pre-processor `#define' statements to map each suffix to
>>> a descriptive feature name macro, for example:
>>>
>>>    #define LSE2 _i1
>>>
>>> and reconstructs function names with the pre-processor's token
>>> concatenation feature, such that for `MACRO(<name>_i<n>)', we would
>>> now have `MACRO_FEAT(name, feature)' and in the macro definition body
>>> we replace `name` with `name##feature`.
>> 
>> FWIW, another way of doing this would be to have:
>> 
>> #define CORE(NAME) NAME
>> #define LSE2(NAME) NAME##_i1
>> 
>> and use feature(name) instead of name##feature.  This has the slight
>> advantage of not using ## on empty tokens, and the maybe slightly
>> better advantage of not needing the extra forwarding step in:
>> 
>> #define ENTRY_FEAT(name, feat)		\
>> 	ENTRY_FEAT1(name, feat)
>> 
>> #define ENTRY_FEAT1(name, feat)	\
>> 
>> WDYT?
>> 
>> Richard
>> 
>
> While from a strictly stylistic point of view, I'm not so keen on the 
> resulting interface and its 'function call within a function call' look, 
> e.g.
>
>    ENTRY (LSE2 (libat_compare_exchange_16))
>
> and
>
>    ALIAS (LSE128 (libat_compare_exchange_16), \
>           LSE2 (libat_compare_exchange_16))
>
> on the implementation-side of things, I like the benefits this brings 
> about.  Namely allowing the use of the unaltered original 
> implementations of the ENTRY, END and ALIAS macros with the 
> aforementioned advantages of not having to use ## on empty tokens and 
> abolishing the need for the extra forwarding step.
>
> I'm happy enough to go with this approach.

I was thinking that the invocations would stay the same.  A C example is:

#define LSE2(NAME) NAME##_i2
#define ENTRY(NAME, FEAT) void FEAT (NAME) ()
ENTRY(foo, LSE2) {}

https://godbolt.org/z/rdn5dEMPM

Thanks,
Richard

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 2/3] libatomic: Enable LSE128 128-bit atomics for armv9.4-a
  2024-01-08 14:39         ` Richard Sandiford
@ 2024-01-08 16:45           ` Wilco Dijkstra
  0 siblings, 0 replies; 14+ messages in thread
From: Wilco Dijkstra @ 2024-01-08 16:45 UTC (permalink / raw)
  To: Richard Sandiford
  Cc: Victor Do Nascimento, gcc-patches, Kyrylo Tkachov, Richard Earnshaw

Hi Richard,

>> Benchmarking showed that LSE and LSE2 RMW atomics have similar performance once
>> the atomic is acquire, release or both. Given there is already a significant overhead due
>> to the function call, PLT indirection and argument setup, it doesn't make sense to add
>> extra taken branches that may mispredict or cause extra fetch cycles...
>
> Thanks for the extra context, especially wrt the LSE/LSE2 benchmarking.
> If there isn't any difference for acquire vs. the rest, is there a
> justification we can use for keeping the acquire branch, rather than
> using SWPAL for everything except relaxed?

The results showed that acquire is typically slightly faster than release (5-10%), so for the
most frequently used atomics (CAS and SWP) it makes sense to add support for acquire.
In most cases once you have release semantics, adding acquire didn't make things
slower, so combining release/acq_rel/seq_cst avoids unnecessary extra branches and
keeps the code small.

> If so, then Victor, could you include that in the explanation above and
> add it as a source comment?  Although maybe tone down "doesn't make
> sense to add" to something like "doesn't seem worth adding". :)

Yes it's worth adding a comment to this effect.

Cheers,
Wilco

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2024-01-08 16:45 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-01-03  1:28 [PATCH v3 0/3] Libatomic: Add LSE128 atomics support for AArch64 Victor Do Nascimento
2024-01-03  1:28 ` [PATCH v3 1/3] libatomic: atomic_16.S: Improve ENTRY, END and ALIAS macro interface Victor Do Nascimento
2024-01-05 11:10   ` Richard Sandiford
2024-01-08 12:33     ` Victor Do Nascimento
2024-01-08 15:36       ` Richard Sandiford
2024-01-03  1:28 ` [PATCH v3 2/3] libatomic: Enable LSE128 128-bit atomics for armv9.4-a Victor Do Nascimento
2024-01-05 11:47   ` Richard Sandiford
2024-01-08 10:17     ` Victor Do Nascimento
2024-01-08 13:06       ` Wilco Dijkstra
2024-01-08 14:39         ` Richard Sandiford
2024-01-08 16:45           ` Wilco Dijkstra
2024-01-03  1:28 ` [PATCH v3 3/3] aarch64: Add explicit checks for implicit LSE/LSE2 requirements Victor Do Nascimento
2024-01-05 11:58   ` Richard Sandiford
2024-01-05 12:08     ` Richard Sandiford

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).