[PATCH 1/2] S390: Optimize atomic macros.

public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed

* [PATCH 1/2] S390: Optimize atomic macros.
@ 2016-11-23 15:09 Stefan Liebler
  2016-11-23 15:09 ` [PATCH 2/2] S390: Test if lock is free before using atomic instruction in spin-lock Stefan Liebler
  2016-11-24 12:51 ` [PATCH 1/2] S390: Optimize atomic macros Torvald Riegel
  0 siblings, 2 replies; 15+ messages in thread
From: Stefan Liebler @ 2016-11-23 15:09 UTC (permalink / raw)
  To: libc-alpha; +Cc: Stefan Liebler

The atomic_compare_and_exchange_val_acq macro is now implemented with
gcc __sync_val_compare_and_swap instead of an inline assembly with
compare-and-swap instruction.
The memory is compared against expected OLDVAL before using compare-and-swap
instruction in case of OLDVAL is constant at compile time.  This is used in
various locking code.  If the lock is already aquired, another cpu has not to
exclusively lock the memory.  If OLDVAL is not constant the compare-and-swap
instruction is used directly as the usages of this macro usually load the
current value in front of this macro.

The same applies to atomic_compare_and_exchange_bool_acq which wasn't
defined before.  Now it is implemented with gcc __sync_bool_compare_and_swap.
If the macro is used as condition in an if/while expression, the condition
code is used to e.g. jump directly to another code sequence.  Before this
change, the old value returned by compare-and-swap instruction was compared
with given OLDVAL to determine if e.g. a jump is needed.

The atomic_exchange_acq macro is now using the load-and-and instruction for a
constant zero value instead of a compare-and-swap loop.  This instruction is
available on a z196 zarch and higher cpus.  This is e.g. used in unlocking code.

The newly defined atomic_exchange_and_add macro is implemented with gcc
builtin __sync_fetch_and_add which uses load-and-add instruction on z196 zarch
and higher cpus instead of a loop with compare-and-swap instruction.
The same applies to atomic_or_val, atomic_and_val, ... macros, which use
the appropiate z196 instruction.

The macros lll_trylock, lll_cond_trylock are extended by an __glibc_unlikely
hint. With the hint gcc on s390 emits code in e.g. pthread_mutex_trylock
which does not use jumps in case the lock is free.  Without the hint it had
to jump if the lock was free.

ChangeLog:

	* sysdeps/s390/atomic-machine.h
	(__ATOMIC_MACROS_HAVE_Z196_ZARCH_INSN): New define.
	(atomic_compare_and_exchange_val_acq):
	Use __sync_val_compare_and_swap and first compare with non-atomic
	instruction in case of OLDVAL is constant.
	(atomic_compare_and_exchange_bool_acq): New define.
	(atomic_exchange_acq): Use load-and-and instruction for constant
	zero values, if available.
	(atomic_exchange_and_add, catomic_exchange_and_add, atomic_or_val,
	atomic_or, catomic_or, atomic_bit_test_set, atomic_and_val,
	atomic_and, catomic_and): New define.
	* sysdeps/unix/sysv/linux/s390/lowlevellock.h:
	(lll_trylock, lll_cond_trylock): New define.
---
 sysdeps/s390/atomic-machine.h               | 195 ++++++++++++++++++++--------
 sysdeps/unix/sysv/linux/s390/lowlevellock.h |  25 +++-
 2 files changed, 160 insertions(+), 60 deletions(-)

diff --git a/sysdeps/s390/atomic-machine.h b/sysdeps/s390/atomic-machine.h
index 4ba4107..650bf98 100644
--- a/sysdeps/s390/atomic-machine.h
+++ b/sysdeps/s390/atomic-machine.h
@@ -45,76 +45,159 @@ typedef uintmax_t uatomic_max_t;
 
 #define USE_ATOMIC_COMPILER_BUILTINS 0
 
-
-#define __arch_compare_and_exchange_val_8_acq(mem, newval, oldval) \
-  (abort (), (__typeof (*mem)) 0)
-
-#define __arch_compare_and_exchange_val_16_acq(mem, newval, oldval) \
-  (abort (), (__typeof (*mem)) 0)
-
-#define __arch_compare_and_exchange_val_32_acq(mem, newval, oldval) \
-  ({ __typeof (mem) __archmem = (mem);					      \
-     __typeof (*mem) __archold = (oldval);				      \
-     __asm__ __volatile__ ("cs %0,%2,%1"				      \
-			   : "+d" (__archold), "=Q" (*__archmem)	      \
-			   : "d" (newval), "m" (*__archmem) : "cc", "memory" );	\
-     __archold; })
-
 #ifdef __s390x__
 # define __HAVE_64B_ATOMICS 1
-# define __arch_compare_and_exchange_val_64_acq(mem, newval, oldval) \
-  ({ __typeof (mem) __archmem = (mem);					      \
-     __typeof (*mem) __archold = (oldval);				      \
-     __asm__ __volatile__ ("csg %0,%2,%1"				      \
-			   : "+d" (__archold), "=Q" (*__archmem)	      \
-			   : "d" ((long) (newval)), "m" (*__archmem) : "cc", "memory" ); \
-     __archold; })
 #else
 # define __HAVE_64B_ATOMICS 0
-/* For 31 bit we do not really need 64-bit compare-and-exchange. We can
-   implement them by use of the csd instruction. The straightforward
-   implementation causes warnings so we skip the definition for now.  */
-# define __arch_compare_and_exchange_val_64_acq(mem, newval, oldval) \
-  (abort (), (__typeof (*mem)) 0)
 #endif
 
+#ifdef HAVE_S390_MIN_Z196_ZARCH_ASM_SUPPORT
+# define __ATOMIC_MACROS_HAVE_Z196_ZARCH_INSN 1
+#else
+# define __ATOMIC_MACROS_HAVE_Z196_ZARCH_INSN 0
+#endif
+
+/* Atomically store NEWVAL in *MEM if *MEM is equal to OLDVAL.
+   Return the old *MEM value.  */
+/* Compare *MEM against expected OLDVAL before using compare-and-swap
+   instruction in case of OLDVAL is constant.  This is used in various
+   locking code.  If the lock is already aquired another cpu has not to
+   exclusively lock the memory.  */
+#define atomic_compare_and_exchange_val_acq(mem, newval, oldval)	\
+  ({ __asm__ __volatile__ ("" ::: "memory");				\
+    __typeof (*(mem)) __atg1_ret;					\
+    if (!__builtin_constant_p (oldval)					\
+	|| __builtin_expect ((__atg1_ret = *(mem))			\
+			     == (oldval), 1))				\
+      __atg1_ret = __sync_val_compare_and_swap ((mem), (oldval),	\
+						(newval));		\
+    __atg1_ret; })
+
+/* Atomically store NEWVAL in *MEM if *MEM is equal to OLDVAL.
+   Return zero if *MEM was changed or non-zero if no exchange happened.  */
+/* Same as with atomic_compare_and_exchange_val_acq,  constant OLDVALs are
+   compared before using compare-and-swap instruction.  As this macro is
+   normally used in conjunction with if or while, gcc emits a conditional-
+   branch instruction to use the condition-code of compare-and-swap instruction
+   instead of comparing the old value.  */
+#define atomic_compare_and_exchange_bool_acq(mem, newval, oldval)	\
+  ({ __asm__ __volatile__ ("" ::: "memory");				\
+    int __atg3_ret = 1;							\
+    if (!__builtin_constant_p (oldval)					\
+	|| __builtin_expect (*(mem) == (oldval), 1))			\
+      __atg3_ret = !__sync_bool_compare_and_swap ((mem), (oldval),	\
+						  (newval));		\
+    __atg3_ret; })
+
 /* Store NEWVALUE in *MEM and return the old value.  */
 /* On s390, the atomic_exchange_acq is different from generic implementation,
    because the generic one does not use the condition-code of cs-instruction
-   to determine if looping is needed. Instead it saves the old-value and
-   compares it against old-value returned by cs-instruction.  */
+   to determine if looping is needed.  Instead it saves the old-value and
+   compares it against old-value returned by cs-instruction.
+   Setting a constant zero can be done with load-and-and instruction which
+   is available on a z196 zarch and higher cpus.  This is used in unlocking
+   code.  */
 #ifdef __s390x__
 # define atomic_exchange_acq(mem, newvalue)				\
-  ({ __typeof (mem) __atg5_memp = (mem);				\
-    __typeof (*(mem)) __atg5_oldval = *__atg5_memp;			\
-    __typeof (*(mem)) __atg5_value = (newvalue);			\
-    if (sizeof (*mem) == 4)						\
-      __asm__ __volatile__ ("0: cs %0,%2,%1\n"				\
-			    "   jl 0b"					\
-			    : "+d" (__atg5_oldval), "=Q" (*__atg5_memp)	\
-			    : "d" (__atg5_value), "m" (*__atg5_memp)	\
-			    : "cc", "memory" );				\
-     else if (sizeof (*mem) == 8)					\
-       __asm__ __volatile__ ("0: csg %0,%2,%1\n"			\
-			     "   jl 0b"					\
-			     : "+d" ( __atg5_oldval), "=Q" (*__atg5_memp) \
-			     : "d" ((long) __atg5_value), "m" (*__atg5_memp) \
-			     : "cc", "memory" );			\
-     else								\
-       abort ();							\
+  ({ __typeof (*(mem)) __atg5_oldval;					\
+    if (__ATOMIC_MACROS_HAVE_Z196_ZARCH_INSN != 0			\
+	&& __builtin_constant_p (newvalue) && (newvalue) == 0)		\
+      {									\
+	__atg5_oldval = __sync_fetch_and_and (mem, 0);			\
+      }									\
+    else								\
+      {									\
+	__typeof (mem) __atg5_memp = (mem);				\
+	__atg5_oldval = *__atg5_memp;					\
+	__typeof (*(mem)) __atg5_value = (newvalue);			\
+	if (sizeof (*(mem)) == 4)					\
+	  __asm__ __volatile__ ("0: cs %0,%2,%1\n"			\
+				"   jl 0b"				\
+				: "+d" (__atg5_oldval),			\
+				  "=Q" (*__atg5_memp)			\
+				: "d" (__atg5_value),			\
+				  "m" (*__atg5_memp)			\
+				: "cc", "memory" );			\
+	else if (sizeof (*(mem)) == 8)					\
+	  __asm__ __volatile__ ("0: csg %0,%2,%1\n"			\
+				"   jl 0b"				\
+				: "+d" ( __atg5_oldval),		\
+				  "=Q" (*__atg5_memp)			\
+				: "d" ((long) __atg5_value),		\
+				  "m" (*__atg5_memp)			\
+				: "cc", "memory" );			\
+	else								\
+	  abort ();							\
+      }									\
      __atg5_oldval; })
 #else
 # define atomic_exchange_acq(mem, newvalue)				\
-  ({ __typeof (mem) __atg5_memp = (mem);				\
-    __typeof (*(mem)) __atg5_oldval = *__atg5_memp;			\
-    __typeof (*(mem)) __atg5_value = (newvalue);			\
-    if (sizeof (*mem) == 4)						\
-      __asm__ __volatile__ ("0: cs %0,%2,%1\n"				\
-			    "   jl 0b"					\
-			    : "+d" (__atg5_oldval), "=Q" (*__atg5_memp)	\
-			    : "d" (__atg5_value), "m" (*__atg5_memp)	\
-			    : "cc", "memory" );				\
+  ({ __typeof (*(mem)) __atg5_oldval;					\
+    if (__ATOMIC_MACROS_HAVE_Z196_ZARCH_INSN != 0			\
+	&& __builtin_constant_p (newvalue) && (newvalue) == 0)		\
+      {									\
+	__atg5_oldval = __sync_fetch_and_and (mem, 0);			\
+      }									\
     else								\
-      abort ();								\
+      {									\
+	__typeof (mem) __atg5_memp = (mem);				\
+	__atg5_oldval = *__atg5_memp;					\
+	__typeof (*(mem)) __atg5_value = (newvalue);			\
+	if (sizeof (*(mem)) == 4)					\
+	  __asm__ __volatile__ ("0: cs %0,%2,%1\n"			\
+				"   jl 0b"				\
+				: "+d" (__atg5_oldval),			\
+				  "=Q" (*__atg5_memp)			\
+				: "d" (__atg5_value),			\
+				  "m" (*__atg5_memp)			\
+				: "cc", "memory" );			\
+	else								\
+	  abort ();							\
+      }									\
     __atg5_oldval; })
 #endif
+
+/* Add VALUE to *MEM and return the old value of *MEM.  */
+/* The gcc builtin uses load-and-add instruction on z196 zarch and higher cpus
+   instead of a loop with compare-and-swap instruction.  */
+#define atomic_exchange_and_add(mem, value)	\
+  __sync_fetch_and_add (mem, value)
+#define catomic_exchange_and_add(mem, value)	\
+  atomic_exchange_and_add (mem, value)
+
+/* Atomically *mem |= mask and return the old value of *mem.  */
+/* The gcc builtin uses load-and-or instruction on z196 zarch and higher cpus
+   instead of a loop with compare-and-swap instruction.  */
+#define atomic_or_val(mem, mask)		\
+  __sync_fetch_and_or (mem, mask)
+/* Atomically *mem |= mask.  */
+#define atomic_or(mem, mask)			\
+  do {						\
+    atomic_or_val (mem, mask);			\
+  } while (0)
+#define catomic_or(mem, mask)			\
+  atomic_or (mem, mask)
+
+/* Atomically *mem |= 1 << bit and return true if the bit was set in old value
+   of *mem.  */
+/* The load-and-or instruction is used on z196 zarch and higher cpus
+   instead of a loop with compare-and-swap instruction.  */
+#define atomic_bit_test_set(mem, bit)					\
+  ({ __typeof (*(mem)) __atg14_old;					\
+    __typeof (mem) __atg14_memp = (mem);				\
+    __typeof (*(mem)) __atg14_mask = ((__typeof (*(mem))) 1 << (bit));	\
+    __atg14_old = atomic_or_val (__atg14_memp, __atg14_mask);		\
+    __atg14_old & __atg14_mask; })
+
+/* Atomically *mem &= mask and return the old value of *mem.  */
+/* The gcc builtin uses load-and-and instruction on z196 zarch and higher cpus
+   instead of a loop with compare-and-swap instruction.  */
+#define atomic_and_val(mem, mask)		\
+  __sync_fetch_and_and (mem, mask)
+/* Atomically *mem &= mask.  */
+#define atomic_and(mem, mask)			\
+  do {						\
+    atomic_and_val (mem, mask);			\
+  } while (0)
+#define catomic_and(mem, mask)			\
+  atomic_and(mem, mask)
diff --git a/sysdeps/unix/sysv/linux/s390/lowlevellock.h b/sysdeps/unix/sysv/linux/s390/lowlevellock.h
index ada2e5b..8d564ed 100644
--- a/sysdeps/unix/sysv/linux/s390/lowlevellock.h
+++ b/sysdeps/unix/sysv/linux/s390/lowlevellock.h
@@ -21,13 +21,30 @@
 
 #include <sysdeps/nptl/lowlevellock.h>
 
+#undef lll_trylock
+/* If LOCK is 0 (not acquired), set to 1 (acquired with no waiters) and return
+   0.  Otherwise leave lock unchanged and return non-zero to indicate that the
+   lock was not acquired.  */
+/* With the __glibc_unlikely hint gcc on s390 emits code in e.g.
+   pthread_mutex_trylock which does not use jumps in case the lock is free.
+   Without the hint it has to jump if the lock is free.  */
+#define lll_trylock(lock)						\
+  __glibc_unlikely (atomic_compare_and_exchange_bool_acq (&(lock), 1, 0))
+
+#undef lll_cond_trylock
+/* If LOCK is 0 (not acquired), set to 2 (acquired, possibly with waiters) and
+   return 0.  Otherwise leave lock unchanged and return non-zero to indicate
+   that the lock was not acquired.  */
+#define lll_cond_trylock(lock)						\
+  __glibc_unlikely (atomic_compare_and_exchange_bool_acq (&(lock), 2, 0))
+
 /* Transactional lock elision definitions.  */
-# ifdef ENABLE_LOCK_ELISION
+#ifdef ENABLE_LOCK_ELISION
 extern int __lll_timedlock_elision
   (int *futex, short *adapt_count, const struct timespec *timeout, int private)
   attribute_hidden;
 
-#  define lll_timedlock_elision(futex, adapt_count, timeout, private)	\
+# define lll_timedlock_elision(futex, adapt_count, timeout, private)	\
   __lll_timedlock_elision(&(futex), &(adapt_count), timeout, private)
 
 extern int __lll_lock_elision (int *futex, short *adapt_count, int private)
@@ -39,12 +56,12 @@ extern int __lll_unlock_elision(int *futex, int private)
 extern int __lll_trylock_elision(int *futex, short *adapt_count)
   attribute_hidden;
 
-#  define lll_lock_elision(futex, adapt_count, private) \
+# define lll_lock_elision(futex, adapt_count, private) \
   __lll_lock_elision (&(futex), &(adapt_count), private)
 #  define lll_unlock_elision(futex, adapt_count, private) \
   __lll_unlock_elision (&(futex), private)
 #  define lll_trylock_elision(futex, adapt_count) \
   __lll_trylock_elision(&(futex), &(adapt_count))
-# endif  /* ENABLE_LOCK_ELISION */
+#endif  /* ENABLE_LOCK_ELISION */
 
 #endif	/* lowlevellock.h */
-- 
2.5.5

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 2/2] S390: Test if lock is free before using atomic instruction in spin-lock.
  2016-11-23 15:09 [PATCH 1/2] S390: Optimize atomic macros Stefan Liebler
@ 2016-11-23 15:09 ` Stefan Liebler
  2016-11-24 13:04   ` Torvald Riegel
  2016-11-24 12:51 ` [PATCH 1/2] S390: Optimize atomic macros Torvald Riegel
  1 sibling, 1 reply; 15+ messages in thread
From: Stefan Liebler @ 2016-11-23 15:09 UTC (permalink / raw)
  To: libc-alpha; +Cc: Stefan Liebler

This patch changes the behaviour of pthread_spin_lock on s390:
Instead of spinning on the lock with compare-and-swap instruction,
the atomic_compare_and_exchange_bool_acq macro is used.
The s390 atomic_compare_and_exchange_bool_acq macro called with constant
zero as oldval first compares the lock with normal instructions.  If it is
free the compare-and-swap instruction is used to aquire the lock.  While
the lock is held by one cpu another cpu will not exclusively lock the
memory of the lock in a loop.  If the lock is unlocked by another cpu
it is observed by the normal instruction and the lock is acquired
with a compare-and-swap instruction.

ChangeLog:

	* sysdeps/s390/nptl/pthread_spin_lock.c:
	Use atomic_compare_and_exchange_bool_acq macro instead of
	inline assembly.
	* sysdeps/s390/nptl/pthread_spin_trylock.c: Likewise.
---
 sysdeps/s390/nptl/pthread_spin_lock.c    | 13 +++++++------
 sysdeps/s390/nptl/pthread_spin_trylock.c | 12 +++++-------
 2 files changed, 12 insertions(+), 13 deletions(-)

diff --git a/sysdeps/s390/nptl/pthread_spin_lock.c b/sysdeps/s390/nptl/pthread_spin_lock.c
index def6a24..9c06949 100644
--- a/sysdeps/s390/nptl/pthread_spin_lock.c
+++ b/sysdeps/s390/nptl/pthread_spin_lock.c
@@ -21,12 +21,13 @@
 int
 pthread_spin_lock (pthread_spinlock_t *lock)
 {
-  int oldval;
+  /* The s390 atomic_compare_and_exchange_bool_acq macro called with constant
+     zero as oldval first compares the lock with normal instructions.  If it is
+     free the compare-and-swap instruction is used to aquire the lock.  While
+     the lock is held by one cpu another cpu will not exclusively lock the
+     memory of the lock in a loop.  */
+  while (atomic_compare_and_exchange_bool_acq (lock, 1, 0))
+    ;
 
-  __asm__ __volatile__ ("0: lhi %0,0\n"
-			"   cs  %0,%2,%1\n"
-			"   jl  0b"
-			: "=&d" (oldval), "=Q" (*lock)
-			: "d" (1), "m" (*lock) : "cc" );
   return 0;
 }
diff --git a/sysdeps/s390/nptl/pthread_spin_trylock.c b/sysdeps/s390/nptl/pthread_spin_trylock.c
index 4c00e08..d9e88ef 100644
--- a/sysdeps/s390/nptl/pthread_spin_trylock.c
+++ b/sysdeps/s390/nptl/pthread_spin_trylock.c
@@ -22,11 +22,9 @@
 int
 pthread_spin_trylock (pthread_spinlock_t *lock)
 {
-  int old;
-
-  __asm__ __volatile__ ("cs %0,%3,%1"
-			: "=d" (old), "=Q" (*lock)
-			: "0" (0), "d" (1), "m" (*lock) : "cc" );
-
-  return old != 0 ? EBUSY : 0;
+  /* The s390 atomic_compare_and_exchange_bool_acq macro called with constant
+     zero as oldval first compares the lock with normal instructions.  If it is
+     free the compare-and-swap instruction is used to aquire the lock.  */
+  return __glibc_unlikely (atomic_compare_and_exchange_bool_acq (lock, 1, 0))
+    ? EBUSY : 0;
 }
-- 
2.5.5

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2] S390: Optimize atomic macros.
  2016-11-23 15:09 [PATCH 1/2] S390: Optimize atomic macros Stefan Liebler
  2016-11-23 15:09 ` [PATCH 2/2] S390: Test if lock is free before using atomic instruction in spin-lock Stefan Liebler
@ 2016-11-24 12:51 ` Torvald Riegel
  2016-12-09 15:17   ` Stefan Liebler
  1 sibling, 1 reply; 15+ messages in thread
From: Torvald Riegel @ 2016-11-24 12:51 UTC (permalink / raw)
  To: Stefan Liebler; +Cc: libc-alpha

On Wed, 2016-11-23 at 16:09 +0100, Stefan Liebler wrote:
> The atomic_compare_and_exchange_val_acq macro is now implemented with
> gcc __sync_val_compare_and_swap instead of an inline assembly with
> compare-and-swap instruction.

Can you use the new GCC atomic builtins, or are they still insufficient
on s390?

Related to that, can you define USE_ATOMIC_COMPILER_BUILTINS to 1?
Generally, this is where we want to get to, so that we have to maintain
as few atomic operations as possible.

Also note that we'll be phasing out the old-style glibc atomics
eventually and use the new C11-like atomics.

> The memory is compared against expected OLDVAL before using compare-and-swap
> instruction in case of OLDVAL is constant at compile time.  This is used in
> various locking code.  If the lock is already aquired, another cpu has not to
> exclusively lock the memory.  If OLDVAL is not constant the compare-and-swap
> instruction is used directly as the usages of this macro usually load the
> current value in front of this macro.

Generally, I think the compiler should do these optimizations (under the
assumption that we drive these using builtin_expect and the like).  If
GCC not doing these (with new _atomic* builtins), is there a patch about
that or at least a BZ?  I'd be okay with accepting an intermediate
solution that optimizes atomics in glibc, but I'd want to make this
clear in code comments and reference the GCC BZ that makes the
improvement request; this ensures that we can drop the glibc hack once
GCC does what you want it to do.

For this particular case, I'm hoping we can just fix this in the
compiler.

Regarding your patch, the compiler barrier is in the wrong position for
an acquire MO operation; for the new C11-like atomics, it would be
unnecessary because we only guarantee relaxed MO in the failure case.

> The same applies to atomic_compare_and_exchange_bool_acq which wasn't
> defined before.  Now it is implemented with gcc __sync_bool_compare_and_swap.
> If the macro is used as condition in an if/while expression, the condition
> code is used to e.g. jump directly to another code sequence.  Before this
> change, the old value returned by compare-and-swap instruction was compared
> with given OLDVAL to determine if e.g. a jump is needed.
> 
> The atomic_exchange_acq macro is now using the load-and-and instruction for a
> constant zero value instead of a compare-and-swap loop.  This instruction is
> available on a z196 zarch and higher cpus.  This is e.g. used in unlocking code.

See above.  This should be fixed in the compiler.

> The newly defined atomic_exchange_and_add macro is implemented with gcc
> builtin __sync_fetch_and_add which uses load-and-add instruction on z196 zarch
> and higher cpus instead of a loop with compare-and-swap instruction.
> The same applies to atomic_or_val, atomic_and_val, ... macros, which use
> the appropiate z196 instruction.

Can you just use the new _atomic builtins?  The compiler should just
give use _atomic* builtins that are optimized, and we should use them.

With that in place, we could just implement the old-style glibc atomic
operations based on the _atomic* builtins, and have much less code to
maintain.

> The macros lll_trylock, lll_cond_trylock are extended by an __glibc_unlikely
> hint. With the hint gcc on s390 emits code in e.g. pthread_mutex_trylock
> which does not use jumps in case the lock is free.  Without the hint it had
> to jump if the lock was free.

I think it's debatable whether trylock should expect the lock to be
free.  OTOH, if it's not free, I guess that we should be in the slow
path in most use cases anyway.  Either way, I think it's a choice we
should make generically for all architectures, and I want to keep
arch-specific code to a minimum.  Thus, please split this out and
propose it for the generic lowlevellock.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] S390: Test if lock is free before using atomic instruction in spin-lock.
  2016-11-23 15:09 ` [PATCH 2/2] S390: Test if lock is free before using atomic instruction in spin-lock Stefan Liebler
@ 2016-11-24 13:04   ` Torvald Riegel
  2016-12-09 15:17     ` Stefan Liebler
  0 siblings, 1 reply; 15+ messages in thread
From: Torvald Riegel @ 2016-11-24 13:04 UTC (permalink / raw)
  To: Stefan Liebler; +Cc: libc-alpha

On Wed, 2016-11-23 at 16:09 +0100, Stefan Liebler wrote:
> This patch changes the behaviour of pthread_spin_lock on s390:
> Instead of spinning on the lock with compare-and-swap instruction,
> the atomic_compare_and_exchange_bool_acq macro is used.
> The s390 atomic_compare_and_exchange_bool_acq macro called with constant
> zero as oldval first compares the lock with normal instructions.  If it is
> free the compare-and-swap instruction is used to aquire the lock.  While
> the lock is held by one cpu another cpu will not exclusively lock the
> memory of the lock in a loop.  If the lock is unlocked by another cpu
> it is observed by the normal instruction and the lock is acquired
> with a compare-and-swap instruction.

I don't want to have more arch-specific code than we really need.
Please compare what you have against the generic spinlock code.  If you
find the generic spinlock code lacking, then please propose a change for
it in a way that does not make things worse for other architectures.

If there are arch-specific properties that significantly affect the
approach the generic spinlock takes (eg, assumptions about CAS vs
atomic-exchange runtime overheads), then please split them out.

In the long term, this will also benefit s390.  For example, the
spinlock code you have has no backoff, and introduces spinning with
loads in a pretty ugly way through the hack you have added to the CAS
(first loading the lock's value and comparing it manually if the
supplied argument is a constant).
While the generic spinlock code we have is also not very great,
improving it is what will make life easier for the whole glibc project.
Also, if someone else improves the generic spinlock code, s390 would
miss out on this if you have added your custom spinlock code.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2] S390: Optimize atomic macros.
  2016-11-24 12:51 ` [PATCH 1/2] S390: Optimize atomic macros Torvald Riegel
@ 2016-12-09 15:17   ` Stefan Liebler
  2016-12-16 17:08     ` Stefan Liebler
  0 siblings, 1 reply; 15+ messages in thread
From: Stefan Liebler @ 2016-12-09 15:17 UTC (permalink / raw)
  To: libc-alpha

On 11/24/2016 01:51 PM, Torvald Riegel wrote:
> On Wed, 2016-11-23 at 16:09 +0100, Stefan Liebler wrote:
>> The atomic_compare_and_exchange_val_acq macro is now implemented with
>> gcc __sync_val_compare_and_swap instead of an inline assembly with
>> compare-and-swap instruction.
>
> Can you use the new GCC atomic builtins, or are they still insufficient
> on s390?
>
> Related to that, can you define USE_ATOMIC_COMPILER_BUILTINS to 1?
> Generally, this is where we want to get to, so that we have to maintain
> as few atomic operations as possible.
>
> Also note that we'll be phasing out the old-style glibc atomics
> eventually and use the new C11-like atomics.
I've activated it and tried an example like this (also compare it with 
pthread_once):
int
foo (int *lock)
{
   int val, newval;
   val = atomic_load_acquire (lock);
   do
     {
       newval = val | 123;
     }
   while (!atomic_compare_exchange_weak_acquire (lock, &val, newval));

   return 0;
}

The assembly produced by GCC 5.4/6.2 contains a new stack-frame, the old 
value is stored and loaded to/from the stack-frame before the 
cs-instruction.
After the cs, the old value is stored again and reloaded if a further 
round of cs is needed.
The condition-code of cs is extracted into a register and then compared 
to determine if a further round of cs is needed.

The cs-instruction itself updates the old-value-register and a 
conditional jump can be used for a further round.

An upstream gcc produces better code, but it's also not perfect!
I've already informed Andreas Krebbel.

Thus I won't set USE_ATOMIC_COMPILER_BUILTINS to 1 now.
And if gcc is fixed, I will activate it only for this version and newer.

>
>> The memory is compared against expected OLDVAL before using compare-and-swap
>> instruction in case of OLDVAL is constant at compile time.  This is used in
>> various locking code.  If the lock is already aquired, another cpu has not to
>> exclusively lock the memory.  If OLDVAL is not constant the compare-and-swap
>> instruction is used directly as the usages of this macro usually load the
>> current value in front of this macro.
>
> Generally, I think the compiler should do these optimizations (under the
> assumption that we drive these using builtin_expect and the like).  If
> GCC not doing these (with new _atomic* builtins), is there a patch about
> that or at least a BZ?  I'd be okay with accepting an intermediate
> solution that optimizes atomics in glibc, but I'd want to make this
> clear in code comments and reference the GCC BZ that makes the
> improvement request; this ensures that we can drop the glibc hack once
> GCC does what you want it to do.
>
No. This comparision will not be included in the gcc builtins as they 
are intended to only be compare-and-swap instruction.

> For this particular case, I'm hoping we can just fix this in the
> compiler.
>
> Regarding your patch, the compiler barrier is in the wrong position for
> an acquire MO operation; for the new C11-like atomics, it would be
> unnecessary because we only guarantee relaxed MO in the failure case.
>
If *mem is equal to constant oldval, the sync builtin is considered as 
full barrier after loading *mem. If *mem is not equal to constant 
oldval, you are right there is no barrier after the load and the 
compiler could hoist memory accesses before the load of mem!?
Thus I can add the compiler barrier in the else path like below.
+#define atomic_compare_and_exchange_val_acq(mem, newval, oldval)\
+  ({ __asm__ __volatile__ ("" ::: "memory");			\
+    __typeof (*(mem)) __atg1_ret;				\
+    if (!__builtin_constant_p (oldval)				\
+	|| __builtin_expect ((__atg1_ret = *(mem))		\
+			     == (oldval), 1))			\
+      __atg1_ret = __sync_val_compare_and_swap ((mem), (oldval),\
+						(newval));	\
+    else							\
+      __asm__ __volatile__ ("" ::: "memory");			\
+    __atg1_ret; })
The intention of the full barrier before the load and the if-statement 
is to "force" the load and compare directly before cs instruction.

>> The same applies to atomic_compare_and_exchange_bool_acq which wasn't
>> defined before.  Now it is implemented with gcc __sync_bool_compare_and_swap.
>> If the macro is used as condition in an if/while expression, the condition
>> code is used to e.g. jump directly to another code sequence.  Before this
>> change, the old value returned by compare-and-swap instruction was compared
>> with given OLDVAL to determine if e.g. a jump is needed.
I will add the barrier here in the same way as mention above.
>>
>> The atomic_exchange_acq macro is now using the load-and-and instruction for a
>> constant zero value instead of a compare-and-swap loop.  This instruction is
>> available on a z196 zarch and higher cpus.  This is e.g. used in unlocking code.
>
> See above.  This should be fixed in the compiler.
There is no sync-builtin which can be used for exchanging the value as 
done in atomic_exchange_acq. Thus I have to do it in this glibc-macro.
But you are right, the compiler can fix this for c11 builtin 
__atomic_exchange_n.
But even if it is fixed in a new version, builds with older compilers 
won't use the load-and-and instruction.
In case of e.g. lll_unlock, an additional register loaded with the 
old-value and an additional jump is needed.

>
>> The newly defined atomic_exchange_and_add macro is implemented with gcc
>> builtin __sync_fetch_and_add which uses load-and-add instruction on z196 zarch
>> and higher cpus instead of a loop with compare-and-swap instruction.
>> The same applies to atomic_or_val, atomic_and_val, ... macros, which use
>> the appropiate z196 instruction.
>
> Can you just use the new _atomic builtins?  The compiler should just
> give use _atomic* builtins that are optimized, and we should use them.
>
I've checked, if __atomic_fetch_or|add use lao|laa instructions.
And yes they use it, too.
As there are issues with the __atomic* builtins as mentioned above,
the usage of the __sync_fetch_* is currently needed here.
I don't intend to use a mix of __sync and __atomic builtins.

> With that in place, we could just implement the old-style glibc atomic
> operations based on the _atomic* builtins, and have much less code to
> maintain.
>
Yes, you are right, this is the correct way in future.
I saw at least aarch64 is doing that this way.

>> The macros lll_trylock, lll_cond_trylock are extended by an __glibc_unlikely
>> hint. With the hint gcc on s390 emits code in e.g. pthread_mutex_trylock
>> which does not use jumps in case the lock is free.  Without the hint it had
>> to jump if the lock was free.
>
> I think it's debatable whether trylock should expect the lock to be
> free.  OTOH, if it's not free, I guess that we should be in the slow
> path in most use cases anyway.  Either way, I think it's a choice we
> should make generically for all architectures, and I want to keep
> arch-specific code to a minimum.  Thus, please split this out and
> propose it for the generic lowlevellock.
>
Okay. I will send a separate patch to propose this change for common-code.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] S390: Test if lock is free before using atomic instruction in spin-lock.
  2016-11-24 13:04   ` Torvald Riegel
@ 2016-12-09 15:17     ` Stefan Liebler
  2016-12-16 17:08       ` Stefan Liebler
  2016-12-16 20:12       ` Florian Weimer
  0 siblings, 2 replies; 15+ messages in thread
From: Stefan Liebler @ 2016-12-09 15:17 UTC (permalink / raw)
  To: libc-alpha

On 11/24/2016 02:04 PM, Torvald Riegel wrote:
> On Wed, 2016-11-23 at 16:09 +0100, Stefan Liebler wrote:
>> This patch changes the behaviour of pthread_spin_lock on s390:
>> Instead of spinning on the lock with compare-and-swap instruction,
>> the atomic_compare_and_exchange_bool_acq macro is used.
>> The s390 atomic_compare_and_exchange_bool_acq macro called with constant
>> zero as oldval first compares the lock with normal instructions.  If it is
>> free the compare-and-swap instruction is used to aquire the lock.  While
>> the lock is held by one cpu another cpu will not exclusively lock the
>> memory of the lock in a loop.  If the lock is unlocked by another cpu
>> it is observed by the normal instruction and the lock is acquired
>> with a compare-and-swap instruction.
>
> I don't want to have more arch-specific code than we really need.
> Please compare what you have against the generic spinlock code.  If you
> find the generic spinlock code lacking, then please propose a change for
> it in a way that does not make things worse for other architectures.
>
> If there are arch-specific properties that significantly affect the
> approach the generic spinlock takes (eg, assumptions about CAS vs
> atomic-exchange runtime overheads), then please split them out.
>
> In the long term, this will also benefit s390.  For example, the
> spinlock code you have has no backoff, and introduces spinning with
> loads in a pretty ugly way through the hack you have added to the CAS
> (first loading the lock's value and comparing it manually if the
> supplied argument is a constant).
> While the generic spinlock code we have is also not very great,
> improving it is what will make life easier for the whole glibc project.
> Also, if someone else improves the generic spinlock code, s390 would
> miss out on this if you have added your custom spinlock code.
>

I had a look into the assembler output of generic spinlock code, e.g:
int
pthread_spin_trylock (pthread_spinlock_t *lock)
{
   return atomic_exchange_acq (lock, 1) ? EBUSY : 0;
}

On s390x assembler output, a new stack-frame is generated, the lock 
value is loaded, stored to stack, loaded from stack and then passed to 
cs-instruction.
After cs-instruction, the "old-value" is stored to stack, loaded from 
stack and then compared to zero.

I also had a look into the aarch64 pthread_spin_trylock.os compiled with 
build-many-script and a gcc 6.2.
aarch64 is using the __atomic-builtins for atomic_exchange_acq, 
atomic_compare_and_exchange_val_acq, ... .
The atomic_exchange_acq results in the exclusive load/store. Then the 
old-value is also stored to stack and is immediately loaded back in 
order to compare it against zero.

The type pthread_spinlock_t is a volatile int!
If lock is casted to (int *) before passing it to the atomic macros,
the assembler output looks okay.

If I change the current generic spinlock code, do I have to rewrite it 
to the c11-like macros?

I've tested the following function in advance:
int
foo (pthread_spinlock_t *lock)
{
   return atomic_load_acquire (lock);
}

With the __atomic_load_n version of atomic_load_acquire, the assembler 
output contains a load which is returned as expected.

The non-c11-macro results in the following:
    0:   58 10 20 00             l       %r1,0(%r2)
    4:   b3 c1 00 0f             ldgr    %f0,%r15
    8:   e3 f0 ff 58 ff 71       lay     %r15,-168(%r15)
    e:   50 10 f0 a4             st      %r1,164(%r15)
   12:   58 10 f0 a4             l       %r1,164(%r15)
   16:   50 10 f0 a0             st      %r1,160(%r15)
   1a:   58 20 f0 a0             l       %r2,160(%r15)
   1e:   b3 cd 00 f0             lgdr    %r15,%f0
   22:   b9 14 00 22             lgfr    %r2,%r2
   26:   07 fe                   br      %r14
The extra stores/loads to/from stack result from the "__typeof (*(mem)) 
__atg101_val" usages in atomic_load_* macros in conjunction with volatile.

As this behaviour is not s390 specific, I've grep'ed to see which arches 
use the generic spin-lock code and the c11-like-atomic-macros:
sysdeps/hppa/nptl/pthread_spin_lock.c:18:#define 
SPIN_LOCK_READS_BETWEEN_CMPXCHG 1000
sysdeps/unix/sysv/linux/hppa/atomic-machine.h:40:#define 
USE_ATOMIC_COMPILER_BUILTINS 0

sysdeps/microblaze/nptl/pthread_spin_lock.c:19:#define 
SPIN_LOCK_READS_BETWEEN_CMPXCHG 1000
sysdeps/microblaze/atomic-machine.h:39:#define 
USE_ATOMIC_COMPILER_BUILTINS 0

sysdeps/aarch64/nptl/pthread_spin_lock.c:19:#define 
SPIN_LOCK_READS_BETWEEN_CMPXCHG 1000
sysdeps/aarch64/atomic-machine.h:40:#define USE_ATOMIC_COMPILER_BUILTINS 1

sysdeps/mips/nptl/pthread_spin_lock.c:18:#define 
SPIN_LOCK_READS_BETWEEN_CMPXCHG 1000
sysdeps/mips/atomic-machine.h:95:#define USE_ATOMIC_COMPILER_BUILTINS 1
sysdeps/mips/atomic-machine.h:216:#define USE_ATOMIC_COMPILER_BUILTINS 0

sysdeps/nios2/nptl/pthread_spin_lock.c:19:#define 
SPIN_LOCK_READS_BETWEEN_CMPXCHG 1000
sysdeps/unix/sysv/linux/nios2/atomic-machine.h:35:#define 
USE_ATOMIC_COMPILER_BUILTINS 0

sysdeps/arm/nptl/pthread_spin_lock.c:18:#define 
SPIN_LOCK_READS_BETWEEN_CMPXCHG 1000
sysdeps/arm/atomic-machine.h:37:#define USE_ATOMIC_COMPILER_BUILTINS 0

sysdeps/m68k/nptl/pthread_spin_lock.c:19:#define 
SPIN_LOCK_READS_BETWEEN_CMPXCHG 1000
sysdeps/m68k/m680x0/m68020/atomic-machine.h:48:#define 
USE_ATOMIC_COMPILER_BUILTINS 0
sysdeps/m68k/coldfire/atomic-machine.h:54:#define 
USE_ATOMIC_COMPILER_BUILTINS 0
sysdeps/unix/sysv/linux/m68k/coldfire/atomic-machine.h:40:#define 
USE_ATOMIC_COMPILER_BUILTINS 0

What is your suggestion, how to proceed with the volatile int type in 
conjunction with the atomic-macros?

Bye Stefan

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2] S390: Optimize atomic macros.
  2016-12-09 15:17   ` Stefan Liebler
@ 2016-12-16 17:08     ` Stefan Liebler
  2016-12-16 19:15       ` Torvald Riegel
  2016-12-16 19:18       ` Torvald Riegel
  0 siblings, 2 replies; 15+ messages in thread
From: Stefan Liebler @ 2016-12-16 17:08 UTC (permalink / raw)
  To: libc-alpha; +Cc: Torvald Riegel

[-- Attachment #1: Type: text/plain, Size: 7892 bytes --]

On 12/09/2016 04:16 PM, Stefan Liebler wrote:
> On 11/24/2016 01:51 PM, Torvald Riegel wrote:
>> On Wed, 2016-11-23 at 16:09 +0100, Stefan Liebler wrote:
>>> The atomic_compare_and_exchange_val_acq macro is now implemented with
>>> gcc __sync_val_compare_and_swap instead of an inline assembly with
>>> compare-and-swap instruction.
>>
>> Can you use the new GCC atomic builtins, or are they still insufficient
>> on s390?
>>
>> Related to that, can you define USE_ATOMIC_COMPILER_BUILTINS to 1?
>> Generally, this is where we want to get to, so that we have to maintain
>> as few atomic operations as possible.
>>
>> Also note that we'll be phasing out the old-style glibc atomics
>> eventually and use the new C11-like atomics.
> I've activated it and tried an example like this (also compare it with
> pthread_once):
> int
> foo (int *lock)
> {
>   int val, newval;
>   val = atomic_load_acquire (lock);
>   do
>     {
>       newval = val | 123;
>     }
>   while (!atomic_compare_exchange_weak_acquire (lock, &val, newval));
>
>   return 0;
> }
>
> The assembly produced by GCC 5.4/6.2 contains a new stack-frame, the old
> value is stored and loaded to/from the stack-frame before the
> cs-instruction.
> After the cs, the old value is stored again and reloaded if a further
> round of cs is needed.
> The condition-code of cs is extracted into a register and then compared
> to determine if a further round of cs is needed.
>
> The cs-instruction itself updates the old-value-register and a
> conditional jump can be used for a further round.
>
> An upstream gcc produces better code, but it's also not perfect!
> I've already informed Andreas Krebbel.
>
> Thus I won't set USE_ATOMIC_COMPILER_BUILTINS to 1 now.
> And if gcc is fixed, I will activate it only for this version and newer.
>
>>
>>> The memory is compared against expected OLDVAL before using
>>> compare-and-swap
>>> instruction in case of OLDVAL is constant at compile time.  This is
>>> used in
>>> various locking code.  If the lock is already aquired, another cpu
>>> has not to
>>> exclusively lock the memory.  If OLDVAL is not constant the
>>> compare-and-swap
>>> instruction is used directly as the usages of this macro usually load
>>> the
>>> current value in front of this macro.
>>
>> Generally, I think the compiler should do these optimizations (under the
>> assumption that we drive these using builtin_expect and the like).  If
>> GCC not doing these (with new _atomic* builtins), is there a patch about
>> that or at least a BZ?  I'd be okay with accepting an intermediate
>> solution that optimizes atomics in glibc, but I'd want to make this
>> clear in code comments and reference the GCC BZ that makes the
>> improvement request; this ensures that we can drop the glibc hack once
>> GCC does what you want it to do.
>>
> No. This comparision will not be included in the gcc builtins as they
> are intended to only be compare-and-swap instruction.
>
>> For this particular case, I'm hoping we can just fix this in the
>> compiler.
>>
>> Regarding your patch, the compiler barrier is in the wrong position for
>> an acquire MO operation; for the new C11-like atomics, it would be
>> unnecessary because we only guarantee relaxed MO in the failure case.
>>
> If *mem is equal to constant oldval, the sync builtin is considered as
> full barrier after loading *mem. If *mem is not equal to constant
> oldval, you are right there is no barrier after the load and the
> compiler could hoist memory accesses before the load of mem!?
> Thus I can add the compiler barrier in the else path like below.
> +#define atomic_compare_and_exchange_val_acq(mem, newval, oldval)\
> +  ({ __asm__ __volatile__ ("" ::: "memory");            \
> +    __typeof (*(mem)) __atg1_ret;                \
> +    if (!__builtin_constant_p (oldval)                \
> +    || __builtin_expect ((__atg1_ret = *(mem))        \
> +                 == (oldval), 1))            \
> +      __atg1_ret = __sync_val_compare_and_swap ((mem), (oldval),\
> +                        (newval));    \
> +    else                            \
> +      __asm__ __volatile__ ("" ::: "memory");            \
> +    __atg1_ret; })
> The intention of the full barrier before the load and the if-statement
> is to "force" the load and compare directly before cs instruction.
>
>>> The same applies to atomic_compare_and_exchange_bool_acq which wasn't
>>> defined before.  Now it is implemented with gcc
>>> __sync_bool_compare_and_swap.
>>> If the macro is used as condition in an if/while expression, the
>>> condition
>>> code is used to e.g. jump directly to another code sequence.  Before
>>> this
>>> change, the old value returned by compare-and-swap instruction was
>>> compared
>>> with given OLDVAL to determine if e.g. a jump is needed.
> I will add the barrier here in the same way as mention above.
>>>
>>> The atomic_exchange_acq macro is now using the load-and-and
>>> instruction for a
>>> constant zero value instead of a compare-and-swap loop.  This
>>> instruction is
>>> available on a z196 zarch and higher cpus.  This is e.g. used in
>>> unlocking code.
>>
>> See above.  This should be fixed in the compiler.
> There is no sync-builtin which can be used for exchanging the value as
> done in atomic_exchange_acq. Thus I have to do it in this glibc-macro.
> But you are right, the compiler can fix this for c11 builtin
> __atomic_exchange_n.
> But even if it is fixed in a new version, builds with older compilers
> won't use the load-and-and instruction.
> In case of e.g. lll_unlock, an additional register loaded with the
> old-value and an additional jump is needed.
>
>>
>>> The newly defined atomic_exchange_and_add macro is implemented with gcc
>>> builtin __sync_fetch_and_add which uses load-and-add instruction on
>>> z196 zarch
>>> and higher cpus instead of a loop with compare-and-swap instruction.
>>> The same applies to atomic_or_val, atomic_and_val, ... macros, which use
>>> the appropiate z196 instruction.
>>
>> Can you just use the new _atomic builtins?  The compiler should just
>> give use _atomic* builtins that are optimized, and we should use them.
>>
> I've checked, if __atomic_fetch_or|add use lao|laa instructions.
> And yes they use it, too.
> As there are issues with the __atomic* builtins as mentioned above,
> the usage of the __sync_fetch_* is currently needed here.
> I don't intend to use a mix of __sync and __atomic builtins.
>
>> With that in place, we could just implement the old-style glibc atomic
>> operations based on the _atomic* builtins, and have much less code to
>> maintain.
>>
> Yes, you are right, this is the correct way in future.
> I saw at least aarch64 is doing that this way.
>
>>> The macros lll_trylock, lll_cond_trylock are extended by an
>>> __glibc_unlikely
>>> hint. With the hint gcc on s390 emits code in e.g. pthread_mutex_trylock
>>> which does not use jumps in case the lock is free.  Without the hint
>>> it had
>>> to jump if the lock was free.
>>
>> I think it's debatable whether trylock should expect the lock to be
>> free.  OTOH, if it's not free, I guess that we should be in the slow
>> path in most use cases anyway.  Either way, I think it's a choice we
>> should make generically for all architectures, and I want to keep
>> arch-specific code to a minimum.  Thus, please split this out and
>> propose it for the generic lowlevellock.
>>
> Okay. I will send a separate patch to propose this change for common-code.
>


Here is an update of this patch. The compiler barrier in 
atomic_compare_and_exchange_val_acq and 
atomic_compare_and_exchange_bool_acq is now fixed.
The hints in macros lll_trylock/lll_cond_trylock are seperated and 
posted here:
"[PATCH] Add __glibc_unlikely hint in lll_trylock, lll_cond_trylock."
https://www.sourceware.org/ml/libc-alpha/2016-12/msg00451.html

The patch-file contains the updated ChangeLog.

Bye.
Stefan

[-- Attachment #2: optimize_s390_atomic_machine.patch --]
[-- Type: text/x-patch, Size: 11659 bytes --]

commit 587b15ad5de619a66d647553310c9da79da57a06
Author: Stefan Liebler <stli@linux.vnet.ibm.com>
Date:   Fri Dec 16 17:53:50 2016 +0100

    S390: Optimize atomic macros.
    
    The atomic_compare_and_exchange_val_acq macro is now implemented with
    gcc __sync_val_compare_and_swap instead of an inline assembly with
    compare-and-swap instruction.
    The memory is compared against expected OLDVAL before using compare-and-swap
    instruction in case of OLDVAL is constant at compile time.  This is used in
    various locking code.  If the lock is already aquired, another cpu has not to
    exclusively lock the memory.  If OLDVAL is not constant the compare-and-swap
    instruction is used directly as the usages of this macro usually load the
    current value in front of this macro.
    
    The same applies to atomic_compare_and_exchange_bool_acq which wasn't
    defined before.  Now it is implemented with gcc __sync_bool_compare_and_swap.
    If the macro is used as condition in an if/while expression, the condition
    code is used to e.g. jump directly to another code sequence.  Before this
    change, the old value returned by compare-and-swap instruction was compared
    with given OLDVAL to determine if e.g. a jump is needed.
    
    The atomic_exchange_acq macro is now using the load-and-and instruction for a
    constant zero value instead of a compare-and-swap loop.  This instruction is
    available on a z196 zarch and higher cpus.  This is e.g. used in unlocking code.
    
    The newly defined atomic_exchange_and_add macro is implemented with gcc
    builtin __sync_fetch_and_add which uses load-and-add instruction on z196 zarch
    and higher cpus instead of a loop with compare-and-swap instruction.
    The same applies to atomic_or_val, atomic_and_val, ... macros, which use
    the appropiate z196 instruction.
    
    ChangeLog:
    
    	* sysdeps/s390/atomic-machine.h
    	(__ATOMIC_MACROS_HAVE_Z196_ZARCH_INSN): New define.
    	(atomic_compare_and_exchange_val_acq):
    	Use __sync_val_compare_and_swap and first compare with non-atomic
    	instruction in case of OLDVAL is constant.
    	(atomic_compare_and_exchange_bool_acq): New define.
    	(atomic_exchange_acq): Use load-and-and instruction for constant
    	zero values, if available.
    	(atomic_exchange_and_add, catomic_exchange_and_add, atomic_or_val,
    	atomic_or, catomic_or, atomic_bit_test_set, atomic_and_val,
    	atomic_and, catomic_and): New define.

diff --git a/sysdeps/s390/atomic-machine.h b/sysdeps/s390/atomic-machine.h
index 4ba4107..e6956ff 100644
--- a/sysdeps/s390/atomic-machine.h
+++ b/sysdeps/s390/atomic-machine.h
@@ -45,76 +45,163 @@ typedef uintmax_t uatomic_max_t;
 
 #define USE_ATOMIC_COMPILER_BUILTINS 0
 
-
-#define __arch_compare_and_exchange_val_8_acq(mem, newval, oldval) \
-  (abort (), (__typeof (*mem)) 0)
-
-#define __arch_compare_and_exchange_val_16_acq(mem, newval, oldval) \
-  (abort (), (__typeof (*mem)) 0)
-
-#define __arch_compare_and_exchange_val_32_acq(mem, newval, oldval) \
-  ({ __typeof (mem) __archmem = (mem);					      \
-     __typeof (*mem) __archold = (oldval);				      \
-     __asm__ __volatile__ ("cs %0,%2,%1"				      \
-			   : "+d" (__archold), "=Q" (*__archmem)	      \
-			   : "d" (newval), "m" (*__archmem) : "cc", "memory" );	\
-     __archold; })
-
 #ifdef __s390x__
 # define __HAVE_64B_ATOMICS 1
-# define __arch_compare_and_exchange_val_64_acq(mem, newval, oldval) \
-  ({ __typeof (mem) __archmem = (mem);					      \
-     __typeof (*mem) __archold = (oldval);				      \
-     __asm__ __volatile__ ("csg %0,%2,%1"				      \
-			   : "+d" (__archold), "=Q" (*__archmem)	      \
-			   : "d" ((long) (newval)), "m" (*__archmem) : "cc", "memory" ); \
-     __archold; })
 #else
 # define __HAVE_64B_ATOMICS 0
-/* For 31 bit we do not really need 64-bit compare-and-exchange. We can
-   implement them by use of the csd instruction. The straightforward
-   implementation causes warnings so we skip the definition for now.  */
-# define __arch_compare_and_exchange_val_64_acq(mem, newval, oldval) \
-  (abort (), (__typeof (*mem)) 0)
 #endif
 
+#ifdef HAVE_S390_MIN_Z196_ZARCH_ASM_SUPPORT
+# define __ATOMIC_MACROS_HAVE_Z196_ZARCH_INSN 1
+#else
+# define __ATOMIC_MACROS_HAVE_Z196_ZARCH_INSN 0
+#endif
+
+/* Atomically store NEWVAL in *MEM if *MEM is equal to OLDVAL.
+   Return the old *MEM value.  */
+/* Compare *MEM against expected OLDVAL before using compare-and-swap
+   instruction in case of OLDVAL is constant.  This is used in various
+   locking code.  If the lock is already aquired another cpu has not to
+   exclusively lock the memory.  */
+#define atomic_compare_and_exchange_val_acq(mem, newval, oldval)	\
+  ({ __asm__ __volatile__ ("" ::: "memory");				\
+    __typeof (*(mem)) __atg1_ret;					\
+    if (!__builtin_constant_p (oldval)					\
+	|| __builtin_expect ((__atg1_ret = *(mem))			\
+			     == (oldval), 1))				\
+      __atg1_ret = __sync_val_compare_and_swap ((mem), (oldval),	\
+						(newval));		\
+    else								\
+      __asm__ __volatile__ ("" ::: "memory");				\
+    __atg1_ret; })
+
+/* Atomically store NEWVAL in *MEM if *MEM is equal to OLDVAL.
+   Return zero if *MEM was changed or non-zero if no exchange happened.  */
+/* Same as with atomic_compare_and_exchange_val_acq,  constant OLDVALs are
+   compared before using compare-and-swap instruction.  As this macro is
+   normally used in conjunction with if or while, gcc emits a conditional-
+   branch instruction to use the condition-code of compare-and-swap instruction
+   instead of comparing the old value.  */
+#define atomic_compare_and_exchange_bool_acq(mem, newval, oldval)	\
+  ({ __asm__ __volatile__ ("" ::: "memory");				\
+    int __atg3_ret = 1;							\
+    if (!__builtin_constant_p (oldval)					\
+	|| __builtin_expect (*(mem) == (oldval), 1))			\
+      __atg3_ret = !__sync_bool_compare_and_swap ((mem), (oldval),	\
+						  (newval));		\
+    else								\
+      __asm__ __volatile__ ("" ::: "memory");				\
+    __atg3_ret; })
+
 /* Store NEWVALUE in *MEM and return the old value.  */
 /* On s390, the atomic_exchange_acq is different from generic implementation,
    because the generic one does not use the condition-code of cs-instruction
-   to determine if looping is needed. Instead it saves the old-value and
-   compares it against old-value returned by cs-instruction.  */
+   to determine if looping is needed.  Instead it saves the old-value and
+   compares it against old-value returned by cs-instruction.
+   Setting a constant zero can be done with load-and-and instruction which
+   is available on a z196 zarch and higher cpus.  This is used in unlocking
+   code.  */
 #ifdef __s390x__
 # define atomic_exchange_acq(mem, newvalue)				\
-  ({ __typeof (mem) __atg5_memp = (mem);				\
-    __typeof (*(mem)) __atg5_oldval = *__atg5_memp;			\
-    __typeof (*(mem)) __atg5_value = (newvalue);			\
-    if (sizeof (*mem) == 4)						\
-      __asm__ __volatile__ ("0: cs %0,%2,%1\n"				\
-			    "   jl 0b"					\
-			    : "+d" (__atg5_oldval), "=Q" (*__atg5_memp)	\
-			    : "d" (__atg5_value), "m" (*__atg5_memp)	\
-			    : "cc", "memory" );				\
-     else if (sizeof (*mem) == 8)					\
-       __asm__ __volatile__ ("0: csg %0,%2,%1\n"			\
-			     "   jl 0b"					\
-			     : "+d" ( __atg5_oldval), "=Q" (*__atg5_memp) \
-			     : "d" ((long) __atg5_value), "m" (*__atg5_memp) \
-			     : "cc", "memory" );			\
-     else								\
-       abort ();							\
+  ({ __typeof (*(mem)) __atg5_oldval;					\
+    if (__ATOMIC_MACROS_HAVE_Z196_ZARCH_INSN != 0			\
+	&& __builtin_constant_p (newvalue) && (newvalue) == 0)		\
+      {									\
+	__atg5_oldval = __sync_fetch_and_and (mem, 0);			\
+      }									\
+    else								\
+      {									\
+	__typeof (mem) __atg5_memp = (mem);				\
+	__atg5_oldval = *__atg5_memp;					\
+	__typeof (*(mem)) __atg5_value = (newvalue);			\
+	if (sizeof (*(mem)) == 4)					\
+	  __asm__ __volatile__ ("0: cs %0,%2,%1\n"			\
+				"   jl 0b"				\
+				: "+d" (__atg5_oldval),			\
+				  "=Q" (*__atg5_memp)			\
+				: "d" (__atg5_value),			\
+				  "m" (*__atg5_memp)			\
+				: "cc", "memory" );			\
+	else if (sizeof (*(mem)) == 8)					\
+	  __asm__ __volatile__ ("0: csg %0,%2,%1\n"			\
+				"   jl 0b"				\
+				: "+d" ( __atg5_oldval),		\
+				  "=Q" (*__atg5_memp)			\
+				: "d" ((long) __atg5_value),		\
+				  "m" (*__atg5_memp)			\
+				: "cc", "memory" );			\
+	else								\
+	  abort ();							\
+      }									\
      __atg5_oldval; })
 #else
 # define atomic_exchange_acq(mem, newvalue)				\
-  ({ __typeof (mem) __atg5_memp = (mem);				\
-    __typeof (*(mem)) __atg5_oldval = *__atg5_memp;			\
-    __typeof (*(mem)) __atg5_value = (newvalue);			\
-    if (sizeof (*mem) == 4)						\
-      __asm__ __volatile__ ("0: cs %0,%2,%1\n"				\
-			    "   jl 0b"					\
-			    : "+d" (__atg5_oldval), "=Q" (*__atg5_memp)	\
-			    : "d" (__atg5_value), "m" (*__atg5_memp)	\
-			    : "cc", "memory" );				\
+  ({ __typeof (*(mem)) __atg5_oldval;					\
+    if (__ATOMIC_MACROS_HAVE_Z196_ZARCH_INSN != 0			\
+	&& __builtin_constant_p (newvalue) && (newvalue) == 0)		\
+      {									\
+	__atg5_oldval = __sync_fetch_and_and (mem, 0);			\
+      }									\
     else								\
-      abort ();								\
+      {									\
+	__typeof (mem) __atg5_memp = (mem);				\
+	__atg5_oldval = *__atg5_memp;					\
+	__typeof (*(mem)) __atg5_value = (newvalue);			\
+	if (sizeof (*(mem)) == 4)					\
+	  __asm__ __volatile__ ("0: cs %0,%2,%1\n"			\
+				"   jl 0b"				\
+				: "+d" (__atg5_oldval),			\
+				  "=Q" (*__atg5_memp)			\
+				: "d" (__atg5_value),			\
+				  "m" (*__atg5_memp)			\
+				: "cc", "memory" );			\
+	else								\
+	  abort ();							\
+      }									\
     __atg5_oldval; })
 #endif
+
+/* Add VALUE to *MEM and return the old value of *MEM.  */
+/* The gcc builtin uses load-and-add instruction on z196 zarch and higher cpus
+   instead of a loop with compare-and-swap instruction.  */
+#define atomic_exchange_and_add(mem, value)	\
+  __sync_fetch_and_add (mem, value)
+#define catomic_exchange_and_add(mem, value)	\
+  atomic_exchange_and_add (mem, value)
+
+/* Atomically *mem |= mask and return the old value of *mem.  */
+/* The gcc builtin uses load-and-or instruction on z196 zarch and higher cpus
+   instead of a loop with compare-and-swap instruction.  */
+#define atomic_or_val(mem, mask)		\
+  __sync_fetch_and_or (mem, mask)
+/* Atomically *mem |= mask.  */
+#define atomic_or(mem, mask)			\
+  do {						\
+    atomic_or_val (mem, mask);			\
+  } while (0)
+#define catomic_or(mem, mask)			\
+  atomic_or (mem, mask)
+
+/* Atomically *mem |= 1 << bit and return true if the bit was set in old value
+   of *mem.  */
+/* The load-and-or instruction is used on z196 zarch and higher cpus
+   instead of a loop with compare-and-swap instruction.  */
+#define atomic_bit_test_set(mem, bit)					\
+  ({ __typeof (*(mem)) __atg14_old;					\
+    __typeof (mem) __atg14_memp = (mem);				\
+    __typeof (*(mem)) __atg14_mask = ((__typeof (*(mem))) 1 << (bit));	\
+    __atg14_old = atomic_or_val (__atg14_memp, __atg14_mask);		\
+    __atg14_old & __atg14_mask; })
+
+/* Atomically *mem &= mask and return the old value of *mem.  */
+/* The gcc builtin uses load-and-and instruction on z196 zarch and higher cpus
+   instead of a loop with compare-and-swap instruction.  */
+#define atomic_and_val(mem, mask)		\
+  __sync_fetch_and_and (mem, mask)
+/* Atomically *mem &= mask.  */
+#define atomic_and(mem, mask)			\
+  do {						\
+    atomic_and_val (mem, mask);			\
+  } while (0)
+#define catomic_and(mem, mask)			\
+  atomic_and(mem, mask)

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] S390: Test if lock is free before using atomic instruction in spin-lock.
  2016-12-09 15:17     ` Stefan Liebler
@ 2016-12-16 17:08       ` Stefan Liebler
  2016-12-16 19:21         ` Torvald Riegel
  2016-12-16 20:12       ` Florian Weimer
  1 sibling, 1 reply; 15+ messages in thread
From: Stefan Liebler @ 2016-12-16 17:08 UTC (permalink / raw)
  To: libc-alpha; +Cc: Torvald Riegel

On 12/09/2016 04:16 PM, Stefan Liebler wrote:
> On 11/24/2016 02:04 PM, Torvald Riegel wrote:
>> On Wed, 2016-11-23 at 16:09 +0100, Stefan Liebler wrote:
>>> This patch changes the behaviour of pthread_spin_lock on s390:
>>> Instead of spinning on the lock with compare-and-swap instruction,
>>> the atomic_compare_and_exchange_bool_acq macro is used.
>>> The s390 atomic_compare_and_exchange_bool_acq macro called with constant
>>> zero as oldval first compares the lock with normal instructions.  If
>>> it is
>>> free the compare-and-swap instruction is used to aquire the lock.  While
>>> the lock is held by one cpu another cpu will not exclusively lock the
>>> memory of the lock in a loop.  If the lock is unlocked by another cpu
>>> it is observed by the normal instruction and the lock is acquired
>>> with a compare-and-swap instruction.
>>
>> I don't want to have more arch-specific code than we really need.
>> Please compare what you have against the generic spinlock code.  If you
>> find the generic spinlock code lacking, then please propose a change for
>> it in a way that does not make things worse for other architectures.
>>
>> If there are arch-specific properties that significantly affect the
>> approach the generic spinlock takes (eg, assumptions about CAS vs
>> atomic-exchange runtime overheads), then please split them out.
>>
>> In the long term, this will also benefit s390.  For example, the
>> spinlock code you have has no backoff, and introduces spinning with
>> loads in a pretty ugly way through the hack you have added to the CAS
>> (first loading the lock's value and comparing it manually if the
>> supplied argument is a constant).
>> While the generic spinlock code we have is also not very great,
>> improving it is what will make life easier for the whole glibc project.
>> Also, if someone else improves the generic spinlock code, s390 would
>> miss out on this if you have added your custom spinlock code.
>>
>
> I had a look into the assembler output of generic spinlock code, e.g:
> int
> pthread_spin_trylock (pthread_spinlock_t *lock)
> {
>   return atomic_exchange_acq (lock, 1) ? EBUSY : 0;
> }
>
> On s390x assembler output, a new stack-frame is generated, the lock
> value is loaded, stored to stack, loaded from stack and then passed to
> cs-instruction.
> After cs-instruction, the "old-value" is stored to stack, loaded from
> stack and then compared to zero.
>
> I also had a look into the aarch64 pthread_spin_trylock.os compiled with
> build-many-script and a gcc 6.2.
> aarch64 is using the __atomic-builtins for atomic_exchange_acq,
> atomic_compare_and_exchange_val_acq, ... .
> The atomic_exchange_acq results in the exclusive load/store. Then the
> old-value is also stored to stack and is immediately loaded back in
> order to compare it against zero.
>
> The type pthread_spinlock_t is a volatile int!
> If lock is casted to (int *) before passing it to the atomic macros,
> the assembler output looks okay.
>
> If I change the current generic spinlock code, do I have to rewrite it
> to the c11-like macros?
>
> I've tested the following function in advance:
> int
> foo (pthread_spinlock_t *lock)
> {
>   return atomic_load_acquire (lock);
> }
>
> With the __atomic_load_n version of atomic_load_acquire, the assembler
> output contains a load which is returned as expected.
>
> The non-c11-macro results in the following:
>    0:   58 10 20 00             l       %r1,0(%r2)
>    4:   b3 c1 00 0f             ldgr    %f0,%r15
>    8:   e3 f0 ff 58 ff 71       lay     %r15,-168(%r15)
>    e:   50 10 f0 a4             st      %r1,164(%r15)
>   12:   58 10 f0 a4             l       %r1,164(%r15)
>   16:   50 10 f0 a0             st      %r1,160(%r15)
>   1a:   58 20 f0 a0             l       %r2,160(%r15)
>   1e:   b3 cd 00 f0             lgdr    %r15,%f0
>   22:   b9 14 00 22             lgfr    %r2,%r2
>   26:   07 fe                   br      %r14
> The extra stores/loads to/from stack result from the "__typeof (*(mem))
> __atg101_val" usages in atomic_load_* macros in conjunction with volatile.
>
> As this behaviour is not s390 specific, I've grep'ed to see which arches
> use the generic spin-lock code and the c11-like-atomic-macros:
> sysdeps/hppa/nptl/pthread_spin_lock.c:18:#define
> SPIN_LOCK_READS_BETWEEN_CMPXCHG 1000
> sysdeps/unix/sysv/linux/hppa/atomic-machine.h:40:#define
> USE_ATOMIC_COMPILER_BUILTINS 0
>
> sysdeps/microblaze/nptl/pthread_spin_lock.c:19:#define
> SPIN_LOCK_READS_BETWEEN_CMPXCHG 1000
> sysdeps/microblaze/atomic-machine.h:39:#define
> USE_ATOMIC_COMPILER_BUILTINS 0
>
> sysdeps/aarch64/nptl/pthread_spin_lock.c:19:#define
> SPIN_LOCK_READS_BETWEEN_CMPXCHG 1000
> sysdeps/aarch64/atomic-machine.h:40:#define USE_ATOMIC_COMPILER_BUILTINS 1
>
> sysdeps/mips/nptl/pthread_spin_lock.c:18:#define
> SPIN_LOCK_READS_BETWEEN_CMPXCHG 1000
> sysdeps/mips/atomic-machine.h:95:#define USE_ATOMIC_COMPILER_BUILTINS 1
> sysdeps/mips/atomic-machine.h:216:#define USE_ATOMIC_COMPILER_BUILTINS 0
>
> sysdeps/nios2/nptl/pthread_spin_lock.c:19:#define
> SPIN_LOCK_READS_BETWEEN_CMPXCHG 1000
> sysdeps/unix/sysv/linux/nios2/atomic-machine.h:35:#define
> USE_ATOMIC_COMPILER_BUILTINS 0
>
> sysdeps/arm/nptl/pthread_spin_lock.c:18:#define
> SPIN_LOCK_READS_BETWEEN_CMPXCHG 1000
> sysdeps/arm/atomic-machine.h:37:#define USE_ATOMIC_COMPILER_BUILTINS 0
>
> sysdeps/m68k/nptl/pthread_spin_lock.c:19:#define
> SPIN_LOCK_READS_BETWEEN_CMPXCHG 1000
> sysdeps/m68k/m680x0/m68020/atomic-machine.h:48:#define
> USE_ATOMIC_COMPILER_BUILTINS 0
> sysdeps/m68k/coldfire/atomic-machine.h:54:#define
> USE_ATOMIC_COMPILER_BUILTINS 0
> sysdeps/unix/sysv/linux/m68k/coldfire/atomic-machine.h:40:#define
> USE_ATOMIC_COMPILER_BUILTINS 0
>
> What is your suggestion, how to proceed with the volatile int type in
> conjunction with the atomic-macros?
>
> Bye Stefan
>
This patch is not needed anymore as I've posted an adjusted generic 
spinlock code:
[PATCH 1/2] Optimize generic spinlock code and use C11 like atomic macros.
https://www.sourceware.org/ml/libc-alpha/2016-12/msg00617.html

[PATCH 2/2] S390: Use generic spinlock code.
https://www.sourceware.org/ml/libc-alpha/2016-12/msg00618.html

Please have a look.

Bye.
Stefan

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2] S390: Optimize atomic macros.
  2016-12-16 17:08     ` Stefan Liebler
@ 2016-12-16 19:15       ` Torvald Riegel
  2017-02-27 11:36         ` Stefan Liebler
  2016-12-16 19:18       ` Torvald Riegel
  1 sibling, 1 reply; 15+ messages in thread
From: Torvald Riegel @ 2016-12-16 19:15 UTC (permalink / raw)
  To: Stefan Liebler; +Cc: libc-alpha

On Fri, 2016-12-16 at 18:08 +0100, Stefan Liebler wrote:
> On 12/09/2016 04:16 PM, Stefan Liebler wrote:
> > On 11/24/2016 01:51 PM, Torvald Riegel wrote:
> >> On Wed, 2016-11-23 at 16:09 +0100, Stefan Liebler wrote:
> >>> The atomic_compare_and_exchange_val_acq macro is now implemented with
> >>> gcc __sync_val_compare_and_swap instead of an inline assembly with
> >>> compare-and-swap instruction.
> >>
> >> Can you use the new GCC atomic builtins, or are they still insufficient
> >> on s390?
> >>
> >> Related to that, can you define USE_ATOMIC_COMPILER_BUILTINS to 1?
> >> Generally, this is where we want to get to, so that we have to maintain
> >> as few atomic operations as possible.
> >>
> >> Also note that we'll be phasing out the old-style glibc atomics
> >> eventually and use the new C11-like atomics.
> > I've activated it and tried an example like this (also compare it with
> > pthread_once):
> > int
> > foo (int *lock)
> > {
> >   int val, newval;
> >   val = atomic_load_acquire (lock);

Note that this can be a relaxed MO load.

> >   do
> >     {
> >       newval = val | 123;
> >     }
> >   while (!atomic_compare_exchange_weak_acquire (lock, &val, newval));
> >
> >   return 0;
> > }
> >
> > The assembly produced by GCC 5.4/6.2 contains a new stack-frame, the old
> > value is stored and loaded to/from the stack-frame before the
> > cs-instruction.
> > After the cs, the old value is stored again and reloaded if a further
> > round of cs is needed.
> > The condition-code of cs is extracted into a register and then compared
> > to determine if a further round of cs is needed.
> >
> > The cs-instruction itself updates the old-value-register and a
> > conditional jump can be used for a further round.

That indeed sounds like the compiler misses quite a few optimizations.
Is this the same if you use a relaxed load before the loop?

> > An upstream gcc produces better code, but it's also not perfect!

How much closer to "perfect" is it?

> > I've already informed Andreas Krebbel.
> >
> > Thus I won't set USE_ATOMIC_COMPILER_BUILTINS to 1 now.
> > And if gcc is fixed, I will activate it only for this version and newer.

That is understandable.

I'd still prefer if you could include a summary of the information you
provided above as a comment in the code (or open up a bug with the
information, so that it doesn't get lost, or add something to the
Concurrency wiki page, or ...).

> >>
> >>> The memory is compared against expected OLDVAL before using
> >>> compare-and-swap
> >>> instruction in case of OLDVAL is constant at compile time.  This is
> >>> used in
> >>> various locking code.  If the lock is already aquired, another cpu
> >>> has not to
> >>> exclusively lock the memory.  If OLDVAL is not constant the
> >>> compare-and-swap
> >>> instruction is used directly as the usages of this macro usually load
> >>> the
> >>> current value in front of this macro.
> >>
> >> Generally, I think the compiler should do these optimizations (under the
> >> assumption that we drive these using builtin_expect and the like).  If
> >> GCC not doing these (with new _atomic* builtins), is there a patch about
> >> that or at least a BZ?  I'd be okay with accepting an intermediate
> >> solution that optimizes atomics in glibc, but I'd want to make this
> >> clear in code comments and reference the GCC BZ that makes the
> >> improvement request; this ensures that we can drop the glibc hack once
> >> GCC does what you want it to do.
> >>
> > No. This comparision will not be included in the gcc builtins as they
> > are intended to only be compare-and-swap instruction.

They are not just that, actually.  We expect the builtins to have the
effect of a CAS when executed, but under the hood that's more than just
issuing the instruction.  For example, we expect the compiler to pick
suitable registers and do something efficient with the flags.

I agree that your case could be considered borderline, because it's
"outside" of the CAS (as opposed to an LL/SC machine where the check
could be after the LL, or something like that).  However, you only do
the optimization for constants.

> >> For this particular case, I'm hoping we can just fix this in the
> >> compiler.
> >>
> >> Regarding your patch, the compiler barrier is in the wrong position for
> >> an acquire MO operation; for the new C11-like atomics, it would be
> >> unnecessary because we only guarantee relaxed MO in the failure case.
> >>
> > If *mem is equal to constant oldval, the sync builtin is considered as
> > full barrier after loading *mem. If *mem is not equal to constant
> > oldval, you are right there is no barrier after the load and the
> > compiler could hoist memory accesses before the load of mem!?

If the true branch of the conditional is taken, the compiler won't hoist
the load before the __sync builtin, because it needs the load to
determine whether the __sync should actually happen.

> > Thus I can add the compiler barrier in the else path like below.

Yes, and given that s390 is TSO-like (IIRC), this would be like an
acquire barrier.

However, the more I think about this, the more I'm worried that
transforming an idempotent CAS into just a load might mess with the
assurances we give regarding the C11 memory model.  At least in C11, the
memory model gives stronger guarantees for an idempotent CAS than for a
load, I believe; for example, that it potentially extends release
sequences, it has different effects on ordering, etc.

What you're proposing is effectively a different s390-atomics-to-C11
mapping.  We should be *very* sure that this is a valid mapping before
we make such a change.  I'm in no position to assess that, simply
because I don't know enough about the s390 model.  Are you confident
enough to make this assessment, and can you show why it's correct?

What you are proposing is a special case, I'd argue (eg, because of
expecting a constant).  I think it would be better if you propose
changes to what you think the uses of this optimization are because we
can much more easily argue about correctness of concrete uses of this.

> > +#define atomic_compare_and_exchange_val_acq(mem, newval, oldval)\
> > +  ({ __asm__ __volatile__ ("" ::: "memory");            \
> > +    __typeof (*(mem)) __atg1_ret;                \
> > +    if (!__builtin_constant_p (oldval)                \
> > +    || __builtin_expect ((__atg1_ret = *(mem))        \
> > +                 == (oldval), 1))            \
> > +      __atg1_ret = __sync_val_compare_and_swap ((mem), (oldval),\
> > +                        (newval));    \
> > +    else                            \
> > +      __asm__ __volatile__ ("" ::: "memory");            \
> > +    __atg1_ret; })
> > The intention of the full barrier before the load and the if-statement
> > is to "force" the load and compare directly before cs instruction.

But that wouldn't be required for correctness, or do you have something
additional in mind that you want to achieve?  Or is affecting compiler
optimizations the intent here?

> >>> The same applies to atomic_compare_and_exchange_bool_acq which wasn't
> >>> defined before.  Now it is implemented with gcc
> >>> __sync_bool_compare_and_swap.
> >>> If the macro is used as condition in an if/while expression, the
> >>> condition
> >>> code is used to e.g. jump directly to another code sequence.  Before
> >>> this
> >>> change, the old value returned by compare-and-swap instruction was
> >>> compared
> >>> with given OLDVAL to determine if e.g. a jump is needed.
> > I will add the barrier here in the same way as mention above.
> >>>
> >>> The atomic_exchange_acq macro is now using the load-and-and
> >>> instruction for a
> >>> constant zero value instead of a compare-and-swap loop.  This
> >>> instruction is
> >>> available on a z196 zarch and higher cpus.  This is e.g. used in
> >>> unlocking code.
> >>
> >> See above.  This should be fixed in the compiler.
> > There is no sync-builtin which can be used for exchanging the value as
> > done in atomic_exchange_acq. Thus I have to do it in this glibc-macro.
> > But you are right, the compiler can fix this for c11 builtin
> > __atomic_exchange_n.
> > But even if it is fixed in a new version, builds with older compilers
> > won't use the load-and-and instruction.

First, can we just use the __atomic builtins instead of __sync?  For the
minimum compiler version current glibc supports, what is the code
generated for __atomic_fetch_add?  IIRC, the minimum GCC version we
support already supports the __atomic builtins.  I'd be surprised if
__sync is optimized better than __atomic.

I'll also say again that especially the exchange to 0 should be a
compiler optimization.  If there is truly a reason to apply the stop-gap
measure that you propose, we must have a GCC bug report for this so we
can track when this happens, and the number of this GCC bug must go into
a comment you add to the code.

> > In case of e.g. lll_unlock, an additional register loaded with the
> > old-value and an additional jump is needed.

That is a comparison against the CAS version?

> >>
> >>> The newly defined atomic_exchange_and_add macro is implemented with gcc
> >>> builtin __sync_fetch_and_add which uses load-and-add instruction on
> >>> z196 zarch
> >>> and higher cpus instead of a loop with compare-and-swap instruction.
> >>> The same applies to atomic_or_val, atomic_and_val, ... macros, which use
> >>> the appropiate z196 instruction.
> >>
> >> Can you just use the new _atomic builtins?  The compiler should just
> >> give use _atomic* builtins that are optimized, and we should use them.
> >>
> > I've checked, if __atomic_fetch_or|add use lao|laa instructions.
> > And yes they use it, too.
> > As there are issues with the __atomic* builtins as mentioned above,
> > the usage of the __sync_fetch_* is currently needed here.
> > I don't intend to use a mix of __sync and __atomic builtins.

Why?  __atomic is much better specified.  __sync is like a seq_cst
__atomic, essentially.

> >> With that in place, we could just implement the old-style glibc atomic
> >> operations based on the _atomic* builtins, and have much less code to
> >> maintain.
> >>
> > Yes, you are right, this is the correct way in future.
> > I saw at least aarch64 is doing that this way.
> >
> >>> The macros lll_trylock, lll_cond_trylock are extended by an
> >>> __glibc_unlikely
> >>> hint. With the hint gcc on s390 emits code in e.g. pthread_mutex_trylock
> >>> which does not use jumps in case the lock is free.  Without the hint
> >>> it had
> >>> to jump if the lock was free.
> >>
> >> I think it's debatable whether trylock should expect the lock to be
> >> free.  OTOH, if it's not free, I guess that we should be in the slow
> >> path in most use cases anyway.  Either way, I think it's a choice we
> >> should make generically for all architectures, and I want to keep
> >> arch-specific code to a minimum.  Thus, please split this out and
> >> propose it for the generic lowlevellock.
> >>
> > Okay. I will send a separate patch to propose this change for common-code.
> >
> 
> 
> Here is an update of this patch. The compiler barrier in 
> atomic_compare_and_exchange_val_acq and 
> atomic_compare_and_exchange_bool_acq is now fixed.
> The hints in macros lll_trylock/lll_cond_trylock are seperated and 
> posted here:
> "[PATCH] Add __glibc_unlikely hint in lll_trylock, lll_cond_trylock."
> https://www.sourceware.org/ml/libc-alpha/2016-12/msg00451.html
> 
> The patch-file contains the updated ChangeLog.

Thanks.  See my comments above.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2] S390: Optimize atomic macros.
  2016-12-16 17:08     ` Stefan Liebler
  2016-12-16 19:15       ` Torvald Riegel
@ 2016-12-16 19:18       ` Torvald Riegel
  1 sibling, 0 replies; 15+ messages in thread
From: Torvald Riegel @ 2016-12-16 19:18 UTC (permalink / raw)
  To: Stefan Liebler; +Cc: libc-alpha

On Fri, 2016-12-16 at 18:08 +0100, Stefan Liebler wrote:
> On 12/09/2016 04:16 PM, Stefan Liebler wrote:
> > On 11/24/2016 01:51 PM, Torvald Riegel wrote:
> >> On Wed, 2016-11-23 at 16:09 +0100, Stefan Liebler wrote:
> >>> The atomic_compare_and_exchange_val_acq macro is now implemented with
> >>> gcc __sync_val_compare_and_swap instead of an inline assembly with
> >>> compare-and-swap instruction.
> >>
> >> Can you use the new GCC atomic builtins, or are they still insufficient
> >> on s390?
> >>
> >> Related to that, can you define USE_ATOMIC_COMPILER_BUILTINS to 1?
> >> Generally, this is where we want to get to, so that we have to maintain
> >> as few atomic operations as possible.
> >>
> >> Also note that we'll be phasing out the old-style glibc atomics
> >> eventually and use the new C11-like atomics.
> > I've activated it and tried an example like this (also compare it with
> > pthread_once):
> > int
> > foo (int *lock)
> > {
> >   int val, newval;
> >   val = atomic_load_acquire (lock);
> >   do
> >     {
> >       newval = val | 123;
> >     }
> >   while (!atomic_compare_exchange_weak_acquire (lock, &val, newval));
> >
> >   return 0;
> > }
> >
> > The assembly produced by GCC 5.4/6.2 contains a new stack-frame, the old
> > value is stored and loaded to/from the stack-frame before the
> > cs-instruction.
> > After the cs, the old value is stored again and reloaded if a further
> > round of cs is needed.
> > The condition-code of cs is extracted into a register and then compared
> > to determine if a further round of cs is needed.

Having read your other message, did you try with an int* or volatile
int* lock?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] S390: Test if lock is free before using atomic instruction in spin-lock.
  2016-12-16 17:08       ` Stefan Liebler
@ 2016-12-16 19:21         ` Torvald Riegel
  0 siblings, 0 replies; 15+ messages in thread
From: Torvald Riegel @ 2016-12-16 19:21 UTC (permalink / raw)
  To: Stefan Liebler; +Cc: libc-alpha

On Fri, 2016-12-16 at 18:08 +0100, Stefan Liebler wrote:
> > What is your suggestion, how to proceed with the volatile int type in
> > conjunction with the atomic-macros?

Good catch re volatile.  It should not be volatile, which is consistent
with how we do atomics elsewhere.  We shouldn't change the user-facing
type, but should cast to non-volatile internally.

> This patch is not needed anymore as I've posted an adjusted generic 
> spinlock code:
> [PATCH 1/2] Optimize generic spinlock code and use C11 like atomic macros.
> https://www.sourceware.org/ml/libc-alpha/2016-12/msg00617.html
> 
> [PATCH 2/2] S390: Use generic spinlock code.
> https://www.sourceware.org/ml/libc-alpha/2016-12/msg00618.html
> 
> Please have a look.

I'll do.  Thanks.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] S390: Test if lock is free before using atomic instruction in spin-lock.
  2016-12-09 15:17     ` Stefan Liebler
  2016-12-16 17:08       ` Stefan Liebler
@ 2016-12-16 20:12       ` Florian Weimer
  1 sibling, 0 replies; 15+ messages in thread
From: Florian Weimer @ 2016-12-16 20:12 UTC (permalink / raw)
  To: Stefan Liebler; +Cc: libc-alpha

* Stefan Liebler:

> The type pthread_spinlock_t is a volatile int!
> If lock is casted to (int *) before passing it to the atomic macros,
> the assembler output looks okay.

That's undefined:

“If an attempt is made to refer to an object defined with a
volatile-qualified type through use of an lvalue with
non-volatile-qualified type, the behavior is undefined.”

But we cannot drop the volatile qualifier from the definition of
pthread_spinlock_t because it would change the C++ mangling of
pthread_spinlock_t * and similar types.

Maybe we can receive a GCC extension to support this?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2] S390: Optimize atomic macros.
  2016-12-16 19:15       ` Torvald Riegel
@ 2017-02-27 11:36         ` Stefan Liebler
  2017-02-28  7:33           ` Torvald Riegel
  0 siblings, 1 reply; 15+ messages in thread
From: Stefan Liebler @ 2017-02-27 11:36 UTC (permalink / raw)
  To: libc-alpha

[-- Attachment #1: Type: text/plain, Size: 1868 bytes --]

This patch activates c11 atomic builtins by defining
USE_ATOMIC_COMPILER_BUILTINS to 1.
Note:
E.g. in nptl/pthread_key_delete.c if compiled with GCCs 6 and before,
an extra stack-frame is generated and the old value is stored on stack
before cs instruction but it never loads this value from stack.
An unreleased GCC 7 omit those stack operations.

E.g. in nptl/pthread_once.c the condition code of cs instruction is
evaluated by a sequence of ipm, sra, compare and jump instructions
instead of one conditional jump instruction.
This also occurs with an unreleased GCC 7.

These shortcomings does not really hurt.  Nevertheless, the gcc guys are
investigating those ones and plan to fix them before GCC 7 release.

The atomic_fetch_abc_def c11 builtins are now using load-and-abc
instructions on z196 zarch and higher cpus instead of a loop
with compare-and-swap instruction.

Some of the non-c11 atomic macros from include/atomic.h are now
implemented with help of the c11 atomic builtins.
The other non-c11 atomic macros are using the macros defined here.

Optimizations like using load-and-and for exchanging memory to zero
will be done by GCC in future.

ChangeLog:

	* sysdeps/s390/atomic-machine.h
	(USE_ATOMIC_COMPILER_BUILTINS): Define to 1.
	(__arch_compare_and_exchange_val_8_acq,
	__arch_compare_and_exchange_val_16_acq,
	__arch_compare_and_exchange_val_32_acq,
	__arch_compare_and_exchange_val_64_acq):
	Delete macro.
	(atomic_compare_and_exchange_val_acq,
	atomic_compare_and_exchange_val_rel,
	atomic_compare_and_exchange_bool_acq,
	catomic_compare_and_exchange_bool_acq,
	atomic_exchange_acq, atomic_exchange_rel,
	atomic_exchange_and_add_acq,
	atomic_exchange_and_add_rel,
	catomic_exchange_and_add, atomic_or_val,
	atomic_or, catomic_or, atomic_bit_test_set,
	atomic_and_val, atomic_and, catomic_and):
	Define macros with help of c11 atomic builtins.

[-- Attachment #2: 20170227_s390_atomic-machine.patch --]
[-- Type: text/x-patch, Size: 10440 bytes --]

commit 187e63cf8411585313051f020a39f8b6010280a9
Author: Stefan Liebler <stli@linux.vnet.ibm.com>
Date:   Mon Feb 27 12:23:32 2017 +0100

    S390: Optimize atomic macros.
    
    This patch activates c11 atomic builtins by defining
    USE_ATOMIC_COMPILER_BUILTINS to 1.
    Note:
    E.g. in nptl/pthread_key_delete.c if compiled with GCCs 6 and before,
    an extra stack-frame is generated and the old value is stored on stack
    before cs instruction but it never loads this value from stack.
    An unreleased GCC 7 omit those stack operations.
    
    E.g. in nptl/pthread_once.c the condition code of cs instruction is
    evaluated by a sequence of ipm, sra, compare and jump instructions instead
    of one conditional jump instruction.  This also occurs with an unreleased
    GCC 7.
    
    These shortcomings does not really hurt.  Nevertheless, the gcc guys are
    investigating those ones and plan to fix them before GCC 7 release.
    
    The atomic_fetch_abc_def c11 builtins are now using load-and-abc instructions
    on z196 zarch and higher cpus instead of a loop with compare-and-swap
    instruction.
    
    Some of the non-c11 atomic macros from include/atomic.h are now implemented
    with help of the c11 atomic builtins.  The other non-c11 atomic macros
    are using the macros defined here.
    
    ChangeLog:
    
    	* sysdeps/s390/atomic-machine.h
    	(USE_ATOMIC_COMPILER_BUILTINS): Define to 1.
    	(__arch_compare_and_exchange_val_8_acq,
    	__arch_compare_and_exchange_val_16_acq,
    	__arch_compare_and_exchange_val_32_acq,
    	__arch_compare_and_exchange_val_64_acq):
    	Delete macro.
    	(atomic_compare_and_exchange_val_acq,
    	atomic_compare_and_exchange_val_rel,
    	atomic_compare_and_exchange_bool_acq,
    	catomic_compare_and_exchange_bool_acq,
    	atomic_exchange_acq, atomic_exchange_rel,
    	atomic_exchange_and_add_acq,
    	atomic_exchange_and_add_rel,
    	catomic_exchange_and_add, atomic_or_val,
    	atomic_or, catomic_or, atomic_bit_test_set,
    	atomic_and_val, atomic_and, catomic_and):
    	Define macros with help of c11 atomic builtins.

diff --git a/sysdeps/s390/atomic-machine.h b/sysdeps/s390/atomic-machine.h
index 211d3d6..6f0a6fe 100644
--- a/sysdeps/s390/atomic-machine.h
+++ b/sysdeps/s390/atomic-machine.h
@@ -43,78 +43,117 @@ typedef uintptr_t uatomicptr_t;
 typedef intmax_t atomic_max_t;
 typedef uintmax_t uatomic_max_t;
 
-#define USE_ATOMIC_COMPILER_BUILTINS 0
+/* Activate all c11 atomic builtins.
 
+   Note:
+   E.g. in nptl/pthread_key_delete.c if compiled with GCCs 6 and before,
+   an extra stack-frame is generated and the old value is stored on stack
+   before cs instruction but it never loads this value from stack.
+   An unreleased GCC 7 omit those stack operations.
 
-#define __arch_compare_and_exchange_val_8_acq(mem, newval, oldval) \
-  (abort (), (__typeof (*mem)) 0)
+   E.g. in nptl/pthread_once.c the condition code of cs instruction is
+   evaluated by a sequence of ipm, sra, compare and jump instructions instead
+   of one conditional jump instruction.  This also occurs with an unreleased
+   GCC 7.
 
-#define __arch_compare_and_exchange_val_16_acq(mem, newval, oldval) \
-  (abort (), (__typeof (*mem)) 0)
-
-#define __arch_compare_and_exchange_val_32_acq(mem, newval, oldval) \
-  ({ __typeof (mem) __archmem = (mem);					      \
-     __typeof (*mem) __archold = (oldval);				      \
-     __asm__ __volatile__ ("cs %0,%2,%1"				      \
-			   : "+d" (__archold), "=Q" (*__archmem)	      \
-			   : "d" (newval), "m" (*__archmem) : "cc", "memory" );	\
-     __archold; })
+   The atomic_fetch_abc_def c11 builtins are now using load-and-abc instructions
+   on z196 zarch and higher cpus instead of a loop with compare-and-swap
+   instruction.  */
+#define USE_ATOMIC_COMPILER_BUILTINS 1
 
 #ifdef __s390x__
 # define __HAVE_64B_ATOMICS 1
-# define __arch_compare_and_exchange_val_64_acq(mem, newval, oldval) \
-  ({ __typeof (mem) __archmem = (mem);					      \
-     __typeof (*mem) __archold = (oldval);				      \
-     __asm__ __volatile__ ("csg %0,%2,%1"				      \
-			   : "+d" (__archold), "=Q" (*__archmem)	      \
-			   : "d" ((long) (newval)), "m" (*__archmem) : "cc", "memory" ); \
-     __archold; })
 #else
 # define __HAVE_64B_ATOMICS 0
-/* For 31 bit we do not really need 64-bit compare-and-exchange. We can
-   implement them by use of the csd instruction. The straightforward
-   implementation causes warnings so we skip the definition for now.  */
-# define __arch_compare_and_exchange_val_64_acq(mem, newval, oldval) \
-  (abort (), (__typeof (*mem)) 0)
 #endif
 
+/* Implement some of the non-c11 atomic macros from include/atomic.h
+   with help of the c11 atomic builtins.  The other non-c11 atomic macros
+   are using the macros defined here.  */
+
+/* Atomically store NEWVAL in *MEM if *MEM is equal to OLDVAL.
+   Return the old *MEM value.  */
+#define atomic_compare_and_exchange_val_acq(mem, newval, oldval)	\
+  ({ __atomic_check_size((mem));					\
+    typeof ((__typeof (*(mem))) *(mem)) __atg1_oldval = (oldval);	\
+    __atomic_compare_exchange_n (mem, (void *) &__atg1_oldval,		\
+				 newval, 1, __ATOMIC_ACQUIRE,		\
+				 __ATOMIC_RELAXED);			\
+    __atg1_oldval; })
+#define atomic_compare_and_exchange_val_rel(mem, newval, oldval)	\
+  ({ __atomic_check_size((mem));					\
+    typeof ((__typeof (*(mem))) *(mem)) __atg1_2_oldval = (oldval);	\
+    __atomic_compare_exchange_n (mem, (void *) &__atg1_2_oldval,	\
+				 newval, 1, __ATOMIC_RELEASE,		\
+				 __ATOMIC_RELAXED);			\
+    __atg1_2_oldval; })
+
+/* Atomically store NEWVAL in *MEM if *MEM is equal to OLDVAL.
+   Return zero if *MEM was changed or non-zero if no exchange happened.  */
+#define atomic_compare_and_exchange_bool_acq(mem, newval, oldval)	\
+  ({ __atomic_check_size((mem));					\
+    typeof ((__typeof (*(mem))) *(mem)) __atg2_oldval = (oldval);	\
+    !__atomic_compare_exchange_n (mem, (void *) &__atg2_oldval, newval,	\
+				  1, __ATOMIC_ACQUIRE,			\
+				  __ATOMIC_RELAXED); })
+#define catomic_compare_and_exchange_bool_acq(mem, newval, oldval)	\
+  atomic_compare_and_exchange_bool_acq (mem, newval, oldval)
+
 /* Store NEWVALUE in *MEM and return the old value.  */
-/* On s390, the atomic_exchange_acq is different from generic implementation,
-   because the generic one does not use the condition-code of cs-instruction
-   to determine if looping is needed. Instead it saves the old-value and
-   compares it against old-value returned by cs-instruction.  */
-#ifdef __s390x__
-# define atomic_exchange_acq(mem, newvalue)				\
-  ({ __typeof (mem) __atg5_memp = (mem);				\
-    __typeof (*(mem)) __atg5_oldval = *__atg5_memp;			\
-    __typeof (*(mem)) __atg5_value = (newvalue);			\
-    if (sizeof (*mem) == 4)						\
-      __asm__ __volatile__ ("0: cs %0,%2,%1\n"				\
-			    "   jl 0b"					\
-			    : "+d" (__atg5_oldval), "=Q" (*__atg5_memp)	\
-			    : "d" (__atg5_value), "m" (*__atg5_memp)	\
-			    : "cc", "memory" );				\
-     else if (sizeof (*mem) == 8)					\
-       __asm__ __volatile__ ("0: csg %0,%2,%1\n"			\
-			     "   jl 0b"					\
-			     : "+d" ( __atg5_oldval), "=Q" (*__atg5_memp) \
-			     : "d" ((long) __atg5_value), "m" (*__atg5_memp) \
-			     : "cc", "memory" );			\
-     else								\
-       abort ();							\
-     __atg5_oldval; })
-#else
-# define atomic_exchange_acq(mem, newvalue)				\
-  ({ __typeof (mem) __atg5_memp = (mem);				\
-    __typeof (*(mem)) __atg5_oldval = *__atg5_memp;			\
-    __typeof (*(mem)) __atg5_value = (newvalue);			\
-    if (sizeof (*mem) == 4)						\
-      __asm__ __volatile__ ("0: cs %0,%2,%1\n"				\
-			    "   jl 0b"					\
-			    : "+d" (__atg5_oldval), "=Q" (*__atg5_memp)	\
-			    : "d" (__atg5_value), "m" (*__atg5_memp)	\
-			    : "cc", "memory" );				\
-    else								\
-      abort ();								\
-    __atg5_oldval; })
-#endif
+#define atomic_exchange_acq(mem, newvalue)				\
+  ({ __atomic_check_size((mem));					\
+    __atomic_exchange_n (mem, newvalue, __ATOMIC_ACQUIRE); })
+#define atomic_exchange_rel(mem, newvalue)				\
+  ({ __atomic_check_size((mem));					\
+    __atomic_exchange_n (mem, newvalue, __ATOMIC_RELEASE); })
+
+/* Add VALUE to *MEM and return the old value of *MEM.  */
+/* The gcc builtin uses load-and-add instruction on z196 zarch and higher cpus
+   instead of a loop with compare-and-swap instruction.  */
+# define atomic_exchange_and_add_acq(mem, operand)			\
+  ({ __atomic_check_size((mem));					\
+  __atomic_fetch_add ((mem), (operand), __ATOMIC_ACQUIRE); })
+# define atomic_exchange_and_add_rel(mem, operand)			\
+  ({ __atomic_check_size((mem));					\
+  __atomic_fetch_add ((mem), (operand), __ATOMIC_RELEASE); })
+#define catomic_exchange_and_add(mem, value)	\
+  atomic_exchange_and_add (mem, value)
+
+/* Atomically *mem |= mask and return the old value of *mem.  */
+/* The gcc builtin uses load-and-or instruction on z196 zarch and higher cpus
+   instead of a loop with compare-and-swap instruction.  */
+#define atomic_or_val(mem, operand)					\
+  ({ __atomic_check_size((mem));					\
+  __atomic_fetch_or ((mem), (operand), __ATOMIC_ACQUIRE); })
+/* Atomically *mem |= mask.  */
+#define atomic_or(mem, mask)			\
+  do {						\
+    atomic_or_val (mem, mask);			\
+  } while (0)
+#define catomic_or(mem, mask)			\
+  atomic_or (mem, mask)
+
+/* Atomically *mem |= 1 << bit and return true if the bit was set in old value
+   of *mem.  */
+/* The load-and-or instruction is used on z196 zarch and higher cpus
+   instead of a loop with compare-and-swap instruction.  */
+#define atomic_bit_test_set(mem, bit)					\
+  ({ __typeof (*(mem)) __atg14_old;					\
+    __typeof (mem) __atg14_memp = (mem);				\
+    __typeof (*(mem)) __atg14_mask = ((__typeof (*(mem))) 1 << (bit));	\
+    __atg14_old = atomic_or_val (__atg14_memp, __atg14_mask);		\
+    __atg14_old & __atg14_mask; })
+
+/* Atomically *mem &= mask and return the old value of *mem.  */
+/* The gcc builtin uses load-and-and instruction on z196 zarch and higher cpus
+   instead of a loop with compare-and-swap instruction.  */
+#define atomic_and_val(mem, operand)					\
+  ({ __atomic_check_size((mem));					\
+  __atomic_fetch_and ((mem), (operand), __ATOMIC_ACQUIRE); })
+/* Atomically *mem &= mask.  */
+#define atomic_and(mem, mask)			\
+  do {						\
+    atomic_and_val (mem, mask);			\
+  } while (0)
+#define catomic_and(mem, mask)			\
+  atomic_and(mem, mask)

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2] S390: Optimize atomic macros.
  2017-02-27 11:36         ` Stefan Liebler
@ 2017-02-28  7:33           ` Torvald Riegel
  2017-03-06 14:42             ` Stefan Liebler
  0 siblings, 1 reply; 15+ messages in thread
From: Torvald Riegel @ 2017-02-28  7:33 UTC (permalink / raw)
  To: Stefan Liebler; +Cc: libc-alpha

On Mon, 2017-02-27 at 12:36 +0100, Stefan Liebler wrote:
> This patch activates c11 atomic builtins by defining
> USE_ATOMIC_COMPILER_BUILTINS to 1.
> Note:
> E.g. in nptl/pthread_key_delete.c if compiled with GCCs 6 and before,
> an extra stack-frame is generated and the old value is stored on stack
> before cs instruction but it never loads this value from stack.
> An unreleased GCC 7 omit those stack operations.
> 
> E.g. in nptl/pthread_once.c the condition code of cs instruction is
> evaluated by a sequence of ipm, sra, compare and jump instructions
> instead of one conditional jump instruction.
> This also occurs with an unreleased GCC 7.
> 
> These shortcomings does not really hurt.  Nevertheless, the gcc guys are
> investigating those ones and plan to fix them before GCC 7 release.
> 
> The atomic_fetch_abc_def c11 builtins are now using load-and-abc
> instructions on z196 zarch and higher cpus instead of a loop
> with compare-and-swap instruction.
> 
> Some of the non-c11 atomic macros from include/atomic.h are now
> implemented with help of the c11 atomic builtins.
> The other non-c11 atomic macros are using the macros defined here.
> 
> Optimizations like using load-and-and for exchanging memory to zero
> will be done by GCC in future.
> 
> ChangeLog:
> 
> 	* sysdeps/s390/atomic-machine.h
> 	(USE_ATOMIC_COMPILER_BUILTINS): Define to 1.
> 	(__arch_compare_and_exchange_val_8_acq,
> 	__arch_compare_and_exchange_val_16_acq,
> 	__arch_compare_and_exchange_val_32_acq,
> 	__arch_compare_and_exchange_val_64_acq):
> 	Delete macro.
> 	(atomic_compare_and_exchange_val_acq,
> 	atomic_compare_and_exchange_val_rel,
> 	atomic_compare_and_exchange_bool_acq,
> 	catomic_compare_and_exchange_bool_acq,
> 	atomic_exchange_acq, atomic_exchange_rel,
> 	atomic_exchange_and_add_acq,
> 	atomic_exchange_and_add_rel,
> 	catomic_exchange_and_add, atomic_or_val,
> 	atomic_or, catomic_or, atomic_bit_test_set,
> 	atomic_and_val, atomic_and, catomic_and):
> 	Define macros with help of c11 atomic builtins.

s/c11/C11/ throughout the patch.
Besides that, this looks good to me.  Thanks.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2] S390: Optimize atomic macros.
  2017-02-28  7:33           ` Torvald Riegel
@ 2017-03-06 14:42             ` Stefan Liebler
  0 siblings, 0 replies; 15+ messages in thread
From: Stefan Liebler @ 2017-03-06 14:42 UTC (permalink / raw)
  To: libc-alpha

On 02/28/2017 08:33 AM, Torvald Riegel wrote:
> On Mon, 2017-02-27 at 12:36 +0100, Stefan Liebler wrote:
>> This patch activates c11 atomic builtins by defining
>> USE_ATOMIC_COMPILER_BUILTINS to 1.
>> Note:
>> E.g. in nptl/pthread_key_delete.c if compiled with GCCs 6 and before,
>> an extra stack-frame is generated and the old value is stored on stack
>> before cs instruction but it never loads this value from stack.
>> An unreleased GCC 7 omit those stack operations.
>>
>> E.g. in nptl/pthread_once.c the condition code of cs instruction is
>> evaluated by a sequence of ipm, sra, compare and jump instructions
>> instead of one conditional jump instruction.
>> This also occurs with an unreleased GCC 7.
>>
>> These shortcomings does not really hurt.  Nevertheless, the gcc guys are
>> investigating those ones and plan to fix them before GCC 7 release.
>>
>> The atomic_fetch_abc_def c11 builtins are now using load-and-abc
>> instructions on z196 zarch and higher cpus instead of a loop
>> with compare-and-swap instruction.
>>
>> Some of the non-c11 atomic macros from include/atomic.h are now
>> implemented with help of the c11 atomic builtins.
>> The other non-c11 atomic macros are using the macros defined here.
>>
>> Optimizations like using load-and-and for exchanging memory to zero
>> will be done by GCC in future.
>>
>> ChangeLog:
>>
>> 	* sysdeps/s390/atomic-machine.h
>> 	(USE_ATOMIC_COMPILER_BUILTINS): Define to 1.
>> 	(__arch_compare_and_exchange_val_8_acq,
>> 	__arch_compare_and_exchange_val_16_acq,
>> 	__arch_compare_and_exchange_val_32_acq,
>> 	__arch_compare_and_exchange_val_64_acq):
>> 	Delete macro.
>> 	(atomic_compare_and_exchange_val_acq,
>> 	atomic_compare_and_exchange_val_rel,
>> 	atomic_compare_and_exchange_bool_acq,
>> 	catomic_compare_and_exchange_bool_acq,
>> 	atomic_exchange_acq, atomic_exchange_rel,
>> 	atomic_exchange_and_add_acq,
>> 	atomic_exchange_and_add_rel,
>> 	catomic_exchange_and_add, atomic_or_val,
>> 	atomic_or, catomic_or, atomic_bit_test_set,
>> 	atomic_and_val, atomic_and, catomic_and):
>> 	Define macros with help of c11 atomic builtins.
>
> s/c11/C11/ throughout the patch.
> Besides that, this looks good to me.  Thanks.
>
Thanks for review.
I've changed c11 to C11 and committed this patch.
Bye.
Stefan

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2017-03-06 14:42 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-11-23 15:09 [PATCH 1/2] S390: Optimize atomic macros Stefan Liebler
2016-11-23 15:09 ` [PATCH 2/2] S390: Test if lock is free before using atomic instruction in spin-lock Stefan Liebler
2016-11-24 13:04   ` Torvald Riegel
2016-12-09 15:17     ` Stefan Liebler
2016-12-16 17:08       ` Stefan Liebler
2016-12-16 19:21         ` Torvald Riegel
2016-12-16 20:12       ` Florian Weimer
2016-11-24 12:51 ` [PATCH 1/2] S390: Optimize atomic macros Torvald Riegel
2016-12-09 15:17   ` Stefan Liebler
2016-12-16 17:08     ` Stefan Liebler
2016-12-16 19:15       ` Torvald Riegel
2017-02-27 11:36         ` Stefan Liebler
2017-02-28  7:33           ` Torvald Riegel
2017-03-06 14:42             ` Stefan Liebler
2016-12-16 19:18       ` Torvald Riegel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).