public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed
* [PATCH v2] PowerPC: libc single-thread lock optimization
@ 2014-08-22 13:50 Adhemerval Zanella
  2014-08-22 14:00 ` Future atomic operation cleanup Richard Henderson
  0 siblings, 1 reply; 20+ messages in thread
From: Adhemerval Zanella @ 2014-08-22 13:50 UTC (permalink / raw)
  To: GNU C. Library

Hi,

Following comments from my first patch to optimize single-thread internal
glibc locking/atomics [1], I have changed the implementation to use now
relaxed atomics instead.  Addresing the concerns raised in last discussion, 
the primitives are still signal-safe (although not thread-safe), so if future
malloc implementation is changed to be async-safe, it won't require to a
adjust powerpc atomics.

For catomic_and and catomic_or I follow the definition at 'include/atomic.h'
(which powerpc is currently using) and implemented the atomics with acquire
semantics.  The new implementation also is simpler.

On synthetic benchmarks it shows an improvement of 5-10% for malloc
calls and an performance increase of 7-8% in 483.xalancbmk from
speccpu2006 (number from an POWER8 machine).

Checked on powerpc64, powerpc32 and powerpc64le.

[1] https://sourceware.org/ml/libc-alpha/2014-05/msg00118.html

--

	* sysdeps/powerpc/bits/atomic.h
	(__arch_compare_and_exchange_val_32_relaxed): New macro: atomic compare
	and exchange with relaxed semantic.
	(atomic_compare_and_exchange_val_relaxed): Likewise.
	(__arch_atomic_exchange_32_relaxed): New macro: atomic exchange with
	relaxed semantic.
	(atomic_exchange_relaxed): Likewise.
	(__arch_atomic_and_32): New macro: atomic bitwise and with acquire
	semantic.
	(__arch_atomic_and_32_relaxed): New macro: atomic bitwise and with
	relaxed semantic.
	(atomic_and_relaxed): Likewise.
	(__arch_atomic_or_32): New macro: atomic bitwise and with acquire
	semantic.
	(__arch_atomic_or_32_relaxed): New macro: atomic bitwise and with
	relaxed semantic.
	(atomic_or_relaxed): Likewise.
	(__atomic_is_single_thread): New macro: check if program is
	single-thread.
	(atomic_compare_and_exchange_val_acq): Add relaxed operation for
	single-thread.
	(atomic_compare_and_exchange_val_rel): Likewise.
	(atomic_exchange_rel): Likewise.
	(catomic_and): Likewise.
	(catomic_or): Likewise.
	* sysdeps/powerpc/powerpc32/bits/atomic.h
	(__arch_compare_and_exchange_val_64_relaxed): New macro: add empty
	implementation.
	(__arch_atomic_exchange_64_relaxed): Likewise.
	* sysdeps/powerpc/powerpc32/bits/atomic.h
	(__arch_compare_and_exchange_val_64_relaxed): New macro: atomic compare
	and exchange with relaxed semantics.
	(__arch_atomic_exchange_64_relaxed): New macro: atomic exchange with
	relaxed semantic.
	(__arch_atomic_and_64_relaxed): New macro: atomic exchange with
	relaxed semantic.
	(__arch_atomic_and_64): New macro: atomic bitwise and with acquire
	semantic.
	(__arch_atomic_or_64_relaxed): New macro: atomic bitwise or with
	relaxed semantic.
	(__arch_atomic_or_64): New macro: atomic bitwise or with acquire
	semantic.

---

diff --git a/sysdeps/powerpc/bits/atomic.h b/sysdeps/powerpc/bits/atomic.h
index 2ffba48..be590c7 100644
--- a/sysdeps/powerpc/bits/atomic.h
+++ b/sysdeps/powerpc/bits/atomic.h
@@ -27,6 +27,7 @@
  */
 
 #include <stdint.h>
+#include <tls.h>
 
 typedef int32_t atomic32_t;
 typedef uint32_t uatomic32_t;
@@ -113,6 +114,23 @@ typedef uintmax_t uatomic_max_t;
       __tmp;								      \
   })
 
+#define __arch_compare_and_exchange_val_32_relaxed(mem, newval, oldval)	      \
+  ({									      \
+      __typeof (*(mem)) __tmp;						      \
+      __typeof (mem)  __memp = (mem);					      \
+      __asm __volatile (						      \
+		        "1:	lwarx	%0,0,%1\n"			      \
+		        "	cmpw	%0,%2\n"			      \
+		        "	bne	2f\n"				      \
+		        "	stwcx.	%3,0,%1\n"			      \
+		        "	bne-	1b\n"				      \
+		        "2:	"					      \
+		        : "=&r" (__tmp)					      \
+		        : "b" (__memp), "r" (oldval), "r" (newval)	      \
+		        : "cr0", "memory");				      \
+      __tmp;								      \
+  })
+
 #define __arch_atomic_exchange_32_acq(mem, value)			      \
   ({									      \
     __typeof (*mem) __val;						      \
@@ -127,6 +145,18 @@ typedef uintmax_t uatomic_max_t;
     __val;								      \
   })
 
+#define __arch_atomic_exchange_32_relaxed(mem, value)			      \
+  ({									      \
+    __typeof (*mem) __val;						      \
+    __asm __volatile ("1:	lwarx	%0,0,%2\n"			      \
+		      "		stwcx.	%3,0,%2\n"			      \
+		      "		bne-	1b\n"				      \
+		      : "=&r" (__val), "=m" (*mem)			      \
+		      : "b" (mem), "r" (value), "m" (*mem)		      \
+		      : "cr0", "memory");				      \
+    __val;								      \
+  })
+
 #define __arch_atomic_exchange_32_rel(mem, value) \
   ({									      \
     __typeof (*mem) __val;						      \
@@ -140,6 +170,7 @@ typedef uintmax_t uatomic_max_t;
     __val;								      \
   })
 
+
 #define __arch_atomic_exchange_and_add_32(mem, value) \
   ({									      \
     __typeof (*mem) __val, __tmp;					      \
@@ -153,6 +184,62 @@ typedef uintmax_t uatomic_max_t;
     __val;								      \
   })
 
+#define __arch_atomic_and_32(mem, value) \
+  ({									      \
+    __typeof (*mem) __val, __tmp;					      \
+    __asm __volatile ("1:	lwarx	%0,0,%3\n"			      \
+		      "		add	%1,%0,%4\n"			      \
+		      "		stwcx.	%1,0,%3\n"			      \
+		      "		bne-	1b\n"				      \
+		      "         " __ARCH_ACQ_INSTR			      \
+		      : "=&b" (__val), "=&r" (__tmp), "=m" (*mem)	      \
+		      : "b" (mem), "r" (value), "m" (*mem)		      \
+		      : "cr0", "memory");				      \
+    __val;								      \
+  })
+
+#define __arch_atomic_and_32_relaxed(mem, value) \
+  ({									      \
+    __typeof (*mem) __val, __tmp;					      \
+    __asm __volatile ("		\n"					      \
+		      "1:	lwarx	%0,0,%3\n"			      \
+		      "		and	%1,%0,%4\n"			      \
+		      "		stwcx.	%1,0,%3\n"			      \
+		      "		bne-	1b\n"				      \
+		      : "=&b" (__val), "=&r" (__tmp), "=m" (*mem)	      \
+		      : "b" (mem), "r" (value), "m" (*mem)		      \
+		      : "cr0", "memory");				      \
+    __val;								      \
+  })
+
+#define __arch_atomic_or_32(mem, value) \
+  ({									      \
+    __typeof (*mem) __val, __tmp;					      \
+    __asm __volatile ("1:	lwarx	%0,0,%3\n"			      \
+		      "		or	%1,%0,%4\n"			      \
+		      "		stwcx.	%1,0,%3\n"			      \
+		      "		bne-	1b\n"				      \
+		      "         " __ARCH_ACQ_INSTR	 		      \
+		      : "=&b" (__val), "=&r" (__tmp), "=m" (*mem)	      \
+		      : "b" (mem), "r" (value), "m" (*mem)		      \
+		      : "cr0", "memory");				      \
+    __val;								      \
+  })
+
+#define __arch_atomic_or_32_relaxed(mem, value) \
+  ({									      \
+    __typeof (*mem) __val, __tmp;					      \
+    __asm __volatile ("		\n"					      \
+		      "1:	lwarx	%0,0,%3\n"			      \
+		      "		or	%1,%0,%4\n"			      \
+		      "		stwcx.	%1,0,%3\n"			      \
+		      "		bne-	1b\n"				      \
+		      : "=&b" (__val), "=&r" (__tmp), "=m" (*mem)	      \
+		      : "b" (mem), "r" (value), "m" (*mem)		      \
+		      : "cr0", "memory");				      \
+    __val;								      \
+  })
+
 #define __arch_atomic_increment_val_32(mem) \
   ({									      \
     __typeof (*(mem)) __val;						      \
@@ -194,10 +281,27 @@ typedef uintmax_t uatomic_max_t;
      __val;								      \
   })
 
-#define atomic_compare_and_exchange_val_acq(mem, newval, oldval) \
+#define __atomic_is_single_thread                                             \
+  (THREAD_GETMEM (THREAD_SELF, header.multiple_threads) == 0)
+
+#define atomic_compare_and_exchange_val_relaxed(mem, newval, oldval) \
   ({									      \
     __typeof (*(mem)) __result;						      \
     if (sizeof (*mem) == 4)						      \
+      __result = __arch_compare_and_exchange_val_32_relaxed(mem, newval, oldval); \
+    else if (sizeof (*mem) == 8)					      \
+      __result = __arch_compare_and_exchange_val_64_relaxed(mem, newval, oldval); \
+    else 								      \
+       abort ();							      \
+    __result;								      \
+  })
+
+#define atomic_compare_and_exchange_val_acq(mem, newval, oldval) \
+  ({									      \
+    __typeof (*(mem)) __result;						      \
+    if (__atomic_is_single_thread)					      \
+      __result = atomic_compare_and_exchange_val_relaxed (mem, newval, oldval); \
+    else if (sizeof (*mem) == 4)					      \
       __result = __arch_compare_and_exchange_val_32_acq(mem, newval, oldval); \
     else if (sizeof (*mem) == 8)					      \
       __result = __arch_compare_and_exchange_val_64_acq(mem, newval, oldval); \
@@ -209,7 +313,9 @@ typedef uintmax_t uatomic_max_t;
 #define atomic_compare_and_exchange_val_rel(mem, newval, oldval) \
   ({									      \
     __typeof (*(mem)) __result;						      \
-    if (sizeof (*mem) == 4)						      \
+    if (__atomic_is_single_thread)					      \
+      __result = atomic_compare_and_exchange_val_relaxed (mem, newval, oldval); \
+    else if (sizeof (*mem) == 4)					      \
       __result = __arch_compare_and_exchange_val_32_rel(mem, newval, oldval); \
     else if (sizeof (*mem) == 8)					      \
       __result = __arch_compare_and_exchange_val_64_rel(mem, newval, oldval); \
@@ -218,10 +324,24 @@ typedef uintmax_t uatomic_max_t;
     __result;								      \
   })
 
-#define atomic_exchange_acq(mem, value) \
+#define atomic_exchange_relaxed(mem, value) \
   ({									      \
     __typeof (*(mem)) __result;						      \
     if (sizeof (*mem) == 4)						      \
+      __result = __arch_atomic_exchange_32_relaxed (mem, value);	      \
+    else if (sizeof (*mem) == 8)					      \
+      __result = __arch_atomic_exchange_64_relaxed (mem, value);	      \
+    else 								      \
+       abort ();							      \
+    __result;								      \
+  })
+
+#define atomic_exchange_acq(mem, value)					      \
+  ({									      \
+    __typeof (*(mem)) __result;						      \
+    if (__atomic_is_single_thread)					      \
+      __result = atomic_exchange_relaxed (mem, value);			      \
+    else if (sizeof (*mem) == 4)					      \
       __result = __arch_atomic_exchange_32_acq (mem, value);		      \
     else if (sizeof (*mem) == 8)					      \
       __result = __arch_atomic_exchange_64_acq (mem, value);		      \
@@ -233,7 +353,9 @@ typedef uintmax_t uatomic_max_t;
 #define atomic_exchange_rel(mem, value) \
   ({									      \
     __typeof (*(mem)) __result;						      \
-    if (sizeof (*mem) == 4)						      \
+    if (__atomic_is_single_thread)					      \
+      __result = atomic_exchange_relaxed (mem, value);			      \
+    else if (sizeof (*mem) == 4)					      \
       __result = __arch_atomic_exchange_32_rel (mem, value);		      \
     else if (sizeof (*mem) == 8)					      \
       __result = __arch_atomic_exchange_64_rel (mem, value);		      \
@@ -294,3 +416,55 @@ typedef uintmax_t uatomic_max_t;
        abort ();							      \
     __result;								      \
   })
+
+#define atomic_and_relaxed(mem, arg) \
+  ({									      \
+    __typeof (*(mem)) __result;						      \
+    if (sizeof (*mem) == 4)						      \
+      __result = __arch_atomic_and_32_relaxed(mem, arg);		      \
+    else if (sizeof (*mem) == 8)					      \
+      __result = __arch_atomic_and_64_relaxed(mem, arg);		      \
+    else 								      \
+       abort ();							      \
+    __result;								      \
+  })
+
+#define catomic_and(mem, arg) \
+  ({									      \
+    __typeof (*(mem)) __result;						      \
+    if (__atomic_is_single_thread)					      \
+      __result = atomic_and_relaxed (mem, arg); \
+    else if (sizeof (*mem) == 4)					      \
+      __result = __arch_atomic_and_32(mem, arg); \
+    else if (sizeof (*mem) == 8)					      \
+      __result = __arch_atomic_and_64(mem, arg); \
+    else 								      \
+       abort ();							      \
+    __result;								      \
+  })
+
+#define atomic_or_relaxed(mem, arg) \
+  ({									      \
+    __typeof (*(mem)) __result;						      \
+    if (sizeof (*mem) == 4)						      \
+      __result = __arch_atomic_or_32_relaxed(mem, arg);			      \
+    else if (sizeof (*mem) == 8)					      \
+      __result = __arch_atomic_or_64_relaxed(mem, arg);			      \
+    else 								      \
+       abort ();							      \
+    __result;								      \
+  })
+
+#define catomic_or(mem, arg) \
+  ({									      \
+    __typeof (*(mem)) __result;						      \
+    if (__atomic_is_single_thread)					      \
+      __result = atomic_or_relaxed (mem, arg); \
+    else if (sizeof (*mem) == 4)					      \
+      __result = __arch_atomic_or_32(mem, arg); \
+    else if (sizeof (*mem) == 8)					      \
+      __result = __arch_atomic_or_64(mem, arg); \
+    else 								      \
+       abort ();							      \
+    __result;								      \
+  })
diff --git a/sysdeps/powerpc/powerpc32/bits/atomic.h b/sysdeps/powerpc/powerpc32/bits/atomic.h
index 7613bdc..1b9f82a 100644
--- a/sysdeps/powerpc/powerpc32/bits/atomic.h
+++ b/sysdeps/powerpc/powerpc32/bits/atomic.h
@@ -83,6 +83,9 @@
 #define __arch_compare_and_exchange_bool_64_rel(mem, newval, oldval) \
   (abort (), 0)
 
+#define __arch_compare_and_exchange_val_64_relaxed(mem, newval, oldval) \
+  (abort (), (__typeof (*mem)) 0)
+
 #define __arch_compare_and_exchange_val_64_rel(mem, newval, oldval) \
   (abort (), (__typeof (*mem)) 0)
 
@@ -92,6 +95,9 @@
 #define __arch_atomic_exchange_64_rel(mem, value) \
     ({ abort (); (*mem) = (value); })
 
+#define __arch_atomic_exchange_64_relaxed(mem, value) \
+    ({ abort (); (*mem) = (value); })
+
 #define __arch_atomic_exchange_and_add_64(mem, value) \
     ({ abort (); (*mem) = (value); })
 
diff --git a/sysdeps/powerpc/powerpc64/bits/atomic.h b/sysdeps/powerpc/powerpc64/bits/atomic.h
index 527fe7c..b8a1035 100644
--- a/sysdeps/powerpc/powerpc64/bits/atomic.h
+++ b/sysdeps/powerpc/powerpc64/bits/atomic.h
@@ -143,6 +143,23 @@
       __tmp;								      \
   })
 
+#define __arch_compare_and_exchange_val_64_relaxed(mem, newval, oldval) \
+  ({									      \
+      __typeof (*(mem)) __tmp;						      \
+      __typeof (mem)  __memp = (mem);					      \
+      __asm __volatile ("\n"				      \
+		        "1:	ldarx	%0,0,%1\n"	      \
+		        "	cmpd	%0,%2\n"			      \
+		        "	bne	2f\n"				      \
+		        "	stdcx.	%3,0,%1\n"			      \
+		        "	bne-	1b\n"				      \
+		        "2:	"					      \
+		        : "=&r" (__tmp)					      \
+		        : "b" (__memp), "r" (oldval), "r" (newval)	      \
+		        : "cr0", "memory");				      \
+      __tmp;								      \
+  })
+
 #define __arch_atomic_exchange_64_acq(mem, value) \
     ({									      \
       __typeof (*mem) __val;						      \
@@ -170,6 +187,18 @@
       __val;								      \
     })
 
+#define __arch_atomic_exchange_64_relaxed(mem, value) \
+    ({									      \
+      __typeof (*mem) __val;						      \
+      __asm __volatile ("1:	ldarx	%0,0,%2\n"			      \
+			"	stdcx.	%3,0,%2\n"			      \
+			"	bne-	1b"				      \
+			: "=&r" (__val), "=m" (*mem)			      \
+			: "b" (mem), "r" (value), "m" (*mem)		      \
+			: "cr0", "memory");				      \
+      __val;								      \
+    })
+
 #define __arch_atomic_exchange_and_add_64(mem, value) \
     ({									      \
       __typeof (*mem) __val, __tmp;					      \
@@ -224,6 +253,60 @@
      __val;								      \
   })
 
+#define __arch_atomic_and_64_relaxed(mem, value) \
+  ({									      \
+    __typeof (*mem) __val, __tmp;					      \
+    __asm __volatile ("1:	ldarx	%0,0,%3\n"			      \
+		      "		and	%1,%0,%4\n"			      \
+		      "		stdcx.	%1,0,%3\n"			      \
+		      "		bne-	1b\n"				      \
+		      : "=&b" (__val), "=&r" (__tmp), "=m" (*mem)	      \
+		      : "b" (mem), "r" (value), "m" (*mem)		      \
+		      : "cr0", "memory");				      \
+    __val;								      \
+  })
+
+#define __arch_atomic_and_64(mem, value) \
+  ({									      \
+    __typeof (*mem) __val, __tmp;					      \
+    __asm __volatile ("1:	ldarx	%0,0,%3\n"			      \
+		      "		and	%1,%0,%4\n"			      \
+		      "		stdcx.	%1,0,%3\n"			      \
+		      "		bne-	1b\n"				      \
+		      "       " __ARCH_ACQ_INSTR			      \
+		      : "=&b" (__val), "=&r" (__tmp), "=m" (*mem)	      \
+		      : "b" (mem), "r" (value), "m" (*mem)		      \
+		      : "cr0", "memory");				      \
+    __val;								      \
+  })
+
+#define __arch_atomic_or_64_relaxed(mem, value)				      \
+  ({									      \
+    __typeof (*mem) __val, __tmp;					      \
+    __asm __volatile ("1:	ldarx	%0,0,%3\n"			      \
+		      "		or	%1,%0,%4\n"			      \
+		      "		stdcx.	%1,0,%3\n"			      \
+		      "		bne-	1b\n"				      \
+		      : "=&b" (__val), "=&r" (__tmp), "=m" (*mem)	      \
+		      : "b" (mem), "r" (value), "m" (*mem)		      \
+		      : "cr0", "memory");				      \
+    __val;								      \
+  })
+
+#define __arch_atomic_or_64(mem, value)					      \
+  ({									      \
+    __typeof (*mem) __val, __tmp;					      \
+    __asm __volatile ("1:	ldarx	%0,0,%3\n"			      \
+		      "		or	%1,%0,%4\n"			      \
+		      "		stdcx.	%1,0,%3\n"			      \
+		      "		bne-	1b\n"				      \
+		      "       " __ARCH_ACQ_INSTR			      \
+		      : "=&b" (__val), "=&r" (__tmp), "=m" (*mem)	      \
+		      : "b" (mem), "r" (value), "m" (*mem)		      \
+		      : "cr0", "memory");				      \
+    __val;								      \
+  })
+
 /*
  * All powerpc64 processors support the new "light weight"  sync (lwsync).
  */

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Future atomic operation cleanup
  2014-08-22 13:50 [PATCH v2] PowerPC: libc single-thread lock optimization Adhemerval Zanella
@ 2014-08-22 14:00 ` Richard Henderson
  2014-08-22 14:50   ` Joseph S. Myers
                     ` (2 more replies)
  0 siblings, 3 replies; 20+ messages in thread
From: Richard Henderson @ 2014-08-22 14:00 UTC (permalink / raw)
  To: Adhemerval Zanella, GNU C. Library

On 08/22/2014 06:50 AM, Adhemerval Zanella wrote:
> Hi,
> 
> Following comments from my first patch to optimize single-thread internal
> glibc locking/atomics [1], I have changed the implementation to use now
> relaxed atomics instead.  Addresing the concerns raised in last discussion, 
> the primitives are still signal-safe (although not thread-safe), so if future
> malloc implementation is changed to be async-safe, it won't require to a
> adjust powerpc atomics.
> 
> For catomic_and and catomic_or I follow the definition at 'include/atomic.h'
> (which powerpc is currently using) and implemented the atomics with acquire
> semantics.  The new implementation also is simpler.
> 
> On synthetic benchmarks it shows an improvement of 5-10% for malloc
> calls and an performance increase of 7-8% in 483.xalancbmk from
> speccpu2006 (number from an POWER8 machine).
> 
> Checked on powerpc64, powerpc32 and powerpc64le.

Wow, that's a lot of boiler-plate.

When development opens again, can we simplify all of these atomic operations by
assuming compiler primitives?  That is, use the __atomic builtins for gcc 4.8
and later, fall back to the __sync builtins for earlier gcc, and completely
drop support for truly ancient compilers that support neither.

As a bonus we'd get to unify the implementations across all of the targets.


r~

> 
> [1] https://sourceware.org/ml/libc-alpha/2014-05/msg00118.html
> 
> --
> 
> 	* sysdeps/powerpc/bits/atomic.h
> 	(__arch_compare_and_exchange_val_32_relaxed): New macro: atomic compare
> 	and exchange with relaxed semantic.
> 	(atomic_compare_and_exchange_val_relaxed): Likewise.
> 	(__arch_atomic_exchange_32_relaxed): New macro: atomic exchange with
> 	relaxed semantic.
> 	(atomic_exchange_relaxed): Likewise.
> 	(__arch_atomic_and_32): New macro: atomic bitwise and with acquire
> 	semantic.
> 	(__arch_atomic_and_32_relaxed): New macro: atomic bitwise and with
> 	relaxed semantic.
> 	(atomic_and_relaxed): Likewise.
> 	(__arch_atomic_or_32): New macro: atomic bitwise and with acquire
> 	semantic.
> 	(__arch_atomic_or_32_relaxed): New macro: atomic bitwise and with
> 	relaxed semantic.
> 	(atomic_or_relaxed): Likewise.
> 	(__atomic_is_single_thread): New macro: check if program is
> 	single-thread.
> 	(atomic_compare_and_exchange_val_acq): Add relaxed operation for
> 	single-thread.
> 	(atomic_compare_and_exchange_val_rel): Likewise.
> 	(atomic_exchange_rel): Likewise.
> 	(catomic_and): Likewise.
> 	(catomic_or): Likewise.
> 	* sysdeps/powerpc/powerpc32/bits/atomic.h
> 	(__arch_compare_and_exchange_val_64_relaxed): New macro: add empty
> 	implementation.
> 	(__arch_atomic_exchange_64_relaxed): Likewise.
> 	* sysdeps/powerpc/powerpc32/bits/atomic.h
> 	(__arch_compare_and_exchange_val_64_relaxed): New macro: atomic compare
> 	and exchange with relaxed semantics.
> 	(__arch_atomic_exchange_64_relaxed): New macro: atomic exchange with
> 	relaxed semantic.
> 	(__arch_atomic_and_64_relaxed): New macro: atomic exchange with
> 	relaxed semantic.
> 	(__arch_atomic_and_64): New macro: atomic bitwise and with acquire
> 	semantic.
> 	(__arch_atomic_or_64_relaxed): New macro: atomic bitwise or with
> 	relaxed semantic.
> 	(__arch_atomic_or_64): New macro: atomic bitwise or with acquire
> 	semantic.
> 
> ---
> 
> diff --git a/sysdeps/powerpc/bits/atomic.h b/sysdeps/powerpc/bits/atomic.h
> index 2ffba48..be590c7 100644
> --- a/sysdeps/powerpc/bits/atomic.h
> +++ b/sysdeps/powerpc/bits/atomic.h
> @@ -27,6 +27,7 @@
>   */
>  
>  #include <stdint.h>
> +#include <tls.h>
>  
>  typedef int32_t atomic32_t;
>  typedef uint32_t uatomic32_t;
> @@ -113,6 +114,23 @@ typedef uintmax_t uatomic_max_t;
>        __tmp;								      \
>    })
>  
> +#define __arch_compare_and_exchange_val_32_relaxed(mem, newval, oldval)	      \
> +  ({									      \
> +      __typeof (*(mem)) __tmp;						      \
> +      __typeof (mem)  __memp = (mem);					      \
> +      __asm __volatile (						      \
> +		        "1:	lwarx	%0,0,%1\n"			      \
> +		        "	cmpw	%0,%2\n"			      \
> +		        "	bne	2f\n"				      \
> +		        "	stwcx.	%3,0,%1\n"			      \
> +		        "	bne-	1b\n"				      \
> +		        "2:	"					      \
> +		        : "=&r" (__tmp)					      \
> +		        : "b" (__memp), "r" (oldval), "r" (newval)	      \
> +		        : "cr0", "memory");				      \
> +      __tmp;								      \
> +  })
> +
>  #define __arch_atomic_exchange_32_acq(mem, value)			      \
>    ({									      \
>      __typeof (*mem) __val;						      \
> @@ -127,6 +145,18 @@ typedef uintmax_t uatomic_max_t;
>      __val;								      \
>    })
>  
> +#define __arch_atomic_exchange_32_relaxed(mem, value)			      \
> +  ({									      \
> +    __typeof (*mem) __val;						      \
> +    __asm __volatile ("1:	lwarx	%0,0,%2\n"			      \
> +		      "		stwcx.	%3,0,%2\n"			      \
> +		      "		bne-	1b\n"				      \
> +		      : "=&r" (__val), "=m" (*mem)			      \
> +		      : "b" (mem), "r" (value), "m" (*mem)		      \
> +		      : "cr0", "memory");				      \
> +    __val;								      \
> +  })
> +
>  #define __arch_atomic_exchange_32_rel(mem, value) \
>    ({									      \
>      __typeof (*mem) __val;						      \
> @@ -140,6 +170,7 @@ typedef uintmax_t uatomic_max_t;
>      __val;								      \
>    })
>  
> +
>  #define __arch_atomic_exchange_and_add_32(mem, value) \
>    ({									      \
>      __typeof (*mem) __val, __tmp;					      \
> @@ -153,6 +184,62 @@ typedef uintmax_t uatomic_max_t;
>      __val;								      \
>    })
>  
> +#define __arch_atomic_and_32(mem, value) \
> +  ({									      \
> +    __typeof (*mem) __val, __tmp;					      \
> +    __asm __volatile ("1:	lwarx	%0,0,%3\n"			      \
> +		      "		add	%1,%0,%4\n"			      \
> +		      "		stwcx.	%1,0,%3\n"			      \
> +		      "		bne-	1b\n"				      \
> +		      "         " __ARCH_ACQ_INSTR			      \
> +		      : "=&b" (__val), "=&r" (__tmp), "=m" (*mem)	      \
> +		      : "b" (mem), "r" (value), "m" (*mem)		      \
> +		      : "cr0", "memory");				      \
> +    __val;								      \
> +  })
> +
> +#define __arch_atomic_and_32_relaxed(mem, value) \
> +  ({									      \
> +    __typeof (*mem) __val, __tmp;					      \
> +    __asm __volatile ("		\n"					      \
> +		      "1:	lwarx	%0,0,%3\n"			      \
> +		      "		and	%1,%0,%4\n"			      \
> +		      "		stwcx.	%1,0,%3\n"			      \
> +		      "		bne-	1b\n"				      \
> +		      : "=&b" (__val), "=&r" (__tmp), "=m" (*mem)	      \
> +		      : "b" (mem), "r" (value), "m" (*mem)		      \
> +		      : "cr0", "memory");				      \
> +    __val;								      \
> +  })
> +
> +#define __arch_atomic_or_32(mem, value) \
> +  ({									      \
> +    __typeof (*mem) __val, __tmp;					      \
> +    __asm __volatile ("1:	lwarx	%0,0,%3\n"			      \
> +		      "		or	%1,%0,%4\n"			      \
> +		      "		stwcx.	%1,0,%3\n"			      \
> +		      "		bne-	1b\n"				      \
> +		      "         " __ARCH_ACQ_INSTR	 		      \
> +		      : "=&b" (__val), "=&r" (__tmp), "=m" (*mem)	      \
> +		      : "b" (mem), "r" (value), "m" (*mem)		      \
> +		      : "cr0", "memory");				      \
> +    __val;								      \
> +  })
> +
> +#define __arch_atomic_or_32_relaxed(mem, value) \
> +  ({									      \
> +    __typeof (*mem) __val, __tmp;					      \
> +    __asm __volatile ("		\n"					      \
> +		      "1:	lwarx	%0,0,%3\n"			      \
> +		      "		or	%1,%0,%4\n"			      \
> +		      "		stwcx.	%1,0,%3\n"			      \
> +		      "		bne-	1b\n"				      \
> +		      : "=&b" (__val), "=&r" (__tmp), "=m" (*mem)	      \
> +		      : "b" (mem), "r" (value), "m" (*mem)		      \
> +		      : "cr0", "memory");				      \
> +    __val;								      \
> +  })
> +
>  #define __arch_atomic_increment_val_32(mem) \
>    ({									      \
>      __typeof (*(mem)) __val;						      \
> @@ -194,10 +281,27 @@ typedef uintmax_t uatomic_max_t;
>       __val;								      \
>    })
>  
> -#define atomic_compare_and_exchange_val_acq(mem, newval, oldval) \
> +#define __atomic_is_single_thread                                             \
> +  (THREAD_GETMEM (THREAD_SELF, header.multiple_threads) == 0)
> +
> +#define atomic_compare_and_exchange_val_relaxed(mem, newval, oldval) \
>    ({									      \
>      __typeof (*(mem)) __result;						      \
>      if (sizeof (*mem) == 4)						      \
> +      __result = __arch_compare_and_exchange_val_32_relaxed(mem, newval, oldval); \
> +    else if (sizeof (*mem) == 8)					      \
> +      __result = __arch_compare_and_exchange_val_64_relaxed(mem, newval, oldval); \
> +    else 								      \
> +       abort ();							      \
> +    __result;								      \
> +  })
> +
> +#define atomic_compare_and_exchange_val_acq(mem, newval, oldval) \
> +  ({									      \
> +    __typeof (*(mem)) __result;						      \
> +    if (__atomic_is_single_thread)					      \
> +      __result = atomic_compare_and_exchange_val_relaxed (mem, newval, oldval); \
> +    else if (sizeof (*mem) == 4)					      \
>        __result = __arch_compare_and_exchange_val_32_acq(mem, newval, oldval); \
>      else if (sizeof (*mem) == 8)					      \
>        __result = __arch_compare_and_exchange_val_64_acq(mem, newval, oldval); \
> @@ -209,7 +313,9 @@ typedef uintmax_t uatomic_max_t;
>  #define atomic_compare_and_exchange_val_rel(mem, newval, oldval) \
>    ({									      \
>      __typeof (*(mem)) __result;						      \
> -    if (sizeof (*mem) == 4)						      \
> +    if (__atomic_is_single_thread)					      \
> +      __result = atomic_compare_and_exchange_val_relaxed (mem, newval, oldval); \
> +    else if (sizeof (*mem) == 4)					      \
>        __result = __arch_compare_and_exchange_val_32_rel(mem, newval, oldval); \
>      else if (sizeof (*mem) == 8)					      \
>        __result = __arch_compare_and_exchange_val_64_rel(mem, newval, oldval); \
> @@ -218,10 +324,24 @@ typedef uintmax_t uatomic_max_t;
>      __result;								      \
>    })
>  
> -#define atomic_exchange_acq(mem, value) \
> +#define atomic_exchange_relaxed(mem, value) \
>    ({									      \
>      __typeof (*(mem)) __result;						      \
>      if (sizeof (*mem) == 4)						      \
> +      __result = __arch_atomic_exchange_32_relaxed (mem, value);	      \
> +    else if (sizeof (*mem) == 8)					      \
> +      __result = __arch_atomic_exchange_64_relaxed (mem, value);	      \
> +    else 								      \
> +       abort ();							      \
> +    __result;								      \
> +  })
> +
> +#define atomic_exchange_acq(mem, value)					      \
> +  ({									      \
> +    __typeof (*(mem)) __result;						      \
> +    if (__atomic_is_single_thread)					      \
> +      __result = atomic_exchange_relaxed (mem, value);			      \
> +    else if (sizeof (*mem) == 4)					      \
>        __result = __arch_atomic_exchange_32_acq (mem, value);		      \
>      else if (sizeof (*mem) == 8)					      \
>        __result = __arch_atomic_exchange_64_acq (mem, value);		      \
> @@ -233,7 +353,9 @@ typedef uintmax_t uatomic_max_t;
>  #define atomic_exchange_rel(mem, value) \
>    ({									      \
>      __typeof (*(mem)) __result;						      \
> -    if (sizeof (*mem) == 4)						      \
> +    if (__atomic_is_single_thread)					      \
> +      __result = atomic_exchange_relaxed (mem, value);			      \
> +    else if (sizeof (*mem) == 4)					      \
>        __result = __arch_atomic_exchange_32_rel (mem, value);		      \
>      else if (sizeof (*mem) == 8)					      \
>        __result = __arch_atomic_exchange_64_rel (mem, value);		      \
> @@ -294,3 +416,55 @@ typedef uintmax_t uatomic_max_t;
>         abort ();							      \
>      __result;								      \
>    })
> +
> +#define atomic_and_relaxed(mem, arg) \
> +  ({									      \
> +    __typeof (*(mem)) __result;						      \
> +    if (sizeof (*mem) == 4)						      \
> +      __result = __arch_atomic_and_32_relaxed(mem, arg);		      \
> +    else if (sizeof (*mem) == 8)					      \
> +      __result = __arch_atomic_and_64_relaxed(mem, arg);		      \
> +    else 								      \
> +       abort ();							      \
> +    __result;								      \
> +  })
> +
> +#define catomic_and(mem, arg) \
> +  ({									      \
> +    __typeof (*(mem)) __result;						      \
> +    if (__atomic_is_single_thread)					      \
> +      __result = atomic_and_relaxed (mem, arg); \
> +    else if (sizeof (*mem) == 4)					      \
> +      __result = __arch_atomic_and_32(mem, arg); \
> +    else if (sizeof (*mem) == 8)					      \
> +      __result = __arch_atomic_and_64(mem, arg); \
> +    else 								      \
> +       abort ();							      \
> +    __result;								      \
> +  })
> +
> +#define atomic_or_relaxed(mem, arg) \
> +  ({									      \
> +    __typeof (*(mem)) __result;						      \
> +    if (sizeof (*mem) == 4)						      \
> +      __result = __arch_atomic_or_32_relaxed(mem, arg);			      \
> +    else if (sizeof (*mem) == 8)					      \
> +      __result = __arch_atomic_or_64_relaxed(mem, arg);			      \
> +    else 								      \
> +       abort ();							      \
> +    __result;								      \
> +  })
> +
> +#define catomic_or(mem, arg) \
> +  ({									      \
> +    __typeof (*(mem)) __result;						      \
> +    if (__atomic_is_single_thread)					      \
> +      __result = atomic_or_relaxed (mem, arg); \
> +    else if (sizeof (*mem) == 4)					      \
> +      __result = __arch_atomic_or_32(mem, arg); \
> +    else if (sizeof (*mem) == 8)					      \
> +      __result = __arch_atomic_or_64(mem, arg); \
> +    else 								      \
> +       abort ();							      \
> +    __result;								      \
> +  })
> diff --git a/sysdeps/powerpc/powerpc32/bits/atomic.h b/sysdeps/powerpc/powerpc32/bits/atomic.h
> index 7613bdc..1b9f82a 100644
> --- a/sysdeps/powerpc/powerpc32/bits/atomic.h
> +++ b/sysdeps/powerpc/powerpc32/bits/atomic.h
> @@ -83,6 +83,9 @@
>  #define __arch_compare_and_exchange_bool_64_rel(mem, newval, oldval) \
>    (abort (), 0)
>  
> +#define __arch_compare_and_exchange_val_64_relaxed(mem, newval, oldval) \
> +  (abort (), (__typeof (*mem)) 0)
> +
>  #define __arch_compare_and_exchange_val_64_rel(mem, newval, oldval) \
>    (abort (), (__typeof (*mem)) 0)
>  
> @@ -92,6 +95,9 @@
>  #define __arch_atomic_exchange_64_rel(mem, value) \
>      ({ abort (); (*mem) = (value); })
>  
> +#define __arch_atomic_exchange_64_relaxed(mem, value) \
> +    ({ abort (); (*mem) = (value); })
> +
>  #define __arch_atomic_exchange_and_add_64(mem, value) \
>      ({ abort (); (*mem) = (value); })
>  
> diff --git a/sysdeps/powerpc/powerpc64/bits/atomic.h b/sysdeps/powerpc/powerpc64/bits/atomic.h
> index 527fe7c..b8a1035 100644
> --- a/sysdeps/powerpc/powerpc64/bits/atomic.h
> +++ b/sysdeps/powerpc/powerpc64/bits/atomic.h
> @@ -143,6 +143,23 @@
>        __tmp;								      \
>    })
>  
> +#define __arch_compare_and_exchange_val_64_relaxed(mem, newval, oldval) \
> +  ({									      \
> +      __typeof (*(mem)) __tmp;						      \
> +      __typeof (mem)  __memp = (mem);					      \
> +      __asm __volatile ("\n"				      \
> +		        "1:	ldarx	%0,0,%1\n"	      \
> +		        "	cmpd	%0,%2\n"			      \
> +		        "	bne	2f\n"				      \
> +		        "	stdcx.	%3,0,%1\n"			      \
> +		        "	bne-	1b\n"				      \
> +		        "2:	"					      \
> +		        : "=&r" (__tmp)					      \
> +		        : "b" (__memp), "r" (oldval), "r" (newval)	      \
> +		        : "cr0", "memory");				      \
> +      __tmp;								      \
> +  })
> +
>  #define __arch_atomic_exchange_64_acq(mem, value) \
>      ({									      \
>        __typeof (*mem) __val;						      \
> @@ -170,6 +187,18 @@
>        __val;								      \
>      })
>  
> +#define __arch_atomic_exchange_64_relaxed(mem, value) \
> +    ({									      \
> +      __typeof (*mem) __val;						      \
> +      __asm __volatile ("1:	ldarx	%0,0,%2\n"			      \
> +			"	stdcx.	%3,0,%2\n"			      \
> +			"	bne-	1b"				      \
> +			: "=&r" (__val), "=m" (*mem)			      \
> +			: "b" (mem), "r" (value), "m" (*mem)		      \
> +			: "cr0", "memory");				      \
> +      __val;								      \
> +    })
> +
>  #define __arch_atomic_exchange_and_add_64(mem, value) \
>      ({									      \
>        __typeof (*mem) __val, __tmp;					      \
> @@ -224,6 +253,60 @@
>       __val;								      \
>    })
>  
> +#define __arch_atomic_and_64_relaxed(mem, value) \
> +  ({									      \
> +    __typeof (*mem) __val, __tmp;					      \
> +    __asm __volatile ("1:	ldarx	%0,0,%3\n"			      \
> +		      "		and	%1,%0,%4\n"			      \
> +		      "		stdcx.	%1,0,%3\n"			      \
> +		      "		bne-	1b\n"				      \
> +		      : "=&b" (__val), "=&r" (__tmp), "=m" (*mem)	      \
> +		      : "b" (mem), "r" (value), "m" (*mem)		      \
> +		      : "cr0", "memory");				      \
> +    __val;								      \
> +  })
> +
> +#define __arch_atomic_and_64(mem, value) \
> +  ({									      \
> +    __typeof (*mem) __val, __tmp;					      \
> +    __asm __volatile ("1:	ldarx	%0,0,%3\n"			      \
> +		      "		and	%1,%0,%4\n"			      \
> +		      "		stdcx.	%1,0,%3\n"			      \
> +		      "		bne-	1b\n"				      \
> +		      "       " __ARCH_ACQ_INSTR			      \
> +		      : "=&b" (__val), "=&r" (__tmp), "=m" (*mem)	      \
> +		      : "b" (mem), "r" (value), "m" (*mem)		      \
> +		      : "cr0", "memory");				      \
> +    __val;								      \
> +  })
> +
> +#define __arch_atomic_or_64_relaxed(mem, value)				      \
> +  ({									      \
> +    __typeof (*mem) __val, __tmp;					      \
> +    __asm __volatile ("1:	ldarx	%0,0,%3\n"			      \
> +		      "		or	%1,%0,%4\n"			      \
> +		      "		stdcx.	%1,0,%3\n"			      \
> +		      "		bne-	1b\n"				      \
> +		      : "=&b" (__val), "=&r" (__tmp), "=m" (*mem)	      \
> +		      : "b" (mem), "r" (value), "m" (*mem)		      \
> +		      : "cr0", "memory");				      \
> +    __val;								      \
> +  })
> +
> +#define __arch_atomic_or_64(mem, value)					      \
> +  ({									      \
> +    __typeof (*mem) __val, __tmp;					      \
> +    __asm __volatile ("1:	ldarx	%0,0,%3\n"			      \
> +		      "		or	%1,%0,%4\n"			      \
> +		      "		stdcx.	%1,0,%3\n"			      \
> +		      "		bne-	1b\n"				      \
> +		      "       " __ARCH_ACQ_INSTR			      \
> +		      : "=&b" (__val), "=&r" (__tmp), "=m" (*mem)	      \
> +		      : "b" (mem), "r" (value), "m" (*mem)		      \
> +		      : "cr0", "memory");				      \
> +    __val;								      \
> +  })
> +
>  /*
>   * All powerpc64 processors support the new "light weight"  sync (lwsync).
>   */
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Future atomic operation cleanup
  2014-08-22 14:00 ` Future atomic operation cleanup Richard Henderson
@ 2014-08-22 14:50   ` Joseph S. Myers
  2014-08-22 17:26   ` Torvald Riegel
  2014-08-28 13:57   ` Adhemerval Zanella
  2 siblings, 0 replies; 20+ messages in thread
From: Joseph S. Myers @ 2014-08-22 14:50 UTC (permalink / raw)
  To: Richard Henderson; +Cc: Adhemerval Zanella, GNU C. Library

On Fri, 22 Aug 2014, Richard Henderson wrote:

> When development opens again, can we simplify all of these atomic operations by
> assuming compiler primitives?  That is, use the __atomic builtins for gcc 4.8
> and later, fall back to the __sync builtins for earlier gcc, and completely
> drop support for truly ancient compilers that support neither.
> 
> As a bonus we'd get to unify the implementations across all of the targets.

Well, no, we wouldn't - __atomic_* and __sync_* aren't expanded inline on 
all targets, and at least we can't depend on libatomic (and I'm not sure 
about using out-of-line __sync_* from libgcc.a either).  But you could 
have a default version of bits/atomic.h defining things in the way you 
outline (supporting using __atomic_* with GCC 4.7 as well if desired by 
the architecture - MIPS deliberately does that for MIPS16 even though 
efficient __atomic_* support on MIPS was only added for 4.8), conditional 
on the macros such as __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4, and with 
architectures only overriding that default to allow for cases where some 
of the operations may not be expanded inline by GCC.

(We could also increase the minimum GCC version for building glibc from 
4.4 to 4.6 or 4.7 to reduce the number of cases needing consideration of 
what compiler support is / is not present.)

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Future atomic operation cleanup
  2014-08-22 14:00 ` Future atomic operation cleanup Richard Henderson
  2014-08-22 14:50   ` Joseph S. Myers
@ 2014-08-22 17:26   ` Torvald Riegel
  2014-08-28 13:57   ` Adhemerval Zanella
  2 siblings, 0 replies; 20+ messages in thread
From: Torvald Riegel @ 2014-08-22 17:26 UTC (permalink / raw)
  To: Richard Henderson; +Cc: Adhemerval Zanella, GNU C. Library

On Fri, 2014-08-22 at 07:00 -0700, Richard Henderson wrote:
> On 08/22/2014 06:50 AM, Adhemerval Zanella wrote:
> > Hi,
> > 
> > Following comments from my first patch to optimize single-thread internal
> > glibc locking/atomics [1], I have changed the implementation to use now
> > relaxed atomics instead.  Addresing the concerns raised in last discussion, 
> > the primitives are still signal-safe (although not thread-safe), so if future
> > malloc implementation is changed to be async-safe, it won't require to a
> > adjust powerpc atomics.
> > 
> > For catomic_and and catomic_or I follow the definition at 'include/atomic.h'
> > (which powerpc is currently using) and implemented the atomics with acquire
> > semantics.  The new implementation also is simpler.
> > 
> > On synthetic benchmarks it shows an improvement of 5-10% for malloc
> > calls and an performance increase of 7-8% in 483.xalancbmk from
> > speccpu2006 (number from an POWER8 machine).
> > 
> > Checked on powerpc64, powerpc32 and powerpc64le.
> 
> Wow, that's a lot of boiler-plate.
> 
> When development opens again, can we simplify all of these atomic operations by
> assuming compiler primitives?  That is, use the __atomic builtins for gcc 4.8
> and later, fall back to the __sync builtins for earlier gcc, and completely
> drop support for truly ancient compilers that support neither.

I definitely agree we should be moving away from the custom
implementation of the atomics.  I'm fine with having wrappers to enable
the custom stuff Joseph mentioned.

> As a bonus we'd get to unify the implementations across all of the targets.

And we should take that as an opportunity to unify towards the C11 model
and operations.  Not the same function names, but the same
functionality.  Including relaxed MO, relaxed MO but atomic loads/stores
instead of plain loads/stores if those synchronize, etc.

This has been on my list for a while, but I didn't get to it so far.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Future atomic operation cleanup
  2014-08-22 14:00 ` Future atomic operation cleanup Richard Henderson
  2014-08-22 14:50   ` Joseph S. Myers
  2014-08-22 17:26   ` Torvald Riegel
@ 2014-08-28 13:57   ` Adhemerval Zanella
  2014-08-28 15:37     ` Torvald Riegel
  2 siblings, 1 reply; 20+ messages in thread
From: Adhemerval Zanella @ 2014-08-28 13:57 UTC (permalink / raw)
  To: Richard Henderson, GNU C. Library

On 22-08-2014 11:00, Richard Henderson wrote:
> On 08/22/2014 06:50 AM, Adhemerval Zanella wrote:
>> Hi,
>>
>> Following comments from my first patch to optimize single-thread internal
>> glibc locking/atomics [1], I have changed the implementation to use now
>> relaxed atomics instead.  Addresing the concerns raised in last discussion, 
>> the primitives are still signal-safe (although not thread-safe), so if future
>> malloc implementation is changed to be async-safe, it won't require to a
>> adjust powerpc atomics.
>>
>> For catomic_and and catomic_or I follow the definition at 'include/atomic.h'
>> (which powerpc is currently using) and implemented the atomics with acquire
>> semantics.  The new implementation also is simpler.
>>
>> On synthetic benchmarks it shows an improvement of 5-10% for malloc
>> calls and an performance increase of 7-8% in 483.xalancbmk from
>> speccpu2006 (number from an POWER8 machine).
>>
>> Checked on powerpc64, powerpc32 and powerpc64le.
> Wow, that's a lot of boiler-plate.
>
> When development opens again, can we simplify all of these atomic operations by
> assuming compiler primitives?  That is, use the __atomic builtins for gcc 4.8
> and later, fall back to the __sync builtins for earlier gcc, and completely
> drop support for truly ancient compilers that support neither.
>
> As a bonus we'd get to unify the implementations across all of the targets.
>
>
> r~
>
I also agree we should move to more a unified implementation (in fact, I
plan to get rid of powerpc lowlevellock.h when devel opens again).  However
I really don't want to either wait or reimplement all the custom atomic to push
this optimization... 

I think such change will require a lot of iteration and testing, which is not
the intend of this patch.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Future atomic operation cleanup
  2014-08-28 13:57   ` Adhemerval Zanella
@ 2014-08-28 15:37     ` Torvald Riegel
  2014-08-28 16:28       ` Adhemerval Zanella
  0 siblings, 1 reply; 20+ messages in thread
From: Torvald Riegel @ 2014-08-28 15:37 UTC (permalink / raw)
  To: Adhemerval Zanella; +Cc: Richard Henderson, GNU C. Library

On Thu, 2014-08-28 at 10:57 -0300, Adhemerval Zanella wrote:
> On 22-08-2014 11:00, Richard Henderson wrote:
> > On 08/22/2014 06:50 AM, Adhemerval Zanella wrote:
> >> Hi,
> >>
> >> Following comments from my first patch to optimize single-thread internal
> >> glibc locking/atomics [1], I have changed the implementation to use now
> >> relaxed atomics instead.  Addresing the concerns raised in last discussion, 
> >> the primitives are still signal-safe (although not thread-safe), so if future
> >> malloc implementation is changed to be async-safe, it won't require to a
> >> adjust powerpc atomics.
> >>
> >> For catomic_and and catomic_or I follow the definition at 'include/atomic.h'
> >> (which powerpc is currently using) and implemented the atomics with acquire
> >> semantics.  The new implementation also is simpler.
> >>
> >> On synthetic benchmarks it shows an improvement of 5-10% for malloc
> >> calls and an performance increase of 7-8% in 483.xalancbmk from
> >> speccpu2006 (number from an POWER8 machine).
> >>
> >> Checked on powerpc64, powerpc32 and powerpc64le.
> > Wow, that's a lot of boiler-plate.
> >
> > When development opens again, can we simplify all of these atomic operations by
> > assuming compiler primitives?  That is, use the __atomic builtins for gcc 4.8
> > and later, fall back to the __sync builtins for earlier gcc, and completely
> > drop support for truly ancient compilers that support neither.
> >
> > As a bonus we'd get to unify the implementations across all of the targets.
> >
> >
> > r~
> >
> I also agree we should move to more a unified implementation (in fact, I
> plan to get rid of powerpc lowlevellock.h when devel opens again).  However
> I really don't want to either wait or reimplement all the custom atomic to push
> this optimization... 

I believe that, unless the caveat Joseph mentioned actually applies to
the archs you're concerned about, using the compiler builtins and doing
the unification would simplify your patch considerably.  It would also
avoid having to iterate over the changes of your patch again once we do
all of the unification.

> I think such change will require a lot of iteration and testing, which is not
> the intend of this patch.

If we move to C11-like atomics, then those will certainly go in
incrementally, and exist in parallel with the current atomics for a
while (until we reviewed all existing code using the old atomics and
moved it over to the new atomics).

If we use the compiler builtins, we'd also have to test less because we
have no additional custom atomics implementation we need to maintain.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Future atomic operation cleanup
  2014-08-28 15:37     ` Torvald Riegel
@ 2014-08-28 16:28       ` Adhemerval Zanella
  2014-08-29  9:55         ` Torvald Riegel
  0 siblings, 1 reply; 20+ messages in thread
From: Adhemerval Zanella @ 2014-08-28 16:28 UTC (permalink / raw)
  To: libc-alpha

On 28-08-2014 12:37, Torvald Riegel wrote:
> On Thu, 2014-08-28 at 10:57 -0300, Adhemerval Zanella wrote:
>> On 22-08-2014 11:00, Richard Henderson wrote:
>>> On 08/22/2014 06:50 AM, Adhemerval Zanella wrote:
>>>> Hi,
>>>>
>>>> Following comments from my first patch to optimize single-thread internal
>>>> glibc locking/atomics [1], I have changed the implementation to use now
>>>> relaxed atomics instead.  Addresing the concerns raised in last discussion, 
>>>> the primitives are still signal-safe (although not thread-safe), so if future
>>>> malloc implementation is changed to be async-safe, it won't require to a
>>>> adjust powerpc atomics.
>>>>
>>>> For catomic_and and catomic_or I follow the definition at 'include/atomic.h'
>>>> (which powerpc is currently using) and implemented the atomics with acquire
>>>> semantics.  The new implementation also is simpler.
>>>>
>>>> On synthetic benchmarks it shows an improvement of 5-10% for malloc
>>>> calls and an performance increase of 7-8% in 483.xalancbmk from
>>>> speccpu2006 (number from an POWER8 machine).
>>>>
>>>> Checked on powerpc64, powerpc32 and powerpc64le.
>>> Wow, that's a lot of boiler-plate.
>>>
>>> When development opens again, can we simplify all of these atomic operations by
>>> assuming compiler primitives?  That is, use the __atomic builtins for gcc 4.8
>>> and later, fall back to the __sync builtins for earlier gcc, and completely
>>> drop support for truly ancient compilers that support neither.
>>>
>>> As a bonus we'd get to unify the implementations across all of the targets.
>>>
>>>
>>> r~
>>>
>> I also agree we should move to more a unified implementation (in fact, I
>> plan to get rid of powerpc lowlevellock.h when devel opens again).  However
>> I really don't want to either wait or reimplement all the custom atomic to push
>> this optimization... 
> I believe that, unless the caveat Joseph mentioned actually applies to
> the archs you're concerned about, using the compiler builtins and doing
> the unification would simplify your patch considerably.  It would also
> avoid having to iterate over the changes of your patch again once we do
> all of the unification.
>
>> I think such change will require a lot of iteration and testing, which is not
>> the intend of this patch.
> If we move to C11-like atomics, then those will certainly go in
> incrementally, and exist in parallel with the current atomics for a
> while (until we reviewed all existing code using the old atomics and
> moved it over to the new atomics).
>
> If we use the compiler builtins, we'd also have to test less because we
> have no additional custom atomics implementation we need to maintain.
>
I do agree moving to a compiler specific is the best way, although for current
status where we support GCC 4.4 we still need hand crafted assembly for atomics.
Also, the GCC atomics are only support for 4.7+ afaik [1] and C11 for 4.9.
We can make what Joseph suggested and move minimum compiler required for GCC 4.7.

But if we intend to actually add mininum GCC 4.7, I can work on changing all the
atomics to GCC builtins.

[1] https://gcc.gnu.org/onlinedocs/gcc-4.7.4/gcc/_005f_005fatomic-Builtins.html#_005f_005fatomic-Builtins

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Future atomic operation cleanup
  2014-08-28 16:28       ` Adhemerval Zanella
@ 2014-08-29  9:55         ` Torvald Riegel
  2014-08-29 11:10           ` Adhemerval Zanella
  0 siblings, 1 reply; 20+ messages in thread
From: Torvald Riegel @ 2014-08-29  9:55 UTC (permalink / raw)
  To: Adhemerval Zanella; +Cc: libc-alpha

On Thu, 2014-08-28 at 13:27 -0300, Adhemerval Zanella wrote:
> On 28-08-2014 12:37, Torvald Riegel wrote:
> > On Thu, 2014-08-28 at 10:57 -0300, Adhemerval Zanella wrote:
> >> On 22-08-2014 11:00, Richard Henderson wrote:
> >>> On 08/22/2014 06:50 AM, Adhemerval Zanella wrote:
> >>>> Hi,
> >>>>
> >>>> Following comments from my first patch to optimize single-thread internal
> >>>> glibc locking/atomics [1], I have changed the implementation to use now
> >>>> relaxed atomics instead.  Addresing the concerns raised in last discussion, 
> >>>> the primitives are still signal-safe (although not thread-safe), so if future
> >>>> malloc implementation is changed to be async-safe, it won't require to a
> >>>> adjust powerpc atomics.
> >>>>
> >>>> For catomic_and and catomic_or I follow the definition at 'include/atomic.h'
> >>>> (which powerpc is currently using) and implemented the atomics with acquire
> >>>> semantics.  The new implementation also is simpler.
> >>>>
> >>>> On synthetic benchmarks it shows an improvement of 5-10% for malloc
> >>>> calls and an performance increase of 7-8% in 483.xalancbmk from
> >>>> speccpu2006 (number from an POWER8 machine).
> >>>>
> >>>> Checked on powerpc64, powerpc32 and powerpc64le.
> >>> Wow, that's a lot of boiler-plate.
> >>>
> >>> When development opens again, can we simplify all of these atomic operations by
> >>> assuming compiler primitives?  That is, use the __atomic builtins for gcc 4.8
> >>> and later, fall back to the __sync builtins for earlier gcc, and completely
> >>> drop support for truly ancient compilers that support neither.
> >>>
> >>> As a bonus we'd get to unify the implementations across all of the targets.
> >>>
> >>>
> >>> r~
> >>>
> >> I also agree we should move to more a unified implementation (in fact, I
> >> plan to get rid of powerpc lowlevellock.h when devel opens again).  However
> >> I really don't want to either wait or reimplement all the custom atomic to push
> >> this optimization... 
> > I believe that, unless the caveat Joseph mentioned actually applies to
> > the archs you're concerned about, using the compiler builtins and doing
> > the unification would simplify your patch considerably.  It would also
> > avoid having to iterate over the changes of your patch again once we do
> > all of the unification.
> >
> >> I think such change will require a lot of iteration and testing, which is not
> >> the intend of this patch.
> > If we move to C11-like atomics, then those will certainly go in
> > incrementally, and exist in parallel with the current atomics for a
> > while (until we reviewed all existing code using the old atomics and
> > moved it over to the new atomics).
> >
> > If we use the compiler builtins, we'd also have to test less because we
> > have no additional custom atomics implementation we need to maintain.
> >
> I do agree moving to a compiler specific is the best way, although for current
> status where we support GCC 4.4 we still need hand crafted assembly for atomics.
> Also, the GCC atomics are only support for 4.7+ afaik [1] and C11 for 4.9.

Note that I'm not arguing for using C11 at this point, but using atomics
that are very much like the atomic_*_explicit operations.  The function
names can be different, but functionality, parameter ordering, etc.,
would be similar.

> We can make what Joseph suggested and move minimum compiler required for GCC 4.7.
> 
> But if we intend to actually add mininum GCC 4.7, I can work on changing all the
> atomics to GCC builtins.

Another option would be to only enable the new atomics that we don't
have right now (e.g., memory_order_relaxed) when building with 4.7 or
higher, and falling back to some of the custom assembly ones otherwise.
This would mean no performance regressions but the optimizations would
only be effective for 4.7 or higher.  memory_order_relaxed loads/stores
could be an exception, because I think we need to make them explicity,
and we don't want to slow down code that now uses plain nonatomic
loads/stores in cases where it should actually be memory_order_relaxed
accesses.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Future atomic operation cleanup
  2014-08-29  9:55         ` Torvald Riegel
@ 2014-08-29 11:10           ` Adhemerval Zanella
  2014-08-29 13:28             ` Torvald Riegel
  0 siblings, 1 reply; 20+ messages in thread
From: Adhemerval Zanella @ 2014-08-29 11:10 UTC (permalink / raw)
  To: libc-alpha

On 29-08-2014 06:55, Torvald Riegel wrote:
> On Thu, 2014-08-28 at 13:27 -0300, Adhemerval Zanella wrote:
>> On 28-08-2014 12:37, Torvald Riegel wrote:
>>> On Thu, 2014-08-28 at 10:57 -0300, Adhemerval Zanella wrote:
>>>> On 22-08-2014 11:00, Richard Henderson wrote:
>>>>> On 08/22/2014 06:50 AM, Adhemerval Zanella wrote:
>>>>>> Hi,
>>>>>>
>>>>>> Following comments from my first patch to optimize single-thread internal
>>>>>> glibc locking/atomics [1], I have changed the implementation to use now
>>>>>> relaxed atomics instead.  Addresing the concerns raised in last discussion, 
>>>>>> the primitives are still signal-safe (although not thread-safe), so if future
>>>>>> malloc implementation is changed to be async-safe, it won't require to a
>>>>>> adjust powerpc atomics.
>>>>>>
>>>>>> For catomic_and and catomic_or I follow the definition at 'include/atomic.h'
>>>>>> (which powerpc is currently using) and implemented the atomics with acquire
>>>>>> semantics.  The new implementation also is simpler.
>>>>>>
>>>>>> On synthetic benchmarks it shows an improvement of 5-10% for malloc
>>>>>> calls and an performance increase of 7-8% in 483.xalancbmk from
>>>>>> speccpu2006 (number from an POWER8 machine).
>>>>>>
>>>>>> Checked on powerpc64, powerpc32 and powerpc64le.
>>>>> Wow, that's a lot of boiler-plate.
>>>>>
>>>>> When development opens again, can we simplify all of these atomic operations by
>>>>> assuming compiler primitives?  That is, use the __atomic builtins for gcc 4.8
>>>>> and later, fall back to the __sync builtins for earlier gcc, and completely
>>>>> drop support for truly ancient compilers that support neither.
>>>>>
>>>>> As a bonus we'd get to unify the implementations across all of the targets.
>>>>>
>>>>>
>>>>> r~
>>>>>
>>>> I also agree we should move to more a unified implementation (in fact, I
>>>> plan to get rid of powerpc lowlevellock.h when devel opens again).  However
>>>> I really don't want to either wait or reimplement all the custom atomic to push
>>>> this optimization... 
>>> I believe that, unless the caveat Joseph mentioned actually applies to
>>> the archs you're concerned about, using the compiler builtins and doing
>>> the unification would simplify your patch considerably.  It would also
>>> avoid having to iterate over the changes of your patch again once we do
>>> all of the unification.
>>>
>>>> I think such change will require a lot of iteration and testing, which is not
>>>> the intend of this patch.
>>> If we move to C11-like atomics, then those will certainly go in
>>> incrementally, and exist in parallel with the current atomics for a
>>> while (until we reviewed all existing code using the old atomics and
>>> moved it over to the new atomics).
>>>
>>> If we use the compiler builtins, we'd also have to test less because we
>>> have no additional custom atomics implementation we need to maintain.
>>>
>> I do agree moving to a compiler specific is the best way, although for current
>> status where we support GCC 4.4 we still need hand crafted assembly for atomics.
>> Also, the GCC atomics are only support for 4.7+ afaik [1] and C11 for 4.9.
> Note that I'm not arguing for using C11 at this point, but using atomics
> that are very much like the atomic_*_explicit operations.  The function
> names can be different, but functionality, parameter ordering, etc.,
> would be similar.
>
>> We can make what Joseph suggested and move minimum compiler required for GCC 4.7.
>>
>> But if we intend to actually add mininum GCC 4.7, I can work on changing all the
>> atomics to GCC builtins.
> Another option would be to only enable the new atomics that we don't
> have right now (e.g., memory_order_relaxed) when building with 4.7 or
> higher, and falling back to some of the custom assembly ones otherwise.
> This would mean no performance regressions but the optimizations would
> only be effective for 4.7 or higher.  memory_order_relaxed loads/stores
> could be an exception, because I think we need to make them explicity,
> and we don't want to slow down code that now uses plain nonatomic
> loads/stores in cases where it should actually be memory_order_relaxed
> accesses.
>
>
I don't like this conditional enable because it will require more logic
to check for compiler version and add different code generation depending
of this. If the idea is to simplify and cleanup code I think it is better
to either add custom arch-specific code that will build regardless of
compiler or use compiler builtins and get rid off hand-craft assembly.
I see no point in enabling both.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Future atomic operation cleanup
  2014-08-29 11:10           ` Adhemerval Zanella
@ 2014-08-29 13:28             ` Torvald Riegel
  2014-08-29 13:32               ` Adhemerval Zanella
  0 siblings, 1 reply; 20+ messages in thread
From: Torvald Riegel @ 2014-08-29 13:28 UTC (permalink / raw)
  To: Adhemerval Zanella; +Cc: libc-alpha

On Fri, 2014-08-29 at 08:10 -0300, Adhemerval Zanella wrote:
> On 29-08-2014 06:55, Torvald Riegel wrote:
> > On Thu, 2014-08-28 at 13:27 -0300, Adhemerval Zanella wrote:
> >> On 28-08-2014 12:37, Torvald Riegel wrote:
> >>> On Thu, 2014-08-28 at 10:57 -0300, Adhemerval Zanella wrote:
> >>>> On 22-08-2014 11:00, Richard Henderson wrote:
> >>>>> On 08/22/2014 06:50 AM, Adhemerval Zanella wrote:
> >>>>>> Hi,
> >>>>>>
> >>>>>> Following comments from my first patch to optimize single-thread internal
> >>>>>> glibc locking/atomics [1], I have changed the implementation to use now
> >>>>>> relaxed atomics instead.  Addresing the concerns raised in last discussion, 
> >>>>>> the primitives are still signal-safe (although not thread-safe), so if future
> >>>>>> malloc implementation is changed to be async-safe, it won't require to a
> >>>>>> adjust powerpc atomics.
> >>>>>>
> >>>>>> For catomic_and and catomic_or I follow the definition at 'include/atomic.h'
> >>>>>> (which powerpc is currently using) and implemented the atomics with acquire
> >>>>>> semantics.  The new implementation also is simpler.
> >>>>>>
> >>>>>> On synthetic benchmarks it shows an improvement of 5-10% for malloc
> >>>>>> calls and an performance increase of 7-8% in 483.xalancbmk from
> >>>>>> speccpu2006 (number from an POWER8 machine).
> >>>>>>
> >>>>>> Checked on powerpc64, powerpc32 and powerpc64le.
> >>>>> Wow, that's a lot of boiler-plate.
> >>>>>
> >>>>> When development opens again, can we simplify all of these atomic operations by
> >>>>> assuming compiler primitives?  That is, use the __atomic builtins for gcc 4.8
> >>>>> and later, fall back to the __sync builtins for earlier gcc, and completely
> >>>>> drop support for truly ancient compilers that support neither.
> >>>>>
> >>>>> As a bonus we'd get to unify the implementations across all of the targets.
> >>>>>
> >>>>>
> >>>>> r~
> >>>>>
> >>>> I also agree we should move to more a unified implementation (in fact, I
> >>>> plan to get rid of powerpc lowlevellock.h when devel opens again).  However
> >>>> I really don't want to either wait or reimplement all the custom atomic to push
> >>>> this optimization... 
> >>> I believe that, unless the caveat Joseph mentioned actually applies to
> >>> the archs you're concerned about, using the compiler builtins and doing
> >>> the unification would simplify your patch considerably.  It would also
> >>> avoid having to iterate over the changes of your patch again once we do
> >>> all of the unification.
> >>>
> >>>> I think such change will require a lot of iteration and testing, which is not
> >>>> the intend of this patch.
> >>> If we move to C11-like atomics, then those will certainly go in
> >>> incrementally, and exist in parallel with the current atomics for a
> >>> while (until we reviewed all existing code using the old atomics and
> >>> moved it over to the new atomics).
> >>>
> >>> If we use the compiler builtins, we'd also have to test less because we
> >>> have no additional custom atomics implementation we need to maintain.
> >>>
> >> I do agree moving to a compiler specific is the best way, although for current
> >> status where we support GCC 4.4 we still need hand crafted assembly for atomics.
> >> Also, the GCC atomics are only support for 4.7+ afaik [1] and C11 for 4.9.
> > Note that I'm not arguing for using C11 at this point, but using atomics
> > that are very much like the atomic_*_explicit operations.  The function
> > names can be different, but functionality, parameter ordering, etc.,
> > would be similar.
> >
> >> We can make what Joseph suggested and move minimum compiler required for GCC 4.7.
> >>
> >> But if we intend to actually add mininum GCC 4.7, I can work on changing all the
> >> atomics to GCC builtins.
> > Another option would be to only enable the new atomics that we don't
> > have right now (e.g., memory_order_relaxed) when building with 4.7 or
> > higher, and falling back to some of the custom assembly ones otherwise.
> > This would mean no performance regressions but the optimizations would
> > only be effective for 4.7 or higher.  memory_order_relaxed loads/stores
> > could be an exception, because I think we need to make them explicity,
> > and we don't want to slow down code that now uses plain nonatomic
> > loads/stores in cases where it should actually be memory_order_relaxed
> > accesses.
> >
> >
> I don't like this conditional enable because it will require more logic
> to check for compiler version and add different code generation depending
> of this. If the idea is to simplify and cleanup code I think it is better
> to either add custom arch-specific code that will build regardless of
> compiler or use compiler builtins and get rid off hand-craft assembly.
> I see no point in enabling both.

*If* we can make 4.7 a requirement, then I'm certainly all for using
just the builtins on architectures where they are available and get
inlined.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Future atomic operation cleanup
  2014-08-29 13:28             ` Torvald Riegel
@ 2014-08-29 13:32               ` Adhemerval Zanella
  2014-08-29 16:33                 ` Chris Metcalf
                                   ` (2 more replies)
  0 siblings, 3 replies; 20+ messages in thread
From: Adhemerval Zanella @ 2014-08-29 13:32 UTC (permalink / raw)
  To: libc-alpha

On 29-08-2014 10:27, Torvald Riegel wrote:
> On Fri, 2014-08-29 at 08:10 -0300, Adhemerval Zanella wrote:
>> On 29-08-2014 06:55, Torvald Riegel wrote:
>>> On Thu, 2014-08-28 at 13:27 -0300, Adhemerval Zanella wrote:
>>>> On 28-08-2014 12:37, Torvald Riegel wrote:
>>>>> On Thu, 2014-08-28 at 10:57 -0300, Adhemerval Zanella wrote:
>>>>>> On 22-08-2014 11:00, Richard Henderson wrote:
>>>>>>> On 08/22/2014 06:50 AM, Adhemerval Zanella wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> Following comments from my first patch to optimize single-thread internal
>>>>>>>> glibc locking/atomics [1], I have changed the implementation to use now
>>>>>>>> relaxed atomics instead.  Addresing the concerns raised in last discussion, 
>>>>>>>> the primitives are still signal-safe (although not thread-safe), so if future
>>>>>>>> malloc implementation is changed to be async-safe, it won't require to a
>>>>>>>> adjust powerpc atomics.
>>>>>>>>
>>>>>>>> For catomic_and and catomic_or I follow the definition at 'include/atomic.h'
>>>>>>>> (which powerpc is currently using) and implemented the atomics with acquire
>>>>>>>> semantics.  The new implementation also is simpler.
>>>>>>>>
>>>>>>>> On synthetic benchmarks it shows an improvement of 5-10% for malloc
>>>>>>>> calls and an performance increase of 7-8% in 483.xalancbmk from
>>>>>>>> speccpu2006 (number from an POWER8 machine).
>>>>>>>>
>>>>>>>> Checked on powerpc64, powerpc32 and powerpc64le.
>>>>>>> Wow, that's a lot of boiler-plate.
>>>>>>>
>>>>>>> When development opens again, can we simplify all of these atomic operations by
>>>>>>> assuming compiler primitives?  That is, use the __atomic builtins for gcc 4.8
>>>>>>> and later, fall back to the __sync builtins for earlier gcc, and completely
>>>>>>> drop support for truly ancient compilers that support neither.
>>>>>>>
>>>>>>> As a bonus we'd get to unify the implementations across all of the targets.
>>>>>>>
>>>>>>>
>>>>>>> r~
>>>>>>>
>>>>>> I also agree we should move to more a unified implementation (in fact, I
>>>>>> plan to get rid of powerpc lowlevellock.h when devel opens again).  However
>>>>>> I really don't want to either wait or reimplement all the custom atomic to push
>>>>>> this optimization... 
>>>>> I believe that, unless the caveat Joseph mentioned actually applies to
>>>>> the archs you're concerned about, using the compiler builtins and doing
>>>>> the unification would simplify your patch considerably.  It would also
>>>>> avoid having to iterate over the changes of your patch again once we do
>>>>> all of the unification.
>>>>>
>>>>>> I think such change will require a lot of iteration and testing, which is not
>>>>>> the intend of this patch.
>>>>> If we move to C11-like atomics, then those will certainly go in
>>>>> incrementally, and exist in parallel with the current atomics for a
>>>>> while (until we reviewed all existing code using the old atomics and
>>>>> moved it over to the new atomics).
>>>>>
>>>>> If we use the compiler builtins, we'd also have to test less because we
>>>>> have no additional custom atomics implementation we need to maintain.
>>>>>
>>>> I do agree moving to a compiler specific is the best way, although for current
>>>> status where we support GCC 4.4 we still need hand crafted assembly for atomics.
>>>> Also, the GCC atomics are only support for 4.7+ afaik [1] and C11 for 4.9.
>>> Note that I'm not arguing for using C11 at this point, but using atomics
>>> that are very much like the atomic_*_explicit operations.  The function
>>> names can be different, but functionality, parameter ordering, etc.,
>>> would be similar.
>>>
>>>> We can make what Joseph suggested and move minimum compiler required for GCC 4.7.
>>>>
>>>> But if we intend to actually add mininum GCC 4.7, I can work on changing all the
>>>> atomics to GCC builtins.
>>> Another option would be to only enable the new atomics that we don't
>>> have right now (e.g., memory_order_relaxed) when building with 4.7 or
>>> higher, and falling back to some of the custom assembly ones otherwise.
>>> This would mean no performance regressions but the optimizations would
>>> only be effective for 4.7 or higher.  memory_order_relaxed loads/stores
>>> could be an exception, because I think we need to make them explicity,
>>> and we don't want to slow down code that now uses plain nonatomic
>>> loads/stores in cases where it should actually be memory_order_relaxed
>>> accesses.
>>>
>>>
>> I don't like this conditional enable because it will require more logic
>> to check for compiler version and add different code generation depending
>> of this. If the idea is to simplify and cleanup code I think it is better
>> to either add custom arch-specific code that will build regardless of
>> compiler or use compiler builtins and get rid off hand-craft assembly.
>> I see no point in enabling both.
> *If* we can make 4.7 a requirement, then I'm certainly all for using
> just the builtins on architectures where they are available and get
> inlined.
>
>
And that is exactly what I am asking: are we willing to raise GCC requirements
for 2.21?  If we will, I see no issue on rework the patch to use GCC builtins.
If we won't, I see to no point in trying to adjust this patch to use conditionally
the GCC builtins for GCC 4.7+.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Future atomic operation cleanup
  2014-08-29 13:32               ` Adhemerval Zanella
@ 2014-08-29 16:33                 ` Chris Metcalf
  2014-10-23 17:12                   ` Carlos O'Donell
  2014-08-29 21:52                 ` Roland McGrath
  2016-03-11 18:36                 ` [PATCH v3] PowerPC: libc single-thread lock optimization Tulio Magno Quites Machado Filho
  2 siblings, 1 reply; 20+ messages in thread
From: Chris Metcalf @ 2014-08-29 16:33 UTC (permalink / raw)
  To: Adhemerval Zanella, libc-alpha

On 8/29/2014 9:32 AM, Adhemerval Zanella wrote:
> On 29-08-2014 10:27, Torvald Riegel wrote:
>> *If* we can make 4.7 a requirement, then I'm certainly all for using
>> just the builtins on architectures where they are available and get
>> inlined.
>>
> And that is exactly what I am asking: are we willing to raise GCC requirements
> for 2.21?  If we will, I see no issue on rework the patch to use GCC builtins.
> If we won't, I see to no point in trying to adjust this patch to use conditionally
> the GCC builtins for GCC 4.7+.

It seems a little too early still, in my opinion.  For example, RHEL 7 was
only officially released a couple of months ago, and RHEL 6 was gcc 4.4, so
Perhaps it's worth adding some mechanism to enable it for testing or
whatever (e.g. configure with --use-atomic-builtins) in the meanwhile.

-- 
Chris Metcalf, Tilera Corp.
http://www.tilera.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Future atomic operation cleanup
  2014-08-29 13:32               ` Adhemerval Zanella
  2014-08-29 16:33                 ` Chris Metcalf
@ 2014-08-29 21:52                 ` Roland McGrath
  2014-09-05 17:51                   ` Torvald Riegel
  2016-03-11 18:36                 ` [PATCH v3] PowerPC: libc single-thread lock optimization Tulio Magno Quites Machado Filho
  2 siblings, 1 reply; 20+ messages in thread
From: Roland McGrath @ 2014-08-29 21:52 UTC (permalink / raw)
  To: Adhemerval Zanella; +Cc: libc-alpha

I don't see a problem with conditional code based on available compiler
version--that is, conditional code within the atomic.h implementations
themselves.  What I take to be the main thrust of Torvald's suggestion is
that we get ASAP to a situation where all the code outside of actual
atomic.h implementations themselves is using a new atomic.h API that has
semantics and signatures identical to, and names similar to, the
C11-supporting builtins.  That enables us to clean up, fix, and optimize
all the actual uses of atomics throughout the codebase, without waiting
until we can rely on requiring newer compiler versions throughout.

I tend to think it is still too early to mandate 4.7 or later.  I think we
can in the 2.21 cycle move up to 4.6 as the minimum (and 4.8 or
4.9 as the recommended version).  But we should discuss that further.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Future atomic operation cleanup
  2014-08-29 21:52                 ` Roland McGrath
@ 2014-09-05 17:51                   ` Torvald Riegel
  0 siblings, 0 replies; 20+ messages in thread
From: Torvald Riegel @ 2014-09-05 17:51 UTC (permalink / raw)
  To: Roland McGrath; +Cc: Adhemerval Zanella, libc-alpha

On Fri, 2014-08-29 at 14:52 -0700, Roland McGrath wrote:
> I don't see a problem with conditional code based on available compiler
> version--that is, conditional code within the atomic.h implementations
> themselves.  What I take to be the main thrust of Torvald's suggestion is
> that we get ASAP to a situation where all the code outside of actual
> atomic.h implementations themselves is using a new atomic.h API that has
> semantics and signatures identical to, and names similar to, the
> C11-supporting builtins.

Yes.  We might pick names that are similar to the C11 functions too, and
that might even be easier to learn for people.

> That enables us to clean up, fix, and optimize
> all the actual uses of atomics throughout the codebase, without waiting
> until we can rely on requiring newer compiler versions throughout.

Yes.  It would also separate the use of a certain weak atomic op (e.g.,
memory_order_relaxed) from whether a particular architecture actually
implements it efficiently, and how.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Future atomic operation cleanup
  2014-08-29 16:33                 ` Chris Metcalf
@ 2014-10-23 17:12                   ` Carlos O'Donell
  0 siblings, 0 replies; 20+ messages in thread
From: Carlos O'Donell @ 2014-10-23 17:12 UTC (permalink / raw)
  To: libc-alpha

On 08/29/2014 12:33 PM, Chris Metcalf wrote:
> On 8/29/2014 9:32 AM, Adhemerval Zanella wrote:
>> On 29-08-2014 10:27, Torvald Riegel wrote:
>>> *If* we can make 4.7 a requirement, then I'm certainly all for using
>>> just the builtins on architectures where they are available and get
>>> inlined.
>>>
>> And that is exactly what I am asking: are we willing to raise GCC requirements
>> for 2.21?  If we will, I see no issue on rework the patch to use GCC builtins.
>> If we won't, I see to no point in trying to adjust this patch to use conditionally
>> the GCC builtins for GCC 4.7+.
> 
> It seems a little too early still, in my opinion.  For example, RHEL 7 was
> only officially released a couple of months ago, and RHEL 6 was gcc 4.4, so
> Perhaps it's worth adding some mechanism to enable it for testing or
> whatever (e.g. configure with --use-atomic-builtins) in the meanwhile.

Note that RHEL 6 and RHEL 7 have a Developer Toolset (dts) in their software
collections (scl) that allows you to use gcc 4.8 and soon gcc 4.9 on those
systems to build applications.

Shipping a compiler that is decoupled from the system compiler, but still
uses the default system runtimes is becoming the norm for RHEL.

Cheers,
Carlos.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v3] PowerPC: libc single-thread lock optimization
  2014-08-29 13:32               ` Adhemerval Zanella
  2014-08-29 16:33                 ` Chris Metcalf
  2014-08-29 21:52                 ` Roland McGrath
@ 2016-03-11 18:36                 ` Tulio Magno Quites Machado Filho
  2016-03-11 18:39                   ` Florian Weimer
                                     ` (2 more replies)
  2 siblings, 3 replies; 20+ messages in thread
From: Tulio Magno Quites Machado Filho @ 2016-03-11 18:36 UTC (permalink / raw)
  To: libc-alpha; +Cc: adhemerval.zanella, munroesj

I continued the work started by Adhemerval.  The discussion around version 2
of this patch is available at http://patchwork.sourceware.org/patch/2516/

Nowadays, we already require GCC 4.7, so we can safely rely on compiler
built-ins for most of our atomic primitives.

Changes since v2:
 - Updated ChangeLog and commit message.
 - Replaced the following atomic primitives by compiler built-ins:
   exchange*, and* and or*.

---8<---

Add relaxed atomics as a lock optimization.  Addressing the concerns
raised in previous discussions, the primitives are still signal-safe
(although not thread-safe), so if future implementations relying on
this code (e.g. malloc) is changed to be async-safe, it won't require to
adjust powerpc atomics.

For catomic_and and catomic_or I follow the definition at 'include/atomic.h'
(which powerpc is currently using) and implemented the atomics with acquire
semantics.  The new implementation is based on compiler built-ins.

On synthetic benchmarks it shows an improvement of 5-10% for malloc
calls and a performance increase of 7-8% in 483.xalancbmk from
speccpu2006 (number from a POWER8 machine).

Checked on powerpc32, powerpc64 and powerpc64le.

2016-03-11  Adhemerval Zanella Netto  <azanella@linux.vnet.ibm.com>
            Tulio Magno Quites Machado Filho  <tuliom@linux.vnet.ibm.com>

	* malloc/malloc.c (malloc_consolidate): Replace 0 by NULL in
	order to match the type of p when calling atomic_exchange_acq().
	* sysdeps/powerpc/atomic-machine.h
	(__arch_atomic_exchange_32_acq): Removed.
	(__arch_atomic_exchange_32_rel): Likewise
	(__arch_compare_and_exchange_val_32_relaxed): New macro: atomic compare
	and exchange with relaxed semantic.
	(atomic_compare_and_exchange_val_relaxed): Likewise.
	(__atomic_is_single_thread): New macro: check if program is
	single-thread.
	(atomic_compare_and_exchange_val_acq): Add relaxed operation for
	single-thread.
	(atomic_compare_and_exchange_val_rel): Likewise.
	(atomic_exchange_acq): Likewise.
	(atomic_exchange_rel): Likewise.
	(catomic_and): Add relaxed operation and use compiler built-ins.
	(catomic_or): Likewise.
	(atomic_exchange_acq): Modify to use compiler built-ins.
	(atomic_exchange_rel): Likewise.
	* sysdeps/powerpc/powerpc32/atomic-machine.h
	(__arch_compare_and_exchange_val_64_relaxed): New macro: add empty
	implementation.
	(__arch_atomic_exchange_64_relaxed): Likewise.
	* sysdeps/powerpc/powerpc64/atomic-machine.h
	(__arch_compare_and_exchange_val_64_relaxed): New macro: atomic compare
	and exchange with relaxed semantics.
	(__arch_atomic_exchange_64_acq): Removed.
	(__arch_atomic_exchange_64_rel): Removed.
---
 malloc/malloc.c                            |   2 +-
 sysdeps/powerpc/atomic-machine.h           | 128 ++++++++++++++++++-----------
 sysdeps/powerpc/powerpc32/atomic-machine.h |   6 ++
 sysdeps/powerpc/powerpc64/atomic-machine.h |  38 ++++-----
 4 files changed, 103 insertions(+), 71 deletions(-)

diff --git a/malloc/malloc.c b/malloc/malloc.c
index b8a43bf..1eed794 100644
--- a/malloc/malloc.c
+++ b/malloc/malloc.c
@@ -4150,7 +4150,7 @@ static void malloc_consolidate(mstate av)
     maxfb = &fastbin (av, NFASTBINS - 1);
     fb = &fastbin (av, 0);
     do {
-      p = atomic_exchange_acq (fb, 0);
+      p = atomic_exchange_acq (fb, NULL);
       if (p != 0) {
 	do {
 	  check_inuse_chunk(av, p);
diff --git a/sysdeps/powerpc/atomic-machine.h b/sysdeps/powerpc/atomic-machine.h
index 8b0e1e7..7e6c699 100644
--- a/sysdeps/powerpc/atomic-machine.h
+++ b/sysdeps/powerpc/atomic-machine.h
@@ -27,6 +27,7 @@
  */
 
 #include <stdint.h>
+#include <tls.h>
 
 typedef int32_t atomic32_t;
 typedef uint32_t uatomic32_t;
@@ -78,6 +79,9 @@ typedef uintmax_t uatomic_max_t;
 
 #define atomic_full_barrier()	__asm ("sync" ::: "memory")
 
+/* We can't convert __arch_compare_and_exchange_val_* to compiler built-ins
+   yet because the built-ins expect a pointer to the expected value while
+   our current implementation pass the value directly.  */
 #define __arch_compare_and_exchange_val_32_acq(mem, newval, oldval)	      \
   ({									      \
       __typeof (*(mem)) __tmp;						      \
@@ -112,33 +116,24 @@ typedef uintmax_t uatomic_max_t;
       __tmp;								      \
   })
 
-#define __arch_atomic_exchange_32_acq(mem, value)			      \
+#define __arch_compare_and_exchange_val_32_relaxed(mem, newval, oldval)	      \
   ({									      \
-    __typeof (*mem) __val;						      \
-    __asm __volatile (							      \
-		      "1:	lwarx	%0,0,%2" MUTEX_HINT_ACQ "\n"	      \
-		      "		stwcx.	%3,0,%2\n"			      \
-		      "		bne-	1b\n"				      \
-		      "   " __ARCH_ACQ_INSTR				      \
-		      : "=&r" (__val), "=m" (*mem)			      \
-		      : "b" (mem), "r" (value), "m" (*mem)		      \
-		      : "cr0", "memory");				      \
-    __val;								      \
-  })
-
-#define __arch_atomic_exchange_32_rel(mem, value) \
-  ({									      \
-    __typeof (*mem) __val;						      \
-    __asm __volatile (__ARCH_REL_INSTR "\n"				      \
-		      "1:	lwarx	%0,0,%2" MUTEX_HINT_REL "\n"	      \
-		      "		stwcx.	%3,0,%2\n"			      \
-		      "		bne-	1b"				      \
-		      : "=&r" (__val), "=m" (*mem)			      \
-		      : "b" (mem), "r" (value), "m" (*mem)		      \
-		      : "cr0", "memory");				      \
-    __val;								      \
+      __typeof (*(mem)) __tmp;						      \
+      __typeof (mem)  __memp = (mem);					      \
+      __asm __volatile (						      \
+		        "1:	lwarx	%0,0,%1\n"			      \
+		        "	cmpw	%0,%2\n"			      \
+		        "	bne	2f\n"				      \
+		        "	stwcx.	%3,0,%1\n"			      \
+		        "	bne-	1b\n"				      \
+		        "2:	"					      \
+		        : "=&r" (__tmp)					      \
+		        : "b" (__memp), "r" (oldval), "r" (newval)	      \
+		        : "cr0", "memory");				      \
+      __tmp;								      \
   })
 
+/* The following atomic primitives aren't available as compiler built-ins.  */
 #define __arch_atomic_exchange_and_add_32(mem, value) \
   ({									      \
     __typeof (*mem) __val, __tmp;					      \
@@ -221,10 +216,30 @@ typedef uintmax_t uatomic_max_t;
      __val;								      \
   })
 
+#define __atomic_is_single_thread				\
+  (THREAD_GETMEM (THREAD_SELF, header.multiple_threads) == 0)
+
+#define atomic_compare_and_exchange_val_relaxed(mem, newval, oldval)	      \
+  ({									      \
+    __typeof (*(mem)) __result;						      \
+    if (sizeof (*mem) == 4)						      \
+      __result = __arch_compare_and_exchange_val_32_relaxed(mem, newval,      \
+							    oldval);	      \
+    else if (sizeof (*mem) == 8)					      \
+      __result = __arch_compare_and_exchange_val_64_relaxed(mem, newval,      \
+							    oldval);	      \
+    else								      \
+       abort ();							      \
+    __result;								      \
+  })
+
 #define atomic_compare_and_exchange_val_acq(mem, newval, oldval) \
   ({									      \
     __typeof (*(mem)) __result;						      \
-    if (sizeof (*mem) == 4)						      \
+    if (__atomic_is_single_thread)					      \
+      __result = atomic_compare_and_exchange_val_relaxed (mem, newval,	      \
+							  oldval);	      \
+    else if (sizeof (*mem) == 4)					      \
       __result = __arch_compare_and_exchange_val_32_acq(mem, newval, oldval); \
     else if (sizeof (*mem) == 8)					      \
       __result = __arch_compare_and_exchange_val_64_acq(mem, newval, oldval); \
@@ -236,7 +251,10 @@ typedef uintmax_t uatomic_max_t;
 #define atomic_compare_and_exchange_val_rel(mem, newval, oldval) \
   ({									      \
     __typeof (*(mem)) __result;						      \
-    if (sizeof (*mem) == 4)						      \
+    if (__atomic_is_single_thread)					      \
+      __result = atomic_compare_and_exchange_val_relaxed (mem, newval,	      \
+							  oldval);	      \
+    else if (sizeof (*mem) == 4)					      \
       __result = __arch_compare_and_exchange_val_32_rel(mem, newval, oldval); \
     else if (sizeof (*mem) == 8)					      \
       __result = __arch_compare_and_exchange_val_64_rel(mem, newval, oldval); \
@@ -245,28 +263,24 @@ typedef uintmax_t uatomic_max_t;
     __result;								      \
   })
 
-#define atomic_exchange_acq(mem, value) \
-  ({									      \
-    __typeof (*(mem)) __result;						      \
-    if (sizeof (*mem) == 4)						      \
-      __result = __arch_atomic_exchange_32_acq (mem, value);		      \
-    else if (sizeof (*mem) == 8)					      \
-      __result = __arch_atomic_exchange_64_acq (mem, value);		      \
-    else 								      \
-       abort ();							      \
-    __result;								      \
+#define atomic_exchange_acq(mem, value)					\
+  ({									\
+    __typeof (value) __ret;						\
+    if (__atomic_is_single_thread)					\
+      __ret = __atomic_exchange_n ((mem), (value), __ATOMIC_RELAXED);	\
+    else								\
+      __ret = __atomic_exchange_n ((mem), (value), __ATOMIC_ACQUIRE);	\
+    __ret;								\
   })
 
-#define atomic_exchange_rel(mem, value) \
-  ({									      \
-    __typeof (*(mem)) __result;						      \
-    if (sizeof (*mem) == 4)						      \
-      __result = __arch_atomic_exchange_32_rel (mem, value);		      \
-    else if (sizeof (*mem) == 8)					      \
-      __result = __arch_atomic_exchange_64_rel (mem, value);		      \
-    else 								      \
-       abort ();							      \
-    __result;								      \
+#define atomic_exchange_rel(mem, value)					\
+  ({									\
+    __typeof (value) __ret;						\
+    if (__atomic_is_single_thread)					\
+      __ret = __atomic_exchange_n ((mem), (value), __ATOMIC_RELAXED);	\
+    else								\
+      __ret = __atomic_exchange_n ((mem), (value), __ATOMIC_RELEASE);	\
+    __ret;								\
   })
 
 #define atomic_exchange_and_add(mem, value) \
@@ -280,6 +294,7 @@ typedef uintmax_t uatomic_max_t;
        abort ();							      \
     __result;								      \
   })
+
 #define atomic_exchange_and_add_acq(mem, value) \
   ({									      \
     __typeof (*(mem)) __result;						      \
@@ -291,6 +306,7 @@ typedef uintmax_t uatomic_max_t;
        abort ();							      \
     __result;								      \
   })
+
 #define atomic_exchange_and_add_rel(mem, value) \
   ({									      \
     __typeof (*(mem)) __result;						      \
@@ -343,3 +359,23 @@ typedef uintmax_t uatomic_max_t;
        abort ();							      \
     __result;								      \
   })
+
+#define catomic_and(mem, arg)						\
+  ({									\
+    __typeof (arg) __ret;						\
+    if (__atomic_is_single_thread)					\
+      __ret = __atomic_fetch_and((mem), arg, __ATOMIC_RELAXED);		\
+    else								\
+      __ret = __atomic_fetch_and((mem), arg, __ATOMIC_ACQUIRE);		\
+    __ret;								\
+  })
+
+#define catomic_or(mem, arg)						\
+  ({									\
+    __typeof (arg) __ret;						\
+    if (__atomic_is_single_thread)					\
+      __ret = __atomic_fetch_or((mem), arg, __ATOMIC_RELAXED);		\
+    else								\
+      __ret = __atomic_fetch_or((mem), arg, __ATOMIC_ACQUIRE);		\
+    __ret;								\
+  })
diff --git a/sysdeps/powerpc/powerpc32/atomic-machine.h b/sysdeps/powerpc/powerpc32/atomic-machine.h
index 1d407b3..c733d43 100644
--- a/sysdeps/powerpc/powerpc32/atomic-machine.h
+++ b/sysdeps/powerpc/powerpc32/atomic-machine.h
@@ -86,6 +86,9 @@
 #define __arch_compare_and_exchange_bool_64_rel(mem, newval, oldval) \
   (abort (), 0)
 
+#define __arch_compare_and_exchange_val_64_relaxed(mem, newval, oldval) \
+  (abort (), (__typeof (*mem)) 0)
+
 #define __arch_compare_and_exchange_val_64_rel(mem, newval, oldval) \
   (abort (), (__typeof (*mem)) 0)
 
@@ -95,6 +98,9 @@
 #define __arch_atomic_exchange_64_rel(mem, value) \
     ({ abort (); (*mem) = (value); })
 
+#define __arch_atomic_exchange_64_relaxed(mem, value) \
+    ({ abort (); (*mem) = (value); })
+
 #define __arch_atomic_exchange_and_add_64(mem, value) \
     ({ abort (); (*mem) = (value); })
 
diff --git a/sysdeps/powerpc/powerpc64/atomic-machine.h b/sysdeps/powerpc/powerpc64/atomic-machine.h
index 751487a..515572e 100644
--- a/sysdeps/powerpc/powerpc64/atomic-machine.h
+++ b/sysdeps/powerpc/powerpc64/atomic-machine.h
@@ -146,32 +146,22 @@
       __tmp;								      \
   })
 
-#define __arch_atomic_exchange_64_acq(mem, value) \
-    ({									      \
-      __typeof (*mem) __val;						      \
-      __asm __volatile (__ARCH_REL_INSTR "\n"				      \
-			"1:	ldarx	%0,0,%2" MUTEX_HINT_ACQ "\n"	      \
-			"	stdcx.	%3,0,%2\n"			      \
+#define __arch_compare_and_exchange_val_64_relaxed(mem, newval, oldval)	      \
+  ({									      \
+      __typeof (*(mem)) __tmp;						      \
+      __typeof (mem)  __memp = (mem);					      \
+      __asm __volatile ("\n"						      \
+			"1:	ldarx	%0,0,%1\n"			      \
+			"	cmpd	%0,%2\n"			      \
+			"	bne	2f\n"				      \
+			"	stdcx.	%3,0,%1\n"			      \
 			"	bne-	1b\n"				      \
-		  " " __ARCH_ACQ_INSTR					      \
-			: "=&r" (__val), "=m" (*mem)			      \
-			: "b" (mem), "r" (value), "m" (*mem)		      \
+			"2:	"					      \
+			: "=&r" (__tmp)					      \
+			: "b" (__memp), "r" (oldval), "r" (newval)	      \
 			: "cr0", "memory");				      \
-      __val;								      \
-    })
-
-#define __arch_atomic_exchange_64_rel(mem, value) \
-    ({									      \
-      __typeof (*mem) __val;						      \
-      __asm __volatile (__ARCH_REL_INSTR "\n"				      \
-			"1:	ldarx	%0,0,%2" MUTEX_HINT_REL "\n"	      \
-			"	stdcx.	%3,0,%2\n"			      \
-			"	bne-	1b"				      \
-			: "=&r" (__val), "=m" (*mem)			      \
-			: "b" (mem), "r" (value), "m" (*mem)		      \
-			: "cr0", "memory");				      \
-      __val;								      \
-    })
+      __tmp;								      \
+  })
 
 #define __arch_atomic_exchange_and_add_64(mem, value) \
     ({									      \
-- 
2.1.0

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3] PowerPC: libc single-thread lock optimization
  2016-03-11 18:36                 ` [PATCH v3] PowerPC: libc single-thread lock optimization Tulio Magno Quites Machado Filho
@ 2016-03-11 18:39                   ` Florian Weimer
  2016-03-11 21:12                     ` Tulio Magno Quites Machado Filho
  2016-03-28 17:36                   ` [PING][PATCH " Tulio Magno Quites Machado Filho
  2016-04-07 11:24                   ` [PATCH " Torvald Riegel
  2 siblings, 1 reply; 20+ messages in thread
From: Florian Weimer @ 2016-03-11 18:39 UTC (permalink / raw)
  To: Tulio Magno Quites Machado Filho; +Cc: libc-alpha, adhemerval.zanella, munroesj

On 03/11/2016 07:35 PM, Tulio Magno Quites Machado Filho wrote:
> diff --git a/malloc/malloc.c b/malloc/malloc.c
> index b8a43bf..1eed794 100644
> --- a/malloc/malloc.c
> +++ b/malloc/malloc.c
> @@ -4150,7 +4150,7 @@ static void malloc_consolidate(mstate av)
>      maxfb = &fastbin (av, NFASTBINS - 1);
>      fb = &fastbin (av, 0);
>      do {
> -      p = atomic_exchange_acq (fb, 0);
> +      p = atomic_exchange_acq (fb, NULL);
>        if (p != 0) {
>  	do {
>  	  check_inuse_chunk(av, p);

This should go in immediately and separately, it is independent of  the
rest of the patch.

Thanks,
Florian

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3] PowerPC: libc single-thread lock optimization
  2016-03-11 18:39                   ` Florian Weimer
@ 2016-03-11 21:12                     ` Tulio Magno Quites Machado Filho
  0 siblings, 0 replies; 20+ messages in thread
From: Tulio Magno Quites Machado Filho @ 2016-03-11 21:12 UTC (permalink / raw)
  To: Florian Weimer; +Cc: libc-alpha, adhemerval.zanella, munroesj

Florian Weimer <fweimer@redhat.com> writes:

> On 03/11/2016 07:35 PM, Tulio Magno Quites Machado Filho wrote:
>> diff --git a/malloc/malloc.c b/malloc/malloc.c
>> index b8a43bf..1eed794 100644
>> --- a/malloc/malloc.c
>> +++ b/malloc/malloc.c
>> @@ -4150,7 +4150,7 @@ static void malloc_consolidate(mstate av)
>>      maxfb = &fastbin (av, NFASTBINS - 1);
>>      fb = &fastbin (av, 0);
>>      do {
>> -      p = atomic_exchange_acq (fb, 0);
>> +      p = atomic_exchange_acq (fb, NULL);
>>        if (p != 0) {
>>  	do {
>>  	  check_inuse_chunk(av, p);
>
> This should go in immediately and separately, it is independent of  the
> rest of the patch.

Makes sense.
I pushed this hunk as commit b43f552a.

Thanks!

-- 
Tulio Magno

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PING][PATCH v3] PowerPC: libc single-thread lock optimization
  2016-03-11 18:36                 ` [PATCH v3] PowerPC: libc single-thread lock optimization Tulio Magno Quites Machado Filho
  2016-03-11 18:39                   ` Florian Weimer
@ 2016-03-28 17:36                   ` Tulio Magno Quites Machado Filho
  2016-04-07 11:24                   ` [PATCH " Torvald Riegel
  2 siblings, 0 replies; 20+ messages in thread
From: Tulio Magno Quites Machado Filho @ 2016-03-28 17:36 UTC (permalink / raw)
  To: libc-alpha; +Cc: adhemerval.zanella, munroesj


Ping!

Tulio Magno Quites Machado Filho <tuliom@linux.vnet.ibm.com> writes:

> I continued the work started by Adhemerval.  The discussion around version 2
> of this patch is available at http://patchwork.sourceware.org/patch/2516/
>
> Nowadays, we already require GCC 4.7, so we can safely rely on compiler
> built-ins for most of our atomic primitives.
>
> Changes since v2:
>  - Updated ChangeLog and commit message.
>  - Replaced the following atomic primitives by compiler built-ins:
>    exchange*, and* and or*.
>
> ---8<---
>
> Add relaxed atomics as a lock optimization.  Addressing the concerns
> raised in previous discussions, the primitives are still signal-safe
> (although not thread-safe), so if future implementations relying on
> this code (e.g. malloc) is changed to be async-safe, it won't require to
> adjust powerpc atomics.
>
> For catomic_and and catomic_or I follow the definition at 'include/atomic.h'
> (which powerpc is currently using) and implemented the atomics with acquire
> semantics.  The new implementation is based on compiler built-ins.
>
> On synthetic benchmarks it shows an improvement of 5-10% for malloc
> calls and a performance increase of 7-8% in 483.xalancbmk from
> speccpu2006 (number from a POWER8 machine).
>
> Checked on powerpc32, powerpc64 and powerpc64le.
>
> 2016-03-11  Adhemerval Zanella Netto  <azanella@linux.vnet.ibm.com>
>             Tulio Magno Quites Machado Filho  <tuliom@linux.vnet.ibm.com>
>
> 	* malloc/malloc.c (malloc_consolidate): Replace 0 by NULL in
> 	order to match the type of p when calling atomic_exchange_acq().
> 	* sysdeps/powerpc/atomic-machine.h
> 	(__arch_atomic_exchange_32_acq): Removed.
> 	(__arch_atomic_exchange_32_rel): Likewise
> 	(__arch_compare_and_exchange_val_32_relaxed): New macro: atomic compare
> 	and exchange with relaxed semantic.
> 	(atomic_compare_and_exchange_val_relaxed): Likewise.
> 	(__atomic_is_single_thread): New macro: check if program is
> 	single-thread.
> 	(atomic_compare_and_exchange_val_acq): Add relaxed operation for
> 	single-thread.
> 	(atomic_compare_and_exchange_val_rel): Likewise.
> 	(atomic_exchange_acq): Likewise.
> 	(atomic_exchange_rel): Likewise.
> 	(catomic_and): Add relaxed operation and use compiler built-ins.
> 	(catomic_or): Likewise.
> 	(atomic_exchange_acq): Modify to use compiler built-ins.
> 	(atomic_exchange_rel): Likewise.
> 	* sysdeps/powerpc/powerpc32/atomic-machine.h
> 	(__arch_compare_and_exchange_val_64_relaxed): New macro: add empty
> 	implementation.
> 	(__arch_atomic_exchange_64_relaxed): Likewise.
> 	* sysdeps/powerpc/powerpc64/atomic-machine.h
> 	(__arch_compare_and_exchange_val_64_relaxed): New macro: atomic compare
> 	and exchange with relaxed semantics.
> 	(__arch_atomic_exchange_64_acq): Removed.
> 	(__arch_atomic_exchange_64_rel): Removed.

-- 
Tulio Magno

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v3] PowerPC: libc single-thread lock optimization
  2016-03-11 18:36                 ` [PATCH v3] PowerPC: libc single-thread lock optimization Tulio Magno Quites Machado Filho
  2016-03-11 18:39                   ` Florian Weimer
  2016-03-28 17:36                   ` [PING][PATCH " Tulio Magno Quites Machado Filho
@ 2016-04-07 11:24                   ` Torvald Riegel
  2 siblings, 0 replies; 20+ messages in thread
From: Torvald Riegel @ 2016-04-07 11:24 UTC (permalink / raw)
  To: Tulio Magno Quites Machado Filho; +Cc: libc-alpha, adhemerval.zanella, munroesj

On Fri, 2016-03-11 at 15:35 -0300, Tulio Magno Quites Machado Filho
wrote:
> I continued the work started by Adhemerval.  The discussion around version 2
> of this patch is available at http://patchwork.sourceware.org/patch/2516/
> 
> Nowadays, we already require GCC 4.7, so we can safely rely on compiler
> built-ins for most of our atomic primitives.

Why don't you change USE_ATOMIC_COMPILER_BUILTINS to 1 then (in both
sysdeps/powerpc{32,64}/atomic-machine.h)?
There is also more cleanup possible when the new builtins are available
(see below).

> Changes since v2:
>  - Updated ChangeLog and commit message.
>  - Replaced the following atomic primitives by compiler built-ins:
>    exchange*, and* and or*.
> 
> ---8<---
> 
> Add relaxed atomics as a lock optimization.  Addressing the concerns
> raised in previous discussions, the primitives are still signal-safe
> (although not thread-safe),

I can't follow you.  Why are atomics not threads-safe?

> so if future implementations relying on
> this code (e.g. malloc) is changed to be async-safe, it won't require to
> adjust powerpc atomics.
> 
> For catomic_and and catomic_or I follow the definition at 'include/atomic.h'
> (which powerpc is currently using) and implemented the atomics with acquire
> semantics.  The new implementation is based on compiler built-ins.

The catomic* variants should be phased out, IMO.  I think it's still
okay to add them for power, but I'd encourage you to rather look for
places in which those are used and try to optimize those uses for
single-threaded executions.

> On synthetic benchmarks it shows an improvement of 5-10% for malloc
> calls and a performance increase of 7-8% in 483.xalancbmk from
> speccpu2006 (number from a POWER8 machine).

What causes the performance difference in the latter case?

> Checked on powerpc32, powerpc64 and powerpc64le.
> 
> 2016-03-11  Adhemerval Zanella Netto  <azanella@linux.vnet.ibm.com>
>             Tulio Magno Quites Machado Filho  <tuliom@linux.vnet.ibm.com>
> 
> 	* malloc/malloc.c (malloc_consolidate): Replace 0 by NULL in
> 	order to match the type of p when calling atomic_exchange_acq().
> 	* sysdeps/powerpc/atomic-machine.h
> 	(__arch_atomic_exchange_32_acq): Removed.
> 	(__arch_atomic_exchange_32_rel): Likewise
> 	(__arch_compare_and_exchange_val_32_relaxed): New macro: atomic compare
> 	and exchange with relaxed semantic.
> 	(atomic_compare_and_exchange_val_relaxed): Likewise.
> 	(__atomic_is_single_thread): New macro: check if program is
> 	single-thread.
> 	(atomic_compare_and_exchange_val_acq): Add relaxed operation for
> 	single-thread.
> 	(atomic_compare_and_exchange_val_rel): Likewise.
> 	(atomic_exchange_acq): Likewise.
> 	(atomic_exchange_rel): Likewise.
> 	(catomic_and): Add relaxed operation and use compiler built-ins.
> 	(catomic_or): Likewise.
> 	(atomic_exchange_acq): Modify to use compiler built-ins.
> 	(atomic_exchange_rel): Likewise.
> 	* sysdeps/powerpc/powerpc32/atomic-machine.h
> 	(__arch_compare_and_exchange_val_64_relaxed): New macro: add empty
> 	implementation.
> 	(__arch_atomic_exchange_64_relaxed): Likewise.
> 	* sysdeps/powerpc/powerpc64/atomic-machine.h
> 	(__arch_compare_and_exchange_val_64_relaxed): New macro: atomic compare
> 	and exchange with relaxed semantics.
> 	(__arch_atomic_exchange_64_acq): Removed.
> 	(__arch_atomic_exchange_64_rel): Removed.
> ---
>  malloc/malloc.c                            |   2 +-
>  sysdeps/powerpc/atomic-machine.h           | 128 ++++++++++++++++++-----------
>  sysdeps/powerpc/powerpc32/atomic-machine.h |   6 ++
>  sysdeps/powerpc/powerpc64/atomic-machine.h |  38 ++++-----
>  4 files changed, 103 insertions(+), 71 deletions(-)
> 
> diff --git a/malloc/malloc.c b/malloc/malloc.c
> index b8a43bf..1eed794 100644
> --- a/malloc/malloc.c
> +++ b/malloc/malloc.c
> @@ -4150,7 +4150,7 @@ static void malloc_consolidate(mstate av)
>      maxfb = &fastbin (av, NFASTBINS - 1);
>      fb = &fastbin (av, 0);
>      do {
> -      p = atomic_exchange_acq (fb, 0);
> +      p = atomic_exchange_acq (fb, NULL);
>        if (p != 0) {
>  	do {
>  	  check_inuse_chunk(av, p);
> diff --git a/sysdeps/powerpc/atomic-machine.h b/sysdeps/powerpc/atomic-machine.h
> index 8b0e1e7..7e6c699 100644
> --- a/sysdeps/powerpc/atomic-machine.h
> +++ b/sysdeps/powerpc/atomic-machine.h
> @@ -27,6 +27,7 @@
>   */
>  
>  #include <stdint.h>
> +#include <tls.h>
>  
>  typedef int32_t atomic32_t;
>  typedef uint32_t uatomic32_t;
> @@ -78,6 +79,9 @@ typedef uintmax_t uatomic_max_t;
>  
>  #define atomic_full_barrier()	__asm ("sync" ::: "memory")
>  
> +/* We can't convert __arch_compare_and_exchange_val_* to compiler built-ins
> +   yet because the built-ins expect a pointer to the expected value while
> +   our current implementation pass the value directly.  */

Why can't you just use a temporary variable?

>  #define __arch_compare_and_exchange_val_32_acq(mem, newval, oldval)	      \
>    ({									      \
>        __typeof (*(mem)) __tmp;						      \
> @@ -112,33 +116,24 @@ typedef uintmax_t uatomic_max_t;
>        __tmp;								      \
>    })
>  
> -#define __arch_atomic_exchange_32_acq(mem, value)			      \
> +#define __arch_compare_and_exchange_val_32_relaxed(mem, newval, oldval)	      \

I don't think we need to add this, and can instead use the builtins (see
above).

>    ({									      \
> -    __typeof (*mem) __val;						      \
> -    __asm __volatile (							      \
> -		      "1:	lwarx	%0,0,%2" MUTEX_HINT_ACQ "\n"	      \
> -		      "		stwcx.	%3,0,%2\n"			      \
> -		      "		bne-	1b\n"				      \
> -		      "   " __ARCH_ACQ_INSTR				      \
> -		      : "=&r" (__val), "=m" (*mem)			      \
> -		      : "b" (mem), "r" (value), "m" (*mem)		      \
> -		      : "cr0", "memory");				      \
> -    __val;								      \
> -  })
> -
> -#define __arch_atomic_exchange_32_rel(mem, value) \
> -  ({									      \
> -    __typeof (*mem) __val;						      \
> -    __asm __volatile (__ARCH_REL_INSTR "\n"				      \
> -		      "1:	lwarx	%0,0,%2" MUTEX_HINT_REL "\n"	      \
> -		      "		stwcx.	%3,0,%2\n"			      \
> -		      "		bne-	1b"				      \
> -		      : "=&r" (__val), "=m" (*mem)			      \
> -		      : "b" (mem), "r" (value), "m" (*mem)		      \
> -		      : "cr0", "memory");				      \
> -    __val;								      \
> +      __typeof (*(mem)) __tmp;						      \
> +      __typeof (mem)  __memp = (mem);					      \
> +      __asm __volatile (						      \
> +		        "1:	lwarx	%0,0,%1\n"			      \
> +		        "	cmpw	%0,%2\n"			      \
> +		        "	bne	2f\n"				      \
> +		        "	stwcx.	%3,0,%1\n"			      \
> +		        "	bne-	1b\n"				      \
> +		        "2:	"					      \
> +		        : "=&r" (__tmp)					      \
> +		        : "b" (__memp), "r" (oldval), "r" (newval)	      \
> +		        : "cr0", "memory");				      \
> +      __tmp;								      \
>    })
>  
> +/* The following atomic primitives aren't available as compiler built-ins.  */

Isn't that simply an atomic fetch-and-add?

>  #define __arch_atomic_exchange_and_add_32(mem, value) \
>    ({									      \
>      __typeof (*mem) __val, __tmp;					      \
> @@ -221,10 +216,30 @@ typedef uintmax_t uatomic_max_t;
>       __val;								      \
>    })
>  
> +#define __atomic_is_single_thread				\
> +  (THREAD_GETMEM (THREAD_SELF, header.multiple_threads) == 0)
> +
> +#define atomic_compare_and_exchange_val_relaxed(mem, newval, oldval)	      \
> +  ({									      \
> +    __typeof (*(mem)) __result;						      \
> +    if (sizeof (*mem) == 4)						      \
> +      __result = __arch_compare_and_exchange_val_32_relaxed(mem, newval,      \
> +							    oldval);	      \
> +    else if (sizeof (*mem) == 8)					      \
> +      __result = __arch_compare_and_exchange_val_64_relaxed(mem, newval,      \
> +							    oldval);	      \
> +    else								      \
> +       abort ();							      \
> +    __result;								      \
> +  })
> +
>  #define atomic_compare_and_exchange_val_acq(mem, newval, oldval) \
>    ({									      \
>      __typeof (*(mem)) __result;						      \
> -    if (sizeof (*mem) == 4)						      \
> +    if (__atomic_is_single_thread)					      \
> +      __result = atomic_compare_and_exchange_val_relaxed (mem, newval,	      \
> +							  oldval);	      \
> +    else if (sizeof (*mem) == 4)					      \
>        __result = __arch_compare_and_exchange_val_32_acq(mem, newval, oldval); \
>      else if (sizeof (*mem) == 8)					      \
>        __result = __arch_compare_and_exchange_val_64_acq(mem, newval, oldval); \
> @@ -236,7 +251,10 @@ typedef uintmax_t uatomic_max_t;
>  #define atomic_compare_and_exchange_val_rel(mem, newval, oldval) \
>    ({									      \
>      __typeof (*(mem)) __result;						      \
> -    if (sizeof (*mem) == 4)						      \
> +    if (__atomic_is_single_thread)					      \
> +      __result = atomic_compare_and_exchange_val_relaxed (mem, newval,	      \
> +							  oldval);	      \
> +    else if (sizeof (*mem) == 4)					      \
>        __result = __arch_compare_and_exchange_val_32_rel(mem, newval, oldval); \
>      else if (sizeof (*mem) == 8)					      \
>        __result = __arch_compare_and_exchange_val_64_rel(mem, newval, oldval); \
> @@ -245,28 +263,24 @@ typedef uintmax_t uatomic_max_t;
>      __result;								      \
>    })

We need to be consistent where we try to optimize for single-threaded
executions.  Currently, we do in catomic_* and I believe in some pieces
of code using atomics.  The atomic_* functions, even the old ones,
should not do that.
Eventually, we should put special cases for single-threaded executions
into the code using atomics and not into the atomics (also see above,
phasing out catomic_*) because avoiding concurrent algorithms altogether
is even faster than doing something in atomics (eg, one can avoid CAS
loops altogether if there's no other thread because the CAS will never
fail).
Another reason to do this is that this adds the overhead of the
single-thread check to all atomics, even in cases where it's clear that
the code will be used often in a multi-threaded setting.

Therefore, I do not agree with this change. 

Instead of this, could you please look at the places where you think we
should optimize for single-threaded executions, and suggest
optimizations for these?  This approach also allows you to distinguish
between use cases where there is no concurrency whatsoever in a
single-threaded execution, and cases that still have to be signal-safe
and thus have to consider the concurrency caused by reentrancy.

> -#define atomic_exchange_acq(mem, value) \
> -  ({									      \
> -    __typeof (*(mem)) __result;						      \
> -    if (sizeof (*mem) == 4)						      \
> -      __result = __arch_atomic_exchange_32_acq (mem, value);		      \
> -    else if (sizeof (*mem) == 8)					      \
> -      __result = __arch_atomic_exchange_64_acq (mem, value);		      \
> -    else 								      \
> -       abort ();							      \
> -    __result;								      \
> +#define atomic_exchange_acq(mem, value)					\
> +  ({									\
> +    __typeof (value) __ret;						\
> +    if (__atomic_is_single_thread)					\
> +      __ret = __atomic_exchange_n ((mem), (value), __ATOMIC_RELAXED);	\
> +    else								\
> +      __ret = __atomic_exchange_n ((mem), (value), __ATOMIC_ACQUIRE);	\
> +    __ret;								\

See above.  The change to using the new __atomic builtins instead of the
custom 32/64 bit variants is fine though, and could be a good cleanup
patch (please post as a separate patch).

>    })
>  
> -#define atomic_exchange_rel(mem, value) \
> -  ({									      \
> -    __typeof (*(mem)) __result;						      \
> -    if (sizeof (*mem) == 4)						      \
> -      __result = __arch_atomic_exchange_32_rel (mem, value);		      \
> -    else if (sizeof (*mem) == 8)					      \
> -      __result = __arch_atomic_exchange_64_rel (mem, value);		      \
> -    else 								      \
> -       abort ();							      \
> -    __result;								      \
> +#define atomic_exchange_rel(mem, value)					\
> +  ({									\
> +    __typeof (value) __ret;						\
> +    if (__atomic_is_single_thread)					\
> +      __ret = __atomic_exchange_n ((mem), (value), __ATOMIC_RELAXED);	\
> +    else								\
> +      __ret = __atomic_exchange_n ((mem), (value), __ATOMIC_RELEASE);	\
> +    __ret;								\

Likewise.

>    })
>  
>  #define atomic_exchange_and_add(mem, value) \
> @@ -280,6 +294,7 @@ typedef uintmax_t uatomic_max_t;
>         abort ();							      \
>      __result;								      \
>    })
> +
>  #define atomic_exchange_and_add_acq(mem, value) \
>    ({									      \
>      __typeof (*(mem)) __result;						      \
> @@ -291,6 +306,7 @@ typedef uintmax_t uatomic_max_t;
>         abort ();							      \
>      __result;								      \
>    })
> +
>  #define atomic_exchange_and_add_rel(mem, value) \
>    ({									      \
>      __typeof (*(mem)) __result;						      \
> @@ -343,3 +359,23 @@ typedef uintmax_t uatomic_max_t;
>         abort ();							      \
>      __result;								      \
>    })
> +
> +#define catomic_and(mem, arg)						\
> +  ({									\
> +    __typeof (arg) __ret;						\
> +    if (__atomic_is_single_thread)					\
> +      __ret = __atomic_fetch_and((mem), arg, __ATOMIC_RELAXED);		\
> +    else								\
> +      __ret = __atomic_fetch_and((mem), arg, __ATOMIC_ACQUIRE);		\
> +    __ret;								\
> +  })
> +
> +#define catomic_or(mem, arg)						\
> +  ({									\
> +    __typeof (arg) __ret;						\
> +    if (__atomic_is_single_thread)					\
> +      __ret = __atomic_fetch_or((mem), arg, __ATOMIC_RELAXED);		\
> +    else								\
> +      __ret = __atomic_fetch_or((mem), arg, __ATOMIC_ACQUIRE);		\
> +    __ret;								\
> +  })

The only call sites of these appear to be in malloc.  Instead of
supporting them here, could you please investigate (or propose) a
separate patch that puts the optimization into malloc and gets rid of
catomic_and and catomic_or for all archs?

> diff --git a/sysdeps/powerpc/powerpc32/atomic-machine.h b/sysdeps/powerpc/powerpc32/atomic-machine.h
> index 1d407b3..c733d43 100644
> --- a/sysdeps/powerpc/powerpc32/atomic-machine.h
> +++ b/sysdeps/powerpc/powerpc32/atomic-machine.h
> @@ -86,6 +86,9 @@
>  #define __arch_compare_and_exchange_bool_64_rel(mem, newval, oldval) \
>    (abort (), 0)
>  
> +#define __arch_compare_and_exchange_val_64_relaxed(mem, newval, oldval) \
> +  (abort (), (__typeof (*mem)) 0)
> +
>  #define __arch_compare_and_exchange_val_64_rel(mem, newval, oldval) \
>    (abort (), (__typeof (*mem)) 0)
>  
> @@ -95,6 +98,9 @@
>  #define __arch_atomic_exchange_64_rel(mem, value) \
>      ({ abort (); (*mem) = (value); })
>  
> +#define __arch_atomic_exchange_64_relaxed(mem, value) \
> +    ({ abort (); (*mem) = (value); })
> +
>  #define __arch_atomic_exchange_and_add_64(mem, value) \
>      ({ abort (); (*mem) = (value); })
>  
> diff --git a/sysdeps/powerpc/powerpc64/atomic-machine.h b/sysdeps/powerpc/powerpc64/atomic-machine.h
> index 751487a..515572e 100644
> --- a/sysdeps/powerpc/powerpc64/atomic-machine.h
> +++ b/sysdeps/powerpc/powerpc64/atomic-machine.h
> @@ -146,32 +146,22 @@
>        __tmp;								      \
>    })
>  
> -#define __arch_atomic_exchange_64_acq(mem, value) \
> -    ({									      \
> -      __typeof (*mem) __val;						      \
> -      __asm __volatile (__ARCH_REL_INSTR "\n"				      \
> -			"1:	ldarx	%0,0,%2" MUTEX_HINT_ACQ "\n"	      \
> -			"	stdcx.	%3,0,%2\n"			      \
> +#define __arch_compare_and_exchange_val_64_relaxed(mem, newval, oldval)	      \

See above.

> +  ({									      \
> +      __typeof (*(mem)) __tmp;						      \
> +      __typeof (mem)  __memp = (mem);					      \
> +      __asm __volatile ("\n"						      \
> +			"1:	ldarx	%0,0,%1\n"			      \
> +			"	cmpd	%0,%2\n"			      \
> +			"	bne	2f\n"				      \
> +			"	stdcx.	%3,0,%1\n"			      \
>  			"	bne-	1b\n"				      \
> -		  " " __ARCH_ACQ_INSTR					      \
> -			: "=&r" (__val), "=m" (*mem)			      \
> -			: "b" (mem), "r" (value), "m" (*mem)		      \
> +			"2:	"					      \
> +			: "=&r" (__tmp)					      \
> +			: "b" (__memp), "r" (oldval), "r" (newval)	      \
>  			: "cr0", "memory");				      \
> -      __val;								      \
> -    })
> -
> -#define __arch_atomic_exchange_64_rel(mem, value) \
> -    ({									      \
> -      __typeof (*mem) __val;						      \
> -      __asm __volatile (__ARCH_REL_INSTR "\n"				      \
> -			"1:	ldarx	%0,0,%2" MUTEX_HINT_REL "\n"	      \
> -			"	stdcx.	%3,0,%2\n"			      \
> -			"	bne-	1b"				      \
> -			: "=&r" (__val), "=m" (*mem)			      \
> -			: "b" (mem), "r" (value), "m" (*mem)		      \
> -			: "cr0", "memory");				      \
> -      __val;								      \
> -    })
> +      __tmp;								      \
> +  })
>  
>  #define __arch_atomic_exchange_and_add_64(mem, value) \
>      ({									      \



^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2016-04-07 11:24 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-08-22 13:50 [PATCH v2] PowerPC: libc single-thread lock optimization Adhemerval Zanella
2014-08-22 14:00 ` Future atomic operation cleanup Richard Henderson
2014-08-22 14:50   ` Joseph S. Myers
2014-08-22 17:26   ` Torvald Riegel
2014-08-28 13:57   ` Adhemerval Zanella
2014-08-28 15:37     ` Torvald Riegel
2014-08-28 16:28       ` Adhemerval Zanella
2014-08-29  9:55         ` Torvald Riegel
2014-08-29 11:10           ` Adhemerval Zanella
2014-08-29 13:28             ` Torvald Riegel
2014-08-29 13:32               ` Adhemerval Zanella
2014-08-29 16:33                 ` Chris Metcalf
2014-10-23 17:12                   ` Carlos O'Donell
2014-08-29 21:52                 ` Roland McGrath
2014-09-05 17:51                   ` Torvald Riegel
2016-03-11 18:36                 ` [PATCH v3] PowerPC: libc single-thread lock optimization Tulio Magno Quites Machado Filho
2016-03-11 18:39                   ` Florian Weimer
2016-03-11 21:12                     ` Tulio Magno Quites Machado Filho
2016-03-28 17:36                   ` [PING][PATCH " Tulio Magno Quites Machado Filho
2016-04-07 11:24                   ` [PATCH " Torvald Riegel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).