public inbox for libc-ports@sourceware.org
 help / color / mirror / Atom feed
* PI mutex support for pthread_cond_* now in nptl
@ 2013-02-18 10:56 Siddhesh Poyarekar
  2013-02-18 23:50 ` Joseph S. Myers
                   ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Siddhesh Poyarekar @ 2013-02-18 10:56 UTC (permalink / raw)
  To: libc-ports; +Cc: libc-alpha

Hi arch-maintainers!

The x86 assembly code for pthread_cond_* functions has, since some
time now, been capable of utilizing PI mutexes (using
PTHREAD_PRIO_INHERIT when initializing a mutex) optimally by using the
FUTEX_WAIT_REQUEUE_PI and its companion FUTEX_CMP_REQUEUE_PI to
requeue waiters woken on pthread_cond_broadcast to the PI mutex while
waking just one child, hence avoiding the thundering herd effect.
This code is now available on the default C implementation in master,
which is used by most non-x86 architectures in glibc.

To allow your architecture to use this, you need to define the
following:

- lll_futex_wait_requeue_pi macro to call the FUTEX_WAIT_REQUEUE_PI
  operation in a futex syscall.

- lll_futex_timed_wait_requeue_pi, which is the timed equivalent of
  the above.

- lll_futex_cmp_requeue_pi to call FUTEX_CMP_REQUEUE_PI.

I have committed implementations for s390 and powerpc, which can be
used as examples for implementation on other architectures.


Siddhesh

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: PI mutex support for pthread_cond_* now in nptl
  2013-02-18 10:56 PI mutex support for pthread_cond_* now in nptl Siddhesh Poyarekar
@ 2013-02-18 23:50 ` Joseph S. Myers
  2013-02-19  0:50   ` Joseph S. Myers
  2013-02-19 16:42 ` Richard Henderson
       [not found] ` <5123AB55.2070100__45742.2411222526$1361292204$gmane$org@twiddle.net>
  2 siblings, 1 reply; 18+ messages in thread
From: Joseph S. Myers @ 2013-02-18 23:50 UTC (permalink / raw)
  To: libc-ports

I've applied this patch to implement this for ARM.

diff --git a/ports/ChangeLog.arm b/ports/ChangeLog.arm
index 3a5880b..5b5220f 100644
--- a/ports/ChangeLog.arm
+++ b/ports/ChangeLog.arm
@@ -1,3 +1,13 @@
+2013-02-18  Joseph Myers  <joseph@codesourcery.com>
+
+	[BZ #14920]
+	* sysdeps/unix/sysv/linux/arm/nptl/lowlevellock.h
+	(FUTEX_WAIT_REQUEUE_PI): Define.
+	(FUTEX_CMP_REQUEUE_PI): Likewise.
+	(lll_futex_wait_requeue_pi): Likewise.
+	(lll_futex_timed_wait_requeue_pi): Likewise.
+	(lll_futex_cmp_requeue_pi): Likewise.
+
 2013-02-18  Siddhesh Poyarekar  <siddhesh@redhat.com>
 
 	* sysdeps/unix/sysv/linux/arm/nptl/libc.abilist: Add
diff --git a/ports/sysdeps/unix/sysv/linux/arm/nptl/lowlevellock.h b/ports/sysdeps/unix/sysv/linux/arm/nptl/lowlevellock.h
index 0278812..fe21cda 100644
--- a/ports/sysdeps/unix/sysv/linux/arm/nptl/lowlevellock.h
+++ b/ports/sysdeps/unix/sysv/linux/arm/nptl/lowlevellock.h
@@ -36,6 +36,8 @@
 #define FUTEX_TRYLOCK_PI	8
 #define FUTEX_WAIT_BITSET	9
 #define FUTEX_WAKE_BITSET	10
+#define FUTEX_WAIT_REQUEUE_PI   11
+#define FUTEX_CMP_REQUEUE_PI    12
 #define FUTEX_PRIVATE_FLAG	128
 #define FUTEX_CLOCK_REALTIME	256
 
@@ -141,6 +143,32 @@
     INTERNAL_SYSCALL_ERROR_P (__ret, __err);				      \
   })
 
+/* Priority Inheritance support.  */
+#define lll_futex_wait_requeue_pi(futexp, val, mutex, private) \
+  lll_futex_timed_wait_requeue_pi (futexp, val, NULL, 0, mutex, private)
+
+#define lll_futex_timed_wait_requeue_pi(futexp, val, timespec, clockbit,      \
+					mutex, private)			      \
+  ({									      \
+    INTERNAL_SYSCALL_DECL (__err);					      \
+    int __op = FUTEX_WAIT_REQUEUE_PI | clockbit;			      \
+									      \
+    INTERNAL_SYSCALL (futex, __err, 5, (futexp),			      \
+		      __lll_private_flag (__op, private),		      \
+		      (val), (timespec), mutex); 			      \
+  })
+
+#define lll_futex_cmp_requeue_pi(futexp, nr_wake, nr_move, mutex, val, priv)  \
+  ({									      \
+    INTERNAL_SYSCALL_DECL (__err);					      \
+    long int __ret;							      \
+									      \
+    __ret = INTERNAL_SYSCALL (futex, __err, 6, (futexp),		      \
+			      __lll_private_flag (FUTEX_CMP_REQUEUE_PI, priv),\
+			      (nr_wake), (nr_move), (mutex), (val));	      \
+    INTERNAL_SYSCALL_ERROR_P (__ret, __err);				      \
+  })
+
 
 #define lll_trylock(lock)	\
   atomic_compare_and_exchange_val_acq(&(lock), 1, 0)

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: PI mutex support for pthread_cond_* now in nptl
  2013-02-18 23:50 ` Joseph S. Myers
@ 2013-02-19  0:50   ` Joseph S. Myers
  0 siblings, 0 replies; 18+ messages in thread
From: Joseph S. Myers @ 2013-02-19  0:50 UTC (permalink / raw)
  To: libc-ports

I've applied this patch with the MIPS implementation.

diff --git a/ports/ChangeLog.mips b/ports/ChangeLog.mips
index de967c6..c07bf97 100644
--- a/ports/ChangeLog.mips
+++ b/ports/ChangeLog.mips
@@ -1,3 +1,13 @@
+2013-02-19  Joseph Myers  <joseph@codesourcery.com>
+
+	[BZ #14920]
+	* sysdeps/unix/sysv/linux/mips/nptl/lowlevellock.h
+	(FUTEX_WAIT_REQUEUE_PI): Define.
+	(FUTEX_CMP_REQUEUE_PI): Likewise.
+	(lll_futex_wait_requeue_pi): Likewise.
+	(lll_futex_timed_wait_requeue_pi): Likewise.
+	(lll_futex_cmp_requeue_pi): Likewise.
+
 2013-02-18  Siddhesh Poyarekar  <siddhesh@redhat.com>
 
 	* sysdeps/unix/sysv/linux/mips/mips32/nptl/libc.abilist: Add
diff --git a/ports/sysdeps/unix/sysv/linux/mips/nptl/lowlevellock.h b/ports/sysdeps/unix/sysv/linux/mips/nptl/lowlevellock.h
index fe7c747..4f27b8b 100644
--- a/ports/sysdeps/unix/sysv/linux/mips/nptl/lowlevellock.h
+++ b/ports/sysdeps/unix/sysv/linux/mips/nptl/lowlevellock.h
@@ -36,6 +36,8 @@
 #define FUTEX_TRYLOCK_PI	8
 #define FUTEX_WAIT_BITSET	9
 #define FUTEX_WAKE_BITSET	10
+#define FUTEX_WAIT_REQUEUE_PI   11
+#define FUTEX_CMP_REQUEUE_PI    12
 #define FUTEX_PRIVATE_FLAG	128
 #define FUTEX_CLOCK_REALTIME	256
 
@@ -141,6 +143,34 @@
     INTERNAL_SYSCALL_ERROR_P (__ret, __err);				      \
   })
 
+/* Priority Inheritance support.  */
+#define lll_futex_wait_requeue_pi(futexp, val, mutex, private) \
+  lll_futex_timed_wait_requeue_pi (futexp, val, NULL, 0, mutex, private)
+
+#define lll_futex_timed_wait_requeue_pi(futexp, val, timespec, clockbit,      \
+					mutex, private)			      \
+  ({									      \
+    INTERNAL_SYSCALL_DECL (__err);					      \
+    long int __ret;							      \
+    int __op = FUTEX_WAIT_REQUEUE_PI | clockbit;			      \
+									      \
+    __ret = INTERNAL_SYSCALL (futex, __err, 5, (futexp),		      \
+			      __lll_private_flag (__op, private),	      \
+			      (val), (timespec), mutex); 		      \
+    INTERNAL_SYSCALL_ERROR_P (__ret, __err) ? -__ret : __ret;		      \
+  })
+
+#define lll_futex_cmp_requeue_pi(futexp, nr_wake, nr_move, mutex, val, priv)  \
+  ({									      \
+    INTERNAL_SYSCALL_DECL (__err);					      \
+    long int __ret;							      \
+									      \
+    __ret = INTERNAL_SYSCALL (futex, __err, 6, (futexp),		      \
+			      __lll_private_flag (FUTEX_CMP_REQUEUE_PI, priv),\
+			      (nr_wake), (nr_move), (mutex), (val));	      \
+    INTERNAL_SYSCALL_ERROR_P (__ret, __err);				      \
+  })
+
 static inline int __attribute__((always_inline))
 __lll_trylock(int *futex)
 {

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: PI mutex support for pthread_cond_* now in nptl
  2013-02-18 10:56 PI mutex support for pthread_cond_* now in nptl Siddhesh Poyarekar
  2013-02-18 23:50 ` Joseph S. Myers
@ 2013-02-19 16:42 ` Richard Henderson
  2013-02-19 16:49   ` Siddhesh Poyarekar
  2013-02-19 17:18   ` Joseph S. Myers
       [not found] ` <5123AB55.2070100__45742.2411222526$1361292204$gmane$org@twiddle.net>
  2 siblings, 2 replies; 18+ messages in thread
From: Richard Henderson @ 2013-02-19 16:42 UTC (permalink / raw)
  To: Siddhesh Poyarekar; +Cc: libc-ports, libc-alpha

On 2013-02-18 02:56, Siddhesh Poyarekar wrote:
> Hi arch-maintainers!
>
> The x86 assembly code for pthread_cond_* functions has, since some
> time now, been capable of utilizing PI mutexes (using
> PTHREAD_PRIO_INHERIT when initializing a mutex) optimally by using the
> FUTEX_WAIT_REQUEUE_PI and its companion FUTEX_CMP_REQUEUE_PI to
> requeue waiters woken on pthread_cond_broadcast to the PI mutex while
> waking just one child, hence avoiding the thundering herd effect.
> This code is now available on the default C implementation in master,
> which is used by most non-x86 architectures in glibc.
>
> To allow your architecture to use this, you need to define the
> following:
>
> - lll_futex_wait_requeue_pi macro to call the FUTEX_WAIT_REQUEUE_PI
>    operation in a futex syscall.
>
> - lll_futex_timed_wait_requeue_pi, which is the timed equivalent of
>    the above.
>
> - lll_futex_cmp_requeue_pi to call FUTEX_CMP_REQUEUE_PI.
>
> I have committed implementations for s390 and powerpc, which can be
> used as examples for implementation on other architectures.

Any chance we can move these macros into a generic linux header?
Given that we're using INTERNAL_SYSCALL macros, the definitions ought to 
be the same for all targets.


r~

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: PI mutex support for pthread_cond_* now in nptl
  2013-02-19 16:42 ` Richard Henderson
@ 2013-02-19 16:49   ` Siddhesh Poyarekar
  2013-02-19 17:18   ` Joseph S. Myers
  1 sibling, 0 replies; 18+ messages in thread
From: Siddhesh Poyarekar @ 2013-02-19 16:49 UTC (permalink / raw)
  To: Richard Henderson; +Cc: libc-ports, libc-alpha

On Tue, Feb 19, 2013 at 08:41:57AM -0800, Richard Henderson wrote:
> Any chance we can move these macros into a generic linux header?
> Given that we're using INTERNAL_SYSCALL macros, the definitions ought
> to be the same for all targets.

It is on my TODO list, but not near the top as of now.  Maybe someone
else might want to get to it before I do.

Siddhesh

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: PI mutex support for pthread_cond_* now in nptl
       [not found] ` <5123AB55.2070100__45742.2411222526$1361292204$gmane$org@twiddle.net>
@ 2013-02-19 17:01   ` Andreas Schwab
  2013-02-19 17:18     ` Siddhesh Poyarekar
  0 siblings, 1 reply; 18+ messages in thread
From: Andreas Schwab @ 2013-02-19 17:01 UTC (permalink / raw)
  To: Richard Henderson; +Cc: Siddhesh Poyarekar, libc-ports, libc-alpha

Richard Henderson <rth@twiddle.net> writes:

> Any chance we can move these macros into a generic linux header?

What would that do to architectures that don't implement PI futexes?

Andreas.

-- 
Andreas Schwab, SUSE Labs, schwab@suse.de
GPG Key fingerprint = 0196 BAD8 1CE9 1970 F4BE  1748 E4D4 88E3 0EEA B9D7
"And now for something completely different."

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: PI mutex support for pthread_cond_* now in nptl
  2013-02-19 16:42 ` Richard Henderson
  2013-02-19 16:49   ` Siddhesh Poyarekar
@ 2013-02-19 17:18   ` Joseph S. Myers
  2013-02-19 17:21     ` Siddhesh Poyarekar
                       ` (2 more replies)
  1 sibling, 3 replies; 18+ messages in thread
From: Joseph S. Myers @ 2013-02-19 17:18 UTC (permalink / raw)
  To: Richard Henderson; +Cc: Siddhesh Poyarekar, libc-ports, libc-alpha

On Tue, 19 Feb 2013, Richard Henderson wrote:

> Any chance we can move these macros into a generic linux header?
> Given that we're using INTERNAL_SYSCALL macros, the definitions ought to be
> the same for all targets.

Generally most of lowlevellock.h should probably be shared between 
architectures.  (If some architectures don't implement a particular 
feature as of a particular kernel version, that's a matter for 
kernel-features.h and __ASSUME_* conditionals.)

There are however two different groups of those headers, those where e.g. 
lll_futex_timed_wait ends with

  INTERNAL_SYSCALL_ERROR_P (__ret, __err) ? -__ret : __ret;

(e.g. powerpc) and those (e.g. s390) where it just ends with a call to 
INTERNAL_SYSCALL and relies on the return value of INTERNAL_SYSCALL being 
the correct return from the macro.  I'm not sure whether those differences 
actually serve a useful purpose or whether the return value of 
INTERNAL_SYSCALL differs in a relevant way between architectures.

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: PI mutex support for pthread_cond_* now in nptl
  2013-02-19 17:01   ` Andreas Schwab
@ 2013-02-19 17:18     ` Siddhesh Poyarekar
  2013-02-19 17:39       ` Andreas Schwab
  0 siblings, 1 reply; 18+ messages in thread
From: Siddhesh Poyarekar @ 2013-02-19 17:18 UTC (permalink / raw)
  To: Andreas Schwab; +Cc: Richard Henderson, libc-ports, libc-alpha

On Tue, Feb 19, 2013 at 06:01:13PM +0100, Andreas Schwab wrote:
> What would that do to architectures that don't implement PI futexes?
> 

They get them implemented for free.  As far as the linux kernel is
concerned, PI futexes are available on all architectures.

Siddhesh

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: PI mutex support for pthread_cond_* now in nptl
  2013-02-19 17:18   ` Joseph S. Myers
@ 2013-02-19 17:21     ` Siddhesh Poyarekar
  2013-02-19 20:06     ` Torvald Riegel
       [not found]     ` <1361304381.581.80.camel__6928.53579898856$1361304432$gmane$org@triegel.csb>
  2 siblings, 0 replies; 18+ messages in thread
From: Siddhesh Poyarekar @ 2013-02-19 17:21 UTC (permalink / raw)
  To: Joseph S. Myers; +Cc: Richard Henderson, libc-ports, libc-alpha

On Tue, Feb 19, 2013 at 05:18:20PM +0000, Joseph S. Myers wrote:
> There are however two different groups of those headers, those where e.g. 
> lll_futex_timed_wait ends with
> 
>   INTERNAL_SYSCALL_ERROR_P (__ret, __err) ? -__ret : __ret;
> 
> (e.g. powerpc) and those (e.g. s390) where it just ends with a call to 
> INTERNAL_SYSCALL and relies on the return value of INTERNAL_SYSCALL being 
> the correct return from the macro.  I'm not sure whether those differences 
> actually serve a useful purpose or whether the return value of 
> INTERNAL_SYSCALL differs in a relevant way between architectures.

From the very superficial analysis I had done last year, it should be
possible to consolidate all that.

Siddhesh

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: PI mutex support for pthread_cond_* now in nptl
  2013-02-19 17:18     ` Siddhesh Poyarekar
@ 2013-02-19 17:39       ` Andreas Schwab
  2013-02-19 17:52         ` Siddhesh Poyarekar
  0 siblings, 1 reply; 18+ messages in thread
From: Andreas Schwab @ 2013-02-19 17:39 UTC (permalink / raw)
  To: Siddhesh Poyarekar; +Cc: Richard Henderson, libc-ports, libc-alpha

Siddhesh Poyarekar <siddhesh@redhat.com> writes:

> As far as the linux kernel is concerned, PI futexes are available on
> all architectures.

No.

Andreas.

-- 
Andreas Schwab, SUSE Labs, schwab@suse.de
GPG Key fingerprint = 0196 BAD8 1CE9 1970 F4BE  1748 E4D4 88E3 0EEA B9D7
"And now for something completely different."

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: PI mutex support for pthread_cond_* now in nptl
  2013-02-19 17:39       ` Andreas Schwab
@ 2013-02-19 17:52         ` Siddhesh Poyarekar
  0 siblings, 0 replies; 18+ messages in thread
From: Siddhesh Poyarekar @ 2013-02-19 17:52 UTC (permalink / raw)
  To: Andreas Schwab
  Cc: Siddhesh Poyarekar, Richard Henderson, libc-ports, libc-alpha

On 19 February 2013 23:09, Andreas Schwab <schwab@suse.de> wrote:
> Siddhesh Poyarekar <siddhesh@redhat.com> writes:
>
>> As far as the linux kernel is concerned, PI futexes are available on
>> all architectures.
>
> No.
>

Oh ok, it's not.  It's a runtime check and not a compile-time check
that I looked for.


-- 
http://siddhesh.in

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: PI mutex support for pthread_cond_* now in nptl
  2013-02-19 17:18   ` Joseph S. Myers
  2013-02-19 17:21     ` Siddhesh Poyarekar
@ 2013-02-19 20:06     ` Torvald Riegel
  2013-02-20 18:13       ` Steven Munroe
       [not found]     ` <1361304381.581.80.camel__6928.53579898856$1361304432$gmane$org@triegel.csb>
  2 siblings, 1 reply; 18+ messages in thread
From: Torvald Riegel @ 2013-02-19 20:06 UTC (permalink / raw)
  To: Joseph S. Myers
  Cc: Richard Henderson, Siddhesh Poyarekar, libc-ports, libc-alpha

On Tue, 2013-02-19 at 17:18 +0000, Joseph S. Myers wrote:
> On Tue, 19 Feb 2013, Richard Henderson wrote:
> 
> > Any chance we can move these macros into a generic linux header?
> > Given that we're using INTERNAL_SYSCALL macros, the definitions ought to be
> > the same for all targets.
> 
> Generally most of lowlevellock.h should probably be shared between 
> architectures.  (If some architectures don't implement a particular 
> feature as of a particular kernel version, that's a matter for 
> kernel-features.h and __ASSUME_* conditionals.)

On a related note: What are the reasons to have arch-specific assembler
versions of many of the synchronization operations?  I would be
surprised if they'd provide a significant performance advantage; has
anyone recent measurements for this?

It seems to me that it would be useful to consolidate the different
versions that exist for the synchronization operations into shared C
code as long as this doesn't make a significant performance difference.
They are all based on atomic operations and futex operations, both of
which we have in C code (especially if we have compilers that support
the C11 memory model).  Or are there other reasons for keeping different
versions that I'm not aware of?


Torvald

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: PI mutex support for pthread_cond_* now in nptl
  2013-02-19 20:06     ` Torvald Riegel
@ 2013-02-20 18:13       ` Steven Munroe
  2013-02-20 20:25         ` Torvald Riegel
  0 siblings, 1 reply; 18+ messages in thread
From: Steven Munroe @ 2013-02-20 18:13 UTC (permalink / raw)
  To: Torvald Riegel
  Cc: Joseph S. Myers, Richard Henderson, Siddhesh Poyarekar,
	libc-ports, libc-alpha

On Tue, 2013-02-19 at 21:06 +0100, Torvald Riegel wrote:
> On Tue, 2013-02-19 at 17:18 +0000, Joseph S. Myers wrote:
> > On Tue, 19 Feb 2013, Richard Henderson wrote:
> > 
> > > Any chance we can move these macros into a generic linux header?
> > > Given that we're using INTERNAL_SYSCALL macros, the definitions ought to be
> > > the same for all targets.
> > 
> > Generally most of lowlevellock.h should probably be shared between 
> > architectures.  (If some architectures don't implement a particular 
> > feature as of a particular kernel version, that's a matter for 
> > kernel-features.h and __ASSUME_* conditionals.)
> 
> On a related note: What are the reasons to have arch-specific assembler
> versions of many of the synchronization operations?  I would be
> surprised if they'd provide a significant performance advantage; has
> anyone recent measurements for this?
> 
The introduction of GCC compiler builtins like __sync is fairly recent
and the new __atomic builtins start with GCC-4.7. So until recently we
had no choice. 

For platforms (like PowerPC) that implement acquire/release the GCC
__sync builtins are not sufficient and GCC-4.7 __atomic builtins are not
pervasive enough to make that the default. 

> It seems to me that it would be useful to consolidate the different
> versions that exist for the synchronization operations into shared C
> code as long as this doesn't make a significant performance difference.
> They are all based on atomic operations and futex operations, both of
> which we have in C code (especially if we have compilers that support
> the C11 memory model).  Or are there other reasons for keeping different
> versions that I'm not aware of?
> 
I disagree. The performance of lowlevellocks and associated platform
specific optimizations are too import to move forward with the
consolidation you suggest.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: PI mutex support for pthread_cond_* now in nptl
  2013-02-20 18:13       ` Steven Munroe
@ 2013-02-20 20:25         ` Torvald Riegel
  2013-02-20 22:04           ` Roland McGrath
                             ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Torvald Riegel @ 2013-02-20 20:25 UTC (permalink / raw)
  To: munroesj
  Cc: Joseph S. Myers, Richard Henderson, Siddhesh Poyarekar,
	libc-ports, libc-alpha

On Wed, 2013-02-20 at 10:59 -0600, Steven Munroe wrote:
> On Tue, 2013-02-19 at 21:06 +0100, Torvald Riegel wrote:
> > On Tue, 2013-02-19 at 17:18 +0000, Joseph S. Myers wrote:
> > > On Tue, 19 Feb 2013, Richard Henderson wrote:
> > > 
> > > > Any chance we can move these macros into a generic linux header?
> > > > Given that we're using INTERNAL_SYSCALL macros, the definitions ought to be
> > > > the same for all targets.
> > > 
> > > Generally most of lowlevellock.h should probably be shared between 
> > > architectures.  (If some architectures don't implement a particular 
> > > feature as of a particular kernel version, that's a matter for 
> > > kernel-features.h and __ASSUME_* conditionals.)
> > 
> > On a related note: What are the reasons to have arch-specific assembler
> > versions of many of the synchronization operations?  I would be
> > surprised if they'd provide a significant performance advantage; has
> > anyone recent measurements for this?
> > 
> The introduction of GCC compiler builtins like __sync is fairly recent
> and the new __atomic builtins start with GCC-4.7. So until recently we
> had no choice. 

Using assembler for the atomic operations is possible (e.g., as in
Boehm's libatomic-ops, or in./sysdeps/powerpc/bits/atomic.h and others).
It doesn't allow for the same level of compiler optimization across
barriers, but it's unclear whether that has much benefit, and GCC
doesn't do it yet anyway.

There are some cases in which compilers that don't support the C11/C++11
memory model can generate code that wouldn't be correct in such a model,
and which can theoretically interfere with other concurrent code (e.g.,
introduce data races due to accesses being too wide).  However, because
we don't have custom assembler for everything, we should be already
exposed to that.

> For platforms (like PowerPC) that implement acquire/release the GCC
> __sync builtins are not sufficient and GCC-4.7 __atomic builtins are not
> pervasive enough to make that the default. 

I agree regarding the __sync builtins, but using assembler in place of
the __atomic builtins should work, or not?

> > It seems to me that it would be useful to consolidate the different
> > versions that exist for the synchronization operations into shared C
> > code as long as this doesn't make a significant performance difference.
> > They are all based on atomic operations and futex operations, both of
> > which we have in C code (especially if we have compilers that support
> > the C11 memory model).  Or are there other reasons for keeping different
> > versions that I'm not aware of?
> > 
> I disagree. The performance of lowlevellocks and associated platform
> specific optimizations are too import to move forward with the
> consolidation you suggest.

Which specific optimizations do you refer to?  I didn't see any for
powerpc, for example (i.e., the lock fast path is C up to the point of
the atomic operation).  The ones that I saw are for x86, and I'm
wondering whether they provide much benefit.  Especially because this
can mostly just matter for the execution path taken when a free lock is
acquired; once you get any cache miss, you're to some extent on the slow
path anyway.  Also, for the Linux platforms I looked at, the mutex
algorithms are the same.

Do you have any recent measurements (or could point to them) that show
the benefit of the optimizations you refer to?

For example, we've spent quite some time debugging a PI cond var failure
in the past, and this wasn't made any easier by having several
(different) versions of the cond var implementation.

Torvald

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: PI mutex support for pthread_cond_* now in nptl
  2013-02-20 20:25         ` Torvald Riegel
@ 2013-02-20 22:04           ` Roland McGrath
  2013-02-20 22:41           ` Steven Munroe
  2013-02-21  4:32           ` Siddhesh Poyarekar
  2 siblings, 0 replies; 18+ messages in thread
From: Roland McGrath @ 2013-02-20 22:04 UTC (permalink / raw)
  To: Torvald Riegel; +Cc: libc-ports, libc-alpha

At least the x86 assembly versions were written because the code produced
by GCC at the time was drastically worse.  But that was long ago.  We now
mandate compiling with GCC 4.3 or newer and recommend 4.6 or newer, so it
merits examining the code produced by those compilers and reconsidering the
judgments of the past.  But we will be conservative and not drop any
hand-written assembly unless recent compilers at -O2 are demonstrated to
produce code at least as good.  (Perhaps the reasonable standard is
something like at least as good from 4.6 -O3 and not a whole lot worse from
4.3 -O2.)  As time goes on, we can of course examine then-common compilers
and reconsider these judgments.  As of now, even 4.7 is not common yet.


Thanks,
Roland

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: PI mutex support for pthread_cond_* now in nptl
  2013-02-20 20:25         ` Torvald Riegel
  2013-02-20 22:04           ` Roland McGrath
@ 2013-02-20 22:41           ` Steven Munroe
  2013-02-21  4:32           ` Siddhesh Poyarekar
  2 siblings, 0 replies; 18+ messages in thread
From: Steven Munroe @ 2013-02-20 22:41 UTC (permalink / raw)
  To: Torvald Riegel
  Cc: munroesj, Joseph S. Myers, Richard Henderson, Siddhesh Poyarekar,
	libc-ports, libc-alpha

On Wed, 2013-02-20 at 21:25 +0100, Torvald Riegel wrote:
> On Wed, 2013-02-20 at 10:59 -0600, Steven Munroe wrote:
> > On Tue, 2013-02-19 at 21:06 +0100, Torvald Riegel wrote:
> > > On Tue, 2013-02-19 at 17:18 +0000, Joseph S. Myers wrote:
> > > > On Tue, 19 Feb 2013, Richard Henderson wrote:
> > > > 
> > > > > Any chance we can move these macros into a generic linux header?
> > > > > Given that we're using INTERNAL_SYSCALL macros, the definitions ought to be
> > > > > the same for all targets.
> > > > 
> > > > Generally most of lowlevellock.h should probably be shared between 
> > > > architectures.  (If some architectures don't implement a particular 
> > > > feature as of a particular kernel version, that's a matter for 
> > > > kernel-features.h and __ASSUME_* conditionals.)
> > > 
> > > On a related note: What are the reasons to have arch-specific assembler
> > > versions of many of the synchronization operations?  I would be
> > > surprised if they'd provide a significant performance advantage; has
> > > anyone recent measurements for this?
> > > 
> > The introduction of GCC compiler builtins like __sync is fairly recent
> > and the new __atomic builtins start with GCC-4.7. So until recently we
> > had no choice. 
> 
> Using assembler for the atomic operations is possible (e.g., as in
> Boehm's libatomic-ops, or in./sysdeps/powerpc/bits/atomic.h and others).
> It doesn't allow for the same level of compiler optimization across
> barriers, but it's unclear whether that has much benefit, and GCC
> doesn't do it yet anyway.
> 
> There are some cases in which compilers that don't support the C11/C++11
> memory model can generate code that wouldn't be correct in such a model,
> and which can theoretically interfere with other concurrent code (e.g.,
> introduce data races due to accesses being too wide).  However, because
> we don't have custom assembler for everything, we should be already
> exposed to that.
> 
> > For platforms (like PowerPC) that implement acquire/release the GCC
> > __sync builtins are not sufficient and GCC-4.7 __atomic builtins are not
> > pervasive enough to make that the default. 
> 
> I agree regarding the __sync builtins, but using assembler in place of
> the __atomic builtins should work, or not?
> 
> > > It seems to me that it would be useful to consolidate the different
> > > versions that exist for the synchronization operations into shared C
> > > code as long as this doesn't make a significant performance difference.
> > > They are all based on atomic operations and futex operations, both of
> > > which we have in C code (especially if we have compilers that support
> > > the C11 memory model).  Or are there other reasons for keeping different
> > > versions that I'm not aware of?
> > > 
> > I disagree. The performance of lowlevellocks and associated platform
> > specific optimizations are too import to move forward with the
> > consolidation you suggest.
> 
> Which specific optimizations do you refer to?  I didn't see any for
> powerpc, for example (i.e., the lock fast path is C up to the point of
> the atomic operation).  The ones that I saw are for x86, and I'm
> wondering whether they provide much benefit.  Especially because this
> can mostly just matter for the execution path taken when a free lock is
> acquired; once you get any cache miss, you're to some extent on the slow
> path anyway.  Also, for the Linux platforms I looked at, the mutex
> algorithms are the same.
> 
Like the lwarx MUTEX_HINT (EH field) hint.

> Do you have any recent measurements (or could point to them) that show
> the benefit of the optimizations you refer to?
> 
No. I don't current have access to a machine big enough show this effect
and I cant tell you about the specific customer. So you will have to
trust me on this.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: PI mutex support for pthread_cond_* now in nptl
  2013-02-20 20:25         ` Torvald Riegel
  2013-02-20 22:04           ` Roland McGrath
  2013-02-20 22:41           ` Steven Munroe
@ 2013-02-21  4:32           ` Siddhesh Poyarekar
  2 siblings, 0 replies; 18+ messages in thread
From: Siddhesh Poyarekar @ 2013-02-21  4:32 UTC (permalink / raw)
  To: Torvald Riegel
  Cc: munroesj, Joseph S. Myers, Richard Henderson, libc-ports, libc-alpha

On Wed, Feb 20, 2013 at 09:25:26PM +0100, Torvald Riegel wrote:
> For example, we've spent quite some time debugging a PI cond var failure
> in the past, and this wasn't made any easier by having several
> (different) versions of the cond var implementation.

I'd like to think that at least the variations are gone now with my
patches, so if you're looking to only do an analysis of the algorithm,
the C code should give sufficient insight into what the
architecture-specific code does.

Siddhesh

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: PI mutex support for pthread_cond_* now in nptl
       [not found]     ` <1361304381.581.80.camel__6928.53579898856$1361304432$gmane$org@triegel.csb>
@ 2013-02-22  4:11       ` Andi Kleen
  0 siblings, 0 replies; 18+ messages in thread
From: Andi Kleen @ 2013-02-22  4:11 UTC (permalink / raw)
  To: Torvald Riegel
  Cc: Joseph S. Myers, Richard Henderson, Siddhesh Poyarekar,
	libc-ports, libc-alpha

Torvald Riegel <triegel@redhat.com> writes:

> On a related note: What are the reasons to have arch-specific assembler
> versions of many of the synchronization operations?  I would be
> surprised if they'd provide a significant performance advantage; has
> anyone recent measurements for this?

Yes changing the x86 assembler code is a big pain and I bet it prevents
a lot of optimizations because it's hard to change.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2013-02-22  4:11 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-02-18 10:56 PI mutex support for pthread_cond_* now in nptl Siddhesh Poyarekar
2013-02-18 23:50 ` Joseph S. Myers
2013-02-19  0:50   ` Joseph S. Myers
2013-02-19 16:42 ` Richard Henderson
2013-02-19 16:49   ` Siddhesh Poyarekar
2013-02-19 17:18   ` Joseph S. Myers
2013-02-19 17:21     ` Siddhesh Poyarekar
2013-02-19 20:06     ` Torvald Riegel
2013-02-20 18:13       ` Steven Munroe
2013-02-20 20:25         ` Torvald Riegel
2013-02-20 22:04           ` Roland McGrath
2013-02-20 22:41           ` Steven Munroe
2013-02-21  4:32           ` Siddhesh Poyarekar
     [not found]     ` <1361304381.581.80.camel__6928.53579898856$1361304432$gmane$org@triegel.csb>
2013-02-22  4:11       ` Andi Kleen
     [not found] ` <5123AB55.2070100__45742.2411222526$1361292204$gmane$org@twiddle.net>
2013-02-19 17:01   ` Andreas Schwab
2013-02-19 17:18     ` Siddhesh Poyarekar
2013-02-19 17:39       ` Andreas Schwab
2013-02-19 17:52         ` Siddhesh Poyarekar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).