public inbox for libc-ports@sourceware.org
 help / color / mirror / Atom feed
* [PATCH] Optimize libc_lock_lock for MIPS XLP.
@ 2012-06-14  5:04 Maxim Kuvyrkov
  2012-06-14 12:39 ` Chris Metcalf
  0 siblings, 1 reply; 12+ messages in thread
From: Maxim Kuvyrkov @ 2012-06-14  5:04 UTC (permalink / raw)
  To: Joseph S. Myers; +Cc: GLIBC Devel, libc-ports

These two patches (libc part and ports part) optimize libc_lock_lock() macro that GLIBC uses for locking internally to take advantage of fetch_and_add instruction that is available as an extension on certain processors, e.g., MIPS-architecture XLP.

The libc_lock_lock macros implement boolean lock: 0 corresponds to unlocked state and non-zero corresponds to locked state.  It is, therefore, possible to use fetch_and_add semantics to acquire lock in libc_lock_lock.  For XLP this translates to a single LDADD instruction.  This optimization allows architectures that can perform fetch_and_add faster than compare_and_exchange, such situation is indicated by defining the new macro "lll_add_lock".

The unlocking counterpart doesn't require any change as it is already uses plain atomic_exchange operation, which, incidentally, also supported on XLP as a single instruction.

Tested on XLP with no regressions.  OK to apply once 2.16 branches off?

Thank you,

--
Maxim Kuvyrkov
Mentor Graphics


2012-06-15  Tom de Vries  <vries@codesourcery.com>
	    Maxim Kuvyrkov  <maxim@codesourcery.com>

	libc/
	* nptl/sysdeps/pthread/bits/libc-lockP.h (__libc_lock_lock): Use
	lll_add_lock when it is available.

	ports/
	* sysdeps/unix/sysv/linux/mips/nptl/lowlevellock.h (__lll_add_lock,)
	(lll_add_lock): Define.
---
 nptl/sysdeps/pthread/bits/libc-lockP.h |    8 +++++++-
 1 files changed, 7 insertions(+), 1 deletions(-)

diff --git a/nptl/sysdeps/pthread/bits/libc-lockP.h b/nptl/sysdeps/pthread/bits/libc-lockP.h
index 0ebac91..58d8366 100644
--- a/nptl/sysdeps/pthread/bits/libc-lockP.h
+++ b/nptl/sysdeps/pthread/bits/libc-lockP.h
@@ -176,8 +176,14 @@ typedef pthread_key_t __libc_key_t;
 
 /* Lock the named lock variable.  */
 #if !defined NOT_IN_libc || defined IS_IN_libpthread
-# define __libc_lock_lock(NAME) \
+# if defined lll_add_lock
+/* lll_add_lock is faster, so use it when it's available.  */
+#  define __libc_lock_lock(NAME) \
+  ({ lll_add_lock (NAME, LLL_PRIVATE); 0; })
+# else
+#  define __libc_lock_lock(NAME) \
   ({ lll_lock (NAME, LLL_PRIVATE); 0; })
+# endif
 #else
 # define __libc_lock_lock(NAME) \
   __libc_maybe_call (__pthread_mutex_lock, (&(NAME)), 0)
-- 
1.7.4.1

---
 sysdeps/unix/sysv/linux/mips/nptl/lowlevellock.h |   23 ++++++++++++++++++++-
 1 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/sysdeps/unix/sysv/linux/mips/nptl/lowlevellock.h b/sysdeps/unix/sysv/linux/mips/nptl/lowlevellock.h
index 88b601e..bbe9ea7 100644
--- a/sysdeps/unix/sysv/linux/mips/nptl/lowlevellock.h
+++ b/sysdeps/unix/sysv/linux/mips/nptl/lowlevellock.h
@@ -1,5 +1,4 @@
-/* Copyright (C) 2003, 2004, 2005, 2006, 2007, 2008,
-   2009 Free Software Foundation, Inc.
+/* Copyright (C) 2003-2012 Free Software Foundation, Inc.
    This file is part of the GNU C Library.
 
    The GNU C Library is free software; you can redistribute it and/or
@@ -172,6 +171,26 @@ extern int __lll_robust_lock_wait (int *futex, int private) attribute_hidden;
   }))
 #define lll_lock(futex, private) __lll_lock (&(futex), private)
 
+#if defined(_MIPS_ARCH_XLP)
+/* XLP has a dedicated exchange_and_add instruction, which is significantly
+   faster than ll/sc and doesn't require explicit syncs.
+   As atomic.h currently only supports a full-barrier atomic_exchange_and_add,
+   using a full-barrier operation instead of an acquire-barrier operation is
+   not beneficial for MIPS in general.
+   Limit this optimization to XLP for now.  */
+#define __lll_add_lock(futex, private)					      \
+  ((void) ({								      \
+    int *__futex = (futex);						      \
+    if (__builtin_expect (atomic_exchange_and_add (__futex, 1), 0))    \
+      {									      \
+	if (__builtin_constant_p (private) && (private) == LLL_PRIVATE)	      \
+	  __lll_lock_wait_private (__futex);				      \
+	else								      \
+	  __lll_lock_wait (__futex, private);				      \
+      }									      \
+  }))
+#define lll_add_lock(futex, private) __lll_add_lock (&(futex), private)
+#endif
 
 #define __lll_robust_lock(futex, id, private)				      \
   ({									      \
-- 
1.7.4.1


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] Optimize libc_lock_lock for MIPS XLP.
  2012-06-14  5:04 [PATCH] Optimize libc_lock_lock for MIPS XLP Maxim Kuvyrkov
@ 2012-06-14 12:39 ` Chris Metcalf
  2012-06-15  1:21   ` Maxim Kuvyrkov
  0 siblings, 1 reply; 12+ messages in thread
From: Chris Metcalf @ 2012-06-14 12:39 UTC (permalink / raw)
  To: Maxim Kuvyrkov; +Cc: Joseph S. Myers, GLIBC Devel, libc-ports

On 6/14/2012 1:03 AM, Maxim Kuvyrkov wrote:
> These two patches (libc part and ports part) optimize libc_lock_lock() macro that GLIBC uses for locking internally to take advantage of fetch_and_add instruction that is available as an extension on certain processors, e.g., MIPS-architecture XLP.
>
> The libc_lock_lock macros implement boolean lock: 0 corresponds to unlocked state and non-zero corresponds to locked state.

Just to be clear, if you put this comment somewhere when you commit, you
should say locks are tristate, where 0 is unlocked, 1 is locked and
uncontended, and >1 is locked and contended.

> It is, therefore, possible to use fetch_and_add semantics to acquire lock in libc_lock_lock.  For XLP this translates to a single LDADD instruction.  This optimization allows architectures that can perform fetch_and_add faster than compare_and_exchange, such situation is indicated by defining the new macro "lll_add_lock".
>
> The unlocking counterpart doesn't require any change as it is already uses plain atomic_exchange operation, which, incidentally, also supported on XLP as a single instruction.

This seems like it would work well for a single thread acquiring the lock,
but I have some questions about it in the presence of multiple threads
trying to acquire the lock.

First, the generic __lll_lock_wait() code assumes the contended value is
exactly "2".  So if two or more threads both try and fail to acquire the
lock, the value will be >2.  This will cause the waiters to busywait,
spinning on atomic exchange instructions, rather than calling into
futex_wait().  I think it might be possible to change the generic code to
support the more general ">1" semantics of contended locks, but it might be
a bit less efficient, so you might end up wanting to provide overrides for
these functions on MIPS.  Even on MIPS it might result in a certain amount
of spinning since you'd have to hit the race window correctly to feed the
right value of the lock to futex_wait.

Second, if a lock is held long enough for 4 billion threads to try to
acquire it and fail, you will end up with an unlocked lock. :-)  I'm not
sure how likely this seems, but it is a potential issue.  You might
consider, for example, doing a cmpxchg on the contended-lock path to try to
reset the lock value back to 2 again; if it fails, it's not a big deal,
since statistically I would expect the occasional thread to succeed, which
is all you need.

-- 
Chris Metcalf, Tilera Corp.
http://www.tilera.com



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] Optimize libc_lock_lock for MIPS XLP.
  2012-06-14 12:39 ` Chris Metcalf
@ 2012-06-15  1:21   ` Maxim Kuvyrkov
  2012-06-15  2:44     ` Chris Metcalf
  0 siblings, 1 reply; 12+ messages in thread
From: Maxim Kuvyrkov @ 2012-06-15  1:21 UTC (permalink / raw)
  To: Chris Metcalf; +Cc: Joseph S. Myers, GLIBC Devel, libc-ports, Tom de Vries

On 15/06/2012, at 12:39 AM, Chris Metcalf wrote:

> On 6/14/2012 1:03 AM, Maxim Kuvyrkov wrote:
>> These two patches (libc part and ports part) optimize libc_lock_lock() macro that GLIBC uses for locking internally to take advantage of fetch_and_add instruction that is available as an extension on certain processors, e.g., MIPS-architecture XLP.
>> 
>> The libc_lock_lock macros implement boolean lock: 0 corresponds to unlocked state and non-zero corresponds to locked state.
> 
> Just to be clear, if you put this comment somewhere when you commit, you
> should say locks are tristate, where 0 is unlocked, 1 is locked and
> uncontended, and >1 is locked and contended.

Right, it's all coming back now.  I will update the comments to mention this.  [This optimization was written around 6 months ago, and not by me.  This and below points are worth elaborating on, thanks for bringing them up.]

I've CC'ed Tom de Vries, who is the original author of patch.  Tom, please let us know if I'm misrepresenting the optimization or the rationale for its correctness.

> 
>> It is, therefore, possible to use fetch_and_add semantics to acquire lock in libc_lock_lock.  For XLP this translates to a single LDADD instruction.  This optimization allows architectures that can perform fetch_and_add faster than compare_and_exchange, such situation is indicated by defining the new macro "lll_add_lock".
>> 
>> The unlocking counterpart doesn't require any change as it is already uses plain atomic_exchange operation, which, incidentally, also supported on XLP as a single instruction.
> 
> This seems like it would work well for a single thread acquiring the lock,
> but I have some questions about it in the presence of multiple threads
> trying to acquire the lock.
> 
> First, the generic __lll_lock_wait() code assumes the contended value is
> exactly "2".

Um, not exactly.  __lll_lock_wait() *sets* the contended lock to a value of "2", but it will work as well with >2 values.

void
__lll_lock_wait (int *futex, int private)
{
  if (*futex == 2)
    lll_futex_wait (futex, 2, private);

  while (atomic_exchange_acq (futex, 2) != 0)
    lll_futex_wait (futex, 2, private);
}

>  So if two or more threads both try and fail to acquire the
> lock, the value will be >2.  This will cause the waiters to busywait,
> spinning on atomic exchange instructions, rather than calling into
> futex_wait().

As I read it, in case of a contended lock __lll_lock_wait will reset the value of the lock to "2" before calling lll_futex_wait().  I agree that there is a timing window in which the other threads will see a value of the lock greater than "2", but the value will not get as high as hundreds or billions as it will be constantly reset to "2" in atomic_exchange in lll_lock_wait().

I do not see how threads will get into a busywait state, though.  Would you please elaborate on that?

>  I think it might be possible to change the generic code to
> support the more general ">1" semantics of contended locks, but it might be
> a bit less efficient, so you might end up wanting to provide overrides for
> these functions on MIPS.  Even on MIPS it might result in a certain amount
> of spinning since you'd have to hit the race window correctly to feed the
> right value of the lock to futex_wait.
> 
> Second, if a lock is held long enough for 4 billion threads to try to
> acquire it and fail, you will end up with an unlocked lock. :-)  I'm not
> sure how likely this seems, but it is a potential issue.  You might
> consider, for example, doing a cmpxchg on the contended-lock path to try to
> reset the lock value back to 2 again; if it fails, it's not a big deal,
> since statistically I would expect the occasional thread to succeed, which
> is all you need.

Thank you,

--
Maxim Kuvyrkov
CodeSourcery / Mentor Graphics

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] Optimize libc_lock_lock for MIPS XLP.
  2012-06-15  1:21   ` Maxim Kuvyrkov
@ 2012-06-15  2:44     ` Chris Metcalf
  2012-06-15  2:50       ` Maxim Kuvyrkov
  0 siblings, 1 reply; 12+ messages in thread
From: Chris Metcalf @ 2012-06-15  2:44 UTC (permalink / raw)
  To: Maxim Kuvyrkov; +Cc: Joseph S. Myers, GLIBC Devel, libc-ports, Tom de Vries

On 6/14/2012 9:20 PM, Maxim Kuvyrkov wrote:
> On 15/06/2012, at 12:39 AM, Chris Metcalf wrote:
>
>> On 6/14/2012 1:03 AM, Maxim Kuvyrkov wrote:
>>> These two patches (libc part and ports part) optimize libc_lock_lock() macro that GLIBC uses for locking internally to take advantage of fetch_and_add instruction that is available as an extension on certain processors, e.g., MIPS-architecture XLP.
>>>
>>> The libc_lock_lock macros implement boolean lock: 0 corresponds to unlocked state and non-zero corresponds to locked state.
>> Just to be clear, if you put this comment somewhere when you commit, you
>> should say locks are tristate, where 0 is unlocked, 1 is locked and
>> uncontended, and >1 is locked and contended.
> Right, it's all coming back now.  I will update the comments to mention this.  [This optimization was written around 6 months ago, and not by me.  This and below points are worth elaborating on, thanks for bringing them up.]
>
> I've CC'ed Tom de Vries, who is the original author of patch.  Tom, please let us know if I'm misrepresenting the optimization or the rationale for its correctness.
>
>>> It is, therefore, possible to use fetch_and_add semantics to acquire lock in libc_lock_lock.  For XLP this translates to a single LDADD instruction.  This optimization allows architectures that can perform fetch_and_add faster than compare_and_exchange, such situation is indicated by defining the new macro "lll_add_lock".
>>>
>>> The unlocking counterpart doesn't require any change as it is already uses plain atomic_exchange operation, which, incidentally, also supported on XLP as a single instruction.
>> This seems like it would work well for a single thread acquiring the lock,
>> but I have some questions about it in the presence of multiple threads
>> trying to acquire the lock.
>>
>> First, the generic __lll_lock_wait() code assumes the contended value is
>> exactly "2".
> Um, not exactly.  __lll_lock_wait() *sets* the contended lock to a value of "2", but it will work as well with >2 values.
>
> void
> __lll_lock_wait (int *futex, int private)
> {
>   if (*futex == 2)
>     lll_futex_wait (futex, 2, private);
>
>   while (atomic_exchange_acq (futex, 2) != 0)
>     lll_futex_wait (futex, 2, private);
> }
>
>>  So if two or more threads both try and fail to acquire the
>> lock, the value will be >2.  This will cause the waiters to busywait,
>> spinning on atomic exchange instructions, rather than calling into
>> futex_wait().
> As I read it, in case of a contended lock __lll_lock_wait will reset the value of the lock to "2" before calling lll_futex_wait().  I agree that there is a timing window in which the other threads will see a value of the lock greater than "2", but the value will not get as high as hundreds or billions as it will be constantly reset to "2" in atomic_exchange in lll_lock_wait().
>
> I do not see how threads will get into a busywait state, though.  Would you please elaborate on that?

You are correct.  I was thinking the that the while loop had a cmpxchg, but
since it's just a straight-up exchange, the flow will be something like:

- Fail to initially call lll_futex_wait() if the lock is contended
- Fall through to while loop
- Spin as long as the lock is contended enough that *futex > 2
- Enter futex_wait

So a little busy under high contention, but probably settles out reasonably
well.

Since Tilera makes chips with 64 cores I tend to worry more about spinning
race cases with a lot of cores contending at once :-)

>>  I think it might be possible to change the generic code to
>> support the more general ">1" semantics of contended locks, but it might be
>> a bit less efficient, so you might end up wanting to provide overrides for
>> these functions on MIPS.  Even on MIPS it might result in a certain amount
>> of spinning since you'd have to hit the race window correctly to feed the
>> right value of the lock to futex_wait.
>>
>> Second, if a lock is held long enough for 4 billion threads to try to
>> acquire it and fail, you will end up with an unlocked lock. :-)  I'm not
>> sure how likely this seems, but it is a potential issue.  You might
>> consider, for example, doing a cmpxchg on the contended-lock path to try to
>> reset the lock value back to 2 again; if it fails, it's not a big deal,
>> since statistically I would expect the occasional thread to succeed, which
>> is all you need.
> Thank you,
>
> --
> Maxim Kuvyrkov
> CodeSourcery / Mentor Graphics

-- 
Chris Metcalf, Tilera Corp.
http://www.tilera.com



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] Optimize libc_lock_lock for MIPS XLP.
  2012-06-15  2:44     ` Chris Metcalf
@ 2012-06-15  2:50       ` Maxim Kuvyrkov
  2012-06-27 21:45         ` Maxim Kuvyrkov
  0 siblings, 1 reply; 12+ messages in thread
From: Maxim Kuvyrkov @ 2012-06-15  2:50 UTC (permalink / raw)
  To: Chris Metcalf; +Cc: Joseph S. Myers, GLIBC Devel, libc-ports, Tom de Vries

On 15/06/2012, at 2:44 PM, Chris Metcalf wrote:

> On 6/14/2012 9:20 PM, Maxim Kuvyrkov wrote:
...
>> As I read it, in case of a contended lock __lll_lock_wait will reset the value of the lock to "2" before calling lll_futex_wait().  I agree that there is a timing window in which the other threads will see a value of the lock greater than "2", but the value will not get as high as hundreds or billions as it will be constantly reset to "2" in atomic_exchange in lll_lock_wait().
>> 
>> I do not see how threads will get into a busywait state, though.  Would you please elaborate on that?
> 
> You are correct.  I was thinking the that the while loop had a cmpxchg, but
> since it's just a straight-up exchange, the flow will be something like:
> 
> - Fail to initially call lll_futex_wait() if the lock is contended
> - Fall through to while loop
> - Spin as long as the lock is contended enough that *futex > 2
> - Enter futex_wait
> 
> So a little busy under high contention, but probably settles out reasonably
> well.

Exactly.  I will include the above scenario in the comment to make it more transparent.

Thank you,

--
Maxim Kuvyrkov
CodeSourcery / Mentor Graphics

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] Optimize libc_lock_lock for MIPS XLP.
  2012-06-15  2:50       ` Maxim Kuvyrkov
@ 2012-06-27 21:45         ` Maxim Kuvyrkov
  2012-06-28 17:30           ` Chris Metcalf
  0 siblings, 1 reply; 12+ messages in thread
From: Maxim Kuvyrkov @ 2012-06-27 21:45 UTC (permalink / raw)
  To: Joseph S. Myers, GLIBC Devel, libc-ports; +Cc: Chris Metcalf, Tom de Vries

On 15/06/2012, at 2:49 PM, Maxim Kuvyrkov wrote:

> On 15/06/2012, at 2:44 PM, Chris Metcalf wrote:
> 
>> On 6/14/2012 9:20 PM, Maxim Kuvyrkov wrote:
> ...
>>> As I read it, in case of a contended lock __lll_lock_wait will reset the value of the lock to "2" before calling lll_futex_wait().  I agree that there is a timing window in which the other threads will see a value of the lock greater than "2", but the value will not get as high as hundreds or billions as it will be constantly reset to "2" in atomic_exchange in lll_lock_wait().
>>> 
>>> I do not see how threads will get into a busywait state, though.  Would you please elaborate on that?
>> 
>> You are correct.  I was thinking the that the while loop had a cmpxchg, but
>> since it's just a straight-up exchange, the flow will be something like:
>> 
>> - Fail to initially call lll_futex_wait() if the lock is contended
>> - Fall through to while loop
>> - Spin as long as the lock is contended enough that *futex > 2
>> - Enter futex_wait
>> 
>> So a little busy under high contention, but probably settles out reasonably
>> well.
> 

Attached is an improved patch that also optimizes __libc_lock_trylock using XLP's atomic instructions.

The patch also removes unnecessary indirection step represented by new macros lll_add_lock, which is then used to define __libc_lock_lock, and defines __libc_lock_lock and __libc_lock_trylock directly in lowlevellock.h .  This makes changes outside of ports/ trivial.

Tested on MIPS XLP with no regressions.  OK to apply for 2.17?

--
Maxim Kuvyrkov
CodeSourcery / Mentor Graphics


Allow overrides of __libc_lock_lock and __libc_lock_trylock.

	* nptl/sysdeps/pthread/bits/libc-lockP.h (__libc_lock_lock)
	(__libc_lock_trylock): Allow pre-existing definitions.
---
 nptl/sysdeps/pthread/bits/libc-lockP.h |    8 ++++++--
 1 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/nptl/sysdeps/pthread/bits/libc-lockP.h b/nptl/sysdeps/pthread/bits/libc-lockP.h
index 0ebac91..9c61662 100644
--- a/nptl/sysdeps/pthread/bits/libc-lockP.h
+++ b/nptl/sysdeps/pthread/bits/libc-lockP.h
@@ -176,8 +176,10 @@ typedef pthread_key_t __libc_key_t;
 
 /* Lock the named lock variable.  */
 #if !defined NOT_IN_libc || defined IS_IN_libpthread
-# define __libc_lock_lock(NAME) \
+# ifndef __libc_lock_lock
+#  define __libc_lock_lock(NAME) \
   ({ lll_lock (NAME, LLL_PRIVATE); 0; })
+# endif
 #else
 # define __libc_lock_lock(NAME) \
   __libc_maybe_call (__pthread_mutex_lock, (&(NAME)), 0)
@@ -189,8 +191,10 @@ typedef pthread_key_t __libc_key_t;
 
 /* Try to lock the named lock variable.  */
 #if !defined NOT_IN_libc || defined IS_IN_libpthread
-# define __libc_lock_trylock(NAME) \
+# ifndef __libc_lock_trylock
+#  define __libc_lock_trylock(NAME) \
   lll_trylock (NAME)
+# endif
 #else
 # define __libc_lock_trylock(NAME) \
   __libc_maybe_call (__pthread_mutex_trylock, (&(NAME)), 0)
-- 
1.7.4.1

Optimize libc_lock_lock for XLP.

2012-06-28  Tom de Vries  <vries@codesourcery.com>
	    Maxim Kuvyrkov  <maxim@codesourcery.com>

	* sysdeps/unix/sysv/linux/mips/nptl/lowlevellock.h (__libc_lock_lock)
	(__libc_lock_trylock): Define for XLP.
---
 sysdeps/unix/sysv/linux/mips/nptl/lowlevellock.h |   39 ++++++++++++++++++++-
 1 files changed, 37 insertions(+), 2 deletions(-)

diff --git a/sysdeps/unix/sysv/linux/mips/nptl/lowlevellock.h b/sysdeps/unix/sysv/linux/mips/nptl/lowlevellock.h
index 88b601e..a441e6b 100644
--- a/sysdeps/unix/sysv/linux/mips/nptl/lowlevellock.h
+++ b/sysdeps/unix/sysv/linux/mips/nptl/lowlevellock.h
@@ -1,5 +1,4 @@
-/* Copyright (C) 2003, 2004, 2005, 2006, 2007, 2008,
-   2009 Free Software Foundation, Inc.
+/* Copyright (C) 2003-2012 Free Software Foundation, Inc.
    This file is part of the GNU C Library.
 
    The GNU C Library is free software; you can redistribute it and/or
@@ -291,4 +290,40 @@ extern int __lll_timedwait_tid (int *, const struct timespec *)
     __res;						\
   })
 
+#ifdef _MIPS_ARCH_XLP
+/* Implement __libc_lock_lock using exchange_and_add, which expands into
+   a single LDADD instruction on XLP.  This is a simplified expansion of
+   ({ lll_lock (NAME, LLL_PRIVATE); 0; }).
+
+   __lll_lock_wait_private() resets lock value to '2', which prevents unbounded
+   increase of the lock value and [with billions of threads] overflow.
+
+   As atomic.h currently only supports a full-barrier atomic_exchange_and_add,
+   using a full-barrier operation instead of an acquire-barrier operation is
+   not beneficial for MIPS in general.  Limit this optimization to XLP for
+   now.  */
+# define __libc_lock_lock(NAME)						\
+  ({									\
+    int *__futex = &(NAME);						\
+    if (__builtin_expect (atomic_exchange_and_add (__futex, 1), 0))	\
+      __lll_lock_wait_private (__futex);				\
+    0;									\
+  })
+
+# define __libc_lock_trylock(NAME)					\
+  ({									\
+  int *__futex = &(NAME);						\
+  int __result;								\
+  if (atomic_exchange_and_add (__futex, 1) == 0)			\
+    __result = 0;							\
+  else									\
+    /* The lock is already locked.  Set it to 'contended' state to avoid \
+       unbounded increase from subsequent trylocks.  This slightly degrades \
+       performance of locked-but-uncontended case, as lll_futex_wake() will be \
+       called unnecessarily.  */					\
+    __result = (atomic_exchange_acq (__futex, 2) != 0);			\
+  __result;								\
+  })
+#endif
+
 #endif	/* lowlevellock.h */
-- 
1.7.4.1

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] Optimize libc_lock_lock for MIPS XLP.
  2012-06-27 21:45         ` Maxim Kuvyrkov
@ 2012-06-28 17:30           ` Chris Metcalf
  2012-07-06 19:42             ` Tom de Vries
  0 siblings, 1 reply; 12+ messages in thread
From: Chris Metcalf @ 2012-06-28 17:30 UTC (permalink / raw)
  To: Maxim Kuvyrkov; +Cc: Joseph S. Myers, GLIBC Devel, libc-ports, Tom de Vries

On 6/27/2012 5:45 PM, Maxim Kuvyrkov wrote:
> On 15/06/2012, at 2:49 PM, Maxim Kuvyrkov wrote:
>
>> > On 15/06/2012, at 2:44 PM, Chris Metcalf wrote:
>> > 
>>> >> On 6/14/2012 9:20 PM, Maxim Kuvyrkov wrote:
>> > ...
>>>> >>> As I read it, in case of a contended lock __lll_lock_wait will reset the value of the lock to "2" before calling lll_futex_wait().  I agree that there is a timing window in which the other threads will see a value of the lock greater than "2", but the value will not get as high as hundreds or billions as it will be constantly reset to "2" in atomic_exchange in lll_lock_wait().
>>>> >>> 
>>>> >>> I do not see how threads will get into a busywait state, though.  Would you please elaborate on that?
>>> >> 
>>> >> You are correct.  I was thinking the that the while loop had a cmpxchg, but
>>> >> since it's just a straight-up exchange, the flow will be something like:
>>> >> 
>>> >> - Fail to initially call lll_futex_wait() if the lock is contended
>>> >> - Fall through to while loop
>>> >> - Spin as long as the lock is contended enough that *futex > 2
>>> >> - Enter futex_wait
>>> >> 
>>> >> So a little busy under high contention, but probably settles out reasonably
>>> >> well.
>> > 
> Attached is an improved patch that also optimizes __libc_lock_trylock using XLP's atomic instructions.
>
> The patch also removes unnecessary indirection step represented by new macros lll_add_lock, which is then used to define __libc_lock_lock, and defines __libc_lock_lock and __libc_lock_trylock directly in lowlevellock.h .  This makes changes outside of ports/ trivial.
>
> Tested on MIPS XLP with no regressions.  OK to apply for 2.17?

It looks OK to me.  I would want someone else to sign off on it before
applying to 2.17.

-- 
Chris Metcalf, Tilera Corp.
http://www.tilera.com



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] Optimize libc_lock_lock for MIPS XLP.
  2012-06-28 17:30           ` Chris Metcalf
@ 2012-07-06 19:42             ` Tom de Vries
  2012-08-14  4:00               ` Maxim Kuvyrkov
  0 siblings, 1 reply; 12+ messages in thread
From: Tom de Vries @ 2012-07-06 19:42 UTC (permalink / raw)
  To: Chris Metcalf
  Cc: Maxim Kuvyrkov, Joseph S. Myers, GLIBC Devel, libc-ports, Tom de Vries

On 28/06/12 19:30, Chris Metcalf wrote:
> On 6/27/2012 5:45 PM, Maxim Kuvyrkov wrote:
>> On 15/06/2012, at 2:49 PM, Maxim Kuvyrkov wrote:
>>
>>>> On 15/06/2012, at 2:44 PM, Chris Metcalf wrote:
>>>>
>>>>>> On 6/14/2012 9:20 PM, Maxim Kuvyrkov wrote:
>>>> ...
>>>>>>>> As I read it, in case of a contended lock __lll_lock_wait will reset the value of the lock to "2" before calling lll_futex_wait().  I agree that there is a timing window in which the other threads will see a value of the lock greater than "2", but the value will not get as high as hundreds or billions as it will be constantly reset to "2" in atomic_exchange in lll_lock_wait().
>>>>>>>>
>>>>>>>> I do not see how threads will get into a busywait state, though.  Would you please elaborate on that?
>>>>>>
>>>>>> You are correct.  I was thinking the that the while loop had a cmpxchg, but
>>>>>> since it's just a straight-up exchange, the flow will be something like:
>>>>>>
>>>>>> - Fail to initially call lll_futex_wait() if the lock is contended
>>>>>> - Fall through to while loop
>>>>>> - Spin as long as the lock is contended enough that *futex > 2
>>>>>> - Enter futex_wait
>>>>>>
>>>>>> So a little busy under high contention, but probably settles out reasonably
>>>>>> well.
>>>>
>> Attached is an improved patch that also optimizes __libc_lock_trylock using XLP's atomic instructions.
>>
>> The patch also removes unnecessary indirection step represented by new macros lll_add_lock, which is then used to define __libc_lock_lock, and defines __libc_lock_lock and __libc_lock_trylock directly in lowlevellock.h .  This makes changes outside of ports/ trivial.
>>
>> Tested on MIPS XLP with no regressions.  OK to apply for 2.17?
> 
> It looks OK to me.  I would want someone else to sign off on it before
> applying to 2.17.
> 

Chris,

I cannot sign off on this, but I reviewed the current patch as well and it looks
ok to me too.

Thanks,
- Tom


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] Optimize libc_lock_lock for MIPS XLP.
  2012-07-06 19:42             ` Tom de Vries
@ 2012-08-14  4:00               ` Maxim Kuvyrkov
  2012-08-14 19:33                 ` Chris Metcalf
  0 siblings, 1 reply; 12+ messages in thread
From: Maxim Kuvyrkov @ 2012-08-14  4:00 UTC (permalink / raw)
  To: Tom de Vries, Chris Metcalf
  Cc: Joseph S. Myers, GLIBC Devel, libc-ports, Tom de Vries

On 7/07/2012, at 7:41 AM, Tom de Vries wrote:

> On 28/06/12 19:30, Chris Metcalf wrote:
>> 
>> 
>> It looks OK to me.  I would want someone else to sign off on it before
>> applying to 2.17.
>> 
> 
> Chris,
> 
> I cannot sign off on this, but I reviewed the current patch as well and it looks
> ok to me too.
> 
> Thanks,
> - Tom

Attached is an updated version of the patch.  Given reviews from Chris and Tom I intend to commit this patch in couple of days if no-one objects.

The differences in this version are
1. the use of now-available atomic_exchange_and_add_acq macro (previously only atomic_exchange_and_add existed),
2. __libc_lock_lock is now defined for all MIPS processors, not just XLP, since there is no downside to using atomic_exchange_and_add_acq versus atomic_compare_and_exchange_acq,
3. as Tom correctly spotted, in __libc_lock_trylock we only need to perform exchange for >=2 values.  For 0 and 1 everything works out by itself.

Thank you,

--
Maxim Kuvyrkov
CodeSourcery / Mentor Graphics


Optimize __libc_lock_lock and __libc_lock_trylock for MIPS.

	* nptl/sysdeps/pthread/bits/libc-lockP.h (__libc_lock_lock)
	(__libc_lock_trylock): Allow pre-existing definitions.

	ports/
	* sysdeps/unix/sysv/linux/mips/nptl/lowlevellock.h (__libc_lock_lock)
	(__libc_lock_trylock): Define versions optimized for MIPS.
---
 nptl/sysdeps/pthread/bits/libc-lockP.h             |   10 ++++-
 .../unix/sysv/linux/mips/nptl/lowlevellock.h       |   39 +++++++++++++++++++-
 2 files changed, 45 insertions(+), 4 deletions(-)

diff --git a/nptl/sysdeps/pthread/bits/libc-lockP.h b/nptl/sysdeps/pthread/bits/libc-lockP.h
index 0ebac91..7adaeb4 100644
--- a/nptl/sysdeps/pthread/bits/libc-lockP.h
+++ b/nptl/sysdeps/pthread/bits/libc-lockP.h
@@ -176,9 +176,12 @@ typedef pthread_key_t __libc_key_t;
 
 /* Lock the named lock variable.  */
 #if !defined NOT_IN_libc || defined IS_IN_libpthread
-# define __libc_lock_lock(NAME) \
+# ifndef __libc_lock_lock
+#  define __libc_lock_lock(NAME) \
   ({ lll_lock (NAME, LLL_PRIVATE); 0; })
+# endif
 #else
+# undef __libc_lock_lock
 # define __libc_lock_lock(NAME) \
   __libc_maybe_call (__pthread_mutex_lock, (&(NAME)), 0)
 #endif
@@ -189,9 +192,12 @@ typedef pthread_key_t __libc_key_t;
 
 /* Try to lock the named lock variable.  */
 #if !defined NOT_IN_libc || defined IS_IN_libpthread
-# define __libc_lock_trylock(NAME) \
+# ifndef __libc_lock_trylock
+#  define __libc_lock_trylock(NAME) \
   lll_trylock (NAME)
+# endif
 #else
+# undef __libc_lock_trylock
 # define __libc_lock_trylock(NAME) \
   __libc_maybe_call (__pthread_mutex_trylock, (&(NAME)), 0)
 #endif
diff --git a/ports/sysdeps/unix/sysv/linux/mips/nptl/lowlevellock.h b/ports/sysdeps/unix/sysv/linux/mips/nptl/lowlevellock.h
index 88b601e..2584e7d 100644
--- a/ports/sysdeps/unix/sysv/linux/mips/nptl/lowlevellock.h
+++ b/ports/sysdeps/unix/sysv/linux/mips/nptl/lowlevellock.h
@@ -1,5 +1,4 @@
-/* Copyright (C) 2003, 2004, 2005, 2006, 2007, 2008,
-   2009 Free Software Foundation, Inc.
+/* Copyright (C) 2003-2012 Free Software Foundation, Inc.
    This file is part of the GNU C Library.
 
    The GNU C Library is free software; you can redistribute it and/or
@@ -291,4 +290,40 @@ extern int __lll_timedwait_tid (int *, const struct timespec *)
     __res;						\
   })
 
+/* Implement __libc_lock_lock using exchange_and_add, which expands into
+   a single instruction on XLP processors.  We enable this for all MIPS
+   processors as atomic_exchange_and_add_acq and
+   atomic_compared_and_exchange_acq take the same time to execute.
+   This is a simplified expansion of ({ lll_lock (NAME, LLL_PRIVATE); 0; }).
+
+   Note: __lll_lock_wait_private() resets lock value to '2', which prevents
+   unbounded increase of the lock value and [with billions of threads]
+   overflow.  */
+#define __libc_lock_lock(NAME)						\
+  ({									\
+    int *__futex = &(NAME);						\
+    if (__builtin_expect (atomic_exchange_and_add_acq (__futex, 1), 0))	\
+      __lll_lock_wait_private (__futex);				\
+    0;									\
+  })
+
+#ifdef _MIPS_ARCH_XLP
+/* The generic version using a single atomic_compare_and_exchange_acq takes
+   less time for non-XLP processors, so we use below for XLP only.  */
+# define __libc_lock_trylock(NAME)					\
+  ({									\
+  int *__futex = &(NAME);						\
+  int __result = atomic_exchange_and_add_acq (__futex, 1);		\
+  /* If __result == 0, we succeeded in acquiring the lock.		\
+     If __result == 1, we switched the lock to 'contended' state, which	\
+     will cause a [possibly unnecessary] call to lll_futex_wait.  This is \
+     unlikely, so we accept the possible inefficiency.			\
+     If __result >= 2, we need to set the lock to 'contended' state to avoid \
+     unbounded increase from subsequent trylocks.  */			\
+  if (__result >= 2)							\
+    __result = (atomic_exchange_acq (__futex, 2) != 0);			\
+  __result;								\
+  })
+#endif
+
 #endif	/* lowlevellock.h */
-- 
1.7.4.1


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] Optimize libc_lock_lock for MIPS XLP.
  2012-08-14  4:00               ` Maxim Kuvyrkov
@ 2012-08-14 19:33                 ` Chris Metcalf
  2012-08-14 21:30                   ` Maxim Kuvyrkov
  0 siblings, 1 reply; 12+ messages in thread
From: Chris Metcalf @ 2012-08-14 19:33 UTC (permalink / raw)
  To: Maxim Kuvyrkov
  Cc: Tom de Vries, Joseph S. Myers, GLIBC Devel, libc-ports, Tom de Vries

On 8/14/2012 12:00 AM, Maxim Kuvyrkov wrote:
> +   atomic_compared_and_exchange_acq take the same time to execute.

Typo.

> +  if (__result >= 2)							\
> +    __result = (atomic_exchange_acq (__futex, 2) != 0);			\

Why not just return the old value in memory here (i.e. omit the "!= 0"), as
you do with the exchange_and_add above?  That seems more parallel in
structure, and also more efficient.

-- 
Chris Metcalf, Tilera Corp.
http://www.tilera.com

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] Optimize libc_lock_lock for MIPS XLP.
  2012-08-14 19:33                 ` Chris Metcalf
@ 2012-08-14 21:30                   ` Maxim Kuvyrkov
  2012-08-14 21:40                     ` Joseph S. Myers
  0 siblings, 1 reply; 12+ messages in thread
From: Maxim Kuvyrkov @ 2012-08-14 21:30 UTC (permalink / raw)
  To: Chris Metcalf, Joseph S. Myers
  Cc: Tom de Vries, GLIBC Devel, libc-ports, Tom de Vries

On 15/08/2012, at 7:33 AM, Chris Metcalf wrote:

> On 8/14/2012 12:00 AM, Maxim Kuvyrkov wrote:
>> +   atomic_compared_and_exchange_acq take the same time to execute.
> 
> Typo.

Fixed.

> 
>> +  if (__result >= 2)							\
>> +    __result = (atomic_exchange_acq (__futex, 2) != 0);			\
> 
> Why not just return the old value in memory here (i.e. omit the "!= 0"), as
> you do with the exchange_and_add above?  That seems more parallel in
> structure, and also more efficient.

I think you are right here.

The "!= 0" comes from the pattern of how __lll_trylock, __lll_cond_trylock and __lll_robust_trylock are defined.  They all use "atomic_compare_and_exchange_val_acq (futex, <value>, 0) != 0", which seems excessive as well.

I've removed the "!= 0" from __libc_lock_trylock and check the testsuite.  Updated patch attached.

Joseph, you are the MIPS maintainer, do you have any comments on this patch?

Thank you,

--
Maxim Kuvyrkov
CodeSourcery / Mentor Graphics

Optimize __libc_lock_lock and __libc_lock_trylock for MIPS.

	* nptl/sysdeps/pthread/bits/libc-lockP.h (__libc_lock_lock)
	(__libc_lock_trylock): Allow pre-existing definitions.

	ports/
	* sysdeps/unix/sysv/linux/mips/nptl/lowlevellock.h (__libc_lock_lock)
	(__libc_lock_trylock): Define versions optimized for MIPS.
---
 nptl/sysdeps/pthread/bits/libc-lockP.h             |   10 ++++-
 .../unix/sysv/linux/mips/nptl/lowlevellock.h       |   39 +++++++++++++++++++-
 2 files changed, 45 insertions(+), 4 deletions(-)

diff --git a/nptl/sysdeps/pthread/bits/libc-lockP.h b/nptl/sysdeps/pthread/bits/libc-lockP.h
index 0ebac91..7adaeb4 100644
--- a/nptl/sysdeps/pthread/bits/libc-lockP.h
+++ b/nptl/sysdeps/pthread/bits/libc-lockP.h
@@ -176,9 +176,12 @@ typedef pthread_key_t __libc_key_t;
 
 /* Lock the named lock variable.  */
 #if !defined NOT_IN_libc || defined IS_IN_libpthread
-# define __libc_lock_lock(NAME) \
+# ifndef __libc_lock_lock
+#  define __libc_lock_lock(NAME) \
   ({ lll_lock (NAME, LLL_PRIVATE); 0; })
+# endif
 #else
+# undef __libc_lock_lock
 # define __libc_lock_lock(NAME) \
   __libc_maybe_call (__pthread_mutex_lock, (&(NAME)), 0)
 #endif
@@ -189,9 +192,12 @@ typedef pthread_key_t __libc_key_t;
 
 /* Try to lock the named lock variable.  */
 #if !defined NOT_IN_libc || defined IS_IN_libpthread
-# define __libc_lock_trylock(NAME) \
+# ifndef __libc_lock_trylock
+#  define __libc_lock_trylock(NAME) \
   lll_trylock (NAME)
+# endif
 #else
+# undef __libc_lock_trylock
 # define __libc_lock_trylock(NAME) \
   __libc_maybe_call (__pthread_mutex_trylock, (&(NAME)), 0)
 #endif
diff --git a/ports/sysdeps/unix/sysv/linux/mips/nptl/lowlevellock.h b/ports/sysdeps/unix/sysv/linux/mips/nptl/lowlevellock.h
index 88b601e..d368ae1 100644
--- a/ports/sysdeps/unix/sysv/linux/mips/nptl/lowlevellock.h
+++ b/ports/sysdeps/unix/sysv/linux/mips/nptl/lowlevellock.h
@@ -1,5 +1,4 @@
-/* Copyright (C) 2003, 2004, 2005, 2006, 2007, 2008,
-   2009 Free Software Foundation, Inc.
+/* Copyright (C) 2003-2012 Free Software Foundation, Inc.
    This file is part of the GNU C Library.
 
    The GNU C Library is free software; you can redistribute it and/or
@@ -291,4 +290,40 @@ extern int __lll_timedwait_tid (int *, const struct timespec *)
     __res;						\
   })
 
+/* Implement __libc_lock_lock using exchange_and_add, which expands into
+   a single instruction on XLP processors.  We enable this for all MIPS
+   processors as atomic_exchange_and_add_acq and
+   atomic_compare_and_exchange_acq take the same time to execute.
+   This is a simplified expansion of ({ lll_lock (NAME, LLL_PRIVATE); 0; }).
+
+   Note: __lll_lock_wait_private() resets lock value to '2', which prevents
+   unbounded increase of the lock value and [with billions of threads]
+   overflow.  */
+#define __libc_lock_lock(NAME)						\
+  ({									\
+    int *__futex = &(NAME);						\
+    if (__builtin_expect (atomic_exchange_and_add_acq (__futex, 1), 0))	\
+      __lll_lock_wait_private (__futex);				\
+    0;									\
+  })
+
+#ifdef _MIPS_ARCH_XLP
+/* The generic version using a single atomic_compare_and_exchange_acq takes
+   less time for non-XLP processors, so we use below for XLP only.  */
+# define __libc_lock_trylock(NAME)					\
+  ({									\
+  int *__futex = &(NAME);						\
+  int __result = atomic_exchange_and_add_acq (__futex, 1);		\
+  /* If __result == 0, we succeeded in acquiring the lock.		\
+     If __result == 1, we switched the lock to 'contended' state, which	\
+     will cause a [possibly unnecessary] call to lll_futex_wait.  This is \
+     unlikely, so we accept the possible inefficiency.			\
+     If __result >= 2, we need to set the lock to 'contended' state to avoid \
+     unbounded increase from subsequent trylocks.  */			\
+  if (__result >= 2)							\
+    __result = atomic_exchange_acq (__futex, 2);			\
+  __result;								\
+  })
+#endif
+
 #endif	/* lowlevellock.h */
-- 
1.7.4.1



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] Optimize libc_lock_lock for MIPS XLP.
  2012-08-14 21:30                   ` Maxim Kuvyrkov
@ 2012-08-14 21:40                     ` Joseph S. Myers
  0 siblings, 0 replies; 12+ messages in thread
From: Joseph S. Myers @ 2012-08-14 21:40 UTC (permalink / raw)
  To: Maxim Kuvyrkov
  Cc: Chris Metcalf, Tom de Vries, GLIBC Devel, libc-ports, Tom de Vries

On Wed, 15 Aug 2012, Maxim Kuvyrkov wrote:

> Joseph, you are the MIPS maintainer, do you have any comments on this 
> patch?

I don't have any comments here.

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2012-08-14 21:40 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-06-14  5:04 [PATCH] Optimize libc_lock_lock for MIPS XLP Maxim Kuvyrkov
2012-06-14 12:39 ` Chris Metcalf
2012-06-15  1:21   ` Maxim Kuvyrkov
2012-06-15  2:44     ` Chris Metcalf
2012-06-15  2:50       ` Maxim Kuvyrkov
2012-06-27 21:45         ` Maxim Kuvyrkov
2012-06-28 17:30           ` Chris Metcalf
2012-07-06 19:42             ` Tom de Vries
2012-08-14  4:00               ` Maxim Kuvyrkov
2012-08-14 19:33                 ` Chris Metcalf
2012-08-14 21:30                   ` Maxim Kuvyrkov
2012-08-14 21:40                     ` Joseph S. Myers

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).