public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed
* [PATCH v2] nptl: Add backoff mechanism to spinlock loop
@ 2022-03-28  8:47 Wangyang Guo
  2022-03-28 16:41 ` Noah Goldstein
                   ` (3 more replies)
  0 siblings, 4 replies; 27+ messages in thread
From: Wangyang Guo @ 2022-03-28  8:47 UTC (permalink / raw)
  To: libc-alpha; +Cc: Wangyang Guo, goldstein.w.n, hjl.tools

When mutiple threads waiting for lock at the same time, once lock owner
releases the lock, waiters will see lock available and all try to lock,
which may cause an expensive CAS storm.

Binary exponential backoff with random jitter is introduced. As try-lock
attempt increases, there is more likely that a larger number threads
compete for adaptive mutex lock, so increase wait time in exponential.
A random jitter is also added to avoid synchronous try-lock from other
threads.

v2: Remove read-check before try-lock for performance.

Signed-off-by: Wangyang Guo <wangyang.guo@intel.com>
---
 nptl/pthread_mutex_lock.c | 25 ++++++++++++++++---------
 1 file changed, 16 insertions(+), 9 deletions(-)

diff --git a/nptl/pthread_mutex_lock.c b/nptl/pthread_mutex_lock.c
index d2e652d151..7e75ec1cba 100644
--- a/nptl/pthread_mutex_lock.c
+++ b/nptl/pthread_mutex_lock.c
@@ -26,6 +26,7 @@
 #include <futex-internal.h>
 #include <stap-probe.h>
 #include <shlib-compat.h>
+#include <random-bits.h>
 
 /* Some of the following definitions differ when pthread_mutex_cond_lock.c
    includes this file.  */
@@ -64,11 +65,6 @@ lll_mutex_lock_optimized (pthread_mutex_t *mutex)
 # define PTHREAD_MUTEX_VERSIONS 1
 #endif
 
-#ifndef LLL_MUTEX_READ_LOCK
-# define LLL_MUTEX_READ_LOCK(mutex) \
-  atomic_load_relaxed (&(mutex)->__data.__lock)
-#endif
-
 static int __pthread_mutex_lock_full (pthread_mutex_t *mutex)
      __attribute_noinline__;
 
@@ -138,17 +134,28 @@ PTHREAD_MUTEX_LOCK (pthread_mutex_t *mutex)
 	  int cnt = 0;
 	  int max_cnt = MIN (max_adaptive_count (),
 			     mutex->__data.__spins * 2 + 10);
+	  int spin_count, exp_backoff = 1;
+	  unsigned int jitter = random_bits ();
 	  do
 	    {
-	      if (cnt++ >= max_cnt)
+	      /* In each loop, spin count is exponential backoff plus
+	         random jitter, random range is [0, exp_backoff-1].  */
+	      spin_count = exp_backoff + (jitter & (exp_backoff - 1));
+	      cnt += spin_count;
+	      if (cnt >= max_cnt)
 		{
+		  /* If cnt exceeds max spin count, just go to wait
+		     queue.  */
 		  LLL_MUTEX_LOCK (mutex);
 		  break;
 		}
-	      atomic_spin_nop ();
+	      do
+		  atomic_spin_nop ();
+	      while (--spin_count > 0);
+	      /* Binary exponential backoff, prepare for next loop.  */
+	      exp_backoff <<= 1;
 	    }
-	  while (LLL_MUTEX_READ_LOCK (mutex) != 0
-		 || LLL_MUTEX_TRYLOCK (mutex) != 0);
+	  while (LLL_MUTEX_TRYLOCK (mutex) != 0);
 
 	  mutex->__data.__spins += (cnt - mutex->__data.__spins) / 8;
 	}
-- 
2.35.1


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2] nptl: Add backoff mechanism to spinlock loop
  2022-03-28  8:47 [PATCH v2] nptl: Add backoff mechanism to spinlock loop Wangyang Guo
@ 2022-03-28 16:41 ` Noah Goldstein
  2022-03-30 11:44   ` Guo, Wangyang
  2022-03-30 11:53 ` Adhemerval Zanella
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 27+ messages in thread
From: Noah Goldstein @ 2022-03-28 16:41 UTC (permalink / raw)
  To: Wangyang Guo; +Cc: GNU C Library, H.J. Lu

On Mon, Mar 28, 2022 at 3:47 AM Wangyang Guo <wangyang.guo@intel.com> wrote:
>
> When mutiple threads waiting for lock at the same time, once lock owner
> releases the lock, waiters will see lock available and all try to lock,
> which may cause an expensive CAS storm.
>
> Binary exponential backoff with random jitter is introduced. As try-lock
> attempt increases, there is more likely that a larger number threads
> compete for adaptive mutex lock, so increase wait time in exponential.
> A random jitter is also added to avoid synchronous try-lock from other
> threads.
>
> v2: Remove read-check before try-lock for performance.
>
> Signed-off-by: Wangyang Guo <wangyang.guo@intel.com>
> ---
>  nptl/pthread_mutex_lock.c | 25 ++++++++++++++++---------
>  1 file changed, 16 insertions(+), 9 deletions(-)
>
> diff --git a/nptl/pthread_mutex_lock.c b/nptl/pthread_mutex_lock.c
> index d2e652d151..7e75ec1cba 100644
> --- a/nptl/pthread_mutex_lock.c
> +++ b/nptl/pthread_mutex_lock.c
> @@ -26,6 +26,7 @@
>  #include <futex-internal.h>
>  #include <stap-probe.h>
>  #include <shlib-compat.h>
> +#include <random-bits.h>
>
>  /* Some of the following definitions differ when pthread_mutex_cond_lock.c
>     includes this file.  */
> @@ -64,11 +65,6 @@ lll_mutex_lock_optimized (pthread_mutex_t *mutex)
>  # define PTHREAD_MUTEX_VERSIONS 1
>  #endif
>
> -#ifndef LLL_MUTEX_READ_LOCK
> -# define LLL_MUTEX_READ_LOCK(mutex) \
> -  atomic_load_relaxed (&(mutex)->__data.__lock)
> -#endif
> -
>  static int __pthread_mutex_lock_full (pthread_mutex_t *mutex)
>       __attribute_noinline__;
>
> @@ -138,17 +134,28 @@ PTHREAD_MUTEX_LOCK (pthread_mutex_t *mutex)
>           int cnt = 0;
>           int max_cnt = MIN (max_adaptive_count (),
>                              mutex->__data.__spins * 2 + 10);
> +         int spin_count, exp_backoff = 1;
> +         unsigned int jitter = random_bits ();
>           do
>             {
> -             if (cnt++ >= max_cnt)
> +             /* In each loop, spin count is exponential backoff plus
> +                random jitter, random range is [0, exp_backoff-1].  */
> +             spin_count = exp_backoff + (jitter & (exp_backoff - 1));
> +             cnt += spin_count;
> +             if (cnt >= max_cnt)
>                 {
> +                 /* If cnt exceeds max spin count, just go to wait
> +                    queue.  */
>                   LLL_MUTEX_LOCK (mutex);
>                   break;
>                 }
> -             atomic_spin_nop ();
> +             do
> +                 atomic_spin_nop ();
> +             while (--spin_count > 0);
> +             /* Binary exponential backoff, prepare for next loop.  */
> +             exp_backoff <<= 1;
>             }
> -         while (LLL_MUTEX_READ_LOCK (mutex) != 0
> -                || LLL_MUTEX_TRYLOCK (mutex) != 0);
> +         while (LLL_MUTEX_TRYLOCK (mutex) != 0);

Does it perform better w.o the read?

In general can you post some benchmarks varying the number of threads / size of
the critical section? Would also be nice if you could collect some
stats regarding
the average number of failed CAS attempts before/after.
>
>           mutex->__data.__spins += (cnt - mutex->__data.__spins) / 8;
>         }
> --
> 2.35.1
>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2] nptl: Add backoff mechanism to spinlock loop
  2022-03-28 16:41 ` Noah Goldstein
@ 2022-03-30 11:44   ` Guo, Wangyang
  2022-03-30 19:39     ` Noah Goldstein
  0 siblings, 1 reply; 27+ messages in thread
From: Guo, Wangyang @ 2022-03-30 11:44 UTC (permalink / raw)
  To: libc-alpha; +Cc: Noah Goldstein, wangyang.guo

On 3/29/2022 12:41 AM, Noah Goldstein via Libc-alpha wrote:
> On Mon, Mar 28, 2022 at 3:47 AM Wangyang Guo <wangyang.guo@intel.com> wrote:
>>
>> When mutiple threads waiting for lock at the same time, once lock owner
>> releases the lock, waiters will see lock available and all try to lock,
>> which may cause an expensive CAS storm.
>>
>> Binary exponential backoff with random jitter is introduced. As try-lock
>> attempt increases, there is more likely that a larger number threads
>> compete for adaptive mutex lock, so increase wait time in exponential.
>> A random jitter is also added to avoid synchronous try-lock from other
>> threads.
>>
>> v2: Remove read-check before try-lock for performance.
>>
>> Signed-off-by: Wangyang Guo <wangyang.guo@intel.com>
>> ---
>>   nptl/pthread_mutex_lock.c | 25 ++++++++++++++++---------
>>   1 file changed, 16 insertions(+), 9 deletions(-)
>>
>> diff --git a/nptl/pthread_mutex_lock.c b/nptl/pthread_mutex_lock.c
>> index d2e652d151..7e75ec1cba 100644
>> --- a/nptl/pthread_mutex_lock.c
>> +++ b/nptl/pthread_mutex_lock.c
>> @@ -26,6 +26,7 @@
>>   #include <futex-internal.h>
>>   #include <stap-probe.h>
>>   #include <shlib-compat.h>
>> +#include <random-bits.h>
>>
>>   /* Some of the following definitions differ when pthread_mutex_cond_lock.c
>>      includes this file.  */
>> @@ -64,11 +65,6 @@ lll_mutex_lock_optimized (pthread_mutex_t *mutex)
>>   # define PTHREAD_MUTEX_VERSIONS 1
>>   #endif
>>
>> -#ifndef LLL_MUTEX_READ_LOCK
>> -# define LLL_MUTEX_READ_LOCK(mutex) \
>> -  atomic_load_relaxed (&(mutex)->__data.__lock)
>> -#endif
>> -
>>   static int __pthread_mutex_lock_full (pthread_mutex_t *mutex)
>>        __attribute_noinline__;
>>
>> @@ -138,17 +134,28 @@ PTHREAD_MUTEX_LOCK (pthread_mutex_t *mutex)
>>            int cnt = 0;
>>            int max_cnt = MIN (max_adaptive_count (),
>>                               mutex->__data.__spins * 2 + 10);
>> +         int spin_count, exp_backoff = 1;
>> +         unsigned int jitter = random_bits ();
>>            do
>>              {
>> -             if (cnt++ >= max_cnt)
>> +             /* In each loop, spin count is exponential backoff plus
>> +                random jitter, random range is [0, exp_backoff-1].  */
>> +             spin_count = exp_backoff + (jitter & (exp_backoff - 1));
>> +             cnt += spin_count;
>> +             if (cnt >= max_cnt)
>>                  {
>> +                 /* If cnt exceeds max spin count, just go to wait
>> +                    queue.  */
>>                    LLL_MUTEX_LOCK (mutex);
>>                    break;
>>                  }
>> -             atomic_spin_nop ();
>> +             do
>> +                 atomic_spin_nop ();
>> +             while (--spin_count > 0);
>> +             /* Binary exponential backoff, prepare for next loop.  */
>> +             exp_backoff <<= 1;
>>              }
>> -         while (LLL_MUTEX_READ_LOCK (mutex) != 0
>> -                || LLL_MUTEX_TRYLOCK (mutex) != 0);
>> +         while (LLL_MUTEX_TRYLOCK (mutex) != 0);
> 
> Does it perform better w.o the read?
> 
> In general can you post some benchmarks varying the number of threads / size of
> the critical section? Would also be nice if you could collect some
> stats regarding
> the average number of failed CAS attempts before/after.

I work out a mutex bench: 
https://gist.github.com/guowangy/459085e5be6bfeda101c591d2d2304c5
It can test with different threads and critical section (critical length 
determined by num of pause instructions).

Here is the result of v1 against upstream:
(Tested in Xeon 8380)

npause	1	4	16	64	128   <== thread num
--------------------------------------------
0	1.00	1.70	1.94	1.70	1.76
1	0.99	1.57	1.75	1.54	1.58
4	1.00	1.28	1.42	1.38	1.41
16	1.00	1.02	1.11	1.18	1.19
64	1.00	1.00	0.95	0.99	0.97
128	1.00	0.92	0.92	0.87	0.87

The first row header is number of threads.
The first column is num of pause, which represents different critical 
section sizes.
The value is the rate of v1/upstream throughput (High Is Better).
RSD varies by thread num: thread(4, 16) < 2%, thread(1, 64, 128) < 1%.

Thread=1 is the baseline and should be the same for the both.
Backoff works well when critical section is small.
But it cause some regression when critical section is very large,
because of upstream using frequently attempt to reduce the latency.
Since adaptive is aimed for 'quick' locks, I think it's not a big 
problem here.

The next is v2 against v1:
	1	4	16	64	128
--------------------------------------------
0	1.00	0.99	0.95	1.02	1.02
1	1.00	1.08	0.92	1.04	1.04
4	1.00	0.94	0.94	1.05	1.04
16	1.00	1.01	0.98	1.03	1.03
64	1.00	1.01	1.00	1.00	1.01
128	1.00	1.02	1.00	1.00	1.00

v2 without read-check can have benefit in large thread num, but have
drawback in small thread num.

I also have a version with random jitter removed (v3), here is the 
result of v3 against v1:
	1	4	16	64	128
--------------------------------------------
0	1.00	0.95	1.04	1.02	1.03
1	1.01	1.09	1.04	1.02	1.03
4	1.00	1.02	1.05	1.02	1.02
16	1.00	1.01	1.01	1.02	1.02
64	1.00	1.04	1.02	1.01	1.01
128	1.00	0.98	0.99	0.99	0.98

The result shows without random is a better solution.

>>
>>            mutex->__data.__spins += (cnt - mutex->__data.__spins) / 8;
>>          }
>> --
>> 2.35.1
>>
> 


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2] nptl: Add backoff mechanism to spinlock loop
  2022-03-28  8:47 [PATCH v2] nptl: Add backoff mechanism to spinlock loop Wangyang Guo
  2022-03-28 16:41 ` Noah Goldstein
@ 2022-03-30 11:53 ` Adhemerval Zanella
  2022-03-30 17:07   ` Noah Goldstein
  2022-05-04  2:50 ` [PATCH v3] " Wangyang Guo
  2022-05-04  2:58 ` Wangyang Guo
  3 siblings, 1 reply; 27+ messages in thread
From: Adhemerval Zanella @ 2022-03-30 11:53 UTC (permalink / raw)
  To: Wangyang Guo, libc-alpha



On 28/03/2022 05:47, Wangyang Guo via Libc-alpha wrote:
> When mutiple threads waiting for lock at the same time, once lock owner
> releases the lock, waiters will see lock available and all try to lock,
> which may cause an expensive CAS storm.
> 
> Binary exponential backoff with random jitter is introduced. As try-lock
> attempt increases, there is more likely that a larger number threads
> compete for adaptive mutex lock, so increase wait time in exponential.
> A random jitter is also added to avoid synchronous try-lock from other
> threads.
> 
> v2: Remove read-check before try-lock for performance.
> 
> Signed-off-by: Wangyang Guo <wangyang.guo@intel.com>
> ---
>  nptl/pthread_mutex_lock.c | 25 ++++++++++++++++---------
>  1 file changed, 16 insertions(+), 9 deletions(-)
> 
> diff --git a/nptl/pthread_mutex_lock.c b/nptl/pthread_mutex_lock.c
> index d2e652d151..7e75ec1cba 100644
> --- a/nptl/pthread_mutex_lock.c
> +++ b/nptl/pthread_mutex_lock.c
> @@ -26,6 +26,7 @@
>  #include <futex-internal.h>
>  #include <stap-probe.h>
>  #include <shlib-compat.h>
> +#include <random-bits.h>
>  
>  /* Some of the following definitions differ when pthread_mutex_cond_lock.c
>     includes this file.  */
> @@ -64,11 +65,6 @@ lll_mutex_lock_optimized (pthread_mutex_t *mutex)
>  # define PTHREAD_MUTEX_VERSIONS 1
>  #endif
>  
> -#ifndef LLL_MUTEX_READ_LOCK
> -# define LLL_MUTEX_READ_LOCK(mutex) \
> -  atomic_load_relaxed (&(mutex)->__data.__lock)
> -#endif
> -
>  static int __pthread_mutex_lock_full (pthread_mutex_t *mutex)
>       __attribute_noinline__;
>  
> @@ -138,17 +134,28 @@ PTHREAD_MUTEX_LOCK (pthread_mutex_t *mutex)
>  	  int cnt = 0;
>  	  int max_cnt = MIN (max_adaptive_count (),
>  			     mutex->__data.__spins * 2 + 10);
> +	  int spin_count, exp_backoff = 1;
> +	  unsigned int jitter = random_bits ();

This will issue a syscall for architectures that do not have clock_gettime
on vDSO, which is a performance regression.  You will need to move the
jitter setup to be arch-specific, where the generic interface setting
no random jitter.

>  	  do
>  	    {
> -	      if (cnt++ >= max_cnt)
> +	      /* In each loop, spin count is exponential backoff plus
> +	         random jitter, random range is [0, exp_backoff-1].  */
> +	      spin_count = exp_backoff + (jitter & (exp_backoff - 1));
> +	      cnt += spin_count;
> +	      if (cnt >= max_cnt)
>  		{
> +		  /* If cnt exceeds max spin count, just go to wait
> +		     queue.  */
>  		  LLL_MUTEX_LOCK (mutex);
>  		  break;
>  		}
> -	      atomic_spin_nop ();
> +	      do
> +		  atomic_spin_nop ();
> +	      while (--spin_count > 0);
> +	      /* Binary exponential backoff, prepare for next loop.  */
> +	      exp_backoff <<= 1;
>  	    }
> -	  while (LLL_MUTEX_READ_LOCK (mutex) != 0
> -		 || LLL_MUTEX_TRYLOCK (mutex) != 0);
> +	  while (LLL_MUTEX_TRYLOCK (mutex) != 0);
>  
>  	  mutex->__data.__spins += (cnt - mutex->__data.__spins) / 8;
>  	}

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2] nptl: Add backoff mechanism to spinlock loop
  2022-03-30 11:53 ` Adhemerval Zanella
@ 2022-03-30 17:07   ` Noah Goldstein
  2022-03-30 17:21     ` Adhemerval Zanella
  0 siblings, 1 reply; 27+ messages in thread
From: Noah Goldstein @ 2022-03-30 17:07 UTC (permalink / raw)
  To: Adhemerval Zanella; +Cc: Wangyang Guo, GNU C Library

On Wed, Mar 30, 2022 at 6:54 AM Adhemerval Zanella via Libc-alpha
<libc-alpha@sourceware.org> wrote:
>
>
>
> On 28/03/2022 05:47, Wangyang Guo via Libc-alpha wrote:
> > When mutiple threads waiting for lock at the same time, once lock owner
> > releases the lock, waiters will see lock available and all try to lock,
> > which may cause an expensive CAS storm.
> >
> > Binary exponential backoff with random jitter is introduced. As try-lock
> > attempt increases, there is more likely that a larger number threads
> > compete for adaptive mutex lock, so increase wait time in exponential.
> > A random jitter is also added to avoid synchronous try-lock from other
> > threads.
> >
> > v2: Remove read-check before try-lock for performance.
> >
> > Signed-off-by: Wangyang Guo <wangyang.guo@intel.com>
> > ---
> >  nptl/pthread_mutex_lock.c | 25 ++++++++++++++++---------
> >  1 file changed, 16 insertions(+), 9 deletions(-)
> >
> > diff --git a/nptl/pthread_mutex_lock.c b/nptl/pthread_mutex_lock.c
> > index d2e652d151..7e75ec1cba 100644
> > --- a/nptl/pthread_mutex_lock.c
> > +++ b/nptl/pthread_mutex_lock.c
> > @@ -26,6 +26,7 @@
> >  #include <futex-internal.h>
> >  #include <stap-probe.h>
> >  #include <shlib-compat.h>
> > +#include <random-bits.h>
> >
> >  /* Some of the following definitions differ when pthread_mutex_cond_lock.c
> >     includes this file.  */
> > @@ -64,11 +65,6 @@ lll_mutex_lock_optimized (pthread_mutex_t *mutex)
> >  # define PTHREAD_MUTEX_VERSIONS 1
> >  #endif
> >
> > -#ifndef LLL_MUTEX_READ_LOCK
> > -# define LLL_MUTEX_READ_LOCK(mutex) \
> > -  atomic_load_relaxed (&(mutex)->__data.__lock)
> > -#endif
> > -
> >  static int __pthread_mutex_lock_full (pthread_mutex_t *mutex)
> >       __attribute_noinline__;
> >
> > @@ -138,17 +134,28 @@ PTHREAD_MUTEX_LOCK (pthread_mutex_t *mutex)
> >         int cnt = 0;
> >         int max_cnt = MIN (max_adaptive_count (),
> >                            mutex->__data.__spins * 2 + 10);
> > +       int spin_count, exp_backoff = 1;
> > +       unsigned int jitter = random_bits ();
>
> This will issue a syscall for architectures that do not have clock_gettime
> on vDSO, which is a performance regression.  You will need to move the
> jitter setup to be arch-specific, where the generic interface setting
> no random jitter.

What would be the best init jitter for arch w/ only syscall timers? TID? Or
something else?
>
> >         do
> >           {
> > -           if (cnt++ >= max_cnt)
> > +           /* In each loop, spin count is exponential backoff plus
> > +              random jitter, random range is [0, exp_backoff-1].  */
> > +           spin_count = exp_backoff + (jitter & (exp_backoff - 1));
> > +           cnt += spin_count;
> > +           if (cnt >= max_cnt)
> >               {
> > +               /* If cnt exceeds max spin count, just go to wait
> > +                  queue.  */
> >                 LLL_MUTEX_LOCK (mutex);
> >                 break;
> >               }
> > -           atomic_spin_nop ();
> > +           do
> > +               atomic_spin_nop ();
> > +           while (--spin_count > 0);
> > +           /* Binary exponential backoff, prepare for next loop.  */
> > +           exp_backoff <<= 1;
> >           }
> > -       while (LLL_MUTEX_READ_LOCK (mutex) != 0
> > -              || LLL_MUTEX_TRYLOCK (mutex) != 0);
> > +       while (LLL_MUTEX_TRYLOCK (mutex) != 0);
> >
> >         mutex->__data.__spins += (cnt - mutex->__data.__spins) / 8;
> >       }

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2] nptl: Add backoff mechanism to spinlock loop
  2022-03-30 17:07   ` Noah Goldstein
@ 2022-03-30 17:21     ` Adhemerval Zanella
  2022-04-12 11:53       ` Florian Weimer
  0 siblings, 1 reply; 27+ messages in thread
From: Adhemerval Zanella @ 2022-03-30 17:21 UTC (permalink / raw)
  To: Noah Goldstein; +Cc: Wangyang Guo, GNU C Library



On 30/03/2022 14:07, Noah Goldstein wrote:
> On Wed, Mar 30, 2022 at 6:54 AM Adhemerval Zanella via Libc-alpha
> <libc-alpha@sourceware.org> wrote:
>>
>>
>>
>> On 28/03/2022 05:47, Wangyang Guo via Libc-alpha wrote:
>>> When mutiple threads waiting for lock at the same time, once lock owner
>>> releases the lock, waiters will see lock available and all try to lock,
>>> which may cause an expensive CAS storm.
>>>
>>> Binary exponential backoff with random jitter is introduced. As try-lock
>>> attempt increases, there is more likely that a larger number threads
>>> compete for adaptive mutex lock, so increase wait time in exponential.
>>> A random jitter is also added to avoid synchronous try-lock from other
>>> threads.
>>>
>>> v2: Remove read-check before try-lock for performance.
>>>
>>> Signed-off-by: Wangyang Guo <wangyang.guo@intel.com>
>>> ---
>>>  nptl/pthread_mutex_lock.c | 25 ++++++++++++++++---------
>>>  1 file changed, 16 insertions(+), 9 deletions(-)
>>>
>>> diff --git a/nptl/pthread_mutex_lock.c b/nptl/pthread_mutex_lock.c
>>> index d2e652d151..7e75ec1cba 100644
>>> --- a/nptl/pthread_mutex_lock.c
>>> +++ b/nptl/pthread_mutex_lock.c
>>> @@ -26,6 +26,7 @@
>>>  #include <futex-internal.h>
>>>  #include <stap-probe.h>
>>>  #include <shlib-compat.h>
>>> +#include <random-bits.h>
>>>
>>>  /* Some of the following definitions differ when pthread_mutex_cond_lock.c
>>>     includes this file.  */
>>> @@ -64,11 +65,6 @@ lll_mutex_lock_optimized (pthread_mutex_t *mutex)
>>>  # define PTHREAD_MUTEX_VERSIONS 1
>>>  #endif
>>>
>>> -#ifndef LLL_MUTEX_READ_LOCK
>>> -# define LLL_MUTEX_READ_LOCK(mutex) \
>>> -  atomic_load_relaxed (&(mutex)->__data.__lock)
>>> -#endif
>>> -
>>>  static int __pthread_mutex_lock_full (pthread_mutex_t *mutex)
>>>       __attribute_noinline__;
>>>
>>> @@ -138,17 +134,28 @@ PTHREAD_MUTEX_LOCK (pthread_mutex_t *mutex)
>>>         int cnt = 0;
>>>         int max_cnt = MIN (max_adaptive_count (),
>>>                            mutex->__data.__spins * 2 + 10);
>>> +       int spin_count, exp_backoff = 1;
>>> +       unsigned int jitter = random_bits ();
>>
>> This will issue a syscall for architectures that do not have clock_gettime
>> on vDSO, which is a performance regression.  You will need to move the
>> jitter setup to be arch-specific, where the generic interface setting
>> no random jitter.
> 
> What would be the best init jitter for arch w/ only syscall timers? TID? Or
> something else?

We don't cached the TID anymore, so it would still generate syscalls. I think
it safer that since we don't have a internal source of entropy (Florian send
some time ago a arc4random implementation that we could use in this case)
the generic interface would do any jitter at all (so it would be basically
the algorithm we already have).

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2] nptl: Add backoff mechanism to spinlock loop
  2022-03-30 11:44   ` Guo, Wangyang
@ 2022-03-30 19:39     ` Noah Goldstein
  0 siblings, 0 replies; 27+ messages in thread
From: Noah Goldstein @ 2022-03-30 19:39 UTC (permalink / raw)
  To: Guo, Wangyang; +Cc: GNU C Library

On Wed, Mar 30, 2022 at 6:45 AM Guo, Wangyang <wangyang.guo@intel.com> wrote:
>
> On 3/29/2022 12:41 AM, Noah Goldstein via Libc-alpha wrote:
> > On Mon, Mar 28, 2022 at 3:47 AM Wangyang Guo <wangyang.guo@intel.com> wrote:
> >>
> >> When mutiple threads waiting for lock at the same time, once lock owner
> >> releases the lock, waiters will see lock available and all try to lock,
> >> which may cause an expensive CAS storm.
> >>
> >> Binary exponential backoff with random jitter is introduced. As try-lock
> >> attempt increases, there is more likely that a larger number threads
> >> compete for adaptive mutex lock, so increase wait time in exponential.
> >> A random jitter is also added to avoid synchronous try-lock from other
> >> threads.
> >>
> >> v2: Remove read-check before try-lock for performance.
> >>
> >> Signed-off-by: Wangyang Guo <wangyang.guo@intel.com>
> >> ---
> >>   nptl/pthread_mutex_lock.c | 25 ++++++++++++++++---------
> >>   1 file changed, 16 insertions(+), 9 deletions(-)
> >>
> >> diff --git a/nptl/pthread_mutex_lock.c b/nptl/pthread_mutex_lock.c
> >> index d2e652d151..7e75ec1cba 100644
> >> --- a/nptl/pthread_mutex_lock.c
> >> +++ b/nptl/pthread_mutex_lock.c
> >> @@ -26,6 +26,7 @@
> >>   #include <futex-internal.h>
> >>   #include <stap-probe.h>
> >>   #include <shlib-compat.h>
> >> +#include <random-bits.h>
> >>
> >>   /* Some of the following definitions differ when pthread_mutex_cond_lock.c
> >>      includes this file.  */
> >> @@ -64,11 +65,6 @@ lll_mutex_lock_optimized (pthread_mutex_t *mutex)
> >>   # define PTHREAD_MUTEX_VERSIONS 1
> >>   #endif
> >>
> >> -#ifndef LLL_MUTEX_READ_LOCK
> >> -# define LLL_MUTEX_READ_LOCK(mutex) \
> >> -  atomic_load_relaxed (&(mutex)->__data.__lock)
> >> -#endif
> >> -
> >>   static int __pthread_mutex_lock_full (pthread_mutex_t *mutex)
> >>        __attribute_noinline__;
> >>
> >> @@ -138,17 +134,28 @@ PTHREAD_MUTEX_LOCK (pthread_mutex_t *mutex)
> >>            int cnt = 0;
> >>            int max_cnt = MIN (max_adaptive_count (),
> >>                               mutex->__data.__spins * 2 + 10);
> >> +         int spin_count, exp_backoff = 1;
> >> +         unsigned int jitter = random_bits ();
> >>            do
> >>              {
> >> -             if (cnt++ >= max_cnt)
> >> +             /* In each loop, spin count is exponential backoff plus
> >> +                random jitter, random range is [0, exp_backoff-1].  */
> >> +             spin_count = exp_backoff + (jitter & (exp_backoff - 1));
> >> +             cnt += spin_count;
> >> +             if (cnt >= max_cnt)
> >>                  {
> >> +                 /* If cnt exceeds max spin count, just go to wait
> >> +                    queue.  */
> >>                    LLL_MUTEX_LOCK (mutex);
> >>                    break;
> >>                  }
> >> -             atomic_spin_nop ();
> >> +             do
> >> +                 atomic_spin_nop ();
> >> +             while (--spin_count > 0);
> >> +             /* Binary exponential backoff, prepare for next loop.  */
> >> +             exp_backoff <<= 1;
> >>              }
> >> -         while (LLL_MUTEX_READ_LOCK (mutex) != 0
> >> -                || LLL_MUTEX_TRYLOCK (mutex) != 0);
> >> +         while (LLL_MUTEX_TRYLOCK (mutex) != 0);
> >
> > Does it perform better w.o the read?
> >
> > In general can you post some benchmarks varying the number of threads / size of
> > the critical section? Would also be nice if you could collect some
> > stats regarding
> > the average number of failed CAS attempts before/after.
>
> I work out a mutex bench:
> https://gist.github.com/guowangy/459085e5be6bfeda101c591d2d2304c5
> It can test with different threads and critical section (critical length
> determined by num of pause instructions).

Can you add a benchmark to glibc? Also you are compiling w.o
optimizations on.

>
> Here is the result of v1 against upstream:
> (Tested in Xeon 8380)
>
> npause  1       4       16      64      128   <== thread num
> --------------------------------------------
> 0       1.00    1.70    1.94    1.70    1.76
> 1       0.99    1.57    1.75    1.54    1.58
> 4       1.00    1.28    1.42    1.38    1.41
> 16      1.00    1.02    1.11    1.18    1.19
> 64      1.00    1.00    0.95    0.99    0.97
> 128     1.00    0.92    0.92    0.87    0.87

I was able to essentially reproduce this.

Maybe in light of these changes `max_cnt` needs to a new threshold.
Also maybe instead of `cnt += spin_count` it should stay `cnt++` so we don't
hit the futex so eagerly.


>
> The first row header is number of threads.
> The first column is num of pause, which represents different critical
> section sizes.
> The value is the rate of v1/upstream throughput (High Is Better).
> RSD varies by thread num: thread(4, 16) < 2%, thread(1, 64, 128) < 1%.
>
> Thread=1 is the baseline and should be the same for the both.
> Backoff works well when critical section is small.
> But it cause some regression when critical section is very large,
> because of upstream using frequently attempt to reduce the latency.
> Since adaptive is aimed for 'quick' locks, I think it's not a big
> problem here.
>
> The next is v2 against v1:
>         1       4       16      64      128
> --------------------------------------------
> 0       1.00    0.99    0.95    1.02    1.02
> 1       1.00    1.08    0.92    1.04    1.04
> 4       1.00    0.94    0.94    1.05    1.04
> 16      1.00    1.01    0.98    1.03    1.03
> 64      1.00    1.01    1.00    1.00    1.01
> 128     1.00    1.02    1.00    1.00    1.00
>
> v2 without read-check can have benefit in large thread num, but have
> drawback in small thread num.
>
I am seeing a pretty significant benefit in keeping the read check
on my Tigerlake (only measuring up to 16 threads as its an 8-core
machine).

W/ read check:
[pause:   0, thread:   1] -- Throughput:   76528393 ops/s RSD: 4.55%
[pause:   0, thread:   4] -- Throughput:   28065221 ops/s RSD: 1.58%
[pause:   0, thread:   8] -- Throughput:   14757408 ops/s RSD: 1.86%
[pause:   0, thread:  16] -- Throughput:   15501579 ops/s RSD: 1.06%

w/o read check:
[pause:   0, thread:   1] -- Throughput:   71303250 ops/s RSD: 4.98%
[pause:   0, thread:   4] -- Throughput:   11308910 ops/s RSD: 8.96%
[pause:   0, thread:   8] -- Throughput:    7380217 ops/s RSD: 5.52%
[pause:   0, thread:  16] -- Throughput:    8253073 ops/s RSD: 2.19%


> I also have a version with random jitter removed (v3), here is the
> result of v3 against v1:
>         1       4       16      64      128
> --------------------------------------------
> 0       1.00    0.95    1.04    1.02    1.03
> 1       1.01    1.09    1.04    1.02    1.03
> 4       1.00    1.02    1.05    1.02    1.02
> 16      1.00    1.01    1.01    1.02    1.02
> 64      1.00    1.04    1.02    1.01    1.01
> 128     1.00    0.98    0.99    0.99    0.98
>
> The result shows without random is a better solution.
>
> >>
> >>            mutex->__data.__spins += (cnt - mutex->__data.__spins) / 8;
> >>          }
> >> --
> >> 2.35.1
> >>
> >
>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2] nptl: Add backoff mechanism to spinlock loop
  2022-03-30 17:21     ` Adhemerval Zanella
@ 2022-04-12 11:53       ` Florian Weimer
  2022-04-22 13:30         ` Yann Droneaud
  0 siblings, 1 reply; 27+ messages in thread
From: Florian Weimer @ 2022-04-12 11:53 UTC (permalink / raw)
  To: Adhemerval Zanella via Libc-alpha
  Cc: Noah Goldstein, Adhemerval Zanella, Wangyang Guo

* Adhemerval Zanella via Libc-alpha:

> On 30/03/2022 14:07, Noah Goldstein wrote:
>> What would be the best init jitter for arch w/ only syscall timers? TID? Or
>> something else?
>
> We don't cached the TID anymore, so it would still generate syscalls.

We do have the TID for the running process in the TCB.  We need it (or
something like it) for recursive locks.

Maybe for this application, we should use some number of upper bits from
a per-thread linear consequential generator, initialized to a random
value at thread start?

Thanks,
Florian


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2] nptl: Add backoff mechanism to spinlock loop
  2022-04-12 11:53       ` Florian Weimer
@ 2022-04-22 13:30         ` Yann Droneaud
  2022-04-22 13:32           ` Florian Weimer
  0 siblings, 1 reply; 27+ messages in thread
From: Yann Droneaud @ 2022-04-22 13:30 UTC (permalink / raw)
  To: Florian Weimer, Adhemerval Zanella via Libc-alpha; +Cc: Wangyang Guo

Le 12/04/2022 à 13:53, Florian Weimer via Libc-alpha a écrit :
> * Adhemerval Zanella via Libc-alpha:
>
>> On 30/03/2022 14:07, Noah Goldstein wrote:
>>> What would be the best init jitter for arch w/ only syscall timers? TID? Or
>>> something else?
>> We don't cached the TID anymore, so it would still generate syscalls.
> We do have the TID for the running process in the TCB.  We need it (or
> something like it) for recursive locks.
>
> Maybe for this application, we should use some number of upper bits from
> a per-thread linear consequential generator, initialized to a random
> value at thread start?


As each running threads has its own stack, thread' stack address can be 
used as a seed for such PRNG.


Regards.

-- 

Yann Droneaud

OPTEYA



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2] nptl: Add backoff mechanism to spinlock loop
  2022-04-22 13:30         ` Yann Droneaud
@ 2022-04-22 13:32           ` Florian Weimer
  2022-04-22 13:35             ` Cristian Rodríguez
  0 siblings, 1 reply; 27+ messages in thread
From: Florian Weimer @ 2022-04-22 13:32 UTC (permalink / raw)
  To: Yann Droneaud; +Cc: Adhemerval Zanella via Libc-alpha, Wangyang Guo

* Yann Droneaud:

> Le 12/04/2022 à 13:53, Florian Weimer via Libc-alpha a écrit :
>> * Adhemerval Zanella via Libc-alpha:
>>
>>> On 30/03/2022 14:07, Noah Goldstein wrote:
>>>> What would be the best init jitter for arch w/ only syscall timers? TID? Or
>>>> something else?
>>> We don't cached the TID anymore, so it would still generate syscalls.
>> We do have the TID for the running process in the TCB.  We need it (or
>> something like it) for recursive locks.
>>
>> Maybe for this application, we should use some number of upper bits from
>> a per-thread linear consequential generator, initialized to a random
>> value at thread start?

> As each running threads has its own stack, thread' stack address can
> be used as a seed for such PRNG.

We would broadcast the stack address though, which is generally fround
upon.

Thanks,
Florian


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2] nptl: Add backoff mechanism to spinlock loop
  2022-04-22 13:32           ` Florian Weimer
@ 2022-04-22 13:35             ` Cristian Rodríguez
  2022-04-22 15:25               ` Noah Goldstein
  0 siblings, 1 reply; 27+ messages in thread
From: Cristian Rodríguez @ 2022-04-22 13:35 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Yann Droneaud, Wangyang Guo, Adhemerval Zanella via Libc-alpha

On Fri, Apr 22, 2022 at 9:32 AM Florian Weimer via Libc-alpha
<libc-alpha@sourceware.org> wrote:

> > As each running threads has its own stack, thread' stack address can
> > be used as a seed for such PRNG.
>
> We would broadcast the stack address though, which is generally fround
> upon.
i>

You can use the siphash approach suggested by Jason if there is a
concern of leaking that information.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2] nptl: Add backoff mechanism to spinlock loop
  2022-04-22 13:35             ` Cristian Rodríguez
@ 2022-04-22 15:25               ` Noah Goldstein
  2022-04-26 12:25                 ` Florian Weimer
  0 siblings, 1 reply; 27+ messages in thread
From: Noah Goldstein @ 2022-04-22 15:25 UTC (permalink / raw)
  To: Cristian Rodríguez
  Cc: Florian Weimer, Yann Droneaud, Wangyang Guo,
	Adhemerval Zanella via Libc-alpha

On Fri, Apr 22, 2022 at 8:35 AM Cristian Rodríguez via Libc-alpha
<libc-alpha@sourceware.org> wrote:
>
> On Fri, Apr 22, 2022 at 9:32 AM Florian Weimer via Libc-alpha
> <libc-alpha@sourceware.org> wrote:
>
> > > As each running threads has its own stack, thread' stack address can
> > > be used as a seed for such PRNG.
> >
> > We would broadcast the stack address though, which is generally fround
> > upon.

Why is that?

Also next version is likely to be based on:
https://patchwork.sourceware.org/project/glibc/patch/20220404213942.652799-1-goldstein.w.n@gmail.com/

But just grabbing the stack sounds faster.

> i>
>
> You can use the siphash approach suggested by Jason if there is a
> concern of leaking that information.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2] nptl: Add backoff mechanism to spinlock loop
  2022-04-22 15:25               ` Noah Goldstein
@ 2022-04-26 12:25                 ` Florian Weimer
  2022-04-26 12:42                   ` Yann Droneaud
  0 siblings, 1 reply; 27+ messages in thread
From: Florian Weimer @ 2022-04-26 12:25 UTC (permalink / raw)
  To: Noah Goldstein
  Cc: Cristian Rodríguez, Yann Droneaud, Wangyang Guo,
	Adhemerval Zanella via Libc-alpha

* Noah Goldstein:

> On Fri, Apr 22, 2022 at 8:35 AM Cristian Rodríguez via Libc-alpha
> <libc-alpha@sourceware.org> wrote:
>>
>> On Fri, Apr 22, 2022 at 9:32 AM Florian Weimer via Libc-alpha
>> <libc-alpha@sourceware.org> wrote:
>>
>> > > As each running threads has its own stack, thread' stack address can
>> > > be used as a seed for such PRNG.
>> >
>> > We would broadcast the stack address though, which is generally fround
>> > upon.
>
> Why is that?

Potential bypass of ASLR hardening.

Thanks,
Florian


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2] nptl: Add backoff mechanism to spinlock loop
  2022-04-26 12:25                 ` Florian Weimer
@ 2022-04-26 12:42                   ` Yann Droneaud
  0 siblings, 0 replies; 27+ messages in thread
From: Yann Droneaud @ 2022-04-26 12:42 UTC (permalink / raw)
  To: Florian Weimer, Noah Goldstein
  Cc: Cristian Rodríguez, Wangyang Guo, Adhemerval Zanella via Libc-alpha

Hi,

Le 26/04/2022 à 14:25, Florian Weimer a écrit :
> * Noah Goldstein:
>
>> On Fri, Apr 22, 2022 at 8:35 AM Cristian Rodríguez via Libc-alpha
>> <libc-alpha@sourceware.org> wrote:
>>> On Fri, Apr 22, 2022 at 9:32 AM Florian Weimer via Libc-alpha
>>> <libc-alpha@sourceware.org> wrote:
>>>
>>>>> As each running threads has its own stack, thread' stack address can
>>>>> be used as a seed for such PRNG.
>>>> We would broadcast the stack address though, which is generally fround
>>>> upon.
>> Why is that?
> Potential bypass of ASLR hardening.

The attack would be to monitor the behavior of multiple threads in a 
process contending for a lock, and get precise timings to recover the 
few bits of each thread stack address that leaked as part of the backoff 
mechanism.

It may sound possible, but I find it unlikely possible without full 
compromise of the running process ....

That said, using the stack address as the key in SipHash, it's 
cryptographically unlikely to recover it from its output, thus no 
exploitable leakage would happen.

Regards.

-- 

Yann Droneaud

OPTEYA



^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v3] nptl: Add backoff mechanism to spinlock loop
  2022-03-28  8:47 [PATCH v2] nptl: Add backoff mechanism to spinlock loop Wangyang Guo
  2022-03-28 16:41 ` Noah Goldstein
  2022-03-30 11:53 ` Adhemerval Zanella
@ 2022-05-04  2:50 ` Wangyang Guo
  2022-05-04  2:58 ` Wangyang Guo
  3 siblings, 0 replies; 27+ messages in thread
From: Wangyang Guo @ 2022-05-04  2:50 UTC (permalink / raw)
  To: libc-alpha; +Cc: hjl.tools, goldstein.w.n, Wangyang Guo

When mutiple threads waiting for lock at the same time, once lock owner
releases the lock, waiters will see lock available and all try to lock,
which may cause an expensive CAS storm.

Binary exponential backoff with random jitter is introduced. As try-lock
attempt increases, there is more likely that a larger number threads
compete for adaptive mutex lock, so increase wait time in exponential.
A random jitter is also added to avoid synchronous try-lock from other
threads.

v2: Remove read-check before try-lock for performance.

v3:
1. Restore read-check since it works well in some platform.
2. Make backoff arch dependent, and enable it for x86_64.
3. Limit max backoff to reduce latency in large critical section.

Result of pthread-mutex-locks bench

Test Platform: Xeon 8280L (2 socket, 112 CPUs in total)
First Row: thread number
First Col: critical section length
Values: backoff vs upstream, time based, low is better

non-critical-length: 1
	1	2	4	8	16	32	64	128	160
0	0.99	0.58	0.52	0.49	0.43	0.44	0.46	0.52	0.54
1	0.98	0.43	0.56	0.50	0.44	0.45	0.50	0.56	0.57
2	0.99	0.41	0.57	0.51	0.45	0.47	0.48	0.60	0.61
4	0.99	0.45	0.59	0.53	0.48	0.49	0.52	0.64	0.65
8	1.00	0.66	0.71	0.63	0.56	0.59	0.66	0.72	0.71
16	0.97	0.78	0.91	0.73	0.67	0.70	0.79	0.80	0.80
32	0.95	1.17	0.98	0.87	0.82	0.86	0.89	0.90	0.90
64	0.96	0.95	1.01	1.01	0.98	1.00	1.03	0.99	0.99
128	0.99	1.01	1.01	1.17	1.08	1.12	1.02	0.97	1.02

non-critical-length: 32
	1	2	4	8	16	32	64	128	160
0	1.03	0.97	0.75	0.65	0.58	0.58	0.56	0.70	0.70
1	0.94	0.95	0.76	0.65	0.58	0.58	0.61	0.71	0.72
2	0.97	0.96	0.77	0.66	0.58	0.59	0.62	0.74	0.74
4	0.99	0.96	0.78	0.66	0.60	0.61	0.66	0.76	0.77
8	0.99	0.99	0.84	0.70	0.64	0.66	0.71	0.80	0.80
16	0.98	0.97	0.95	0.76	0.70	0.73	0.81	0.85	0.84
32	1.04	1.12	1.04	0.89	0.82	0.86	0.93	0.91	0.91
64	0.99	1.15	1.07	1.00	0.99	1.01	1.05	0.99	0.99
128	1.00	1.21	1.20	1.22	1.25	1.31	1.12	1.10	0.99

non-critical-length: 128
	1	2	4	8	16	32	64	128	160
0	1.02	1.00	0.99	0.67	0.61	0.61	0.61	0.74	0.73
1	0.95	0.99	1.00	0.68	0.61	0.60	0.60	0.74	0.74
2	1.00	1.04	1.00	0.68	0.59	0.61	0.65	0.76	0.76
4	1.00	0.96	0.98	0.70	0.63	0.63	0.67	0.78	0.77
8	1.01	1.02	0.89	0.73	0.65	0.67	0.71	0.81	0.80
16	0.99	0.96	0.96	0.79	0.71	0.73	0.80	0.84	0.84
32	0.99	0.95	1.05	0.89	0.84	0.85	0.94	0.92	0.91
64	1.00	0.99	1.16	1.04	1.00	1.02	1.06	0.99	0.99
128	1.00	1.06	0.98	1.14	1.39	1.26	1.08	1.02	0.98

Signed-off-by: Wangyang Guo <wangyang.guo@intel.com>
---
 nptl/pthread_mutex_lock.c                   | 16 +++++++--
 sysdeps/nptl/pthreadP.h                     |  1 +
 sysdeps/nptl/pthread_mutex_backoff.h        | 35 ++++++++++++++++++
 sysdeps/x86_64/nptl/pthread_mutex_backoff.h | 39 +++++++++++++++++++++
 4 files changed, 89 insertions(+), 2 deletions(-)
 create mode 100644 sysdeps/nptl/pthread_mutex_backoff.h
 create mode 100644 sysdeps/x86_64/nptl/pthread_mutex_backoff.h

diff --git a/nptl/pthread_mutex_lock.c b/nptl/pthread_mutex_lock.c
index d2e652d151..6e767a8724 100644
--- a/nptl/pthread_mutex_lock.c
+++ b/nptl/pthread_mutex_lock.c
@@ -138,14 +138,26 @@ PTHREAD_MUTEX_LOCK (pthread_mutex_t *mutex)
 	  int cnt = 0;
 	  int max_cnt = MIN (max_adaptive_count (),
 			     mutex->__data.__spins * 2 + 10);
+	  int spin_count, exp_backoff = 1;
+	  unsigned int jitter = get_jitter ();
 	  do
 	    {
-	      if (cnt++ >= max_cnt)
+	      /* In each loop, spin count is exponential backoff plus
+		 random jitter, random range is [0, exp_backoff-1].  */
+	      spin_count = exp_backoff + (jitter & (exp_backoff - 1));
+	      cnt += spin_count;
+	      if (cnt >= max_cnt)
 		{
+		  /* If cnt exceeds max spin count, just go to wait
+		     queue.  */
 		  LLL_MUTEX_LOCK (mutex);
 		  break;
 		}
-	      atomic_spin_nop ();
+	      do
+		atomic_spin_nop ();
+	      while (--spin_count > 0);
+	      /* Prepare for next loop.  */
+	      exp_backoff = get_next_backoff (exp_backoff);
 	    }
 	  while (LLL_MUTEX_READ_LOCK (mutex) != 0
 		 || LLL_MUTEX_TRYLOCK (mutex) != 0);
diff --git a/sysdeps/nptl/pthreadP.h b/sysdeps/nptl/pthreadP.h
index 601db4ff2b..39af275c25 100644
--- a/sysdeps/nptl/pthreadP.h
+++ b/sysdeps/nptl/pthreadP.h
@@ -33,6 +33,7 @@
 #include <kernel-features.h>
 #include <errno.h>
 #include <internal-signals.h>
+#include <pthread_mutex_backoff.h>
 #include "pthread_mutex_conf.h"
 
 
diff --git a/sysdeps/nptl/pthread_mutex_backoff.h b/sysdeps/nptl/pthread_mutex_backoff.h
new file mode 100644
index 0000000000..2ccdcaf1e7
--- /dev/null
+++ b/sysdeps/nptl/pthread_mutex_backoff.h
@@ -0,0 +1,35 @@
+/* Pthread mutex backoff configuration.
+   Copyright (C) 2022 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <https://www.gnu.org/licenses/>.  */
+#ifndef _PTHREAD_MUTEX_BACKOFF_H
+#define _PTHREAD_MUTEX_BACKOFF_H 1
+
+static inline unsigned int
+get_jitter ()
+{
+  /* Arch dependent random jitter, return 0 disables random.  */
+  return 0;
+}
+
+static inline int
+get_next_backoff (int backoff)
+{
+  /* Next backoff, return 1 disables mutex backoff.  */
+  return 1;
+}
+
+#endif
diff --git a/sysdeps/x86_64/nptl/pthread_mutex_backoff.h b/sysdeps/x86_64/nptl/pthread_mutex_backoff.h
new file mode 100644
index 0000000000..ec74c3d9db
--- /dev/null
+++ b/sysdeps/x86_64/nptl/pthread_mutex_backoff.h
@@ -0,0 +1,39 @@
+/* Pthread mutex backoff configuration.
+   Copyright (C) 2022 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <https://www.gnu.org/licenses/>.  */
+#ifndef _PTHREAD_MUTEX_BACKOFF_H
+#define _PTHREAD_MUTEX_BACKOFF_H 1
+
+#include <fast-jitter.h>
+
+static inline unsigned int
+get_jitter (void)
+{
+  return get_fast_jitter ();
+}
+
+#define MAX_BACKOFF 16
+
+static inline int
+get_next_backoff (int backoff)
+{
+  /* Binary expontial backoff. Limiting max backoff
+     can reduce latency in large critical section.  */
+  return (backoff < MAX_BACKOFF) ? backoff << 1 : backoff;
+}
+
+#endif
-- 
2.35.1


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v3] nptl: Add backoff mechanism to spinlock loop
  2022-03-28  8:47 [PATCH v2] nptl: Add backoff mechanism to spinlock loop Wangyang Guo
                   ` (2 preceding siblings ...)
  2022-05-04  2:50 ` [PATCH v3] " Wangyang Guo
@ 2022-05-04  2:58 ` Wangyang Guo
  2022-05-04  3:17   ` [PATCH v4] " Wangyang Guo
  3 siblings, 1 reply; 27+ messages in thread
From: Wangyang Guo @ 2022-05-04  2:58 UTC (permalink / raw)
  To: libc-alpha; +Cc: hjl.tools, goldstein.w.n, Wangyang Guo

When mutiple threads waiting for lock at the same time, once lock owner
releases the lock, waiters will see lock available and all try to lock,
which may cause an expensive CAS storm.

Binary exponential backoff with random jitter is introduced. As try-lock
attempt increases, there is more likely that a larger number threads
compete for adaptive mutex lock, so increase wait time in exponential.
A random jitter is also added to avoid synchronous try-lock from other
threads.

v2: Remove read-check before try-lock for performance.

v3:
1. Restore read-check since it works well in some platform.
2. Make backoff arch dependent, and enable it for x86_64.
3. Limit max backoff to reduce latency in large critical section.

Result of pthread-mutex-locks bench

Test Platform: Xeon 8280L (2 socket, 112 CPUs in total)
First Row: thread number
First Col: critical section length
Values: backoff vs upstream, time based, low is better

non-critical-length: 1
	1	2	4	8	16	32	64	128	160
0	0.99	0.58	0.52	0.49	0.43	0.44	0.46	0.52	0.54
1	0.98	0.43	0.56	0.50	0.44	0.45	0.50	0.56	0.57
2	0.99	0.41	0.57	0.51	0.45	0.47	0.48	0.60	0.61
4	0.99	0.45	0.59	0.53	0.48	0.49	0.52	0.64	0.65
8	1.00	0.66	0.71	0.63	0.56	0.59	0.66	0.72	0.71
16	0.97	0.78	0.91	0.73	0.67	0.70	0.79	0.80	0.80
32	0.95	1.17	0.98	0.87	0.82	0.86	0.89	0.90	0.90
64	0.96	0.95	1.01	1.01	0.98	1.00	1.03	0.99	0.99
128	0.99	1.01	1.01	1.17	1.08	1.12	1.02	0.97	1.02

non-critical-length: 32
	1	2	4	8	16	32	64	128	160
0	1.03	0.97	0.75	0.65	0.58	0.58	0.56	0.70	0.70
1	0.94	0.95	0.76	0.65	0.58	0.58	0.61	0.71	0.72
2	0.97	0.96	0.77	0.66	0.58	0.59	0.62	0.74	0.74
4	0.99	0.96	0.78	0.66	0.60	0.61	0.66	0.76	0.77
8	0.99	0.99	0.84	0.70	0.64	0.66	0.71	0.80	0.80
16	0.98	0.97	0.95	0.76	0.70	0.73	0.81	0.85	0.84
32	1.04	1.12	1.04	0.89	0.82	0.86	0.93	0.91	0.91
64	0.99	1.15	1.07	1.00	0.99	1.01	1.05	0.99	0.99
128	1.00	1.21	1.20	1.22	1.25	1.31	1.12	1.10	0.99

non-critical-length: 128
	1	2	4	8	16	32	64	128	160
0	1.02	1.00	0.99	0.67	0.61	0.61	0.61	0.74	0.73
1	0.95	0.99	1.00	0.68	0.61	0.60	0.60	0.74	0.74
2	1.00	1.04	1.00	0.68	0.59	0.61	0.65	0.76	0.76
4	1.00	0.96	0.98	0.70	0.63	0.63	0.67	0.78	0.77
8	1.01	1.02	0.89	0.73	0.65	0.67	0.71	0.81	0.80
16	0.99	0.96	0.96	0.79	0.71	0.73	0.80	0.84	0.84
32	0.99	0.95	1.05	0.89	0.84	0.85	0.94	0.92	0.91
64	1.00	0.99	1.16	1.04	1.00	1.02	1.06	0.99	0.99
128	1.00	1.06	0.98	1.14	1.39	1.26	1.08	1.02	0.98

Signed-off-by: Wangyang Guo <wangyang.guo@intel.com>
---
 nptl/pthread_mutex_lock.c                   | 16 +++++++--
 sysdeps/nptl/pthreadP.h                     |  1 +
 sysdeps/nptl/pthread_mutex_backoff.h        | 35 ++++++++++++++++++
 sysdeps/x86_64/nptl/pthread_mutex_backoff.h | 39 +++++++++++++++++++++
 4 files changed, 89 insertions(+), 2 deletions(-)
 create mode 100644 sysdeps/nptl/pthread_mutex_backoff.h
 create mode 100644 sysdeps/x86_64/nptl/pthread_mutex_backoff.h

diff --git a/nptl/pthread_mutex_lock.c b/nptl/pthread_mutex_lock.c
index d2e652d151..6e767a8724 100644
--- a/nptl/pthread_mutex_lock.c
+++ b/nptl/pthread_mutex_lock.c
@@ -138,14 +138,26 @@ PTHREAD_MUTEX_LOCK (pthread_mutex_t *mutex)
 	  int cnt = 0;
 	  int max_cnt = MIN (max_adaptive_count (),
 			     mutex->__data.__spins * 2 + 10);
+	  int spin_count, exp_backoff = 1;
+	  unsigned int jitter = get_jitter ();
 	  do
 	    {
-	      if (cnt++ >= max_cnt)
+	      /* In each loop, spin count is exponential backoff plus
+		 random jitter, random range is [0, exp_backoff-1].  */
+	      spin_count = exp_backoff + (jitter & (exp_backoff - 1));
+	      cnt += spin_count;
+	      if (cnt >= max_cnt)
 		{
+		  /* If cnt exceeds max spin count, just go to wait
+		     queue.  */
 		  LLL_MUTEX_LOCK (mutex);
 		  break;
 		}
-	      atomic_spin_nop ();
+	      do
+		atomic_spin_nop ();
+	      while (--spin_count > 0);
+	      /* Prepare for next loop.  */
+	      exp_backoff = get_next_backoff (exp_backoff);
 	    }
 	  while (LLL_MUTEX_READ_LOCK (mutex) != 0
 		 || LLL_MUTEX_TRYLOCK (mutex) != 0);
diff --git a/sysdeps/nptl/pthreadP.h b/sysdeps/nptl/pthreadP.h
index 601db4ff2b..39af275c25 100644
--- a/sysdeps/nptl/pthreadP.h
+++ b/sysdeps/nptl/pthreadP.h
@@ -33,6 +33,7 @@
 #include <kernel-features.h>
 #include <errno.h>
 #include <internal-signals.h>
+#include <pthread_mutex_backoff.h>
 #include "pthread_mutex_conf.h"
 
 
diff --git a/sysdeps/nptl/pthread_mutex_backoff.h b/sysdeps/nptl/pthread_mutex_backoff.h
new file mode 100644
index 0000000000..2ccdcaf1e7
--- /dev/null
+++ b/sysdeps/nptl/pthread_mutex_backoff.h
@@ -0,0 +1,35 @@
+/* Pthread mutex backoff configuration.
+   Copyright (C) 2022 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <https://www.gnu.org/licenses/>.  */
+#ifndef _PTHREAD_MUTEX_BACKOFF_H
+#define _PTHREAD_MUTEX_BACKOFF_H 1
+
+static inline unsigned int
+get_jitter ()
+{
+  /* Arch dependent random jitter, return 0 disables random.  */
+  return 0;
+}
+
+static inline int
+get_next_backoff (int backoff)
+{
+  /* Next backoff, return 1 disables mutex backoff.  */
+  return 1;
+}
+
+#endif
diff --git a/sysdeps/x86_64/nptl/pthread_mutex_backoff.h b/sysdeps/x86_64/nptl/pthread_mutex_backoff.h
new file mode 100644
index 0000000000..ec74c3d9db
--- /dev/null
+++ b/sysdeps/x86_64/nptl/pthread_mutex_backoff.h
@@ -0,0 +1,39 @@
+/* Pthread mutex backoff configuration.
+   Copyright (C) 2022 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <https://www.gnu.org/licenses/>.  */
+#ifndef _PTHREAD_MUTEX_BACKOFF_H
+#define _PTHREAD_MUTEX_BACKOFF_H 1
+
+#include <fast-jitter.h>
+
+static inline unsigned int
+get_jitter (void)
+{
+  return get_fast_jitter ();
+}
+
+#define MAX_BACKOFF 16
+
+static inline int
+get_next_backoff (int backoff)
+{
+  /* Binary expontial backoff. Limiting max backoff
+     can reduce latency in large critical section.  */
+  return (backoff < MAX_BACKOFF) ? backoff << 1 : backoff;
+}
+
+#endif
-- 
2.35.1


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v4] nptl: Add backoff mechanism to spinlock loop
  2022-05-04  2:58 ` Wangyang Guo
@ 2022-05-04  3:17   ` Wangyang Guo
  2022-05-05  1:56     ` H.J. Lu
  2022-05-06  1:50     ` [PATCH v5] " Wangyang Guo
  0 siblings, 2 replies; 27+ messages in thread
From: Wangyang Guo @ 2022-05-04  3:17 UTC (permalink / raw)
  To: libc-alpha; +Cc: hjl.tools, goldstein.w.n, Wangyang Guo

When mutiple threads waiting for lock at the same time, once lock owner
releases the lock, waiters will see lock available and all try to lock,
which may cause an expensive CAS storm.

Binary exponential backoff with random jitter is introduced. As try-lock
attempt increases, there is more likely that a larger number threads
compete for adaptive mutex lock, so increase wait time in exponential.
A random jitter is also added to avoid synchronous try-lock from other
threads.

v2: Remove read-check before try-lock for performance.

v3:
1. Restore read-check since it works well in some platform.
2. Make backoff arch dependent, and enable it for x86_64.
3. Limit max backoff to reduce latency in large critical section.

v4: Fix strict-prototypes error in sysdeps/nptl/pthread_mutex_backoff.h

Result of pthread-mutex-locks bench

Test Platform: Xeon 8280L (2 socket, 112 CPUs in total)
First Row: thread number
First Col: critical section length
Values: backoff vs upstream, time based, low is better

non-critical-length: 1
	1	2	4	8	16	32	64	112	140
0	0.99	0.58	0.52	0.49	0.43	0.44	0.46	0.52	0.54
1	0.98	0.43	0.56	0.50	0.44	0.45	0.50	0.56	0.57
2	0.99	0.41	0.57	0.51	0.45	0.47	0.48	0.60	0.61
4	0.99	0.45	0.59	0.53	0.48	0.49	0.52	0.64	0.65
8	1.00	0.66	0.71	0.63	0.56	0.59	0.66	0.72	0.71
16	0.97	0.78	0.91	0.73	0.67	0.70	0.79	0.80	0.80
32	0.95	1.17	0.98	0.87	0.82	0.86	0.89	0.90	0.90
64	0.96	0.95	1.01	1.01	0.98	1.00	1.03	0.99	0.99
128	0.99	1.01	1.01	1.17	1.08	1.12	1.02	0.97	1.02

non-critical-length: 32
	1	2	4	8	16	32	64	112	140
0	1.03	0.97	0.75	0.65	0.58	0.58	0.56	0.70	0.70
1	0.94	0.95	0.76	0.65	0.58	0.58	0.61	0.71	0.72
2	0.97	0.96	0.77	0.66	0.58	0.59	0.62	0.74	0.74
4	0.99	0.96	0.78	0.66	0.60	0.61	0.66	0.76	0.77
8	0.99	0.99	0.84	0.70	0.64	0.66	0.71	0.80	0.80
16	0.98	0.97	0.95	0.76	0.70	0.73	0.81	0.85	0.84
32	1.04	1.12	1.04	0.89	0.82	0.86	0.93	0.91	0.91
64	0.99	1.15	1.07	1.00	0.99	1.01	1.05	0.99	0.99
128	1.00	1.21	1.20	1.22	1.25	1.31	1.12	1.10	0.99

non-critical-length: 128
	1	2	4	8	16	32	64	112	140
0	1.02	1.00	0.99	0.67	0.61	0.61	0.61	0.74	0.73
1	0.95	0.99	1.00	0.68	0.61	0.60	0.60	0.74	0.74
2	1.00	1.04	1.00	0.68	0.59	0.61	0.65	0.76	0.76
4	1.00	0.96	0.98	0.70	0.63	0.63	0.67	0.78	0.77
8	1.01	1.02	0.89	0.73	0.65	0.67	0.71	0.81	0.80
16	0.99	0.96	0.96	0.79	0.71	0.73	0.80	0.84	0.84
32	0.99	0.95	1.05	0.89	0.84	0.85	0.94	0.92	0.91
64	1.00	0.99	1.16	1.04	1.00	1.02	1.06	0.99	0.99
128	1.00	1.06	0.98	1.14	1.39	1.26	1.08	1.02	0.98

Signed-off-by: Wangyang Guo <wangyang.guo@intel.com>
---
 nptl/pthread_mutex_lock.c                   | 16 +++++++--
 sysdeps/nptl/pthreadP.h                     |  1 +
 sysdeps/nptl/pthread_mutex_backoff.h        | 35 ++++++++++++++++++
 sysdeps/x86_64/nptl/pthread_mutex_backoff.h | 39 +++++++++++++++++++++
 4 files changed, 89 insertions(+), 2 deletions(-)
 create mode 100644 sysdeps/nptl/pthread_mutex_backoff.h
 create mode 100644 sysdeps/x86_64/nptl/pthread_mutex_backoff.h

diff --git a/nptl/pthread_mutex_lock.c b/nptl/pthread_mutex_lock.c
index d2e652d151..6e767a8724 100644
--- a/nptl/pthread_mutex_lock.c
+++ b/nptl/pthread_mutex_lock.c
@@ -138,14 +138,26 @@ PTHREAD_MUTEX_LOCK (pthread_mutex_t *mutex)
 	  int cnt = 0;
 	  int max_cnt = MIN (max_adaptive_count (),
 			     mutex->__data.__spins * 2 + 10);
+	  int spin_count, exp_backoff = 1;
+	  unsigned int jitter = get_jitter ();
 	  do
 	    {
-	      if (cnt++ >= max_cnt)
+	      /* In each loop, spin count is exponential backoff plus
+		 random jitter, random range is [0, exp_backoff-1].  */
+	      spin_count = exp_backoff + (jitter & (exp_backoff - 1));
+	      cnt += spin_count;
+	      if (cnt >= max_cnt)
 		{
+		  /* If cnt exceeds max spin count, just go to wait
+		     queue.  */
 		  LLL_MUTEX_LOCK (mutex);
 		  break;
 		}
-	      atomic_spin_nop ();
+	      do
+		atomic_spin_nop ();
+	      while (--spin_count > 0);
+	      /* Prepare for next loop.  */
+	      exp_backoff = get_next_backoff (exp_backoff);
 	    }
 	  while (LLL_MUTEX_READ_LOCK (mutex) != 0
 		 || LLL_MUTEX_TRYLOCK (mutex) != 0);
diff --git a/sysdeps/nptl/pthreadP.h b/sysdeps/nptl/pthreadP.h
index 601db4ff2b..39af275c25 100644
--- a/sysdeps/nptl/pthreadP.h
+++ b/sysdeps/nptl/pthreadP.h
@@ -33,6 +33,7 @@
 #include <kernel-features.h>
 #include <errno.h>
 #include <internal-signals.h>
+#include <pthread_mutex_backoff.h>
 #include "pthread_mutex_conf.h"
 
 
diff --git a/sysdeps/nptl/pthread_mutex_backoff.h b/sysdeps/nptl/pthread_mutex_backoff.h
new file mode 100644
index 0000000000..5b26c22ac7
--- /dev/null
+++ b/sysdeps/nptl/pthread_mutex_backoff.h
@@ -0,0 +1,35 @@
+/* Pthread mutex backoff configuration.
+   Copyright (C) 2022 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <https://www.gnu.org/licenses/>.  */
+#ifndef _PTHREAD_MUTEX_BACKOFF_H
+#define _PTHREAD_MUTEX_BACKOFF_H 1
+
+static inline unsigned int
+get_jitter (void)
+{
+  /* Arch dependent random jitter, return 0 disables random.  */
+  return 0;
+}
+
+static inline int
+get_next_backoff (int backoff)
+{
+  /* Next backoff, return 1 disables mutex backoff.  */
+  return 1;
+}
+
+#endif
diff --git a/sysdeps/x86_64/nptl/pthread_mutex_backoff.h b/sysdeps/x86_64/nptl/pthread_mutex_backoff.h
new file mode 100644
index 0000000000..ec74c3d9db
--- /dev/null
+++ b/sysdeps/x86_64/nptl/pthread_mutex_backoff.h
@@ -0,0 +1,39 @@
+/* Pthread mutex backoff configuration.
+   Copyright (C) 2022 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <https://www.gnu.org/licenses/>.  */
+#ifndef _PTHREAD_MUTEX_BACKOFF_H
+#define _PTHREAD_MUTEX_BACKOFF_H 1
+
+#include <fast-jitter.h>
+
+static inline unsigned int
+get_jitter (void)
+{
+  return get_fast_jitter ();
+}
+
+#define MAX_BACKOFF 16
+
+static inline int
+get_next_backoff (int backoff)
+{
+  /* Binary expontial backoff. Limiting max backoff
+     can reduce latency in large critical section.  */
+  return (backoff < MAX_BACKOFF) ? backoff << 1 : backoff;
+}
+
+#endif
-- 
2.35.1


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4] nptl: Add backoff mechanism to spinlock loop
  2022-05-04  3:17   ` [PATCH v4] " Wangyang Guo
@ 2022-05-05  1:56     ` H.J. Lu
  2022-05-05  2:52       ` Noah Goldstein
  2022-05-05  2:59       ` Guo, Wangyang
  2022-05-06  1:50     ` [PATCH v5] " Wangyang Guo
  1 sibling, 2 replies; 27+ messages in thread
From: H.J. Lu @ 2022-05-05  1:56 UTC (permalink / raw)
  To: Wangyang Guo; +Cc: libc-alpha, goldstein.w.n

On Tue, May 3, 2022 at 8:17 PM Wangyang Guo <wangyang.guo@intel.com> wrote:
>
> When mutiple threads waiting for lock at the same time, once lock owner
> releases the lock, waiters will see lock available and all try to lock,
> which may cause an expensive CAS storm.
>
> Binary exponential backoff with random jitter is introduced. As try-lock
> attempt increases, there is more likely that a larger number threads
> compete for adaptive mutex lock, so increase wait time in exponential.
> A random jitter is also added to avoid synchronous try-lock from other
> threads.
>
> v2: Remove read-check before try-lock for performance.
>
> v3:
> 1. Restore read-check since it works well in some platform.
> 2. Make backoff arch dependent, and enable it for x86_64.
> 3. Limit max backoff to reduce latency in large critical section.
>
> v4: Fix strict-prototypes error in sysdeps/nptl/pthread_mutex_backoff.h
>
> Result of pthread-mutex-locks bench
>
> Test Platform: Xeon 8280L (2 socket, 112 CPUs in total)
> First Row: thread number
> First Col: critical section length
> Values: backoff vs upstream, time based, low is better
>
> non-critical-length: 1
>         1       2       4       8       16      32      64      112     140
> 0       0.99    0.58    0.52    0.49    0.43    0.44    0.46    0.52    0.54
> 1       0.98    0.43    0.56    0.50    0.44    0.45    0.50    0.56    0.57
> 2       0.99    0.41    0.57    0.51    0.45    0.47    0.48    0.60    0.61
> 4       0.99    0.45    0.59    0.53    0.48    0.49    0.52    0.64    0.65
> 8       1.00    0.66    0.71    0.63    0.56    0.59    0.66    0.72    0.71
> 16      0.97    0.78    0.91    0.73    0.67    0.70    0.79    0.80    0.80
> 32      0.95    1.17    0.98    0.87    0.82    0.86    0.89    0.90    0.90
> 64      0.96    0.95    1.01    1.01    0.98    1.00    1.03    0.99    0.99
> 128     0.99    1.01    1.01    1.17    1.08    1.12    1.02    0.97    1.02
>
> non-critical-length: 32
>         1       2       4       8       16      32      64      112     140
> 0       1.03    0.97    0.75    0.65    0.58    0.58    0.56    0.70    0.70
> 1       0.94    0.95    0.76    0.65    0.58    0.58    0.61    0.71    0.72
> 2       0.97    0.96    0.77    0.66    0.58    0.59    0.62    0.74    0.74
> 4       0.99    0.96    0.78    0.66    0.60    0.61    0.66    0.76    0.77
> 8       0.99    0.99    0.84    0.70    0.64    0.66    0.71    0.80    0.80
> 16      0.98    0.97    0.95    0.76    0.70    0.73    0.81    0.85    0.84
> 32      1.04    1.12    1.04    0.89    0.82    0.86    0.93    0.91    0.91
> 64      0.99    1.15    1.07    1.00    0.99    1.01    1.05    0.99    0.99
> 128     1.00    1.21    1.20    1.22    1.25    1.31    1.12    1.10    0.99
>
> non-critical-length: 128
>         1       2       4       8       16      32      64      112     140
> 0       1.02    1.00    0.99    0.67    0.61    0.61    0.61    0.74    0.73
> 1       0.95    0.99    1.00    0.68    0.61    0.60    0.60    0.74    0.74
> 2       1.00    1.04    1.00    0.68    0.59    0.61    0.65    0.76    0.76
> 4       1.00    0.96    0.98    0.70    0.63    0.63    0.67    0.78    0.77
> 8       1.01    1.02    0.89    0.73    0.65    0.67    0.71    0.81    0.80
> 16      0.99    0.96    0.96    0.79    0.71    0.73    0.80    0.84    0.84
> 32      0.99    0.95    1.05    0.89    0.84    0.85    0.94    0.92    0.91
> 64      1.00    0.99    1.16    1.04    1.00    1.02    1.06    0.99    0.99
> 128     1.00    1.06    0.98    1.14    1.39    1.26    1.08    1.02    0.98

For large critical section lengths, the new code can be up to 39% slower.
Is it true?  Are small critical section lengths more common?

> Signed-off-by: Wangyang Guo <wangyang.guo@intel.com>
> ---
>  nptl/pthread_mutex_lock.c                   | 16 +++++++--
>  sysdeps/nptl/pthreadP.h                     |  1 +
>  sysdeps/nptl/pthread_mutex_backoff.h        | 35 ++++++++++++++++++
>  sysdeps/x86_64/nptl/pthread_mutex_backoff.h | 39 +++++++++++++++++++++
>  4 files changed, 89 insertions(+), 2 deletions(-)
>  create mode 100644 sysdeps/nptl/pthread_mutex_backoff.h
>  create mode 100644 sysdeps/x86_64/nptl/pthread_mutex_backoff.h
>
> diff --git a/nptl/pthread_mutex_lock.c b/nptl/pthread_mutex_lock.c
> index d2e652d151..6e767a8724 100644
> --- a/nptl/pthread_mutex_lock.c
> +++ b/nptl/pthread_mutex_lock.c
> @@ -138,14 +138,26 @@ PTHREAD_MUTEX_LOCK (pthread_mutex_t *mutex)
>           int cnt = 0;
>           int max_cnt = MIN (max_adaptive_count (),
>                              mutex->__data.__spins * 2 + 10);
> +         int spin_count, exp_backoff = 1;
> +         unsigned int jitter = get_jitter ();
>           do
>             {
> -             if (cnt++ >= max_cnt)
> +             /* In each loop, spin count is exponential backoff plus
> +                random jitter, random range is [0, exp_backoff-1].  */
> +             spin_count = exp_backoff + (jitter & (exp_backoff - 1));
> +             cnt += spin_count;
> +             if (cnt >= max_cnt)
>                 {
> +                 /* If cnt exceeds max spin count, just go to wait
> +                    queue.  */
>                   LLL_MUTEX_LOCK (mutex);
>                   break;
>                 }
> -             atomic_spin_nop ();
> +             do
> +               atomic_spin_nop ();
> +             while (--spin_count > 0);
> +             /* Prepare for next loop.  */
> +             exp_backoff = get_next_backoff (exp_backoff);
>             }
>           while (LLL_MUTEX_READ_LOCK (mutex) != 0
>                  || LLL_MUTEX_TRYLOCK (mutex) != 0);
> diff --git a/sysdeps/nptl/pthreadP.h b/sysdeps/nptl/pthreadP.h
> index 601db4ff2b..39af275c25 100644
> --- a/sysdeps/nptl/pthreadP.h
> +++ b/sysdeps/nptl/pthreadP.h
> @@ -33,6 +33,7 @@
>  #include <kernel-features.h>
>  #include <errno.h>
>  #include <internal-signals.h>
> +#include <pthread_mutex_backoff.h>
>  #include "pthread_mutex_conf.h"
>
>
> diff --git a/sysdeps/nptl/pthread_mutex_backoff.h b/sysdeps/nptl/pthread_mutex_backoff.h
> new file mode 100644
> index 0000000000..5b26c22ac7
> --- /dev/null
> +++ b/sysdeps/nptl/pthread_mutex_backoff.h
> @@ -0,0 +1,35 @@
> +/* Pthread mutex backoff configuration.
> +   Copyright (C) 2022 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <https://www.gnu.org/licenses/>.  */
> +#ifndef _PTHREAD_MUTEX_BACKOFF_H
> +#define _PTHREAD_MUTEX_BACKOFF_H 1
> +
> +static inline unsigned int
> +get_jitter (void)
> +{
> +  /* Arch dependent random jitter, return 0 disables random.  */
> +  return 0;
> +}
> +
> +static inline int
> +get_next_backoff (int backoff)
> +{
> +  /* Next backoff, return 1 disables mutex backoff.  */
> +  return 1;
> +}
> +
> +#endif
> diff --git a/sysdeps/x86_64/nptl/pthread_mutex_backoff.h b/sysdeps/x86_64/nptl/pthread_mutex_backoff.h
> new file mode 100644
> index 0000000000..ec74c3d9db
> --- /dev/null
> +++ b/sysdeps/x86_64/nptl/pthread_mutex_backoff.h
> @@ -0,0 +1,39 @@
> +/* Pthread mutex backoff configuration.
> +   Copyright (C) 2022 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <https://www.gnu.org/licenses/>.  */
> +#ifndef _PTHREAD_MUTEX_BACKOFF_H
> +#define _PTHREAD_MUTEX_BACKOFF_H 1
> +
> +#include <fast-jitter.h>
> +
> +static inline unsigned int
> +get_jitter (void)
> +{
> +  return get_fast_jitter ();
> +}
> +
> +#define MAX_BACKOFF 16
> +
> +static inline int
> +get_next_backoff (int backoff)
> +{
> +  /* Binary expontial backoff. Limiting max backoff
> +     can reduce latency in large critical section.  */
> +  return (backoff < MAX_BACKOFF) ? backoff << 1 : backoff;
> +}
> +
> +#endif
> --
> 2.35.1
>


-- 
H.J.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4] nptl: Add backoff mechanism to spinlock loop
  2022-05-05  1:56     ` H.J. Lu
@ 2022-05-05  2:52       ` Noah Goldstein
  2022-05-05  2:59       ` Guo, Wangyang
  1 sibling, 0 replies; 27+ messages in thread
From: Noah Goldstein @ 2022-05-05  2:52 UTC (permalink / raw)
  To: H.J. Lu; +Cc: Wangyang Guo, GNU C Library

On Wed, May 4, 2022 at 8:57 PM H.J. Lu <hjl.tools@gmail.com> wrote:
>
> On Tue, May 3, 2022 at 8:17 PM Wangyang Guo <wangyang.guo@intel.com> wrote:
> >
> > When mutiple threads waiting for lock at the same time, once lock owner
> > releases the lock, waiters will see lock available and all try to lock,
> > which may cause an expensive CAS storm.
> >
> > Binary exponential backoff with random jitter is introduced. As try-lock
> > attempt increases, there is more likely that a larger number threads
> > compete for adaptive mutex lock, so increase wait time in exponential.
> > A random jitter is also added to avoid synchronous try-lock from other
> > threads.
> >
> > v2: Remove read-check before try-lock for performance.
> >
> > v3:
> > 1. Restore read-check since it works well in some platform.
> > 2. Make backoff arch dependent, and enable it for x86_64.
> > 3. Limit max backoff to reduce latency in large critical section.
> >
> > v4: Fix strict-prototypes error in sysdeps/nptl/pthread_mutex_backoff.h
> >
> > Result of pthread-mutex-locks bench
> >
> > Test Platform: Xeon 8280L (2 socket, 112 CPUs in total)
> > First Row: thread number
> > First Col: critical section length
> > Values: backoff vs upstream, time based, low is better
> >
> > non-critical-length: 1
> >         1       2       4       8       16      32      64      112     140
> > 0       0.99    0.58    0.52    0.49    0.43    0.44    0.46    0.52    0.54
> > 1       0.98    0.43    0.56    0.50    0.44    0.45    0.50    0.56    0.57
> > 2       0.99    0.41    0.57    0.51    0.45    0.47    0.48    0.60    0.61
> > 4       0.99    0.45    0.59    0.53    0.48    0.49    0.52    0.64    0.65
> > 8       1.00    0.66    0.71    0.63    0.56    0.59    0.66    0.72    0.71
> > 16      0.97    0.78    0.91    0.73    0.67    0.70    0.79    0.80    0.80
> > 32      0.95    1.17    0.98    0.87    0.82    0.86    0.89    0.90    0.90
> > 64      0.96    0.95    1.01    1.01    0.98    1.00    1.03    0.99    0.99
> > 128     0.99    1.01    1.01    1.17    1.08    1.12    1.02    0.97    1.02
> >
> > non-critical-length: 32
> >         1       2       4       8       16      32      64      112     140
> > 0       1.03    0.97    0.75    0.65    0.58    0.58    0.56    0.70    0.70
> > 1       0.94    0.95    0.76    0.65    0.58    0.58    0.61    0.71    0.72
> > 2       0.97    0.96    0.77    0.66    0.58    0.59    0.62    0.74    0.74
> > 4       0.99    0.96    0.78    0.66    0.60    0.61    0.66    0.76    0.77
> > 8       0.99    0.99    0.84    0.70    0.64    0.66    0.71    0.80    0.80
> > 16      0.98    0.97    0.95    0.76    0.70    0.73    0.81    0.85    0.84
> > 32      1.04    1.12    1.04    0.89    0.82    0.86    0.93    0.91    0.91
> > 64      0.99    1.15    1.07    1.00    0.99    1.01    1.05    0.99    0.99
> > 128     1.00    1.21    1.20    1.22    1.25    1.31    1.12    1.10    0.99
> >
> > non-critical-length: 128
> >         1       2       4       8       16      32      64      112     140
> > 0       1.02    1.00    0.99    0.67    0.61    0.61    0.61    0.74    0.73
> > 1       0.95    0.99    1.00    0.68    0.61    0.60    0.60    0.74    0.74
> > 2       1.00    1.04    1.00    0.68    0.59    0.61    0.65    0.76    0.76
> > 4       1.00    0.96    0.98    0.70    0.63    0.63    0.67    0.78    0.77
> > 8       1.01    1.02    0.89    0.73    0.65    0.67    0.71    0.81    0.80
> > 16      0.99    0.96    0.96    0.79    0.71    0.73    0.80    0.84    0.84
> > 32      0.99    0.95    1.05    0.89    0.84    0.85    0.94    0.92    0.91
> > 64      1.00    0.99    1.16    1.04    1.00    1.02    1.06    0.99    0.99
> > 128     1.00    1.06    0.98    1.14    1.39    1.26    1.08    1.02    0.98
>
> For large critical section lengths, the new code can be up to 39% slower.
> Is it true?  Are small critical section lengths more common?

Only when a VERY high number of threads are competing for the lock which
is probably uncommon outside of benchmarks.

>
> > Signed-off-by: Wangyang Guo <wangyang.guo@intel.com>
> > ---
> >  nptl/pthread_mutex_lock.c                   | 16 +++++++--
> >  sysdeps/nptl/pthreadP.h                     |  1 +
> >  sysdeps/nptl/pthread_mutex_backoff.h        | 35 ++++++++++++++++++
> >  sysdeps/x86_64/nptl/pthread_mutex_backoff.h | 39 +++++++++++++++++++++
> >  4 files changed, 89 insertions(+), 2 deletions(-)
> >  create mode 100644 sysdeps/nptl/pthread_mutex_backoff.h
> >  create mode 100644 sysdeps/x86_64/nptl/pthread_mutex_backoff.h
> >
> > diff --git a/nptl/pthread_mutex_lock.c b/nptl/pthread_mutex_lock.c
> > index d2e652d151..6e767a8724 100644
> > --- a/nptl/pthread_mutex_lock.c
> > +++ b/nptl/pthread_mutex_lock.c
> > @@ -138,14 +138,26 @@ PTHREAD_MUTEX_LOCK (pthread_mutex_t *mutex)
> >           int cnt = 0;
> >           int max_cnt = MIN (max_adaptive_count (),
> >                              mutex->__data.__spins * 2 + 10);
> > +         int spin_count, exp_backoff = 1;
> > +         unsigned int jitter = get_jitter ();
> >           do
> >             {
> > -             if (cnt++ >= max_cnt)
> > +             /* In each loop, spin count is exponential backoff plus
> > +                random jitter, random range is [0, exp_backoff-1].  */
> > +             spin_count = exp_backoff + (jitter & (exp_backoff - 1));
> > +             cnt += spin_count;
> > +             if (cnt >= max_cnt)
> >                 {
> > +                 /* If cnt exceeds max spin count, just go to wait
> > +                    queue.  */
> >                   LLL_MUTEX_LOCK (mutex);
> >                   break;
> >                 }
> > -             atomic_spin_nop ();
> > +             do
> > +               atomic_spin_nop ();
> > +             while (--spin_count > 0);
> > +             /* Prepare for next loop.  */
> > +             exp_backoff = get_next_backoff (exp_backoff);
> >             }
> >           while (LLL_MUTEX_READ_LOCK (mutex) != 0
> >                  || LLL_MUTEX_TRYLOCK (mutex) != 0);
> > diff --git a/sysdeps/nptl/pthreadP.h b/sysdeps/nptl/pthreadP.h
> > index 601db4ff2b..39af275c25 100644
> > --- a/sysdeps/nptl/pthreadP.h
> > +++ b/sysdeps/nptl/pthreadP.h
> > @@ -33,6 +33,7 @@
> >  #include <kernel-features.h>
> >  #include <errno.h>
> >  #include <internal-signals.h>
> > +#include <pthread_mutex_backoff.h>
> >  #include "pthread_mutex_conf.h"
> >
> >
> > diff --git a/sysdeps/nptl/pthread_mutex_backoff.h b/sysdeps/nptl/pthread_mutex_backoff.h
> > new file mode 100644
> > index 0000000000..5b26c22ac7
> > --- /dev/null
> > +++ b/sysdeps/nptl/pthread_mutex_backoff.h
> > @@ -0,0 +1,35 @@
> > +/* Pthread mutex backoff configuration.
> > +   Copyright (C) 2022 Free Software Foundation, Inc.
> > +   This file is part of the GNU C Library.
> > +
> > +   The GNU C Library is free software; you can redistribute it and/or
> > +   modify it under the terms of the GNU Lesser General Public
> > +   License as published by the Free Software Foundation; either
> > +   version 2.1 of the License, or (at your option) any later version.
> > +
> > +   The GNU C Library is distributed in the hope that it will be useful,
> > +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> > +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> > +   Lesser General Public License for more details.
> > +
> > +   You should have received a copy of the GNU Lesser General Public
> > +   License along with the GNU C Library; if not, see
> > +   <https://www.gnu.org/licenses/>.  */
> > +#ifndef _PTHREAD_MUTEX_BACKOFF_H
> > +#define _PTHREAD_MUTEX_BACKOFF_H 1
> > +
> > +static inline unsigned int
> > +get_jitter (void)
> > +{
> > +  /* Arch dependent random jitter, return 0 disables random.  */
> > +  return 0;
> > +}
> > +
> > +static inline int
> > +get_next_backoff (int backoff)
> > +{
> > +  /* Next backoff, return 1 disables mutex backoff.  */
> > +  return 1;
> > +}
> > +
> > +#endif
> > diff --git a/sysdeps/x86_64/nptl/pthread_mutex_backoff.h b/sysdeps/x86_64/nptl/pthread_mutex_backoff.h
> > new file mode 100644
> > index 0000000000..ec74c3d9db
> > --- /dev/null
> > +++ b/sysdeps/x86_64/nptl/pthread_mutex_backoff.h
> > @@ -0,0 +1,39 @@
> > +/* Pthread mutex backoff configuration.
> > +   Copyright (C) 2022 Free Software Foundation, Inc.
> > +   This file is part of the GNU C Library.
> > +
> > +   The GNU C Library is free software; you can redistribute it and/or
> > +   modify it under the terms of the GNU Lesser General Public
> > +   License as published by the Free Software Foundation; either
> > +   version 2.1 of the License, or (at your option) any later version.
> > +
> > +   The GNU C Library is distributed in the hope that it will be useful,
> > +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> > +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> > +   Lesser General Public License for more details.
> > +
> > +   You should have received a copy of the GNU Lesser General Public
> > +   License along with the GNU C Library; if not, see
> > +   <https://www.gnu.org/licenses/>.  */
> > +#ifndef _PTHREAD_MUTEX_BACKOFF_H
> > +#define _PTHREAD_MUTEX_BACKOFF_H 1
> > +
> > +#include <fast-jitter.h>
> > +
> > +static inline unsigned int
> > +get_jitter (void)
> > +{
> > +  return get_fast_jitter ();
> > +}
> > +
> > +#define MAX_BACKOFF 16
> > +
> > +static inline int
> > +get_next_backoff (int backoff)
> > +{
> > +  /* Binary expontial backoff. Limiting max backoff
> > +     can reduce latency in large critical section.  */
> > +  return (backoff < MAX_BACKOFF) ? backoff << 1 : backoff;
> > +}
> > +
> > +#endif
> > --
> > 2.35.1
> >
>
>
> --
> H.J.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4] nptl: Add backoff mechanism to spinlock loop
  2022-05-05  1:56     ` H.J. Lu
  2022-05-05  2:52       ` Noah Goldstein
@ 2022-05-05  2:59       ` Guo, Wangyang
  2022-05-05 22:44         ` H.J. Lu
  1 sibling, 1 reply; 27+ messages in thread
From: Guo, Wangyang @ 2022-05-05  2:59 UTC (permalink / raw)
  To: H.J. Lu; +Cc: libc-alpha

On 5/5/2022 9:56 AM, H.J. Lu via Libc-alpha wrote:
> On Tue, May 3, 2022 at 8:17 PM Wangyang Guo <wangyang.guo@intel.com> wrote:
>>
>> When mutiple threads waiting for lock at the same time, once lock owner
>> releases the lock, waiters will see lock available and all try to lock,
>> which may cause an expensive CAS storm.
>>
>> Binary exponential backoff with random jitter is introduced. As try-lock
>> attempt increases, there is more likely that a larger number threads
>> compete for adaptive mutex lock, so increase wait time in exponential.
>> A random jitter is also added to avoid synchronous try-lock from other
>> threads.
>>
>> v2: Remove read-check before try-lock for performance.
>>
>> v3:
>> 1. Restore read-check since it works well in some platform.
>> 2. Make backoff arch dependent, and enable it for x86_64.
>> 3. Limit max backoff to reduce latency in large critical section.
>>
>> v4: Fix strict-prototypes error in sysdeps/nptl/pthread_mutex_backoff.h
>>
>> Result of pthread-mutex-locks bench
>>
>> Test Platform: Xeon 8280L (2 socket, 112 CPUs in total)
>> First Row: thread number
>> First Col: critical section length
>> Values: backoff vs upstream, time based, low is better
>>
>> non-critical-length: 1
>>          1       2       4       8       16      32      64      112     140
>> 0       0.99    0.58    0.52    0.49    0.43    0.44    0.46    0.52    0.54
>> 1       0.98    0.43    0.56    0.50    0.44    0.45    0.50    0.56    0.57
>> 2       0.99    0.41    0.57    0.51    0.45    0.47    0.48    0.60    0.61
>> 4       0.99    0.45    0.59    0.53    0.48    0.49    0.52    0.64    0.65
>> 8       1.00    0.66    0.71    0.63    0.56    0.59    0.66    0.72    0.71
>> 16      0.97    0.78    0.91    0.73    0.67    0.70    0.79    0.80    0.80
>> 32      0.95    1.17    0.98    0.87    0.82    0.86    0.89    0.90    0.90
>> 64      0.96    0.95    1.01    1.01    0.98    1.00    1.03    0.99    0.99
>> 128     0.99    1.01    1.01    1.17    1.08    1.12    1.02    0.97    1.02
>>
>> non-critical-length: 32
>>          1       2       4       8       16      32      64      112     140
>> 0       1.03    0.97    0.75    0.65    0.58    0.58    0.56    0.70    0.70
>> 1       0.94    0.95    0.76    0.65    0.58    0.58    0.61    0.71    0.72
>> 2       0.97    0.96    0.77    0.66    0.58    0.59    0.62    0.74    0.74
>> 4       0.99    0.96    0.78    0.66    0.60    0.61    0.66    0.76    0.77
>> 8       0.99    0.99    0.84    0.70    0.64    0.66    0.71    0.80    0.80
>> 16      0.98    0.97    0.95    0.76    0.70    0.73    0.81    0.85    0.84
>> 32      1.04    1.12    1.04    0.89    0.82    0.86    0.93    0.91    0.91
>> 64      0.99    1.15    1.07    1.00    0.99    1.01    1.05    0.99    0.99
>> 128     1.00    1.21    1.20    1.22    1.25    1.31    1.12    1.10    0.99
>>
>> non-critical-length: 128
>>          1       2       4       8       16      32      64      112     140
>> 0       1.02    1.00    0.99    0.67    0.61    0.61    0.61    0.74    0.73
>> 1       0.95    0.99    1.00    0.68    0.61    0.60    0.60    0.74    0.74
>> 2       1.00    1.04    1.00    0.68    0.59    0.61    0.65    0.76    0.76
>> 4       1.00    0.96    0.98    0.70    0.63    0.63    0.67    0.78    0.77
>> 8       1.01    1.02    0.89    0.73    0.65    0.67    0.71    0.81    0.80
>> 16      0.99    0.96    0.96    0.79    0.71    0.73    0.80    0.84    0.84
>> 32      0.99    0.95    1.05    0.89    0.84    0.85    0.94    0.92    0.91
>> 64      1.00    0.99    1.16    1.04    1.00    1.02    1.06    0.99    0.99
>> 128     1.00    1.06    0.98    1.14    1.39    1.26    1.08    1.02    0.98
> 
> For large critical section lengths, the new code can be up to 39% slower.
> Is it true?  Are small critical section lengths more common
The adaptive mutex is aimed for "quick" locks.
Small critical section is more common when user choose to use adaptive 
pthread_mutex.
> 
>> Signed-off-by: Wangyang Guo <wangyang.guo@intel.com>
>> ---
>>   nptl/pthread_mutex_lock.c                   | 16 +++++++--
>>   sysdeps/nptl/pthreadP.h                     |  1 +
>>   sysdeps/nptl/pthread_mutex_backoff.h        | 35 ++++++++++++++++++
>>   sysdeps/x86_64/nptl/pthread_mutex_backoff.h | 39 +++++++++++++++++++++
>>   4 files changed, 89 insertions(+), 2 deletions(-)
>>   create mode 100644 sysdeps/nptl/pthread_mutex_backoff.h
>>   create mode 100644 sysdeps/x86_64/nptl/pthread_mutex_backoff.h
>>
>> diff --git a/nptl/pthread_mutex_lock.c b/nptl/pthread_mutex_lock.c
>> index d2e652d151..6e767a8724 100644
>> --- a/nptl/pthread_mutex_lock.c
>> +++ b/nptl/pthread_mutex_lock.c
>> @@ -138,14 +138,26 @@ PTHREAD_MUTEX_LOCK (pthread_mutex_t *mutex)
>>            int cnt = 0;
>>            int max_cnt = MIN (max_adaptive_count (),
>>                               mutex->__data.__spins * 2 + 10);
>> +         int spin_count, exp_backoff = 1;
>> +         unsigned int jitter = get_jitter ();
>>            do
>>              {
>> -             if (cnt++ >= max_cnt)
>> +             /* In each loop, spin count is exponential backoff plus
>> +                random jitter, random range is [0, exp_backoff-1].  */
>> +             spin_count = exp_backoff + (jitter & (exp_backoff - 1));
>> +             cnt += spin_count;
>> +             if (cnt >= max_cnt)
>>                  {
>> +                 /* If cnt exceeds max spin count, just go to wait
>> +                    queue.  */
>>                    LLL_MUTEX_LOCK (mutex);
>>                    break;
>>                  }
>> -             atomic_spin_nop ();
>> +             do
>> +               atomic_spin_nop ();
>> +             while (--spin_count > 0);
>> +             /* Prepare for next loop.  */
>> +             exp_backoff = get_next_backoff (exp_backoff);
>>              }
>>            while (LLL_MUTEX_READ_LOCK (mutex) != 0
>>                   || LLL_MUTEX_TRYLOCK (mutex) != 0);
>> diff --git a/sysdeps/nptl/pthreadP.h b/sysdeps/nptl/pthreadP.h
>> index 601db4ff2b..39af275c25 100644
>> --- a/sysdeps/nptl/pthreadP.h
>> +++ b/sysdeps/nptl/pthreadP.h
>> @@ -33,6 +33,7 @@
>>   #include <kernel-features.h>
>>   #include <errno.h>
>>   #include <internal-signals.h>
>> +#include <pthread_mutex_backoff.h>
>>   #include "pthread_mutex_conf.h"
>>
>>
>> diff --git a/sysdeps/nptl/pthread_mutex_backoff.h b/sysdeps/nptl/pthread_mutex_backoff.h
>> new file mode 100644
>> index 0000000000..5b26c22ac7
>> --- /dev/null
>> +++ b/sysdeps/nptl/pthread_mutex_backoff.h
>> @@ -0,0 +1,35 @@
>> +/* Pthread mutex backoff configuration.
>> +   Copyright (C) 2022 Free Software Foundation, Inc.
>> +   This file is part of the GNU C Library.
>> +
>> +   The GNU C Library is free software; you can redistribute it and/or
>> +   modify it under the terms of the GNU Lesser General Public
>> +   License as published by the Free Software Foundation; either
>> +   version 2.1 of the License, or (at your option) any later version.
>> +
>> +   The GNU C Library is distributed in the hope that it will be useful,
>> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
>> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
>> +   Lesser General Public License for more details.
>> +
>> +   You should have received a copy of the GNU Lesser General Public
>> +   License along with the GNU C Library; if not, see
>> +   <https://www.gnu.org/licenses/>.  */
>> +#ifndef _PTHREAD_MUTEX_BACKOFF_H
>> +#define _PTHREAD_MUTEX_BACKOFF_H 1
>> +
>> +static inline unsigned int
>> +get_jitter (void)
>> +{
>> +  /* Arch dependent random jitter, return 0 disables random.  */
>> +  return 0;
>> +}
>> +
>> +static inline int
>> +get_next_backoff (int backoff)
>> +{
>> +  /* Next backoff, return 1 disables mutex backoff.  */
>> +  return 1;
>> +}
>> +
>> +#endif
>> diff --git a/sysdeps/x86_64/nptl/pthread_mutex_backoff.h b/sysdeps/x86_64/nptl/pthread_mutex_backoff.h
>> new file mode 100644
>> index 0000000000..ec74c3d9db
>> --- /dev/null
>> +++ b/sysdeps/x86_64/nptl/pthread_mutex_backoff.h
>> @@ -0,0 +1,39 @@
>> +/* Pthread mutex backoff configuration.
>> +   Copyright (C) 2022 Free Software Foundation, Inc.
>> +   This file is part of the GNU C Library.
>> +
>> +   The GNU C Library is free software; you can redistribute it and/or
>> +   modify it under the terms of the GNU Lesser General Public
>> +   License as published by the Free Software Foundation; either
>> +   version 2.1 of the License, or (at your option) any later version.
>> +
>> +   The GNU C Library is distributed in the hope that it will be useful,
>> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
>> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
>> +   Lesser General Public License for more details.
>> +
>> +   You should have received a copy of the GNU Lesser General Public
>> +   License along with the GNU C Library; if not, see
>> +   <https://www.gnu.org/licenses/>.  */
>> +#ifndef _PTHREAD_MUTEX_BACKOFF_H
>> +#define _PTHREAD_MUTEX_BACKOFF_H 1
>> +
>> +#include <fast-jitter.h>
>> +
>> +static inline unsigned int
>> +get_jitter (void)
>> +{
>> +  return get_fast_jitter ();
>> +}
>> +
>> +#define MAX_BACKOFF 16
>> +
>> +static inline int
>> +get_next_backoff (int backoff)
>> +{
>> +  /* Binary expontial backoff. Limiting max backoff
>> +     can reduce latency in large critical section.  */
>> +  return (backoff < MAX_BACKOFF) ? backoff << 1 : backoff;
>> +}
>> +
>> +#endif
>> --
>> 2.35.1
>>
> 
> 


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4] nptl: Add backoff mechanism to spinlock loop
  2022-05-05  2:59       ` Guo, Wangyang
@ 2022-05-05 22:44         ` H.J. Lu
  2022-05-06  1:52           ` Guo, Wangyang
  0 siblings, 1 reply; 27+ messages in thread
From: H.J. Lu @ 2022-05-05 22:44 UTC (permalink / raw)
  To: Guo, Wangyang; +Cc: libc-alpha

On Wed, May 4, 2022 at 7:59 PM Guo, Wangyang <wangyang.guo@intel.com> wrote:
>
> On 5/5/2022 9:56 AM, H.J. Lu via Libc-alpha wrote:
> > On Tue, May 3, 2022 at 8:17 PM Wangyang Guo <wangyang.guo@intel.com> wrote:
> >>
> >> When mutiple threads waiting for lock at the same time, once lock owner
> >> releases the lock, waiters will see lock available and all try to lock,
> >> which may cause an expensive CAS storm.
> >>
> >> Binary exponential backoff with random jitter is introduced. As try-lock
> >> attempt increases, there is more likely that a larger number threads
> >> compete for adaptive mutex lock, so increase wait time in exponential.
> >> A random jitter is also added to avoid synchronous try-lock from other
> >> threads.
> >>
> >> v2: Remove read-check before try-lock for performance.
> >>
> >> v3:
> >> 1. Restore read-check since it works well in some platform.
> >> 2. Make backoff arch dependent, and enable it for x86_64.
> >> 3. Limit max backoff to reduce latency in large critical section.
> >>
> >> v4: Fix strict-prototypes error in sysdeps/nptl/pthread_mutex_backoff.h
> >>
> >> Result of pthread-mutex-locks bench
> >>
> >> Test Platform: Xeon 8280L (2 socket, 112 CPUs in total)
> >> First Row: thread number
> >> First Col: critical section length
> >> Values: backoff vs upstream, time based, low is better
> >>
> >> non-critical-length: 1
> >>          1       2       4       8       16      32      64      112     140
> >> 0       0.99    0.58    0.52    0.49    0.43    0.44    0.46    0.52    0.54
> >> 1       0.98    0.43    0.56    0.50    0.44    0.45    0.50    0.56    0.57
> >> 2       0.99    0.41    0.57    0.51    0.45    0.47    0.48    0.60    0.61
> >> 4       0.99    0.45    0.59    0.53    0.48    0.49    0.52    0.64    0.65
> >> 8       1.00    0.66    0.71    0.63    0.56    0.59    0.66    0.72    0.71
> >> 16      0.97    0.78    0.91    0.73    0.67    0.70    0.79    0.80    0.80
> >> 32      0.95    1.17    0.98    0.87    0.82    0.86    0.89    0.90    0.90
> >> 64      0.96    0.95    1.01    1.01    0.98    1.00    1.03    0.99    0.99
> >> 128     0.99    1.01    1.01    1.17    1.08    1.12    1.02    0.97    1.02
> >>
> >> non-critical-length: 32
> >>          1       2       4       8       16      32      64      112     140
> >> 0       1.03    0.97    0.75    0.65    0.58    0.58    0.56    0.70    0.70
> >> 1       0.94    0.95    0.76    0.65    0.58    0.58    0.61    0.71    0.72
> >> 2       0.97    0.96    0.77    0.66    0.58    0.59    0.62    0.74    0.74
> >> 4       0.99    0.96    0.78    0.66    0.60    0.61    0.66    0.76    0.77
> >> 8       0.99    0.99    0.84    0.70    0.64    0.66    0.71    0.80    0.80
> >> 16      0.98    0.97    0.95    0.76    0.70    0.73    0.81    0.85    0.84
> >> 32      1.04    1.12    1.04    0.89    0.82    0.86    0.93    0.91    0.91
> >> 64      0.99    1.15    1.07    1.00    0.99    1.01    1.05    0.99    0.99
> >> 128     1.00    1.21    1.20    1.22    1.25    1.31    1.12    1.10    0.99
> >>
> >> non-critical-length: 128
> >>          1       2       4       8       16      32      64      112     140
> >> 0       1.02    1.00    0.99    0.67    0.61    0.61    0.61    0.74    0.73
> >> 1       0.95    0.99    1.00    0.68    0.61    0.60    0.60    0.74    0.74
> >> 2       1.00    1.04    1.00    0.68    0.59    0.61    0.65    0.76    0.76
> >> 4       1.00    0.96    0.98    0.70    0.63    0.63    0.67    0.78    0.77
> >> 8       1.01    1.02    0.89    0.73    0.65    0.67    0.71    0.81    0.80
> >> 16      0.99    0.96    0.96    0.79    0.71    0.73    0.80    0.84    0.84
> >> 32      0.99    0.95    1.05    0.89    0.84    0.85    0.94    0.92    0.91
> >> 64      1.00    0.99    1.16    1.04    1.00    1.02    1.06    0.99    0.99
> >> 128     1.00    1.06    0.98    1.14    1.39    1.26    1.08    1.02    0.98
> >
> > For large critical section lengths, the new code can be up to 39% slower.
> > Is it true?  Are small critical section lengths more common
> The adaptive mutex is aimed for "quick" locks.
> Small critical section is more common when user choose to use adaptive
> pthread_mutex.

Please update the commit log to document the rationale for this change.

Thanks.


> >
> >> Signed-off-by: Wangyang Guo <wangyang.guo@intel.com>
> >> ---
> >>   nptl/pthread_mutex_lock.c                   | 16 +++++++--
> >>   sysdeps/nptl/pthreadP.h                     |  1 +
> >>   sysdeps/nptl/pthread_mutex_backoff.h        | 35 ++++++++++++++++++
> >>   sysdeps/x86_64/nptl/pthread_mutex_backoff.h | 39 +++++++++++++++++++++
> >>   4 files changed, 89 insertions(+), 2 deletions(-)
> >>   create mode 100644 sysdeps/nptl/pthread_mutex_backoff.h
> >>   create mode 100644 sysdeps/x86_64/nptl/pthread_mutex_backoff.h
> >>
> >> diff --git a/nptl/pthread_mutex_lock.c b/nptl/pthread_mutex_lock.c
> >> index d2e652d151..6e767a8724 100644
> >> --- a/nptl/pthread_mutex_lock.c
> >> +++ b/nptl/pthread_mutex_lock.c
> >> @@ -138,14 +138,26 @@ PTHREAD_MUTEX_LOCK (pthread_mutex_t *mutex)
> >>            int cnt = 0;
> >>            int max_cnt = MIN (max_adaptive_count (),
> >>                               mutex->__data.__spins * 2 + 10);
> >> +         int spin_count, exp_backoff = 1;
> >> +         unsigned int jitter = get_jitter ();
> >>            do
> >>              {
> >> -             if (cnt++ >= max_cnt)
> >> +             /* In each loop, spin count is exponential backoff plus
> >> +                random jitter, random range is [0, exp_backoff-1].  */
> >> +             spin_count = exp_backoff + (jitter & (exp_backoff - 1));
> >> +             cnt += spin_count;
> >> +             if (cnt >= max_cnt)
> >>                  {
> >> +                 /* If cnt exceeds max spin count, just go to wait
> >> +                    queue.  */
> >>                    LLL_MUTEX_LOCK (mutex);
> >>                    break;
> >>                  }
> >> -             atomic_spin_nop ();
> >> +             do
> >> +               atomic_spin_nop ();
> >> +             while (--spin_count > 0);
> >> +             /* Prepare for next loop.  */
> >> +             exp_backoff = get_next_backoff (exp_backoff);
> >>              }
> >>            while (LLL_MUTEX_READ_LOCK (mutex) != 0
> >>                   || LLL_MUTEX_TRYLOCK (mutex) != 0);
> >> diff --git a/sysdeps/nptl/pthreadP.h b/sysdeps/nptl/pthreadP.h
> >> index 601db4ff2b..39af275c25 100644
> >> --- a/sysdeps/nptl/pthreadP.h
> >> +++ b/sysdeps/nptl/pthreadP.h
> >> @@ -33,6 +33,7 @@
> >>   #include <kernel-features.h>
> >>   #include <errno.h>
> >>   #include <internal-signals.h>
> >> +#include <pthread_mutex_backoff.h>
> >>   #include "pthread_mutex_conf.h"
> >>
> >>
> >> diff --git a/sysdeps/nptl/pthread_mutex_backoff.h b/sysdeps/nptl/pthread_mutex_backoff.h
> >> new file mode 100644
> >> index 0000000000..5b26c22ac7
> >> --- /dev/null
> >> +++ b/sysdeps/nptl/pthread_mutex_backoff.h
> >> @@ -0,0 +1,35 @@
> >> +/* Pthread mutex backoff configuration.
> >> +   Copyright (C) 2022 Free Software Foundation, Inc.
> >> +   This file is part of the GNU C Library.
> >> +
> >> +   The GNU C Library is free software; you can redistribute it and/or
> >> +   modify it under the terms of the GNU Lesser General Public
> >> +   License as published by the Free Software Foundation; either
> >> +   version 2.1 of the License, or (at your option) any later version.
> >> +
> >> +   The GNU C Library is distributed in the hope that it will be useful,
> >> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> >> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> >> +   Lesser General Public License for more details.
> >> +
> >> +   You should have received a copy of the GNU Lesser General Public
> >> +   License along with the GNU C Library; if not, see
> >> +   <https://www.gnu.org/licenses/>.  */
> >> +#ifndef _PTHREAD_MUTEX_BACKOFF_H
> >> +#define _PTHREAD_MUTEX_BACKOFF_H 1
> >> +
> >> +static inline unsigned int
> >> +get_jitter (void)
> >> +{
> >> +  /* Arch dependent random jitter, return 0 disables random.  */
> >> +  return 0;
> >> +}
> >> +
> >> +static inline int
> >> +get_next_backoff (int backoff)
> >> +{
> >> +  /* Next backoff, return 1 disables mutex backoff.  */
> >> +  return 1;
> >> +}
> >> +
> >> +#endif
> >> diff --git a/sysdeps/x86_64/nptl/pthread_mutex_backoff.h b/sysdeps/x86_64/nptl/pthread_mutex_backoff.h
> >> new file mode 100644
> >> index 0000000000..ec74c3d9db
> >> --- /dev/null
> >> +++ b/sysdeps/x86_64/nptl/pthread_mutex_backoff.h
> >> @@ -0,0 +1,39 @@
> >> +/* Pthread mutex backoff configuration.
> >> +   Copyright (C) 2022 Free Software Foundation, Inc.
> >> +   This file is part of the GNU C Library.
> >> +
> >> +   The GNU C Library is free software; you can redistribute it and/or
> >> +   modify it under the terms of the GNU Lesser General Public
> >> +   License as published by the Free Software Foundation; either
> >> +   version 2.1 of the License, or (at your option) any later version.
> >> +
> >> +   The GNU C Library is distributed in the hope that it will be useful,
> >> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> >> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> >> +   Lesser General Public License for more details.
> >> +
> >> +   You should have received a copy of the GNU Lesser General Public
> >> +   License along with the GNU C Library; if not, see
> >> +   <https://www.gnu.org/licenses/>.  */
> >> +#ifndef _PTHREAD_MUTEX_BACKOFF_H
> >> +#define _PTHREAD_MUTEX_BACKOFF_H 1
> >> +
> >> +#include <fast-jitter.h>
> >> +
> >> +static inline unsigned int
> >> +get_jitter (void)
> >> +{
> >> +  return get_fast_jitter ();
> >> +}
> >> +
> >> +#define MAX_BACKOFF 16
> >> +
> >> +static inline int
> >> +get_next_backoff (int backoff)
> >> +{
> >> +  /* Binary expontial backoff. Limiting max backoff
> >> +     can reduce latency in large critical section.  */
> >> +  return (backoff < MAX_BACKOFF) ? backoff << 1 : backoff;
> >> +}
> >> +
> >> +#endif
> >> --
> >> 2.35.1
> >>
> >
> >
>


--
H.J.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v5] nptl: Add backoff mechanism to spinlock loop
  2022-05-04  3:17   ` [PATCH v4] " Wangyang Guo
  2022-05-05  1:56     ` H.J. Lu
@ 2022-05-06  1:50     ` Wangyang Guo
  2022-05-06  3:06       ` H.J. Lu
  1 sibling, 1 reply; 27+ messages in thread
From: Wangyang Guo @ 2022-05-06  1:50 UTC (permalink / raw)
  To: libc-alpha; +Cc: hjl.tools, goldstein.w.n, Wangyang Guo

When mutiple threads waiting for lock at the same time, once lock owner
releases the lock, waiters will see lock available and all try to lock,
which may cause an expensive CAS storm.

Binary exponential backoff with random jitter is introduced. As try-lock
attempt increases, there is more likely that a larger number threads
compete for adaptive mutex lock, so increase wait time in exponential.
A random jitter is also added to avoid synchronous try-lock from other
threads.

v2: Remove read-check before try-lock for performance.

v3:
1. Restore read-check since it works well in some platform.
2. Make backoff arch dependent, and enable it for x86_64.
3. Limit max backoff to reduce latency in large critical section.

v4: Fix strict-prototypes error in sysdeps/nptl/pthread_mutex_backoff.h

v5: Commit log updated for regression in large critical section.

Result of pthread-mutex-locks bench

Test Platform: Xeon 8280L (2 socket, 112 CPUs in total)
First Row: thread number
First Col: critical section length
Values: backoff vs upstream, time based, low is better

non-critical-length: 1
	1	2	4	8	16	32	64	112	140
0	0.99	0.58	0.52	0.49	0.43	0.44	0.46	0.52	0.54
1	0.98	0.43	0.56	0.50	0.44	0.45	0.50	0.56	0.57
2	0.99	0.41	0.57	0.51	0.45	0.47	0.48	0.60	0.61
4	0.99	0.45	0.59	0.53	0.48	0.49	0.52	0.64	0.65
8	1.00	0.66	0.71	0.63	0.56	0.59	0.66	0.72	0.71
16	0.97	0.78	0.91	0.73	0.67	0.70	0.79	0.80	0.80
32	0.95	1.17	0.98	0.87	0.82	0.86	0.89	0.90	0.90
64	0.96	0.95	1.01	1.01	0.98	1.00	1.03	0.99	0.99
128	0.99	1.01	1.01	1.17	1.08	1.12	1.02	0.97	1.02

non-critical-length: 32
	1	2	4	8	16	32	64	112	140
0	1.03	0.97	0.75	0.65	0.58	0.58	0.56	0.70	0.70
1	0.94	0.95	0.76	0.65	0.58	0.58	0.61	0.71	0.72
2	0.97	0.96	0.77	0.66	0.58	0.59	0.62	0.74	0.74
4	0.99	0.96	0.78	0.66	0.60	0.61	0.66	0.76	0.77
8	0.99	0.99	0.84	0.70	0.64	0.66	0.71	0.80	0.80
16	0.98	0.97	0.95	0.76	0.70	0.73	0.81	0.85	0.84
32	1.04	1.12	1.04	0.89	0.82	0.86	0.93	0.91	0.91
64	0.99	1.15	1.07	1.00	0.99	1.01	1.05	0.99	0.99
128	1.00	1.21	1.20	1.22	1.25	1.31	1.12	1.10	0.99

non-critical-length: 128
	1	2	4	8	16	32	64	112	140
0	1.02	1.00	0.99	0.67	0.61	0.61	0.61	0.74	0.73
1	0.95	0.99	1.00	0.68	0.61	0.60	0.60	0.74	0.74
2	1.00	1.04	1.00	0.68	0.59	0.61	0.65	0.76	0.76
4	1.00	0.96	0.98	0.70	0.63	0.63	0.67	0.78	0.77
8	1.01	1.02	0.89	0.73	0.65	0.67	0.71	0.81	0.80
16	0.99	0.96	0.96	0.79	0.71	0.73	0.80	0.84	0.84
32	0.99	0.95	1.05	0.89	0.84	0.85	0.94	0.92	0.91
64	1.00	0.99	1.16	1.04	1.00	1.02	1.06	0.99	0.99
128	1.00	1.06	0.98	1.14	1.39	1.26	1.08	1.02	0.98

There is regression in large critical section. But adaptive mutex is
aimed for "quick" locks. Small critical section is more common when
users choose to use adaptive pthread_mutex.

Signed-off-by: Wangyang Guo <wangyang.guo@intel.com>
---
 nptl/pthread_mutex_lock.c                   | 16 +++++++--
 sysdeps/nptl/pthreadP.h                     |  1 +
 sysdeps/nptl/pthread_mutex_backoff.h        | 35 ++++++++++++++++++
 sysdeps/x86_64/nptl/pthread_mutex_backoff.h | 39 +++++++++++++++++++++
 4 files changed, 89 insertions(+), 2 deletions(-)
 create mode 100644 sysdeps/nptl/pthread_mutex_backoff.h
 create mode 100644 sysdeps/x86_64/nptl/pthread_mutex_backoff.h

diff --git a/nptl/pthread_mutex_lock.c b/nptl/pthread_mutex_lock.c
index d2e652d151..6e767a8724 100644
--- a/nptl/pthread_mutex_lock.c
+++ b/nptl/pthread_mutex_lock.c
@@ -138,14 +138,26 @@ PTHREAD_MUTEX_LOCK (pthread_mutex_t *mutex)
 	  int cnt = 0;
 	  int max_cnt = MIN (max_adaptive_count (),
 			     mutex->__data.__spins * 2 + 10);
+	  int spin_count, exp_backoff = 1;
+	  unsigned int jitter = get_jitter ();
 	  do
 	    {
-	      if (cnt++ >= max_cnt)
+	      /* In each loop, spin count is exponential backoff plus
+		 random jitter, random range is [0, exp_backoff-1].  */
+	      spin_count = exp_backoff + (jitter & (exp_backoff - 1));
+	      cnt += spin_count;
+	      if (cnt >= max_cnt)
 		{
+		  /* If cnt exceeds max spin count, just go to wait
+		     queue.  */
 		  LLL_MUTEX_LOCK (mutex);
 		  break;
 		}
-	      atomic_spin_nop ();
+	      do
+		atomic_spin_nop ();
+	      while (--spin_count > 0);
+	      /* Prepare for next loop.  */
+	      exp_backoff = get_next_backoff (exp_backoff);
 	    }
 	  while (LLL_MUTEX_READ_LOCK (mutex) != 0
 		 || LLL_MUTEX_TRYLOCK (mutex) != 0);
diff --git a/sysdeps/nptl/pthreadP.h b/sysdeps/nptl/pthreadP.h
index 601db4ff2b..39af275c25 100644
--- a/sysdeps/nptl/pthreadP.h
+++ b/sysdeps/nptl/pthreadP.h
@@ -33,6 +33,7 @@
 #include <kernel-features.h>
 #include <errno.h>
 #include <internal-signals.h>
+#include <pthread_mutex_backoff.h>
 #include "pthread_mutex_conf.h"
 
 
diff --git a/sysdeps/nptl/pthread_mutex_backoff.h b/sysdeps/nptl/pthread_mutex_backoff.h
new file mode 100644
index 0000000000..5b26c22ac7
--- /dev/null
+++ b/sysdeps/nptl/pthread_mutex_backoff.h
@@ -0,0 +1,35 @@
+/* Pthread mutex backoff configuration.
+   Copyright (C) 2022 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <https://www.gnu.org/licenses/>.  */
+#ifndef _PTHREAD_MUTEX_BACKOFF_H
+#define _PTHREAD_MUTEX_BACKOFF_H 1
+
+static inline unsigned int
+get_jitter (void)
+{
+  /* Arch dependent random jitter, return 0 disables random.  */
+  return 0;
+}
+
+static inline int
+get_next_backoff (int backoff)
+{
+  /* Next backoff, return 1 disables mutex backoff.  */
+  return 1;
+}
+
+#endif
diff --git a/sysdeps/x86_64/nptl/pthread_mutex_backoff.h b/sysdeps/x86_64/nptl/pthread_mutex_backoff.h
new file mode 100644
index 0000000000..ec74c3d9db
--- /dev/null
+++ b/sysdeps/x86_64/nptl/pthread_mutex_backoff.h
@@ -0,0 +1,39 @@
+/* Pthread mutex backoff configuration.
+   Copyright (C) 2022 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <https://www.gnu.org/licenses/>.  */
+#ifndef _PTHREAD_MUTEX_BACKOFF_H
+#define _PTHREAD_MUTEX_BACKOFF_H 1
+
+#include <fast-jitter.h>
+
+static inline unsigned int
+get_jitter (void)
+{
+  return get_fast_jitter ();
+}
+
+#define MAX_BACKOFF 16
+
+static inline int
+get_next_backoff (int backoff)
+{
+  /* Binary expontial backoff. Limiting max backoff
+     can reduce latency in large critical section.  */
+  return (backoff < MAX_BACKOFF) ? backoff << 1 : backoff;
+}
+
+#endif
-- 
2.35.1


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4] nptl: Add backoff mechanism to spinlock loop
  2022-05-05 22:44         ` H.J. Lu
@ 2022-05-06  1:52           ` Guo, Wangyang
  0 siblings, 0 replies; 27+ messages in thread
From: Guo, Wangyang @ 2022-05-06  1:52 UTC (permalink / raw)
  To: H.J. Lu; +Cc: libc-alpha

On 5/6/2022 6:44 AM, H.J. Lu via Libc-alpha wrote:
> On Wed, May 4, 2022 at 7:59 PM Guo, Wangyang <wangyang.guo@intel.com> wrote:
>>
>> On 5/5/2022 9:56 AM, H.J. Lu via Libc-alpha wrote:
>>> On Tue, May 3, 2022 at 8:17 PM Wangyang Guo <wangyang.guo@intel.com> wrote:
>>>>
>>>> When mutiple threads waiting for lock at the same time, once lock owner
>>>> releases the lock, waiters will see lock available and all try to lock,
>>>> which may cause an expensive CAS storm.
>>>>
>>>> Binary exponential backoff with random jitter is introduced. As try-lock
>>>> attempt increases, there is more likely that a larger number threads
>>>> compete for adaptive mutex lock, so increase wait time in exponential.
>>>> A random jitter is also added to avoid synchronous try-lock from other
>>>> threads.
>>>>
>>>> v2: Remove read-check before try-lock for performance.
>>>>
>>>> v3:
>>>> 1. Restore read-check since it works well in some platform.
>>>> 2. Make backoff arch dependent, and enable it for x86_64.
>>>> 3. Limit max backoff to reduce latency in large critical section.
>>>>
>>>> v4: Fix strict-prototypes error in sysdeps/nptl/pthread_mutex_backoff.h
>>>>
>>>> Result of pthread-mutex-locks bench
>>>>
>>>> Test Platform: Xeon 8280L (2 socket, 112 CPUs in total)
>>>> First Row: thread number
>>>> First Col: critical section length
>>>> Values: backoff vs upstream, time based, low is better
>>>>
>>>> non-critical-length: 1
>>>>           1       2       4       8       16      32      64      112     140
>>>> 0       0.99    0.58    0.52    0.49    0.43    0.44    0.46    0.52    0.54
>>>> 1       0.98    0.43    0.56    0.50    0.44    0.45    0.50    0.56    0.57
>>>> 2       0.99    0.41    0.57    0.51    0.45    0.47    0.48    0.60    0.61
>>>> 4       0.99    0.45    0.59    0.53    0.48    0.49    0.52    0.64    0.65
>>>> 8       1.00    0.66    0.71    0.63    0.56    0.59    0.66    0.72    0.71
>>>> 16      0.97    0.78    0.91    0.73    0.67    0.70    0.79    0.80    0.80
>>>> 32      0.95    1.17    0.98    0.87    0.82    0.86    0.89    0.90    0.90
>>>> 64      0.96    0.95    1.01    1.01    0.98    1.00    1.03    0.99    0.99
>>>> 128     0.99    1.01    1.01    1.17    1.08    1.12    1.02    0.97    1.02
>>>>
>>>> non-critical-length: 32
>>>>           1       2       4       8       16      32      64      112     140
>>>> 0       1.03    0.97    0.75    0.65    0.58    0.58    0.56    0.70    0.70
>>>> 1       0.94    0.95    0.76    0.65    0.58    0.58    0.61    0.71    0.72
>>>> 2       0.97    0.96    0.77    0.66    0.58    0.59    0.62    0.74    0.74
>>>> 4       0.99    0.96    0.78    0.66    0.60    0.61    0.66    0.76    0.77
>>>> 8       0.99    0.99    0.84    0.70    0.64    0.66    0.71    0.80    0.80
>>>> 16      0.98    0.97    0.95    0.76    0.70    0.73    0.81    0.85    0.84
>>>> 32      1.04    1.12    1.04    0.89    0.82    0.86    0.93    0.91    0.91
>>>> 64      0.99    1.15    1.07    1.00    0.99    1.01    1.05    0.99    0.99
>>>> 128     1.00    1.21    1.20    1.22    1.25    1.31    1.12    1.10    0.99
>>>>
>>>> non-critical-length: 128
>>>>           1       2       4       8       16      32      64      112     140
>>>> 0       1.02    1.00    0.99    0.67    0.61    0.61    0.61    0.74    0.73
>>>> 1       0.95    0.99    1.00    0.68    0.61    0.60    0.60    0.74    0.74
>>>> 2       1.00    1.04    1.00    0.68    0.59    0.61    0.65    0.76    0.76
>>>> 4       1.00    0.96    0.98    0.70    0.63    0.63    0.67    0.78    0.77
>>>> 8       1.01    1.02    0.89    0.73    0.65    0.67    0.71    0.81    0.80
>>>> 16      0.99    0.96    0.96    0.79    0.71    0.73    0.80    0.84    0.84
>>>> 32      0.99    0.95    1.05    0.89    0.84    0.85    0.94    0.92    0.91
>>>> 64      1.00    0.99    1.16    1.04    1.00    1.02    1.06    0.99    0.99
>>>> 128     1.00    1.06    0.98    1.14    1.39    1.26    1.08    1.02    0.98
>>>
>>> For large critical section lengths, the new code can be up to 39% slower.
>>> Is it true?  Are small critical section lengths more common
>> The adaptive mutex is aimed for "quick" locks.
>> Small critical section is more common when user choose to use adaptive
>> pthread_mutex.
> 
> Please update the commit log to document the rationale for this change.
> 
> Thanks.

Done, updated in v5 patch.

> 
> 
>>>
>>>> Signed-off-by: Wangyang Guo <wangyang.guo@intel.com>
>>>> ---
>>>>    nptl/pthread_mutex_lock.c                   | 16 +++++++--
>>>>    sysdeps/nptl/pthreadP.h                     |  1 +
>>>>    sysdeps/nptl/pthread_mutex_backoff.h        | 35 ++++++++++++++++++
>>>>    sysdeps/x86_64/nptl/pthread_mutex_backoff.h | 39 +++++++++++++++++++++
>>>>    4 files changed, 89 insertions(+), 2 deletions(-)
>>>>    create mode 100644 sysdeps/nptl/pthread_mutex_backoff.h
>>>>    create mode 100644 sysdeps/x86_64/nptl/pthread_mutex_backoff.h
>>>>
>>>> diff --git a/nptl/pthread_mutex_lock.c b/nptl/pthread_mutex_lock.c
>>>> index d2e652d151..6e767a8724 100644
>>>> --- a/nptl/pthread_mutex_lock.c
>>>> +++ b/nptl/pthread_mutex_lock.c
>>>> @@ -138,14 +138,26 @@ PTHREAD_MUTEX_LOCK (pthread_mutex_t *mutex)
>>>>             int cnt = 0;
>>>>             int max_cnt = MIN (max_adaptive_count (),
>>>>                                mutex->__data.__spins * 2 + 10);
>>>> +         int spin_count, exp_backoff = 1;
>>>> +         unsigned int jitter = get_jitter ();
>>>>             do
>>>>               {
>>>> -             if (cnt++ >= max_cnt)
>>>> +             /* In each loop, spin count is exponential backoff plus
>>>> +                random jitter, random range is [0, exp_backoff-1].  */
>>>> +             spin_count = exp_backoff + (jitter & (exp_backoff - 1));
>>>> +             cnt += spin_count;
>>>> +             if (cnt >= max_cnt)
>>>>                   {
>>>> +                 /* If cnt exceeds max spin count, just go to wait
>>>> +                    queue.  */
>>>>                     LLL_MUTEX_LOCK (mutex);
>>>>                     break;
>>>>                   }
>>>> -             atomic_spin_nop ();
>>>> +             do
>>>> +               atomic_spin_nop ();
>>>> +             while (--spin_count > 0);
>>>> +             /* Prepare for next loop.  */
>>>> +             exp_backoff = get_next_backoff (exp_backoff);
>>>>               }
>>>>             while (LLL_MUTEX_READ_LOCK (mutex) != 0
>>>>                    || LLL_MUTEX_TRYLOCK (mutex) != 0);
>>>> diff --git a/sysdeps/nptl/pthreadP.h b/sysdeps/nptl/pthreadP.h
>>>> index 601db4ff2b..39af275c25 100644
>>>> --- a/sysdeps/nptl/pthreadP.h
>>>> +++ b/sysdeps/nptl/pthreadP.h
>>>> @@ -33,6 +33,7 @@
>>>>    #include <kernel-features.h>
>>>>    #include <errno.h>
>>>>    #include <internal-signals.h>
>>>> +#include <pthread_mutex_backoff.h>
>>>>    #include "pthread_mutex_conf.h"
>>>>
>>>>
>>>> diff --git a/sysdeps/nptl/pthread_mutex_backoff.h b/sysdeps/nptl/pthread_mutex_backoff.h
>>>> new file mode 100644
>>>> index 0000000000..5b26c22ac7
>>>> --- /dev/null
>>>> +++ b/sysdeps/nptl/pthread_mutex_backoff.h
>>>> @@ -0,0 +1,35 @@
>>>> +/* Pthread mutex backoff configuration.
>>>> +   Copyright (C) 2022 Free Software Foundation, Inc.
>>>> +   This file is part of the GNU C Library.
>>>> +
>>>> +   The GNU C Library is free software; you can redistribute it and/or
>>>> +   modify it under the terms of the GNU Lesser General Public
>>>> +   License as published by the Free Software Foundation; either
>>>> +   version 2.1 of the License, or (at your option) any later version.
>>>> +
>>>> +   The GNU C Library is distributed in the hope that it will be useful,
>>>> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
>>>> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
>>>> +   Lesser General Public License for more details.
>>>> +
>>>> +   You should have received a copy of the GNU Lesser General Public
>>>> +   License along with the GNU C Library; if not, see
>>>> +   <https://www.gnu.org/licenses/>.  */
>>>> +#ifndef _PTHREAD_MUTEX_BACKOFF_H
>>>> +#define _PTHREAD_MUTEX_BACKOFF_H 1
>>>> +
>>>> +static inline unsigned int
>>>> +get_jitter (void)
>>>> +{
>>>> +  /* Arch dependent random jitter, return 0 disables random.  */
>>>> +  return 0;
>>>> +}
>>>> +
>>>> +static inline int
>>>> +get_next_backoff (int backoff)
>>>> +{
>>>> +  /* Next backoff, return 1 disables mutex backoff.  */
>>>> +  return 1;
>>>> +}
>>>> +
>>>> +#endif
>>>> diff --git a/sysdeps/x86_64/nptl/pthread_mutex_backoff.h b/sysdeps/x86_64/nptl/pthread_mutex_backoff.h
>>>> new file mode 100644
>>>> index 0000000000..ec74c3d9db
>>>> --- /dev/null
>>>> +++ b/sysdeps/x86_64/nptl/pthread_mutex_backoff.h
>>>> @@ -0,0 +1,39 @@
>>>> +/* Pthread mutex backoff configuration.
>>>> +   Copyright (C) 2022 Free Software Foundation, Inc.
>>>> +   This file is part of the GNU C Library.
>>>> +
>>>> +   The GNU C Library is free software; you can redistribute it and/or
>>>> +   modify it under the terms of the GNU Lesser General Public
>>>> +   License as published by the Free Software Foundation; either
>>>> +   version 2.1 of the License, or (at your option) any later version.
>>>> +
>>>> +   The GNU C Library is distributed in the hope that it will be useful,
>>>> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
>>>> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
>>>> +   Lesser General Public License for more details.
>>>> +
>>>> +   You should have received a copy of the GNU Lesser General Public
>>>> +   License along with the GNU C Library; if not, see
>>>> +   <https://www.gnu.org/licenses/>.  */
>>>> +#ifndef _PTHREAD_MUTEX_BACKOFF_H
>>>> +#define _PTHREAD_MUTEX_BACKOFF_H 1
>>>> +
>>>> +#include <fast-jitter.h>
>>>> +
>>>> +static inline unsigned int
>>>> +get_jitter (void)
>>>> +{
>>>> +  return get_fast_jitter ();
>>>> +}
>>>> +
>>>> +#define MAX_BACKOFF 16
>>>> +
>>>> +static inline int
>>>> +get_next_backoff (int backoff)
>>>> +{
>>>> +  /* Binary expontial backoff. Limiting max backoff
>>>> +     can reduce latency in large critical section.  */
>>>> +  return (backoff < MAX_BACKOFF) ? backoff << 1 : backoff;
>>>> +}
>>>> +
>>>> +#endif
>>>> --
>>>> 2.35.1
>>>>
>>>
>>>
>>
> 
> 
> --
> H.J.
> 


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v5] nptl: Add backoff mechanism to spinlock loop
  2022-05-06  1:50     ` [PATCH v5] " Wangyang Guo
@ 2022-05-06  3:06       ` H.J. Lu
  2022-09-11 20:29         ` Sunil Pandey
  0 siblings, 1 reply; 27+ messages in thread
From: H.J. Lu @ 2022-05-06  3:06 UTC (permalink / raw)
  To: Wangyang Guo; +Cc: GNU C Library, Noah Goldstein

On Thu, May 5, 2022 at 6:50 PM Wangyang Guo <wangyang.guo@intel.com> wrote:
>
> When mutiple threads waiting for lock at the same time, once lock owner
> releases the lock, waiters will see lock available and all try to lock,
> which may cause an expensive CAS storm.
>
> Binary exponential backoff with random jitter is introduced. As try-lock
> attempt increases, there is more likely that a larger number threads
> compete for adaptive mutex lock, so increase wait time in exponential.
> A random jitter is also added to avoid synchronous try-lock from other
> threads.
>
> v2: Remove read-check before try-lock for performance.
>
> v3:
> 1. Restore read-check since it works well in some platform.
> 2. Make backoff arch dependent, and enable it for x86_64.
> 3. Limit max backoff to reduce latency in large critical section.
>
> v4: Fix strict-prototypes error in sysdeps/nptl/pthread_mutex_backoff.h
>
> v5: Commit log updated for regression in large critical section.
>
> Result of pthread-mutex-locks bench
>
> Test Platform: Xeon 8280L (2 socket, 112 CPUs in total)
> First Row: thread number
> First Col: critical section length
> Values: backoff vs upstream, time based, low is better
>
> non-critical-length: 1
>         1       2       4       8       16      32      64      112     140
> 0       0.99    0.58    0.52    0.49    0.43    0.44    0.46    0.52    0.54
> 1       0.98    0.43    0.56    0.50    0.44    0.45    0.50    0.56    0.57
> 2       0.99    0.41    0.57    0.51    0.45    0.47    0.48    0.60    0.61
> 4       0.99    0.45    0.59    0.53    0.48    0.49    0.52    0.64    0.65
> 8       1.00    0.66    0.71    0.63    0.56    0.59    0.66    0.72    0.71
> 16      0.97    0.78    0.91    0.73    0.67    0.70    0.79    0.80    0.80
> 32      0.95    1.17    0.98    0.87    0.82    0.86    0.89    0.90    0.90
> 64      0.96    0.95    1.01    1.01    0.98    1.00    1.03    0.99    0.99
> 128     0.99    1.01    1.01    1.17    1.08    1.12    1.02    0.97    1.02
>
> non-critical-length: 32
>         1       2       4       8       16      32      64      112     140
> 0       1.03    0.97    0.75    0.65    0.58    0.58    0.56    0.70    0.70
> 1       0.94    0.95    0.76    0.65    0.58    0.58    0.61    0.71    0.72
> 2       0.97    0.96    0.77    0.66    0.58    0.59    0.62    0.74    0.74
> 4       0.99    0.96    0.78    0.66    0.60    0.61    0.66    0.76    0.77
> 8       0.99    0.99    0.84    0.70    0.64    0.66    0.71    0.80    0.80
> 16      0.98    0.97    0.95    0.76    0.70    0.73    0.81    0.85    0.84
> 32      1.04    1.12    1.04    0.89    0.82    0.86    0.93    0.91    0.91
> 64      0.99    1.15    1.07    1.00    0.99    1.01    1.05    0.99    0.99
> 128     1.00    1.21    1.20    1.22    1.25    1.31    1.12    1.10    0.99
>
> non-critical-length: 128
>         1       2       4       8       16      32      64      112     140
> 0       1.02    1.00    0.99    0.67    0.61    0.61    0.61    0.74    0.73
> 1       0.95    0.99    1.00    0.68    0.61    0.60    0.60    0.74    0.74
> 2       1.00    1.04    1.00    0.68    0.59    0.61    0.65    0.76    0.76
> 4       1.00    0.96    0.98    0.70    0.63    0.63    0.67    0.78    0.77
> 8       1.01    1.02    0.89    0.73    0.65    0.67    0.71    0.81    0.80
> 16      0.99    0.96    0.96    0.79    0.71    0.73    0.80    0.84    0.84
> 32      0.99    0.95    1.05    0.89    0.84    0.85    0.94    0.92    0.91
> 64      1.00    0.99    1.16    1.04    1.00    1.02    1.06    0.99    0.99
> 128     1.00    1.06    0.98    1.14    1.39    1.26    1.08    1.02    0.98
>
> There is regression in large critical section. But adaptive mutex is
> aimed for "quick" locks. Small critical section is more common when
> users choose to use adaptive pthread_mutex.
>
> Signed-off-by: Wangyang Guo <wangyang.guo@intel.com>
> ---
>  nptl/pthread_mutex_lock.c                   | 16 +++++++--
>  sysdeps/nptl/pthreadP.h                     |  1 +
>  sysdeps/nptl/pthread_mutex_backoff.h        | 35 ++++++++++++++++++
>  sysdeps/x86_64/nptl/pthread_mutex_backoff.h | 39 +++++++++++++++++++++
>  4 files changed, 89 insertions(+), 2 deletions(-)
>  create mode 100644 sysdeps/nptl/pthread_mutex_backoff.h
>  create mode 100644 sysdeps/x86_64/nptl/pthread_mutex_backoff.h
>
> diff --git a/nptl/pthread_mutex_lock.c b/nptl/pthread_mutex_lock.c
> index d2e652d151..6e767a8724 100644
> --- a/nptl/pthread_mutex_lock.c
> +++ b/nptl/pthread_mutex_lock.c
> @@ -138,14 +138,26 @@ PTHREAD_MUTEX_LOCK (pthread_mutex_t *mutex)
>           int cnt = 0;
>           int max_cnt = MIN (max_adaptive_count (),
>                              mutex->__data.__spins * 2 + 10);
> +         int spin_count, exp_backoff = 1;
> +         unsigned int jitter = get_jitter ();
>           do
>             {
> -             if (cnt++ >= max_cnt)
> +             /* In each loop, spin count is exponential backoff plus
> +                random jitter, random range is [0, exp_backoff-1].  */
> +             spin_count = exp_backoff + (jitter & (exp_backoff - 1));
> +             cnt += spin_count;
> +             if (cnt >= max_cnt)
>                 {
> +                 /* If cnt exceeds max spin count, just go to wait
> +                    queue.  */
>                   LLL_MUTEX_LOCK (mutex);
>                   break;
>                 }
> -             atomic_spin_nop ();
> +             do
> +               atomic_spin_nop ();
> +             while (--spin_count > 0);
> +             /* Prepare for next loop.  */
> +             exp_backoff = get_next_backoff (exp_backoff);
>             }
>           while (LLL_MUTEX_READ_LOCK (mutex) != 0
>                  || LLL_MUTEX_TRYLOCK (mutex) != 0);
> diff --git a/sysdeps/nptl/pthreadP.h b/sysdeps/nptl/pthreadP.h
> index 601db4ff2b..39af275c25 100644
> --- a/sysdeps/nptl/pthreadP.h
> +++ b/sysdeps/nptl/pthreadP.h
> @@ -33,6 +33,7 @@
>  #include <kernel-features.h>
>  #include <errno.h>
>  #include <internal-signals.h>
> +#include <pthread_mutex_backoff.h>
>  #include "pthread_mutex_conf.h"
>
>
> diff --git a/sysdeps/nptl/pthread_mutex_backoff.h b/sysdeps/nptl/pthread_mutex_backoff.h
> new file mode 100644
> index 0000000000..5b26c22ac7
> --- /dev/null
> +++ b/sysdeps/nptl/pthread_mutex_backoff.h
> @@ -0,0 +1,35 @@
> +/* Pthread mutex backoff configuration.
> +   Copyright (C) 2022 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <https://www.gnu.org/licenses/>.  */
> +#ifndef _PTHREAD_MUTEX_BACKOFF_H
> +#define _PTHREAD_MUTEX_BACKOFF_H 1
> +
> +static inline unsigned int
> +get_jitter (void)
> +{
> +  /* Arch dependent random jitter, return 0 disables random.  */
> +  return 0;
> +}
> +
> +static inline int
> +get_next_backoff (int backoff)
> +{
> +  /* Next backoff, return 1 disables mutex backoff.  */
> +  return 1;
> +}
> +
> +#endif
> diff --git a/sysdeps/x86_64/nptl/pthread_mutex_backoff.h b/sysdeps/x86_64/nptl/pthread_mutex_backoff.h
> new file mode 100644
> index 0000000000..ec74c3d9db
> --- /dev/null
> +++ b/sysdeps/x86_64/nptl/pthread_mutex_backoff.h
> @@ -0,0 +1,39 @@
> +/* Pthread mutex backoff configuration.
> +   Copyright (C) 2022 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <https://www.gnu.org/licenses/>.  */
> +#ifndef _PTHREAD_MUTEX_BACKOFF_H
> +#define _PTHREAD_MUTEX_BACKOFF_H 1
> +
> +#include <fast-jitter.h>
> +
> +static inline unsigned int
> +get_jitter (void)
> +{
> +  return get_fast_jitter ();
> +}
> +
> +#define MAX_BACKOFF 16
> +
> +static inline int
> +get_next_backoff (int backoff)
> +{
> +  /* Binary expontial backoff. Limiting max backoff
> +     can reduce latency in large critical section.  */
> +  return (backoff < MAX_BACKOFF) ? backoff << 1 : backoff;
> +}
> +
> +#endif
> --
> 2.35.1
>

LGTM.

Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

Please wait until next Monday for more comments.

Thanks.

-- 
H.J.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v5] nptl: Add backoff mechanism to spinlock loop
  2022-05-06  3:06       ` H.J. Lu
@ 2022-09-11 20:29         ` Sunil Pandey
  2022-09-14  1:26           ` Noah Goldstein
  2022-09-29  0:12           ` Noah Goldstein
  0 siblings, 2 replies; 27+ messages in thread
From: Sunil Pandey @ 2022-09-11 20:29 UTC (permalink / raw)
  To: H.J. Lu, Libc-stable Mailing List, Florian Weimer
  Cc: Wangyang Guo, GNU C Library

On Thu, May 5, 2022 at 8:07 PM H.J. Lu via Libc-alpha
<libc-alpha@sourceware.org> wrote:
>
> On Thu, May 5, 2022 at 6:50 PM Wangyang Guo <wangyang.guo@intel.com> wrote:
> >
> > When mutiple threads waiting for lock at the same time, once lock owner
> > releases the lock, waiters will see lock available and all try to lock,
> > which may cause an expensive CAS storm.
> >
> > Binary exponential backoff with random jitter is introduced. As try-lock
> > attempt increases, there is more likely that a larger number threads
> > compete for adaptive mutex lock, so increase wait time in exponential.
> > A random jitter is also added to avoid synchronous try-lock from other
> > threads.
> >
> > v2: Remove read-check before try-lock for performance.
> >
> > v3:
> > 1. Restore read-check since it works well in some platform.
> > 2. Make backoff arch dependent, and enable it for x86_64.
> > 3. Limit max backoff to reduce latency in large critical section.
> >
> > v4: Fix strict-prototypes error in sysdeps/nptl/pthread_mutex_backoff.h
> >
> > v5: Commit log updated for regression in large critical section.
> >
> > Result of pthread-mutex-locks bench
> >
> > Test Platform: Xeon 8280L (2 socket, 112 CPUs in total)
> > First Row: thread number
> > First Col: critical section length
> > Values: backoff vs upstream, time based, low is better
> >
> > non-critical-length: 1
> >         1       2       4       8       16      32      64      112     140
> > 0       0.99    0.58    0.52    0.49    0.43    0.44    0.46    0.52    0.54
> > 1       0.98    0.43    0.56    0.50    0.44    0.45    0.50    0.56    0.57
> > 2       0.99    0.41    0.57    0.51    0.45    0.47    0.48    0.60    0.61
> > 4       0.99    0.45    0.59    0.53    0.48    0.49    0.52    0.64    0.65
> > 8       1.00    0.66    0.71    0.63    0.56    0.59    0.66    0.72    0.71
> > 16      0.97    0.78    0.91    0.73    0.67    0.70    0.79    0.80    0.80
> > 32      0.95    1.17    0.98    0.87    0.82    0.86    0.89    0.90    0.90
> > 64      0.96    0.95    1.01    1.01    0.98    1.00    1.03    0.99    0.99
> > 128     0.99    1.01    1.01    1.17    1.08    1.12    1.02    0.97    1.02
> >
> > non-critical-length: 32
> >         1       2       4       8       16      32      64      112     140
> > 0       1.03    0.97    0.75    0.65    0.58    0.58    0.56    0.70    0.70
> > 1       0.94    0.95    0.76    0.65    0.58    0.58    0.61    0.71    0.72
> > 2       0.97    0.96    0.77    0.66    0.58    0.59    0.62    0.74    0.74
> > 4       0.99    0.96    0.78    0.66    0.60    0.61    0.66    0.76    0.77
> > 8       0.99    0.99    0.84    0.70    0.64    0.66    0.71    0.80    0.80
> > 16      0.98    0.97    0.95    0.76    0.70    0.73    0.81    0.85    0.84
> > 32      1.04    1.12    1.04    0.89    0.82    0.86    0.93    0.91    0.91
> > 64      0.99    1.15    1.07    1.00    0.99    1.01    1.05    0.99    0.99
> > 128     1.00    1.21    1.20    1.22    1.25    1.31    1.12    1.10    0.99
> >
> > non-critical-length: 128
> >         1       2       4       8       16      32      64      112     140
> > 0       1.02    1.00    0.99    0.67    0.61    0.61    0.61    0.74    0.73
> > 1       0.95    0.99    1.00    0.68    0.61    0.60    0.60    0.74    0.74
> > 2       1.00    1.04    1.00    0.68    0.59    0.61    0.65    0.76    0.76
> > 4       1.00    0.96    0.98    0.70    0.63    0.63    0.67    0.78    0.77
> > 8       1.01    1.02    0.89    0.73    0.65    0.67    0.71    0.81    0.80
> > 16      0.99    0.96    0.96    0.79    0.71    0.73    0.80    0.84    0.84
> > 32      0.99    0.95    1.05    0.89    0.84    0.85    0.94    0.92    0.91
> > 64      1.00    0.99    1.16    1.04    1.00    1.02    1.06    0.99    0.99
> > 128     1.00    1.06    0.98    1.14    1.39    1.26    1.08    1.02    0.98
> >
> > There is regression in large critical section. But adaptive mutex is
> > aimed for "quick" locks. Small critical section is more common when
> > users choose to use adaptive pthread_mutex.
> >
> > Signed-off-by: Wangyang Guo <wangyang.guo@intel.com>
> > ---
> >  nptl/pthread_mutex_lock.c                   | 16 +++++++--
> >  sysdeps/nptl/pthreadP.h                     |  1 +
> >  sysdeps/nptl/pthread_mutex_backoff.h        | 35 ++++++++++++++++++
> >  sysdeps/x86_64/nptl/pthread_mutex_backoff.h | 39 +++++++++++++++++++++
> >  4 files changed, 89 insertions(+), 2 deletions(-)
> >  create mode 100644 sysdeps/nptl/pthread_mutex_backoff.h
> >  create mode 100644 sysdeps/x86_64/nptl/pthread_mutex_backoff.h
> >
> > diff --git a/nptl/pthread_mutex_lock.c b/nptl/pthread_mutex_lock.c
> > index d2e652d151..6e767a8724 100644
> > --- a/nptl/pthread_mutex_lock.c
> > +++ b/nptl/pthread_mutex_lock.c
> > @@ -138,14 +138,26 @@ PTHREAD_MUTEX_LOCK (pthread_mutex_t *mutex)
> >           int cnt = 0;
> >           int max_cnt = MIN (max_adaptive_count (),
> >                              mutex->__data.__spins * 2 + 10);
> > +         int spin_count, exp_backoff = 1;
> > +         unsigned int jitter = get_jitter ();
> >           do
> >             {
> > -             if (cnt++ >= max_cnt)
> > +             /* In each loop, spin count is exponential backoff plus
> > +                random jitter, random range is [0, exp_backoff-1].  */
> > +             spin_count = exp_backoff + (jitter & (exp_backoff - 1));
> > +             cnt += spin_count;
> > +             if (cnt >= max_cnt)
> >                 {
> > +                 /* If cnt exceeds max spin count, just go to wait
> > +                    queue.  */
> >                   LLL_MUTEX_LOCK (mutex);
> >                   break;
> >                 }
> > -             atomic_spin_nop ();
> > +             do
> > +               atomic_spin_nop ();
> > +             while (--spin_count > 0);
> > +             /* Prepare for next loop.  */
> > +             exp_backoff = get_next_backoff (exp_backoff);
> >             }
> >           while (LLL_MUTEX_READ_LOCK (mutex) != 0
> >                  || LLL_MUTEX_TRYLOCK (mutex) != 0);
> > diff --git a/sysdeps/nptl/pthreadP.h b/sysdeps/nptl/pthreadP.h
> > index 601db4ff2b..39af275c25 100644
> > --- a/sysdeps/nptl/pthreadP.h
> > +++ b/sysdeps/nptl/pthreadP.h
> > @@ -33,6 +33,7 @@
> >  #include <kernel-features.h>
> >  #include <errno.h>
> >  #include <internal-signals.h>
> > +#include <pthread_mutex_backoff.h>
> >  #include "pthread_mutex_conf.h"
> >
> >
> > diff --git a/sysdeps/nptl/pthread_mutex_backoff.h b/sysdeps/nptl/pthread_mutex_backoff.h
> > new file mode 100644
> > index 0000000000..5b26c22ac7
> > --- /dev/null
> > +++ b/sysdeps/nptl/pthread_mutex_backoff.h
> > @@ -0,0 +1,35 @@
> > +/* Pthread mutex backoff configuration.
> > +   Copyright (C) 2022 Free Software Foundation, Inc.
> > +   This file is part of the GNU C Library.
> > +
> > +   The GNU C Library is free software; you can redistribute it and/or
> > +   modify it under the terms of the GNU Lesser General Public
> > +   License as published by the Free Software Foundation; either
> > +   version 2.1 of the License, or (at your option) any later version.
> > +
> > +   The GNU C Library is distributed in the hope that it will be useful,
> > +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> > +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> > +   Lesser General Public License for more details.
> > +
> > +   You should have received a copy of the GNU Lesser General Public
> > +   License along with the GNU C Library; if not, see
> > +   <https://www.gnu.org/licenses/>.  */
> > +#ifndef _PTHREAD_MUTEX_BACKOFF_H
> > +#define _PTHREAD_MUTEX_BACKOFF_H 1
> > +
> > +static inline unsigned int
> > +get_jitter (void)
> > +{
> > +  /* Arch dependent random jitter, return 0 disables random.  */
> > +  return 0;
> > +}
> > +
> > +static inline int
> > +get_next_backoff (int backoff)
> > +{
> > +  /* Next backoff, return 1 disables mutex backoff.  */
> > +  return 1;
> > +}
> > +
> > +#endif
> > diff --git a/sysdeps/x86_64/nptl/pthread_mutex_backoff.h b/sysdeps/x86_64/nptl/pthread_mutex_backoff.h
> > new file mode 100644
> > index 0000000000..ec74c3d9db
> > --- /dev/null
> > +++ b/sysdeps/x86_64/nptl/pthread_mutex_backoff.h
> > @@ -0,0 +1,39 @@
> > +/* Pthread mutex backoff configuration.
> > +   Copyright (C) 2022 Free Software Foundation, Inc.
> > +   This file is part of the GNU C Library.
> > +
> > +   The GNU C Library is free software; you can redistribute it and/or
> > +   modify it under the terms of the GNU Lesser General Public
> > +   License as published by the Free Software Foundation; either
> > +   version 2.1 of the License, or (at your option) any later version.
> > +
> > +   The GNU C Library is distributed in the hope that it will be useful,
> > +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> > +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> > +   Lesser General Public License for more details.
> > +
> > +   You should have received a copy of the GNU Lesser General Public
> > +   License along with the GNU C Library; if not, see
> > +   <https://www.gnu.org/licenses/>.  */
> > +#ifndef _PTHREAD_MUTEX_BACKOFF_H
> > +#define _PTHREAD_MUTEX_BACKOFF_H 1
> > +
> > +#include <fast-jitter.h>
> > +
> > +static inline unsigned int
> > +get_jitter (void)
> > +{
> > +  return get_fast_jitter ();
> > +}
> > +
> > +#define MAX_BACKOFF 16
> > +
> > +static inline int
> > +get_next_backoff (int backoff)
> > +{
> > +  /* Binary expontial backoff. Limiting max backoff
> > +     can reduce latency in large critical section.  */
> > +  return (backoff < MAX_BACKOFF) ? backoff << 1 : backoff;
> > +}
> > +
> > +#endif
> > --
> > 2.35.1
> >
>
> LGTM.
>
> Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
>
> Please wait until next Monday for more comments.
>
> Thanks.
>
> --
> H.J.

I would like to backport this patch to release branch 2.33, 2.34 and 2.35

Any comments/suggestions or objections on this.

commit 8162147872491bb5b48e91543b19c49a29ae6b6d
Author: Wangyang Guo <wangyang.guo@intel.com>
Date:   Fri May 6 01:50:10 2022 +0000

    nptl: Add backoff mechanism to spinlock loop

    When mutiple threads waiting for lock at the same time, once lock owner
    releases the lock, waiters will see lock available and all try to lock,
    which may cause an expensive CAS storm.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v5] nptl: Add backoff mechanism to spinlock loop
  2022-09-11 20:29         ` Sunil Pandey
@ 2022-09-14  1:26           ` Noah Goldstein
  2022-09-29  0:12           ` Noah Goldstein
  1 sibling, 0 replies; 27+ messages in thread
From: Noah Goldstein @ 2022-09-14  1:26 UTC (permalink / raw)
  To: Sunil Pandey
  Cc: H.J. Lu, Libc-stable Mailing List, Florian Weimer, Wangyang Guo,
	GNU C Library

On Sun, Sep 11, 2022 at 1:30 PM Sunil Pandey via Libc-stable
<libc-stable@sourceware.org> wrote:
>
> On Thu, May 5, 2022 at 8:07 PM H.J. Lu via Libc-alpha
> <libc-alpha@sourceware.org> wrote:
> >
> > On Thu, May 5, 2022 at 6:50 PM Wangyang Guo <wangyang.guo@intel.com> wrote:
> > >
> > > When mutiple threads waiting for lock at the same time, once lock owner
> > > releases the lock, waiters will see lock available and all try to lock,
> > > which may cause an expensive CAS storm.
> > >
> > > Binary exponential backoff with random jitter is introduced. As try-lock
> > > attempt increases, there is more likely that a larger number threads
> > > compete for adaptive mutex lock, so increase wait time in exponential.
> > > A random jitter is also added to avoid synchronous try-lock from other
> > > threads.
> > >
> > > v2: Remove read-check before try-lock for performance.
> > >
> > > v3:
> > > 1. Restore read-check since it works well in some platform.
> > > 2. Make backoff arch dependent, and enable it for x86_64.
> > > 3. Limit max backoff to reduce latency in large critical section.
> > >
> > > v4: Fix strict-prototypes error in sysdeps/nptl/pthread_mutex_backoff.h
> > >
> > > v5: Commit log updated for regression in large critical section.
> > >
> > > Result of pthread-mutex-locks bench
> > >
> > > Test Platform: Xeon 8280L (2 socket, 112 CPUs in total)
> > > First Row: thread number
> > > First Col: critical section length
> > > Values: backoff vs upstream, time based, low is better
> > >
> > > non-critical-length: 1
> > >         1       2       4       8       16      32      64      112     140
> > > 0       0.99    0.58    0.52    0.49    0.43    0.44    0.46    0.52    0.54
> > > 1       0.98    0.43    0.56    0.50    0.44    0.45    0.50    0.56    0.57
> > > 2       0.99    0.41    0.57    0.51    0.45    0.47    0.48    0.60    0.61
> > > 4       0.99    0.45    0.59    0.53    0.48    0.49    0.52    0.64    0.65
> > > 8       1.00    0.66    0.71    0.63    0.56    0.59    0.66    0.72    0.71
> > > 16      0.97    0.78    0.91    0.73    0.67    0.70    0.79    0.80    0.80
> > > 32      0.95    1.17    0.98    0.87    0.82    0.86    0.89    0.90    0.90
> > > 64      0.96    0.95    1.01    1.01    0.98    1.00    1.03    0.99    0.99
> > > 128     0.99    1.01    1.01    1.17    1.08    1.12    1.02    0.97    1.02
> > >
> > > non-critical-length: 32
> > >         1       2       4       8       16      32      64      112     140
> > > 0       1.03    0.97    0.75    0.65    0.58    0.58    0.56    0.70    0.70
> > > 1       0.94    0.95    0.76    0.65    0.58    0.58    0.61    0.71    0.72
> > > 2       0.97    0.96    0.77    0.66    0.58    0.59    0.62    0.74    0.74
> > > 4       0.99    0.96    0.78    0.66    0.60    0.61    0.66    0.76    0.77
> > > 8       0.99    0.99    0.84    0.70    0.64    0.66    0.71    0.80    0.80
> > > 16      0.98    0.97    0.95    0.76    0.70    0.73    0.81    0.85    0.84
> > > 32      1.04    1.12    1.04    0.89    0.82    0.86    0.93    0.91    0.91
> > > 64      0.99    1.15    1.07    1.00    0.99    1.01    1.05    0.99    0.99
> > > 128     1.00    1.21    1.20    1.22    1.25    1.31    1.12    1.10    0.99
> > >
> > > non-critical-length: 128
> > >         1       2       4       8       16      32      64      112     140
> > > 0       1.02    1.00    0.99    0.67    0.61    0.61    0.61    0.74    0.73
> > > 1       0.95    0.99    1.00    0.68    0.61    0.60    0.60    0.74    0.74
> > > 2       1.00    1.04    1.00    0.68    0.59    0.61    0.65    0.76    0.76
> > > 4       1.00    0.96    0.98    0.70    0.63    0.63    0.67    0.78    0.77
> > > 8       1.01    1.02    0.89    0.73    0.65    0.67    0.71    0.81    0.80
> > > 16      0.99    0.96    0.96    0.79    0.71    0.73    0.80    0.84    0.84
> > > 32      0.99    0.95    1.05    0.89    0.84    0.85    0.94    0.92    0.91
> > > 64      1.00    0.99    1.16    1.04    1.00    1.02    1.06    0.99    0.99
> > > 128     1.00    1.06    0.98    1.14    1.39    1.26    1.08    1.02    0.98
> > >
> > > There is regression in large critical section. But adaptive mutex is
> > > aimed for "quick" locks. Small critical section is more common when
> > > users choose to use adaptive pthread_mutex.
> > >
> > > Signed-off-by: Wangyang Guo <wangyang.guo@intel.com>
> > > ---
> > >  nptl/pthread_mutex_lock.c                   | 16 +++++++--
> > >  sysdeps/nptl/pthreadP.h                     |  1 +
> > >  sysdeps/nptl/pthread_mutex_backoff.h        | 35 ++++++++++++++++++
> > >  sysdeps/x86_64/nptl/pthread_mutex_backoff.h | 39 +++++++++++++++++++++
> > >  4 files changed, 89 insertions(+), 2 deletions(-)
> > >  create mode 100644 sysdeps/nptl/pthread_mutex_backoff.h
> > >  create mode 100644 sysdeps/x86_64/nptl/pthread_mutex_backoff.h
> > >
> > > diff --git a/nptl/pthread_mutex_lock.c b/nptl/pthread_mutex_lock.c
> > > index d2e652d151..6e767a8724 100644
> > > --- a/nptl/pthread_mutex_lock.c
> > > +++ b/nptl/pthread_mutex_lock.c
> > > @@ -138,14 +138,26 @@ PTHREAD_MUTEX_LOCK (pthread_mutex_t *mutex)
> > >           int cnt = 0;
> > >           int max_cnt = MIN (max_adaptive_count (),
> > >                              mutex->__data.__spins * 2 + 10);
> > > +         int spin_count, exp_backoff = 1;
> > > +         unsigned int jitter = get_jitter ();
> > >           do
> > >             {
> > > -             if (cnt++ >= max_cnt)
> > > +             /* In each loop, spin count is exponential backoff plus
> > > +                random jitter, random range is [0, exp_backoff-1].  */
> > > +             spin_count = exp_backoff + (jitter & (exp_backoff - 1));
> > > +             cnt += spin_count;
> > > +             if (cnt >= max_cnt)
> > >                 {
> > > +                 /* If cnt exceeds max spin count, just go to wait
> > > +                    queue.  */
> > >                   LLL_MUTEX_LOCK (mutex);
> > >                   break;
> > >                 }
> > > -             atomic_spin_nop ();
> > > +             do
> > > +               atomic_spin_nop ();
> > > +             while (--spin_count > 0);
> > > +             /* Prepare for next loop.  */
> > > +             exp_backoff = get_next_backoff (exp_backoff);
> > >             }
> > >           while (LLL_MUTEX_READ_LOCK (mutex) != 0
> > >                  || LLL_MUTEX_TRYLOCK (mutex) != 0);
> > > diff --git a/sysdeps/nptl/pthreadP.h b/sysdeps/nptl/pthreadP.h
> > > index 601db4ff2b..39af275c25 100644
> > > --- a/sysdeps/nptl/pthreadP.h
> > > +++ b/sysdeps/nptl/pthreadP.h
> > > @@ -33,6 +33,7 @@
> > >  #include <kernel-features.h>
> > >  #include <errno.h>
> > >  #include <internal-signals.h>
> > > +#include <pthread_mutex_backoff.h>
> > >  #include "pthread_mutex_conf.h"
> > >
> > >
> > > diff --git a/sysdeps/nptl/pthread_mutex_backoff.h b/sysdeps/nptl/pthread_mutex_backoff.h
> > > new file mode 100644
> > > index 0000000000..5b26c22ac7
> > > --- /dev/null
> > > +++ b/sysdeps/nptl/pthread_mutex_backoff.h
> > > @@ -0,0 +1,35 @@
> > > +/* Pthread mutex backoff configuration.
> > > +   Copyright (C) 2022 Free Software Foundation, Inc.
> > > +   This file is part of the GNU C Library.
> > > +
> > > +   The GNU C Library is free software; you can redistribute it and/or
> > > +   modify it under the terms of the GNU Lesser General Public
> > > +   License as published by the Free Software Foundation; either
> > > +   version 2.1 of the License, or (at your option) any later version.
> > > +
> > > +   The GNU C Library is distributed in the hope that it will be useful,
> > > +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> > > +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> > > +   Lesser General Public License for more details.
> > > +
> > > +   You should have received a copy of the GNU Lesser General Public
> > > +   License along with the GNU C Library; if not, see
> > > +   <https://www.gnu.org/licenses/>.  */
> > > +#ifndef _PTHREAD_MUTEX_BACKOFF_H
> > > +#define _PTHREAD_MUTEX_BACKOFF_H 1
> > > +
> > > +static inline unsigned int
> > > +get_jitter (void)
> > > +{
> > > +  /* Arch dependent random jitter, return 0 disables random.  */
> > > +  return 0;
> > > +}
> > > +
> > > +static inline int
> > > +get_next_backoff (int backoff)
> > > +{
> > > +  /* Next backoff, return 1 disables mutex backoff.  */
> > > +  return 1;
> > > +}
> > > +
> > > +#endif
> > > diff --git a/sysdeps/x86_64/nptl/pthread_mutex_backoff.h b/sysdeps/x86_64/nptl/pthread_mutex_backoff.h
> > > new file mode 100644
> > > index 0000000000..ec74c3d9db
> > > --- /dev/null
> > > +++ b/sysdeps/x86_64/nptl/pthread_mutex_backoff.h
> > > @@ -0,0 +1,39 @@
> > > +/* Pthread mutex backoff configuration.
> > > +   Copyright (C) 2022 Free Software Foundation, Inc.
> > > +   This file is part of the GNU C Library.
> > > +
> > > +   The GNU C Library is free software; you can redistribute it and/or
> > > +   modify it under the terms of the GNU Lesser General Public
> > > +   License as published by the Free Software Foundation; either
> > > +   version 2.1 of the License, or (at your option) any later version.
> > > +
> > > +   The GNU C Library is distributed in the hope that it will be useful,
> > > +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> > > +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> > > +   Lesser General Public License for more details.
> > > +
> > > +   You should have received a copy of the GNU Lesser General Public
> > > +   License along with the GNU C Library; if not, see
> > > +   <https://www.gnu.org/licenses/>.  */
> > > +#ifndef _PTHREAD_MUTEX_BACKOFF_H
> > > +#define _PTHREAD_MUTEX_BACKOFF_H 1
> > > +
> > > +#include <fast-jitter.h>
> > > +
> > > +static inline unsigned int
> > > +get_jitter (void)
> > > +{
> > > +  return get_fast_jitter ();
> > > +}
> > > +
> > > +#define MAX_BACKOFF 16
> > > +
> > > +static inline int
> > > +get_next_backoff (int backoff)
> > > +{
> > > +  /* Binary expontial backoff. Limiting max backoff
> > > +     can reduce latency in large critical section.  */
> > > +  return (backoff < MAX_BACKOFF) ? backoff << 1 : backoff;
> > > +}
> > > +
> > > +#endif
> > > --
> > > 2.35.1
> > >
> >
> > LGTM.
> >
> > Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
> >
> > Please wait until next Monday for more comments.
> >
> > Thanks.
> >
> > --
> > H.J.
>
> I would like to backport this patch to release branch 2.33, 2.34 and 2.35
>
> Any comments/suggestions or objections on this.

Fine by me.
>
> commit 8162147872491bb5b48e91543b19c49a29ae6b6d
> Author: Wangyang Guo <wangyang.guo@intel.com>
> Date:   Fri May 6 01:50:10 2022 +0000
>
>     nptl: Add backoff mechanism to spinlock loop
>
>     When mutiple threads waiting for lock at the same time, once lock owner
>     releases the lock, waiters will see lock available and all try to lock,
>     which may cause an expensive CAS storm.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v5] nptl: Add backoff mechanism to spinlock loop
  2022-09-11 20:29         ` Sunil Pandey
  2022-09-14  1:26           ` Noah Goldstein
@ 2022-09-29  0:12           ` Noah Goldstein
  1 sibling, 0 replies; 27+ messages in thread
From: Noah Goldstein @ 2022-09-29  0:12 UTC (permalink / raw)
  To: Sunil Pandey
  Cc: H.J. Lu, Libc-stable Mailing List, Florian Weimer, Wangyang Guo,
	GNU C Library

On Sun, Sep 11, 2022 at 4:30 PM Sunil Pandey via Libc-stable
<libc-stable@sourceware.org> wrote:
>
> On Thu, May 5, 2022 at 8:07 PM H.J. Lu via Libc-alpha
> <libc-alpha@sourceware.org> wrote:
> >
> > On Thu, May 5, 2022 at 6:50 PM Wangyang Guo <wangyang.guo@intel.com> wrote:
> > >
> > > When mutiple threads waiting for lock at the same time, once lock owner
> > > releases the lock, waiters will see lock available and all try to lock,
> > > which may cause an expensive CAS storm.
> > >
> > > Binary exponential backoff with random jitter is introduced. As try-lock
> > > attempt increases, there is more likely that a larger number threads
> > > compete for adaptive mutex lock, so increase wait time in exponential.
> > > A random jitter is also added to avoid synchronous try-lock from other
> > > threads.
> > >
> > > v2: Remove read-check before try-lock for performance.
> > >
> > > v3:
> > > 1. Restore read-check since it works well in some platform.
> > > 2. Make backoff arch dependent, and enable it for x86_64.
> > > 3. Limit max backoff to reduce latency in large critical section.
> > >
> > > v4: Fix strict-prototypes error in sysdeps/nptl/pthread_mutex_backoff.h
> > >
> > > v5: Commit log updated for regression in large critical section.
> > >
> > > Result of pthread-mutex-locks bench
> > >
> > > Test Platform: Xeon 8280L (2 socket, 112 CPUs in total)
> > > First Row: thread number
> > > First Col: critical section length
> > > Values: backoff vs upstream, time based, low is better
> > >
> > > non-critical-length: 1
> > >         1       2       4       8       16      32      64      112     140
> > > 0       0.99    0.58    0.52    0.49    0.43    0.44    0.46    0.52    0.54
> > > 1       0.98    0.43    0.56    0.50    0.44    0.45    0.50    0.56    0.57
> > > 2       0.99    0.41    0.57    0.51    0.45    0.47    0.48    0.60    0.61
> > > 4       0.99    0.45    0.59    0.53    0.48    0.49    0.52    0.64    0.65
> > > 8       1.00    0.66    0.71    0.63    0.56    0.59    0.66    0.72    0.71
> > > 16      0.97    0.78    0.91    0.73    0.67    0.70    0.79    0.80    0.80
> > > 32      0.95    1.17    0.98    0.87    0.82    0.86    0.89    0.90    0.90
> > > 64      0.96    0.95    1.01    1.01    0.98    1.00    1.03    0.99    0.99
> > > 128     0.99    1.01    1.01    1.17    1.08    1.12    1.02    0.97    1.02
> > >
> > > non-critical-length: 32
> > >         1       2       4       8       16      32      64      112     140
> > > 0       1.03    0.97    0.75    0.65    0.58    0.58    0.56    0.70    0.70
> > > 1       0.94    0.95    0.76    0.65    0.58    0.58    0.61    0.71    0.72
> > > 2       0.97    0.96    0.77    0.66    0.58    0.59    0.62    0.74    0.74
> > > 4       0.99    0.96    0.78    0.66    0.60    0.61    0.66    0.76    0.77
> > > 8       0.99    0.99    0.84    0.70    0.64    0.66    0.71    0.80    0.80
> > > 16      0.98    0.97    0.95    0.76    0.70    0.73    0.81    0.85    0.84
> > > 32      1.04    1.12    1.04    0.89    0.82    0.86    0.93    0.91    0.91
> > > 64      0.99    1.15    1.07    1.00    0.99    1.01    1.05    0.99    0.99
> > > 128     1.00    1.21    1.20    1.22    1.25    1.31    1.12    1.10    0.99
> > >
> > > non-critical-length: 128
> > >         1       2       4       8       16      32      64      112     140
> > > 0       1.02    1.00    0.99    0.67    0.61    0.61    0.61    0.74    0.73
> > > 1       0.95    0.99    1.00    0.68    0.61    0.60    0.60    0.74    0.74
> > > 2       1.00    1.04    1.00    0.68    0.59    0.61    0.65    0.76    0.76
> > > 4       1.00    0.96    0.98    0.70    0.63    0.63    0.67    0.78    0.77
> > > 8       1.01    1.02    0.89    0.73    0.65    0.67    0.71    0.81    0.80
> > > 16      0.99    0.96    0.96    0.79    0.71    0.73    0.80    0.84    0.84
> > > 32      0.99    0.95    1.05    0.89    0.84    0.85    0.94    0.92    0.91
> > > 64      1.00    0.99    1.16    1.04    1.00    1.02    1.06    0.99    0.99
> > > 128     1.00    1.06    0.98    1.14    1.39    1.26    1.08    1.02    0.98
> > >
> > > There is regression in large critical section. But adaptive mutex is
> > > aimed for "quick" locks. Small critical section is more common when
> > > users choose to use adaptive pthread_mutex.
> > >
> > > Signed-off-by: Wangyang Guo <wangyang.guo@intel.com>
> > > ---
> > >  nptl/pthread_mutex_lock.c                   | 16 +++++++--
> > >  sysdeps/nptl/pthreadP.h                     |  1 +
> > >  sysdeps/nptl/pthread_mutex_backoff.h        | 35 ++++++++++++++++++
> > >  sysdeps/x86_64/nptl/pthread_mutex_backoff.h | 39 +++++++++++++++++++++
> > >  4 files changed, 89 insertions(+), 2 deletions(-)
> > >  create mode 100644 sysdeps/nptl/pthread_mutex_backoff.h
> > >  create mode 100644 sysdeps/x86_64/nptl/pthread_mutex_backoff.h
> > >
> > > diff --git a/nptl/pthread_mutex_lock.c b/nptl/pthread_mutex_lock.c
> > > index d2e652d151..6e767a8724 100644
> > > --- a/nptl/pthread_mutex_lock.c
> > > +++ b/nptl/pthread_mutex_lock.c
> > > @@ -138,14 +138,26 @@ PTHREAD_MUTEX_LOCK (pthread_mutex_t *mutex)
> > >           int cnt = 0;
> > >           int max_cnt = MIN (max_adaptive_count (),
> > >                              mutex->__data.__spins * 2 + 10);
> > > +         int spin_count, exp_backoff = 1;
> > > +         unsigned int jitter = get_jitter ();
> > >           do
> > >             {
> > > -             if (cnt++ >= max_cnt)
> > > +             /* In each loop, spin count is exponential backoff plus
> > > +                random jitter, random range is [0, exp_backoff-1].  */
> > > +             spin_count = exp_backoff + (jitter & (exp_backoff - 1));
> > > +             cnt += spin_count;
> > > +             if (cnt >= max_cnt)
> > >                 {
> > > +                 /* If cnt exceeds max spin count, just go to wait
> > > +                    queue.  */
> > >                   LLL_MUTEX_LOCK (mutex);
> > >                   break;
> > >                 }
> > > -             atomic_spin_nop ();
> > > +             do
> > > +               atomic_spin_nop ();
> > > +             while (--spin_count > 0);
> > > +             /* Prepare for next loop.  */
> > > +             exp_backoff = get_next_backoff (exp_backoff);
> > >             }
> > >           while (LLL_MUTEX_READ_LOCK (mutex) != 0
> > >                  || LLL_MUTEX_TRYLOCK (mutex) != 0);
> > > diff --git a/sysdeps/nptl/pthreadP.h b/sysdeps/nptl/pthreadP.h
> > > index 601db4ff2b..39af275c25 100644
> > > --- a/sysdeps/nptl/pthreadP.h
> > > +++ b/sysdeps/nptl/pthreadP.h
> > > @@ -33,6 +33,7 @@
> > >  #include <kernel-features.h>
> > >  #include <errno.h>
> > >  #include <internal-signals.h>
> > > +#include <pthread_mutex_backoff.h>
> > >  #include "pthread_mutex_conf.h"
> > >
> > >
> > > diff --git a/sysdeps/nptl/pthread_mutex_backoff.h b/sysdeps/nptl/pthread_mutex_backoff.h
> > > new file mode 100644
> > > index 0000000000..5b26c22ac7
> > > --- /dev/null
> > > +++ b/sysdeps/nptl/pthread_mutex_backoff.h
> > > @@ -0,0 +1,35 @@
> > > +/* Pthread mutex backoff configuration.
> > > +   Copyright (C) 2022 Free Software Foundation, Inc.
> > > +   This file is part of the GNU C Library.
> > > +
> > > +   The GNU C Library is free software; you can redistribute it and/or
> > > +   modify it under the terms of the GNU Lesser General Public
> > > +   License as published by the Free Software Foundation; either
> > > +   version 2.1 of the License, or (at your option) any later version.
> > > +
> > > +   The GNU C Library is distributed in the hope that it will be useful,
> > > +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> > > +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> > > +   Lesser General Public License for more details.
> > > +
> > > +   You should have received a copy of the GNU Lesser General Public
> > > +   License along with the GNU C Library; if not, see
> > > +   <https://www.gnu.org/licenses/>.  */
> > > +#ifndef _PTHREAD_MUTEX_BACKOFF_H
> > > +#define _PTHREAD_MUTEX_BACKOFF_H 1
> > > +
> > > +static inline unsigned int
> > > +get_jitter (void)
> > > +{
> > > +  /* Arch dependent random jitter, return 0 disables random.  */
> > > +  return 0;
> > > +}
> > > +
> > > +static inline int
> > > +get_next_backoff (int backoff)
> > > +{
> > > +  /* Next backoff, return 1 disables mutex backoff.  */
> > > +  return 1;
> > > +}
> > > +
> > > +#endif
> > > diff --git a/sysdeps/x86_64/nptl/pthread_mutex_backoff.h b/sysdeps/x86_64/nptl/pthread_mutex_backoff.h
> > > new file mode 100644
> > > index 0000000000..ec74c3d9db
> > > --- /dev/null
> > > +++ b/sysdeps/x86_64/nptl/pthread_mutex_backoff.h
> > > @@ -0,0 +1,39 @@
> > > +/* Pthread mutex backoff configuration.
> > > +   Copyright (C) 2022 Free Software Foundation, Inc.
> > > +   This file is part of the GNU C Library.
> > > +
> > > +   The GNU C Library is free software; you can redistribute it and/or
> > > +   modify it under the terms of the GNU Lesser General Public
> > > +   License as published by the Free Software Foundation; either
> > > +   version 2.1 of the License, or (at your option) any later version.
> > > +
> > > +   The GNU C Library is distributed in the hope that it will be useful,
> > > +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> > > +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> > > +   Lesser General Public License for more details.
> > > +
> > > +   You should have received a copy of the GNU Lesser General Public
> > > +   License along with the GNU C Library; if not, see
> > > +   <https://www.gnu.org/licenses/>.  */
> > > +#ifndef _PTHREAD_MUTEX_BACKOFF_H
> > > +#define _PTHREAD_MUTEX_BACKOFF_H 1
> > > +
> > > +#include <fast-jitter.h>
> > > +
> > > +static inline unsigned int
> > > +get_jitter (void)
> > > +{
> > > +  return get_fast_jitter ();
> > > +}
> > > +
> > > +#define MAX_BACKOFF 16
> > > +
> > > +static inline int
> > > +get_next_backoff (int backoff)
> > > +{
> > > +  /* Binary expontial backoff. Limiting max backoff
> > > +     can reduce latency in large critical section.  */
> > > +  return (backoff < MAX_BACKOFF) ? backoff << 1 : backoff;
> > > +}
> > > +
> > > +#endif
> > > --
> > > 2.35.1
> > >
> >
> > LGTM.
> >
> > Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
> >
> > Please wait until next Monday for more comments.
> >
> > Thanks.
> >
> > --
> > H.J.
>
> I would like to backport this patch to release branch 2.33, 2.34 and 2.35

Fine by me.
>
> Any comments/suggestions or objections on this.
>
> commit 8162147872491bb5b48e91543b19c49a29ae6b6d
> Author: Wangyang Guo <wangyang.guo@intel.com>
> Date:   Fri May 6 01:50:10 2022 +0000
>
>     nptl: Add backoff mechanism to spinlock loop
>
>     When mutiple threads waiting for lock at the same time, once lock owner
>     releases the lock, waiters will see lock available and all try to lock,
>     which may cause an expensive CAS storm.

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2022-09-29  0:12 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-28  8:47 [PATCH v2] nptl: Add backoff mechanism to spinlock loop Wangyang Guo
2022-03-28 16:41 ` Noah Goldstein
2022-03-30 11:44   ` Guo, Wangyang
2022-03-30 19:39     ` Noah Goldstein
2022-03-30 11:53 ` Adhemerval Zanella
2022-03-30 17:07   ` Noah Goldstein
2022-03-30 17:21     ` Adhemerval Zanella
2022-04-12 11:53       ` Florian Weimer
2022-04-22 13:30         ` Yann Droneaud
2022-04-22 13:32           ` Florian Weimer
2022-04-22 13:35             ` Cristian Rodríguez
2022-04-22 15:25               ` Noah Goldstein
2022-04-26 12:25                 ` Florian Weimer
2022-04-26 12:42                   ` Yann Droneaud
2022-05-04  2:50 ` [PATCH v3] " Wangyang Guo
2022-05-04  2:58 ` Wangyang Guo
2022-05-04  3:17   ` [PATCH v4] " Wangyang Guo
2022-05-05  1:56     ` H.J. Lu
2022-05-05  2:52       ` Noah Goldstein
2022-05-05  2:59       ` Guo, Wangyang
2022-05-05 22:44         ` H.J. Lu
2022-05-06  1:52           ` Guo, Wangyang
2022-05-06  1:50     ` [PATCH v5] " Wangyang Guo
2022-05-06  3:06       ` H.J. Lu
2022-09-11 20:29         ` Sunil Pandey
2022-09-14  1:26           ` Noah Goldstein
2022-09-29  0:12           ` Noah Goldstein

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).