public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed
* [PATCH v6 0/4] Optimize CAS [BZ #28537]
@ 2021-11-11 16:24 H.J. Lu
  2021-11-11 16:24 ` [PATCH v6 1/4] Add LLL_MUTEX_READ_LOCK " H.J. Lu
                   ` (3 more replies)
  0 siblings, 4 replies; 24+ messages in thread
From: H.J. Lu @ 2021-11-11 16:24 UTC (permalink / raw)
  To: libc-alpha
  Cc: Florian Weimer, Oleh Derevenko, Arjan van de Ven, Andreas Schwab,
	Paul A . Clarke, Noah Goldstein

Changes in v6:

1. Add LLL_MUTEX_READ_LOCK to do an atomic load and skip CAS in spinlock
loop if compare may fail.
2. Remove low level lock changes.
3. Don't change CAS usages in __pthread_mutex_lock_full.
4. Avoid extra load with CAS in __pthread_mutex_clocklock_common.
5. Reduce CAS in malloc spinlocks.

Changes in v5:

1. Put back __glibc_unlikely in  __lll_trylock and lll_cond_trylock.
2. Remove an atomic load in a CAS usage which has been already optimized.
3. Add an empty statement with a semicolon to a goto label for older
compiler versions.
4. Simplify CAS optimization.

CAS instruction is expensive.  From the x86 CPU's point of view, getting
a cache line for writing is more expensive than reading.  See Appendix
A.2 Spinlock in:

https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-lock-scaling-analysis-paper.pdf

The full compare and swap will grab the cache line exclusive and cause
excessive cache line bouncing.

Optimize CAS in low level locks and pthread_mutex_lock.c:

1. Add LLL_MUTEX_READ_LOCK to do an atomic load and skip CAS in spinlock
loop if compare may fail to reduce cache line bouncing on contended locks.
2. Replace boolean CAS with value CAS to avoid the extra load.
2. Change malloc spinlocks to do an atomic load and check if compare may
fail.  Skip CAS and spin if compare may fail to reduce cache line bouncing
on contended locks.

With all CAS optimizations applied, on a machine with 112 cores,

              mutex-empty    17.4575    17.3908  0.38%
             mutex-filler    48.4768    46.4925  4.1%
      mutex_trylock-empty    19.2726    19.2737  -0.0057%
     mutex_trylock-filler    54.0893     54.105  -0.029%
        rwlock_read-empty    39.7572    39.8933  -0.34%
       rwlock_read-filler     75.109    74.0818  1.4%
     rwlock_tryread-empty    5.28944    5.28938  0.0011%
    rwlock_tryread-filler    39.6297     39.734  -0.26%
       rwlock_write-empty    60.6644    60.6151  0.081%
      rwlock_write-filler      92.92    90.0722  3.1%
    rwlock_trywrite-empty    7.24741    6.59308  9%
   rwlock_trywrite-filler    42.7404    41.6767  2.5%
          spin_lock-empty    19.1078    19.1079  -0.00052%
         spin_lock-filler    51.0646    51.6041  -1.1%
       spin_trylock-empty    16.4707    16.4811  -0.063%
      spin_trylock-filler    50.5355    50.4012  0.27%
           sem_wait-empty    42.1991    42.1683  0.073%
          sem_wait-filler    74.6699    74.7883  -0.16%
        sem_trywait-empty    5.27062     5.2702  0.008%
       sem_trywait-filler    40.1541    40.1684  -0.036%
            condvar-empty    5488.91    5165.95  5.9%
           condvar-filler    1442.43    1474.21  -2.2%
  consumer_producer-empty    16508.2    16705.3  -1.2%
 consumer_producer-filler    16781.1    16942.3  -0.96%

H.J. Lu (4):
  Add LLL_MUTEX_READ_LOCK [BZ #28537]
  Avoid extra load with CAS in __pthread_mutex_lock_full [BZ #28537]
  Reduce CAS in malloc spinlocks
  Avoid extra load with CAS in __pthread_mutex_clocklock_common [BZ
    #28537]

 malloc/arena.c                 |  5 +++++
 malloc/malloc.c                | 10 ++++++++++
 nptl/pthread_mutex_lock.c      | 17 ++++++++++++-----
 nptl/pthread_mutex_timedlock.c | 10 +++++-----
 4 files changed, 32 insertions(+), 10 deletions(-)

-- 
2.33.1


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v6 1/4] Add LLL_MUTEX_READ_LOCK [BZ #28537]
  2021-11-11 16:24 [PATCH v6 0/4] Optimize CAS [BZ #28537] H.J. Lu
@ 2021-11-11 16:24 ` H.J. Lu
  2021-11-12 17:23   ` Szabolcs Nagy
  2021-11-17  2:24   ` Noah Goldstein
  2021-11-11 16:24 ` [PATCH v6 2/4] Avoid extra load with CAS in __pthread_mutex_lock_full " H.J. Lu
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 24+ messages in thread
From: H.J. Lu @ 2021-11-11 16:24 UTC (permalink / raw)
  To: libc-alpha
  Cc: Florian Weimer, Oleh Derevenko, Arjan van de Ven, Andreas Schwab,
	Paul A . Clarke, Noah Goldstein

CAS instruction is expensive.  From the x86 CPU's point of view, getting
a cache line for writing is more expensive than reading.  See Appendix
A.2 Spinlock in:

https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-lock-scaling-analysis-paper.pdf

The full compare and swap will grab the cache line exclusive and cause
excessive cache line bouncing.

Add LLL_MUTEX_READ_LOCK to do an atomic load and skip CAS in spinlock
loop if compare may fail to reduce cache line bouncing on contended locks.
---
 nptl/pthread_mutex_lock.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/nptl/pthread_mutex_lock.c b/nptl/pthread_mutex_lock.c
index 2bd41767e0..72058c719c 100644
--- a/nptl/pthread_mutex_lock.c
+++ b/nptl/pthread_mutex_lock.c
@@ -64,6 +64,11 @@ lll_mutex_lock_optimized (pthread_mutex_t *mutex)
 # define PTHREAD_MUTEX_VERSIONS 1
 #endif
 
+#ifndef LLL_MUTEX_READ_LOCK
+# define LLL_MUTEX_READ_LOCK(mutex) \
+  atomic_load_relaxed (&(mutex)->__data.__lock)
+#endif
+
 static int __pthread_mutex_lock_full (pthread_mutex_t *mutex)
      __attribute_noinline__;
 
@@ -141,6 +146,8 @@ PTHREAD_MUTEX_LOCK (pthread_mutex_t *mutex)
 		  break;
 		}
 	      atomic_spin_nop ();
+	      if (LLL_MUTEX_READ_LOCK (mutex) != 0)
+		continue;
 	    }
 	  while (LLL_MUTEX_TRYLOCK (mutex) != 0);
 
-- 
2.33.1


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v6 2/4] Avoid extra load with CAS in __pthread_mutex_lock_full [BZ #28537]
  2021-11-11 16:24 [PATCH v6 0/4] Optimize CAS [BZ #28537] H.J. Lu
  2021-11-11 16:24 ` [PATCH v6 1/4] Add LLL_MUTEX_READ_LOCK " H.J. Lu
@ 2021-11-11 16:24 ` H.J. Lu
  2021-11-12 16:31   ` Szabolcs Nagy
  2021-11-12 18:50   ` Andreas Schwab
  2021-11-11 16:24 ` [PATCH v6 3/4] Reduce CAS in malloc spinlocks H.J. Lu
  2021-11-11 16:24 ` [PATCH v6 4/4] Avoid extra load with CAS in __pthread_mutex_clocklock_common [BZ #28537] H.J. Lu
  3 siblings, 2 replies; 24+ messages in thread
From: H.J. Lu @ 2021-11-11 16:24 UTC (permalink / raw)
  To: libc-alpha
  Cc: Florian Weimer, Oleh Derevenko, Arjan van de Ven, Andreas Schwab,
	Paul A . Clarke, Noah Goldstein

Replace boolean CAS with value CAS to avoid the extra load.
---
 nptl/pthread_mutex_lock.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/nptl/pthread_mutex_lock.c b/nptl/pthread_mutex_lock.c
index 72058c719c..762059b230 100644
--- a/nptl/pthread_mutex_lock.c
+++ b/nptl/pthread_mutex_lock.c
@@ -304,12 +304,12 @@ __pthread_mutex_lock_full (pthread_mutex_t *mutex)
 	     meantime.  */
 	  if ((oldval & FUTEX_WAITERS) == 0)
 	    {
-	      if (atomic_compare_and_exchange_bool_acq (&mutex->__data.__lock,
-							oldval | FUTEX_WAITERS,
-							oldval)
-		  != 0)
+	      int val;
+	      if ((val = atomic_compare_and_exchange_val_acq
+		   (&mutex->__data.__lock, oldval | FUTEX_WAITERS,
+		    oldval)) != oldval)
 		{
-		  oldval = mutex->__data.__lock;
+		  oldval = val;
 		  continue;
 		}
 	      oldval |= FUTEX_WAITERS;
-- 
2.33.1


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v6 3/4] Reduce CAS in malloc spinlocks
  2021-11-11 16:24 [PATCH v6 0/4] Optimize CAS [BZ #28537] H.J. Lu
  2021-11-11 16:24 ` [PATCH v6 1/4] Add LLL_MUTEX_READ_LOCK " H.J. Lu
  2021-11-11 16:24 ` [PATCH v6 2/4] Avoid extra load with CAS in __pthread_mutex_lock_full " H.J. Lu
@ 2021-11-11 16:24 ` H.J. Lu
  2023-02-23  5:48   ` DJ Delorie
  2021-11-11 16:24 ` [PATCH v6 4/4] Avoid extra load with CAS in __pthread_mutex_clocklock_common [BZ #28537] H.J. Lu
  3 siblings, 1 reply; 24+ messages in thread
From: H.J. Lu @ 2021-11-11 16:24 UTC (permalink / raw)
  To: libc-alpha
  Cc: Florian Weimer, Oleh Derevenko, Arjan van de Ven, Andreas Schwab,
	Paul A . Clarke, Noah Goldstein

Do an atomic load and check if compare may fail.  Skip CAS and spin if
compare may fail to reduce cache line bouncing on contended locks.
---
 malloc/arena.c  |  5 +++++
 malloc/malloc.c | 10 ++++++++++
 2 files changed, 15 insertions(+)

diff --git a/malloc/arena.c b/malloc/arena.c
index 78ef4cf18c..e7fbe7c183 100644
--- a/malloc/arena.c
+++ b/malloc/arena.c
@@ -899,6 +899,11 @@ arena_get2 (size_t size, mstate avoid_arena)
          enough address space to create that many arenas.  */
       if (__glibc_unlikely (n <= narenas_limit - 1))
         {
+          if (atomic_load_relaxed (&narenas) != n)
+           {
+              atomic_spin_nop ();
+              goto repeat;
+           }
           if (catomic_compare_and_exchange_bool_acq (&narenas, n + 1, n))
             goto repeat;
           a = _int_new_arena (size);
diff --git a/malloc/malloc.c b/malloc/malloc.c
index 095d97a3be..403ffb84ef 100644
--- a/malloc/malloc.c
+++ b/malloc/malloc.c
@@ -3717,6 +3717,11 @@ _int_malloc (mstate av, size_t bytes)
       pp = REVEAL_PTR (victim->fd);                                     \
       if (__glibc_unlikely (pp != NULL && misaligned_chunk (pp)))       \
 	malloc_printerr ("malloc(): unaligned fastbin chunk detected"); \
+      if (atomic_load_relaxed (fb) != victim)		\
+	{						\
+	  atomic_spin_nop ();				\
+	  continue;					\
+	}						\
     }							\
   while ((pp = catomic_compare_and_exchange_val_acq (fb, pp, victim)) \
 	 != victim);					\
@@ -4435,6 +4440,11 @@ _int_free (mstate av, mchunkptr p, int have_lock)
 	    malloc_printerr ("double free or corruption (fasttop)");
 	  old2 = old;
 	  p->fd = PROTECT_PTR (&p->fd, old);
+	  if (atomic_load_relaxed (fb) != old2)
+	    {
+	      atomic_spin_nop ();
+	      continue;
+	    }
 	}
       while ((old = catomic_compare_and_exchange_val_rel (fb, p, old2))
 	     != old2);
-- 
2.33.1


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v6 4/4] Avoid extra load with CAS in __pthread_mutex_clocklock_common [BZ #28537]
  2021-11-11 16:24 [PATCH v6 0/4] Optimize CAS [BZ #28537] H.J. Lu
                   ` (2 preceding siblings ...)
  2021-11-11 16:24 ` [PATCH v6 3/4] Reduce CAS in malloc spinlocks H.J. Lu
@ 2021-11-11 16:24 ` H.J. Lu
  2021-11-12 16:32   ` Szabolcs Nagy
  2021-11-12 18:51   ` Andreas Schwab
  3 siblings, 2 replies; 24+ messages in thread
From: H.J. Lu @ 2021-11-11 16:24 UTC (permalink / raw)
  To: libc-alpha
  Cc: Florian Weimer, Oleh Derevenko, Arjan van de Ven, Andreas Schwab,
	Paul A . Clarke, Noah Goldstein

Replace boolean CAS with value CAS to avoid the extra load.
---
 nptl/pthread_mutex_timedlock.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/nptl/pthread_mutex_timedlock.c b/nptl/pthread_mutex_timedlock.c
index 57f3f28869..f763cfc7fa 100644
--- a/nptl/pthread_mutex_timedlock.c
+++ b/nptl/pthread_mutex_timedlock.c
@@ -233,12 +233,12 @@ __pthread_mutex_clocklock_common (pthread_mutex_t *mutex,
 	     meantime.  */
 	  if ((oldval & FUTEX_WAITERS) == 0)
 	    {
-	      if (atomic_compare_and_exchange_bool_acq (&mutex->__data.__lock,
-							oldval | FUTEX_WAITERS,
-							oldval)
-		  != 0)
+	      int val;
+	      if ((val = atomic_compare_and_exchange_val_acq
+		   (&mutex->__data.__lock, oldval | FUTEX_WAITERS,
+		    oldval)) != oldval)
 		{
-		  oldval = mutex->__data.__lock;
+		  oldval = val;
 		  continue;
 		}
 	      oldval |= FUTEX_WAITERS;
-- 
2.33.1


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v6 2/4] Avoid extra load with CAS in __pthread_mutex_lock_full [BZ #28537]
  2021-11-11 16:24 ` [PATCH v6 2/4] Avoid extra load with CAS in __pthread_mutex_lock_full " H.J. Lu
@ 2021-11-12 16:31   ` Szabolcs Nagy
  2021-11-12 18:50   ` Andreas Schwab
  1 sibling, 0 replies; 24+ messages in thread
From: Szabolcs Nagy @ 2021-11-12 16:31 UTC (permalink / raw)
  To: H.J. Lu
  Cc: libc-alpha, Florian Weimer, Andreas Schwab, Paul A . Clarke,
	Arjan van de Ven

The 11/11/2021 08:24, H.J. Lu via Libc-alpha wrote:
> Replace boolean CAS with value CAS to avoid the extra load.

this looks good in general.

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>

> ---
>  nptl/pthread_mutex_lock.c | 10 +++++-----
>  1 file changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/nptl/pthread_mutex_lock.c b/nptl/pthread_mutex_lock.c
> index 72058c719c..762059b230 100644
> --- a/nptl/pthread_mutex_lock.c
> +++ b/nptl/pthread_mutex_lock.c
> @@ -304,12 +304,12 @@ __pthread_mutex_lock_full (pthread_mutex_t *mutex)
>  	     meantime.  */
>  	  if ((oldval & FUTEX_WAITERS) == 0)
>  	    {
> -	      if (atomic_compare_and_exchange_bool_acq (&mutex->__data.__lock,
> -							oldval | FUTEX_WAITERS,
> -							oldval)
> -		  != 0)
> +	      int val;
> +	      if ((val = atomic_compare_and_exchange_val_acq
> +		   (&mutex->__data.__lock, oldval | FUTEX_WAITERS,
> +		    oldval)) != oldval)
>  		{
> -		  oldval = mutex->__data.__lock;
> +		  oldval = val;
>  		  continue;
>  		}
>  	      oldval |= FUTEX_WAITERS;
> -- 
> 2.33.1
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v6 4/4] Avoid extra load with CAS in __pthread_mutex_clocklock_common [BZ #28537]
  2021-11-11 16:24 ` [PATCH v6 4/4] Avoid extra load with CAS in __pthread_mutex_clocklock_common [BZ #28537] H.J. Lu
@ 2021-11-12 16:32   ` Szabolcs Nagy
  2021-11-12 18:51   ` Andreas Schwab
  1 sibling, 0 replies; 24+ messages in thread
From: Szabolcs Nagy @ 2021-11-12 16:32 UTC (permalink / raw)
  To: H.J. Lu
  Cc: libc-alpha, Florian Weimer, Andreas Schwab, Paul A . Clarke,
	Arjan van de Ven

The 11/11/2021 08:24, H.J. Lu via Libc-alpha wrote:
> Replace boolean CAS with value CAS to avoid the extra load.

this looks good in general.

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>

> ---
>  nptl/pthread_mutex_timedlock.c | 10 +++++-----
>  1 file changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/nptl/pthread_mutex_timedlock.c b/nptl/pthread_mutex_timedlock.c
> index 57f3f28869..f763cfc7fa 100644
> --- a/nptl/pthread_mutex_timedlock.c
> +++ b/nptl/pthread_mutex_timedlock.c
> @@ -233,12 +233,12 @@ __pthread_mutex_clocklock_common (pthread_mutex_t *mutex,
>  	     meantime.  */
>  	  if ((oldval & FUTEX_WAITERS) == 0)
>  	    {
> -	      if (atomic_compare_and_exchange_bool_acq (&mutex->__data.__lock,
> -							oldval | FUTEX_WAITERS,
> -							oldval)
> -		  != 0)
> +	      int val;
> +	      if ((val = atomic_compare_and_exchange_val_acq
> +		   (&mutex->__data.__lock, oldval | FUTEX_WAITERS,
> +		    oldval)) != oldval)
>  		{
> -		  oldval = mutex->__data.__lock;
> +		  oldval = val;
>  		  continue;
>  		}
>  	      oldval |= FUTEX_WAITERS;
> -- 
> 2.33.1
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v6 1/4] Add LLL_MUTEX_READ_LOCK [BZ #28537]
  2021-11-11 16:24 ` [PATCH v6 1/4] Add LLL_MUTEX_READ_LOCK " H.J. Lu
@ 2021-11-12 17:23   ` Szabolcs Nagy
  2021-11-17  2:24   ` Noah Goldstein
  1 sibling, 0 replies; 24+ messages in thread
From: Szabolcs Nagy @ 2021-11-12 17:23 UTC (permalink / raw)
  To: H.J. Lu
  Cc: libc-alpha, Florian Weimer, Andreas Schwab, Paul A . Clarke,
	Arjan van de Ven

The 11/11/2021 08:24, H.J. Lu via Libc-alpha wrote:
> CAS instruction is expensive.  From the x86 CPU's point of view, getting
> a cache line for writing is more expensive than reading.  See Appendix
> A.2 Spinlock in:
> 
> https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-lock-scaling-analysis-paper.pdf
> 
> The full compare and swap will grab the cache line exclusive and cause
> excessive cache line bouncing.
> 
> Add LLL_MUTEX_READ_LOCK to do an atomic load and skip CAS in spinlock
> loop if compare may fail to reduce cache line bouncing on contended locks.

essentially the contended loop goes from

  if (cas(lock, 0, 1))
    break;

to

  if (read(lock) == 0)
    if (cas(lock, 0, 1))
      break;

i believe this helps on aarch64 too if it is only
done after contention is detected.

since the first cas access to lock is kept as is,
the common, uncontended case should not be affected.
this looks good to me.

> ---
>  nptl/pthread_mutex_lock.c | 7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/nptl/pthread_mutex_lock.c b/nptl/pthread_mutex_lock.c
> index 2bd41767e0..72058c719c 100644
> --- a/nptl/pthread_mutex_lock.c
> +++ b/nptl/pthread_mutex_lock.c
> @@ -64,6 +64,11 @@ lll_mutex_lock_optimized (pthread_mutex_t *mutex)
>  # define PTHREAD_MUTEX_VERSIONS 1
>  #endif
>  
> +#ifndef LLL_MUTEX_READ_LOCK
> +# define LLL_MUTEX_READ_LOCK(mutex) \
> +  atomic_load_relaxed (&(mutex)->__data.__lock)
> +#endif
> +
>  static int __pthread_mutex_lock_full (pthread_mutex_t *mutex)
>       __attribute_noinline__;
>  
> @@ -141,6 +146,8 @@ PTHREAD_MUTEX_LOCK (pthread_mutex_t *mutex)
>  		  break;
>  		}
>  	      atomic_spin_nop ();
> +	      if (LLL_MUTEX_READ_LOCK (mutex) != 0)
> +		continue;
>  	    }
>  	  while (LLL_MUTEX_TRYLOCK (mutex) != 0);
>  
> -- 
> 2.33.1
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v6 2/4] Avoid extra load with CAS in __pthread_mutex_lock_full [BZ #28537]
  2021-11-11 16:24 ` [PATCH v6 2/4] Avoid extra load with CAS in __pthread_mutex_lock_full " H.J. Lu
  2021-11-12 16:31   ` Szabolcs Nagy
@ 2021-11-12 18:50   ` Andreas Schwab
  2022-09-11 20:16     ` Sunil Pandey
  1 sibling, 1 reply; 24+ messages in thread
From: Andreas Schwab @ 2021-11-12 18:50 UTC (permalink / raw)
  To: H.J. Lu
  Cc: libc-alpha, Florian Weimer, Oleh Derevenko, Arjan van de Ven,
	Paul A . Clarke, Noah Goldstein

On Nov 11 2021, H.J. Lu wrote:

> diff --git a/nptl/pthread_mutex_lock.c b/nptl/pthread_mutex_lock.c
> index 72058c719c..762059b230 100644
> --- a/nptl/pthread_mutex_lock.c
> +++ b/nptl/pthread_mutex_lock.c
> @@ -304,12 +304,12 @@ __pthread_mutex_lock_full (pthread_mutex_t *mutex)
>  	     meantime.  */
>  	  if ((oldval & FUTEX_WAITERS) == 0)
>  	    {
> -	      if (atomic_compare_and_exchange_bool_acq (&mutex->__data.__lock,
> -							oldval | FUTEX_WAITERS,
> -							oldval)
> -		  != 0)
> +	      int val;
> +	      if ((val = atomic_compare_and_exchange_val_acq
> +		   (&mutex->__data.__lock, oldval | FUTEX_WAITERS,
> +		    oldval)) != oldval)

Please move the assignment out of the condition.

Andreas.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510  2552 DF73 E780 A9DA AEC1
"And now for something completely different."

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v6 4/4] Avoid extra load with CAS in __pthread_mutex_clocklock_common [BZ #28537]
  2021-11-11 16:24 ` [PATCH v6 4/4] Avoid extra load with CAS in __pthread_mutex_clocklock_common [BZ #28537] H.J. Lu
  2021-11-12 16:32   ` Szabolcs Nagy
@ 2021-11-12 18:51   ` Andreas Schwab
  2022-09-11 20:12     ` Sunil Pandey
  1 sibling, 1 reply; 24+ messages in thread
From: Andreas Schwab @ 2021-11-12 18:51 UTC (permalink / raw)
  To: H.J. Lu
  Cc: libc-alpha, Florian Weimer, Oleh Derevenko, Arjan van de Ven,
	Paul A . Clarke, Noah Goldstein

On Nov 11 2021, H.J. Lu wrote:

> diff --git a/nptl/pthread_mutex_timedlock.c b/nptl/pthread_mutex_timedlock.c
> index 57f3f28869..f763cfc7fa 100644
> --- a/nptl/pthread_mutex_timedlock.c
> +++ b/nptl/pthread_mutex_timedlock.c
> @@ -233,12 +233,12 @@ __pthread_mutex_clocklock_common (pthread_mutex_t *mutex,
>  	     meantime.  */
>  	  if ((oldval & FUTEX_WAITERS) == 0)
>  	    {
> -	      if (atomic_compare_and_exchange_bool_acq (&mutex->__data.__lock,
> -							oldval | FUTEX_WAITERS,
> -							oldval)
> -		  != 0)
> +	      int val;
> +	      if ((val = atomic_compare_and_exchange_val_acq
> +		   (&mutex->__data.__lock, oldval | FUTEX_WAITERS,
> +		    oldval)) != oldval)

Please move the assignment out of the condition.

Andreas.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510  2552 DF73 E780 A9DA AEC1
"And now for something completely different."

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v6 1/4] Add LLL_MUTEX_READ_LOCK [BZ #28537]
  2021-11-11 16:24 ` [PATCH v6 1/4] Add LLL_MUTEX_READ_LOCK " H.J. Lu
  2021-11-12 17:23   ` Szabolcs Nagy
@ 2021-11-17  2:24   ` Noah Goldstein
  2021-11-17 23:54     ` H.J. Lu
  1 sibling, 1 reply; 24+ messages in thread
From: Noah Goldstein @ 2021-11-17  2:24 UTC (permalink / raw)
  To: H.J. Lu
  Cc: GNU C Library, Florian Weimer, Oleh Derevenko, Arjan van de Ven,
	Andreas Schwab, Paul A . Clarke

On Thu, Nov 11, 2021 at 10:24 AM H.J. Lu <hjl.tools@gmail.com> wrote:
>
> CAS instruction is expensive.  From the x86 CPU's point of view, getting
> a cache line for writing is more expensive than reading.  See Appendix
> A.2 Spinlock in:
>
> https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-lock-scaling-analysis-paper.pdf
>
> The full compare and swap will grab the cache line exclusive and cause
> excessive cache line bouncing.
>
> Add LLL_MUTEX_READ_LOCK to do an atomic load and skip CAS in spinlock
> loop if compare may fail to reduce cache line bouncing on contended locks.
> ---
>  nptl/pthread_mutex_lock.c | 7 +++++++
>  1 file changed, 7 insertions(+)
>
> diff --git a/nptl/pthread_mutex_lock.c b/nptl/pthread_mutex_lock.c
> index 2bd41767e0..72058c719c 100644
> --- a/nptl/pthread_mutex_lock.c
> +++ b/nptl/pthread_mutex_lock.c
> @@ -64,6 +64,11 @@ lll_mutex_lock_optimized (pthread_mutex_t *mutex)
>  # define PTHREAD_MUTEX_VERSIONS 1
>  #endif
>
> +#ifndef LLL_MUTEX_READ_LOCK
> +# define LLL_MUTEX_READ_LOCK(mutex) \
> +  atomic_load_relaxed (&(mutex)->__data.__lock)
> +#endif
> +
>  static int __pthread_mutex_lock_full (pthread_mutex_t *mutex)
>       __attribute_noinline__;
>
> @@ -141,6 +146,8 @@ PTHREAD_MUTEX_LOCK (pthread_mutex_t *mutex)
>                   break;
>                 }
>               atomic_spin_nop ();
> +             if (LLL_MUTEX_READ_LOCK (mutex) != 0)
> +               continue;

Now that the lock spins on a simple read should `max_cnt` be adjusted?
https://sourceware.org/git/?p=glibc.git;a=blob;f=nptl/pthread_mutex_lock.c;h=762059b230ba97140d6ca16c7273b489592dd3bc;hb=d672a98a1af106bd68deb15576710cd61363f7a6#l143
>             }
>           while (LLL_MUTEX_TRYLOCK (mutex) != 0);
>
> --
> 2.33.1
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v6 1/4] Add LLL_MUTEX_READ_LOCK [BZ #28537]
  2021-11-17  2:24   ` Noah Goldstein
@ 2021-11-17 23:54     ` H.J. Lu
  2021-11-18  0:03       ` Noah Goldstein
  0 siblings, 1 reply; 24+ messages in thread
From: H.J. Lu @ 2021-11-17 23:54 UTC (permalink / raw)
  To: Noah Goldstein
  Cc: GNU C Library, Florian Weimer, Oleh Derevenko, Arjan van de Ven,
	Andreas Schwab, Paul A . Clarke

On Tue, Nov 16, 2021 at 6:24 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
> On Thu, Nov 11, 2021 at 10:24 AM H.J. Lu <hjl.tools@gmail.com> wrote:
> >
> > CAS instruction is expensive.  From the x86 CPU's point of view, getting
> > a cache line for writing is more expensive than reading.  See Appendix
> > A.2 Spinlock in:
> >
> > https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-lock-scaling-analysis-paper.pdf
> >
> > The full compare and swap will grab the cache line exclusive and cause
> > excessive cache line bouncing.
> >
> > Add LLL_MUTEX_READ_LOCK to do an atomic load and skip CAS in spinlock
> > loop if compare may fail to reduce cache line bouncing on contended locks.
> > ---
> >  nptl/pthread_mutex_lock.c | 7 +++++++
> >  1 file changed, 7 insertions(+)
> >
> > diff --git a/nptl/pthread_mutex_lock.c b/nptl/pthread_mutex_lock.c
> > index 2bd41767e0..72058c719c 100644
> > --- a/nptl/pthread_mutex_lock.c
> > +++ b/nptl/pthread_mutex_lock.c
> > @@ -64,6 +64,11 @@ lll_mutex_lock_optimized (pthread_mutex_t *mutex)
> >  # define PTHREAD_MUTEX_VERSIONS 1
> >  #endif
> >
> > +#ifndef LLL_MUTEX_READ_LOCK
> > +# define LLL_MUTEX_READ_LOCK(mutex) \
> > +  atomic_load_relaxed (&(mutex)->__data.__lock)
> > +#endif
> > +
> >  static int __pthread_mutex_lock_full (pthread_mutex_t *mutex)
> >       __attribute_noinline__;
> >
> > @@ -141,6 +146,8 @@ PTHREAD_MUTEX_LOCK (pthread_mutex_t *mutex)
> >                   break;
> >                 }
> >               atomic_spin_nop ();
> > +             if (LLL_MUTEX_READ_LOCK (mutex) != 0)
> > +               continue;
>
> Now that the lock spins on a simple read should `max_cnt` be adjusted?

Adding LLL_MUTEX_READ_LOCK just avoids the more expensive
LLL_MUTEX_TRYLOCK.  It doesn't change the flow.

> https://sourceware.org/git/?p=glibc.git;a=blob;f=nptl/pthread_mutex_lock.c;h=762059b230ba97140d6ca16c7273b489592dd3bc;hb=d672a98a1af106bd68deb15576710cd61363f7a6#l143
> >             }
> >           while (LLL_MUTEX_TRYLOCK (mutex) != 0);
> >
> > --
> > 2.33.1
> >



-- 
H.J.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v6 1/4] Add LLL_MUTEX_READ_LOCK [BZ #28537]
  2021-11-17 23:54     ` H.J. Lu
@ 2021-11-18  0:03       ` Noah Goldstein
  2021-11-18  0:31         ` H.J. Lu
  0 siblings, 1 reply; 24+ messages in thread
From: Noah Goldstein @ 2021-11-18  0:03 UTC (permalink / raw)
  To: H.J. Lu
  Cc: GNU C Library, Florian Weimer, Oleh Derevenko, Arjan van de Ven,
	Andreas Schwab, Paul A . Clarke

On Wed, Nov 17, 2021 at 5:55 PM H.J. Lu <hjl.tools@gmail.com> wrote:
>
> On Tue, Nov 16, 2021 at 6:24 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> >
> > On Thu, Nov 11, 2021 at 10:24 AM H.J. Lu <hjl.tools@gmail.com> wrote:
> > >
> > > CAS instruction is expensive.  From the x86 CPU's point of view, getting
> > > a cache line for writing is more expensive than reading.  See Appendix
> > > A.2 Spinlock in:
> > >
> > > https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-lock-scaling-analysis-paper.pdf
> > >
> > > The full compare and swap will grab the cache line exclusive and cause
> > > excessive cache line bouncing.
> > >
> > > Add LLL_MUTEX_READ_LOCK to do an atomic load and skip CAS in spinlock
> > > loop if compare may fail to reduce cache line bouncing on contended locks.
> > > ---
> > >  nptl/pthread_mutex_lock.c | 7 +++++++
> > >  1 file changed, 7 insertions(+)
> > >
> > > diff --git a/nptl/pthread_mutex_lock.c b/nptl/pthread_mutex_lock.c
> > > index 2bd41767e0..72058c719c 100644
> > > --- a/nptl/pthread_mutex_lock.c
> > > +++ b/nptl/pthread_mutex_lock.c
> > > @@ -64,6 +64,11 @@ lll_mutex_lock_optimized (pthread_mutex_t *mutex)
> > >  # define PTHREAD_MUTEX_VERSIONS 1
> > >  #endif
> > >
> > > +#ifndef LLL_MUTEX_READ_LOCK
> > > +# define LLL_MUTEX_READ_LOCK(mutex) \
> > > +  atomic_load_relaxed (&(mutex)->__data.__lock)
> > > +#endif
> > > +
> > >  static int __pthread_mutex_lock_full (pthread_mutex_t *mutex)
> > >       __attribute_noinline__;
> > >
> > > @@ -141,6 +146,8 @@ PTHREAD_MUTEX_LOCK (pthread_mutex_t *mutex)
> > >                   break;
> > >                 }
> > >               atomic_spin_nop ();
> > > +             if (LLL_MUTEX_READ_LOCK (mutex) != 0)
> > > +               continue;
> >
> > Now that the lock spins on a simple read should `max_cnt` be adjusted?
>
> Adding LLL_MUTEX_READ_LOCK just avoids the more expensive
> LLL_MUTEX_TRYLOCK.  It doesn't change the flow.

Yes, but the loop will be able to run `max_cnt` iterations much faster now.
Just wondering if the value needs to be re-tuned. Not that is necessarily needs
to be.
>
> > https://sourceware.org/git/?p=glibc.git;a=blob;f=nptl/pthread_mutex_lock.c;h=762059b230ba97140d6ca16c7273b489592dd3bc;hb=d672a98a1af106bd68deb15576710cd61363f7a6#l143
> > >             }
> > >           while (LLL_MUTEX_TRYLOCK (mutex) != 0);
> > >
> > > --
> > > 2.33.1
> > >
>
>
>
> --
> H.J.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v6 1/4] Add LLL_MUTEX_READ_LOCK [BZ #28537]
  2021-11-18  0:03       ` Noah Goldstein
@ 2021-11-18  0:31         ` H.J. Lu
  2021-11-18  1:16           ` Arjan van de Ven
  0 siblings, 1 reply; 24+ messages in thread
From: H.J. Lu @ 2021-11-18  0:31 UTC (permalink / raw)
  To: Noah Goldstein
  Cc: GNU C Library, Florian Weimer, Oleh Derevenko, Arjan van de Ven,
	Andreas Schwab, Paul A . Clarke

On Wed, Nov 17, 2021 at 4:03 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
> On Wed, Nov 17, 2021 at 5:55 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> >
> > On Tue, Nov 16, 2021 at 6:24 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> > >
> > > On Thu, Nov 11, 2021 at 10:24 AM H.J. Lu <hjl.tools@gmail.com> wrote:
> > > >
> > > > CAS instruction is expensive.  From the x86 CPU's point of view, getting
> > > > a cache line for writing is more expensive than reading.  See Appendix
> > > > A.2 Spinlock in:
> > > >
> > > > https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-lock-scaling-analysis-paper.pdf
> > > >
> > > > The full compare and swap will grab the cache line exclusive and cause
> > > > excessive cache line bouncing.
> > > >
> > > > Add LLL_MUTEX_READ_LOCK to do an atomic load and skip CAS in spinlock
> > > > loop if compare may fail to reduce cache line bouncing on contended locks.
> > > > ---
> > > >  nptl/pthread_mutex_lock.c | 7 +++++++
> > > >  1 file changed, 7 insertions(+)
> > > >
> > > > diff --git a/nptl/pthread_mutex_lock.c b/nptl/pthread_mutex_lock.c
> > > > index 2bd41767e0..72058c719c 100644
> > > > --- a/nptl/pthread_mutex_lock.c
> > > > +++ b/nptl/pthread_mutex_lock.c
> > > > @@ -64,6 +64,11 @@ lll_mutex_lock_optimized (pthread_mutex_t *mutex)
> > > >  # define PTHREAD_MUTEX_VERSIONS 1
> > > >  #endif
> > > >
> > > > +#ifndef LLL_MUTEX_READ_LOCK
> > > > +# define LLL_MUTEX_READ_LOCK(mutex) \
> > > > +  atomic_load_relaxed (&(mutex)->__data.__lock)
> > > > +#endif
> > > > +
> > > >  static int __pthread_mutex_lock_full (pthread_mutex_t *mutex)
> > > >       __attribute_noinline__;
> > > >
> > > > @@ -141,6 +146,8 @@ PTHREAD_MUTEX_LOCK (pthread_mutex_t *mutex)
> > > >                   break;
> > > >                 }
> > > >               atomic_spin_nop ();
> > > > +             if (LLL_MUTEX_READ_LOCK (mutex) != 0)
> > > > +               continue;
> > >
> > > Now that the lock spins on a simple read should `max_cnt` be adjusted?
> >
> > Adding LLL_MUTEX_READ_LOCK just avoids the more expensive
> > LLL_MUTEX_TRYLOCK.  It doesn't change the flow.
>
> Yes, but the loop will be able to run `max_cnt` iterations much faster now.
> Just wondering if the value needs to be re-tuned. Not that is necessarily needs
> to be.

Maybe if we can find some data to show for.

> >
> > > https://sourceware.org/git/?p=glibc.git;a=blob;f=nptl/pthread_mutex_lock.c;h=762059b230ba97140d6ca16c7273b489592dd3bc;hb=d672a98a1af106bd68deb15576710cd61363f7a6#l143
> > > >             }
> > > >           while (LLL_MUTEX_TRYLOCK (mutex) != 0);
> > > >
> > > > --
> > > > 2.33.1
> > > >
> >
> >
> >
> > --
> > H.J.



-- 
H.J.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v6 1/4] Add LLL_MUTEX_READ_LOCK [BZ #28537]
  2021-11-18  0:31         ` H.J. Lu
@ 2021-11-18  1:16           ` Arjan van de Ven
  2022-09-11 20:19             ` Sunil Pandey
  0 siblings, 1 reply; 24+ messages in thread
From: Arjan van de Ven @ 2021-11-18  1:16 UTC (permalink / raw)
  To: H.J. Lu, Noah Goldstein
  Cc: GNU C Library, Florian Weimer, Oleh Derevenko, Andreas Schwab,
	Paul A . Clarke

On 11/17/2021 4:31 PM, H.J. Lu wrote:
>> Yes, but the loop will be able to run `max_cnt` iterations much faster now.
>> Just wondering if the value needs to be re-tuned. Not that is necessarily needs
>> to be.
> Maybe if we can find some data to show for.
> 

wondering when this was last tuned.. I assume todays CPUs and CPUs from, say, 5 or 10 years ago
have an order of magnitude different performance in this regard....
if there wasn't a need to retune during that, maybe this value is so robust that it doesn't need
retuning.

or maybe it's time to retune this in general sometime soon after this patch goes in ;)

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v6 4/4] Avoid extra load with CAS in __pthread_mutex_clocklock_common [BZ #28537]
  2021-11-12 18:51   ` Andreas Schwab
@ 2022-09-11 20:12     ` Sunil Pandey
  2022-09-11 20:15       ` Arjan van de Ven
  2022-09-29  0:09       ` Noah Goldstein
  0 siblings, 2 replies; 24+ messages in thread
From: Sunil Pandey @ 2022-09-11 20:12 UTC (permalink / raw)
  To: Andreas Schwab, Libc-stable Mailing List
  Cc: H.J. Lu, Florian Weimer, GNU C Library, Arjan van de Ven,
	Paul A . Clarke

On Fri, Nov 12, 2021 at 10:53 AM Andreas Schwab <schwab@linux-m68k.org> wrote:
>
> On Nov 11 2021, H.J. Lu wrote:
>
> > diff --git a/nptl/pthread_mutex_timedlock.c b/nptl/pthread_mutex_timedlock.c
> > index 57f3f28869..f763cfc7fa 100644
> > --- a/nptl/pthread_mutex_timedlock.c
> > +++ b/nptl/pthread_mutex_timedlock.c
> > @@ -233,12 +233,12 @@ __pthread_mutex_clocklock_common (pthread_mutex_t *mutex,
> >            meantime.  */
> >         if ((oldval & FUTEX_WAITERS) == 0)
> >           {
> > -           if (atomic_compare_and_exchange_bool_acq (&mutex->__data.__lock,
> > -                                                     oldval | FUTEX_WAITERS,
> > -                                                     oldval)
> > -               != 0)
> > +           int val;
> > +           if ((val = atomic_compare_and_exchange_val_acq
> > +                (&mutex->__data.__lock, oldval | FUTEX_WAITERS,
> > +                 oldval)) != oldval)
>
> Please move the assignment out of the condition.
>
> Andreas.
>
> --
> Andreas Schwab, schwab@linux-m68k.org
> GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510  2552 DF73 E780 A9DA AEC1
> "And now for something completely different."

I would like to backport this patch to release branch 2.33 and 2.34

Any comments/suggestions or objections on this.

commit 0b82747dc48d5bf0871bdc6da8cb6eec1256355f
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Nov 11 06:31:51 2021 -0800

    Avoid extra load with CAS in __pthread_mutex_lock_full [BZ #28537]

    Replace boolean CAS with value CAS to avoid the extra load.

    Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v6 4/4] Avoid extra load with CAS in __pthread_mutex_clocklock_common [BZ #28537]
  2022-09-11 20:12     ` Sunil Pandey
@ 2022-09-11 20:15       ` Arjan van de Ven
  2022-09-11 21:26         ` Florian Weimer
  2022-09-29  0:09       ` Noah Goldstein
  1 sibling, 1 reply; 24+ messages in thread
From: Arjan van de Ven @ 2022-09-11 20:15 UTC (permalink / raw)
  To: Sunil Pandey, Andreas Schwab, Libc-stable Mailing List
  Cc: H.J. Lu, Florian Weimer, GNU C Library, Paul A . Clarke

On 9/11/2022 1:12 PM, Sunil Pandey wrote:
> On Fri, Nov 12, 2021 at 10:53 AM Andreas Schwab <schwab@linux-m68k.org> wrote:
>>
>> On Nov 11 2021, H.J. Lu wrote:
>>
>>> diff --git a/nptl/pthread_mutex_timedlock.c b/nptl/pthread_mutex_timedlock.c
>>> index 57f3f28869..f763cfc7fa 100644
>>> --- a/nptl/pthread_mutex_timedlock.c
>>> +++ b/nptl/pthread_mutex_timedlock.c
>>> @@ -233,12 +233,12 @@ __pthread_mutex_clocklock_common (pthread_mutex_t *mutex,
>>>             meantime.  */
>>>          if ((oldval & FUTEX_WAITERS) == 0)
>>>            {
>>> -           if (atomic_compare_and_exchange_bool_acq (&mutex->__data.__lock,
>>> -                                                     oldval | FUTEX_WAITERS,
>>> -                                                     oldval)
>>> -               != 0)
>>> +           int val;
>>> +           if ((val = atomic_compare_and_exchange_val_acq
>>> +                (&mutex->__data.__lock, oldval | FUTEX_WAITERS,
>>> +                 oldval)) != oldval)
>>
>> Please move the assignment out of the condition.
>>
>> Andreas.
>>
>> --
>> Andreas Schwab, schwab@linux-m68k.org
>> GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510  2552 DF73 E780 A9DA AEC1
>> "And now for something completely different."
> 
> I would like to backport this patch to release branch 2.33 and 2.34
> 

what exactly is the stable branch policy that would suggest to backport performance improvements like this ?
(most projects are sticking to "strict bugfixes and other gross oversights" as much as possible)


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v6 2/4] Avoid extra load with CAS in __pthread_mutex_lock_full [BZ #28537]
  2021-11-12 18:50   ` Andreas Schwab
@ 2022-09-11 20:16     ` Sunil Pandey
  2022-09-29  0:10       ` Noah Goldstein
  0 siblings, 1 reply; 24+ messages in thread
From: Sunil Pandey @ 2022-09-11 20:16 UTC (permalink / raw)
  To: Andreas Schwab, Libc-stable Mailing List
  Cc: H.J. Lu, Florian Weimer, GNU C Library, Arjan van de Ven,
	Paul A . Clarke

On Fri, Nov 12, 2021 at 10:52 AM Andreas Schwab <schwab@linux-m68k.org> wrote:
>
> On Nov 11 2021, H.J. Lu wrote:
>
> > diff --git a/nptl/pthread_mutex_lock.c b/nptl/pthread_mutex_lock.c
> > index 72058c719c..762059b230 100644
> > --- a/nptl/pthread_mutex_lock.c
> > +++ b/nptl/pthread_mutex_lock.c
> > @@ -304,12 +304,12 @@ __pthread_mutex_lock_full (pthread_mutex_t *mutex)
> >            meantime.  */
> >         if ((oldval & FUTEX_WAITERS) == 0)
> >           {
> > -           if (atomic_compare_and_exchange_bool_acq (&mutex->__data.__lock,
> > -                                                     oldval | FUTEX_WAITERS,
> > -                                                     oldval)
> > -               != 0)
> > +           int val;
> > +           if ((val = atomic_compare_and_exchange_val_acq
> > +                (&mutex->__data.__lock, oldval | FUTEX_WAITERS,
> > +                 oldval)) != oldval)
>
> Please move the assignment out of the condition.
>
> Andreas.
>
> --
> Andreas Schwab, schwab@linux-m68k.org
> GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510  2552 DF73 E780 A9DA AEC1
> "And now for something completely different."

I would like to backport this patch to release branch 2.33 and 2.34

Any comments/suggestions or objections on this.

commit 49302b8fdf9103b6fc0a398678668a22fa19574c
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Nov 11 06:54:01 2021 -0800

    Avoid extra load with CAS in __pthread_mutex_clocklock_common [BZ #28537]

    Replace boolean CAS with value CAS to avoid the extra load.

    Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v6 1/4] Add LLL_MUTEX_READ_LOCK [BZ #28537]
  2021-11-18  1:16           ` Arjan van de Ven
@ 2022-09-11 20:19             ` Sunil Pandey
  2022-09-29  0:10               ` Noah Goldstein
  0 siblings, 1 reply; 24+ messages in thread
From: Sunil Pandey @ 2022-09-11 20:19 UTC (permalink / raw)
  To: Arjan van de Ven, Libc-stable Mailing List
  Cc: H.J. Lu, Noah Goldstein, Florian Weimer, Andreas Schwab,
	GNU C Library, Paul A . Clarke

On Wed, Nov 17, 2021 at 5:17 PM Arjan van de Ven via Libc-alpha
<libc-alpha@sourceware.org> wrote:
>
> On 11/17/2021 4:31 PM, H.J. Lu wrote:
> >> Yes, but the loop will be able to run `max_cnt` iterations much faster now.
> >> Just wondering if the value needs to be re-tuned. Not that is necessarily needs
> >> to be.
> > Maybe if we can find some data to show for.
> >
>
> wondering when this was last tuned.. I assume todays CPUs and CPUs from, say, 5 or 10 years ago
> have an order of magnitude different performance in this regard....
> if there wasn't a need to retune during that, maybe this value is so robust that it doesn't need
> retuning.
>
> or maybe it's time to retune this in general sometime soon after this patch goes in ;)

I would like to backport this patch to release branch 2.33 and 2.34

Any comments/suggestions or objections on this.

commit d672a98a1af106bd68deb15576710cd61363f7a6
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Tue Nov 2 18:33:07 2021 -0700

    Add LLL_MUTEX_READ_LOCK [BZ #28537]

    CAS instruction is expensive.  From the x86 CPU's point of view, getting
    a cache line for writing is more expensive than reading.  See Appendix
    A.2 Spinlock in:

    https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-lock-scaling-analysis-paper.pdf

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v6 4/4] Avoid extra load with CAS in __pthread_mutex_clocklock_common [BZ #28537]
  2022-09-11 20:15       ` Arjan van de Ven
@ 2022-09-11 21:26         ` Florian Weimer
  0 siblings, 0 replies; 24+ messages in thread
From: Florian Weimer @ 2022-09-11 21:26 UTC (permalink / raw)
  To: Arjan van de Ven via Libc-alpha
  Cc: Sunil Pandey, Andreas Schwab, Libc-stable Mailing List,
	Arjan van de Ven, Paul A . Clarke

* Arjan van de Ven via Libc-alpha:

> what exactly is the stable branch policy that would suggest to
> backport performance improvements like this ?  (most projects are
> sticking to "strict bugfixes and other gross oversights" as much as
> possible)

We occasionally backport safe (and not-so-safe) performance
optimizations.  We draw the line at ABI changes, those aren't possible.
It's also very desirable that all future releases have a superset of the
backports.

But anything else depends on who is willing to do the work.

Thanks,
Florian


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v6 4/4] Avoid extra load with CAS in __pthread_mutex_clocklock_common [BZ #28537]
  2022-09-11 20:12     ` Sunil Pandey
  2022-09-11 20:15       ` Arjan van de Ven
@ 2022-09-29  0:09       ` Noah Goldstein
  1 sibling, 0 replies; 24+ messages in thread
From: Noah Goldstein @ 2022-09-29  0:09 UTC (permalink / raw)
  To: Sunil Pandey
  Cc: Andreas Schwab, Libc-stable Mailing List, Florian Weimer,
	Arjan van de Ven, GNU C Library, Paul A . Clarke

On Sun, Sep 11, 2022 at 4:13 PM Sunil Pandey via Libc-stable
<libc-stable@sourceware.org> wrote:
>
> On Fri, Nov 12, 2021 at 10:53 AM Andreas Schwab <schwab@linux-m68k.org> wrote:
> >
> > On Nov 11 2021, H.J. Lu wrote:
> >
> > > diff --git a/nptl/pthread_mutex_timedlock.c b/nptl/pthread_mutex_timedlock.c
> > > index 57f3f28869..f763cfc7fa 100644
> > > --- a/nptl/pthread_mutex_timedlock.c
> > > +++ b/nptl/pthread_mutex_timedlock.c
> > > @@ -233,12 +233,12 @@ __pthread_mutex_clocklock_common (pthread_mutex_t *mutex,
> > >            meantime.  */
> > >         if ((oldval & FUTEX_WAITERS) == 0)
> > >           {
> > > -           if (atomic_compare_and_exchange_bool_acq (&mutex->__data.__lock,
> > > -                                                     oldval | FUTEX_WAITERS,
> > > -                                                     oldval)
> > > -               != 0)
> > > +           int val;
> > > +           if ((val = atomic_compare_and_exchange_val_acq
> > > +                (&mutex->__data.__lock, oldval | FUTEX_WAITERS,
> > > +                 oldval)) != oldval)
> >
> > Please move the assignment out of the condition.
> >
> > Andreas.
> >
> > --
> > Andreas Schwab, schwab@linux-m68k.org
> > GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510  2552 DF73 E780 A9DA AEC1
> > "And now for something completely different."
>
> I would like to backport this patch to release branch 2.33 and 2.34
>
> Any comments/suggestions or objections on this.
>
> commit 0b82747dc48d5bf0871bdc6da8cb6eec1256355f
> Author: H.J. Lu <hjl.tools@gmail.com>
> Date:   Thu Nov 11 06:31:51 2021 -0800
>
>     Avoid extra load with CAS in __pthread_mutex_lock_full [BZ #28537]
>
>     Replace boolean CAS with value CAS to avoid the extra load.
>
>     Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>

Fine by me.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v6 1/4] Add LLL_MUTEX_READ_LOCK [BZ #28537]
  2022-09-11 20:19             ` Sunil Pandey
@ 2022-09-29  0:10               ` Noah Goldstein
  0 siblings, 0 replies; 24+ messages in thread
From: Noah Goldstein @ 2022-09-29  0:10 UTC (permalink / raw)
  To: Sunil Pandey
  Cc: Arjan van de Ven, Libc-stable Mailing List, H.J. Lu,
	Florian Weimer, Andreas Schwab, GNU C Library, Paul A . Clarke

On Sun, Sep 11, 2022 at 4:19 PM Sunil Pandey <skpgkp2@gmail.com> wrote:
>
> On Wed, Nov 17, 2021 at 5:17 PM Arjan van de Ven via Libc-alpha
> <libc-alpha@sourceware.org> wrote:
> >
> > On 11/17/2021 4:31 PM, H.J. Lu wrote:
> > >> Yes, but the loop will be able to run `max_cnt` iterations much faster now.
> > >> Just wondering if the value needs to be re-tuned. Not that is necessarily needs
> > >> to be.
> > > Maybe if we can find some data to show for.
> > >
> >
> > wondering when this was last tuned.. I assume todays CPUs and CPUs from, say, 5 or 10 years ago
> > have an order of magnitude different performance in this regard....
> > if there wasn't a need to retune during that, maybe this value is so robust that it doesn't need
> > retuning.
> >
> > or maybe it's time to retune this in general sometime soon after this patch goes in ;)
>
> I would like to backport this patch to release branch 2.33 and 2.34

Fine by  me.
>
> Any comments/suggestions or objections on this.
>
> commit d672a98a1af106bd68deb15576710cd61363f7a6
> Author: H.J. Lu <hjl.tools@gmail.com>
> Date:   Tue Nov 2 18:33:07 2021 -0700
>
>     Add LLL_MUTEX_READ_LOCK [BZ #28537]
>
>     CAS instruction is expensive.  From the x86 CPU's point of view, getting
>     a cache line for writing is more expensive than reading.  See Appendix
>     A.2 Spinlock in:
>
>     https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-lock-scaling-analysis-paper.pdf

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v6 2/4] Avoid extra load with CAS in __pthread_mutex_lock_full [BZ #28537]
  2022-09-11 20:16     ` Sunil Pandey
@ 2022-09-29  0:10       ` Noah Goldstein
  0 siblings, 0 replies; 24+ messages in thread
From: Noah Goldstein @ 2022-09-29  0:10 UTC (permalink / raw)
  To: Sunil Pandey
  Cc: Andreas Schwab, Libc-stable Mailing List, Florian Weimer,
	Arjan van de Ven, GNU C Library, Paul A . Clarke

On Sun, Sep 11, 2022 at 4:17 PM Sunil Pandey via Libc-stable
<libc-stable@sourceware.org> wrote:
>
> On Fri, Nov 12, 2021 at 10:52 AM Andreas Schwab <schwab@linux-m68k.org> wrote:
> >
> > On Nov 11 2021, H.J. Lu wrote:
> >
> > > diff --git a/nptl/pthread_mutex_lock.c b/nptl/pthread_mutex_lock.c
> > > index 72058c719c..762059b230 100644
> > > --- a/nptl/pthread_mutex_lock.c
> > > +++ b/nptl/pthread_mutex_lock.c
> > > @@ -304,12 +304,12 @@ __pthread_mutex_lock_full (pthread_mutex_t *mutex)
> > >            meantime.  */
> > >         if ((oldval & FUTEX_WAITERS) == 0)
> > >           {
> > > -           if (atomic_compare_and_exchange_bool_acq (&mutex->__data.__lock,
> > > -                                                     oldval | FUTEX_WAITERS,
> > > -                                                     oldval)
> > > -               != 0)
> > > +           int val;
> > > +           if ((val = atomic_compare_and_exchange_val_acq
> > > +                (&mutex->__data.__lock, oldval | FUTEX_WAITERS,
> > > +                 oldval)) != oldval)
> >
> > Please move the assignment out of the condition.
> >
> > Andreas.
> >
> > --
> > Andreas Schwab, schwab@linux-m68k.org
> > GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510  2552 DF73 E780 A9DA AEC1
> > "And now for something completely different."
>
> I would like to backport this patch to release branch 2.33 and 2.34
>
> Any comments/suggestions or objections on this.
>
> commit 49302b8fdf9103b6fc0a398678668a22fa19574c
> Author: H.J. Lu <hjl.tools@gmail.com>
> Date:   Thu Nov 11 06:54:01 2021 -0800
>
>     Avoid extra load with CAS in __pthread_mutex_clocklock_common [BZ #28537]
>
>     Replace boolean CAS with value CAS to avoid the extra load.
>
>     Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>

Fine by me.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v6 3/4] Reduce CAS in malloc spinlocks
  2021-11-11 16:24 ` [PATCH v6 3/4] Reduce CAS in malloc spinlocks H.J. Lu
@ 2023-02-23  5:48   ` DJ Delorie
  0 siblings, 0 replies; 24+ messages in thread
From: DJ Delorie @ 2023-02-23  5:48 UTC (permalink / raw)
  To: H.J. Lu; +Cc: libc-alpha


Sorry for letting this one slip...

"H.J. Lu via Libc-alpha" <libc-alpha@sourceware.org> writes:
>        size_t n = narenas;
>        if (__glibc_unlikely (n <= narenas_limit - 1))
>          {
> +          if (atomic_load_relaxed (&narenas) != n)
> +           {
> +              atomic_spin_nop ();
> +              goto repeat;
> +           }
>            if (catomic_compare_and_exchange_bool_acq (&narenas, n + 1, n))
>              goto repeat;

I understand that a congested spinloop will benefit from this kind of
change, but... we JUST loaded narenas into n, and adding arenas is rare.
We probably should have loaded it atomically, but still, we just loaded
it.  The odds of malloc being so congested that we miss the CAS is
essentially (but of course not exactly) zero.  Are we just adding an
uneeded atomic read here?  Do any benchmarks say this would be
beneficial?

Also, the malloc code is already complicated enough.  Are the extra
lines of code and slight reduction in readability justified?

Also, we've been migrating to C11-like atomics; would this patch need
changing for that?

Should target-specific atomics optimizations be "hidden" somewhere in
the atomics implementation?  Just because x86 may benefit from a
pre-read doesn't mean that all targets will, and if x86 generally
benefits, it should update its implementation of the atomics to do that
at a lower level.

>            a = _int_new_arena (size);
> diff --git a/malloc/malloc.c b/malloc/malloc.c
> index 095d97a3be..403ffb84ef 100644
> --- a/malloc/malloc.c
> +++ b/malloc/malloc.c
> @@ -3717,6 +3717,11 @@ _int_malloc (mstate av, size_t bytes)
>        pp = REVEAL_PTR (victim->fd);                                     \
>        if (__glibc_unlikely (pp != NULL && misaligned_chunk (pp)))       \
>  	malloc_printerr ("malloc(): unaligned fastbin chunk detected"); \
> +      if (atomic_load_relaxed (fb) != victim)		\
> +	{						\
> +	  atomic_spin_nop ();				\
> +	  continue;					\
> +	}						\
>      }							\
>    while ((pp = catomic_compare_and_exchange_val_acq (fb, pp, victim)) \
>  	 != victim);					\
> @@ -4435,6 +4440,11 @@ _int_free (mstate av, mchunkptr p, int have_lock)
>  	    malloc_printerr ("double free or corruption (fasttop)");
>  	  old2 = old;
>  	  p->fd = PROTECT_PTR (&p->fd, old);
> +	  if (atomic_load_relaxed (fb) != old2)
> +	    {
> +	      atomic_spin_nop ();
> +	      continue;
> +	    }
>  	}
>        while ((old = catomic_compare_and_exchange_val_rel (fb, p, old2))
>  	     != old2);

Likewise, although these are less rare, but not so common as I'd expect
a benefit from the extra code.


^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2023-02-23  5:48 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-11 16:24 [PATCH v6 0/4] Optimize CAS [BZ #28537] H.J. Lu
2021-11-11 16:24 ` [PATCH v6 1/4] Add LLL_MUTEX_READ_LOCK " H.J. Lu
2021-11-12 17:23   ` Szabolcs Nagy
2021-11-17  2:24   ` Noah Goldstein
2021-11-17 23:54     ` H.J. Lu
2021-11-18  0:03       ` Noah Goldstein
2021-11-18  0:31         ` H.J. Lu
2021-11-18  1:16           ` Arjan van de Ven
2022-09-11 20:19             ` Sunil Pandey
2022-09-29  0:10               ` Noah Goldstein
2021-11-11 16:24 ` [PATCH v6 2/4] Avoid extra load with CAS in __pthread_mutex_lock_full " H.J. Lu
2021-11-12 16:31   ` Szabolcs Nagy
2021-11-12 18:50   ` Andreas Schwab
2022-09-11 20:16     ` Sunil Pandey
2022-09-29  0:10       ` Noah Goldstein
2021-11-11 16:24 ` [PATCH v6 3/4] Reduce CAS in malloc spinlocks H.J. Lu
2023-02-23  5:48   ` DJ Delorie
2021-11-11 16:24 ` [PATCH v6 4/4] Avoid extra load with CAS in __pthread_mutex_clocklock_common [BZ #28537] H.J. Lu
2021-11-12 16:32   ` Szabolcs Nagy
2021-11-12 18:51   ` Andreas Schwab
2022-09-11 20:12     ` Sunil Pandey
2022-09-11 20:15       ` Arjan van de Ven
2022-09-11 21:26         ` Florian Weimer
2022-09-29  0:09       ` Noah Goldstein

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).