[PATCH v5 0/3] Optimize CAS [BZ #28537]

public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed

* [PATCH v5 0/3] Optimize CAS [BZ #28537]
@ 2021-11-10 18:41 H.J. Lu
  2021-11-10 18:41 ` [PATCH v5 1/3] Reduce CAS in low level locks " H.J. Lu
                   ` (3 more replies)
  0 siblings, 4 replies; 6+ messages in thread
From: H.J. Lu @ 2021-11-10 18:41 UTC (permalink / raw)
  To: libc-alpha
  Cc: Florian Weimer, Oleh Derevenko, Arjan van de Ven, Andreas Schwab,
	Paul A . Clarke, Noah Goldstein

Changes in v5:

1. Put back __glibc_unlikely in  __lll_trylock and lll_cond_trylock.
2. Remove an atomic load in a CAS usage which has been already optimized.
3. Add an empty statement with a semicolon to a goto label for older
compiler versions.
4. Simplify CAS optimization.

CAS instruction is expensive.  From the x86 CPU's point of view, getting
a cache line for writing is more expensive than reading.  See Appendix
A.2 Spinlock in:

https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-lock-scaling-analysis-paper.pdf

The full compare and swap will grab the cache line exclusive and cause
excessive cache line bouncing.

Optimize CAS in low level locks and pthread_mutex_lock.c:

1. Do an atomic load and skip CAS if compare may fail to reduce cache
line bouncing on contended locks.
2. Replace atomic_compare_and_exchange_bool_acq with
atomic_compare_and_exchange_val_acq to avoid the extra load.

This is the first patch set to optimize CAS.  I will submit the rest
CAS optimizations in glibc after this patch set has been accepted.

With all CAS optimizations applied, on a machine with 112 cores,
"make check -j28" under heavy load took

3093.18user 1644.12system 22:26.05elapsed 351%CPU

vs without CAS optimizations

3746.07user 1614.93system 22:02.91elapsed 405%CPU

H.J. Lu (3):
  Reduce CAS in low level locks [BZ #28537]
  Reduce CAS in __pthread_mutex_lock_full [BZ #28537]
  Optimize CAS in __pthread_mutex_lock_full [BZ #28537]

 nptl/lowlevellock.c         | 12 ++++-----
 nptl/pthread_mutex_lock.c   | 49 +++++++++++++++++++++++--------------
 sysdeps/nptl/lowlevellock.h | 33 +++++++++++++++++--------
 3 files changed, 60 insertions(+), 34 deletions(-)

-- 
2.33.1

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH v5 1/3] Reduce CAS in low level locks [BZ #28537]
  2021-11-10 18:41 [PATCH v5 0/3] Optimize CAS [BZ #28537] H.J. Lu
@ 2021-11-10 18:41 ` H.J. Lu
  2021-11-10 18:41 ` [PATCH v5 2/3] Reduce CAS in __pthread_mutex_lock_full " H.J. Lu
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 6+ messages in thread
From: H.J. Lu @ 2021-11-10 18:41 UTC (permalink / raw)
  To: libc-alpha
  Cc: Florian Weimer, Oleh Derevenko, Arjan van de Ven, Andreas Schwab,
	Paul A . Clarke, Noah Goldstein

CAS instruction is expensive.  From the x86 CPU's point of view, getting
a cache line for writing is more expensive than reading.  See Appendix
A.2 Spinlock in:

https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-lock-scaling-analysis-paper.pdf

The full compare and swap will grab the cache line exclusive and cause
excessive cache line bouncing.

1. Change low level locks to do an atomic load and skip CAS if compare
may fail to reduce cache line bouncing on contended locks.
2. In __lll_lock, replace atomic_compare_and_exchange_bool_acq with
atomic_compare_and_exchange_val_acq and pass down the result to
__lll_lock_wait and __lll_lock_wait_private to avoid the redundant load
there.
---
 nptl/lowlevellock.c         | 12 ++++++------
 sysdeps/nptl/lowlevellock.h | 33 +++++++++++++++++++++++----------
 2 files changed, 29 insertions(+), 16 deletions(-)

diff --git a/nptl/lowlevellock.c b/nptl/lowlevellock.c
index 8c740529c4..d1965c01ca 100644
--- a/nptl/lowlevellock.c
+++ b/nptl/lowlevellock.c
@@ -22,30 +22,30 @@
 #include <stap-probe.h>
 
 void
-__lll_lock_wait_private (int *futex)
+__lll_lock_wait_private (int *futex, int futex_value)
 {
-  if (atomic_load_relaxed (futex) == 2)
+  if (futex_value == 2)
     goto futex;
 
   while (atomic_exchange_acquire (futex, 2) != 0)
     {
     futex:
-      LIBC_PROBE (lll_lock_wait_private, 1, futex);
+      LIBC_PROBE (lll_lock_wait_private, 2, futex, futex_value);
       futex_wait ((unsigned int *) futex, 2, LLL_PRIVATE); /* Wait if *futex == 2.  */
     }
 }
 libc_hidden_def (__lll_lock_wait_private)
 
 void
-__lll_lock_wait (int *futex, int private)
+__lll_lock_wait (int *futex, int futex_value, int private)
 {
-  if (atomic_load_relaxed (futex) == 2)
+  if (futex_value == 2)
     goto futex;
 
   while (atomic_exchange_acquire (futex, 2) != 0)
     {
     futex:
-      LIBC_PROBE (lll_lock_wait, 1, futex);
+      LIBC_PROBE (lll_lock_wait, 2, futex, futex_value);
       futex_wait ((unsigned int *) futex, 2, private); /* Wait if *futex == 2.  */
     }
 }
diff --git a/sysdeps/nptl/lowlevellock.h b/sysdeps/nptl/lowlevellock.h
index 4d95114ed3..05260eb706 100644
--- a/sysdeps/nptl/lowlevellock.h
+++ b/sysdeps/nptl/lowlevellock.h
@@ -66,7 +66,12 @@
    0.  Otherwise leave lock unchanged and return non-zero to indicate that the
    lock was not acquired.  */
 #define __lll_trylock(lock)	\
-  __glibc_unlikely (atomic_compare_and_exchange_bool_acq ((lock), 1, 0))
+  (__extension__ ({							\
+    __typeof (*(lock)) __lock_value = atomic_load_relaxed (lock);	\
+    (__lock_value != 0							\
+     || __glibc_unlikely (atomic_compare_and_exchange_bool_acq ((lock),	\
+								1, 0)));\
+  }))
 #define lll_trylock(lock)	\
    __lll_trylock (&(lock))
 
@@ -74,11 +79,16 @@
    return 0.  Otherwise leave lock unchanged and return non-zero to indicate
    that the lock was not acquired.  */
 #define lll_cond_trylock(lock)	\
-  __glibc_unlikely (atomic_compare_and_exchange_bool_acq (&(lock), 2, 0))
-
-extern void __lll_lock_wait_private (int *futex);
+  (__extension__ ({							\
+    __typeof (lock) __lock_value = atomic_load_relaxed (&(lock));	\
+    (__lock_value != 0							\
+     || __glibc_unlikely (atomic_compare_and_exchange_bool_acq (&(lock),\
+								2, 0)));\
+  }))
+
+extern void __lll_lock_wait_private (int *futex, int futex_value);
 libc_hidden_proto (__lll_lock_wait_private)
-extern void __lll_lock_wait (int *futex, int private);
+extern void __lll_lock_wait (int *futex, int futex_value, int private);
 libc_hidden_proto (__lll_lock_wait)
 
 /* This is an expression rather than a statement even though its value is
@@ -95,13 +105,15 @@ libc_hidden_proto (__lll_lock_wait)
   ((void)                                                               \
    ({                                                                   \
      int *__futex = (futex);                                            \
-     if (__glibc_unlikely                                               \
-         (atomic_compare_and_exchange_bool_acq (__futex, 1, 0)))        \
+     int __futex_value = atomic_load_relaxed (futex);                   \
+     if (__futex_value != 0                                             \
+         || ((__futex_value = atomic_compare_and_exchange_val_acq       \
+              (__futex, 1, 0) != 0)))                                   \
        {                                                                \
          if (__builtin_constant_p (private) && (private) == LLL_PRIVATE) \
-           __lll_lock_wait_private (__futex);                           \
+           __lll_lock_wait_private (futex, __futex_value);              \
          else                                                           \
-           __lll_lock_wait (__futex, private);                          \
+           __lll_lock_wait (futex, __futex_value, private);             \
        }                                                                \
    }))
 #define lll_lock(futex, private)	\
@@ -120,7 +132,8 @@ libc_hidden_proto (__lll_lock_wait)
    ({                                                                   \
      int *__futex = (futex);                                            \
      if (__glibc_unlikely (atomic_exchange_acq (__futex, 2) != 0))      \
-       __lll_lock_wait (__futex, private);                              \
+       __lll_lock_wait (__futex, atomic_load_relaxed (__futex),         \
+			private); \
    }))
 #define lll_cond_lock(futex, private) __lll_cond_lock (&(futex), private)
 
-- 
2.33.1


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH v5 2/3] Reduce CAS in __pthread_mutex_lock_full [BZ #28537]
  2021-11-10 18:41 [PATCH v5 0/3] Optimize CAS [BZ #28537] H.J. Lu
  2021-11-10 18:41 ` [PATCH v5 1/3] Reduce CAS in low level locks " H.J. Lu
@ 2021-11-10 18:41 ` H.J. Lu
  2021-11-10 18:41 ` [PATCH v5 3/3] Optimize " H.J. Lu
  2021-11-10 20:23 ` [PATCH v5 0/3] Optimize CAS " Paul A. Clarke
  3 siblings, 0 replies; 6+ messages in thread
From: H.J. Lu @ 2021-11-10 18:41 UTC (permalink / raw)
  To: libc-alpha
  Cc: Florian Weimer, Oleh Derevenko, Arjan van de Ven, Andreas Schwab,
	Paul A . Clarke, Noah Goldstein

Change __pthread_mutex_lock_full to do an atomic load and skip CAS if
compare may fail to reduce cache line bouncing on contended locks.
---
 nptl/pthread_mutex_lock.c | 38 +++++++++++++++++++++++++-------------
 1 file changed, 25 insertions(+), 13 deletions(-)

diff --git a/nptl/pthread_mutex_lock.c b/nptl/pthread_mutex_lock.c
index 2bd41767e0..d7e8efedd2 100644
--- a/nptl/pthread_mutex_lock.c
+++ b/nptl/pthread_mutex_lock.c
@@ -223,13 +223,13 @@ __pthread_mutex_lock_full (pthread_mutex_t *mutex)
 	      newval |= (oldval & FUTEX_WAITERS) | assume_other_futex_waiters;
 #endif
 
-	      newval
-		= atomic_compare_and_exchange_val_acq (&mutex->__data.__lock,
-						       newval, oldval);
-
-	      if (newval != oldval)
+	      int val = atomic_load_relaxed (&mutex->__data.__lock);
+	      if (val != oldval
+		  || ((val = atomic_compare_and_exchange_val_acq
+			 (&mutex->__data.__lock, newval, oldval))
+		      != oldval))
 		{
-		  oldval = newval;
+		  oldval = val;
 		  continue;
 		}
 
@@ -411,11 +411,15 @@ __pthread_mutex_lock_full (pthread_mutex_t *mutex)
 # ifdef NO_INCR
 	newval |= FUTEX_WAITERS;
 # endif
+	oldval = atomic_load_relaxed (&mutex->__data.__lock);
+	if (oldval != 0)
+	  goto locked_mutex;
 	oldval = atomic_compare_and_exchange_val_acq (&mutex->__data.__lock,
 						      newval, 0);
 
 	if (oldval != 0)
 	  {
+ locked_mutex:;
 	    /* The mutex is locked.  The kernel will now take care of
 	       everything.  */
 	    int private = (robust
@@ -554,6 +558,10 @@ __pthread_mutex_lock_full (pthread_mutex_t *mutex)
 	    ceilval = ceiling << PTHREAD_MUTEX_PRIO_CEILING_SHIFT;
 	    oldprio = ceiling;
 
+	    oldval = atomic_load_relaxed (&mutex->__data.__lock);
+	    if (oldval != ceilval)
+	      goto ceilval_failed;
+
 	    oldval
 	      = atomic_compare_and_exchange_val_acq (&mutex->__data.__lock,
 #ifdef NO_INCR
@@ -568,10 +576,13 @@ __pthread_mutex_lock_full (pthread_mutex_t *mutex)
 
 	    do
 	      {
-		oldval
-		  = atomic_compare_and_exchange_val_acq (&mutex->__data.__lock,
-							 ceilval | 2,
-							 ceilval | 1);
+	        oldval = atomic_load_relaxed (&mutex->__data.__lock);
+ ceilval_failed:
+		if (oldval == (ceilval | 1))
+		  oldval
+		    = atomic_compare_and_exchange_val_acq (&mutex->__data.__lock,
+							   ceilval | 2,
+							   ceilval | 1);
 
 		if ((oldval & PTHREAD_MUTEX_PRIO_CEILING_MASK) != ceilval)
 		  break;
@@ -581,9 +592,10 @@ __pthread_mutex_lock_full (pthread_mutex_t *mutex)
 			      ceilval | 2,
 			      PTHREAD_MUTEX_PSHARED (mutex));
 	      }
-	    while (atomic_compare_and_exchange_val_acq (&mutex->__data.__lock,
-							ceilval | 2, ceilval)
-		   != ceilval);
+	    while (atomic_load_relaxed (&mutex->__data.__lock) != ceilval
+		   || (atomic_compare_and_exchange_val_acq (&mutex->__data.__lock,
+							    ceilval | 2, ceilval)
+		       != ceilval));
 	  }
 	while ((oldval & PTHREAD_MUTEX_PRIO_CEILING_MASK) != ceilval);
 
-- 
2.33.1


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH v5 3/3] Optimize CAS in __pthread_mutex_lock_full [BZ #28537]
  2021-11-10 18:41 [PATCH v5 0/3] Optimize CAS [BZ #28537] H.J. Lu
  2021-11-10 18:41 ` [PATCH v5 1/3] Reduce CAS in low level locks " H.J. Lu
  2021-11-10 18:41 ` [PATCH v5 2/3] Reduce CAS in __pthread_mutex_lock_full " H.J. Lu
@ 2021-11-10 18:41 ` H.J. Lu
  2021-11-10 20:23 ` [PATCH v5 0/3] Optimize CAS " Paul A. Clarke
  3 siblings, 0 replies; 6+ messages in thread
From: H.J. Lu @ 2021-11-10 18:41 UTC (permalink / raw)
  To: libc-alpha
  Cc: Florian Weimer, Oleh Derevenko, Arjan van de Ven, Andreas Schwab,
	Paul A . Clarke, Noah Goldstein

1. Do an atomic load and skip CAS if compare may fail to reduce cache
line bouncing on contended locks.
2. Replace atomic_compare_and_exchange_bool_acq with
atomic_compare_and_exchange_val_acq to avoid the extra load.
---
 nptl/pthread_mutex_lock.c | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/nptl/pthread_mutex_lock.c b/nptl/pthread_mutex_lock.c
index d7e8efedd2..24ff1772cd 100644
--- a/nptl/pthread_mutex_lock.c
+++ b/nptl/pthread_mutex_lock.c
@@ -297,12 +297,13 @@ __pthread_mutex_lock_full (pthread_mutex_t *mutex)
 	     meantime.  */
 	  if ((oldval & FUTEX_WAITERS) == 0)
 	    {
-	      if (atomic_compare_and_exchange_bool_acq (&mutex->__data.__lock,
-							oldval | FUTEX_WAITERS,
-							oldval)
-		  != 0)
+	      int val = atomic_load_relaxed (&mutex->__data.__lock);
+	      if (val != oldval
+		  || ((val = atomic_compare_and_exchange_val_acq
+			 (&mutex->__data.__lock, oldval | FUTEX_WAITERS,
+			  oldval)) != oldval))
 		{
-		  oldval = mutex->__data.__lock;
+		  oldval = val;
 		  continue;
 		}
 	      oldval |= FUTEX_WAITERS;
-- 
2.33.1


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v5 0/3] Optimize CAS [BZ #28537]
  2021-11-10 18:41 [PATCH v5 0/3] Optimize CAS [BZ #28537] H.J. Lu
                   ` (2 preceding siblings ...)
  2021-11-10 18:41 ` [PATCH v5 3/3] Optimize " H.J. Lu
@ 2021-11-10 20:23 ` Paul A. Clarke
  2021-11-10 20:53   ` H.J. Lu
  3 siblings, 1 reply; 6+ messages in thread
From: Paul A. Clarke @ 2021-11-10 20:23 UTC (permalink / raw)
  To: H.J. Lu
  Cc: libc-alpha, Florian Weimer, Oleh Derevenko, Arjan van de Ven,
	Andreas Schwab, Noah Goldstein

On Wed, Nov 10, 2021 at 10:41:50AM -0800, H.J. Lu wrote:
> Changes in v5:
> 
> 1. Put back __glibc_unlikely in  __lll_trylock and lll_cond_trylock.
> 2. Remove an atomic load in a CAS usage which has been already optimized.
> 3. Add an empty statement with a semicolon to a goto label for older
> compiler versions.
> 4. Simplify CAS optimization.
> 
> CAS instruction is expensive.  From the x86 CPU's point of view, getting
> a cache line for writing is more expensive than reading.  See Appendix
> A.2 Spinlock in:
> 
> https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-lock-scaling-analysis-paper.pdf 
> 
> The full compare and swap will grab the cache line exclusive and cause
> excessive cache line bouncing.
> 
> Optimize CAS in low level locks and pthread_mutex_lock.c:
> 
> 1. Do an atomic load and skip CAS if compare may fail to reduce cache
> line bouncing on contended locks.
> 2. Replace atomic_compare_and_exchange_bool_acq with
> atomic_compare_and_exchange_val_acq to avoid the extra load.
> 
> This is the first patch set to optimize CAS.  I will submit the rest
> CAS optimizations in glibc after this patch set has been accepted.
> 
> With all CAS optimizations applied, on a machine with 112 cores,
> "make check -j28" under heavy load took
> 
> 3093.18user 1644.12system 22:26.05elapsed 351%CPU
> 
> vs without CAS optimizations
> 
> 3746.07user 1614.93system 22:02.91elapsed 405%CPU

I read that as about 2% slower with your changes. Is that the desired result?

PC

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v5 0/3] Optimize CAS [BZ #28537]
  2021-11-10 20:23 ` [PATCH v5 0/3] Optimize CAS " Paul A. Clarke
@ 2021-11-10 20:53   ` H.J. Lu
  0 siblings, 0 replies; 6+ messages in thread
From: H.J. Lu @ 2021-11-10 20:53 UTC (permalink / raw)
  To: Paul A. Clarke
  Cc: GNU C Library, Florian Weimer, Oleh Derevenko, Arjan van de Ven,
	Andreas Schwab, Noah Goldstein

On Wed, Nov 10, 2021 at 12:23 PM Paul A. Clarke <pc@us.ibm.com> wrote:
>
> On Wed, Nov 10, 2021 at 10:41:50AM -0800, H.J. Lu wrote:
> > Changes in v5:
> >
> > 1. Put back __glibc_unlikely in  __lll_trylock and lll_cond_trylock.
> > 2. Remove an atomic load in a CAS usage which has been already optimized.
> > 3. Add an empty statement with a semicolon to a goto label for older
> > compiler versions.
> > 4. Simplify CAS optimization.
> >
> > CAS instruction is expensive.  From the x86 CPU's point of view, getting
> > a cache line for writing is more expensive than reading.  See Appendix
> > A.2 Spinlock in:
> >
> > https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-lock-scaling-analysis-paper.pdf
> >
> > The full compare and swap will grab the cache line exclusive and cause
> > excessive cache line bouncing.
> >
> > Optimize CAS in low level locks and pthread_mutex_lock.c:
> >
> > 1. Do an atomic load and skip CAS if compare may fail to reduce cache
> > line bouncing on contended locks.
> > 2. Replace atomic_compare_and_exchange_bool_acq with
> > atomic_compare_and_exchange_val_acq to avoid the extra load.
> >
> > This is the first patch set to optimize CAS.  I will submit the rest
> > CAS optimizations in glibc after this patch set has been accepted.
> >
> > With all CAS optimizations applied, on a machine with 112 cores,
> > "make check -j28" under heavy load took
> >
> > 3093.18user 1644.12system 22:26.05elapsed 351%CPU
> >
> > vs without CAS optimizations
> >
> > 3746.07user 1614.93system 22:02.91elapsed 405%CPU
>
> I read that as about 2% slower with your changes. Is that the desired result?
>

The machine was under heavy load.  My reading is with CAS optimization, it
takes less CPU cycles and reduces CPU utilization by 13%.    It saves power
and gives CPU cycles to other tasks.

-- 
H.J.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2021-11-10 20:53 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-10 18:41 [PATCH v5 0/3] Optimize CAS [BZ #28537] H.J. Lu
2021-11-10 18:41 ` [PATCH v5 1/3] Reduce CAS in low level locks " H.J. Lu
2021-11-10 18:41 ` [PATCH v5 2/3] Reduce CAS in __pthread_mutex_lock_full " H.J. Lu
2021-11-10 18:41 ` [PATCH v5 3/3] Optimize " H.J. Lu
2021-11-10 20:23 ` [PATCH v5 0/3] Optimize CAS " Paul A. Clarke
2021-11-10 20:53   ` H.J. Lu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).