[PATCH v4 0/3] Optimize CAS [BZ #28537]

public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed

* [PATCH v4 0/3] Optimize CAS [BZ #28537]
@ 2021-11-10  0:16 H.J. Lu
  2021-11-10  0:16 ` [PATCH v4 1/3] Reduce CAS in low level locks " H.J. Lu
                   ` (4 more replies)
  0 siblings, 5 replies; 18+ messages in thread
From: H.J. Lu @ 2021-11-10  0:16 UTC (permalink / raw)
  To: libc-alpha
  Cc: Florian Weimer, Oleh Derevenko, Arjan van de Ven, Andreas Schwab

CAS instruction is expensive.  From the x86 CPU's point of view, getting
a cache line for writing is more expensive than reading.  See Appendix
A.2 Spinlock in:

https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-lock-scaling-analysis-paper.pdf

The full compare and swap will grab the cache line exclusive and cause
excessive cache line bouncing.

Optimize CAS in low level locks and pthread_mutex_lock.c:

1. Do an atomic load and skip CAS if compare may fail to reduce cache
line bouncing on contended locks.
2. Replace atomic_compare_and_exchange_bool_acq with
atomic_compare_and_exchange_val_acq to avoid the extra load.
3. Drop __glibc_unlikely in __lll_trylock and lll_cond_trylock since we
don't know if it's actually rare; in the contended case it is clearly not
rare.

This is the first patch set to optimize CAS.  I will investigate the rest
CAS usages in glibc after this patch set has been accepted.

H.J. Lu (3):
  Reduce CAS in low level locks [BZ #28537]
  Reduce CAS in __pthread_mutex_lock_full [BZ #28537]
  Optimize CAS in __pthread_mutex_lock_full [BZ #28537]

 nptl/lowlevellock.c         | 12 ++++-----
 nptl/pthread_mutex_lock.c   | 53 ++++++++++++++++++++++++++++---------
 sysdeps/nptl/lowlevellock.h | 29 +++++++++++++-------
 3 files changed, 67 insertions(+), 27 deletions(-)

-- 
2.33.1

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v4 1/3] Reduce CAS in low level locks [BZ #28537]
  2021-11-10  0:16 [PATCH v4 0/3] Optimize CAS [BZ #28537] H.J. Lu
@ 2021-11-10  0:16 ` H.J. Lu
  2021-11-10  1:56   ` Noah Goldstein
  2021-11-10  0:16 ` [PATCH v4 2/3] Reduce CAS in __pthread_mutex_lock_full " H.J. Lu
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 18+ messages in thread
From: H.J. Lu @ 2021-11-10  0:16 UTC (permalink / raw)
  To: libc-alpha
  Cc: Florian Weimer, Oleh Derevenko, Arjan van de Ven, Andreas Schwab

CAS instruction is expensive.  From the x86 CPU's point of view, getting
a cache line for writing is more expensive than reading.  See Appendix
A.2 Spinlock in:

https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-lock-scaling-analysis-paper.pdf

The full compare and swap will grab the cache line exclusive and cause
excessive cache line bouncing.

1. Change low level locks to do an atomic load and skip CAS if compare
may fail to reduce cache line bouncing on contended locks.
2. In __lll_lock, replace atomic_compare_and_exchange_bool_acq with
atomic_compare_and_exchange_val_acq and pass down the result to
__lll_lock_wait and __lll_lock_wait_private to avoid the redundant load
there.
3. Drop __glibc_unlikely in __lll_trylock and lll_cond_trylock since we
don't know if it's actually rare; in the contended case it is clearly not
rare.
---
 nptl/lowlevellock.c         | 12 ++++++------
 sysdeps/nptl/lowlevellock.h | 29 ++++++++++++++++++++---------
 2 files changed, 26 insertions(+), 15 deletions(-)

diff --git a/nptl/lowlevellock.c b/nptl/lowlevellock.c
index 8c740529c4..d1965c01ca 100644
--- a/nptl/lowlevellock.c
+++ b/nptl/lowlevellock.c
@@ -22,30 +22,30 @@
 #include <stap-probe.h>
 
 void
-__lll_lock_wait_private (int *futex)
+__lll_lock_wait_private (int *futex, int futex_value)
 {
-  if (atomic_load_relaxed (futex) == 2)
+  if (futex_value == 2)
     goto futex;
 
   while (atomic_exchange_acquire (futex, 2) != 0)
     {
     futex:
-      LIBC_PROBE (lll_lock_wait_private, 1, futex);
+      LIBC_PROBE (lll_lock_wait_private, 2, futex, futex_value);
       futex_wait ((unsigned int *) futex, 2, LLL_PRIVATE); /* Wait if *futex == 2.  */
     }
 }
 libc_hidden_def (__lll_lock_wait_private)
 
 void
-__lll_lock_wait (int *futex, int private)
+__lll_lock_wait (int *futex, int futex_value, int private)
 {
-  if (atomic_load_relaxed (futex) == 2)
+  if (futex_value == 2)
     goto futex;
 
   while (atomic_exchange_acquire (futex, 2) != 0)
     {
     futex:
-      LIBC_PROBE (lll_lock_wait, 1, futex);
+      LIBC_PROBE (lll_lock_wait, 2, futex, futex_value);
       futex_wait ((unsigned int *) futex, 2, private); /* Wait if *futex == 2.  */
     }
 }
diff --git a/sysdeps/nptl/lowlevellock.h b/sysdeps/nptl/lowlevellock.h
index 4d95114ed3..4235d13de9 100644
--- a/sysdeps/nptl/lowlevellock.h
+++ b/sysdeps/nptl/lowlevellock.h
@@ -66,7 +66,11 @@
    0.  Otherwise leave lock unchanged and return non-zero to indicate that the
    lock was not acquired.  */
 #define __lll_trylock(lock)	\
-  __glibc_unlikely (atomic_compare_and_exchange_bool_acq ((lock), 1, 0))
+  (__extension__ ({							\
+    __typeof (*(lock)) __lock_value = atomic_load_relaxed (lock);	\
+    (__lock_value != 0							\
+     || atomic_compare_and_exchange_bool_acq ((lock), 1, 0));		\
+  }))
 #define lll_trylock(lock)	\
    __lll_trylock (&(lock))
 
@@ -74,11 +78,15 @@
    return 0.  Otherwise leave lock unchanged and return non-zero to indicate
    that the lock was not acquired.  */
 #define lll_cond_trylock(lock)	\
-  __glibc_unlikely (atomic_compare_and_exchange_bool_acq (&(lock), 2, 0))
+  (__extension__ ({							\
+    __typeof (lock) __lock_value = atomic_load_relaxed (&(lock));	\
+    (__lock_value != 0							\
+     || atomic_compare_and_exchange_bool_acq (&(lock), 2, 0));		\
+  }))
 
-extern void __lll_lock_wait_private (int *futex);
+extern void __lll_lock_wait_private (int *futex, int futex_value);
 libc_hidden_proto (__lll_lock_wait_private)
-extern void __lll_lock_wait (int *futex, int private);
+extern void __lll_lock_wait (int *futex, int futex_value, int private);
 libc_hidden_proto (__lll_lock_wait)
 
 /* This is an expression rather than a statement even though its value is
@@ -95,13 +103,15 @@ libc_hidden_proto (__lll_lock_wait)
   ((void)                                                               \
    ({                                                                   \
      int *__futex = (futex);                                            \
-     if (__glibc_unlikely                                               \
-         (atomic_compare_and_exchange_bool_acq (__futex, 1, 0)))        \
+     int __futex_value = atomic_load_relaxed (futex);                   \
+     if (__futex_value != 0                                             \
+         || ((__futex_value = atomic_compare_and_exchange_val_acq       \
+              (__futex, 1, 0) != 0)))                                   \
        {                                                                \
          if (__builtin_constant_p (private) && (private) == LLL_PRIVATE) \
-           __lll_lock_wait_private (__futex);                           \
+           __lll_lock_wait_private (futex, __futex_value);              \
          else                                                           \
-           __lll_lock_wait (__futex, private);                          \
+           __lll_lock_wait (futex, __futex_value, private);             \
        }                                                                \
    }))
 #define lll_lock(futex, private)	\
@@ -120,7 +130,8 @@ libc_hidden_proto (__lll_lock_wait)
    ({                                                                   \
      int *__futex = (futex);                                            \
      if (__glibc_unlikely (atomic_exchange_acq (__futex, 2) != 0))      \
-       __lll_lock_wait (__futex, private);                              \
+       __lll_lock_wait (__futex, atomic_load_relaxed (__futex),         \
+			private); \
    }))
 #define lll_cond_lock(futex, private) __lll_cond_lock (&(futex), private)
 
-- 
2.33.1


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v4 2/3] Reduce CAS in __pthread_mutex_lock_full [BZ #28537]
  2021-11-10  0:16 [PATCH v4 0/3] Optimize CAS [BZ #28537] H.J. Lu
  2021-11-10  0:16 ` [PATCH v4 1/3] Reduce CAS in low level locks " H.J. Lu
@ 2021-11-10  0:16 ` H.J. Lu
  2021-11-10  0:16 ` [PATCH v4 3/3] Optimize " H.J. Lu
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 18+ messages in thread
From: H.J. Lu @ 2021-11-10  0:16 UTC (permalink / raw)
  To: libc-alpha
  Cc: Florian Weimer, Oleh Derevenko, Arjan van de Ven, Andreas Schwab

Change __pthread_mutex_lock_full to do an atomic load and skip CAS if
compare may fail to reduce cache line bouncing on contended locks.
---
 nptl/pthread_mutex_lock.c | 38 +++++++++++++++++++++++++++++++-------
 1 file changed, 31 insertions(+), 7 deletions(-)

diff --git a/nptl/pthread_mutex_lock.c b/nptl/pthread_mutex_lock.c
index 2bd41767e0..1126ecba95 100644
--- a/nptl/pthread_mutex_lock.c
+++ b/nptl/pthread_mutex_lock.c
@@ -204,6 +204,10 @@ __pthread_mutex_lock_full (pthread_mutex_t *mutex)
 	     our TID | assume_other_futex_waiters.  */
 	  if (__glibc_likely (oldval == 0))
 	    {
+	      oldval = atomic_load_relaxed (&mutex->__data.__lock);
+	      if (oldval != 0)
+		goto tid_failed;
+
 	      oldval
 	        = atomic_compare_and_exchange_val_acq (&mutex->__data.__lock,
 	            id | assume_other_futex_waiters, 0);
@@ -213,6 +217,13 @@ __pthread_mutex_lock_full (pthread_mutex_t *mutex)
 
 	  if ((oldval & FUTEX_OWNER_DIED) != 0)
 	    {
+	      int currval = atomic_load_relaxed (&mutex->__data.__lock);
+	      if (currval != oldval)
+		{
+		  oldval = currval;
+		  continue;
+		}
+
 	      /* The previous owner died.  Try locking the mutex.  */
 	      int newval = id;
 #ifdef NO_INCR
@@ -259,6 +270,7 @@ __pthread_mutex_lock_full (pthread_mutex_t *mutex)
 	      return EOWNERDEAD;
 	    }
 
+ tid_failed:
 	  /* Check whether we already hold the mutex.  */
 	  if (__glibc_unlikely ((oldval & FUTEX_TID_MASK) == id))
 	    {
@@ -411,11 +423,15 @@ __pthread_mutex_lock_full (pthread_mutex_t *mutex)
 # ifdef NO_INCR
 	newval |= FUTEX_WAITERS;
 # endif
+	oldval = atomic_load_relaxed (&mutex->__data.__lock);
+	if (oldval != 0)
+	  goto locked_mutex;
 	oldval = atomic_compare_and_exchange_val_acq (&mutex->__data.__lock,
 						      newval, 0);
 
 	if (oldval != 0)
 	  {
+ locked_mutex:
 	    /* The mutex is locked.  The kernel will now take care of
 	       everything.  */
 	    int private = (robust
@@ -554,6 +570,10 @@ __pthread_mutex_lock_full (pthread_mutex_t *mutex)
 	    ceilval = ceiling << PTHREAD_MUTEX_PRIO_CEILING_SHIFT;
 	    oldprio = ceiling;
 
+	    oldval = atomic_load_relaxed (&mutex->__data.__lock);
+	    if (oldval != ceilval)
+	      goto ceilval_failed;
+
 	    oldval
 	      = atomic_compare_and_exchange_val_acq (&mutex->__data.__lock,
 #ifdef NO_INCR
@@ -568,10 +588,13 @@ __pthread_mutex_lock_full (pthread_mutex_t *mutex)
 
 	    do
 	      {
-		oldval
-		  = atomic_compare_and_exchange_val_acq (&mutex->__data.__lock,
-							 ceilval | 2,
-							 ceilval | 1);
+	        oldval = atomic_load_relaxed (&mutex->__data.__lock);
+ ceilval_failed:
+		if (oldval == (ceilval | 1))
+		  oldval
+		    = atomic_compare_and_exchange_val_acq (&mutex->__data.__lock,
+							   ceilval | 2,
+							   ceilval | 1);
 
 		if ((oldval & PTHREAD_MUTEX_PRIO_CEILING_MASK) != ceilval)
 		  break;
@@ -581,9 +604,10 @@ __pthread_mutex_lock_full (pthread_mutex_t *mutex)
 			      ceilval | 2,
 			      PTHREAD_MUTEX_PSHARED (mutex));
 	      }
-	    while (atomic_compare_and_exchange_val_acq (&mutex->__data.__lock,
-							ceilval | 2, ceilval)
-		   != ceilval);
+	    while (atomic_load_relaxed (&mutex->__data.__lock) != ceilval
+		   || (atomic_compare_and_exchange_val_acq (&mutex->__data.__lock,
+							    ceilval | 2, ceilval)
+		       != ceilval));
 	  }
 	while ((oldval & PTHREAD_MUTEX_PRIO_CEILING_MASK) != ceilval);
 
-- 
2.33.1


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v4 3/3] Optimize CAS in __pthread_mutex_lock_full [BZ #28537]
  2021-11-10  0:16 [PATCH v4 0/3] Optimize CAS [BZ #28537] H.J. Lu
  2021-11-10  0:16 ` [PATCH v4 1/3] Reduce CAS in low level locks " H.J. Lu
  2021-11-10  0:16 ` [PATCH v4 2/3] Reduce CAS in __pthread_mutex_lock_full " H.J. Lu
@ 2021-11-10  0:16 ` H.J. Lu
  2021-11-10 14:26 ` [PATCH v4 0/3] Optimize CAS " Paul E Murphy
  2021-11-10 15:35 ` Paul A. Clarke
  4 siblings, 0 replies; 18+ messages in thread
From: H.J. Lu @ 2021-11-10  0:16 UTC (permalink / raw)
  To: libc-alpha
  Cc: Florian Weimer, Oleh Derevenko, Arjan van de Ven, Andreas Schwab

1. Do an atomic load and skip CAS if compare may fail to reduce cache
line bouncing on contended locks.
2. Replace atomic_compare_and_exchange_bool_acq with
atomic_compare_and_exchange_val_acq to avoid the extra load.
---
 nptl/pthread_mutex_lock.c | 15 ++++++++++-----
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/nptl/pthread_mutex_lock.c b/nptl/pthread_mutex_lock.c
index 1126ecba95..9a18073f29 100644
--- a/nptl/pthread_mutex_lock.c
+++ b/nptl/pthread_mutex_lock.c
@@ -309,12 +309,17 @@ __pthread_mutex_lock_full (pthread_mutex_t *mutex)
 	     meantime.  */
 	  if ((oldval & FUTEX_WAITERS) == 0)
 	    {
-	      if (atomic_compare_and_exchange_bool_acq (&mutex->__data.__lock,
-							oldval | FUTEX_WAITERS,
-							oldval)
-		  != 0)
+	      int currval = atomic_load_relaxed (&mutex->__data.__lock);
+	      if (currval != oldval)
+		{
+		  oldval = currval;
+		  continue;
+		}
+	      int newval = atomic_compare_and_exchange_val_acq
+		(&mutex->__data.__lock, oldval | FUTEX_WAITERS, oldval);
+	      if (newval != oldval)
 		{
-		  oldval = mutex->__data.__lock;
+		  oldval = newval;
 		  continue;
 		}
 	      oldval |= FUTEX_WAITERS;
-- 
2.33.1


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 1/3] Reduce CAS in low level locks [BZ #28537]
  2021-11-10  0:16 ` [PATCH v4 1/3] Reduce CAS in low level locks " H.J. Lu
@ 2021-11-10  1:56   ` Noah Goldstein
  0 siblings, 0 replies; 18+ messages in thread
From: Noah Goldstein @ 2021-11-10  1:56 UTC (permalink / raw)
  To: H.J. Lu; +Cc: GNU C Library, Florian Weimer, Arjan van de Ven, Andreas Schwab

On Tue, Nov 9, 2021 at 6:21 PM H.J. Lu via Libc-alpha
<libc-alpha@sourceware.org> wrote:
>
> CAS instruction is expensive.  From the x86 CPU's point of view, getting
> a cache line for writing is more expensive than reading.  See Appendix
> A.2 Spinlock in:
>
> https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-lock-scaling-analysis-paper.pdf
>
> The full compare and swap will grab the cache line exclusive and cause
> excessive cache line bouncing.
>
> 1. Change low level locks to do an atomic load and skip CAS if compare
> may fail to reduce cache line bouncing on contended locks.
> 2. In __lll_lock, replace atomic_compare_and_exchange_bool_acq with
> atomic_compare_and_exchange_val_acq and pass down the result to
> __lll_lock_wait and __lll_lock_wait_private to avoid the redundant load
> there.
> 3. Drop __glibc_unlikely in __lll_trylock and lll_cond_trylock since we
> don't know if it's actually rare; in the contended case it is clearly not
> rare.

Think especially since you have a manual check for old vs new, the atomic
CAS failure should still be wrapped in a `glibc_unlikely`.

> ---
>  nptl/lowlevellock.c         | 12 ++++++------
>  sysdeps/nptl/lowlevellock.h | 29 ++++++++++++++++++++---------
>  2 files changed, 26 insertions(+), 15 deletions(-)
>
> diff --git a/nptl/lowlevellock.c b/nptl/lowlevellock.c
> index 8c740529c4..d1965c01ca 100644
> --- a/nptl/lowlevellock.c
> +++ b/nptl/lowlevellock.c
> @@ -22,30 +22,30 @@
>  #include <stap-probe.h>
>
>  void
> -__lll_lock_wait_private (int *futex)
> +__lll_lock_wait_private (int *futex, int futex_value)
>  {
> -  if (atomic_load_relaxed (futex) == 2)
> +  if (futex_value == 2)
>      goto futex;
>
>    while (atomic_exchange_acquire (futex, 2) != 0)
>      {
>      futex:
> -      LIBC_PROBE (lll_lock_wait_private, 1, futex);
> +      LIBC_PROBE (lll_lock_wait_private, 2, futex, futex_value);
>        futex_wait ((unsigned int *) futex, 2, LLL_PRIVATE); /* Wait if *futex == 2.  */
>      }
>  }
>  libc_hidden_def (__lll_lock_wait_private)
>
>  void
> -__lll_lock_wait (int *futex, int private)
> +__lll_lock_wait (int *futex, int futex_value, int private)
>  {
> -  if (atomic_load_relaxed (futex) == 2)
> +  if (futex_value == 2)
>      goto futex;
>
>    while (atomic_exchange_acquire (futex, 2) != 0)
>      {
>      futex:
> -      LIBC_PROBE (lll_lock_wait, 1, futex);
> +      LIBC_PROBE (lll_lock_wait, 2, futex, futex_value);
>        futex_wait ((unsigned int *) futex, 2, private); /* Wait if *futex == 2.  */
>      }
>  }
> diff --git a/sysdeps/nptl/lowlevellock.h b/sysdeps/nptl/lowlevellock.h
> index 4d95114ed3..4235d13de9 100644
> --- a/sysdeps/nptl/lowlevellock.h
> +++ b/sysdeps/nptl/lowlevellock.h
> @@ -66,7 +66,11 @@
>     0.  Otherwise leave lock unchanged and return non-zero to indicate that the
>     lock was not acquired.  */
>  #define __lll_trylock(lock)    \
> -  __glibc_unlikely (atomic_compare_and_exchange_bool_acq ((lock), 1, 0))
> +  (__extension__ ({                                                    \
> +    __typeof (*(lock)) __lock_value = atomic_load_relaxed (lock);      \
> +    (__lock_value != 0                                                 \
> +     || atomic_compare_and_exchange_bool_acq ((lock), 1, 0));          \
> +  }))
>  #define lll_trylock(lock)      \
>     __lll_trylock (&(lock))
>
> @@ -74,11 +78,15 @@
>     return 0.  Otherwise leave lock unchanged and return non-zero to indicate
>     that the lock was not acquired.  */
>  #define lll_cond_trylock(lock) \
> -  __glibc_unlikely (atomic_compare_and_exchange_bool_acq (&(lock), 2, 0))
> +  (__extension__ ({                                                    \
> +    __typeof (lock) __lock_value = atomic_load_relaxed (&(lock));      \
> +    (__lock_value != 0                                                 \
> +     || atomic_compare_and_exchange_bool_acq (&(lock), 2, 0));         \
> +  }))
>
> -extern void __lll_lock_wait_private (int *futex);
> +extern void __lll_lock_wait_private (int *futex, int futex_value);
>  libc_hidden_proto (__lll_lock_wait_private)
> -extern void __lll_lock_wait (int *futex, int private);
> +extern void __lll_lock_wait (int *futex, int futex_value, int private);
>  libc_hidden_proto (__lll_lock_wait)
>
>  /* This is an expression rather than a statement even though its value is
> @@ -95,13 +103,15 @@ libc_hidden_proto (__lll_lock_wait)
>    ((void)                                                               \
>     ({                                                                   \
>       int *__futex = (futex);                                            \
> -     if (__glibc_unlikely                                               \
> -         (atomic_compare_and_exchange_bool_acq (__futex, 1, 0)))        \
> +     int __futex_value = atomic_load_relaxed (futex);                   \
> +     if (__futex_value != 0                                             \
> +         || ((__futex_value = atomic_compare_and_exchange_val_acq       \
> +              (__futex, 1, 0) != 0)))                                   \
>         {                                                                \
>           if (__builtin_constant_p (private) && (private) == LLL_PRIVATE) \
> -           __lll_lock_wait_private (__futex);                           \
> +           __lll_lock_wait_private (futex, __futex_value);              \
>           else                                                           \
> -           __lll_lock_wait (__futex, private);                          \
> +           __lll_lock_wait (futex, __futex_value, private);             \
>         }                                                                \
>     }))
>  #define lll_lock(futex, private)       \
> @@ -120,7 +130,8 @@ libc_hidden_proto (__lll_lock_wait)
>     ({                                                                   \
>       int *__futex = (futex);                                            \
>       if (__glibc_unlikely (atomic_exchange_acq (__futex, 2) != 0))      \
> -       __lll_lock_wait (__futex, private);                              \
> +       __lll_lock_wait (__futex, atomic_load_relaxed (__futex),         \
> +                       private); \
>     }))
>  #define lll_cond_lock(futex, private) __lll_cond_lock (&(futex), private)
>
> --
> 2.33.1
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 0/3] Optimize CAS [BZ #28537]
  2021-11-10  0:16 [PATCH v4 0/3] Optimize CAS [BZ #28537] H.J. Lu
                   ` (2 preceding siblings ...)
  2021-11-10  0:16 ` [PATCH v4 3/3] Optimize " H.J. Lu
@ 2021-11-10 14:26 ` Paul E Murphy
  2021-11-10 20:07   ` Paul A. Clarke
  2021-11-10 15:35 ` Paul A. Clarke
  4 siblings, 1 reply; 18+ messages in thread
From: Paul E Murphy @ 2021-11-10 14:26 UTC (permalink / raw)
  To: H.J. Lu, libc-alpha; +Cc: Florian Weimer, Arjan van de Ven, Andreas Schwab

On 11/9/21 6:16 PM, H.J. Lu via Libc-alpha wrote:
> CAS instruction is expensive.  From the x86 CPU's point of view, getting
> a cache line for writing is more expensive than reading.  See Appendix
> A.2 Spinlock in:
> 
> https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-lock-scaling-analysis-paper.pdf
> 
> The full compare and swap will grab the cache line exclusive and cause
> excessive cache line bouncing.
> 
> Optimize CAS in low level locks and pthread_mutex_lock.c:
> 
> 1. Do an atomic load and skip CAS if compare may fail to reduce cache
> line bouncing on contended locks.
> 2. Replace atomic_compare_and_exchange_bool_acq with
> atomic_compare_and_exchange_val_acq to avoid the extra load.
> 3. Drop __glibc_unlikely in __lll_trylock and lll_cond_trylock since we
> don't know if it's actually rare; in the contended case it is clearly not
> rare.

Are you able to share benchmarks of this change? I am curious what 
effects this might have on other platforms.

> 
> This is the first patch set to optimize CAS.  I will investigate the rest
> CAS usages in glibc after this patch set has been accepted.
> 
> H.J. Lu (3):
>    Reduce CAS in low level locks [BZ #28537]
>    Reduce CAS in __pthread_mutex_lock_full [BZ #28537]
>    Optimize CAS in __pthread_mutex_lock_full [BZ #28537]
> 
>   nptl/lowlevellock.c         | 12 ++++-----
>   nptl/pthread_mutex_lock.c   | 53 ++++++++++++++++++++++++++++---------
>   sysdeps/nptl/lowlevellock.h | 29 +++++++++++++-------
>   3 files changed, 67 insertions(+), 27 deletions(-)
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 0/3] Optimize CAS [BZ #28537]
  2021-11-10  0:16 [PATCH v4 0/3] Optimize CAS [BZ #28537] H.J. Lu
                   ` (3 preceding siblings ...)
  2021-11-10 14:26 ` [PATCH v4 0/3] Optimize CAS " Paul E Murphy
@ 2021-11-10 15:35 ` Paul A. Clarke
  2021-11-10 15:42   ` H.J. Lu
  2021-11-10 15:52   ` Florian Weimer
  4 siblings, 2 replies; 18+ messages in thread
From: Paul A. Clarke @ 2021-11-10 15:35 UTC (permalink / raw)
  To: H.J. Lu; +Cc: libc-alpha, Florian Weimer, Arjan van de Ven, Andreas Schwab

On Tue, Nov 09, 2021 at 04:16:11PM -0800, H.J. Lu via Libc-alpha wrote:
> CAS instruction is expensive.  From the x86 CPU's point of view, getting
> a cache line for writing is more expensive than reading.  See Appendix
> A.2 Spinlock in:
> 
> https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-lock-scaling-analysis-paper.pdf
> 
> The full compare and swap will grab the cache line exclusive and cause
> excessive cache line bouncing.
> 
> Optimize CAS in low level locks and pthread_mutex_lock.c:
> 
> 1. Do an atomic load and skip CAS if compare may fail to reduce cache
> line bouncing on contended locks.
> 2. Replace atomic_compare_and_exchange_bool_acq with
> atomic_compare_and_exchange_val_acq to avoid the extra load.
> 3. Drop __glibc_unlikely in __lll_trylock and lll_cond_trylock since we
> don't know if it's actually rare; in the contended case it is clearly not
> rare.

I see build errors:

In file included from pthread_mutex_cond_lock.c:23:
../nptl/pthread_mutex_lock.c: In function ‘__pthread_mutex_cond_lock_full’:
../nptl/pthread_mutex_lock.c:442:6: error: a label can only be part of a statement and a declaration is not a statement
      int private = (robust
      ^~~
../nptl/pthread_mutex_lock.c:445:6: error: expected expression before ‘int’
      int e = __futex_lock_pi64 (&mutex->__data.__lock, 0 /* ununsed  */,
      ^~~
../nptl/pthread_mutex_lock.c:447:10: error: ‘e’ undeclared (first use in this function)
      if (e == ESRCH || e == EDEADLK)
          ^
../nptl/pthread_mutex_lock.c:447:10: note: each undeclared identifier is reported only once for each function it appears in
In file included from ../include/assert.h:1,
                 from ../nptl/pthread_mutex_lock.c:18,
                 from pthread_mutex_cond_lock.c:23:
../assert/assert.h:105:34: error: left-hand operand of comma expression has no effect [-Werror=unused-value]
   ((void) sizeof ((expr) ? 1 : 0), __extension__ ({   \
                                  ^
../nptl/pthread_mutex_lock.c:449:3: note: in expansion of macro ‘assert’
   assert (e != EDEADLK
   ^~~~~~
../assert/assert.h:105:34: error: left-hand operand of comma expression has no effect [-Werror=unused-value]
   ((void) sizeof ((expr) ? 1 : 0), __extension__ ({   \
                                  ^
../nptl/pthread_mutex_lock.c:454:3: note: in expansion of macro ‘assert’
   assert (e != ESRCH || !robust);
   ^~~~~~

PC

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 0/3] Optimize CAS [BZ #28537]
  2021-11-10 15:35 ` Paul A. Clarke
@ 2021-11-10 15:42   ` H.J. Lu
  2021-11-10 15:50     ` Paul A. Clarke
  2021-11-10 15:52   ` Florian Weimer
  1 sibling, 1 reply; 18+ messages in thread
From: H.J. Lu @ 2021-11-10 15:42 UTC (permalink / raw)
  To: Paul A. Clarke
  Cc: GNU C Library, Florian Weimer, Arjan van de Ven, Andreas Schwab

On Wed, Nov 10, 2021 at 7:36 AM Paul A. Clarke <pc@us.ibm.com> wrote:
>
> On Tue, Nov 09, 2021 at 04:16:11PM -0800, H.J. Lu via Libc-alpha wrote:
> > CAS instruction is expensive.  From the x86 CPU's point of view, getting
> > a cache line for writing is more expensive than reading.  See Appendix
> > A.2 Spinlock in:
> >
> > https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-lock-scaling-analysis-paper.pdf
> >
> > The full compare and swap will grab the cache line exclusive and cause
> > excessive cache line bouncing.
> >
> > Optimize CAS in low level locks and pthread_mutex_lock.c:
> >
> > 1. Do an atomic load and skip CAS if compare may fail to reduce cache
> > line bouncing on contended locks.
> > 2. Replace atomic_compare_and_exchange_bool_acq with
> > atomic_compare_and_exchange_val_acq to avoid the extra load.
> > 3. Drop __glibc_unlikely in __lll_trylock and lll_cond_trylock since we
> > don't know if it's actually rare; in the contended case it is clearly not
> > rare.
>
> I see build errors:
>
> In file included from pthread_mutex_cond_lock.c:23:
> ../nptl/pthread_mutex_lock.c: In function ‘__pthread_mutex_cond_lock_full’:
> ../nptl/pthread_mutex_lock.c:442:6: error: a label can only be part of a statement and a declaration is not a statement
>       int private = (robust
>       ^~~
> ../nptl/pthread_mutex_lock.c:445:6: error: expected expression before ‘int’
>       int e = __futex_lock_pi64 (&mutex->__data.__lock, 0 /* ununsed  */,
>       ^~~
> ../nptl/pthread_mutex_lock.c:447:10: error: ‘e’ undeclared (first use in this function)
>       if (e == ESRCH || e == EDEADLK)
>           ^
> ../nptl/pthread_mutex_lock.c:447:10: note: each undeclared identifier is reported only once for each function it appears in
> In file included from ../include/assert.h:1,
>                  from ../nptl/pthread_mutex_lock.c:18,
>                  from pthread_mutex_cond_lock.c:23:
> ../assert/assert.h:105:34: error: left-hand operand of comma expression has no effect [-Werror=unused-value]
>    ((void) sizeof ((expr) ? 1 : 0), __extension__ ({   \
>                                   ^
> ../nptl/pthread_mutex_lock.c:449:3: note: in expansion of macro ‘assert’
>    assert (e != EDEADLK
>    ^~~~~~
> ../assert/assert.h:105:34: error: left-hand operand of comma expression has no effect [-Werror=unused-value]
>    ((void) sizeof ((expr) ? 1 : 0), __extension__ ({   \
>                                   ^
> ../nptl/pthread_mutex_lock.c:454:3: note: in expansion of macro ‘assert’
>    assert (e != ESRCH || !robust);
>    ^~~~~~
>
> PC

On PPC?

-- 
H.J.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 0/3] Optimize CAS [BZ #28537]
  2021-11-10 15:42   ` H.J. Lu
@ 2021-11-10 15:50     ` Paul A. Clarke
  2021-11-10 15:52       ` H.J. Lu
  0 siblings, 1 reply; 18+ messages in thread
From: Paul A. Clarke @ 2021-11-10 15:50 UTC (permalink / raw)
  To: H.J. Lu; +Cc: GNU C Library, Florian Weimer, Arjan van de Ven, Andreas Schwab

On Wed, Nov 10, 2021 at 07:42:20AM -0800, H.J. Lu wrote:
> On Wed, Nov 10, 2021 at 7:36 AM Paul A. Clarke <pc@us.ibm.com> wrote:
> >
> > On Tue, Nov 09, 2021 at 04:16:11PM -0800, H.J. Lu via Libc-alpha wrote:
> > > CAS instruction is expensive.  From the x86 CPU's point of view, getting
> > > a cache line for writing is more expensive than reading.  See Appendix
> > > A.2 Spinlock in:
> > >
> > > https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-lock-scaling-analysis-paper.pdf 
> > >
> > > The full compare and swap will grab the cache line exclusive and cause
> > > excessive cache line bouncing.
> > >
> > > Optimize CAS in low level locks and pthread_mutex_lock.c:
> > >
> > > 1. Do an atomic load and skip CAS if compare may fail to reduce cache
> > > line bouncing on contended locks.
> > > 2. Replace atomic_compare_and_exchange_bool_acq with
> > > atomic_compare_and_exchange_val_acq to avoid the extra load.
> > > 3. Drop __glibc_unlikely in __lll_trylock and lll_cond_trylock since we
> > > don't know if it's actually rare; in the contended case it is clearly not
> > > rare.
> >
> > I see build errors:
> >
> > In file included from pthread_mutex_cond_lock.c:23:
> > ../nptl/pthread_mutex_lock.c: In function ‘__pthread_mutex_cond_lock_full’:
> > ../nptl/pthread_mutex_lock.c:442:6: error: a label can only be part of a statement and a declaration is not a statement
> >       int private = (robust
> >       ^~~
> > ../nptl/pthread_mutex_lock.c:445:6: error: expected expression before ‘int’
> >       int e = __futex_lock_pi64 (&mutex->__data.__lock, 0 /* ununsed  */,
> >       ^~~
> > ../nptl/pthread_mutex_lock.c:447:10: error: ‘e’ undeclared (first use in this function)
> >       if (e == ESRCH || e == EDEADLK)
> >           ^
> > ../nptl/pthread_mutex_lock.c:447:10: note: each undeclared identifier is reported only once for each function it appears in
> > In file included from ../include/assert.h:1,
> >                  from ../nptl/pthread_mutex_lock.c:18,
> >                  from pthread_mutex_cond_lock.c:23:
> > ../assert/assert.h:105:34: error: left-hand operand of comma expression has no effect [-Werror=unused-value]
> >    ((void) sizeof ((expr) ? 1 : 0), __extension__ ({   \
> >                                   ^
> > ../nptl/pthread_mutex_lock.c:449:3: note: in expansion of macro ‘assert’
> >    assert (e != EDEADLK
> >    ^~~~~~
> > ../assert/assert.h:105:34: error: left-hand operand of comma expression has no effect [-Werror=unused-value]
> >    ((void) sizeof ((expr) ? 1 : 0), __extension__ ({   \
> >                                   ^
> > ../nptl/pthread_mutex_lock.c:454:3: note: in expansion of macro ‘assert’
> >    assert (e != ESRCH || !robust);
> >    ^~~~~~
> 
> On PPC?

Yes, powerpc64le.  Failing command:
gcc pthread_mutex_cond_lock.c -c -std=gnu11 -fgnu89-inline -g -O2 -Wall -Wwrite-strings -Wundef -Werror -fmerge-all-constants -frounding-math -fno-stack-protector -fno-common -Wstrict-prototypes -Wold-style-definition -fmath-errno -mabi=ieeelongdouble -Wno-psabi -mno-gnu-attribute -mlong-double-128 -ftls-model=initial-exec -I../include -I/home/pc/locks/build-patched/nptl -I/home/pc/locks/build-patched -I../sysdeps/unix/sysv/linux/powerpc/powerpc64/le/fpu -I../sysdeps/unix/sysv/linux/powerpc/powerpc64/fpu -I../sysdeps/unix/sysv/linux/powerpc/powerpc64/le -I../sysdeps/unix/sysv/linux/powerpc/powerpc64 -I../sysdeps/unix/sysv/linux/wordsize-64 -I../sysdeps/unix/sysv/linux/powerpc -I../sysdeps/powerpc/nptl -I../sysdeps/unix/sysv/linux/include -I../sysdeps/unix/sysv/linux -I../sysdeps/nptl -I../sysdeps/pthread -I../sysdeps/gnu -I../sysdeps/unix/inet -I../sysdeps/unix/sysv -I../sysdeps/unix/powerpc -I../sysdeps/unix -I../sysdeps/posix -I../sysdeps/powerpc/powerpc64/le/power8/fpu/multiarch -I../sysdeps/powerpc/powerpc64/le/power7/fpu/multiarch -I../sysdeps/powerpc/powerpc64/le/fpu/multiarch -I../sysdeps/powerpc/powerpc64/le/power8/fpu -I../sysdeps/powerpc/powerpc64/le/power7/fpu -I../sysdeps/powerpc/powerpc64/le/fpu -I../sysdeps/powerpc/powerpc64/fpu -I../sysdeps/powerpc/powerpc64/le/power8/multiarch -I../sysdeps/powerpc/powerpc64/le/power7/multiarch -I../sysdeps/powerpc/powerpc64/le/multiarch -I../sysdeps/powerpc/powerpc64/multiarch -I../sysdeps/powerpc/powerpc64/le/power8 -I../sysdeps/powerpc/powerpc64/power8 -I../sysdeps/powerpc/powerpc64/le/power7 -I../sysdeps/powerpc/powerpc64/power7 -I../sysdeps/powerpc/powerpc64/power6 -I../sysdeps/powerpc/powerpc64/power4 -I../sysdeps/powerpc/power4 -I../sysdeps/powerpc/powerpc64/le -I../sysdeps/powerpc/powerpc64 -I../sysdeps/wordsize-64 -I../sysdeps/powerpc/fpu -I../sysdeps/powerpc -I../sysdeps/ieee754/ldbl-128ibm-compat -I../sysdeps/ieee754/ldbl-128ibm/include -I../sysdeps/ieee754/ldbl-128ibm -I../sysdeps/ieee754/ldbl-opt -I../sysdeps/ieee754/dbl-64 -I../sysdeps/ieee754/flt-32 -I../sysdeps/ieee754/float128 -I../sysdeps/ieee754 -I../sysdeps/generic -I.. -I../libio -I. -D_LIBC_REENTRANT -include /home/pc/locks/build-patched/libc-modules.h -DMODULE_NAME=libc -include ../include/libc-symbols.h -DTOP_NAMESPACE=glibc -o /home/pc/locks/build-patched/nptl/pthread_mutex_cond_lock.o -MD -MP -MF /home/pc/locks/build-patched/nptl/pthread_mutex_cond_lock.o.dt -MT /home/pc/locks/build-patched/nptl/pthread_mutex_cond_lock.o

PC

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 0/3] Optimize CAS [BZ #28537]
  2021-11-10 15:50     ` Paul A. Clarke
@ 2021-11-10 15:52       ` H.J. Lu
  0 siblings, 0 replies; 18+ messages in thread
From: H.J. Lu @ 2021-11-10 15:52 UTC (permalink / raw)
  To: Paul A. Clarke
  Cc: GNU C Library, Florian Weimer, Arjan van de Ven, Andreas Schwab

On Wed, Nov 10, 2021 at 7:50 AM Paul A. Clarke <pc@us.ibm.com> wrote:
>
> On Wed, Nov 10, 2021 at 07:42:20AM -0800, H.J. Lu wrote:
> > On Wed, Nov 10, 2021 at 7:36 AM Paul A. Clarke <pc@us.ibm.com> wrote:
> > >
> > > On Tue, Nov 09, 2021 at 04:16:11PM -0800, H.J. Lu via Libc-alpha wrote:
> > > > CAS instruction is expensive.  From the x86 CPU's point of view, getting
> > > > a cache line for writing is more expensive than reading.  See Appendix
> > > > A.2 Spinlock in:
> > > >
> > > > https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-lock-scaling-analysis-paper.pdf
> > > >
> > > > The full compare and swap will grab the cache line exclusive and cause
> > > > excessive cache line bouncing.
> > > >
> > > > Optimize CAS in low level locks and pthread_mutex_lock.c:
> > > >
> > > > 1. Do an atomic load and skip CAS if compare may fail to reduce cache
> > > > line bouncing on contended locks.
> > > > 2. Replace atomic_compare_and_exchange_bool_acq with
> > > > atomic_compare_and_exchange_val_acq to avoid the extra load.
> > > > 3. Drop __glibc_unlikely in __lll_trylock and lll_cond_trylock since we
> > > > don't know if it's actually rare; in the contended case it is clearly not
> > > > rare.
> > >
> > > I see build errors:
> > >
> > > In file included from pthread_mutex_cond_lock.c:23:
> > > ../nptl/pthread_mutex_lock.c: In function ‘__pthread_mutex_cond_lock_full’:
> > > ../nptl/pthread_mutex_lock.c:442:6: error: a label can only be part of a statement and a declaration is not a statement
> > >       int private = (robust
> > >       ^~~
> > > ../nptl/pthread_mutex_lock.c:445:6: error: expected expression before ‘int’
> > >       int e = __futex_lock_pi64 (&mutex->__data.__lock, 0 /* ununsed  */,
> > >       ^~~
> > > ../nptl/pthread_mutex_lock.c:447:10: error: ‘e’ undeclared (first use in this function)
> > >       if (e == ESRCH || e == EDEADLK)
> > >           ^
> > > ../nptl/pthread_mutex_lock.c:447:10: note: each undeclared identifier is reported only once for each function it appears in
> > > In file included from ../include/assert.h:1,
> > >                  from ../nptl/pthread_mutex_lock.c:18,
> > >                  from pthread_mutex_cond_lock.c:23:
> > > ../assert/assert.h:105:34: error: left-hand operand of comma expression has no effect [-Werror=unused-value]
> > >    ((void) sizeof ((expr) ? 1 : 0), __extension__ ({   \
> > >                                   ^
> > > ../nptl/pthread_mutex_lock.c:449:3: note: in expansion of macro ‘assert’
> > >    assert (e != EDEADLK
> > >    ^~~~~~
> > > ../assert/assert.h:105:34: error: left-hand operand of comma expression has no effect [-Werror=unused-value]
> > >    ((void) sizeof ((expr) ? 1 : 0), __extension__ ({   \
> > >                                   ^
> > > ../nptl/pthread_mutex_lock.c:454:3: note: in expansion of macro ‘assert’
> > >    assert (e != ESRCH || !robust);
> > >    ^~~~~~
> >
> > On PPC?
>
> Yes, powerpc64le.  Failing command:
> gcc pthread_mutex_cond_lock.c -c -std=gnu11 -fgnu89-inline -g -O2 -Wall -Wwrite-strings -Wundef -Werror -fmerge-all-constants -frounding-math -fno-stack-protector -fno-common -Wstrict-prototypes -Wold-style-definition -fmath-errno -mabi=ieeelongdouble -Wno-psabi -mno-gnu-attribute -mlong-double-128 -ftls-model=initial-exec -I../include -I/home/pc/locks/build-patched/nptl -I/home/pc/locks/build-patched -I../sysdeps/unix/sysv/linux/powerpc/powerpc64/le/fpu -I../sysdeps/unix/sysv/linux/powerpc/powerpc64/fpu -I../sysdeps/unix/sysv/linux/powerpc/powerpc64/le -I../sysdeps/unix/sysv/linux/powerpc/powerpc64 -I../sysdeps/unix/sysv/linux/wordsize-64 -I../sysdeps/unix/sysv/linux/powerpc -I../sysdeps/powerpc/nptl -I../sysdeps/unix/sysv/linux/include -I../sysdeps/unix/sysv/linux -I../sysdeps/nptl -I../sysdeps/pthread -I../sysdeps/gnu -I../sysdeps/unix/inet -I../sysdeps/unix/sysv -I../sysdeps/unix/powerpc -I../sysdeps/unix -I../sysdeps/posix -I../sysdeps/powerpc/powerpc64/le/power8/fpu/multiarch -I../sysdeps/powerpc/powerpc64/le/power7/fpu/multiarch -I../sysdeps/powerpc/powerpc64/le/fpu/multiarch -I../sysdeps/powerpc/powerpc64/le/power8/fpu -I../sysdeps/powerpc/powerpc64/le/power7/fpu -I../sysdeps/powerpc/powerpc64/le/fpu -I../sysdeps/powerpc/powerpc64/fpu -I../sysdeps/powerpc/powerpc64/le/power8/multiarch -I../sysdeps/powerpc/powerpc64/le/power7/multiarch -I../sysdeps/powerpc/powerpc64/le/multiarch -I../sysdeps/powerpc/powerpc64/multiarch -I../sysdeps/powerpc/powerpc64/le/power8 -I../sysdeps/powerpc/powerpc64/power8 -I../sysdeps/powerpc/powerpc64/le/power7 -I../sysdeps/powerpc/powerpc64/power7 -I../sysdeps/powerpc/powerpc64/power6 -I../sysdeps/powerpc/powerpc64/power4 -I../sysdeps/powerpc/power4 -I../sysdeps/powerpc/powerpc64/le -I../sysdeps/powerpc/powerpc64 -I../sysdeps/wordsize-64 -I../sysdeps/powerpc/fpu -I../sysdeps/powerpc -I../sysdeps/ieee754/ldbl-128ibm-compat -I../sysdeps/ieee754/ldbl-128ibm/include -I../sysdeps/ieee754/ldbl-128ibm -I../sysdeps/ieee754/ldbl-opt -I../sysdeps/ieee754/dbl-64 -I../sysdeps/ieee754/flt-32 -I../sysdeps/ieee754/float128 -I../sysdeps/ieee754 -I../sysdeps/generic -I.. -I../libio -I. -D_LIBC_REENTRANT -include /home/pc/locks/build-patched/libc-modules.h -DMODULE_NAME=libc -include ../include/libc-symbols.h -DTOP_NAMESPACE=glibc -o /home/pc/locks/build-patched/nptl/pthread_mutex_cond_lock.o -MD -MP -MF /home/pc/locks/build-patched/nptl/pthread_mutex_cond_lock.o.dt -MT /home/pc/locks/build-patched/nptl/pthread_mutex_cond_lock.o
>
> PC

I am testing it now.


-- 
H.J.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 0/3] Optimize CAS [BZ #28537]
  2021-11-10 15:35 ` Paul A. Clarke
  2021-11-10 15:42   ` H.J. Lu
@ 2021-11-10 15:52   ` Florian Weimer
  2021-11-10 16:03     ` H.J. Lu
                       ` (2 more replies)
  1 sibling, 3 replies; 18+ messages in thread
From: Florian Weimer @ 2021-11-10 15:52 UTC (permalink / raw)
  To: Paul A. Clarke; +Cc: H.J. Lu, libc-alpha, Arjan van de Ven, Andreas Schwab

* Paul A. Clarke:

> On Tue, Nov 09, 2021 at 04:16:11PM -0800, H.J. Lu via Libc-alpha wrote:
>> CAS instruction is expensive.  From the x86 CPU's point of view, getting
>> a cache line for writing is more expensive than reading.  See Appendix
>> A.2 Spinlock in:
>> 
>> https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-lock-scaling-analysis-paper.pdf
>> 
>> The full compare and swap will grab the cache line exclusive and cause
>> excessive cache line bouncing.
>> 
>> Optimize CAS in low level locks and pthread_mutex_lock.c:
>> 
>> 1. Do an atomic load and skip CAS if compare may fail to reduce cache
>> line bouncing on contended locks.
>> 2. Replace atomic_compare_and_exchange_bool_acq with
>> atomic_compare_and_exchange_val_acq to avoid the extra load.
>> 3. Drop __glibc_unlikely in __lll_trylock and lll_cond_trylock since we
>> don't know if it's actually rare; in the contended case it is clearly not
>> rare.
>
> I see build errors:
>
> In file included from pthread_mutex_cond_lock.c:23:
> ../nptl/pthread_mutex_lock.c: In function ‘__pthread_mutex_cond_lock_full’:
> ../nptl/pthread_mutex_lock.c:442:6: error: a label can only be part of a statement and a declaration is not a statement
>       int private = (robust
>       ^~~

The patch has:

+ locked_mutex:
 	    /* The mutex is locked.  The kernel will now take care of
 	       everything.  */
 	    int private = (robust

This is only supported in recent C versions, I think the workaround is
to add an empty statement with a semicolon, like this:

+ locked_mutex:;
 	    /* The mutex is locked.  The kernel will now take care of
 	       everything.  */
 	    int private = (robust

Thanks,
Florian


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 0/3] Optimize CAS [BZ #28537]
  2021-11-10 15:52   ` Florian Weimer
@ 2021-11-10 16:03     ` H.J. Lu
  2021-11-10 16:04     ` Paul A. Clarke
  2021-11-10 16:14     ` Andreas Schwab
  2 siblings, 0 replies; 18+ messages in thread
From: H.J. Lu @ 2021-11-10 16:03 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Paul A. Clarke, GNU C Library, Arjan van de Ven, Andreas Schwab

On Wed, Nov 10, 2021 at 7:52 AM Florian Weimer <fweimer@redhat.com> wrote:
>
> * Paul A. Clarke:
>
> > On Tue, Nov 09, 2021 at 04:16:11PM -0800, H.J. Lu via Libc-alpha wrote:
> >> CAS instruction is expensive.  From the x86 CPU's point of view, getting
> >> a cache line for writing is more expensive than reading.  See Appendix
> >> A.2 Spinlock in:
> >>
> >> https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-lock-scaling-analysis-paper.pdf
> >>
> >> The full compare and swap will grab the cache line exclusive and cause
> >> excessive cache line bouncing.
> >>
> >> Optimize CAS in low level locks and pthread_mutex_lock.c:
> >>
> >> 1. Do an atomic load and skip CAS if compare may fail to reduce cache
> >> line bouncing on contended locks.
> >> 2. Replace atomic_compare_and_exchange_bool_acq with
> >> atomic_compare_and_exchange_val_acq to avoid the extra load.
> >> 3. Drop __glibc_unlikely in __lll_trylock and lll_cond_trylock since we
> >> don't know if it's actually rare; in the contended case it is clearly not
> >> rare.
> >
> > I see build errors:
> >
> > In file included from pthread_mutex_cond_lock.c:23:
> > ../nptl/pthread_mutex_lock.c: In function ‘__pthread_mutex_cond_lock_full’:
> > ../nptl/pthread_mutex_lock.c:442:6: error: a label can only be part of a statement and a declaration is not a statement
> >       int private = (robust
> >       ^~~
>
> The patch has:
>
> + locked_mutex:
>             /* The mutex is locked.  The kernel will now take care of
>                everything.  */
>             int private = (robust
>
> This is only supported in recent C versions, I think the workaround is
> to add an empty statement with a semicolon, like this:
>
> + locked_mutex:;
>             /* The mutex is locked.  The kernel will now take care of
>                everything.  */
>             int private = (robust
>
> Thanks,
> Florian
>

Can you try users/hjl/x86/atomic-nptl branch:

https://gitlab.com/x86-glibc/glibc/-/commits/users/hjl/x86/atomic-nptl

I fixed a couple issues and added more CAS optimizations.

-- 
H.J.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 0/3] Optimize CAS [BZ #28537]
  2021-11-10 15:52   ` Florian Weimer
  2021-11-10 16:03     ` H.J. Lu
@ 2021-11-10 16:04     ` Paul A. Clarke
  2021-11-10 16:14     ` Andreas Schwab
  2 siblings, 0 replies; 18+ messages in thread
From: Paul A. Clarke @ 2021-11-10 16:04 UTC (permalink / raw)
  To: Florian Weimer; +Cc: H.J. Lu, libc-alpha, Arjan van de Ven, Andreas Schwab

On Wed, Nov 10, 2021 at 04:52:28PM +0100, Florian Weimer wrote:
> * Paul A. Clarke:
> 
> > On Tue, Nov 09, 2021 at 04:16:11PM -0800, H.J. Lu via Libc-alpha wrote:
> >> CAS instruction is expensive.  From the x86 CPU's point of view, getting
> >> a cache line for writing is more expensive than reading.  See Appendix
> >> A.2 Spinlock in:
> >> 
> >> https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-lock-scaling-analysis-paper.pdf 
> >> 
> >> The full compare and swap will grab the cache line exclusive and cause
> >> excessive cache line bouncing.
> >> 
> >> Optimize CAS in low level locks and pthread_mutex_lock.c:
> >> 
> >> 1. Do an atomic load and skip CAS if compare may fail to reduce cache
> >> line bouncing on contended locks.
> >> 2. Replace atomic_compare_and_exchange_bool_acq with
> >> atomic_compare_and_exchange_val_acq to avoid the extra load.
> >> 3. Drop __glibc_unlikely in __lll_trylock and lll_cond_trylock since we
> >> don't know if it's actually rare; in the contended case it is clearly not
> >> rare.
> >
> > I see build errors:
> >
> > In file included from pthread_mutex_cond_lock.c:23:
> > ../nptl/pthread_mutex_lock.c: In function ‘__pthread_mutex_cond_lock_full’:
> > ../nptl/pthread_mutex_lock.c:442:6: error: a label can only be part of a statement and a declaration is not a statement
> >       int private = (robust
> >       ^~~
> 
> The patch has:
> 
> + locked_mutex:
>  	    /* The mutex is locked.  The kernel will now take care of
>  	       everything.  */
>  	    int private = (robust
> 
> This is only supported in recent C versions, I think the workaround is
> to add an empty statement with a semicolon, like this:
> 
> + locked_mutex:;
>  	    /* The mutex is locked.  The kernel will now take care of
>  	       everything.  */
>  	    int private = (robust

This change resolved the issue.

PC

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 0/3] Optimize CAS [BZ #28537]
  2021-11-10 15:52   ` Florian Weimer
  2021-11-10 16:03     ` H.J. Lu
  2021-11-10 16:04     ` Paul A. Clarke
@ 2021-11-10 16:14     ` Andreas Schwab
  2 siblings, 0 replies; 18+ messages in thread
From: Andreas Schwab @ 2021-11-10 16:14 UTC (permalink / raw)
  To: Florian Weimer; +Cc: Paul A. Clarke, H.J. Lu, libc-alpha, Arjan van de Ven

On Nov 10 2021, Florian Weimer wrote:

> This is only supported in recent C versions

Only in C++.

Andreas.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510  2552 DF73 E780 A9DA AEC1
"And now for something completely different."

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 0/3] Optimize CAS [BZ #28537]
  2021-11-10 14:26 ` [PATCH v4 0/3] Optimize CAS " Paul E Murphy
@ 2021-11-10 20:07   ` Paul A. Clarke
  2021-11-10 21:33     ` H.J. Lu
  2021-11-10 23:34     ` H.J. Lu
  0 siblings, 2 replies; 18+ messages in thread
From: Paul A. Clarke @ 2021-11-10 20:07 UTC (permalink / raw)
  To: Paul E Murphy
  Cc: H.J. Lu, libc-alpha, Florian Weimer, Andreas Schwab, Arjan van de Ven

On Wed, Nov 10, 2021 at 08:26:09AM -0600, Paul E Murphy via Libc-alpha wrote:
> On 11/9/21 6:16 PM, H.J. Lu via Libc-alpha wrote:
> > CAS instruction is expensive.  From the x86 CPU's point of view, getting
> > a cache line for writing is more expensive than reading.  See Appendix
> > A.2 Spinlock in:
> > 
> > https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-lock-scaling-analysis-paper.pdf
> > 
> > The full compare and swap will grab the cache line exclusive and cause
> > excessive cache line bouncing.
> > 
> > Optimize CAS in low level locks and pthread_mutex_lock.c:
> > 
> > 1. Do an atomic load and skip CAS if compare may fail to reduce cache
> > line bouncing on contended locks.
> > 2. Replace atomic_compare_and_exchange_bool_acq with
> > atomic_compare_and_exchange_val_acq to avoid the extra load.
> > 3. Drop __glibc_unlikely in __lll_trylock and lll_cond_trylock since we
> > don't know if it's actually rare; in the contended case it is clearly not
> > rare.
> 
> Are you able to share benchmarks of this change? I am curious what effects
> this might have on other platforms.

I'd like to see the expected performance results, too.

For me, the results are not uniformly positive (Power10).
From bench-pthread-locks:

                         bench   bench-patched	
mutex-empty              4.73371 4.54792   3.9%
mutex-filler             18.5395 18.3419   1.1%
mutex_trylock-empty      10.46   2.46364  76.4%
mutex_trylock-filler     16.2188 16.1758   0.3%
rwlock_read-empty        16.5118 16.4681   0.3%
rwlock_read-filler       20.68   20.4416   1.2%
rwlock_tryread-empty     2.06572 2.17284  -5.2%
rwlock_tryread-filler    16.082  16.1215  -0.2%
rwlock_write-empty       31.3723 31.259    0.4%
rwlock_write-filler      41.6492 69.313  -66.4%
rwlock_trywrite-empty    2.20584 2.32178  -5.3%
rwlock_trywrite-filler   15.7044 15.9088  -1.3%
spin_lock-empty          16.7964 16.7731   0.1%
spin_lock-filler         20.6118 20.4175   0.9%
spin_trylock-empty       8.99989 8.98879   0.1%
spin_trylock-filler      16.4732 15.9957   2.9%
sem_wait-empty           15.805  15.7391   0.4%
sem_wait-filler          19.2346 19.5098  -1.4%
sem_trywait-empty        2.06405 2.03782   1.3%
sem_trywait-filler       15.921  15.8408   0.5%
condvar-empty            1385.84 1387.29  -0.1%
condvar-filler           1419.82 1424.01  -0.3%
consumer_producer-empty  2550.01 2395.29   6.1%
consumer_producer-filler 2709.4  2558.28   5.6%

PC

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 0/3] Optimize CAS [BZ #28537]
  2021-11-10 20:07   ` Paul A. Clarke
@ 2021-11-10 21:33     ` H.J. Lu
  2021-11-11  0:30       ` Paul A. Clarke
  2021-11-10 23:34     ` H.J. Lu
  1 sibling, 1 reply; 18+ messages in thread
From: H.J. Lu @ 2021-11-10 21:33 UTC (permalink / raw)
  To: Paul A. Clarke
  Cc: Paul E Murphy, GNU C Library, Florian Weimer, Andreas Schwab,
	Arjan van de Ven

On Wed, Nov 10, 2021 at 12:07 PM Paul A. Clarke <pc@us.ibm.com> wrote:
>
> On Wed, Nov 10, 2021 at 08:26:09AM -0600, Paul E Murphy via Libc-alpha wrote:
> > On 11/9/21 6:16 PM, H.J. Lu via Libc-alpha wrote:
> > > CAS instruction is expensive.  From the x86 CPU's point of view, getting
> > > a cache line for writing is more expensive than reading.  See Appendix
> > > A.2 Spinlock in:
> > >
> > > https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-lock-scaling-analysis-paper.pdf
> > >
> > > The full compare and swap will grab the cache line exclusive and cause
> > > excessive cache line bouncing.
> > >
> > > Optimize CAS in low level locks and pthread_mutex_lock.c:
> > >
> > > 1. Do an atomic load and skip CAS if compare may fail to reduce cache
> > > line bouncing on contended locks.
> > > 2. Replace atomic_compare_and_exchange_bool_acq with
> > > atomic_compare_and_exchange_val_acq to avoid the extra load.
> > > 3. Drop __glibc_unlikely in __lll_trylock and lll_cond_trylock since we
> > > don't know if it's actually rare; in the contended case it is clearly not
> > > rare.
> >
> > Are you able to share benchmarks of this change? I am curious what effects
> > this might have on other platforms.
>
> I'd like to see the expected performance results, too.
>
> For me, the results are not uniformly positive (Power10).
> From bench-pthread-locks:
>
>                          bench   bench-patched
> mutex-empty              4.73371 4.54792   3.9%
> mutex-filler             18.5395 18.3419   1.1%
> mutex_trylock-empty      10.46   2.46364  76.4%
> mutex_trylock-filler     16.2188 16.1758   0.3%
> rwlock_read-empty        16.5118 16.4681   0.3%
> rwlock_read-filler       20.68   20.4416   1.2%
> rwlock_tryread-empty     2.06572 2.17284  -5.2%
> rwlock_tryread-filler    16.082  16.1215  -0.2%
> rwlock_write-empty       31.3723 31.259    0.4%
> rwlock_write-filler      41.6492 69.313  -66.4%
> rwlock_trywrite-empty    2.20584 2.32178  -5.3%
> rwlock_trywrite-filler   15.7044 15.9088  -1.3%
> spin_lock-empty          16.7964 16.7731   0.1%
> spin_lock-filler         20.6118 20.4175   0.9%
> spin_trylock-empty       8.99989 8.98879   0.1%
> spin_trylock-filler      16.4732 15.9957   2.9%
> sem_wait-empty           15.805  15.7391   0.4%
> sem_wait-filler          19.2346 19.5098  -1.4%
> sem_trywait-empty        2.06405 2.03782   1.3%
> sem_trywait-filler       15.921  15.8408   0.5%
> condvar-empty            1385.84 1387.29  -0.1%
> condvar-filler           1419.82 1424.01  -0.3%
> consumer_producer-empty  2550.01 2395.29   6.1%
> consumer_producer-filler 2709.4  2558.28   5.6%

Small regressions on uncontended locks are expected due to extra
check.   What do you get with my current branch

https://gitlab.com/x86-glibc/glibc/-/tree/users/hjl/x86/atomic-nptl

BTW, how did you compare the 2 results?  I tried compare_bench.py
and got

Traceback (most recent call last):
  File "/export/gnu/import/git/gitlab/x86-glibc/benchtests/scripts/compare_bench.py",
line 196, in <module>
    main(args.bench1, args.bench2, args.schema, args.threshold, args.stats)
  File "/export/gnu/import/git/gitlab/x86-glibc/benchtests/scripts/compare_bench.py",
line 165, in main
    bench1 = bench.parse_bench(bench1, schema)
  File "/export/ssd/git/gitlab/x86-glibc/benchtests/scripts/import_bench.py",
line 137, in parse_bench
    bench = json.load(benchfile)
  File "/usr/lib64/python3.10/json/__init__.py", line 293, in load
    return loads(fp.read(),
  File "/usr/lib64/python3.10/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
  File "/usr/lib64/python3.10/json/decoder.py", line 340, in decode
    raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 1 column 18 (char 17)


-- 
H.J.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 0/3] Optimize CAS [BZ #28537]
  2021-11-10 20:07   ` Paul A. Clarke
  2021-11-10 21:33     ` H.J. Lu
@ 2021-11-10 23:34     ` H.J. Lu
  1 sibling, 0 replies; 18+ messages in thread
From: H.J. Lu @ 2021-11-10 23:34 UTC (permalink / raw)
  To: Paul A. Clarke
  Cc: Paul E Murphy, GNU C Library, Florian Weimer, Andreas Schwab,
	Arjan van de Ven

On Wed, Nov 10, 2021 at 12:07 PM Paul A. Clarke <pc@us.ibm.com> wrote:
>
> On Wed, Nov 10, 2021 at 08:26:09AM -0600, Paul E Murphy via Libc-alpha wrote:
> > On 11/9/21 6:16 PM, H.J. Lu via Libc-alpha wrote:
> > > CAS instruction is expensive.  From the x86 CPU's point of view, getting
> > > a cache line for writing is more expensive than reading.  See Appendix
> > > A.2 Spinlock in:
> > >
> > > https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-lock-scaling-analysis-paper.pdf
> > >
> > > The full compare and swap will grab the cache line exclusive and cause
> > > excessive cache line bouncing.
> > >
> > > Optimize CAS in low level locks and pthread_mutex_lock.c:
> > >
> > > 1. Do an atomic load and skip CAS if compare may fail to reduce cache
> > > line bouncing on contended locks.
> > > 2. Replace atomic_compare_and_exchange_bool_acq with
> > > atomic_compare_and_exchange_val_acq to avoid the extra load.
> > > 3. Drop __glibc_unlikely in __lll_trylock and lll_cond_trylock since we
> > > don't know if it's actually rare; in the contended case it is clearly not
> > > rare.
> >
> > Are you able to share benchmarks of this change? I am curious what effects
> > this might have on other platforms.
>
> I'd like to see the expected performance results, too.
>
> For me, the results are not uniformly positive (Power10).
> From bench-pthread-locks:
>
>                          bench   bench-patched
> mutex-empty              4.73371 4.54792   3.9%
> mutex-filler             18.5395 18.3419   1.1%
> mutex_trylock-empty      10.46   2.46364  76.4%
> mutex_trylock-filler     16.2188 16.1758   0.3%
> rwlock_read-empty        16.5118 16.4681   0.3%
> rwlock_read-filler       20.68   20.4416   1.2%
> rwlock_tryread-empty     2.06572 2.17284  -5.2%
> rwlock_tryread-filler    16.082  16.1215  -0.2%
> rwlock_write-empty       31.3723 31.259    0.4%
> rwlock_write-filler      41.6492 69.313  -66.4%
> rwlock_trywrite-empty    2.20584 2.32178  -5.3%
> rwlock_trywrite-filler   15.7044 15.9088  -1.3%
> spin_lock-empty          16.7964 16.7731   0.1%
> spin_lock-filler         20.6118 20.4175   0.9%
> spin_trylock-empty       8.99989 8.98879   0.1%
> spin_trylock-filler      16.4732 15.9957   2.9%
> sem_wait-empty           15.805  15.7391   0.4%
> sem_wait-filler          19.2346 19.5098  -1.4%
> sem_trywait-empty        2.06405 2.03782   1.3%
> sem_trywait-filler       15.921  15.8408   0.5%
> condvar-empty            1385.84 1387.29  -0.1%
> condvar-filler           1419.82 1424.01  -0.3%
> consumer_producer-empty  2550.01 2395.29   6.1%
> consumer_producer-filler 2709.4  2558.28   5.6%
>
> PC

Here are the results on a machine with 112 cores:

             mutex-empty    16.0112    16.5728  -3.5%
             mutex-filler    49.4354    48.7608  1.4%
      mutex_trylock-empty    19.2854    8.56795  56%
     mutex_trylock-filler    54.9643    41.5418  24%
        rwlock_read-empty    39.8855    39.7448  0.35%
       rwlock_read-filler    75.1334    75.1218  0.015%
     rwlock_tryread-empty    5.29094     5.2917  -0.014%
    rwlock_tryread-filler    39.6653     40.209  -1.4%
       rwlock_write-empty    60.6445    60.6236  0.034%
      rwlock_write-filler     91.431    92.9016  -1.6%
    rwlock_trywrite-empty    5.28404    5.94623  -13%
   rwlock_trywrite-filler    40.7044    40.7709  -0.16%
          spin_lock-empty    19.1067    19.1068  -0.00052%
         spin_lock-filler     51.643    51.2963  0.67%
       spin_trylock-empty    16.4705    16.4707  -0.0012%
      spin_trylock-filler    45.4647    50.5047  -11%
           sem_wait-empty     42.169    42.1889  -0.047%
          sem_wait-filler    74.4302    74.4577  -0.037%
        sem_trywait-empty    5.27318    5.27172  0.028%
       sem_trywait-filler     40.191    40.8506  -1.6%
            condvar-empty    5404.27    5406.39  -0.039%
           condvar-filler    5022.93    1566.82  69%
  consumer_producer-empty    15899.2    16755.8  -5.4%
 consumer_producer-filler    16076.9    16065.8  0.069%

rwlock_trywrite-empty has 13% regression and spin_trylock-filler
has 11% regression.  But there are 69%, 56% and 24% improvements.


--
H.J.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 0/3] Optimize CAS [BZ #28537]
  2021-11-10 21:33     ` H.J. Lu
@ 2021-11-11  0:30       ` Paul A. Clarke
  0 siblings, 0 replies; 18+ messages in thread
From: Paul A. Clarke @ 2021-11-11  0:30 UTC (permalink / raw)
  To: H.J. Lu
  Cc: Paul E Murphy, GNU C Library, Florian Weimer, Andreas Schwab,
	Arjan van de Ven

On Wed, Nov 10, 2021 at 01:33:26PM -0800, H.J. Lu wrote:
> On Wed, Nov 10, 2021 at 12:07 PM Paul A. Clarke <pc@us.ibm.com> wrote:
> >
> > On Wed, Nov 10, 2021 at 08:26:09AM -0600, Paul E Murphy via Libc-alpha wrote:
> > > On 11/9/21 6:16 PM, H.J. Lu via Libc-alpha wrote:
> > > > CAS instruction is expensive.  From the x86 CPU's point of view, getting
> > > > a cache line for writing is more expensive than reading.  See Appendix
> > > > A.2 Spinlock in:
> > > >
> > > > https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-lock-scaling-analysis-paper.pdf 
> > > >
> > > > The full compare and swap will grab the cache line exclusive and cause
> > > > excessive cache line bouncing.
> > > >
> > > > Optimize CAS in low level locks and pthread_mutex_lock.c:
> > > >
> > > > 1. Do an atomic load and skip CAS if compare may fail to reduce cache
> > > > line bouncing on contended locks.
> > > > 2. Replace atomic_compare_and_exchange_bool_acq with
> > > > atomic_compare_and_exchange_val_acq to avoid the extra load.
> > > > 3. Drop __glibc_unlikely in __lll_trylock and lll_cond_trylock since we
> > > > don't know if it's actually rare; in the contended case it is clearly not
> > > > rare.
> > >
> > > Are you able to share benchmarks of this change? I am curious what effects
> > > this might have on other platforms.
> >
> > I'd like to see the expected performance results, too.
> >
> > For me, the results are not uniformly positive (Power10).
> > From bench-pthread-locks:
> >
> >                          bench   bench-patched
> > mutex-empty              4.73371 4.54792   3.9%
> > mutex-filler             18.5395 18.3419   1.1%
> > mutex_trylock-empty      10.46   2.46364  76.4%
> > mutex_trylock-filler     16.2188 16.1758   0.3%
> > rwlock_read-empty        16.5118 16.4681   0.3%
> > rwlock_read-filler       20.68   20.4416   1.2%
> > rwlock_tryread-empty     2.06572 2.17284  -5.2%
> > rwlock_tryread-filler    16.082  16.1215  -0.2%
> > rwlock_write-empty       31.3723 31.259    0.4%
> > rwlock_write-filler      41.6492 69.313  -66.4%
> > rwlock_trywrite-empty    2.20584 2.32178  -5.3%
> > rwlock_trywrite-filler   15.7044 15.9088  -1.3%
> > spin_lock-empty          16.7964 16.7731   0.1%
> > spin_lock-filler         20.6118 20.4175   0.9%
> > spin_trylock-empty       8.99989 8.98879   0.1%
> > spin_trylock-filler      16.4732 15.9957   2.9%
> > sem_wait-empty           15.805  15.7391   0.4%
> > sem_wait-filler          19.2346 19.5098  -1.4%
> > sem_trywait-empty        2.06405 2.03782   1.3%
> > sem_trywait-filler       15.921  15.8408   0.5%
> > condvar-empty            1385.84 1387.29  -0.1%
> > condvar-filler           1419.82 1424.01  -0.3%
> > consumer_producer-empty  2550.01 2395.29   6.1%
> > consumer_producer-filler 2709.4  2558.28   5.6%
> 
> Small regressions on uncontended locks are expected due to extra
> check.   What do you get with my current branch
> 
> https://gitlab.com/x86-glibc/glibc/-/tree/users/hjl/x86/atomic-nptl

                         bench   bench-hjl
mutex-empty              4.73371 4.65279   1.7%
mutex-filler             18.5395 18.3971   0.8%
mutex_trylock-empty      10.46   10.1671   2.8%
mutex_trylock-filler     16.2188 16.7105  -3.0%
rwlock_read-empty        16.5118 16.4697   0.3%
rwlock_read-filler       20.68   20.0416   3.1%
rwlock_tryread-empty     2.06572 2.038     1.3%
rwlock_tryread-filler    16.082  15.7182   2.3%
rwlock_write-empty       31.3723 31.1147   0.8%
rwlock_write-filler      41.6492 69.8115 -67.6%
rwlock_trywrite-empty    2.20584 2.32175  -5.3%
rwlock_trywrite-filler   15.7044 15.86    -1.0%
spin_lock-empty          16.7964 16.4342   2.2%
spin_lock-filler         20.6118 20.3916   1.1%
spin_trylock-empty       8.99989 8.98884   0.1%
spin_trylock-filler      16.4732 16.1979   1.7%
sem_wait-empty           15.805  15.7558   0.3%
sem_wait-filler          19.2346 19.2554  -0.1%
sem_trywait-empty        2.06405 2.03789   1.3%
sem_trywait-filler       15.921  15.7884   0.8%
condvar-empty            1385.84 1341.96   3.2%
condvar-filler           1419.82 1343.06   5.4%
consumer_producer-empty  2550.01 2446.33   4.1%
consumer_producer-filler 2709.4  2659.59   1.8%

...still one very bad outlier, and a few of concern.

> BTW, how did you compare the 2 results?  I tried compare_bench.py
> and got
> 
> Traceback (most recent call last):
>   File "/export/gnu/import/git/gitlab/x86-glibc/benchtests/scripts/compare_bench.py",
> line 196, in <module>
>     main(args.bench1, args.bench2, args.schema, args.threshold, args.stats)
>   File "/export/gnu/import/git/gitlab/x86-glibc/benchtests/scripts/compare_bench.py",
> line 165, in main
>     bench1 = bench.parse_bench(bench1, schema)
>   File "/export/ssd/git/gitlab/x86-glibc/benchtests/scripts/import_bench.py",
> line 137, in parse_bench
>     bench = json.load(benchfile)
>   File "/usr/lib64/python3.10/json/__init__.py", line 293, in load
>     return loads(fp.read(),
>   File "/usr/lib64/python3.10/json/__init__.py", line 346, in loads
>     return _default_decoder.decode(s)
>   File "/usr/lib64/python3.10/json/decoder.py", line 340, in decode
>     raise JSONDecodeError("Extra data", s, end)
> json.decoder.JSONDecodeError: Extra data: line 1 column 18 (char 17)

I did it the old-fashioned way, in a spreadsheet.  :-)

I see the same errors you see with compare_bench.py.

Upon further investigation, compare_bench.py expects input in the form
produced by "make bench". The output from running the benchtest directly
is insufficient.  Using the respective outputs from
"make BENCHSET=bench-pthread bench":
--
$ ./benchtests/scripts/compare_bench.py --threshold 2 --stats mean A.out B.out
[snip]
+++ thread_create(stack=1024,guard=2)[mean]: (2.15%) from 372674 to 364660
+++ thread_create(stack=2048,guard=1)[mean]: (4.88%) from 377835 to 359396
+++ thread_create(stack=2048,guard=2)[mean]: (3.58%) from 377306 to 363798
+++ pthread_locks(mutex-empty)[mean]: (4.27%) from 4.85936 to 4.65185
--- pthread_locks(mutex_trylock-filler)[mean]: (3.09%) from 16.0579 to 16.5533
--- pthread_locks(rwlock_write-filler)[mean]: (56.90%) from 44.4255 to 69.7047
--- pthread_locks(rwlock_trywrite-empty)[mean]: (6.73%) from 2.17594 to 2.32244
+++ pthread_locks(spin_lock-empty)[mean]: (2.17%) from 16.8086 to 16.4436
--- pthread_locks(spin_trylock-filler)[mean]: (2.34%) from 16.1119 to 16.4896
+++ pthread_locks(consumer_producer-empty)[mean]: (2.94%) from 2531.95 to 2457.48
--

PC

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2021-11-11  0:30 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-10  0:16 [PATCH v4 0/3] Optimize CAS [BZ #28537] H.J. Lu
2021-11-10  0:16 ` [PATCH v4 1/3] Reduce CAS in low level locks " H.J. Lu
2021-11-10  1:56   ` Noah Goldstein
2021-11-10  0:16 ` [PATCH v4 2/3] Reduce CAS in __pthread_mutex_lock_full " H.J. Lu
2021-11-10  0:16 ` [PATCH v4 3/3] Optimize " H.J. Lu
2021-11-10 14:26 ` [PATCH v4 0/3] Optimize CAS " Paul E Murphy
2021-11-10 20:07   ` Paul A. Clarke
2021-11-10 21:33     ` H.J. Lu
2021-11-11  0:30       ` Paul A. Clarke
2021-11-10 23:34     ` H.J. Lu
2021-11-10 15:35 ` Paul A. Clarke
2021-11-10 15:42   ` H.J. Lu
2021-11-10 15:50     ` Paul A. Clarke
2021-11-10 15:52       ` H.J. Lu
2021-11-10 15:52   ` Florian Weimer
2021-11-10 16:03     ` H.J. Lu
2021-11-10 16:04     ` Paul A. Clarke
2021-11-10 16:14     ` Andreas Schwab

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).