* [PATCH v4 0/3] Optimize CAS [BZ #28537]
@ 2021-11-10 0:16 H.J. Lu
2021-11-10 0:16 ` [PATCH v4 1/3] Reduce CAS in low level locks " H.J. Lu
` (4 more replies)
0 siblings, 5 replies; 18+ messages in thread
From: H.J. Lu @ 2021-11-10 0:16 UTC (permalink / raw)
To: libc-alpha
Cc: Florian Weimer, Oleh Derevenko, Arjan van de Ven, Andreas Schwab
CAS instruction is expensive. From the x86 CPU's point of view, getting
a cache line for writing is more expensive than reading. See Appendix
A.2 Spinlock in:
https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-lock-scaling-analysis-paper.pdf
The full compare and swap will grab the cache line exclusive and cause
excessive cache line bouncing.
Optimize CAS in low level locks and pthread_mutex_lock.c:
1. Do an atomic load and skip CAS if compare may fail to reduce cache
line bouncing on contended locks.
2. Replace atomic_compare_and_exchange_bool_acq with
atomic_compare_and_exchange_val_acq to avoid the extra load.
3. Drop __glibc_unlikely in __lll_trylock and lll_cond_trylock since we
don't know if it's actually rare; in the contended case it is clearly not
rare.
This is the first patch set to optimize CAS. I will investigate the rest
CAS usages in glibc after this patch set has been accepted.
H.J. Lu (3):
Reduce CAS in low level locks [BZ #28537]
Reduce CAS in __pthread_mutex_lock_full [BZ #28537]
Optimize CAS in __pthread_mutex_lock_full [BZ #28537]
nptl/lowlevellock.c | 12 ++++-----
nptl/pthread_mutex_lock.c | 53 ++++++++++++++++++++++++++++---------
sysdeps/nptl/lowlevellock.h | 29 +++++++++++++-------
3 files changed, 67 insertions(+), 27 deletions(-)
--
2.33.1
^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH v4 1/3] Reduce CAS in low level locks [BZ #28537]
2021-11-10 0:16 [PATCH v4 0/3] Optimize CAS [BZ #28537] H.J. Lu
@ 2021-11-10 0:16 ` H.J. Lu
2021-11-10 1:56 ` Noah Goldstein
2021-11-10 0:16 ` [PATCH v4 2/3] Reduce CAS in __pthread_mutex_lock_full " H.J. Lu
` (3 subsequent siblings)
4 siblings, 1 reply; 18+ messages in thread
From: H.J. Lu @ 2021-11-10 0:16 UTC (permalink / raw)
To: libc-alpha
Cc: Florian Weimer, Oleh Derevenko, Arjan van de Ven, Andreas Schwab
CAS instruction is expensive. From the x86 CPU's point of view, getting
a cache line for writing is more expensive than reading. See Appendix
A.2 Spinlock in:
https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-lock-scaling-analysis-paper.pdf
The full compare and swap will grab the cache line exclusive and cause
excessive cache line bouncing.
1. Change low level locks to do an atomic load and skip CAS if compare
may fail to reduce cache line bouncing on contended locks.
2. In __lll_lock, replace atomic_compare_and_exchange_bool_acq with
atomic_compare_and_exchange_val_acq and pass down the result to
__lll_lock_wait and __lll_lock_wait_private to avoid the redundant load
there.
3. Drop __glibc_unlikely in __lll_trylock and lll_cond_trylock since we
don't know if it's actually rare; in the contended case it is clearly not
rare.
---
nptl/lowlevellock.c | 12 ++++++------
sysdeps/nptl/lowlevellock.h | 29 ++++++++++++++++++++---------
2 files changed, 26 insertions(+), 15 deletions(-)
diff --git a/nptl/lowlevellock.c b/nptl/lowlevellock.c
index 8c740529c4..d1965c01ca 100644
--- a/nptl/lowlevellock.c
+++ b/nptl/lowlevellock.c
@@ -22,30 +22,30 @@
#include <stap-probe.h>
void
-__lll_lock_wait_private (int *futex)
+__lll_lock_wait_private (int *futex, int futex_value)
{
- if (atomic_load_relaxed (futex) == 2)
+ if (futex_value == 2)
goto futex;
while (atomic_exchange_acquire (futex, 2) != 0)
{
futex:
- LIBC_PROBE (lll_lock_wait_private, 1, futex);
+ LIBC_PROBE (lll_lock_wait_private, 2, futex, futex_value);
futex_wait ((unsigned int *) futex, 2, LLL_PRIVATE); /* Wait if *futex == 2. */
}
}
libc_hidden_def (__lll_lock_wait_private)
void
-__lll_lock_wait (int *futex, int private)
+__lll_lock_wait (int *futex, int futex_value, int private)
{
- if (atomic_load_relaxed (futex) == 2)
+ if (futex_value == 2)
goto futex;
while (atomic_exchange_acquire (futex, 2) != 0)
{
futex:
- LIBC_PROBE (lll_lock_wait, 1, futex);
+ LIBC_PROBE (lll_lock_wait, 2, futex, futex_value);
futex_wait ((unsigned int *) futex, 2, private); /* Wait if *futex == 2. */
}
}
diff --git a/sysdeps/nptl/lowlevellock.h b/sysdeps/nptl/lowlevellock.h
index 4d95114ed3..4235d13de9 100644
--- a/sysdeps/nptl/lowlevellock.h
+++ b/sysdeps/nptl/lowlevellock.h
@@ -66,7 +66,11 @@
0. Otherwise leave lock unchanged and return non-zero to indicate that the
lock was not acquired. */
#define __lll_trylock(lock) \
- __glibc_unlikely (atomic_compare_and_exchange_bool_acq ((lock), 1, 0))
+ (__extension__ ({ \
+ __typeof (*(lock)) __lock_value = atomic_load_relaxed (lock); \
+ (__lock_value != 0 \
+ || atomic_compare_and_exchange_bool_acq ((lock), 1, 0)); \
+ }))
#define lll_trylock(lock) \
__lll_trylock (&(lock))
@@ -74,11 +78,15 @@
return 0. Otherwise leave lock unchanged and return non-zero to indicate
that the lock was not acquired. */
#define lll_cond_trylock(lock) \
- __glibc_unlikely (atomic_compare_and_exchange_bool_acq (&(lock), 2, 0))
+ (__extension__ ({ \
+ __typeof (lock) __lock_value = atomic_load_relaxed (&(lock)); \
+ (__lock_value != 0 \
+ || atomic_compare_and_exchange_bool_acq (&(lock), 2, 0)); \
+ }))
-extern void __lll_lock_wait_private (int *futex);
+extern void __lll_lock_wait_private (int *futex, int futex_value);
libc_hidden_proto (__lll_lock_wait_private)
-extern void __lll_lock_wait (int *futex, int private);
+extern void __lll_lock_wait (int *futex, int futex_value, int private);
libc_hidden_proto (__lll_lock_wait)
/* This is an expression rather than a statement even though its value is
@@ -95,13 +103,15 @@ libc_hidden_proto (__lll_lock_wait)
((void) \
({ \
int *__futex = (futex); \
- if (__glibc_unlikely \
- (atomic_compare_and_exchange_bool_acq (__futex, 1, 0))) \
+ int __futex_value = atomic_load_relaxed (futex); \
+ if (__futex_value != 0 \
+ || ((__futex_value = atomic_compare_and_exchange_val_acq \
+ (__futex, 1, 0) != 0))) \
{ \
if (__builtin_constant_p (private) && (private) == LLL_PRIVATE) \
- __lll_lock_wait_private (__futex); \
+ __lll_lock_wait_private (futex, __futex_value); \
else \
- __lll_lock_wait (__futex, private); \
+ __lll_lock_wait (futex, __futex_value, private); \
} \
}))
#define lll_lock(futex, private) \
@@ -120,7 +130,8 @@ libc_hidden_proto (__lll_lock_wait)
({ \
int *__futex = (futex); \
if (__glibc_unlikely (atomic_exchange_acq (__futex, 2) != 0)) \
- __lll_lock_wait (__futex, private); \
+ __lll_lock_wait (__futex, atomic_load_relaxed (__futex), \
+ private); \
}))
#define lll_cond_lock(futex, private) __lll_cond_lock (&(futex), private)
--
2.33.1
^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH v4 2/3] Reduce CAS in __pthread_mutex_lock_full [BZ #28537]
2021-11-10 0:16 [PATCH v4 0/3] Optimize CAS [BZ #28537] H.J. Lu
2021-11-10 0:16 ` [PATCH v4 1/3] Reduce CAS in low level locks " H.J. Lu
@ 2021-11-10 0:16 ` H.J. Lu
2021-11-10 0:16 ` [PATCH v4 3/3] Optimize " H.J. Lu
` (2 subsequent siblings)
4 siblings, 0 replies; 18+ messages in thread
From: H.J. Lu @ 2021-11-10 0:16 UTC (permalink / raw)
To: libc-alpha
Cc: Florian Weimer, Oleh Derevenko, Arjan van de Ven, Andreas Schwab
Change __pthread_mutex_lock_full to do an atomic load and skip CAS if
compare may fail to reduce cache line bouncing on contended locks.
---
nptl/pthread_mutex_lock.c | 38 +++++++++++++++++++++++++++++++-------
1 file changed, 31 insertions(+), 7 deletions(-)
diff --git a/nptl/pthread_mutex_lock.c b/nptl/pthread_mutex_lock.c
index 2bd41767e0..1126ecba95 100644
--- a/nptl/pthread_mutex_lock.c
+++ b/nptl/pthread_mutex_lock.c
@@ -204,6 +204,10 @@ __pthread_mutex_lock_full (pthread_mutex_t *mutex)
our TID | assume_other_futex_waiters. */
if (__glibc_likely (oldval == 0))
{
+ oldval = atomic_load_relaxed (&mutex->__data.__lock);
+ if (oldval != 0)
+ goto tid_failed;
+
oldval
= atomic_compare_and_exchange_val_acq (&mutex->__data.__lock,
id | assume_other_futex_waiters, 0);
@@ -213,6 +217,13 @@ __pthread_mutex_lock_full (pthread_mutex_t *mutex)
if ((oldval & FUTEX_OWNER_DIED) != 0)
{
+ int currval = atomic_load_relaxed (&mutex->__data.__lock);
+ if (currval != oldval)
+ {
+ oldval = currval;
+ continue;
+ }
+
/* The previous owner died. Try locking the mutex. */
int newval = id;
#ifdef NO_INCR
@@ -259,6 +270,7 @@ __pthread_mutex_lock_full (pthread_mutex_t *mutex)
return EOWNERDEAD;
}
+ tid_failed:
/* Check whether we already hold the mutex. */
if (__glibc_unlikely ((oldval & FUTEX_TID_MASK) == id))
{
@@ -411,11 +423,15 @@ __pthread_mutex_lock_full (pthread_mutex_t *mutex)
# ifdef NO_INCR
newval |= FUTEX_WAITERS;
# endif
+ oldval = atomic_load_relaxed (&mutex->__data.__lock);
+ if (oldval != 0)
+ goto locked_mutex;
oldval = atomic_compare_and_exchange_val_acq (&mutex->__data.__lock,
newval, 0);
if (oldval != 0)
{
+ locked_mutex:
/* The mutex is locked. The kernel will now take care of
everything. */
int private = (robust
@@ -554,6 +570,10 @@ __pthread_mutex_lock_full (pthread_mutex_t *mutex)
ceilval = ceiling << PTHREAD_MUTEX_PRIO_CEILING_SHIFT;
oldprio = ceiling;
+ oldval = atomic_load_relaxed (&mutex->__data.__lock);
+ if (oldval != ceilval)
+ goto ceilval_failed;
+
oldval
= atomic_compare_and_exchange_val_acq (&mutex->__data.__lock,
#ifdef NO_INCR
@@ -568,10 +588,13 @@ __pthread_mutex_lock_full (pthread_mutex_t *mutex)
do
{
- oldval
- = atomic_compare_and_exchange_val_acq (&mutex->__data.__lock,
- ceilval | 2,
- ceilval | 1);
+ oldval = atomic_load_relaxed (&mutex->__data.__lock);
+ ceilval_failed:
+ if (oldval == (ceilval | 1))
+ oldval
+ = atomic_compare_and_exchange_val_acq (&mutex->__data.__lock,
+ ceilval | 2,
+ ceilval | 1);
if ((oldval & PTHREAD_MUTEX_PRIO_CEILING_MASK) != ceilval)
break;
@@ -581,9 +604,10 @@ __pthread_mutex_lock_full (pthread_mutex_t *mutex)
ceilval | 2,
PTHREAD_MUTEX_PSHARED (mutex));
}
- while (atomic_compare_and_exchange_val_acq (&mutex->__data.__lock,
- ceilval | 2, ceilval)
- != ceilval);
+ while (atomic_load_relaxed (&mutex->__data.__lock) != ceilval
+ || (atomic_compare_and_exchange_val_acq (&mutex->__data.__lock,
+ ceilval | 2, ceilval)
+ != ceilval));
}
while ((oldval & PTHREAD_MUTEX_PRIO_CEILING_MASK) != ceilval);
--
2.33.1
^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH v4 3/3] Optimize CAS in __pthread_mutex_lock_full [BZ #28537]
2021-11-10 0:16 [PATCH v4 0/3] Optimize CAS [BZ #28537] H.J. Lu
2021-11-10 0:16 ` [PATCH v4 1/3] Reduce CAS in low level locks " H.J. Lu
2021-11-10 0:16 ` [PATCH v4 2/3] Reduce CAS in __pthread_mutex_lock_full " H.J. Lu
@ 2021-11-10 0:16 ` H.J. Lu
2021-11-10 14:26 ` [PATCH v4 0/3] Optimize CAS " Paul E Murphy
2021-11-10 15:35 ` Paul A. Clarke
4 siblings, 0 replies; 18+ messages in thread
From: H.J. Lu @ 2021-11-10 0:16 UTC (permalink / raw)
To: libc-alpha
Cc: Florian Weimer, Oleh Derevenko, Arjan van de Ven, Andreas Schwab
1. Do an atomic load and skip CAS if compare may fail to reduce cache
line bouncing on contended locks.
2. Replace atomic_compare_and_exchange_bool_acq with
atomic_compare_and_exchange_val_acq to avoid the extra load.
---
nptl/pthread_mutex_lock.c | 15 ++++++++++-----
1 file changed, 10 insertions(+), 5 deletions(-)
diff --git a/nptl/pthread_mutex_lock.c b/nptl/pthread_mutex_lock.c
index 1126ecba95..9a18073f29 100644
--- a/nptl/pthread_mutex_lock.c
+++ b/nptl/pthread_mutex_lock.c
@@ -309,12 +309,17 @@ __pthread_mutex_lock_full (pthread_mutex_t *mutex)
meantime. */
if ((oldval & FUTEX_WAITERS) == 0)
{
- if (atomic_compare_and_exchange_bool_acq (&mutex->__data.__lock,
- oldval | FUTEX_WAITERS,
- oldval)
- != 0)
+ int currval = atomic_load_relaxed (&mutex->__data.__lock);
+ if (currval != oldval)
+ {
+ oldval = currval;
+ continue;
+ }
+ int newval = atomic_compare_and_exchange_val_acq
+ (&mutex->__data.__lock, oldval | FUTEX_WAITERS, oldval);
+ if (newval != oldval)
{
- oldval = mutex->__data.__lock;
+ oldval = newval;
continue;
}
oldval |= FUTEX_WAITERS;
--
2.33.1
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v4 1/3] Reduce CAS in low level locks [BZ #28537]
2021-11-10 0:16 ` [PATCH v4 1/3] Reduce CAS in low level locks " H.J. Lu
@ 2021-11-10 1:56 ` Noah Goldstein
0 siblings, 0 replies; 18+ messages in thread
From: Noah Goldstein @ 2021-11-10 1:56 UTC (permalink / raw)
To: H.J. Lu; +Cc: GNU C Library, Florian Weimer, Arjan van de Ven, Andreas Schwab
On Tue, Nov 9, 2021 at 6:21 PM H.J. Lu via Libc-alpha
<libc-alpha@sourceware.org> wrote:
>
> CAS instruction is expensive. From the x86 CPU's point of view, getting
> a cache line for writing is more expensive than reading. See Appendix
> A.2 Spinlock in:
>
> https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-lock-scaling-analysis-paper.pdf
>
> The full compare and swap will grab the cache line exclusive and cause
> excessive cache line bouncing.
>
> 1. Change low level locks to do an atomic load and skip CAS if compare
> may fail to reduce cache line bouncing on contended locks.
> 2. In __lll_lock, replace atomic_compare_and_exchange_bool_acq with
> atomic_compare_and_exchange_val_acq and pass down the result to
> __lll_lock_wait and __lll_lock_wait_private to avoid the redundant load
> there.
> 3. Drop __glibc_unlikely in __lll_trylock and lll_cond_trylock since we
> don't know if it's actually rare; in the contended case it is clearly not
> rare.
Think especially since you have a manual check for old vs new, the atomic
CAS failure should still be wrapped in a `glibc_unlikely`.
> ---
> nptl/lowlevellock.c | 12 ++++++------
> sysdeps/nptl/lowlevellock.h | 29 ++++++++++++++++++++---------
> 2 files changed, 26 insertions(+), 15 deletions(-)
>
> diff --git a/nptl/lowlevellock.c b/nptl/lowlevellock.c
> index 8c740529c4..d1965c01ca 100644
> --- a/nptl/lowlevellock.c
> +++ b/nptl/lowlevellock.c
> @@ -22,30 +22,30 @@
> #include <stap-probe.h>
>
> void
> -__lll_lock_wait_private (int *futex)
> +__lll_lock_wait_private (int *futex, int futex_value)
> {
> - if (atomic_load_relaxed (futex) == 2)
> + if (futex_value == 2)
> goto futex;
>
> while (atomic_exchange_acquire (futex, 2) != 0)
> {
> futex:
> - LIBC_PROBE (lll_lock_wait_private, 1, futex);
> + LIBC_PROBE (lll_lock_wait_private, 2, futex, futex_value);
> futex_wait ((unsigned int *) futex, 2, LLL_PRIVATE); /* Wait if *futex == 2. */
> }
> }
> libc_hidden_def (__lll_lock_wait_private)
>
> void
> -__lll_lock_wait (int *futex, int private)
> +__lll_lock_wait (int *futex, int futex_value, int private)
> {
> - if (atomic_load_relaxed (futex) == 2)
> + if (futex_value == 2)
> goto futex;
>
> while (atomic_exchange_acquire (futex, 2) != 0)
> {
> futex:
> - LIBC_PROBE (lll_lock_wait, 1, futex);
> + LIBC_PROBE (lll_lock_wait, 2, futex, futex_value);
> futex_wait ((unsigned int *) futex, 2, private); /* Wait if *futex == 2. */
> }
> }
> diff --git a/sysdeps/nptl/lowlevellock.h b/sysdeps/nptl/lowlevellock.h
> index 4d95114ed3..4235d13de9 100644
> --- a/sysdeps/nptl/lowlevellock.h
> +++ b/sysdeps/nptl/lowlevellock.h
> @@ -66,7 +66,11 @@
> 0. Otherwise leave lock unchanged and return non-zero to indicate that the
> lock was not acquired. */
> #define __lll_trylock(lock) \
> - __glibc_unlikely (atomic_compare_and_exchange_bool_acq ((lock), 1, 0))
> + (__extension__ ({ \
> + __typeof (*(lock)) __lock_value = atomic_load_relaxed (lock); \
> + (__lock_value != 0 \
> + || atomic_compare_and_exchange_bool_acq ((lock), 1, 0)); \
> + }))
> #define lll_trylock(lock) \
> __lll_trylock (&(lock))
>
> @@ -74,11 +78,15 @@
> return 0. Otherwise leave lock unchanged and return non-zero to indicate
> that the lock was not acquired. */
> #define lll_cond_trylock(lock) \
> - __glibc_unlikely (atomic_compare_and_exchange_bool_acq (&(lock), 2, 0))
> + (__extension__ ({ \
> + __typeof (lock) __lock_value = atomic_load_relaxed (&(lock)); \
> + (__lock_value != 0 \
> + || atomic_compare_and_exchange_bool_acq (&(lock), 2, 0)); \
> + }))
>
> -extern void __lll_lock_wait_private (int *futex);
> +extern void __lll_lock_wait_private (int *futex, int futex_value);
> libc_hidden_proto (__lll_lock_wait_private)
> -extern void __lll_lock_wait (int *futex, int private);
> +extern void __lll_lock_wait (int *futex, int futex_value, int private);
> libc_hidden_proto (__lll_lock_wait)
>
> /* This is an expression rather than a statement even though its value is
> @@ -95,13 +103,15 @@ libc_hidden_proto (__lll_lock_wait)
> ((void) \
> ({ \
> int *__futex = (futex); \
> - if (__glibc_unlikely \
> - (atomic_compare_and_exchange_bool_acq (__futex, 1, 0))) \
> + int __futex_value = atomic_load_relaxed (futex); \
> + if (__futex_value != 0 \
> + || ((__futex_value = atomic_compare_and_exchange_val_acq \
> + (__futex, 1, 0) != 0))) \
> { \
> if (__builtin_constant_p (private) && (private) == LLL_PRIVATE) \
> - __lll_lock_wait_private (__futex); \
> + __lll_lock_wait_private (futex, __futex_value); \
> else \
> - __lll_lock_wait (__futex, private); \
> + __lll_lock_wait (futex, __futex_value, private); \
> } \
> }))
> #define lll_lock(futex, private) \
> @@ -120,7 +130,8 @@ libc_hidden_proto (__lll_lock_wait)
> ({ \
> int *__futex = (futex); \
> if (__glibc_unlikely (atomic_exchange_acq (__futex, 2) != 0)) \
> - __lll_lock_wait (__futex, private); \
> + __lll_lock_wait (__futex, atomic_load_relaxed (__futex), \
> + private); \
> }))
> #define lll_cond_lock(futex, private) __lll_cond_lock (&(futex), private)
>
> --
> 2.33.1
>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v4 0/3] Optimize CAS [BZ #28537]
2021-11-10 0:16 [PATCH v4 0/3] Optimize CAS [BZ #28537] H.J. Lu
` (2 preceding siblings ...)
2021-11-10 0:16 ` [PATCH v4 3/3] Optimize " H.J. Lu
@ 2021-11-10 14:26 ` Paul E Murphy
2021-11-10 20:07 ` Paul A. Clarke
2021-11-10 15:35 ` Paul A. Clarke
4 siblings, 1 reply; 18+ messages in thread
From: Paul E Murphy @ 2021-11-10 14:26 UTC (permalink / raw)
To: H.J. Lu, libc-alpha; +Cc: Florian Weimer, Arjan van de Ven, Andreas Schwab
On 11/9/21 6:16 PM, H.J. Lu via Libc-alpha wrote:
> CAS instruction is expensive. From the x86 CPU's point of view, getting
> a cache line for writing is more expensive than reading. See Appendix
> A.2 Spinlock in:
>
> https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-lock-scaling-analysis-paper.pdf
>
> The full compare and swap will grab the cache line exclusive and cause
> excessive cache line bouncing.
>
> Optimize CAS in low level locks and pthread_mutex_lock.c:
>
> 1. Do an atomic load and skip CAS if compare may fail to reduce cache
> line bouncing on contended locks.
> 2. Replace atomic_compare_and_exchange_bool_acq with
> atomic_compare_and_exchange_val_acq to avoid the extra load.
> 3. Drop __glibc_unlikely in __lll_trylock and lll_cond_trylock since we
> don't know if it's actually rare; in the contended case it is clearly not
> rare.
Are you able to share benchmarks of this change? I am curious what
effects this might have on other platforms.
>
> This is the first patch set to optimize CAS. I will investigate the rest
> CAS usages in glibc after this patch set has been accepted.
>
> H.J. Lu (3):
> Reduce CAS in low level locks [BZ #28537]
> Reduce CAS in __pthread_mutex_lock_full [BZ #28537]
> Optimize CAS in __pthread_mutex_lock_full [BZ #28537]
>
> nptl/lowlevellock.c | 12 ++++-----
> nptl/pthread_mutex_lock.c | 53 ++++++++++++++++++++++++++++---------
> sysdeps/nptl/lowlevellock.h | 29 +++++++++++++-------
> 3 files changed, 67 insertions(+), 27 deletions(-)
>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v4 0/3] Optimize CAS [BZ #28537]
2021-11-10 0:16 [PATCH v4 0/3] Optimize CAS [BZ #28537] H.J. Lu
` (3 preceding siblings ...)
2021-11-10 14:26 ` [PATCH v4 0/3] Optimize CAS " Paul E Murphy
@ 2021-11-10 15:35 ` Paul A. Clarke
2021-11-10 15:42 ` H.J. Lu
2021-11-10 15:52 ` Florian Weimer
4 siblings, 2 replies; 18+ messages in thread
From: Paul A. Clarke @ 2021-11-10 15:35 UTC (permalink / raw)
To: H.J. Lu; +Cc: libc-alpha, Florian Weimer, Arjan van de Ven, Andreas Schwab
On Tue, Nov 09, 2021 at 04:16:11PM -0800, H.J. Lu via Libc-alpha wrote:
> CAS instruction is expensive. From the x86 CPU's point of view, getting
> a cache line for writing is more expensive than reading. See Appendix
> A.2 Spinlock in:
>
> https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-lock-scaling-analysis-paper.pdf
>
> The full compare and swap will grab the cache line exclusive and cause
> excessive cache line bouncing.
>
> Optimize CAS in low level locks and pthread_mutex_lock.c:
>
> 1. Do an atomic load and skip CAS if compare may fail to reduce cache
> line bouncing on contended locks.
> 2. Replace atomic_compare_and_exchange_bool_acq with
> atomic_compare_and_exchange_val_acq to avoid the extra load.
> 3. Drop __glibc_unlikely in __lll_trylock and lll_cond_trylock since we
> don't know if it's actually rare; in the contended case it is clearly not
> rare.
I see build errors:
In file included from pthread_mutex_cond_lock.c:23:
../nptl/pthread_mutex_lock.c: In function ‘__pthread_mutex_cond_lock_full’:
../nptl/pthread_mutex_lock.c:442:6: error: a label can only be part of a statement and a declaration is not a statement
int private = (robust
^~~
../nptl/pthread_mutex_lock.c:445:6: error: expected expression before ‘int’
int e = __futex_lock_pi64 (&mutex->__data.__lock, 0 /* ununsed */,
^~~
../nptl/pthread_mutex_lock.c:447:10: error: ‘e’ undeclared (first use in this function)
if (e == ESRCH || e == EDEADLK)
^
../nptl/pthread_mutex_lock.c:447:10: note: each undeclared identifier is reported only once for each function it appears in
In file included from ../include/assert.h:1,
from ../nptl/pthread_mutex_lock.c:18,
from pthread_mutex_cond_lock.c:23:
../assert/assert.h:105:34: error: left-hand operand of comma expression has no effect [-Werror=unused-value]
((void) sizeof ((expr) ? 1 : 0), __extension__ ({ \
^
../nptl/pthread_mutex_lock.c:449:3: note: in expansion of macro ‘assert’
assert (e != EDEADLK
^~~~~~
../assert/assert.h:105:34: error: left-hand operand of comma expression has no effect [-Werror=unused-value]
((void) sizeof ((expr) ? 1 : 0), __extension__ ({ \
^
../nptl/pthread_mutex_lock.c:454:3: note: in expansion of macro ‘assert’
assert (e != ESRCH || !robust);
^~~~~~
PC
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v4 0/3] Optimize CAS [BZ #28537]
2021-11-10 15:35 ` Paul A. Clarke
@ 2021-11-10 15:42 ` H.J. Lu
2021-11-10 15:50 ` Paul A. Clarke
2021-11-10 15:52 ` Florian Weimer
1 sibling, 1 reply; 18+ messages in thread
From: H.J. Lu @ 2021-11-10 15:42 UTC (permalink / raw)
To: Paul A. Clarke
Cc: GNU C Library, Florian Weimer, Arjan van de Ven, Andreas Schwab
On Wed, Nov 10, 2021 at 7:36 AM Paul A. Clarke <pc@us.ibm.com> wrote:
>
> On Tue, Nov 09, 2021 at 04:16:11PM -0800, H.J. Lu via Libc-alpha wrote:
> > CAS instruction is expensive. From the x86 CPU's point of view, getting
> > a cache line for writing is more expensive than reading. See Appendix
> > A.2 Spinlock in:
> >
> > https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-lock-scaling-analysis-paper.pdf
> >
> > The full compare and swap will grab the cache line exclusive and cause
> > excessive cache line bouncing.
> >
> > Optimize CAS in low level locks and pthread_mutex_lock.c:
> >
> > 1. Do an atomic load and skip CAS if compare may fail to reduce cache
> > line bouncing on contended locks.
> > 2. Replace atomic_compare_and_exchange_bool_acq with
> > atomic_compare_and_exchange_val_acq to avoid the extra load.
> > 3. Drop __glibc_unlikely in __lll_trylock and lll_cond_trylock since we
> > don't know if it's actually rare; in the contended case it is clearly not
> > rare.
>
> I see build errors:
>
> In file included from pthread_mutex_cond_lock.c:23:
> ../nptl/pthread_mutex_lock.c: In function ‘__pthread_mutex_cond_lock_full’:
> ../nptl/pthread_mutex_lock.c:442:6: error: a label can only be part of a statement and a declaration is not a statement
> int private = (robust
> ^~~
> ../nptl/pthread_mutex_lock.c:445:6: error: expected expression before ‘int’
> int e = __futex_lock_pi64 (&mutex->__data.__lock, 0 /* ununsed */,
> ^~~
> ../nptl/pthread_mutex_lock.c:447:10: error: ‘e’ undeclared (first use in this function)
> if (e == ESRCH || e == EDEADLK)
> ^
> ../nptl/pthread_mutex_lock.c:447:10: note: each undeclared identifier is reported only once for each function it appears in
> In file included from ../include/assert.h:1,
> from ../nptl/pthread_mutex_lock.c:18,
> from pthread_mutex_cond_lock.c:23:
> ../assert/assert.h:105:34: error: left-hand operand of comma expression has no effect [-Werror=unused-value]
> ((void) sizeof ((expr) ? 1 : 0), __extension__ ({ \
> ^
> ../nptl/pthread_mutex_lock.c:449:3: note: in expansion of macro ‘assert’
> assert (e != EDEADLK
> ^~~~~~
> ../assert/assert.h:105:34: error: left-hand operand of comma expression has no effect [-Werror=unused-value]
> ((void) sizeof ((expr) ? 1 : 0), __extension__ ({ \
> ^
> ../nptl/pthread_mutex_lock.c:454:3: note: in expansion of macro ‘assert’
> assert (e != ESRCH || !robust);
> ^~~~~~
>
> PC
On PPC?
--
H.J.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v4 0/3] Optimize CAS [BZ #28537]
2021-11-10 15:42 ` H.J. Lu
@ 2021-11-10 15:50 ` Paul A. Clarke
2021-11-10 15:52 ` H.J. Lu
0 siblings, 1 reply; 18+ messages in thread
From: Paul A. Clarke @ 2021-11-10 15:50 UTC (permalink / raw)
To: H.J. Lu; +Cc: GNU C Library, Florian Weimer, Arjan van de Ven, Andreas Schwab
On Wed, Nov 10, 2021 at 07:42:20AM -0800, H.J. Lu wrote:
> On Wed, Nov 10, 2021 at 7:36 AM Paul A. Clarke <pc@us.ibm.com> wrote:
> >
> > On Tue, Nov 09, 2021 at 04:16:11PM -0800, H.J. Lu via Libc-alpha wrote:
> > > CAS instruction is expensive. From the x86 CPU's point of view, getting
> > > a cache line for writing is more expensive than reading. See Appendix
> > > A.2 Spinlock in:
> > >
> > > https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-lock-scaling-analysis-paper.pdf
> > >
> > > The full compare and swap will grab the cache line exclusive and cause
> > > excessive cache line bouncing.
> > >
> > > Optimize CAS in low level locks and pthread_mutex_lock.c:
> > >
> > > 1. Do an atomic load and skip CAS if compare may fail to reduce cache
> > > line bouncing on contended locks.
> > > 2. Replace atomic_compare_and_exchange_bool_acq with
> > > atomic_compare_and_exchange_val_acq to avoid the extra load.
> > > 3. Drop __glibc_unlikely in __lll_trylock and lll_cond_trylock since we
> > > don't know if it's actually rare; in the contended case it is clearly not
> > > rare.
> >
> > I see build errors:
> >
> > In file included from pthread_mutex_cond_lock.c:23:
> > ../nptl/pthread_mutex_lock.c: In function ‘__pthread_mutex_cond_lock_full’:
> > ../nptl/pthread_mutex_lock.c:442:6: error: a label can only be part of a statement and a declaration is not a statement
> > int private = (robust
> > ^~~
> > ../nptl/pthread_mutex_lock.c:445:6: error: expected expression before ‘int’
> > int e = __futex_lock_pi64 (&mutex->__data.__lock, 0 /* ununsed */,
> > ^~~
> > ../nptl/pthread_mutex_lock.c:447:10: error: ‘e’ undeclared (first use in this function)
> > if (e == ESRCH || e == EDEADLK)
> > ^
> > ../nptl/pthread_mutex_lock.c:447:10: note: each undeclared identifier is reported only once for each function it appears in
> > In file included from ../include/assert.h:1,
> > from ../nptl/pthread_mutex_lock.c:18,
> > from pthread_mutex_cond_lock.c:23:
> > ../assert/assert.h:105:34: error: left-hand operand of comma expression has no effect [-Werror=unused-value]
> > ((void) sizeof ((expr) ? 1 : 0), __extension__ ({ \
> > ^
> > ../nptl/pthread_mutex_lock.c:449:3: note: in expansion of macro ‘assert’
> > assert (e != EDEADLK
> > ^~~~~~
> > ../assert/assert.h:105:34: error: left-hand operand of comma expression has no effect [-Werror=unused-value]
> > ((void) sizeof ((expr) ? 1 : 0), __extension__ ({ \
> > ^
> > ../nptl/pthread_mutex_lock.c:454:3: note: in expansion of macro ‘assert’
> > assert (e != ESRCH || !robust);
> > ^~~~~~
>
> On PPC?
Yes, powerpc64le. Failing command:
gcc pthread_mutex_cond_lock.c -c -std=gnu11 -fgnu89-inline -g -O2 -Wall -Wwrite-strings -Wundef -Werror -fmerge-all-constants -frounding-math -fno-stack-protector -fno-common -Wstrict-prototypes -Wold-style-definition -fmath-errno -mabi=ieeelongdouble -Wno-psabi -mno-gnu-attribute -mlong-double-128 -ftls-model=initial-exec -I../include -I/home/pc/locks/build-patched/nptl -I/home/pc/locks/build-patched -I../sysdeps/unix/sysv/linux/powerpc/powerpc64/le/fpu -I../sysdeps/unix/sysv/linux/powerpc/powerpc64/fpu -I../sysdeps/unix/sysv/linux/powerpc/powerpc64/le -I../sysdeps/unix/sysv/linux/powerpc/powerpc64 -I../sysdeps/unix/sysv/linux/wordsize-64 -I../sysdeps/unix/sysv/linux/powerpc -I../sysdeps/powerpc/nptl -I../sysdeps/unix/sysv/linux/include -I../sysdeps/unix/sysv/linux -I../sysdeps/nptl -I../sysdeps/pthread -I../sysdeps/gnu -I../sysdeps/unix/inet -I../sysdeps/unix/sysv -I../sysdeps/unix/powerpc -I../sysdeps/unix -I../sysdeps/posix -I../sysdeps/powerpc/powerpc64/le/power8/fpu/multiarch -I../sysdeps/powerpc/powerpc64/le/power7/fpu/multiarch -I../sysdeps/powerpc/powerpc64/le/fpu/multiarch -I../sysdeps/powerpc/powerpc64/le/power8/fpu -I../sysdeps/powerpc/powerpc64/le/power7/fpu -I../sysdeps/powerpc/powerpc64/le/fpu -I../sysdeps/powerpc/powerpc64/fpu -I../sysdeps/powerpc/powerpc64/le/power8/multiarch -I../sysdeps/powerpc/powerpc64/le/power7/multiarch -I../sysdeps/powerpc/powerpc64/le/multiarch -I../sysdeps/powerpc/powerpc64/multiarch -I../sysdeps/powerpc/powerpc64/le/power8 -I../sysdeps/powerpc/powerpc64/power8 -I../sysdeps/powerpc/powerpc64/le/power7 -I../sysdeps/powerpc/powerpc64/power7 -I../sysdeps/powerpc/powerpc64/power6 -I../sysdeps/powerpc/powerpc64/power4 -I../sysdeps/powerpc/power4 -I../sysdeps/powerpc/powerpc64/le -I../sysdeps/powerpc/powerpc64 -I../sysdeps/wordsize-64 -I../sysdeps/powerpc/fpu -I../sysdeps/powerpc -I../sysdeps/ieee754/ldbl-128ibm-compat -I../sysdeps/ieee754/ldbl-128ibm/include -I../sysdeps/ieee754/ldbl-128ibm -I../sysdeps/ieee754/ldbl-opt -I../sysdeps/ieee754/dbl-64 -I../sysdeps/ieee754/flt-32 -I../sysdeps/ieee754/float128 -I../sysdeps/ieee754 -I../sysdeps/generic -I.. -I../libio -I. -D_LIBC_REENTRANT -include /home/pc/locks/build-patched/libc-modules.h -DMODULE_NAME=libc -include ../include/libc-symbols.h -DTOP_NAMESPACE=glibc -o /home/pc/locks/build-patched/nptl/pthread_mutex_cond_lock.o -MD -MP -MF /home/pc/locks/build-patched/nptl/pthread_mutex_cond_lock.o.dt -MT /home/pc/locks/build-patched/nptl/pthread_mutex_cond_lock.o
PC
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v4 0/3] Optimize CAS [BZ #28537]
2021-11-10 15:50 ` Paul A. Clarke
@ 2021-11-10 15:52 ` H.J. Lu
0 siblings, 0 replies; 18+ messages in thread
From: H.J. Lu @ 2021-11-10 15:52 UTC (permalink / raw)
To: Paul A. Clarke
Cc: GNU C Library, Florian Weimer, Arjan van de Ven, Andreas Schwab
On Wed, Nov 10, 2021 at 7:50 AM Paul A. Clarke <pc@us.ibm.com> wrote:
>
> On Wed, Nov 10, 2021 at 07:42:20AM -0800, H.J. Lu wrote:
> > On Wed, Nov 10, 2021 at 7:36 AM Paul A. Clarke <pc@us.ibm.com> wrote:
> > >
> > > On Tue, Nov 09, 2021 at 04:16:11PM -0800, H.J. Lu via Libc-alpha wrote:
> > > > CAS instruction is expensive. From the x86 CPU's point of view, getting
> > > > a cache line for writing is more expensive than reading. See Appendix
> > > > A.2 Spinlock in:
> > > >
> > > > https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-lock-scaling-analysis-paper.pdf
> > > >
> > > > The full compare and swap will grab the cache line exclusive and cause
> > > > excessive cache line bouncing.
> > > >
> > > > Optimize CAS in low level locks and pthread_mutex_lock.c:
> > > >
> > > > 1. Do an atomic load and skip CAS if compare may fail to reduce cache
> > > > line bouncing on contended locks.
> > > > 2. Replace atomic_compare_and_exchange_bool_acq with
> > > > atomic_compare_and_exchange_val_acq to avoid the extra load.
> > > > 3. Drop __glibc_unlikely in __lll_trylock and lll_cond_trylock since we
> > > > don't know if it's actually rare; in the contended case it is clearly not
> > > > rare.
> > >
> > > I see build errors:
> > >
> > > In file included from pthread_mutex_cond_lock.c:23:
> > > ../nptl/pthread_mutex_lock.c: In function ‘__pthread_mutex_cond_lock_full’:
> > > ../nptl/pthread_mutex_lock.c:442:6: error: a label can only be part of a statement and a declaration is not a statement
> > > int private = (robust
> > > ^~~
> > > ../nptl/pthread_mutex_lock.c:445:6: error: expected expression before ‘int’
> > > int e = __futex_lock_pi64 (&mutex->__data.__lock, 0 /* ununsed */,
> > > ^~~
> > > ../nptl/pthread_mutex_lock.c:447:10: error: ‘e’ undeclared (first use in this function)
> > > if (e == ESRCH || e == EDEADLK)
> > > ^
> > > ../nptl/pthread_mutex_lock.c:447:10: note: each undeclared identifier is reported only once for each function it appears in
> > > In file included from ../include/assert.h:1,
> > > from ../nptl/pthread_mutex_lock.c:18,
> > > from pthread_mutex_cond_lock.c:23:
> > > ../assert/assert.h:105:34: error: left-hand operand of comma expression has no effect [-Werror=unused-value]
> > > ((void) sizeof ((expr) ? 1 : 0), __extension__ ({ \
> > > ^
> > > ../nptl/pthread_mutex_lock.c:449:3: note: in expansion of macro ‘assert’
> > > assert (e != EDEADLK
> > > ^~~~~~
> > > ../assert/assert.h:105:34: error: left-hand operand of comma expression has no effect [-Werror=unused-value]
> > > ((void) sizeof ((expr) ? 1 : 0), __extension__ ({ \
> > > ^
> > > ../nptl/pthread_mutex_lock.c:454:3: note: in expansion of macro ‘assert’
> > > assert (e != ESRCH || !robust);
> > > ^~~~~~
> >
> > On PPC?
>
> Yes, powerpc64le. Failing command:
> gcc pthread_mutex_cond_lock.c -c -std=gnu11 -fgnu89-inline -g -O2 -Wall -Wwrite-strings -Wundef -Werror -fmerge-all-constants -frounding-math -fno-stack-protector -fno-common -Wstrict-prototypes -Wold-style-definition -fmath-errno -mabi=ieeelongdouble -Wno-psabi -mno-gnu-attribute -mlong-double-128 -ftls-model=initial-exec -I../include -I/home/pc/locks/build-patched/nptl -I/home/pc/locks/build-patched -I../sysdeps/unix/sysv/linux/powerpc/powerpc64/le/fpu -I../sysdeps/unix/sysv/linux/powerpc/powerpc64/fpu -I../sysdeps/unix/sysv/linux/powerpc/powerpc64/le -I../sysdeps/unix/sysv/linux/powerpc/powerpc64 -I../sysdeps/unix/sysv/linux/wordsize-64 -I../sysdeps/unix/sysv/linux/powerpc -I../sysdeps/powerpc/nptl -I../sysdeps/unix/sysv/linux/include -I../sysdeps/unix/sysv/linux -I../sysdeps/nptl -I../sysdeps/pthread -I../sysdeps/gnu -I../sysdeps/unix/inet -I../sysdeps/unix/sysv -I../sysdeps/unix/powerpc -I../sysdeps/unix -I../sysdeps/posix -I../sysdeps/powerpc/powerpc64/le/power8/fpu/multiarch -I../sysdeps/powerpc/powerpc64/le/power7/fpu/multiarch -I../sysdeps/powerpc/powerpc64/le/fpu/multiarch -I../sysdeps/powerpc/powerpc64/le/power8/fpu -I../sysdeps/powerpc/powerpc64/le/power7/fpu -I../sysdeps/powerpc/powerpc64/le/fpu -I../sysdeps/powerpc/powerpc64/fpu -I../sysdeps/powerpc/powerpc64/le/power8/multiarch -I../sysdeps/powerpc/powerpc64/le/power7/multiarch -I../sysdeps/powerpc/powerpc64/le/multiarch -I../sysdeps/powerpc/powerpc64/multiarch -I../sysdeps/powerpc/powerpc64/le/power8 -I../sysdeps/powerpc/powerpc64/power8 -I../sysdeps/powerpc/powerpc64/le/power7 -I../sysdeps/powerpc/powerpc64/power7 -I../sysdeps/powerpc/powerpc64/power6 -I../sysdeps/powerpc/powerpc64/power4 -I../sysdeps/powerpc/power4 -I../sysdeps/powerpc/powerpc64/le -I../sysdeps/powerpc/powerpc64 -I../sysdeps/wordsize-64 -I../sysdeps/powerpc/fpu -I../sysdeps/powerpc -I../sysdeps/ieee754/ldbl-128ibm-compat -I../sysdeps/ieee754/ldbl-128ibm/include -I../sysdeps/ieee754/ldbl-128ibm -I../sysdeps/ieee754/ldbl-opt -I../sysdeps/ieee754/dbl-64 -I../sysdeps/ieee754/flt-32 -I../sysdeps/ieee754/float128 -I../sysdeps/ieee754 -I../sysdeps/generic -I.. -I../libio -I. -D_LIBC_REENTRANT -include /home/pc/locks/build-patched/libc-modules.h -DMODULE_NAME=libc -include ../include/libc-symbols.h -DTOP_NAMESPACE=glibc -o /home/pc/locks/build-patched/nptl/pthread_mutex_cond_lock.o -MD -MP -MF /home/pc/locks/build-patched/nptl/pthread_mutex_cond_lock.o.dt -MT /home/pc/locks/build-patched/nptl/pthread_mutex_cond_lock.o
>
> PC
I am testing it now.
--
H.J.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v4 0/3] Optimize CAS [BZ #28537]
2021-11-10 15:35 ` Paul A. Clarke
2021-11-10 15:42 ` H.J. Lu
@ 2021-11-10 15:52 ` Florian Weimer
2021-11-10 16:03 ` H.J. Lu
` (2 more replies)
1 sibling, 3 replies; 18+ messages in thread
From: Florian Weimer @ 2021-11-10 15:52 UTC (permalink / raw)
To: Paul A. Clarke; +Cc: H.J. Lu, libc-alpha, Arjan van de Ven, Andreas Schwab
* Paul A. Clarke:
> On Tue, Nov 09, 2021 at 04:16:11PM -0800, H.J. Lu via Libc-alpha wrote:
>> CAS instruction is expensive. From the x86 CPU's point of view, getting
>> a cache line for writing is more expensive than reading. See Appendix
>> A.2 Spinlock in:
>>
>> https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-lock-scaling-analysis-paper.pdf
>>
>> The full compare and swap will grab the cache line exclusive and cause
>> excessive cache line bouncing.
>>
>> Optimize CAS in low level locks and pthread_mutex_lock.c:
>>
>> 1. Do an atomic load and skip CAS if compare may fail to reduce cache
>> line bouncing on contended locks.
>> 2. Replace atomic_compare_and_exchange_bool_acq with
>> atomic_compare_and_exchange_val_acq to avoid the extra load.
>> 3. Drop __glibc_unlikely in __lll_trylock and lll_cond_trylock since we
>> don't know if it's actually rare; in the contended case it is clearly not
>> rare.
>
> I see build errors:
>
> In file included from pthread_mutex_cond_lock.c:23:
> ../nptl/pthread_mutex_lock.c: In function ‘__pthread_mutex_cond_lock_full’:
> ../nptl/pthread_mutex_lock.c:442:6: error: a label can only be part of a statement and a declaration is not a statement
> int private = (robust
> ^~~
The patch has:
+ locked_mutex:
/* The mutex is locked. The kernel will now take care of
everything. */
int private = (robust
This is only supported in recent C versions, I think the workaround is
to add an empty statement with a semicolon, like this:
+ locked_mutex:;
/* The mutex is locked. The kernel will now take care of
everything. */
int private = (robust
Thanks,
Florian
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v4 0/3] Optimize CAS [BZ #28537]
2021-11-10 15:52 ` Florian Weimer
@ 2021-11-10 16:03 ` H.J. Lu
2021-11-10 16:04 ` Paul A. Clarke
2021-11-10 16:14 ` Andreas Schwab
2 siblings, 0 replies; 18+ messages in thread
From: H.J. Lu @ 2021-11-10 16:03 UTC (permalink / raw)
To: Florian Weimer
Cc: Paul A. Clarke, GNU C Library, Arjan van de Ven, Andreas Schwab
On Wed, Nov 10, 2021 at 7:52 AM Florian Weimer <fweimer@redhat.com> wrote:
>
> * Paul A. Clarke:
>
> > On Tue, Nov 09, 2021 at 04:16:11PM -0800, H.J. Lu via Libc-alpha wrote:
> >> CAS instruction is expensive. From the x86 CPU's point of view, getting
> >> a cache line for writing is more expensive than reading. See Appendix
> >> A.2 Spinlock in:
> >>
> >> https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-lock-scaling-analysis-paper.pdf
> >>
> >> The full compare and swap will grab the cache line exclusive and cause
> >> excessive cache line bouncing.
> >>
> >> Optimize CAS in low level locks and pthread_mutex_lock.c:
> >>
> >> 1. Do an atomic load and skip CAS if compare may fail to reduce cache
> >> line bouncing on contended locks.
> >> 2. Replace atomic_compare_and_exchange_bool_acq with
> >> atomic_compare_and_exchange_val_acq to avoid the extra load.
> >> 3. Drop __glibc_unlikely in __lll_trylock and lll_cond_trylock since we
> >> don't know if it's actually rare; in the contended case it is clearly not
> >> rare.
> >
> > I see build errors:
> >
> > In file included from pthread_mutex_cond_lock.c:23:
> > ../nptl/pthread_mutex_lock.c: In function ‘__pthread_mutex_cond_lock_full’:
> > ../nptl/pthread_mutex_lock.c:442:6: error: a label can only be part of a statement and a declaration is not a statement
> > int private = (robust
> > ^~~
>
> The patch has:
>
> + locked_mutex:
> /* The mutex is locked. The kernel will now take care of
> everything. */
> int private = (robust
>
> This is only supported in recent C versions, I think the workaround is
> to add an empty statement with a semicolon, like this:
>
> + locked_mutex:;
> /* The mutex is locked. The kernel will now take care of
> everything. */
> int private = (robust
>
> Thanks,
> Florian
>
Can you try users/hjl/x86/atomic-nptl branch:
https://gitlab.com/x86-glibc/glibc/-/commits/users/hjl/x86/atomic-nptl
I fixed a couple issues and added more CAS optimizations.
--
H.J.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v4 0/3] Optimize CAS [BZ #28537]
2021-11-10 15:52 ` Florian Weimer
2021-11-10 16:03 ` H.J. Lu
@ 2021-11-10 16:04 ` Paul A. Clarke
2021-11-10 16:14 ` Andreas Schwab
2 siblings, 0 replies; 18+ messages in thread
From: Paul A. Clarke @ 2021-11-10 16:04 UTC (permalink / raw)
To: Florian Weimer; +Cc: H.J. Lu, libc-alpha, Arjan van de Ven, Andreas Schwab
On Wed, Nov 10, 2021 at 04:52:28PM +0100, Florian Weimer wrote:
> * Paul A. Clarke:
>
> > On Tue, Nov 09, 2021 at 04:16:11PM -0800, H.J. Lu via Libc-alpha wrote:
> >> CAS instruction is expensive. From the x86 CPU's point of view, getting
> >> a cache line for writing is more expensive than reading. See Appendix
> >> A.2 Spinlock in:
> >>
> >> https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-lock-scaling-analysis-paper.pdf
> >>
> >> The full compare and swap will grab the cache line exclusive and cause
> >> excessive cache line bouncing.
> >>
> >> Optimize CAS in low level locks and pthread_mutex_lock.c:
> >>
> >> 1. Do an atomic load and skip CAS if compare may fail to reduce cache
> >> line bouncing on contended locks.
> >> 2. Replace atomic_compare_and_exchange_bool_acq with
> >> atomic_compare_and_exchange_val_acq to avoid the extra load.
> >> 3. Drop __glibc_unlikely in __lll_trylock and lll_cond_trylock since we
> >> don't know if it's actually rare; in the contended case it is clearly not
> >> rare.
> >
> > I see build errors:
> >
> > In file included from pthread_mutex_cond_lock.c:23:
> > ../nptl/pthread_mutex_lock.c: In function ‘__pthread_mutex_cond_lock_full’:
> > ../nptl/pthread_mutex_lock.c:442:6: error: a label can only be part of a statement and a declaration is not a statement
> > int private = (robust
> > ^~~
>
> The patch has:
>
> + locked_mutex:
> /* The mutex is locked. The kernel will now take care of
> everything. */
> int private = (robust
>
> This is only supported in recent C versions, I think the workaround is
> to add an empty statement with a semicolon, like this:
>
> + locked_mutex:;
> /* The mutex is locked. The kernel will now take care of
> everything. */
> int private = (robust
This change resolved the issue.
PC
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v4 0/3] Optimize CAS [BZ #28537]
2021-11-10 15:52 ` Florian Weimer
2021-11-10 16:03 ` H.J. Lu
2021-11-10 16:04 ` Paul A. Clarke
@ 2021-11-10 16:14 ` Andreas Schwab
2 siblings, 0 replies; 18+ messages in thread
From: Andreas Schwab @ 2021-11-10 16:14 UTC (permalink / raw)
To: Florian Weimer; +Cc: Paul A. Clarke, H.J. Lu, libc-alpha, Arjan van de Ven
On Nov 10 2021, Florian Weimer wrote:
> This is only supported in recent C versions
Only in C++.
Andreas.
--
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510 2552 DF73 E780 A9DA AEC1
"And now for something completely different."
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v4 0/3] Optimize CAS [BZ #28537]
2021-11-10 14:26 ` [PATCH v4 0/3] Optimize CAS " Paul E Murphy
@ 2021-11-10 20:07 ` Paul A. Clarke
2021-11-10 21:33 ` H.J. Lu
2021-11-10 23:34 ` H.J. Lu
0 siblings, 2 replies; 18+ messages in thread
From: Paul A. Clarke @ 2021-11-10 20:07 UTC (permalink / raw)
To: Paul E Murphy
Cc: H.J. Lu, libc-alpha, Florian Weimer, Andreas Schwab, Arjan van de Ven
On Wed, Nov 10, 2021 at 08:26:09AM -0600, Paul E Murphy via Libc-alpha wrote:
> On 11/9/21 6:16 PM, H.J. Lu via Libc-alpha wrote:
> > CAS instruction is expensive. From the x86 CPU's point of view, getting
> > a cache line for writing is more expensive than reading. See Appendix
> > A.2 Spinlock in:
> >
> > https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-lock-scaling-analysis-paper.pdf
> >
> > The full compare and swap will grab the cache line exclusive and cause
> > excessive cache line bouncing.
> >
> > Optimize CAS in low level locks and pthread_mutex_lock.c:
> >
> > 1. Do an atomic load and skip CAS if compare may fail to reduce cache
> > line bouncing on contended locks.
> > 2. Replace atomic_compare_and_exchange_bool_acq with
> > atomic_compare_and_exchange_val_acq to avoid the extra load.
> > 3. Drop __glibc_unlikely in __lll_trylock and lll_cond_trylock since we
> > don't know if it's actually rare; in the contended case it is clearly not
> > rare.
>
> Are you able to share benchmarks of this change? I am curious what effects
> this might have on other platforms.
I'd like to see the expected performance results, too.
For me, the results are not uniformly positive (Power10).
From bench-pthread-locks:
bench bench-patched
mutex-empty 4.73371 4.54792 3.9%
mutex-filler 18.5395 18.3419 1.1%
mutex_trylock-empty 10.46 2.46364 76.4%
mutex_trylock-filler 16.2188 16.1758 0.3%
rwlock_read-empty 16.5118 16.4681 0.3%
rwlock_read-filler 20.68 20.4416 1.2%
rwlock_tryread-empty 2.06572 2.17284 -5.2%
rwlock_tryread-filler 16.082 16.1215 -0.2%
rwlock_write-empty 31.3723 31.259 0.4%
rwlock_write-filler 41.6492 69.313 -66.4%
rwlock_trywrite-empty 2.20584 2.32178 -5.3%
rwlock_trywrite-filler 15.7044 15.9088 -1.3%
spin_lock-empty 16.7964 16.7731 0.1%
spin_lock-filler 20.6118 20.4175 0.9%
spin_trylock-empty 8.99989 8.98879 0.1%
spin_trylock-filler 16.4732 15.9957 2.9%
sem_wait-empty 15.805 15.7391 0.4%
sem_wait-filler 19.2346 19.5098 -1.4%
sem_trywait-empty 2.06405 2.03782 1.3%
sem_trywait-filler 15.921 15.8408 0.5%
condvar-empty 1385.84 1387.29 -0.1%
condvar-filler 1419.82 1424.01 -0.3%
consumer_producer-empty 2550.01 2395.29 6.1%
consumer_producer-filler 2709.4 2558.28 5.6%
PC
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v4 0/3] Optimize CAS [BZ #28537]
2021-11-10 20:07 ` Paul A. Clarke
@ 2021-11-10 21:33 ` H.J. Lu
2021-11-11 0:30 ` Paul A. Clarke
2021-11-10 23:34 ` H.J. Lu
1 sibling, 1 reply; 18+ messages in thread
From: H.J. Lu @ 2021-11-10 21:33 UTC (permalink / raw)
To: Paul A. Clarke
Cc: Paul E Murphy, GNU C Library, Florian Weimer, Andreas Schwab,
Arjan van de Ven
On Wed, Nov 10, 2021 at 12:07 PM Paul A. Clarke <pc@us.ibm.com> wrote:
>
> On Wed, Nov 10, 2021 at 08:26:09AM -0600, Paul E Murphy via Libc-alpha wrote:
> > On 11/9/21 6:16 PM, H.J. Lu via Libc-alpha wrote:
> > > CAS instruction is expensive. From the x86 CPU's point of view, getting
> > > a cache line for writing is more expensive than reading. See Appendix
> > > A.2 Spinlock in:
> > >
> > > https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-lock-scaling-analysis-paper.pdf
> > >
> > > The full compare and swap will grab the cache line exclusive and cause
> > > excessive cache line bouncing.
> > >
> > > Optimize CAS in low level locks and pthread_mutex_lock.c:
> > >
> > > 1. Do an atomic load and skip CAS if compare may fail to reduce cache
> > > line bouncing on contended locks.
> > > 2. Replace atomic_compare_and_exchange_bool_acq with
> > > atomic_compare_and_exchange_val_acq to avoid the extra load.
> > > 3. Drop __glibc_unlikely in __lll_trylock and lll_cond_trylock since we
> > > don't know if it's actually rare; in the contended case it is clearly not
> > > rare.
> >
> > Are you able to share benchmarks of this change? I am curious what effects
> > this might have on other platforms.
>
> I'd like to see the expected performance results, too.
>
> For me, the results are not uniformly positive (Power10).
> From bench-pthread-locks:
>
> bench bench-patched
> mutex-empty 4.73371 4.54792 3.9%
> mutex-filler 18.5395 18.3419 1.1%
> mutex_trylock-empty 10.46 2.46364 76.4%
> mutex_trylock-filler 16.2188 16.1758 0.3%
> rwlock_read-empty 16.5118 16.4681 0.3%
> rwlock_read-filler 20.68 20.4416 1.2%
> rwlock_tryread-empty 2.06572 2.17284 -5.2%
> rwlock_tryread-filler 16.082 16.1215 -0.2%
> rwlock_write-empty 31.3723 31.259 0.4%
> rwlock_write-filler 41.6492 69.313 -66.4%
> rwlock_trywrite-empty 2.20584 2.32178 -5.3%
> rwlock_trywrite-filler 15.7044 15.9088 -1.3%
> spin_lock-empty 16.7964 16.7731 0.1%
> spin_lock-filler 20.6118 20.4175 0.9%
> spin_trylock-empty 8.99989 8.98879 0.1%
> spin_trylock-filler 16.4732 15.9957 2.9%
> sem_wait-empty 15.805 15.7391 0.4%
> sem_wait-filler 19.2346 19.5098 -1.4%
> sem_trywait-empty 2.06405 2.03782 1.3%
> sem_trywait-filler 15.921 15.8408 0.5%
> condvar-empty 1385.84 1387.29 -0.1%
> condvar-filler 1419.82 1424.01 -0.3%
> consumer_producer-empty 2550.01 2395.29 6.1%
> consumer_producer-filler 2709.4 2558.28 5.6%
Small regressions on uncontended locks are expected due to extra
check. What do you get with my current branch
https://gitlab.com/x86-glibc/glibc/-/tree/users/hjl/x86/atomic-nptl
BTW, how did you compare the 2 results? I tried compare_bench.py
and got
Traceback (most recent call last):
File "/export/gnu/import/git/gitlab/x86-glibc/benchtests/scripts/compare_bench.py",
line 196, in <module>
main(args.bench1, args.bench2, args.schema, args.threshold, args.stats)
File "/export/gnu/import/git/gitlab/x86-glibc/benchtests/scripts/compare_bench.py",
line 165, in main
bench1 = bench.parse_bench(bench1, schema)
File "/export/ssd/git/gitlab/x86-glibc/benchtests/scripts/import_bench.py",
line 137, in parse_bench
bench = json.load(benchfile)
File "/usr/lib64/python3.10/json/__init__.py", line 293, in load
return loads(fp.read(),
File "/usr/lib64/python3.10/json/__init__.py", line 346, in loads
return _default_decoder.decode(s)
File "/usr/lib64/python3.10/json/decoder.py", line 340, in decode
raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 1 column 18 (char 17)
--
H.J.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v4 0/3] Optimize CAS [BZ #28537]
2021-11-10 20:07 ` Paul A. Clarke
2021-11-10 21:33 ` H.J. Lu
@ 2021-11-10 23:34 ` H.J. Lu
1 sibling, 0 replies; 18+ messages in thread
From: H.J. Lu @ 2021-11-10 23:34 UTC (permalink / raw)
To: Paul A. Clarke
Cc: Paul E Murphy, GNU C Library, Florian Weimer, Andreas Schwab,
Arjan van de Ven
On Wed, Nov 10, 2021 at 12:07 PM Paul A. Clarke <pc@us.ibm.com> wrote:
>
> On Wed, Nov 10, 2021 at 08:26:09AM -0600, Paul E Murphy via Libc-alpha wrote:
> > On 11/9/21 6:16 PM, H.J. Lu via Libc-alpha wrote:
> > > CAS instruction is expensive. From the x86 CPU's point of view, getting
> > > a cache line for writing is more expensive than reading. See Appendix
> > > A.2 Spinlock in:
> > >
> > > https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-lock-scaling-analysis-paper.pdf
> > >
> > > The full compare and swap will grab the cache line exclusive and cause
> > > excessive cache line bouncing.
> > >
> > > Optimize CAS in low level locks and pthread_mutex_lock.c:
> > >
> > > 1. Do an atomic load and skip CAS if compare may fail to reduce cache
> > > line bouncing on contended locks.
> > > 2. Replace atomic_compare_and_exchange_bool_acq with
> > > atomic_compare_and_exchange_val_acq to avoid the extra load.
> > > 3. Drop __glibc_unlikely in __lll_trylock and lll_cond_trylock since we
> > > don't know if it's actually rare; in the contended case it is clearly not
> > > rare.
> >
> > Are you able to share benchmarks of this change? I am curious what effects
> > this might have on other platforms.
>
> I'd like to see the expected performance results, too.
>
> For me, the results are not uniformly positive (Power10).
> From bench-pthread-locks:
>
> bench bench-patched
> mutex-empty 4.73371 4.54792 3.9%
> mutex-filler 18.5395 18.3419 1.1%
> mutex_trylock-empty 10.46 2.46364 76.4%
> mutex_trylock-filler 16.2188 16.1758 0.3%
> rwlock_read-empty 16.5118 16.4681 0.3%
> rwlock_read-filler 20.68 20.4416 1.2%
> rwlock_tryread-empty 2.06572 2.17284 -5.2%
> rwlock_tryread-filler 16.082 16.1215 -0.2%
> rwlock_write-empty 31.3723 31.259 0.4%
> rwlock_write-filler 41.6492 69.313 -66.4%
> rwlock_trywrite-empty 2.20584 2.32178 -5.3%
> rwlock_trywrite-filler 15.7044 15.9088 -1.3%
> spin_lock-empty 16.7964 16.7731 0.1%
> spin_lock-filler 20.6118 20.4175 0.9%
> spin_trylock-empty 8.99989 8.98879 0.1%
> spin_trylock-filler 16.4732 15.9957 2.9%
> sem_wait-empty 15.805 15.7391 0.4%
> sem_wait-filler 19.2346 19.5098 -1.4%
> sem_trywait-empty 2.06405 2.03782 1.3%
> sem_trywait-filler 15.921 15.8408 0.5%
> condvar-empty 1385.84 1387.29 -0.1%
> condvar-filler 1419.82 1424.01 -0.3%
> consumer_producer-empty 2550.01 2395.29 6.1%
> consumer_producer-filler 2709.4 2558.28 5.6%
>
> PC
Here are the results on a machine with 112 cores:
mutex-empty 16.0112 16.5728 -3.5%
mutex-filler 49.4354 48.7608 1.4%
mutex_trylock-empty 19.2854 8.56795 56%
mutex_trylock-filler 54.9643 41.5418 24%
rwlock_read-empty 39.8855 39.7448 0.35%
rwlock_read-filler 75.1334 75.1218 0.015%
rwlock_tryread-empty 5.29094 5.2917 -0.014%
rwlock_tryread-filler 39.6653 40.209 -1.4%
rwlock_write-empty 60.6445 60.6236 0.034%
rwlock_write-filler 91.431 92.9016 -1.6%
rwlock_trywrite-empty 5.28404 5.94623 -13%
rwlock_trywrite-filler 40.7044 40.7709 -0.16%
spin_lock-empty 19.1067 19.1068 -0.00052%
spin_lock-filler 51.643 51.2963 0.67%
spin_trylock-empty 16.4705 16.4707 -0.0012%
spin_trylock-filler 45.4647 50.5047 -11%
sem_wait-empty 42.169 42.1889 -0.047%
sem_wait-filler 74.4302 74.4577 -0.037%
sem_trywait-empty 5.27318 5.27172 0.028%
sem_trywait-filler 40.191 40.8506 -1.6%
condvar-empty 5404.27 5406.39 -0.039%
condvar-filler 5022.93 1566.82 69%
consumer_producer-empty 15899.2 16755.8 -5.4%
consumer_producer-filler 16076.9 16065.8 0.069%
rwlock_trywrite-empty has 13% regression and spin_trylock-filler
has 11% regression. But there are 69%, 56% and 24% improvements.
--
H.J.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v4 0/3] Optimize CAS [BZ #28537]
2021-11-10 21:33 ` H.J. Lu
@ 2021-11-11 0:30 ` Paul A. Clarke
0 siblings, 0 replies; 18+ messages in thread
From: Paul A. Clarke @ 2021-11-11 0:30 UTC (permalink / raw)
To: H.J. Lu
Cc: Paul E Murphy, GNU C Library, Florian Weimer, Andreas Schwab,
Arjan van de Ven
On Wed, Nov 10, 2021 at 01:33:26PM -0800, H.J. Lu wrote:
> On Wed, Nov 10, 2021 at 12:07 PM Paul A. Clarke <pc@us.ibm.com> wrote:
> >
> > On Wed, Nov 10, 2021 at 08:26:09AM -0600, Paul E Murphy via Libc-alpha wrote:
> > > On 11/9/21 6:16 PM, H.J. Lu via Libc-alpha wrote:
> > > > CAS instruction is expensive. From the x86 CPU's point of view, getting
> > > > a cache line for writing is more expensive than reading. See Appendix
> > > > A.2 Spinlock in:
> > > >
> > > > https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-lock-scaling-analysis-paper.pdf
> > > >
> > > > The full compare and swap will grab the cache line exclusive and cause
> > > > excessive cache line bouncing.
> > > >
> > > > Optimize CAS in low level locks and pthread_mutex_lock.c:
> > > >
> > > > 1. Do an atomic load and skip CAS if compare may fail to reduce cache
> > > > line bouncing on contended locks.
> > > > 2. Replace atomic_compare_and_exchange_bool_acq with
> > > > atomic_compare_and_exchange_val_acq to avoid the extra load.
> > > > 3. Drop __glibc_unlikely in __lll_trylock and lll_cond_trylock since we
> > > > don't know if it's actually rare; in the contended case it is clearly not
> > > > rare.
> > >
> > > Are you able to share benchmarks of this change? I am curious what effects
> > > this might have on other platforms.
> >
> > I'd like to see the expected performance results, too.
> >
> > For me, the results are not uniformly positive (Power10).
> > From bench-pthread-locks:
> >
> > bench bench-patched
> > mutex-empty 4.73371 4.54792 3.9%
> > mutex-filler 18.5395 18.3419 1.1%
> > mutex_trylock-empty 10.46 2.46364 76.4%
> > mutex_trylock-filler 16.2188 16.1758 0.3%
> > rwlock_read-empty 16.5118 16.4681 0.3%
> > rwlock_read-filler 20.68 20.4416 1.2%
> > rwlock_tryread-empty 2.06572 2.17284 -5.2%
> > rwlock_tryread-filler 16.082 16.1215 -0.2%
> > rwlock_write-empty 31.3723 31.259 0.4%
> > rwlock_write-filler 41.6492 69.313 -66.4%
> > rwlock_trywrite-empty 2.20584 2.32178 -5.3%
> > rwlock_trywrite-filler 15.7044 15.9088 -1.3%
> > spin_lock-empty 16.7964 16.7731 0.1%
> > spin_lock-filler 20.6118 20.4175 0.9%
> > spin_trylock-empty 8.99989 8.98879 0.1%
> > spin_trylock-filler 16.4732 15.9957 2.9%
> > sem_wait-empty 15.805 15.7391 0.4%
> > sem_wait-filler 19.2346 19.5098 -1.4%
> > sem_trywait-empty 2.06405 2.03782 1.3%
> > sem_trywait-filler 15.921 15.8408 0.5%
> > condvar-empty 1385.84 1387.29 -0.1%
> > condvar-filler 1419.82 1424.01 -0.3%
> > consumer_producer-empty 2550.01 2395.29 6.1%
> > consumer_producer-filler 2709.4 2558.28 5.6%
>
> Small regressions on uncontended locks are expected due to extra
> check. What do you get with my current branch
>
> https://gitlab.com/x86-glibc/glibc/-/tree/users/hjl/x86/atomic-nptl
bench bench-hjl
mutex-empty 4.73371 4.65279 1.7%
mutex-filler 18.5395 18.3971 0.8%
mutex_trylock-empty 10.46 10.1671 2.8%
mutex_trylock-filler 16.2188 16.7105 -3.0%
rwlock_read-empty 16.5118 16.4697 0.3%
rwlock_read-filler 20.68 20.0416 3.1%
rwlock_tryread-empty 2.06572 2.038 1.3%
rwlock_tryread-filler 16.082 15.7182 2.3%
rwlock_write-empty 31.3723 31.1147 0.8%
rwlock_write-filler 41.6492 69.8115 -67.6%
rwlock_trywrite-empty 2.20584 2.32175 -5.3%
rwlock_trywrite-filler 15.7044 15.86 -1.0%
spin_lock-empty 16.7964 16.4342 2.2%
spin_lock-filler 20.6118 20.3916 1.1%
spin_trylock-empty 8.99989 8.98884 0.1%
spin_trylock-filler 16.4732 16.1979 1.7%
sem_wait-empty 15.805 15.7558 0.3%
sem_wait-filler 19.2346 19.2554 -0.1%
sem_trywait-empty 2.06405 2.03789 1.3%
sem_trywait-filler 15.921 15.7884 0.8%
condvar-empty 1385.84 1341.96 3.2%
condvar-filler 1419.82 1343.06 5.4%
consumer_producer-empty 2550.01 2446.33 4.1%
consumer_producer-filler 2709.4 2659.59 1.8%
...still one very bad outlier, and a few of concern.
> BTW, how did you compare the 2 results? I tried compare_bench.py
> and got
>
> Traceback (most recent call last):
> File "/export/gnu/import/git/gitlab/x86-glibc/benchtests/scripts/compare_bench.py",
> line 196, in <module>
> main(args.bench1, args.bench2, args.schema, args.threshold, args.stats)
> File "/export/gnu/import/git/gitlab/x86-glibc/benchtests/scripts/compare_bench.py",
> line 165, in main
> bench1 = bench.parse_bench(bench1, schema)
> File "/export/ssd/git/gitlab/x86-glibc/benchtests/scripts/import_bench.py",
> line 137, in parse_bench
> bench = json.load(benchfile)
> File "/usr/lib64/python3.10/json/__init__.py", line 293, in load
> return loads(fp.read(),
> File "/usr/lib64/python3.10/json/__init__.py", line 346, in loads
> return _default_decoder.decode(s)
> File "/usr/lib64/python3.10/json/decoder.py", line 340, in decode
> raise JSONDecodeError("Extra data", s, end)
> json.decoder.JSONDecodeError: Extra data: line 1 column 18 (char 17)
I did it the old-fashioned way, in a spreadsheet. :-)
I see the same errors you see with compare_bench.py.
Upon further investigation, compare_bench.py expects input in the form
produced by "make bench". The output from running the benchtest directly
is insufficient. Using the respective outputs from
"make BENCHSET=bench-pthread bench":
--
$ ./benchtests/scripts/compare_bench.py --threshold 2 --stats mean A.out B.out
[snip]
+++ thread_create(stack=1024,guard=2)[mean]: (2.15%) from 372674 to 364660
+++ thread_create(stack=2048,guard=1)[mean]: (4.88%) from 377835 to 359396
+++ thread_create(stack=2048,guard=2)[mean]: (3.58%) from 377306 to 363798
+++ pthread_locks(mutex-empty)[mean]: (4.27%) from 4.85936 to 4.65185
--- pthread_locks(mutex_trylock-filler)[mean]: (3.09%) from 16.0579 to 16.5533
--- pthread_locks(rwlock_write-filler)[mean]: (56.90%) from 44.4255 to 69.7047
--- pthread_locks(rwlock_trywrite-empty)[mean]: (6.73%) from 2.17594 to 2.32244
+++ pthread_locks(spin_lock-empty)[mean]: (2.17%) from 16.8086 to 16.4436
--- pthread_locks(spin_trylock-filler)[mean]: (2.34%) from 16.1119 to 16.4896
+++ pthread_locks(consumer_producer-empty)[mean]: (2.94%) from 2531.95 to 2457.48
--
PC
^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2021-11-11 0:30 UTC | newest]
Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-10 0:16 [PATCH v4 0/3] Optimize CAS [BZ #28537] H.J. Lu
2021-11-10 0:16 ` [PATCH v4 1/3] Reduce CAS in low level locks " H.J. Lu
2021-11-10 1:56 ` Noah Goldstein
2021-11-10 0:16 ` [PATCH v4 2/3] Reduce CAS in __pthread_mutex_lock_full " H.J. Lu
2021-11-10 0:16 ` [PATCH v4 3/3] Optimize " H.J. Lu
2021-11-10 14:26 ` [PATCH v4 0/3] Optimize CAS " Paul E Murphy
2021-11-10 20:07 ` Paul A. Clarke
2021-11-10 21:33 ` H.J. Lu
2021-11-11 0:30 ` Paul A. Clarke
2021-11-10 23:34 ` H.J. Lu
2021-11-10 15:35 ` Paul A. Clarke
2021-11-10 15:42 ` H.J. Lu
2021-11-10 15:50 ` Paul A. Clarke
2021-11-10 15:52 ` H.J. Lu
2021-11-10 15:52 ` Florian Weimer
2021-11-10 16:03 ` H.J. Lu
2021-11-10 16:04 ` Paul A. Clarke
2021-11-10 16:14 ` Andreas Schwab
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).