[PATCH v6 3/4] Reduce CAS in malloc spinlocks

public inbox for libc-alpha@sourceware.org
 help / color / mirror / Atom feed

* [PATCH v6 3/4] Reduce CAS in malloc spinlocks
@ 2021-11-15 13:01 Wilco Dijkstra
  0 siblings, 0 replies; 6+ messages in thread
From: Wilco Dijkstra @ 2021-11-15 13:01 UTC (permalink / raw)
  To: H.J. Lu; +Cc: 'GNU C Library'

Hi,

A quick check shows that the atomic loads are always inserted before the first
CAS, and since these locks are mostly uncontended, this will actually hurt
performance on all targets. Also, it's not like we've ever had complaints about
the number of arenas we can create in malloc by having all CPUs create one at
exactly the same time...

So a change like this really need to show gains in malloc benchmarks.

Cheers,
Wilco

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v6 3/4] Reduce CAS in malloc spinlocks
@ 2023-02-23 18:27 Wilco Dijkstra
  2023-02-23 19:53 ` H.J. Lu
  0 siblings, 1 reply; 6+ messages in thread
From: Wilco Dijkstra @ 2023-02-23 18:27 UTC (permalink / raw)
  To: dj, H.J. Lu; +Cc: 'GNU C Library'

Hi DJ,

>        size_t n = narenas;
>        if (__glibc_unlikely (n <= narenas_limit - 1))
>          {
> +          if (atomic_load_relaxed (&narenas) != n)
> +           {
> +              atomic_spin_nop ();
> +              goto repeat;
> +           }
>            if (catomic_compare_and_exchange_bool_acq (&narenas, n + 1, n))
>              goto repeat;

Before we consider optimizing it, we should first simplify it. All this wants
to do is a relaxed atomic add, then check the maximum arenas and
treat the case of having too many arenas in the same way as failure to
create another arena (ie. just atomically decrement again). Ie. no CAS
loop required, and there is nothing to optimize either.

> Should target-specific atomics optimizations be "hidden" somewhere in
> the atomics implementation?  Just because x86 may benefit from a
> pre-read doesn't mean that all targets will, and if x86 generally
> benefits, it should update its implementation of the atomics to do that
> at a lower level.

We have been removing secret optimizations hidden behind atomics and just
use standard atomics everywhere. These micro optimizations are often counter-
productive - it's far better to do them at a higher similar to the single-thread
path in malloc/free.

For these there is no evidence they are heavily contended - if anything the
extra code will just slow things down. The cost is in the CAS itself, we could
remove it from the multithreaded malloc path by splitting the free list into a
local one (only accessed when you have the malloc lock) and a concurrent one.

Cheers,
Wilco

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v6 3/4] Reduce CAS in malloc spinlocks
  2023-02-23 18:27 Wilco Dijkstra
@ 2023-02-23 19:53 ` H.J. Lu
  2023-02-23 20:07   ` DJ Delorie
  0 siblings, 1 reply; 6+ messages in thread
From: H.J. Lu @ 2023-02-23 19:53 UTC (permalink / raw)
  To: Wilco Dijkstra; +Cc: dj, GNU C Library

On Thu, Feb 23, 2023 at 10:28 AM Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote:
>
> Hi DJ,
>
> >        size_t n = narenas;
> >        if (__glibc_unlikely (n <= narenas_limit - 1))
> >          {
> > +          if (atomic_load_relaxed (&narenas) != n)
> > +           {
> > +              atomic_spin_nop ();
> > +              goto repeat;
> > +           }
> >            if (catomic_compare_and_exchange_bool_acq (&narenas, n + 1, n))
> >              goto repeat;
>
> Before we consider optimizing it, we should first simplify it. All this wants
> to do is a relaxed atomic add, then check the maximum arenas and
> treat the case of having too many arenas in the same way as failure to
> create another arena (ie. just atomically decrement again). Ie. no CAS
> loop required, and there is nothing to optimize either.
>
> > Should target-specific atomics optimizations be "hidden" somewhere in
> > the atomics implementation?  Just because x86 may benefit from a
> > pre-read doesn't mean that all targets will, and if x86 generally
> > benefits, it should update its implementation of the atomics to do that
> > at a lower level.
>
> We have been removing secret optimizations hidden behind atomics and just
> use standard atomics everywhere. These micro optimizations are often counter-
> productive - it's far better to do them at a higher similar to the single-thread
> path in malloc/free.
>
> For these there is no evidence they are heavily contended - if anything the
> extra code will just slow things down. The cost is in the CAS itself, we could
> remove it from the multithreaded malloc path by splitting the free list into a
> local one (only accessed when you have the malloc lock) and a concurrent one.

I didn't pursue it further since I couldn't show how much it would
improve performance.



-- 
H.J.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v6 3/4] Reduce CAS in malloc spinlocks
  2023-02-23 19:53 ` H.J. Lu
@ 2023-02-23 20:07   ` DJ Delorie
  0 siblings, 0 replies; 6+ messages in thread
From: DJ Delorie @ 2023-02-23 20:07 UTC (permalink / raw)
  To: H.J. Lu; +Cc: Wilco.Dijkstra, libc-alpha

"H.J. Lu" <hjl.tools@gmail.com> writes:
> I didn't pursue it further since I couldn't show how much it would
> improve performance.

Yeah, that's one of the big problems with malloc is benchmarking - it's
easy to write a bad benchmark, and hard to capture what real-world
applications are doing.  And any benchmark complex enough to be relevent
in this case would be chaotic enough to lose the tiny changes in the
noise.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH v6 0/4] Optimize CAS [BZ #28537]
@ 2021-11-11 16:24 H.J. Lu
  2021-11-11 16:24 ` [PATCH v6 3/4] Reduce CAS in malloc spinlocks H.J. Lu
  0 siblings, 1 reply; 6+ messages in thread
From: H.J. Lu @ 2021-11-11 16:24 UTC (permalink / raw)
  To: libc-alpha
  Cc: Florian Weimer, Oleh Derevenko, Arjan van de Ven, Andreas Schwab,
	Paul A . Clarke, Noah Goldstein

Changes in v6:

1. Add LLL_MUTEX_READ_LOCK to do an atomic load and skip CAS in spinlock
loop if compare may fail.
2. Remove low level lock changes.
3. Don't change CAS usages in __pthread_mutex_lock_full.
4. Avoid extra load with CAS in __pthread_mutex_clocklock_common.
5. Reduce CAS in malloc spinlocks.

Changes in v5:

1. Put back __glibc_unlikely in  __lll_trylock and lll_cond_trylock.
2. Remove an atomic load in a CAS usage which has been already optimized.
3. Add an empty statement with a semicolon to a goto label for older
compiler versions.
4. Simplify CAS optimization.

CAS instruction is expensive.  From the x86 CPU's point of view, getting
a cache line for writing is more expensive than reading.  See Appendix
A.2 Spinlock in:

https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-lock-scaling-analysis-paper.pdf

The full compare and swap will grab the cache line exclusive and cause
excessive cache line bouncing.

Optimize CAS in low level locks and pthread_mutex_lock.c:

1. Add LLL_MUTEX_READ_LOCK to do an atomic load and skip CAS in spinlock
loop if compare may fail to reduce cache line bouncing on contended locks.
2. Replace boolean CAS with value CAS to avoid the extra load.
2. Change malloc spinlocks to do an atomic load and check if compare may
fail.  Skip CAS and spin if compare may fail to reduce cache line bouncing
on contended locks.

With all CAS optimizations applied, on a machine with 112 cores,

              mutex-empty    17.4575    17.3908  0.38%
             mutex-filler    48.4768    46.4925  4.1%
      mutex_trylock-empty    19.2726    19.2737  -0.0057%
     mutex_trylock-filler    54.0893     54.105  -0.029%
        rwlock_read-empty    39.7572    39.8933  -0.34%
       rwlock_read-filler     75.109    74.0818  1.4%
     rwlock_tryread-empty    5.28944    5.28938  0.0011%
    rwlock_tryread-filler    39.6297     39.734  -0.26%
       rwlock_write-empty    60.6644    60.6151  0.081%
      rwlock_write-filler      92.92    90.0722  3.1%
    rwlock_trywrite-empty    7.24741    6.59308  9%
   rwlock_trywrite-filler    42.7404    41.6767  2.5%
          spin_lock-empty    19.1078    19.1079  -0.00052%
         spin_lock-filler    51.0646    51.6041  -1.1%
       spin_trylock-empty    16.4707    16.4811  -0.063%
      spin_trylock-filler    50.5355    50.4012  0.27%
           sem_wait-empty    42.1991    42.1683  0.073%
          sem_wait-filler    74.6699    74.7883  -0.16%
        sem_trywait-empty    5.27062     5.2702  0.008%
       sem_trywait-filler    40.1541    40.1684  -0.036%
            condvar-empty    5488.91    5165.95  5.9%
           condvar-filler    1442.43    1474.21  -2.2%
  consumer_producer-empty    16508.2    16705.3  -1.2%
 consumer_producer-filler    16781.1    16942.3  -0.96%

H.J. Lu (4):
  Add LLL_MUTEX_READ_LOCK [BZ #28537]
  Avoid extra load with CAS in __pthread_mutex_lock_full [BZ #28537]
  Reduce CAS in malloc spinlocks
  Avoid extra load with CAS in __pthread_mutex_clocklock_common [BZ
    #28537]

 malloc/arena.c                 |  5 +++++
 malloc/malloc.c                | 10 ++++++++++
 nptl/pthread_mutex_lock.c      | 17 ++++++++++++-----
 nptl/pthread_mutex_timedlock.c | 10 +++++-----
 4 files changed, 32 insertions(+), 10 deletions(-)

-- 
2.33.1

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH v6 3/4] Reduce CAS in malloc spinlocks
  2021-11-11 16:24 [PATCH v6 0/4] Optimize CAS [BZ #28537] H.J. Lu
@ 2021-11-11 16:24 ` H.J. Lu
  2023-02-23  5:48   ` DJ Delorie
  0 siblings, 1 reply; 6+ messages in thread
From: H.J. Lu @ 2021-11-11 16:24 UTC (permalink / raw)
  To: libc-alpha
  Cc: Florian Weimer, Oleh Derevenko, Arjan van de Ven, Andreas Schwab,
	Paul A . Clarke, Noah Goldstein

Do an atomic load and check if compare may fail.  Skip CAS and spin if
compare may fail to reduce cache line bouncing on contended locks.
---
 malloc/arena.c  |  5 +++++
 malloc/malloc.c | 10 ++++++++++
 2 files changed, 15 insertions(+)

diff --git a/malloc/arena.c b/malloc/arena.c
index 78ef4cf18c..e7fbe7c183 100644
--- a/malloc/arena.c
+++ b/malloc/arena.c
@@ -899,6 +899,11 @@ arena_get2 (size_t size, mstate avoid_arena)
          enough address space to create that many arenas.  */
       if (__glibc_unlikely (n <= narenas_limit - 1))
         {
+          if (atomic_load_relaxed (&narenas) != n)
+           {
+              atomic_spin_nop ();
+              goto repeat;
+           }
           if (catomic_compare_and_exchange_bool_acq (&narenas, n + 1, n))
             goto repeat;
           a = _int_new_arena (size);
diff --git a/malloc/malloc.c b/malloc/malloc.c
index 095d97a3be..403ffb84ef 100644
--- a/malloc/malloc.c
+++ b/malloc/malloc.c
@@ -3717,6 +3717,11 @@ _int_malloc (mstate av, size_t bytes)
       pp = REVEAL_PTR (victim->fd);                                     \
       if (__glibc_unlikely (pp != NULL && misaligned_chunk (pp)))       \
 	malloc_printerr ("malloc(): unaligned fastbin chunk detected"); \
+      if (atomic_load_relaxed (fb) != victim)		\
+	{						\
+	  atomic_spin_nop ();				\
+	  continue;					\
+	}						\
     }							\
   while ((pp = catomic_compare_and_exchange_val_acq (fb, pp, victim)) \
 	 != victim);					\
@@ -4435,6 +4440,11 @@ _int_free (mstate av, mchunkptr p, int have_lock)
 	    malloc_printerr ("double free or corruption (fasttop)");
 	  old2 = old;
 	  p->fd = PROTECT_PTR (&p->fd, old);
+	  if (atomic_load_relaxed (fb) != old2)
+	    {
+	      atomic_spin_nop ();
+	      continue;
+	    }
 	}
       while ((old = catomic_compare_and_exchange_val_rel (fb, p, old2))
 	     != old2);
-- 
2.33.1


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v6 3/4] Reduce CAS in malloc spinlocks
  2021-11-11 16:24 ` [PATCH v6 3/4] Reduce CAS in malloc spinlocks H.J. Lu
@ 2023-02-23  5:48   ` DJ Delorie
  0 siblings, 0 replies; 6+ messages in thread
From: DJ Delorie @ 2023-02-23  5:48 UTC (permalink / raw)
  To: H.J. Lu; +Cc: libc-alpha

Sorry for letting this one slip...

"H.J. Lu via Libc-alpha" <libc-alpha@sourceware.org> writes:
>        size_t n = narenas;
>        if (__glibc_unlikely (n <= narenas_limit - 1))
>          {
> +          if (atomic_load_relaxed (&narenas) != n)
> +           {
> +              atomic_spin_nop ();
> +              goto repeat;
> +           }
>            if (catomic_compare_and_exchange_bool_acq (&narenas, n + 1, n))
>              goto repeat;

I understand that a congested spinloop will benefit from this kind of
change, but... we JUST loaded narenas into n, and adding arenas is rare.
We probably should have loaded it atomically, but still, we just loaded
it.  The odds of malloc being so congested that we miss the CAS is
essentially (but of course not exactly) zero.  Are we just adding an
uneeded atomic read here?  Do any benchmarks say this would be
beneficial?

Also, the malloc code is already complicated enough.  Are the extra
lines of code and slight reduction in readability justified?

Also, we've been migrating to C11-like atomics; would this patch need
changing for that?

Should target-specific atomics optimizations be "hidden" somewhere in
the atomics implementation?  Just because x86 may benefit from a
pre-read doesn't mean that all targets will, and if x86 generally
benefits, it should update its implementation of the atomics to do that
at a lower level.

>            a = _int_new_arena (size);
> diff --git a/malloc/malloc.c b/malloc/malloc.c
> index 095d97a3be..403ffb84ef 100644
> --- a/malloc/malloc.c
> +++ b/malloc/malloc.c
> @@ -3717,6 +3717,11 @@ _int_malloc (mstate av, size_t bytes)
>        pp = REVEAL_PTR (victim->fd);                                     \
>        if (__glibc_unlikely (pp != NULL && misaligned_chunk (pp)))       \
>  	malloc_printerr ("malloc(): unaligned fastbin chunk detected"); \
> +      if (atomic_load_relaxed (fb) != victim)		\
> +	{						\
> +	  atomic_spin_nop ();				\
> +	  continue;					\
> +	}						\
>      }							\
>    while ((pp = catomic_compare_and_exchange_val_acq (fb, pp, victim)) \
>  	 != victim);					\
> @@ -4435,6 +4440,11 @@ _int_free (mstate av, mchunkptr p, int have_lock)
>  	    malloc_printerr ("double free or corruption (fasttop)");
>  	  old2 = old;
>  	  p->fd = PROTECT_PTR (&p->fd, old);
> +	  if (atomic_load_relaxed (fb) != old2)
> +	    {
> +	      atomic_spin_nop ();
> +	      continue;
> +	    }
>  	}
>        while ((old = catomic_compare_and_exchange_val_rel (fb, p, old2))
>  	     != old2);

Likewise, although these are less rare, but not so common as I'd expect
a benefit from the extra code.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2023-02-23 20:07 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-15 13:01 [PATCH v6 3/4] Reduce CAS in malloc spinlocks Wilco Dijkstra
  -- strict thread matches above, loose matches on Subject: below --
2023-02-23 18:27 Wilco Dijkstra
2023-02-23 19:53 ` H.J. Lu
2023-02-23 20:07   ` DJ Delorie
2021-11-11 16:24 [PATCH v6 0/4] Optimize CAS [BZ #28537] H.J. Lu
2021-11-11 16:24 ` [PATCH v6 3/4] Reduce CAS in malloc spinlocks H.J. Lu
2023-02-23  5:48   ` DJ Delorie

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).