From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <triegel@redhat.com>
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [216.205.24.124])
 by sourceware.org (Postfix) with ESMTP id A553D384A018
 for <libc-alpha@sourceware.org>; Mon, 18 Jan 2021 23:41:20 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org A553D384A018
Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com
 [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id
 us-mta-563-uuFukoCePFmc7RlyNM3h4g-1; Mon, 18 Jan 2021 18:41:18 -0500
X-MC-Unique: uuFukoCePFmc7RlyNM3h4g-1
Received: from smtp.corp.redhat.com (int-mx07.intmail.prod.int.phx2.redhat.com
 [10.5.11.22])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 1B68F1006C82;
 Mon, 18 Jan 2021 23:41:17 +0000 (UTC)
Received: from ovpn-115-172.ams2.redhat.com (ovpn-115-172.ams2.redhat.com
 [10.36.115.172])
 by smtp.corp.redhat.com (Postfix) with ESMTPS id ED9D710023B9;
 Mon, 18 Jan 2021 23:41:15 +0000 (UTC)
Message-ID: <cc83b3a78173012ee30006d44b65732a8fa2eabd.camel@redhat.com>
Subject: Re: [PATCH 3/5] nptl: Optimization by not incrementing wrefs in
 pthread_cond_wait
From: Torvald Riegel <triegel@redhat.com>
To: Malte Skarupke <malteskarupke@web.de>, libc-alpha@sourceware.org
Cc: malteskarupke@fastmail.fm
Date: Tue, 19 Jan 2021 00:41:13 +0100
In-Reply-To: <20210116204950.16434-3-malteskarupke@web.de>
References: <20210116204950.16434-1-malteskarupke@web.de>
 <20210116204950.16434-3-malteskarupke@web.de>
Mime-Version: 1.0
X-Scanned-By: MIMEDefang 2.84 on 10.5.11.22
X-Mimecast-Spam-Score: 0
X-Mimecast-Originator: redhat.com
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: 7bit
X-Spam-Status: No, score=-14.5 required=5.0 tests=BAYES_00, DKIMWL_WL_HIGH,
 DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0,
 RCVD_IN_DNSWL_LOW, RCVD_IN_MSPIKE_H3, RCVD_IN_MSPIKE_WL, SPF_HELO_NONE,
 SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2
X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on
 server2.sourceware.org
X-BeenThere: libc-alpha@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Libc-alpha mailing list <libc-alpha.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=subscribe>
X-List-Received-Date: Mon, 18 Jan 2021 23:41:22 -0000

On Sat, 2021-01-16 at 15:49 -0500, Malte Skarupke wrote:
> After I broadened the scope of grefs, it covered mostly the same
> scope
> as wrefs. The duplicate atomic increment/decrement was unnecessary. 

This will not work as is because __wrefs and __g_refs are used
differently, which matters for destruction safety.  See below for
details.

You should be able to fix it, but this may require a CAS instead of an
read-modify-write when decreasing group ref counts.  It might just be
faster to keep __wrefs (and less complex too).

The real potential for a performance degradation is not in the number
of atomic ops anyway, IMO, but in that your changes now require
signalers to wait for waiters when switching groups, even if the
waiters were just spinning (ie, in high-throughput scenarios).

AFAIA glibc is still lacking proper tuning of when to switch from
spinning to blocking via futexes, but this change should decrease the
benefits of spinning in condvars.

> diff --git a/nptl/pthread_cond_broadcast.c
> b/nptl/pthread_cond_broadcast.c
> index 8d887aab93..e10432ce7c 100644
> --- a/nptl/pthread_cond_broadcast.c
> +++ b/nptl/pthread_cond_broadcast.c
> @@ -40,10 +40,12 @@ __pthread_cond_broadcast (pthread_cond_t *cond)
>  {
>    LIBC_PROBE (cond_broadcast, 1, cond);
> 
> -  unsigned int wrefs = atomic_load_relaxed (&cond->__data.__wrefs);
> -  if (wrefs >> 3 == 0)
> +  unsigned int grefs0 = atomic_load_relaxed (cond->__data.__g_refs);
> +  unsigned int grefs1 = atomic_load_relaxed (cond->__data.__g_refs +
> 1);
> +  if ((grefs0 >> 1) == 0 && (grefs1 >> 1) == 0)

See below.

>      return 0;
> -  int private = __condvar_get_private (wrefs);
> +  unsigned int flags = atomic_load_relaxed (&cond->__data.__wrefs);
> +  int private = __condvar_get_private (flags);
> 
>    __condvar_acquire_lock (cond, private);
> 
> diff --git a/nptl/pthread_cond_destroy.c
> b/nptl/pthread_cond_destroy.c
> index 31034905d1..1c27385f89 100644
> --- a/nptl/pthread_cond_destroy.c
> +++ b/nptl/pthread_cond_destroy.c
> @@ -37,22 +37,35 @@
>     signal or broadcast calls.
>     Thus, we can assume that all waiters that are still accessing the
> condvar
>     have been woken.  We wait until they have confirmed to have woken
> up by
> -   decrementing __wrefs.  */
> +   decrementing __g_refs.  */
>  int
>  __pthread_cond_destroy (pthread_cond_t *cond)
>  {
>    LIBC_PROBE (cond_destroy, 1, cond);
> 
> -  /* Set the wake request flag.  We could also spin, but destruction
> that is
> -     concurrent with still-active waiters is probably neither common
> nor
> -     performance critical.  Acquire MO to synchronize with waiters
> confirming
> -     that they finished.  */
> -  unsigned int wrefs = atomic_fetch_or_acquire (&cond-
> >__data.__wrefs, 4);
> -  int private = __condvar_get_private (wrefs);
> +  unsigned int flags = atomic_load_relaxed (&cond->__data.__wrefs);
> +  int private = __condvar_get_private (flags);
> +  for (unsigned g = 0; g < 2; ++g)
> +    {
> +      while (true)
> +	{
> +	  /* Set the wake request flag.  We could also spin, but
> destruction that is
> +	     concurrent with still-active waiters is probably neither
> common nor
> +	     performance critical.  Acquire MO to synchronize with
> waiters confirming
> +	     that they finished.  */
> +	  unsigned r = atomic_fetch_or_acquire (cond->__data.__g_refs +
> g, 1) | 1;
> +	  if (r == 1)
> +	    break;

You wait until the refcount is zero, but that is not necessarily the
last access to cond in __condvar_dev_grefs.  Hence this does not ensure
destruction safety.  Also see below.

> +	  futex_wait_simple (cond->__data.__g_refs + g, r, private);
> +	}
> +    }
> +
> +  /* Same as above, except to synchronize with canceled
> threads.  This wake
> +     flag never gets cleared, so it's enough to set it once.  */
> +  unsigned int wrefs = atomic_fetch_or_acquire (&cond-
> >__data.__wrefs, 4) | 4;
>    while (wrefs >> 3 != 0)
>      {
>        futex_wait_simple (&cond->__data.__wrefs, wrefs, private);
> -      /* See above.  */
>        wrefs = atomic_load_acquire (&cond->__data.__wrefs);
>      }
>    /* The memory the condvar occupies can now be reused.  */
> diff --git a/nptl/pthread_cond_signal.c b/nptl/pthread_cond_signal.c
> index 4281ad4d3b..0cd534cc40 100644
> --- a/nptl/pthread_cond_signal.c
> +++ b/nptl/pthread_cond_signal.c
> @@ -39,10 +39,12 @@ __pthread_cond_signal (pthread_cond_t *cond)
>    /* First check whether there are waiters.  Relaxed MO is fine for
> that for
>       the same reasons that relaxed MO is fine when observing __wseq
> (see
>       below).  */
> -  unsigned int wrefs = atomic_load_relaxed (&cond->__data.__wrefs);
> -  if (wrefs >> 3 == 0)
> +  unsigned int grefs0 = atomic_load_relaxed (cond->__data.__g_refs);
> +  unsigned int grefs1 = atomic_load_relaxed (cond->__data.__g_refs +
> 1);
> +  if ((grefs0 >> 1) == 0 && (grefs1 >> 1) == 0)
>      return 0;

This really needs an explanation why that is supposed to work.  The
existing comments talk about a single atomic load of wseq, but here you
have two separate atomic loads.

I believe it should be correct, but this isn't obvious and thus should
be clearly explained in a comment.

> -  int private = __condvar_get_private (wrefs);
> +  unsigned int flags = atomic_load_relaxed (&cond->__data.__wrefs);
> +  int private = __condvar_get_private (flags);
> 
>    __condvar_acquire_lock (cond, private);
> 
> diff --git a/nptl/pthread_cond_wait.c b/nptl/pthread_cond_wait.c
> index 7b825f81df..0ee0247874 100644
> --- a/nptl/pthread_cond_wait.c
> +++ b/nptl/pthread_cond_wait.c
> @@ -43,19 +43,6 @@ struct _condvar_cleanup_buffer
>  };
> 
> 
> -/* Decrease the waiter reference count.  */
> -static void
> -__condvar_confirm_wakeup (pthread_cond_t *cond, int private)
> -{
> -  /* If destruction is pending (i.e., the wake-request flag is
> nonzero) and we
> -     are the last waiter (prior value of __wrefs was 1 << 3), then
> wake any
> -     threads waiting in pthread_cond_destroy.  Release MO to
> synchronize with
> -     these threads.  Don't bother clearing the wake-up request
> flag.  */
> -  if ((atomic_fetch_add_release (&cond->__data.__wrefs, -8) >> 2) ==
> 3)
> -    futex_wake (&cond->__data.__wrefs, INT_MAX, private);
> -}
> -
> -
>  /* Cancel waiting after having registered as a waiter
> previously.  SEQ is our
>     position and G is our group index.
>     The goal of cancellation is to make our group smaller if that is
> still
> @@ -81,7 +68,7 @@ __condvar_cancel_waiting (pthread_cond_t *cond,
> uint64_t seq, unsigned int g,
>  {
>    bool consumed_signal = false;
> 
> -  /* No deadlock with group switching is possible here because we
> have do
> +  /* No deadlock with group switching is possible here because we do
>       not hold a reference on the group.  */
>    __condvar_acquire_lock (cond, private);
> 
> @@ -172,6 +159,14 @@ __condvar_cleanup_waiting (void *arg)
>    pthread_cond_t *cond = cbuffer->cond;
>    unsigned g = cbuffer->wseq & 1;
> 
> +  /* Normally we are not allowed to touch cond any more after 

, after "Normally" and s/any more/anymore/

> calling
> +     __condvar_dec_grefs

Precisely, we are not allowed to touch memory anymore after
__condvar_confirm_wakeup -- which takes different actions than
__condvar_dec_grefs.  So these aren't quite the same.

> , because pthread_cond_destroy looks at __g_refs to
> +     determine when all waiters have woken. Since we will do more
> work in this
> +     function, we are using an extra channel to communicate to
> +     pthread_cond_destroy that it is not allowed to finish yet: We
> increment
> +     the fourth bit on __wrefs.

You increment the refcount consisting of the bits starting at the
fourth bit, not just the fourth bit.

> Relaxed MO is enough. The synchronization
> +     happens because __condvar_dec_grefs uses release MO. */
> +  atomic_fetch_add_relaxed (&cond->__data.__wrefs, 8);
>    __condvar_dec_grefs (cond, g, cbuffer->private);
> 
>    __condvar_cancel_waiting (cond, cbuffer->wseq >> 1, g, cbuffer-
> >private);
> @@ -183,7 +178,12 @@ __condvar_cleanup_waiting (void *arg)
>       conservatively.  */
>    futex_wake (cond->__data.__g_signals + g, 1, cbuffer->private);
> 
> -  __condvar_confirm_wakeup (cond, cbuffer->private);
> +  /* If destruction is pending (i.e., the wake-request flag is
> nonzero) and we
> +     are the last waiter (prior value of __wrefs was 1 << 3), then
> wake any
> +     threads waiting in pthread_cond_destroy.  Release MO to
> synchronize with
> +     these threads.  Don't bother clearing the wake-up request
> flag.  */
> +  if ((atomic_fetch_add_release (&cond->__data.__wrefs, -8) >> 2) ==
> 3)
> +    futex_wake (&cond->__data.__wrefs, INT_MAX, cbuffer->private);
> 
>    /* XXX If locking the mutex fails, should we just stop
> execution?  This
>       might be better than silently ignoring the error.  */
> @@ -287,20 +287,21 @@ __condvar_cleanup_waiting (void *arg)
>     __g1_orig_size: Initial size of G1
>       * The two least-significant bits represent the condvar-internal 
> lock.
>       * Only accessed while having acquired the condvar-internal
> lock.
> -   __wrefs: Waiter reference counter.
> +   __wrefs: Flags and count of waiters who called pthread_cancel.

...and reference counter for waiters that called...

Also update the previous paragraphs in this algorithm overview that
discuss __g_refs and __wrefs. 

>       * Bit 2 is true if waiters should run futex_wake when they
> remove the
>         last reference.  pthread_cond_destroy uses this as futex
> word.
>       * Bit 1 is the clock ID (0 == CLOCK_REALTIME, 1 ==
> CLOCK_MONOTONIC).
>       * Bit 0 is true iff this is a process-shared condvar.
> -     * Simple reference count used by both waiters and
> pthread_cond_destroy.
> -     (If the format of __wrefs is changed, update
> nptl_lock_constants.pysym
> -      and the pretty printers.)
> +     * Simple reference count used by __condvar_cleanup_waiting and
> pthread_cond_destroy.
> +     (If the format of __wrefs is changed, update the pretty
> printers.)
>     For each of the two groups, we have:
>     __g_refs: Futex waiter reference count.
>       * LSB is true if waiters should run futex_wake when they remove
> the
>         last reference.
>       * Reference count used by waiters concurrently with signalers
> that have
>         acquired the condvar-internal lock.
> +     (If the format of __g_refs is changed, update
> nptl_lock_constants.pysym
> +      and the pretty printers.)
>     __g_signals: The number of signals that can still be consumed.
>       * Used as a futex word by waiters.  Used concurrently by
> waiters and
>         signalers.
> @@ -409,9 +410,7 @@ __pthread_cond_wait_common (pthread_cond_t *cond,
> pthread_mutex_t *mutex,
>       synchronize with the dummy read-modify-write in
>       __condvar_quiesce_and_switch_g1 if we read from that.  */
>    atomic_fetch_add_acquire (cond->__data.__g_refs + g, 2);
> -  /* Increase the waiter reference count.  Relaxed MO is sufficient
> because
> -     we only need to synchronize when decrementing the reference
> count.  */
> -  unsigned int flags = atomic_fetch_add_relaxed (&cond-
> >__data.__wrefs, 8);
> +  unsigned int flags = atomic_load_relaxed (&cond->__data.__wrefs);
>    int private = __condvar_get_private (flags);
> 
>    /* Now that we are registered as a waiter, we can release the
> mutex.
> @@ -425,7 +424,6 @@ __pthread_cond_wait_common (pthread_cond_t *cond,
> pthread_mutex_t *mutex,
>    if (__glibc_unlikely (err != 0))
>      {
>        __condvar_cancel_waiting (cond, seq, g, private);
> -      __condvar_confirm_wakeup (cond, private);
>        __condvar_dec_grefs (cond, g, private);
>        return err;
>      }
> @@ -530,7 +528,6 @@ __pthread_cond_wait_common (pthread_cond_t *cond,
> pthread_mutex_t *mutex,
>    /* Confirm that we have been woken.  We do that before acquiring
> the mutex
>       to allow for execution of pthread_cond_destroy while having
> acquired the
>       mutex.  */
> -  __condvar_confirm_wakeup (cond, private);
>    __condvar_dec_grefs (cond, g, private);

__condvar_dec_grefs clears the wake-up request flag after resetting the
waiter flag.  Thus, the former will not be the last access to the
condvar.