From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=Ldz2=KP=redhat.com=jwakely@sourceware.org>
Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124])
	by sourceware.org (Postfix) with ESMTPS id 272E73858C60
	for <libstdc++@gcc.gnu.org>; Sat,  9 Mar 2024 12:19:12 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 272E73858C60
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=redhat.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=redhat.com
ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 272E73858C60
Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=170.10.129.124
ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1709986759; cv=none;
	b=ERPEZ7CYPU+FuRqQYN6/Fhk/4ZXa/dhOp/60y7tv0x2uQ4fMXQ4ksAXQLT3g7sftOgtIxwSxBZICzctNyvLbIxwt+wxEVpGsrxyE//2gX5vLts0GRaYvHOZneQ6qVlwKusm+wwMzMnFXJCC1tRSj2O2rnKJkxx4RQUrLcZuzWQs=
ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key;
	t=1709986759; c=relaxed/simple;
	bh=gyhWBtIOwvAo+OlTZWxtzn4Rr8nD/wZydyPGhDbyGys=;
	h=DKIM-Signature:MIME-Version:From:Date:Message-ID:Subject:To; b=dghjIwx+horwR6Kr74nwNEYH9nA4o70VMKcj8Qvgour57iZsukpyWnguu4h9bplKLzgSBMclRTwypjzRqYfJEdVGBiOiIBH0B/BlzupNrypgsT+rji3hMPsfbWjiv3elYaJ/EebarqjMLFzbtVoDxHV8rq3t1fwYYUVxHNEDGJc=
ARC-Authentication-Results: i=1; server2.sourceware.org
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1709986751;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 in-reply-to:in-reply-to:references:references;
	bh=JaeOZ5LmJr0BmctdtcI7Tr7M8471ZDrY3baPL66OilM=;
	b=Ko/ysnPixCgTt/+AKtgwhiPM0kPFEpQSEqKU6ZvX3HaxHSpfrwN58DbkmxPgLJ89O7OJP+
	+4rqlVDapnJgj7v9m/LDfiUWdGdhsSRlKeYj4gNzSHouriE52UN+9fKishWPhxW23L6Oid
	gqyqPaT7NorYHDSMeHnMNoh6Ze5dJBw=
Received: from mail-yw1-f198.google.com (mail-yw1-f198.google.com
 [209.85.128.198]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id
 us-mta-627-naEviEg4Mk2icTdnPede0Q-1; Sat, 09 Mar 2024 07:19:09 -0500
X-MC-Unique: naEviEg4Mk2icTdnPede0Q-1
Received: by mail-yw1-f198.google.com with SMTP id 00721157ae682-608e4171382so41973537b3.3
        for <libstdc++@gcc.gnu.org>; Sat, 09 Mar 2024 04:19:09 -0800 (PST)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1709986749; x=1710591549;
        h=cc:to:subject:message-id:date:from:in-reply-to:references
         :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id
         :reply-to;
        bh=JaeOZ5LmJr0BmctdtcI7Tr7M8471ZDrY3baPL66OilM=;
        b=WE7gPiXXDjC9Z+BwHYzTcv0al3YPrHwUKtssR2yh3fiT92m0KZjRG0nxMUB7/UZOuW
         Ze02lq+ZXR3iaIkWebEQK6dXUeImpR2kQLSbqfvxddqd/lJicBTG6tvWuDbjAil8dNmO
         JRGSbQcB8OKoGfE+2cNZvIKcZzdMK6BH3F1Ao5GNtg5fmpUVnUSGjXvlpoI1YKKamMCe
         mmOZmGJVKZbk6LGItlOGaMKoufKpKwhblMTYg/PuRx47UNaSv4MaVkElM9c9c0xoYWou
         ilNItZLT5+auP/3CldmF9aoQEANH49PzPzOAr5+2f16wvRr/MdpG6azRTbmHiGb2OV15
         NwMA==
X-Gm-Message-State: AOJu0YzDFAG5ElTiUmhz/iEL1GXfGHhF8g5OI/nRhFyusxJgDjmrH7l4
	41Flof3XYB/MpHNx0RKYDdkL4rH9PPX94na0SmnFX8q4fcWqFb1M1prgUqE7xbuAs5rvIlZphkG
	bw880+4JZlr6vY8FgL/yUGmCujMtb5JAMmNoxLzTowV1CXdbcRgxXEZfIv2DOB6tn1gFSsuP7PL
	sRp6bjQSEXt44IJJN8ajpZ1CRXEzvusgofFNHv3A==
X-Received: by 2002:a81:72c3:0:b0:609:c9f1:f95c with SMTP id n186-20020a8172c3000000b00609c9f1f95cmr1551610ywc.48.1709986748903;
        Sat, 09 Mar 2024 04:19:08 -0800 (PST)
X-Google-Smtp-Source: AGHT+IEKMuTyHFSk1poeZMx0Bd6fGlOvP5ej1s899cNxwh71yaW8sDaKd/waRq1KLUU7TN8RT8inR2GYykUqRi2JHDA=
X-Received: by 2002:a81:72c3:0:b0:609:c9f1:f95c with SMTP id
 n186-20020a8172c3000000b00609c9f1f95cmr1551599ywc.48.1709986748461; Sat, 09
 Mar 2024 04:19:08 -0800 (PST)
MIME-Version: 1.0
References: <20231116134736.1287539-1-jwakely@redhat.com>
In-Reply-To: <20231116134736.1287539-1-jwakely@redhat.com>
From: Jonathan Wakely <jwakely@redhat.com>
Date: Sat, 9 Mar 2024 12:18:52 +0000
Message-ID: <CACb0b4=x1YbFFW7arQAjM5EissFsUwTkXwYPS3CK4MRTykmM4Q@mail.gmail.com>
Subject: Re: [PATCH 1/2] libstdc++: Atomic wait/notify ABI stabilization
To: libstdc++@gcc.gnu.org, gcc-patches@gcc.gnu.org
Cc: Thomas Rodgers <rodgert@twrodgers.com>, Nate Eldredge <nate@thatsmathematics.com>
X-Mimecast-Spam-Score: 0
X-Mimecast-Originator: redhat.com
Content-Type: multipart/alternative; boundary="000000000000b692890613395096"
X-Spam-Status: No, score=-12.6 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,GIT_PATCH_0,HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_NONE,TXREP,T_SCC_BODY_TEXT_LINE,URI_HEX autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <libstdc++.gcc.gnu.org>

--000000000000b692890613395096
Content-Type: text/plain; charset="UTF-8"

On Thu, 16 Nov 2023 at 13:49, Jonathan Wakely <jwakely@redhat.com> wrote:

> From: Thomas Rodgers <trodgers@redhat.com>
>
> These two patches were written by Tom earlier this year, before he left
> Red Hat. We should finish reviewing them for GCC 14 (and probably squash
> them into one?)
>
> Tom, you mentioned further work that changes the __platform_wait_t* to
> uintptr_t, is that ready, or likely to be ready very soon?
>
> Tested x86_64-linux, testing underway on powerpc-aix and sparc-solaris.
>
>
> -- >8 --
>
> This represents a major refactoring of the previous atomic::wait
> and atomic::notify implementation detail. The aim of this change
> is to simplify the implementation details and position the resulting
> implementation so that much of the current header-only detail
> can be moved into the shared library, while also accounting for
> anticipated changes to wait/notify functionality for C++26.
>
> The previous implementation implemented spin logic in terms of
> the types __default_spin_policy, __timed_backoff_spin_policy, and
> the free function __atomic_spin. These are replaced in favor of
> two new free functions; __spin_impl and __spin_until_impl. These
> currently inline free functions are expected to be moved into the
> libstdc++ shared library in a future commit.
>
> The previous implementation derived untimed and timed wait
> implementation detail from __detail::__waiter_pool_base. This
> is-a relationship is removed in the new version and the previous
> implementation detail is renamed to reflect this change. The
> static _S_for member has been renamed as well to indicate that it
> returns the __waiter_pool_impl entry in the static 'side table'
> for a given awaited address.
>
> This new implementation replaces all of the non-templated waiting
> detail of __waiter_base, __waiter_pool, __waiter, __enters_wait, and
> __bare_wait with the __wait_impl free function, and the supporting
> __wait_flags enum and __wait_args struct. This currenly inline free
> function is expected to be moved into the libstdc++ shared library
> in a future commit.
>
> This new implementation replaces all of the non-templated notifying
> detail of __waiter_base, __waiter_pool, and __waiter with the
> __notify_impl free function. This currently inline free function
> is expected to be moved into the libstdc++ shared library in a
> future commit.
>
> The __atomic_wait_address template function is updated to account
> for the above changes and to support the expected C++26 change to
> pass the most recent observed value to the caller supplied predicate.
>
> A new non-templated __atomic_wait_address_v free function is added
> that only works for atomic types that operate only on __platform_wait_t
> and requires the caller to supply a memory order. This is intended
> to be the simplest code path for such types.
>
> The __atomic_wait_address_v template function is now implemented in
> terms of new __atomic_wait_address template and continues to accept
> a user supplied "value function" to retrieve the current value of
> the atomic.
>
> The __atomic_notify_address template function is updated to account
> for the above changes.
>
> The template __platform_wait_until_impl is renamed to
> __wait_clock_t. The previous __platform_wait_until template is deleted
> and the functionality previously provided is moved t the new tempalate
> function __wait_until. A similar change is made to the
> __cond_wait_until_impl/__cond_wait_until implementation.
>
> This new implementation similarly replaces all of the non-templated
> waiting detail of __timed_waiter_pool, __timed_waiter, etc. with
> the new __wait_until_impl free function. This currently inline free
> function is expected to be moved into the libstdc++ shared library
> in a future commit.
>
> This implementation replaces all templated waiting functions that
> manage clock conversion as well as relative waiting (wait_for) with
> the new template functions __wait_until and __wait_for.
>
> Similarly the previous implementation detail for the various
> __atomic_wait_address_Xxx templates is adjusted to account for the
> implementation changes outlined above.
>
> All of the "bare wait" versions of __atomic_wait_Xxx have been removed
> and replaced with a defaulted boolean __bare_wait parameter on the
> new version of these templates.
>
> libstdc++-v3/ChangeLog:
>
>         * include/bits/atomic_timed_wait.h:
>         (__detail::__platform_wait_until_impl): Rename to
>         __platform_wait_until.
>         (__detail::__platform_wait_until): Remove previous
>         definition.
>         (__detail::__cond_wait_until_impl): Rename to
>         __cond_wait_until.
>         (__detail::__cond_wait_until): Remove previous
>         definition.
>         (__detail::__spin_until_impl): New function.
>         (__detail::__wait_until_impl): New function.
>         (__detail::__wait_until): New function.
>         (__detail::__wait_for): New function.
>         (__detail::__timed_waiter_pool): Remove type.
>         (__detail::__timed_backoff_spin_policy): Remove type.
>         (__detail::__timed_waiter): Remove type.
>         (__detail::__enters_timed_wait): Remove type alias.
>         (__detail::__bare_timed_wait): Remove type alias.
>         (__atomic_wait_address_until): Adjust to new implementation
>         detail.
>         (__atomic_wait_address_until_v): Likewise.
>         (__atomic_wait_address_bare): Remove.
>         (__atomic_wait_address_for): Adjust to new implementation
>         detail.
>         (__atomic_wait_address_for_v): Likewise.
>         (__atomic_wait_address_for_bare): Remove.
>         * include/bits/atomic_wait.h: Include bits/stl_pair.h.
>         (__detail::__default_spin_policy): Remove type.
>         (__detail::__atomic_spin): Remove function.
>         (__detail::__waiter_pool_base): Rename to __waiter_pool_impl.
>         Remove _M_notify.  Rename _S_for to _S_impl_for.
>         (__detail::__waiter_base): Remove type.
>         (__detail::__waiter_pool): Remove type.
>         (__detail::__waiter): Remove type.
>         (__detail::__enters_wait): Remove type alias.
>         (__detail::__bare_wait): Remove type alias.
>         (__detail::__wait_flags): New enum.
>         (__detail::__wait_args): New struct.
>         (__detail::__wait_result_type): New type alias.
>         (__detail::__spin_impl): New function.
>         (__detail::__wait_impl): New function.
>         (__atomic_wait_address): Adjust to new implementation detail.
>         (__atomic_wait_address_v): Likewise.
>         (__atomic_notify_address): Likewise.
>         (__atomic_wait_address_bare): Delete.
>         (__atomic_notify_address_bare): Likewise.
>         * include/bits/semaphore_base.h: Adjust implementation to
>         use new __atomic_wait_address_v contract.
>         * include/std/barrier: Adjust implementation to use new
>         __atomic_wait contract.
>         * include/std/latch: Adjust implementation to use new
>         __atomic_wait contract.
>         * testsuite/29_atomics/atomic/wait_notify/100334.cc (main):
>         Adjust to for __detail::__waiter_pool_base renaming.
> ---
>  libstdc++-v3/include/bits/atomic_timed_wait.h | 549 ++++++++----------
>  libstdc++-v3/include/bits/atomic_wait.h       | 486 ++++++++--------
>  libstdc++-v3/include/bits/semaphore_base.h    |  53 +-
>  libstdc++-v3/include/std/barrier              |   6 +-
>  libstdc++-v3/include/std/latch                |   5 +-
>  .../29_atomics/atomic/wait_notify/100334.cc   |   4 +-
>  6 files changed, 514 insertions(+), 589 deletions(-)
>
> diff --git a/libstdc++-v3/include/bits/atomic_timed_wait.h
> b/libstdc++-v3/include/bits/atomic_timed_wait.h
> index ba21ab20a11..90427ef8b42 100644
> --- a/libstdc++-v3/include/bits/atomic_timed_wait.h
> +++ b/libstdc++-v3/include/bits/atomic_timed_wait.h
> @@ -74,62 +74,32 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
>  #ifdef _GLIBCXX_HAVE_LINUX_FUTEX
>  #define _GLIBCXX_HAVE_PLATFORM_TIMED_WAIT
>      // returns true if wait ended before timeout
> -    template<typename _Dur>
> -      bool
> -      __platform_wait_until_impl(const __platform_wait_t* __addr,
> -                                __platform_wait_t __old,
> -                                const chrono::time_point<__wait_clock_t,
> _Dur>&
> -                                     __atime) noexcept
> -      {
> -       auto __s = chrono::time_point_cast<chrono::seconds>(__atime);
> -       auto __ns = chrono::duration_cast<chrono::nanoseconds>(__atime -
> __s);
>

For much of the mail below I've removed many of the - lines in the diff,
because the changes are huge and often the removed - lines are completely
unrelated to the added + lines adjacent to them.

Just seeing the new code without unrelated code next to it is clearer.


+    bool
> +    __platform_wait_until(const __platform_wait_t* __addr,
> +                         __platform_wait_t __old,
> +                         const __wait_clock_t::time_point& __atime)
> noexcept
> +    {
> +      auto __s = chrono::time_point_cast<chrono::seconds>(__atime);
> +      auto __ns = chrono::duration_cast<chrono::nanoseconds>(__atime -
> __s);
>
> -       struct timespec __rt =
> +      struct timespec __rt =
>         {
>           static_cast<std::time_t>(__s.time_since_epoch().count()),
>           static_cast<long>(__ns.count())
>         };
>


> +      auto __e = syscall (SYS_futex, __addr,
> +
>  static_cast<int>(__futex_wait_flags::__wait_bitset_private),
> +                         __old, &__rt, nullptr,
> +
>  static_cast<int>(__futex_wait_flags::__bitset_match_any));
> +      if (__e)
> +       {
> +         if (errno == ETIMEDOUT)
>             return false;
> -         }
> +         if (errno != EINTR && errno != EAGAIN)
> +           __throw_system_error(errno);
> +       }
> +       return true;
>        }
>  #else
>  // define _GLIBCXX_HAVE_PLATFORM_TIMED_WAIT and implement
> __platform_wait_until()
> @@ -139,15 +109,10 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
>
>  #ifdef _GLIBCXX_HAS_GTHREADS
>      // Returns true if wait ended before timeout.
>


> +    bool
> +    __cond_wait_until(__condvar& __cv, mutex& __mx,
> +                     const __wait_clock_t::time_point& __atime)
> +    {
>         auto __s = chrono::time_point_cast<chrono::seconds>(__atime);
>         auto __ns = chrono::duration_cast<chrono::nanoseconds>(__atime -
> __s);
>
> @@ -158,293 +123,261 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
>           };
>
>  #ifdef _GLIBCXX_USE_PTHREAD_COND_CLOCKWAIT
> -       if constexpr (is_same_v<chrono::steady_clock, _Clock>)
> +       if constexpr (is_same_v<chrono::steady_clock, __wait_clock_t>)
>           __cv.wait_until(__mx, CLOCK_MONOTONIC, __ts);
>         else
>  #endif
>           __cv.wait_until(__mx, __ts);
>


> +       return __wait_clock_t::now() < __atime;
>        }
>  #endif // _GLIBCXX_HAS_GTHREADS
>


> +    inline __wait_result_type
> +    __spin_until_impl(const __platform_wait_t* __addr, __wait_args __args,
> +                     const __wait_clock_t::time_point& __deadline)
>      {
>


> +      auto __t0 = __wait_clock_t::now();
> +      using namespace literals::chrono_literals;
> +
> +      __platform_wait_t __val;
> +      auto __now = __wait_clock_t::now();
> +      for (; __now < __deadline; __now = __wait_clock_t::now())
> +      {
> +       auto __elapsed = __now - __t0;
> +#ifndef _GLIBCXX_NO_SLEEP
> +       if (__elapsed > 128ms)
> +       {
> +         this_thread::sleep_for(64ms);
> +       }
> +       else if (__elapsed > 64us)
> +       {
> +         this_thread::sleep_for(__elapsed / 2);
> +       }
> +       else
> +#endif
> +       if (__elapsed > 4us)
> +       {
> +         __thread_yield();
> +       }
> +       else
> +       {
> +         auto __res = __detail::__spin_impl(__addr, __args);
> +         if (__res.first)
> +           return __res;
> +       }
> +
> +       __atomic_load(__addr, &__val, __args._M_order);
> +       if (__val != __args._M_old)
> +           return make_pair(true, __val);
> +      }
> +      return make_pair(false, __val);
> +    }
> +
> +    inline __wait_result_type
> +    __wait_until_impl(const __platform_wait_t* __addr, __wait_args __args,
> +                     const __wait_clock_t::time_point& __atime)
> +    {
> +#ifdef _GLIBCXX_HAVE_PLATFORM_TIMED_WAIT
> +      __waiter_pool_impl* __pool = nullptr;
> +#else
> +      // if we don't have __platform_wait, we always need the side-table
> +      __waiter_pool_impl* __pool =
> &__waiter_pool_impl::_S_impl_for(__addr);
> +#endif
> +
> +      __platform_wait_t* __wait_addr;
> +      if (__args & __wait_flags::__proxy_wait)
>         {
>  #ifdef _GLIBCXX_HAVE_PLATFORM_TIMED_WAIT
>
>


> +         __pool = &__waiter_pool_impl::_S_impl_for(__addr);
> +#endif
> +         __wait_addr = &__pool->_M_ver;
> +         __atomic_load(__wait_addr, &__args._M_old, __args._M_order);
>         }
>


> +      else
> +       __wait_addr = const_cast<__platform_wait_t*>(__addr);
>


> +      if (__args & __wait_flags::__do_spin)
> +       {
> +         auto __res = __detail::__spin_until_impl(__wait_addr, __args,
> __atime);
> +         if (__res.first)
> +           return __res;
> +         if (__args & __wait_flags::__spin_only)
> +           return __res;
>

As observed by Nate Eldredge, this early return is always taken, even if
the caller didn't set __spin_only.
That's because args & __spin_only is args & 12 which is true if do_spin is
set (which it must be or we wouldn't have entered this block).

That means that any call to __wait_impl with __do_spin set in the flags
(which is in the default flags) will return before ever waiting, resulting
in the caller doing a busy loop.


+       }
>


> +      if (!(__args & __wait_flags::__track_contention))
>

This condition seems backwards. We want to track contention (i.e. use
_M_enter_wait and _M_leave_wait) when the flag is set, but this does it
when it's not set.

       {
>
>

> +       // caller does not externally track contention
> +#ifdef _GLIBCXX_HAVE_PLATFORM_TIMED_WAIT
> +       __pool = (__pool == nullptr) ?
> &__waiter_pool_impl::_S_impl_for(__addr)
> +                                    : __pool;
>

This would be simpler as just:

    if (!__pool)
      __pool = ...;


> +#endif
> +       __pool->_M_enter_wait();
>        }
>


> +      __wait_result_type __res;
> +#ifdef _GLIBCXX_HAVE_PLATFORM_TIMED_WAIT
> +      if (__platform_wait_until(__wait_addr, __args._M_old, __atime))
> +       __res = make_pair(true, __args._M_old);
> +      else
> +       __res = make_pair(false, __args._M_old);
> +#else
> +      __platform_wait_t __val;
> +      __atomic_load(__wait_addr, &__val, __args._M_order);
> +      if (__val == __args._M_old)
> +       {
> +          lock_guard<mutex> __l{ __pool->_M_mtx };
> +          __atomic_load(__wait_addr, &__val, __args._M_order);
> +          if (__val == __args._M_old &&
> +              __cond_wait_until(__pool->_M_cv, __pool->_M_mtx, __atime))
> +            __res = make_pair(true, __val);
>

There's a missing 'else' here, so we don't set anything in __res in this
case.

I think it probably wants to be {false, __val} ?

+       }
> +      else
> +       __res = make_pair(false, __val);
> +#endif
> +
> +      if (!(__args & __wait_flags::__track_contention))
> +         // caller does not externally track contention
> +         __pool->_M_leave_wait();
> +      return __res;
> +    }
> +
> +    template<typename _Clock, typename _Dur>
> +      __wait_result_type
> +      __wait_until(const __platform_wait_t* __addr, __wait_args __args,
> +                  const chrono::time_point<_Clock, _Dur>& __atime)
> noexcept
>        {
>
> +       if constexpr (is_same_v<__wait_clock_t, _Clock>)
> +         return __detail::__wait_until_impl(__addr, __args, __atime);
> +       else
>           {
>
> +            auto __res = __detail::__wait_until_impl(__addr, __args,
> +                                           __to_wait_clock(__atime));
> +            if (!__res.first)
> +              {
> +                 // We got a timeout when measured against __clock_t but
> +                 // we need to check against the caller-supplied clock
> +                 // to tell whether we should return a timeout.
> +                 if (_Clock::now() < __atime)
> +                   return make_pair(true, __res.second);
> +               }
> +             return __res;
> +           }
> +      }
>
>
> +    template<typename _Rep, typename _Period>
> +      __wait_result_type
> +      __wait_for(const __platform_wait_t* __addr, __wait_args __args,
> +                const chrono::duration<_Rep, _Period>& __rtime) noexcept
> +    {
> +      if (!__rtime.count())
> +       // no rtime supplied, just spin a bit
> +       return __detail::__wait_impl(__addr, __args |
> __wait_flags::__spin_only);
>

This should set __do_spin | __spin_only if the latter no longer implies the
former.


> +      auto const __reltime =
> chrono::ceil<__wait_clock_t::duration>(__rtime);
> +      auto const __atime = chrono::steady_clock::now() + __reltime;
> +      return __detail::__wait_until(__addr, __args, __atime);
> +    }
>    } // namespace __detail
>
>    // returns true if wait ended before timeout
> +  template<typename _Tp,
> +          typename _Pred, typename _ValFn,
> +          typename _Clock, typename _Dur>
> +    bool
> +    __atomic_wait_address_until(const _Tp* __addr, _Pred&& __pred,
> +                               _ValFn&& __vfn,
> +                               const chrono::time_point<_Clock, _Dur>&
> __atime,
> +                               bool __bare_wait = false) noexcept
> +    {
> +       const auto __wait_addr =
> +          reinterpret_cast<const __detail::__platform_wait_t*>(__addr);
> +       __detail::__wait_args __args{ __addr, __bare_wait };
> +       _Tp __val = __vfn();
> +       while (!__pred(__val))
> +        {
> +          auto __res = __detail::__wait_until(__wait_addr, __args,
> __atime);
> +          if (!__res.first)
> +            // timed out
> +            return __res.first; // C++26 will also return last observed
> __val
> +          __val = __vfn();
> +        }
> +       return true; // C++26 will also return last observed __val
> +    }
> +
> +  template<typename _Clock, typename _Dur>
> +    bool
> +    __atomic_wait_address_until_v(const __detail::__platform_wait_t*
> __addr,
> +                                 __detail::__platform_wait_t __old,
> +                                 int __order,
> +                                 const chrono::time_point<_Clock, _Dur>&
> __atime,
> +                                 bool __bare_wait = false) noexcept
> +    {
> +      __detail::__wait_args __args{ __addr, __old, __order, __bare_wait };
> +      auto __res = __detail::__wait_until(__addr, __args, __atime);
> +      return __res.first; // C++26 will also return last observed __val
> +    }
> +
>    template<typename _Tp, typename _ValFn,
>            typename _Clock, typename _Dur>
>      bool
>      __atomic_wait_address_until_v(const _Tp* __addr, _Tp&& __old,
> _ValFn&& __vfn,
> -                       const chrono::time_point<_Clock, _Dur>&
> -                           __atime) noexcept
> +                                 const chrono::time_point<_Clock, _Dur>&
> __atime,
> +                                 bool __bare_wait = false) noexcept
>      {
> -      __detail::__enters_timed_wait __w{__addr};
> -      return __w._M_do_wait_until_v(__old, __vfn, __atime);
> +       auto __pfn = [&](const _Tp& __val)
> +          { return !__detail::__atomic_compare(__old, __val); };
> +       return __atomic_wait_address_until(__addr, __pfn,
> forward<_ValFn>(__vfn),
> +                                         __atime, __bare_wait);
>      }
>
> -  template<typename _Tp, typename _Pred,
> -          typename _Clock, typename _Dur>
> -    bool
> -    __atomic_wait_address_until(const _Tp* __addr, _Pred __pred,
> -                               const chrono::time_point<_Clock, _Dur>&
> -                                                             __atime)
> noexcept
> -    {
> -      __detail::__enters_timed_wait __w{__addr};
> -      return __w._M_do_wait_until(__pred, __atime);
> -    }
> +  template<typename _Tp,
> +          typename _Pred, typename _ValFn,
> +          typename _Rep, typename _Period>
> +   bool
> +   __atomic_wait_address_for(const _Tp* __addr, _Pred&& __pred,
> +                            _ValFn&& __vfn,
> +                            const chrono::duration<_Rep, _Period>&
> __rtime,
> +                            bool __bare_wait = false) noexcept
> +  {
> +      const auto __wait_addr =
> +         reinterpret_cast<const __detail::__platform_wait_t*>(__addr);
> +      __detail::__wait_args __args{ __addr, __bare_wait };
> +      _Tp __val = __vfn();
> +      while (!__pred(__val))
> +       {
> +         auto __res = __detail::__wait_for(__wait_addr, __args, __rtime);
> +         if (!__res.first)
> +           // timed out
> +           return __res.first; // C++26 will also return last observed
> __val
> +         __val = __vfn();
> +       }
> +      return true; // C++26 will also return last observed __val
> +  }
>
> -  template<typename _Pred,
> -          typename _Clock, typename _Dur>
> +  template<typename _Rep, typename _Period>
>      bool
> -    __atomic_wait_address_until_bare(const __detail::__platform_wait_t*
> __addr,
> -                               _Pred __pred,
> -                               const chrono::time_point<_Clock, _Dur>&
> -                                                             __atime)
> noexcept
> -    {
> -      __detail::__bare_timed_wait __w{__addr};
> -      return __w._M_do_wait_until(__pred, __atime);
> -    }
> +    __atomic_wait_address_for_v(const __detail::__platform_wait_t* __addr,
> +                               __detail::__platform_wait_t __old,
> +                               int __order,
> +                               const chrono::time_point<_Rep, _Period>&
> __rtime,
> +                               bool __bare_wait = false) noexcept
> +  {
> +    __detail::__wait_args __args{ __addr, __old, __order, __bare_wait };
> +    auto __res = __detail::__wait_for(__addr, __args, __rtime);
> +    return __res.first; // C++26 will also return last observed __Val
> +  }
>
>    template<typename _Tp, typename _ValFn,
>            typename _Rep, typename _Period>
>      bool
>      __atomic_wait_address_for_v(const _Tp* __addr, _Tp&& __old, _ValFn&&
> __vfn,
> -                     const chrono::duration<_Rep, _Period>& __rtime)
> noexcept
> +                               const chrono::duration<_Rep, _Period>&
> __rtime,
> +                               bool __bare_wait = false) noexcept
>      {
> -      __detail::__enters_timed_wait __w{__addr};
> -      return __w._M_do_wait_for_v(__old, __vfn, __rtime);
> -    }
> -
> -  template<typename _Tp, typename _Pred,
> -          typename _Rep, typename _Period>
> -    bool
> -    __atomic_wait_address_for(const _Tp* __addr, _Pred __pred,
> -                     const chrono::duration<_Rep, _Period>& __rtime)
> noexcept
> -    {
> -
> -      __detail::__enters_timed_wait __w{__addr};
> -      return __w._M_do_wait_for(__pred, __rtime);
> -    }
> -
> -  template<typename _Pred,
> -          typename _Rep, typename _Period>
> -    bool
> -    __atomic_wait_address_for_bare(const __detail::__platform_wait_t*
> __addr,
> -                       _Pred __pred,
> -                       const chrono::duration<_Rep, _Period>& __rtime)
> noexcept
> -    {
> -      __detail::__bare_timed_wait __w{__addr};
> -      return __w._M_do_wait_for(__pred, __rtime);
> +      auto __pfn = [&](const _Tp& __val)
> +         { return !__detail::__atomic_compare(__old, __val); };
> +      return __atomic_wait_address_for(__addr, __pfn,
> forward<_ValFn>(__vfn),
> +                                      __rtime, __bare_wait);
>      }
>  _GLIBCXX_END_NAMESPACE_VERSION
>  } // namespace std
> diff --git a/libstdc++-v3/include/bits/atomic_wait.h
> b/libstdc++-v3/include/bits/atomic_wait.h
> index 506b7f0cc9d..63d132fc9a8 100644
> --- a/libstdc++-v3/include/bits/atomic_wait.h
> +++ b/libstdc++-v3/include/bits/atomic_wait.h
> @@ -47,7 +47,8 @@
>  # include <bits/functexcept.h>
>  #endif
>
> -# include <bits/std_mutex.h>  // std::mutex, std::__condvar
> +#include <bits/stl_pair.h>
> +#include <bits/std_mutex.h>  // std::mutex, std::__condvar
>
>  namespace std _GLIBCXX_VISIBILITY(default)
>  {
> @@ -131,7 +132,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
>      __thread_yield() noexcept
>      {
>  #if defined _GLIBCXX_HAS_GTHREADS && defined _GLIBCXX_USE_SCHED_YIELD
> -     __gthread_yield();
> +      __gthread_yield();
>  #endif
>      }
>


> @@ -188,7 +157,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
>         return __builtin_memcmp(&__a, &__b, sizeof(_Tp)) == 0;
>        }
>
> -    struct __waiter_pool_base
> +    struct __waiter_pool_impl
>

Would it make sense to call this type "__wait_proxy" or something like that?
"Waiter pool impl" doesn't really tell me much, where is the waiter pool
that this is implementing?

There used to be __waiter_pool and __waiter_pool_base and __waiter, but now
there's just __waiter_pool_impl which isn't very descriptive in isolation.


>      {
>        // Don't use std::hardware_destructive_interference_size here
> because we
>        // don't want the layout of library types to depend on compiler
> options.
> @@ -205,7 +174,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
>  #ifndef _GLIBCXX_HAVE_PLATFORM_WAIT
>        __condvar _M_cv;
>  #endif
> -      __waiter_pool_base() = default;
> +      __waiter_pool_impl() = default;
>
>        void
>        _M_enter_wait() noexcept
>


> -
> -      static __waiter_pool_base&
> -      _S_for(const void* __addr) noexcept
> +      static __waiter_pool_impl&
> +      _S_impl_for(const void* __addr) noexcept
>        {
>         constexpr uintptr_t __ct = 16;
> -       static __waiter_pool_base __w[__ct];
> +       static __waiter_pool_impl __w[__ct];
>         auto __key = (uintptr_t(__addr) >> 2) % __ct;
>         return __w[__key];
>        }
>      };
>


> +    enum class __wait_flags : uint32_t
>      {
>


> +       __abi_version = 0,
> +       __proxy_wait = 1,
> +       __track_contention = 2,
> +       __do_spin = 4,
> +       __spin_only = 8 | __do_spin, // implies __do_spin
>

This should be just 8 and not also imply __do_spin.


> +       __abi_version_mask = 0xffff0000,
>      };
>
> -    template<typename _Tp>
> -      struct __waiter_base
> +    struct __wait_args
> +    {
> +      __platform_wait_t _M_old;
> +      int _M_order = __ATOMIC_ACQUIRE;
> +      __wait_flags _M_flags;
> +
> +      template<typename _Tp>
> +       explicit __wait_args(const _Tp* __addr,
> +                            bool __bare_wait = false) noexcept
> +           : _M_flags{ _S_flags_for(__addr, __bare_wait) }
> +       { }
> +
> +      __wait_args(const __platform_wait_t* __addr, __platform_wait_t
> __old,
> +                 int __order, bool __bare_wait = false) noexcept
> +         : _M_old{ __old }
> +         , _M_order{ __order }
> +         , _M_flags{ _S_flags_for(__addr, __bare_wait) }
> +       { }
> +
> +      __wait_args(const __wait_args&) noexcept = default;
> +      __wait_args&
> +      operator=(const __wait_args&) noexcept = default;
> +
> +      bool
> +      operator&(__wait_flags __flag) const noexcept
>        {
>
>

> +        using __t = underlying_type_t<__wait_flags>;
> +        return static_cast<__t>(_M_flags)
> +            & static_cast<__t>(__flag);
> +      }
>


> +      __wait_args
> +      operator|(__wait_flags __flag) const noexcept
> +      {
> +       using __t = underlying_type_t<__wait_flags>;
> +       __wait_args __res{ *this };
> +       const auto __flags = static_cast<__t>(__res._M_flags)
> +                            | static_cast<__t>(__flag);
> +       __res._M_flags = __wait_flags{ __flags };
> +       return __res;
> +      }
>

Seems like it would be clearer to define operator| for the actual enum
type, so that these member functions don't need to repeat all the casting
logic.


> +    private:
> +      static int
> +      constexpr _S_default_flags() noexcept
> +      {
> +       using __t = underlying_type_t<__wait_flags>;
> +       return static_cast<__t>(__wait_flags::__abi_version)
> +               | static_cast<__t>(__wait_flags::__do_spin);
> +      }
>


> +      template<typename _Tp>
> +       static int
> +       constexpr _S_flags_for(const _Tp*, bool __bare_wait) noexcept
>         {
>


> +         auto __res = _S_default_flags();
> +         if (!__bare_wait)
> +           __res |= static_cast<int>(__wait_flags::__track_contention);
> +         if constexpr (!__platform_wait_uses_type<_Tp>)
> +           __res |= static_cast<int>(__wait_flags::__proxy_wait);
> +         return __res;
>         }
>


+      template<typename _Tp>
> +       static int
> +       _S_memory_order_for(const _Tp*, int __order) noexcept
>         {
>


> +         if constexpr (__platform_wait_uses_type<_Tp>)
> +           return __order;
> +         return __ATOMIC_ACQUIRE;
> +       }
> +    };
>

This function is never used. Should it be?
Is the idea to force the use of acquire ordering when waiting on a proxy
instead of waiting on the atomic variable directly?
Is that done elsewhere? Can this function be removed, or should it be used
somewhere?


> +
> +    using __wait_result_type = pair<bool, __platform_wait_t>;
> +    inline __wait_result_type
> +    __spin_impl(const __platform_wait_t* __addr, __wait_args __args)
> +    {
> +      __platform_wait_t __val;
> +      for (auto __i = 0; __i < __atomic_spin_count; ++__i)
> +       {
> +         __atomic_load(__addr, &__val, __args._M_order);
> +         if (__val != __args._M_old)
> +           return make_pair(true, __val);
> +         if (__i < __atomic_spin_count_relax)
> +           __detail::__thread_relax();
> +         else
> +           __detail::__thread_yield();
> +       }
> +      return make_pair(false, __val);
>

Every use of make_pair can be replaced by just {x, y} to avoid making a
function call (which will compile faster too).


> +    }
> +
> +    inline __wait_result_type
> +    __wait_impl(const __platform_wait_t* __addr, __wait_args __args)
> +    {
> +#ifdef _GLIBCXX_HAVE_PLATFORM_WAIT
> +      __waiter_pool_impl* __pool = nullptr;
> +#else
> +      // if we don't have __platform_wait, we always need the side-table
> +      __waiter_pool_impl* __pool =
> &__waiter_pool_impl::_S_impl_for(__addr);
> +#endif
> +
> +      __platform_wait_t* __wait_addr;
>

This could be const __platform_wait_t*


> +      if (__args & __wait_flags::__proxy_wait)
> +       {
> +#ifdef _GLIBCXX_HAVE_PLATFORM_WAIT
> +         __pool = &__waiter_pool_impl::_S_impl_for(__addr);
> +#endif
> +         __wait_addr = &__pool->_M_ver;
> +         __atomic_load(__wait_addr, &__args._M_old, __args._M_order);
> +       }
> +      else
> +       __wait_addr = const_cast<__platform_wait_t*>(__addr);
> +
> +      if (__args & __wait_flags::__do_spin)
> +       {
> +         auto __res = __detail::__spin_impl(__wait_addr, __args);
> +         if (__res.first)
> +           return __res;
> +         if (__args & __wait_flags::__spin_only)
> +           return __res;
>

As above, this early return is always taken, resulting in busy loops.

        }
>


> +      if (!(__args & __wait_flags::__track_contention))
> +       {
> +         // caller does not externally track contention
> +#ifdef _GLIBCXX_HAVE_PLATFORM_WAIT
> +         __pool = (__pool == nullptr) ?
> &__waiter_pool_impl::_S_impl_for(__addr)
> +                                      : __pool;
>

  if (!pool)
    pool = ...;


> +#endif
> +         __pool->_M_enter_wait();
> +       }
>


> +      __wait_result_type __res;
> +#ifdef _GLIBCXX_HAVE_PLATFORM_WAIT
> +      __platform_wait(__wait_addr, __args._M_old);
> +      __res = make_pair(false, __args._M_old);
> +#else
> +      __platform_wait_t __val;
> +      __atomic_load(__wait_addr, &__val, __args._M_order);
> +      if (__val == __args._M_old)
> +       {
> +           lock_guard<mutex> __l{ __pool->_M_mtx };
> +           __atomic_load(__wait_addr, &__val, __args._M_order);
> +           if (__val == __args._M_old)
> +               __pool->_M_cv.wait(__pool->_M_mtx);
> +       }
> +      __res = make_pair(false, __val);
> +#endif
>
> -    using __enters_wait = __waiter<std::true_type>;
> -    using __bare_wait = __waiter<std::false_type>;
> +      if (!(__args & __wait_flags::__track_contention))
>

Condition is backwards?


> +       // caller does not externally track contention
> +       __pool->_M_leave_wait();
>

This _M_leave_wait() could be done by an RAII type's destructor instead.


> +      return __res;
> +    }
> +
> +    inline void
> +    __notify_impl(const __platform_wait_t* __addr, [[maybe_unused]] bool
> __all,
> +                 __wait_args __args)
> +    {
> +#ifdef _GLIBCXX_HAVE_PLATFORM_WAIT
> +      __waiter_pool_impl* __pool = nullptr;
> +#else
> +      // if we don't have __platform_notify, we always need the side-table
> +      __waiter_pool_impl* __pool =
> &__waiter_pool_impl::_S_impl_for(__addr);
> +#endif
> +
> +      if (!(__args & __wait_flags::__track_contention))
>

Condition is backwards?


> +       {
> +#ifdef _GLIBCXX_HAVE_PLATFORM_WAIT
> +         __pool = &__waiter_pool_impl::_S_impl_for(__addr);
> +#endif
> +         if (!__pool->_M_waiting())
> +           return;
> +       }
> +
> +      __platform_wait_t* __wait_addr;
> +      if (__args & __wait_flags::__proxy_wait)
> +       {
> +#ifdef _GLIBCXX_HAVE_PLATFORM_WAIT
> +          __pool = (__pool == nullptr) ?
> &__waiter_pool_impl::_S_impl_for(__addr)
> +                                       : __pool;
>

  if (!pool)
    pool = ...;


> +#endif
> +          __wait_addr = &__pool->_M_ver;
> +          __atomic_fetch_add(__wait_addr, 1, __ATOMIC_RELAXED);
> +          __all = true;
> +        }
>

For a non-proxied wait, __wait_addr is uninitialized here. So the futex
wait uses a random address.

There needs to be `else __wait_addr = __addr;`

+
> +#ifdef _GLIBCXX_HAVE_PLATFORM_WAIT
> +      __platform_notify(__wait_addr, __all);
> +#else
> +      lock_guard<mutex> __l{ __pool->_M_mtx };
> +      __pool->_M_cv.notify_all();
> +#endif
> +    }
>    } // namespace __detail
>
> +  template<typename _Tp,
> +          typename _Pred, typename _ValFn>
> +    void
> +    __atomic_wait_address(const _Tp* __addr,
> +                         _Pred&& __pred, _ValFn&& __vfn,
> +                         bool __bare_wait = false) noexcept
> +    {
> +      const auto __wait_addr =
> +         reinterpret_cast<const __detail::__platform_wait_t*>(__addr);
> +      __detail::__wait_args __args{ __addr, __bare_wait };
> +      _Tp __val = __vfn();
> +      while (!__pred(__val))
> +       {
>

As observed by Nate, the old value is never set in the __args, so the wait
always uses 0 (the value set in the __wait_args constructor).

It needs:

    if constexpr (__platform_wait_uses_type<_Tp>)
      __args._M_old = __builtin_bit_cast(__platform_wait_t, __val);

For proxied waits _M_old isn't used, pool->_M_ver is used instead.


> +         __detail::__wait_impl(__wait_addr, __args);
> +         __val = __vfn();
> +       }
> +      // C++26 will return __val
> +    }
> +
> +  inline void
> +  __atomic_wait_address_v(const __detail::__platform_wait_t* __addr,
> +                         __detail::__platform_wait_t __old,
> +                         int __order)
> +  {
> +    __detail::__wait_args __args{ __addr, __old, __order };
> +    // C++26 will not ignore the return value here
> +    __detail::__wait_impl(__addr, __args);
> +  }
> +
>    template<typename _Tp, typename _ValFn>
>      void
>      __atomic_wait_address_v(const _Tp* __addr, _Tp __old,
>                             _ValFn __vfn) noexcept
>      {
> diff --git a/libstdc++-v3/include/std/latch
> b/libstdc++-v3/include/std/latch
> index 27cd80d7100..525647c1701 100644
> --- a/libstdc++-v3/include/std/latch
> +++ b/libstdc++-v3/include/std/latch
> @@ -74,8 +74,9 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
>      _GLIBCXX_ALWAYS_INLINE void
>      wait() const noexcept
>      {
> -      auto const __pred = [this] { return this->try_wait(); };
> -      std::__atomic_wait_address(&_M_a, __pred);
> +        auto const __vfn = [this] { return this->try_wait(); };
> +        auto const __pred = [this](bool __b) { return __b; };
>

This second lambda doesn't need to capture 'this'.


> +        std::__atomic_wait_address(&_M_a, __pred, __vfn);
>      }
>
>      _GLIBCXX_ALWAYS_INLINE void
>
>

--000000000000b692890613395096--