From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <jwakely@redhat.com>
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.133.124])
 by sourceware.org (Postfix) with ESMTP id B89873971C38
 for <gcc-patches@gcc.gnu.org>; Wed,  3 Mar 2021 17:31:29 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org B89873971C38
Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com
 [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id
 us-mta-126-5PIq4xqdOX2WtjXbFv8XOg-1; Wed, 03 Mar 2021 12:31:27 -0500
X-MC-Unique: 5PIq4xqdOX2WtjXbFv8XOg-1
Received: from smtp.corp.redhat.com (int-mx02.intmail.prod.int.phx2.redhat.com
 [10.5.11.12])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 44A11803F4E;
 Wed,  3 Mar 2021 17:31:26 +0000 (UTC)
Received: from localhost (unknown [10.33.37.30])
 by smtp.corp.redhat.com (Postfix) with ESMTP id 63EB660BFA;
 Wed,  3 Mar 2021 17:31:24 +0000 (UTC)
Date: Wed, 3 Mar 2021 17:31:23 +0000
From: Jonathan Wakely <jwakely@redhat.com>
To: Thomas Rodgers <rodgert@appliantology.com>
Cc: gcc-patches@gcc.gnu.org, libstdc++@gcc.gnu.org, trodgers@redhat.com,
 Thomas Rodgers <rodgert@twrodgers.com>
Subject: Re: [PATCH] [libstdc++] Refactor/cleanup of atomic wait implementation
Message-ID: <20210303173123.GV3008@redhat.com>
References: <20210222215307.49424-1-rodgert@appliantology.com>
 <20210223215722.140761-1-rodgert@appliantology.com>
MIME-Version: 1.0
In-Reply-To: <20210223215722.140761-1-rodgert@appliantology.com>
X-Clacks-Overhead: GNU Terry Pratchett
X-Scanned-By: MIMEDefang 2.79 on 10.5.11.12
X-Mimecast-Spam-Score: 0
X-Mimecast-Originator: redhat.com
Content-Type: text/plain; charset=us-ascii; format=flowed
Content-Disposition: inline
X-Spam-Status: No, score=-14.2 required=5.0 tests=BAYES_00, DKIMWL_WL_HIGH,
 DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0,
 RCVD_IN_DNSWL_LOW, RCVD_IN_MSPIKE_H4, RCVD_IN_MSPIKE_WL, SPF_HELO_NONE,
 SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2
X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Wed, 03 Mar 2021 17:31:32 -0000

On 23/02/21 13:57 -0800, Thomas Rodgers wrote:
>From: Thomas Rodgers <rodgert@twrodgers.com>
>
>* This revises the previous version to fix std::__condvar::wait_until() usage.
>
>This is a substantial rewrite of the atomic wait/notify (and timed wait
>counterparts) implementation.
>
>The previous __platform_wait looped on EINTR however this behavior is
>not required by the standard. A new _GLIBCXX_HAVE_PLATFORM_WAIT macro
>now controls whether wait/notify are implemented using a platform
>specific primitive or with a platform agnostic mutex/condvar. This
>patch only supplies a definition for linux futexes. A future update
>could add support __ulock_wait/wake on Darwin, for instance.
>
>The members of __waiters were lifted to a new base class. The members
>are now arranged such that overall sizeof(__waiters_base) fits in two
>cache lines (on platforms with at least 64 byte cache lines). The
>definition will also use destructive_interference_size for this if it
>is available.
>
>The __waiters type is now specific to untimed waits. Timed waits have a
>corresponding __timed_waiters type. Much of the code has been moved from
>the previous __atomic_wait() free function to the __waiter_base template
>and a __waiter derived type is provided to implement the un-timed wait
>operations. A similar change has been made to the timed wait
>implementation.
>
>The __atomic_spin code has been extended to take a spin policy which is
>invoked after the initial busy wait loop. The default policy is to
>return from the spin. The timed wait code adds a timed backoff spinning
>policy. The code from <thread> which implements this_thread::sleep_for,
>sleep_until has been moved to a new <bits/std_thread_sleep.h> header
>which allows the thread sleep code to be consumed without pulling in the
>whole of <thread>.
>
>The entry points into the wait/notify code have been restructured to
>support either -
>   * Testing the current value of the atomic stored at the given address
>     and waiting on a notification.
>   * Applying a predicate to determine if the wait was satisfied.
>The entry points were renamed to make it clear that the wait and wake
>operations operate on addresses. The first variant takes the expected
>value and a function which returns the current value that should be used
>in comparison operations, these operations are named with a _v suffix
>(e.g. 'value'). All atomic<_Tp> wait/notify operations use the first
>variant. Barriers, latches and semaphores use the predicate variant.
>
>This change also centralizes what it means to compare values for the
>purposes of atomic<T>::wait rather than scattering through individual
>predicates.
>
>This change also centralizes the repetitive code which adjusts for
>different user supplied clocks (this should be moved elsewhere
>and all such adjustments should use a common implementation).
>
>libstdc++-v3/ChangeLog:
>	* include/Makefile.am: Add new <bits/std_thread_sleep.h> header.
>	* include/Makefile.in: Regenerate.
>	* include/bits/atomic_base.h: Adjust all calls
>	to __atomic_wait/__atomic_notify for new call signatures.
>	* include/bits/atomic_wait.h: Extensive rewrite.
>	* include/bits/atomic_timed_wait.h: Likewise.
>	* include/bits/semaphore_base.h: Adjust all calls
>	to __atomic_wait/__atomic_notify for new call signatures.
>	* include/bits/std_thread_sleep.h: New file.
>	* include/std/atomic: Likewise.
>	* include/std/barrier: Likewise.
>	* include/std/latch: Likewise.
>	* testsuite/29_atomics/atomic/wait_notify/bool.cc: Simplify
>	test.
>	* testsuite/29_atomics/atomic/wait_notify/generic.cc: Likewise.
>	* testsuite/29_atomics/atomic/wait_notify/pointers.cc: Likewise.
>	* testsuite/29_atomics/atomic_flag/wait_notify.cc: Likewise.
>	* testsuite/29_atomics/atomic_float/wait_notify.cc: Likewise.
>	* testsuite/29_atomics/atomic_integral/wait_notify.cc: Likewise.
>	* testsuite/29_atomics/atomic_ref/wait_notify.cc: Likewise.

Some of this diff is very confusing, where the context being shown as
removed is actually a completely different function. Please try
--diff-algorithm=histogram for the next version of this patch. It
might make it easier to read.

>+    struct __timed_backoff_spin_policy
>+    {
>+      __wait_clock_t::time_point _M_deadline;
>+      __wait_clock_t::time_point _M_t0;
>+
>+      template<typename _Clock, typename _Dur>
>+	__timed_backoff_spin_policy(chrono::time_point<_Clock, _Dur>
>+				      __deadline = _Clock::time_point::max(),
>+				    chrono::time_point<_Clock, _Dur>
>+				      __t0 = _Clock::now()) noexcept
>+	  : _M_deadline(__to_wait_clock(__deadline))
>+	  , _M_t0(__to_wait_clock(__t0))

If this policy object is constructed with a time_point using the
steady_clock then it will still call __to_wait_clock to convert it to
the steady_clock, making multiple unnecessary (and expensive) calls to
steady_clock::now().

I think you either need to overload the constructor or overload
__to_wait_clock.

>+	{ }
>+
>+      bool
>+      operator()() noexcept

This can be const.

>       {
>-	static_assert(sizeof(__timed_waiters) == sizeof(__waiters));
>-	return static_cast<__timed_waiters&>(__waiters::_S_for(__t));
>+	using namespace literals::chrono_literals;
>+	auto __now = __wait_clock_t::now();
>+	if (_M_deadline <= __now)
>+	  return false;
>+
>+	auto __elapsed = __now - _M_t0;
>+	if (__elapsed > 128ms)
>+	  {
>+	    this_thread::sleep_for(64ms);
>+	  }
>+	else if (__elapsed > 64us)
>+	  {
>+	    this_thread::sleep_for(__elapsed / 2);
>+	  }
>+	else if (__elapsed > 4us)
>+	  {
>+	    __thread_yield();
>+	  }
>+	else
>+	  return false;
>       }
>     };
>-  } // namespace __detail


>+      template<typename _Tp, typename _ValFn,
>+	       typename _Rep, typename _Period>
>+	bool
>+	_M_do_wait_for_v(_Tp __old, _ValFn __vfn,
>+			 const chrono::duration<_Rep, _Period>&
>+							      __rtime) noexcept
>+	{
>+	  __platform_wait_t __val;
>+	  if (_M_do_spin_v(__old, move(__vfn), __val))

This should be std::move (there's another case of this in the patch
too).

>+	    return true;
>+
>+	  if (!__rtime.count())
>+	    return false; // no rtime supplied, and spin did not acquire
>+
>+	  using __dur = chrono::steady_clock::duration;
>+	  auto __reltime = chrono::duration_cast<__dur>(__rtime);
>+	  if (__reltime < __rtime)
>+	    ++__reltime;

This is C++20 code so it can use chrono::ceil here instead of
duration_cast, then you don't need the increment.

>+	  return _M_w._M_do_wait_until(_M_addr, __val,
>+				       chrono::steady_clock::now() + __reltime);
> 	}
>-      while (!__pred() && __atime < _Clock::now());
>-      __w._M_leave_wait();
>
>-      // if timed out, return false
>-      return (_Clock::now() < __atime);
>+      template<typename _Pred,
>+	       typename _Rep, typename _Period>
>+	bool
>+	_M_do_wait_for(_Pred __pred,
>+		       const chrono::duration<_Rep, _Period>& __rtime) noexcept
>+	{
>+	  __platform_wait_t __val;
>+	  if (_M_do_spin(__pred, __val))
>+	    return true;
>+
>+	  if (!__rtime.count())
>+	    return false; // no rtime supplied, and spin did not acquire
>+
>+	  using __dur = chrono::steady_clock::duration;
>+	  auto __reltime = chrono::duration_cast<__dur>(__rtime);
>+	  if (__reltime < __rtime)
>+	    ++__reltime;

chrono::ceil here too.

>+  template<typename _Tp>
>+    inline constexpr bool __platform_wait_uses_type
>+#ifdef _GLIBCXX_HAVE_PLATFORM_WAIT
>+      = is_same_v<remove_cv_t<_Tp>, __detail::__platform_wait_t>

This is_same check seems redundant, as the following will be true
anyway.

>+	|| ((sizeof(_Tp) == sizeof(__detail::__platform_wait_t))
>+	    && (alignof(_Tp*) == alignof(__detail::__platform_wait_t)));

This should be alignof(_Tp) not alignof(_Tp*) shouldn't it?

And alignof(_Tp) > alignof(__platform_wait_t) is OK too, so >= not ==.

We need the is_scalar check from Thiago's patch. We don't want to try
and use a futex for something like:

struct S { short s; char c; /* padding */ };


> #else
>-	= false;
>+      = false;
> #endif
>
>+  namespace __detail
>+  {
> #ifdef _GLIBCXX_HAVE_LINUX_FUTEX
>+#define _GLIBCXX_HAVE_PLATFORM_WAIT

Redefinition, as I pointed out in my earlier mail.


>+    struct __default_spin_policy
>+    {
>+      bool
>+      operator()() noexcept

This can be const.

>+      { return false; }
>+    };
>+
>+    template<typename _Pred,
>+	     typename _Spin = __default_spin_policy>
>+      bool
>+      __atomic_spin(_Pred& __pred, _Spin __spin = _Spin{ }) noexcept
>       {
>-	__platform_wait_t __res;
>-	__atomic_load(&_M_ver, &__res, __ATOMIC_ACQUIRE);
>-	__atomic_fetch_add(&_M_wait, 1, __ATOMIC_ACQ_REL);
>-	return __res;
>+	for (auto __i = 0; __i < __detail::__atomic_spin_count_1; ++__i)
>+	  {
>+	    if (__pred())
>+	      return true;
>+
>+	    if (__i < __detail::__atomic_spin_count_2)
>+	      __detail::__thread_relax();
>+	    else
>+	      __detail::__thread_yield();
>+	  }

I keep wondering (and not bothering to check) whether having two loops
(for counts of 12 and then 4) would make more sense than this branch
in each loop. It doesn't matter though.

>+	while (__spin())
>+	  {
>+	    if (__pred())
>+	      return true;
>+	  }
>+
>+	return false;
>       }
>
>-      void
>-      _M_leave_wait() noexcept
>+    template<typename _Tp>
>+      bool __atomic_compare(const _Tp& __a, const _Tp& __b)
>       {
>-	__atomic_fetch_sub(&_M_wait, 1, __ATOMIC_ACQ_REL);
>+	// TODO make this do the correct padding bit ignoring comparison
>+	return __builtin_memcmp(&__a, &__b, sizeof(_Tp)) != 0;
>       }
>
>-      void
>-      _M_do_wait(__platform_wait_t __version) noexcept
>-      {
>-#ifdef _GLIBCXX_HAVE_LINUX_FUTEX
>-	__platform_wait(&_M_ver, __version);
>+#ifdef __cpp_lib_hardware_interference_size
>+    struct alignas(hardware_destructive_interference_size)
> #else
>-	__platform_wait_t __cur = 0;
>-	while (__cur <= __version)
>-	  {
>-	    __waiters::__lock_t __l(_M_mtx);
>-	    _M_cv.wait(_M_mtx);
>-	    __platform_wait_t __last = __cur;
>-	    __atomic_load(&_M_ver, &__cur, __ATOMIC_ACQUIRE);
>-	    if (__cur < __last)
>-	      break; // break the loop if version overflows
>-	  }
>+    struct alignas(64)
>+#endif
>+    __waiters_base
>+    {
>+      __platform_wait_t _M_wait = 0;
>+#ifndef _GLIBCXX_HAVE_PLATFORM_WAIT
>+      mutex _M_mtx;
> #endif
>-      }
>+
>+#ifdef __cpp_lib_hardware_interference_size
>+      alignas(hardware_destructive_interference_size)
>+#else
>+      alignas(64)
>+#endif

Please do this #ifdef dance once and define a constant that can be
used in both places, instead of repeating the #ifdef.

e.g.

     struct __waiters_base
     {
#ifdef __cpp_lib_hardware_interference_size
       static constexpr _S_align = hardware_destructive_interference_size;
#else
       static constexpr _S_align = 64;
#endif

       alignas(_S_align) __platform_wait_t _M_wait = 0;
#ifndef _GLIBCXX_HAVE_PLATFORM_WAIT
       mutex _M_mtx;
#endif

       alignas(_S_align) __platform_wait_t _M_ver = 0;
#ifndef _GLIBCXX_HAVE_PLATFORM_WAIT
       __condvar _M_cond;
#endif

       __waiters_base() = default;

       // ...

>+      __platform_wait_t _M_ver = 0;
>+
>+#ifndef _GLIBCXX_HAVE_PLATFORM_WAIT
>+      __condvar _M_cv;
>+
>+      __waiters_base() noexcept = default;

Should this be outside the #ifdef block?

I think the noexcept is redundant, but harmless.

>+#endif
>+
>+      void
>+      _M_enter_wait() noexcept
>+      { __atomic_fetch_add(&_M_wait, 1, __ATOMIC_ACQ_REL); }
>+
>+      void
>+      _M_leave_wait() noexcept
>+      { __atomic_fetch_sub(&_M_wait, 1, __ATOMIC_ACQ_REL); }
>
>       bool
>       _M_waiting() const noexcept
>       {
> 	__platform_wait_t __res;
> 	__atomic_load(&_M_wait, &__res, __ATOMIC_ACQUIRE);
>-	return __res;
>+	return __res > 0;
>       }
>
>       void
>-      _M_notify(bool __all) noexcept
>+      _M_notify(const __platform_wait_t* __addr, bool __all) noexcept
>       {
>-	__atomic_fetch_add(&_M_ver, 1, __ATOMIC_ACQ_REL);
>+	if (!_M_waiting())
>+	  return;
>+
> #ifdef _GLIBCXX_HAVE_LINUX_FUTEX

Should this check HAVE_PLATFORM_WAIT instead?

>-	__platform_notify(&_M_ver, __all);
>+	__platform_notify(__addr, __all);
> #else
> 	if (__all)
> 	  _M_cv.notify_all();


>+    struct __waiter : __waiter_base<__waiters>
>     {
>-      using namespace __detail;
>-      if (std::__atomic_spin(__pred))
>-	return;
>+      template<typename _Tp>
>+	__waiter(const _Tp* __addr, bool __waiting = true) noexcept

Make this constructor explicit please.

>+	  : __waiter_base(__addr, __waiting)
>+	{ }
>

>diff --git a/libstdc++-v3/include/bits/semaphore_base.h b/libstdc++-v3/include/bits/semaphore_base.h
>index b65717e64d7..95d5414ff80 100644
>--- a/libstdc++-v3/include/bits/semaphore_base.h
>+++ b/libstdc++-v3/include/bits/semaphore_base.h
>@@ -181,40 +181,32 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
>       __atomic_semaphore(const __atomic_semaphore&) = delete;
>       __atomic_semaphore& operator=(const __atomic_semaphore&) = delete;
>
>+      static _GLIBCXX_ALWAYS_INLINE bool
>+      _S_do_try_acquire(_Tp* __counter) noexcept
>+      {
>+	auto __old = __atomic_impl::load(__counter, memory_order::acquire);
>+
>+	if (__old == 0)
>+	  return false;
>+
>+	return __atomic_impl::compare_exchange_strong(__counter,
>+						      __old, __old - 1,
>+						      memory_order::acquire,
>+						      memory_order::release);
>+      }

If we keep calling this in a loop it means that we reload the value
every time using atomic_load, despite the compare_exchange telling us
that value. Can't we reuse that value returned from the CAS?

If the caller provides it by reference:

       static _GLIBCXX_ALWAYS_INLINE bool
       _S_do_try_acquire(_Tp* __counter, _Tp& __old) noexcept
       {
	if (__old == 0)
	  return false;
	return __atomic_impl::compare_exchange_strong(__counter,
						      __old, __old - 1,
						      memory_order::acquire,
						      memory_order::release);
       }


>+
>       _GLIBCXX_ALWAYS_INLINE void
>       _M_acquire() noexcept
>       {
>-	auto const __pred = [this]
>-	  {
>-	    auto __old = __atomic_impl::load(&this->_M_counter,
>-			    memory_order::acquire);
>-	    if (__old == 0)
>-	      return false;
>-	    return __atomic_impl::compare_exchange_strong(&this->_M_counter,
>-		      __old, __old - 1,
>-		      memory_order::acquire,
>-		      memory_order::release);
>-	  };
>-	auto __old = __atomic_impl::load(&_M_counter, memory_order_relaxed);
>-	std::__atomic_wait(&_M_counter, __old, __pred);
>+	auto const __pred = [this] { return _S_do_try_acquire(&this->_M_counter); };

Then the predicate can maintain that state:

         auto __old = __atomic_impl::load(_M_counter, memory_order::acquire);
	auto const __pred = [this, __old] () mutable {
           return _S_do_try_acquire(&this->_M_counter, __old);
         };

Or is realoading it every time needed, because we do a
yield/relax/spin after the CAS and so the value it returns might be
stale before the next CAS?

>+	std::__atomic_wait_address(&_M_counter, __pred);