From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by sourceware.org (Postfix) with ESMTP id B89873971C38 for ; Wed, 3 Mar 2021 17:31:29 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org B89873971C38 Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-126-5PIq4xqdOX2WtjXbFv8XOg-1; Wed, 03 Mar 2021 12:31:27 -0500 X-MC-Unique: 5PIq4xqdOX2WtjXbFv8XOg-1 Received: from smtp.corp.redhat.com (int-mx02.intmail.prod.int.phx2.redhat.com [10.5.11.12]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 44A11803F4E; Wed, 3 Mar 2021 17:31:26 +0000 (UTC) Received: from localhost (unknown [10.33.37.30]) by smtp.corp.redhat.com (Postfix) with ESMTP id 63EB660BFA; Wed, 3 Mar 2021 17:31:24 +0000 (UTC) Date: Wed, 3 Mar 2021 17:31:23 +0000 From: Jonathan Wakely To: Thomas Rodgers Cc: gcc-patches@gcc.gnu.org, libstdc++@gcc.gnu.org, trodgers@redhat.com, Thomas Rodgers Subject: Re: [PATCH] [libstdc++] Refactor/cleanup of atomic wait implementation Message-ID: <20210303173123.GV3008@redhat.com> References: <20210222215307.49424-1-rodgert@appliantology.com> <20210223215722.140761-1-rodgert@appliantology.com> MIME-Version: 1.0 In-Reply-To: <20210223215722.140761-1-rodgert@appliantology.com> X-Clacks-Overhead: GNU Terry Pratchett X-Scanned-By: MIMEDefang 2.79 on 10.5.11.12 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=us-ascii; format=flowed Content-Disposition: inline X-Spam-Status: No, score=-14.2 required=5.0 tests=BAYES_00, DKIMWL_WL_HIGH, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, RCVD_IN_DNSWL_LOW, RCVD_IN_MSPIKE_H4, RCVD_IN_MSPIKE_WL, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 03 Mar 2021 17:31:32 -0000 On 23/02/21 13:57 -0800, Thomas Rodgers wrote: >From: Thomas Rodgers > >* This revises the previous version to fix std::__condvar::wait_until() usage. > >This is a substantial rewrite of the atomic wait/notify (and timed wait >counterparts) implementation. > >The previous __platform_wait looped on EINTR however this behavior is >not required by the standard. A new _GLIBCXX_HAVE_PLATFORM_WAIT macro >now controls whether wait/notify are implemented using a platform >specific primitive or with a platform agnostic mutex/condvar. This >patch only supplies a definition for linux futexes. A future update >could add support __ulock_wait/wake on Darwin, for instance. > >The members of __waiters were lifted to a new base class. The members >are now arranged such that overall sizeof(__waiters_base) fits in two >cache lines (on platforms with at least 64 byte cache lines). The >definition will also use destructive_interference_size for this if it >is available. > >The __waiters type is now specific to untimed waits. Timed waits have a >corresponding __timed_waiters type. Much of the code has been moved from >the previous __atomic_wait() free function to the __waiter_base template >and a __waiter derived type is provided to implement the un-timed wait >operations. A similar change has been made to the timed wait >implementation. > >The __atomic_spin code has been extended to take a spin policy which is >invoked after the initial busy wait loop. The default policy is to >return from the spin. The timed wait code adds a timed backoff spinning >policy. The code from which implements this_thread::sleep_for, >sleep_until has been moved to a new header >which allows the thread sleep code to be consumed without pulling in the >whole of . > >The entry points into the wait/notify code have been restructured to >support either - > * Testing the current value of the atomic stored at the given address > and waiting on a notification. > * Applying a predicate to determine if the wait was satisfied. >The entry points were renamed to make it clear that the wait and wake >operations operate on addresses. The first variant takes the expected >value and a function which returns the current value that should be used >in comparison operations, these operations are named with a _v suffix >(e.g. 'value'). All atomic<_Tp> wait/notify operations use the first >variant. Barriers, latches and semaphores use the predicate variant. > >This change also centralizes what it means to compare values for the >purposes of atomic::wait rather than scattering through individual >predicates. > >This change also centralizes the repetitive code which adjusts for >different user supplied clocks (this should be moved elsewhere >and all such adjustments should use a common implementation). > >libstdc++-v3/ChangeLog: > * include/Makefile.am: Add new header. > * include/Makefile.in: Regenerate. > * include/bits/atomic_base.h: Adjust all calls > to __atomic_wait/__atomic_notify for new call signatures. > * include/bits/atomic_wait.h: Extensive rewrite. > * include/bits/atomic_timed_wait.h: Likewise. > * include/bits/semaphore_base.h: Adjust all calls > to __atomic_wait/__atomic_notify for new call signatures. > * include/bits/std_thread_sleep.h: New file. > * include/std/atomic: Likewise. > * include/std/barrier: Likewise. > * include/std/latch: Likewise. > * testsuite/29_atomics/atomic/wait_notify/bool.cc: Simplify > test. > * testsuite/29_atomics/atomic/wait_notify/generic.cc: Likewise. > * testsuite/29_atomics/atomic/wait_notify/pointers.cc: Likewise. > * testsuite/29_atomics/atomic_flag/wait_notify.cc: Likewise. > * testsuite/29_atomics/atomic_float/wait_notify.cc: Likewise. > * testsuite/29_atomics/atomic_integral/wait_notify.cc: Likewise. > * testsuite/29_atomics/atomic_ref/wait_notify.cc: Likewise. Some of this diff is very confusing, where the context being shown as removed is actually a completely different function. Please try --diff-algorithm=histogram for the next version of this patch. It might make it easier to read. >+ struct __timed_backoff_spin_policy >+ { >+ __wait_clock_t::time_point _M_deadline; >+ __wait_clock_t::time_point _M_t0; >+ >+ template >+ __timed_backoff_spin_policy(chrono::time_point<_Clock, _Dur> >+ __deadline = _Clock::time_point::max(), >+ chrono::time_point<_Clock, _Dur> >+ __t0 = _Clock::now()) noexcept >+ : _M_deadline(__to_wait_clock(__deadline)) >+ , _M_t0(__to_wait_clock(__t0)) If this policy object is constructed with a time_point using the steady_clock then it will still call __to_wait_clock to convert it to the steady_clock, making multiple unnecessary (and expensive) calls to steady_clock::now(). I think you either need to overload the constructor or overload __to_wait_clock. >+ { } >+ >+ bool >+ operator()() noexcept This can be const. > { >- static_assert(sizeof(__timed_waiters) == sizeof(__waiters)); >- return static_cast<__timed_waiters&>(__waiters::_S_for(__t)); >+ using namespace literals::chrono_literals; >+ auto __now = __wait_clock_t::now(); >+ if (_M_deadline <= __now) >+ return false; >+ >+ auto __elapsed = __now - _M_t0; >+ if (__elapsed > 128ms) >+ { >+ this_thread::sleep_for(64ms); >+ } >+ else if (__elapsed > 64us) >+ { >+ this_thread::sleep_for(__elapsed / 2); >+ } >+ else if (__elapsed > 4us) >+ { >+ __thread_yield(); >+ } >+ else >+ return false; > } > }; >- } // namespace __detail >+ template+ typename _Rep, typename _Period> >+ bool >+ _M_do_wait_for_v(_Tp __old, _ValFn __vfn, >+ const chrono::duration<_Rep, _Period>& >+ __rtime) noexcept >+ { >+ __platform_wait_t __val; >+ if (_M_do_spin_v(__old, move(__vfn), __val)) This should be std::move (there's another case of this in the patch too). >+ return true; >+ >+ if (!__rtime.count()) >+ return false; // no rtime supplied, and spin did not acquire >+ >+ using __dur = chrono::steady_clock::duration; >+ auto __reltime = chrono::duration_cast<__dur>(__rtime); >+ if (__reltime < __rtime) >+ ++__reltime; This is C++20 code so it can use chrono::ceil here instead of duration_cast, then you don't need the increment. >+ return _M_w._M_do_wait_until(_M_addr, __val, >+ chrono::steady_clock::now() + __reltime); > } >- while (!__pred() && __atime < _Clock::now()); >- __w._M_leave_wait(); > >- // if timed out, return false >- return (_Clock::now() < __atime); >+ template+ typename _Rep, typename _Period> >+ bool >+ _M_do_wait_for(_Pred __pred, >+ const chrono::duration<_Rep, _Period>& __rtime) noexcept >+ { >+ __platform_wait_t __val; >+ if (_M_do_spin(__pred, __val)) >+ return true; >+ >+ if (!__rtime.count()) >+ return false; // no rtime supplied, and spin did not acquire >+ >+ using __dur = chrono::steady_clock::duration; >+ auto __reltime = chrono::duration_cast<__dur>(__rtime); >+ if (__reltime < __rtime) >+ ++__reltime; chrono::ceil here too. >+ template >+ inline constexpr bool __platform_wait_uses_type >+#ifdef _GLIBCXX_HAVE_PLATFORM_WAIT >+ = is_same_v, __detail::__platform_wait_t> This is_same check seems redundant, as the following will be true anyway. >+ || ((sizeof(_Tp) == sizeof(__detail::__platform_wait_t)) >+ && (alignof(_Tp*) == alignof(__detail::__platform_wait_t))); This should be alignof(_Tp) not alignof(_Tp*) shouldn't it? And alignof(_Tp) > alignof(__platform_wait_t) is OK too, so >= not ==. We need the is_scalar check from Thiago's patch. We don't want to try and use a futex for something like: struct S { short s; char c; /* padding */ }; > #else >- = false; >+ = false; > #endif > >+ namespace __detail >+ { > #ifdef _GLIBCXX_HAVE_LINUX_FUTEX >+#define _GLIBCXX_HAVE_PLATFORM_WAIT Redefinition, as I pointed out in my earlier mail. >+ struct __default_spin_policy >+ { >+ bool >+ operator()() noexcept This can be const. >+ { return false; } >+ }; >+ >+ template+ typename _Spin = __default_spin_policy> >+ bool >+ __atomic_spin(_Pred& __pred, _Spin __spin = _Spin{ }) noexcept > { >- __platform_wait_t __res; >- __atomic_load(&_M_ver, &__res, __ATOMIC_ACQUIRE); >- __atomic_fetch_add(&_M_wait, 1, __ATOMIC_ACQ_REL); >- return __res; >+ for (auto __i = 0; __i < __detail::__atomic_spin_count_1; ++__i) >+ { >+ if (__pred()) >+ return true; >+ >+ if (__i < __detail::__atomic_spin_count_2) >+ __detail::__thread_relax(); >+ else >+ __detail::__thread_yield(); >+ } I keep wondering (and not bothering to check) whether having two loops (for counts of 12 and then 4) would make more sense than this branch in each loop. It doesn't matter though. >+ while (__spin()) >+ { >+ if (__pred()) >+ return true; >+ } >+ >+ return false; > } > >- void >- _M_leave_wait() noexcept >+ template >+ bool __atomic_compare(const _Tp& __a, const _Tp& __b) > { >- __atomic_fetch_sub(&_M_wait, 1, __ATOMIC_ACQ_REL); >+ // TODO make this do the correct padding bit ignoring comparison >+ return __builtin_memcmp(&__a, &__b, sizeof(_Tp)) != 0; > } > >- void >- _M_do_wait(__platform_wait_t __version) noexcept >- { >-#ifdef _GLIBCXX_HAVE_LINUX_FUTEX >- __platform_wait(&_M_ver, __version); >+#ifdef __cpp_lib_hardware_interference_size >+ struct alignas(hardware_destructive_interference_size) > #else >- __platform_wait_t __cur = 0; >- while (__cur <= __version) >- { >- __waiters::__lock_t __l(_M_mtx); >- _M_cv.wait(_M_mtx); >- __platform_wait_t __last = __cur; >- __atomic_load(&_M_ver, &__cur, __ATOMIC_ACQUIRE); >- if (__cur < __last) >- break; // break the loop if version overflows >- } >+ struct alignas(64) >+#endif >+ __waiters_base >+ { >+ __platform_wait_t _M_wait = 0; >+#ifndef _GLIBCXX_HAVE_PLATFORM_WAIT >+ mutex _M_mtx; > #endif >- } >+ >+#ifdef __cpp_lib_hardware_interference_size >+ alignas(hardware_destructive_interference_size) >+#else >+ alignas(64) >+#endif Please do this #ifdef dance once and define a constant that can be used in both places, instead of repeating the #ifdef. e.g. struct __waiters_base { #ifdef __cpp_lib_hardware_interference_size static constexpr _S_align = hardware_destructive_interference_size; #else static constexpr _S_align = 64; #endif alignas(_S_align) __platform_wait_t _M_wait = 0; #ifndef _GLIBCXX_HAVE_PLATFORM_WAIT mutex _M_mtx; #endif alignas(_S_align) __platform_wait_t _M_ver = 0; #ifndef _GLIBCXX_HAVE_PLATFORM_WAIT __condvar _M_cond; #endif __waiters_base() = default; // ... >+ __platform_wait_t _M_ver = 0; >+ >+#ifndef _GLIBCXX_HAVE_PLATFORM_WAIT >+ __condvar _M_cv; >+ >+ __waiters_base() noexcept = default; Should this be outside the #ifdef block? I think the noexcept is redundant, but harmless. >+#endif >+ >+ void >+ _M_enter_wait() noexcept >+ { __atomic_fetch_add(&_M_wait, 1, __ATOMIC_ACQ_REL); } >+ >+ void >+ _M_leave_wait() noexcept >+ { __atomic_fetch_sub(&_M_wait, 1, __ATOMIC_ACQ_REL); } > > bool > _M_waiting() const noexcept > { > __platform_wait_t __res; > __atomic_load(&_M_wait, &__res, __ATOMIC_ACQUIRE); >- return __res; >+ return __res > 0; > } > > void >- _M_notify(bool __all) noexcept >+ _M_notify(const __platform_wait_t* __addr, bool __all) noexcept > { >- __atomic_fetch_add(&_M_ver, 1, __ATOMIC_ACQ_REL); >+ if (!_M_waiting()) >+ return; >+ > #ifdef _GLIBCXX_HAVE_LINUX_FUTEX Should this check HAVE_PLATFORM_WAIT instead? >- __platform_notify(&_M_ver, __all); >+ __platform_notify(__addr, __all); > #else > if (__all) > _M_cv.notify_all(); >+ struct __waiter : __waiter_base<__waiters> > { >- using namespace __detail; >- if (std::__atomic_spin(__pred)) >- return; >+ template >+ __waiter(const _Tp* __addr, bool __waiting = true) noexcept Make this constructor explicit please. >+ : __waiter_base(__addr, __waiting) >+ { } > >diff --git a/libstdc++-v3/include/bits/semaphore_base.h b/libstdc++-v3/include/bits/semaphore_base.h >index b65717e64d7..95d5414ff80 100644 >--- a/libstdc++-v3/include/bits/semaphore_base.h >+++ b/libstdc++-v3/include/bits/semaphore_base.h >@@ -181,40 +181,32 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION > __atomic_semaphore(const __atomic_semaphore&) = delete; > __atomic_semaphore& operator=(const __atomic_semaphore&) = delete; > >+ static _GLIBCXX_ALWAYS_INLINE bool >+ _S_do_try_acquire(_Tp* __counter) noexcept >+ { >+ auto __old = __atomic_impl::load(__counter, memory_order::acquire); >+ >+ if (__old == 0) >+ return false; >+ >+ return __atomic_impl::compare_exchange_strong(__counter, >+ __old, __old - 1, >+ memory_order::acquire, >+ memory_order::release); >+ } If we keep calling this in a loop it means that we reload the value every time using atomic_load, despite the compare_exchange telling us that value. Can't we reuse that value returned from the CAS? If the caller provides it by reference: static _GLIBCXX_ALWAYS_INLINE bool _S_do_try_acquire(_Tp* __counter, _Tp& __old) noexcept { if (__old == 0) return false; return __atomic_impl::compare_exchange_strong(__counter, __old, __old - 1, memory_order::acquire, memory_order::release); } >+ > _GLIBCXX_ALWAYS_INLINE void > _M_acquire() noexcept > { >- auto const __pred = [this] >- { >- auto __old = __atomic_impl::load(&this->_M_counter, >- memory_order::acquire); >- if (__old == 0) >- return false; >- return __atomic_impl::compare_exchange_strong(&this->_M_counter, >- __old, __old - 1, >- memory_order::acquire, >- memory_order::release); >- }; >- auto __old = __atomic_impl::load(&_M_counter, memory_order_relaxed); >- std::__atomic_wait(&_M_counter, __old, __pred); >+ auto const __pred = [this] { return _S_do_try_acquire(&this->_M_counter); }; Then the predicate can maintain that state: auto __old = __atomic_impl::load(_M_counter, memory_order::acquire); auto const __pred = [this, __old] () mutable { return _S_do_try_acquire(&this->_M_counter, __old); }; Or is realoading it every time needed, because we do a yield/relax/spin after the CAS and so the value it returns might be stale before the next CAS? >+ std::__atomic_wait_address(&_M_counter, __pred);