This is the right patch. The previous one is missing noexcept. Sorry. On Mon, Aug 2, 2021 at 9:23 AM Maged Michael wrote: > Please find attached an updated patch after incorporating Jonathan's > suggestions. > > Changes from the last patch include: > - Add a TSAN macro to bits/c++config. > - Use separate constexpr bool-s for the conditions for lock-freedom, > double-width and alignment. > - Move the code in the optimized path to a separate function > _M_release_double_width_cas. > > Thanks, > Maged > > > On Fri, Jul 16, 2021 at 11:55 AM Maged Michael > wrote: > >> Thank you, Jonathan, for the detailed comments! I'll update the patch >> accordingly. >> >> On Fri, Jul 16, 2021 at 9:55 AM Jonathan Wakely >> wrote: >> >>> On Thu, 17 Dec 2020 at 20:51, Maged Michael wrote: >>> > >>> > Please find a proposed patch for _Sp_counted_base::_M_release to skip >>> the >>> > two atomic instructions that decrement each of the use count and the >>> weak >>> > count when both are 1. I proposed the general idea in an earlier >>> thread ( >>> > https://gcc.gnu.org/pipermail/libstdc++/2020-December/051642.html) >>> and got >>> > useful feedback on a draft patch and responses to related questions >>> about >>> > multi-granular atomicity and alignment. This patch is based on that >>> > feedback. >>> > >>> > >>> > I added a check for thread sanitizer to use the current algorithm in >>> that >>> > case because TSAN does not support multi-granular atomicity. I'd like >>> to >>> > add a check of __has_feature(thread_sanitizer) for building using >>> LLVM. I >>> > found examples of __has_feature in libstdc++ >>> >>> There are no uses of __has_feature in libstdc++. We do use >>> __has_builtin (which GCC also supports) and Clang's __is_identifier >>> (which GCC doesn't support) to work around some weird semantics of >>> __has_builtin in older versions of Clang. >>> >>> >>> > but it doesn't seem to be >>> > recognized in shared_ptr_base.h. Any guidance on how to check >>> > __has_feature(thread_sanitizer) in this patch? >>> >>> I think we want to do something like this in include/bits/c++config >>> >>> #if __SANITIZE_THREAD__ >>> # define _GLIBCXX_TSAN 1 >>> #elif defined __has_feature >>> # if __has_feature(thread_sanitizer) >>> # define _GLIBCXX_TSAN 1 >>> # endif >>> #endif >>> >>> Then in bits/shared_ptr_base.h >>> >>> #if _GLIBCXX_TSAN >>> _M_release_orig(); >>> return; >>> #endif >>> >>> >>> >>> > GCC generates code for _M_release that is larger and more complex than >>> that >>> > generated by LLVM. I'd like to file a bug report about that. Jonathan, >>> >>> Is this the same issue as https://gcc.gnu.org/PR101406 ? >>> >>> Partly yes. Even when using __atomic_add_dispatch I noticed that clang >> generated less code than gcc. I see in the response to the issue that the >> new glibc is expected to optimize better. So maybe this will eliminate the >> issue. >> >> >>> > would you please create a bugzilla account for me ( >>> > https://gcc.gnu.org/bugzilla/) using my gmail address. Thank you. >>> >>> Done (sorry, I didn't notice the request in this mail until coming >>> back to it to review the patch properly). >>> >>> Thank you! >> >> >>> >>> >>> > >>> > >>> > Information about the patch: >>> > >>> > - Benefits of the patch: Save the cost of the last atomic decrements of >>> > each of the use count and the weak count in _Sp_counted_base. Atomic >>> > instructions are significantly slower than regular loads and stores >>> across >>> > major architectures. >>> > >>> > - How current code works: _M_release() atomically decrements the use >>> count, >>> > checks if it was 1, if so calls _M_dispose(), atomically decrements the >>> > weak count, checks if it was 1, and if so calls _M_destroy(). >>> > >>> > - How the proposed patch works: _M_release() loads both use count and >>> weak >>> > count together atomically (when properly aligned), checks if the value >>> is >>> > equal to the value of both counts equal to 1 (e.g., 0x100000001), and >>> if so >>> > calls _M_dispose() and _M_destroy(). Otherwise, it follows the original >>> > algorithm. >>> > >>> > - Why it works: When the current thread executing _M_release() finds >>> each >>> > of the counts is equal to 1, then (when _lock_policy is _S_atomic) no >>> other >>> > threads could possibly hold use or weak references to this control >>> block. >>> > That is, no other threads could possibly access the counts or the >>> protected >>> > object. >>> > >>> > - The proposed patch is intended to interact correctly with current >>> code >>> > (under certain conditions: _Lock_policy is _S_atomic, proper >>> alignment, and >>> > native lock-free support for atomic operations). That is, multiple >>> threads >>> > using different versions of the code with and without the patch >>> operating >>> > on the same objects should always interact correctly. The intent for >>> the >>> > patch is to be ABI compatible with the current implementation. >>> > >>> > - The proposed patch involves a performance trade-off between saving >>> the >>> > costs of two atomic instructions when the counts are both 1 vs adding >>> the >>> > cost of loading the combined counts and comparison with two ones (e.g., >>> > 0x100000001). >>> > >>> > - The patch has been in use (built using LLVM) in a large environment >>> for >>> > many months. The performance gains outweigh the losses (roughly 10 to >>> 1) >>> > across a large variety of workloads. >>> > >>> > >>> > I'd appreciate feedback on the patch and any suggestions for checking >>> > __has_feature(thread_sanitizer). >>> >>> N.B. gmail completely mangles patches unless you send them as >>> attachments. >>> >>> >>> > diff --git a/libstdc++-v3/include/bits/shared_ptr_base.h >>> > b/libstdc++-v3/include/bits/shared_ptr_base.h >>> > >>> > index 368b2d7379a..a8fc944af5f 100644 >>> > >>> > --- a/libstdc++-v3/include/bits/shared_ptr_base.h >>> > >>> > +++ b/libstdc++-v3/include/bits/shared_ptr_base.h >>> > >>> > @@ -153,20 +153,78 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION >>> > >>> > if (!_M_add_ref_lock_nothrow()) >>> > >>> > __throw_bad_weak_ptr(); >>> > >>> > } >>> > >>> > >>> > bool >>> > >>> > _M_add_ref_lock_nothrow() noexcept; >>> > >>> > >>> > void >>> > >>> > _M_release() noexcept >>> > >>> > { >>> > >>> > +#if __SANITIZE_THREAD__ >>> > >>> > + _M_release_orig(); >>> > >>> > + return; >>> > >>> > +#endif >>> > >>> > + if (!__atomic_always_lock_free(sizeof(long long), 0) || >>> >>> The line break should come before the logical operator, not after. >>> This makes it easier to see which operator it is, because it's at a >>> predictable position, not off on the right edge somewhere. >>> >>> i.e. >>> >>> if (!__atomic_always_lock_free(sizeof(long long), 0) >>> || !__atomic_always_lock_free(sizeof(_Atomic_word), 0) >>> || sizeof(long long) < (2 * sizeof(_Atomic_word)) >>> || sizeof(long long) > (sizeof(void*))) >>> >>> But I think I'd like to see this condition expressed differently >>> anyway, see below. >>> >>> > >>> > + !__atomic_always_lock_free(sizeof(_Atomic_word), 0) || >>> > >>> > + sizeof(long long) < (2 * sizeof(_Atomic_word)) || >>> >>> Shouldn't this be != instead of < ? >>> >>> On a big endian target where sizeof(long long) > sizeof(_Atomic_word) >>> loading two _Atomic_word objects will fill the high bits of the long >>> long, and so the (1LL + (1LL << (8 * 4))) calculation won't match what >>> got loaded into the long long. >>> >>> > >>> > + sizeof(long long) > (sizeof(void*))) >>> >>> This is checking the alignment, right? I think we can do so more >>> reliably, and should comment it. >>> >>> I think I'd like to see this condition defined as a number of >>> constexpr booleans, with comments. Maybe: >>> >>> constexpr bool __lock_free >>> = __atomic_always_lock_free(sizeof(long long), 0) >>> && __atomic_always_lock_free(sizeof(_Atomic_word), 0); >>> constexpr bool __double_word >>> = sizeof(long long) == 2 * sizeof(_Atomic_word); >>> // The ref-count members follow the vptr, so are aligned to >>> alignof(void*). >>> constexpr bool __aligned = alignof(long long) <= alignof(void*); >>> >>> if _GLIBCXX17_CONSTEXPR >>> (__lock_free && __double_word && __aligned) >>> { >>> _M_release_double_width_cas(); >>> return; >>> } >>> else >>> { >>> // ... original body of _M_release(); >>> } >>> >>> > + { >>> > >>> > + _M_release_orig(); >>> > >>> > + return; >>> > >>> > + } >>> > >>> > + _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_use_count); >>> > >>> > + _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_weak_count); >>> > >>> > + if (__atomic_load_n((long long*)(&_M_use_count), >>> __ATOMIC_ACQUIRE) >>> > >>> > + == (1LL + (1LL << (8 * sizeof(_Atomic_word))))) >>> >>> This should use __CHAR_BIT__ instead of 8. >>> >>> >>> > >>> > + { >>> > >>> > + // Both counts are 1, so there are no weak references and >>> > >>> > + // we are releasing the last strong reference. No other >>> > >>> > + // threads can observe the effects of this _M_release() >>> > >>> > + // call (e.g. calling use_count()) without a data race. >>> > >>> > + *(long long*)(&_M_use_count) = 0; >>> > >>> > + _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_use_count); >>> > >>> > + _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_weak_count); >>> > >>> > + _M_dispose(); >>> > >>> > + _M_destroy(); >>> > >>> > + } >>> > >>> > + else >>> > >>> > + { >>> > >>> > + if ((__gnu_cxx::__exchange_and_add(&_M_use_count, -1) == >>> 1)) >>> > >>> > + { >>> > >>> > + _M_release_last_use(); >>> > >>> > + } >>> > >>> > + } >>> > >>> > + } >>> > >>> > + >>> > >>> > + void >>> > >>> > + __attribute__ ((noinline)) >>> > >>> > + _M_release_last_use() noexcept >>> > >>> > + { >>> > >>> > + _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_use_count); >>> > >>> > + _M_dispose(); >>> > >>> > + if (_Mutex_base<_Lp>::_S_need_barriers) >>> > >>> > + { >>> > >>> > + __atomic_thread_fence (__ATOMIC_ACQ_REL); >>> > >>> > + } >>> > >>> > + _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_weak_count); >>> > >>> > + if (__gnu_cxx::__exchange_and_add_dispatch(&_M_weak_count, >>> > >>> > + -1) == 1) >>> > >>> > + { >>> > >>> > + _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_weak_count); >>> > >>> > + _M_destroy(); >>> > >>> > + } >>> > >>> > + } >>> > >>> > + >>> > >>> > + void >>> > >>> > + _M_release_orig() noexcept >>> > >>> > + { >>> > >>> > // Be race-detector-friendly. For more info see >>> bits/c++config. >>> > >>> > _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_use_count); >>> > >>> > if (__gnu_cxx::__exchange_and_add_dispatch(&_M_use_count, -1) >>> == 1) >>> > >>> > { >>> > >>> > _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_use_count); >>> > >>> > _M_dispose(); >>> > >>> > // There must be a memory barrier between dispose() and >>> > destroy() >>> > >>> > // to ensure that the effects of dispose() are observed in >>> the >>> > >>> > // thread that runs destroy(). >>> > >>> > // See >>> http://gcc.gnu.org/ml/libstdc++/2005-11/msg00136.html >>> > >>> > @@ -279,20 +337,27 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION >>> > >>> > _Sp_counted_base<_S_single>::_M_release() noexcept >>> > >>> > { >>> > >>> > if (--_M_use_count == 0) >>> > >>> > { >>> > >>> > _M_dispose(); >>> > >>> > if (--_M_weak_count == 0) >>> > >>> > _M_destroy(); >>> > >>> > } >>> > >>> > } >>> > >>> > >>> > + template<> >>> > >>> > + inline void >>> > >>> > + _Sp_counted_base<_S_mutex>::_M_release() noexcept >>> > >>> > + { >>> > >>> > + _M_release_orig(); >>> > >>> > + } >>> > >>> > + >>> > >>> > template<> >>> > >>> > inline void >>> > >>> > _Sp_counted_base<_S_single>::_M_weak_add_ref() noexcept >>> > >>> > { ++_M_weak_count; } >>> > >>> > >>> > template<> >>> > >>> > inline void >>> > >>> > _Sp_counted_base<_S_single>::_M_weak_release() noexcept >>> > >>> > { >>> > >>> > if (--_M_weak_count == 0) >>> >>