Please find attached an updated patch after incorporating Jonathan's suggestions. Changes from the last patch include: - Add a TSAN macro to bits/c++config. - Use separate constexpr bool-s for the conditions for lock-freedom, double-width and alignment. - Move the code in the optimized path to a separate function _M_release_double_width_cas. Thanks, Maged On Fri, Jul 16, 2021 at 11:55 AM Maged Michael wrote: > Thank you, Jonathan, for the detailed comments! I'll update the patch > accordingly. > > On Fri, Jul 16, 2021 at 9:55 AM Jonathan Wakely > wrote: > >> On Thu, 17 Dec 2020 at 20:51, Maged Michael wrote: >> > >> > Please find a proposed patch for _Sp_counted_base::_M_release to skip >> the >> > two atomic instructions that decrement each of the use count and the >> weak >> > count when both are 1. I proposed the general idea in an earlier thread >> ( >> > https://gcc.gnu.org/pipermail/libstdc++/2020-December/051642.html) and >> got >> > useful feedback on a draft patch and responses to related questions >> about >> > multi-granular atomicity and alignment. This patch is based on that >> > feedback. >> > >> > >> > I added a check for thread sanitizer to use the current algorithm in >> that >> > case because TSAN does not support multi-granular atomicity. I'd like to >> > add a check of __has_feature(thread_sanitizer) for building using LLVM. >> I >> > found examples of __has_feature in libstdc++ >> >> There are no uses of __has_feature in libstdc++. We do use >> __has_builtin (which GCC also supports) and Clang's __is_identifier >> (which GCC doesn't support) to work around some weird semantics of >> __has_builtin in older versions of Clang. >> >> >> > but it doesn't seem to be >> > recognized in shared_ptr_base.h. Any guidance on how to check >> > __has_feature(thread_sanitizer) in this patch? >> >> I think we want to do something like this in include/bits/c++config >> >> #if __SANITIZE_THREAD__ >> # define _GLIBCXX_TSAN 1 >> #elif defined __has_feature >> # if __has_feature(thread_sanitizer) >> # define _GLIBCXX_TSAN 1 >> # endif >> #endif >> >> Then in bits/shared_ptr_base.h >> >> #if _GLIBCXX_TSAN >> _M_release_orig(); >> return; >> #endif >> >> >> >> > GCC generates code for _M_release that is larger and more complex than >> that >> > generated by LLVM. I'd like to file a bug report about that. Jonathan, >> >> Is this the same issue as https://gcc.gnu.org/PR101406 ? >> >> Partly yes. Even when using __atomic_add_dispatch I noticed that clang > generated less code than gcc. I see in the response to the issue that the > new glibc is expected to optimize better. So maybe this will eliminate the > issue. > > >> > would you please create a bugzilla account for me ( >> > https://gcc.gnu.org/bugzilla/) using my gmail address. Thank you. >> >> Done (sorry, I didn't notice the request in this mail until coming >> back to it to review the patch properly). >> >> Thank you! > > >> >> >> > >> > >> > Information about the patch: >> > >> > - Benefits of the patch: Save the cost of the last atomic decrements of >> > each of the use count and the weak count in _Sp_counted_base. Atomic >> > instructions are significantly slower than regular loads and stores >> across >> > major architectures. >> > >> > - How current code works: _M_release() atomically decrements the use >> count, >> > checks if it was 1, if so calls _M_dispose(), atomically decrements the >> > weak count, checks if it was 1, and if so calls _M_destroy(). >> > >> > - How the proposed patch works: _M_release() loads both use count and >> weak >> > count together atomically (when properly aligned), checks if the value >> is >> > equal to the value of both counts equal to 1 (e.g., 0x100000001), and >> if so >> > calls _M_dispose() and _M_destroy(). Otherwise, it follows the original >> > algorithm. >> > >> > - Why it works: When the current thread executing _M_release() finds >> each >> > of the counts is equal to 1, then (when _lock_policy is _S_atomic) no >> other >> > threads could possibly hold use or weak references to this control >> block. >> > That is, no other threads could possibly access the counts or the >> protected >> > object. >> > >> > - The proposed patch is intended to interact correctly with current code >> > (under certain conditions: _Lock_policy is _S_atomic, proper alignment, >> and >> > native lock-free support for atomic operations). That is, multiple >> threads >> > using different versions of the code with and without the patch >> operating >> > on the same objects should always interact correctly. The intent for the >> > patch is to be ABI compatible with the current implementation. >> > >> > - The proposed patch involves a performance trade-off between saving the >> > costs of two atomic instructions when the counts are both 1 vs adding >> the >> > cost of loading the combined counts and comparison with two ones (e.g., >> > 0x100000001). >> > >> > - The patch has been in use (built using LLVM) in a large environment >> for >> > many months. The performance gains outweigh the losses (roughly 10 to 1) >> > across a large variety of workloads. >> > >> > >> > I'd appreciate feedback on the patch and any suggestions for checking >> > __has_feature(thread_sanitizer). >> >> N.B. gmail completely mangles patches unless you send them as attachments. >> >> >> > diff --git a/libstdc++-v3/include/bits/shared_ptr_base.h >> > b/libstdc++-v3/include/bits/shared_ptr_base.h >> > >> > index 368b2d7379a..a8fc944af5f 100644 >> > >> > --- a/libstdc++-v3/include/bits/shared_ptr_base.h >> > >> > +++ b/libstdc++-v3/include/bits/shared_ptr_base.h >> > >> > @@ -153,20 +153,78 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION >> > >> > if (!_M_add_ref_lock_nothrow()) >> > >> > __throw_bad_weak_ptr(); >> > >> > } >> > >> > >> > bool >> > >> > _M_add_ref_lock_nothrow() noexcept; >> > >> > >> > void >> > >> > _M_release() noexcept >> > >> > { >> > >> > +#if __SANITIZE_THREAD__ >> > >> > + _M_release_orig(); >> > >> > + return; >> > >> > +#endif >> > >> > + if (!__atomic_always_lock_free(sizeof(long long), 0) || >> >> The line break should come before the logical operator, not after. >> This makes it easier to see which operator it is, because it's at a >> predictable position, not off on the right edge somewhere. >> >> i.e. >> >> if (!__atomic_always_lock_free(sizeof(long long), 0) >> || !__atomic_always_lock_free(sizeof(_Atomic_word), 0) >> || sizeof(long long) < (2 * sizeof(_Atomic_word)) >> || sizeof(long long) > (sizeof(void*))) >> >> But I think I'd like to see this condition expressed differently >> anyway, see below. >> >> > >> > + !__atomic_always_lock_free(sizeof(_Atomic_word), 0) || >> > >> > + sizeof(long long) < (2 * sizeof(_Atomic_word)) || >> >> Shouldn't this be != instead of < ? >> >> On a big endian target where sizeof(long long) > sizeof(_Atomic_word) >> loading two _Atomic_word objects will fill the high bits of the long >> long, and so the (1LL + (1LL << (8 * 4))) calculation won't match what >> got loaded into the long long. >> >> > >> > + sizeof(long long) > (sizeof(void*))) >> >> This is checking the alignment, right? I think we can do so more >> reliably, and should comment it. >> >> I think I'd like to see this condition defined as a number of >> constexpr booleans, with comments. Maybe: >> >> constexpr bool __lock_free >> = __atomic_always_lock_free(sizeof(long long), 0) >> && __atomic_always_lock_free(sizeof(_Atomic_word), 0); >> constexpr bool __double_word >> = sizeof(long long) == 2 * sizeof(_Atomic_word); >> // The ref-count members follow the vptr, so are aligned to >> alignof(void*). >> constexpr bool __aligned = alignof(long long) <= alignof(void*); >> >> if _GLIBCXX17_CONSTEXPR >> (__lock_free && __double_word && __aligned) >> { >> _M_release_double_width_cas(); >> return; >> } >> else >> { >> // ... original body of _M_release(); >> } >> >> > + { >> > >> > + _M_release_orig(); >> > >> > + return; >> > >> > + } >> > >> > + _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_use_count); >> > >> > + _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_weak_count); >> > >> > + if (__atomic_load_n((long long*)(&_M_use_count), >> __ATOMIC_ACQUIRE) >> > >> > + == (1LL + (1LL << (8 * sizeof(_Atomic_word))))) >> >> This should use __CHAR_BIT__ instead of 8. >> >> >> > >> > + { >> > >> > + // Both counts are 1, so there are no weak references and >> > >> > + // we are releasing the last strong reference. No other >> > >> > + // threads can observe the effects of this _M_release() >> > >> > + // call (e.g. calling use_count()) without a data race. >> > >> > + *(long long*)(&_M_use_count) = 0; >> > >> > + _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_use_count); >> > >> > + _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_weak_count); >> > >> > + _M_dispose(); >> > >> > + _M_destroy(); >> > >> > + } >> > >> > + else >> > >> > + { >> > >> > + if ((__gnu_cxx::__exchange_and_add(&_M_use_count, -1) == >> 1)) >> > >> > + { >> > >> > + _M_release_last_use(); >> > >> > + } >> > >> > + } >> > >> > + } >> > >> > + >> > >> > + void >> > >> > + __attribute__ ((noinline)) >> > >> > + _M_release_last_use() noexcept >> > >> > + { >> > >> > + _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_use_count); >> > >> > + _M_dispose(); >> > >> > + if (_Mutex_base<_Lp>::_S_need_barriers) >> > >> > + { >> > >> > + __atomic_thread_fence (__ATOMIC_ACQ_REL); >> > >> > + } >> > >> > + _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_weak_count); >> > >> > + if (__gnu_cxx::__exchange_and_add_dispatch(&_M_weak_count, >> > >> > + -1) == 1) >> > >> > + { >> > >> > + _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_weak_count); >> > >> > + _M_destroy(); >> > >> > + } >> > >> > + } >> > >> > + >> > >> > + void >> > >> > + _M_release_orig() noexcept >> > >> > + { >> > >> > // Be race-detector-friendly. For more info see >> bits/c++config. >> > >> > _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_use_count); >> > >> > if (__gnu_cxx::__exchange_and_add_dispatch(&_M_use_count, -1) >> == 1) >> > >> > { >> > >> > _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_use_count); >> > >> > _M_dispose(); >> > >> > // There must be a memory barrier between dispose() and >> > destroy() >> > >> > // to ensure that the effects of dispose() are observed in >> the >> > >> > // thread that runs destroy(). >> > >> > // See >> http://gcc.gnu.org/ml/libstdc++/2005-11/msg00136.html >> > >> > @@ -279,20 +337,27 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION >> > >> > _Sp_counted_base<_S_single>::_M_release() noexcept >> > >> > { >> > >> > if (--_M_use_count == 0) >> > >> > { >> > >> > _M_dispose(); >> > >> > if (--_M_weak_count == 0) >> > >> > _M_destroy(); >> > >> > } >> > >> > } >> > >> > >> > + template<> >> > >> > + inline void >> > >> > + _Sp_counted_base<_S_mutex>::_M_release() noexcept >> > >> > + { >> > >> > + _M_release_orig(); >> > >> > + } >> > >> > + >> > >> > template<> >> > >> > inline void >> > >> > _Sp_counted_base<_S_single>::_M_weak_add_ref() noexcept >> > >> > { ++_M_weak_count; } >> > >> > >> > template<> >> > >> > inline void >> > >> > _Sp_counted_base<_S_single>::_M_weak_release() noexcept >> > >> > { >> > >> > if (--_M_weak_count == 0) >> >