On Tue, 3 Aug 2021 at 21:59, Jonathan Wakely wrote: > > On Mon, 2 Aug 2021 at 14:29, Maged Michael wrote: > > > > This is the right patch. The previous one is missing noexcept. Sorry. > > > > > > On Mon, Aug 2, 2021 at 9:23 AM Maged Michael wrote: > >> > >> Please find attached an updated patch after incorporating Jonathan's suggestions. > >> > >> Changes from the last patch include: > >> - Add a TSAN macro to bits/c++config. > >> - Use separate constexpr bool-s for the conditions for lock-freedom, double-width and alignment. > >> - Move the code in the optimized path to a separate function _M_release_double_width_cas. > > Thanks for the updated patch. At a quick glance it looks great. I'll > apply it locally and test it tomorrow. It occurs to me now that _M_release_double_width_cas is the wrong name, because we're not doing a DWCAS, just a double-width load. What do you think about renaming it to _M_release_combined instead? Since it does a combined load of the two counts. I'm no longer confident in the alignof suggestion I gave you. + constexpr bool __double_word + = sizeof(long long) == 2 * sizeof(_Atomic_word); + // The ref-count members follow the vptr, so are aligned to + // alignof(void*). + constexpr bool __aligned = alignof(long long) <= alignof(void*); For IA32 (i.e. 32-bit x86) this constant will be true, because alignof(long long) and alignof(void*) are both 4, even though sizeof(long long) is 8. So in theory the _M_use_count and _M_weak_count members could be in different cache lines, and the atomic load will not be atomic. I think we want to use __alignof(long long) here to get the "natural" alignment, not the smaller 4B alignment allowed by SysV ABI. That means that we won't do this optimization on IA32, but I think that's acceptable. Alternatively, we could keep using alignof, but with an additional run-time check something like (uintptr_t)&_M_use_count / 64 == (uintptr_t)&_M_weak_count / 64 to check they're on the same cache line. I think for now we should just use __alignof and live without the optimization on IA32. Secondly: + void + __attribute__ ((noinline)) This needs to be __noinline__ because noinline is not a reserved word, so users can do: #define noinline 1 #include Was it performance profiling, or code-size measurements (or something else?) that showed this should be non-inline? I'd like to add a comment explaining why we want it to be noinline. In that function ... + _M_release_last_use() noexcept + { + _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_use_count); + _M_dispose(); + if (_Mutex_base<_Lp>::_S_need_barriers) + { + __atomic_thread_fence (__ATOMIC_ACQ_REL); + } I think we can remove this fence. The _S_need_barriers constant is only true for the _S_mutex policy, and that just calls _M_release_orig(), so never uses this function. I'll remove it and add a comment noting that the lack of barrier is intentional. + _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_weak_count); + if (__gnu_cxx::__exchange_and_add_dispatch(&_M_weak_count, + -1) == 1) + { + _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_weak_count); + _M_destroy(); + } + } Alternatively, we could keep the fence in _M_release_last_use() and refactor the other _M_release functions, so that we have explicit specializations for each of: _Sp_counted_base<_S_single>::_M_release() (already present) _Sp_counted_base<_S_mutex>::_M_release() _Sp_counted_base<_S_atomic>::_M_release() The second and third would be new, as currently they both use the definition in the primary template. The _S_mutex one would just decrement _M_use_count then call _M_release_last_use() (which still has the barrier needed for the _S_mutex policy). The _S_atomic one would have your new optimization. See the attached patch showing what I mean. I find this version a bit simpler to understand, as we just have _M_release and _M_release_last_use, without _M_release_double_width_cas and _M_release_orig. What do you think of this version? Does it lose any important properties of your version which I've failed to notice?