I've pushed this change to trunk now (it was posted and reviewed in stage 1, I just didn't get around to pushing it until now). The final version of the patch is attached to this mail. Thanks for the nice optimization, Maged! On Wed, 4 Aug 2021 at 20:49, Maged Michael via Libstdc++ wrote: > > On Wed, Aug 4, 2021 at 3:32 PM Jonathan Wakely > wrote: > > > On Wed, 4 Aug 2021 at 18:19, Maged Michael wrote: > > > > > > Sorry. I totally missed the rest of your message and the patch. My fuzzy > > eyesight, which usually guesses correctly 90% of the time, mistook > > "Secondly" on a line by itself for "Sincerely" :-) > > > > :-) > > > > > The noinlining was based on looking at generated code. That was for > > clang. It was inlining the _M_last_use function for every instance of > > _M_release (e.g., ~shared_ptr). This optimization with the noinline for > > _M_release_last_use ended up reducing massive binary text sizes by 1.5% > > (several megabytes)., which was an extra benefit. Without the noinline we > > saw code size increase. > > > > Wow, that is a convincing argument for making it not inline, thanks. > > > > > IIUC, we van use the following. Right? > > > > > > __attribute__((__noinline__)) > > > > Right. > > > > > I didn't understand the part about programmers doing #define noinline 1. > > I don't see code in the patch that uses noinline. > > > > This is a valid C++ program: > > > > #define noinline 1 > > #include > > int main() { } > > > > But if anything in uses "noinline" then this valid program > > will not compile. Which is why we must use ((__noinline__)) instead of > > ((noinline)). > > > > Thanks. Now I get it. > > > > > > > > > > > > How about something like this comment? > > > > > > // Noinline to avoid code size increase. > > > > Great, thanks. > > > > On Wed, 4 Aug 2021 at 18:34, Maged Michael wrote: > > > Actually I take back what I said. Sorry. I think the logic in your patch > > is correct. I missed some of the atomic decrements. > > > But I'd be concerned about performance. If we make _M_release_last_use > > noinline then we are adding overhead to the fast path of the original logic > > (where both counts are 1). > > > > Oh, I see. So the code duplication serves a purpose. We want the > > _M_release_last_use() code to be non-inline for the new logic, because > > in the common case we never call it (we either detect that both counts > > are 1 and do the dispose & destroy without atomic decrements, or we do > > a single decrement and don't dispose or destroy anything). But for the > > old logic, we do want that code inlined into _M_release (or > > _M_release_orig as it was called in your patch). Have I got that right > > now? > > > > Yes. Completely right. > > > > What if we remove the __noinline__ from _M_release_last_use() so that > > it can be inlined, but than add a noinline wrapper that can be called > > when we don't want to inline it? > > > > So: > > // Called by _M_release() when the use count reaches zero. > > void > > _M_release_last_use() noexcept > > { > > // unchanged from previous patch, but without the attribute. > > // ... > > } > > > > // As above, but 'noinline' to reduce code size on the cold path. > > __attribute__((__noinline__)) > > void > > _M_release_last_use_cold() noexcept > > { _M_release_last_use(); } > > > > > > And then: > > > > template<> > > inline void > > _Sp_counted_base<_S_atomic>::_M_release() noexcept > > { > > _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_use_count); > > #if ! _GLIBCXX_TSAN > > constexpr bool __lock_free > > = __atomic_always_lock_free(sizeof(long long), 0) > > && __atomic_always_lock_free(sizeof(_Atomic_word), 0); > > constexpr bool __double_word > > = sizeof(long long) == 2 * sizeof(_Atomic_word); > > // The ref-count members follow the vptr, so are aligned to > > // alignof(void*). > > constexpr bool __aligned = __alignof(long long) <= alignof(void*); > > if _GLIBCXX17_CONSTEXPR (__lock_free && __double_word && __aligned) > > { > > constexpr long long __unique_ref > > = 1LL + (1LL << (__CHAR_BIT__ * sizeof(_Atomic_word))); > > auto __both_counts = reinterpret_cast(&_M_use_count); > > > > _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_weak_count); > > if (__atomic_load_n(__both_counts, __ATOMIC_ACQUIRE) == > > __unique_ref) > > { > > // Both counts are 1, so there are no weak references and > > // we are releasing the last strong reference. No other > > // threads can observe the effects of this _M_release() > > // call (e.g. calling use_count()) without a data race. > > *(long long*)(&_M_use_count) = 0; > > _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_use_count); > > _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_weak_count); > > _M_dispose(); > > _M_destroy(); > > return; > > } > > if (__gnu_cxx::__exchange_and_add_dispatch(&_M_use_count, -1) == > > 1) > > { > > _M_release_last_use_cold(); > > return; > > } > > } > > else > > #endif > > if (__gnu_cxx::__exchange_and_add_dispatch(&_M_use_count, -1) == 1) > > { > > _M_release_last_use(); > > } > > } > > > > > > So we use the noinline version for the else branch in the new logic, > > but the can-inline version for the old logic. Would that work? > > > > Yes. > > > > We could also consider adding __attribute__((__cold__)) to the > > _M_release_last_use_cold() function, and/or add [[__unlikely__]] to > > the 'if' that calls it, but maybe that's overkill. > > > > It seems like this will slightly pessimize the case where the last use > > count is dropped but there are weak refs, as that will take the cold > > path now. That's probably OK, given the benefits for the "both counts > > are 1" case, and that it might reduce the code size for the "use count > > > 1" case. > > > > Right, This pessimizes the case of having weak_ptr-s when the last use is > released (e.g., enable_shared_from_this). > For background, when I applied this optimization there were gains and > losses but overall net gain. Then I investigated whether the losses are > large enough to warrant developing a library for custom shared_ptr-s that > we can use only in the cases of gains and use the standard one for the > cases where we'd lose (for example if gains are 3x and losses are 2x). I > found out that overall (over 100s of services) the gains and losses are > closer to 1.1x gains and 0.1x losses, which didn't warrant the hassle of > maintaining a custom library. > > > > > > It's past 8pm here so I'll come back to this tomorrow and do some code > > size measurements of my own, now that I have a clearer understanding > > of what your original patch was optimizing for. > > > > Thank you very much, Jonathan! >