I've pushed this change to trunk now (it was posted and reviewed in
stage 1, I just didn't get around to pushing it until now).

The final version of the patch is attached to this mail.

Thanks for the nice optimization, Maged!


On Wed, 4 Aug 2021 at 20:49, Maged Michael via Libstdc++
<libstdc++@gcc.gnu.org> wrote:
>
> On Wed, Aug 4, 2021 at 3:32 PM Jonathan Wakely <jwakely.gcc@gmail.com>
> wrote:
>
> > On Wed, 4 Aug 2021 at 18:19, Maged Michael wrote:
> > >
> > > Sorry. I totally missed the rest of your message and the patch. My fuzzy
> > eyesight, which usually guesses correctly 90% of the time, mistook
> > "Secondly" on a line by itself for "Sincerely" :-)
> >
> > :-)
> >
> > > The noinlining was based on looking at generated code. That was for
> > clang. It was inlining the _M_last_use function for every instance of
> > _M_release (e.g., ~shared_ptr). This optimization with the noinline for
> > _M_release_last_use ended up reducing massive binary text sizes by 1.5%
> > (several megabytes)., which was an extra benefit. Without the noinline we
> > saw code size increase.
> >
> > Wow, that is a convincing argument for making it not inline, thanks.
> >
> > > IIUC, we van use the following. Right?
> > >
> > > __attribute__((__noinline__))
> >
> > Right.
> >
> > > I didn't understand the part about programmers doing #define noinline 1.
> > I don't see code in the patch that uses noinline.
> >
> > This is a valid C++ program:
> >
> > #define noinline 1
> > #include <memory>
> > int main() { }
> >
> > But if anything in <memory> uses "noinline" then this valid program
> > will not compile. Which is why we must use ((__noinline__)) instead of
> > ((noinline)).
> >
> > Thanks. Now I get it.
>
>
> >
> >
> > >
> > > How about something like this comment?
> > >
> > > // Noinline to avoid code size increase.
> >
> > Great, thanks.
> >
> > On Wed, 4 Aug 2021 at 18:34, Maged Michael wrote:
> > > Actually I take back what I said. Sorry. I think the logic in your patch
> > is correct. I missed some of the atomic decrements.
> > > But I'd be concerned about performance. If we make _M_release_last_use
> > noinline then we are adding overhead to the fast path of the original logic
> > (where both counts are 1).
> >
> > Oh, I see. So the code duplication serves a purpose. We want the
> > _M_release_last_use() code to be non-inline for the new logic, because
> > in the common case we never call it (we either detect that both counts
> > are 1 and do the dispose & destroy without atomic decrements, or we do
> > a single decrement and don't dispose or destroy anything). But for the
> > old logic, we do want that code inlined into _M_release (or
> > _M_release_orig as it was called in your patch). Have I got that right
> > now?
> >
> > Yes. Completely right.
>
>
> > What if we remove the __noinline__ from _M_release_last_use() so that
> > it can be inlined, but than add a noinline wrapper that can be called
> > when we don't want to inline it?
> >
> > So:
> >       // Called by _M_release() when the use count reaches zero.
> >       void
> >       _M_release_last_use() noexcept
> >       {
> >         // unchanged from previous patch, but without the attribute.
> >         // ...
> >       }
> >
> >       // As above, but 'noinline' to reduce code size on the cold path.
> >       __attribute__((__noinline__))
> >       void
> >       _M_release_last_use_cold() noexcept
> >       { _M_release_last_use(); }
> >
> >
> > And then:
> >
> >   template<>
> >     inline void
> >     _Sp_counted_base<_S_atomic>::_M_release() noexcept
> >     {
> >       _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_use_count);
> > #if ! _GLIBCXX_TSAN
> >       constexpr bool __lock_free
> >         = __atomic_always_lock_free(sizeof(long long), 0)
> >         && __atomic_always_lock_free(sizeof(_Atomic_word), 0);
> >       constexpr bool __double_word
> >         = sizeof(long long) == 2 * sizeof(_Atomic_word);
> >       // The ref-count members follow the vptr, so are aligned to
> >       // alignof(void*).
> >       constexpr bool __aligned = __alignof(long long) <= alignof(void*);
> >       if _GLIBCXX17_CONSTEXPR (__lock_free && __double_word && __aligned)
> >         {
> >           constexpr long long __unique_ref
> >             = 1LL + (1LL << (__CHAR_BIT__ * sizeof(_Atomic_word)));
> >           auto __both_counts = reinterpret_cast<long long*>(&_M_use_count);
> >
> >           _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_weak_count);
> >           if (__atomic_load_n(__both_counts, __ATOMIC_ACQUIRE) ==
> > __unique_ref)
> >             {
> >               // Both counts are 1, so there are no weak references and
> >               // we are releasing the last strong reference. No other
> >               // threads can observe the effects of this _M_release()
> >               // call (e.g. calling use_count()) without a data race.
> >               *(long long*)(&_M_use_count) = 0;
> >               _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_use_count);
> >               _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_weak_count);
> >               _M_dispose();
> >               _M_destroy();
> >               return;
> >             }
> >           if (__gnu_cxx::__exchange_and_add_dispatch(&_M_use_count, -1) ==
> > 1)
> >             {
> >               _M_release_last_use_cold();
> >               return;
> >             }
> >         }
> >       else
> > #endif
> >       if (__gnu_cxx::__exchange_and_add_dispatch(&_M_use_count, -1) == 1)
> >         {
> >           _M_release_last_use();
> >         }
> >     }
> >
> >
> > So we use the noinline version for the else branch in the new logic,
> > but the can-inline version for the old logic. Would that work?
> >
> > Yes.
>
>
> > We could also consider adding __attribute__((__cold__)) to the
> > _M_release_last_use_cold() function, and/or add [[__unlikely__]] to
> > the 'if' that calls it, but maybe that's overkill.
> >
> > It seems like this will slightly pessimize the case where the last use
> > count is dropped but there are weak refs, as that will take the cold
> > path now. That's probably OK, given the benefits for the "both counts
> > are 1" case, and that it might reduce the code size for the "use count
> > > 1" case.
> >
>
> Right, This pessimizes the case of having weak_ptr-s when the last use is
> released (e.g., enable_shared_from_this).
> For background, when I applied this optimization there were gains and
> losses but overall net gain. Then I investigated whether the losses are
> large enough to warrant developing a library for custom shared_ptr-s that
> we can use only in the cases of gains and use the standard one for the
> cases where we'd lose (for example if gains are 3x and losses are 2x). I
> found out that overall (over 100s of services) the gains and losses are
> closer to 1.1x gains and 0.1x losses, which didn't warrant the hassle of
> maintaining a custom library.
>
>
> >
> > It's past 8pm here so I'll come back to this tomorrow and do some code
> > size measurements of my own, now that I have a clearer understanding
> > of what your original patch was optimizing for.
> >
>
> Thank you very much, Jonathan!
>