* Re: Proposed patch to skip the last atomic decrements in _Sp_counted_base::_M_release
@ 2020-12-07 12:30 Maged Michael
2020-12-07 15:57 ` Jonathan Wakely
0 siblings, 1 reply; 11+ messages in thread
From: Maged Michael @ 2020-12-07 12:30 UTC (permalink / raw)
To: libstdc++
The attached patch was scrubbed. Including it inline.
diff --git a/libstdc++-v3/include/bits/shared_ptr_base.h
b/libstdc++-v3/include/bits/shared_ptr_base.h
index 368b2d7379a..c675b0f6916 100644
--- a/libstdc++-v3/include/bits/shared_ptr_base.h
+++ b/libstdc++-v3/include/bits/shared_ptr_base.h
@@ -122,80 +122,126 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
class _Sp_counted_base
: public _Mutex_base<_Lp>
{
public:
_Sp_counted_base() noexcept
: _M_use_count(1), _M_weak_count(1) { }
virtual
~_Sp_counted_base() noexcept
{ }
// Called when _M_use_count drops to zero, to release the resources
// managed by *this.
virtual void
_M_dispose() noexcept = 0;
// Called when _M_weak_count drops to zero.
virtual void
_M_destroy() noexcept
{ delete this; }
virtual void*
_M_get_deleter(const std::type_info&) noexcept = 0;
void
_M_add_ref_copy()
{ __gnu_cxx::__atomic_add_dispatch(&_M_use_count, 1); }
void
_M_add_ref_lock()
{
if (!_M_add_ref_lock_nothrow())
__throw_bad_weak_ptr();
}
bool
_M_add_ref_lock_nothrow() noexcept;
void
_M_release() noexcept
+ {
+ if (!__LP64__ || sizeof(_Atomic_word) != 4 || sizeof(long long) !=
8)
+ {
+ _M_release_orig();
+ return;
+ }
+ _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_use_count);
+ _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_weak_count);
+ if ((__atomic_load_n((long long*)(&_M_use_count), __ATOMIC_ACQUIRE)
+ == 0x100000001))
+ {
+ _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_use_count);
+ _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_weak_count);
+ _M_dispose();
+ _M_destroy();
+ }
+ else
+ {
+ if ((__gnu_cxx::__exchange_and_add(&_M_use_count, -1) == 1))
+ {
+ _M_release_last_use();
+ }
+ }
+ }
+
+ void
+ __attribute__ ((noinline))
+ _M_release_last_use() noexcept
+ {
+ _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_use_count);
+ _M_dispose();
+ if (_Mutex_base<_Lp>::_S_need_barriers)
+ {
+ __atomic_thread_fence (__ATOMIC_ACQ_REL);
+ }
+ _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_weak_count);
+ if (__gnu_cxx::__exchange_and_add_dispatch(&_M_weak_count,
+ -1) == 1)
+ {
+ _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_weak_count);
+ _M_destroy();
+ }
+ }
+
+ void
+ _M_release_orig() noexcept
{
// Be race-detector-friendly. For more info see bits/c++config.
_GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_use_count);
if (__gnu_cxx::__exchange_and_add_dispatch(&_M_use_count, -1) == 1)
{
_GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_use_count);
_M_dispose();
// There must be a memory barrier between dispose() and destroy()
// to ensure that the effects of dispose() are observed in the
// thread that runs destroy().
// See http://gcc.gnu.org/ml/libstdc++/2005-11/msg00136.html
if (_Mutex_base<_Lp>::_S_need_barriers)
{
__atomic_thread_fence (__ATOMIC_ACQ_REL);
}
// Be race-detector-friendly. For more info see
bits/c++config.
_GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_weak_count);
if (__gnu_cxx::__exchange_and_add_dispatch(&_M_weak_count,
-1) == 1)
{
_GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_weak_count);
_M_destroy();
}
}
}
void
_M_weak_add_ref() noexcept
{ __gnu_cxx::__atomic_add_dispatch(&_M_weak_count, 1); }
void
_M_weak_release() noexcept
{
// Be race-detector-friendly. For more info see bits/c++config.
_GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_weak_count);
if (__gnu_cxx::__exchange_and_add_dispatch(&_M_weak_count, -1) == 1)
{
_GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_weak_count);
if (_Mutex_base<_Lp>::_S_need_barriers)
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Proposed patch to skip the last atomic decrements in _Sp_counted_base::_M_release
2020-12-07 12:30 Proposed patch to skip the last atomic decrements in _Sp_counted_base::_M_release Maged Michael
@ 2020-12-07 15:57 ` Jonathan Wakely
2020-12-07 16:23 ` Jonathan Wakely
0 siblings, 1 reply; 11+ messages in thread
From: Jonathan Wakely @ 2020-12-07 15:57 UTC (permalink / raw)
To: Maged Michael; +Cc: libstdc++
On 07/12/20 07:30 -0500, Maged Michael via Libstdc++ wrote:
> void
> _M_release() noexcept
>+ {
>+ if (!__LP64__ || sizeof(_Atomic_word) != 4 || sizeof(long long) !=
>8)
>+ {
>+ _M_release_orig();
>+ return;
>+ }
>+ _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_use_count);
>+ _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_weak_count);
>+ if ((__atomic_load_n((long long*)(&_M_use_count), __ATOMIC_ACQUIRE)
>+ == 0x100000001))
>+ {
I think we want a comment here explaining what it's doing:
// Both counts are 1, so there are no weak references and
// we are releasing the last strong reference. No other
// threads can observe the effects of this _M_release()
// call (e.g. calling use_count()) without a data race.
>+ _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_use_count);
>+ _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_weak_count);
>+ _M_dispose();
_M_dispose() will run the destructor for the owned object. If that
object has a reference to the shared_ptr then it can call use_count()
on it. That isn't a data race, because it's within the same thread.
Previously the object's destructor would have observed that the use
count had dropped to zero. In this new implementation, the use count
is still 1 when the object is destroyed.
Similarly, the object's destructor could create a new weak_ptr from
*this. Previously that would see that the use count is zero, and so
fail. With the new code it would succeed, and increase the weak count,
and be left with a dangling pointer after the destroy here:
>+ _M_destroy();
I think to avoid these problems you need to set both counts to 0
before calling _M_dispose(). Those can be simple non-atomic writes
though, since we know that only the current thread can observe those
values without data races.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Proposed patch to skip the last atomic decrements in _Sp_counted_base::_M_release
2020-12-07 15:57 ` Jonathan Wakely
@ 2020-12-07 16:23 ` Jonathan Wakely
0 siblings, 0 replies; 11+ messages in thread
From: Jonathan Wakely @ 2020-12-07 16:23 UTC (permalink / raw)
To: Maged Michael; +Cc: libstdc++
On 07/12/20 15:57 +0000, Jonathan Wakely wrote:
>On 07/12/20 07:30 -0500, Maged Michael via Libstdc++ wrote:
>> void
>> _M_release() noexcept
>>+ {
>>+ if (!__LP64__ || sizeof(_Atomic_word) != 4 || sizeof(long long) !=
>>8)
>>+ {
>>+ _M_release_orig();
>>+ return;
>>+ }
>>+ _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_use_count);
>>+ _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_weak_count);
>>+ if ((__atomic_load_n((long long*)(&_M_use_count), __ATOMIC_ACQUIRE)
>>+ == 0x100000001))
>>+ {
>
>I think we want a comment here explaining what it's doing:
>
> // Both counts are 1, so there are no weak references and
> // we are releasing the last strong reference. No other
> // threads can observe the effects of this _M_release()
> // call (e.g. calling use_count()) without a data race.
>
>
>>+ _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_use_count);
>>+ _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_weak_count);
>>+ _M_dispose();
>
>_M_dispose() will run the destructor for the owned object. If that
>object has a reference to the shared_ptr then it can call use_count()
>on it. That isn't a data race, because it's within the same thread.
>Previously the object's destructor would have observed that the use
>count had dropped to zero. In this new implementation, the use count
>is still 1 when the object is destroyed.
>
>Similarly, the object's destructor could create a new weak_ptr from
>*this. Previously that would see that the use count is zero, and so
>fail. With the new code it would succeed, and increase the weak count,
>and be left with a dangling pointer after the destroy here:
>
>>+ _M_destroy();
>
>I think to avoid these problems you need to set both counts to 0
>before calling _M_dispose(). Those can be simple non-atomic writes
>though, since we know that only the current thread can observe those
>values without data races.
Ah, actually maybe this isn't a problem, because of how we implement
shared_ptr::reset() and shared_ptr::operator=.
We do:
void
reset() noexcept
{ __shared_ptr().swap(*this); }
Which means that when the final _M_release() is called via an unnamed
temporary, which the object's destructor cannot refer to.
But we should make sure we have a test verifying that. Something like
this:
#include <memory>
struct X
{
~X();
};
std::shared_ptr<X> p;
std::weak_ptr<X> w(p);
X::~X()
{
w = p;
if (!w.expired())
std::terminate();
}
int main()
{
p = std::make_shared<X>();
p.reset();
p = w.lock();
if (p)
std::terminate();
}
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Proposed patch to skip the last atomic decrements in _Sp_counted_base::_M_release
2020-12-08 8:01 ` Marc Glisse
@ 2020-12-08 9:56 ` Jonathan Wakely
0 siblings, 0 replies; 11+ messages in thread
From: Jonathan Wakely @ 2020-12-08 9:56 UTC (permalink / raw)
To: libstdc++
On Tue, 8 Dec 2020, 08:01 Marc Glisse, <marc.glisse@inria.fr> wrote:
> On Mon, 7 Dec 2020, Maged Michael via Libstdc++ wrote:
>
> > Any pointers on where to file bug reports about the generated code?
>
> https://gcc.gnu.org/bugzilla/
> (creating an account may not be as trivial as it used to be, but it still
> works)
>
I can create bugzilla accounts so if you want one created for your Gmail
address just let me know.
>
>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Proposed patch to skip the last atomic decrements in _Sp_counted_base::_M_release
2020-12-08 4:10 ` Maged Michael
@ 2020-12-08 8:01 ` Marc Glisse
2020-12-08 9:56 ` Jonathan Wakely
0 siblings, 1 reply; 11+ messages in thread
From: Marc Glisse @ 2020-12-08 8:01 UTC (permalink / raw)
To: Maged Michael; +Cc: libstdc++
On Mon, 7 Dec 2020, Maged Michael via Libstdc++ wrote:
> Any pointers on where to file bug reports about the generated code?
https://gcc.gnu.org/bugzilla/
(creating an account may not be as trivial as it used to be, but it still
works)
>> ISTR that libstdc++ uses a surprisingly strong memory barrier when
>> incrementing references (acq_rel, where I would expect acquire to suffice,
>> although I haven't thought about weak references that much). Did you also
>> experiment with that?
(I actually meant "relaxed")
> I applied to x86, where acq_rel doesn't add extra memory fence instructions
> for atomic increment/decrement operations.
Indeed, on x86 just making the increment atomically requires lock add or
lock inc, which already includes a strong barrier. It does make a
difference in generated code on aarch64 or powerpc64el though.
--
Marc Glisse
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Proposed patch to skip the last atomic decrements in _Sp_counted_base::_M_release
2020-12-07 15:28 ` Marc Glisse
2020-12-07 15:41 ` Jonathan Wakely
@ 2020-12-08 4:10 ` Maged Michael
2020-12-08 8:01 ` Marc Glisse
1 sibling, 1 reply; 11+ messages in thread
From: Maged Michael @ 2020-12-08 4:10 UTC (permalink / raw)
To: libstdc++
Marc,
Thank you for your comments.
Any pointers on where to file bug reports about the generated code?
Thanks,
Maged
On Mon, Dec 7, 2020 at 10:29 AM Marc Glisse <marc.glisse@inria.fr> wrote:
> Thanks for doing this!
>
> As someone who recently wrote
>
> if (count.load(std::memory_order_relaxed) == 1
> || count.fetch_sub(1, std::memory_order_release) == 1) {
> std::atomic_thread_fence(std::memory_order_acquire);
> // destroy here
> }
>
> for a much simpler reference-counted type without weak references, I am
> happy to see that this can generalize to the more complicated case of
> shared_ptr.
>
> Yes, hardening it may be a bit hard. Some checks with alignof/offsetof
> could make sense, but it is complicated by the fact that things are
> happening in the base class that depend on the derived class. Some
> allocators guarantee a minimum alignment, likely sufficient for your case,
> but again I am not sure how to test for that exactly.
>
> ISTR that libstdc++ uses a surprisingly strong memory barrier when
> incrementing references (acq_rel, where I would expect acquire to suffice,
> although I haven't thought about weak references that much). Did you also
> experiment with that?
>
>
I applied to x86, where acq_rel doesn't add extra memory fence instructions
for atomic increment/decrement operations.
> Don't hesitate to file bug reports about missed optimizations in the
> compiler itself, for the redundant checks and branches you mention
> compared to llvm.
>
> On Mon, 7 Dec 2020, Maged Michael via Libstdc++ wrote:
>
> > Hi all,
> >
> > The attached patch includes a proposed alternative implementation of
> > _Sp_counted_base::_M_release(). I'd appreciate your feedback.
> >
> > Benefits: Save the cost of the last atomic decrements of each of the use
> > count and the weak count in _Sp_counted_base. Atomic instructions are
> > significantly slower than regular loads and stores across
> > major architectures.
> >
> > How current code works: _M_release() atomically decrements the use count,
> > checks if it was 1, if so calls _M_dispose(), atomically decrements the
> > weak count, checks if it was 1, and if so calls _M_destroy().
> >
> > How the proposed algorithm works: _M_release() loads both use count and
> > weak count together atomically (assuming 8-byte alignment, discussed
> > later), checks if the value is equal to 0x100000001 (i.e., both counts
> are
> > equal to 1), and if so calls _M_dispose() and _M_destroy(). Otherwise, it
> > follows the original algorithm.
> >
> > Why it works: When the current thread executing _M_release() finds each
> of
> > the counts is equal to 1, then no other threads could possibly hold use
> or
> > weak references to this control block. That is, no other threads could
> > possibly access the counts or the protected object.
> >
> > There are two crucial high-level issues that I'd like to point out first:
> > - Atomicity of access to the counts together
> > - Proper alignment of the counts together
> >
> > The patch is intended to apply the proposed algorithm only to the case of
> > 64-bit mode, 4-byte counts, and 8-byte aligned _Sp_counted_base.
> >
> > ** Atomicity **
> > - The proposed algorithm depends on the mutual atomicity among 8-byte
> > atomic operations and 4-byte atomic operations on each of the 4-byte
> halves
> > of the 8-byte aligned 8-byte block.
> > - The standard does not guarantee atomicity of 8-byte operations on a
> pair
> > of 8-byte aligned 4-byte objects.
> > - To my knowledge this works in practice on systems that guarantee native
> > implementation of 4-byte and 8-byte atomic operations.
> > - Can we limit applying the proposed algorithm to architectures that
> > guarantee native implementation of atomic operations?
> >
> > ** Alignment **
> > - _Sp_counted_base is an internal base class. Three internal classes are
> > derived from it.
> > - Two of these classes include a pointer as a first member. That is, the
> > layout (in the relevant case) is: use count (4 bytes), weak count (4
> > bytes), pointer (8 bytes). My understanding is that in these cases the
> two
> > 4-byte counts are guaranteed to occupy an 8-byte aligned 8 byte range.
> > - In the third case (_Sp_counted_ptr_inplace), only includes user data
> > after the two counts, without necessarily including any 8-byte aligned
> > members. For example, if the protected object is int32_t, then the
> > _Sp_counted_ptr_inplace object consists of three 4-byte members (the two
> > counts inherited from _Sp_counted_base and the user data). My
> understanding
> > is that the standard does not guarantee 8-byte alignment in such a case.
> > - Is 8-byte alignment guaranteed in practice in some widely-used
> > environments?
> > - Can 8-byte alignment be checked at build time in some widely-used
> > environments?
> >
> > Other points:
> > - The proposed algorithm can interact correctly with the current
> algorithm.
> > That is, multiple threads using different versions of the code with and
> > without the patch operating on the same objects should always interact
> > correctly. The intent for the patch is to be ABI compatible with the
> > current implementation.
> > - The proposed patch involves a performance trade-off between saving the
> > costs of atomic instructions when the counts are both 1 vs adding the
> cost
> > of loading the 8-byte combined counts and comparison with 0x100000001.
> > - I noticed a big difference between the code generated by GCC vs LLVM.
> GCC
> > seems to generate noticeably more code and what seems to be redundant
> null
> > checks and branches.
> > - The patch has been in use (built using LLVM) in a large environment for
> > many months. The performance gains outweigh the losses (roughly 10 to 1)
> > across a large variety of workloads.
> >
> > I'd appreciate your feedback on the alignment and atomicity issues as
> well
> > as any other comments.
> >
> > Thank you,
> > Maged
>
> --
> Marc Glisse
>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Proposed patch to skip the last atomic decrements in _Sp_counted_base::_M_release
2020-12-07 15:37 ` Jonathan Wakely
@ 2020-12-08 4:06 ` Maged Michael
0 siblings, 0 replies; 11+ messages in thread
From: Maged Michael @ 2020-12-08 4:06 UTC (permalink / raw)
To: Jonathan Wakely; +Cc: libstdc++
Jonathan,
Thank you for your comments both on the high-level idea and on the draft
patch. This is exactly what I needed for guidance to work on the patch.
Maged
On Mon, Dec 7, 2020 at 10:37 AM Jonathan Wakely <jwakely@redhat.com> wrote:
> On 07/12/20 07:24 -0500, Maged Michael via Libstdc++ wrote:
> >Hi all,
> >
> >The attached patch includes a proposed alternative implementation of
> >_Sp_counted_base::_M_release(). I'd appreciate your feedback.
>
> This is very interesting, thanks.
>
> >Benefits: Save the cost of the last atomic decrements of each of the use
> >count and the weak count in _Sp_counted_base. Atomic instructions are
> >significantly slower than regular loads and stores across
> >major architectures.
> >
> >How current code works: _M_release() atomically decrements the use count,
> >checks if it was 1, if so calls _M_dispose(), atomically decrements the
> >weak count, checks if it was 1, and if so calls _M_destroy().
> >
> >How the proposed algorithm works: _M_release() loads both use count and
> >weak count together atomically (assuming 8-byte alignment, discussed
> >later), checks if the value is equal to 0x100000001 (i.e., both counts are
> >equal to 1), and if so calls _M_dispose() and _M_destroy(). Otherwise, it
> >follows the original algorithm.
> >
> >Why it works: When the current thread executing _M_release() finds each of
> >the counts is equal to 1, then no other threads could possibly hold use or
> >weak references to this control block. That is, no other threads could
> >possibly access the counts or the protected object.
> >
> >There are two crucial high-level issues that I'd like to point out first:
> >- Atomicity of access to the counts together
> >- Proper alignment of the counts together
> >
> >The patch is intended to apply the proposed algorithm only to the case of
> >64-bit mode, 4-byte counts, and 8-byte aligned _Sp_counted_base.
> >
> >** Atomicity **
> >- The proposed algorithm depends on the mutual atomicity among 8-byte
> >atomic operations and 4-byte atomic operations on each of the 4-byte
> halves
> >of the 8-byte aligned 8-byte block.
> >- The standard does not guarantee atomicity of 8-byte operations on a pair
> >of 8-byte aligned 4-byte objects.
> >- To my knowledge this works in practice on systems that guarantee native
> >implementation of 4-byte and 8-byte atomic operations.
> >- Can we limit applying the proposed algorithm to architectures that
> >guarantee native implementation of atomic operations?
>
> Check __ATOMIC_LLONG_LOCK_FREE > 1
>
> Or std::atomic<long long>::is_always_lock_free
> && std::atomic<_Atomic_word>::is_always_lock_free
>
> >** Alignment **
> >- _Sp_counted_base is an internal base class. Three internal classes are
> >derived from it.
> >- Two of these classes include a pointer as a first member. That is, the
> >layout (in the relevant case) is: use count (4 bytes), weak count (4
> >bytes), pointer (8 bytes). My understanding is that in these cases the two
> >4-byte counts are guaranteed to occupy an 8-byte aligned 8 byte range.
>
> N.B. _Atomic_word is long on sparc64, but that's OK as we won't try to
> use the new code in that case.
>
> >- In the third case (_Sp_counted_ptr_inplace), only includes user data
> >after the two counts, without necessarily including any 8-byte aligned
> >members. For example, if the protected object is int32_t, then the
> >_Sp_counted_ptr_inplace object consists of three 4-byte members (the two
> >counts inherited from _Sp_counted_base and the user data). My
> understanding
> >is that the standard does not guarantee 8-byte alignment in such a case.
> >- Is 8-byte alignment guaranteed in practice in some widely-used
> >environments?
>
> For all types derived from _Sp_counted_base the whole class is going
> to depend on the alignment of the vptr, which means that even the
> _Sp_counted_ptr_inplace is guaranteed to be aligned to alignof(void*).
>
> >- Can 8-byte alignment be checked at build time in some widely-used
> >environments?
>
> Yes, we absolutely must check the alignments before using the new code
> path, either with static_assert or if-constexpr. The expected
> alignment must be checked along with the expected sizes of
> _Atomic_word and long long.
>
> On that point, the patch has:
>
> >+ if (!__LP64__ || sizeof(_Atomic_word) != 4 || sizeof(long long)
> !=
>
> This will fail to compile on all targets except LP64 ones, because
> __LP64__ is not defined there.
>
> Why is __LP64__ the right check anyway?
>
> Shouldn't it depend on __ATOMIC_LLONG_LOCK_FREE > 1 because what
> matters is that both 8-byte and 4-byte atomics are lock-free?
>
> Why do we care if the sizes are 4 bytes and 8 bytes, wouldn't it also
> work for 16-bit int and 32-bit long long? i.e.
>
> if constexpr (sizeof(long long) < (2 * sizeof(_Atomic_word))
>
> (Except that the constexpr should be _GLIBCXX17_CONSTEXPR which
> expands to nothing for C++11 and C++14, which don't support
> if-constexpr).
>
> I think this also needs to be restricted to the _S_atomic lock policy,
> otherwise you're mixing different ways to modify the same memory
> locations using incompatible techniques.
>
> >Other points:
> >- The proposed algorithm can interact correctly with the current
> algorithm.
> >That is, multiple threads using different versions of the code with and
> >without the patch operating on the same objects should always interact
> >correctly. The intent for the patch is to be ABI compatible with the
> >current implementation.
>
> Yes, this is important, but I disagree that the patch preserves that
> property. It's only true for the _S_atomic policy.
>
>
> >- The proposed patch involves a performance trade-off between saving the
> >costs of atomic instructions when the counts are both 1 vs adding the cost
> >of loading the 8-byte combined counts and comparison with 0x100000001.
> >- I noticed a big difference between the code generated by GCC vs LLVM.
> GCC
> >seems to generate noticeably more code and what seems to be redundant null
> >checks and branches.
> >- The patch has been in use (built using LLVM) in a large environment for
> >many months. The performance gains outweigh the losses (roughly 10 to 1)
> >across a large variety of workloads.
>
> That's great.
>
> >I'd appreciate your feedback on the alignment and atomicity issues as well
> >as any other comments.
>
> Please CC the gcc-patches@gcc.gnu.org list for all patches as well as
> the libstdc++ list, thanks!
>
> I think the idea is definitely worth pursuing, but the patch needs
> work (which is fine because a change like this won't be able to get
> into the repo until after GCC 11 is released in a few months, so
> there's no rush).
>
>
>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Proposed patch to skip the last atomic decrements in _Sp_counted_base::_M_release
2020-12-07 15:28 ` Marc Glisse
@ 2020-12-07 15:41 ` Jonathan Wakely
2020-12-08 4:10 ` Maged Michael
1 sibling, 0 replies; 11+ messages in thread
From: Jonathan Wakely @ 2020-12-07 15:41 UTC (permalink / raw)
To: libstdc++
On 07/12/20 16:28 +0100, Marc Glisse wrote:
>Thanks for doing this!
>
>As someone who recently wrote
>
> if (count.load(std::memory_order_relaxed) == 1
> || count.fetch_sub(1, std::memory_order_release) == 1) {
> std::atomic_thread_fence(std::memory_order_acquire);
> // destroy here
> }
>
>for a much simpler reference-counted type without weak references, I am
>happy to see that this can generalize to the more complicated case of
>shared_ptr.
>
>Yes, hardening it may be a bit hard. Some checks with alignof/offsetof
>could make sense, but it is complicated by the fact that things are
>happening in the base class that depend on the derived class. Some
>allocators guarantee a minimum alignment, likely sufficient for your
>case, but again I am not sure how to test for that exactly.
>
>ISTR that libstdc++ uses a surprisingly strong memory barrier when
>incrementing references (acq_rel, where I would expect acquire to
>suffice, although I haven't thought about weak references that much).
>Did you also experiment with that?
That's because we go through the dispatcher functions in
<ext/atomicity.h> which don't take a memory_order argument so use a
conservative ordering. That could be improved.
The dispatchers are beneficial (especially with glibc 2.32 or later)
because they downgrade the atomic ops to non-atomic equivalents when
the program only has a single thread. The proposed patch loses that
benefit, meaning that the 8-byte load will always be done using an
atomic instruction, even when not needed. That should be improved too.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Proposed patch to skip the last atomic decrements in _Sp_counted_base::_M_release
2020-12-07 12:24 Maged Michael
2020-12-07 15:28 ` Marc Glisse
@ 2020-12-07 15:37 ` Jonathan Wakely
2020-12-08 4:06 ` Maged Michael
1 sibling, 1 reply; 11+ messages in thread
From: Jonathan Wakely @ 2020-12-07 15:37 UTC (permalink / raw)
To: Maged Michael; +Cc: libstdc++
On 07/12/20 07:24 -0500, Maged Michael via Libstdc++ wrote:
>Hi all,
>
>The attached patch includes a proposed alternative implementation of
>_Sp_counted_base::_M_release(). I'd appreciate your feedback.
This is very interesting, thanks.
>Benefits: Save the cost of the last atomic decrements of each of the use
>count and the weak count in _Sp_counted_base. Atomic instructions are
>significantly slower than regular loads and stores across
>major architectures.
>
>How current code works: _M_release() atomically decrements the use count,
>checks if it was 1, if so calls _M_dispose(), atomically decrements the
>weak count, checks if it was 1, and if so calls _M_destroy().
>
>How the proposed algorithm works: _M_release() loads both use count and
>weak count together atomically (assuming 8-byte alignment, discussed
>later), checks if the value is equal to 0x100000001 (i.e., both counts are
>equal to 1), and if so calls _M_dispose() and _M_destroy(). Otherwise, it
>follows the original algorithm.
>
>Why it works: When the current thread executing _M_release() finds each of
>the counts is equal to 1, then no other threads could possibly hold use or
>weak references to this control block. That is, no other threads could
>possibly access the counts or the protected object.
>
>There are two crucial high-level issues that I'd like to point out first:
>- Atomicity of access to the counts together
>- Proper alignment of the counts together
>
>The patch is intended to apply the proposed algorithm only to the case of
>64-bit mode, 4-byte counts, and 8-byte aligned _Sp_counted_base.
>
>** Atomicity **
>- The proposed algorithm depends on the mutual atomicity among 8-byte
>atomic operations and 4-byte atomic operations on each of the 4-byte halves
>of the 8-byte aligned 8-byte block.
>- The standard does not guarantee atomicity of 8-byte operations on a pair
>of 8-byte aligned 4-byte objects.
>- To my knowledge this works in practice on systems that guarantee native
>implementation of 4-byte and 8-byte atomic operations.
>- Can we limit applying the proposed algorithm to architectures that
>guarantee native implementation of atomic operations?
Check __ATOMIC_LLONG_LOCK_FREE > 1
Or std::atomic<long long>::is_always_lock_free
&& std::atomic<_Atomic_word>::is_always_lock_free
>** Alignment **
>- _Sp_counted_base is an internal base class. Three internal classes are
>derived from it.
>- Two of these classes include a pointer as a first member. That is, the
>layout (in the relevant case) is: use count (4 bytes), weak count (4
>bytes), pointer (8 bytes). My understanding is that in these cases the two
>4-byte counts are guaranteed to occupy an 8-byte aligned 8 byte range.
N.B. _Atomic_word is long on sparc64, but that's OK as we won't try to
use the new code in that case.
>- In the third case (_Sp_counted_ptr_inplace), only includes user data
>after the two counts, without necessarily including any 8-byte aligned
>members. For example, if the protected object is int32_t, then the
>_Sp_counted_ptr_inplace object consists of three 4-byte members (the two
>counts inherited from _Sp_counted_base and the user data). My understanding
>is that the standard does not guarantee 8-byte alignment in such a case.
>- Is 8-byte alignment guaranteed in practice in some widely-used
>environments?
For all types derived from _Sp_counted_base the whole class is going
to depend on the alignment of the vptr, which means that even the
_Sp_counted_ptr_inplace is guaranteed to be aligned to alignof(void*).
>- Can 8-byte alignment be checked at build time in some widely-used
>environments?
Yes, we absolutely must check the alignments before using the new code
path, either with static_assert or if-constexpr. The expected
alignment must be checked along with the expected sizes of
_Atomic_word and long long.
On that point, the patch has:
>+ if (!__LP64__ || sizeof(_Atomic_word) != 4 || sizeof(long long) !=
This will fail to compile on all targets except LP64 ones, because
__LP64__ is not defined there.
Why is __LP64__ the right check anyway?
Shouldn't it depend on __ATOMIC_LLONG_LOCK_FREE > 1 because what
matters is that both 8-byte and 4-byte atomics are lock-free?
Why do we care if the sizes are 4 bytes and 8 bytes, wouldn't it also
work for 16-bit int and 32-bit long long? i.e.
if constexpr (sizeof(long long) < (2 * sizeof(_Atomic_word))
(Except that the constexpr should be _GLIBCXX17_CONSTEXPR which
expands to nothing for C++11 and C++14, which don't support
if-constexpr).
I think this also needs to be restricted to the _S_atomic lock policy,
otherwise you're mixing different ways to modify the same memory
locations using incompatible techniques.
>Other points:
>- The proposed algorithm can interact correctly with the current algorithm.
>That is, multiple threads using different versions of the code with and
>without the patch operating on the same objects should always interact
>correctly. The intent for the patch is to be ABI compatible with the
>current implementation.
Yes, this is important, but I disagree that the patch preserves that
property. It's only true for the _S_atomic policy.
>- The proposed patch involves a performance trade-off between saving the
>costs of atomic instructions when the counts are both 1 vs adding the cost
>of loading the 8-byte combined counts and comparison with 0x100000001.
>- I noticed a big difference between the code generated by GCC vs LLVM. GCC
>seems to generate noticeably more code and what seems to be redundant null
>checks and branches.
>- The patch has been in use (built using LLVM) in a large environment for
>many months. The performance gains outweigh the losses (roughly 10 to 1)
>across a large variety of workloads.
That's great.
>I'd appreciate your feedback on the alignment and atomicity issues as well
>as any other comments.
Please CC the gcc-patches@gcc.gnu.org list for all patches as well as
the libstdc++ list, thanks!
I think the idea is definitely worth pursuing, but the patch needs
work (which is fine because a change like this won't be able to get
into the repo until after GCC 11 is released in a few months, so
there's no rush).
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Proposed patch to skip the last atomic decrements in _Sp_counted_base::_M_release
2020-12-07 12:24 Maged Michael
@ 2020-12-07 15:28 ` Marc Glisse
2020-12-07 15:41 ` Jonathan Wakely
2020-12-08 4:10 ` Maged Michael
2020-12-07 15:37 ` Jonathan Wakely
1 sibling, 2 replies; 11+ messages in thread
From: Marc Glisse @ 2020-12-07 15:28 UTC (permalink / raw)
To: Maged Michael; +Cc: libstdc++
Thanks for doing this!
As someone who recently wrote
if (count.load(std::memory_order_relaxed) == 1
|| count.fetch_sub(1, std::memory_order_release) == 1) {
std::atomic_thread_fence(std::memory_order_acquire);
// destroy here
}
for a much simpler reference-counted type without weak references, I am
happy to see that this can generalize to the more complicated case of
shared_ptr.
Yes, hardening it may be a bit hard. Some checks with alignof/offsetof
could make sense, but it is complicated by the fact that things are
happening in the base class that depend on the derived class. Some
allocators guarantee a minimum alignment, likely sufficient for your case,
but again I am not sure how to test for that exactly.
ISTR that libstdc++ uses a surprisingly strong memory barrier when
incrementing references (acq_rel, where I would expect acquire to suffice,
although I haven't thought about weak references that much). Did you also
experiment with that?
Don't hesitate to file bug reports about missed optimizations in the
compiler itself, for the redundant checks and branches you mention
compared to llvm.
On Mon, 7 Dec 2020, Maged Michael via Libstdc++ wrote:
> Hi all,
>
> The attached patch includes a proposed alternative implementation of
> _Sp_counted_base::_M_release(). I'd appreciate your feedback.
>
> Benefits: Save the cost of the last atomic decrements of each of the use
> count and the weak count in _Sp_counted_base. Atomic instructions are
> significantly slower than regular loads and stores across
> major architectures.
>
> How current code works: _M_release() atomically decrements the use count,
> checks if it was 1, if so calls _M_dispose(), atomically decrements the
> weak count, checks if it was 1, and if so calls _M_destroy().
>
> How the proposed algorithm works: _M_release() loads both use count and
> weak count together atomically (assuming 8-byte alignment, discussed
> later), checks if the value is equal to 0x100000001 (i.e., both counts are
> equal to 1), and if so calls _M_dispose() and _M_destroy(). Otherwise, it
> follows the original algorithm.
>
> Why it works: When the current thread executing _M_release() finds each of
> the counts is equal to 1, then no other threads could possibly hold use or
> weak references to this control block. That is, no other threads could
> possibly access the counts or the protected object.
>
> There are two crucial high-level issues that I'd like to point out first:
> - Atomicity of access to the counts together
> - Proper alignment of the counts together
>
> The patch is intended to apply the proposed algorithm only to the case of
> 64-bit mode, 4-byte counts, and 8-byte aligned _Sp_counted_base.
>
> ** Atomicity **
> - The proposed algorithm depends on the mutual atomicity among 8-byte
> atomic operations and 4-byte atomic operations on each of the 4-byte halves
> of the 8-byte aligned 8-byte block.
> - The standard does not guarantee atomicity of 8-byte operations on a pair
> of 8-byte aligned 4-byte objects.
> - To my knowledge this works in practice on systems that guarantee native
> implementation of 4-byte and 8-byte atomic operations.
> - Can we limit applying the proposed algorithm to architectures that
> guarantee native implementation of atomic operations?
>
> ** Alignment **
> - _Sp_counted_base is an internal base class. Three internal classes are
> derived from it.
> - Two of these classes include a pointer as a first member. That is, the
> layout (in the relevant case) is: use count (4 bytes), weak count (4
> bytes), pointer (8 bytes). My understanding is that in these cases the two
> 4-byte counts are guaranteed to occupy an 8-byte aligned 8 byte range.
> - In the third case (_Sp_counted_ptr_inplace), only includes user data
> after the two counts, without necessarily including any 8-byte aligned
> members. For example, if the protected object is int32_t, then the
> _Sp_counted_ptr_inplace object consists of three 4-byte members (the two
> counts inherited from _Sp_counted_base and the user data). My understanding
> is that the standard does not guarantee 8-byte alignment in such a case.
> - Is 8-byte alignment guaranteed in practice in some widely-used
> environments?
> - Can 8-byte alignment be checked at build time in some widely-used
> environments?
>
> Other points:
> - The proposed algorithm can interact correctly with the current algorithm.
> That is, multiple threads using different versions of the code with and
> without the patch operating on the same objects should always interact
> correctly. The intent for the patch is to be ABI compatible with the
> current implementation.
> - The proposed patch involves a performance trade-off between saving the
> costs of atomic instructions when the counts are both 1 vs adding the cost
> of loading the 8-byte combined counts and comparison with 0x100000001.
> - I noticed a big difference between the code generated by GCC vs LLVM. GCC
> seems to generate noticeably more code and what seems to be redundant null
> checks and branches.
> - The patch has been in use (built using LLVM) in a large environment for
> many months. The performance gains outweigh the losses (roughly 10 to 1)
> across a large variety of workloads.
>
> I'd appreciate your feedback on the alignment and atomicity issues as well
> as any other comments.
>
> Thank you,
> Maged
--
Marc Glisse
^ permalink raw reply [flat|nested] 11+ messages in thread
* Proposed patch to skip the last atomic decrements in _Sp_counted_base::_M_release
@ 2020-12-07 12:24 Maged Michael
2020-12-07 15:28 ` Marc Glisse
2020-12-07 15:37 ` Jonathan Wakely
0 siblings, 2 replies; 11+ messages in thread
From: Maged Michael @ 2020-12-07 12:24 UTC (permalink / raw)
To: libstdc++
[-- Attachment #1: Type: text/plain, Size: 4037 bytes --]
Hi all,
The attached patch includes a proposed alternative implementation of
_Sp_counted_base::_M_release(). I'd appreciate your feedback.
Benefits: Save the cost of the last atomic decrements of each of the use
count and the weak count in _Sp_counted_base. Atomic instructions are
significantly slower than regular loads and stores across
major architectures.
How current code works: _M_release() atomically decrements the use count,
checks if it was 1, if so calls _M_dispose(), atomically decrements the
weak count, checks if it was 1, and if so calls _M_destroy().
How the proposed algorithm works: _M_release() loads both use count and
weak count together atomically (assuming 8-byte alignment, discussed
later), checks if the value is equal to 0x100000001 (i.e., both counts are
equal to 1), and if so calls _M_dispose() and _M_destroy(). Otherwise, it
follows the original algorithm.
Why it works: When the current thread executing _M_release() finds each of
the counts is equal to 1, then no other threads could possibly hold use or
weak references to this control block. That is, no other threads could
possibly access the counts or the protected object.
There are two crucial high-level issues that I'd like to point out first:
- Atomicity of access to the counts together
- Proper alignment of the counts together
The patch is intended to apply the proposed algorithm only to the case of
64-bit mode, 4-byte counts, and 8-byte aligned _Sp_counted_base.
** Atomicity **
- The proposed algorithm depends on the mutual atomicity among 8-byte
atomic operations and 4-byte atomic operations on each of the 4-byte halves
of the 8-byte aligned 8-byte block.
- The standard does not guarantee atomicity of 8-byte operations on a pair
of 8-byte aligned 4-byte objects.
- To my knowledge this works in practice on systems that guarantee native
implementation of 4-byte and 8-byte atomic operations.
- Can we limit applying the proposed algorithm to architectures that
guarantee native implementation of atomic operations?
** Alignment **
- _Sp_counted_base is an internal base class. Three internal classes are
derived from it.
- Two of these classes include a pointer as a first member. That is, the
layout (in the relevant case) is: use count (4 bytes), weak count (4
bytes), pointer (8 bytes). My understanding is that in these cases the two
4-byte counts are guaranteed to occupy an 8-byte aligned 8 byte range.
- In the third case (_Sp_counted_ptr_inplace), only includes user data
after the two counts, without necessarily including any 8-byte aligned
members. For example, if the protected object is int32_t, then the
_Sp_counted_ptr_inplace object consists of three 4-byte members (the two
counts inherited from _Sp_counted_base and the user data). My understanding
is that the standard does not guarantee 8-byte alignment in such a case.
- Is 8-byte alignment guaranteed in practice in some widely-used
environments?
- Can 8-byte alignment be checked at build time in some widely-used
environments?
Other points:
- The proposed algorithm can interact correctly with the current algorithm.
That is, multiple threads using different versions of the code with and
without the patch operating on the same objects should always interact
correctly. The intent for the patch is to be ABI compatible with the
current implementation.
- The proposed patch involves a performance trade-off between saving the
costs of atomic instructions when the counts are both 1 vs adding the cost
of loading the 8-byte combined counts and comparison with 0x100000001.
- I noticed a big difference between the code generated by GCC vs LLVM. GCC
seems to generate noticeably more code and what seems to be redundant null
checks and branches.
- The patch has been in use (built using LLVM) in a large environment for
many months. The performance gains outweigh the losses (roughly 10 to 1)
across a large variety of workloads.
I'd appreciate your feedback on the alignment and atomicity issues as well
as any other comments.
Thank you,
Maged
[-- Attachment #2: patch --]
[-- Type: application/octet-stream, Size: 4269 bytes --]
diff --git a/libstdc++-v3/include/bits/shared_ptr_base.h b/libstdc++-v3/include/bits/shared_ptr_base.h
index 368b2d7379a..c675b0f6916 100644
--- a/libstdc++-v3/include/bits/shared_ptr_base.h
+++ b/libstdc++-v3/include/bits/shared_ptr_base.h
@@ -122,80 +122,126 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
class _Sp_counted_base
: public _Mutex_base<_Lp>
{
public:
_Sp_counted_base() noexcept
: _M_use_count(1), _M_weak_count(1) { }
virtual
~_Sp_counted_base() noexcept
{ }
// Called when _M_use_count drops to zero, to release the resources
// managed by *this.
virtual void
_M_dispose() noexcept = 0;
// Called when _M_weak_count drops to zero.
virtual void
_M_destroy() noexcept
{ delete this; }
virtual void*
_M_get_deleter(const std::type_info&) noexcept = 0;
void
_M_add_ref_copy()
{ __gnu_cxx::__atomic_add_dispatch(&_M_use_count, 1); }
void
_M_add_ref_lock()
{
if (!_M_add_ref_lock_nothrow())
__throw_bad_weak_ptr();
}
bool
_M_add_ref_lock_nothrow() noexcept;
void
_M_release() noexcept
+ {
+ if (!__LP64__ || sizeof(_Atomic_word) != 4 || sizeof(long long) != 8)
+ {
+ _M_release_orig();
+ return;
+ }
+ _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_use_count);
+ _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_weak_count);
+ if ((__atomic_load_n((long long*)(&_M_use_count), __ATOMIC_ACQUIRE)
+ == 0x100000001))
+ {
+ _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_use_count);
+ _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_weak_count);
+ _M_dispose();
+ _M_destroy();
+ }
+ else
+ {
+ if ((__gnu_cxx::__exchange_and_add(&_M_use_count, -1) == 1))
+ {
+ _M_release_last_use();
+ }
+ }
+ }
+
+ void
+ __attribute__ ((noinline))
+ _M_release_last_use() noexcept
+ {
+ _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_use_count);
+ _M_dispose();
+ if (_Mutex_base<_Lp>::_S_need_barriers)
+ {
+ __atomic_thread_fence (__ATOMIC_ACQ_REL);
+ }
+ _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_weak_count);
+ if (__gnu_cxx::__exchange_and_add_dispatch(&_M_weak_count,
+ -1) == 1)
+ {
+ _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_weak_count);
+ _M_destroy();
+ }
+ }
+
+ void
+ _M_release_orig() noexcept
{
// Be race-detector-friendly. For more info see bits/c++config.
_GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_use_count);
if (__gnu_cxx::__exchange_and_add_dispatch(&_M_use_count, -1) == 1)
{
_GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_use_count);
_M_dispose();
// There must be a memory barrier between dispose() and destroy()
// to ensure that the effects of dispose() are observed in the
// thread that runs destroy().
// See http://gcc.gnu.org/ml/libstdc++/2005-11/msg00136.html
if (_Mutex_base<_Lp>::_S_need_barriers)
{
__atomic_thread_fence (__ATOMIC_ACQ_REL);
}
// Be race-detector-friendly. For more info see bits/c++config.
_GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_weak_count);
if (__gnu_cxx::__exchange_and_add_dispatch(&_M_weak_count,
-1) == 1)
{
_GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_weak_count);
_M_destroy();
}
}
}
void
_M_weak_add_ref() noexcept
{ __gnu_cxx::__atomic_add_dispatch(&_M_weak_count, 1); }
void
_M_weak_release() noexcept
{
// Be race-detector-friendly. For more info see bits/c++config.
_GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_weak_count);
if (__gnu_cxx::__exchange_and_add_dispatch(&_M_weak_count, -1) == 1)
{
_GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_weak_count);
if (_Mutex_base<_Lp>::_S_need_barriers)
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2020-12-08 9:56 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-07 12:30 Proposed patch to skip the last atomic decrements in _Sp_counted_base::_M_release Maged Michael
2020-12-07 15:57 ` Jonathan Wakely
2020-12-07 16:23 ` Jonathan Wakely
-- strict thread matches above, loose matches on Subject: below --
2020-12-07 12:24 Maged Michael
2020-12-07 15:28 ` Marc Glisse
2020-12-07 15:41 ` Jonathan Wakely
2020-12-08 4:10 ` Maged Michael
2020-12-08 8:01 ` Marc Glisse
2020-12-08 9:56 ` Jonathan Wakely
2020-12-07 15:37 ` Jonathan Wakely
2020-12-08 4:06 ` Maged Michael
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).