From: Maged Michael <maged.michael@gmail.com>
To: libstdc++@gcc.gnu.org
Subject: Proposed patch to skip the last atomic decrements in _Sp_counted_base::_M_release
Date: Mon, 7 Dec 2020 07:24:06 -0500 [thread overview]
Message-ID: <CABBFi0-HanBFtYt4J81bpEojGXXi6AQtAak6b7vU6nCKwAg+4g@mail.gmail.com> (raw)
[-- Attachment #1: Type: text/plain, Size: 4037 bytes --]
Hi all,
The attached patch includes a proposed alternative implementation of
_Sp_counted_base::_M_release(). I'd appreciate your feedback.
Benefits: Save the cost of the last atomic decrements of each of the use
count and the weak count in _Sp_counted_base. Atomic instructions are
significantly slower than regular loads and stores across
major architectures.
How current code works: _M_release() atomically decrements the use count,
checks if it was 1, if so calls _M_dispose(), atomically decrements the
weak count, checks if it was 1, and if so calls _M_destroy().
How the proposed algorithm works: _M_release() loads both use count and
weak count together atomically (assuming 8-byte alignment, discussed
later), checks if the value is equal to 0x100000001 (i.e., both counts are
equal to 1), and if so calls _M_dispose() and _M_destroy(). Otherwise, it
follows the original algorithm.
Why it works: When the current thread executing _M_release() finds each of
the counts is equal to 1, then no other threads could possibly hold use or
weak references to this control block. That is, no other threads could
possibly access the counts or the protected object.
There are two crucial high-level issues that I'd like to point out first:
- Atomicity of access to the counts together
- Proper alignment of the counts together
The patch is intended to apply the proposed algorithm only to the case of
64-bit mode, 4-byte counts, and 8-byte aligned _Sp_counted_base.
** Atomicity **
- The proposed algorithm depends on the mutual atomicity among 8-byte
atomic operations and 4-byte atomic operations on each of the 4-byte halves
of the 8-byte aligned 8-byte block.
- The standard does not guarantee atomicity of 8-byte operations on a pair
of 8-byte aligned 4-byte objects.
- To my knowledge this works in practice on systems that guarantee native
implementation of 4-byte and 8-byte atomic operations.
- Can we limit applying the proposed algorithm to architectures that
guarantee native implementation of atomic operations?
** Alignment **
- _Sp_counted_base is an internal base class. Three internal classes are
derived from it.
- Two of these classes include a pointer as a first member. That is, the
layout (in the relevant case) is: use count (4 bytes), weak count (4
bytes), pointer (8 bytes). My understanding is that in these cases the two
4-byte counts are guaranteed to occupy an 8-byte aligned 8 byte range.
- In the third case (_Sp_counted_ptr_inplace), only includes user data
after the two counts, without necessarily including any 8-byte aligned
members. For example, if the protected object is int32_t, then the
_Sp_counted_ptr_inplace object consists of three 4-byte members (the two
counts inherited from _Sp_counted_base and the user data). My understanding
is that the standard does not guarantee 8-byte alignment in such a case.
- Is 8-byte alignment guaranteed in practice in some widely-used
environments?
- Can 8-byte alignment be checked at build time in some widely-used
environments?
Other points:
- The proposed algorithm can interact correctly with the current algorithm.
That is, multiple threads using different versions of the code with and
without the patch operating on the same objects should always interact
correctly. The intent for the patch is to be ABI compatible with the
current implementation.
- The proposed patch involves a performance trade-off between saving the
costs of atomic instructions when the counts are both 1 vs adding the cost
of loading the 8-byte combined counts and comparison with 0x100000001.
- I noticed a big difference between the code generated by GCC vs LLVM. GCC
seems to generate noticeably more code and what seems to be redundant null
checks and branches.
- The patch has been in use (built using LLVM) in a large environment for
many months. The performance gains outweigh the losses (roughly 10 to 1)
across a large variety of workloads.
I'd appreciate your feedback on the alignment and atomicity issues as well
as any other comments.
Thank you,
Maged
[-- Attachment #2: patch --]
[-- Type: application/octet-stream, Size: 4269 bytes --]
diff --git a/libstdc++-v3/include/bits/shared_ptr_base.h b/libstdc++-v3/include/bits/shared_ptr_base.h
index 368b2d7379a..c675b0f6916 100644
--- a/libstdc++-v3/include/bits/shared_ptr_base.h
+++ b/libstdc++-v3/include/bits/shared_ptr_base.h
@@ -122,80 +122,126 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
class _Sp_counted_base
: public _Mutex_base<_Lp>
{
public:
_Sp_counted_base() noexcept
: _M_use_count(1), _M_weak_count(1) { }
virtual
~_Sp_counted_base() noexcept
{ }
// Called when _M_use_count drops to zero, to release the resources
// managed by *this.
virtual void
_M_dispose() noexcept = 0;
// Called when _M_weak_count drops to zero.
virtual void
_M_destroy() noexcept
{ delete this; }
virtual void*
_M_get_deleter(const std::type_info&) noexcept = 0;
void
_M_add_ref_copy()
{ __gnu_cxx::__atomic_add_dispatch(&_M_use_count, 1); }
void
_M_add_ref_lock()
{
if (!_M_add_ref_lock_nothrow())
__throw_bad_weak_ptr();
}
bool
_M_add_ref_lock_nothrow() noexcept;
void
_M_release() noexcept
+ {
+ if (!__LP64__ || sizeof(_Atomic_word) != 4 || sizeof(long long) != 8)
+ {
+ _M_release_orig();
+ return;
+ }
+ _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_use_count);
+ _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_weak_count);
+ if ((__atomic_load_n((long long*)(&_M_use_count), __ATOMIC_ACQUIRE)
+ == 0x100000001))
+ {
+ _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_use_count);
+ _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_weak_count);
+ _M_dispose();
+ _M_destroy();
+ }
+ else
+ {
+ if ((__gnu_cxx::__exchange_and_add(&_M_use_count, -1) == 1))
+ {
+ _M_release_last_use();
+ }
+ }
+ }
+
+ void
+ __attribute__ ((noinline))
+ _M_release_last_use() noexcept
+ {
+ _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_use_count);
+ _M_dispose();
+ if (_Mutex_base<_Lp>::_S_need_barriers)
+ {
+ __atomic_thread_fence (__ATOMIC_ACQ_REL);
+ }
+ _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_weak_count);
+ if (__gnu_cxx::__exchange_and_add_dispatch(&_M_weak_count,
+ -1) == 1)
+ {
+ _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_weak_count);
+ _M_destroy();
+ }
+ }
+
+ void
+ _M_release_orig() noexcept
{
// Be race-detector-friendly. For more info see bits/c++config.
_GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_use_count);
if (__gnu_cxx::__exchange_and_add_dispatch(&_M_use_count, -1) == 1)
{
_GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_use_count);
_M_dispose();
// There must be a memory barrier between dispose() and destroy()
// to ensure that the effects of dispose() are observed in the
// thread that runs destroy().
// See http://gcc.gnu.org/ml/libstdc++/2005-11/msg00136.html
if (_Mutex_base<_Lp>::_S_need_barriers)
{
__atomic_thread_fence (__ATOMIC_ACQ_REL);
}
// Be race-detector-friendly. For more info see bits/c++config.
_GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_weak_count);
if (__gnu_cxx::__exchange_and_add_dispatch(&_M_weak_count,
-1) == 1)
{
_GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_weak_count);
_M_destroy();
}
}
}
void
_M_weak_add_ref() noexcept
{ __gnu_cxx::__atomic_add_dispatch(&_M_weak_count, 1); }
void
_M_weak_release() noexcept
{
// Be race-detector-friendly. For more info see bits/c++config.
_GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_weak_count);
if (__gnu_cxx::__exchange_and_add_dispatch(&_M_weak_count, -1) == 1)
{
_GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_weak_count);
if (_Mutex_base<_Lp>::_S_need_barriers)
next reply other threads:[~2020-12-07 12:24 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-12-07 12:24 Maged Michael [this message]
2020-12-07 15:28 ` Marc Glisse
2020-12-07 15:41 ` Jonathan Wakely
2020-12-08 4:10 ` Maged Michael
2020-12-08 8:01 ` Marc Glisse
2020-12-08 9:56 ` Jonathan Wakely
2020-12-07 15:37 ` Jonathan Wakely
2020-12-08 4:06 ` Maged Michael
2020-12-07 12:30 Maged Michael
2020-12-07 15:57 ` Jonathan Wakely
2020-12-07 16:23 ` Jonathan Wakely
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CABBFi0-HanBFtYt4J81bpEojGXXi6AQtAak6b7vU6nCKwAg+4g@mail.gmail.com \
--to=maged.michael@gmail.com \
--cc=libstdc++@gcc.gnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).