public inbox for libstdc++@gcc.gnu.org
 help / color / mirror / Atom feed
From: Marc Glisse <marc.glisse@inria.fr>
To: Maged Michael <maged.michael@gmail.com>
Cc: libstdc++@gcc.gnu.org
Subject: Re: Proposed patch to skip the last atomic decrements in _Sp_counted_base::_M_release
Date: Mon, 7 Dec 2020 16:28:55 +0100 (CET)	[thread overview]
Message-ID: <cad0c97e-4c8d-3849-7ea7-41461cba4aa0@stedding.saclay.inria.fr> (raw)
In-Reply-To: <CABBFi0-HanBFtYt4J81bpEojGXXi6AQtAak6b7vU6nCKwAg+4g@mail.gmail.com>

Thanks for doing this!

As someone who recently wrote

   if (count.load(std::memory_order_relaxed) == 1
       || count.fetch_sub(1, std::memory_order_release) == 1) {
     std::atomic_thread_fence(std::memory_order_acquire);
     // destroy here
   }

for a much simpler reference-counted type without weak references, I am
happy to see that this can generalize to the more complicated case of
shared_ptr.

Yes, hardening it may be a bit hard. Some checks with alignof/offsetof 
could make sense, but it is complicated by the fact that things are 
happening in the base class that depend on the derived class. Some 
allocators guarantee a minimum alignment, likely sufficient for your case, 
but again I am not sure how to test for that exactly.

ISTR that libstdc++ uses a surprisingly strong memory barrier when 
incrementing references (acq_rel, where I would expect acquire to suffice, 
although I haven't thought about weak references that much). Did you also 
experiment with that?

Don't hesitate to file bug reports about missed optimizations in the 
compiler itself, for the redundant checks and branches you mention 
compared to llvm.

On Mon, 7 Dec 2020, Maged Michael via Libstdc++ wrote:

> Hi all,
>
> The attached patch includes a proposed alternative implementation of
> _Sp_counted_base::_M_release(). I'd appreciate your feedback.
>
> Benefits: Save the cost of the last atomic decrements of each of the use
> count and the weak count in _Sp_counted_base. Atomic instructions are
> significantly slower than regular loads and stores across
> major architectures.
>
> How current code works: _M_release() atomically decrements the use count,
> checks if it was 1, if so calls _M_dispose(), atomically decrements the
> weak count, checks if it was 1, and if so calls _M_destroy().
>
> How the proposed algorithm works: _M_release() loads both use count and
> weak count together atomically (assuming 8-byte alignment, discussed
> later), checks if the value is equal to 0x100000001 (i.e., both counts are
> equal to 1), and if so calls _M_dispose() and _M_destroy(). Otherwise, it
> follows the original algorithm.
>
> Why it works: When the current thread executing _M_release() finds each of
> the counts is equal to 1, then no other threads could possibly hold use or
> weak references to this control block. That is, no other threads could
> possibly access the counts or the protected object.
>
> There are two crucial high-level issues that I'd like to point out first:
> - Atomicity of access to the counts together
> - Proper alignment of the counts together
>
> The patch is intended to apply the proposed algorithm only to the case of
> 64-bit mode, 4-byte counts, and 8-byte aligned _Sp_counted_base.
>
> ** Atomicity **
> - The proposed algorithm depends on the mutual atomicity among 8-byte
> atomic operations and 4-byte atomic operations on each of the 4-byte halves
> of the 8-byte aligned 8-byte block.
> - The standard does not guarantee atomicity of 8-byte operations on a pair
> of 8-byte aligned 4-byte objects.
> - To my knowledge this works in practice on systems that guarantee native
> implementation of 4-byte and 8-byte atomic operations.
> - Can we limit applying the proposed algorithm to architectures that
> guarantee native implementation of atomic operations?
>
> ** Alignment **
> - _Sp_counted_base is an internal base class. Three internal classes are
> derived from it.
> - Two of these classes include a pointer as a first member. That is, the
> layout (in the relevant case) is: use count (4 bytes), weak count (4
> bytes), pointer (8 bytes). My understanding is that in these cases the two
> 4-byte counts are guaranteed to occupy an 8-byte aligned 8 byte range.
> - In the third case (_Sp_counted_ptr_inplace), only includes user data
> after the two counts, without necessarily including any 8-byte aligned
> members. For example, if the protected object is int32_t, then the
> _Sp_counted_ptr_inplace object consists of three 4-byte members (the two
> counts inherited from _Sp_counted_base and the user data). My understanding
> is that the standard does not guarantee 8-byte alignment in such a case.
> - Is 8-byte alignment guaranteed in practice in some widely-used
> environments?
> - Can 8-byte alignment be checked at build time in some widely-used
> environments?
>
> Other points:
> - The proposed algorithm can interact correctly with the current algorithm.
> That is, multiple threads using different versions of the code with and
> without the patch operating on the same objects should always interact
> correctly. The intent for the patch is to be ABI compatible with the
> current implementation.
> - The proposed patch involves a performance trade-off between saving the
> costs of atomic instructions when the counts are both 1 vs adding the cost
> of loading the 8-byte combined counts and comparison with 0x100000001.
> - I noticed a big difference between the code generated by GCC vs LLVM. GCC
> seems to generate noticeably more code and what seems to be redundant null
> checks and branches.
> - The patch has been in use (built using LLVM) in a large environment for
> many months. The performance gains outweigh the losses (roughly 10 to 1)
> across a large variety of workloads.
>
> I'd appreciate your feedback on the alignment and atomicity issues as well
> as any other comments.
>
> Thank you,
> Maged

-- 
Marc Glisse

  reply	other threads:[~2020-12-07 15:29 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-12-07 12:24 Maged Michael
2020-12-07 15:28 ` Marc Glisse [this message]
2020-12-07 15:41   ` Jonathan Wakely
2020-12-08  4:10   ` Maged Michael
2020-12-08  8:01     ` Marc Glisse
2020-12-08  9:56       ` Jonathan Wakely
2020-12-07 15:37 ` Jonathan Wakely
2020-12-08  4:06   ` Maged Michael
2020-12-07 12:30 Maged Michael
2020-12-07 15:57 ` Jonathan Wakely
2020-12-07 16:23   ` Jonathan Wakely

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=cad0c97e-4c8d-3849-7ea7-41461cba4aa0@stedding.saclay.inria.fr \
    --to=marc.glisse@inria.fr \
    --cc=libstdc++@gcc.gnu.org \
    --cc=maged.michael@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).