public inbox for libstdc++@gcc.gnu.org
 help / color / mirror / Atom feed
From: Maged Michael <maged.michael@gmail.com>
To: libstdc++@gcc.gnu.org
Subject: Re: Proposed patch to skip the last atomic decrements in _Sp_counted_base::_M_release
Date: Mon, 7 Dec 2020 23:10:16 -0500	[thread overview]
Message-ID: <CABBFi09ovWwPpUnC2V_vU5iVMLR0vQd77EgrO5XTFPOofmWzdQ@mail.gmail.com> (raw)
In-Reply-To: <cad0c97e-4c8d-3849-7ea7-41461cba4aa0@stedding.saclay.inria.fr>

Marc,

Thank you for your comments.

Any pointers on where to file bug reports about the generated code?

Thanks,
Maged

On Mon, Dec 7, 2020 at 10:29 AM Marc Glisse <marc.glisse@inria.fr> wrote:

> Thanks for doing this!
>
> As someone who recently wrote
>
>    if (count.load(std::memory_order_relaxed) == 1
>        || count.fetch_sub(1, std::memory_order_release) == 1) {
>      std::atomic_thread_fence(std::memory_order_acquire);
>      // destroy here
>    }
>
> for a much simpler reference-counted type without weak references, I am
> happy to see that this can generalize to the more complicated case of
> shared_ptr.
>
> Yes, hardening it may be a bit hard. Some checks with alignof/offsetof
> could make sense, but it is complicated by the fact that things are
> happening in the base class that depend on the derived class. Some
> allocators guarantee a minimum alignment, likely sufficient for your case,
> but again I am not sure how to test for that exactly.
>
> ISTR that libstdc++ uses a surprisingly strong memory barrier when
> incrementing references (acq_rel, where I would expect acquire to suffice,
> although I haven't thought about weak references that much). Did you also
> experiment with that?
>
>
I applied to x86, where acq_rel doesn't add extra memory fence instructions
for atomic increment/decrement operations.


> Don't hesitate to file bug reports about missed optimizations in the
> compiler itself, for the redundant checks and branches you mention
> compared to llvm.
>
> On Mon, 7 Dec 2020, Maged Michael via Libstdc++ wrote:
>
> > Hi all,
> >
> > The attached patch includes a proposed alternative implementation of
> > _Sp_counted_base::_M_release(). I'd appreciate your feedback.
> >
> > Benefits: Save the cost of the last atomic decrements of each of the use
> > count and the weak count in _Sp_counted_base. Atomic instructions are
> > significantly slower than regular loads and stores across
> > major architectures.
> >
> > How current code works: _M_release() atomically decrements the use count,
> > checks if it was 1, if so calls _M_dispose(), atomically decrements the
> > weak count, checks if it was 1, and if so calls _M_destroy().
> >
> > How the proposed algorithm works: _M_release() loads both use count and
> > weak count together atomically (assuming 8-byte alignment, discussed
> > later), checks if the value is equal to 0x100000001 (i.e., both counts
> are
> > equal to 1), and if so calls _M_dispose() and _M_destroy(). Otherwise, it
> > follows the original algorithm.
> >
> > Why it works: When the current thread executing _M_release() finds each
> of
> > the counts is equal to 1, then no other threads could possibly hold use
> or
> > weak references to this control block. That is, no other threads could
> > possibly access the counts or the protected object.
> >
> > There are two crucial high-level issues that I'd like to point out first:
> > - Atomicity of access to the counts together
> > - Proper alignment of the counts together
> >
> > The patch is intended to apply the proposed algorithm only to the case of
> > 64-bit mode, 4-byte counts, and 8-byte aligned _Sp_counted_base.
> >
> > ** Atomicity **
> > - The proposed algorithm depends on the mutual atomicity among 8-byte
> > atomic operations and 4-byte atomic operations on each of the 4-byte
> halves
> > of the 8-byte aligned 8-byte block.
> > - The standard does not guarantee atomicity of 8-byte operations on a
> pair
> > of 8-byte aligned 4-byte objects.
> > - To my knowledge this works in practice on systems that guarantee native
> > implementation of 4-byte and 8-byte atomic operations.
> > - Can we limit applying the proposed algorithm to architectures that
> > guarantee native implementation of atomic operations?
> >
> > ** Alignment **
> > - _Sp_counted_base is an internal base class. Three internal classes are
> > derived from it.
> > - Two of these classes include a pointer as a first member. That is, the
> > layout (in the relevant case) is: use count (4 bytes), weak count (4
> > bytes), pointer (8 bytes). My understanding is that in these cases the
> two
> > 4-byte counts are guaranteed to occupy an 8-byte aligned 8 byte range.
> > - In the third case (_Sp_counted_ptr_inplace), only includes user data
> > after the two counts, without necessarily including any 8-byte aligned
> > members. For example, if the protected object is int32_t, then the
> > _Sp_counted_ptr_inplace object consists of three 4-byte members (the two
> > counts inherited from _Sp_counted_base and the user data). My
> understanding
> > is that the standard does not guarantee 8-byte alignment in such a case.
> > - Is 8-byte alignment guaranteed in practice in some widely-used
> > environments?
> > - Can 8-byte alignment be checked at build time in some widely-used
> > environments?
> >
> > Other points:
> > - The proposed algorithm can interact correctly with the current
> algorithm.
> > That is, multiple threads using different versions of the code with and
> > without the patch operating on the same objects should always interact
> > correctly. The intent for the patch is to be ABI compatible with the
> > current implementation.
> > - The proposed patch involves a performance trade-off between saving the
> > costs of atomic instructions when the counts are both 1 vs adding the
> cost
> > of loading the 8-byte combined counts and comparison with 0x100000001.
> > - I noticed a big difference between the code generated by GCC vs LLVM.
> GCC
> > seems to generate noticeably more code and what seems to be redundant
> null
> > checks and branches.
> > - The patch has been in use (built using LLVM) in a large environment for
> > many months. The performance gains outweigh the losses (roughly 10 to 1)
> > across a large variety of workloads.
> >
> > I'd appreciate your feedback on the alignment and atomicity issues as
> well
> > as any other comments.
> >
> > Thank you,
> > Maged
>
> --
> Marc Glisse
>

  parent reply	other threads:[~2020-12-08  4:10 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-12-07 12:24 Maged Michael
2020-12-07 15:28 ` Marc Glisse
2020-12-07 15:41   ` Jonathan Wakely
2020-12-08  4:10   ` Maged Michael [this message]
2020-12-08  8:01     ` Marc Glisse
2020-12-08  9:56       ` Jonathan Wakely
2020-12-07 15:37 ` Jonathan Wakely
2020-12-08  4:06   ` Maged Michael
2020-12-07 12:30 Maged Michael
2020-12-07 15:57 ` Jonathan Wakely
2020-12-07 16:23   ` Jonathan Wakely

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CABBFi09ovWwPpUnC2V_vU5iVMLR0vQd77EgrO5XTFPOofmWzdQ@mail.gmail.com \
    --to=maged.michael@gmail.com \
    --cc=libstdc++@gcc.gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).