From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <maged.michael@gmail.com>
Received: from mail-ua1-x931.google.com (mail-ua1-x931.google.com
 [IPv6:2607:f8b0:4864:20::931])
 by sourceware.org (Postfix) with ESMTPS id 4FEA639874DB;
 Fri, 16 Jul 2021 15:55:29 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 4FEA639874DB
Received: by mail-ua1-x931.google.com with SMTP id e7so3740139uaj.11;
 Fri, 16 Jul 2021 08:55:29 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:mime-version:references:in-reply-to:from:date
 :message-id:subject:to:cc;
 bh=bVgxCQyy+lcwqcOxU0FVU/s9T7f29C9wHJo5DJdyhSQ=;
 b=ji9uIRB5aNVq3qmoi6UYWqObjd5kFGFfjBeY/DfAnuV+noam1YsXuwIGHgpKemUUcY
 abLqbe2itsxuNPNROtnfSSFoDBY8FS9NkW4UKwL9p1RnvKCghTXdQlD1o1wHw/2pDO+u
 2XGbrZ3Nf0/MC7bXeVvhaYo4biUiiwWOmx86nE9XIwmdWKUsN8zAzB0jmNBWBT7mSzbn
 LC4Ao++7F2NSa7XBa47OXDI1XQiBUrpgyx/iIX4zrUigL+Pj1WwgxpD7vTmSmTAh2M+F
 UrvRB75ek2Yl1uyQTqvygkQIG4SmTi7HpIVy9zlRlDm2XFbZo4OMTMqCl3Uhp/vyaA5E
 a9MQ==
X-Gm-Message-State: AOAM531cgMtA3kUueVX5I5ca4li+mJI3PYoW679q5Wf4jP3p1IK64zy1
 sQTeUWOzPFDvBBBCSGq9O6g8YcmK+lkusuytiR8=
X-Google-Smtp-Source: ABdhPJzAK+u5+z4Il8rh55ewKV5SRXB7Rbqv8v+0pUl+M22lJnYOK4ngeAdYwVjErkgQ14M37vM3+Q2o3NHBJKflsvw=
X-Received: by 2002:ab0:6dcd:: with SMTP id r13mr14384804uaf.65.1626450928599; 
 Fri, 16 Jul 2021 08:55:28 -0700 (PDT)
MIME-Version: 1.0
References: <CABBFi095Sn2-=DiZyLEfAP2=XCf9jA6F6uATBBe0LAK16EhogA@mail.gmail.com>
 <CAH6eHdQzMFVVrnaKiL5v_bF-HtC+H1reLaimN6x=9LxRxvERCg@mail.gmail.com>
In-Reply-To: <CAH6eHdQzMFVVrnaKiL5v_bF-HtC+H1reLaimN6x=9LxRxvERCg@mail.gmail.com>
From: Maged Michael <maged.michael@gmail.com>
Date: Fri, 16 Jul 2021 11:55:17 -0400
Message-ID: <CABBFi08YYQRgU-bMweP1gA_qDtKiD42JaptEsZPpG+KMYNw_WQ@mail.gmail.com>
Subject: Re: [PATCH] libstdc++: Skip atomic instructions in
 _Sp_counted_base::_M_release when both counts are 1
To: Jonathan Wakely <jwakely.gcc@gmail.com>
Cc: "libstdc++" <libstdc++@gcc.gnu.org>, gcc-patches <gcc-patches@gcc.gnu.org>
X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0,
 HTML_MESSAGE, KAM_NUMSUBJECT, KAM_SHORT, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,
 SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.4
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on
 server2.sourceware.org
Content-Type: text/plain; charset="UTF-8"
X-Content-Filtered-By: Mailman/MimeDel 2.1.29
X-BeenThere: libstdc++@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Libstdc++ mailing list <libstdc++.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/libstdc++>,
 <mailto:libstdc++-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/libstdc++/>
List-Post: <mailto:libstdc++@gcc.gnu.org>
List-Help: <mailto:libstdc++-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/libstdc++>,
 <mailto:libstdc++-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Fri, 16 Jul 2021 15:55:32 -0000

Thank you, Jonathan, for the detailed comments! I'll update the patch
accordingly.

On Fri, Jul 16, 2021 at 9:55 AM Jonathan Wakely <jwakely.gcc@gmail.com>
wrote:

> On Thu, 17 Dec 2020 at 20:51, Maged Michael wrote:
> >
> > Please find a proposed patch for _Sp_counted_base::_M_release to skip the
> > two atomic instructions that decrement each of the use count and the weak
> > count when both are 1. I proposed the general idea in an earlier thread (
> > https://gcc.gnu.org/pipermail/libstdc++/2020-December/051642.html) and
> got
> > useful feedback on a draft patch and responses to related questions about
> > multi-granular atomicity and alignment. This patch is based on that
> > feedback.
> >
> >
> > I added a check for thread sanitizer to use the current algorithm in that
> > case because TSAN does not support multi-granular atomicity. I'd like to
> > add a check of __has_feature(thread_sanitizer) for building using LLVM. I
> > found examples of __has_feature in libstdc++
>
> There are no uses of __has_feature in libstdc++. We do use
> __has_builtin (which GCC also supports) and Clang's __is_identifier
> (which GCC doesn't support) to work around some weird semantics of
> __has_builtin in older versions of Clang.
>
>
> > but it doesn't seem to be
> > recognized in shared_ptr_base.h. Any guidance on how to check
> > __has_feature(thread_sanitizer) in this patch?
>
> I think we want to do something like this in include/bits/c++config
>
> #if __SANITIZE_THREAD__
> #  define _GLIBCXX_TSAN 1
> #elif defined __has_feature
> # if __has_feature(thread_sanitizer)
> #  define _GLIBCXX_TSAN 1
> # endif
> #endif
>
> Then in bits/shared_ptr_base.h
>
> #if _GLIBCXX_TSAN
>         _M_release_orig();
>         return;
> #endif
>
>
>
> > GCC generates code for _M_release that is larger and more complex than
> that
> > generated by LLVM. I'd like to file a bug report about that. Jonathan,
>
> Is this the same issue as https://gcc.gnu.org/PR101406 ?
>
> Partly yes. Even when using __atomic_add_dispatch I noticed that clang
generated less code than gcc. I see in the response to the issue that the
new glibc is expected to optimize better. So maybe this will eliminate the
issue.


> > would you please create a bugzilla account for me (
> > https://gcc.gnu.org/bugzilla/) using my gmail address. Thank you.
>
> Done (sorry, I didn't notice the request in this mail until coming
> back to it to review the patch properly).
>
> Thank you!


>
>
> >
> >
> > Information about the patch:
> >
> > - Benefits of the patch: Save the cost of the last atomic decrements of
> > each of the use count and the weak count in _Sp_counted_base. Atomic
> > instructions are significantly slower than regular loads and stores
> across
> > major architectures.
> >
> > - How current code works: _M_release() atomically decrements the use
> count,
> > checks if it was 1, if so calls _M_dispose(), atomically decrements the
> > weak count, checks if it was 1, and if so calls _M_destroy().
> >
> > - How the proposed patch works: _M_release() loads both use count and
> weak
> > count together atomically (when properly aligned), checks if the value is
> > equal to the value of both counts equal to 1 (e.g., 0x100000001), and if
> so
> > calls _M_dispose() and _M_destroy(). Otherwise, it follows the original
> > algorithm.
> >
> > - Why it works: When the current thread executing _M_release() finds each
> > of the counts is equal to 1, then (when _lock_policy is _S_atomic) no
> other
> > threads could possibly hold use or weak references to this control block.
> > That is, no other threads could possibly access the counts or the
> protected
> > object.
> >
> > - The proposed patch is intended to interact correctly with current code
> > (under certain conditions: _Lock_policy is _S_atomic, proper alignment,
> and
> > native lock-free support for atomic operations). That is, multiple
> threads
> > using different versions of the code with and without the patch operating
> > on the same objects should always interact correctly. The intent for the
> > patch is to be ABI compatible with the current implementation.
> >
> > - The proposed patch involves a performance trade-off between saving the
> > costs of two atomic instructions when the counts are both 1 vs adding the
> > cost of loading the combined counts and comparison with two ones (e.g.,
> > 0x100000001).
> >
> > - The patch has been in use (built using LLVM) in a large environment for
> > many months. The performance gains outweigh the losses (roughly 10 to 1)
> > across a large variety of workloads.
> >
> >
> > I'd appreciate feedback on the patch and any suggestions for checking
> > __has_feature(thread_sanitizer).
>
> N.B. gmail completely mangles patches unless you send them as attachments.
>
>
> > diff --git a/libstdc++-v3/include/bits/shared_ptr_base.h
> > b/libstdc++-v3/include/bits/shared_ptr_base.h
> >
> > index 368b2d7379a..a8fc944af5f 100644
> >
> > --- a/libstdc++-v3/include/bits/shared_ptr_base.h
> >
> > +++ b/libstdc++-v3/include/bits/shared_ptr_base.h
> >
> > @@ -153,20 +153,78 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
> >
> >         if (!_M_add_ref_lock_nothrow())
> >
> >           __throw_bad_weak_ptr();
> >
> >        }
> >
> >
> >        bool
> >
> >        _M_add_ref_lock_nothrow() noexcept;
> >
> >
> >        void
> >
> >        _M_release() noexcept
> >
> >        {
> >
> > +#if __SANITIZE_THREAD__
> >
> > +        _M_release_orig();
> >
> > +        return;
> >
> > +#endif
> >
> > +        if (!__atomic_always_lock_free(sizeof(long long), 0) ||
>
> The line break should come before the logical operator, not after.
> This makes it easier to see which operator it is, because it's at a
> predictable position, not off on the right edge somewhere.
>
> i.e.
>
>         if (!__atomic_always_lock_free(sizeof(long long), 0)
>            || !__atomic_always_lock_free(sizeof(_Atomic_word), 0)
>            || sizeof(long long) < (2 * sizeof(_Atomic_word))
>            || sizeof(long long) > (sizeof(void*)))
>
> But I think I'd like to see this condition expressed differently
> anyway, see below.
>
> >
> > +            !__atomic_always_lock_free(sizeof(_Atomic_word), 0) ||
> >
> > +            sizeof(long long) < (2 * sizeof(_Atomic_word)) ||
>
> Shouldn't this be != instead of < ?
>
> On a big endian target where sizeof(long long) > sizeof(_Atomic_word)
> loading two _Atomic_word objects will fill the high bits of the long
> long, and so the (1LL + (1LL << (8 * 4))) calculation won't match what
> got loaded into the long long.
>
> >
> > +            sizeof(long long) > (sizeof(void*)))
>
> This is checking the alignment, right? I think we can do so more
> reliably, and should comment it.
>
> I think I'd like to see this condition defined as a number of
> constexpr booleans, with comments. Maybe:
>
> constexpr bool __lock_free
>   = __atomic_always_lock_free(sizeof(long long), 0)
>      && __atomic_always_lock_free(sizeof(_Atomic_word), 0);
> constexpr bool __double_word
>   = sizeof(long long) == 2 * sizeof(_Atomic_word);
> // The ref-count members follow the vptr, so are aligned to alignof(void*).
> constexpr bool __aligned = alignof(long long) <= alignof(void*);
>
> if _GLIBCXX17_CONSTEXPR
>   (__lock_free && __double_word && __aligned)
>   {
>     _M_release_double_width_cas();
>     return;
>   }
> else
>   {
>     // ... original body of _M_release();
>   }
>
> > +          {
> >
> > +            _M_release_orig();
> >
> > +            return;
> >
> > +          }
> >
> > +        _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_use_count);
> >
> > +        _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_weak_count);
> >
> > +        if (__atomic_load_n((long long*)(&_M_use_count),
> __ATOMIC_ACQUIRE)
> >
> > +            == (1LL + (1LL << (8 * sizeof(_Atomic_word)))))
>
> This should use __CHAR_BIT__ instead of 8.
>
>
> >
> > +          {
> >
> > +            // Both counts are 1, so there are no weak references and
> >
> > +            // we are releasing the last strong reference. No other
> >
> > +            // threads can observe the effects of this _M_release()
> >
> > +            // call (e.g. calling use_count()) without a data race.
> >
> > +            *(long long*)(&_M_use_count) = 0;
> >
> > +            _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_use_count);
> >
> > +            _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_weak_count);
> >
> > +            _M_dispose();
> >
> > +            _M_destroy();
> >
> > +          }
> >
> > +        else
> >
> > +          {
> >
> > +            if ((__gnu_cxx::__exchange_and_add(&_M_use_count, -1) == 1))
> >
> > +              {
> >
> > +                _M_release_last_use();
> >
> > +              }
> >
> > +          }
> >
> > +      }
> >
> > +
> >
> > +      void
> >
> > +      __attribute__ ((noinline))
> >
> > +      _M_release_last_use() noexcept
> >
> > +      {
> >
> > +        _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_use_count);
> >
> > +        _M_dispose();
> >
> > +        if (_Mutex_base<_Lp>::_S_need_barriers)
> >
> > +          {
> >
> > +            __atomic_thread_fence (__ATOMIC_ACQ_REL);
> >
> > +          }
> >
> > +        _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_weak_count);
> >
> > +        if (__gnu_cxx::__exchange_and_add_dispatch(&_M_weak_count,
> >
> > +                                                   -1) == 1)
> >
> > +          {
> >
> > +            _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_weak_count);
> >
> > +            _M_destroy();
> >
> > +          }
> >
> > +      }
> >
> > +
> >
> > +      void
> >
> > +      _M_release_orig() noexcept
> >
> > +      {
> >
> >          // Be race-detector-friendly.  For more info see bits/c++config.
> >
> >          _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_use_count);
> >
> >         if (__gnu_cxx::__exchange_and_add_dispatch(&_M_use_count, -1) ==
> 1)
> >
> >           {
> >
> >              _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_use_count);
> >
> >             _M_dispose();
> >
> >             // There must be a memory barrier between dispose() and
> > destroy()
> >
> >             // to ensure that the effects of dispose() are observed in
> the
> >
> >             // thread that runs destroy().
> >
> >             // See http://gcc.gnu.org/ml/libstdc++/2005-11/msg00136.html
> >
> > @@ -279,20 +337,27 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
> >
> >      _Sp_counted_base<_S_single>::_M_release() noexcept
> >
> >      {
> >
> >        if (--_M_use_count == 0)
> >
> >          {
> >
> >            _M_dispose();
> >
> >            if (--_M_weak_count == 0)
> >
> >              _M_destroy();
> >
> >          }
> >
> >      }
> >
> >
> > +  template<>
> >
> > +    inline void
> >
> > +    _Sp_counted_base<_S_mutex>::_M_release() noexcept
> >
> > +    {
> >
> > +      _M_release_orig();
> >
> > +    }
> >
> > +
> >
> >    template<>
> >
> >      inline void
> >
> >      _Sp_counted_base<_S_single>::_M_weak_add_ref() noexcept
> >
> >      { ++_M_weak_count; }
> >
> >
> >    template<>
> >
> >      inline void
> >
> >      _Sp_counted_base<_S_single>::_M_weak_release() noexcept
> >
> >      {
> >
> >        if (--_M_weak_count == 0)
>