From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ua1-x931.google.com (mail-ua1-x931.google.com [IPv6:2607:f8b0:4864:20::931]) by sourceware.org (Postfix) with ESMTPS id 4FEA639874DB; Fri, 16 Jul 2021 15:55:29 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 4FEA639874DB Received: by mail-ua1-x931.google.com with SMTP id e7so3740139uaj.11; Fri, 16 Jul 2021 08:55:29 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=bVgxCQyy+lcwqcOxU0FVU/s9T7f29C9wHJo5DJdyhSQ=; b=ji9uIRB5aNVq3qmoi6UYWqObjd5kFGFfjBeY/DfAnuV+noam1YsXuwIGHgpKemUUcY abLqbe2itsxuNPNROtnfSSFoDBY8FS9NkW4UKwL9p1RnvKCghTXdQlD1o1wHw/2pDO+u 2XGbrZ3Nf0/MC7bXeVvhaYo4biUiiwWOmx86nE9XIwmdWKUsN8zAzB0jmNBWBT7mSzbn LC4Ao++7F2NSa7XBa47OXDI1XQiBUrpgyx/iIX4zrUigL+Pj1WwgxpD7vTmSmTAh2M+F UrvRB75ek2Yl1uyQTqvygkQIG4SmTi7HpIVy9zlRlDm2XFbZo4OMTMqCl3Uhp/vyaA5E a9MQ== X-Gm-Message-State: AOAM531cgMtA3kUueVX5I5ca4li+mJI3PYoW679q5Wf4jP3p1IK64zy1 sQTeUWOzPFDvBBBCSGq9O6g8YcmK+lkusuytiR8= X-Google-Smtp-Source: ABdhPJzAK+u5+z4Il8rh55ewKV5SRXB7Rbqv8v+0pUl+M22lJnYOK4ngeAdYwVjErkgQ14M37vM3+Q2o3NHBJKflsvw= X-Received: by 2002:ab0:6dcd:: with SMTP id r13mr14384804uaf.65.1626450928599; Fri, 16 Jul 2021 08:55:28 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Maged Michael Date: Fri, 16 Jul 2021 11:55:17 -0400 Message-ID: Subject: Re: [PATCH] libstdc++: Skip atomic instructions in _Sp_counted_base::_M_release when both counts are 1 To: Jonathan Wakely Cc: "libstdc++" , gcc-patches X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0, HTML_MESSAGE, KAM_NUMSUBJECT, KAM_SHORT, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org Content-Type: text/plain; charset="UTF-8" X-Content-Filtered-By: Mailman/MimeDel 2.1.29 X-BeenThere: libstdc++@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libstdc++ mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 16 Jul 2021 15:55:32 -0000 Thank you, Jonathan, for the detailed comments! I'll update the patch accordingly. On Fri, Jul 16, 2021 at 9:55 AM Jonathan Wakely wrote: > On Thu, 17 Dec 2020 at 20:51, Maged Michael wrote: > > > > Please find a proposed patch for _Sp_counted_base::_M_release to skip the > > two atomic instructions that decrement each of the use count and the weak > > count when both are 1. I proposed the general idea in an earlier thread ( > > https://gcc.gnu.org/pipermail/libstdc++/2020-December/051642.html) and > got > > useful feedback on a draft patch and responses to related questions about > > multi-granular atomicity and alignment. This patch is based on that > > feedback. > > > > > > I added a check for thread sanitizer to use the current algorithm in that > > case because TSAN does not support multi-granular atomicity. I'd like to > > add a check of __has_feature(thread_sanitizer) for building using LLVM. I > > found examples of __has_feature in libstdc++ > > There are no uses of __has_feature in libstdc++. We do use > __has_builtin (which GCC also supports) and Clang's __is_identifier > (which GCC doesn't support) to work around some weird semantics of > __has_builtin in older versions of Clang. > > > > but it doesn't seem to be > > recognized in shared_ptr_base.h. Any guidance on how to check > > __has_feature(thread_sanitizer) in this patch? > > I think we want to do something like this in include/bits/c++config > > #if __SANITIZE_THREAD__ > # define _GLIBCXX_TSAN 1 > #elif defined __has_feature > # if __has_feature(thread_sanitizer) > # define _GLIBCXX_TSAN 1 > # endif > #endif > > Then in bits/shared_ptr_base.h > > #if _GLIBCXX_TSAN > _M_release_orig(); > return; > #endif > > > > > GCC generates code for _M_release that is larger and more complex than > that > > generated by LLVM. I'd like to file a bug report about that. Jonathan, > > Is this the same issue as https://gcc.gnu.org/PR101406 ? > > Partly yes. Even when using __atomic_add_dispatch I noticed that clang generated less code than gcc. I see in the response to the issue that the new glibc is expected to optimize better. So maybe this will eliminate the issue. > > would you please create a bugzilla account for me ( > > https://gcc.gnu.org/bugzilla/) using my gmail address. Thank you. > > Done (sorry, I didn't notice the request in this mail until coming > back to it to review the patch properly). > > Thank you! > > > > > > > > Information about the patch: > > > > - Benefits of the patch: Save the cost of the last atomic decrements of > > each of the use count and the weak count in _Sp_counted_base. Atomic > > instructions are significantly slower than regular loads and stores > across > > major architectures. > > > > - How current code works: _M_release() atomically decrements the use > count, > > checks if it was 1, if so calls _M_dispose(), atomically decrements the > > weak count, checks if it was 1, and if so calls _M_destroy(). > > > > - How the proposed patch works: _M_release() loads both use count and > weak > > count together atomically (when properly aligned), checks if the value is > > equal to the value of both counts equal to 1 (e.g., 0x100000001), and if > so > > calls _M_dispose() and _M_destroy(). Otherwise, it follows the original > > algorithm. > > > > - Why it works: When the current thread executing _M_release() finds each > > of the counts is equal to 1, then (when _lock_policy is _S_atomic) no > other > > threads could possibly hold use or weak references to this control block. > > That is, no other threads could possibly access the counts or the > protected > > object. > > > > - The proposed patch is intended to interact correctly with current code > > (under certain conditions: _Lock_policy is _S_atomic, proper alignment, > and > > native lock-free support for atomic operations). That is, multiple > threads > > using different versions of the code with and without the patch operating > > on the same objects should always interact correctly. The intent for the > > patch is to be ABI compatible with the current implementation. > > > > - The proposed patch involves a performance trade-off between saving the > > costs of two atomic instructions when the counts are both 1 vs adding the > > cost of loading the combined counts and comparison with two ones (e.g., > > 0x100000001). > > > > - The patch has been in use (built using LLVM) in a large environment for > > many months. The performance gains outweigh the losses (roughly 10 to 1) > > across a large variety of workloads. > > > > > > I'd appreciate feedback on the patch and any suggestions for checking > > __has_feature(thread_sanitizer). > > N.B. gmail completely mangles patches unless you send them as attachments. > > > > diff --git a/libstdc++-v3/include/bits/shared_ptr_base.h > > b/libstdc++-v3/include/bits/shared_ptr_base.h > > > > index 368b2d7379a..a8fc944af5f 100644 > > > > --- a/libstdc++-v3/include/bits/shared_ptr_base.h > > > > +++ b/libstdc++-v3/include/bits/shared_ptr_base.h > > > > @@ -153,20 +153,78 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION > > > > if (!_M_add_ref_lock_nothrow()) > > > > __throw_bad_weak_ptr(); > > > > } > > > > > > bool > > > > _M_add_ref_lock_nothrow() noexcept; > > > > > > void > > > > _M_release() noexcept > > > > { > > > > +#if __SANITIZE_THREAD__ > > > > + _M_release_orig(); > > > > + return; > > > > +#endif > > > > + if (!__atomic_always_lock_free(sizeof(long long), 0) || > > The line break should come before the logical operator, not after. > This makes it easier to see which operator it is, because it's at a > predictable position, not off on the right edge somewhere. > > i.e. > > if (!__atomic_always_lock_free(sizeof(long long), 0) > || !__atomic_always_lock_free(sizeof(_Atomic_word), 0) > || sizeof(long long) < (2 * sizeof(_Atomic_word)) > || sizeof(long long) > (sizeof(void*))) > > But I think I'd like to see this condition expressed differently > anyway, see below. > > > > > + !__atomic_always_lock_free(sizeof(_Atomic_word), 0) || > > > > + sizeof(long long) < (2 * sizeof(_Atomic_word)) || > > Shouldn't this be != instead of < ? > > On a big endian target where sizeof(long long) > sizeof(_Atomic_word) > loading two _Atomic_word objects will fill the high bits of the long > long, and so the (1LL + (1LL << (8 * 4))) calculation won't match what > got loaded into the long long. > > > > > + sizeof(long long) > (sizeof(void*))) > > This is checking the alignment, right? I think we can do so more > reliably, and should comment it. > > I think I'd like to see this condition defined as a number of > constexpr booleans, with comments. Maybe: > > constexpr bool __lock_free > = __atomic_always_lock_free(sizeof(long long), 0) > && __atomic_always_lock_free(sizeof(_Atomic_word), 0); > constexpr bool __double_word > = sizeof(long long) == 2 * sizeof(_Atomic_word); > // The ref-count members follow the vptr, so are aligned to alignof(void*). > constexpr bool __aligned = alignof(long long) <= alignof(void*); > > if _GLIBCXX17_CONSTEXPR > (__lock_free && __double_word && __aligned) > { > _M_release_double_width_cas(); > return; > } > else > { > // ... original body of _M_release(); > } > > > + { > > > > + _M_release_orig(); > > > > + return; > > > > + } > > > > + _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_use_count); > > > > + _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_weak_count); > > > > + if (__atomic_load_n((long long*)(&_M_use_count), > __ATOMIC_ACQUIRE) > > > > + == (1LL + (1LL << (8 * sizeof(_Atomic_word))))) > > This should use __CHAR_BIT__ instead of 8. > > > > > > + { > > > > + // Both counts are 1, so there are no weak references and > > > > + // we are releasing the last strong reference. No other > > > > + // threads can observe the effects of this _M_release() > > > > + // call (e.g. calling use_count()) without a data race. > > > > + *(long long*)(&_M_use_count) = 0; > > > > + _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_use_count); > > > > + _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_weak_count); > > > > + _M_dispose(); > > > > + _M_destroy(); > > > > + } > > > > + else > > > > + { > > > > + if ((__gnu_cxx::__exchange_and_add(&_M_use_count, -1) == 1)) > > > > + { > > > > + _M_release_last_use(); > > > > + } > > > > + } > > > > + } > > > > + > > > > + void > > > > + __attribute__ ((noinline)) > > > > + _M_release_last_use() noexcept > > > > + { > > > > + _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_use_count); > > > > + _M_dispose(); > > > > + if (_Mutex_base<_Lp>::_S_need_barriers) > > > > + { > > > > + __atomic_thread_fence (__ATOMIC_ACQ_REL); > > > > + } > > > > + _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_weak_count); > > > > + if (__gnu_cxx::__exchange_and_add_dispatch(&_M_weak_count, > > > > + -1) == 1) > > > > + { > > > > + _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_weak_count); > > > > + _M_destroy(); > > > > + } > > > > + } > > > > + > > > > + void > > > > + _M_release_orig() noexcept > > > > + { > > > > // Be race-detector-friendly. For more info see bits/c++config. > > > > _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_use_count); > > > > if (__gnu_cxx::__exchange_and_add_dispatch(&_M_use_count, -1) == > 1) > > > > { > > > > _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_use_count); > > > > _M_dispose(); > > > > // There must be a memory barrier between dispose() and > > destroy() > > > > // to ensure that the effects of dispose() are observed in > the > > > > // thread that runs destroy(). > > > > // See http://gcc.gnu.org/ml/libstdc++/2005-11/msg00136.html > > > > @@ -279,20 +337,27 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION > > > > _Sp_counted_base<_S_single>::_M_release() noexcept > > > > { > > > > if (--_M_use_count == 0) > > > > { > > > > _M_dispose(); > > > > if (--_M_weak_count == 0) > > > > _M_destroy(); > > > > } > > > > } > > > > > > + template<> > > > > + inline void > > > > + _Sp_counted_base<_S_mutex>::_M_release() noexcept > > > > + { > > > > + _M_release_orig(); > > > > + } > > > > + > > > > template<> > > > > inline void > > > > _Sp_counted_base<_S_single>::_M_weak_add_ref() noexcept > > > > { ++_M_weak_count; } > > > > > > template<> > > > > inline void > > > > _Sp_counted_base<_S_single>::_M_weak_release() noexcept > > > > { > > > > if (--_M_weak_count == 0) >