public inbox for libstdc++@gcc.gnu.org
 help / color / mirror / Atom feed
* [PATCH] libstdc++: Skip atomic instructions in _Sp_counted_base::_M_release when both counts are 1
@ 2020-12-17 20:49 Maged Michael
  2021-01-05 16:19 ` Maged Michael
                   ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: Maged Michael @ 2020-12-17 20:49 UTC (permalink / raw)
  To: libstdc++; +Cc: Maged Michael, gcc-patches

Please find a proposed patch for _Sp_counted_base::_M_release to skip the
two atomic instructions that decrement each of the use count and the weak
count when both are 1. I proposed the general idea in an earlier thread (
https://gcc.gnu.org/pipermail/libstdc++/2020-December/051642.html) and got
useful feedback on a draft patch and responses to related questions about
multi-granular atomicity and alignment. This patch is based on that
feedback.


I added a check for thread sanitizer to use the current algorithm in that
case because TSAN does not support multi-granular atomicity. I'd like to
add a check of __has_feature(thread_sanitizer) for building using LLVM. I
found examples of __has_feature in libstdc++ but it doesn't seem to be
recognized in shared_ptr_base.h. Any guidance on how to check
__has_feature(thread_sanitizer) in this patch?


GCC generates code for _M_release that is larger and more complex than that
generated by LLVM. I'd like to file a bug report about that. Jonathan,
would you please create a bugzilla account for me (
https://gcc.gnu.org/bugzilla/) using my gmail address. Thank you.


Information about the patch:

- Benefits of the patch: Save the cost of the last atomic decrements of
each of the use count and the weak count in _Sp_counted_base. Atomic
instructions are significantly slower than regular loads and stores across
major architectures.

- How current code works: _M_release() atomically decrements the use count,
checks if it was 1, if so calls _M_dispose(), atomically decrements the
weak count, checks if it was 1, and if so calls _M_destroy().

- How the proposed patch works: _M_release() loads both use count and weak
count together atomically (when properly aligned), checks if the value is
equal to the value of both counts equal to 1 (e.g., 0x100000001), and if so
calls _M_dispose() and _M_destroy(). Otherwise, it follows the original
algorithm.

- Why it works: When the current thread executing _M_release() finds each
of the counts is equal to 1, then (when _lock_policy is _S_atomic) no other
threads could possibly hold use or weak references to this control block.
That is, no other threads could possibly access the counts or the protected
object.

- The proposed patch is intended to interact correctly with current code
(under certain conditions: _Lock_policy is _S_atomic, proper alignment, and
native lock-free support for atomic operations). That is, multiple threads
using different versions of the code with and without the patch operating
on the same objects should always interact correctly. The intent for the
patch is to be ABI compatible with the current implementation.

- The proposed patch involves a performance trade-off between saving the
costs of two atomic instructions when the counts are both 1 vs adding the
cost of loading the combined counts and comparison with two ones (e.g.,
0x100000001).

- The patch has been in use (built using LLVM) in a large environment for
many months. The performance gains outweigh the losses (roughly 10 to 1)
across a large variety of workloads.


I'd appreciate feedback on the patch and any suggestions for checking
__has_feature(thread_sanitizer).


Maged



diff --git a/libstdc++-v3/include/bits/shared_ptr_base.h
b/libstdc++-v3/include/bits/shared_ptr_base.h

index 368b2d7379a..a8fc944af5f 100644

--- a/libstdc++-v3/include/bits/shared_ptr_base.h

+++ b/libstdc++-v3/include/bits/shared_ptr_base.h

@@ -153,20 +153,78 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION

        if (!_M_add_ref_lock_nothrow())

          __throw_bad_weak_ptr();

       }


       bool

       _M_add_ref_lock_nothrow() noexcept;


       void

       _M_release() noexcept

       {

+#if __SANITIZE_THREAD__

+        _M_release_orig();

+        return;

+#endif

+        if (!__atomic_always_lock_free(sizeof(long long), 0) ||

+            !__atomic_always_lock_free(sizeof(_Atomic_word), 0) ||

+            sizeof(long long) < (2 * sizeof(_Atomic_word)) ||

+            sizeof(long long) > (sizeof(void*)))

+          {

+            _M_release_orig();

+            return;

+          }

+        _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_use_count);

+        _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_weak_count);

+        if (__atomic_load_n((long long*)(&_M_use_count), __ATOMIC_ACQUIRE)

+            == (1LL + (1LL << (8 * sizeof(_Atomic_word)))))

+          {

+            // Both counts are 1, so there are no weak references and

+            // we are releasing the last strong reference. No other

+            // threads can observe the effects of this _M_release()

+            // call (e.g. calling use_count()) without a data race.

+            *(long long*)(&_M_use_count) = 0;

+            _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_use_count);

+            _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_weak_count);

+            _M_dispose();

+            _M_destroy();

+          }

+        else

+          {

+            if ((__gnu_cxx::__exchange_and_add(&_M_use_count, -1) == 1))

+              {

+                _M_release_last_use();

+              }

+          }

+      }

+

+      void

+      __attribute__ ((noinline))

+      _M_release_last_use() noexcept

+      {

+        _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_use_count);

+        _M_dispose();

+        if (_Mutex_base<_Lp>::_S_need_barriers)

+          {

+            __atomic_thread_fence (__ATOMIC_ACQ_REL);

+          }

+        _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_weak_count);

+        if (__gnu_cxx::__exchange_and_add_dispatch(&_M_weak_count,

+                                                   -1) == 1)

+          {

+            _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_weak_count);

+            _M_destroy();

+          }

+      }

+

+      void

+      _M_release_orig() noexcept

+      {

         // Be race-detector-friendly.  For more info see bits/c++config.

         _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_use_count);

        if (__gnu_cxx::__exchange_and_add_dispatch(&_M_use_count, -1) == 1)

          {

             _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_use_count);

            _M_dispose();

            // There must be a memory barrier between dispose() and
destroy()

            // to ensure that the effects of dispose() are observed in the

            // thread that runs destroy().

            // See http://gcc.gnu.org/ml/libstdc++/2005-11/msg00136.html

@@ -279,20 +337,27 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION

     _Sp_counted_base<_S_single>::_M_release() noexcept

     {

       if (--_M_use_count == 0)

         {

           _M_dispose();

           if (--_M_weak_count == 0)

             _M_destroy();

         }

     }


+  template<>

+    inline void

+    _Sp_counted_base<_S_mutex>::_M_release() noexcept

+    {

+      _M_release_orig();

+    }

+

   template<>

     inline void

     _Sp_counted_base<_S_single>::_M_weak_add_ref() noexcept

     { ++_M_weak_count; }


   template<>

     inline void

     _Sp_counted_base<_S_single>::_M_weak_release() noexcept

     {

       if (--_M_weak_count == 0)

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] libstdc++: Skip atomic instructions in _Sp_counted_base::_M_release when both counts are 1
  2020-12-17 20:49 [PATCH] libstdc++: Skip atomic instructions in _Sp_counted_base::_M_release when both counts are 1 Maged Michael
@ 2021-01-05 16:19 ` Maged Michael
  2021-06-10 14:41 ` Maged Michael
  2021-07-16 13:54 ` Jonathan Wakely
  2 siblings, 0 replies; 21+ messages in thread
From: Maged Michael @ 2021-01-05 16:19 UTC (permalink / raw)
  To: libstdc++; +Cc: gcc-patches, : Jonathan Wakely

Please let me know if more information about this patch is needed. Thank
you.

Maged



On Thu, Dec 17, 2020 at 3:49 PM Maged Michael <maged.michael@gmail.com>
wrote:

> Please find a proposed patch for _Sp_counted_base::_M_release to skip the
> two atomic instructions that decrement each of the use count and the weak
> count when both are 1. I proposed the general idea in an earlier thread (
> https://gcc.gnu.org/pipermail/libstdc++/2020-December/051642.html) and
> got useful feedback on a draft patch and responses to related questions
> about multi-granular atomicity and alignment. This patch is based on that
> feedback.
>
>
> I added a check for thread sanitizer to use the current algorithm in that
> case because TSAN does not support multi-granular atomicity. I'd like to
> add a check of __has_feature(thread_sanitizer) for building using LLVM. I
> found examples of __has_feature in libstdc++ but it doesn't seem to be
> recognized in shared_ptr_base.h. Any guidance on how to check
> __has_feature(thread_sanitizer) in this patch?
>
>
> GCC generates code for _M_release that is larger and more complex than
> that generated by LLVM. I'd like to file a bug report about that. Jonathan,
> would you please create a bugzilla account for me (
> https://gcc.gnu.org/bugzilla/) using my gmail address. Thank you.
>
>
> Information about the patch:
>
> - Benefits of the patch: Save the cost of the last atomic decrements of
> each of the use count and the weak count in _Sp_counted_base. Atomic
> instructions are significantly slower than regular loads and stores across
> major architectures.
>
> - How current code works: _M_release() atomically decrements the use
> count, checks if it was 1, if so calls _M_dispose(), atomically decrements
> the weak count, checks if it was 1, and if so calls _M_destroy().
>
> - How the proposed patch works: _M_release() loads both use count and weak
> count together atomically (when properly aligned), checks if the value is
> equal to the value of both counts equal to 1 (e.g., 0x100000001), and if so
> calls _M_dispose() and _M_destroy(). Otherwise, it follows the original
> algorithm.
>
> - Why it works: When the current thread executing _M_release() finds each
> of the counts is equal to 1, then (when _lock_policy is _S_atomic) no other
> threads could possibly hold use or weak references to this control block.
> That is, no other threads could possibly access the counts or the protected
> object.
>
> - The proposed patch is intended to interact correctly with current code
> (under certain conditions: _Lock_policy is _S_atomic, proper alignment, and
> native lock-free support for atomic operations). That is, multiple threads
> using different versions of the code with and without the patch operating
> on the same objects should always interact correctly. The intent for the
> patch is to be ABI compatible with the current implementation.
>
> - The proposed patch involves a performance trade-off between saving the
> costs of two atomic instructions when the counts are both 1 vs adding the
> cost of loading the combined counts and comparison with two ones (e.g.,
> 0x100000001).
>
> - The patch has been in use (built using LLVM) in a large environment for
> many months. The performance gains outweigh the losses (roughly 10 to 1)
> across a large variety of workloads.
>
>
> I'd appreciate feedback on the patch and any suggestions for checking
> __has_feature(thread_sanitizer).
>
>
> Maged
>
>
>
> diff --git a/libstdc++-v3/include/bits/shared_ptr_base.h
> b/libstdc++-v3/include/bits/shared_ptr_base.h
>
> index 368b2d7379a..a8fc944af5f 100644
>
> --- a/libstdc++-v3/include/bits/shared_ptr_base.h
>
> +++ b/libstdc++-v3/include/bits/shared_ptr_base.h
>
> @@ -153,20 +153,78 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
>
>         if (!_M_add_ref_lock_nothrow())
>
>           __throw_bad_weak_ptr();
>
>        }
>
>
>        bool
>
>        _M_add_ref_lock_nothrow() noexcept;
>
>
>        void
>
>        _M_release() noexcept
>
>        {
>
> +#if __SANITIZE_THREAD__
>
> +        _M_release_orig();
>
> +        return;
>
> +#endif
>
> +        if (!__atomic_always_lock_free(sizeof(long long), 0) ||
>
> +            !__atomic_always_lock_free(sizeof(_Atomic_word), 0) ||
>
> +            sizeof(long long) < (2 * sizeof(_Atomic_word)) ||
>
> +            sizeof(long long) > (sizeof(void*)))
>
> +          {
>
> +            _M_release_orig();
>
> +            return;
>
> +          }
>
> +        _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_use_count);
>
> +        _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_weak_count);
>
> +        if (__atomic_load_n((long long*)(&_M_use_count),
> __ATOMIC_ACQUIRE)
>
> +            == (1LL + (1LL << (8 * sizeof(_Atomic_word)))))
>
> +          {
>
> +            // Both counts are 1, so there are no weak references and
>
> +            // we are releasing the last strong reference. No other
>
> +            // threads can observe the effects of this _M_release()
>
> +            // call (e.g. calling use_count()) without a data race.
>
> +            *(long long*)(&_M_use_count) = 0;
>
> +            _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_use_count);
>
> +            _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_weak_count);
>
> +            _M_dispose();
>
> +            _M_destroy();
>
> +          }
>
> +        else
>
> +          {
>
> +            if ((__gnu_cxx::__exchange_and_add(&_M_use_count, -1) == 1))
>
> +              {
>
> +                _M_release_last_use();
>
> +              }
>
> +          }
>
> +      }
>
> +
>
> +      void
>
> +      __attribute__ ((noinline))
>
> +      _M_release_last_use() noexcept
>
> +      {
>
> +        _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_use_count);
>
> +        _M_dispose();
>
> +        if (_Mutex_base<_Lp>::_S_need_barriers)
>
> +          {
>
> +            __atomic_thread_fence (__ATOMIC_ACQ_REL);
>
> +          }
>
> +        _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_weak_count);
>
> +        if (__gnu_cxx::__exchange_and_add_dispatch(&_M_weak_count,
>
> +                                                   -1) == 1)
>
> +          {
>
> +            _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_weak_count);
>
> +            _M_destroy();
>
> +          }
>
> +      }
>
> +
>
> +      void
>
> +      _M_release_orig() noexcept
>
> +      {
>
>          // Be race-detector-friendly.  For more info see bits/c++config.
>
>          _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_use_count);
>
>         if (__gnu_cxx::__exchange_and_add_dispatch(&_M_use_count, -1) ==
> 1)
>
>           {
>
>              _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_use_count);
>
>             _M_dispose();
>
>             // There must be a memory barrier between dispose() and
> destroy()
>
>             // to ensure that the effects of dispose() are observed in the
>
>             // thread that runs destroy().
>
>             // See http://gcc.gnu.org/ml/libstdc++/2005-11/msg00136.html
>
> @@ -279,20 +337,27 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
>
>      _Sp_counted_base<_S_single>::_M_release() noexcept
>
>      {
>
>        if (--_M_use_count == 0)
>
>          {
>
>            _M_dispose();
>
>            if (--_M_weak_count == 0)
>
>              _M_destroy();
>
>          }
>
>      }
>
>
> +  template<>
>
> +    inline void
>
> +    _Sp_counted_base<_S_mutex>::_M_release() noexcept
>
> +    {
>
> +      _M_release_orig();
>
> +    }
>
> +
>
>    template<>
>
>      inline void
>
>      _Sp_counted_base<_S_single>::_M_weak_add_ref() noexcept
>
>      { ++_M_weak_count; }
>
>
>    template<>
>
>      inline void
>
>      _Sp_counted_base<_S_single>::_M_weak_release() noexcept
>
>      {
>
>        if (--_M_weak_count == 0)
>
>
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] libstdc++: Skip atomic instructions in _Sp_counted_base::_M_release when both counts are 1
  2020-12-17 20:49 [PATCH] libstdc++: Skip atomic instructions in _Sp_counted_base::_M_release when both counts are 1 Maged Michael
  2021-01-05 16:19 ` Maged Michael
@ 2021-06-10 14:41 ` Maged Michael
  2021-07-16 13:54 ` Jonathan Wakely
  2 siblings, 0 replies; 21+ messages in thread
From: Maged Michael @ 2021-06-10 14:41 UTC (permalink / raw)
  To: libstdc++; +Cc: gcc-patches, Jonathan Wakely

Would appreciate any comments on this proposed patch.

Thanks,
Maged


On Thu, Dec 17, 2020 at 3:49 PM Maged Michael <maged.michael@gmail.com>
wrote:

> Please find a proposed patch for _Sp_counted_base::_M_release to skip the
> two atomic instructions that decrement each of the use count and the weak
> count when both are 1. I proposed the general idea in an earlier thread (
> https://gcc.gnu.org/pipermail/libstdc++/2020-December/051642.html) and
> got useful feedback on a draft patch and responses to related questions
> about multi-granular atomicity and alignment. This patch is based on that
> feedback.
>
>
> I added a check for thread sanitizer to use the current algorithm in that
> case because TSAN does not support multi-granular atomicity. I'd like to
> add a check of __has_feature(thread_sanitizer) for building using LLVM. I
> found examples of __has_feature in libstdc++ but it doesn't seem to be
> recognized in shared_ptr_base.h. Any guidance on how to check
> __has_feature(thread_sanitizer) in this patch?
>
>
> GCC generates code for _M_release that is larger and more complex than
> that generated by LLVM. I'd like to file a bug report about that. Jonathan,
> would you please create a bugzilla account for me (
> https://gcc.gnu.org/bugzilla/) using my gmail address. Thank you.
>
>
> Information about the patch:
>
> - Benefits of the patch: Save the cost of the last atomic decrements of
> each of the use count and the weak count in _Sp_counted_base. Atomic
> instructions are significantly slower than regular loads and stores across
> major architectures.
>
> - How current code works: _M_release() atomically decrements the use
> count, checks if it was 1, if so calls _M_dispose(), atomically decrements
> the weak count, checks if it was 1, and if so calls _M_destroy().
>
> - How the proposed patch works: _M_release() loads both use count and weak
> count together atomically (when properly aligned), checks if the value is
> equal to the value of both counts equal to 1 (e.g., 0x100000001), and if so
> calls _M_dispose() and _M_destroy(). Otherwise, it follows the original
> algorithm.
>
> - Why it works: When the current thread executing _M_release() finds each
> of the counts is equal to 1, then (when _lock_policy is _S_atomic) no other
> threads could possibly hold use or weak references to this control block.
> That is, no other threads could possibly access the counts or the protected
> object.
>
> - The proposed patch is intended to interact correctly with current code
> (under certain conditions: _Lock_policy is _S_atomic, proper alignment, and
> native lock-free support for atomic operations). That is, multiple threads
> using different versions of the code with and without the patch operating
> on the same objects should always interact correctly. The intent for the
> patch is to be ABI compatible with the current implementation.
>
> - The proposed patch involves a performance trade-off between saving the
> costs of two atomic instructions when the counts are both 1 vs adding the
> cost of loading the combined counts and comparison with two ones (e.g.,
> 0x100000001).
>
> - The patch has been in use (built using LLVM) in a large environment for
> many months. The performance gains outweigh the losses (roughly 10 to 1)
> across a large variety of workloads.
>
>
> I'd appreciate feedback on the patch and any suggestions for checking
> __has_feature(thread_sanitizer).
>
>
> Maged
>
>
>
> diff --git a/libstdc++-v3/include/bits/shared_ptr_base.h
> b/libstdc++-v3/include/bits/shared_ptr_base.h
>
> index 368b2d7379a..a8fc944af5f 100644
>
> --- a/libstdc++-v3/include/bits/shared_ptr_base.h
>
> +++ b/libstdc++-v3/include/bits/shared_ptr_base.h
>
> @@ -153,20 +153,78 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
>
>         if (!_M_add_ref_lock_nothrow())
>
>           __throw_bad_weak_ptr();
>
>        }
>
>
>        bool
>
>        _M_add_ref_lock_nothrow() noexcept;
>
>
>        void
>
>        _M_release() noexcept
>
>        {
>
> +#if __SANITIZE_THREAD__
>
> +        _M_release_orig();
>
> +        return;
>
> +#endif
>
> +        if (!__atomic_always_lock_free(sizeof(long long), 0) ||
>
> +            !__atomic_always_lock_free(sizeof(_Atomic_word), 0) ||
>
> +            sizeof(long long) < (2 * sizeof(_Atomic_word)) ||
>
> +            sizeof(long long) > (sizeof(void*)))
>
> +          {
>
> +            _M_release_orig();
>
> +            return;
>
> +          }
>
> +        _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_use_count);
>
> +        _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_weak_count);
>
> +        if (__atomic_load_n((long long*)(&_M_use_count),
> __ATOMIC_ACQUIRE)
>
> +            == (1LL + (1LL << (8 * sizeof(_Atomic_word)))))
>
> +          {
>
> +            // Both counts are 1, so there are no weak references and
>
> +            // we are releasing the last strong reference. No other
>
> +            // threads can observe the effects of this _M_release()
>
> +            // call (e.g. calling use_count()) without a data race.
>
> +            *(long long*)(&_M_use_count) = 0;
>
> +            _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_use_count);
>
> +            _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_weak_count);
>
> +            _M_dispose();
>
> +            _M_destroy();
>
> +          }
>
> +        else
>
> +          {
>
> +            if ((__gnu_cxx::__exchange_and_add(&_M_use_count, -1) == 1))
>
> +              {
>
> +                _M_release_last_use();
>
> +              }
>
> +          }
>
> +      }
>
> +
>
> +      void
>
> +      __attribute__ ((noinline))
>
> +      _M_release_last_use() noexcept
>
> +      {
>
> +        _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_use_count);
>
> +        _M_dispose();
>
> +        if (_Mutex_base<_Lp>::_S_need_barriers)
>
> +          {
>
> +            __atomic_thread_fence (__ATOMIC_ACQ_REL);
>
> +          }
>
> +        _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_weak_count);
>
> +        if (__gnu_cxx::__exchange_and_add_dispatch(&_M_weak_count,
>
> +                                                   -1) == 1)
>
> +          {
>
> +            _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_weak_count);
>
> +            _M_destroy();
>
> +          }
>
> +      }
>
> +
>
> +      void
>
> +      _M_release_orig() noexcept
>
> +      {
>
>          // Be race-detector-friendly.  For more info see bits/c++config.
>
>          _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_use_count);
>
>         if (__gnu_cxx::__exchange_and_add_dispatch(&_M_use_count, -1) ==
> 1)
>
>           {
>
>              _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_use_count);
>
>             _M_dispose();
>
>             // There must be a memory barrier between dispose() and
> destroy()
>
>             // to ensure that the effects of dispose() are observed in the
>
>             // thread that runs destroy().
>
>             // See http://gcc.gnu.org/ml/libstdc++/2005-11/msg00136.html
>
> @@ -279,20 +337,27 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
>
>      _Sp_counted_base<_S_single>::_M_release() noexcept
>
>      {
>
>        if (--_M_use_count == 0)
>
>          {
>
>            _M_dispose();
>
>            if (--_M_weak_count == 0)
>
>              _M_destroy();
>
>          }
>
>      }
>
>
> +  template<>
>
> +    inline void
>
> +    _Sp_counted_base<_S_mutex>::_M_release() noexcept
>
> +    {
>
> +      _M_release_orig();
>
> +    }
>
> +
>
>    template<>
>
>      inline void
>
>      _Sp_counted_base<_S_single>::_M_weak_add_ref() noexcept
>
>      { ++_M_weak_count; }
>
>
>    template<>
>
>      inline void
>
>      _Sp_counted_base<_S_single>::_M_weak_release() noexcept
>
>      {
>
>        if (--_M_weak_count == 0)
>
>
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] libstdc++: Skip atomic instructions in _Sp_counted_base::_M_release when both counts are 1
  2020-12-17 20:49 [PATCH] libstdc++: Skip atomic instructions in _Sp_counted_base::_M_release when both counts are 1 Maged Michael
  2021-01-05 16:19 ` Maged Michael
  2021-06-10 14:41 ` Maged Michael
@ 2021-07-16 13:54 ` Jonathan Wakely
  2021-07-16 15:55   ` Maged Michael
  2 siblings, 1 reply; 21+ messages in thread
From: Jonathan Wakely @ 2021-07-16 13:54 UTC (permalink / raw)
  To: Maged Michael; +Cc: libstdc++, gcc-patches

On Thu, 17 Dec 2020 at 20:51, Maged Michael wrote:
>
> Please find a proposed patch for _Sp_counted_base::_M_release to skip the
> two atomic instructions that decrement each of the use count and the weak
> count when both are 1. I proposed the general idea in an earlier thread (
> https://gcc.gnu.org/pipermail/libstdc++/2020-December/051642.html) and got
> useful feedback on a draft patch and responses to related questions about
> multi-granular atomicity and alignment. This patch is based on that
> feedback.
>
>
> I added a check for thread sanitizer to use the current algorithm in that
> case because TSAN does not support multi-granular atomicity. I'd like to
> add a check of __has_feature(thread_sanitizer) for building using LLVM. I
> found examples of __has_feature in libstdc++

There are no uses of __has_feature in libstdc++. We do use
__has_builtin (which GCC also supports) and Clang's __is_identifier
(which GCC doesn't support) to work around some weird semantics of
__has_builtin in older versions of Clang.


> but it doesn't seem to be
> recognized in shared_ptr_base.h. Any guidance on how to check
> __has_feature(thread_sanitizer) in this patch?

I think we want to do something like this in include/bits/c++config

#if __SANITIZE_THREAD__
#  define _GLIBCXX_TSAN 1
#elif defined __has_feature
# if __has_feature(thread_sanitizer)
#  define _GLIBCXX_TSAN 1
# endif
#endif

Then in bits/shared_ptr_base.h

#if _GLIBCXX_TSAN
        _M_release_orig();
        return;
#endif



> GCC generates code for _M_release that is larger and more complex than that
> generated by LLVM. I'd like to file a bug report about that. Jonathan,

Is this the same issue as https://gcc.gnu.org/PR101406 ?

> would you please create a bugzilla account for me (
> https://gcc.gnu.org/bugzilla/) using my gmail address. Thank you.

Done (sorry, I didn't notice the request in this mail until coming
back to it to review the patch properly).



>
>
> Information about the patch:
>
> - Benefits of the patch: Save the cost of the last atomic decrements of
> each of the use count and the weak count in _Sp_counted_base. Atomic
> instructions are significantly slower than regular loads and stores across
> major architectures.
>
> - How current code works: _M_release() atomically decrements the use count,
> checks if it was 1, if so calls _M_dispose(), atomically decrements the
> weak count, checks if it was 1, and if so calls _M_destroy().
>
> - How the proposed patch works: _M_release() loads both use count and weak
> count together atomically (when properly aligned), checks if the value is
> equal to the value of both counts equal to 1 (e.g., 0x100000001), and if so
> calls _M_dispose() and _M_destroy(). Otherwise, it follows the original
> algorithm.
>
> - Why it works: When the current thread executing _M_release() finds each
> of the counts is equal to 1, then (when _lock_policy is _S_atomic) no other
> threads could possibly hold use or weak references to this control block.
> That is, no other threads could possibly access the counts or the protected
> object.
>
> - The proposed patch is intended to interact correctly with current code
> (under certain conditions: _Lock_policy is _S_atomic, proper alignment, and
> native lock-free support for atomic operations). That is, multiple threads
> using different versions of the code with and without the patch operating
> on the same objects should always interact correctly. The intent for the
> patch is to be ABI compatible with the current implementation.
>
> - The proposed patch involves a performance trade-off between saving the
> costs of two atomic instructions when the counts are both 1 vs adding the
> cost of loading the combined counts and comparison with two ones (e.g.,
> 0x100000001).
>
> - The patch has been in use (built using LLVM) in a large environment for
> many months. The performance gains outweigh the losses (roughly 10 to 1)
> across a large variety of workloads.
>
>
> I'd appreciate feedback on the patch and any suggestions for checking
> __has_feature(thread_sanitizer).

N.B. gmail completely mangles patches unless you send them as attachments.


> diff --git a/libstdc++-v3/include/bits/shared_ptr_base.h
> b/libstdc++-v3/include/bits/shared_ptr_base.h
>
> index 368b2d7379a..a8fc944af5f 100644
>
> --- a/libstdc++-v3/include/bits/shared_ptr_base.h
>
> +++ b/libstdc++-v3/include/bits/shared_ptr_base.h
>
> @@ -153,20 +153,78 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
>
>         if (!_M_add_ref_lock_nothrow())
>
>           __throw_bad_weak_ptr();
>
>        }
>
>
>        bool
>
>        _M_add_ref_lock_nothrow() noexcept;
>
>
>        void
>
>        _M_release() noexcept
>
>        {
>
> +#if __SANITIZE_THREAD__
>
> +        _M_release_orig();
>
> +        return;
>
> +#endif
>
> +        if (!__atomic_always_lock_free(sizeof(long long), 0) ||

The line break should come before the logical operator, not after.
This makes it easier to see which operator it is, because it's at a
predictable position, not off on the right edge somewhere.

i.e.

        if (!__atomic_always_lock_free(sizeof(long long), 0)
           || !__atomic_always_lock_free(sizeof(_Atomic_word), 0)
           || sizeof(long long) < (2 * sizeof(_Atomic_word))
           || sizeof(long long) > (sizeof(void*)))

But I think I'd like to see this condition expressed differently
anyway, see below.

>
> +            !__atomic_always_lock_free(sizeof(_Atomic_word), 0) ||
>
> +            sizeof(long long) < (2 * sizeof(_Atomic_word)) ||

Shouldn't this be != instead of < ?

On a big endian target where sizeof(long long) > sizeof(_Atomic_word)
loading two _Atomic_word objects will fill the high bits of the long
long, and so the (1LL + (1LL << (8 * 4))) calculation won't match what
got loaded into the long long.

>
> +            sizeof(long long) > (sizeof(void*)))

This is checking the alignment, right? I think we can do so more
reliably, and should comment it.

I think I'd like to see this condition defined as a number of
constexpr booleans, with comments. Maybe:

constexpr bool __lock_free
  = __atomic_always_lock_free(sizeof(long long), 0)
     && __atomic_always_lock_free(sizeof(_Atomic_word), 0);
constexpr bool __double_word
  = sizeof(long long) == 2 * sizeof(_Atomic_word);
// The ref-count members follow the vptr, so are aligned to alignof(void*).
constexpr bool __aligned = alignof(long long) <= alignof(void*);

if _GLIBCXX17_CONSTEXPR
  (__lock_free && __double_word && __aligned)
  {
    _M_release_double_width_cas();
    return;
  }
else
  {
    // ... original body of _M_release();
  }

> +          {
>
> +            _M_release_orig();
>
> +            return;
>
> +          }
>
> +        _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_use_count);
>
> +        _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_weak_count);
>
> +        if (__atomic_load_n((long long*)(&_M_use_count), __ATOMIC_ACQUIRE)
>
> +            == (1LL + (1LL << (8 * sizeof(_Atomic_word)))))

This should use __CHAR_BIT__ instead of 8.


>
> +          {
>
> +            // Both counts are 1, so there are no weak references and
>
> +            // we are releasing the last strong reference. No other
>
> +            // threads can observe the effects of this _M_release()
>
> +            // call (e.g. calling use_count()) without a data race.
>
> +            *(long long*)(&_M_use_count) = 0;
>
> +            _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_use_count);
>
> +            _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_weak_count);
>
> +            _M_dispose();
>
> +            _M_destroy();
>
> +          }
>
> +        else
>
> +          {
>
> +            if ((__gnu_cxx::__exchange_and_add(&_M_use_count, -1) == 1))
>
> +              {
>
> +                _M_release_last_use();
>
> +              }
>
> +          }
>
> +      }
>
> +
>
> +      void
>
> +      __attribute__ ((noinline))
>
> +      _M_release_last_use() noexcept
>
> +      {
>
> +        _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_use_count);
>
> +        _M_dispose();
>
> +        if (_Mutex_base<_Lp>::_S_need_barriers)
>
> +          {
>
> +            __atomic_thread_fence (__ATOMIC_ACQ_REL);
>
> +          }
>
> +        _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_weak_count);
>
> +        if (__gnu_cxx::__exchange_and_add_dispatch(&_M_weak_count,
>
> +                                                   -1) == 1)
>
> +          {
>
> +            _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_weak_count);
>
> +            _M_destroy();
>
> +          }
>
> +      }
>
> +
>
> +      void
>
> +      _M_release_orig() noexcept
>
> +      {
>
>          // Be race-detector-friendly.  For more info see bits/c++config.
>
>          _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_use_count);
>
>         if (__gnu_cxx::__exchange_and_add_dispatch(&_M_use_count, -1) == 1)
>
>           {
>
>              _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_use_count);
>
>             _M_dispose();
>
>             // There must be a memory barrier between dispose() and
> destroy()
>
>             // to ensure that the effects of dispose() are observed in the
>
>             // thread that runs destroy().
>
>             // See http://gcc.gnu.org/ml/libstdc++/2005-11/msg00136.html
>
> @@ -279,20 +337,27 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
>
>      _Sp_counted_base<_S_single>::_M_release() noexcept
>
>      {
>
>        if (--_M_use_count == 0)
>
>          {
>
>            _M_dispose();
>
>            if (--_M_weak_count == 0)
>
>              _M_destroy();
>
>          }
>
>      }
>
>
> +  template<>
>
> +    inline void
>
> +    _Sp_counted_base<_S_mutex>::_M_release() noexcept
>
> +    {
>
> +      _M_release_orig();
>
> +    }
>
> +
>
>    template<>
>
>      inline void
>
>      _Sp_counted_base<_S_single>::_M_weak_add_ref() noexcept
>
>      { ++_M_weak_count; }
>
>
>    template<>
>
>      inline void
>
>      _Sp_counted_base<_S_single>::_M_weak_release() noexcept
>
>      {
>
>        if (--_M_weak_count == 0)

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] libstdc++: Skip atomic instructions in _Sp_counted_base::_M_release when both counts are 1
  2021-07-16 13:54 ` Jonathan Wakely
@ 2021-07-16 15:55   ` Maged Michael
  2021-08-02 13:23     ` Maged Michael
  0 siblings, 1 reply; 21+ messages in thread
From: Maged Michael @ 2021-07-16 15:55 UTC (permalink / raw)
  To: Jonathan Wakely; +Cc: libstdc++, gcc-patches

Thank you, Jonathan, for the detailed comments! I'll update the patch
accordingly.

On Fri, Jul 16, 2021 at 9:55 AM Jonathan Wakely <jwakely.gcc@gmail.com>
wrote:

> On Thu, 17 Dec 2020 at 20:51, Maged Michael wrote:
> >
> > Please find a proposed patch for _Sp_counted_base::_M_release to skip the
> > two atomic instructions that decrement each of the use count and the weak
> > count when both are 1. I proposed the general idea in an earlier thread (
> > https://gcc.gnu.org/pipermail/libstdc++/2020-December/051642.html) and
> got
> > useful feedback on a draft patch and responses to related questions about
> > multi-granular atomicity and alignment. This patch is based on that
> > feedback.
> >
> >
> > I added a check for thread sanitizer to use the current algorithm in that
> > case because TSAN does not support multi-granular atomicity. I'd like to
> > add a check of __has_feature(thread_sanitizer) for building using LLVM. I
> > found examples of __has_feature in libstdc++
>
> There are no uses of __has_feature in libstdc++. We do use
> __has_builtin (which GCC also supports) and Clang's __is_identifier
> (which GCC doesn't support) to work around some weird semantics of
> __has_builtin in older versions of Clang.
>
>
> > but it doesn't seem to be
> > recognized in shared_ptr_base.h. Any guidance on how to check
> > __has_feature(thread_sanitizer) in this patch?
>
> I think we want to do something like this in include/bits/c++config
>
> #if __SANITIZE_THREAD__
> #  define _GLIBCXX_TSAN 1
> #elif defined __has_feature
> # if __has_feature(thread_sanitizer)
> #  define _GLIBCXX_TSAN 1
> # endif
> #endif
>
> Then in bits/shared_ptr_base.h
>
> #if _GLIBCXX_TSAN
>         _M_release_orig();
>         return;
> #endif
>
>
>
> > GCC generates code for _M_release that is larger and more complex than
> that
> > generated by LLVM. I'd like to file a bug report about that. Jonathan,
>
> Is this the same issue as https://gcc.gnu.org/PR101406 ?
>
> Partly yes. Even when using __atomic_add_dispatch I noticed that clang
generated less code than gcc. I see in the response to the issue that the
new glibc is expected to optimize better. So maybe this will eliminate the
issue.


> > would you please create a bugzilla account for me (
> > https://gcc.gnu.org/bugzilla/) using my gmail address. Thank you.
>
> Done (sorry, I didn't notice the request in this mail until coming
> back to it to review the patch properly).
>
> Thank you!


>
>
> >
> >
> > Information about the patch:
> >
> > - Benefits of the patch: Save the cost of the last atomic decrements of
> > each of the use count and the weak count in _Sp_counted_base. Atomic
> > instructions are significantly slower than regular loads and stores
> across
> > major architectures.
> >
> > - How current code works: _M_release() atomically decrements the use
> count,
> > checks if it was 1, if so calls _M_dispose(), atomically decrements the
> > weak count, checks if it was 1, and if so calls _M_destroy().
> >
> > - How the proposed patch works: _M_release() loads both use count and
> weak
> > count together atomically (when properly aligned), checks if the value is
> > equal to the value of both counts equal to 1 (e.g., 0x100000001), and if
> so
> > calls _M_dispose() and _M_destroy(). Otherwise, it follows the original
> > algorithm.
> >
> > - Why it works: When the current thread executing _M_release() finds each
> > of the counts is equal to 1, then (when _lock_policy is _S_atomic) no
> other
> > threads could possibly hold use or weak references to this control block.
> > That is, no other threads could possibly access the counts or the
> protected
> > object.
> >
> > - The proposed patch is intended to interact correctly with current code
> > (under certain conditions: _Lock_policy is _S_atomic, proper alignment,
> and
> > native lock-free support for atomic operations). That is, multiple
> threads
> > using different versions of the code with and without the patch operating
> > on the same objects should always interact correctly. The intent for the
> > patch is to be ABI compatible with the current implementation.
> >
> > - The proposed patch involves a performance trade-off between saving the
> > costs of two atomic instructions when the counts are both 1 vs adding the
> > cost of loading the combined counts and comparison with two ones (e.g.,
> > 0x100000001).
> >
> > - The patch has been in use (built using LLVM) in a large environment for
> > many months. The performance gains outweigh the losses (roughly 10 to 1)
> > across a large variety of workloads.
> >
> >
> > I'd appreciate feedback on the patch and any suggestions for checking
> > __has_feature(thread_sanitizer).
>
> N.B. gmail completely mangles patches unless you send them as attachments.
>
>
> > diff --git a/libstdc++-v3/include/bits/shared_ptr_base.h
> > b/libstdc++-v3/include/bits/shared_ptr_base.h
> >
> > index 368b2d7379a..a8fc944af5f 100644
> >
> > --- a/libstdc++-v3/include/bits/shared_ptr_base.h
> >
> > +++ b/libstdc++-v3/include/bits/shared_ptr_base.h
> >
> > @@ -153,20 +153,78 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
> >
> >         if (!_M_add_ref_lock_nothrow())
> >
> >           __throw_bad_weak_ptr();
> >
> >        }
> >
> >
> >        bool
> >
> >        _M_add_ref_lock_nothrow() noexcept;
> >
> >
> >        void
> >
> >        _M_release() noexcept
> >
> >        {
> >
> > +#if __SANITIZE_THREAD__
> >
> > +        _M_release_orig();
> >
> > +        return;
> >
> > +#endif
> >
> > +        if (!__atomic_always_lock_free(sizeof(long long), 0) ||
>
> The line break should come before the logical operator, not after.
> This makes it easier to see which operator it is, because it's at a
> predictable position, not off on the right edge somewhere.
>
> i.e.
>
>         if (!__atomic_always_lock_free(sizeof(long long), 0)
>            || !__atomic_always_lock_free(sizeof(_Atomic_word), 0)
>            || sizeof(long long) < (2 * sizeof(_Atomic_word))
>            || sizeof(long long) > (sizeof(void*)))
>
> But I think I'd like to see this condition expressed differently
> anyway, see below.
>
> >
> > +            !__atomic_always_lock_free(sizeof(_Atomic_word), 0) ||
> >
> > +            sizeof(long long) < (2 * sizeof(_Atomic_word)) ||
>
> Shouldn't this be != instead of < ?
>
> On a big endian target where sizeof(long long) > sizeof(_Atomic_word)
> loading two _Atomic_word objects will fill the high bits of the long
> long, and so the (1LL + (1LL << (8 * 4))) calculation won't match what
> got loaded into the long long.
>
> >
> > +            sizeof(long long) > (sizeof(void*)))
>
> This is checking the alignment, right? I think we can do so more
> reliably, and should comment it.
>
> I think I'd like to see this condition defined as a number of
> constexpr booleans, with comments. Maybe:
>
> constexpr bool __lock_free
>   = __atomic_always_lock_free(sizeof(long long), 0)
>      && __atomic_always_lock_free(sizeof(_Atomic_word), 0);
> constexpr bool __double_word
>   = sizeof(long long) == 2 * sizeof(_Atomic_word);
> // The ref-count members follow the vptr, so are aligned to alignof(void*).
> constexpr bool __aligned = alignof(long long) <= alignof(void*);
>
> if _GLIBCXX17_CONSTEXPR
>   (__lock_free && __double_word && __aligned)
>   {
>     _M_release_double_width_cas();
>     return;
>   }
> else
>   {
>     // ... original body of _M_release();
>   }
>
> > +          {
> >
> > +            _M_release_orig();
> >
> > +            return;
> >
> > +          }
> >
> > +        _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_use_count);
> >
> > +        _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_weak_count);
> >
> > +        if (__atomic_load_n((long long*)(&_M_use_count),
> __ATOMIC_ACQUIRE)
> >
> > +            == (1LL + (1LL << (8 * sizeof(_Atomic_word)))))
>
> This should use __CHAR_BIT__ instead of 8.
>
>
> >
> > +          {
> >
> > +            // Both counts are 1, so there are no weak references and
> >
> > +            // we are releasing the last strong reference. No other
> >
> > +            // threads can observe the effects of this _M_release()
> >
> > +            // call (e.g. calling use_count()) without a data race.
> >
> > +            *(long long*)(&_M_use_count) = 0;
> >
> > +            _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_use_count);
> >
> > +            _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_weak_count);
> >
> > +            _M_dispose();
> >
> > +            _M_destroy();
> >
> > +          }
> >
> > +        else
> >
> > +          {
> >
> > +            if ((__gnu_cxx::__exchange_and_add(&_M_use_count, -1) == 1))
> >
> > +              {
> >
> > +                _M_release_last_use();
> >
> > +              }
> >
> > +          }
> >
> > +      }
> >
> > +
> >
> > +      void
> >
> > +      __attribute__ ((noinline))
> >
> > +      _M_release_last_use() noexcept
> >
> > +      {
> >
> > +        _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_use_count);
> >
> > +        _M_dispose();
> >
> > +        if (_Mutex_base<_Lp>::_S_need_barriers)
> >
> > +          {
> >
> > +            __atomic_thread_fence (__ATOMIC_ACQ_REL);
> >
> > +          }
> >
> > +        _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_weak_count);
> >
> > +        if (__gnu_cxx::__exchange_and_add_dispatch(&_M_weak_count,
> >
> > +                                                   -1) == 1)
> >
> > +          {
> >
> > +            _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_weak_count);
> >
> > +            _M_destroy();
> >
> > +          }
> >
> > +      }
> >
> > +
> >
> > +      void
> >
> > +      _M_release_orig() noexcept
> >
> > +      {
> >
> >          // Be race-detector-friendly.  For more info see bits/c++config.
> >
> >          _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_use_count);
> >
> >         if (__gnu_cxx::__exchange_and_add_dispatch(&_M_use_count, -1) ==
> 1)
> >
> >           {
> >
> >              _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_use_count);
> >
> >             _M_dispose();
> >
> >             // There must be a memory barrier between dispose() and
> > destroy()
> >
> >             // to ensure that the effects of dispose() are observed in
> the
> >
> >             // thread that runs destroy().
> >
> >             // See http://gcc.gnu.org/ml/libstdc++/2005-11/msg00136.html
> >
> > @@ -279,20 +337,27 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
> >
> >      _Sp_counted_base<_S_single>::_M_release() noexcept
> >
> >      {
> >
> >        if (--_M_use_count == 0)
> >
> >          {
> >
> >            _M_dispose();
> >
> >            if (--_M_weak_count == 0)
> >
> >              _M_destroy();
> >
> >          }
> >
> >      }
> >
> >
> > +  template<>
> >
> > +    inline void
> >
> > +    _Sp_counted_base<_S_mutex>::_M_release() noexcept
> >
> > +    {
> >
> > +      _M_release_orig();
> >
> > +    }
> >
> > +
> >
> >    template<>
> >
> >      inline void
> >
> >      _Sp_counted_base<_S_single>::_M_weak_add_ref() noexcept
> >
> >      { ++_M_weak_count; }
> >
> >
> >    template<>
> >
> >      inline void
> >
> >      _Sp_counted_base<_S_single>::_M_weak_release() noexcept
> >
> >      {
> >
> >        if (--_M_weak_count == 0)
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] libstdc++: Skip atomic instructions in _Sp_counted_base::_M_release when both counts are 1
  2021-07-16 15:55   ` Maged Michael
@ 2021-08-02 13:23     ` Maged Michael
  2021-08-02 13:29       ` Maged Michael
  0 siblings, 1 reply; 21+ messages in thread
From: Maged Michael @ 2021-08-02 13:23 UTC (permalink / raw)
  To: Jonathan Wakely; +Cc: libstdc++, gcc-patches

[-- Attachment #1: Type: text/plain, Size: 12086 bytes --]

Please find attached an updated patch after incorporating Jonathan's
suggestions.

Changes from the last patch include:
- Add a TSAN macro to bits/c++config.
- Use separate constexpr bool-s for the conditions for lock-freedom,
double-width and alignment.
- Move the code in the optimized path to a separate function
_M_release_double_width_cas.

Thanks,
Maged


On Fri, Jul 16, 2021 at 11:55 AM Maged Michael <maged.michael@gmail.com>
wrote:

> Thank you, Jonathan, for the detailed comments! I'll update the patch
> accordingly.
>
> On Fri, Jul 16, 2021 at 9:55 AM Jonathan Wakely <jwakely.gcc@gmail.com>
> wrote:
>
>> On Thu, 17 Dec 2020 at 20:51, Maged Michael wrote:
>> >
>> > Please find a proposed patch for _Sp_counted_base::_M_release to skip
>> the
>> > two atomic instructions that decrement each of the use count and the
>> weak
>> > count when both are 1. I proposed the general idea in an earlier thread
>> (
>> > https://gcc.gnu.org/pipermail/libstdc++/2020-December/051642.html) and
>> got
>> > useful feedback on a draft patch and responses to related questions
>> about
>> > multi-granular atomicity and alignment. This patch is based on that
>> > feedback.
>> >
>> >
>> > I added a check for thread sanitizer to use the current algorithm in
>> that
>> > case because TSAN does not support multi-granular atomicity. I'd like to
>> > add a check of __has_feature(thread_sanitizer) for building using LLVM.
>> I
>> > found examples of __has_feature in libstdc++
>>
>> There are no uses of __has_feature in libstdc++. We do use
>> __has_builtin (which GCC also supports) and Clang's __is_identifier
>> (which GCC doesn't support) to work around some weird semantics of
>> __has_builtin in older versions of Clang.
>>
>>
>> > but it doesn't seem to be
>> > recognized in shared_ptr_base.h. Any guidance on how to check
>> > __has_feature(thread_sanitizer) in this patch?
>>
>> I think we want to do something like this in include/bits/c++config
>>
>> #if __SANITIZE_THREAD__
>> #  define _GLIBCXX_TSAN 1
>> #elif defined __has_feature
>> # if __has_feature(thread_sanitizer)
>> #  define _GLIBCXX_TSAN 1
>> # endif
>> #endif
>>
>> Then in bits/shared_ptr_base.h
>>
>> #if _GLIBCXX_TSAN
>>         _M_release_orig();
>>         return;
>> #endif
>>
>>
>>
>> > GCC generates code for _M_release that is larger and more complex than
>> that
>> > generated by LLVM. I'd like to file a bug report about that. Jonathan,
>>
>> Is this the same issue as https://gcc.gnu.org/PR101406 ?
>>
>> Partly yes. Even when using __atomic_add_dispatch I noticed that clang
> generated less code than gcc. I see in the response to the issue that the
> new glibc is expected to optimize better. So maybe this will eliminate the
> issue.
>
>
>> > would you please create a bugzilla account for me (
>> > https://gcc.gnu.org/bugzilla/) using my gmail address. Thank you.
>>
>> Done (sorry, I didn't notice the request in this mail until coming
>> back to it to review the patch properly).
>>
>> Thank you!
>
>
>>
>>
>> >
>> >
>> > Information about the patch:
>> >
>> > - Benefits of the patch: Save the cost of the last atomic decrements of
>> > each of the use count and the weak count in _Sp_counted_base. Atomic
>> > instructions are significantly slower than regular loads and stores
>> across
>> > major architectures.
>> >
>> > - How current code works: _M_release() atomically decrements the use
>> count,
>> > checks if it was 1, if so calls _M_dispose(), atomically decrements the
>> > weak count, checks if it was 1, and if so calls _M_destroy().
>> >
>> > - How the proposed patch works: _M_release() loads both use count and
>> weak
>> > count together atomically (when properly aligned), checks if the value
>> is
>> > equal to the value of both counts equal to 1 (e.g., 0x100000001), and
>> if so
>> > calls _M_dispose() and _M_destroy(). Otherwise, it follows the original
>> > algorithm.
>> >
>> > - Why it works: When the current thread executing _M_release() finds
>> each
>> > of the counts is equal to 1, then (when _lock_policy is _S_atomic) no
>> other
>> > threads could possibly hold use or weak references to this control
>> block.
>> > That is, no other threads could possibly access the counts or the
>> protected
>> > object.
>> >
>> > - The proposed patch is intended to interact correctly with current code
>> > (under certain conditions: _Lock_policy is _S_atomic, proper alignment,
>> and
>> > native lock-free support for atomic operations). That is, multiple
>> threads
>> > using different versions of the code with and without the patch
>> operating
>> > on the same objects should always interact correctly. The intent for the
>> > patch is to be ABI compatible with the current implementation.
>> >
>> > - The proposed patch involves a performance trade-off between saving the
>> > costs of two atomic instructions when the counts are both 1 vs adding
>> the
>> > cost of loading the combined counts and comparison with two ones (e.g.,
>> > 0x100000001).
>> >
>> > - The patch has been in use (built using LLVM) in a large environment
>> for
>> > many months. The performance gains outweigh the losses (roughly 10 to 1)
>> > across a large variety of workloads.
>> >
>> >
>> > I'd appreciate feedback on the patch and any suggestions for checking
>> > __has_feature(thread_sanitizer).
>>
>> N.B. gmail completely mangles patches unless you send them as attachments.
>>
>>
>> > diff --git a/libstdc++-v3/include/bits/shared_ptr_base.h
>> > b/libstdc++-v3/include/bits/shared_ptr_base.h
>> >
>> > index 368b2d7379a..a8fc944af5f 100644
>> >
>> > --- a/libstdc++-v3/include/bits/shared_ptr_base.h
>> >
>> > +++ b/libstdc++-v3/include/bits/shared_ptr_base.h
>> >
>> > @@ -153,20 +153,78 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
>> >
>> >         if (!_M_add_ref_lock_nothrow())
>> >
>> >           __throw_bad_weak_ptr();
>> >
>> >        }
>> >
>> >
>> >        bool
>> >
>> >        _M_add_ref_lock_nothrow() noexcept;
>> >
>> >
>> >        void
>> >
>> >        _M_release() noexcept
>> >
>> >        {
>> >
>> > +#if __SANITIZE_THREAD__
>> >
>> > +        _M_release_orig();
>> >
>> > +        return;
>> >
>> > +#endif
>> >
>> > +        if (!__atomic_always_lock_free(sizeof(long long), 0) ||
>>
>> The line break should come before the logical operator, not after.
>> This makes it easier to see which operator it is, because it's at a
>> predictable position, not off on the right edge somewhere.
>>
>> i.e.
>>
>>         if (!__atomic_always_lock_free(sizeof(long long), 0)
>>            || !__atomic_always_lock_free(sizeof(_Atomic_word), 0)
>>            || sizeof(long long) < (2 * sizeof(_Atomic_word))
>>            || sizeof(long long) > (sizeof(void*)))
>>
>> But I think I'd like to see this condition expressed differently
>> anyway, see below.
>>
>> >
>> > +            !__atomic_always_lock_free(sizeof(_Atomic_word), 0) ||
>> >
>> > +            sizeof(long long) < (2 * sizeof(_Atomic_word)) ||
>>
>> Shouldn't this be != instead of < ?
>>
>> On a big endian target where sizeof(long long) > sizeof(_Atomic_word)
>> loading two _Atomic_word objects will fill the high bits of the long
>> long, and so the (1LL + (1LL << (8 * 4))) calculation won't match what
>> got loaded into the long long.
>>
>> >
>> > +            sizeof(long long) > (sizeof(void*)))
>>
>> This is checking the alignment, right? I think we can do so more
>> reliably, and should comment it.
>>
>> I think I'd like to see this condition defined as a number of
>> constexpr booleans, with comments. Maybe:
>>
>> constexpr bool __lock_free
>>   = __atomic_always_lock_free(sizeof(long long), 0)
>>      && __atomic_always_lock_free(sizeof(_Atomic_word), 0);
>> constexpr bool __double_word
>>   = sizeof(long long) == 2 * sizeof(_Atomic_word);
>> // The ref-count members follow the vptr, so are aligned to
>> alignof(void*).
>> constexpr bool __aligned = alignof(long long) <= alignof(void*);
>>
>> if _GLIBCXX17_CONSTEXPR
>>   (__lock_free && __double_word && __aligned)
>>   {
>>     _M_release_double_width_cas();
>>     return;
>>   }
>> else
>>   {
>>     // ... original body of _M_release();
>>   }
>>
>> > +          {
>> >
>> > +            _M_release_orig();
>> >
>> > +            return;
>> >
>> > +          }
>> >
>> > +        _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_use_count);
>> >
>> > +        _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_weak_count);
>> >
>> > +        if (__atomic_load_n((long long*)(&_M_use_count),
>> __ATOMIC_ACQUIRE)
>> >
>> > +            == (1LL + (1LL << (8 * sizeof(_Atomic_word)))))
>>
>> This should use __CHAR_BIT__ instead of 8.
>>
>>
>> >
>> > +          {
>> >
>> > +            // Both counts are 1, so there are no weak references and
>> >
>> > +            // we are releasing the last strong reference. No other
>> >
>> > +            // threads can observe the effects of this _M_release()
>> >
>> > +            // call (e.g. calling use_count()) without a data race.
>> >
>> > +            *(long long*)(&_M_use_count) = 0;
>> >
>> > +            _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_use_count);
>> >
>> > +            _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_weak_count);
>> >
>> > +            _M_dispose();
>> >
>> > +            _M_destroy();
>> >
>> > +          }
>> >
>> > +        else
>> >
>> > +          {
>> >
>> > +            if ((__gnu_cxx::__exchange_and_add(&_M_use_count, -1) ==
>> 1))
>> >
>> > +              {
>> >
>> > +                _M_release_last_use();
>> >
>> > +              }
>> >
>> > +          }
>> >
>> > +      }
>> >
>> > +
>> >
>> > +      void
>> >
>> > +      __attribute__ ((noinline))
>> >
>> > +      _M_release_last_use() noexcept
>> >
>> > +      {
>> >
>> > +        _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_use_count);
>> >
>> > +        _M_dispose();
>> >
>> > +        if (_Mutex_base<_Lp>::_S_need_barriers)
>> >
>> > +          {
>> >
>> > +            __atomic_thread_fence (__ATOMIC_ACQ_REL);
>> >
>> > +          }
>> >
>> > +        _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_weak_count);
>> >
>> > +        if (__gnu_cxx::__exchange_and_add_dispatch(&_M_weak_count,
>> >
>> > +                                                   -1) == 1)
>> >
>> > +          {
>> >
>> > +            _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_weak_count);
>> >
>> > +            _M_destroy();
>> >
>> > +          }
>> >
>> > +      }
>> >
>> > +
>> >
>> > +      void
>> >
>> > +      _M_release_orig() noexcept
>> >
>> > +      {
>> >
>> >          // Be race-detector-friendly.  For more info see
>> bits/c++config.
>> >
>> >          _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_use_count);
>> >
>> >         if (__gnu_cxx::__exchange_and_add_dispatch(&_M_use_count, -1)
>> == 1)
>> >
>> >           {
>> >
>> >              _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_use_count);
>> >
>> >             _M_dispose();
>> >
>> >             // There must be a memory barrier between dispose() and
>> > destroy()
>> >
>> >             // to ensure that the effects of dispose() are observed in
>> the
>> >
>> >             // thread that runs destroy().
>> >
>> >             // See
>> http://gcc.gnu.org/ml/libstdc++/2005-11/msg00136.html
>> >
>> > @@ -279,20 +337,27 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
>> >
>> >      _Sp_counted_base<_S_single>::_M_release() noexcept
>> >
>> >      {
>> >
>> >        if (--_M_use_count == 0)
>> >
>> >          {
>> >
>> >            _M_dispose();
>> >
>> >            if (--_M_weak_count == 0)
>> >
>> >              _M_destroy();
>> >
>> >          }
>> >
>> >      }
>> >
>> >
>> > +  template<>
>> >
>> > +    inline void
>> >
>> > +    _Sp_counted_base<_S_mutex>::_M_release() noexcept
>> >
>> > +    {
>> >
>> > +      _M_release_orig();
>> >
>> > +    }
>> >
>> > +
>> >
>> >    template<>
>> >
>> >      inline void
>> >
>> >      _Sp_counted_base<_S_single>::_M_weak_add_ref() noexcept
>> >
>> >      { ++_M_weak_count; }
>> >
>> >
>> >    template<>
>> >
>> >      inline void
>> >
>> >      _Sp_counted_base<_S_single>::_M_weak_release() noexcept
>> >
>> >      {
>> >
>> >        if (--_M_weak_count == 0)
>>
>

[-- Attachment #2: patch.txt --]
[-- Type: text/plain, Size: 5098 bytes --]

diff --git a/libstdc++-v3/include/bits/c++config b/libstdc++-v3/include/bits/c++config
index 69ace386dd7..543fac268b4 100644
--- a/libstdc++-v3/include/bits/c++config
+++ b/libstdc++-v3/include/bits/c++config
@@ -126,20 +126,29 @@
 # define _GLIBCXX_ABI_TAG_CXX11 __attribute ((__abi_tag__ ("cxx11")))
 #endif
 
 // Macro to warn about unused results.
 #if __cplusplus >= 201703L
 # define _GLIBCXX_NODISCARD [[__nodiscard__]]
 #else
 # define _GLIBCXX_NODISCARD
 #endif
 
+// Macro for TSAN.
+#if __SANITIZE_THREAD__
+#  define _GLIBCXX_TSAN 1
+#elif defined __has_feature
+# if __has_feature(thread_sanitizer)
+#  define _GLIBCXX_TSAN 1
+# endif
+#endif
+
 
 
 #if __cplusplus
 
 // Macro for constexpr, to support in mixed 03/0x mode.
 #ifndef _GLIBCXX_CONSTEXPR
 # if __cplusplus >= 201103L
 #  define _GLIBCXX_CONSTEXPR constexpr
 #  define _GLIBCXX_USE_CONSTEXPR constexpr
 # else
diff --git a/libstdc++-v3/include/bits/shared_ptr_base.h b/libstdc++-v3/include/bits/shared_ptr_base.h
index 5be935d174d..3368426e417 100644
--- a/libstdc++-v3/include/bits/shared_ptr_base.h
+++ b/libstdc++-v3/include/bits/shared_ptr_base.h
@@ -153,20 +153,93 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
 	if (!_M_add_ref_lock_nothrow())
 	  __throw_bad_weak_ptr();
       }
 
       bool
       _M_add_ref_lock_nothrow() noexcept;
 
       void
       _M_release() noexcept
       {
+#if _GLIBCXX_TSAN
+        _M_release_orig();
+        return;
+#endif
+        constexpr bool __lock_free
+          = __atomic_always_lock_free(sizeof(long long), 0)
+          && __atomic_always_lock_free(sizeof(_Atomic_word), 0);
+        constexpr bool __double_word
+          = sizeof(long long) == 2 * sizeof(_Atomic_word);
+        // The ref-count members follow the vptr, so are aligned to
+        // alignof(void*).
+        constexpr bool __aligned = alignof(long long) <= alignof(void*);
+        if _GLIBCXX17_CONSTEXPR
+          (__lock_free && __double_word && __aligned)
+          {
+            _M_release_double_width_cas();
+            return;
+          }
+        else
+          {
+            _M_release_orig();
+          }
+      }
+
+      void
+      _M_release_double_width_cas()
+      {
+        _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_use_count);
+        _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_weak_count);
+        if (__atomic_load_n((long long*)(&_M_use_count), __ATOMIC_ACQUIRE)
+            == (1LL + (1LL << (__CHAR_BIT__ * sizeof(_Atomic_word)))))
+          {
+            // Both counts are 1, so there are no weak references and
+            // we are releasing the last strong reference. No other
+            // threads can observe the effects of this _M_release()
+            // call (e.g. calling use_count()) without a data race.
+            *(long long*)(&_M_use_count) = 0;
+            _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_use_count);
+            _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_weak_count);
+            _M_dispose();
+            _M_destroy();
+          }
+        else
+          {
+            if ((__gnu_cxx::__exchange_and_add(&_M_use_count, -1) == 1))
+              {
+                _M_release_last_use();
+              }
+          }
+      }
+
+      void
+      __attribute__ ((noinline))
+      _M_release_last_use() noexcept
+      {
+        _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_use_count);
+        _M_dispose();
+        if (_Mutex_base<_Lp>::_S_need_barriers)
+          {
+            __atomic_thread_fence (__ATOMIC_ACQ_REL);
+          }
+        _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_weak_count);
+        if (__gnu_cxx::__exchange_and_add_dispatch(&_M_weak_count,
+                                                   -1) == 1)
+          {
+            _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_weak_count);
+            _M_destroy();
+          }
+      }
+
+      void
+      _M_release_orig() noexcept
+      {
         // Be race-detector-friendly.  For more info see bits/c++config.
         _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_use_count);
 	if (__gnu_cxx::__exchange_and_add_dispatch(&_M_use_count, -1) == 1)
 	  {
             _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_use_count);
 	    _M_dispose();
 	    // There must be a memory barrier between dispose() and destroy()
 	    // to ensure that the effects of dispose() are observed in the
 	    // thread that runs destroy().
 	    // See http://gcc.gnu.org/ml/libstdc++/2005-11/msg00136.html
@@ -279,20 +352,27 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
     _Sp_counted_base<_S_single>::_M_release() noexcept
     {
       if (--_M_use_count == 0)
         {
           _M_dispose();
           if (--_M_weak_count == 0)
             _M_destroy();
         }
     }
 
+  template<>
+    inline void
+    _Sp_counted_base<_S_mutex>::_M_release() noexcept
+    {
+      _M_release_orig();
+    }
+
   template<>
     inline void
     _Sp_counted_base<_S_single>::_M_weak_add_ref() noexcept
     { ++_M_weak_count; }
 
   template<>
     inline void
     _Sp_counted_base<_S_single>::_M_weak_release() noexcept
     {
       if (--_M_weak_count == 0)

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] libstdc++: Skip atomic instructions in _Sp_counted_base::_M_release when both counts are 1
  2021-08-02 13:23     ` Maged Michael
@ 2021-08-02 13:29       ` Maged Michael
  2021-08-03 20:59         ` Jonathan Wakely
  0 siblings, 1 reply; 21+ messages in thread
From: Maged Michael @ 2021-08-02 13:29 UTC (permalink / raw)
  To: Jonathan Wakely; +Cc: libstdc++, gcc-patches

[-- Attachment #1: Type: text/plain, Size: 12713 bytes --]

This is the right patch. The previous one is missing noexcept. Sorry.


On Mon, Aug 2, 2021 at 9:23 AM Maged Michael <maged.michael@gmail.com>
wrote:

> Please find attached an updated patch after incorporating Jonathan's
> suggestions.
>
> Changes from the last patch include:
> - Add a TSAN macro to bits/c++config.
> - Use separate constexpr bool-s for the conditions for lock-freedom,
> double-width and alignment.
> - Move the code in the optimized path to a separate function
> _M_release_double_width_cas.
>
> Thanks,
> Maged
>
>
> On Fri, Jul 16, 2021 at 11:55 AM Maged Michael <maged.michael@gmail.com>
> wrote:
>
>> Thank you, Jonathan, for the detailed comments! I'll update the patch
>> accordingly.
>>
>> On Fri, Jul 16, 2021 at 9:55 AM Jonathan Wakely <jwakely.gcc@gmail.com>
>> wrote:
>>
>>> On Thu, 17 Dec 2020 at 20:51, Maged Michael wrote:
>>> >
>>> > Please find a proposed patch for _Sp_counted_base::_M_release to skip
>>> the
>>> > two atomic instructions that decrement each of the use count and the
>>> weak
>>> > count when both are 1. I proposed the general idea in an earlier
>>> thread (
>>> > https://gcc.gnu.org/pipermail/libstdc++/2020-December/051642.html)
>>> and got
>>> > useful feedback on a draft patch and responses to related questions
>>> about
>>> > multi-granular atomicity and alignment. This patch is based on that
>>> > feedback.
>>> >
>>> >
>>> > I added a check for thread sanitizer to use the current algorithm in
>>> that
>>> > case because TSAN does not support multi-granular atomicity. I'd like
>>> to
>>> > add a check of __has_feature(thread_sanitizer) for building using
>>> LLVM. I
>>> > found examples of __has_feature in libstdc++
>>>
>>> There are no uses of __has_feature in libstdc++. We do use
>>> __has_builtin (which GCC also supports) and Clang's __is_identifier
>>> (which GCC doesn't support) to work around some weird semantics of
>>> __has_builtin in older versions of Clang.
>>>
>>>
>>> > but it doesn't seem to be
>>> > recognized in shared_ptr_base.h. Any guidance on how to check
>>> > __has_feature(thread_sanitizer) in this patch?
>>>
>>> I think we want to do something like this in include/bits/c++config
>>>
>>> #if __SANITIZE_THREAD__
>>> #  define _GLIBCXX_TSAN 1
>>> #elif defined __has_feature
>>> # if __has_feature(thread_sanitizer)
>>> #  define _GLIBCXX_TSAN 1
>>> # endif
>>> #endif
>>>
>>> Then in bits/shared_ptr_base.h
>>>
>>> #if _GLIBCXX_TSAN
>>>         _M_release_orig();
>>>         return;
>>> #endif
>>>
>>>
>>>
>>> > GCC generates code for _M_release that is larger and more complex than
>>> that
>>> > generated by LLVM. I'd like to file a bug report about that. Jonathan,
>>>
>>> Is this the same issue as https://gcc.gnu.org/PR101406 ?
>>>
>>> Partly yes. Even when using __atomic_add_dispatch I noticed that clang
>> generated less code than gcc. I see in the response to the issue that the
>> new glibc is expected to optimize better. So maybe this will eliminate the
>> issue.
>>
>>
>>> > would you please create a bugzilla account for me (
>>> > https://gcc.gnu.org/bugzilla/) using my gmail address. Thank you.
>>>
>>> Done (sorry, I didn't notice the request in this mail until coming
>>> back to it to review the patch properly).
>>>
>>> Thank you!
>>
>>
>>>
>>>
>>> >
>>> >
>>> > Information about the patch:
>>> >
>>> > - Benefits of the patch: Save the cost of the last atomic decrements of
>>> > each of the use count and the weak count in _Sp_counted_base. Atomic
>>> > instructions are significantly slower than regular loads and stores
>>> across
>>> > major architectures.
>>> >
>>> > - How current code works: _M_release() atomically decrements the use
>>> count,
>>> > checks if it was 1, if so calls _M_dispose(), atomically decrements the
>>> > weak count, checks if it was 1, and if so calls _M_destroy().
>>> >
>>> > - How the proposed patch works: _M_release() loads both use count and
>>> weak
>>> > count together atomically (when properly aligned), checks if the value
>>> is
>>> > equal to the value of both counts equal to 1 (e.g., 0x100000001), and
>>> if so
>>> > calls _M_dispose() and _M_destroy(). Otherwise, it follows the original
>>> > algorithm.
>>> >
>>> > - Why it works: When the current thread executing _M_release() finds
>>> each
>>> > of the counts is equal to 1, then (when _lock_policy is _S_atomic) no
>>> other
>>> > threads could possibly hold use or weak references to this control
>>> block.
>>> > That is, no other threads could possibly access the counts or the
>>> protected
>>> > object.
>>> >
>>> > - The proposed patch is intended to interact correctly with current
>>> code
>>> > (under certain conditions: _Lock_policy is _S_atomic, proper
>>> alignment, and
>>> > native lock-free support for atomic operations). That is, multiple
>>> threads
>>> > using different versions of the code with and without the patch
>>> operating
>>> > on the same objects should always interact correctly. The intent for
>>> the
>>> > patch is to be ABI compatible with the current implementation.
>>> >
>>> > - The proposed patch involves a performance trade-off between saving
>>> the
>>> > costs of two atomic instructions when the counts are both 1 vs adding
>>> the
>>> > cost of loading the combined counts and comparison with two ones (e.g.,
>>> > 0x100000001).
>>> >
>>> > - The patch has been in use (built using LLVM) in a large environment
>>> for
>>> > many months. The performance gains outweigh the losses (roughly 10 to
>>> 1)
>>> > across a large variety of workloads.
>>> >
>>> >
>>> > I'd appreciate feedback on the patch and any suggestions for checking
>>> > __has_feature(thread_sanitizer).
>>>
>>> N.B. gmail completely mangles patches unless you send them as
>>> attachments.
>>>
>>>
>>> > diff --git a/libstdc++-v3/include/bits/shared_ptr_base.h
>>> > b/libstdc++-v3/include/bits/shared_ptr_base.h
>>> >
>>> > index 368b2d7379a..a8fc944af5f 100644
>>> >
>>> > --- a/libstdc++-v3/include/bits/shared_ptr_base.h
>>> >
>>> > +++ b/libstdc++-v3/include/bits/shared_ptr_base.h
>>> >
>>> > @@ -153,20 +153,78 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
>>> >
>>> >         if (!_M_add_ref_lock_nothrow())
>>> >
>>> >           __throw_bad_weak_ptr();
>>> >
>>> >        }
>>> >
>>> >
>>> >        bool
>>> >
>>> >        _M_add_ref_lock_nothrow() noexcept;
>>> >
>>> >
>>> >        void
>>> >
>>> >        _M_release() noexcept
>>> >
>>> >        {
>>> >
>>> > +#if __SANITIZE_THREAD__
>>> >
>>> > +        _M_release_orig();
>>> >
>>> > +        return;
>>> >
>>> > +#endif
>>> >
>>> > +        if (!__atomic_always_lock_free(sizeof(long long), 0) ||
>>>
>>> The line break should come before the logical operator, not after.
>>> This makes it easier to see which operator it is, because it's at a
>>> predictable position, not off on the right edge somewhere.
>>>
>>> i.e.
>>>
>>>         if (!__atomic_always_lock_free(sizeof(long long), 0)
>>>            || !__atomic_always_lock_free(sizeof(_Atomic_word), 0)
>>>            || sizeof(long long) < (2 * sizeof(_Atomic_word))
>>>            || sizeof(long long) > (sizeof(void*)))
>>>
>>> But I think I'd like to see this condition expressed differently
>>> anyway, see below.
>>>
>>> >
>>> > +            !__atomic_always_lock_free(sizeof(_Atomic_word), 0) ||
>>> >
>>> > +            sizeof(long long) < (2 * sizeof(_Atomic_word)) ||
>>>
>>> Shouldn't this be != instead of < ?
>>>
>>> On a big endian target where sizeof(long long) > sizeof(_Atomic_word)
>>> loading two _Atomic_word objects will fill the high bits of the long
>>> long, and so the (1LL + (1LL << (8 * 4))) calculation won't match what
>>> got loaded into the long long.
>>>
>>> >
>>> > +            sizeof(long long) > (sizeof(void*)))
>>>
>>> This is checking the alignment, right? I think we can do so more
>>> reliably, and should comment it.
>>>
>>> I think I'd like to see this condition defined as a number of
>>> constexpr booleans, with comments. Maybe:
>>>
>>> constexpr bool __lock_free
>>>   = __atomic_always_lock_free(sizeof(long long), 0)
>>>      && __atomic_always_lock_free(sizeof(_Atomic_word), 0);
>>> constexpr bool __double_word
>>>   = sizeof(long long) == 2 * sizeof(_Atomic_word);
>>> // The ref-count members follow the vptr, so are aligned to
>>> alignof(void*).
>>> constexpr bool __aligned = alignof(long long) <= alignof(void*);
>>>
>>> if _GLIBCXX17_CONSTEXPR
>>>   (__lock_free && __double_word && __aligned)
>>>   {
>>>     _M_release_double_width_cas();
>>>     return;
>>>   }
>>> else
>>>   {
>>>     // ... original body of _M_release();
>>>   }
>>>
>>> > +          {
>>> >
>>> > +            _M_release_orig();
>>> >
>>> > +            return;
>>> >
>>> > +          }
>>> >
>>> > +        _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_use_count);
>>> >
>>> > +        _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_weak_count);
>>> >
>>> > +        if (__atomic_load_n((long long*)(&_M_use_count),
>>> __ATOMIC_ACQUIRE)
>>> >
>>> > +            == (1LL + (1LL << (8 * sizeof(_Atomic_word)))))
>>>
>>> This should use __CHAR_BIT__ instead of 8.
>>>
>>>
>>> >
>>> > +          {
>>> >
>>> > +            // Both counts are 1, so there are no weak references and
>>> >
>>> > +            // we are releasing the last strong reference. No other
>>> >
>>> > +            // threads can observe the effects of this _M_release()
>>> >
>>> > +            // call (e.g. calling use_count()) without a data race.
>>> >
>>> > +            *(long long*)(&_M_use_count) = 0;
>>> >
>>> > +            _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_use_count);
>>> >
>>> > +            _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_weak_count);
>>> >
>>> > +            _M_dispose();
>>> >
>>> > +            _M_destroy();
>>> >
>>> > +          }
>>> >
>>> > +        else
>>> >
>>> > +          {
>>> >
>>> > +            if ((__gnu_cxx::__exchange_and_add(&_M_use_count, -1) ==
>>> 1))
>>> >
>>> > +              {
>>> >
>>> > +                _M_release_last_use();
>>> >
>>> > +              }
>>> >
>>> > +          }
>>> >
>>> > +      }
>>> >
>>> > +
>>> >
>>> > +      void
>>> >
>>> > +      __attribute__ ((noinline))
>>> >
>>> > +      _M_release_last_use() noexcept
>>> >
>>> > +      {
>>> >
>>> > +        _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_use_count);
>>> >
>>> > +        _M_dispose();
>>> >
>>> > +        if (_Mutex_base<_Lp>::_S_need_barriers)
>>> >
>>> > +          {
>>> >
>>> > +            __atomic_thread_fence (__ATOMIC_ACQ_REL);
>>> >
>>> > +          }
>>> >
>>> > +        _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_weak_count);
>>> >
>>> > +        if (__gnu_cxx::__exchange_and_add_dispatch(&_M_weak_count,
>>> >
>>> > +                                                   -1) == 1)
>>> >
>>> > +          {
>>> >
>>> > +            _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_weak_count);
>>> >
>>> > +            _M_destroy();
>>> >
>>> > +          }
>>> >
>>> > +      }
>>> >
>>> > +
>>> >
>>> > +      void
>>> >
>>> > +      _M_release_orig() noexcept
>>> >
>>> > +      {
>>> >
>>> >          // Be race-detector-friendly.  For more info see
>>> bits/c++config.
>>> >
>>> >          _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_use_count);
>>> >
>>> >         if (__gnu_cxx::__exchange_and_add_dispatch(&_M_use_count, -1)
>>> == 1)
>>> >
>>> >           {
>>> >
>>> >              _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_use_count);
>>> >
>>> >             _M_dispose();
>>> >
>>> >             // There must be a memory barrier between dispose() and
>>> > destroy()
>>> >
>>> >             // to ensure that the effects of dispose() are observed in
>>> the
>>> >
>>> >             // thread that runs destroy().
>>> >
>>> >             // See
>>> http://gcc.gnu.org/ml/libstdc++/2005-11/msg00136.html
>>> >
>>> > @@ -279,20 +337,27 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
>>> >
>>> >      _Sp_counted_base<_S_single>::_M_release() noexcept
>>> >
>>> >      {
>>> >
>>> >        if (--_M_use_count == 0)
>>> >
>>> >          {
>>> >
>>> >            _M_dispose();
>>> >
>>> >            if (--_M_weak_count == 0)
>>> >
>>> >              _M_destroy();
>>> >
>>> >          }
>>> >
>>> >      }
>>> >
>>> >
>>> > +  template<>
>>> >
>>> > +    inline void
>>> >
>>> > +    _Sp_counted_base<_S_mutex>::_M_release() noexcept
>>> >
>>> > +    {
>>> >
>>> > +      _M_release_orig();
>>> >
>>> > +    }
>>> >
>>> > +
>>> >
>>> >    template<>
>>> >
>>> >      inline void
>>> >
>>> >      _Sp_counted_base<_S_single>::_M_weak_add_ref() noexcept
>>> >
>>> >      { ++_M_weak_count; }
>>> >
>>> >
>>> >    template<>
>>> >
>>> >      inline void
>>> >
>>> >      _Sp_counted_base<_S_single>::_M_weak_release() noexcept
>>> >
>>> >      {
>>> >
>>> >        if (--_M_weak_count == 0)
>>>
>>

[-- Attachment #2: patch.txt --]
[-- Type: text/plain, Size: 5107 bytes --]

diff --git a/libstdc++-v3/include/bits/c++config b/libstdc++-v3/include/bits/c++config
index 69ace386dd7..543fac268b4 100644
--- a/libstdc++-v3/include/bits/c++config
+++ b/libstdc++-v3/include/bits/c++config
@@ -126,20 +126,29 @@
 # define _GLIBCXX_ABI_TAG_CXX11 __attribute ((__abi_tag__ ("cxx11")))
 #endif
 
 // Macro to warn about unused results.
 #if __cplusplus >= 201703L
 # define _GLIBCXX_NODISCARD [[__nodiscard__]]
 #else
 # define _GLIBCXX_NODISCARD
 #endif
 
+// Macro for TSAN.
+#if __SANITIZE_THREAD__
+#  define _GLIBCXX_TSAN 1
+#elif defined __has_feature
+# if __has_feature(thread_sanitizer)
+#  define _GLIBCXX_TSAN 1
+# endif
+#endif
+
 
 
 #if __cplusplus
 
 // Macro for constexpr, to support in mixed 03/0x mode.
 #ifndef _GLIBCXX_CONSTEXPR
 # if __cplusplus >= 201103L
 #  define _GLIBCXX_CONSTEXPR constexpr
 #  define _GLIBCXX_USE_CONSTEXPR constexpr
 # else
diff --git a/libstdc++-v3/include/bits/shared_ptr_base.h b/libstdc++-v3/include/bits/shared_ptr_base.h
index 5be935d174d..834ca13bdc6 100644
--- a/libstdc++-v3/include/bits/shared_ptr_base.h
+++ b/libstdc++-v3/include/bits/shared_ptr_base.h
@@ -153,20 +153,93 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
 	if (!_M_add_ref_lock_nothrow())
 	  __throw_bad_weak_ptr();
       }
 
       bool
       _M_add_ref_lock_nothrow() noexcept;
 
       void
       _M_release() noexcept
       {
+#if _GLIBCXX_TSAN
+        _M_release_orig();
+        return;
+#endif
+        constexpr bool __lock_free
+          = __atomic_always_lock_free(sizeof(long long), 0)
+          && __atomic_always_lock_free(sizeof(_Atomic_word), 0);
+        constexpr bool __double_word
+          = sizeof(long long) == 2 * sizeof(_Atomic_word);
+        // The ref-count members follow the vptr, so are aligned to
+        // alignof(void*).
+        constexpr bool __aligned = alignof(long long) <= alignof(void*);
+        if _GLIBCXX17_CONSTEXPR
+          (__lock_free && __double_word && __aligned)
+          {
+            _M_release_double_width_cas();
+            return;
+          }
+        else
+          {
+            _M_release_orig();
+          }
+      }
+
+      void
+      _M_release_double_width_cas() noexcept
+      {
+        _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_use_count);
+        _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_weak_count);
+        if (__atomic_load_n((long long*)(&_M_use_count), __ATOMIC_ACQUIRE)
+            == (1LL + (1LL << (__CHAR_BIT__ * sizeof(_Atomic_word)))))
+          {
+            // Both counts are 1, so there are no weak references and
+            // we are releasing the last strong reference. No other
+            // threads can observe the effects of this _M_release()
+            // call (e.g. calling use_count()) without a data race.
+            *(long long*)(&_M_use_count) = 0;
+            _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_use_count);
+            _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_weak_count);
+            _M_dispose();
+            _M_destroy();
+          }
+        else
+          {
+            if ((__gnu_cxx::__exchange_and_add(&_M_use_count, -1) == 1))
+              {
+                _M_release_last_use();
+              }
+          }
+      }
+
+      void
+      __attribute__ ((noinline))
+      _M_release_last_use() noexcept
+      {
+        _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_use_count);
+        _M_dispose();
+        if (_Mutex_base<_Lp>::_S_need_barriers)
+          {
+            __atomic_thread_fence (__ATOMIC_ACQ_REL);
+          }
+        _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_weak_count);
+        if (__gnu_cxx::__exchange_and_add_dispatch(&_M_weak_count,
+                                                   -1) == 1)
+          {
+            _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_weak_count);
+            _M_destroy();
+          }
+      }
+
+      void
+      _M_release_orig() noexcept
+      {
         // Be race-detector-friendly.  For more info see bits/c++config.
         _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_use_count);
 	if (__gnu_cxx::__exchange_and_add_dispatch(&_M_use_count, -1) == 1)
 	  {
             _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_use_count);
 	    _M_dispose();
 	    // There must be a memory barrier between dispose() and destroy()
 	    // to ensure that the effects of dispose() are observed in the
 	    // thread that runs destroy().
 	    // See http://gcc.gnu.org/ml/libstdc++/2005-11/msg00136.html
@@ -279,20 +352,27 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
     _Sp_counted_base<_S_single>::_M_release() noexcept
     {
       if (--_M_use_count == 0)
         {
           _M_dispose();
           if (--_M_weak_count == 0)
             _M_destroy();
         }
     }
 
+  template<>
+    inline void
+    _Sp_counted_base<_S_mutex>::_M_release() noexcept
+    {
+      _M_release_orig();
+    }
+
   template<>
     inline void
     _Sp_counted_base<_S_single>::_M_weak_add_ref() noexcept
     { ++_M_weak_count; }
 
   template<>
     inline void
     _Sp_counted_base<_S_single>::_M_weak_release() noexcept
     {
       if (--_M_weak_count == 0)

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] libstdc++: Skip atomic instructions in _Sp_counted_base::_M_release when both counts are 1
  2021-08-02 13:29       ` Maged Michael
@ 2021-08-03 20:59         ` Jonathan Wakely
  2021-08-04 15:32           ` Jonathan Wakely
  0 siblings, 1 reply; 21+ messages in thread
From: Jonathan Wakely @ 2021-08-03 20:59 UTC (permalink / raw)
  To: Maged Michael; +Cc: libstdc++, gcc-patches

On Mon, 2 Aug 2021 at 14:29, Maged Michael wrote:
>
> This is the right patch. The previous one is missing noexcept. Sorry.
>
>
> On Mon, Aug 2, 2021 at 9:23 AM Maged Michael <maged.michael@gmail.com> wrote:
>>
>> Please find attached an updated patch after incorporating Jonathan's suggestions.
>>
>> Changes from the last patch include:
>> - Add a TSAN macro to bits/c++config.
>> - Use separate constexpr bool-s for the conditions for lock-freedom, double-width and alignment.
>> - Move the code in the optimized path to a separate function _M_release_double_width_cas.

Thanks for the updated patch. At a quick glance it looks great. I'll
apply it locally and test it tomorrow.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] libstdc++: Skip atomic instructions in _Sp_counted_base::_M_release when both counts are 1
  2021-08-03 20:59         ` Jonathan Wakely
@ 2021-08-04 15:32           ` Jonathan Wakely
  2021-08-04 15:47             ` Maged Michael
  2021-08-04 17:19             ` Maged Michael
  0 siblings, 2 replies; 21+ messages in thread
From: Jonathan Wakely @ 2021-08-04 15:32 UTC (permalink / raw)
  To: Maged Michael; +Cc: libstdc++, gcc-patches

[-- Attachment #1: Type: text/plain, Size: 4302 bytes --]

On Tue, 3 Aug 2021 at 21:59, Jonathan Wakely wrote:
>
> On Mon, 2 Aug 2021 at 14:29, Maged Michael wrote:
> >
> > This is the right patch. The previous one is missing noexcept. Sorry.
> >
> >
> > On Mon, Aug 2, 2021 at 9:23 AM Maged Michael <maged.michael@gmail.com> wrote:
> >>
> >> Please find attached an updated patch after incorporating Jonathan's suggestions.
> >>
> >> Changes from the last patch include:
> >> - Add a TSAN macro to bits/c++config.
> >> - Use separate constexpr bool-s for the conditions for lock-freedom, double-width and alignment.
> >> - Move the code in the optimized path to a separate function _M_release_double_width_cas.
>
> Thanks for the updated patch. At a quick glance it looks great. I'll
> apply it locally and test it tomorrow.


It occurs to me now that _M_release_double_width_cas is the wrong
name, because we're not doing a DWCAS, just a double-width load. What
do you think about renaming it to _M_release_combined instead? Since
it does a combined load of the two counts.

I'm no longer confident in the alignof suggestion I gave you.

+        constexpr bool __double_word
+          = sizeof(long long) == 2 * sizeof(_Atomic_word);
+        // The ref-count members follow the vptr, so are aligned to
+        // alignof(void*).
+        constexpr bool __aligned = alignof(long long) <= alignof(void*);

For IA32 (i.e. 32-bit x86) this constant will be true, because
alignof(long long) and alignof(void*) are both 4, even though
sizeof(long long) is 8. So in theory the _M_use_count and
_M_weak_count members could be in different cache lines, and the
atomic load will not be atomic. I think we want to use __alignof(long
long) here to get the "natural" alignment, not the smaller 4B
alignment allowed by SysV ABI. That means that we won't do this
optimization on IA32, but I think that's acceptable.

Alternatively, we could keep using alignof, but with an additional
run-time check something like (uintptr_t)&_M_use_count / 64 ==
(uintptr_t)&_M_weak_count / 64 to check they're on the same cache
line. I think for now we should just use __alignof and live without
the optimization on IA32.

Secondly:

+      void
+      __attribute__ ((noinline))

This needs to be __noinline__ because noinline is not a reserved word,
so users can do:
#define noinline 1
#include <memory>

Was it performance profiling, or code-size measurements (or something
else?) that showed this should be non-inline?
I'd like to add a comment explaining why we want it to be noinline.

In that function ...

+      _M_release_last_use() noexcept
+      {
+        _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_use_count);
+        _M_dispose();
+        if (_Mutex_base<_Lp>::_S_need_barriers)
+          {
+            __atomic_thread_fence (__ATOMIC_ACQ_REL);
+          }

I think we can remove this fence. The _S_need_barriers constant is
only true for the _S_mutex policy, and that just calls
_M_release_orig(), so never uses this function. I'll remove it and add
a comment noting that the lack of barrier is intentional.

+        _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_weak_count);
+        if (__gnu_cxx::__exchange_and_add_dispatch(&_M_weak_count,
+                                                   -1) == 1)
+          {
+            _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_weak_count);
+            _M_destroy();
+          }
+      }

Alternatively, we could keep the fence in _M_release_last_use() and
refactor the other _M_release functions, so that we have explicit
specializations for each of:
_Sp_counted_base<_S_single>::_M_release() (already present)
_Sp_counted_base<_S_mutex>::_M_release()
_Sp_counted_base<_S_atomic>::_M_release()

The second and third would be new, as currently they both use the
definition in the primary template. The _S_mutex one would just
decrement _M_use_count then call _M_release_last_use() (which still
has the barrier needed for the _S_mutex policy). The _S_atomic one
would have your new optimization. See the attached patch showing what
I mean. I find this version a bit simpler to understand, as we just
have _M_release and _M_release_last_use, without
_M_release_double_width_cas and _M_release_orig. What do you think of
this version? Does it lose any important properties of your version
which I've failed to notice?

[-- Attachment #2: patch.txt --]
[-- Type: text/plain, Size: 4410 bytes --]

diff --git a/libstdc++-v3/include/bits/c++config b/libstdc++-v3/include/bits/c++config
index 32b8957f814..07465f7ecd5 100644
--- a/libstdc++-v3/include/bits/c++config
+++ b/libstdc++-v3/include/bits/c++config
@@ -143,6 +143,15 @@
 # define _GLIBCXX_NODISCARD
 #endif
 
+// Macro for TSAN.
+#if __SANITIZE_THREAD__
+#  define _GLIBCXX_TSAN 1
+#elif defined __has_feature
+# if __has_feature(thread_sanitizer)
+#  define _GLIBCXX_TSAN 1
+# endif
+#endif
+
 
 
 #if __cplusplus
diff --git a/libstdc++-v3/include/bits/shared_ptr_base.h b/libstdc++-v3/include/bits/shared_ptr_base.h
index 5be935d174d..b2397c8fddb 100644
--- a/libstdc++-v3/include/bits/shared_ptr_base.h
+++ b/libstdc++-v3/include/bits/shared_ptr_base.h
@@ -143,10 +143,12 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
       virtual void*
       _M_get_deleter(const std::type_info&) noexcept = 0;
 
+      // Increment the use count (used when the count is greater than zero).
       void
       _M_add_ref_copy()
       { __gnu_cxx::__atomic_add_dispatch(&_M_use_count, 1); }
 
+      // Increment the use count if it is non-zero, throw otherwise.
       void
       _M_add_ref_lock()
       {
@@ -154,15 +156,18 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
 	  __throw_bad_weak_ptr();
       }
 
+      // Increment the use count if it is non-zero.
       bool
       _M_add_ref_lock_nothrow() noexcept;
 
+      // Decrement the use count.
       void
-      _M_release() noexcept
-      {
-        // Be race-detector-friendly.  For more info see bits/c++config.
-        _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_use_count);
-	if (__gnu_cxx::__exchange_and_add_dispatch(&_M_use_count, -1) == 1)
+      _M_release() noexcept;
+
+      // Called by _M_release() when the use count reaches zero.
+      __attribute__((__noinline__))
+      void
+      _M_release_last_use() noexcept
       {
 	_GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_use_count);
 	_M_dispose();
@@ -184,12 +189,13 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
 	    _M_destroy();
 	  }
       }
-      }
 
+      // Increment the weak count.
       void
       _M_weak_add_ref() noexcept
       { __gnu_cxx::__atomic_add_dispatch(&_M_weak_count, 1); }
 
+      // Decrement the weak count.
       void
       _M_weak_release() noexcept
       {
@@ -286,6 +292,61 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
         }
     }
 
+  template<>
+    inline void
+    _Sp_counted_base<_S_mutex>::_M_release() noexcept
+    {
+      // Be race-detector-friendly.  For more info see bits/c++config.
+      _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_use_count);
+      if (__gnu_cxx::__exchange_and_add_dispatch(&_M_use_count, -1) == 1)
+	{
+	  _M_release_last_use();
+	}
+    }
+
+  template<>
+    inline void
+    _Sp_counted_base<_S_atomic>::_M_release() noexcept
+    {
+      _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_use_count);
+#if ! _GLIBCXX_TSAN
+      constexpr bool __lock_free
+	= __atomic_always_lock_free(sizeof(long long), 0)
+	&& __atomic_always_lock_free(sizeof(_Atomic_word), 0);
+      constexpr bool __double_word
+	= sizeof(long long) == 2 * sizeof(_Atomic_word);
+      // The ref-count members follow the vptr, so are aligned to
+      // alignof(void*).
+      constexpr bool __aligned = __alignof(long long) <= alignof(void*);
+      if _GLIBCXX17_CONSTEXPR (__lock_free && __double_word && __aligned)
+	{
+	  constexpr long long __unique_ref
+	    = 1LL + (1LL << (__CHAR_BIT__ * sizeof(_Atomic_word)));
+	  auto __both_counts = reinterpret_cast<long long*>(&_M_use_count);
+
+	  _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_weak_count);
+	  if (__atomic_load_n(__both_counts, __ATOMIC_ACQUIRE) == __unique_ref)
+	    {
+	      // Both counts are 1, so there are no weak references and
+	      // we are releasing the last strong reference. No other
+	      // threads can observe the effects of this _M_release()
+	      // call (e.g. calling use_count()) without a data race.
+	      *(long long*)(&_M_use_count) = 0;
+	      _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_use_count);
+	      _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_weak_count);
+	      _M_dispose();
+	      _M_destroy();
+	      __builtin_puts("double double bye bye");
+	      return;
+	    }
+	}
+#endif
+      if (__gnu_cxx::__exchange_and_add_dispatch(&_M_use_count, -1) == 1)
+	{
+	  _M_release_last_use();
+	}
+    }
+
   template<>
     inline void
     _Sp_counted_base<_S_single>::_M_weak_add_ref() noexcept

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] libstdc++: Skip atomic instructions in _Sp_counted_base::_M_release when both counts are 1
  2021-08-04 15:32           ` Jonathan Wakely
@ 2021-08-04 15:47             ` Maged Michael
  2021-08-04 15:52               ` Jonathan Wakely
  2021-08-04 17:19             ` Maged Michael
  1 sibling, 1 reply; 21+ messages in thread
From: Maged Michael @ 2021-08-04 15:47 UTC (permalink / raw)
  To: Jonathan Wakely; +Cc: libstdc++, gcc-patches

Thanks, Jonathan!

On Wed, Aug 4, 2021 at 11:32 AM Jonathan Wakely <jwakely.gcc@gmail.com>
wrote:

> On Tue, 3 Aug 2021 at 21:59, Jonathan Wakely wrote:
> >
> > On Mon, 2 Aug 2021 at 14:29, Maged Michael wrote:
> > >
> > > This is the right patch. The previous one is missing noexcept. Sorry.
> > >
> > >
> > > On Mon, Aug 2, 2021 at 9:23 AM Maged Michael <maged.michael@gmail.com>
> wrote:
> > >>
> > >> Please find attached an updated patch after incorporating Jonathan's
> suggestions.
> > >>
> > >> Changes from the last patch include:
> > >> - Add a TSAN macro to bits/c++config.
> > >> - Use separate constexpr bool-s for the conditions for lock-freedom,
> double-width and alignment.
> > >> - Move the code in the optimized path to a separate function
> _M_release_double_width_cas.
> >
> > Thanks for the updated patch. At a quick glance it looks great. I'll
> > apply it locally and test it tomorrow.
>
>
> It occurs to me now that _M_release_double_width_cas is the wrong
> name, because we're not doing a DWCAS, just a double-width load. What
> do you think about renaming it to _M_release_combined instead? Since
> it does a combined load of the two counts.
>

Yes definitely _M_release_combined makes sense. Will do.


> I'm no longer confident in the alignof suggestion I gave you.
>
> +        constexpr bool __double_word
> +          = sizeof(long long) == 2 * sizeof(_Atomic_word);
> +        // The ref-count members follow the vptr, so are aligned to
> +        // alignof(void*).
> +        constexpr bool __aligned = alignof(long long) <= alignof(void*);
>
> For IA32 (i.e. 32-bit x86) this constant will be true, because
> alignof(long long) and alignof(void*) are both 4, even though
> sizeof(long long) is 8. So in theory the _M_use_count and
> _M_weak_count members could be in different cache lines, and the
> atomic load will not be atomic. I think we want to use __alignof(long
> long) here to get the "natural" alignment, not the smaller 4B
> alignment allowed by SysV ABI. That means that we won't do this
> optimization on IA32, but I think that's acceptable.
>
> Alternatively, we could keep using alignof, but with an additional
> run-time check something like (uintptr_t)&_M_use_count / 64 ==
> (uintptr_t)&_M_weak_count / 64 to check they're on the same cache
> line. I think for now we should just use __alignof and live without
> the optimization on IA32.
>
> I'd rather avoid any runtime checks because they may negate the speed
rationale for doing this optimization.
I'd be OK with not doing this optimization for any 32-bit architecture.

Is it OK to change the __align condition to the following?
constexpr bool __aligned =
  (alignof(long long) <= alignof(void*))
  && (sizeof(long long) == sizeof(void*));

Thanks,
Maged



> Secondly:
>
> +      void
> +      __attribute__ ((noinline))
>
> This needs to be __noinline__ because noinline is not a reserved word,
> so users can do:
> #define noinline 1
> #include <memory>
>
> Was it performance profiling, or code-size measurements (or something
> else?) that showed this should be non-inline?
> I'd like to add a comment explaining why we want it to be noinline.
>
> In that function ...
>
> +      _M_release_last_use() noexcept
> +      {
> +        _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_use_count);
> +        _M_dispose();
> +        if (_Mutex_base<_Lp>::_S_need_barriers)
> +          {
> +            __atomic_thread_fence (__ATOMIC_ACQ_REL);
> +          }
>
> I think we can remove this fence. The _S_need_barriers constant is
> only true for the _S_mutex policy, and that just calls
> _M_release_orig(), so never uses this function. I'll remove it and add
> a comment noting that the lack of barrier is intentional.
>
> +        _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_weak_count);
> +        if (__gnu_cxx::__exchange_and_add_dispatch(&_M_weak_count,
> +                                                   -1) == 1)
> +          {
> +            _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_weak_count);
> +            _M_destroy();
> +          }
> +      }
>
> Alternatively, we could keep the fence in _M_release_last_use() and
> refactor the other _M_release functions, so that we have explicit
> specializations for each of:
> _Sp_counted_base<_S_single>::_M_release() (already present)
> _Sp_counted_base<_S_mutex>::_M_release()
> _Sp_counted_base<_S_atomic>::_M_release()
>
> The second and third would be new, as currently they both use the
> definition in the primary template. The _S_mutex one would just
> decrement _M_use_count then call _M_release_last_use() (which still
> has the barrier needed for the _S_mutex policy). The _S_atomic one
> would have your new optimization. See the attached patch showing what
> I mean. I find this version a bit simpler to understand, as we just
> have _M_release and _M_release_last_use, without
> _M_release_double_width_cas and _M_release_orig. What do you think of
> this version? Does it lose any important properties of your version
> which I've failed to notice?
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] libstdc++: Skip atomic instructions in _Sp_counted_base::_M_release when both counts are 1
  2021-08-04 15:47             ` Maged Michael
@ 2021-08-04 15:52               ` Jonathan Wakely
  0 siblings, 0 replies; 21+ messages in thread
From: Jonathan Wakely @ 2021-08-04 15:52 UTC (permalink / raw)
  To: Maged Michael; +Cc: libstdc++, gcc-patches

On Wed, 4 Aug 2021 at 16:47, Maged Michael <maged.michael@gmail.com> wrote:
>
> Thanks, Jonathan!
>
> On Wed, Aug 4, 2021 at 11:32 AM Jonathan Wakely <jwakely.gcc@gmail.com> wrote:
>>
>> On Tue, 3 Aug 2021 at 21:59, Jonathan Wakely wrote:
>> >
>> > On Mon, 2 Aug 2021 at 14:29, Maged Michael wrote:
>> > >
>> > > This is the right patch. The previous one is missing noexcept. Sorry.
>> > >
>> > >
>> > > On Mon, Aug 2, 2021 at 9:23 AM Maged Michael <maged.michael@gmail.com> wrote:
>> > >>
>> > >> Please find attached an updated patch after incorporating Jonathan's suggestions.
>> > >>
>> > >> Changes from the last patch include:
>> > >> - Add a TSAN macro to bits/c++config.
>> > >> - Use separate constexpr bool-s for the conditions for lock-freedom, double-width and alignment.
>> > >> - Move the code in the optimized path to a separate function _M_release_double_width_cas.
>> >
>> > Thanks for the updated patch. At a quick glance it looks great. I'll
>> > apply it locally and test it tomorrow.
>>
>>
>> It occurs to me now that _M_release_double_width_cas is the wrong
>> name, because we're not doing a DWCAS, just a double-width load. What
>> do you think about renaming it to _M_release_combined instead? Since
>> it does a combined load of the two counts.
>
>
> Yes definitely _M_release_combined makes sense. Will do.

See the patch I attached to my previous mail, which has a refactored
version that gets rid of that function entirely.

>
>>
>> I'm no longer confident in the alignof suggestion I gave you.
>>
>> +        constexpr bool __double_word
>> +          = sizeof(long long) == 2 * sizeof(_Atomic_word);
>> +        // The ref-count members follow the vptr, so are aligned to
>> +        // alignof(void*).
>> +        constexpr bool __aligned = alignof(long long) <= alignof(void*);
>>
>> For IA32 (i.e. 32-bit x86) this constant will be true, because
>> alignof(long long) and alignof(void*) are both 4, even though
>> sizeof(long long) is 8. So in theory the _M_use_count and
>> _M_weak_count members could be in different cache lines, and the
>> atomic load will not be atomic. I think we want to use __alignof(long
>> long) here to get the "natural" alignment, not the smaller 4B
>> alignment allowed by SysV ABI. That means that we won't do this
>> optimization on IA32, but I think that's acceptable.
>>
>> Alternatively, we could keep using alignof, but with an additional
>> run-time check something like (uintptr_t)&_M_use_count / 64 ==
>> (uintptr_t)&_M_weak_count / 64 to check they're on the same cache
>> line. I think for now we should just use __alignof and live without
>> the optimization on IA32.
>>
> I'd rather avoid any runtime checks because they may negate the speed rationale for doing this optimization.
> I'd be OK with not doing this optimization for any 32-bit architecture.
>
> Is it OK to change the __align condition to the following?
> constexpr bool __aligned =
>   (alignof(long long) <= alignof(void*))
>   && (sizeof(long long) == sizeof(void*));

Yes, that will work fine.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] libstdc++: Skip atomic instructions in _Sp_counted_base::_M_release when both counts are 1
  2021-08-04 15:32           ` Jonathan Wakely
  2021-08-04 15:47             ` Maged Michael
@ 2021-08-04 17:19             ` Maged Michael
  2021-08-04 17:34               ` Maged Michael
  2021-08-04 19:32               ` Jonathan Wakely
  1 sibling, 2 replies; 21+ messages in thread
From: Maged Michael @ 2021-08-04 17:19 UTC (permalink / raw)
  To: Jonathan Wakely; +Cc: libstdc++, gcc-patches

Sorry. I totally missed the rest of your message and the patch. My fuzzy
eyesight, which usually guesses correctly 90% of the time, mistook
"Secondly" on a line by itself for "Sincerely" :-)

On Wed, Aug 4, 2021 at 11:32 AM Jonathan Wakely <jwakely.gcc@gmail.com>
wrote:

> On Tue, 3 Aug 2021 at 21:59, Jonathan Wakely wrote:
> >
> > On Mon, 2 Aug 2021 at 14:29, Maged Michael wrote:
> > >
> > > This is the right patch. The previous one is missing noexcept. Sorry.
> > >
> > >
> > > On Mon, Aug 2, 2021 at 9:23 AM Maged Michael <maged.michael@gmail.com>
> wrote:
> > >>
> > >> Please find attached an updated patch after incorporating Jonathan's
> suggestions.
> > >>
> > >> Changes from the last patch include:
> > >> - Add a TSAN macro to bits/c++config.
> > >> - Use separate constexpr bool-s for the conditions for lock-freedom,
> double-width and alignment.
> > >> - Move the code in the optimized path to a separate function
> _M_release_double_width_cas.
> >
> > Thanks for the updated patch. At a quick glance it looks great. I'll
> > apply it locally and test it tomorrow.
>
>
> It occurs to me now that _M_release_double_width_cas is the wrong
> name, because we're not doing a DWCAS, just a double-width load. What
> do you think about renaming it to _M_release_combined instead? Since
> it does a combined load of the two counts.
>
> I'm no longer confident in the alignof suggestion I gave you.
>
> +        constexpr bool __double_word
> +          = sizeof(long long) == 2 * sizeof(_Atomic_word);
> +        // The ref-count members follow the vptr, so are aligned to
> +        // alignof(void*).
> +        constexpr bool __aligned = alignof(long long) <= alignof(void*);
>
> For IA32 (i.e. 32-bit x86) this constant will be true, because
> alignof(long long) and alignof(void*) are both 4, even though
> sizeof(long long) is 8. So in theory the _M_use_count and
> _M_weak_count members could be in different cache lines, and the
> atomic load will not be atomic. I think we want to use __alignof(long
> long) here to get the "natural" alignment, not the smaller 4B
> alignment allowed by SysV ABI. That means that we won't do this
> optimization on IA32, but I think that's acceptable.
>
> Alternatively, we could keep using alignof, but with an additional
> run-time check something like (uintptr_t)&_M_use_count / 64 ==
> (uintptr_t)&_M_weak_count / 64 to check they're on the same cache
> line. I think for now we should just use __alignof and live without
> the optimization on IA32.
>
> Secondly:
>
> +      void
> +      __attribute__ ((noinline))
>
> This needs to be __noinline__ because noinline is not a reserved word,
> so users can do:
> #define noinline 1
> #include <memory>
>
> Was it performance profiling, or code-size measurements (or something
> else?) that showed this should be non-inline?
> I'd like to add a comment explaining why we want it to be noinline.
>
> The noinlining was based on looking at generated code. That was for clang.
It was inlining the _M_last_use function for every instance of _M_release
(e.g., ~shared_ptr). This optimization with the noinline for
_M_release_last_use ended up reducing massive binary text sizes by 1.5%
(several megabytes)., which was an extra benefit. Without the noinline we
saw code size increase.

IIUC, we van use the following. Right?

__attribute__((__noinline__))


I didn't understand the part about programmers doing #define noinline 1. I
don't see code in the patch that uses noinline.

How about something like this comment?

// Noinline to avoid code size increase.



In that function ...
>
> +      _M_release_last_use() noexcept
> +      {
> +        _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_use_count);
> +        _M_dispose();
> +        if (_Mutex_base<_Lp>::_S_need_barriers)
> +          {
> +            __atomic_thread_fence (__ATOMIC_ACQ_REL);
> +          }
>
> I think we can remove this fence. The _S_need_barriers constant is
> only true for the _S_mutex policy, and that just calls
> _M_release_orig(), so never uses this function. I'll remove it and add
> a comment noting that the lack of barrier is intentional.
>
> +        _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_weak_count);
> +        if (__gnu_cxx::__exchange_and_add_dispatch(&_M_weak_count,
> +                                                   -1) == 1)
> +          {
> +            _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_weak_count);
> +            _M_destroy();
> +          }
> +      }
>
> Alternatively, we could keep the fence in _M_release_last_use() and
> refactor the other _M_release functions, so that we have explicit
> specializations for each of:
> _Sp_counted_base<_S_single>::_M_release() (already present)
> _Sp_counted_base<_S_mutex>::_M_release()
> _Sp_counted_base<_S_atomic>::_M_release()
>
> The second and third would be new, as currently they both use the
> definition in the primary template. The _S_mutex one would just
> decrement _M_use_count then call _M_release_last_use() (which still
> has the barrier needed for the _S_mutex policy). The _S_atomic one
> would have your new optimization. See the attached patch showing what
> I mean. I find this version a bit simpler to understand, as we just
> have _M_release and _M_release_last_use, without
> _M_release_double_width_cas and _M_release_orig. What do you think of
> this version? Does it lose any important properties of your version
> which I've failed to notice?
>

Unless I missed something I don't think the patch is correct. IIUC it uses
_M_release_last_use as if it is equivalent to what was _M_release_orig,
which is not true. The logic in _M_release_last_use assumes that the use
count has already been decremented and the return value was 1. So I don't
think the refactor works. Please correct me if I missed something..

If I am not mistaken, how about I prepare a new patch that incorporates the
following?
- __noinline__
- The change to __align in my last message.
- Remove the unnecessary checks based on _S_need_barriers in the combined /
last_use path.
- Going back to having _M_release_orig and _M_release_combined
- Include all of the comments that you added in your patch.

Would that be OK with you?

Thanks,
Maged

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] libstdc++: Skip atomic instructions in _Sp_counted_base::_M_release when both counts are 1
  2021-08-04 17:19             ` Maged Michael
@ 2021-08-04 17:34               ` Maged Michael
  2021-08-04 19:32               ` Jonathan Wakely
  1 sibling, 0 replies; 21+ messages in thread
From: Maged Michael @ 2021-08-04 17:34 UTC (permalink / raw)
  To: Jonathan Wakely; +Cc: libstdc++, gcc-patches

On Wed, Aug 4, 2021 at 1:19 PM Maged Michael <maged.michael@gmail.com>
wrote:

> Sorry. I totally missed the rest of your message and the patch. My fuzzy
> eyesight, which usually guesses correctly 90% of the time, mistook
> "Secondly" on a line by itself for "Sincerely" :-)
>
> On Wed, Aug 4, 2021 at 11:32 AM Jonathan Wakely <jwakely.gcc@gmail.com>
> wrote:
>
>> On Tue, 3 Aug 2021 at 21:59, Jonathan Wakely wrote:
>> >
>> > On Mon, 2 Aug 2021 at 14:29, Maged Michael wrote:
>> > >
>> > > This is the right patch. The previous one is missing noexcept. Sorry.
>> > >
>> > >
>> > > On Mon, Aug 2, 2021 at 9:23 AM Maged Michael <maged.michael@gmail.com>
>> wrote:
>> > >>
>> > >> Please find attached an updated patch after incorporating Jonathan's
>> suggestions.
>> > >>
>> > >> Changes from the last patch include:
>> > >> - Add a TSAN macro to bits/c++config.
>> > >> - Use separate constexpr bool-s for the conditions for lock-freedom,
>> double-width and alignment.
>> > >> - Move the code in the optimized path to a separate function
>> _M_release_double_width_cas.
>> >
>> > Thanks for the updated patch. At a quick glance it looks great. I'll
>> > apply it locally and test it tomorrow.
>>
>>
>> It occurs to me now that _M_release_double_width_cas is the wrong
>> name, because we're not doing a DWCAS, just a double-width load. What
>> do you think about renaming it to _M_release_combined instead? Since
>> it does a combined load of the two counts.
>>
>> I'm no longer confident in the alignof suggestion I gave you.
>>
>> +        constexpr bool __double_word
>> +          = sizeof(long long) == 2 * sizeof(_Atomic_word);
>> +        // The ref-count members follow the vptr, so are aligned to
>> +        // alignof(void*).
>> +        constexpr bool __aligned = alignof(long long) <= alignof(void*);
>>
>> For IA32 (i.e. 32-bit x86) this constant will be true, because
>> alignof(long long) and alignof(void*) are both 4, even though
>> sizeof(long long) is 8. So in theory the _M_use_count and
>> _M_weak_count members could be in different cache lines, and the
>> atomic load will not be atomic. I think we want to use __alignof(long
>> long) here to get the "natural" alignment, not the smaller 4B
>> alignment allowed by SysV ABI. That means that we won't do this
>> optimization on IA32, but I think that's acceptable.
>>
>> Alternatively, we could keep using alignof, but with an additional
>> run-time check something like (uintptr_t)&_M_use_count / 64 ==
>> (uintptr_t)&_M_weak_count / 64 to check they're on the same cache
>> line. I think for now we should just use __alignof and live without
>> the optimization on IA32.
>>
>> Secondly:
>>
>> +      void
>> +      __attribute__ ((noinline))
>>
>> This needs to be __noinline__ because noinline is not a reserved word,
>> so users can do:
>> #define noinline 1
>> #include <memory>
>>
>> Was it performance profiling, or code-size measurements (or something
>> else?) that showed this should be non-inline?
>> I'd like to add a comment explaining why we want it to be noinline.
>>
>> The noinlining was based on looking at generated code. That was for
> clang. It was inlining the _M_last_use function for every instance of
> _M_release (e.g., ~shared_ptr). This optimization with the noinline for
> _M_release_last_use ended up reducing massive binary text sizes by 1.5%
> (several megabytes)., which was an extra benefit. Without the noinline we
> saw code size increase.
>
> IIUC, we van use the following. Right?
>
> __attribute__((__noinline__))
>
>
> I didn't understand the part about programmers doing #define noinline 1.
> I don't see code in the patch that uses noinline.
>
> How about something like this comment?
>
> // Noinline to avoid code size increase.
>
>
>
> In that function ...
>>
>> +      _M_release_last_use() noexcept
>> +      {
>> +        _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_use_count);
>> +        _M_dispose();
>> +        if (_Mutex_base<_Lp>::_S_need_barriers)
>> +          {
>> +            __atomic_thread_fence (__ATOMIC_ACQ_REL);
>> +          }
>>
>> I think we can remove this fence. The _S_need_barriers constant is
>> only true for the _S_mutex policy, and that just calls
>> _M_release_orig(), so never uses this function. I'll remove it and add
>> a comment noting that the lack of barrier is intentional.
>>
>> +        _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_weak_count);
>> +        if (__gnu_cxx::__exchange_and_add_dispatch(&_M_weak_count,
>> +                                                   -1) == 1)
>> +          {
>> +            _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_weak_count);
>> +            _M_destroy();
>> +          }
>> +      }
>>
>> Alternatively, we could keep the fence in _M_release_last_use() and
>> refactor the other _M_release functions, so that we have explicit
>> specializations for each of:
>> _Sp_counted_base<_S_single>::_M_release() (already present)
>> _Sp_counted_base<_S_mutex>::_M_release()
>> _Sp_counted_base<_S_atomic>::_M_release()
>>
>> The second and third would be new, as currently they both use the
>> definition in the primary template. The _S_mutex one would just
>> decrement _M_use_count then call _M_release_last_use() (which still
>> has the barrier needed for the _S_mutex policy). The _S_atomic one
>> would have your new optimization. See the attached patch showing what
>> I mean. I find this version a bit simpler to understand, as we just
>> have _M_release and _M_release_last_use, without
>> _M_release_double_width_cas and _M_release_orig. What do you think of
>> this version? Does it lose any important properties of your version
>> which I've failed to notice?
>>
>
> Unless I missed something I don't think the patch is correct. IIUC it uses
> _M_release_last_use as if it is equivalent to what was _M_release_orig,
> which is not true. The logic in _M_release_last_use assumes that the use
> count has already been decremented and the return value was 1. So I don't
> think the refactor works. Please correct me if I missed something..
>
> Actually I take back what I said. Sorry. I think the logic in your patch
is correct. I missed some of the atomic decrements.
But I'd be concerned about performance. If we make _M_release_last_use
noinline then we are adding overhead to the fast path of the original logic
(where both counts are 1).

What do you think?



> If I am not mistaken, how about I prepare a new patch that incorporates
> the following?
> - __noinline__
> - The change to __align in my last message.
> - Remove the unnecessary checks based on _S_need_barriers in the combined
> / last_use path.
> - Going back to having _M_release_orig and _M_release_combined
> - Include all of the comments that you added in your patch.
>
> Would that be OK with you?
>
> Thanks,
> Maged
>
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] libstdc++: Skip atomic instructions in _Sp_counted_base::_M_release when both counts are 1
  2021-08-04 17:19             ` Maged Michael
  2021-08-04 17:34               ` Maged Michael
@ 2021-08-04 19:32               ` Jonathan Wakely
  2021-08-04 19:48                 ` Maged Michael
  1 sibling, 1 reply; 21+ messages in thread
From: Jonathan Wakely @ 2021-08-04 19:32 UTC (permalink / raw)
  To: Maged Michael; +Cc: libstdc++, gcc-patches

On Wed, 4 Aug 2021 at 18:19, Maged Michael wrote:
>
> Sorry. I totally missed the rest of your message and the patch. My fuzzy eyesight, which usually guesses correctly 90% of the time, mistook "Secondly" on a line by itself for "Sincerely" :-)

:-)

> The noinlining was based on looking at generated code. That was for clang. It was inlining the _M_last_use function for every instance of _M_release (e.g., ~shared_ptr). This optimization with the noinline for _M_release_last_use ended up reducing massive binary text sizes by 1.5% (several megabytes)., which was an extra benefit. Without the noinline we saw code size increase.

Wow, that is a convincing argument for making it not inline, thanks.

> IIUC, we van use the following. Right?
>
> __attribute__((__noinline__))

Right.

> I didn't understand the part about programmers doing #define noinline 1. I don't see code in the patch that uses noinline.

This is a valid C++ program:

#define noinline 1
#include <memory>
int main() { }

But if anything in <memory> uses "noinline" then this valid program
will not compile. Which is why we must use ((__noinline__)) instead of
((noinline)).



>
> How about something like this comment?
>
> // Noinline to avoid code size increase.

Great, thanks.

On Wed, 4 Aug 2021 at 18:34, Maged Michael wrote:
> Actually I take back what I said. Sorry. I think the logic in your patch is correct. I missed some of the atomic decrements.
> But I'd be concerned about performance. If we make _M_release_last_use noinline then we are adding overhead to the fast path of the original logic (where both counts are 1).

Oh, I see. So the code duplication serves a purpose. We want the
_M_release_last_use() code to be non-inline for the new logic, because
in the common case we never call it (we either detect that both counts
are 1 and do the dispose & destroy without atomic decrements, or we do
a single decrement and don't dispose or destroy anything). But for the
old logic, we do want that code inlined into _M_release (or
_M_release_orig as it was called in your patch). Have I got that right
now?

What if we remove the __noinline__ from _M_release_last_use() so that
it can be inlined, but than add a noinline wrapper that can be called
when we don't want to inline it?

So:
      // Called by _M_release() when the use count reaches zero.
      void
      _M_release_last_use() noexcept
      {
        // unchanged from previous patch, but without the attribute.
        // ...
      }

      // As above, but 'noinline' to reduce code size on the cold path.
      __attribute__((__noinline__))
      void
      _M_release_last_use_cold() noexcept
      { _M_release_last_use(); }


And then:

  template<>
    inline void
    _Sp_counted_base<_S_atomic>::_M_release() noexcept
    {
      _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_use_count);
#if ! _GLIBCXX_TSAN
      constexpr bool __lock_free
        = __atomic_always_lock_free(sizeof(long long), 0)
        && __atomic_always_lock_free(sizeof(_Atomic_word), 0);
      constexpr bool __double_word
        = sizeof(long long) == 2 * sizeof(_Atomic_word);
      // The ref-count members follow the vptr, so are aligned to
      // alignof(void*).
      constexpr bool __aligned = __alignof(long long) <= alignof(void*);
      if _GLIBCXX17_CONSTEXPR (__lock_free && __double_word && __aligned)
        {
          constexpr long long __unique_ref
            = 1LL + (1LL << (__CHAR_BIT__ * sizeof(_Atomic_word)));
          auto __both_counts = reinterpret_cast<long long*>(&_M_use_count);

          _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_weak_count);
          if (__atomic_load_n(__both_counts, __ATOMIC_ACQUIRE) == __unique_ref)
            {
              // Both counts are 1, so there are no weak references and
              // we are releasing the last strong reference. No other
              // threads can observe the effects of this _M_release()
              // call (e.g. calling use_count()) without a data race.
              *(long long*)(&_M_use_count) = 0;
              _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_use_count);
              _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_weak_count);
              _M_dispose();
              _M_destroy();
              return;
            }
          if (__gnu_cxx::__exchange_and_add_dispatch(&_M_use_count, -1) == 1)
            {
              _M_release_last_use_cold();
              return;
            }
        }
      else
#endif
      if (__gnu_cxx::__exchange_and_add_dispatch(&_M_use_count, -1) == 1)
        {
          _M_release_last_use();
        }
    }


So we use the noinline version for the else branch in the new logic,
but the can-inline version for the old logic. Would that work?

We could also consider adding __attribute__((__cold__)) to the
_M_release_last_use_cold() function, and/or add [[__unlikely__]] to
the 'if' that calls it, but maybe that's overkill.

It seems like this will slightly pessimize the case where the last use
count is dropped but there are weak refs, as that will take the cold
path now. That's probably OK, given the benefits for the "both counts
are 1" case, and that it might reduce the code size for the "use count
> 1" case.

It's past 8pm here so I'll come back to this tomorrow and do some code
size measurements of my own, now that I have a clearer understanding
of what your original patch was optimizing for.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] libstdc++: Skip atomic instructions in _Sp_counted_base::_M_release when both counts are 1
  2021-08-04 19:32               ` Jonathan Wakely
@ 2021-08-04 19:48                 ` Maged Michael
  2021-12-08 11:49                   ` Jonathan Wakely
  0 siblings, 1 reply; 21+ messages in thread
From: Maged Michael @ 2021-08-04 19:48 UTC (permalink / raw)
  To: Jonathan Wakely; +Cc: libstdc++, gcc-patches

On Wed, Aug 4, 2021 at 3:32 PM Jonathan Wakely <jwakely.gcc@gmail.com>
wrote:

> On Wed, 4 Aug 2021 at 18:19, Maged Michael wrote:
> >
> > Sorry. I totally missed the rest of your message and the patch. My fuzzy
> eyesight, which usually guesses correctly 90% of the time, mistook
> "Secondly" on a line by itself for "Sincerely" :-)
>
> :-)
>
> > The noinlining was based on looking at generated code. That was for
> clang. It was inlining the _M_last_use function for every instance of
> _M_release (e.g., ~shared_ptr). This optimization with the noinline for
> _M_release_last_use ended up reducing massive binary text sizes by 1.5%
> (several megabytes)., which was an extra benefit. Without the noinline we
> saw code size increase.
>
> Wow, that is a convincing argument for making it not inline, thanks.
>
> > IIUC, we van use the following. Right?
> >
> > __attribute__((__noinline__))
>
> Right.
>
> > I didn't understand the part about programmers doing #define noinline 1.
> I don't see code in the patch that uses noinline.
>
> This is a valid C++ program:
>
> #define noinline 1
> #include <memory>
> int main() { }
>
> But if anything in <memory> uses "noinline" then this valid program
> will not compile. Which is why we must use ((__noinline__)) instead of
> ((noinline)).
>
> Thanks. Now I get it.


>
>
> >
> > How about something like this comment?
> >
> > // Noinline to avoid code size increase.
>
> Great, thanks.
>
> On Wed, 4 Aug 2021 at 18:34, Maged Michael wrote:
> > Actually I take back what I said. Sorry. I think the logic in your patch
> is correct. I missed some of the atomic decrements.
> > But I'd be concerned about performance. If we make _M_release_last_use
> noinline then we are adding overhead to the fast path of the original logic
> (where both counts are 1).
>
> Oh, I see. So the code duplication serves a purpose. We want the
> _M_release_last_use() code to be non-inline for the new logic, because
> in the common case we never call it (we either detect that both counts
> are 1 and do the dispose & destroy without atomic decrements, or we do
> a single decrement and don't dispose or destroy anything). But for the
> old logic, we do want that code inlined into _M_release (or
> _M_release_orig as it was called in your patch). Have I got that right
> now?
>
> Yes. Completely right.


> What if we remove the __noinline__ from _M_release_last_use() so that
> it can be inlined, but than add a noinline wrapper that can be called
> when we don't want to inline it?
>
> So:
>       // Called by _M_release() when the use count reaches zero.
>       void
>       _M_release_last_use() noexcept
>       {
>         // unchanged from previous patch, but without the attribute.
>         // ...
>       }
>
>       // As above, but 'noinline' to reduce code size on the cold path.
>       __attribute__((__noinline__))
>       void
>       _M_release_last_use_cold() noexcept
>       { _M_release_last_use(); }
>
>
> And then:
>
>   template<>
>     inline void
>     _Sp_counted_base<_S_atomic>::_M_release() noexcept
>     {
>       _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_use_count);
> #if ! _GLIBCXX_TSAN
>       constexpr bool __lock_free
>         = __atomic_always_lock_free(sizeof(long long), 0)
>         && __atomic_always_lock_free(sizeof(_Atomic_word), 0);
>       constexpr bool __double_word
>         = sizeof(long long) == 2 * sizeof(_Atomic_word);
>       // The ref-count members follow the vptr, so are aligned to
>       // alignof(void*).
>       constexpr bool __aligned = __alignof(long long) <= alignof(void*);
>       if _GLIBCXX17_CONSTEXPR (__lock_free && __double_word && __aligned)
>         {
>           constexpr long long __unique_ref
>             = 1LL + (1LL << (__CHAR_BIT__ * sizeof(_Atomic_word)));
>           auto __both_counts = reinterpret_cast<long long*>(&_M_use_count);
>
>           _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_weak_count);
>           if (__atomic_load_n(__both_counts, __ATOMIC_ACQUIRE) ==
> __unique_ref)
>             {
>               // Both counts are 1, so there are no weak references and
>               // we are releasing the last strong reference. No other
>               // threads can observe the effects of this _M_release()
>               // call (e.g. calling use_count()) without a data race.
>               *(long long*)(&_M_use_count) = 0;
>               _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_use_count);
>               _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_weak_count);
>               _M_dispose();
>               _M_destroy();
>               return;
>             }
>           if (__gnu_cxx::__exchange_and_add_dispatch(&_M_use_count, -1) ==
> 1)
>             {
>               _M_release_last_use_cold();
>               return;
>             }
>         }
>       else
> #endif
>       if (__gnu_cxx::__exchange_and_add_dispatch(&_M_use_count, -1) == 1)
>         {
>           _M_release_last_use();
>         }
>     }
>
>
> So we use the noinline version for the else branch in the new logic,
> but the can-inline version for the old logic. Would that work?
>
> Yes.


> We could also consider adding __attribute__((__cold__)) to the
> _M_release_last_use_cold() function, and/or add [[__unlikely__]] to
> the 'if' that calls it, but maybe that's overkill.
>
> It seems like this will slightly pessimize the case where the last use
> count is dropped but there are weak refs, as that will take the cold
> path now. That's probably OK, given the benefits for the "both counts
> are 1" case, and that it might reduce the code size for the "use count
> > 1" case.
>

Right, This pessimizes the case of having weak_ptr-s when the last use is
released (e.g., enable_shared_from_this).
For background, when I applied this optimization there were gains and
losses but overall net gain. Then I investigated whether the losses are
large enough to warrant developing a library for custom shared_ptr-s that
we can use only in the cases of gains and use the standard one for the
cases where we'd lose (for example if gains are 3x and losses are 2x). I
found out that overall (over 100s of services) the gains and losses are
closer to 1.1x gains and 0.1x losses, which didn't warrant the hassle of
maintaining a custom library.


>
> It's past 8pm here so I'll come back to this tomorrow and do some code
> size measurements of my own, now that I have a clearer understanding
> of what your original patch was optimizing for.
>

Thank you very much, Jonathan!

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] libstdc++: Skip atomic instructions in _Sp_counted_base::_M_release when both counts are 1
  2021-08-04 19:48                 ` Maged Michael
@ 2021-12-08 11:49                   ` Jonathan Wakely
  2021-12-08 19:15                     ` Rainer Orth
  0 siblings, 1 reply; 21+ messages in thread
From: Jonathan Wakely @ 2021-12-08 11:49 UTC (permalink / raw)
  To: Maged Michael; +Cc: libstdc++, gcc-patches

[-- Attachment #1: Type: text/plain, Size: 7176 bytes --]

I've pushed this change to trunk now (it was posted and reviewed in
stage 1, I just didn't get around to pushing it until now).

The final version of the patch is attached to this mail.

Thanks for the nice optimization, Maged!


On Wed, 4 Aug 2021 at 20:49, Maged Michael via Libstdc++
<libstdc++@gcc.gnu.org> wrote:
>
> On Wed, Aug 4, 2021 at 3:32 PM Jonathan Wakely <jwakely.gcc@gmail.com>
> wrote:
>
> > On Wed, 4 Aug 2021 at 18:19, Maged Michael wrote:
> > >
> > > Sorry. I totally missed the rest of your message and the patch. My fuzzy
> > eyesight, which usually guesses correctly 90% of the time, mistook
> > "Secondly" on a line by itself for "Sincerely" :-)
> >
> > :-)
> >
> > > The noinlining was based on looking at generated code. That was for
> > clang. It was inlining the _M_last_use function for every instance of
> > _M_release (e.g., ~shared_ptr). This optimization with the noinline for
> > _M_release_last_use ended up reducing massive binary text sizes by 1.5%
> > (several megabytes)., which was an extra benefit. Without the noinline we
> > saw code size increase.
> >
> > Wow, that is a convincing argument for making it not inline, thanks.
> >
> > > IIUC, we van use the following. Right?
> > >
> > > __attribute__((__noinline__))
> >
> > Right.
> >
> > > I didn't understand the part about programmers doing #define noinline 1.
> > I don't see code in the patch that uses noinline.
> >
> > This is a valid C++ program:
> >
> > #define noinline 1
> > #include <memory>
> > int main() { }
> >
> > But if anything in <memory> uses "noinline" then this valid program
> > will not compile. Which is why we must use ((__noinline__)) instead of
> > ((noinline)).
> >
> > Thanks. Now I get it.
>
>
> >
> >
> > >
> > > How about something like this comment?
> > >
> > > // Noinline to avoid code size increase.
> >
> > Great, thanks.
> >
> > On Wed, 4 Aug 2021 at 18:34, Maged Michael wrote:
> > > Actually I take back what I said. Sorry. I think the logic in your patch
> > is correct. I missed some of the atomic decrements.
> > > But I'd be concerned about performance. If we make _M_release_last_use
> > noinline then we are adding overhead to the fast path of the original logic
> > (where both counts are 1).
> >
> > Oh, I see. So the code duplication serves a purpose. We want the
> > _M_release_last_use() code to be non-inline for the new logic, because
> > in the common case we never call it (we either detect that both counts
> > are 1 and do the dispose & destroy without atomic decrements, or we do
> > a single decrement and don't dispose or destroy anything). But for the
> > old logic, we do want that code inlined into _M_release (or
> > _M_release_orig as it was called in your patch). Have I got that right
> > now?
> >
> > Yes. Completely right.
>
>
> > What if we remove the __noinline__ from _M_release_last_use() so that
> > it can be inlined, but than add a noinline wrapper that can be called
> > when we don't want to inline it?
> >
> > So:
> >       // Called by _M_release() when the use count reaches zero.
> >       void
> >       _M_release_last_use() noexcept
> >       {
> >         // unchanged from previous patch, but without the attribute.
> >         // ...
> >       }
> >
> >       // As above, but 'noinline' to reduce code size on the cold path.
> >       __attribute__((__noinline__))
> >       void
> >       _M_release_last_use_cold() noexcept
> >       { _M_release_last_use(); }
> >
> >
> > And then:
> >
> >   template<>
> >     inline void
> >     _Sp_counted_base<_S_atomic>::_M_release() noexcept
> >     {
> >       _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_use_count);
> > #if ! _GLIBCXX_TSAN
> >       constexpr bool __lock_free
> >         = __atomic_always_lock_free(sizeof(long long), 0)
> >         && __atomic_always_lock_free(sizeof(_Atomic_word), 0);
> >       constexpr bool __double_word
> >         = sizeof(long long) == 2 * sizeof(_Atomic_word);
> >       // The ref-count members follow the vptr, so are aligned to
> >       // alignof(void*).
> >       constexpr bool __aligned = __alignof(long long) <= alignof(void*);
> >       if _GLIBCXX17_CONSTEXPR (__lock_free && __double_word && __aligned)
> >         {
> >           constexpr long long __unique_ref
> >             = 1LL + (1LL << (__CHAR_BIT__ * sizeof(_Atomic_word)));
> >           auto __both_counts = reinterpret_cast<long long*>(&_M_use_count);
> >
> >           _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_weak_count);
> >           if (__atomic_load_n(__both_counts, __ATOMIC_ACQUIRE) ==
> > __unique_ref)
> >             {
> >               // Both counts are 1, so there are no weak references and
> >               // we are releasing the last strong reference. No other
> >               // threads can observe the effects of this _M_release()
> >               // call (e.g. calling use_count()) without a data race.
> >               *(long long*)(&_M_use_count) = 0;
> >               _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_use_count);
> >               _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_weak_count);
> >               _M_dispose();
> >               _M_destroy();
> >               return;
> >             }
> >           if (__gnu_cxx::__exchange_and_add_dispatch(&_M_use_count, -1) ==
> > 1)
> >             {
> >               _M_release_last_use_cold();
> >               return;
> >             }
> >         }
> >       else
> > #endif
> >       if (__gnu_cxx::__exchange_and_add_dispatch(&_M_use_count, -1) == 1)
> >         {
> >           _M_release_last_use();
> >         }
> >     }
> >
> >
> > So we use the noinline version for the else branch in the new logic,
> > but the can-inline version for the old logic. Would that work?
> >
> > Yes.
>
>
> > We could also consider adding __attribute__((__cold__)) to the
> > _M_release_last_use_cold() function, and/or add [[__unlikely__]] to
> > the 'if' that calls it, but maybe that's overkill.
> >
> > It seems like this will slightly pessimize the case where the last use
> > count is dropped but there are weak refs, as that will take the cold
> > path now. That's probably OK, given the benefits for the "both counts
> > are 1" case, and that it might reduce the code size for the "use count
> > > 1" case.
> >
>
> Right, This pessimizes the case of having weak_ptr-s when the last use is
> released (e.g., enable_shared_from_this).
> For background, when I applied this optimization there were gains and
> losses but overall net gain. Then I investigated whether the losses are
> large enough to warrant developing a library for custom shared_ptr-s that
> we can use only in the cases of gains and use the standard one for the
> cases where we'd lose (for example if gains are 3x and losses are 2x). I
> found out that overall (over 100s of services) the gains and losses are
> closer to 1.1x gains and 0.1x losses, which didn't warrant the hassle of
> maintaining a custom library.
>
>
> >
> > It's past 8pm here so I'll come back to this tomorrow and do some code
> > size measurements of my own, now that I have a clearer understanding
> > of what your original patch was optimizing for.
> >
>
> Thank you very much, Jonathan!
>

[-- Attachment #2: patch.txt --]
[-- Type: text/plain, Size: 10807 bytes --]

commit dbf8bd3c2f2cd2d27ca4f0fe379bd9490273c6d7
Author: Maged Michael <maged.michael@gmail.com>
Date:   Tue Dec 7 15:20:58 2021

    libstdc++: Skip atomic instructions in shared_ptr when both counts are 1
    
    This rewrites _Sp_counted_base::_M_release to skip the two atomic
    instructions that decrement each of the use count and the weak count
    when both are 1.
    
    Benefits: Save the cost of the last atomic decrements of each of the use
    count and the weak count in _Sp_counted_base. Atomic instructions are
    significantly slower than regular loads and stores across major
    architectures.
    
    How current code works: _M_release() atomically decrements the use
    count, checks if it was 1, if so calls _M_dispose(), atomically
    decrements the weak count, checks if it was 1, and if so calls
    _M_destroy().
    
    How the proposed algorithm works: _M_release() loads both use count and
    weak count together atomically (assuming suitable alignment, discussed
    later), checks if the value corresponds to a 0x1 value in the individual
    count members, and if so calls _M_dispose() and _M_destroy().
    Otherwise, it follows the original algorithm.
    
    Why it works: When the current thread executing _M_release() finds each
    of the counts is equal to 1, then no other threads could possibly hold
    use or weak references to this control block. That is, no other threads
    could possibly access the counts or the protected object.
    
    There are two crucial high-level issues that I'd like to point out first:
    - Atomicity of access to the counts together
    - Proper alignment of the counts together
    
    The patch is intended to apply the proposed algorithm only to the case of
    64-bit mode, 4-byte counts, and 8-byte aligned _Sp_counted_base.
    
    ** Atomicity **
    - The proposed algorithm depends on the mutual atomicity among 8-byte
    atomic operations and 4-byte atomic operations on each of the 4-byte halves
    of the 8-byte aligned 8-byte block.
    - The standard does not guarantee atomicity of 8-byte operations on a pair
    of 8-byte aligned 4-byte objects.
    - To my knowledge this works in practice on systems that guarantee native
    implementation of 4-byte and 8-byte atomic operations.
    - __atomic_always_lock_free is used to check for native atomic operations.
    
    ** Alignment **
    - _Sp_counted_base is an internal base class, with a virtual destructor,
    so it has a vptr at the beginning of the class, and will be aligned to
    alignof(void*) i.e. 8 bytes.
    - The first members of the class are the 4-byte use count and 4-byte
    weak count, which will occupy 8 contiguous bytes immediately after the
    vptr, i.e. they form an 8-byte aligned 8 byte range.
    
    Other points:
    - The proposed algorithm can interact correctly with the current algorithm.
    That is, multiple threads using different versions of the code with and
    without the patch operating on the same objects should always interact
    correctly. The intent for the patch is to be ABI compatible with the
    current implementation.
    - The proposed patch involves a performance trade-off between saving the
    costs of atomic instructions when the counts are both 1 vs adding the cost
    of loading the 8-byte combined counts and comparison with {0x1, 0x1}.
    - I noticed a big difference between the code generated by GCC vs LLVM. GCC
    seems to generate noticeably more code and what seems to be redundant null
    checks and branches.
    - The patch has been in use (built using LLVM) in a large environment for
    many months. The performance gains outweigh the losses (roughly 10 to 1)
    across a large variety of workloads.
    
    Signed-off-by: Jonathan Wakely <jwakely@redhat.com>
    
    Co-authored-by: Jonathan Wakely <jwakely@redhat.com>
    
    libstdc++-v3/ChangeLog:
    
            * include/bits/c++config (_GLIBCXX_TSAN): Define macro
            indicating that TSan is in use.
            * include/bits/shared_ptr_base.h (_Sp_counted_base::_M_release):
            Replace definition in primary template with explicit
            specializations for _S_mutex and _S_atomic policies.
            (_Sp_counted_base<_S_mutex>::_M_release): New specialization.
            (_Sp_counted_base<_S_atomic>::_M_release): New specialization,
            using a single atomic load to access both reference counts at
            once.
            (_Sp_counted_base::_M_release_last_use): New member function.

diff --git a/libstdc++-v3/include/bits/c++config b/libstdc++-v3/include/bits/c++config
index 90513ccae38..f2d704f57eb 100644
--- a/libstdc++-v3/include/bits/c++config
+++ b/libstdc++-v3/include/bits/c++config
@@ -577,6 +577,15 @@ namespace std
   do { __glibcxx_constexpr_assert(cond); } while (false)
 #endif
 
+// Macro indicating that TSAN is in use.
+#if __SANITIZE_THREAD__
+#  define _GLIBCXX_TSAN 1
+#elif defined __has_feature
+# if __has_feature(thread_sanitizer)
+#  define _GLIBCXX_TSAN 1
+# endif
+#endif
+
 // Macros for race detectors.
 // _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(A) and
 // _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(A) should be used to explain
diff --git a/libstdc++-v3/include/bits/shared_ptr_base.h b/libstdc++-v3/include/bits/shared_ptr_base.h
index 3473a74280d..90ad30947b0 100644
--- a/libstdc++-v3/include/bits/shared_ptr_base.h
+++ b/libstdc++-v3/include/bits/shared_ptr_base.h
@@ -143,10 +143,12 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
       virtual void*
       _M_get_deleter(const std::type_info&) noexcept = 0;
 
+      // Increment the use count (used when the count is greater than zero).
       void
       _M_add_ref_copy()
       { __gnu_cxx::__atomic_add_dispatch(&_M_use_count, 1); }
 
+      // Increment the use count if it is non-zero, throw otherwise.
       void
       _M_add_ref_lock()
       {
@@ -154,42 +156,51 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
 	  __throw_bad_weak_ptr();
       }
 
+      // Increment the use count if it is non-zero.
       bool
       _M_add_ref_lock_nothrow() noexcept;
 
+      // Decrement the use count.
       void
-      _M_release() noexcept
-      {
-        // Be race-detector-friendly.  For more info see bits/c++config.
-        _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_use_count);
-	if (__gnu_cxx::__exchange_and_add_dispatch(&_M_use_count, -1) == 1)
-	  {
-            _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_use_count);
-	    _M_dispose();
-	    // There must be a memory barrier between dispose() and destroy()
-	    // to ensure that the effects of dispose() are observed in the
-	    // thread that runs destroy().
-	    // See http://gcc.gnu.org/ml/libstdc++/2005-11/msg00136.html
-	    if (_Mutex_base<_Lp>::_S_need_barriers)
-	      {
-		__atomic_thread_fence (__ATOMIC_ACQ_REL);
-	      }
+      _M_release() noexcept;
 
-            // Be race-detector-friendly.  For more info see bits/c++config.
-            _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_weak_count);
-	    if (__gnu_cxx::__exchange_and_add_dispatch(&_M_weak_count,
-						       -1) == 1)
-              {
-                _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_weak_count);
-	        _M_destroy();
-              }
+      // Called by _M_release() when the use count reaches zero.
+      void
+      _M_release_last_use() noexcept
+      {
+	_GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_use_count);
+	_M_dispose();
+	// There must be a memory barrier between dispose() and destroy()
+	// to ensure that the effects of dispose() are observed in the
+	// thread that runs destroy().
+	// See http://gcc.gnu.org/ml/libstdc++/2005-11/msg00136.html
+	if (_Mutex_base<_Lp>::_S_need_barriers)
+	  {
+	    __atomic_thread_fence (__ATOMIC_ACQ_REL);
+	  }
+
+	// Be race-detector-friendly.  For more info see bits/c++config.
+	_GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_weak_count);
+	if (__gnu_cxx::__exchange_and_add_dispatch(&_M_weak_count,
+						   -1) == 1)
+	  {
+	    _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_weak_count);
+	    _M_destroy();
 	  }
       }
 
+      // As above, but 'noinline' to reduce code size on the cold path.
+      __attribute__((__noinline__))
+      void
+      _M_release_last_use_cold() noexcept
+      { _M_release_last_use(); }
+
+      // Increment the weak count.
       void
       _M_weak_add_ref() noexcept
       { __gnu_cxx::__atomic_add_dispatch(&_M_weak_count, 1); }
 
+      // Decrement the weak count.
       void
       _M_weak_release() noexcept
       {
@@ -286,6 +297,67 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
         }
     }
 
+  template<>
+    inline void
+    _Sp_counted_base<_S_mutex>::_M_release() noexcept
+    {
+      // Be race-detector-friendly.  For more info see bits/c++config.
+      _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_use_count);
+      if (__gnu_cxx::__exchange_and_add_dispatch(&_M_use_count, -1) == 1)
+	{
+	  _M_release_last_use();
+	}
+    }
+
+  template<>
+    inline void
+    _Sp_counted_base<_S_atomic>::_M_release() noexcept
+    {
+      _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_use_count);
+#if ! _GLIBCXX_TSAN
+      constexpr bool __lock_free
+	= __atomic_always_lock_free(sizeof(long long), 0)
+	&& __atomic_always_lock_free(sizeof(_Atomic_word), 0);
+      constexpr bool __double_word
+	= sizeof(long long) == 2 * sizeof(_Atomic_word);
+      // The ref-count members follow the vptr, so are aligned to
+      // alignof(void*).
+      constexpr bool __aligned = __alignof(long long) <= alignof(void*);
+      if _GLIBCXX17_CONSTEXPR (__lock_free && __double_word && __aligned)
+	{
+	  constexpr long long __unique_ref
+	    = 1LL + (1LL << (__CHAR_BIT__ * sizeof(_Atomic_word)));
+	  auto __both_counts = reinterpret_cast<long long*>(&_M_use_count);
+
+	  _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_weak_count);
+	  if (__atomic_load_n(__both_counts, __ATOMIC_ACQUIRE) == __unique_ref)
+	    {
+	      // Both counts are 1, so there are no weak references and
+	      // we are releasing the last strong reference. No other
+	      // threads can observe the effects of this _M_release()
+	      // call (e.g. calling use_count()) without a data race.
+	      *(long long*)(&_M_use_count) = 0;
+	      _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_use_count);
+	      _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_weak_count);
+	      _M_dispose();
+	      _M_destroy();
+	      return;
+	    }
+	  if (__gnu_cxx::__exchange_and_add_dispatch(&_M_use_count, -1) == 1)
+	    [[__unlikely__]]
+	    {
+	      _M_release_last_use_cold();
+	      return;
+	    }
+	}
+      else
+#endif
+      if (__gnu_cxx::__exchange_and_add_dispatch(&_M_use_count, -1) == 1)
+	{
+	  _M_release_last_use();
+	}
+    }
+
   template<>
     inline void
     _Sp_counted_base<_S_single>::_M_weak_add_ref() noexcept

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] libstdc++: Skip atomic instructions in _Sp_counted_base::_M_release when both counts are 1
  2021-12-08 11:49                   ` Jonathan Wakely
@ 2021-12-08 19:15                     ` Rainer Orth
  2021-12-08 19:21                       ` Jonathan Wakely
  0 siblings, 1 reply; 21+ messages in thread
From: Rainer Orth @ 2021-12-08 19:15 UTC (permalink / raw)
  To: Jonathan Wakely via Gcc-patches; +Cc: Maged Michael, Jonathan Wakely, libstdc++

Hi Jonathan,

> I've pushed this change to trunk now (it was posted and reviewed in
> stage 1, I just didn't get around to pushing it until now).
>
> The final version of the patch is attached to this mail.

unfortunately, it breaks Solaris/SPARC bootstrap:

In file included from /var/gcc/regression/master/11.4-gcc-gas/build/sparc-sun-solaris2.11/sparcv9/libstdc++-v3/include/bits/shared_ptr.h:53,
                 from /var/gcc/regression/master/11.4-gcc-gas/build/sparc-sun-solaris2.11/sparcv9/libstdc++-v3/include/memory:77,
                 from /vol/gcc/src/hg/master/local/libstdc++-v3/include/precompiled/stdc++.h:82:
/var/gcc/regression/master/11.4-gcc-gas/build/sparc-sun-solaris2.11/sparcv9/libstdc++-v3/include/bits/shared_ptr_base.h: In member function 'void std::_Sp_counted_base<_Lp>::_M_release() [with __gnu_cxx::_Lock_policy _Lp = __gnu_cxx::_S_atomic]':
/var/gcc/regression/master/11.4-gcc-gas/build/sparc-sun-solaris2.11/sparcv9/libstdc++-v3/include/bits/shared_ptr_base.h:329:26: error: right operand of shift expression '(1 << 64)' is greater than or equal to the precision 64 of the left operand [-fpermissive]
  329 |             = 1LL + (1LL << (__CHAR_BIT__ * sizeof(_Atomic_word)));
      |                     ~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
make[9]: *** [Makefile:1875: sparc-sun-solaris2.11/bits/stdc++.h.gch/O2ggnu++0x.gch] Error 1

For 64-bit SPARC, _Atomic_word is long.

	Rainer

-- 
-----------------------------------------------------------------------------
Rainer Orth, Center for Biotechnology, Bielefeld University

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] libstdc++: Skip atomic instructions in _Sp_counted_base::_M_release when both counts are 1
  2021-12-08 19:15                     ` Rainer Orth
@ 2021-12-08 19:21                       ` Jonathan Wakely
  2021-12-08 19:27                         ` Jonathan Wakely
  0 siblings, 1 reply; 21+ messages in thread
From: Jonathan Wakely @ 2021-12-08 19:21 UTC (permalink / raw)
  To: Rainer Orth; +Cc: Jonathan Wakely via Gcc-patches, libstdc++

On Wed, 8 Dec 2021 at 19:17, Rainer Orth wrote:
>
> Hi Jonathan,
>
> > I've pushed this change to trunk now (it was posted and reviewed in
> > stage 1, I just didn't get around to pushing it until now).
> >
> > The final version of the patch is attached to this mail.
>
> unfortunately, it breaks Solaris/SPARC bootstrap:
>
> In file included from /var/gcc/regression/master/11.4-gcc-gas/build/sparc-sun-solaris2.11/sparcv9/libstdc++-v3/include/bits/shared_ptr.h:53,
>                  from /var/gcc/regression/master/11.4-gcc-gas/build/sparc-sun-solaris2.11/sparcv9/libstdc++-v3/include/memory:77,
>                  from /vol/gcc/src/hg/master/local/libstdc++-v3/include/precompiled/stdc++.h:82:
> /var/gcc/regression/master/11.4-gcc-gas/build/sparc-sun-solaris2.11/sparcv9/libstdc++-v3/include/bits/shared_ptr_base.h: In member function 'void std::_Sp_counted_base<_Lp>::_M_release() [with __gnu_cxx::_Lock_policy _Lp = __gnu_cxx::_S_atomic]':
> /var/gcc/regression/master/11.4-gcc-gas/build/sparc-sun-solaris2.11/sparcv9/libstdc++-v3/include/bits/shared_ptr_base.h:329:26: error: right operand of shift expression '(1 << 64)' is greater than or equal to the precision 64 of the left operand [-fpermissive]
>   329 |             = 1LL + (1LL << (__CHAR_BIT__ * sizeof(_Atomic_word)));
>       |                     ~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> make[9]: *** [Makefile:1875: sparc-sun-solaris2.11/bits/stdc++.h.gch/O2ggnu++0x.gch] Error 1
>
> For 64-bit SPARC, _Atomic_word is long.

Ah yes, so we need to disable this optimization. Patch coming up ...


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] libstdc++: Skip atomic instructions in _Sp_counted_base::_M_release when both counts are 1
  2021-12-08 19:21                       ` Jonathan Wakely
@ 2021-12-08 19:27                         ` Jonathan Wakely
  2021-12-08 23:47                           ` [committed] libstdc++: Fix undefined shift when _Atomic_word is 64-bit Jonathan Wakely
  0 siblings, 1 reply; 21+ messages in thread
From: Jonathan Wakely @ 2021-12-08 19:27 UTC (permalink / raw)
  To: Rainer Orth; +Cc: Jonathan Wakely via Gcc-patches, libstdc++

On Wed, 8 Dec 2021 at 19:21, Jonathan Wakely wrote:
>
> On Wed, 8 Dec 2021 at 19:17, Rainer Orth wrote:
> >
> > Hi Jonathan,
> >
> > > I've pushed this change to trunk now (it was posted and reviewed in
> > > stage 1, I just didn't get around to pushing it until now).
> > >
> > > The final version of the patch is attached to this mail.
> >
> > unfortunately, it breaks Solaris/SPARC bootstrap:
> >
> > In file included from /var/gcc/regression/master/11.4-gcc-gas/build/sparc-sun-solaris2.11/sparcv9/libstdc++-v3/include/bits/shared_ptr.h:53,
> >                  from /var/gcc/regression/master/11.4-gcc-gas/build/sparc-sun-solaris2.11/sparcv9/libstdc++-v3/include/memory:77,
> >                  from /vol/gcc/src/hg/master/local/libstdc++-v3/include/precompiled/stdc++.h:82:
> > /var/gcc/regression/master/11.4-gcc-gas/build/sparc-sun-solaris2.11/sparcv9/libstdc++-v3/include/bits/shared_ptr_base.h: In member function 'void std::_Sp_counted_base<_Lp>::_M_release() [with __gnu_cxx::_Lock_policy _Lp = __gnu_cxx::_S_atomic]':
> > /var/gcc/regression/master/11.4-gcc-gas/build/sparc-sun-solaris2.11/sparcv9/libstdc++-v3/include/bits/shared_ptr_base.h:329:26: error: right operand of shift expression '(1 << 64)' is greater than or equal to the precision 64 of the left operand [-fpermissive]
> >   329 |             = 1LL + (1LL << (__CHAR_BIT__ * sizeof(_Atomic_word)));
> >       |                     ~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > make[9]: *** [Makefile:1875: sparc-sun-solaris2.11/bits/stdc++.h.gch/O2ggnu++0x.gch] Error 1
> >
> > For 64-bit SPARC, _Atomic_word is long.
>
> Ah yes, so we need to disable this optimization. Patch coming up ...

Gah, I remembered to check that:

      constexpr bool __double_word
        = sizeof(long long) == 2 * sizeof(_Atomic_word);
      // The ref-count members follow the vptr, so are aligned to
      // alignof(void*).
      constexpr bool __aligned = __alignof(long long) <= alignof(void*);
      if _GLIBCXX17_CONSTEXPR (__lock_free && __double_word && __aligned)


But for C++11 and C++14 that is a normal runtime condition not
if-constexpr, so the undefined shift still gets compiled, even though
it can never be reached at runtime.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [committed] libstdc++: Fix undefined shift when _Atomic_word is 64-bit
  2021-12-08 19:27                         ` Jonathan Wakely
@ 2021-12-08 23:47                           ` Jonathan Wakely
  2021-12-09 12:28                             ` Rainer Orth
  0 siblings, 1 reply; 21+ messages in thread
From: Jonathan Wakely @ 2021-12-08 23:47 UTC (permalink / raw)
  To: Rainer Orth; +Cc: Jonathan Wakely via Gcc-patches, libstdc++

[-- Attachment #1: Type: text/plain, Size: 2443 bytes --]

On Wed, 8 Dec 2021 at 19:27, Jonathan Wakely wrote:
>
> On Wed, 8 Dec 2021 at 19:21, Jonathan Wakely wrote:
> >
> > On Wed, 8 Dec 2021 at 19:17, Rainer Orth wrote:
> > >
> > > Hi Jonathan,
> > >
> > > > I've pushed this change to trunk now (it was posted and reviewed in
> > > > stage 1, I just didn't get around to pushing it until now).
> > > >
> > > > The final version of the patch is attached to this mail.
> > >
> > > unfortunately, it breaks Solaris/SPARC bootstrap:
> > >
> > > In file included from /var/gcc/regression/master/11.4-gcc-gas/build/sparc-sun-solaris2.11/sparcv9/libstdc++-v3/include/bits/shared_ptr.h:53,
> > >                  from /var/gcc/regression/master/11.4-gcc-gas/build/sparc-sun-solaris2.11/sparcv9/libstdc++-v3/include/memory:77,
> > >                  from /vol/gcc/src/hg/master/local/libstdc++-v3/include/precompiled/stdc++.h:82:
> > > /var/gcc/regression/master/11.4-gcc-gas/build/sparc-sun-solaris2.11/sparcv9/libstdc++-v3/include/bits/shared_ptr_base.h: In member function 'void std::_Sp_counted_base<_Lp>::_M_release() [with __gnu_cxx::_Lock_policy _Lp = __gnu_cxx::_S_atomic]':
> > > /var/gcc/regression/master/11.4-gcc-gas/build/sparc-sun-solaris2.11/sparcv9/libstdc++-v3/include/bits/shared_ptr_base.h:329:26: error: right operand of shift expression '(1 << 64)' is greater than or equal to the precision 64 of the left operand [-fpermissive]
> > >   329 |             = 1LL + (1LL << (__CHAR_BIT__ * sizeof(_Atomic_word)));
> > >       |                     ~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > > make[9]: *** [Makefile:1875: sparc-sun-solaris2.11/bits/stdc++.h.gch/O2ggnu++0x.gch] Error 1
> > >
> > > For 64-bit SPARC, _Atomic_word is long.
> >
> > Ah yes, so we need to disable this optimization. Patch coming up ...
>
> Gah, I remembered to check that:
>
>       constexpr bool __double_word
>         = sizeof(long long) == 2 * sizeof(_Atomic_word);
>       // The ref-count members follow the vptr, so are aligned to
>       // alignof(void*).
>       constexpr bool __aligned = __alignof(long long) <= alignof(void*);
>       if _GLIBCXX17_CONSTEXPR (__lock_free && __double_word && __aligned)
>
>
> But for C++11 and C++14 that is a normal runtime condition not
> if-constexpr, so the undefined shift still gets compiled, even though
> it can never be reached at runtime.

Fixed like so. Tested sparc-sun-solaris2.11, pushed to trunk.

[-- Attachment #2: patch.txt --]
[-- Type: text/plain, Size: 1632 bytes --]

commit c15aa46cca0649b68613d3292cf71c7cc57ef78f
Author: Jonathan Wakely <jwakely@redhat.com>
Date:   Wed Dec 8 19:36:24 2021

    libstdc++: Fix undefined shift when _Atomic_word is 64-bit
    
    The check for _Atomic_word being 32-bit is just a normal runtime
    condition for C++11 and C++14, because it doesn't use if-constexpr. That
    means the 1LL << (CHAR_BIT * sizeof(_Atomic_word)) expression expands to
    1LL << 64 on Solaris, which is ill-formed.
    
    This adds another indirection so that the shift width is zero if the
    code is unreachable.
    
    libstdc++-v3/ChangeLog:
    
            * include/bits/shared_ptr_base.h (_Sp_counted_base::_M_release()):
            Make shift width conditional on __double_word condition.

diff --git a/libstdc++-v3/include/bits/shared_ptr_base.h b/libstdc++-v3/include/bits/shared_ptr_base.h
index 90ad30947b0..f315d8f354f 100644
--- a/libstdc++-v3/include/bits/shared_ptr_base.h
+++ b/libstdc++-v3/include/bits/shared_ptr_base.h
@@ -325,8 +325,9 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
       constexpr bool __aligned = __alignof(long long) <= alignof(void*);
       if _GLIBCXX17_CONSTEXPR (__lock_free && __double_word && __aligned)
 	{
-	  constexpr long long __unique_ref
-	    = 1LL + (1LL << (__CHAR_BIT__ * sizeof(_Atomic_word)));
+	  constexpr int __wordbits = __CHAR_BIT__ * sizeof(_Atomic_word);
+	  constexpr int __shiftbits = __double_word ? __wordbits : 0;
+	  constexpr long long __unique_ref = 1LL + (1LL << __shiftbits);
 	  auto __both_counts = reinterpret_cast<long long*>(&_M_use_count);
 
 	  _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_weak_count);

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [committed] libstdc++: Fix undefined shift when _Atomic_word is 64-bit
  2021-12-08 23:47                           ` [committed] libstdc++: Fix undefined shift when _Atomic_word is 64-bit Jonathan Wakely
@ 2021-12-09 12:28                             ` Rainer Orth
  0 siblings, 0 replies; 21+ messages in thread
From: Rainer Orth @ 2021-12-09 12:28 UTC (permalink / raw)
  To: Jonathan Wakely; +Cc: Jonathan Wakely via Gcc-patches, libstdc++

Hi Jonathan,

>> > Ah yes, so we need to disable this optimization. Patch coming up ...
>>
>> Gah, I remembered to check that:
>>
>>       constexpr bool __double_word
>>         = sizeof(long long) == 2 * sizeof(_Atomic_word);
>>       // The ref-count members follow the vptr, so are aligned to
>>       // alignof(void*).
>>       constexpr bool __aligned = __alignof(long long) <= alignof(void*);
>>       if _GLIBCXX17_CONSTEXPR (__lock_free && __double_word && __aligned)
>>
>>
>> But for C++11 and C++14 that is a normal runtime condition not
>> if-constexpr, so the undefined shift still gets compiled, even though
>> it can never be reached at runtime.
>
> Fixed like so. Tested sparc-sun-solaris2.11, pushed to trunk.

great, thanks a lot.

	Rainer

-- 
-----------------------------------------------------------------------------
Rainer Orth, Center for Biotechnology, Bielefeld University

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2021-12-09 12:28 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-17 20:49 [PATCH] libstdc++: Skip atomic instructions in _Sp_counted_base::_M_release when both counts are 1 Maged Michael
2021-01-05 16:19 ` Maged Michael
2021-06-10 14:41 ` Maged Michael
2021-07-16 13:54 ` Jonathan Wakely
2021-07-16 15:55   ` Maged Michael
2021-08-02 13:23     ` Maged Michael
2021-08-02 13:29       ` Maged Michael
2021-08-03 20:59         ` Jonathan Wakely
2021-08-04 15:32           ` Jonathan Wakely
2021-08-04 15:47             ` Maged Michael
2021-08-04 15:52               ` Jonathan Wakely
2021-08-04 17:19             ` Maged Michael
2021-08-04 17:34               ` Maged Michael
2021-08-04 19:32               ` Jonathan Wakely
2021-08-04 19:48                 ` Maged Michael
2021-12-08 11:49                   ` Jonathan Wakely
2021-12-08 19:15                     ` Rainer Orth
2021-12-08 19:21                       ` Jonathan Wakely
2021-12-08 19:27                         ` Jonathan Wakely
2021-12-08 23:47                           ` [committed] libstdc++: Fix undefined shift when _Atomic_word is 64-bit Jonathan Wakely
2021-12-09 12:28                             ` Rainer Orth

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).