From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <maged.michael@gmail.com>
Received: from mail-vk1-xa2b.google.com (mail-vk1-xa2b.google.com
 [IPv6:2607:f8b0:4864:20::a2b])
 by sourceware.org (Postfix) with ESMTPS id F0C93394D832;
 Wed,  4 Aug 2021 17:19:34 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org F0C93394D832
Received: by mail-vk1-xa2b.google.com with SMTP id k124so598672vke.5;
 Wed, 04 Aug 2021 10:19:34 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:mime-version:references:in-reply-to:from:date
 :message-id:subject:to:cc;
 bh=/4h+SXAbc//0l/dFlHdyh76EKmcsO4ky+eELdMr8s1Q=;
 b=YInHPa4un1Y8Dtqz/TPosJ3nBrAkIGemIIBJ84WObsja96bwj4Vu4PI85gP4PTTXDB
 qS3e8pghgkGWs60bQdYy3N7mATnShtaMRN0X7l5wHayPukjKce7bUzXKVFHZm/gLsD8g
 mz+I/JhtYV5JPrYKfcYb8UbIDppDVZhQwYtHajqZDFwF5KvJKnJ5eTyYtUT97ygFu7z1
 3eY5MjjATwvU4KgmeD8FJTr0uwDpd+QL4ICdBkWv5FXmWKHNjaoVfinr4Zc/okVne0nk
 BDsuAsQBIcSbNWny7hg7j1Qzw7Pph6833N1Pd1aciYqCvnS/ZDou+lZX2Yc1HsoSEsL0
 cNEQ==
X-Gm-Message-State: AOAM532OC4NdP763GuOlnH/UIZdAbh8niPH404qE+ZjVz/Z9iVF/mDdV
 dMZLNVJAUCTjZlV5YTT6q8vx2PwspvJoBXLEMRo=
X-Google-Smtp-Source: ABdhPJzGPQXtJPK+fmFIH+BLUS49IrJY0XU7Q1W5eCllEONUjJrlYs4zJemAFZG6cLoQueyXDiYcR9cJQOPT+cWK6OY=
X-Received: by 2002:a1f:5c09:: with SMTP id q9mr502643vkb.3.1628097574465;
 Wed, 04 Aug 2021 10:19:34 -0700 (PDT)
MIME-Version: 1.0
References: <CABBFi095Sn2-=DiZyLEfAP2=XCf9jA6F6uATBBe0LAK16EhogA@mail.gmail.com>
 <CAH6eHdQzMFVVrnaKiL5v_bF-HtC+H1reLaimN6x=9LxRxvERCg@mail.gmail.com>
 <CABBFi08YYQRgU-bMweP1gA_qDtKiD42JaptEsZPpG+KMYNw_WQ@mail.gmail.com>
 <CABBFi0-+oRNUDM3TxDetxMzhmfzrxBHJQxBvM3TWsogbM8hMZw@mail.gmail.com>
 <CABBFi0_QP2Psm2RbSmr38WnjX9MJgehywTfhjXRtJ3rXFsN4-w@mail.gmail.com>
 <CAH6eHdSH9uOpOLrH37qztYe=eRpdX=iWSYkzAaC4EOZio9ekwA@mail.gmail.com>
 <CAH6eHdT=WCfPt+CaAJtEN_+tbEn52X=Q7jR7_aQmktv6zs-zVg@mail.gmail.com>
In-Reply-To: <CAH6eHdT=WCfPt+CaAJtEN_+tbEn52X=Q7jR7_aQmktv6zs-zVg@mail.gmail.com>
From: Maged Michael <maged.michael@gmail.com>
Date: Wed, 4 Aug 2021 13:19:23 -0400
Message-ID: <CABBFi0-ZfBW4xU9GSbNe8HZaJ44cwdmO-CVjwF-X7XviXZmqow@mail.gmail.com>
Subject: Re: [PATCH] libstdc++: Skip atomic instructions in
 _Sp_counted_base::_M_release when both counts are 1
To: Jonathan Wakely <jwakely.gcc@gmail.com>
Cc: "libstdc++" <libstdc++@gcc.gnu.org>, gcc-patches <gcc-patches@gcc.gnu.org>
X-Spam-Status: No, score=-4.0 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, HTML_MESSAGE,
 KAM_NUMSUBJECT, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS,
 TXREP autolearn=no autolearn_force=no version=3.4.4
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on
 server2.sourceware.org
Content-Type: text/plain; charset="UTF-8"
X-Content-Filtered-By: Mailman/MimeDel 2.1.29
X-BeenThere: libstdc++@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Libstdc++ mailing list <libstdc++.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/libstdc++>,
 <mailto:libstdc++-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/libstdc++/>
List-Post: <mailto:libstdc++@gcc.gnu.org>
List-Help: <mailto:libstdc++-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/libstdc++>,
 <mailto:libstdc++-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Wed, 04 Aug 2021 17:19:37 -0000

Sorry. I totally missed the rest of your message and the patch. My fuzzy
eyesight, which usually guesses correctly 90% of the time, mistook
"Secondly" on a line by itself for "Sincerely" :-)

On Wed, Aug 4, 2021 at 11:32 AM Jonathan Wakely <jwakely.gcc@gmail.com>
wrote:

> On Tue, 3 Aug 2021 at 21:59, Jonathan Wakely wrote:
> >
> > On Mon, 2 Aug 2021 at 14:29, Maged Michael wrote:
> > >
> > > This is the right patch. The previous one is missing noexcept. Sorry.
> > >
> > >
> > > On Mon, Aug 2, 2021 at 9:23 AM Maged Michael <maged.michael@gmail.com>
> wrote:
> > >>
> > >> Please find attached an updated patch after incorporating Jonathan's
> suggestions.
> > >>
> > >> Changes from the last patch include:
> > >> - Add a TSAN macro to bits/c++config.
> > >> - Use separate constexpr bool-s for the conditions for lock-freedom,
> double-width and alignment.
> > >> - Move the code in the optimized path to a separate function
> _M_release_double_width_cas.
> >
> > Thanks for the updated patch. At a quick glance it looks great. I'll
> > apply it locally and test it tomorrow.
>
>
> It occurs to me now that _M_release_double_width_cas is the wrong
> name, because we're not doing a DWCAS, just a double-width load. What
> do you think about renaming it to _M_release_combined instead? Since
> it does a combined load of the two counts.
>
> I'm no longer confident in the alignof suggestion I gave you.
>
> +        constexpr bool __double_word
> +          = sizeof(long long) == 2 * sizeof(_Atomic_word);
> +        // The ref-count members follow the vptr, so are aligned to
> +        // alignof(void*).
> +        constexpr bool __aligned = alignof(long long) <= alignof(void*);
>
> For IA32 (i.e. 32-bit x86) this constant will be true, because
> alignof(long long) and alignof(void*) are both 4, even though
> sizeof(long long) is 8. So in theory the _M_use_count and
> _M_weak_count members could be in different cache lines, and the
> atomic load will not be atomic. I think we want to use __alignof(long
> long) here to get the "natural" alignment, not the smaller 4B
> alignment allowed by SysV ABI. That means that we won't do this
> optimization on IA32, but I think that's acceptable.
>
> Alternatively, we could keep using alignof, but with an additional
> run-time check something like (uintptr_t)&_M_use_count / 64 ==
> (uintptr_t)&_M_weak_count / 64 to check they're on the same cache
> line. I think for now we should just use __alignof and live without
> the optimization on IA32.
>
> Secondly:
>
> +      void
> +      __attribute__ ((noinline))
>
> This needs to be __noinline__ because noinline is not a reserved word,
> so users can do:
> #define noinline 1
> #include <memory>
>
> Was it performance profiling, or code-size measurements (or something
> else?) that showed this should be non-inline?
> I'd like to add a comment explaining why we want it to be noinline.
>
> The noinlining was based on looking at generated code. That was for clang.
It was inlining the _M_last_use function for every instance of _M_release
(e.g., ~shared_ptr). This optimization with the noinline for
_M_release_last_use ended up reducing massive binary text sizes by 1.5%
(several megabytes)., which was an extra benefit. Without the noinline we
saw code size increase.

IIUC, we van use the following. Right?

__attribute__((__noinline__))


I didn't understand the part about programmers doing #define noinline 1. I
don't see code in the patch that uses noinline.

How about something like this comment?

// Noinline to avoid code size increase.


In that function ...
>
> +      _M_release_last_use() noexcept
> +      {
> +        _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_use_count);
> +        _M_dispose();
> +        if (_Mutex_base<_Lp>::_S_need_barriers)
> +          {
> +            __atomic_thread_fence (__ATOMIC_ACQ_REL);
> +          }
>
> I think we can remove this fence. The _S_need_barriers constant is
> only true for the _S_mutex policy, and that just calls
> _M_release_orig(), so never uses this function. I'll remove it and add
> a comment noting that the lack of barrier is intentional.
>
> +        _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_weak_count);
> +        if (__gnu_cxx::__exchange_and_add_dispatch(&_M_weak_count,
> +                                                   -1) == 1)
> +          {
> +            _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_weak_count);
> +            _M_destroy();
> +          }
> +      }
>
> Alternatively, we could keep the fence in _M_release_last_use() and
> refactor the other _M_release functions, so that we have explicit
> specializations for each of:
> _Sp_counted_base<_S_single>::_M_release() (already present)
> _Sp_counted_base<_S_mutex>::_M_release()
> _Sp_counted_base<_S_atomic>::_M_release()
>
> The second and third would be new, as currently they both use the
> definition in the primary template. The _S_mutex one would just
> decrement _M_use_count then call _M_release_last_use() (which still
> has the barrier needed for the _S_mutex policy). The _S_atomic one
> would have your new optimization. See the attached patch showing what
> I mean. I find this version a bit simpler to understand, as we just
> have _M_release and _M_release_last_use, without
> _M_release_double_width_cas and _M_release_orig. What do you think of
> this version? Does it lose any important properties of your version
> which I've failed to notice?
>

Unless I missed something I don't think the patch is correct. IIUC it uses
_M_release_last_use as if it is equivalent to what was _M_release_orig,
which is not true. The logic in _M_release_last_use assumes that the use
count has already been decremented and the return value was 1. So I don't
think the refactor works. Please correct me if I missed something..

If I am not mistaken, how about I prepare a new patch that incorporates the
following?
- __noinline__
- The change to __align in my last message.
- Remove the unnecessary checks based on _S_need_barriers in the combined /
last_use path.
- Going back to having _M_release_orig and _M_release_combined
- Include all of the comments that you added in your patch.

Would that be OK with you?

Thanks,
Maged