From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-vs1-xe2d.google.com (mail-vs1-xe2d.google.com [IPv6:2607:f8b0:4864:20::e2d]) by sourceware.org (Postfix) with ESMTPS id 4BFA2395184B; Wed, 4 Aug 2021 17:34:34 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 4BFA2395184B Received: by mail-vs1-xe2d.google.com with SMTP id j19so1473386vso.0; Wed, 04 Aug 2021 10:34:34 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=5zx1929cJ5QhCcwj6HWvvgh3uWlJ5KmAaJoiVHN7fdA=; b=hE5b8vrjdLNJMPOTZYoh6YMWfjE9+b8ZQeMYJMUnCW9l9cS7om5ehCyJryq8V8VhN8 fj+IrdZjFnmjhUsUpYuPvj/z5TlSXs/kQORqXovrPR6Yg/iH19gYmlhjBYHcUfclDR85 6uhH4XKZ6njDjTFoZZ22+l7zRiHUXFT6NlMGn7eJCgMaTAZvOH54xXFU3XDuYuZjvvdw kS9CBAPLIqk+JGMmAK93bWJv/vCoVr9kMu3PAmVRcPCR1ZiHNDj12J1gG27n+Wt5sedc 9aAtWOSo7KbyTJIV/Zwvk4PD/TYxwoujGoppq1y0M+0LbC7s6QmQFkz+JQyymdLfcXUS /GFg== X-Gm-Message-State: AOAM5329FshXIY1e88eUhZp4a6Wst+Ufk5Vaw9RIwlGh3A8NEgQ1EZNM 5flA5uI562x1S6R12Iz8fRPuTCUPkPEnM45fric= X-Google-Smtp-Source: ABdhPJzgfDPr/OHz0IrK6sdBTmC0HVQSjw6Z86h055aGti595niw6d8Ft4riWjv7l1+I1Yvyh2FKORCWyVsejW3qz8o= X-Received: by 2002:a05:6102:3711:: with SMTP id s17mr760931vst.51.1628098472817; Wed, 04 Aug 2021 10:34:32 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Maged Michael Date: Wed, 4 Aug 2021 13:34:21 -0400 Message-ID: Subject: Re: [PATCH] libstdc++: Skip atomic instructions in _Sp_counted_base::_M_release when both counts are 1 To: Jonathan Wakely Cc: "libstdc++" , gcc-patches X-Spam-Status: No, score=-3.7 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, HTML_MESSAGE, KAM_NUMSUBJECT, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=no autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org Content-Type: text/plain; charset="UTF-8" X-Content-Filtered-By: Mailman/MimeDel 2.1.29 X-BeenThere: libstdc++@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libstdc++ mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 04 Aug 2021 17:34:36 -0000 On Wed, Aug 4, 2021 at 1:19 PM Maged Michael wrote: > Sorry. I totally missed the rest of your message and the patch. My fuzzy > eyesight, which usually guesses correctly 90% of the time, mistook > "Secondly" on a line by itself for "Sincerely" :-) > > On Wed, Aug 4, 2021 at 11:32 AM Jonathan Wakely > wrote: > >> On Tue, 3 Aug 2021 at 21:59, Jonathan Wakely wrote: >> > >> > On Mon, 2 Aug 2021 at 14:29, Maged Michael wrote: >> > > >> > > This is the right patch. The previous one is missing noexcept. Sorry. >> > > >> > > >> > > On Mon, Aug 2, 2021 at 9:23 AM Maged Michael >> wrote: >> > >> >> > >> Please find attached an updated patch after incorporating Jonathan's >> suggestions. >> > >> >> > >> Changes from the last patch include: >> > >> - Add a TSAN macro to bits/c++config. >> > >> - Use separate constexpr bool-s for the conditions for lock-freedom, >> double-width and alignment. >> > >> - Move the code in the optimized path to a separate function >> _M_release_double_width_cas. >> > >> > Thanks for the updated patch. At a quick glance it looks great. I'll >> > apply it locally and test it tomorrow. >> >> >> It occurs to me now that _M_release_double_width_cas is the wrong >> name, because we're not doing a DWCAS, just a double-width load. What >> do you think about renaming it to _M_release_combined instead? Since >> it does a combined load of the two counts. >> >> I'm no longer confident in the alignof suggestion I gave you. >> >> + constexpr bool __double_word >> + = sizeof(long long) == 2 * sizeof(_Atomic_word); >> + // The ref-count members follow the vptr, so are aligned to >> + // alignof(void*). >> + constexpr bool __aligned = alignof(long long) <= alignof(void*); >> >> For IA32 (i.e. 32-bit x86) this constant will be true, because >> alignof(long long) and alignof(void*) are both 4, even though >> sizeof(long long) is 8. So in theory the _M_use_count and >> _M_weak_count members could be in different cache lines, and the >> atomic load will not be atomic. I think we want to use __alignof(long >> long) here to get the "natural" alignment, not the smaller 4B >> alignment allowed by SysV ABI. That means that we won't do this >> optimization on IA32, but I think that's acceptable. >> >> Alternatively, we could keep using alignof, but with an additional >> run-time check something like (uintptr_t)&_M_use_count / 64 == >> (uintptr_t)&_M_weak_count / 64 to check they're on the same cache >> line. I think for now we should just use __alignof and live without >> the optimization on IA32. >> >> Secondly: >> >> + void >> + __attribute__ ((noinline)) >> >> This needs to be __noinline__ because noinline is not a reserved word, >> so users can do: >> #define noinline 1 >> #include >> >> Was it performance profiling, or code-size measurements (or something >> else?) that showed this should be non-inline? >> I'd like to add a comment explaining why we want it to be noinline. >> >> The noinlining was based on looking at generated code. That was for > clang. It was inlining the _M_last_use function for every instance of > _M_release (e.g., ~shared_ptr). This optimization with the noinline for > _M_release_last_use ended up reducing massive binary text sizes by 1.5% > (several megabytes)., which was an extra benefit. Without the noinline we > saw code size increase. > > IIUC, we van use the following. Right? > > __attribute__((__noinline__)) > > > I didn't understand the part about programmers doing #define noinline 1. > I don't see code in the patch that uses noinline. > > How about something like this comment? > > // Noinline to avoid code size increase. > > > > In that function ... >> >> + _M_release_last_use() noexcept >> + { >> + _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_use_count); >> + _M_dispose(); >> + if (_Mutex_base<_Lp>::_S_need_barriers) >> + { >> + __atomic_thread_fence (__ATOMIC_ACQ_REL); >> + } >> >> I think we can remove this fence. The _S_need_barriers constant is >> only true for the _S_mutex policy, and that just calls >> _M_release_orig(), so never uses this function. I'll remove it and add >> a comment noting that the lack of barrier is intentional. >> >> + _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_weak_count); >> + if (__gnu_cxx::__exchange_and_add_dispatch(&_M_weak_count, >> + -1) == 1) >> + { >> + _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_weak_count); >> + _M_destroy(); >> + } >> + } >> >> Alternatively, we could keep the fence in _M_release_last_use() and >> refactor the other _M_release functions, so that we have explicit >> specializations for each of: >> _Sp_counted_base<_S_single>::_M_release() (already present) >> _Sp_counted_base<_S_mutex>::_M_release() >> _Sp_counted_base<_S_atomic>::_M_release() >> >> The second and third would be new, as currently they both use the >> definition in the primary template. The _S_mutex one would just >> decrement _M_use_count then call _M_release_last_use() (which still >> has the barrier needed for the _S_mutex policy). The _S_atomic one >> would have your new optimization. See the attached patch showing what >> I mean. I find this version a bit simpler to understand, as we just >> have _M_release and _M_release_last_use, without >> _M_release_double_width_cas and _M_release_orig. What do you think of >> this version? Does it lose any important properties of your version >> which I've failed to notice? >> > > Unless I missed something I don't think the patch is correct. IIUC it uses > _M_release_last_use as if it is equivalent to what was _M_release_orig, > which is not true. The logic in _M_release_last_use assumes that the use > count has already been decremented and the return value was 1. So I don't > think the refactor works. Please correct me if I missed something.. > > Actually I take back what I said. Sorry. I think the logic in your patch is correct. I missed some of the atomic decrements. But I'd be concerned about performance. If we make _M_release_last_use noinline then we are adding overhead to the fast path of the original logic (where both counts are 1). What do you think? > If I am not mistaken, how about I prepare a new patch that incorporates > the following? > - __noinline__ > - The change to __align in my last message. > - Remove the unnecessary checks based on _S_need_barriers in the combined > / last_use path. > - Going back to having _M_release_orig and _M_release_combined > - Include all of the comments that you added in your patch. > > Would that be OK with you? > > Thanks, > Maged > >