From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wm1-x336.google.com (mail-wm1-x336.google.com [IPv6:2a00:1450:4864:20::336]) by sourceware.org (Postfix) with ESMTPS id 85B87396D83E; Wed, 4 Aug 2021 19:32:14 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 85B87396D83E Received: by mail-wm1-x336.google.com with SMTP id h24-20020a1ccc180000b029022e0571d1a0so2103628wmb.5; Wed, 04 Aug 2021 12:32:14 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=CC3HE3L4Oi7alX2NeBp7h/xuYRcc+1wjAGu5dapTDq4=; b=aJ+efGKIH+HOs/1Ld1CYqNa01uUH+bJBRBvPKxj1+XSgK/17nI1FS68qaLiJ0sFNEI xY+QjcYTZEGqdzl5brG2aeiMMGL04rTaFtrsqsbTHJjWTOu2qXkKHJJdJdqDRByj0qge 7cogXK9pDzeXxFlaiJo/QMBlxk9mHAgQm0MdpcNE6D203pJ11ZvrcRy3xU3tA/zOn/ka B4qoN0Kt8dSIghTHuS3tjJRzcxV7s52CKRA7qrNiRigmBTu83Dj7XwoQJ7Xm41w1Y15I Zu4/ODj1HVCO7MeD7rjrsEsZOG5wFI3jfzz+idbRQoTy0cQ75IFfgg0hnDovmisOSjCO lKew== X-Gm-Message-State: AOAM531pae4a+tqf/fUFYMuya45PqtLQpJsrckvGwaSM/Wmqil9b4XlJ FKnyWQk+9MbZ64WWb83RT9eHeDG3r0Xn1jNXPj4= X-Google-Smtp-Source: ABdhPJwzsTHBLKNkl9GVeRynrWUDuySl3hMvuNHYIjz05NYYc8vABZg/3Agyi4ejqTwGEPon5clcHFseGQ7iuoEXjWU= X-Received: by 2002:a1c:f214:: with SMTP id s20mr1105633wmc.14.1628105533487; Wed, 04 Aug 2021 12:32:13 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Jonathan Wakely Date: Wed, 4 Aug 2021 20:32:02 +0100 Message-ID: Subject: Re: [PATCH] libstdc++: Skip atomic instructions in _Sp_counted_base::_M_release when both counts are 1 To: Maged Michael Cc: "libstdc++" , gcc-patches Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-1.3 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, KAM_NUMSUBJECT, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=no autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: libstdc++@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libstdc++ mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 04 Aug 2021 19:32:16 -0000 On Wed, 4 Aug 2021 at 18:19, Maged Michael wrote: > > Sorry. I totally missed the rest of your message and the patch. My fuzzy = eyesight, which usually guesses correctly 90% of the time, mistook "Secondl= y" on a line by itself for "Sincerely" :-) :-) > The noinlining was based on looking at generated code. That was for clang= . It was inlining the _M_last_use function for every instance of _M_release= (e.g., ~shared_ptr). This optimization with the noinline for _M_release_la= st_use ended up reducing massive binary text sizes by 1.5% (several megabyt= es)., which was an extra benefit. Without the noinline we saw code size inc= rease. Wow, that is a convincing argument for making it not inline, thanks. > IIUC, we van use the following. Right? > > __attribute__((__noinline__)) Right. > I didn't understand the part about programmers doing #define noinline 1. = I don't see code in the patch that uses noinline. This is a valid C++ program: #define noinline 1 #include int main() { } But if anything in uses "noinline" then this valid program will not compile. Which is why we must use ((__noinline__)) instead of ((noinline)). > > How about something like this comment? > > // Noinline to avoid code size increase. Great, thanks. On Wed, 4 Aug 2021 at 18:34, Maged Michael wrote: > Actually I take back what I said. Sorry. I think the logic in your patch = is correct. I missed some of the atomic decrements. > But I'd be concerned about performance. If we make _M_release_last_use no= inline then we are adding overhead to the fast path of the original logic (= where both counts are 1). Oh, I see. So the code duplication serves a purpose. We want the _M_release_last_use() code to be non-inline for the new logic, because in the common case we never call it (we either detect that both counts are 1 and do the dispose & destroy without atomic decrements, or we do a single decrement and don't dispose or destroy anything). But for the old logic, we do want that code inlined into _M_release (or _M_release_orig as it was called in your patch). Have I got that right now? What if we remove the __noinline__ from _M_release_last_use() so that it can be inlined, but than add a noinline wrapper that can be called when we don't want to inline it? So: // Called by _M_release() when the use count reaches zero. void _M_release_last_use() noexcept { // unchanged from previous patch, but without the attribute. // ... } // As above, but 'noinline' to reduce code size on the cold path. __attribute__((__noinline__)) void _M_release_last_use_cold() noexcept { _M_release_last_use(); } And then: template<> inline void _Sp_counted_base<_S_atomic>::_M_release() noexcept { _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_use_count); #if ! _GLIBCXX_TSAN constexpr bool __lock_free =3D __atomic_always_lock_free(sizeof(long long), 0) && __atomic_always_lock_free(sizeof(_Atomic_word), 0); constexpr bool __double_word =3D sizeof(long long) =3D=3D 2 * sizeof(_Atomic_word); // The ref-count members follow the vptr, so are aligned to // alignof(void*). constexpr bool __aligned =3D __alignof(long long) <=3D alignof(void*)= ; if _GLIBCXX17_CONSTEXPR (__lock_free && __double_word && __aligned) { constexpr long long __unique_ref =3D 1LL + (1LL << (__CHAR_BIT__ * sizeof(_Atomic_word))); auto __both_counts =3D reinterpret_cast(&_M_use_count= ); _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_weak_count); if (__atomic_load_n(__both_counts, __ATOMIC_ACQUIRE) =3D=3D __uni= que_ref) { // Both counts are 1, so there are no weak references and // we are releasing the last strong reference. No other // threads can observe the effects of this _M_release() // call (e.g. calling use_count()) without a data race. *(long long*)(&_M_use_count) =3D 0; _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_use_count); _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_weak_count); _M_dispose(); _M_destroy(); return; } if (__gnu_cxx::__exchange_and_add_dispatch(&_M_use_count, -1) =3D= =3D 1) { _M_release_last_use_cold(); return; } } else #endif if (__gnu_cxx::__exchange_and_add_dispatch(&_M_use_count, -1) =3D=3D = 1) { _M_release_last_use(); } } So we use the noinline version for the else branch in the new logic, but the can-inline version for the old logic. Would that work? We could also consider adding __attribute__((__cold__)) to the _M_release_last_use_cold() function, and/or add [[__unlikely__]] to the 'if' that calls it, but maybe that's overkill. It seems like this will slightly pessimize the case where the last use count is dropped but there are weak refs, as that will take the cold path now. That's probably OK, given the benefits for the "both counts are 1" case, and that it might reduce the code size for the "use count > 1" case. It's past 8pm here so I'll come back to this tomorrow and do some code size measurements of my own, now that I have a clearer understanding of what your original patch was optimizing for.