From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <rodgert@appliantology.com>
Received: from mx.kolabnow.com (mx.kolabnow.com [95.128.36.42])
 by sourceware.org (Postfix) with ESMTPS id 84960386185F
 for <libstdc++@gcc.gnu.org>; Mon,  1 Mar 2021 17:46:15 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 84960386185F
Authentication-Results: sourceware.org; dmarc=none (p=none dis=none)
 header.from=appliantology.com
Authentication-Results: sourceware.org;
 spf=none smtp.mailfrom=rodgert@appliantology.com
Received: from localhost (unknown [127.0.0.1])
 by ext-mx-out001.mykolab.com (Postfix) with ESMTP id 4F543B49;
 Mon,  1 Mar 2021 18:46:14 +0100 (CET)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kolabnow.com; h=
 message-id:references:in-reply-to:subject:subject:from:from:date
 :date:content-type:content-type:mime-version:received:received
 :received; s=dkim20160331; t=1614620772; x=1616435173; bh=11sMYD
 c2d9Wr4esCKC2pyxmVFj3a364O5MdNmt+y8+A=; b=bHVhjFj7tNvc+UybMFvfaV
 wKLgTUfHdhwLPKrOdKoD9erYPDmB67JgDCcQUax2Eeidr2t21ajfs5D9JEzOfmiG
 W5yhcKzz92V5uosFU8wmI/l9GjBJaLWVoh9+iYgvKNRwLe0IkoiaxLijqnecMddl
 TVr4JMZfLZNs1odYxIcACtEbDy0i8rK792kOfp5xHxspP2frEpkYydBReXLW7Ufp
 It9rfltAf/p/YWh9ah9H40rZr9FntqIpdiLuC7ATwLUXSGU1JEBygDibA/ZBKQdV
 pjbyXnmO1ZceNSLetFP0IpJ5Cqk8i4T39wJqU7rEiMe2RN96+q4G9mRIOQIP50//
 jAS1QcjeQXI8cn9omgpSE025sbjVKzwgT0u5gM9FlmTFOzf2Saj45V68PUQFTMee
 ndPmBS1kkbhEYP5EqSEmIJrORnheHi0ViWq863re/HV10qEU9Xi/IZ7V9871L4H/
 jy7JGgn4s/luY1K8yFeysbWgwIYEfQb1dA4OYTI3rIcy6ghFVgwDyz5Pgk2Im+/K
 sgDpJXCDK3LwmVKuvhzx5JMvlb3CO70UOPVsq3dImayGJHfX4+hmhtiC/+hhtSHf
 7v4EAIaVvo11j8inaI9XnBE8myAKjbvNYbaxQWorIp0AfKSoHPyatnEyI1vSaEom
 0vK04sEZdFgFRZlZh677A=
X-Virus-Scanned: amavisd-new at mykolab.com
X-Spam-Score: -1.898
X-Spam-Level: 
X-Spam-Status: No, score=-6.1 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, HTML_MESSAGE, KAM_NUMSUBJECT, KAM_SHORT, SPF_HELO_NONE, SPF_NONE,
 TXREP autolearn=no autolearn_force=no version=3.4.2
Received: from mx.kolabnow.com ([127.0.0.1])
 by localhost (ext-mx-out001.mykolab.com [127.0.0.1]) (amavisd-new, port 10024)
 with ESMTP id RDwDDkPGUfFe; Mon,  1 Mar 2021 18:46:12 +0100 (CET)
Received: from int-mx002.mykolab.com (unknown [10.9.13.2])
 by ext-mx-out001.mykolab.com (Postfix) with ESMTPS id 4244C4EF;
 Mon,  1 Mar 2021 18:46:11 +0100 (CET)
Received: from int-subm001.mykolab.com (unknown [10.9.37.1])
 by int-mx002.mykolab.com (Postfix) with ESMTPS id C4FFB2250;
 Mon,  1 Mar 2021 18:46:10 +0100 (CET)
MIME-Version: 1.0
Date: Mon, 01 Mar 2021 09:46:08 -0800
From: Thomas Rodgers <rodgert@appliantology.com>
To: Thiago Macieira <thiago.macieira@intel.com>
Cc: trodgers@redhat.com, jwakely@redhat.com, libstdc++@gcc.gnu.org
Subject: Re: C++2a synchronisation inefficient in GCC 11
In-Reply-To: <812e6f42e4b6decfa864f12d73f42e5e@appliantology.com>
References: <1968544.UC5HiB4uFJ@tjmaciei-mobl1>
 <62df56697b99a0ab414e8f1e35f10b0b@appliantology.com>
 <9c8e2e24618f5a538d01c2ccbb0f0124@appliantology.com>
 <812e6f42e4b6decfa864f12d73f42e5e@appliantology.com>
Message-ID: <b789b26fd01d07fe505aaba83e9a1bc6@appliantology.com>
X-Sender: rodgert@appliantology.com
X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on
 server2.sourceware.org
Content-Type: text/plain; charset=US-ASCII;
 format=flowed
Content-Transfer-Encoding: 7bit
X-Content-Filtered-By: Mailman/MimeDel 2.1.29
X-BeenThere: libstdc++@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Libstdc++ mailing list <libstdc++.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/libstdc++>,
 <mailto:libstdc++-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/libstdc++/>
List-Post: <mailto:libstdc++@gcc.gnu.org>
List-Help: <mailto:libstdc++-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/libstdc++>,
 <mailto:libstdc++-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Mon, 01 Mar 2021 17:46:18 -0000

I discussed the prematureness of the contention optimization with 
ogiroux this morning. Confirming, libc++ does do this, and in his 
measurements it saves on the order of 50us.

Regarding the use of _Hash_impl::hash(), that can be changed to just use 
the pointer value itself. This is what libatomic does. I'm happy to 
propose that as a subsequent patch to the most recent version of this 
code which I submitted recently.

On 2021-02-26 19:01, Thomas Rodgers wrote:

> On 2021-02-26 17:29, Thomas Rodgers wrote:
> 
>> On 2021-02-26 17:13, Thomas Rodgers wrote:
>> 
>> We are SPECIFICALLY NOT committing to an ABI for wait/notify with 
>> GCC11. These features are experimental and I have no intention of 
>> treating this as a blocker.
>> 
>> On 2021-02-25 14:50, Thiago Macieira via Libstdc++ wrote:
>> 
>> Hello all
>> 
>> I was investigating the implementation of std::atomic<T>::wait to use 
>> inside
>> my own code (Qt's QMutex and QSemaphore), when I realised that it has
>> huge inefficiencies. Since everything in this implementation is inline 
>> and,
>> once released, it will tie our hands until the next ABI break 
>> (libstdc++.so.
>> 7). And no, inline namespaces and ABI tags are no different than ABI 
>> breaks,
>> they only make the ABI break more or less silent in the process.
>> 
>> Here's a summary of the findings:
>> 
>> 1) everything is inline
>> 2) futex code is still behind a lot of code calling into 
>> std::_Hash_bytes
> 
> It's likely possible to remove the dependence on _Hash_bytes. I looked 
> at it while working on the most recent version of this code but decided 
> to not touch it for GCC11 because...it's experimental and we
> aren't committing to an ABI.
> 
>> 3) other int-sized (incl. enums) atomics don't use futex
> 
> The most recent version of this code extends the range of types which 
> use futex to all 32bit types.
> 
>> 4) std::latch and std::counting_semaphore defaults preclude from using
>> futex on Linux
>> 5) std::barrier implementation also uses a type that futex(2) can't 
>> handle
>> 
>> Here's what the code currently looks like: 
>> https://gcc.godbolt.org/z/MeP8rr
>> 
>> Given how far we are in the release cycle, a simple solution is to 
>> simply
>> disable the entire solution by #if 0'ing out bits/atomic_wait.h. If
>> __cpp_lib_atomic_wait isn't defined, the rest of the functionality 
>> cascades
>> disabled too. Then we can spend however long is necessary to get a 
>> good
>> implementation for GCC 12.
>> 
>> I have patches for all but the first issue, but they are a result of 
>> only 3
>> hours of hacking to prove the concept. I will post them, but I do not 
>> expect
>> them to get merged as-is.
>> 
>> Details:
>> 
>> 1) everything is inline
>> 
>> I think this is unnecessary optimisation, especially since it expands 
>> to a LOT
>> of code (see link above). A cursory research into libc++ and MS STL 
>> shows that
>> the actual wait and notification are out-of-line. I recommend we do 
>> the same
>> for libstdc++: if we're going to syscall anyway, the extra function 
>> call isn't
>> going to be the performance bottleneck.
>> 
>> In <https://gcc.gnu.org/pipermail/libstdc++/2020-May/050360.html>, 
>> Jonathan
>> wrote:
>> 
>> Eventually we'll want to move the rest of this function (which doesn't
>> depend on the template argument) into the compiled library, but it's
>> better to be header-only for now.
>> But that has never happened.
>> 
>> Having out-of-line implementations would allow the implementation for 
>> generic-
>> sized atomic wait/notify to fine-tune as time and experience 
>> progresses. The
>> std::__detail::__waiter implementation looks clever, but it's unclear 
>> if it is
>> the best for all workloads. Moreover, the implementation is limited to 
>> 16
>> waiters.
>> 
>> Moreover, it would allow us to have support in the future for a 64-bit 
>> futex.
>> Linus did shoot down any need for it 13 years ago, but the world has 
>> changed
>> since and it's not inconceivable to have a pointer-sized futex on 
>> every
>> platform, which on 64-bit would be 64-bit. Windows and Darwin do have 
>> support
>> for 64-bit compare-and-wait functionality.
>> 
>> And it would allow us to have futex-equivalent support in other OSes, 
>> added at
>> our leisure. The current implementation does not support Windows's
>> WaitOnAddress and could never support it if released in the current 
>> state.
>> It's also possible other OSes will eventually add equivalent 
>> functionality
>> because of the C++ standard. For example, FreeBSD already has futex() 
>> support
>> inside its kernel in order to run Linux binaries, but it is not 
>> exposed to the
>> FreeBSD-personality syscall ABI.
>> 
>> Inlining certain functionality at a later date is possible. 
>> De-inlining isn't.
>> 
>> 2) futex code for int is still behind a lot of boiler plate
>> 
>> It comes from the creation/use of std::__detail::__waiter in
>> std::__atomic_wait and __atomic_notify. That does allow one to detect 
>> whether
>> there is contention in the first place and avoid the system call, but 
>> I think
>> that's premature optimisation.
>> 
>> In <https://gcc.gnu.org/pipermail/libstdc++/2020-May/050362.html>, 
>> Thomas
>> says:
>> 
>> __atomic_notify checks to see if that count is non-zero
>> before invoking the platform/semaphore notify because it is cheaper
>> to do the atomic load than it is to make the syscall() when there are 
>> no
>> waiters.
>> That's true, but it ignores the fact that users of 
>> std::atomic<T>::wait() are
>> likely to track contention by themselves on the value of the atomic (I 
>> know I
>> do). libc++, MS STL and the original atomic_wait[1 [1]] implementation 
>> operate on
>> direct system calls and don't try to track contention. Because of 
>> that,
>> libstdc++ adding an extra layer of tracking is double book-keeping and
>> violates "don't pay for what you don't need".
>> 
>> [1] https://github.com/ogiroux/atomic_wait
> 
> THe final version that Olivier committed to libc++ does track 
> contention.
> 
>> 3) other int-sized atomics don't use futex
>> 
>> The kernel doesn't care what we store in those 32 bits, only that they 
>> are
>> comparable (FUTEX_OP does provide less-than and greater-than 
>> operations that
>> are signed, but we don't use that and the kernel API can be later 
>> extended to
>> unsigned compares). Therefore, I would recommend we extend futex 
>> support to
>> all 32-bit signed scalar types. This would add:
>> * pointers and long on 32-bit architectures
>> * unsigned
>> * untyped enums and typed enums on int & unsigned int
>> * float
>> 
>> 4) std::latch and std::counting_semaphore defaults too high
>> 
>> The standard requires a ptrdiff_t in the API, but it does allow us to 
>> set hard
>> limits and/or defaults for best performance. Right now, std::latch is
>> implemented using ptrdiff_t internally and std::counting_semaphore's 
>> default
>> limit uses PTRDIFF_MAX:
>> 
>> template<ptrdiff_t __least_max_value =
>> __gnu_cxx::__int_traits<ptrdiff_t>::__max>
>> class counting_semaphore
>> 
>> As a consequence, those two classes will use 64-bit atomics on 64-bit
>> architectures, which means they will not use futex. I recommend 
>> changing
>> std::latch's internal to int and lowering the default maximum in
>> std::counting_semaphore to INT_MAX.
>> 
>> 5) std::barrier uses a type that futex can't handle
>> 
>> The barrier phase is the actual type that gets waited / notified on, 
>> but it is
>> an enum class on unsigned char:
>> 
>> enum class __barrier_phase_t : unsigned char { };
>> 
>> This is a harder problem to solve, but I really do recommend making it 
>> 32-bit
>> in width. Whether each barrier phase holds only 256 values or the full 
>> 32
>> bits, I don't know (I've implemented with wrapping on 256).
>> 
>> What is of consequence is that the __state_t type could increase in 
>> size
>> ("could" because we can simply reduce the number of tickets in each 
>> state
>> entry). I haven't done enough research on this to know whether it is 
>> the right
>> way to go.
> 
> My understanding from Olivier Giroux, who's barrier implementation code 
> this is derived from, is that he extensively tested it's performance on 
> a wide variety of hardware, including time on Summit. Short of hard 
> data to the contrary, I am not inclined to second guess the 
> implementation he tested (which uses char tickets).
> 
> Links:
> ------
> [1] https://github.com/ogiroux/atomic_wait