From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mx.kolabnow.com (mx.kolabnow.com [95.128.36.42]) by sourceware.org (Postfix) with ESMTPS id 84960386185F for ; Mon, 1 Mar 2021 17:46:15 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 84960386185F Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=appliantology.com Authentication-Results: sourceware.org; spf=none smtp.mailfrom=rodgert@appliantology.com Received: from localhost (unknown [127.0.0.1]) by ext-mx-out001.mykolab.com (Postfix) with ESMTP id 4F543B49; Mon, 1 Mar 2021 18:46:14 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kolabnow.com; h= message-id:references:in-reply-to:subject:subject:from:from:date :date:content-type:content-type:mime-version:received:received :received; s=dkim20160331; t=1614620772; x=1616435173; bh=11sMYD c2d9Wr4esCKC2pyxmVFj3a364O5MdNmt+y8+A=; b=bHVhjFj7tNvc+UybMFvfaV wKLgTUfHdhwLPKrOdKoD9erYPDmB67JgDCcQUax2Eeidr2t21ajfs5D9JEzOfmiG W5yhcKzz92V5uosFU8wmI/l9GjBJaLWVoh9+iYgvKNRwLe0IkoiaxLijqnecMddl TVr4JMZfLZNs1odYxIcACtEbDy0i8rK792kOfp5xHxspP2frEpkYydBReXLW7Ufp It9rfltAf/p/YWh9ah9H40rZr9FntqIpdiLuC7ATwLUXSGU1JEBygDibA/ZBKQdV pjbyXnmO1ZceNSLetFP0IpJ5Cqk8i4T39wJqU7rEiMe2RN96+q4G9mRIOQIP50// jAS1QcjeQXI8cn9omgpSE025sbjVKzwgT0u5gM9FlmTFOzf2Saj45V68PUQFTMee ndPmBS1kkbhEYP5EqSEmIJrORnheHi0ViWq863re/HV10qEU9Xi/IZ7V9871L4H/ jy7JGgn4s/luY1K8yFeysbWgwIYEfQb1dA4OYTI3rIcy6ghFVgwDyz5Pgk2Im+/K sgDpJXCDK3LwmVKuvhzx5JMvlb3CO70UOPVsq3dImayGJHfX4+hmhtiC/+hhtSHf 7v4EAIaVvo11j8inaI9XnBE8myAKjbvNYbaxQWorIp0AfKSoHPyatnEyI1vSaEom 0vK04sEZdFgFRZlZh677A= X-Virus-Scanned: amavisd-new at mykolab.com X-Spam-Score: -1.898 X-Spam-Level: X-Spam-Status: No, score=-6.1 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, HTML_MESSAGE, KAM_NUMSUBJECT, KAM_SHORT, SPF_HELO_NONE, SPF_NONE, TXREP autolearn=no autolearn_force=no version=3.4.2 Received: from mx.kolabnow.com ([127.0.0.1]) by localhost (ext-mx-out001.mykolab.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id RDwDDkPGUfFe; Mon, 1 Mar 2021 18:46:12 +0100 (CET) Received: from int-mx002.mykolab.com (unknown [10.9.13.2]) by ext-mx-out001.mykolab.com (Postfix) with ESMTPS id 4244C4EF; Mon, 1 Mar 2021 18:46:11 +0100 (CET) Received: from int-subm001.mykolab.com (unknown [10.9.37.1]) by int-mx002.mykolab.com (Postfix) with ESMTPS id C4FFB2250; Mon, 1 Mar 2021 18:46:10 +0100 (CET) MIME-Version: 1.0 Date: Mon, 01 Mar 2021 09:46:08 -0800 From: Thomas Rodgers To: Thiago Macieira Cc: trodgers@redhat.com, jwakely@redhat.com, libstdc++@gcc.gnu.org Subject: Re: C++2a synchronisation inefficient in GCC 11 In-Reply-To: <812e6f42e4b6decfa864f12d73f42e5e@appliantology.com> References: <1968544.UC5HiB4uFJ@tjmaciei-mobl1> <62df56697b99a0ab414e8f1e35f10b0b@appliantology.com> <9c8e2e24618f5a538d01c2ccbb0f0124@appliantology.com> <812e6f42e4b6decfa864f12d73f42e5e@appliantology.com> Message-ID: X-Sender: rodgert@appliantology.com X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org Content-Type: text/plain; charset=US-ASCII; format=flowed Content-Transfer-Encoding: 7bit X-Content-Filtered-By: Mailman/MimeDel 2.1.29 X-BeenThere: libstdc++@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libstdc++ mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 01 Mar 2021 17:46:18 -0000 I discussed the prematureness of the contention optimization with ogiroux this morning. Confirming, libc++ does do this, and in his measurements it saves on the order of 50us. Regarding the use of _Hash_impl::hash(), that can be changed to just use the pointer value itself. This is what libatomic does. I'm happy to propose that as a subsequent patch to the most recent version of this code which I submitted recently. On 2021-02-26 19:01, Thomas Rodgers wrote: > On 2021-02-26 17:29, Thomas Rodgers wrote: > >> On 2021-02-26 17:13, Thomas Rodgers wrote: >> >> We are SPECIFICALLY NOT committing to an ABI for wait/notify with >> GCC11. These features are experimental and I have no intention of >> treating this as a blocker. >> >> On 2021-02-25 14:50, Thiago Macieira via Libstdc++ wrote: >> >> Hello all >> >> I was investigating the implementation of std::atomic::wait to use >> inside >> my own code (Qt's QMutex and QSemaphore), when I realised that it has >> huge inefficiencies. Since everything in this implementation is inline >> and, >> once released, it will tie our hands until the next ABI break >> (libstdc++.so. >> 7). And no, inline namespaces and ABI tags are no different than ABI >> breaks, >> they only make the ABI break more or less silent in the process. >> >> Here's a summary of the findings: >> >> 1) everything is inline >> 2) futex code is still behind a lot of code calling into >> std::_Hash_bytes > > It's likely possible to remove the dependence on _Hash_bytes. I looked > at it while working on the most recent version of this code but decided > to not touch it for GCC11 because...it's experimental and we > aren't committing to an ABI. > >> 3) other int-sized (incl. enums) atomics don't use futex > > The most recent version of this code extends the range of types which > use futex to all 32bit types. > >> 4) std::latch and std::counting_semaphore defaults preclude from using >> futex on Linux >> 5) std::barrier implementation also uses a type that futex(2) can't >> handle >> >> Here's what the code currently looks like: >> https://gcc.godbolt.org/z/MeP8rr >> >> Given how far we are in the release cycle, a simple solution is to >> simply >> disable the entire solution by #if 0'ing out bits/atomic_wait.h. If >> __cpp_lib_atomic_wait isn't defined, the rest of the functionality >> cascades >> disabled too. Then we can spend however long is necessary to get a >> good >> implementation for GCC 12. >> >> I have patches for all but the first issue, but they are a result of >> only 3 >> hours of hacking to prove the concept. I will post them, but I do not >> expect >> them to get merged as-is. >> >> Details: >> >> 1) everything is inline >> >> I think this is unnecessary optimisation, especially since it expands >> to a LOT >> of code (see link above). A cursory research into libc++ and MS STL >> shows that >> the actual wait and notification are out-of-line. I recommend we do >> the same >> for libstdc++: if we're going to syscall anyway, the extra function >> call isn't >> going to be the performance bottleneck. >> >> In , >> Jonathan >> wrote: >> >> Eventually we'll want to move the rest of this function (which doesn't >> depend on the template argument) into the compiled library, but it's >> better to be header-only for now. >> But that has never happened. >> >> Having out-of-line implementations would allow the implementation for >> generic- >> sized atomic wait/notify to fine-tune as time and experience >> progresses. The >> std::__detail::__waiter implementation looks clever, but it's unclear >> if it is >> the best for all workloads. Moreover, the implementation is limited to >> 16 >> waiters. >> >> Moreover, it would allow us to have support in the future for a 64-bit >> futex. >> Linus did shoot down any need for it 13 years ago, but the world has >> changed >> since and it's not inconceivable to have a pointer-sized futex on >> every >> platform, which on 64-bit would be 64-bit. Windows and Darwin do have >> support >> for 64-bit compare-and-wait functionality. >> >> And it would allow us to have futex-equivalent support in other OSes, >> added at >> our leisure. The current implementation does not support Windows's >> WaitOnAddress and could never support it if released in the current >> state. >> It's also possible other OSes will eventually add equivalent >> functionality >> because of the C++ standard. For example, FreeBSD already has futex() >> support >> inside its kernel in order to run Linux binaries, but it is not >> exposed to the >> FreeBSD-personality syscall ABI. >> >> Inlining certain functionality at a later date is possible. >> De-inlining isn't. >> >> 2) futex code for int is still behind a lot of boiler plate >> >> It comes from the creation/use of std::__detail::__waiter in >> std::__atomic_wait and __atomic_notify. That does allow one to detect >> whether >> there is contention in the first place and avoid the system call, but >> I think >> that's premature optimisation. >> >> In , >> Thomas >> says: >> >> __atomic_notify checks to see if that count is non-zero >> before invoking the platform/semaphore notify because it is cheaper >> to do the atomic load than it is to make the syscall() when there are >> no >> waiters. >> That's true, but it ignores the fact that users of >> std::atomic::wait() are >> likely to track contention by themselves on the value of the atomic (I >> know I >> do). libc++, MS STL and the original atomic_wait[1 [1]] implementation >> operate on >> direct system calls and don't try to track contention. Because of >> that, >> libstdc++ adding an extra layer of tracking is double book-keeping and >> violates "don't pay for what you don't need". >> >> [1] https://github.com/ogiroux/atomic_wait > > THe final version that Olivier committed to libc++ does track > contention. > >> 3) other int-sized atomics don't use futex >> >> The kernel doesn't care what we store in those 32 bits, only that they >> are >> comparable (FUTEX_OP does provide less-than and greater-than >> operations that >> are signed, but we don't use that and the kernel API can be later >> extended to >> unsigned compares). Therefore, I would recommend we extend futex >> support to >> all 32-bit signed scalar types. This would add: >> * pointers and long on 32-bit architectures >> * unsigned >> * untyped enums and typed enums on int & unsigned int >> * float >> >> 4) std::latch and std::counting_semaphore defaults too high >> >> The standard requires a ptrdiff_t in the API, but it does allow us to >> set hard >> limits and/or defaults for best performance. Right now, std::latch is >> implemented using ptrdiff_t internally and std::counting_semaphore's >> default >> limit uses PTRDIFF_MAX: >> >> template> __gnu_cxx::__int_traits::__max> >> class counting_semaphore >> >> As a consequence, those two classes will use 64-bit atomics on 64-bit >> architectures, which means they will not use futex. I recommend >> changing >> std::latch's internal to int and lowering the default maximum in >> std::counting_semaphore to INT_MAX. >> >> 5) std::barrier uses a type that futex can't handle >> >> The barrier phase is the actual type that gets waited / notified on, >> but it is >> an enum class on unsigned char: >> >> enum class __barrier_phase_t : unsigned char { }; >> >> This is a harder problem to solve, but I really do recommend making it >> 32-bit >> in width. Whether each barrier phase holds only 256 values or the full >> 32 >> bits, I don't know (I've implemented with wrapping on 256). >> >> What is of consequence is that the __state_t type could increase in >> size >> ("could" because we can simply reduce the number of tickets in each >> state >> entry). I haven't done enough research on this to know whether it is >> the right >> way to go. > > My understanding from Olivier Giroux, who's barrier implementation code > this is derived from, is that he extensively tested it's performance on > a wide variety of hardware, including time on Summit. Short of hard > data to the contrary, I am not inclined to second guess the > implementation he tested (which uses char tickets). > > Links: > ------ > [1] https://github.com/ogiroux/atomic_wait