From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <rodgert@appliantology.com>
Received: from mx.kolabnow.com (mx.kolabnow.com [95.128.36.42])
 by sourceware.org (Postfix) with ESMTPS id 2070F3854830
 for <libstdc++@gcc.gnu.org>; Sat, 27 Feb 2021 01:13:54 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 2070F3854830
Authentication-Results: sourceware.org; dmarc=none (p=none dis=none)
 header.from=appliantology.com
Authentication-Results: sourceware.org;
 spf=none smtp.mailfrom=rodgert@appliantology.com
Received: from localhost (unknown [127.0.0.1])
 by ext-mx-out001.mykolab.com (Postfix) with ESMTP id 99437B69;
 Sat, 27 Feb 2021 02:13:52 +0100 (CET)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kolabnow.com; h=
 message-id:references:in-reply-to:subject:subject:from:from:date
 :date:content-type:content-type:mime-version:received:received
 :received; s=dkim20160331; t=1614388431; x=1616202832; bh=Aqy7gx
 XhJePNP0UlMyoJG+7IUl5Zex6zyU+Vx26q4NM=; b=TJ0CPJWLb+am2FIMnLDNet
 qI5tUptDJ/QoKM8P43JpG13JbaYLcR2sfYBjTfYJcvYwAKMh/I2EQznE68jh8PEW
 u+ntaBLolTRNW4F7CTfDEu5KM6nm6fkXbjQ4cv4CQr9+YUWScXzPShkPffosMMMz
 /MID9MLR34v/JfuepkdIfMrCLB7oo7zqIvcDA1tMIG+Dd3KNLwboScVVUkh9K/M5
 MIeCa8yajRSoqKotls5K1HWo+JLP4PwgHu6Rp6ACzU2lmwh4Lx31m6Vlo2cPrM5R
 Q5Krz0cdqq8usfG8CFfSVl8dwnEVfIwbb4s69pZNYFHr2GhRYESfg5f3O+d5uHSo
 AmifaLom5/FZlzUbDwwmvhgjJoG+QkYRDLfc0/42oKQBpxlUQA9lcJBSope0Ynua
 pBf+4K6sQORXVJZO0GMchzveetr+uPB6nuC28iGBh29E3C5mhu1UsY16h7rWoi0H
 8I9gEJaC2papIE+Ix1kTciaNwbFDw3UMv1S3N+oyJHwacvcCfkUsjDnujcB8UM1r
 l3fjZxGrd/0shpMFm1Eb5OtILJ+hZfkosbvYpcVdMTH4URRTGE1umnuOurfTbaZr
 VRiXN08zqRwqo5+qP5k5zN96889XP48LsJu0QTghFiiBxRZq+57/ymhII08u3Ivv
 B6f2CFJ18UhqWypit/JjA=
X-Virus-Scanned: amavisd-new at mykolab.com
X-Spam-Score: -1.898
X-Spam-Level: 
X-Spam-Status: No, score=-6.4 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, HTML_MESSAGE, KAM_NUMSUBJECT, KAM_SHORT, SPF_HELO_NONE, SPF_NONE,
 TXREP autolearn=no autolearn_force=no version=3.4.2
Received: from mx.kolabnow.com ([127.0.0.1])
 by localhost (ext-mx-out001.mykolab.com [127.0.0.1]) (amavisd-new, port 10024)
 with ESMTP id YER4HsV3TS7b; Sat, 27 Feb 2021 02:13:51 +0100 (CET)
Received: from int-mx002.mykolab.com (unknown [10.9.13.2])
 by ext-mx-out001.mykolab.com (Postfix) with ESMTPS id D6FEA82F;
 Sat, 27 Feb 2021 02:13:50 +0100 (CET)
Received: from int-subm001.mykolab.com (unknown [10.9.37.1])
 by int-mx002.mykolab.com (Postfix) with ESMTPS id F3A98223C;
 Sat, 27 Feb 2021 02:13:48 +0100 (CET)
MIME-Version: 1.0
Date: Fri, 26 Feb 2021 17:13:46 -0800
From: Thomas Rodgers <rodgert@appliantology.com>
To: Thiago Macieira <thiago.macieira@intel.com>
Cc: libstdc++@gcc.gnu.org, trodgers@redhat.com, jwakely@redhat.com
Subject: Re: C++2a synchronisation inefficient in GCC 11
In-Reply-To: <1968544.UC5HiB4uFJ@tjmaciei-mobl1>
References: <1968544.UC5HiB4uFJ@tjmaciei-mobl1>
Message-ID: <62df56697b99a0ab414e8f1e35f10b0b@appliantology.com>
X-Sender: rodgert@appliantology.com
X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on
 server2.sourceware.org
Content-Type: text/plain; charset=US-ASCII;
 format=flowed
Content-Transfer-Encoding: 7bit
X-Content-Filtered-By: Mailman/MimeDel 2.1.29
X-BeenThere: libstdc++@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Libstdc++ mailing list <libstdc++.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/libstdc++>,
 <mailto:libstdc++-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/libstdc++/>
List-Post: <mailto:libstdc++@gcc.gnu.org>
List-Help: <mailto:libstdc++-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/libstdc++>,
 <mailto:libstdc++-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Sat, 27 Feb 2021 01:13:56 -0000

We are SPECIFICALLY NOT committing to an ABI for wait/notify with GCC11. 
These features are experimental and I have no intention of treating this 
as a blocker.

On 2021-02-25 14:50, Thiago Macieira via Libstdc++ wrote:

> Hello all
> 
> I was investigating the implementation of std::atomic<T>::wait to use 
> inside
> my own code (Qt's QMutex and QSemaphore), when I realised that it has
> huge inefficiencies. Since everything in this implementation is inline 
> and,
> once released, it will tie our hands until the next ABI break 
> (libstdc++.so.
> 7). And no, inline namespaces and ABI tags are no different than ABI 
> breaks,
> they only make the ABI break more or less silent in the process.
> 
> Here's a summary of the findings:
> 
> 1) everything is inline
> 2) futex code is still behind a lot of code calling into 
> std::_Hash_bytes
> 3) other int-sized (incl. enums) atomics don't use futex
> 4) std::latch and std::counting_semaphore defaults preclude from using
> futex on Linux
> 5) std::barrier implementation also uses a type that futex(2) can't 
> handle
> 
> Here's what the code currently looks like: 
> https://gcc.godbolt.org/z/MeP8rr
> 
> Given how far we are in the release cycle, a simple solution is to 
> simply
> disable the entire solution by #if 0'ing out bits/atomic_wait.h. If
> __cpp_lib_atomic_wait isn't defined, the rest of the functionality 
> cascades
> disabled too. Then we can spend however long is necessary to get a good
> implementation for GCC 12.
> 
> I have patches for all but the first issue, but they are a result of 
> only 3
> hours of hacking to prove the concept. I will post them, but I do not 
> expect
> them to get merged as-is.
> 
> Details:
> 
> 1) everything is inline
> 
> I think this is unnecessary optimisation, especially since it expands 
> to a LOT
> of code (see link above). A cursory research into libc++ and MS STL 
> shows that
> the actual wait and notification are out-of-line. I recommend we do the 
> same
> for libstdc++: if we're going to syscall anyway, the extra function 
> call isn't
> going to be the performance bottleneck.
> 
> In <https://gcc.gnu.org/pipermail/libstdc++/2020-May/050360.html>, 
> Jonathan
> wrote:
> 
>> Eventually we'll want to move the rest of this function (which doesn't
>> depend on the template argument) into the compiled library, but it's
>> better to be header-only for now.
> 
> But that has never happened.
> 
> Having out-of-line implementations would allow the implementation for 
> generic-
> sized atomic wait/notify to fine-tune as time and experience 
> progresses. The
> std::__detail::__waiter implementation looks clever, but it's unclear 
> if it is
> the best for all workloads. Moreover, the implementation is limited to 
> 16
> waiters.
> 
> Moreover, it would allow us to have support in the future for a 64-bit 
> futex.
> Linus did shoot down any need for it 13 years ago, but the world has 
> changed
> since and it's not inconceivable to have a pointer-sized futex on every
> platform, which on 64-bit would be 64-bit. Windows and Darwin do have 
> support
> for 64-bit compare-and-wait functionality.
> 
> And it would allow us to have futex-equivalent support in other OSes, 
> added at
> our leisure. The current implementation does not support Windows's
> WaitOnAddress and could never support it if released in the current 
> state.
> It's also possible other OSes will eventually add equivalent 
> functionality
> because of the C++ standard. For example, FreeBSD already has futex() 
> support
> inside its kernel in order to run Linux binaries, but it is not exposed 
> to the
> FreeBSD-personality syscall ABI.
> 
> Inlining certain functionality at a later date is possible. De-inlining 
> isn't.
> 
> 2) futex code for int is still behind a lot of boiler plate
> 
> It comes from the creation/use of std::__detail::__waiter in
> std::__atomic_wait and __atomic_notify. That does allow one to detect 
> whether
> there is contention in the first place and avoid the system call, but I 
> think
> that's premature optimisation.
> 
> In <https://gcc.gnu.org/pipermail/libstdc++/2020-May/050362.html>, 
> Thomas
> says:
> 
>> __atomic_notify checks to see if that count is non-zero
>> before invoking the platform/semaphore notify because it is cheaper
>> to do the atomic load than it is to make the syscall() when there are 
>> no
>> waiters.
> 
> That's true, but it ignores the fact that users of 
> std::atomic<T>::wait() are
> likely to track contention by themselves on the value of the atomic (I 
> know I
> do). libc++, MS STL and the original atomic_wait[1 [1]] implementation 
> operate on
> direct system calls and don't try to track contention. Because of that,
> libstdc++ adding an extra layer of tracking is double book-keeping and
> violates "don't pay for what you don't need".
> 
> [1] https://github.com/ogiroux/atomic_wait
> 
> 3) other int-sized atomics don't use futex
> 
> The kernel doesn't care what we store in those 32 bits, only that they 
> are
> comparable (FUTEX_OP does provide less-than and greater-than operations 
> that
> are signed, but we don't use that and the kernel API can be later 
> extended to
> unsigned compares). Therefore, I would recommend we extend futex 
> support to
> all 32-bit signed scalar types. This would add:
> * pointers and long on 32-bit architectures
> * unsigned
> * untyped enums and typed enums on int & unsigned int
> * float
> 
> 4) std::latch and std::counting_semaphore defaults too high
> 
> The standard requires a ptrdiff_t in the API, but it does allow us to 
> set hard
> limits and/or defaults for best performance. Right now, std::latch is
> implemented using ptrdiff_t internally and std::counting_semaphore's 
> default
> limit uses PTRDIFF_MAX:
> 
> template<ptrdiff_t __least_max_value =
> __gnu_cxx::__int_traits<ptrdiff_t>::__max>
> class counting_semaphore
> 
> As a consequence, those two classes will use 64-bit atomics on 64-bit
> architectures, which means they will not use futex. I recommend 
> changing
> std::latch's internal to int and lowering the default maximum in
> std::counting_semaphore to INT_MAX.
> 
> 5) std::barrier uses a type that futex can't handle
> 
> The barrier phase is the actual type that gets waited / notified on, 
> but it is
> an enum class on unsigned char:
> 
> enum class __barrier_phase_t : unsigned char { };
> 
> This is a harder problem to solve, but I really do recommend making it 
> 32-bit
> in width. Whether each barrier phase holds only 256 values or the full 
> 32
> bits, I don't know (I've implemented with wrapping on 256).
> 
> What is of consequence is that the __state_t type could increase in 
> size
> ("could" because we can simply reduce the number of tickets in each 
> state
> entry). I haven't done enough research on this to know whether it is 
> the right
> way to go.



Links:
------
[1] https://github.com/ogiroux/atomic_wait