public inbox for libstdc++@gcc.gnu.org
 help / color / mirror / Atom feed
* C++2a synchronisation inefficient in GCC 11
@ 2021-02-25 22:50 Thiago Macieira
  2021-02-26 11:19 ` Jonathan Wakely
                   ` (2 more replies)
  0 siblings, 3 replies; 73+ messages in thread
From: Thiago Macieira @ 2021-02-25 22:50 UTC (permalink / raw)
  To: libstdc++; +Cc: trodgers, jwakely

Hello all

I was investigating the implementation of std::atomic<T>::wait to use inside 
my own code (Qt's QMutex and QSemaphore), when I realised that it has 
huge inefficiencies. Since everything in this implementation is inline and, 
once released, it will tie our hands until the next ABI break (libstdc++.so.
7). And no, inline namespaces and ABI tags are no different than ABI breaks, 
they only make the ABI break more or less silent in the process.

Here's a summary of the findings:

 1) everything is inline
 2) futex code is still behind a lot of code calling into std::_Hash_bytes
 3) other int-sized (incl. enums) atomics don't use futex
 4) std::latch and std::counting_semaphore defaults preclude from using
    futex on Linux
 5) std::barrier implementation also uses a type that futex(2) can't handle

Here's what the code currently looks like: https://gcc.godbolt.org/z/MeP8rr

Given how far we are in the release cycle, a simple solution is to simply 
disable the entire solution by #if 0'ing out bits/atomic_wait.h. If 
__cpp_lib_atomic_wait isn't defined, the rest of the functionality cascades 
disabled too. Then we can spend however long is necessary to get a good 
implementation for GCC 12.

I have patches for all but the first issue, but they are a result of only 3 
hours of hacking to prove the concept. I will post them, but I do not expect 
them to get merged as-is.

Details:

1) everything is inline

I think this is unnecessary optimisation, especially since it expands to a LOT 
of code (see link above). A cursory research into libc++ and MS STL shows that 
the actual wait and notification are out-of-line. I recommend we do the same 
for libstdc++: if we're going to syscall anyway, the extra function call isn't 
going to be the performance bottleneck.

In <https://gcc.gnu.org/pipermail/libstdc++/2020-May/050360.html>, Jonathan 
wrote:
> Eventually we'll want to move the rest of this function (which doesn't
> depend on the template argument) into the compiled library, but it's
> better to be header-only for now.

But that has never happened.

Having out-of-line implementations would allow the implementation for generic-
sized atomic wait/notify to fine-tune as time and experience progresses. The 
std::__detail::__waiter implementation looks clever, but it's unclear if it is 
the best for all workloads. Moreover, the implementation is limited to 16 
waiters.

Moreover, it would allow us to have support in the future for a 64-bit futex. 
Linus did shoot down any need for it 13 years ago, but the world has changed 
since and it's not inconceivable to have a pointer-sized futex on every 
platform, which on 64-bit would be 64-bit. Windows and Darwin do have support 
for 64-bit compare-and-wait functionality.

And it would allow us to have futex-equivalent support in other OSes, added at 
our leisure. The current implementation does not support Windows's 
WaitOnAddress and could never support it if released in the current state. 
It's also possible other OSes will eventually add equivalent functionality 
because of the C++ standard. For example, FreeBSD already has futex() support 
inside its kernel in order to run Linux binaries, but it is not exposed to the 
FreeBSD-personality syscall ABI.

Inlining certain functionality at a later date is possible. De-inlining isn't.

2) futex code for int is still behind a lot of boiler plate

It comes from the creation/use of std::__detail::__waiter in 
std::__atomic_wait and __atomic_notify. That does allow one to detect whether 
there is contention in the first place and avoid the system call, but I think 
that's premature optimisation.

In <https://gcc.gnu.org/pipermail/libstdc++/2020-May/050362.html>, Thomas 
says:
> __atomic_notify checks to see if that count is non-zero
> before invoking the platform/semaphore notify because it is cheaper
> to do the atomic load than it is to make the syscall() when there are no
> waiters.

That's true, but it ignores the fact that users of std::atomic<T>::wait() are 
likely to track contention by themselves on the value of the atomic (I know I 
do). libc++, MS STL and the original atomic_wait[1] implementation operate on 
direct system calls and don't try to track contention. Because of that, 
libstdc++ adding an extra layer of tracking is double book-keeping and 
violates "don't pay for what you don't need".

[1] https://github.com/ogiroux/atomic_wait

3) other int-sized atomics don't use futex

The kernel doesn't care what we store in those 32 bits, only that they are 
comparable (FUTEX_OP does provide less-than and greater-than operations that 
are signed, but we don't use that and the kernel API can be later extended to 
unsigned compares). Therefore, I would recommend we extend futex support to 
all 32-bit signed scalar types. This would add:
 * pointers and long on 32-bit architectures
 * unsigned
 * untyped enums and typed enums on int & unsigned int
 * float

4) std::latch and std::counting_semaphore defaults too high

The standard requires a ptrdiff_t in the API, but it does allow us to set hard 
limits and/or defaults for best performance. Right now, std::latch is 
implemented using ptrdiff_t internally and std::counting_semaphore's default 
limit uses PTRDIFF_MAX:

  template<ptrdiff_t __least_max_value =
			__gnu_cxx::__int_traits<ptrdiff_t>::__max>
    class counting_semaphore

As a consequence, those two classes will use 64-bit atomics on 64-bit 
architectures, which means they will not use futex. I recommend changing 
std::latch's internal to int and lowering the default maximum in 
std::counting_semaphore to INT_MAX.

5) std::barrier uses a type that futex can't handle

The barrier phase is the actual type that gets waited / notified on, but it is 
an enum class on unsigned char:

  enum class __barrier_phase_t : unsigned char { };

This is a harder problem to solve, but I really do recommend making it 32-bit 
in width. Whether each barrier phase holds only 256 values or the full 32 
bits, I don't know (I've implemented with wrapping on 256).

What is of consequence is that the __state_t type could increase in size 
("could" because we can simply reduce the number of tickets in each state 
entry). I haven't done enough research on this to know whether it is the right 
way to go.

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel DPG Cloud Engineering




^ permalink raw reply	[flat|nested] 73+ messages in thread

end of thread, other threads:[~2021-03-03 17:41 UTC | newest]

Thread overview: 73+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-02-25 22:50 C++2a synchronisation inefficient in GCC 11 Thiago Macieira
2021-02-26 11:19 ` Jonathan Wakely
2021-02-26 17:37   ` Thiago Macieira
2021-02-26 18:29     ` Jonathan Wakely
2021-02-26 19:30       ` Ville Voutilainen
2021-02-26 21:17         ` Jonathan Wakely
2021-02-26 21:18           ` Ville Voutilainen
2021-02-26 21:39             ` Jonathan Wakely
2021-02-26 18:47     ` Ville Voutilainen
2021-02-26 23:53       ` Thiago Macieira
2021-02-26 23:58         ` Ville Voutilainen
2021-02-27  0:11           ` Thiago Macieira
2021-02-27  0:18             ` Ville Voutilainen
2021-02-27  0:36               ` Thiago Macieira
2021-02-27  0:44                 ` Ville Voutilainen
2021-02-27  0:53                   ` Thiago Macieira
2021-02-27  1:03                     ` Ville Voutilainen
2021-03-03 14:30                   ` Jonathan Wakely
2021-03-03 17:07                     ` Thiago Macieira
2021-03-03 17:14                       ` Jonathan Wakely
2021-02-27  0:22             ` Marc Glisse
2021-02-27  0:30               ` Ville Voutilainen
2021-02-27  0:43               ` Thiago Macieira
2021-03-03 14:24         ` Jonathan Wakely
2021-03-03 17:12           ` Thiago Macieira
2021-02-26 15:59 ` [PATCH 1/5] std::latch: reduce internal implementation from ptrdiff_t to int Thiago Macieira
2021-02-26 15:59   ` [PATCH 2/5] Atomic __platform_wait: accept any 32-bit type, not just int Thiago Macieira
2021-03-03 14:34     ` Jonathan Wakely
2021-03-03 16:21       ` Jonathan Wakely
2021-03-03 17:27         ` Thiago Macieira
2021-03-03 17:34         ` Ville Voutilainen
2021-03-03 17:41           ` Jonathan Wakely
2021-02-26 15:59   ` [PATCH 3/5] std::__atomic_wait: don't use __detail::__waiter with futex Thiago Macieira
2021-02-26 15:59   ` [PATCH 4/5] barrier: use int instead of unsigned char for the phase state Thiago Macieira
2021-02-28 15:05     ` Hans-Peter Nilsson
2021-03-01 16:28       ` Thomas Rodgers
2021-03-01 17:24       ` Thiago Macieira
2021-03-01 17:38         ` Thomas Rodgers
2021-03-01 17:40           ` Thomas Rodgers
2021-03-01 18:06           ` Thiago Macieira
2021-03-01 19:08             ` Thomas Rodgers
2021-03-01 18:12         ` Ville Voutilainen
2021-03-01 19:44           ` Thiago Macieira
2021-03-01 20:35             ` Ville Voutilainen
2021-03-01 21:54               ` Thiago Macieira
2021-03-01 22:04                 ` Ville Voutilainen
2021-03-01 22:21                   ` Thiago Macieira
2021-03-01 22:31                     ` Ville Voutilainen
2021-03-01 22:40                       ` Thiago Macieira
2021-02-26 15:59   ` [PATCH 5/5] barrier: optimise by not having the hasher in a loop Thiago Macieira
2021-03-03 14:36     ` Jonathan Wakely
2021-02-26 18:14   ` [PATCH 1/5] std::latch: reduce internal implementation from ptrdiff_t to int Andreas Schwab
2021-02-26 19:08     ` Thiago Macieira
2021-02-26 19:31       ` Andreas Schwab
2021-02-27  0:13         ` Thiago Macieira
2021-02-28 21:31           ` Hans-Peter Nilsson
2021-03-01  8:56             ` Richard Biener
2021-03-03 14:56               ` Jonathan Wakely
2021-03-03 15:02                 ` Andreas Schwab
2021-03-03 15:10                 ` Jonathan Wakely
2021-03-03 15:37                 ` Hans-Peter Nilsson
2021-03-01 16:32             ` Thomas Rodgers
2021-03-03 14:34   ` Jonathan Wakely
2021-03-03 17:14     ` Thiago Macieira
2021-03-03 17:18       ` Jonathan Wakely
2021-02-27  1:13 ` C++2a synchronisation inefficient in GCC 11 Thomas Rodgers
2021-02-27  1:29   ` Thomas Rodgers
2021-02-27  3:01     ` Thomas Rodgers
2021-03-01 17:46       ` Thomas Rodgers
2021-03-01 18:00         ` Thiago Macieira
2021-03-01 18:34           ` Thomas Rodgers
2021-03-01 19:11             ` Thiago Macieira
2021-02-27  2:02   ` Ville Voutilainen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).