public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug libstdc++/106772] New: atomic<T>::wait shouldn't touch waiter pool if used platform wait
@ 2022-08-29 12:38 valera.mironow at gmail dot com
  2022-08-29 12:40 ` [Bug libstdc++/106772] " valera.mironow at gmail dot com
                   ` (24 more replies)
  0 siblings, 25 replies; 26+ messages in thread
From: valera.mironow at gmail dot com @ 2022-08-29 12:38 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106772

            Bug ID: 106772
           Summary: atomic<T>::wait shouldn't touch waiter pool if used
                    platform wait
           Product: gcc
           Version: unknown
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: libstdc++
          Assignee: unassigned at gcc dot gnu.org
          Reporter: valera.mironow at gmail dot com
  Target Milestone: ---

atomic<T>::wait shouldn't touch waiter pool if used platform wait.

Because otherwise it affected by waiter pool bugs.
And performance can degrade on machine with many cores, because waiter pool
have only 16 cells.
Also it makes fetch_add with seq_cst, that can hide synchronization issue

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug libstdc++/106772] atomic<T>::wait shouldn't touch waiter pool if used platform wait
  2022-08-29 12:38 [Bug libstdc++/106772] New: atomic<T>::wait shouldn't touch waiter pool if used platform wait valera.mironow at gmail dot com
@ 2022-08-29 12:40 ` valera.mironow at gmail dot com
  2022-08-29 15:12 ` rodgertq at gcc dot gnu.org
                   ` (23 subsequent siblings)
  24 siblings, 0 replies; 26+ messages in thread
From: valera.mironow at gmail dot com @ 2022-08-29 12:40 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106772

--- Comment #1 from Mkkt Bkkt <valera.mironow at gmail dot com> ---
libc++ implementation do it.
msvc stl of course too.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug libstdc++/106772] atomic<T>::wait shouldn't touch waiter pool if used platform wait
  2022-08-29 12:38 [Bug libstdc++/106772] New: atomic<T>::wait shouldn't touch waiter pool if used platform wait valera.mironow at gmail dot com
  2022-08-29 12:40 ` [Bug libstdc++/106772] " valera.mironow at gmail dot com
@ 2022-08-29 15:12 ` rodgertq at gcc dot gnu.org
  2022-09-20  3:38 ` rodgertq at gcc dot gnu.org
                   ` (22 subsequent siblings)
  24 siblings, 0 replies; 26+ messages in thread
From: rodgertq at gcc dot gnu.org @ 2022-08-29 15:12 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106772

Thomas Rodgers <rodgertq at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         Resolution|---                         |INVALID
             Status|UNCONFIRMED                 |RESOLVED

--- Comment #2 from Thomas Rodgers <rodgertq at gcc dot gnu.org> ---
This is not a bug. The size of the waiter pool, not withstanding, the point is
you want to determine if a notify is likely to notify anything, to avoid the
very expensive (by comparison) syscall to wake the futex.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug libstdc++/106772] atomic<T>::wait shouldn't touch waiter pool if used platform wait
  2022-08-29 12:38 [Bug libstdc++/106772] New: atomic<T>::wait shouldn't touch waiter pool if used platform wait valera.mironow at gmail dot com
  2022-08-29 12:40 ` [Bug libstdc++/106772] " valera.mironow at gmail dot com
  2022-08-29 15:12 ` rodgertq at gcc dot gnu.org
@ 2022-09-20  3:38 ` rodgertq at gcc dot gnu.org
  2022-09-20  7:39 ` valera.mironow at gmail dot com
                   ` (21 subsequent siblings)
  24 siblings, 0 replies; 26+ messages in thread
From: rodgertq at gcc dot gnu.org @ 2022-09-20  3:38 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106772

--- Comment #3 from Thomas Rodgers <rodgertq at gcc dot gnu.org> ---
Since this latter point has come up before, I want to additionally note that
the optimization to use an atomic count of waiters per-waiter pool bucket means
that a call to notify_one/notify_all is roughly 25x faster based on my testing
than naively issuing a syscall to FUTEX_WAKE when there is no possibility of
the wake being issued to a waiter.

2022-09-19T20:34:28-07:00
Running ./benchmark
Run on (20 X 4800 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x10)
  L1 Instruction 32 KiB (x10)
  L2 Unified 1280 KiB (x10)
  L3 Unified 24576 KiB (x1)
Load Average: 0.69, 0.61, 1.30
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may
be noisy and will incur extra overhead.
------------------------------------------------------------------
Benchmark                        Time             CPU   Iterations
------------------------------------------------------------------
BM_empty_notify_checked       3.79 ns         3.79 ns    179929051
BM_empty_notify_syscall       94.1 ns         93.9 ns      7477997

For types that can use a FUTEX directly (e.g. int) there is no place to put
that extra atomic to perform this check, so we can either have the type that is
directly usable by the underlying platform be significantly more expensive to
call, or we can use the waiter count in the waiter_pool.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug libstdc++/106772] atomic<T>::wait shouldn't touch waiter pool if used platform wait
  2022-08-29 12:38 [Bug libstdc++/106772] New: atomic<T>::wait shouldn't touch waiter pool if used platform wait valera.mironow at gmail dot com
                   ` (2 preceding siblings ...)
  2022-09-20  3:38 ` rodgertq at gcc dot gnu.org
@ 2022-09-20  7:39 ` valera.mironow at gmail dot com
  2022-09-20  7:47 ` valera.mironow at gmail dot com
                   ` (20 subsequent siblings)
  24 siblings, 0 replies; 26+ messages in thread
From: valera.mironow at gmail dot com @ 2022-09-20  7:39 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106772

--- Comment #4 from Mkkt Bkkt <valera.mironow at gmail dot com> ---
Nice, thanks for benchmarks without code.

Also, why do I need call notify when don't have wait? 

Common usage from my point of view looks like:

atomic.wait(value, mo):

while (atomic.load(mo) == value) {
  futex_wait();
}

Notify want to looks like:

atomic.store(1, mo)
atomic.notify_one();

See:

https://github.com/lewissbaker/cppcoro/blob/master/lib/lightweight_manual_reset_event.cpp

https://github.com/YACLib/YACLib/blob/main/src/util/atomic_event.cpp

and others

So your optimization is pretty useless.

Also if for some reason possible few notification, or notify before wait, then
possible to have code can looks like:

if (atomic.exchange(non_wait_value, mo) == wait_value) {
  atomic.notify_one();
}

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug libstdc++/106772] atomic<T>::wait shouldn't touch waiter pool if used platform wait
  2022-08-29 12:38 [Bug libstdc++/106772] New: atomic<T>::wait shouldn't touch waiter pool if used platform wait valera.mironow at gmail dot com
                   ` (3 preceding siblings ...)
  2022-09-20  7:39 ` valera.mironow at gmail dot com
@ 2022-09-20  7:47 ` valera.mironow at gmail dot com
  2022-09-20 14:14 ` rodgertq at gcc dot gnu.org
                   ` (19 subsequent siblings)
  24 siblings, 0 replies; 26+ messages in thread
From: valera.mironow at gmail dot com @ 2022-09-20  7:47 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106772

--- Comment #5 from Mkkt Bkkt <valera.mironow at gmail dot com> ---
Single reason why I want to use atomic::wait/notify is cross platform api for
futex, wait/wake on address, ulock, etc
Not because I need YOU decide instead of me how many spins, or other
optimization I need.

boost::atomic already do this.

Why do you cannot make same?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug libstdc++/106772] atomic<T>::wait shouldn't touch waiter pool if used platform wait
  2022-08-29 12:38 [Bug libstdc++/106772] New: atomic<T>::wait shouldn't touch waiter pool if used platform wait valera.mironow at gmail dot com
                   ` (4 preceding siblings ...)
  2022-09-20  7:47 ` valera.mironow at gmail dot com
@ 2022-09-20 14:14 ` rodgertq at gcc dot gnu.org
  2022-09-20 14:20 ` valera.mironow at gmail dot com
                   ` (18 subsequent siblings)
  24 siblings, 0 replies; 26+ messages in thread
From: rodgertq at gcc dot gnu.org @ 2022-09-20 14:14 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106772

--- Comment #6 from Thomas Rodgers <rodgertq at gcc dot gnu.org> ---
(In reply to Mkkt Bkkt from comment #5)
> Single reason why I want to use atomic::wait/notify is cross platform api
> for futex, wait/wake on address, ulock, etc
> Not because I need YOU decide instead of me how many spins, or other
> optimization I need.
> 
> boost::atomic already do this.
> 
> Why do you cannot make same?

(In reply to Mkkt Bkkt from comment #4)
> Nice, thanks for benchmarks without code.
> 
> Also, why do I need call notify when don't have wait? 
> 
> Common usage from my point of view looks like:
> 
> atomic.wait(value, mo):
> 
> while (atomic.load(mo) == value) {
>   futex_wait();
> }
> 
> Notify want to looks like:
> 
> atomic.store(1, mo)
> atomic.notify_one();
> 
> See:
> 
> https://github.com/lewissbaker/cppcoro/blob/master/lib/
> lightweight_manual_reset_event.cpp
> 

I have every confidence that Lewis knows how to bring a paper for a
'lightweight manual reset event' to SG1, I suspect it will be well received
when he does.

> https://github.com/YACLib/YACLib/blob/main/src/util/atomic_event.cpp
> 
> and others
> 
> So your optimization is pretty useless.
> 
> Also if for some reason possible few notification, or notify before wait,
> then possible to have code can looks like:
> 
> if (atomic.exchange(non_wait_value, mo) == wait_value) {
>   atomic.notify_one();
> }

I

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug libstdc++/106772] atomic<T>::wait shouldn't touch waiter pool if used platform wait
  2022-08-29 12:38 [Bug libstdc++/106772] New: atomic<T>::wait shouldn't touch waiter pool if used platform wait valera.mironow at gmail dot com
                   ` (5 preceding siblings ...)
  2022-09-20 14:14 ` rodgertq at gcc dot gnu.org
@ 2022-09-20 14:20 ` valera.mironow at gmail dot com
  2022-09-20 14:23 ` valera.mironow at gmail dot com
                   ` (17 subsequent siblings)
  24 siblings, 0 replies; 26+ messages in thread
From: valera.mironow at gmail dot com @ 2022-09-20 14:20 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106772

--- Comment #7 from Mkkt Bkkt <valera.mironow at gmail dot com> ---
Can you give example when this optimization needed and cannot be done on user,
not stdlib side?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug libstdc++/106772] atomic<T>::wait shouldn't touch waiter pool if used platform wait
  2022-08-29 12:38 [Bug libstdc++/106772] New: atomic<T>::wait shouldn't touch waiter pool if used platform wait valera.mironow at gmail dot com
                   ` (6 preceding siblings ...)
  2022-09-20 14:20 ` valera.mironow at gmail dot com
@ 2022-09-20 14:23 ` valera.mironow at gmail dot com
  2022-09-20 14:27 ` valera.mironow at gmail dot com
                   ` (16 subsequent siblings)
  24 siblings, 0 replies; 26+ messages in thread
From: valera.mironow at gmail dot com @ 2022-09-20 14:23 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106772

--- Comment #8 from Mkkt Bkkt <valera.mironow at gmail dot com> ---
> I have every confidence that Lewis knows how to bring a paper for a 'lightweight manual reset event' to SG1, I suspect it will be well received when he does.

So at least before C++26 I and any other developer that known notify not called
on non-waiting atomic should use ifdefs for linux/windows/mac etc syscalls?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug libstdc++/106772] atomic<T>::wait shouldn't touch waiter pool if used platform wait
  2022-08-29 12:38 [Bug libstdc++/106772] New: atomic<T>::wait shouldn't touch waiter pool if used platform wait valera.mironow at gmail dot com
                   ` (7 preceding siblings ...)
  2022-09-20 14:23 ` valera.mironow at gmail dot com
@ 2022-09-20 14:27 ` valera.mironow at gmail dot com
  2022-09-20 14:36 ` redi at gcc dot gnu.org
                   ` (15 subsequent siblings)
  24 siblings, 0 replies; 26+ messages in thread
From: valera.mironow at gmail dot com @ 2022-09-20 14:27 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106772

--- Comment #9 from Mkkt Bkkt <valera.mironow at gmail dot com> ---
Why do you think you smarter than msvc stl, libc++, boost::atomic developers?

Maybe it's about your "I"?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug libstdc++/106772] atomic<T>::wait shouldn't touch waiter pool if used platform wait
  2022-08-29 12:38 [Bug libstdc++/106772] New: atomic<T>::wait shouldn't touch waiter pool if used platform wait valera.mironow at gmail dot com
                   ` (8 preceding siblings ...)
  2022-09-20 14:27 ` valera.mironow at gmail dot com
@ 2022-09-20 14:36 ` redi at gcc dot gnu.org
  2022-09-20 14:40 ` rodgertq at gcc dot gnu.org
                   ` (14 subsequent siblings)
  24 siblings, 0 replies; 26+ messages in thread
From: redi at gcc dot gnu.org @ 2022-09-20 14:36 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106772

--- Comment #10 from Jonathan Wakely <redi at gcc dot gnu.org> ---
Please try to be civil, or your requests will simply be ignored.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug libstdc++/106772] atomic<T>::wait shouldn't touch waiter pool if used platform wait
  2022-08-29 12:38 [Bug libstdc++/106772] New: atomic<T>::wait shouldn't touch waiter pool if used platform wait valera.mironow at gmail dot com
                   ` (9 preceding siblings ...)
  2022-09-20 14:36 ` redi at gcc dot gnu.org
@ 2022-09-20 14:40 ` rodgertq at gcc dot gnu.org
  2022-09-20 14:54 ` valera.mironow at gmail dot com
                   ` (13 subsequent siblings)
  24 siblings, 0 replies; 26+ messages in thread
From: rodgertq at gcc dot gnu.org @ 2022-09-20 14:40 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106772

--- Comment #11 from Thomas Rodgers <rodgertq at gcc dot gnu.org> ---
(In reply to Mkkt Bkkt from comment #9)
> Why do you think you smarter than msvc stl, libc++, boost::atomic developers?
> 
> Maybe it's about your "I"?

I should ignore this (see jwakely's response), but -

I very much don't think I am smarter than the person (ogiroux@apple.com) who
implemented libc++'s implementation. There are minor difference between
libstc++'s details and libc++'s but in this regard,in particular, they behave
the same.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug libstdc++/106772] atomic<T>::wait shouldn't touch waiter pool if used platform wait
  2022-08-29 12:38 [Bug libstdc++/106772] New: atomic<T>::wait shouldn't touch waiter pool if used platform wait valera.mironow at gmail dot com
                   ` (10 preceding siblings ...)
  2022-09-20 14:40 ` rodgertq at gcc dot gnu.org
@ 2022-09-20 14:54 ` valera.mironow at gmail dot com
  2022-09-20 15:45 ` valera.mironow at gmail dot com
                   ` (12 subsequent siblings)
  24 siblings, 0 replies; 26+ messages in thread
From: valera.mironow at gmail dot com @ 2022-09-20 14:54 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106772

--- Comment #12 from Mkkt Bkkt <valera.mironow at gmail dot com> ---
First of, I was a toxic, sorry.
But you start this first, maybe it's allowed for maintainer, I don't know.

But I still waiting code of benchmarks.

Also I want to see example of usage when we notify atomic, that in
non-deterministic wait state. And we cannot make this optimization outside of
libstdc++.

Because honestly I don't understand purpose of this optimization except
benchmarks like

```
atomic<int> x;
while(...) {
  x.notify_one();
}
```

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug libstdc++/106772] atomic<T>::wait shouldn't touch waiter pool if used platform wait
  2022-08-29 12:38 [Bug libstdc++/106772] New: atomic<T>::wait shouldn't touch waiter pool if used platform wait valera.mironow at gmail dot com
                   ` (11 preceding siblings ...)
  2022-09-20 14:54 ` valera.mironow at gmail dot com
@ 2022-09-20 15:45 ` valera.mironow at gmail dot com
  2022-09-20 15:55 ` redi at gcc dot gnu.org
                   ` (11 subsequent siblings)
  24 siblings, 0 replies; 26+ messages in thread
From: valera.mironow at gmail dot com @ 2022-09-20 15:45 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106772

--- Comment #13 from Mkkt Bkkt <valera.mironow at gmail dot com> ---
example of simple mutex pseudocode that don't need call notify every unlock:
Same possible for other cases, if it doesn't please share example for me

```
int kUnlocked = 0;
int kLocked = 1;
int kLockedWithWaiters = 2;

std::atomic<int> state{0};

lock:
  // some spin
  if (state.cas(kUnlocked, kLocked)) {
    return;
  }
  do {
    if (state == kLockedWithWaiters || state.cas(kLocked, kLockedWithWaiters))
{
      state.wait(kLockedWithWaiters);
    }
  } while (!state.cas(kUnlocked, kLockedWithWaiters));

unlock:
  if (state.exchange(kUnlocked) == kLockedWithWaiters) {
    state.notify_one();
  }
```

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug libstdc++/106772] atomic<T>::wait shouldn't touch waiter pool if used platform wait
  2022-08-29 12:38 [Bug libstdc++/106772] New: atomic<T>::wait shouldn't touch waiter pool if used platform wait valera.mironow at gmail dot com
                   ` (12 preceding siblings ...)
  2022-09-20 15:45 ` valera.mironow at gmail dot com
@ 2022-09-20 15:55 ` redi at gcc dot gnu.org
  2022-09-20 16:00 ` valera.mironow at gmail dot com
                   ` (10 subsequent siblings)
  24 siblings, 0 replies; 26+ messages in thread
From: redi at gcc dot gnu.org @ 2022-09-20 15:55 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106772

--- Comment #14 from Jonathan Wakely <redi at gcc dot gnu.org> ---
(In reply to Mkkt Bkkt from comment #12)
> First of, I was a toxic, sorry.
> But you start this first, maybe it's allowed for maintainer, I don't know.

No, you started it with sarcasm and disparaging remarks in comment 4, and
continued the aggression in comment 5.

Your example demonstrates a state machine where the only possible values of the
atomic are known (e.g. a locked state and unlocked state). That is not the only
way to use atomics. If you have an atomic counter and want to signal when it
has been incremented, you cannot tell from the previous value whether another
thread is waiting.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug libstdc++/106772] atomic<T>::wait shouldn't touch waiter pool if used platform wait
  2022-08-29 12:38 [Bug libstdc++/106772] New: atomic<T>::wait shouldn't touch waiter pool if used platform wait valera.mironow at gmail dot com
                   ` (13 preceding siblings ...)
  2022-09-20 15:55 ` redi at gcc dot gnu.org
@ 2022-09-20 16:00 ` valera.mironow at gmail dot com
  2022-09-20 16:05 ` valera.mironow at gmail dot com
                   ` (9 subsequent siblings)
  24 siblings, 0 replies; 26+ messages in thread
From: valera.mironow at gmail dot com @ 2022-09-20 16:00 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106772

--- Comment #15 from Mkkt Bkkt <valera.mironow at gmail dot com> ---
>  If you have an atomic counter and want to signal when it has been incremented, you cannot tell from the previous value whether another thread is waiting.

I wrote it example.

Do you talk about like semaphore usage?

As I know in libstdc++ semaphore use bare wait without touching waiters_pool.
Am I wrong?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug libstdc++/106772] atomic<T>::wait shouldn't touch waiter pool if used platform wait
  2022-08-29 12:38 [Bug libstdc++/106772] New: atomic<T>::wait shouldn't touch waiter pool if used platform wait valera.mironow at gmail dot com
                   ` (14 preceding siblings ...)
  2022-09-20 16:00 ` valera.mironow at gmail dot com
@ 2022-09-20 16:05 ` valera.mironow at gmail dot com
  2022-09-28 21:54 ` rodgertq at gcc dot gnu.org
                   ` (8 subsequent siblings)
  24 siblings, 0 replies; 26+ messages in thread
From: valera.mironow at gmail dot com @ 2022-09-20 16:05 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106772

--- Comment #16 from Mkkt Bkkt <valera.mironow at gmail dot com> ---
> it with sarcasm

I started with sarcasm because you restart this thread with some doubtful
benchmarks without code for them.

I think it's very disrespectfully.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug libstdc++/106772] atomic<T>::wait shouldn't touch waiter pool if used platform wait
  2022-08-29 12:38 [Bug libstdc++/106772] New: atomic<T>::wait shouldn't touch waiter pool if used platform wait valera.mironow at gmail dot com
                   ` (15 preceding siblings ...)
  2022-09-20 16:05 ` valera.mironow at gmail dot com
@ 2022-09-28 21:54 ` rodgertq at gcc dot gnu.org
  2022-09-28 22:00 ` valera.mironow at gmail dot com
                   ` (7 subsequent siblings)
  24 siblings, 0 replies; 26+ messages in thread
From: rodgertq at gcc dot gnu.org @ 2022-09-28 21:54 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106772

--- Comment #17 from Thomas Rodgers <rodgertq at gcc dot gnu.org> ---
Created attachment 53638
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=53638&action=edit
benchmark main.cc

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug libstdc++/106772] atomic<T>::wait shouldn't touch waiter pool if used platform wait
  2022-08-29 12:38 [Bug libstdc++/106772] New: atomic<T>::wait shouldn't touch waiter pool if used platform wait valera.mironow at gmail dot com
                   ` (16 preceding siblings ...)
  2022-09-28 21:54 ` rodgertq at gcc dot gnu.org
@ 2022-09-28 22:00 ` valera.mironow at gmail dot com
  2022-09-28 22:04 ` valera.mironow at gmail dot com
                   ` (6 subsequent siblings)
  24 siblings, 0 replies; 26+ messages in thread
From: valera.mironow at gmail dot com @ 2022-09-28 22:00 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106772

--- Comment #18 from Mkkt Bkkt <valera.mironow at gmail dot com> ---
Thanks for provide source of your benchmarks!

Can you also provide some example of usage atomic::notify where this
optimization is needed?

If it exists I think you right and it is necessary.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug libstdc++/106772] atomic<T>::wait shouldn't touch waiter pool if used platform wait
  2022-08-29 12:38 [Bug libstdc++/106772] New: atomic<T>::wait shouldn't touch waiter pool if used platform wait valera.mironow at gmail dot com
                   ` (17 preceding siblings ...)
  2022-09-28 22:00 ` valera.mironow at gmail dot com
@ 2022-09-28 22:04 ` valera.mironow at gmail dot com
  2022-09-28 22:18 ` valera.mironow at gmail dot com
                   ` (5 subsequent siblings)
  24 siblings, 0 replies; 26+ messages in thread
From: valera.mironow at gmail dot com @ 2022-09-28 22:04 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106772

--- Comment #19 from Mkkt Bkkt <valera.mironow at gmail dot com> ---
An example with a futex or other syscall would also work, because atomic::wait
is still not widely used.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug libstdc++/106772] atomic<T>::wait shouldn't touch waiter pool if used platform wait
  2022-08-29 12:38 [Bug libstdc++/106772] New: atomic<T>::wait shouldn't touch waiter pool if used platform wait valera.mironow at gmail dot com
                   ` (18 preceding siblings ...)
  2022-09-28 22:04 ` valera.mironow at gmail dot com
@ 2022-09-28 22:18 ` valera.mironow at gmail dot com
  2022-09-28 22:34 ` rodgertq at gcc dot gnu.org
                   ` (4 subsequent siblings)
  24 siblings, 0 replies; 26+ messages in thread
From: valera.mironow at gmail dot com @ 2022-09-28 22:18 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106772

--- Comment #20 from Mkkt Bkkt <valera.mironow at gmail dot com> ---
My main concern with this optimization it's not zero-overhead.

It's not necessary when we expect we have some waiters, in that case it just
additional synchronization and contention in waiter pool (that have small fixed
size, just imagine system with 100+ cores, if we have > 16 waiting threads some
of them make fetch_add/sub on the same atomic, that can be expensive,
especially on numa)

And at the same time, I don't understand when I need to notify and cannot know
notification not needed.
I don't understand when it useful.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug libstdc++/106772] atomic<T>::wait shouldn't touch waiter pool if used platform wait
  2022-08-29 12:38 [Bug libstdc++/106772] New: atomic<T>::wait shouldn't touch waiter pool if used platform wait valera.mironow at gmail dot com
                   ` (19 preceding siblings ...)
  2022-09-28 22:18 ` valera.mironow at gmail dot com
@ 2022-09-28 22:34 ` rodgertq at gcc dot gnu.org
  2022-09-28 22:50 ` rodgertq at gcc dot gnu.org
                   ` (3 subsequent siblings)
  24 siblings, 0 replies; 26+ messages in thread
From: rodgertq at gcc dot gnu.org @ 2022-09-28 22:34 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106772

--- Comment #21 from Thomas Rodgers <rodgertq at gcc dot gnu.org> ---
(In reply to Mkkt Bkkt from comment #16)
> > it with sarcasm
> 
> I started with sarcasm because you restart this thread with some doubtful
> benchmarks without code for them.
> 
> I think it's very disrespectfully.

I wasn't sarcastic in what I posted. As I noted, this question has come up
before in different contexts, Bugzilla is a useful historical archive, so
updating this issue with my reasoning and a bit of data was primarily a capture
task.

So, let's try this again.

I did not post the original source because it required hacking on the libstdc++
headers. I have posted a version that does not require that, the results are
identical.

In this test, it is the example Jonathan cited in #14; incrementing an atomic
int and calling notify.

This isn't about semaphore or any other synchronization primitive. Those types
are free to make different choices that may be more appropriate to the
constrained usage of the type just like Lewis' lightweight manual reset event
does (I will also note that Lewis has reviewed this implementation, and has
written a paper to be discussed at the Fall meeting, p2616). 

There are, as Jonathan has pointed out, use cases where notify can and will be
called without a notifier having any way to determine it will wake a waiter.
One example I, as the person that is going to have to implement C++26 executors
care about is a wait free work-stealing deque, it certainly doesn't require
anything more than spinning for work on an empty queue to be algorithmically
correct, but after spinning on an empty queue, making the rounds trying to
steal work from other deques, maybe spinning a bit more, just to be sure, the
de-queuing thread which hasn't been able to acquire more work is probably going
to want to enter a wait until such time as it knows it can do productive work.
Another thread pushing work into that queue won't be able to determine if the
dequeue-ing thread is spinning for new work, work stealing, or has entered a
wait, but atomic<int>::notify() does know that, and can avoid penalizing the
thread submitting work with a syscall, if there is no thread in a wait on the
other end of the deque, which is the expected case for this algorithm.

p1135 was the paper that added atomic wait/notify. One of the co-authors of
that paper wrote the libc++ implementation. That implementation, as with
libstdc++, is not simply a wrapper over the underlying platform's wait
primitive. It has two 'backends', an exponentially timed backoff and ulock
wait/wake. libstdc++ currently has a Futex and condvar backend. Both
implementations make the choice of having a short-term spinning strategy and
long term waiting strategy (spinning, futex/ulock, condvar).

I have confirmed with the libc++ implementation's author, (who also chairs the
C++ committee's Concurrency and Parallelism study group), that it was never the
intention of p1135 or the subsequently standardized language in C++20 to imply
that wait/notify were direct portable analogs to the platform's waiting
primitives. There are users, such as yourself that want exactly that, there are
other users (like in my prior industry) where busy waiting is the desired
strategy, and in between those two choices are people who want it to work as
advertised in the standard, and to do so 'efficiently'. Both libc++ and
libstdc++ take a balanced approach somewhere between always going to OS and
always spinning.

There is an open question here that your original issue raises -

* At what point do collisions on the waiter pool, with the cache invalidation
traffic and spurious wakeups that result, swamp the gain of doing this empty
waiter check on notify?

I also, made a comment about the 'size of the waiter pool not withstanding'. I
chose a smaller size than libc++ chose, in part because Jonathan and I did not
want to make any sort of ABI commitment until this had been in a few GCC
releases. This implementation is header only at present, and still considered
experimental. libc++ committed to an ABI early.

In the sizing of the libc++ waiter pool there is the comment that 'there is no
magic in this value'. 

Not only is there no magic, there is no test of any sort that I have done, or
that has been done on libc++ to determine what effect different size pools have
under different load scenarios. So, all of it is a guess at this point. I will
likely match libc++ when I do move this into the .so.

Finally, in the most recent mailing there is p2643 which proposes additional
changes to atomic waiting. One proposal is to add a 'hinted' wait that can
allow the caller to steer the choices atomic wait/notify uses. I have conferred
with the other authors of the paper and this latter option is not without
controversy, and likely some sharp edges for the user, but I plan to raise the
discussion at the Fall WG21 meeting to see what the other members of SG1 think.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug libstdc++/106772] atomic<T>::wait shouldn't touch waiter pool if used platform wait
  2022-08-29 12:38 [Bug libstdc++/106772] New: atomic<T>::wait shouldn't touch waiter pool if used platform wait valera.mironow at gmail dot com
                   ` (20 preceding siblings ...)
  2022-09-28 22:34 ` rodgertq at gcc dot gnu.org
@ 2022-09-28 22:50 ` rodgertq at gcc dot gnu.org
  2022-09-28 23:09 ` valera.mironow at gmail dot com
                   ` (2 subsequent siblings)
  24 siblings, 0 replies; 26+ messages in thread
From: rodgertq at gcc dot gnu.org @ 2022-09-28 22:50 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106772

--- Comment #22 from Thomas Rodgers <rodgertq at gcc dot gnu.org> ---
(In reply to Mkkt Bkkt from comment #20)
> My main concern with this optimization it's not zero-overhead.
> 
> It's not necessary when we expect we have some waiters, in that case it just
> additional synchronization and contention in waiter pool (that have small
> fixed size, just imagine system with 100+ cores, if we have > 16 waiting
> threads some of them make fetch_add/sub on the same atomic, that can be
> expensive, especially on numa)
> 
> And at the same time, I don't understand when I need to notify and cannot
> know notification not needed.
> I don't understand when it useful.

You are correct, it is not zero overhead. It also isn't clear what those
overheads are, either. As I noted in comment #21, there is no test over a
variety of workloads to inform this discussion, either.

Your example of '100+ core' systems especially on NUMA is certainly a valid
one. I would ask, at what point do those collisions and the resulting cache
invalidation traffic swamp the cost of just making the syscall? I do plan to
put these tests together, because there is another algorithm that I am
exploring, that I believe will reduce the likelihood of spurious wakeups, and
achieves the same result as this particular approach, without generating the
same invalidation traffic. At this point, I don't anticipate doing that work
until after GCC13 stage1 closes.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug libstdc++/106772] atomic<T>::wait shouldn't touch waiter pool if used platform wait
  2022-08-29 12:38 [Bug libstdc++/106772] New: atomic<T>::wait shouldn't touch waiter pool if used platform wait valera.mironow at gmail dot com
                   ` (21 preceding siblings ...)
  2022-09-28 22:50 ` rodgertq at gcc dot gnu.org
@ 2022-09-28 23:09 ` valera.mironow at gmail dot com
  2022-09-28 23:25 ` valera.mironow at gmail dot com
  2022-09-28 23:40 ` rodgertq at gcc dot gnu.org
  24 siblings, 0 replies; 26+ messages in thread
From: valera.mironow at gmail dot com @ 2022-09-28 23:09 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106772

--- Comment #23 from Mkkt Bkkt <valera.mironow at gmail dot com> ---
(In reply to Thomas Rodgers from comment #21)
> I wasn't sarcastic in what I posted. As I noted, this question has come up
> before in different contexts, Bugzilla is a useful historical archive, so
> updating this issue with my reasoning and a bit of data was primarily a
> capture task.

Ok, sorry

> In this test, it is the example Jonathan cited in #14; incrementing an
> atomic int and calling notify.

incrementing atomic int here doesn't touch waiter pool atomic.
I think in reality it possible that you have contention between load in notify
and fetch_add/sub in some wait (maybe on other atomics)

> This isn't about semaphore or any other synchronization primitive. Those
> types are free to make different choices that may be more appropriate to the
> constrained usage of the type just like Lewis' lightweight manual reset
> event does (I will also note that Lewis has reviewed this implementation,
> and has written a paper to be discussed at the Fall meeting, p2616). 

Ok, but it just means at least before C++26 devs should use own kludge.

> There are, as Jonathan has pointed out, use cases where notify can and will
> be called without a notifier having any way to determine it will wake a
> waiter. One example I, as the person that is going to have to implement
> C++26 executors care about is a wait free work-stealing deque, it certainly
> doesn't require anything more than spinning for work on an empty queue to be
> algorithmically correct, but after spinning on an empty queue, making the
> rounds trying to steal work from other deques, maybe spinning a bit more,
> just to be sure, the de-queuing thread which hasn't been able to acquire
> more work is probably going to want to enter a wait until such time as it
> knows it can do productive work. Another thread pushing work into that queue
> won't be able to determine if the dequeue-ing thread is spinning for new
> work, work stealing, or has entered a wait, but atomic<int>::notify() does
> know that, and can avoid penalizing the thread submitting work with a
> syscall, if there is no thread in a wait on the other end of the deque,
> which is the expected case for this algorithm.

It's good example, but I think it will be better to have own "waiter pool" in
executor for that.
I try to explain why:
1) In that case you can control size of waiter pool, for example just create
separate atomic per threadpool's thread.
2) you avoid contention/and false-positive with other atomics because its not
global 

> p1135 was the paper that added atomic wait/notify. One of the co-authors of
> that paper wrote the libc++ implementation. That implementation, as with
> libstdc++, is not simply a wrapper over the underlying platform's wait
> primitive. It has two 'backends', an exponentially timed backoff and ulock
> wait/wake. libstdc++ currently has a Futex and condvar backend. Both
> implementations make the choice of having a short-term spinning strategy and
> long term waiting strategy (spinning, futex/ulock, condvar).

I know (one moment that I was wrong libc++ also use waiter pool in platform
case), and I don't think its good.
From my point of view better to have something like
std::wait_spin waiter;
bool std::wait_spin::wait(Condition condition_func);
void std::wait_spin::reset();

Also libc++ implementation make timed wait on futex (but not on ulock), I don't
why, do you know?
https://discourse.llvm.org/t/why-do-atomic-int-wait-use-timed-futex/65341

But in general it's not about this thread.

> I have confirmed with the libc++ implementation's author, (who also chairs
> the C++ committee's Concurrency and Parallelism study group), that it was
> never the intention of p1135 or the subsequently standardized language in
> C++20 to imply that wait/notify were direct portable analogs to the
> platform's waiting primitives. There are users, such as yourself that want
> exactly that, there are other users (like in my prior industry) where busy
> waiting is the desired strategy, and in between those two choices are people
> who want it to work as advertised in the standard, and to do so
> 'efficiently'. Both libc++ and libstdc++ take a balanced approach somewhere
> between always going to OS and always spinning.

It's a good point, despite though atomic wait/notify is too low-level to that
make sense in my opinion.
As a result, it seems to me that in this form they aren't very suitable for
developers who know what they want (syscall).

And at the same time, developers who write business code aren't particularly
needed them (they generally use some library for coroutines/callbacks, for
example, like p2300)


> There is an open question here that your original issue raises -
> 
> * At what point do collisions on the waiter pool, with the cache
> invalidation traffic and spurious wakeups that result, swamp the gain of
> doing this empty waiter check on notify?

Just start threadpool with deques from your example.
No tasks.
All thread make some spin and then go to wait, some of threads make fetch_add
on the same atomic.
Then we push a lot of work, all threads wakeup and some of them make fetch_sub
on the same atomic.

> I also, made a comment about the 'size of the waiter pool not withstanding'.
> I chose a smaller size than libc++ chose, in part because Jonathan and I did
> not want to make any sort of ABI commitment until this had been in a few GCC
> releases. This implementation is header only at present, and still
> considered experimental. libc++ committed to an ABI early.
> 
> In the sizing of the libc++ waiter pool there is the comment that 'there is
> no magic in this value'. 
> 
> Not only is there no magic, there is no test of any sort that I have done,
> or that has been done on libc++ to determine what effect different size
> pools have under different load scenarios. So, all of it is a guess at this
> point. I will likely match libc++ when I do move this into the .so.

We just have a different vision of it usage.
I think user that use atomic::wait know about spinning wait before syscall,
know about syscall not needed if no waiters, etc.

You and libc++ devs think user can not know all these, and better to have more
universal implementation.

It's just different approaches and now I understand arguing about it was a
little pointless

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug libstdc++/106772] atomic<T>::wait shouldn't touch waiter pool if used platform wait
  2022-08-29 12:38 [Bug libstdc++/106772] New: atomic<T>::wait shouldn't touch waiter pool if used platform wait valera.mironow at gmail dot com
                   ` (22 preceding siblings ...)
  2022-09-28 23:09 ` valera.mironow at gmail dot com
@ 2022-09-28 23:25 ` valera.mironow at gmail dot com
  2022-09-28 23:40 ` rodgertq at gcc dot gnu.org
  24 siblings, 0 replies; 26+ messages in thread
From: valera.mironow at gmail dot com @ 2022-09-28 23:25 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106772

--- Comment #24 from Mkkt Bkkt <valera.mironow at gmail dot com> ---
(In reply to Thomas Rodgers from comment #22)
> Your example of '100+ core' systems especially on NUMA is certainly a valid
> one. I would ask, at what point do those collisions and the resulting cache
> invalidation traffic swamp the cost of just making the syscall? I do plan to
> put these tests together, because there is another algorithm that I am
> exploring, that I believe will reduce the likelihood of spurious wakeups,
> and achieves the same result as this particular approach, without generating
> the same invalidation traffic. At this point, I don't anticipate doing that
> work until after GCC13 stage1 closes.

I try to explain: 

syscall overhead is some constant commonly like 10-30ns (futex syscall can be
more expensive like 100ns in your example)

But numbers of cores are grow, arm also makes more popular (fetch_add/sub have
more cost on it compares to x86)
And people already faced with situation where fetch_add have a bigger cost than
syscall overhead:

https://pkolaczk.github.io/server-slower-than-a-laptop/
https://travisdowns.github.io/blog/2020/07/06/concurrency-costs.html

I don't think we will faced with problem like in these links in
atomic::wait/notify in real code, but I'm pretty sure in some cases it can be
more expansive than syscall part of atomic::wait/notify

Of course better to prove it, maybe someday I will do it :(

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [Bug libstdc++/106772] atomic<T>::wait shouldn't touch waiter pool if used platform wait
  2022-08-29 12:38 [Bug libstdc++/106772] New: atomic<T>::wait shouldn't touch waiter pool if used platform wait valera.mironow at gmail dot com
                   ` (23 preceding siblings ...)
  2022-09-28 23:25 ` valera.mironow at gmail dot com
@ 2022-09-28 23:40 ` rodgertq at gcc dot gnu.org
  24 siblings, 0 replies; 26+ messages in thread
From: rodgertq at gcc dot gnu.org @ 2022-09-28 23:40 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106772

--- Comment #25 from Thomas Rodgers <rodgertq at gcc dot gnu.org> ---
(In reply to Mkkt Bkkt from comment #24)
> (In reply to Thomas Rodgers from comment #22)
> > Your example of '100+ core' systems especially on NUMA is certainly a valid
> > one. I would ask, at what point do those collisions and the resulting cache
> > invalidation traffic swamp the cost of just making the syscall? I do plan to
> > put these tests together, because there is another algorithm that I am
> > exploring, that I believe will reduce the likelihood of spurious wakeups,
> > and achieves the same result as this particular approach, without generating
> > the same invalidation traffic. At this point, I don't anticipate doing that
> > work until after GCC13 stage1 closes.
> 
> I try to explain: 
> 
> syscall overhead is some constant commonly like 10-30ns (futex syscall can
> be more expensive like 100ns in your example)
> 
> But numbers of cores are grow, arm also makes more popular (fetch_add/sub
> have more cost on it compares to x86)
> And people already faced with situation where fetch_add have a bigger cost
> than syscall overhead:
> 
> https://pkolaczk.github.io/server-slower-than-a-laptop/
> https://travisdowns.github.io/blog/2020/07/06/concurrency-costs.html
> 
> I don't think we will faced with problem like in these links in
> atomic::wait/notify in real code, but I'm pretty sure in some cases it can
> be more expansive than syscall part of atomic::wait/notify
> 
> Of course better to prove it, maybe someday I will do it :(

So to your previous comment, I don't the discussion is at all pointless. i plan
to raise some of these issues at the next SG1 meeting in November. Sure, that
doesn't help *you* or any developer with your specific intent until C++26, and
maybe Boost's implementation is a better choice, I also get how unsatisfying of
an aswer that is.

I'm well aware of the potential scalability problems, and I have a longer term
plan to get concrete data on how different implementation choices impact
scalability. The barrier implementation (which is the same algorithm as in
libc++), for example spreads this traffic over 64 individual atomic_refs, for
this very reason, and that implementation has been shown to scale quite well on
ORNL's Summit. But not all users of libstdc++ have those sorts of problems
either.

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2022-09-28 23:40 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-08-29 12:38 [Bug libstdc++/106772] New: atomic<T>::wait shouldn't touch waiter pool if used platform wait valera.mironow at gmail dot com
2022-08-29 12:40 ` [Bug libstdc++/106772] " valera.mironow at gmail dot com
2022-08-29 15:12 ` rodgertq at gcc dot gnu.org
2022-09-20  3:38 ` rodgertq at gcc dot gnu.org
2022-09-20  7:39 ` valera.mironow at gmail dot com
2022-09-20  7:47 ` valera.mironow at gmail dot com
2022-09-20 14:14 ` rodgertq at gcc dot gnu.org
2022-09-20 14:20 ` valera.mironow at gmail dot com
2022-09-20 14:23 ` valera.mironow at gmail dot com
2022-09-20 14:27 ` valera.mironow at gmail dot com
2022-09-20 14:36 ` redi at gcc dot gnu.org
2022-09-20 14:40 ` rodgertq at gcc dot gnu.org
2022-09-20 14:54 ` valera.mironow at gmail dot com
2022-09-20 15:45 ` valera.mironow at gmail dot com
2022-09-20 15:55 ` redi at gcc dot gnu.org
2022-09-20 16:00 ` valera.mironow at gmail dot com
2022-09-20 16:05 ` valera.mironow at gmail dot com
2022-09-28 21:54 ` rodgertq at gcc dot gnu.org
2022-09-28 22:00 ` valera.mironow at gmail dot com
2022-09-28 22:04 ` valera.mironow at gmail dot com
2022-09-28 22:18 ` valera.mironow at gmail dot com
2022-09-28 22:34 ` rodgertq at gcc dot gnu.org
2022-09-28 22:50 ` rodgertq at gcc dot gnu.org
2022-09-28 23:09 ` valera.mironow at gmail dot com
2022-09-28 23:25 ` valera.mironow at gmail dot com
2022-09-28 23:40 ` rodgertq at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).