From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id 734803854143; Wed, 10 May 2023 21:29:17 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 734803854143 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1683754158; bh=cX+VQPoo3SxFSmxvAClB12j4leUIXL2u3M6dFxnfqkI=; h=From:To:Subject:Date:In-Reply-To:References:From; b=xp8RjCa2EDEmk6FYO/ZJ02jkhR1tkbGKVKb+TbvYOODBXZe62CQBNvTjWeTHVtDiU GnIsTa51aGD82aJK+ZJzQt5TsELMrsz5x2np+lz9YEWX0h8l9XUjm3HV6/YhKG9XxQ QFHQf00KiVT1Cpo//uRLKd4ueXtuZpqYfLQ3ZSnc= From: "frankbarrus_sw at shaggy dot cc" To: glibc-bugs@sourceware.org Subject: [Bug nptl/25847] pthread_cond_signal failed to wake up pthread_cond_wait due to a bug in undoing stealing Date: Wed, 10 May 2023 21:29:15 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: glibc X-Bugzilla-Component: nptl X-Bugzilla-Version: 2.27 X-Bugzilla-Keywords: X-Bugzilla-Severity: normal X-Bugzilla-Who: frankbarrus_sw at shaggy dot cc X-Bugzilla-Status: ASSIGNED X-Bugzilla-Resolution: X-Bugzilla-Priority: P2 X-Bugzilla-Assigned-To: carlos at redhat dot com X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://sourceware.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 List-Id: https://sourceware.org/bugzilla/show_bug.cgi?id=3D25847 --- Comment #60 from Frank Barrus --- Comment on attachment 13419 --> https://sourceware.org/bugzilla/attachment.cgi?id=3D13419 proposed lost wakeup fix with g_signals relative to g1_start and no signal stealing > pthreads NPTL: lost wakeup fix 2 >=20=20=20=20 > This fixes the lost wakeup (from a bug in signal stealing) with a chan= ge > in the usage of g_signals[] in the condition variable internal state. > It also completely eliminates the concept and handling of signal steal= ing, > as well as the need for signalers to block to wait for waiters to wake > up every time there is a G1/G2 switch. This greatly reduces the avera= ge > and maximum latency for pthread_cond_signal. >=20=20=20=20 > The g_signals[] field now contains a signal count that is relative to > the current g1_start value. Since it is a 32-bit field, and the LSB is > still reserved (though not currently used anymore), it has a 31-bit va= lue > that corresponds to the low 31 bits of the sequence number in g1_start. > (since g1_start also has an LSB flag, this means bits 31:1 in g_signals > correspond to bits 31:1 in g1_start, plus the current signal count) >=20=20=20=20 > By making the signal count relative to g1_start, there is no longer > any ambiguity or A/B/A issue, and thus any checks before blocking, > including the futex call itself, are guaranteed not to block if the G1= /G2 > switch occurs, even if the signal count remains the same. This allows > initially safely blocking in G2 until the switch to G1 occurs, and > then transitioning from G1 to a new G1 or G2, and always being able to > distinguish the state change. This removes the race condition and A/B= /A > problems that otherwise ocurred if a late (pre-empted) waiter were to > resume just as the futex call attempted to block on g_signal since > otherwise there was no last opportunity to re-check things like whether > the current G1 group was already closed. >=20 > By fixing these issues, the signal stealing code can be eliminated, > since there is no concept of signal stealing anymore. The code to blo= ck > for all waiters to exit g_refs can also be removed, since any waiters > that are still in the g_refs region can be guaranteed to safely wake > up and exit. If there are still any left at this time, they are all > sent one final futex wakeup to ensure that they are not blocked any > longer, but there is no need for the signaller to block and wait for > them to wake up and exit the g_refs region. >=20=20=20=20 > The signal count is then effectively "zeroed" but since it is now > relative to g1_start, this is done by advancing it to a new value that > can be observed by any pending blocking waiters. Any late waiters can > always tell the difference, and can thus just cleanly exit if they are > in a stale G1 or G2. They can never steal a signal from the current > G1 if they are not in the current G1, since the signal value that has > to match in the cmpxchg has the low 31 bits of the g1_start value > contained in it, and that's first checked, and then it won't match if > there's a G1/G2 change. > > Note: the 31-bit sequence number used in g_signals is designed to > handle wrap-around when checking the signal count, but if the entire > 31-bit wraparound (2 billion signals) occurs while there is still a > late waiter that has not yet resumed, and it happens to then match > the current g1_start low bits, and the pre-emption occurs after the > normal "closed group" checks (which are 64-bit) but then hits the > futex syscall and signal consuming code, then an A/B/A issue could > still result and cause an incorrect assumption about whether it > should block. This particular scenario seems unlikely in practice. > Note that once awake from the futex, the waiter would notice the > closed group before consuming the signal (since that's still a 64-bit > check that would not be aliased in the wrap-around in g_signals), > so the biggest impact would be blocking on the futex until the next > full wakeup from a G1/G2 switch. > >Signed-off-by: Frank Barrus >--- > nptl/pthread_cond_common.c | 105 +++++++++++---------------------- > nptl/pthread_cond_wait.c | 144 +++++++++++++++-------------------------= ----- > 2 files changed, 81 insertions(+), 168 deletions(-) > >diff --git a/nptl/pthread_cond_common.c b/nptl/pthread_cond_common.c >index c35b9ef..eae7f4d 100644 >--- a/nptl/pthread_cond_common.c >+++ b/nptl/pthread_cond_common.c >@@ -341,7 +341,6 @@ static bool __attribute__ ((unused)) > __condvar_quiesce_and_switch_g1 (pthread_cond_t *cond, uint64_t wseq, > unsigned int *g1index, int private) > { >- const unsigned int maxspin =3D 0; > unsigned int g1 =3D *g1index; >=20 > /* If there is no waiter in G2, we don't do anything. The expression m= ay >@@ -362,84 +361,46 @@ __condvar_quiesce_and_switch_g1 (pthread_cond_t *con= d, uint64_t wseq, > * New waiters arriving concurrently with the group switching will al= l go > into G2 until we atomically make the switch. Waiters existing in = G2 > are not affected. >- * Waiters in G1 will be closed out immediately by setting a flag in >- __g_signals, which will prevent waiters from blocking using a fute= x on >- __g_signals and also notifies them that the group is closed. As a >- result, they will eventually remove their group reference, allowin= g us >- to close switch group roles. */ >- >- /* First, set the closed flag on __g_signals. This tells waiters that = are >- about to wait that they shouldn't do that anymore. This basically >- serves as an advance notificaton of the upcoming change to __g1_star= t; >- waiters interpret it as if __g1_start was larger than their waiter >- sequence position. This allows us to change __g1_start after waiting >- for all existing waiters with group references to leave, which in tu= rn >- makes recovery after stealing a signal simpler because it then can be >- skipped if __g1_start indicates that the group is closed (otherwise, >- we would have to recover always because waiters don't know how big t= heir >- groups are). Relaxed MO is fine. */ >- atomic_fetch_or_relaxed (cond->__data.__g_signals + g1, 1); >- >- /* Wait until there are no group references anymore. The fetch-or oper= ation >- injects us into the modification order of __g_refs; release MO ensur= es >- that waiters incrementing __g_refs after our fetch-or see the previo= us >- changes to __g_signals and to __g1_start that had to happen before w= e can >- switch this G1 and alias with an older group (we have two groups, so >- aliasing requires switching group roles twice). Note that nobody el= se >- can have set the wake-request flag, so we do not have to act upon it. >- >- Also note that it is harmless if older waiters or waiters from this = G1 >- get a group reference after we have quiesced the group because it wi= ll >- remain closed for them either because of the closed flag in __g_sign= als >- or the later update to __g1_start. New waiters will never arrive he= re >- but instead continue to go into the still current G2. */ >- unsigned r =3D atomic_fetch_or_release (cond->__data.__g_refs + g1, 0); >- while ((r >> 1) > 0) >- { >- for (unsigned int spin =3D maxspin; ((r >> 1) > 0) && (spin > 0); s= pin--) >- { >- /* TODO Back off. */ >- r =3D atomic_load_relaxed (cond->__data.__g_refs + g1); >- } >- if ((r >> 1) > 0) >- { >- /* There is still a waiter after spinning. Set the wake-request >- flag and block. Relaxed MO is fine because this is just about >- this futex word. >- >- Update r to include the set wake-request flag so that the upcoming >- futex_wait only blocks if the flag is still set (otherwise, we'd >- violate the basic client-side futex protocol). */ >- r =3D atomic_fetch_or_relaxed (cond->__data.__g_refs + g1, 1) | 1; >- >- if ((r >> 1) > 0) >- futex_wait_simple (cond->__data.__g_refs + g1, r, private); >- /* Reload here so we eventually see the most recent value even if we >- do not spin. */ >- r =3D atomic_load_relaxed (cond->__data.__g_refs + g1); >- } >- } >- /* Acquire MO so that we synchronize with the release operation that wa= iters >- use to decrement __g_refs and thus happen after the waiters we waited >- for. */ >- atomic_thread_fence_acquire (); >+ * Waiters in G1 will be closed out immediately by the advancing of >+ __g_signals to the next "lowseq" (low 31 bits of the new g1_start), >+ which will prevent waiters from blocking using a futex on >+ __g_signals since it provides enough signals for all possible >+ remaining waiters. As a result, they can each consume a signal >+ and they will eventually remove their group reference. */ >=20 > /* Update __g1_start, which finishes closing this group. The value we = add > will never be negative because old_orig_size can only be zero when we > switch groups the first time after a condvar was initialized, in whi= ch >- case G1 will be at index 1 and we will add a value of 1. See above = for >- why this takes place after waiting for quiescence of the group. >+ case G1 will be at index 1 and we will add a value of 1. > Relaxed MO is fine because the change comes with no additional > constraints that others would have to observe. */ > __condvar_add_g1_start_relaxed (cond, > (old_orig_size << 1) + (g1 =3D=3D 1 ? 1 : - 1)); >=20 >- /* Now reopen the group, thus enabling waiters to again block using the >- futex controlled by __g_signals. Release MO so that observers that = see >- no signals (and thus can block) also see the write __g1_start and th= us >- that this is now a new group (see __pthread_cond_wait_common for the >- matching acquire MO loads). */ >- atomic_store_release (cond->__data.__g_signals + g1, 0); >+ unsigned int lowseq =3D ((old_g1_start + old_orig_size) << 1) & ~1U; >+ >+ /* If any waiters still hold group references (and thus could be blocke= d), >+ then wake them all up now and prevent any running ones from blocking. >+ This is effectively a catch-all for any possible current or future >+ bugs that can allow the group size to reach 0 before all G1 waiters >+ have been awakened or at least given signals to consume, or any >+ other case that can leave blocked (or about to block) older waiters.= . */ >+ if ((atomic_fetch_or_release (cond->__data.__g_refs + g1, 0) >> 1) > 0) >+ { >+ /* First advance signals to the end of the group (i.e. enough signals >+ for the entire G1 group) to ensure that waiters which have not >+ yet blocked in the futex will not block. >+ Note that in the vast majority of cases, this should never >+ actually be necessary, since __g_signals will have enough >+ signals for the remaining g_refs waiters. As an optimization, >+ we could check this first before proceeding, although that >+ could still leave the potential for futex lost wakeup bugs >+ if the signal count was non-zero but the futex wakeup >+ was somehow lost. */ >+ atomic_store_relaxed (cond->__data.__g_signals + g1, lowseq); >+ >+ futex_wake (cond->__data.__g_signals + g1, INT_MAX, private); >+ } >=20 > /* At this point, the old G1 is now a valid new G2 (but not in use yet). > No old waiter can neither grab a signal nor acquire a reference with= out >@@ -451,6 +412,10 @@ __condvar_quiesce_and_switch_g1 (pthread_cond_t *cond= , uint64_t wseq, > g1 ^=3D 1; > *g1index ^=3D 1; >=20 >+ /* Now advance the new G1 g_signals to the new lowseq, giving it >+ an effective signal count of 0 to start. */ >+ atomic_store_relaxed (cond->__data.__g_signals + g1, lowseq); >+ > /* These values are just observed by signalers, and thus protected by t= he > lock. */ > unsigned int orig_size =3D wseq - (old_g1_start + old_orig_size); >diff --git a/nptl/pthread_cond_wait.c b/nptl/pthread_cond_wait.c >index 54e504a..69cba15 100644 >--- a/nptl/pthread_cond_wait.c >+++ b/nptl/pthread_cond_wait.c >@@ -239,9 +239,7 @@ __condvar_cleanup_waiting (void *arg) > signaled), and a reference count. >=20 > The group reference count is used to maintain the number of waiters th= at >- are using the group's futex. Before a group can change its role, the >- reference count must show that no waiters are using the futex anymore;= this >- prevents ABA issues on the futex word. >+ are using the group's futex. >=20 > To represent which intervals in the waiter sequence the groups cover (= and > thus also which group slot contains G1 or G2), we use a 64b counter to >@@ -301,11 +299,12 @@ __condvar_cleanup_waiting (void *arg) > last reference. > * Reference count used by waiters concurrently with signalers that h= ave > acquired the condvar-internal lock. >- __g_signals: The number of signals that can still be consumed. >+ __g_signals: The number of signals that can still be consumed, relativ= e to >+ the current g1_start. (i.e. bits 31 to 1 of __g_signals are bits >+ 31 to 1 of g1_start with the signal count added) > * Used as a futex word by waiters. Used concurrently by waiters and > signalers. >- * LSB is true iff this group has been completely signaled (i.e., it = is >- closed). >+ * LSB is currently reserved and 0. > __g_size: Waiters remaining in this group (i.e., which have not been > signaled yet. > * Accessed by signalers and waiters that cancel waiting (both do so = only >@@ -329,18 +328,6 @@ __condvar_cleanup_waiting (void *arg) > sufficient because if a waiter can see a sufficiently large value, it = could > have also consume a signal in the waiters group. >=20 >- Waiters try to grab a signal from __g_signals without holding a refere= nce >- count, which can lead to stealing a signal from a more recent group af= ter >- their own group was already closed. They cannot always detect whether= they >- in fact did because they do not know when they stole, but they can >- conservatively add a signal back to the group they stole from; if they >- did so unnecessarily, all that happens is a spurious wake-up. To make= this >- even less likely, __g1_start contains the index of the current g2 too, >- which allows waiters to check if there aliasing on the group slots; if >- there wasn't, they didn't steal from the current G1, which means that = the >- G1 they stole from must have been already closed and they do not need = to >- fix anything. >- > It is essential that the last field in pthread_cond_t is __g_signals[1= ]: > The previous condvar used a pointer-sized field in pthread_cond_t, so a > PTHREAD_COND_INITIALIZER from that condvar implementation might only >@@ -436,6 +423,9 @@ __pthread_cond_wait_common (pthread_cond_t *cond, pthr= ead_mutex_t *mutex, > { > while (1) > { >+ uint64_t g1_start =3D __condvar_load_g1_start_relaxed (cond); >+ unsigned int lowseq =3D (g1_start & 1) =3D=3D g ? signals : g1_= start & ~1U; >+ > /* Spin-wait first. > Note that spinning first without checking whether a timeout > passed might lead to what looks like a spurious wake-up even >@@ -447,35 +437,45 @@ __pthread_cond_wait_common (pthread_cond_t *cond, pt= hread_mutex_t *mutex, > having to compare against the current time seems to be the right > choice from a performance perspective for most use cases. */ > unsigned int spin =3D maxspin; >- while (signals =3D=3D 0 && spin > 0) >+ while (spin > 0 && ((int)(signals - lowseq) < 2)) > { > /* Check that we are not spinning on a group that's already > closed. */ >- if (seq < (__condvar_load_g1_start_relaxed (cond) >> 1)) >- goto done; >+ if (seq < (g1_start >> 1)) >+ break; >=20 > /* TODO Back off. */ >=20 > /* Reload signals. See above for MO. */ > signals =3D atomic_load_acquire (cond->__data.__g_signals + g); >+ g1_start =3D __condvar_load_g1_start_relaxed (cond); >+ lowseq =3D (g1_start & 1) =3D=3D g ? signals : g1_start & ~= 1U; > spin--; > } >=20 >- /* If our group will be closed as indicated by the flag on signals, >- don't bother grabbing a signal. */ >- if (signals & 1) >- goto done; >- >- /* If there is an available signal, don't block. */ >- if (signals !=3D 0) >+ if (seq < (g1_start >> 1)) >+ { >+ /* If the group is closed already, >+ then this waiter originally had enough extra signals to >+ consume, up until the time its group was closed. */ >+ goto done; >+ } >+ >+ /* If there is an available signal, don't block. >+ If __g1_start has advanced at all, then we must be in G1 >+ by now, perhaps in the process of switching back to an older >+ G2, but in either case we're allowed to consume the available >+ signal and should not block anymore. */ >+ if ((int)(signals - lowseq) >=3D 2) > break; >=20 > /* No signals available after spinning, so prepare to block. > We first acquire a group reference and use acquire MO for that so > that we synchronize with the dummy read-modify-write in > __condvar_quiesce_and_switch_g1 if we read from that. In turn, >- in this case this will make us see the closed flag on __g_signals >- that designates a concurrent attempt to reuse the group's slot. >+ in this case this will make us see the advancement of __g_signals >+ to the upcoming new g1_start that occurs with a concurrent >+ attempt to reuse the group's slot. > We use acquire MO for the __g_signals check to make the > __g1_start check work (see spinning above). > Note that the group reference acquisition will not mask the >@@ -483,15 +483,24 @@ __pthread_cond_wait_common (pthread_cond_t *cond, pt= hread_mutex_t *mutex, > an atomic read-modify-write operation and thus extend the release > sequence. */ > atomic_fetch_add_acquire (cond->__data.__g_refs + g, 2); >- if (((atomic_load_acquire (cond->__data.__g_signals + g) & 1) !=3D 0) >- || (seq < (__condvar_load_g1_start_relaxed (cond) >> 1))) >+ signals =3D atomic_load_acquire (cond->__data.__g_signals + g); >+ g1_start =3D __condvar_load_g1_start_relaxed (cond); >+ lowseq =3D (g1_start & 1) =3D=3D g ? signals : g1_start & ~1U; >+ >+ if (seq < (g1_start >> 1)) > { >- /* Our group is closed. Wake up any signalers that might be >- waiting. */ >+ /* group is closed already, so don't block */ > __condvar_dec_grefs (cond, g, private); > goto done; > } >=20 >+ if ((int)(signals - lowseq) >=3D 2) >+ { >+ /* a signal showed up or G1/G2 switched after we grabbed the refco= unt */ >+ __condvar_dec_grefs (cond, g, private); >+ break; >+ } >+ > // Now block. > struct _pthread_cleanup_buffer buffer; > struct _condvar_cleanup_buffer cbuffer; >@@ -502,7 +511,7 @@ __pthread_cond_wait_common (pthread_cond_t *cond, pthr= ead_mutex_t *mutex, > __pthread_cleanup_push (&buffer, __condvar_cleanup_waiting, &cbuffer); >=20 > err =3D __futex_abstimed_wait_cancelable64 ( >- cond->__data.__g_signals + g, 0, clockid, abstime, private); >+ cond->__data.__g_signals + g, signals, clockid, abstime, private); >=20 > __pthread_cleanup_pop (&buffer, 0); >=20 >@@ -525,6 +534,8 @@ __pthread_cond_wait_common (pthread_cond_t *cond, pthr= ead_mutex_t *mutex, > signals =3D atomic_load_acquire (cond->__data.__g_signals + g); > } >=20 >+ if (seq < (__condvar_load_g1_start_relaxed (cond) >> 1)) >+ goto done; > } > /* Try to grab a signal. Use acquire MO so that we see an up-to-date v= alue > of __g1_start below (see spinning above for a similar case). In >@@ -533,69 +544,6 @@ __pthread_cond_wait_common (pthread_cond_t *cond, pth= read_mutex_t *mutex, > while (!atomic_compare_exchange_weak_acquire (cond->__data.__g_signals = + g, > &signals, signals - 2)); >=20 >- /* We consumed a signal but we could have consumed from a more recent g= roup >- that aliased with ours due to being in the same group slot. If this >- might be the case our group must be closed as visible through >- __g1_start. */ >- uint64_t g1_start =3D __condvar_load_g1_start_relaxed (cond); >- if (seq < (g1_start >> 1)) >- { >- /* We potentially stole a signal from a more recent group but we do= not >- know which group we really consumed from. >- We do not care about groups older than current G1 because they are >- closed; we could have stolen from these, but then we just add a >- spurious wake-up for the current groups. >- We will never steal a signal from current G2 that was really intended >- for G2 because G2 never receives signals (until it becomes G1). We >- could have stolen a signal from G2 that was conservatively added by a >- previous waiter that also thought it stole a signal -- but given that >- that signal was added unnecessarily, it's not a problem if we steal >- it. >- Thus, the remaining case is that we could have stolen from the current >- G1, where "current" means the __g1_start value we observed. However, >- if the current G1 does not have the same slot index as we do, we did >- not steal from it and do not need to undo that. This is the reason >- for putting a bit with G2's index into__g1_start as well. */ >- if (((g1_start & 1) ^ 1) =3D=3D g) >- { >- /* We have to conservatively undo our potential mistake of stealing >- a signal. We can stop trying to do that when the current G1 >- changes because other spinning waiters will notice this too and >- __condvar_quiesce_and_switch_g1 has checked that there are no >- futex waiters anymore before switching G1. >- Relaxed MO is fine for the __g1_start load because we need to >- merely be able to observe this fact and not have to observe >- something else as well. >- ??? Would it help to spin for a little while to see whether the >- current G1 gets closed? This might be worthwhile if the group is >- small or close to being closed. */ >- unsigned int s =3D atomic_load_relaxed (cond->__data.__g_signals + g); >- while (__condvar_load_g1_start_relaxed (cond) =3D=3D g1_start) >- { >- /* Try to add a signal. We don't need to acquire the lock >- because at worst we can cause a spurious wake-up. If the >- group is in the process of being closed (LSB is true), this >- has an effect similar to us adding a signal. */ >- if (((s & 1) !=3D 0) >- || atomic_compare_exchange_weak_relaxed >- (cond->__data.__g_signals + g, &s, s + 2)) >- { >- /* If we added a signal, we also need to add a wake-up on >- the futex. We also need to do that if we skipped adding >- a signal because the group is being closed because >- while __condvar_quiesce_and_switch_g1 could have closed >- the group, it might stil be waiting for futex waiters to >- leave (and one of those waiters might be the one we stole >- the signal from, which cause it to block using the >- futex). */ >- futex_wake (cond->__data.__g_signals + g, 1, private); >- break; >- } >- /* TODO Back off. */ >- } >- } >- } >- > done: >=20 > /* Confirm that we have been woken. We do that before acquiring the mu= tex --=20 You are receiving this mail because: You are on the CC list for the bug.=