From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id 22725383B43E; Thu, 15 Apr 2021 14:34:15 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 22725383B43E From: "frankbarrus_sw at shaggy dot cc" To: glibc-bugs@sourceware.org Subject: [Bug nptl/25847] pthread_cond_signal failed to wake up pthread_cond_wait due to a bug in undoing stealing Date: Thu, 15 Apr 2021 14:34:14 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: glibc X-Bugzilla-Component: nptl X-Bugzilla-Version: 2.27 X-Bugzilla-Keywords: X-Bugzilla-Severity: normal X-Bugzilla-Who: frankbarrus_sw at shaggy dot cc X-Bugzilla-Status: UNCONFIRMED X-Bugzilla-Resolution: X-Bugzilla-Priority: P2 X-Bugzilla-Assigned-To: carlos at redhat dot com X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://sourceware.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: glibc-bugs@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Glibc-bugs mailing list List-Unsubscribe: , List-Archive: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 15 Apr 2021 14:34:15 -0000 https://sourceware.org/bugzilla/show_bug.cgi?id=3D25847 --- Comment #36 from Frank Barrus --- (In reply to Frank Barrus from comment #35) > (In reply to Qin Li from comment #34) > > Hi Frank, I am the original reporter of this bug. Could you share a sne= ak > > peak version of your alternate fix that you mentioned below? > >=20 > > > FYI, I'm currently testing a different pthreads fix for this issue th= at does it without the suggested "broadcast" solution that some distros app= ear to be adopting for now. > >=20 > > The reason I am asking is that several months after applied the broadca= st > > fix we started to observe a different hang caused by either > > pthread_cond_signal/pthread_cond_wait, or the constructs it relied on, = e.g. > > futex. Original I feared it is related or caused by the broadcast fix, = but > > later realized this might be another issue as it has also been independ= ently > > reported in this bug by Arun without applying the broadcast fix: > > https://sourceware.org/bugzilla/show_bug.cgi?id=3D25847#c3 > >=20 > > The signature of this different issue is this: 1 thread blocked "infini= tely" > > in the pthread_cond_signal when calling __condvar_quiesce_and_switch_g1: > >=20 > > #0 futex_wait > > #1 futex_wait_simple > > #2 __condvar_quiesce_and_switch_g1 > > #3 __pthread_cond_signal > >=20 > > And all the other threads are blocked in pthread_cond_wait waiting for = the > > signal. > >=20 > > Another interesting observation is the "infinitely" blocked thread on > > pthread_cond_signal can be unblocked if I send a SIG_SEGV to the linux = .Net > > Core process that hit this issue which has a segfault handler that will= call > > another external executable to take a core dump of this process. I am n= ot > > exactly sure how much of the special signal handling logic is important= to > > get pthread_cond_signal unblocked. It is possible that such signal would > > cause a spurious wakeup from futex_wait that actually unblocks > > __condvar_quiesce_and_switch_g1 and later __pthread_cond_signal, but th= is is > > pure speculation. > >=20 > > It would be nice to know if the ^^^ hanging pthread_cond_signal signatu= re > > has also been discovered by the community and whether there might be any > > investigation/fix available. >=20 > Hi Qin, >=20 > I believe you may be right about the second bug. I have also seen such a > signature ever since the fix went in to "or" in the 1 bit so it doesn't s= pin > while waiting for g_refs to reach 0: > r =3D atomic_fetch_or_relaxed (cond->__data.__g_refs + g1, 1) |= 1; >=20 > Whenever I've encountered threads "stuck" in that futex_wait, they have > always become unblocked as soon as I resumed from gdb (which makes sense = if > they're just missing a futex wake since the signal handling unblocks them > from the futex) so I was never quite sure if they were truly stuck there > (which they seemed to be) or whether I was just catching them in a transi= ent > state. However, since this was always at a time when other parts of our > system had also become stuck, it seemed quite likely that the threads rea= lly > were stuck in the futex_wait. >=20 > I have not seen this happen anymore ever since applying my new fix and > testing with it, but that's purely anecdotal evidence of fixing the > secondary issues, since it was quite rare to begin with. >=20 > If there's a race in the wait/wake logic itself for the grefs counting, I > haven't found it yet, unless there's some very obscure bug in the underly= ing > atomics themselves allowing the atomic or and atomic subtract to cross pa= ths > somehow thus missing the wakeup flag. i.e. these two atomics: >=20 > r =3D atomic_fetch_or_relaxed (cond->__data.__g_refs + g1, 1) | 1; > (in the quiesce_and_switch)=20 > and: > if (atomic_fetch_add_release (cond->__data.__g_refs + g, -2) =3D=3D 3) > (in the gref_decrefs used from the cond_wait) >=20 > I think the more likely cause is related to still having an extra waiter > that didn't get woken yet and didn't see the group being closed. Note t= hat > when the group is closed, the LSB of g_signals is set, but there's no fut= ex > wakeup: >=20 > atomic_fetch_or_relaxed (cond->__data.__g_signals + g1, 1); >=20 > This should be fine if there are no other bugs, since every waiter should > have been woken already. But since we're already dealing with a mix of b= oth > current waiters and older ones that were pre-empted first and which are n= ow > resuming, and possibly not yet aware that they are in an older group, the= re > might be some race there. In particular, if a signal did actually get > stolen and not properly detected and replaced, there could be an extra > waiter stuck in a futex wait on g_signals still. Without an explicit wak= eup > when closing the group, it won't see the "closed" bit until it wakes from > the futex, which won't happen in that case if there are no more signals s= ent > to that group. However, an interrupt from normal signal handling (SIG* > signals) will break it out of the futex, which is why a gdb attach and > resume would get it unstuck in that case. And then it will see the closed > group on g_signals and proceed as if nothing was ever wrong. >=20 > Until we figure out the exact cause, I cannot guarantee that my new fix a= lso > addresses this issue. Although as I said, I have not yet seen it occur w= ith > the fix. Also since I eliminated the need for handling stolen signals and > signal replacement, it might remove the cause of this second bug as a side > effect. (if indeed there truly is a second bug) >=20 > Do you happen to still have the condvar state when you've hit this bug? = Or > can it be reproduced often enough to capture this state? Could you > instrument your calls to pthread_cond_signal to capture the condvar state > before you send the signal also? (and perhaps also in your waiters?). I > ended up adding circular logging of the condvar state (and a timestamp and > TID) before and after every pthread_cond_wait and pthread_cond_signal to > diagnose the lost wakeup so I could see how the events interleaved, > including the often 2ms to 3ms pre-emption time in the older waiters that > were coming back and leading to the bug. I also called rusage() and > captured the voluntary and involuntary context-switch counts and added th= ose > to the condvar event logs to make sure I was always seeing a pre-emptive > involuntary context switch when the bug occurred. You might not need to = go > that far, but it would help to at least find out the current condvar state > for all the threads involved when you see the futex_wait get stuck. >=20 > I'm working on trying to get my patch available soon for further review a= nd > testing. >=20 > If you have a self-contained test case that you can release that can even > occasionally show this new "second bug" (if it is), let me know. Thanks! I should clarify what I'm looking for here in particular with the condvar s= tate in this case, as well as knowing the state of all the other waiters and whe= re they are stuck (or spinning): (assuming this second bug exists) it seems that it is stuck on the futex_wa= it for g_refs to become 0 and that an interrupt to the futex_wait from signal handlers "fixes" the problem, which implies the underlying condvar state is currently correct and that there's just a missing/lost futex_wake. However= , is the wakeup to this particular futex_wait the one that fixes it? Or is it an extra wakeup to a waiter blocked on g_signals that is needed to fix it? We need to see if the condvar state shows that the g_refs waiter already has w= hat it needs to continue and just missed a wakeup, or whether it is legitimately still waiting for g_refs to reach 0, and it's still at a refcount of 1 or m= ore because there's another waiter that is still blocked on g_signals and has n= ot been awakened to see that the group is now closed. (the more likely cause)= =20 Seeing the condvar state would greatly help in figuring out which case it i= s. --=20 You are receiving this mail because: You are on the CC list for the bug.=