From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id 8CEFC383B427; Thu, 15 Apr 2021 14:13:55 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 8CEFC383B427 From: "frankbarrus_sw at shaggy dot cc" To: glibc-bugs@sourceware.org Subject: [Bug nptl/25847] pthread_cond_signal failed to wake up pthread_cond_wait due to a bug in undoing stealing Date: Thu, 15 Apr 2021 14:13:55 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: glibc X-Bugzilla-Component: nptl X-Bugzilla-Version: 2.27 X-Bugzilla-Keywords: X-Bugzilla-Severity: normal X-Bugzilla-Who: frankbarrus_sw at shaggy dot cc X-Bugzilla-Status: UNCONFIRMED X-Bugzilla-Resolution: X-Bugzilla-Priority: P2 X-Bugzilla-Assigned-To: carlos at redhat dot com X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://sourceware.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: glibc-bugs@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Glibc-bugs mailing list List-Unsubscribe: , List-Archive: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 15 Apr 2021 14:13:55 -0000 https://sourceware.org/bugzilla/show_bug.cgi?id=3D25847 --- Comment #35 from Frank Barrus --- (In reply to Qin Li from comment #34) > Hi Frank, I am the original reporter of this bug. Could you share a sneak > peak version of your alternate fix that you mentioned below? >=20 > > FYI, I'm currently testing a different pthreads fix for this issue that= does it without the suggested "broadcast" solution that some distros appea= r to be adopting for now. >=20 > The reason I am asking is that several months after applied the broadcast > fix we started to observe a different hang caused by either > pthread_cond_signal/pthread_cond_wait, or the constructs it relied on, e.= g. > futex. Original I feared it is related or caused by the broadcast fix, but > later realized this might be another issue as it has also been independen= tly > reported in this bug by Arun without applying the broadcast fix: > https://sourceware.org/bugzilla/show_bug.cgi?id=3D25847#c3 >=20 > The signature of this different issue is this: 1 thread blocked "infinite= ly" > in the pthread_cond_signal when calling __condvar_quiesce_and_switch_g1: >=20 > #0 futex_wait > #1 futex_wait_simple > #2 __condvar_quiesce_and_switch_g1 > #3 __pthread_cond_signal >=20 > And all the other threads are blocked in pthread_cond_wait waiting for the > signal. >=20 > Another interesting observation is the "infinitely" blocked thread on > pthread_cond_signal can be unblocked if I send a SIG_SEGV to the linux .N= et > Core process that hit this issue which has a segfault handler that will c= all > another external executable to take a core dump of this process. I am not > exactly sure how much of the special signal handling logic is important to > get pthread_cond_signal unblocked. It is possible that such signal would > cause a spurious wakeup from futex_wait that actually unblocks > __condvar_quiesce_and_switch_g1 and later __pthread_cond_signal, but this= is > pure speculation. >=20 > It would be nice to know if the ^^^ hanging pthread_cond_signal signature > has also been discovered by the community and whether there might be any > investigation/fix available. Hi Qin, I believe you may be right about the second bug. I have also seen such a signature ever since the fix went in to "or" in the 1 bit so it doesn't spin while waiting for g_refs to reach 0: r =3D atomic_fetch_or_relaxed (cond->__data.__g_refs + g1, 1) | 1; Whenever I've encountered threads "stuck" in that futex_wait, they have alw= ays become unblocked as soon as I resumed from gdb (which makes sense if they're just missing a futex wake since the signal handling unblocks them from the futex) so I was never quite sure if they were truly stuck there (which they seemed to be) or whether I was just catching them in a transient state.=20= =20 However, since this was always at a time when other parts of our system had also become stuck, it seemed quite likely that the threads really were stuc= k in the futex_wait. I have not seen this happen anymore ever since applying my new fix and test= ing with it, but that's purely anecdotal evidence of fixing the secondary issue= s, since it was quite rare to begin with. If there's a race in the wait/wake logic itself for the grefs counting, I haven't found it yet, unless there's some very obscure bug in the underlying atomics themselves allowing the atomic or and atomic subtract to cross paths somehow thus missing the wakeup flag. i.e. these two atomics: r =3D atomic_fetch_or_relaxed (cond->__data.__g_refs + g1, 1) | 1; (in the quiesce_and_switch)=20 and: if (atomic_fetch_add_release (cond->__data.__g_refs + g, -2) =3D=3D 3) (in the gref_decrefs used from the cond_wait) I think the more likely cause is related to still having an extra waiter th= at didn't get woken yet and didn't see the group being closed. Note that when the group is closed, the LSB of g_signals is set, but there's no futex wake= up: atomic_fetch_or_relaxed (cond->__data.__g_signals + g1, 1); This should be fine if there are no other bugs, since every waiter should h= ave been woken already. But since we're already dealing with a mix of both cur= rent waiters and older ones that were pre-empted first and which are now resumin= g, and possibly not yet aware that they are in an older group, there might be = some race there. In particular, if a signal did actually get stolen and not properly detected and replaced, there could be an extra waiter stuck in a f= utex wait on g_signals still. Without an explicit wakeup when closing the group= , it won't see the "closed" bit until it wakes from the futex, which won't happe= n in that case if there are no more signals sent to that group. However, an interrupt from normal signal handling (SIG* signals) will break it out of t= he futex, which is why a gdb attach and resume would get it unstuck in that ca= se.=20 And then it will see the closed group on g_signals and proceed as if nothing was ever wrong. Until we figure out the exact cause, I cannot guarantee that my new fix also addresses this issue. Although as I said, I have not yet seen it occur with the fix. Also since I eliminated the need for handling stolen signals and signal replacement, it might remove the cause of this second bug as a side effect. (if indeed there truly is a second bug) Do you happen to still have the condvar state when you've hit this bug? Or= can it be reproduced often enough to capture this state? Could you instrument your calls to pthread_cond_signal to capture the condvar state before you s= end the signal also? (and perhaps also in your waiters?). I ended up adding circular logging of the condvar state (and a timestamp and TID) before and after every pthread_cond_wait and pthread_cond_signal to diagnose the lost wakeup so I could see how the events interleaved, including the often 2ms to 3ms pre-emption time in the older waiters that were coming back and leading= to the bug. I also called rusage() and captured the voluntary and involuntary context-switch counts and added those to the condvar event logs to make sur= e I was always seeing a pre-emptive involuntary context switch when the bug occurred. You might not need to go that far, but it would help to at least find out the current condvar state for all the threads involved when you see the futex_wait get stuck. I'm working on trying to get my patch available soon for further review and testing. If you have a self-contained test case that you can release that can even occasionally show this new "second bug" (if it is), let me know. Thanks! --=20 You are receiving this mail because: You are on the CC list for the bug.=