From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <sourceware-bugzilla@sourceware.org>
Received: by sourceware.org (Postfix, from userid 48)
 id 22725383B43E; Thu, 15 Apr 2021 14:34:15 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 22725383B43E
From: "frankbarrus_sw at shaggy dot cc" <sourceware-bugzilla@sourceware.org>
To: glibc-bugs@sourceware.org
Subject: [Bug nptl/25847] pthread_cond_signal failed to wake up
 pthread_cond_wait due to a bug in undoing stealing
Date: Thu, 15 Apr 2021 14:34:14 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: glibc
X-Bugzilla-Component: nptl
X-Bugzilla-Version: 2.27
X-Bugzilla-Keywords: 
X-Bugzilla-Severity: normal
X-Bugzilla-Who: frankbarrus_sw at shaggy dot cc
X-Bugzilla-Status: UNCONFIRMED
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: P2
X-Bugzilla-Assigned-To: carlos at redhat dot com
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: 
Message-ID: <bug-25847-131-tkw5GBFe7l@http.sourceware.org/bugzilla/>
In-Reply-To: <bug-25847-131@http.sourceware.org/bugzilla/>
References: <bug-25847-131@http.sourceware.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://sourceware.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
X-BeenThere: glibc-bugs@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Glibc-bugs mailing list <glibc-bugs.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/glibc-bugs>,
 <mailto:glibc-bugs-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/glibc-bugs/>
List-Help: <mailto:glibc-bugs-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/glibc-bugs>,
 <mailto:glibc-bugs-request@sourceware.org?subject=subscribe>
X-List-Received-Date: Thu, 15 Apr 2021 14:34:15 -0000

https://sourceware.org/bugzilla/show_bug.cgi?id=3D25847

--- Comment #36 from Frank Barrus <frankbarrus_sw at shaggy dot cc> ---
(In reply to Frank Barrus from comment #35)
> (In reply to Qin Li from comment #34)
> > Hi Frank, I am the original reporter of this bug. Could you share a sne=
ak
> > peak version of your alternate fix that you mentioned below?
> >=20
> > > FYI, I'm currently testing a different pthreads fix for this issue th=
at does it without the suggested "broadcast" solution that some distros app=
ear to be adopting for now.
> >=20
> > The reason I am asking is that several months after applied the broadca=
st
> > fix we started to observe a different hang caused by either
> > pthread_cond_signal/pthread_cond_wait, or the constructs it relied on, =
e.g.
> > futex. Original I feared it is related or caused by the broadcast fix, =
but
> > later realized this might be another issue as it has also been independ=
ently
> > reported in this bug by Arun without applying the broadcast fix:
> > https://sourceware.org/bugzilla/show_bug.cgi?id=3D25847#c3
> >=20
> > The signature of this different issue is this: 1 thread blocked "infini=
tely"
> > in the pthread_cond_signal when calling __condvar_quiesce_and_switch_g1:
> >=20
> > #0  futex_wait
> > #1  futex_wait_simple
> > #2  __condvar_quiesce_and_switch_g1
> > #3  __pthread_cond_signal
> >=20
> > And all the other threads are blocked in pthread_cond_wait waiting for =
the
> > signal.
> >=20
> > Another interesting observation is the "infinitely" blocked thread on
> > pthread_cond_signal can be unblocked if I send a SIG_SEGV to the linux =
.Net
> > Core process that hit this issue which has a segfault handler that will=
 call
> > another external executable to take a core dump of this process. I am n=
ot
> > exactly sure how much of the special signal handling logic is important=
 to
> > get pthread_cond_signal unblocked. It is possible that such signal would
> > cause a spurious wakeup from futex_wait that actually unblocks
> > __condvar_quiesce_and_switch_g1 and later __pthread_cond_signal, but th=
is is
> > pure speculation.
> >=20
> > It would be nice to know if the ^^^ hanging pthread_cond_signal signatu=
re
> > has also been discovered by the community and whether there might be any
> > investigation/fix available.
>=20
> Hi Qin,
>=20
> I believe you may be right about the second bug.  I have also seen such a
> signature ever since the fix went in to "or" in the 1 bit so it doesn't s=
pin
> while waiting for g_refs to reach 0:
>           r =3D atomic_fetch_or_relaxed (cond->__data.__g_refs + g1, 1) |=
 1;
>=20
> Whenever I've encountered threads "stuck" in that futex_wait, they have
> always become unblocked as soon as I resumed from gdb (which makes sense =
if
> they're just missing a futex wake since the signal handling unblocks them
> from the futex) so I was never quite sure if they were truly stuck there
> (which they seemed to be) or whether I was just catching them in a transi=
ent
> state.   However, since this was always at a time when other parts of our
> system had also become stuck, it seemed quite likely that the threads rea=
lly
> were stuck in the futex_wait.
>=20
> I have not seen this happen anymore ever since applying my new fix and
> testing with it, but that's purely anecdotal evidence of fixing the
> secondary issues, since it was quite rare to begin with.
>=20
> If there's a race in the wait/wake logic itself for the grefs counting, I
> haven't found it yet, unless there's some very obscure bug in the underly=
ing
> atomics themselves allowing the atomic or and atomic subtract to cross pa=
ths
> somehow thus missing the wakeup flag.  i.e. these two atomics:
>=20
>   r =3D atomic_fetch_or_relaxed (cond->__data.__g_refs + g1, 1) | 1;
>   (in the quiesce_and_switch)=20
> and:
>   if (atomic_fetch_add_release (cond->__data.__g_refs + g, -2) =3D=3D 3)
>   (in the gref_decrefs used from the cond_wait)
>=20
> I think the more likely cause is related to still having an extra waiter
> that didn't get woken yet and didn't see the group being closed.   Note t=
hat
> when the group is closed, the LSB of g_signals is set, but there's no fut=
ex
> wakeup:
>=20
>   atomic_fetch_or_relaxed (cond->__data.__g_signals + g1, 1);
>=20
> This should be fine if there are no other bugs, since every waiter should
> have been woken already.  But since we're already dealing with a mix of b=
oth
> current waiters and older ones that were pre-empted first and which are n=
ow
> resuming, and possibly not yet aware that they are in an older group, the=
re
> might be some race there.  In particular, if a signal did actually get
> stolen and not properly detected and replaced, there could be an extra
> waiter stuck in a futex wait on g_signals still.  Without an explicit wak=
eup
> when closing the group, it won't see the "closed" bit until it wakes from
> the futex, which won't happen in that case if there are no more signals s=
ent
> to that group. However, an interrupt from normal signal handling (SIG*
> signals) will break it out of the futex, which is why a gdb attach and
> resume would get it unstuck in that case.  And then it will see the closed
> group on g_signals and proceed as if nothing was ever wrong.
>=20
> Until we figure out the exact cause, I cannot guarantee that my new fix a=
lso
> addresses this issue.  Although as I said, I have not yet seen it occur w=
ith
> the fix.  Also since I eliminated the need for handling stolen signals and
> signal replacement, it might remove the cause of this second bug as a side
> effect.  (if indeed there truly is a second bug)
>=20
> Do you happen to still have the condvar state when you've hit this bug?  =
Or
> can it be reproduced often enough to capture this state?   Could you
> instrument your calls to pthread_cond_signal to capture the condvar state
> before you send the signal also?  (and perhaps also in your waiters?).  I
> ended up adding circular logging of the condvar state (and a timestamp and
> TID) before and after every pthread_cond_wait and pthread_cond_signal to
> diagnose the lost wakeup so I could see how the events interleaved,
> including the often 2ms to 3ms pre-emption time in the older waiters that
> were coming back and leading to the bug.  I also called rusage() and
> captured the voluntary and involuntary context-switch counts and added th=
ose
> to the condvar event logs to make sure I was always seeing a pre-emptive
> involuntary context switch when the bug occurred.  You might not need to =
go
> that far, but it would help to at least find out the current condvar state
> for all the threads involved when you see the futex_wait get stuck.
>=20
> I'm working on trying to get my patch available soon for further review a=
nd
> testing.
>=20
> If you have a self-contained test case that you can release that can even
> occasionally show this new "second bug" (if it is), let me know.  Thanks!

I should clarify what I'm looking for here in particular with the condvar s=
tate
in this case, as well as knowing the state of all the other waiters and whe=
re
they are stuck (or spinning):

(assuming this second bug exists) it seems that it is stuck on the futex_wa=
it
for g_refs to become 0 and that an interrupt to the futex_wait from signal
handlers "fixes" the problem, which implies the underlying condvar state is
currently correct and that there's just a missing/lost futex_wake.  However=
, is
the wakeup to this particular futex_wait the one that fixes it?  Or is it an
extra wakeup to a waiter blocked on g_signals that is needed to fix it?   We
need to see if the condvar state shows that the g_refs waiter already has w=
hat
it needs to continue and just missed a wakeup, or whether it is legitimately
still waiting for g_refs to reach 0, and it's still at a refcount of 1 or m=
ore
because there's another waiter that is still blocked on g_signals and has n=
ot
been awakened to see that the group is now closed.  (the more likely cause)=
=20
Seeing the condvar state would greatly help in figuring out which case it i=
s.

--=20
You are receiving this mail because:
You are on the CC list for the bug.=