From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <sourceware-bugzilla@sourceware.org>
Received: by sourceware.org (Postfix, from userid 48)
 id 8CEFC383B427; Thu, 15 Apr 2021 14:13:55 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 8CEFC383B427
From: "frankbarrus_sw at shaggy dot cc" <sourceware-bugzilla@sourceware.org>
To: glibc-bugs@sourceware.org
Subject: [Bug nptl/25847] pthread_cond_signal failed to wake up
 pthread_cond_wait due to a bug in undoing stealing
Date: Thu, 15 Apr 2021 14:13:55 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: glibc
X-Bugzilla-Component: nptl
X-Bugzilla-Version: 2.27
X-Bugzilla-Keywords: 
X-Bugzilla-Severity: normal
X-Bugzilla-Who: frankbarrus_sw at shaggy dot cc
X-Bugzilla-Status: UNCONFIRMED
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: P2
X-Bugzilla-Assigned-To: carlos at redhat dot com
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: 
Message-ID: <bug-25847-131-QqyiKsL7Ro@http.sourceware.org/bugzilla/>
In-Reply-To: <bug-25847-131@http.sourceware.org/bugzilla/>
References: <bug-25847-131@http.sourceware.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://sourceware.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
X-BeenThere: glibc-bugs@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Glibc-bugs mailing list <glibc-bugs.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/glibc-bugs>,
 <mailto:glibc-bugs-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/glibc-bugs/>
List-Help: <mailto:glibc-bugs-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/glibc-bugs>,
 <mailto:glibc-bugs-request@sourceware.org?subject=subscribe>
X-List-Received-Date: Thu, 15 Apr 2021 14:13:55 -0000

https://sourceware.org/bugzilla/show_bug.cgi?id=3D25847

--- Comment #35 from Frank Barrus <frankbarrus_sw at shaggy dot cc> ---
(In reply to Qin Li from comment #34)
> Hi Frank, I am the original reporter of this bug. Could you share a sneak
> peak version of your alternate fix that you mentioned below?
>=20
> > FYI, I'm currently testing a different pthreads fix for this issue that=
 does it without the suggested "broadcast" solution that some distros appea=
r to be adopting for now.
>=20
> The reason I am asking is that several months after applied the broadcast
> fix we started to observe a different hang caused by either
> pthread_cond_signal/pthread_cond_wait, or the constructs it relied on, e.=
g.
> futex. Original I feared it is related or caused by the broadcast fix, but
> later realized this might be another issue as it has also been independen=
tly
> reported in this bug by Arun without applying the broadcast fix:
> https://sourceware.org/bugzilla/show_bug.cgi?id=3D25847#c3
>=20
> The signature of this different issue is this: 1 thread blocked "infinite=
ly"
> in the pthread_cond_signal when calling __condvar_quiesce_and_switch_g1:
>=20
> #0  futex_wait
> #1  futex_wait_simple
> #2  __condvar_quiesce_and_switch_g1
> #3  __pthread_cond_signal
>=20
> And all the other threads are blocked in pthread_cond_wait waiting for the
> signal.
>=20
> Another interesting observation is the "infinitely" blocked thread on
> pthread_cond_signal can be unblocked if I send a SIG_SEGV to the linux .N=
et
> Core process that hit this issue which has a segfault handler that will c=
all
> another external executable to take a core dump of this process. I am not
> exactly sure how much of the special signal handling logic is important to
> get pthread_cond_signal unblocked. It is possible that such signal would
> cause a spurious wakeup from futex_wait that actually unblocks
> __condvar_quiesce_and_switch_g1 and later __pthread_cond_signal, but this=
 is
> pure speculation.
>=20
> It would be nice to know if the ^^^ hanging pthread_cond_signal signature
> has also been discovered by the community and whether there might be any
> investigation/fix available.

Hi Qin,

I believe you may be right about the second bug.  I have also seen such a
signature ever since the fix went in to "or" in the 1 bit so it doesn't spin
while waiting for g_refs to reach 0:
          r =3D atomic_fetch_or_relaxed (cond->__data.__g_refs + g1, 1) | 1;

Whenever I've encountered threads "stuck" in that futex_wait, they have alw=
ays
become unblocked as soon as I resumed from gdb (which makes sense if they're
just missing a futex wake since the signal handling unblocks them from the
futex) so I was never quite sure if they were truly stuck there (which they
seemed to be) or whether I was just catching them in a transient state.=20=
=20
However, since this was always at a time when other parts of our system had
also become stuck, it seemed quite likely that the threads really were stuc=
k in
the futex_wait.

I have not seen this happen anymore ever since applying my new fix and test=
ing
with it, but that's purely anecdotal evidence of fixing the secondary issue=
s,
since it was quite rare to begin with.

If there's a race in the wait/wake logic itself for the grefs counting, I
haven't found it yet, unless there's some very obscure bug in the underlying
atomics themselves allowing the atomic or and atomic subtract to cross paths
somehow thus missing the wakeup flag.  i.e. these two atomics:

  r =3D atomic_fetch_or_relaxed (cond->__data.__g_refs + g1, 1) | 1;
  (in the quiesce_and_switch)=20
and:
  if (atomic_fetch_add_release (cond->__data.__g_refs + g, -2) =3D=3D 3)
  (in the gref_decrefs used from the cond_wait)

I think the more likely cause is related to still having an extra waiter th=
at
didn't get woken yet and didn't see the group being closed.   Note that when
the group is closed, the LSB of g_signals is set, but there's no futex wake=
up:

  atomic_fetch_or_relaxed (cond->__data.__g_signals + g1, 1);

This should be fine if there are no other bugs, since every waiter should h=
ave
been woken already.  But since we're already dealing with a mix of both cur=
rent
waiters and older ones that were pre-empted first and which are now resumin=
g,
and possibly not yet aware that they are in an older group, there might be =
some
race there.  In particular, if a signal did actually get stolen and not
properly detected and replaced, there could be an extra waiter stuck in a f=
utex
wait on g_signals still.  Without an explicit wakeup when closing the group=
, it
won't see the "closed" bit until it wakes from the futex, which won't happe=
n in
that case if there are no more signals sent to that group. However, an
interrupt from normal signal handling (SIG* signals) will break it out of t=
he
futex, which is why a gdb attach and resume would get it unstuck in that ca=
se.=20
And then it will see the closed group on g_signals and proceed as if nothing
was ever wrong.

Until we figure out the exact cause, I cannot guarantee that my new fix also
addresses this issue.  Although as I said, I have not yet seen it occur with
the fix.  Also since I eliminated the need for handling stolen signals and
signal replacement, it might remove the cause of this second bug as a side
effect.  (if indeed there truly is a second bug)

Do you happen to still have the condvar state when you've hit this bug?  Or=
 can
it be reproduced often enough to capture this state?   Could you instrument
your calls to pthread_cond_signal to capture the condvar state before you s=
end
the signal also?  (and perhaps also in your waiters?).  I ended up adding
circular logging of the condvar state (and a timestamp and TID) before and
after every pthread_cond_wait and pthread_cond_signal to diagnose the lost
wakeup so I could see how the events interleaved, including the often 2ms to
3ms pre-emption time in the older waiters that were coming back and leading=
 to
the bug.  I also called rusage() and captured the voluntary and involuntary
context-switch counts and added those to the condvar event logs to make sur=
e I
was always seeing a pre-emptive involuntary context switch when the bug
occurred.  You might not need to go that far, but it would help to at least
find out the current condvar state for all the threads involved when you see
the futex_wait get stuck.

I'm working on trying to get my patch available soon for further review and
testing.

If you have a self-contained test case that you can release that can even
occasionally show this new "second bug" (if it is), let me know.  Thanks!

--=20
You are receiving this mail because:
You are on the CC list for the bug.=