From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <sourceware-bugzilla@sourceware.org>
Received: by sourceware.org (Postfix, from userid 48)
 id 4396C388CC1A; Tue,  6 Apr 2021 00:37:49 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 4396C388CC1A
From: "frankbarrus_sw at shaggy dot cc" <sourceware-bugzilla@sourceware.org>
To: glibc-bugs@sourceware.org
Subject: [Bug nptl/25847] pthread_cond_signal failed to wake up
 pthread_cond_wait due to a bug in undoing stealing
Date: Tue, 06 Apr 2021 00:37:49 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: glibc
X-Bugzilla-Component: nptl
X-Bugzilla-Version: 2.27
X-Bugzilla-Keywords: 
X-Bugzilla-Severity: normal
X-Bugzilla-Who: frankbarrus_sw at shaggy dot cc
X-Bugzilla-Status: UNCONFIRMED
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: P2
X-Bugzilla-Assigned-To: unassigned at sourceware dot org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: cc
Message-ID: <bug-25847-131-hljtgmCS0w@http.sourceware.org/bugzilla/>
In-Reply-To: <bug-25847-131@http.sourceware.org/bugzilla/>
References: <bug-25847-131@http.sourceware.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://sourceware.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
X-BeenThere: glibc-bugs@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Glibc-bugs mailing list <glibc-bugs.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/glibc-bugs>,
 <mailto:glibc-bugs-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/glibc-bugs/>
List-Help: <mailto:glibc-bugs-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/glibc-bugs>,
 <mailto:glibc-bugs-request@sourceware.org?subject=subscribe>
X-List-Received-Date: Tue, 06 Apr 2021 00:37:50 -0000

https://sourceware.org/bugzilla/show_bug.cgi?id=3D25847

Frank Barrus <frankbarrus_sw at shaggy dot cc> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |frankbarrus_sw at shaggy d=
ot cc

--- Comment #28 from Frank Barrus <frankbarrus_sw at shaggy dot cc> ---
I have also recently been chasing this bug and couldn't find anything about=
 it
online before, so I ended up having to figure it out and develop a workarou=
nd
for now.   While trying to find a place to post what I had found, my search=
 now
led to this thread.

So in case it's still helpful in providing more details or confirming what
others have found, here's what I had already written up and was trying to f=
ind
a place to post:


During the course of my work over the past couple years, in software using
pthreads on multi-core and multi-CPU systems, I=E2=80=99ve been encounterin=
g obscure
lost wakeups that were leading to sometimes being completely stuck, but cou=
ld
not pinpoint the source of the problem previously, and could find nothing
online that matched what I was seeing, so I had implemented watchdogs to =
=E2=80=9Ckick=E2=80=9D
certain threads to prevent them from getting stuck once it was detected, but
that was very specific to a certain threadpool/taskqueue usage pattern in t=
he
code.

After encountering this problem again in a general usage pattern that didn=
=E2=80=99t
lend itself well to the watchdog approach, I finally was able to make
investigating this problem a priority and just recently got to the bottom o=
f it
(after previously looking online but not finding anything matching what I w=
as
seeing at the time).

Here is the series of steps that leads to the bug:

- a waiter thread W1 has been running long enough (CPU-bound) to be eligible
for involuntary pre-emption by the kernel soon since its quantum is about to
expire.
thread W1 is immediately signaled before it blocks, and it consumes the sig=
nal,
but is then pre-empted before it can read g1_start

- one or more G1/G2 switches occur such that the current index for G2 is ba=
ck
to where it was when W1 was pre-empted, but this is no longer effectively t=
he
same group anymore even though it has the same index

- thread W1 sees the change in g1_start and assumes it consumed a signal th=
at
was not meant for it. As a result, it conservatively sends a replacement si=
gnal
to the current G1 group, which turns out to be an unnecessary spurious sign=
al.=20
>>From the comments in the code, it would appear this not not considered harm=
ful.
 (although it turns out it is)

- the extra signal wakes up a waiter W2 from group G1. However, since this
wakeup came from an extra replacement signal that should not have been sent,
the size of the group in g_size[G1] was never decremented, so now there is a
mismatch between the size of the group in g_size[G1] and the actual number =
of
remaining waiters.
(If multiple waiters (W1) follow this entire pattern at once on multiple co=
res,
there can be a mismatch of more than 1.)

- once the actual count of remaining waiters in G1 hits 0, the next signal =
sent
will increment g_signal[G2] and decrement g_size[G1] but will not wake up a=
ny
threads, and will not yet run the quiesce_and_switch since it thinks there =
are
still G1 waiters to consume signals.  (this does not take wrefs into accoun=
t to
check how many waiters could possibly remain in G1)  This becomes a lost
signal, which will show up in g_signals[G1] since it can never be consumed.=
=20
The condvar can now be in this state for any amount of time, and can accumu=
late
additional waiters in G2, but will still lose the very next signal.  (or
multiple next signals if g_size[G1] is greater than 1)
at this point, it's possible for software using the condvar to be stuck
completely if there are no additional signals ever sent.

- if additional signals are sent, then once g_size[G1] reaches 0, the next
signal will work properly, but the value of g_signals[G1] at that point
represents how many signals were lost completely and how many additional
wakeups never occurred. These will never directly be made up for unless thi=
s is
observed and fixed before the quiesce_and_switch occurs, since that will th=
en
clear the g_signals count.  Note that this means it can require more than o=
ne
signal to properly fix the problem.  There need to be enough signals to cau=
se
g_size[G1] to reach 0 just to become =E2=80=9Cunstuck=E2=80=9D, but in addi=
tion there should be
true replacement signals for however many unconsumed signals are in
g_signals[G1], in order to deliver those to all waiters who were already in=
 G2
by the time the last (otherwise lost) signal was sent.  It could be argued =
that
we already woke that many extra waiters earlier, since those were the ones =
that
caused the bug to arise in the first place, but if those waiters already aw=
oke
earlier, then they do not count as the required number of wakes, since they
observed an earlier state of whatever the mutex protects, and it could have
changed before the signals that were lost.  So they only count as spurious
wakeups, and we still require the expected number of new wakeups, which is
g_signals[G1], and to accomplish that we need g_size[G1] signals to be unst=
uck,
plus g_signals[G1] more to actually perform the wakeups of the waiters stil=
l in
G2.


In practice, most multi-threaded software that can get into this situation =
in
the first place also has enough threads to send additional signals, so most=
 of
the time the lost wakeup just results in an additional latency until the ne=
xt
signal is sent, or in a slight loss of concurrency if a specific number of
workers was meant to be awakened.  But every once in a while, a scenario ca=
n be
set up where there can=E2=80=99t be another signal until the wakeup occurs,=
 and then
the users of the condvar are completely stuck.


As a temporary workaround I have wrapped pthread_cond_wait(),
pthread_cond_timedwait(), and pthread_cond_signal() with code that performs=
 a
quick post-check on the state of the condvar and determines if there=E2=80=
=99s a lost
signal (by checking g_size, g_signals, and calculating the size of G2 and
checking wrefs to see how many waiters could possibly be left in G1 after
accounting for all the ones in G2).  It will then send additional signals as
needed to fix it and replace any lost signals once it can be determined.  T=
his
post-check has to be done in pthread_cond_wait/pthread_cond_timedwait() as =
well
as pthread_cond_signal() since there are many cases where wrefs is temporar=
ily
higher until a waiter exits, so it cannot be conclusively determined that t=
he
condvar is in the lost wakeup state until then.  Once the last G1 waiter ex=
its,
then wrefs only represents G2 waiters, so (wrefs - remaining_G2_size) tells=
 if
there are any G1 waiters left who can potentially consume the signals there=
 or
not, and g_size[G1] can now be safely checked for a non-zero value if there=
 are
no G1 waiters left.  The post-check can also quickly bail-out in several ca=
ses
where it's clear that G1 waiters still exist (i.e. it didn't detect the bug=
),
such as when g_refs[G1] is non-zero.  (all of this is written to mean just =
the
count bits, with the LSB already shifted out to make it easier to describe)

I=E2=80=99m looking for a proper pthreads fix to use instead as well as con=
sidering how
I might go about solving this myself or contributing to the solution if not=
hing
is available soon.

--=20
You are receiving this mail because:
You are on the CC list for the bug.=