[Bug nptl/25847] pthread_cond_signal failed to wake up pthread_cond_wait due to a bug in undoing stealing

public inbox for glibc-bugs@sourceware.org
help / color / mirror / Atom feed

From: "frankbarrus_sw at shaggy dot cc" <sourceware-bugzilla@sourceware.org>
To: glibc-bugs@sourceware.org
Subject: [Bug nptl/25847] pthread_cond_signal failed to wake up pthread_cond_wait due to a bug in undoing stealing
Date: Tue, 06 Apr 2021 00:37:49 +0000	[thread overview]
Message-ID: <bug-25847-131-hljtgmCS0w@http.sourceware.org/bugzilla/> (raw)
In-Reply-To: <bug-25847-131@http.sourceware.org/bugzilla/>

https://sourceware.org/bugzilla/show_bug.cgi?id=25847

Frank Barrus <frankbarrus_sw at shaggy dot cc> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |frankbarrus_sw at shaggy dot cc

--- Comment #28 from Frank Barrus <frankbarrus_sw at shaggy dot cc> ---
I have also recently been chasing this bug and couldn't find anything about it
online before, so I ended up having to figure it out and develop a workaround
for now.   While trying to find a place to post what I had found, my search now
led to this thread.

So in case it's still helpful in providing more details or confirming what
others have found, here's what I had already written up and was trying to find
a place to post:

During the course of my work over the past couple years, in software using
pthreads on multi-core and multi-CPU systems, I’ve been encountering obscure
lost wakeups that were leading to sometimes being completely stuck, but could
not pinpoint the source of the problem previously, and could find nothing
online that matched what I was seeing, so I had implemented watchdogs to “kick”
certain threads to prevent them from getting stuck once it was detected, but
that was very specific to a certain threadpool/taskqueue usage pattern in the
code.

After encountering this problem again in a general usage pattern that didn’t
lend itself well to the watchdog approach, I finally was able to make
investigating this problem a priority and just recently got to the bottom of it
(after previously looking online but not finding anything matching what I was
seeing at the time).

Here is the series of steps that leads to the bug:

- a waiter thread W1 has been running long enough (CPU-bound) to be eligible
for involuntary pre-emption by the kernel soon since its quantum is about to
expire.
thread W1 is immediately signaled before it blocks, and it consumes the signal,
but is then pre-empted before it can read g1_start

- one or more G1/G2 switches occur such that the current index for G2 is back
to where it was when W1 was pre-empted, but this is no longer effectively the
same group anymore even though it has the same index

- thread W1 sees the change in g1_start and assumes it consumed a signal that
was not meant for it. As a result, it conservatively sends a replacement signal
to the current G1 group, which turns out to be an unnecessary spurious signal. 
>From the comments in the code, it would appear this not not considered harmful.
 (although it turns out it is)

- the extra signal wakes up a waiter W2 from group G1. However, since this
wakeup came from an extra replacement signal that should not have been sent,
the size of the group in g_size[G1] was never decremented, so now there is a
mismatch between the size of the group in g_size[G1] and the actual number of
remaining waiters.
(If multiple waiters (W1) follow this entire pattern at once on multiple cores,
there can be a mismatch of more than 1.)

- once the actual count of remaining waiters in G1 hits 0, the next signal sent
will increment g_signal[G2] and decrement g_size[G1] but will not wake up any
threads, and will not yet run the quiesce_and_switch since it thinks there are
still G1 waiters to consume signals.  (this does not take wrefs into account to
check how many waiters could possibly remain in G1)  This becomes a lost
signal, which will show up in g_signals[G1] since it can never be consumed. 
The condvar can now be in this state for any amount of time, and can accumulate
additional waiters in G2, but will still lose the very next signal.  (or
multiple next signals if g_size[G1] is greater than 1)
at this point, it's possible for software using the condvar to be stuck
completely if there are no additional signals ever sent.

- if additional signals are sent, then once g_size[G1] reaches 0, the next
signal will work properly, but the value of g_signals[G1] at that point
represents how many signals were lost completely and how many additional
wakeups never occurred. These will never directly be made up for unless this is
observed and fixed before the quiesce_and_switch occurs, since that will then
clear the g_signals count.  Note that this means it can require more than one
signal to properly fix the problem.  There need to be enough signals to cause
g_size[G1] to reach 0 just to become “unstuck”, but in addition there should be
true replacement signals for however many unconsumed signals are in
g_signals[G1], in order to deliver those to all waiters who were already in G2
by the time the last (otherwise lost) signal was sent.  It could be argued that
we already woke that many extra waiters earlier, since those were the ones that
caused the bug to arise in the first place, but if those waiters already awoke
earlier, then they do not count as the required number of wakes, since they
observed an earlier state of whatever the mutex protects, and it could have
changed before the signals that were lost.  So they only count as spurious
wakeups, and we still require the expected number of new wakeups, which is
g_signals[G1], and to accomplish that we need g_size[G1] signals to be unstuck,
plus g_signals[G1] more to actually perform the wakeups of the waiters still in
G2.

In practice, most multi-threaded software that can get into this situation in
the first place also has enough threads to send additional signals, so most of
the time the lost wakeup just results in an additional latency until the next
signal is sent, or in a slight loss of concurrency if a specific number of
workers was meant to be awakened.  But every once in a while, a scenario can be
set up where there can’t be another signal until the wakeup occurs, and then
the users of the condvar are completely stuck.

As a temporary workaround I have wrapped pthread_cond_wait(),
pthread_cond_timedwait(), and pthread_cond_signal() with code that performs a
quick post-check on the state of the condvar and determines if there’s a lost
signal (by checking g_size, g_signals, and calculating the size of G2 and
checking wrefs to see how many waiters could possibly be left in G1 after
accounting for all the ones in G2).  It will then send additional signals as
needed to fix it and replace any lost signals once it can be determined.  This
post-check has to be done in pthread_cond_wait/pthread_cond_timedwait() as well
as pthread_cond_signal() since there are many cases where wrefs is temporarily
higher until a waiter exits, so it cannot be conclusively determined that the
condvar is in the lost wakeup state until then.  Once the last G1 waiter exits,
then wrefs only represents G2 waiters, so (wrefs - remaining_G2_size) tells if
there are any G1 waiters left who can potentially consume the signals there or
not, and g_size[G1] can now be safely checked for a non-zero value if there are
no G1 waiters left.  The post-check can also quickly bail-out in several cases
where it's clear that G1 waiters still exist (i.e. it didn't detect the bug),
such as when g_refs[G1] is non-zero.  (all of this is written to mean just the
count bits, with the LSB already shifted out to make it easier to describe)

I’m looking for a proper pthreads fix to use instead as well as considering how
I might go about solving this myself or contributing to the solution if nothing
is available soon.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

next prev parent reply	other threads:[~2021-04-06  0:37 UTC|newest]

Thread overview: 93+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-04-18  8:04 [Bug nptl/25847] New: " qin.li at thetradedesk dot com
2020-04-18  8:07 ` [Bug nptl/25847] " qin.li at thetradedesk dot com
2020-04-20 17:52 ` qin.li at thetradedesk dot com
2020-05-05 11:50 ` arun11299 at gmail dot com
2020-05-05 11:52 ` arun11299 at gmail dot com
2020-05-14  8:52 ` dingxiangfei2009 at protonmail dot ch
2020-05-19 21:15 ` carlos at redhat dot com
2020-10-08 19:07 ` michael.bacarella at gmail dot com
2020-10-10  8:11 ` flo at geekplace dot eu
2020-10-12 20:53 ` anssi.hannula at iki dot fi
2020-10-13 16:50 ` guillaume at morinfr dot org
2020-10-16 20:23 ` kirichenkoga at gmail dot com
2020-10-18 22:55 ` michael.bacarella at gmail dot com
2020-10-20 20:12 ` kirichenkoga at gmail dot com
2020-10-20 20:17 ` kirichenkoga at gmail dot com
2020-11-01 13:58 ` malteskarupke at fastmail dot fm
2020-11-01 15:40 ` michael.bacarella at gmail dot com
2020-11-15  2:45 ` malteskarupke at fastmail dot fm
2020-11-23 10:37 ` ydroneaud at opteya dot com
2020-11-23 16:36 ` mattst88 at gmail dot com
2020-11-23 16:46 ` adhemerval.zanella at linaro dot org
2020-12-07 15:41 ` balint at balintreczey dot hu
2020-12-09  3:22 ` malteskarupke at fastmail dot fm
2020-12-24 20:02 ` triegel at redhat dot com
2020-12-25 16:19 ` triegel at redhat dot com
2021-01-07  1:09 ` manojgupta at google dot com
2021-01-07  7:31 ` balint at balintreczey dot hu
2021-01-07 13:54 ` malteskarupke at fastmail dot fm
2021-01-07 20:43 ` triegel at redhat dot com
2021-01-07 23:31 ` triegel at redhat dot com
2021-01-08  3:45 ` malteskarupke at fastmail dot fm
2021-01-16  0:21 ` triegel at redhat dot com
2021-01-16  0:23 ` triegel at redhat dot com
2021-01-30  0:59 ` michael.bacarella at gmail dot com
2021-02-07 17:38 ` slav.isv at gmail dot com
2021-03-09 15:18 ` bugdal at aerifal dot cx
2021-04-06  0:37 ` frankbarrus_sw at shaggy dot cc [this message]
2021-04-06  1:17 ` frankbarrus_sw at shaggy dot cc
2021-04-06 14:32 ` frankbarrus_sw at shaggy dot cc
2021-04-06 16:49 ` frankbarrus_sw at shaggy dot cc
2021-04-11 12:12 ` carlos at redhat dot com
2021-04-11 12:13 ` carlos at redhat dot com
2021-04-13 12:21 ` frankbarrus_sw at shaggy dot cc
2021-04-14 16:57 ` qin.li at thetradedesk dot com
2021-04-15 14:13 ` frankbarrus_sw at shaggy dot cc
2021-04-15 14:34 ` frankbarrus_sw at shaggy dot cc
2021-04-30 17:41 ` venkiram_be at yahoo dot co.in
2021-05-04 22:48 ` frankbarrus_sw at shaggy dot cc
2021-05-04 22:51 ` frankbarrus_sw at shaggy dot cc
2021-05-04 22:58 ` frankbarrus_sw at shaggy dot cc
2021-05-13 13:25 ` tuliom at ascii dot art.br
2021-06-14 13:31 ` willireamangel at gmail dot com
2021-07-09  1:41 ` uwydoc at gmail dot com
2021-07-15  0:55 ` benh at kernel dot crashing.org
2021-08-16  9:41 ` evgeny+sourceware at loop54 dot com
2021-09-13 16:50 ` dushistov at mail dot ru
2021-09-22 19:03 ` evgeny+sourceware at loop54 dot com
2021-09-22 19:07 ` balint at balintreczey dot hu
2021-09-24  0:18 ` tuliom at ascii dot art.br
2021-09-24  0:58 ` michael.bacarella at gmail dot com
2021-09-29 11:50 ` fweimer at redhat dot com
2021-10-21 15:42 ` fweimer at redhat dot com
2021-10-30 22:17 ` sam at gentoo dot org
2021-11-25 14:49 ` arekm at maven dot pl
2022-09-18  5:38 ` malteskarupke at fastmail dot fm
2022-09-18 20:06 ` carlos at redhat dot com
2022-09-19  3:38 ` malteskarupke at fastmail dot fm
2022-09-24  0:03 ` bugzilla at dimebar dot com
2022-09-24 10:15 ` ismail at i10z dot com
2022-09-26 14:28 ` ehagberg at janestreet dot com
2022-09-26 14:32 ` ehagberg at janestreet dot com
2022-10-06 21:58 ` malteskarupke at fastmail dot fm
2022-10-07 12:01 ` crrodriguez at opensuse dot org
2022-10-15 19:57 ` malteskarupke at fastmail dot fm
2022-11-07 18:23 ` sourceware-bugzilla at djr61 dot uk
2023-01-28 14:57 ` malteskarupke at fastmail dot fm
2023-05-01 12:52 ` carlos at redhat dot com
2023-05-02 12:57 ` carlos at redhat dot com
2023-05-03  3:04 ` malteskarupke at fastmail dot fm
2023-05-04  4:57 ` malteskarupke at fastmail dot fm
2023-05-04 12:24 ` carlos at redhat dot com
2023-05-05 23:44 ` carlos at redhat dot com
2023-05-10 21:29 ` frankbarrus_sw at shaggy dot cc
2023-05-10 21:39 ` frankbarrus_sw at shaggy dot cc
2023-05-11  0:22 ` frankbarrus_sw at shaggy dot cc
2023-05-11 12:01 ` carlos at redhat dot com
2023-05-11 12:05 ` carlos at redhat dot com
2023-05-13  4:10 ` malteskarupke at fastmail dot fm
2023-08-24 20:24 ` jwakely.gcc at gmail dot com
2023-09-26 12:33 ` malteskarupke at fastmail dot fm
2023-09-26 12:38 ` fweimer at redhat dot com
2024-01-05  7:31 ` malteskarupke at fastmail dot fm
2024-02-17  9:44 ` github at kalvdans dot no-ip.org

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=bug-25847-131-hljtgmCS0w@http.sourceware.org/bugzilla/ \
    --to=sourceware-bugzilla@sourceware.org \
    --cc=glibc-bugs@sourceware.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).