[Bug nptl/25847] New: pthread_cond_signal failed to wake up pthread_cond_wait due to a bug in undoing stealing

public inbox for glibc-bugs@sourceware.org
help / color / mirror / Atom feed

From: "qin.li at thetradedesk dot com" <sourceware-bugzilla@sourceware.org>
To: glibc-bugs@sourceware.org
Subject: [Bug nptl/25847] New: pthread_cond_signal failed to wake up pthread_cond_wait due to a bug in undoing stealing
Date: Sat, 18 Apr 2020 08:04:10 +0000	[thread overview]
Message-ID: <bug-25847-131@http.sourceware.org/bugzilla/> (raw)

https://sourceware.org/bugzilla/show_bug.cgi?id=25847

            Bug ID: 25847
           Summary: pthread_cond_signal failed to wake up
                    pthread_cond_wait due to a bug in undoing stealing
           Product: glibc
           Version: 2.27
            Status: UNCONFIRMED
          Severity: normal
          Priority: P2
         Component: nptl
          Assignee: unassigned at sourceware dot org
          Reporter: qin.li at thetradedesk dot com
                CC: drepper.fsp at gmail dot com, triegel at redhat dot com
  Target Milestone: ---

Created attachment 12480
  --> https://sourceware.org/bugzilla/attachment.cgi?id=12480&action=edit
the c repro mentioned in the description

Hi, I will try to be concise while this issue is hard to explain shortly. The
problem is repro-able (with ~100 line c code that uses only pthread) on Ubuntu
18.04 glibc 2.27, while I look through the recent changes and it should be
applicable to the latest version of glibc as well.

Our application is written in C# and runs on .net core on Linux, and .net core
implements a critical section structure based on pthread_cond. The use of
pthread_cond is for maintaining the waiters for the critical section, and the
owner of critical section will use pthread_cond_signal when leaving, to wake up
one waiter which was blocked on entering it before.

And it happens regularly that our application will hang, which the culprit of
the hang is traced to be one or more threads stuck in pthread_cond_wait (called
by EnterCritialSection function due to contention). The evidence that made use
believe this is a bug in pthread_cond, instead of .net core, or our
application, is this signature that we can always confirm from our application
core dump:

(lldb) p (__pthread_cond_s)0x000055af919331c8
(__pthread_cond_s) $7 = {
{ __wseq = 5171942476 __wseq32 = (__low = 876975180, __high = 1) }
{ __g1_start = 5171942452 __g1_start32 = (__low = 876975156, __high = 1) }
__g_refs = ([0] = 4, [1] = 0)
__g_size = ([0] = 0, [1] = 0)
__g1_orig_size = 40
__wrefs = 16
__g_signals = ([0] = 0, [1] = 2)
}

The above internal state of pthread_cond is problematic because:
1) current G1 is at slot index 1 (LSB is 0 in wseq, so G2 at index 0)
2) there is 1 signal in G1, because __g_signals[G1] == 2
3) there are no futex waiters in G1, because __g_refs[G1] == 0
4) last signaler just deliver the last signal to G1 of this pthread_cond,
because __g_size[G1] == 0.
5) as the same time there are 2 waiters in G2, as __g_refs[G2] = 4

So, the problem is while there are 2 waiters in G2, when the last signal is
delivered to G1 when we have __g_size[G1] == 1, but __g_refs[G1] == 0.

Because based on line 77 to 89 from pthread_cond_signal.c, before we increment
the the last signal on __g_signals[G1] from 0 to 2 (or some value N to N + 2),
__g_size[G1] was 1, then __g_size[G1] got decremented to 0 after the signals
being incremented.

  if ((cond->__data.__g_size[g1] != 0)
      || __condvar_quiesce_and_switch_g1 (cond, wseq, &g1, private))
    {
      /* Add a signal.  Relaxed MO is fine because signaling does not need to
         establish a happens-before relation (see above).  We do not mask the
         release-MO store when initializing a group in
         __condvar_quiesce_and_switch_g1 because we use an atomic
         read-modify-write and thus extend that store's release sequence.  */
      atomic_fetch_add_relaxed (cond->__data.__g_signals + g1, 2);
      cond->__data.__g_size[g1]--;
      /* TODO Only set it if there are indeed futex waiters.  */
      do_futex_wake = true;
    }

The critical section uses only pthread_cond_wait, so cancellation is not
applicable here, which makes pthread_cond_signal the only place __g_size[G1] is
decremented. And because __g_refs[G1] = 0, we essentially delivered a signal to
a G1 where there is no futex waiter waiting on it. And the result is signal
lost, when there are still 2 more waiters waiting for the signal in G2.

While there seem to be a lot of speculations in the above reasoning, as the
dump only tells the final state, but won't necessarily mean it was the same
state as I described, when the last signaler enters pthread_cond_signal. we do
have more supporting evidence.

First, I wrote a C repro, that uses only pthread_cond/mutex, by mimicking the
implementation of .net core's critical section, and the usage pattern of the
critical section that causes the repro.

Here is the link to the repro:
https://gist.github.com/tqinli/db1b892a97cfa0bc41fb8b0b0b156b7e

On my 4 core, 8 hyperthread machine, the repro creates 12 threads (1.5 x
hyperthreads) that are entering and leaving the same critical section doing
some dummy computation. Each round, each thread will try to enter and leave
critical section once, then wait for the rest of the threads to finish doing
entering and leaving the critical section once as well, before starting the
next round, and it goes on forever.

Every 2 seconds, a monitor thread will print the current value of total number
of rounds and the dummy variable used during calculation. When you see every 2s
the same numbers are being printed over and over, it is repro'ed.

It takes < 20mins on my machine to repro, when I ran just 1 instance of the
repro program. I find most successful to start 3 instances of the repro
simultaneously, which tend to speed up the repro significantly so it can happen
within 1min, some times in ten seconds. I'd say the mileage might vary
especially based on the number of cores you have. We have 64 cores machines,
and I've seen it can take 10 hours to repro.

Second, I believe I found the explanation why we would ever enter a situation
where we could have __g_size[G1] to be 1 more than __g_refs[G1]. It is related
to how we handle the signal stealing.

Right after a signal is successfully taken by a pthread_cond waiter, we have to
check __g1_start, to confirm whether the waiter is taking the signal from the
correct G1 the waiter belongs to, or the waiter took a signal from a later G1,
that happened to aliased on the same slot, see below code snippets:

line 548 to 560 from pthread_cond_wait.c:
  /* Try to grab a signal.  Use acquire MO so that we see an up-to-date value
     of __g1_start below (see spinning above for a similar case).  In
     particular, if we steal from a more recent group, we will also see a
     more recent __g1_start below.  */
  while (!atomic_compare_exchange_weak_acquire (cond->__data.__g_signals + g,
                                                &signals, signals - 2));

  /* We consumed a signal but we could have consumed from a more recent group
     that aliased with ours due to being in the same group slot.  If this
     might be the case our group must be closed as visible through
     __g1_start.  */
  uint64_t g1_start = __condvar_load_g1_start_relaxed (cond);
  if (seq < (g1_start >> 1))

line 594 to 560 from pthread_cond_wait.c:

              /* Try to add a signal.  We don't need to acquire the lock
                 because at worst we can cause a spurious wake-up.  If the
                 group is in the process of being closed (LSB is true), this
                 has an effect similar to us adding a signal.  */
              if (((s & 1) != 0)
                  || atomic_compare_exchange_weak_relaxed
                       (cond->__data.__g_signals + g, &s, s + 2))

This is the place where a signal is posted to G1, but __g_size[G1] isn't
decremented at the same time, which is correct if the waiter is indeed stealing
a signal, as the decrementing of __g_size[G1] had already happened when the
signal is originally posted in pthread_cond_signal.

However, two operations: a) taking a signal, b) checking the current __g1_start
can't happen at the same time. If unfortunately a thread is being scheduled out
right after this line, after it took the signal:
  while (!atomic_compare_exchange_weak_acquire (cond->__data.__g_signals + g,
                                                &signals, signals - 2));

And resumed running only a while later, during which time several group
rotations happened, the thread could now observe a newer g1_start at line
below. If the new G1 happens to be aliased on the same slot, the code
incorrectly thinks the waiter is stealing a signal from new G1, which in fact
it took the signal from an old G1:
  uint64_t g1_start = __condvar_load_g1_start_relaxed (cond);

When we then post the signal back to the new and current G1, we will spuriously
wake up a futex waiter in this group, causing __g_refs[G1] to be 1 less than
__g_size[G1]. And we know __g_size[G1] isn't decremented because the original
signal didn't come from this new G1. At this point the damage to this G1 is
already done, although this is not the final state yet.

As the G1 might still have several remaining waiters, when new signals come,
waiters from this damaged G1 will still be woke up. Until the last signal is
delivered on this G1, we would observe what was shown in the dump above: that
we posted a signal to a G1, no futex waiter woke up, as __g_refs[G1] was
already 0 before __g_size[G1] did, and the signal remains not taken. But in the
meantime there are one or more waiters in G2. Signal is lost, when indeed we
could have wake up a waiter in G2.

I think the above code does give a plausible explanation for the dump and does
suggest there might be a bug for handling signal stealing this way.

Third, with both the repro, and a theory about why it happened, we tried to
hack the code to verify whether the theory is correct with the repro. Here a
link to the hack:
https://github.com/thetradedesk/glibc/commit/900792b317b5b72a27bb4ec2aec1cd794dca62a5

Basically, if we suspect the current waiter stole the signal, posting a signal
back to the current G1 must happen first, because if we are right, it is also
possible a signaler thread is blocked in quiesce_and_switch(), while acquiring
the private lock if the pthread_cond. What is next is to mitigate the case when
we are wrong about stealing, I can't think of a way to detect easily. If we
can't detect when we are wrong about stealing, the only way I could think of is
to dismantle G1 entirely and letting every waiter in G1 to spuriously wake up.
And also this broadcast is only possible after we potentially unblocked a
thread blocked in quiesce_and_switch with the pthread_cond private lock.

This is by no means suggesting that this can be the actual fix, if this bug is
confirmed here. But when I tried the repro with a "fixed" libpthread.so, what
used to repro in 15-20mins is no longer reproing. I have been able to run my
repro for 2 days straight without a repro, which gives me good confidence that
the theory we had above holds true and is what is causing the problem we saw
here.

So this is the issue. There are still a lot of details that I intentionally
left out to make it concise, details about the repro, our application and .net
core's critical section implementation based on pthread_cond, and why we seems
to be the only ones hitting this issue. Please let me know if you have any
questions and I will be happy to answer. Looking forward to a quick
confirmation and a fix for this, and let me know if there is anything I can do
to help to make it happen faster.

Best Regards.
Qin Li

-- 
You are receiving this mail because:
You are on the CC list for the bug.

next             reply	other threads:[~2020-04-18  8:04 UTC|newest]

Thread overview: 93+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-04-18  8:04 qin.li at thetradedesk dot com [this message]
2020-04-18  8:07 ` [Bug nptl/25847] " qin.li at thetradedesk dot com
2020-04-20 17:52 ` qin.li at thetradedesk dot com
2020-05-05 11:50 ` arun11299 at gmail dot com
2020-05-05 11:52 ` arun11299 at gmail dot com
2020-05-14  8:52 ` dingxiangfei2009 at protonmail dot ch
2020-05-19 21:15 ` carlos at redhat dot com
2020-10-08 19:07 ` michael.bacarella at gmail dot com
2020-10-10  8:11 ` flo at geekplace dot eu
2020-10-12 20:53 ` anssi.hannula at iki dot fi
2020-10-13 16:50 ` guillaume at morinfr dot org
2020-10-16 20:23 ` kirichenkoga at gmail dot com
2020-10-18 22:55 ` michael.bacarella at gmail dot com
2020-10-20 20:12 ` kirichenkoga at gmail dot com
2020-10-20 20:17 ` kirichenkoga at gmail dot com
2020-11-01 13:58 ` malteskarupke at fastmail dot fm
2020-11-01 15:40 ` michael.bacarella at gmail dot com
2020-11-15  2:45 ` malteskarupke at fastmail dot fm
2020-11-23 10:37 ` ydroneaud at opteya dot com
2020-11-23 16:36 ` mattst88 at gmail dot com
2020-11-23 16:46 ` adhemerval.zanella at linaro dot org
2020-12-07 15:41 ` balint at balintreczey dot hu
2020-12-09  3:22 ` malteskarupke at fastmail dot fm
2020-12-24 20:02 ` triegel at redhat dot com
2020-12-25 16:19 ` triegel at redhat dot com
2021-01-07  1:09 ` manojgupta at google dot com
2021-01-07  7:31 ` balint at balintreczey dot hu
2021-01-07 13:54 ` malteskarupke at fastmail dot fm
2021-01-07 20:43 ` triegel at redhat dot com
2021-01-07 23:31 ` triegel at redhat dot com
2021-01-08  3:45 ` malteskarupke at fastmail dot fm
2021-01-16  0:21 ` triegel at redhat dot com
2021-01-16  0:23 ` triegel at redhat dot com
2021-01-30  0:59 ` michael.bacarella at gmail dot com
2021-02-07 17:38 ` slav.isv at gmail dot com
2021-03-09 15:18 ` bugdal at aerifal dot cx
2021-04-06  0:37 ` frankbarrus_sw at shaggy dot cc
2021-04-06  1:17 ` frankbarrus_sw at shaggy dot cc
2021-04-06 14:32 ` frankbarrus_sw at shaggy dot cc
2021-04-06 16:49 ` frankbarrus_sw at shaggy dot cc
2021-04-11 12:12 ` carlos at redhat dot com
2021-04-11 12:13 ` carlos at redhat dot com
2021-04-13 12:21 ` frankbarrus_sw at shaggy dot cc
2021-04-14 16:57 ` qin.li at thetradedesk dot com
2021-04-15 14:13 ` frankbarrus_sw at shaggy dot cc
2021-04-15 14:34 ` frankbarrus_sw at shaggy dot cc
2021-04-30 17:41 ` venkiram_be at yahoo dot co.in
2021-05-04 22:48 ` frankbarrus_sw at shaggy dot cc
2021-05-04 22:51 ` frankbarrus_sw at shaggy dot cc
2021-05-04 22:58 ` frankbarrus_sw at shaggy dot cc
2021-05-13 13:25 ` tuliom at ascii dot art.br
2021-06-14 13:31 ` willireamangel at gmail dot com
2021-07-09  1:41 ` uwydoc at gmail dot com
2021-07-15  0:55 ` benh at kernel dot crashing.org
2021-08-16  9:41 ` evgeny+sourceware at loop54 dot com
2021-09-13 16:50 ` dushistov at mail dot ru
2021-09-22 19:03 ` evgeny+sourceware at loop54 dot com
2021-09-22 19:07 ` balint at balintreczey dot hu
2021-09-24  0:18 ` tuliom at ascii dot art.br
2021-09-24  0:58 ` michael.bacarella at gmail dot com
2021-09-29 11:50 ` fweimer at redhat dot com
2021-10-21 15:42 ` fweimer at redhat dot com
2021-10-30 22:17 ` sam at gentoo dot org
2021-11-25 14:49 ` arekm at maven dot pl
2022-09-18  5:38 ` malteskarupke at fastmail dot fm
2022-09-18 20:06 ` carlos at redhat dot com
2022-09-19  3:38 ` malteskarupke at fastmail dot fm
2022-09-24  0:03 ` bugzilla at dimebar dot com
2022-09-24 10:15 ` ismail at i10z dot com
2022-09-26 14:28 ` ehagberg at janestreet dot com
2022-09-26 14:32 ` ehagberg at janestreet dot com
2022-10-06 21:58 ` malteskarupke at fastmail dot fm
2022-10-07 12:01 ` crrodriguez at opensuse dot org
2022-10-15 19:57 ` malteskarupke at fastmail dot fm
2022-11-07 18:23 ` sourceware-bugzilla at djr61 dot uk
2023-01-28 14:57 ` malteskarupke at fastmail dot fm
2023-05-01 12:52 ` carlos at redhat dot com
2023-05-02 12:57 ` carlos at redhat dot com
2023-05-03  3:04 ` malteskarupke at fastmail dot fm
2023-05-04  4:57 ` malteskarupke at fastmail dot fm
2023-05-04 12:24 ` carlos at redhat dot com
2023-05-05 23:44 ` carlos at redhat dot com
2023-05-10 21:29 ` frankbarrus_sw at shaggy dot cc
2023-05-10 21:39 ` frankbarrus_sw at shaggy dot cc
2023-05-11  0:22 ` frankbarrus_sw at shaggy dot cc
2023-05-11 12:01 ` carlos at redhat dot com
2023-05-11 12:05 ` carlos at redhat dot com
2023-05-13  4:10 ` malteskarupke at fastmail dot fm
2023-08-24 20:24 ` jwakely.gcc at gmail dot com
2023-09-26 12:33 ` malteskarupke at fastmail dot fm
2023-09-26 12:38 ` fweimer at redhat dot com
2024-01-05  7:31 ` malteskarupke at fastmail dot fm
2024-02-17  9:44 ` github at kalvdans dot no-ip.org

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=bug-25847-131@http.sourceware.org/bugzilla/ \
    --to=sourceware-bugzilla@sourceware.org \
    --cc=glibc-bugs@sourceware.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).