[Bug nptl/25847] pthread_cond_signal failed to wake up pthread_cond_wait due to a bug in undoing stealing

public inbox for glibc-bugs@sourceware.org
help / color / mirror / Atom feed

From: "triegel at redhat dot com" <sourceware-bugzilla@sourceware.org>
To: glibc-bugs@sourceware.org
Subject: [Bug nptl/25847] pthread_cond_signal failed to wake up pthread_cond_wait due to a bug in undoing stealing
Date: Sat, 16 Jan 2021 00:21:17 +0000	[thread overview]
Message-ID: <bug-25847-131-pk98qHse4t@http.sourceware.org/bugzilla/> (raw)
In-Reply-To: <bug-25847-131@http.sourceware.org/bugzilla/>

https://sourceware.org/bugzilla/show_bug.cgi?id=25847

--- Comment #25 from Torvald Riegel <triegel at redhat dot com> ---
(In reply to Qin Li from comment #0)
> Created attachment 12480 [details]
> the c repro mentioned in the description
> 
> Hi, I will try to be concise while this issue is hard to explain shortly.
> The problem is repro-able (with ~100 line c code that uses only pthread) on
> Ubuntu 18.04 glibc 2.27, while I look through the recent changes and it
> should be applicable to the latest version of glibc as well.
> 
> Our application is written in C# and runs on .net core on Linux, and .net
> core implements a critical section structure based on pthread_cond. The use
> of pthread_cond is for maintaining the waiters for the critical section, and
> the owner of critical section will use pthread_cond_signal when leaving, to
> wake up one waiter which was blocked on entering it before.
> 
> And it happens regularly that our application will hang, which the culprit
> of the hang is traced to be one or more threads stuck in pthread_cond_wait
> (called by EnterCritialSection function due to contention). The evidence
> that made use believe this is a bug in pthread_cond, instead of .net core,
> or our application, is this signature that we can always confirm from our
> application core dump:
> 
> (lldb) p (__pthread_cond_s)0x000055af919331c8
> (__pthread_cond_s) $7 = {
> { __wseq = 5171942476 __wseq32 = (__low = 876975180, __high = 1) }
> { __g1_start = 5171942452 __g1_start32 = (__low = 876975156, __high = 1) }
> __g_refs = ([0] = 4, [1] = 0)
> __g_size = ([0] = 0, [1] = 0)
> __g1_orig_size = 40
> __wrefs = 16
> __g_signals = ([0] = 0, [1] = 2)
> }
> 
> The above internal state of pthread_cond is problematic because:
> 1) current G1 is at slot index 1 (LSB is 0 in wseq, so G2 at index 0)
> 2) there is 1 signal in G1, because __g_signals[G1] == 2
> 3) there are no futex waiters in G1, because __g_refs[G1] == 0
> 4) last signaler just deliver the last signal to G1 of this pthread_cond,
> because __g_size[G1] == 0.
> 5) as the same time there are 2 waiters in G2, as __g_refs[G2] = 4
> 
> So, the problem is while there are 2 waiters in G2, when the last signal is
> delivered to G1 when we have __g_size[G1] == 1, but __g_refs[G1] == 0.
> 
> Because based on line 77 to 89 from pthread_cond_signal.c, before we
> increment the the last signal on __g_signals[G1] from 0 to 2 (or some value
> N to N + 2), __g_size[G1] was 1, then __g_size[G1] got decremented to 0
> after the signals being incremented.
> 
>   if ((cond->__data.__g_size[g1] != 0)
>       || __condvar_quiesce_and_switch_g1 (cond, wseq, &g1, private))
>     {
>       /* Add a signal.  Relaxed MO is fine because signaling does not need to
> 	 establish a happens-before relation (see above).  We do not mask the
> 	 release-MO store when initializing a group in
> 	 __condvar_quiesce_and_switch_g1 because we use an atomic
> 	 read-modify-write and thus extend that store's release sequence.  */
>       atomic_fetch_add_relaxed (cond->__data.__g_signals + g1, 2);
>       cond->__data.__g_size[g1]--;
>       /* TODO Only set it if there are indeed futex waiters.  */
>       do_futex_wake = true;
>     }
> 
> The critical section uses only pthread_cond_wait, so cancellation is not
> applicable here, which makes pthread_cond_signal the only place __g_size[G1]
> is decremented. And because __g_refs[G1] = 0, we essentially delivered a
> signal to a G1 where there is no futex waiter waiting on it. And the result
> is signal lost, when there are still 2 more waiters waiting for the signal
> in G2.
> 
> While there seem to be a lot of speculations in the above reasoning, as the
> dump only tells the final state, but won't necessarily mean it was the same
> state as I described, when the last signaler enters pthread_cond_signal. we
> do have more supporting evidence.
> 
> First, I wrote a C repro, that uses only pthread_cond/mutex, by mimicking
> the implementation of .net core's critical section, and the usage pattern of
> the critical section that causes the repro.
> 
> Here is the link to the repro:
> https://gist.github.com/tqinli/db1b892a97cfa0bc41fb8b0b0b156b7e
> 
> On my 4 core, 8 hyperthread machine, the repro creates 12 threads (1.5 x
> hyperthreads) that are entering and leaving the same critical section doing
> some dummy computation. Each round, each thread will try to enter and leave
> critical section once, then wait for the rest of the threads to finish doing
> entering and leaving the critical section once as well, before starting the
> next round, and it goes on forever.
> 
> Every 2 seconds, a monitor thread will print the current value of total
> number of rounds and the dummy variable used during calculation. When you
> see every 2s the same numbers are being printed over and over, it is
> repro'ed.
> 
> It takes < 20mins on my machine to repro, when I ran just 1 instance of the
> repro program. I find most successful to start 3 instances of the repro
> simultaneously, which tend to speed up the repro significantly so it can
> happen within 1min, some times in ten seconds. I'd say the mileage might
> vary especially based on the number of cores you have. We have 64 cores
> machines, and I've seen it can take 10 hours to repro.
> 
> Second, I believe I found the explanation why we would ever enter a
> situation where we could have __g_size[G1] to be 1 more than __g_refs[G1].
> It is related to how we handle the signal stealing.
> 
> Right after a signal is successfully taken by a pthread_cond waiter, we have
> to check __g1_start, to confirm whether the waiter is taking the signal from
> the correct G1 the waiter belongs to, or the waiter took a signal from a
> later G1, that happened to aliased on the same slot, see below code snippets:
> 
> line 548 to 560 from pthread_cond_wait.c:
>   /* Try to grab a signal.  Use acquire MO so that we see an up-to-date value
>      of __g1_start below (see spinning above for a similar case).  In
>      particular, if we steal from a more recent group, we will also see a
>      more recent __g1_start below.  */
>   while (!atomic_compare_exchange_weak_acquire (cond->__data.__g_signals + g,
> 						&signals, signals - 2));
> 
>   /* We consumed a signal but we could have consumed from a more recent group
>      that aliased with ours due to being in the same group slot.  If this
>      might be the case our group must be closed as visible through
>      __g1_start.  */
>   uint64_t g1_start = __condvar_load_g1_start_relaxed (cond);
>   if (seq < (g1_start >> 1))
> 
> line 594 to 560 from pthread_cond_wait.c:
> 
> 	      /* Try to add a signal.  We don't need to acquire the lock
> 		 because at worst we can cause a spurious wake-up.  If the
> 		 group is in the process of being closed (LSB is true), this
> 		 has an effect similar to us adding a signal.  */
> 	      if (((s & 1) != 0)
> 		  || atomic_compare_exchange_weak_relaxed
> 		       (cond->__data.__g_signals + g, &s, s + 2))
> 
> This is the place where a signal is posted to G1, but __g_size[G1] isn't
> decremented at the same time, which is correct if the waiter is indeed
> stealing a signal, as the decrementing of __g_size[G1] had already happened
> when the signal is originally posted in pthread_cond_signal.
> 
> However, two operations: a) taking a signal, b) checking the current
> __g1_start can't happen at the same time. If unfortunately a thread is being
> scheduled out right after this line, after it took the signal:
>   while (!atomic_compare_exchange_weak_acquire (cond->__data.__g_signals + g,
> 						&signals, signals - 2));
> 
> And resumed running only a while later, during which time several group
> rotations happened, the thread could now observe a newer g1_start at line
> below. If the new G1 happens to be aliased on the same slot, the code
> incorrectly thinks the waiter is stealing a signal from new G1, which in
> fact it took the signal from an old G1:
>   uint64_t g1_start = __condvar_load_g1_start_relaxed (cond);
> 
> When we then post the signal back to the new and current G1, we will
> spuriously wake up a futex waiter in this group, causing __g_refs[G1] to be
> 1 less than __g_size[G1]. And we know __g_size[G1] isn't decremented because
> the original signal didn't come from this new G1. At this point the damage
> to this G1 is already done, although this is not the final state yet.
> 
> As the G1 might still have several remaining waiters, when new signals come,
> waiters from this damaged G1 will still be woke up. Until the last signal is
> delivered on this G1, we would observe what was shown in the dump above:
> that we posted a signal to a G1, no futex waiter woke up, as __g_refs[G1]
> was already 0 before __g_size[G1] did, and the signal remains not taken. But
> in the meantime there are one or more waiters in G2. Signal is lost, when
> indeed we could have wake up a waiter in G2.

So the stealing mitigation that adds a signal by just incrementing the
respective slot in__g_signals can effectively mask another, real signal, but
...

> I think the above code does give a plausible explanation for the dump and
> does suggest there might be a bug for handling signal stealing this way.
> 
> Third, with both the repro, and a theory about why it happened, we tried to
> hack the code to verify whether the theory is correct with the repro. Here a
> link to the hack:
> https://github.com/thetradedesk/glibc/commit/
> 900792b317b5b72a27bb4ec2aec1cd794dca62a5

... why would a broadcast be required instead of just adding one more signal to
fix/replay the masked signal?  (You later said that you found out the patch
above that just closes group 1 isn't sufficient, and then proposed to use the
broadcast.)

I don't have an explanation why the broadcast would be required, but I can
confirm that just using one signal does not work for your test program.  When
it hangs, I get such a state, for example:

__wseq = 10815175, __g1_start = 10815161, __g_refs = {0, 8},
__g_size = {0, 0}, __g1_orig_size = 12, __wrefs = 32, __g_signals = {2, 0}

with the 4 remaining waiters all being in group 2 (ie, slot 1), and group 1
having one more signal than necessary according to it's size (which would be
consistent with one being added unnecessarily).

So maybe adding one additional call to signals is insufficient; if I add two
signals, I haven't observed a failure with the reproducer yet.  But why would
it have to be more than one?

> So this is the issue. There are still a lot of details that I intentionally
> left out to make it concise, details about the repro, our application and
> .net core's critical section implementation based on pthread_cond, and why
> we seems to be the only ones hitting this issue.

Are you sure that your reproducer can correctly handle spurious wakeups?  I'm
asking because if there might be a bug in the reproducer too, this could
explain why a single additional signal does not work (assuming that this would
indeed be a correct solution): The spuriously woken waiter goes to the back of
the queue, but the additional signal goes in before any waiter including the
spuriously woken one is in G2. 
The stealing and the additional signals resulting from that are the only
sources of spurious wakeups given that you don't use cancellation or timeouts
in the reproducer.  A broadcast or more signals could potentially just make it
unlikely that a fault in the reproducer triggers.

Thus, if you could point me to docs or a version of your mutex implementation
with comments, this would help me wrap my head around the details of it.

Malte, if you should have some cycles, perhaps your TLA+ model could also help
to find executions in which just adding one signal would not be sufficient?

-- 
You are receiving this mail because:
You are on the CC list for the bug.

next prev parent reply	other threads:[~2021-01-16  0:21 UTC|newest]

Thread overview: 93+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-04-18  8:04 [Bug nptl/25847] New: " qin.li at thetradedesk dot com
2020-04-18  8:07 ` [Bug nptl/25847] " qin.li at thetradedesk dot com
2020-04-20 17:52 ` qin.li at thetradedesk dot com
2020-05-05 11:50 ` arun11299 at gmail dot com
2020-05-05 11:52 ` arun11299 at gmail dot com
2020-05-14  8:52 ` dingxiangfei2009 at protonmail dot ch
2020-05-19 21:15 ` carlos at redhat dot com
2020-10-08 19:07 ` michael.bacarella at gmail dot com
2020-10-10  8:11 ` flo at geekplace dot eu
2020-10-12 20:53 ` anssi.hannula at iki dot fi
2020-10-13 16:50 ` guillaume at morinfr dot org
2020-10-16 20:23 ` kirichenkoga at gmail dot com
2020-10-18 22:55 ` michael.bacarella at gmail dot com
2020-10-20 20:12 ` kirichenkoga at gmail dot com
2020-10-20 20:17 ` kirichenkoga at gmail dot com
2020-11-01 13:58 ` malteskarupke at fastmail dot fm
2020-11-01 15:40 ` michael.bacarella at gmail dot com
2020-11-15  2:45 ` malteskarupke at fastmail dot fm
2020-11-23 10:37 ` ydroneaud at opteya dot com
2020-11-23 16:36 ` mattst88 at gmail dot com
2020-11-23 16:46 ` adhemerval.zanella at linaro dot org
2020-12-07 15:41 ` balint at balintreczey dot hu
2020-12-09  3:22 ` malteskarupke at fastmail dot fm
2020-12-24 20:02 ` triegel at redhat dot com
2020-12-25 16:19 ` triegel at redhat dot com
2021-01-07  1:09 ` manojgupta at google dot com
2021-01-07  7:31 ` balint at balintreczey dot hu
2021-01-07 13:54 ` malteskarupke at fastmail dot fm
2021-01-07 20:43 ` triegel at redhat dot com
2021-01-07 23:31 ` triegel at redhat dot com
2021-01-08  3:45 ` malteskarupke at fastmail dot fm
2021-01-16  0:21 ` triegel at redhat dot com [this message]
2021-01-16  0:23 ` triegel at redhat dot com
2021-01-30  0:59 ` michael.bacarella at gmail dot com
2021-02-07 17:38 ` slav.isv at gmail dot com
2021-03-09 15:18 ` bugdal at aerifal dot cx
2021-04-06  0:37 ` frankbarrus_sw at shaggy dot cc
2021-04-06  1:17 ` frankbarrus_sw at shaggy dot cc
2021-04-06 14:32 ` frankbarrus_sw at shaggy dot cc
2021-04-06 16:49 ` frankbarrus_sw at shaggy dot cc
2021-04-11 12:12 ` carlos at redhat dot com
2021-04-11 12:13 ` carlos at redhat dot com
2021-04-13 12:21 ` frankbarrus_sw at shaggy dot cc
2021-04-14 16:57 ` qin.li at thetradedesk dot com
2021-04-15 14:13 ` frankbarrus_sw at shaggy dot cc
2021-04-15 14:34 ` frankbarrus_sw at shaggy dot cc
2021-04-30 17:41 ` venkiram_be at yahoo dot co.in
2021-05-04 22:48 ` frankbarrus_sw at shaggy dot cc
2021-05-04 22:51 ` frankbarrus_sw at shaggy dot cc
2021-05-04 22:58 ` frankbarrus_sw at shaggy dot cc
2021-05-13 13:25 ` tuliom at ascii dot art.br
2021-06-14 13:31 ` willireamangel at gmail dot com
2021-07-09  1:41 ` uwydoc at gmail dot com
2021-07-15  0:55 ` benh at kernel dot crashing.org
2021-08-16  9:41 ` evgeny+sourceware at loop54 dot com
2021-09-13 16:50 ` dushistov at mail dot ru
2021-09-22 19:03 ` evgeny+sourceware at loop54 dot com
2021-09-22 19:07 ` balint at balintreczey dot hu
2021-09-24  0:18 ` tuliom at ascii dot art.br
2021-09-24  0:58 ` michael.bacarella at gmail dot com
2021-09-29 11:50 ` fweimer at redhat dot com
2021-10-21 15:42 ` fweimer at redhat dot com
2021-10-30 22:17 ` sam at gentoo dot org
2021-11-25 14:49 ` arekm at maven dot pl
2022-09-18  5:38 ` malteskarupke at fastmail dot fm
2022-09-18 20:06 ` carlos at redhat dot com
2022-09-19  3:38 ` malteskarupke at fastmail dot fm
2022-09-24  0:03 ` bugzilla at dimebar dot com
2022-09-24 10:15 ` ismail at i10z dot com
2022-09-26 14:28 ` ehagberg at janestreet dot com
2022-09-26 14:32 ` ehagberg at janestreet dot com
2022-10-06 21:58 ` malteskarupke at fastmail dot fm
2022-10-07 12:01 ` crrodriguez at opensuse dot org
2022-10-15 19:57 ` malteskarupke at fastmail dot fm
2022-11-07 18:23 ` sourceware-bugzilla at djr61 dot uk
2023-01-28 14:57 ` malteskarupke at fastmail dot fm
2023-05-01 12:52 ` carlos at redhat dot com
2023-05-02 12:57 ` carlos at redhat dot com
2023-05-03  3:04 ` malteskarupke at fastmail dot fm
2023-05-04  4:57 ` malteskarupke at fastmail dot fm
2023-05-04 12:24 ` carlos at redhat dot com
2023-05-05 23:44 ` carlos at redhat dot com
2023-05-10 21:29 ` frankbarrus_sw at shaggy dot cc
2023-05-10 21:39 ` frankbarrus_sw at shaggy dot cc
2023-05-11  0:22 ` frankbarrus_sw at shaggy dot cc
2023-05-11 12:01 ` carlos at redhat dot com
2023-05-11 12:05 ` carlos at redhat dot com
2023-05-13  4:10 ` malteskarupke at fastmail dot fm
2023-08-24 20:24 ` jwakely.gcc at gmail dot com
2023-09-26 12:33 ` malteskarupke at fastmail dot fm
2023-09-26 12:38 ` fweimer at redhat dot com
2024-01-05  7:31 ` malteskarupke at fastmail dot fm
2024-02-17  9:44 ` github at kalvdans dot no-ip.org

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=bug-25847-131-pk98qHse4t@http.sourceware.org/bugzilla/ \
    --to=sourceware-bugzilla@sourceware.org \
    --cc=glibc-bugs@sourceware.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).