public inbox for glibc-bugs@sourceware.org help / color / mirror / Atom feed
From: "qin.li at thetradedesk dot com" <sourceware-bugzilla@sourceware.org> To: glibc-bugs@sourceware.org Subject: [Bug nptl/25847] New: pthread_cond_signal failed to wake up pthread_cond_wait due to a bug in undoing stealing Date: Sat, 18 Apr 2020 08:04:10 +0000 [thread overview] Message-ID: <bug-25847-131@http.sourceware.org/bugzilla/> (raw) https://sourceware.org/bugzilla/show_bug.cgi?id=25847 Bug ID: 25847 Summary: pthread_cond_signal failed to wake up pthread_cond_wait due to a bug in undoing stealing Product: glibc Version: 2.27 Status: UNCONFIRMED Severity: normal Priority: P2 Component: nptl Assignee: unassigned at sourceware dot org Reporter: qin.li at thetradedesk dot com CC: drepper.fsp at gmail dot com, triegel at redhat dot com Target Milestone: --- Created attachment 12480 --> https://sourceware.org/bugzilla/attachment.cgi?id=12480&action=edit the c repro mentioned in the description Hi, I will try to be concise while this issue is hard to explain shortly. The problem is repro-able (with ~100 line c code that uses only pthread) on Ubuntu 18.04 glibc 2.27, while I look through the recent changes and it should be applicable to the latest version of glibc as well. Our application is written in C# and runs on .net core on Linux, and .net core implements a critical section structure based on pthread_cond. The use of pthread_cond is for maintaining the waiters for the critical section, and the owner of critical section will use pthread_cond_signal when leaving, to wake up one waiter which was blocked on entering it before. And it happens regularly that our application will hang, which the culprit of the hang is traced to be one or more threads stuck in pthread_cond_wait (called by EnterCritialSection function due to contention). The evidence that made use believe this is a bug in pthread_cond, instead of .net core, or our application, is this signature that we can always confirm from our application core dump: (lldb) p (__pthread_cond_s)0x000055af919331c8 (__pthread_cond_s) $7 = { { __wseq = 5171942476 __wseq32 = (__low = 876975180, __high = 1) } { __g1_start = 5171942452 __g1_start32 = (__low = 876975156, __high = 1) } __g_refs = ([0] = 4, [1] = 0) __g_size = ([0] = 0, [1] = 0) __g1_orig_size = 40 __wrefs = 16 __g_signals = ([0] = 0, [1] = 2) } The above internal state of pthread_cond is problematic because: 1) current G1 is at slot index 1 (LSB is 0 in wseq, so G2 at index 0) 2) there is 1 signal in G1, because __g_signals[G1] == 2 3) there are no futex waiters in G1, because __g_refs[G1] == 0 4) last signaler just deliver the last signal to G1 of this pthread_cond, because __g_size[G1] == 0. 5) as the same time there are 2 waiters in G2, as __g_refs[G2] = 4 So, the problem is while there are 2 waiters in G2, when the last signal is delivered to G1 when we have __g_size[G1] == 1, but __g_refs[G1] == 0. Because based on line 77 to 89 from pthread_cond_signal.c, before we increment the the last signal on __g_signals[G1] from 0 to 2 (or some value N to N + 2), __g_size[G1] was 1, then __g_size[G1] got decremented to 0 after the signals being incremented. if ((cond->__data.__g_size[g1] != 0) || __condvar_quiesce_and_switch_g1 (cond, wseq, &g1, private)) { /* Add a signal. Relaxed MO is fine because signaling does not need to establish a happens-before relation (see above). We do not mask the release-MO store when initializing a group in __condvar_quiesce_and_switch_g1 because we use an atomic read-modify-write and thus extend that store's release sequence. */ atomic_fetch_add_relaxed (cond->__data.__g_signals + g1, 2); cond->__data.__g_size[g1]--; /* TODO Only set it if there are indeed futex waiters. */ do_futex_wake = true; } The critical section uses only pthread_cond_wait, so cancellation is not applicable here, which makes pthread_cond_signal the only place __g_size[G1] is decremented. And because __g_refs[G1] = 0, we essentially delivered a signal to a G1 where there is no futex waiter waiting on it. And the result is signal lost, when there are still 2 more waiters waiting for the signal in G2. While there seem to be a lot of speculations in the above reasoning, as the dump only tells the final state, but won't necessarily mean it was the same state as I described, when the last signaler enters pthread_cond_signal. we do have more supporting evidence. First, I wrote a C repro, that uses only pthread_cond/mutex, by mimicking the implementation of .net core's critical section, and the usage pattern of the critical section that causes the repro. Here is the link to the repro: https://gist.github.com/tqinli/db1b892a97cfa0bc41fb8b0b0b156b7e On my 4 core, 8 hyperthread machine, the repro creates 12 threads (1.5 x hyperthreads) that are entering and leaving the same critical section doing some dummy computation. Each round, each thread will try to enter and leave critical section once, then wait for the rest of the threads to finish doing entering and leaving the critical section once as well, before starting the next round, and it goes on forever. Every 2 seconds, a monitor thread will print the current value of total number of rounds and the dummy variable used during calculation. When you see every 2s the same numbers are being printed over and over, it is repro'ed. It takes < 20mins on my machine to repro, when I ran just 1 instance of the repro program. I find most successful to start 3 instances of the repro simultaneously, which tend to speed up the repro significantly so it can happen within 1min, some times in ten seconds. I'd say the mileage might vary especially based on the number of cores you have. We have 64 cores machines, and I've seen it can take 10 hours to repro. Second, I believe I found the explanation why we would ever enter a situation where we could have __g_size[G1] to be 1 more than __g_refs[G1]. It is related to how we handle the signal stealing. Right after a signal is successfully taken by a pthread_cond waiter, we have to check __g1_start, to confirm whether the waiter is taking the signal from the correct G1 the waiter belongs to, or the waiter took a signal from a later G1, that happened to aliased on the same slot, see below code snippets: line 548 to 560 from pthread_cond_wait.c: /* Try to grab a signal. Use acquire MO so that we see an up-to-date value of __g1_start below (see spinning above for a similar case). In particular, if we steal from a more recent group, we will also see a more recent __g1_start below. */ while (!atomic_compare_exchange_weak_acquire (cond->__data.__g_signals + g, &signals, signals - 2)); /* We consumed a signal but we could have consumed from a more recent group that aliased with ours due to being in the same group slot. If this might be the case our group must be closed as visible through __g1_start. */ uint64_t g1_start = __condvar_load_g1_start_relaxed (cond); if (seq < (g1_start >> 1)) line 594 to 560 from pthread_cond_wait.c: /* Try to add a signal. We don't need to acquire the lock because at worst we can cause a spurious wake-up. If the group is in the process of being closed (LSB is true), this has an effect similar to us adding a signal. */ if (((s & 1) != 0) || atomic_compare_exchange_weak_relaxed (cond->__data.__g_signals + g, &s, s + 2)) This is the place where a signal is posted to G1, but __g_size[G1] isn't decremented at the same time, which is correct if the waiter is indeed stealing a signal, as the decrementing of __g_size[G1] had already happened when the signal is originally posted in pthread_cond_signal. However, two operations: a) taking a signal, b) checking the current __g1_start can't happen at the same time. If unfortunately a thread is being scheduled out right after this line, after it took the signal: while (!atomic_compare_exchange_weak_acquire (cond->__data.__g_signals + g, &signals, signals - 2)); And resumed running only a while later, during which time several group rotations happened, the thread could now observe a newer g1_start at line below. If the new G1 happens to be aliased on the same slot, the code incorrectly thinks the waiter is stealing a signal from new G1, which in fact it took the signal from an old G1: uint64_t g1_start = __condvar_load_g1_start_relaxed (cond); When we then post the signal back to the new and current G1, we will spuriously wake up a futex waiter in this group, causing __g_refs[G1] to be 1 less than __g_size[G1]. And we know __g_size[G1] isn't decremented because the original signal didn't come from this new G1. At this point the damage to this G1 is already done, although this is not the final state yet. As the G1 might still have several remaining waiters, when new signals come, waiters from this damaged G1 will still be woke up. Until the last signal is delivered on this G1, we would observe what was shown in the dump above: that we posted a signal to a G1, no futex waiter woke up, as __g_refs[G1] was already 0 before __g_size[G1] did, and the signal remains not taken. But in the meantime there are one or more waiters in G2. Signal is lost, when indeed we could have wake up a waiter in G2. I think the above code does give a plausible explanation for the dump and does suggest there might be a bug for handling signal stealing this way. Third, with both the repro, and a theory about why it happened, we tried to hack the code to verify whether the theory is correct with the repro. Here a link to the hack: https://github.com/thetradedesk/glibc/commit/900792b317b5b72a27bb4ec2aec1cd794dca62a5 Basically, if we suspect the current waiter stole the signal, posting a signal back to the current G1 must happen first, because if we are right, it is also possible a signaler thread is blocked in quiesce_and_switch(), while acquiring the private lock if the pthread_cond. What is next is to mitigate the case when we are wrong about stealing, I can't think of a way to detect easily. If we can't detect when we are wrong about stealing, the only way I could think of is to dismantle G1 entirely and letting every waiter in G1 to spuriously wake up. And also this broadcast is only possible after we potentially unblocked a thread blocked in quiesce_and_switch with the pthread_cond private lock. This is by no means suggesting that this can be the actual fix, if this bug is confirmed here. But when I tried the repro with a "fixed" libpthread.so, what used to repro in 15-20mins is no longer reproing. I have been able to run my repro for 2 days straight without a repro, which gives me good confidence that the theory we had above holds true and is what is causing the problem we saw here. So this is the issue. There are still a lot of details that I intentionally left out to make it concise, details about the repro, our application and .net core's critical section implementation based on pthread_cond, and why we seems to be the only ones hitting this issue. Please let me know if you have any questions and I will be happy to answer. Looking forward to a quick confirmation and a fix for this, and let me know if there is anything I can do to help to make it happen faster. Best Regards. Qin Li -- You are receiving this mail because: You are on the CC list for the bug.
next reply other threads:[~2020-04-18 8:04 UTC|newest] Thread overview: 93+ messages / expand[flat|nested] mbox.gz Atom feed top 2020-04-18 8:04 qin.li at thetradedesk dot com [this message] 2020-04-18 8:07 ` [Bug nptl/25847] " qin.li at thetradedesk dot com 2020-04-20 17:52 ` qin.li at thetradedesk dot com 2020-05-05 11:50 ` arun11299 at gmail dot com 2020-05-05 11:52 ` arun11299 at gmail dot com 2020-05-14 8:52 ` dingxiangfei2009 at protonmail dot ch 2020-05-19 21:15 ` carlos at redhat dot com 2020-10-08 19:07 ` michael.bacarella at gmail dot com 2020-10-10 8:11 ` flo at geekplace dot eu 2020-10-12 20:53 ` anssi.hannula at iki dot fi 2020-10-13 16:50 ` guillaume at morinfr dot org 2020-10-16 20:23 ` kirichenkoga at gmail dot com 2020-10-18 22:55 ` michael.bacarella at gmail dot com 2020-10-20 20:12 ` kirichenkoga at gmail dot com 2020-10-20 20:17 ` kirichenkoga at gmail dot com 2020-11-01 13:58 ` malteskarupke at fastmail dot fm 2020-11-01 15:40 ` michael.bacarella at gmail dot com 2020-11-15 2:45 ` malteskarupke at fastmail dot fm 2020-11-23 10:37 ` ydroneaud at opteya dot com 2020-11-23 16:36 ` mattst88 at gmail dot com 2020-11-23 16:46 ` adhemerval.zanella at linaro dot org 2020-12-07 15:41 ` balint at balintreczey dot hu 2020-12-09 3:22 ` malteskarupke at fastmail dot fm 2020-12-24 20:02 ` triegel at redhat dot com 2020-12-25 16:19 ` triegel at redhat dot com 2021-01-07 1:09 ` manojgupta at google dot com 2021-01-07 7:31 ` balint at balintreczey dot hu 2021-01-07 13:54 ` malteskarupke at fastmail dot fm 2021-01-07 20:43 ` triegel at redhat dot com 2021-01-07 23:31 ` triegel at redhat dot com 2021-01-08 3:45 ` malteskarupke at fastmail dot fm 2021-01-16 0:21 ` triegel at redhat dot com 2021-01-16 0:23 ` triegel at redhat dot com 2021-01-30 0:59 ` michael.bacarella at gmail dot com 2021-02-07 17:38 ` slav.isv at gmail dot com 2021-03-09 15:18 ` bugdal at aerifal dot cx 2021-04-06 0:37 ` frankbarrus_sw at shaggy dot cc 2021-04-06 1:17 ` frankbarrus_sw at shaggy dot cc 2021-04-06 14:32 ` frankbarrus_sw at shaggy dot cc 2021-04-06 16:49 ` frankbarrus_sw at shaggy dot cc 2021-04-11 12:12 ` carlos at redhat dot com 2021-04-11 12:13 ` carlos at redhat dot com 2021-04-13 12:21 ` frankbarrus_sw at shaggy dot cc 2021-04-14 16:57 ` qin.li at thetradedesk dot com 2021-04-15 14:13 ` frankbarrus_sw at shaggy dot cc 2021-04-15 14:34 ` frankbarrus_sw at shaggy dot cc 2021-04-30 17:41 ` venkiram_be at yahoo dot co.in 2021-05-04 22:48 ` frankbarrus_sw at shaggy dot cc 2021-05-04 22:51 ` frankbarrus_sw at shaggy dot cc 2021-05-04 22:58 ` frankbarrus_sw at shaggy dot cc 2021-05-13 13:25 ` tuliom at ascii dot art.br 2021-06-14 13:31 ` willireamangel at gmail dot com 2021-07-09 1:41 ` uwydoc at gmail dot com 2021-07-15 0:55 ` benh at kernel dot crashing.org 2021-08-16 9:41 ` evgeny+sourceware at loop54 dot com 2021-09-13 16:50 ` dushistov at mail dot ru 2021-09-22 19:03 ` evgeny+sourceware at loop54 dot com 2021-09-22 19:07 ` balint at balintreczey dot hu 2021-09-24 0:18 ` tuliom at ascii dot art.br 2021-09-24 0:58 ` michael.bacarella at gmail dot com 2021-09-29 11:50 ` fweimer at redhat dot com 2021-10-21 15:42 ` fweimer at redhat dot com 2021-10-30 22:17 ` sam at gentoo dot org 2021-11-25 14:49 ` arekm at maven dot pl 2022-09-18 5:38 ` malteskarupke at fastmail dot fm 2022-09-18 20:06 ` carlos at redhat dot com 2022-09-19 3:38 ` malteskarupke at fastmail dot fm 2022-09-24 0:03 ` bugzilla at dimebar dot com 2022-09-24 10:15 ` ismail at i10z dot com 2022-09-26 14:28 ` ehagberg at janestreet dot com 2022-09-26 14:32 ` ehagberg at janestreet dot com 2022-10-06 21:58 ` malteskarupke at fastmail dot fm 2022-10-07 12:01 ` crrodriguez at opensuse dot org 2022-10-15 19:57 ` malteskarupke at fastmail dot fm 2022-11-07 18:23 ` sourceware-bugzilla at djr61 dot uk 2023-01-28 14:57 ` malteskarupke at fastmail dot fm 2023-05-01 12:52 ` carlos at redhat dot com 2023-05-02 12:57 ` carlos at redhat dot com 2023-05-03 3:04 ` malteskarupke at fastmail dot fm 2023-05-04 4:57 ` malteskarupke at fastmail dot fm 2023-05-04 12:24 ` carlos at redhat dot com 2023-05-05 23:44 ` carlos at redhat dot com 2023-05-10 21:29 ` frankbarrus_sw at shaggy dot cc 2023-05-10 21:39 ` frankbarrus_sw at shaggy dot cc 2023-05-11 0:22 ` frankbarrus_sw at shaggy dot cc 2023-05-11 12:01 ` carlos at redhat dot com 2023-05-11 12:05 ` carlos at redhat dot com 2023-05-13 4:10 ` malteskarupke at fastmail dot fm 2023-08-24 20:24 ` jwakely.gcc at gmail dot com 2023-09-26 12:33 ` malteskarupke at fastmail dot fm 2023-09-26 12:38 ` fweimer at redhat dot com 2024-01-05 7:31 ` malteskarupke at fastmail dot fm 2024-02-17 9:44 ` github at kalvdans dot no-ip.org
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=bug-25847-131@http.sourceware.org/bugzilla/ \ --to=sourceware-bugzilla@sourceware.org \ --cc=glibc-bugs@sourceware.org \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).