From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id 435F1385DC08; Sat, 18 Apr 2020 08:04:11 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 435F1385DC08 From: "qin.li at thetradedesk dot com" To: glibc-bugs@sourceware.org Subject: [Bug nptl/25847] New: pthread_cond_signal failed to wake up pthread_cond_wait due to a bug in undoing stealing Date: Sat, 18 Apr 2020 08:04:10 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: new X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: glibc X-Bugzilla-Component: nptl X-Bugzilla-Version: 2.27 X-Bugzilla-Keywords: X-Bugzilla-Severity: normal X-Bugzilla-Who: qin.li at thetradedesk dot com X-Bugzilla-Status: UNCONFIRMED X-Bugzilla-Resolution: X-Bugzilla-Priority: P2 X-Bugzilla-Assigned-To: unassigned at sourceware dot org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: bug_id short_desc product version bug_status bug_severity priority component assigned_to reporter cc target_milestone attachments.created Message-ID: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://sourceware.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: glibc-bugs@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Glibc-bugs mailing list List-Unsubscribe: , List-Archive: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 18 Apr 2020 08:04:11 -0000 https://sourceware.org/bugzilla/show_bug.cgi?id=3D25847 Bug ID: 25847 Summary: pthread_cond_signal failed to wake up pthread_cond_wait due to a bug in undoing stealing Product: glibc Version: 2.27 Status: UNCONFIRMED Severity: normal Priority: P2 Component: nptl Assignee: unassigned at sourceware dot org Reporter: qin.li at thetradedesk dot com CC: drepper.fsp at gmail dot com, triegel at redhat dot com Target Milestone: --- Created attachment 12480 --> https://sourceware.org/bugzilla/attachment.cgi?id=3D12480&action=3Ded= it the c repro mentioned in the description Hi, I will try to be concise while this issue is hard to explain shortly. T= he problem is repro-able (with ~100 line c code that uses only pthread) on Ubu= ntu 18.04 glibc 2.27, while I look through the recent changes and it should be applicable to the latest version of glibc as well. Our application is written in C# and runs on .net core on Linux, and .net c= ore implements a critical section structure based on pthread_cond. The use of pthread_cond is for maintaining the waiters for the critical section, and t= he owner of critical section will use pthread_cond_signal when leaving, to wak= e up one waiter which was blocked on entering it before. And it happens regularly that our application will hang, which the culprit = of the hang is traced to be one or more threads stuck in pthread_cond_wait (ca= lled by EnterCritialSection function due to contention). The evidence that made = use believe this is a bug in pthread_cond, instead of .net core, or our application, is this signature that we can always confirm from our applicat= ion core dump: (lldb) p (__pthread_cond_s)0x000055af919331c8 (__pthread_cond_s) $7 =3D { { __wseq =3D 5171942476 __wseq32 =3D (__low =3D 876975180, __high =3D 1) } { __g1_start =3D 5171942452 __g1_start32 =3D (__low =3D 876975156, __high = =3D 1) } __g_refs =3D ([0] =3D 4, [1] =3D 0) __g_size =3D ([0] =3D 0, [1] =3D 0) __g1_orig_size =3D 40 __wrefs =3D 16 __g_signals =3D ([0] =3D 0, [1] =3D 2) } The above internal state of pthread_cond is problematic because: 1) current G1 is at slot index 1 (LSB is 0 in wseq, so G2 at index 0) 2) there is 1 signal in G1, because __g_signals[G1] =3D=3D 2 3) there are no futex waiters in G1, because __g_refs[G1] =3D=3D 0 4) last signaler just deliver the last signal to G1 of this pthread_cond, because __g_size[G1] =3D=3D 0. 5) as the same time there are 2 waiters in G2, as __g_refs[G2] =3D 4 So, the problem is while there are 2 waiters in G2, when the last signal is delivered to G1 when we have __g_size[G1] =3D=3D 1, but __g_refs[G1] =3D=3D= 0. Because based on line 77 to 89 from pthread_cond_signal.c, before we increm= ent the the last signal on __g_signals[G1] from 0 to 2 (or some value N to N + = 2), __g_size[G1] was 1, then __g_size[G1] got decremented to 0 after the signals being incremented. if ((cond->__data.__g_size[g1] !=3D 0) || __condvar_quiesce_and_switch_g1 (cond, wseq, &g1, private)) { /* Add a signal. Relaxed MO is fine because signaling does not need = to establish a happens-before relation (see above). We do not mask t= he release-MO store when initializing a group in __condvar_quiesce_and_switch_g1 because we use an atomic read-modify-write and thus extend that store's release sequence. = */ atomic_fetch_add_relaxed (cond->__data.__g_signals + g1, 2); cond->__data.__g_size[g1]--; /* TODO Only set it if there are indeed futex waiters. */ do_futex_wake =3D true; } The critical section uses only pthread_cond_wait, so cancellation is not applicable here, which makes pthread_cond_signal the only place __g_size[G1= ] is decremented. And because __g_refs[G1] =3D 0, we essentially delivered a sig= nal to a G1 where there is no futex waiter waiting on it. And the result is signal lost, when there are still 2 more waiters waiting for the signal in G2. While there seem to be a lot of speculations in the above reasoning, as the dump only tells the final state, but won't necessarily mean it was the same state as I described, when the last signaler enters pthread_cond_signal. we= do have more supporting evidence. First, I wrote a C repro, that uses only pthread_cond/mutex, by mimicking t= he implementation of .net core's critical section, and the usage pattern of the critical section that causes the repro. Here is the link to the repro: https://gist.github.com/tqinli/db1b892a97cfa0bc41fb8b0b0b156b7e On my 4 core, 8 hyperthread machine, the repro creates 12 threads (1.5 x hyperthreads) that are entering and leaving the same critical section doing some dummy computation. Each round, each thread will try to enter and leave critical section once, then wait for the rest of the threads to finish doing entering and leaving the critical section once as well, before starting the next round, and it goes on forever. Every 2 seconds, a monitor thread will print the current value of total num= ber of rounds and the dummy variable used during calculation. When you see ever= y 2s the same numbers are being printed over and over, it is repro'ed. It takes < 20mins on my machine to repro, when I ran just 1 instance of the repro program. I find most successful to start 3 instances of the repro simultaneously, which tend to speed up the repro significantly so it can ha= ppen within 1min, some times in ten seconds. I'd say the mileage might vary especially based on the number of cores you have. We have 64 cores machines, and I've seen it can take 10 hours to repro. Second, I believe I found the explanation why we would ever enter a situati= on where we could have __g_size[G1] to be 1 more than __g_refs[G1]. It is rela= ted to how we handle the signal stealing. Right after a signal is successfully taken by a pthread_cond waiter, we hav= e to check __g1_start, to confirm whether the waiter is taking the signal from t= he correct G1 the waiter belongs to, or the waiter took a signal from a later = G1, that happened to aliased on the same slot, see below code snippets: line 548 to 560 from pthread_cond_wait.c: /* Try to grab a signal. Use acquire MO so that we see an up-to-date val= ue of __g1_start below (see spinning above for a similar case). In particular, if we steal from a more recent group, we will also see a more recent __g1_start below. */ while (!atomic_compare_exchange_weak_acquire (cond->__data.__g_signals + = g, &signals, signals - 2)); /* We consumed a signal but we could have consumed from a more recent gro= up that aliased with ours due to being in the same group slot. If this might be the case our group must be closed as visible through __g1_start. */ uint64_t g1_start =3D __condvar_load_g1_start_relaxed (cond); if (seq < (g1_start >> 1)) line 594 to 560 from pthread_cond_wait.c: /* Try to add a signal. We don't need to acquire the lock because at worst we can cause a spurious wake-up. If the group is in the process of being closed (LSB is true), this has an effect similar to us adding a signal. */ if (((s & 1) !=3D 0) || atomic_compare_exchange_weak_relaxed (cond->__data.__g_signals + g, &s, s + 2)) This is the place where a signal is posted to G1, but __g_size[G1] isn't decremented at the same time, which is correct if the waiter is indeed stea= ling a signal, as the decrementing of __g_size[G1] had already happened when the signal is originally posted in pthread_cond_signal. However, two operations: a) taking a signal, b) checking the current __g1_s= tart can't happen at the same time. If unfortunately a thread is being scheduled= out right after this line, after it took the signal: while (!atomic_compare_exchange_weak_acquire (cond->__data.__g_signals + = g, &signals, signals - 2)); And resumed running only a while later, during which time several group rotations happened, the thread could now observe a newer g1_start at line below. If the new G1 happens to be aliased on the same slot, the code incorrectly thinks the waiter is stealing a signal from new G1, which in fa= ct it took the signal from an old G1: uint64_t g1_start =3D __condvar_load_g1_start_relaxed (cond); When we then post the signal back to the new and current G1, we will spurio= usly wake up a futex waiter in this group, causing __g_refs[G1] to be 1 less than __g_size[G1]. And we know __g_size[G1] isn't decremented because the origin= al signal didn't come from this new G1. At this point the damage to this G1 is already done, although this is not the final state yet. As the G1 might still have several remaining waiters, when new signals come, waiters from this damaged G1 will still be woke up. Until the last signal is delivered on this G1, we would observe what was shown in the dump above: th= at we posted a signal to a G1, no futex waiter woke up, as __g_refs[G1] was already 0 before __g_size[G1] did, and the signal remains not taken. But in= the meantime there are one or more waiters in G2. Signal is lost, when indeed we could have wake up a waiter in G2. I think the above code does give a plausible explanation for the dump and d= oes suggest there might be a bug for handling signal stealing this way. Third, with both the repro, and a theory about why it happened, we tried to hack the code to verify whether the theory is correct with the repro. Here a link to the hack: https://github.com/thetradedesk/glibc/commit/900792b317b5b72a27bb4ec2aec1cd= 794dca62a5 Basically, if we suspect the current waiter stole the signal, posting a sig= nal back to the current G1 must happen first, because if we are right, it is al= so possible a signaler thread is blocked in quiesce_and_switch(), while acquir= ing the private lock if the pthread_cond. What is next is to mitigate the case = when we are wrong about stealing, I can't think of a way to detect easily. If we can't detect when we are wrong about stealing, the only way I could think o= f is to dismantle G1 entirely and letting every waiter in G1 to spuriously wake = up. And also this broadcast is only possible after we potentially unblocked a thread blocked in quiesce_and_switch with the pthread_cond private lock. This is by no means suggesting that this can be the actual fix, if this bug= is confirmed here. But when I tried the repro with a "fixed" libpthread.so, wh= at used to repro in 15-20mins is no longer reproing. I have been able to run my repro for 2 days straight without a repro, which gives me good confidence t= hat the theory we had above holds true and is what is causing the problem we saw here. So this is the issue. There are still a lot of details that I intentionally left out to make it concise, details about the repro, our application and .= net core's critical section implementation based on pthread_cond, and why we se= ems to be the only ones hitting this issue. Please let me know if you have any questions and I will be happy to answer. Looking forward to a quick confirmation and a fix for this, and let me know if there is anything I can= do to help to make it happen faster. Best Regards. Qin Li --=20 You are receiving this mail because: You are on the CC list for the bug.=