From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id D33E5398B40B; Sat, 16 Jan 2021 00:21:17 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org D33E5398B40B From: "triegel at redhat dot com" To: glibc-bugs@sourceware.org Subject: [Bug nptl/25847] pthread_cond_signal failed to wake up pthread_cond_wait due to a bug in undoing stealing Date: Sat, 16 Jan 2021 00:21:17 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: glibc X-Bugzilla-Component: nptl X-Bugzilla-Version: 2.27 X-Bugzilla-Keywords: X-Bugzilla-Severity: normal X-Bugzilla-Who: triegel at redhat dot com X-Bugzilla-Status: UNCONFIRMED X-Bugzilla-Resolution: X-Bugzilla-Priority: P2 X-Bugzilla-Assigned-To: unassigned at sourceware dot org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://sourceware.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: glibc-bugs@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Glibc-bugs mailing list List-Unsubscribe: , List-Archive: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 16 Jan 2021 00:21:17 -0000 https://sourceware.org/bugzilla/show_bug.cgi?id=3D25847 --- Comment #25 from Torvald Riegel --- (In reply to Qin Li from comment #0) > Created attachment 12480 [details] > the c repro mentioned in the description >=20 > Hi, I will try to be concise while this issue is hard to explain shortly. > The problem is repro-able (with ~100 line c code that uses only pthread) = on > Ubuntu 18.04 glibc 2.27, while I look through the recent changes and it > should be applicable to the latest version of glibc as well. >=20 > Our application is written in C# and runs on .net core on Linux, and .net > core implements a critical section structure based on pthread_cond. The u= se > of pthread_cond is for maintaining the waiters for the critical section, = and > the owner of critical section will use pthread_cond_signal when leaving, = to > wake up one waiter which was blocked on entering it before. >=20 > And it happens regularly that our application will hang, which the culprit > of the hang is traced to be one or more threads stuck in pthread_cond_wait > (called by EnterCritialSection function due to contention). The evidence > that made use believe this is a bug in pthread_cond, instead of .net core, > or our application, is this signature that we can always confirm from our > application core dump: >=20 > (lldb) p (__pthread_cond_s)0x000055af919331c8 > (__pthread_cond_s) $7 =3D { > { __wseq =3D 5171942476 __wseq32 =3D (__low =3D 876975180, __high =3D 1) } > { __g1_start =3D 5171942452 __g1_start32 =3D (__low =3D 876975156, __high= =3D 1) } > __g_refs =3D ([0] =3D 4, [1] =3D 0) > __g_size =3D ([0] =3D 0, [1] =3D 0) > __g1_orig_size =3D 40 > __wrefs =3D 16 > __g_signals =3D ([0] =3D 0, [1] =3D 2) > } >=20 > The above internal state of pthread_cond is problematic because: > 1) current G1 is at slot index 1 (LSB is 0 in wseq, so G2 at index 0) > 2) there is 1 signal in G1, because __g_signals[G1] =3D=3D 2 > 3) there are no futex waiters in G1, because __g_refs[G1] =3D=3D 0 > 4) last signaler just deliver the last signal to G1 of this pthread_cond, > because __g_size[G1] =3D=3D 0. > 5) as the same time there are 2 waiters in G2, as __g_refs[G2] =3D 4 >=20 > So, the problem is while there are 2 waiters in G2, when the last signal = is > delivered to G1 when we have __g_size[G1] =3D=3D 1, but __g_refs[G1] =3D= =3D 0. >=20 > Because based on line 77 to 89 from pthread_cond_signal.c, before we > increment the the last signal on __g_signals[G1] from 0 to 2 (or some val= ue > N to N + 2), __g_size[G1] was 1, then __g_size[G1] got decremented to 0 > after the signals being incremented. >=20 > if ((cond->__data.__g_size[g1] !=3D 0) > || __condvar_quiesce_and_switch_g1 (cond, wseq, &g1, private)) > { > /* Add a signal. Relaxed MO is fine because signaling does not nee= d to > establish a happens-before relation (see above). We do not mask the > release-MO store when initializing a group in > __condvar_quiesce_and_switch_g1 because we use an atomic > read-modify-write and thus extend that store's release sequence. */ > atomic_fetch_add_relaxed (cond->__data.__g_signals + g1, 2); > cond->__data.__g_size[g1]--; > /* TODO Only set it if there are indeed futex waiters. */ > do_futex_wake =3D true; > } >=20 > The critical section uses only pthread_cond_wait, so cancellation is not > applicable here, which makes pthread_cond_signal the only place __g_size[= G1] > is decremented. And because __g_refs[G1] =3D 0, we essentially delivered a > signal to a G1 where there is no futex waiter waiting on it. And the resu= lt > is signal lost, when there are still 2 more waiters waiting for the signal > in G2. >=20 > While there seem to be a lot of speculations in the above reasoning, as t= he > dump only tells the final state, but won't necessarily mean it was the sa= me > state as I described, when the last signaler enters pthread_cond_signal. = we > do have more supporting evidence. >=20 > First, I wrote a C repro, that uses only pthread_cond/mutex, by mimicking > the implementation of .net core's critical section, and the usage pattern= of > the critical section that causes the repro. >=20 > Here is the link to the repro: > https://gist.github.com/tqinli/db1b892a97cfa0bc41fb8b0b0b156b7e >=20 > On my 4 core, 8 hyperthread machine, the repro creates 12 threads (1.5 x > hyperthreads) that are entering and leaving the same critical section doi= ng > some dummy computation. Each round, each thread will try to enter and lea= ve > critical section once, then wait for the rest of the threads to finish do= ing > entering and leaving the critical section once as well, before starting t= he > next round, and it goes on forever. >=20 > Every 2 seconds, a monitor thread will print the current value of total > number of rounds and the dummy variable used during calculation. When you > see every 2s the same numbers are being printed over and over, it is > repro'ed. >=20 > It takes < 20mins on my machine to repro, when I ran just 1 instance of t= he > repro program. I find most successful to start 3 instances of the repro > simultaneously, which tend to speed up the repro significantly so it can > happen within 1min, some times in ten seconds. I'd say the mileage might > vary especially based on the number of cores you have. We have 64 cores > machines, and I've seen it can take 10 hours to repro. >=20 > Second, I believe I found the explanation why we would ever enter a > situation where we could have __g_size[G1] to be 1 more than __g_refs[G1]. > It is related to how we handle the signal stealing. >=20 > Right after a signal is successfully taken by a pthread_cond waiter, we h= ave > to check __g1_start, to confirm whether the waiter is taking the signal f= rom > the correct G1 the waiter belongs to, or the waiter took a signal from a > later G1, that happened to aliased on the same slot, see below code snipp= ets: >=20 > line 548 to 560 from pthread_cond_wait.c: > /* Try to grab a signal. Use acquire MO so that we see an up-to-date v= alue > of __g1_start below (see spinning above for a similar case). In > particular, if we steal from a more recent group, we will also see a > more recent __g1_start below. */ > while (!atomic_compare_exchange_weak_acquire (cond->__data.__g_signals = + g, > &signals, signals - 2)); >=20 > /* We consumed a signal but we could have consumed from a more recent g= roup > that aliased with ours due to being in the same group slot. If this > might be the case our group must be closed as visible through > __g1_start. */ > uint64_t g1_start =3D __condvar_load_g1_start_relaxed (cond); > if (seq < (g1_start >> 1)) >=20 > line 594 to 560 from pthread_cond_wait.c: >=20 > /* Try to add a signal. We don't need to acquire the lock > because at worst we can cause a spurious wake-up. If the > group is in the process of being closed (LSB is true), this > has an effect similar to us adding a signal. */ > if (((s & 1) !=3D 0) > || atomic_compare_exchange_weak_relaxed > (cond->__data.__g_signals + g, &s, s + 2)) >=20 > This is the place where a signal is posted to G1, but __g_size[G1] isn't > decremented at the same time, which is correct if the waiter is indeed > stealing a signal, as the decrementing of __g_size[G1] had already happen= ed > when the signal is originally posted in pthread_cond_signal. >=20 > However, two operations: a) taking a signal, b) checking the current > __g1_start can't happen at the same time. If unfortunately a thread is be= ing > scheduled out right after this line, after it took the signal: > while (!atomic_compare_exchange_weak_acquire (cond->__data.__g_signals = + g, > &signals, signals - 2)); >=20 > And resumed running only a while later, during which time several group > rotations happened, the thread could now observe a newer g1_start at line > below. If the new G1 happens to be aliased on the same slot, the code > incorrectly thinks the waiter is stealing a signal from new G1, which in > fact it took the signal from an old G1: > uint64_t g1_start =3D __condvar_load_g1_start_relaxed (cond); >=20 > When we then post the signal back to the new and current G1, we will > spuriously wake up a futex waiter in this group, causing __g_refs[G1] to = be > 1 less than __g_size[G1]. And we know __g_size[G1] isn't decremented beca= use > the original signal didn't come from this new G1. At this point the damage > to this G1 is already done, although this is not the final state yet. >=20 > As the G1 might still have several remaining waiters, when new signals co= me, > waiters from this damaged G1 will still be woke up. Until the last signal= is > delivered on this G1, we would observe what was shown in the dump above: > that we posted a signal to a G1, no futex waiter woke up, as __g_refs[G1] > was already 0 before __g_size[G1] did, and the signal remains not taken. = But > in the meantime there are one or more waiters in G2. Signal is lost, when > indeed we could have wake up a waiter in G2. So the stealing mitigation that adds a signal by just incrementing the respective slot in__g_signals can effectively mask another, real signal, but ... > I think the above code does give a plausible explanation for the dump and > does suggest there might be a bug for handling signal stealing this way. >=20 > Third, with both the repro, and a theory about why it happened, we tried = to > hack the code to verify whether the theory is correct with the repro. Her= e a > link to the hack: > https://github.com/thetradedesk/glibc/commit/ > 900792b317b5b72a27bb4ec2aec1cd794dca62a5 ... why would a broadcast be required instead of just adding one more signa= l to fix/replay the masked signal? (You later said that you found out the patch above that just closes group 1 isn't sufficient, and then proposed to use t= he broadcast.) I don't have an explanation why the broadcast would be required, but I can confirm that just using one signal does not work for your test program. Wh= en it hangs, I get such a state, for example: __wseq =3D 10815175, __g1_start =3D 10815161, __g_refs =3D {0, 8}, __g_size =3D {0, 0}, __g1_orig_size =3D 12, __wrefs =3D 32, __g_signals =3D= {2, 0} with the 4 remaining waiters all being in group 2 (ie, slot 1), and group 1 having one more signal than necessary according to it's size (which would be consistent with one being added unnecessarily). So maybe adding one additional call to signals is insufficient; if I add two signals, I haven't observed a failure with the reproducer yet. But why wou= ld it have to be more than one? > So this is the issue. There are still a lot of details that I intentional= ly > left out to make it concise, details about the repro, our application and > .net core's critical section implementation based on pthread_cond, and why > we seems to be the only ones hitting this issue. Are you sure that your reproducer can correctly handle spurious wakeups? I= 'm asking because if there might be a bug in the reproducer too, this could explain why a single additional signal does not work (assuming that this wo= uld indeed be a correct solution): The spuriously woken waiter goes to the back= of the queue, but the additional signal goes in before any waiter including the spuriously woken one is in G2.=20 The stealing and the additional signals resulting from that are the only sources of spurious wakeups given that you don't use cancellation or timeou= ts in the reproducer. A broadcast or more signals could potentially just make= it unlikely that a fault in the reproducer triggers. Thus, if you could point me to docs or a version of your mutex implementati= on with comments, this would help me wrap my head around the details of it. Malte, if you should have some cycles, perhaps your TLA+ model could also h= elp to find executions in which just adding one signal would not be sufficient? --=20 You are receiving this mail because: You are on the CC list for the bug.=