From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <sourceware-bugzilla@sourceware.org>
Received: by sourceware.org (Postfix, from userid 48)
 id D33E5398B40B; Sat, 16 Jan 2021 00:21:17 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org D33E5398B40B
From: "triegel at redhat dot com" <sourceware-bugzilla@sourceware.org>
To: glibc-bugs@sourceware.org
Subject: [Bug nptl/25847] pthread_cond_signal failed to wake up
 pthread_cond_wait due to a bug in undoing stealing
Date: Sat, 16 Jan 2021 00:21:17 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: glibc
X-Bugzilla-Component: nptl
X-Bugzilla-Version: 2.27
X-Bugzilla-Keywords: 
X-Bugzilla-Severity: normal
X-Bugzilla-Who: triegel at redhat dot com
X-Bugzilla-Status: UNCONFIRMED
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: P2
X-Bugzilla-Assigned-To: unassigned at sourceware dot org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: 
Message-ID: <bug-25847-131-pk98qHse4t@http.sourceware.org/bugzilla/>
In-Reply-To: <bug-25847-131@http.sourceware.org/bugzilla/>
References: <bug-25847-131@http.sourceware.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://sourceware.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
X-BeenThere: glibc-bugs@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Glibc-bugs mailing list <glibc-bugs.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/glibc-bugs>,
 <mailto:glibc-bugs-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/glibc-bugs/>
List-Help: <mailto:glibc-bugs-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/glibc-bugs>,
 <mailto:glibc-bugs-request@sourceware.org?subject=subscribe>
X-List-Received-Date: Sat, 16 Jan 2021 00:21:17 -0000

https://sourceware.org/bugzilla/show_bug.cgi?id=3D25847
--- Comment #25 from Torvald Riegel <triegel at redhat dot com> ---
(In reply to Qin Li from comment #0)
> Created attachment 12480 [details]
> the c repro mentioned in the description
>=20
> Hi, I will try to be concise while this issue is hard to explain shortly.
> The problem is repro-able (with ~100 line c code that uses only pthread) =
on
> Ubuntu 18.04 glibc 2.27, while I look through the recent changes and it
> should be applicable to the latest version of glibc as well.
>=20
> Our application is written in C# and runs on .net core on Linux, and .net
> core implements a critical section structure based on pthread_cond. The u=
se
> of pthread_cond is for maintaining the waiters for the critical section, =
and
> the owner of critical section will use pthread_cond_signal when leaving, =
to
> wake up one waiter which was blocked on entering it before.
>=20
> And it happens regularly that our application will hang, which the culprit
> of the hang is traced to be one or more threads stuck in pthread_cond_wait
> (called by EnterCritialSection function due to contention). The evidence
> that made use believe this is a bug in pthread_cond, instead of .net core,
> or our application, is this signature that we can always confirm from our
> application core dump:
>=20
> (lldb) p (__pthread_cond_s)0x000055af919331c8
> (__pthread_cond_s) $7 =3D {
> { __wseq =3D 5171942476 __wseq32 =3D (__low =3D 876975180, __high =3D 1) }
> { __g1_start =3D 5171942452 __g1_start32 =3D (__low =3D 876975156, __high=
 =3D 1) }
> __g_refs =3D ([0] =3D 4, [1] =3D 0)
> __g_size =3D ([0] =3D 0, [1] =3D 0)
> __g1_orig_size =3D 40
> __wrefs =3D 16
> __g_signals =3D ([0] =3D 0, [1] =3D 2)
> }
>=20
> The above internal state of pthread_cond is problematic because:
> 1) current G1 is at slot index 1 (LSB is 0 in wseq, so G2 at index 0)
> 2) there is 1 signal in G1, because __g_signals[G1] =3D=3D 2
> 3) there are no futex waiters in G1, because __g_refs[G1] =3D=3D 0
> 4) last signaler just deliver the last signal to G1 of this pthread_cond,
> because __g_size[G1] =3D=3D 0.
> 5) as the same time there are 2 waiters in G2, as __g_refs[G2] =3D 4
>=20
> So, the problem is while there are 2 waiters in G2, when the last signal =
is
> delivered to G1 when we have __g_size[G1] =3D=3D 1, but __g_refs[G1] =3D=
=3D 0.
>=20
> Because based on line 77 to 89 from pthread_cond_signal.c, before we
> increment the the last signal on __g_signals[G1] from 0 to 2 (or some val=
ue
> N to N + 2), __g_size[G1] was 1, then __g_size[G1] got decremented to 0
> after the signals being incremented.
>=20
>   if ((cond->__data.__g_size[g1] !=3D 0)
>       || __condvar_quiesce_and_switch_g1 (cond, wseq, &g1, private))
>     {
>       /* Add a signal.  Relaxed MO is fine because signaling does not nee=
d to
> 	 establish a happens-before relation (see above).  We do not mask the
> 	 release-MO store when initializing a group in
> 	 __condvar_quiesce_and_switch_g1 because we use an atomic
> 	 read-modify-write and thus extend that store's release sequence.  */
>       atomic_fetch_add_relaxed (cond->__data.__g_signals + g1, 2);
>       cond->__data.__g_size[g1]--;
>       /* TODO Only set it if there are indeed futex waiters.  */
>       do_futex_wake =3D true;
>     }
>=20
> The critical section uses only pthread_cond_wait, so cancellation is not
> applicable here, which makes pthread_cond_signal the only place __g_size[=
G1]
> is decremented. And because __g_refs[G1] =3D 0, we essentially delivered a
> signal to a G1 where there is no futex waiter waiting on it. And the resu=
lt
> is signal lost, when there are still 2 more waiters waiting for the signal
> in G2.
>=20
> While there seem to be a lot of speculations in the above reasoning, as t=
he
> dump only tells the final state, but won't necessarily mean it was the sa=
me
> state as I described, when the last signaler enters pthread_cond_signal. =
we
> do have more supporting evidence.
>=20
> First, I wrote a C repro, that uses only pthread_cond/mutex, by mimicking
> the implementation of .net core's critical section, and the usage pattern=
 of
> the critical section that causes the repro.
>=20
> Here is the link to the repro:
> https://gist.github.com/tqinli/db1b892a97cfa0bc41fb8b0b0b156b7e
>=20
> On my 4 core, 8 hyperthread machine, the repro creates 12 threads (1.5 x
> hyperthreads) that are entering and leaving the same critical section doi=
ng
> some dummy computation. Each round, each thread will try to enter and lea=
ve
> critical section once, then wait for the rest of the threads to finish do=
ing
> entering and leaving the critical section once as well, before starting t=
he
> next round, and it goes on forever.
>=20
> Every 2 seconds, a monitor thread will print the current value of total
> number of rounds and the dummy variable used during calculation. When you
> see every 2s the same numbers are being printed over and over, it is
> repro'ed.
>=20
> It takes < 20mins on my machine to repro, when I ran just 1 instance of t=
he
> repro program. I find most successful to start 3 instances of the repro
> simultaneously, which tend to speed up the repro significantly so it can
> happen within 1min, some times in ten seconds. I'd say the mileage might
> vary especially based on the number of cores you have. We have 64 cores
> machines, and I've seen it can take 10 hours to repro.
>=20
> Second, I believe I found the explanation why we would ever enter a
> situation where we could have __g_size[G1] to be 1 more than __g_refs[G1].
> It is related to how we handle the signal stealing.
>=20
> Right after a signal is successfully taken by a pthread_cond waiter, we h=
ave
> to check __g1_start, to confirm whether the waiter is taking the signal f=
rom
> the correct G1 the waiter belongs to, or the waiter took a signal from a
> later G1, that happened to aliased on the same slot, see below code snipp=
ets:
>=20
> line 548 to 560 from pthread_cond_wait.c:
>   /* Try to grab a signal.  Use acquire MO so that we see an up-to-date v=
alue
>      of __g1_start below (see spinning above for a similar case).  In
>      particular, if we steal from a more recent group, we will also see a
>      more recent __g1_start below.  */
>   while (!atomic_compare_exchange_weak_acquire (cond->__data.__g_signals =
+ g,
> 						&signals, signals - 2));
>=20
>   /* We consumed a signal but we could have consumed from a more recent g=
roup
>      that aliased with ours due to being in the same group slot.  If this
>      might be the case our group must be closed as visible through
>      __g1_start.  */
>   uint64_t g1_start =3D __condvar_load_g1_start_relaxed (cond);
>   if (seq < (g1_start >> 1))
>=20
> line 594 to 560 from pthread_cond_wait.c:
>=20
> 	      /* Try to add a signal.  We don't need to acquire the lock
> 		 because at worst we can cause a spurious wake-up.  If the
> 		 group is in the process of being closed (LSB is true), this
> 		 has an effect similar to us adding a signal.  */
> 	      if (((s & 1) !=3D 0)
> 		  || atomic_compare_exchange_weak_relaxed
> 		       (cond->__data.__g_signals + g, &s, s + 2))
>=20
> This is the place where a signal is posted to G1, but __g_size[G1] isn't
> decremented at the same time, which is correct if the waiter is indeed
> stealing a signal, as the decrementing of __g_size[G1] had already happen=
ed
> when the signal is originally posted in pthread_cond_signal.
>=20
> However, two operations: a) taking a signal, b) checking the current
> __g1_start can't happen at the same time. If unfortunately a thread is be=
ing
> scheduled out right after this line, after it took the signal:
>   while (!atomic_compare_exchange_weak_acquire (cond->__data.__g_signals =
+ g,
> 						&signals, signals - 2));
>=20
> And resumed running only a while later, during which time several group
> rotations happened, the thread could now observe a newer g1_start at line
> below. If the new G1 happens to be aliased on the same slot, the code
> incorrectly thinks the waiter is stealing a signal from new G1, which in
> fact it took the signal from an old G1:
>   uint64_t g1_start =3D __condvar_load_g1_start_relaxed (cond);
>=20
> When we then post the signal back to the new and current G1, we will
> spuriously wake up a futex waiter in this group, causing __g_refs[G1] to =
be
> 1 less than __g_size[G1]. And we know __g_size[G1] isn't decremented beca=
use
> the original signal didn't come from this new G1. At this point the damage
> to this G1 is already done, although this is not the final state yet.
>=20
> As the G1 might still have several remaining waiters, when new signals co=
me,
> waiters from this damaged G1 will still be woke up. Until the last signal=
 is
> delivered on this G1, we would observe what was shown in the dump above:
> that we posted a signal to a G1, no futex waiter woke up, as __g_refs[G1]
> was already 0 before __g_size[G1] did, and the signal remains not taken. =
But
> in the meantime there are one or more waiters in G2. Signal is lost, when
> indeed we could have wake up a waiter in G2.

So the stealing mitigation that adds a signal by just incrementing the
respective slot in__g_signals can effectively mask another, real signal, but
...

> I think the above code does give a plausible explanation for the dump and
> does suggest there might be a bug for handling signal stealing this way.
>=20
> Third, with both the repro, and a theory about why it happened, we tried =
to
> hack the code to verify whether the theory is correct with the repro. Her=
e a
> link to the hack:
> https://github.com/thetradedesk/glibc/commit/
> 900792b317b5b72a27bb4ec2aec1cd794dca62a5

... why would a broadcast be required instead of just adding one more signa=
l to
fix/replay the masked signal?  (You later said that you found out the patch
above that just closes group 1 isn't sufficient, and then proposed to use t=
he
broadcast.)

I don't have an explanation why the broadcast would be required, but I can
confirm that just using one signal does not work for your test program.  Wh=
en
it hangs, I get such a state, for example:

__wseq =3D 10815175, __g1_start =3D 10815161, __g_refs =3D {0, 8},
__g_size =3D {0, 0}, __g1_orig_size =3D 12, __wrefs =3D 32, __g_signals =3D=
 {2, 0}

with the 4 remaining waiters all being in group 2 (ie, slot 1), and group 1
having one more signal than necessary according to it's size (which would be
consistent with one being added unnecessarily).

So maybe adding one additional call to signals is insufficient; if I add two
signals, I haven't observed a failure with the reproducer yet.  But why wou=
ld
it have to be more than one?

> So this is the issue. There are still a lot of details that I intentional=
ly
> left out to make it concise, details about the repro, our application and
> .net core's critical section implementation based on pthread_cond, and why
> we seems to be the only ones hitting this issue.

Are you sure that your reproducer can correctly handle spurious wakeups?  I=
'm
asking because if there might be a bug in the reproducer too, this could
explain why a single additional signal does not work (assuming that this wo=
uld
indeed be a correct solution): The spuriously woken waiter goes to the back=
 of
the queue, but the additional signal goes in before any waiter including the
spuriously woken one is in G2.=20
The stealing and the additional signals resulting from that are the only
sources of spurious wakeups given that you don't use cancellation or timeou=
ts
in the reproducer.  A broadcast or more signals could potentially just make=
 it
unlikely that a fault in the reproducer triggers.

Thus, if you could point me to docs or a version of your mutex implementati=
on
with comments, this would help me wrap my head around the details of it.

Malte, if you should have some cycles, perhaps your TLA+ model could also h=
elp
to find executions in which just adding one signal would not be sufficient?

--=20
You are receiving this mail because:
You are on the CC list for the bug.=