From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <sourceware-bugzilla@sourceware.org>
Received: by sourceware.org (Postfix, from userid 48)
 id 435F1385DC08; Sat, 18 Apr 2020 08:04:11 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 435F1385DC08
From: "qin.li at thetradedesk dot com" <sourceware-bugzilla@sourceware.org>
To: glibc-bugs@sourceware.org
Subject: [Bug nptl/25847] New: pthread_cond_signal failed to wake up
 pthread_cond_wait due to a bug in undoing stealing
Date: Sat, 18 Apr 2020 08:04:10 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: new
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: glibc
X-Bugzilla-Component: nptl
X-Bugzilla-Version: 2.27
X-Bugzilla-Keywords: 
X-Bugzilla-Severity: normal
X-Bugzilla-Who: qin.li at thetradedesk dot com
X-Bugzilla-Status: UNCONFIRMED
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: P2
X-Bugzilla-Assigned-To: unassigned at sourceware dot org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: bug_id short_desc product version bug_status
 bug_severity priority component assigned_to reporter cc target_milestone
 attachments.created
Message-ID: <bug-25847-131@http.sourceware.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://sourceware.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
X-BeenThere: glibc-bugs@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Glibc-bugs mailing list <glibc-bugs.sourceware.org>
List-Unsubscribe: <http://sourceware.org/mailman/options/glibc-bugs>,
 <mailto:glibc-bugs-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/glibc-bugs/>
List-Help: <mailto:glibc-bugs-request@sourceware.org?subject=help>
List-Subscribe: <http://sourceware.org/mailman/listinfo/glibc-bugs>,
 <mailto:glibc-bugs-request@sourceware.org?subject=subscribe>
X-List-Received-Date: Sat, 18 Apr 2020 08:04:11 -0000

https://sourceware.org/bugzilla/show_bug.cgi?id=3D25847

            Bug ID: 25847
           Summary: pthread_cond_signal failed to wake up
                    pthread_cond_wait due to a bug in undoing stealing
           Product: glibc
           Version: 2.27
            Status: UNCONFIRMED
          Severity: normal
          Priority: P2
         Component: nptl
          Assignee: unassigned at sourceware dot org
          Reporter: qin.li at thetradedesk dot com
                CC: drepper.fsp at gmail dot com, triegel at redhat dot com
  Target Milestone: ---

Created attachment 12480
  --> https://sourceware.org/bugzilla/attachment.cgi?id=3D12480&action=3Ded=
it
the c repro mentioned in the description

Hi, I will try to be concise while this issue is hard to explain shortly. T=
he
problem is repro-able (with ~100 line c code that uses only pthread) on Ubu=
ntu
18.04 glibc 2.27, while I look through the recent changes and it should be
applicable to the latest version of glibc as well.

Our application is written in C# and runs on .net core on Linux, and .net c=
ore
implements a critical section structure based on pthread_cond. The use of
pthread_cond is for maintaining the waiters for the critical section, and t=
he
owner of critical section will use pthread_cond_signal when leaving, to wak=
e up
one waiter which was blocked on entering it before.

And it happens regularly that our application will hang, which the culprit =
of
the hang is traced to be one or more threads stuck in pthread_cond_wait (ca=
lled
by EnterCritialSection function due to contention). The evidence that made =
use
believe this is a bug in pthread_cond, instead of .net core, or our
application, is this signature that we can always confirm from our applicat=
ion
core dump:

(lldb) p (__pthread_cond_s)0x000055af919331c8
(__pthread_cond_s) $7 =3D {
{ __wseq =3D 5171942476 __wseq32 =3D (__low =3D 876975180, __high =3D 1) }
{ __g1_start =3D 5171942452 __g1_start32 =3D (__low =3D 876975156, __high =
=3D 1) }
__g_refs =3D ([0] =3D 4, [1] =3D 0)
__g_size =3D ([0] =3D 0, [1] =3D 0)
__g1_orig_size =3D 40
__wrefs =3D 16
__g_signals =3D ([0] =3D 0, [1] =3D 2)
}

The above internal state of pthread_cond is problematic because:
1) current G1 is at slot index 1 (LSB is 0 in wseq, so G2 at index 0)
2) there is 1 signal in G1, because __g_signals[G1] =3D=3D 2
3) there are no futex waiters in G1, because __g_refs[G1] =3D=3D 0
4) last signaler just deliver the last signal to G1 of this pthread_cond,
because __g_size[G1] =3D=3D 0.
5) as the same time there are 2 waiters in G2, as __g_refs[G2] =3D 4

So, the problem is while there are 2 waiters in G2, when the last signal is
delivered to G1 when we have __g_size[G1] =3D=3D 1, but __g_refs[G1] =3D=3D=
 0.

Because based on line 77 to 89 from pthread_cond_signal.c, before we increm=
ent
the the last signal on __g_signals[G1] from 0 to 2 (or some value N to N + =
2),
__g_size[G1] was 1, then __g_size[G1] got decremented to 0 after the signals
being incremented.

  if ((cond->__data.__g_size[g1] !=3D 0)
      || __condvar_quiesce_and_switch_g1 (cond, wseq, &g1, private))
    {
      /* Add a signal.  Relaxed MO is fine because signaling does not need =
to
         establish a happens-before relation (see above).  We do not mask t=
he
         release-MO store when initializing a group in
         __condvar_quiesce_and_switch_g1 because we use an atomic
         read-modify-write and thus extend that store's release sequence.  =
*/
      atomic_fetch_add_relaxed (cond->__data.__g_signals + g1, 2);
      cond->__data.__g_size[g1]--;
      /* TODO Only set it if there are indeed futex waiters.  */
      do_futex_wake =3D true;
    }

The critical section uses only pthread_cond_wait, so cancellation is not
applicable here, which makes pthread_cond_signal the only place __g_size[G1=
] is
decremented. And because __g_refs[G1] =3D 0, we essentially delivered a sig=
nal to
a G1 where there is no futex waiter waiting on it. And the result is signal
lost, when there are still 2 more waiters waiting for the signal in G2.

While there seem to be a lot of speculations in the above reasoning, as the
dump only tells the final state, but won't necessarily mean it was the same
state as I described, when the last signaler enters pthread_cond_signal. we=
 do
have more supporting evidence.

First, I wrote a C repro, that uses only pthread_cond/mutex, by mimicking t=
he
implementation of .net core's critical section, and the usage pattern of the
critical section that causes the repro.

Here is the link to the repro:
https://gist.github.com/tqinli/db1b892a97cfa0bc41fb8b0b0b156b7e

On my 4 core, 8 hyperthread machine, the repro creates 12 threads (1.5 x
hyperthreads) that are entering and leaving the same critical section doing
some dummy computation. Each round, each thread will try to enter and leave
critical section once, then wait for the rest of the threads to finish doing
entering and leaving the critical section once as well, before starting the
next round, and it goes on forever.

Every 2 seconds, a monitor thread will print the current value of total num=
ber
of rounds and the dummy variable used during calculation. When you see ever=
y 2s
the same numbers are being printed over and over, it is repro'ed.

It takes < 20mins on my machine to repro, when I ran just 1 instance of the
repro program. I find most successful to start 3 instances of the repro
simultaneously, which tend to speed up the repro significantly so it can ha=
ppen
within 1min, some times in ten seconds. I'd say the mileage might vary
especially based on the number of cores you have. We have 64 cores machines,
and I've seen it can take 10 hours to repro.

Second, I believe I found the explanation why we would ever enter a situati=
on
where we could have __g_size[G1] to be 1 more than __g_refs[G1]. It is rela=
ted
to how we handle the signal stealing.

Right after a signal is successfully taken by a pthread_cond waiter, we hav=
e to
check __g1_start, to confirm whether the waiter is taking the signal from t=
he
correct G1 the waiter belongs to, or the waiter took a signal from a later =
G1,
that happened to aliased on the same slot, see below code snippets:

line 548 to 560 from pthread_cond_wait.c:
  /* Try to grab a signal.  Use acquire MO so that we see an up-to-date val=
ue
     of __g1_start below (see spinning above for a similar case).  In
     particular, if we steal from a more recent group, we will also see a
     more recent __g1_start below.  */
  while (!atomic_compare_exchange_weak_acquire (cond->__data.__g_signals + =
g,
                                                &signals, signals - 2));

  /* We consumed a signal but we could have consumed from a more recent gro=
up
     that aliased with ours due to being in the same group slot.  If this
     might be the case our group must be closed as visible through
     __g1_start.  */
  uint64_t g1_start =3D __condvar_load_g1_start_relaxed (cond);
  if (seq < (g1_start >> 1))

line 594 to 560 from pthread_cond_wait.c:

              /* Try to add a signal.  We don't need to acquire the lock
                 because at worst we can cause a spurious wake-up.  If the
                 group is in the process of being closed (LSB is true), this
                 has an effect similar to us adding a signal.  */
              if (((s & 1) !=3D 0)
                  || atomic_compare_exchange_weak_relaxed
                       (cond->__data.__g_signals + g, &s, s + 2))

This is the place where a signal is posted to G1, but __g_size[G1] isn't
decremented at the same time, which is correct if the waiter is indeed stea=
ling
a signal, as the decrementing of __g_size[G1] had already happened when the
signal is originally posted in pthread_cond_signal.

However, two operations: a) taking a signal, b) checking the current __g1_s=
tart
can't happen at the same time. If unfortunately a thread is being scheduled=
 out
right after this line, after it took the signal:
  while (!atomic_compare_exchange_weak_acquire (cond->__data.__g_signals + =
g,
                                                &signals, signals - 2));

And resumed running only a while later, during which time several group
rotations happened, the thread could now observe a newer g1_start at line
below. If the new G1 happens to be aliased on the same slot, the code
incorrectly thinks the waiter is stealing a signal from new G1, which in fa=
ct
it took the signal from an old G1:
  uint64_t g1_start =3D __condvar_load_g1_start_relaxed (cond);

When we then post the signal back to the new and current G1, we will spurio=
usly
wake up a futex waiter in this group, causing __g_refs[G1] to be 1 less than
__g_size[G1]. And we know __g_size[G1] isn't decremented because the origin=
al
signal didn't come from this new G1. At this point the damage to this G1 is
already done, although this is not the final state yet.

As the G1 might still have several remaining waiters, when new signals come,
waiters from this damaged G1 will still be woke up. Until the last signal is
delivered on this G1, we would observe what was shown in the dump above: th=
at
we posted a signal to a G1, no futex waiter woke up, as __g_refs[G1] was
already 0 before __g_size[G1] did, and the signal remains not taken. But in=
 the
meantime there are one or more waiters in G2. Signal is lost, when indeed we
could have wake up a waiter in G2.

I think the above code does give a plausible explanation for the dump and d=
oes
suggest there might be a bug for handling signal stealing this way.

Third, with both the repro, and a theory about why it happened, we tried to
hack the code to verify whether the theory is correct with the repro. Here a
link to the hack:
https://github.com/thetradedesk/glibc/commit/900792b317b5b72a27bb4ec2aec1cd=
794dca62a5

Basically, if we suspect the current waiter stole the signal, posting a sig=
nal
back to the current G1 must happen first, because if we are right, it is al=
so
possible a signaler thread is blocked in quiesce_and_switch(), while acquir=
ing
the private lock if the pthread_cond. What is next is to mitigate the case =
when
we are wrong about stealing, I can't think of a way to detect easily. If we
can't detect when we are wrong about stealing, the only way I could think o=
f is
to dismantle G1 entirely and letting every waiter in G1 to spuriously wake =
up.
And also this broadcast is only possible after we potentially unblocked a
thread blocked in quiesce_and_switch with the pthread_cond private lock.

This is by no means suggesting that this can be the actual fix, if this bug=
 is
confirmed here. But when I tried the repro with a "fixed" libpthread.so, wh=
at
used to repro in 15-20mins is no longer reproing. I have been able to run my
repro for 2 days straight without a repro, which gives me good confidence t=
hat
the theory we had above holds true and is what is causing the problem we saw
here.

So this is the issue. There are still a lot of details that I intentionally
left out to make it concise, details about the repro, our application and .=
net
core's critical section implementation based on pthread_cond, and why we se=
ems
to be the only ones hitting this issue. Please let me know if you have any
questions and I will be happy to answer. Looking forward to a quick
confirmation and a fix for this, and let me know if there is anything I can=
 do
to help to make it happen faster.

Best Regards.
Qin Li

--=20
You are receiving this mail because:
You are on the CC list for the bug.=