Possible hang in pthread_cond

public inbox for libc-help@sourceware.org
 help / color / mirror / Atom feed

* Possible hang in pthread_cond_broadcast()
@ 2018-08-15 22:13 Kuchta, Martin
  2018-08-16  2:01 ` Carlos O'Donell
  0 siblings, 1 reply; 4+ messages in thread
From: Kuchta, Martin @ 2018-08-15 22:13 UTC (permalink / raw)
  To: libc-help; +Cc: Kuchta, Martin

Hi,

After upgrading from glibc 2.23 to 2.26, we've been seeing what looks like a hang inside pthread_cond_broadcast calls in our application. Based on inspecting core dumps from the hung processes and the nptl source, I have some speculation about where the issue might be. Since I have no prior experience with this code, I'm hoping to get a sanity check on my understanding of the potential problem before jumping straight to filing a bug report.

Ideally, we'd like to have a standalone reproducible test, but our efforts so far to develop one have been unsuccessful. The problem only seems to show itself in our application very rarely (once per 15-24 hours) and only under very heavy load.

We are also using a version of glibc with Gentoo patches (glibc-2.26-r7), but the sections of code I'm looking at are identical to the current upstream master branch (unchanged since the condition variable rewrite in 2016). 
Our libc was built with:
    gcc-7.3.0

    CFLAGS "-march=westmere -pipe -O2 -fno-strict-aliasing", and we're running on a 4.14.49 Linux kernel.

With all that said, here is what I'm observing:

At the time our application's hang detector kicks in and kills the process, these are always the top frames of the hung thread's stack:

#0  0x00007ff8ef32320c in futex_wait (private=<optimized out>, expected=2, futex_word=0x7ff8e7433a6c) at ../sysdeps/unix/sysv/linux/futex-internal.h:61
#1  futex_wait_simple (private=<optimized out>, expected=2, futex_word=0x7ff8e7433a6c) at ../sysdeps/nptl/futex-internal.h:135
#2  __condvar_quiesce_and_switch_g1 (private=<optimized out>, g1index=<synthetic pointer>, wseq=<optimized out>, cond=0x7ff8e7433a58) at pthread_cond_common.c:412
#3  __pthread_cond_broadcast (cond=cond@entry=0x7ff8e7433a58) at pthread_cond_broadcast.c:73

And the actual condition variable state always looks something like this:

{__data = {{__wseq = 212413044, __wseq32 = {__low = 212413044, __high = 0}}, {__g1_start = 212413034, __g1_start32 = {__low = 212413034, __high = 0}}, __g_refs = {2, 0}, __g_size = {0, 0}, __g1_orig_size = 17, __wrefs = 10, __g_signals = {0, 5}}, 
  __size = "t*\251\f\000\000\000\000j*\251\f\000\000\000\000\002", '\000' <repeats 15 times>, "\021\000\000\000\n\000\000\000\000\000\000\000\005\000\000", __align = 212413044}

The hung thread is always in a futex_wait call with futex_word being the address of one of the elements of __g_refs and expected=2. The actual value at futex_word is always 0.

And here is what I think might be happening:

In this loop within __condvar_quiesce_and_switch_g1, the broadcasting thread is supposed to be waiting for all the existing waiters to leave g1:

  while ((r >> 1) > 0)
    {
      for (unsigned int spin = maxspin; ((r >> 1) > 0) && (spin > 0); spin--)
	{
	  /* TODO Back off.  */
	  r = atomic_load_relaxed (cond->__data.__g_refs + g1);
	}
      if ((r >> 1) > 0)
	{
	  /* There is still a waiter after spinning.  Set the wake-request
	     flag and block.  Relaxed MO is fine because this is just about
	     this futex word.  */
	  r = atomic_fetch_or_relaxed (cond->__data.__g_refs + g1, 1);

	  if ((r >> 1) > 0)
	    futex_wait_simple (cond->__data.__g_refs + g1, r, private);
	  /* Reload here so we eventually see the most recent value even if we
	     do not spin.   */
	  r = atomic_load_relaxed (cond->__data.__g_refs + g1);
	}
    }

In a normal situation where there is still a waiter holding a reference, I think the flow could go something like this:

The atomic-or reads the value 2 and writes 3
futex_wait_simple is called with expected=2 and immediately returns because the actual value of 3 is not the same as the expected
(There is still a waiter, so the loop repeats)
The atomic-or reads the value 3 and writes 3
futex_wait_simple is called with expected=3 and blocks
At some point, the last waiter releases its reference with __condvar_dec_grefs, which sees the value of 3 for the reference count, unsets the wake-request flag, and calls futex_wake to unblock the broadcasting thread:

static void
__condvar_dec_grefs (pthread_cond_t *cond, unsigned int g, int private)
{
  /* Release MO to synchronize-with the acquire load in
     __condvar_quiesce_and_switch_g1.  */
  if (atomic_fetch_add_release (cond->__data.__g_refs + g, -2) == 3)
    {
      /* Clear the wake-up request flag before waking up.  We do not need more
	 than relaxed MO and it doesn't matter if we apply this for an aliased
	 group because we wake all futex waiters right after clearing the
	 flag.  */
      atomic_fetch_and_relaxed (cond->__data.__g_refs + g, ~(unsigned int) 1);
      futex_wake (cond->__data.__g_refs + g, INT_MAX, private);
    }
}

In my situation, the unexpected piece seems to be that futex_wait is blocking on a value without the wake-request flag set. This indicates to me that something must have cleared the wake-request flag between the atomic-or and futex_wait from the broadcasting thread. The only place I see the wake-request flag being cleared is in __condvar_dec_grefs.

The most reasonable explanation I've been able to come up with is that the last waiter from the _previous_ time this group was g1 decremented the reference count in __condvar_dec_grefs and entered the if-block, but was preempted before clearing the wake-request flag. Because the reference count was decremented, the broadcast was able to complete. Another broadcast then happened, swapping the active group and making the original group g1 again. On the broadcast after that, the old waiter thread resumed and cleared the wake-request flag right between the broadcaster reading the value 2 from the reference count and calling futex_wait, causing the futex_wait to block. The last legitimate waiter on g1 at that point would not call futex_wake when releasing its reference since the wake-request flag would not be set, leaving the broadcasting thread blocked indefinitely.

I did manage to produce a scenario like this in a manual test involving multiple processes broadcasting and waiting on a shared memory condition variable, using gdb to pause and resume processes in the right places. The hang I observed in that experiment looked the same as the one our real system is hitting. Manually controlling the processes makes this a bit less convincing, but it does seem to add a bit more evidence that a hang in pthread_cond_broadcast is possible by just altering timing.

My feeling is that the futex_wait call in __condvar_quiesce_and_switch_g1 should be made with the expected value being the one just written to __g_refs (with the wake-request flag set), instead of the one read. This would guarantee that if the futex_wait sleeps, then the wake-request flag is set. If another thread comes in and clears the wake-request flag before the futex_wait call, the values will no longer match and it will simply return.

Please let me know if there are any obvious misunderstandings here, or if there is extra information we could provide about the issue. We're going to try the proposed fix internally to see if it makes a difference - initial tests look somewhat promising.

Thanks,
Martin

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Possible hang in pthread_cond_broadcast()
  2018-08-15 22:13 Possible hang in pthread_cond_broadcast() Kuchta, Martin
@ 2018-08-16  2:01 ` Carlos O'Donell
  2018-08-16 23:18   ` Kuchta, Martin
  2018-08-21 15:51   ` Torvald Riegel
  0 siblings, 2 replies; 4+ messages in thread
From: Carlos O'Donell @ 2018-08-16  2:01 UTC (permalink / raw)
  To: Kuchta, Martin, libc-help, Torvald Riegel

On 08/15/2018 06:13 PM, Kuchta, Martin wrote:
> After upgrading from glibc 2.23 to 2.26

In glibc 2.25 we completely rewrote the POSIX thread condition variable
implementation in order to comply strictly with the signaling requirements
as outlined by POSIX. So there was a significant change in the implementation
between 2.23 and 2.26.

> , we've been seeing what looks like a hang inside
> pthread_cond_broadcast calls in our application. Based on inspecting
> core dumps from the hung processes and the nptl source, I have some
> speculation about where the issue might be. Since I have no prior
> experience with this code, I'm hoping to get a sanity check on my
> understanding of the potential problem before jumping straight to
> filing a bug report.

OK.

> Ideally, we'd like to have a standalone reproducible test, but our
> efforts so far to develop one have been unsuccessful. The problem
> only seems to show itself in our application very rarely (once per
> 15-24 hours) and only under very heavy load.

Right, the ideal is a simple reproducer.

> We are also using a version of glibc with Gentoo patches
> (glibc-2.26-r7), but the sections of code I'm looking at are
> identical to the current upstream master branch (unchanged since the
> condition variable rewrite in 2016).

OK.

> Our libc was built with:
>     gcc-7.3.0
> 
>     CFLAGS "-march=westmere -pipe -O2 -fno-strict-aliasing", and we're running on a 4.14.49 Linux kernel.

Why are you building with -fno-strict-aliasing? This is not needed for
building glibc, and disables optimizations which may change the code
generation in unexpected ways. You are deviating from the normal way
glibc is built in the distributions.

> With all that said, here is what I'm observing:
> 
> At the time our application's hang detector kicks in and kills the process, these are always the top frames of the hung thread's stack:
> 
> #0  0x00007ff8ef32320c in futex_wait (private=<optimized out>, expected=2, futex_word=0x7ff8e7433a6c) at ../sysdeps/unix/sysv/linux/futex-internal.h:61
> #1  futex_wait_simple (private=<optimized out>, expected=2, futex_word=0x7ff8e7433a6c) at ../sysdeps/nptl/futex-internal.h:135
> #2  __condvar_quiesce_and_switch_g1 (private=<optimized out>, g1index=<synthetic pointer>, wseq=<optimized out>, cond=0x7ff8e7433a58) at pthread_cond_common.c:412
> #3  __pthread_cond_broadcast (cond=cond@entry=0x7ff8e7433a58) at pthread_cond_broadcast.c:73

OK.
 
> 
> And the actual condition variable state always looks something like this:
> 
> {__data = {{__wseq = 212413044, __wseq32 = {__low = 212413044, __high = 0}}, {__g1_start = 212413034, __g1_start32 = {__low = 212413034, __high = 0}}, __g_refs = {2, 0}, __g_size = {0, 0}, __g1_orig_size = 17, __wrefs = 10, __g_signals = {0, 5}}, 
>   __size = "t*\251\f\000\000\000\000j*\251\f\000\000\000\000\002", '\000' <repeats 15 times>, "\021\000\000\000\n\000\000\000\000\000\000\000\005\000\000", __align = 212413044}

OK.
 
> The hung thread is always in a futex_wait call with futex_word being the address of one of the elements of __g_refs and expected=2. The actual value at futex_word is always 0.
 
OK. 
 
> And here is what I think might be happening:
> 
> In this loop within __condvar_quiesce_and_switch_g1, the broadcasting thread is supposed to be waiting for all the existing waiters to leave g1:
> 
>   while ((r >> 1) > 0)
>     {
>       for (unsigned int spin = maxspin; ((r >> 1) > 0) && (spin > 0); spin--)
> 	{
> 	  /* TODO Back off.  */
> 	  r = atomic_load_relaxed (cond->__data.__g_refs + g1);
> 	}
>       if ((r >> 1) > 0)
> 	{
> 	  /* There is still a waiter after spinning.  Set the wake-request
> 	     flag and block.  Relaxed MO is fine because this is just about
> 	     this futex word.  */
> 	  r = atomic_fetch_or_relaxed (cond->__data.__g_refs + g1, 1);
> 
> 	  if ((r >> 1) > 0)
> 	    futex_wait_simple (cond->__data.__g_refs + g1, r, private);
> 	  /* Reload here so we eventually see the most recent value even if we
> 	     do not spin.   */
> 	  r = atomic_load_relaxed (cond->__data.__g_refs + g1);
> 	}
>     }
> 
> 
> In a normal situation where there is still a waiter holding a reference, I think the flow could go something like this:
> 
> The atomic-or reads the value 2 and writes 3
> futex_wait_simple is called with expected=2 and immediately returns because the actual value of 3 is not the same as the expected
> (There is still a waiter, so the loop repeats)
> The atomic-or reads the value 3 and writes 3
> futex_wait_simple is called with expected=3 and blocks
> At some point, the last waiter releases its reference with __condvar_dec_grefs, which sees the value of 3 for the reference count, unsets the wake-request flag, and calls futex_wake to unblock the broadcasting thread:
> 
> static void
> __condvar_dec_grefs (pthread_cond_t *cond, unsigned int g, int private)
> {
>   /* Release MO to synchronize-with the acquire load in
>      __condvar_quiesce_and_switch_g1.  */
>   if (atomic_fetch_add_release (cond->__data.__g_refs + g, -2) == 3)
>     {
>       /* Clear the wake-up request flag before waking up.  We do not need more
> 	 than relaxed MO and it doesn't matter if we apply this for an aliased
> 	 group because we wake all futex waiters right after clearing the
> 	 flag.  */
>       atomic_fetch_and_relaxed (cond->__data.__g_refs + g, ~(unsigned int) 1);
>       futex_wake (cond->__data.__g_refs + g, INT_MAX, private);
>     }
> }
> 
> 
> In my situation, the unexpected piece seems to be that futex_wait is blocking on a value without the wake-request flag set. This indicates to me that something must have cleared the wake-request flag between the atomic-or and futex_wait from the broadcasting thread. The only place I see the wake-request flag being cleared is in __condvar_dec_grefs.
> 
> The most reasonable explanation I've been able to come up with is that the last waiter from the _previous_ time this group was g1 decremented the reference count in __condvar_dec_grefs and entered the if-block, but was preempted before clearing the wake-request flag. Because the reference count was decremented, the broadcast was able to complete. Another broadcast then happened, swapping the active group and making the original group g1 again. On the broadcast after that, the old waiter thread resumed and cleared the wake-request flag right between the broadcaster reading the value 2 from the reference count and calling futex_wait, causing the futex_wait to block. The last legitimate waiter on g1 at that point would not call futex_wake when releasing its reference since the wake-request flag would not be set, leaving the broadcasting thread blocked indefinitely.
> 
> I did manage to produce a scenario like this in a manual test involving multiple processes broadcasting and waiting on a shared memory condition variable, using gdb to pause and resume processes in the right places. The hang I observed in that experiment looked the same as the one our real system is hitting. Manually controlling the processes makes this a bit less convincing, but it does seem to add a bit more evidence that a hang in pthread_cond_broadcast is possible by just altering timing.
> 
> 
> 
> My feeling is that the futex_wait call in __condvar_quiesce_and_switch_g1 should be made with the expected value being the one just written to __g_refs (with the wake-request flag set), instead of the one read. This would guarantee that if the futex_wait sleeps, then the wake-request flag is set. If another thread comes in and clears the wake-request flag before the futex_wait call, the values will no longer match and it will simply return.
> 
> 
> Please let me know if there are any obvious misunderstandings here, or if there is extra information we could provide about the issue. We're going to try the proposed fix internally to see if it makes a difference - initial tests look somewhat promising.
 
I would have to go through and re-review the algorithm again to prove to myself
that what you are describing matches what is going on.

I'm including Torvald Riegel in this email since he is the author of the new
algorithm and might have some insight into the failure mode.

At this point I think you should file a bug so we don't loose this issue and can
reference it with a bug number.

-- 
Cheers,
Carlos.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Possible hang in pthread_cond_broadcast()
  2018-08-16  2:01 ` Carlos O'Donell
@ 2018-08-16 23:18   ` Kuchta, Martin
  2018-08-21 15:51   ` Torvald Riegel
  1 sibling, 0 replies; 4+ messages in thread
From: Kuchta, Martin @ 2018-08-16 23:18 UTC (permalink / raw)
  To: Carlos O'Donell, libc-help, Torvald Riegel

Thank you for taking a look, Carlos. I've submitted a bug as you
suggested:

https://sourceware.org/bugzilla/show_bug.cgi?id=23538

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Possible hang in pthread_cond_broadcast()
  2018-08-16  2:01 ` Carlos O'Donell
  2018-08-16 23:18   ` Kuchta, Martin
@ 2018-08-21 15:51   ` Torvald Riegel
  1 sibling, 0 replies; 4+ messages in thread
From: Torvald Riegel @ 2018-08-21 15:51 UTC (permalink / raw)
  To: Carlos O'Donell; +Cc: Kuchta, Martin, libc-help

On Wed, 2018-08-15 at 22:01 -0400, Carlos O'Donell wrote:
> On 08/15/2018 06:13 PM, Kuchta, Martin wrote:
> > Ideally, we'd like to have a standalone reproducible test, but our
> > efforts so far to develop one have been unsuccessful. The problem
> > only seems to show itself in our application very rarely (once per
> > 15-24 hours) and only under very heavy load.
> 
> Right, the ideal is a simple reproducer.

Even if there's no reproducer that triggers the problem reliably, I'd
like to see the code that uses the condvar (not the full application,
but something reduced).

I'll try to find some time to review this bug report and your analysis.

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2018-08-21 15:51 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-08-15 22:13 Possible hang in pthread_cond_broadcast() Kuchta, Martin
2018-08-16  2:01 ` Carlos O'Donell
2018-08-16 23:18   ` Kuchta, Martin
2018-08-21 15:51   ` Torvald Riegel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).