From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id 4396C388CC1A; Tue, 6 Apr 2021 00:37:49 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 4396C388CC1A From: "frankbarrus_sw at shaggy dot cc" To: glibc-bugs@sourceware.org Subject: [Bug nptl/25847] pthread_cond_signal failed to wake up pthread_cond_wait due to a bug in undoing stealing Date: Tue, 06 Apr 2021 00:37:49 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: glibc X-Bugzilla-Component: nptl X-Bugzilla-Version: 2.27 X-Bugzilla-Keywords: X-Bugzilla-Severity: normal X-Bugzilla-Who: frankbarrus_sw at shaggy dot cc X-Bugzilla-Status: UNCONFIRMED X-Bugzilla-Resolution: X-Bugzilla-Priority: P2 X-Bugzilla-Assigned-To: unassigned at sourceware dot org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: cc Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://sourceware.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: glibc-bugs@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Glibc-bugs mailing list List-Unsubscribe: , List-Archive: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 06 Apr 2021 00:37:50 -0000 https://sourceware.org/bugzilla/show_bug.cgi?id=3D25847 Frank Barrus changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |frankbarrus_sw at shaggy d= ot cc --- Comment #28 from Frank Barrus --- I have also recently been chasing this bug and couldn't find anything about= it online before, so I ended up having to figure it out and develop a workarou= nd for now. While trying to find a place to post what I had found, my search= now led to this thread. So in case it's still helpful in providing more details or confirming what others have found, here's what I had already written up and was trying to f= ind a place to post: During the course of my work over the past couple years, in software using pthreads on multi-core and multi-CPU systems, I=E2=80=99ve been encounterin= g obscure lost wakeups that were leading to sometimes being completely stuck, but cou= ld not pinpoint the source of the problem previously, and could find nothing online that matched what I was seeing, so I had implemented watchdogs to = =E2=80=9Ckick=E2=80=9D certain threads to prevent them from getting stuck once it was detected, but that was very specific to a certain threadpool/taskqueue usage pattern in t= he code. After encountering this problem again in a general usage pattern that didn= =E2=80=99t lend itself well to the watchdog approach, I finally was able to make investigating this problem a priority and just recently got to the bottom o= f it (after previously looking online but not finding anything matching what I w= as seeing at the time). Here is the series of steps that leads to the bug: - a waiter thread W1 has been running long enough (CPU-bound) to be eligible for involuntary pre-emption by the kernel soon since its quantum is about to expire. thread W1 is immediately signaled before it blocks, and it consumes the sig= nal, but is then pre-empted before it can read g1_start - one or more G1/G2 switches occur such that the current index for G2 is ba= ck to where it was when W1 was pre-empted, but this is no longer effectively t= he same group anymore even though it has the same index - thread W1 sees the change in g1_start and assumes it consumed a signal th= at was not meant for it. As a result, it conservatively sends a replacement si= gnal to the current G1 group, which turns out to be an unnecessary spurious sign= al.=20 >>From the comments in the code, it would appear this not not considered harm= ful. (although it turns out it is) - the extra signal wakes up a waiter W2 from group G1. However, since this wakeup came from an extra replacement signal that should not have been sent, the size of the group in g_size[G1] was never decremented, so now there is a mismatch between the size of the group in g_size[G1] and the actual number = of remaining waiters. (If multiple waiters (W1) follow this entire pattern at once on multiple co= res, there can be a mismatch of more than 1.) - once the actual count of remaining waiters in G1 hits 0, the next signal = sent will increment g_signal[G2] and decrement g_size[G1] but will not wake up a= ny threads, and will not yet run the quiesce_and_switch since it thinks there = are still G1 waiters to consume signals. (this does not take wrefs into accoun= t to check how many waiters could possibly remain in G1) This becomes a lost signal, which will show up in g_signals[G1] since it can never be consumed.= =20 The condvar can now be in this state for any amount of time, and can accumu= late additional waiters in G2, but will still lose the very next signal. (or multiple next signals if g_size[G1] is greater than 1) at this point, it's possible for software using the condvar to be stuck completely if there are no additional signals ever sent. - if additional signals are sent, then once g_size[G1] reaches 0, the next signal will work properly, but the value of g_signals[G1] at that point represents how many signals were lost completely and how many additional wakeups never occurred. These will never directly be made up for unless thi= s is observed and fixed before the quiesce_and_switch occurs, since that will th= en clear the g_signals count. Note that this means it can require more than o= ne signal to properly fix the problem. There need to be enough signals to cau= se g_size[G1] to reach 0 just to become =E2=80=9Cunstuck=E2=80=9D, but in addi= tion there should be true replacement signals for however many unconsumed signals are in g_signals[G1], in order to deliver those to all waiters who were already in= G2 by the time the last (otherwise lost) signal was sent. It could be argued = that we already woke that many extra waiters earlier, since those were the ones = that caused the bug to arise in the first place, but if those waiters already aw= oke earlier, then they do not count as the required number of wakes, since they observed an earlier state of whatever the mutex protects, and it could have changed before the signals that were lost. So they only count as spurious wakeups, and we still require the expected number of new wakeups, which is g_signals[G1], and to accomplish that we need g_size[G1] signals to be unst= uck, plus g_signals[G1] more to actually perform the wakeups of the waiters stil= l in G2. In practice, most multi-threaded software that can get into this situation = in the first place also has enough threads to send additional signals, so most= of the time the lost wakeup just results in an additional latency until the ne= xt signal is sent, or in a slight loss of concurrency if a specific number of workers was meant to be awakened. But every once in a while, a scenario ca= n be set up where there can=E2=80=99t be another signal until the wakeup occurs,= and then the users of the condvar are completely stuck. As a temporary workaround I have wrapped pthread_cond_wait(), pthread_cond_timedwait(), and pthread_cond_signal() with code that performs= a quick post-check on the state of the condvar and determines if there=E2=80= =99s a lost signal (by checking g_size, g_signals, and calculating the size of G2 and checking wrefs to see how many waiters could possibly be left in G1 after accounting for all the ones in G2). It will then send additional signals as needed to fix it and replace any lost signals once it can be determined. T= his post-check has to be done in pthread_cond_wait/pthread_cond_timedwait() as = well as pthread_cond_signal() since there are many cases where wrefs is temporar= ily higher until a waiter exits, so it cannot be conclusively determined that t= he condvar is in the lost wakeup state until then. Once the last G1 waiter ex= its, then wrefs only represents G2 waiters, so (wrefs - remaining_G2_size) tells= if there are any G1 waiters left who can potentially consume the signals there= or not, and g_size[G1] can now be safely checked for a non-zero value if there= are no G1 waiters left. The post-check can also quickly bail-out in several ca= ses where it's clear that G1 waiters still exist (i.e. it didn't detect the bug= ), such as when g_refs[G1] is non-zero. (all of this is written to mean just = the count bits, with the LSB already shifted out to make it easier to describe) I=E2=80=99m looking for a proper pthreads fix to use instead as well as con= sidering how I might go about solving this myself or contributing to the solution if not= hing is available soon. --=20 You are receiving this mail because: You are on the CC list for the bug.=