From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id 7399D394443E; Tue, 13 Apr 2021 12:21:02 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 7399D394443E From: "frankbarrus_sw at shaggy dot cc" To: glibc-bugs@sourceware.org Subject: [Bug nptl/25847] pthread_cond_signal failed to wake up pthread_cond_wait due to a bug in undoing stealing Date: Tue, 13 Apr 2021 12:21:01 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: glibc X-Bugzilla-Component: nptl X-Bugzilla-Version: 2.27 X-Bugzilla-Keywords: X-Bugzilla-Severity: normal X-Bugzilla-Who: frankbarrus_sw at shaggy dot cc X-Bugzilla-Status: UNCONFIRMED X-Bugzilla-Resolution: X-Bugzilla-Priority: P2 X-Bugzilla-Assigned-To: carlos at redhat dot com X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://sourceware.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: glibc-bugs@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Glibc-bugs mailing list List-Unsubscribe: , List-Archive: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 13 Apr 2021 12:21:02 -0000 https://sourceware.org/bugzilla/show_bug.cgi?id=3D25847 --- Comment #33 from Frank Barrus --- (In reply to Carlos O'Donell from comment #32) > I'm working through the review for this issue. FYI, I'm currently testing a different pthreads fix for this issue that doe= s it without the suggested "broadcast" solution that some distros appear to be adopting for now. It also eliminates the need for the signal-stealing hand= ling that was leading to the bug in the first place. While broadcasting might w= ork as a temporary workaround solution for correctness, it certainly has performance implications. My previous workaround I mentioned (detecting the lost wakeup state or the potential for it after calling pthread_cond_wait/pthread_cond_timedwait and pthread_cond_signal) is not showing any cases detected when this new pthrea= ds fix is applied. However, when I run with the broadcast fix instead, it sti= ll detects cases that look like a potential lost wakeup. I'm suspecting that those are just due to the state the condvar is temporarily being left in, a= nd that all waiters are actually being woken up by the broadcast and just raci= ng against my detection code which is catching the intermediate state, but it makes me wonder if there's still a rare edge case the broadcast solution is= n't actually fixing and which could still get stuck. (consider all threads that are not actually blocked and are outside the g_refs counted region and poss= ibly have been pre-empted so they'll make their attempt to grab g_signals at a l= ater time) When I was running these tests my other post-detection logic still contained its fix to send extra signals as necessary also. I'll have to re= -run that as detection-only and no fix to see if it would have gotten stuck, but= my priority has been the work on the new pthreads fix instead, not proving the correctness of the fix that uses broadcast. So far the testing for my new fix is looking good for both correctness and performance, but of course there are many correctness issues to properly and thoroughly test. I'm hoping to be able to post a patch in the near future. --=20 You are receiving this mail because: You are on the CC list for the bug.=