From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id 65231389440B; Tue, 6 Apr 2021 16:49:57 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 65231389440B From: "frankbarrus_sw at shaggy dot cc" To: glibc-bugs@sourceware.org Subject: [Bug nptl/25847] pthread_cond_signal failed to wake up pthread_cond_wait due to a bug in undoing stealing Date: Tue, 06 Apr 2021 16:49:57 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: glibc X-Bugzilla-Component: nptl X-Bugzilla-Version: 2.27 X-Bugzilla-Keywords: X-Bugzilla-Severity: normal X-Bugzilla-Who: frankbarrus_sw at shaggy dot cc X-Bugzilla-Status: UNCONFIRMED X-Bugzilla-Resolution: X-Bugzilla-Priority: P2 X-Bugzilla-Assigned-To: unassigned at sourceware dot org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://sourceware.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: glibc-bugs@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Glibc-bugs mailing list List-Unsubscribe: , List-Archive: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 06 Apr 2021 16:49:57 -0000 https://sourceware.org/bugzilla/show_bug.cgi?id=3D25847 --- Comment #31 from Frank Barrus --- (In reply to Michael Bacarella from comment #12) > [...] > Given how hard this bug is to isolate (it took about two man-months on my > side), and that other orgs have gone through and are going through similar > experiences, it seems right to apply the fix now and discuss improving the > performance of the fix in a new bug. >=20 > From Malte's excellent blog post it appears people have been struggling w= ith > this since at least 2019 https://github.com/mratsim/weave/issues/56 We've been hitting it enough to start filing internal bugs as of at least J= une 2019 (perhaps earlier but that's the oldest I was able to quickly dig up). = I'm also pretty sure we had been seeing cases of it for at least a year before that. At the time the suspicion was that our own C code using pthreads had to be = the cause, since of course glibc and pthreads is so widely used it couldn't possibly have a bug like this that wasn't noticed and we could find nothing online about it when searching back then, although I noticed I never was ab= le to reproduce the problem when using an alternate futex-based synchronization library. We ended up making a watchdog workaround solution for the majorit= y of cases where the lost wakeup was showing up. It was only very recently that this became enough of a priority to warrant a deeper investigation into pthreads after first re-verifying the correctness of our logic above it and logging the pthreads condvar state with a circular in-core event log. Now that we can detect it, with our software just running a normal constant multi-threaded I/O test load I see about 250 to 500 cases a day (on differe= nt hardware) of the bug. But the majority of these are cases that would never have been seen normally except for a small latency increase since there's almost always another signal coming along soon anyway to "fix" the problem.= Of the ones that wouldn't be caught this way, our previous watchdog solution catches the rest. The ones that still slip through the cracks (without the detector) are exceedingly rare and not really reproducible but they've been happening from time to time for at least a couple years now. --=20 You are receiving this mail because: You are on the CC list for the bug.=