From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id 1E9A7384BC21; Sun, 1 Nov 2020 15:40:31 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 1E9A7384BC21 From: "michael.bacarella at gmail dot com" To: glibc-bugs@sourceware.org Subject: [Bug nptl/25847] pthread_cond_signal failed to wake up pthread_cond_wait due to a bug in undoing stealing Date: Sun, 01 Nov 2020 15:40:30 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: glibc X-Bugzilla-Component: nptl X-Bugzilla-Version: 2.27 X-Bugzilla-Keywords: X-Bugzilla-Severity: normal X-Bugzilla-Who: michael.bacarella at gmail dot com X-Bugzilla-Status: UNCONFIRMED X-Bugzilla-Resolution: X-Bugzilla-Priority: P2 X-Bugzilla-Assigned-To: unassigned at sourceware dot org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://sourceware.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: glibc-bugs@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Glibc-bugs mailing list List-Unsubscribe: , List-Archive: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 01 Nov 2020 15:40:31 -0000 https://sourceware.org/bugzilla/show_bug.cgi?id=3D25847 --- Comment #12 from Michael Bacarella = --- (In reply to Malte Skarupke from comment #11) > I blogged about trying to understand this bug using TLA+ here: > https://probablydance.com/2020/10/31/using-tla-in-the-real-world-to- > understand-a-glibc-bug/ This is an awesome write-up. > TLA+ is a tool for doing program verification. I'll make the code availab= le > in any license you want. (I presume LGPL is good?) I think it would be go= od > to have a TLA+ implementation of the condition variables in glibc. Agree. I will happily contribute to this cause. > My solution for this bug based on the investigation would be to broaden t= he > scope of g_refs, so that it gets incremented in line 411 in > pthread_cond_wait.c, and gets decremented in line 54. That makes it align > exactly with the increment and decrement of wrefs, which might make wrefs > unnecessary. (though I didn't verify that) If it does make wrefs > unnecessary, this fix could both simplify the code, speed it up a little, > and also get rid of the "potential stealing" case. >=20 > I haven't yet done the work to actually implement and try that change. Let > me know if there is interest in me trying that. (or also let me know if i= t's > unlikely to work because I missed something) I'm fairly confident that Qin Li's patch stops the deadlock from happening = (see my previous comment). I understand it may not be the maximally efficient f= ix but I consider not deadlocking to outweigh any possible inefficiency introd= uced only in this rather edge, lock stealing case. Given how hard this bug is to isolate (it took about two man-months on my side), and that other orgs have gone through and are going through similar experiences, it seems right to apply the fix now and discuss improving the performance of the fix in a new bug. >>From Malte's excellent blog post it appears people have been struggling with this since at least 2019 https://github.com/mratsim/weave/issues/56 --=20 You are receiving this mail because: You are on the CC list for the bug.=