From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id C5CA03971C2C; Fri, 8 Jan 2021 03:45:05 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org C5CA03971C2C From: "malteskarupke at fastmail dot fm" To: glibc-bugs@sourceware.org Subject: [Bug nptl/25847] pthread_cond_signal failed to wake up pthread_cond_wait due to a bug in undoing stealing Date: Fri, 08 Jan 2021 03:45:05 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: glibc X-Bugzilla-Component: nptl X-Bugzilla-Version: 2.27 X-Bugzilla-Keywords: X-Bugzilla-Severity: normal X-Bugzilla-Who: malteskarupke at fastmail dot fm X-Bugzilla-Status: UNCONFIRMED X-Bugzilla-Resolution: X-Bugzilla-Priority: P2 X-Bugzilla-Assigned-To: unassigned at sourceware dot org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://sourceware.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: glibc-bugs@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Glibc-bugs mailing list List-Unsubscribe: , List-Archive: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 08 Jan 2021 03:45:05 -0000 https://sourceware.org/bugzilla/show_bug.cgi?id=3D25847 --- Comment #24 from Malte Skarupke --- (In reply to Torvald Riegel from comment #23) > Consider the case where a signal is stolen by thread A from a group 1 that > is then fully signaled according to the group size, and then another thre= ad > B tries to quiesce and switch group 1. The internal lock is held by B, > which waits for a waiter W whose signal has been stolen by A. W can't ma= ke > progress because A has "its" signal. And A waits for B if one adds the > empty critical section, which is how I think this can deadlock. >=20 > > The mitigation patch > > will call these functions. >=20 > It does, but after giving back the stolen signal (or at least trying to do > so, incorrectly). That makes sense, and it would mean that the mitigation patch should be goo= d. So if stealing happens, a waiter is blocked. And that causes a signaler to block because it can't close the group. And because of that, I couldn't undo the stealing by signaling, because two threads can't signal at the same time because they have to hold the internal mutex. Instead we have to do the partial signal that we do right now. But that one fails if stealing didn't happen. Because then we get into a weird state whe= re g_size is incorrect because we can't change g_size without taking the inter= nal mutex. This means we have to do something different depending on whether stealing happened or didn't happen. But we only have this code to begin with because= we can't tell one from the other. Which means the mitigation patch might be the only way to do this properly:= Do the partial signal in case stealing happens, and then in case stealing didn= 't happen do a broadcast to reset the state. And replacing the broadcast with a signal will be difficult because signaling doesn't reset the state of the condition variable, and therefore can't get us out of the weird state in ca= se stealing didn't happen. I personally think this is unsatisfying. It's very complex, even if we can = find a way to do this correctly. So I won't attempt to work in this direction ag= ain, and I'll instead try to clean up the patch that makes stealing impossible, which reduces complexity. --=20 You are receiving this mail because: You are on the CC list for the bug.=