From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id 47A92385042E; Fri, 25 Dec 2020 16:19:07 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 47A92385042E From: "triegel at redhat dot com" To: glibc-bugs@sourceware.org Subject: [Bug nptl/25847] pthread_cond_signal failed to wake up pthread_cond_wait due to a bug in undoing stealing Date: Fri, 25 Dec 2020 16:19:07 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: glibc X-Bugzilla-Component: nptl X-Bugzilla-Version: 2.27 X-Bugzilla-Keywords: X-Bugzilla-Severity: normal X-Bugzilla-Who: triegel at redhat dot com X-Bugzilla-Status: UNCONFIRMED X-Bugzilla-Resolution: X-Bugzilla-Priority: P2 X-Bugzilla-Assigned-To: unassigned at sourceware dot org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://sourceware.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: glibc-bugs@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Glibc-bugs mailing list List-Unsubscribe: , List-Archive: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 25 Dec 2020 16:19:08 -0000 https://sourceware.org/bugzilla/show_bug.cgi?id=3D25847 --- Comment #18 from Torvald Riegel --- (In reply to Qin Li from comment #0) > As the G1 might still have several remaining waiters, when new signals co= me, > waiters from this damaged G1 will still be woke up. Until the last signal= is > delivered on this G1, we would observe what was shown in the dump above: > that we posted a signal to a G1, no futex waiter woke up, as __g_refs[G1] > was already 0 before __g_size[G1] did, and the signal remains not taken. = But > in the meantime there are one or more waiters in G2. Signal is lost, when > indeed we could have wake up a waiter in G2. Qin Li, thank you for investigating, isolating the issue, and providing the reproducer. I agree that there is a bug in that just incrementing a group's number of available signals isn't sufficient because it doesn't adjust the group size (ie, the number of waiters that signalers think are still in this group) accordingly. The result is that signalers can put a correct signal in a group that is already effectively empty even though it's size doesn't show that, which le= ads to this signal being "lost". __condvar_cancel_waiting also updates group s= ize, for example.=20 I also think that a solution would be to handle potential stealing in such a way that it takes all the steps a pthread_cond_signal would, so including updating the group's size. I believe we don't need a full broadcast for th= at, which would reset everything basically. Malte Skarupke's proposed solution of "locking" groups through __g_refs more broadly could also work in principle. I'll respond to the patch directly on libc-alpha. I'm wondering about the performance implications of this appro= ach, even though a full pthread_cond_signal could also be costly because it adds contention to signalers. (In reply to Torvald Riegel from comment #17) > (I'm aware of the first > reproducer posted, but I'm currently looking at it and am not yet convinc= ed > that it is correct; it sends out more signals than the number of wake-ups= it > allows through the wait condition, AFAICT, which I find surprising.) I haven't fully wrapped my head around how the critical section implementat= ion covered in the reproducer works, but I also don't see any concrete red flags anymore. What had me surprised is that it really depends on blocking throu= gh the condvar, which I'd say is different from how many concurrent algorithms treat blocking through OS facilities like futexes as "optional" because the real functional blocking happens through shared-memory synchronization.=20 Second, the AWAKENED_WAITER flag seems to be intended to leak into the wait= er ref count (ie, you can't have the waiter ref count start at a higher bit, t= he "overflow" of the flag seems to be intended). --=20 You are receiving this mail because: You are on the CC list for the bug.=