From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id 3A3A73857702; Thu, 4 May 2023 12:24:35 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 3A3A73857702 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1683203076; bh=P8Ve0CAzSScF9XvowsO25dOWQmvleoxnpSU9XGxb6A0=; h=From:To:Subject:Date:In-Reply-To:References:From; b=FtlqSbhabPu0O/isXAySo/ZkxMX1S935zY9S0k0Fa2VcWeoMuwhuiPGzefpEsOvRd lVteJmn9BKzCC1aBo2Ia7AH4HqkwxkAUwU/speZqmQj4yu66jwP+mYKwmeBrOzLJ7E 7LqCw+pMt5/gOj0rIdSOVLrBmg4hxEh/kpvDnhCk= From: "carlos at redhat dot com" To: glibc-bugs@sourceware.org Subject: [Bug nptl/25847] pthread_cond_signal failed to wake up pthread_cond_wait due to a bug in undoing stealing Date: Thu, 04 May 2023 12:24:33 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: glibc X-Bugzilla-Component: nptl X-Bugzilla-Version: 2.27 X-Bugzilla-Keywords: X-Bugzilla-Severity: normal X-Bugzilla-Who: carlos at redhat dot com X-Bugzilla-Status: ASSIGNED X-Bugzilla-Resolution: X-Bugzilla-Priority: P2 X-Bugzilla-Assigned-To: carlos at redhat dot com X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://sourceware.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 List-Id: https://sourceware.org/bugzilla/show_bug.cgi?id=3D25847 --- Comment #58 from Carlos O'Donell --- (In reply to Malte Skarupke from comment #56) > Thanks for the review. Two points: >=20 > 1. I agree that this one is wrong. I'll fix it. >=20 > 2. It did strike me as odd that all the stores to g_signals are with rela= xed > MO. So if there is no group switching, pthread_cond_wait doesn't actually > synchronize with pthread_cond_signal. I understand that the reason is that > condition variables are always used together with a mutex, and the mutex > forces synchronization, but even knowing that, it's kinda weird. The intent is always to improve performance by avoiding the synchronize-wit= h / happens-before requirement since such a requirement could mean flushing wri= tes from that CPU and invalidating the caches on other CPUs (forcing them to re= ad the updated values). If the mutex already provides that happens-before relationship then we need only have relaxed MO, but in the case of the read of __g_signals we are explicitly attempting to establish an order and make visible to other in-fl= ight threads the relaxed MO load of __g1_start, and so *something* has to create= the happens-before relationship and that is normally a release-acquire pair. The last thing I will do when going through the changes is review the release-acquire pairing, and if they make logical sense. This is something = you didn't cover in the TLA+ analysis, and in fact is *very* difficult to do in TLA+, but there are existing prior-art of having done this in TLA+ but you = have to model it and that creates additional modelling states: An interesting discussion of the topic is here: Verification of Concurrent Data Structures with TLA (Cai 2018): https://openresearch-repository.anu.edu.au/bitstream/1885/151830/1/20181109= -SCNC2102-Zixian.pdf Formally we would need to model weak memory models in TLA+/PlusCal in order= to prove that the implementation we have, with the specific acquire/release wo= rks correctly and there isn't a total global order that observes a logical race condition. > I'm not sure about the subtleties about how CPUs behave without > acquire-release semantics. What happens if someone signals without a mute= x? > What state does the woken thread see? This sounds like you could get real= ly > weird bugs if it sees partial state. If you signal without a mutex then there is no ordering guarantee because without the mutex release to synchronize-with the mutex acquire, the state = of memory on that CPU could be any previous old state of memory. You can only see partial states if you use relaxed MO and expect to see consistent states, but once you use stronger than relaxed you can start to reason about which updates you would or would not have seen. As a primer I recommend Preshing's writeup on the topic: https://preshing.com/20130823/the-synchronizes-with-relation/ > I know it's wrong to signal without locking/unlocking the mutex, but I > wouldn't be 100% confident that there are no valid cases where you'd do > that. Like some kind of job system where you signal a bunch and have a > broadcast at the end to make sure everyone gets woken. POSIX allows calling broadcast or signal without locking/unlock the mutex, = but without the enforced ordering you may not wake any threads i.e. "predictable scheduling" https://pubs.opengroup.org/onlinepubs/9699919799/functions/pthread_cond_sig= nal.html ~~~ The pthread_cond_broadcast() or pthread_cond_signal() functions may be call= ed by a thread whether or not it currently owns the mutex that threads calling pthread_cond_wait() or pthread_cond_timedwait() have associated with the condition variable during their waits; however, if predictable scheduling behavior is required, then that mutex shall be locked by the thread calling pthread_cond_broadcast() or pthread_cond_signal(). ~~~ > Since I had a hard time reasoning about this, I had a half-finished tenth > patch in the series where I changed this and just made all stores to > g_signals use release MO, just to remove the hard-to-reason-about case. > Would you be interested in that patch? The algorithm must use acquire/release in order to ensure happens-before for certain internal states to become visible. Part of the review of the series= is to review the acquire/release interactions and their impact. It is not corr= ect to change everything to release. --=20 You are receiving this mail because: You are on the CC list for the bug.=