From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <sourceware-bugzilla@sourceware.org>
Received: by sourceware.org (Postfix, from userid 48)
	id 3A3A73857702; Thu,  4 May 2023 12:24:35 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 3A3A73857702
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org;
	s=default; t=1683203076;
	bh=P8Ve0CAzSScF9XvowsO25dOWQmvleoxnpSU9XGxb6A0=;
	h=From:To:Subject:Date:In-Reply-To:References:From;
	b=FtlqSbhabPu0O/isXAySo/ZkxMX1S935zY9S0k0Fa2VcWeoMuwhuiPGzefpEsOvRd
	 lVteJmn9BKzCC1aBo2Ia7AH4HqkwxkAUwU/speZqmQj4yu66jwP+mYKwmeBrOzLJ7E
	 7LqCw+pMt5/gOj0rIdSOVLrBmg4hxEh/kpvDnhCk=
From: "carlos at redhat dot com" <sourceware-bugzilla@sourceware.org>
To: glibc-bugs@sourceware.org
Subject: [Bug nptl/25847] pthread_cond_signal failed to wake up
 pthread_cond_wait due to a bug in undoing stealing
Date: Thu, 04 May 2023 12:24:33 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: glibc
X-Bugzilla-Component: nptl
X-Bugzilla-Version: 2.27
X-Bugzilla-Keywords: 
X-Bugzilla-Severity: normal
X-Bugzilla-Who: carlos at redhat dot com
X-Bugzilla-Status: ASSIGNED
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: P2
X-Bugzilla-Assigned-To: carlos at redhat dot com
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: 
Message-ID: <bug-25847-131-f2943ZARQY@http.sourceware.org/bugzilla/>
In-Reply-To: <bug-25847-131@http.sourceware.org/bugzilla/>
References: <bug-25847-131@http.sourceware.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://sourceware.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
List-Id: <glibc-bugs.sourceware.org>

https://sourceware.org/bugzilla/show_bug.cgi?id=3D25847
--- Comment #58 from Carlos O'Donell <carlos at redhat dot com> ---
(In reply to Malte Skarupke from comment #56)
> Thanks for the review. Two points:
>=20
> 1. I agree that this one is wrong. I'll fix it.
>=20
> 2. It did strike me as odd that all the stores to g_signals are with rela=
xed
> MO. So if there is no group switching, pthread_cond_wait doesn't actually
> synchronize with pthread_cond_signal. I understand that the reason is that
> condition variables are always used together with a mutex, and the mutex
> forces synchronization, but even knowing that, it's kinda weird.

The intent is always to improve performance by avoiding the synchronize-wit=
h /
happens-before requirement since such a requirement could mean flushing wri=
tes
from that CPU and invalidating the caches on other CPUs (forcing them to re=
ad
the updated values).

If the mutex already provides that happens-before relationship then we need
only have relaxed MO, but in the case of the read of __g_signals we are
explicitly attempting to establish an order and make visible to other in-fl=
ight
threads the relaxed MO load of __g1_start, and so *something* has to create=
 the
happens-before relationship and that is normally a release-acquire pair.

The last thing I will do when going through the changes is review the
release-acquire pairing, and if they make logical sense. This is something =
you
didn't cover in the TLA+ analysis, and in fact is *very* difficult to do in
TLA+, but there are existing prior-art of having done this in TLA+ but you =
have
to model it and that creates additional modelling states:

An interesting discussion of the topic is here:
Verification of Concurrent Data Structures with TLA (Cai 2018):
https://openresearch-repository.anu.edu.au/bitstream/1885/151830/1/20181109=
-SCNC2102-Zixian.pdf

Formally we would need to model weak memory models in TLA+/PlusCal in order=
 to
prove that the implementation we have, with the specific acquire/release wo=
rks
correctly and there isn't a total global order that observes a logical race
condition.

> I'm not sure about the subtleties about how CPUs behave without
> acquire-release semantics. What happens if someone signals without a mute=
x?
> What state does the woken thread see? This sounds like you could get real=
ly
> weird bugs if it sees partial state.

If you signal without a mutex then there is no ordering guarantee because
without the mutex release to synchronize-with the mutex acquire, the state =
of
memory on that CPU could be any previous old state of memory.

You can only see partial states if you use relaxed MO and expect to see
consistent states, but once you use stronger than relaxed you can start to
reason about which updates you would or would not have seen.

As a primer I recommend Preshing's writeup on the topic:
https://preshing.com/20130823/the-synchronizes-with-relation/

> I know it's wrong to signal without locking/unlocking the mutex, but I
> wouldn't be 100% confident that there are no valid cases where you'd do
> that. Like some kind of job system where you signal a bunch and have a
> broadcast at the end to make sure everyone gets woken.

POSIX allows calling broadcast or signal without locking/unlock the mutex, =
but
without the enforced ordering you may not wake any threads i.e. "predictable
scheduling"

https://pubs.opengroup.org/onlinepubs/9699919799/functions/pthread_cond_sig=
nal.html
~~~
The pthread_cond_broadcast() or pthread_cond_signal() functions may be call=
ed
by a thread whether or not it currently owns the mutex that threads calling
pthread_cond_wait() or pthread_cond_timedwait() have associated with the
condition variable during their waits; however, if predictable scheduling
behavior is required, then that mutex shall be locked by the thread calling
pthread_cond_broadcast() or pthread_cond_signal().
~~~

> Since I had a hard time reasoning about this, I had a half-finished tenth
> patch in the series where I changed this and just made all stores to
> g_signals use release MO, just to remove the hard-to-reason-about case.
> Would you be interested in that patch?

The algorithm must use acquire/release in order to ensure happens-before for
certain internal states to become visible. Part of the review of the series=
 is
to review the acquire/release interactions and their impact. It is not corr=
ect
to change everything to release.

--=20
You are receiving this mail because:
You are on the CC list for the bug.=