From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugzilla@gcc.gnu.org>
Received: by sourceware.org (Postfix, from userid 48)
	id 7A9763858C00; Mon, 29 May 2023 15:18:40 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 7A9763858C00
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org;
	s=default; t=1685373520;
	bh=HzTA1QuifVD8+/KiLIHnt0lXU1ovw9o8y59w2g5TIqQ=;
	h=From:To:Subject:Date:In-Reply-To:References:From;
	b=Zx+GJGl/AlcUoeDOArIuE0j1eD5rOg+Sbl2cpsubDUY99WFRho+UOWZrdyQaI+Irm
	 t74mnkCplQMr3GkHcZfXXEBkkz4Fr4wexIGBCdiCN1oAAPewUg55LG9NaouRcwusqv
	 ZoedtmjhfXMZXVoVn64ydFVNsIJJCnU1QTkrJwTE=
From: "rachel at rachelmant dot com" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug libstdc++/110016] Possible miscodegen when inlining
 std::condition_variable::wait predicate causes deadlock
Date: Mon, 29 May 2023 15:18:40 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: libstdc++
X-Bugzilla-Version: 12.2.1
X-Bugzilla-Keywords: 
X-Bugzilla-Severity: normal
X-Bugzilla-Who: rachel at rachelmant dot com
X-Bugzilla-Status: UNCONFIRMED
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: 
Message-ID: <bug-110016-4-zC8mwleJGW@http.gcc.gnu.org/bugzilla/>
In-Reply-To: <bug-110016-4@http.gcc.gnu.org/bugzilla/>
References: <bug-110016-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
List-Id: <gcc-bugs.sourceware.org>

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D110016
--- Comment #14 from Rachel Mant <rachel at rachelmant dot com> ---
(In reply to Andrew Pinski from comment #12)
> Let me try again to show the exact events of why I think this is not a
> libstdc++/GCC bug here.
>=20
>=20
> time      thread/core 1                       thread/core N
> -1                                            grab the mutex
> 0         atomically load waitingThreads      atomically increment
> waitingThreads
> 1         compare  waitingThreads             atomically load finished
> 2         atomically set finished to 1        atomically load work.empty()
> (queueLength)
> 3         start of notify_all                 branch on finished/queueLen=
gth
> 4         ...(some code before ...)           start on haveWork.wait
> 5         notifies all threads finished       .....(some more before ...)
> 6         .....                               waiting now
> 7         starts of joins                     still inside wait
> 8         joins hit thread N                  still inside wait
> etc.
>=20
> You will notice the ordering of loading finished and the wait (and setting
> of finished and notify_all) is exactly ordered as you expect them to be
> ordered with memory_order_seq_cst on each of the core; that is there is no
> reordering going on each thread/core. It is still strictly ordered even.=
=20
>=20
> The reason why maybe libstdc++ exposes this is that the wait implemention
> checks the predicate before it goes into wait system call. or the time
> between the start of the call of notify_all call starts and the
> notifications go out is shorter than the time it takes to after the atomic
> load of finished happenes and the wait system call happens.
>=20
> Since on thread 1, updating finished to 1  and notify_all is not done
> atomically (together), a thread could have read finished before the update
> and get into the wait loop after the notifications have gone out.
>=20
> It is very similar to a TOCTOU issue because the use of the idea of finis=
hed
> is the wait itself rather than the comparison. and setting of finished and
> notify are done atomically (together); right now there is only an atomic
> ordering of the two.

Thank you for the clear run-through of the series of events you see leading=
 to
the deadlock. That's very helpful.

To properly understand this problem space, why do you think locking the mut=
ex
before setting `finished` is sufficient to fix this? It feels to us like it
shouldn't, and should only mask the bug, making it less likely to trigger?=