From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id 7A9763858C00; Mon, 29 May 2023 15:18:40 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 7A9763858C00 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1685373520; bh=HzTA1QuifVD8+/KiLIHnt0lXU1ovw9o8y59w2g5TIqQ=; h=From:To:Subject:Date:In-Reply-To:References:From; b=Zx+GJGl/AlcUoeDOArIuE0j1eD5rOg+Sbl2cpsubDUY99WFRho+UOWZrdyQaI+Irm t74mnkCplQMr3GkHcZfXXEBkkz4Fr4wexIGBCdiCN1oAAPewUg55LG9NaouRcwusqv ZoedtmjhfXMZXVoVn64ydFVNsIJJCnU1QTkrJwTE= From: "rachel at rachelmant dot com" To: gcc-bugs@gcc.gnu.org Subject: [Bug libstdc++/110016] Possible miscodegen when inlining std::condition_variable::wait predicate causes deadlock Date: Mon, 29 May 2023 15:18:40 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: libstdc++ X-Bugzilla-Version: 12.2.1 X-Bugzilla-Keywords: X-Bugzilla-Severity: normal X-Bugzilla-Who: rachel at rachelmant dot com X-Bugzilla-Status: UNCONFIRMED X-Bugzilla-Resolution: X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 List-Id: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D110016 --- Comment #14 from Rachel Mant --- (In reply to Andrew Pinski from comment #12) > Let me try again to show the exact events of why I think this is not a > libstdc++/GCC bug here. >=20 >=20 > time thread/core 1 thread/core N > -1 grab the mutex > 0 atomically load waitingThreads atomically increment > waitingThreads > 1 compare waitingThreads atomically load finished > 2 atomically set finished to 1 atomically load work.empty() > (queueLength) > 3 start of notify_all branch on finished/queueLen= gth > 4 ...(some code before ...) start on haveWork.wait > 5 notifies all threads finished .....(some more before ...) > 6 ..... waiting now > 7 starts of joins still inside wait > 8 joins hit thread N still inside wait > etc. >=20 > You will notice the ordering of loading finished and the wait (and setting > of finished and notify_all) is exactly ordered as you expect them to be > ordered with memory_order_seq_cst on each of the core; that is there is no > reordering going on each thread/core. It is still strictly ordered even.= =20 >=20 > The reason why maybe libstdc++ exposes this is that the wait implemention > checks the predicate before it goes into wait system call. or the time > between the start of the call of notify_all call starts and the > notifications go out is shorter than the time it takes to after the atomic > load of finished happenes and the wait system call happens. >=20 > Since on thread 1, updating finished to 1 and notify_all is not done > atomically (together), a thread could have read finished before the update > and get into the wait loop after the notifications have gone out. >=20 > It is very similar to a TOCTOU issue because the use of the idea of finis= hed > is the wait itself rather than the comparison. and setting of finished and > notify are done atomically (together); right now there is only an atomic > ordering of the two. Thank you for the clear run-through of the series of events you see leading= to the deadlock. That's very helpful. To properly understand this problem space, why do you think locking the mut= ex before setting `finished` is sufficient to fix this? It feels to us like it shouldn't, and should only mask the bug, making it less likely to trigger?=