From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id 197A23858D28; Wed, 28 Sep 2022 23:40:37 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 197A23858D28 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1664408437; bh=6HPSdpLA1vwH1DGr0c3SbKasR6wi/Il6j3ozO3BAEBw=; h=From:To:Subject:Date:In-Reply-To:References:From; b=PQ9HmySnUcksWZPLM09GXVfry7NrzQQvgxW8MGprziUxbkPwrHXwxv0sNQHu/WP1V FSyQgOKyz0QpJXzOPFYy3sBiD/BnrCnB0oHg29fLS3ZhJtSUZ8c6KfHXPtU9KPssTM 6bkN2YbmMiEAGr6VWnN1HGh4SDcZdnDHktGGuGCI= From: "rodgertq at gcc dot gnu.org" To: gcc-bugs@gcc.gnu.org Subject: [Bug libstdc++/106772] atomic::wait shouldn't touch waiter pool if used platform wait Date: Wed, 28 Sep 2022 23:40:36 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: libstdc++ X-Bugzilla-Version: unknown X-Bugzilla-Keywords: X-Bugzilla-Severity: normal X-Bugzilla-Who: rodgertq at gcc dot gnu.org X-Bugzilla-Status: RESOLVED X-Bugzilla-Resolution: INVALID X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 List-Id: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D106772 --- Comment #25 from Thomas Rodgers --- (In reply to Mkkt Bkkt from comment #24) > (In reply to Thomas Rodgers from comment #22) > > Your example of '100+ core' systems especially on NUMA is certainly a v= alid > > one. I would ask, at what point do those collisions and the resulting c= ache > > invalidation traffic swamp the cost of just making the syscall? I do pl= an to > > put these tests together, because there is another algorithm that I am > > exploring, that I believe will reduce the likelihood of spurious wakeup= s, > > and achieves the same result as this particular approach, without gener= ating > > the same invalidation traffic. At this point, I don't anticipate doing = that > > work until after GCC13 stage1 closes. >=20 > I try to explain:=20 >=20 > syscall overhead is some constant commonly like 10-30ns (futex syscall can > be more expensive like 100ns in your example) >=20 > But numbers of cores are grow, arm also makes more popular (fetch_add/sub > have more cost on it compares to x86) > And people already faced with situation where fetch_add have a bigger cost > than syscall overhead: >=20 > https://pkolaczk.github.io/server-slower-than-a-laptop/ > https://travisdowns.github.io/blog/2020/07/06/concurrency-costs.html >=20 > I don't think we will faced with problem like in these links in > atomic::wait/notify in real code, but I'm pretty sure in some cases it can > be more expansive than syscall part of atomic::wait/notify >=20 > Of course better to prove it, maybe someday I will do it :( So to your previous comment, I don't the discussion is at all pointless. i = plan to raise some of these issues at the next SG1 meeting in November. Sure, th= at doesn't help *you* or any developer with your specific intent until C++26, = and maybe Boost's implementation is a better choice, I also get how unsatisfyin= g of an aswer that is. I'm well aware of the potential scalability problems, and I have a longer t= erm plan to get concrete data on how different implementation choices impact scalability. The barrier implementation (which is the same algorithm as in libc++), for example spreads this traffic over 64 individual atomic_refs, f= or this very reason, and that implementation has been shown to scale quite wel= l on ORNL's Summit. But not all users of libstdc++ have those sorts of problems either.=