From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id 3B9C63858C27; Tue, 8 Feb 2022 11:57:28 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 3B9C63858C27 From: "tschwinge at gcc dot gnu.org" To: gcc-bugs@gcc.gnu.org Subject: [Bug target/104345] [12 Regression] "nvptx: Transition nvptx backend to STORE_FLAG_VALUE = 1" patch made some code generation worse Date: Tue, 08 Feb 2022 11:57:27 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: target X-Bugzilla-Version: 12.0 X-Bugzilla-Keywords: missed-optimization X-Bugzilla-Severity: minor X-Bugzilla-Who: tschwinge at gcc dot gnu.org X-Bugzilla-Status: ASSIGNED X-Bugzilla-Resolution: X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: roger at nextmovesoftware dot com X-Bugzilla-Target-Milestone: 12.0 X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: gcc-bugs@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-bugs mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 08 Feb 2022 11:57:28 -0000 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D104345 --- Comment #7 from Thomas Schwinge --- All your three patches combined still doesn't help resolve the problem. And, what I realized: they don't even change the Nvidia/CUDA Driver reported "used [...] registers". Does that mean that the Driver "doesn't care" that we feed into it simple P= TX code, using less PTX registers -- it seems able to do these optimizations a= ll by itself? :-O (At least regarding number of registers used -- I didn't verify the SASS co= de generated.) (Emitting cleaner, "de-cluttered" code in GCC/nvptx is still very much valuable, of course: if only for our own visual consumption! ... at least = as long as it doesn't make 'nvptx.md' etc. much more complex...) For "good" vs. "bad"/"not-so-good" (before vs. after "nvptx: Transition nvp= tx backend to STORE_FLAG_VALUE =3D 1"), the only code generation difference is= in the '__muldc3' function ('nvptx-none/libgcc/_muldc3.o'), and that one is the only '.extern' dependency aside from the '__reduction_lock' global variable ('nvptx-none/libgcc/reduction.o'). (In the following, working around "'lto-wrap= per' invoking 'mkoffload's with duplicated command-line options".) This means, I can conveniently manually create a minimal nvptx 'libgcc.a': $ cp build-gcc-offload-nvptx-none/nvptx-none/libgcc/_muldc3.o ./ $ rm -f libgcc.a && ar q libgcc.a _muldc3.o build-gcc-offload-nvptx-none/nvptx-none/libgcc/reduction.o ..., and compile 'libgomp.oacc-c-c++-common/reduction-cplx-dbl.c' with '-foffload=3Dnvptx-none=3D-Wl,-L.'. (Via 'GOMP_DEBUG=3D1' verified that id= entical PTX code is loaded to GPU.) Then, hand-modify '_muldc3.o', re-create 'libgcc.a', re-compile, re-execute. Verified that before "nvptx: Transition nvptx backend to STORE_FLAG_VALUE = =3D 1" '__muldc3' (attached '_muldc3-good.o') works fine, and after "nvptx: Transi= tion nvptx backend to STORE_FLAG_VALUE =3D 1" '__muldc3' (attached '_muldc3-bad.= o') does show the problem originally reported here. I then gradually morphed the former into the latter (beginning with eliding simple changes like renumbered registers etc.), until only one last change = was necessary to turn "good" into "bad" (attached '_muldc3-WIP.o'); showing the "still-good" state: [...] @@ -1716,8 +1718,16 @@ cvt.rn.f64.s32 %r84,%r27; copysign.f64 %r61,%r61,%r84; .loc 2 1981 32 +// Current/after "nvptx: Transition nvptx backend to STORE_FLAG_VALUE = =3D 1": +/* selp.u32 %r86,1,0,%r138; mov.u32 %r85,%r86; +*/ +// Before "nvptx: Transition nvptx backend to STORE_FLAG_VALUE =3D 1": +set.u32.leu.f64 %r86,%r57,0d7fefffffffffffff; +neg.s32 %r87,%r86; +mov.u32 %r85,%r87; +// cvt.u16.u8 %r140,%r85; mov.u16 %r89,%r140; xor.b16 %r88,%r89,1; @@ -1770,8 +1780,16 @@ [...] That is, things go "bad" if we here use the '%r138' that was computed (and used) earlier in the code, and things are "good" if we re-compute locally.= =20 Same for '%r139' in the second code block. (Interestingly, it is tolerated if one of the long-lived registers are used, but things go "bad" only if *both* are used. But that's probably just a distraction? And again, I have not inspected the actual SASS code, but just looked at the JIT-time "used [...] registers".) Do we thus conclude that what happens is that the "nvptx: Transition nvptx backend to STORE_FLAG_VALUE =3D 1" changes here enable an "optimization" in= GCC such that values that have previously been compute may be re-used later in = the code, without re-computing them. But on the flip side, of course, this mea= ns that the values have to kept live in (SASS) registers. (That's just my the= ory; I haven't verified the actual SASS.) In other words: at least in this case here, it seems preferrable to re-comp= ute instead of keeping registers occupied. (But I'm of course not claiming tha= t to be a simple yes/no decision...) It seem we're now in territory of tuning CPU vs. GPU code generation? Certainly, GCC has not seen much care for the latter (GPU code generation). I mean: verify GCC pass pipeline generally, and parameterization of individ= ual passes for GPU code generation. Impossible to get "right", of course, but maybe some heuristics for CPU vs.= GPU may be discovered and implemented? I'm sure there must be some literature on that topic? All that complicated by the fact the with the (several different versions of the) Nvidia/CUDA Driver's PTX -> SASS translation/optimization we have anot= her moving part...=