From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugzilla@gcc.gnu.org>
Received: by sourceware.org (Postfix, from userid 48)
 id 3B9C63858C27; Tue,  8 Feb 2022 11:57:28 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 3B9C63858C27
From: "tschwinge at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug target/104345] [12 Regression] "nvptx: Transition nvptx backend
 to STORE_FLAG_VALUE = 1" patch made some code generation worse
Date: Tue, 08 Feb 2022 11:57:27 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: target
X-Bugzilla-Version: 12.0
X-Bugzilla-Keywords: missed-optimization
X-Bugzilla-Severity: minor
X-Bugzilla-Who: tschwinge at gcc dot gnu.org
X-Bugzilla-Status: ASSIGNED
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: roger at nextmovesoftware dot com
X-Bugzilla-Target-Milestone: 12.0
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: 
Message-ID: <bug-104345-4-D0WF6BdCU1@http.gcc.gnu.org/bugzilla/>
In-Reply-To: <bug-104345-4@http.gcc.gnu.org/bugzilla/>
References: <bug-104345-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
X-BeenThere: gcc-bugs@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-bugs mailing list <gcc-bugs.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-bugs>,
 <mailto:gcc-bugs-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-bugs/>
List-Post: <mailto:gcc-bugs@gcc.gnu.org>
List-Help: <mailto:gcc-bugs-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-bugs>,
 <mailto:gcc-bugs-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Tue, 08 Feb 2022 11:57:28 -0000

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D104345
--- Comment #7 from Thomas Schwinge <tschwinge at gcc dot gnu.org> ---
All your three patches combined still doesn't help resolve the problem.
And, what I realized: they don't even change the Nvidia/CUDA Driver reported
"used [...] registers".
Does that mean that the Driver "doesn't care" that we feed into it simple P=
TX
code, using less PTX registers -- it seems able to do these optimizations a=
ll
by itself?  :-O
(At least regarding number of registers used -- I didn't verify the SASS co=
de
generated.)
(Emitting cleaner, "de-cluttered" code in GCC/nvptx is still very much
valuable, of course: if only for our own visual consumption!  ... at least =
as
long as it doesn't make 'nvptx.md' etc. much more complex...)

For "good" vs. "bad"/"not-so-good" (before vs. after "nvptx: Transition nvp=
tx
backend to STORE_FLAG_VALUE =3D 1"), the only code generation difference is=
 in
the '__muldc3' function ('nvptx-none/libgcc/_muldc3.o'), and that one is the
only '.extern' dependency aside from the '__reduction_lock' global variable
('nvptx-none/libgcc/reduction.o').
(In the following, working around <https://gcc.gnu.org/PR104416> "'lto-wrap=
per'
invoking 'mkoffload's with duplicated command-line options".)

This means, I can conveniently manually create a minimal nvptx 'libgcc.a':

    $ cp build-gcc-offload-nvptx-none/nvptx-none/libgcc/_muldc3.o ./
    $ rm -f libgcc.a && ar q libgcc.a _muldc3.o
build-gcc-offload-nvptx-none/nvptx-none/libgcc/reduction.o

..., and compile 'libgomp.oacc-c-c++-common/reduction-cplx-dbl.c' with
'-foffload=3Dnvptx-none=3D-Wl,-L.'.  (Via 'GOMP_DEBUG=3D1' verified that id=
entical
PTX code is loaded to GPU.)

Then, hand-modify '_muldc3.o', re-create 'libgcc.a', re-compile, re-execute.

Verified that before "nvptx: Transition nvptx backend to STORE_FLAG_VALUE =
=3D 1"
'__muldc3' (attached '_muldc3-good.o') works fine, and after "nvptx: Transi=
tion
nvptx backend to STORE_FLAG_VALUE =3D 1" '__muldc3' (attached '_muldc3-bad.=
o')
does show the problem originally reported here.

I then gradually morphed the former into the latter (beginning with eliding
simple changes like renumbered registers etc.), until only one last change =
was
necessary to turn "good" into "bad" (attached '_muldc3-WIP.o'); showing the
"still-good" state:

    [...]
    @@ -1716,8 +1718,16 @@
     cvt.rn.f64.s32 %r84,%r27;
     copysign.f64 %r61,%r61,%r84;
     .loc 2 1981 32
    +// Current/after "nvptx: Transition nvptx backend to STORE_FLAG_VALUE =
=3D
1":
    +/*
     selp.u32 %r86,1,0,%r138;
     mov.u32 %r85,%r86;
    +*/
    +// Before "nvptx: Transition nvptx backend to STORE_FLAG_VALUE =3D 1":
    +set.u32.leu.f64 %r86,%r57,0d7fefffffffffffff;
    +neg.s32 %r87,%r86;
    +mov.u32 %r85,%r87;
    +//
     cvt.u16.u8 %r140,%r85;
     mov.u16 %r89,%r140;
     xor.b16 %r88,%r89,1;
    @@ -1770,8 +1780,16 @@
    [...]

That is, things go "bad" if we here use the '%r138' that was computed (and
used) earlier in the code, and things are "good" if we re-compute locally.=
=20
Same for '%r139' in the second code block.

(Interestingly, it is tolerated if one of the long-lived registers are used,
but things go "bad" only if *both* are used.  But that's probably just a
distraction?  And again, I have not inspected the actual SASS code, but just
looked at the JIT-time "used [...] registers".)

Do we thus conclude that what happens is that the "nvptx: Transition nvptx
backend to STORE_FLAG_VALUE =3D 1" changes here enable an "optimization" in=
 GCC
such that values that have previously been compute may be re-used later in =
the
code, without re-computing them.  But on the flip side, of course, this mea=
ns
that the values have to kept live in (SASS) registers.  (That's just my the=
ory;
I haven't verified the actual SASS.)

In other words: at least in this case here, it seems preferrable to re-comp=
ute
instead of keeping registers occupied.  (But I'm of course not claiming tha=
t to
be a simple yes/no decision...)

It seem we're now in territory of tuning CPU vs. GPU code generation?

Certainly, GCC has not seen much care for the latter (GPU code generation).
I mean: verify GCC pass pipeline generally, and parameterization of individ=
ual
passes for GPU code generation.
Impossible to get "right", of course, but maybe some heuristics for CPU vs.=
 GPU
may be discovered and implemented?
I'm sure there must be some literature on that topic?

All that complicated by the fact the with the (several different versions of
the) Nvidia/CUDA Driver's PTX -> SASS translation/optimization we have anot=
her
moving part...=