[Bug target/104345] [12 Regression] "nvptx: Transition nvptx backend to STORE_FLAG_VALUE = 1" patch made some code generation worse

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

From: "tschwinge at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug target/104345] [12 Regression] "nvptx: Transition nvptx backend to STORE_FLAG_VALUE = 1" patch made some code generation worse
Date: Tue, 08 Feb 2022 11:57:27 +0000	[thread overview]
Message-ID: <bug-104345-4-D0WF6BdCU1@http.gcc.gnu.org/bugzilla/> (raw)
In-Reply-To: <bug-104345-4@http.gcc.gnu.org/bugzilla/>

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104345

--- Comment #7 from Thomas Schwinge <tschwinge at gcc dot gnu.org> ---
All your three patches combined still doesn't help resolve the problem.
And, what I realized: they don't even change the Nvidia/CUDA Driver reported
"used [...] registers".
Does that mean that the Driver "doesn't care" that we feed into it simple PTX
code, using less PTX registers -- it seems able to do these optimizations all
by itself?  :-O
(At least regarding number of registers used -- I didn't verify the SASS code
generated.)
(Emitting cleaner, "de-cluttered" code in GCC/nvptx is still very much
valuable, of course: if only for our own visual consumption!  ... at least as
long as it doesn't make 'nvptx.md' etc. much more complex...)

For "good" vs. "bad"/"not-so-good" (before vs. after "nvptx: Transition nvptx
backend to STORE_FLAG_VALUE = 1"), the only code generation difference is in
the '__muldc3' function ('nvptx-none/libgcc/_muldc3.o'), and that one is the
only '.extern' dependency aside from the '__reduction_lock' global variable
('nvptx-none/libgcc/reduction.o').
(In the following, working around <https://gcc.gnu.org/PR104416> "'lto-wrapper'
invoking 'mkoffload's with duplicated command-line options".)

This means, I can conveniently manually create a minimal nvptx 'libgcc.a':

    $ cp build-gcc-offload-nvptx-none/nvptx-none/libgcc/_muldc3.o ./
    $ rm -f libgcc.a && ar q libgcc.a _muldc3.o
build-gcc-offload-nvptx-none/nvptx-none/libgcc/reduction.o

..., and compile 'libgomp.oacc-c-c++-common/reduction-cplx-dbl.c' with
'-foffload=nvptx-none=-Wl,-L.'.  (Via 'GOMP_DEBUG=1' verified that identical
PTX code is loaded to GPU.)

Then, hand-modify '_muldc3.o', re-create 'libgcc.a', re-compile, re-execute.

Verified that before "nvptx: Transition nvptx backend to STORE_FLAG_VALUE = 1"
'__muldc3' (attached '_muldc3-good.o') works fine, and after "nvptx: Transition
nvptx backend to STORE_FLAG_VALUE = 1" '__muldc3' (attached '_muldc3-bad.o')
does show the problem originally reported here.

I then gradually morphed the former into the latter (beginning with eliding
simple changes like renumbered registers etc.), until only one last change was
necessary to turn "good" into "bad" (attached '_muldc3-WIP.o'); showing the
"still-good" state:

    [...]
    @@ -1716,8 +1718,16 @@
     cvt.rn.f64.s32 %r84,%r27;
     copysign.f64 %r61,%r61,%r84;
     .loc 2 1981 32
    +// Current/after "nvptx: Transition nvptx backend to STORE_FLAG_VALUE =
1":
    +/*
     selp.u32 %r86,1,0,%r138;
     mov.u32 %r85,%r86;
    +*/
    +// Before "nvptx: Transition nvptx backend to STORE_FLAG_VALUE = 1":
    +set.u32.leu.f64 %r86,%r57,0d7fefffffffffffff;
    +neg.s32 %r87,%r86;
    +mov.u32 %r85,%r87;
    +//
     cvt.u16.u8 %r140,%r85;
     mov.u16 %r89,%r140;
     xor.b16 %r88,%r89,1;
    @@ -1770,8 +1780,16 @@
    [...]

That is, things go "bad" if we here use the '%r138' that was computed (and
used) earlier in the code, and things are "good" if we re-compute locally. 
Same for '%r139' in the second code block.

(Interestingly, it is tolerated if one of the long-lived registers are used,
but things go "bad" only if *both* are used.  But that's probably just a
distraction?  And again, I have not inspected the actual SASS code, but just
looked at the JIT-time "used [...] registers".)

Do we thus conclude that what happens is that the "nvptx: Transition nvptx
backend to STORE_FLAG_VALUE = 1" changes here enable an "optimization" in GCC
such that values that have previously been compute may be re-used later in the
code, without re-computing them.  But on the flip side, of course, this means
that the values have to kept live in (SASS) registers.  (That's just my theory;
I haven't verified the actual SASS.)

In other words: at least in this case here, it seems preferrable to re-compute
instead of keeping registers occupied.  (But I'm of course not claiming that to
be a simple yes/no decision...)

It seem we're now in territory of tuning CPU vs. GPU code generation?

Certainly, GCC has not seen much care for the latter (GPU code generation).
I mean: verify GCC pass pipeline generally, and parameterization of individual
passes for GPU code generation.
Impossible to get "right", of course, but maybe some heuristics for CPU vs. GPU
may be discovered and implemented?
I'm sure there must be some literature on that topic?

All that complicated by the fact the with the (several different versions of
the) Nvidia/CUDA Driver's PTX -> SASS translation/optimization we have another
moving part...

next prev parent reply	other threads:[~2022-02-08 11:57 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-02-02 14:10 [Bug target/104345] New: nvptx: "regression" after "nvptx: Transition nvptx backend to STORE_FLAG_VALUE = 1" tschwinge at gcc dot gnu.org
2022-02-02 14:26 ` [Bug target/104345] " roger at nextmovesoftware dot com
2022-02-02 22:37 ` [Bug target/104345] [12 Regression] "nvptx: Transition nvptx backend to STORE_FLAG_VALUE = 1" patch made some code generation worse pinskia at gcc dot gnu.org
2022-02-03 14:17 ` vries at gcc dot gnu.org
2022-02-03 21:09 ` roger at nextmovesoftware dot com
2022-02-08 11:53 ` tschwinge at gcc dot gnu.org
2022-02-08 11:54 ` tschwinge at gcc dot gnu.org
2022-02-08 11:55 ` tschwinge at gcc dot gnu.org
2022-02-08 11:57 ` tschwinge at gcc dot gnu.org [this message]
2022-02-08 12:33 ` tschwinge at gcc dot gnu.org
2022-02-08 15:36 ` tschwinge at gcc dot gnu.org
2022-02-10  8:02 ` cvs-commit at gcc dot gnu.org
2022-02-10  8:02 ` cvs-commit at gcc dot gnu.org
2022-03-02 12:10 ` roger at nextmovesoftware dot com

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=bug-104345-4-D0WF6BdCU1@http.gcc.gnu.org/bugzilla/ \
    --to=gcc-bugzilla@gcc.gnu.org \
    --cc=gcc-bugs@gcc.gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).