public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed
* [Bug target/104345] New: nvptx: "regression" after "nvptx: Transition nvptx backend to STORE_FLAG_VALUE = 1"
@ 2022-02-02 14:10 tschwinge at gcc dot gnu.org
  2022-02-02 14:26 ` [Bug target/104345] " roger at nextmovesoftware dot com
                   ` (12 more replies)
  0 siblings, 13 replies; 14+ messages in thread
From: tschwinge at gcc dot gnu.org @ 2022-02-02 14:10 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104345

            Bug ID: 104345
           Summary: nvptx: "regression" after "nvptx: Transition nvptx
                    backend to STORE_FLAG_VALUE = 1"
           Product: gcc
           Version: 12.0
            Status: UNCONFIRMED
          Keywords: openacc
          Severity: minor
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: tschwinge at gcc dot gnu.org
                CC: sayle at gcc dot gnu.org, vries at gcc dot gnu.org
  Target Milestone: ---
            Target: nvptx

First, I do acknowledge that commit beed3f8f60492289ca6211d86c54a2254a642035
"nvptx: Transition nvptx backend to STORE_FLAG_VALUE = 1" generally does
improve nvptx code generation -- thanks!

I've however run into one case where it causes a regression:

    PASS: libgomp.oacc-c/../libgomp.oacc-c-c++-common/reduction-cplx-dbl.c
-DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 -foffload=nvptx-none  -O0  (test
for excess errors)
    [-PASS:-]{+FAIL:+}
libgomp.oacc-c/../libgomp.oacc-c-c++-common/reduction-cplx-dbl.c
-DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 -foffload=nvptx-none  -O0 
execution test
    PASS: libgomp.oacc-c/../libgomp.oacc-c-c++-common/reduction-cplx-dbl.c
-DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 -foffload=nvptx-none  -O2  (test
for excess errors)
    [-PASS:-]{+FAIL:+}
libgomp.oacc-c/../libgomp.oacc-c-c++-common/reduction-cplx-dbl.c
-DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 -foffload=nvptx-none  -O2 
execution test

    libgomp: The Nvidia accelerator has insufficient resources to launch
'worker$_omp_fn$0' with num_workers = 32 and vector_length = 32; recompile the
program with 'num_workers = x and vector_length = y' on that offloaded region
or '-fopenacc-dim=:x:y' where x * y <= 896.

Same for C++.

That's with a Nvidia Tesla K20c, Driver Version: 346.46 -- so, rather old.

By the way: the subsequent commit 659f8161f61d3f75c3a47cf646147e8f7b4dcb34
"nvptx: Add support for PTX's cnot instruction" is not helpful or even relevant
here; there are no 'cnot's appearing in the PTX code loaded to the GPU (per
'GOMP_DEBUG=1' execution).

Per 'diff' of  'GOMP_DEBUG=1' execution we indeed see *more* registers used
after "nvptx: Transition nvptx backend to STORE_FLAG_VALUE = 1" than before. 
For '-O0':

    [...]
     Link log info    : 4 bytes gmem
     info    : Function properties for 'gang$_omp_fn$0':
    -info    : used 51 registers, 112 stack, 0 bytes smem, 328 bytes cmem[0],
16 bytes cmem[2], 0 bytes lmem
    +info    : used 68 registers, 112 stack, 0 bytes smem, 328 bytes cmem[0], 8
bytes cmem[2], 0 bytes lmem
     info    : Function properties for 'worker$_omp_fn$0':
    -info    : used 51 registers, 112 stack, 136 bytes smem, 328 bytes cmem[0],
16 bytes cmem[2], 0 bytes lmem
    +info    : used 68 registers, 112 stack, 136 bytes smem, 328 bytes cmem[0],
8 bytes cmem[2], 0 bytes lmem
     info    : Function properties for 'vector$_omp_fn$0':
    -info    : used 51 registers, 112 stack, 0 bytes smem, 328 bytes cmem[0],
16 bytes cmem[2], 0 bytes lmem
    +info    : used 68 registers, 112 stack, 0 bytes smem, 328 bytes cmem[0], 8
bytes cmem[2], 0 bytes lmem
       GOMP_OFFLOAD_openacc_exec: prepare mappings
       nvptx_exec: kernel vector$_omp_fn$0: launch gangs=1, workers=1,
vectors=32
       nvptx_exec: kernel vector$_omp_fn$0: finished
    -GOACC_parallel_keyed: mapnum=3, hostaddrs=0x7ffc760394a0, size=0x60bb30,
kinds=0x60bb48
    +GOACC_parallel_keyed: mapnum=3, hostaddrs=0x7fff99653530, size=0x60bad0,
kinds=0x60bae8
       GOMP_OFFLOAD_openacc_exec: prepare mappings
    +
    +libgomp: The Nvidia accelerator has insufficient resources to launch
'worker$_omp_fn$0' with num_workers = 32 and vector_length = 32; recompile the
program with 'num_workers = x and vector_length = y' on that offloaded region
or '-fopenacc-dim=:x:y' where x * y <= 896.
    -  nvptx_exec: kernel worker$_omp_fn$0: launch gangs=1, workers=32,
vectors=32
    -  nvptx_exec: kernel worker$_omp_fn$0: finished
    -GOACC_parallel_keyed: mapnum=3, hostaddrs=0x7ffc760394a0, size=0x60bb50,
kinds=0x60bb68
    -  GOMP_OFFLOAD_openacc_exec: prepare mappings
    -  nvptx_exec: kernel gang$_omp_fn$0: launch gangs=32, workers=1,
vectors=32
    -  nvptx_exec: kernel gang$_omp_fn$0: finished

Similar for '-O2', just with less stack usage.

Cross-checking with a more recent Driver Version: 450.119.03, I'm only seeing
slightly increased register usage; 52 registers after "nvptx: Transition nvptx
backend to STORE_FLAG_VALUE = 1" compared to 51 registers before:

    [...]
    Link log info    : 4 bytes gmem
    info    : Function properties for 'vector$_omp_fn$0':
    info    : used [-51-]{+52+} registers, 112 stack, 0 bytes smem, 328 bytes
cmem[0], 16 bytes cmem[2], 0 bytes lmem
    info    : Function properties for 'worker$_omp_fn$0':
    info    : used [-51-]{+52+} registers, 112 stack, 136 bytes smem, 328 bytes
cmem[0], 16 bytes cmem[2], 0 bytes lmem
    info    : Function properties for 'gang$_omp_fn$0':
    info    : used [-51-]{+52+} registers, 112 stack, 0 bytes smem, 328 bytes
cmem[0], 16 bytes cmem[2], 0 bytes lmem
    [...]

This suggests that compared to before, after "nvptx: Transition nvptx backend
to STORE_FLAG_VALUE = 1" GCC is generating certain PTX code sequences that the
Driver/JIT fails to understand/optimize?  While not ideal, the code still
executes fine (with newish Driver/JIT), and I'm thus OK if we classify that as
not worth looking into -- but I at least wanted to report my findings: maybe
there's a way to tune the GCC/nvptx code generation to the PTX -> SASS
compiler's liking?

Possibly (but that's just guessing!), the reason might be around the following
PTX code change:

    [...]
    -setp.leu.f64 %r82,%r25,0d7fefffffffffffff;
    -@ ! %r82 bra $L3;
    +@ %r78 bra $L20;
    +setp.leu.f64 %r138,%r57,0d7fefffffffffffff;
    +bra $L3;
    +$L20:
     .loc 2 1976 21
    -setp.leu.f64 %r83,%r57,0d7fefffffffffffff;
    -@ %r83 bra $L19;
    +setp.leu.f64 %r138,%r57,0d7fefffffffffffff;
    +@ %r138 bra $L19;
     $L3:
    [...]

>From a quick look, I read this to mean that the originally ("before")
unconditional 'setp.leu.f64 %r82,%r25,0d7fefffffffffffff;' is now ("after")
done conditionally.

Maybe related, maybe not: when curiously 'diff'ing the before vs. after
nvptx-none target libraries, I noticed amongst all the "noise" (improved code
generation):

'nvptx-none/libatomic/gcas.o':

    [...]
     atom.cas.b32 %r137,[%r34],%r136,%r139;
     setp.eq.u32 %r140,%r137,%r136;
     selp.u32 %r138,1,0,%r140;
    -setp.ne.u32 %r141,%r138,0;
    -@ %r141 bra $L21;
    +@ %r140 bra $L18;
     st.u32 [%r201],%r137;
    -bra $L19;
    +$L18:
    +setp.eq.u32 %r142,%r138,0;
    +@ %r142 bra $L19;
     $L21:
    [...]

... which again looks like a pattern where an originally ("before")
unconditional 'setp.ne.u32 %r141,%r138,0;' is now ("after") done conditionally.

Similar in other files -- but I certainly didn't look in detail, and I'm
certainly not claiming this to be/cause any actual problem.


And, I've spotted a few cases where we're generating "maybe worse" code:

'nvptx-none/libgomp/openacc.o' (complete 'diff'):

    @@ -25,6 +25,7 @@
     .reg .u64 %r28;
     .reg .u32 %r29;
     .reg .u32 %r30;
    +.reg .pred %r31;
     mov.u64 %r27,%ar0;
     st.u64 [%frame+16],%r27;
     ld.u64 %r28,[%frame+16];
    @@ -38,8 +39,8 @@
     ld.param.u32 %r30,[%value_in];
     }
     mov.u32 %r23,%r30;
    -set.u32.ne.u32 %r24,%r23,0;
    -neg.s32 %r24,%r24;
    +setp.ne.u32 %r31,%r23,0;
    +selp.u32 %r24,1,0,%r31;
     st.u32 [%frame],%r24;
     ld.u32 %r25,[%frame];
     mov.u32 %r26,%r25;

'nvptx-none/newlib/libc/reent/lib_a-renamer.o' (complete 'diff'):

    @@ -28,6 +28,8 @@
     .reg .u32 %r32;
     .reg .pred %r33;
     .reg .u32 %r36;
    +.reg .u32 %r38;
    +.reg .pred %r39;
     mov.u64 %r26,%ar0;
     mov.u64 %r27,%ar1;
     mov.u64 %r28,%ar2;
    @@ -58,7 +60,9 @@
     ld.param.u32 %r36,[%value_in];
     }
     .loc 2 57 6
    -set.u32.eq.u32 %r25,%r36,-1;
    +setp.eq.u32 %r39,%r36,-1;
    +selp.u32 %r38,1,0,%r39;
    +neg.s32 %r25,%r38;
     $L1:
     .loc 2 64 1
     mov.u32 %value,%r25;

'nvptx-none/newlib/libc/stdio/lib_a-remove.o' (complete 'diff'):

    @@ -24,6 +24,8 @@
     .reg .u64 %r26;
     .reg .u64 %r27;
     .reg .u32 %r30;
    +.reg .u32 %r33;
    +.reg .pred %r34;
     mov.u64 %r26,%ar0;
     mov.u64 %r27,%ar1;
     .loc 2 65 7
    @@ -37,7 +39,9 @@
     ld.param.u32 %r30,[%value_in];
     }
     .loc 2 65 6
    -set.u32.eq.u32 %value,%r30,-1;
    +setp.eq.u32 %r34,%r30,-1;
    +selp.u32 %r33,1,0,%r34;
    +neg.s32 %value,%r33;
     .loc 2 69 1
     st.param.u32 [%value_out],%value;
     ret;
    @@ -51,6 +55,8 @@
     .reg .u64 %r27;
     .reg .u64 %r30;
     .reg .u32 %r31;
    +.reg .u32 %r34;
    +.reg .pred %r35;
     mov.u64 %r27,%ar0;
     .loc 2 65 7
     ld.global.u64 %r30,[_impure_ptr];
    @@ -64,7 +70,9 @@
     ld.param.u32 %r31,[%value_in];
     }
     .loc 2 65 6
    -set.u32.eq.u32 %value,%r31,-1;
    +setp.eq.u32 %r35,%r31,-1;
    +selp.u32 %r34,1,0,%r35;
    +neg.s32 %value,%r34;
     .loc 2 77 1
     st.param.u32 [%value_out],%value;
     ret;

'nvptx-none/newlib/libm/common/lib_a-s_rint.o' (complete 'diff'):

    @@ -80,6 +80,7 @@
     .reg .u32 %r119;
     .reg .pred %r120;
     .reg .u32 %r122;
    +.reg .pred %r123;
     .reg .u32 %r124;
     .reg .u32 %r125;
     .reg .u32 %r126;
    @@ -197,7 +198,8 @@
     setp.eq.u32 %r120,%r41,0;
     @ %r120 bra $L5;
     .loc 2 114 9
    -set.u32.eq.u32 %r122,%r58,19;
    +setp.eq.u32 %r123,%r58,19;
    +selp.u32 %r122,1,0,%r123;
     shl.b32 %r41,%r122,31;
     .loc 2 115 17
     not.b32 %r124,%r64;

I haven't looked if that's "actually worse" in SASS, or just "maybe worse" in
the intermediate PTX representation.  (... and is most certainly not related to
the regression mentioned before.)


It seems unlikely, but I'll report in case anything here changes due to Tom's
several GCC/nvptx back end commits yesterday.


So, please close this PR as "won't fix" unless you see something here that
you'd like to look into.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Bug target/104345] nvptx: "regression" after "nvptx: Transition nvptx backend to STORE_FLAG_VALUE = 1"
  2022-02-02 14:10 [Bug target/104345] New: nvptx: "regression" after "nvptx: Transition nvptx backend to STORE_FLAG_VALUE = 1" tschwinge at gcc dot gnu.org
@ 2022-02-02 14:26 ` roger at nextmovesoftware dot com
  2022-02-02 22:37 ` [Bug target/104345] [12 Regression] "nvptx: Transition nvptx backend to STORE_FLAG_VALUE = 1" patch made some code generation worse pinskia at gcc dot gnu.org
                   ` (11 subsequent siblings)
  12 siblings, 0 replies; 14+ messages in thread
From: roger at nextmovesoftware dot com @ 2022-02-02 14:26 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104345

Roger Sayle <roger at nextmovesoftware dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |roger at nextmovesoftware dot com
             Status|UNCONFIRMED                 |NEW
   Last reconfirmed|                            |2022-02-02
     Ever confirmed|0                           |1

--- Comment #1 from Roger Sayle <roger at nextmovesoftware dot com> ---
Hi Torsten,
Thanks for the bug report.  The STORE_FLAG_VALUE=1 patch was one of a series to
dramatically improve the quality of nvptx code.  Alas not all of them have yet
been reviewed/approved, and it's likely these later improvements address the
quality regression you're seeing.

The other patches in the "nvptx Boolean" series are:
patchq3: nvptx: Expand QI mode operations using SI mode instructions.
https://gcc.gnu.org/pipermail/gcc-patches/2022-January/587999.html

patchq4: nvptx: Fix and use BI mode logic instructions (e.g. and.pred).
https://gcc.gnu.org/pipermail/gcc-patches/2022-January/588555.html

[and purely for reference, my other outstanding nvptx patches are]
patchn: nvptx: Improved support for HFMode including neghf2 and abshf2.
https://gcc.gnu.org/pipermail/gcc-patches/2022-January/587949.html

patchw: nvptx: Add support for 64-bit mul.hi (and other) instructions.
https://gcc.gnu.org/pipermail/gcc-patches/2022-January/588453.html

And one other related patch is that there's also a middle-end SUBREG
patch intended to improve code generation on nvptx is also pending at:

patchs: Simplify paradoxical subreg extensions of TRUNCATE
https://gcc.gnu.org/pipermail/gcc-patches/2021-September/578848.html



My guess is that patchq3+patchq4 above should (hopefully) resolve this
particular regression.  If you could give them a spin on your system
to see if they reduce register pressure sufficiently for this case,
that would be greatly appreciated.  As you can read in the above postings,
the total number of instructions/registers (after all of these changes)
should be dramatically reduced.

I'll see what I can do from my end.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Bug target/104345] [12 Regression] "nvptx: Transition nvptx backend to STORE_FLAG_VALUE = 1" patch made some code generation worse
  2022-02-02 14:10 [Bug target/104345] New: nvptx: "regression" after "nvptx: Transition nvptx backend to STORE_FLAG_VALUE = 1" tschwinge at gcc dot gnu.org
  2022-02-02 14:26 ` [Bug target/104345] " roger at nextmovesoftware dot com
@ 2022-02-02 22:37 ` pinskia at gcc dot gnu.org
  2022-02-03 14:17 ` vries at gcc dot gnu.org
                   ` (10 subsequent siblings)
  12 siblings, 0 replies; 14+ messages in thread
From: pinskia at gcc dot gnu.org @ 2022-02-02 22:37 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104345

Andrew Pinski <pinskia at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|---                         |12.0

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Bug target/104345] [12 Regression] "nvptx: Transition nvptx backend to STORE_FLAG_VALUE = 1" patch made some code generation worse
  2022-02-02 14:10 [Bug target/104345] New: nvptx: "regression" after "nvptx: Transition nvptx backend to STORE_FLAG_VALUE = 1" tschwinge at gcc dot gnu.org
  2022-02-02 14:26 ` [Bug target/104345] " roger at nextmovesoftware dot com
  2022-02-02 22:37 ` [Bug target/104345] [12 Regression] "nvptx: Transition nvptx backend to STORE_FLAG_VALUE = 1" patch made some code generation worse pinskia at gcc dot gnu.org
@ 2022-02-03 14:17 ` vries at gcc dot gnu.org
  2022-02-03 21:09 ` roger at nextmovesoftware dot com
                   ` (9 subsequent siblings)
  12 siblings, 0 replies; 14+ messages in thread
From: vries at gcc dot gnu.org @ 2022-02-03 14:17 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104345

--- Comment #2 from Tom de Vries <vries at gcc dot gnu.org> ---
(In reply to Roger Sayle from comment #1)
> The other patches in the "nvptx Boolean" series are:
> patchq3: nvptx: Expand QI mode operations using SI mode instructions.
> https://gcc.gnu.org/pipermail/gcc-patches/2022-January/587999.html
> 
> patchq4: nvptx: Fix and use BI mode logic instructions (e.g. and.pred).
> https://gcc.gnu.org/pipermail/gcc-patches/2022-January/588555.html
> 
> [and purely for reference, my other outstanding nvptx patches are]
> patchn: nvptx: Improved support for HFMode including neghf2 and abshf2.
> https://gcc.gnu.org/pipermail/gcc-patches/2022-January/587949.html
> 
> patchw: nvptx: Add support for 64-bit mul.hi (and other) instructions.
> https://gcc.gnu.org/pipermail/gcc-patches/2022-January/588453.html

FYI, I'm currently testing these.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Bug target/104345] [12 Regression] "nvptx: Transition nvptx backend to STORE_FLAG_VALUE = 1" patch made some code generation worse
  2022-02-02 14:10 [Bug target/104345] New: nvptx: "regression" after "nvptx: Transition nvptx backend to STORE_FLAG_VALUE = 1" tschwinge at gcc dot gnu.org
                   ` (2 preceding siblings ...)
  2022-02-03 14:17 ` vries at gcc dot gnu.org
@ 2022-02-03 21:09 ` roger at nextmovesoftware dot com
  2022-02-08 11:53 ` tschwinge at gcc dot gnu.org
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 14+ messages in thread
From: roger at nextmovesoftware dot com @ 2022-02-03 21:09 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104345

Roger Sayle <roger at nextmovesoftware dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |ASSIGNED
           Assignee|unassigned at gcc dot gnu.org      |roger at nextmovesoftware dot com

--- Comment #3 from Roger Sayle <roger at nextmovesoftware dot com> ---
Additional patch proposed:
https://gcc.gnu.org/pipermail/gcc-patches/2022-February/589802.html
I need to figure out how to (re)produce Thomas' "used N registers" reports.
If someone could summarize the effect of this patch (and previous patches) on
register usage, that would be much appreciated (and help reviewers).

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Bug target/104345] [12 Regression] "nvptx: Transition nvptx backend to STORE_FLAG_VALUE = 1" patch made some code generation worse
  2022-02-02 14:10 [Bug target/104345] New: nvptx: "regression" after "nvptx: Transition nvptx backend to STORE_FLAG_VALUE = 1" tschwinge at gcc dot gnu.org
                   ` (3 preceding siblings ...)
  2022-02-03 21:09 ` roger at nextmovesoftware dot com
@ 2022-02-08 11:53 ` tschwinge at gcc dot gnu.org
  2022-02-08 11:54 ` tschwinge at gcc dot gnu.org
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 14+ messages in thread
From: tschwinge at gcc dot gnu.org @ 2022-02-08 11:53 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104345

--- Comment #4 from Thomas Schwinge <tschwinge at gcc dot gnu.org> ---
Created attachment 52370
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52370&action=edit
'_muldc3-good.o'

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Bug target/104345] [12 Regression] "nvptx: Transition nvptx backend to STORE_FLAG_VALUE = 1" patch made some code generation worse
  2022-02-02 14:10 [Bug target/104345] New: nvptx: "regression" after "nvptx: Transition nvptx backend to STORE_FLAG_VALUE = 1" tschwinge at gcc dot gnu.org
                   ` (4 preceding siblings ...)
  2022-02-08 11:53 ` tschwinge at gcc dot gnu.org
@ 2022-02-08 11:54 ` tschwinge at gcc dot gnu.org
  2022-02-08 11:55 ` tschwinge at gcc dot gnu.org
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 14+ messages in thread
From: tschwinge at gcc dot gnu.org @ 2022-02-08 11:54 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104345

--- Comment #5 from Thomas Schwinge <tschwinge at gcc dot gnu.org> ---
Created attachment 52371
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52371&action=edit
'_muldc3-bad.o'

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Bug target/104345] [12 Regression] "nvptx: Transition nvptx backend to STORE_FLAG_VALUE = 1" patch made some code generation worse
  2022-02-02 14:10 [Bug target/104345] New: nvptx: "regression" after "nvptx: Transition nvptx backend to STORE_FLAG_VALUE = 1" tschwinge at gcc dot gnu.org
                   ` (5 preceding siblings ...)
  2022-02-08 11:54 ` tschwinge at gcc dot gnu.org
@ 2022-02-08 11:55 ` tschwinge at gcc dot gnu.org
  2022-02-08 11:57 ` tschwinge at gcc dot gnu.org
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 14+ messages in thread
From: tschwinge at gcc dot gnu.org @ 2022-02-08 11:55 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104345

--- Comment #6 from Thomas Schwinge <tschwinge at gcc dot gnu.org> ---
Created attachment 52372
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52372&action=edit
'_muldc3-WIP.o'

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Bug target/104345] [12 Regression] "nvptx: Transition nvptx backend to STORE_FLAG_VALUE = 1" patch made some code generation worse
  2022-02-02 14:10 [Bug target/104345] New: nvptx: "regression" after "nvptx: Transition nvptx backend to STORE_FLAG_VALUE = 1" tschwinge at gcc dot gnu.org
                   ` (6 preceding siblings ...)
  2022-02-08 11:55 ` tschwinge at gcc dot gnu.org
@ 2022-02-08 11:57 ` tschwinge at gcc dot gnu.org
  2022-02-08 12:33 ` tschwinge at gcc dot gnu.org
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 14+ messages in thread
From: tschwinge at gcc dot gnu.org @ 2022-02-08 11:57 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104345

--- Comment #7 from Thomas Schwinge <tschwinge at gcc dot gnu.org> ---
All your three patches combined still doesn't help resolve the problem.
And, what I realized: they don't even change the Nvidia/CUDA Driver reported
"used [...] registers".
Does that mean that the Driver "doesn't care" that we feed into it simple PTX
code, using less PTX registers -- it seems able to do these optimizations all
by itself?  :-O
(At least regarding number of registers used -- I didn't verify the SASS code
generated.)
(Emitting cleaner, "de-cluttered" code in GCC/nvptx is still very much
valuable, of course: if only for our own visual consumption!  ... at least as
long as it doesn't make 'nvptx.md' etc. much more complex...)

For "good" vs. "bad"/"not-so-good" (before vs. after "nvptx: Transition nvptx
backend to STORE_FLAG_VALUE = 1"), the only code generation difference is in
the '__muldc3' function ('nvptx-none/libgcc/_muldc3.o'), and that one is the
only '.extern' dependency aside from the '__reduction_lock' global variable
('nvptx-none/libgcc/reduction.o').
(In the following, working around <https://gcc.gnu.org/PR104416> "'lto-wrapper'
invoking 'mkoffload's with duplicated command-line options".)

This means, I can conveniently manually create a minimal nvptx 'libgcc.a':

    $ cp build-gcc-offload-nvptx-none/nvptx-none/libgcc/_muldc3.o ./
    $ rm -f libgcc.a && ar q libgcc.a _muldc3.o
build-gcc-offload-nvptx-none/nvptx-none/libgcc/reduction.o

..., and compile 'libgomp.oacc-c-c++-common/reduction-cplx-dbl.c' with
'-foffload=nvptx-none=-Wl,-L.'.  (Via 'GOMP_DEBUG=1' verified that identical
PTX code is loaded to GPU.)

Then, hand-modify '_muldc3.o', re-create 'libgcc.a', re-compile, re-execute.

Verified that before "nvptx: Transition nvptx backend to STORE_FLAG_VALUE = 1"
'__muldc3' (attached '_muldc3-good.o') works fine, and after "nvptx: Transition
nvptx backend to STORE_FLAG_VALUE = 1" '__muldc3' (attached '_muldc3-bad.o')
does show the problem originally reported here.

I then gradually morphed the former into the latter (beginning with eliding
simple changes like renumbered registers etc.), until only one last change was
necessary to turn "good" into "bad" (attached '_muldc3-WIP.o'); showing the
"still-good" state:

    [...]
    @@ -1716,8 +1718,16 @@
     cvt.rn.f64.s32 %r84,%r27;
     copysign.f64 %r61,%r61,%r84;
     .loc 2 1981 32
    +// Current/after "nvptx: Transition nvptx backend to STORE_FLAG_VALUE =
1":
    +/*
     selp.u32 %r86,1,0,%r138;
     mov.u32 %r85,%r86;
    +*/
    +// Before "nvptx: Transition nvptx backend to STORE_FLAG_VALUE = 1":
    +set.u32.leu.f64 %r86,%r57,0d7fefffffffffffff;
    +neg.s32 %r87,%r86;
    +mov.u32 %r85,%r87;
    +//
     cvt.u16.u8 %r140,%r85;
     mov.u16 %r89,%r140;
     xor.b16 %r88,%r89,1;
    @@ -1770,8 +1780,16 @@
    [...]

That is, things go "bad" if we here use the '%r138' that was computed (and
used) earlier in the code, and things are "good" if we re-compute locally. 
Same for '%r139' in the second code block.

(Interestingly, it is tolerated if one of the long-lived registers are used,
but things go "bad" only if *both* are used.  But that's probably just a
distraction?  And again, I have not inspected the actual SASS code, but just
looked at the JIT-time "used [...] registers".)

Do we thus conclude that what happens is that the "nvptx: Transition nvptx
backend to STORE_FLAG_VALUE = 1" changes here enable an "optimization" in GCC
such that values that have previously been compute may be re-used later in the
code, without re-computing them.  But on the flip side, of course, this means
that the values have to kept live in (SASS) registers.  (That's just my theory;
I haven't verified the actual SASS.)

In other words: at least in this case here, it seems preferrable to re-compute
instead of keeping registers occupied.  (But I'm of course not claiming that to
be a simple yes/no decision...)

It seem we're now in territory of tuning CPU vs. GPU code generation?

Certainly, GCC has not seen much care for the latter (GPU code generation).
I mean: verify GCC pass pipeline generally, and parameterization of individual
passes for GPU code generation.
Impossible to get "right", of course, but maybe some heuristics for CPU vs. GPU
may be discovered and implemented?
I'm sure there must be some literature on that topic?

All that complicated by the fact the with the (several different versions of
the) Nvidia/CUDA Driver's PTX -> SASS translation/optimization we have another
moving part...

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Bug target/104345] [12 Regression] "nvptx: Transition nvptx backend to STORE_FLAG_VALUE = 1" patch made some code generation worse
  2022-02-02 14:10 [Bug target/104345] New: nvptx: "regression" after "nvptx: Transition nvptx backend to STORE_FLAG_VALUE = 1" tschwinge at gcc dot gnu.org
                   ` (7 preceding siblings ...)
  2022-02-08 11:57 ` tschwinge at gcc dot gnu.org
@ 2022-02-08 12:33 ` tschwinge at gcc dot gnu.org
  2022-02-08 15:36 ` tschwinge at gcc dot gnu.org
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 14+ messages in thread
From: tschwinge at gcc dot gnu.org @ 2022-02-08 12:33 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104345

--- Comment #8 from Thomas Schwinge <tschwinge at gcc dot gnu.org> ---
Well.  Here's another problem.

Re-testing things using a "bad" '__muldc3' from a build with your three patches
applied, I notice that my '_muldc3-WIP.o' "old"/replacement code uses a
'set.u32.leu.f64', 'neg.s32', 'mov.u32' sequence.  That works.
However, if I replace that with a localized (its own '.reg .pred %r138_;')
'setp.leu.f64', 'selp.u32' sequence (as per your code generation changes), then
it doesn't in fact work!  Huh.

Will later continue looking into this.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Bug target/104345] [12 Regression] "nvptx: Transition nvptx backend to STORE_FLAG_VALUE = 1" patch made some code generation worse
  2022-02-02 14:10 [Bug target/104345] New: nvptx: "regression" after "nvptx: Transition nvptx backend to STORE_FLAG_VALUE = 1" tschwinge at gcc dot gnu.org
                   ` (8 preceding siblings ...)
  2022-02-08 12:33 ` tschwinge at gcc dot gnu.org
@ 2022-02-08 15:36 ` tschwinge at gcc dot gnu.org
  2022-02-10  8:02 ` cvs-commit at gcc dot gnu.org
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 14+ messages in thread
From: tschwinge at gcc dot gnu.org @ 2022-02-08 15:36 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104345

--- Comment #9 from Thomas Schwinge <tschwinge at gcc dot gnu.org> ---
OK!  Putting your "nvptx: Add support for 64-bit mul.hi (and other)
instructions" on top, that considerably changes (simplifies!) the generated
'__muldc3' PTX code; the regression disappears.  :-)

(I have, so far, only manually tested
'libgomp.oacc-c-c++-common/reduction-cplx-dbl.c'.  I'll report later in the
unlikely case that any other/new issues should appear.)


(And, will later test your "nvptx: Tweak constraints on copysign instructions"
on top, too.)

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Bug target/104345] [12 Regression] "nvptx: Transition nvptx backend to STORE_FLAG_VALUE = 1" patch made some code generation worse
  2022-02-02 14:10 [Bug target/104345] New: nvptx: "regression" after "nvptx: Transition nvptx backend to STORE_FLAG_VALUE = 1" tschwinge at gcc dot gnu.org
                   ` (9 preceding siblings ...)
  2022-02-08 15:36 ` tschwinge at gcc dot gnu.org
@ 2022-02-10  8:02 ` cvs-commit at gcc dot gnu.org
  2022-02-10  8:02 ` cvs-commit at gcc dot gnu.org
  2022-03-02 12:10 ` roger at nextmovesoftware dot com
  12 siblings, 0 replies; 14+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2022-02-10  8:02 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104345

--- Comment #10 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Tom de Vries <vries@gcc.gnu.org>:

https://gcc.gnu.org/g:9bacd7af2e3bba9ddad17e7de4e2d299419d819d

commit r12-7167-g9bacd7af2e3bba9ddad17e7de4e2d299419d819d
Author: Roger Sayle <roger@nextmovesoftware.com>
Date:   Fri Feb 4 04:13:53 2022 +0100

    PR target/104345: Use nvptx "set" instruction for cond ? -1 : 0

    This patch addresses the "increased register pressure" regression on
    nvptx-none caused by my change to transition the backend to a
    STORE_FLAG_VALUE = 1 target.  This improved code generation for the
    more common case of producing 0/1 Boolean values, but unfortunately
    made things marginally worse when a 0/-1 mask value is desired.
    Unfortunately, nvptx kernels are extremely sensitive to changes in
    register usage, which was observable in the reported PR.

    This patch provides optimizations for -(cond ? 1 : 0), effectively
    simplify this into cond ? -1 : 0, where these ternary operators are
    provided by nvptx's selp instruction, and for the specific case of
    SImode, using (restoring) nvptx's "set" instruction (which avoids
    the need for a predicate register).

    This patch has been tested on nvptx-none hosted on x86_64-pc-linux-gnu
    with a "make" and "make -k check" with no new failures.  Unfortunately,
    the exact register usage of a nvptx kernel depends upon the version of
    the Cuda drivers being used (and the hardware), but I believe this
    change should resolve the PR (for Thomas) by improving code generation
    for the cases that regressed.

    gcc/ChangeLog:

            PR target/104345
            * config/nvptx/nvptx.md (sel_true<mode>): Fix indentation.
            (sel_false<mode>): Likewise.
            (define_code_iterator eqne): New code iterator for EQ and NE.
            (*selp<mode>_neg_<code>): New define_insn_and_split to optimize
            the negation of a selp instruction.
            (*selp<mode>_not_<code>): New define_insn_and_split to optimize
            the bitwise not of a selp instruction.
            (*setcc_int<mode>): Use set instruction for neg:SI of a selp.

    gcc/testsuite/ChangeLog:

            PR target/104345
            * gcc.target/nvptx/neg-selp.c: New test case.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Bug target/104345] [12 Regression] "nvptx: Transition nvptx backend to STORE_FLAG_VALUE = 1" patch made some code generation worse
  2022-02-02 14:10 [Bug target/104345] New: nvptx: "regression" after "nvptx: Transition nvptx backend to STORE_FLAG_VALUE = 1" tschwinge at gcc dot gnu.org
                   ` (10 preceding siblings ...)
  2022-02-10  8:02 ` cvs-commit at gcc dot gnu.org
@ 2022-02-10  8:02 ` cvs-commit at gcc dot gnu.org
  2022-03-02 12:10 ` roger at nextmovesoftware dot com
  12 siblings, 0 replies; 14+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2022-02-10  8:02 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104345

--- Comment #11 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Tom de Vries <vries@gcc.gnu.org>:

https://gcc.gnu.org/g:6d98e83b2c919bd9fba2c61333d613bafc37357f

commit r12-7168-g6d98e83b2c919bd9fba2c61333d613bafc37357f
Author: Roger Sayle <roger@nextmovesoftware.com>
Date:   Tue Feb 8 20:56:55 2022 +0100

    nvptx: Tweak constraints on copysign instructions

    Many thanks to Thomas Schwinge for confirming my hypothesis that the
register
    usage regression, PR target/104345, is solely due to libgcc's _muldc3
function.
    In addition to the isinf functionality in the previously proposed nvptx
patch at
    https://gcc.gnu.org/pipermail/gcc-patches/2022-January/588453.html which
    significantly reduces the number of instructions in _muldc3, the patch
below
    further reduces both the number of instructions and the number of
explicitly
    declared registers, by permitting floating point constant immediate
operands
    in nvptx's copysign instruction.

    Fingers-crossed, the combination with all of the previous proposed nvptx
    patches improves things.  Ultimately, increasing register usage from 50 to
    51 registers, reducing the number of concurrent threads by ~2%, can easily
    be countered if we're now executing significantly fewer instructions in
each
    kernel, for a net performance win.

    This patch has been tested on nvptx-none hosted on x86_64-pc-linux-gnu
    with a "make" and "make -k check" with no new failures.

    gcc/ChangeLog:

            * config/nvptx/nvptx.md (copysign<mode>3): Allow immediate
            floating point constants as operands 1 and/or 2.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Bug target/104345] [12 Regression] "nvptx: Transition nvptx backend to STORE_FLAG_VALUE = 1" patch made some code generation worse
  2022-02-02 14:10 [Bug target/104345] New: nvptx: "regression" after "nvptx: Transition nvptx backend to STORE_FLAG_VALUE = 1" tschwinge at gcc dot gnu.org
                   ` (11 preceding siblings ...)
  2022-02-10  8:02 ` cvs-commit at gcc dot gnu.org
@ 2022-03-02 12:10 ` roger at nextmovesoftware dot com
  12 siblings, 0 replies; 14+ messages in thread
From: roger at nextmovesoftware dot com @ 2022-03-02 12:10 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104345

Roger Sayle <roger at nextmovesoftware dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         Resolution|---                         |FIXED
             Status|ASSIGNED                    |RESOLVED

--- Comment #12 from Roger Sayle <roger at nextmovesoftware dot com> ---
This should now be fixed on mainline.

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2022-03-02 12:10 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-02-02 14:10 [Bug target/104345] New: nvptx: "regression" after "nvptx: Transition nvptx backend to STORE_FLAG_VALUE = 1" tschwinge at gcc dot gnu.org
2022-02-02 14:26 ` [Bug target/104345] " roger at nextmovesoftware dot com
2022-02-02 22:37 ` [Bug target/104345] [12 Regression] "nvptx: Transition nvptx backend to STORE_FLAG_VALUE = 1" patch made some code generation worse pinskia at gcc dot gnu.org
2022-02-03 14:17 ` vries at gcc dot gnu.org
2022-02-03 21:09 ` roger at nextmovesoftware dot com
2022-02-08 11:53 ` tschwinge at gcc dot gnu.org
2022-02-08 11:54 ` tschwinge at gcc dot gnu.org
2022-02-08 11:55 ` tschwinge at gcc dot gnu.org
2022-02-08 11:57 ` tschwinge at gcc dot gnu.org
2022-02-08 12:33 ` tschwinge at gcc dot gnu.org
2022-02-08 15:36 ` tschwinge at gcc dot gnu.org
2022-02-10  8:02 ` cvs-commit at gcc dot gnu.org
2022-02-10  8:02 ` cvs-commit at gcc dot gnu.org
2022-03-02 12:10 ` roger at nextmovesoftware dot com

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).