public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
From: Thomas Schwinge <thomas@codesourcery.com>
To: Tom de Vries <tdevries@suse.de>
Cc: <gcc-patches@gcc.gnu.org>, Roger Sayle <roger@nextmovesoftware.com>
Subject: Re: [committed][libgomp, testsuite, nvptx] Fix dg-output test in vector-length-128-7.c
Date: Wed, 4 May 2022 10:40:31 +0200	[thread overview]
Message-ID: <87v8ul7ksg.fsf@euler.schwinge.homeip.net> (raw)
In-Reply-To: <20220401112305.GA19166@delia>

Hi Tom!

On 2022-04-01T13:23:06+0200, Tom de Vries <tdevries@suse.de> wrote:
> When running test-case libgomp.oacc-c-c++-common/vector-length-128-7.c on an
> RTX A2000 (sm_86) with driver 510.60.02 I run into:
> ...
> FAIL: libgomp.oacc-c/../libgomp.oacc-c-c++-common/vector-length-128-7.c \
>   -DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 -foffload=nvptx-none  -O0  \
>   output pattern test
> ...
>
> The failing check verifies the launch dimensions:
> ...
> /* { dg-output "nvptx_exec: kernel main\\\$_omp_fn\\\$0: \
>                 launch gangs=1, workers=8, vectors=128" } */
> ...
> which fails because (as we can see with GOMP_DEBUG=1) the actual num_workers
> is 6:
> ...
>   nvptx_exec: kernel main$_omp_fn$0: launch gangs=1, workers=6, vectors=128
> ...
>
> This is due to the result of cuOccupancyMaxPotentialBlockSize (which suggests
> 'a launch configuration with reasonable occupancy') printed just before:
> ...
> cuOccupancyMaxPotentialBlockSize: grid = 52, block = 768
> ...
> [ Note: 6 * 128 == 768. ]

I had a while ago observed, and now finally looked into a similar case
with Nvidia TITAN V, Driver 455.23.05, GCC/nvptx default multilib.
Looking at 'GOMP_DEBUG=1' output:

'-O2'; all good:

    [...]
    Link log info    : 0 bytes gmem
    info    : Function properties for 'main$_omp_fn$0':
    info    : used 32 registers, 0 stack, 288 bytes smem, 360 bytes cmem[0], 0 bytes lmem
      GOMP_OFFLOAD_openacc_exec: prepare mappings
     warp_size=32, block_size=1024, dev_size=80, cpu_size=2048
     default dimensions [160,32,32]
    cuOccupancyMaxPotentialBlockSize: grid = 160, block = 1024
      nvptx_exec: kernel main$_omp_fn$0: launch gangs=1, workers=8, vectors=128
      nvptx_exec: kernel main$_omp_fn$0: finished

... vs. '-O0'; similar to your report:

    [...]
    Link log info    : 0 bytes gmem
    info    : Function properties for 'main$_omp_fn$0':
    info    : used 33 registers, 32 stack, 432 bytes smem, 360 bytes cmem[0], 0 bytes lmem
      GOMP_OFFLOAD_openacc_exec: prepare mappings
     warp_size=32, block_size=1024, dev_size=80, cpu_size=2048
     default dimensions [160,32,32]
    cuOccupancyMaxPotentialBlockSize: grid = 160, block = 768
      nvptx_exec: kernel main$_omp_fn$0: launch gangs=1, workers=6, vectors=128
      nvptx_exec: kernel main$_omp_fn$0: finished

..., so I would've suggested:

> Fix this by updating the check to allow num_workers in the range 1 to 8.

... to do this for '-O0' only, to make sure that we'll notice should the
'-O2' case regress at some later point in time.  Are you OK if I make the
obvious a change?


But that said...  We might also generally classify this as a regression,
because when using the GCC/nvptx '-mptx=3.1' instead of default multilib
('-foffload-options=nvptx-none=-mptx=3.1'), I see:

'-O2'; all good (exactly the same launch configuration as with GCC/nvptx
default multilib, see above):

    [...]
    Link log info    : 0 bytes gmem
    info    : Function properties for 'main$_omp_fn$0':
    info    : used 32 registers, 0 stack, 288 bytes smem, 360 bytes cmem[0], 0 bytes lmem
      GOMP_OFFLOAD_openacc_exec: prepare mappings
     warp_size=32, block_size=1024, dev_size=80, cpu_size=2048
     default dimensions [160,32,32]
    cuOccupancyMaxPotentialBlockSize: grid = 160, block = 1024
      nvptx_exec: kernel main$_omp_fn$0: launch gangs=1, workers=8, vectors=128
      nvptx_exec: kernel main$_omp_fn$0: finished

..., but also for -O0'; all good:

    Link log info    : 0 bytes gmem
    info    : Function properties for 'main$_omp_fn$0':
    info    : used 30 registers, 32 stack, 432 bytes smem, 360 bytes cmem[0], 0 bytes lmem
      GOMP_OFFLOAD_openacc_exec: prepare mappings
     warp_size=32, block_size=1024, dev_size=80, cpu_size=2048
     default dimensions [160,32,32]
    cuOccupancyMaxPotentialBlockSize: grid = 160, block = 1024
      nvptx_exec: kernel main$_omp_fn$0: launch gangs=1, workers=8, vectors=128
      nvptx_exec: kernel main$_omp_fn$0: finished

Are you able to reproduce that?

Follows '-O0' word-diff between GCC/nvptx default vs. '-mptx=3.1'
multilib:

    Link log info    : 0 bytes gmem
    info    : Function properties for 'main$_omp_fn$0':
    info    : used [-33-]{+30+} registers, 32 stack, 432 bytes smem, 360 bytes cmem[0], 0 bytes lmem
      GOMP_OFFLOAD_openacc_exec: prepare mappings
     warp_size=32, block_size=1024, dev_size=80, cpu_size=2048
     default dimensions [160,32,32]
    cuOccupancyMaxPotentialBlockSize: grid = 160, block = [-768-]{+1024+}
      nvptx_exec: kernel main$_omp_fn$0: launch gangs=1, [-workers=6,-]{+workers=8,+} vectors=128
      nvptx_exec: kernel main$_omp_fn$0: finished

Notice that the GCC/nvptx default multilib uses 33 registers vs. the
'-mptx=3.1' multilib uses 30 registers!  (..., which then allows for
'block = [-768-]{+1024+}', 'workers=[-6-]{+8+}').

If that's useful, 'diff' of the PTX code that gets loaded to the GPU:

     // BEGIN PREAMBLE
    -.version 6.0
    +.version 3.1
     .target sm_30
     .address_size 64
     // END PREAMBLE
    @@ -158,9 +158,17 @@ setp.ne.u32 %r111,%r110,0;
     add.u64 %r109,%r109,8;
     @ %r111 bra.uni $L11;
     $L19:
    -bar.warp.sync 0xffffffff;
    +{
    +.reg .b32 %r_act;
    +vote.ballot.b32 %r_act,1;
    +.reg .pred %r_do_abort;
    +mov.pred %r_do_abort,0;
    +setp.ne.b32 %r_do_abort,%r_act,0xffffffff;
    +@ %r_do_abort trap;
    +@ %r_do_abort exit;
    +}
     $L18:
    -barrier.sync.aligned 0;
    +bar.sync 0;
     // forked 2;
     @ %r113 bra $L12;
     cvta.shared.u64 %r101,__oacc_bcast;
    @@ -179,7 +187,15 @@ mov.u32 %r22,0;
     mov.u32 %r29,1;
     mov.u32 %r30,%ntid.y;
     $L12:
    -bar.warp.sync 0xffffffff;
    +{
    +.reg .b32 %r_act;
    +vote.ballot.b32 %r_act,1;
    +.reg .pred %r_do_abort;
    +mov.pred %r_do_abort,0;
    +setp.ne.b32 %r_do_abort,%r_act,0xffffffff;
    +@ %r_do_abort trap;
    +@ %r_do_abort exit;
    +}
     $L7:
     @ %r113 bra $L13;
     mov.u32 %r23,%tid.y;
    @@ -188,11 +204,19 @@ setp.ge.s32 %r62,%r23,%r31;
     selp.u32 %r114,1,0,%r62;
     st.u32 [%r93],%r114;
     $L13:
    -bar.warp.sync 0xffffffff;
    -barrier.sync %r94,128;
    +{
    +.reg .b32 %r_act;
    +vote.ballot.b32 %r_act,1;
    +.reg .pred %r_do_abort;
    +mov.pred %r_do_abort,0;
    +setp.ne.b32 %r_do_abort,%r_act,0xffffffff;
    +@ %r_do_abort trap;
    +@ %r_do_abort exit;
    +}
    +bar.sync %r94,128;
     ld.u32 %r115,[%r93];
     setp.ne.u32 %r62,%r115,0;
    -barrier.sync %r94,128;
    +bar.sync %r94,128;
     @ %r62 bra.uni $L2;
     $L6:
     @ %r113 bra $L14;
    @@ -220,8 +244,16 @@ st.u32 [%r95+36],%r30;
     st.u32 [%r95+40],%r31;
     st.u32 [%r95+44],%r34;
     $L14:
    -bar.warp.sync 0xffffffff;
    -barrier.sync %r94,128;
    +{
    +.reg .b32 %r_act;
    +vote.ballot.b32 %r_act,1;
    +.reg .pred %r_do_abort;
    +mov.pred %r_do_abort,0;
    +setp.ne.b32 %r_do_abort,%r_act,0xffffffff;
    +@ %r_do_abort trap;
    +@ %r_do_abort exit;
    +}
    +bar.sync %r94,128;
     // forked 4;
     mov.u64 %r87,%r93;
     mov.u64 %r89,%frame;
    @@ -296,7 +328,7 @@ setp.lt.s32 %r83,%r24,%r38;
     mov.u32 %r56,%r37;
     st.u32 [%frame+8],%r56;
     // joining 4;
    -barrier.sync %r94,128;
    +bar.sync %r94,128;
     // join 4;
     @ %r113 bra $L15;
     add.u32 %r23,%r23,%r30;
    @@ -304,11 +336,19 @@ setp.lt.s32 %r84,%r23,%r31;
     selp.u32 %r116,1,0,%r84;
     st.u32 [%r93],%r116;
     $L15:
    -bar.warp.sync 0xffffffff;
    -barrier.sync %r94,128;
    +{
    +.reg .b32 %r_act;
    +vote.ballot.b32 %r_act,1;
    +.reg .pred %r_do_abort;
    +mov.pred %r_do_abort,0;
    +setp.ne.b32 %r_do_abort,%r_act,0xffffffff;
    +@ %r_do_abort trap;
    +@ %r_do_abort exit;
    +}
    +bar.sync %r94,128;
     ld.u32 %r117,[%r93];
     setp.ne.u32 %r84,%r117,0;
    -barrier.sync %r94,128;
    +bar.sync %r94,128;
     @ %r84 bra.uni $L6;
     $L2:
     @ %r113 bra $L16;
    @@ -317,19 +357,35 @@ setp.lt.s32 %r85,%r22,%r29;
     selp.u32 %r118,1,0,%r85;
     st.u32 [%r93],%r118;
     $L16:
    -bar.warp.sync 0xffffffff;
    -barrier.sync %r94,128;
    +{
    +.reg .b32 %r_act;
    +vote.ballot.b32 %r_act,1;
    +.reg .pred %r_do_abort;
    +mov.pred %r_do_abort,0;
    +setp.ne.b32 %r_do_abort,%r_act,0xffffffff;
    +@ %r_do_abort trap;
    +@ %r_do_abort exit;
    +}
    +bar.sync %r94,128;
     ld.u32 %r119,[%r93];
     setp.ne.u32 %r85,%r119,0;
    -barrier.sync %r94,128;
    +bar.sync %r94,128;
     @ %r85 bra.uni $L7;
     @ %r113 bra $L17;
     mov.u32 %r86,4;
     st.u32 [%frame+4],%r86;
     // joining 2;
     $L17:
    -bar.warp.sync 0xffffffff;
    -barrier.sync.aligned 0;
    +{
    +.reg .b32 %r_act;
    +vote.ballot.b32 %r_act,1;
    +.reg .pred %r_do_abort;
    +mov.pred %r_do_abort,0;
    +setp.ne.b32 %r_do_abort,%r_act,0xffffffff;
    +@ %r_do_abort trap;
    +@ %r_do_abort exit;
    +}
    +bar.sync 0;
     // join 2;
     ret;
     }

Do the 'trap/'exit' "no-return" calls allow for optimizing JIT register
allocation?  Does it follow that we should be doing something different
in the GCC/nvptx default multilib, to achieve a similar outcome (without
otherwise pessimizing the code, of course)?


Grüße
 Thomas


> [libgomp, testsuite, nvptx] Fix dg-output test in vector-length-128-7.c
>
> libgomp/ChangeLog:
>
> 2022-04-01  Tom de Vries  <tdevries@suse.de>
>
>       * testsuite/libgomp.oacc-c-c++-common/vector-length-128-7.c: Fix
>       num_workers check.
>
> ---
>  libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-7.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-7.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-7.c
> index 4a8c1bf549e..92b3de03636 100644
> --- a/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-7.c
> +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-7.c
> @@ -37,4 +37,4 @@ main (void)
>  }
>
>  /* { dg-final { scan-offload-tree-dump "__attribute__\\(\\(oacc function \\(1, 0, 128\\)" "oaccloops" } } */
> -/* { dg-output "nvptx_exec: kernel main\\\$_omp_fn\\\$0: launch gangs=1, workers=8, vectors=128" } */
> +/* { dg-output "nvptx_exec: kernel main\\\$_omp_fn\\\$0: launch gangs=1, workers=\[1-8\], vectors=128" } */
-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955

      reply	other threads:[~2022-05-04  8:40 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-04-01 11:23 Tom de Vries
2022-05-04  8:40 ` Thomas Schwinge [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87v8ul7ksg.fsf@euler.schwinge.homeip.net \
    --to=thomas@codesourcery.com \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=roger@nextmovesoftware.com \
    --cc=tdevries@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).