From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <Thomas_Schwinge@mentor.com>
Received: from esa4.mentor.iphmx.com (esa4.mentor.iphmx.com [68.232.137.252])
 by sourceware.org (Postfix) with ESMTPS id 76D343857430
 for <gcc-patches@gcc.gnu.org>; Wed,  4 May 2022 08:40:38 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 76D343857430
Authentication-Results: sourceware.org; dmarc=none (p=none dis=none)
 header.from=codesourcery.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=mentor.com
X-IronPort-AV: E=Sophos;i="5.91,197,1647331200"; d="scan'208";a="75331604"
Received: from orw-gwy-02-in.mentorg.com ([192.94.38.167])
 by esa4.mentor.iphmx.com with ESMTP; 04 May 2022 00:40:38 -0800
IronPort-SDR: K07DDsq7llzdMWGiAMl7E6f9eMLYQZ3WUaRYZ7/hoUsswEDbLkJKdXTZ5UfnmfzeyXC7vu6f7r
 2ppBvgg7Ys+HkxRL6Va5ojMUq/QOnsxj0H0XZntuz9TePmXeOtgb9Wq0WcCuaebXgWfgL++Txs
 s9Wvt/naxAxk0BYnoiIq/4xfnEwj292arAGf7UtmuMy98v1Ut/s0mv7gthFCmQkhA65AgL7dCM
 XMAgGrnQf8p2Ud7xenDfZCUyoub3njhT+rQgAtXk0qZFIKrm7+p3c6HsM6YkoeOEDbd0KXrfQH
 Owc=
From: Thomas Schwinge <thomas@codesourcery.com>
To: Tom de Vries <tdevries@suse.de>
CC: <gcc-patches@gcc.gnu.org>, Roger Sayle <roger@nextmovesoftware.com>
Subject: Re: [committed][libgomp, testsuite, nvptx] Fix dg-output test in
 vector-length-128-7.c
In-Reply-To: <20220401112305.GA19166@delia>
References: <20220401112305.GA19166@delia>
User-Agent: Notmuch/0.29.3+94~g74c3f1b (https://notmuchmail.org) Emacs/27.1
 (x86_64-pc-linux-gnu)
Date: Wed, 4 May 2022 10:40:31 +0200
Message-ID: <87v8ul7ksg.fsf@euler.schwinge.homeip.net>
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
X-Originating-IP: [137.202.0.90]
X-ClientProxiedBy: svr-ies-mbx-10.mgc.mentorg.com (139.181.222.10) To
 svr-ies-mbx-10.mgc.mentorg.com (139.181.222.10)
X-Spam-Status: No, score=-12.0 required=5.0 tests=BAYES_00, GIT_PATCH_0,
 HEADER_FROM_DIFFERENT_DOMAINS, KAM_DMARC_STATUS, SPF_HELO_PASS, SPF_PASS,
 TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.4
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
X-List-Received-Date: Wed, 04 May 2022 08:40:40 -0000

Hi Tom!

On 2022-04-01T13:23:06+0200, Tom de Vries <tdevries@suse.de> wrote:
> When running test-case libgomp.oacc-c-c++-common/vector-length-128-7.c on=
 an
> RTX A2000 (sm_86) with driver 510.60.02 I run into:
> ...
> FAIL: libgomp.oacc-c/../libgomp.oacc-c-c++-common/vector-length-128-7.c \
>   -DACC_DEVICE_TYPE_nvidia=3D1 -DACC_MEM_SHARED=3D0 -foffload=3Dnvptx-non=
e  -O0  \
>   output pattern test
> ...
>
> The failing check verifies the launch dimensions:
> ...
> /* { dg-output "nvptx_exec: kernel main\\\$_omp_fn\\\$0: \
>                 launch gangs=3D1, workers=3D8, vectors=3D128" } */
> ...
> which fails because (as we can see with GOMP_DEBUG=3D1) the actual num_wo=
rkers
> is 6:
> ...
>   nvptx_exec: kernel main$_omp_fn$0: launch gangs=3D1, workers=3D6, vecto=
rs=3D128
> ...
>
> This is due to the result of cuOccupancyMaxPotentialBlockSize (which sugg=
ests
> 'a launch configuration with reasonable occupancy') printed just before:
> ...
> cuOccupancyMaxPotentialBlockSize: grid =3D 52, block =3D 768
> ...
> [ Note: 6 * 128 =3D=3D 768. ]

I had a while ago observed, and now finally looked into a similar case
with Nvidia TITAN V, Driver 455.23.05, GCC/nvptx default multilib.
Looking at 'GOMP_DEBUG=3D1' output:

'-O2'; all good:

    [...]
    Link log info    : 0 bytes gmem
    info    : Function properties for 'main$_omp_fn$0':
    info    : used 32 registers, 0 stack, 288 bytes smem, 360 bytes cmem[0]=
, 0 bytes lmem
      GOMP_OFFLOAD_openacc_exec: prepare mappings
     warp_size=3D32, block_size=3D1024, dev_size=3D80, cpu_size=3D2048
     default dimensions [160,32,32]
    cuOccupancyMaxPotentialBlockSize: grid =3D 160, block =3D 1024
      nvptx_exec: kernel main$_omp_fn$0: launch gangs=3D1, workers=3D8, vec=
tors=3D128
      nvptx_exec: kernel main$_omp_fn$0: finished

... vs. '-O0'; similar to your report:

    [...]
    Link log info    : 0 bytes gmem
    info    : Function properties for 'main$_omp_fn$0':
    info    : used 33 registers, 32 stack, 432 bytes smem, 360 bytes cmem[0=
], 0 bytes lmem
      GOMP_OFFLOAD_openacc_exec: prepare mappings
     warp_size=3D32, block_size=3D1024, dev_size=3D80, cpu_size=3D2048
     default dimensions [160,32,32]
    cuOccupancyMaxPotentialBlockSize: grid =3D 160, block =3D 768
      nvptx_exec: kernel main$_omp_fn$0: launch gangs=3D1, workers=3D6, vec=
tors=3D128
      nvptx_exec: kernel main$_omp_fn$0: finished

..., so I would've suggested:

> Fix this by updating the check to allow num_workers in the range 1 to 8.

... to do this for '-O0' only, to make sure that we'll notice should the
'-O2' case regress at some later point in time.  Are you OK if I make the
obvious a change?


But that said...  We might also generally classify this as a regression,
because when using the GCC/nvptx '-mptx=3D3.1' instead of default multilib
('-foffload-options=3Dnvptx-none=3D-mptx=3D3.1'), I see:

'-O2'; all good (exactly the same launch configuration as with GCC/nvptx
default multilib, see above):

    [...]
    Link log info    : 0 bytes gmem
    info    : Function properties for 'main$_omp_fn$0':
    info    : used 32 registers, 0 stack, 288 bytes smem, 360 bytes cmem[0]=
, 0 bytes lmem
      GOMP_OFFLOAD_openacc_exec: prepare mappings
     warp_size=3D32, block_size=3D1024, dev_size=3D80, cpu_size=3D2048
     default dimensions [160,32,32]
    cuOccupancyMaxPotentialBlockSize: grid =3D 160, block =3D 1024
      nvptx_exec: kernel main$_omp_fn$0: launch gangs=3D1, workers=3D8, vec=
tors=3D128
      nvptx_exec: kernel main$_omp_fn$0: finished

..., but also for -O0'; all good:

    Link log info    : 0 bytes gmem
    info    : Function properties for 'main$_omp_fn$0':
    info    : used 30 registers, 32 stack, 432 bytes smem, 360 bytes cmem[0=
], 0 bytes lmem
      GOMP_OFFLOAD_openacc_exec: prepare mappings
     warp_size=3D32, block_size=3D1024, dev_size=3D80, cpu_size=3D2048
     default dimensions [160,32,32]
    cuOccupancyMaxPotentialBlockSize: grid =3D 160, block =3D 1024
      nvptx_exec: kernel main$_omp_fn$0: launch gangs=3D1, workers=3D8, vec=
tors=3D128
      nvptx_exec: kernel main$_omp_fn$0: finished

Are you able to reproduce that?

Follows '-O0' word-diff between GCC/nvptx default vs. '-mptx=3D3.1'
multilib:

    Link log info    : 0 bytes gmem
    info    : Function properties for 'main$_omp_fn$0':
    info    : used [-33-]{+30+} registers, 32 stack, 432 bytes smem, 360 by=
tes cmem[0], 0 bytes lmem
      GOMP_OFFLOAD_openacc_exec: prepare mappings
     warp_size=3D32, block_size=3D1024, dev_size=3D80, cpu_size=3D2048
     default dimensions [160,32,32]
    cuOccupancyMaxPotentialBlockSize: grid =3D 160, block =3D [-768-]{+1024=
+}
      nvptx_exec: kernel main$_omp_fn$0: launch gangs=3D1, [-workers=3D6,-]=
{+workers=3D8,+} vectors=3D128
      nvptx_exec: kernel main$_omp_fn$0: finished

Notice that the GCC/nvptx default multilib uses 33 registers vs. the
'-mptx=3D3.1' multilib uses 30 registers!  (..., which then allows for
'block =3D [-768-]{+1024+}', 'workers=3D[-6-]{+8+}').

If that's useful, 'diff' of the PTX code that gets loaded to the GPU:

     // BEGIN PREAMBLE
    -.version 6.0
    +.version 3.1
     .target sm_30
     .address_size 64
     // END PREAMBLE
    @@ -158,9 +158,17 @@ setp.ne.u32 %r111,%r110,0;
     add.u64 %r109,%r109,8;
     @ %r111 bra.uni $L11;
     $L19:
    -bar.warp.sync 0xffffffff;
    +{
    +.reg .b32 %r_act;
    +vote.ballot.b32 %r_act,1;
    +.reg .pred %r_do_abort;
    +mov.pred %r_do_abort,0;
    +setp.ne.b32 %r_do_abort,%r_act,0xffffffff;
    +@ %r_do_abort trap;
    +@ %r_do_abort exit;
    +}
     $L18:
    -barrier.sync.aligned 0;
    +bar.sync 0;
     // forked 2;
     @ %r113 bra $L12;
     cvta.shared.u64 %r101,__oacc_bcast;
    @@ -179,7 +187,15 @@ mov.u32 %r22,0;
     mov.u32 %r29,1;
     mov.u32 %r30,%ntid.y;
     $L12:
    -bar.warp.sync 0xffffffff;
    +{
    +.reg .b32 %r_act;
    +vote.ballot.b32 %r_act,1;
    +.reg .pred %r_do_abort;
    +mov.pred %r_do_abort,0;
    +setp.ne.b32 %r_do_abort,%r_act,0xffffffff;
    +@ %r_do_abort trap;
    +@ %r_do_abort exit;
    +}
     $L7:
     @ %r113 bra $L13;
     mov.u32 %r23,%tid.y;
    @@ -188,11 +204,19 @@ setp.ge.s32 %r62,%r23,%r31;
     selp.u32 %r114,1,0,%r62;
     st.u32 [%r93],%r114;
     $L13:
    -bar.warp.sync 0xffffffff;
    -barrier.sync %r94,128;
    +{
    +.reg .b32 %r_act;
    +vote.ballot.b32 %r_act,1;
    +.reg .pred %r_do_abort;
    +mov.pred %r_do_abort,0;
    +setp.ne.b32 %r_do_abort,%r_act,0xffffffff;
    +@ %r_do_abort trap;
    +@ %r_do_abort exit;
    +}
    +bar.sync %r94,128;
     ld.u32 %r115,[%r93];
     setp.ne.u32 %r62,%r115,0;
    -barrier.sync %r94,128;
    +bar.sync %r94,128;
     @ %r62 bra.uni $L2;
     $L6:
     @ %r113 bra $L14;
    @@ -220,8 +244,16 @@ st.u32 [%r95+36],%r30;
     st.u32 [%r95+40],%r31;
     st.u32 [%r95+44],%r34;
     $L14:
    -bar.warp.sync 0xffffffff;
    -barrier.sync %r94,128;
    +{
    +.reg .b32 %r_act;
    +vote.ballot.b32 %r_act,1;
    +.reg .pred %r_do_abort;
    +mov.pred %r_do_abort,0;
    +setp.ne.b32 %r_do_abort,%r_act,0xffffffff;
    +@ %r_do_abort trap;
    +@ %r_do_abort exit;
    +}
    +bar.sync %r94,128;
     // forked 4;
     mov.u64 %r87,%r93;
     mov.u64 %r89,%frame;
    @@ -296,7 +328,7 @@ setp.lt.s32 %r83,%r24,%r38;
     mov.u32 %r56,%r37;
     st.u32 [%frame+8],%r56;
     // joining 4;
    -barrier.sync %r94,128;
    +bar.sync %r94,128;
     // join 4;
     @ %r113 bra $L15;
     add.u32 %r23,%r23,%r30;
    @@ -304,11 +336,19 @@ setp.lt.s32 %r84,%r23,%r31;
     selp.u32 %r116,1,0,%r84;
     st.u32 [%r93],%r116;
     $L15:
    -bar.warp.sync 0xffffffff;
    -barrier.sync %r94,128;
    +{
    +.reg .b32 %r_act;
    +vote.ballot.b32 %r_act,1;
    +.reg .pred %r_do_abort;
    +mov.pred %r_do_abort,0;
    +setp.ne.b32 %r_do_abort,%r_act,0xffffffff;
    +@ %r_do_abort trap;
    +@ %r_do_abort exit;
    +}
    +bar.sync %r94,128;
     ld.u32 %r117,[%r93];
     setp.ne.u32 %r84,%r117,0;
    -barrier.sync %r94,128;
    +bar.sync %r94,128;
     @ %r84 bra.uni $L6;
     $L2:
     @ %r113 bra $L16;
    @@ -317,19 +357,35 @@ setp.lt.s32 %r85,%r22,%r29;
     selp.u32 %r118,1,0,%r85;
     st.u32 [%r93],%r118;
     $L16:
    -bar.warp.sync 0xffffffff;
    -barrier.sync %r94,128;
    +{
    +.reg .b32 %r_act;
    +vote.ballot.b32 %r_act,1;
    +.reg .pred %r_do_abort;
    +mov.pred %r_do_abort,0;
    +setp.ne.b32 %r_do_abort,%r_act,0xffffffff;
    +@ %r_do_abort trap;
    +@ %r_do_abort exit;
    +}
    +bar.sync %r94,128;
     ld.u32 %r119,[%r93];
     setp.ne.u32 %r85,%r119,0;
    -barrier.sync %r94,128;
    +bar.sync %r94,128;
     @ %r85 bra.uni $L7;
     @ %r113 bra $L17;
     mov.u32 %r86,4;
     st.u32 [%frame+4],%r86;
     // joining 2;
     $L17:
    -bar.warp.sync 0xffffffff;
    -barrier.sync.aligned 0;
    +{
    +.reg .b32 %r_act;
    +vote.ballot.b32 %r_act,1;
    +.reg .pred %r_do_abort;
    +mov.pred %r_do_abort,0;
    +setp.ne.b32 %r_do_abort,%r_act,0xffffffff;
    +@ %r_do_abort trap;
    +@ %r_do_abort exit;
    +}
    +bar.sync 0;
     // join 2;
     ret;
     }

Do the 'trap/'exit' "no-return" calls allow for optimizing JIT register
allocation?  Does it follow that we should be doing something different
in the GCC/nvptx default multilib, to achieve a similar outcome (without
otherwise pessimizing the code, of course)?


Gr=C3=BC=C3=9Fe
 Thomas


> [libgomp, testsuite, nvptx] Fix dg-output test in vector-length-128-7.c
>
> libgomp/ChangeLog:
>
> 2022-04-01  Tom de Vries  <tdevries@suse.de>
>
>       * testsuite/libgomp.oacc-c-c++-common/vector-length-128-7.c: Fix
>       num_workers check.
>
> ---
>  libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-7.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-12=
8-7.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-7.c
> index 4a8c1bf549e..92b3de03636 100644
> --- a/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-7.c
> +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-7.c
> @@ -37,4 +37,4 @@ main (void)
>  }
>
>  /* { dg-final { scan-offload-tree-dump "__attribute__\\(\\(oacc function=
 \\(1, 0, 128\\)" "oaccloops" } } */
> -/* { dg-output "nvptx_exec: kernel main\\\$_omp_fn\\\$0: launch gangs=3D=
1, workers=3D8, vectors=3D128" } */
> +/* { dg-output "nvptx_exec: kernel main\\\$_omp_fn\\\$0: launch gangs=3D=
1, workers=3D\[1-8\], vectors=3D128" } */
-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstra=C3=9Fe 201=
, 80634 M=C3=BCnchen; Gesellschaft mit beschr=C3=A4nkter Haftung; Gesch=C3=
=A4ftsf=C3=BChrer: Thomas Heurung, Frank Th=C3=BCrauf; Sitz der Gesellschaf=
t: M=C3=BCnchen; Registergericht M=C3=BCnchen, HRB 106955