From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from esa4.mentor.iphmx.com (esa4.mentor.iphmx.com [68.232.137.252]) by sourceware.org (Postfix) with ESMTPS id 76D343857430 for ; Wed, 4 May 2022 08:40:38 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 76D343857430 Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=codesourcery.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=mentor.com X-IronPort-AV: E=Sophos;i="5.91,197,1647331200"; d="scan'208";a="75331604" Received: from orw-gwy-02-in.mentorg.com ([192.94.38.167]) by esa4.mentor.iphmx.com with ESMTP; 04 May 2022 00:40:38 -0800 IronPort-SDR: K07DDsq7llzdMWGiAMl7E6f9eMLYQZ3WUaRYZ7/hoUsswEDbLkJKdXTZ5UfnmfzeyXC7vu6f7r 2ppBvgg7Ys+HkxRL6Va5ojMUq/QOnsxj0H0XZntuz9TePmXeOtgb9Wq0WcCuaebXgWfgL++Txs s9Wvt/naxAxk0BYnoiIq/4xfnEwj292arAGf7UtmuMy98v1Ut/s0mv7gthFCmQkhA65AgL7dCM XMAgGrnQf8p2Ud7xenDfZCUyoub3njhT+rQgAtXk0qZFIKrm7+p3c6HsM6YkoeOEDbd0KXrfQH Owc= From: Thomas Schwinge To: Tom de Vries CC: , Roger Sayle Subject: Re: [committed][libgomp, testsuite, nvptx] Fix dg-output test in vector-length-128-7.c In-Reply-To: <20220401112305.GA19166@delia> References: <20220401112305.GA19166@delia> User-Agent: Notmuch/0.29.3+94~g74c3f1b (https://notmuchmail.org) Emacs/27.1 (x86_64-pc-linux-gnu) Date: Wed, 4 May 2022 10:40:31 +0200 Message-ID: <87v8ul7ksg.fsf@euler.schwinge.homeip.net> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable X-Originating-IP: [137.202.0.90] X-ClientProxiedBy: svr-ies-mbx-10.mgc.mentorg.com (139.181.222.10) To svr-ies-mbx-10.mgc.mentorg.com (139.181.222.10) X-Spam-Status: No, score=-12.0 required=5.0 tests=BAYES_00, GIT_PATCH_0, HEADER_FROM_DIFFERENT_DOMAINS, KAM_DMARC_STATUS, SPF_HELO_PASS, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 04 May 2022 08:40:40 -0000 Hi Tom! On 2022-04-01T13:23:06+0200, Tom de Vries wrote: > When running test-case libgomp.oacc-c-c++-common/vector-length-128-7.c on= an > RTX A2000 (sm_86) with driver 510.60.02 I run into: > ... > FAIL: libgomp.oacc-c/../libgomp.oacc-c-c++-common/vector-length-128-7.c \ > -DACC_DEVICE_TYPE_nvidia=3D1 -DACC_MEM_SHARED=3D0 -foffload=3Dnvptx-non= e -O0 \ > output pattern test > ... > > The failing check verifies the launch dimensions: > ... > /* { dg-output "nvptx_exec: kernel main\\\$_omp_fn\\\$0: \ > launch gangs=3D1, workers=3D8, vectors=3D128" } */ > ... > which fails because (as we can see with GOMP_DEBUG=3D1) the actual num_wo= rkers > is 6: > ... > nvptx_exec: kernel main$_omp_fn$0: launch gangs=3D1, workers=3D6, vecto= rs=3D128 > ... > > This is due to the result of cuOccupancyMaxPotentialBlockSize (which sugg= ests > 'a launch configuration with reasonable occupancy') printed just before: > ... > cuOccupancyMaxPotentialBlockSize: grid =3D 52, block =3D 768 > ... > [ Note: 6 * 128 =3D=3D 768. ] I had a while ago observed, and now finally looked into a similar case with Nvidia TITAN V, Driver 455.23.05, GCC/nvptx default multilib. Looking at 'GOMP_DEBUG=3D1' output: '-O2'; all good: [...] Link log info : 0 bytes gmem info : Function properties for 'main$_omp_fn$0': info : used 32 registers, 0 stack, 288 bytes smem, 360 bytes cmem[0]= , 0 bytes lmem GOMP_OFFLOAD_openacc_exec: prepare mappings warp_size=3D32, block_size=3D1024, dev_size=3D80, cpu_size=3D2048 default dimensions [160,32,32] cuOccupancyMaxPotentialBlockSize: grid =3D 160, block =3D 1024 nvptx_exec: kernel main$_omp_fn$0: launch gangs=3D1, workers=3D8, vec= tors=3D128 nvptx_exec: kernel main$_omp_fn$0: finished ... vs. '-O0'; similar to your report: [...] Link log info : 0 bytes gmem info : Function properties for 'main$_omp_fn$0': info : used 33 registers, 32 stack, 432 bytes smem, 360 bytes cmem[0= ], 0 bytes lmem GOMP_OFFLOAD_openacc_exec: prepare mappings warp_size=3D32, block_size=3D1024, dev_size=3D80, cpu_size=3D2048 default dimensions [160,32,32] cuOccupancyMaxPotentialBlockSize: grid =3D 160, block =3D 768 nvptx_exec: kernel main$_omp_fn$0: launch gangs=3D1, workers=3D6, vec= tors=3D128 nvptx_exec: kernel main$_omp_fn$0: finished ..., so I would've suggested: > Fix this by updating the check to allow num_workers in the range 1 to 8. ... to do this for '-O0' only, to make sure that we'll notice should the '-O2' case regress at some later point in time. Are you OK if I make the obvious a change? But that said... We might also generally classify this as a regression, because when using the GCC/nvptx '-mptx=3D3.1' instead of default multilib ('-foffload-options=3Dnvptx-none=3D-mptx=3D3.1'), I see: '-O2'; all good (exactly the same launch configuration as with GCC/nvptx default multilib, see above): [...] Link log info : 0 bytes gmem info : Function properties for 'main$_omp_fn$0': info : used 32 registers, 0 stack, 288 bytes smem, 360 bytes cmem[0]= , 0 bytes lmem GOMP_OFFLOAD_openacc_exec: prepare mappings warp_size=3D32, block_size=3D1024, dev_size=3D80, cpu_size=3D2048 default dimensions [160,32,32] cuOccupancyMaxPotentialBlockSize: grid =3D 160, block =3D 1024 nvptx_exec: kernel main$_omp_fn$0: launch gangs=3D1, workers=3D8, vec= tors=3D128 nvptx_exec: kernel main$_omp_fn$0: finished ..., but also for -O0'; all good: Link log info : 0 bytes gmem info : Function properties for 'main$_omp_fn$0': info : used 30 registers, 32 stack, 432 bytes smem, 360 bytes cmem[0= ], 0 bytes lmem GOMP_OFFLOAD_openacc_exec: prepare mappings warp_size=3D32, block_size=3D1024, dev_size=3D80, cpu_size=3D2048 default dimensions [160,32,32] cuOccupancyMaxPotentialBlockSize: grid =3D 160, block =3D 1024 nvptx_exec: kernel main$_omp_fn$0: launch gangs=3D1, workers=3D8, vec= tors=3D128 nvptx_exec: kernel main$_omp_fn$0: finished Are you able to reproduce that? Follows '-O0' word-diff between GCC/nvptx default vs. '-mptx=3D3.1' multilib: Link log info : 0 bytes gmem info : Function properties for 'main$_omp_fn$0': info : used [-33-]{+30+} registers, 32 stack, 432 bytes smem, 360 by= tes cmem[0], 0 bytes lmem GOMP_OFFLOAD_openacc_exec: prepare mappings warp_size=3D32, block_size=3D1024, dev_size=3D80, cpu_size=3D2048 default dimensions [160,32,32] cuOccupancyMaxPotentialBlockSize: grid =3D 160, block =3D [-768-]{+1024= +} nvptx_exec: kernel main$_omp_fn$0: launch gangs=3D1, [-workers=3D6,-]= {+workers=3D8,+} vectors=3D128 nvptx_exec: kernel main$_omp_fn$0: finished Notice that the GCC/nvptx default multilib uses 33 registers vs. the '-mptx=3D3.1' multilib uses 30 registers! (..., which then allows for 'block =3D [-768-]{+1024+}', 'workers=3D[-6-]{+8+}'). If that's useful, 'diff' of the PTX code that gets loaded to the GPU: // BEGIN PREAMBLE -.version 6.0 +.version 3.1 .target sm_30 .address_size 64 // END PREAMBLE @@ -158,9 +158,17 @@ setp.ne.u32 %r111,%r110,0; add.u64 %r109,%r109,8; @ %r111 bra.uni $L11; $L19: -bar.warp.sync 0xffffffff; +{ +.reg .b32 %r_act; +vote.ballot.b32 %r_act,1; +.reg .pred %r_do_abort; +mov.pred %r_do_abort,0; +setp.ne.b32 %r_do_abort,%r_act,0xffffffff; +@ %r_do_abort trap; +@ %r_do_abort exit; +} $L18: -barrier.sync.aligned 0; +bar.sync 0; // forked 2; @ %r113 bra $L12; cvta.shared.u64 %r101,__oacc_bcast; @@ -179,7 +187,15 @@ mov.u32 %r22,0; mov.u32 %r29,1; mov.u32 %r30,%ntid.y; $L12: -bar.warp.sync 0xffffffff; +{ +.reg .b32 %r_act; +vote.ballot.b32 %r_act,1; +.reg .pred %r_do_abort; +mov.pred %r_do_abort,0; +setp.ne.b32 %r_do_abort,%r_act,0xffffffff; +@ %r_do_abort trap; +@ %r_do_abort exit; +} $L7: @ %r113 bra $L13; mov.u32 %r23,%tid.y; @@ -188,11 +204,19 @@ setp.ge.s32 %r62,%r23,%r31; selp.u32 %r114,1,0,%r62; st.u32 [%r93],%r114; $L13: -bar.warp.sync 0xffffffff; -barrier.sync %r94,128; +{ +.reg .b32 %r_act; +vote.ballot.b32 %r_act,1; +.reg .pred %r_do_abort; +mov.pred %r_do_abort,0; +setp.ne.b32 %r_do_abort,%r_act,0xffffffff; +@ %r_do_abort trap; +@ %r_do_abort exit; +} +bar.sync %r94,128; ld.u32 %r115,[%r93]; setp.ne.u32 %r62,%r115,0; -barrier.sync %r94,128; +bar.sync %r94,128; @ %r62 bra.uni $L2; $L6: @ %r113 bra $L14; @@ -220,8 +244,16 @@ st.u32 [%r95+36],%r30; st.u32 [%r95+40],%r31; st.u32 [%r95+44],%r34; $L14: -bar.warp.sync 0xffffffff; -barrier.sync %r94,128; +{ +.reg .b32 %r_act; +vote.ballot.b32 %r_act,1; +.reg .pred %r_do_abort; +mov.pred %r_do_abort,0; +setp.ne.b32 %r_do_abort,%r_act,0xffffffff; +@ %r_do_abort trap; +@ %r_do_abort exit; +} +bar.sync %r94,128; // forked 4; mov.u64 %r87,%r93; mov.u64 %r89,%frame; @@ -296,7 +328,7 @@ setp.lt.s32 %r83,%r24,%r38; mov.u32 %r56,%r37; st.u32 [%frame+8],%r56; // joining 4; -barrier.sync %r94,128; +bar.sync %r94,128; // join 4; @ %r113 bra $L15; add.u32 %r23,%r23,%r30; @@ -304,11 +336,19 @@ setp.lt.s32 %r84,%r23,%r31; selp.u32 %r116,1,0,%r84; st.u32 [%r93],%r116; $L15: -bar.warp.sync 0xffffffff; -barrier.sync %r94,128; +{ +.reg .b32 %r_act; +vote.ballot.b32 %r_act,1; +.reg .pred %r_do_abort; +mov.pred %r_do_abort,0; +setp.ne.b32 %r_do_abort,%r_act,0xffffffff; +@ %r_do_abort trap; +@ %r_do_abort exit; +} +bar.sync %r94,128; ld.u32 %r117,[%r93]; setp.ne.u32 %r84,%r117,0; -barrier.sync %r94,128; +bar.sync %r94,128; @ %r84 bra.uni $L6; $L2: @ %r113 bra $L16; @@ -317,19 +357,35 @@ setp.lt.s32 %r85,%r22,%r29; selp.u32 %r118,1,0,%r85; st.u32 [%r93],%r118; $L16: -bar.warp.sync 0xffffffff; -barrier.sync %r94,128; +{ +.reg .b32 %r_act; +vote.ballot.b32 %r_act,1; +.reg .pred %r_do_abort; +mov.pred %r_do_abort,0; +setp.ne.b32 %r_do_abort,%r_act,0xffffffff; +@ %r_do_abort trap; +@ %r_do_abort exit; +} +bar.sync %r94,128; ld.u32 %r119,[%r93]; setp.ne.u32 %r85,%r119,0; -barrier.sync %r94,128; +bar.sync %r94,128; @ %r85 bra.uni $L7; @ %r113 bra $L17; mov.u32 %r86,4; st.u32 [%frame+4],%r86; // joining 2; $L17: -bar.warp.sync 0xffffffff; -barrier.sync.aligned 0; +{ +.reg .b32 %r_act; +vote.ballot.b32 %r_act,1; +.reg .pred %r_do_abort; +mov.pred %r_do_abort,0; +setp.ne.b32 %r_do_abort,%r_act,0xffffffff; +@ %r_do_abort trap; +@ %r_do_abort exit; +} +bar.sync 0; // join 2; ret; } Do the 'trap/'exit' "no-return" calls allow for optimizing JIT register allocation? Does it follow that we should be doing something different in the GCC/nvptx default multilib, to achieve a similar outcome (without otherwise pessimizing the code, of course)? Gr=C3=BC=C3=9Fe Thomas > [libgomp, testsuite, nvptx] Fix dg-output test in vector-length-128-7.c > > libgomp/ChangeLog: > > 2022-04-01 Tom de Vries > > * testsuite/libgomp.oacc-c-c++-common/vector-length-128-7.c: Fix > num_workers check. > > --- > libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-7.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-12= 8-7.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-7.c > index 4a8c1bf549e..92b3de03636 100644 > --- a/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-7.c > +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-7.c > @@ -37,4 +37,4 @@ main (void) > } > > /* { dg-final { scan-offload-tree-dump "__attribute__\\(\\(oacc function= \\(1, 0, 128\\)" "oaccloops" } } */ > -/* { dg-output "nvptx_exec: kernel main\\\$_omp_fn\\\$0: launch gangs=3D= 1, workers=3D8, vectors=3D128" } */ > +/* { dg-output "nvptx_exec: kernel main\\\$_omp_fn\\\$0: launch gangs=3D= 1, workers=3D\[1-8\], vectors=3D128" } */ ----------------- Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstra=C3=9Fe 201= , 80634 M=C3=BCnchen; Gesellschaft mit beschr=C3=A4nkter Haftung; Gesch=C3= =A4ftsf=C3=BChrer: Thomas Heurung, Frank Th=C3=BCrauf; Sitz der Gesellschaf= t: M=C3=BCnchen; Registergericht M=C3=BCnchen, HRB 106955