* [committed][libgomp, testsuite, nvptx] Fix dg-output test in vector-length-128-7.c
@ 2022-04-01 11:23 Tom de Vries
2022-05-04 8:40 ` Thomas Schwinge
0 siblings, 1 reply; 2+ messages in thread
From: Tom de Vries @ 2022-04-01 11:23 UTC (permalink / raw)
To: gcc-patches; +Cc: Thomas Schwinge
Hi,
When running test-case libgomp.oacc-c-c++-common/vector-length-128-7.c on an
RTX A2000 (sm_86) with driver 510.60.02 I run into:
...
FAIL: libgomp.oacc-c/../libgomp.oacc-c-c++-common/vector-length-128-7.c \
-DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 -foffload=nvptx-none -O0 \
output pattern test
...
The failing check verifies the launch dimensions:
...
/* { dg-output "nvptx_exec: kernel main\\\$_omp_fn\\\$0: \
launch gangs=1, workers=8, vectors=128" } */
...
which fails because (as we can see with GOMP_DEBUG=1) the actual num_workers
is 6:
...
nvptx_exec: kernel main$_omp_fn$0: launch gangs=1, workers=6, vectors=128
...
This is due to the result of cuOccupancyMaxPotentialBlockSize (which suggests
'a launch configuration with reasonable occupancy') printed just before:
...
cuOccupancyMaxPotentialBlockSize: grid = 52, block = 768
...
[ Note: 6 * 128 == 768. ]
Fix this by updating the check to allow num_workers in the range 1 to 8.
Tested on x86_64 with nvptx accelerator.
Committed to trunk.
Thanks,
- Tom
[libgomp, testsuite, nvptx] Fix dg-output test in vector-length-128-7.c
libgomp/ChangeLog:
2022-04-01 Tom de Vries <tdevries@suse.de>
* testsuite/libgomp.oacc-c-c++-common/vector-length-128-7.c: Fix
num_workers check.
---
libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-7.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-7.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-7.c
index 4a8c1bf549e..92b3de03636 100644
--- a/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-7.c
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-7.c
@@ -37,4 +37,4 @@ main (void)
}
/* { dg-final { scan-offload-tree-dump "__attribute__\\(\\(oacc function \\(1, 0, 128\\)" "oaccloops" } } */
-/* { dg-output "nvptx_exec: kernel main\\\$_omp_fn\\\$0: launch gangs=1, workers=8, vectors=128" } */
+/* { dg-output "nvptx_exec: kernel main\\\$_omp_fn\\\$0: launch gangs=1, workers=\[1-8\], vectors=128" } */
^ permalink raw reply [flat|nested] 2+ messages in thread
* Re: [committed][libgomp, testsuite, nvptx] Fix dg-output test in vector-length-128-7.c
2022-04-01 11:23 [committed][libgomp, testsuite, nvptx] Fix dg-output test in vector-length-128-7.c Tom de Vries
@ 2022-05-04 8:40 ` Thomas Schwinge
0 siblings, 0 replies; 2+ messages in thread
From: Thomas Schwinge @ 2022-05-04 8:40 UTC (permalink / raw)
To: Tom de Vries; +Cc: gcc-patches, Roger Sayle
Hi Tom!
On 2022-04-01T13:23:06+0200, Tom de Vries <tdevries@suse.de> wrote:
> When running test-case libgomp.oacc-c-c++-common/vector-length-128-7.c on an
> RTX A2000 (sm_86) with driver 510.60.02 I run into:
> ...
> FAIL: libgomp.oacc-c/../libgomp.oacc-c-c++-common/vector-length-128-7.c \
> -DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 -foffload=nvptx-none -O0 \
> output pattern test
> ...
>
> The failing check verifies the launch dimensions:
> ...
> /* { dg-output "nvptx_exec: kernel main\\\$_omp_fn\\\$0: \
> launch gangs=1, workers=8, vectors=128" } */
> ...
> which fails because (as we can see with GOMP_DEBUG=1) the actual num_workers
> is 6:
> ...
> nvptx_exec: kernel main$_omp_fn$0: launch gangs=1, workers=6, vectors=128
> ...
>
> This is due to the result of cuOccupancyMaxPotentialBlockSize (which suggests
> 'a launch configuration with reasonable occupancy') printed just before:
> ...
> cuOccupancyMaxPotentialBlockSize: grid = 52, block = 768
> ...
> [ Note: 6 * 128 == 768. ]
I had a while ago observed, and now finally looked into a similar case
with Nvidia TITAN V, Driver 455.23.05, GCC/nvptx default multilib.
Looking at 'GOMP_DEBUG=1' output:
'-O2'; all good:
[...]
Link log info : 0 bytes gmem
info : Function properties for 'main$_omp_fn$0':
info : used 32 registers, 0 stack, 288 bytes smem, 360 bytes cmem[0], 0 bytes lmem
GOMP_OFFLOAD_openacc_exec: prepare mappings
warp_size=32, block_size=1024, dev_size=80, cpu_size=2048
default dimensions [160,32,32]
cuOccupancyMaxPotentialBlockSize: grid = 160, block = 1024
nvptx_exec: kernel main$_omp_fn$0: launch gangs=1, workers=8, vectors=128
nvptx_exec: kernel main$_omp_fn$0: finished
... vs. '-O0'; similar to your report:
[...]
Link log info : 0 bytes gmem
info : Function properties for 'main$_omp_fn$0':
info : used 33 registers, 32 stack, 432 bytes smem, 360 bytes cmem[0], 0 bytes lmem
GOMP_OFFLOAD_openacc_exec: prepare mappings
warp_size=32, block_size=1024, dev_size=80, cpu_size=2048
default dimensions [160,32,32]
cuOccupancyMaxPotentialBlockSize: grid = 160, block = 768
nvptx_exec: kernel main$_omp_fn$0: launch gangs=1, workers=6, vectors=128
nvptx_exec: kernel main$_omp_fn$0: finished
..., so I would've suggested:
> Fix this by updating the check to allow num_workers in the range 1 to 8.
... to do this for '-O0' only, to make sure that we'll notice should the
'-O2' case regress at some later point in time. Are you OK if I make the
obvious a change?
But that said... We might also generally classify this as a regression,
because when using the GCC/nvptx '-mptx=3.1' instead of default multilib
('-foffload-options=nvptx-none=-mptx=3.1'), I see:
'-O2'; all good (exactly the same launch configuration as with GCC/nvptx
default multilib, see above):
[...]
Link log info : 0 bytes gmem
info : Function properties for 'main$_omp_fn$0':
info : used 32 registers, 0 stack, 288 bytes smem, 360 bytes cmem[0], 0 bytes lmem
GOMP_OFFLOAD_openacc_exec: prepare mappings
warp_size=32, block_size=1024, dev_size=80, cpu_size=2048
default dimensions [160,32,32]
cuOccupancyMaxPotentialBlockSize: grid = 160, block = 1024
nvptx_exec: kernel main$_omp_fn$0: launch gangs=1, workers=8, vectors=128
nvptx_exec: kernel main$_omp_fn$0: finished
..., but also for -O0'; all good:
Link log info : 0 bytes gmem
info : Function properties for 'main$_omp_fn$0':
info : used 30 registers, 32 stack, 432 bytes smem, 360 bytes cmem[0], 0 bytes lmem
GOMP_OFFLOAD_openacc_exec: prepare mappings
warp_size=32, block_size=1024, dev_size=80, cpu_size=2048
default dimensions [160,32,32]
cuOccupancyMaxPotentialBlockSize: grid = 160, block = 1024
nvptx_exec: kernel main$_omp_fn$0: launch gangs=1, workers=8, vectors=128
nvptx_exec: kernel main$_omp_fn$0: finished
Are you able to reproduce that?
Follows '-O0' word-diff between GCC/nvptx default vs. '-mptx=3.1'
multilib:
Link log info : 0 bytes gmem
info : Function properties for 'main$_omp_fn$0':
info : used [-33-]{+30+} registers, 32 stack, 432 bytes smem, 360 bytes cmem[0], 0 bytes lmem
GOMP_OFFLOAD_openacc_exec: prepare mappings
warp_size=32, block_size=1024, dev_size=80, cpu_size=2048
default dimensions [160,32,32]
cuOccupancyMaxPotentialBlockSize: grid = 160, block = [-768-]{+1024+}
nvptx_exec: kernel main$_omp_fn$0: launch gangs=1, [-workers=6,-]{+workers=8,+} vectors=128
nvptx_exec: kernel main$_omp_fn$0: finished
Notice that the GCC/nvptx default multilib uses 33 registers vs. the
'-mptx=3.1' multilib uses 30 registers! (..., which then allows for
'block = [-768-]{+1024+}', 'workers=[-6-]{+8+}').
If that's useful, 'diff' of the PTX code that gets loaded to the GPU:
// BEGIN PREAMBLE
-.version 6.0
+.version 3.1
.target sm_30
.address_size 64
// END PREAMBLE
@@ -158,9 +158,17 @@ setp.ne.u32 %r111,%r110,0;
add.u64 %r109,%r109,8;
@ %r111 bra.uni $L11;
$L19:
-bar.warp.sync 0xffffffff;
+{
+.reg .b32 %r_act;
+vote.ballot.b32 %r_act,1;
+.reg .pred %r_do_abort;
+mov.pred %r_do_abort,0;
+setp.ne.b32 %r_do_abort,%r_act,0xffffffff;
+@ %r_do_abort trap;
+@ %r_do_abort exit;
+}
$L18:
-barrier.sync.aligned 0;
+bar.sync 0;
// forked 2;
@ %r113 bra $L12;
cvta.shared.u64 %r101,__oacc_bcast;
@@ -179,7 +187,15 @@ mov.u32 %r22,0;
mov.u32 %r29,1;
mov.u32 %r30,%ntid.y;
$L12:
-bar.warp.sync 0xffffffff;
+{
+.reg .b32 %r_act;
+vote.ballot.b32 %r_act,1;
+.reg .pred %r_do_abort;
+mov.pred %r_do_abort,0;
+setp.ne.b32 %r_do_abort,%r_act,0xffffffff;
+@ %r_do_abort trap;
+@ %r_do_abort exit;
+}
$L7:
@ %r113 bra $L13;
mov.u32 %r23,%tid.y;
@@ -188,11 +204,19 @@ setp.ge.s32 %r62,%r23,%r31;
selp.u32 %r114,1,0,%r62;
st.u32 [%r93],%r114;
$L13:
-bar.warp.sync 0xffffffff;
-barrier.sync %r94,128;
+{
+.reg .b32 %r_act;
+vote.ballot.b32 %r_act,1;
+.reg .pred %r_do_abort;
+mov.pred %r_do_abort,0;
+setp.ne.b32 %r_do_abort,%r_act,0xffffffff;
+@ %r_do_abort trap;
+@ %r_do_abort exit;
+}
+bar.sync %r94,128;
ld.u32 %r115,[%r93];
setp.ne.u32 %r62,%r115,0;
-barrier.sync %r94,128;
+bar.sync %r94,128;
@ %r62 bra.uni $L2;
$L6:
@ %r113 bra $L14;
@@ -220,8 +244,16 @@ st.u32 [%r95+36],%r30;
st.u32 [%r95+40],%r31;
st.u32 [%r95+44],%r34;
$L14:
-bar.warp.sync 0xffffffff;
-barrier.sync %r94,128;
+{
+.reg .b32 %r_act;
+vote.ballot.b32 %r_act,1;
+.reg .pred %r_do_abort;
+mov.pred %r_do_abort,0;
+setp.ne.b32 %r_do_abort,%r_act,0xffffffff;
+@ %r_do_abort trap;
+@ %r_do_abort exit;
+}
+bar.sync %r94,128;
// forked 4;
mov.u64 %r87,%r93;
mov.u64 %r89,%frame;
@@ -296,7 +328,7 @@ setp.lt.s32 %r83,%r24,%r38;
mov.u32 %r56,%r37;
st.u32 [%frame+8],%r56;
// joining 4;
-barrier.sync %r94,128;
+bar.sync %r94,128;
// join 4;
@ %r113 bra $L15;
add.u32 %r23,%r23,%r30;
@@ -304,11 +336,19 @@ setp.lt.s32 %r84,%r23,%r31;
selp.u32 %r116,1,0,%r84;
st.u32 [%r93],%r116;
$L15:
-bar.warp.sync 0xffffffff;
-barrier.sync %r94,128;
+{
+.reg .b32 %r_act;
+vote.ballot.b32 %r_act,1;
+.reg .pred %r_do_abort;
+mov.pred %r_do_abort,0;
+setp.ne.b32 %r_do_abort,%r_act,0xffffffff;
+@ %r_do_abort trap;
+@ %r_do_abort exit;
+}
+bar.sync %r94,128;
ld.u32 %r117,[%r93];
setp.ne.u32 %r84,%r117,0;
-barrier.sync %r94,128;
+bar.sync %r94,128;
@ %r84 bra.uni $L6;
$L2:
@ %r113 bra $L16;
@@ -317,19 +357,35 @@ setp.lt.s32 %r85,%r22,%r29;
selp.u32 %r118,1,0,%r85;
st.u32 [%r93],%r118;
$L16:
-bar.warp.sync 0xffffffff;
-barrier.sync %r94,128;
+{
+.reg .b32 %r_act;
+vote.ballot.b32 %r_act,1;
+.reg .pred %r_do_abort;
+mov.pred %r_do_abort,0;
+setp.ne.b32 %r_do_abort,%r_act,0xffffffff;
+@ %r_do_abort trap;
+@ %r_do_abort exit;
+}
+bar.sync %r94,128;
ld.u32 %r119,[%r93];
setp.ne.u32 %r85,%r119,0;
-barrier.sync %r94,128;
+bar.sync %r94,128;
@ %r85 bra.uni $L7;
@ %r113 bra $L17;
mov.u32 %r86,4;
st.u32 [%frame+4],%r86;
// joining 2;
$L17:
-bar.warp.sync 0xffffffff;
-barrier.sync.aligned 0;
+{
+.reg .b32 %r_act;
+vote.ballot.b32 %r_act,1;
+.reg .pred %r_do_abort;
+mov.pred %r_do_abort,0;
+setp.ne.b32 %r_do_abort,%r_act,0xffffffff;
+@ %r_do_abort trap;
+@ %r_do_abort exit;
+}
+bar.sync 0;
// join 2;
ret;
}
Do the 'trap/'exit' "no-return" calls allow for optimizing JIT register
allocation? Does it follow that we should be doing something different
in the GCC/nvptx default multilib, to achieve a similar outcome (without
otherwise pessimizing the code, of course)?
Grüße
Thomas
> [libgomp, testsuite, nvptx] Fix dg-output test in vector-length-128-7.c
>
> libgomp/ChangeLog:
>
> 2022-04-01 Tom de Vries <tdevries@suse.de>
>
> * testsuite/libgomp.oacc-c-c++-common/vector-length-128-7.c: Fix
> num_workers check.
>
> ---
> libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-7.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-7.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-7.c
> index 4a8c1bf549e..92b3de03636 100644
> --- a/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-7.c
> +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-7.c
> @@ -37,4 +37,4 @@ main (void)
> }
>
> /* { dg-final { scan-offload-tree-dump "__attribute__\\(\\(oacc function \\(1, 0, 128\\)" "oaccloops" } } */
> -/* { dg-output "nvptx_exec: kernel main\\\$_omp_fn\\\$0: launch gangs=1, workers=8, vectors=128" } */
> +/* { dg-output "nvptx_exec: kernel main\\\$_omp_fn\\\$0: launch gangs=1, workers=\[1-8\], vectors=128" } */
-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2022-05-04 8:40 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-01 11:23 [committed][libgomp, testsuite, nvptx] Fix dg-output test in vector-length-128-7.c Tom de Vries
2022-05-04 8:40 ` Thomas Schwinge
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).