[Bug libgomp/105042] New: [libgomp, GOMP_NVPTX_JIT=-O0] Openacc testsuite failures when X runs on nvidia driver

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug libgomp/105042] New: [libgomp, GOMP_NVPTX_JIT=-O0] Openacc testsuite failures when X runs on nvidia driver
@ 2022-03-24  8:01 vries at gcc dot gnu.org
  2022-03-24  8:59 ` [Bug libgomp/105042] " rguenth at gcc dot gnu.org
                   ` (7 more replies)
  0 siblings, 8 replies; 9+ messages in thread
From: vries at gcc dot gnu.org @ 2022-03-24  8:01 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105042

            Bug ID: 105042
           Summary: [libgomp, GOMP_NVPTX_JIT=-O0] Openacc testsuite
                    failures when X runs on nvidia driver
           Product: gcc
           Version: 12.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: libgomp
          Assignee: unassigned at gcc dot gnu.org
          Reporter: vries at gcc dot gnu.org
                CC: jakub at gcc dot gnu.org
  Target Milestone: ---

I usually have only an nvidia-compute$n driver package installed, but sometimes
(as happened when I updated the system yesterday) also x11-video-nvidia$n,
after which X is run on the nvidia card (instead of on the builtin intel
graphics).

With such a setup, I run into a cluster of FAILs, all for GOMP_NVPTX_JIT=-O0:
...
$ grep ^FAIL: 2/libgomp.sum
FAIL: libgomp.oacc-c/../libgomp.oacc-c-c++-common/parallel-dims.c
-DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 -foffload=nvptx-none  -O0
-DGOMP_NVPTX_JIT=-O0 execution test
FAIL: libgomp.oacc-c/../libgomp.oacc-c-c++-common/vred2d-128.c
-DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 -foffload=nvptx-none  -O0
-DGOMP_NVPTX_JIT=-O0 execution test
FAIL: libgomp.oacc-c/../libgomp.oacc-c-c++-common/vred2d-128.c
-DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 -foffload=nvptx-none  -O2
-DGOMP_NVPTX_JIT=-O0 execution test
FAIL: libgomp.oacc-c++/../libgomp.oacc-c-c++-common/parallel-dims.c
-DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 -foffload=nvptx-none  -O0
-DGOMP_NVPTX_JIT=-O0 execution test
FAIL: libgomp.oacc-c++/../libgomp.oacc-c-c++-common/vred2d-128.c
-DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 -foffload=nvptx-none  -O0
-DGOMP_NVPTX_JIT=-O0 execution test
FAIL: libgomp.oacc-c++/../libgomp.oacc-c-c++-common/vred2d-128.c
-DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 -foffload=nvptx-none  -O2
-DGOMP_NVPTX_JIT=-O0 execution test
FAIL: libgomp.oacc-fortran/parallel-dims.f90 -DACC_DEVICE_TYPE_nvidia=1
-DACC_MEM_SHARED=0 -foffload=nvptx-none  -O0 -DGOMP_NVPTX_JIT=-O0 execution
test
FAIL: libgomp.oacc-fortran/parallel-dims.f90 -DACC_DEVICE_TYPE_nvidia=1
-DACC_MEM_SHARED=0 -foffload=nvptx-none  -O1 -DGOMP_NVPTX_JIT=-O0 execution
test
FAIL: libgomp.oacc-fortran/parallel-dims.f90 -DACC_DEVICE_TYPE_nvidia=1
-DACC_MEM_SHARED=0 -foffload=nvptx-none  -Os -DGOMP_NVPTX_JIT=-O0 execution
test
...

Note that this is with a patch from PR104423 that runs tests both with default
JIT optimization and GOMP_NVPTX_JIT=-O0, hence the -DGOMP_NVPTX_JIT=-O0 tag. 
But it can be reproduced by just doing:
...
export GOMP_NVPTX_JIT=-O0
...

It could be that the test-cases just need scaling down.  OTOH, it also could be
that there's an underlying problem that only surfaces when other processes are
run in parallel, or specifically, X.

This is on board K2000 with driver 470.103.01.

The board has 2GB of memory, and according to nvidia-smi, having the X
processes takes a couple of 100MBs, and ./parallel-dims.exe just takes 15MiB,
so at first glance it doesn't seem to be an out-of-board-memory thing.

I do observe reduced system responsiveness while running the tests, so maybe
it's the compute capacity rather than memory which is exhausted.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug libgomp/105042] [libgomp, GOMP_NVPTX_JIT=-O0] Openacc testsuite failures when X runs on nvidia driver
  2022-03-24  8:01 [Bug libgomp/105042] New: [libgomp, GOMP_NVPTX_JIT=-O0] Openacc testsuite failures when X runs on nvidia driver vries at gcc dot gnu.org
@ 2022-03-24  8:59 ` rguenth at gcc dot gnu.org
  2022-03-24 10:09 ` vries at gcc dot gnu.org
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: rguenth at gcc dot gnu.org @ 2022-03-24  8:59 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105042

--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
Doesn't whatever driver/library API we use from libgomp to invoke workloads
report actual errors?  Maybe we need to improve there.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug libgomp/105042] [libgomp, GOMP_NVPTX_JIT=-O0] Openacc testsuite failures when X runs on nvidia driver
  2022-03-24  8:01 [Bug libgomp/105042] New: [libgomp, GOMP_NVPTX_JIT=-O0] Openacc testsuite failures when X runs on nvidia driver vries at gcc dot gnu.org
  2022-03-24  8:59 ` [Bug libgomp/105042] " rguenth at gcc dot gnu.org
@ 2022-03-24 10:09 ` vries at gcc dot gnu.org
  2022-03-25  6:41 ` vries at gcc dot gnu.org
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: vries at gcc dot gnu.org @ 2022-03-24 10:09 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105042

--- Comment #2 from Tom de Vries <vries at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #1)
> Doesn't whatever driver/library API we use from libgomp to invoke workloads
> report actual errors?  Maybe we need to improve there.

Good point, it reported some form of timeout.  I'll post the exact form once I
reproduce.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug libgomp/105042] [libgomp, GOMP_NVPTX_JIT=-O0] Openacc testsuite failures when X runs on nvidia driver
  2022-03-24  8:01 [Bug libgomp/105042] New: [libgomp, GOMP_NVPTX_JIT=-O0] Openacc testsuite failures when X runs on nvidia driver vries at gcc dot gnu.org
  2022-03-24  8:59 ` [Bug libgomp/105042] " rguenth at gcc dot gnu.org
  2022-03-24 10:09 ` vries at gcc dot gnu.org
@ 2022-03-25  6:41 ` vries at gcc dot gnu.org
  2022-03-25  9:19 ` vries at gcc dot gnu.org
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: vries at gcc dot gnu.org @ 2022-03-25  6:41 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105042

--- Comment #3 from Tom de Vries <vries at gcc dot gnu.org> ---
(In reply to Tom de Vries from comment #2)
> (In reply to Richard Biener from comment #1)
> > Doesn't whatever driver/library API we use from libgomp to invoke workloads
> > report actual errors?  Maybe we need to improve there.
> 
> Good point, it reported some form of timeout.  I'll post the exact form once
> I reproduce.

It's:
...
Execution timeout is: 300
spawn [open ...]^M

libgomp: cuStreamSynchronize error: the launch timed out and was terminated
FAIL: libgomp.oacc-c/../libgomp.oacc-c-c++-common/parallel-dims.c
-DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 -foffload=nvptx-none  -O0 
execution test
...

Googling a bit about this error message (
https://forums.developer.nvidia.com/t/need-to-remove-timeouts-and-the-launch-timed-out-and-was-terminated-message/16741/2
) shows that running a display manager sets a 5/10 seconds watchdog timer on
any kernel.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug libgomp/105042] [libgomp, GOMP_NVPTX_JIT=-O0] Openacc testsuite failures when X runs on nvidia driver
  2022-03-24  8:01 [Bug libgomp/105042] New: [libgomp, GOMP_NVPTX_JIT=-O0] Openacc testsuite failures when X runs on nvidia driver vries at gcc dot gnu.org
                   ` (2 preceding siblings ...)
  2022-03-25  6:41 ` vries at gcc dot gnu.org
@ 2022-03-25  9:19 ` vries at gcc dot gnu.org
  2022-03-25  9:40 ` vries at gcc dot gnu.org
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: vries at gcc dot gnu.org @ 2022-03-25  9:19 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105042

--- Comment #4 from Tom de Vries <vries at gcc dot gnu.org> ---
https://gcc.gnu.org/pipermail/gcc-patches/2022-March/592275.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug libgomp/105042] [libgomp, GOMP_NVPTX_JIT=-O0] Openacc testsuite failures when X runs on nvidia driver
  2022-03-24  8:01 [Bug libgomp/105042] New: [libgomp, GOMP_NVPTX_JIT=-O0] Openacc testsuite failures when X runs on nvidia driver vries at gcc dot gnu.org
                   ` (3 preceding siblings ...)
  2022-03-25  9:19 ` vries at gcc dot gnu.org
@ 2022-03-25  9:40 ` vries at gcc dot gnu.org
  2022-03-25 12:52 ` cvs-commit at gcc dot gnu.org
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: vries at gcc dot gnu.org @ 2022-03-25  9:40 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105042

--- Comment #5 from Tom de Vries <vries at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #1)
> Doesn't whatever driver/library API we use from libgomp to invoke workloads
> report actual errors?  Maybe we need to improve there.

This:
...
libgomp: cuStreamSynchronize error: the launch timed out and was terminated
...
seems to be the string for cudaErrorLaunchTimeout, which AFAICT is dedicated to
this situation, so we could treat that error code specially in cuda_error in
plugin-nvptx.c and emit a custom message.

Say:
...
libgomp: cuStreamSynchronize error: the launch timed out and was terminated (5
second time-out caused by launching on a device running a display manager)
...

Alternatively, we could detect cudaDeviceProp::kernelExecTimeoutEnabled and
emit a warning when initializing or before launching the first kernel.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug libgomp/105042] [libgomp, GOMP_NVPTX_JIT=-O0] Openacc testsuite failures when X runs on nvidia driver
  2022-03-24  8:01 [Bug libgomp/105042] New: [libgomp, GOMP_NVPTX_JIT=-O0] Openacc testsuite failures when X runs on nvidia driver vries at gcc dot gnu.org
                   ` (4 preceding siblings ...)
  2022-03-25  9:40 ` vries at gcc dot gnu.org
@ 2022-03-25 12:52 ` cvs-commit at gcc dot gnu.org
  2022-03-25 12:55 ` tschwinge at gcc dot gnu.org
  2022-03-28 13:14 ` vries at gcc dot gnu.org
  7 siblings, 0 replies; 9+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2022-03-25 12:52 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105042

--- Comment #6 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Tom de Vries <vries@gcc.gnu.org>:

https://gcc.gnu.org/g:8570cce7c705f2ec3ffaeb8e47d58af22a075ebd

commit r12-7814-g8570cce7c705f2ec3ffaeb8e47d58af22a075ebd
Author: Tom de Vries <tdevries@suse.de>
Date:   Fri Mar 25 10:06:41 2022 +0100

    [libgomp, testsuite] Scale down some OpenACC test-cases

    When a display manager is running on an nvidia card, all CUDA kernel
launches
    get a 5 seconds watchdog timer.

    Consequently, when running the libgomp testsuite with nvptx accelerator and
    GOMP_NVPTX_JIT=-O0 we run into a few FAILs like this:
    ...
    libgomp: cuStreamSynchronize error: the launch timed out and was terminated
    FAIL: libgomp.oacc-c/../libgomp.oacc-c-c++-common/parallel-dims.c \
      -DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 -foffload=nvptx-none -O0 \
      execution test
    ...

    Fix this by scaling down the failing test-cases by default, and reverting
to
    the original behaviour for GCC_TEST_RUN_EXPENSIVE=1.

    Tested on x86_64-linux with nvptx accelerator.

    libgomp/ChangeLog:

    2022-03-25  Tom de Vries  <tdevries@suse.de>

            PR libgomp/105042
            * testsuite/libgomp.oacc-c-c++-common/parallel-dims.c: Reduce
            execution time.
            * testsuite/libgomp.oacc-c-c++-common/vred2d-128.c: Same.
            * testsuite/libgomp.oacc-fortran/parallel-dims.f90: Same.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug libgomp/105042] [libgomp, GOMP_NVPTX_JIT=-O0] Openacc testsuite failures when X runs on nvidia driver
  2022-03-24  8:01 [Bug libgomp/105042] New: [libgomp, GOMP_NVPTX_JIT=-O0] Openacc testsuite failures when X runs on nvidia driver vries at gcc dot gnu.org
                   ` (5 preceding siblings ...)
  2022-03-25 12:52 ` cvs-commit at gcc dot gnu.org
@ 2022-03-25 12:55 ` tschwinge at gcc dot gnu.org
  2022-03-28 13:14 ` vries at gcc dot gnu.org
  7 siblings, 0 replies; 9+ messages in thread
From: tschwinge at gcc dot gnu.org @ 2022-03-25 12:55 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105042

Thomas Schwinge <tschwinge at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |tschwinge at gcc dot gnu.org

--- Comment #7 from Thomas Schwinge <tschwinge at gcc dot gnu.org> ---
By the way, I'm not reproducing this 'GOMP_NVPTX_JIT=-O0' issue on my current
Nvidia Quadro P1000 GPU system (Driver Version: 450.119.03), but what you've
found sounds plausible.


(In reply to Tom de Vries from comment #5)
> (In reply to Richard Biener from comment #1)
> > Doesn't whatever driver/library API we use from libgomp to invoke workloads
> > report actual errors?  Maybe we need to improve there.
> 
> This:
> ...
> libgomp: cuStreamSynchronize error: the launch timed out and was terminated
> ...
> seems to be the string for cudaErrorLaunchTimeout, which AFAICT is dedicated
> to this situation, so we could treat that error code specially in cuda_error
> in plugin-nvptx.c and emit a custom message.
> 
> Say:
> ...
> libgomp: cuStreamSynchronize error: the launch timed out and was terminated
> (5 second time-out caused by launching on a device running a display manager)
> ...

Not sure if that's really worth it?  And, "5 second time-out" seems a detail
that we shouldn't rely on.  Is really "display manager" the only way this
timeout may get enabled?

> Alternatively, we could detect cudaDeviceProp::kernelExecTimeoutEnabled and
> emit a warning when initializing or before launching the first kernel.

That sounds noisy to me, given that most of all GPU kernel launches still
finish successfully?  A 'GOMP_debug' note for that sounds fine.

But, well, to be helpful to the user: how about we indeed catch the
'CUDA_ERROR_LAUNCH_TIMEOUT' error case, (if that makes sense, then 'assert'
that 'CU_DEVICE_ATTRIBUTE_KERNEL_EXEC_TIMEOUT' is set), and emit an additional
message like "run time limit for kernels executed on the device" (per
<https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__DEVICE.html#group__CUDA__DEVICE_1g9c3e1414f0ad901d3278a4d6645fc266>,
'CU_DEVICE_ATTRIBUTE_KERNEL_EXEC_TIMEOUT')?  That is, like we have
'maybe_abort_msg'.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug libgomp/105042] [libgomp, GOMP_NVPTX_JIT=-O0] Openacc testsuite failures when X runs on nvidia driver
  2022-03-24  8:01 [Bug libgomp/105042] New: [libgomp, GOMP_NVPTX_JIT=-O0] Openacc testsuite failures when X runs on nvidia driver vries at gcc dot gnu.org
                   ` (6 preceding siblings ...)
  2022-03-25 12:55 ` tschwinge at gcc dot gnu.org
@ 2022-03-28 13:14 ` vries at gcc dot gnu.org
  7 siblings, 0 replies; 9+ messages in thread
From: vries at gcc dot gnu.org @ 2022-03-28 13:14 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105042

Tom de Vries <vries at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Severity|normal                      |enhancement

--- Comment #8 from Tom de Vries <vries at gcc dot gnu.org> ---
With the conversation shifted to better error messages, re-classifying as
enhancement.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2022-03-28 13:14 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-24  8:01 [Bug libgomp/105042] New: [libgomp, GOMP_NVPTX_JIT=-O0] Openacc testsuite failures when X runs on nvidia driver vries at gcc dot gnu.org
2022-03-24  8:59 ` [Bug libgomp/105042] " rguenth at gcc dot gnu.org
2022-03-24 10:09 ` vries at gcc dot gnu.org
2022-03-25  6:41 ` vries at gcc dot gnu.org
2022-03-25  9:19 ` vries at gcc dot gnu.org
2022-03-25  9:40 ` vries at gcc dot gnu.org
2022-03-25 12:52 ` cvs-commit at gcc dot gnu.org
2022-03-25 12:55 ` tschwinge at gcc dot gnu.org
2022-03-28 13:14 ` vries at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).