[Bug libgomp/108098] New: OpenMP/nvptx reverse offload execution test FAILs

public inbox for gcc-bugs@sourceware.org
help / color / mirror / Atom feed

* [Bug libgomp/108098] New: OpenMP/nvptx reverse offload execution test FAILs
@ 2022-12-14 11:48 tschwinge at gcc dot gnu.org
  2022-12-15  9:09 ` [Bug libgomp/108098] " vries at gcc dot gnu.org
                   ` (6 more replies)
  0 siblings, 7 replies; 8+ messages in thread
From: tschwinge at gcc dot gnu.org @ 2022-12-14 11:48 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108098

            Bug ID: 108098
           Summary: OpenMP/nvptx reverse offload execution test FAILs
           Product: gcc
           Version: 13.0
            Status: UNCONFIRMED
          Keywords: openmp
          Severity: normal
          Priority: P3
         Component: libgomp
          Assignee: unassigned at gcc dot gnu.org
          Reporter: tschwinge at gcc dot gnu.org
                CC: burnus at gcc dot gnu.org, jakub at gcc dot gnu.org,
                    vries at gcc dot gnu.org
  Target Milestone: ---

With commit r13-4593-gea4b23d9c82d9be3b982c3519fe5e8e9d833a6a8 "libgomp: Handle
OpenMP's reverse offloads", I'm seeing PASSes for a number of testing
configurations (see below), but regressions/FAILs on others:

    PASS: libgomp.c/../libgomp.c-c++-common/reverse-offload-1.c (test for
excess errors)
    [-PASS:-]{+FAIL:+} libgomp.c/../libgomp.c-c++-common/reverse-offload-1.c
execution test

    libgomp: cuCtxSynchronize error: unspecified launch failure (perhaps abort
was called)

Or:

    libgomp: cuCtxSynchronize error: an illegal instruction was encountered

Same for C++.

    PASS: libgomp.c/../libgomp.c-c++-common/reverse-offload-2.c (test for
excess errors)
    [-PASS:-]{+FAIL:+} libgomp.c/../libgomp.c-c++-common/reverse-offload-2.c
execution test

    libgomp: cuModuleGetFunction error: named symbol not found

Same for C++.

    PASS: libgomp.fortran/reverse-offload-1.f90   -O0  (test for excess errors)
    [-PASS:-]{+FAIL:+} libgomp.fortran/reverse-offload-1.f90   -O0  execution
test

    STOP 2

    libgomp: cuCtxSynchronize error: unspecified launch failure (perhaps abort
was called)

Or:

    libgomp: cuCtxSynchronize error: an illegal instruction was encountered

Same for other torture testing flags.

..., and for the new test cases:

    +PASS: libgomp.fortran/reverse-offload-2.f90   -O  (test for excess errors)
    +FAIL: libgomp.fortran/reverse-offload-2.f90   -O  execution test

    +PASS: libgomp.fortran/reverse-offload-3.f90   -O  (test for excess errors)
    +FAIL: libgomp.fortran/reverse-offload-3.f90   -O  execution test

    +PASS: libgomp.fortran/reverse-offload-4.f90   -O  (test for excess errors)
    +FAIL: libgomp.fortran/reverse-offload-4.f90   -O  execution test

    +PASS: libgomp.fortran/reverse-offload-5.f90   -O  (test for excess errors)
    +XFAIL: libgomp.fortran/reverse-offload-5.f90   -O  execution test

    +PASS: libgomp.fortran/reverse-offload-5a.f90   -O  (test for excess
errors)
    +FAIL: libgomp.fortran/reverse-offload-5a.f90   -O  execution test

These fail with different host-side or device-side STOP codes.

I haven't analyzed further.

That's with standard option on a system with:

    $ nvidia-smi
    [...]
    | NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2
    [...]
    |   0  Tesla K80  [...]
    [...]
    |   1  Tesla K80  [...]

... as well as (separately tested):

    |   2  GeForce GTX 1080  [...]

..., and with '-foffload-options=nvptx-none=-mptx=3.1' on a system with:

    $ nvidia-smi
    [...]
    | NVIDIA-SMI 361.93.02              Driver Version: 361.93.02
    [...]
    |   0  Tesla K20c  [...]

These test cases PASS on a system with:

    $ nvidia-smi
    [...]
    | NVIDIA-SMI 455.38       Driver Version: 455.38       CUDA Version: 11.1
    [...]
    |   0  Tesla K20c  [...]

... as well as (separately tested):

    |   1  Tesla K40c  [...]

..., and:

    $ nvidia-smi
    [...]
    | NVIDIA-SMI 455.23.05    Driver Version: 455.23.05    CUDA Version: 11.1
    [...]
    |   0  TITAN V  [...]

Is maybe CUDA 11 (or rather, corresponding Nvidia Driver version) a hard
requirement, and if yes, how to deal with that?

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug libgomp/108098] OpenMP/nvptx reverse offload execution test FAILs
  2022-12-14 11:48 [Bug libgomp/108098] New: OpenMP/nvptx reverse offload execution test FAILs tschwinge at gcc dot gnu.org
@ 2022-12-15  9:09 ` vries at gcc dot gnu.org
  2022-12-16  9:40 ` tschwinge at gcc dot gnu.org
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: vries at gcc dot gnu.org @ 2022-12-15  9:09 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108098

--- Comment #1 from Tom de Vries <vries at gcc dot gnu.org> ---
(In reply to Thomas Schwinge from comment #0)
>     $ nvidia-smi
>     [...]
>     | NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2
>     [...]
>     |   0  Tesla K80  [...]
>     [...]
>     |   1  Tesla K80  [...]
> 

I'm not sure if it matters for triggering this problem, but if I look at this
board at nvidia drivers download and select cuda 10.2 and production branch, I
get :
...
version:        440.118.02
Release Date:   2020.9.30
...

Then using the "Beta and Older Drivers" I find the version you're using is:
...
version: 440.33.01
Release date:  November 19, 2019
...

Please always use the latest drivers when reporting a problem.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug libgomp/108098] OpenMP/nvptx reverse offload execution test FAILs
  2022-12-14 11:48 [Bug libgomp/108098] New: OpenMP/nvptx reverse offload execution test FAILs tschwinge at gcc dot gnu.org
  2022-12-15  9:09 ` [Bug libgomp/108098] " vries at gcc dot gnu.org
@ 2022-12-16  9:40 ` tschwinge at gcc dot gnu.org
  2022-12-16 13:13 ` burnus at gcc dot gnu.org
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: tschwinge at gcc dot gnu.org @ 2022-12-16  9:40 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108098

Thomas Schwinge <tschwinge at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Last reconfirmed|                            |2022-12-16

--- Comment #2 from Thomas Schwinge <tschwinge at gcc dot gnu.org> ---
(In reply to Tom de Vries from comment #1)
> I'm not sure if it matters for triggering this problem

It doesn't:

> version: 	440.118.02

Same set of FAILs.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug libgomp/108098] OpenMP/nvptx reverse offload execution test FAILs
  2022-12-14 11:48 [Bug libgomp/108098] New: OpenMP/nvptx reverse offload execution test FAILs tschwinge at gcc dot gnu.org
  2022-12-15  9:09 ` [Bug libgomp/108098] " vries at gcc dot gnu.org
  2022-12-16  9:40 ` tschwinge at gcc dot gnu.org
@ 2022-12-16 13:13 ` burnus at gcc dot gnu.org
  2022-12-16 15:06 ` burnus at gcc dot gnu.org
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: burnus at gcc dot gnu.org @ 2022-12-16 13:13 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108098

--- Comment #3 from Tobias Burnus <burnus at gcc dot gnu.org> ---
The problem - at least when testing on a system with:
  NVIDIA-SMI 440.118.02   Driver Version: 440.118.02   CUDA Version: 10.2

seems to be that libgomp/plugin/plugin-nvptx.c's GOMP_OFFLOAD_load_image has:

   fn_entries == 3 - and rev_fn_table != NULL (i.e. expect offloading)

and then runs:

      r = CUDA_CALL_NOCHECK (cuModuleGetGlobal, &var, &bytes, module,
                             "$offload_func_table");
      if (r != CUDA_SUCCESS)
        GOMP_PLUGIN_fatal ("cuModuleGetGlobal error: %s", cuda_error (r));
      assert (bytes == sizeof (uint64_t) * fn_entries);
      *rev_fn_table = GOMP_PLUGIN_malloc (sizeof (uint64_t) * fn_entries);
      r = CUDA_CALL_NOCHECK (cuMemcpyDtoH, *rev_fn_table, var, bytes);
      if (r != CUDA_SUCCESS)
        GOMP_PLUGIN_fatal ("cuMemcpyDtoH error: %s", cuda_error (r));

So far so good - but all entries are NULL. This then disables the checking for
reverse offload on the host side. (It is not quite clear to me why it doesn't
run into an endless loop on the device side.)

The generated PTX code for reverse-offload-1{,-aux}.c is for that offload
table:

        ".version 6.0"
        ".target sm_35"
        ".file 1 \"<dummy>\""
        ".extern .func tg_fn$_omp_fn$0$nohost$0 (.param .u64 %in_ar0);"
        ".extern .func main$_omp_fn$2$nohost$1 (.param .u64 %in_ar0);"
        ".visible .global .align 8 .u64 $offload_func_table[] = {"
                "tg_fn$_omp_fn$0$nohost$0,"
                "main$_omp_fn$2$nohost$1,"
                "0,"
                "0};\n";

which seems to be OK – and works with CUDA 11.  It looks as if the '>= sm_35'
is only one required criterion but that there are additional ones.

 * * *

I am relatively sure that it did work before, but it could well be that I only
checked that the device->host notification worked w/o trying any actual offload
(and before adding all NULL -> no reverse offload). And later when doing the
actual offload tests, I might have missed that machine. — Or I did something
different back then, but I don't know what.

 * * *

In patch "nvptx: Support global constructors/destructors via 'collect2'",
https://gcc.gnu.org/pipermail/gcc-patches/2022-December/607749.html , Thomas
uses a dummy entry - possibly that would be also a solution:

+/* For example with old Nvidia Tesla K20c, Driver Version: 361.93.02, the
+   function pointers stored in the '__CTOR_LIST__', '__DTOR_LIST__' arrays
+   evidently evaluate to NULL in JIT compilation.  Avoiding the use of
+   assembler names ('write_list_with_asm') doesn't help, but defining a dummy
+   function next to the arrays apparently does work around this issue...

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug libgomp/108098] OpenMP/nvptx reverse offload execution test FAILs
  2022-12-14 11:48 [Bug libgomp/108098] New: OpenMP/nvptx reverse offload execution test FAILs tschwinge at gcc dot gnu.org
                   ` (2 preceding siblings ...)
  2022-12-16 13:13 ` burnus at gcc dot gnu.org
@ 2022-12-16 15:06 ` burnus at gcc dot gnu.org
  2023-05-05  9:28 ` cvs-commit at gcc dot gnu.org
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: burnus at gcc dot gnu.org @ 2022-12-16 15:06 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108098

--- Comment #4 from Tobias Burnus <burnus at gcc dot gnu.org> ---
Indeed, the following seems to also help with an older CUDA / JIT compiler.
Motivated by Thomas' work.

If we are sure that CUDA 11.0 fixes it, we could generate that code only for:

  if (version2[0] < 7 || sm_ver2[0] < 8)

given that sm_80 is only supported since CUDA 11.0 and, likewise, CUDA 11.0
introduces PTX ISA version 7.0.

--- a/gcc/config/nvptx/mkoffload.cc
+++ b/gcc/config/nvptx/mkoffload.cc
@@ -358,4 +358,9 @@ process (FILE *in, FILE *out, uint32_t omp_requires)
       fprintf (out, "\"\n\t\".file 1 \\\"<dummy>\\\"\"\n");

+      fprintf (out, "\n\t\".func __dummy$func ( );\"\n");
+      fprintf (out, "\t\".func __dummy$func ( )\"\n");
+      fprintf (out, "\t\"{\"\n");
+      fprintf (out, "\t\"}\"\n");
+
       size_t fidx = 0;
       for (id = func_ids; id; id = id->next)

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug libgomp/108098] OpenMP/nvptx reverse offload execution test FAILs
  2022-12-14 11:48 [Bug libgomp/108098] New: OpenMP/nvptx reverse offload execution test FAILs tschwinge at gcc dot gnu.org
                   ` (3 preceding siblings ...)
  2022-12-16 15:06 ` burnus at gcc dot gnu.org
@ 2023-05-05  9:28 ` cvs-commit at gcc dot gnu.org
  2023-05-08 21:15 ` cvs-commit at gcc dot gnu.org
  2023-05-08 21:16 ` burnus at gcc dot gnu.org
  6 siblings, 0 replies; 8+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2023-05-05  9:28 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108098

--- Comment #5 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The master branch has been updated by Tobias Burnus <burnus@gcc.gnu.org>:

https://gcc.gnu.org/g:4359724cba31b2645f6106266bef019c3d6ef16a

commit r14-491-g4359724cba31b2645f6106266bef019c3d6ef16a
Author: Tobias Burnus <tobias@codesourcery.com>
Date:   Fri May 5 11:27:32 2023 +0200

    nvptx/mkoffload.cc: Add dummy proc for OpenMP rev-offload table [PR108098]

    Seemingly, the ptx JIT of CUDA <= 10.2 replaces function pointers in global
    variables by NULL if a translation does not contain any executable code. It
    works with CUDA 11.1.  The code of this commit is about reverse offload;
    having NULL values disables the side of reverse offload during image load.

    Solution is the same as found by Thomas for a related issue: Adding a dummy
    procedure. Cf. the PR of this issue and Thomas' patch
    "nvptx: Support global constructors/destructors via 'collect2'"
    https://gcc.gnu.org/pipermail/gcc-patches/2022-December/607749.html

    As that approach also works here:

    Co-authored-by: Thomas Schwinge <thomas@codesourcery.com>

    gcc/
            PR libgomp/108098

            * config/nvptx/mkoffload.cc (process): Emit dummy procedure
            alongside reverse-offload function table to prevent NULL values
            of the function addresses.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug libgomp/108098] OpenMP/nvptx reverse offload execution test FAILs
  2022-12-14 11:48 [Bug libgomp/108098] New: OpenMP/nvptx reverse offload execution test FAILs tschwinge at gcc dot gnu.org
                   ` (4 preceding siblings ...)
  2023-05-05  9:28 ` cvs-commit at gcc dot gnu.org
@ 2023-05-08 21:15 ` cvs-commit at gcc dot gnu.org
  2023-05-08 21:16 ` burnus at gcc dot gnu.org
  6 siblings, 0 replies; 8+ messages in thread
From: cvs-commit at gcc dot gnu.org @ 2023-05-08 21:15 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108098

--- Comment #6 from CVS Commits <cvs-commit at gcc dot gnu.org> ---
The releases/gcc-13 branch has been updated by Tobias Burnus
<burnus@gcc.gnu.org>:

https://gcc.gnu.org/g:615b920553fd28e9d4732dedcd799227e82cc011

commit r13-7306-g615b920553fd28e9d4732dedcd799227e82cc011
Author: Tobias Burnus <tobias@codesourcery.com>
Date:   Fri May 5 11:27:32 2023 +0200

    nvptx/mkoffload.cc: Add dummy proc for OpenMP rev-offload table [PR108098]

    Seemingly, the ptx JIT of CUDA <= 10.2 replaces function pointers in global
    variables by NULL if a translation does not contain any executable code. It
    works with CUDA 11.1.  The code of this commit is about reverse offload;
    having NULL values disables the side of reverse offload during image load.

    Solution is the same as found by Thomas for a related issue: Adding a dummy
    procedure. Cf. the PR of this issue and Thomas' patch
    "nvptx: Support global constructors/destructors via 'collect2'"
    https://gcc.gnu.org/pipermail/gcc-patches/2022-December/607749.html

    As that approach also works here:

    Co-authored-by: Thomas Schwinge <thomas@codesourcery.com>

    gcc/
            PR libgomp/108098

            * config/nvptx/mkoffload.cc (process): Emit dummy procedure
            alongside reverse-offload function table to prevent NULL values
            of the function addresses.

    (cherry picked from commit 4359724cba31b2645f6106266bef019c3d6ef16a)

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Bug libgomp/108098] OpenMP/nvptx reverse offload execution test FAILs
  2022-12-14 11:48 [Bug libgomp/108098] New: OpenMP/nvptx reverse offload execution test FAILs tschwinge at gcc dot gnu.org
                   ` (5 preceding siblings ...)
  2023-05-08 21:15 ` cvs-commit at gcc dot gnu.org
@ 2023-05-08 21:16 ` burnus at gcc dot gnu.org
  6 siblings, 0 replies; 8+ messages in thread
From: burnus at gcc dot gnu.org @ 2023-05-08 21:16 UTC (permalink / raw)
  To: gcc-bugs

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108098

Tobias Burnus <burnus at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |RESOLVED
         Resolution|---                         |FIXED

--- Comment #7 from Tobias Burnus <burnus at gcc dot gnu.org> ---
FIXED for GCC 13(.2) + mainline/14.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2023-05-08 21:16 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-12-14 11:48 [Bug libgomp/108098] New: OpenMP/nvptx reverse offload execution test FAILs tschwinge at gcc dot gnu.org
2022-12-15  9:09 ` [Bug libgomp/108098] " vries at gcc dot gnu.org
2022-12-16  9:40 ` tschwinge at gcc dot gnu.org
2022-12-16 13:13 ` burnus at gcc dot gnu.org
2022-12-16 15:06 ` burnus at gcc dot gnu.org
2023-05-05  9:28 ` cvs-commit at gcc dot gnu.org
2023-05-08 21:15 ` cvs-commit at gcc dot gnu.org
2023-05-08 21:16 ` burnus at gcc dot gnu.org

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).