From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-bugzilla@gcc.gnu.org>
Received: by sourceware.org (Postfix, from userid 48)
	id 8415138417F6; Fri, 16 Dec 2022 13:13:39 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 8415138417F6
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org;
	s=default; t=1671196419;
	bh=9MAOT6/Rxv17HG6dxdntIy3ZIvMkTVfS3TCBumCQ04A=;
	h=From:To:Subject:Date:In-Reply-To:References:From;
	b=u4QGVvk8GuagRrY4gJ096pZ+q71z8Gft3jkfYjQ9xQWvGsmity8PtnvnoebMfWzT9
	 Axf4JzBwOCamDeP6BYuvkgK5CTNqTEK/VVslv1SBCsDyoQaNz5yQAS6VcP2N3cH+o3
	 ShnrLkGOWiV7F7YKrPTB1pECFZuSmlbxyFcsnES0=
From: "burnus at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org>
To: gcc-bugs@gcc.gnu.org
Subject: [Bug libgomp/108098] OpenMP/nvptx reverse offload execution test
 FAILs
Date: Fri, 16 Dec 2022 13:13:38 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: gcc
X-Bugzilla-Component: libgomp
X-Bugzilla-Version: 13.0
X-Bugzilla-Keywords: openmp
X-Bugzilla-Severity: normal
X-Bugzilla-Who: burnus at gcc dot gnu.org
X-Bugzilla-Status: UNCONFIRMED
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: P3
X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org
X-Bugzilla-Target-Milestone: ---
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: 
Message-ID: <bug-108098-4-LROfPdbRmS@http.gcc.gnu.org/bugzilla/>
In-Reply-To: <bug-108098-4@http.gcc.gnu.org/bugzilla/>
References: <bug-108098-4@http.gcc.gnu.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
List-Id: <gcc-bugs.sourceware.org>

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D108098
--- Comment #3 from Tobias Burnus <burnus at gcc dot gnu.org> ---
The problem - at least when testing on a system with:
  NVIDIA-SMI 440.118.02   Driver Version: 440.118.02   CUDA Version: 10.2

seems to be that libgomp/plugin/plugin-nvptx.c's GOMP_OFFLOAD_load_image ha=
s:

   fn_entries =3D=3D 3 - and rev_fn_table !=3D NULL (i.e. expect offloading)

and then runs:

      r =3D CUDA_CALL_NOCHECK (cuModuleGetGlobal, &var, &bytes, module,
                             "$offload_func_table");
      if (r !=3D CUDA_SUCCESS)
        GOMP_PLUGIN_fatal ("cuModuleGetGlobal error: %s", cuda_error (r));
      assert (bytes =3D=3D sizeof (uint64_t) * fn_entries);
      *rev_fn_table =3D GOMP_PLUGIN_malloc (sizeof (uint64_t) * fn_entries);
      r =3D CUDA_CALL_NOCHECK (cuMemcpyDtoH, *rev_fn_table, var, bytes);
      if (r !=3D CUDA_SUCCESS)
        GOMP_PLUGIN_fatal ("cuMemcpyDtoH error: %s", cuda_error (r));

So far so good - but all entries are NULL. This then disables the checking =
for
reverse offload on the host side. (It is not quite clear to me why it doesn=
't
run into an endless loop on the device side.)

The generated PTX code for reverse-offload-1{,-aux}.c is for that offload
table:

        ".version 6.0"
        ".target sm_35"
        ".file 1 \"<dummy>\""
        ".extern .func tg_fn$_omp_fn$0$nohost$0 (.param .u64 %in_ar0);"
        ".extern .func main$_omp_fn$2$nohost$1 (.param .u64 %in_ar0);"
        ".visible .global .align 8 .u64 $offload_func_table[] =3D {"
                "tg_fn$_omp_fn$0$nohost$0,"
                "main$_omp_fn$2$nohost$1,"
                "0,"
                "0};\n";

which seems to be OK =E2=80=93 and works with CUDA 11.  It looks as if the =
'>=3D sm_35'
is only one required criterion but that there are additional ones.

 * * *

I am relatively sure that it did work before, but it could well be that I o=
nly
checked that the device->host notification worked w/o trying any actual off=
load
(and before adding all NULL -> no reverse offload). And later when doing the
actual offload tests, I might have missed that machine. =E2=80=94 Or I did =
something
different back then, but I don't know what.

 * * *

In patch "nvptx: Support global constructors/destructors via 'collect2'",
https://gcc.gnu.org/pipermail/gcc-patches/2022-December/607749.html , Thomas
uses a dummy entry - possibly that would be also a solution:

+/* For example with old Nvidia Tesla K20c, Driver Version: 361.93.02, the
+   function pointers stored in the '__CTOR_LIST__', '__DTOR_LIST__' arrays
+   evidently evaluate to NULL in JIT compilation.  Avoiding the use of
+   assembler names ('write_list_with_asm') doesn't help, but defining a du=
mmy
+   function next to the arrays apparently does work around this issue...=