From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: by sourceware.org (Postfix, from userid 48) id 8415138417F6; Fri, 16 Dec 2022 13:13:39 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 8415138417F6 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1671196419; bh=9MAOT6/Rxv17HG6dxdntIy3ZIvMkTVfS3TCBumCQ04A=; h=From:To:Subject:Date:In-Reply-To:References:From; b=u4QGVvk8GuagRrY4gJ096pZ+q71z8Gft3jkfYjQ9xQWvGsmity8PtnvnoebMfWzT9 Axf4JzBwOCamDeP6BYuvkgK5CTNqTEK/VVslv1SBCsDyoQaNz5yQAS6VcP2N3cH+o3 ShnrLkGOWiV7F7YKrPTB1pECFZuSmlbxyFcsnES0= From: "burnus at gcc dot gnu.org" To: gcc-bugs@gcc.gnu.org Subject: [Bug libgomp/108098] OpenMP/nvptx reverse offload execution test FAILs Date: Fri, 16 Dec 2022 13:13:38 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: gcc X-Bugzilla-Component: libgomp X-Bugzilla-Version: 13.0 X-Bugzilla-Keywords: openmp X-Bugzilla-Severity: normal X-Bugzilla-Who: burnus at gcc dot gnu.org X-Bugzilla-Status: UNCONFIRMED X-Bugzilla-Resolution: X-Bugzilla-Priority: P3 X-Bugzilla-Assigned-To: unassigned at gcc dot gnu.org X-Bugzilla-Target-Milestone: --- X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: http://gcc.gnu.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 List-Id: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D108098 --- Comment #3 from Tobias Burnus --- The problem - at least when testing on a system with: NVIDIA-SMI 440.118.02 Driver Version: 440.118.02 CUDA Version: 10.2 seems to be that libgomp/plugin/plugin-nvptx.c's GOMP_OFFLOAD_load_image ha= s: fn_entries =3D=3D 3 - and rev_fn_table !=3D NULL (i.e. expect offloading) and then runs: r =3D CUDA_CALL_NOCHECK (cuModuleGetGlobal, &var, &bytes, module, "$offload_func_table"); if (r !=3D CUDA_SUCCESS) GOMP_PLUGIN_fatal ("cuModuleGetGlobal error: %s", cuda_error (r)); assert (bytes =3D=3D sizeof (uint64_t) * fn_entries); *rev_fn_table =3D GOMP_PLUGIN_malloc (sizeof (uint64_t) * fn_entries); r =3D CUDA_CALL_NOCHECK (cuMemcpyDtoH, *rev_fn_table, var, bytes); if (r !=3D CUDA_SUCCESS) GOMP_PLUGIN_fatal ("cuMemcpyDtoH error: %s", cuda_error (r)); So far so good - but all entries are NULL. This then disables the checking = for reverse offload on the host side. (It is not quite clear to me why it doesn= 't run into an endless loop on the device side.) The generated PTX code for reverse-offload-1{,-aux}.c is for that offload table: ".version 6.0" ".target sm_35" ".file 1 \"\"" ".extern .func tg_fn$_omp_fn$0$nohost$0 (.param .u64 %in_ar0);" ".extern .func main$_omp_fn$2$nohost$1 (.param .u64 %in_ar0);" ".visible .global .align 8 .u64 $offload_func_table[] =3D {" "tg_fn$_omp_fn$0$nohost$0," "main$_omp_fn$2$nohost$1," "0," "0};\n"; which seems to be OK =E2=80=93 and works with CUDA 11. It looks as if the = '>=3D sm_35' is only one required criterion but that there are additional ones. * * * I am relatively sure that it did work before, but it could well be that I o= nly checked that the device->host notification worked w/o trying any actual off= load (and before adding all NULL -> no reverse offload). And later when doing the actual offload tests, I might have missed that machine. =E2=80=94 Or I did = something different back then, but I don't know what. * * * In patch "nvptx: Support global constructors/destructors via 'collect2'", https://gcc.gnu.org/pipermail/gcc-patches/2022-December/607749.html , Thomas uses a dummy entry - possibly that would be also a solution: +/* For example with old Nvidia Tesla K20c, Driver Version: 361.93.02, the + function pointers stored in the '__CTOR_LIST__', '__DTOR_LIST__' arrays + evidently evaluate to NULL in JIT compilation. Avoiding the use of + assembler names ('write_list_with_asm') doesn't help, but defining a du= mmy + function next to the arrays apparently does work around this issue...=