@Tom and Alexander: Better suggestions are welcome for the busy loop in libgomp/plugin/plugin-nvptx.c regarding the variable placement and checking its value. PRE-REMARK As nvptx (and all other plugins) returns <= 0 for GOMP_OFFLOAD_get_num_devices if GOMP_REQUIRES_REVERSE_OFFLOAD is set. This patch is currently still a no op. The patch is almost stand alone, except that it either needs a void *rev_fn_table = NULL; in GOMP_OFFLOAD_load_image or the following patch: [Patch][2/3] nvptx: libgomp+mkoffload.cc: Prepare for reverse offload fn lookup https://gcc.gnu.org/pipermail/gcc-patches/2022-August/600348.html (which in turn needs the '[1/3]' patch). Not required to be compilable, but the patch is based on the ideas/code from the reverse-offload ME patch; the latter adds calls to GOMP_target_ext (omp_initial_device, which is for host fallback code processed by the normal target_ext and for device code by the target_ext of this patch. → "[Patch] OpenMP: Support reverse offload (middle end part)" https://gcc.gnu.org/pipermail/gcc-patches/2022-July/598662.html * * * This patch adds initial offloading support for nvptx. When the nvptx's device GOMP_target_ext is called - it creates a lock, fills a struct with the argument pointers (addr, kinds, sizes), its device number and the set the function pointer address. On the host side, the last address is checked - if fn_addr != NULL, it passes all arguments on to the generic (target.c) gomp_target_rev to do the actual offloading. CUDA does lockup when trying to copy data from the currently running stream; hence, a new stream is generated to do the memory copying. Just having managed memory is not enough - it needs to be concurrently accessible - otherwise, it will segfault on the host when migrated to the device. OK for mainline? * * * Future work for nvptx: * Adjust 'sleep', possibly using different values with and without USM and to do shorter sleeps than usleep(1)? * Set a flag whether there is any offload function at all, avoiding to run the more expensive check if there is 'requires reverse_offload' without actual reverse-offloading functions present. (Recall that the '2/3' patch, mentioned above, only has fn != NULL for reverse-offload functions.) * Document → libgomp.texi that reverse offload may cause some performance overhead for all target regions. + That reverse offload is run serialized. And obviously: submitting the missing bits to get reverse offload working, but that's mostly not an nvptx topic. Tobias ----------------- Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955