@Alexander/@Tom – Can you comment on both libgomp/config/nvptx + libgomp/plugin/plugin-nvptx.c ? (Comments on the rest are welcome, too)

(Updated patch enclosed)

Because Jakub asked:

I'm afraid you need Alexander or Tom here, I don't feel I can review it;
I could rubber stamp it if they are ok with it.


Regarding:

How did the standardization process for this feature look like, how did it pass
if it's not efficiently implementable for the major offloading targets?

It doesn't have to be implementable on all major offloading targets, it is enough when it can work on some. As one needs to request the reverse offloading through a declarative directive, it is always possible in that case to just pretend devices that don't support it don't exist.

First, I think it is better to provide a feature even if it is slow – than not providing it at all. Secondly, as Jakub mentioned, it is not required that all devices support this feature well. It is sufficient some do.

I believe on of the main uses is debugging and for this use, the performance is not critical. This patch attempts to have no overhead if the feature is not used (no 'omp requires reverse_offload' and no actual reverse offload).

Additionally, for GCN, it can be implemented with almost no overhead by using the feature used for I/O. (CUDA implements 'printf' internally – but does not permit to piggyback on this feature.)

* * *

I think in the future, we could additionally pass information to GOMP_target_ext whether a target region is known not to do reverse offload – both by checking what's in the region and by utilizing an 'absent(target)' assumption places in the outer target regsion on an '(begin) assume(s)' directive. That should at least help with the common case of having no reverse offload – even if it does not for some large kernel which does use reverse offload for non-debugging purpose (e.g. to trigger file I/O or inter-node communication).

* * *

Regarding the implementation: I left in 'usleep(1)' for now – 1µs seems to be not too bad and I have no idea what's better.

I also don't have any idea what's the overhead for accessing concurrently accessible managed memory from the host (is it on the host until requested from the device – or is it always on the device and needs to be copied/migrated to the host for every access). Likewise, I don't know how much overhead it is to D2H copy the memory via the second CUDA stream.

Suggestions are welcome. But as this code is strictly confined to a single function, it can easily be modified later.

Documentation: I have not mentioned caveats in https://gcc.gnu.org/onlinedocs/libgomp/nvptx.html as the reverse offload is not yet enabled, even with this patch.

On 26.08.22 11:07, Tobias Burnus wrote:

PRE-REMARK

As nvptx (and all other plugins) returns <= 0 for
GOMP_OFFLOAD_get_num_devices if GOMP_REQUIRES_REVERSE_OFFLOAD is
set. This patch is currently still a no op.

The patch is almost stand alone, except that it either needs a
  void *rev_fn_table = NULL;
in GOMP_OFFLOAD_load_image or the following patch:
  [Patch][2/3] nvptx: libgomp+mkoffload.cc: Prepare for reverse offload fn lookup
  https://gcc.gnu.org/pipermail/gcc-patches/2022-August/600348.html
(which in turn needs the '[1/3]' patch).

Not required to be compilable, but the patch is based on the ideas/code from
the reverse-offload ME patch; the latter adds calls to
  GOMP_target_ext (omp_initial_device,
which is for host fallback code processed by the normal target_ext and for
device code by the target_ext of this patch.
→ "[Patch] OpenMP: Support reverse offload (middle end part)"
  https://gcc.gnu.org/pipermail/gcc-patches/2022-July/598662.html

 * * *

This patch adds initial offloading support for nvptx.
When the nvptx's device GOMP_target_ext is called - it creates a lock,
fills a struct with the argument pointers (addr, kinds, sizes), its
device number and the set the function pointer address.

On the host side, the last address is checked - if fn_addr != NULL,
it passes all arguments on to the generic (target.c) gomp_target_rev
to do the actual offloading.

CUDA does lockup when trying to copy data from the currently running
stream; hence, a new stream is generated to do the memory copying.
Just having managed memory is not enough - it needs to be concurrently
accessible - otherwise, it will segfault on the host when migrated to
the device.

OK for mainline?

 * * *

Future work for nvptx:
* Adjust 'sleep', possibly using different values with and without USM and
  to do shorter sleeps than usleep(1)?
* Set a flag whether there is any offload function at all, avoiding to run
  the more expensive check if there is 'requires reverse_offload' without
  actual reverse-offloading functions present.
  (Recall that the '2/3' patch, mentioned above, only has fn != NULL for
  reverse-offload functions.)
* Document → libgomp.texi that reverse offload may cause some performance
  overhead for all target regions. + That reverse offload is run serialized.

And obviously: submitting the missing bits to get reverse offload working,
but that's mostly not an nvptx topic.


-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955