public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
From: Tobias Burnus <tobias@codesourcery.com>
To: Jakub Jelinek <jakub@redhat.com>, Tom de Vries <tdevries@suse.de>,
	gcc-patches <gcc-patches@gcc.gnu.org>
Cc: Alexander Monakov <amonakov@ispras.ru>
Subject: Re: [Patch] libgomp/nvptx: Prepare for reverse-offload callback handling
Date: Tue, 13 Sep 2022 09:07:51 +0200	[thread overview]
Message-ID: <3f0fc49f-b07f-bee2-51a8-a5d03f1c33ed@codesourcery.com> (raw)
In-Reply-To: <57b3ae5e-8f15-8bea-fa09-39bccbaa2414@codesourcery.com>


[-- Attachment #1.1: Type: text/plain, Size: 5512 bytes --]

@Alexander/@Tom – Can you comment on both libgomp/config/nvptx + libgomp/plugin/plugin-nvptx.c ? (Comments on the rest are welcome, too)

(Updated patch enclosed)

Because Jakub asked:

I'm afraid you need Alexander or Tom here, I don't feel I can review it;
I could rubber stamp it if they are ok with it.


Regarding:

How did the standardization process for this feature look like, how did it pass
if it's not efficiently implementable for the major offloading targets?

It doesn't have to be implementable on all major offloading targets, it is enough when it can work on some. As one needs to request the reverse offloading through a declarative directive, it is always possible in that case to just pretend devices that don't support it don't exist.

First, I think it is better to provide a feature even if it is slow – than not providing it at all. Secondly, as Jakub mentioned, it is not required that all devices support this feature well. It is sufficient some do.

I believe on of the main uses is debugging and for this use, the performance is not critical. This patch attempts to have no overhead if the feature is not used (no 'omp requires reverse_offload' and no actual reverse offload).

Additionally, for GCN, it can be implemented with almost no overhead by using the feature used for I/O. (CUDA implements 'printf' internally – but does not permit to piggyback on this feature.)

* * *

I think in the future, we could additionally pass information to GOMP_target_ext whether a target region is known not to do reverse offload – both by checking what's in the region and by utilizing an 'absent(target)' assumption places in the outer target regsion on an '(begin) assume(s)' directive. That should at least help with the common case of having no reverse offload – even if it does not for some large kernel which does use reverse offload for non-debugging purpose (e.g. to trigger file I/O or inter-node communication).

* * *

Regarding the implementation: I left in 'usleep(1)' for now – 1µs seems to be not too bad and I have no idea what's better.

I also don't have any idea what's the overhead for accessing concurrently accessible managed memory from the host (is it on the host until requested from the device – or is it always on the device and needs to be copied/migrated to the host for every access). Likewise, I don't know how much overhead it is to D2H copy the memory via the second CUDA stream.

Suggestions are welcome. But as this code is strictly confined to a single function, it can easily be modified later.

Documentation: I have not mentioned caveats in https://gcc.gnu.org/onlinedocs/libgomp/nvptx.html as the reverse offload is not yet enabled, even with this patch.

On 26.08.22 11:07, Tobias Burnus wrote:

PRE-REMARK

As nvptx (and all other plugins) returns <= 0 for
GOMP_OFFLOAD_get_num_devices if GOMP_REQUIRES_REVERSE_OFFLOAD is
set. This patch is currently still a no op.

The patch is almost stand alone, except that it either needs a
  void *rev_fn_table = NULL;
in GOMP_OFFLOAD_load_image or the following patch:
  [Patch][2/3] nvptx: libgomp+mkoffload.cc: Prepare for reverse offload fn lookup
  https://gcc.gnu.org/pipermail/gcc-patches/2022-August/600348.html
(which in turn needs the '[1/3]' patch).

Not required to be compilable, but the patch is based on the ideas/code from
the reverse-offload ME patch; the latter adds calls to
  GOMP_target_ext (omp_initial_device,
which is for host fallback code processed by the normal target_ext and for
device code by the target_ext of this patch.
→ "[Patch] OpenMP: Support reverse offload (middle end part)"
  https://gcc.gnu.org/pipermail/gcc-patches/2022-July/598662.html

 * * *

This patch adds initial offloading support for nvptx.
When the nvptx's device GOMP_target_ext is called - it creates a lock,
fills a struct with the argument pointers (addr, kinds, sizes), its
device number and the set the function pointer address.

On the host side, the last address is checked - if fn_addr != NULL,
it passes all arguments on to the generic (target.c) gomp_target_rev
to do the actual offloading.

CUDA does lockup when trying to copy data from the currently running
stream; hence, a new stream is generated to do the memory copying.
Just having managed memory is not enough - it needs to be concurrently
accessible - otherwise, it will segfault on the host when migrated to
the device.

OK for mainline?

 * * *

Future work for nvptx:
* Adjust 'sleep', possibly using different values with and without USM and
  to do shorter sleeps than usleep(1)?
* Set a flag whether there is any offload function at all, avoiding to run
  the more expensive check if there is 'requires reverse_offload' without
  actual reverse-offloading functions present.
  (Recall that the '2/3' patch, mentioned above, only has fn != NULL for
  reverse-offload functions.)
* Document → libgomp.texi that reverse offload may cause some performance
  overhead for all target regions. + That reverse offload is run serialized.

And obviously: submitting the missing bits to get reverse offload working,
but that's mostly not an nvptx topic.


-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955

[-- Attachment #2: rev-offload-run-nvptx-v2.diff --]
[-- Type: text/x-patch, Size: 17981 bytes --]

libgomp/nvptx: Prepare for reverse-offload callback handling

This patch adds a stub 'gomp_target_rev' in the host's target.c, which will
later handle the reverse offload.
For nvptx, it adds support for forwarding the offload gomp_target_ext call
to the host by setting values in a struct on the device and querying it on
the host - invoking gomp_target_rev on the result.

include/ChangeLog:

	* cuda/cuda.h (enum CUdevice_attribute): Add
	CU_DEVICE_ATTRIBUTE_CONCURRENT_MANAGED_ACCESS.
	(enum CUmemAttach_flags): New stub with only member
	CU_MEM_ATTACH_GLOBAL.
	(cuMemAllocManaged): Add prototype.

libgomp/ChangeLog:

	* config/nvptx/icv-device.c (GOMP_DEVICE_NUM_VAR): Remove
	'static' for this variable.
	* config/nvptx/target.c (GOMP_REV_OFFLOAD_VAR): #define as
	variable-name string and use it to define the variable.
	(GOMP_DEVICE_NUM_VAR): Declare this extern global var.
	(struct rev_offload): Define.
	(GOMP_target_ext): Handle reverse offload.
	* libgomp-plugin.h (GOMP_PLUGIN_target_rev): New prototype.
	* libgomp-plugin.c (GOMP_PLUGIN_target_rev): New, call ...
	* target.c (gomp_target_rev): ... this new stub function.
	* libgomp.h (gomp_target_rev): Declare.
	* libgomp.map (GOMP_PLUGIN_1.4): New; add GOMP_PLUGIN_target_rev.
	* plugin/cuda-lib.def (cuMemAllocManaged): Add.
	* plugin/plugin-nvptx.c (GOMP_REV_OFFLOAD_VAR): #define var string.
	(struct rev_offload): New.
	(struct ptx_device): Add concurr_managed_access, rev_data
	and rev_data_dev.
	(nvptx_open_device): Set ptx_device's concurr_managed_access;
	'#if 0' unused async_engines.
	(GOMP_OFFLOAD_load_image): Allocate rev_data variable.
	(rev_off_dev_to_host_cpy, rev_off_host_to_dev_cpy): New.
	(GOMP_OFFLOAD_run): Handle reverse offloading.

 include/cuda/cuda.h               |   8 ++-
 libgomp/config/nvptx/icv-device.c |   2 +-
 libgomp/config/nvptx/target.c     |  52 ++++++++++++--
 libgomp/libgomp-plugin.c          |  12 ++++
 libgomp/libgomp-plugin.h          |   7 ++
 libgomp/libgomp.h                 |   5 ++
 libgomp/libgomp.map               |   5 ++
 libgomp/plugin/cuda-lib.def       |   1 +
 libgomp/plugin/plugin-nvptx.c     | 143 ++++++++++++++++++++++++++++++++++++--
 libgomp/target.c                  |  19 +++++
 10 files changed, 241 insertions(+), 13 deletions(-)

diff --git a/include/cuda/cuda.h b/include/cuda/cuda.h
index 3938d05..08e496a 100644
--- a/include/cuda/cuda.h
+++ b/include/cuda/cuda.h
@@ -77,9 +77,14 @@ typedef enum {
   CU_DEVICE_ATTRIBUTE_CONCURRENT_KERNELS = 31,
   CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_MULTIPROCESSOR = 39,
   CU_DEVICE_ATTRIBUTE_ASYNC_ENGINE_COUNT = 40,
-  CU_DEVICE_ATTRIBUTE_MAX_REGISTERS_PER_MULTIPROCESSOR = 82
+  CU_DEVICE_ATTRIBUTE_MAX_REGISTERS_PER_MULTIPROCESSOR = 82,
+  CU_DEVICE_ATTRIBUTE_CONCURRENT_MANAGED_ACCESS = 89
 } CUdevice_attribute;
 
+typedef enum {
+  CU_MEM_ATTACH_GLOBAL = 0x1
+} CUmemAttach_flags;
+
 enum {
   CU_EVENT_DEFAULT = 0,
   CU_EVENT_DISABLE_TIMING = 2
@@ -169,6 +174,7 @@ CUresult cuMemGetInfo (size_t *, size_t *);
 CUresult cuMemAlloc (CUdeviceptr *, size_t);
 #define cuMemAllocHost cuMemAllocHost_v2
 CUresult cuMemAllocHost (void **, size_t);
+CUresult cuMemAllocManaged (CUdeviceptr *, size_t, unsigned int);
 CUresult cuMemcpy (CUdeviceptr, CUdeviceptr, size_t);
 #define cuMemcpyDtoDAsync cuMemcpyDtoDAsync_v2
 CUresult cuMemcpyDtoDAsync (CUdeviceptr, CUdeviceptr, size_t, CUstream);
diff --git a/libgomp/config/nvptx/icv-device.c b/libgomp/config/nvptx/icv-device.c
index 6f869be..eef151c 100644
--- a/libgomp/config/nvptx/icv-device.c
+++ b/libgomp/config/nvptx/icv-device.c
@@ -30,7 +30,7 @@
 
 /* This is set to the ICV values of current GPU during device initialization,
    when the offload image containing this libgomp portion is loaded.  */
-static volatile struct gomp_offload_icvs GOMP_ADDITIONAL_ICVS;
+volatile struct gomp_offload_icvs GOMP_ADDITIONAL_ICVS;
 
 void
 omp_set_default_device (int device_num __attribute__((unused)))
diff --git a/libgomp/config/nvptx/target.c b/libgomp/config/nvptx/target.c
index 11108d2..2a3fd8f 100644
--- a/libgomp/config/nvptx/target.c
+++ b/libgomp/config/nvptx/target.c
@@ -26,7 +26,29 @@
 #include "libgomp.h"
 #include <limits.h>
 
+#define GOMP_REV_OFFLOAD_VAR __gomp_rev_offload_var
+
+/* Reverse offload. Must match version used in plugin/plugin-nvptx.c. */
+struct rev_offload {
+  uint64_t fn;
+  uint64_t mapnum;
+  uint64_t addrs;
+  uint64_t sizes;
+  uint64_t kinds;
+  int32_t dev_num;
+  uint32_t lock;
+};
+
+#if (__SIZEOF_SHORT__ != 2 \
+     || __SIZEOF_SIZE_T__ != 8 \
+     || __SIZEOF_POINTER__ != 8)
+#error "Data-type conversion required for rev_offload"
+#endif
+
+
 extern int __gomp_team_num __attribute__((shared));
+extern volatile struct gomp_offload_icvs GOMP_ADDITIONAL_ICVS;
+volatile struct rev_offload *GOMP_REV_OFFLOAD_VAR;
 
 bool
 GOMP_teams4 (unsigned int num_teams_lower, unsigned int num_teams_upper,
@@ -88,16 +110,32 @@ GOMP_target_ext (int device, void (*fn) (void *), size_t mapnum,
 		 void **hostaddrs, size_t *sizes, unsigned short *kinds,
 		 unsigned int flags, void **depend, void **args)
 {
-  (void) device;
-  (void) fn;
-  (void) mapnum;
-  (void) hostaddrs;
-  (void) sizes;
-  (void) kinds;
   (void) flags;
   (void) depend;
   (void) args;
-  __builtin_unreachable ();
+
+  if (device != GOMP_DEVICE_HOST_FALLBACK
+      || fn == NULL
+      || GOMP_REV_OFFLOAD_VAR == NULL)
+    return;
+
+  while (__sync_lock_test_and_set (&GOMP_REV_OFFLOAD_VAR->lock, (uint8_t) 1))
+    ;  /* spin  */
+
+  __atomic_store_n (&GOMP_REV_OFFLOAD_VAR->mapnum, mapnum, __ATOMIC_SEQ_CST);
+  __atomic_store_n (&GOMP_REV_OFFLOAD_VAR->addrs, hostaddrs, __ATOMIC_SEQ_CST);
+  __atomic_store_n (&GOMP_REV_OFFLOAD_VAR->sizes, sizes, __ATOMIC_SEQ_CST);
+  __atomic_store_n (&GOMP_REV_OFFLOAD_VAR->kinds, kinds, __ATOMIC_SEQ_CST);
+  __atomic_store_n (&GOMP_REV_OFFLOAD_VAR->dev_num,
+		    GOMP_ADDITIONAL_ICVS.device_num, __ATOMIC_SEQ_CST);
+
+  /* 'fn' must be last.  */
+  __atomic_store_n (&GOMP_REV_OFFLOAD_VAR->fn, fn, __ATOMIC_SEQ_CST);
+
+  /* Processed on the host - when done, fn is set to NULL.  */
+  while (__atomic_load_n (&GOMP_REV_OFFLOAD_VAR->fn, __ATOMIC_SEQ_CST) != 0)
+    ;  /* spin  */
+  __sync_lock_release (&GOMP_REV_OFFLOAD_VAR->lock);
 }
 
 void
diff --git a/libgomp/libgomp-plugin.c b/libgomp/libgomp-plugin.c
index 9d4cc62..316de74 100644
--- a/libgomp/libgomp-plugin.c
+++ b/libgomp/libgomp-plugin.c
@@ -78,3 +78,15 @@ GOMP_PLUGIN_fatal (const char *msg, ...)
   gomp_vfatal (msg, ap);
   va_end (ap);
 }
+
+void
+GOMP_PLUGIN_target_rev (uint64_t fn_ptr, uint64_t mapnum, uint64_t devaddrs_ptr,
+			uint64_t sizes_ptr, uint64_t kinds_ptr, int dev_num,
+			void (*dev_to_host_cpy) (void *, const void *, size_t,
+						 void *),
+			void (*host_to_dev_cpy) (void *, const void *, size_t,
+						 void *), void *token)
+{
+  gomp_target_rev (fn_ptr, mapnum, devaddrs_ptr, sizes_ptr, kinds_ptr, dev_num,
+		   dev_to_host_cpy, host_to_dev_cpy, token);
+}
diff --git a/libgomp/libgomp-plugin.h b/libgomp/libgomp-plugin.h
index 6ab5ac6..875f967 100644
--- a/libgomp/libgomp-plugin.h
+++ b/libgomp/libgomp-plugin.h
@@ -121,6 +121,13 @@ extern void GOMP_PLUGIN_error (const char *, ...)
 extern void GOMP_PLUGIN_fatal (const char *, ...)
 	__attribute__ ((noreturn, format (printf, 1, 2)));
 
+extern void GOMP_PLUGIN_target_rev (uint64_t, uint64_t, uint64_t, uint64_t,
+				    uint64_t, int,
+				    void (*) (void *, const void *, size_t,
+					      void *),
+				    void (*) (void *, const void *, size_t,
+					      void *), void *);
+
 /* Prototypes for functions implemented by libgomp plugins.  */
 extern const char *GOMP_OFFLOAD_get_name (void);
 extern unsigned int GOMP_OFFLOAD_get_caps (void);
diff --git a/libgomp/libgomp.h b/libgomp/libgomp.h
index 7519274..5803683 100644
--- a/libgomp/libgomp.h
+++ b/libgomp/libgomp.h
@@ -1128,6 +1128,11 @@ extern int gomp_pause_host (void);
 extern void gomp_init_targets_once (void);
 extern int gomp_get_num_devices (void);
 extern bool gomp_target_task_fn (void *);
+extern void gomp_target_rev (uint64_t, uint64_t, uint64_t, uint64_t, uint64_t,
+			     int,
+			     void (*) (void *, const void *, size_t, void *),
+			     void (*) (void *, const void *, size_t, void *),
+			     void *);
 
 /* Splay tree definitions.  */
 typedef struct splay_tree_node_s *splay_tree_node;
diff --git a/libgomp/libgomp.map b/libgomp/libgomp.map
index 46d5f10..12f76f7 100644
--- a/libgomp/libgomp.map
+++ b/libgomp/libgomp.map
@@ -622,3 +622,8 @@ GOMP_PLUGIN_1.3 {
 	GOMP_PLUGIN_goacc_profiling_dispatch;
 	GOMP_PLUGIN_goacc_thread;
 } GOMP_PLUGIN_1.2;
+
+GOMP_PLUGIN_1.4 {
+  global:
+	GOMP_PLUGIN_target_rev;
+} GOMP_PLUGIN_1.3;
diff --git a/libgomp/plugin/cuda-lib.def b/libgomp/plugin/cuda-lib.def
index cd91b39..61359c7 100644
--- a/libgomp/plugin/cuda-lib.def
+++ b/libgomp/plugin/cuda-lib.def
@@ -29,6 +29,7 @@ CUDA_ONE_CALL_MAYBE_NULL (cuLinkCreate_v2)
 CUDA_ONE_CALL (cuLinkDestroy)
 CUDA_ONE_CALL (cuMemAlloc)
 CUDA_ONE_CALL (cuMemAllocHost)
+CUDA_ONE_CALL (cuMemAllocManaged)
 CUDA_ONE_CALL (cuMemcpy)
 CUDA_ONE_CALL (cuMemcpyDtoDAsync)
 CUDA_ONE_CALL (cuMemcpyDtoH)
diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c
index ba6b229..1bd9ee2 100644
--- a/libgomp/plugin/plugin-nvptx.c
+++ b/libgomp/plugin/plugin-nvptx.c
@@ -54,6 +54,8 @@
 #include <assert.h>
 #include <errno.h>
 
+#define GOMP_REV_OFFLOAD_VAR __gomp_rev_offload_var
+
 /* An arbitrary fixed limit (128MB) for the size of the OpenMP soft stacks
    block to cache between kernel invocations.  For soft-stacks blocks bigger
    than this, we will free the block before attempting another GPU memory
@@ -274,6 +276,17 @@ struct targ_fn_descriptor
   int max_threads_per_block;
 };
 
+/* Reverse offload. Must match version used in config/nvptx/target.c. */
+struct rev_offload {
+  uint64_t fn;
+  uint64_t mapnum;
+  uint64_t addrs;
+  uint64_t sizes;
+  uint64_t kinds;
+  int32_t dev_num;
+  uint32_t lock;
+};
+
 /* A loaded PTX image.  */
 struct ptx_image_data
 {
@@ -302,6 +315,7 @@ struct ptx_device
   bool map;
   bool concur;
   bool mkern;
+  bool concurr_managed_access;
   int mode;
   int clock_khz;
   int num_sms;
@@ -329,6 +343,9 @@ struct ptx_device
       pthread_mutex_t lock;
     } omp_stacks;
 
+  struct rev_offload *rev_data;
+  CUdeviceptr rev_data_dev;
+
   struct ptx_device *next;
 };
 
@@ -423,7 +440,7 @@ nvptx_open_device (int n)
   struct ptx_device *ptx_dev;
   CUdevice dev, ctx_dev;
   CUresult r;
-  int async_engines, pi;
+  int pi;
 
   CUDA_CALL_ERET (NULL, cuDeviceGet, &dev, n);
 
@@ -519,11 +536,17 @@ nvptx_open_device (int n)
 		  CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_MULTIPROCESSOR, dev);
   ptx_dev->max_threads_per_multiprocessor = pi;
 
+#if 0
+  int async_engines;
   r = CUDA_CALL_NOCHECK (cuDeviceGetAttribute, &async_engines,
 			 CU_DEVICE_ATTRIBUTE_ASYNC_ENGINE_COUNT, dev);
   if (r != CUDA_SUCCESS)
     async_engines = 1;
+#endif
 
+  r = CUDA_CALL_NOCHECK (cuDeviceGetAttribute, &pi,
+			 CU_DEVICE_ATTRIBUTE_CONCURRENT_MANAGED_ACCESS, dev);
+  ptx_dev->concurr_managed_access = r == CUDA_SUCCESS ? pi : false;
   for (int i = 0; i != GOMP_DIM_MAX; i++)
     ptx_dev->default_dims[i] = 0;
 
@@ -1380,7 +1403,7 @@ GOMP_OFFLOAD_load_image (int ord, unsigned version, const void *target_data,
   else if (rev_fn_table)
     {
       CUdeviceptr var;
-      size_t bytes;
+      size_t bytes, i;
       r = CUDA_CALL_NOCHECK (cuModuleGetGlobal, &var, &bytes, module,
 			     "$offload_func_table");
       if (r != CUDA_SUCCESS)
@@ -1390,6 +1413,47 @@ GOMP_OFFLOAD_load_image (int ord, unsigned version, const void *target_data,
       r = CUDA_CALL_NOCHECK (cuMemcpyDtoH, *rev_fn_table, var, bytes);
       if (r != CUDA_SUCCESS)
 	GOMP_PLUGIN_fatal ("cuMemcpyDtoH error: %s", cuda_error (r));
+      /* Free if only NULL entries.  */
+      for (i = 0; i < fn_entries; ++i)
+	if (rev_fn_table[i] != 0)
+	  break;
+      if (i == fn_entries)
+	{
+	  free (*rev_fn_table);
+	  *rev_fn_table = NULL;
+	}
+    }
+
+  if (rev_fn_table && dev->rev_data == NULL)
+    {
+      CUdeviceptr dp = 0;
+      if (dev->concurr_managed_access && CUDA_CALL_EXISTS (cuMemAllocManaged))
+	{
+	  CUDA_CALL_ASSERT (cuMemAllocManaged, (void *) &dev->rev_data,
+			    sizeof (*dev->rev_data), CU_MEM_ATTACH_GLOBAL);
+	  dp = (CUdeviceptr) dev->rev_data;
+	}
+      else
+	{
+	  CUDA_CALL_ASSERT (cuMemAllocHost, (void **) &dev->rev_data,
+			    sizeof (*dev->rev_data));
+	  memset (dev->rev_data, '\0', sizeof (*dev->rev_data));
+	  CUDA_CALL_ASSERT (cuMemAlloc, &dev->rev_data_dev,
+			    sizeof (*dev->rev_data));
+	  dp = dev->rev_data_dev;
+	}
+      CUdeviceptr device_rev_offload_var;
+      size_t device_rev_offload_size;
+      CUresult r = CUDA_CALL_NOCHECK (cuModuleGetGlobal,
+				      &device_rev_offload_var,
+				      &device_rev_offload_size, module,
+				      XSTRING (GOMP_REV_OFFLOAD_VAR));
+      if (r != CUDA_SUCCESS)
+	GOMP_PLUGIN_fatal ("cuModuleGetGlobal error: %s", cuda_error (r));
+      r = CUDA_CALL_NOCHECK (cuMemcpyHtoD, device_rev_offload_var, &dp,
+			     sizeof (dp));
+      if (r != CUDA_SUCCESS)
+	GOMP_PLUGIN_fatal ("cuMemcpyHtoD error: %s", cuda_error (r));
     }
 
   nvptx_set_clocktick (module, dev);
@@ -2001,6 +2065,23 @@ nvptx_stacks_acquire (struct ptx_device *ptx_dev, size_t size, int num)
   return (void *) ptx_dev->omp_stacks.ptr;
 }
 
+
+void
+rev_off_dev_to_host_cpy (void *dest, const void *src, size_t size,
+			 CUstream stream)
+{
+  CUDA_CALL_ASSERT (cuMemcpyDtoHAsync, dest, (CUdeviceptr) src, size, stream);
+  CUDA_CALL_ASSERT (cuStreamSynchronize, stream);
+}
+
+void
+rev_off_host_to_dev_cpy (void *dest, const void *src, size_t size,
+			 CUstream stream)
+{
+  CUDA_CALL_ASSERT (cuMemcpyHtoDAsync, (CUdeviceptr) dest, src, size, stream);
+  CUDA_CALL_ASSERT (cuStreamSynchronize, stream);
+}
+
 void
 GOMP_OFFLOAD_run (int ord, void *tgt_fn, void *tgt_vars, void **args)
 {
@@ -2035,6 +2116,10 @@ GOMP_OFFLOAD_run (int ord, void *tgt_fn, void *tgt_vars, void **args)
   nvptx_adjust_launch_bounds (tgt_fn, ptx_dev, &teams, &threads);
 
   size_t stack_size = nvptx_stacks_size ();
+  bool reverse_off = ptx_dev->rev_data != NULL;
+  bool has_usm = (ptx_dev->concurr_managed_access
+		  && CUDA_CALL_EXISTS (cuMemAllocManaged));
+  CUstream copy_stream = NULL;
 
   pthread_mutex_lock (&ptx_dev->omp_stacks.lock);
   void *stacks = nvptx_stacks_acquire (ptx_dev, stack_size, teams * threads);
@@ -2048,12 +2133,62 @@ GOMP_OFFLOAD_run (int ord, void *tgt_fn, void *tgt_vars, void **args)
   GOMP_PLUGIN_debug (0, "  %s: kernel %s: launch"
 		     " [(teams: %u), 1, 1] [(lanes: 32), (threads: %u), 1]\n",
 		     __FUNCTION__, fn_name, teams, threads);
+  if (reverse_off)
+    CUDA_CALL_ASSERT (cuStreamCreate, &copy_stream, CU_STREAM_NON_BLOCKING);
   r = CUDA_CALL_NOCHECK (cuLaunchKernel, function, teams, 1, 1,
 			 32, threads, 1, 0, NULL, NULL, config);
   if (r != CUDA_SUCCESS)
     GOMP_PLUGIN_fatal ("cuLaunchKernel error: %s", cuda_error (r));
-
-  r = CUDA_CALL_NOCHECK (cuCtxSynchronize, );
+  if (reverse_off)
+    while (true)
+      {
+	r = CUDA_CALL_NOCHECK (cuStreamQuery, NULL);
+	if (r == CUDA_SUCCESS)
+	  break;
+	if (r == CUDA_ERROR_LAUNCH_FAILED)
+	  GOMP_PLUGIN_fatal ("cuStreamQuery error: %s %s\n", cuda_error (r),
+			     maybe_abort_msg);
+	else if (r != CUDA_ERROR_NOT_READY)
+	  GOMP_PLUGIN_fatal ("cuStreamQuery error: %s", cuda_error (r));
+	if (!has_usm)
+	  {
+	    CUDA_CALL_ASSERT (cuMemcpyDtoHAsync, ptx_dev->rev_data,
+			      ptx_dev->rev_data_dev,
+			      sizeof (*ptx_dev->rev_data), copy_stream);
+	    CUDA_CALL_ASSERT (cuStreamSynchronize, copy_stream);
+	  }
+	if (ptx_dev->rev_data->fn != 0)
+	  {
+	    struct rev_offload *rev_data = ptx_dev->rev_data;
+	    uint64_t fn_ptr = rev_data->fn;
+	    uint64_t mapnum = rev_data->mapnum;
+	    uint64_t addr_ptr = rev_data->addrs;
+	    uint64_t sizes_ptr = rev_data->sizes;
+	    uint64_t kinds_ptr = rev_data->kinds;
+	    int dev_num = (int) rev_data->dev_num;
+	    GOMP_PLUGIN_target_rev (fn_ptr, mapnum, addr_ptr, sizes_ptr,
+				    kinds_ptr, dev_num, rev_off_dev_to_host_cpy,
+				    rev_off_host_to_dev_cpy, copy_stream);
+	    rev_data->fn = 0;
+	    if (!has_usm)
+	      {
+		/* fn is the first element. */
+		r = CUDA_CALL_NOCHECK (cuMemcpyHtoDAsync,
+				       ptx_dev->rev_data_dev,
+				       ptx_dev->rev_data,
+				       sizeof (ptx_dev->rev_data->fn),
+				       copy_stream);
+		if (r != CUDA_SUCCESS)
+		  GOMP_PLUGIN_fatal ("cuMemcpyHtoD error: %s", cuda_error (r));
+		CUDA_CALL_ASSERT (cuStreamSynchronize, copy_stream);
+	      }
+	  }
+	usleep (1);
+      }
+  else
+    r = CUDA_CALL_NOCHECK (cuCtxSynchronize, );
+  if (reverse_off)
+    CUDA_CALL_ASSERT (cuStreamDestroy, copy_stream);
   if (r == CUDA_ERROR_LAUNCH_FAILED)
     GOMP_PLUGIN_fatal ("cuCtxSynchronize error: %s %s\n", cuda_error (r),
 		       maybe_abort_msg);
diff --git a/libgomp/target.c b/libgomp/target.c
index 5763483..9377de0 100644
--- a/libgomp/target.c
+++ b/libgomp/target.c
@@ -2925,6 +2925,25 @@ GOMP_target_ext (int device, void (*fn) (void *), size_t mapnum,
     htab_free (refcount_set);
 }
 
+/* Handle reverse offload. This is called by the device plugins for a
+   reverse offload; it is not called if the outer target runs on the host.  */
+
+void
+gomp_target_rev (uint64_t fn_ptr __attribute__ ((unused)),
+		 uint64_t mapnum __attribute__ ((unused)),
+		 uint64_t devaddrs_ptr __attribute__ ((unused)),
+		 uint64_t sizes_ptr __attribute__ ((unused)),
+		 uint64_t kinds_ptr __attribute__ ((unused)),
+		 int dev_num __attribute__ ((unused)),
+		 void (*dev_to_host_cpy) (void *, const void *, size_t,
+					  void *) __attribute__ ((unused)),
+		 void (*host_to_dev_cpy) (void *, const void *, size_t,
+					  void *) __attribute__ ((unused)),
+		 void *token __attribute__ ((unused)))
+{
+  __builtin_unreachable ();
+}
+
 /* Host fallback for GOMP_target_data{,_ext} routines.  */
 
 static void

  parent reply	other threads:[~2022-09-13  7:08 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-08-26  9:07 Tobias Burnus
2022-08-26  9:07 ` Tobias Burnus
2022-08-26 14:56 ` Alexander Monakov
2022-09-09 15:49   ` Jakub Jelinek
2022-09-09 15:51 ` Jakub Jelinek
2022-09-13  7:07 ` Tobias Burnus [this message]
2022-09-21 20:06   ` Alexander Monakov
2022-09-26 15:07     ` Tobias Burnus
2022-09-26 17:45       ` Alexander Monakov
2022-09-27  9:23         ` Tobias Burnus
2022-09-28 13:16           ` Alexander Monakov
2022-10-02 18:13           ` Tobias Burnus
2022-10-07 14:26             ` [Patch][v5] " Tobias Burnus
2022-10-11 10:49               ` Jakub Jelinek
2022-10-11 11:12                 ` Alexander Monakov
2022-10-12  8:55                   ` Tobias Burnus
2022-10-17  7:35                     ` *ping* / " Tobias Burnus
2022-10-19 15:53                     ` Alexander Monakov
2022-10-24 14:07                     ` Jakub Jelinek
2022-10-24 19:05                       ` Thomas Schwinge
2022-10-24 19:11                         ` Thomas Schwinge
2022-10-24 19:46                           ` Tobias Burnus
2022-10-24 19:51                           ` libgomp/nvptx: Prepare for reverse-offload callback handling, resolve spurious SIGSEGVs (was: [Patch][v5] libgomp/nvptx: Prepare for reverse-offload callback handling) Thomas Schwinge
2023-03-21 15:53 ` libgomp: Simplify OpenMP reverse offload host <-> device memory copy implementation (was: [Patch] " Thomas Schwinge
2023-03-24 15:43   ` [og12] " Thomas Schwinge
2023-04-28  8:48   ` Tobias Burnus
2023-04-28  9:31     ` Thomas Schwinge
2023-04-28 10:51       ` Tobias Burnus
2023-04-04 14:40 ` [Patch] libgomp/nvptx: Prepare for reverse-offload callback handling Thomas Schwinge
2023-04-28  8:28   ` Tobias Burnus
2023-04-28  9:23     ` Thomas Schwinge

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3f0fc49f-b07f-bee2-51a8-a5d03f1c33ed@codesourcery.com \
    --to=tobias@codesourcery.com \
    --cc=amonakov@ispras.ru \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=jakub@redhat.com \
    --cc=tdevries@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).