[Patch] libgomp/nvptx: Prepare for reverse-offload callback handling

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

From: Tobias Burnus <tobias@codesourcery.com>
To: Jakub Jelinek <jakub@redhat.com>, Tom de Vries <tdevries@suse.de>,
	gcc-patches <gcc-patches@gcc.gnu.org>
Cc: Alexander Monakov <amonakov@ispras.ru>
Subject: [Patch] libgomp/nvptx: Prepare for reverse-offload callback handling
Date: Fri, 26 Aug 2022 11:07:28 +0200	[thread overview]
Message-ID: <57b3ae5e-8f15-8bea-fa09-39bccbaa2414@codesourcery.com> (raw)

[-- Attachment #1: Type: text/plain, Size: 2878 bytes --]

@Tom and Alexander: Better suggestions are welcome for the busy loop in
libgomp/plugin/plugin-nvptx.c regarding the variable placement and checking
its value.


PRE-REMARK

As nvptx (and all other plugins) returns <= 0 for
GOMP_OFFLOAD_get_num_devices if GOMP_REQUIRES_REVERSE_OFFLOAD is
set. This patch is currently still a no op.

The patch is almost stand alone, except that it either needs a
  void *rev_fn_table = NULL;
in GOMP_OFFLOAD_load_image or the following patch:
  [Patch][2/3] nvptx: libgomp+mkoffload.cc: Prepare for reverse offload fn lookup
  https://gcc.gnu.org/pipermail/gcc-patches/2022-August/600348.html
(which in turn needs the '[1/3]' patch).

Not required to be compilable, but the patch is based on the ideas/code from
the reverse-offload ME patch; the latter adds calls to
  GOMP_target_ext (omp_initial_device,
which is for host fallback code processed by the normal target_ext and for
device code by the target_ext of this patch.
→ "[Patch] OpenMP: Support reverse offload (middle end part)"
  https://gcc.gnu.org/pipermail/gcc-patches/2022-July/598662.html

 * * *

This patch adds initial offloading support for nvptx.
When the nvptx's device GOMP_target_ext is called - it creates a lock,
fills a struct with the argument pointers (addr, kinds, sizes), its
device number and the set the function pointer address.

On the host side, the last address is checked - if fn_addr != NULL,
it passes all arguments on to the generic (target.c) gomp_target_rev
to do the actual offloading.

CUDA does lockup when trying to copy data from the currently running
stream; hence, a new stream is generated to do the memory copying.
Just having managed memory is not enough - it needs to be concurrently
accessible - otherwise, it will segfault on the host when migrated to
the device.

OK for mainline?

 * * *

Future work for nvptx:
* Adjust 'sleep', possibly using different values with and without USM and
  to do shorter sleeps than usleep(1)?
* Set a flag whether there is any offload function at all, avoiding to run
  the more expensive check if there is 'requires reverse_offload' without
  actual reverse-offloading functions present.
  (Recall that the '2/3' patch, mentioned above, only has fn != NULL for
  reverse-offload functions.)
* Document → libgomp.texi that reverse offload may cause some performance
  overhead for all target regions. + That reverse offload is run serialized.

And obviously: submitting the missing bits to get reverse offload working,
but that's mostly not an nvptx topic.

Tobias


-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955

[-- Attachment #2: rev-offload-run-nvptx.diff --]
[-- Type: text/x-patch, Size: 18161 bytes --]

libgomp/nvptx: Prepare for reverse-offload callback handling

This patch adds a stub 'gomp_target_rev' in the host's target.c, which will
later handle the reverse offload.
For nvptx, it adds support for forwarding the offload gomp_target_ext call
to the host by setting values in a struct on the device and querying it on
the host - invoking gomp_target_rev on the result.

include/ChangeLog:

	* cuda/cuda.h (enum CUdevice_attribute): Add
	CU_DEVICE_ATTRIBUTE_CONCURRENT_MANAGED_ACCESS.
	(enum CUmemAttach_flags): New stub with only member
	CU_MEM_ATTACH_GLOBAL.
	(cuMemAllocManaged): Add prototype.

libgomp/ChangeLog:

	* config/nvptx/icv-device.c (GOMP_DEVICE_NUM_VAR): Remove
	'static' for this variable.
	* config/nvptx/target.c (GOMP_REV_OFFLOAD_VAR): #define as
	variable-name string and use it to define the variable.
	(GOMP_DEVICE_NUM_VAR): Declare this extern global var.
	(struct rev_offload): Define.
	(GOMP_target_ext): Handle reverse offload.
	* libgomp-plugin.h (GOMP_PLUGIN_target_rev): New prototype.
	* libgomp-plugin.c (GOMP_PLUGIN_target_rev): New, call ...
	* target.c (gomp_target_rev): ... this new stub function.
	* libgomp.h (gomp_target_rev): Declare.
	* libgomp.map (GOMP_PLUGIN_1.4): New; add GOMP_PLUGIN_target_rev.
	* plugin/cuda-lib.def (cuMemAllocManaged): Add.
	* plugin/plugin-nvptx.c (GOMP_REV_OFFLOAD_VAR): #define var string.
	(struct rev_offload): New.
	(struct ptx_device): Add concurr_managed_access, rev_data
	and rev_data_dev.
	(nvptx_open_device): Set ptx_device's concurr_managed_access;
	'#if 0' unused async_engines.
	(GOMP_OFFLOAD_load_image): Allocate rev_data variable.
	(rev_off_dev_to_host_cpy, rev_off_host_to_dev_cpy): New.
	(GOMP_OFFLOAD_run): Handle reverse offloading.

 include/cuda/cuda.h               |   8 ++-
 libgomp/config/nvptx/icv-device.c |   2 +-
 libgomp/config/nvptx/target.c     |  52 ++++++++++++--
 libgomp/libgomp-plugin.c          |  12 ++++
 libgomp/libgomp-plugin.h          |   7 ++
 libgomp/libgomp.h                 |   5 ++
 libgomp/libgomp.map               |   5 ++
 libgomp/plugin/cuda-lib.def       |   1 +
 libgomp/plugin/plugin-nvptx.c     | 148 +++++++++++++++++++++++++++++++++++++-
 libgomp/target.c                  |  18 +++++
 10 files changed, 246 insertions(+), 12 deletions(-)

diff --git a/include/cuda/cuda.h b/include/cuda/cuda.h
index 3938d05d150..08e496a2e98 100644
--- a/include/cuda/cuda.h
+++ b/include/cuda/cuda.h
@@ -77,9 +77,14 @@ typedef enum {
   CU_DEVICE_ATTRIBUTE_CONCURRENT_KERNELS = 31,
   CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_MULTIPROCESSOR = 39,
   CU_DEVICE_ATTRIBUTE_ASYNC_ENGINE_COUNT = 40,
-  CU_DEVICE_ATTRIBUTE_MAX_REGISTERS_PER_MULTIPROCESSOR = 82
+  CU_DEVICE_ATTRIBUTE_MAX_REGISTERS_PER_MULTIPROCESSOR = 82,
+  CU_DEVICE_ATTRIBUTE_CONCURRENT_MANAGED_ACCESS = 89
 } CUdevice_attribute;
 
+typedef enum {
+  CU_MEM_ATTACH_GLOBAL = 0x1
+} CUmemAttach_flags;
+
 enum {
   CU_EVENT_DEFAULT = 0,
   CU_EVENT_DISABLE_TIMING = 2
@@ -169,6 +174,7 @@ CUresult cuMemGetInfo (size_t *, size_t *);
 CUresult cuMemAlloc (CUdeviceptr *, size_t);
 #define cuMemAllocHost cuMemAllocHost_v2
 CUresult cuMemAllocHost (void **, size_t);
+CUresult cuMemAllocManaged (CUdeviceptr *, size_t, unsigned int);
 CUresult cuMemcpy (CUdeviceptr, CUdeviceptr, size_t);
 #define cuMemcpyDtoDAsync cuMemcpyDtoDAsync_v2
 CUresult cuMemcpyDtoDAsync (CUdeviceptr, CUdeviceptr, size_t, CUstream);
diff --git a/libgomp/config/nvptx/icv-device.c b/libgomp/config/nvptx/icv-device.c
index faf90f9947c..f4f18cdac5e 100644
--- a/libgomp/config/nvptx/icv-device.c
+++ b/libgomp/config/nvptx/icv-device.c
@@ -60,7 +60,7 @@ omp_is_initial_device (void)
 
 /* This is set to the device number of current GPU during device initialization,
    when the offload image containing this libgomp portion is loaded.  */
-static volatile int GOMP_DEVICE_NUM_VAR;
+volatile int GOMP_DEVICE_NUM_VAR;
 
 int
 omp_get_device_num (void)
diff --git a/libgomp/config/nvptx/target.c b/libgomp/config/nvptx/target.c
index 11108d20e15..06f6cd8b611 100644
--- a/libgomp/config/nvptx/target.c
+++ b/libgomp/config/nvptx/target.c
@@ -26,7 +26,29 @@
 #include "libgomp.h"
 #include <limits.h>
 
+#define GOMP_REV_OFFLOAD_VAR __gomp_rev_offload_var
+
+/* Reverse offload. Must match version used in plugin/plugin-nvptx.c. */
+struct rev_offload {
+  uint64_t fn;
+  uint64_t mapnum;
+  uint64_t addrs;
+  uint64_t sizes;
+  uint64_t kinds;
+  int32_t dev_num;
+  uint32_t lock;
+};
+
+#if (__SIZEOF_SHORT__ != 2 \
+     || __SIZEOF_SIZE_T__ != 8 \
+     || __SIZEOF_POINTER__ != 8)
+#error "Data-type conversion required for rev_offload"
+#endif
+
+
 extern int __gomp_team_num __attribute__((shared));
+extern volatile int GOMP_DEVICE_NUM_VAR;
+volatile struct rev_offload *GOMP_REV_OFFLOAD_VAR;
 
 bool
 GOMP_teams4 (unsigned int num_teams_lower, unsigned int num_teams_upper,
@@ -88,16 +110,32 @@ GOMP_target_ext (int device, void (*fn) (void *), size_t mapnum,
 		 void **hostaddrs, size_t *sizes, unsigned short *kinds,
 		 unsigned int flags, void **depend, void **args)
 {
-  (void) device;
-  (void) fn;
-  (void) mapnum;
-  (void) hostaddrs;
-  (void) sizes;
-  (void) kinds;
   (void) flags;
   (void) depend;
   (void) args;
-  __builtin_unreachable ();
+
+  if (device != GOMP_DEVICE_HOST_FALLBACK
+      || fn == NULL
+      || GOMP_REV_OFFLOAD_VAR == NULL)
+    return;
+
+  while (__sync_lock_test_and_set (&GOMP_REV_OFFLOAD_VAR->lock, (uint8_t) 1))
+    ;  /* spin  */
+
+  __atomic_store_n (&GOMP_REV_OFFLOAD_VAR->mapnum, mapnum, __ATOMIC_SEQ_CST);
+  __atomic_store_n (&GOMP_REV_OFFLOAD_VAR->addrs, hostaddrs, __ATOMIC_SEQ_CST);
+  __atomic_store_n (&GOMP_REV_OFFLOAD_VAR->sizes, sizes, __ATOMIC_SEQ_CST);
+  __atomic_store_n (&GOMP_REV_OFFLOAD_VAR->kinds, kinds, __ATOMIC_SEQ_CST);
+  __atomic_store_n (&GOMP_REV_OFFLOAD_VAR->dev_num, GOMP_DEVICE_NUM_VAR,
+		    __ATOMIC_SEQ_CST);
+
+  /* 'fn' must be last.  */
+  __atomic_store_n (&GOMP_REV_OFFLOAD_VAR->fn, fn, __ATOMIC_SEQ_CST);
+
+  /* Processed on the host - when done, fn is set to NULL.  */
+  while (__atomic_load_n (&GOMP_REV_OFFLOAD_VAR->fn, __ATOMIC_SEQ_CST) != 0)
+    ;  /* spin  */
+  __sync_lock_release (&GOMP_REV_OFFLOAD_VAR->lock);
 }
 
 void
diff --git a/libgomp/libgomp-plugin.c b/libgomp/libgomp-plugin.c
index 9d4cc623a10..316de749f69 100644
--- a/libgomp/libgomp-plugin.c
+++ b/libgomp/libgomp-plugin.c
@@ -78,3 +78,15 @@ GOMP_PLUGIN_fatal (const char *msg, ...)
   gomp_vfatal (msg, ap);
   va_end (ap);
 }
+
+void
+GOMP_PLUGIN_target_rev (uint64_t fn_ptr, uint64_t mapnum, uint64_t devaddrs_ptr,
+			uint64_t sizes_ptr, uint64_t kinds_ptr, int dev_num,
+			void (*dev_to_host_cpy) (void *, const void *, size_t,
+						 void *),
+			void (*host_to_dev_cpy) (void *, const void *, size_t,
+						 void *), void *token)
+{
+  gomp_target_rev (fn_ptr, mapnum, devaddrs_ptr, sizes_ptr, kinds_ptr, dev_num,
+		   dev_to_host_cpy, host_to_dev_cpy, token);
+}
diff --git a/libgomp/libgomp-plugin.h b/libgomp/libgomp-plugin.h
index ab3ed638475..40dfb52e44e 100644
--- a/libgomp/libgomp-plugin.h
+++ b/libgomp/libgomp-plugin.h
@@ -121,6 +121,13 @@ extern void GOMP_PLUGIN_error (const char *, ...)
 extern void GOMP_PLUGIN_fatal (const char *, ...)
 	__attribute__ ((noreturn, format (printf, 1, 2)));
 
+extern void GOMP_PLUGIN_target_rev (uint64_t, uint64_t, uint64_t, uint64_t,
+				    uint64_t, int,
+				    void (*) (void *, const void *, size_t,
+					      void *),
+				    void (*) (void *, const void *, size_t,
+					      void *), void *);
+
 /* Prototypes for functions implemented by libgomp plugins.  */
 extern const char *GOMP_OFFLOAD_get_name (void);
 extern unsigned int GOMP_OFFLOAD_get_caps (void);
diff --git a/libgomp/libgomp.h b/libgomp/libgomp.h
index c243c4d6cf4..bbab5b4b0af 100644
--- a/libgomp/libgomp.h
+++ b/libgomp/libgomp.h
@@ -1014,6 +1014,11 @@ extern int gomp_pause_host (void);
 extern void gomp_init_targets_once (void);
 extern int gomp_get_num_devices (void);
 extern bool gomp_target_task_fn (void *);
+extern void gomp_target_rev (uint64_t, uint64_t, uint64_t, uint64_t, uint64_t,
+			     int,
+			     void (*) (void *, const void *, size_t, void *),
+			     void (*) (void *, const void *, size_t, void *),
+			     void *);
 
 /* Splay tree definitions.  */
 typedef struct splay_tree_node_s *splay_tree_node;
diff --git a/libgomp/libgomp.map b/libgomp/libgomp.map
index 46d5f10f3e1..12f76f7e48f 100644
--- a/libgomp/libgomp.map
+++ b/libgomp/libgomp.map
@@ -622,3 +622,8 @@ GOMP_PLUGIN_1.3 {
 	GOMP_PLUGIN_goacc_profiling_dispatch;
 	GOMP_PLUGIN_goacc_thread;
 } GOMP_PLUGIN_1.2;
+
+GOMP_PLUGIN_1.4 {
+  global:
+	GOMP_PLUGIN_target_rev;
+} GOMP_PLUGIN_1.3;
diff --git a/libgomp/plugin/cuda-lib.def b/libgomp/plugin/cuda-lib.def
index cd91b39b1d2..61359c7e74e 100644
--- a/libgomp/plugin/cuda-lib.def
+++ b/libgomp/plugin/cuda-lib.def
@@ -29,6 +29,7 @@ CUDA_ONE_CALL_MAYBE_NULL (cuLinkCreate_v2)
 CUDA_ONE_CALL (cuLinkDestroy)
 CUDA_ONE_CALL (cuMemAlloc)
 CUDA_ONE_CALL (cuMemAllocHost)
+CUDA_ONE_CALL (cuMemAllocManaged)
 CUDA_ONE_CALL (cuMemcpy)
 CUDA_ONE_CALL (cuMemcpyDtoDAsync)
 CUDA_ONE_CALL (cuMemcpyDtoH)
diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c
index bc63e274cdf..7ab9421b060 100644
--- a/libgomp/plugin/plugin-nvptx.c
+++ b/libgomp/plugin/plugin-nvptx.c
@@ -54,6 +54,8 @@
 #include <assert.h>
 #include <errno.h>
 
+#define GOMP_REV_OFFLOAD_VAR __gomp_rev_offload_var
+
 /* An arbitrary fixed limit (128MB) for the size of the OpenMP soft stacks
    block to cache between kernel invocations.  For soft-stacks blocks bigger
    than this, we will free the block before attempting another GPU memory
@@ -274,6 +276,17 @@ struct targ_fn_descriptor
   int max_threads_per_block;
 };
 
+/* Reverse offload. Must match version used in config/nvptx/target.c. */
+struct rev_offload {
+  uint64_t fn;
+  uint64_t mapnum;
+  uint64_t addrs;
+  uint64_t sizes;
+  uint64_t kinds;
+  int32_t dev_num;
+  uint32_t lock;
+};
+
 /* A loaded PTX image.  */
 struct ptx_image_data
 {
@@ -302,6 +315,7 @@ struct ptx_device
   bool map;
   bool concur;
   bool mkern;
+  bool concurr_managed_access;
   int mode;
   int clock_khz;
   int num_sms;
@@ -329,6 +343,9 @@ struct ptx_device
       pthread_mutex_t lock;
     } omp_stacks;
 
+  struct rev_offload *rev_data;
+  CUdeviceptr rev_data_dev;
+
   struct ptx_device *next;
 };
 
@@ -423,7 +440,7 @@ nvptx_open_device (int n)
   struct ptx_device *ptx_dev;
   CUdevice dev, ctx_dev;
   CUresult r;
-  int async_engines, pi;
+  int pi;
 
   CUDA_CALL_ERET (NULL, cuDeviceGet, &dev, n);
 
@@ -519,11 +536,17 @@ nvptx_open_device (int n)
 		  CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_MULTIPROCESSOR, dev);
   ptx_dev->max_threads_per_multiprocessor = pi;
 
+#if 0
+  int async_engines;
   r = CUDA_CALL_NOCHECK (cuDeviceGetAttribute, &async_engines,
 			 CU_DEVICE_ATTRIBUTE_ASYNC_ENGINE_COUNT, dev);
   if (r != CUDA_SUCCESS)
     async_engines = 1;
+#endif
 
+  r = CUDA_CALL_NOCHECK (cuDeviceGetAttribute, &pi,
+			 CU_DEVICE_ATTRIBUTE_CONCURRENT_MANAGED_ACCESS, dev);
+  ptx_dev->concurr_managed_access = r == CUDA_SUCCESS ? pi : false;
   for (int i = 0; i != GOMP_DIM_MAX; i++)
     ptx_dev->default_dims[i] = 0;
 
@@ -1313,6 +1336,38 @@ GOMP_OFFLOAD_load_image (int ord, unsigned version, const void *target_data,
   targ_fns = GOMP_PLUGIN_malloc (sizeof (struct targ_fn_descriptor)
 				 * fn_entries);
 
+  if (rev_fn_table && dev->rev_data == NULL)
+    {
+      CUdeviceptr dp = 0;
+      if (dev->concurr_managed_access && CUDA_CALL_EXISTS (cuMemAllocManaged))
+	{
+	  CUDA_CALL_ASSERT (cuMemAllocManaged, (void *) &dev->rev_data,
+			    sizeof (*dev->rev_data), CU_MEM_ATTACH_GLOBAL);
+	  dp = (CUdeviceptr) dev->rev_data;
+	}
+      else
+	{
+	  CUDA_CALL_ASSERT (cuMemAllocHost, (void **) &dev->rev_data,
+			    sizeof (*dev->rev_data));
+	  memset (dev->rev_data, '\0', sizeof (*dev->rev_data));
+	  CUDA_CALL_ASSERT (cuMemAlloc, &dev->rev_data_dev,
+			    sizeof (*dev->rev_data));
+	  dp = dev->rev_data_dev;
+	}
+      CUdeviceptr device_rev_offload_var;
+      size_t device_rev_offload_size;
+      CUresult r = CUDA_CALL_NOCHECK (cuModuleGetGlobal,
+				      &device_rev_offload_var,
+				      &device_rev_offload_size, module,
+				      XSTRING (GOMP_REV_OFFLOAD_VAR));
+      if (r != CUDA_SUCCESS)
+	GOMP_PLUGIN_fatal ("cuModuleGetGlobal error: %s", cuda_error (r));
+      r = CUDA_CALL_NOCHECK (cuMemcpyHtoD, device_rev_offload_var, &dp,
+			     sizeof (dp));
+      if (r != CUDA_SUCCESS)
+	GOMP_PLUGIN_fatal ("cuMemcpyHtoD error: %s", cuda_error (r));
+    }
+
   *target_table = targ_tbl;
 
   new_image = GOMP_PLUGIN_malloc (sizeof (struct ptx_image_data));
@@ -1373,6 +1428,22 @@ GOMP_OFFLOAD_load_image (int ord, unsigned version, const void *target_data,
     targ_tbl->start = targ_tbl->end = 0;
   targ_tbl++;
 
+  if (rev_fn_table)
+    {
+      CUdeviceptr var;
+      size_t bytes;
+      r = CUDA_CALL_NOCHECK (cuModuleGetGlobal, &var, &bytes, module,
+			     "$offload_func_table");
+      if (r != CUDA_SUCCESS)
+	GOMP_PLUGIN_fatal ("cuModuleGetGlobal error: %s", cuda_error (r));
+      assert (bytes == sizeof (uint64_t) * fn_entries);
+      *rev_fn_table = GOMP_PLUGIN_malloc (sizeof (uint64_t) * fn_entries);
+      r = CUDA_CALL_NOCHECK (cuMemcpyDtoH, *rev_fn_table, var, bytes);
+      if (r != CUDA_SUCCESS)
+	GOMP_PLUGIN_fatal ("cuMemcpyDtoH error: %s", cuda_error (r));
+    }
+
+
   nvptx_set_clocktick (module, dev);
 
   return fn_entries + var_entries + other_entries;
@@ -1982,6 +2053,23 @@ nvptx_stacks_acquire (struct ptx_device *ptx_dev, size_t size, int num)
   return (void *) ptx_dev->omp_stacks.ptr;
 }
 
+
+void
+rev_off_dev_to_host_cpy (void *dest, const void *src, size_t size,
+			 CUstream stream)
+{
+  CUDA_CALL_ASSERT (cuMemcpyDtoHAsync, dest, (CUdeviceptr) src, size, stream);
+  CUDA_CALL_ASSERT (cuStreamSynchronize, stream);
+}
+
+void
+rev_off_host_to_dev_cpy (void *dest, const void *src, size_t size,
+			 CUstream stream)
+{
+  CUDA_CALL_ASSERT (cuMemcpyHtoDAsync, (CUdeviceptr) dest, src, size, stream);
+  CUDA_CALL_ASSERT (cuStreamSynchronize, stream);
+}
+
 void
 GOMP_OFFLOAD_run (int ord, void *tgt_fn, void *tgt_vars, void **args)
 {
@@ -2016,6 +2104,10 @@ GOMP_OFFLOAD_run (int ord, void *tgt_fn, void *tgt_vars, void **args)
   nvptx_adjust_launch_bounds (tgt_fn, ptx_dev, &teams, &threads);
 
   size_t stack_size = nvptx_stacks_size ();
+  bool reverse_off = ptx_dev->rev_data != NULL;
+  bool has_usm = (ptx_dev->concurr_managed_access
+		  && CUDA_CALL_EXISTS (cuMemAllocManaged));
+  CUstream copy_stream = NULL;
 
   pthread_mutex_lock (&ptx_dev->omp_stacks.lock);
   void *stacks = nvptx_stacks_acquire (ptx_dev, stack_size, teams * threads);
@@ -2029,12 +2121,62 @@ GOMP_OFFLOAD_run (int ord, void *tgt_fn, void *tgt_vars, void **args)
   GOMP_PLUGIN_debug (0, "  %s: kernel %s: launch"
 		     " [(teams: %u), 1, 1] [(lanes: 32), (threads: %u), 1]\n",
 		     __FUNCTION__, fn_name, teams, threads);
+  if (reverse_off)
+    CUDA_CALL_ASSERT (cuStreamCreate, &copy_stream, CU_STREAM_NON_BLOCKING);
   r = CUDA_CALL_NOCHECK (cuLaunchKernel, function, teams, 1, 1,
 			 32, threads, 1, 0, NULL, NULL, config);
   if (r != CUDA_SUCCESS)
     GOMP_PLUGIN_fatal ("cuLaunchKernel error: %s", cuda_error (r));
-
-  r = CUDA_CALL_NOCHECK (cuCtxSynchronize, );
+  if (reverse_off)
+    while (true)
+      {
+	r = CUDA_CALL_NOCHECK (cuStreamQuery, NULL);
+	if (r == CUDA_SUCCESS)
+	  break;
+	if (r == CUDA_ERROR_LAUNCH_FAILED)
+	  GOMP_PLUGIN_fatal ("cuStreamQuery error: %s %s\n", cuda_error (r),
+			     maybe_abort_msg);
+	else if (r != CUDA_ERROR_NOT_READY)
+	  GOMP_PLUGIN_fatal ("cuStreamQuery error: %s", cuda_error (r));
+	if (!has_usm)
+	  {
+	    CUDA_CALL_ASSERT (cuMemcpyDtoHAsync, ptx_dev->rev_data,
+			      ptx_dev->rev_data_dev,
+			      sizeof (*ptx_dev->rev_data), copy_stream);
+	    CUDA_CALL_ASSERT (cuStreamSynchronize, copy_stream);
+	  }
+	if (ptx_dev->rev_data->fn != 0)
+	  {
+	    struct rev_offload *rev_data = ptx_dev->rev_data;
+	    uint64_t fn_ptr = rev_data->fn;
+	    uint64_t mapnum = rev_data->mapnum;
+	    uint64_t addr_ptr = rev_data->addrs;
+	    uint64_t sizes_ptr = rev_data->sizes;
+	    uint64_t kinds_ptr = rev_data->kinds;
+	    int dev_num = (int) rev_data->dev_num;
+	    GOMP_PLUGIN_target_rev (fn_ptr, mapnum, addr_ptr, sizes_ptr,
+				    kinds_ptr, dev_num, rev_off_dev_to_host_cpy,
+				    rev_off_host_to_dev_cpy, copy_stream);
+	    rev_data->fn = 0;
+	    if (!has_usm)
+	      {
+		/* fn is the first element. */
+		r = CUDA_CALL_NOCHECK (cuMemcpyHtoDAsync,
+				       ptx_dev->rev_data_dev,
+				       ptx_dev->rev_data,
+				       sizeof (ptx_dev->rev_data->fn),
+				       copy_stream);
+		if (r != CUDA_SUCCESS)
+		  GOMP_PLUGIN_fatal ("cuMemcpyHtoD error: %s", cuda_error (r));
+		CUDA_CALL_ASSERT (cuStreamSynchronize, copy_stream);
+	      }
+	  }
+	usleep (1);
+      }
+  else
+    r = CUDA_CALL_NOCHECK (cuCtxSynchronize, );
+  if (reverse_off)
+    CUDA_CALL_ASSERT (cuStreamDestroy, copy_stream);
   if (r == CUDA_ERROR_LAUNCH_FAILED)
     GOMP_PLUGIN_fatal ("cuCtxSynchronize error: %s %s\n", cuda_error (r),
 		       maybe_abort_msg);
diff --git a/libgomp/target.c b/libgomp/target.c
index 135db1d88ab..0c6fad690f1 100644
--- a/libgomp/target.c
+++ b/libgomp/target.c
@@ -2856,6 +2856,24 @@ GOMP_target_ext (int device, void (*fn) (void *), size_t mapnum,
     htab_free (refcount_set);
 }
 
+/* Handle reverse offload. This is not called for the host. */
+
+void
+gomp_target_rev (uint64_t fn_ptr __attribute__ ((unused)),
+		 uint64_t mapnum __attribute__ ((unused)),
+		 uint64_t devaddrs_ptr __attribute__ ((unused)),
+		 uint64_t sizes_ptr __attribute__ ((unused)),
+		 uint64_t kinds_ptr __attribute__ ((unused)),
+		 int dev_num __attribute__ ((unused)),
+		 void (*dev_to_host_cpy) (void *, const void *, size_t,
+					  void *) __attribute__ ((unused)),
+		 void (*host_to_dev_cpy) (void *, const void *, size_t,
+					  void *) __attribute__ ((unused)),
+		 void *token __attribute__ ((unused)))
+{
+  __builtin_unreachable ();
+}
+
 /* Host fallback for GOMP_target_data{,_ext} routines.  */
 
 static void

WARNING: multiple messages have this Message-ID

From: Tobias Burnus <tobias@codesourcery.com>
To: Jakub Jelinek <jakub@redhat.com>, Tom de Vries <tdevries@suse.de>,
	gcc-patches <gcc-patches@gcc.gnu.org>
Cc: Alexander Monakov <amonakov@ispras.ru>
Subject: [Patch] libgomp/nvptx: Prepare for reverse-offload callback handling
Date: Fri, 26 Aug 2022 11:07:28 +0200	[thread overview]
Message-ID: <57b3ae5e-8f15-8bea-fa09-39bccbaa2414@codesourcery.com> (raw)
Message-ID: <20220826090728.bl6YugMOjViZTzay3lO9j9C66QjKs-bIGcHhgS3qbZ0@z> (raw)


[-- Attachment #1.1: Type: text/plain, Size: 2878 bytes --]

@Tom and Alexander: Better suggestions are welcome for the busy loop in
libgomp/plugin/plugin-nvptx.c regarding the variable placement and checking
its value.


PRE-REMARK

As nvptx (and all other plugins) returns <= 0 for
GOMP_OFFLOAD_get_num_devices if GOMP_REQUIRES_REVERSE_OFFLOAD is
set. This patch is currently still a no op.

The patch is almost stand alone, except that it either needs a
  void *rev_fn_table = NULL;
in GOMP_OFFLOAD_load_image or the following patch:
  [Patch][2/3] nvptx: libgomp+mkoffload.cc: Prepare for reverse offload fn lookup
  https://gcc.gnu.org/pipermail/gcc-patches/2022-August/600348.html
(which in turn needs the '[1/3]' patch).

Not required to be compilable, but the patch is based on the ideas/code from
the reverse-offload ME patch; the latter adds calls to
  GOMP_target_ext (omp_initial_device,
which is for host fallback code processed by the normal target_ext and for
device code by the target_ext of this patch.
→ "[Patch] OpenMP: Support reverse offload (middle end part)"
  https://gcc.gnu.org/pipermail/gcc-patches/2022-July/598662.html

 * * *

This patch adds initial offloading support for nvptx.
When the nvptx's device GOMP_target_ext is called - it creates a lock,
fills a struct with the argument pointers (addr, kinds, sizes), its
device number and the set the function pointer address.

On the host side, the last address is checked - if fn_addr != NULL,
it passes all arguments on to the generic (target.c) gomp_target_rev
to do the actual offloading.

CUDA does lockup when trying to copy data from the currently running
stream; hence, a new stream is generated to do the memory copying.
Just having managed memory is not enough - it needs to be concurrently
accessible - otherwise, it will segfault on the host when migrated to
the device.

OK for mainline?

 * * *

Future work for nvptx:
* Adjust 'sleep', possibly using different values with and without USM and
  to do shorter sleeps than usleep(1)?
* Set a flag whether there is any offload function at all, avoiding to run
  the more expensive check if there is 'requires reverse_offload' without
  actual reverse-offloading functions present.
  (Recall that the '2/3' patch, mentioned above, only has fn != NULL for
  reverse-offload functions.)
* Document → libgomp.texi that reverse offload may cause some performance
  overhead for all target regions. + That reverse offload is run serialized.

And obviously: submitting the missing bits to get reverse offload working,
but that's mostly not an nvptx topic.

Tobias


-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955

[-- Attachment #2: rev-offload-run-nvptx.diff --]
[-- Type: text/x-patch, Size: 18161 bytes --]

libgomp/nvptx: Prepare for reverse-offload callback handling

This patch adds a stub 'gomp_target_rev' in the host's target.c, which will
later handle the reverse offload.
For nvptx, it adds support for forwarding the offload gomp_target_ext call
to the host by setting values in a struct on the device and querying it on
the host - invoking gomp_target_rev on the result.

include/ChangeLog:

	* cuda/cuda.h (enum CUdevice_attribute): Add
	CU_DEVICE_ATTRIBUTE_CONCURRENT_MANAGED_ACCESS.
	(enum CUmemAttach_flags): New stub with only member
	CU_MEM_ATTACH_GLOBAL.
	(cuMemAllocManaged): Add prototype.

libgomp/ChangeLog:

	* config/nvptx/icv-device.c (GOMP_DEVICE_NUM_VAR): Remove
	'static' for this variable.
	* config/nvptx/target.c (GOMP_REV_OFFLOAD_VAR): #define as
	variable-name string and use it to define the variable.
	(GOMP_DEVICE_NUM_VAR): Declare this extern global var.
	(struct rev_offload): Define.
	(GOMP_target_ext): Handle reverse offload.
	* libgomp-plugin.h (GOMP_PLUGIN_target_rev): New prototype.
	* libgomp-plugin.c (GOMP_PLUGIN_target_rev): New, call ...
	* target.c (gomp_target_rev): ... this new stub function.
	* libgomp.h (gomp_target_rev): Declare.
	* libgomp.map (GOMP_PLUGIN_1.4): New; add GOMP_PLUGIN_target_rev.
	* plugin/cuda-lib.def (cuMemAllocManaged): Add.
	* plugin/plugin-nvptx.c (GOMP_REV_OFFLOAD_VAR): #define var string.
	(struct rev_offload): New.
	(struct ptx_device): Add concurr_managed_access, rev_data
	and rev_data_dev.
	(nvptx_open_device): Set ptx_device's concurr_managed_access;
	'#if 0' unused async_engines.
	(GOMP_OFFLOAD_load_image): Allocate rev_data variable.
	(rev_off_dev_to_host_cpy, rev_off_host_to_dev_cpy): New.
	(GOMP_OFFLOAD_run): Handle reverse offloading.

 include/cuda/cuda.h               |   8 ++-
 libgomp/config/nvptx/icv-device.c |   2 +-
 libgomp/config/nvptx/target.c     |  52 ++++++++++++--
 libgomp/libgomp-plugin.c          |  12 ++++
 libgomp/libgomp-plugin.h          |   7 ++
 libgomp/libgomp.h                 |   5 ++
 libgomp/libgomp.map               |   5 ++
 libgomp/plugin/cuda-lib.def       |   1 +
 libgomp/plugin/plugin-nvptx.c     | 148 +++++++++++++++++++++++++++++++++++++-
 libgomp/target.c                  |  18 +++++
 10 files changed, 246 insertions(+), 12 deletions(-)

diff --git a/include/cuda/cuda.h b/include/cuda/cuda.h
index 3938d05d150..08e496a2e98 100644
--- a/include/cuda/cuda.h
+++ b/include/cuda/cuda.h
@@ -77,9 +77,14 @@ typedef enum {
   CU_DEVICE_ATTRIBUTE_CONCURRENT_KERNELS = 31,
   CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_MULTIPROCESSOR = 39,
   CU_DEVICE_ATTRIBUTE_ASYNC_ENGINE_COUNT = 40,
-  CU_DEVICE_ATTRIBUTE_MAX_REGISTERS_PER_MULTIPROCESSOR = 82
+  CU_DEVICE_ATTRIBUTE_MAX_REGISTERS_PER_MULTIPROCESSOR = 82,
+  CU_DEVICE_ATTRIBUTE_CONCURRENT_MANAGED_ACCESS = 89
 } CUdevice_attribute;
 
+typedef enum {
+  CU_MEM_ATTACH_GLOBAL = 0x1
+} CUmemAttach_flags;
+
 enum {
   CU_EVENT_DEFAULT = 0,
   CU_EVENT_DISABLE_TIMING = 2
@@ -169,6 +174,7 @@ CUresult cuMemGetInfo (size_t *, size_t *);
 CUresult cuMemAlloc (CUdeviceptr *, size_t);
 #define cuMemAllocHost cuMemAllocHost_v2
 CUresult cuMemAllocHost (void **, size_t);
+CUresult cuMemAllocManaged (CUdeviceptr *, size_t, unsigned int);
 CUresult cuMemcpy (CUdeviceptr, CUdeviceptr, size_t);
 #define cuMemcpyDtoDAsync cuMemcpyDtoDAsync_v2
 CUresult cuMemcpyDtoDAsync (CUdeviceptr, CUdeviceptr, size_t, CUstream);
diff --git a/libgomp/config/nvptx/icv-device.c b/libgomp/config/nvptx/icv-device.c
index faf90f9947c..f4f18cdac5e 100644
--- a/libgomp/config/nvptx/icv-device.c
+++ b/libgomp/config/nvptx/icv-device.c
@@ -60,7 +60,7 @@ omp_is_initial_device (void)
 
 /* This is set to the device number of current GPU during device initialization,
    when the offload image containing this libgomp portion is loaded.  */
-static volatile int GOMP_DEVICE_NUM_VAR;
+volatile int GOMP_DEVICE_NUM_VAR;
 
 int
 omp_get_device_num (void)
diff --git a/libgomp/config/nvptx/target.c b/libgomp/config/nvptx/target.c
index 11108d20e15..06f6cd8b611 100644
--- a/libgomp/config/nvptx/target.c
+++ b/libgomp/config/nvptx/target.c
@@ -26,7 +26,29 @@
 #include "libgomp.h"
 #include <limits.h>
 
+#define GOMP_REV_OFFLOAD_VAR __gomp_rev_offload_var
+
+/* Reverse offload. Must match version used in plugin/plugin-nvptx.c. */
+struct rev_offload {
+  uint64_t fn;
+  uint64_t mapnum;
+  uint64_t addrs;
+  uint64_t sizes;
+  uint64_t kinds;
+  int32_t dev_num;
+  uint32_t lock;
+};
+
+#if (__SIZEOF_SHORT__ != 2 \
+     || __SIZEOF_SIZE_T__ != 8 \
+     || __SIZEOF_POINTER__ != 8)
+#error "Data-type conversion required for rev_offload"
+#endif
+
+
 extern int __gomp_team_num __attribute__((shared));
+extern volatile int GOMP_DEVICE_NUM_VAR;
+volatile struct rev_offload *GOMP_REV_OFFLOAD_VAR;
 
 bool
 GOMP_teams4 (unsigned int num_teams_lower, unsigned int num_teams_upper,
@@ -88,16 +110,32 @@ GOMP_target_ext (int device, void (*fn) (void *), size_t mapnum,
 		 void **hostaddrs, size_t *sizes, unsigned short *kinds,
 		 unsigned int flags, void **depend, void **args)
 {
-  (void) device;
-  (void) fn;
-  (void) mapnum;
-  (void) hostaddrs;
-  (void) sizes;
-  (void) kinds;
   (void) flags;
   (void) depend;
   (void) args;
-  __builtin_unreachable ();
+
+  if (device != GOMP_DEVICE_HOST_FALLBACK
+      || fn == NULL
+      || GOMP_REV_OFFLOAD_VAR == NULL)
+    return;
+
+  while (__sync_lock_test_and_set (&GOMP_REV_OFFLOAD_VAR->lock, (uint8_t) 1))
+    ;  /* spin  */
+
+  __atomic_store_n (&GOMP_REV_OFFLOAD_VAR->mapnum, mapnum, __ATOMIC_SEQ_CST);
+  __atomic_store_n (&GOMP_REV_OFFLOAD_VAR->addrs, hostaddrs, __ATOMIC_SEQ_CST);
+  __atomic_store_n (&GOMP_REV_OFFLOAD_VAR->sizes, sizes, __ATOMIC_SEQ_CST);
+  __atomic_store_n (&GOMP_REV_OFFLOAD_VAR->kinds, kinds, __ATOMIC_SEQ_CST);
+  __atomic_store_n (&GOMP_REV_OFFLOAD_VAR->dev_num, GOMP_DEVICE_NUM_VAR,
+		    __ATOMIC_SEQ_CST);
+
+  /* 'fn' must be last.  */
+  __atomic_store_n (&GOMP_REV_OFFLOAD_VAR->fn, fn, __ATOMIC_SEQ_CST);
+
+  /* Processed on the host - when done, fn is set to NULL.  */
+  while (__atomic_load_n (&GOMP_REV_OFFLOAD_VAR->fn, __ATOMIC_SEQ_CST) != 0)
+    ;  /* spin  */
+  __sync_lock_release (&GOMP_REV_OFFLOAD_VAR->lock);
 }
 
 void
diff --git a/libgomp/libgomp-plugin.c b/libgomp/libgomp-plugin.c
index 9d4cc623a10..316de749f69 100644
--- a/libgomp/libgomp-plugin.c
+++ b/libgomp/libgomp-plugin.c
@@ -78,3 +78,15 @@ GOMP_PLUGIN_fatal (const char *msg, ...)
   gomp_vfatal (msg, ap);
   va_end (ap);
 }
+
+void
+GOMP_PLUGIN_target_rev (uint64_t fn_ptr, uint64_t mapnum, uint64_t devaddrs_ptr,
+			uint64_t sizes_ptr, uint64_t kinds_ptr, int dev_num,
+			void (*dev_to_host_cpy) (void *, const void *, size_t,
+						 void *),
+			void (*host_to_dev_cpy) (void *, const void *, size_t,
+						 void *), void *token)
+{
+  gomp_target_rev (fn_ptr, mapnum, devaddrs_ptr, sizes_ptr, kinds_ptr, dev_num,
+		   dev_to_host_cpy, host_to_dev_cpy, token);
+}
diff --git a/libgomp/libgomp-plugin.h b/libgomp/libgomp-plugin.h
index ab3ed638475..40dfb52e44e 100644
--- a/libgomp/libgomp-plugin.h
+++ b/libgomp/libgomp-plugin.h
@@ -121,6 +121,13 @@ extern void GOMP_PLUGIN_error (const char *, ...)
 extern void GOMP_PLUGIN_fatal (const char *, ...)
 	__attribute__ ((noreturn, format (printf, 1, 2)));
 
+extern void GOMP_PLUGIN_target_rev (uint64_t, uint64_t, uint64_t, uint64_t,
+				    uint64_t, int,
+				    void (*) (void *, const void *, size_t,
+					      void *),
+				    void (*) (void *, const void *, size_t,
+					      void *), void *);
+
 /* Prototypes for functions implemented by libgomp plugins.  */
 extern const char *GOMP_OFFLOAD_get_name (void);
 extern unsigned int GOMP_OFFLOAD_get_caps (void);
diff --git a/libgomp/libgomp.h b/libgomp/libgomp.h
index c243c4d6cf4..bbab5b4b0af 100644
--- a/libgomp/libgomp.h
+++ b/libgomp/libgomp.h
@@ -1014,6 +1014,11 @@ extern int gomp_pause_host (void);
 extern void gomp_init_targets_once (void);
 extern int gomp_get_num_devices (void);
 extern bool gomp_target_task_fn (void *);
+extern void gomp_target_rev (uint64_t, uint64_t, uint64_t, uint64_t, uint64_t,
+			     int,
+			     void (*) (void *, const void *, size_t, void *),
+			     void (*) (void *, const void *, size_t, void *),
+			     void *);
 
 /* Splay tree definitions.  */
 typedef struct splay_tree_node_s *splay_tree_node;
diff --git a/libgomp/libgomp.map b/libgomp/libgomp.map
index 46d5f10f3e1..12f76f7e48f 100644
--- a/libgomp/libgomp.map
+++ b/libgomp/libgomp.map
@@ -622,3 +622,8 @@ GOMP_PLUGIN_1.3 {
 	GOMP_PLUGIN_goacc_profiling_dispatch;
 	GOMP_PLUGIN_goacc_thread;
 } GOMP_PLUGIN_1.2;
+
+GOMP_PLUGIN_1.4 {
+  global:
+	GOMP_PLUGIN_target_rev;
+} GOMP_PLUGIN_1.3;
diff --git a/libgomp/plugin/cuda-lib.def b/libgomp/plugin/cuda-lib.def
index cd91b39b1d2..61359c7e74e 100644
--- a/libgomp/plugin/cuda-lib.def
+++ b/libgomp/plugin/cuda-lib.def
@@ -29,6 +29,7 @@ CUDA_ONE_CALL_MAYBE_NULL (cuLinkCreate_v2)
 CUDA_ONE_CALL (cuLinkDestroy)
 CUDA_ONE_CALL (cuMemAlloc)
 CUDA_ONE_CALL (cuMemAllocHost)
+CUDA_ONE_CALL (cuMemAllocManaged)
 CUDA_ONE_CALL (cuMemcpy)
 CUDA_ONE_CALL (cuMemcpyDtoDAsync)
 CUDA_ONE_CALL (cuMemcpyDtoH)
diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c
index bc63e274cdf..7ab9421b060 100644
--- a/libgomp/plugin/plugin-nvptx.c
+++ b/libgomp/plugin/plugin-nvptx.c
@@ -54,6 +54,8 @@
 #include <assert.h>
 #include <errno.h>
 
+#define GOMP_REV_OFFLOAD_VAR __gomp_rev_offload_var
+
 /* An arbitrary fixed limit (128MB) for the size of the OpenMP soft stacks
    block to cache between kernel invocations.  For soft-stacks blocks bigger
    than this, we will free the block before attempting another GPU memory
@@ -274,6 +276,17 @@ struct targ_fn_descriptor
   int max_threads_per_block;
 };
 
+/* Reverse offload. Must match version used in config/nvptx/target.c. */
+struct rev_offload {
+  uint64_t fn;
+  uint64_t mapnum;
+  uint64_t addrs;
+  uint64_t sizes;
+  uint64_t kinds;
+  int32_t dev_num;
+  uint32_t lock;
+};
+
 /* A loaded PTX image.  */
 struct ptx_image_data
 {
@@ -302,6 +315,7 @@ struct ptx_device
   bool map;
   bool concur;
   bool mkern;
+  bool concurr_managed_access;
   int mode;
   int clock_khz;
   int num_sms;
@@ -329,6 +343,9 @@ struct ptx_device
       pthread_mutex_t lock;
     } omp_stacks;
 
+  struct rev_offload *rev_data;
+  CUdeviceptr rev_data_dev;
+
   struct ptx_device *next;
 };
 
@@ -423,7 +440,7 @@ nvptx_open_device (int n)
   struct ptx_device *ptx_dev;
   CUdevice dev, ctx_dev;
   CUresult r;
-  int async_engines, pi;
+  int pi;
 
   CUDA_CALL_ERET (NULL, cuDeviceGet, &dev, n);
 
@@ -519,11 +536,17 @@ nvptx_open_device (int n)
 		  CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_MULTIPROCESSOR, dev);
   ptx_dev->max_threads_per_multiprocessor = pi;
 
+#if 0
+  int async_engines;
   r = CUDA_CALL_NOCHECK (cuDeviceGetAttribute, &async_engines,
 			 CU_DEVICE_ATTRIBUTE_ASYNC_ENGINE_COUNT, dev);
   if (r != CUDA_SUCCESS)
     async_engines = 1;
+#endif
 
+  r = CUDA_CALL_NOCHECK (cuDeviceGetAttribute, &pi,
+			 CU_DEVICE_ATTRIBUTE_CONCURRENT_MANAGED_ACCESS, dev);
+  ptx_dev->concurr_managed_access = r == CUDA_SUCCESS ? pi : false;
   for (int i = 0; i != GOMP_DIM_MAX; i++)
     ptx_dev->default_dims[i] = 0;
 
@@ -1313,6 +1336,38 @@ GOMP_OFFLOAD_load_image (int ord, unsigned version, const void *target_data,
   targ_fns = GOMP_PLUGIN_malloc (sizeof (struct targ_fn_descriptor)
 				 * fn_entries);
 
+  if (rev_fn_table && dev->rev_data == NULL)
+    {
+      CUdeviceptr dp = 0;
+      if (dev->concurr_managed_access && CUDA_CALL_EXISTS (cuMemAllocManaged))
+	{
+	  CUDA_CALL_ASSERT (cuMemAllocManaged, (void *) &dev->rev_data,
+			    sizeof (*dev->rev_data), CU_MEM_ATTACH_GLOBAL);
+	  dp = (CUdeviceptr) dev->rev_data;
+	}
+      else
+	{
+	  CUDA_CALL_ASSERT (cuMemAllocHost, (void **) &dev->rev_data,
+			    sizeof (*dev->rev_data));
+	  memset (dev->rev_data, '\0', sizeof (*dev->rev_data));
+	  CUDA_CALL_ASSERT (cuMemAlloc, &dev->rev_data_dev,
+			    sizeof (*dev->rev_data));
+	  dp = dev->rev_data_dev;
+	}
+      CUdeviceptr device_rev_offload_var;
+      size_t device_rev_offload_size;
+      CUresult r = CUDA_CALL_NOCHECK (cuModuleGetGlobal,
+				      &device_rev_offload_var,
+				      &device_rev_offload_size, module,
+				      XSTRING (GOMP_REV_OFFLOAD_VAR));
+      if (r != CUDA_SUCCESS)
+	GOMP_PLUGIN_fatal ("cuModuleGetGlobal error: %s", cuda_error (r));
+      r = CUDA_CALL_NOCHECK (cuMemcpyHtoD, device_rev_offload_var, &dp,
+			     sizeof (dp));
+      if (r != CUDA_SUCCESS)
+	GOMP_PLUGIN_fatal ("cuMemcpyHtoD error: %s", cuda_error (r));
+    }
+
   *target_table = targ_tbl;
 
   new_image = GOMP_PLUGIN_malloc (sizeof (struct ptx_image_data));
@@ -1373,6 +1428,22 @@ GOMP_OFFLOAD_load_image (int ord, unsigned version, const void *target_data,
     targ_tbl->start = targ_tbl->end = 0;
   targ_tbl++;
 
+  if (rev_fn_table)
+    {
+      CUdeviceptr var;
+      size_t bytes;
+      r = CUDA_CALL_NOCHECK (cuModuleGetGlobal, &var, &bytes, module,
+			     "$offload_func_table");
+      if (r != CUDA_SUCCESS)
+	GOMP_PLUGIN_fatal ("cuModuleGetGlobal error: %s", cuda_error (r));
+      assert (bytes == sizeof (uint64_t) * fn_entries);
+      *rev_fn_table = GOMP_PLUGIN_malloc (sizeof (uint64_t) * fn_entries);
+      r = CUDA_CALL_NOCHECK (cuMemcpyDtoH, *rev_fn_table, var, bytes);
+      if (r != CUDA_SUCCESS)
+	GOMP_PLUGIN_fatal ("cuMemcpyDtoH error: %s", cuda_error (r));
+    }
+
+
   nvptx_set_clocktick (module, dev);
 
   return fn_entries + var_entries + other_entries;
@@ -1982,6 +2053,23 @@ nvptx_stacks_acquire (struct ptx_device *ptx_dev, size_t size, int num)
   return (void *) ptx_dev->omp_stacks.ptr;
 }
 
+
+void
+rev_off_dev_to_host_cpy (void *dest, const void *src, size_t size,
+			 CUstream stream)
+{
+  CUDA_CALL_ASSERT (cuMemcpyDtoHAsync, dest, (CUdeviceptr) src, size, stream);
+  CUDA_CALL_ASSERT (cuStreamSynchronize, stream);
+}
+
+void
+rev_off_host_to_dev_cpy (void *dest, const void *src, size_t size,
+			 CUstream stream)
+{
+  CUDA_CALL_ASSERT (cuMemcpyHtoDAsync, (CUdeviceptr) dest, src, size, stream);
+  CUDA_CALL_ASSERT (cuStreamSynchronize, stream);
+}
+
 void
 GOMP_OFFLOAD_run (int ord, void *tgt_fn, void *tgt_vars, void **args)
 {
@@ -2016,6 +2104,10 @@ GOMP_OFFLOAD_run (int ord, void *tgt_fn, void *tgt_vars, void **args)
   nvptx_adjust_launch_bounds (tgt_fn, ptx_dev, &teams, &threads);
 
   size_t stack_size = nvptx_stacks_size ();
+  bool reverse_off = ptx_dev->rev_data != NULL;
+  bool has_usm = (ptx_dev->concurr_managed_access
+		  && CUDA_CALL_EXISTS (cuMemAllocManaged));
+  CUstream copy_stream = NULL;
 
   pthread_mutex_lock (&ptx_dev->omp_stacks.lock);
   void *stacks = nvptx_stacks_acquire (ptx_dev, stack_size, teams * threads);
@@ -2029,12 +2121,62 @@ GOMP_OFFLOAD_run (int ord, void *tgt_fn, void *tgt_vars, void **args)
   GOMP_PLUGIN_debug (0, "  %s: kernel %s: launch"
 		     " [(teams: %u), 1, 1] [(lanes: 32), (threads: %u), 1]\n",
 		     __FUNCTION__, fn_name, teams, threads);
+  if (reverse_off)
+    CUDA_CALL_ASSERT (cuStreamCreate, &copy_stream, CU_STREAM_NON_BLOCKING);
   r = CUDA_CALL_NOCHECK (cuLaunchKernel, function, teams, 1, 1,
 			 32, threads, 1, 0, NULL, NULL, config);
   if (r != CUDA_SUCCESS)
     GOMP_PLUGIN_fatal ("cuLaunchKernel error: %s", cuda_error (r));
-
-  r = CUDA_CALL_NOCHECK (cuCtxSynchronize, );
+  if (reverse_off)
+    while (true)
+      {
+	r = CUDA_CALL_NOCHECK (cuStreamQuery, NULL);
+	if (r == CUDA_SUCCESS)
+	  break;
+	if (r == CUDA_ERROR_LAUNCH_FAILED)
+	  GOMP_PLUGIN_fatal ("cuStreamQuery error: %s %s\n", cuda_error (r),
+			     maybe_abort_msg);
+	else if (r != CUDA_ERROR_NOT_READY)
+	  GOMP_PLUGIN_fatal ("cuStreamQuery error: %s", cuda_error (r));
+	if (!has_usm)
+	  {
+	    CUDA_CALL_ASSERT (cuMemcpyDtoHAsync, ptx_dev->rev_data,
+			      ptx_dev->rev_data_dev,
+			      sizeof (*ptx_dev->rev_data), copy_stream);
+	    CUDA_CALL_ASSERT (cuStreamSynchronize, copy_stream);
+	  }
+	if (ptx_dev->rev_data->fn != 0)
+	  {
+	    struct rev_offload *rev_data = ptx_dev->rev_data;
+	    uint64_t fn_ptr = rev_data->fn;
+	    uint64_t mapnum = rev_data->mapnum;
+	    uint64_t addr_ptr = rev_data->addrs;
+	    uint64_t sizes_ptr = rev_data->sizes;
+	    uint64_t kinds_ptr = rev_data->kinds;
+	    int dev_num = (int) rev_data->dev_num;
+	    GOMP_PLUGIN_target_rev (fn_ptr, mapnum, addr_ptr, sizes_ptr,
+				    kinds_ptr, dev_num, rev_off_dev_to_host_cpy,
+				    rev_off_host_to_dev_cpy, copy_stream);
+	    rev_data->fn = 0;
+	    if (!has_usm)
+	      {
+		/* fn is the first element. */
+		r = CUDA_CALL_NOCHECK (cuMemcpyHtoDAsync,
+				       ptx_dev->rev_data_dev,
+				       ptx_dev->rev_data,
+				       sizeof (ptx_dev->rev_data->fn),
+				       copy_stream);
+		if (r != CUDA_SUCCESS)
+		  GOMP_PLUGIN_fatal ("cuMemcpyHtoD error: %s", cuda_error (r));
+		CUDA_CALL_ASSERT (cuStreamSynchronize, copy_stream);
+	      }
+	  }
+	usleep (1);
+      }
+  else
+    r = CUDA_CALL_NOCHECK (cuCtxSynchronize, );
+  if (reverse_off)
+    CUDA_CALL_ASSERT (cuStreamDestroy, copy_stream);
   if (r == CUDA_ERROR_LAUNCH_FAILED)
     GOMP_PLUGIN_fatal ("cuCtxSynchronize error: %s %s\n", cuda_error (r),
 		       maybe_abort_msg);
diff --git a/libgomp/target.c b/libgomp/target.c
index 135db1d88ab..0c6fad690f1 100644
--- a/libgomp/target.c
+++ b/libgomp/target.c
@@ -2856,6 +2856,24 @@ GOMP_target_ext (int device, void (*fn) (void *), size_t mapnum,
     htab_free (refcount_set);
 }
 
+/* Handle reverse offload. This is not called for the host. */
+
+void
+gomp_target_rev (uint64_t fn_ptr __attribute__ ((unused)),
+		 uint64_t mapnum __attribute__ ((unused)),
+		 uint64_t devaddrs_ptr __attribute__ ((unused)),
+		 uint64_t sizes_ptr __attribute__ ((unused)),
+		 uint64_t kinds_ptr __attribute__ ((unused)),
+		 int dev_num __attribute__ ((unused)),
+		 void (*dev_to_host_cpy) (void *, const void *, size_t,
+					  void *) __attribute__ ((unused)),
+		 void (*host_to_dev_cpy) (void *, const void *, size_t,
+					  void *) __attribute__ ((unused)),
+		 void *token __attribute__ ((unused)))
+{
+  __builtin_unreachable ();
+}
+
 /* Host fallback for GOMP_target_data{,_ext} routines.  */
 
 static void

next             reply	other threads:[~2022-08-26  9:07 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-08-26  9:07 Tobias Burnus [this message]
2022-08-26  9:07 ` Tobias Burnus
2022-08-26 14:56 ` Alexander Monakov
2022-09-09 15:49   ` Jakub Jelinek
2022-09-09 15:51 ` Jakub Jelinek
2022-09-13  7:07 ` Tobias Burnus
2022-09-21 20:06   ` Alexander Monakov
2022-09-26 15:07     ` Tobias Burnus
2022-09-26 17:45       ` Alexander Monakov
2022-09-27  9:23         ` Tobias Burnus
2022-09-28 13:16           ` Alexander Monakov
2022-10-02 18:13           ` Tobias Burnus
2022-10-07 14:26             ` [Patch][v5] " Tobias Burnus
2022-10-11 10:49               ` Jakub Jelinek
2022-10-11 11:12                 ` Alexander Monakov
2022-10-12  8:55                   ` Tobias Burnus
2022-10-17  7:35                     ` *ping* / " Tobias Burnus
2022-10-19 15:53                     ` Alexander Monakov
2022-10-24 14:07                     ` Jakub Jelinek
2022-10-24 19:05                       ` Thomas Schwinge
2022-10-24 19:11                         ` Thomas Schwinge
2022-10-24 19:46                           ` Tobias Burnus
2022-10-24 19:51                           ` libgomp/nvptx: Prepare for reverse-offload callback handling, resolve spurious SIGSEGVs (was: [Patch][v5] libgomp/nvptx: Prepare for reverse-offload callback handling) Thomas Schwinge
2023-03-21 15:53 ` libgomp: Simplify OpenMP reverse offload host <-> device memory copy implementation (was: [Patch] " Thomas Schwinge
2023-03-24 15:43   ` [og12] " Thomas Schwinge
2023-04-28  8:48   ` Tobias Burnus
2023-04-28  9:31     ` Thomas Schwinge
2023-04-28 10:51       ` Tobias Burnus
2023-04-04 14:40 ` [Patch] libgomp/nvptx: Prepare for reverse-offload callback handling Thomas Schwinge
2023-04-28  8:28   ` Tobias Burnus
2023-04-28  9:23     ` Thomas Schwinge

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=57b3ae5e-8f15-8bea-fa09-39bccbaa2414@codesourcery.com \
    --to=tobias@codesourcery.com \
    --cc=amonakov@ispras.ru \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=jakub@redhat.com \
    --cc=tdevries@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).