public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [Patch] libgomp/nvptx: Prepare for reverse-offload callback handling
@ 2022-08-26  9:07 Tobias Burnus
  2022-08-26  9:07 ` Tobias Burnus
                   ` (5 more replies)
  0 siblings, 6 replies; 31+ messages in thread
From: Tobias Burnus @ 2022-08-26  9:07 UTC (permalink / raw)
  To: Jakub Jelinek, Tom de Vries, gcc-patches; +Cc: Alexander Monakov

[-- Attachment #1: Type: text/plain, Size: 2878 bytes --]

@Tom and Alexander: Better suggestions are welcome for the busy loop in
libgomp/plugin/plugin-nvptx.c regarding the variable placement and checking
its value.


PRE-REMARK

As nvptx (and all other plugins) returns <= 0 for
GOMP_OFFLOAD_get_num_devices if GOMP_REQUIRES_REVERSE_OFFLOAD is
set. This patch is currently still a no op.

The patch is almost stand alone, except that it either needs a
  void *rev_fn_table = NULL;
in GOMP_OFFLOAD_load_image or the following patch:
  [Patch][2/3] nvptx: libgomp+mkoffload.cc: Prepare for reverse offload fn lookup
  https://gcc.gnu.org/pipermail/gcc-patches/2022-August/600348.html
(which in turn needs the '[1/3]' patch).

Not required to be compilable, but the patch is based on the ideas/code from
the reverse-offload ME patch; the latter adds calls to
  GOMP_target_ext (omp_initial_device,
which is for host fallback code processed by the normal target_ext and for
device code by the target_ext of this patch.
→ "[Patch] OpenMP: Support reverse offload (middle end part)"
  https://gcc.gnu.org/pipermail/gcc-patches/2022-July/598662.html

 * * *

This patch adds initial offloading support for nvptx.
When the nvptx's device GOMP_target_ext is called - it creates a lock,
fills a struct with the argument pointers (addr, kinds, sizes), its
device number and the set the function pointer address.

On the host side, the last address is checked - if fn_addr != NULL,
it passes all arguments on to the generic (target.c) gomp_target_rev
to do the actual offloading.

CUDA does lockup when trying to copy data from the currently running
stream; hence, a new stream is generated to do the memory copying.
Just having managed memory is not enough - it needs to be concurrently
accessible - otherwise, it will segfault on the host when migrated to
the device.

OK for mainline?

 * * *

Future work for nvptx:
* Adjust 'sleep', possibly using different values with and without USM and
  to do shorter sleeps than usleep(1)?
* Set a flag whether there is any offload function at all, avoiding to run
  the more expensive check if there is 'requires reverse_offload' without
  actual reverse-offloading functions present.
  (Recall that the '2/3' patch, mentioned above, only has fn != NULL for
  reverse-offload functions.)
* Document → libgomp.texi that reverse offload may cause some performance
  overhead for all target regions. + That reverse offload is run serialized.

And obviously: submitting the missing bits to get reverse offload working,
but that's mostly not an nvptx topic.

Tobias


-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955

[-- Attachment #2: rev-offload-run-nvptx.diff --]
[-- Type: text/x-patch, Size: 18161 bytes --]

libgomp/nvptx: Prepare for reverse-offload callback handling

This patch adds a stub 'gomp_target_rev' in the host's target.c, which will
later handle the reverse offload.
For nvptx, it adds support for forwarding the offload gomp_target_ext call
to the host by setting values in a struct on the device and querying it on
the host - invoking gomp_target_rev on the result.

include/ChangeLog:

	* cuda/cuda.h (enum CUdevice_attribute): Add
	CU_DEVICE_ATTRIBUTE_CONCURRENT_MANAGED_ACCESS.
	(enum CUmemAttach_flags): New stub with only member
	CU_MEM_ATTACH_GLOBAL.
	(cuMemAllocManaged): Add prototype.

libgomp/ChangeLog:

	* config/nvptx/icv-device.c (GOMP_DEVICE_NUM_VAR): Remove
	'static' for this variable.
	* config/nvptx/target.c (GOMP_REV_OFFLOAD_VAR): #define as
	variable-name string and use it to define the variable.
	(GOMP_DEVICE_NUM_VAR): Declare this extern global var.
	(struct rev_offload): Define.
	(GOMP_target_ext): Handle reverse offload.
	* libgomp-plugin.h (GOMP_PLUGIN_target_rev): New prototype.
	* libgomp-plugin.c (GOMP_PLUGIN_target_rev): New, call ...
	* target.c (gomp_target_rev): ... this new stub function.
	* libgomp.h (gomp_target_rev): Declare.
	* libgomp.map (GOMP_PLUGIN_1.4): New; add GOMP_PLUGIN_target_rev.
	* plugin/cuda-lib.def (cuMemAllocManaged): Add.
	* plugin/plugin-nvptx.c (GOMP_REV_OFFLOAD_VAR): #define var string.
	(struct rev_offload): New.
	(struct ptx_device): Add concurr_managed_access, rev_data
	and rev_data_dev.
	(nvptx_open_device): Set ptx_device's concurr_managed_access;
	'#if 0' unused async_engines.
	(GOMP_OFFLOAD_load_image): Allocate rev_data variable.
	(rev_off_dev_to_host_cpy, rev_off_host_to_dev_cpy): New.
	(GOMP_OFFLOAD_run): Handle reverse offloading.

 include/cuda/cuda.h               |   8 ++-
 libgomp/config/nvptx/icv-device.c |   2 +-
 libgomp/config/nvptx/target.c     |  52 ++++++++++++--
 libgomp/libgomp-plugin.c          |  12 ++++
 libgomp/libgomp-plugin.h          |   7 ++
 libgomp/libgomp.h                 |   5 ++
 libgomp/libgomp.map               |   5 ++
 libgomp/plugin/cuda-lib.def       |   1 +
 libgomp/plugin/plugin-nvptx.c     | 148 +++++++++++++++++++++++++++++++++++++-
 libgomp/target.c                  |  18 +++++
 10 files changed, 246 insertions(+), 12 deletions(-)

diff --git a/include/cuda/cuda.h b/include/cuda/cuda.h
index 3938d05d150..08e496a2e98 100644
--- a/include/cuda/cuda.h
+++ b/include/cuda/cuda.h
@@ -77,9 +77,14 @@ typedef enum {
   CU_DEVICE_ATTRIBUTE_CONCURRENT_KERNELS = 31,
   CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_MULTIPROCESSOR = 39,
   CU_DEVICE_ATTRIBUTE_ASYNC_ENGINE_COUNT = 40,
-  CU_DEVICE_ATTRIBUTE_MAX_REGISTERS_PER_MULTIPROCESSOR = 82
+  CU_DEVICE_ATTRIBUTE_MAX_REGISTERS_PER_MULTIPROCESSOR = 82,
+  CU_DEVICE_ATTRIBUTE_CONCURRENT_MANAGED_ACCESS = 89
 } CUdevice_attribute;
 
+typedef enum {
+  CU_MEM_ATTACH_GLOBAL = 0x1
+} CUmemAttach_flags;
+
 enum {
   CU_EVENT_DEFAULT = 0,
   CU_EVENT_DISABLE_TIMING = 2
@@ -169,6 +174,7 @@ CUresult cuMemGetInfo (size_t *, size_t *);
 CUresult cuMemAlloc (CUdeviceptr *, size_t);
 #define cuMemAllocHost cuMemAllocHost_v2
 CUresult cuMemAllocHost (void **, size_t);
+CUresult cuMemAllocManaged (CUdeviceptr *, size_t, unsigned int);
 CUresult cuMemcpy (CUdeviceptr, CUdeviceptr, size_t);
 #define cuMemcpyDtoDAsync cuMemcpyDtoDAsync_v2
 CUresult cuMemcpyDtoDAsync (CUdeviceptr, CUdeviceptr, size_t, CUstream);
diff --git a/libgomp/config/nvptx/icv-device.c b/libgomp/config/nvptx/icv-device.c
index faf90f9947c..f4f18cdac5e 100644
--- a/libgomp/config/nvptx/icv-device.c
+++ b/libgomp/config/nvptx/icv-device.c
@@ -60,7 +60,7 @@ omp_is_initial_device (void)
 
 /* This is set to the device number of current GPU during device initialization,
    when the offload image containing this libgomp portion is loaded.  */
-static volatile int GOMP_DEVICE_NUM_VAR;
+volatile int GOMP_DEVICE_NUM_VAR;
 
 int
 omp_get_device_num (void)
diff --git a/libgomp/config/nvptx/target.c b/libgomp/config/nvptx/target.c
index 11108d20e15..06f6cd8b611 100644
--- a/libgomp/config/nvptx/target.c
+++ b/libgomp/config/nvptx/target.c
@@ -26,7 +26,29 @@
 #include "libgomp.h"
 #include <limits.h>
 
+#define GOMP_REV_OFFLOAD_VAR __gomp_rev_offload_var
+
+/* Reverse offload. Must match version used in plugin/plugin-nvptx.c. */
+struct rev_offload {
+  uint64_t fn;
+  uint64_t mapnum;
+  uint64_t addrs;
+  uint64_t sizes;
+  uint64_t kinds;
+  int32_t dev_num;
+  uint32_t lock;
+};
+
+#if (__SIZEOF_SHORT__ != 2 \
+     || __SIZEOF_SIZE_T__ != 8 \
+     || __SIZEOF_POINTER__ != 8)
+#error "Data-type conversion required for rev_offload"
+#endif
+
+
 extern int __gomp_team_num __attribute__((shared));
+extern volatile int GOMP_DEVICE_NUM_VAR;
+volatile struct rev_offload *GOMP_REV_OFFLOAD_VAR;
 
 bool
 GOMP_teams4 (unsigned int num_teams_lower, unsigned int num_teams_upper,
@@ -88,16 +110,32 @@ GOMP_target_ext (int device, void (*fn) (void *), size_t mapnum,
 		 void **hostaddrs, size_t *sizes, unsigned short *kinds,
 		 unsigned int flags, void **depend, void **args)
 {
-  (void) device;
-  (void) fn;
-  (void) mapnum;
-  (void) hostaddrs;
-  (void) sizes;
-  (void) kinds;
   (void) flags;
   (void) depend;
   (void) args;
-  __builtin_unreachable ();
+
+  if (device != GOMP_DEVICE_HOST_FALLBACK
+      || fn == NULL
+      || GOMP_REV_OFFLOAD_VAR == NULL)
+    return;
+
+  while (__sync_lock_test_and_set (&GOMP_REV_OFFLOAD_VAR->lock, (uint8_t) 1))
+    ;  /* spin  */
+
+  __atomic_store_n (&GOMP_REV_OFFLOAD_VAR->mapnum, mapnum, __ATOMIC_SEQ_CST);
+  __atomic_store_n (&GOMP_REV_OFFLOAD_VAR->addrs, hostaddrs, __ATOMIC_SEQ_CST);
+  __atomic_store_n (&GOMP_REV_OFFLOAD_VAR->sizes, sizes, __ATOMIC_SEQ_CST);
+  __atomic_store_n (&GOMP_REV_OFFLOAD_VAR->kinds, kinds, __ATOMIC_SEQ_CST);
+  __atomic_store_n (&GOMP_REV_OFFLOAD_VAR->dev_num, GOMP_DEVICE_NUM_VAR,
+		    __ATOMIC_SEQ_CST);
+
+  /* 'fn' must be last.  */
+  __atomic_store_n (&GOMP_REV_OFFLOAD_VAR->fn, fn, __ATOMIC_SEQ_CST);
+
+  /* Processed on the host - when done, fn is set to NULL.  */
+  while (__atomic_load_n (&GOMP_REV_OFFLOAD_VAR->fn, __ATOMIC_SEQ_CST) != 0)
+    ;  /* spin  */
+  __sync_lock_release (&GOMP_REV_OFFLOAD_VAR->lock);
 }
 
 void
diff --git a/libgomp/libgomp-plugin.c b/libgomp/libgomp-plugin.c
index 9d4cc623a10..316de749f69 100644
--- a/libgomp/libgomp-plugin.c
+++ b/libgomp/libgomp-plugin.c
@@ -78,3 +78,15 @@ GOMP_PLUGIN_fatal (const char *msg, ...)
   gomp_vfatal (msg, ap);
   va_end (ap);
 }
+
+void
+GOMP_PLUGIN_target_rev (uint64_t fn_ptr, uint64_t mapnum, uint64_t devaddrs_ptr,
+			uint64_t sizes_ptr, uint64_t kinds_ptr, int dev_num,
+			void (*dev_to_host_cpy) (void *, const void *, size_t,
+						 void *),
+			void (*host_to_dev_cpy) (void *, const void *, size_t,
+						 void *), void *token)
+{
+  gomp_target_rev (fn_ptr, mapnum, devaddrs_ptr, sizes_ptr, kinds_ptr, dev_num,
+		   dev_to_host_cpy, host_to_dev_cpy, token);
+}
diff --git a/libgomp/libgomp-plugin.h b/libgomp/libgomp-plugin.h
index ab3ed638475..40dfb52e44e 100644
--- a/libgomp/libgomp-plugin.h
+++ b/libgomp/libgomp-plugin.h
@@ -121,6 +121,13 @@ extern void GOMP_PLUGIN_error (const char *, ...)
 extern void GOMP_PLUGIN_fatal (const char *, ...)
 	__attribute__ ((noreturn, format (printf, 1, 2)));
 
+extern void GOMP_PLUGIN_target_rev (uint64_t, uint64_t, uint64_t, uint64_t,
+				    uint64_t, int,
+				    void (*) (void *, const void *, size_t,
+					      void *),
+				    void (*) (void *, const void *, size_t,
+					      void *), void *);
+
 /* Prototypes for functions implemented by libgomp plugins.  */
 extern const char *GOMP_OFFLOAD_get_name (void);
 extern unsigned int GOMP_OFFLOAD_get_caps (void);
diff --git a/libgomp/libgomp.h b/libgomp/libgomp.h
index c243c4d6cf4..bbab5b4b0af 100644
--- a/libgomp/libgomp.h
+++ b/libgomp/libgomp.h
@@ -1014,6 +1014,11 @@ extern int gomp_pause_host (void);
 extern void gomp_init_targets_once (void);
 extern int gomp_get_num_devices (void);
 extern bool gomp_target_task_fn (void *);
+extern void gomp_target_rev (uint64_t, uint64_t, uint64_t, uint64_t, uint64_t,
+			     int,
+			     void (*) (void *, const void *, size_t, void *),
+			     void (*) (void *, const void *, size_t, void *),
+			     void *);
 
 /* Splay tree definitions.  */
 typedef struct splay_tree_node_s *splay_tree_node;
diff --git a/libgomp/libgomp.map b/libgomp/libgomp.map
index 46d5f10f3e1..12f76f7e48f 100644
--- a/libgomp/libgomp.map
+++ b/libgomp/libgomp.map
@@ -622,3 +622,8 @@ GOMP_PLUGIN_1.3 {
 	GOMP_PLUGIN_goacc_profiling_dispatch;
 	GOMP_PLUGIN_goacc_thread;
 } GOMP_PLUGIN_1.2;
+
+GOMP_PLUGIN_1.4 {
+  global:
+	GOMP_PLUGIN_target_rev;
+} GOMP_PLUGIN_1.3;
diff --git a/libgomp/plugin/cuda-lib.def b/libgomp/plugin/cuda-lib.def
index cd91b39b1d2..61359c7e74e 100644
--- a/libgomp/plugin/cuda-lib.def
+++ b/libgomp/plugin/cuda-lib.def
@@ -29,6 +29,7 @@ CUDA_ONE_CALL_MAYBE_NULL (cuLinkCreate_v2)
 CUDA_ONE_CALL (cuLinkDestroy)
 CUDA_ONE_CALL (cuMemAlloc)
 CUDA_ONE_CALL (cuMemAllocHost)
+CUDA_ONE_CALL (cuMemAllocManaged)
 CUDA_ONE_CALL (cuMemcpy)
 CUDA_ONE_CALL (cuMemcpyDtoDAsync)
 CUDA_ONE_CALL (cuMemcpyDtoH)
diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c
index bc63e274cdf..7ab9421b060 100644
--- a/libgomp/plugin/plugin-nvptx.c
+++ b/libgomp/plugin/plugin-nvptx.c
@@ -54,6 +54,8 @@
 #include <assert.h>
 #include <errno.h>
 
+#define GOMP_REV_OFFLOAD_VAR __gomp_rev_offload_var
+
 /* An arbitrary fixed limit (128MB) for the size of the OpenMP soft stacks
    block to cache between kernel invocations.  For soft-stacks blocks bigger
    than this, we will free the block before attempting another GPU memory
@@ -274,6 +276,17 @@ struct targ_fn_descriptor
   int max_threads_per_block;
 };
 
+/* Reverse offload. Must match version used in config/nvptx/target.c. */
+struct rev_offload {
+  uint64_t fn;
+  uint64_t mapnum;
+  uint64_t addrs;
+  uint64_t sizes;
+  uint64_t kinds;
+  int32_t dev_num;
+  uint32_t lock;
+};
+
 /* A loaded PTX image.  */
 struct ptx_image_data
 {
@@ -302,6 +315,7 @@ struct ptx_device
   bool map;
   bool concur;
   bool mkern;
+  bool concurr_managed_access;
   int mode;
   int clock_khz;
   int num_sms;
@@ -329,6 +343,9 @@ struct ptx_device
       pthread_mutex_t lock;
     } omp_stacks;
 
+  struct rev_offload *rev_data;
+  CUdeviceptr rev_data_dev;
+
   struct ptx_device *next;
 };
 
@@ -423,7 +440,7 @@ nvptx_open_device (int n)
   struct ptx_device *ptx_dev;
   CUdevice dev, ctx_dev;
   CUresult r;
-  int async_engines, pi;
+  int pi;
 
   CUDA_CALL_ERET (NULL, cuDeviceGet, &dev, n);
 
@@ -519,11 +536,17 @@ nvptx_open_device (int n)
 		  CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_MULTIPROCESSOR, dev);
   ptx_dev->max_threads_per_multiprocessor = pi;
 
+#if 0
+  int async_engines;
   r = CUDA_CALL_NOCHECK (cuDeviceGetAttribute, &async_engines,
 			 CU_DEVICE_ATTRIBUTE_ASYNC_ENGINE_COUNT, dev);
   if (r != CUDA_SUCCESS)
     async_engines = 1;
+#endif
 
+  r = CUDA_CALL_NOCHECK (cuDeviceGetAttribute, &pi,
+			 CU_DEVICE_ATTRIBUTE_CONCURRENT_MANAGED_ACCESS, dev);
+  ptx_dev->concurr_managed_access = r == CUDA_SUCCESS ? pi : false;
   for (int i = 0; i != GOMP_DIM_MAX; i++)
     ptx_dev->default_dims[i] = 0;
 
@@ -1313,6 +1336,38 @@ GOMP_OFFLOAD_load_image (int ord, unsigned version, const void *target_data,
   targ_fns = GOMP_PLUGIN_malloc (sizeof (struct targ_fn_descriptor)
 				 * fn_entries);
 
+  if (rev_fn_table && dev->rev_data == NULL)
+    {
+      CUdeviceptr dp = 0;
+      if (dev->concurr_managed_access && CUDA_CALL_EXISTS (cuMemAllocManaged))
+	{
+	  CUDA_CALL_ASSERT (cuMemAllocManaged, (void *) &dev->rev_data,
+			    sizeof (*dev->rev_data), CU_MEM_ATTACH_GLOBAL);
+	  dp = (CUdeviceptr) dev->rev_data;
+	}
+      else
+	{
+	  CUDA_CALL_ASSERT (cuMemAllocHost, (void **) &dev->rev_data,
+			    sizeof (*dev->rev_data));
+	  memset (dev->rev_data, '\0', sizeof (*dev->rev_data));
+	  CUDA_CALL_ASSERT (cuMemAlloc, &dev->rev_data_dev,
+			    sizeof (*dev->rev_data));
+	  dp = dev->rev_data_dev;
+	}
+      CUdeviceptr device_rev_offload_var;
+      size_t device_rev_offload_size;
+      CUresult r = CUDA_CALL_NOCHECK (cuModuleGetGlobal,
+				      &device_rev_offload_var,
+				      &device_rev_offload_size, module,
+				      XSTRING (GOMP_REV_OFFLOAD_VAR));
+      if (r != CUDA_SUCCESS)
+	GOMP_PLUGIN_fatal ("cuModuleGetGlobal error: %s", cuda_error (r));
+      r = CUDA_CALL_NOCHECK (cuMemcpyHtoD, device_rev_offload_var, &dp,
+			     sizeof (dp));
+      if (r != CUDA_SUCCESS)
+	GOMP_PLUGIN_fatal ("cuMemcpyHtoD error: %s", cuda_error (r));
+    }
+
   *target_table = targ_tbl;
 
   new_image = GOMP_PLUGIN_malloc (sizeof (struct ptx_image_data));
@@ -1373,6 +1428,22 @@ GOMP_OFFLOAD_load_image (int ord, unsigned version, const void *target_data,
     targ_tbl->start = targ_tbl->end = 0;
   targ_tbl++;
 
+  if (rev_fn_table)
+    {
+      CUdeviceptr var;
+      size_t bytes;
+      r = CUDA_CALL_NOCHECK (cuModuleGetGlobal, &var, &bytes, module,
+			     "$offload_func_table");
+      if (r != CUDA_SUCCESS)
+	GOMP_PLUGIN_fatal ("cuModuleGetGlobal error: %s", cuda_error (r));
+      assert (bytes == sizeof (uint64_t) * fn_entries);
+      *rev_fn_table = GOMP_PLUGIN_malloc (sizeof (uint64_t) * fn_entries);
+      r = CUDA_CALL_NOCHECK (cuMemcpyDtoH, *rev_fn_table, var, bytes);
+      if (r != CUDA_SUCCESS)
+	GOMP_PLUGIN_fatal ("cuMemcpyDtoH error: %s", cuda_error (r));
+    }
+
+
   nvptx_set_clocktick (module, dev);
 
   return fn_entries + var_entries + other_entries;
@@ -1982,6 +2053,23 @@ nvptx_stacks_acquire (struct ptx_device *ptx_dev, size_t size, int num)
   return (void *) ptx_dev->omp_stacks.ptr;
 }
 
+
+void
+rev_off_dev_to_host_cpy (void *dest, const void *src, size_t size,
+			 CUstream stream)
+{
+  CUDA_CALL_ASSERT (cuMemcpyDtoHAsync, dest, (CUdeviceptr) src, size, stream);
+  CUDA_CALL_ASSERT (cuStreamSynchronize, stream);
+}
+
+void
+rev_off_host_to_dev_cpy (void *dest, const void *src, size_t size,
+			 CUstream stream)
+{
+  CUDA_CALL_ASSERT (cuMemcpyHtoDAsync, (CUdeviceptr) dest, src, size, stream);
+  CUDA_CALL_ASSERT (cuStreamSynchronize, stream);
+}
+
 void
 GOMP_OFFLOAD_run (int ord, void *tgt_fn, void *tgt_vars, void **args)
 {
@@ -2016,6 +2104,10 @@ GOMP_OFFLOAD_run (int ord, void *tgt_fn, void *tgt_vars, void **args)
   nvptx_adjust_launch_bounds (tgt_fn, ptx_dev, &teams, &threads);
 
   size_t stack_size = nvptx_stacks_size ();
+  bool reverse_off = ptx_dev->rev_data != NULL;
+  bool has_usm = (ptx_dev->concurr_managed_access
+		  && CUDA_CALL_EXISTS (cuMemAllocManaged));
+  CUstream copy_stream = NULL;
 
   pthread_mutex_lock (&ptx_dev->omp_stacks.lock);
   void *stacks = nvptx_stacks_acquire (ptx_dev, stack_size, teams * threads);
@@ -2029,12 +2121,62 @@ GOMP_OFFLOAD_run (int ord, void *tgt_fn, void *tgt_vars, void **args)
   GOMP_PLUGIN_debug (0, "  %s: kernel %s: launch"
 		     " [(teams: %u), 1, 1] [(lanes: 32), (threads: %u), 1]\n",
 		     __FUNCTION__, fn_name, teams, threads);
+  if (reverse_off)
+    CUDA_CALL_ASSERT (cuStreamCreate, &copy_stream, CU_STREAM_NON_BLOCKING);
   r = CUDA_CALL_NOCHECK (cuLaunchKernel, function, teams, 1, 1,
 			 32, threads, 1, 0, NULL, NULL, config);
   if (r != CUDA_SUCCESS)
     GOMP_PLUGIN_fatal ("cuLaunchKernel error: %s", cuda_error (r));
-
-  r = CUDA_CALL_NOCHECK (cuCtxSynchronize, );
+  if (reverse_off)
+    while (true)
+      {
+	r = CUDA_CALL_NOCHECK (cuStreamQuery, NULL);
+	if (r == CUDA_SUCCESS)
+	  break;
+	if (r == CUDA_ERROR_LAUNCH_FAILED)
+	  GOMP_PLUGIN_fatal ("cuStreamQuery error: %s %s\n", cuda_error (r),
+			     maybe_abort_msg);
+	else if (r != CUDA_ERROR_NOT_READY)
+	  GOMP_PLUGIN_fatal ("cuStreamQuery error: %s", cuda_error (r));
+	if (!has_usm)
+	  {
+	    CUDA_CALL_ASSERT (cuMemcpyDtoHAsync, ptx_dev->rev_data,
+			      ptx_dev->rev_data_dev,
+			      sizeof (*ptx_dev->rev_data), copy_stream);
+	    CUDA_CALL_ASSERT (cuStreamSynchronize, copy_stream);
+	  }
+	if (ptx_dev->rev_data->fn != 0)
+	  {
+	    struct rev_offload *rev_data = ptx_dev->rev_data;
+	    uint64_t fn_ptr = rev_data->fn;
+	    uint64_t mapnum = rev_data->mapnum;
+	    uint64_t addr_ptr = rev_data->addrs;
+	    uint64_t sizes_ptr = rev_data->sizes;
+	    uint64_t kinds_ptr = rev_data->kinds;
+	    int dev_num = (int) rev_data->dev_num;
+	    GOMP_PLUGIN_target_rev (fn_ptr, mapnum, addr_ptr, sizes_ptr,
+				    kinds_ptr, dev_num, rev_off_dev_to_host_cpy,
+				    rev_off_host_to_dev_cpy, copy_stream);
+	    rev_data->fn = 0;
+	    if (!has_usm)
+	      {
+		/* fn is the first element. */
+		r = CUDA_CALL_NOCHECK (cuMemcpyHtoDAsync,
+				       ptx_dev->rev_data_dev,
+				       ptx_dev->rev_data,
+				       sizeof (ptx_dev->rev_data->fn),
+				       copy_stream);
+		if (r != CUDA_SUCCESS)
+		  GOMP_PLUGIN_fatal ("cuMemcpyHtoD error: %s", cuda_error (r));
+		CUDA_CALL_ASSERT (cuStreamSynchronize, copy_stream);
+	      }
+	  }
+	usleep (1);
+      }
+  else
+    r = CUDA_CALL_NOCHECK (cuCtxSynchronize, );
+  if (reverse_off)
+    CUDA_CALL_ASSERT (cuStreamDestroy, copy_stream);
   if (r == CUDA_ERROR_LAUNCH_FAILED)
     GOMP_PLUGIN_fatal ("cuCtxSynchronize error: %s %s\n", cuda_error (r),
 		       maybe_abort_msg);
diff --git a/libgomp/target.c b/libgomp/target.c
index 135db1d88ab..0c6fad690f1 100644
--- a/libgomp/target.c
+++ b/libgomp/target.c
@@ -2856,6 +2856,24 @@ GOMP_target_ext (int device, void (*fn) (void *), size_t mapnum,
     htab_free (refcount_set);
 }
 
+/* Handle reverse offload. This is not called for the host. */
+
+void
+gomp_target_rev (uint64_t fn_ptr __attribute__ ((unused)),
+		 uint64_t mapnum __attribute__ ((unused)),
+		 uint64_t devaddrs_ptr __attribute__ ((unused)),
+		 uint64_t sizes_ptr __attribute__ ((unused)),
+		 uint64_t kinds_ptr __attribute__ ((unused)),
+		 int dev_num __attribute__ ((unused)),
+		 void (*dev_to_host_cpy) (void *, const void *, size_t,
+					  void *) __attribute__ ((unused)),
+		 void (*host_to_dev_cpy) (void *, const void *, size_t,
+					  void *) __attribute__ ((unused)),
+		 void *token __attribute__ ((unused)))
+{
+  __builtin_unreachable ();
+}
+
 /* Host fallback for GOMP_target_data{,_ext} routines.  */
 
 static void

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Patch] libgomp/nvptx: Prepare for reverse-offload callback handling
  2022-08-26  9:07 [Patch] libgomp/nvptx: Prepare for reverse-offload callback handling Tobias Burnus
@ 2022-08-26  9:07 ` Tobias Burnus
  2022-08-26 14:56 ` Alexander Monakov
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 31+ messages in thread
From: Tobias Burnus @ 2022-08-26  9:07 UTC (permalink / raw)
  To: Jakub Jelinek, Tom de Vries, gcc-patches; +Cc: Alexander Monakov


[-- Attachment #1.1: Type: text/plain, Size: 2878 bytes --]

@Tom and Alexander: Better suggestions are welcome for the busy loop in
libgomp/plugin/plugin-nvptx.c regarding the variable placement and checking
its value.


PRE-REMARK

As nvptx (and all other plugins) returns <= 0 for
GOMP_OFFLOAD_get_num_devices if GOMP_REQUIRES_REVERSE_OFFLOAD is
set. This patch is currently still a no op.

The patch is almost stand alone, except that it either needs a
  void *rev_fn_table = NULL;
in GOMP_OFFLOAD_load_image or the following patch:
  [Patch][2/3] nvptx: libgomp+mkoffload.cc: Prepare for reverse offload fn lookup
  https://gcc.gnu.org/pipermail/gcc-patches/2022-August/600348.html
(which in turn needs the '[1/3]' patch).

Not required to be compilable, but the patch is based on the ideas/code from
the reverse-offload ME patch; the latter adds calls to
  GOMP_target_ext (omp_initial_device,
which is for host fallback code processed by the normal target_ext and for
device code by the target_ext of this patch.
→ "[Patch] OpenMP: Support reverse offload (middle end part)"
  https://gcc.gnu.org/pipermail/gcc-patches/2022-July/598662.html

 * * *

This patch adds initial offloading support for nvptx.
When the nvptx's device GOMP_target_ext is called - it creates a lock,
fills a struct with the argument pointers (addr, kinds, sizes), its
device number and the set the function pointer address.

On the host side, the last address is checked - if fn_addr != NULL,
it passes all arguments on to the generic (target.c) gomp_target_rev
to do the actual offloading.

CUDA does lockup when trying to copy data from the currently running
stream; hence, a new stream is generated to do the memory copying.
Just having managed memory is not enough - it needs to be concurrently
accessible - otherwise, it will segfault on the host when migrated to
the device.

OK for mainline?

 * * *

Future work for nvptx:
* Adjust 'sleep', possibly using different values with and without USM and
  to do shorter sleeps than usleep(1)?
* Set a flag whether there is any offload function at all, avoiding to run
  the more expensive check if there is 'requires reverse_offload' without
  actual reverse-offloading functions present.
  (Recall that the '2/3' patch, mentioned above, only has fn != NULL for
  reverse-offload functions.)
* Document → libgomp.texi that reverse offload may cause some performance
  overhead for all target regions. + That reverse offload is run serialized.

And obviously: submitting the missing bits to get reverse offload working,
but that's mostly not an nvptx topic.

Tobias


-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955

[-- Attachment #2: rev-offload-run-nvptx.diff --]
[-- Type: text/x-patch, Size: 18161 bytes --]

libgomp/nvptx: Prepare for reverse-offload callback handling

This patch adds a stub 'gomp_target_rev' in the host's target.c, which will
later handle the reverse offload.
For nvptx, it adds support for forwarding the offload gomp_target_ext call
to the host by setting values in a struct on the device and querying it on
the host - invoking gomp_target_rev on the result.

include/ChangeLog:

	* cuda/cuda.h (enum CUdevice_attribute): Add
	CU_DEVICE_ATTRIBUTE_CONCURRENT_MANAGED_ACCESS.
	(enum CUmemAttach_flags): New stub with only member
	CU_MEM_ATTACH_GLOBAL.
	(cuMemAllocManaged): Add prototype.

libgomp/ChangeLog:

	* config/nvptx/icv-device.c (GOMP_DEVICE_NUM_VAR): Remove
	'static' for this variable.
	* config/nvptx/target.c (GOMP_REV_OFFLOAD_VAR): #define as
	variable-name string and use it to define the variable.
	(GOMP_DEVICE_NUM_VAR): Declare this extern global var.
	(struct rev_offload): Define.
	(GOMP_target_ext): Handle reverse offload.
	* libgomp-plugin.h (GOMP_PLUGIN_target_rev): New prototype.
	* libgomp-plugin.c (GOMP_PLUGIN_target_rev): New, call ...
	* target.c (gomp_target_rev): ... this new stub function.
	* libgomp.h (gomp_target_rev): Declare.
	* libgomp.map (GOMP_PLUGIN_1.4): New; add GOMP_PLUGIN_target_rev.
	* plugin/cuda-lib.def (cuMemAllocManaged): Add.
	* plugin/plugin-nvptx.c (GOMP_REV_OFFLOAD_VAR): #define var string.
	(struct rev_offload): New.
	(struct ptx_device): Add concurr_managed_access, rev_data
	and rev_data_dev.
	(nvptx_open_device): Set ptx_device's concurr_managed_access;
	'#if 0' unused async_engines.
	(GOMP_OFFLOAD_load_image): Allocate rev_data variable.
	(rev_off_dev_to_host_cpy, rev_off_host_to_dev_cpy): New.
	(GOMP_OFFLOAD_run): Handle reverse offloading.

 include/cuda/cuda.h               |   8 ++-
 libgomp/config/nvptx/icv-device.c |   2 +-
 libgomp/config/nvptx/target.c     |  52 ++++++++++++--
 libgomp/libgomp-plugin.c          |  12 ++++
 libgomp/libgomp-plugin.h          |   7 ++
 libgomp/libgomp.h                 |   5 ++
 libgomp/libgomp.map               |   5 ++
 libgomp/plugin/cuda-lib.def       |   1 +
 libgomp/plugin/plugin-nvptx.c     | 148 +++++++++++++++++++++++++++++++++++++-
 libgomp/target.c                  |  18 +++++
 10 files changed, 246 insertions(+), 12 deletions(-)

diff --git a/include/cuda/cuda.h b/include/cuda/cuda.h
index 3938d05d150..08e496a2e98 100644
--- a/include/cuda/cuda.h
+++ b/include/cuda/cuda.h
@@ -77,9 +77,14 @@ typedef enum {
   CU_DEVICE_ATTRIBUTE_CONCURRENT_KERNELS = 31,
   CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_MULTIPROCESSOR = 39,
   CU_DEVICE_ATTRIBUTE_ASYNC_ENGINE_COUNT = 40,
-  CU_DEVICE_ATTRIBUTE_MAX_REGISTERS_PER_MULTIPROCESSOR = 82
+  CU_DEVICE_ATTRIBUTE_MAX_REGISTERS_PER_MULTIPROCESSOR = 82,
+  CU_DEVICE_ATTRIBUTE_CONCURRENT_MANAGED_ACCESS = 89
 } CUdevice_attribute;
 
+typedef enum {
+  CU_MEM_ATTACH_GLOBAL = 0x1
+} CUmemAttach_flags;
+
 enum {
   CU_EVENT_DEFAULT = 0,
   CU_EVENT_DISABLE_TIMING = 2
@@ -169,6 +174,7 @@ CUresult cuMemGetInfo (size_t *, size_t *);
 CUresult cuMemAlloc (CUdeviceptr *, size_t);
 #define cuMemAllocHost cuMemAllocHost_v2
 CUresult cuMemAllocHost (void **, size_t);
+CUresult cuMemAllocManaged (CUdeviceptr *, size_t, unsigned int);
 CUresult cuMemcpy (CUdeviceptr, CUdeviceptr, size_t);
 #define cuMemcpyDtoDAsync cuMemcpyDtoDAsync_v2
 CUresult cuMemcpyDtoDAsync (CUdeviceptr, CUdeviceptr, size_t, CUstream);
diff --git a/libgomp/config/nvptx/icv-device.c b/libgomp/config/nvptx/icv-device.c
index faf90f9947c..f4f18cdac5e 100644
--- a/libgomp/config/nvptx/icv-device.c
+++ b/libgomp/config/nvptx/icv-device.c
@@ -60,7 +60,7 @@ omp_is_initial_device (void)
 
 /* This is set to the device number of current GPU during device initialization,
    when the offload image containing this libgomp portion is loaded.  */
-static volatile int GOMP_DEVICE_NUM_VAR;
+volatile int GOMP_DEVICE_NUM_VAR;
 
 int
 omp_get_device_num (void)
diff --git a/libgomp/config/nvptx/target.c b/libgomp/config/nvptx/target.c
index 11108d20e15..06f6cd8b611 100644
--- a/libgomp/config/nvptx/target.c
+++ b/libgomp/config/nvptx/target.c
@@ -26,7 +26,29 @@
 #include "libgomp.h"
 #include <limits.h>
 
+#define GOMP_REV_OFFLOAD_VAR __gomp_rev_offload_var
+
+/* Reverse offload. Must match version used in plugin/plugin-nvptx.c. */
+struct rev_offload {
+  uint64_t fn;
+  uint64_t mapnum;
+  uint64_t addrs;
+  uint64_t sizes;
+  uint64_t kinds;
+  int32_t dev_num;
+  uint32_t lock;
+};
+
+#if (__SIZEOF_SHORT__ != 2 \
+     || __SIZEOF_SIZE_T__ != 8 \
+     || __SIZEOF_POINTER__ != 8)
+#error "Data-type conversion required for rev_offload"
+#endif
+
+
 extern int __gomp_team_num __attribute__((shared));
+extern volatile int GOMP_DEVICE_NUM_VAR;
+volatile struct rev_offload *GOMP_REV_OFFLOAD_VAR;
 
 bool
 GOMP_teams4 (unsigned int num_teams_lower, unsigned int num_teams_upper,
@@ -88,16 +110,32 @@ GOMP_target_ext (int device, void (*fn) (void *), size_t mapnum,
 		 void **hostaddrs, size_t *sizes, unsigned short *kinds,
 		 unsigned int flags, void **depend, void **args)
 {
-  (void) device;
-  (void) fn;
-  (void) mapnum;
-  (void) hostaddrs;
-  (void) sizes;
-  (void) kinds;
   (void) flags;
   (void) depend;
   (void) args;
-  __builtin_unreachable ();
+
+  if (device != GOMP_DEVICE_HOST_FALLBACK
+      || fn == NULL
+      || GOMP_REV_OFFLOAD_VAR == NULL)
+    return;
+
+  while (__sync_lock_test_and_set (&GOMP_REV_OFFLOAD_VAR->lock, (uint8_t) 1))
+    ;  /* spin  */
+
+  __atomic_store_n (&GOMP_REV_OFFLOAD_VAR->mapnum, mapnum, __ATOMIC_SEQ_CST);
+  __atomic_store_n (&GOMP_REV_OFFLOAD_VAR->addrs, hostaddrs, __ATOMIC_SEQ_CST);
+  __atomic_store_n (&GOMP_REV_OFFLOAD_VAR->sizes, sizes, __ATOMIC_SEQ_CST);
+  __atomic_store_n (&GOMP_REV_OFFLOAD_VAR->kinds, kinds, __ATOMIC_SEQ_CST);
+  __atomic_store_n (&GOMP_REV_OFFLOAD_VAR->dev_num, GOMP_DEVICE_NUM_VAR,
+		    __ATOMIC_SEQ_CST);
+
+  /* 'fn' must be last.  */
+  __atomic_store_n (&GOMP_REV_OFFLOAD_VAR->fn, fn, __ATOMIC_SEQ_CST);
+
+  /* Processed on the host - when done, fn is set to NULL.  */
+  while (__atomic_load_n (&GOMP_REV_OFFLOAD_VAR->fn, __ATOMIC_SEQ_CST) != 0)
+    ;  /* spin  */
+  __sync_lock_release (&GOMP_REV_OFFLOAD_VAR->lock);
 }
 
 void
diff --git a/libgomp/libgomp-plugin.c b/libgomp/libgomp-plugin.c
index 9d4cc623a10..316de749f69 100644
--- a/libgomp/libgomp-plugin.c
+++ b/libgomp/libgomp-plugin.c
@@ -78,3 +78,15 @@ GOMP_PLUGIN_fatal (const char *msg, ...)
   gomp_vfatal (msg, ap);
   va_end (ap);
 }
+
+void
+GOMP_PLUGIN_target_rev (uint64_t fn_ptr, uint64_t mapnum, uint64_t devaddrs_ptr,
+			uint64_t sizes_ptr, uint64_t kinds_ptr, int dev_num,
+			void (*dev_to_host_cpy) (void *, const void *, size_t,
+						 void *),
+			void (*host_to_dev_cpy) (void *, const void *, size_t,
+						 void *), void *token)
+{
+  gomp_target_rev (fn_ptr, mapnum, devaddrs_ptr, sizes_ptr, kinds_ptr, dev_num,
+		   dev_to_host_cpy, host_to_dev_cpy, token);
+}
diff --git a/libgomp/libgomp-plugin.h b/libgomp/libgomp-plugin.h
index ab3ed638475..40dfb52e44e 100644
--- a/libgomp/libgomp-plugin.h
+++ b/libgomp/libgomp-plugin.h
@@ -121,6 +121,13 @@ extern void GOMP_PLUGIN_error (const char *, ...)
 extern void GOMP_PLUGIN_fatal (const char *, ...)
 	__attribute__ ((noreturn, format (printf, 1, 2)));
 
+extern void GOMP_PLUGIN_target_rev (uint64_t, uint64_t, uint64_t, uint64_t,
+				    uint64_t, int,
+				    void (*) (void *, const void *, size_t,
+					      void *),
+				    void (*) (void *, const void *, size_t,
+					      void *), void *);
+
 /* Prototypes for functions implemented by libgomp plugins.  */
 extern const char *GOMP_OFFLOAD_get_name (void);
 extern unsigned int GOMP_OFFLOAD_get_caps (void);
diff --git a/libgomp/libgomp.h b/libgomp/libgomp.h
index c243c4d6cf4..bbab5b4b0af 100644
--- a/libgomp/libgomp.h
+++ b/libgomp/libgomp.h
@@ -1014,6 +1014,11 @@ extern int gomp_pause_host (void);
 extern void gomp_init_targets_once (void);
 extern int gomp_get_num_devices (void);
 extern bool gomp_target_task_fn (void *);
+extern void gomp_target_rev (uint64_t, uint64_t, uint64_t, uint64_t, uint64_t,
+			     int,
+			     void (*) (void *, const void *, size_t, void *),
+			     void (*) (void *, const void *, size_t, void *),
+			     void *);
 
 /* Splay tree definitions.  */
 typedef struct splay_tree_node_s *splay_tree_node;
diff --git a/libgomp/libgomp.map b/libgomp/libgomp.map
index 46d5f10f3e1..12f76f7e48f 100644
--- a/libgomp/libgomp.map
+++ b/libgomp/libgomp.map
@@ -622,3 +622,8 @@ GOMP_PLUGIN_1.3 {
 	GOMP_PLUGIN_goacc_profiling_dispatch;
 	GOMP_PLUGIN_goacc_thread;
 } GOMP_PLUGIN_1.2;
+
+GOMP_PLUGIN_1.4 {
+  global:
+	GOMP_PLUGIN_target_rev;
+} GOMP_PLUGIN_1.3;
diff --git a/libgomp/plugin/cuda-lib.def b/libgomp/plugin/cuda-lib.def
index cd91b39b1d2..61359c7e74e 100644
--- a/libgomp/plugin/cuda-lib.def
+++ b/libgomp/plugin/cuda-lib.def
@@ -29,6 +29,7 @@ CUDA_ONE_CALL_MAYBE_NULL (cuLinkCreate_v2)
 CUDA_ONE_CALL (cuLinkDestroy)
 CUDA_ONE_CALL (cuMemAlloc)
 CUDA_ONE_CALL (cuMemAllocHost)
+CUDA_ONE_CALL (cuMemAllocManaged)
 CUDA_ONE_CALL (cuMemcpy)
 CUDA_ONE_CALL (cuMemcpyDtoDAsync)
 CUDA_ONE_CALL (cuMemcpyDtoH)
diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c
index bc63e274cdf..7ab9421b060 100644
--- a/libgomp/plugin/plugin-nvptx.c
+++ b/libgomp/plugin/plugin-nvptx.c
@@ -54,6 +54,8 @@
 #include <assert.h>
 #include <errno.h>
 
+#define GOMP_REV_OFFLOAD_VAR __gomp_rev_offload_var
+
 /* An arbitrary fixed limit (128MB) for the size of the OpenMP soft stacks
    block to cache between kernel invocations.  For soft-stacks blocks bigger
    than this, we will free the block before attempting another GPU memory
@@ -274,6 +276,17 @@ struct targ_fn_descriptor
   int max_threads_per_block;
 };
 
+/* Reverse offload. Must match version used in config/nvptx/target.c. */
+struct rev_offload {
+  uint64_t fn;
+  uint64_t mapnum;
+  uint64_t addrs;
+  uint64_t sizes;
+  uint64_t kinds;
+  int32_t dev_num;
+  uint32_t lock;
+};
+
 /* A loaded PTX image.  */
 struct ptx_image_data
 {
@@ -302,6 +315,7 @@ struct ptx_device
   bool map;
   bool concur;
   bool mkern;
+  bool concurr_managed_access;
   int mode;
   int clock_khz;
   int num_sms;
@@ -329,6 +343,9 @@ struct ptx_device
       pthread_mutex_t lock;
     } omp_stacks;
 
+  struct rev_offload *rev_data;
+  CUdeviceptr rev_data_dev;
+
   struct ptx_device *next;
 };
 
@@ -423,7 +440,7 @@ nvptx_open_device (int n)
   struct ptx_device *ptx_dev;
   CUdevice dev, ctx_dev;
   CUresult r;
-  int async_engines, pi;
+  int pi;
 
   CUDA_CALL_ERET (NULL, cuDeviceGet, &dev, n);
 
@@ -519,11 +536,17 @@ nvptx_open_device (int n)
 		  CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_MULTIPROCESSOR, dev);
   ptx_dev->max_threads_per_multiprocessor = pi;
 
+#if 0
+  int async_engines;
   r = CUDA_CALL_NOCHECK (cuDeviceGetAttribute, &async_engines,
 			 CU_DEVICE_ATTRIBUTE_ASYNC_ENGINE_COUNT, dev);
   if (r != CUDA_SUCCESS)
     async_engines = 1;
+#endif
 
+  r = CUDA_CALL_NOCHECK (cuDeviceGetAttribute, &pi,
+			 CU_DEVICE_ATTRIBUTE_CONCURRENT_MANAGED_ACCESS, dev);
+  ptx_dev->concurr_managed_access = r == CUDA_SUCCESS ? pi : false;
   for (int i = 0; i != GOMP_DIM_MAX; i++)
     ptx_dev->default_dims[i] = 0;
 
@@ -1313,6 +1336,38 @@ GOMP_OFFLOAD_load_image (int ord, unsigned version, const void *target_data,
   targ_fns = GOMP_PLUGIN_malloc (sizeof (struct targ_fn_descriptor)
 				 * fn_entries);
 
+  if (rev_fn_table && dev->rev_data == NULL)
+    {
+      CUdeviceptr dp = 0;
+      if (dev->concurr_managed_access && CUDA_CALL_EXISTS (cuMemAllocManaged))
+	{
+	  CUDA_CALL_ASSERT (cuMemAllocManaged, (void *) &dev->rev_data,
+			    sizeof (*dev->rev_data), CU_MEM_ATTACH_GLOBAL);
+	  dp = (CUdeviceptr) dev->rev_data;
+	}
+      else
+	{
+	  CUDA_CALL_ASSERT (cuMemAllocHost, (void **) &dev->rev_data,
+			    sizeof (*dev->rev_data));
+	  memset (dev->rev_data, '\0', sizeof (*dev->rev_data));
+	  CUDA_CALL_ASSERT (cuMemAlloc, &dev->rev_data_dev,
+			    sizeof (*dev->rev_data));
+	  dp = dev->rev_data_dev;
+	}
+      CUdeviceptr device_rev_offload_var;
+      size_t device_rev_offload_size;
+      CUresult r = CUDA_CALL_NOCHECK (cuModuleGetGlobal,
+				      &device_rev_offload_var,
+				      &device_rev_offload_size, module,
+				      XSTRING (GOMP_REV_OFFLOAD_VAR));
+      if (r != CUDA_SUCCESS)
+	GOMP_PLUGIN_fatal ("cuModuleGetGlobal error: %s", cuda_error (r));
+      r = CUDA_CALL_NOCHECK (cuMemcpyHtoD, device_rev_offload_var, &dp,
+			     sizeof (dp));
+      if (r != CUDA_SUCCESS)
+	GOMP_PLUGIN_fatal ("cuMemcpyHtoD error: %s", cuda_error (r));
+    }
+
   *target_table = targ_tbl;
 
   new_image = GOMP_PLUGIN_malloc (sizeof (struct ptx_image_data));
@@ -1373,6 +1428,22 @@ GOMP_OFFLOAD_load_image (int ord, unsigned version, const void *target_data,
     targ_tbl->start = targ_tbl->end = 0;
   targ_tbl++;
 
+  if (rev_fn_table)
+    {
+      CUdeviceptr var;
+      size_t bytes;
+      r = CUDA_CALL_NOCHECK (cuModuleGetGlobal, &var, &bytes, module,
+			     "$offload_func_table");
+      if (r != CUDA_SUCCESS)
+	GOMP_PLUGIN_fatal ("cuModuleGetGlobal error: %s", cuda_error (r));
+      assert (bytes == sizeof (uint64_t) * fn_entries);
+      *rev_fn_table = GOMP_PLUGIN_malloc (sizeof (uint64_t) * fn_entries);
+      r = CUDA_CALL_NOCHECK (cuMemcpyDtoH, *rev_fn_table, var, bytes);
+      if (r != CUDA_SUCCESS)
+	GOMP_PLUGIN_fatal ("cuMemcpyDtoH error: %s", cuda_error (r));
+    }
+
+
   nvptx_set_clocktick (module, dev);
 
   return fn_entries + var_entries + other_entries;
@@ -1982,6 +2053,23 @@ nvptx_stacks_acquire (struct ptx_device *ptx_dev, size_t size, int num)
   return (void *) ptx_dev->omp_stacks.ptr;
 }
 
+
+void
+rev_off_dev_to_host_cpy (void *dest, const void *src, size_t size,
+			 CUstream stream)
+{
+  CUDA_CALL_ASSERT (cuMemcpyDtoHAsync, dest, (CUdeviceptr) src, size, stream);
+  CUDA_CALL_ASSERT (cuStreamSynchronize, stream);
+}
+
+void
+rev_off_host_to_dev_cpy (void *dest, const void *src, size_t size,
+			 CUstream stream)
+{
+  CUDA_CALL_ASSERT (cuMemcpyHtoDAsync, (CUdeviceptr) dest, src, size, stream);
+  CUDA_CALL_ASSERT (cuStreamSynchronize, stream);
+}
+
 void
 GOMP_OFFLOAD_run (int ord, void *tgt_fn, void *tgt_vars, void **args)
 {
@@ -2016,6 +2104,10 @@ GOMP_OFFLOAD_run (int ord, void *tgt_fn, void *tgt_vars, void **args)
   nvptx_adjust_launch_bounds (tgt_fn, ptx_dev, &teams, &threads);
 
   size_t stack_size = nvptx_stacks_size ();
+  bool reverse_off = ptx_dev->rev_data != NULL;
+  bool has_usm = (ptx_dev->concurr_managed_access
+		  && CUDA_CALL_EXISTS (cuMemAllocManaged));
+  CUstream copy_stream = NULL;
 
   pthread_mutex_lock (&ptx_dev->omp_stacks.lock);
   void *stacks = nvptx_stacks_acquire (ptx_dev, stack_size, teams * threads);
@@ -2029,12 +2121,62 @@ GOMP_OFFLOAD_run (int ord, void *tgt_fn, void *tgt_vars, void **args)
   GOMP_PLUGIN_debug (0, "  %s: kernel %s: launch"
 		     " [(teams: %u), 1, 1] [(lanes: 32), (threads: %u), 1]\n",
 		     __FUNCTION__, fn_name, teams, threads);
+  if (reverse_off)
+    CUDA_CALL_ASSERT (cuStreamCreate, &copy_stream, CU_STREAM_NON_BLOCKING);
   r = CUDA_CALL_NOCHECK (cuLaunchKernel, function, teams, 1, 1,
 			 32, threads, 1, 0, NULL, NULL, config);
   if (r != CUDA_SUCCESS)
     GOMP_PLUGIN_fatal ("cuLaunchKernel error: %s", cuda_error (r));
-
-  r = CUDA_CALL_NOCHECK (cuCtxSynchronize, );
+  if (reverse_off)
+    while (true)
+      {
+	r = CUDA_CALL_NOCHECK (cuStreamQuery, NULL);
+	if (r == CUDA_SUCCESS)
+	  break;
+	if (r == CUDA_ERROR_LAUNCH_FAILED)
+	  GOMP_PLUGIN_fatal ("cuStreamQuery error: %s %s\n", cuda_error (r),
+			     maybe_abort_msg);
+	else if (r != CUDA_ERROR_NOT_READY)
+	  GOMP_PLUGIN_fatal ("cuStreamQuery error: %s", cuda_error (r));
+	if (!has_usm)
+	  {
+	    CUDA_CALL_ASSERT (cuMemcpyDtoHAsync, ptx_dev->rev_data,
+			      ptx_dev->rev_data_dev,
+			      sizeof (*ptx_dev->rev_data), copy_stream);
+	    CUDA_CALL_ASSERT (cuStreamSynchronize, copy_stream);
+	  }
+	if (ptx_dev->rev_data->fn != 0)
+	  {
+	    struct rev_offload *rev_data = ptx_dev->rev_data;
+	    uint64_t fn_ptr = rev_data->fn;
+	    uint64_t mapnum = rev_data->mapnum;
+	    uint64_t addr_ptr = rev_data->addrs;
+	    uint64_t sizes_ptr = rev_data->sizes;
+	    uint64_t kinds_ptr = rev_data->kinds;
+	    int dev_num = (int) rev_data->dev_num;
+	    GOMP_PLUGIN_target_rev (fn_ptr, mapnum, addr_ptr, sizes_ptr,
+				    kinds_ptr, dev_num, rev_off_dev_to_host_cpy,
+				    rev_off_host_to_dev_cpy, copy_stream);
+	    rev_data->fn = 0;
+	    if (!has_usm)
+	      {
+		/* fn is the first element. */
+		r = CUDA_CALL_NOCHECK (cuMemcpyHtoDAsync,
+				       ptx_dev->rev_data_dev,
+				       ptx_dev->rev_data,
+				       sizeof (ptx_dev->rev_data->fn),
+				       copy_stream);
+		if (r != CUDA_SUCCESS)
+		  GOMP_PLUGIN_fatal ("cuMemcpyHtoD error: %s", cuda_error (r));
+		CUDA_CALL_ASSERT (cuStreamSynchronize, copy_stream);
+	      }
+	  }
+	usleep (1);
+      }
+  else
+    r = CUDA_CALL_NOCHECK (cuCtxSynchronize, );
+  if (reverse_off)
+    CUDA_CALL_ASSERT (cuStreamDestroy, copy_stream);
   if (r == CUDA_ERROR_LAUNCH_FAILED)
     GOMP_PLUGIN_fatal ("cuCtxSynchronize error: %s %s\n", cuda_error (r),
 		       maybe_abort_msg);
diff --git a/libgomp/target.c b/libgomp/target.c
index 135db1d88ab..0c6fad690f1 100644
--- a/libgomp/target.c
+++ b/libgomp/target.c
@@ -2856,6 +2856,24 @@ GOMP_target_ext (int device, void (*fn) (void *), size_t mapnum,
     htab_free (refcount_set);
 }
 
+/* Handle reverse offload. This is not called for the host. */
+
+void
+gomp_target_rev (uint64_t fn_ptr __attribute__ ((unused)),
+		 uint64_t mapnum __attribute__ ((unused)),
+		 uint64_t devaddrs_ptr __attribute__ ((unused)),
+		 uint64_t sizes_ptr __attribute__ ((unused)),
+		 uint64_t kinds_ptr __attribute__ ((unused)),
+		 int dev_num __attribute__ ((unused)),
+		 void (*dev_to_host_cpy) (void *, const void *, size_t,
+					  void *) __attribute__ ((unused)),
+		 void (*host_to_dev_cpy) (void *, const void *, size_t,
+					  void *) __attribute__ ((unused)),
+		 void *token __attribute__ ((unused)))
+{
+  __builtin_unreachable ();
+}
+
 /* Host fallback for GOMP_target_data{,_ext} routines.  */
 
 static void

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Patch] libgomp/nvptx: Prepare for reverse-offload callback handling
  2022-08-26  9:07 [Patch] libgomp/nvptx: Prepare for reverse-offload callback handling Tobias Burnus
  2022-08-26  9:07 ` Tobias Burnus
@ 2022-08-26 14:56 ` Alexander Monakov
  2022-09-09 15:49   ` Jakub Jelinek
  2022-09-09 15:51 ` Jakub Jelinek
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 31+ messages in thread
From: Alexander Monakov @ 2022-08-26 14:56 UTC (permalink / raw)
  To: Tobias Burnus; +Cc: Jakub Jelinek, Tom de Vries, gcc-patches


On Fri, 26 Aug 2022, Tobias Burnus wrote:

> @Tom and Alexander: Better suggestions are welcome for the busy loop in
> libgomp/plugin/plugin-nvptx.c regarding the variable placement and checking
> its value.

I think to do that without polling you can use PTX 'brkpt' instruction on the
device and CUDA Debugger API on the host (but you'd have to be careful about
interactions with the real debugger).

How did the standardization process for this feature look like, how did it pass
if it's not efficiently implementable for the major offloading targets?

Alexander

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Patch] libgomp/nvptx: Prepare for reverse-offload callback handling
  2022-08-26 14:56 ` Alexander Monakov
@ 2022-09-09 15:49   ` Jakub Jelinek
  0 siblings, 0 replies; 31+ messages in thread
From: Jakub Jelinek @ 2022-09-09 15:49 UTC (permalink / raw)
  To: Alexander Monakov; +Cc: Tobias Burnus, gcc-patches

On Fri, Aug 26, 2022 at 05:56:09PM +0300, Alexander Monakov via Gcc-patches wrote:
> 
> On Fri, 26 Aug 2022, Tobias Burnus wrote:
> 
> > @Tom and Alexander: Better suggestions are welcome for the busy loop in
> > libgomp/plugin/plugin-nvptx.c regarding the variable placement and checking
> > its value.
> 
> I think to do that without polling you can use PTX 'brkpt' instruction on the
> device and CUDA Debugger API on the host (but you'd have to be careful about
> interactions with the real debugger).
> 
> How did the standardization process for this feature look like, how did it pass
> if it's not efficiently implementable for the major offloading targets?

It doesn't have to be implementable on all major offloading targets, it is
enough when it can work on some.  As one needs to request the reverse
offloading through a declarative directive, it is always possible in that
case to just pretend devices that don't support it don't exist.

But it would be really nice to support it even on PTX.

Are there any other implementations of reverse offloading to PTX already?

	Jakub


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Patch] libgomp/nvptx: Prepare for reverse-offload callback handling
  2022-08-26  9:07 [Patch] libgomp/nvptx: Prepare for reverse-offload callback handling Tobias Burnus
  2022-08-26  9:07 ` Tobias Burnus
  2022-08-26 14:56 ` Alexander Monakov
@ 2022-09-09 15:51 ` Jakub Jelinek
  2022-09-13  7:07 ` Tobias Burnus
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 31+ messages in thread
From: Jakub Jelinek @ 2022-09-09 15:51 UTC (permalink / raw)
  To: Tobias Burnus; +Cc: Tom de Vries, gcc-patches, Alexander Monakov

On Fri, Aug 26, 2022 at 11:07:28AM +0200, Tobias Burnus wrote:
> @Tom and Alexander: Better suggestions are welcome for the busy loop in
> libgomp/plugin/plugin-nvptx.c regarding the variable placement and checking
> its value.

I'm afraid you need Alexander or Tom here, I don't feel I can review it;
I could rubber stamp it if they are ok with it.

	Jakub


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Patch] libgomp/nvptx: Prepare for reverse-offload callback handling
  2022-08-26  9:07 [Patch] libgomp/nvptx: Prepare for reverse-offload callback handling Tobias Burnus
                   ` (2 preceding siblings ...)
  2022-09-09 15:51 ` Jakub Jelinek
@ 2022-09-13  7:07 ` Tobias Burnus
  2022-09-21 20:06   ` Alexander Monakov
  2023-03-21 15:53 ` libgomp: Simplify OpenMP reverse offload host <-> device memory copy implementation (was: [Patch] " Thomas Schwinge
  2023-04-04 14:40 ` [Patch] libgomp/nvptx: Prepare for reverse-offload callback handling Thomas Schwinge
  5 siblings, 1 reply; 31+ messages in thread
From: Tobias Burnus @ 2022-09-13  7:07 UTC (permalink / raw)
  To: Jakub Jelinek, Tom de Vries, gcc-patches; +Cc: Alexander Monakov


[-- Attachment #1.1: Type: text/plain, Size: 5512 bytes --]

@Alexander/@Tom – Can you comment on both libgomp/config/nvptx + libgomp/plugin/plugin-nvptx.c ? (Comments on the rest are welcome, too)

(Updated patch enclosed)

Because Jakub asked:

I'm afraid you need Alexander or Tom here, I don't feel I can review it;
I could rubber stamp it if they are ok with it.


Regarding:

How did the standardization process for this feature look like, how did it pass
if it's not efficiently implementable for the major offloading targets?

It doesn't have to be implementable on all major offloading targets, it is enough when it can work on some. As one needs to request the reverse offloading through a declarative directive, it is always possible in that case to just pretend devices that don't support it don't exist.

First, I think it is better to provide a feature even if it is slow – than not providing it at all. Secondly, as Jakub mentioned, it is not required that all devices support this feature well. It is sufficient some do.

I believe on of the main uses is debugging and for this use, the performance is not critical. This patch attempts to have no overhead if the feature is not used (no 'omp requires reverse_offload' and no actual reverse offload).

Additionally, for GCN, it can be implemented with almost no overhead by using the feature used for I/O. (CUDA implements 'printf' internally – but does not permit to piggyback on this feature.)

* * *

I think in the future, we could additionally pass information to GOMP_target_ext whether a target region is known not to do reverse offload – both by checking what's in the region and by utilizing an 'absent(target)' assumption places in the outer target regsion on an '(begin) assume(s)' directive. That should at least help with the common case of having no reverse offload – even if it does not for some large kernel which does use reverse offload for non-debugging purpose (e.g. to trigger file I/O or inter-node communication).

* * *

Regarding the implementation: I left in 'usleep(1)' for now – 1µs seems to be not too bad and I have no idea what's better.

I also don't have any idea what's the overhead for accessing concurrently accessible managed memory from the host (is it on the host until requested from the device – or is it always on the device and needs to be copied/migrated to the host for every access). Likewise, I don't know how much overhead it is to D2H copy the memory via the second CUDA stream.

Suggestions are welcome. But as this code is strictly confined to a single function, it can easily be modified later.

Documentation: I have not mentioned caveats in https://gcc.gnu.org/onlinedocs/libgomp/nvptx.html as the reverse offload is not yet enabled, even with this patch.

On 26.08.22 11:07, Tobias Burnus wrote:

PRE-REMARK

As nvptx (and all other plugins) returns <= 0 for
GOMP_OFFLOAD_get_num_devices if GOMP_REQUIRES_REVERSE_OFFLOAD is
set. This patch is currently still a no op.

The patch is almost stand alone, except that it either needs a
  void *rev_fn_table = NULL;
in GOMP_OFFLOAD_load_image or the following patch:
  [Patch][2/3] nvptx: libgomp+mkoffload.cc: Prepare for reverse offload fn lookup
  https://gcc.gnu.org/pipermail/gcc-patches/2022-August/600348.html
(which in turn needs the '[1/3]' patch).

Not required to be compilable, but the patch is based on the ideas/code from
the reverse-offload ME patch; the latter adds calls to
  GOMP_target_ext (omp_initial_device,
which is for host fallback code processed by the normal target_ext and for
device code by the target_ext of this patch.
→ "[Patch] OpenMP: Support reverse offload (middle end part)"
  https://gcc.gnu.org/pipermail/gcc-patches/2022-July/598662.html

 * * *

This patch adds initial offloading support for nvptx.
When the nvptx's device GOMP_target_ext is called - it creates a lock,
fills a struct with the argument pointers (addr, kinds, sizes), its
device number and the set the function pointer address.

On the host side, the last address is checked - if fn_addr != NULL,
it passes all arguments on to the generic (target.c) gomp_target_rev
to do the actual offloading.

CUDA does lockup when trying to copy data from the currently running
stream; hence, a new stream is generated to do the memory copying.
Just having managed memory is not enough - it needs to be concurrently
accessible - otherwise, it will segfault on the host when migrated to
the device.

OK for mainline?

 * * *

Future work for nvptx:
* Adjust 'sleep', possibly using different values with and without USM and
  to do shorter sleeps than usleep(1)?
* Set a flag whether there is any offload function at all, avoiding to run
  the more expensive check if there is 'requires reverse_offload' without
  actual reverse-offloading functions present.
  (Recall that the '2/3' patch, mentioned above, only has fn != NULL for
  reverse-offload functions.)
* Document → libgomp.texi that reverse offload may cause some performance
  overhead for all target regions. + That reverse offload is run serialized.

And obviously: submitting the missing bits to get reverse offload working,
but that's mostly not an nvptx topic.


-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955

[-- Attachment #2: rev-offload-run-nvptx-v2.diff --]
[-- Type: text/x-patch, Size: 17981 bytes --]

libgomp/nvptx: Prepare for reverse-offload callback handling

This patch adds a stub 'gomp_target_rev' in the host's target.c, which will
later handle the reverse offload.
For nvptx, it adds support for forwarding the offload gomp_target_ext call
to the host by setting values in a struct on the device and querying it on
the host - invoking gomp_target_rev on the result.

include/ChangeLog:

	* cuda/cuda.h (enum CUdevice_attribute): Add
	CU_DEVICE_ATTRIBUTE_CONCURRENT_MANAGED_ACCESS.
	(enum CUmemAttach_flags): New stub with only member
	CU_MEM_ATTACH_GLOBAL.
	(cuMemAllocManaged): Add prototype.

libgomp/ChangeLog:

	* config/nvptx/icv-device.c (GOMP_DEVICE_NUM_VAR): Remove
	'static' for this variable.
	* config/nvptx/target.c (GOMP_REV_OFFLOAD_VAR): #define as
	variable-name string and use it to define the variable.
	(GOMP_DEVICE_NUM_VAR): Declare this extern global var.
	(struct rev_offload): Define.
	(GOMP_target_ext): Handle reverse offload.
	* libgomp-plugin.h (GOMP_PLUGIN_target_rev): New prototype.
	* libgomp-plugin.c (GOMP_PLUGIN_target_rev): New, call ...
	* target.c (gomp_target_rev): ... this new stub function.
	* libgomp.h (gomp_target_rev): Declare.
	* libgomp.map (GOMP_PLUGIN_1.4): New; add GOMP_PLUGIN_target_rev.
	* plugin/cuda-lib.def (cuMemAllocManaged): Add.
	* plugin/plugin-nvptx.c (GOMP_REV_OFFLOAD_VAR): #define var string.
	(struct rev_offload): New.
	(struct ptx_device): Add concurr_managed_access, rev_data
	and rev_data_dev.
	(nvptx_open_device): Set ptx_device's concurr_managed_access;
	'#if 0' unused async_engines.
	(GOMP_OFFLOAD_load_image): Allocate rev_data variable.
	(rev_off_dev_to_host_cpy, rev_off_host_to_dev_cpy): New.
	(GOMP_OFFLOAD_run): Handle reverse offloading.

 include/cuda/cuda.h               |   8 ++-
 libgomp/config/nvptx/icv-device.c |   2 +-
 libgomp/config/nvptx/target.c     |  52 ++++++++++++--
 libgomp/libgomp-plugin.c          |  12 ++++
 libgomp/libgomp-plugin.h          |   7 ++
 libgomp/libgomp.h                 |   5 ++
 libgomp/libgomp.map               |   5 ++
 libgomp/plugin/cuda-lib.def       |   1 +
 libgomp/plugin/plugin-nvptx.c     | 143 ++++++++++++++++++++++++++++++++++++--
 libgomp/target.c                  |  19 +++++
 10 files changed, 241 insertions(+), 13 deletions(-)

diff --git a/include/cuda/cuda.h b/include/cuda/cuda.h
index 3938d05..08e496a 100644
--- a/include/cuda/cuda.h
+++ b/include/cuda/cuda.h
@@ -77,9 +77,14 @@ typedef enum {
   CU_DEVICE_ATTRIBUTE_CONCURRENT_KERNELS = 31,
   CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_MULTIPROCESSOR = 39,
   CU_DEVICE_ATTRIBUTE_ASYNC_ENGINE_COUNT = 40,
-  CU_DEVICE_ATTRIBUTE_MAX_REGISTERS_PER_MULTIPROCESSOR = 82
+  CU_DEVICE_ATTRIBUTE_MAX_REGISTERS_PER_MULTIPROCESSOR = 82,
+  CU_DEVICE_ATTRIBUTE_CONCURRENT_MANAGED_ACCESS = 89
 } CUdevice_attribute;
 
+typedef enum {
+  CU_MEM_ATTACH_GLOBAL = 0x1
+} CUmemAttach_flags;
+
 enum {
   CU_EVENT_DEFAULT = 0,
   CU_EVENT_DISABLE_TIMING = 2
@@ -169,6 +174,7 @@ CUresult cuMemGetInfo (size_t *, size_t *);
 CUresult cuMemAlloc (CUdeviceptr *, size_t);
 #define cuMemAllocHost cuMemAllocHost_v2
 CUresult cuMemAllocHost (void **, size_t);
+CUresult cuMemAllocManaged (CUdeviceptr *, size_t, unsigned int);
 CUresult cuMemcpy (CUdeviceptr, CUdeviceptr, size_t);
 #define cuMemcpyDtoDAsync cuMemcpyDtoDAsync_v2
 CUresult cuMemcpyDtoDAsync (CUdeviceptr, CUdeviceptr, size_t, CUstream);
diff --git a/libgomp/config/nvptx/icv-device.c b/libgomp/config/nvptx/icv-device.c
index 6f869be..eef151c 100644
--- a/libgomp/config/nvptx/icv-device.c
+++ b/libgomp/config/nvptx/icv-device.c
@@ -30,7 +30,7 @@
 
 /* This is set to the ICV values of current GPU during device initialization,
    when the offload image containing this libgomp portion is loaded.  */
-static volatile struct gomp_offload_icvs GOMP_ADDITIONAL_ICVS;
+volatile struct gomp_offload_icvs GOMP_ADDITIONAL_ICVS;
 
 void
 omp_set_default_device (int device_num __attribute__((unused)))
diff --git a/libgomp/config/nvptx/target.c b/libgomp/config/nvptx/target.c
index 11108d2..2a3fd8f 100644
--- a/libgomp/config/nvptx/target.c
+++ b/libgomp/config/nvptx/target.c
@@ -26,7 +26,29 @@
 #include "libgomp.h"
 #include <limits.h>
 
+#define GOMP_REV_OFFLOAD_VAR __gomp_rev_offload_var
+
+/* Reverse offload. Must match version used in plugin/plugin-nvptx.c. */
+struct rev_offload {
+  uint64_t fn;
+  uint64_t mapnum;
+  uint64_t addrs;
+  uint64_t sizes;
+  uint64_t kinds;
+  int32_t dev_num;
+  uint32_t lock;
+};
+
+#if (__SIZEOF_SHORT__ != 2 \
+     || __SIZEOF_SIZE_T__ != 8 \
+     || __SIZEOF_POINTER__ != 8)
+#error "Data-type conversion required for rev_offload"
+#endif
+
+
 extern int __gomp_team_num __attribute__((shared));
+extern volatile struct gomp_offload_icvs GOMP_ADDITIONAL_ICVS;
+volatile struct rev_offload *GOMP_REV_OFFLOAD_VAR;
 
 bool
 GOMP_teams4 (unsigned int num_teams_lower, unsigned int num_teams_upper,
@@ -88,16 +110,32 @@ GOMP_target_ext (int device, void (*fn) (void *), size_t mapnum,
 		 void **hostaddrs, size_t *sizes, unsigned short *kinds,
 		 unsigned int flags, void **depend, void **args)
 {
-  (void) device;
-  (void) fn;
-  (void) mapnum;
-  (void) hostaddrs;
-  (void) sizes;
-  (void) kinds;
   (void) flags;
   (void) depend;
   (void) args;
-  __builtin_unreachable ();
+
+  if (device != GOMP_DEVICE_HOST_FALLBACK
+      || fn == NULL
+      || GOMP_REV_OFFLOAD_VAR == NULL)
+    return;
+
+  while (__sync_lock_test_and_set (&GOMP_REV_OFFLOAD_VAR->lock, (uint8_t) 1))
+    ;  /* spin  */
+
+  __atomic_store_n (&GOMP_REV_OFFLOAD_VAR->mapnum, mapnum, __ATOMIC_SEQ_CST);
+  __atomic_store_n (&GOMP_REV_OFFLOAD_VAR->addrs, hostaddrs, __ATOMIC_SEQ_CST);
+  __atomic_store_n (&GOMP_REV_OFFLOAD_VAR->sizes, sizes, __ATOMIC_SEQ_CST);
+  __atomic_store_n (&GOMP_REV_OFFLOAD_VAR->kinds, kinds, __ATOMIC_SEQ_CST);
+  __atomic_store_n (&GOMP_REV_OFFLOAD_VAR->dev_num,
+		    GOMP_ADDITIONAL_ICVS.device_num, __ATOMIC_SEQ_CST);
+
+  /* 'fn' must be last.  */
+  __atomic_store_n (&GOMP_REV_OFFLOAD_VAR->fn, fn, __ATOMIC_SEQ_CST);
+
+  /* Processed on the host - when done, fn is set to NULL.  */
+  while (__atomic_load_n (&GOMP_REV_OFFLOAD_VAR->fn, __ATOMIC_SEQ_CST) != 0)
+    ;  /* spin  */
+  __sync_lock_release (&GOMP_REV_OFFLOAD_VAR->lock);
 }
 
 void
diff --git a/libgomp/libgomp-plugin.c b/libgomp/libgomp-plugin.c
index 9d4cc62..316de74 100644
--- a/libgomp/libgomp-plugin.c
+++ b/libgomp/libgomp-plugin.c
@@ -78,3 +78,15 @@ GOMP_PLUGIN_fatal (const char *msg, ...)
   gomp_vfatal (msg, ap);
   va_end (ap);
 }
+
+void
+GOMP_PLUGIN_target_rev (uint64_t fn_ptr, uint64_t mapnum, uint64_t devaddrs_ptr,
+			uint64_t sizes_ptr, uint64_t kinds_ptr, int dev_num,
+			void (*dev_to_host_cpy) (void *, const void *, size_t,
+						 void *),
+			void (*host_to_dev_cpy) (void *, const void *, size_t,
+						 void *), void *token)
+{
+  gomp_target_rev (fn_ptr, mapnum, devaddrs_ptr, sizes_ptr, kinds_ptr, dev_num,
+		   dev_to_host_cpy, host_to_dev_cpy, token);
+}
diff --git a/libgomp/libgomp-plugin.h b/libgomp/libgomp-plugin.h
index 6ab5ac6..875f967 100644
--- a/libgomp/libgomp-plugin.h
+++ b/libgomp/libgomp-plugin.h
@@ -121,6 +121,13 @@ extern void GOMP_PLUGIN_error (const char *, ...)
 extern void GOMP_PLUGIN_fatal (const char *, ...)
 	__attribute__ ((noreturn, format (printf, 1, 2)));
 
+extern void GOMP_PLUGIN_target_rev (uint64_t, uint64_t, uint64_t, uint64_t,
+				    uint64_t, int,
+				    void (*) (void *, const void *, size_t,
+					      void *),
+				    void (*) (void *, const void *, size_t,
+					      void *), void *);
+
 /* Prototypes for functions implemented by libgomp plugins.  */
 extern const char *GOMP_OFFLOAD_get_name (void);
 extern unsigned int GOMP_OFFLOAD_get_caps (void);
diff --git a/libgomp/libgomp.h b/libgomp/libgomp.h
index 7519274..5803683 100644
--- a/libgomp/libgomp.h
+++ b/libgomp/libgomp.h
@@ -1128,6 +1128,11 @@ extern int gomp_pause_host (void);
 extern void gomp_init_targets_once (void);
 extern int gomp_get_num_devices (void);
 extern bool gomp_target_task_fn (void *);
+extern void gomp_target_rev (uint64_t, uint64_t, uint64_t, uint64_t, uint64_t,
+			     int,
+			     void (*) (void *, const void *, size_t, void *),
+			     void (*) (void *, const void *, size_t, void *),
+			     void *);
 
 /* Splay tree definitions.  */
 typedef struct splay_tree_node_s *splay_tree_node;
diff --git a/libgomp/libgomp.map b/libgomp/libgomp.map
index 46d5f10..12f76f7 100644
--- a/libgomp/libgomp.map
+++ b/libgomp/libgomp.map
@@ -622,3 +622,8 @@ GOMP_PLUGIN_1.3 {
 	GOMP_PLUGIN_goacc_profiling_dispatch;
 	GOMP_PLUGIN_goacc_thread;
 } GOMP_PLUGIN_1.2;
+
+GOMP_PLUGIN_1.4 {
+  global:
+	GOMP_PLUGIN_target_rev;
+} GOMP_PLUGIN_1.3;
diff --git a/libgomp/plugin/cuda-lib.def b/libgomp/plugin/cuda-lib.def
index cd91b39..61359c7 100644
--- a/libgomp/plugin/cuda-lib.def
+++ b/libgomp/plugin/cuda-lib.def
@@ -29,6 +29,7 @@ CUDA_ONE_CALL_MAYBE_NULL (cuLinkCreate_v2)
 CUDA_ONE_CALL (cuLinkDestroy)
 CUDA_ONE_CALL (cuMemAlloc)
 CUDA_ONE_CALL (cuMemAllocHost)
+CUDA_ONE_CALL (cuMemAllocManaged)
 CUDA_ONE_CALL (cuMemcpy)
 CUDA_ONE_CALL (cuMemcpyDtoDAsync)
 CUDA_ONE_CALL (cuMemcpyDtoH)
diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c
index ba6b229..1bd9ee2 100644
--- a/libgomp/plugin/plugin-nvptx.c
+++ b/libgomp/plugin/plugin-nvptx.c
@@ -54,6 +54,8 @@
 #include <assert.h>
 #include <errno.h>
 
+#define GOMP_REV_OFFLOAD_VAR __gomp_rev_offload_var
+
 /* An arbitrary fixed limit (128MB) for the size of the OpenMP soft stacks
    block to cache between kernel invocations.  For soft-stacks blocks bigger
    than this, we will free the block before attempting another GPU memory
@@ -274,6 +276,17 @@ struct targ_fn_descriptor
   int max_threads_per_block;
 };
 
+/* Reverse offload. Must match version used in config/nvptx/target.c. */
+struct rev_offload {
+  uint64_t fn;
+  uint64_t mapnum;
+  uint64_t addrs;
+  uint64_t sizes;
+  uint64_t kinds;
+  int32_t dev_num;
+  uint32_t lock;
+};
+
 /* A loaded PTX image.  */
 struct ptx_image_data
 {
@@ -302,6 +315,7 @@ struct ptx_device
   bool map;
   bool concur;
   bool mkern;
+  bool concurr_managed_access;
   int mode;
   int clock_khz;
   int num_sms;
@@ -329,6 +343,9 @@ struct ptx_device
       pthread_mutex_t lock;
     } omp_stacks;
 
+  struct rev_offload *rev_data;
+  CUdeviceptr rev_data_dev;
+
   struct ptx_device *next;
 };
 
@@ -423,7 +440,7 @@ nvptx_open_device (int n)
   struct ptx_device *ptx_dev;
   CUdevice dev, ctx_dev;
   CUresult r;
-  int async_engines, pi;
+  int pi;
 
   CUDA_CALL_ERET (NULL, cuDeviceGet, &dev, n);
 
@@ -519,11 +536,17 @@ nvptx_open_device (int n)
 		  CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_MULTIPROCESSOR, dev);
   ptx_dev->max_threads_per_multiprocessor = pi;
 
+#if 0
+  int async_engines;
   r = CUDA_CALL_NOCHECK (cuDeviceGetAttribute, &async_engines,
 			 CU_DEVICE_ATTRIBUTE_ASYNC_ENGINE_COUNT, dev);
   if (r != CUDA_SUCCESS)
     async_engines = 1;
+#endif
 
+  r = CUDA_CALL_NOCHECK (cuDeviceGetAttribute, &pi,
+			 CU_DEVICE_ATTRIBUTE_CONCURRENT_MANAGED_ACCESS, dev);
+  ptx_dev->concurr_managed_access = r == CUDA_SUCCESS ? pi : false;
   for (int i = 0; i != GOMP_DIM_MAX; i++)
     ptx_dev->default_dims[i] = 0;
 
@@ -1380,7 +1403,7 @@ GOMP_OFFLOAD_load_image (int ord, unsigned version, const void *target_data,
   else if (rev_fn_table)
     {
       CUdeviceptr var;
-      size_t bytes;
+      size_t bytes, i;
       r = CUDA_CALL_NOCHECK (cuModuleGetGlobal, &var, &bytes, module,
 			     "$offload_func_table");
       if (r != CUDA_SUCCESS)
@@ -1390,6 +1413,47 @@ GOMP_OFFLOAD_load_image (int ord, unsigned version, const void *target_data,
       r = CUDA_CALL_NOCHECK (cuMemcpyDtoH, *rev_fn_table, var, bytes);
       if (r != CUDA_SUCCESS)
 	GOMP_PLUGIN_fatal ("cuMemcpyDtoH error: %s", cuda_error (r));
+      /* Free if only NULL entries.  */
+      for (i = 0; i < fn_entries; ++i)
+	if (rev_fn_table[i] != 0)
+	  break;
+      if (i == fn_entries)
+	{
+	  free (*rev_fn_table);
+	  *rev_fn_table = NULL;
+	}
+    }
+
+  if (rev_fn_table && dev->rev_data == NULL)
+    {
+      CUdeviceptr dp = 0;
+      if (dev->concurr_managed_access && CUDA_CALL_EXISTS (cuMemAllocManaged))
+	{
+	  CUDA_CALL_ASSERT (cuMemAllocManaged, (void *) &dev->rev_data,
+			    sizeof (*dev->rev_data), CU_MEM_ATTACH_GLOBAL);
+	  dp = (CUdeviceptr) dev->rev_data;
+	}
+      else
+	{
+	  CUDA_CALL_ASSERT (cuMemAllocHost, (void **) &dev->rev_data,
+			    sizeof (*dev->rev_data));
+	  memset (dev->rev_data, '\0', sizeof (*dev->rev_data));
+	  CUDA_CALL_ASSERT (cuMemAlloc, &dev->rev_data_dev,
+			    sizeof (*dev->rev_data));
+	  dp = dev->rev_data_dev;
+	}
+      CUdeviceptr device_rev_offload_var;
+      size_t device_rev_offload_size;
+      CUresult r = CUDA_CALL_NOCHECK (cuModuleGetGlobal,
+				      &device_rev_offload_var,
+				      &device_rev_offload_size, module,
+				      XSTRING (GOMP_REV_OFFLOAD_VAR));
+      if (r != CUDA_SUCCESS)
+	GOMP_PLUGIN_fatal ("cuModuleGetGlobal error: %s", cuda_error (r));
+      r = CUDA_CALL_NOCHECK (cuMemcpyHtoD, device_rev_offload_var, &dp,
+			     sizeof (dp));
+      if (r != CUDA_SUCCESS)
+	GOMP_PLUGIN_fatal ("cuMemcpyHtoD error: %s", cuda_error (r));
     }
 
   nvptx_set_clocktick (module, dev);
@@ -2001,6 +2065,23 @@ nvptx_stacks_acquire (struct ptx_device *ptx_dev, size_t size, int num)
   return (void *) ptx_dev->omp_stacks.ptr;
 }
 
+
+void
+rev_off_dev_to_host_cpy (void *dest, const void *src, size_t size,
+			 CUstream stream)
+{
+  CUDA_CALL_ASSERT (cuMemcpyDtoHAsync, dest, (CUdeviceptr) src, size, stream);
+  CUDA_CALL_ASSERT (cuStreamSynchronize, stream);
+}
+
+void
+rev_off_host_to_dev_cpy (void *dest, const void *src, size_t size,
+			 CUstream stream)
+{
+  CUDA_CALL_ASSERT (cuMemcpyHtoDAsync, (CUdeviceptr) dest, src, size, stream);
+  CUDA_CALL_ASSERT (cuStreamSynchronize, stream);
+}
+
 void
 GOMP_OFFLOAD_run (int ord, void *tgt_fn, void *tgt_vars, void **args)
 {
@@ -2035,6 +2116,10 @@ GOMP_OFFLOAD_run (int ord, void *tgt_fn, void *tgt_vars, void **args)
   nvptx_adjust_launch_bounds (tgt_fn, ptx_dev, &teams, &threads);
 
   size_t stack_size = nvptx_stacks_size ();
+  bool reverse_off = ptx_dev->rev_data != NULL;
+  bool has_usm = (ptx_dev->concurr_managed_access
+		  && CUDA_CALL_EXISTS (cuMemAllocManaged));
+  CUstream copy_stream = NULL;
 
   pthread_mutex_lock (&ptx_dev->omp_stacks.lock);
   void *stacks = nvptx_stacks_acquire (ptx_dev, stack_size, teams * threads);
@@ -2048,12 +2133,62 @@ GOMP_OFFLOAD_run (int ord, void *tgt_fn, void *tgt_vars, void **args)
   GOMP_PLUGIN_debug (0, "  %s: kernel %s: launch"
 		     " [(teams: %u), 1, 1] [(lanes: 32), (threads: %u), 1]\n",
 		     __FUNCTION__, fn_name, teams, threads);
+  if (reverse_off)
+    CUDA_CALL_ASSERT (cuStreamCreate, &copy_stream, CU_STREAM_NON_BLOCKING);
   r = CUDA_CALL_NOCHECK (cuLaunchKernel, function, teams, 1, 1,
 			 32, threads, 1, 0, NULL, NULL, config);
   if (r != CUDA_SUCCESS)
     GOMP_PLUGIN_fatal ("cuLaunchKernel error: %s", cuda_error (r));
-
-  r = CUDA_CALL_NOCHECK (cuCtxSynchronize, );
+  if (reverse_off)
+    while (true)
+      {
+	r = CUDA_CALL_NOCHECK (cuStreamQuery, NULL);
+	if (r == CUDA_SUCCESS)
+	  break;
+	if (r == CUDA_ERROR_LAUNCH_FAILED)
+	  GOMP_PLUGIN_fatal ("cuStreamQuery error: %s %s\n", cuda_error (r),
+			     maybe_abort_msg);
+	else if (r != CUDA_ERROR_NOT_READY)
+	  GOMP_PLUGIN_fatal ("cuStreamQuery error: %s", cuda_error (r));
+	if (!has_usm)
+	  {
+	    CUDA_CALL_ASSERT (cuMemcpyDtoHAsync, ptx_dev->rev_data,
+			      ptx_dev->rev_data_dev,
+			      sizeof (*ptx_dev->rev_data), copy_stream);
+	    CUDA_CALL_ASSERT (cuStreamSynchronize, copy_stream);
+	  }
+	if (ptx_dev->rev_data->fn != 0)
+	  {
+	    struct rev_offload *rev_data = ptx_dev->rev_data;
+	    uint64_t fn_ptr = rev_data->fn;
+	    uint64_t mapnum = rev_data->mapnum;
+	    uint64_t addr_ptr = rev_data->addrs;
+	    uint64_t sizes_ptr = rev_data->sizes;
+	    uint64_t kinds_ptr = rev_data->kinds;
+	    int dev_num = (int) rev_data->dev_num;
+	    GOMP_PLUGIN_target_rev (fn_ptr, mapnum, addr_ptr, sizes_ptr,
+				    kinds_ptr, dev_num, rev_off_dev_to_host_cpy,
+				    rev_off_host_to_dev_cpy, copy_stream);
+	    rev_data->fn = 0;
+	    if (!has_usm)
+	      {
+		/* fn is the first element. */
+		r = CUDA_CALL_NOCHECK (cuMemcpyHtoDAsync,
+				       ptx_dev->rev_data_dev,
+				       ptx_dev->rev_data,
+				       sizeof (ptx_dev->rev_data->fn),
+				       copy_stream);
+		if (r != CUDA_SUCCESS)
+		  GOMP_PLUGIN_fatal ("cuMemcpyHtoD error: %s", cuda_error (r));
+		CUDA_CALL_ASSERT (cuStreamSynchronize, copy_stream);
+	      }
+	  }
+	usleep (1);
+      }
+  else
+    r = CUDA_CALL_NOCHECK (cuCtxSynchronize, );
+  if (reverse_off)
+    CUDA_CALL_ASSERT (cuStreamDestroy, copy_stream);
   if (r == CUDA_ERROR_LAUNCH_FAILED)
     GOMP_PLUGIN_fatal ("cuCtxSynchronize error: %s %s\n", cuda_error (r),
 		       maybe_abort_msg);
diff --git a/libgomp/target.c b/libgomp/target.c
index 5763483..9377de0 100644
--- a/libgomp/target.c
+++ b/libgomp/target.c
@@ -2925,6 +2925,25 @@ GOMP_target_ext (int device, void (*fn) (void *), size_t mapnum,
     htab_free (refcount_set);
 }
 
+/* Handle reverse offload. This is called by the device plugins for a
+   reverse offload; it is not called if the outer target runs on the host.  */
+
+void
+gomp_target_rev (uint64_t fn_ptr __attribute__ ((unused)),
+		 uint64_t mapnum __attribute__ ((unused)),
+		 uint64_t devaddrs_ptr __attribute__ ((unused)),
+		 uint64_t sizes_ptr __attribute__ ((unused)),
+		 uint64_t kinds_ptr __attribute__ ((unused)),
+		 int dev_num __attribute__ ((unused)),
+		 void (*dev_to_host_cpy) (void *, const void *, size_t,
+					  void *) __attribute__ ((unused)),
+		 void (*host_to_dev_cpy) (void *, const void *, size_t,
+					  void *) __attribute__ ((unused)),
+		 void *token __attribute__ ((unused)))
+{
+  __builtin_unreachable ();
+}
+
 /* Host fallback for GOMP_target_data{,_ext} routines.  */
 
 static void

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Patch] libgomp/nvptx: Prepare for reverse-offload callback handling
  2022-09-13  7:07 ` Tobias Burnus
@ 2022-09-21 20:06   ` Alexander Monakov
  2022-09-26 15:07     ` Tobias Burnus
  0 siblings, 1 reply; 31+ messages in thread
From: Alexander Monakov @ 2022-09-21 20:06 UTC (permalink / raw)
  To: Tobias Burnus; +Cc: Jakub Jelinek, Tom de Vries, gcc-patches

[-- Attachment #1: Type: text/plain, Size: 16917 bytes --]


Hi.

On the high level, I'd be highly uncomfortable with this. I guess we are in
vague agreement that it cannot be efficiently implemented. It also goes
against the good practice of accelerator programming, which requires queueing
work on the accelerator and letting it run asynchronously with the CPU with high
occupancy.

(I know libgomp still waits for the GPU to finish in each GOMP_offload_run,
but maybe it's better to improve *that* instead of piling on new slowness)

What I said above also applies to MPI+GPU scenarios: a well-designed algorithm
should arrange for MPI communications to happen in parallel with some useful
offloaded calculations. I don't see the value in implementing the ability to
invoke an MPI call from the accelerator in such inefficient fashion.

(so yes, I disagree with "it is better to provide a feature even if it is slow –
than not providing it at all", when it is advertised as a general-purpose
feature, not a purely debugging helper)


On to the patch itself. IIRC one of the questions was use of CUDA managed
memory. I think it is unsafe because device-issued atomics are not guaranteed
to appear atomic to the host, unless compiling for compute capability 6.0 or
above, and using system-scope atomics ("atom.sys").

And for non-USM code path you're relying on cudaMemcpy observing device-side
atomics in the right order.

Atomics aside, CUDA pinned memory would be a natural choice for such a tiny
structure. Did you rule it out for some reason?

Some remarks on the diff below, not intended to be a complete review.

Alexander


> --- a/libgomp/config/nvptx/target.c
> +++ b/libgomp/config/nvptx/target.c
> @@ -26,7 +26,29 @@
>  #include "libgomp.h"
>  #include <limits.h>
>  
> +#define GOMP_REV_OFFLOAD_VAR __gomp_rev_offload_var

Shouldn't this be in a header (needs to be in sync with the plugin).

> +
> +/* Reverse offload. Must match version used in plugin/plugin-nvptx.c. */
> +struct rev_offload {
> +  uint64_t fn;
> +  uint64_t mapnum;
> +  uint64_t addrs;
> +  uint64_t sizes;
> +  uint64_t kinds;
> +  int32_t dev_num;
> +  uint32_t lock;
> +};

Likewise.

> +
> +#if (__SIZEOF_SHORT__ != 2 \
> +     || __SIZEOF_SIZE_T__ != 8 \
> +     || __SIZEOF_POINTER__ != 8)
> +#error "Data-type conversion required for rev_offload"
> +#endif

Huh? This is not a requirement that is new for reverse offload, it has always
been like that for offloading (all ABI rules regarding type sizes, struct
layout, bitfield layout, endianness must match).

> +
> +
>  extern int __gomp_team_num __attribute__((shared));
> +extern volatile struct gomp_offload_icvs GOMP_ADDITIONAL_ICVS;
> +volatile struct rev_offload *GOMP_REV_OFFLOAD_VAR;
>  
>  bool
>  GOMP_teams4 (unsigned int num_teams_lower, unsigned int num_teams_upper,
> @@ -88,16 +110,32 @@ GOMP_target_ext (int device, void (*fn) (void *), size_t mapnum,
>  		 void **hostaddrs, size_t *sizes, unsigned short *kinds,
>  		 unsigned int flags, void **depend, void **args)
>  {
> -  (void) device;
> -  (void) fn;
> -  (void) mapnum;
> -  (void) hostaddrs;
> -  (void) sizes;
> -  (void) kinds;
>    (void) flags;
>    (void) depend;
>    (void) args;
> -  __builtin_unreachable ();
> +
> +  if (device != GOMP_DEVICE_HOST_FALLBACK
> +      || fn == NULL
> +      || GOMP_REV_OFFLOAD_VAR == NULL)
> +    return;

Shouldn't this be an 'assert' instead?

> +
> +  while (__sync_lock_test_and_set (&GOMP_REV_OFFLOAD_VAR->lock, (uint8_t) 1))
> +    ;  /* spin  */
> +
> +  __atomic_store_n (&GOMP_REV_OFFLOAD_VAR->mapnum, mapnum, __ATOMIC_SEQ_CST);
> +  __atomic_store_n (&GOMP_REV_OFFLOAD_VAR->addrs, hostaddrs, __ATOMIC_SEQ_CST);
> +  __atomic_store_n (&GOMP_REV_OFFLOAD_VAR->sizes, sizes, __ATOMIC_SEQ_CST);
> +  __atomic_store_n (&GOMP_REV_OFFLOAD_VAR->kinds, kinds, __ATOMIC_SEQ_CST);
> +  __atomic_store_n (&GOMP_REV_OFFLOAD_VAR->dev_num,
> +		    GOMP_ADDITIONAL_ICVS.device_num, __ATOMIC_SEQ_CST);

Looks like all these can be plain stores, you only need ...

> +
> +  /* 'fn' must be last.  */
> +  __atomic_store_n (&GOMP_REV_OFFLOAD_VAR->fn, fn, __ATOMIC_SEQ_CST);

... this to be atomic with 'release' semantics in the usual producer-consumer
pattern.

> +
> +  /* Processed on the host - when done, fn is set to NULL.  */
> +  while (__atomic_load_n (&GOMP_REV_OFFLOAD_VAR->fn, __ATOMIC_SEQ_CST) != 0)
> +    ;  /* spin  */
> +  __sync_lock_release (&GOMP_REV_OFFLOAD_VAR->lock);
>  }
>  
>  void
> diff --git a/libgomp/libgomp-plugin.c b/libgomp/libgomp-plugin.c
> index 9d4cc62..316de74 100644
> --- a/libgomp/libgomp-plugin.c
> +++ b/libgomp/libgomp-plugin.c
> @@ -78,3 +78,15 @@ GOMP_PLUGIN_fatal (const char *msg, ...)
>    gomp_vfatal (msg, ap);
>    va_end (ap);
>  }
> +
> +void
> +GOMP_PLUGIN_target_rev (uint64_t fn_ptr, uint64_t mapnum, uint64_t devaddrs_ptr,
> +			uint64_t sizes_ptr, uint64_t kinds_ptr, int dev_num,
> +			void (*dev_to_host_cpy) (void *, const void *, size_t,
> +						 void *),
> +			void (*host_to_dev_cpy) (void *, const void *, size_t,
> +						 void *), void *token)
> +{
> +  gomp_target_rev (fn_ptr, mapnum, devaddrs_ptr, sizes_ptr, kinds_ptr, dev_num,
> +		   dev_to_host_cpy, host_to_dev_cpy, token);
> +}
> diff --git a/libgomp/libgomp-plugin.h b/libgomp/libgomp-plugin.h
> index 6ab5ac6..875f967 100644
> --- a/libgomp/libgomp-plugin.h
> +++ b/libgomp/libgomp-plugin.h
> @@ -121,6 +121,13 @@ extern void GOMP_PLUGIN_error (const char *, ...)
>  extern void GOMP_PLUGIN_fatal (const char *, ...)
>  	__attribute__ ((noreturn, format (printf, 1, 2)));
>  
> +extern void GOMP_PLUGIN_target_rev (uint64_t, uint64_t, uint64_t, uint64_t,
> +				    uint64_t, int,
> +				    void (*) (void *, const void *, size_t,
> +					      void *),
> +				    void (*) (void *, const void *, size_t,
> +					      void *), void *);
> +
>  /* Prototypes for functions implemented by libgomp plugins.  */
>  extern const char *GOMP_OFFLOAD_get_name (void);
>  extern unsigned int GOMP_OFFLOAD_get_caps (void);
> diff --git a/libgomp/libgomp.h b/libgomp/libgomp.h
> index 7519274..5803683 100644
> --- a/libgomp/libgomp.h
> +++ b/libgomp/libgomp.h
> @@ -1128,6 +1128,11 @@ extern int gomp_pause_host (void);
>  extern void gomp_init_targets_once (void);
>  extern int gomp_get_num_devices (void);
>  extern bool gomp_target_task_fn (void *);
> +extern void gomp_target_rev (uint64_t, uint64_t, uint64_t, uint64_t, uint64_t,
> +			     int,
> +			     void (*) (void *, const void *, size_t, void *),
> +			     void (*) (void *, const void *, size_t, void *),
> +			     void *);
>  
>  /* Splay tree definitions.  */
>  typedef struct splay_tree_node_s *splay_tree_node;
> diff --git a/libgomp/libgomp.map b/libgomp/libgomp.map
> index 46d5f10..12f76f7 100644
> --- a/libgomp/libgomp.map
> +++ b/libgomp/libgomp.map
> @@ -622,3 +622,8 @@ GOMP_PLUGIN_1.3 {
>  	GOMP_PLUGIN_goacc_profiling_dispatch;
>  	GOMP_PLUGIN_goacc_thread;
>  } GOMP_PLUGIN_1.2;
> +
> +GOMP_PLUGIN_1.4 {
> +  global:
> +	GOMP_PLUGIN_target_rev;
> +} GOMP_PLUGIN_1.3;
> diff --git a/libgomp/plugin/cuda-lib.def b/libgomp/plugin/cuda-lib.def
> index cd91b39..61359c7 100644
> --- a/libgomp/plugin/cuda-lib.def
> +++ b/libgomp/plugin/cuda-lib.def
> @@ -29,6 +29,7 @@ CUDA_ONE_CALL_MAYBE_NULL (cuLinkCreate_v2)
>  CUDA_ONE_CALL (cuLinkDestroy)
>  CUDA_ONE_CALL (cuMemAlloc)
>  CUDA_ONE_CALL (cuMemAllocHost)
> +CUDA_ONE_CALL (cuMemAllocManaged)
>  CUDA_ONE_CALL (cuMemcpy)
>  CUDA_ONE_CALL (cuMemcpyDtoDAsync)
>  CUDA_ONE_CALL (cuMemcpyDtoH)
> diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c
> index ba6b229..1bd9ee2 100644
> --- a/libgomp/plugin/plugin-nvptx.c
> +++ b/libgomp/plugin/plugin-nvptx.c
> @@ -54,6 +54,8 @@
>  #include <assert.h>
>  #include <errno.h>
>  
> +#define GOMP_REV_OFFLOAD_VAR __gomp_rev_offload_var
> +
>  /* An arbitrary fixed limit (128MB) for the size of the OpenMP soft stacks
>     block to cache between kernel invocations.  For soft-stacks blocks bigger
>     than this, we will free the block before attempting another GPU memory
> @@ -274,6 +276,17 @@ struct targ_fn_descriptor
>    int max_threads_per_block;
>  };
>  
> +/* Reverse offload. Must match version used in config/nvptx/target.c. */
> +struct rev_offload {
> +  uint64_t fn;
> +  uint64_t mapnum;
> +  uint64_t addrs;
> +  uint64_t sizes;
> +  uint64_t kinds;
> +  int32_t dev_num;
> +  uint32_t lock;
> +};
> +
>  /* A loaded PTX image.  */
>  struct ptx_image_data
>  {
> @@ -302,6 +315,7 @@ struct ptx_device
>    bool map;
>    bool concur;
>    bool mkern;
> +  bool concurr_managed_access;
>    int mode;
>    int clock_khz;
>    int num_sms;
> @@ -329,6 +343,9 @@ struct ptx_device
>        pthread_mutex_t lock;
>      } omp_stacks;
>  
> +  struct rev_offload *rev_data;
> +  CUdeviceptr rev_data_dev;
> +
>    struct ptx_device *next;
>  };
>  
> @@ -423,7 +440,7 @@ nvptx_open_device (int n)
>    struct ptx_device *ptx_dev;
>    CUdevice dev, ctx_dev;
>    CUresult r;
> -  int async_engines, pi;
> +  int pi;
>  
>    CUDA_CALL_ERET (NULL, cuDeviceGet, &dev, n);
>  
> @@ -519,11 +536,17 @@ nvptx_open_device (int n)
>  		  CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_MULTIPROCESSOR, dev);
>    ptx_dev->max_threads_per_multiprocessor = pi;
>  
> +#if 0
> +  int async_engines;
>    r = CUDA_CALL_NOCHECK (cuDeviceGetAttribute, &async_engines,
>  			 CU_DEVICE_ATTRIBUTE_ASYNC_ENGINE_COUNT, dev);
>    if (r != CUDA_SUCCESS)
>      async_engines = 1;
> +#endif
>  
> +  r = CUDA_CALL_NOCHECK (cuDeviceGetAttribute, &pi,
> +			 CU_DEVICE_ATTRIBUTE_CONCURRENT_MANAGED_ACCESS, dev);
> +  ptx_dev->concurr_managed_access = r == CUDA_SUCCESS ? pi : false;
>    for (int i = 0; i != GOMP_DIM_MAX; i++)
>      ptx_dev->default_dims[i] = 0;
>  
> @@ -1380,7 +1403,7 @@ GOMP_OFFLOAD_load_image (int ord, unsigned version, const void *target_data,
>    else if (rev_fn_table)
>      {
>        CUdeviceptr var;
> -      size_t bytes;
> +      size_t bytes, i;
>        r = CUDA_CALL_NOCHECK (cuModuleGetGlobal, &var, &bytes, module,
>  			     "$offload_func_table");
>        if (r != CUDA_SUCCESS)
> @@ -1390,6 +1413,47 @@ GOMP_OFFLOAD_load_image (int ord, unsigned version, const void *target_data,
>        r = CUDA_CALL_NOCHECK (cuMemcpyDtoH, *rev_fn_table, var, bytes);
>        if (r != CUDA_SUCCESS)
>  	GOMP_PLUGIN_fatal ("cuMemcpyDtoH error: %s", cuda_error (r));
> +      /* Free if only NULL entries.  */
> +      for (i = 0; i < fn_entries; ++i)
> +	if (rev_fn_table[i] != 0)
> +	  break;
> +      if (i == fn_entries)
> +	{
> +	  free (*rev_fn_table);
> +	  *rev_fn_table = NULL;
> +	}
> +    }
> +
> +  if (rev_fn_table && dev->rev_data == NULL)
> +    {
> +      CUdeviceptr dp = 0;
> +      if (dev->concurr_managed_access && CUDA_CALL_EXISTS (cuMemAllocManaged))
> +	{
> +	  CUDA_CALL_ASSERT (cuMemAllocManaged, (void *) &dev->rev_data,
> +			    sizeof (*dev->rev_data), CU_MEM_ATTACH_GLOBAL);
> +	  dp = (CUdeviceptr) dev->rev_data;
> +	}
> +      else
> +	{
> +	  CUDA_CALL_ASSERT (cuMemAllocHost, (void **) &dev->rev_data,
> +			    sizeof (*dev->rev_data));
> +	  memset (dev->rev_data, '\0', sizeof (*dev->rev_data));
> +	  CUDA_CALL_ASSERT (cuMemAlloc, &dev->rev_data_dev,
> +			    sizeof (*dev->rev_data));
> +	  dp = dev->rev_data_dev;
> +	}
> +      CUdeviceptr device_rev_offload_var;
> +      size_t device_rev_offload_size;
> +      CUresult r = CUDA_CALL_NOCHECK (cuModuleGetGlobal,
> +				      &device_rev_offload_var,
> +				      &device_rev_offload_size, module,
> +				      XSTRING (GOMP_REV_OFFLOAD_VAR));
> +      if (r != CUDA_SUCCESS)
> +	GOMP_PLUGIN_fatal ("cuModuleGetGlobal error: %s", cuda_error (r));
> +      r = CUDA_CALL_NOCHECK (cuMemcpyHtoD, device_rev_offload_var, &dp,
> +			     sizeof (dp));
> +      if (r != CUDA_SUCCESS)
> +	GOMP_PLUGIN_fatal ("cuMemcpyHtoD error: %s", cuda_error (r));
>      }
>  
>    nvptx_set_clocktick (module, dev);
> @@ -2001,6 +2065,23 @@ nvptx_stacks_acquire (struct ptx_device *ptx_dev, size_t size, int num)
>    return (void *) ptx_dev->omp_stacks.ptr;
>  }
>  
> +
> +void
> +rev_off_dev_to_host_cpy (void *dest, const void *src, size_t size,
> +			 CUstream stream)
> +{
> +  CUDA_CALL_ASSERT (cuMemcpyDtoHAsync, dest, (CUdeviceptr) src, size, stream);
> +  CUDA_CALL_ASSERT (cuStreamSynchronize, stream);
> +}
> +
> +void
> +rev_off_host_to_dev_cpy (void *dest, const void *src, size_t size,
> +			 CUstream stream)
> +{
> +  CUDA_CALL_ASSERT (cuMemcpyHtoDAsync, (CUdeviceptr) dest, src, size, stream);
> +  CUDA_CALL_ASSERT (cuStreamSynchronize, stream);
> +}
> +
>  void
>  GOMP_OFFLOAD_run (int ord, void *tgt_fn, void *tgt_vars, void **args)
>  {
> @@ -2035,6 +2116,10 @@ GOMP_OFFLOAD_run (int ord, void *tgt_fn, void *tgt_vars, void **args)
>    nvptx_adjust_launch_bounds (tgt_fn, ptx_dev, &teams, &threads);
>  
>    size_t stack_size = nvptx_stacks_size ();
> +  bool reverse_off = ptx_dev->rev_data != NULL;
> +  bool has_usm = (ptx_dev->concurr_managed_access
> +		  && CUDA_CALL_EXISTS (cuMemAllocManaged));
> +  CUstream copy_stream = NULL;
>  
>    pthread_mutex_lock (&ptx_dev->omp_stacks.lock);
>    void *stacks = nvptx_stacks_acquire (ptx_dev, stack_size, teams * threads);
> @@ -2048,12 +2133,62 @@ GOMP_OFFLOAD_run (int ord, void *tgt_fn, void *tgt_vars, void **args)
>    GOMP_PLUGIN_debug (0, "  %s: kernel %s: launch"
>  		     " [(teams: %u), 1, 1] [(lanes: 32), (threads: %u), 1]\n",
>  		     __FUNCTION__, fn_name, teams, threads);
> +  if (reverse_off)
> +    CUDA_CALL_ASSERT (cuStreamCreate, &copy_stream, CU_STREAM_NON_BLOCKING);
>    r = CUDA_CALL_NOCHECK (cuLaunchKernel, function, teams, 1, 1,
>  			 32, threads, 1, 0, NULL, NULL, config);
>    if (r != CUDA_SUCCESS)
>      GOMP_PLUGIN_fatal ("cuLaunchKernel error: %s", cuda_error (r));
> -
> -  r = CUDA_CALL_NOCHECK (cuCtxSynchronize, );
> +  if (reverse_off)
> +    while (true)
> +      {
> +	r = CUDA_CALL_NOCHECK (cuStreamQuery, NULL);
> +	if (r == CUDA_SUCCESS)
> +	  break;
> +	if (r == CUDA_ERROR_LAUNCH_FAILED)
> +	  GOMP_PLUGIN_fatal ("cuStreamQuery error: %s %s\n", cuda_error (r),
> +			     maybe_abort_msg);
> +	else if (r != CUDA_ERROR_NOT_READY)
> +	  GOMP_PLUGIN_fatal ("cuStreamQuery error: %s", cuda_error (r));
> +	if (!has_usm)
> +	  {
> +	    CUDA_CALL_ASSERT (cuMemcpyDtoHAsync, ptx_dev->rev_data,
> +			      ptx_dev->rev_data_dev,
> +			      sizeof (*ptx_dev->rev_data), copy_stream);
> +	    CUDA_CALL_ASSERT (cuStreamSynchronize, copy_stream);
> +	  }
> +	if (ptx_dev->rev_data->fn != 0)

Surely this needs to be an atomic load with 'acquire' semantics in has_usm case?

> +	  {
> +	    struct rev_offload *rev_data = ptx_dev->rev_data;
> +	    uint64_t fn_ptr = rev_data->fn;
> +	    uint64_t mapnum = rev_data->mapnum;
> +	    uint64_t addr_ptr = rev_data->addrs;
> +	    uint64_t sizes_ptr = rev_data->sizes;
> +	    uint64_t kinds_ptr = rev_data->kinds;
> +	    int dev_num = (int) rev_data->dev_num;
> +	    GOMP_PLUGIN_target_rev (fn_ptr, mapnum, addr_ptr, sizes_ptr,
> +				    kinds_ptr, dev_num, rev_off_dev_to_host_cpy,
> +				    rev_off_host_to_dev_cpy, copy_stream);
> +	    rev_data->fn = 0;

Atomic store?

> +	    if (!has_usm)
> +	      {
> +		/* fn is the first element. */
> +		r = CUDA_CALL_NOCHECK (cuMemcpyHtoDAsync,
> +				       ptx_dev->rev_data_dev,
> +				       ptx_dev->rev_data,
> +				       sizeof (ptx_dev->rev_data->fn),
> +				       copy_stream);
> +		if (r != CUDA_SUCCESS)
> +		  GOMP_PLUGIN_fatal ("cuMemcpyHtoD error: %s", cuda_error (r));
> +		CUDA_CALL_ASSERT (cuStreamSynchronize, copy_stream);
> +	      }
> +	  }
> +	usleep (1);
> +      }
> +  else
> +    r = CUDA_CALL_NOCHECK (cuCtxSynchronize, );
> +  if (reverse_off)
> +    CUDA_CALL_ASSERT (cuStreamDestroy, copy_stream);
>    if (r == CUDA_ERROR_LAUNCH_FAILED)
>      GOMP_PLUGIN_fatal ("cuCtxSynchronize error: %s %s\n", cuda_error (r),
>  		       maybe_abort_msg);
> diff --git a/libgomp/target.c b/libgomp/target.c
> index 5763483..9377de0 100644
> --- a/libgomp/target.c
> +++ b/libgomp/target.c
> @@ -2925,6 +2925,25 @@ GOMP_target_ext (int device, void (*fn) (void *), size_t mapnum,
>      htab_free (refcount_set);
>  }
>  
> +/* Handle reverse offload. This is called by the device plugins for a
> +   reverse offload; it is not called if the outer target runs on the host.  */
> +
> +void
> +gomp_target_rev (uint64_t fn_ptr __attribute__ ((unused)),
> +		 uint64_t mapnum __attribute__ ((unused)),
> +		 uint64_t devaddrs_ptr __attribute__ ((unused)),
> +		 uint64_t sizes_ptr __attribute__ ((unused)),
> +		 uint64_t kinds_ptr __attribute__ ((unused)),
> +		 int dev_num __attribute__ ((unused)),
> +		 void (*dev_to_host_cpy) (void *, const void *, size_t,
> +					  void *) __attribute__ ((unused)),
> +		 void (*host_to_dev_cpy) (void *, const void *, size_t,
> +					  void *) __attribute__ ((unused)),
> +		 void *token __attribute__ ((unused)))
> +{
> +  __builtin_unreachable ();
> +}
> +
>  /* Host fallback for GOMP_target_data{,_ext} routines.  */
>  
>  static void

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Patch] libgomp/nvptx: Prepare for reverse-offload callback handling
  2022-09-21 20:06   ` Alexander Monakov
@ 2022-09-26 15:07     ` Tobias Burnus
  2022-09-26 17:45       ` Alexander Monakov
  0 siblings, 1 reply; 31+ messages in thread
From: Tobias Burnus @ 2022-09-26 15:07 UTC (permalink / raw)
  To: Alexander Monakov; +Cc: Jakub Jelinek, Tom de Vries, gcc-patches

[-- Attachment #1: Type: text/plain, Size: 3543 bytes --]

Hi Alexander,

On 21.09.22 22:06, Alexander Monakov wrote:
> It also goes
> against the good practice of accelerator programming, which requires queueing
> work on the accelerator and letting it run asynchronously with the CPU with high
> occupancy.
> (I know libgomp still waits for the GPU to finish in each GOMP_offload_run,
> but maybe it's better to improve *that* instead of piling on new slowness)

Doesn't OpenMP 'nowait' permit this? (+ 'depend' clause if needed).

> On to the patch itself.

> And for non-USM code path you're relying on cudaMemcpy observing device-side
> atomics in the right order.
> Atomics aside, CUDA pinned memory would be a natural choice for such a tiny
> structure. Did you rule it out for some reason?

I did use pinned memory (cuMemAllocHost) – but somehow it did escape me
that:

"All host memory allocated in all contexts using cuMemAllocHost() and
cuMemHostAlloc() is always directly accessible from all contexts on all
devices that support unified addressing."

I have now updated (but using cuMemHostAlloc instead, using a flag in
the hope that this choice is a tad faster).

>> +++ b/libgomp/config/nvptx/target.c
>> ...
>> +#define GOMP_REV_OFFLOAD_VAR __gomp_rev_offload_var
> Shouldn't this be in a header (needs to be in sync with the plugin).
I have now created one.
>> +
>> +#if (__SIZEOF_SHORT__ != 2 \
>> +     || __SIZEOF_SIZE_T__ != 8 \
>> +     || __SIZEOF_POINTER__ != 8)
>> +#error "Data-type conversion required for rev_offload"
>> +#endif
> Huh? This is not a requirement that is new for reverse offload, it has always
> been like that for offloading (all ABI rules regarding type sizes, struct
> layout, bitfield layout, endianness must match).

In theory, compiling with "-m32 -foffload-options=-m64" or "-m32
-foffload-options=-m32" or "-m64 -foffload-options=-m32" is supported.
In practice, -m64 everywhere is required. I just want to make sure that
for this code the sizes are fine because, here, I am sure it breaks. For
other parts, I think the 64bit assumption is coded in but I am not
completely sure that's really the case.

>> +  if (device != GOMP_DEVICE_HOST_FALLBACK
>> +      || fn == NULL
>> +      || GOMP_REV_OFFLOAD_VAR == NULL)
>> +    return;
> Shouldn't this be an 'assert' instead?

This tries to mimic what was there before – doing nothing. In any case,
this code path is unspecified or implementation defined (I forgot which
of the two), but a user might still be able to construct such a code.

I leave it to Jakub whether he likes to have an assert, a error/warning
message, or just the return here.

>> +  __atomic_store_n (&GOMP_REV_OFFLOAD_VAR->dev_num,
>> +                GOMP_ADDITIONAL_ICVS.device_num, __ATOMIC_SEQ_CST);
> Looks like all these can be plain stores, you only need ...
>
>> +  __atomic_store_n (&GOMP_REV_OFFLOAD_VAR->fn, fn, __ATOMIC_SEQ_CST);
> ... this to be atomic with 'release' semantics in the usual producer-consumer
> pattern.
>
>> +  if (ptx_dev->rev_data->fn != 0)
> Surely this needs to be an atomic load with 'acquire' semantics in has_usm case?
>> +        rev_data->fn = 0;
>>
>> Atomic store?

Done so – updated patch attached. Thanks for the comments.

Tobias
-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955

[-- Attachment #2: rev-offload-run-nvptx-v3.diff --]
[-- Type: text/x-patch, Size: 18834 bytes --]

libgomp/nvptx: Prepare for reverse-offload callback handling

This patch adds a stub 'gomp_target_rev' in the host's target.c, which will
later handle the reverse offload.
For nvptx, it adds support for forwarding the offload gomp_target_ext call
to the host by setting values in a struct on the device and querying it on
the host - invoking gomp_target_rev on the result.

include/ChangeLog:

	* cuda/cuda.h (enum CUdevice_attribute): Add
	CU_DEVICE_ATTRIBUTE_UNIFIED_ADDRESSING.
	(cuMemHostAlloc): Add prototype.

libgomp/ChangeLog:

	* config/nvptx/icv-device.c (GOMP_DEVICE_NUM_VAR): Remove
	'static' for this variable.
	* config/nvptx/libgomp-nvptx.h: New file.
	* config/nvptx/target.c: Include it.
	(GOMP_ADDITIONAL_ICVS): Declare extern var.
	(GOMP_REV_OFFLOAD_VAR): Declare var.
	(GOMP_target_ext): Handle reverse offload.
	* libgomp-plugin.h (GOMP_PLUGIN_target_rev): New prototype.
	* libgomp-plugin.c (GOMP_PLUGIN_target_rev): New, call ...
	* target.c (gomp_target_rev): ... this new stub function.
	* libgomp.h (gomp_target_rev): Declare.
	* libgomp.map (GOMP_PLUGIN_1.4): New; add GOMP_PLUGIN_target_rev.
	* plugin/cuda-lib.def (cuMemHostAlloc): Add.
	* plugin/plugin-nvptx.c: Include libgomp-nvptx.h.
	(struct ptx_device): Add rev_data member. 
	(nvptx_open_device): #if 0 unused check; add
	unified address assert check.
	(GOMP_OFFLOAD_get_num_devices): Claim unified address
	support.
	(GOMP_OFFLOAD_load_image): Free rev_fn_table if no
	offload functions exist. Make offload var available
	on host and device.
	(rev_off_dev_to_host_cpy, rev_off_host_to_dev_cpy): New.
	(GOMP_OFFLOAD_run): Handle reverse offload.

 include/cuda/cuda.h                  |   3 +
 libgomp/config/nvptx/icv-device.c    |   2 +-
 libgomp/config/nvptx/libgomp-nvptx.h |  52 ++++++++++++++++
 libgomp/config/nvptx/target.c        |  32 +++++++---
 libgomp/libgomp-plugin.c             |  12 ++++
 libgomp/libgomp-plugin.h             |   7 +++
 libgomp/libgomp.h                    |   5 ++
 libgomp/libgomp.map                  |   5 ++
 libgomp/plugin/cuda-lib.def          |   1 +
 libgomp/plugin/plugin-nvptx.c        | 111 +++++++++++++++++++++++++++++++++--
 libgomp/target.c                     |  19 ++++++
 11 files changed, 235 insertions(+), 14 deletions(-)

diff --git a/include/cuda/cuda.h b/include/cuda/cuda.h
index 3938d05..e081f04 100644
--- a/include/cuda/cuda.h
+++ b/include/cuda/cuda.h
@@ -77,6 +77,7 @@ typedef enum {
   CU_DEVICE_ATTRIBUTE_CONCURRENT_KERNELS = 31,
   CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_MULTIPROCESSOR = 39,
   CU_DEVICE_ATTRIBUTE_ASYNC_ENGINE_COUNT = 40,
+  CU_DEVICE_ATTRIBUTE_UNIFIED_ADDRESSING = 41,
   CU_DEVICE_ATTRIBUTE_MAX_REGISTERS_PER_MULTIPROCESSOR = 82
 } CUdevice_attribute;
 
@@ -113,6 +114,7 @@ enum {
 #define CU_LAUNCH_PARAM_END ((void *) 0)
 #define CU_LAUNCH_PARAM_BUFFER_POINTER ((void *) 1)
 #define CU_LAUNCH_PARAM_BUFFER_SIZE ((void *) 2)
+#define CU_MEMHOSTALLOC_DEVICEMAP 0x02U
 
 enum {
   CU_STREAM_DEFAULT = 0,
@@ -169,6 +171,7 @@ CUresult cuMemGetInfo (size_t *, size_t *);
 CUresult cuMemAlloc (CUdeviceptr *, size_t);
 #define cuMemAllocHost cuMemAllocHost_v2
 CUresult cuMemAllocHost (void **, size_t);
+CUresult cuMemHostAlloc (void **, size_t, unsigned int);
 CUresult cuMemcpy (CUdeviceptr, CUdeviceptr, size_t);
 #define cuMemcpyDtoDAsync cuMemcpyDtoDAsync_v2
 CUresult cuMemcpyDtoDAsync (CUdeviceptr, CUdeviceptr, size_t, CUstream);
diff --git a/libgomp/config/nvptx/icv-device.c b/libgomp/config/nvptx/icv-device.c
index 6f869be..eef151c 100644
--- a/libgomp/config/nvptx/icv-device.c
+++ b/libgomp/config/nvptx/icv-device.c
@@ -30,7 +30,7 @@
 
 /* This is set to the ICV values of current GPU during device initialization,
    when the offload image containing this libgomp portion is loaded.  */
-static volatile struct gomp_offload_icvs GOMP_ADDITIONAL_ICVS;
+volatile struct gomp_offload_icvs GOMP_ADDITIONAL_ICVS;
 
 void
 omp_set_default_device (int device_num __attribute__((unused)))
diff --git a/libgomp/config/nvptx/libgomp-nvptx.h b/libgomp/config/nvptx/libgomp-nvptx.h
new file mode 100644
index 0000000..9fd1b27
--- /dev/null
+++ b/libgomp/config/nvptx/libgomp-nvptx.h
@@ -0,0 +1,52 @@
+/* Copyright (C) 2005-2022 Free Software Foundation, Inc.
+   Contributed by Tobias Burnus <tobias@codesourcery.com>.
+
+   This file is part of the GNU Offloading and Multi Processing Library
+   (libgomp).
+
+   Libgomp is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by
+   the Free Software Foundation; either version 3, or (at your option)
+   any later version.
+
+   Libgomp is distributed in the hope that it will be useful, but WITHOUT ANY
+   WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
+   FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+   more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   <http://www.gnu.org/licenses/>.  */
+
+/* This file contains defines and type definitions shared between the
+   nvptx target's libgomp.a and the plugin-nvptx.c, but that is only
+   needef for this target.  */
+
+#ifndef LIBGOMP_NVPTX_H 
+#define LIBGOMP_NVPTX_H 1
+
+#define GOMP_REV_OFFLOAD_VAR __gomp_rev_offload_var
+
+struct rev_offload {
+  uint64_t fn;
+  uint64_t mapnum;
+  uint64_t addrs;
+  uint64_t sizes;
+  uint64_t kinds;
+  int32_t dev_num;
+  uint32_t lock;
+};
+
+#if (__SIZEOF_SHORT__ != 2 \
+     || __SIZEOF_SIZE_T__ != 8 \
+     || __SIZEOF_POINTER__ != 8)
+#error "Data-type conversion required for rev_offload"
+#endif
+
+#endif  /* LIBGOMP_NVPTX_H */
+
diff --git a/libgomp/config/nvptx/target.c b/libgomp/config/nvptx/target.c
index 11108d2..7f84cdc 100644
--- a/libgomp/config/nvptx/target.c
+++ b/libgomp/config/nvptx/target.c
@@ -24,9 +24,12 @@
    <http://www.gnu.org/licenses/>.  */
 
 #include "libgomp.h"
+#include "libgomp-nvptx.h"  /* For struct rev_offload + GOMP_REV_OFFLOAD_VAR. */
 #include <limits.h>
 
 extern int __gomp_team_num __attribute__((shared));
+extern volatile struct gomp_offload_icvs GOMP_ADDITIONAL_ICVS;
+volatile struct rev_offload *GOMP_REV_OFFLOAD_VAR;
 
 bool
 GOMP_teams4 (unsigned int num_teams_lower, unsigned int num_teams_upper,
@@ -88,16 +91,31 @@ GOMP_target_ext (int device, void (*fn) (void *), size_t mapnum,
 		 void **hostaddrs, size_t *sizes, unsigned short *kinds,
 		 unsigned int flags, void **depend, void **args)
 {
-  (void) device;
-  (void) fn;
-  (void) mapnum;
-  (void) hostaddrs;
-  (void) sizes;
-  (void) kinds;
   (void) flags;
   (void) depend;
   (void) args;
-  __builtin_unreachable ();
+
+  if (device != GOMP_DEVICE_HOST_FALLBACK
+      || fn == NULL
+      || GOMP_REV_OFFLOAD_VAR == NULL)
+    return;
+
+  while (__sync_lock_test_and_set (&GOMP_REV_OFFLOAD_VAR->lock, (uint8_t) 1))
+    ;  /* spin  */
+
+  GOMP_REV_OFFLOAD_VAR->mapnum = mapnum;
+  GOMP_REV_OFFLOAD_VAR->addrs = (uint64_t) hostaddrs;
+  GOMP_REV_OFFLOAD_VAR->sizes = (uint64_t) sizes;
+  GOMP_REV_OFFLOAD_VAR->kinds = (uint64_t) kinds;
+  GOMP_REV_OFFLOAD_VAR->dev_num = GOMP_ADDITIONAL_ICVS.device_num;
+
+  /* 'fn' must be last.  */
+  __atomic_store_n (&GOMP_REV_OFFLOAD_VAR->fn, fn, __ATOMIC_RELEASE);
+
+  /* Processed on the host - when done, fn is set to NULL.  */
+  while (__atomic_load_n (&GOMP_REV_OFFLOAD_VAR->fn, __ATOMIC_SEQ_CST) != 0)
+    ;  /* spin  */
+  __sync_lock_release (&GOMP_REV_OFFLOAD_VAR->lock);
 }
 
 void
diff --git a/libgomp/libgomp-plugin.c b/libgomp/libgomp-plugin.c
index 9d4cc62..316de74 100644
--- a/libgomp/libgomp-plugin.c
+++ b/libgomp/libgomp-plugin.c
@@ -78,3 +78,15 @@ GOMP_PLUGIN_fatal (const char *msg, ...)
   gomp_vfatal (msg, ap);
   va_end (ap);
 }
+
+void
+GOMP_PLUGIN_target_rev (uint64_t fn_ptr, uint64_t mapnum, uint64_t devaddrs_ptr,
+			uint64_t sizes_ptr, uint64_t kinds_ptr, int dev_num,
+			void (*dev_to_host_cpy) (void *, const void *, size_t,
+						 void *),
+			void (*host_to_dev_cpy) (void *, const void *, size_t,
+						 void *), void *token)
+{
+  gomp_target_rev (fn_ptr, mapnum, devaddrs_ptr, sizes_ptr, kinds_ptr, dev_num,
+		   dev_to_host_cpy, host_to_dev_cpy, token);
+}
diff --git a/libgomp/libgomp-plugin.h b/libgomp/libgomp-plugin.h
index 6ab5ac6..875f967 100644
--- a/libgomp/libgomp-plugin.h
+++ b/libgomp/libgomp-plugin.h
@@ -121,6 +121,13 @@ extern void GOMP_PLUGIN_error (const char *, ...)
 extern void GOMP_PLUGIN_fatal (const char *, ...)
 	__attribute__ ((noreturn, format (printf, 1, 2)));
 
+extern void GOMP_PLUGIN_target_rev (uint64_t, uint64_t, uint64_t, uint64_t,
+				    uint64_t, int,
+				    void (*) (void *, const void *, size_t,
+					      void *),
+				    void (*) (void *, const void *, size_t,
+					      void *), void *);
+
 /* Prototypes for functions implemented by libgomp plugins.  */
 extern const char *GOMP_OFFLOAD_get_name (void);
 extern unsigned int GOMP_OFFLOAD_get_caps (void);
diff --git a/libgomp/libgomp.h b/libgomp/libgomp.h
index 7519274..5803683 100644
--- a/libgomp/libgomp.h
+++ b/libgomp/libgomp.h
@@ -1128,6 +1128,11 @@ extern int gomp_pause_host (void);
 extern void gomp_init_targets_once (void);
 extern int gomp_get_num_devices (void);
 extern bool gomp_target_task_fn (void *);
+extern void gomp_target_rev (uint64_t, uint64_t, uint64_t, uint64_t, uint64_t,
+			     int,
+			     void (*) (void *, const void *, size_t, void *),
+			     void (*) (void *, const void *, size_t, void *),
+			     void *);
 
 /* Splay tree definitions.  */
 typedef struct splay_tree_node_s *splay_tree_node;
diff --git a/libgomp/libgomp.map b/libgomp/libgomp.map
index 46d5f10..12f76f7 100644
--- a/libgomp/libgomp.map
+++ b/libgomp/libgomp.map
@@ -622,3 +622,8 @@ GOMP_PLUGIN_1.3 {
 	GOMP_PLUGIN_goacc_profiling_dispatch;
 	GOMP_PLUGIN_goacc_thread;
 } GOMP_PLUGIN_1.2;
+
+GOMP_PLUGIN_1.4 {
+  global:
+	GOMP_PLUGIN_target_rev;
+} GOMP_PLUGIN_1.3;
diff --git a/libgomp/plugin/cuda-lib.def b/libgomp/plugin/cuda-lib.def
index cd91b39..dff42d6 100644
--- a/libgomp/plugin/cuda-lib.def
+++ b/libgomp/plugin/cuda-lib.def
@@ -29,6 +29,7 @@ CUDA_ONE_CALL_MAYBE_NULL (cuLinkCreate_v2)
 CUDA_ONE_CALL (cuLinkDestroy)
 CUDA_ONE_CALL (cuMemAlloc)
 CUDA_ONE_CALL (cuMemAllocHost)
+CUDA_ONE_CALL (cuMemHostAlloc)
 CUDA_ONE_CALL (cuMemcpy)
 CUDA_ONE_CALL (cuMemcpyDtoDAsync)
 CUDA_ONE_CALL (cuMemcpyDtoH)
diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c
index ba6b229..43b34df 100644
--- a/libgomp/plugin/plugin-nvptx.c
+++ b/libgomp/plugin/plugin-nvptx.c
@@ -40,6 +40,9 @@
 #include "gomp-constants.h"
 #include "oacc-int.h"
 
+/* For struct rev_offload + GOMP_REV_OFFLOAD_VAR. */
+#include "config/nvptx/libgomp-nvptx.h"
+
 #include <pthread.h>
 #ifndef PLUGIN_NVPTX_INCLUDE_SYSTEM_CUDA_H
 # include "cuda/cuda.h"
@@ -329,6 +332,7 @@ struct ptx_device
       pthread_mutex_t lock;
     } omp_stacks;
 
+  struct rev_offload *rev_data;
   struct ptx_device *next;
 };
 
@@ -423,7 +427,7 @@ nvptx_open_device (int n)
   struct ptx_device *ptx_dev;
   CUdevice dev, ctx_dev;
   CUresult r;
-  int async_engines, pi;
+  int pi;
 
   CUDA_CALL_ERET (NULL, cuDeviceGet, &dev, n);
 
@@ -519,10 +523,20 @@ nvptx_open_device (int n)
 		  CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_MULTIPROCESSOR, dev);
   ptx_dev->max_threads_per_multiprocessor = pi;
 
+#if 0
+  int async_engines;
   r = CUDA_CALL_NOCHECK (cuDeviceGetAttribute, &async_engines,
 			 CU_DEVICE_ATTRIBUTE_ASYNC_ENGINE_COUNT, dev);
   if (r != CUDA_SUCCESS)
     async_engines = 1;
+#endif
+
+  /* Required below for reverse offload as implemented, but with compute
+     capability >= 2.0 and 64bit device processes, this should be universally be
+     the case; hence, an assert.  */
+  r = CUDA_CALL_NOCHECK (cuDeviceGetAttribute, &pi,
+			 CU_DEVICE_ATTRIBUTE_UNIFIED_ADDRESSING, dev);
+  assert (r == CUDA_SUCCESS && pi);
 
   for (int i = 0; i != GOMP_DIM_MAX; i++)
     ptx_dev->default_dims[i] = 0;
@@ -1179,8 +1193,10 @@ GOMP_OFFLOAD_get_num_devices (unsigned int omp_requires_mask)
 {
   int num_devices = nvptx_get_num_devices ();
   /* Return -1 if no omp_requires_mask cannot be fulfilled but
-     devices were present.  */
-  if (num_devices > 0 && omp_requires_mask != 0)
+     devices were present. Unified-shared address: see comment in
+     nvptx_open_device for CU_DEVICE_ATTRIBUTE_UNIFIED_ADDRESSING.  */
+  if (num_devices > 0
+      && (omp_requires_mask & ~GOMP_REQUIRES_UNIFIED_ADDRESS) != 0)
     return -1;
   return num_devices;
 }
@@ -1380,7 +1396,7 @@ GOMP_OFFLOAD_load_image (int ord, unsigned version, const void *target_data,
   else if (rev_fn_table)
     {
       CUdeviceptr var;
-      size_t bytes;
+      size_t bytes, i;
       r = CUDA_CALL_NOCHECK (cuModuleGetGlobal, &var, &bytes, module,
 			     "$offload_func_table");
       if (r != CUDA_SUCCESS)
@@ -1390,6 +1406,37 @@ GOMP_OFFLOAD_load_image (int ord, unsigned version, const void *target_data,
       r = CUDA_CALL_NOCHECK (cuMemcpyDtoH, *rev_fn_table, var, bytes);
       if (r != CUDA_SUCCESS)
 	GOMP_PLUGIN_fatal ("cuMemcpyDtoH error: %s", cuda_error (r));
+      /* Free if only NULL entries.  */
+      for (i = 0; i < fn_entries; ++i)
+	if ((*rev_fn_table)[i] != 0)
+	  break;
+      if (i == fn_entries)
+	{
+	  free (*rev_fn_table);
+	  *rev_fn_table = NULL;
+	}
+    }
+
+  if (rev_fn_table && *rev_fn_table && dev->rev_data == NULL)
+    {
+      /* cuMemHostAlloc memory is accessible on the device, if unified-shared
+	 address is supported; this is assumed - see comment in
+	 nvptx_open_device for CU_DEVICE_ATTRIBUTE_UNIFIED_ADDRESSING.   */
+      CUDA_CALL_ASSERT (cuMemHostAlloc, (void **) &dev->rev_data,
+			sizeof (*dev->rev_data), CU_MEMHOSTALLOC_DEVICEMAP);
+      CUdeviceptr dp = (CUdeviceptr) dev->rev_data;
+      CUdeviceptr device_rev_offload_var;
+      size_t device_rev_offload_size;
+      CUresult r = CUDA_CALL_NOCHECK (cuModuleGetGlobal,
+				      &device_rev_offload_var,
+				      &device_rev_offload_size, module,
+				      XSTRING (GOMP_REV_OFFLOAD_VAR));
+      if (r != CUDA_SUCCESS)
+	GOMP_PLUGIN_fatal ("cuModuleGetGlobal error - GOMP_REV_OFFLOAD_VAR: %s", cuda_error (r));
+      r = CUDA_CALL_NOCHECK (cuMemcpyHtoD, device_rev_offload_var, &dp,
+			     sizeof (dp));
+      if (r != CUDA_SUCCESS)
+	GOMP_PLUGIN_fatal ("cuMemcpyHtoD error: %s", cuda_error (r));
     }
 
   nvptx_set_clocktick (module, dev);
@@ -2001,6 +2048,23 @@ nvptx_stacks_acquire (struct ptx_device *ptx_dev, size_t size, int num)
   return (void *) ptx_dev->omp_stacks.ptr;
 }
 
+
+void
+rev_off_dev_to_host_cpy (void *dest, const void *src, size_t size,
+			 CUstream stream)
+{
+  CUDA_CALL_ASSERT (cuMemcpyDtoHAsync, dest, (CUdeviceptr) src, size, stream);
+  CUDA_CALL_ASSERT (cuStreamSynchronize, stream);
+}
+
+void
+rev_off_host_to_dev_cpy (void *dest, const void *src, size_t size,
+			 CUstream stream)
+{
+  CUDA_CALL_ASSERT (cuMemcpyHtoDAsync, (CUdeviceptr) dest, src, size, stream);
+  CUDA_CALL_ASSERT (cuStreamSynchronize, stream);
+}
+
 void
 GOMP_OFFLOAD_run (int ord, void *tgt_fn, void *tgt_vars, void **args)
 {
@@ -2035,6 +2099,8 @@ GOMP_OFFLOAD_run (int ord, void *tgt_fn, void *tgt_vars, void **args)
   nvptx_adjust_launch_bounds (tgt_fn, ptx_dev, &teams, &threads);
 
   size_t stack_size = nvptx_stacks_size ();
+  bool reverse_off = ptx_dev->rev_data != NULL;
+  CUstream copy_stream = NULL;
 
   pthread_mutex_lock (&ptx_dev->omp_stacks.lock);
   void *stacks = nvptx_stacks_acquire (ptx_dev, stack_size, teams * threads);
@@ -2048,12 +2114,45 @@ GOMP_OFFLOAD_run (int ord, void *tgt_fn, void *tgt_vars, void **args)
   GOMP_PLUGIN_debug (0, "  %s: kernel %s: launch"
 		     " [(teams: %u), 1, 1] [(lanes: 32), (threads: %u), 1]\n",
 		     __FUNCTION__, fn_name, teams, threads);
+  if (reverse_off)
+    CUDA_CALL_ASSERT (cuStreamCreate, &copy_stream, CU_STREAM_NON_BLOCKING);
   r = CUDA_CALL_NOCHECK (cuLaunchKernel, function, teams, 1, 1,
 			 32, threads, 1, 0, NULL, NULL, config);
   if (r != CUDA_SUCCESS)
     GOMP_PLUGIN_fatal ("cuLaunchKernel error: %s", cuda_error (r));
-
-  r = CUDA_CALL_NOCHECK (cuCtxSynchronize, );
+  if (reverse_off)
+    while (true)
+      {
+	r = CUDA_CALL_NOCHECK (cuStreamQuery, NULL);
+	if (r == CUDA_SUCCESS)
+	  break;
+	if (r == CUDA_ERROR_LAUNCH_FAILED)
+	  GOMP_PLUGIN_fatal ("cuStreamQuery error: %s %s\n", cuda_error (r),
+			     maybe_abort_msg);
+	else if (r != CUDA_ERROR_NOT_READY)
+	  GOMP_PLUGIN_fatal ("cuStreamQuery error: %s", cuda_error (r));
+
+	if (__atomic_load_n (&ptx_dev->rev_data->fn, __ATOMIC_ACQUIRE) != 0)
+	  {
+	    struct rev_offload *rev_data = ptx_dev->rev_data;
+	    uint64_t fn_ptr = rev_data->fn;
+	    uint64_t mapnum = rev_data->mapnum;
+	    uint64_t addr_ptr = rev_data->addrs;
+	    uint64_t sizes_ptr = rev_data->sizes;
+	    uint64_t kinds_ptr = rev_data->kinds;
+	    int dev_num = (int) rev_data->dev_num;
+	    GOMP_PLUGIN_target_rev (fn_ptr, mapnum, addr_ptr, sizes_ptr,
+				    kinds_ptr, dev_num, rev_off_dev_to_host_cpy,
+				    rev_off_host_to_dev_cpy, copy_stream);
+	    CUDA_CALL_ASSERT (cuStreamSynchronize, copy_stream);
+	    __atomic_store_n (&rev_data->fn, 0, __ATOMIC_RELEASE);
+	  }
+	usleep (1);
+      }
+  else
+    r = CUDA_CALL_NOCHECK (cuCtxSynchronize, );
+  if (reverse_off)
+    CUDA_CALL_ASSERT (cuStreamDestroy, copy_stream);
   if (r == CUDA_ERROR_LAUNCH_FAILED)
     GOMP_PLUGIN_fatal ("cuCtxSynchronize error: %s %s\n", cuda_error (r),
 		       maybe_abort_msg);
diff --git a/libgomp/target.c b/libgomp/target.c
index 5763483..9377de0 100644
--- a/libgomp/target.c
+++ b/libgomp/target.c
@@ -2925,6 +2925,25 @@ GOMP_target_ext (int device, void (*fn) (void *), size_t mapnum,
     htab_free (refcount_set);
 }
 
+/* Handle reverse offload. This is called by the device plugins for a
+   reverse offload; it is not called if the outer target runs on the host.  */
+
+void
+gomp_target_rev (uint64_t fn_ptr __attribute__ ((unused)),
+		 uint64_t mapnum __attribute__ ((unused)),
+		 uint64_t devaddrs_ptr __attribute__ ((unused)),
+		 uint64_t sizes_ptr __attribute__ ((unused)),
+		 uint64_t kinds_ptr __attribute__ ((unused)),
+		 int dev_num __attribute__ ((unused)),
+		 void (*dev_to_host_cpy) (void *, const void *, size_t,
+					  void *) __attribute__ ((unused)),
+		 void (*host_to_dev_cpy) (void *, const void *, size_t,
+					  void *) __attribute__ ((unused)),
+		 void *token __attribute__ ((unused)))
+{
+  __builtin_unreachable ();
+}
+
 /* Host fallback for GOMP_target_data{,_ext} routines.  */
 
 static void

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Patch] libgomp/nvptx: Prepare for reverse-offload callback handling
  2022-09-26 15:07     ` Tobias Burnus
@ 2022-09-26 17:45       ` Alexander Monakov
  2022-09-27  9:23         ` Tobias Burnus
  0 siblings, 1 reply; 31+ messages in thread
From: Alexander Monakov @ 2022-09-26 17:45 UTC (permalink / raw)
  To: Tobias Burnus; +Cc: Jakub Jelinek, Tom de Vries, gcc-patches


Hi.

My main concerns remain not addressed:

1) what I said in the opening paragraphs of my previous email;

2) device-issued atomics are not guaranteed to appear atomic to the host
unless using atom.sys and translating for CUDA compute capability 6.0+.

Item 2 is a correctness issue. Item 1 I think is a matter of policy that
is up to you to hash out with Jakub.

On Mon, 26 Sep 2022, Tobias Burnus wrote:

> In theory, compiling with "-m32 -foffload-options=-m64" or "-m32
> -foffload-options=-m32" or "-m64 -foffload-options=-m32" is supported.

I have no words.

Alexander

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Patch] libgomp/nvptx: Prepare for reverse-offload callback handling
  2022-09-26 17:45       ` Alexander Monakov
@ 2022-09-27  9:23         ` Tobias Burnus
  2022-09-28 13:16           ` Alexander Monakov
  2022-10-02 18:13           ` Tobias Burnus
  0 siblings, 2 replies; 31+ messages in thread
From: Tobias Burnus @ 2022-09-27  9:23 UTC (permalink / raw)
  To: Alexander Monakov; +Cc: Jakub Jelinek, Tom de Vries, gcc-patches


[-- Attachment #1.1: Type: text/plain, Size: 2383 bytes --]

Hi,

On 26.09.22 19:45, Alexander Monakov wrote:

My main concerns remain not addressed:
1) what I said in the opening paragraphs of my previous email;

(i.e. the general disagreement whether the feature itself should be implemented for nvptx or not.)

2) device-issued atomics are not guaranteed to appear atomic to the host
unless using atom.sys and translating for CUDA compute capability 6.0+.

As you seem to have no other rough review comments, this can now be addressed :-)

We do support
  #if __PTX_SM__ >= 600  (CUDA >= 8.0, ptx isa >= 5.0)
and we also can configure GCC with
  --with-arch=sm_70 (or sm_80 or ...)
Thus, adding atomics with .sys scope is possible.

See attached patch. This seems to work fine and I hope I got the
assembly right in terms of atomic use. (And I do believe that the
.release/.acquire do not need an additional __sync_syncronize()/"membar.sys".)

Ignoring (1), does the overall patch and this part otherwise look okay(ish)?


Caveat: The .sys scope works well with >= sm_60 but not does not handle older
versions. For those, the __atomic_{load/store}_n are used.
I do not see a good solution beyond documentation. In the way it is used
(one thread only setting only on/off flag, no atomic increments etc.), I think it is
unlikely to cause races without .sys scope, but as always is difficult to rule out
some special unfortunate case where it does. At lease we do have now some
documentation (in general) - which still needs to be expanded and improved.
For this feature, I did not add any wording in this patch: until the feature
is actually enabled, it would be more confusing than helpful.


On Mon, 26 Sep 2022, Tobias Burnus wrote:


In theory, compiling with "-m32 -foffload-options=-m64" or "-m32
-foffload-options=-m32" or "-m64 -foffload-options=-m32" is supported.


I have no words.

@node Nvidia PTX Options
...
@item -m64
@opindex m64
Ignored, but preserved for backward compatibility.  Only 64-bit ABI is
supported.

And in config/nvptx/mkoffload.cc you also still find leftovers from -m32.

Tobias


-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955

[-- Attachment #2: rev-offload-run-nvptx-v4.diff --]
[-- Type: text/x-patch, Size: 19260 bytes --]

libgomp/nvptx: Prepare for reverse-offload callback handling

This patch adds a stub 'gomp_target_rev' in the host's target.c, which will
later handle the reverse offload.
For nvptx, it adds support for forwarding the offload gomp_target_ext call
to the host by setting values in a struct on the device and querying it on
the host - invoking gomp_target_rev on the result.

include/ChangeLog:

	* cuda/cuda.h (enum CUdevice_attribute): Add
	CU_DEVICE_ATTRIBUTE_UNIFIED_ADDRESSING.
	(cuMemHostAlloc): Add prototype.

libgomp/ChangeLog:

	* config/nvptx/icv-device.c (GOMP_DEVICE_NUM_VAR): Remove
	'static' for this variable.
	* config/nvptx/libgomp-nvptx.h: New file.
	* config/nvptx/target.c: Include it.
	(GOMP_ADDITIONAL_ICVS): Declare extern var.
	(GOMP_REV_OFFLOAD_VAR): Declare var.
	(GOMP_target_ext): Handle reverse offload.
	* libgomp-plugin.h (GOMP_PLUGIN_target_rev): New prototype.
	* libgomp-plugin.c (GOMP_PLUGIN_target_rev): New, call ...
	* target.c (gomp_target_rev): ... this new stub function.
	* libgomp.h (gomp_target_rev): Declare.
	* libgomp.map (GOMP_PLUGIN_1.4): New; add GOMP_PLUGIN_target_rev.
	* plugin/cuda-lib.def (cuMemHostAlloc): Add.
	* plugin/plugin-nvptx.c: Include libgomp-nvptx.h.
	(struct ptx_device): Add rev_data member. 
	(nvptx_open_device): #if 0 unused check; add
	unified address assert check.
	(GOMP_OFFLOAD_get_num_devices): Claim unified address
	support.
	(GOMP_OFFLOAD_load_image): Free rev_fn_table if no
	offload functions exist. Make offload var available
	on host and device.
	(rev_off_dev_to_host_cpy, rev_off_host_to_dev_cpy): New.
	(GOMP_OFFLOAD_run): Handle reverse offload.

 include/cuda/cuda.h                  |   3 +
 libgomp/config/nvptx/icv-device.c    |   2 +-
 libgomp/config/nvptx/libgomp-nvptx.h |  52 ++++++++++++++++
 libgomp/config/nvptx/target.c        |  48 ++++++++++++---
 libgomp/libgomp-plugin.c             |  12 ++++
 libgomp/libgomp-plugin.h             |   7 +++
 libgomp/libgomp.h                    |   5 ++
 libgomp/libgomp.map                  |   5 ++
 libgomp/plugin/cuda-lib.def          |   1 +
 libgomp/plugin/plugin-nvptx.c        | 111 +++++++++++++++++++++++++++++++++--
 libgomp/target.c                     |  19 ++++++
 11 files changed, 251 insertions(+), 14 deletions(-)

diff --git a/include/cuda/cuda.h b/include/cuda/cuda.h
index 3938d05..e081f04 100644
--- a/include/cuda/cuda.h
+++ b/include/cuda/cuda.h
@@ -77,6 +77,7 @@ typedef enum {
   CU_DEVICE_ATTRIBUTE_CONCURRENT_KERNELS = 31,
   CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_MULTIPROCESSOR = 39,
   CU_DEVICE_ATTRIBUTE_ASYNC_ENGINE_COUNT = 40,
+  CU_DEVICE_ATTRIBUTE_UNIFIED_ADDRESSING = 41,
   CU_DEVICE_ATTRIBUTE_MAX_REGISTERS_PER_MULTIPROCESSOR = 82
 } CUdevice_attribute;
 
@@ -113,6 +114,7 @@ enum {
 #define CU_LAUNCH_PARAM_END ((void *) 0)
 #define CU_LAUNCH_PARAM_BUFFER_POINTER ((void *) 1)
 #define CU_LAUNCH_PARAM_BUFFER_SIZE ((void *) 2)
+#define CU_MEMHOSTALLOC_DEVICEMAP 0x02U
 
 enum {
   CU_STREAM_DEFAULT = 0,
@@ -169,6 +171,7 @@ CUresult cuMemGetInfo (size_t *, size_t *);
 CUresult cuMemAlloc (CUdeviceptr *, size_t);
 #define cuMemAllocHost cuMemAllocHost_v2
 CUresult cuMemAllocHost (void **, size_t);
+CUresult cuMemHostAlloc (void **, size_t, unsigned int);
 CUresult cuMemcpy (CUdeviceptr, CUdeviceptr, size_t);
 #define cuMemcpyDtoDAsync cuMemcpyDtoDAsync_v2
 CUresult cuMemcpyDtoDAsync (CUdeviceptr, CUdeviceptr, size_t, CUstream);
diff --git a/libgomp/config/nvptx/icv-device.c b/libgomp/config/nvptx/icv-device.c
index 6f869be..eef151c 100644
--- a/libgomp/config/nvptx/icv-device.c
+++ b/libgomp/config/nvptx/icv-device.c
@@ -30,7 +30,7 @@
 
 /* This is set to the ICV values of current GPU during device initialization,
    when the offload image containing this libgomp portion is loaded.  */
-static volatile struct gomp_offload_icvs GOMP_ADDITIONAL_ICVS;
+volatile struct gomp_offload_icvs GOMP_ADDITIONAL_ICVS;
 
 void
 omp_set_default_device (int device_num __attribute__((unused)))
diff --git a/libgomp/config/nvptx/libgomp-nvptx.h b/libgomp/config/nvptx/libgomp-nvptx.h
new file mode 100644
index 0000000..9fd1b27
--- /dev/null
+++ b/libgomp/config/nvptx/libgomp-nvptx.h
@@ -0,0 +1,52 @@
+/* Copyright (C) 2005-2022 Free Software Foundation, Inc.
+   Contributed by Tobias Burnus <tobias@codesourcery.com>.
+
+   This file is part of the GNU Offloading and Multi Processing Library
+   (libgomp).
+
+   Libgomp is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by
+   the Free Software Foundation; either version 3, or (at your option)
+   any later version.
+
+   Libgomp is distributed in the hope that it will be useful, but WITHOUT ANY
+   WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
+   FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+   more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   <http://www.gnu.org/licenses/>.  */
+
+/* This file contains defines and type definitions shared between the
+   nvptx target's libgomp.a and the plugin-nvptx.c, but that is only
+   needef for this target.  */
+
+#ifndef LIBGOMP_NVPTX_H 
+#define LIBGOMP_NVPTX_H 1
+
+#define GOMP_REV_OFFLOAD_VAR __gomp_rev_offload_var
+
+struct rev_offload {
+  uint64_t fn;
+  uint64_t mapnum;
+  uint64_t addrs;
+  uint64_t sizes;
+  uint64_t kinds;
+  int32_t dev_num;
+  uint32_t lock;
+};
+
+#if (__SIZEOF_SHORT__ != 2 \
+     || __SIZEOF_SIZE_T__ != 8 \
+     || __SIZEOF_POINTER__ != 8)
+#error "Data-type conversion required for rev_offload"
+#endif
+
+#endif  /* LIBGOMP_NVPTX_H */
+
diff --git a/libgomp/config/nvptx/target.c b/libgomp/config/nvptx/target.c
index 11108d2..0c7bba9 100644
--- a/libgomp/config/nvptx/target.c
+++ b/libgomp/config/nvptx/target.c
@@ -24,9 +24,12 @@
    <http://www.gnu.org/licenses/>.  */
 
 #include "libgomp.h"
+#include "libgomp-nvptx.h"  /* For struct rev_offload + GOMP_REV_OFFLOAD_VAR. */
 #include <limits.h>
 
 extern int __gomp_team_num __attribute__((shared));
+extern volatile struct gomp_offload_icvs GOMP_ADDITIONAL_ICVS;
+volatile struct rev_offload *GOMP_REV_OFFLOAD_VAR;
 
 bool
 GOMP_teams4 (unsigned int num_teams_lower, unsigned int num_teams_upper,
@@ -88,16 +91,47 @@ GOMP_target_ext (int device, void (*fn) (void *), size_t mapnum,
 		 void **hostaddrs, size_t *sizes, unsigned short *kinds,
 		 unsigned int flags, void **depend, void **args)
 {
-  (void) device;
-  (void) fn;
-  (void) mapnum;
-  (void) hostaddrs;
-  (void) sizes;
-  (void) kinds;
   (void) flags;
   (void) depend;
   (void) args;
-  __builtin_unreachable ();
+
+  if (device != GOMP_DEVICE_HOST_FALLBACK
+      || fn == NULL
+      || GOMP_REV_OFFLOAD_VAR == NULL)
+    return;
+
+  while (__sync_lock_test_and_set (&GOMP_REV_OFFLOAD_VAR->lock, (uint8_t) 1))
+    ;  /* spin  */
+
+  GOMP_REV_OFFLOAD_VAR->mapnum = mapnum;
+  GOMP_REV_OFFLOAD_VAR->addrs = (uint64_t) hostaddrs;
+  GOMP_REV_OFFLOAD_VAR->sizes = (uint64_t) sizes;
+  GOMP_REV_OFFLOAD_VAR->kinds = (uint64_t) kinds;
+  GOMP_REV_OFFLOAD_VAR->dev_num = GOMP_ADDITIONAL_ICVS.device_num;
+
+  /* 'fn' must be last.  */
+#if __PTX_SM__ >= 600
+  uint64_t addr_struct_fn = (uint64_t) &GOMP_REV_OFFLOAD_VAR->fn;
+  asm volatile ("st.global.release.sys.u64 [%0], %1;"
+		: : "r"(addr_struct_fn), "r" (fn) : "memory");
+#else
+  __atomic_store_n (&GOMP_REV_OFFLOAD_VAR->fn, fn, __ATOMIC_RELEASE);
+#endif
+
+  /* Processed on the host - when done, fn is set to NULL.  */
+#if __PTX_SM__ >= 600
+  uint64_t fn2;
+  do
+    {
+      asm volatile ("ld.acquire.sys.global.u64 %0, [%1];"
+		    : "=r" (fn2) : "r" (addr_struct_fn) : "memory");
+    }
+  while (fn2 != 0);
+#else
+  while (__atomic_load_n (&GOMP_REV_OFFLOAD_VAR->fn, __ATOMIC_ACQUIRE) != 0)
+    ;  /* spin  */
+#endif
+  __sync_lock_release (&GOMP_REV_OFFLOAD_VAR->lock);
 }
 
 void
diff --git a/libgomp/libgomp-plugin.c b/libgomp/libgomp-plugin.c
index 9d4cc62..316de74 100644
--- a/libgomp/libgomp-plugin.c
+++ b/libgomp/libgomp-plugin.c
@@ -78,3 +78,15 @@ GOMP_PLUGIN_fatal (const char *msg, ...)
   gomp_vfatal (msg, ap);
   va_end (ap);
 }
+
+void
+GOMP_PLUGIN_target_rev (uint64_t fn_ptr, uint64_t mapnum, uint64_t devaddrs_ptr,
+			uint64_t sizes_ptr, uint64_t kinds_ptr, int dev_num,
+			void (*dev_to_host_cpy) (void *, const void *, size_t,
+						 void *),
+			void (*host_to_dev_cpy) (void *, const void *, size_t,
+						 void *), void *token)
+{
+  gomp_target_rev (fn_ptr, mapnum, devaddrs_ptr, sizes_ptr, kinds_ptr, dev_num,
+		   dev_to_host_cpy, host_to_dev_cpy, token);
+}
diff --git a/libgomp/libgomp-plugin.h b/libgomp/libgomp-plugin.h
index 6ab5ac6..875f967 100644
--- a/libgomp/libgomp-plugin.h
+++ b/libgomp/libgomp-plugin.h
@@ -121,6 +121,13 @@ extern void GOMP_PLUGIN_error (const char *, ...)
 extern void GOMP_PLUGIN_fatal (const char *, ...)
 	__attribute__ ((noreturn, format (printf, 1, 2)));
 
+extern void GOMP_PLUGIN_target_rev (uint64_t, uint64_t, uint64_t, uint64_t,
+				    uint64_t, int,
+				    void (*) (void *, const void *, size_t,
+					      void *),
+				    void (*) (void *, const void *, size_t,
+					      void *), void *);
+
 /* Prototypes for functions implemented by libgomp plugins.  */
 extern const char *GOMP_OFFLOAD_get_name (void);
 extern unsigned int GOMP_OFFLOAD_get_caps (void);
diff --git a/libgomp/libgomp.h b/libgomp/libgomp.h
index 7519274..5803683 100644
--- a/libgomp/libgomp.h
+++ b/libgomp/libgomp.h
@@ -1128,6 +1128,11 @@ extern int gomp_pause_host (void);
 extern void gomp_init_targets_once (void);
 extern int gomp_get_num_devices (void);
 extern bool gomp_target_task_fn (void *);
+extern void gomp_target_rev (uint64_t, uint64_t, uint64_t, uint64_t, uint64_t,
+			     int,
+			     void (*) (void *, const void *, size_t, void *),
+			     void (*) (void *, const void *, size_t, void *),
+			     void *);
 
 /* Splay tree definitions.  */
 typedef struct splay_tree_node_s *splay_tree_node;
diff --git a/libgomp/libgomp.map b/libgomp/libgomp.map
index 46d5f10..12f76f7 100644
--- a/libgomp/libgomp.map
+++ b/libgomp/libgomp.map
@@ -622,3 +622,8 @@ GOMP_PLUGIN_1.3 {
 	GOMP_PLUGIN_goacc_profiling_dispatch;
 	GOMP_PLUGIN_goacc_thread;
 } GOMP_PLUGIN_1.2;
+
+GOMP_PLUGIN_1.4 {
+  global:
+	GOMP_PLUGIN_target_rev;
+} GOMP_PLUGIN_1.3;
diff --git a/libgomp/plugin/cuda-lib.def b/libgomp/plugin/cuda-lib.def
index cd91b39..dff42d6 100644
--- a/libgomp/plugin/cuda-lib.def
+++ b/libgomp/plugin/cuda-lib.def
@@ -29,6 +29,7 @@ CUDA_ONE_CALL_MAYBE_NULL (cuLinkCreate_v2)
 CUDA_ONE_CALL (cuLinkDestroy)
 CUDA_ONE_CALL (cuMemAlloc)
 CUDA_ONE_CALL (cuMemAllocHost)
+CUDA_ONE_CALL (cuMemHostAlloc)
 CUDA_ONE_CALL (cuMemcpy)
 CUDA_ONE_CALL (cuMemcpyDtoDAsync)
 CUDA_ONE_CALL (cuMemcpyDtoH)
diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c
index ba6b229..43b34df 100644
--- a/libgomp/plugin/plugin-nvptx.c
+++ b/libgomp/plugin/plugin-nvptx.c
@@ -40,6 +40,9 @@
 #include "gomp-constants.h"
 #include "oacc-int.h"
 
+/* For struct rev_offload + GOMP_REV_OFFLOAD_VAR. */
+#include "config/nvptx/libgomp-nvptx.h"
+
 #include <pthread.h>
 #ifndef PLUGIN_NVPTX_INCLUDE_SYSTEM_CUDA_H
 # include "cuda/cuda.h"
@@ -329,6 +332,7 @@ struct ptx_device
       pthread_mutex_t lock;
     } omp_stacks;
 
+  struct rev_offload *rev_data;
   struct ptx_device *next;
 };
 
@@ -423,7 +427,7 @@ nvptx_open_device (int n)
   struct ptx_device *ptx_dev;
   CUdevice dev, ctx_dev;
   CUresult r;
-  int async_engines, pi;
+  int pi;
 
   CUDA_CALL_ERET (NULL, cuDeviceGet, &dev, n);
 
@@ -519,10 +523,20 @@ nvptx_open_device (int n)
 		  CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_MULTIPROCESSOR, dev);
   ptx_dev->max_threads_per_multiprocessor = pi;
 
+#if 0
+  int async_engines;
   r = CUDA_CALL_NOCHECK (cuDeviceGetAttribute, &async_engines,
 			 CU_DEVICE_ATTRIBUTE_ASYNC_ENGINE_COUNT, dev);
   if (r != CUDA_SUCCESS)
     async_engines = 1;
+#endif
+
+  /* Required below for reverse offload as implemented, but with compute
+     capability >= 2.0 and 64bit device processes, this should be universally be
+     the case; hence, an assert.  */
+  r = CUDA_CALL_NOCHECK (cuDeviceGetAttribute, &pi,
+			 CU_DEVICE_ATTRIBUTE_UNIFIED_ADDRESSING, dev);
+  assert (r == CUDA_SUCCESS && pi);
 
   for (int i = 0; i != GOMP_DIM_MAX; i++)
     ptx_dev->default_dims[i] = 0;
@@ -1179,8 +1193,10 @@ GOMP_OFFLOAD_get_num_devices (unsigned int omp_requires_mask)
 {
   int num_devices = nvptx_get_num_devices ();
   /* Return -1 if no omp_requires_mask cannot be fulfilled but
-     devices were present.  */
-  if (num_devices > 0 && omp_requires_mask != 0)
+     devices were present. Unified-shared address: see comment in
+     nvptx_open_device for CU_DEVICE_ATTRIBUTE_UNIFIED_ADDRESSING.  */
+  if (num_devices > 0
+      && (omp_requires_mask & ~GOMP_REQUIRES_UNIFIED_ADDRESS) != 0)
     return -1;
   return num_devices;
 }
@@ -1380,7 +1396,7 @@ GOMP_OFFLOAD_load_image (int ord, unsigned version, const void *target_data,
   else if (rev_fn_table)
     {
       CUdeviceptr var;
-      size_t bytes;
+      size_t bytes, i;
       r = CUDA_CALL_NOCHECK (cuModuleGetGlobal, &var, &bytes, module,
 			     "$offload_func_table");
       if (r != CUDA_SUCCESS)
@@ -1390,6 +1406,37 @@ GOMP_OFFLOAD_load_image (int ord, unsigned version, const void *target_data,
       r = CUDA_CALL_NOCHECK (cuMemcpyDtoH, *rev_fn_table, var, bytes);
       if (r != CUDA_SUCCESS)
 	GOMP_PLUGIN_fatal ("cuMemcpyDtoH error: %s", cuda_error (r));
+      /* Free if only NULL entries.  */
+      for (i = 0; i < fn_entries; ++i)
+	if ((*rev_fn_table)[i] != 0)
+	  break;
+      if (i == fn_entries)
+	{
+	  free (*rev_fn_table);
+	  *rev_fn_table = NULL;
+	}
+    }
+
+  if (rev_fn_table && *rev_fn_table && dev->rev_data == NULL)
+    {
+      /* cuMemHostAlloc memory is accessible on the device, if unified-shared
+	 address is supported; this is assumed - see comment in
+	 nvptx_open_device for CU_DEVICE_ATTRIBUTE_UNIFIED_ADDRESSING.   */
+      CUDA_CALL_ASSERT (cuMemHostAlloc, (void **) &dev->rev_data,
+			sizeof (*dev->rev_data), CU_MEMHOSTALLOC_DEVICEMAP);
+      CUdeviceptr dp = (CUdeviceptr) dev->rev_data;
+      CUdeviceptr device_rev_offload_var;
+      size_t device_rev_offload_size;
+      CUresult r = CUDA_CALL_NOCHECK (cuModuleGetGlobal,
+				      &device_rev_offload_var,
+				      &device_rev_offload_size, module,
+				      XSTRING (GOMP_REV_OFFLOAD_VAR));
+      if (r != CUDA_SUCCESS)
+	GOMP_PLUGIN_fatal ("cuModuleGetGlobal error - GOMP_REV_OFFLOAD_VAR: %s", cuda_error (r));
+      r = CUDA_CALL_NOCHECK (cuMemcpyHtoD, device_rev_offload_var, &dp,
+			     sizeof (dp));
+      if (r != CUDA_SUCCESS)
+	GOMP_PLUGIN_fatal ("cuMemcpyHtoD error: %s", cuda_error (r));
     }
 
   nvptx_set_clocktick (module, dev);
@@ -2001,6 +2048,23 @@ nvptx_stacks_acquire (struct ptx_device *ptx_dev, size_t size, int num)
   return (void *) ptx_dev->omp_stacks.ptr;
 }
 
+
+void
+rev_off_dev_to_host_cpy (void *dest, const void *src, size_t size,
+			 CUstream stream)
+{
+  CUDA_CALL_ASSERT (cuMemcpyDtoHAsync, dest, (CUdeviceptr) src, size, stream);
+  CUDA_CALL_ASSERT (cuStreamSynchronize, stream);
+}
+
+void
+rev_off_host_to_dev_cpy (void *dest, const void *src, size_t size,
+			 CUstream stream)
+{
+  CUDA_CALL_ASSERT (cuMemcpyHtoDAsync, (CUdeviceptr) dest, src, size, stream);
+  CUDA_CALL_ASSERT (cuStreamSynchronize, stream);
+}
+
 void
 GOMP_OFFLOAD_run (int ord, void *tgt_fn, void *tgt_vars, void **args)
 {
@@ -2035,6 +2099,8 @@ GOMP_OFFLOAD_run (int ord, void *tgt_fn, void *tgt_vars, void **args)
   nvptx_adjust_launch_bounds (tgt_fn, ptx_dev, &teams, &threads);
 
   size_t stack_size = nvptx_stacks_size ();
+  bool reverse_off = ptx_dev->rev_data != NULL;
+  CUstream copy_stream = NULL;
 
   pthread_mutex_lock (&ptx_dev->omp_stacks.lock);
   void *stacks = nvptx_stacks_acquire (ptx_dev, stack_size, teams * threads);
@@ -2048,12 +2114,45 @@ GOMP_OFFLOAD_run (int ord, void *tgt_fn, void *tgt_vars, void **args)
   GOMP_PLUGIN_debug (0, "  %s: kernel %s: launch"
 		     " [(teams: %u), 1, 1] [(lanes: 32), (threads: %u), 1]\n",
 		     __FUNCTION__, fn_name, teams, threads);
+  if (reverse_off)
+    CUDA_CALL_ASSERT (cuStreamCreate, &copy_stream, CU_STREAM_NON_BLOCKING);
   r = CUDA_CALL_NOCHECK (cuLaunchKernel, function, teams, 1, 1,
 			 32, threads, 1, 0, NULL, NULL, config);
   if (r != CUDA_SUCCESS)
     GOMP_PLUGIN_fatal ("cuLaunchKernel error: %s", cuda_error (r));
-
-  r = CUDA_CALL_NOCHECK (cuCtxSynchronize, );
+  if (reverse_off)
+    while (true)
+      {
+	r = CUDA_CALL_NOCHECK (cuStreamQuery, NULL);
+	if (r == CUDA_SUCCESS)
+	  break;
+	if (r == CUDA_ERROR_LAUNCH_FAILED)
+	  GOMP_PLUGIN_fatal ("cuStreamQuery error: %s %s\n", cuda_error (r),
+			     maybe_abort_msg);
+	else if (r != CUDA_ERROR_NOT_READY)
+	  GOMP_PLUGIN_fatal ("cuStreamQuery error: %s", cuda_error (r));
+
+	if (__atomic_load_n (&ptx_dev->rev_data->fn, __ATOMIC_ACQUIRE) != 0)
+	  {
+	    struct rev_offload *rev_data = ptx_dev->rev_data;
+	    uint64_t fn_ptr = rev_data->fn;
+	    uint64_t mapnum = rev_data->mapnum;
+	    uint64_t addr_ptr = rev_data->addrs;
+	    uint64_t sizes_ptr = rev_data->sizes;
+	    uint64_t kinds_ptr = rev_data->kinds;
+	    int dev_num = (int) rev_data->dev_num;
+	    GOMP_PLUGIN_target_rev (fn_ptr, mapnum, addr_ptr, sizes_ptr,
+				    kinds_ptr, dev_num, rev_off_dev_to_host_cpy,
+				    rev_off_host_to_dev_cpy, copy_stream);
+	    CUDA_CALL_ASSERT (cuStreamSynchronize, copy_stream);
+	    __atomic_store_n (&rev_data->fn, 0, __ATOMIC_RELEASE);
+	  }
+	usleep (1);
+      }
+  else
+    r = CUDA_CALL_NOCHECK (cuCtxSynchronize, );
+  if (reverse_off)
+    CUDA_CALL_ASSERT (cuStreamDestroy, copy_stream);
   if (r == CUDA_ERROR_LAUNCH_FAILED)
     GOMP_PLUGIN_fatal ("cuCtxSynchronize error: %s %s\n", cuda_error (r),
 		       maybe_abort_msg);
diff --git a/libgomp/target.c b/libgomp/target.c
index 5763483..9377de0 100644
--- a/libgomp/target.c
+++ b/libgomp/target.c
@@ -2925,6 +2925,25 @@ GOMP_target_ext (int device, void (*fn) (void *), size_t mapnum,
     htab_free (refcount_set);
 }
 
+/* Handle reverse offload. This is called by the device plugins for a
+   reverse offload; it is not called if the outer target runs on the host.  */
+
+void
+gomp_target_rev (uint64_t fn_ptr __attribute__ ((unused)),
+		 uint64_t mapnum __attribute__ ((unused)),
+		 uint64_t devaddrs_ptr __attribute__ ((unused)),
+		 uint64_t sizes_ptr __attribute__ ((unused)),
+		 uint64_t kinds_ptr __attribute__ ((unused)),
+		 int dev_num __attribute__ ((unused)),
+		 void (*dev_to_host_cpy) (void *, const void *, size_t,
+					  void *) __attribute__ ((unused)),
+		 void (*host_to_dev_cpy) (void *, const void *, size_t,
+					  void *) __attribute__ ((unused)),
+		 void *token __attribute__ ((unused)))
+{
+  __builtin_unreachable ();
+}
+
 /* Host fallback for GOMP_target_data{,_ext} routines.  */
 
 static void

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Patch] libgomp/nvptx: Prepare for reverse-offload callback handling
  2022-09-27  9:23         ` Tobias Burnus
@ 2022-09-28 13:16           ` Alexander Monakov
  2022-10-02 18:13           ` Tobias Burnus
  1 sibling, 0 replies; 31+ messages in thread
From: Alexander Monakov @ 2022-09-28 13:16 UTC (permalink / raw)
  To: Tobias Burnus; +Cc: Jakub Jelinek, Tom de Vries, gcc-patches


On Tue, 27 Sep 2022, Tobias Burnus wrote:

> Ignoring (1), does the overall patch and this part otherwise look okay(ish)?
> 
> 
> Caveat: The .sys scope works well with >= sm_60 but not does not handle
> older versions. For those, the __atomic_{load/store}_n are used.  I do not
> see a good solution beyond documentation. In the way it is used (one
> thread only setting only on/off flag, no atomic increments etc.), I think
> it is unlikely to cause races without .sys scope, but as always is
> difficult to rule out some special unfortunate case where it does. At
> lease we do have now some documentation (in general) - which still needs
> to be expanded and improved.  For this feature, I did not add any wording
> in this patch: until the feature is actually enabled, it would be more
> confusing than helpful.

If the implication is that distros will ship a racy-by-default implementation,
unless they know about the problem and configure for sm_60, then no, that
doesn't look fine to me. A possible solution is not enabling a feature that
has a known correctness issue.

Alexander

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Patch] libgomp/nvptx: Prepare for reverse-offload callback handling
  2022-09-27  9:23         ` Tobias Burnus
  2022-09-28 13:16           ` Alexander Monakov
@ 2022-10-02 18:13           ` Tobias Burnus
  2022-10-07 14:26             ` [Patch][v5] " Tobias Burnus
  1 sibling, 1 reply; 31+ messages in thread
From: Tobias Burnus @ 2022-10-02 18:13 UTC (permalink / raw)
  To: Alexander Monakov; +Cc: Jakub Jelinek, Tom de Vries, gcc-patches

[-- Attachment #1: Type: text/plain, Size: 2135 bytes --]

On 27.09.22 11:23, Tobias Burnus wrote:

We do support
  #if __PTX_SM__ >= 600  (CUDA >= 8.0, ptx isa >= 5.0)
and we also can configure GCC with
  --with-arch=sm_70 (or sm_80 or ...)
Thus, adding atomics with .sys scope is possible.

See attached patch. This seems to work fine and I hope I got the
assembly right in terms of atomic use. (And I do believe that the
.release/.acquire do not need an additional __sync_syncronize()/"membar.sys".)

Regarding this:

While 'atom.op' (op = and/or/xor/cas/exch/add/inc/dec/min/max)
with scope is a sm_60 feature, the used 'st/ld' with scope qualifier
and .relaxed, .release / .relaxed, .acquire require sm_70.

(Does not really matter as only ..., sm_53 and sm_70, ... is currently
supported but not sm_60, but the #if should be obviously fixed.)

 * * *

Looking at the generated code for without inline assembler, we have instead of
  st.global.release.sys.u64 [%r27],%r39;
and
  ld.acquire.sys.global.u64 %r62,[%r27];
for the older-systems (__PTX_SM < 700) the code:
  @ %r69 membar.sys;
  @ %r69 atom.exch.b64 _,[%r27],%r41;
and
  ld.global.u64 %r64,[__gomp_rev_offload_var];
  ld.u64 %r36,[%r64];
  membar.sys;

In my understanding, the membar.sys ensures - similar to
  st.release / ld.acquire
that the memory handling is done in the correct order in scope .sys.
As the 'fn' variable is initially 0 - and then only set via the device
i.e. there is eventually a DMA write device->host, which is atomically
as the will int64_t is written at once (and not first, e.g. the lower
and then the upper half). The 'st'/'atom.exch' should work fine, despite
having no .sys scope.

Likewise, the membar.sys applies also in the other direction. Or did I
miss something. If so, would an explicit __sync_synchronize() (= membar.sys)
help between the 'st' and the 'ld'?

Tobias


-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Patch][v5] libgomp/nvptx: Prepare for reverse-offload callback handling
  2022-10-02 18:13           ` Tobias Burnus
@ 2022-10-07 14:26             ` Tobias Burnus
  2022-10-11 10:49               ` Jakub Jelinek
  0 siblings, 1 reply; 31+ messages in thread
From: Tobias Burnus @ 2022-10-07 14:26 UTC (permalink / raw)
  To: Jakub Jelinek, gcc-patches; +Cc: Alexander Monakov, Tom de Vries


[-- Attachment #1.1: Type: text/plain, Size: 2504 bytes --]

Updated patch enclosed. Changes:

* Fixes the sm >= 700 issue, I noted before (cf. below)

* The < sm_70 code is still in, but disabled at user-compile time, with a warning, if libgomp.a wasn't compiled with sm_70 or higher. (mkoffload strips the nvptx offload code)

* Some minor cleanup

OK for mainline?

Tobias

On 02.10.22 20:13, Tobias Burnus wrote:
On 27.09.22 11:23, Tobias Burnus wrote:

We do support
  #if __PTX_SM__ >= 600  (CUDA >= 8.0, ptx isa >= 5.0)
and we also can configure GCC with
  --with-arch=sm_70 (or sm_80 or ...)
Thus, adding atomics with .sys scope is possible.

See attached patch. This seems to work fine and I hope I got the
assembly right in terms of atomic use. (And I do believe that the
.release/.acquire do not need an additional __sync_syncronize()/"membar.sys".)

Regarding this:

While 'atom.op' (op = and/or/xor/cas/exch/add/inc/dec/min/max)
with scope is a sm_60 feature, the used 'st/ld' with scope qualifier
and .relaxed, .release / .relaxed, .acquire require sm_70.

(Does not really matter as only ..., sm_53 and sm_70, ... is currently
supported but not sm_60, but the #if should be obviously fixed.)

 * * *

Looking at the generated code for without inline assembler, we have instead of
  st.global.release.sys.u64 [%r27],%r39;
and
  ld.acquire.sys.global.u64 %r62,[%r27];
for the older-systems (__PTX_SM < 700) the code:
  @ %r69 membar.sys;
  @ %r69 atom.exch.b64 _,[%r27],%r41;
and
  ld.global.u64 %r64,[__gomp_rev_offload_var];
  ld.u64 %r36,[%r64];
  membar.sys;

In my understanding, the membar.sys ensures - similar to
  st.release / ld.acquire
that the memory handling is done in the correct order in scope .sys.
As the 'fn' variable is initially 0 - and then only set via the device
i.e. there is eventually a DMA write device->host, which is atomically
as the will int64_t is written at once (and not first, e.g. the lower
and then the upper half). The 'st'/'atom.exch' should work fine, despite
having no .sys scope.

Likewise, the membar.sys applies also in the other direction. Or did I
miss something. If so, would an explicit __sync_synchronize() (= membar.sys)
help between the 'st' and the 'ld'?

Tobias


-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955

[-- Attachment #2: rev-offload-run-nvptx-v5.diff --]
[-- Type: text/x-patch, Size: 23685 bytes --]

libgomp/nvptx: Prepare for reverse-offload callback handling

This patch adds a stub 'gomp_target_rev' in the host's target.c, which will
later handle the reverse offload.
For nvptx, it adds support for forwarding the offload gomp_target_ext call
to the host by setting values in a struct on the device and querying it on
the host - invoking gomp_target_rev on the result.

For host-device consistency guarantee reasons, reverse offload is currently
limited -march=sm_70 (for libgomp).

gcc/ChangeLog:

	* config/nvptx/mkoffload.cc (process): Warn if the linked-in libgomp.a
	has not been compiled with sm_70 or higher and disable code gen then.

include/ChangeLog:

	* cuda/cuda.h (enum CUdevice_attribute): Add
	CU_DEVICE_ATTRIBUTE_UNIFIED_ADDRESSING.
	(CU_MEMHOSTALLOC_DEVICEMAP): Define.
	(cuMemHostAlloc): Add prototype.

libgomp/ChangeLog:

	* config/nvptx/icv-device.c (GOMP_DEVICE_NUM_VAR): Remove
	'static' for this variable.
	* config/nvptx/libgomp-nvptx.h: New file.
	* config/nvptx/target.c: Include it.
	(GOMP_ADDITIONAL_ICVS): Declare extern var.
	(GOMP_REV_OFFLOAD_VAR): Declare var.
	(GOMP_target_ext): Handle reverse offload.
	* libgomp-plugin.h (GOMP_PLUGIN_target_rev): New prototype.
	* libgomp-plugin.c (GOMP_PLUGIN_target_rev): New, call ...
	* target.c (gomp_target_rev): ... this new stub function.
	* libgomp.h (gomp_target_rev): Declare.
	* libgomp.map (GOMP_PLUGIN_1.4): New; add GOMP_PLUGIN_target_rev.
	* plugin/cuda-lib.def (cuMemHostAlloc): Add.
	* plugin/plugin-nvptx.c: Include libgomp-nvptx.h.
	(struct ptx_device): Add rev_data member. 
	(nvptx_open_device): #if 0 unused check; add
	unified address assert check.
	(GOMP_OFFLOAD_get_num_devices): Claim unified address
	support.
	(GOMP_OFFLOAD_load_image): Free rev_fn_table if no
	offload functions exist. Make offload var available
	on host and device.
	(rev_off_dev_to_host_cpy, rev_off_host_to_dev_cpy): New.
	(GOMP_OFFLOAD_run): Handle reverse offload.

 gcc/config/nvptx/mkoffload.cc        |  60 +++++++++++++++-----
 include/cuda/cuda.h                  |   3 +
 libgomp/config/nvptx/icv-device.c    |   2 +-
 libgomp/config/nvptx/libgomp-nvptx.h |  51 +++++++++++++++++
 libgomp/config/nvptx/target.c        |  61 +++++++++++++++++---
 libgomp/libgomp-plugin.c             |  12 ++++
 libgomp/libgomp-plugin.h             |   7 +++
 libgomp/libgomp.h                    |   5 ++
 libgomp/libgomp.map                  |   5 ++
 libgomp/plugin/cuda-lib.def          |   1 +
 libgomp/plugin/plugin-nvptx.c        | 107 +++++++++++++++++++++++++++++++++--
 libgomp/target.c                     |  19 +++++++
 12 files changed, 304 insertions(+), 29 deletions(-)

diff --git a/gcc/config/nvptx/mkoffload.cc b/gcc/config/nvptx/mkoffload.cc
index 854cd72..aa2e042 100644
--- a/gcc/config/nvptx/mkoffload.cc
+++ b/gcc/config/nvptx/mkoffload.cc
@@ -258,6 +258,7 @@ process (FILE *in, FILE *out, uint32_t omp_requires)
   unsigned ix;
   const char *sm_ver = NULL, *version = NULL;
   const char *sm_ver2 = NULL, *version2 = NULL;
+  const char *sm_libgomp = NULL;
   size_t file_cnt = 0;
   size_t *file_idx = XALLOCAVEC (size_t, len);
 
@@ -268,6 +269,7 @@ process (FILE *in, FILE *out, uint32_t omp_requires)
   for (size_t i = 0; i != len;)
     {
       char c;
+      bool is_libgomp = false;
       bool output_fn_ptr = false;
       file_idx[file_cnt++] = i;
 
@@ -291,6 +293,13 @@ process (FILE *in, FILE *out, uint32_t omp_requires)
 		  version = input + i + strlen (".version ");
 		  continue;
 		}
+	      if (UNLIKELY (startswith (input + i,
+					"// BEGIN GLOBAL FUNCTION "
+					"DEF: GOMP_target_ext")))
+		{
+		  is_libgomp = true;
+		  continue;
+		}
 	      while (startswith (input + i, "//:"))
 		{
 		  i += 3;
@@ -319,28 +328,49 @@ process (FILE *in, FILE *out, uint32_t omp_requires)
 	  putc (c, out);
 	}
       fprintf (out, "\";\n\n");
+      if (is_libgomp)
+	sm_libgomp = sm_ver;
       if (output_fn_ptr
 	  && (omp_requires & GOMP_REQUIRES_REVERSE_OFFLOAD) != 0)
 	{
-	  if (sm_ver && sm_ver[0] == '3' && sm_ver[1] == '0'
-	      && sm_ver[2] == '\n')
-	    {
-	      warning_at (input_location, 0,
-			  "%<omp requires reverse_offload%> requires at "
-			  "least %<sm_35%> for "
-			  "%<-foffload-options=nvptx-none=-march=%> - disabling"
-			  " offload-code generation for this device type");
-	      /* As now an empty file is compiled and there is no call to
-		 GOMP_offload_register_ver, this device type is effectively
-		 disabled.  */
-	      fflush (out);
-	      ftruncate (fileno (out), 0);
-	      return;
-	    }
 	  sm_ver2 = sm_ver;
 	  version2 = version;
 	}
     }
+  if (sm_ver2 && sm_libgomp
+      && sm_libgomp[0] < '7' && sm_libgomp[1] && sm_libgomp[2] == '\n')
+    {
+      /* The code for nvptx for GOMP_target_ext in libgomp/config/nvptx/target.c
+	 for < sm_70 exists but is disabled here as it is unclear whether there
+	 is the required consistency between host and device.
+	 See https://gcc.gnu.org/pipermail/gcc-patches/2022-October/602715.html
+	 for details.  */
+      warning_at (input_location, 0,
+		  "Disabling offload-code generation for this device type: "
+		  "%<omp requires reverse_offload%> can only be fulfilled "
+		  "for %<sm_70%> or higher");
+      inform (UNKNOWN_LOCATION,
+	      "Reverse offload requires that GCC is configured with "
+	      "%<--with-arch=sm_70%> or higher and not overridden by a lower "
+	      "value for %<-foffload-options=nvptx-none=-march=%>");
+      /* As now an empty file is compiled and there is no call to
+	 GOMP_offload_register_ver, this device type is effectively disabled.  */
+      fflush (out);
+      ftruncate (fileno (out), 0);
+      return;
+    }
+  if (sm_ver2 && sm_ver2[0] == '3' && sm_ver2[1] == '0' && sm_ver[2] == '\n')
+    {
+      warning_at (input_location, 0,
+		  "%<omp requires reverse_offload%> requires at least %<sm_35%> "
+		  "for %<-foffload-options=nvptx-none=-march=%> - disabling "
+		  "offload-code generation for this device type");
+      /* As now an empty file is compiled and there is no call to
+	 GOMP_offload_register_ver, this device type is effectively disabled.  */
+      fflush (out);
+      ftruncate (fileno (out), 0);
+      return;
+    }
 
   /* Create function-pointer array, required for reverse
      offload function-pointer lookup.  */
diff --git a/include/cuda/cuda.h b/include/cuda/cuda.h
index 3938d05..e081f04 100644
--- a/include/cuda/cuda.h
+++ b/include/cuda/cuda.h
@@ -77,6 +77,7 @@ typedef enum {
   CU_DEVICE_ATTRIBUTE_CONCURRENT_KERNELS = 31,
   CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_MULTIPROCESSOR = 39,
   CU_DEVICE_ATTRIBUTE_ASYNC_ENGINE_COUNT = 40,
+  CU_DEVICE_ATTRIBUTE_UNIFIED_ADDRESSING = 41,
   CU_DEVICE_ATTRIBUTE_MAX_REGISTERS_PER_MULTIPROCESSOR = 82
 } CUdevice_attribute;
 
@@ -113,6 +114,7 @@ enum {
 #define CU_LAUNCH_PARAM_END ((void *) 0)
 #define CU_LAUNCH_PARAM_BUFFER_POINTER ((void *) 1)
 #define CU_LAUNCH_PARAM_BUFFER_SIZE ((void *) 2)
+#define CU_MEMHOSTALLOC_DEVICEMAP 0x02U
 
 enum {
   CU_STREAM_DEFAULT = 0,
@@ -169,6 +171,7 @@ CUresult cuMemGetInfo (size_t *, size_t *);
 CUresult cuMemAlloc (CUdeviceptr *, size_t);
 #define cuMemAllocHost cuMemAllocHost_v2
 CUresult cuMemAllocHost (void **, size_t);
+CUresult cuMemHostAlloc (void **, size_t, unsigned int);
 CUresult cuMemcpy (CUdeviceptr, CUdeviceptr, size_t);
 #define cuMemcpyDtoDAsync cuMemcpyDtoDAsync_v2
 CUresult cuMemcpyDtoDAsync (CUdeviceptr, CUdeviceptr, size_t, CUstream);
diff --git a/libgomp/config/nvptx/icv-device.c b/libgomp/config/nvptx/icv-device.c
index 6f869be..eef151c 100644
--- a/libgomp/config/nvptx/icv-device.c
+++ b/libgomp/config/nvptx/icv-device.c
@@ -30,7 +30,7 @@
 
 /* This is set to the ICV values of current GPU during device initialization,
    when the offload image containing this libgomp portion is loaded.  */
-static volatile struct gomp_offload_icvs GOMP_ADDITIONAL_ICVS;
+volatile struct gomp_offload_icvs GOMP_ADDITIONAL_ICVS;
 
 void
 omp_set_default_device (int device_num __attribute__((unused)))
diff --git a/libgomp/config/nvptx/libgomp-nvptx.h b/libgomp/config/nvptx/libgomp-nvptx.h
new file mode 100644
index 0000000..5da9aae
--- /dev/null
+++ b/libgomp/config/nvptx/libgomp-nvptx.h
@@ -0,0 +1,51 @@
+/* Copyright (C) 2022 Free Software Foundation, Inc.
+   Contributed by Tobias Burnus <tobias@codesourcery.com>.
+
+   This file is part of the GNU Offloading and Multi Processing Library
+   (libgomp).
+
+   Libgomp is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by
+   the Free Software Foundation; either version 3, or (at your option)
+   any later version.
+
+   Libgomp is distributed in the hope that it will be useful, but WITHOUT ANY
+   WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
+   FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+   more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   <http://www.gnu.org/licenses/>.  */
+
+/* This file contains defines and type definitions shared between the
+   nvptx target's libgomp.a and the plugin-nvptx.c, but that is only
+   needef for this target.  */
+
+#ifndef LIBGOMP_NVPTX_H
+#define LIBGOMP_NVPTX_H 1
+
+#define GOMP_REV_OFFLOAD_VAR __gomp_rev_offload_var
+
+struct rev_offload {
+  uint64_t fn;
+  uint64_t mapnum;
+  uint64_t addrs;
+  uint64_t sizes;
+  uint64_t kinds;
+  int32_t dev_num;
+};
+
+#if (__SIZEOF_SHORT__ != 2 \
+     || __SIZEOF_SIZE_T__ != 8 \
+     || __SIZEOF_POINTER__ != 8)
+#error "Data-type conversion required for rev_offload"
+#endif
+
+#endif  /* LIBGOMP_NVPTX_H */
+
diff --git a/libgomp/config/nvptx/target.c b/libgomp/config/nvptx/target.c
index 11108d2..6470ae8 100644
--- a/libgomp/config/nvptx/target.c
+++ b/libgomp/config/nvptx/target.c
@@ -24,9 +24,12 @@
    <http://www.gnu.org/licenses/>.  */
 
 #include "libgomp.h"
+#include "libgomp-nvptx.h"  /* For struct rev_offload + GOMP_REV_OFFLOAD_VAR. */
 #include <limits.h>
 
 extern int __gomp_team_num __attribute__((shared));
+extern volatile struct gomp_offload_icvs GOMP_ADDITIONAL_ICVS;
+volatile struct rev_offload *GOMP_REV_OFFLOAD_VAR;
 
 bool
 GOMP_teams4 (unsigned int num_teams_lower, unsigned int num_teams_upper,
@@ -88,16 +91,60 @@ GOMP_target_ext (int device, void (*fn) (void *), size_t mapnum,
 		 void **hostaddrs, size_t *sizes, unsigned short *kinds,
 		 unsigned int flags, void **depend, void **args)
 {
-  (void) device;
-  (void) fn;
-  (void) mapnum;
-  (void) hostaddrs;
-  (void) sizes;
-  (void) kinds;
+  static int lock = 0;  /* == gomp_mutex_t lock; gomp_mutex_init (&lock); */
   (void) flags;
   (void) depend;
   (void) args;
-  __builtin_unreachable ();
+
+  if (device != GOMP_DEVICE_HOST_FALLBACK
+      || fn == NULL
+      || GOMP_REV_OFFLOAD_VAR == NULL)
+    return;
+
+  gomp_mutex_lock (&lock);
+
+  GOMP_REV_OFFLOAD_VAR->mapnum = mapnum;
+  GOMP_REV_OFFLOAD_VAR->addrs = (uint64_t) hostaddrs;
+  GOMP_REV_OFFLOAD_VAR->sizes = (uint64_t) sizes;
+  GOMP_REV_OFFLOAD_VAR->kinds = (uint64_t) kinds;
+  GOMP_REV_OFFLOAD_VAR->dev_num = GOMP_ADDITIONAL_ICVS.device_num;
+
+  /* 'fn' must be last.  */
+#if __PTX_SM__ >= 700
+  uint64_t addr_struct_fn = (uint64_t) &GOMP_REV_OFFLOAD_VAR->fn;
+  asm volatile ("st.global.release.sys.u64 [%0], %1;"
+		: : "r"(addr_struct_fn), "r" (fn) : "memory");
+#else
+/* The following has been effectively disabled via mkoffload as it is unclear
+   whether there is the required consistency between host and device.
+   See https://gcc.gnu.org/pipermail/gcc-patches/2022-October/602715.html
+   Note: Using atomic with scope = .sys is already supported since >= 600.
+   The generated code is:
+     @ %r69 membar.sys;
+     @ %r69 atom.exch.b64 _,[%r27],%r41; */
+  __atomic_store_n (&GOMP_REV_OFFLOAD_VAR->fn, fn, __ATOMIC_RELEASE);
+#endif
+
+  /* Processed on the host - when done, fn is set to NULL.  */
+#if __PTX_SM__ >= 700
+  uint64_t fn2;
+  do
+    {
+      asm volatile ("ld.acquire.sys.global.u64 %0, [%1];"
+		    : "=r" (fn2) : "r" (addr_struct_fn) : "memory");
+    }
+  while (fn2 != 0);
+#else
+/* See remark above. The generated memory-access code is
+     ld.global.u64 %r64,[__gomp_rev_offload_var];
+     ld.u64 %r36,[%r64];
+     membar.sys;  */
+  __sync_synchronize ();  /* membar.sys */
+  while (__atomic_load_n (&GOMP_REV_OFFLOAD_VAR->fn, __ATOMIC_ACQUIRE) != 0)
+    ;  /* spin  */
+#endif
+
+  gomp_mutex_unlock (&lock);
 }
 
 void
diff --git a/libgomp/libgomp-plugin.c b/libgomp/libgomp-plugin.c
index 9d4cc62..316de74 100644
--- a/libgomp/libgomp-plugin.c
+++ b/libgomp/libgomp-plugin.c
@@ -78,3 +78,15 @@ GOMP_PLUGIN_fatal (const char *msg, ...)
   gomp_vfatal (msg, ap);
   va_end (ap);
 }
+
+void
+GOMP_PLUGIN_target_rev (uint64_t fn_ptr, uint64_t mapnum, uint64_t devaddrs_ptr,
+			uint64_t sizes_ptr, uint64_t kinds_ptr, int dev_num,
+			void (*dev_to_host_cpy) (void *, const void *, size_t,
+						 void *),
+			void (*host_to_dev_cpy) (void *, const void *, size_t,
+						 void *), void *token)
+{
+  gomp_target_rev (fn_ptr, mapnum, devaddrs_ptr, sizes_ptr, kinds_ptr, dev_num,
+		   dev_to_host_cpy, host_to_dev_cpy, token);
+}
diff --git a/libgomp/libgomp-plugin.h b/libgomp/libgomp-plugin.h
index 6ab5ac6..875f967 100644
--- a/libgomp/libgomp-plugin.h
+++ b/libgomp/libgomp-plugin.h
@@ -121,6 +121,13 @@ extern void GOMP_PLUGIN_error (const char *, ...)
 extern void GOMP_PLUGIN_fatal (const char *, ...)
 	__attribute__ ((noreturn, format (printf, 1, 2)));
 
+extern void GOMP_PLUGIN_target_rev (uint64_t, uint64_t, uint64_t, uint64_t,
+				    uint64_t, int,
+				    void (*) (void *, const void *, size_t,
+					      void *),
+				    void (*) (void *, const void *, size_t,
+					      void *), void *);
+
 /* Prototypes for functions implemented by libgomp plugins.  */
 extern const char *GOMP_OFFLOAD_get_name (void);
 extern unsigned int GOMP_OFFLOAD_get_caps (void);
diff --git a/libgomp/libgomp.h b/libgomp/libgomp.h
index 7519274..5803683 100644
--- a/libgomp/libgomp.h
+++ b/libgomp/libgomp.h
@@ -1128,6 +1128,11 @@ extern int gomp_pause_host (void);
 extern void gomp_init_targets_once (void);
 extern int gomp_get_num_devices (void);
 extern bool gomp_target_task_fn (void *);
+extern void gomp_target_rev (uint64_t, uint64_t, uint64_t, uint64_t, uint64_t,
+			     int,
+			     void (*) (void *, const void *, size_t, void *),
+			     void (*) (void *, const void *, size_t, void *),
+			     void *);
 
 /* Splay tree definitions.  */
 typedef struct splay_tree_node_s *splay_tree_node;
diff --git a/libgomp/libgomp.map b/libgomp/libgomp.map
index 46d5f10..12f76f7 100644
--- a/libgomp/libgomp.map
+++ b/libgomp/libgomp.map
@@ -622,3 +622,8 @@ GOMP_PLUGIN_1.3 {
 	GOMP_PLUGIN_goacc_profiling_dispatch;
 	GOMP_PLUGIN_goacc_thread;
 } GOMP_PLUGIN_1.2;
+
+GOMP_PLUGIN_1.4 {
+  global:
+	GOMP_PLUGIN_target_rev;
+} GOMP_PLUGIN_1.3;
diff --git a/libgomp/plugin/cuda-lib.def b/libgomp/plugin/cuda-lib.def
index cd91b39..dff42d6 100644
--- a/libgomp/plugin/cuda-lib.def
+++ b/libgomp/plugin/cuda-lib.def
@@ -29,6 +29,7 @@ CUDA_ONE_CALL_MAYBE_NULL (cuLinkCreate_v2)
 CUDA_ONE_CALL (cuLinkDestroy)
 CUDA_ONE_CALL (cuMemAlloc)
 CUDA_ONE_CALL (cuMemAllocHost)
+CUDA_ONE_CALL (cuMemHostAlloc)
 CUDA_ONE_CALL (cuMemcpy)
 CUDA_ONE_CALL (cuMemcpyDtoDAsync)
 CUDA_ONE_CALL (cuMemcpyDtoH)
diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c
index ba6b229..de24398 100644
--- a/libgomp/plugin/plugin-nvptx.c
+++ b/libgomp/plugin/plugin-nvptx.c
@@ -40,6 +40,9 @@
 #include "gomp-constants.h"
 #include "oacc-int.h"
 
+/* For struct rev_offload + GOMP_REV_OFFLOAD_VAR. */
+#include "config/nvptx/libgomp-nvptx.h"
+
 #include <pthread.h>
 #ifndef PLUGIN_NVPTX_INCLUDE_SYSTEM_CUDA_H
 # include "cuda/cuda.h"
@@ -329,6 +332,7 @@ struct ptx_device
       pthread_mutex_t lock;
     } omp_stacks;
 
+  struct rev_offload *rev_data;
   struct ptx_device *next;
 };
 
@@ -423,7 +427,7 @@ nvptx_open_device (int n)
   struct ptx_device *ptx_dev;
   CUdevice dev, ctx_dev;
   CUresult r;
-  int async_engines, pi;
+  int pi;
 
   CUDA_CALL_ERET (NULL, cuDeviceGet, &dev, n);
 
@@ -519,10 +523,20 @@ nvptx_open_device (int n)
 		  CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_MULTIPROCESSOR, dev);
   ptx_dev->max_threads_per_multiprocessor = pi;
 
+#if 0
+  int async_engines;
   r = CUDA_CALL_NOCHECK (cuDeviceGetAttribute, &async_engines,
 			 CU_DEVICE_ATTRIBUTE_ASYNC_ENGINE_COUNT, dev);
   if (r != CUDA_SUCCESS)
     async_engines = 1;
+#endif
+
+  /* Required below for reverse offload as implemented, but with compute
+     capability >= 2.0 and 64bit device processes, this should be universally be
+     the case; hence, an assert.  */
+  r = CUDA_CALL_NOCHECK (cuDeviceGetAttribute, &pi,
+			 CU_DEVICE_ATTRIBUTE_UNIFIED_ADDRESSING, dev);
+  assert (r == CUDA_SUCCESS && pi);
 
   for (int i = 0; i != GOMP_DIM_MAX; i++)
     ptx_dev->default_dims[i] = 0;
@@ -1179,8 +1193,10 @@ GOMP_OFFLOAD_get_num_devices (unsigned int omp_requires_mask)
 {
   int num_devices = nvptx_get_num_devices ();
   /* Return -1 if no omp_requires_mask cannot be fulfilled but
-     devices were present.  */
-  if (num_devices > 0 && omp_requires_mask != 0)
+     devices were present. Unified-shared address: see comment in
+     nvptx_open_device for CU_DEVICE_ATTRIBUTE_UNIFIED_ADDRESSING.  */
+  if (num_devices > 0
+      && (omp_requires_mask & ~GOMP_REQUIRES_UNIFIED_ADDRESS) != 0)
     return -1;
   return num_devices;
 }
@@ -1380,7 +1396,7 @@ GOMP_OFFLOAD_load_image (int ord, unsigned version, const void *target_data,
   else if (rev_fn_table)
     {
       CUdeviceptr var;
-      size_t bytes;
+      size_t bytes, i;
       r = CUDA_CALL_NOCHECK (cuModuleGetGlobal, &var, &bytes, module,
 			     "$offload_func_table");
       if (r != CUDA_SUCCESS)
@@ -1390,6 +1406,37 @@ GOMP_OFFLOAD_load_image (int ord, unsigned version, const void *target_data,
       r = CUDA_CALL_NOCHECK (cuMemcpyDtoH, *rev_fn_table, var, bytes);
       if (r != CUDA_SUCCESS)
 	GOMP_PLUGIN_fatal ("cuMemcpyDtoH error: %s", cuda_error (r));
+      /* Free if only NULL entries.  */
+      for (i = 0; i < fn_entries; ++i)
+	if ((*rev_fn_table)[i] != 0)
+	  break;
+      if (i == fn_entries)
+	{
+	  free (*rev_fn_table);
+	  *rev_fn_table = NULL;
+	}
+    }
+
+  if (rev_fn_table && *rev_fn_table && dev->rev_data == NULL)
+    {
+      /* cuMemHostAlloc memory is accessible on the device, if unified-shared
+	 address is supported; this is assumed - see comment in
+	 nvptx_open_device for CU_DEVICE_ATTRIBUTE_UNIFIED_ADDRESSING.   */
+      CUDA_CALL_ASSERT (cuMemHostAlloc, (void **) &dev->rev_data,
+			sizeof (*dev->rev_data), CU_MEMHOSTALLOC_DEVICEMAP);
+      CUdeviceptr dp = (CUdeviceptr) dev->rev_data;
+      CUdeviceptr device_rev_offload_var;
+      size_t device_rev_offload_size;
+      CUresult r = CUDA_CALL_NOCHECK (cuModuleGetGlobal,
+				      &device_rev_offload_var,
+				      &device_rev_offload_size, module,
+				      XSTRING (GOMP_REV_OFFLOAD_VAR));
+      if (r != CUDA_SUCCESS)
+	GOMP_PLUGIN_fatal ("cuModuleGetGlobal error - GOMP_REV_OFFLOAD_VAR: %s", cuda_error (r));
+      r = CUDA_CALL_NOCHECK (cuMemcpyHtoD, device_rev_offload_var, &dp,
+			     sizeof (dp));
+      if (r != CUDA_SUCCESS)
+	GOMP_PLUGIN_fatal ("cuMemcpyHtoD error: %s", cuda_error (r));
     }
 
   nvptx_set_clocktick (module, dev);
@@ -2001,6 +2048,23 @@ nvptx_stacks_acquire (struct ptx_device *ptx_dev, size_t size, int num)
   return (void *) ptx_dev->omp_stacks.ptr;
 }
 
+
+void
+rev_off_dev_to_host_cpy (void *dest, const void *src, size_t size,
+			 CUstream stream)
+{
+  CUDA_CALL_ASSERT (cuMemcpyDtoHAsync, dest, (CUdeviceptr) src, size, stream);
+  CUDA_CALL_ASSERT (cuStreamSynchronize, stream);
+}
+
+void
+rev_off_host_to_dev_cpy (void *dest, const void *src, size_t size,
+			 CUstream stream)
+{
+  CUDA_CALL_ASSERT (cuMemcpyHtoDAsync, (CUdeviceptr) dest, src, size, stream);
+  CUDA_CALL_ASSERT (cuStreamSynchronize, stream);
+}
+
 void
 GOMP_OFFLOAD_run (int ord, void *tgt_fn, void *tgt_vars, void **args)
 {
@@ -2035,6 +2099,8 @@ GOMP_OFFLOAD_run (int ord, void *tgt_fn, void *tgt_vars, void **args)
   nvptx_adjust_launch_bounds (tgt_fn, ptx_dev, &teams, &threads);
 
   size_t stack_size = nvptx_stacks_size ();
+  bool reverse_offload = ptx_dev->rev_data != NULL;
+  CUstream copy_stream = NULL;
 
   pthread_mutex_lock (&ptx_dev->omp_stacks.lock);
   void *stacks = nvptx_stacks_acquire (ptx_dev, stack_size, teams * threads);
@@ -2048,12 +2114,41 @@ GOMP_OFFLOAD_run (int ord, void *tgt_fn, void *tgt_vars, void **args)
   GOMP_PLUGIN_debug (0, "  %s: kernel %s: launch"
 		     " [(teams: %u), 1, 1] [(lanes: 32), (threads: %u), 1]\n",
 		     __FUNCTION__, fn_name, teams, threads);
+  if (reverse_offload)
+    CUDA_CALL_ASSERT (cuStreamCreate, &copy_stream, CU_STREAM_NON_BLOCKING);
   r = CUDA_CALL_NOCHECK (cuLaunchKernel, function, teams, 1, 1,
 			 32, threads, 1, 0, NULL, NULL, config);
   if (r != CUDA_SUCCESS)
     GOMP_PLUGIN_fatal ("cuLaunchKernel error: %s", cuda_error (r));
-
-  r = CUDA_CALL_NOCHECK (cuCtxSynchronize, );
+  if (reverse_offload)
+    while (true)
+      {
+	r = CUDA_CALL_NOCHECK (cuStreamQuery, NULL);
+	if (r == CUDA_SUCCESS)
+	  break;
+	if (r == CUDA_ERROR_LAUNCH_FAILED)
+	  GOMP_PLUGIN_fatal ("cuStreamQuery error: %s %s\n", cuda_error (r),
+			     maybe_abort_msg);
+	else if (r != CUDA_ERROR_NOT_READY)
+	  GOMP_PLUGIN_fatal ("cuStreamQuery error: %s", cuda_error (r));
+
+	if (__atomic_load_n (&ptx_dev->rev_data->fn, __ATOMIC_ACQUIRE) != 0)
+	  {
+	    struct rev_offload *rev_data = ptx_dev->rev_data;
+	    GOMP_PLUGIN_target_rev (rev_data->fn, rev_data->mapnum,
+				    rev_data->addrs, rev_data->sizes,
+				    rev_data->kinds, rev_data->dev_num,
+				    rev_off_dev_to_host_cpy,
+				    rev_off_host_to_dev_cpy, copy_stream);
+	    CUDA_CALL_ASSERT (cuStreamSynchronize, copy_stream);
+	    __atomic_store_n (&rev_data->fn, 0, __ATOMIC_RELEASE);
+	  }
+	usleep (1);
+      }
+  else
+    r = CUDA_CALL_NOCHECK (cuCtxSynchronize, );
+  if (reverse_offload)
+    CUDA_CALL_ASSERT (cuStreamDestroy, copy_stream);
   if (r == CUDA_ERROR_LAUNCH_FAILED)
     GOMP_PLUGIN_fatal ("cuCtxSynchronize error: %s %s\n", cuda_error (r),
 		       maybe_abort_msg);
diff --git a/libgomp/target.c b/libgomp/target.c
index 5763483..71bcb05 100644
--- a/libgomp/target.c
+++ b/libgomp/target.c
@@ -2925,6 +2925,25 @@ GOMP_target_ext (int device, void (*fn) (void *), size_t mapnum,
     htab_free (refcount_set);
 }
 
+/* Handle reverse offload. This is called by the device plugins for a
+   reverse offload; it is not called if the outer target runs on the host.  */
+
+void
+gomp_target_rev (uint64_t fn_ptr __attribute__ ((unused)),
+		 uint64_t mapnum __attribute__ ((unused)),
+		 uint64_t devaddrs_ptr __attribute__ ((unused)),
+		 uint64_t sizes_ptr __attribute__ ((unused)),
+		 uint64_t kinds_ptr __attribute__ ((unused)),
+		 int dev_num __attribute__ ((unused)),
+		 void (*dev_to_host_cpy) (void *, const void *, size_t,
+					  void *) __attribute__ ((unused)),
+		 void (*host_to_dev_cpy) (void *, const void *, size_t,
+					  void *) __attribute__ ((unused)),
+		 void *token __attribute__ ((unused)))
+{
+  __builtin_unreachable ();
+}
+
 /* Host fallback for GOMP_target_data{,_ext} routines.  */
 
 static void

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Patch][v5] libgomp/nvptx: Prepare for reverse-offload callback handling
  2022-10-07 14:26             ` [Patch][v5] " Tobias Burnus
@ 2022-10-11 10:49               ` Jakub Jelinek
  2022-10-11 11:12                 ` Alexander Monakov
  0 siblings, 1 reply; 31+ messages in thread
From: Jakub Jelinek @ 2022-10-11 10:49 UTC (permalink / raw)
  To: Tobias Burnus; +Cc: gcc-patches, Alexander Monakov, Tom de Vries

On Fri, Oct 07, 2022 at 04:26:58PM +0200, Tobias Burnus wrote:
> libgomp/nvptx: Prepare for reverse-offload callback handling
> 
> This patch adds a stub 'gomp_target_rev' in the host's target.c, which will
> later handle the reverse offload.
> For nvptx, it adds support for forwarding the offload gomp_target_ext call
> to the host by setting values in a struct on the device and querying it on
> the host - invoking gomp_target_rev on the result.
> 
> For host-device consistency guarantee reasons, reverse offload is currently
> limited -march=sm_70 (for libgomp).
> 
> gcc/ChangeLog:
> 
> 	* config/nvptx/mkoffload.cc (process): Warn if the linked-in libgomp.a
> 	has not been compiled with sm_70 or higher and disable code gen then.
> 
> include/ChangeLog:
> 
> 	* cuda/cuda.h (enum CUdevice_attribute): Add
> 	CU_DEVICE_ATTRIBUTE_UNIFIED_ADDRESSING.
> 	(CU_MEMHOSTALLOC_DEVICEMAP): Define.
> 	(cuMemHostAlloc): Add prototype.
> 
> libgomp/ChangeLog:
> 
> 	* config/nvptx/icv-device.c (GOMP_DEVICE_NUM_VAR): Remove
> 	'static' for this variable.
> 	* config/nvptx/libgomp-nvptx.h: New file.
> 	* config/nvptx/target.c: Include it.
> 	(GOMP_ADDITIONAL_ICVS): Declare extern var.
> 	(GOMP_REV_OFFLOAD_VAR): Declare var.
> 	(GOMP_target_ext): Handle reverse offload.
> 	* libgomp-plugin.h (GOMP_PLUGIN_target_rev): New prototype.
> 	* libgomp-plugin.c (GOMP_PLUGIN_target_rev): New, call ...
> 	* target.c (gomp_target_rev): ... this new stub function.
> 	* libgomp.h (gomp_target_rev): Declare.
> 	* libgomp.map (GOMP_PLUGIN_1.4): New; add GOMP_PLUGIN_target_rev.
> 	* plugin/cuda-lib.def (cuMemHostAlloc): Add.
> 	* plugin/plugin-nvptx.c: Include libgomp-nvptx.h.
> 	(struct ptx_device): Add rev_data member. 
> 	(nvptx_open_device): #if 0 unused check; add
> 	unified address assert check.
> 	(GOMP_OFFLOAD_get_num_devices): Claim unified address
> 	support.
> 	(GOMP_OFFLOAD_load_image): Free rev_fn_table if no
> 	offload functions exist. Make offload var available
> 	on host and device.
> 	(rev_off_dev_to_host_cpy, rev_off_host_to_dev_cpy): New.
> 	(GOMP_OFFLOAD_run): Handle reverse offload.

So, does this mean one has to have gcc configured --with-arch=sm_70
or later to make reverse offloading work (and then on the other
side no support for older PTX arches at all)?
If yes, I was kind of hoping we could arrange for it to be more
user-friendly, build libgomp.a normally (sm_35 or what is the default),
build the single TU in libgomp that needs the sm_70 stuff with -march=sm_70
and arrange for mkoffload to link in the sm_70 stuff only if the user
wants reverse offload (or has requires reverse_offload?).  In that case
ignore sm_60 and older devices, if reverse offload isn't wanted, don't link
in the part that needs sm_70 and make stuff working on sm_35 and later.
Or perhaps have 2 versions of target.o, one sm_35 and one sm_70 and let
mkoffload choose among them.

> +      /* The code for nvptx for GOMP_target_ext in libgomp/config/nvptx/target.c
> +	 for < sm_70 exists but is disabled here as it is unclear whether there
> +	 is the required consistency between host and device.
> +	 See https://gcc.gnu.org/pipermail/gcc-patches/2022-October/602715.html
> +	 for details.  */
> +      warning_at (input_location, 0,
> +		  "Disabling offload-code generation for this device type: "
> +		  "%<omp requires reverse_offload%> can only be fulfilled "
> +		  "for %<sm_70%> or higher");
> +      inform (UNKNOWN_LOCATION,
> +	      "Reverse offload requires that GCC is configured with "
> +	      "%<--with-arch=sm_70%> or higher and not overridden by a lower "
> +	      "value for %<-foffload-options=nvptx-none=-march=%>");

Diagnostics (sure, Fortran FE is an exception) shouldn't start with capital
letters).

> @@ -519,10 +523,20 @@ nvptx_open_device (int n)
>  		  CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_MULTIPROCESSOR, dev);
>    ptx_dev->max_threads_per_multiprocessor = pi;
>  
> +#if 0
> +  int async_engines;
>    r = CUDA_CALL_NOCHECK (cuDeviceGetAttribute, &async_engines,
>  			 CU_DEVICE_ATTRIBUTE_ASYNC_ENGINE_COUNT, dev);
>    if (r != CUDA_SUCCESS)
>      async_engines = 1;
> +#endif

Please avoid #if 0 code.

> +
> +  /* Required below for reverse offload as implemented, but with compute
> +     capability >= 2.0 and 64bit device processes, this should be universally be
> +     the case; hence, an assert.  */
> +  r = CUDA_CALL_NOCHECK (cuDeviceGetAttribute, &pi,
> +			 CU_DEVICE_ATTRIBUTE_UNIFIED_ADDRESSING, dev);
> +  assert (r == CUDA_SUCCESS && pi);
>  
>    for (int i = 0; i != GOMP_DIM_MAX; i++)
>      ptx_dev->default_dims[i] = 0;
> @@ -1179,8 +1193,10 @@ GOMP_OFFLOAD_get_num_devices (unsigned int omp_requires_mask)
>  {
>    int num_devices = nvptx_get_num_devices ();
>    /* Return -1 if no omp_requires_mask cannot be fulfilled but
> -     devices were present.  */
> -  if (num_devices > 0 && omp_requires_mask != 0)
> +     devices were present. Unified-shared address: see comment in

2 spaces after . rather than 1.

> --- a/libgomp/target.c
> +++ b/libgomp/target.c
> @@ -2925,6 +2925,25 @@ GOMP_target_ext (int device, void (*fn) (void *), size_t mapnum,
>      htab_free (refcount_set);
>  }
>  
> +/* Handle reverse offload. This is called by the device plugins for a
> +   reverse offload; it is not called if the outer target runs on the host.  */

Likewise.

	Jakub


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Patch][v5] libgomp/nvptx: Prepare for reverse-offload callback handling
  2022-10-11 10:49               ` Jakub Jelinek
@ 2022-10-11 11:12                 ` Alexander Monakov
  2022-10-12  8:55                   ` Tobias Burnus
  0 siblings, 1 reply; 31+ messages in thread
From: Alexander Monakov @ 2022-10-11 11:12 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: Tobias Burnus, gcc-patches, Tom de Vries

On Tue, 11 Oct 2022, Jakub Jelinek wrote:

> So, does this mean one has to have gcc configured --with-arch=sm_70
> or later to make reverse offloading work (and then on the other
> side no support for older PTX arches at all)?
> If yes, I was kind of hoping we could arrange for it to be more
> user-friendly, build libgomp.a normally (sm_35 or what is the default),
> build the single TU in libgomp that needs the sm_70 stuff with -march=sm_70
> and arrange for mkoffload to link in the sm_70 stuff only if the user
> wants reverse offload (or has requires reverse_offload?).  In that case
> ignore sm_60 and older devices, if reverse offload isn't wanted, don't link
> in the part that needs sm_70 and make stuff working on sm_35 and later.
> Or perhaps have 2 versions of target.o, one sm_35 and one sm_70 and let
> mkoffload choose among them.

My understanding is such trickery should not be necessary with
the barrier-based approach, i.e. the sequence of PTX instructions

  st   % plain store
  membar.sys
  st.volatile

should be enough to guarantee that the former store is visible on the host
before the latter, and work all the way back to sm_20.

Alexander

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Patch][v5] libgomp/nvptx: Prepare for reverse-offload callback handling
  2022-10-11 11:12                 ` Alexander Monakov
@ 2022-10-12  8:55                   ` Tobias Burnus
  2022-10-17  7:35                     ` *ping* / " Tobias Burnus
                                       ` (2 more replies)
  0 siblings, 3 replies; 31+ messages in thread
From: Tobias Burnus @ 2022-10-12  8:55 UTC (permalink / raw)
  To: Alexander Monakov, Jakub Jelinek, gcc-patches

[-- Attachment #1: Type: text/plain, Size: 2008 bytes --]

On 11.10.22 13:12, Alexander Monakov wrote:
> My understanding is such trickery should not be necessary with
> the barrier-based approach, i.e. the sequence of PTX instructions
>
>    st   % plain store
>    membar.sys
>    st.volatile
>
> should be enough to guarantee that the former store is visible on the host
> before the latter, and work all the way back to sm_20.

If I understand it correctly, you mean:

   GOMP_REV_OFFLOAD_VAR->dev_num = GOMP_ADDITIONAL_ICVS.device_num;

   __sync_synchronize ();  /* membar.sys */
   asm volatile ("st.volatile.global.u64 [%0], %1;"
                 : : "r"(addr_struct_fn), "r" (fn) : "memory");


And then directly followed by the busy wait:

   while (__atomic_load_n (&GOMP_REV_OFFLOAD_VAR->fn, __ATOMIC_ACQUIRE) != 0)
     ;  /* spin  */

which GCC expands to:

   /* ld.global.u64 %r64,[__gomp_rev_offload_var];
      ld.u64 %r36,[%r64];
      membar.sys;  */

The such updated patch is attached.

(This is the only change + removing the mkoffload.cc part is the only
larger change. Otherwise, it only handles the minor comments by Jakub.
The now removed CU_DEVICE_ATTRIBUTE_ASYNC_ENGINE_COUNT was used
until commit r10-304-g1f4c5b9bb2eb81880e2bc725435d596fcd2bdfef i.e.
it is a really old left over!)

Otherwise, tested* to work with sm_30 (error by mkoffload, unchanged),
sm_35 and sm_70.

Tobias

*With some added code; until GOMP_OFFLOAD_get_num_devices accepts
GOMP_REQUIRES_UNIFIED_SHARED_MEMORY and GOMP_OFFLOAD_load_image
gets passed a non-NULL for rev_fn_table, the current patch is a no op.

Planned next is the related GCN patch – and the actual change
in libgomp/target.c (+ accepting USM in GOMP_OFFLOAD_get_num_devices)
-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955

[-- Attachment #2: rev-offload-run-nvptx-v6.diff --]
[-- Type: text/x-patch, Size: 19308 bytes --]

libgomp/nvptx: Prepare for reverse-offload callback handling

This patch adds a stub 'gomp_target_rev' in the host's target.c, which will
later handle the reverse offload.
For nvptx, it adds support for forwarding the offload gomp_target_ext call
to the host by setting values in a struct on the device and querying it on
the host - invoking gomp_target_rev on the result.

include/ChangeLog:

	* cuda/cuda.h (enum CUdevice_attribute): Add
	CU_DEVICE_ATTRIBUTE_UNIFIED_ADDRESSING.
	(CU_MEMHOSTALLOC_DEVICEMAP): Define.
	(cuMemHostAlloc): Add prototype.

libgomp/ChangeLog:

	* config/nvptx/icv-device.c (GOMP_DEVICE_NUM_VAR): Remove
	'static' for this variable.
	* config/nvptx/libgomp-nvptx.h: New file.
	* config/nvptx/target.c: Include it.
	(GOMP_ADDITIONAL_ICVS): Declare extern var.
	(GOMP_REV_OFFLOAD_VAR): Declare var.
	(GOMP_target_ext): Handle reverse offload.
	* libgomp-plugin.h (GOMP_PLUGIN_target_rev): New prototype.
	* libgomp-plugin.c (GOMP_PLUGIN_target_rev): New, call ...
	* target.c (gomp_target_rev): ... this new stub function.
	* libgomp.h (gomp_target_rev): Declare.
	* libgomp.map (GOMP_PLUGIN_1.4): New; add GOMP_PLUGIN_target_rev.
	* plugin/cuda-lib.def (cuMemHostAlloc): Add.
	* plugin/plugin-nvptx.c: Include libgomp-nvptx.h.
	(struct ptx_device): Add rev_data member. 
	(nvptx_open_device): Remove async_engines query, last used in
	r10-304-g1f4c5b9b; add unified-address assert check.
	(GOMP_OFFLOAD_get_num_devices): Claim unified address
	support.
	(GOMP_OFFLOAD_load_image): Free rev_fn_table if no
	offload functions exist. Make offload var available
	on host and device.
	(rev_off_dev_to_host_cpy, rev_off_host_to_dev_cpy): New.
	(GOMP_OFFLOAD_run): Handle reverse offload.

 include/cuda/cuda.h                  |   3 +
 libgomp/config/nvptx/icv-device.c    |   2 +-
 libgomp/config/nvptx/libgomp-nvptx.h |  51 +++++++++++++++++
 libgomp/config/nvptx/target.c        |  54 +++++++++++++++---
 libgomp/libgomp-plugin.c             |  12 ++++
 libgomp/libgomp-plugin.h             |   7 +++
 libgomp/libgomp.h                    |   5 ++
 libgomp/libgomp.map                  |   5 ++
 libgomp/plugin/cuda-lib.def          |   1 +
 libgomp/plugin/plugin-nvptx.c        | 107 +++++++++++++++++++++++++++++++----
 libgomp/target.c                     |  19 +++++++
 11 files changed, 248 insertions(+), 18 deletions(-)

diff --git a/include/cuda/cuda.h b/include/cuda/cuda.h
index 3938d05..e081f04 100644
--- a/include/cuda/cuda.h
+++ b/include/cuda/cuda.h
@@ -77,6 +77,7 @@ typedef enum {
   CU_DEVICE_ATTRIBUTE_CONCURRENT_KERNELS = 31,
   CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_MULTIPROCESSOR = 39,
   CU_DEVICE_ATTRIBUTE_ASYNC_ENGINE_COUNT = 40,
+  CU_DEVICE_ATTRIBUTE_UNIFIED_ADDRESSING = 41,
   CU_DEVICE_ATTRIBUTE_MAX_REGISTERS_PER_MULTIPROCESSOR = 82
 } CUdevice_attribute;
 
@@ -113,6 +114,7 @@ enum {
 #define CU_LAUNCH_PARAM_END ((void *) 0)
 #define CU_LAUNCH_PARAM_BUFFER_POINTER ((void *) 1)
 #define CU_LAUNCH_PARAM_BUFFER_SIZE ((void *) 2)
+#define CU_MEMHOSTALLOC_DEVICEMAP 0x02U
 
 enum {
   CU_STREAM_DEFAULT = 0,
@@ -169,6 +171,7 @@ CUresult cuMemGetInfo (size_t *, size_t *);
 CUresult cuMemAlloc (CUdeviceptr *, size_t);
 #define cuMemAllocHost cuMemAllocHost_v2
 CUresult cuMemAllocHost (void **, size_t);
+CUresult cuMemHostAlloc (void **, size_t, unsigned int);
 CUresult cuMemcpy (CUdeviceptr, CUdeviceptr, size_t);
 #define cuMemcpyDtoDAsync cuMemcpyDtoDAsync_v2
 CUresult cuMemcpyDtoDAsync (CUdeviceptr, CUdeviceptr, size_t, CUstream);
diff --git a/libgomp/config/nvptx/icv-device.c b/libgomp/config/nvptx/icv-device.c
index 6f869be..eef151c 100644
--- a/libgomp/config/nvptx/icv-device.c
+++ b/libgomp/config/nvptx/icv-device.c
@@ -30,7 +30,7 @@
 
 /* This is set to the ICV values of current GPU during device initialization,
    when the offload image containing this libgomp portion is loaded.  */
-static volatile struct gomp_offload_icvs GOMP_ADDITIONAL_ICVS;
+volatile struct gomp_offload_icvs GOMP_ADDITIONAL_ICVS;
 
 void
 omp_set_default_device (int device_num __attribute__((unused)))
diff --git a/libgomp/config/nvptx/libgomp-nvptx.h b/libgomp/config/nvptx/libgomp-nvptx.h
new file mode 100644
index 0000000..5da9aae
--- /dev/null
+++ b/libgomp/config/nvptx/libgomp-nvptx.h
@@ -0,0 +1,51 @@
+/* Copyright (C) 2022 Free Software Foundation, Inc.
+   Contributed by Tobias Burnus <tobias@codesourcery.com>.
+
+   This file is part of the GNU Offloading and Multi Processing Library
+   (libgomp).
+
+   Libgomp is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by
+   the Free Software Foundation; either version 3, or (at your option)
+   any later version.
+
+   Libgomp is distributed in the hope that it will be useful, but WITHOUT ANY
+   WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
+   FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+   more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   <http://www.gnu.org/licenses/>.  */
+
+/* This file contains defines and type definitions shared between the
+   nvptx target's libgomp.a and the plugin-nvptx.c, but that is only
+   needef for this target.  */
+
+#ifndef LIBGOMP_NVPTX_H
+#define LIBGOMP_NVPTX_H 1
+
+#define GOMP_REV_OFFLOAD_VAR __gomp_rev_offload_var
+
+struct rev_offload {
+  uint64_t fn;
+  uint64_t mapnum;
+  uint64_t addrs;
+  uint64_t sizes;
+  uint64_t kinds;
+  int32_t dev_num;
+};
+
+#if (__SIZEOF_SHORT__ != 2 \
+     || __SIZEOF_SIZE_T__ != 8 \
+     || __SIZEOF_POINTER__ != 8)
+#error "Data-type conversion required for rev_offload"
+#endif
+
+#endif  /* LIBGOMP_NVPTX_H */
+
diff --git a/libgomp/config/nvptx/target.c b/libgomp/config/nvptx/target.c
index 11108d2..0e79388 100644
--- a/libgomp/config/nvptx/target.c
+++ b/libgomp/config/nvptx/target.c
@@ -24,9 +24,12 @@
    <http://www.gnu.org/licenses/>.  */
 
 #include "libgomp.h"
+#include "libgomp-nvptx.h"  /* For struct rev_offload + GOMP_REV_OFFLOAD_VAR. */
 #include <limits.h>
 
 extern int __gomp_team_num __attribute__((shared));
+extern volatile struct gomp_offload_icvs GOMP_ADDITIONAL_ICVS;
+volatile struct rev_offload *GOMP_REV_OFFLOAD_VAR;
 
 bool
 GOMP_teams4 (unsigned int num_teams_lower, unsigned int num_teams_upper,
@@ -88,16 +91,53 @@ GOMP_target_ext (int device, void (*fn) (void *), size_t mapnum,
 		 void **hostaddrs, size_t *sizes, unsigned short *kinds,
 		 unsigned int flags, void **depend, void **args)
 {
-  (void) device;
-  (void) fn;
-  (void) mapnum;
-  (void) hostaddrs;
-  (void) sizes;
-  (void) kinds;
+  static int lock = 0;  /* == gomp_mutex_t lock; gomp_mutex_init (&lock); */
   (void) flags;
   (void) depend;
   (void) args;
-  __builtin_unreachable ();
+
+  if (device != GOMP_DEVICE_HOST_FALLBACK
+      || fn == NULL
+      || GOMP_REV_OFFLOAD_VAR == NULL)
+    return;
+
+  gomp_mutex_lock (&lock);
+
+  GOMP_REV_OFFLOAD_VAR->mapnum = mapnum;
+  GOMP_REV_OFFLOAD_VAR->addrs = (uint64_t) hostaddrs;
+  GOMP_REV_OFFLOAD_VAR->sizes = (uint64_t) sizes;
+  GOMP_REV_OFFLOAD_VAR->kinds = (uint64_t) kinds;
+  GOMP_REV_OFFLOAD_VAR->dev_num = GOMP_ADDITIONAL_ICVS.device_num;
+
+  /* Set 'fn' to trigger processing on the host; wait for completion,
+     which is flagged by setting 'fn' back to 0 on the host.  */
+  uint64_t addr_struct_fn = (uint64_t) &GOMP_REV_OFFLOAD_VAR->fn;
+#if __PTX_SM__ >= 700
+  asm volatile ("st.global.release.sys.u64 [%0], %1;"
+		: : "r"(addr_struct_fn), "r" (fn) : "memory");
+#else
+  __sync_synchronize ();  /* membar.sys */
+  asm volatile ("st.volatile.global.u64 [%0], %1;"
+		: : "r"(addr_struct_fn), "r" (fn) : "memory");
+#endif
+
+#if __PTX_SM__ >= 700
+  uint64_t fn2;
+  do
+    {
+      asm volatile ("ld.acquire.sys.global.u64 %0, [%1];"
+		    : "=r" (fn2) : "r" (addr_struct_fn) : "memory");
+    }
+  while (fn2 != 0);
+#else
+  /* ld.global.u64 %r64,[__gomp_rev_offload_var];
+     ld.u64 %r36,[%r64];
+     membar.sys;  */
+  while (__atomic_load_n (&GOMP_REV_OFFLOAD_VAR->fn, __ATOMIC_ACQUIRE) != 0)
+    ;  /* spin  */
+#endif
+
+  gomp_mutex_unlock (&lock);
 }
 
 void
diff --git a/libgomp/libgomp-plugin.c b/libgomp/libgomp-plugin.c
index 9d4cc62..316de74 100644
--- a/libgomp/libgomp-plugin.c
+++ b/libgomp/libgomp-plugin.c
@@ -78,3 +78,15 @@ GOMP_PLUGIN_fatal (const char *msg, ...)
   gomp_vfatal (msg, ap);
   va_end (ap);
 }
+
+void
+GOMP_PLUGIN_target_rev (uint64_t fn_ptr, uint64_t mapnum, uint64_t devaddrs_ptr,
+			uint64_t sizes_ptr, uint64_t kinds_ptr, int dev_num,
+			void (*dev_to_host_cpy) (void *, const void *, size_t,
+						 void *),
+			void (*host_to_dev_cpy) (void *, const void *, size_t,
+						 void *), void *token)
+{
+  gomp_target_rev (fn_ptr, mapnum, devaddrs_ptr, sizes_ptr, kinds_ptr, dev_num,
+		   dev_to_host_cpy, host_to_dev_cpy, token);
+}
diff --git a/libgomp/libgomp-plugin.h b/libgomp/libgomp-plugin.h
index 6ab5ac6..875f967 100644
--- a/libgomp/libgomp-plugin.h
+++ b/libgomp/libgomp-plugin.h
@@ -121,6 +121,13 @@ extern void GOMP_PLUGIN_error (const char *, ...)
 extern void GOMP_PLUGIN_fatal (const char *, ...)
 	__attribute__ ((noreturn, format (printf, 1, 2)));
 
+extern void GOMP_PLUGIN_target_rev (uint64_t, uint64_t, uint64_t, uint64_t,
+				    uint64_t, int,
+				    void (*) (void *, const void *, size_t,
+					      void *),
+				    void (*) (void *, const void *, size_t,
+					      void *), void *);
+
 /* Prototypes for functions implemented by libgomp plugins.  */
 extern const char *GOMP_OFFLOAD_get_name (void);
 extern unsigned int GOMP_OFFLOAD_get_caps (void);
diff --git a/libgomp/libgomp.h b/libgomp/libgomp.h
index 7519274..5803683 100644
--- a/libgomp/libgomp.h
+++ b/libgomp/libgomp.h
@@ -1128,6 +1128,11 @@ extern int gomp_pause_host (void);
 extern void gomp_init_targets_once (void);
 extern int gomp_get_num_devices (void);
 extern bool gomp_target_task_fn (void *);
+extern void gomp_target_rev (uint64_t, uint64_t, uint64_t, uint64_t, uint64_t,
+			     int,
+			     void (*) (void *, const void *, size_t, void *),
+			     void (*) (void *, const void *, size_t, void *),
+			     void *);
 
 /* Splay tree definitions.  */
 typedef struct splay_tree_node_s *splay_tree_node;
diff --git a/libgomp/libgomp.map b/libgomp/libgomp.map
index 46d5f10..12f76f7 100644
--- a/libgomp/libgomp.map
+++ b/libgomp/libgomp.map
@@ -622,3 +622,8 @@ GOMP_PLUGIN_1.3 {
 	GOMP_PLUGIN_goacc_profiling_dispatch;
 	GOMP_PLUGIN_goacc_thread;
 } GOMP_PLUGIN_1.2;
+
+GOMP_PLUGIN_1.4 {
+  global:
+	GOMP_PLUGIN_target_rev;
+} GOMP_PLUGIN_1.3;
diff --git a/libgomp/plugin/cuda-lib.def b/libgomp/plugin/cuda-lib.def
index cd91b39..dff42d6 100644
--- a/libgomp/plugin/cuda-lib.def
+++ b/libgomp/plugin/cuda-lib.def
@@ -29,6 +29,7 @@ CUDA_ONE_CALL_MAYBE_NULL (cuLinkCreate_v2)
 CUDA_ONE_CALL (cuLinkDestroy)
 CUDA_ONE_CALL (cuMemAlloc)
 CUDA_ONE_CALL (cuMemAllocHost)
+CUDA_ONE_CALL (cuMemHostAlloc)
 CUDA_ONE_CALL (cuMemcpy)
 CUDA_ONE_CALL (cuMemcpyDtoDAsync)
 CUDA_ONE_CALL (cuMemcpyDtoH)
diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c
index ba6b229..ad057ed 100644
--- a/libgomp/plugin/plugin-nvptx.c
+++ b/libgomp/plugin/plugin-nvptx.c
@@ -40,6 +40,9 @@
 #include "gomp-constants.h"
 #include "oacc-int.h"
 
+/* For struct rev_offload + GOMP_REV_OFFLOAD_VAR. */
+#include "config/nvptx/libgomp-nvptx.h"
+
 #include <pthread.h>
 #ifndef PLUGIN_NVPTX_INCLUDE_SYSTEM_CUDA_H
 # include "cuda/cuda.h"
@@ -329,6 +332,7 @@ struct ptx_device
       pthread_mutex_t lock;
     } omp_stacks;
 
+  struct rev_offload *rev_data;
   struct ptx_device *next;
 };
 
@@ -423,7 +427,7 @@ nvptx_open_device (int n)
   struct ptx_device *ptx_dev;
   CUdevice dev, ctx_dev;
   CUresult r;
-  int async_engines, pi;
+  int pi;
 
   CUDA_CALL_ERET (NULL, cuDeviceGet, &dev, n);
 
@@ -519,10 +523,12 @@ nvptx_open_device (int n)
 		  CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_MULTIPROCESSOR, dev);
   ptx_dev->max_threads_per_multiprocessor = pi;
 
-  r = CUDA_CALL_NOCHECK (cuDeviceGetAttribute, &async_engines,
-			 CU_DEVICE_ATTRIBUTE_ASYNC_ENGINE_COUNT, dev);
-  if (r != CUDA_SUCCESS)
-    async_engines = 1;
+  /* Required below for reverse offload as implemented, but with compute
+     capability >= 2.0 and 64bit device processes, this should be universally be
+     the case; hence, an assert.  */
+  r = CUDA_CALL_NOCHECK (cuDeviceGetAttribute, &pi,
+			 CU_DEVICE_ATTRIBUTE_UNIFIED_ADDRESSING, dev);
+  assert (r == CUDA_SUCCESS && pi);
 
   for (int i = 0; i != GOMP_DIM_MAX; i++)
     ptx_dev->default_dims[i] = 0;
@@ -1179,8 +1185,10 @@ GOMP_OFFLOAD_get_num_devices (unsigned int omp_requires_mask)
 {
   int num_devices = nvptx_get_num_devices ();
   /* Return -1 if no omp_requires_mask cannot be fulfilled but
-     devices were present.  */
-  if (num_devices > 0 && omp_requires_mask != 0)
+     devices were present.  Unified-shared address: see comment in
+     nvptx_open_device for CU_DEVICE_ATTRIBUTE_UNIFIED_ADDRESSING.  */
+  if (num_devices > 0
+      && (omp_requires_mask & ~GOMP_REQUIRES_UNIFIED_ADDRESS) != 0)
     return -1;
   return num_devices;
 }
@@ -1380,7 +1388,7 @@ GOMP_OFFLOAD_load_image (int ord, unsigned version, const void *target_data,
   else if (rev_fn_table)
     {
       CUdeviceptr var;
-      size_t bytes;
+      size_t bytes, i;
       r = CUDA_CALL_NOCHECK (cuModuleGetGlobal, &var, &bytes, module,
 			     "$offload_func_table");
       if (r != CUDA_SUCCESS)
@@ -1390,6 +1398,37 @@ GOMP_OFFLOAD_load_image (int ord, unsigned version, const void *target_data,
       r = CUDA_CALL_NOCHECK (cuMemcpyDtoH, *rev_fn_table, var, bytes);
       if (r != CUDA_SUCCESS)
 	GOMP_PLUGIN_fatal ("cuMemcpyDtoH error: %s", cuda_error (r));
+      /* Free if only NULL entries.  */
+      for (i = 0; i < fn_entries; ++i)
+	if ((*rev_fn_table)[i] != 0)
+	  break;
+      if (i == fn_entries)
+	{
+	  free (*rev_fn_table);
+	  *rev_fn_table = NULL;
+	}
+    }
+
+  if (rev_fn_table && *rev_fn_table && dev->rev_data == NULL)
+    {
+      /* cuMemHostAlloc memory is accessible on the device, if unified-shared
+	 address is supported; this is assumed - see comment in
+	 nvptx_open_device for CU_DEVICE_ATTRIBUTE_UNIFIED_ADDRESSING.   */
+      CUDA_CALL_ASSERT (cuMemHostAlloc, (void **) &dev->rev_data,
+			sizeof (*dev->rev_data), CU_MEMHOSTALLOC_DEVICEMAP);
+      CUdeviceptr dp = (CUdeviceptr) dev->rev_data;
+      CUdeviceptr device_rev_offload_var;
+      size_t device_rev_offload_size;
+      CUresult r = CUDA_CALL_NOCHECK (cuModuleGetGlobal,
+				      &device_rev_offload_var,
+				      &device_rev_offload_size, module,
+				      XSTRING (GOMP_REV_OFFLOAD_VAR));
+      if (r != CUDA_SUCCESS)
+	GOMP_PLUGIN_fatal ("cuModuleGetGlobal error - GOMP_REV_OFFLOAD_VAR: %s", cuda_error (r));
+      r = CUDA_CALL_NOCHECK (cuMemcpyHtoD, device_rev_offload_var, &dp,
+			     sizeof (dp));
+      if (r != CUDA_SUCCESS)
+	GOMP_PLUGIN_fatal ("cuMemcpyHtoD error: %s", cuda_error (r));
     }
 
   nvptx_set_clocktick (module, dev);
@@ -2001,6 +2040,23 @@ nvptx_stacks_acquire (struct ptx_device *ptx_dev, size_t size, int num)
   return (void *) ptx_dev->omp_stacks.ptr;
 }
 
+
+void
+rev_off_dev_to_host_cpy (void *dest, const void *src, size_t size,
+			 CUstream stream)
+{
+  CUDA_CALL_ASSERT (cuMemcpyDtoHAsync, dest, (CUdeviceptr) src, size, stream);
+  CUDA_CALL_ASSERT (cuStreamSynchronize, stream);
+}
+
+void
+rev_off_host_to_dev_cpy (void *dest, const void *src, size_t size,
+			 CUstream stream)
+{
+  CUDA_CALL_ASSERT (cuMemcpyHtoDAsync, (CUdeviceptr) dest, src, size, stream);
+  CUDA_CALL_ASSERT (cuStreamSynchronize, stream);
+}
+
 void
 GOMP_OFFLOAD_run (int ord, void *tgt_fn, void *tgt_vars, void **args)
 {
@@ -2035,6 +2091,8 @@ GOMP_OFFLOAD_run (int ord, void *tgt_fn, void *tgt_vars, void **args)
   nvptx_adjust_launch_bounds (tgt_fn, ptx_dev, &teams, &threads);
 
   size_t stack_size = nvptx_stacks_size ();
+  bool reverse_offload = ptx_dev->rev_data != NULL;
+  CUstream copy_stream = NULL;
 
   pthread_mutex_lock (&ptx_dev->omp_stacks.lock);
   void *stacks = nvptx_stacks_acquire (ptx_dev, stack_size, teams * threads);
@@ -2048,12 +2106,41 @@ GOMP_OFFLOAD_run (int ord, void *tgt_fn, void *tgt_vars, void **args)
   GOMP_PLUGIN_debug (0, "  %s: kernel %s: launch"
 		     " [(teams: %u), 1, 1] [(lanes: 32), (threads: %u), 1]\n",
 		     __FUNCTION__, fn_name, teams, threads);
+  if (reverse_offload)
+    CUDA_CALL_ASSERT (cuStreamCreate, &copy_stream, CU_STREAM_NON_BLOCKING);
   r = CUDA_CALL_NOCHECK (cuLaunchKernel, function, teams, 1, 1,
 			 32, threads, 1, 0, NULL, NULL, config);
   if (r != CUDA_SUCCESS)
     GOMP_PLUGIN_fatal ("cuLaunchKernel error: %s", cuda_error (r));
-
-  r = CUDA_CALL_NOCHECK (cuCtxSynchronize, );
+  if (reverse_offload)
+    while (true)
+      {
+	r = CUDA_CALL_NOCHECK (cuStreamQuery, NULL);
+	if (r == CUDA_SUCCESS)
+	  break;
+	if (r == CUDA_ERROR_LAUNCH_FAILED)
+	  GOMP_PLUGIN_fatal ("cuStreamQuery error: %s %s\n", cuda_error (r),
+			     maybe_abort_msg);
+	else if (r != CUDA_ERROR_NOT_READY)
+	  GOMP_PLUGIN_fatal ("cuStreamQuery error: %s", cuda_error (r));
+
+	if (__atomic_load_n (&ptx_dev->rev_data->fn, __ATOMIC_ACQUIRE) != 0)
+	  {
+	    struct rev_offload *rev_data = ptx_dev->rev_data;
+	    GOMP_PLUGIN_target_rev (rev_data->fn, rev_data->mapnum,
+				    rev_data->addrs, rev_data->sizes,
+				    rev_data->kinds, rev_data->dev_num,
+				    rev_off_dev_to_host_cpy,
+				    rev_off_host_to_dev_cpy, copy_stream);
+	    CUDA_CALL_ASSERT (cuStreamSynchronize, copy_stream);
+	    __atomic_store_n (&rev_data->fn, 0, __ATOMIC_RELEASE);
+	  }
+	usleep (1);
+      }
+  else
+    r = CUDA_CALL_NOCHECK (cuCtxSynchronize, );
+  if (reverse_offload)
+    CUDA_CALL_ASSERT (cuStreamDestroy, copy_stream);
   if (r == CUDA_ERROR_LAUNCH_FAILED)
     GOMP_PLUGIN_fatal ("cuCtxSynchronize error: %s %s\n", cuda_error (r),
 		       maybe_abort_msg);
diff --git a/libgomp/target.c b/libgomp/target.c
index 5763483..c7fe741 100644
--- a/libgomp/target.c
+++ b/libgomp/target.c
@@ -2925,6 +2925,25 @@ GOMP_target_ext (int device, void (*fn) (void *), size_t mapnum,
     htab_free (refcount_set);
 }
 
+/* Handle reverse offload.  This is called by the device plugins for a
+   reverse offload; it is not called if the outer target runs on the host.  */
+
+void
+gomp_target_rev (uint64_t fn_ptr __attribute__ ((unused)),
+		 uint64_t mapnum __attribute__ ((unused)),
+		 uint64_t devaddrs_ptr __attribute__ ((unused)),
+		 uint64_t sizes_ptr __attribute__ ((unused)),
+		 uint64_t kinds_ptr __attribute__ ((unused)),
+		 int dev_num __attribute__ ((unused)),
+		 void (*dev_to_host_cpy) (void *, const void *, size_t,
+					  void *) __attribute__ ((unused)),
+		 void (*host_to_dev_cpy) (void *, const void *, size_t,
+					  void *) __attribute__ ((unused)),
+		 void *token __attribute__ ((unused)))
+{
+  __builtin_unreachable ();
+}
+
 /* Host fallback for GOMP_target_data{,_ext} routines.  */
 
 static void

^ permalink raw reply	[flat|nested] 31+ messages in thread

* *ping* / Re: [Patch][v5] libgomp/nvptx: Prepare for reverse-offload callback handling
  2022-10-12  8:55                   ` Tobias Burnus
@ 2022-10-17  7:35                     ` Tobias Burnus
  2022-10-19 15:53                     ` Alexander Monakov
  2022-10-24 14:07                     ` Jakub Jelinek
  2 siblings, 0 replies; 31+ messages in thread
From: Tobias Burnus @ 2022-10-17  7:35 UTC (permalink / raw)
  To: Alexander Monakov, Jakub Jelinek, gcc-patches


On 12.10.22 10:55, Tobias Burnus wrote:
> On 11.10.22 13:12, Alexander Monakov wrote:
>> My understanding is such trickery should not be necessary with
>> the barrier-based approach, i.e. the sequence of PTX instructions
>>
>>    st   % plain store
>>    membar.sys
>>    st.volatile
>>
>> should be enough to guarantee that the former store is visible on the
>> host
>> before the latter, and work all the way back to sm_20.
>
> If I understand it correctly, you mean:
>
>   GOMP_REV_OFFLOAD_VAR->dev_num = GOMP_ADDITIONAL_ICVS.device_num;
>
>   __sync_synchronize ();  /* membar.sys */
>   asm volatile ("st.volatile.global.u64 [%0], %1;"
>                 : : "r"(addr_struct_fn), "r" (fn) : "memory");
>
>
> And then directly followed by the busy wait:
>
>   while (__atomic_load_n (&GOMP_REV_OFFLOAD_VAR->fn, __ATOMIC_ACQUIRE)
> != 0)
>     ;  /* spin  */
>
> which GCC expands to:
>
>   /* ld.global.u64 %r64,[__gomp_rev_offload_var];
>      ld.u64 %r36,[%r64];
>      membar.sys;  */
>
> The such updated patch is attached.
>
> (This is the only change + removing the mkoffload.cc part is the only
> larger change. Otherwise, it only handles the minor comments by Jakub.
> The now removed CU_DEVICE_ATTRIBUTE_ASYNC_ENGINE_COUNT was used
> until commit r10-304-g1f4c5b9bb2eb81880e2bc725435d596fcd2bdfef i.e.
> it is a really old left over!)
>
> Otherwise, tested* to work with sm_30 (error by mkoffload, unchanged),
> sm_35 and sm_70.
>
> Tobias
>
> *With some added code; until GOMP_OFFLOAD_get_num_devices accepts
> GOMP_REQUIRES_UNIFIED_SHARED_MEMORY and GOMP_OFFLOAD_load_image
> gets passed a non-NULL for rev_fn_table, the current patch is a no op.
>
> Planned next is the related GCN patch – and the actual change
> in libgomp/target.c (+ accepting USM in GOMP_OFFLOAD_get_num_devices)
-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Patch][v5] libgomp/nvptx: Prepare for reverse-offload callback handling
  2022-10-12  8:55                   ` Tobias Burnus
  2022-10-17  7:35                     ` *ping* / " Tobias Burnus
@ 2022-10-19 15:53                     ` Alexander Monakov
  2022-10-24 14:07                     ` Jakub Jelinek
  2 siblings, 0 replies; 31+ messages in thread
From: Alexander Monakov @ 2022-10-19 15:53 UTC (permalink / raw)
  To: Tobias Burnus; +Cc: Jakub Jelinek, gcc-patches

[-- Attachment #1: Type: text/plain, Size: 2056 bytes --]

On Wed, 12 Oct 2022, Tobias Burnus wrote:

> On 11.10.22 13:12, Alexander Monakov wrote:
> > My understanding is such trickery should not be necessary with
> > the barrier-based approach, i.e. the sequence of PTX instructions
> >
> >    st   % plain store
> >    membar.sys
> >    st.volatile
> >
> > should be enough to guarantee that the former store is visible on the host
> > before the latter, and work all the way back to sm_20.
> 
> If I understand it correctly, you mean:
> 
>   GOMP_REV_OFFLOAD_VAR->dev_num = GOMP_ADDITIONAL_ICVS.device_num;
> 
>   __sync_synchronize ();  /* membar.sys */
>   asm volatile ("st.volatile.global.u64 [%0], %1;"
>                 : : "r"(addr_struct_fn), "r" (fn) : "memory");
> 
> 
> And then directly followed by the busy wait:
> 
>   while (__atomic_load_n (&GOMP_REV_OFFLOAD_VAR->fn, __ATOMIC_ACQUIRE) != 0)
>     ;  /* spin  */
> 
> which GCC expands to:
> 
>   /* ld.global.u64 %r64,[__gomp_rev_offload_var];
>      ld.u64 %r36,[%r64];
>      membar.sys;  */
> 
> The such updated patch is attached.

I think the topic for which I was Cc'ed (memory space and access method for
the synchronization variable) has been resolved nicely. I am not satisfied
with some other points raised in the conversation, I hope they are noted.

Alexander

> (This is the only change + removing the mkoffload.cc part is the only
> larger change. Otherwise, it only handles the minor comments by Jakub.
> The now removed CU_DEVICE_ATTRIBUTE_ASYNC_ENGINE_COUNT was used
> until commit r10-304-g1f4c5b9bb2eb81880e2bc725435d596fcd2bdfef i.e.
> it is a really old left over!)
> 
> Otherwise, tested* to work with sm_30 (error by mkoffload, unchanged),
> sm_35 and sm_70.
> 
> Tobias
> 
> *With some added code; until GOMP_OFFLOAD_get_num_devices accepts
> GOMP_REQUIRES_UNIFIED_SHARED_MEMORY and GOMP_OFFLOAD_load_image
> gets passed a non-NULL for rev_fn_table, the current patch is a no op.
> 
> Planned next is the related GCN patch – and the actual change
> in libgomp/target.c (+ accepting USM in GOMP_OFFLOAD_get_num_devices)

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Patch][v5] libgomp/nvptx: Prepare for reverse-offload callback handling
  2022-10-12  8:55                   ` Tobias Burnus
  2022-10-17  7:35                     ` *ping* / " Tobias Burnus
  2022-10-19 15:53                     ` Alexander Monakov
@ 2022-10-24 14:07                     ` Jakub Jelinek
  2022-10-24 19:05                       ` Thomas Schwinge
  2 siblings, 1 reply; 31+ messages in thread
From: Jakub Jelinek @ 2022-10-24 14:07 UTC (permalink / raw)
  To: Tobias Burnus; +Cc: Alexander Monakov, gcc-patches

On Wed, Oct 12, 2022 at 10:55:26AM +0200, Tobias Burnus wrote:
> libgomp/nvptx: Prepare for reverse-offload callback handling
> 
> This patch adds a stub 'gomp_target_rev' in the host's target.c, which will
> later handle the reverse offload.
> For nvptx, it adds support for forwarding the offload gomp_target_ext call
> to the host by setting values in a struct on the device and querying it on
> the host - invoking gomp_target_rev on the result.
> 
> include/ChangeLog:
> 
> 	* cuda/cuda.h (enum CUdevice_attribute): Add
> 	CU_DEVICE_ATTRIBUTE_UNIFIED_ADDRESSING.
> 	(CU_MEMHOSTALLOC_DEVICEMAP): Define.
> 	(cuMemHostAlloc): Add prototype.
> 
> libgomp/ChangeLog:
> 
> 	* config/nvptx/icv-device.c (GOMP_DEVICE_NUM_VAR): Remove
> 	'static' for this variable.
> 	* config/nvptx/libgomp-nvptx.h: New file.
> 	* config/nvptx/target.c: Include it.
> 	(GOMP_ADDITIONAL_ICVS): Declare extern var.
> 	(GOMP_REV_OFFLOAD_VAR): Declare var.
> 	(GOMP_target_ext): Handle reverse offload.
> 	* libgomp-plugin.h (GOMP_PLUGIN_target_rev): New prototype.
> 	* libgomp-plugin.c (GOMP_PLUGIN_target_rev): New, call ...
> 	* target.c (gomp_target_rev): ... this new stub function.
> 	* libgomp.h (gomp_target_rev): Declare.
> 	* libgomp.map (GOMP_PLUGIN_1.4): New; add GOMP_PLUGIN_target_rev.
> 	* plugin/cuda-lib.def (cuMemHostAlloc): Add.
> 	* plugin/plugin-nvptx.c: Include libgomp-nvptx.h.
> 	(struct ptx_device): Add rev_data member. 
> 	(nvptx_open_device): Remove async_engines query, last used in
> 	r10-304-g1f4c5b9b; add unified-address assert check.
> 	(GOMP_OFFLOAD_get_num_devices): Claim unified address
> 	support.
> 	(GOMP_OFFLOAD_load_image): Free rev_fn_table if no
> 	offload functions exist. Make offload var available
> 	on host and device.
> 	(rev_off_dev_to_host_cpy, rev_off_host_to_dev_cpy): New.
> 	(GOMP_OFFLOAD_run): Handle reverse offload.

Ok, thanks.

	Jakub


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Patch][v5] libgomp/nvptx: Prepare for reverse-offload callback handling
  2022-10-24 14:07                     ` Jakub Jelinek
@ 2022-10-24 19:05                       ` Thomas Schwinge
  2022-10-24 19:11                         ` Thomas Schwinge
  0 siblings, 1 reply; 31+ messages in thread
From: Thomas Schwinge @ 2022-10-24 19:05 UTC (permalink / raw)
  To: Jakub Jelinek, Tobias Burnus; +Cc: Alexander Monakov, gcc-patches

Hi Tobias!

On 2022-10-24T16:07:25+0200, Jakub Jelinek via Gcc-patches <gcc-patches@gcc.gnu.org> wrote:
> On Wed, Oct 12, 2022 at 10:55:26AM +0200, Tobias Burnus wrote:
>> libgomp/nvptx: Prepare for reverse-offload callback handling

> Ok, thanks.

Per commit r13-3460-g131d18e928a3ea1ab2d3bf61aa92d68a8a254609
"libgomp/nvptx: Prepare for reverse-offload callback handling",
I'm seeing a lot of libgomp execution test regressions.  Random
example, 'libgomp.c-c++-common/error-1.c':

    [...]
      GOMP_OFFLOAD_run: kernel main$_omp_fn$0: launch [(teams: 1), 1, 1] [(lanes: 32), (threads: 8), 1]

    Thread 1 "a.out" received signal SIGSEGV, Segmentation fault.
    0x00007ffff793b87d in GOMP_OFFLOAD_run (ord=<optimized out>, tgt_fn=<optimized out>, tgt_vars=<optimized out>, args=<optimized out>) at [...]/source-gcc/libgomp/plugin/plugin-nvptx.c:2127
    2127            if (__atomic_load_n (&ptx_dev->rev_data->fn, __ATOMIC_ACQUIRE) != 0)
    (gdb) print ptx_dev
    $1 = (struct ptx_device *) 0x6a55a0
    (gdb) print ptx_dev->rev_data
    $2 = (struct rev_offload *) 0xffffffff00000000
    (gdb) print ptx_dev->rev_data->fn
    Cannot access memory at address 0xffffffff00000000

Why is it even taking this 'if (reverse_offload)' code path, which isn't
applicable to this test case (as far as I understand)?  (Well, the answer
is 'bool reverse_offload = ptx_dev->rev_data != NULL;', but why is that?)


Grüße
 Thomas
-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Patch][v5] libgomp/nvptx: Prepare for reverse-offload callback handling
  2022-10-24 19:05                       ` Thomas Schwinge
@ 2022-10-24 19:11                         ` Thomas Schwinge
  2022-10-24 19:46                           ` Tobias Burnus
  2022-10-24 19:51                           ` libgomp/nvptx: Prepare for reverse-offload callback handling, resolve spurious SIGSEGVs (was: [Patch][v5] libgomp/nvptx: Prepare for reverse-offload callback handling) Thomas Schwinge
  0 siblings, 2 replies; 31+ messages in thread
From: Thomas Schwinge @ 2022-10-24 19:11 UTC (permalink / raw)
  To: Jakub Jelinek, Tobias Burnus; +Cc: Alexander Monakov, gcc-patches

Hi Tobias!

On 2022-10-24T21:05:46+0200, I wrote:
> On 2022-10-24T16:07:25+0200, Jakub Jelinek via Gcc-patches <gcc-patches@gcc.gnu.org> wrote:
>> On Wed, Oct 12, 2022 at 10:55:26AM +0200, Tobias Burnus wrote:
>>> libgomp/nvptx: Prepare for reverse-offload callback handling
>
>> Ok, thanks.
>
> Per commit r13-3460-g131d18e928a3ea1ab2d3bf61aa92d68a8a254609
> "libgomp/nvptx: Prepare for reverse-offload callback handling",
> I'm seeing a lot of libgomp execution test regressions.  Random
> example, 'libgomp.c-c++-common/error-1.c':
>
>     [...]
>       GOMP_OFFLOAD_run: kernel main$_omp_fn$0: launch [(teams: 1), 1, 1] [(lanes: 32), (threads: 8), 1]
>
>     Thread 1 "a.out" received signal SIGSEGV, Segmentation fault.
>     0x00007ffff793b87d in GOMP_OFFLOAD_run (ord=<optimized out>, tgt_fn=<optimized out>, tgt_vars=<optimized out>, args=<optimized out>) at [...]/source-gcc/libgomp/plugin/plugin-nvptx.c:2127
>     2127            if (__atomic_load_n (&ptx_dev->rev_data->fn, __ATOMIC_ACQUIRE) != 0)
>     (gdb) print ptx_dev
>     $1 = (struct ptx_device *) 0x6a55a0
>     (gdb) print ptx_dev->rev_data
>     $2 = (struct rev_offload *) 0xffffffff00000000
>     (gdb) print ptx_dev->rev_data->fn
>     Cannot access memory at address 0xffffffff00000000
>
> Why is it even taking this 'if (reverse_offload)' code path, which isn't
> applicable to this test case (as far as I understand)?  (Well, the answer
> is 'bool reverse_offload = ptx_dev->rev_data != NULL;', but why is that?)

Well.

    --- a/libgomp/plugin/plugin-nvptx.c
    +++ b/libgomp/plugin/plugin-nvptx.c

    @@ -329,6 +332,7 @@ struct ptx_device
           pthread_mutex_t lock;
         } omp_stacks;

    +  struct rev_offload *rev_data;
       struct ptx_device *next;
     };

... but as far as I can tell, this is never initialized in
'nvptx_open_device', which does 'ptx_dev = GOMP_PLUGIN_malloc ([...]);'.
Would the following be the correct fix (currently testing)?

    --- libgomp/plugin/plugin-nvptx.c
    +++ libgomp/plugin/plugin-nvptx.c
    @@ -546,6 +546,8 @@ nvptx_open_device (int n)
       ptx_dev->omp_stacks.size = 0;
       pthread_mutex_init (&ptx_dev->omp_stacks.lock, NULL);

    +  ptx_dev->rev_data = NULL;
    +
       return ptx_dev;
     }



Grüße
 Thomas
-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Patch][v5] libgomp/nvptx: Prepare for reverse-offload callback handling
  2022-10-24 19:11                         ` Thomas Schwinge
@ 2022-10-24 19:46                           ` Tobias Burnus
  2022-10-24 19:51                           ` libgomp/nvptx: Prepare for reverse-offload callback handling, resolve spurious SIGSEGVs (was: [Patch][v5] libgomp/nvptx: Prepare for reverse-offload callback handling) Thomas Schwinge
  1 sibling, 0 replies; 31+ messages in thread
From: Tobias Burnus @ 2022-10-24 19:46 UTC (permalink / raw)
  To: Thomas Schwinge, Jakub Jelinek; +Cc: Alexander Monakov, gcc-patches

Hi Tobias!

On 24.10.22 21:11, Thomas Schwinge wrote:
> On 2022-10-24T21:05:46+0200, I wrote:
>> On 2022-10-24T16:07:25+0200, Jakub Jelinek via Gcc-patches <gcc-patches@gcc.gnu.org> wrote:
>>> On Wed, Oct 12, 2022 at 10:55:26AM +0200, Tobias Burnus wrote:
>>>> libgomp/nvptx: Prepare for reverse-offload callback handling
> Well.
>      +  struct rev_offload *rev_data;
> ... but as far as I can tell, this is never initialized in
> 'nvptx_open_device', which does 'ptx_dev = GOMP_PLUGIN_malloc ([...]);'.
> Would the following be the correct fix (currently testing)?
>
>      --- libgomp/plugin/plugin-nvptx.c
>      +++ libgomp/plugin/plugin-nvptx.c
>      @@ -546,6 +546,8 @@ nvptx_open_device (int n)
>         ptx_dev->omp_stacks.size = 0;
>         pthread_mutex_init (&ptx_dev->omp_stacks.lock, NULL);
>
>      +  ptx_dev->rev_data = NULL;
>      +
>         return ptx_dev;
>       }

LGTM and I think it is obvious – albeit I am not sure why it did not
fail when testing it here.

Thanks,

Tobias

-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955

^ permalink raw reply	[flat|nested] 31+ messages in thread

* libgomp/nvptx: Prepare for reverse-offload callback handling, resolve spurious SIGSEGVs (was: [Patch][v5] libgomp/nvptx: Prepare for reverse-offload callback handling)
  2022-10-24 19:11                         ` Thomas Schwinge
  2022-10-24 19:46                           ` Tobias Burnus
@ 2022-10-24 19:51                           ` Thomas Schwinge
  1 sibling, 0 replies; 31+ messages in thread
From: Thomas Schwinge @ 2022-10-24 19:51 UTC (permalink / raw)
  To: Jakub Jelinek, Tobias Burnus, gcc-patches; +Cc: Alexander Monakov

[-- Attachment #1: Type: text/plain, Size: 2947 bytes --]

Hi!

On 2022-10-24T21:11:04+0200, I wrote:
> On 2022-10-24T21:05:46+0200, I wrote:
>> On 2022-10-24T16:07:25+0200, Jakub Jelinek via Gcc-patches <gcc-patches@gcc.gnu.org> wrote:
>>> On Wed, Oct 12, 2022 at 10:55:26AM +0200, Tobias Burnus wrote:
>>>> libgomp/nvptx: Prepare for reverse-offload callback handling
>>
>>> Ok, thanks.
>>
>> Per commit r13-3460-g131d18e928a3ea1ab2d3bf61aa92d68a8a254609
>> "libgomp/nvptx: Prepare for reverse-offload callback handling",
>> I'm seeing a lot of libgomp execution test regressions.  Random
>> example, 'libgomp.c-c++-common/error-1.c':
>>
>>     [...]
>>       GOMP_OFFLOAD_run: kernel main$_omp_fn$0: launch [(teams: 1), 1, 1] [(lanes: 32), (threads: 8), 1]
>>
>>     Thread 1 "a.out" received signal SIGSEGV, Segmentation fault.
>>     0x00007ffff793b87d in GOMP_OFFLOAD_run (ord=<optimized out>, tgt_fn=<optimized out>, tgt_vars=<optimized out>, args=<optimized out>) at [...]/source-gcc/libgomp/plugin/plugin-nvptx.c:2127
>>     2127            if (__atomic_load_n (&ptx_dev->rev_data->fn, __ATOMIC_ACQUIRE) != 0)
>>     (gdb) print ptx_dev
>>     $1 = (struct ptx_device *) 0x6a55a0
>>     (gdb) print ptx_dev->rev_data
>>     $2 = (struct rev_offload *) 0xffffffff00000000
>>     (gdb) print ptx_dev->rev_data->fn
>>     Cannot access memory at address 0xffffffff00000000
>>
>> Why is it even taking this 'if (reverse_offload)' code path, which isn't
>> applicable to this test case (as far as I understand)?  (Well, the answer
>> is 'bool reverse_offload = ptx_dev->rev_data != NULL;', but why is that?)
>
> Well.
>
>     --- a/libgomp/plugin/plugin-nvptx.c
>     +++ b/libgomp/plugin/plugin-nvptx.c
>
>     @@ -329,6 +332,7 @@ struct ptx_device
>            pthread_mutex_t lock;
>          } omp_stacks;
>
>     +  struct rev_offload *rev_data;
>        struct ptx_device *next;
>      };
>
> ... but as far as I can tell, this is never initialized in
> 'nvptx_open_device', which does 'ptx_dev = GOMP_PLUGIN_malloc ([...]);'.
> Would the following be the correct fix (currently testing)?
>
>     --- libgomp/plugin/plugin-nvptx.c
>     +++ libgomp/plugin/plugin-nvptx.c
>     @@ -546,6 +546,8 @@ nvptx_open_device (int n)
>        ptx_dev->omp_stacks.size = 0;
>        pthread_mutex_init (&ptx_dev->omp_stacks.lock, NULL);
>
>     +  ptx_dev->rev_data = NULL;
>     +
>        return ptx_dev;
>      }

That did clean up libgomp execution test regressions; pushed to
master branch commit 205538832b7033699047900cf25928f5920d8b93
"libgomp/nvptx: Prepare for reverse-offload callback handling, resolve spurious SIGSEGVs",
see attached.


Grüße
 Thomas


-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-libgomp-nvptx-Prepare-for-reverse-offload-callback-h.patch --]
[-- Type: text/x-diff, Size: 1758 bytes --]

From 205538832b7033699047900cf25928f5920d8b93 Mon Sep 17 00:00:00 2001
From: Thomas Schwinge <thomas@codesourcery.com>
Date: Mon, 24 Oct 2022 21:11:47 +0200
Subject: [PATCH] libgomp/nvptx: Prepare for reverse-offload callback handling,
 resolve spurious SIGSEGVs

Per commit r13-3460-g131d18e928a3ea1ab2d3bf61aa92d68a8a254609
"libgomp/nvptx: Prepare for reverse-offload callback handling",
I'm seeing a lot of libgomp execution test regressions.  Random
example, 'libgomp.c-c++-common/error-1.c':

    [...]
      GOMP_OFFLOAD_run: kernel main$_omp_fn$0: launch [(teams: 1), 1, 1] [(lanes: 32), (threads: 8), 1]

    Thread 1 "a.out" received signal SIGSEGV, Segmentation fault.
    0x00007ffff793b87d in GOMP_OFFLOAD_run (ord=<optimized out>, tgt_fn=<optimized out>, tgt_vars=<optimized out>, args=<optimized out>) at [...]/source-gcc/libgomp/plugin/plugin-nvptx.c:2127
    2127            if (__atomic_load_n (&ptx_dev->rev_data->fn, __ATOMIC_ACQUIRE) != 0)
    (gdb) print ptx_dev
    $1 = (struct ptx_device *) 0x6a55a0
    (gdb) print ptx_dev->rev_data
    $2 = (struct rev_offload *) 0xffffffff00000000
    (gdb) print ptx_dev->rev_data->fn
    Cannot access memory at address 0xffffffff00000000

	libgomp/
	* plugin/plugin-nvptx.c (nvptx_open_device): Initialize
	'ptx_dev->rev_data'.
---
 libgomp/plugin/plugin-nvptx.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c
index ad057edabec..0768fca350b 100644
--- a/libgomp/plugin/plugin-nvptx.c
+++ b/libgomp/plugin/plugin-nvptx.c
@@ -546,6 +546,8 @@ nvptx_open_device (int n)
   ptx_dev->omp_stacks.size = 0;
   pthread_mutex_init (&ptx_dev->omp_stacks.lock, NULL);
 
+  ptx_dev->rev_data = NULL;
+
   return ptx_dev;
 }
 
-- 
2.35.1


^ permalink raw reply	[flat|nested] 31+ messages in thread

* libgomp: Simplify OpenMP reverse offload host <-> device memory copy implementation (was: [Patch] libgomp/nvptx: Prepare for reverse-offload callback handling)
  2022-08-26  9:07 [Patch] libgomp/nvptx: Prepare for reverse-offload callback handling Tobias Burnus
                   ` (3 preceding siblings ...)
  2022-09-13  7:07 ` Tobias Burnus
@ 2023-03-21 15:53 ` Thomas Schwinge
  2023-03-24 15:43   ` [og12] " Thomas Schwinge
  2023-04-28  8:48   ` Tobias Burnus
  2023-04-04 14:40 ` [Patch] libgomp/nvptx: Prepare for reverse-offload callback handling Thomas Schwinge
  5 siblings, 2 replies; 31+ messages in thread
From: Thomas Schwinge @ 2023-03-21 15:53 UTC (permalink / raw)
  To: Tobias Burnus, gcc-patches; +Cc: Jakub Jelinek, Tom de Vries, Alexander Monakov

[-- Attachment #1: Type: text/plain, Size: 929 bytes --]

Hi!

On 2022-08-26T11:07:28+0200, Tobias Burnus <tobias@codesourcery.com> wrote:
> This patch adds initial [OpenMP reverse offload] support for nvptx.

> CUDA does lockup when trying to copy data from the currently running
> stream; hence, a new stream is generated to do the memory copying.

As part of other work, where I had to touch those special code paths, I
found that we may reduce complexity a little bit "by using the existing
'goacc_asyncqueue' instead of re-coding parts of it".  OK to push
"libgomp: Simplify OpenMP reverse offload host <-> device memory copy implementation"
(still testing), see attached?


Grüße
 Thomas


-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-libgomp-Simplify-OpenMP-reverse-offload-host-device-.patch --]
[-- Type: text/x-diff, Size: 15223 bytes --]

From 65636e924f69a146e571e7a7009304803e24ca1a Mon Sep 17 00:00:00 2001
From: Thomas Schwinge <thomas@codesourcery.com>
Date: Tue, 21 Mar 2023 16:14:16 +0100
Subject: [PATCH] libgomp: Simplify OpenMP reverse offload host <-> device
 memory copy implementation

... by using the existing 'goacc_asyncqueue' instead of re-coding parts of it.

Follow-up to commit 131d18e928a3ea1ab2d3bf61aa92d68a8a254609
"libgomp/nvptx: Prepare for reverse-offload callback handling",
and commit ea4b23d9c82d9be3b982c3519fe5e8e9d833a6a8
"libgomp: Handle OpenMP's reverse offloads".

	libgomp/
	* target.c (gomp_target_rev): Instead of 'dev_to_host_cpy',
	'host_to_dev_cpy', 'token', take a single 'goacc_asyncqueue'.
	* libgomp.h (gomp_target_rev): Adjust.
	* libgomp-plugin.c (GOMP_PLUGIN_target_rev): Adjust.
	* libgomp-plugin.h (GOMP_PLUGIN_target_rev): Adjust.
	* plugin/plugin-gcn.c (process_reverse_offload): Adjust.
	* plugin/plugin-nvptx.c (rev_off_dev_to_host_cpy)
	(rev_off_host_to_dev_cpy): Remove.
	(GOMP_OFFLOAD_run): Adjust.
---
 libgomp/libgomp-plugin.c      |   7 +--
 libgomp/libgomp-plugin.h      |   6 +-
 libgomp/libgomp.h             |   5 +-
 libgomp/plugin/plugin-gcn.c   |   2 +-
 libgomp/plugin/plugin-nvptx.c |  77 ++++++++++++++-----------
 libgomp/target.c              | 102 +++++++++++++++-------------------
 6 files changed, 96 insertions(+), 103 deletions(-)

diff --git a/libgomp/libgomp-plugin.c b/libgomp/libgomp-plugin.c
index 27e7c94ba9b..d696515eeb6 100644
--- a/libgomp/libgomp-plugin.c
+++ b/libgomp/libgomp-plugin.c
@@ -82,11 +82,8 @@ GOMP_PLUGIN_fatal (const char *msg, ...)
 void
 GOMP_PLUGIN_target_rev (uint64_t fn_ptr, uint64_t mapnum, uint64_t devaddrs_ptr,
 			uint64_t sizes_ptr, uint64_t kinds_ptr, int dev_num,
-			void (*dev_to_host_cpy) (void *, const void *, size_t,
-						 void *),
-			void (*host_to_dev_cpy) (void *, const void *, size_t,
-						 void *), void *token)
+			struct goacc_asyncqueue *aq)
 {
   gomp_target_rev (fn_ptr, mapnum, devaddrs_ptr, sizes_ptr, kinds_ptr, dev_num,
-		   dev_to_host_cpy, host_to_dev_cpy, token);
+		   aq);
 }
diff --git a/libgomp/libgomp-plugin.h b/libgomp/libgomp-plugin.h
index 28267f75f7a..42ee3d6c7f9 100644
--- a/libgomp/libgomp-plugin.h
+++ b/libgomp/libgomp-plugin.h
@@ -121,11 +121,7 @@ extern void GOMP_PLUGIN_fatal (const char *, ...)
 	__attribute__ ((noreturn, format (printf, 1, 2)));
 
 extern void GOMP_PLUGIN_target_rev (uint64_t, uint64_t, uint64_t, uint64_t,
-				    uint64_t, int,
-				    void (*) (void *, const void *, size_t,
-					      void *),
-				    void (*) (void *, const void *, size_t,
-					      void *), void *);
+				    uint64_t, int, struct goacc_asyncqueue *);
 
 /* Prototypes for functions implemented by libgomp plugins.  */
 extern const char *GOMP_OFFLOAD_get_name (void);
diff --git a/libgomp/libgomp.h b/libgomp/libgomp.h
index ba8fe348aba..4d2bfab4b71 100644
--- a/libgomp/libgomp.h
+++ b/libgomp/libgomp.h
@@ -1130,10 +1130,7 @@ extern void gomp_init_targets_once (void);
 extern int gomp_get_num_devices (void);
 extern bool gomp_target_task_fn (void *);
 extern void gomp_target_rev (uint64_t, uint64_t, uint64_t, uint64_t, uint64_t,
-			     int,
-			     void (*) (void *, const void *, size_t, void *),
-			     void (*) (void *, const void *, size_t, void *),
-			     void *);
+			     int, struct goacc_asyncqueue *);
 
 /* Splay tree definitions.  */
 typedef struct splay_tree_node_s *splay_tree_node;
diff --git a/libgomp/plugin/plugin-gcn.c b/libgomp/plugin/plugin-gcn.c
index 347803762eb..2181bf0235f 100644
--- a/libgomp/plugin/plugin-gcn.c
+++ b/libgomp/plugin/plugin-gcn.c
@@ -1949,7 +1949,7 @@ process_reverse_offload (uint64_t fn, uint64_t mapnum, uint64_t hostaddrs,
 {
   int dev_num = dev_num64;
   GOMP_PLUGIN_target_rev (fn, mapnum, hostaddrs, sizes, kinds, dev_num,
-			  NULL, NULL, NULL);
+			  NULL);
 }
 
 /* Output any data written to console output from the kernel.  It is expected
diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c
index 5bd5a419e0e..4a710851ee5 100644
--- a/libgomp/plugin/plugin-nvptx.c
+++ b/libgomp/plugin/plugin-nvptx.c
@@ -56,6 +56,7 @@
 #include <unistd.h>
 #include <assert.h>
 #include <errno.h>
+#include <stdlib.h>
 
 /* An arbitrary fixed limit (128MB) for the size of the OpenMP soft stacks
    block to cache between kernel invocations.  For soft-stacks blocks bigger
@@ -1739,11 +1740,11 @@ GOMP_OFFLOAD_openacc_cuda_set_stream (struct goacc_asyncqueue *aq, void *stream)
   return 1;
 }
 
-struct goacc_asyncqueue *
-GOMP_OFFLOAD_openacc_async_construct (int device __attribute__((unused)))
+static struct goacc_asyncqueue *
+nvptx_goacc_asyncqueue_construct (unsigned int flags)
 {
   CUstream stream = NULL;
-  CUDA_CALL_ERET (NULL, cuStreamCreate, &stream, CU_STREAM_DEFAULT);
+  CUDA_CALL_ERET (NULL, cuStreamCreate, &stream, flags);
 
   struct goacc_asyncqueue *aq
     = GOMP_PLUGIN_malloc (sizeof (struct goacc_asyncqueue));
@@ -1751,14 +1752,26 @@ GOMP_OFFLOAD_openacc_async_construct (int device __attribute__((unused)))
   return aq;
 }
 
-bool
-GOMP_OFFLOAD_openacc_async_destruct (struct goacc_asyncqueue *aq)
+struct goacc_asyncqueue *
+GOMP_OFFLOAD_openacc_async_construct (int device __attribute__((unused)))
+{
+  return nvptx_goacc_asyncqueue_construct (CU_STREAM_DEFAULT);
+}
+
+static bool
+nvptx_goacc_asyncqueue_destruct (struct goacc_asyncqueue *aq)
 {
   CUDA_CALL_ERET (false, cuStreamDestroy, aq->cuda_stream);
   free (aq);
   return true;
 }
 
+bool
+GOMP_OFFLOAD_openacc_async_destruct (struct goacc_asyncqueue *aq)
+{
+  return nvptx_goacc_asyncqueue_destruct (aq);
+}
+
 int
 GOMP_OFFLOAD_openacc_async_test (struct goacc_asyncqueue *aq)
 {
@@ -1772,13 +1785,19 @@ GOMP_OFFLOAD_openacc_async_test (struct goacc_asyncqueue *aq)
   return -1;
 }
 
-bool
-GOMP_OFFLOAD_openacc_async_synchronize (struct goacc_asyncqueue *aq)
+static bool
+nvptx_goacc_asyncqueue_synchronize (struct goacc_asyncqueue *aq)
 {
   CUDA_CALL_ERET (false, cuStreamSynchronize, aq->cuda_stream);
   return true;
 }
 
+bool
+GOMP_OFFLOAD_openacc_async_synchronize (struct goacc_asyncqueue *aq)
+{
+  return nvptx_goacc_asyncqueue_synchronize (aq);
+}
+
 bool
 GOMP_OFFLOAD_openacc_async_serialize (struct goacc_asyncqueue *aq1,
 				      struct goacc_asyncqueue *aq2)
@@ -2038,22 +2057,6 @@ nvptx_stacks_acquire (struct ptx_device *ptx_dev, size_t size, int num)
 }
 
 
-void
-rev_off_dev_to_host_cpy (void *dest, const void *src, size_t size,
-			 CUstream stream)
-{
-  CUDA_CALL_ASSERT (cuMemcpyDtoHAsync, dest, (CUdeviceptr) src, size, stream);
-  CUDA_CALL_ASSERT (cuStreamSynchronize, stream);
-}
-
-void
-rev_off_host_to_dev_cpy (void *dest, const void *src, size_t size,
-			 CUstream stream)
-{
-  CUDA_CALL_ASSERT (cuMemcpyHtoDAsync, (CUdeviceptr) dest, src, size, stream);
-  CUDA_CALL_ASSERT (cuStreamSynchronize, stream);
-}
-
 void
 GOMP_OFFLOAD_run (int ord, void *tgt_fn, void *tgt_vars, void **args)
 {
@@ -2087,9 +2090,17 @@ GOMP_OFFLOAD_run (int ord, void *tgt_fn, void *tgt_vars, void **args)
     }
   nvptx_adjust_launch_bounds (tgt_fn, ptx_dev, &teams, &threads);
 
-  size_t stack_size = nvptx_stacks_size ();
   bool reverse_offload = ptx_dev->rev_data != NULL;
-  CUstream copy_stream = NULL;
+  struct goacc_asyncqueue *reverse_offload_aq = NULL;
+  if (reverse_offload)
+    {
+      reverse_offload_aq
+	= nvptx_goacc_asyncqueue_construct (CU_STREAM_NON_BLOCKING);
+      if (!reverse_offload_aq)
+	exit (EXIT_FAILURE);
+    }
+
+  size_t stack_size = nvptx_stacks_size ();
 
   pthread_mutex_lock (&ptx_dev->omp_stacks.lock);
   void *stacks = nvptx_stacks_acquire (ptx_dev, stack_size, teams * threads);
@@ -2103,8 +2114,6 @@ GOMP_OFFLOAD_run (int ord, void *tgt_fn, void *tgt_vars, void **args)
   GOMP_PLUGIN_debug (0, "  %s: kernel %s: launch"
 		     " [(teams: %u), 1, 1] [(lanes: 32), (threads: %u), 1]\n",
 		     __FUNCTION__, fn_name, teams, threads);
-  if (reverse_offload)
-    CUDA_CALL_ASSERT (cuStreamCreate, &copy_stream, CU_STREAM_NON_BLOCKING);
   r = CUDA_CALL_NOCHECK (cuLaunchKernel, function, teams, 1, 1,
 			 32, threads, 1, 0, NULL, NULL, config);
   if (r != CUDA_SUCCESS)
@@ -2127,17 +2136,15 @@ GOMP_OFFLOAD_run (int ord, void *tgt_fn, void *tgt_vars, void **args)
 	    GOMP_PLUGIN_target_rev (rev_data->fn, rev_data->mapnum,
 				    rev_data->addrs, rev_data->sizes,
 				    rev_data->kinds, rev_data->dev_num,
-				    rev_off_dev_to_host_cpy,
-				    rev_off_host_to_dev_cpy, copy_stream);
-	    CUDA_CALL_ASSERT (cuStreamSynchronize, copy_stream);
+				    reverse_offload_aq);
+	    if (!nvptx_goacc_asyncqueue_synchronize (reverse_offload_aq))
+	      exit (EXIT_FAILURE);
 	    __atomic_store_n (&rev_data->fn, 0, __ATOMIC_RELEASE);
 	  }
 	usleep (1);
       }
   else
     r = CUDA_CALL_NOCHECK (cuCtxSynchronize, );
-  if (reverse_offload)
-    CUDA_CALL_ASSERT (cuStreamDestroy, copy_stream);
   if (r == CUDA_ERROR_LAUNCH_FAILED)
     GOMP_PLUGIN_fatal ("cuCtxSynchronize error: %s %s\n", cuda_error (r),
 		       maybe_abort_msg);
@@ -2145,6 +2152,12 @@ GOMP_OFFLOAD_run (int ord, void *tgt_fn, void *tgt_vars, void **args)
     GOMP_PLUGIN_fatal ("cuCtxSynchronize error: %s", cuda_error (r));
 
   pthread_mutex_unlock (&ptx_dev->omp_stacks.lock);
+
+  if (reverse_offload)
+    {
+      if (!nvptx_goacc_asyncqueue_destruct (reverse_offload_aq))
+	exit (EXIT_FAILURE);
+    }
 }
 
 /* TODO: Implement GOMP_OFFLOAD_async_run. */
diff --git a/libgomp/target.c b/libgomp/target.c
index 79ed64a5dc3..e02188cf7e1 100644
--- a/libgomp/target.c
+++ b/libgomp/target.c
@@ -3312,9 +3312,7 @@ gomp_map_cdata_lookup (struct cpy_data *d, uint64_t *devaddrs,
 void
 gomp_target_rev (uint64_t fn_ptr, uint64_t mapnum, uint64_t devaddrs_ptr,
 		 uint64_t sizes_ptr, uint64_t kinds_ptr, int dev_num,
-		 void (*dev_to_host_cpy) (void *, const void *, size_t, void*),
-		 void (*host_to_dev_cpy) (void *, const void *, size_t, void*),
-		 void *token)
+		 struct goacc_asyncqueue *aq)
 {
   /* Return early if there is no offload code.  */
   if (sizeof (OFFLOAD_PLUGINS) == sizeof (""))
@@ -3356,26 +3354,17 @@ gomp_target_rev (uint64_t fn_ptr, uint64_t mapnum, uint64_t devaddrs_ptr,
       devaddrs = (uint64_t *) gomp_malloc (mapnum * sizeof (uint64_t));
       sizes = (uint64_t *) gomp_malloc (mapnum * sizeof (uint64_t));
       kinds = (unsigned short *) gomp_malloc (mapnum * sizeof (unsigned short));
-      if (dev_to_host_cpy)
-	{
-	  dev_to_host_cpy (devaddrs, (const void *) (uintptr_t) devaddrs_ptr,
-			   mapnum * sizeof (uint64_t), token);
-	  dev_to_host_cpy (sizes, (const void *) (uintptr_t) sizes_ptr,
-			   mapnum * sizeof (uint64_t), token);
-	  dev_to_host_cpy (kinds, (const void *) (uintptr_t) kinds_ptr,
-			   mapnum * sizeof (unsigned short), token);
-	}
-      else
-	{
-	  gomp_copy_dev2host (devicep, NULL, devaddrs,
-			      (const void *) (uintptr_t) devaddrs_ptr,
-			      mapnum * sizeof (uint64_t));
-	  gomp_copy_dev2host (devicep, NULL, sizes,
-			      (const void *) (uintptr_t) sizes_ptr,
-			      mapnum * sizeof (uint64_t));
-	  gomp_copy_dev2host (devicep, NULL, kinds, (const void *) (uintptr_t) kinds_ptr,
-			      mapnum * sizeof (unsigned short));
-	}
+      gomp_copy_dev2host (devicep, aq, devaddrs,
+			  (const void *) (uintptr_t) devaddrs_ptr,
+			  mapnum * sizeof (uint64_t));
+      gomp_copy_dev2host (devicep, aq, sizes,
+			  (const void *) (uintptr_t) sizes_ptr,
+			  mapnum * sizeof (uint64_t));
+      gomp_copy_dev2host (devicep, aq, kinds,
+			  (const void *) (uintptr_t) kinds_ptr,
+			  mapnum * sizeof (unsigned short));
+      if (aq && !devicep->openacc.async.synchronize_func (aq))
+	exit (EXIT_FAILURE);
     }
 
   size_t tgt_align = 0, tgt_size = 0;
@@ -3402,13 +3391,14 @@ gomp_target_rev (uint64_t fn_ptr, uint64_t mapnum, uint64_t devaddrs_ptr,
 	    if (devicep->capabilities & GOMP_OFFLOAD_CAP_SHARED_MEM)
 	      memcpy (tgt + tgt_size, (void *) (uintptr_t) devaddrs[i],
 		      (size_t) sizes[i]);
-	    else if (dev_to_host_cpy)
-	      dev_to_host_cpy (tgt + tgt_size, (void *) (uintptr_t) devaddrs[i],
-			       (size_t) sizes[i], token);
 	    else
-	      gomp_copy_dev2host (devicep, NULL, tgt + tgt_size,
-				  (void *) (uintptr_t) devaddrs[i],
-				  (size_t) sizes[i]);
+	      {
+		gomp_copy_dev2host (devicep, aq, tgt + tgt_size,
+				    (void *) (uintptr_t) devaddrs[i],
+				    (size_t) sizes[i]);
+		if (aq && !devicep->openacc.async.synchronize_func (aq))
+		  exit (EXIT_FAILURE);
+	      }
 	    devaddrs[i] = (uint64_t) (uintptr_t) tgt + tgt_size;
 	    tgt_size = tgt_size + sizes[i];
 	    if ((devicep->capabilities & GOMP_OFFLOAD_CAP_SHARED_MEM)
@@ -3498,15 +3488,15 @@ gomp_target_rev (uint64_t fn_ptr, uint64_t mapnum, uint64_t devaddrs_ptr,
 		    || kind == GOMP_MAP_ALWAYS_TO
 		    || kind == GOMP_MAP_ALWAYS_TOFROM)
 		  {
-		    if (dev_to_host_cpy)
-		      dev_to_host_cpy ((void *) (uintptr_t) devaddrs[i],
-				       (void *) (uintptr_t) cdata[i].devaddr,
-				       sizes[i], token);
-		    else
-		      gomp_copy_dev2host (devicep, NULL,
-					  (void *) (uintptr_t) devaddrs[i],
-					  (void *) (uintptr_t) cdata[i].devaddr,
-					  sizes[i]);
+		    gomp_copy_dev2host (devicep, aq,
+					(void *) (uintptr_t) devaddrs[i],
+					(void *) (uintptr_t) cdata[i].devaddr,
+					sizes[i]);
+		    if (aq && !devicep->openacc.async.synchronize_func (aq))
+		      {
+			gomp_mutex_unlock (&devicep->lock);
+			exit (EXIT_FAILURE);
+		      }
 		  }
 		if (struct_cpy)
 		  struct_cpy--;
@@ -3573,15 +3563,15 @@ gomp_target_rev (uint64_t fn_ptr, uint64_t mapnum, uint64_t devaddrs_ptr,
 		    devaddrs[i]
 		      = (uint64_t) (uintptr_t) gomp_aligned_alloc (align,
 								   sizes[i]);
-		    if (dev_to_host_cpy)
-		      dev_to_host_cpy ((void *) (uintptr_t) devaddrs[i],
-				       (void *) (uintptr_t) cdata[i].devaddr,
-				       sizes[i], token);
-		    else
-		      gomp_copy_dev2host (devicep, NULL,
-					  (void *) (uintptr_t) devaddrs[i],
-					  (void *) (uintptr_t) cdata[i].devaddr,
-					  sizes[i]);
+		    gomp_copy_dev2host (devicep, aq,
+					(void *) (uintptr_t) devaddrs[i],
+					(void *) (uintptr_t) cdata[i].devaddr,
+					sizes[i]);
+		    if (aq && !devicep->openacc.async.synchronize_func (aq))
+		      {
+			gomp_mutex_unlock (&devicep->lock);
+			exit (EXIT_FAILURE);
+		      }
 		  }
 		for (j = i + 1; j < mapnum; j++)
 		  {
@@ -3685,15 +3675,15 @@ gomp_target_rev (uint64_t fn_ptr, uint64_t mapnum, uint64_t devaddrs_ptr,
 		/* FALLTHRU */
 	      case GOMP_MAP_FROM:
 	      case GOMP_MAP_TOFROM:
-		if (copy && host_to_dev_cpy)
-		  host_to_dev_cpy ((void *) (uintptr_t) cdata[i].devaddr,
-				   (void *) (uintptr_t) devaddrs[i],
-				   sizes[i], token);
-		else if (copy)
-		  gomp_copy_host2dev (devicep, NULL,
-				      (void *) (uintptr_t) cdata[i].devaddr,
-				      (void *) (uintptr_t) devaddrs[i],
-				      sizes[i], false, NULL);
+		if (copy)
+		  {
+		    gomp_copy_host2dev (devicep, aq,
+					(void *) (uintptr_t) cdata[i].devaddr,
+					(void *) (uintptr_t) devaddrs[i],
+					sizes[i], false, NULL);
+		    if (aq && !devicep->openacc.async.synchronize_func (aq))
+		      exit (EXIT_FAILURE);
+		  }
 	      default:
 		break;
 	    }
-- 
2.25.1


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [og12] libgomp: Simplify OpenMP reverse offload host <-> device memory copy implementation (was: [Patch] libgomp/nvptx: Prepare for reverse-offload callback handling)
  2023-03-21 15:53 ` libgomp: Simplify OpenMP reverse offload host <-> device memory copy implementation (was: [Patch] " Thomas Schwinge
@ 2023-03-24 15:43   ` Thomas Schwinge
  2023-04-28  8:48   ` Tobias Burnus
  1 sibling, 0 replies; 31+ messages in thread
From: Thomas Schwinge @ 2023-03-24 15:43 UTC (permalink / raw)
  To: Tobias Burnus, gcc-patches; +Cc: Jakub Jelinek, Tom de Vries, Alexander Monakov

[-- Attachment #1: Type: text/plain, Size: 1302 bytes --]

Hi!

On 2023-03-21T16:53:31+0100, I wrote:
> On 2022-08-26T11:07:28+0200, Tobias Burnus <tobias@codesourcery.com> wrote:
>> This patch adds initial [OpenMP reverse offload] support for nvptx.
>
>> CUDA does lockup when trying to copy data from the currently running
>> stream; hence, a new stream is generated to do the memory copying.
>
> As part of other work, where I had to touch those special code paths, I
> found that we may reduce complexity a little bit "by using the existing
> 'goacc_asyncqueue' instead of re-coding parts of it".  OK to push
> "libgomp: Simplify OpenMP reverse offload host <-> device memory copy implementation"
> (still testing), see attached?

My other work now actually does depend on this; I've pushed to
devel/omp/gcc-12 branch commit c276fa0616eb79ddc4d0245e775a841e84cbb7dd
"libgomp: Simplify OpenMP reverse offload host <-> device memory copy implementation",
see attached.

May this also go into master branch still at this time, or "next year"?


Grüße
 Thomas


-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-libgomp-Simplify-OpenMP-reverse-offload-host-device-.patch --]
[-- Type: text/x-diff, Size: 16160 bytes --]

From c276fa0616eb79ddc4d0245e775a841e84cbb7dd Mon Sep 17 00:00:00 2001
From: Thomas Schwinge <thomas@codesourcery.com>
Date: Tue, 21 Mar 2023 16:14:16 +0100
Subject: [PATCH] libgomp: Simplify OpenMP reverse offload host <-> device
 memory copy implementation

... by using the existing 'goacc_asyncqueue' instead of re-coding parts of it.

Follow-up to commit 131d18e928a3ea1ab2d3bf61aa92d68a8a254609
"libgomp/nvptx: Prepare for reverse-offload callback handling",
and commit ea4b23d9c82d9be3b982c3519fe5e8e9d833a6a8
"libgomp: Handle OpenMP's reverse offloads".

	libgomp/
	* target.c (gomp_target_rev): Instead of 'dev_to_host_cpy',
	'host_to_dev_cpy', 'token', take a single 'goacc_asyncqueue'.
	* libgomp.h (gomp_target_rev): Adjust.
	* libgomp-plugin.c (GOMP_PLUGIN_target_rev): Adjust.
	* libgomp-plugin.h (GOMP_PLUGIN_target_rev): Adjust.
	* plugin/plugin-gcn.c (process_reverse_offload): Adjust.
	* plugin/plugin-nvptx.c (rev_off_dev_to_host_cpy)
	(rev_off_host_to_dev_cpy): Remove.
	(GOMP_OFFLOAD_run): Adjust.
---
 libgomp/ChangeLog.omp         |  10 ++++
 libgomp/libgomp-plugin.c      |   7 +--
 libgomp/libgomp-plugin.h      |   6 +-
 libgomp/libgomp.h             |   5 +-
 libgomp/plugin/plugin-gcn.c   |   2 +-
 libgomp/plugin/plugin-nvptx.c |  77 ++++++++++++++-----------
 libgomp/target.c              | 102 +++++++++++++++-------------------
 7 files changed, 106 insertions(+), 103 deletions(-)

diff --git a/libgomp/ChangeLog.omp b/libgomp/ChangeLog.omp
index 9360db66b03..fb352b39a6d 100644
--- a/libgomp/ChangeLog.omp
+++ b/libgomp/ChangeLog.omp
@@ -1,5 +1,15 @@
 2023-03-24  Thomas Schwinge  <thomas@codesourcery.com>
 
+	* target.c (gomp_target_rev): Instead of 'dev_to_host_cpy',
+	'host_to_dev_cpy', 'token', take a single 'goacc_asyncqueue'.
+	* libgomp.h (gomp_target_rev): Adjust.
+	* libgomp-plugin.c (GOMP_PLUGIN_target_rev): Adjust.
+	* libgomp-plugin.h (GOMP_PLUGIN_target_rev): Adjust.
+	* plugin/plugin-gcn.c (process_reverse_offload): Adjust.
+	* plugin/plugin-nvptx.c (rev_off_dev_to_host_cpy)
+	(rev_off_host_to_dev_cpy): Remove.
+	(GOMP_OFFLOAD_run): Adjust.
+
 	* target.c (gomp_unmap_vars_internal): Queue splay-tree keys for
 	removal after main loop.
 
diff --git a/libgomp/libgomp-plugin.c b/libgomp/libgomp-plugin.c
index 316de749f69..c76fa63da83 100644
--- a/libgomp/libgomp-plugin.c
+++ b/libgomp/libgomp-plugin.c
@@ -82,11 +82,8 @@ GOMP_PLUGIN_fatal (const char *msg, ...)
 void
 GOMP_PLUGIN_target_rev (uint64_t fn_ptr, uint64_t mapnum, uint64_t devaddrs_ptr,
 			uint64_t sizes_ptr, uint64_t kinds_ptr, int dev_num,
-			void (*dev_to_host_cpy) (void *, const void *, size_t,
-						 void *),
-			void (*host_to_dev_cpy) (void *, const void *, size_t,
-						 void *), void *token)
+			struct goacc_asyncqueue *aq)
 {
   gomp_target_rev (fn_ptr, mapnum, devaddrs_ptr, sizes_ptr, kinds_ptr, dev_num,
-		   dev_to_host_cpy, host_to_dev_cpy, token);
+		   aq);
 }
diff --git a/libgomp/libgomp-plugin.h b/libgomp/libgomp-plugin.h
index 66d995f33e8..ca557a79380 100644
--- a/libgomp/libgomp-plugin.h
+++ b/libgomp/libgomp-plugin.h
@@ -122,11 +122,7 @@ extern void GOMP_PLUGIN_fatal (const char *, ...)
 	__attribute__ ((noreturn, format (printf, 1, 2)));
 
 extern void GOMP_PLUGIN_target_rev (uint64_t, uint64_t, uint64_t, uint64_t,
-				    uint64_t, int,
-				    void (*) (void *, const void *, size_t,
-					      void *),
-				    void (*) (void *, const void *, size_t,
-					      void *), void *);
+				    uint64_t, int, struct goacc_asyncqueue *);
 
 /* Prototypes for functions implemented by libgomp plugins.  */
 extern const char *GOMP_OFFLOAD_get_name (void);
diff --git a/libgomp/libgomp.h b/libgomp/libgomp.h
index 92f6f14960f..3b2b4aa9534 100644
--- a/libgomp/libgomp.h
+++ b/libgomp/libgomp.h
@@ -1127,10 +1127,7 @@ extern void gomp_init_targets_once (void);
 extern int gomp_get_num_devices (void);
 extern bool gomp_target_task_fn (void *);
 extern void gomp_target_rev (uint64_t, uint64_t, uint64_t, uint64_t, uint64_t,
-			     int,
-			     void (*) (void *, const void *, size_t, void *),
-			     void (*) (void *, const void *, size_t, void *),
-			     void *);
+			     int, struct goacc_asyncqueue *);
 extern void * gomp_usm_alloc (size_t size, int device_num);
 extern void gomp_usm_free (void *device_ptr, int device_num);
 extern bool gomp_page_locked_host_alloc (void **, size_t);
diff --git a/libgomp/plugin/plugin-gcn.c b/libgomp/plugin/plugin-gcn.c
index 64694cdc118..82f5940f970 100644
--- a/libgomp/plugin/plugin-gcn.c
+++ b/libgomp/plugin/plugin-gcn.c
@@ -2008,7 +2008,7 @@ process_reverse_offload (uint64_t fn, uint64_t mapnum, uint64_t hostaddrs,
 {
   int dev_num = dev_num64;
   GOMP_PLUGIN_target_rev (fn, mapnum, hostaddrs, sizes, kinds, dev_num,
-			  NULL, NULL, NULL);
+			  NULL);
 }
 
 /* Output any data written to console output from the kernel.  It is expected
diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c
index 6ade34beb67..23f89b6fb34 100644
--- a/libgomp/plugin/plugin-nvptx.c
+++ b/libgomp/plugin/plugin-nvptx.c
@@ -56,6 +56,7 @@
 #include <unistd.h>
 #include <assert.h>
 #include <errno.h>
+#include <stdlib.h>
 
 /* An arbitrary fixed limit (128MB) for the size of the OpenMP soft stacks
    block to cache between kernel invocations.  For soft-stacks blocks bigger
@@ -1837,11 +1838,11 @@ GOMP_OFFLOAD_openacc_cuda_set_stream (struct goacc_asyncqueue *aq, void *stream)
   return 1;
 }
 
-struct goacc_asyncqueue *
-GOMP_OFFLOAD_openacc_async_construct (int device __attribute__((unused)))
+static struct goacc_asyncqueue *
+nvptx_goacc_asyncqueue_construct (unsigned int flags)
 {
   CUstream stream = NULL;
-  CUDA_CALL_ERET (NULL, cuStreamCreate, &stream, CU_STREAM_DEFAULT);
+  CUDA_CALL_ERET (NULL, cuStreamCreate, &stream, flags);
 
   struct goacc_asyncqueue *aq
     = GOMP_PLUGIN_malloc (sizeof (struct goacc_asyncqueue));
@@ -1849,14 +1850,26 @@ GOMP_OFFLOAD_openacc_async_construct (int device __attribute__((unused)))
   return aq;
 }
 
-bool
-GOMP_OFFLOAD_openacc_async_destruct (struct goacc_asyncqueue *aq)
+struct goacc_asyncqueue *
+GOMP_OFFLOAD_openacc_async_construct (int device __attribute__((unused)))
+{
+  return nvptx_goacc_asyncqueue_construct (CU_STREAM_DEFAULT);
+}
+
+static bool
+nvptx_goacc_asyncqueue_destruct (struct goacc_asyncqueue *aq)
 {
   CUDA_CALL_ERET (false, cuStreamDestroy, aq->cuda_stream);
   free (aq);
   return true;
 }
 
+bool
+GOMP_OFFLOAD_openacc_async_destruct (struct goacc_asyncqueue *aq)
+{
+  return nvptx_goacc_asyncqueue_destruct (aq);
+}
+
 int
 GOMP_OFFLOAD_openacc_async_test (struct goacc_asyncqueue *aq)
 {
@@ -1870,13 +1883,19 @@ GOMP_OFFLOAD_openacc_async_test (struct goacc_asyncqueue *aq)
   return -1;
 }
 
-bool
-GOMP_OFFLOAD_openacc_async_synchronize (struct goacc_asyncqueue *aq)
+static bool
+nvptx_goacc_asyncqueue_synchronize (struct goacc_asyncqueue *aq)
 {
   CUDA_CALL_ERET (false, cuStreamSynchronize, aq->cuda_stream);
   return true;
 }
 
+bool
+GOMP_OFFLOAD_openacc_async_synchronize (struct goacc_asyncqueue *aq)
+{
+  return nvptx_goacc_asyncqueue_synchronize (aq);
+}
+
 bool
 GOMP_OFFLOAD_openacc_async_serialize (struct goacc_asyncqueue *aq1,
 				      struct goacc_asyncqueue *aq2)
@@ -2136,22 +2155,6 @@ nvptx_stacks_acquire (struct ptx_device *ptx_dev, size_t size, int num)
 }
 
 
-void
-rev_off_dev_to_host_cpy (void *dest, const void *src, size_t size,
-			 CUstream stream)
-{
-  CUDA_CALL_ASSERT (cuMemcpyDtoHAsync, dest, (CUdeviceptr) src, size, stream);
-  CUDA_CALL_ASSERT (cuStreamSynchronize, stream);
-}
-
-void
-rev_off_host_to_dev_cpy (void *dest, const void *src, size_t size,
-			 CUstream stream)
-{
-  CUDA_CALL_ASSERT (cuMemcpyHtoDAsync, (CUdeviceptr) dest, src, size, stream);
-  CUDA_CALL_ASSERT (cuStreamSynchronize, stream);
-}
-
 void
 GOMP_OFFLOAD_run (int ord, void *tgt_fn, void *tgt_vars, void **args)
 {
@@ -2185,9 +2188,17 @@ GOMP_OFFLOAD_run (int ord, void *tgt_fn, void *tgt_vars, void **args)
     }
   nvptx_adjust_launch_bounds (tgt_fn, ptx_dev, &teams, &threads);
 
-  size_t stack_size = nvptx_stacks_size ();
   bool reverse_offload = ptx_dev->rev_data != NULL;
-  CUstream copy_stream = NULL;
+  struct goacc_asyncqueue *reverse_offload_aq = NULL;
+  if (reverse_offload)
+    {
+      reverse_offload_aq
+	= nvptx_goacc_asyncqueue_construct (CU_STREAM_NON_BLOCKING);
+      if (!reverse_offload_aq)
+	exit (EXIT_FAILURE);
+    }
+
+  size_t stack_size = nvptx_stacks_size ();
 
   pthread_mutex_lock (&ptx_dev->omp_stacks.lock);
   void *stacks = nvptx_stacks_acquire (ptx_dev, stack_size, teams * threads);
@@ -2201,8 +2212,6 @@ GOMP_OFFLOAD_run (int ord, void *tgt_fn, void *tgt_vars, void **args)
   GOMP_PLUGIN_debug (0, "  %s: kernel %s: launch"
 		     " [(teams: %u), 1, 1] [(lanes: 32), (threads: %u), 1]\n",
 		     __FUNCTION__, fn_name, teams, threads);
-  if (reverse_offload)
-    CUDA_CALL_ASSERT (cuStreamCreate, &copy_stream, CU_STREAM_NON_BLOCKING);
   r = CUDA_CALL_NOCHECK (cuLaunchKernel, function, teams, 1, 1,
 			 32, threads, 1, lowlat_pool_size, NULL, NULL, config);
   if (r != CUDA_SUCCESS)
@@ -2225,17 +2234,15 @@ GOMP_OFFLOAD_run (int ord, void *tgt_fn, void *tgt_vars, void **args)
 	    GOMP_PLUGIN_target_rev (rev_data->fn, rev_data->mapnum,
 				    rev_data->addrs, rev_data->sizes,
 				    rev_data->kinds, rev_data->dev_num,
-				    rev_off_dev_to_host_cpy,
-				    rev_off_host_to_dev_cpy, copy_stream);
-	    CUDA_CALL_ASSERT (cuStreamSynchronize, copy_stream);
+				    reverse_offload_aq);
+	    if (!nvptx_goacc_asyncqueue_synchronize (reverse_offload_aq))
+	      exit (EXIT_FAILURE);
 	    __atomic_store_n (&rev_data->fn, 0, __ATOMIC_RELEASE);
 	  }
 	usleep (1);
       }
   else
     r = CUDA_CALL_NOCHECK (cuCtxSynchronize, );
-  if (reverse_offload)
-    CUDA_CALL_ASSERT (cuStreamDestroy, copy_stream);
   if (r == CUDA_ERROR_LAUNCH_FAILED)
     GOMP_PLUGIN_fatal ("cuCtxSynchronize error: %s %s\n", cuda_error (r),
 		       maybe_abort_msg);
@@ -2243,6 +2250,12 @@ GOMP_OFFLOAD_run (int ord, void *tgt_fn, void *tgt_vars, void **args)
     GOMP_PLUGIN_fatal ("cuCtxSynchronize error: %s", cuda_error (r));
 
   pthread_mutex_unlock (&ptx_dev->omp_stacks.lock);
+
+  if (reverse_offload)
+    {
+      if (!nvptx_goacc_asyncqueue_destruct (reverse_offload_aq))
+	exit (EXIT_FAILURE);
+    }
 }
 
 /* TODO: Implement GOMP_OFFLOAD_async_run. */
diff --git a/libgomp/target.c b/libgomp/target.c
index 107c3567a30..2f53f056e53 100644
--- a/libgomp/target.c
+++ b/libgomp/target.c
@@ -3527,9 +3527,7 @@ gomp_map_cdata_lookup (struct cpy_data *d, uint64_t *devaddrs,
 void
 gomp_target_rev (uint64_t fn_ptr, uint64_t mapnum, uint64_t devaddrs_ptr,
 		 uint64_t sizes_ptr, uint64_t kinds_ptr, int dev_num,
-		 void (*dev_to_host_cpy) (void *, const void *, size_t, void*),
-		 void (*host_to_dev_cpy) (void *, const void *, size_t, void*),
-		 void *token)
+		 struct goacc_asyncqueue *aq)
 {
   /* Return early if there is no offload code.  */
   if (sizeof (OFFLOAD_PLUGINS) == sizeof (""))
@@ -3571,26 +3569,17 @@ gomp_target_rev (uint64_t fn_ptr, uint64_t mapnum, uint64_t devaddrs_ptr,
       devaddrs = (uint64_t *) gomp_malloc (mapnum * sizeof (uint64_t));
       sizes = (uint64_t *) gomp_malloc (mapnum * sizeof (uint64_t));
       kinds = (unsigned short *) gomp_malloc (mapnum * sizeof (unsigned short));
-      if (dev_to_host_cpy)
-	{
-	  dev_to_host_cpy (devaddrs, (const void *) (uintptr_t) devaddrs_ptr,
-			   mapnum * sizeof (uint64_t), token);
-	  dev_to_host_cpy (sizes, (const void *) (uintptr_t) sizes_ptr,
-			   mapnum * sizeof (uint64_t), token);
-	  dev_to_host_cpy (kinds, (const void *) (uintptr_t) kinds_ptr,
-			   mapnum * sizeof (unsigned short), token);
-	}
-      else
-	{
-	  gomp_copy_dev2host (devicep, NULL, devaddrs,
-			      (const void *) (uintptr_t) devaddrs_ptr,
-			      mapnum * sizeof (uint64_t));
-	  gomp_copy_dev2host (devicep, NULL, sizes,
-			      (const void *) (uintptr_t) sizes_ptr,
-			      mapnum * sizeof (uint64_t));
-	  gomp_copy_dev2host (devicep, NULL, kinds, (const void *) (uintptr_t) kinds_ptr,
-			      mapnum * sizeof (unsigned short));
-	}
+      gomp_copy_dev2host (devicep, aq, devaddrs,
+			  (const void *) (uintptr_t) devaddrs_ptr,
+			  mapnum * sizeof (uint64_t));
+      gomp_copy_dev2host (devicep, aq, sizes,
+			  (const void *) (uintptr_t) sizes_ptr,
+			  mapnum * sizeof (uint64_t));
+      gomp_copy_dev2host (devicep, aq, kinds,
+			  (const void *) (uintptr_t) kinds_ptr,
+			  mapnum * sizeof (unsigned short));
+      if (aq && !devicep->openacc.async.synchronize_func (aq))
+	exit (EXIT_FAILURE);
     }
 
   size_t tgt_align = 0, tgt_size = 0;
@@ -3617,13 +3606,14 @@ gomp_target_rev (uint64_t fn_ptr, uint64_t mapnum, uint64_t devaddrs_ptr,
 	    if (devicep->capabilities & GOMP_OFFLOAD_CAP_SHARED_MEM)
 	      memcpy (tgt + tgt_size, (void *) (uintptr_t) devaddrs[i],
 		      (size_t) sizes[i]);
-	    else if (dev_to_host_cpy)
-	      dev_to_host_cpy (tgt + tgt_size, (void *) (uintptr_t) devaddrs[i],
-			       (size_t) sizes[i], token);
 	    else
-	      gomp_copy_dev2host (devicep, NULL, tgt + tgt_size,
-				  (void *) (uintptr_t) devaddrs[i],
-				  (size_t) sizes[i]);
+	      {
+		gomp_copy_dev2host (devicep, aq, tgt + tgt_size,
+				    (void *) (uintptr_t) devaddrs[i],
+				    (size_t) sizes[i]);
+		if (aq && !devicep->openacc.async.synchronize_func (aq))
+		  exit (EXIT_FAILURE);
+	      }
 	    devaddrs[i] = (uint64_t) (uintptr_t) tgt + tgt_size;
 	    tgt_size = tgt_size + sizes[i];
 	    if ((devicep->capabilities & GOMP_OFFLOAD_CAP_SHARED_MEM)
@@ -3735,15 +3725,15 @@ gomp_target_rev (uint64_t fn_ptr, uint64_t mapnum, uint64_t devaddrs_ptr,
 		    || kind == GOMP_MAP_FORCE_TOFROM
 		    || GOMP_MAP_ALWAYS_TO_P (kind))
 		  {
-		    if (dev_to_host_cpy)
-		      dev_to_host_cpy ((void *) (uintptr_t) devaddrs[i],
-				       (void *) (uintptr_t) cdata[i].devaddr,
-				       sizes[i], token);
-		    else
-		      gomp_copy_dev2host (devicep, NULL,
-					  (void *) (uintptr_t) devaddrs[i],
-					  (void *) (uintptr_t) cdata[i].devaddr,
-					  sizes[i]);
+		    gomp_copy_dev2host (devicep, aq,
+					(void *) (uintptr_t) devaddrs[i],
+					(void *) (uintptr_t) cdata[i].devaddr,
+					sizes[i]);
+		    if (aq && !devicep->openacc.async.synchronize_func (aq))
+		      {
+			gomp_mutex_unlock (&devicep->lock);
+			exit (EXIT_FAILURE);
+		      }
 		  }
 		if (struct_cpy)
 		  struct_cpy--;
@@ -3810,15 +3800,15 @@ gomp_target_rev (uint64_t fn_ptr, uint64_t mapnum, uint64_t devaddrs_ptr,
 		    devaddrs[i]
 		      = (uint64_t) (uintptr_t) gomp_aligned_alloc (align,
 								   sizes[i]);
-		    if (dev_to_host_cpy)
-		      dev_to_host_cpy ((void *) (uintptr_t) devaddrs[i],
-				       (void *) (uintptr_t) cdata[i].devaddr,
-				       sizes[i], token);
-		    else
-		      gomp_copy_dev2host (devicep, NULL,
-					  (void *) (uintptr_t) devaddrs[i],
-					  (void *) (uintptr_t) cdata[i].devaddr,
-					  sizes[i]);
+		    gomp_copy_dev2host (devicep, aq,
+					(void *) (uintptr_t) devaddrs[i],
+					(void *) (uintptr_t) cdata[i].devaddr,
+					sizes[i]);
+		    if (aq && !devicep->openacc.async.synchronize_func (aq))
+		      {
+			gomp_mutex_unlock (&devicep->lock);
+			exit (EXIT_FAILURE);
+		      }
 		  }
 		for (j = i + 1; j < mapnum; j++)
 		  {
@@ -3926,15 +3916,15 @@ gomp_target_rev (uint64_t fn_ptr, uint64_t mapnum, uint64_t devaddrs_ptr,
 		/* FALLTHRU */
 	      case GOMP_MAP_FROM:
 	      case GOMP_MAP_TOFROM:
-		if (copy && host_to_dev_cpy)
-		  host_to_dev_cpy ((void *) (uintptr_t) cdata[i].devaddr,
-				   (void *) (uintptr_t) devaddrs[i],
-				   sizes[i], token);
-		else if (copy)
-		  gomp_copy_host2dev (devicep, NULL,
-				      (void *) (uintptr_t) cdata[i].devaddr,
-				      (void *) (uintptr_t) devaddrs[i],
-				      sizes[i], false, NULL);
+		if (copy)
+		  {
+		    gomp_copy_host2dev (devicep, aq,
+					(void *) (uintptr_t) cdata[i].devaddr,
+					(void *) (uintptr_t) devaddrs[i],
+					sizes[i], false, NULL);
+		    if (aq && !devicep->openacc.async.synchronize_func (aq))
+		      exit (EXIT_FAILURE);
+		  }
 	      default:
 		break;
 	    }
-- 
2.25.1


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Patch] libgomp/nvptx: Prepare for reverse-offload callback handling
  2022-08-26  9:07 [Patch] libgomp/nvptx: Prepare for reverse-offload callback handling Tobias Burnus
                   ` (4 preceding siblings ...)
  2023-03-21 15:53 ` libgomp: Simplify OpenMP reverse offload host <-> device memory copy implementation (was: [Patch] " Thomas Schwinge
@ 2023-04-04 14:40 ` Thomas Schwinge
  2023-04-28  8:28   ` Tobias Burnus
  5 siblings, 1 reply; 31+ messages in thread
From: Thomas Schwinge @ 2023-04-04 14:40 UTC (permalink / raw)
  To: Tobias Burnus; +Cc: Alexander Monakov, Jakub Jelinek, Tom de Vries, gcc-patches

Hi!

During GCC/OpenMP/nvptx reverse offload investigations, about how to
replace the problematic global 'GOMP_REV_OFFLOAD_VAR', I may have found
something re:

On 2022-08-26T11:07:28+0200, Tobias Burnus <tobias@codesourcery.com> wrote:
> Better suggestions are welcome for the busy loop in
> libgomp/plugin/plugin-nvptx.c regarding the variable placement and checking
> its value.

> On the host side, the last address is checked - if fn_addr != NULL,
> it passes all arguments on to the generic (target.c) gomp_target_rev
> to do the actual offloading.
>
> CUDA does lockup when trying to copy data from the currently running
> stream; hence, a new stream is generated to do the memory copying.

> Future work for nvptx:
> * Adjust 'sleep', possibly [...]
>   to do shorter sleeps than usleep(1)?

... this busy loop.

Current 'libgomp/plugin/plugin-nvptx.c:GOMP_OFFLOAD_run':

    [...]
      if (reverse_offload)
        CUDA_CALL_ASSERT (cuStreamCreate, &copy_stream, CU_STREAM_NON_BLOCKING);
      r = CUDA_CALL_NOCHECK (cuLaunchKernel, function, teams, 1, 1,
                             32, threads, 1, 0, NULL, NULL, config);
      if (r != CUDA_SUCCESS)
        GOMP_PLUGIN_fatal ("cuLaunchKernel error: %s", cuda_error (r));
      if (reverse_offload)
        while (true)
          {
            r = CUDA_CALL_NOCHECK (cuStreamQuery, NULL);
            if (r == CUDA_SUCCESS)
              break;
            if (r == CUDA_ERROR_LAUNCH_FAILED)
              GOMP_PLUGIN_fatal ("cuStreamQuery error: %s %s\n", cuda_error (r),
                                 maybe_abort_msg);
            else if (r != CUDA_ERROR_NOT_READY)
              GOMP_PLUGIN_fatal ("cuStreamQuery error: %s", cuda_error (r));

            if (__atomic_load_n (&ptx_dev->rev_data->fn, __ATOMIC_ACQUIRE) != 0)
              {
                struct rev_offload *rev_data = ptx_dev->rev_data;
                GOMP_PLUGIN_target_rev (rev_data->fn, rev_data->mapnum,
                                        rev_data->addrs, rev_data->sizes,
                                        rev_data->kinds, rev_data->dev_num,
                                        rev_off_dev_to_host_cpy,
                                        rev_off_host_to_dev_cpy, copy_stream);
                CUDA_CALL_ASSERT (cuStreamSynchronize, copy_stream);
                __atomic_store_n (&rev_data->fn, 0, __ATOMIC_RELEASE);
              }
            usleep (1);
          }
      else
        r = CUDA_CALL_NOCHECK (cuCtxSynchronize, );
      if (reverse_offload)
        CUDA_CALL_ASSERT (cuStreamDestroy, copy_stream);
    [...]

Instead of this 'while (true)', 'usleep (1)' loop, shouldn't we be able
to use "Stream Memory Operations",
<https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEMOP.html>,
that allow to "Wait on a memory location", "until the given condition on
the memory is satisfied"?

For reference, current 'libgomp/config/nvptx/target.c:GOMP_target_ext':

    [...]
      GOMP_REV_OFFLOAD_VAR->mapnum = mapnum;
      GOMP_REV_OFFLOAD_VAR->addrs = (uint64_t) hostaddrs;
      GOMP_REV_OFFLOAD_VAR->sizes = (uint64_t) sizes;
      GOMP_REV_OFFLOAD_VAR->kinds = (uint64_t) kinds;
      GOMP_REV_OFFLOAD_VAR->dev_num = GOMP_ADDITIONAL_ICVS.device_num;

      /* Set 'fn' to trigger processing on the host; wait for completion,
         which is flagged by setting 'fn' back to 0 on the host.  */
      uint64_t addr_struct_fn = (uint64_t) &GOMP_REV_OFFLOAD_VAR->fn;
    #if __PTX_SM__ >= 700
      asm volatile ("st.global.release.sys.u64 [%0], %1;"
                    : : "r"(addr_struct_fn), "r" (fn) : "memory");
    #else
      __sync_synchronize ();  /* membar.sys */
      asm volatile ("st.volatile.global.u64 [%0], %1;"
                    : : "r"(addr_struct_fn), "r" (fn) : "memory");
    #endif

    #if __PTX_SM__ >= 700
      uint64_t fn2;
      do
        {
          asm volatile ("ld.acquire.sys.global.u64 %0, [%1];"
                        : "=r" (fn2) : "r" (addr_struct_fn) : "memory");
        }
      while (fn2 != 0);
    #else
      /* ld.global.u64 %r64,[__gomp_rev_offload_var];
         ld.u64 %r36,[%r64];
         membar.sys;  */
      while (__atomic_load_n (&GOMP_REV_OFFLOAD_VAR->fn, __ATOMIC_ACQUIRE) != 0)
        ;  /* spin  */
    #endif
    [...]


Grüße
 Thomas
-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Patch] libgomp/nvptx: Prepare for reverse-offload callback handling
  2023-04-04 14:40 ` [Patch] libgomp/nvptx: Prepare for reverse-offload callback handling Thomas Schwinge
@ 2023-04-28  8:28   ` Tobias Burnus
  2023-04-28  9:23     ` Thomas Schwinge
  0 siblings, 1 reply; 31+ messages in thread
From: Tobias Burnus @ 2023-04-28  8:28 UTC (permalink / raw)
  To: Thomas Schwinge
  Cc: Alexander Monakov, Jakub Jelinek, Tom de Vries, gcc-patches

Hi Thomas,

maybe I misunderstood your suggestion, but "Wait on a memory location"
assumes that there will be a change – but if a target region happens to
have no reverse offload, the memory location will never change, but
still the target region should return to the host.

What we would need: Wait on memory location – and return if either the
kernel stopped *or* the memory location changed.

My impression is that "return if the kernel stopped" is not really
guaranteed. Of did I miss some fineprint?

Tobias

On 04.04.23 16:40, Thomas Schwinge wrote:
> Hi!
>
> During GCC/OpenMP/nvptx reverse offload investigations, about how to
> replace the problematic global 'GOMP_REV_OFFLOAD_VAR', I may have found
> something re:
>
> On 2022-08-26T11:07:28+0200, Tobias Burnus <tobias@codesourcery.com> wrote:
>> Better suggestions are welcome for the busy loop in
>> libgomp/plugin/plugin-nvptx.c regarding the variable placement and checking
>> its value.
>> On the host side, the last address is checked - if fn_addr != NULL,
>> it passes all arguments on to the generic (target.c) gomp_target_rev
>> to do the actual offloading.
>>
>> CUDA does lockup when trying to copy data from the currently running
>> stream; hence, a new stream is generated to do the memory copying.
>> Future work for nvptx:
>> * Adjust 'sleep', possibly [...]
>>    to do shorter sleeps than usleep(1)?
> ... this busy loop.
>
> Current 'libgomp/plugin/plugin-nvptx.c:GOMP_OFFLOAD_run':
>
>      [...]
>        if (reverse_offload)
>          CUDA_CALL_ASSERT (cuStreamCreate, &copy_stream, CU_STREAM_NON_BLOCKING);
>        r = CUDA_CALL_NOCHECK (cuLaunchKernel, function, teams, 1, 1,
>                               32, threads, 1, 0, NULL, NULL, config);
>        if (r != CUDA_SUCCESS)
>          GOMP_PLUGIN_fatal ("cuLaunchKernel error: %s", cuda_error (r));
>        if (reverse_offload)
>          while (true)
>            {
>              r = CUDA_CALL_NOCHECK (cuStreamQuery, NULL);
>              if (r == CUDA_SUCCESS)
>                break;
>              if (r == CUDA_ERROR_LAUNCH_FAILED)
>                GOMP_PLUGIN_fatal ("cuStreamQuery error: %s %s\n", cuda_error (r),
>                                   maybe_abort_msg);
>              else if (r != CUDA_ERROR_NOT_READY)
>                GOMP_PLUGIN_fatal ("cuStreamQuery error: %s", cuda_error (r));
>
>              if (__atomic_load_n (&ptx_dev->rev_data->fn, __ATOMIC_ACQUIRE) != 0)
>                {
>                  struct rev_offload *rev_data = ptx_dev->rev_data;
>                  GOMP_PLUGIN_target_rev (rev_data->fn, rev_data->mapnum,
>                                          rev_data->addrs, rev_data->sizes,
>                                          rev_data->kinds, rev_data->dev_num,
>                                          rev_off_dev_to_host_cpy,
>                                          rev_off_host_to_dev_cpy, copy_stream);
>                  CUDA_CALL_ASSERT (cuStreamSynchronize, copy_stream);
>                  __atomic_store_n (&rev_data->fn, 0, __ATOMIC_RELEASE);
>                }
>              usleep (1);
>            }
>        else
>          r = CUDA_CALL_NOCHECK (cuCtxSynchronize, );
>        if (reverse_offload)
>          CUDA_CALL_ASSERT (cuStreamDestroy, copy_stream);
>      [...]
>
> Instead of this 'while (true)', 'usleep (1)' loop, shouldn't we be able
> to use "Stream Memory Operations",
> <https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEMOP.html
> that allow to "Wait on a memory location", "until the given condition on
> the memory is satisfied"?
>
> For reference, current 'libgomp/config/nvptx/target.c:GOMP_target_ext':
>
>      [...]
>        GOMP_REV_OFFLOAD_VAR->mapnum = mapnum;
>        GOMP_REV_OFFLOAD_VAR->addrs = (uint64_t) hostaddrs;
>        GOMP_REV_OFFLOAD_VAR->sizes = (uint64_t) sizes;
>        GOMP_REV_OFFLOAD_VAR->kinds = (uint64_t) kinds;
>        GOMP_REV_OFFLOAD_VAR->dev_num = GOMP_ADDITIONAL_ICVS.device_num;
>
>        /* Set 'fn' to trigger processing on the host; wait for completion,
>           which is flagged by setting 'fn' back to 0 on the host.  */
>        uint64_t addr_struct_fn = (uint64_t) &GOMP_REV_OFFLOAD_VAR->fn;
>      #if __PTX_SM__ >= 700
>        asm volatile ("st.global.release.sys.u64 [%0], %1;"
>                      : : "r"(addr_struct_fn), "r" (fn) : "memory");
>      #else
>        __sync_synchronize ();  /* membar.sys */
>        asm volatile ("st.volatile.global.u64 [%0], %1;"
>                      : : "r"(addr_struct_fn), "r" (fn) : "memory");
>      #endif
>
>      #if __PTX_SM__ >= 700
>        uint64_t fn2;
>        do
>          {
>            asm volatile ("ld.acquire.sys.global.u64 %0, [%1];"
>                          : "=r" (fn2) : "r" (addr_struct_fn) : "memory");
>          }
>        while (fn2 != 0);
>      #else
>        /* ld.global.u64 %r64,[__gomp_rev_offload_var];
>           ld.u64 %r36,[%r64];
>           membar.sys;  */
>        while (__atomic_load_n (&GOMP_REV_OFFLOAD_VAR->fn, __ATOMIC_ACQUIRE) != 0)
>          ;  /* spin  */
>      #endif
>      [...]
>
>
> Grüße
>   Thomas
> -----------------
> Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955
-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: libgomp: Simplify OpenMP reverse offload host <-> device memory copy implementation (was: [Patch] libgomp/nvptx: Prepare for reverse-offload callback handling)
  2023-03-21 15:53 ` libgomp: Simplify OpenMP reverse offload host <-> device memory copy implementation (was: [Patch] " Thomas Schwinge
  2023-03-24 15:43   ` [og12] " Thomas Schwinge
@ 2023-04-28  8:48   ` Tobias Burnus
  2023-04-28  9:31     ` Thomas Schwinge
  1 sibling, 1 reply; 31+ messages in thread
From: Tobias Burnus @ 2023-04-28  8:48 UTC (permalink / raw)
  To: Thomas Schwinge, gcc-patches; +Cc: Jakub Jelinek

Hi Thomas,

On 21.03.23 16:53, Thomas Schwinge wrote:
> On 2022-08-26T11:07:28+0200, Tobias Burnus <tobias@codesourcery.com>
> wrote:
>> This patch adds initial [OpenMP reverse offload] support for nvptx.
>> CUDA does lockup when trying to copy data from the currently running
>> stream; hence, a new stream is generated to do the memory copying.
> As part of other work, where I had to touch those special code paths, I
> found that we may reduce complexity a little bit "by using the existing
> 'goacc_asyncqueue' instead of re-coding parts of it".  OK to push
> "libgomp: Simplify OpenMP reverse offload host <-> device memory copy implementation"
> (still testing), see attached?

I don't think that just calling "exit (EXIT_FAILURE);" is the the proper
way – I think that should be GOMP_PLUGIN_fatal in the plugin and
gomp_fatal in target.c.

Otherwise, it LGTM.

Tobias

> Subject: [PATCH] libgomp: Simplify OpenMP reverse offload host <-> device
>   memory copy implementation
>
> ... by using the existing 'goacc_asyncqueue' instead of re-coding parts of it.
>
> Follow-up to commit 131d18e928a3ea1ab2d3bf61aa92d68a8a254609
> "libgomp/nvptx: Prepare for reverse-offload callback handling",
> and commit ea4b23d9c82d9be3b982c3519fe5e8e9d833a6a8
> "libgomp: Handle OpenMP's reverse offloads".
>
>       libgomp/
>       * target.c (gomp_target_rev): Instead of 'dev_to_host_cpy',
>       'host_to_dev_cpy', 'token', take a single 'goacc_asyncqueue'.
>       * libgomp.h (gomp_target_rev): Adjust.
>       * libgomp-plugin.c (GOMP_PLUGIN_target_rev): Adjust.
>       * libgomp-plugin.h (GOMP_PLUGIN_target_rev): Adjust.
>       * plugin/plugin-gcn.c (process_reverse_offload): Adjust.
>       * plugin/plugin-nvptx.c (rev_off_dev_to_host_cpy)
>       (rev_off_host_to_dev_cpy): Remove.
>       (GOMP_OFFLOAD_run): Adjust.
> ---
>   libgomp/libgomp-plugin.c      |   7 +--
>   libgomp/libgomp-plugin.h      |   6 +-
>   libgomp/libgomp.h             |   5 +-
>   libgomp/plugin/plugin-gcn.c   |   2 +-
>   libgomp/plugin/plugin-nvptx.c |  77 ++++++++++++++-----------
>   libgomp/target.c              | 102 +++++++++++++++-------------------
>   6 files changed, 96 insertions(+), 103 deletions(-)
>
> diff --git a/libgomp/libgomp-plugin.c b/libgomp/libgomp-plugin.c
> index 27e7c94ba9b..d696515eeb6 100644
> --- a/libgomp/libgomp-plugin.c
> +++ b/libgomp/libgomp-plugin.c
> @@ -82,11 +82,8 @@ GOMP_PLUGIN_fatal (const char *msg, ...)
>   void
>   GOMP_PLUGIN_target_rev (uint64_t fn_ptr, uint64_t mapnum, uint64_t devaddrs_ptr,
>                       uint64_t sizes_ptr, uint64_t kinds_ptr, int dev_num,
> -                     void (*dev_to_host_cpy) (void *, const void *, size_t,
> -                                              void *),
> -                     void (*host_to_dev_cpy) (void *, const void *, size_t,
> -                                              void *), void *token)
> +                     struct goacc_asyncqueue *aq)
>   {
>     gomp_target_rev (fn_ptr, mapnum, devaddrs_ptr, sizes_ptr, kinds_ptr, dev_num,
> -                dev_to_host_cpy, host_to_dev_cpy, token);
> +                aq);
>   }
> diff --git a/libgomp/libgomp-plugin.h b/libgomp/libgomp-plugin.h
> index 28267f75f7a..42ee3d6c7f9 100644
> --- a/libgomp/libgomp-plugin.h
> +++ b/libgomp/libgomp-plugin.h
> @@ -121,11 +121,7 @@ extern void GOMP_PLUGIN_fatal (const char *, ...)
>       __attribute__ ((noreturn, format (printf, 1, 2)));
>
>   extern void GOMP_PLUGIN_target_rev (uint64_t, uint64_t, uint64_t, uint64_t,
> -                                 uint64_t, int,
> -                                 void (*) (void *, const void *, size_t,
> -                                           void *),
> -                                 void (*) (void *, const void *, size_t,
> -                                           void *), void *);
> +                                 uint64_t, int, struct goacc_asyncqueue *);
>
>   /* Prototypes for functions implemented by libgomp plugins.  */
>   extern const char *GOMP_OFFLOAD_get_name (void);
> diff --git a/libgomp/libgomp.h b/libgomp/libgomp.h
> index ba8fe348aba..4d2bfab4b71 100644
> --- a/libgomp/libgomp.h
> +++ b/libgomp/libgomp.h
> @@ -1130,10 +1130,7 @@ extern void gomp_init_targets_once (void);
>   extern int gomp_get_num_devices (void);
>   extern bool gomp_target_task_fn (void *);
>   extern void gomp_target_rev (uint64_t, uint64_t, uint64_t, uint64_t, uint64_t,
> -                          int,
> -                          void (*) (void *, const void *, size_t, void *),
> -                          void (*) (void *, const void *, size_t, void *),
> -                          void *);
> +                          int, struct goacc_asyncqueue *);
>
>   /* Splay tree definitions.  */
>   typedef struct splay_tree_node_s *splay_tree_node;
> diff --git a/libgomp/plugin/plugin-gcn.c b/libgomp/plugin/plugin-gcn.c
> index 347803762eb..2181bf0235f 100644
> --- a/libgomp/plugin/plugin-gcn.c
> +++ b/libgomp/plugin/plugin-gcn.c
> @@ -1949,7 +1949,7 @@ process_reverse_offload (uint64_t fn, uint64_t mapnum, uint64_t hostaddrs,
>   {
>     int dev_num = dev_num64;
>     GOMP_PLUGIN_target_rev (fn, mapnum, hostaddrs, sizes, kinds, dev_num,
> -                       NULL, NULL, NULL);
> +                       NULL);
>   }
>
>   /* Output any data written to console output from the kernel.  It is expected
> diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c
> index 5bd5a419e0e..4a710851ee5 100644
> --- a/libgomp/plugin/plugin-nvptx.c
> +++ b/libgomp/plugin/plugin-nvptx.c
> @@ -56,6 +56,7 @@
>   #include <unistd.h>
>   #include <assert.h>
>   #include <errno.h>
> +#include <stdlib.h>
>
>   /* An arbitrary fixed limit (128MB) for the size of the OpenMP soft stacks
>      block to cache between kernel invocations.  For soft-stacks blocks bigger
> @@ -1739,11 +1740,11 @@ GOMP_OFFLOAD_openacc_cuda_set_stream (struct goacc_asyncqueue *aq, void *stream)
>     return 1;
>   }
>
> -struct goacc_asyncqueue *
> -GOMP_OFFLOAD_openacc_async_construct (int device __attribute__((unused)))
> +static struct goacc_asyncqueue *
> +nvptx_goacc_asyncqueue_construct (unsigned int flags)
>   {
>     CUstream stream = NULL;
> -  CUDA_CALL_ERET (NULL, cuStreamCreate, &stream, CU_STREAM_DEFAULT);
> +  CUDA_CALL_ERET (NULL, cuStreamCreate, &stream, flags);
>
>     struct goacc_asyncqueue *aq
>       = GOMP_PLUGIN_malloc (sizeof (struct goacc_asyncqueue));
> @@ -1751,14 +1752,26 @@ GOMP_OFFLOAD_openacc_async_construct (int device __attribute__((unused)))
>     return aq;
>   }
>
> -bool
> -GOMP_OFFLOAD_openacc_async_destruct (struct goacc_asyncqueue *aq)
> +struct goacc_asyncqueue *
> +GOMP_OFFLOAD_openacc_async_construct (int device __attribute__((unused)))
> +{
> +  return nvptx_goacc_asyncqueue_construct (CU_STREAM_DEFAULT);
> +}
> +
> +static bool
> +nvptx_goacc_asyncqueue_destruct (struct goacc_asyncqueue *aq)
>   {
>     CUDA_CALL_ERET (false, cuStreamDestroy, aq->cuda_stream);
>     free (aq);
>     return true;
>   }
>
> +bool
> +GOMP_OFFLOAD_openacc_async_destruct (struct goacc_asyncqueue *aq)
> +{
> +  return nvptx_goacc_asyncqueue_destruct (aq);
> +}
> +
>   int
>   GOMP_OFFLOAD_openacc_async_test (struct goacc_asyncqueue *aq)
>   {
> @@ -1772,13 +1785,19 @@ GOMP_OFFLOAD_openacc_async_test (struct goacc_asyncqueue *aq)
>     return -1;
>   }
>
> -bool
> -GOMP_OFFLOAD_openacc_async_synchronize (struct goacc_asyncqueue *aq)
> +static bool
> +nvptx_goacc_asyncqueue_synchronize (struct goacc_asyncqueue *aq)
>   {
>     CUDA_CALL_ERET (false, cuStreamSynchronize, aq->cuda_stream);
>     return true;
>   }
>
> +bool
> +GOMP_OFFLOAD_openacc_async_synchronize (struct goacc_asyncqueue *aq)
> +{
> +  return nvptx_goacc_asyncqueue_synchronize (aq);
> +}
> +
>   bool
>   GOMP_OFFLOAD_openacc_async_serialize (struct goacc_asyncqueue *aq1,
>                                     struct goacc_asyncqueue *aq2)
> @@ -2038,22 +2057,6 @@ nvptx_stacks_acquire (struct ptx_device *ptx_dev, size_t size, int num)
>   }
>
>
> -void
> -rev_off_dev_to_host_cpy (void *dest, const void *src, size_t size,
> -                      CUstream stream)
> -{
> -  CUDA_CALL_ASSERT (cuMemcpyDtoHAsync, dest, (CUdeviceptr) src, size, stream);
> -  CUDA_CALL_ASSERT (cuStreamSynchronize, stream);
> -}
> -
> -void
> -rev_off_host_to_dev_cpy (void *dest, const void *src, size_t size,
> -                      CUstream stream)
> -{
> -  CUDA_CALL_ASSERT (cuMemcpyHtoDAsync, (CUdeviceptr) dest, src, size, stream);
> -  CUDA_CALL_ASSERT (cuStreamSynchronize, stream);
> -}
> -
>   void
>   GOMP_OFFLOAD_run (int ord, void *tgt_fn, void *tgt_vars, void **args)
>   {
> @@ -2087,9 +2090,17 @@ GOMP_OFFLOAD_run (int ord, void *tgt_fn, void *tgt_vars, void **args)
>       }
>     nvptx_adjust_launch_bounds (tgt_fn, ptx_dev, &teams, &threads);
>
> -  size_t stack_size = nvptx_stacks_size ();
>     bool reverse_offload = ptx_dev->rev_data != NULL;
> -  CUstream copy_stream = NULL;
> +  struct goacc_asyncqueue *reverse_offload_aq = NULL;
> +  if (reverse_offload)
> +    {
> +      reverse_offload_aq
> +     = nvptx_goacc_asyncqueue_construct (CU_STREAM_NON_BLOCKING);
> +      if (!reverse_offload_aq)
> +     exit (EXIT_FAILURE);
> +    }
> +
> +  size_t stack_size = nvptx_stacks_size ();
>
>     pthread_mutex_lock (&ptx_dev->omp_stacks.lock);
>     void *stacks = nvptx_stacks_acquire (ptx_dev, stack_size, teams * threads);
> @@ -2103,8 +2114,6 @@ GOMP_OFFLOAD_run (int ord, void *tgt_fn, void *tgt_vars, void **args)
>     GOMP_PLUGIN_debug (0, "  %s: kernel %s: launch"
>                    " [(teams: %u), 1, 1] [(lanes: 32), (threads: %u), 1]\n",
>                    __FUNCTION__, fn_name, teams, threads);
> -  if (reverse_offload)
> -    CUDA_CALL_ASSERT (cuStreamCreate, &copy_stream, CU_STREAM_NON_BLOCKING);
>     r = CUDA_CALL_NOCHECK (cuLaunchKernel, function, teams, 1, 1,
>                        32, threads, 1, 0, NULL, NULL, config);
>     if (r != CUDA_SUCCESS)
> @@ -2127,17 +2136,15 @@ GOMP_OFFLOAD_run (int ord, void *tgt_fn, void *tgt_vars, void **args)
>           GOMP_PLUGIN_target_rev (rev_data->fn, rev_data->mapnum,
>                                   rev_data->addrs, rev_data->sizes,
>                                   rev_data->kinds, rev_data->dev_num,
> -                                 rev_off_dev_to_host_cpy,
> -                                 rev_off_host_to_dev_cpy, copy_stream);
> -         CUDA_CALL_ASSERT (cuStreamSynchronize, copy_stream);
> +                                 reverse_offload_aq);
> +         if (!nvptx_goacc_asyncqueue_synchronize (reverse_offload_aq))
> +           exit (EXIT_FAILURE);
>           __atomic_store_n (&rev_data->fn, 0, __ATOMIC_RELEASE);
>         }
>       usleep (1);
>         }
>     else
>       r = CUDA_CALL_NOCHECK (cuCtxSynchronize, );
> -  if (reverse_offload)
> -    CUDA_CALL_ASSERT (cuStreamDestroy, copy_stream);
>     if (r == CUDA_ERROR_LAUNCH_FAILED)
>       GOMP_PLUGIN_fatal ("cuCtxSynchronize error: %s %s\n", cuda_error (r),
>                      maybe_abort_msg);
> @@ -2145,6 +2152,12 @@ GOMP_OFFLOAD_run (int ord, void *tgt_fn, void *tgt_vars, void **args)
>       GOMP_PLUGIN_fatal ("cuCtxSynchronize error: %s", cuda_error (r));
>
>     pthread_mutex_unlock (&ptx_dev->omp_stacks.lock);
> +
> +  if (reverse_offload)
> +    {
> +      if (!nvptx_goacc_asyncqueue_destruct (reverse_offload_aq))
> +     exit (EXIT_FAILURE);
> +    }
>   }
>
>   /* TODO: Implement GOMP_OFFLOAD_async_run. */
> diff --git a/libgomp/target.c b/libgomp/target.c
> index 79ed64a5dc3..e02188cf7e1 100644
> --- a/libgomp/target.c
> +++ b/libgomp/target.c
> @@ -3312,9 +3312,7 @@ gomp_map_cdata_lookup (struct cpy_data *d, uint64_t *devaddrs,
>   void
>   gomp_target_rev (uint64_t fn_ptr, uint64_t mapnum, uint64_t devaddrs_ptr,
>                uint64_t sizes_ptr, uint64_t kinds_ptr, int dev_num,
> -              void (*dev_to_host_cpy) (void *, const void *, size_t, void*),
> -              void (*host_to_dev_cpy) (void *, const void *, size_t, void*),
> -              void *token)
> +              struct goacc_asyncqueue *aq)
>   {
>     /* Return early if there is no offload code.  */
>     if (sizeof (OFFLOAD_PLUGINS) == sizeof (""))
> @@ -3356,26 +3354,17 @@ gomp_target_rev (uint64_t fn_ptr, uint64_t mapnum, uint64_t devaddrs_ptr,
>         devaddrs = (uint64_t *) gomp_malloc (mapnum * sizeof (uint64_t));
>         sizes = (uint64_t *) gomp_malloc (mapnum * sizeof (uint64_t));
>         kinds = (unsigned short *) gomp_malloc (mapnum * sizeof (unsigned short));
> -      if (dev_to_host_cpy)
> -     {
> -       dev_to_host_cpy (devaddrs, (const void *) (uintptr_t) devaddrs_ptr,
> -                        mapnum * sizeof (uint64_t), token);
> -       dev_to_host_cpy (sizes, (const void *) (uintptr_t) sizes_ptr,
> -                        mapnum * sizeof (uint64_t), token);
> -       dev_to_host_cpy (kinds, (const void *) (uintptr_t) kinds_ptr,
> -                        mapnum * sizeof (unsigned short), token);
> -     }
> -      else
> -     {
> -       gomp_copy_dev2host (devicep, NULL, devaddrs,
> -                           (const void *) (uintptr_t) devaddrs_ptr,
> -                           mapnum * sizeof (uint64_t));
> -       gomp_copy_dev2host (devicep, NULL, sizes,
> -                           (const void *) (uintptr_t) sizes_ptr,
> -                           mapnum * sizeof (uint64_t));
> -       gomp_copy_dev2host (devicep, NULL, kinds, (const void *) (uintptr_t) kinds_ptr,
> -                           mapnum * sizeof (unsigned short));
> -     }
> +      gomp_copy_dev2host (devicep, aq, devaddrs,
> +                       (const void *) (uintptr_t) devaddrs_ptr,
> +                       mapnum * sizeof (uint64_t));
> +      gomp_copy_dev2host (devicep, aq, sizes,
> +                       (const void *) (uintptr_t) sizes_ptr,
> +                       mapnum * sizeof (uint64_t));
> +      gomp_copy_dev2host (devicep, aq, kinds,
> +                       (const void *) (uintptr_t) kinds_ptr,
> +                       mapnum * sizeof (unsigned short));
> +      if (aq && !devicep->openacc.async.synchronize_func (aq))
> +     exit (EXIT_FAILURE);
>       }
>
>     size_t tgt_align = 0, tgt_size = 0;
> @@ -3402,13 +3391,14 @@ gomp_target_rev (uint64_t fn_ptr, uint64_t mapnum, uint64_t devaddrs_ptr,
>           if (devicep->capabilities & GOMP_OFFLOAD_CAP_SHARED_MEM)
>             memcpy (tgt + tgt_size, (void *) (uintptr_t) devaddrs[i],
>                     (size_t) sizes[i]);
> -         else if (dev_to_host_cpy)
> -           dev_to_host_cpy (tgt + tgt_size, (void *) (uintptr_t) devaddrs[i],
> -                            (size_t) sizes[i], token);
>           else
> -           gomp_copy_dev2host (devicep, NULL, tgt + tgt_size,
> -                               (void *) (uintptr_t) devaddrs[i],
> -                               (size_t) sizes[i]);
> +           {
> +             gomp_copy_dev2host (devicep, aq, tgt + tgt_size,
> +                                 (void *) (uintptr_t) devaddrs[i],
> +                                 (size_t) sizes[i]);
> +             if (aq && !devicep->openacc.async.synchronize_func (aq))
> +               exit (EXIT_FAILURE);
> +           }
>           devaddrs[i] = (uint64_t) (uintptr_t) tgt + tgt_size;
>           tgt_size = tgt_size + sizes[i];
>           if ((devicep->capabilities & GOMP_OFFLOAD_CAP_SHARED_MEM)
> @@ -3498,15 +3488,15 @@ gomp_target_rev (uint64_t fn_ptr, uint64_t mapnum, uint64_t devaddrs_ptr,
>                   || kind == GOMP_MAP_ALWAYS_TO
>                   || kind == GOMP_MAP_ALWAYS_TOFROM)
>                 {
> -                 if (dev_to_host_cpy)
> -                   dev_to_host_cpy ((void *) (uintptr_t) devaddrs[i],
> -                                    (void *) (uintptr_t) cdata[i].devaddr,
> -                                    sizes[i], token);
> -                 else
> -                   gomp_copy_dev2host (devicep, NULL,
> -                                       (void *) (uintptr_t) devaddrs[i],
> -                                       (void *) (uintptr_t) cdata[i].devaddr,
> -                                       sizes[i]);
> +                 gomp_copy_dev2host (devicep, aq,
> +                                     (void *) (uintptr_t) devaddrs[i],
> +                                     (void *) (uintptr_t) cdata[i].devaddr,
> +                                     sizes[i]);
> +                 if (aq && !devicep->openacc.async.synchronize_func (aq))
> +                   {
> +                     gomp_mutex_unlock (&devicep->lock);
> +                     exit (EXIT_FAILURE);
> +                   }
>                 }
>               if (struct_cpy)
>                 struct_cpy--;
> @@ -3573,15 +3563,15 @@ gomp_target_rev (uint64_t fn_ptr, uint64_t mapnum, uint64_t devaddrs_ptr,
>                   devaddrs[i]
>                     = (uint64_t) (uintptr_t) gomp_aligned_alloc (align,
>                                                                  sizes[i]);
> -                 if (dev_to_host_cpy)
> -                   dev_to_host_cpy ((void *) (uintptr_t) devaddrs[i],
> -                                    (void *) (uintptr_t) cdata[i].devaddr,
> -                                    sizes[i], token);
> -                 else
> -                   gomp_copy_dev2host (devicep, NULL,
> -                                       (void *) (uintptr_t) devaddrs[i],
> -                                       (void *) (uintptr_t) cdata[i].devaddr,
> -                                       sizes[i]);
> +                 gomp_copy_dev2host (devicep, aq,
> +                                     (void *) (uintptr_t) devaddrs[i],
> +                                     (void *) (uintptr_t) cdata[i].devaddr,
> +                                     sizes[i]);
> +                 if (aq && !devicep->openacc.async.synchronize_func (aq))
> +                   {
> +                     gomp_mutex_unlock (&devicep->lock);
> +                     exit (EXIT_FAILURE);
> +                   }
>                 }
>               for (j = i + 1; j < mapnum; j++)
>                 {
> @@ -3685,15 +3675,15 @@ gomp_target_rev (uint64_t fn_ptr, uint64_t mapnum, uint64_t devaddrs_ptr,
>               /* FALLTHRU */
>             case GOMP_MAP_FROM:
>             case GOMP_MAP_TOFROM:
> -             if (copy && host_to_dev_cpy)
> -               host_to_dev_cpy ((void *) (uintptr_t) cdata[i].devaddr,
> -                                (void *) (uintptr_t) devaddrs[i],
> -                                sizes[i], token);
> -             else if (copy)
> -               gomp_copy_host2dev (devicep, NULL,
> -                                   (void *) (uintptr_t) cdata[i].devaddr,
> -                                   (void *) (uintptr_t) devaddrs[i],
> -                                   sizes[i], false, NULL);
> +             if (copy)
> +               {
> +                 gomp_copy_host2dev (devicep, aq,
> +                                     (void *) (uintptr_t) cdata[i].devaddr,
> +                                     (void *) (uintptr_t) devaddrs[i],
> +                                     sizes[i], false, NULL);
> +                 if (aq && !devicep->openacc.async.synchronize_func (aq))
> +                   exit (EXIT_FAILURE);
> +               }
>             default:
>               break;
>           }
-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Patch] libgomp/nvptx: Prepare for reverse-offload callback handling
  2023-04-28  8:28   ` Tobias Burnus
@ 2023-04-28  9:23     ` Thomas Schwinge
  0 siblings, 0 replies; 31+ messages in thread
From: Thomas Schwinge @ 2023-04-28  9:23 UTC (permalink / raw)
  To: Tobias Burnus; +Cc: Alexander Monakov, Jakub Jelinek, Tom de Vries, gcc-patches

Hi Tobias!

On 2023-04-28T10:28:22+0200, Tobias Burnus <tobias@codesourcery.com> wrote:
> maybe I misunderstood your suggestion, but

Forst, note that those CUDA "Stream Memory Operations" are something that
I found by chance, and don't have any actual experience with.  I can't
seem to find a lot of documentation/usage of this API?

By the way, a similar thing also exists for AMD GPUs:
'hipStreamWaitValue32', etc.


> "Wait on a memory location"
> assumes that there will be a change – but if a target region happens to
> have no reverse offload, the memory location will never change, but
> still the target region should return to the host.

Oh indeed.  ;-) Details...

> What we would need: Wait on memory location – and return if either the
> kernel stopped *or* the memory location changed.

Or, have a way to "cancel", from the host, the 'cuStreamWaitValue32',
'cuStreamWaitValue64', after the actual 'target' kernel completed?

> My impression is that "return if the kernel stopped" is not really
> guaranteed. Of did I miss some fineprint?

No, you're right.  I suppose this is as designed: for example, generally,
there may be additional kernel launches, and the "wait" will then
eventually trigger.

Could we, after the actual 'target' kernel completed, issue a host-side
"write" ('cuStreamWriteValue32', 'cuStreamWriteValue64') to that memory
location, to signal end of processing for reverse offloads?

That is:

  - enqueue 'cuLaunchKernel'
  - enqueue 'cuStreamWriteValue' (to signal end of processing for reverse offloads)
  - loop on 'cuStreamWaitValue' (until end of processing for reverse offloads)


Grüße
 Thomas


> On 04.04.23 16:40, Thomas Schwinge wrote:
>> Hi!
>>
>> During GCC/OpenMP/nvptx reverse offload investigations, about how to
>> replace the problematic global 'GOMP_REV_OFFLOAD_VAR', I may have found
>> something re:
>>
>> On 2022-08-26T11:07:28+0200, Tobias Burnus <tobias@codesourcery.com> wrote:
>>> Better suggestions are welcome for the busy loop in
>>> libgomp/plugin/plugin-nvptx.c regarding the variable placement and checking
>>> its value.
>>> On the host side, the last address is checked - if fn_addr != NULL,
>>> it passes all arguments on to the generic (target.c) gomp_target_rev
>>> to do the actual offloading.
>>>
>>> CUDA does lockup when trying to copy data from the currently running
>>> stream; hence, a new stream is generated to do the memory copying.
>>> Future work for nvptx:
>>> * Adjust 'sleep', possibly [...]
>>>    to do shorter sleeps than usleep(1)?
>> ... this busy loop.
>>
>> Current 'libgomp/plugin/plugin-nvptx.c:GOMP_OFFLOAD_run':
>>
>>      [...]
>>        if (reverse_offload)
>>          CUDA_CALL_ASSERT (cuStreamCreate, &copy_stream, CU_STREAM_NON_BLOCKING);
>>        r = CUDA_CALL_NOCHECK (cuLaunchKernel, function, teams, 1, 1,
>>                               32, threads, 1, 0, NULL, NULL, config);
>>        if (r != CUDA_SUCCESS)
>>          GOMP_PLUGIN_fatal ("cuLaunchKernel error: %s", cuda_error (r));
>>        if (reverse_offload)
>>          while (true)
>>            {
>>              r = CUDA_CALL_NOCHECK (cuStreamQuery, NULL);
>>              if (r == CUDA_SUCCESS)
>>                break;
>>              if (r == CUDA_ERROR_LAUNCH_FAILED)
>>                GOMP_PLUGIN_fatal ("cuStreamQuery error: %s %s\n", cuda_error (r),
>>                                   maybe_abort_msg);
>>              else if (r != CUDA_ERROR_NOT_READY)
>>                GOMP_PLUGIN_fatal ("cuStreamQuery error: %s", cuda_error (r));
>>
>>              if (__atomic_load_n (&ptx_dev->rev_data->fn, __ATOMIC_ACQUIRE) != 0)
>>                {
>>                  struct rev_offload *rev_data = ptx_dev->rev_data;
>>                  GOMP_PLUGIN_target_rev (rev_data->fn, rev_data->mapnum,
>>                                          rev_data->addrs, rev_data->sizes,
>>                                          rev_data->kinds, rev_data->dev_num,
>>                                          rev_off_dev_to_host_cpy,
>>                                          rev_off_host_to_dev_cpy, copy_stream);
>>                  CUDA_CALL_ASSERT (cuStreamSynchronize, copy_stream);
>>                  __atomic_store_n (&rev_data->fn, 0, __ATOMIC_RELEASE);
>>                }
>>              usleep (1);
>>            }
>>        else
>>          r = CUDA_CALL_NOCHECK (cuCtxSynchronize, );
>>        if (reverse_offload)
>>          CUDA_CALL_ASSERT (cuStreamDestroy, copy_stream);
>>      [...]
>>
>> Instead of this 'while (true)', 'usleep (1)' loop, shouldn't we be able
>> to use "Stream Memory Operations",
>> <https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEMOP.html
>> that allow to "Wait on a memory location", "until the given condition on
>> the memory is satisfied"?
>>
>> For reference, current 'libgomp/config/nvptx/target.c:GOMP_target_ext':
>>
>>      [...]
>>        GOMP_REV_OFFLOAD_VAR->mapnum = mapnum;
>>        GOMP_REV_OFFLOAD_VAR->addrs = (uint64_t) hostaddrs;
>>        GOMP_REV_OFFLOAD_VAR->sizes = (uint64_t) sizes;
>>        GOMP_REV_OFFLOAD_VAR->kinds = (uint64_t) kinds;
>>        GOMP_REV_OFFLOAD_VAR->dev_num = GOMP_ADDITIONAL_ICVS.device_num;
>>
>>        /* Set 'fn' to trigger processing on the host; wait for completion,
>>           which is flagged by setting 'fn' back to 0 on the host.  */
>>        uint64_t addr_struct_fn = (uint64_t) &GOMP_REV_OFFLOAD_VAR->fn;
>>      #if __PTX_SM__ >= 700
>>        asm volatile ("st.global.release.sys.u64 [%0], %1;"
>>                      : : "r"(addr_struct_fn), "r" (fn) : "memory");
>>      #else
>>        __sync_synchronize ();  /* membar.sys */
>>        asm volatile ("st.volatile.global.u64 [%0], %1;"
>>                      : : "r"(addr_struct_fn), "r" (fn) : "memory");
>>      #endif
>>
>>      #if __PTX_SM__ >= 700
>>        uint64_t fn2;
>>        do
>>          {
>>            asm volatile ("ld.acquire.sys.global.u64 %0, [%1];"
>>                          : "=r" (fn2) : "r" (addr_struct_fn) : "memory");
>>          }
>>        while (fn2 != 0);
>>      #else
>>        /* ld.global.u64 %r64,[__gomp_rev_offload_var];
>>           ld.u64 %r36,[%r64];
>>           membar.sys;  */
>>        while (__atomic_load_n (&GOMP_REV_OFFLOAD_VAR->fn, __ATOMIC_ACQUIRE) != 0)
>>          ;  /* spin  */
>>      #endif
>>      [...]
>>
>>
>> Grüße
>>   Thomas
>> -----------------
>> Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955
> -----------------
> Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955
-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: libgomp: Simplify OpenMP reverse offload host <-> device memory copy implementation (was: [Patch] libgomp/nvptx: Prepare for reverse-offload callback handling)
  2023-04-28  8:48   ` Tobias Burnus
@ 2023-04-28  9:31     ` Thomas Schwinge
  2023-04-28 10:51       ` Tobias Burnus
  0 siblings, 1 reply; 31+ messages in thread
From: Thomas Schwinge @ 2023-04-28  9:31 UTC (permalink / raw)
  To: Tobias Burnus; +Cc: Jakub Jelinek, gcc-patches

Hi Tobias!

On 2023-04-28T10:48:31+0200, Tobias Burnus <tobias@codesourcery.com> wrote:
> On 21.03.23 16:53, Thomas Schwinge wrote:
>> On 2022-08-26T11:07:28+0200, Tobias Burnus <tobias@codesourcery.com>
>> wrote:
>>> This patch adds initial [OpenMP reverse offload] support for nvptx.
>>> CUDA does lockup when trying to copy data from the currently running
>>> stream; hence, a new stream is generated to do the memory copying.
>> As part of other work, where I had to touch those special code paths, I
>> found that we may reduce complexity a little bit "by using the existing
>> 'goacc_asyncqueue' instead of re-coding parts of it".  OK to push
>> "libgomp: Simplify OpenMP reverse offload host <-> device memory copy implementation"
>> (still testing), see attached?
>
> I don't think that just calling "exit (EXIT_FAILURE);" is the the proper
> way

The point is, when we run into such an 'exit', we've already issued an
error (in the plugin, via 'GOMP_PLUGIN_fatal'), and then (to replicate
what 'GOMP_PLUGIN_fatal'/'gomp_fatal' do) we just need to 'exit' -- after
unlocking.  The latter is the reason why we can't just do this:

> – I think that should be GOMP_PLUGIN_fatal in the plugin and
> gomp_fatal in target.c.

..., because we'd dead-lock due to 'atexit' shutdown of devices etc.,
while still having devices etc. locked.

(Resolving all this differently/"properly" is for another day.)

> Otherwise, it LGTM.

Thanks.  OK to push then, given the rationale above?


Grüße
 Thomas


>> Subject: [PATCH] libgomp: Simplify OpenMP reverse offload host <-> device
>>   memory copy implementation
>>
>> ... by using the existing 'goacc_asyncqueue' instead of re-coding parts of it.
>>
>> Follow-up to commit 131d18e928a3ea1ab2d3bf61aa92d68a8a254609
>> "libgomp/nvptx: Prepare for reverse-offload callback handling",
>> and commit ea4b23d9c82d9be3b982c3519fe5e8e9d833a6a8
>> "libgomp: Handle OpenMP's reverse offloads".
>>
>>       libgomp/
>>       * target.c (gomp_target_rev): Instead of 'dev_to_host_cpy',
>>       'host_to_dev_cpy', 'token', take a single 'goacc_asyncqueue'.
>>       * libgomp.h (gomp_target_rev): Adjust.
>>       * libgomp-plugin.c (GOMP_PLUGIN_target_rev): Adjust.
>>       * libgomp-plugin.h (GOMP_PLUGIN_target_rev): Adjust.
>>       * plugin/plugin-gcn.c (process_reverse_offload): Adjust.
>>       * plugin/plugin-nvptx.c (rev_off_dev_to_host_cpy)
>>       (rev_off_host_to_dev_cpy): Remove.
>>       (GOMP_OFFLOAD_run): Adjust.
>> ---
>>   libgomp/libgomp-plugin.c      |   7 +--
>>   libgomp/libgomp-plugin.h      |   6 +-
>>   libgomp/libgomp.h             |   5 +-
>>   libgomp/plugin/plugin-gcn.c   |   2 +-
>>   libgomp/plugin/plugin-nvptx.c |  77 ++++++++++++++-----------
>>   libgomp/target.c              | 102 +++++++++++++++-------------------
>>   6 files changed, 96 insertions(+), 103 deletions(-)
>>
>> diff --git a/libgomp/libgomp-plugin.c b/libgomp/libgomp-plugin.c
>> index 27e7c94ba9b..d696515eeb6 100644
>> --- a/libgomp/libgomp-plugin.c
>> +++ b/libgomp/libgomp-plugin.c
>> @@ -82,11 +82,8 @@ GOMP_PLUGIN_fatal (const char *msg, ...)
>>   void
>>   GOMP_PLUGIN_target_rev (uint64_t fn_ptr, uint64_t mapnum, uint64_t devaddrs_ptr,
>>                       uint64_t sizes_ptr, uint64_t kinds_ptr, int dev_num,
>> -                     void (*dev_to_host_cpy) (void *, const void *, size_t,
>> -                                              void *),
>> -                     void (*host_to_dev_cpy) (void *, const void *, size_t,
>> -                                              void *), void *token)
>> +                     struct goacc_asyncqueue *aq)
>>   {
>>     gomp_target_rev (fn_ptr, mapnum, devaddrs_ptr, sizes_ptr, kinds_ptr, dev_num,
>> -                dev_to_host_cpy, host_to_dev_cpy, token);
>> +                aq);
>>   }
>> diff --git a/libgomp/libgomp-plugin.h b/libgomp/libgomp-plugin.h
>> index 28267f75f7a..42ee3d6c7f9 100644
>> --- a/libgomp/libgomp-plugin.h
>> +++ b/libgomp/libgomp-plugin.h
>> @@ -121,11 +121,7 @@ extern void GOMP_PLUGIN_fatal (const char *, ...)
>>       __attribute__ ((noreturn, format (printf, 1, 2)));
>>
>>   extern void GOMP_PLUGIN_target_rev (uint64_t, uint64_t, uint64_t, uint64_t,
>> -                                 uint64_t, int,
>> -                                 void (*) (void *, const void *, size_t,
>> -                                           void *),
>> -                                 void (*) (void *, const void *, size_t,
>> -                                           void *), void *);
>> +                                 uint64_t, int, struct goacc_asyncqueue *);
>>
>>   /* Prototypes for functions implemented by libgomp plugins.  */
>>   extern const char *GOMP_OFFLOAD_get_name (void);
>> diff --git a/libgomp/libgomp.h b/libgomp/libgomp.h
>> index ba8fe348aba..4d2bfab4b71 100644
>> --- a/libgomp/libgomp.h
>> +++ b/libgomp/libgomp.h
>> @@ -1130,10 +1130,7 @@ extern void gomp_init_targets_once (void);
>>   extern int gomp_get_num_devices (void);
>>   extern bool gomp_target_task_fn (void *);
>>   extern void gomp_target_rev (uint64_t, uint64_t, uint64_t, uint64_t, uint64_t,
>> -                          int,
>> -                          void (*) (void *, const void *, size_t, void *),
>> -                          void (*) (void *, const void *, size_t, void *),
>> -                          void *);
>> +                          int, struct goacc_asyncqueue *);
>>
>>   /* Splay tree definitions.  */
>>   typedef struct splay_tree_node_s *splay_tree_node;
>> diff --git a/libgomp/plugin/plugin-gcn.c b/libgomp/plugin/plugin-gcn.c
>> index 347803762eb..2181bf0235f 100644
>> --- a/libgomp/plugin/plugin-gcn.c
>> +++ b/libgomp/plugin/plugin-gcn.c
>> @@ -1949,7 +1949,7 @@ process_reverse_offload (uint64_t fn, uint64_t mapnum, uint64_t hostaddrs,
>>   {
>>     int dev_num = dev_num64;
>>     GOMP_PLUGIN_target_rev (fn, mapnum, hostaddrs, sizes, kinds, dev_num,
>> -                       NULL, NULL, NULL);
>> +                       NULL);
>>   }
>>
>>   /* Output any data written to console output from the kernel.  It is expected
>> diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c
>> index 5bd5a419e0e..4a710851ee5 100644
>> --- a/libgomp/plugin/plugin-nvptx.c
>> +++ b/libgomp/plugin/plugin-nvptx.c
>> @@ -56,6 +56,7 @@
>>   #include <unistd.h>
>>   #include <assert.h>
>>   #include <errno.h>
>> +#include <stdlib.h>
>>
>>   /* An arbitrary fixed limit (128MB) for the size of the OpenMP soft stacks
>>      block to cache between kernel invocations.  For soft-stacks blocks bigger
>> @@ -1739,11 +1740,11 @@ GOMP_OFFLOAD_openacc_cuda_set_stream (struct goacc_asyncqueue *aq, void *stream)
>>     return 1;
>>   }
>>
>> -struct goacc_asyncqueue *
>> -GOMP_OFFLOAD_openacc_async_construct (int device __attribute__((unused)))
>> +static struct goacc_asyncqueue *
>> +nvptx_goacc_asyncqueue_construct (unsigned int flags)
>>   {
>>     CUstream stream = NULL;
>> -  CUDA_CALL_ERET (NULL, cuStreamCreate, &stream, CU_STREAM_DEFAULT);
>> +  CUDA_CALL_ERET (NULL, cuStreamCreate, &stream, flags);
>>
>>     struct goacc_asyncqueue *aq
>>       = GOMP_PLUGIN_malloc (sizeof (struct goacc_asyncqueue));
>> @@ -1751,14 +1752,26 @@ GOMP_OFFLOAD_openacc_async_construct (int device __attribute__((unused)))
>>     return aq;
>>   }
>>
>> -bool
>> -GOMP_OFFLOAD_openacc_async_destruct (struct goacc_asyncqueue *aq)
>> +struct goacc_asyncqueue *
>> +GOMP_OFFLOAD_openacc_async_construct (int device __attribute__((unused)))
>> +{
>> +  return nvptx_goacc_asyncqueue_construct (CU_STREAM_DEFAULT);
>> +}
>> +
>> +static bool
>> +nvptx_goacc_asyncqueue_destruct (struct goacc_asyncqueue *aq)
>>   {
>>     CUDA_CALL_ERET (false, cuStreamDestroy, aq->cuda_stream);
>>     free (aq);
>>     return true;
>>   }
>>
>> +bool
>> +GOMP_OFFLOAD_openacc_async_destruct (struct goacc_asyncqueue *aq)
>> +{
>> +  return nvptx_goacc_asyncqueue_destruct (aq);
>> +}
>> +
>>   int
>>   GOMP_OFFLOAD_openacc_async_test (struct goacc_asyncqueue *aq)
>>   {
>> @@ -1772,13 +1785,19 @@ GOMP_OFFLOAD_openacc_async_test (struct goacc_asyncqueue *aq)
>>     return -1;
>>   }
>>
>> -bool
>> -GOMP_OFFLOAD_openacc_async_synchronize (struct goacc_asyncqueue *aq)
>> +static bool
>> +nvptx_goacc_asyncqueue_synchronize (struct goacc_asyncqueue *aq)
>>   {
>>     CUDA_CALL_ERET (false, cuStreamSynchronize, aq->cuda_stream);
>>     return true;
>>   }
>>
>> +bool
>> +GOMP_OFFLOAD_openacc_async_synchronize (struct goacc_asyncqueue *aq)
>> +{
>> +  return nvptx_goacc_asyncqueue_synchronize (aq);
>> +}
>> +
>>   bool
>>   GOMP_OFFLOAD_openacc_async_serialize (struct goacc_asyncqueue *aq1,
>>                                     struct goacc_asyncqueue *aq2)
>> @@ -2038,22 +2057,6 @@ nvptx_stacks_acquire (struct ptx_device *ptx_dev, size_t size, int num)
>>   }
>>
>>
>> -void
>> -rev_off_dev_to_host_cpy (void *dest, const void *src, size_t size,
>> -                      CUstream stream)
>> -{
>> -  CUDA_CALL_ASSERT (cuMemcpyDtoHAsync, dest, (CUdeviceptr) src, size, stream);
>> -  CUDA_CALL_ASSERT (cuStreamSynchronize, stream);
>> -}
>> -
>> -void
>> -rev_off_host_to_dev_cpy (void *dest, const void *src, size_t size,
>> -                      CUstream stream)
>> -{
>> -  CUDA_CALL_ASSERT (cuMemcpyHtoDAsync, (CUdeviceptr) dest, src, size, stream);
>> -  CUDA_CALL_ASSERT (cuStreamSynchronize, stream);
>> -}
>> -
>>   void
>>   GOMP_OFFLOAD_run (int ord, void *tgt_fn, void *tgt_vars, void **args)
>>   {
>> @@ -2087,9 +2090,17 @@ GOMP_OFFLOAD_run (int ord, void *tgt_fn, void *tgt_vars, void **args)
>>       }
>>     nvptx_adjust_launch_bounds (tgt_fn, ptx_dev, &teams, &threads);
>>
>> -  size_t stack_size = nvptx_stacks_size ();
>>     bool reverse_offload = ptx_dev->rev_data != NULL;
>> -  CUstream copy_stream = NULL;
>> +  struct goacc_asyncqueue *reverse_offload_aq = NULL;
>> +  if (reverse_offload)
>> +    {
>> +      reverse_offload_aq
>> +     = nvptx_goacc_asyncqueue_construct (CU_STREAM_NON_BLOCKING);
>> +      if (!reverse_offload_aq)
>> +     exit (EXIT_FAILURE);
>> +    }
>> +
>> +  size_t stack_size = nvptx_stacks_size ();
>>
>>     pthread_mutex_lock (&ptx_dev->omp_stacks.lock);
>>     void *stacks = nvptx_stacks_acquire (ptx_dev, stack_size, teams * threads);
>> @@ -2103,8 +2114,6 @@ GOMP_OFFLOAD_run (int ord, void *tgt_fn, void *tgt_vars, void **args)
>>     GOMP_PLUGIN_debug (0, "  %s: kernel %s: launch"
>>                    " [(teams: %u), 1, 1] [(lanes: 32), (threads: %u), 1]\n",
>>                    __FUNCTION__, fn_name, teams, threads);
>> -  if (reverse_offload)
>> -    CUDA_CALL_ASSERT (cuStreamCreate, &copy_stream, CU_STREAM_NON_BLOCKING);
>>     r = CUDA_CALL_NOCHECK (cuLaunchKernel, function, teams, 1, 1,
>>                        32, threads, 1, 0, NULL, NULL, config);
>>     if (r != CUDA_SUCCESS)
>> @@ -2127,17 +2136,15 @@ GOMP_OFFLOAD_run (int ord, void *tgt_fn, void *tgt_vars, void **args)
>>           GOMP_PLUGIN_target_rev (rev_data->fn, rev_data->mapnum,
>>                                   rev_data->addrs, rev_data->sizes,
>>                                   rev_data->kinds, rev_data->dev_num,
>> -                                 rev_off_dev_to_host_cpy,
>> -                                 rev_off_host_to_dev_cpy, copy_stream);
>> -         CUDA_CALL_ASSERT (cuStreamSynchronize, copy_stream);
>> +                                 reverse_offload_aq);
>> +         if (!nvptx_goacc_asyncqueue_synchronize (reverse_offload_aq))
>> +           exit (EXIT_FAILURE);
>>           __atomic_store_n (&rev_data->fn, 0, __ATOMIC_RELEASE);
>>         }
>>       usleep (1);
>>         }
>>     else
>>       r = CUDA_CALL_NOCHECK (cuCtxSynchronize, );
>> -  if (reverse_offload)
>> -    CUDA_CALL_ASSERT (cuStreamDestroy, copy_stream);
>>     if (r == CUDA_ERROR_LAUNCH_FAILED)
>>       GOMP_PLUGIN_fatal ("cuCtxSynchronize error: %s %s\n", cuda_error (r),
>>                      maybe_abort_msg);
>> @@ -2145,6 +2152,12 @@ GOMP_OFFLOAD_run (int ord, void *tgt_fn, void *tgt_vars, void **args)
>>       GOMP_PLUGIN_fatal ("cuCtxSynchronize error: %s", cuda_error (r));
>>
>>     pthread_mutex_unlock (&ptx_dev->omp_stacks.lock);
>> +
>> +  if (reverse_offload)
>> +    {
>> +      if (!nvptx_goacc_asyncqueue_destruct (reverse_offload_aq))
>> +     exit (EXIT_FAILURE);
>> +    }
>>   }
>>
>>   /* TODO: Implement GOMP_OFFLOAD_async_run. */
>> diff --git a/libgomp/target.c b/libgomp/target.c
>> index 79ed64a5dc3..e02188cf7e1 100644
>> --- a/libgomp/target.c
>> +++ b/libgomp/target.c
>> @@ -3312,9 +3312,7 @@ gomp_map_cdata_lookup (struct cpy_data *d, uint64_t *devaddrs,
>>   void
>>   gomp_target_rev (uint64_t fn_ptr, uint64_t mapnum, uint64_t devaddrs_ptr,
>>                uint64_t sizes_ptr, uint64_t kinds_ptr, int dev_num,
>> -              void (*dev_to_host_cpy) (void *, const void *, size_t, void*),
>> -              void (*host_to_dev_cpy) (void *, const void *, size_t, void*),
>> -              void *token)
>> +              struct goacc_asyncqueue *aq)
>>   {
>>     /* Return early if there is no offload code.  */
>>     if (sizeof (OFFLOAD_PLUGINS) == sizeof (""))
>> @@ -3356,26 +3354,17 @@ gomp_target_rev (uint64_t fn_ptr, uint64_t mapnum, uint64_t devaddrs_ptr,
>>         devaddrs = (uint64_t *) gomp_malloc (mapnum * sizeof (uint64_t));
>>         sizes = (uint64_t *) gomp_malloc (mapnum * sizeof (uint64_t));
>>         kinds = (unsigned short *) gomp_malloc (mapnum * sizeof (unsigned short));
>> -      if (dev_to_host_cpy)
>> -     {
>> -       dev_to_host_cpy (devaddrs, (const void *) (uintptr_t) devaddrs_ptr,
>> -                        mapnum * sizeof (uint64_t), token);
>> -       dev_to_host_cpy (sizes, (const void *) (uintptr_t) sizes_ptr,
>> -                        mapnum * sizeof (uint64_t), token);
>> -       dev_to_host_cpy (kinds, (const void *) (uintptr_t) kinds_ptr,
>> -                        mapnum * sizeof (unsigned short), token);
>> -     }
>> -      else
>> -     {
>> -       gomp_copy_dev2host (devicep, NULL, devaddrs,
>> -                           (const void *) (uintptr_t) devaddrs_ptr,
>> -                           mapnum * sizeof (uint64_t));
>> -       gomp_copy_dev2host (devicep, NULL, sizes,
>> -                           (const void *) (uintptr_t) sizes_ptr,
>> -                           mapnum * sizeof (uint64_t));
>> -       gomp_copy_dev2host (devicep, NULL, kinds, (const void *) (uintptr_t) kinds_ptr,
>> -                           mapnum * sizeof (unsigned short));
>> -     }
>> +      gomp_copy_dev2host (devicep, aq, devaddrs,
>> +                       (const void *) (uintptr_t) devaddrs_ptr,
>> +                       mapnum * sizeof (uint64_t));
>> +      gomp_copy_dev2host (devicep, aq, sizes,
>> +                       (const void *) (uintptr_t) sizes_ptr,
>> +                       mapnum * sizeof (uint64_t));
>> +      gomp_copy_dev2host (devicep, aq, kinds,
>> +                       (const void *) (uintptr_t) kinds_ptr,
>> +                       mapnum * sizeof (unsigned short));
>> +      if (aq && !devicep->openacc.async.synchronize_func (aq))
>> +     exit (EXIT_FAILURE);
>>       }
>>
>>     size_t tgt_align = 0, tgt_size = 0;
>> @@ -3402,13 +3391,14 @@ gomp_target_rev (uint64_t fn_ptr, uint64_t mapnum, uint64_t devaddrs_ptr,
>>           if (devicep->capabilities & GOMP_OFFLOAD_CAP_SHARED_MEM)
>>             memcpy (tgt + tgt_size, (void *) (uintptr_t) devaddrs[i],
>>                     (size_t) sizes[i]);
>> -         else if (dev_to_host_cpy)
>> -           dev_to_host_cpy (tgt + tgt_size, (void *) (uintptr_t) devaddrs[i],
>> -                            (size_t) sizes[i], token);
>>           else
>> -           gomp_copy_dev2host (devicep, NULL, tgt + tgt_size,
>> -                               (void *) (uintptr_t) devaddrs[i],
>> -                               (size_t) sizes[i]);
>> +           {
>> +             gomp_copy_dev2host (devicep, aq, tgt + tgt_size,
>> +                                 (void *) (uintptr_t) devaddrs[i],
>> +                                 (size_t) sizes[i]);
>> +             if (aq && !devicep->openacc.async.synchronize_func (aq))
>> +               exit (EXIT_FAILURE);
>> +           }
>>           devaddrs[i] = (uint64_t) (uintptr_t) tgt + tgt_size;
>>           tgt_size = tgt_size + sizes[i];
>>           if ((devicep->capabilities & GOMP_OFFLOAD_CAP_SHARED_MEM)
>> @@ -3498,15 +3488,15 @@ gomp_target_rev (uint64_t fn_ptr, uint64_t mapnum, uint64_t devaddrs_ptr,
>>                   || kind == GOMP_MAP_ALWAYS_TO
>>                   || kind == GOMP_MAP_ALWAYS_TOFROM)
>>                 {
>> -                 if (dev_to_host_cpy)
>> -                   dev_to_host_cpy ((void *) (uintptr_t) devaddrs[i],
>> -                                    (void *) (uintptr_t) cdata[i].devaddr,
>> -                                    sizes[i], token);
>> -                 else
>> -                   gomp_copy_dev2host (devicep, NULL,
>> -                                       (void *) (uintptr_t) devaddrs[i],
>> -                                       (void *) (uintptr_t) cdata[i].devaddr,
>> -                                       sizes[i]);
>> +                 gomp_copy_dev2host (devicep, aq,
>> +                                     (void *) (uintptr_t) devaddrs[i],
>> +                                     (void *) (uintptr_t) cdata[i].devaddr,
>> +                                     sizes[i]);
>> +                 if (aq && !devicep->openacc.async.synchronize_func (aq))
>> +                   {
>> +                     gomp_mutex_unlock (&devicep->lock);
>> +                     exit (EXIT_FAILURE);
>> +                   }
>>                 }
>>               if (struct_cpy)
>>                 struct_cpy--;
>> @@ -3573,15 +3563,15 @@ gomp_target_rev (uint64_t fn_ptr, uint64_t mapnum, uint64_t devaddrs_ptr,
>>                   devaddrs[i]
>>                     = (uint64_t) (uintptr_t) gomp_aligned_alloc (align,
>>                                                                  sizes[i]);
>> -                 if (dev_to_host_cpy)
>> -                   dev_to_host_cpy ((void *) (uintptr_t) devaddrs[i],
>> -                                    (void *) (uintptr_t) cdata[i].devaddr,
>> -                                    sizes[i], token);
>> -                 else
>> -                   gomp_copy_dev2host (devicep, NULL,
>> -                                       (void *) (uintptr_t) devaddrs[i],
>> -                                       (void *) (uintptr_t) cdata[i].devaddr,
>> -                                       sizes[i]);
>> +                 gomp_copy_dev2host (devicep, aq,
>> +                                     (void *) (uintptr_t) devaddrs[i],
>> +                                     (void *) (uintptr_t) cdata[i].devaddr,
>> +                                     sizes[i]);
>> +                 if (aq && !devicep->openacc.async.synchronize_func (aq))
>> +                   {
>> +                     gomp_mutex_unlock (&devicep->lock);
>> +                     exit (EXIT_FAILURE);
>> +                   }
>>                 }
>>               for (j = i + 1; j < mapnum; j++)
>>                 {
>> @@ -3685,15 +3675,15 @@ gomp_target_rev (uint64_t fn_ptr, uint64_t mapnum, uint64_t devaddrs_ptr,
>>               /* FALLTHRU */
>>             case GOMP_MAP_FROM:
>>             case GOMP_MAP_TOFROM:
>> -             if (copy && host_to_dev_cpy)
>> -               host_to_dev_cpy ((void *) (uintptr_t) cdata[i].devaddr,
>> -                                (void *) (uintptr_t) devaddrs[i],
>> -                                sizes[i], token);
>> -             else if (copy)
>> -               gomp_copy_host2dev (devicep, NULL,
>> -                                   (void *) (uintptr_t) cdata[i].devaddr,
>> -                                   (void *) (uintptr_t) devaddrs[i],
>> -                                   sizes[i], false, NULL);
>> +             if (copy)
>> +               {
>> +                 gomp_copy_host2dev (devicep, aq,
>> +                                     (void *) (uintptr_t) cdata[i].devaddr,
>> +                                     (void *) (uintptr_t) devaddrs[i],
>> +                                     sizes[i], false, NULL);
>> +                 if (aq && !devicep->openacc.async.synchronize_func (aq))
>> +                   exit (EXIT_FAILURE);
>> +               }
>>             default:
>>               break;
>>           }
> -----------------
> Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955
-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: libgomp: Simplify OpenMP reverse offload host <-> device memory copy implementation (was: [Patch] libgomp/nvptx: Prepare for reverse-offload callback handling)
  2023-04-28  9:31     ` Thomas Schwinge
@ 2023-04-28 10:51       ` Tobias Burnus
  0 siblings, 0 replies; 31+ messages in thread
From: Tobias Burnus @ 2023-04-28 10:51 UTC (permalink / raw)
  To: Thomas Schwinge; +Cc: Jakub Jelinek, gcc-patches

On 28.04.23 11:31, Thomas Schwinge wrote:
> On 2023-04-28T10:48:31+0200, Tobias Burnus <tobias@codesourcery.com> wrote:
>> I don't think that just calling "exit (EXIT_FAILURE);" is the the proper
>> way
> The point is, when we run into such an 'exit', we've already issued an
> error (in the plugin, via 'GOMP_PLUGIN_fatal'),
you meant: GOMP_PLUGIN_error.
> and then (to replicate
> what 'GOMP_PLUGIN_fatal'/'gomp_fatal' do) we just need to 'exit' -- after
> unlocking.  The latter is the reason why we can't just do this:
>
>> – I think that should be GOMP_PLUGIN_fatal in the plugin and
>> gomp_fatal in target.c.
> ..., because we'd dead-lock due to 'atexit' shutdown of devices etc.,
> while still having devices etc. locked.
>
> (Resolving all this differently/"properly" is for another day.)
https://gcc.gnu.org/PR109664
>> Otherwise, it LGTM.
> Thanks.  OK to push then, given the rationale above?

OK.

Tobias

-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2023-04-28 10:51 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-08-26  9:07 [Patch] libgomp/nvptx: Prepare for reverse-offload callback handling Tobias Burnus
2022-08-26  9:07 ` Tobias Burnus
2022-08-26 14:56 ` Alexander Monakov
2022-09-09 15:49   ` Jakub Jelinek
2022-09-09 15:51 ` Jakub Jelinek
2022-09-13  7:07 ` Tobias Burnus
2022-09-21 20:06   ` Alexander Monakov
2022-09-26 15:07     ` Tobias Burnus
2022-09-26 17:45       ` Alexander Monakov
2022-09-27  9:23         ` Tobias Burnus
2022-09-28 13:16           ` Alexander Monakov
2022-10-02 18:13           ` Tobias Burnus
2022-10-07 14:26             ` [Patch][v5] " Tobias Burnus
2022-10-11 10:49               ` Jakub Jelinek
2022-10-11 11:12                 ` Alexander Monakov
2022-10-12  8:55                   ` Tobias Burnus
2022-10-17  7:35                     ` *ping* / " Tobias Burnus
2022-10-19 15:53                     ` Alexander Monakov
2022-10-24 14:07                     ` Jakub Jelinek
2022-10-24 19:05                       ` Thomas Schwinge
2022-10-24 19:11                         ` Thomas Schwinge
2022-10-24 19:46                           ` Tobias Burnus
2022-10-24 19:51                           ` libgomp/nvptx: Prepare for reverse-offload callback handling, resolve spurious SIGSEGVs (was: [Patch][v5] libgomp/nvptx: Prepare for reverse-offload callback handling) Thomas Schwinge
2023-03-21 15:53 ` libgomp: Simplify OpenMP reverse offload host <-> device memory copy implementation (was: [Patch] " Thomas Schwinge
2023-03-24 15:43   ` [og12] " Thomas Schwinge
2023-04-28  8:48   ` Tobias Burnus
2023-04-28  9:31     ` Thomas Schwinge
2023-04-28 10:51       ` Tobias Burnus
2023-04-04 14:40 ` [Patch] libgomp/nvptx: Prepare for reverse-offload callback handling Thomas Schwinge
2023-04-28  8:28   ` Tobias Burnus
2023-04-28  9:23     ` Thomas Schwinge

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).