[PATCH] libgomp, nvptx, amdgcn: parallel reverse offload

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [PATCH] libgomp, nvptx, amdgcn: parallel reverse offload
@ 2023-09-12 14:27 Andrew Stubbs
  2023-09-12 16:32 ` [OG13][committed] " Andrew Stubbs
  2023-09-21  9:22 ` [PATCH] " Tobias Burnus
  0 siblings, 2 replies; 3+ messages in thread
From: Andrew Stubbs @ 2023-09-12 14:27 UTC (permalink / raw)
  To: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 2029 bytes --]

Hi all,

This patch implements parallel execution of OpenMP reverse offload kernels.

The first problem was that GPU device kernels may request reverse 
offload (via the "ancestor" clause) once for each running offload thread 
-- of which there may be thousands -- and the existing implementation 
ran each request serially, whilst blocking all other I/O from that 
device kernel.

The second problem was that the NVPTX plugin runs the reverse offload 
kernel in the context of whichever host thread sees the request first, 
regardless of which kernel originated the request. This is probably 
logically harmless, but may lead to surprising timing when it blocks the 
wrong kernel from exiting until the reverse offload is done. It was also 
only capable of receiving and processing a single request at a time, 
across all running kernels. (GCN did not have these problems.)

Both problems are now solved by making the reverse offload requests 
asynchronous. The host threads still recieve the requests in the same 
way, but instead of running them inline the request is queued for 
execution later in another thread. The requests are then consumed from 
the message passing buffer imediately (allowing I/O to continue, in the 
case of GCN). The device threads that sent requests are still blocked 
waiting for the completion signal, but any other threads may continue as 
usual.

The queued requests are processed by a thread pool created on demand and 
limited by a new environment variable GOMP_REVERSE_OFFLOAD_THREADS. By 
this means reverse offload should become much less of a bottleneck.

In the process of this work I have found and fixed a couple of 
target-specific issues. NVPTX asynchronous streams were independent of 
each other, but still synchronous w.r.t. the default NULL stream. Some 
GCN devices (at least gfx908) seem to have a race condition in the 
message passing system whereby the cache write-back triggered by 
__ATOMIC_RELEASE occurs slower than the atomically written value.

OK for mainline?

Andrew

[-- Attachment #2: 230912-parallel-reverse-offload.patch --]
[-- Type: text/plain, Size: 25694 bytes --]

libgomp: parallel reverse offload

Extend OpenMP reverse offload support to allow running the host kernels
on multiple threads.  The device plugin API for reverse offload is now made
non-blocking, meaning that running the host kernel in the wrong device
context is no longer a problem.  The NVPTX message passing interface now
uses a ring buffer aproximately matching GCN.

include/ChangeLog:

	* gomp-constants.h (GOMP_VERSION): Bump.

libgomp/ChangeLog:

	* config/gcn/target.c (GOMP_target_ext): Add "signal" field.
	Fix atomics race condition.
	* config/nvptx/libgomp-nvptx.h (REV_OFFLOAD_QUEUE_SIZE): New define.
	(struct rev_offload): Implement ring buffer.
	* config/nvptx/target.c (GOMP_target_ext): Likewise.
	* env.c (initialize_env): Read GOMP_REVERSE_OFFLOAD_THREADS.
	* libgomp-plugin.c (GOMP_PLUGIN_target_rev): Replace "aq" parameter
	with "signal" and "use_aq".
	* libgomp-plugin.h (GOMP_PLUGIN_target_rev): Likewise.
	* libgomp.h (gomp_target_rev): Likewise.
	* plugin/plugin-gcn.c (process_reverse_offload): Add "signal".
	(console_output): Pass signal value through.
	* plugin/plugin-nvptx.c (GOMP_OFFLOAD_openacc_async_construct):
	Attach new threads to the numbered device.
	Change the flag to CU_STREAM_NON_BLOCKING.
	(GOMP_OFFLOAD_run): Implement ring-buffer and remove signalling.
	* target.c (gomp_target_rev): Rename to ...
	(gomp_target_rev_internal): ... this, and change "dev_num" to
	"devicep".
	(gomp_target_rev_worker_thread): New function.
	(gomp_target_rev): New function (old name).
	* libgomp.texi: Document GOMP_REVERSE_OFFLOAD_THREADS.
	* testsuite/libgomp.c/reverse-offload-threads-1.c: New test.
	* testsuite/libgomp.c/reverse-offload-threads-2.c: New test.

diff --git a/include/gomp-constants.h b/include/gomp-constants.h
index 8d4e8e81303..7ce07508e9d 100644
--- a/include/gomp-constants.h
+++ b/include/gomp-constants.h
@@ -314,7 +314,7 @@ enum gomp_map_kind
 /* Versions of libgomp and device-specific plugins.  GOMP_VERSION
    should be incremented whenever an ABI-incompatible change is introduced
    to the plugin interface defined in libgomp/libgomp.h.  */
-#define GOMP_VERSION	2
+#define GOMP_VERSION	3
 #define GOMP_VERSION_NVIDIA_PTX 1
 #define GOMP_VERSION_GCN 3
 
diff --git a/libgomp/config/gcn/target.c b/libgomp/config/gcn/target.c
index ea5eb1ff5ed..906b04ca41e 100644
--- a/libgomp/config/gcn/target.c
+++ b/libgomp/config/gcn/target.c
@@ -103,19 +103,38 @@ GOMP_target_ext (int device, void (*fn) (void *), size_t mapnum,
 	   <= (index - 1024))
       asm ("s_sleep 64");
 
+  /* In theory, it should be enough to write "written" with __ATOMIC_RELEASE,
+     and have the rest of the data flushed to memory automatically, but some
+     devices (gfx908) seem to have a race condition where the flushed data
+     arrives after the atomic data, and the host does the wrong thing.
+     If we just write everything atomically in the correct order then we're
+     safe.  */
+
   unsigned int slot = index % 1024;
-  data->queue[slot].value_u64[0] = (uint64_t) fn;
-  data->queue[slot].value_u64[1] = (uint64_t) mapnum;
-  data->queue[slot].value_u64[2] = (uint64_t) hostaddrs;
-  data->queue[slot].value_u64[3] = (uint64_t) sizes;
-  data->queue[slot].value_u64[4] = (uint64_t) kinds;
-  data->queue[slot].value_u64[5] = (uint64_t) GOMP_ADDITIONAL_ICVS.device_num;
-
-  data->queue[slot].type = 4; /* Reverse offload.  */
+  __atomic_store_n (&data->queue[slot].value_u64[0], (uint64_t) fn,
+		    __ATOMIC_RELAXED);
+  __atomic_store_n (&data->queue[slot].value_u64[1], (uint64_t) mapnum,
+		    __ATOMIC_RELAXED);
+  __atomic_store_n (&data->queue[slot].value_u64[2], (uint64_t) hostaddrs,
+		    __ATOMIC_RELAXED);
+  __atomic_store_n (&data->queue[slot].value_u64[3], (uint64_t) sizes,
+		    __ATOMIC_RELAXED);
+  __atomic_store_n (&data->queue[slot].value_u64[4], (uint64_t) kinds,
+		    __ATOMIC_RELAXED);
+  __atomic_store_n (&data->queue[slot].value_u64[5],
+		    (uint64_t) GOMP_ADDITIONAL_ICVS.device_num,
+		    __ATOMIC_RELAXED);
+
+  volatile int signal = 0;
+  __atomic_store_n (&data->queue[slot].value_u64[6], (uint64_t) &signal,
+		    __ATOMIC_RELAXED);
+
+  __atomic_store_n (&data->queue[slot].type, 4 /* Reverse offload.  */,
+		    __ATOMIC_RELAXED);
   __atomic_store_n (&data->queue[slot].written, 1, __ATOMIC_RELEASE);
 
-  /* Spinlock while the host catches up.  */
-  while (__atomic_load_n (&data->queue[slot].written, __ATOMIC_ACQUIRE) != 0)
+  /* Spinlock while the host runs the kernel.  */
+  while (__atomic_load_n (&signal, __ATOMIC_ACQUIRE) == 0)
     asm ("s_sleep 64");
 }
 
diff --git a/libgomp/config/nvptx/libgomp-nvptx.h b/libgomp/config/nvptx/libgomp-nvptx.h
index 96de86977c3..78719894efa 100644
--- a/libgomp/config/nvptx/libgomp-nvptx.h
+++ b/libgomp/config/nvptx/libgomp-nvptx.h
@@ -25,20 +25,41 @@
 
 /* This file contains defines and type definitions shared between the
    nvptx target's libgomp.a and the plugin-nvptx.c, but that is only
-   needef for this target.  */
+   needed for this target.  */
 
 #ifndef LIBGOMP_NVPTX_H
 #define LIBGOMP_NVPTX_H 1
 
 #define GOMP_REV_OFFLOAD_VAR __gomp_rev_offload_var
+#define REV_OFFLOAD_QUEUE_SIZE 1024
 
 struct rev_offload {
-  uint64_t fn;
-  uint64_t mapnum;
-  uint64_t addrs;
-  uint64_t sizes;
-  uint64_t kinds;
-  int32_t dev_num;
+  /* The target can grab a slot by incrementing "next_slot".
+     Each host thread may claim some slots for processing.
+     When the host processing is completed "consumed" indicates that the
+     corresponding slots in the ring-buffer "queue" are available for reuse.
+
+     Note that "next_slot" is an index, and "consumed"/"claimed" are counters,
+     so beware of the fence-posts.  */
+  unsigned int next_slot;
+  unsigned int consumed;
+  unsigned int claimed;
+
+  struct rev_req {
+    /* The target writes an address to "signal" as the last item, which
+       indicates to the host that the record is completely written.  The target
+       must not assume that it still owns the slot, after that.  The signal
+       address is then used by the host to communicate that the reverse-offload
+       kernel has completed execution.  */
+    volatile int *signal;
+
+    uint64_t fn;
+    uint64_t mapnum;
+    uint64_t addrs;
+    uint64_t sizes;
+    uint64_t kinds;
+    int32_t dev_num;
+  } queue[REV_OFFLOAD_QUEUE_SIZE];
 };
 
 #if (__SIZEOF_SHORT__ != 2 \
diff --git a/libgomp/config/nvptx/target.c b/libgomp/config/nvptx/target.c
index 125d92a2ea9..5593ab62b6d 100644
--- a/libgomp/config/nvptx/target.c
+++ b/libgomp/config/nvptx/target.c
@@ -93,7 +93,6 @@ GOMP_target_ext (int device, void (*fn) (void *), size_t mapnum,
 		 void **hostaddrs, size_t *sizes, unsigned short *kinds,
 		 unsigned int flags, void **depend, void **args)
 {
-  static int lock = 0;  /* == gomp_mutex_t lock; gomp_mutex_init (&lock); */
   (void) flags;
   (void) depend;
   (void) args;
@@ -103,43 +102,57 @@ GOMP_target_ext (int device, void (*fn) (void *), size_t mapnum,
       || GOMP_REV_OFFLOAD_VAR == NULL)
     return;
 
-  gomp_mutex_lock (&lock);
-
-  GOMP_REV_OFFLOAD_VAR->mapnum = mapnum;
-  GOMP_REV_OFFLOAD_VAR->addrs = (uint64_t) hostaddrs;
-  GOMP_REV_OFFLOAD_VAR->sizes = (uint64_t) sizes;
-  GOMP_REV_OFFLOAD_VAR->kinds = (uint64_t) kinds;
-  GOMP_REV_OFFLOAD_VAR->dev_num = GOMP_ADDITIONAL_ICVS.device_num;
-
-  /* Set 'fn' to trigger processing on the host; wait for completion,
-     which is flagged by setting 'fn' back to 0 on the host.  */
-  uint64_t addr_struct_fn = (uint64_t) &GOMP_REV_OFFLOAD_VAR->fn;
+  /* Reserve one slot. */
+  unsigned int index = __atomic_fetch_add (&GOMP_REV_OFFLOAD_VAR->next_slot,
+					   1, __ATOMIC_ACQUIRE);
+
+  if ((unsigned int) (index + 1) < GOMP_REV_OFFLOAD_VAR->consumed)
+    abort ();  /* Overflow.  */
+
+  /* Spinlock while the host catches up.  */
+  if (index >= REV_OFFLOAD_QUEUE_SIZE)
+    while (__atomic_load_n (&GOMP_REV_OFFLOAD_VAR->consumed, __ATOMIC_ACQUIRE)
+	   <= (index - REV_OFFLOAD_QUEUE_SIZE))
+      ; /* spin  */
+
+  unsigned int slot = index % REV_OFFLOAD_QUEUE_SIZE;
+  GOMP_REV_OFFLOAD_VAR->queue[slot].fn = (uint64_t) fn;
+  GOMP_REV_OFFLOAD_VAR->queue[slot].mapnum = mapnum;
+  GOMP_REV_OFFLOAD_VAR->queue[slot].addrs = (uint64_t) hostaddrs;
+  GOMP_REV_OFFLOAD_VAR->queue[slot].sizes = (uint64_t) sizes;
+  GOMP_REV_OFFLOAD_VAR->queue[slot].kinds = (uint64_t) kinds;
+  GOMP_REV_OFFLOAD_VAR->queue[slot].dev_num = GOMP_ADDITIONAL_ICVS.device_num;
+
+  /* Set 'signal' to trigger processing on the host; the slot is now consumed
+     by the host, so we should not touch it again.  */
+  volatile int signal = 0;
+  uint64_t addr_struct_signal = (uint64_t) &GOMP_REV_OFFLOAD_VAR->queue[slot].signal;
 #if __PTX_SM__ >= 700
   asm volatile ("st.global.release.sys.u64 [%0], %1;"
-		: : "r"(addr_struct_fn), "r" (fn) : "memory");
+		: : "r"(addr_struct_signal), "r" (&signal) : "memory");
 #else
   __sync_synchronize ();  /* membar.sys */
   asm volatile ("st.volatile.global.u64 [%0], %1;"
-		: : "r"(addr_struct_fn), "r" (fn) : "memory");
+		: : "r"(addr_struct_signal), "r" (&signal) : "memory");
 #endif
 
+  /* The host signals completion by writing a non-zero value to the 'signal'
+     variable.  */
 #if __PTX_SM__ >= 700
-  uint64_t fn2;
+  uint64_t signal2;
   do
     {
       asm volatile ("ld.acquire.sys.global.u64 %0, [%1];"
-		    : "=r" (fn2) : "r" (addr_struct_fn) : "memory");
+		    : "=r" (signal2) : "r" (&signal) : "memory");
     }
-  while (fn2 != 0);
+  while (signal2 == 0);
 #else
   /* ld.global.u64 %r64,[__gomp_rev_offload_var];
      ld.u64 %r36,[%r64];
      membar.sys;  */
-  while (__atomic_load_n (&GOMP_REV_OFFLOAD_VAR->fn, __ATOMIC_ACQUIRE) != 0)
+  while (__atomic_load_n (&signal, __ATOMIC_ACQUIRE) == 0)
     ;  /* spin  */
 #endif
-
-  gomp_mutex_unlock (&lock);
 }
 
 void
diff --git a/libgomp/env.c b/libgomp/env.c
index f24484d7f70..0c74f441da0 100644
--- a/libgomp/env.c
+++ b/libgomp/env.c
@@ -123,6 +123,7 @@ size_t gomp_affinity_format_len;
 char *goacc_device_type;
 int goacc_device_num;
 int goacc_default_dims[GOMP_DIM_MAX];
+int gomp_reverse_offload_threads = 8;  /* Reasonable default.  */
 
 #ifndef LIBGOMP_OFFLOADED_ONLY
 
@@ -2483,6 +2484,11 @@ initialize_env (void)
 
   handle_omp_display_env ();
 
+  /* Control the number of background threads reverse offload is permitted
+     to use.  */
+  parse_int_secure ("GOMP_REVERSE_OFFLOAD_THREADS",
+		    &gomp_reverse_offload_threads, false);
+
   /* OpenACC.  */
 
   if (!parse_int ("ACC_DEVICE_NUM", getenv ("ACC_DEVICE_NUM"),
diff --git a/libgomp/libgomp-plugin.c b/libgomp/libgomp-plugin.c
index d696515eeb6..daff51ba50c 100644
--- a/libgomp/libgomp-plugin.c
+++ b/libgomp/libgomp-plugin.c
@@ -82,8 +82,8 @@ GOMP_PLUGIN_fatal (const char *msg, ...)
 void
 GOMP_PLUGIN_target_rev (uint64_t fn_ptr, uint64_t mapnum, uint64_t devaddrs_ptr,
 			uint64_t sizes_ptr, uint64_t kinds_ptr, int dev_num,
-			struct goacc_asyncqueue *aq)
+			volatile int *signal, bool use_aq)
 {
   gomp_target_rev (fn_ptr, mapnum, devaddrs_ptr, sizes_ptr, kinds_ptr, dev_num,
-		   aq);
+		   signal, use_aq);
 }
diff --git a/libgomp/libgomp-plugin.h b/libgomp/libgomp-plugin.h
index dc993882c3b..f4f3f9b3a3d 100644
--- a/libgomp/libgomp-plugin.h
+++ b/libgomp/libgomp-plugin.h
@@ -121,7 +121,7 @@ extern void GOMP_PLUGIN_fatal (const char *, ...)
 	__attribute__ ((noreturn, format (printf, 1, 2)));
 
 extern void GOMP_PLUGIN_target_rev (uint64_t, uint64_t, uint64_t, uint64_t,
-				    uint64_t, int, struct goacc_asyncqueue *);
+				    uint64_t, int, volatile int *, bool);
 
 /* Prototypes for functions implemented by libgomp plugins.  */
 extern const char *GOMP_OFFLOAD_get_name (void);
diff --git a/libgomp/libgomp.h b/libgomp/libgomp.h
index 68f20651fbf..63637da08c3 100644
--- a/libgomp/libgomp.h
+++ b/libgomp/libgomp.h
@@ -616,6 +616,7 @@ extern struct gomp_offload_icv_list *gomp_offload_icv_list;
 extern int goacc_device_num;
 extern char *goacc_device_type;
 extern int goacc_default_dims[GOMP_DIM_MAX];
+extern int gomp_reverse_offload_threads;
 
 enum gomp_task_kind
 {
@@ -1130,7 +1131,7 @@ extern void gomp_init_targets_once (void);
 extern int gomp_get_num_devices (void);
 extern bool gomp_target_task_fn (void *);
 extern void gomp_target_rev (uint64_t, uint64_t, uint64_t, uint64_t, uint64_t,
-			     int, struct goacc_asyncqueue *);
+			     int, volatile int *, bool);
 
 /* Splay tree definitions.  */
 typedef struct splay_tree_node_s *splay_tree_node;
diff --git a/libgomp/libgomp.texi b/libgomp/libgomp.texi
index c6cd825bbaa..672a49da931 100644
--- a/libgomp/libgomp.texi
+++ b/libgomp/libgomp.texi
@@ -2326,6 +2326,7 @@ variable is not set.
 * GOMP_STACKSIZE::          Set default thread stack size
 * GOMP_SPINCOUNT::          Set the busy-wait spin count
 * GOMP_RTEMS_THREAD_POOLS:: Set the RTEMS specific thread pools
+* GOMP_REVERSE_OFFLOAD_THREADS:: Set the maximum number of host threads
 @end menu
 
 
@@ -3072,6 +3073,22 @@ pools available and their worker threads run at priority four.
 
 
 
+@node GOMP_REVERSE_OFFLOAD_THREADS
+@section @env{GOMP_REVERSE_OFFLOAD_THREADS} -- Set the maximum number of host threads
+@cindex Environment Variable
+@table @asis
+@item @emph{Description}
+Set the maximum number of threads that may be used to run reverse offload
+code sections (host code nested within offload regions, declared using
+@code{#pragma omp target device(ancestor:1)}).  The value should be a non-zero
+positive integer.  The default is 8 threads.
+
+The threads are created on demand, up to the maximum number given, and are
+destroyed when no reverse offload requests remain.
+@end table
+
+
+
 @c ---------------------------------------------------------------------
 @c Enabling OpenACC
 @c ---------------------------------------------------------------------
diff --git a/libgomp/plugin/plugin-gcn.c b/libgomp/plugin/plugin-gcn.c
index ef22d48da79..4ab09c7de21 100644
--- a/libgomp/plugin/plugin-gcn.c
+++ b/libgomp/plugin/plugin-gcn.c
@@ -1945,11 +1945,12 @@ create_kernel_dispatch (struct kernel_info *kernel, int num_teams,
 
 static void
 process_reverse_offload (uint64_t fn, uint64_t mapnum, uint64_t hostaddrs,
-			 uint64_t sizes, uint64_t kinds, uint64_t dev_num64)
+			 uint64_t sizes, uint64_t kinds, uint64_t dev_num64,
+			 uint64_t signal)
 {
   int dev_num = dev_num64;
   GOMP_PLUGIN_target_rev (fn, mapnum, hostaddrs, sizes, kinds, dev_num,
-			  NULL);
+			  (volatile int *) signal, false);
 }
 
 /* Output any data written to console output from the kernel.  It is expected
@@ -1999,7 +2000,8 @@ console_output (struct kernel_info *kernel, struct kernargs *kernargs,
 	case 4:
 	  process_reverse_offload (data->value_u64[0], data->value_u64[1],
 				   data->value_u64[2], data->value_u64[3],
-				   data->value_u64[4], data->value_u64[5]);
+				   data->value_u64[4], data->value_u64[5],
+				   data->value_u64[6]);
 	  break;
 	default: printf ("GCN print buffer error!\n"); break;
 	}
diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c
index 00d4241ae02..18215da0f3d 100644
--- a/libgomp/plugin/plugin-nvptx.c
+++ b/libgomp/plugin/plugin-nvptx.c
@@ -1639,9 +1639,10 @@ nvptx_goacc_asyncqueue_construct (unsigned int flags)
 }
 
 struct goacc_asyncqueue *
-GOMP_OFFLOAD_openacc_async_construct (int device __attribute__((unused)))
+GOMP_OFFLOAD_openacc_async_construct (int device)
 {
-  return nvptx_goacc_asyncqueue_construct (CU_STREAM_DEFAULT);
+  nvptx_attach_host_thread_to_device (device);
+  return nvptx_goacc_asyncqueue_construct (CU_STREAM_NON_BLOCKING);
 }
 
 static bool
@@ -2134,17 +2135,68 @@ GOMP_OFFLOAD_run (int ord, void *tgt_fn, void *tgt_vars, void **args)
 	else if (r != CUDA_ERROR_NOT_READY)
 	  GOMP_PLUGIN_fatal ("cuStreamQuery error: %s", cuda_error (r));
 
-	if (__atomic_load_n (&ptx_dev->rev_data->fn, __ATOMIC_ACQUIRE) != 0)
+	struct rev_offload *rev_metadata = ptx_dev->rev_data;
+
+	/* Claim a portion of the ring buffer to process on this iteration.
+	   Don't mark them as consumed until all the data has been read out.  */
+	unsigned int consumed = __atomic_load_n (&rev_metadata->consumed,
+						 __ATOMIC_ACQUIRE);
+	unsigned int from = __atomic_load_n (&rev_metadata->claimed,
+						__ATOMIC_RELAXED);
+	unsigned int to = __atomic_load_n (&rev_metadata->next_slot,
+					   __ATOMIC_RELAXED);
+
+	if (consumed > to)
+	  {
+	    /* Overflow happens when we exceed UINTMAX requests.  */
+	    GOMP_PLUGIN_fatal ("NVPTX reverse offload buffer overflowed.\n");
+	  }
+
+	to = MIN(to, consumed + REV_OFFLOAD_QUEUE_SIZE / 2);
+	if (to <= from)
+	  /* Nothing to do; poll again.  */
+	  goto poll_again;
+
+	if (!__atomic_compare_exchange_n (&rev_metadata->claimed, &from, to,
+					  false,
+					  __ATOMIC_ACQUIRE, __ATOMIC_RELAXED))
+	  /* Collision with another thread ... go around again.  */
+	  goto poll_again;
+
+	unsigned int index;
+	for (index = from; index < to; index++)
 	  {
-	    struct rev_offload *rev_data = ptx_dev->rev_data;
+	    int slot = index % REV_OFFLOAD_QUEUE_SIZE;
+
+	    /* Wait while the target finishes filling in the slot.  */
+	    while (__atomic_load_n (&ptx_dev->rev_data->queue[slot].signal,
+				    __ATOMIC_ACQUIRE) == 0)
+	      ; /* spin  */
+
+	    /* Pass the request to libgomp; this will queue the request and
+	       return right away, without waiting for the kernel to run.  */
+	    struct rev_req *rev_data = &ptx_dev->rev_data->queue[slot];
 	    GOMP_PLUGIN_target_rev (rev_data->fn, rev_data->mapnum,
 				    rev_data->addrs, rev_data->sizes,
 				    rev_data->kinds, rev_data->dev_num,
-				    reverse_offload_aq);
-	    if (!nvptx_goacc_asyncqueue_synchronize (reverse_offload_aq))
-	      exit (EXIT_FAILURE);
-	    __atomic_store_n (&rev_data->fn, 0, __ATOMIC_RELEASE);
+				    rev_data->signal, true);
+
+	    /* Ensure that the slot doesn't trigger early, when reused.  */
+	    __atomic_store_n (&rev_data->signal, 0, __ATOMIC_RELEASE);
 	  }
+
+	/* The data is now consumed so release the slots for reuse.  */
+	unsigned int consumed_so_far = from;
+	while (!__atomic_compare_exchange_n (&rev_metadata->consumed,
+					    &consumed_so_far, to, false,
+					    __ATOMIC_RELEASE, __ATOMIC_RELAXED))
+	  {
+	    /* Another thread didn't consume all it claimed yet.... */
+	    consumed_so_far = from;
+	    usleep (1);
+	  }
+
+poll_again:
 	usleep (1);
       }
   else
diff --git a/libgomp/target.c b/libgomp/target.c
index 812674d19a9..24814f4686b 100644
--- a/libgomp/target.c
+++ b/libgomp/target.c
@@ -3329,16 +3329,18 @@ gomp_map_cdata_lookup (struct cpy_data *d, uint64_t *devaddrs,
 				    tgt_start, tgt_end);
 }
 
-/* Handle reverse offload.  This is called by the device plugins for a
-   reverse offload; it is not called if the outer target runs on the host.
+/* Handle reverse offload.  This is called by the host worker thread to
+   execute a single reverse offload request; it is not called if the outer
+   target runs on the host.
    The mapping is simplified device-affecting constructs (except for target
    with device(ancestor:1)) must not be encountered; in particular not
    target (enter/exit) data.  */
 
-void
-gomp_target_rev (uint64_t fn_ptr, uint64_t mapnum, uint64_t devaddrs_ptr,
-		 uint64_t sizes_ptr, uint64_t kinds_ptr, int dev_num,
-		 struct goacc_asyncqueue *aq)
+static void
+gomp_target_rev_internal (uint64_t fn_ptr, uint64_t mapnum,
+			  uint64_t devaddrs_ptr, uint64_t sizes_ptr,
+			  uint64_t kinds_ptr, struct gomp_device_descr *devicep,
+			  struct goacc_asyncqueue *aq)
 {
   /* Return early if there is no offload code.  */
   if (sizeof (OFFLOAD_PLUGINS) == sizeof (""))
@@ -3355,7 +3357,6 @@ gomp_target_rev (uint64_t fn_ptr, uint64_t mapnum, uint64_t devaddrs_ptr,
   unsigned short *kinds;
   const bool short_mapkind = true;
   const int typemask = short_mapkind ? 0xff : 0x7;
-  struct gomp_device_descr *devicep = resolve_device (dev_num, false);
 
   reverse_splay_tree_key n;
   struct reverse_splay_tree_key_s k;
@@ -3766,6 +3767,134 @@ gomp_target_rev (uint64_t fn_ptr, uint64_t mapnum, uint64_t devaddrs_ptr,
     }
 }
 
+static struct target_rev_queue_s
+{
+  uint64_t fn_ptr;
+  uint64_t mapnum;
+  uint64_t devaddrs_ptr;
+  uint64_t sizes_ptr;
+  uint64_t kinds_ptr;
+  struct gomp_device_descr *devicep;
+
+  volatile int *signal;
+  bool use_aq;
+
+  struct target_rev_queue_s *next;
+} *target_rev_queue_head = NULL, *target_rev_queue_last = NULL;
+static gomp_mutex_t target_rev_queue_lock = 0;
+static int target_rev_thread_count = 0;
+
+static void *
+gomp_target_rev_worker_thread (void *)
+{
+  struct target_rev_queue_s *rev_kernel = NULL;
+  struct goacc_asyncqueue *aq = NULL;
+  struct gomp_device_descr *aq_devicep;
+
+  while (1)
+    {
+      gomp_mutex_lock (&target_rev_queue_lock);
+
+      /* Take a reverse-offload kernel request from the queue.  */
+      rev_kernel = target_rev_queue_head;
+      if (rev_kernel)
+	{
+	  target_rev_queue_head = rev_kernel->next;
+	  if (target_rev_queue_head == NULL)
+	    target_rev_queue_last = NULL;
+	}
+
+      if (rev_kernel == NULL)
+	{
+	  target_rev_thread_count--;
+	  gomp_mutex_unlock (&target_rev_queue_lock);
+	  break;
+	}
+      gomp_mutex_unlock (&target_rev_queue_lock);
+
+      /* Ensure we have a suitable device queue for the memory transfers.  */
+      if (rev_kernel->use_aq)
+	{
+	  if (aq && aq_devicep != rev_kernel->devicep)
+	    {
+	      aq_devicep->openacc.async.destruct_func (aq);
+	      aq = NULL;
+	    }
+
+	  if (!aq)
+	    {
+	      aq_devicep = rev_kernel->devicep;
+	      aq = aq_devicep->openacc.async.construct_func (aq_devicep->target_id);
+	    }
+	}
+
+      /* Run the kernel on the host.  */
+      gomp_target_rev_internal (rev_kernel->fn_ptr, rev_kernel->mapnum,
+				rev_kernel->devaddrs_ptr, rev_kernel->sizes_ptr,
+				rev_kernel->kinds_ptr, rev_kernel->devicep, aq);
+
+      /* Signal the device that the reverse-offload is completed.  */
+      int one = 1;
+      gomp_copy_host2dev (rev_kernel->devicep, aq, (void*)rev_kernel->signal,
+			  &one, sizeof (one), false, NULL);
+
+      /* We're done with this request.  */
+      free (rev_kernel);
+
+      /* Loop around and see if another request is waiting.  */
+    }
+
+  if (aq)
+    aq_devicep->openacc.async.destruct_func (aq);
+
+  return NULL;
+}
+
+void
+gomp_target_rev (uint64_t fn_ptr, uint64_t mapnum, uint64_t devaddrs_ptr,
+		 uint64_t sizes_ptr, uint64_t kinds_ptr, int dev_num,
+		 volatile int *signal, bool use_aq)
+{
+  struct gomp_device_descr *devicep = resolve_device (dev_num, false);
+
+  /* Create a new queue node.  */
+  struct target_rev_queue_s *newreq = gomp_malloc (sizeof (*newreq));
+  newreq->fn_ptr = fn_ptr;
+  newreq->mapnum = mapnum;
+  newreq->devaddrs_ptr = devaddrs_ptr;
+  newreq->sizes_ptr = sizes_ptr;
+  newreq->kinds_ptr = kinds_ptr;
+  newreq->devicep = devicep;
+  newreq->signal = signal;
+  newreq->use_aq = use_aq;
+  newreq->next = NULL;
+
+  gomp_mutex_lock (&target_rev_queue_lock);
+
+  /* Enqueue the reverse-offload request.  */
+  if (target_rev_queue_last)
+    {
+      target_rev_queue_last->next = newreq;
+      target_rev_queue_last = newreq;
+    }
+  else
+    target_rev_queue_last = target_rev_queue_head = newreq;
+
+  /* Launch a new thread to process the request asynchronously.
+     If the thread pool limit has been reached then an existing thread will
+     pick up the job when it is ready.  */
+  if (target_rev_thread_count < gomp_reverse_offload_threads)
+    {
+      target_rev_thread_count++;
+      gomp_mutex_unlock (&target_rev_queue_lock);
+
+      pthread_t t;
+      pthread_create (&t, NULL, gomp_target_rev_worker_thread, NULL);
+    }
+  else
+    gomp_mutex_unlock (&target_rev_queue_lock);
+}
+
 /* Host fallback for GOMP_target_data{,_ext} routines.  */
 
 static void
diff --git a/libgomp/testsuite/libgomp.c/reverse-offload-threads-1.c b/libgomp/testsuite/libgomp.c/reverse-offload-threads-1.c
new file mode 100644
index 00000000000..fa74a8e9668
--- /dev/null
+++ b/libgomp/testsuite/libgomp.c/reverse-offload-threads-1.c
@@ -0,0 +1,26 @@
+/* { dg-do run }  */
+/* { dg-additional-options "-foffload-options=nvptx-none=-misa=sm_35" { target { offload_target_nvptx } } } */
+
+/* Test that the reverse offload message buffers can cope with a lot of
+   requests.  */
+
+#pragma omp requires reverse_offload
+
+int main ()
+{
+  #pragma omp target teams distribute parallel for collapse(2)
+  for (int i=0; i < 100; i++)
+    for (int j=0; j < 16; j++)
+      {
+	int val = 0;
+	#pragma omp target device ( ancestor:1 ) firstprivate(i,j) map(from:val)
+	{
+	  val = i + j;
+	}
+
+	if (val != i + j)
+	  __builtin_abort ();
+      }
+
+  return 0;
+}
diff --git a/libgomp/testsuite/libgomp.c/reverse-offload-threads-2.c b/libgomp/testsuite/libgomp.c/reverse-offload-threads-2.c
new file mode 100644
index 00000000000..05a2571ed14
--- /dev/null
+++ b/libgomp/testsuite/libgomp.c/reverse-offload-threads-2.c
@@ -0,0 +1,31 @@
+/* { dg-do run }  */
+/* { dg-additional-options "-foffload-options=nvptx-none=-misa=sm_35" { target { offload_target_nvptx } } } */
+
+/* Test that the reverse offload message buffers can cope with multiple
+   requests from multiple kernels.  */
+
+#pragma omp requires reverse_offload
+
+int main ()
+{
+  for (int n=0; n < 5; n++)
+    {
+      #pragma omp target teams distribute parallel for nowait collapse(2)
+      for (int i=0; i < 32; i++)
+	for (int j=0; j < 16; j++)
+	  {
+	    int val = 0;
+	    #pragma omp target device ( ancestor:1 ) firstprivate(i,j) map(from:val)
+	    {
+	      val = i + j;
+	    }
+
+	    if (val != i + j)
+	      __builtin_abort ();
+	  }
+    }
+
+#pragma omp taskwait
+
+  return 0;
+}

^ permalink raw reply	[flat|nested] 3+ messages in thread

* [OG13][committed] libgomp, nvptx, amdgcn: parallel reverse offload
  2023-09-12 14:27 [PATCH] libgomp, nvptx, amdgcn: parallel reverse offload Andrew Stubbs
@ 2023-09-12 16:32 ` Andrew Stubbs
  2023-09-21  9:22 ` [PATCH] " Tobias Burnus
  1 sibling, 0 replies; 3+ messages in thread
From: Andrew Stubbs @ 2023-09-12 16:32 UTC (permalink / raw)
  To: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 2459 bytes --]

Here's the same patch, but backported to the OG13 branch.

There was one "difficult" conflict, but after reading around the problem 
I don't think that any actual code changes are required and I've updated 
the comment to explain (see second patch).

Both patches committed to devel/omp/gcc-13.

Andrew

On 12/09/2023 15:27, Andrew Stubbs wrote:
> Hi all,
> 
> This patch implements parallel execution of OpenMP reverse offload kernels.
> 
> The first problem was that GPU device kernels may request reverse 
> offload (via the "ancestor" clause) once for each running offload thread 
> -- of which there may be thousands -- and the existing implementation 
> ran each request serially, whilst blocking all other I/O from that 
> device kernel.
> 
> The second problem was that the NVPTX plugin runs the reverse offload 
> kernel in the context of whichever host thread sees the request first, 
> regardless of which kernel originated the request. This is probably 
> logically harmless, but may lead to surprising timing when it blocks the 
> wrong kernel from exiting until the reverse offload is done. It was also 
> only capable of receiving and processing a single request at a time, 
> across all running kernels. (GCN did not have these problems.)
> 
> Both problems are now solved by making the reverse offload requests 
> asynchronous. The host threads still recieve the requests in the same 
> way, but instead of running them inline the request is queued for 
> execution later in another thread. The requests are then consumed from 
> the message passing buffer imediately (allowing I/O to continue, in the 
> case of GCN). The device threads that sent requests are still blocked 
> waiting for the completion signal, but any other threads may continue as 
> usual.
> 
> The queued requests are processed by a thread pool created on demand and 
> limited by a new environment variable GOMP_REVERSE_OFFLOAD_THREADS. By 
> this means reverse offload should become much less of a bottleneck.
> 
> In the process of this work I have found and fixed a couple of 
> target-specific issues. NVPTX asynchronous streams were independent of 
> each other, but still synchronous w.r.t. the default NULL stream. Some 
> GCN devices (at least gfx908) seem to have a race condition in the 
> message passing system whereby the cache write-back triggered by 
> __ATOMIC_RELEASE occurs slower than the atomically written value.
> 
> OK for mainline?
> 
> Andrew


[-- Attachment #2: 230912-nvptx-delayed-free-og13.patch --]
[-- Type: text/plain, Size: 1385 bytes --]

nvptx: update comment re delayed free

Polling the delayed free is roughly the same as freeing them between
reverse offload kernels.

libgomp/ChangeLog:

	* plugin/plugin-nvptx.c (GOMP_OFFLOAD_run): Update comment.

diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c
index 176bb983bdc..0cf49719515 100644
--- a/libgomp/plugin/plugin-nvptx.c
+++ b/libgomp/plugin/plugin-nvptx.c
@@ -2703,12 +2703,10 @@ GOMP_OFFLOAD_run (int ord, void *tgt_fn, void *tgt_vars, void **args)
 	       a following reverse offload does
 	       'GOMP_OFFLOAD_page_locked_host_alloc', and that then runs the
 	       deferred 'cuMemFreeHost's -- which may dead-lock?!
-	       TODO: This may need more considerations for the case that
-	       different host threads do reverse offload?  We could move
-	       'free_host_blocks' into 'aq' (which is separate per reverse
-	       offload) instead of global, like
-	       'page_locked_host_unregister_blocks', but that doesn't seem the
-	       right thing for OpenACC 'async' generally?  */
+	       Note: even though the reverse offload kernels are now run in
+	       multiple backgroud threads, *this* thread (or one of these
+	       threads, anyway) will live the whole time, so polling
+	       free_host_blocks should be effective.  */
 	    if (!nvptx_run_deferred_page_locked_host_free ())
 	      exit (EXIT_FAILURE);
 	  }

[-- Attachment #3: 230912-parallel-reverse-offload-og13.patch --]
[-- Type: text/plain, Size: 26024 bytes --]

libgomp: parallel reverse offload

Extend OpenMP reverse offload support to allow running the host kernels
on multiple threads.  The device plugin API for reverse offload is now made
non-blocking, meaning that running the host kernel in the wrong device
context is no longer a problem.  The NVPTX message passing interface now
uses a ring buffer aproximately matching GCN.

include/ChangeLog:

	* gomp-constants.h (GOMP_VERSION): Bump.

libgomp/ChangeLog:

	* config/gcn/target.c (GOMP_target_ext): Add "signal" field.
	Fix atomics race condition.
	* config/nvptx/libgomp-nvptx.h (REV_OFFLOAD_QUEUE_SIZE): New define.
	(struct rev_offload): Implement ring buffer.
	* config/nvptx/target.c (GOMP_target_ext): Likewise.
	* env.c (initialize_env): Read GOMP_REVERSE_OFFLOAD_THREADS.
	* libgomp-plugin.c (GOMP_PLUGIN_target_rev): Replace "aq" parameter
	with "signal" and "use_aq".
	* libgomp-plugin.h (GOMP_PLUGIN_target_rev): Likewise.
	* libgomp.h (gomp_target_rev): Likewise.
	* plugin/plugin-gcn.c (process_reverse_offload): Add "signal".
	(console_output): Pass signal value through.
	* plugin/plugin-nvptx.c (GOMP_OFFLOAD_openacc_async_construct):
	Attach new threads to the numbered device.
	Change the flag to CU_STREAM_NON_BLOCKING.
	(GOMP_OFFLOAD_run): Implement ring-buffer and remove signalling.
	* target.c (gomp_target_rev): Rename to ...
	(gomp_target_rev_internal): ... this, and change "dev_num" to
	"devicep".
	(gomp_target_rev_worker_thread): New function.
	(gomp_target_rev): New function (old name).
	* libgomp.texi: Document GOMP_REVERSE_OFFLOAD_THREADS.
	* testsuite/libgomp.c/reverse-offload-threads-1.c: New test.
	* testsuite/libgomp.c/reverse-offload-threads-2.c: New test.

diff --git a/include/gomp-constants.h b/include/gomp-constants.h
index 1c7508bedce..54af0f2e1d6 100644
--- a/include/gomp-constants.h
+++ b/include/gomp-constants.h
@@ -362,7 +362,7 @@ enum gomp_map_kind
 /* Versions of libgomp and device-specific plugins.  GOMP_VERSION
    should be incremented whenever an ABI-incompatible change is introduced
    to the plugin interface defined in libgomp/libgomp.h.  */
-#define GOMP_VERSION	2
+#define GOMP_VERSION	3
 #define GOMP_VERSION_NVIDIA_PTX 1
 #define GOMP_VERSION_GCN 3
 
diff --git a/libgomp/config/gcn/target.c b/libgomp/config/gcn/target.c
index ea5eb1ff5ed..906b04ca41e 100644
--- a/libgomp/config/gcn/target.c
+++ b/libgomp/config/gcn/target.c
@@ -103,19 +103,38 @@ GOMP_target_ext (int device, void (*fn) (void *), size_t mapnum,
 	   <= (index - 1024))
       asm ("s_sleep 64");
 
+  /* In theory, it should be enough to write "written" with __ATOMIC_RELEASE,
+     and have the rest of the data flushed to memory automatically, but some
+     devices (gfx908) seem to have a race condition where the flushed data
+     arrives after the atomic data, and the host does the wrong thing.
+     If we just write everything atomically in the correct order then we're
+     safe.  */
+
   unsigned int slot = index % 1024;
-  data->queue[slot].value_u64[0] = (uint64_t) fn;
-  data->queue[slot].value_u64[1] = (uint64_t) mapnum;
-  data->queue[slot].value_u64[2] = (uint64_t) hostaddrs;
-  data->queue[slot].value_u64[3] = (uint64_t) sizes;
-  data->queue[slot].value_u64[4] = (uint64_t) kinds;
-  data->queue[slot].value_u64[5] = (uint64_t) GOMP_ADDITIONAL_ICVS.device_num;
-
-  data->queue[slot].type = 4; /* Reverse offload.  */
+  __atomic_store_n (&data->queue[slot].value_u64[0], (uint64_t) fn,
+		    __ATOMIC_RELAXED);
+  __atomic_store_n (&data->queue[slot].value_u64[1], (uint64_t) mapnum,
+		    __ATOMIC_RELAXED);
+  __atomic_store_n (&data->queue[slot].value_u64[2], (uint64_t) hostaddrs,
+		    __ATOMIC_RELAXED);
+  __atomic_store_n (&data->queue[slot].value_u64[3], (uint64_t) sizes,
+		    __ATOMIC_RELAXED);
+  __atomic_store_n (&data->queue[slot].value_u64[4], (uint64_t) kinds,
+		    __ATOMIC_RELAXED);
+  __atomic_store_n (&data->queue[slot].value_u64[5],
+		    (uint64_t) GOMP_ADDITIONAL_ICVS.device_num,
+		    __ATOMIC_RELAXED);
+
+  volatile int signal = 0;
+  __atomic_store_n (&data->queue[slot].value_u64[6], (uint64_t) &signal,
+		    __ATOMIC_RELAXED);
+
+  __atomic_store_n (&data->queue[slot].type, 4 /* Reverse offload.  */,
+		    __ATOMIC_RELAXED);
   __atomic_store_n (&data->queue[slot].written, 1, __ATOMIC_RELEASE);
 
-  /* Spinlock while the host catches up.  */
-  while (__atomic_load_n (&data->queue[slot].written, __ATOMIC_ACQUIRE) != 0)
+  /* Spinlock while the host runs the kernel.  */
+  while (__atomic_load_n (&signal, __ATOMIC_ACQUIRE) == 0)
     asm ("s_sleep 64");
 }
 
diff --git a/libgomp/config/nvptx/libgomp-nvptx.h b/libgomp/config/nvptx/libgomp-nvptx.h
index 96de86977c3..78719894efa 100644
--- a/libgomp/config/nvptx/libgomp-nvptx.h
+++ b/libgomp/config/nvptx/libgomp-nvptx.h
@@ -25,20 +25,41 @@
 
 /* This file contains defines and type definitions shared between the
    nvptx target's libgomp.a and the plugin-nvptx.c, but that is only
-   needef for this target.  */
+   needed for this target.  */
 
 #ifndef LIBGOMP_NVPTX_H
 #define LIBGOMP_NVPTX_H 1
 
 #define GOMP_REV_OFFLOAD_VAR __gomp_rev_offload_var
+#define REV_OFFLOAD_QUEUE_SIZE 1024
 
 struct rev_offload {
-  uint64_t fn;
-  uint64_t mapnum;
-  uint64_t addrs;
-  uint64_t sizes;
-  uint64_t kinds;
-  int32_t dev_num;
+  /* The target can grab a slot by incrementing "next_slot".
+     Each host thread may claim some slots for processing.
+     When the host processing is completed "consumed" indicates that the
+     corresponding slots in the ring-buffer "queue" are available for reuse.
+
+     Note that "next_slot" is an index, and "consumed"/"claimed" are counters,
+     so beware of the fence-posts.  */
+  unsigned int next_slot;
+  unsigned int consumed;
+  unsigned int claimed;
+
+  struct rev_req {
+    /* The target writes an address to "signal" as the last item, which
+       indicates to the host that the record is completely written.  The target
+       must not assume that it still owns the slot, after that.  The signal
+       address is then used by the host to communicate that the reverse-offload
+       kernel has completed execution.  */
+    volatile int *signal;
+
+    uint64_t fn;
+    uint64_t mapnum;
+    uint64_t addrs;
+    uint64_t sizes;
+    uint64_t kinds;
+    int32_t dev_num;
+  } queue[REV_OFFLOAD_QUEUE_SIZE];
 };
 
 #if (__SIZEOF_SHORT__ != 2 \
diff --git a/libgomp/config/nvptx/target.c b/libgomp/config/nvptx/target.c
index 125d92a2ea9..5593ab62b6d 100644
--- a/libgomp/config/nvptx/target.c
+++ b/libgomp/config/nvptx/target.c
@@ -93,7 +93,6 @@ GOMP_target_ext (int device, void (*fn) (void *), size_t mapnum,
 		 void **hostaddrs, size_t *sizes, unsigned short *kinds,
 		 unsigned int flags, void **depend, void **args)
 {
-  static int lock = 0;  /* == gomp_mutex_t lock; gomp_mutex_init (&lock); */
   (void) flags;
   (void) depend;
   (void) args;
@@ -103,43 +102,57 @@ GOMP_target_ext (int device, void (*fn) (void *), size_t mapnum,
       || GOMP_REV_OFFLOAD_VAR == NULL)
     return;
 
-  gomp_mutex_lock (&lock);
-
-  GOMP_REV_OFFLOAD_VAR->mapnum = mapnum;
-  GOMP_REV_OFFLOAD_VAR->addrs = (uint64_t) hostaddrs;
-  GOMP_REV_OFFLOAD_VAR->sizes = (uint64_t) sizes;
-  GOMP_REV_OFFLOAD_VAR->kinds = (uint64_t) kinds;
-  GOMP_REV_OFFLOAD_VAR->dev_num = GOMP_ADDITIONAL_ICVS.device_num;
-
-  /* Set 'fn' to trigger processing on the host; wait for completion,
-     which is flagged by setting 'fn' back to 0 on the host.  */
-  uint64_t addr_struct_fn = (uint64_t) &GOMP_REV_OFFLOAD_VAR->fn;
+  /* Reserve one slot. */
+  unsigned int index = __atomic_fetch_add (&GOMP_REV_OFFLOAD_VAR->next_slot,
+					   1, __ATOMIC_ACQUIRE);
+
+  if ((unsigned int) (index + 1) < GOMP_REV_OFFLOAD_VAR->consumed)
+    abort ();  /* Overflow.  */
+
+  /* Spinlock while the host catches up.  */
+  if (index >= REV_OFFLOAD_QUEUE_SIZE)
+    while (__atomic_load_n (&GOMP_REV_OFFLOAD_VAR->consumed, __ATOMIC_ACQUIRE)
+	   <= (index - REV_OFFLOAD_QUEUE_SIZE))
+      ; /* spin  */
+
+  unsigned int slot = index % REV_OFFLOAD_QUEUE_SIZE;
+  GOMP_REV_OFFLOAD_VAR->queue[slot].fn = (uint64_t) fn;
+  GOMP_REV_OFFLOAD_VAR->queue[slot].mapnum = mapnum;
+  GOMP_REV_OFFLOAD_VAR->queue[slot].addrs = (uint64_t) hostaddrs;
+  GOMP_REV_OFFLOAD_VAR->queue[slot].sizes = (uint64_t) sizes;
+  GOMP_REV_OFFLOAD_VAR->queue[slot].kinds = (uint64_t) kinds;
+  GOMP_REV_OFFLOAD_VAR->queue[slot].dev_num = GOMP_ADDITIONAL_ICVS.device_num;
+
+  /* Set 'signal' to trigger processing on the host; the slot is now consumed
+     by the host, so we should not touch it again.  */
+  volatile int signal = 0;
+  uint64_t addr_struct_signal = (uint64_t) &GOMP_REV_OFFLOAD_VAR->queue[slot].signal;
 #if __PTX_SM__ >= 700
   asm volatile ("st.global.release.sys.u64 [%0], %1;"
-		: : "r"(addr_struct_fn), "r" (fn) : "memory");
+		: : "r"(addr_struct_signal), "r" (&signal) : "memory");
 #else
   __sync_synchronize ();  /* membar.sys */
   asm volatile ("st.volatile.global.u64 [%0], %1;"
-		: : "r"(addr_struct_fn), "r" (fn) : "memory");
+		: : "r"(addr_struct_signal), "r" (&signal) : "memory");
 #endif
 
+  /* The host signals completion by writing a non-zero value to the 'signal'
+     variable.  */
 #if __PTX_SM__ >= 700
-  uint64_t fn2;
+  uint64_t signal2;
   do
     {
       asm volatile ("ld.acquire.sys.global.u64 %0, [%1];"
-		    : "=r" (fn2) : "r" (addr_struct_fn) : "memory");
+		    : "=r" (signal2) : "r" (&signal) : "memory");
     }
-  while (fn2 != 0);
+  while (signal2 == 0);
 #else
   /* ld.global.u64 %r64,[__gomp_rev_offload_var];
      ld.u64 %r36,[%r64];
      membar.sys;  */
-  while (__atomic_load_n (&GOMP_REV_OFFLOAD_VAR->fn, __ATOMIC_ACQUIRE) != 0)
+  while (__atomic_load_n (&signal, __ATOMIC_ACQUIRE) == 0)
     ;  /* spin  */
 #endif
-
-  gomp_mutex_unlock (&lock);
 }
 
 void
diff --git a/libgomp/env.c b/libgomp/env.c
index f24484d7f70..0c74f441da0 100644
--- a/libgomp/env.c
+++ b/libgomp/env.c
@@ -123,6 +123,7 @@ size_t gomp_affinity_format_len;
 char *goacc_device_type;
 int goacc_device_num;
 int goacc_default_dims[GOMP_DIM_MAX];
+int gomp_reverse_offload_threads = 8;  /* Reasonable default.  */
 
 #ifndef LIBGOMP_OFFLOADED_ONLY
 
@@ -2483,6 +2484,11 @@ initialize_env (void)
 
   handle_omp_display_env ();
 
+  /* Control the number of background threads reverse offload is permitted
+     to use.  */
+  parse_int_secure ("GOMP_REVERSE_OFFLOAD_THREADS",
+		    &gomp_reverse_offload_threads, false);
+
   /* OpenACC.  */
 
   if (!parse_int ("ACC_DEVICE_NUM", getenv ("ACC_DEVICE_NUM"),
diff --git a/libgomp/libgomp-plugin.c b/libgomp/libgomp-plugin.c
index d696515eeb6..daff51ba50c 100644
--- a/libgomp/libgomp-plugin.c
+++ b/libgomp/libgomp-plugin.c
@@ -82,8 +82,8 @@ GOMP_PLUGIN_fatal (const char *msg, ...)
 void
 GOMP_PLUGIN_target_rev (uint64_t fn_ptr, uint64_t mapnum, uint64_t devaddrs_ptr,
 			uint64_t sizes_ptr, uint64_t kinds_ptr, int dev_num,
-			struct goacc_asyncqueue *aq)
+			volatile int *signal, bool use_aq)
 {
   gomp_target_rev (fn_ptr, mapnum, devaddrs_ptr, sizes_ptr, kinds_ptr, dev_num,
-		   aq);
+		   signal, use_aq);
 }
diff --git a/libgomp/libgomp-plugin.h b/libgomp/libgomp-plugin.h
index 59428c1c150..0f0d8b50b24 100644
--- a/libgomp/libgomp-plugin.h
+++ b/libgomp/libgomp-plugin.h
@@ -121,7 +121,7 @@ extern void GOMP_PLUGIN_fatal (const char *, ...)
 	__attribute__ ((noreturn, format (printf, 1, 2)));
 
 extern void GOMP_PLUGIN_target_rev (uint64_t, uint64_t, uint64_t, uint64_t,
-				    uint64_t, int, struct goacc_asyncqueue *);
+				    uint64_t, int, volatile int *, bool);
 
 /* Prototypes for functions implemented by libgomp plugins.  */
 extern const char *GOMP_OFFLOAD_get_name (void);
diff --git a/libgomp/libgomp.h b/libgomp/libgomp.h
index 0742836127f..482d077dfdc 100644
--- a/libgomp/libgomp.h
+++ b/libgomp/libgomp.h
@@ -613,6 +613,7 @@ extern struct gomp_offload_icv_list *gomp_offload_icv_list;
 extern int goacc_device_num;
 extern char *goacc_device_type;
 extern int goacc_default_dims[GOMP_DIM_MAX];
+extern int gomp_reverse_offload_threads;
 
 enum gomp_task_kind
 {
@@ -1136,7 +1137,7 @@ extern bool gomp_page_locked_host_unregister_dev (struct gomp_device_descr *,
 						  void *, size_t,
 						  struct goacc_asyncqueue *);
 extern void gomp_target_rev (uint64_t, uint64_t, uint64_t, uint64_t, uint64_t,
-			     int, struct goacc_asyncqueue *);
+			     int, volatile int *, bool);
 extern bool gomp_page_locked_host_alloc (void **, size_t);
 extern void gomp_page_locked_host_free (void *);
 
diff --git a/libgomp/libgomp.texi b/libgomp/libgomp.texi
index 8b374bceced..b0a67151bf4 100644
--- a/libgomp/libgomp.texi
+++ b/libgomp/libgomp.texi
@@ -2197,6 +2197,7 @@ variable is not set.
 * GOMP_STACKSIZE::          Set default thread stack size
 * GOMP_SPINCOUNT::          Set the busy-wait spin count
 * GOMP_RTEMS_THREAD_POOLS:: Set the RTEMS specific thread pools
+* GOMP_REVERSE_OFFLOAD_THREADS:: Set the maximum number of host threads
 @end menu
 
 
@@ -2923,6 +2924,22 @@ pools available and their worker threads run at priority four.
 
 
 
+@node GOMP_REVERSE_OFFLOAD_THREADS
+@section @env{GOMP_REVERSE_OFFLOAD_THREADS} -- Set the maximum number of host threads
+@cindex Environment Variable
+@table @asis
+@item @emph{Description}
+Set the maximum number of threads that may be used to run reverse offload
+code sections (host code nested within offload regions, declared using
+@code{#pragma omp target device(ancestor:1)}).  The value should be a non-zero
+positive integer.  The default is 8 threads.
+
+The threads are created on demand, up to the maximum number given, and are
+destroyed when no reverse offload requests remain.
+@end table
+
+
+
 @c ---------------------------------------------------------------------
 @c Enabling OpenACC
 @c ---------------------------------------------------------------------
diff --git a/libgomp/plugin/plugin-gcn.c b/libgomp/plugin/plugin-gcn.c
index 3df9fbe8d80..7c0115ae579 100644
--- a/libgomp/plugin/plugin-gcn.c
+++ b/libgomp/plugin/plugin-gcn.c
@@ -2004,11 +2004,12 @@ create_kernel_dispatch (struct kernel_info *kernel, int num_teams,
 
 static void
 process_reverse_offload (uint64_t fn, uint64_t mapnum, uint64_t hostaddrs,
-			 uint64_t sizes, uint64_t kinds, uint64_t dev_num64)
+			 uint64_t sizes, uint64_t kinds, uint64_t dev_num64,
+			 uint64_t signal)
 {
   int dev_num = dev_num64;
   GOMP_PLUGIN_target_rev (fn, mapnum, hostaddrs, sizes, kinds, dev_num,
-			  NULL);
+			  (volatile int *) signal, false);
 }
 
 /* Output any data written to console output from the kernel.  It is expected
@@ -2058,7 +2059,8 @@ console_output (struct kernel_info *kernel, struct kernargs *kernargs,
 	case 4:
 	  process_reverse_offload (data->value_u64[0], data->value_u64[1],
 				   data->value_u64[2], data->value_u64[3],
-				   data->value_u64[4], data->value_u64[5]);
+				   data->value_u64[4], data->value_u64[5],
+				   data->value_u64[6]);
 	  break;
 	default: printf ("GCN print buffer error!\n"); break;
 	}
diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c
index 41191d31c79..176bb983bdc 100644
--- a/libgomp/plugin/plugin-nvptx.c
+++ b/libgomp/plugin/plugin-nvptx.c
@@ -2153,9 +2153,10 @@ nvptx_goacc_asyncqueue_construct (unsigned int flags)
 }
 
 struct goacc_asyncqueue *
-GOMP_OFFLOAD_openacc_async_construct (int device __attribute__((unused)))
+GOMP_OFFLOAD_openacc_async_construct (int device)
 {
-  return nvptx_goacc_asyncqueue_construct (CU_STREAM_DEFAULT);
+  nvptx_attach_host_thread_to_device (device);
+  return nvptx_goacc_asyncqueue_construct (CU_STREAM_NON_BLOCKING);
 }
 
 static bool
@@ -2649,16 +2650,54 @@ GOMP_OFFLOAD_run (int ord, void *tgt_fn, void *tgt_vars, void **args)
 	else if (r != CUDA_ERROR_NOT_READY)
 	  GOMP_PLUGIN_fatal ("cuStreamQuery error: %s", cuda_error (r));
 
-	if (__atomic_load_n (&ptx_dev->rev_data->fn, __ATOMIC_ACQUIRE) != 0)
+	struct rev_offload *rev_metadata = ptx_dev->rev_data;
+
+	/* Claim a portion of the ring buffer to process on this iteration.
+	   Don't mark them as consumed until all the data has been read out.  */
+	unsigned int consumed = __atomic_load_n (&rev_metadata->consumed,
+						 __ATOMIC_ACQUIRE);
+	unsigned int from = __atomic_load_n (&rev_metadata->claimed,
+						__ATOMIC_RELAXED);
+	unsigned int to = __atomic_load_n (&rev_metadata->next_slot,
+					   __ATOMIC_RELAXED);
+
+	if (consumed > to)
+	  {
+	    /* Overflow happens when we exceed UINTMAX requests.  */
+	    GOMP_PLUGIN_fatal ("NVPTX reverse offload buffer overflowed.\n");
+	  }
+
+	to = MIN(to, consumed + REV_OFFLOAD_QUEUE_SIZE / 2);
+	if (to <= from)
+	  /* Nothing to do; poll again.  */
+	  goto poll_again;
+
+	if (!__atomic_compare_exchange_n (&rev_metadata->claimed, &from, to,
+					  false,
+					  __ATOMIC_ACQUIRE, __ATOMIC_RELAXED))
+	  /* Collision with another thread ... go around again.  */
+	  goto poll_again;
+
+	unsigned int index;
+	for (index = from; index < to; index++)
 	  {
-	    struct rev_offload *rev_data = ptx_dev->rev_data;
+	    int slot = index % REV_OFFLOAD_QUEUE_SIZE;
+
+	    /* Wait while the target finishes filling in the slot.  */
+	    while (__atomic_load_n (&ptx_dev->rev_data->queue[slot].signal,
+				    __ATOMIC_ACQUIRE) == 0)
+	      ; /* spin  */
+
+	    /* Pass the request to libgomp; this will queue the request and
+	       return right away, without waiting for the kernel to run.  */
+	    struct rev_req *rev_data = &ptx_dev->rev_data->queue[slot];
 	    GOMP_PLUGIN_target_rev (rev_data->fn, rev_data->mapnum,
 				    rev_data->addrs, rev_data->sizes,
 				    rev_data->kinds, rev_data->dev_num,
-				    reverse_offload_aq);
-	    if (!nvptx_goacc_asyncqueue_synchronize (reverse_offload_aq))
-	      exit (EXIT_FAILURE);
-	    __atomic_store_n (&rev_data->fn, 0, __ATOMIC_RELEASE);
+				    rev_data->signal, true);
+
+	    /* Ensure that the slot doesn't trigger early, when reused.  */
+	    __atomic_store_n (&rev_data->signal, 0, __ATOMIC_RELEASE);
 
 	    /* Clean up here; otherwise we may run into the situation that
 	       a following reverse offload does
@@ -2673,6 +2712,19 @@ GOMP_OFFLOAD_run (int ord, void *tgt_fn, void *tgt_vars, void **args)
 	    if (!nvptx_run_deferred_page_locked_host_free ())
 	      exit (EXIT_FAILURE);
 	  }
+
+	/* The data is now consumed so release the slots for reuse.  */
+	unsigned int consumed_so_far = from;
+	while (!__atomic_compare_exchange_n (&rev_metadata->consumed,
+					    &consumed_so_far, to, false,
+					    __ATOMIC_RELEASE, __ATOMIC_RELAXED))
+	  {
+	    /* Another thread didn't consume all it claimed yet.... */
+	    consumed_so_far = from;
+	    usleep (1);
+	  }
+
+poll_again:
 	usleep (1);
       }
   else
diff --git a/libgomp/target.c b/libgomp/target.c
index 20a7f3776c2..1ab533f214f 100644
--- a/libgomp/target.c
+++ b/libgomp/target.c
@@ -4053,16 +4053,18 @@ gomp_map_cdata_lookup (struct cpy_data *d, uint64_t *devaddrs,
 				    tgt_start, tgt_end);
 }
 
-/* Handle reverse offload.  This is called by the device plugins for a
-   reverse offload; it is not called if the outer target runs on the host.
+/* Handle reverse offload.  This is called by the host worker thread to
+   execute a single reverse offload request; it is not called if the outer
+   target runs on the host.
    The mapping is simplified device-affecting constructs (except for target
    with device(ancestor:1)) must not be encountered; in particular not
    target (enter/exit) data.  */
 
-void
-gomp_target_rev (uint64_t fn_ptr, uint64_t mapnum, uint64_t devaddrs_ptr,
-		 uint64_t sizes_ptr, uint64_t kinds_ptr, int dev_num,
-		 struct goacc_asyncqueue *aq)
+static void
+gomp_target_rev_internal (uint64_t fn_ptr, uint64_t mapnum,
+			  uint64_t devaddrs_ptr, uint64_t sizes_ptr,
+			  uint64_t kinds_ptr, struct gomp_device_descr *devicep,
+			  struct goacc_asyncqueue *aq)
 {
   /* Return early if there is no offload code.  */
   if (sizeof (OFFLOAD_PLUGINS) == sizeof (""))
@@ -4079,7 +4081,6 @@ gomp_target_rev (uint64_t fn_ptr, uint64_t mapnum, uint64_t devaddrs_ptr,
   unsigned short *kinds;
   const bool short_mapkind = true;
   const int typemask = short_mapkind ? 0xff : 0x7;
-  struct gomp_device_descr *devicep = resolve_device (dev_num, false);
 
   reverse_splay_tree_key n;
   struct reverse_splay_tree_key_s k;
@@ -4622,6 +4623,134 @@ gomp_target_rev (uint64_t fn_ptr, uint64_t mapnum, uint64_t devaddrs_ptr,
     }
 }
 
+static struct target_rev_queue_s
+{
+  uint64_t fn_ptr;
+  uint64_t mapnum;
+  uint64_t devaddrs_ptr;
+  uint64_t sizes_ptr;
+  uint64_t kinds_ptr;
+  struct gomp_device_descr *devicep;
+
+  volatile int *signal;
+  bool use_aq;
+
+  struct target_rev_queue_s *next;
+} *target_rev_queue_head = NULL, *target_rev_queue_last = NULL;
+static gomp_mutex_t target_rev_queue_lock = 0;
+static int target_rev_thread_count = 0;
+
+static void *
+gomp_target_rev_worker_thread (void *)
+{
+  struct target_rev_queue_s *rev_kernel = NULL;
+  struct goacc_asyncqueue *aq = NULL;
+  struct gomp_device_descr *aq_devicep;
+
+  while (1)
+    {
+      gomp_mutex_lock (&target_rev_queue_lock);
+
+      /* Take a reverse-offload kernel request from the queue.  */
+      rev_kernel = target_rev_queue_head;
+      if (rev_kernel)
+	{
+	  target_rev_queue_head = rev_kernel->next;
+	  if (target_rev_queue_head == NULL)
+	    target_rev_queue_last = NULL;
+	}
+
+      if (rev_kernel == NULL)
+	{
+	  target_rev_thread_count--;
+	  gomp_mutex_unlock (&target_rev_queue_lock);
+	  break;
+	}
+      gomp_mutex_unlock (&target_rev_queue_lock);
+
+      /* Ensure we have a suitable device queue for the memory transfers.  */
+      if (rev_kernel->use_aq)
+	{
+	  if (aq && aq_devicep != rev_kernel->devicep)
+	    {
+	      aq_devicep->openacc.async.destruct_func (aq);
+	      aq = NULL;
+	    }
+
+	  if (!aq)
+	    {
+	      aq_devicep = rev_kernel->devicep;
+	      aq = aq_devicep->openacc.async.construct_func (aq_devicep->target_id);
+	    }
+	}
+
+      /* Run the kernel on the host.  */
+      gomp_target_rev_internal (rev_kernel->fn_ptr, rev_kernel->mapnum,
+				rev_kernel->devaddrs_ptr, rev_kernel->sizes_ptr,
+				rev_kernel->kinds_ptr, rev_kernel->devicep, aq);
+
+      /* Signal the device that the reverse-offload is completed.  */
+      int one = 1;
+      gomp_copy_host2dev (rev_kernel->devicep, aq, (void*)rev_kernel->signal,
+			  &one, sizeof (one), false, NULL);
+
+      /* We're done with this request.  */
+      free (rev_kernel);
+
+      /* Loop around and see if another request is waiting.  */
+    }
+
+  if (aq)
+    aq_devicep->openacc.async.destruct_func (aq);
+
+  return NULL;
+}
+
+void
+gomp_target_rev (uint64_t fn_ptr, uint64_t mapnum, uint64_t devaddrs_ptr,
+		 uint64_t sizes_ptr, uint64_t kinds_ptr, int dev_num,
+		 volatile int *signal, bool use_aq)
+{
+  struct gomp_device_descr *devicep = resolve_device (dev_num, false);
+
+  /* Create a new queue node.  */
+  struct target_rev_queue_s *newreq = gomp_malloc (sizeof (*newreq));
+  newreq->fn_ptr = fn_ptr;
+  newreq->mapnum = mapnum;
+  newreq->devaddrs_ptr = devaddrs_ptr;
+  newreq->sizes_ptr = sizes_ptr;
+  newreq->kinds_ptr = kinds_ptr;
+  newreq->devicep = devicep;
+  newreq->signal = signal;
+  newreq->use_aq = use_aq;
+  newreq->next = NULL;
+
+  gomp_mutex_lock (&target_rev_queue_lock);
+
+  /* Enqueue the reverse-offload request.  */
+  if (target_rev_queue_last)
+    {
+      target_rev_queue_last->next = newreq;
+      target_rev_queue_last = newreq;
+    }
+  else
+    target_rev_queue_last = target_rev_queue_head = newreq;
+
+  /* Launch a new thread to process the request asynchronously.
+     If the thread pool limit has been reached then an existing thread will
+     pick up the job when it is ready.  */
+  if (target_rev_thread_count < gomp_reverse_offload_threads)
+    {
+      target_rev_thread_count++;
+      gomp_mutex_unlock (&target_rev_queue_lock);
+
+      pthread_t t;
+      pthread_create (&t, NULL, gomp_target_rev_worker_thread, NULL);
+    }
+  else
+    gomp_mutex_unlock (&target_rev_queue_lock);
+}
+
 /* Host fallback for GOMP_target_data{,_ext} routines.  */
 
 static void
diff --git a/libgomp/testsuite/libgomp.c/reverse-offload-threads-1.c b/libgomp/testsuite/libgomp.c/reverse-offload-threads-1.c
new file mode 100644
index 00000000000..fa74a8e9668
--- /dev/null
+++ b/libgomp/testsuite/libgomp.c/reverse-offload-threads-1.c
@@ -0,0 +1,26 @@
+/* { dg-do run }  */
+/* { dg-additional-options "-foffload-options=nvptx-none=-misa=sm_35" { target { offload_target_nvptx } } } */
+
+/* Test that the reverse offload message buffers can cope with a lot of
+   requests.  */
+
+#pragma omp requires reverse_offload
+
+int main ()
+{
+  #pragma omp target teams distribute parallel for collapse(2)
+  for (int i=0; i < 100; i++)
+    for (int j=0; j < 16; j++)
+      {
+	int val = 0;
+	#pragma omp target device ( ancestor:1 ) firstprivate(i,j) map(from:val)
+	{
+	  val = i + j;
+	}
+
+	if (val != i + j)
+	  __builtin_abort ();
+      }
+
+  return 0;
+}
diff --git a/libgomp/testsuite/libgomp.c/reverse-offload-threads-2.c b/libgomp/testsuite/libgomp.c/reverse-offload-threads-2.c
new file mode 100644
index 00000000000..05a2571ed14
--- /dev/null
+++ b/libgomp/testsuite/libgomp.c/reverse-offload-threads-2.c
@@ -0,0 +1,31 @@
+/* { dg-do run }  */
+/* { dg-additional-options "-foffload-options=nvptx-none=-misa=sm_35" { target { offload_target_nvptx } } } */
+
+/* Test that the reverse offload message buffers can cope with multiple
+   requests from multiple kernels.  */
+
+#pragma omp requires reverse_offload
+
+int main ()
+{
+  for (int n=0; n < 5; n++)
+    {
+      #pragma omp target teams distribute parallel for nowait collapse(2)
+      for (int i=0; i < 32; i++)
+	for (int j=0; j < 16; j++)
+	  {
+	    int val = 0;
+	    #pragma omp target device ( ancestor:1 ) firstprivate(i,j) map(from:val)
+	    {
+	      val = i + j;
+	    }
+
+	    if (val != i + j)
+	      __builtin_abort ();
+	  }
+    }
+
+#pragma omp taskwait
+
+  return 0;
+}

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH] libgomp, nvptx, amdgcn: parallel reverse offload
  2023-09-12 14:27 [PATCH] libgomp, nvptx, amdgcn: parallel reverse offload Andrew Stubbs
  2023-09-12 16:32 ` [OG13][committed] " Andrew Stubbs
@ 2023-09-21  9:22 ` Tobias Burnus
  1 sibling, 0 replies; 3+ messages in thread
From: Tobias Burnus @ 2023-09-21  9:22 UTC (permalink / raw)
  To: Andrew Stubbs, gcc-patches, Thomas Schwinge

Hi Andrew, hi Thomas, hi all,

@Thomas: I wouldn't mind if you could glance at the nvptx/CUDA bits.

On 12.09.23 16:27, Andrew Stubbs wrote:
> This patch implements parallel execution of OpenMP reverse offload
> kernels.
> ...
> The device threads that sent requests are still blocked waiting for
> the completion signal, but any other threads may continue as usual.
Which matches the spec. (Except that, starting with TR12, a user may
also use the 'nowait' clause on the reverse-offload target directive.)
> +++ b/libgomp/config/nvptx/target.c
> @@ -93,7 +93,6 @@ GOMP_target_ext (int device, void (*fn) (void *), size_t mapnum,
>                void **hostaddrs, size_t *sizes, unsigned short *kinds,
>                unsigned int flags, void **depend, void **args)
>   {
...
> +  if ((unsigned int) (index + 1) < GOMP_REV_OFFLOAD_VAR->consumed)
> +    abort ();  /* Overflow.  */

[I assume the ideas is that this gets diagnosed by the host (via the
GOMP_PLUGIN_fatal) and that that diagnosis is faster then the
propagation of the abort() from the device to the host such that the
message is always printed.]

Should there be an "Error message is printed via the nvptx plugin on the
host" or something along this line?


> +++ b/libgomp/libgomp.texi
...
> +* GOMP_REVERSE_OFFLOAD_THREADS:: Set the maximum number of host threads
>   @end menu
...
> +@node GOMP_REVERSE_OFFLOAD_THREADS
> +@section @env{GOMP_REVERSE_OFFLOAD_THREADS} -- Set the maximum number of host threads

Thanks but can you also update the gcn/nvptx description? We currently have:
https://gcc.gnu.org/onlinedocs/libgomp/index.html

--------<cut->----------
@section AMD Radeon (GCN)
...
@item Reverse offload regions (i.e. @code{target} regions with
       @code{device(ancestor:1)}) are processed serially per @code{target} region
       such that the next reverse offload region is only executed after the previous
       one returned.
...
@section nvptx
...
@item Reverse offload regions (i.e. @code{target} regions with
       @code{device(ancestor:1)}) are processed serially per @code{target} region
       such that the next reverse offload region is only executed after the previous
       one returned.
--------<cut->----------

Possibly by also adding a @ref{GOMP_REVERSE_OFFLOAD_THREADS} such that a user
can find this.

(I wonder whether the "UINTMAX (= 4294967296)" should be documented or whether
that's an implementation detail we do not need to document. But given that a
long-running code taking 4 weeks still can issues 1775 reverse offloads per
second before exhausting the count, that should be a nonissue.)



> +++ b/libgomp/plugin/plugin-nvptx.c
> @@ -1639,9 +1639,10 @@ nvptx_goacc_asyncqueue_construct (unsigned int flags)
>   }
>
>   struct goacc_asyncqueue *
> -GOMP_OFFLOAD_openacc_async_construct (int device __attribute__((unused)))
> +GOMP_OFFLOAD_openacc_async_construct (int device)
>   {
> -  return nvptx_goacc_asyncqueue_construct (CU_STREAM_DEFAULT);
> +  nvptx_attach_host_thread_to_device (device);
> +  return nvptx_goacc_asyncqueue_construct (CU_STREAM_NON_BLOCKING);
>   }

That's not really new and we have plenty of code of this kind, but isn't
this a race if this is called nearly instantaneously for multiple
devices? (Still, the new code is surely better than the previous one.)

@Thomas: ?

I will have another look after the Cauldron, but I think the patch is
otherwise okay.

Tobias

-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2023-09-21  9:22 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-09-12 14:27 [PATCH] libgomp, nvptx, amdgcn: parallel reverse offload Andrew Stubbs
2023-09-12 16:32 ` [OG13][committed] " Andrew Stubbs
2023-09-21  9:22 ` [PATCH] " Tobias Burnus

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).