public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [PATCH 0/4] openacc: Async fixes
@ 2021-06-29 23:42 Julian Brown
  2021-06-29 23:42 ` [PATCH 1/4] openacc: Async fix for lib-94 testcase Julian Brown
                   ` (4 more replies)
  0 siblings, 5 replies; 12+ messages in thread
From: Julian Brown @ 2021-06-29 23:42 UTC (permalink / raw)
  To: gcc-patches; +Cc: Thomas Schwinge, Jakub Jelinek, Chung-Lin Tang

This patch series contains fixes for various problems with async support
for OpenACC at present:

 - Asynchonous host-to-device copies invoked from within libgomp
   (target.c) could copy bad data to the target -- and the workaround
   for that currently used in the AMD GCN target plugin could lead to
   a different problem (a race condition).

 - The OpenACC profiling-interface implementation did not measure
   asynchronous operations properly.

 - Several test cases misuse OpenACC asynchronous support (more race
   conditions).

Further comments on individual patches. Tested with offloading to AMD
GCN. OK for mainline?

Thanks,

Julian

Julian Brown (4):
  openacc: Async fix for lib-94 testcase
  openacc: Fix async bugs in several OpenACC test cases
  openacc: Fix asynchronous host-to-device copies in libgomp runtime
  openacc: Profiling-interface fixes for asynchronous operations

 libgomp/libgomp.h                             |   2 +-
 libgomp/oacc-host.c                           |   5 +-
 libgomp/oacc-mem.c                            |  36 +++-
 libgomp/oacc-parallel.c                       | 190 ++++++++++++++----
 libgomp/plugin/plugin-gcn.c                   |  20 +-
 libgomp/target.c                              | 111 ++++++----
 .../acc_prof-init-1.c                         |   5 +-
 .../acc_prof-parallel-1.c                     |  64 ++----
 .../libgomp.oacc-c-c++-common/deep-copy-10.c  |  14 +-
 .../libgomp.oacc-c-c++-common/lib-94.c        |   4 +-
 .../libgomp.oacc-fortran/lib-16-2.f90         |   5 +
 .../testsuite/libgomp.oacc-fortran/lib-16.f90 |   5 +
 12 files changed, 289 insertions(+), 172 deletions(-)

-- 
2.29.2


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 1/4] openacc: Async fix for lib-94 testcase
  2021-06-29 23:42 [PATCH 0/4] openacc: Async fixes Julian Brown
@ 2021-06-29 23:42 ` Julian Brown
  2021-06-29 23:42 ` [PATCH 2/4] openacc: Fix async bugs in several OpenACC test cases Julian Brown
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 12+ messages in thread
From: Julian Brown @ 2021-06-29 23:42 UTC (permalink / raw)
  To: gcc-patches; +Cc: Thomas Schwinge, Jakub Jelinek, Chung-Lin Tang

The test case performs an asynchronous host-to-device copy and then
immediately clobbers the data on the host via "memset", leading to a race
condition.  This patch moves the memset after an acc_wait call instead.

Tested with offloading to AMD GCN.

I can probably self-approve this as a testcase change only, unless
anyone objects.

Thanks,

Julian

2021-06-29  Julian Brown  <julian@codesourcery.com>

libgomp/
	* testsuite/libgomp.oacc-c-c++-common/lib-94.c: Fix race condition.
---
 libgomp/testsuite/libgomp.oacc-c-c++-common/lib-94.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/lib-94.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/lib-94.c
index 54497237b0c..baa3ac83f04 100644
--- a/libgomp/testsuite/libgomp.oacc-c-c++-common/lib-94.c
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/lib-94.c
@@ -22,10 +22,10 @@ main (int argc, char **argv)
 
   acc_copyin_async (h, N, async);
 
-  memset (h, 0, N);
-
   acc_wait (async);
 
+  memset (h, 0, N);
+
   acc_copyout_async (h, N, async + 1);
 
   acc_wait (async + 1);
-- 
2.29.2


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 2/4] openacc: Fix async bugs in several OpenACC test cases
  2021-06-29 23:42 [PATCH 0/4] openacc: Async fixes Julian Brown
  2021-06-29 23:42 ` [PATCH 1/4] openacc: Async fix for lib-94 testcase Julian Brown
@ 2021-06-29 23:42 ` Julian Brown
  2021-06-29 23:52   ` Julian Brown
  2021-06-29 23:42 ` [PATCH 3/4] openacc: Fix asynchronous host-to-device copies in libgomp runtime Julian Brown
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 12+ messages in thread
From: Julian Brown @ 2021-06-29 23:42 UTC (permalink / raw)
  To: gcc-patches; +Cc: Thomas Schwinge, Jakub Jelinek, Chung-Lin Tang

Several OpenACC tests accidentally abuse async semantics, leading to
race conditions & test failures.  This patch fixes those tests.

Tested with offloading to AMD GCN. I can probably self-approve this as
a testcase change only, unless anyone objects.

Thanks,

Julian

2021-06-29  Julian Brown  <julian@codesourcery.com>

libgomp/
	* testsuite/libgomp.oacc-c-c++-common/deep-copy-10.c: Fix async
	behaviour and increase number of iterations.
	* testsuite/libgomp.oacc-fortran/lib-16-2.f90: Fix async behaviour.
	* testsuite/libgomp.oacc-fortran/lib-16.f90: Likewise.
---
 .../libgomp.oacc-c-c++-common/deep-copy-10.c       | 14 ++++++++------
 .../testsuite/libgomp.oacc-fortran/lib-16-2.f90    |  5 +++++
 libgomp/testsuite/libgomp.oacc-fortran/lib-16.f90  |  5 +++++
 3 files changed, 18 insertions(+), 6 deletions(-)

diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/deep-copy-10.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/deep-copy-10.c
index 573a8214bf0..dadb6d37942 100644
--- a/libgomp/testsuite/libgomp.oacc-c-c++-common/deep-copy-10.c
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/deep-copy-10.c
@@ -1,6 +1,8 @@
 #include <stdlib.h>
 
-/* Test asyncronous attach and detach operation.  */
+#define ITERATIONS 1023
+
+/* Test asynchronous attach and detach operation.  */
 
 typedef struct {
   int *a;
@@ -25,13 +27,13 @@ main (int argc, char* argv[])
 
 #pragma acc enter data copyin(m)
 
-  for (int i = 0; i < 99; i++)
+  for (int i = 0; i < ITERATIONS; i++)
     {
       int j;
-#pragma acc parallel loop copy(m.a[0:N]) async(i % 2)
+#pragma acc parallel loop copy(m.a[0:N]) async(0)
       for (j = 0; j < N; j++)
 	m.a[j]++;
-#pragma acc parallel loop copy(m.b[0:N]) async((i + 1) % 2)
+#pragma acc parallel loop copy(m.b[0:N]) async(1)
       for (j = 0; j < N; j++)
 	m.b[j]++;
     }
@@ -40,9 +42,9 @@ main (int argc, char* argv[])
 
   for (i = 0; i < N; i++)
     {
-      if (m.a[i] != 99)
+      if (m.a[i] != ITERATIONS)
 	abort ();
-      if (m.b[i] != 99)
+      if (m.b[i] != ITERATIONS)
 	abort ();
     }
 
diff --git a/libgomp/testsuite/libgomp.oacc-fortran/lib-16-2.f90 b/libgomp/testsuite/libgomp.oacc-fortran/lib-16-2.f90
index ddd557d3be0..e2e47c967fa 100644
--- a/libgomp/testsuite/libgomp.oacc-fortran/lib-16-2.f90
+++ b/libgomp/testsuite/libgomp.oacc-fortran/lib-16-2.f90
@@ -27,6 +27,9 @@ program main
 
   if (acc_is_present (h) .neqv. .TRUE.) stop 1
 
+  ! We must wait for the update to be done.
+  call acc_wait (async)
+
   h(:) = 0
 
   call acc_copyout_async (h, sizeof (h), async)
@@ -45,6 +48,8 @@ program main
   
   if (acc_is_present (h) .neqv. .TRUE.) stop 3
 
+  call acc_wait (async)
+
   do i = 1, N
     if (h(i) /= i + i) stop 4
   end do 
diff --git a/libgomp/testsuite/libgomp.oacc-fortran/lib-16.f90 b/libgomp/testsuite/libgomp.oacc-fortran/lib-16.f90
index ccd1ce6ee18..ef9a6f6626c 100644
--- a/libgomp/testsuite/libgomp.oacc-fortran/lib-16.f90
+++ b/libgomp/testsuite/libgomp.oacc-fortran/lib-16.f90
@@ -27,6 +27,9 @@ program main
 
   if (acc_is_present (h) .neqv. .TRUE.) stop 1
 
+  ! We must wait for the update to be done.
+  call acc_wait (async)
+
   h(:) = 0
 
   call acc_copyout_async (h, sizeof (h), async)
@@ -45,6 +48,8 @@ program main
   
   if (acc_is_present (h) .neqv. .TRUE.) stop 3
 
+  call acc_wait (async)
+
   do i = 1, N
     if (h(i) /= i + i) stop 4
   end do 
-- 
2.29.2


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 3/4] openacc: Fix asynchronous host-to-device copies in libgomp runtime
  2021-06-29 23:42 [PATCH 0/4] openacc: Async fixes Julian Brown
  2021-06-29 23:42 ` [PATCH 1/4] openacc: Async fix for lib-94 testcase Julian Brown
  2021-06-29 23:42 ` [PATCH 2/4] openacc: Fix async bugs in several OpenACC test cases Julian Brown
@ 2021-06-29 23:42 ` Julian Brown
  2021-07-27 10:01   ` Thomas Schwinge
  2021-06-29 23:42 ` [PATCH 4/4] openacc: Profiling-interface fixes for asynchronous operations Julian Brown
  2021-06-30  8:28 ` [PATCH 0/4] openacc: Async fixes Thomas Schwinge
  4 siblings, 1 reply; 12+ messages in thread
From: Julian Brown @ 2021-06-29 23:42 UTC (permalink / raw)
  To: gcc-patches; +Cc: Thomas Schwinge, Jakub Jelinek, Chung-Lin Tang

This patch fixes several places in libgomp/target.c where "ephemeral" data
(on the stack or in temporary heap locations) may be used as the source of
an asynchronous host-to-device copy that may not complete before the host
data disappears.  Versions of the patch have been posted several times
before, but this one (at Chung-Lin Tang's prior suggesion, IIRC) moves
all logic into target.c rather than pushing it out to each target plugin.

An existing, but flawed, workaround for this problem in the AMD GCN
libgomp offloading plugin is currently present on mainline, and was
posted for the og9 branch here:

  https://gcc.gnu.org/legacy-ml/gcc-patches/2019-08/msg00901.html

and previous versions of this patch were posted here (for mainline/og9):

  https://gcc.gnu.org/legacy-ml/gcc-patches/2019-11/msg01482.html
  https://gcc.gnu.org/legacy-ml/gcc-patches/2019-09/msg01026.html

This patch exposes a problem with OpenACC profiling support that is
fixed by the next patch in the series. The acc_prof-parallel-1.c test
is XFAILed for now.

Tested with offloading to AMD GCN. OK?

Julian

2021-06-29  Julian Brown  <julian@codesourcery.com>

libgomp/
	* libgomp.h (gomp_copy_host2dev): Update prototype.
	(memcpy_tofrom_device, update_dev_host): Add new argument to
	gomp_copy_host2dev (false).
	* plugin/plugin-gcn.c (struct copy_data): Remove free_src field.
	(copy_data): Don't free free_src.
	(queue_push_copy): Remove free_src handling.
	(GOMP_OFFLOAD_dev2dev): Update call to queue_push_copy.
	(GOMP_OFFLOAD_openacc_async_host2dev): Remove source-data snapshotting.
	(GOMP_OFFLOAD_openacc_async_dev2host): Update call to queue_push_copy.
	* target.c (goacc_device_copy_async): Remove.
	(gomp_copy_host2dev): Add EPHEMERAL parameter. Snapshot source data
	when true, and set up deferred freeing of temporary buffer.
	(gomp_copy_dev2host): Inline device-to-host copy handling instead of
	calling goacc_device_copy_async.
	(gomp_map_vars_existing): Update calls to gomp_copy_host2dev with
	appropriate ephemeral argument.
	(gomp_map_pointer, gomp_attach_pointer, gomp_detach_pointer,
	gomp_update): Likewise.
	(gomp_map_vars_internal): Likewise. Don't use coalescing buffer for
	async copies.
	* testsuite/libgomp.oacc-c-c++-common/acc_prof-parallel-1.c: XFAIL for
	now.
---
 libgomp/libgomp.h                             |   2 +-
 libgomp/oacc-mem.c                            |   4 +-
 libgomp/plugin/plugin-gcn.c                   |  20 +---
 libgomp/target.c                              | 111 +++++++++++-------
 .../acc_prof-parallel-1.c                     |   2 +
 5 files changed, 77 insertions(+), 62 deletions(-)

diff --git a/libgomp/libgomp.h b/libgomp/libgomp.h
index 8d25dc8e2a8..e8901da1069 100644
--- a/libgomp/libgomp.h
+++ b/libgomp/libgomp.h
@@ -1226,7 +1226,7 @@ extern void gomp_acc_declare_allocate (bool, size_t, void **, size_t *,
 struct gomp_coalesce_buf;
 extern void gomp_copy_host2dev (struct gomp_device_descr *,
 				struct goacc_asyncqueue *, void *, const void *,
-				size_t, struct gomp_coalesce_buf *);
+				size_t, bool, struct gomp_coalesce_buf *);
 extern void gomp_copy_dev2host (struct gomp_device_descr *,
 				struct goacc_asyncqueue *, void *, const void *,
 				size_t);
diff --git a/libgomp/oacc-mem.c b/libgomp/oacc-mem.c
index c21508f3739..5988db0b886 100644
--- a/libgomp/oacc-mem.c
+++ b/libgomp/oacc-mem.c
@@ -202,7 +202,7 @@ memcpy_tofrom_device (bool from, void *d, void *h, size_t s, int async,
   if (from)
     gomp_copy_dev2host (thr->dev, aq, h, d, s);
   else
-    gomp_copy_host2dev (thr->dev, aq, d, h, s, /* TODO: cbuf? */ NULL);
+    gomp_copy_host2dev (thr->dev, aq, d, h, s, false, /* TODO: cbuf? */ NULL);
 
   if (profiling_p)
     {
@@ -874,7 +874,7 @@ update_dev_host (int is_dev, void *h, size_t s, int async)
   goacc_aq aq = get_goacc_asyncqueue (async);
 
   if (is_dev)
-    gomp_copy_host2dev (acc_dev, aq, d, h, s, /* TODO: cbuf? */ NULL);
+    gomp_copy_host2dev (acc_dev, aq, d, h, s, false, /* TODO: cbuf? */ NULL);
   else
     gomp_copy_dev2host (acc_dev, aq, h, d, s);
 
diff --git a/libgomp/plugin/plugin-gcn.c b/libgomp/plugin/plugin-gcn.c
index cfed42a2d4d..98da48b77cb 100644
--- a/libgomp/plugin/plugin-gcn.c
+++ b/libgomp/plugin/plugin-gcn.c
@@ -292,7 +292,6 @@ struct copy_data
   void *dst;
   const void *src;
   size_t len;
-  bool free_src;
   struct goacc_asyncqueue *aq;
 };
 
@@ -2914,8 +2913,6 @@ copy_data (void *data_)
 	     data->aq->agent->device_id, data->aq->id, data->len, data->src,
 	     data->dst);
   hsa_memory_copy_wrapper (data->dst, data->src, data->len);
-  if (data->free_src)
-    free ((void *) data->src);
   free (data);
 }
 
@@ -2934,7 +2931,7 @@ gomp_offload_free (void *ptr)
 
 static void
 queue_push_copy (struct goacc_asyncqueue *aq, void *dst, const void *src,
-		 size_t len, bool free_src)
+		 size_t len)
 {
   if (DEBUG_QUEUES)
     GCN_DEBUG ("queue_push_copy %d:%d: %zu bytes from (%p) to (%p)\n",
@@ -2944,7 +2941,6 @@ queue_push_copy (struct goacc_asyncqueue *aq, void *dst, const void *src,
   data->dst = dst;
   data->src = src;
   data->len = len;
-  data->free_src = free_src;
   data->aq = aq;
   queue_push_callback (aq, copy_data, data);
 }
@@ -3646,7 +3642,7 @@ GOMP_OFFLOAD_dev2dev (int device, void *dst, const void *src, size_t n)
     {
       struct agent_info *agent = get_agent_info (device);
       maybe_init_omp_async (agent);
-      queue_push_copy (agent->omp_async_queue, dst, src, n, false);
+      queue_push_copy (agent->omp_async_queue, dst, src, n);
       return true;
     }
 
@@ -3916,15 +3912,7 @@ GOMP_OFFLOAD_openacc_async_host2dev (int device, void *dst, const void *src,
 {
   struct agent_info *agent = get_agent_info (device);
   assert (agent == aq->agent);
-  /* The source data does not necessarily remain live until the deferred
-     copy happens.  Taking a snapshot of the data here avoids reading
-     uninitialised data later, but means that (a) data is copied twice and
-     (b) modifications to the copied data between the "spawning" point of
-     the asynchronous kernel and when it is executed will not be seen.
-     But, that is probably correct.  */
-  void *src_copy = GOMP_PLUGIN_malloc (n);
-  memcpy (src_copy, src, n);
-  queue_push_copy (aq, dst, src_copy, n, true);
+  queue_push_copy (aq, dst, src, n);
   return true;
 }
 
@@ -3936,7 +3924,7 @@ GOMP_OFFLOAD_openacc_async_dev2host (int device, void *dst, const void *src,
 {
   struct agent_info *agent = get_agent_info (device);
   assert (agent == aq->agent);
-  queue_push_copy (aq, dst, src, n, false);
+  queue_push_copy (aq, dst, src, n);
   return true;
 }
 
diff --git a/libgomp/target.c b/libgomp/target.c
index bb09d501dd6..5e4a80a67e1 100644
--- a/libgomp/target.c
+++ b/libgomp/target.c
@@ -208,22 +208,6 @@ gomp_device_copy (struct gomp_device_descr *devicep,
     }
 }
 
-static inline void
-goacc_device_copy_async (struct gomp_device_descr *devicep,
-			 bool (*copy_func) (int, void *, const void *, size_t,
-					    struct goacc_asyncqueue *),
-			 const char *dst, void *dstaddr,
-			 const char *src, const void *srcaddr,
-			 size_t size, struct goacc_asyncqueue *aq)
-{
-  if (!copy_func (devicep->target_id, dstaddr, srcaddr, size, aq))
-    {
-      gomp_mutex_unlock (&devicep->lock);
-      gomp_fatal ("Copying of %s object [%p..%p) to %s object [%p..%p) failed",
-		  src, srcaddr, srcaddr + size, dst, dstaddr, dstaddr + size);
-    }
-}
-
 /* Infrastructure for coalescing adjacent or nearly adjacent (in device addresses)
    host to device memory transfers.  */
 
@@ -317,11 +301,18 @@ gomp_to_device_kind_p (int kind)
     }
 }
 
+/* Copy host memory to an offload device.  In asynchronous mode (if AQ is
+   non-NULL), when the source data is stack or may otherwise be deallocated
+   before the asynchronous copy takes place, EPHEMERAL must be passed as
+   TRUE.  The CBUF isn't used for non-ephemeral asynchronous copies, because
+   the host data might not be computed yet (by an earlier asynchronous compute
+   region).  */
+
 attribute_hidden void
 gomp_copy_host2dev (struct gomp_device_descr *devicep,
 		    struct goacc_asyncqueue *aq,
 		    void *d, const void *h, size_t sz,
-		    struct gomp_coalesce_buf *cbuf)
+		    bool ephemeral, struct gomp_coalesce_buf *cbuf)
 {
   if (cbuf)
     {
@@ -349,8 +340,30 @@ gomp_copy_host2dev (struct gomp_device_descr *devicep,
 	}
     }
   if (__builtin_expect (aq != NULL, 0))
-    goacc_device_copy_async (devicep, devicep->openacc.async.host2dev_func,
-			     "dev", d, "host", h, sz, aq);
+    {
+      void *srcdata = (void *) h;
+      if (ephemeral)
+	{
+	  /* We're queueing up an asynchronous copy from data that may
+	     disappear before the transfer takes place (i.e. because it is a
+	     stack local in a function that is no longer executing).  Make a
+	     copy of the data into a temporary buffer in those cases.  */
+	  void *tmpbuf = gomp_malloc (sz);
+	  memcpy (tmpbuf, h, sz);
+	  srcdata = tmpbuf;
+	}
+      if (!devicep->openacc.async.host2dev_func (devicep->target_id, d,
+						 srcdata, sz, aq))
+	{
+	  gomp_mutex_unlock (&devicep->lock);
+	  gomp_fatal ("Copying of host object [%p..%p) to dev object [%p..%p) "
+		      "failed", h, h + sz, d, d + sz);
+	}
+      /* Free any temporary buffer created above once the transfer has
+	 completed.  */
+      if (srcdata != h)
+	devicep->openacc.async.queue_callback_func (aq, free, srcdata);
+    }
   else
     gomp_device_copy (devicep, devicep->host2dev_func, "dev", d, "host", h, sz);
 }
@@ -361,8 +374,15 @@ gomp_copy_dev2host (struct gomp_device_descr *devicep,
 		    void *h, const void *d, size_t sz)
 {
   if (__builtin_expect (aq != NULL, 0))
-    goacc_device_copy_async (devicep, devicep->openacc.async.dev2host_func,
-			     "host", h, "dev", d, sz, aq);
+    {
+      if (!devicep->openacc.async.dev2host_func (devicep->target_id, h, d, sz,
+						 aq))
+	{
+	  gomp_mutex_unlock (&devicep->lock);
+	  gomp_fatal ("Copying of dev object [%p..%p) to host object [%p..%p) "
+		      "failed", d, d + sz, h, h + sz);
+	}
+    }
   else
     gomp_device_copy (devicep, devicep->dev2host_func, "host", h, "dev", d, sz);
 }
@@ -521,7 +541,7 @@ gomp_map_vars_existing (struct gomp_device_descr *devicep,
 			(void *) (oldn->tgt->tgt_start + oldn->tgt_offset
 				  + newn->host_start - oldn->host_start),
 			(void *) newn->host_start,
-			newn->host_end - newn->host_start, cbuf);
+			newn->host_end - newn->host_start, false, cbuf);
 
   gomp_increment_refcount (oldn, refcount_set);
 }
@@ -548,8 +568,8 @@ gomp_map_pointer (struct target_mem_desc *tgt, struct goacc_asyncqueue *aq,
       cur_node.tgt_offset = (uintptr_t) NULL;
       gomp_copy_host2dev (devicep, aq,
 			  (void *) (tgt->tgt_start + target_offset),
-			  (void *) &cur_node.tgt_offset,
-			  sizeof (void *), cbuf);
+			  (void *) &cur_node.tgt_offset, sizeof (void *),
+			  true, cbuf);
       return;
     }
   /* Add bias to the pointer value.  */
@@ -569,7 +589,8 @@ gomp_map_pointer (struct target_mem_desc *tgt, struct goacc_asyncqueue *aq,
      to initialize the pointer with.  */
   cur_node.tgt_offset -= bias;
   gomp_copy_host2dev (devicep, aq, (void *) (tgt->tgt_start + target_offset),
-		      (void *) &cur_node.tgt_offset, sizeof (void *), cbuf);
+		      (void *) &cur_node.tgt_offset, sizeof (void *), true,
+		      cbuf);
 }
 
 static void
@@ -702,7 +723,7 @@ gomp_attach_pointer (struct gomp_device_descr *devicep,
 		  (void *) (n->tgt->tgt_start + n->tgt_offset), (void *) data);
 
       gomp_copy_host2dev (devicep, aq, (void *) devptr, (void *) &data,
-			  sizeof (void *), cbufp);
+			  sizeof (void *), true, cbufp);
     }
   else
     gomp_debug (1, "%s: attach count for %p -> %u\n", __FUNCTION__,
@@ -755,7 +776,7 @@ gomp_detach_pointer (struct gomp_device_descr *devicep,
 		  (void *) target);
 
       gomp_copy_host2dev (devicep, aq, (void *) devptr, (void *) &target,
-			  sizeof (void *), cbufp);
+			  sizeof (void *), true, cbufp);
     }
   else
     gomp_debug (1, "%s: attach count for %p -> %u\n", __FUNCTION__,
@@ -927,8 +948,9 @@ gomp_map_vars_internal (struct gomp_device_descr *devicep,
 	      for (i = first; i <= last; i++)
 		{
 		  tgt->list[i].key = NULL;
-		  if (gomp_to_device_kind_p (get_kind (short_mapkind, kinds, i)
-					     & typemask))
+		  if (!aq
+		      && gomp_to_device_kind_p (get_kind (short_mapkind, kinds,
+							  i) & typemask))
 		    gomp_coalesce_buf_add (&cbuf,
 					   tgt_size - cur_node.host_end
 					   + (uintptr_t) hostaddrs[i],
@@ -969,8 +991,9 @@ gomp_map_vars_internal (struct gomp_device_descr *devicep,
 	  if (tgt_align < align)
 	    tgt_align = align;
 	  tgt_size = (tgt_size + align - 1) & ~(align - 1);
-	  gomp_coalesce_buf_add (&cbuf, tgt_size,
-				 cur_node.host_end - cur_node.host_start);
+	  if (!aq)
+	    gomp_coalesce_buf_add (&cbuf, tgt_size,
+				   cur_node.host_end - cur_node.host_start);
 	  tgt_size += cur_node.host_end - cur_node.host_start;
 	  has_firstprivate = true;
 	  continue;
@@ -1063,7 +1086,7 @@ gomp_map_vars_internal (struct gomp_device_descr *devicep,
 	  if (tgt_align < align)
 	    tgt_align = align;
 	  tgt_size = (tgt_size + align - 1) & ~(align - 1);
-	  if (gomp_to_device_kind_p (kind & typemask))
+	  if (!aq && gomp_to_device_kind_p (kind & typemask))
 	    gomp_coalesce_buf_add (&cbuf, tgt_size,
 				   cur_node.host_end - cur_node.host_start);
 	  tgt_size += cur_node.host_end - cur_node.host_start;
@@ -1218,7 +1241,7 @@ gomp_map_vars_internal (struct gomp_device_descr *devicep,
 		len = sizes[i];
 		gomp_copy_host2dev (devicep, aq,
 				    (void *) (tgt->tgt_start + tgt_size),
-				    (void *) hostaddrs[i], len, cbufp);
+				    (void *) hostaddrs[i], len, false, cbufp);
 		tgt_size += len;
 		continue;
 	      case GOMP_MAP_FIRSTPRIVATE_INT:
@@ -1307,12 +1330,11 @@ gomp_map_vars_internal (struct gomp_device_descr *devicep,
 		if (cur_node.tgt_offset)
 		  cur_node.tgt_offset -= sizes[i];
 		gomp_copy_host2dev (devicep, aq,
-				    (void *) (n->tgt->tgt_start
-					      + n->tgt_offset
+				    (void *) (n->tgt->tgt_start + n->tgt_offset
 					      + cur_node.host_start
 					      - n->host_start),
 				    (void *) &cur_node.tgt_offset,
-				    sizeof (void *), cbufp);
+				    sizeof (void *), true, cbufp);
 		cur_node.tgt_offset = n->tgt->tgt_start + n->tgt_offset
 				      + cur_node.host_start - n->host_start;
 		continue;
@@ -1450,7 +1472,8 @@ gomp_map_vars_internal (struct gomp_device_descr *devicep,
 					(void *) (tgt->tgt_start
 						  + k->tgt_offset),
 					(void *) k->host_start,
-					k->host_end - k->host_start, cbufp);
+					k->host_end - k->host_start, false,
+					cbufp);
 		    break;
 		  case GOMP_MAP_POINTER:
 		    gomp_map_pointer (tgt, aq,
@@ -1462,7 +1485,8 @@ gomp_map_vars_internal (struct gomp_device_descr *devicep,
 					(void *) (tgt->tgt_start
 						  + k->tgt_offset),
 					(void *) k->host_start,
-					k->host_end - k->host_start, cbufp);
+					k->host_end - k->host_start, false,
+					cbufp);
 		    tgt->list[i].has_null_ptr_assoc = false;
 
 		    for (j = i + 1; j < mapnum; j++)
@@ -1525,7 +1549,7 @@ gomp_map_vars_internal (struct gomp_device_descr *devicep,
 					(void *) (tgt->tgt_start
 						  + k->tgt_offset),
 					(void *) k->host_start,
-					sizeof (void *), cbufp);
+					sizeof (void *), false, cbufp);
 		    break;
 		  default:
 		    gomp_mutex_unlock (&devicep->lock);
@@ -1541,7 +1565,7 @@ gomp_map_vars_internal (struct gomp_device_descr *devicep,
 		    /* We intentionally do not use coalescing here, as it's not
 		       data allocated by the current call to this function.  */
 		    gomp_copy_host2dev (devicep, aq, (void *) n->tgt_offset,
-					&tgt_addr, sizeof (void *), NULL);
+					&tgt_addr, sizeof (void *), true, NULL);
 		  }
 		array++;
 	      }
@@ -1556,7 +1580,7 @@ gomp_map_vars_internal (struct gomp_device_descr *devicep,
 	  gomp_copy_host2dev (devicep, aq,
 			      (void *) (tgt->tgt_start + i * sizeof (void *)),
 			      (void *) &cur_node.tgt_offset, sizeof (void *),
-			      cbufp);
+			      true, cbufp);
 	}
     }
 
@@ -1568,7 +1592,8 @@ gomp_map_vars_internal (struct gomp_device_descr *devicep,
 			    (void *) (tgt->tgt_start + cbuf.chunks[c].start),
 			    (char *) cbuf.buf + (cbuf.chunks[c].start
 						 - cbuf.chunks[0].start),
-			    cbuf.chunks[c].end - cbuf.chunks[c].start, NULL);
+			    cbuf.chunks[c].end - cbuf.chunks[c].start, true,
+			    NULL);
       free (cbuf.buf);
       cbuf.buf = NULL;
       cbufp = NULL;
@@ -1892,7 +1917,7 @@ gomp_update (struct gomp_device_descr *devicep, size_t mapnum, void **hostaddrs,
 
 	    if (GOMP_MAP_COPY_TO_P (kind & typemask))
 	      gomp_copy_host2dev (devicep, NULL, devaddr, hostaddr, size,
-				  NULL);
+				  false, NULL);
 	    if (GOMP_MAP_COPY_FROM_P (kind & typemask))
 	      gomp_copy_dev2host (devicep, NULL, hostaddr, devaddr, size);
 	  }
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/acc_prof-parallel-1.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/acc_prof-parallel-1.c
index a5e9ab3f936..dc1807c6ce4 100644
--- a/libgomp/testsuite/libgomp.oacc-c-c++-common/acc_prof-parallel-1.c
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/acc_prof-parallel-1.c
@@ -1,3 +1,5 @@
+/* { dg-xfail-run-if "Async profiling bug" { *-*-* } } */
+
 /* Test dispatch of events to callbacks.  */
 
 #undef NDEBUG
-- 
2.29.2


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 4/4] openacc: Profiling-interface fixes for asynchronous operations
  2021-06-29 23:42 [PATCH 0/4] openacc: Async fixes Julian Brown
                   ` (2 preceding siblings ...)
  2021-06-29 23:42 ` [PATCH 3/4] openacc: Fix asynchronous host-to-device copies in libgomp runtime Julian Brown
@ 2021-06-29 23:42 ` Julian Brown
  2021-06-30  8:28 ` [PATCH 0/4] openacc: Async fixes Thomas Schwinge
  4 siblings, 0 replies; 12+ messages in thread
From: Julian Brown @ 2021-06-29 23:42 UTC (permalink / raw)
  To: gcc-patches; +Cc: Thomas Schwinge, Jakub Jelinek, Chung-Lin Tang

This patch fixes some problems with the OpenACC profiling interface when
used with asynchronous offload operations. The profiling operations
themselves are now launched asynchronously, as previously they measured
the wrong thing, and/or executed at the same time as the operation they
were supposed to be measuring.

A consequence of this change is that "enqueueing" profiling callbacks
are no longer predictably ordered with respect to the callbacks
relating to the execution of asynchronous operations themselves. The
acc_prof-parallel-1.c test is un-XFAILed and adjusted accordingly.

This patch was posted for the og9 branch here:

  https://gcc.gnu.org/legacy-ml/gcc-patches/2019-09/msg01024.html

Tested with offloading to AMD GCN. OK for mainline?

Thanks,

Julian

2021-06-29  Julian Brown  <julian@codesourcery.com>

libgomp/
	* oacc-host.c (host_openacc_async_queue_callback): Invoke callback
	function immediately.
	* oacc-mem.c (goacc_enter_exit_data_internal): Call
	queue_async_prof_dispatch for asynchronous profile-event dispatches.
	* oacc-parallel.c (struct async_prof_callback_info,
	async_prof_dispatch, queue_async_prof_dispatch): New.
	(GOACC_parallel_keyed): Call queue_async_prof_dispatch for asynchronous
	profile-event dispatches.
	(GOACC_update): Likewise.
	* testsuite/libgomp.oacc-c-c++-common/acc_prof-init-1.c
	(cb_compute_construct_start): Remove/fix TODO.
	* testsuite/libgomp.oacc-c-c++-common/acc_prof-parallel-1.c: Remove
	XFAIL.
	(cb_exit_data_start): Tweak expected state values.
	(cb_exit_data_end): Likewise.
	(cb_compute_construct_start): Remove/fix TODO.
	(cb_compute_construct_end): Don't do adjustments for
	acc_ev_enqueue_launch_start/acc_ev_enqueue_launch_end callbacks.
	(cb_compute_construct_end): Tweak expected state values.
	(cb_enqueue_launch_start, cb_enqueue_launch_end): Don't expect
	launch-enqueue operations to happen synchronously with respect to
	profiling events on async streams.
	(main): Tweak expected state values.
---
 libgomp/oacc-host.c                           |   5 +-
 libgomp/oacc-mem.c                            |  32 ++-
 libgomp/oacc-parallel.c                       | 190 ++++++++++++++----
 .../acc_prof-init-1.c                         |   5 +-
 .../acc_prof-parallel-1.c                     |  66 ++----
 5 files changed, 194 insertions(+), 104 deletions(-)

diff --git a/libgomp/oacc-host.c b/libgomp/oacc-host.c
index f3bbd2b9c61..1cbff4caace 100644
--- a/libgomp/oacc-host.c
+++ b/libgomp/oacc-host.c
@@ -204,10 +204,9 @@ host_openacc_async_dev2host (int ord __attribute__ ((unused)),
 static void
 host_openacc_async_queue_callback (struct goacc_asyncqueue *aq
 				   __attribute__ ((unused)),
-				   void (*callback_fn)(void *)
-				   __attribute__ ((unused)),
-				   void *userptr __attribute__ ((unused)))
+				   void (*callback_fn)(void *), void *userptr)
 {
+  callback_fn (userptr);
 }
 
 static struct goacc_asyncqueue *
diff --git a/libgomp/oacc-mem.c b/libgomp/oacc-mem.c
index 5988db0b886..f0bd907cf07 100644
--- a/libgomp/oacc-mem.c
+++ b/libgomp/oacc-mem.c
@@ -1317,6 +1317,12 @@ goacc_exit_data_internal (struct gomp_device_descr *acc_dev, size_t mapnum,
   gomp_mutex_unlock (&acc_dev->lock);
 }
 
+struct async_prof_callback_info *
+queue_async_prof_dispatch (struct gomp_device_descr *devicep, goacc_aq aq,
+			   acc_prof_info *prof_info, acc_event_info *event_info,
+			   acc_api_info *api_info,
+			   struct async_prof_callback_info *prev_info);
+
 static void
 goacc_enter_exit_data_internal (int flags_m, size_t mapnum, void **hostaddrs,
 				size_t *sizes, unsigned short *kinds,
@@ -1327,6 +1333,7 @@ goacc_enter_exit_data_internal (int flags_m, size_t mapnum, void **hostaddrs,
 
   struct goacc_thread *thr;
   struct gomp_device_descr *acc_dev;
+  struct async_prof_callback_info *data_start_info = NULL;
 
   goacc_lazy_initialize ();
 
@@ -1382,9 +1389,19 @@ goacc_enter_exit_data_internal (int flags_m, size_t mapnum, void **hostaddrs,
       api_info.async_handle = NULL;
     }
 
+  goacc_aq aq = get_goacc_asyncqueue (async);
+
   if (profiling_p)
-    goacc_profiling_dispatch (&prof_info, &enter_exit_data_event_info,
-			      &api_info);
+    {
+      if (aq)
+	data_start_info
+	  = queue_async_prof_dispatch (acc_dev, aq, &prof_info,
+				       &enter_exit_data_event_info, &api_info,
+				       NULL);
+      else
+	goacc_profiling_dispatch (&prof_info, &enter_exit_data_event_info,
+				  &api_info);
+    }
 
   if ((acc_dev->capabilities & GOMP_OFFLOAD_CAP_SHARED_MEM)
       || (flags & GOACC_FLAG_HOST_FALLBACK))
@@ -1398,8 +1415,6 @@ goacc_enter_exit_data_internal (int flags_m, size_t mapnum, void **hostaddrs,
   if (num_waits)
     goacc_wait (async, num_waits, ap);
 
-  goacc_aq aq = get_goacc_asyncqueue (async);
-
   if (data_enter)
     goacc_enter_data_internal (acc_dev, mapnum, hostaddrs, sizes, kinds, aq);
   else
@@ -1411,8 +1426,13 @@ goacc_enter_exit_data_internal (int flags_m, size_t mapnum, void **hostaddrs,
       prof_info.event_type
 	= data_enter ? acc_ev_enter_data_end : acc_ev_exit_data_end;
       enter_exit_data_event_info.other_event.event_type = prof_info.event_type;
-      goacc_profiling_dispatch (&prof_info, &enter_exit_data_event_info,
-				&api_info);
+      if (aq)
+	queue_async_prof_dispatch (acc_dev, aq, &prof_info,
+				   &enter_exit_data_event_info, &api_info,
+				   data_start_info);
+      else
+	goacc_profiling_dispatch (&prof_info, &enter_exit_data_event_info,
+				  &api_info);
 
       thr->prof_info = NULL;
       thr->api_info = NULL;
diff --git a/libgomp/oacc-parallel.c b/libgomp/oacc-parallel.c
index 83625ba8a8e..3cc9f31d23b 100644
--- a/libgomp/oacc-parallel.c
+++ b/libgomp/oacc-parallel.c
@@ -93,6 +93,62 @@ handle_ftn_pointers (size_t mapnum, void **hostaddrs, size_t *sizes,
 }
 
 
+struct async_prof_callback_info {
+  acc_prof_info prof_info;
+  acc_event_info event_info;
+  acc_api_info api_info;
+  struct async_prof_callback_info *start_info;
+};
+
+static void
+async_prof_dispatch (void *ptr)
+{
+  struct async_prof_callback_info *info
+    = (struct async_prof_callback_info *) ptr;
+
+  if (info->start_info)
+    {
+      /* The TOOL_INFO must be preserved from a start event to the
+	 corresponding end event.  Copy that here.  */
+      void *tool_info = info->start_info->event_info.other_event.tool_info;
+      info->event_info.other_event.tool_info = tool_info;
+    }
+
+  goacc_profiling_dispatch (&info->prof_info, &info->event_info,
+			    &info->api_info);
+
+  /* The async_prof_dispatch function is (so far) always used for start/end
+     profiling event pairs: the start and end parts are queued, then each is
+     dispatched (or the dispatches might be interleaved before the end part is
+     queued).
+     In any case, it's not safe to delete either info structure before the
+     whole bracketed event is complete.  */
+
+  if (info->start_info)
+    {
+      free (info->start_info);
+      free (info);
+    }
+}
+
+struct async_prof_callback_info *
+queue_async_prof_dispatch (struct gomp_device_descr *devicep, goacc_aq aq,
+			   acc_prof_info *prof_info, acc_event_info *event_info,
+			   acc_api_info *api_info,
+			   struct async_prof_callback_info *prev_info)
+{
+  struct async_prof_callback_info *info = malloc (sizeof (*info));
+
+  info->prof_info = *prof_info;
+  info->event_info = *event_info;
+  info->api_info = *api_info;
+  info->start_info = prev_info;
+
+  devicep->openacc.async.queue_callback_func (aq, async_prof_dispatch,
+					      (void *) info);
+  return info;
+}
+
 /* Launch a possibly offloaded function with FLAGS.  FN is the host fn
    address.  MAPNUM, HOSTADDRS, SIZES & KINDS  describe the memory
    blocks to be copied to/from the device.  Varadic arguments are
@@ -117,6 +173,8 @@ GOACC_parallel_keyed (int flags_m, void (*fn) (void *),
   int async = GOMP_ASYNC_SYNC;
   unsigned dims[GOMP_DIM_MAX];
   unsigned tag;
+  struct async_prof_callback_info *comp_start_info = NULL,
+				  *data_start_info = NULL;
 
 #ifdef HAVE_INTTYPES_H
   gomp_debug (0, "%s: mapnum=%"PRIu64", hostaddrs=%p, size=%p, kinds=%p\n",
@@ -178,28 +236,9 @@ GOACC_parallel_keyed (int flags_m, void (*fn) (void *),
       api_info.async_handle = NULL;
     }
 
-  if (profiling_p)
-    goacc_profiling_dispatch (&prof_info, &compute_construct_event_info,
-			      &api_info);
-
   handle_ftn_pointers (mapnum, hostaddrs, sizes, kinds);
 
-  /* Host fallback if "if" clause is false or if the current device is set to
-     the host.  */
-  if (flags & GOACC_FLAG_HOST_FALLBACK)
-    {
-      prof_info.device_type = acc_device_host;
-      api_info.device_type = prof_info.device_type;
-      goacc_save_and_set_bind (acc_device_host);
-      fn (hostaddrs);
-      goacc_restore_bind ();
-      goto out_prof;
-    }
-  else if (acc_device_type (acc_dev->type) == acc_device_host)
-    {
-      fn (hostaddrs);
-      goto out_prof;
-    }
+  goacc_aq aq = NULL;
 
   /* Default: let the runtime choose.  */
   for (i = 0; i != GOMP_DIM_MAX; i++)
@@ -233,11 +272,12 @@ GOACC_parallel_keyed (int flags_m, void (*fn) (void *),
 	    if (async == GOMP_LAUNCH_OP_MAX)
 	      async = va_arg (ap, unsigned);
 
-	    if (profiling_p)
-	      {
-		prof_info.async = async;
-		prof_info.async_queue = prof_info.async;
-	      }
+	    /* Set async number in profiling data, unless the device is the
+	       host or we're doing host fallback.  */
+	    if (profiling_p
+		&& !(flags & GOACC_FLAG_HOST_FALLBACK)
+		&& acc_device_type (acc_dev->type) != acc_device_host)
+	      prof_info.async = prof_info.async_queue = async;
 
 	    break;
 	  }
@@ -255,7 +295,38 @@ GOACC_parallel_keyed (int flags_m, void (*fn) (void *),
 	}
     }
   va_end (ap);
-  
+
+  aq = get_goacc_asyncqueue (async);
+
+  if (profiling_p)
+    {
+      if (aq)
+	comp_start_info
+	  = queue_async_prof_dispatch (acc_dev, aq, &prof_info,
+				       &compute_construct_event_info,
+				       &api_info, NULL);
+      else
+	goacc_profiling_dispatch (&prof_info, &compute_construct_event_info,
+				  &api_info);
+    }
+
+  /* Host fallback if "if" clause is false or if the current device is set to
+     the host.  */
+  if (flags & GOACC_FLAG_HOST_FALLBACK)
+    {
+      prof_info.device_type = acc_device_host;
+      api_info.device_type = prof_info.device_type;
+      goacc_save_and_set_bind (acc_device_host);
+      fn (hostaddrs);
+      goacc_restore_bind ();
+      goto out_prof;
+    }
+  else if (acc_device_type (acc_dev->type) == acc_device_host)
+    {
+      fn (hostaddrs);
+      goto out_prof;
+    }
+
   if (!(acc_dev->capabilities & GOMP_OFFLOAD_CAP_NATIVE_EXEC))
     {
       k.host_start = (uintptr_t) fn;
@@ -284,12 +355,16 @@ GOACC_parallel_keyed (int flags_m, void (*fn) (void *),
 	= compute_construct_event_info.other_event.parent_construct;
       enter_exit_data_event_info.other_event.implicit = 1;
       enter_exit_data_event_info.other_event.tool_info = NULL;
-      goacc_profiling_dispatch (&prof_info, &enter_exit_data_event_info,
-				&api_info);
+      if (aq)
+	data_start_info
+	  = queue_async_prof_dispatch (acc_dev, aq, &prof_info,
+				       &enter_exit_data_event_info, &api_info,
+				       NULL);
+      else
+	goacc_profiling_dispatch (&prof_info, &enter_exit_data_event_info,
+				  &api_info);
     }
 
-  goacc_aq aq = get_goacc_asyncqueue (async);
-
   tgt = goacc_map_vars (acc_dev, aq, mapnum, hostaddrs, NULL, sizes, kinds,
 			true, 0);
   if (profiling_p)
@@ -297,8 +372,13 @@ GOACC_parallel_keyed (int flags_m, void (*fn) (void *),
       prof_info.event_type = acc_ev_enter_data_end;
       enter_exit_data_event_info.other_event.event_type
 	= prof_info.event_type;
-      goacc_profiling_dispatch (&prof_info, &enter_exit_data_event_info,
-				&api_info);
+      if (aq)
+	queue_async_prof_dispatch (acc_dev, aq, &prof_info,
+				   &enter_exit_data_event_info, &api_info,
+				   data_start_info);
+      else
+	goacc_profiling_dispatch (&prof_info, &enter_exit_data_event_info,
+				  &api_info);
     }
 
   devaddrs = gomp_alloca (sizeof (void *) * mapnum);
@@ -317,8 +397,14 @@ GOACC_parallel_keyed (int flags_m, void (*fn) (void *),
       prof_info.event_type = acc_ev_exit_data_start;
       enter_exit_data_event_info.other_event.event_type = prof_info.event_type;
       enter_exit_data_event_info.other_event.tool_info = NULL;
-      goacc_profiling_dispatch (&prof_info, &enter_exit_data_event_info,
-				&api_info);
+      if (aq)
+	data_start_info
+	  = queue_async_prof_dispatch (acc_dev, aq, &prof_info,
+				       &enter_exit_data_event_info, &api_info,
+				       NULL);
+      else
+	goacc_profiling_dispatch (&prof_info, &enter_exit_data_event_info,
+				  &api_info);
     }
 
   /* If running synchronously (aq == NULL), this will unmap immediately.  */
@@ -328,8 +414,13 @@ GOACC_parallel_keyed (int flags_m, void (*fn) (void *),
     {
       prof_info.event_type = acc_ev_exit_data_end;
       enter_exit_data_event_info.other_event.event_type = prof_info.event_type;
-      goacc_profiling_dispatch (&prof_info, &enter_exit_data_event_info,
-				&api_info);
+      if (aq)
+	queue_async_prof_dispatch (acc_dev, aq, &prof_info,
+				   &enter_exit_data_event_info, &api_info,
+				   data_start_info);
+      else
+	goacc_profiling_dispatch (&prof_info, &enter_exit_data_event_info,
+				  &api_info);
     }
 
  out_prof:
@@ -338,8 +429,13 @@ GOACC_parallel_keyed (int flags_m, void (*fn) (void *),
       prof_info.event_type = acc_ev_compute_construct_end;
       compute_construct_event_info.other_event.event_type
 	= prof_info.event_type;
-      goacc_profiling_dispatch (&prof_info, &compute_construct_event_info,
-				&api_info);
+      if (aq)
+	queue_async_prof_dispatch (acc_dev, aq, &prof_info,
+				   &compute_construct_event_info, &api_info,
+				   comp_start_info);
+      else
+	goacc_profiling_dispatch (&prof_info, &compute_construct_event_info,
+				  &api_info);
 
       thr->prof_info = NULL;
       thr->api_info = NULL;
@@ -565,6 +661,8 @@ GOACC_update (int flags_m, size_t mapnum,
 
   struct goacc_thread *thr = goacc_thread ();
   struct gomp_device_descr *acc_dev = thr->dev;
+  goacc_aq aq = NULL;
+  struct async_prof_callback_info *update_start_info = NULL;
 
   bool profiling_p = GOACC_PROFILING_DISPATCH_P (true);
 
@@ -614,7 +712,15 @@ GOACC_update (int flags_m, size_t mapnum,
     }
 
   if (profiling_p)
-    goacc_profiling_dispatch (&prof_info, &update_event_info, &api_info);
+    {
+      aq = get_goacc_asyncqueue (async);
+      if (aq)
+	update_start_info
+	  = queue_async_prof_dispatch (acc_dev, aq, &prof_info,
+				       &update_event_info, &api_info, NULL);
+      else
+	goacc_profiling_dispatch (&prof_info, &update_event_info, &api_info);
+    }
 
   if ((acc_dev->capabilities & GOMP_OFFLOAD_CAP_SHARED_MEM)
       || (flags & GOACC_FLAG_HOST_FALLBACK))
@@ -701,7 +807,11 @@ GOACC_update (int flags_m, size_t mapnum,
     {
       prof_info.event_type = acc_ev_update_end;
       update_event_info.other_event.event_type = prof_info.event_type;
-      goacc_profiling_dispatch (&prof_info, &update_event_info, &api_info);
+      if (aq)
+	queue_async_prof_dispatch (acc_dev, aq, &prof_info, &update_event_info,
+				   &api_info, update_start_info);
+      else
+	goacc_profiling_dispatch (&prof_info, &update_event_info, &api_info);
 
       thr->prof_info = NULL;
       thr->api_info = NULL;
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/acc_prof-init-1.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/acc_prof-init-1.c
index 7d05f482f46..72cf6305bcc 100644
--- a/libgomp/testsuite/libgomp.oacc-c-c++-common/acc_prof-init-1.c
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/acc_prof-init-1.c
@@ -159,7 +159,10 @@ static void cb_compute_construct_start (acc_prof_info *prof_info, acc_event_info
   assert (prof_info->device_type == acc_device_type);
   assert (prof_info->device_number == acc_device_num);
   assert (prof_info->thread_id == -1);
-  assert (prof_info->async == /* TODO acc_async */ acc_async_sync);
+  if (acc_device_type == acc_device_host)
+    assert (prof_info->async == acc_async_sync);
+  else
+    assert (prof_info->async == acc_async);
   assert (prof_info->async_queue == prof_info->async);
   assert (prof_info->src_file == NULL);
   assert (prof_info->func_name == NULL);
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/acc_prof-parallel-1.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/acc_prof-parallel-1.c
index dc1807c6ce4..9c8af743aba 100644
--- a/libgomp/testsuite/libgomp.oacc-c-c++-common/acc_prof-parallel-1.c
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/acc_prof-parallel-1.c
@@ -1,5 +1,3 @@
-/* { dg-xfail-run-if "Async profiling bug" { *-*-* } } */
-
 /* Test dispatch of events to callbacks.  */
 
 #undef NDEBUG
@@ -286,9 +284,9 @@ static void cb_exit_data_start (acc_prof_info *prof_info, acc_event_info *event_
 {
   DEBUG_printf ("%s\n", __FUNCTION__);
 
-  assert (state == 7
+  assert (state == 5
 #if ASYNC_EXIT_DATA
-	  || state == 107
+	  || state == 105
 #endif
 	  );
   STATE_OP (state, ++);
@@ -342,9 +340,9 @@ static void cb_exit_data_end (acc_prof_info *prof_info, acc_event_info *event_in
 {
   DEBUG_printf ("%s\n", __FUNCTION__);
 
-  assert (state == 8
+  assert (state == 6
 #if ASYNC_EXIT_DATA
-	  || state == 108
+	  || state == 106
 #endif
 	  );
   STATE_OP (state, ++);
@@ -428,7 +426,10 @@ static void cb_compute_construct_start (acc_prof_info *prof_info, acc_event_info
   assert (prof_info->device_type == acc_device_type);
   assert (prof_info->device_number == acc_device_num);
   assert (prof_info->thread_id == -1);
-  assert (prof_info->async == /* TODO acc_async */ acc_async_sync);
+  if (acc_device_type == acc_device_host)
+    assert (prof_info->async == acc_async_sync);
+  else
+    assert (prof_info->async == acc_async);
   assert (prof_info->async_queue == prof_info->async);
   assert (prof_info->src_file == NULL);
   assert (prof_info->func_name == NULL);
@@ -469,9 +470,6 @@ static void cb_compute_construct_end (acc_prof_info *prof_info, acc_event_info *
     {
       /* Compensate for the missing 'acc_ev_enter_data_end'.  */
       state += 1;
-      /* Compensate for the missing 'acc_ev_enqueue_launch_start' and
-	 'acc_ev_enqueue_launch_end'.  */
-      state += 2;
       /* Compensate for the missing 'acc_ev_exit_data_start' and
 	 'acc_ev_exit_data_end'.  */
       state += 2;
@@ -484,8 +482,8 @@ static void cb_compute_construct_end (acc_prof_info *prof_info, acc_event_info *
       state += 2;
     }
 #endif
-  assert (state == 9
-	  || state == 109);
+  assert (state == 7
+	  || state == 107);
   STATE_OP (state, ++);
 
   assert (tool_info != NULL);
@@ -539,17 +537,6 @@ static void cb_enqueue_launch_start (acc_prof_info *prof_info, acc_event_info *e
 
   assert (acc_device_type != acc_device_host);
 
-  assert (state == 5
-	  || state == 105);
-  STATE_OP (state, ++);
-
-  assert (tool_info != NULL);
-  assert (tool_info->event_info.other_event.event_type == acc_ev_compute_construct_start);
-  assert (tool_info->nested == NULL);
-  tool_info->nested = (struct tool_info *) malloc(sizeof *tool_info);
-  assert (tool_info->nested != NULL);
-  tool_info->nested->nested = NULL;
-
   assert (prof_info->event_type == acc_ev_enqueue_launch_start);
   assert (prof_info->valid_bytes == _ACC_PROF_INFO_VALID_BYTES);
   assert (prof_info->version == _ACC_PROF_INFO_VERSION);
@@ -593,13 +580,6 @@ static void cb_enqueue_launch_start (acc_prof_info *prof_info, acc_event_info *e
   assert (api_info->device_handle == NULL);
   assert (api_info->context_handle == NULL);
   assert (api_info->async_handle == NULL);
-
-  tool_info->nested->event_info.launch_event.event_type = event_info->launch_event.event_type;
-  tool_info->nested->event_info.launch_event.kernel_name = strdup (event_info->launch_event.kernel_name);
-  tool_info->nested->event_info.launch_event.num_gangs = event_info->launch_event.num_gangs;
-  tool_info->nested->event_info.launch_event.num_workers = event_info->launch_event.num_workers;
-  tool_info->nested->event_info.launch_event.vector_length = event_info->launch_event.vector_length;
-  event_info->other_event.tool_info = tool_info->nested;
 }
 
 static void cb_enqueue_launch_end (acc_prof_info *prof_info, acc_event_info *event_info, acc_api_info *api_info)
@@ -608,19 +588,6 @@ static void cb_enqueue_launch_end (acc_prof_info *prof_info, acc_event_info *eve
 
   assert (acc_device_type != acc_device_host);
 
-  assert (state == 6
-	  || state == 106);
-  STATE_OP (state, ++);
-
-  assert (tool_info != NULL);
-  assert (tool_info->event_info.other_event.event_type == acc_ev_compute_construct_start);
-  assert (tool_info->nested != NULL);
-  assert (tool_info->nested->event_info.launch_event.event_type == acc_ev_enqueue_launch_start);
-  assert (tool_info->nested->event_info.launch_event.kernel_name != NULL);
-  assert (tool_info->nested->event_info.launch_event.num_gangs >= 1);
-  assert (tool_info->nested->event_info.launch_event.num_workers >= 1);
-  assert (tool_info->nested->event_info.launch_event.vector_length >= 1);
-
   assert (prof_info->event_type == acc_ev_enqueue_launch_end);
   assert (prof_info->valid_bytes == _ACC_PROF_INFO_VALID_BYTES);
   assert (prof_info->version == _ACC_PROF_INFO_VERSION);
@@ -640,12 +607,7 @@ static void cb_enqueue_launch_end (acc_prof_info *prof_info, acc_event_info *eve
   assert (event_info->launch_event.valid_bytes == _ACC_LAUNCH_EVENT_INFO_VALID_BYTES);
   assert (event_info->launch_event.parent_construct == acc_construct_parallel);
   assert (event_info->launch_event.implicit == 1);
-  assert (event_info->launch_event.tool_info == tool_info->nested);
   assert (event_info->launch_event.kernel_name != NULL);
-  assert (strcmp (event_info->launch_event.kernel_name, tool_info->nested->event_info.launch_event.kernel_name) == 0);
-  assert (event_info->launch_event.num_gangs == tool_info->nested->event_info.launch_event.num_gangs);
-  assert (event_info->launch_event.num_workers == tool_info->nested->event_info.launch_event.num_workers);
-  assert (event_info->launch_event.vector_length == tool_info->nested->event_info.launch_event.vector_length);
 
   if (acc_device_type == acc_device_host)
     assert (api_info->device_api == acc_device_api_none);
@@ -659,10 +621,6 @@ static void cb_enqueue_launch_end (acc_prof_info *prof_info, acc_event_info *eve
   assert (api_info->device_handle == NULL);
   assert (api_info->context_handle == NULL);
   assert (api_info->async_handle == NULL);
-
-  free ((void *) tool_info->nested->event_info.launch_event.kernel_name);
-  free (tool_info->nested);
-  tool_info->nested = NULL;
 }
 
 
@@ -711,7 +669,7 @@ int main()
     }
     assert (state_init == 4);
   }
-  assert (state == 10);
+  assert (state == 8);
 
   STATE_OP (state, = 100);
 
@@ -727,7 +685,7 @@ int main()
 #pragma acc wait
     assert (state_init == 104);
   }
-  assert (state == 110);
+  assert (state == 108);
 
   return 0;
 }
-- 
2.29.2


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/4] openacc: Fix async bugs in several OpenACC test cases
  2021-06-29 23:42 ` [PATCH 2/4] openacc: Fix async bugs in several OpenACC test cases Julian Brown
@ 2021-06-29 23:52   ` Julian Brown
  0 siblings, 0 replies; 12+ messages in thread
From: Julian Brown @ 2021-06-29 23:52 UTC (permalink / raw)
  To: gcc-patches; +Cc: Jakub Jelinek, Thomas Schwinge

On Tue, 29 Jun 2021 16:42:02 -0700
Julian Brown <julian@codesourcery.com> wrote:

> Several OpenACC tests accidentally abuse async semantics, leading to
> race conditions & test failures.  This patch fixes those tests.
> 
> Tested with offloading to AMD GCN. I can probably self-approve this as
> a testcase change only, unless anyone objects.

Forgot to say: this was previously posted as part of the AMD GCN
worker-partitioning series here:

  https://gcc.gnu.org/pipermail/gcc-patches/2021-March/566081.html

But I noticed that the worker-partitioning patches do not (now?) have to
be present for the tests in question to fail.

Thanks,

Julian

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 0/4] openacc: Async fixes
  2021-06-29 23:42 [PATCH 0/4] openacc: Async fixes Julian Brown
                   ` (3 preceding siblings ...)
  2021-06-29 23:42 ` [PATCH 4/4] openacc: Profiling-interface fixes for asynchronous operations Julian Brown
@ 2021-06-30  8:28 ` Thomas Schwinge
  2021-06-30 10:40   ` Julian Brown
  2023-03-10 11:38   ` Thomas Schwinge
  4 siblings, 2 replies; 12+ messages in thread
From: Thomas Schwinge @ 2021-06-30  8:28 UTC (permalink / raw)
  To: Julian Brown; +Cc: gcc-patches, Jakub Jelinek, Chung-Lin Tang

Hi Julian!

On 2021-06-29T16:42:00-0700, Julian Brown <julian@codesourcery.com> wrote:
> This patch series contains fixes for various problems with async support
> for OpenACC at present:

Thanks, I shall be looking into these in detail, "soonish".

Some quick comments.


>  - Asynchonous host-to-device copies invoked from within libgomp
>    (target.c) could copy bad data to the target -- and the workaround
>    for that currently used in the AMD GCN target plugin could lead to
>    a different problem (a race condition).

As per discussion on Monday, I like to you moved (back?) the "ephemeral"
handling into 'libgomp/target.c:gomp_copy_host2dev'.


>  - The OpenACC profiling-interface implementation did not measure
>    asynchronous operations properly.

We'll need to be careful: (possibly, an older version of) that one we
internally had identified to be causing some issues; see the
"acc_prof-parallel-1.c intermittent failure on og10 branch" emails,
2020-07.


>  - Several test cases misuse OpenACC asynchronous support (more race
>    conditions).

Mostly ACK, but some more changes may be necessary; please see
<http://mid.mail-archive.com/87sg1s9s9l.fsf@euler.schwinge.homeip.net>
(you were CCed).

>  .../libgomp.oacc-c-c++-common/deep-copy-10.c  |  14 +-

Please provide some detail about that one ("Fix async behaviour").  It's
not obvious to me what's wrong with the current version (but I haven't
really spent time on that yet).


Grüße
 Thomas
-----------------
Mentor Graphics (Deutschland) GmbH, Arnulfstrasse 201, 80634 München Registergericht München HRB 106955, Geschäftsführer: Thomas Heurung, Frank Thürauf

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 0/4] openacc: Async fixes
  2021-06-30  8:28 ` [PATCH 0/4] openacc: Async fixes Thomas Schwinge
@ 2021-06-30 10:40   ` Julian Brown
  2021-07-02 13:51     ` Julian Brown
  2023-03-10 11:38   ` Thomas Schwinge
  1 sibling, 1 reply; 12+ messages in thread
From: Julian Brown @ 2021-06-30 10:40 UTC (permalink / raw)
  To: Thomas Schwinge; +Cc: gcc-patches, Jakub Jelinek, Chung-Lin Tang

On Wed, 30 Jun 2021 10:28:00 +0200
Thomas Schwinge <thomas@codesourcery.com> wrote:

> >  - The OpenACC profiling-interface implementation did not measure
> >    asynchronous operations properly.  
> 
> We'll need to be careful: (possibly, an older version of) that one we
> internally had identified to be causing some issues; see the
> "acc_prof-parallel-1.c intermittent failure on og10 branch" emails,
> 2020-07.

Hmm, I'll check those.

> >  - Several test cases misuse OpenACC asynchronous support (more race
> >    conditions).  
> 
> Mostly ACK, but some more changes may be necessary; please see
> <http://mid.mail-archive.com/87sg1s9s9l.fsf@euler.schwinge.homeip.net>
> (you were CCed).

Thanks -- these test changes have been floating around uncommitted for
too long already, I guess...

> >  .../libgomp.oacc-c-c++-common/deep-copy-10.c  |  14 +-  
> 
> Please provide some detail about that one ("Fix async behaviour").
> It's not obvious to me what's wrong with the current version (but I
> haven't really spent time on that yet).

Aha, well I didn't see what was wrong with it either when I wrote the
test!

The problem is that we have copyin/modify-on-target/copyout operations
that process the *same data* from different async streams on successive
loop iterations. Those async streams are independent from one another,
so depending on how they are scheduled, we can be copying-in on one
async stream whilst simultaneously copying-out on another async stream
-- so of course, the data gets corrupted.

So the fix makes sure that each async stream only operates on "its own"
data. The increase in number of loop iterations was specifically to
tickle the flaw in the workaround used for GCN wrt. the ephemeral
copies -- i.e. snapshotting all host data immediately.

HTH,

Julian

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 0/4] openacc: Async fixes
  2021-06-30 10:40   ` Julian Brown
@ 2021-07-02 13:51     ` Julian Brown
  0 siblings, 0 replies; 12+ messages in thread
From: Julian Brown @ 2021-07-02 13:51 UTC (permalink / raw)
  To: Thomas Schwinge; +Cc: Jakub Jelinek, gcc-patches

[-- Attachment #1: Type: text/plain, Size: 1870 bytes --]

On Wed, 30 Jun 2021 11:40:33 +0100
Julian Brown <julian@codesourcery.com> wrote:

> On Wed, 30 Jun 2021 10:28:00 +0200
> Thomas Schwinge <thomas@codesourcery.com> wrote:
> 
> > >  - The OpenACC profiling-interface implementation did not measure
> > >    asynchronous operations properly.    
> > 
> > We'll need to be careful: (possibly, an older version of) that one
> > we internally had identified to be causing some issues; see the
> > "acc_prof-parallel-1.c intermittent failure on og10 branch" emails,
> > 2020-07.  
> 
> Hmm, I'll check those.

The problem here is that the async callbacks now execute in a different
thread to the main program, so the direct sharing of the 'state'
variable isn't safe. (I verified that by observing the
result of "pthread_self ()" calls from the main thread and from the
callback.)

The attached patch appears to make the test run reliably on mainline
(which still exhibits the failure with the parent patch series, very
intermittently). A better solution might be to use the memory-model
builtins for all 'state' variable accesses though.

I think the async profiling callbacks *have to* run in a different
thread to the main program, which would make this a testcase bug (the
spec doesn't explicitly say this as of 3.0 though). However there might
be an argument for making "acc_wait" and friends thread barriers with
respect to the host (i.e. calling __atomic_thread_fence in the
appropriate place in libgomp) -- otherwise you have "break out of the
abstraction" provided by OpenACC and rely on a non-OpenACC API in order
to observe any results measured in the async profiling callbacks. OTOH
the memory-model stuff is part of C now, so maybe that's fine (and also,
I'm doubtful that just adding the barrier and using regular global
variable accesses is sufficient to ensure thread safety anyway).

Thoughts?

Thanks,

Julian

[-- Attachment #2: acc_prof-parallel-barrier-1.diff --]
[-- Type: text/x-patch, Size: 679 bytes --]

commit a24d5c521b66ae88d0ddd05ce7fe247c94802595
Author: Julian Brown <julian@codesourcery.com>
Date:   Fri Jul 2 03:42:41 2021 -0700

    Add barrier, hack

diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/acc_prof-parallel-1.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/acc_prof-parallel-1.c
index 9c8af743aba..2bba7bd2e55 100644
--- a/libgomp/testsuite/libgomp.oacc-c-c++-common/acc_prof-parallel-1.c
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/acc_prof-parallel-1.c
@@ -683,6 +683,7 @@ int main()
       state_init = state;
     }
 #pragma acc wait
+   __atomic_thread_fence (__ATOMIC_SEQ_CST);
     assert (state_init == 104);
   }
   assert (state == 108);

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 3/4] openacc: Fix asynchronous host-to-device copies in libgomp runtime
  2021-06-29 23:42 ` [PATCH 3/4] openacc: Fix asynchronous host-to-device copies in libgomp runtime Julian Brown
@ 2021-07-27 10:01   ` Thomas Schwinge
  2023-03-10 15:22     ` Allow libgomp 'cbuf' buffering with OpenACC 'async' for 'ephemeral' data (was: [PATCH 3/4] openacc: Fix asynchronous host-to-device copies in libgomp runtime) Thomas Schwinge
  0 siblings, 1 reply; 12+ messages in thread
From: Thomas Schwinge @ 2021-07-27 10:01 UTC (permalink / raw)
  To: Julian Brown, gcc-patches; +Cc: Jakub Jelinek, Chung-Lin Tang, Tom de Vries

[-- Attachment #1: Type: text/plain, Size: 5130 bytes --]

Hi!

On 2021-06-29T16:42:03-0700, Julian Brown <julian@codesourcery.com> wrote:
> This patch fixes several places in libgomp/target.c where "ephemeral" data
> (on the stack or in temporary heap locations) may be used as the source of
> an asynchronous host-to-device copy that may not complete before the host
> data disappears.  Versions of the patch have been posted several times
> before, but this one (at Chung-Lin Tang's prior suggesion, IIRC) moves
> all logic into target.c rather than pushing it out to each target plugin.

Thanks for the re-work!

> An existing, but flawed, workaround for this problem in the AMD GCN
> libgomp offloading plugin is currently present on mainline, and was
> posted for the og9 branch here:
>
>   https://gcc.gnu.org/legacy-ml/gcc-patches/2019-08/msg00901.html
>
> and previous versions of this patch were posted here (for mainline/og9):
>
>   https://gcc.gnu.org/legacy-ml/gcc-patches/2019-11/msg01482.html
>   https://gcc.gnu.org/legacy-ml/gcc-patches/2019-09/msg01026.html

... but this version here I like best!  ;-)


> This patch exposes a problem with OpenACC profiling support that is
> fixed by the next patch in the series. The acc_prof-parallel-1.c test
> is XFAILed for now.

(Ought to XFAIL 'libgomp.oacc-c-c++-common/acc_prof-parallel-1.c' for GCN
only.  Also, 'libgomp.oacc-c-c++-common/acc_prof-init-1.c' did similarly
FAIL for GCN, intermittently.)

But, actually no XFAILing is necessary, given my recent
commit 29ddaf43f70e19fd1110b539e8b3d0436c757e34 "[OpenACC]
Clarify sequencing of 'async' data copying vs. profiling events
in 'libgomp.oacc-c-c++-common/acc_prof-{init,parallel}-1.c'".


A few more comments:

> --- a/libgomp/plugin/plugin-gcn.c
> +++ b/libgomp/plugin/plugin-gcn.c

> @@ -2934,7 +2931,7 @@ gomp_offload_free (void *ptr)
>
>  static void
>  queue_push_copy (struct goacc_asyncqueue *aq, void *dst, const void *src,
> -              size_t len, bool free_src)
> +              size_t len)

Also have to update function comment.

> @@ -3916,15 +3912,7 @@ GOMP_OFFLOAD_openacc_async_host2dev (int device, void *dst, const void *src,
>  {
>    struct agent_info *agent = get_agent_info (device);
>    assert (agent == aq->agent);
> -  /* The source data does not necessarily remain live until the deferred
> -     copy happens.  Taking a snapshot of the data here avoids reading
> -     uninitialised data later, but means that (a) data is copied twice and
> -     (b) modifications to the copied data between the "spawning" point of
> -     the asynchronous kernel and when it is executed will not be seen.
> -     But, that is probably correct.  */
> -  void *src_copy = GOMP_PLUGIN_malloc (n);
> -  memcpy (src_copy, src, n);
> -  queue_push_copy (aq, dst, src_copy, n, true);
> +  queue_push_copy (aq, dst, src, n);
>    return true;
>  }

:-)

> --- a/libgomp/target.c
> +++ b/libgomp/target.c

> -static inline void
> -goacc_device_copy_async (struct gomp_device_descr *devicep,
> -                      bool (*copy_func) (int, void *, const void *, size_t,
> -                                         struct goacc_asyncqueue *),
> -                      const char *dst, void *dstaddr,
> -                      const char *src, const void *srcaddr,
> -                      size_t size, struct goacc_asyncqueue *aq)
> -{
> -  if (!copy_func (devicep->target_id, dstaddr, srcaddr, size, aq))
> -    {
> -      gomp_mutex_unlock (&devicep->lock);
> -      gomp_fatal ("Copying of %s object [%p..%p) to %s object [%p..%p) failed",
> -               src, srcaddr, srcaddr + size, dst, dstaddr, dstaddr + size);
> -    }
> -}

For symmetry with 'gomp_device_copy', I prefer to keep (and thus
restored) 'goacc_device_copy_async', adding a 'srcaddr_orig' parameter
for error reporting purposes.

Pushed 'Fix OpenACC "ephemeral" asynchronous host-to-device copies' to
master branch in commit 9c41f5b9cddd93f1b56eb71bff87b255d37d16f4, see
attached.

Removes GCN XFAIL 'libgomp.oacc-c-c++-common/async-data-1-1.c'.


> +/* Copy host memory to an offload device.  In asynchronous mode (if AQ is
> +   non-NULL), when the source data is stack or may otherwise be deallocated
> +   before the asynchronous copy takes place, EPHEMERAL must be passed as
> +   TRUE.  The CBUF isn't used for non-ephemeral asynchronous copies, because
> +   the host data might not be computed yet (by an earlier asynchronous compute
> +   region).  */
> +
>  [gomp_copy_host2dev]

Code changes related to the latter sentence have moved into a separate
"Don't use libgomp 'cbuf' buffering with OpenACC 'async'", pushed to
master branch in commit d88a6951586c7229b25708f4486eaaf4bf4b5bbe, see
attached.

Removes GCN XFAIL 'libgomp.oacc-c-c++-common/async-data-1-2.c'.


Grüße
 Thomas


-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-Fix-OpenACC-ephemeral-asynchronous-host-to-device-co.patch --]
[-- Type: text/x-diff, Size: 15433 bytes --]

From 9c41f5b9cddd93f1b56eb71bff87b255d37d16f4 Mon Sep 17 00:00:00 2001
From: Julian Brown <julian@codesourcery.com>
Date: Tue, 29 Jun 2021 16:42:03 -0700
Subject: [PATCH 1/2] Fix OpenACC "ephemeral" asynchronous host-to-device
 copies

This patch fixes several places in libgomp/target.c where "ephemeral" data
(on the stack or in temporary heap locations) may be used as the source of
an asynchronous host-to-device copy that may not complete before the host
data disappears.

An existing, but flawed, workaround for this problem in the AMD GCN
libgomp offloading plugin is currently present on mainline, and was
posted for the og9 branch here:

  https://gcc.gnu.org/legacy-ml/gcc-patches/2019-08/msg00901.html

and previous versions of this patch were posted here (for mainline/og9):

  https://gcc.gnu.org/legacy-ml/gcc-patches/2019-11/msg01482.html
  https://gcc.gnu.org/legacy-ml/gcc-patches/2019-09/msg01026.html

libgomp/
	* libgomp.h (gomp_copy_host2dev): Update prototype.
	* oacc-mem.c (memcpy_tofrom_device, update_dev_host): Add new
	argument to gomp_copy_host2dev (false).
	* plugin/plugin-gcn.c (struct copy_data): Remove free_src field.
	(copy_data): Don't free src.
	(queue_push_copy): Remove free_src handling.
	(GOMP_OFFLOAD_dev2dev): Update call to queue_push_copy.
	(GOMP_OFFLOAD_openacc_async_host2dev): Remove source-data
	snapshotting.
	(GOMP_OFFLOAD_openacc_async_dev2host): Update call to
	queue_push_copy.
	* target.c (goacc_device_copy_async): Add SRCADDR_ORIG parameter.
	(gomp_copy_host2dev): Add EPHEMERAL parameter.  Snapshot source
	data when true, and set up deferred freeing of temporary buffer.
	(gomp_copy_dev2host): Update call to goacc_device_copy_async.
	(gomp_map_vars_existing, gomp_map_pointer, gomp_attach_pointer)
	(gomp_detach_pointer, gomp_map_vars_internal, gomp_update): Update
	calls to gomp_copy_host2dev with appropriate ephemeral argument.
	* testsuite/libgomp.oacc-c-c++-common/async-data-1-1.c: Remove
	XFAIL.

Co-Authored-By: Thomas Schwinge <thomas@codesourcery.com>
---
 libgomp/libgomp.h                             |  2 +-
 libgomp/oacc-mem.c                            |  4 +-
 libgomp/plugin/plugin-gcn.c                   | 23 ++----
 libgomp/target.c                              | 77 ++++++++++++++-----
 .../async-data-1-1.c                          |  2 -
 5 files changed, 64 insertions(+), 44 deletions(-)

diff --git a/libgomp/libgomp.h b/libgomp/libgomp.h
index 8d25dc8e2a8..e8901da1069 100644
--- a/libgomp/libgomp.h
+++ b/libgomp/libgomp.h
@@ -1226,7 +1226,7 @@ extern void gomp_acc_declare_allocate (bool, size_t, void **, size_t *,
 struct gomp_coalesce_buf;
 extern void gomp_copy_host2dev (struct gomp_device_descr *,
 				struct goacc_asyncqueue *, void *, const void *,
-				size_t, struct gomp_coalesce_buf *);
+				size_t, bool, struct gomp_coalesce_buf *);
 extern void gomp_copy_dev2host (struct gomp_device_descr *,
 				struct goacc_asyncqueue *, void *, const void *,
 				size_t);
diff --git a/libgomp/oacc-mem.c b/libgomp/oacc-mem.c
index c21508f3739..5988db0b886 100644
--- a/libgomp/oacc-mem.c
+++ b/libgomp/oacc-mem.c
@@ -202,7 +202,7 @@ memcpy_tofrom_device (bool from, void *d, void *h, size_t s, int async,
   if (from)
     gomp_copy_dev2host (thr->dev, aq, h, d, s);
   else
-    gomp_copy_host2dev (thr->dev, aq, d, h, s, /* TODO: cbuf? */ NULL);
+    gomp_copy_host2dev (thr->dev, aq, d, h, s, false, /* TODO: cbuf? */ NULL);
 
   if (profiling_p)
     {
@@ -874,7 +874,7 @@ update_dev_host (int is_dev, void *h, size_t s, int async)
   goacc_aq aq = get_goacc_asyncqueue (async);
 
   if (is_dev)
-    gomp_copy_host2dev (acc_dev, aq, d, h, s, /* TODO: cbuf? */ NULL);
+    gomp_copy_host2dev (acc_dev, aq, d, h, s, false, /* TODO: cbuf? */ NULL);
   else
     gomp_copy_dev2host (acc_dev, aq, h, d, s);
 
diff --git a/libgomp/plugin/plugin-gcn.c b/libgomp/plugin/plugin-gcn.c
index cfed42a2d4d..2548614a2e5 100644
--- a/libgomp/plugin/plugin-gcn.c
+++ b/libgomp/plugin/plugin-gcn.c
@@ -292,7 +292,6 @@ struct copy_data
   void *dst;
   const void *src;
   size_t len;
-  bool free_src;
   struct goacc_asyncqueue *aq;
 };
 
@@ -2914,8 +2913,6 @@ copy_data (void *data_)
 	     data->aq->agent->device_id, data->aq->id, data->len, data->src,
 	     data->dst);
   hsa_memory_copy_wrapper (data->dst, data->src, data->len);
-  if (data->free_src)
-    free ((void *) data->src);
   free (data);
 }
 
@@ -2929,12 +2926,11 @@ gomp_offload_free (void *ptr)
 }
 
 /* Request an asynchronous data copy, to or from a device, on a given queue.
-   The event will be registered as a callback.  If FREE_SRC is true
-   then the source data will be freed following the copy.  */
+   The event will be registered as a callback.  */
 
 static void
 queue_push_copy (struct goacc_asyncqueue *aq, void *dst, const void *src,
-		 size_t len, bool free_src)
+		 size_t len)
 {
   if (DEBUG_QUEUES)
     GCN_DEBUG ("queue_push_copy %d:%d: %zu bytes from (%p) to (%p)\n",
@@ -2944,7 +2940,6 @@ queue_push_copy (struct goacc_asyncqueue *aq, void *dst, const void *src,
   data->dst = dst;
   data->src = src;
   data->len = len;
-  data->free_src = free_src;
   data->aq = aq;
   queue_push_callback (aq, copy_data, data);
 }
@@ -3646,7 +3641,7 @@ GOMP_OFFLOAD_dev2dev (int device, void *dst, const void *src, size_t n)
     {
       struct agent_info *agent = get_agent_info (device);
       maybe_init_omp_async (agent);
-      queue_push_copy (agent->omp_async_queue, dst, src, n, false);
+      queue_push_copy (agent->omp_async_queue, dst, src, n);
       return true;
     }
 
@@ -3916,15 +3911,7 @@ GOMP_OFFLOAD_openacc_async_host2dev (int device, void *dst, const void *src,
 {
   struct agent_info *agent = get_agent_info (device);
   assert (agent == aq->agent);
-  /* The source data does not necessarily remain live until the deferred
-     copy happens.  Taking a snapshot of the data here avoids reading
-     uninitialised data later, but means that (a) data is copied twice and
-     (b) modifications to the copied data between the "spawning" point of
-     the asynchronous kernel and when it is executed will not be seen.
-     But, that is probably correct.  */
-  void *src_copy = GOMP_PLUGIN_malloc (n);
-  memcpy (src_copy, src, n);
-  queue_push_copy (aq, dst, src_copy, n, true);
+  queue_push_copy (aq, dst, src, n);
   return true;
 }
 
@@ -3936,7 +3923,7 @@ GOMP_OFFLOAD_openacc_async_dev2host (int device, void *dst, const void *src,
 {
   struct agent_info *agent = get_agent_info (device);
   assert (agent == aq->agent);
-  queue_push_copy (aq, dst, src, n, false);
+  queue_push_copy (aq, dst, src, n);
   return true;
 }
 
diff --git a/libgomp/target.c b/libgomp/target.c
index bb09d501dd6..5576e57f822 100644
--- a/libgomp/target.c
+++ b/libgomp/target.c
@@ -214,13 +214,24 @@ goacc_device_copy_async (struct gomp_device_descr *devicep,
 					    struct goacc_asyncqueue *),
 			 const char *dst, void *dstaddr,
 			 const char *src, const void *srcaddr,
+			 const void *srcaddr_orig,
 			 size_t size, struct goacc_asyncqueue *aq)
 {
   if (!copy_func (devicep->target_id, dstaddr, srcaddr, size, aq))
     {
       gomp_mutex_unlock (&devicep->lock);
-      gomp_fatal ("Copying of %s object [%p..%p) to %s object [%p..%p) failed",
-		  src, srcaddr, srcaddr + size, dst, dstaddr, dstaddr + size);
+      if (srcaddr_orig && srcaddr_orig != srcaddr)
+	gomp_fatal ("Copying of %s object [%p..%p)"
+		    " via buffer %s object [%p..%p)"
+		    " to %s object [%p..%p) failed",
+		    src, srcaddr_orig, srcaddr_orig + size,
+		    src, srcaddr, srcaddr + size,
+		    dst, dstaddr, dstaddr + size);
+      else
+	gomp_fatal ("Copying of %s object [%p..%p)"
+		    " to %s object [%p..%p) failed",
+		    src, srcaddr, srcaddr + size,
+		    dst, dstaddr, dstaddr + size);
     }
 }
 
@@ -317,11 +328,16 @@ gomp_to_device_kind_p (int kind)
     }
 }
 
+/* Copy host memory to an offload device.  In asynchronous mode (if AQ is
+   non-NULL), when the source data is stack or may otherwise be deallocated
+   before the asynchronous copy takes place, EPHEMERAL must be passed as
+   TRUE.  */
+
 attribute_hidden void
 gomp_copy_host2dev (struct gomp_device_descr *devicep,
 		    struct goacc_asyncqueue *aq,
 		    void *d, const void *h, size_t sz,
-		    struct gomp_coalesce_buf *cbuf)
+		    bool ephemeral, struct gomp_coalesce_buf *cbuf)
 {
   if (cbuf)
     {
@@ -349,8 +365,23 @@ gomp_copy_host2dev (struct gomp_device_descr *devicep,
 	}
     }
   if (__builtin_expect (aq != NULL, 0))
-    goacc_device_copy_async (devicep, devicep->openacc.async.host2dev_func,
-			     "dev", d, "host", h, sz, aq);
+    {
+      void *h_buf = (void *) h;
+      if (ephemeral)
+	{
+	  /* We're queueing up an asynchronous copy from data that may
+	     disappear before the transfer takes place (i.e. because it is a
+	     stack local in a function that is no longer executing).  Make a
+	     copy of the data into a temporary buffer in those cases.  */
+	  h_buf = gomp_malloc (sz);
+	  memcpy (h_buf, h, sz);
+	}
+      goacc_device_copy_async (devicep, devicep->openacc.async.host2dev_func,
+			       "dev", d, "host", h_buf, h, sz, aq);
+      if (ephemeral)
+	/* Free temporary buffer once the transfer has completed.  */
+	devicep->openacc.async.queue_callback_func (aq, free, h_buf);
+    }
   else
     gomp_device_copy (devicep, devicep->host2dev_func, "dev", d, "host", h, sz);
 }
@@ -362,7 +393,7 @@ gomp_copy_dev2host (struct gomp_device_descr *devicep,
 {
   if (__builtin_expect (aq != NULL, 0))
     goacc_device_copy_async (devicep, devicep->openacc.async.dev2host_func,
-			     "host", h, "dev", d, sz, aq);
+			     "host", h, "dev", d, NULL, sz, aq);
   else
     gomp_device_copy (devicep, devicep->dev2host_func, "host", h, "dev", d, sz);
 }
@@ -521,7 +552,7 @@ gomp_map_vars_existing (struct gomp_device_descr *devicep,
 			(void *) (oldn->tgt->tgt_start + oldn->tgt_offset
 				  + newn->host_start - oldn->host_start),
 			(void *) newn->host_start,
-			newn->host_end - newn->host_start, cbuf);
+			newn->host_end - newn->host_start, false, cbuf);
 
   gomp_increment_refcount (oldn, refcount_set);
 }
@@ -548,8 +579,8 @@ gomp_map_pointer (struct target_mem_desc *tgt, struct goacc_asyncqueue *aq,
       cur_node.tgt_offset = (uintptr_t) NULL;
       gomp_copy_host2dev (devicep, aq,
 			  (void *) (tgt->tgt_start + target_offset),
-			  (void *) &cur_node.tgt_offset,
-			  sizeof (void *), cbuf);
+			  (void *) &cur_node.tgt_offset, sizeof (void *),
+			  true, cbuf);
       return;
     }
   /* Add bias to the pointer value.  */
@@ -569,7 +600,8 @@ gomp_map_pointer (struct target_mem_desc *tgt, struct goacc_asyncqueue *aq,
      to initialize the pointer with.  */
   cur_node.tgt_offset -= bias;
   gomp_copy_host2dev (devicep, aq, (void *) (tgt->tgt_start + target_offset),
-		      (void *) &cur_node.tgt_offset, sizeof (void *), cbuf);
+		      (void *) &cur_node.tgt_offset, sizeof (void *),
+		      true, cbuf);
 }
 
 static void
@@ -702,7 +734,7 @@ gomp_attach_pointer (struct gomp_device_descr *devicep,
 		  (void *) (n->tgt->tgt_start + n->tgt_offset), (void *) data);
 
       gomp_copy_host2dev (devicep, aq, (void *) devptr, (void *) &data,
-			  sizeof (void *), cbufp);
+			  sizeof (void *), true, cbufp);
     }
   else
     gomp_debug (1, "%s: attach count for %p -> %u\n", __FUNCTION__,
@@ -755,7 +787,7 @@ gomp_detach_pointer (struct gomp_device_descr *devicep,
 		  (void *) target);
 
       gomp_copy_host2dev (devicep, aq, (void *) devptr, (void *) &target,
-			  sizeof (void *), cbufp);
+			  sizeof (void *), true, cbufp);
     }
   else
     gomp_debug (1, "%s: attach count for %p -> %u\n", __FUNCTION__,
@@ -1218,7 +1250,7 @@ gomp_map_vars_internal (struct gomp_device_descr *devicep,
 		len = sizes[i];
 		gomp_copy_host2dev (devicep, aq,
 				    (void *) (tgt->tgt_start + tgt_size),
-				    (void *) hostaddrs[i], len, cbufp);
+				    (void *) hostaddrs[i], len, false, cbufp);
 		tgt_size += len;
 		continue;
 	      case GOMP_MAP_FIRSTPRIVATE_INT:
@@ -1312,7 +1344,7 @@ gomp_map_vars_internal (struct gomp_device_descr *devicep,
 					      + cur_node.host_start
 					      - n->host_start),
 				    (void *) &cur_node.tgt_offset,
-				    sizeof (void *), cbufp);
+				    sizeof (void *), true, cbufp);
 		cur_node.tgt_offset = n->tgt->tgt_start + n->tgt_offset
 				      + cur_node.host_start - n->host_start;
 		continue;
@@ -1450,7 +1482,8 @@ gomp_map_vars_internal (struct gomp_device_descr *devicep,
 					(void *) (tgt->tgt_start
 						  + k->tgt_offset),
 					(void *) k->host_start,
-					k->host_end - k->host_start, cbufp);
+					k->host_end - k->host_start,
+					false, cbufp);
 		    break;
 		  case GOMP_MAP_POINTER:
 		    gomp_map_pointer (tgt, aq,
@@ -1462,7 +1495,8 @@ gomp_map_vars_internal (struct gomp_device_descr *devicep,
 					(void *) (tgt->tgt_start
 						  + k->tgt_offset),
 					(void *) k->host_start,
-					k->host_end - k->host_start, cbufp);
+					k->host_end - k->host_start,
+					false, cbufp);
 		    tgt->list[i].has_null_ptr_assoc = false;
 
 		    for (j = i + 1; j < mapnum; j++)
@@ -1525,7 +1559,7 @@ gomp_map_vars_internal (struct gomp_device_descr *devicep,
 					(void *) (tgt->tgt_start
 						  + k->tgt_offset),
 					(void *) k->host_start,
-					sizeof (void *), cbufp);
+					sizeof (void *), false, cbufp);
 		    break;
 		  default:
 		    gomp_mutex_unlock (&devicep->lock);
@@ -1541,7 +1575,7 @@ gomp_map_vars_internal (struct gomp_device_descr *devicep,
 		    /* We intentionally do not use coalescing here, as it's not
 		       data allocated by the current call to this function.  */
 		    gomp_copy_host2dev (devicep, aq, (void *) n->tgt_offset,
-					&tgt_addr, sizeof (void *), NULL);
+					&tgt_addr, sizeof (void *), true, NULL);
 		  }
 		array++;
 	      }
@@ -1556,7 +1590,7 @@ gomp_map_vars_internal (struct gomp_device_descr *devicep,
 	  gomp_copy_host2dev (devicep, aq,
 			      (void *) (tgt->tgt_start + i * sizeof (void *)),
 			      (void *) &cur_node.tgt_offset, sizeof (void *),
-			      cbufp);
+			      true, cbufp);
 	}
     }
 
@@ -1568,7 +1602,8 @@ gomp_map_vars_internal (struct gomp_device_descr *devicep,
 			    (void *) (tgt->tgt_start + cbuf.chunks[c].start),
 			    (char *) cbuf.buf + (cbuf.chunks[c].start
 						 - cbuf.chunks[0].start),
-			    cbuf.chunks[c].end - cbuf.chunks[c].start, NULL);
+			    cbuf.chunks[c].end - cbuf.chunks[c].start,
+			    true, NULL);
       free (cbuf.buf);
       cbuf.buf = NULL;
       cbufp = NULL;
@@ -1892,7 +1927,7 @@ gomp_update (struct gomp_device_descr *devicep, size_t mapnum, void **hostaddrs,
 
 	    if (GOMP_MAP_COPY_TO_P (kind & typemask))
 	      gomp_copy_host2dev (devicep, NULL, devaddr, hostaddr, size,
-				  NULL);
+				  false, NULL);
 	    if (GOMP_MAP_COPY_FROM_P (kind & typemask))
 	      gomp_copy_dev2host (devicep, NULL, hostaddr, devaddr, size);
 	  }
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/async-data-1-1.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/async-data-1-1.c
index cd87aec56ff..9f2bed8aca8 100644
--- a/libgomp/testsuite/libgomp.oacc-c-c++-common/async-data-1-1.c
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/async-data-1-1.c
@@ -3,8 +3,6 @@
    Due to one data mapping, this isn't using the libgomp 'cbuf' buffering.
 */
 
-/* { dg-xfail-run-if "TODO" { openacc_radeon_accel_selected } } */
-
 
 #include <stdlib.h>
 
-- 
2.30.2


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #3: 0002-Don-t-use-libgomp-cbuf-buffering-with-OpenACC-async.patch --]
[-- Type: text/x-diff, Size: 6191 bytes --]

From d88a6951586c7229b25708f4486eaaf4bf4b5bbe Mon Sep 17 00:00:00 2001
From: Thomas Schwinge <thomas@codesourcery.com>
Date: Fri, 23 Jul 2021 22:01:32 +0200
Subject: [PATCH 2/2] Don't use libgomp 'cbuf' buffering with OpenACC 'async'

The host data might not be computed yet (by an earlier asynchronous compute
region, for example.

	libgomp/
	* target.c (gomp_coalesce_buf_add): Update comment.
	(gomp_copy_host2dev, gomp_map_vars_internal): Don't expect to see
	'aq && cbuf'.
	(gomp_map_vars_internal): Only 'if (!aq)', do
	'gomp_coalesce_buf_add'.
	* testsuite/libgomp.oacc-c-c++-common/async-data-1-2.c: Remove
	XFAIL.

Co-Authored-By: Julian Brown <julian@codesourcery.com>
---
 libgomp/target.c                              | 71 ++++++++++++-------
 .../async-data-1-2.c                          |  5 +-
 2 files changed, 47 insertions(+), 29 deletions(-)

diff --git a/libgomp/target.c b/libgomp/target.c
index 5576e57f822..453b3210e40 100644
--- a/libgomp/target.c
+++ b/libgomp/target.c
@@ -275,7 +275,14 @@ struct gomp_coalesce_buf
    host to device (e.g. map(alloc:), map(from:) etc.).  */
 #define MAX_COALESCE_BUF_GAP	(4 * 1024)
 
-/* Add region with device tgt_start relative offset and length to CBUF.  */
+/* Add region with device tgt_start relative offset and length to CBUF.
+
+   This must not be used for asynchronous copies, because the host data might
+   not be computed yet (by an earlier asynchronous compute region, for
+   example).
+   TODO ... but we could allow CBUF usage for EPHEMERAL data?  (Open question:
+   is it more performant to use libgomp CBUF buffering or individual device
+   asyncronous copying?)  */
 
 static inline void
 gomp_coalesce_buf_add (struct gomp_coalesce_buf *cbuf, size_t start, size_t len)
@@ -339,6 +346,30 @@ gomp_copy_host2dev (struct gomp_device_descr *devicep,
 		    void *d, const void *h, size_t sz,
 		    bool ephemeral, struct gomp_coalesce_buf *cbuf)
 {
+  if (__builtin_expect (aq != NULL, 0))
+    {
+      /* See 'gomp_coalesce_buf_add'.  */
+      assert (!cbuf);
+
+      void *h_buf = (void *) h;
+      if (ephemeral)
+	{
+	  /* We're queueing up an asynchronous copy from data that may
+	     disappear before the transfer takes place (i.e. because it is a
+	     stack local in a function that is no longer executing).  Make a
+	     copy of the data into a temporary buffer in those cases.  */
+	  h_buf = gomp_malloc (sz);
+	  memcpy (h_buf, h, sz);
+	}
+      goacc_device_copy_async (devicep, devicep->openacc.async.host2dev_func,
+			       "dev", d, "host", h_buf, h, sz, aq);
+      if (ephemeral)
+	/* Free temporary buffer once the transfer has completed.  */
+	devicep->openacc.async.queue_callback_func (aq, free, h_buf);
+
+      return;
+    }
+
   if (cbuf)
     {
       uintptr_t doff = (uintptr_t) d - cbuf->tgt->tgt_start;
@@ -364,26 +395,8 @@ gomp_copy_host2dev (struct gomp_device_descr *devicep,
 	    }
 	}
     }
-  if (__builtin_expect (aq != NULL, 0))
-    {
-      void *h_buf = (void *) h;
-      if (ephemeral)
-	{
-	  /* We're queueing up an asynchronous copy from data that may
-	     disappear before the transfer takes place (i.e. because it is a
-	     stack local in a function that is no longer executing).  Make a
-	     copy of the data into a temporary buffer in those cases.  */
-	  h_buf = gomp_malloc (sz);
-	  memcpy (h_buf, h, sz);
-	}
-      goacc_device_copy_async (devicep, devicep->openacc.async.host2dev_func,
-			       "dev", d, "host", h_buf, h, sz, aq);
-      if (ephemeral)
-	/* Free temporary buffer once the transfer has completed.  */
-	devicep->openacc.async.queue_callback_func (aq, free, h_buf);
-    }
-  else
-    gomp_device_copy (devicep, devicep->host2dev_func, "dev", d, "host", h, sz);
+
+  gomp_device_copy (devicep, devicep->host2dev_func, "dev", d, "host", h, sz);
 }
 
 attribute_hidden void
@@ -959,8 +972,9 @@ gomp_map_vars_internal (struct gomp_device_descr *devicep,
 	      for (i = first; i <= last; i++)
 		{
 		  tgt->list[i].key = NULL;
-		  if (gomp_to_device_kind_p (get_kind (short_mapkind, kinds, i)
-					     & typemask))
+		  if (!aq
+		      && gomp_to_device_kind_p (get_kind (short_mapkind, kinds, i)
+						& typemask))
 		    gomp_coalesce_buf_add (&cbuf,
 					   tgt_size - cur_node.host_end
 					   + (uintptr_t) hostaddrs[i],
@@ -1001,8 +1015,9 @@ gomp_map_vars_internal (struct gomp_device_descr *devicep,
 	  if (tgt_align < align)
 	    tgt_align = align;
 	  tgt_size = (tgt_size + align - 1) & ~(align - 1);
-	  gomp_coalesce_buf_add (&cbuf, tgt_size,
-				 cur_node.host_end - cur_node.host_start);
+	  if (!aq)
+	    gomp_coalesce_buf_add (&cbuf, tgt_size,
+				   cur_node.host_end - cur_node.host_start);
 	  tgt_size += cur_node.host_end - cur_node.host_start;
 	  has_firstprivate = true;
 	  continue;
@@ -1095,7 +1110,8 @@ gomp_map_vars_internal (struct gomp_device_descr *devicep,
 	  if (tgt_align < align)
 	    tgt_align = align;
 	  tgt_size = (tgt_size + align - 1) & ~(align - 1);
-	  if (gomp_to_device_kind_p (kind & typemask))
+	  if (!aq
+	      && gomp_to_device_kind_p (kind & typemask))
 	    gomp_coalesce_buf_add (&cbuf, tgt_size,
 				   cur_node.host_end - cur_node.host_start);
 	  tgt_size += cur_node.host_end - cur_node.host_start;
@@ -1596,6 +1612,9 @@ gomp_map_vars_internal (struct gomp_device_descr *devicep,
 
   if (cbufp)
     {
+      /* See 'gomp_coalesce_buf_add'.  */
+      assert (!aq);
+
       long c = 0;
       for (c = 0; c < cbuf.chunk_cnt; ++c)
 	gomp_copy_host2dev (devicep, aq,
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/async-data-1-2.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/async-data-1-2.c
index 385237698e2..3299499312f 100644
--- a/libgomp/testsuite/libgomp.oacc-c-c++-common/async-data-1-2.c
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/async-data-1-2.c
@@ -1,10 +1,9 @@
 /* Verify back to back 'async' operations, two data mappings.
 
-   Due to two data mappings, this is using the libgomp 'cbuf' buffering.
+   Make sure that despite two data mappings, this isn't using the libgomp
+   'cbuf' buffering.
 */
 
-/* { dg-xfail-run-if "TODO" { openacc_radeon_accel_selected } } */
-
 
 #include <stdlib.h>
 
-- 
2.30.2


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 0/4] openacc: Async fixes
  2021-06-30  8:28 ` [PATCH 0/4] openacc: Async fixes Thomas Schwinge
  2021-06-30 10:40   ` Julian Brown
@ 2023-03-10 11:38   ` Thomas Schwinge
  1 sibling, 0 replies; 12+ messages in thread
From: Thomas Schwinge @ 2023-03-10 11:38 UTC (permalink / raw)
  To: Julian Brown, gcc-patches, Kwok Cheung Yeung

[-- Attachment #1: Type: text/plain, Size: 1873 bytes --]

Hi!

On 2021-06-30T10:28:00+0200, I wrote:
> On 2021-06-29T16:42:00-0700, Julian Brown <julian@codesourcery.com> wrote:
>>  - The OpenACC profiling-interface implementation did not measure
>>    asynchronous operations properly.
>
> We'll need to be careful: (possibly, an older version of) that one we
> internally had identified to be causing some issues; see the
> "acc_prof-parallel-1.c intermittent failure on og10 branch" emails,
> 2020-07.

That's still unresolved (not blaming you!); those intermittent failures
are still seen.  I've not yet been able to look into your follow-on
discussion and WIP patch 'acc_prof-parallel-barrier-1.diff'
"Add barrier, hack" in detail.


As part of the og12 branch setup, Kwok then had to put
og12 commit b845d2f62e7da1c4cfdfee99690de94b648d076d
"Revert changes to acc_prof-init-1.c and acc_prof-parallel-1.c" on top of
your og12 commit 719f93c8618a134f90b5b661ab70c918d659ad05
"OpenACC profiling-interface fixes for asynchronous operations", and that
stuff is now again conflicting with GCC master branch work that I need to
cherry-pick into og12 branch.

Therefore, I'm now reverting this on og12 branch -- with the intention to
resolve that issue on master branch, eventually (but no promises, when).
Pushed to devel/omp/gcc-12 branch
commit 1818bab2ce9f11d8dde5b378f580971b87a5c4ff
'Revert "Revert changes to acc_prof-init-1.c and acc_prof-parallel-1.c"', and
commit b8beaa8447ed3c1637e8f93a08c0e47b5709290f
'Revert "OpenACC profiling-interface fixes for asynchronous operations"',
see attached.


Grüße
 Thomas


-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-Revert-Revert-changes-to-acc_prof-init-1.c-and-acc_p.patch --]
[-- Type: text/x-diff, Size: 4595 bytes --]

From 1818bab2ce9f11d8dde5b378f580971b87a5c4ff Mon Sep 17 00:00:00 2001
From: Thomas Schwinge <thomas@codesourcery.com>
Date: Thu, 2 Mar 2023 11:24:28 +0100
Subject: [PATCH 1/2] Revert "Revert changes to acc_prof-init-1.c and
 acc_prof-parallel-1.c"

... as a prerequisite for reverting
"OpenACC profiling-interface fixes for asynchronous operations".

This reverts og12 commit b845d2f62e7da1c4cfdfee99690de94b648d076d.

	libgomp/
	* testsuite/libgomp.oacc-c-c++-common/acc_prof-init-1.c: Revert
	"Revert changes to acc_prof-init-1.c and acc_prof-parallel-1.c"
	changes.
	* testsuite/libgomp.oacc-c-c++-common/acc_prof-parallel-1.c:
	Likewise.
---
 libgomp/ChangeLog.omp                         |  8 ++++++++
 .../acc_prof-init-1.c                         | 17 ++++++++++++++++
 .../acc_prof-parallel-1.c                     | 20 +++++++++++++++++++
 3 files changed, 45 insertions(+)

diff --git a/libgomp/ChangeLog.omp b/libgomp/ChangeLog.omp
index 3ed90bb38f2..d55b0503920 100644
--- a/libgomp/ChangeLog.omp
+++ b/libgomp/ChangeLog.omp
@@ -1,3 +1,11 @@
+2023-03-10  Thomas Schwinge  <thomas@codesourcery.com>
+
+	* testsuite/libgomp.oacc-c-c++-common/acc_prof-init-1.c: Revert
+	"Revert changes to acc_prof-init-1.c and acc_prof-parallel-1.c"
+	changes.
+	* testsuite/libgomp.oacc-c-c++-common/acc_prof-parallel-1.c:
+	Likewise.
+
 2023-03-01  Tobias Burnus  <tobias@codesourcery.com>
 
 	Backported from master:
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/acc_prof-init-1.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/acc_prof-init-1.c
index 6bbe99df1ff..a33fac7556c 100644
--- a/libgomp/testsuite/libgomp.oacc-c-c++-common/acc_prof-init-1.c
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/acc_prof-init-1.c
@@ -208,6 +208,21 @@ static void cb_compute_construct_end (acc_prof_info *prof_info, acc_event_info *
 
   assert (state == 11
 	  || state == 111);
+#if defined COPYIN
+  /* In an 'async' setting, this event may be triggered before actual 'async'
+     data copying has completed.  Given that 'state' appears in 'COPYIN', we
+     first have to synchronize (that is, let the 'async' 'COPYIN' read the
+     current 'state' value)...  */
+  if (acc_async != acc_async_sync)
+    {
+      /* "We're not yet accounting for the fact that _OpenACC events may occur
+	 during event processing_"; temporarily disable to avoid deadlock.  */
+      unreg (acc_ev_none, NULL, acc_toggle_per_thread);
+      acc_wait (acc_async);
+      reg (acc_ev_none, NULL, acc_toggle_per_thread);
+    }
+  /* ... before modifying it in the following.  */
+#endif
   STATE_OP (state, ++);
 
   assert (tool_info != NULL);
@@ -280,6 +295,7 @@ int main()
     {
       state_init = state;
     }
+    acc_async = acc_async_sync;
 #pragma acc wait
     assert (state_init == 11);
   }
@@ -306,6 +322,7 @@ int main()
     {
       state_init = state;
     }
+    acc_async = acc_async_sync;
 #pragma acc wait
     assert (state_init == 111);
   }
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/acc_prof-parallel-1.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/acc_prof-parallel-1.c
index 9a542b56fe5..663f7f724d5 100644
--- a/libgomp/testsuite/libgomp.oacc-c-c++-common/acc_prof-parallel-1.c
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/acc_prof-parallel-1.c
@@ -248,6 +248,25 @@ static void cb_enter_data_end (acc_prof_info *prof_info, acc_event_info *event_i
 
   assert (state == 4
 	  || state == 104);
+#if defined COPYIN
+  /* Conceptually, 'acc_ev_enter_data_end' marks the end of data copying,
+     before 'acc_ev_enqueue_launch_start' marks invoking the compute region.
+     That's the 'state_init = state;' intended to be captured in the compute
+     regions.  */
+  /* In an 'async' setting, this event may be triggered before actual 'async'
+     data copying has completed.  Given that 'state' appears in 'COPYIN', we
+     first have to synchronize (that is, let the 'async' 'COPYIN' read the
+     current 'state' value)...  */
+  if (acc_async != acc_async_sync)
+    {
+      /* "We're not yet accounting for the fact that _OpenACC events may occur
+	 during event processing_"; temporarily disable to avoid deadlock.  */
+      unreg (acc_ev_none, NULL, acc_toggle_per_thread);
+      acc_wait (acc_async);
+      reg (acc_ev_none, NULL, acc_toggle_per_thread);
+    }
+  /* ... before modifying it in the following.  */
+#endif
   STATE_OP (state, ++);
 
   assert (tool_info != NULL);
@@ -679,6 +698,7 @@ int main()
 
       state_init = state;
     }
+    acc_async = acc_async_sync;
 #pragma acc wait
     assert (state_init == 104);
   }
-- 
2.25.1


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #3: 0002-Revert-OpenACC-profiling-interface-fixes-for-asynchr.patch --]
[-- Type: text/x-diff, Size: 22464 bytes --]

From b8beaa8447ed3c1637e8f93a08c0e47b5709290f Mon Sep 17 00:00:00 2001
From: Thomas Schwinge <thomas@codesourcery.com>
Date: Thu, 2 Mar 2023 11:28:24 +0100
Subject: [PATCH 2/2] Revert "OpenACC profiling-interface fixes for
 asynchronous operations"

There is occasional execution failure; these changes need to be reviewed.

This reverts og12 commit 719f93c8618a134f90b5b661ab70c918d659ad05.

	libgomp/
	* oacc-host.c: Revert
	"OpenACC profiling-interface fixes for asynchronous operations"
	changes.
	* oacc-mem.c: Likewise.
	* oacc-parallel.c: Likewise.
	* testsuite/libgomp.oacc-c-c++-common/acc_prof-init-1.c: Likewise.
	* testsuite/libgomp.oacc-c-c++-common/acc_prof-parallel-1.c:
	Likewise.
---
 libgomp/ChangeLog.omp                         |   9 +
 libgomp/oacc-host.c                           |   5 +-
 libgomp/oacc-mem.c                            |  32 +--
 libgomp/oacc-parallel.c                       | 192 ++++--------------
 .../acc_prof-init-1.c                         |   5 +-
 .../acc_prof-parallel-1.c                     |  64 ++++--
 6 files changed, 113 insertions(+), 194 deletions(-)

diff --git a/libgomp/ChangeLog.omp b/libgomp/ChangeLog.omp
index d55b0503920..0e984754bb0 100644
--- a/libgomp/ChangeLog.omp
+++ b/libgomp/ChangeLog.omp
@@ -1,5 +1,14 @@
 2023-03-10  Thomas Schwinge  <thomas@codesourcery.com>
 
+	* oacc-host.c: Revert
+	"OpenACC profiling-interface fixes for asynchronous operations"
+	changes.
+	* oacc-mem.c: Likewise.
+	* oacc-parallel.c: Likewise.
+	* testsuite/libgomp.oacc-c-c++-common/acc_prof-init-1.c: Likewise.
+	* testsuite/libgomp.oacc-c-c++-common/acc_prof-parallel-1.c:
+	Likewise.
+
 	* testsuite/libgomp.oacc-c-c++-common/acc_prof-init-1.c: Revert
 	"Revert changes to acc_prof-init-1.c and acc_prof-parallel-1.c"
 	changes.
diff --git a/libgomp/oacc-host.c b/libgomp/oacc-host.c
index afc99c5a374..94792abe5ce 100644
--- a/libgomp/oacc-host.c
+++ b/libgomp/oacc-host.c
@@ -215,9 +215,10 @@ host_openacc_async_dev2host (int ord __attribute__ ((unused)),
 static void
 host_openacc_async_queue_callback (struct goacc_asyncqueue *aq
 				   __attribute__ ((unused)),
-				   void (*callback_fn)(void *), void *userptr)
+				   void (*callback_fn)(void *)
+				   __attribute__ ((unused)),
+				   void *userptr __attribute__ ((unused)))
 {
-  callback_fn (userptr);
 }
 
 static struct goacc_asyncqueue *
diff --git a/libgomp/oacc-mem.c b/libgomp/oacc-mem.c
index 4b7d306f402..6fb8be98542 100644
--- a/libgomp/oacc-mem.c
+++ b/libgomp/oacc-mem.c
@@ -1431,12 +1431,6 @@ goacc_exit_data_internal (struct gomp_device_descr *acc_dev, size_t mapnum,
   gomp_mutex_unlock (&acc_dev->lock);
 }
 
-struct async_prof_callback_info *
-queue_async_prof_dispatch (struct gomp_device_descr *devicep, goacc_aq aq,
-			   acc_prof_info *prof_info, acc_event_info *event_info,
-			   acc_api_info *api_info,
-			   struct async_prof_callback_info *prev_info);
-
 static void
 goacc_enter_exit_data_internal (int flags_m, size_t mapnum, void **hostaddrs,
 				size_t *sizes, unsigned short *kinds,
@@ -1447,7 +1441,6 @@ goacc_enter_exit_data_internal (int flags_m, size_t mapnum, void **hostaddrs,
 
   struct goacc_thread *thr;
   struct gomp_device_descr *acc_dev;
-  struct async_prof_callback_info *data_start_info = NULL;
 
   goacc_lazy_initialize ();
 
@@ -1503,19 +1496,9 @@ goacc_enter_exit_data_internal (int flags_m, size_t mapnum, void **hostaddrs,
       api_info.async_handle = NULL;
     }
 
-  goacc_aq aq = get_goacc_asyncqueue (async);
-
   if (profiling_p)
-    {
-      if (aq)
-	data_start_info
-	  = queue_async_prof_dispatch (acc_dev, aq, &prof_info,
-				       &enter_exit_data_event_info, &api_info,
-				       NULL);
-      else
-	goacc_profiling_dispatch (&prof_info, &enter_exit_data_event_info,
-				  &api_info);
-    }
+    goacc_profiling_dispatch (&prof_info, &enter_exit_data_event_info,
+			      &api_info);
 
   if ((acc_dev->capabilities & GOMP_OFFLOAD_CAP_SHARED_MEM)
       || (flags & GOACC_FLAG_HOST_FALLBACK))
@@ -1529,6 +1512,8 @@ goacc_enter_exit_data_internal (int flags_m, size_t mapnum, void **hostaddrs,
   if (num_waits)
     goacc_wait (async, num_waits, ap);
 
+  goacc_aq aq = get_goacc_asyncqueue (async);
+
   if (data_enter)
     goacc_enter_data_internal (acc_dev, mapnum, hostaddrs, sizes, kinds, aq);
   else
@@ -1540,13 +1525,8 @@ goacc_enter_exit_data_internal (int flags_m, size_t mapnum, void **hostaddrs,
       prof_info.event_type
 	= data_enter ? acc_ev_enter_data_end : acc_ev_exit_data_end;
       enter_exit_data_event_info.other_event.event_type = prof_info.event_type;
-      if (aq)
-	queue_async_prof_dispatch (acc_dev, aq, &prof_info,
-				   &enter_exit_data_event_info, &api_info,
-				   data_start_info);
-      else
-	goacc_profiling_dispatch (&prof_info, &enter_exit_data_event_info,
-				  &api_info);
+      goacc_profiling_dispatch (&prof_info, &enter_exit_data_event_info,
+				&api_info);
 
       thr->prof_info = NULL;
       thr->api_info = NULL;
diff --git a/libgomp/oacc-parallel.c b/libgomp/oacc-parallel.c
index 81e8eba4225..d66bc882a5f 100644
--- a/libgomp/oacc-parallel.c
+++ b/libgomp/oacc-parallel.c
@@ -259,62 +259,6 @@ handle_ftn_pointers (size_t mapnum, void **hostaddrs, size_t *sizes,
 }
 
 
-struct async_prof_callback_info {
-  acc_prof_info prof_info;
-  acc_event_info event_info;
-  acc_api_info api_info;
-  struct async_prof_callback_info *start_info;
-};
-
-static void
-async_prof_dispatch (void *ptr)
-{
-  struct async_prof_callback_info *info
-    = (struct async_prof_callback_info *) ptr;
-
-  if (info->start_info)
-    {
-      /* The TOOL_INFO must be preserved from a start event to the
-	 corresponding end event.  Copy that here.  */
-      void *tool_info = info->start_info->event_info.other_event.tool_info;
-      info->event_info.other_event.tool_info = tool_info;
-    }
-
-  goacc_profiling_dispatch (&info->prof_info, &info->event_info,
-			    &info->api_info);
-
-  /* The async_prof_dispatch function is (so far) always used for start/end
-     profiling event pairs: the start and end parts are queued, then each is
-     dispatched (or the dispatches might be interleaved before the end part is
-     queued).
-     In any case, it's not safe to delete either info structure before the
-     whole bracketed event is complete.  */
-
-  if (info->start_info)
-    {
-      free (info->start_info);
-      free (info);
-    }
-}
-
-struct async_prof_callback_info *
-queue_async_prof_dispatch (struct gomp_device_descr *devicep, goacc_aq aq,
-			   acc_prof_info *prof_info, acc_event_info *event_info,
-			   acc_api_info *api_info,
-			   struct async_prof_callback_info *prev_info)
-{
-  struct async_prof_callback_info *info = malloc (sizeof (*info));
-
-  info->prof_info = *prof_info;
-  info->event_info = *event_info;
-  info->api_info = *api_info;
-  info->start_info = prev_info;
-
-  devicep->openacc.async.queue_callback_func (aq, async_prof_dispatch,
-					      (void *) info);
-  return info;
-}
-
 /* Launch a possibly offloaded function with FLAGS.  FN is the host fn
    address.  MAPNUM, HOSTADDRS, SIZES & KINDS  describe the memory
    blocks to be copied to/from the device.  Varadic arguments are
@@ -340,8 +284,6 @@ GOACC_parallel_keyed (int flags_m, void (*fn) (void *),
   unsigned dims[GOMP_DIM_MAX];
   unsigned tag;
   struct goacc_ncarray_info *nca_info = NULL;
-  struct async_prof_callback_info *comp_start_info = NULL,
-				  *data_start_info = NULL;
 
 #ifdef HAVE_INTTYPES_H
   gomp_debug (0, "%s: mapnum=%"PRIu64", hostaddrs=%p, size=%p, kinds=%p\n",
@@ -403,8 +345,31 @@ GOACC_parallel_keyed (int flags_m, void (*fn) (void *),
       api_info.async_handle = NULL;
     }
 
+  if (profiling_p)
+    goacc_profiling_dispatch (&prof_info, &compute_construct_event_info,
+			      &api_info);
+
   handle_ftn_pointers (mapnum, hostaddrs, sizes, kinds);
 
+  /* Host fallback if "if" clause is false or if the current device is set to
+     the host.  */
+  if (flags & GOACC_FLAG_HOST_FALLBACK)
+    {
+      prof_info.device_type = acc_device_host;
+      api_info.device_type = prof_info.device_type;
+      goacc_save_and_set_bind (acc_device_host);
+      fn (hostaddrs);
+      goacc_restore_bind ();
+      goto out_prof;
+    }
+  else if (acc_device_type (acc_dev->type) == acc_device_host)
+    {
+      fn (hostaddrs);
+      goto out_prof;
+    }
+  else if (profiling_p)
+    api_info.device_api = acc_device_api_cuda;
+
   /* Default: let the runtime choose.  */
   for (i = 0; i != GOMP_DIM_MAX; i++)
     dims[i] = 0;
@@ -437,12 +402,11 @@ GOACC_parallel_keyed (int flags_m, void (*fn) (void *),
 	    if (async == GOMP_LAUNCH_OP_MAX)
 	      async = va_arg (ap, unsigned);
 
-	    /* Set async number in profiling data, unless the device is the
-	       host or we're doing host fallback.  */
-	    if (profiling_p
-	        && !(flags & GOACC_FLAG_HOST_FALLBACK)
-		&& acc_device_type (acc_dev->type) != acc_device_host)
-	      prof_info.async = prof_info.async_queue = async;
+	    if (profiling_p)
+	      {
+		prof_info.async = async;
+		prof_info.async_queue = prof_info.async;
+	      }
 
 	    break;
 	  }
@@ -470,39 +434,6 @@ GOACC_parallel_keyed (int flags_m, void (*fn) (void *),
 
   va_end (ap);
 
-  goacc_aq aq = get_goacc_asyncqueue (async);
-
-  if (profiling_p)
-    {
-      if (aq)
-	comp_start_info
-	  = queue_async_prof_dispatch (acc_dev, aq, &prof_info,
-				       &compute_construct_event_info,
-				       &api_info, NULL);
-      else
-	goacc_profiling_dispatch (&prof_info, &compute_construct_event_info,
-				  &api_info);
-    }
-
-  /* Host fallback if "if" clause is false or if the current device is set to
-     the host.  */
-  if (flags & GOACC_FLAG_HOST_FALLBACK)
-    {
-      prof_info.device_type = acc_device_host;
-      api_info.device_type = prof_info.device_type;
-      goacc_save_and_set_bind (acc_device_host);
-      fn (hostaddrs);
-      goacc_restore_bind ();
-      goto out_prof;
-    }
-  else if (acc_device_type (acc_dev->type) == acc_device_host)
-    {
-      fn (hostaddrs);
-      goto out_prof;
-    }
-  else if (profiling_p)
-    api_info.device_api = acc_device_api_cuda;
-
   if (!(acc_dev->capabilities & GOMP_OFFLOAD_CAP_NATIVE_EXEC))
     {
       k.host_start = (uintptr_t) fn;
@@ -531,16 +462,12 @@ GOACC_parallel_keyed (int flags_m, void (*fn) (void *),
 	= compute_construct_event_info.other_event.parent_construct;
       enter_exit_data_event_info.other_event.implicit = 1;
       enter_exit_data_event_info.other_event.tool_info = NULL;
-      if (aq)
-	data_start_info
-	  = queue_async_prof_dispatch (acc_dev, aq, &prof_info,
-				       &enter_exit_data_event_info, &api_info,
-				       NULL);
-      else
-	goacc_profiling_dispatch (&prof_info, &enter_exit_data_event_info,
-				  &api_info);
+      goacc_profiling_dispatch (&prof_info, &enter_exit_data_event_info,
+				&api_info);
     }
 
+  goacc_aq aq = get_goacc_asyncqueue (async);
+
   tgt = gomp_map_vars_openacc (acc_dev, aq, mapnum, hostaddrs, sizes, kinds,
 			       nca_info);
   free (nca_info);
@@ -550,13 +477,8 @@ GOACC_parallel_keyed (int flags_m, void (*fn) (void *),
       prof_info.event_type = acc_ev_enter_data_end;
       enter_exit_data_event_info.other_event.event_type
 	= prof_info.event_type;
-      if (aq)
-	queue_async_prof_dispatch (acc_dev, aq, &prof_info,
-				   &enter_exit_data_event_info, &api_info,
-				   data_start_info);
-      else
-	goacc_profiling_dispatch (&prof_info, &enter_exit_data_event_info,
-				  &api_info);
+      goacc_profiling_dispatch (&prof_info, &enter_exit_data_event_info,
+				&api_info);
     }
 
   devaddrs = gomp_alloca (sizeof (void *) * mapnum);
@@ -575,14 +497,8 @@ GOACC_parallel_keyed (int flags_m, void (*fn) (void *),
       prof_info.event_type = acc_ev_exit_data_start;
       enter_exit_data_event_info.other_event.event_type = prof_info.event_type;
       enter_exit_data_event_info.other_event.tool_info = NULL;
-      if (aq)
-	data_start_info
-	  = queue_async_prof_dispatch (acc_dev, aq, &prof_info,
-				       &enter_exit_data_event_info, &api_info,
-				       NULL);
-      else
-	goacc_profiling_dispatch (&prof_info, &enter_exit_data_event_info,
-				  &api_info);
+      goacc_profiling_dispatch (&prof_info, &enter_exit_data_event_info,
+				&api_info);
     }
 
   /* If running synchronously (aq == NULL), this will unmap immediately.  */
@@ -592,13 +508,8 @@ GOACC_parallel_keyed (int flags_m, void (*fn) (void *),
     {
       prof_info.event_type = acc_ev_exit_data_end;
       enter_exit_data_event_info.other_event.event_type = prof_info.event_type;
-      if (aq)
-	queue_async_prof_dispatch (acc_dev, aq, &prof_info,
-				   &enter_exit_data_event_info, &api_info,
-				   data_start_info);
-      else
-	goacc_profiling_dispatch (&prof_info, &enter_exit_data_event_info,
-				  &api_info);
+      goacc_profiling_dispatch (&prof_info, &enter_exit_data_event_info,
+				&api_info);
     }
 
  out_prof:
@@ -607,13 +518,8 @@ GOACC_parallel_keyed (int flags_m, void (*fn) (void *),
       prof_info.event_type = acc_ev_compute_construct_end;
       compute_construct_event_info.other_event.event_type
 	= prof_info.event_type;
-      if (aq)
-	queue_async_prof_dispatch (acc_dev, aq, &prof_info,
-				   &compute_construct_event_info, &api_info,
-				   comp_start_info);
-      else
-	goacc_profiling_dispatch (&prof_info, &compute_construct_event_info,
-				  &api_info);
+      goacc_profiling_dispatch (&prof_info, &compute_construct_event_info,
+				&api_info);
 
       thr->prof_info = NULL;
       thr->api_info = NULL;
@@ -851,8 +757,6 @@ GOACC_update (int flags_m, size_t mapnum,
 
   struct goacc_thread *thr = goacc_thread ();
   struct gomp_device_descr *acc_dev = thr->dev;
-  goacc_aq aq = NULL;
-  struct async_prof_callback_info *update_start_info = NULL;
 
   bool profiling_p = GOACC_PROFILING_DISPATCH_P (true);
 
@@ -902,15 +806,7 @@ GOACC_update (int flags_m, size_t mapnum,
     }
 
   if (profiling_p)
-    {
-      aq = get_goacc_asyncqueue (async);
-      if (aq)
-	update_start_info
-	  = queue_async_prof_dispatch (acc_dev, aq, &prof_info,
-				       &update_event_info, &api_info, NULL);
-      else
-	goacc_profiling_dispatch (&prof_info, &update_event_info, &api_info);
-    }
+    goacc_profiling_dispatch (&prof_info, &update_event_info, &api_info);
 
   if ((acc_dev->capabilities & GOMP_OFFLOAD_CAP_SHARED_MEM)
       || (flags & GOACC_FLAG_HOST_FALLBACK))
@@ -997,11 +893,7 @@ GOACC_update (int flags_m, size_t mapnum,
     {
       prof_info.event_type = acc_ev_update_end;
       update_event_info.other_event.event_type = prof_info.event_type;
-      if (aq)
-	queue_async_prof_dispatch (acc_dev, aq, &prof_info, &update_event_info,
-				   &api_info, update_start_info);
-      else
-	goacc_profiling_dispatch (&prof_info, &update_event_info, &api_info);
+      goacc_profiling_dispatch (&prof_info, &update_event_info, &api_info);
 
       thr->prof_info = NULL;
       thr->api_info = NULL;
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/acc_prof-init-1.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/acc_prof-init-1.c
index a33fac7556c..91b373216c9 100644
--- a/libgomp/testsuite/libgomp.oacc-c-c++-common/acc_prof-init-1.c
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/acc_prof-init-1.c
@@ -172,10 +172,7 @@ static void cb_compute_construct_start (acc_prof_info *prof_info, acc_event_info
   assert (prof_info->device_type == acc_device_type);
   assert (prof_info->device_number == acc_device_num);
   assert (prof_info->thread_id == -1);
-  if (acc_device_type == acc_device_host)
-    assert (prof_info->async == acc_async_sync);
-  else
-    assert (prof_info->async == acc_async);
+  assert (prof_info->async == /* TODO acc_async */ acc_async_sync);
   assert (prof_info->async_queue == prof_info->async);
   assert (prof_info->src_file == NULL);
   assert (prof_info->func_name == NULL);
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/acc_prof-parallel-1.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/acc_prof-parallel-1.c
index 663f7f724d5..28a47ccc27d 100644
--- a/libgomp/testsuite/libgomp.oacc-c-c++-common/acc_prof-parallel-1.c
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/acc_prof-parallel-1.c
@@ -316,9 +316,9 @@ static void cb_exit_data_start (acc_prof_info *prof_info, acc_event_info *event_
 {
   DEBUG_printf ("%s\n", __FUNCTION__);
 
-  assert (state == 5
+  assert (state == 7
 #if ASYNC_EXIT_DATA
-	  || state == 105
+	  || state == 107
 #endif
 	  );
   STATE_OP (state, ++);
@@ -372,9 +372,9 @@ static void cb_exit_data_end (acc_prof_info *prof_info, acc_event_info *event_in
 {
   DEBUG_printf ("%s\n", __FUNCTION__);
 
-  assert (state == 6
+  assert (state == 8
 #if ASYNC_EXIT_DATA
-	  || state == 106
+	  || state == 108
 #endif
 	  );
   STATE_OP (state, ++);
@@ -458,10 +458,7 @@ static void cb_compute_construct_start (acc_prof_info *prof_info, acc_event_info
   assert (prof_info->device_type == acc_device_type);
   assert (prof_info->device_number == acc_device_num);
   assert (prof_info->thread_id == -1);
-  if (acc_device_type == acc_device_host)
-    assert (prof_info->async == acc_async_sync);
-  else
-    assert (prof_info->async == acc_async);
+  assert (prof_info->async == /* TODO acc_async */ acc_async_sync);
   assert (prof_info->async_queue == prof_info->async);
   assert (prof_info->src_file == NULL);
   assert (prof_info->func_name == NULL);
@@ -502,6 +499,9 @@ static void cb_compute_construct_end (acc_prof_info *prof_info, acc_event_info *
     {
       /* Compensate for the missing 'acc_ev_enter_data_end'.  */
       state += 1;
+      /* Compensate for the missing 'acc_ev_enqueue_launch_start' and
+	 'acc_ev_enqueue_launch_end'.  */
+      state += 2;
       /* Compensate for the missing 'acc_ev_exit_data_start' and
 	 'acc_ev_exit_data_end'.  */
       state += 2;
@@ -514,8 +514,8 @@ static void cb_compute_construct_end (acc_prof_info *prof_info, acc_event_info *
       state += 2;
     }
 #endif
-  assert (state == 7
-	  || state == 107);
+  assert (state == 9
+	  || state == 109);
   STATE_OP (state, ++);
 
   assert (tool_info != NULL);
@@ -569,6 +569,17 @@ static void cb_enqueue_launch_start (acc_prof_info *prof_info, acc_event_info *e
 
   assert (acc_device_type != acc_device_host);
 
+  assert (state == 5
+	  || state == 105);
+  STATE_OP (state, ++);
+
+  assert (tool_info != NULL);
+  assert (tool_info->event_info.other_event.event_type == acc_ev_compute_construct_start);
+  assert (tool_info->nested == NULL);
+  tool_info->nested = (struct tool_info *) malloc(sizeof *tool_info);
+  assert (tool_info->nested != NULL);
+  tool_info->nested->nested = NULL;
+
   assert (prof_info->event_type == acc_ev_enqueue_launch_start);
   assert (prof_info->valid_bytes == _ACC_PROF_INFO_VALID_BYTES);
   assert (prof_info->version == _ACC_PROF_INFO_VERSION);
@@ -612,6 +623,13 @@ static void cb_enqueue_launch_start (acc_prof_info *prof_info, acc_event_info *e
   assert (api_info->device_handle == NULL);
   assert (api_info->context_handle == NULL);
   assert (api_info->async_handle == NULL);
+
+  tool_info->nested->event_info.launch_event.event_type = event_info->launch_event.event_type;
+  tool_info->nested->event_info.launch_event.kernel_name = strdup (event_info->launch_event.kernel_name);
+  tool_info->nested->event_info.launch_event.num_gangs = event_info->launch_event.num_gangs;
+  tool_info->nested->event_info.launch_event.num_workers = event_info->launch_event.num_workers;
+  tool_info->nested->event_info.launch_event.vector_length = event_info->launch_event.vector_length;
+  event_info->other_event.tool_info = tool_info->nested;
 }
 
 static void cb_enqueue_launch_end (acc_prof_info *prof_info, acc_event_info *event_info, acc_api_info *api_info)
@@ -620,6 +638,19 @@ static void cb_enqueue_launch_end (acc_prof_info *prof_info, acc_event_info *eve
 
   assert (acc_device_type != acc_device_host);
 
+  assert (state == 6
+	  || state == 106);
+  STATE_OP (state, ++);
+
+  assert (tool_info != NULL);
+  assert (tool_info->event_info.other_event.event_type == acc_ev_compute_construct_start);
+  assert (tool_info->nested != NULL);
+  assert (tool_info->nested->event_info.launch_event.event_type == acc_ev_enqueue_launch_start);
+  assert (tool_info->nested->event_info.launch_event.kernel_name != NULL);
+  assert (tool_info->nested->event_info.launch_event.num_gangs >= 1);
+  assert (tool_info->nested->event_info.launch_event.num_workers >= 1);
+  assert (tool_info->nested->event_info.launch_event.vector_length >= 1);
+
   assert (prof_info->event_type == acc_ev_enqueue_launch_end);
   assert (prof_info->valid_bytes == _ACC_PROF_INFO_VALID_BYTES);
   assert (prof_info->version == _ACC_PROF_INFO_VERSION);
@@ -639,7 +670,12 @@ static void cb_enqueue_launch_end (acc_prof_info *prof_info, acc_event_info *eve
   assert (event_info->launch_event.valid_bytes == _ACC_LAUNCH_EVENT_INFO_VALID_BYTES);
   assert (event_info->launch_event.parent_construct == acc_construct_parallel);
   assert (event_info->launch_event.implicit == 1);
+  assert (event_info->launch_event.tool_info == tool_info->nested);
   assert (event_info->launch_event.kernel_name != NULL);
+  assert (strcmp (event_info->launch_event.kernel_name, tool_info->nested->event_info.launch_event.kernel_name) == 0);
+  assert (event_info->launch_event.num_gangs == tool_info->nested->event_info.launch_event.num_gangs);
+  assert (event_info->launch_event.num_workers == tool_info->nested->event_info.launch_event.num_workers);
+  assert (event_info->launch_event.vector_length == tool_info->nested->event_info.launch_event.vector_length);
 
   if (acc_device_type == acc_device_host)
     assert (api_info->device_api == acc_device_api_none);
@@ -653,6 +689,10 @@ static void cb_enqueue_launch_end (acc_prof_info *prof_info, acc_event_info *eve
   assert (api_info->device_handle == NULL);
   assert (api_info->context_handle == NULL);
   assert (api_info->async_handle == NULL);
+
+  free ((void *) tool_info->nested->event_info.launch_event.kernel_name);
+  free (tool_info->nested);
+  tool_info->nested = NULL;
 }
 
 
@@ -685,7 +725,7 @@ int main()
     }
     assert (state_init == 4);
   }
-  assert (state == 8);
+  assert (state == 10);
 
   STATE_OP (state, = 100);
 
@@ -702,7 +742,7 @@ int main()
 #pragma acc wait
     assert (state_init == 104);
   }
-  assert (state == 108);
+  assert (state == 110);
 
   return 0;
 }
-- 
2.25.1


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Allow libgomp 'cbuf' buffering with OpenACC 'async' for 'ephemeral' data (was: [PATCH 3/4] openacc: Fix asynchronous host-to-device copies in libgomp runtime)
  2021-07-27 10:01   ` Thomas Schwinge
@ 2023-03-10 15:22     ` Thomas Schwinge
  0 siblings, 0 replies; 12+ messages in thread
From: Thomas Schwinge @ 2023-03-10 15:22 UTC (permalink / raw)
  To: Julian Brown, gcc-patches; +Cc: Jakub Jelinek, Tobias Burnus

[-- Attachment #1: Type: text/plain, Size: 1987 bytes --]

Hi!

On 2021-07-27T12:01:18+0200, I wrote:
> On 2021-06-29T16:42:03-0700, Julian Brown <julian@codesourcery.com> wrote:
>> This patch fixes several places in libgomp/target.c where "ephemeral" data
>> (on the stack or in temporary heap locations) may be used as the source of
>> an asynchronous host-to-device copy that may not complete before the host
>> data disappears.  Versions of the patch have been posted several times
>> before, but this one (at Chung-Lin Tang's prior suggesion, IIRC) moves
>> all logic into target.c rather than pushing it out to each target plugin.
>
> Thanks for the re-work!

>> +/* Copy host memory to an offload device.  In asynchronous mode (if AQ is
>> +   non-NULL), when the source data is stack or may otherwise be deallocated
>> +   before the asynchronous copy takes place, EPHEMERAL must be passed as
>> +   TRUE.  The CBUF isn't used for non-ephemeral asynchronous copies, because
>> +   the host data might not be computed yet (by an earlier asynchronous compute
>> +   region).  */
>> +
>>  [gomp_copy_host2dev]
>
> Code changes related to the latter sentence have moved into a separate
> "Don't use libgomp 'cbuf' buffering with OpenACC 'async'", pushed to
> master branch in commit d88a6951586c7229b25708f4486eaaf4bf4b5bbe, [...]

Re this TODO comment:

> +   TODO ... but we could allow CBUF usage for EPHEMERAL data?  (Open question:
> +   is it more performant to use libgomp CBUF buffering or individual device
> +   asyncronous copying?)  */

Pushed to master branch commit 2b2340e236c0bba8aaca358ea25a5accd8249fbd
"Allow libgomp 'cbuf' buffering with OpenACC 'async' for 'ephemeral' data",
see attached.


Grüße
 Thomas


-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-Allow-libgomp-cbuf-buffering-with-OpenACC-async-for-.patch --]
[-- Type: text/x-diff, Size: 5525 bytes --]

From 2b2340e236c0bba8aaca358ea25a5accd8249fbd Mon Sep 17 00:00:00 2001
From: Thomas Schwinge <thomas@codesourcery.com>
Date: Mon, 27 Feb 2023 16:41:17 +0100
Subject: [PATCH] Allow libgomp 'cbuf' buffering with OpenACC 'async' for
 'ephemeral' data

This does *allow*, but under no circumstances is this currently going to be
used: all potentially applicable data is non-'ephemeral', and thus not
considered for 'gomp_coalesce_buf_add' for OpenACC 'async'.  (But a use will
emerge later.)

Follow-up to commit r12-2530-gd88a6951586c7229b25708f4486eaaf4bf4b5bbe
"Don't use libgomp 'cbuf' buffering with OpenACC 'async'", addressing this
TODO comment:

    TODO ... but we could allow CBUF usage for EPHEMERAL data?  (Open question:
    is it more performant to use libgomp CBUF buffering or individual device
    asyncronous copying?)

Ephemeral data is small, and therefore individual device asyncronous copying
does seem dubious -- in particular given that for all those, we'd individually
have to allocate and queue for deallocation a temporary buffer to capture the
ephemeral data.  Instead, just let the 'cbuf' *be* the temporary buffer.

	libgomp/
	* target.c (gomp_copy_host2dev, gomp_map_vars_internal): Allow
	libgomp 'cbuf' buffering with OpenACC 'async' for 'ephemeral'
	data.
---
 libgomp/target.c | 70 +++++++++++++++++++++++++-----------------------
 1 file changed, 36 insertions(+), 34 deletions(-)

diff --git a/libgomp/target.c b/libgomp/target.c
index 0344f68a936..074caa6a4dc 100644
--- a/libgomp/target.c
+++ b/libgomp/target.c
@@ -310,10 +310,8 @@ struct gomp_coalesce_buf
 
    This must not be used for asynchronous copies, because the host data might
    not be computed yet (by an earlier asynchronous compute region, for
-   example).
-   TODO ... but we could allow CBUF usage for EPHEMERAL data?  (Open question:
-   is it more performant to use libgomp CBUF buffering or individual device
-   asyncronous copying?)  */
+   example).  The exception is for EPHEMERAL data, that we know is available
+   already "by construction".  */
 
 static inline void
 gomp_coalesce_buf_add (struct gomp_coalesce_buf *cbuf, size_t start, size_t len)
@@ -377,30 +375,6 @@ gomp_copy_host2dev (struct gomp_device_descr *devicep,
 		    void *d, const void *h, size_t sz,
 		    bool ephemeral, struct gomp_coalesce_buf *cbuf)
 {
-  if (__builtin_expect (aq != NULL, 0))
-    {
-      /* See 'gomp_coalesce_buf_add'.  */
-      assert (!cbuf);
-
-      void *h_buf = (void *) h;
-      if (ephemeral)
-	{
-	  /* We're queueing up an asynchronous copy from data that may
-	     disappear before the transfer takes place (i.e. because it is a
-	     stack local in a function that is no longer executing).  Make a
-	     copy of the data into a temporary buffer in those cases.  */
-	  h_buf = gomp_malloc (sz);
-	  memcpy (h_buf, h, sz);
-	}
-      goacc_device_copy_async (devicep, devicep->openacc.async.host2dev_func,
-			       "dev", d, "host", h_buf, h, sz, aq);
-      if (ephemeral)
-	/* Free temporary buffer once the transfer has completed.  */
-	devicep->openacc.async.queue_callback_func (aq, free, h_buf);
-
-      return;
-    }
-
   if (cbuf)
     {
       uintptr_t doff = (uintptr_t) d - cbuf->tgt->tgt_start;
@@ -420,6 +394,12 @@ gomp_copy_host2dev (struct gomp_device_descr *devicep,
 		      gomp_mutex_unlock (&devicep->lock);
 		      gomp_fatal ("internal libgomp cbuf error");
 		    }
+
+		  /* In an asynchronous context, verify that CBUF isn't used
+		     with non-EPHEMERAL data; see 'gomp_coalesce_buf_add'.  */
+		  if (__builtin_expect (aq != NULL, 0))
+		    assert (ephemeral);
+
 		  memcpy ((char *) cbuf->buf + (doff - cbuf->chunks[0].start),
 			  h, sz);
 		  return;
@@ -430,7 +410,28 @@ gomp_copy_host2dev (struct gomp_device_descr *devicep,
 	}
     }
 
-  gomp_device_copy (devicep, devicep->host2dev_func, "dev", d, "host", h, sz);
+  if (__builtin_expect (aq != NULL, 0))
+    {
+      void *h_buf = (void *) h;
+      if (ephemeral)
+	{
+	  /* We're queueing up an asynchronous copy from data that may
+	     disappear before the transfer takes place (i.e. because it is a
+	     stack local in a function that is no longer executing).  As we've
+	     not been able to use CBUF, make a copy of the data into a
+	     temporary buffer.  */
+	  h_buf = gomp_malloc (sz);
+	  memcpy (h_buf, h, sz);
+	}
+      goacc_device_copy_async (devicep, devicep->openacc.async.host2dev_func,
+			       "dev", d, "host", h_buf, h, sz, aq);
+      if (ephemeral)
+	/* Free once the transfer has completed.  */
+	devicep->openacc.async.queue_callback_func (aq, free, h_buf);
+    }
+  else
+    gomp_device_copy (devicep, devicep->host2dev_func,
+		      "dev", d, "host", h, sz);
 }
 
 attribute_hidden void
@@ -1751,9 +1752,6 @@ gomp_map_vars_internal (struct gomp_device_descr *devicep,
 
   if (cbufp)
     {
-      /* See 'gomp_coalesce_buf_add'.  */
-      assert (!aq);
-
       long c = 0;
       for (c = 0; c < cbuf.chunk_cnt; ++c)
 	gomp_copy_host2dev (devicep, aq,
@@ -1761,8 +1759,12 @@ gomp_map_vars_internal (struct gomp_device_descr *devicep,
 			    (char *) cbuf.buf + (cbuf.chunks[c].start
 						 - cbuf.chunks[0].start),
 			    cbuf.chunks[c].end - cbuf.chunks[c].start,
-			    true, NULL);
-      free (cbuf.buf);
+			    false, NULL);
+      if (aq)
+	/* Free once the transfer has completed.  */
+	devicep->openacc.async.queue_callback_func (aq, free, cbuf.buf);
+      else
+	free (cbuf.buf);
       cbuf.buf = NULL;
       cbufp = NULL;
     }
-- 
2.25.1


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2023-03-10 15:23 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-29 23:42 [PATCH 0/4] openacc: Async fixes Julian Brown
2021-06-29 23:42 ` [PATCH 1/4] openacc: Async fix for lib-94 testcase Julian Brown
2021-06-29 23:42 ` [PATCH 2/4] openacc: Fix async bugs in several OpenACC test cases Julian Brown
2021-06-29 23:52   ` Julian Brown
2021-06-29 23:42 ` [PATCH 3/4] openacc: Fix asynchronous host-to-device copies in libgomp runtime Julian Brown
2021-07-27 10:01   ` Thomas Schwinge
2023-03-10 15:22     ` Allow libgomp 'cbuf' buffering with OpenACC 'async' for 'ephemeral' data (was: [PATCH 3/4] openacc: Fix asynchronous host-to-device copies in libgomp runtime) Thomas Schwinge
2021-06-29 23:42 ` [PATCH 4/4] openacc: Profiling-interface fixes for asynchronous operations Julian Brown
2021-06-30  8:28 ` [PATCH 0/4] openacc: Async fixes Thomas Schwinge
2021-06-30 10:40   ` Julian Brown
2021-07-02 13:51     ` Julian Brown
2023-03-10 11:38   ` Thomas Schwinge

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).