[patch] plugin/plugin-nvptx.c: Fix fini_device call when already shutdown [PR113513]

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [patch] plugin/plugin-nvptx.c: Fix fini_device call when already shutdown [PR113513]
@ 2024-01-22 19:45 Tobias Burnus
  2024-01-23  9:55 ` [v2][patch] " Tobias Burnus
  0 siblings, 1 reply; 4+ messages in thread
From: Tobias Burnus @ 2024-01-22 19:45 UTC (permalink / raw)
  To: gcc-patches, Thomas Schwinge, Jakub Jelinek

[-- Attachment #1: Type: text/plain, Size: 940 bytes --]

Testing showed that the libgomp.c/target-52.c failed with:
libgomp: cuCtxGetDevice error: unknown cuda error libgomp: device 
finalization failed This testcase uses OMP_DISPLAY_ENV=true and 
OMP_TARGET_OFFLOAD=mandatory, and those env vars matter, i.e. it only 
fails if dg-set-target-env-var is honored. If both env vars are set, the 
device initialization occurs earlier as OMP_DEFAULT_DEVICE is shown due 
to the display-env env var and its value (when target-offload-var is 
'mandatory') might be either 'omp_invalid_device' or '0'. It turned out 
that this had an effect on device finalization, which caused CUDA to 
stop earlier than expected. This patch now handles this case gracefully. 
For details, see the commit log message in the attached patch and/or the 
PR. Comments, remarks, suggestions? Does this look sensible? (I would 
like to see some acknowledgement by someone who feels more comfortable 
with CUDA than me.) Tobias

[-- Attachment #2: fix-nvptx-shutdown.diff --]
[-- Type: text/x-patch, Size: 8887 bytes --]

plugin/plugin-nvptx.c: Fix fini_device call when already shutdown [PR113513]

The following issue was found when running libgomp.c/target-52.c with
nvptx offloading when the dg-set-target-env-var was honored. The issue
occurred for both -foffload=disable and with offloading configured when
an nvidia device is available.

At the end of the program, the offloading parts are shutdown via two means:
The callback registered via 'atexit (gomp_target_fini)' and - via code
generated in mkoffload, the '__attribute__((destructor)) fini' function
that calls GOMP_offload_unregister_ver.

In normal processing, first gomp_target_fini is called - which then sets
GOMP_DEVICE_FINALIZED for the device - and later GOMP_offload_unregister_ver,
but that's then because the state is GOMP_DEVICE_FINALIZED.
If both OMP_DISPLAY_ENV=true and OMP_TARGET_OFFLOAD="mandatory" are set,
the call omp_display_env already invokes gomp_init_targets_once, i.e. it
occurs earlier than usual and is invoked via __attribute__((constructor))
initialize_env.

For some unknown reasons, while this does not have an effect on the
order of the called plugin functions for initialization, it changes the
order of function calls for shutting down. Namely, when the two environment
variables are set, GOMP_offload_unregister_ver is called now before
gomp_target_fini. - And it seems as if CUDA regards a call to cuModuleUnload
(or unloading the last module?) as indication that the device context should
be destroyed - or, at least, afterwards calling cuCtxGetDevice will return
CUDA_ERROR_DEINITIALIZED.

As the previous code in nvptx_attach_host_thread_to_device wasn't expecting
that result, it called
  GOMP_PLUGIN_error ("cuCtxGetDevice error: %s", cuda_error (r));
causing a fatal error of the program.

This commit handles now CUDA_ERROR_DEINITIALIZED in a special way such
that GOMP_OFFLOAD_fini_device just works.

When reading the code, the following was observed in addition:
When gomp_fini_device is called, it invokes goacc_fini_asyncqueues
to ensure that the queue is emptied.  It seems to make sense to do
likewise for GOMP_offload_unregister_ver, which this commit does in
addition.

libgomp/ChangeLog:

	PR libgomp/113513
	* target.c (GOMP_offload_unregister_ver): Call goacc_fini_asyncqueues
	before invoking GOMP_offload_unregister_ver.
	* plugin/plugin-nvptx.c (nvptx_attach_host_thread_to_device): Change
	return type to int and return -1 for CUDA_ERROR_DEINITIALIZED.
	(GOMP_OFFLOAD_fini_device): Handle the latter gracefully.
	(nvptx_init, GOMP_OFFLOAD_load_image, GOMP_OFFLOAD_alloc,
	GOMP_OFFLOAD_host2dev, GOMP_OFFLOAD_dev2host, GOMP_OFFLOAD_memcpy2d,
	GOMP_OFFLOAD_memcpy3d, GOMP_OFFLOAD_openacc_async_host2dev,
	GOMP_OFFLOAD_openacc_async_dev2host): Update for return-type change.

Signed-off-by: Tobias Burnus <tburnus@baylibre.com>

 libgomp/plugin/plugin-nvptx.c | 41 +++++++++++++++++++++++++----------------
 libgomp/target.c              |  7 +++++--
 2 files changed, 30 insertions(+), 18 deletions(-)

diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c
index c04c3acd679..dccbae44abd 100644
--- a/libgomp/plugin/plugin-nvptx.c
+++ b/libgomp/plugin/plugin-nvptx.c
@@ -382,9 +382,11 @@ nvptx_init (void)
 }
 
 /* Select the N'th PTX device for the current host thread.  The device must
-   have been previously opened before calling this function.  */
+   have been previously opened before calling this function.
+   Returns 1 if successful, 0 if an error occurred, and -1 for
+   CUDA_ERROR_DEINITIALIZED.  */
 
-static bool
+static int
 nvptx_attach_host_thread_to_device (int n)
 {
   CUdevice dev;
@@ -393,15 +395,17 @@ nvptx_attach_host_thread_to_device (int n)
   CUcontext thd_ctx;
 
   r = CUDA_CALL_NOCHECK (cuCtxGetDevice, &dev);
+  if (r == CUDA_ERROR_DEINITIALIZED)
+    return -1;
   if (r == CUDA_ERROR_NOT_PERMITTED)
     {
       /* Assume we're in a CUDA callback, just return true.  */
-      return true;
+      return 1;
     }
   if (r != CUDA_SUCCESS && r != CUDA_ERROR_INVALID_CONTEXT)
     {
       GOMP_PLUGIN_error ("cuCtxGetDevice error: %s", cuda_error (r));
-      return false;
+      return 0;
     }
 
   if (r != CUDA_ERROR_INVALID_CONTEXT && dev == n)
@@ -414,7 +418,7 @@ nvptx_attach_host_thread_to_device (int n)
       if (!ptx_dev)
 	{
 	  GOMP_PLUGIN_error ("device %d not found", n);
-	  return false;
+	  return 0;
 	}
 
       CUDA_CALL (cuCtxGetCurrent, &thd_ctx);
@@ -426,7 +430,7 @@ nvptx_attach_host_thread_to_device (int n)
 
       CUDA_CALL (cuCtxPushCurrent, ptx_dev->ctx);
     }
-  return true;
+  return 1;
 }
 
 static struct ptx_device *
@@ -1252,8 +1256,11 @@ GOMP_OFFLOAD_fini_device (int n)
 
   if (ptx_devices[n] != NULL)
     {
-      if (!nvptx_attach_host_thread_to_device (n)
-	  || !nvptx_close_device (ptx_devices[n]))
+      /* Returns 1 if successful, 0 if an error occurred, and -1 for
+	 CUDA_ERROR_DEINITIALIZED.  */
+      int r = nvptx_attach_host_thread_to_device (n);
+      if (r == 0
+	  || (r == 1 && !nvptx_close_device (ptx_devices[n])))
 	{
 	  pthread_mutex_unlock (&ptx_dev_lock);
 	  return false;
@@ -1329,7 +1336,7 @@ GOMP_OFFLOAD_load_image (int ord, unsigned version, const void *target_data,
       return -1;
     }
 
-  if (!nvptx_attach_host_thread_to_device (ord)
+  if (nvptx_attach_host_thread_to_device (ord) != 1
       || !link_ptx (&module, img_header->ptx_objs, img_header->ptx_num))
     return -1;
 
@@ -1568,7 +1575,7 @@ GOMP_OFFLOAD_unload_image (int ord, unsigned version, const void *target_data)
 void *
 GOMP_OFFLOAD_alloc (int ord, size_t size)
 {
-  if (!nvptx_attach_host_thread_to_device (ord))
+  if (nvptx_attach_host_thread_to_device (ord) != 1)
     return NULL;
 
   struct ptx_device *ptx_dev = ptx_devices[ord];
@@ -1837,7 +1844,7 @@ cuda_memcpy_sanity_check (const void *h, const void *d, size_t s)
 bool
 GOMP_OFFLOAD_host2dev (int ord, void *dst, const void *src, size_t n)
 {
-  if (!nvptx_attach_host_thread_to_device (ord)
+  if (nvptx_attach_host_thread_to_device (ord) != 1
       || !cuda_memcpy_sanity_check (src, dst, n))
     return false;
   CUDA_CALL (cuMemcpyHtoD, (CUdeviceptr) dst, src, n);
@@ -1847,7 +1854,7 @@ GOMP_OFFLOAD_host2dev (int ord, void *dst, const void *src, size_t n)
 bool
 GOMP_OFFLOAD_dev2host (int ord, void *dst, const void *src, size_t n)
 {
-  if (!nvptx_attach_host_thread_to_device (ord)
+  if (nvptx_attach_host_thread_to_device (ord) != 1
       || !cuda_memcpy_sanity_check (dst, src, n))
     return false;
   CUDA_CALL (cuMemcpyDtoH, dst, (CUdeviceptr) src, n);
@@ -1868,7 +1875,8 @@ GOMP_OFFLOAD_memcpy2d (int dst_ord, int src_ord, size_t dim1_size,
 		       const void *src, size_t src_offset1_size,
 		       size_t src_offset0_len, size_t src_dim1_size)
 {
-  if (!nvptx_attach_host_thread_to_device (src_ord != -1 ? src_ord : dst_ord))
+  if (nvptx_attach_host_thread_to_device (src_ord != -1 ? src_ord : dst_ord)
+      != 1)
     return false;
 
   /* TODO: Consider using CU_MEMORYTYPE_UNIFIED if supported.  */
@@ -1960,7 +1968,8 @@ GOMP_OFFLOAD_memcpy3d (int dst_ord, int src_ord, size_t dim2_size,
 		       size_t src_offset0_len, size_t src_dim2_size,
 		       size_t src_dim1_len)
 {
-  if (!nvptx_attach_host_thread_to_device (src_ord != -1 ? src_ord : dst_ord))
+  if (nvptx_attach_host_thread_to_device (src_ord != -1 ? src_ord : dst_ord)
+      != 1)
     return false;
 
   /* TODO: Consider using CU_MEMORYTYPE_UNIFIED if supported.  */
@@ -2050,7 +2059,7 @@ bool
 GOMP_OFFLOAD_openacc_async_host2dev (int ord, void *dst, const void *src,
 				     size_t n, struct goacc_asyncqueue *aq)
 {
-  if (!nvptx_attach_host_thread_to_device (ord)
+  if (nvptx_attach_host_thread_to_device (ord) != 1
       || !cuda_memcpy_sanity_check (src, dst, n))
     return false;
   CUDA_CALL (cuMemcpyHtoDAsync, (CUdeviceptr) dst, src, n, aq->cuda_stream);
@@ -2061,7 +2070,7 @@ bool
 GOMP_OFFLOAD_openacc_async_dev2host (int ord, void *dst, const void *src,
 				     size_t n, struct goacc_asyncqueue *aq)
 {
-  if (!nvptx_attach_host_thread_to_device (ord)
+  if (nvptx_attach_host_thread_to_device (ord) != 1
       || !cuda_memcpy_sanity_check (dst, src, n))
     return false;
   CUDA_CALL (cuMemcpyDtoHAsync, dst, (CUdeviceptr) src, n, aq->cuda_stream);
diff --git a/libgomp/target.c b/libgomp/target.c
index 1367e9cce6c..8d05877deb7 100644
--- a/libgomp/target.c
+++ b/libgomp/target.c
@@ -2706,8 +2706,11 @@ GOMP_offload_unregister_ver (unsigned version, const void *host_table,
       gomp_mutex_lock (&devicep->lock);
       if (devicep->type == target_type
 	  && devicep->state == GOMP_DEVICE_INITIALIZED)
-	gomp_unload_image_from_device (devicep, version,
-				       host_table, target_data);
+	{
+	  goacc_fini_asyncqueues (devicep);
+	  gomp_unload_image_from_device (devicep, version,
+					 host_table, target_data);
+	}
       gomp_mutex_unlock (&devicep->lock);
     }
 

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [v2][patch] plugin/plugin-nvptx.c: Fix fini_device call when already shutdown [PR113513]
  2024-01-22 19:45 [patch] plugin/plugin-nvptx.c: Fix fini_device call when already shutdown [PR113513] Tobias Burnus
@ 2024-01-23  9:55 ` Tobias Burnus
  2024-01-29 15:53   ` Thomas Schwinge
  0 siblings, 1 reply; 4+ messages in thread
From: Tobias Burnus @ 2024-01-23  9:55 UTC (permalink / raw)
  To: gcc-patches, Thomas Schwinge, Jakub Jelinek


[-- Attachment #1.1: Type: text/plain, Size: 1204 bytes --]

Slightly changed patch:

nvptx_attach_host_thread_to_device now fails again with an error for 
CUDA_ERROR_DEINITIALIZED, except for GOMP_OFFLOAD_fini_device.

I think it makes more sense that way.

Tobias Burnus wrote:
> Testing showed that the libgomp.c/target-52.c failed with:
>
> libgomp: cuCtxGetDevice error: unknown cuda error
>
> libgomp: device finalization failed
>
> This testcase uses OMP_DISPLAY_ENV=true and 
> OMP_TARGET_OFFLOAD=mandatory, and those env vars matter, i.e. it only 
> fails if dg-set-target-env-var is honored.
>
> If both env vars are set, the device initialization occurs earlier as 
> OMP_DEFAULT_DEVICE is shown due to the display-env env var and its 
> value (when target-offload-var is 'mandatory') might be either 
> 'omp_invalid_device' or '0'.
>
> It turned out that this had an effect on device finalization, which 
> caused CUDA to stop earlier than expected. This patch now handles this 
> case gracefully. For details, see the commit log message in the 
> attached patch and/or the PR.
>
> Comments, remarks, suggestions?
>
> Does this look sensible? (I would like to see some acknowledgement by 
> someone who feels more comfortable with CUDA than me.)

Tobias

[-- Attachment #2: fix-nvptx-shutdown-v2.diff --]
[-- Type: text/x-patch, Size: 9399 bytes --]

plugin/plugin-nvptx.c: Fix fini_device call when already shutdown [PR113513]

The following issue was found when running libgomp.c/target-52.c with
nvptx offloading when the dg-set-target-env-var was honored. The issue
occurred for both -foffload=disable and with offloading configured when
an nvidia device is available.

At the end of the program, the offloading parts are shutdown via two means:
The callback registered via 'atexit (gomp_target_fini)' and - via code
generated in mkoffload, the '__attribute__((destructor)) fini' function
that calls GOMP_offload_unregister_ver.

In normal processing, first gomp_target_fini is called - which then sets
GOMP_DEVICE_FINALIZED for the device - and later GOMP_offload_unregister_ver,
but that's then because the state is GOMP_DEVICE_FINALIZED.
If both OMP_DISPLAY_ENV=true and OMP_TARGET_OFFLOAD="mandatory" are set,
the call omp_display_env already invokes gomp_init_targets_once, i.e. it
occurs earlier than usual and is invoked via __attribute__((constructor))
initialize_env.

For some unknown reasons, while this does not have an effect on the
order of the called plugin functions for initialization, it changes the
order of function calls for shutting down. Namely, when the two environment
variables are set, GOMP_offload_unregister_ver is called now before
gomp_target_fini. - And it seems as if CUDA regards a call to cuModuleUnload
(or unloading the last module?) as indication that the device context should
be destroyed - or, at least, afterwards calling cuCtxGetDevice will return
CUDA_ERROR_DEINITIALIZED.

As the previous code in nvptx_attach_host_thread_to_device wasn't expecting
that result, it called
  GOMP_PLUGIN_error ("cuCtxGetDevice error: %s", cuda_error (r));
causing a fatal error of the program.

This commit handles now CUDA_ERROR_DEINITIALIZED in a special way such
that GOMP_OFFLOAD_fini_device just works.

When reading the code, the following was observed in addition:
When gomp_fini_device is called, it invokes goacc_fini_asyncqueues
to ensure that the queue is emptied.  It seems to make sense to do
likewise for GOMP_offload_unregister_ver, which this commit does in
addition.

libgomp/ChangeLog:

	PR libgomp/113513
	* target.c (GOMP_offload_unregister_ver): Call goacc_fini_asyncqueues
	before invoking GOMP_offload_unregister_ver.
	* plugin/plugin-nvptx.c (nvptx_attach_host_thread_to_device): Change
	return type to int and new bool arg, it true, return -1 for
	CUDA_ERROR_DEINITIALIZED.
	(GOMP_OFFLOAD_fini_device): Handle the deinitialized gracefully.
	(nvptx_init, GOMP_OFFLOAD_load_image, GOMP_OFFLOAD_alloc,
	GOMP_OFFLOAD_host2dev, GOMP_OFFLOAD_dev2host, GOMP_OFFLOAD_memcpy2d,
	GOMP_OFFLOAD_memcpy3d, GOMP_OFFLOAD_openacc_async_host2dev,
	GOMP_OFFLOAD_openacc_async_dev2host): Update calls

Signed-off-by: Tobias Burnus <tburnus@baylibre.com>

 libgomp/plugin/plugin-nvptx.c | 46 ++++++++++++++++++++++++++-----------------
 libgomp/target.c              |  7 +++++--
 2 files changed, 33 insertions(+), 20 deletions(-)

diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c
index c04c3acd679..318d3d2aca6 100644
--- a/libgomp/plugin/plugin-nvptx.c
+++ b/libgomp/plugin/plugin-nvptx.c
@@ -382,10 +382,13 @@ nvptx_init (void)
 }
 
 /* Select the N'th PTX device for the current host thread.  The device must
-   have been previously opened before calling this function.  */
+   have been previously opened before calling this function.
+   Returns 1 if successful, 0 if an error occurred and a message has been
+   issued; if fini_okay, -1 is returned for CUDA_ERROR_DEINITIALIZED and
+   no error message is printed in that case.  */
 
-static bool
-nvptx_attach_host_thread_to_device (int n)
+static int
+nvptx_attach_host_thread_to_device (int n, bool fini_okay)
 {
   CUdevice dev;
   CUresult r;
@@ -393,15 +396,17 @@ nvptx_attach_host_thread_to_device (int n)
   CUcontext thd_ctx;
 
   r = CUDA_CALL_NOCHECK (cuCtxGetDevice, &dev);
+  if (fini_okay && r == CUDA_ERROR_DEINITIALIZED)
+    return -1;
   if (r == CUDA_ERROR_NOT_PERMITTED)
     {
       /* Assume we're in a CUDA callback, just return true.  */
-      return true;
+      return 1;
     }
   if (r != CUDA_SUCCESS && r != CUDA_ERROR_INVALID_CONTEXT)
     {
       GOMP_PLUGIN_error ("cuCtxGetDevice error: %s", cuda_error (r));
-      return false;
+      return 0;
     }
 
   if (r != CUDA_ERROR_INVALID_CONTEXT && dev == n)
@@ -414,7 +419,7 @@ nvptx_attach_host_thread_to_device (int n)
       if (!ptx_dev)
 	{
 	  GOMP_PLUGIN_error ("device %d not found", n);
-	  return false;
+	  return 0;
 	}
 
       CUDA_CALL (cuCtxGetCurrent, &thd_ctx);
@@ -426,7 +431,7 @@ nvptx_attach_host_thread_to_device (int n)
 
       CUDA_CALL (cuCtxPushCurrent, ptx_dev->ctx);
     }
-  return true;
+  return 1;
 }
 
 static struct ptx_device *
@@ -1252,8 +1257,11 @@ GOMP_OFFLOAD_fini_device (int n)
 
   if (ptx_devices[n] != NULL)
     {
-      if (!nvptx_attach_host_thread_to_device (n)
-	  || !nvptx_close_device (ptx_devices[n]))
+      /* Returns 1 if successful, 0 if an error occurred, and -1 for
+	 CUDA_ERROR_DEINITIALIZED.  */
+      int r = nvptx_attach_host_thread_to_device (n, true);
+      if (r == 0
+	  || (r == 1 && !nvptx_close_device (ptx_devices[n])))
 	{
 	  pthread_mutex_unlock (&ptx_dev_lock);
 	  return false;
@@ -1329,7 +1337,7 @@ GOMP_OFFLOAD_load_image (int ord, unsigned version, const void *target_data,
       return -1;
     }
 
-  if (!nvptx_attach_host_thread_to_device (ord)
+  if (!nvptx_attach_host_thread_to_device (ord, false)
       || !link_ptx (&module, img_header->ptx_objs, img_header->ptx_num))
     return -1;
 
@@ -1568,7 +1576,7 @@ GOMP_OFFLOAD_unload_image (int ord, unsigned version, const void *target_data)
 void *
 GOMP_OFFLOAD_alloc (int ord, size_t size)
 {
-  if (!nvptx_attach_host_thread_to_device (ord))
+  if (!nvptx_attach_host_thread_to_device (ord, false))
     return NULL;
 
   struct ptx_device *ptx_dev = ptx_devices[ord];
@@ -1604,7 +1612,7 @@ GOMP_OFFLOAD_alloc (int ord, size_t size)
 bool
 GOMP_OFFLOAD_free (int ord, void *ptr)
 {
-  return (nvptx_attach_host_thread_to_device (ord)
+  return (nvptx_attach_host_thread_to_device (ord, false)
 	  && nvptx_free (ptr, ptx_devices[ord]));
 }
 
@@ -1837,7 +1845,7 @@ cuda_memcpy_sanity_check (const void *h, const void *d, size_t s)
 bool
 GOMP_OFFLOAD_host2dev (int ord, void *dst, const void *src, size_t n)
 {
-  if (!nvptx_attach_host_thread_to_device (ord)
+  if (!nvptx_attach_host_thread_to_device (ord, false)
       || !cuda_memcpy_sanity_check (src, dst, n))
     return false;
   CUDA_CALL (cuMemcpyHtoD, (CUdeviceptr) dst, src, n);
@@ -1847,7 +1855,7 @@ GOMP_OFFLOAD_host2dev (int ord, void *dst, const void *src, size_t n)
 bool
 GOMP_OFFLOAD_dev2host (int ord, void *dst, const void *src, size_t n)
 {
-  if (!nvptx_attach_host_thread_to_device (ord)
+  if (!nvptx_attach_host_thread_to_device (ord, false)
       || !cuda_memcpy_sanity_check (dst, src, n))
     return false;
   CUDA_CALL (cuMemcpyDtoH, dst, (CUdeviceptr) src, n);
@@ -1868,7 +1876,8 @@ GOMP_OFFLOAD_memcpy2d (int dst_ord, int src_ord, size_t dim1_size,
 		       const void *src, size_t src_offset1_size,
 		       size_t src_offset0_len, size_t src_dim1_size)
 {
-  if (!nvptx_attach_host_thread_to_device (src_ord != -1 ? src_ord : dst_ord))
+  if (!nvptx_attach_host_thread_to_device (src_ord != -1 ? src_ord : dst_ord,
+					   false))
     return false;
 
   /* TODO: Consider using CU_MEMORYTYPE_UNIFIED if supported.  */
@@ -1960,7 +1969,8 @@ GOMP_OFFLOAD_memcpy3d (int dst_ord, int src_ord, size_t dim2_size,
 		       size_t src_offset0_len, size_t src_dim2_size,
 		       size_t src_dim1_len)
 {
-  if (!nvptx_attach_host_thread_to_device (src_ord != -1 ? src_ord : dst_ord))
+  if (!nvptx_attach_host_thread_to_device (src_ord != -1 ? src_ord : dst_ord,
+					   false))
     return false;
 
   /* TODO: Consider using CU_MEMORYTYPE_UNIFIED if supported.  */
@@ -2050,7 +2060,7 @@ bool
 GOMP_OFFLOAD_openacc_async_host2dev (int ord, void *dst, const void *src,
 				     size_t n, struct goacc_asyncqueue *aq)
 {
-  if (!nvptx_attach_host_thread_to_device (ord)
+  if (!nvptx_attach_host_thread_to_device (ord, false)
       || !cuda_memcpy_sanity_check (src, dst, n))
     return false;
   CUDA_CALL (cuMemcpyHtoDAsync, (CUdeviceptr) dst, src, n, aq->cuda_stream);
@@ -2061,7 +2071,7 @@ bool
 GOMP_OFFLOAD_openacc_async_dev2host (int ord, void *dst, const void *src,
 				     size_t n, struct goacc_asyncqueue *aq)
 {
-  if (!nvptx_attach_host_thread_to_device (ord)
+  if (!nvptx_attach_host_thread_to_device (ord, false)
       || !cuda_memcpy_sanity_check (dst, src, n))
     return false;
   CUDA_CALL (cuMemcpyDtoHAsync, dst, (CUdeviceptr) src, n, aq->cuda_stream);
diff --git a/libgomp/target.c b/libgomp/target.c
index 1367e9cce6c..8d05877deb7 100644
--- a/libgomp/target.c
+++ b/libgomp/target.c
@@ -2706,8 +2706,11 @@ GOMP_offload_unregister_ver (unsigned version, const void *host_table,
       gomp_mutex_lock (&devicep->lock);
       if (devicep->type == target_type
 	  && devicep->state == GOMP_DEVICE_INITIALIZED)
-	gomp_unload_image_from_device (devicep, version,
-				       host_table, target_data);
+	{
+	  goacc_fini_asyncqueues (devicep);
+	  gomp_unload_image_from_device (devicep, version,
+					 host_table, target_data);
+	}
       gomp_mutex_unlock (&devicep->lock);
     }
 

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [v2][patch] plugin/plugin-nvptx.c: Fix fini_device call when already shutdown [PR113513]
  2024-01-23  9:55 ` [v2][patch] " Tobias Burnus
@ 2024-01-29 15:53   ` Thomas Schwinge
  2024-01-29 17:07     ` Tobias Burnus
  0 siblings, 1 reply; 4+ messages in thread
From: Thomas Schwinge @ 2024-01-29 15:53 UTC (permalink / raw)
  To: Tobias Burnus; +Cc: gcc-patches, Jakub Jelinek, Tom de Vries

Hi Tobias!

On 2024-01-23T10:55:16+0100, Tobias Burnus <tburnus@baylibre.com> wrote:
> Slightly changed patch:
>
> nvptx_attach_host_thread_to_device now fails again with an error for 
> CUDA_ERROR_DEINITIALIZED, except for GOMP_OFFLOAD_fini_device.
>
> I think it makes more sense that way.

Agreed.

> Tobias Burnus wrote:
>> Testing showed that the libgomp.c/target-52.c failed with:
>>
>> libgomp: cuCtxGetDevice error: unknown cuda error
>>
>> libgomp: device finalization failed
>>
>> This testcase uses OMP_DISPLAY_ENV=true and 
>> OMP_TARGET_OFFLOAD=mandatory, and those env vars matter, i.e. it only 
>> fails if dg-set-target-env-var is honored.
>>
>> If both env vars are set, the device initialization occurs earlier as 
>> OMP_DEFAULT_DEVICE is shown due to the display-env env var and its 
>> value (when target-offload-var is 'mandatory') might be either 
>> 'omp_invalid_device' or '0'.
>>
>> It turned out that this had an effect on device finalization, which 
>> caused CUDA to stop earlier than expected. This patch now handles this 
>> case gracefully. For details, see the commit log message in the 
>> attached patch and/or the PR.

> plugin/plugin-nvptx.c: Fix fini_device call when already shutdown [PR113513]
>
> The following issue was found when running libgomp.c/target-52.c with
> nvptx offloading when the dg-set-target-env-var was honored.

Curious, I've never seen this failure mode in my several different
configurations.  :-|

> The issue
> occurred for both -foffload=disable and with offloading configured when
> an nvidia device is available.
>
> At the end of the program, the offloading parts are shutdown via two means:
> The callback registered via 'atexit (gomp_target_fini)' and - via code
> generated in mkoffload, the '__attribute__((destructor)) fini' function
> that calls GOMP_offload_unregister_ver.
>
> In normal processing, first gomp_target_fini is called - which then sets
> GOMP_DEVICE_FINALIZED for the device - and later GOMP_offload_unregister_ver,
> but that's then because the state is GOMP_DEVICE_FINALIZED.
> If both OMP_DISPLAY_ENV=true and OMP_TARGET_OFFLOAD="mandatory" are set,
> the call omp_display_env already invokes gomp_init_targets_once, i.e. it
> occurs earlier than usual and is invoked via __attribute__((constructor))
> initialize_env.
>
> For some unknown reasons, while this does not have an effect on the
> order of the called plugin functions for initialization, it changes the
> order of function calls for shutting down. Namely, when the two environment
> variables are set, GOMP_offload_unregister_ver is called now before
> gomp_target_fini.

Re "unknown reasons", isn't that indeed explained by the different
'atexit' function/'__attribute__((destructor))' sequencing, due to
different order of 'atexit'/'__attribute__((constructor))' calls?

I think I agree that, defensively, we should behave correctly in libgomp
finitialization, no matter in which these calls occur.

> And it seems as if CUDA regards a call to cuModuleUnload
> (or unloading the last module?) as indication that the device context should
> be destroyed - or, at least, afterwards calling cuCtxGetDevice will return
> CUDA_ERROR_DEINITIALIZED.

However, this I don't understand -- but would like to.  Are you saying
that for:

    --- libgomp/plugin/plugin-nvptx.c
    +++ libgomp/plugin/plugin-nvptx.c
    @@ -1556,8 +1556,16 @@ GOMP_OFFLOAD_unload_image (int ord, unsigned version, const void *target_data)
         if (image->target_data == target_data)
           {
     	*prev_p = image->next;
    -	if (CUDA_CALL_NOCHECK (cuModuleUnload, image->module) != CUDA_SUCCESS)
    +	CUresult r;
    +	r = CUDA_CALL_NOCHECK (cuModuleUnload, image->module);
    +	GOMP_PLUGIN_debug (0, "%s: cuModuleUnload: %s\n", __FUNCTION__, cuda_error (r));
    +	if (r != CUDA_SUCCESS)
     	  ret = false;
    +	CUdevice dev_;
    +	r = CUDA_CALL_NOCHECK (cuCtxGetDevice, &dev_);
    +	GOMP_PLUGIN_debug (0, "%s: cuCtxGetDevice: %s\n", __FUNCTION__, cuda_error (r));
    +	GOMP_PLUGIN_debug (0, "%s: dev_=%d, dev->dev=%d\n", __FUNCTION__, dev_, dev->dev);
    +	assert (dev_ == dev->dev);
     	free (image->fns);
     	free (image);
     	break;

..., you're seeing an error for 'libgomp.c/target-52.c' with
'env OMP_TARGET_OFFLOAD=mandatory OMP_DISPLAY_ENV=true'?  I get:

    GOMP_OFFLOAD_unload_image: cuModuleUnload: no error
    GOMP_OFFLOAD_unload_image: cuCtxGetDevice: no error
    GOMP_OFFLOAD_unload_image: dev_=0, dev->dev=0

Or, is something else happening in between the 'cuModuleUnload' and your
reportedly failing 'cuCtxGetDevice'?

Re your PR113513 details, I don't see how your failure mode could be
related to (a) the PTX code ('--with-arch=sm_80'), or the GPU hardware
("NVIDIA RTX A1000 6GB") (..., unless the Nvidia Driver is doing "funny"
things, of course...), so could this possibly be due to a recent change
in the CUDA Driver/Nvidia Driver?  You say "CUDA Version: 12.3", but
which which Nvidia Driver version?  The latest I've now tested are:

    Driver Version: 525.147.05   CUDA Version: 12.0
    Driver Version: 535.154.05   CUDA Version: 12.2

I'll re-try with a more recent version.

> As the previous code in nvptx_attach_host_thread_to_device wasn't expecting
> that result, it called
>   GOMP_PLUGIN_error ("cuCtxGetDevice error: %s", cuda_error (r));
> causing a fatal error of the program.
>
> This commit handles now CUDA_ERROR_DEINITIALIZED in a special way such
> that GOMP_OFFLOAD_fini_device just works.

I'd like to please defer that one until we understand the actual origin
of the misbehavior.


> When reading the code, the following was observed in addition:
> When gomp_fini_device is called, it invokes goacc_fini_asyncqueues
> to ensure that the queue is emptied.  It seems to make sense to do
> likewise for GOMP_offload_unregister_ver, which this commit does in
> addition.

I don't understand why offload image unregistration (a) should trigger
'goacc_fini_asyncqueues', and (b) how that relates to PR113513?


Grüße
 Thomas

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [v2][patch] plugin/plugin-nvptx.c: Fix fini_device call when already shutdown [PR113513]
  2024-01-29 15:53   ` Thomas Schwinge
@ 2024-01-29 17:07     ` Tobias Burnus
  0 siblings, 0 replies; 4+ messages in thread
From: Tobias Burnus @ 2024-01-29 17:07 UTC (permalink / raw)
  To: Thomas Schwinge; +Cc: gcc-patches, Jakub Jelinek, Tom de Vries

[-- Attachment #1: Type: text/plain, Size: 6615 bytes --]

Hi Thomas,

Thomas Schwinge wrote:
> On 2024-01-23T10:55:16+0100, Tobias Burnus <tburnus@baylibre.com> wrote:
>> plugin/plugin-nvptx.c: Fix fini_device call when already shutdown [PR113513]
>>
>> The following issue was found when running libgomp.c/target-52.c with
>> nvptx offloading when the dg-set-target-env-var was honored.
> Curious, I've never seen this failure mode in my several different
> configurations.  :-|

I think we recently fixed a surprisingly high number of issues that we 
didn't see before but were clearly preexisting for quite a while. 
(Mostly for AMDGPU but still.)

But I concur that this one is a more tricky one.

>> For some unknown reasons, while this does not have an effect on the
>> order of the called plugin functions for initialization, it changes the
>> order of function calls for shutting down. Namely, when the two environment
>> variables are set, GOMP_offload_unregister_ver is called now before
>> gomp_target_fini.
> Re "unknown reasons", isn't that indeed explained by the different
> 'atexit' function/'__attribute__((destructor))' sequencing, due to
> different order of 'atexit'/'__attribute__((constructor))' calls?

Maybe or not. First, it does not seem to occur elsewhere but maybe 
that's because remote setting of environment variables does not work 
with DejaGNU and most code was run such a way. And secondly, I have no 
idea how 'atexit' and destructors are implemented internally.

>> And it seems as if CUDA regards a call to cuModuleUnload
>> (or unloading the last module?) as indication that the device context should
>> be destroyed - or, at least, afterwards calling cuCtxGetDevice will return
>> CUDA_ERROR_DEINITIALIZED.
> However, this I don't understand -- but would like to.  Are you saying
> that for:
>
>      --- libgomp/plugin/plugin-nvptx.c
>      +++ libgomp/plugin/plugin-nvptx.c
>      @@ -1556,8 +1556,16 @@ GOMP_OFFLOAD_unload_image (int ord, unsigned version, const void *target_data)
>           if (image->target_data == target_data)
>             {
>       	*prev_p = image->next;
>      -	if (CUDA_CALL_NOCHECK (cuModuleUnload, image->module) != CUDA_SUCCESS)
>      +	CUresult r;
>      +	r = CUDA_CALL_NOCHECK (cuModuleUnload, image->module);
>      +	GOMP_PLUGIN_debug (0, "%s: cuModuleUnload: %s\n", __FUNCTION__, cuda_error (r));
>      +	if (r != CUDA_SUCCESS)
>       	  ret = false;
>      +	CUdevice dev_;
>      +	r = CUDA_CALL_NOCHECK (cuCtxGetDevice, &dev_);
>      +	GOMP_PLUGIN_debug (0, "%s: cuCtxGetDevice: %s\n", __FUNCTION__, cuda_error (r));
>      +	GOMP_PLUGIN_debug (0, "%s: dev_=%d, dev->dev=%d\n", __FUNCTION__, dev_, dev->dev);
>      +	assert (dev_ == dev->dev);
>       	free (image->fns);
>       	free (image);
>       	break;
>
> ..., you're seeing an error for 'libgomp.c/target-52.c' with
> 'env OMP_TARGET_OFFLOAD=mandatory OMP_DISPLAY_ENV=true'?  I get:
>
>      GOMP_OFFLOAD_unload_image: cuModuleUnload: no error
>      GOMP_OFFLOAD_unload_image: cuCtxGetDevice: no error
>      GOMP_OFFLOAD_unload_image: dev_=0, dev->dev=0
>
> Or, is something else happening in between the 'cuModuleUnload' and your
> reportedly failing 'cuCtxGetDevice'?

I cluttered the plugin with "printf" debugging; hence, no other code
is calling *into* the run-time library as far as I can see.

But now I will try it with a vanilla code and your patch applied.

Result for the target-52.c with the env vars set:

DEBUG: GOMP_offload_unregister_ver dev=0; state=1
DEBUG: gomp_unload_image_from_device
DEBUG GOMP_OFFLOAD_unload_image, 0, 196609
GOMP_OFFLOAD_unload_image: cuModuleUnload: no error
GOMP_OFFLOAD_unload_image: cuCtxGetDevice: no error
GOMP_OFFLOAD_unload_image: dev_=0, dev->dev=0
DEBUG: gomp_target_fini; dev=0, state=1
DEBUG  0
DEBUG: nvptx_attach_host_thread_to_device - 0
DEBUG: ERROR nvptx_attach_host_thread_to_device - 0

libgomp: cuCtxGetDevice error: unknown cuda error

Hence: The immediately calling cuCtxGetDevice after
the device unloading does not fail.

But calling it soon late via gomp_target_fini
→ GOMP_OFFLOAD_fini_device → nvptx_attach_host_thread_to_device
does fail.

I have attached my printf patch for reference.

* * *

> Re your PR113513 details, I don't see how your failure mode could be
> related to (a) the PTX code ('--with-arch=sm_80'), or the GPU hardware
> ("NVIDIA RTX A1000 6GB") (..., unless the Nvidia Driver is doing "funny"
> things, of course...), so could this possibly be due to a recent change
> in the CUDA Driver/Nvidia Driver?  You say "CUDA Version: 12.3", but
> which which Nvidia Driver version?  The latest I've now tested are:
>
>      Driver Version: 525.147.05   CUDA Version: 12.0
>      Driver Version: 535.154.05   CUDA Version: 12.2

My laptop has:

NVIDIA-SMI 545.29.06              Driver Version: 545.29.06    CUDA Version: 12.3

> I'd like to please defer that one until we understand the actual origin
> of the misbehavior.
(I think that patch makes still sense, but first finding out what goes 
wrong is fine nonetheless.)
>> When reading the code, the following was observed in addition:
>> When gomp_fini_device is called, it invokes goacc_fini_asyncqueues
>> to ensure that the queue is emptied.  It seems to make sense to do
>> likewise for GOMP_offload_unregister_ver, which this commit does in
>> addition.
> I don't understand why offload image unregistration (a) should trigger
> 'goacc_fini_asyncqueues', and (b) how that relates to PR113513?

While there no direct relation and none to the testcase, this is 
affected by the ordering of GOMP_offload_unregister_ver vs.before 
gomp_target_fini, which is what the main issue is above.

Assume that by some reason GOMP_offload_unregister_ver gets called 
before gomp_target_fini. In that case, the asynchronous queues can be 
still running when the variables are removed and only when later 
gomp_target_fini is called, it will invoke goacc_fini_asyncqueues.

Of course, when gomp_target_fini is called first, it will run 
goacc_fini_asyncqueues first – and a later GOMP_offload_unregister_ver 
is a no op as the device is already finalized.

Thus, this part of the patch adds a safeguard for something to be a 
known issue for a related issue.

If we guarantee that gomp_target_fini is always called first, I suggest 
to remove GOMP_offload_unregister_ver for good as that will then be 
always unreachable ... (Well, that function itself not but it will not 
do any actual work.)

If we don't think so and there might be an ordering issue, I very much 
would like to see this safeguard in, which is very inexpensive if no 
work remains to be completed.

Tobias

[-- Attachment #2: debug_nvptx_fini.diff --]
[-- Type: text/x-patch, Size: 13310 bytes --]

diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c
index c04c3acd679..7fc7f4a5bbf 100644
--- a/libgomp/plugin/plugin-nvptx.c
+++ b/libgomp/plugin/plugin-nvptx.c
@@ -391,7 +391,7 @@ nvptx_attach_host_thread_to_device (int n)
   CUresult r;
   struct ptx_device *ptx_dev;
   CUcontext thd_ctx;
-
+__builtin_fprintf (stderr, "DEBUG: nvptx_attach_host_thread_to_device - %d\n", n);
   r = CUDA_CALL_NOCHECK (cuCtxGetDevice, &dev);
   if (r == CUDA_ERROR_NOT_PERMITTED)
     {
@@ -400,6 +400,7 @@ nvptx_attach_host_thread_to_device (int n)
     }
   if (r != CUDA_SUCCESS && r != CUDA_ERROR_INVALID_CONTEXT)
     {
+__builtin_fprintf (stderr, "DEBUG: ERROR nvptx_attach_host_thread_to_device - %d\n", n);
       GOMP_PLUGIN_error ("cuCtxGetDevice error: %s", cuda_error (r));
       return false;
     }
@@ -445,9 +446,11 @@ nvptx_open_device (int n)
   ptx_dev->dev = dev;
   ptx_dev->ctx_shared = false;
 
+__builtin_fprintf (stderr, "DEBUG: nvptx_open_device - %d\n", n);
   r = CUDA_CALL_NOCHECK (cuCtxGetDevice, &ctx_dev);
   if (r != CUDA_SUCCESS && r != CUDA_ERROR_INVALID_CONTEXT)
     {
+__builtin_fprintf (stderr, "DEBUG: ERROR nvptx_open_device - %d\n", n);
       GOMP_PLUGIN_error ("cuCtxGetDevice error: %s", cuda_error (r));
       return NULL;
     }
@@ -1174,24 +1177,28 @@ nvptx_get_current_cuda_context (void)
 const char *
 GOMP_OFFLOAD_get_name (void)
 {
+__builtin_fprintf (stderr, "DEBUG GOMP_OFFLOAD_get_name\n");
   return "nvptx";
 }
 
 unsigned int
 GOMP_OFFLOAD_get_caps (void)
 {
+__builtin_fprintf (stderr, "DEBUG GOMP_OFFLOAD_get_caps\n");
   return GOMP_OFFLOAD_CAP_OPENACC_200 | GOMP_OFFLOAD_CAP_OPENMP_400;
 }
 
 int
 GOMP_OFFLOAD_get_type (void)
 {
+__builtin_fprintf (stderr, "DEBUG GOMP_OFFLOAD_get_type\n");
   return OFFLOAD_TARGET_TYPE_NVIDIA_PTX;
 }
 
 int
 GOMP_OFFLOAD_get_num_devices (unsigned int omp_requires_mask)
 {
+__builtin_fprintf (stderr, "DEBUG GOMP_OFFLOAD_get_num_devices %u\n", omp_requires_mask);
   int num_devices = nvptx_get_num_devices ();
   /* Return -1 if no omp_requires_mask cannot be fulfilled but
      devices were present.  Unified-shared address: see comment in
@@ -1207,6 +1214,7 @@ GOMP_OFFLOAD_get_num_devices (unsigned int omp_requires_mask)
 bool
 GOMP_OFFLOAD_init_device (int n)
 {
+__builtin_fprintf (stderr, "DEBUG GOMP_OFFLOAD_init_device %u\n", n);
   struct ptx_device *dev;
 
   pthread_mutex_lock (&ptx_dev_lock);
@@ -1248,6 +1256,7 @@ GOMP_OFFLOAD_init_device (int n)
 bool
 GOMP_OFFLOAD_fini_device (int n)
 {
+__builtin_fprintf (stderr, "DEBUG GOMP_OFFLOAD_fini_device %u\n", n);
   pthread_mutex_lock (&ptx_dev_lock);
 
   if (ptx_devices[n] != NULL)
@@ -1278,6 +1287,7 @@ GOMP_OFFLOAD_fini_device (int n)
 unsigned
 GOMP_OFFLOAD_version (void)
 {
+__builtin_fprintf (stderr, "DEBUG GOMP_OFFLOAD_version\n");
   return GOMP_VERSION;
 }
 
@@ -1311,6 +1321,7 @@ GOMP_OFFLOAD_load_image (int ord, unsigned version, const void *target_data,
 			 uint64_t **rev_fn_table,
 			 uint64_t *host_ind_fn_table)
 {
+__builtin_fprintf (stderr, "DEBUG GOMP_OFFLOAD_load_image, %d, %u\n", ord, version);
   CUmodule module;
   const char *const *var_names;
   const struct targ_fn_launch *fn_descs;
@@ -1538,6 +1549,7 @@ GOMP_OFFLOAD_load_image (int ord, unsigned version, const void *target_data,
 bool
 GOMP_OFFLOAD_unload_image (int ord, unsigned version, const void *target_data)
 {
+__builtin_fprintf (stderr, "DEBUG GOMP_OFFLOAD_unload_image, %d, %u\n", ord, version);
   struct ptx_image_data *image, **prev_p;
   struct ptx_device *dev = ptx_devices[ord];
 
@@ -1568,6 +1580,7 @@ GOMP_OFFLOAD_unload_image (int ord, unsigned version, const void *target_data)
 void *
 GOMP_OFFLOAD_alloc (int ord, size_t size)
 {
+__builtin_fprintf (stderr, "DEBUG GOMP_OFFLOAD_alloc, %d, %lu\n", ord, (long unsigned)size);
   if (!nvptx_attach_host_thread_to_device (ord))
     return NULL;
 
@@ -1604,6 +1617,7 @@ GOMP_OFFLOAD_alloc (int ord, size_t size)
 bool
 GOMP_OFFLOAD_free (int ord, void *ptr)
 {
+__builtin_fprintf (stderr, "DEBUG GOMP_OFFLOAD_free, %d\n", ord);
   return (nvptx_attach_host_thread_to_device (ord)
 	  && nvptx_free (ptr, ptx_devices[ord]));
 }
@@ -1615,6 +1629,7 @@ GOMP_OFFLOAD_openacc_exec (void (*fn) (void *),
 			   void **devaddrs,
 			   unsigned *dims, void *targ_mem_desc)
 {
+__builtin_fprintf (stderr, "DEBUG GOMP_OFFLOAD_openacc_exec\n");
   GOMP_PLUGIN_debug (0, "nvptx %s\n", __FUNCTION__);
 
   CUdeviceptr dp = (CUdeviceptr) devaddrs;
@@ -1637,6 +1652,7 @@ GOMP_OFFLOAD_openacc_async_exec (void (*fn) (void *),
 				 unsigned *dims, void *targ_mem_desc,
 				 struct goacc_asyncqueue *aq)
 {
+__builtin_fprintf (stderr, "DEBUG GOMP_OFFLOAD_openacc_async_exec\n");
   GOMP_PLUGIN_debug (0, "nvptx %s\n", __FUNCTION__);
 
   CUdeviceptr dp = (CUdeviceptr) devaddrs;
@@ -1646,6 +1662,7 @@ GOMP_OFFLOAD_openacc_async_exec (void (*fn) (void *),
 void *
 GOMP_OFFLOAD_openacc_create_thread_data (int ord)
 {
+__builtin_fprintf (stderr, "DEBUG GOMP_OFFLOAD_openacc_create_thread_data\n");
   struct ptx_device *ptx_dev;
   struct nvptx_thread *nvthd
     = GOMP_PLUGIN_malloc (sizeof (struct nvptx_thread));
@@ -1670,18 +1687,21 @@ GOMP_OFFLOAD_openacc_create_thread_data (int ord)
 void
 GOMP_OFFLOAD_openacc_destroy_thread_data (void *data)
 {
+__builtin_fprintf (stderr, "DEBUG GOMP_OFFLOAD_openacc_destroy_thread_data\n");
   free (data);
 }
 
 void *
 GOMP_OFFLOAD_openacc_cuda_get_current_device (void)
 {
+__builtin_fprintf (stderr, "DEBUG GOMP_OFFLOAD_openacc_cuda_get_current_device\n");
   return nvptx_get_current_cuda_device ();
 }
 
 void *
 GOMP_OFFLOAD_openacc_cuda_get_current_context (void)
 {
+__builtin_fprintf (stderr, "DEBUG GOMP_OFFLOAD_openacc_cuda_get_current_context\n");
   return nvptx_get_current_cuda_context ();
 }
 
@@ -1689,6 +1709,7 @@ GOMP_OFFLOAD_openacc_cuda_get_current_context (void)
 void *
 GOMP_OFFLOAD_openacc_cuda_get_stream (struct goacc_asyncqueue *aq)
 {
+__builtin_fprintf (stderr, "DEBUG GOMP_OFFLOAD_openacc_cuda_get_stream\n");
   return (void *) aq->cuda_stream;
 }
 
@@ -1696,6 +1717,7 @@ GOMP_OFFLOAD_openacc_cuda_get_stream (struct goacc_asyncqueue *aq)
 int
 GOMP_OFFLOAD_openacc_cuda_set_stream (struct goacc_asyncqueue *aq, void *stream)
 {
+__builtin_fprintf (stderr, "DEBUG GOMP_OFFLOAD_openacc_cuda_set_stream\n");
   if (aq->cuda_stream)
     {
       CUDA_CALL_ASSERT (cuStreamSynchronize, aq->cuda_stream);
@@ -1721,6 +1743,7 @@ nvptx_goacc_asyncqueue_construct (unsigned int flags)
 struct goacc_asyncqueue *
 GOMP_OFFLOAD_openacc_async_construct (int device __attribute__((unused)))
 {
+__builtin_fprintf (stderr, "DEBUG GOMP_OFFLOAD_openacc_async_construct\n");
   return nvptx_goacc_asyncqueue_construct (CU_STREAM_DEFAULT);
 }
 
@@ -1735,12 +1758,14 @@ nvptx_goacc_asyncqueue_destruct (struct goacc_asyncqueue *aq)
 bool
 GOMP_OFFLOAD_openacc_async_destruct (struct goacc_asyncqueue *aq)
 {
+__builtin_fprintf (stderr, "DEBUG GOMP_OFFLOAD_openacc_async_destruct\n");
   return nvptx_goacc_asyncqueue_destruct (aq);
 }
 
 int
 GOMP_OFFLOAD_openacc_async_test (struct goacc_asyncqueue *aq)
 {
+__builtin_fprintf (stderr, "DEBUG GOMP_OFFLOAD_openacc_async_test\n");
   CUresult r = CUDA_CALL_NOCHECK (cuStreamQuery, aq->cuda_stream);
   if (r == CUDA_SUCCESS)
     return 1;
@@ -1761,6 +1786,7 @@ nvptx_goacc_asyncqueue_synchronize (struct goacc_asyncqueue *aq)
 bool
 GOMP_OFFLOAD_openacc_async_synchronize (struct goacc_asyncqueue *aq)
 {
+__builtin_fprintf (stderr, "DEBUG GOMP_OFFLOAD_openacc_async_synchronize\n");
   return nvptx_goacc_asyncqueue_synchronize (aq);
 }
 
@@ -1768,6 +1794,7 @@ bool
 GOMP_OFFLOAD_openacc_async_serialize (struct goacc_asyncqueue *aq1,
 				      struct goacc_asyncqueue *aq2)
 {
+__builtin_fprintf (stderr, "DEBUG GOMP_OFFLOAD_openacc_async_serialize\n");
   CUevent e;
   CUDA_CALL_ERET (false, cuEventCreate, &e, CU_EVENT_DISABLE_TIMING);
   CUDA_CALL_ERET (false, cuEventRecord, e, aq1->cuda_stream);
@@ -1790,6 +1817,7 @@ GOMP_OFFLOAD_openacc_async_queue_callback (struct goacc_asyncqueue *aq,
 					   void (*callback_fn)(void *),
 					   void *userptr)
 {
+__builtin_fprintf (stderr, "DEBUG GOMP_OFFLOAD_openacc_async_queue_callback\n");
   struct nvptx_callback *b = GOMP_PLUGIN_malloc (sizeof (*b));
   b->fn = callback_fn;
   b->ptr = userptr;
@@ -1837,6 +1865,7 @@ cuda_memcpy_sanity_check (const void *h, const void *d, size_t s)
 bool
 GOMP_OFFLOAD_host2dev (int ord, void *dst, const void *src, size_t n)
 {
+__builtin_fprintf (stderr, "DEBUG GOMP_OFFLOAD_host2dev\n");
   if (!nvptx_attach_host_thread_to_device (ord)
       || !cuda_memcpy_sanity_check (src, dst, n))
     return false;
@@ -1847,6 +1876,7 @@ GOMP_OFFLOAD_host2dev (int ord, void *dst, const void *src, size_t n)
 bool
 GOMP_OFFLOAD_dev2host (int ord, void *dst, const void *src, size_t n)
 {
+__builtin_fprintf (stderr, "DEBUG GOMP_OFFLOAD_dev2host\n");
   if (!nvptx_attach_host_thread_to_device (ord)
       || !cuda_memcpy_sanity_check (dst, src, n))
     return false;
@@ -1857,6 +1887,7 @@ GOMP_OFFLOAD_dev2host (int ord, void *dst, const void *src, size_t n)
 bool
 GOMP_OFFLOAD_dev2dev (int ord, void *dst, const void *src, size_t n)
 {
+__builtin_fprintf (stderr, "DEBUG GOMP_OFFLOAD_dev2dev\n");
   CUDA_CALL (cuMemcpyDtoDAsync, (CUdeviceptr) dst, (CUdeviceptr) src, n, NULL);
   return true;
 }
@@ -1868,6 +1899,7 @@ GOMP_OFFLOAD_memcpy2d (int dst_ord, int src_ord, size_t dim1_size,
 		       const void *src, size_t src_offset1_size,
 		       size_t src_offset0_len, size_t src_dim1_size)
 {
+__builtin_fprintf (stderr, "DEBUG GOMP_OFFLOAD_memcpy2d\n");
   if (!nvptx_attach_host_thread_to_device (src_ord != -1 ? src_ord : dst_ord))
     return false;
 
@@ -1960,6 +1992,7 @@ GOMP_OFFLOAD_memcpy3d (int dst_ord, int src_ord, size_t dim2_size,
 		       size_t src_offset0_len, size_t src_dim2_size,
 		       size_t src_dim1_len)
 {
+__builtin_fprintf (stderr, "DEBUG GOMP_OFFLOAD_memcpy3d\n");
   if (!nvptx_attach_host_thread_to_device (src_ord != -1 ? src_ord : dst_ord))
     return false;
 
@@ -2050,6 +2083,7 @@ bool
 GOMP_OFFLOAD_openacc_async_host2dev (int ord, void *dst, const void *src,
 				     size_t n, struct goacc_asyncqueue *aq)
 {
+__builtin_fprintf (stderr, "DEBUG GOMP_OFFLOAD_openacc_async_host2dev\n");
   if (!nvptx_attach_host_thread_to_device (ord)
       || !cuda_memcpy_sanity_check (src, dst, n))
     return false;
@@ -2061,6 +2095,7 @@ bool
 GOMP_OFFLOAD_openacc_async_dev2host (int ord, void *dst, const void *src,
 				     size_t n, struct goacc_asyncqueue *aq)
 {
+__builtin_fprintf (stderr, "DEBUG GOMP_OFFLOAD_openacc_async_dev2host\n");
   if (!nvptx_attach_host_thread_to_device (ord)
       || !cuda_memcpy_sanity_check (dst, src, n))
     return false;
@@ -2071,6 +2106,7 @@ GOMP_OFFLOAD_openacc_async_dev2host (int ord, void *dst, const void *src,
 union goacc_property_value
 GOMP_OFFLOAD_openacc_get_property (int n, enum goacc_property prop)
 {
+__builtin_fprintf (stderr, "DEBUG GOMP_OFFLOAD_openacc_get_property\n");
   union goacc_property_value propval = { .val = 0 };
 
   pthread_mutex_lock (&ptx_dev_lock);
@@ -2211,6 +2247,7 @@ nvptx_stacks_acquire (struct ptx_device *ptx_dev, size_t size, int num)
 void
 GOMP_OFFLOAD_run (int ord, void *tgt_fn, void *tgt_vars, void **args)
 {
+__builtin_fprintf (stderr, "DEBUG GOMP_OFFLOAD_run\n");
   struct targ_fn_descriptor *tgt_fn_desc
     = (struct targ_fn_descriptor *) tgt_fn;
   CUfunction function = tgt_fn_desc->fn;
diff --git a/libgomp/target.c b/libgomp/target.c
index 1367e9cce6c..f758b20ba4c 100644
--- a/libgomp/target.c
+++ b/libgomp/target.c
@@ -2524,6 +2524,8 @@ gomp_unload_image_from_device (struct gomp_device_descr *devicep,
       node = splay_tree_lookup (&devicep->mem_map, &k);
     }
 
+__builtin_fprintf(stderr, "DEBUG: gomp_unload_image_from_device\n");
+
   if (!devicep->unload_image_func (devicep->target_id, version, target_data))
     {
       gomp_mutex_unlock (&devicep->lock);
@@ -2698,12 +2700,14 @@ GOMP_offload_unregister_ver (unsigned version, const void *host_table,
     target_data = data;
 
   gomp_mutex_lock (&register_lock);
+__builtin_fprintf(stderr, "DEBUG: GOMP_offload_unregister_ver\n");
 
   /* Unload image from all initialized devices.  */
   for (i = 0; i < num_devices; i++)
     {
       struct gomp_device_descr *devicep = &devices[i];
       gomp_mutex_lock (&devicep->lock);
+__builtin_fprintf(stderr, "DEBUG: GOMP_offload_unregister_ver dev=%d; state=%d\n", i, devicep->state);
       if (devicep->type == target_type
 	  && devicep->state == GOMP_DEVICE_INITIALIZED)
 	gomp_unload_image_from_device (devicep, version,
@@ -2775,6 +2779,7 @@ gomp_fini_device (struct gomp_device_descr *devicep)
 attribute_hidden void
 gomp_unload_device (struct gomp_device_descr *devicep)
 {
+__builtin_fprintf(stderr, "DEBUG: gomp_unload_device; state=%d\n", devicep->state);
   if (devicep->state == GOMP_DEVICE_INITIALIZED)
     {
       unsigned i;
@@ -5217,6 +5222,7 @@ gomp_target_fini (void)
       bool ret = true;
       struct gomp_device_descr *devicep = &devices[i];
       gomp_mutex_lock (&devicep->lock);
+__builtin_fprintf(stderr, "DEBUG: gomp_target_fini; dev=%d, state=%d\n", i, devicep->state);
       if (devicep->state == GOMP_DEVICE_INITIALIZED)
 	ret = gomp_fini_device (devicep);
       gomp_mutex_unlock (&devicep->lock);

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2024-01-29 17:07 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-01-22 19:45 [patch] plugin/plugin-nvptx.c: Fix fini_device call when already shutdown [PR113513] Tobias Burnus
2024-01-23  9:55 ` [v2][patch] " Tobias Burnus
2024-01-29 15:53   ` Thomas Schwinge
2024-01-29 17:07     ` Tobias Burnus

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).