public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [PATCH, nvptx, libgomp] Avoid use of GOMP_PLUGIN_acc_thread() in nvptx_free()
@ 2020-08-20 13:03 Chung-Lin Tang
  2020-08-20 13:33 ` Tom de Vries
  0 siblings, 1 reply; 3+ messages in thread
From: Chung-Lin Tang @ 2020-08-20 13:03 UTC (permalink / raw)
  To: gcc-patches, Tom de Vries; +Cc: Tobias Burnus, Catherine Moore

[-- Attachment #1: Type: text/plain, Size: 1984 bytes --]

Hi Tom,
this patch adjusts nvptx_free() in libgomp/plugin/plugin-nvptx.c to avoid a
"GOMP_PLUGIN_acc_thread() == NULL" check that was causing problems under
OpenMP offloading.

This check was originally used to determine if nvptx_free() was running under
CUDA callback context, when freeing resources from an OpenACC asynchronous compute
region. Since CUDA API calls are not allowed inside callback context, we have
to save the freed block to ptx_dev->free_blocks, and cuMemFree it later.

The check to see if GOMP_PLUGIN_acc_thread() exists to determine normal host thread
vs. callback thread worked under -fopenacc, but since the OpenACC per-thread data
is not created under -fopenmp, and always returned NULL, we have a leak situation
where OpenMP offloading kept accumulating freed device memory blocks until failing;
nvptx_free() never reaches the part where cuMemFree() is actually called.

I reviewed the CUDA API docs and see that CUDA_ERROR_NOT_PERMITTED is returned
for such CUDA calls inside callback context, and it appears to be enough to replace
the current check, so the new code sees if this error is returned on the first
cuMemGetAddressRange call to determine callback context. This should now work
for both OpenACC/OpenMP.

(Tobias, Catherine, the earlier internal patch to re-organize this callback context
checking did not work in general, since OpenACC also uses the .queue_callback
functionality to free the struct target_mem_desc asynchronously, so in general we
have to ensure nvptx_free() could be used under both normal/callback context)

This patch has been libgomp tested for x86_64-linux with nvptx offloading without
regressions, and should be applied for mainline and GCC10. Is this okay?

Thanks,
Chung-Lin

2020-08-20  Chung-Lin Tang  <cltang@codesourcery.com>

	libgomp/
	* plugin/plugin-nvptx.c (nvptx_free): Change "GOMP_PLUGIN_acc_thread () == NULL"
	test into check of CUDA_ERROR_NOT_PERMITTED status for cuMemGetAddressRange.
	Adjust comments.

[-- Attachment #2: m.diff --]
[-- Type: text/plain, Size: 1472 bytes --]

diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c
index ec103a2f40b..188a34f1d04 100644
--- a/libgomp/plugin/plugin-nvptx.c
+++ b/libgomp/plugin/plugin-nvptx.c
@@ -1038,27 +1038,34 @@ goacc_profiling_acc_ev_free (struct goacc_thread *thr, void *p)
 }
 
 static bool
 nvptx_free (void *p, struct ptx_device *ptx_dev)
 {
-  /* Assume callback context if this is null.  */
-  if (GOMP_PLUGIN_acc_thread () == NULL)
+  CUdeviceptr pb;
+  size_t ps;
+
+  CUresult r = CUDA_CALL_NOCHECK (cuMemGetAddressRange, &pb, &ps,
+				  (CUdeviceptr) p);
+  if (r == CUDA_ERROR_NOT_PERMITTED)
     {
+      /* We assume that this error indicates we are in a CUDA callback context,
+	 where all CUDA calls are not allowed. Arrange to free this piece of
+	 device memory later.  */
       struct ptx_free_block *n
 	= GOMP_PLUGIN_malloc (sizeof (struct ptx_free_block));
       n->ptr = p;
       pthread_mutex_lock (&ptx_dev->free_blocks_lock);
       n->next = ptx_dev->free_blocks;
       ptx_dev->free_blocks = n;
       pthread_mutex_unlock (&ptx_dev->free_blocks_lock);
       return true;
     }
-
-  CUdeviceptr pb;
-  size_t ps;
-
-  CUDA_CALL (cuMemGetAddressRange, &pb, &ps, (CUdeviceptr) p);
+  else if (r != CUDA_SUCCESS)
+    {
+      GOMP_PLUGIN_error ("cuMemGetAddressRange error: %s", cuda_error (r));
+      return false;
+    }
   if ((CUdeviceptr) p != pb)
     {
       GOMP_PLUGIN_error ("invalid device address");
       return false;
     }

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH, nvptx, libgomp] Avoid use of GOMP_PLUGIN_acc_thread() in nvptx_free()
  2020-08-20 13:03 [PATCH, nvptx, libgomp] Avoid use of GOMP_PLUGIN_acc_thread() in nvptx_free() Chung-Lin Tang
@ 2020-08-20 13:33 ` Tom de Vries
  2020-08-20 14:32   ` Chung-Lin Tang
  0 siblings, 1 reply; 3+ messages in thread
From: Tom de Vries @ 2020-08-20 13:33 UTC (permalink / raw)
  To: Chung-Lin Tang, gcc-patches; +Cc: Tobias Burnus, Catherine Moore

On 8/20/20 3:03 PM, Chung-Lin Tang wrote:
> Hi Tom,
> this patch adjusts nvptx_free() in libgomp/plugin/plugin-nvptx.c to avoid a
> "GOMP_PLUGIN_acc_thread() == NULL" check that was causing problems under
> OpenMP offloading.
> 
> This check was originally used to determine if nvptx_free() was running
> under
> CUDA callback context, when freeing resources from an OpenACC
> asynchronous compute
> region. Since CUDA API calls are not allowed inside callback context, we
> have
> to save the freed block to ptx_dev->free_blocks, and cuMemFree it later.
> 
> The check to see if GOMP_PLUGIN_acc_thread() exists to determine normal
> host thread
> vs. callback thread worked under -fopenacc, but since the OpenACC
> per-thread data
> is not created under -fopenmp, and always returned NULL, we have a leak
> situation
> where OpenMP offloading kept accumulating freed device memory blocks
> until failing;
> nvptx_free() never reaches the part where cuMemFree() is actually called.
> 
> I reviewed the CUDA API docs and see that CUDA_ERROR_NOT_PERMITTED is
> returned
> for such CUDA calls inside callback context,

Right, that's "Callbacks must not make any CUDA API calls. Attempting to
use a CUDA API will result in CUDA_ERROR_NOT_PERMITTED" at
cuStreamAddCallback.  Perhaps mention more precisely where you found this.

> and it appears to be enough
> to replace
> the current check, so the new code sees if this error is returned on the
> first
> cuMemGetAddressRange call to determine callback context. This should now
> work
> for both OpenACC/OpenMP.
> 
> (Tobias, Catherine, the earlier internal patch to re-organize this
> callback context
> checking did not work in general, since OpenACC also uses the
> .queue_callback
> functionality to free the struct target_mem_desc asynchronously, so in
> general we
> have to ensure nvptx_free() could be used under both normal/callback
> context)
> 
> This patch has been libgomp tested for x86_64-linux with nvptx
> offloading without
> regressions, and should be applied for mainline and GCC10. Is this okay?
> 

Ok, thanks for fixing this.

- Tom

> Thanks,
> Chung-Lin
> 
> 2020-08-20  Chung-Lin Tang  <cltang@codesourcery.com>
> 
>     libgomp/
>     * plugin/plugin-nvptx.c (nvptx_free): Change "GOMP_PLUGIN_acc_thread
> () == NULL"
>     test into check of CUDA_ERROR_NOT_PERMITTED status for
> cuMemGetAddressRange.
>     Adjust comments.


> diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c
> index ec103a2f40b..188a34f1d04 100644
> --- a/libgomp/plugin/plugin-nvptx.c
> +++ b/libgomp/plugin/plugin-nvptx.c
> @@ -1038,27 +1038,34 @@ goacc_profiling_acc_ev_free (struct goacc_thread *thr, void *p)
>  }
>  
>  static bool
>  nvptx_free (void *p, struct ptx_device *ptx_dev)
>  {
> -  /* Assume callback context if this is null.  */
> -  if (GOMP_PLUGIN_acc_thread () == NULL)
> +  CUdeviceptr pb;
> +  size_t ps;
> +
> +  CUresult r = CUDA_CALL_NOCHECK (cuMemGetAddressRange, &pb, &ps,
> +				  (CUdeviceptr) p);
> +  if (r == CUDA_ERROR_NOT_PERMITTED)
>      {
> +      /* We assume that this error indicates we are in a CUDA callback context,
> +	 where all CUDA calls are not allowed. Arrange to free this piece of
> +	 device memory later.  */
>        struct ptx_free_block *n
>  	= GOMP_PLUGIN_malloc (sizeof (struct ptx_free_block));
>        n->ptr = p;
>        pthread_mutex_lock (&ptx_dev->free_blocks_lock);
>        n->next = ptx_dev->free_blocks;
>        ptx_dev->free_blocks = n;
>        pthread_mutex_unlock (&ptx_dev->free_blocks_lock);
>        return true;
>      }
> -
> -  CUdeviceptr pb;
> -  size_t ps;
> -
> -  CUDA_CALL (cuMemGetAddressRange, &pb, &ps, (CUdeviceptr) p);
> +  else if (r != CUDA_SUCCESS)
> +    {
> +      GOMP_PLUGIN_error ("cuMemGetAddressRange error: %s", cuda_error (r));
> +      return false;
> +    }
>    if ((CUdeviceptr) p != pb)
>      {
>        GOMP_PLUGIN_error ("invalid device address");
>        return false;
>      }

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH, nvptx, libgomp] Avoid use of GOMP_PLUGIN_acc_thread() in nvptx_free()
  2020-08-20 13:33 ` Tom de Vries
@ 2020-08-20 14:32   ` Chung-Lin Tang
  0 siblings, 0 replies; 3+ messages in thread
From: Chung-Lin Tang @ 2020-08-20 14:32 UTC (permalink / raw)
  To: Tom de Vries, gcc-patches; +Cc: Tobias Burnus, Catherine Moore

Hi Tom,
thanks for the extremely quick review :)

On 2020/8/20 9:33 PM, Tom de Vries wrote:
>> I reviewed the CUDA API docs and see that CUDA_ERROR_NOT_PERMITTED is
>> returned
>> for such CUDA calls inside callback context,
> Right, that's "Callbacks must not make any CUDA API calls. Attempting to
> use a CUDA API will result in CUDA_ERROR_NOT_PERMITTED" at
> cuStreamAddCallback.  Perhaps mention more precisely where you found this.

I've added a little bit more in the comments in the final patch pushed.

>> and it appears to be enough
>> to replace
>> the current check, so the new code sees if this error is returned on the
>> first
>> cuMemGetAddressRange call to determine callback context. This should now
>> work
>> for both OpenACC/OpenMP.
>>
>> (Tobias, Catherine, the earlier internal patch to re-organize this
>> callback context
>> checking did not work in general, since OpenACC also uses the
>> .queue_callback
>> functionality to free the struct target_mem_desc asynchronously, so in
>> general we
>> have to ensure nvptx_free() could be used under both normal/callback
>> context)
>>
>> This patch has been libgomp tested for x86_64-linux with nvptx
>> offloading without
>> regressions, and should be applied for mainline and GCC10. Is this okay?
>>
> Ok, thanks for fixing this.

Just pushed to master, releases/gcc-10, and devel/omp/gcc-10.

Chung-Lin

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2020-08-20 14:32 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-20 13:03 [PATCH, nvptx, libgomp] Avoid use of GOMP_PLUGIN_acc_thread() in nvptx_free() Chung-Lin Tang
2020-08-20 13:33 ` Tom de Vries
2020-08-20 14:32   ` Chung-Lin Tang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).