[patch] libgomp: cuda.h and omp_target_memcpy_rect cleanup (was: [patch] OpenMP: Call cuMemcpy2D/cuMemcpy3D for nvptx for omp_target_memcpy_rect)

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

From: Tobias Burnus <tobias@codesourcery.com>
To: Thomas Schwinge <thomas@codesourcery.com>, <gcc-patches@gcc.gnu.org>
Cc: Jakub Jelinek <jakub@redhat.com>, Julian Brown <julian@codesourcery.com>
Subject: [patch] libgomp: cuda.h and omp_target_memcpy_rect cleanup (was: [patch] OpenMP: Call cuMemcpy2D/cuMemcpy3D for nvptx for omp_target_memcpy_rect)
Date: Fri, 28 Jul 2023 13:51:41 +0200	[thread overview]
Message-ID: <12ebc63d-9f48-a292-ae17-4ac70254c500@codesourcery.com> (raw)
In-Reply-To: <87cz0ddq8d.fsf@euler.schwinge.homeip.net>

[-- Attachment #1: Type: text/plain, Size: 7464 bytes --]

Hi Thomas,

thanks for proof reading and the suggestions! – Do have comments to the
attached patch?

* * *

Crossref: For further optimizations, see also

https://gcc.gnu.org/PR101581 — [OpenMP] omp_target_memcpy – support
inter-device memcpy
https://gcc.gnu.org/PR110813 — [OpenMP] omp_target_memcpy_rect (+
strided 'target update'): Improve GCN performance and contiguous subranges

and just added based on Thomas' comment:

https://gcc.gnu.org/PR107424 — [OpenMP] Check whether device locking is
really needed for bare memcopy to/from devices (omp_target_memcpy...)

* * *

On 27.07.23 23:00, Thomas Schwinge wrote:
>> +++ b/include/cuda/cuda.h
> I note that you're not actually using everything you're adding here.
> (..., but I understand you're simply adding everying that relates to
> these 'cuMemcpy[...]' routines -- OK as far as I'm concerned.)

Yes. That was on purpose to make it easier to pick something when needed
– especially as we might want to use some of those later on.

For symmetry, I now also added cuMemcpyPeer + ...Async, which also
remain unused. (But could be used as part of the PRs linked above.)

>> +  const void *dstHost;
> That last one isn't 'const'.  ;-)
Fixed - three times.
> A 'cuda.h' that I looked at calls that last one 'reserved0', with comment
> "Must be NULL".
Seems to be unused in real world code and in the documentation. But
let's use this name as it might be exposed in the wild.
>> --- a/libgomp/libgomp-plugin.h
>> +++ b/libgomp/libgomp-plugin.h
>> +extern int GOMP_OFFLOAD_memcpy2d (int, int, size_t, size_t,
>> +                               void*, size_t, size_t, size_t,
>> +                               const void*, size_t, size_t, size_t);
>> +extern int GOMP_OFFLOAD_memcpy3d (int, int, size_t, size_t, size_t, void *,
>> +                               size_t, size_t, size_t, size_t, size_t,
>> +                               const void *, size_t, size_t, size_t, size_t,
>> +                               size_t);
> Oh, wow.  ;-)

Maybe this is not the best ABI. We can consider to modify it before the
GCC 14 release. (And in principle also afterwards, given that libgomp
and its plugins should™ be compiled and installed alongside.)

I think once we know how to implement GCN, we will see whether it was
done smartly or whether other arguments should be used or whether the
two functions should be combined.

[Regarding the reserve0/reserve1 values for cuMemcpy3D and whether it
should be NULL or not; quoting the usage in plugin-nvptx.c:]

> I note that this doesn't adhere to the two "Must be NULL" remarks from
> above -- but I'm confused, because, for example, on
> <https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1g4b5238975579f002c0199a3800ca44df
> "cuMemcpy3D", there also are no such remarks.  (... in contast to:
> "srcLOD and dstLOD members of the CUDA_MEMCPY3D structure must be set to 0",
> which you do set accordingly.)
>
> Maybe just 'memset' the whole thing to '0', for good measure?
OK - but then we can remove the LOD init.
>> +      else if (src_devicep != NULL
>> +            && (dst_devicep == NULL
>> +                || (dst_devicep->capabilities
>> +                    & GOMP_OFFLOAD_CAP_SHARED_MEM)))
> Are these 'GOMP_OFFLOAD_CAP_SHARED_MEM' actually reachable, given that
> 'omp_target_memcpy_check' (via 'omp_target_memcpy_rect_check') clears out
> the device to 'NULL' for 'GOMP_OFFLOAD_CAP_SHARED_MEM'?

I have now undone this change – I did not dig deep enough into the
function calls.


>> +      else if (dst_devicep == NULL && src_devicep == NULL)
>> +     {
>> +       memcpy ((char *) dst + dst_off, (const char *) src + src_off,
>> +               length);
>> +       ret = 1;
>> +     }
>>         else if (src_devicep == dst_devicep)
>>        ret = src_devicep->dev2dev_func (src_devicep->target_id,
>>                                         (char *) dst + dst_off,
>>                                         (const char *) src + src_off,
>>                                         length);
> ..., but also left the intra-device case here -- which should now be dead
> code here?

Why? Unless I missed something, the old, the current, and the proposed
(= old) code do still run this code.

I have not added an assert to confirm, but in any case, it is tested for
in my recently added testcase - thus, we could add a 'printf' to confirm.

>> +       else if (*tmp_size < length)
>> +         {
>> +           *tmp_size = length;
>> +           *tmp = realloc (*tmp, length);
>> +           if (*tmp == NULL)
>> +             return ENOMEM;
> If 'realloc' returns 'NULL', we should 'free' the original '*tmp'?
>
> Do we really need here the property here that if the re-allocation can't
> be done in-place, 'realloc' copies the original content to the new?  In
> other words, should we just unconditionally 'free' and re-'malloc' here,
> instead of 'realloc'?
I have now done so – but I am not really sure what's faster on average.
If it can be enlarged, 'realloc' is faster, if it cannot free+malloc is
better.
> I haven't looked whether the re-use of 'tmp' for multiple calls to this
> is then actually useful, or whether we should just always 'malloc', use,
> 'free' the buffer here?

Well, it can run in a hot loop – assume a C-array array[1024][1024][2]
and copying array[:1024,:1024,0:1] (using OpenMP syntax) – i.e. 1048576
times every other element. And therefore I would like to avoid repeated
malloc/free in such a case. (But in general, interdevice copying should
be very rare.)

Actually, I think the realloc case is unreachable: for rectangular
copies, as implied both by 'target update' with strided access and by
'omp_target_memcpy_rect', the size should be constant.

> (For later...)  Is that what
> <https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1ge1f5c7771544fee150ada8853c7cbf4a
> "cuMemcpyPeer" would be used for?

I concur that we could/should use cuMemcpyPeer and cuMemcpy3DPeer for
omp_target and ..._rect. I added a comment to the respective PRs (linked
at the top).

>> +  lock_src = (src_devicep
>> +           && (!dst_devicep
>> +               || src_devicep == dst_devicep
>> +               || !(src_devicep->capabilities
>> +                    & GOMP_OFFLOAD_CAP_SHARED_MEM)));
> Similar doubt than above re "'GOMP_OFFLOAD_CAP_SHARED_MEM' actually
> reachable"?
Updated in the attach patch + filed a PR about the need of the lock (see
link at the top).
>> +  DLSYM (memcpy2d);
>> +  DLSYM (memcpy3d);
> With 'DLSYM' used here, won't that fail if these symbols don't actually
> exist (like for 'libgomp/plugin/plugin-gcn.c')?

Hmm, that should be: DLSYM_OPT, but somehow testing on GCN did not show
any fails. (Neither here nor in the testsuite; odd.)

I intent to commit the attached follow-up patch very soon.

[It has been tested without offloading support compiled in, with AMD GCN
offloading (single GPU, no memcpy[23]d available), and on a 2-GPU system
with nvptx offloading.]

Tobias
-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955

[-- Attachment #2: fix-cuMemcy-change.diff --]
[-- Type: text/x-patch, Size: 8034 bytes --]

libgomp: cuda.h and omp_target_memcpy_rect cleanup

Fixes for commit r14-2792-g25072a477a56a727b369bf9b20f4d18198ff5894
"OpenMP: Call cuMemcpy2D/cuMemcpy3D for nvptx for omp_target_memcpy_rect",
namely:

In that commit, the code was changed to handle shared-memory devices;
however, as pointed out, omp_target_memcpy_check already set the pointer
to NULL in that case.  Hence, this commit reverts to the prior version.

In cuda.h, it adds cuMemcpyPeer{,Async} for symmetry for cuMemcpy3DPeer
(all currently unused) and in three structs, fixes reserved-member names
and remove a bogus 'const' in three structs.

And it changes a DLSYM to DLSYM_OPT as not all plugins support the new
functions, yet.

include/ChangeLog:

	* cuda/cuda.h (CUDA_MEMCPY2D, CUDA_MEMCPY3D, CUDA_MEMCPY3D_PEER):
	Remove bogus 'const' from 'const void *dst' and fix reserved-name
	name in those structs.
	(cuMemcpyPeer, cuMemcpyPeerAsync): Add.

libgomp/ChangeLog:

	* target.c (omp_target_memcpy_rect_worker): Undo dim=1 change for
	GOMP_OFFLOAD_CAP_SHARED_MEM.
	(omp_target_memcpy_rect_copy): Likewise for lock condition.
	(gomp_load_plugin_for_device): Use DLSYM_OPT not DLSYM for
	memcpy3d/memcpy2d.
        * plugin/plugin-nvptx.c (GOMP_OFFLOAD_memcpy2d,
	GOMP_OFFLOAD_memcpy3d): Use memset 0 to nullify reserved and
	unused src/dst fields for that mem type; remove '{src,dst}LOD = 0'.

Co-authored-by: Thomas Schwinge <thomas@codesourcery.com>

 include/cuda/cuda.h           | 12 +++++-----
 libgomp/plugin/plugin-nvptx.c |  6 +++--
 libgomp/target.c              | 52 ++++++++++++++-----------------------------
 3 files changed, 28 insertions(+), 42 deletions(-)

diff --git a/include/cuda/cuda.h b/include/cuda/cuda.h
index 09c3c2b8dbe..94fc64a488d 100644
--- a/include/cuda/cuda.h
+++ b/include/cuda/cuda.h
@@ -147,7 +147,7 @@ typedef struct {
 
   size_t dstXInBytes, dstY;
   CUmemorytype dstMemoryType;
-  const void *dstHost;
+  void *dstHost;
   CUdeviceptr dstDevice;
   CUarray dstArray;
   size_t dstPitch;
@@ -162,16 +162,16 @@ typedef struct {
   const void *srcHost;
   CUdeviceptr srcDevice;
   CUarray srcArray;
-  void *dummy;
+  void *reserved0;
   size_t srcPitch, srcHeight;
 
   size_t dstXInBytes, dstY, dstZ;
   size_t dstLOD;
   CUmemorytype dstMemoryType;
-  const void *dstHost;
+  void *dstHost;
   CUdeviceptr dstDevice;
   CUarray dstArray;
-  void *dummy2;
+  void *reserved1;
   size_t dstPitch, dstHeight;
 
   size_t WidthInBytes, Height, Depth;
@@ -190,7 +190,7 @@ typedef struct {
   size_t dstXInBytes, dstY, dstZ;
   size_t dstLOD;
   CUmemorytype dstMemoryType;
-  const void *dstHost;
+  void *dstHost;
   CUdeviceptr dstDevice;
   CUarray dstArray;
   CUcontext dstContext;
@@ -246,6 +246,8 @@ CUresult cuMemAlloc (CUdeviceptr *, size_t);
 CUresult cuMemAllocHost (void **, size_t);
 CUresult cuMemHostAlloc (void **, size_t, unsigned int);
 CUresult cuMemcpy (CUdeviceptr, CUdeviceptr, size_t);
+CUresult cuMemcpyPeer (CUdeviceptr, CUcontext, CUdeviceptr, CUcontext, size_t);
+CUresult cuMemcpyPeerAsync (CUdeviceptr, CUcontext, CUdeviceptr, CUcontext, size_t, CUstream);
 #define cuMemcpyDtoDAsync cuMemcpyDtoDAsync_v2
 CUresult cuMemcpyDtoDAsync (CUdeviceptr, CUdeviceptr, size_t, CUstream);
 #define cuMemcpyDtoH cuMemcpyDtoH_v2
diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c
index 9cdc55cac6b..00d4241ae02 100644
--- a/libgomp/plugin/plugin-nvptx.c
+++ b/libgomp/plugin/plugin-nvptx.c
@@ -1794,6 +1794,8 @@ GOMP_OFFLOAD_memcpy2d (int dst_ord, int src_ord, size_t dim1_size,
   /* TODO: Consider using CU_MEMORYTYPE_UNIFIED if supported.  */
 
   CUDA_MEMCPY2D data;
+
+  memset (&data, 0, sizeof (data));
   data.WidthInBytes = dim1_size;
   data.Height = dim0_len;
 
@@ -1855,6 +1857,8 @@ GOMP_OFFLOAD_memcpy3d (int dst_ord, int src_ord, size_t dim2_size,
   /* TODO: Consider using CU_MEMORYTYPE_UNIFIED if supported.  */
 
   CUDA_MEMCPY3D data;
+
+  memset (&data, 0, sizeof (data));
   data.WidthInBytes = dim2_size;
   data.Height = dim1_len;
   data.Depth = dim0_len;
@@ -1874,7 +1878,6 @@ GOMP_OFFLOAD_memcpy3d (int dst_ord, int src_ord, size_t dim2_size,
   data.dstXInBytes = dst_offset2_size;
   data.dstY = dst_offset1_len;
   data.dstZ = dst_offset0_len;
-  data.dstLOD = 0;
 
   if (src_ord == -1)
     {
@@ -1891,7 +1894,6 @@ GOMP_OFFLOAD_memcpy3d (int dst_ord, int src_ord, size_t dim2_size,
   data.srcXInBytes = src_offset2_size;
   data.srcY = src_offset1_len;
   data.srcZ = src_offset0_len;
-  data.srcLOD = 0;
 
   CUDA_CALL (cuMemcpy3D, &data);
   return true;
diff --git a/libgomp/target.c b/libgomp/target.c
index 5cf2e8dce37..cd4cc1b01ca 100644
--- a/libgomp/target.c
+++ b/libgomp/target.c
@@ -4540,33 +4540,22 @@ omp_target_memcpy_rect_worker (void *dst, const void *src, size_t element_size,
 	  || __builtin_mul_overflow (element_size, dst_offsets[0], &dst_off)
 	  || __builtin_mul_overflow (element_size, src_offsets[0], &src_off))
 	return EINVAL;
-      if (src_devicep != NULL && src_devicep == dst_devicep)
-	ret = src_devicep->dev2dev_func (src_devicep->target_id,
-					 (char *) dst + dst_off,
-					 (const char *) src + src_off,
-					 length);
-      else if (src_devicep != NULL
-	       && (dst_devicep == NULL
-		   || (dst_devicep->capabilities
-		       & GOMP_OFFLOAD_CAP_SHARED_MEM)))
-	ret = src_devicep->dev2host_func (src_devicep->target_id,
+      if (dst_devicep == NULL && src_devicep == NULL)
+	{
+	  memcpy ((char *) dst + dst_off, (const char *) src + src_off,
+		  length);
+	  ret = 1;
+	}
+      else if (src_devicep == NULL)
+	ret = dst_devicep->host2dev_func (dst_devicep->target_id,
 					  (char *) dst + dst_off,
 					  (const char *) src + src_off,
 					  length);
-      else if (dst_devicep != NULL
-	       && (src_devicep == NULL
-		   || (src_devicep->capabilities
-			& GOMP_OFFLOAD_CAP_SHARED_MEM)))
-	ret = dst_devicep->host2dev_func (dst_devicep->target_id,
+      else if (dst_devicep == NULL)
+	ret = src_devicep->dev2host_func (src_devicep->target_id,
 					  (char *) dst + dst_off,
 					  (const char *) src + src_off,
 					  length);
-      else if (dst_devicep == NULL && src_devicep == NULL)
-	{
-	  memcpy ((char *) dst + dst_off, (const char *) src + src_off,
-		  length);
-	  ret = 1;
-	}
       else if (src_devicep == dst_devicep)
 	ret = src_devicep->dev2dev_func (src_devicep->target_id,
 					 (char *) dst + dst_off,
@@ -4584,7 +4573,8 @@ omp_target_memcpy_rect_worker (void *dst, const void *src, size_t element_size,
 	  else if (*tmp_size < length)
 	    {
 	      *tmp_size = length;
-	      *tmp = realloc (*tmp, length);
+	      free (*tmp);
+	      *tmp = malloc (length);
 	      if (*tmp == NULL)
 		return ENOMEM;
 	    }
@@ -4599,7 +4589,7 @@ omp_target_memcpy_rect_worker (void *dst, const void *src, size_t element_size,
       return ret ? 0 : EINVAL;
     }
 
-  /* host->device, device->host and same-device device->device.  */
+  /* host->device, device->host and intra device.  */
   if (num_dims == 2
       && ((src_devicep
 	   && src_devicep == dst_devicep
@@ -4711,16 +4701,8 @@ omp_target_memcpy_rect_copy (void *dst, const void *src,
   bool lock_src;
   bool lock_dst;
 
-  lock_src = (src_devicep
-	      && (!dst_devicep
-		  || src_devicep == dst_devicep
-		  || !(src_devicep->capabilities
-		       & GOMP_OFFLOAD_CAP_SHARED_MEM)));
-  lock_dst = (dst_devicep
-	      && (!lock_src
-		  || (src_devicep != dst_devicep
-		      && !(dst_devicep->capabilities
-			   & GOMP_OFFLOAD_CAP_SHARED_MEM))));
+  lock_src = src_devicep != NULL;
+  lock_dst = dst_devicep != NULL && src_devicep != dst_devicep;
   if (lock_src)
     gomp_mutex_lock (&src_devicep->lock);
   if (lock_dst)
@@ -5076,8 +5058,8 @@ gomp_load_plugin_for_device (struct gomp_device_descr *device,
   DLSYM (free);
   DLSYM (dev2host);
   DLSYM (host2dev);
-  DLSYM (memcpy2d);
-  DLSYM (memcpy3d);
+  DLSYM_OPT (memcpy2d, memcpy2d);
+  DLSYM_OPT (memcpy3d, memcpy3d);
   device->capabilities = device->get_caps_func ();
   if (device->capabilities & GOMP_OFFLOAD_CAP_OPENMP_400)
     {

next prev parent reply	other threads:[~2023-07-28 11:51 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-07-25 21:45 [patch] OpenMP: Call cuMemcpy2D/cuMemcpy3D for nvptx for omp_target_memcpy_rect Tobias Burnus
2023-07-27 21:00 ` Thomas Schwinge
2023-07-28 11:51   ` Tobias Burnus [this message]
2023-07-29 11:27     ` [patch] libgomp: cuda.h and omp_target_memcpy_rect cleanup (was: [patch] OpenMP: Call cuMemcpy2D/cuMemcpy3D for nvptx for omp_target_memcpy_rect) Tobias Burnus
2023-08-09 20:48     ` Thomas Schwinge

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=12ebc63d-9f48-a292-ae17-4ac70254c500@codesourcery.com \
    --to=tobias@codesourcery.com \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=jakub@redhat.com \
    --cc=julian@codesourcery.com \
    --cc=thomas@codesourcery.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).