From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=+pMA=FQ=mentor.com=Julian_Brown@sourceware.org>
Received: from esa1.mentor.iphmx.com (esa1.mentor.iphmx.com [68.232.129.153])
	by sourceware.org (Postfix) with ESMTPS id C06913858D33;
	Mon,  2 Oct 2023 14:54:08 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org C06913858D33
Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=codesourcery.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=mentor.com
X-CSE-ConnectionGUID: wzKqdq3mSxKP/KMFsDoAMQ==
X-CSE-MsgGUID: 4CvRqQHHSiGnwILMT8sTKg==
X-IronPort-AV: E=Sophos;i="6.03,194,1694764800"; 
   d="diff'?scan'208";a="20741472"
Received: from orw-gwy-01-in.mentorg.com ([192.94.38.165])
  by esa1.mentor.iphmx.com with ESMTP; 02 Oct 2023 06:54:06 -0800
IronPort-SDR: 1+vCPyl8LvgnVk8voPVcTP7a0K4d0ybKQXBVoS5j03HalqTBZ7wRd3FO+xuAhn50qbShN7hsRb
 XKR+4hSN7sY7yU+w1iR5718c6Nwh61G1gAgtcr42wh9PgSyNJKwB57voEVSgzzt8TqjMtzcHlO
 aT8WIopx97aYXk7t/DU6xooHtU6U9m+FNWd+nA1wMthPYJG7IDv3jEUQX+uiRgVdH6+uL4Y2gX
 HTd2gFVLvVrkIuxGfxErXqH1aAhsq44UkYXDWXTZ0le9ool9FtOtaDkOYtRpf8sOUyog+lESP1
 leo=
Date: Mon, 2 Oct 2023 15:53:59 +0100
From: Julian Brown <julian@codesourcery.com>
To: Thomas Schwinge <thomas@codesourcery.com>
CC: <gcc-patches@gcc.gnu.org>, <fortran@gcc.gnu.org>,
	<tobias@codesourcery.com>, <jakub@redhat.com>, Tom de Vries
	<tdevries@suse.de>
Subject: Re: [PATCH 1/5] OpenMP, NVPTX: memcpy[23]D bias correction
Message-ID: <20231002155359.3a44a582@squid.athome>
In-Reply-To: <87sf704k5l.fsf@euler.schwinge.homeip.net>
References: <cover.1693991758.git.julian@codesourcery.com>
	<c83c9f9f05bf5577eeaf3633c5c2e494ac0a11fd.1693991759.git.julian@codesourcery.com>
	<87sf704k5l.fsf@euler.schwinge.homeip.net>
Organization: Siemens Embedded
X-Mailer: Claws Mail 4.1.1git78 (GTK 3.24.38; x86_64-pc-linux-gnu)
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="MP_/yu4Agj1pYA2V1tP8mUj2SRM"
X-Originating-IP: [137.202.0.90]
X-ClientProxiedBy: svr-ies-mbx-14.mgc.mentorg.com (139.181.222.14) To
 svr-ies-mbx-11.mgc.mentorg.com (139.181.222.11)
X-Spam-Status: No, score=-11.8 required=5.0 tests=BAYES_00,GIT_PATCH_0,HEADER_FROM_DIFFERENT_DOMAINS,KAM_DMARC_STATUS,SPF_HELO_PASS,SPF_PASS,TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <fortran.gcc.gnu.org>

--MP_/yu4Agj1pYA2V1tP8mUj2SRM
Content-Type: text/plain; charset="US-ASCII"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

On Wed, 27 Sep 2023 00:57:58 +0200
Thomas Schwinge <thomas@codesourcery.com> wrote:

> On 2023-09-06T02:34:30-0700, Julian Brown <julian@codesourcery.com>
> wrote:
> > This patch works around behaviour of the 2D and 3D memcpy
> > operations in the CUDA driver runtime.  Particularly in Fortran,
> > the "base pointer" of an array (used for either source or
> > destination of a host/device copy) may lie outside of data that is
> > actually stored on the device.  The fix is to make sure that we use
> > the first element of data to be transferred instead, and adjust
> > parameters accordingly.  
> 
> Do you (a) have a stand-alone test case for this (that is, not
> depending on your other pending patches, so that this could go in
> directly -- together with the before-FAIL test case).

Thanks for the reply! Here's a version with a stand-alone test case.

> Do you (b)
> know if is this a bug in our use of the CUDA Driver API or rather in
> CUDA itself?  If the latter, have you reported this to Nvidia?

I don't think the CUDA behaviour is *wrong*, as such -- at least to the
C/C++ way of thinking (or indeed a graphics-oriented way of thinking),
one would normally think of an array as having a zero-based origin, and
these 2D/3D memory copies would be intended as a way of updating just a
part of an array (or texture) that has full duplicate copies on both
the host and device.  Our use-case just happens to be a bit different,
both because Fortran (internally) represents an array by a zero-based
origin but may use 1-based (or whatever-based) indices, and because we
support partial mappings of host arrays on the device in all three
supported languages -- which amounts to much the same thing, actually.

That said, it *could* be fixed in CUDA, though probably not in all the
versions currently deployed out there in the world.  So I guess we'd
still need a patch like this anyway.

Julian

--MP_/yu4Agj1pYA2V1tP8mUj2SRM
Content-Type: text/x-patch
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment; filename="cuda-memcpyxd-bias-2.diff"

commit f6fd3ad060bbe5c57661cd861d009dbc2b415778
Author: Julian Brown <julian@codesourcery.com>
Date:   Wed Aug 23 23:46:29 2023 +0000

    OpenMP, NVPTX: memcpy[23]D bias correction
    
    This patch works around behaviour of the 2D and 3D memcpy operations in
    the CUDA driver runtime.  Particularly in Fortran, the "base pointer"
    of an array (used for either source or destination of a host/device copy)
    may lie outside of data that is actually stored on the device.  The fix
    is to make sure that we use the first element of data to be transferred
    instead, and adjust parameters accordingly.
    
    2023-10-02  Julian Brown  <julian@codesourcery.com>
    
    libgomp/
            * plugin/plugin-nvptx.c (GOMP_OFFLOAD_memcpy2d): Adjust parameters to
            avoid out-of-bounds array checks in CUDA runtime.
            (GOMP_OFFLOAD_memcpy3d): Likewise.
            * testsuite/libgomp.c-c++-common/memcpyxd-bias-1.c: New test.

diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c
index 00d4241ae02..cefe288a8aa 100644
--- a/libgomp/plugin/plugin-nvptx.c
+++ b/libgomp/plugin/plugin-nvptx.c
@@ -1827,6 +1827,35 @@ GOMP_OFFLOAD_memcpy2d (int dst_ord, int src_ord, size_t dim1_size,
   data.srcXInBytes = src_offset1_size;
   data.srcY = src_offset0_len;
 
+  if (data.srcXInBytes != 0 || data.srcY != 0)
+    {
+      /* Adjust origin to the actual array data, else the CUDA 2D memory
+	 copy API calls below may fail to validate source/dest pointers
+	 correctly (especially for Fortran where the "virtual origin" of an
+	 array is often outside the stored data).  */
+      if (src_ord == -1)
+	data.srcHost = (const void *) ((const char *) data.srcHost
+				      + data.srcY * data.srcPitch
+				      + data.srcXInBytes);
+      else
+	data.srcDevice += data.srcY * data.srcPitch + data.srcXInBytes;
+      data.srcXInBytes = 0;
+      data.srcY = 0;
+    }
+
+  if (data.dstXInBytes != 0 || data.dstY != 0)
+    {
+      /* As above.  */
+      if (dst_ord == -1)
+	data.dstHost = (void *) ((char *) data.dstHost
+				 + data.dstY * data.dstPitch
+				 + data.dstXInBytes);
+      else
+	data.dstDevice += data.dstY * data.dstPitch + data.dstXInBytes;
+      data.dstXInBytes = 0;
+      data.dstY = 0;
+    }
+
   CUresult res = CUDA_CALL_NOCHECK (cuMemcpy2D, &data);
   if (res == CUDA_ERROR_INVALID_VALUE)
     /* If pitch > CU_DEVICE_ATTRIBUTE_MAX_PITCH or for device-to-device
@@ -1895,6 +1924,44 @@ GOMP_OFFLOAD_memcpy3d (int dst_ord, int src_ord, size_t dim2_size,
   data.srcY = src_offset1_len;
   data.srcZ = src_offset0_len;
 
+  if (data.srcXInBytes != 0 || data.srcY != 0 || data.srcZ != 0)
+    {
+      /* Adjust origin to the actual array data, else the CUDA 3D memory
+	 copy API call below may fail to validate source/dest pointers
+	 correctly (especially for Fortran where the "virtual origin" of an
+	 array is often outside the stored data).  */
+      if (src_ord == -1)
+	data.srcHost
+	  = (const void *) ((const char *) data.srcHost
+			    + (data.srcZ * data.srcHeight + data.srcY)
+			      * data.srcPitch
+			    + data.srcXInBytes);
+      else
+	data.srcDevice
+	  += (data.srcZ * data.srcHeight + data.srcY) * data.srcPitch
+	     + data.srcXInBytes;
+      data.srcXInBytes = 0;
+      data.srcY = 0;
+      data.srcZ = 0;
+    }
+
+  if (data.dstXInBytes != 0 || data.dstY != 0 || data.dstZ != 0)
+    {
+      /* As above.  */
+      if (dst_ord == -1)
+	data.dstHost = (void *) ((char *) data.dstHost
+				 + (data.dstZ * data.dstHeight + data.dstY)
+				   * data.dstPitch
+				 + data.dstXInBytes);
+      else
+	data.dstDevice
+	  += (data.dstZ * data.dstHeight + data.dstY) * data.dstPitch
+	     + data.dstXInBytes;
+      data.dstXInBytes = 0;
+      data.dstY = 0;
+      data.dstZ = 0;
+    }
+
   CUDA_CALL (cuMemcpy3D, &data);
   return true;
 }
diff --git a/libgomp/testsuite/libgomp.c-c++-common/memcpyxd-bias-1.c b/libgomp/testsuite/libgomp.c-c++-common/memcpyxd-bias-1.c
new file mode 100644
index 00000000000..6aa7b3d614f
--- /dev/null
+++ b/libgomp/testsuite/libgomp.c-c++-common/memcpyxd-bias-1.c
@@ -0,0 +1,61 @@
+/* { dg-do run } */
+
+#include <stdlib.h>
+#include <stdint.h>
+#include <assert.h>
+#include <omp.h>
+
+/* Say this is N rows and M columns.  */
+#define N 1024
+#define M 2048
+
+#define row_offset 256
+#define row_length 512
+#define col_offset 128
+#define col_length 384
+
+int
+main ()
+{
+  int *arr2d = (int *) calloc (N * M, sizeof (int));
+  uintptr_t dstptr;
+  int hostdev = omp_get_initial_device ();
+  int targdev;
+
+#pragma omp target enter data map(to: arr2d[col_offset*M:col_length*M])
+
+#pragma omp target map(from: targdev, dstptr) \
+		   map(present, tofrom: arr2d[col_offset*M:col_length*M])
+  {
+    for (int j = col_offset; j < col_offset + col_length; j++)
+      for (int i = row_offset; i < row_offset + row_length; i++)
+	arr2d[j * M + i]++;
+    targdev = omp_get_device_num ();
+    dstptr = (uintptr_t) arr2d;
+  }
+
+  /* Copy rectangular block back to the host.  */
+  {
+    size_t volume[2] = { col_length, row_length };
+    size_t offsets[2] = { col_offset, row_offset };
+    size_t dimensions[2] = { N, M };
+    omp_target_memcpy_rect ((void *) arr2d, (const void *) dstptr,
+			    sizeof (int), 2, &volume[0], &offsets[0],
+			    &offsets[0], &dimensions[0], &dimensions[0],
+			    hostdev, targdev);
+  }
+
+#pragma omp target exit data map(release: arr2d[col_offset*M:col_length*M])
+
+  for (int j = 0; j < N; j++)
+    for (int i = 0; i < M; i++)
+      if (i >= row_offset && i < row_offset + row_length
+	  && j >= col_offset && j < col_offset + col_length)
+	assert (arr2d[j * M + i] == 1);
+      else
+	assert (arr2d[j * M + i] == 0);
+
+  free (arr2d);
+
+  return 0;
+}

--MP_/yu4Agj1pYA2V1tP8mUj2SRM--