From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-patches-return-492522-listarch-gcc-patches=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 56209 invoked by alias); 14 Dec 2018 21:07:00 -0000
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
Received: (qmail 56193 invoked by uid 89); 14 Dec 2018 21:06:59 -0000
Authentication-Results: sourceware.org; auth=none
X-Spam-SWARE-Status: No, score=-26.9 required=5.0 tests=BAYES_00,GIT_PATCH_0,GIT_PATCH_1,GIT_PATCH_2,GIT_PATCH_3,KAM_SHORT,RCVD_IN_DNSWL_NONE,SPF_PASS autolearn=ham version=3.3.2 spammy=Due, 2811, informational, refusing
X-HELO: relay1.mentorg.com
Received: from relay1.mentorg.com (HELO relay1.mentorg.com) (192.94.38.131) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Fri, 14 Dec 2018 21:06:54 +0000
Received: from svr-orw-mbx-03.mgc.mentorg.com ([147.34.90.203])	by relay1.mentorg.com with esmtps (TLSv1.2:ECDHE-RSA-AES256-SHA384:256)	id 1gXufd-0005Ki-07 from Thomas_Schwinge@mentor.com 	for gcc-patches@gcc.gnu.org; Fri, 14 Dec 2018 13:06:53 -0800
Received: from svr-orw-mbx-03.mgc.mentorg.com (147.34.90.203) by svr-orw-mbx-03.mgc.mentorg.com (147.34.90.203) with Microsoft SMTP Server (TLS) id 15.0.1320.4; Fri, 14 Dec 2018 13:06:50 -0800
Received: from tftp-cs (147.34.91.1) by svr-orw-mbx-03.mgc.mentorg.com (147.34.90.203) with Microsoft SMTP Server id 15.0.1320.4 via Frontend Transport; Fri, 14 Dec 2018 13:06:50 -0800
Received: by tftp-cs (Postfix, from userid 49978)	id 19BD1C2321; Fri, 14 Dec 2018 13:06:50 -0800 (PST)
From: Thomas Schwinge <thomas@codesourcery.com>
To: <gcc-patches@gcc.gnu.org>
CC: Chung-Lin Tang <cltang@codesourcery.com>
Subject: Re: [PR88370] acc_get_cuda_stream/acc_set_cuda_stream: acc_async_sync, acc_async_noval
In-Reply-To: <yxfpd0qg9i2v.fsf@hertz.schwinge.homeip.net>
References: <044c4fdb-e659-6029-1da1-4f6bfc05ca9c@mentor.com> <yxfpk1lbtbdd.fsf@hertz.schwinge.homeip.net> <247a55cf-e6f6-ab71-70b5-5eca5fcc2233@mentor.com> <yxfpd0qg9i2v.fsf@hertz.schwinge.homeip.net>
User-Agent: Notmuch/0.9-125-g4686d11 (http://notmuchmail.org) Emacs/25.2.2 (x86_64-pc-linux-gnu)
Date: Fri, 14 Dec 2018 21:07:00 -0000
Message-ID: <87va3vrf6r.fsf@euler.schwinge.homeip.net>
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
X-SW-Source: 2018-12/txt/msg01096.txt.bz2

Hi!

On Wed, 5 Dec 2018 15:14:16 +0100, I wrote:
> On Mon, 19 Nov 2018 16:33:30 +0900, Chung-Lin Tang <chunglin_tang@mentor.=
com> wrote:
> > On 2018/11/18 10:36 AM, Thomas Schwinge wrote:
> > > Generally, I envision test cases running a few "acc_get_cuda_stream"
> > > calls with relevant argument values, to see whether the expected
> > > queues/streames are being used.  (Similar for other offload targets.)
> > >=20
> > > But I suppose we might again need to get clarified whether
> > > "acc_get_cuda_stream(acc_async_sync)",
> > > "acc_get_cuda_stream(acc_async_noval)", or
> > > "acc_get_cuda_stream(acc_async_default)" are actually valid calls (gi=
ven
> > > that these argument values are not valid "async value"s), and these w=
ould
> > > then return the respective CUDA stream handles, different from the one
> > > returned for "acc_get_cuda_stream(0)" etc.
> > >=20
> > > That said, we can certainly implement it that way, because that's not
> > > against the specification.
> >=20
> > I think the likely clarification we'll ever get on this is that it's
> > implementation defined :P
>=20
> Well, actually, I've been able to convince myself ;-) to a reading of the
> specification so that this is supported, and filed
> <https://gcc.gnu.org/PR88370>.
>=20
> Does the following look alright to you?
>=20
> Do you agree that 'Refusing request to set CUDA stream associated with
> "acc_async_sync"' should just be an informational debug message, instead
> of a hard error?  (This restriction might disappear in the future.)  (Oh,
> and other negative values will still be diagnosed as errors by
> "select_stream_for_async".)

Not having heard anything against this, and as a prerequisite for other
changes, I have now committed the following in r267147:

commit 815940afeefeeafa49ad3a5d81ef2d273ddeb3d7
Author: tschwinge <tschwinge@138bc75d-0d04-0410-961f-82ee72b054a4>
Date:   Fri Dec 14 20:42:08 2018 +0000

    [PR88370] acc_get_cuda_stream/acc_set_cuda_stream: acc_async_sync, acc_=
async_noval
=20=20=20=20
    Per my reading of the OpenACC specification (and as supported by second=
ary
    documentation, such as code examples, or presentations), it's valid to =
call
    "acc_get_cuda_stream"/"acc_set_cuda_stream" also with "acc_async_sync",
    "acc_async_noval" arguments, not just with the nonnegative values as cu=
rrently
    implemented.
=20=20=20=20
            libgomp/
            PR libgomp/88370
            * libgomp.texi (acc_get_current_cuda_context, acc_get_cuda_stre=
am)
            (acc_set_cuda_stream): Clarify.
            * oacc-cuda.c (acc_get_cuda_stream, acc_set_cuda_stream): Use
            "async_valid_p".
            * plugin/plugin-nvptx.c (nvptx_set_cuda_stream): Refuse "async =
=3D=3D
            acc_async_sync".
            * testsuite/libgomp.oacc-c-c++-common/acc_set_cuda_stream-1.c: =
New file.
            * testsuite/libgomp.oacc-c-c++-common/async_queue-1.c: Likewise.
            * testsuite/libgomp.oacc-c-c++-common/lib-84.c: Update.
            * testsuite/libgomp.oacc-c-c++-common/lib-85.c: Likewise.
=20=20=20=20
    git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/trunk@267147 138bc75d-0d04-04=
10-961f-82ee72b054a4
---
 libgomp/ChangeLog                                  | 14 ++++
 libgomp/libgomp.texi                               | 17 ++--
 libgomp/oacc-cuda.c                                |  4 +-
 libgomp/plugin/plugin-nvptx.c                      | 10 ++-
 .../acc_set_cuda_stream-1.c                        | 42 ++++++++++
 .../libgomp.oacc-c-c++-common/async_queue-1.c      | 97 ++++++++++++++++++=
++++
 .../testsuite/libgomp.oacc-c-c++-common/lib-84.c   | 31 +++++--
 .../testsuite/libgomp.oacc-c-c++-common/lib-85.c   | 27 +++++-
 8 files changed, 222 insertions(+), 20 deletions(-)

diff --git libgomp/ChangeLog libgomp/ChangeLog
index 4c66021c367d..b4ab6b690553 100644
--- libgomp/ChangeLog
+++ libgomp/ChangeLog
@@ -1,3 +1,17 @@
+2018-12-14  Thomas Schwinge  <thomas@codesourcery.com>
+
+	PR libgomp/88370
+	* libgomp.texi (acc_get_current_cuda_context, acc_get_cuda_stream)
+	(acc_set_cuda_stream): Clarify.
+	* oacc-cuda.c (acc_get_cuda_stream, acc_set_cuda_stream): Use
+	"async_valid_p".
+	* plugin/plugin-nvptx.c (nvptx_set_cuda_stream): Refuse "async =3D=3D
+	acc_async_sync".
+	* testsuite/libgomp.oacc-c-c++-common/acc_set_cuda_stream-1.c: New file.
+	* testsuite/libgomp.oacc-c-c++-common/async_queue-1.c: Likewise.
+	* testsuite/libgomp.oacc-c-c++-common/lib-84.c: Update.
+	* testsuite/libgomp.oacc-c-c++-common/lib-85.c: Likewise.
+
 2018-12-14  Tom de Vries  <tdevries@suse.de>
=20
 	* testsuite/libgomp.c-c++-common/function-not-offloaded-aux.c: New test.
diff --git libgomp/libgomp.texi libgomp/libgomp.texi
index 3fa8eb8165e5..e6c20525bc0c 100644
--- libgomp/libgomp.texi
+++ libgomp/libgomp.texi
@@ -2768,7 +2768,7 @@ as used by the CUDA Runtime or Driver API's.
=20
 @item @emph{C/C++}:
 @multitable @columnfractions .20 .80
-@item @emph{Prototype}: @tab @code{acc_get_current_cuda_context(void);}
+@item @emph{Prototype}: @tab @code{void *acc_get_current_cuda_context(void=
);}
 @end multitable
=20
 @item @emph{Reference}:
@@ -2782,12 +2782,12 @@ A.2.1.2.
 @section @code{acc_get_cuda_stream} -- Get CUDA stream handle.
 @table @asis
 @item @emph{Description}
-This function returns the CUDA stream handle. This handle is the same
-as used by the CUDA Runtime or Driver API's.
+This function returns the CUDA stream handle for the queue @var{async}.
+This handle is the same as used by the CUDA Runtime or Driver API's.
=20
 @item @emph{C/C++}:
 @multitable @columnfractions .20 .80
-@item @emph{Prototype}: @tab @code{acc_get_cuda_stream(void);}
+@item @emph{Prototype}: @tab @code{void *acc_get_cuda_stream(int async);}
 @end multitable
=20
 @item @emph{Reference}:
@@ -2802,11 +2802,16 @@ A.2.1.3.
 @table @asis
 @item @emph{Description}
 This function associates the stream handle specified by @var{stream} with
-the asynchronous value specified by @var{async}.
+the queue @var{async}.
+
+This cannot be used to change the stream handle associated with
+@code{acc_async_sync}.
+
+The return value is not specified.
=20
 @item @emph{C/C++}:
 @multitable @columnfractions .20 .80
-@item @emph{Prototype}: @tab @code{acc_set_cuda_stream(int async void *str=
eam);}
+@item @emph{Prototype}: @tab @code{int acc_set_cuda_stream(int async, void=
 *stream);}
 @end multitable
=20
 @item @emph{Reference}:
diff --git libgomp/oacc-cuda.c libgomp/oacc-cuda.c
index 20774c1b4876..4ee4c9b08576 100644
--- libgomp/oacc-cuda.c
+++ libgomp/oacc-cuda.c
@@ -58,7 +58,7 @@ acc_get_cuda_stream (int async)
 {
   struct goacc_thread *thr =3D goacc_thread ();
=20
-  if (!async_valid_stream_id_p (async))
+  if (!async_valid_p (async))
     return NULL;
=20
   if (thr && thr->dev && thr->dev->openacc.cuda.get_stream_func)
@@ -72,7 +72,7 @@ acc_set_cuda_stream (int async, void *stream)
 {
   struct goacc_thread *thr;
=20
-  if (!async_valid_stream_id_p (async) || stream =3D=3D NULL)
+  if (!async_valid_p (async) || stream =3D=3D NULL)
     return 0;
=20
   goacc_lazy_initialize ();
diff --git libgomp/plugin/plugin-nvptx.c libgomp/plugin/plugin-nvptx.c
index 6492e5ffab77..7d0d38e0c2e1 100644
--- libgomp/plugin/plugin-nvptx.c
+++ libgomp/plugin/plugin-nvptx.c
@@ -1753,8 +1753,14 @@ nvptx_set_cuda_stream (int async, void *stream)
   pthread_t self =3D pthread_self ();
   struct nvptx_thread *nvthd =3D nvptx_thread ();
=20
-  if (async < 0)
-    GOMP_PLUGIN_fatal ("bad async %d", async);
+  /* Due to the "null_stream" usage for "acc_async_sync", this cannot be u=
sed
+     to change the stream handle associated with "acc_async_sync".  */
+  if (async =3D=3D acc_async_sync)
+    {
+      GOMP_PLUGIN_debug (0, "Refusing request to set CUDA stream associate=
d"
+			 " with \"acc_async_sync\"\n");
+      return 0;
+    }
=20
   pthread_mutex_lock (&nvthd->ptx_dev->stream_lock);
=20
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/acc_set_cuda_stream-=
1.c libgomp/testsuite/libgomp.oacc-c-c++-common/acc_set_cuda_stream-1.c
new file mode 100644
index 000000000000..93981ff5cb7f
--- /dev/null
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/acc_set_cuda_stream-1.c
@@ -0,0 +1,42 @@
+/* Verify expected nvptx plugin behavior for "acc_set_cuda_stream" for
+   "acc_async_sync".  */
+
+/* { dg-do run { target openacc_nvidia_accel_selected } } */
+/* { dg-set-target-env-var GOMP_DEBUG "1" } */
+
+#undef NDEBUG
+#include <assert.h>
+#include <openacc.h>
+
+int main(void)
+{
+  int async =3D 42;
+
+  /* Initialize.  */
+#pragma acc parallel async(acc_async_sync)
+      ;
+#pragma acc parallel async(async)
+      ;
+#pragma acc wait
+
+  void *cuda_stream_sync =3D acc_get_cuda_stream (acc_async_sync);
+  assert (cuda_stream_sync =3D=3D NULL);
+  void *cuda_stream_async =3D acc_get_cuda_stream (async);
+  assert (cuda_stream_async !=3D NULL);
+  int ret =3D acc_set_cuda_stream (acc_async_sync, cuda_stream_async);
+  assert (ret =3D=3D 0);
+  void *cuda_stream_sync_ =3D acc_get_cuda_stream (acc_async_sync);
+  assert (cuda_stream_sync_ =3D=3D cuda_stream_sync);
+  void *cuda_stream_async_ =3D acc_get_cuda_stream (async);
+  assert (cuda_stream_async_ =3D=3D cuda_stream_async);
+
+#pragma acc parallel async(acc_async_sync)
+      ;
+#pragma acc parallel async(async)
+      ;
+#pragma acc wait
+
+  return 0;
+}
+
+/* { dg-output "Refusing request to set CUDA stream associated with \"acc_=
async_sync\"" } */
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/async_queue-1.c libg=
omp/testsuite/libgomp.oacc-c-c++-common/async_queue-1.c
new file mode 100644
index 000000000000..48e1846a36e3
--- /dev/null
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/async_queue-1.c
@@ -0,0 +1,97 @@
+/* Test mapping of async values to specific underlying queues.  */
+
+#undef NDEBUG
+#include <assert.h>
+#include <openacc.h>
+
+/* This is implemented in terms of the "acc_get_cuda_stream" interface.  */
+
+struct
+{
+  int async;
+  void *cuda_stream;
+} queues[] =3D { { acc_async_sync, NULL },
+	       { acc_async_noval, NULL },
+	       { 0, NULL },
+	       { 1, NULL },
+	       { 2, NULL },
+	       { 36, NULL },
+	       { 1982, NULL } };
+const size_t queues_n =3D sizeof queues / sizeof queues[0];
+
+int main(void)
+{
+  /* Explicitly initialize: it's not clear whether the following OpenACC
+     runtime library calls implicitly initialize;
+     <https://github.com/OpenACC/openacc-spec/issues/102>.  */
+  acc_device_t d;
+#if defined ACC_DEVICE_TYPE_nvidia
+  d =3D acc_device_nvidia;
+#elif defined ACC_DEVICE_TYPE_host
+  d =3D acc_device_host;
+#else
+# error Not ported to this ACC_DEVICE_TYPE
+#endif
+  acc_init (d);
+
+  for (size_t i =3D 0; i < queues_n; ++i)
+    {
+      /* Before actually being used, there are all NULL.  */
+      queues[i].cuda_stream =3D acc_get_cuda_stream (queues[i].async);
+      assert (queues[i].cuda_stream =3D=3D NULL);
+    }
+
+  for (size_t i =3D 0; i < queues_n; ++i)
+    {
+      /* Use the queue to initialize it.  */
+#pragma acc parallel async(queues[i].async)
+      ;
+#pragma acc wait
+
+      /* Verify CUDA stream used.  */
+      queues[i].cuda_stream =3D acc_get_cuda_stream (queues[i].async);
+#if defined ACC_DEVICE_TYPE_nvidia
+      /* "acc_async_sync" maps to the NULL CUDA default stream.  */
+      if (queues[i].async =3D=3D acc_async_sync)
+	assert (queues[i].cuda_stream =3D=3D NULL);
+      else
+	assert (queues[i].cuda_stream !=3D NULL);
+#elif defined ACC_DEVICE_TYPE_host
+      /* For "acc_device_host" there are no CUDA streams.  */
+      assert (queues[i].cuda_stream =3D=3D NULL);
+#else
+# error Not ported to this ACC_DEVICE_TYPE
+#endif
+    }
+
+  /* Verify same results.  */
+  for (size_t i =3D 0; i < queues_n; ++i)
+    {
+      void *cuda_stream;
+
+      cuda_stream =3D acc_get_cuda_stream (queues[i].async);
+      assert (cuda_stream =3D=3D queues[i].cuda_stream);
+
+#pragma acc parallel async(queues[i].async)
+      ;
+#pragma acc wait
+
+      cuda_stream =3D acc_get_cuda_stream (queues[i].async);
+      assert (cuda_stream =3D=3D queues[i].cuda_stream);
+    }
+
+  /* Verify individual underlying queues are all different.  */
+  for (size_t i =3D 0; i < queues_n; ++i)
+    {
+      if (queues[i].cuda_stream =3D=3D NULL)
+	continue;
+      for (size_t j =3D i + 1; j < queues_n; ++j)
+	{
+	  if (queues[j].cuda_stream =3D=3D NULL)
+	    continue;
+	  assert (queues[j].cuda_stream !=3D queues[i].cuda_stream);
+	}
+    }
+
+  return 0;
+}
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/lib-84.c libgomp/tes=
tsuite/libgomp.oacc-c-c++-common/lib-84.c
index 786b908f755b..d793c7436300 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/lib-84.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/lib-84.c
@@ -7,6 +7,14 @@
 #include <openacc.h>
 #include <cuda.h>
=20
+#if !defined __cplusplus
+# undef static_assert
+# define static_assert _Static_assert
+#endif
+
+static_assert (acc_async_sync =3D=3D -2, "acc_async_sync?");
+static_assert (acc_async_noval =3D=3D -1, "acc_async_noval?");
+
 int
 main (int argc, char **argv)
 {
@@ -20,9 +28,11 @@ main (int argc, char **argv)
=20
   (void) acc_get_device_num (acc_device_nvidia);
=20
-  streams =3D (CUstream *) malloc (N * sizeof (void *));
+  streams =3D (CUstream *) malloc ((2 + N) * sizeof (void *));
+  streams +=3D 2;
+  /* "streams[i]" is valid for i in [acc_async_sync..N).  */
=20
-  for (i =3D 0; i < N; i++)
+  for (i =3D acc_async_sync; i < N; i++)
     {
       streams[i] =3D (CUstream) acc_get_cuda_stream (i);
       if (streams[i] !=3D NULL)
@@ -35,11 +45,20 @@ main (int argc, char **argv)
 	  abort ();
 	}
=20
-        if (!acc_set_cuda_stream (i, streams[i]))
-	  abort ();
+      int ret =3D acc_set_cuda_stream (i, streams[i]);
+      if (i =3D=3D acc_async_sync)
+	{
+	  if (ret =3D=3D 1)
+	    abort ();
+	}
+      else
+	{
+	  if (ret !=3D 1)
+	    abort ();
+	}
     }
=20
-  for (i =3D 0; i < N; i++)
+  for (i =3D acc_async_sync; i < N; i++)
     {
       int j;
       int cnt;
@@ -48,7 +67,7 @@ main (int argc, char **argv)
=20
       s =3D streams[i];
=20
-      for (j =3D 0; j < N; j++)
+      for (j =3D acc_async_sync; j < N; j++)
 	{
 	  if (s =3D=3D streams[j])
 	    cnt++;
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/lib-85.c libgomp/tes=
tsuite/libgomp.oacc-c-c++-common/lib-85.c
index cf925a7b002e..141c83b53dd6 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/lib-85.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/lib-85.c
@@ -7,6 +7,14 @@
 #include <stdio.h>
 #include <cuda.h>
=20
+#if !defined __cplusplus
+# undef static_assert
+# define static_assert _Static_assert
+#endif
+
+static_assert (acc_async_sync =3D=3D -2, "acc_async_sync?");
+static_assert (acc_async_noval =3D=3D -1, "acc_async_noval?");
+
 int
 main (int argc, char **argv)
 {
@@ -20,9 +28,11 @@ main (int argc, char **argv)
=20
   (void) acc_get_device_num (acc_device_nvidia);
=20
-  streams =3D (CUstream *) malloc (N * sizeof (void *));
+  streams =3D (CUstream *) malloc ((2 + N) * sizeof (void *));
+  streams +=3D 2;
+  /* "streams[i]" is valid for i in [acc_async_sync..N).  */
=20
-  for (i =3D 0; i < N; i++)
+  for (i =3D acc_async_sync; i < N; i++)
     {
       streams[i] =3D (CUstream) acc_get_cuda_stream (i);
       if (streams[i] !=3D NULL)
@@ -35,8 +45,17 @@ main (int argc, char **argv)
 	  abort ();
 	}
=20
-        if (!acc_set_cuda_stream (i, streams[i]))
-	  abort ();
+      int ret =3D acc_set_cuda_stream (i, streams[i]);
+      if (i =3D=3D acc_async_sync)
+	{
+	  if (ret =3D=3D 1)
+	    abort ();
+	}
+      else
+	{
+	  if (ret !=3D 1)
+	    abort ();
+	}
     }
=20
   s =3D NULL;


Gr=C3=BC=C3=9Fe
 Thomas