public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [og7, nvptx, PR85486, committed] Force vl32 if calling vector-partitionable routines
@ 2018-04-23 11:24 Tom de Vries
  2018-09-18 20:22 ` [nvptx] vector length patch series Cesar Philippidis
  0 siblings, 1 reply; 32+ messages in thread
From: Tom de Vries @ 2018-04-23 11:24 UTC (permalink / raw)
  To: GCC Patches; +Cc: Thomas Schwinge, Cesar Philippidis

[-- Attachment #1: Type: text/plain, Size: 1282 bytes --]

Hi,

we've recently added the new feature allowing vector length larger than 32.

But when we compile a test-case like this:
...
#pragma acc routine vector
void __attribute__((noinline, noclone))
Vector (int *ptr, int n, const int inc)
{
   #pragma acc loop vector
   for (unsigned ix = 0; ix < n; ix++)
     ptr[ix] += inc;
}

int
main (void)
{
   const int n = 32, m=32;

   int ary[m][n];
   unsigned ix,  iy;

#pragma acc parallel copy(ary) vector_length(128)
   {
     Vector (&ary[0][0], m * n, (1<<24) - (1<<16));
   }

   return 0;
}
...
the offloading region is compiled with vector length 128, but the 
routine is compiled with vector length 32, which leads to runtime failures.

The code for the routine assumes that ntid.x == 32 (because state 
propagation is done using inter-warp shuffle instructions), and calling 
the routine from the offloading region where ntid.x == 128 breaks that 
assumption.

An easy fix would be to make vector_length > 32 the default in routines, 
but for now we don't want to switch it on by default anywhere.

This patch fixes the runtime failure by forcing vector length 32 if an 
offloading function contains calls to vector-partitionable routines.

Build x86_64 with nvptx accelerator, tested libgomp.

Committed to og7.

Thanks,
- Tom

[-- Attachment #2: 0001-nvptx-Force-vl32-if-calling-vector-partitionable-routines.patch --]
[-- Type: text/x-patch, Size: 9995 bytes --]

[nvptx] Force vl32 if calling vector-partitionable routines

2018-04-23  Tom de Vries  <tom@codesourcery.com>

	PR target/85486
	* omp-offload.c (oacc_fn_attrib_level): Remove static.
	* omp-offload.h (oacc_fn_attrib_level): Declare.
	* config/nvptx/nvptx.c (has_vector_partitionable_routine_calls_p): New
	function.
	(nvptx_goacc_validate_dims): Force vector length 32 if offloading
	function calls vector-partitionable routines.

	* testsuite/libgomp.oacc-c-c++-common/pr85486-2.c: New test.
	* testsuite/libgomp.oacc-c-c++-common/pr85486-3.c: New test.
	* testsuite/libgomp.oacc-c-c++-common/pr85486.c: New test.

---
 gcc/config/nvptx/nvptx.c                           | 90 ++++++++++++++++++++--
 gcc/omp-offload.c                                  |  2 +-
 gcc/omp-offload.h                                  |  1 +
 .../libgomp.oacc-c-c++-common/pr85486-2.c          | 53 +++++++++++++
 .../libgomp.oacc-c-c++-common/pr85486-3.c          | 56 ++++++++++++++
 .../testsuite/libgomp.oacc-c-c++-common/pr85486.c  | 52 +++++++++++++
 6 files changed, 247 insertions(+), 7 deletions(-)

diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index 3aee9cc..77c4d71 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -5194,6 +5194,40 @@ nvptx_goacc_needs_vl_warp ()
   return attr != NULL_TREE;
 }
 
+/* Return true if FNDECL contains calls to vector-partitionable routines.  */
+
+static bool
+has_vector_partitionable_routine_calls_p (tree fndecl)
+{
+  if (!fndecl)
+    return false;
+
+  basic_block bb;
+  FOR_EACH_BB_FN (bb, DECL_STRUCT_FUNCTION (fndecl))
+    for (gimple_stmt_iterator i = gsi_start_bb (bb); !gsi_end_p (i);
+	 gsi_next_nondebug (&i))
+      {
+	gimple *stmt = gsi_stmt (i);
+	if (gimple_code (stmt) != GIMPLE_CALL)
+	  continue;
+
+	tree callee = gimple_call_fndecl (stmt);
+	if (!callee)
+	  continue;
+
+	tree attrs  = oacc_get_fn_attrib (callee);
+	if (attrs == NULL_TREE)
+	  return false;
+
+	int partition_level = oacc_fn_attrib_level (attrs);
+	bool seq_routine_p = partition_level == GOMP_DIM_MAX;
+	if (!seq_routine_p)
+	  return true;
+      }
+
+  return false;
+}
+
 /* Validate compute dimensions of an OpenACC offload or routine, fill
    in non-unity defaults.  FN_LEVEL indicates the level at which a
    routine might spawn a loop.  It is negative for non-routines.  If
@@ -5202,13 +5236,45 @@ nvptx_goacc_needs_vl_warp ()
 static bool
 nvptx_goacc_validate_dims (tree decl, int dims[], int fn_level)
 {
-  int default_vector_length = PTX_VECTOR_LENGTH;
+  bool oacc_default_dims_p ATTRIBUTE_UNUSED = false;
+  bool oacc_min_dims_p ATTRIBUTE_UNUSED = false;
+  bool offload_region_p = false;
+  bool routine_p = false;
+  bool routine_seq_p = false;
+
+  if (decl == NULL_TREE)
+    {
+      if (fn_level == -1)
+	oacc_default_dims_p = true;
+      else if (fn_level == -2)
+	oacc_min_dims_p = true;
+      else
+	gcc_unreachable ();
+    }
+  else if (fn_level == -1)
+    offload_region_p = true;
+  else if (0 <= fn_level && fn_level <= GOMP_DIM_MAX)
+    {
+      routine_p = true;
+      if (fn_level == GOMP_DIM_MAX)
+	routine_seq_p = true;
+    }
+  else
+    gcc_unreachable ();
 
+  int default_vector_length = PTX_VECTOR_LENGTH;
   /* For capability reasons, fallback to vl = 32 for runtime values.  */
   if (dims[GOMP_DIM_VECTOR] == 0)
     default_vector_length = PTX_WARP_SIZE;
   else if (decl)
-    default_vector_length = oacc_get_default_dim (GOMP_DIM_VECTOR);
+    {
+      default_vector_length = oacc_get_default_dim (GOMP_DIM_VECTOR);
+      if ((offload_region_p
+	   || (routine_p && !routine_seq_p))
+	  && default_vector_length > PTX_WARP_SIZE
+	  && has_vector_partitionable_routine_calls_p (decl))
+	default_vector_length = PTX_WARP_SIZE;
+    }
 
   /* Detect if a function is unsuitable for offloading.  */
   if (!flag_offload_force && decl)
@@ -5234,12 +5300,24 @@ nvptx_goacc_validate_dims (tree decl, int dims[], int fn_level)
 
   bool changed = false;
 
+  if ((offload_region_p
+       || (routine_p && !routine_seq_p))
+      && dims[GOMP_DIM_VECTOR] > PTX_WARP_SIZE
+      && has_vector_partitionable_routine_calls_p (decl))
+    {
+	warning_at (DECL_SOURCE_LOCATION (decl), 0,
+		    G_("using vector_length (%d) due to call to"
+		       " vector-partitionable routine, ignoring %d"),
+		    PTX_WARP_SIZE, dims[GOMP_DIM_VECTOR]);
+      dims[GOMP_DIM_VECTOR] = PTX_WARP_SIZE;
+      changed = true;
+    }
   /* The vector size must be a positive multiple of the warp size,
      unless this is a SEQ routine.  */
-  if (fn_level <= GOMP_DIM_VECTOR && fn_level >= -1
-      && dims[GOMP_DIM_VECTOR] >= 0
-      && (dims[GOMP_DIM_VECTOR] % 32 != 0
-	  || dims[GOMP_DIM_VECTOR] == 0))
+  else if (fn_level <= GOMP_DIM_VECTOR && fn_level >= -1
+	   && dims[GOMP_DIM_VECTOR] >= 0
+	   && (dims[GOMP_DIM_VECTOR] % 32 != 0
+	       || dims[GOMP_DIM_VECTOR] == 0))
     {
       if (fn_level < 0 && dims[GOMP_DIM_VECTOR] >= 0)
 	warning_at (decl ? DECL_SOURCE_LOCATION (decl) : UNKNOWN_LOCATION, 0,
diff --git a/gcc/omp-offload.c b/gcc/omp-offload.c
index 66c6212..dcd7a87 100644
--- a/gcc/omp-offload.c
+++ b/gcc/omp-offload.c
@@ -85,7 +85,7 @@ vec<tree, va_gc> *offload_funcs, *offload_vars;
 /* Return level at which oacc routine may spawn a partitioned loop, or
    -1 if it is not a routine (i.e. is an offload fn).  */
 
-static int
+int
 oacc_fn_attrib_level (tree attr)
 {
   tree pos = TREE_VALUE (attr);
diff --git a/gcc/omp-offload.h b/gcc/omp-offload.h
index 014ee52..7507338 100644
--- a/gcc/omp-offload.h
+++ b/gcc/omp-offload.h
@@ -23,6 +23,7 @@ along with GCC; see the file COPYING3.  If not see
 #define GCC_OMP_DEVICE_H
 
 extern int oacc_get_default_dim (int dim);
+extern int oacc_fn_attrib_level (tree attr);
 
 extern GTY(()) vec<tree, va_gc> *offload_funcs;
 extern GTY(()) vec<tree, va_gc> *offload_vars;
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/pr85486-2.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/pr85486-2.c
new file mode 100644
index 0000000..a92b5dd
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/pr85486-2.c
@@ -0,0 +1,53 @@
+/* { dg-do run } */
+/* { dg-skip-if "" { *-*-* } { "*" } { "-DACC_DEVICE_TYPE_nvidia=1" } } */
+/* { dg-additional-options "-fopenacc-dim=-:-:128" } */
+
+/* Minimized from ref-1.C.  */
+
+#include <stdio.h>
+
+#pragma acc routine vector
+void __attribute__((noinline, noclone))
+Vector (int *ptr, int n, const int inc)
+{
+  #pragma acc loop vector
+  for (unsigned ix = 0; ix < n; ix++)
+    ptr[ix] += inc;
+}
+
+int
+main (void)
+{
+  const int n = 32, m=32;
+
+  int ary[m][n];
+  unsigned ix,  iy;
+
+  for (ix = m; ix--;)
+    for (iy = n; iy--;)
+      ary[ix][iy] = (1 << 16) + (ix << 8) + iy;
+
+  int err = 0;
+
+#pragma acc parallel copy (ary)
+  {
+    Vector (&ary[0][0], m * n, (1 << 24) - (1 << 16));
+  }
+
+  for (ix = m; ix--;)
+    for (iy = n; iy--;)
+      if (ary[ix][iy] != ((1 << 24) + (ix << 8) + iy))
+	{
+	  printf ("ary[%u][%u] = %x expected %x\n",
+		  ix, iy, ary[ix][iy], ((1 << 24) + (ix << 8) + iy));
+	  err++;
+	}
+
+  if (err)
+    {
+      printf ("%d failed\n", err);
+      return 1;
+    }
+
+  return 0;
+}
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/pr85486-3.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/pr85486-3.c
new file mode 100644
index 0000000..ae62206
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/pr85486-3.c
@@ -0,0 +1,56 @@
+/* { dg-do run } */
+/* { dg-skip-if "" { *-*-* } { "*" } { "-DACC_DEVICE_TYPE_nvidia=1" } } */
+/* { dg-additional-options "-fopenacc-dim=-:-:-" } */
+/* { dg-set-target-env-var "GOMP_OPENACC_DIM" "-:-:128" } */
+
+/* Minimized from ref-1.C.  */
+
+#include <stdio.h>
+
+#pragma acc routine vector
+void __attribute__((noinline, noclone))
+Vector (int *ptr, int n, const int inc)
+{
+  #pragma acc loop vector
+  for (unsigned ix = 0; ix < n; ix++)
+    ptr[ix] += inc;
+}
+
+int
+main (void)
+{
+  const int n = 32, m=32;
+
+  int ary[m][n];
+  unsigned ix,  iy;
+
+  for (ix = m; ix--;)
+    for (iy = n; iy--;)
+      ary[ix][iy] = (1 << 16) + (ix << 8) + iy;
+
+  int err = 0;
+
+#pragma acc parallel copy (ary)
+  {
+    Vector (&ary[0][0], m * n, (1 << 24) - (1 << 16));
+  }
+
+  for (ix = m; ix--;)
+    for (iy = n; iy--;)
+      if (ary[ix][iy] != ((1 << 24) + (ix << 8) + iy))
+	{
+	  printf ("ary[%u][%u] = %x expected %x\n",
+		  ix, iy, ary[ix][iy], ((1 << 24) + (ix << 8) + iy));
+	  err++;
+	}
+
+  if (err)
+    {
+      printf ("%d failed\n", err);
+      return 1;
+    }
+
+  return 0;
+}
+
+/* { dg-prune-output "using vector_length \\(32\\), ignoring runtime setting" } */
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/pr85486.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/pr85486.c
new file mode 100644
index 0000000..f91dee0
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/pr85486.c
@@ -0,0 +1,52 @@
+/* { dg-do run } */
+/* { dg-skip-if "" { *-*-* } { "*" } { "-DACC_DEVICE_TYPE_nvidia=1" } } */
+
+/* Minimized from ref-1.C.  */
+
+#include <stdio.h>
+
+#pragma acc routine vector
+void __attribute__((noinline, noclone))
+Vector (int *ptr, int n, const int inc)
+{
+  #pragma acc loop vector
+  for (unsigned ix = 0; ix < n; ix++)
+    ptr[ix] += inc;
+}
+
+int
+main (void)
+{
+  const int n = 32, m=32;
+
+  int ary[m][n];
+  unsigned ix,  iy;
+
+  for (ix = m; ix--;)
+    for (iy = n; iy--;)
+      ary[ix][iy] = (1 << 16) + (ix << 8) + iy;
+
+  int err = 0;
+
+#pragma acc parallel copy (ary) vector_length (128) /* { dg-warning "using vector_length \\(32\\) due to call to vector-partitionable routine, ignoring 128" } */
+  {
+    Vector (&ary[0][0], m * n, (1 << 24) - (1 << 16));
+  }
+
+  for (ix = m; ix--;)
+    for (iy = n; iy--;)
+      if (ary[ix][iy] != ((1 << 24) + (ix << 8) + iy))
+	{
+	  printf ("ary[%u][%u] = %x expected %x\n",
+		  ix, iy, ary[ix][iy], ((1 << 24) + (ix << 8) + iy));
+	  err++;
+	}
+
+  if (err)
+    {
+      printf ("%d failed\n", err);
+      return 1;
+    }
+
+  return 0;
+}

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [nvptx] vector length patch series
@ 2018-09-18 20:22 ` Cesar Philippidis
  2018-10-06  7:49   ` Tom de Vries
  0 siblings, 1 reply; 32+ messages in thread
From: Cesar Philippidis @ 2018-09-18 20:22 UTC (permalink / raw)
  To: Tom de Vries, gcc-patches

Hi Tom,

Here is a link to our nvptx vector length patches on github:

  https://github.com/cesarjp/gcc/tree/trunk-og8-vl-private

Specifically, the code lives in the trunk-og8-vl-private branch. There
are a couple of outstanding dependency patches:

  * Teach gfortran to lower OpenACC routine dims
    https://gcc.gnu.org/ml/gcc-patches/2018-09/msg00368.html
    b186c651f37 [openacc] Make GFC default to -1 for OpenACC routine dims

  * Add target hook TARGET_GOACC_ADJUST_PARALLELISM
    https://gcc.gnu.org/ml/gcc-patches/2018-09/msg00369.html
    49b2039013e [openacc] Add target hook TARGET_GOACC_ADJUST_PARALLELISM

  * Enable firstprivate OpenACC reductions
    https://gcc.gnu.org/ml/gcc-patches/2018-09/msg00370.html
    1f70cdb7cf0 (HEAD -> trunk-og8-vl-private,
github/trunk-og8-vl-private) [OpenACC] Enable firstprivate OpenACC
reductions

  * Adjust offsets for present data clauses
    https://gcc.gnu.org/ml/gcc-patches/2018-07/msg01213.html
    8bcda2f1a2b [libgomp, OpenACC] Adjust offsets for present data clauses

Of the patches in trunk-og8-vl-private, the following are just general
refactors and cleanups which do not change any functionality:

7eb378e9b0c [nvptx] Generalize state propagation and synchronization
10aa1f74d5a [nvptx] Use MAX, MIN, ROUND_UP macros
9dfe611f3d8 [nvptx] Use TARGET_SET_CURRENT_FUNCTION
4fbe0e812bd [nvptx] Add axis_dim
fbe43dac79f [nvptx] Add thread count parm to bar.sync
57d3f8c88ff [nvptx] only use one bar.sync barriers in OpenACC offloaded code
f14d0e882eb [nvptx] Fix whitespace in nvptx_single and nvptx_neuter_pars
82d81fffb0f [nvptx] make nvptx state propagation function names more generic
95703737e09 [nvptx] consolidate offloaded function attributes into
struct offload_attrs
8c9e897c36d [nvptx] Rename worker_bcast variables oacc_bcast.
45147e7e3f3 [nvptx] update openacc dim macros
caa641ecfb4 [nvptx] Update insufficient launch message to accommodate
large vectors

The following patches actually implement the new vector length
functionality. Note that trunk doesn't support missing arguments between
colons in -fopenacc-dim like -fopenacc-dim=::64, so I had to remove a
couple or adjust a couple of your test cases from og8.

591973d3c3a [nvptx] use user-defined vectors when possible
fb9cefa5b17 [nvptx] Handle large vector reductions
5154d363d07 [nvptx] Force vl32 if calling vector-partitionable routines
f62e3afcf6a [nvptx, openacc] Don't emit barriers for empty loops
4cc408658fb [PR85246] [nvptx] Fix propagation of branch cond in
vw-neutered code
d97ed5fc580 [nvptx] Simplifly logic in nvptx_single
62f0c5df3dd [nvptx] Enable worker partitioning with warp-sized vector_length
f2cf96b0df3 [nvptx] Handle large vectors in libgomp
eba014c260c [nvptx] Enable large vectors
f31d8b98ca1 [nvptx] Add vector_length 128 testcases

Let me know if you encounter any problems with that github branch.

This branch has recently been recently rebased against trunk. Further, I
bootstrapped and regtested it on x86_64 Linux target with nvptx
offloading.

Thanks,
Cesar

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [nvptx] vector length patch series
  2018-09-18 20:22 ` [nvptx] vector length patch series Cesar Philippidis
@ 2018-10-06  7:49   ` Tom de Vries
  2018-10-29 20:09     ` Cesar Philippidis
  0 siblings, 1 reply; 32+ messages in thread
From: Tom de Vries @ 2018-10-06  7:49 UTC (permalink / raw)
  To: Cesar Philippidis, gcc-patches

On 9/18/18 10:04 PM, Cesar Philippidis wrote:
> 591973d3c3a [nvptx] use user-defined vectors when possible

If I drop this patch, I get the same test results. Can you find a
testcase for which this patch has an effect?

Thanks,
- Tom

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [nvptx] vector length patch series
  2018-10-06  7:49   ` Tom de Vries
@ 2018-10-29 20:09     ` Cesar Philippidis
  2018-12-14 19:58       ` Tom de Vries
  0 siblings, 1 reply; 32+ messages in thread
From: Cesar Philippidis @ 2018-10-29 20:09 UTC (permalink / raw)
  To: Tom de Vries, gcc-patches, Schwinge, Thomas

[-- Attachment #1: Type: text/plain, Size: 1940 bytes --]

On 10/5/18 23:22, Tom de Vries wrote:
> On 9/18/18 10:04 PM, Cesar Philippidis wrote:
>> 591973d3c3a [nvptx] use user-defined vectors when possible
> 
> If I drop this patch, I get the same test results. Can you find a
> testcase for which this patch has an effect?

I just revisited the vector length patch series, and that patch in
specific is bogus and can be safely dropped.

From what I can remember, the intent behind that patch is to allow the
user to override the default vector length using GOMP_OPENACC_DIM. E.g.,

  #pragma acc parallel loop vector
  for (...)

Here, the nvptx BE defaults to vector length = 32. But I had originally
wanted to allow something like

  GOMP_OPENACC_DIM="1:1:128" ./a.out

to use a vector length of 128 in that parallel region. After looking the
rest of the patch series, that's not possible because the nvptx BE
hard-codes the vector length to 128 at compile time. This was done
because large, multi-warp vector reductions are slow (O(n) vs O(ln n)).

Is this patch series OK without that patch? And if so, because that
patch series depends on other patches, can the following patches be
committed independently?

91e5c13b462 [nvptx] Generalize state propagation and synchronization
cb4b27a93e0 [nvptx] Use MAX, MIN, ROUND_UP macros
0af782ae93c [nvptx] Use TARGET_SET_CURRENT_FUNCTION
87cfb384dbe [nvptx] Add axis_dim
d1783939d98 [nvptx] Add thread count parm to bar.sync
47e80fa77a5 [nvptx] only use one bar.sync barriers in OpenACC offloaded
            code
dafc9957ee7 [nvptx] Fix whitespace in nvptx_single and nvptx_neuter_pars
a4857b94879 [nvptx] make nvptx state propagation function names more
            generic
b4b85f6e0b5 [nvptx] consolidate offloaded function attributes into
            struct offload_attrs
bcdb1e8afac [nvptx] Rename worker_bcast variables oacc_bcast.
34958a0904d [nvptx] update openacc dim macros

These patches just refactor code in the nvptx BE.

Thanks,
Cesar

[-- Attachment #2: nvptx-vl.tar.gz --]
[-- Type: application/gzip, Size: 39369 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [nvptx] vector length patch series
  2018-10-29 20:09     ` Cesar Philippidis
@ 2018-12-14 19:58       ` Tom de Vries
  2018-12-17 21:29         ` [nvptx, committed] Rewrite nvptx_goacc_validate_dims to use predicate vars Tom de Vries
                           ` (21 more replies)
  0 siblings, 22 replies; 32+ messages in thread
From: Tom de Vries @ 2018-12-14 19:58 UTC (permalink / raw)
  To: Schwinge, Thomas; +Cc: gcc-patches

On 29-10-18 20:28, Cesar Philippidis wrote:
> On 10/5/18 23:22, Tom de Vries wrote:
>> On 9/18/18 10:04 PM, Cesar Philippidis wrote:
>>> 591973d3c3a [nvptx] use user-defined vectors when possible
>>
>> If I drop this patch, I get the same test results. Can you find a
>> testcase for which this patch has an effect?
> 
> I just revisited the vector length patch series, and that patch in
> specific is bogus and can be safely dropped.
> 

Hi Thomas,

The new vector length patch series contains these patches:
...
0001-libgomp-OpenACC-Adjust-offsets-for-present-data-clau.patch
0002-nvptx-Update-insufficient-launch-message-to-accommod.patch
0003-openacc-Add-target-hook-TARGET_GOACC_ADJUST_PARALLEL.patch
0004-openacc-Make-GFC-default-to-1-for-OpenACC-routine-di.patch
0005-nvptx-update-openacc-dim-macros.patch
0006-nvptx-Rename-worker_bcast-variables-oacc_bcast.patch
0007-nvptx-consolidate-offloaded-function-attributes-into.patch
0008-nvptx-make-nvptx-state-propagation-function-names-mo.patch
0009-nvptx-Fix-whitespace-in-nvptx_single-and-nvptx_neute.patch
0010-nvptx-only-use-one-bar.sync-barriers-in-OpenACC-offl.patch
0011-nvptx-Add-thread-count-parm-to-bar.sync.patch
0012-nvptx-Add-axis_dim.patch
0013-nvptx-Use-TARGET_SET_CURRENT_FUNCTION.patch
0014-nvptx-Use-MAX-MIN-ROUND_UP-macros.patch
0015-nvptx-Generalize-state-propagation-and-synchronizati.patch
0016-nvptx-Add-vector_length-128-testcases.patch
0017-nvptx-Enable-large-vectors.patch
0018-nvptx-Handle-large-vectors-in-libgomp.patch
0019-nvptx-Enable-worker-partitioning-with-warp-sized-vec.patch
0020-nvptx-Simplifly-logic-in-nvptx_single.patch
0021-PR85246-nvptx-Fix-propagation-of-branch-cond-in-vw-n.patch
0022-nvptx-openacc-Don-t-emit-barriers-for-empty-loops.patch
0023-nvptx-Force-vl32-if-calling-vector-partitionable-rou.patch
0024-nvptx-Handle-large-vector-reductions.patch
0025-OpenACC-Enable-firstprivate-OpenACC-reductions.patch
...

> 0001-libgomp-OpenACC-Adjust-offsets-for-present-data-clau.patch

This patch (well, a variant of it) has been committed (although it's not
clear to me why this was included in this patch series).

> 0025-OpenACC-Enable-firstprivate-OpenACC-reductions.patch

This patch is not required for this patch series. If you remove it,
vred2d-128.c and gemm.f90 start to fail, which is trivially fixable by
adding firstprivate clauses according to the test-cases.

> 0004-openacc-Make-GFC-default-to-1-for-OpenACC-routine-di.patch

If I remove this, I run into ICEs in the compiler, but I think that
means we need to understand and fix that ICE, instead of concluding that
we need this patch. It looks completely unrelated.

Thanks,
- Tom

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [nvptx, committed] Rewrite nvptx_goacc_validate_dims to use predicate vars
  2018-12-14 19:58       ` Tom de Vries
@ 2018-12-17 21:29         ` Tom de Vries
  2018-12-17 21:46         ` [nvptx, committed] Unify C/Fortran routine handling in nvptx_goacc_validate_dims Tom de Vries
                           ` (20 subsequent siblings)
  21 siblings, 0 replies; 32+ messages in thread
From: Tom de Vries @ 2018-12-17 21:29 UTC (permalink / raw)
  To: Schwinge, Thomas; +Cc: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 239 bytes --]

[ was: Re: [nvptx] vector length patch series ]

On 14-12-18 20:58, Tom de Vries wrote:
> 0023-nvptx-Force-vl32-if-calling-vector-partitionable-rou.patch

I've factored out this cleanup patch from here.

Committed to trunk.

Thanks,
- Tom

[-- Attachment #2: 0001-nvptx-Rewrite-nvptx_goacc_validate_dims-to-use-predicate-vars.patch --]
[-- Type: text/x-patch, Size: 2448 bytes --]

[nvptx] Rewrite nvptx_goacc_validate_dims to use predicate vars

The function nvptx_goacc_validate_dims has arguments decl and fn_level which
together describe different situations.

Introduce a predicate var for each situation, and use them, allowing to
understand what the function does in each situation without having to know the
way the situations are encoded in the args.

Build and reg-tested on x86_64 with nvptx accelerator.

2018-12-17  Tom de Vries  <tdevries@suse.de>

	* config/nvptx/nvptx.c (nvptx_goacc_validate_dims): Rewrite using
	predicate vars.

---
 gcc/config/nvptx/nvptx.c | 32 +++++++++++++++++++++++++++++---
 1 file changed, 29 insertions(+), 3 deletions(-)

diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index 9903a273863..746d8bfb100 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -5187,13 +5187,39 @@ static bool
 nvptx_goacc_validate_dims (tree decl, int dims[], int fn_level)
 {
   bool changed = false;
+  bool oacc_default_dims_p = false;
+  bool oacc_min_dims_p = false;
+  bool offload_region_p = false;
+  bool routine_p = false;
+  bool routine_seq_p = false;
+
+  if (decl == NULL_TREE)
+    {
+      if (fn_level == -1)
+	oacc_default_dims_p = true;
+      else if (fn_level == -2)
+	oacc_min_dims_p = true;
+      else
+	gcc_unreachable ();
+    }
+  else if (fn_level == -1)
+    offload_region_p = true;
+  else if (0 <= fn_level && fn_level <= GOMP_DIM_MAX)
+    {
+      routine_p = true;
+      routine_seq_p = fn_level == GOMP_DIM_MAX;
+    }
+  else
+    gcc_unreachable ();
 
   /* The vector size must be 32, unless this is a SEQ routine.  */
-  if (fn_level <= GOMP_DIM_VECTOR && fn_level >= -1
+  if ((offload_region_p || oacc_default_dims_p
+       || (routine_p && !routine_seq_p))
       && dims[GOMP_DIM_VECTOR] >= 0
       && dims[GOMP_DIM_VECTOR] != PTX_VECTOR_LENGTH)
     {
-      if (fn_level < 0 && dims[GOMP_DIM_VECTOR] >= 0)
+      if ((offload_region_p || oacc_default_dims_p)
+	  && dims[GOMP_DIM_VECTOR] >= 0)
 	warning_at (decl ? DECL_SOURCE_LOCATION (decl) : UNKNOWN_LOCATION, 0,
 		    dims[GOMP_DIM_VECTOR]
 		    ? G_("using vector_length (%d), ignoring %d")
@@ -5213,7 +5239,7 @@ nvptx_goacc_validate_dims (tree decl, int dims[], int fn_level)
       changed = true;
     }
 
-  if (!decl)
+  if (oacc_default_dims_p || oacc_min_dims_p)
     {
       dims[GOMP_DIM_VECTOR] = PTX_VECTOR_LENGTH;
       if (dims[GOMP_DIM_WORKER] < 0)

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [nvptx, committed] Unify C/Fortran routine handling in nvptx_goacc_validate_dims
  2018-12-14 19:58       ` Tom de Vries
  2018-12-17 21:29         ` [nvptx, committed] Rewrite nvptx_goacc_validate_dims to use predicate vars Tom de Vries
@ 2018-12-17 21:46         ` Tom de Vries
  2019-02-22 11:23           ` Thomas Schwinge
  2018-12-17 21:48         ` [nvptx] vector length patch series Tom de Vries
                           ` (19 subsequent siblings)
  21 siblings, 1 reply; 32+ messages in thread
From: Tom de Vries @ 2018-12-17 21:46 UTC (permalink / raw)
  To: Schwinge, Thomas; +Cc: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 925 bytes --]

[ was: Re: [nvptx] vector length patch series ]

On 14-12-18 20:58, Tom de Vries wrote:
>> 0004-openacc-Make-GFC-default-to-1-for-OpenACC-routine-di.patch
> If I remove this, I run into ICEs in the compiler, but I think that
> means we need to understand and fix that ICE, instead of concluding that
> we need this patch. It looks completely unrelated.

Indeed this
(0004-openacc-Make-GFC-default-to-1-for-OpenACC-routine-di.patch) patch
is unrelated to the vector length functionality.

However, it fixes a problem in the Fortran front-end which has as
consequence that C and Fortran routines look the same in
nvptx_goacc_validate_dims, which is a good thing to have.

However, the upstreaming of the patch seems to be stuck, so I've
committed an nvptx workaround patch that has the same effect, allowing
us to drop it
(0004-openacc-Make-GFC-default-to-1-for-OpenACC-routine-di.patch) from
the patch series.

Thanks,
- Tom


[-- Attachment #2: 0002-nvptx-Unify-C-Fortran-routine-handling-in-nvptx_goacc_validate_dims.patch --]
[-- Type: text/x-patch, Size: 2858 bytes --]

[nvptx] Unify C/Fortran routine handling in nvptx_goacc_validate_dims

The Fortran front-end has a bug (PR72741) that means what when
nvptx_goacc_validate_dims is called for a Fortran routine, the dims parameter
is not the same as it would have been if the function would have been called for
an equivalent C routine.

Work around this bug by overriding the dims parameter for routines, allowing the
function to handle routines in Fortran and C the same.

Build and reg-tested on x86_64 with nvptx accelerator.

2018-12-17  Tom de Vries  <tdevries@suse.de>

	* config/nvptx/nvptx.c (nvptx_goacc_validate_dims): Work around Fortran
	bug PR72741 by overriding dims parameter for routines.

---
 gcc/config/nvptx/nvptx.c | 36 ++++++++++++++++++++++++++++++++++++
 1 file changed, 36 insertions(+)

diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index 746d8bfb100..24727ad5920 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -5212,6 +5212,42 @@ nvptx_goacc_validate_dims (tree decl, int dims[], int fn_level)
   else
     gcc_unreachable ();
 
+  if (routine_p)
+    {
+      /* OpenACC routines in C arrive here with the following attributes
+	 (omitting the 'omp declare target'):
+	 seq   : __attribute__((oacc function (0 1, 0 1, 0 1)))
+	 vector: __attribute__((oacc function (0 1, 0 1, 1 0)))
+	 worker: __attribute__((oacc function (0 1, 1 0, 1 0)))
+	 gang  : __attribute__((oacc function (1 0, 1 0, 1 0)))
+
+	 If we take f.i. the oacc function attribute of the worker routine
+	 (0 1, 1 0, 1 0), then:
+	 - the slice (0, 1, 1) is interpreted by oacc_fn_attrib_level as
+	   meaning: worker routine, that is:
+	   - can't contain gang loop (0),
+	   - can contain worker loop (1),
+	   - can contain vector loop (1).
+	 - the slice (1, 0, 0) is interpreted by oacc_validate_dims as the
+	 dimensions: gang: 1, worker: 0, vector: 0.
+
+	 OTOH, routines in Fortran arrive here with these attributes:
+	 seq   : __attribute__((oacc function (0 0, 0 0, 0 0)))
+	 vector: __attribute__((oacc function (0 0, 0 0, 1 0)))
+	 worker: __attribute__((oacc function (0 0, 1 0, 1 0)))
+	 gang  : __attribute__((oacc function (1 0, 1 0, 1 0)))
+	 that is, the same as for C but with the dimensions set to 0.
+
+	 This is due to a bug in the Fortran front-end: PR72741.  Work around
+	 this bug by forcing the dimensions to be the same in Fortran as for C,
+	 to be able to handle C and Fortran routines uniformly in this
+	 function.  */
+      dims[GOMP_DIM_VECTOR] = fn_level > GOMP_DIM_VECTOR ? 1 : 0;
+      dims[GOMP_DIM_WORKER] = fn_level > GOMP_DIM_WORKER ? 1 : 0;
+      dims[GOMP_DIM_GANG] = fn_level > GOMP_DIM_GANG ? 1 : 0;
+      changed = true;
+    }
+
   /* The vector size must be 32, unless this is a SEQ routine.  */
   if ((offload_region_p || oacc_default_dims_p
        || (routine_p && !routine_seq_p))

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [nvptx] vector length patch series
  2018-12-14 19:58       ` Tom de Vries
  2018-12-17 21:29         ` [nvptx, committed] Rewrite nvptx_goacc_validate_dims to use predicate vars Tom de Vries
  2018-12-17 21:46         ` [nvptx, committed] Unify C/Fortran routine handling in nvptx_goacc_validate_dims Tom de Vries
@ 2018-12-17 21:48         ` Tom de Vries
  2018-12-17 21:51         ` [nvptx, committed] Add PTX_WARP_SIZE Tom de Vries
                           ` (18 subsequent siblings)
  21 siblings, 0 replies; 32+ messages in thread
From: Tom de Vries @ 2018-12-17 21:48 UTC (permalink / raw)
  To: Schwinge, Thomas; +Cc: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 171 bytes --]

On 14-12-18 20:58, Tom de Vries wrote:
> 0009-nvptx-Fix-whitespace-in-nvptx_single-and-nvptx_neute.patch

Committed (Could have been committed as obvious).

Thanks,
- Tom

[-- Attachment #2: 0003-nvptx-Fix-whitespace-in-nvptx_single-and-nvptx_neuter_pars.patch --]
[-- Type: text/x-patch, Size: 1103 bytes --]

[nvptx] Fix whitespace in nvptx_single and nvptx_neuter_pars

Fix whitespace in nvptx_single and nvptx_neuter_pars.

Build and reg-tested on x86_64 with nvptx accelerator.

2018-12-17  Tom de Vries  <tdevries@suse.de>

	* config/nvptx/nvptx.c (nvptx_single): Fix whitespace.
	(nvptx_neuter_pars): Likewise.

---
 gcc/config/nvptx/nvptx.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index 24727ad5920..01505899785 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -4224,7 +4224,7 @@ nvptx_single (unsigned mask, basic_block from, basic_block to)
 	    pred = gen_reg_rtx (BImode);
 	    cfun->machine->axis_predicate[mode - GOMP_DIM_WORKER] = pred;
 	  }
-	
+
 	rtx br;
 	if (mode == GOMP_DIM_VECTOR)
 	  br = gen_br_true (pred, label);
@@ -4554,7 +4554,7 @@ nvptx_neuter_pars (parallel *par, unsigned modes, unsigned outer)
     }
 
   if (skip_mask)
-      nvptx_skip_par (skip_mask, par);
+    nvptx_skip_par (skip_mask, par);
   
   if (par->next)
     nvptx_neuter_pars (par->next, modes, outer);

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [nvptx, committed] Add PTX_WARP_SIZE
  2018-12-14 19:58       ` Tom de Vries
                           ` (2 preceding siblings ...)
  2018-12-17 21:48         ` [nvptx] vector length patch series Tom de Vries
@ 2018-12-17 21:51         ` Tom de Vries
  2018-12-17 21:52         ` [nvptx, committed] Move macro defs to top of nvptx.c Tom de Vries
                           ` (17 subsequent siblings)
  21 siblings, 0 replies; 32+ messages in thread
From: Tom de Vries @ 2018-12-17 21:51 UTC (permalink / raw)
  To: Schwinge, Thomas; +Cc: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 186 bytes --]

[ was: Re: [nvptx] vector length patch series ]

On 14-12-18 20:58, Tom de Vries wrote:
> 0005-nvptx-update-openacc-dim-macros.patch

Factored out this patch.

Committed.

Thanks,
- Tom

[-- Attachment #2: 0004-nvptx-Add-PTX_WARP_SIZE.patch --]
[-- Type: text/x-patch, Size: 1189 bytes --]

[nvptx] Add PTX_WARP_SIZE

Add PTX_WARP_SIZE constant and use it in nvptx_simt_vf.  The function
nvptx_simt_vf is used for OpenMP, and using PTX_WARP_SIZE here decouples the
OpenMP support from the PTX_VECTOR_LENGTH constant used in OpenACC support.

Build and reg-tested on x86_64 with nvptx accelerator.

2018-12-17  Tom de Vries  <tdevries@suse.de>

	* config/nvptx/nvptx.c (PTX_WARP_SIZE): Define.
	(nvptx_simt_vf): Return PTX_WARP_SIZE instead of PTX_VECTOR_LENGTH.

---
 gcc/config/nvptx/nvptx.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index 01505899785..9906716890e 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -81,6 +81,8 @@
 #define WORKAROUND_PTXJIT_BUG_2 1
 #define WORKAROUND_PTXJIT_BUG_3 1
 
+#define PTX_WARP_SIZE 32
+
 /* The various PTX memory areas an object might reside in.  */
 enum nvptx_data_area
 {
@@ -5175,7 +5177,7 @@ nvptx_expand_builtin (tree exp, rtx target, rtx ARG_UNUSED (subtarget),
 static int
 nvptx_simt_vf ()
 {
-  return PTX_VECTOR_LENGTH;
+  return PTX_WARP_SIZE;
 }
 
 /* Validate compute dimensions of an OpenACC offload or routine, fill

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [nvptx, committed] Move macro defs to top of nvptx.c
  2018-12-14 19:58       ` Tom de Vries
                           ` (3 preceding siblings ...)
  2018-12-17 21:51         ` [nvptx, committed] Add PTX_WARP_SIZE Tom de Vries
@ 2018-12-17 21:52         ` Tom de Vries
  2018-12-17 23:52         ` [nvptx] vector length patch series -- openacc parts Tom de Vries
                           ` (16 subsequent siblings)
  21 siblings, 0 replies; 32+ messages in thread
From: Tom de Vries @ 2018-12-17 21:52 UTC (permalink / raw)
  To: Schwinge, Thomas; +Cc: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 186 bytes --]

[ was: Re: [nvptx] vector length patch series ]

On 14-12-18 20:58, Tom de Vries wrote:
> 0005-nvptx-update-openacc-dim-macros.patch

Factored out this patch.

Committed.

Thanks,
- Tom

[-- Attachment #2: 0005-nvptx-Move-macro-defs-to-top-of-nvptx.c.patch --]
[-- Type: text/x-patch, Size: 1289 bytes --]

[nvptx] Move macro defs to top of nvptx.c

Move macro definition to the top of the file, allowing them to be used
there-after.

Build and reg-tested on x86_64 with nvptx accelerator.

2018-12-17  Tom de Vries  <tdevries@suse.de>

	* config/nvptx/nvptx.c (PTX_VECTOR_LENGTH, PTX_WORKER_LENGTH,
	PTX_DEFAULT_RUNTIME_DIM): Move to the top of the file.

---
 gcc/config/nvptx/nvptx.c | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index 9906716890e..74ca0f585aa 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -82,6 +82,9 @@
 #define WORKAROUND_PTXJIT_BUG_3 1
 
 #define PTX_WARP_SIZE 32
+#define PTX_VECTOR_LENGTH 32
+#define PTX_WORKER_LENGTH 32
+#define PTX_DEFAULT_RUNTIME_DIM 0 /* Defer to runtime.  */
 
 /* The various PTX memory areas an object might reside in.  */
 enum nvptx_data_area
@@ -5166,11 +5169,6 @@ nvptx_expand_builtin (tree exp, rtx target, rtx ARG_UNUSED (subtarget),
     default: gcc_unreachable ();
     }
 }
-\f
-/* Define dimension sizes for known hardware.  */
-#define PTX_VECTOR_LENGTH 32
-#define PTX_WORKER_LENGTH 32
-#define PTX_DEFAULT_RUNTIME_DIM 0 /* Defer to runtime.  */
 
 /* Implement TARGET_SIMT_VF target hook: number of threads in a warp.  */
 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [nvptx] vector length patch series -- openacc parts
  2018-12-14 19:58       ` Tom de Vries
                           ` (4 preceding siblings ...)
  2018-12-17 21:52         ` [nvptx, committed] Move macro defs to top of nvptx.c Tom de Vries
@ 2018-12-17 23:52         ` Tom de Vries
  2019-01-03  9:13           ` Tom de Vries
  2018-12-19 10:28         ` [nvptx, committed] Use TARGET_SET_CURRENT_FUNCTION Tom de Vries
                           ` (15 subsequent siblings)
  21 siblings, 1 reply; 32+ messages in thread
From: Tom de Vries @ 2018-12-17 23:52 UTC (permalink / raw)
  To: Schwinge, Thomas; +Cc: gcc-patches

On 14-12-18 20:58, Tom de Vries wrote:

> 0003-openacc-Add-target-hook-TARGET_GOACC_ADJUST_PARALLEL.patch

> 0017-nvptx-Enable-large-vectors.patch

> 0023-nvptx-Force-vl32-if-calling-vector-partitionable-rou.patch

Thomas,

these patches are openacc (0003) or have openacc components (0017, 0023).

Can you review and possibly approve the openacc parts?

Thanks,
- Tom

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [nvptx, committed] Use TARGET_SET_CURRENT_FUNCTION
  2018-12-14 19:58       ` Tom de Vries
                           ` (5 preceding siblings ...)
  2018-12-17 23:52         ` [nvptx] vector length patch series -- openacc parts Tom de Vries
@ 2018-12-19 10:28         ` Tom de Vries
  2018-12-19 10:31         ` [nvptx, committed] Only use one logical barrier resource Tom de Vries
                           ` (14 subsequent siblings)
  21 siblings, 0 replies; 32+ messages in thread
From: Tom de Vries @ 2018-12-19 10:28 UTC (permalink / raw)
  To: Schwinge, Thomas; +Cc: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 166 bytes --]

[ was: Re: [nvptx] vector length patch series ]

On 14-12-18 20:58, Tom de Vries wrote:
> 0013-nvptx-Use-TARGET_SET_CURRENT_FUNCTION.patch

Committed.

Thanks,
- Tom

[-- Attachment #2: 0001-nvptx-Use-TARGET_SET_CURRENT_FUNCTION.patch --]
[-- Type: text/x-patch, Size: 1502 bytes --]

[nvptx] Use TARGET_SET_CURRENT_FUNCTION

Implement TARGET_SET_CURRENT_FUNCTION for nvptx.  This gives us a place to
add initialization or reset actions that need to be executed on a per-function
basis.

Build and reg-tested on x86_64 with nvptx accelerator.

2018-12-17  Tom de Vries  <tdevries@suse.de>

	* config/nvptx/nvptx.c (nvptx_previous_fndecl): Declare.
	(nvptx_set_current_function): New function.
	(TARGET_SET_CURRENT_FUNCTION): Define.

---
 gcc/config/nvptx/nvptx.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index 74ca0f585aa..9f834d35200 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -5980,6 +5980,17 @@ nvptx_can_change_mode_class (machine_mode, machine_mode, reg_class_t)
   return false;
 }
 
+static GTY(()) tree nvptx_previous_fndecl;
+
+static void
+nvptx_set_current_function (tree fndecl)
+{
+  if (!fndecl || fndecl == nvptx_previous_fndecl)
+    return;
+
+  nvptx_previous_fndecl = fndecl;
+}
+
 #undef TARGET_OPTION_OVERRIDE
 #define TARGET_OPTION_OVERRIDE nvptx_option_override
 
@@ -6116,6 +6127,9 @@ nvptx_can_change_mode_class (machine_mode, machine_mode, reg_class_t)
 #undef TARGET_HAVE_SPECULATION_SAFE_VALUE
 #define TARGET_HAVE_SPECULATION_SAFE_VALUE speculation_safe_value_not_needed
 
+#undef TARGET_SET_CURRENT_FUNCTION
+#define TARGET_SET_CURRENT_FUNCTION nvptx_set_current_function
+
 struct gcc_target targetm = TARGET_INITIALIZER;
 
 #include "gt-nvptx.h"

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [nvptx, committed] Only use one logical barrier resource
  2018-12-14 19:58       ` Tom de Vries
                           ` (6 preceding siblings ...)
  2018-12-19 10:28         ` [nvptx, committed] Use TARGET_SET_CURRENT_FUNCTION Tom de Vries
@ 2018-12-19 10:31         ` Tom de Vries
  2018-12-19 10:33         ` [nvptx, committed] Generalize bar.sync instruction Tom de Vries
                           ` (13 subsequent siblings)
  21 siblings, 0 replies; 32+ messages in thread
From: Tom de Vries @ 2018-12-19 10:31 UTC (permalink / raw)
  To: Schwinge, Thomas; +Cc: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 181 bytes --]

[ was: Re: [nvptx] vector length patch series ]
On 14-12-18 20:58, Tom de Vries wrote:
> 0010-nvptx-only-use-one-bar.sync-barriers-in-OpenACC-offl.patch

Committed.

Thanks,
- Tom


[-- Attachment #2: 0002-nvptx-Only-use-one-logical-barrier-resource.patch --]
[-- Type: text/x-patch, Size: 2282 bytes --]

[nvptx] Only use one logical barrier resource

For openacc loops, we generate this style of code:
...
        @%r41   bra.uni $L5;
        @%r40   bra     $L6;
                mov.u64 %r32, %ar0;
                cvta.shared.u64 %r39, __worker_bcast;
                st.u64  [%r39], %r32;
$L6:
$L5:
                bar.sync        0;
        @%r40   bra     $L4;
                cvta.shared.u64 %r38, __worker_bcast;
                ld.u64  %r32, [%r38];
                ...
$L4:
                bar.sync        1;
...

The first barrier is there to ensure that no thread reads the broadcast buffer
before it's written.  The second barrier is there to ensure that no thread
overwrites the broadcast buffer before all threads have read it (as well as
implementing the obligatory synchronization after a worker loop).

We've been using the logical barrier resources '0' and '1' for these two
barriers, but there's no reason why we can't use the same one.

Use logical barrier resource '0' for both barriers, making the openacc
implementation claim less resources.

Build and reg-tested on x86_64 with nvptx accelerator.

2018-12-17  Tom de Vries  <tdevries@suse.de>

	* config/nvptx/nvptx.c (nvptx_single): Always pass false to
	nvptx_wsync.
	(nvptx_process_pars): Likewise.

---
 gcc/config/nvptx/nvptx.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index 9f834d35200..a354811194c 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -4351,7 +4351,7 @@ nvptx_single (unsigned mask, basic_block from, basic_block to)
 	  /* This barrier is needed to avoid worker zero clobbering
 	     the broadcast buffer before all the other workers have
 	     had a chance to read this instance of it.  */
-	  emit_insn_before (nvptx_wsync (true), tail);
+	  emit_insn_before (nvptx_wsync (false), tail);
 	}
 
       extract_insn (tail);
@@ -4476,7 +4476,7 @@ nvptx_process_pars (parallel *par)
 	{
 	  /* Insert begin and end synchronizations.  */
 	  emit_insn_before (nvptx_wsync (false), par->forked_insn);
-	  emit_insn_before (nvptx_wsync (true), par->join_insn);
+	  emit_insn_before (nvptx_wsync (false), par->join_insn);
 	}
     }
   else if (par->mask & GOMP_DIM_MASK (GOMP_DIM_VECTOR))

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [nvptx, committed] Generalize bar.sync instruction
  2018-12-14 19:58       ` Tom de Vries
                           ` (7 preceding siblings ...)
  2018-12-19 10:31         ` [nvptx, committed] Only use one logical barrier resource Tom de Vries
@ 2018-12-19 10:33         ` Tom de Vries
  2018-12-19 10:35         ` [nvptx, committed] Rename worker_bcast variables to oacc_bcast Tom de Vries
                           ` (12 subsequent siblings)
  21 siblings, 0 replies; 32+ messages in thread
From: Tom de Vries @ 2018-12-19 10:33 UTC (permalink / raw)
  To: Schwinge, Thomas; +Cc: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 193 bytes --]

[ was: Re: [nvptx] vector length patch series ]

On 14-12-18 20:58, Tom de Vries wrote:
> 0011-nvptx-Add-thread-count-parm-to-bar.sync.patch

Factored out this patch, committed.

Thanks,
- Tom

[-- Attachment #2: 0003-nvptx-Generalize-bar.sync-instruction.patch --]
[-- Type: text/x-patch, Size: 1670 bytes --]

[nvptx] Generalize bar.sync instruction

Allow the logical barrier operand of nvptx_barsync to be a register, and add a
thread count operand.

Build and reg-tested on x86_64 with nvptx accelerator.

2018-12-17  Tom de Vries  <tdevries@suse.de>

	* config/nvptx/nvptx.md (nvptx_barsync): Add and handle operand.
	* config/nvptx/nvptx.c (nvptx_wsync): Update call to gen_nvptx_barsync.

---
 gcc/config/nvptx/nvptx.c  |  2 +-
 gcc/config/nvptx/nvptx.md | 10 ++++++++--
 2 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index a354811194c..1ad3ba92caa 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -3974,7 +3974,7 @@ nvptx_wpropagate (bool pre_p, bool is_call, basic_block block, rtx_insn *insn)
 static rtx
 nvptx_wsync (bool after)
 {
-  return gen_nvptx_barsync (GEN_INT (after));
+  return gen_nvptx_barsync (GEN_INT (after), GEN_INT (0));
 }
 
 #if WORKAROUND_PTXJIT_BUG
diff --git a/gcc/config/nvptx/nvptx.md b/gcc/config/nvptx/nvptx.md
index ca00b1d8073..f1f6fe0c404 100644
--- a/gcc/config/nvptx/nvptx.md
+++ b/gcc/config/nvptx/nvptx.md
@@ -1454,10 +1454,16 @@
   [(set_attr "atomic" "true")])
 
 (define_insn "nvptx_barsync"
-  [(unspec_volatile [(match_operand:SI 0 "const_int_operand" "")]
+  [(unspec_volatile [(match_operand:SI 0 "nvptx_nonmemory_operand" "Ri")
+		     (match_operand:SI 1 "const_int_operand")]
 		    UNSPECV_BARSYNC)]
   ""
-  "\\tbar.sync\\t%0;"
+  {
+    if (INTVAL (operands[1]) == 0)
+      return "\\tbar.sync\\t%0;";
+    else
+      return "\\tbar.sync\\t%0, %1;";
+  }
   [(set_attr "predicable" "false")])
 
 (define_expand "memory_barrier"

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [nvptx, committed] Rename worker_bcast variables to oacc_bcast
  2018-12-14 19:58       ` Tom de Vries
                           ` (8 preceding siblings ...)
  2018-12-19 10:33         ` [nvptx, committed] Generalize bar.sync instruction Tom de Vries
@ 2018-12-19 10:35         ` Tom de Vries
  2018-12-19 10:37         ` [nvptx, committed] Make nvptx state propagation function names more generic Tom de Vries
                           ` (11 subsequent siblings)
  21 siblings, 0 replies; 32+ messages in thread
From: Tom de Vries @ 2018-12-19 10:35 UTC (permalink / raw)
  To: Schwinge, Thomas; +Cc: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 175 bytes --]

[ was: Re: [nvptx] vector length patch series ]

On 14-12-18 20:58, Tom de Vries wrote:
> 0006-nvptx-Rename-worker_bcast-variables-oacc_bcast.patch

Committed.

Thanks,
- Tom

[-- Attachment #2: 0004-nvptx-Rename-worker_bcast-variables-to-oacc_bcast.patch --]
[-- Type: text/x-patch, Size: 6787 bytes --]

[nvptx] Rename worker_bcast variables to oacc_bcast

Rename worker_bcast variables to oacc_bcast, avoiding worker terminology.

Build and reg-tested on x86_64 with nvptx accelerator.

2018-12-17  Tom de Vries  <tdevries@suse.de>

	* config/nvptx/nvptx.c (worker_bcast_size): Rename as
	oacc_bcast_size.
	(worker_bcast_align): Rename as oacc_bcast_align.
	(worker_bcast_sym): Rename as oacc_bcast_sym.
	(nvptx_option_override): Update usage of oacc_bcast_*.
	(struct wcast_data_t): Rename as broadcast_data_t.
	(nvptx_gen_wcast): Update type of data argument and usage of
	oacc_bcast_align.
	(wprop_gen): Update type of data_ and usage of oacc_bcast_align.
	(nvptx_wpropagate): Update type of data and usage of
	oacc_bcast_{sym,size}.
	(nvptx_single): Update type of data and usage of oacc_bcast_size.
	(nvptx_file_end): Update usage of oacc_bcast_{sym,align,size}.

---
 gcc/config/nvptx/nvptx.c | 59 ++++++++++++++++++++++++------------------------
 1 file changed, 30 insertions(+), 29 deletions(-)

diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index 1ad3ba92caa..9625ac86aa1 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -127,14 +127,15 @@ struct tree_hasher : ggc_cache_ptr_hash<tree_node>
 static GTY((cache)) hash_table<tree_hasher> *declared_fndecls_htab;
 static GTY((cache)) hash_table<tree_hasher> *needed_fndecls_htab;
 
-/* Buffer needed to broadcast across workers.  This is used for both
-   worker-neutering and worker broadcasting.  It is shared by all
-   functions emitted.  The buffer is placed in shared memory.  It'd be
-   nice if PTX supported common blocks, because then this could be
-   shared across TUs (taking the largest size).  */
-static unsigned worker_bcast_size;
-static unsigned worker_bcast_align;
-static GTY(()) rtx worker_bcast_sym;
+/* Buffer needed to broadcast across workers and vectors.  This is
+   used for both worker-neutering and worker broadcasting, and
+   vector-neutering and boardcasting when vector_length > 32.  It is
+   shared by all functions emitted.  The buffer is placed in shared
+   memory.  It'd be nice if PTX supported common blocks, because then
+   this could be shared across TUs (taking the largest size).  */
+static unsigned oacc_bcast_size;
+static unsigned oacc_bcast_align;
+static GTY(()) rtx oacc_bcast_sym;
 
 /* Buffer needed for worker reductions.  This has to be distinct from
    the worker broadcast array, as both may be live concurrently.  */
@@ -207,9 +208,9 @@ nvptx_option_override (void)
   declared_libfuncs_htab
     = hash_table<declared_libfunc_hasher>::create_ggc (17);
 
-  worker_bcast_sym = gen_rtx_SYMBOL_REF (Pmode, "__worker_bcast");
-  SET_SYMBOL_DATA_AREA (worker_bcast_sym, DATA_AREA_SHARED);
-  worker_bcast_align = GET_MODE_ALIGNMENT (SImode) / BITS_PER_UNIT;
+  oacc_bcast_sym = gen_rtx_SYMBOL_REF (Pmode, "__oacc_bcast");
+  SET_SYMBOL_DATA_AREA (oacc_bcast_sym, DATA_AREA_SHARED);
+  oacc_bcast_align = GET_MODE_ALIGNMENT (SImode) / BITS_PER_UNIT;
 
   worker_red_sym = gen_rtx_SYMBOL_REF (Pmode, "__worker_red");
   SET_SYMBOL_DATA_AREA (worker_red_sym, DATA_AREA_SHARED);
@@ -1754,7 +1755,7 @@ nvptx_gen_vcast (rtx reg)
 
 /* Structure used when generating a worker-level spill or fill.  */
 
-struct wcast_data_t
+struct broadcast_data_t
 {
   rtx base;  /* Register holding base addr of buffer.  */
   rtx ptr;  /* Iteration var,  if needed.  */
@@ -1778,7 +1779,7 @@ enum propagate_mask
    how many loop iterations will be executed (0 for not a loop).  */
    
 static rtx
-nvptx_gen_wcast (rtx reg, propagate_mask pm, unsigned rep, wcast_data_t *data)
+nvptx_gen_wcast (rtx reg, propagate_mask pm, unsigned rep, broadcast_data_t *data)
 {
   rtx  res;
   machine_mode mode = GET_MODE (reg);
@@ -1808,8 +1809,8 @@ nvptx_gen_wcast (rtx reg, propagate_mask pm, unsigned rep, wcast_data_t *data)
 	  {
 	    unsigned align = GET_MODE_ALIGNMENT (mode) / BITS_PER_UNIT;
 
-	    if (align > worker_bcast_align)
-	      worker_bcast_align = align;
+	    if (align > oacc_bcast_align)
+	      oacc_bcast_align = align;
 	    data->offset = (data->offset + align - 1) & ~(align - 1);
 	    addr = data->base;
 	    if (data->offset)
@@ -3914,15 +3915,15 @@ nvptx_vpropagate (bool is_call, basic_block block, rtx_insn *insn)
 static rtx
 wprop_gen (rtx reg, propagate_mask pm, unsigned rep, void *data_)
 {
-  wcast_data_t *data = (wcast_data_t *)data_;
+  broadcast_data_t *data = (broadcast_data_t *)data_;
 
   if (pm & PM_loop_begin)
     {
       /* Starting a loop, initialize pointer.    */
       unsigned align = GET_MODE_ALIGNMENT (GET_MODE (reg)) / BITS_PER_UNIT;
 
-      if (align > worker_bcast_align)
-	worker_bcast_align = align;
+      if (align > oacc_bcast_align)
+	oacc_bcast_align = align;
       data->offset = (data->offset + align - 1) & ~(align - 1);
 
       data->ptr = gen_reg_rtx (Pmode);
@@ -3947,7 +3948,7 @@ wprop_gen (rtx reg, propagate_mask pm, unsigned rep, void *data_)
 static bool
 nvptx_wpropagate (bool pre_p, bool is_call, basic_block block, rtx_insn *insn)
 {
-  wcast_data_t data;
+  broadcast_data_t data;
 
   data.base = gen_reg_rtx (Pmode);
   data.offset = 0;
@@ -3959,11 +3960,11 @@ nvptx_wpropagate (bool pre_p, bool is_call, basic_block block, rtx_insn *insn)
   if (data.offset)
     {
       /* Stuff was emitted, initialize the base pointer now.  */
-      rtx init = gen_rtx_SET (data.base, worker_bcast_sym);
+      rtx init = gen_rtx_SET (data.base, oacc_bcast_sym);
       emit_insn_after (init, insn);
 
-      if (worker_bcast_size < data.offset)
-	worker_bcast_size = data.offset;
+      if (oacc_bcast_size < data.offset)
+	oacc_bcast_size = data.offset;
     }
   return empty;
 }
@@ -4333,13 +4334,13 @@ nvptx_single (unsigned mask, basic_block from, basic_block to)
 	{
 	  /* Includes worker mode, do spill & fill.  By construction
 	     we should never have worker mode only. */
-	  wcast_data_t data;
+	  broadcast_data_t data;
 
-	  data.base = worker_bcast_sym;
+	  data.base = oacc_bcast_sym;
 	  data.ptr = 0;
 
-	  if (worker_bcast_size < GET_MODE_SIZE (SImode))
-	    worker_bcast_size = GET_MODE_SIZE (SImode);
+	  if (oacc_bcast_size < GET_MODE_SIZE (SImode))
+	    oacc_bcast_size = GET_MODE_SIZE (SImode);
 
 	  data.offset = 0;
 	  emit_insn_before (nvptx_gen_wcast (pvar, PM_read, 0, &data),
@@ -4968,9 +4969,9 @@ nvptx_file_end (void)
     nvptx_record_fndecl (decl);
   fputs (func_decls.str().c_str(), asm_out_file);
 
-  if (worker_bcast_size)
-    write_worker_buffer (asm_out_file, worker_bcast_sym,
-			 worker_bcast_align, worker_bcast_size);
+  if (oacc_bcast_size)
+    write_worker_buffer (asm_out_file, oacc_bcast_sym,
+			 oacc_bcast_align, oacc_bcast_size);
 
   if (worker_red_size)
     write_worker_buffer (asm_out_file, worker_red_sym,

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [nvptx, committed] Make nvptx state propagation function names more generic
  2018-12-14 19:58       ` Tom de Vries
                           ` (9 preceding siblings ...)
  2018-12-19 10:35         ` [nvptx, committed] Rename worker_bcast variables to oacc_bcast Tom de Vries
@ 2018-12-19 10:37         ` Tom de Vries
  2018-12-19 10:38         ` [nvptx, committed] Use MAX, MIN, ROUND_UP macros Tom de Vries
                           ` (10 subsequent siblings)
  21 siblings, 0 replies; 32+ messages in thread
From: Tom de Vries @ 2018-12-19 10:37 UTC (permalink / raw)
  To: Schwinge, Thomas; +Cc: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 181 bytes --]

[ was: Re: [nvptx] vector length patch series ]

On 14-12-18 20:58, Tom de Vries wrote:
> 0008-nvptx-make-nvptx-state-propagation-function-names-mo.patch

Committed.

Thanks,
- Tom

[-- Attachment #2: 0005-nvptx-Make-nvptx-state-propagation-function-names-more-generic.patch --]
[-- Type: text/x-patch, Size: 13872 bytes --]

[nvptx] Make nvptx state propagation function names more generic

Rename state propagation functions to avoid worker/vector terminology.

Build and reg-tested on x86_64 with nvptx accelerator.

2018-12-17  Tom de Vries  <tdevries@suse.de>

	* config/nvptx/nvptx.c (nvptx_gen_wcast): Rename as
	nvptx_gen_warp_bcast.
	(nvptx_gen_wcast): Rename to nvptx_gen_shared_bcast, add bool
	vector argument, and update call to nvptx_gen_shared_bcast.
	(propagator_fn): Add bool argument.
	(nvptx_propagate): New bool argument, pass bool argument to fn.
	(vprop_gen): Rename to warp_prop_gen, update call to
	nvptx_gen_warp_bcast.
	(nvptx_vpropagate): Rename to nvptx_warp_propagate, update call to
	nvptx_propagate.
	(wprop_gen): Rename to shared_prop_gen, update call to
	nvptx_gen_shared_bcast.
	(nvptx_wpropagate): Rename to nvptx_shared_propagate, update call
	to nvptx_propagate.
	(nvptx_wsync): Rename to nvptx_cta_sync.
	(nvptx_single): Update calls to nvptx_gen_warp_bcast,
	nvptx_gen_shared_bcast and nvptx_cta_sync.
	(nvptx_process_pars): Likewise.
	(write_worker_buffer): Rename as write_shared_buffer.
	(nvptx_file_end): Update calls to write_shared_buffer.
	(nvptx_expand_worker_addr): Rename as nvptx_expand_shared_addr.
	(nvptx_expand_builtin): Update call to nvptx_expand_shared_addr.
	(nvptx_get_worker_red_addr): Rename as nvptx_get_shared_red_addr.
	(nvptx_goacc_reduction_setup): Update call to
	nvptx_get_shared_red_addr.
	(nvptx_goacc_reduction_fini): Likewise.
	(nvptx_goacc_reduction_teardown): Likewise.

---
 gcc/config/nvptx/nvptx.c | 96 +++++++++++++++++++++++++++---------------------
 1 file changed, 54 insertions(+), 42 deletions(-)

diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index 9625ac86aa1..163f2268e5f 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -1748,7 +1748,7 @@ nvptx_gen_shuffle (rtx dst, rtx src, rtx idx, nvptx_shuffle_kind kind)
    across the vectors of a single warp.  */
 
 static rtx
-nvptx_gen_vcast (rtx reg)
+nvptx_gen_warp_bcast (rtx reg)
 {
   return nvptx_gen_shuffle (reg, reg, const0_rtx, SHUFFLE_IDX);
 }
@@ -1779,7 +1779,8 @@ enum propagate_mask
    how many loop iterations will be executed (0 for not a loop).  */
    
 static rtx
-nvptx_gen_wcast (rtx reg, propagate_mask pm, unsigned rep, broadcast_data_t *data)
+nvptx_gen_shared_bcast (rtx reg, propagate_mask pm, unsigned rep,
+			broadcast_data_t *data, bool vector)
 {
   rtx  res;
   machine_mode mode = GET_MODE (reg);
@@ -1793,7 +1794,7 @@ nvptx_gen_wcast (rtx reg, propagate_mask pm, unsigned rep, broadcast_data_t *dat
 	start_sequence ();
 	if (pm & PM_read)
 	  emit_insn (gen_sel_truesi (tmp, reg, GEN_INT (1), const0_rtx));
-	emit_insn (nvptx_gen_wcast (tmp, pm, rep, data));
+	emit_insn (nvptx_gen_shared_bcast (tmp, pm, rep, data, vector));
 	if (pm & PM_write)
 	  emit_insn (gen_rtx_SET (reg, gen_rtx_NE (BImode, tmp, const0_rtx)));
 	res = get_insns ();
@@ -1813,6 +1814,7 @@ nvptx_gen_wcast (rtx reg, propagate_mask pm, unsigned rep, broadcast_data_t *dat
 	      oacc_bcast_align = align;
 	    data->offset = (data->offset + align - 1) & ~(align - 1);
 	    addr = data->base;
+	    gcc_assert (data->base != NULL);
 	    if (data->offset)
 	      addr = gen_rtx_PLUS (Pmode, addr, GEN_INT (data->offset));
 	  }
@@ -3803,11 +3805,11 @@ nvptx_find_sese (auto_vec<basic_block> &blocks, bb_pair_vec_t &regions)
    regions and (b) only propagating stack entries that are used.  The
    latter might be quite hard to determine.  */
 
-typedef rtx (*propagator_fn) (rtx, propagate_mask, unsigned, void *);
+typedef rtx (*propagator_fn) (rtx, propagate_mask, unsigned, void *, bool);
 
 static bool
 nvptx_propagate (bool is_call, basic_block block, rtx_insn *insn,
-		 propagate_mask rw, propagator_fn fn, void *data)
+		 propagate_mask rw, propagator_fn fn, void *data, bool vector)
 {
   bitmap live = DF_LIVE_IN (block);
   bitmap_iterator iterator;
@@ -3842,7 +3844,7 @@ nvptx_propagate (bool is_call, basic_block block, rtx_insn *insn,
 	  
 	  emit_insn (gen_rtx_SET (idx, GEN_INT (fs)));
 	  /* Allow worker function to initialize anything needed.  */
-	  rtx init = fn (tmp, PM_loop_begin, fs, data);
+	  rtx init = fn (tmp, PM_loop_begin, fs, data, vector);
 	  if (init)
 	    emit_insn (init);
 	  emit_label (label);
@@ -3851,7 +3853,7 @@ nvptx_propagate (bool is_call, basic_block block, rtx_insn *insn,
 	}
       if (rw & PM_read)
 	emit_insn (gen_rtx_SET (tmp, gen_rtx_MEM (DImode, ptr)));
-      emit_insn (fn (tmp, rw, fs, data));
+      emit_insn (fn (tmp, rw, fs, data, vector));
       if (rw & PM_write)
 	emit_insn (gen_rtx_SET (gen_rtx_MEM (DImode, ptr), tmp));
       if (fs)
@@ -3859,7 +3861,7 @@ nvptx_propagate (bool is_call, basic_block block, rtx_insn *insn,
 	  emit_insn (gen_rtx_SET (pred, gen_rtx_NE (BImode, idx, const0_rtx)));
 	  emit_insn (gen_adddi3 (ptr, ptr, GEN_INT (GET_MODE_SIZE (DImode))));
 	  emit_insn (gen_br_true_uni (pred, label));
-	  rtx fini = fn (tmp, PM_loop_end, fs, data);
+	  rtx fini = fn (tmp, PM_loop_end, fs, data, vector);
 	  if (fini)
 	    emit_insn (fini);
 	  emit_insn (gen_rtx_CLOBBER (GET_MODE (idx), idx));
@@ -3879,7 +3881,7 @@ nvptx_propagate (bool is_call, basic_block block, rtx_insn *insn,
 
 	if (REGNO (reg) >= FIRST_PSEUDO_REGISTER)
 	  {
-	    rtx bcast = fn (reg, rw, 0, data);
+	    rtx bcast = fn (reg, rw, 0, data, vector);
 
 	    insn = emit_insn_after (bcast, insn);
 	    empty = false;
@@ -3888,16 +3890,17 @@ nvptx_propagate (bool is_call, basic_block block, rtx_insn *insn,
   return empty;
 }
 
-/* Worker for nvptx_vpropagate.  */
+/* Worker for nvptx_warp_propagate.  */
 
 static rtx
-vprop_gen (rtx reg, propagate_mask pm,
-	   unsigned ARG_UNUSED (count), void *ARG_UNUSED (data))
+warp_prop_gen (rtx reg, propagate_mask pm,
+	       unsigned ARG_UNUSED (count), void *ARG_UNUSED (data),
+	       bool ARG_UNUSED (vector))
 {
   if (!(pm & PM_read_write))
     return 0;
   
-  return nvptx_gen_vcast (reg);
+  return nvptx_gen_warp_bcast (reg);
 }
 
 /* Propagate state that is live at start of BLOCK across the vectors
@@ -3905,15 +3908,17 @@ vprop_gen (rtx reg, propagate_mask pm,
    IS_CALL and return as for nvptx_propagate.  */
 
 static bool
-nvptx_vpropagate (bool is_call, basic_block block, rtx_insn *insn)
+nvptx_warp_propagate (bool is_call, basic_block block, rtx_insn *insn)
 {
-  return nvptx_propagate (is_call, block, insn, PM_read_write, vprop_gen, 0);
+  return nvptx_propagate (is_call, block, insn, PM_read_write,
+			  warp_prop_gen, 0, false);
 }
 
-/* Worker for nvptx_wpropagate.  */
+/* Worker for nvptx_shared_propagate.  */
 
 static rtx
-wprop_gen (rtx reg, propagate_mask pm, unsigned rep, void *data_)
+shared_prop_gen (rtx reg, propagate_mask pm, unsigned rep, void *data_,
+		 bool vector)
 {
   broadcast_data_t *data = (broadcast_data_t *)data_;
 
@@ -3937,7 +3942,7 @@ wprop_gen (rtx reg, propagate_mask pm, unsigned rep, void *data_)
       return clobber;
     }
   else
-    return nvptx_gen_wcast (reg, pm, rep, data);
+    return nvptx_gen_shared_bcast (reg, pm, rep, data, vector);
 }
 
 /* Spill or fill live state that is live at start of BLOCK.  PRE_P
@@ -3946,7 +3951,8 @@ wprop_gen (rtx reg, propagate_mask pm, unsigned rep, void *data_)
    INSN.  IS_CALL and return as for nvptx_propagate.  */
 
 static bool
-nvptx_wpropagate (bool pre_p, bool is_call, basic_block block, rtx_insn *insn)
+nvptx_shared_propagate (bool pre_p, bool is_call, basic_block block,
+			rtx_insn *insn, bool vector)
 {
   broadcast_data_t data;
 
@@ -3955,7 +3961,8 @@ nvptx_wpropagate (bool pre_p, bool is_call, basic_block block, rtx_insn *insn)
   data.ptr = NULL_RTX;
 
   bool empty = nvptx_propagate (is_call, block, insn,
-				pre_p ? PM_read : PM_write, wprop_gen, &data);
+				pre_p ? PM_read : PM_write, shared_prop_gen,
+				&data, vector);
   gcc_assert (empty == !data.offset);
   if (data.offset)
     {
@@ -3973,7 +3980,7 @@ nvptx_wpropagate (bool pre_p, bool is_call, basic_block block, rtx_insn *insn)
    markers for before and after synchronizations.  */
 
 static rtx
-nvptx_wsync (bool after)
+nvptx_cta_sync (bool after)
 {
   return gen_nvptx_barsync (GEN_INT (after), GEN_INT (0));
 }
@@ -4328,7 +4335,7 @@ nvptx_single (unsigned mask, basic_block from, basic_block to)
 	  emit_insn_before (gen_rtx_SET (tmp, pvar), label);
 	  emit_insn_before (gen_rtx_SET (pvar, tmp), tail);
 #endif
-	  emit_insn_before (nvptx_gen_vcast (pvar), tail);
+	  emit_insn_before (nvptx_gen_warp_bcast (pvar), tail);
 	}
       else
 	{
@@ -4343,16 +4350,18 @@ nvptx_single (unsigned mask, basic_block from, basic_block to)
 	    oacc_bcast_size = GET_MODE_SIZE (SImode);
 
 	  data.offset = 0;
-	  emit_insn_before (nvptx_gen_wcast (pvar, PM_read, 0, &data),
+	  emit_insn_before (nvptx_gen_shared_bcast (pvar, PM_read, 0, &data,
+						    false),
 			    before);
 	  /* Barrier so other workers can see the write.  */
-	  emit_insn_before (nvptx_wsync (false), tail);
+	  emit_insn_before (nvptx_cta_sync (false), tail);
 	  data.offset = 0;
-	  emit_insn_before (nvptx_gen_wcast (pvar, PM_write, 0, &data), tail);
+	  emit_insn_before (nvptx_gen_shared_bcast (pvar, PM_write, 0, &data,
+						    false), tail);
 	  /* This barrier is needed to avoid worker zero clobbering
 	     the broadcast buffer before all the other workers have
 	     had a chance to read this instance of it.  */
-	  emit_insn_before (nvptx_wsync (false), tail);
+	  emit_insn_before (nvptx_cta_sync (false), tail);
 	}
 
       extract_insn (tail);
@@ -4469,19 +4478,21 @@ nvptx_process_pars (parallel *par)
 
   if (par->mask & GOMP_DIM_MASK (GOMP_DIM_WORKER))
     {
-      nvptx_wpropagate (false, is_call, par->forked_block, par->forked_insn);
-      bool empty = nvptx_wpropagate (true, is_call,
-				     par->forked_block, par->fork_insn);
+      nvptx_shared_propagate (false, is_call, par->forked_block,
+			      par->forked_insn, false);
+      bool empty = nvptx_shared_propagate (true, is_call,
+					   par->forked_block, par->fork_insn,
+					   false);
 
       if (!empty || !is_call)
 	{
 	  /* Insert begin and end synchronizations.  */
-	  emit_insn_before (nvptx_wsync (false), par->forked_insn);
-	  emit_insn_before (nvptx_wsync (false), par->join_insn);
+	  emit_insn_before (nvptx_cta_sync (false), par->forked_insn);
+	  emit_insn_before (nvptx_cta_sync (false), par->join_insn);
 	}
     }
   else if (par->mask & GOMP_DIM_MASK (GOMP_DIM_VECTOR))
-    nvptx_vpropagate (is_call, par->forked_block, par->forked_insn);
+    nvptx_warp_propagate (is_call, par->forked_block, par->forked_insn);
 
   /* Now do siblings.  */
   if (par->next)
@@ -4945,10 +4956,11 @@ nvptx_file_start (void)
   fputs ("// END PREAMBLE\n", asm_out_file);
 }
 
-/* Emit a declaration for a worker-level buffer in .shared memory.  */
+/* Emit a declaration for a worker and vector-level buffer in .shared
+   memory.  */
 
 static void
-write_worker_buffer (FILE *file, rtx sym, unsigned align, unsigned size)
+write_shared_buffer (FILE *file, rtx sym, unsigned align, unsigned size)
 {
   const char *name = XSTR (sym, 0);
 
@@ -4970,11 +4982,11 @@ nvptx_file_end (void)
   fputs (func_decls.str().c_str(), asm_out_file);
 
   if (oacc_bcast_size)
-    write_worker_buffer (asm_out_file, oacc_bcast_sym,
+    write_shared_buffer (asm_out_file, oacc_bcast_sym,
 			 oacc_bcast_align, oacc_bcast_size);
 
   if (worker_red_size)
-    write_worker_buffer (asm_out_file, worker_red_sym,
+    write_shared_buffer (asm_out_file, worker_red_sym,
 			 worker_red_align, worker_red_size);
 
   if (need_softstack_decl)
@@ -5025,7 +5037,7 @@ nvptx_expand_shuffle (tree exp, rtx target, machine_mode mode, int ignore)
 /* Worker reduction address expander.  */
 
 static rtx
-nvptx_expand_worker_addr (tree exp, rtx target,
+nvptx_expand_shared_addr (tree exp, rtx target,
 			  machine_mode ARG_UNUSED (mode), int ignore)
 {
   if (ignore)
@@ -5161,7 +5173,7 @@ nvptx_expand_builtin (tree exp, rtx target, rtx ARG_UNUSED (subtarget),
       return nvptx_expand_shuffle (exp, target, mode, ignore);
 
     case NVPTX_BUILTIN_WORKER_ADDR:
-      return nvptx_expand_worker_addr (exp, target, mode, ignore);
+      return nvptx_expand_shared_addr (exp, target, mode, ignore);
 
     case NVPTX_BUILTIN_CMP_SWAP:
     case NVPTX_BUILTIN_CMP_SWAPLL:
@@ -5330,7 +5342,7 @@ nvptx_goacc_fork_join (gcall *call, const int dims[],
    data at that location.  */
 
 static tree
-nvptx_get_worker_red_addr (tree type, tree offset)
+nvptx_get_shared_red_addr (tree type, tree offset)
 {
   machine_mode mode = TYPE_MODE (type);
   tree fndecl = nvptx_builtin_decl (NVPTX_BUILTIN_WORKER_ADDR, true);
@@ -5672,7 +5684,7 @@ nvptx_goacc_reduction_setup (gcall *call)
     {
       /* Store incoming value to worker reduction buffer.  */
       tree offset = gimple_call_arg (call, 5);
-      tree call = nvptx_get_worker_red_addr (TREE_TYPE (var), offset);
+      tree call = nvptx_get_shared_red_addr (TREE_TYPE (var), offset);
       tree ptr = make_ssa_name (TREE_TYPE (call));
 
       gimplify_assign (ptr, call, &seq);
@@ -5814,7 +5826,7 @@ nvptx_goacc_reduction_fini (gcall *call)
 	{
 	  /* Get reduction buffer address.  */
 	  tree offset = gimple_call_arg (call, 5);
-	  tree call = nvptx_get_worker_red_addr (TREE_TYPE (var), offset);
+	  tree call = nvptx_get_shared_red_addr (TREE_TYPE (var), offset);
 	  tree ptr = make_ssa_name (TREE_TYPE (call));
 
 	  gimplify_assign (ptr, call, &seq);
@@ -5858,7 +5870,7 @@ nvptx_goacc_reduction_teardown (gcall *call)
     {
       /* Read the worker reduction buffer.  */
       tree offset = gimple_call_arg (call, 5);
-      tree call = nvptx_get_worker_red_addr(TREE_TYPE (var), offset);
+      tree call = nvptx_get_shared_red_addr(TREE_TYPE (var), offset);
       tree ptr = make_ssa_name (TREE_TYPE (call));
 
       gimplify_assign (ptr, call, &seq);

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [nvptx, committed] Use MAX, MIN, ROUND_UP macros
  2018-12-14 19:58       ` Tom de Vries
                           ` (10 preceding siblings ...)
  2018-12-19 10:37         ` [nvptx, committed] Make nvptx state propagation function names more generic Tom de Vries
@ 2018-12-19 10:38         ` Tom de Vries
  2018-12-19 17:19         ` [nvptx, committed] Add PTX_CTA_SIZE Tom de Vries
                           ` (9 subsequent siblings)
  21 siblings, 0 replies; 32+ messages in thread
From: Tom de Vries @ 2018-12-19 10:38 UTC (permalink / raw)
  To: Schwinge, Thomas; +Cc: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 162 bytes --]

[ was: Re: [nvptx] vector length patch series ]

On 14-12-18 20:58, Tom de Vries wrote:
> 0014-nvptx-Use-MAX-MIN-ROUND_UP-macros.patch

Committed.

Thanks,
- Tom

[-- Attachment #2: 0006-nvptx-Use-MAX-MIN-ROUND_UP-macros.patch --]
[-- Type: text/x-patch, Size: 3732 bytes --]

[nvptx] Use MAX, MIN, ROUND_UP macros

Use MAX, MIN, and ROUND_UP macros to simplify code.

Build and reg-tested on x86_64 with nvptx accelerator.

2018-12-17  Tom de Vries  <tdevries@suse.de>

	* config/nvptx/nvptx.c (nvptx_gen_shared_bcast, shared_prop_gen)
	(nvptx_goacc_expand_accel_var): Use MAX and ROUND_UP.
	(nvptx_assemble_value, nvptx_output_skip): Use MIN.
	(nvptx_shared_propagate, nvptx_single, nvptx_expand_shared_addr): Use
	MAX.

---
 gcc/config/nvptx/nvptx.c | 28 ++++++++++------------------
 1 file changed, 10 insertions(+), 18 deletions(-)

diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index 163f2268e5f..2a2d638e6d7 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -1810,9 +1810,8 @@ nvptx_gen_shared_bcast (rtx reg, propagate_mask pm, unsigned rep,
 	  {
 	    unsigned align = GET_MODE_ALIGNMENT (mode) / BITS_PER_UNIT;
 
-	    if (align > oacc_bcast_align)
-	      oacc_bcast_align = align;
-	    data->offset = (data->offset + align - 1) & ~(align - 1);
+	    oacc_bcast_align = MAX (oacc_bcast_align, align);
+	    data->offset = ROUND_UP (data->offset, align);
 	    addr = data->base;
 	    gcc_assert (data->base != NULL);
 	    if (data->offset)
@@ -1934,8 +1933,7 @@ nvptx_assemble_value (unsigned HOST_WIDE_INT val, unsigned size)
     {
       val >>= part * BITS_PER_UNIT;
       part = init_frag.size - init_frag.offset;
-      if (part > size)
-	part = size;
+      part = MIN (part, size);
 
       unsigned HOST_WIDE_INT partial
 	= val << (init_frag.offset * BITS_PER_UNIT);
@@ -1998,8 +1996,7 @@ nvptx_output_skip (FILE *, unsigned HOST_WIDE_INT size)
   if (init_frag.offset)
     {
       unsigned part = init_frag.size - init_frag.offset;
-      if (part > size)
-	part = (unsigned) size;
+      part = MIN (part, (unsigned)size);
       size -= part;
       nvptx_assemble_value (0, part);
     }
@@ -3927,9 +3924,8 @@ shared_prop_gen (rtx reg, propagate_mask pm, unsigned rep, void *data_,
       /* Starting a loop, initialize pointer.    */
       unsigned align = GET_MODE_ALIGNMENT (GET_MODE (reg)) / BITS_PER_UNIT;
 
-      if (align > oacc_bcast_align)
-	oacc_bcast_align = align;
-      data->offset = (data->offset + align - 1) & ~(align - 1);
+      oacc_bcast_align = MAX (oacc_bcast_align, align);
+      data->offset = ROUND_UP (data->offset, align);
 
       data->ptr = gen_reg_rtx (Pmode);
 
@@ -3970,8 +3966,7 @@ nvptx_shared_propagate (bool pre_p, bool is_call, basic_block block,
       rtx init = gen_rtx_SET (data.base, oacc_bcast_sym);
       emit_insn_after (init, insn);
 
-      if (oacc_bcast_size < data.offset)
-	oacc_bcast_size = data.offset;
+      oacc_bcast_size = MAX (oacc_bcast_size, data.offset);
     }
   return empty;
 }
@@ -4346,8 +4341,7 @@ nvptx_single (unsigned mask, basic_block from, basic_block to)
 	  data.base = oacc_bcast_sym;
 	  data.ptr = 0;
 
-	  if (oacc_bcast_size < GET_MODE_SIZE (SImode))
-	    oacc_bcast_size = GET_MODE_SIZE (SImode);
+	  oacc_bcast_size = MAX (oacc_bcast_size, GET_MODE_SIZE (SImode));
 
 	  data.offset = 0;
 	  emit_insn_before (nvptx_gen_shared_bcast (pvar, PM_read, 0, &data,
@@ -5044,13 +5038,11 @@ nvptx_expand_shared_addr (tree exp, rtx target,
     return target;
 
   unsigned align = TREE_INT_CST_LOW (CALL_EXPR_ARG (exp, 2));
-  if (align > worker_red_align)
-    worker_red_align = align;
+  worker_red_align = MAX (worker_red_align, align);
 
   unsigned offset = TREE_INT_CST_LOW (CALL_EXPR_ARG (exp, 0));
   unsigned size = TREE_INT_CST_LOW (CALL_EXPR_ARG (exp, 1));
-  if (size + offset > worker_red_size)
-    worker_red_size = size + offset;
+  worker_red_size = MAX (worker_red_size, size + offset);
 
   rtx addr = worker_red_sym;
   if (offset)

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [nvptx, committed] Add PTX_CTA_SIZE
  2018-12-14 19:58       ` Tom de Vries
                           ` (11 preceding siblings ...)
  2018-12-19 10:38         ` [nvptx, committed] Use MAX, MIN, ROUND_UP macros Tom de Vries
@ 2018-12-19 17:19         ` Tom de Vries
  2018-12-22  2:19         ` [nvptx] vector length patch series Tom de Vries
                           ` (8 subsequent siblings)
  21 siblings, 0 replies; 32+ messages in thread
From: Tom de Vries @ 2018-12-19 17:19 UTC (permalink / raw)
  To: Schwinge, Thomas; +Cc: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 185 bytes --]

[ was: Re: [nvptx] vector length patch series ]
On 14-12-18 20:58, Tom de Vries wrote:
> 0005-nvptx-update-openacc-dim-macros.patch

Factored out this patch.

Committed.

Thanks,
- Tom

[-- Attachment #2: 0001-nvptx-Add-PTX_CTA_SIZE.patch --]
[-- Type: text/x-patch, Size: 764 bytes --]

[nvptx] Add PTX_CTA_SIZE

2018-12-17  Tom de Vries  <tdevries@suse.de>

	* config/nvptx/nvptx.c (PTX_CTA_SIZE): Define.

---
 gcc/config/nvptx/nvptx.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index 2a2d638e6d7..f4095ff5f55 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -86,6 +86,11 @@
 #define PTX_WORKER_LENGTH 32
 #define PTX_DEFAULT_RUNTIME_DIM 0 /* Defer to runtime.  */
 
+/* The PTX concept CTA (Concurrent Thread Array) maps on the CUDA concept thread
+   block, which has had a maximum number of threads of 1024 since CUDA version
+   2.x.  */
+#define PTX_CTA_SIZE 1024
+
 /* The various PTX memory areas an object might reside in.  */
 enum nvptx_data_area
 {

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [nvptx] vector length patch series
  2018-12-14 19:58       ` Tom de Vries
                           ` (12 preceding siblings ...)
  2018-12-19 17:19         ` [nvptx, committed] Add PTX_CTA_SIZE Tom de Vries
@ 2018-12-22  2:19         ` Tom de Vries
  2019-01-05 11:19           ` Tom de Vries
  2019-01-03 16:08         ` Tom de Vries
                           ` (7 subsequent siblings)
  21 siblings, 1 reply; 32+ messages in thread
From: Tom de Vries @ 2018-12-22  2:19 UTC (permalink / raw)
  To: Schwinge, Thomas; +Cc: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 1103 bytes --]

On 14-12-18 20:58, Tom de Vries wrote:
> 0003-openacc-Add-target-hook-TARGET_GOACC_ADJUST_PARALLEL.patch

> 0017-nvptx-Enable-large-vectors.patch

1.

If I void nvptx_adjust_parallelism like this:
...
static unsigned
nvptx_adjust_parallelism (unsigned inner_mask, unsigned outer_mask)
{
  return default_goacc_adjust_parallelism (inner_mask, outer_mask);
}
...
I don't run into any failing tests. From what I can tell, the only
test-case that the proposed implementation of the hook has an effect on,
is the worker vector loop in vred2d-128.c, but that one is passing.

Can you confirm that this hook is in fact needed? Does this test fail on
a specific card? Or is there another test-case that exercises this?

2.

If you have a test-case where this is indeed failing without the
proposed hook implementation, then please try to remove the hardcoding
of vector_length > 32 from the test-source and instead set it using
-fopenacc-dim. AFAIU, the proposed hook does not handle that case, so
you should be able to make it fail.
If so, can you test whether attached implementation fixes it?

Thanks,
- Tom

[-- Attachment #2: 0003-nvptx-Add-nvptx_adjust_parallelism.patch --]
[-- Type: text/x-patch, Size: 3589 bytes --]

[nvptx] Add nvptx_adjust_parallelism

2018-12-17  Tom de Vries  <tdevries@suse.de>

	* config/nvptx/nvptx.c (nvptx_adjust_parallelism): New function.
	(TARGET_GOACC_ADJUST_PARALLELISM): Define.

---
 gcc/config/nvptx/nvptx.c | 55 ++++++++++++++++++++++++++++++++++++++++++++++++
 gcc/omp-offload.c        |  7 ++++++
 gcc/omp-offload.h        |  1 +
 3 files changed, 63 insertions(+)

diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index f4095ff5f55..90bbc5b251e 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -5314,6 +5314,58 @@ nvptx_dim_limit (int axis)
   return 0;
 }
 
+/* This is a copy of oacc_validate_dims from omp-offload.c that does not update
+   the function attributes.  */
+
+static void
+oacc_validate_dims_no_update (tree fn, tree attrs, int *dims, int level,
+			      unsigned used)
+{
+  tree purpose[GOMP_DIM_MAX];
+  unsigned ix;
+  tree pos = TREE_VALUE (attrs);
+
+  gcc_assert (pos);
+
+  for (ix = 0; ix != GOMP_DIM_MAX; ix++)
+    {
+      purpose[ix] = TREE_PURPOSE (pos);
+      tree val = TREE_VALUE (pos);
+      dims[ix] = val ? TREE_INT_CST_LOW (val) : -1;
+      pos = TREE_CHAIN (pos);
+    }
+
+  targetm.goacc.validate_dims (fn, dims, level);
+
+  for (ix = 0; ix != GOMP_DIM_MAX; ix++)
+    if (dims[ix] < 0)
+      dims[ix] = (used & GOMP_DIM_MASK (ix)
+		  ? oacc_get_default_dim (ix) : oacc_get_min_dim (ix));
+}
+
+/* Adjust the parallelism available to a loop given vector_length
+   associated with the offloaded function.  */
+
+static unsigned
+nvptx_adjust_parallelism (unsigned inner_mask, unsigned outer_mask)
+{
+  bool wv = ((inner_mask & GOMP_DIM_MASK (GOMP_DIM_WORKER))
+	     && (inner_mask & GOMP_DIM_MASK (GOMP_DIM_VECTOR)));
+  if (!wv)
+    return default_goacc_adjust_parallelism (inner_mask, outer_mask);
+
+  int dims[GOMP_DIM_MAX];
+  tree attrs = oacc_get_fn_attrib (current_function_decl);
+  int fn_level = oacc_fn_attrib_level (attrs);
+  oacc_validate_dims_no_update (current_function_decl, attrs, dims, fn_level,
+				inner_mask);
+
+  if (dims[GOMP_DIM_VECTOR] > PTX_WARP_SIZE)
+    inner_mask &= ~GOMP_DIM_MASK (GOMP_DIM_WORKER);
+
+  return default_goacc_adjust_parallelism (inner_mask, outer_mask);
+}
+
 /* Determine whether fork & joins are needed.  */
 
 static bool
@@ -6109,6 +6161,9 @@ nvptx_set_current_function (tree fndecl)
 #undef TARGET_GOACC_DIM_LIMIT
 #define TARGET_GOACC_DIM_LIMIT nvptx_dim_limit
 
+#undef TARGET_GOACC_ADJUST_PARALLELISM
+#define TARGET_GOACC_ADJUST_PARALLELISM nvptx_adjust_parallelism
+
 #undef TARGET_GOACC_FORK_JOIN
 #define TARGET_GOACC_FORK_JOIN nvptx_goacc_fork_join
 
diff --git a/gcc/omp-offload.c b/gcc/omp-offload.c
index 3338e0633a1..80ecda82d24 100644
--- a/gcc/omp-offload.c
+++ b/gcc/omp-offload.c
@@ -580,6 +580,13 @@ oacc_get_default_dim (int dim)
   return oacc_default_dims[dim];
 }
 
+int
+oacc_get_min_dim (int dim)
+{
+  gcc_assert (0 <= dim && dim < GOMP_DIM_MAX);
+  return oacc_min_dims[dim];
+}
+
 /* Parse the default dimension parameter.  This is a set of
    :-separated optional compute dimensions.  Each specified dimension
    is a positive integer.  When device type support is added, it is
diff --git a/gcc/omp-offload.h b/gcc/omp-offload.h
index 176c4da7e88..08e994abdb9 100644
--- a/gcc/omp-offload.h
+++ b/gcc/omp-offload.h
@@ -23,6 +23,7 @@ along with GCC; see the file COPYING3.  If not see
 #define GCC_OMP_DEVICE_H
 
 extern int oacc_get_default_dim (int dim);
+extern int oacc_get_min_dim (int dim);
 extern int oacc_fn_attrib_level (tree attr);
 
 extern GTY(()) vec<tree, va_gc> *offload_funcs;

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [nvptx] vector length patch series -- openacc parts
  2018-12-17 23:52         ` [nvptx] vector length patch series -- openacc parts Tom de Vries
@ 2019-01-03  9:13           ` Tom de Vries
  0 siblings, 0 replies; 32+ messages in thread
From: Tom de Vries @ 2019-01-03  9:13 UTC (permalink / raw)
  To: Schwinge, Thomas; +Cc: gcc-patches

On 18-12-18 00:52, Tom de Vries wrote:
> On 14-12-18 20:58, Tom de Vries wrote:
> 
>> 0003-openacc-Add-target-hook-TARGET_GOACC_ADJUST_PARALLEL.patch
> 

Hi Thomas,

ping. OK for trunk?

Thanks,
- Tom

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [nvptx] vector length patch series
  2018-12-14 19:58       ` Tom de Vries
                           ` (13 preceding siblings ...)
  2018-12-22  2:19         ` [nvptx] vector length patch series Tom de Vries
@ 2019-01-03 16:08         ` Tom de Vries
  2019-01-03 16:20         ` Tom de Vries
                           ` (6 subsequent siblings)
  21 siblings, 0 replies; 32+ messages in thread
From: Tom de Vries @ 2019-01-03 16:08 UTC (permalink / raw)
  To: Schwinge, Thomas; +Cc: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 5224 bytes --]

On 14-12-18 20:58, Tom de Vries wrote:
> 0007-nvptx-consolidate-offloaded-function-attributes-into.patch


> diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
> index a3169febbb4..dcfa57d10ca 100644
> --- a/gcc/config/nvptx/nvptx.c
> +++ b/gcc/config/nvptx/nvptx.c
> @@ -2872,6 +2872,17 @@ nvptx_reorg_uniform_simt ()
>      }
>  }
>  
> +/* Offloading function attributes.  */
> +
> +struct offload_attrs
> +{
> +  unsigned mask;
> +  int num_gangs;
> +  int num_workers;
> +  int vector_length;
> +  int max_workers;
> +};
> +

I like the idea of factoring out the extraction of information from the
function attributes.  But max_workers is something derived from that
information, so it doesn't seem proper to add it here.

>  /* Loop structure of the function.  The entire function is described as
>     a NULL loop.  */
>  
> @@ -4569,6 +4580,56 @@ nvptx_neuter_pars (parallel *par, unsigned modes, unsigned outer)
>      nvptx_neuter_pars (par->next, modes, outer);
>  }
>  
> +static void
> +populate_offload_attrs (offload_attrs *oa)
> +{
> +  tree attr = oacc_get_fn_attrib (current_function_decl);
> +  tree dims = TREE_VALUE (attr);
> +  unsigned ix;
> +
> +  oa->mask = 0;
> +
> +  for (ix = 0; ix != GOMP_DIM_MAX; ix++, dims = TREE_CHAIN (dims))
> +    {
> +      tree t = TREE_VALUE (dims);
> +      int size = (t == NULL_TREE) ? 0 : TREE_INT_CST_LOW (t);

This is very strange.  Why do we represent 'TREE_VALUE (dims) ==
NULL_TREE' with '0' (which means determined at runtime)? In
oacc_validate_dims we use -1 for this, which means 'not set'.

> +      tree allowed = TREE_PURPOSE (dims);
> +
> +      if (size != 1 && !(allowed && integer_zerop (allowed)))
> +       oa->mask |= GOMP_DIM_MASK (ix);
> +
> +      switch (ix)
> +       {
> +       case GOMP_DIM_GANG:
> +         oa->num_gangs = size;
> +         break;
> +
> +       case GOMP_DIM_WORKER:
> +         oa->num_workers = size;
> +         break;
> +
> +       case GOMP_DIM_VECTOR:
> +         oa->vector_length = size;
> +         break;
> +       }
> +    }
> +

> +  if (oa->vector_length == 0)
> +    {
> +      /* FIXME: Need a more graceful way to handle large vector
> +        lengths in OpenACC routines.  */
> +      if (!lookup_attribute ("omp target entrypoint",
> +                            DECL_ATTRIBUTES (current_function_decl)))
> +       oa->vector_length = PTX_WARP_SIZE;
> +      else
> +       oa->vector_length = PTX_VECTOR_LENGTH;
> +    }

The case that 'oa->vector_length == 0' is triggered by calling
populate_offload_attrs from nvptx_adjust_parallelism, which is called
before oacc_validate_dims has updated the function attributes.

This kludge is trying to resolve a circular dependency: after calling
oacc_validate_dims and updating the function attributes, we know the
chosen vector length, which is necessary for nvptx_adjust_parallelism,
which influences the used_mask given as parameter to ... oacc_validate_dims.

The way the kludge tries to cut this circular dependency is by
replicating setting of default dimensions (designed to be done in
nvptx_goacc_validate_dims) here in populate_offload_attrs.

In the counter-proposed nvptx_adjust_parallelism (
https://gcc.gnu.org/ml/gcc-patches/2018-12/msg01619.html ), we cut the
circular dependency in a different way: by reusing
nvptx_goacc_validate_dims in nvptx_adjust_parallelism.

So, we can assume oa->vector_length > 0 here, and declare this dead code.

> +  if (oa->num_workers == 0)
> +    oa->max_workers = PTX_CTA_SIZE / oa->vector_length;
> +  else
> +    oa->max_workers = oa->num_workers;
> +}
> +

I moved this bit to the patch introducing nvptx_mach_max_workers.

>  #if WORKAROUND_PTXJIT_BUG_2
>  /* Variant of pc_set that only requires JUMP_P (INSN) if STRICT.  This variant
>     is needed in the nvptx target because the branches generated for
> @@ -4750,27 +4811,19 @@ nvptx_reorg (void)
>      {
>        /* If we determined this mask before RTL expansion, we could
>          elide emission of some levels of forks and joins.  */
> -      unsigned mask = 0;
> -      tree dims = TREE_VALUE (attr);
> -      unsigned ix;
> +      offload_attrs oa;
>  
> -      for (ix = 0; ix != GOMP_DIM_MAX; ix++, dims = TREE_CHAIN (dims))
> -       {
> -         int size = TREE_INT_CST_LOW (TREE_VALUE (dims));
> -         tree allowed = TREE_PURPOSE (dims);
> +      populate_offload_attrs (&oa);
>  
> -         if (size != 1 && !(allowed && integer_zerop (allowed)))
> -           mask |= GOMP_DIM_MASK (ix);
> -       }
>        /* If there is worker neutering, there must be vector
>          neutering.  Otherwise the hardware will fail.  */
> -      gcc_assert (!(mask & GOMP_DIM_MASK (GOMP_DIM_WORKER))
> -                 || (mask & GOMP_DIM_MASK (GOMP_DIM_VECTOR)));
> +      gcc_assert (!(oa.mask & GOMP_DIM_MASK (GOMP_DIM_WORKER))
> +                 || (oa.mask & GOMP_DIM_MASK (GOMP_DIM_VECTOR)));
>  
>        /* Discover & process partitioned regions.  */
>        parallel *pars = nvptx_discover_pars (&bb_insn_map);
>        nvptx_process_pars (pars);
> -      nvptx_neuter_pars (pars, mask, 0);
> +      nvptx_neuter_pars (pars, oa.mask, 0);
>        delete pars;
>      }
>  

Committed as attached.

Thanks,
- Tom


[-- Attachment #2: 0003-nvptx-Factor-out-populate_offload_attrs.patch --]
[-- Type: text/x-patch, Size: 3230 bytes --]

[nvptx] Factor out populate_offload_attrs

Factor out populate_offload_attrs from nvptx_reorg.

2018-12-17  Tom de Vries  <tdevries@suse.de>

	* config/nvptx/nvptx.c (struct offload_attrs): New.
	(populate_offload_attrs): New function.  Factor mask extraction out of
	nvptx_reorg.  Add extraction of dimensions.
	(nvptx_reorg): Use populate_offload_attrs.

---
 gcc/config/nvptx/nvptx.c | 63 ++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 50 insertions(+), 13 deletions(-)

diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index 8cb58341c23..f527429ce2d 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -2873,6 +2873,16 @@ nvptx_reorg_uniform_simt ()
     }
 }
 
+/* Offloading function attributes.  */
+
+struct offload_attrs
+{
+  unsigned mask;
+  int num_gangs;
+  int num_workers;
+  int vector_length;
+};
+
 /* Loop structure of the function.  The entire function is described as
    a NULL loop.  */
 
@@ -4576,6 +4586,41 @@ nvptx_neuter_pars (parallel *par, unsigned modes, unsigned outer)
     nvptx_neuter_pars (par->next, modes, outer);
 }
 
+static void
+populate_offload_attrs (offload_attrs *oa)
+{
+  tree attr = oacc_get_fn_attrib (current_function_decl);
+  tree dims = TREE_VALUE (attr);
+  unsigned ix;
+
+  oa->mask = 0;
+
+  for (ix = 0; ix != GOMP_DIM_MAX; ix++, dims = TREE_CHAIN (dims))
+    {
+      tree t = TREE_VALUE (dims);
+      int size = (t == NULL_TREE) ? -1 : TREE_INT_CST_LOW (t);
+      tree allowed = TREE_PURPOSE (dims);
+
+      if (size != 1 && !(allowed && integer_zerop (allowed)))
+	oa->mask |= GOMP_DIM_MASK (ix);
+
+      switch (ix)
+	{
+	case GOMP_DIM_GANG:
+	  oa->num_gangs = size;
+	  break;
+
+	case GOMP_DIM_WORKER:
+	  oa->num_workers = size;
+	  break;
+
+	case GOMP_DIM_VECTOR:
+	  oa->vector_length = size;
+	  break;
+	}
+    }
+}
+
 #if WORKAROUND_PTXJIT_BUG_2
 /* Variant of pc_set that only requires JUMP_P (INSN) if STRICT.  This variant
    is needed in the nvptx target because the branches generated for
@@ -4757,27 +4802,19 @@ nvptx_reorg (void)
     {
       /* If we determined this mask before RTL expansion, we could
 	 elide emission of some levels of forks and joins.  */
-      unsigned mask = 0;
-      tree dims = TREE_VALUE (attr);
-      unsigned ix;
+      offload_attrs oa;
 
-      for (ix = 0; ix != GOMP_DIM_MAX; ix++, dims = TREE_CHAIN (dims))
-	{
-	  int size = TREE_INT_CST_LOW (TREE_VALUE (dims));
-	  tree allowed = TREE_PURPOSE (dims);
+      populate_offload_attrs (&oa);
 
-	  if (size != 1 && !(allowed && integer_zerop (allowed)))
-	    mask |= GOMP_DIM_MASK (ix);
-	}
       /* If there is worker neutering, there must be vector
 	 neutering.  Otherwise the hardware will fail.  */
-      gcc_assert (!(mask & GOMP_DIM_MASK (GOMP_DIM_WORKER))
-		  || (mask & GOMP_DIM_MASK (GOMP_DIM_VECTOR)));
+      gcc_assert (!(oa.mask & GOMP_DIM_MASK (GOMP_DIM_WORKER))
+		  || (oa.mask & GOMP_DIM_MASK (GOMP_DIM_VECTOR)));
 
       /* Discover & process partitioned regions.  */
       parallel *pars = nvptx_discover_pars (&bb_insn_map);
       nvptx_process_pars (pars);
-      nvptx_neuter_pars (pars, mask, 0);
+      nvptx_neuter_pars (pars, oa.mask, 0);
       delete pars;
     }
 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [nvptx] vector length patch series
  2018-12-14 19:58       ` Tom de Vries
                           ` (14 preceding siblings ...)
  2019-01-03 16:08         ` Tom de Vries
@ 2019-01-03 16:20         ` Tom de Vries
  2019-01-03 16:29         ` Tom de Vries
                           ` (5 subsequent siblings)
  21 siblings, 0 replies; 32+ messages in thread
From: Tom de Vries @ 2019-01-03 16:20 UTC (permalink / raw)
  To: Schwinge, Thomas; +Cc: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 2319 bytes --]

On 14-12-18 20:58, Tom de Vries wrote:
> 0012-nvptx-Add-axis_dim.patch

> diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
> index 74a0d4b04d9..02ecf12bd84 100644
> --- a/gcc/config/nvptx/nvptx.c
> +++ b/gcc/config/nvptx/nvptx.c
> @@ -2885,6 +2885,23 @@ struct offload_attrs
>    int max_workers;
>  };
>  
> +/* Define entries for cfun->machine->axis_dim.  */
> +
> +#define MACH_VECTOR_LENGTH 0
> +#define MACH_MAX_WORKERS 1
> +
> +static int ATTRIBUTE_UNUSED
> +nvptx_mach_max_workers ()
> +{
> +  return cfun->machine->axis_dim[MACH_MAX_WORKERS];
> +}
> +
> +static int ATTRIBUTE_UNUSED
> +nvptx_mach_vector_length ()
> +{
> +  return cfun->machine->axis_dim[MACH_VECTOR_LENGTH];
> +}
> +
>  /* Loop structure of the function.  The entire function is described as
>     a NULL loop.  */
>  
> @@ -4832,6 +4849,9 @@ nvptx_reorg (void)
>  
>        populate_offload_attrs (&oa);
>  
> +      cfun->machine->axis_dim[MACH_VECTOR_LENGTH] = oa.vector_length;
> +      cfun->machine->axis_dim[MACH_MAX_WORKERS] = oa.max_workers;
> +

This initialization here is done during pass_machine_reorg , but the
data is needed earlier, making it necessary there to call
populate_offload_attrs again, instead of using
nvptx_mach_vector_length/nvptx_mach_max_workers.

I've made the initialization lazy, which fixes that problem.

>        /* If there is worker neutering, there must be vector
>          neutering.  Otherwise the hardware will fail.  */
>        gcc_assert (!(oa.mask & GOMP_DIM_MASK (GOMP_DIM_WORKER))
> diff --git a/gcc/config/nvptx/nvptx.h b/gcc/config/nvptx/nvptx.h
> index a2fe8b68b22..4059691a609 100644
> --- a/gcc/config/nvptx/nvptx.h
> +++ b/gcc/config/nvptx/nvptx.h
> @@ -218,6 +218,8 @@ struct GTY(()) machine_function
>    int return_mode; /* Return mode of current fn.
>                       (machine_mode not defined yet.) */
>    rtx axis_predicate[2]; /* Neutering predicates.  */
> +  int axis_dim[2]; /* Maximum number of threads on each axis, dim[0] is
> +                     vector_length, dim[1] is num_workers.  */
>    rtx unisimt_master; /* 'Master lane index' for -muniform-simt.  */
>    rtx unisimt_predicate; /* Predicate for -muniform-simt.  */
>    rtx unisimt_location; /* Mask location for -muniform-simt.  */
> -- 
> 2.17.2

Committed as attached.

Thanks,
- Tom

[-- Attachment #2: 0004-nvptx-Add-nvptx_mach_vector_length-nvptx_mach_max_workers.patch --]
[-- Type: text/x-patch, Size: 2621 bytes --]

[nvptx] Add nvptx_mach_vector_length, nvptx_mach_max_workers

The vector length and maximum number of workers are known compile-time.  Make
these easily available during code generation via new functions.

2019-01-03  Tom de Vries  <tdevries@suse.de>

	* config/nvptx/nvptx.c (MACH_VECTOR_LENGTH, MACH_MAX_WORKERS): Define.
	(init_axis_dim, nvptx_mach_max_workers, nvptx_mach_vector_length): New
	function.
	* config/nvptx/nvptx.h (struct machine_function): Add axis_dims.

---
 gcc/config/nvptx/nvptx.c | 41 +++++++++++++++++++++++++++++++++++++++++
 gcc/config/nvptx/nvptx.h |  3 +++
 2 files changed, 44 insertions(+)

diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index f527429ce2d..52cbac957ce 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -2883,6 +2883,47 @@ struct offload_attrs
   int vector_length;
 };
 
+/* Define entries for cfun->machine->axis_dim.  */
+
+#define MACH_VECTOR_LENGTH 0
+#define MACH_MAX_WORKERS 1
+
+static void populate_offload_attrs (offload_attrs *oa);
+
+static void
+init_axis_dim (void)
+{
+  offload_attrs oa;
+  int max_workers;
+
+  populate_offload_attrs (&oa);
+
+  if (oa.num_workers == 0)
+    max_workers = PTX_CTA_SIZE / oa.vector_length;
+  else
+    max_workers = oa.num_workers;
+
+  cfun->machine->axis_dim[MACH_VECTOR_LENGTH] = oa.vector_length;
+  cfun->machine->axis_dim[MACH_MAX_WORKERS] = max_workers;
+  cfun->machine->axis_dim_init_p = true;
+}
+
+static int ATTRIBUTE_UNUSED
+nvptx_mach_max_workers ()
+{
+  if (!cfun->machine->axis_dim_init_p)
+    init_axis_dim ();
+  return cfun->machine->axis_dim[MACH_MAX_WORKERS];
+}
+
+static int ATTRIBUTE_UNUSED
+nvptx_mach_vector_length ()
+{
+  if (!cfun->machine->axis_dim_init_p)
+    init_axis_dim ();
+  return cfun->machine->axis_dim[MACH_VECTOR_LENGTH];
+}
+
 /* Loop structure of the function.  The entire function is described as
    a NULL loop.  */
 
diff --git a/gcc/config/nvptx/nvptx.h b/gcc/config/nvptx/nvptx.h
index a2fe8b68b22..cb4404504c5 100644
--- a/gcc/config/nvptx/nvptx.h
+++ b/gcc/config/nvptx/nvptx.h
@@ -218,6 +218,9 @@ struct GTY(()) machine_function
   int return_mode; /* Return mode of current fn.
 		      (machine_mode not defined yet.) */
   rtx axis_predicate[2]; /* Neutering predicates.  */
+  int axis_dim[2]; /* Maximum number of threads on each axis, dim[0] is
+		      vector_length, dim[1] is num_workers.  */
+  bool axis_dim_init_p;
   rtx unisimt_master; /* 'Master lane index' for -muniform-simt.  */
   rtx unisimt_predicate; /* Predicate for -muniform-simt.  */
   rtx unisimt_location; /* Mask location for -muniform-simt.  */

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [nvptx] vector length patch series
  2018-12-14 19:58       ` Tom de Vries
                           ` (15 preceding siblings ...)
  2019-01-03 16:20         ` Tom de Vries
@ 2019-01-03 16:29         ` Tom de Vries
  2019-01-07  8:52           ` [nvptx, committed] Fix libgomp.oacc-c-c++-common/vector-length-128-3.c Tom de Vries
  2019-01-07  8:59         ` [nvptx, committed] Add support for a per-worker broadcast buffer and barrier Tom de Vries
                           ` (4 subsequent siblings)
  21 siblings, 1 reply; 32+ messages in thread
From: Tom de Vries @ 2019-01-03 16:29 UTC (permalink / raw)
  To: Schwinge, Thomas; +Cc: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 889 bytes --]

On 14-12-18 20:58, Tom de Vries wrote:
> 0016-nvptx-Add-vector_length-128-testcases.patch


>         * testsuite/libgomp.oacc-c-c++-common/vred2d-128.c: New test.                                             
>         * testsuite/libgomp.oacc-fortran/gemm.f90: New test.                                                      
>         * testsuite/libgomp.oacc-c-c++-common/vector-length-128-10.c: New test.                                   

These test-cases fail at the point that they're introduced in the patch
series, which is not a good idea.

>         * testsuite/libgomp.oacc-c-c++-common/vector-length-128-1.c: New test.                                    
>         * testsuite/libgomp.oacc-c-c++-common/vector-length-128-3.c: New test.                                    

These test-cases could have been committed as obvious.

Committed last two as attached.

Thanks,
- Tom

[-- Attachment #2: 0005-nvptx-Add-vector_length-128-testcases.patch --]
[-- Type: text/x-patch, Size: 3203 bytes --]

[nvptx] Add vector_length 128 testcases

Add a couple of test-cases using vector length 128, while checking that we
override to vector length 32.

2018-12-17  Tom de Vries  <tdevries@suse.de>

	* testsuite/libgomp.oacc-c-c++-common/vector-length-128-1.c: New test.
	* testsuite/libgomp.oacc-c-c++-common/vector-length-128-3.c: New test.

---
 .../vector-length-128-1.c                          | 39 ++++++++++++++++++++
 .../vector-length-128-3.c                          | 42 ++++++++++++++++++++++
 2 files changed, 81 insertions(+)

diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-1.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-1.c
new file mode 100644
index 00000000000..fab5b0d25d1
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-1.c
@@ -0,0 +1,39 @@
+/* { dg-do run { target openacc_nvidia_accel_selected } } */
+/* { dg-additional-options "-foffload=-fdump-tree-oaccdevlow" } */
+/* { dg-set-target-env-var "GOMP_DEBUG" "1" } */
+
+#include <stdlib.h>
+
+#define N 1024
+
+unsigned int a[N];
+unsigned int b[N];
+unsigned int c[N];
+unsigned int n = N;
+
+int
+main (void)
+{
+  for (unsigned int i = 0; i < n; ++i)
+    {
+      a[i] = i % 3;
+      b[i] = i % 5;
+    }
+
+#pragma acc parallel vector_length (128) copyin (a,b) copyout (c)
+  {
+#pragma acc loop vector
+    for (unsigned int i = 0; i < n; i++)
+      c[i] = a[i] + b[i];
+  }
+
+  for (unsigned int i = 0; i < n; ++i)
+    if (c[i] != (i % 3) + (i % 5))
+      abort ();
+
+  return 0;
+}
+/* { dg-prune-output "using vector_length \\(32\\), ignoring 128" } */
+
+/* { dg-final { scan-offload-tree-dump "__attribute__\\(\\(oacc function \\(1, 1, 32\\)" "oaccdevlow" } } */
+/* { dg-output "nvptx_exec: kernel main\\\$_omp_fn\\\$0: launch gangs=1, workers=1, vectors=32" } */
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-3.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-3.c
new file mode 100644
index 00000000000..c403e74658b
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-3.c
@@ -0,0 +1,42 @@
+/* { dg-do run { target openacc_nvidia_accel_selected } } */
+/* { dg-additional-options "-foffload=-fdump-tree-oaccdevlow" } */
+/* We default to warp size 32 for the vector length, so the GOMP_OPENACC_DIM has
+   no effect.  */
+/* { dg-set-target-env-var "GOMP_OPENACC_DIM" "-:-:128" } */
+/* { dg-set-target-env-var "GOMP_DEBUG" "1" } */
+
+
+#include <stdlib.h>
+
+#define N 1024
+
+unsigned int a[N];
+unsigned int b[N];
+unsigned int c[N];
+unsigned int n = N;
+
+int
+main (void)
+{
+  for (unsigned int i = 0; i < n; ++i)
+    {
+      a[i] = i % 3;
+      b[i] = i % 5;
+    }
+
+#pragma acc parallel copyin (a,b) copyout (c)
+  {
+#pragma acc loop vector
+    for (unsigned int i = 0; i < n; i++)
+      c[i] = a[i] + b[i];
+  }
+
+  for (unsigned int i = 0; i < n; ++i)
+    if (c[i] != (i % 3) + (i % 5))
+      abort ();
+
+  return 0;
+}
+
+/* { dg-final { scan-offload-tree-dump "__attribute__\\(\\(oacc function \\(1, 1, 32\\)" "oaccdevlow" } } */
+/* { dg-output "nvptx_exec: kernel main\\\$_omp_fn\\\$0: launch gangs=1, workers=1, vectors=32" } */

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [nvptx] vector length patch series
  2018-12-22  2:19         ` [nvptx] vector length patch series Tom de Vries
@ 2019-01-05 11:19           ` Tom de Vries
  0 siblings, 0 replies; 32+ messages in thread
From: Tom de Vries @ 2019-01-05 11:19 UTC (permalink / raw)
  To: Schwinge, Thomas; +Cc: gcc-patches

On 22-12-18 03:13, Tom de Vries wrote:
> If you have a test-case where this is indeed failing without the
> proposed hook implementation, then please try to remove the hardcoding
> of vector_length > 32 from the test-source and instead set it using
> -fopenacc-dim. AFAIU, the proposed hook does not handle that case, so
> you should be able to make it fail.

Filed as PR88706 - "[og8, nvptx, openacc] Inconsistencies when vector
length set using vector_length clause or fopenacc-dim" (
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88706 ).

Thanks,
- Tom

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [nvptx, committed] Fix libgomp.oacc-c-c++-common/vector-length-128-3.c
  2019-01-03 16:29         ` Tom de Vries
@ 2019-01-07  8:52           ` Tom de Vries
  0 siblings, 0 replies; 32+ messages in thread
From: Tom de Vries @ 2019-01-07  8:52 UTC (permalink / raw)
  To: Schwinge, Thomas; +Cc: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 188 bytes --]

[was: Re: [nvptx] vector length patch series]

On 03-01-19 17:29, Tom de Vries wrote:
> +/* { dg-set-target-env-var "GOMP_OPENACC_DIM" "-:-:128" } */

Committed as obvious.

Thanks,
- Tom

[-- Attachment #2: 0002-nvptx-Fix-libgomp.oacc-c-c-common-vector-length-128-3.c.patch --]
[-- Type: text/x-patch, Size: 1162 bytes --]

[nvptx] Fix libgomp.oacc-c-c++-common/vector-length-128-3.c

The vector-length-128-3.c test-case uses GOMP_OPENACC_DIM=-:-:128, but '-' is
not yet supported on trunk.  Use GOMP_OPENACC_DIM=::128 instead.

2019-01-07  Tom de Vries  <tdevries@suse.de>

	* testsuite/libgomp.oacc-c-c++-common/vector-length-128-3.c: Fix
	GOMP_OPENACC_DIM argument.

---
 libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-3.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-3.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-3.c
index c403e74658b..59be37a7c27 100644
--- a/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-3.c
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-3.c
@@ -2,7 +2,7 @@
 /* { dg-additional-options "-foffload=-fdump-tree-oaccdevlow" } */
 /* We default to warp size 32 for the vector length, so the GOMP_OPENACC_DIM has
    no effect.  */
-/* { dg-set-target-env-var "GOMP_OPENACC_DIM" "-:-:128" } */
+/* { dg-set-target-env-var "GOMP_OPENACC_DIM" "::128" } */
 /* { dg-set-target-env-var "GOMP_DEBUG" "1" } */
 
 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [nvptx, committed] Add support for a per-worker broadcast buffer and barrier
  2018-12-14 19:58       ` Tom de Vries
                           ` (16 preceding siblings ...)
  2019-01-03 16:29         ` Tom de Vries
@ 2019-01-07  8:59         ` Tom de Vries
  2019-01-07  9:01         ` [nvptx] Don't emit barriers for empty loops -- fix Tom de Vries
                           ` (3 subsequent siblings)
  21 siblings, 0 replies; 32+ messages in thread
From: Tom de Vries @ 2019-01-07  8:59 UTC (permalink / raw)
  To: Schwinge, Thomas; +Cc: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 181 bytes --]

[ was: Re: [nvptx] vector length patch series ]

On 14-12-18 20:58, Tom de Vries wrote:
> 0015-nvptx-Generalize-state-propagation-and-synchronizati.patch

Committed.

Thanks,
- Tom

[-- Attachment #2: 0007-nvptx-Add-support-for-a-per-worker-broadcast-buffer-and-barrier.patch --]
[-- Type: text/x-patch, Size: 12127 bytes --]

[nvptx] Add support for a per-worker broadcast buffer and barrier

Add support for a per-worker broadcast buffer and barrier, to be used for
openacc vector_length larger than warp-size.

2018-12-17  Tom de Vries  <tdevries@suse.de>

	* config/nvptx/nvptx.c (oacc_bcast_partition): Declare.
	(nvptx_option_override): Init oacc_bcast_partition.
	(nvptx_init_oacc_workers): New function.
	(nvptx_declare_function_name): Call nvptx_init_oacc_workers.
	(nvptx_needs_shared_bcast): New function.
	(nvptx_find_par): Generalize to enable vectors to use shared-memory
	to propagate state.
	(nvptx_shared_propagate): Initialize vector bcast partition and
	synchronization state.
	(nvptx_single):  Generalize to enable vectors to use shared-memory
	to propagate state.
	(nvptx_process_pars): Likewise.
	(nvptx_set_current_function): Initialize oacc_broadcast_partition.
	* config/nvptx/nvptx.h (struct machine_function): Add
	bcast_partition and sync_bar members.

---
 gcc/config/nvptx/nvptx.c | 153 +++++++++++++++++++++++++++++++++++++++++------
 gcc/config/nvptx/nvptx.h |   4 ++
 2 files changed, 138 insertions(+), 19 deletions(-)

diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index 6df4d02c4c1..2166f37b182 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -140,6 +140,7 @@ static GTY((cache)) hash_table<tree_hasher> *needed_fndecls_htab;
    memory.  It'd be nice if PTX supported common blocks, because then
    this could be shared across TUs (taking the largest size).  */
 static unsigned oacc_bcast_size;
+static unsigned oacc_bcast_partition;
 static unsigned oacc_bcast_align;
 static GTY(()) rtx oacc_bcast_sym;
 
@@ -158,6 +159,8 @@ static bool need_softstack_decl;
 /* True if any function references __nvptx_uni.  */
 static bool need_unisimt_decl;
 
+static int nvptx_mach_max_workers ();
+
 /* Allocate a new, cleared machine_function structure.  */
 
 static struct machine_function *
@@ -217,6 +220,7 @@ nvptx_option_override (void)
   oacc_bcast_sym = gen_rtx_SYMBOL_REF (Pmode, "__oacc_bcast");
   SET_SYMBOL_DATA_AREA (oacc_bcast_sym, DATA_AREA_SHARED);
   oacc_bcast_align = GET_MODE_ALIGNMENT (SImode) / BITS_PER_UNIT;
+  oacc_bcast_partition = 0;
 
   worker_red_sym = gen_rtx_SYMBOL_REF (Pmode, "__worker_red");
   SET_SYMBOL_DATA_AREA (worker_red_sym, DATA_AREA_SHARED);
@@ -1105,6 +1109,40 @@ nvptx_init_axis_predicate (FILE *file, int regno, const char *name)
   fprintf (file, "\t}\n");
 }
 
+/* Emit code to initialize OpenACC worker broadcast and synchronization
+   registers.  */
+
+static void
+nvptx_init_oacc_workers (FILE *file)
+{
+  fprintf (file, "\t{\n");
+  fprintf (file, "\t\t.reg.u32\t%%tidy;\n");
+  if (cfun->machine->bcast_partition)
+    {
+      fprintf (file, "\t\t.reg.u64\t%%t_bcast;\n");
+      fprintf (file, "\t\t.reg.u64\t%%y64;\n");
+    }
+  fprintf (file, "\t\tmov.u32\t\t%%tidy, %%tid.y;\n");
+  if (cfun->machine->bcast_partition)
+    {
+      fprintf (file, "\t\tcvt.u64.u32\t%%y64, %%tidy;\n");
+      fprintf (file, "\t\tadd.u64\t\t%%y64, %%y64, 1; // vector ID\n");
+      fprintf (file, "\t\tcvta.shared.u64\t%%t_bcast, __oacc_bcast;\n");
+      fprintf (file, "\t\tmad.lo.u64\t%%r%d, %%y64, %d, %%t_bcast; "
+	       "// vector broadcast offset\n",
+	       REGNO (cfun->machine->bcast_partition),
+	       oacc_bcast_partition);
+    }
+  /* Verify oacc_bcast_size.  */
+  gcc_assert (oacc_bcast_partition * (nvptx_mach_max_workers () + 1)
+	      <= oacc_bcast_size);
+  if (cfun->machine->sync_bar)
+    fprintf (file, "\t\tadd.u32\t\t%%r%d, %%tidy, 1; "
+	     "// vector synchronization barrier\n",
+	     REGNO (cfun->machine->sync_bar));
+  fprintf (file, "\t}\n");
+}
+
 /* Emit code to initialize predicate and master lane index registers for
    -muniform-simt code generation variant.  */
 
@@ -1331,6 +1369,8 @@ nvptx_declare_function_name (FILE *file, const char *name, const_tree decl)
   if (cfun->machine->unisimt_predicate
       || (cfun->machine->has_simtreg && !crtl->is_leaf))
     nvptx_init_unisimt_predicate (file);
+  if (cfun->machine->bcast_partition || cfun->machine->sync_bar)
+    nvptx_init_oacc_workers (file);
 }
 
 /* Output code for switching uniform-simt state.  ENTERING indicates whether
@@ -3072,6 +3112,19 @@ nvptx_split_blocks (bb_insn_map_t *map)
     }
 }
 
+/* Return true if MASK contains parallelism that requires shared
+   memory to broadcast.  */
+
+static bool
+nvptx_needs_shared_bcast (unsigned mask)
+{
+  bool worker = mask & GOMP_DIM_MASK (GOMP_DIM_WORKER);
+  bool large_vector = (mask & GOMP_DIM_MASK (GOMP_DIM_VECTOR))
+    && nvptx_mach_vector_length () != PTX_WARP_SIZE;
+
+  return worker || large_vector;
+}
+
 /* BLOCK is a basic block containing a head or tail instruction.
    Locate the associated prehead or pretail instruction, which must be
    in the single predecessor block.  */
@@ -3147,7 +3200,7 @@ nvptx_find_par (bb_insn_map_t *map, parallel *par, basic_block block)
 	    par = new parallel (par, mask);
 	    par->forked_block = block;
 	    par->forked_insn = end;
-	    if (mask & GOMP_DIM_MASK (GOMP_DIM_WORKER))
+	    if (nvptx_needs_shared_bcast (mask))
 	      par->fork_insn
 		= nvptx_discover_pre (block, CODE_FOR_nvptx_fork);
 	  }
@@ -3162,7 +3215,7 @@ nvptx_find_par (bb_insn_map_t *map, parallel *par, basic_block block)
 	    gcc_assert (par->mask == mask);
 	    par->join_block = block;
 	    par->join_insn = end;
-	    if (mask & GOMP_DIM_MASK (GOMP_DIM_WORKER))
+	    if (nvptx_needs_shared_bcast (mask))
 	      par->joining_insn
 		= nvptx_discover_pre (block, CODE_FOR_nvptx_joining);
 	    par = par->parent;
@@ -4019,22 +4072,45 @@ nvptx_shared_propagate (bool pre_p, bool is_call, basic_block block,
   gcc_assert (empty == !data.offset);
   if (data.offset)
     {
+      rtx bcast_sym = oacc_bcast_sym;
+
       /* Stuff was emitted, initialize the base pointer now.  */
-      rtx init = gen_rtx_SET (data.base, oacc_bcast_sym);
+      if (vector && nvptx_mach_max_workers () > 1)
+	{
+	  if (!cfun->machine->bcast_partition)
+	    {
+	      /* It would be nice to place this register in
+		 DATA_AREA_SHARED.  */
+	      cfun->machine->bcast_partition = gen_reg_rtx (DImode);
+	    }
+	  if (!cfun->machine->sync_bar)
+	    cfun->machine->sync_bar = gen_reg_rtx (SImode);
+
+	  bcast_sym = cfun->machine->bcast_partition;
+	}
+
+      rtx init = gen_rtx_SET (data.base, bcast_sym);
       emit_insn_after (init, insn);
 
-      oacc_bcast_size = MAX (oacc_bcast_size, data.offset);
+      unsigned int psize = ROUND_UP (data.offset, oacc_bcast_align);
+      unsigned int pnum = (nvptx_mach_vector_length () > PTX_WARP_SIZE
+			   ? nvptx_mach_max_workers () + 1
+			   : 1);
+
+      oacc_bcast_partition = MAX (oacc_bcast_partition, psize);
+      oacc_bcast_size = MAX (oacc_bcast_size, psize * pnum);
     }
   return empty;
 }
 
-/* Emit a worker-level synchronization barrier.  We use different
-   markers for before and after synchronizations.  */
+/* Emit a CTA-level synchronization barrier.  LOCK is the barrier number,
+   which is an integer or a register.  THREADS is the number of threads
+   controlled by the barrier.  */
 
 static rtx
-nvptx_cta_sync (bool after)
+nvptx_cta_sync (rtx lock, int threads)
 {
-  return gen_nvptx_barsync (GEN_INT (after), GEN_INT (0));
+  return gen_nvptx_barsync (lock, GEN_INT (threads));
 }
 
 #if WORKAROUND_PTXJIT_BUG
@@ -4327,7 +4403,8 @@ nvptx_single (unsigned mask, basic_block from, basic_block to)
     {
       rtx pvar = XEXP (XEXP (cond_branch, 0), 0);
 
-      if (GOMP_DIM_MASK (GOMP_DIM_VECTOR) == mask)
+      if (GOMP_DIM_MASK (GOMP_DIM_VECTOR) == mask
+	  && nvptx_mach_vector_length () == PTX_WARP_SIZE)
 	{
 	  /* Vector mode only, do a shuffle.  */
 #if WORKAROUND_PTXJIT_BUG
@@ -4394,25 +4471,50 @@ nvptx_single (unsigned mask, basic_block from, basic_block to)
 	  /* Includes worker mode, do spill & fill.  By construction
 	     we should never have worker mode only. */
 	  broadcast_data_t data;
+	  unsigned size = GET_MODE_SIZE (SImode);
+	  bool vector = (GOMP_DIM_MASK (GOMP_DIM_VECTOR) == mask) != 0;
+	  bool worker = (GOMP_DIM_MASK (GOMP_DIM_WORKER) == mask) != 0;
+	  rtx barrier = GEN_INT (0);
+	  int threads = 0;
 
 	  data.base = oacc_bcast_sym;
 	  data.ptr = 0;
 
-	  oacc_bcast_size = MAX (oacc_bcast_size, GET_MODE_SIZE (SImode));
+	  bool use_partitioning_p = (vector && !worker
+				     && nvptx_mach_max_workers () > 1
+				     && cfun->machine->bcast_partition);
+	  if (use_partitioning_p)
+	    {
+	      data.base = cfun->machine->bcast_partition;
+	      barrier = cfun->machine->sync_bar;
+	      threads = nvptx_mach_vector_length ();
+	    }
+	  gcc_assert (data.base != NULL);
+	  gcc_assert (barrier);
+
+	  unsigned int psize = ROUND_UP (size, oacc_bcast_align);
+	  unsigned int pnum = (nvptx_mach_vector_length () > PTX_WARP_SIZE
+			       ? nvptx_mach_max_workers () + 1
+			       : 1);
+
+	  oacc_bcast_partition = MAX (oacc_bcast_partition, psize);
+	  oacc_bcast_size = MAX (oacc_bcast_size, psize * pnum);
 
 	  data.offset = 0;
 	  emit_insn_before (nvptx_gen_shared_bcast (pvar, PM_read, 0, &data,
-						    false),
+						    vector),
 			    before);
+
 	  /* Barrier so other workers can see the write.  */
-	  emit_insn_before (nvptx_cta_sync (false), tail);
+	  emit_insn_before (nvptx_cta_sync (barrier, threads), tail);
 	  data.offset = 0;
 	  emit_insn_before (nvptx_gen_shared_bcast (pvar, PM_write, 0, &data,
-						    false), tail);
+						    vector),
+			    tail);
 	  /* This barrier is needed to avoid worker zero clobbering
 	     the broadcast buffer before all the other workers have
 	     had a chance to read this instance of it.  */
-	  emit_insn_before (nvptx_cta_sync (false), tail);
+	  emit_insn_before (nvptx_cta_sync (barrier, threads), tail);
 	}
 
       extract_insn (tail);
@@ -4526,20 +4628,32 @@ nvptx_process_pars (parallel *par)
     }
 
   bool is_call = (par->mask & GOMP_DIM_MASK (GOMP_DIM_MAX)) != 0;
+  bool worker = (par->mask & GOMP_DIM_MASK (GOMP_DIM_WORKER));
+  bool large_vector = ((par->mask & GOMP_DIM_MASK (GOMP_DIM_VECTOR))
+		      && nvptx_mach_vector_length () > PTX_WARP_SIZE);
 
-  if (par->mask & GOMP_DIM_MASK (GOMP_DIM_WORKER))
+  if (worker || large_vector)
     {
       nvptx_shared_propagate (false, is_call, par->forked_block,
-			      par->forked_insn, false);
+			      par->forked_insn, !worker);
       bool empty = nvptx_shared_propagate (true, is_call,
 					   par->forked_block, par->fork_insn,
-					   false);
+					   !worker);
+      rtx barrier = GEN_INT (0);
+      int threads = 0;
+
+      if (!worker && cfun->machine->sync_bar)
+	{
+	  barrier = cfun->machine->sync_bar;
+	  threads = nvptx_mach_vector_length ();
+	}
 
       if (!empty || !is_call)
 	{
 	  /* Insert begin and end synchronizations.  */
-	  emit_insn_before (nvptx_cta_sync (false), par->forked_insn);
-	  emit_insn_before (nvptx_cta_sync (false), par->join_insn);
+	  emit_insn_before (nvptx_cta_sync (barrier, threads),
+			    par->forked_insn);
+	  emit_insn_before (nvptx_cta_sync (barrier, threads), par->join_insn);
 	}
     }
   else if (par->mask & GOMP_DIM_MASK (GOMP_DIM_VECTOR))
@@ -6169,6 +6283,7 @@ nvptx_set_current_function (tree fndecl)
     return;
 
   nvptx_previous_fndecl = fndecl;
+  oacc_bcast_partition = 0;
 }
 
 #undef TARGET_OPTION_OVERRIDE
diff --git a/gcc/config/nvptx/nvptx.h b/gcc/config/nvptx/nvptx.h
index a05ff7ec9be..76ce871a731 100644
--- a/gcc/config/nvptx/nvptx.h
+++ b/gcc/config/nvptx/nvptx.h
@@ -221,6 +221,10 @@ struct GTY(()) machine_function
   int axis_dim[2]; /* Maximum number of threads on each axis, dim[0] is
 		      vector_length, dim[1] is num_workers.  */
   bool axis_dim_init_p;
+  rtx bcast_partition; /* Register containing the size of each
+			  vector's partition of share-memory used to
+			  broadcast state.  */
+  rtx sync_bar; /* Synchronization barrier ID for vectors.  */
   rtx unisimt_master; /* 'Master lane index' for -muniform-simt.  */
   rtx unisimt_predicate; /* Predicate for -muniform-simt.  */
   rtx unisimt_location; /* Mask location for -muniform-simt.  */

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [nvptx] Don't emit barriers for empty loops -- fix
  2018-12-14 19:58       ` Tom de Vries
                           ` (17 preceding siblings ...)
  2019-01-07  8:59         ` [nvptx, committed] Add support for a per-worker broadcast buffer and barrier Tom de Vries
@ 2019-01-07  9:01         ` Tom de Vries
  2019-01-07  9:03         ` [nvptx] Handle large vector reductions Tom de Vries
                           ` (2 subsequent siblings)
  21 siblings, 0 replies; 32+ messages in thread
From: Tom de Vries @ 2019-01-07  9:01 UTC (permalink / raw)
  To: Schwinge, Thomas; +Cc: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 196 bytes --]

[ was: Re: [nvptx] vector length patch series ]

On 14-12-18 20:58, Tom de Vries wrote:
> 0022-nvptx-openacc-Don-t-emit-barriers-for-empty-loops.patch

Committed without test-case.

Thanks,
- Tom

[-- Attachment #2: 0008-nvptx-Don-t-emit-barriers-for-empty-loops-fix.patch --]
[-- Type: text/x-patch, Size: 2043 bytes --]

[nvptx] Don't emit barriers for empty loops -- fix

When compiling an empty loop:
...
  long long v1;
  #pragma acc parallel num_gangs (640) num_workers(1) vector_length (128)
  #pragma acc loop
    for (v1 = 0; v1 < 20; v1 += 2)
        ;
...
the compiler emits two subsequent bar.syncs.  This triggers some bug on my
quadro m1200 (I'm assuming in the ptxas/JIT compiler) that hangs the testcase.

This patch works around the bug by doing an optimization: we detect that this is
an empty loop (a forked immediately followed by a joining), and don't emit the
barriers.

The patch does not include the test-case yet, since vector_length (128) is not
yet supported at this point.

2018-12-17  Tom de Vries  <tdevries@suse.de>

	PR target/85381
	* config/nvptx/nvptx.c (nvptx_process_pars): Don't emit barriers for
	empty loops.

---
 gcc/config/nvptx/nvptx.c | 15 +++++++++++----
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index 2166f37b182..26c80716603 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -4636,9 +4636,12 @@ nvptx_process_pars (parallel *par)
     {
       nvptx_shared_propagate (false, is_call, par->forked_block,
 			      par->forked_insn, !worker);
-      bool empty = nvptx_shared_propagate (true, is_call,
-					   par->forked_block, par->fork_insn,
-					   !worker);
+      bool no_prop_p
+	= nvptx_shared_propagate (true, is_call, par->forked_block,
+				  par->fork_insn, !worker);
+      bool empty_loop_p
+	= !is_call && (NEXT_INSN (par->forked_insn)
+		       && NEXT_INSN (par->forked_insn) == par->joining_insn);
       rtx barrier = GEN_INT (0);
       int threads = 0;
 
@@ -4648,7 +4651,11 @@ nvptx_process_pars (parallel *par)
 	  threads = nvptx_mach_vector_length ();
 	}
 
-      if (!empty || !is_call)
+      if (no_prop_p && empty_loop_p)
+	;
+      else if (no_prop_p && is_call)
+	;
+      else
 	{
 	  /* Insert begin and end synchronizations.  */
 	  emit_insn_before (nvptx_cta_sync (barrier, threads),

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [nvptx] Handle large vector reductions
  2018-12-14 19:58       ` Tom de Vries
                           ` (18 preceding siblings ...)
  2019-01-07  9:01         ` [nvptx] Don't emit barriers for empty loops -- fix Tom de Vries
@ 2019-01-07  9:03         ` Tom de Vries
  2019-01-07 19:11         ` [nvptx, committed] Force vl32 if calling vector-partitionable routines Tom de Vries
  2019-01-08 23:00         ` [nvptx] vector length patch series Tom de Vries
  21 siblings, 0 replies; 32+ messages in thread
From: Tom de Vries @ 2019-01-07  9:03 UTC (permalink / raw)
  To: Schwinge, Thomas; +Cc: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 165 bytes --]

[ was: Re: [nvptx] vector length patch series ]

On 14-12-18 20:58, Tom de Vries wrote:
> 0024-nvptx-Handle-large-vector-reductions.patch

Committed.

Thanks,
- Tom

[-- Attachment #2: 0009-nvptx-Handle-large-vector-reductions.patch --]
[-- Type: text/x-patch, Size: 16201 bytes --]

[nvptx] Handle large vector reductions

Add support for vector reductions with openacc vector_length larger than
warp-size.

2018-12-17  Tom de Vries  <tdevries@suse.de>

	* config/nvptx/nvptx-protos.h (nvptx_output_red_partition): Declare.
	* config/nvptx/nvptx.c (vector_red_size, vector_red_align,
	vector_red_partition, vector_red_sym): New global variables.
	(nvptx_option_override): Initialize vector_red_sym.
	(nvptx_declare_function_name): Restore red_partition register.
	(nvptx_file_end): Emit code to declare the vector reduction variables.
	(nvptx_output_red_partition): New function.
	(nvptx_expand_shared_addr): Add vector argument. Use it to handle
	large vector reductions.
	(enum nvptx_builtins): Add NVPTX_BUILTIN_VECTOR_ADDR.
	(nvptx_init_builtins): Add VECTOR_ADDR.
	(nvptx_expand_builtin): Update call to nvptx_expand_shared_addr.
	Handle nvptx_expand_shared_addr.
	(nvptx_get_shared_red_addr): Add vector argument and handle large
	vectors.
	(nvptx_goacc_reduction_setup): Add offload_attrs argument and handle
	large vectors.
	(nvptx_goacc_reduction_init): Likewise.
	(nvptx_goacc_reduction_fini): Likewise.
	(nvptx_goacc_reduction_teardown): Likewise.
	(nvptx_goacc_reduction): Update calls to nvptx_goacc_reduction_{setup,
	init,fini,teardown}.
	(nvptx_init_axis_predicate): Initialize vector_red_partition.
	(nvptx_set_current_function): Init vector_red_partition.
	* config/nvptx/nvptx.md (UNSPECV_RED_PART): New unspecv.
	(nvptx_red_partition): New insn.
	* config/nvptx/nvptx.h (struct machine_function): Add red_partition.

---
 gcc/config/nvptx/nvptx-protos.h |   1 +
 gcc/config/nvptx/nvptx.c        | 154 ++++++++++++++++++++++++++++++++--------
 gcc/config/nvptx/nvptx.h        |   2 +
 gcc/config/nvptx/nvptx.md       |  12 ++++
 4 files changed, 140 insertions(+), 29 deletions(-)

diff --git a/gcc/config/nvptx/nvptx-protos.h b/gcc/config/nvptx/nvptx-protos.h
index 1a26d00ab99..be09a15e49c 100644
--- a/gcc/config/nvptx/nvptx-protos.h
+++ b/gcc/config/nvptx/nvptx-protos.h
@@ -56,5 +56,6 @@ extern const char *nvptx_output_return (void);
 extern const char *nvptx_output_set_softstack (unsigned);
 extern const char *nvptx_output_simt_enter (rtx, rtx, rtx);
 extern const char *nvptx_output_simt_exit (rtx);
+extern const char *nvptx_output_red_partition (rtx, rtx);
 #endif
 #endif
diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index 26c80716603..5a4b38de522 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -150,6 +150,14 @@ static unsigned worker_red_size;
 static unsigned worker_red_align;
 static GTY(()) rtx worker_red_sym;
 
+/* Buffer needed for vector reductions, when vector_length >
+   PTX_WARP_SIZE.  This has to be distinct from the worker broadcast
+   array, as both may be live concurrently.  */
+static unsigned vector_red_size;
+static unsigned vector_red_align;
+static unsigned vector_red_partition;
+static GTY(()) rtx vector_red_sym;
+
 /* Global lock variable, needed for 128bit worker & gang reductions.  */
 static GTY(()) tree global_lock_var;
 
@@ -226,6 +234,11 @@ nvptx_option_override (void)
   SET_SYMBOL_DATA_AREA (worker_red_sym, DATA_AREA_SHARED);
   worker_red_align = GET_MODE_ALIGNMENT (SImode) / BITS_PER_UNIT;
 
+  vector_red_sym = gen_rtx_SYMBOL_REF (Pmode, "__vector_red");
+  SET_SYMBOL_DATA_AREA (vector_red_sym, DATA_AREA_SHARED);
+  vector_red_align = GET_MODE_ALIGNMENT (SImode) / BITS_PER_UNIT;
+  vector_red_partition = 0;
+
   diagnose_openacc_conflict (TARGET_GOMP, "-mgomp");
   diagnose_openacc_conflict (TARGET_SOFT_STACK, "-msoft-stack");
   diagnose_openacc_conflict (TARGET_UNIFORM_SIMT, "-muniform-simt");
@@ -1104,8 +1117,25 @@ nvptx_init_axis_predicate (FILE *file, int regno, const char *name)
 {
   fprintf (file, "\t{\n");
   fprintf (file, "\t\t.reg.u32\t%%%s;\n", name);
+  if (strcmp (name, "x") == 0 && cfun->machine->red_partition)
+    {
+      fprintf (file, "\t\t.reg.u64\t%%t_red;\n");
+      fprintf (file, "\t\t.reg.u64\t%%y64;\n");
+    }
   fprintf (file, "\t\tmov.u32\t%%%s, %%tid.%s;\n", name, name);
   fprintf (file, "\t\tsetp.ne.u32\t%%r%d, %%%s, 0;\n", regno, name);
+  if (strcmp (name, "x") == 0 && cfun->machine->red_partition)
+    {
+      fprintf (file, "\t\tcvt.u64.u32\t%%y64, %%tid.y;\n");
+      fprintf (file, "\t\tcvta.shared.u64\t%%t_red, __vector_red;\n");
+      fprintf (file, "\t\tmad.lo.u64\t%%r%d, %%y64, %d, %%t_red; "
+	       "// vector reduction buffer\n",
+	       REGNO (cfun->machine->red_partition),
+	       vector_red_partition);
+    }
+  /* Verify vector_red_size.  */
+  gcc_assert (vector_red_partition * nvptx_mach_max_workers ()
+	      <= vector_red_size);
   fprintf (file, "\t}\n");
 }
 
@@ -1342,6 +1372,13 @@ nvptx_declare_function_name (FILE *file, const char *name, const_tree decl)
 	fprintf (file, "\t.local.align 8 .b8 %%simtstack_ar["
 		HOST_WIDE_INT_PRINT_DEC "];\n", simtsz);
     }
+
+  /* Restore the vector reduction partition register, if necessary.
+     FIXME: Find out when and why this is necessary, and fix it.  */
+  if (cfun->machine->red_partition)
+    regno_reg_rtx[REGNO (cfun->machine->red_partition)]
+      = cfun->machine->red_partition;
+
   /* Declare the pseudos we have as ptx registers.  */
   int maxregs = max_reg_num ();
   for (int i = LAST_VIRTUAL_REGISTER + 1; i < maxregs; i++)
@@ -5188,6 +5225,10 @@ nvptx_file_end (void)
     write_shared_buffer (asm_out_file, worker_red_sym,
 			 worker_red_align, worker_red_size);
 
+  if (vector_red_size)
+    write_shared_buffer (asm_out_file, vector_red_sym,
+			 vector_red_align, vector_red_size);
+
   if (need_softstack_decl)
     {
       write_var_marker (asm_out_file, false, true, "__nvptx_stacks");
@@ -5233,31 +5274,68 @@ nvptx_expand_shuffle (tree exp, rtx target, machine_mode mode, int ignore)
   return target;
 }
 
-/* Worker reduction address expander.  */
+const char *
+nvptx_output_red_partition (rtx dst, rtx offset)
+{
+  const char *zero_offset = "\t\tmov.u64\t%%r%d, %%r%d; // vred buffer\n";
+  const char *with_offset = "\t\tadd.u64\t%%r%d, %%r%d, %d; // vred buffer\n";
+
+  if (offset == const0_rtx)
+    fprintf (asm_out_file, zero_offset, REGNO (dst),
+	     REGNO (cfun->machine->red_partition));
+  else
+    fprintf (asm_out_file, with_offset, REGNO (dst),
+	     REGNO (cfun->machine->red_partition), UINTVAL (offset));
+
+  return "";
+}
+
+/* Shared-memory reduction address expander.  */
 
 static rtx
 nvptx_expand_shared_addr (tree exp, rtx target,
-			  machine_mode ARG_UNUSED (mode), int ignore)
+			  machine_mode ARG_UNUSED (mode), int ignore,
+			  int vector)
 {
   if (ignore)
     return target;
 
   unsigned align = TREE_INT_CST_LOW (CALL_EXPR_ARG (exp, 2));
-  worker_red_align = MAX (worker_red_align, align);
-
   unsigned offset = TREE_INT_CST_LOW (CALL_EXPR_ARG (exp, 0));
   unsigned size = TREE_INT_CST_LOW (CALL_EXPR_ARG (exp, 1));
-  worker_red_size = MAX (worker_red_size, size + offset);
-
   rtx addr = worker_red_sym;
-  if (offset)
+
+  if (vector)
     {
-      addr = gen_rtx_PLUS (Pmode, addr, GEN_INT (offset));
-      addr = gen_rtx_CONST (Pmode, addr);
+      offload_attrs oa;
+
+      populate_offload_attrs (&oa);
+
+      unsigned int psize = ROUND_UP (size + offset, align);
+      unsigned int pnum = nvptx_mach_max_workers ();
+      vector_red_partition = MAX (vector_red_partition, psize);
+      vector_red_size = MAX (vector_red_size, psize * pnum);
+      vector_red_align = MAX (vector_red_align, align);
+
+      if (cfun->machine->red_partition == NULL)
+	cfun->machine->red_partition = gen_reg_rtx (Pmode);
+
+      addr = gen_reg_rtx (Pmode);
+      emit_insn (gen_nvptx_red_partition (addr, GEN_INT (offset)));
     }
+  else
+    {
+      worker_red_align = MAX (worker_red_align, align);
+      worker_red_size = MAX (worker_red_size, size + offset);
 
-  emit_move_insn (target, addr);
+      if (offset)
+	{
+	  addr = gen_rtx_PLUS (Pmode, addr, GEN_INT (offset));
+	  addr = gen_rtx_CONST (Pmode, addr);
+	}
+   }
 
+  emit_move_insn (target, addr);
   return target;
 }
 
@@ -5305,6 +5383,7 @@ enum nvptx_builtins
   NVPTX_BUILTIN_SHUFFLE,
   NVPTX_BUILTIN_SHUFFLELL,
   NVPTX_BUILTIN_WORKER_ADDR,
+  NVPTX_BUILTIN_VECTOR_ADDR,
   NVPTX_BUILTIN_CMP_SWAP,
   NVPTX_BUILTIN_CMP_SWAPLL,
   NVPTX_BUILTIN_MAX
@@ -5342,6 +5421,8 @@ nvptx_init_builtins (void)
   DEF (SHUFFLELL, "shufflell", (LLUINT, LLUINT, UINT, UINT, NULL_TREE));
   DEF (WORKER_ADDR, "worker_addr",
        (PTRVOID, ST, UINT, UINT, NULL_TREE));
+  DEF (VECTOR_ADDR, "vector_addr",
+       (PTRVOID, ST, UINT, UINT, NULL_TREE));
   DEF (CMP_SWAP, "cmp_swap", (UINT, PTRVOID, UINT, UINT, NULL_TREE));
   DEF (CMP_SWAPLL, "cmp_swapll", (LLUINT, PTRVOID, LLUINT, LLUINT, NULL_TREE));
 
@@ -5370,7 +5451,10 @@ nvptx_expand_builtin (tree exp, rtx target, rtx ARG_UNUSED (subtarget),
       return nvptx_expand_shuffle (exp, target, mode, ignore);
 
     case NVPTX_BUILTIN_WORKER_ADDR:
-      return nvptx_expand_shared_addr (exp, target, mode, ignore);
+      return nvptx_expand_shared_addr (exp, target, mode, ignore, false);
+
+    case NVPTX_BUILTIN_VECTOR_ADDR:
+      return nvptx_expand_shared_addr (exp, target, mode, ignore, true);
 
     case NVPTX_BUILTIN_CMP_SWAP:
     case NVPTX_BUILTIN_CMP_SWAPLL:
@@ -5630,10 +5714,13 @@ nvptx_goacc_fork_join (gcall *call, const int dims[],
    data at that location.  */
 
 static tree
-nvptx_get_shared_red_addr (tree type, tree offset)
+nvptx_get_shared_red_addr (tree type, tree offset, bool vector)
 {
+  enum nvptx_builtins addr_dim = NVPTX_BUILTIN_WORKER_ADDR;
+  if (vector)
+    addr_dim = NVPTX_BUILTIN_VECTOR_ADDR;
   machine_mode mode = TYPE_MODE (type);
-  tree fndecl = nvptx_builtin_decl (NVPTX_BUILTIN_WORKER_ADDR, true);
+  tree fndecl = nvptx_builtin_decl (addr_dim, true);
   tree size = build_int_cst (unsigned_type_node, GET_MODE_SIZE (mode));
   tree align = build_int_cst (unsigned_type_node,
 			      GET_MODE_ALIGNMENT (mode) / BITS_PER_UNIT);
@@ -5949,7 +6036,7 @@ nvptx_reduction_update (location_t loc, gimple_stmt_iterator *gsi,
 /* NVPTX implementation of GOACC_REDUCTION_SETUP.  */
 
 static void
-nvptx_goacc_reduction_setup (gcall *call)
+nvptx_goacc_reduction_setup (gcall *call, offload_attrs *oa)
 {
   gimple_stmt_iterator gsi = gsi_for_stmt (call);
   tree lhs = gimple_call_lhs (call);
@@ -5968,11 +6055,13 @@ nvptx_goacc_reduction_setup (gcall *call)
 	var = build_simple_mem_ref (ref_to_res);
     }
   
-  if (level == GOMP_DIM_WORKER)
+  if (level == GOMP_DIM_WORKER
+      || (level == GOMP_DIM_VECTOR && oa->vector_length > PTX_WARP_SIZE))
     {
       /* Store incoming value to worker reduction buffer.  */
       tree offset = gimple_call_arg (call, 5);
-      tree call = nvptx_get_shared_red_addr (TREE_TYPE (var), offset);
+      tree call = nvptx_get_shared_red_addr (TREE_TYPE (var), offset,
+					     level == GOMP_DIM_VECTOR);
       tree ptr = make_ssa_name (TREE_TYPE (call));
 
       gimplify_assign (ptr, call, &seq);
@@ -5991,7 +6080,7 @@ nvptx_goacc_reduction_setup (gcall *call)
 /* NVPTX implementation of GOACC_REDUCTION_INIT. */
 
 static void
-nvptx_goacc_reduction_init (gcall *call)
+nvptx_goacc_reduction_init (gcall *call, offload_attrs *oa)
 {
   gimple_stmt_iterator gsi = gsi_for_stmt (call);
   tree lhs = gimple_call_lhs (call);
@@ -6005,7 +6094,7 @@ nvptx_goacc_reduction_init (gcall *call)
   
   push_gimplify_context (true);
 
-  if (level == GOMP_DIM_VECTOR)
+  if (level == GOMP_DIM_VECTOR && oa->vector_length == PTX_WARP_SIZE)
     {
       /* Initialize vector-non-zeroes to INIT_VAL (OP).  */
       tree tid = make_ssa_name (integer_type_node);
@@ -6075,7 +6164,7 @@ nvptx_goacc_reduction_init (gcall *call)
 /* NVPTX implementation of GOACC_REDUCTION_FINI.  */
 
 static void
-nvptx_goacc_reduction_fini (gcall *call)
+nvptx_goacc_reduction_fini (gcall *call, offload_attrs *oa)
 {
   gimple_stmt_iterator gsi = gsi_for_stmt (call);
   tree lhs = gimple_call_lhs (call);
@@ -6089,7 +6178,7 @@ nvptx_goacc_reduction_fini (gcall *call)
 
   push_gimplify_context (true);
 
-  if (level == GOMP_DIM_VECTOR)
+  if (level == GOMP_DIM_VECTOR && oa->vector_length == PTX_WARP_SIZE)
     {
       /* Emit binary shuffle tree.  TODO. Emit this as an actual loop,
 	 but that requires a method of emitting a unified jump at the
@@ -6110,11 +6199,12 @@ nvptx_goacc_reduction_fini (gcall *call)
     {
       tree accum = NULL_TREE;
 
-      if (level == GOMP_DIM_WORKER)
+      if (level == GOMP_DIM_WORKER || level == GOMP_DIM_VECTOR)
 	{
 	  /* Get reduction buffer address.  */
 	  tree offset = gimple_call_arg (call, 5);
-	  tree call = nvptx_get_shared_red_addr (TREE_TYPE (var), offset);
+	  tree call = nvptx_get_shared_red_addr (TREE_TYPE (var), offset,
+						 level == GOMP_DIM_VECTOR);
 	  tree ptr = make_ssa_name (TREE_TYPE (call));
 
 	  gimplify_assign (ptr, call, &seq);
@@ -6145,7 +6235,7 @@ nvptx_goacc_reduction_fini (gcall *call)
 /* NVPTX implementation of GOACC_REDUCTION_TEARDOWN.  */
 
 static void
-nvptx_goacc_reduction_teardown (gcall *call)
+nvptx_goacc_reduction_teardown (gcall *call, offload_attrs *oa)
 {
   gimple_stmt_iterator gsi = gsi_for_stmt (call);
   tree lhs = gimple_call_lhs (call);
@@ -6154,11 +6244,13 @@ nvptx_goacc_reduction_teardown (gcall *call)
   gimple_seq seq = NULL;
   
   push_gimplify_context (true);
-  if (level == GOMP_DIM_WORKER)
+  if (level == GOMP_DIM_WORKER
+      || (level == GOMP_DIM_VECTOR && oa->vector_length > PTX_WARP_SIZE))
     {
       /* Read the worker reduction buffer.  */
       tree offset = gimple_call_arg (call, 5);
-      tree call = nvptx_get_shared_red_addr(TREE_TYPE (var), offset);
+      tree call = nvptx_get_shared_red_addr (TREE_TYPE (var), offset,
+					     level == GOMP_DIM_VECTOR);
       tree ptr = make_ssa_name (TREE_TYPE (call));
 
       gimplify_assign (ptr, call, &seq);
@@ -6189,23 +6281,26 @@ static void
 nvptx_goacc_reduction (gcall *call)
 {
   unsigned code = (unsigned)TREE_INT_CST_LOW (gimple_call_arg (call, 0));
+  offload_attrs oa;
+
+  populate_offload_attrs (&oa);
 
   switch (code)
     {
     case IFN_GOACC_REDUCTION_SETUP:
-      nvptx_goacc_reduction_setup (call);
+      nvptx_goacc_reduction_setup (call, &oa);
       break;
 
     case IFN_GOACC_REDUCTION_INIT:
-      nvptx_goacc_reduction_init (call);
+      nvptx_goacc_reduction_init (call, &oa);
       break;
 
     case IFN_GOACC_REDUCTION_FINI:
-      nvptx_goacc_reduction_fini (call);
+      nvptx_goacc_reduction_fini (call, &oa);
       break;
 
     case IFN_GOACC_REDUCTION_TEARDOWN:
-      nvptx_goacc_reduction_teardown (call);
+      nvptx_goacc_reduction_teardown (call, &oa);
       break;
 
     default:
@@ -6290,6 +6385,7 @@ nvptx_set_current_function (tree fndecl)
     return;
 
   nvptx_previous_fndecl = fndecl;
+  vector_red_partition = 0;
   oacc_bcast_partition = 0;
 }
 
diff --git a/gcc/config/nvptx/nvptx.h b/gcc/config/nvptx/nvptx.h
index 76ce871a731..29e658248ab 100644
--- a/gcc/config/nvptx/nvptx.h
+++ b/gcc/config/nvptx/nvptx.h
@@ -224,6 +224,8 @@ struct GTY(()) machine_function
   rtx bcast_partition; /* Register containing the size of each
 			  vector's partition of share-memory used to
 			  broadcast state.  */
+  rtx red_partition; /* Similar to bcast_partition, except for vector
+			reductions.  */
   rtx sync_bar; /* Synchronization barrier ID for vectors.  */
   rtx unisimt_master; /* 'Master lane index' for -muniform-simt.  */
   rtx unisimt_predicate; /* Predicate for -muniform-simt.  */
diff --git a/gcc/config/nvptx/nvptx.md b/gcc/config/nvptx/nvptx.md
index 271b00e1eb0..1a090a47a32 100644
--- a/gcc/config/nvptx/nvptx.md
+++ b/gcc/config/nvptx/nvptx.md
@@ -68,6 +68,8 @@
 
    UNSPECV_SIMT_ENTER
    UNSPECV_SIMT_EXIT
+
+   UNSPECV_RED_PART
 ])
 
 (define_attr "subregs_ok" "false,true"
@@ -1508,3 +1510,13 @@
   ""
   "\\t.pragma \\\"nounroll\\\";"
   [(set_attr "predicable" "false")])
+
+(define_insn "nvptx_red_partition"
+  [(set (match_operand:DI 0 "nonimmediate_operand" "=R")
+	(unspec_volatile [(match_operand:DI 1 "const_int_operand")]
+	 UNSPECV_RED_PART))]
+  ""
+  {
+    return nvptx_output_red_partition (operands[0], operands[1]);
+  }
+  [(set_attr "predicable" "false")])

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [nvptx, committed] Force vl32 if calling vector-partitionable routines
  2018-12-14 19:58       ` Tom de Vries
                           ` (19 preceding siblings ...)
  2019-01-07  9:03         ` [nvptx] Handle large vector reductions Tom de Vries
@ 2019-01-07 19:11         ` Tom de Vries
  2020-10-30 16:53           ` Thomas Schwinge
  2019-01-08 23:00         ` [nvptx] vector length patch series Tom de Vries
  21 siblings, 1 reply; 32+ messages in thread
From: Tom de Vries @ 2019-01-07 19:11 UTC (permalink / raw)
  To: Schwinge, Thomas; +Cc: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 1566 bytes --]

[ was: Re: [nvptx] vector length patch series ]

On 14-12-18 20:58, Tom de Vries wrote:
> 0023-nvptx-Force-vl32-if-calling-vector-partitionable-rou.patch

> @@ -73,6 +73,7 @@
>  #include "cfgloop.h"
>  #include "fold-const.h"
>  #include "intl.h"
> +#include "tree-hash-traits.h"
>  #include "omp-offload.h"
>  
>  /* This file should be included last.  */

I dropped that include, that's not necessary.

> @@ -5557,19 +5637,6 @@ nvptx_adjust_parallelism (unsigned inner_mask, unsigned outer_mask)
>    if (wv)
>      return inner_mask & ~GOMP_DIM_MASK (GOMP_DIM_WORKER);
>  
> -  /* It's difficult to guarantee that warps in large vector_lengths
> -     will remain convergent when a vector loop is nested inside a
> -     worker loop.  Therefore, fallback to setting vector_length to
> -     PTX_WARP_SIZE.  Hopefully this condition may be relaxed for
> -     sm_70+ targets.  */
> -  if ((inner_mask & GOMP_DIM_MASK (GOMP_DIM_VECTOR))
> -      && (outer_mask & GOMP_DIM_MASK (GOMP_DIM_WORKER)))
> -    {
> -      tree attr = tree_cons (get_identifier (NVPTX_GOACC_VL_WARP), NULL_TREE,
> -                             DECL_ATTRIBUTES (current_function_decl));
> -      DECL_ATTRIBUTES (current_function_decl) = attr;
> -    }
> -
>    return inner_mask;
>  }
>  

This patch is removing here some code related to a workaround that was
added earlier in the patch series
(0017-nvptx-Enable-large-vectors.patch). Which means that that submitted
patch should not have contained that code in the first place.

Committed (without test-cases) as attached.

Thanks,
- Tom

[-- Attachment #2: 0001-nvptx-Force-vl32-if-calling-vector-partitionable-routines.patch --]
[-- Type: text/x-patch, Size: 2751 bytes --]

[nvptx] Force vl32 if calling vector-partitionable routines

With PTX_MAX_VECTOR_LENGTH set to larger than PTX_WARP_SIZE, routines can be
called from offloading regions with vector-size set to larger than warp size.
OTOH, vector-partitionable routines assume warp-sized vector length.

Detect if we're calling a vector-partitionable routine from an offloading
region, and if so, fall back to warp-sized vector length in that region.

2018-12-17  Tom de Vries  <tdevries@suse.de>

	PR target/85486
	* config/nvptx/nvptx.c (has_vector_partitionable_routine_calls_p): New
	function.
	(nvptx_goacc_validate_dims): Force vl32 if calling vector-partitionable
	routines.

---
 gcc/config/nvptx/nvptx.c | 45 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 45 insertions(+)

diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index 5a4b38de522..7fdc285b6f8 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -59,6 +59,7 @@
 #include "builtins.h"
 #include "omp-general.h"
 #include "omp-low.h"
+#include "omp-offload.h"
 #include "gomp-constants.h"
 #include "dumpfile.h"
 #include "internal-fn.h"
@@ -5496,6 +5497,40 @@ nvptx_apply_dim_limits (int dims[])
     dims[GOMP_DIM_VECTOR] = PTX_WARP_SIZE;
 }
 
+/* Return true if FNDECL contains calls to vector-partitionable routines.  */
+
+static bool
+has_vector_partitionable_routine_calls_p (tree fndecl)
+{
+  if (!fndecl)
+    return false;
+
+  basic_block bb;
+  FOR_EACH_BB_FN (bb, DECL_STRUCT_FUNCTION (fndecl))
+    for (gimple_stmt_iterator i = gsi_start_bb (bb); !gsi_end_p (i);
+	 gsi_next_nondebug (&i))
+      {
+	gimple *stmt = gsi_stmt (i);
+	if (gimple_code (stmt) != GIMPLE_CALL)
+	  continue;
+
+	tree callee = gimple_call_fndecl (stmt);
+	if (!callee)
+	  continue;
+
+	tree attrs  = oacc_get_fn_attrib (callee);
+	if (attrs == NULL_TREE)
+	  return false;
+
+	int partition_level = oacc_fn_attrib_level (attrs);
+	bool seq_routine_p = partition_level == GOMP_DIM_MAX;
+	if (!seq_routine_p)
+	  return true;
+      }
+
+  return false;
+}
+
 /* As nvptx_goacc_validate_dims, but does not return bool to indicate whether
    DIMS has changed.  */
 
@@ -5611,6 +5646,16 @@ nvptx_goacc_validate_dims_1 (tree decl, int dims[], int fn_level)
     old_dims[i] = dims[i];
 
   const char *vector_reason = NULL;
+  if (offload_region_p && has_vector_partitionable_routine_calls_p (decl))
+    {
+      if (dims[GOMP_DIM_VECTOR] > PTX_WARP_SIZE)
+	{
+	  vector_reason = G_("using vector_length (%d) due to call to"
+			     " vector-partitionable routine, ignoring %d");
+	  dims[GOMP_DIM_VECTOR] = PTX_WARP_SIZE;
+	}
+    }
+
   if (dims[GOMP_DIM_VECTOR] == 0)
     {
       vector_reason = G_("using vector_length (%d), ignoring runtime setting");

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [nvptx] vector length patch series
  2018-12-14 19:58       ` Tom de Vries
                           ` (20 preceding siblings ...)
  2019-01-07 19:11         ` [nvptx, committed] Force vl32 if calling vector-partitionable routines Tom de Vries
@ 2019-01-08 23:00         ` Tom de Vries
  21 siblings, 0 replies; 32+ messages in thread
From: Tom de Vries @ 2019-01-08 23:00 UTC (permalink / raw)
  To: Schwinge, Thomas; +Cc: gcc-patches

On 14-12-18 20:58, Tom de Vries wrote:
> 0016-nvptx-Add-vector_length-128-testcases.patch

> --- /dev/null                                                                                                     
> +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/vred2d-128.c                                                    

> +gentest (test1, "acc parallel loop gang vector_length (128)",                                                    
> +        "acc loop vector reduction(+:t1) reduction(-:t2)")      

With this I run into PR70895 - "OpenACC: loop reduction does not work.
Output is zero".

Making the implicit firstprivate explicit fixes that.

Same problem and solution for gemm.f90.

Thanks,
- Tom

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [nvptx, committed] Unify C/Fortran routine handling in nvptx_goacc_validate_dims
  2018-12-17 21:46         ` [nvptx, committed] Unify C/Fortran routine handling in nvptx_goacc_validate_dims Tom de Vries
@ 2019-02-22 11:23           ` Thomas Schwinge
  0 siblings, 0 replies; 32+ messages in thread
From: Thomas Schwinge @ 2019-02-22 11:23 UTC (permalink / raw)
  To: Tom de Vries, gcc-patches, fortran; +Cc: Jakub Jelinek


[-- Attachment #1.1: Type: text/plain, Size: 1996 bytes --]

Hi!

On Mon, 17 Dec 2018 22:46:50 +0100, Tom de Vries <tdevries@suse.de> wrote:
> [ was: Re: [nvptx] vector length patch series ]
> 
> On 14-12-18 20:58, Tom de Vries wrote:
> >> 0004-openacc-Make-GFC-default-to-1-for-OpenACC-routine-di.patch
> > If I remove this, I run into ICEs in the compiler, but I think that
> > means we need to understand and fix that ICE, instead of concluding that
> > we need this patch. It looks completely unrelated.
> 
> Indeed this
> (0004-openacc-Make-GFC-default-to-1-for-OpenACC-routine-di.patch) patch
> is unrelated to the vector length functionality.
> 
> However, it fixes a problem in the Fortran front-end which has as
> consequence that C and Fortran routines look the same in
> nvptx_goacc_validate_dims, which is a good thing to have.
> 
> However, the upstreaming of the patch seems to be stuck, so I've
> committed an nvptx workaround patch that has the same effect, allowing
> us to drop it
> (0004-openacc-Make-GFC-default-to-1-for-OpenACC-routine-di.patch) from
> the patch series.

ACK, thanks.

> [nvptx] Unify C/Fortran routine handling in nvptx_goacc_validate_dims
> 
> The Fortran front-end has a bug (PR72741) that means what when
> nvptx_goacc_validate_dims is called for a Fortran routine, the dims parameter
> is not the same as it would have been if the function would have been called for
> an equivalent C routine.
> 
> Work around this bug by overriding the dims parameter for routines, allowing the
> function to handle routines in Fortran and C the same.

I have now finally identified the relevant changes (scattered over
several commits on the OpenACC development branch, each of these trying
to do too many things at once, but also incompletely...), and then have
rewritten most of it anyway, into a more pleasant form, and now committed
to trunk in r269105 "[PR72741] Use 'oacc_build_routine_dims' for Fortran
OpenACC 'routine' directives, too", as attached.


Grüße
 Thomas



[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1.2: 0004-PR72741-Use-oacc_build_routine_dims-for-Fortran-Open.patch --]
[-- Type: text/x-diff, Size: 13395 bytes --]

From 1d740b07b3ba5b15b7ece7fdb25236e32251131a Mon Sep 17 00:00:00 2001
From: tschwinge <tschwinge@138bc75d-0d04-0410-961f-82ee72b054a4>
Date: Fri, 22 Feb 2019 10:50:35 +0000
Subject: [PATCH 4/9] [PR72741] Use 'oacc_build_routine_dims' for Fortran
 OpenACC 'routine' directives, too

... instead of having an incomplete local implementation.

With these changes in place, we can then also revert the work-around r267213
"[nvptx] Unify C/Fortran routine handling in nvptx_goacc_validate_dims".

	gcc/fortran/
	PR fortran/72741
	* gfortran.h (oacc_routine_lop): New enum.
	(symbol_attribute): Use it.
	* openmp.c (gfc_oacc_routine_dims): Replace with...
	(gfc_oacc_routine_lop): ... this new function.
	(gfc_match_oacc_routine): Adjust.
	* trans-decl.c (add_attributes_to_decl): Likewise.
	gcc/
	PR fortran/72741
	* omp-general.c (oacc_replace_fn_attrib): Mostly split out into...
	(oacc_replace_fn_attrib_attr): ... this new function.
	* omp-general.h (oacc_replace_fn_attrib_attr): New prototype.
	* config/nvptx/nvptx.c (nvptx_goacc_validate_dims_1): Revert workaround.
	gcc/testsuite/
	PR fortran/72741
	* gfortran.dg/goacc/classify-routine.f95: Adjust.

git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/trunk@269105 138bc75d-0d04-0410-961f-82ee72b054a4
---
 gcc/ChangeLog                                 |  8 ++++
 gcc/config/nvptx/nvptx.c                      | 35 ----------------
 gcc/fortran/ChangeLog                         | 11 +++++
 gcc/fortran/gfortran.h                        | 13 +++++-
 gcc/fortran/openmp.c                          | 41 +++++++++++--------
 gcc/fortran/trans-decl.c                      | 34 ++++++++++-----
 gcc/omp-general.c                             | 18 ++++++--
 gcc/omp-general.h                             |  1 +
 gcc/testsuite/ChangeLog                       |  3 ++
 .../gfortran.dg/goacc/classify-routine.f95    |  4 +-
 10 files changed, 99 insertions(+), 69 deletions(-)

diff --git a/gcc/ChangeLog b/gcc/ChangeLog
index 4745a1999d9..f14cbbce477 100644
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
@@ -1,3 +1,11 @@
+2019-02-22  Thomas Schwinge  <thomas@codesourcery.com>
+
+	PR fortran/72741
+	* omp-general.c (oacc_replace_fn_attrib): Mostly split out into...
+	(oacc_replace_fn_attrib_attr): ... this new function.
+	* omp-general.h (oacc_replace_fn_attrib_attr): New prototype.
+	* config/nvptx/nvptx.c (nvptx_goacc_validate_dims_1): Revert workaround.
+
 2019-02-22  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
 
 	* config/arm/arm-cpus.in (ares): Rename to...
diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index 23459e1c6f4..424b43ac8b4 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -5577,41 +5577,6 @@ nvptx_goacc_validate_dims_1 (tree decl, int dims[], int fn_level, unsigned used)
   else
     gcc_unreachable ();
 
-  if (routine_p)
-    {
-      /* OpenACC routines in C arrive here with the following attributes
-	 (omitting the 'omp declare target'):
-	 seq   : __attribute__((oacc function (0 1, 0 1, 0 1)))
-	 vector: __attribute__((oacc function (0 1, 0 1, 1 0)))
-	 worker: __attribute__((oacc function (0 1, 1 0, 1 0)))
-	 gang  : __attribute__((oacc function (1 0, 1 0, 1 0)))
-
-	 If we take f.i. the oacc function attribute of the worker routine
-	 (0 1, 1 0, 1 0), then:
-	 - the slice (0, 1, 1) is interpreted by oacc_fn_attrib_level as
-	   meaning: worker routine, that is:
-	   - can't contain gang loop (0),
-	   - can contain worker loop (1),
-	   - can contain vector loop (1).
-	 - the slice (1, 0, 0) is interpreted by oacc_validate_dims as the
-	 dimensions: gang: 1, worker: 0, vector: 0.
-
-	 OTOH, routines in Fortran arrive here with these attributes:
-	 seq   : __attribute__((oacc function (0 0, 0 0, 0 0)))
-	 vector: __attribute__((oacc function (0 0, 0 0, 1 0)))
-	 worker: __attribute__((oacc function (0 0, 1 0, 1 0)))
-	 gang  : __attribute__((oacc function (1 0, 1 0, 1 0)))
-	 that is, the same as for C but with the dimensions set to 0.
-
-	 This is due to a bug in the Fortran front-end: PR72741.  Work around
-	 this bug by forcing the dimensions to be the same in Fortran as for C,
-	 to be able to handle C and Fortran routines uniformly in this
-	 function.  */
-      dims[GOMP_DIM_VECTOR] = fn_level > GOMP_DIM_VECTOR ? 1 : 0;
-      dims[GOMP_DIM_WORKER] = fn_level > GOMP_DIM_WORKER ? 1 : 0;
-      dims[GOMP_DIM_GANG] = fn_level > GOMP_DIM_GANG ? 1 : 0;
-    }
-
   if (oacc_min_dims_p)
     {
       gcc_assert (dims[GOMP_DIM_VECTOR] == 1);
diff --git a/gcc/fortran/ChangeLog b/gcc/fortran/ChangeLog
index 74a6890ed70..0eb860449ca 100644
--- a/gcc/fortran/ChangeLog
+++ b/gcc/fortran/ChangeLog
@@ -1,3 +1,14 @@
+2019-02-22  Thomas Schwinge  <thomas@codesourcery.com>
+	    Cesar Philippidis  <cesar@codesourcery.com>
+
+	PR fortran/72741
+	* gfortran.h (oacc_routine_lop): New enum.
+	(symbol_attribute): Use it.
+	* openmp.c (gfc_oacc_routine_dims): Replace with...
+	(gfc_oacc_routine_lop): ... this new function.
+	(gfc_match_oacc_routine): Adjust.
+	* trans-decl.c (add_attributes_to_decl): Likewise.
+
 2019-02-22  Thomas Schwinge  <thomas@codesourcery.com>
 
 	* openmp.c (gfc_match_oacc_declare): Revert earlier changes.
diff --git a/gcc/fortran/gfortran.h b/gcc/fortran/gfortran.h
index 6c4e839c489..f0258b39ffd 100644
--- a/gcc/fortran/gfortran.h
+++ b/gcc/fortran/gfortran.h
@@ -317,6 +317,15 @@ enum save_state
 { SAVE_NONE = 0, SAVE_EXPLICIT, SAVE_IMPLICIT
 };
 
+/* OpenACC 'routine' directive's level of parallelism.  */
+enum oacc_routine_lop
+{ OACC_ROUTINE_LOP_NONE = 0,
+  OACC_ROUTINE_LOP_GANG,
+  OACC_ROUTINE_LOP_WORKER,
+  OACC_ROUTINE_LOP_VECTOR,
+  OACC_ROUTINE_LOP_SEQ
+};
+
 /* Strings for all symbol attributes.  We use these for dumping the
    parse tree, in error messages, and also when reading and writing
    modules.  In symbol.c.  */
@@ -904,8 +913,8 @@ typedef struct
   unsigned oacc_declare_device_resident:1;
   unsigned oacc_declare_link:1;
 
-  /* This is an OpenACC acclerator function at level N - 1  */
-  unsigned oacc_function:3;
+  /* OpenACC 'routine' directive's level of parallelism.  */
+  ENUM_BITFIELD (oacc_routine_lop) oacc_routine_lop:3;
 
   /* Attributes set by compiler extensions (!GCC$ ATTRIBUTES).  */
   unsigned ext_attr:EXT_ATTR_NUM;
diff --git a/gcc/fortran/openmp.c b/gcc/fortran/openmp.c
index 8aa4a2f18c4..dfd4be86d50 100644
--- a/gcc/fortran/openmp.c
+++ b/gcc/fortran/openmp.c
@@ -2232,34 +2232,43 @@ gfc_match_oacc_cache (void)
   return MATCH_YES;
 }
 
-/* Determine the loop level for a routine.   */
+/* Determine the OpenACC 'routine' directive's level of parallelism.  */
 
-static int
-gfc_oacc_routine_dims (gfc_omp_clauses *clauses)
+static oacc_routine_lop
+gfc_oacc_routine_lop (gfc_omp_clauses *clauses)
 {
-  int level = -1;
+  oacc_routine_lop ret = OACC_ROUTINE_LOP_SEQ;
 
   if (clauses)
     {
-      unsigned mask = 0;
+      unsigned n_lop_clauses = 0;
 
       if (clauses->gang)
-	level = GOMP_DIM_GANG, mask |= GOMP_DIM_MASK (level);
+	{
+	  ++n_lop_clauses;
+	  ret = OACC_ROUTINE_LOP_GANG;
+	}
       if (clauses->worker)
-	level = GOMP_DIM_WORKER, mask |= GOMP_DIM_MASK (level);
+	{
+	  ++n_lop_clauses;
+	  ret = OACC_ROUTINE_LOP_WORKER;
+	}
       if (clauses->vector)
-	level = GOMP_DIM_VECTOR, mask |= GOMP_DIM_MASK (level);
+	{
+	  ++n_lop_clauses;
+	  ret = OACC_ROUTINE_LOP_VECTOR;
+	}
       if (clauses->seq)
-	level = GOMP_DIM_MAX, mask |= GOMP_DIM_MASK (level);
+	{
+	  ++n_lop_clauses;
+	  ret = OACC_ROUTINE_LOP_SEQ;
+	}
 
-      if (mask != (mask & -mask))
+      if (n_lop_clauses > 1)
 	gfc_error ("Multiple loop axes specified for routine");
     }
 
-  if (level < 0)
-    level = GOMP_DIM_MAX;
-
-  return level;
+  return ret;
 }
 
 match
@@ -2352,8 +2361,8 @@ gfc_match_oacc_routine (void)
 				       gfc_current_ns->proc_name->name,
 				       &old_loc))
 	goto cleanup;
-      gfc_current_ns->proc_name->attr.oacc_function
-	= gfc_oacc_routine_dims (c) + 1;
+      gfc_current_ns->proc_name->attr.oacc_routine_lop
+	= gfc_oacc_routine_lop (c);
     }
 
   if (n)
diff --git a/gcc/fortran/trans-decl.c b/gcc/fortran/trans-decl.c
index 3604cfcf5cb..20d453051a2 100644
--- a/gcc/fortran/trans-decl.c
+++ b/gcc/fortran/trans-decl.c
@@ -46,6 +46,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "trans-stmt.h"
 #include "gomp-constants.h"
 #include "gimplify.h"
+#include "omp-general.h"
 
 #define MAX_LABEL_VALUE 99999
 
@@ -1406,18 +1407,31 @@ add_attributes_to_decl (symbol_attribute sym_attr, tree list)
     list = tree_cons (get_identifier ("omp declare target"),
 		      NULL_TREE, list);
 
-  if (sym_attr.oacc_function)
+  if (sym_attr.oacc_routine_lop != OACC_ROUTINE_LOP_NONE)
     {
-      tree dims = NULL_TREE;
-      int ix;
-      int level = sym_attr.oacc_function - 1;
-
-      for (ix = GOMP_DIM_MAX; ix--;)
-	dims = tree_cons (build_int_cst (boolean_type_node, ix >= level),
-			  integer_zero_node, dims);
+      omp_clause_code code;
+      switch (sym_attr.oacc_routine_lop)
+	{
+	case OACC_ROUTINE_LOP_GANG:
+	  code = OMP_CLAUSE_GANG;
+	  break;
+	case OACC_ROUTINE_LOP_WORKER:
+	  code = OMP_CLAUSE_WORKER;
+	  break;
+	case OACC_ROUTINE_LOP_VECTOR:
+	  code = OMP_CLAUSE_VECTOR;
+	  break;
+	case OACC_ROUTINE_LOP_SEQ:
+	  code = OMP_CLAUSE_SEQ;
+	  break;
+	case OACC_ROUTINE_LOP_NONE:
+	default:
+	  gcc_unreachable ();
+	}
+      tree c = build_omp_clause (UNKNOWN_LOCATION, code);
 
-      list = tree_cons (get_identifier ("oacc function"),
-			dims, list);
+      tree dims = oacc_build_routine_dims (c);
+      list = oacc_replace_fn_attrib_attr (list, dims);
     }
 
   return list;
diff --git a/gcc/omp-general.c b/gcc/omp-general.c
index 0f66ba0c5d8..356772ff458 100644
--- a/gcc/omp-general.c
+++ b/gcc/omp-general.c
@@ -540,16 +540,26 @@ oacc_launch_pack (unsigned code, tree device, unsigned op)
 
 /* Replace any existing oacc fn attribute with updated dimensions.  */
 
-void
-oacc_replace_fn_attrib (tree fn, tree dims)
+/* Variant working on a list of attributes.  */
+
+tree
+oacc_replace_fn_attrib_attr (tree attribs, tree dims)
 {
   tree ident = get_identifier (OACC_FN_ATTRIB);
-  tree attribs = DECL_ATTRIBUTES (fn);
 
   /* If we happen to be present as the first attrib, drop it.  */
   if (attribs && TREE_PURPOSE (attribs) == ident)
     attribs = TREE_CHAIN (attribs);
-  DECL_ATTRIBUTES (fn) = tree_cons (ident, dims, attribs);
+  return tree_cons (ident, dims, attribs);
+}
+
+/* Variant working on a function decl.  */
+
+void
+oacc_replace_fn_attrib (tree fn, tree dims)
+{
+  DECL_ATTRIBUTES (fn)
+    = oacc_replace_fn_attrib_attr (DECL_ATTRIBUTES (fn), dims);
 }
 
 /* Scan CLAUSES for launch dimensions and attach them to the oacc
diff --git a/gcc/omp-general.h b/gcc/omp-general.h
index 0cbbb31e73b..60faa5213a2 100644
--- a/gcc/omp-general.h
+++ b/gcc/omp-general.h
@@ -81,6 +81,7 @@ extern gimple *omp_build_barrier (tree lhs);
 extern poly_uint64 omp_max_vf (void);
 extern int omp_max_simt_vf (void);
 extern tree oacc_launch_pack (unsigned code, tree device, unsigned op);
+extern tree oacc_replace_fn_attrib_attr (tree attribs, tree dims);
 extern void oacc_replace_fn_attrib (tree fn, tree dims);
 extern void oacc_set_fn_attrib (tree fn, tree clauses, vec<tree> *args);
 extern tree oacc_build_routine_dims (tree clauses);
diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog
index 12901a9361a..dec48441f30 100644
--- a/gcc/testsuite/ChangeLog
+++ b/gcc/testsuite/ChangeLog
@@ -1,5 +1,8 @@
 2019-02-22  Thomas Schwinge  <thomas@codesourcery.com>
 
+	PR fortran/72741
+	* gfortran.dg/goacc/classify-routine.f95: Adjust.
+
 	* c-c++-common/goacc/routine-5.c: Revert earlier changes.
 	* g++.dg/goacc/template.C: Likewise.
 
diff --git a/gcc/testsuite/gfortran.dg/goacc/classify-routine.f95 b/gcc/testsuite/gfortran.dg/goacc/classify-routine.f95
index 5cf4c13acb8..e435f5d7eae 100644
--- a/gcc/testsuite/gfortran.dg/goacc/classify-routine.f95
+++ b/gcc/testsuite/gfortran.dg/goacc/classify-routine.f95
@@ -21,10 +21,10 @@ subroutine ROUTINE
 end subroutine ROUTINE
 
 ! Check the offloaded function's attributes.
-! { dg-final { scan-tree-dump-times "(?n)__attribute__\\(\\(omp declare target, oacc function \\(0 0, 1 0, 1 0\\)\\)\\)" 1 "ompexp" } }
+! { dg-final { scan-tree-dump-times "(?n)__attribute__\\(\\(omp declare target, oacc function \\(0 1, 1 0, 1 0\\)\\)\\)" 1 "ompexp" } }
 
 ! Check the offloaded function's classification and compute dimensions (will
 ! always be 1 x 1 x 1 for non-offloading compilation).
 ! { dg-final { scan-tree-dump-times "(?n)Function is OpenACC routine level 1" 1 "oaccdevlow" } }
 ! { dg-final { scan-tree-dump-times "(?n)Compute dimensions \\\[1, 1, 1\\\]" 1 "oaccdevlow" } }
-! { dg-final { scan-tree-dump-times "(?n)__attribute__\\(\\(oacc function \\(0 1, 1 1, 1 1\\), omp declare target, oacc function \\(0 0, 1 0, 1 0\\)\\)\\)" 1 "oaccdevlow" } }
+! { dg-final { scan-tree-dump-times "(?n)__attribute__\\(\\(oacc function \\(0 1, 1 1, 1 1\\), omp declare target, oacc function \\(0 1, 1 0, 1 0\\)\\)\\)" 1 "oaccdevlow" } }
-- 
2.17.1


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 658 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [nvptx, committed] Force vl32 if calling vector-partitionable routines
  2019-01-07 19:11         ` [nvptx, committed] Force vl32 if calling vector-partitionable routines Tom de Vries
@ 2020-10-30 16:53           ` Thomas Schwinge
  0 siblings, 0 replies; 32+ messages in thread
From: Thomas Schwinge @ 2020-10-30 16:53 UTC (permalink / raw)
  To: Tom de Vries; +Cc: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 3784 bytes --]

Hi Tom!

On 2019-01-07T20:11:59+0100, Tom de Vries <tdevries@suse.de> wrote:
> [nvptx] Force vl32 if calling vector-partitionable routines
>
> With PTX_MAX_VECTOR_LENGTH set to larger than PTX_WARP_SIZE, routines can be
> called from offloading regions with vector-size set to larger than warp size.
> OTOH, vector-partitionable routines assume warp-sized vector length.
>
> Detect if we're calling a vector-partitionable routine from an offloading
> region, and if so, fall back to warp-sized vector length in that region.
>
> 2018-12-17  Tom de Vries  <tdevries@suse.de>
>
>       PR target/85486
>       * config/nvptx/nvptx.c (has_vector_partitionable_routine_calls_p): New
>       function.
>       (nvptx_goacc_validate_dims): Force vl32 if calling vector-partitionable
>       routines.

> --- a/gcc/config/nvptx/nvptx.c
> +++ b/gcc/config/nvptx/nvptx.c

> +/* Return true if FNDECL contains calls to vector-partitionable routines.  */
> +
> +static bool
> +has_vector_partitionable_routine_calls_p (tree fndecl)
> +{
> +  if (!fndecl)
> +    return false;
> +
> +  basic_block bb;
> +  FOR_EACH_BB_FN (bb, DECL_STRUCT_FUNCTION (fndecl))
> +    for (gimple_stmt_iterator i = gsi_start_bb (bb); !gsi_end_p (i);
> +      gsi_next_nondebug (&i))
> +      {
> +     gimple *stmt = gsi_stmt (i);
> +     if (gimple_code (stmt) != GIMPLE_CALL)
> +       continue;

(This might use '!is_gimple_call (stmt)'.)

> +
> +     tree callee = gimple_call_fndecl (stmt);
> +     if (!callee)
> +       continue;

Would there be any other case where this '!callee' conditional doesn't
really mean 'gimple_call_internal_p (stmt)'?  I thought about suggesting
to use that instead, and then maybe 'gcc_assert (callee)' (... which
doesn't trigger for any current testcases), but reviewing 'GIMPLE_CALL',
I now see further 'is_gimple_call_addr' legitimate cases.  What do these
mean, here?

And, should we add a comment why 'continue' is fine then, instead of
fail-safe 'return true'?

Couldn't an 'internal_fn' potentially also make use of OpenACC
parallelism?

> +
> +     tree attrs  = oacc_get_fn_attrib (callee);
> +     if (attrs == NULL_TREE)
> +       return false;

That's not correct, as far as I can tell: if the current callee doesn't
have an 'oacc function' attribute, we *stop* here any further processing,
and 'return false' indicating that there are no "calls to
vector-partitionable routines".  See bug fix and adjusted test case in
attached patch "Force vl32 if calling vector-partitionable routines: fix
case where callee doesn't have 'oacc function' attribute [PR85486]".  OK
to push?

> +
> +     int partition_level = oacc_fn_attrib_level (attrs);
> +     bool seq_routine_p = partition_level == GOMP_DIM_MAX;
> +     if (!seq_routine_p)
> +       return true;
> +      }
> +
> +  return false;
> +}

> @@ -5611,6 +5646,16 @@ nvptx_goacc_validate_dims_1 (tree decl, int dims[], int fn_level)
>      old_dims[i] = dims[i];
>
>    const char *vector_reason = NULL;
> +  if (offload_region_p && has_vector_partitionable_routine_calls_p (decl))
> +    {
> +      if (dims[GOMP_DIM_VECTOR] > PTX_WARP_SIZE)
> +     {
> +       vector_reason = G_("using vector_length (%d) due to call to"
> +                          " vector-partitionable routine, ignoring %d");
> +       dims[GOMP_DIM_VECTOR] = PTX_WARP_SIZE;
> +     }
> +    }
> +
>    if (dims[GOMP_DIM_VECTOR] == 0)
>      {
>        vector_reason = G_("using vector_length (%d), ignoring runtime setting");


Grüße
 Thomas


-----------------
Mentor Graphics (Deutschland) GmbH, Arnulfstraße 201, 80634 München / Germany
Registergericht München HRB 106955, Geschäftsführer: Thomas Heurung, Alexander Walter

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-Force-vl32-if-calling-vector-partitionable-routines-.patch --]
[-- Type: text/x-diff, Size: 2239 bytes --]

From 0399c9023b717ea686db912ca5c133a2d30752e4 Mon Sep 17 00:00:00 2001
From: Thomas Schwinge <thomas@codesourcery.com>
Date: Wed, 28 Oct 2020 12:04:46 +0100
Subject: [PATCH] Force vl32 if calling vector-partitionable routines: fix case
 where callee doesn't have 'oacc function' attribute [PR85486]

	gcc/
	PR target/85486
	* config/nvptx/nvptx.c (has_vector_partitionable_routine_calls_p):
	Fix case where callee doesn't have 'oacc function' attribute.
	libgomp/
	PR target/85486
	* testsuite/libgomp.oacc-c-c++-common/pr85486.c: Extend.
---
 gcc/config/nvptx/nvptx.c                              |  3 ++-
 libgomp/testsuite/libgomp.oacc-c-c++-common/pr85486.c | 10 ++++++++++
 2 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index 17349475fff0..61a756fc6448 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -5674,7 +5674,8 @@ has_vector_partitionable_routine_calls_p (tree fndecl)
 
 	tree attrs  = oacc_get_fn_attrib (callee);
 	if (attrs == NULL_TREE)
-	  return false;
+	  /* Implicitly 'seq'.  */
+	  continue;
 
 	int partition_level = oacc_fn_attrib_level (attrs);
 	bool seq_routine_p = partition_level == GOMP_DIM_MAX;
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/pr85486.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/pr85486.c
index 0d98b82f9932..38a61624d9f8 100644
--- a/libgomp/testsuite/libgomp.oacc-c-c++-common/pr85486.c
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/pr85486.c
@@ -7,6 +7,7 @@
 /* Minimized from ref-1.C.  */
 
 #include <stdio.h>
+#include <stdlib.h>
 
 #pragma acc routine vector
 void __attribute__((noinline, noclone))
@@ -33,6 +34,15 @@ main (void)
 
 #pragma acc parallel copy (ary) VECTOR_LENGTH /* { dg-warning "using vector_length \\(32\\) due to call to vector-partitionable routine, ignoring 128" } */
   {
+    /* Call a routine that is not tagged OpenACC 'routine' (but is still
+       available by default; thus something from libc), thus is implicitly
+       'seq'.  */
+    {
+      void *null = NULL;
+      asm ("" : : "g" (&null) : "memory");  /* Optimization barrier.  */
+      free (null);
+    }
+
     Vector (&ary[0][0], m * n, (1 << 24) - (1 << 16));
   }
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2020-10-30 16:54 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-04-23 11:24 [og7, nvptx, PR85486, committed] Force vl32 if calling vector-partitionable routines Tom de Vries
2018-09-18 20:22 ` [nvptx] vector length patch series Cesar Philippidis
2018-10-06  7:49   ` Tom de Vries
2018-10-29 20:09     ` Cesar Philippidis
2018-12-14 19:58       ` Tom de Vries
2018-12-17 21:29         ` [nvptx, committed] Rewrite nvptx_goacc_validate_dims to use predicate vars Tom de Vries
2018-12-17 21:46         ` [nvptx, committed] Unify C/Fortran routine handling in nvptx_goacc_validate_dims Tom de Vries
2019-02-22 11:23           ` Thomas Schwinge
2018-12-17 21:48         ` [nvptx] vector length patch series Tom de Vries
2018-12-17 21:51         ` [nvptx, committed] Add PTX_WARP_SIZE Tom de Vries
2018-12-17 21:52         ` [nvptx, committed] Move macro defs to top of nvptx.c Tom de Vries
2018-12-17 23:52         ` [nvptx] vector length patch series -- openacc parts Tom de Vries
2019-01-03  9:13           ` Tom de Vries
2018-12-19 10:28         ` [nvptx, committed] Use TARGET_SET_CURRENT_FUNCTION Tom de Vries
2018-12-19 10:31         ` [nvptx, committed] Only use one logical barrier resource Tom de Vries
2018-12-19 10:33         ` [nvptx, committed] Generalize bar.sync instruction Tom de Vries
2018-12-19 10:35         ` [nvptx, committed] Rename worker_bcast variables to oacc_bcast Tom de Vries
2018-12-19 10:37         ` [nvptx, committed] Make nvptx state propagation function names more generic Tom de Vries
2018-12-19 10:38         ` [nvptx, committed] Use MAX, MIN, ROUND_UP macros Tom de Vries
2018-12-19 17:19         ` [nvptx, committed] Add PTX_CTA_SIZE Tom de Vries
2018-12-22  2:19         ` [nvptx] vector length patch series Tom de Vries
2019-01-05 11:19           ` Tom de Vries
2019-01-03 16:08         ` Tom de Vries
2019-01-03 16:20         ` Tom de Vries
2019-01-03 16:29         ` Tom de Vries
2019-01-07  8:52           ` [nvptx, committed] Fix libgomp.oacc-c-c++-common/vector-length-128-3.c Tom de Vries
2019-01-07  8:59         ` [nvptx, committed] Add support for a per-worker broadcast buffer and barrier Tom de Vries
2019-01-07  9:01         ` [nvptx] Don't emit barriers for empty loops -- fix Tom de Vries
2019-01-07  9:03         ` [nvptx] Handle large vector reductions Tom de Vries
2019-01-07 19:11         ` [nvptx, committed] Force vl32 if calling vector-partitionable routines Tom de Vries
2020-10-30 16:53           ` Thomas Schwinge
2019-01-08 23:00         ` [nvptx] vector length patch series Tom de Vries

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).