public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [PATCH 00/13] AMD GCN worker partitioning support
@ 2019-11-15 21:41 Julian Brown
  2019-11-15 21:41 ` [PATCH 02/13] Target-dependent gang-private variable decl rewriting Julian Brown
                   ` (12 more replies)
  0 siblings, 13 replies; 19+ messages in thread
From: Julian Brown @ 2019-11-15 21:41 UTC (permalink / raw)
  To: gcc-patches; +Cc: thomas_schwinge, jakub, Catherine_Moore, Andrew Stubbs

This patch series brings together support for worker partitioning on
AMD GCN and various support patches to ensure no testsuite regressions.

Some of these patches have been sent upstream previously. Most are present
on the openacc-gcc-9-branch, and have been tested with both AMD GCN and
nVidia GPUs.

The series has been tested as a whole with offloading to AMD GCN.

Further commentary is provided alongside individual patches.

OK for trunk?

Thanks,

Julian

Julian Brown (13):
  Add support for gang local storage allocation in shared memory
  Target-dependent gang-private variable decl rewriting
  Rewrite OpenACC private or reduction reference variables
  OpenACC middle-end worker-partitioning support
  AMD GCN adjustments for middle-end worker partitioning
  Fix up tests for oaccdevlow pass split
  Fix OpenACC "ephemeral" asynchronous host-to-device copies
  Fix host-to-device copies from rodata for AMD GCN
  AMD GCN libgomp plugin queue-full condition locking fix
  Race conditions in OpenACC async tests
  AMD GCN symbol output with null cfun
  Fix parallel-dims.f90 for AMD GCN
  Enable worker partitioning for AMD GCN

 gcc/Makefile.in                               |    1 +
 gcc/config/gcn/gcn-protos.h                   |    4 +-
 gcc/config/gcn/gcn-tree.c                     |   11 +-
 gcc/config/gcn/gcn.c                          |   25 +-
 gcc/config/gcn/gcn.opt                        |    2 +-
 gcc/config/nvptx/nvptx.c                      |  699 +-----
 gcc/doc/tm.texi                               |   23 +
 gcc/doc/tm.texi.in                            |    8 +
 gcc/expr.c                                    |   13 +-
 gcc/gimplify.c                                |  116 +
 gcc/internal-fn.c                             |    2 +
 gcc/internal-fn.h                             |    3 +-
 gcc/omp-builtins.def                          |    8 +
 gcc/omp-low.c                                 |  172 +-
 gcc/omp-offload.c                             |  322 ++-
 gcc/omp-offload.h                             |    1 +
 gcc/omp-sese.c                                | 2086 +++++++++++++++++
 gcc/omp-sese.h                                |   32 +
 gcc/passes.def                                |    2 +
 gcc/target.def                                |   30 +
 gcc/targhooks.h                               |    1 +
 .../goacc/classify-kernels-unparallelized.c   |    8 +-
 .../c-c++-common/goacc/classify-kernels.c     |    8 +-
 .../c-c++-common/goacc/classify-parallel.c    |    8 +-
 .../c-c++-common/goacc/classify-routine.c     |    8 +-
 .../gcc.dg/goacc/loop-processing-1.c          |    4 +-
 .../goacc/classify-kernels-unparallelized.f95 |    8 +-
 .../gfortran.dg/goacc/classify-kernels.f95    |    8 +-
 .../gfortran.dg/goacc/classify-parallel.f95   |    8 +-
 .../gfortran.dg/goacc/classify-routine.f95    |    8 +-
 gcc/tree-core.h                               |    4 +-
 gcc/tree-pass.h                               |    2 +
 gcc/tree.c                                    |   11 +-
 gcc/tree.h                                    |    2 +
 libgomp/libgomp-plugin.h                      |    3 +-
 libgomp/libgomp.h                             |    2 +-
 libgomp/oacc-host.c                           |    1 +
 libgomp/oacc-mem.c                            |    4 +-
 libgomp/plugin/plugin-gcn.c                   |   82 +-
 libgomp/plugin/plugin-nvptx.c                 |   13 +-
 libgomp/target.c                              |   92 +-
 .../libgomp.oacc-c++/privatized-ref-2.C       |   64 +
 .../libgomp.oacc-c++/privatized-ref-3.C       |   64 +
 .../gang-private-1.c                          |   38 +
 .../libgomp.oacc-c-c++-common/lib-94.c        |    4 +-
 .../libgomp.oacc-c-c++-common/loop-gwv-2.c    |   95 +
 .../gangprivate-attrib-1.f90                  |   25 +
 .../gangprivate-attrib-2.f90                  |   25 +
 .../libgomp.oacc-fortran/lib-16-2.f90         |    5 +
 .../libgomp.oacc-fortran/parallel-dims-aux.c  |    9 +-
 .../libgomp.oacc-fortran/privatized-ref-1.f95 |   71 +
 51 files changed, 3426 insertions(+), 819 deletions(-)
 create mode 100644 gcc/omp-sese.c
 create mode 100644 gcc/omp-sese.h
 create mode 100644 libgomp/testsuite/libgomp.oacc-c++/privatized-ref-2.C
 create mode 100644 libgomp/testsuite/libgomp.oacc-c++/privatized-ref-3.C
 create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/gang-private-1.c
 create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/loop-gwv-2.c
 create mode 100644 libgomp/testsuite/libgomp.oacc-fortran/gangprivate-attrib-1.f90
 create mode 100644 libgomp/testsuite/libgomp.oacc-fortran/gangprivate-attrib-2.f90
 create mode 100644 libgomp/testsuite/libgomp.oacc-fortran/privatized-ref-1.f95

-- 
2.23.0

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 02/13] Target-dependent gang-private variable decl rewriting
  2019-11-15 21:41 [PATCH 00/13] AMD GCN worker partitioning support Julian Brown
@ 2019-11-15 21:41 ` Julian Brown
  2019-11-15 21:42 ` [PATCH 01/13] Add support for gang local storage allocation in shared memory Julian Brown
                   ` (11 subsequent siblings)
  12 siblings, 0 replies; 19+ messages in thread
From: Julian Brown @ 2019-11-15 21:41 UTC (permalink / raw)
  To: gcc-patches; +Cc: thomas_schwinge, jakub, Catherine_Moore, Andrew Stubbs

This patch provides support for the adjust_private_decl hook introduced
by the previous patch changing the type of its decl argument, e.g. if the
offload target forces it into another address space. Any ADDR_EXPR nodes
that have the decl as an argument will have the wrong type: this patch
implements a scheme to rewrite those nodes at oaccdevlow time. Arguments
to sync builtins are handled specially, since those often have variants
that operate on alternative address spaces, so the rewritten decl can
be passed to them directly.

ChangeLog

	gcc/
	* omp-offload.c (struct addr_expr_rewrite_info): Add struct.
	(rewrite_addr_expr): New function.
	(is_sync_builtin_call): New function.
	(execute_oacc_device_lower): Fix up addr_expr nodes whose argument type
	has changed after calling the OpenACC adjust_private_decl hook.
---
 gcc/omp-offload.c | 130 +++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 129 insertions(+), 1 deletion(-)

diff --git a/gcc/omp-offload.c b/gcc/omp-offload.c
index d8291125370..2e56a04a714 100644
--- a/gcc/omp-offload.c
+++ b/gcc/omp-offload.c
@@ -1502,6 +1502,78 @@ default_goacc_reduction (gcall *call)
   gsi_replace_with_seq (&gsi, seq, true);
 }
 
+struct addr_expr_rewrite_info
+{
+  gimple *stmt;
+  hash_set<tree> *adjusted_vars;
+  bool avoid_pointer_conversion;
+  bool modified;
+};
+
+static tree
+rewrite_addr_expr (tree *tp, int *walk_subtrees, void *data)
+{
+  walk_stmt_info *wi = (walk_stmt_info *) data;
+  addr_expr_rewrite_info *info = (addr_expr_rewrite_info *) wi->info;
+
+  if (TREE_CODE (*tp) == ADDR_EXPR)
+    {
+      tree arg = TREE_OPERAND (*tp, 0);
+
+      if (info->adjusted_vars->contains (arg))
+	{
+	  if (info->avoid_pointer_conversion)
+	    {
+	      *tp = build_fold_addr_expr (arg);
+	      info->modified = true;
+	      *walk_subtrees = 0;
+	    }
+	  else
+	    {
+	      gimple_stmt_iterator gsi = gsi_for_stmt (info->stmt);
+	      tree repl = build_fold_addr_expr (arg);
+	      gimple *stmt1
+		= gimple_build_assign (make_ssa_name (TREE_TYPE (repl)), repl);
+	      tree conv = convert_to_pointer (TREE_TYPE (*tp),
+					      gimple_assign_lhs (stmt1));
+	      gimple *stmt2
+		= gimple_build_assign (make_ssa_name (TREE_TYPE (*tp)), conv);
+	      gsi_insert_before (&gsi, stmt1, GSI_SAME_STMT);
+	      gsi_insert_before (&gsi, stmt2, GSI_SAME_STMT);
+	      *tp = gimple_assign_lhs (stmt2);
+	      info->modified = true;
+	      *walk_subtrees = 0;
+	    }
+	}
+    }
+
+  return NULL_TREE;
+}
+
+/* Return TRUE if CALL is a call to a builtin atomic/sync operation.  */
+
+static bool
+is_sync_builtin_call (gcall *call)
+{
+  tree callee = gimple_call_fndecl (call);
+
+  if (callee != NULL_TREE
+      && gimple_call_builtin_p (call, BUILT_IN_NORMAL))
+    switch (DECL_FUNCTION_CODE (callee))
+      {
+#undef DEF_SYNC_BUILTIN
+#define DEF_SYNC_BUILTIN(ENUM, NAME, TYPE, ATTRS) case ENUM:
+#include "sync-builtins.def"
+#undef DEF_SYNC_BUILTIN
+	return true;
+
+      default:
+	;
+      }
+
+  return false;
+}
+
 /* Main entry point for oacc transformations which run on the device
    compiler after LTO, so we know what the target device is at this
    point (including the host fallback).  */
@@ -1611,6 +1683,8 @@ execute_oacc_device_lower ()
      dominance information to update SSA.  */
   calculate_dominance_info (CDI_DOMINATORS);
 
+  hash_set<tree> adjusted_vars;
+
   /* Now lower internal loop functions to target-specific code
      sequences.  */
   basic_block bb;
@@ -1714,7 +1788,12 @@ execute_oacc_device_lower ()
 			    fputc ('\n', dump_file);
 			  }
 			if (targetm.goacc.adjust_private_decl)
-			  targetm.goacc.adjust_private_decl (decl, level);
+			  {
+			    tree oldtype = TREE_TYPE (decl);
+			    targetm.goacc.adjust_private_decl (decl, level);
+			    if (TREE_TYPE (decl) != oldtype)
+			      adjusted_vars.add (decl);
+			  }
 		      }
 		    remove = true;
 		  }
@@ -1750,6 +1829,55 @@ execute_oacc_device_lower ()
 	  gsi_next (&gsi);
       }
 
+  /* Make adjustments to gang-private local variables if required by the
+     target, e.g. forcing them into a particular address space.  Afterwards,
+     ADDR_EXPR nodes which have adjusted variables as their argument need to
+     be modified in one of two ways:
+
+       1. They can be recreated, making a pointer to the variable in the new
+	  address space, or
+
+       2. The address of the variable in the new address space can be taken,
+	  converted to the default (original) address space, and the result of
+	  that conversion subsituted in place of the original ADDR_EXPR node.
+
+     Which of these is done depends on the gimple statement being processed.
+     At present atomic operations and inline asms use (1), and everything else
+     uses (2).  At least on AMD GCN, there are atomic operations that work
+     directly in the LDS address space.  */
+
+  if (targetm.goacc.adjust_private_decl)
+    {
+      tree var;
+      unsigned i;
+
+      FOR_ALL_BB_FN (bb, cfun)
+	for (gimple_stmt_iterator gsi = gsi_start_bb (bb);
+	     !gsi_end_p (gsi);
+	     gsi_next (&gsi))
+	  {
+	    gimple *stmt = gsi_stmt (gsi);
+	    walk_stmt_info wi;
+	    addr_expr_rewrite_info info;
+
+	    info.avoid_pointer_conversion
+	      = (is_gimple_call (stmt)
+		 && is_sync_builtin_call (as_a <gcall *> (stmt)))
+		|| gimple_code (stmt) == GIMPLE_ASM;
+	    info.stmt = stmt;
+	    info.modified = false;
+	    info.adjusted_vars = &adjusted_vars;
+
+	    memset (&wi, 0, sizeof (wi));
+	    wi.info = &info;
+
+	    walk_gimple_op (stmt, rewrite_addr_expr, &wi);
+
+	    if (info.modified)
+	      update_stmt (stmt);
+	  }
+    }
+
   free_oacc_loop (loops);
 
   return 0;
-- 
2.23.0

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 01/13] Add support for gang local storage allocation in shared memory
  2019-11-15 21:41 [PATCH 00/13] AMD GCN worker partitioning support Julian Brown
  2019-11-15 21:41 ` [PATCH 02/13] Target-dependent gang-private variable decl rewriting Julian Brown
@ 2019-11-15 21:42 ` Julian Brown
  2019-12-05  2:40   ` Julian Brown
  2019-11-15 21:43 ` [PATCH 03/13] Rewrite OpenACC private or reduction reference variables Julian Brown
                   ` (10 subsequent siblings)
  12 siblings, 1 reply; 19+ messages in thread
From: Julian Brown @ 2019-11-15 21:42 UTC (permalink / raw)
  To: gcc-patches; +Cc: thomas_schwinge, jakub, Catherine_Moore, Andrew Stubbs

This patch provides support for gang local storage allocation in shared
memory. It is mostly identical to the version posted previously, with
one cosmetic fix (a duplicated identical condition):

https://gcc.gnu.org/ml/gcc-patches/2019-11/msg00448.html

Tested alongside other patches in this series with offloading to AMD GCN.

OK?

Julian

ChangeLog

	gcc/
	* config/gcn/gcn-protos.h (gcn_goacc_adjust_gangprivate_decl): Rename
	to...
	(gcn_goacc_adjust_private_decl): ...this.  Add and use LEVEL parameter.
	* config/gcn/gcn-tree.c (gcn_goacc_adjust_gangprivate_decl): Rename
	to...
	(gcn_goacc_adjust_private_decl): ...this. Add LEVEL parameter.
	* config/gcn/gcn.c (TARGET_GOACC_ADJUST_GANGPRIVATE_DECL): Delete.
	(TARGET_GOACC_ADJUST_PRIVATE_DECL): Define using renamed
	gcn_goacc_adjust_private_decl.
	* config/nvptx/nvptx.c (tree-hash-traits.h, tree-pretty-print.h):
	Include.
	(gangprivate_shared_size): New global variable.
	(gangprivate_shared_align): Likewise.
	(gangprivate_shared_sym): Likewise.
	(gangprivate_shared_hmap): Likewise.
	(nvptx_option_override): Initialize gangprivate_shared_sym,
	gangprivate_shared_align.
	(nvptx_file_end): Output gangprivate_shared_sym.
	(nvptx_goacc_adjust_private_decl): New function.
	(nvptx_goacc_expand_accel_var): New function.
	(nvptx_set_current_function): New function.
	(TARGET_GOACC_ADJUST_PRIVATE_DECL, TARGET_GOACC_EXPAND_ACCEL_VAR):
	Define hooks.
	* doc/tm.texi.in (TARGET_GOACC_EXPAND_ACCEL_VAR,
	TARGET_GOACC_ADJUST_PRIVATE_DECL): Place new documentation hooks.
	* doc/tm.texi: Regenerate.
	* expr.c (expand_expr_real_1): Expand decls using the expand_accel_var
	OpenACC hook if defined.
	* internal-fn.c (expand_UNIQUE): Handle IFN_UNIQUE_OACC_PRIVATE.
	* internal-fn.h (IFN_UNIQUE_CODES): Add OACC_PRIVATE.
	* omp-low.c (omp_context): Add oacc_addressable_var_decls field.
	(new_omp_context): Initialize oacc_addressable_var_decls in new
	omp_context.
	(delete_omp_context): Delete oacc_addressable_var_decls in old
	omp_context.
	(lower_oacc_reductions): Add PRIVATE_MARKER parameter.  Insert private
	marker before fork.
	(lower_oacc_head_tail): Add PRIVATE_MARKER parameter. Modify private
	marker's gimple call arguments, and pass it to lower_oacc_reductions.
	(oacc_record_private_var_clauses, oacc_record_vars_in_bind,
	make_oacc_private_marker): New functions.
	(lower_omp_for): Call oacc_record_private_var_clauses with "for"
	clauses.  Call oacc_record_vars_in_bind for OpenACC contexts.  Create
	private marker and pass to lower_oacc_head_tail.
	(lower_omp_target): Create private marker and pass to
	lower_oacc_reductions.
	(lower_omp_1): Call oacc_record_vars_in_bind for OpenACC bind contexts.
	* omp-offload.c (convert.h): Include.
	(oacc_loop_xform_head_tail): Treat private-variable markers like
	fork/join when transforming head/tail sequences.
	(execute_oacc_device_lower): Use IFN_UNIQUE_OACC_PRIVATE to determine
	partitioning level of private variables, and process any found via
	adjust_private_decl target hook.
	* target.def (expand_accel_var, adjust_private_decl): New target hooks.

	libgomp/
	* testsuite/libgomp.oacc-c-c++-common/gang-private-1.c: New test.
	* testsuite/libgomp.oacc-c-c++-common/loop-gwv-2.c: New test.
	* testsuite/libgomp.oacc-fortran/gangprivate-attrib-1.f90: New test.
	* testsuite/libgomp.oacc-fortran/gangprivate-attrib-2.f90: New test.
---
 gcc/config/gcn/gcn-protos.h                   |   2 +-
 gcc/config/gcn/gcn-tree.c                     |   5 +-
 gcc/config/gcn/gcn.c                          |   4 +-
 gcc/config/nvptx/nvptx.c                      |  77 +++++++++++
 gcc/doc/tm.texi                               |  13 ++
 gcc/doc/tm.texi.in                            |   4 +
 gcc/expr.c                                    |  13 +-
 gcc/internal-fn.c                             |   2 +
 gcc/internal-fn.h                             |   3 +-
 gcc/omp-low.c                                 | 125 +++++++++++++++++-
 gcc/omp-offload.c                             |  37 +++++-
 gcc/target.def                                |  17 +++
 .../gang-private-1.c                          |  38 ++++++
 .../libgomp.oacc-c-c++-common/loop-gwv-2.c    |  95 +++++++++++++
 .../gangprivate-attrib-1.f90                  |  25 ++++
 .../gangprivate-attrib-2.f90                  |  25 ++++
 16 files changed, 472 insertions(+), 13 deletions(-)
 create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/gang-private-1.c
 create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/loop-gwv-2.c
 create mode 100644 libgomp/testsuite/libgomp.oacc-fortran/gangprivate-attrib-1.f90
 create mode 100644 libgomp/testsuite/libgomp.oacc-fortran/gangprivate-attrib-2.f90

diff --git a/gcc/config/gcn/gcn-protos.h b/gcc/config/gcn/gcn-protos.h
index da7faf29c70..714d51189d9 100644
--- a/gcc/config/gcn/gcn-protos.h
+++ b/gcc/config/gcn/gcn-protos.h
@@ -39,7 +39,7 @@ extern rtx gcn_gen_undef (machine_mode);
 extern bool gcn_global_address_p (rtx);
 extern tree gcn_goacc_adjust_propagation_record (tree record_type, bool sender,
 						 const char *name);
-extern void gcn_goacc_adjust_gangprivate_decl (tree var);
+extern void gcn_goacc_adjust_private_decl (tree var, int level);
 extern void gcn_goacc_reduction (gcall *call);
 extern bool gcn_hard_regno_rename_ok (unsigned int from_reg,
 				      unsigned int to_reg);
diff --git a/gcc/config/gcn/gcn-tree.c b/gcc/config/gcn/gcn-tree.c
index c6b6302e9ed..aa56e236134 100644
--- a/gcc/config/gcn/gcn-tree.c
+++ b/gcc/config/gcn/gcn-tree.c
@@ -697,8 +697,11 @@ gcn_goacc_adjust_propagation_record (tree record_type, bool sender,
 }
 
 void
-gcn_goacc_adjust_gangprivate_decl (tree var)
+gcn_goacc_adjust_private_decl (tree var, int level)
 {
+  if (level != GOMP_DIM_GANG)
+    return;
+
   tree type = TREE_TYPE (var);
   tree lds_type = build_qualified_type (type,
 		    TYPE_QUALS_NO_ADDR_SPACE (type)
diff --git a/gcc/config/gcn/gcn.c b/gcc/config/gcn/gcn.c
index 1a69737f693..cf2f30413ae 100644
--- a/gcc/config/gcn/gcn.c
+++ b/gcc/config/gcn/gcn.c
@@ -6067,8 +6067,8 @@ print_operand (FILE *file, rtx x, int code)
 #undef  TARGET_GOACC_ADJUST_PROPAGATION_RECORD
 #define TARGET_GOACC_ADJUST_PROPAGATION_RECORD \
   gcn_goacc_adjust_propagation_record
-#undef  TARGET_GOACC_ADJUST_GANGPRIVATE_DECL
-#define TARGET_GOACC_ADJUST_GANGPRIVATE_DECL gcn_goacc_adjust_gangprivate_decl
+#undef  TARGET_GOACC_ADJUST_PRIVATE_DECL
+#define TARGET_GOACC_ADJUST_PRIVATE_DECL gcn_goacc_adjust_private_decl
 #undef  TARGET_GOACC_FORK_JOIN
 #define TARGET_GOACC_FORK_JOIN gcn_fork_join
 #undef  TARGET_GOACC_REDUCTION
diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index 0d6e8840852..9934a240209 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -74,6 +74,8 @@
 #include "cfgloop.h"
 #include "fold-const.h"
 #include "intl.h"
+#include "tree-hash-traits.h"
+#include "tree-pretty-print.h"
 
 /* This file should be included last.  */
 #include "target-def.h"
@@ -166,6 +168,12 @@ static unsigned vector_red_align;
 static unsigned vector_red_partition;
 static GTY(()) rtx vector_red_sym;
 
+/* Shared memory block for gang-private variables.  */
+static unsigned gangprivate_shared_size;
+static unsigned gangprivate_shared_align;
+static GTY(()) rtx gangprivate_shared_sym;
+static hash_map<tree_decl_hash, unsigned int> gangprivate_shared_hmap;
+
 /* Global lock variable, needed for 128bit worker & gang reductions.  */
 static GTY(()) tree global_lock_var;
 
@@ -247,6 +255,10 @@ nvptx_option_override (void)
   vector_red_align = GET_MODE_ALIGNMENT (SImode) / BITS_PER_UNIT;
   vector_red_partition = 0;
 
+  gangprivate_shared_sym = gen_rtx_SYMBOL_REF (Pmode, "__gangprivate_shared");
+  SET_SYMBOL_DATA_AREA (gangprivate_shared_sym, DATA_AREA_SHARED);
+  gangprivate_shared_align = GET_MODE_ALIGNMENT (SImode) / BITS_PER_UNIT;
+
   diagnose_openacc_conflict (TARGET_GOMP, "-mgomp");
   diagnose_openacc_conflict (TARGET_SOFT_STACK, "-msoft-stack");
   diagnose_openacc_conflict (TARGET_UNIFORM_SIMT, "-muniform-simt");
@@ -5231,6 +5243,10 @@ nvptx_file_end (void)
     write_shared_buffer (asm_out_file, vector_red_sym,
 			 vector_red_align, vector_red_size);
 
+  if (gangprivate_shared_size)
+    write_shared_buffer (asm_out_file, gangprivate_shared_sym,
+			 gangprivate_shared_align, gangprivate_shared_size);
+
   if (need_softstack_decl)
     {
       write_var_marker (asm_out_file, false, true, "__nvptx_stacks");
@@ -6450,6 +6466,60 @@ nvptx_can_change_mode_class (machine_mode, machine_mode, reg_class_t)
   return false;
 }
 
+/* Implement TARGET_GOACC_ADJUST_PRIVATE_DECL.  Set "oacc gangprivate"
+   attribute for gang-private variable declarations.  */
+
+void
+nvptx_goacc_adjust_private_decl (tree decl, int level)
+{
+  if (level != GOMP_DIM_GANG)
+    return;
+
+  if (!lookup_attribute ("oacc gangprivate", DECL_ATTRIBUTES (decl)))
+    {
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	{
+	  fprintf (dump_file, "Setting 'oacc gangprivate' attribute for decl:");
+	  print_generic_decl (dump_file, decl, TDF_SLIM);
+	  fputc ('\n', dump_file);
+	}
+      tree id = get_identifier ("oacc gangprivate");
+      DECL_ATTRIBUTES (decl) = tree_cons (id, NULL, DECL_ATTRIBUTES (decl));
+    }
+}
+
+/* Implement TARGET_GOACC_EXPAND_ACCEL_VAR.  Place "oacc gangprivate"
+   variables in shared memory.  */
+
+static rtx
+nvptx_goacc_expand_accel_var (tree var)
+{
+  if (VAR_P (var)
+      && lookup_attribute ("oacc gangprivate", DECL_ATTRIBUTES (var)))
+    {
+      unsigned int offset, *poffset;
+      poffset = gangprivate_shared_hmap.get (var);
+      if (poffset)
+	offset = *poffset;
+      else
+	{
+	  unsigned HOST_WIDE_INT align = DECL_ALIGN (var);
+	  gangprivate_shared_size
+	    = (gangprivate_shared_size + align - 1) & ~(align - 1);
+	  if (gangprivate_shared_align < align)
+	    gangprivate_shared_align = align;
+
+	  offset = gangprivate_shared_size;
+	  bool existed = gangprivate_shared_hmap.put (var, offset);
+	  gcc_assert (!existed);
+	  gangprivate_shared_size += tree_to_uhwi (DECL_SIZE_UNIT (var));
+	}
+      rtx addr = plus_constant (Pmode, gangprivate_shared_sym, offset);
+      return gen_rtx_MEM (TYPE_MODE (TREE_TYPE (var)), addr);
+    }
+  return NULL_RTX;
+}
+
 static GTY(()) tree nvptx_previous_fndecl;
 
 static void
@@ -6458,6 +6528,7 @@ nvptx_set_current_function (tree fndecl)
   if (!fndecl || fndecl == nvptx_previous_fndecl)
     return;
 
+  gangprivate_shared_hmap.empty ();
   nvptx_previous_fndecl = fndecl;
   vector_red_partition = 0;
   oacc_bcast_partition = 0;
@@ -6602,6 +6673,12 @@ nvptx_set_current_function (tree fndecl)
 #undef TARGET_HAVE_SPECULATION_SAFE_VALUE
 #define TARGET_HAVE_SPECULATION_SAFE_VALUE speculation_safe_value_not_needed
 
+#undef TARGET_GOACC_ADJUST_PRIVATE_DECL
+#define TARGET_GOACC_ADJUST_PRIVATE_DECL nvptx_goacc_adjust_private_decl
+
+#undef TARGET_GOACC_EXPAND_ACCEL_VAR
+#define TARGET_GOACC_EXPAND_ACCEL_VAR nvptx_goacc_expand_accel_var
+
 #undef TARGET_SET_CURRENT_FUNCTION
 #define TARGET_SET_CURRENT_FUNCTION nvptx_set_current_function
 
diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index 11c236e1c65..f5b7995705a 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -6155,6 +6155,19 @@ like @code{cond_add@var{m}}.  The default implementation returns a zero
 constant of type @var{type}.
 @end deftypefn
 
+@deftypefn {Target Hook} rtx TARGET_GOACC_EXPAND_ACCEL_VAR (tree @var{var})
+This hook, if defined, is used by accelerator target back-ends to expand
+specially handled kinds of VAR_DECL expressions.  A particular use is to
+place variables with specific attributes inside special accelarator
+memories.  A return value of NULL indicates that the target does not
+handle this VAR_DECL, and normal RTL expanding is resumed.
+@end deftypefn
+
+@deftypefn {Target Hook} void TARGET_GOACC_ADJUST_PRIVATE_DECL (tree @var{var}, @var{int})
+Tweak variable declaration for a private variable at the specified
+parallelism level.
+@end deftypefn
+
 @node Anchored Addresses
 @section Anchored Addresses
 @cindex anchored addresses
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index b8c41b5a7aa..d5ed6906e5d 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -4213,6 +4213,10 @@ address;  but often a machine-dependent strategy can generate better code.
 
 @hook TARGET_PREFERRED_ELSE_VALUE
 
+@hook TARGET_GOACC_EXPAND_ACCEL_VAR
+
+@hook TARGET_GOACC_ADJUST_PRIVATE_DECL
+
 @node Anchored Addresses
 @section Anchored Addresses
 @cindex anchored addresses
diff --git a/gcc/expr.c b/gcc/expr.c
index 0fd5890f8b4..f2629bfb4fb 100644
--- a/gcc/expr.c
+++ b/gcc/expr.c
@@ -10044,8 +10044,19 @@ expand_expr_real_1 (tree exp, rtx target, machine_mode tmode,
       exp = SSA_NAME_VAR (ssa_name);
       goto expand_decl_rtl;
 
-    case PARM_DECL:
     case VAR_DECL:
+      /* Allow accel compiler to handle specific cases of variables,
+	 specifically those tagged with the "oacc gangprivate" attribute,
+	 which may be intended to be placed in special memory in GPUs.  */
+      if (flag_openacc && targetm.goacc.expand_accel_var)
+	{
+	  temp = targetm.goacc.expand_accel_var (exp);
+	  if (temp)
+	    return temp;
+	}
+      /* ... fall through ...  */
+
+    case PARM_DECL:
       /* If a static var's type was incomplete when the decl was written,
 	 but the type is complete now, lay out the decl now.  */
       if (DECL_SIZE (exp) == 0
diff --git a/gcc/internal-fn.c b/gcc/internal-fn.c
index 6a878bde24d..5106c34eca8 100644
--- a/gcc/internal-fn.c
+++ b/gcc/internal-fn.c
@@ -2618,6 +2618,8 @@ expand_UNIQUE (internal_fn, gcall *stmt)
       else
 	gcc_unreachable ();
       break;
+    case IFN_UNIQUE_OACC_PRIVATE:
+      break;
     }
 
   if (pattern)
diff --git a/gcc/internal-fn.h b/gcc/internal-fn.h
index 389241a8a06..59844d3eb08 100644
--- a/gcc/internal-fn.h
+++ b/gcc/internal-fn.h
@@ -36,7 +36,8 @@ along with GCC; see the file COPYING3.  If not see
 #define IFN_UNIQUE_CODES				  \
   DEF(UNSPEC),	\
     DEF(OACC_FORK), DEF(OACC_JOIN),		\
-    DEF(OACC_HEAD_MARK), DEF(OACC_TAIL_MARK)
+    DEF(OACC_HEAD_MARK), DEF(OACC_TAIL_MARK),	\
+    DEF(OACC_PRIVATE)
 
 enum ifn_unique_kind {
 #define DEF(X) IFN_UNIQUE_##X
diff --git a/gcc/omp-low.c b/gcc/omp-low.c
index 781e7cbf27a..6499cd64770 100644
--- a/gcc/omp-low.c
+++ b/gcc/omp-low.c
@@ -163,6 +163,9 @@ struct omp_context
 
   /* True if there is bind clause on the construct (i.e. a loop construct).  */
   bool loop_p;
+
+  /* Addressable variable decls in this context.  */
+  vec<tree> *oacc_addressable_var_decls;
 };
 
 static splay_tree all_contexts;
@@ -943,6 +946,8 @@ new_omp_context (gimple *stmt, omp_context *outer_ctx)
 
   ctx->cb.decl_map = new hash_map<tree, tree>;
 
+  ctx->oacc_addressable_var_decls = new vec<tree> ();
+
   return ctx;
 }
 
@@ -1024,6 +1029,7 @@ delete_omp_context (splay_tree_value value)
     }
 
   delete ctx->lastprivate_conditional_map;
+  delete ctx->oacc_addressable_var_decls;
 
   XDELETE (ctx);
 }
@@ -6667,8 +6673,9 @@ lower_lastprivate_clauses (tree clauses, tree predicate, gimple_seq *body_p,
 
 static void
 lower_oacc_reductions (location_t loc, tree clauses, tree level, bool inner,
-		       gcall *fork, gcall *join, gimple_seq *fork_seq,
-		       gimple_seq *join_seq, omp_context *ctx)
+		       gcall *fork, gcall *private_marker, gcall *join,
+		       gimple_seq *fork_seq, gimple_seq *join_seq,
+		       omp_context *ctx)
 {
   gimple_seq before_fork = NULL;
   gimple_seq after_fork = NULL;
@@ -6866,6 +6873,8 @@ lower_oacc_reductions (location_t loc, tree clauses, tree level, bool inner,
 
   /* Now stitch things together.  */
   gimple_seq_add_seq (fork_seq, before_fork);
+  if (private_marker)
+    gimple_seq_add_stmt (fork_seq, private_marker);
   if (fork)
     gimple_seq_add_stmt (fork_seq, fork);
   gimple_seq_add_seq (fork_seq, after_fork);
@@ -7581,7 +7590,7 @@ lower_oacc_loop_marker (location_t loc, tree ddvar, bool head,
    HEAD and TAIL.  */
 
 static void
-lower_oacc_head_tail (location_t loc, tree clauses,
+lower_oacc_head_tail (location_t loc, tree clauses, gcall *private_marker,
 		      gimple_seq *head, gimple_seq *tail, omp_context *ctx)
 {
   bool inner = false;
@@ -7589,6 +7598,14 @@ lower_oacc_head_tail (location_t loc, tree clauses,
   gimple_seq_add_stmt (head, gimple_build_assign (ddvar, integer_zero_node));
 
   unsigned count = lower_oacc_head_mark (loc, ddvar, clauses, head, ctx);
+
+  if (private_marker)
+    {
+      gimple_set_location (private_marker, loc);
+      gimple_call_set_lhs (private_marker, ddvar);
+      gimple_call_set_arg (private_marker, 1, ddvar);
+    }
+
   tree fork_kind = build_int_cst (unsigned_type_node, IFN_UNIQUE_OACC_FORK);
   tree join_kind = build_int_cst (unsigned_type_node, IFN_UNIQUE_OACC_JOIN);
 
@@ -7619,7 +7636,8 @@ lower_oacc_head_tail (location_t loc, tree clauses,
 			      &join_seq);
 
       lower_oacc_reductions (loc, clauses, place, inner,
-			     fork, join, &fork_seq, &join_seq,  ctx);
+			     fork, (count == 1) ? private_marker : NULL,
+			     join, &fork_seq, &join_seq,  ctx);
 
       /* Append this level to head. */
       gimple_seq_add_seq (head, fork_seq);
@@ -9584,6 +9602,32 @@ lower_omp_for_lastprivate (struct omp_for_data *fd, gimple_seq *body_p,
     }
 }
 
+/* Record vars listed in private clauses in CLAUSES in CTX.  This information
+   is used to mark up variables that should be made private per-gang.  */
+
+static void
+oacc_record_private_var_clauses (omp_context *ctx, tree clauses)
+{
+  for (tree c = clauses; c; c = OMP_CLAUSE_CHAIN (c))
+    if (OMP_CLAUSE_CODE (c) == OMP_CLAUSE_PRIVATE)
+      {
+	tree decl = OMP_CLAUSE_DECL (c);
+	if (VAR_P (decl) && TREE_ADDRESSABLE (decl))
+	  ctx->oacc_addressable_var_decls->safe_push (decl);
+      }
+}
+
+/* Record addressable vars declared in BINDVARS in CTX.  This information is
+   used to mark up variables that should be made private per-gang.  */
+
+static void
+oacc_record_vars_in_bind (omp_context *ctx, tree bindvars)
+{
+  for (tree v = bindvars; v; v = DECL_CHAIN (v))
+    if (VAR_P (v) && TREE_ADDRESSABLE (v))
+      ctx->oacc_addressable_var_decls->safe_push (v);
+}
+
 /* Callback for walk_gimple_seq.  Find #pragma omp scan statement.  */
 
 static tree
@@ -10414,6 +10458,57 @@ lower_omp_for_scan (gimple_seq *body_p, gimple_seq *dlist, gomp_for *stmt,
   *dlist = new_dlist;
 }
 
+/* Build an internal UNIQUE function with type IFN_UNIQUE_OACC_PRIVATE listing
+   the addresses of variables that should be made private at the surrounding
+   parallelism level.  Such functions appear in the gimple code stream in two
+   forms, e.g. for a partitioned loop:
+
+      .data_dep.6 = .UNIQUE (OACC_HEAD_MARK, .data_dep.6, 1, 68);
+      .data_dep.6 = .UNIQUE (OACC_PRIVATE, .data_dep.6, -1, &w);
+      .data_dep.6 = .UNIQUE (OACC_FORK, .data_dep.6, -1);
+      .data_dep.6 = .UNIQUE (OACC_HEAD_MARK, .data_dep.6);
+
+   or alternatively, OACC_PRIVATE can appear at the top level of a parallel,
+   not as part of a HEAD_MARK sequence:
+
+      .UNIQUE (OACC_PRIVATE, 0, 0, &w);
+
+   For such stand-alone appearances, the 3rd argument is always 0, denoting
+   gang partitioning.  */
+
+static gcall *
+make_oacc_private_marker (omp_context *ctx)
+{
+  int i;
+  tree decl;
+
+  if (ctx->oacc_addressable_var_decls->length () == 0)
+    return NULL;
+
+  auto_vec<tree, 5> args;
+
+  args.quick_push (build_int_cst (integer_type_node, IFN_UNIQUE_OACC_PRIVATE));
+  args.quick_push (integer_zero_node);
+  args.quick_push (integer_minus_one_node);
+
+  FOR_EACH_VEC_ELT (*ctx->oacc_addressable_var_decls, i, decl)
+    {
+      for (omp_context *thisctx = ctx; thisctx; thisctx = thisctx->outer)
+	{
+	  tree inner_decl = maybe_lookup_decl (decl, thisctx);
+	  if (inner_decl)
+	    {
+	      decl = inner_decl;
+	      break;
+	    }
+	}
+      tree addr = build_fold_addr_expr (decl);
+      args.safe_push (addr);
+    }
+
+  return gimple_build_call_internal_vec (IFN_UNIQUE, args);
+}
+
 /* Lower code for an OMP loop directive.  */
 
 static void
@@ -10430,6 +10525,8 @@ lower_omp_for (gimple_stmt_iterator *gsi_p, omp_context *ctx)
 
   push_gimplify_context ();
 
+  oacc_record_private_var_clauses (ctx, gimple_omp_for_clauses (stmt));
+
   lower_omp (gimple_omp_for_pre_body_ptr (stmt), ctx);
 
   block = make_node (BLOCK);
@@ -10448,6 +10545,8 @@ lower_omp_for (gimple_stmt_iterator *gsi_p, omp_context *ctx)
       gbind *inner_bind
 	= as_a <gbind *> (gimple_seq_first_stmt (omp_for_body));
       tree vars = gimple_bind_vars (inner_bind);
+      if (is_gimple_omp_oacc (ctx->stmt))
+	oacc_record_vars_in_bind (ctx, vars);
       gimple_bind_append_vars (new_stmt, vars);
       /* bind_vars/BLOCK_VARS are being moved to new_stmt/block, don't
 	 keep them on the inner_bind and it's block.  */
@@ -10547,6 +10646,11 @@ lower_omp_for (gimple_stmt_iterator *gsi_p, omp_context *ctx)
 
   lower_omp (gimple_omp_body_ptr (stmt), ctx);
 
+  gcall *private_marker = NULL;
+  if (is_gimple_omp_oacc (ctx->stmt)
+      && !gimple_seq_empty_p (omp_for_body))
+    private_marker = make_oacc_private_marker (ctx);
+
   /* Lower the header expressions.  At this point, we can assume that
      the header is of the form:
 
@@ -10583,7 +10687,7 @@ lower_omp_for (gimple_stmt_iterator *gsi_p, omp_context *ctx)
   if (is_gimple_omp_oacc (ctx->stmt)
       && !ctx_in_oacc_kernels_region (ctx))
     lower_oacc_head_tail (gimple_location (stmt),
-			  gimple_omp_for_clauses (stmt),
+			  gimple_omp_for_clauses (stmt), private_marker,
 			  &oacc_head, &oacc_tail, ctx);
 
   /* Add OpenACC partitioning and reduction markers just before the loop.  */
@@ -12525,8 +12629,14 @@ lower_omp_target (gimple_stmt_iterator *gsi_p, omp_context *ctx)
 	     them as a dummy GANG loop.  */
 	  tree level = build_int_cst (integer_type_node, GOMP_DIM_GANG);
 
+	  gcall *private_marker = make_oacc_private_marker (ctx);
+
+	  if (private_marker)
+	    gimple_call_set_arg (private_marker, 2, level);
+
 	  lower_oacc_reductions (gimple_location (ctx->stmt), clauses, level,
-				 false, NULL, NULL, &fork_seq, &join_seq, ctx);
+				 false, NULL, private_marker, NULL, &fork_seq,
+				 &join_seq, ctx);
 	}
 
       gimple_seq_add_seq (&new_body, fork_seq);
@@ -12782,6 +12892,9 @@ lower_omp_1 (gimple_stmt_iterator *gsi_p, omp_context *ctx)
 		 ctx);
       break;
     case GIMPLE_BIND:
+      if (ctx && is_gimple_omp_oacc (ctx->stmt))
+	oacc_record_vars_in_bind (ctx,
+				  gimple_bind_vars (as_a <gbind *> (stmt)));
       lower_omp (gimple_bind_body_ptr (as_a <gbind *> (stmt)), ctx);
       maybe_remove_omp_member_access_dummy_vars (as_a <gbind *> (stmt));
       break;
diff --git a/gcc/omp-offload.c b/gcc/omp-offload.c
index 32eacf7863e..d8291125370 100644
--- a/gcc/omp-offload.c
+++ b/gcc/omp-offload.c
@@ -52,6 +52,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "stringpool.h"
 #include "attribs.h"
 #include "cfgloop.h"
+#include "convert.h"
 
 /* Describe the OpenACC looping structure of a function.  The entire
    function is held in a 'NULL' loop.  */
@@ -1082,7 +1083,9 @@ oacc_loop_xform_head_tail (gcall *from, int level)
 	    = ((enum ifn_unique_kind)
 	       TREE_INT_CST_LOW (gimple_call_arg (stmt, 0)));
 
-	  if (k == IFN_UNIQUE_OACC_FORK || k == IFN_UNIQUE_OACC_JOIN)
+	  if (k == IFN_UNIQUE_OACC_FORK
+	      || k == IFN_UNIQUE_OACC_JOIN
+	      || k == IFN_UNIQUE_OACC_PRIVATE)
 	    *gimple_call_arg_ptr (stmt, 2) = replacement;
 	  else if (k == kind && stmt != from)
 	    break;
@@ -1684,6 +1687,38 @@ execute_oacc_device_lower ()
 		case IFN_UNIQUE_OACC_TAIL_MARK:
 		  remove = true;
 		  break;
+
+		case IFN_UNIQUE_OACC_PRIVATE:
+		  {
+		    HOST_WIDE_INT level
+		      = TREE_INT_CST_LOW (gimple_call_arg (call, 2));
+		    if (level == -1)
+		      break;
+		    for (unsigned i = 3;
+			 i < gimple_call_num_args (call);
+			 i++)
+		      {
+			tree arg = gimple_call_arg (call, i);
+			gcc_assert (TREE_CODE (arg) == ADDR_EXPR);
+			tree decl = TREE_OPERAND (arg, 0);
+			if (dump_file && (dump_flags & TDF_DETAILS))
+			  {
+			    static char const *const axes[] =
+			      /* Must be kept in sync with GOMP_DIM
+				 enumeration.  */
+			      { "gang", "worker", "vector" };
+			    fprintf (dump_file, "Decl UID %u has %s "
+				     "partitioning:", DECL_UID (decl),
+				     axes[level]);
+			    print_generic_decl (dump_file, decl, TDF_SLIM);
+			    fputc ('\n', dump_file);
+			  }
+			if (targetm.goacc.adjust_private_decl)
+			  targetm.goacc.adjust_private_decl (decl, level);
+		      }
+		    remove = true;
+		  }
+		  break;
 		}
 	      break;
 	    }
diff --git a/gcc/target.def b/gcc/target.def
index 8e83c2c7a71..2cc5d5c46b3 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -1734,6 +1734,23 @@ for allocating any storage for reductions when necessary.",
 void, (gcall *call),
 default_goacc_reduction)
 
+DEFHOOK
+(expand_accel_var,
+"This hook, if defined, is used by accelerator target back-ends to expand\n\
+specially handled kinds of VAR_DECL expressions.  A particular use is to\n\
+place variables with specific attributes inside special accelarator\n\
+memories.  A return value of NULL indicates that the target does not\n\
+handle this VAR_DECL, and normal RTL expanding is resumed.",
+rtx, (tree var),
+NULL)
+
+DEFHOOK
+(adjust_private_decl,
+"Tweak variable declaration for a private variable at the specified\n\
+parallelism level.",
+void, (tree var, int),
+NULL)
+
 HOOK_VECTOR_END (goacc)
 
 /* Functions relating to vectorization.  */
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/gang-private-1.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/gang-private-1.c
new file mode 100644
index 00000000000..28222c25da3
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/gang-private-1.c
@@ -0,0 +1,38 @@
+#include <assert.h>
+
+int main (void)
+{
+  int ret;
+
+  #pragma acc parallel num_gangs(1) num_workers(32) copyout(ret)
+  {
+    int w = 0;
+
+    #pragma acc loop worker
+    for (int i = 0; i < 32; i++)
+      {
+	#pragma acc atomic update
+	w++;
+      }
+
+    ret = (w == 32);
+  }
+  assert (ret);
+
+  #pragma acc parallel num_gangs(1) vector_length(32) copyout(ret)
+  {
+    int v = 0;
+
+    #pragma acc loop vector
+    for (int i = 0; i < 32; i++)
+      {
+	#pragma acc atomic update
+	v++;
+      }
+
+    ret = (v == 32);
+  }
+  assert (ret);
+
+  return 0;
+}
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-gwv-2.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-gwv-2.c
new file mode 100644
index 00000000000..a4f81a39e24
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-gwv-2.c
@@ -0,0 +1,95 @@
+#include <stdio.h>
+#include <openacc.h>
+#include <alloca.h>
+#include <string.h>
+#include <gomp-constants.h>
+#include <stdlib.h>
+
+#if 0
+#define DEBUG(DIM, IDX, VAL) \
+  fprintf (stderr, "%sdist[%d] = %d\n", (DIM), (IDX), (VAL))
+#else
+#define DEBUG(DIM, IDX, VAL)
+#endif
+
+#define N (32*32*32)
+
+int
+check (const char *dim, int *dist, int dimsize)
+{
+  int ix;
+  int exit = 0;
+
+  for (ix = 0; ix < dimsize; ix++)
+    {
+      DEBUG(dim, ix, dist[ix]);
+      if (dist[ix] < (N) / (dimsize + 0.5)
+	  || dist[ix] > (N) / (dimsize - 0.5))
+	{
+	  fprintf (stderr, "did not distribute to %ss (%d not between %d "
+		   "and %d)\n", dim, dist[ix], (int) ((N) / (dimsize + 0.5)),
+		   (int) ((N) / (dimsize - 0.5)));
+	  exit |= 1;
+	}
+    }
+
+  return exit;
+}
+
+int main ()
+{
+  int ary[N];
+  int ix;
+  int exit = 0;
+  int gangsize = 0, workersize = 0, vectorsize = 0;
+  int *gangdist, *workerdist, *vectordist;
+
+  for (ix = 0; ix < N;ix++)
+    ary[ix] = -1;
+
+#pragma acc parallel num_gangs(32) num_workers(32) vector_length(32) \
+	    copy(ary) copyout(gangsize, workersize, vectorsize)
+  {
+#pragma acc loop gang worker vector
+    for (unsigned ix = 0; ix < N; ix++)
+      {
+	int g, w, v;
+
+	g = __builtin_goacc_parlevel_id (GOMP_DIM_GANG);
+	w = __builtin_goacc_parlevel_id (GOMP_DIM_WORKER);
+	v = __builtin_goacc_parlevel_id (GOMP_DIM_VECTOR);
+
+	ary[ix] = (g << 16) | (w << 8) | v;
+      }
+
+    gangsize = __builtin_goacc_parlevel_size (GOMP_DIM_GANG);
+    workersize = __builtin_goacc_parlevel_size (GOMP_DIM_WORKER);
+    vectorsize = __builtin_goacc_parlevel_size (GOMP_DIM_VECTOR);
+  }
+
+  gangdist = (int *) alloca (gangsize * sizeof (int));
+  workerdist = (int *) alloca (workersize * sizeof (int));
+  vectordist = (int *) alloca (vectorsize * sizeof (int));
+  memset (gangdist, 0, gangsize * sizeof (int));
+  memset (workerdist, 0, workersize * sizeof (int));
+  memset (vectordist, 0, vectorsize * sizeof (int));
+
+  /* Test that work is shared approximately equally amongst each active
+     gang/worker/vector.  */
+  for (ix = 0; ix < N; ix++)
+    {
+      int g = (ary[ix] >> 16) & 255;
+      int w = (ary[ix] >> 8) & 255;
+      int v = ary[ix] & 255;
+
+      gangdist[g]++;
+      workerdist[w]++;
+      vectordist[v]++;
+    }
+
+  exit = check ("gang", gangdist, gangsize);
+  exit |= check ("worker", workerdist, workersize);
+  exit |= check ("vector", vectordist, vectorsize);
+
+  return exit;
+}
diff --git a/libgomp/testsuite/libgomp.oacc-fortran/gangprivate-attrib-1.f90 b/libgomp/testsuite/libgomp.oacc-fortran/gangprivate-attrib-1.f90
new file mode 100644
index 00000000000..b9293e7d2a4
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-fortran/gangprivate-attrib-1.f90
@@ -0,0 +1,25 @@
+! Test for "oacc gangprivate" attribute on gang-private variables
+
+! { dg-do run }
+! { dg-additional-options "-fdump-tree-oaccdevlow-details" }
+! { dg-final { scan-tree-dump-times "Decl UID \[0-9\]+ has gang partitioning:  integer\\(kind=4\\) w;" 1 "oaccdevlow" } } */
+
+program main
+  integer :: w, arr(0:31)
+
+  !$acc parallel num_gangs(32) num_workers(32) copyout(arr)
+    !$acc loop gang private(w)
+    do j = 0, 31
+      w = 0
+      !$acc loop seq
+      do i = 0, 31
+        !$acc atomic update
+        w = w + 1
+        !$acc end atomic
+      end do
+      arr(j) = w
+    end do
+  !$acc end parallel
+
+  if (any (arr .ne. 32)) stop 1
+end program main
diff --git a/libgomp/testsuite/libgomp.oacc-fortran/gangprivate-attrib-2.f90 b/libgomp/testsuite/libgomp.oacc-fortran/gangprivate-attrib-2.f90
new file mode 100644
index 00000000000..90e06be24ff
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-fortran/gangprivate-attrib-2.f90
@@ -0,0 +1,25 @@
+! Test for worker-private variables
+
+! { dg-do run }
+! { dg-additional-options "-fdump-tree-oaccdevlow-details" }
+! { dg-final { scan-tree-dump-times "Decl UID \[0-9\]+ has worker partitioning:  integer\\(kind=4\\) w;" 1 "oaccdevlow" } } */
+
+program main
+  integer :: w, arr(0:31)
+
+  !$acc parallel num_gangs(32) num_workers(32) copyout(arr)
+    !$acc loop gang worker private(w)
+    do j = 0, 31
+      w = 0
+      !$acc loop seq
+      do i = 0, 31
+        !$acc atomic update
+        w = w + 1
+        !$acc end atomic
+      end do
+      arr(j) = w
+    end do
+  !$acc end parallel
+
+  if (any (arr .ne. 32)) stop 1
+end program main
-- 
2.23.0

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 03/13] Rewrite OpenACC private or reduction reference variables
  2019-11-15 21:41 [PATCH 00/13] AMD GCN worker partitioning support Julian Brown
  2019-11-15 21:41 ` [PATCH 02/13] Target-dependent gang-private variable decl rewriting Julian Brown
  2019-11-15 21:42 ` [PATCH 01/13] Add support for gang local storage allocation in shared memory Julian Brown
@ 2019-11-15 21:43 ` Julian Brown
  2019-11-15 21:43 ` [PATCH 04/13] OpenACC middle-end worker-partitioning support Julian Brown
                   ` (9 subsequent siblings)
  12 siblings, 0 replies; 19+ messages in thread
From: Julian Brown @ 2019-11-15 21:43 UTC (permalink / raw)
  To: gcc-patches; +Cc: thomas_schwinge, jakub, Catherine_Moore, Andrew Stubbs

Reference-type private variables or reference-type variables used as
reduction targets do not work well with the scheme to implement worker
partitioning on AMD GCN. This patch (originally by Cesar Philippidis, but
modified somewhat) provides support for replacing such variables with new
non-reference-typed temporary versions within partitioned offload regions.

Tested alongside other patches in this series.

OK?

Thanks,

Julian

ChangeLog

	gcc/
	* gimplify.c (privatize_reduction): New struct.
	(localize_reductions_r, localize_reductions): New functions.
	(gimplify_omp_for): Call localize_reductions.
	(gimplify_omp_workshare): Likewise.
	* omp-low.c (lower_oacc_reductions): Handle localized reductions.
	Create fewer temp vars.
	* tree-core.h (omp_clause_code): Add OMP_CLAUSE_REDUCTION_PRIVATE_DECL
	documentation.
	* tree.c (omp_clause_num_ops): Bump number of ops for
	OMP_CLAUSE_REDUCTION to 6.
	(walk_tree_1): Adjust accordingly.
	* tree.h (OMP_CLAUSE_REDUCTION_PRIVATE_DECL): Add macro.

	libgomp/
	* testsuite/libgomp.oacc-fortran/privatized-ref-1.f95: New test.
	* testsuite/libgomp.oacc-c++/privatized-ref-2.C: New test.
	* testsuite/libgomp.oacc-c++/privatized-ref-3.C: New test.
---
 gcc/gimplify.c                                | 116 ++++++++++++++++++
 gcc/omp-low.c                                 |  47 +++----
 gcc/tree-core.h                               |   4 +-
 gcc/tree.c                                    |  11 +-
 gcc/tree.h                                    |   2 +
 .../libgomp.oacc-c++/privatized-ref-2.C       |  64 ++++++++++
 .../libgomp.oacc-c++/privatized-ref-3.C       |  64 ++++++++++
 .../libgomp.oacc-fortran/privatized-ref-1.f95 |  71 +++++++++++
 8 files changed, 342 insertions(+), 37 deletions(-)
 create mode 100644 libgomp/testsuite/libgomp.oacc-c++/privatized-ref-2.C
 create mode 100644 libgomp/testsuite/libgomp.oacc-c++/privatized-ref-3.C
 create mode 100644 libgomp/testsuite/libgomp.oacc-fortran/privatized-ref-1.f95

diff --git a/gcc/gimplify.c b/gcc/gimplify.c
index 87a64054514..191ade6be3e 100644
--- a/gcc/gimplify.c
+++ b/gcc/gimplify.c
@@ -229,6 +229,11 @@ struct gimplify_omp_ctx
   int defaultmap[4];
 };
 
+struct privatize_reduction
+{
+  tree ref_var, local_var;
+};
+
 static struct gimplify_ctx *gimplify_ctxp;
 static struct gimplify_omp_ctx *gimplify_omp_ctxp;
 static bool in_omp_construct;
@@ -10811,6 +10816,95 @@ find_combined_omp_for (tree *tp, int *walk_subtrees, void *data)
   return NULL_TREE;
 }
 
+/* Helper function for localize_reductions.  Replace all uses of REF_VAR with
+   LOCAL_VAR.  */
+
+static tree
+localize_reductions_r (tree *tp, int *walk_subtrees, void *data)
+{
+  enum tree_code tc = TREE_CODE (*tp);
+  struct privatize_reduction *pr = (struct privatize_reduction *) data;
+
+  if (TYPE_P (*tp))
+    *walk_subtrees = 0;
+
+  switch (tc)
+    {
+    case INDIRECT_REF:
+    case MEM_REF:
+      if (TREE_OPERAND (*tp, 0) == pr->ref_var)
+	*tp = pr->local_var;
+
+      *walk_subtrees = 0;
+      break;
+
+    case VAR_DECL:
+    case PARM_DECL:
+    case RESULT_DECL:
+      if (*tp == pr->ref_var)
+	*tp = pr->local_var;
+
+      *walk_subtrees = 0;
+      break;
+
+    default:
+      break;
+    }
+
+  return NULL_TREE;
+}
+
+/* OpenACC worker and vector loop state propagation requires reductions
+   to be inside local variables.  This function replaces all reference-type
+   reductions variables associated with the loop with a local copy.  It is
+   also used to create private copies of reduction variables for those
+   which are not associated with acc loops.  */
+
+static void
+localize_reductions (tree clauses, tree body)
+{
+  tree c, var, type, new_var;
+  struct privatize_reduction pr;
+
+  for (c = clauses; c; c = OMP_CLAUSE_CHAIN (c))
+    if (OMP_CLAUSE_CODE (c) == OMP_CLAUSE_REDUCTION)
+      {
+	var = OMP_CLAUSE_DECL (c);
+
+	if (!lang_hooks.decls.omp_privatize_by_reference (var))
+	  {
+	    OMP_CLAUSE_REDUCTION_PRIVATE_DECL (c) = NULL;
+	    continue;
+	  }
+
+	type = TREE_TYPE (TREE_TYPE (var));
+	new_var = create_tmp_var (type, IDENTIFIER_POINTER (DECL_NAME (var)));
+
+	pr.ref_var = var;
+	pr.local_var = new_var;
+
+	walk_tree (&body, localize_reductions_r, &pr, NULL);
+
+	OMP_CLAUSE_REDUCTION_PRIVATE_DECL (c) = new_var;
+      }
+    else if (OMP_CLAUSE_CODE (c) == OMP_CLAUSE_PRIVATE)
+      {
+	var = OMP_CLAUSE_DECL (c);
+
+	if (!lang_hooks.decls.omp_privatize_by_reference (var))
+	  continue;
+
+	type = TREE_TYPE (TREE_TYPE (var));
+	new_var = create_tmp_var (type, IDENTIFIER_POINTER (DECL_NAME (var)));
+
+	pr.ref_var = var;
+	pr.local_var = new_var;
+
+	walk_tree (&body, localize_reductions_r, &pr, NULL);
+      }
+}
+
+
 /* Gimplify the gross structure of an OMP_FOR statement.  */
 
 static enum gimplify_status
@@ -11017,6 +11111,23 @@ gimplify_omp_for (tree *expr_p, gimple_seq *pre_p)
       gcc_unreachable ();
     }
 
+  if (ort == ORT_ACC)
+    {
+      gimplify_omp_ctx *outer = gimplify_omp_ctxp;
+
+      while (outer
+	     && outer->region_type != ORT_ACC_PARALLEL
+	     && outer->region_type != ORT_ACC_KERNELS)
+	outer = outer->outer_context;
+
+      /* FIXME: Reductions only work in parallel regions at present.  We avoid
+	 doing the reduction localization transformation in kernels regions
+	 here, because the code to remove reductions in kernels regions cannot
+	 handle that.  */
+      if (outer && outer->region_type == ORT_ACC_PARALLEL)
+	localize_reductions (OMP_FOR_CLAUSES (*expr_p), OMP_FOR_BODY (*expr_p));
+    }
+
   /* Set OMP_CLAUSE_LINEAR_NO_COPYIN flag on explicit linear
      clause for the IV.  */
   if (ort == ORT_SIMD && TREE_VEC_LENGTH (OMP_FOR_INIT (for_stmt)) == 1)
@@ -12567,6 +12678,11 @@ gimplify_omp_workshare (tree *expr_p, gimple_seq *pre_p)
       || (ort & ORT_HOST_TEAMS) == ORT_HOST_TEAMS)
     {
       push_gimplify_context ();
+
+      /* FIXME: Reductions are not supported in kernels regions yet.  */
+      if (/*ort == ORT_ACC_KERNELS ||*/ ort == ORT_ACC_PARALLEL)
+	localize_reductions (OMP_CLAUSES (expr), OMP_BODY (expr));
+
       gimple *g = gimplify_and_return_first (OMP_BODY (expr), &body);
       if (gimple_code (g) == GIMPLE_BIND)
 	pop_gimplify_context (g);
diff --git a/gcc/omp-low.c b/gcc/omp-low.c
index 6499cd64770..61391dfe031 100644
--- a/gcc/omp-low.c
+++ b/gcc/omp-low.c
@@ -6689,9 +6689,9 @@ lower_oacc_reductions (location_t loc, tree clauses, tree level, bool inner,
     if (OMP_CLAUSE_CODE (c) == OMP_CLAUSE_REDUCTION)
       {
 	tree orig = OMP_CLAUSE_DECL (c);
-	tree var = maybe_lookup_decl (orig, ctx);
+	tree var;
 	tree ref_to_res = NULL_TREE;
-	tree incoming, outgoing, v1, v2, v3;
+	tree incoming, outgoing;
 	bool is_private = false;
 
 	enum tree_code rcode = OMP_CLAUSE_REDUCTION_CODE (c);
@@ -6703,6 +6703,9 @@ lower_oacc_reductions (location_t loc, tree clauses, tree level, bool inner,
 	  rcode = BIT_IOR_EXPR;
 	tree op = build_int_cst (unsigned_type_node, rcode);
 
+	var = OMP_CLAUSE_REDUCTION_PRIVATE_DECL (c);
+	if (!var)
+	  var = maybe_lookup_decl (orig, ctx);
 	if (!var)
 	  var = orig;
 
@@ -6792,36 +6795,13 @@ lower_oacc_reductions (location_t loc, tree clauses, tree level, bool inner,
 	if (!ref_to_res)
 	  ref_to_res = integer_zero_node;
 
-	if (omp_is_reference (orig))
+	if (omp_is_reference (outgoing))
 	  {
-	    tree type = TREE_TYPE (var);
-	    const char *id = IDENTIFIER_POINTER (DECL_NAME (var));
-
-	    if (!inner)
-	      {
-		tree x = create_tmp_var (TREE_TYPE (type), id);
-		gimplify_assign (var, build_fold_addr_expr (x), fork_seq);
-	      }
-
-	    v1 = create_tmp_var (type, id);
-	    v2 = create_tmp_var (type, id);
-	    v3 = create_tmp_var (type, id);
-
-	    gimplify_assign (v1, var, fork_seq);
-	    gimplify_assign (v2, var, fork_seq);
-	    gimplify_assign (v3, var, fork_seq);
-
-	    var = build_simple_mem_ref (var);
-	    v1 = build_simple_mem_ref (v1);
-	    v2 = build_simple_mem_ref (v2);
-	    v3 = build_simple_mem_ref (v3);
 	    outgoing = build_simple_mem_ref (outgoing);
 
 	    if (!TREE_CONSTANT (incoming))
 	      incoming = build_simple_mem_ref (incoming);
 	  }
-	else
-	  v1 = v2 = v3 = var;
 
 	/* Determine position in reduction buffer, which may be used
 	   by target.  The parser has ensured that this is not a
@@ -6854,20 +6834,21 @@ lower_oacc_reductions (location_t loc, tree clauses, tree level, bool inner,
 	  = build_call_expr_internal_loc (loc, IFN_GOACC_REDUCTION,
 					  TREE_TYPE (var), 6, init_code,
 					  unshare_expr (ref_to_res),
-					  v1, level, op, off);
+					  var, level, op, off);
 	tree fini_call
 	  = build_call_expr_internal_loc (loc, IFN_GOACC_REDUCTION,
 					  TREE_TYPE (var), 6, fini_code,
 					  unshare_expr (ref_to_res),
-					  v2, level, op, off);
+					  var, level, op, off);
 	tree teardown_call
 	  = build_call_expr_internal_loc (loc, IFN_GOACC_REDUCTION,
-					  TREE_TYPE (var), 6, teardown_code,
-					  ref_to_res, v3, level, op, off);
+					  TREE_TYPE (var), 6,
+					  teardown_code, ref_to_res, var,
+					  level, op, off);
 
-	gimplify_assign (v1, setup_call, &before_fork);
-	gimplify_assign (v2, init_call, &after_fork);
-	gimplify_assign (v3, fini_call, &before_join);
+	gimplify_assign (var, setup_call, &before_fork);
+	gimplify_assign (var, init_call, &after_fork);
+	gimplify_assign (var, fini_call, &before_join);
 	gimplify_assign (outgoing, teardown_call, &after_join);
       }
 
diff --git a/gcc/tree-core.h b/gcc/tree-core.h
index 12e078882da..85fda6cf3f6 100644
--- a/gcc/tree-core.h
+++ b/gcc/tree-core.h
@@ -255,7 +255,9 @@ enum omp_clause_code {
                 placeholder used in OMP_CLAUSE_REDUCTION_{INIT,MERGE}.
      Operand 4: OMP_CLAUSE_REDUCTION_DECL_PLACEHOLDER: Another dummy
 		VAR_DECL placeholder, used like the above for C/C++ array
-		reductions.  */
+		reductions.
+     Operand 5: OMP_CLAUSE_REDUCTION_PRIVATE_DECL: A private VAR_DECL of
+		the original DECL associated with the reduction clause.  */
   OMP_CLAUSE_REDUCTION,
 
   /* OpenMP clause: task_reduction (operator:variable_list).  */
diff --git a/gcc/tree.c b/gcc/tree.c
index 78c2815028b..1cc887071df 100644
--- a/gcc/tree.c
+++ b/gcc/tree.c
@@ -284,7 +284,7 @@ unsigned const char omp_clause_num_ops[] =
   1, /* OMP_CLAUSE_SHARED  */
   1, /* OMP_CLAUSE_FIRSTPRIVATE  */
   2, /* OMP_CLAUSE_LASTPRIVATE  */
-  5, /* OMP_CLAUSE_REDUCTION  */
+  6, /* OMP_CLAUSE_REDUCTION  */
   5, /* OMP_CLAUSE_TASK_REDUCTION  */
   5, /* OMP_CLAUSE_IN_REDUCTION  */
   1, /* OMP_CLAUSE_COPYIN  */
@@ -12170,11 +12170,16 @@ walk_tree_1 (tree *tp, walk_tree_fn func, void *data,
 	  WALK_SUBTREE_TAIL (OMP_CLAUSE_CHAIN (*tp));
 
 	case OMP_CLAUSE_REDUCTION:
+	  {
+	    for (int i = 0; i < 6; i++)
+	      WALK_SUBTREE (OMP_CLAUSE_OPERAND (*tp, i));
+	    WALK_SUBTREE_TAIL (OMP_CLAUSE_CHAIN (*tp));
+	  }
+
 	case OMP_CLAUSE_TASK_REDUCTION:
 	case OMP_CLAUSE_IN_REDUCTION:
 	  {
-	    int i;
-	    for (i = 0; i < 5; i++)
+	    for (int i = 0; i < 5; i++)
 	      WALK_SUBTREE (OMP_CLAUSE_OPERAND (*tp, i));
 	    WALK_SUBTREE_TAIL (OMP_CLAUSE_CHAIN (*tp));
 	  }
diff --git a/gcc/tree.h b/gcc/tree.h
index 4bec90d9a72..336b75d8698 100644
--- a/gcc/tree.h
+++ b/gcc/tree.h
@@ -1662,6 +1662,8 @@ class auto_suppress_location_wrappers
 #define OMP_CLAUSE_REDUCTION_DECL_PLACEHOLDER(NODE) \
   OMP_CLAUSE_OPERAND (OMP_CLAUSE_RANGE_CHECK (NODE, OMP_CLAUSE_REDUCTION, \
 					      OMP_CLAUSE_IN_REDUCTION), 4)
+#define OMP_CLAUSE_REDUCTION_PRIVATE_DECL(NODE) \
+  OMP_CLAUSE_OPERAND (OMP_CLAUSE_SUBCODE_CHECK (NODE, OMP_CLAUSE_REDUCTION), 5)
 
 /* True if a REDUCTION clause may reference the original list item (omp_orig)
    in its OMP_CLAUSE_REDUCTION_{,GIMPLE_}INIT.  */
diff --git a/libgomp/testsuite/libgomp.oacc-c++/privatized-ref-2.C b/libgomp/testsuite/libgomp.oacc-c++/privatized-ref-2.C
new file mode 100644
index 00000000000..3884f163132
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c++/privatized-ref-2.C
@@ -0,0 +1,64 @@
+/* { dg-do run } */
+
+#include <stdlib.h>
+
+void workers (void)
+{
+  double res[65536];
+  int i;
+
+#pragma acc parallel copyout(res) num_gangs(64) num_workers(64)
+  {
+    int i, j;
+#pragma acc loop gang
+    for (i = 0; i < 256; i++)
+      {
+#pragma acc loop worker
+	for (j = 0; j < 256; j++)
+	  {
+	    int tmpvar;
+	    int &tmpref = tmpvar;
+	    tmpref = (i * 256 + j) * 99;
+	    res[i * 256 + j] = tmpref;
+	  }
+      }
+  }
+
+  for (i = 0; i < 65536; i++)
+    if (res[i] != i * 99)
+      abort ();
+}
+
+void vectors (void)
+{
+  double res[65536];
+  int i;
+
+#pragma acc parallel copyout(res) num_gangs(64) num_workers(64)
+  {
+    int i, j;
+#pragma acc loop gang worker
+    for (i = 0; i < 256; i++)
+      {
+#pragma acc loop vector
+	for (j = 0; j < 256; j++)
+	  {
+	    int tmpvar;
+	    int &tmpref = tmpvar;
+	    tmpref = (i * 256 + j) * 101;
+	    res[i * 256 + j] = tmpref;
+	  }
+      }
+  }
+
+  for (i = 0; i < 65536; i++)
+    if (res[i] != i * 101)
+      abort ();
+}
+
+int main (int argc, char *argv[])
+{
+  workers ();
+  vectors ();
+  return 0;
+}
diff --git a/libgomp/testsuite/libgomp.oacc-c++/privatized-ref-3.C b/libgomp/testsuite/libgomp.oacc-c++/privatized-ref-3.C
new file mode 100644
index 00000000000..c1a10cba31b
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c++/privatized-ref-3.C
@@ -0,0 +1,64 @@
+/* { dg-do run } */
+
+#include <stdlib.h>
+
+void workers (void)
+{
+  double res[65536];
+  int i;
+
+#pragma acc parallel copyout(res) num_gangs(64) num_workers(64)
+  {
+    int i, j;
+    int tmpvar;
+    int &tmpref = tmpvar;
+#pragma acc loop gang
+    for (i = 0; i < 256; i++)
+      {
+#pragma acc loop worker private(tmpref)
+	for (j = 0; j < 256; j++)
+	  {
+	    tmpref = (i * 256 + j) * 99;
+	    res[i * 256 + j] = tmpref;
+	  }
+      }
+  }
+
+  for (i = 0; i < 65536; i++)
+    if (res[i] != i * 99)
+      abort ();
+}
+
+void vectors (void)
+{
+  double res[65536];
+  int i;
+
+#pragma acc parallel copyout(res) num_gangs(64) num_workers(64)
+  {
+    int i, j;
+    int tmpvar;
+    int &tmpref = tmpvar;
+#pragma acc loop gang worker
+    for (i = 0; i < 256; i++)
+      {
+#pragma acc loop vector private(tmpref)
+	for (j = 0; j < 256; j++)
+	  {
+	    tmpref = (i * 256 + j) * 101;
+	    res[i * 256 + j] = tmpref;
+	  }
+      }
+  }
+
+  for (i = 0; i < 65536; i++)
+    if (res[i] != i * 101)
+      abort ();
+}
+
+int main (int argc, char *argv[])
+{
+  workers ();
+  vectors ();
+  return 0;
+}
diff --git a/libgomp/testsuite/libgomp.oacc-fortran/privatized-ref-1.f95 b/libgomp/testsuite/libgomp.oacc-fortran/privatized-ref-1.f95
new file mode 100644
index 00000000000..f16f69c1d1b
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-fortran/privatized-ref-1.f95
@@ -0,0 +1,71 @@
+! { dg-do run }
+
+program main
+  implicit none
+  integer :: myint
+  integer :: i
+  real :: res(65536), tmp
+
+  res(:) = 0.0
+
+  myint = 5
+  call workers(myint, res)
+
+  do i=1,65536
+    tmp = i * 99
+    if (res(i) .ne. tmp) stop 1
+  end do
+
+  res(:) = 0.0
+
+  myint = 7
+  call vectors(myint, res)
+
+  do i=1,65536
+    tmp = i * 101
+    if (res(i) .ne. tmp) stop 2
+  end do
+
+contains
+
+  subroutine workers(t1, res)
+    implicit none
+    integer :: t1
+    integer :: i, j
+    real, intent(out) :: res(:)
+
+    !$acc parallel copyout(res) num_gangs(64) num_workers(64)
+
+    !$acc loop gang
+    do i=0,255
+      !$acc loop worker private(t1)
+      do j=1,256
+        t1 = (i * 256 + j) * 99
+        res(i * 256 + j) = t1
+      end do
+    end do
+
+    !$acc end parallel
+  end subroutine workers
+
+  subroutine vectors(t1, res)
+    implicit none
+    integer :: t1
+    integer :: i, j
+    real, intent(out) :: res(:)
+
+    !$acc parallel copyout(res) num_gangs(64) num_workers(64)
+
+    !$acc loop gang worker
+    do i=0,255
+      !$acc loop vector private(t1)
+      do j=1,256
+        t1 = (i * 256 + j) * 101
+        res(i * 256 + j) = t1
+      end do
+    end do
+
+    !$acc end parallel
+  end subroutine vectors
+
+end program main
-- 
2.23.0

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 04/13] OpenACC middle-end worker-partitioning support
  2019-11-15 21:41 [PATCH 00/13] AMD GCN worker partitioning support Julian Brown
                   ` (2 preceding siblings ...)
  2019-11-15 21:43 ` [PATCH 03/13] Rewrite OpenACC private or reduction reference variables Julian Brown
@ 2019-11-15 21:43 ` Julian Brown
  2019-11-15 21:43 ` [PATCH 05/13] AMD GCN adjustments for middle-end worker partitioning Julian Brown
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 19+ messages in thread
From: Julian Brown @ 2019-11-15 21:43 UTC (permalink / raw)
  To: gcc-patches; +Cc: thomas_schwinge, jakub, Catherine_Moore, Andrew Stubbs

This is the main patch implementing worker-partitioning support on AMD
GCN. The following description is taken from the version of the patch
submitted on the openacc-gcc-9-branch:

This patch implements worker-partitioning support in the middle end,
by rewriting gimple. The OpenACC execution model requires that code can
run in either "worker single" mode where only a single worker per gang
is active, or "worker partitioned" mode, where multiple workers per gang
are active. This means we need to do something equivalent to
spawning additional workers when transitioning from worker-single to
worker-partitioned mode. However, GPUs typically fix the number of threads
of invoked kernels at launch time, so we need to do something with the
"extra" threads when they are not wanted.

The scheme used is -- very briefly! -- to conditionalise each basic block
that executes in "worker single" mode for worker 0 only. Conditional
branches are handled specially so "idle" (non-0) workers follow along with
worker 0. On transitioning to "worker partitioned" mode, any variables
modified by worker 0 are propagated to the other workers via GPU shared
memory. Special care is taken for routine calls, writes through pointers,
and so forth.

Much of omp-sese.c originates from code written for NVPTX by Nathan
Sidwell (adapted to work on gimple instead of RTL) -- though at present,
only the per-basic-block scheme is implemented, and the SESE-finding
algorithm isn't yet used.

OK?

Julian

ChangeLog

	gcc/
	* Makefile.in (OBJS): Add omp-sese.o.
	* config/nvptx/nvptx.c (omp-sese.h): Include.
	(bb_pair_t, bb_pair_vec_t, pseudo_node_t, bracket, bracket_vec_t,
	bb_sese, bb_sese::~bb_sese, bb_sese::append, bb_sese::remove,
	BB_SET_SESE, BB_GET_SESE, nvptx_sese_number, nvptx_sese_pseudo,
	nvptx_sese_color, nvptx_find_sese): Remove.
	(nvptx_neuter_pars): Call omp_find_sese instead of nvptx_find_sese.
	* omp-builtins.def (BUILT_IN_GOACC_BARRIER, BUILT_IN_GOACC_SINGLE_START,
	BUILT_IN_GOACC_SINGLE_COPY_START, BUILT_IN_GOACC_SINGLE_COPY_END): New
	builtins.
	* omp-offload.c (omp-sese.h): Include header.
	(oacc_loop_xform_head_tail): Call update_stmt for modified builtin
	calls.
	(oacc_loop_process): Likewise.
	(default_goacc_create_propagation_record): New default implementation
	for TARGET_GOACC_CREATE_PROPAGATION_RECORD hook.
	(execute_oacc_loop_designation): New.  Split out of oacc_device_lower.
	(execute_oacc_gimple_workers): New.  Likewise.
	(execute_oacc_device_lower): Recreate dims array.
	(pass_data_oacc_loop_designation, pass_data_oacc_gimple_workers): New.
	(pass_oacc_loop_designation, pass_oacc_gimple_workers): New.
	(make_pass_oacc_loop_designation, make_pass_oacc_gimple_workers): New.
	* omp-offload.h (oacc_fn_attrib_level): Add prototype.
	* omp-sese.c: New file.
	* omp-sese.h: New file.
	* passes.def (pass_oacc_loop_designation, pass_oacc_gimple_workers):
	Add passes.
	* target.def (worker_partitioning, create_propagation_record): Add
	target hooks.
	* targhooks.h (default_goacc_create_propagation_record): Add prototype.
	* tree-pass.h (make_pass_oacc_loop_designation,
	make_pass_oacc_gimple_workers): Add prototypes.
	* doc/tm.texi.in (TARGET_GOACC_WORKER_PARTITIONING,
	TARGET_GOACC_CREATE_PROPAGATION_RECORD): Add documentation hooks.
	* doc/tm.texi: Regenerate.
---
 gcc/Makefile.in          |    1 +
 gcc/config/nvptx/nvptx.c |  622 +-----------
 gcc/doc/tm.texi          |   10 +
 gcc/doc/tm.texi.in       |    4 +
 gcc/omp-builtins.def     |    8 +
 gcc/omp-offload.c        |  159 ++-
 gcc/omp-offload.h        |    1 +
 gcc/omp-sese.c           | 2086 ++++++++++++++++++++++++++++++++++++++
 gcc/omp-sese.h           |   32 +
 gcc/passes.def           |    2 +
 gcc/target.def           |   13 +
 gcc/targhooks.h          |    1 +
 gcc/tree-pass.h          |    2 +
 13 files changed, 2302 insertions(+), 639 deletions(-)
 create mode 100644 gcc/omp-sese.c
 create mode 100644 gcc/omp-sese.h

diff --git a/gcc/Makefile.in b/gcc/Makefile.in
index 0004d46b93d..eadf235c9f8 100644
--- a/gcc/Makefile.in
+++ b/gcc/Makefile.in
@@ -1432,6 +1432,7 @@ OBJS = \
 	omp-expand.o \
 	omp-general.o \
 	omp-grid.o \
+	omp-sese.o \
 	omp-low.o \
 	omp-simd-clone.o \
 	opt-problem.o \
diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index 9934a240209..5ac8b6798cf 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -76,6 +76,7 @@
 #include "intl.h"
 #include "tree-hash-traits.h"
 #include "tree-pretty-print.h"
+#include "omp-sese.h"
 
 /* This file should be included last.  */
 #include "target-def.h"
@@ -3327,625 +3328,6 @@ nvptx_discover_pars (bb_insn_map_t *map)
   return par;
 }
 
-/* Analyse a group of BBs within a partitioned region and create N
-   Single-Entry-Single-Exit regions.  Some of those regions will be
-   trivial ones consisting of a single BB.  The blocks of a
-   partitioned region might form a set of disjoint graphs -- because
-   the region encloses a differently partitoned sub region.
-
-   We use the linear time algorithm described in 'Finding Regions Fast:
-   Single Entry Single Exit and control Regions in Linear Time'
-   Johnson, Pearson & Pingali.  That algorithm deals with complete
-   CFGs, where a back edge is inserted from END to START, and thus the
-   problem becomes one of finding equivalent loops.
-
-   In this case we have a partial CFG.  We complete it by redirecting
-   any incoming edge to the graph to be from an arbitrary external BB,
-   and similarly redirecting any outgoing edge to be to  that BB.
-   Thus we end up with a closed graph.
-
-   The algorithm works by building a spanning tree of an undirected
-   graph and keeping track of back edges from nodes further from the
-   root in the tree to nodes nearer to the root in the tree.  In the
-   description below, the root is up and the tree grows downwards.
-
-   We avoid having to deal with degenerate back-edges to the same
-   block, by splitting each BB into 3 -- one for input edges, one for
-   the node itself and one for the output edges.  Such back edges are
-   referred to as 'Brackets'.  Cycle equivalent nodes will have the
-   same set of brackets.
-   
-   Determining bracket equivalency is done by maintaining a list of
-   brackets in such a manner that the list length and final bracket
-   uniquely identify the set.
-
-   We use coloring to mark all BBs with cycle equivalency with the
-   same color.  This is the output of the 'Finding Regions Fast'
-   algorithm.  Notice it doesn't actually find the set of nodes within
-   a particular region, just unorderd sets of nodes that are the
-   entries and exits of SESE regions.
-   
-   After determining cycle equivalency, we need to find the minimal
-   set of SESE regions.  Do this with a DFS coloring walk of the
-   complete graph.  We're either 'looking' or 'coloring'.  When
-   looking, and we're in the subgraph, we start coloring the color of
-   the current node, and remember that node as the start of the
-   current color's SESE region.  Every time we go to a new node, we
-   decrement the count of nodes with thet color.  If it reaches zero,
-   we remember that node as the end of the current color's SESE region
-   and return to 'looking'.  Otherwise we color the node the current
-   color.
-
-   This way we end up with coloring the inside of non-trivial SESE
-   regions with the color of that region.  */
-
-/* A pair of BBs.  We use this to represent SESE regions.  */
-typedef std::pair<basic_block, basic_block> bb_pair_t;
-typedef auto_vec<bb_pair_t> bb_pair_vec_t;
-
-/* A node in the undirected CFG.  The discriminator SECOND indicates just
-   above or just below the BB idicated by FIRST.  */
-typedef std::pair<basic_block, int> pseudo_node_t;
-
-/* A bracket indicates an edge towards the root of the spanning tree of the
-   undirected graph.  Each bracket has a color, determined
-   from the currrent set of brackets.  */
-struct bracket
-{
-  pseudo_node_t back; /* Back target */
-
-  /* Current color and size of set.  */
-  unsigned color;
-  unsigned size;
-
-  bracket (pseudo_node_t back_)
-  : back (back_), color (~0u), size (~0u)
-  {
-  }
-
-  unsigned get_color (auto_vec<unsigned> &color_counts, unsigned length)
-  {
-    if (length != size)
-      {
-	size = length;
-	color = color_counts.length ();
-	color_counts.quick_push (0);
-      }
-    color_counts[color]++;
-    return color;
-  }
-};
-
-typedef auto_vec<bracket> bracket_vec_t;
-
-/* Basic block info for finding SESE regions.    */
-
-struct bb_sese
-{
-  int node;  /* Node number in spanning tree.  */
-  int parent; /* Parent node number.  */
-
-  /* The algorithm splits each node A into Ai, A', Ao. The incoming
-     edges arrive at pseudo-node Ai and the outgoing edges leave at
-     pseudo-node Ao.  We have to remember which way we arrived at a
-     particular node when generating the spanning tree.  dir > 0 means
-     we arrived at Ai, dir < 0 means we arrived at Ao.  */
-  int dir;
-
-  /* Lowest numbered pseudo-node reached via a backedge from thsis
-     node, or any descendant.  */
-  pseudo_node_t high;
-
-  int color;  /* Cycle-equivalence color  */
-
-  /* Stack of brackets for this node.  */
-  bracket_vec_t brackets;
-
-  bb_sese (unsigned node_, unsigned p, int dir_)
-  :node (node_), parent (p), dir (dir_)
-  {
-  }
-  ~bb_sese ();
-
-  /* Push a bracket ending at BACK.  */
-  void push (const pseudo_node_t &back)
-  {
-    if (dump_file)
-      fprintf (dump_file, "Pushing backedge %d:%+d\n",
-	       back.first ? back.first->index : 0, back.second);
-    brackets.safe_push (bracket (back));
-  }
-  
-  void append (bb_sese *child);
-  void remove (const pseudo_node_t &);
-
-  /* Set node's color.  */
-  void set_color (auto_vec<unsigned> &color_counts)
-  {
-    color = brackets.last ().get_color (color_counts, brackets.length ());
-  }
-};
-
-bb_sese::~bb_sese ()
-{
-}
-
-/* Destructively append CHILD's brackets.  */
-
-void
-bb_sese::append (bb_sese *child)
-{
-  if (int len = child->brackets.length ())
-    {
-      int ix;
-
-      if (dump_file)
-	{
-	  for (ix = 0; ix < len; ix++)
-	    {
-	      const pseudo_node_t &pseudo = child->brackets[ix].back;
-	      fprintf (dump_file, "Appending (%d)'s backedge %d:%+d\n",
-		       child->node, pseudo.first ? pseudo.first->index : 0,
-		       pseudo.second);
-	    }
-	}
-      if (!brackets.length ())
-	std::swap (brackets, child->brackets);
-      else
-	{
-	  brackets.reserve (len);
-	  for (ix = 0; ix < len; ix++)
-	    brackets.quick_push (child->brackets[ix]);
-	}
-    }
-}
-
-/* Remove brackets that terminate at PSEUDO.  */
-
-void
-bb_sese::remove (const pseudo_node_t &pseudo)
-{
-  unsigned removed = 0;
-  int len = brackets.length ();
-
-  for (int ix = 0; ix < len; ix++)
-    {
-      if (brackets[ix].back == pseudo)
-	{
-	  if (dump_file)
-	    fprintf (dump_file, "Removing backedge %d:%+d\n",
-		     pseudo.first ? pseudo.first->index : 0, pseudo.second);
-	  removed++;
-	}
-      else if (removed)
-	brackets[ix-removed] = brackets[ix];
-    }
-  while (removed--)
-    brackets.pop ();
-}
-
-/* Accessors for BB's aux pointer.  */
-#define BB_SET_SESE(B, S) ((B)->aux = (S))
-#define BB_GET_SESE(B) ((bb_sese *)(B)->aux)
-
-/* DFS walk creating SESE data structures.  Only cover nodes with
-   BB_VISITED set.  Append discovered blocks to LIST.  We number in
-   increments of 3 so that the above and below pseudo nodes can be
-   implicitly numbered too.  */
-
-static int
-nvptx_sese_number (int n, int p, int dir, basic_block b,
-		   auto_vec<basic_block> *list)
-{
-  if (BB_GET_SESE (b))
-    return n;
-
-  if (dump_file)
-    fprintf (dump_file, "Block %d(%d), parent (%d), orientation %+d\n",
-	     b->index, n, p, dir);
-  
-  BB_SET_SESE (b, new bb_sese (n, p, dir));
-  p = n;
-      
-  n += 3;
-  list->quick_push (b);
-
-  /* First walk the nodes on the 'other side' of this node, then walk
-     the nodes on the same side.  */
-  for (unsigned ix = 2; ix; ix--)
-    {
-      vec<edge, va_gc> *edges = dir > 0 ? b->succs : b->preds;
-      size_t offset = (dir > 0 ? offsetof (edge_def, dest)
-		       : offsetof (edge_def, src));
-      edge e;
-      edge_iterator ei;
-
-      FOR_EACH_EDGE (e, ei, edges)
-	{
-	  basic_block target = *(basic_block *)((char *)e + offset);
-	  
-	  if (target->flags & BB_VISITED)
-	    n = nvptx_sese_number (n, p, dir, target, list);
-	}
-      dir = -dir;
-    }
-  return n;
-}
-
-/* Process pseudo node above (DIR < 0) or below (DIR > 0) ME.
-   EDGES are the outgoing edges and OFFSET is the offset to the src
-   or dst block on the edges.   */
-
-static void
-nvptx_sese_pseudo (basic_block me, bb_sese *sese, int depth, int dir,
-		   vec<edge, va_gc> *edges, size_t offset)
-{
-  edge e;
-  edge_iterator ei;
-  int hi_back = depth;
-  pseudo_node_t node_back (0, depth);
-  int hi_child = depth;
-  pseudo_node_t node_child (0, depth);
-  basic_block child = NULL;
-  unsigned num_children = 0;
-  int usd = -dir * sese->dir;
-
-  if (dump_file)
-    fprintf (dump_file, "\nProcessing %d(%d) %+d\n",
-	     me->index, sese->node, dir);
-
-  if (dir < 0)
-    {
-      /* This is the above pseudo-child.  It has the BB itself as an
-	 additional child node.  */
-      node_child = sese->high;
-      hi_child = node_child.second;
-      if (node_child.first)
-	hi_child += BB_GET_SESE (node_child.first)->node;
-      num_children++;
-    }
-
-  /* Examine each edge.
-     - if it is a child (a) append its bracket list and (b) record
-          whether it is the child with the highest reaching bracket.
-     - if it is an edge to ancestor, record whether it's the highest
-          reaching backlink.  */
-  FOR_EACH_EDGE (e, ei, edges)
-    {
-      basic_block target = *(basic_block *)((char *)e + offset);
-
-      if (bb_sese *t_sese = BB_GET_SESE (target))
-	{
-	  if (t_sese->parent == sese->node && !(t_sese->dir + usd))
-	    {
-	      /* Child node.  Append its bracket list. */
-	      num_children++;
-	      sese->append (t_sese);
-
-	      /* Compare it's hi value.  */
-	      int t_hi = t_sese->high.second;
-
-	      if (basic_block child_hi_block = t_sese->high.first)
-		t_hi += BB_GET_SESE (child_hi_block)->node;
-
-	      if (hi_child > t_hi)
-		{
-		  hi_child = t_hi;
-		  node_child = t_sese->high;
-		  child = target;
-		}
-	    }
-	  else if (t_sese->node < sese->node + dir
-		   && !(dir < 0 && sese->parent == t_sese->node))
-	    {
-	      /* Non-parental ancestor node -- a backlink.  */
-	      int d = usd * t_sese->dir;
-	      int back = t_sese->node + d;
-	
-	      if (hi_back > back)
-		{
-		  hi_back = back;
-		  node_back = pseudo_node_t (target, d);
-		}
-	    }
-	}
-      else
-	{ /* Fallen off graph, backlink to entry node.  */
-	  hi_back = 0;
-	  node_back = pseudo_node_t (0, 0);
-	}
-    }
-
-  /* Remove any brackets that terminate at this pseudo node.  */
-  sese->remove (pseudo_node_t (me, dir));
-
-  /* Now push any backlinks from this pseudo node.  */
-  FOR_EACH_EDGE (e, ei, edges)
-    {
-      basic_block target = *(basic_block *)((char *)e + offset);
-      if (bb_sese *t_sese = BB_GET_SESE (target))
-	{
-	  if (t_sese->node < sese->node + dir
-	      && !(dir < 0 && sese->parent == t_sese->node))
-	    /* Non-parental ancestor node - backedge from me.  */
-	    sese->push (pseudo_node_t (target, usd * t_sese->dir));
-	}
-      else
-	{
-	  /* back edge to entry node */
-	  sese->push (pseudo_node_t (0, 0));
-	}
-    }
-  
- /* If this node leads directly or indirectly to a no-return region of
-     the graph, then fake a backedge to entry node.  */
-  if (!sese->brackets.length () || !edges || !edges->length ())
-    {
-      hi_back = 0;
-      node_back = pseudo_node_t (0, 0);
-      sese->push (node_back);
-    }
-
-  /* Record the highest reaching backedge from us or a descendant.  */
-  sese->high = hi_back < hi_child ? node_back : node_child;
-
-  if (num_children > 1)
-    {
-      /* There is more than one child -- this is a Y shaped piece of
-	 spanning tree.  We have to insert a fake backedge from this
-	 node to the highest ancestor reached by not-the-highest
-	 reaching child.  Note that there may be multiple children
-	 with backedges to the same highest node.  That's ok and we
-	 insert the edge to that highest node.  */
-      hi_child = depth;
-      if (dir < 0 && child)
-	{
-	  node_child = sese->high;
-	  hi_child = node_child.second;
-	  if (node_child.first)
-	    hi_child += BB_GET_SESE (node_child.first)->node;
-	}
-
-      FOR_EACH_EDGE (e, ei, edges)
-	{
-	  basic_block target = *(basic_block *)((char *)e + offset);
-
-	  if (target == child)
-	    /* Ignore the highest child. */
-	    continue;
-
-	  bb_sese *t_sese = BB_GET_SESE (target);
-	  if (!t_sese)
-	    continue;
-	  if (t_sese->parent != sese->node)
-	    /* Not a child. */
-	    continue;
-
-	  /* Compare its hi value.  */
-	  int t_hi = t_sese->high.second;
-
-	  if (basic_block child_hi_block = t_sese->high.first)
-	    t_hi += BB_GET_SESE (child_hi_block)->node;
-
-	  if (hi_child > t_hi)
-	    {
-	      hi_child = t_hi;
-	      node_child = t_sese->high;
-	    }
-	}
-      
-      sese->push (node_child);
-    }
-}
-
-
-/* DFS walk of BB graph.  Color node BLOCK according to COLORING then
-   proceed to successors.  Set SESE entry and exit nodes of
-   REGIONS.  */
-
-static void
-nvptx_sese_color (auto_vec<unsigned> &color_counts, bb_pair_vec_t &regions,
-		  basic_block block, int coloring)
-{
-  bb_sese *sese = BB_GET_SESE (block);
-
-  if (block->flags & BB_VISITED)
-    {
-      /* If we've already encountered this block, either we must not
-	 be coloring, or it must have been colored the current color.  */
-      gcc_assert (coloring < 0 || (sese && coloring == sese->color));
-      return;
-    }
-  
-  block->flags |= BB_VISITED;
-
-  if (sese)
-    {
-      if (coloring < 0)
-	{
-	  /* Start coloring a region.  */
-	  regions[sese->color].first = block;
-	  coloring = sese->color;
-	}
-
-      if (!--color_counts[sese->color] && sese->color == coloring)
-	{
-	  /* Found final block of SESE region.  */
-	  regions[sese->color].second = block;
-	  coloring = -1;
-	}
-      else
-	/* Color the node, so we can assert on revisiting the node
-	   that the graph is indeed SESE.  */
-	sese->color = coloring;
-    }
-  else
-    /* Fallen off the subgraph, we cannot be coloring.  */
-    gcc_assert (coloring < 0);
-
-  /* Walk each successor block.  */
-  if (block->succs && block->succs->length ())
-    {
-      edge e;
-      edge_iterator ei;
-      
-      FOR_EACH_EDGE (e, ei, block->succs)
-	nvptx_sese_color (color_counts, regions, e->dest, coloring);
-    }
-  else
-    gcc_assert (coloring < 0);
-}
-
-/* Find minimal set of SESE regions covering BLOCKS.  REGIONS might
-   end up with NULL entries in it.  */
-
-static void
-nvptx_find_sese (auto_vec<basic_block> &blocks, bb_pair_vec_t &regions)
-{
-  basic_block block;
-  int ix;
-
-  /* First clear each BB of the whole function.  */ 
-  FOR_ALL_BB_FN (block, cfun)
-    {
-      block->flags &= ~BB_VISITED;
-      BB_SET_SESE (block, 0);
-    }
-
-  /* Mark blocks in the function that are in this graph.  */
-  for (ix = 0; blocks.iterate (ix, &block); ix++)
-    block->flags |= BB_VISITED;
-
-  /* Counts of nodes assigned to each color.  There cannot be more
-     colors than blocks (and hopefully there will be fewer).  */
-  auto_vec<unsigned> color_counts;
-  color_counts.reserve (blocks.length ());
-
-  /* Worklist of nodes in the spanning tree.  Again, there cannot be
-     more nodes in the tree than blocks (there will be fewer if the
-     CFG of blocks is disjoint).  */
-  auto_vec<basic_block> spanlist;
-  spanlist.reserve (blocks.length ());
-
-  /* Make sure every block has its cycle class determined.  */
-  for (ix = 0; blocks.iterate (ix, &block); ix++)
-    {
-      if (BB_GET_SESE (block))
-	/* We already met this block in an earlier graph solve.  */
-	continue;
-
-      if (dump_file)
-	fprintf (dump_file, "Searching graph starting at %d\n", block->index);
-      
-      /* Number the nodes reachable from block initial DFS order.  */
-      int depth = nvptx_sese_number (2, 0, +1, block, &spanlist);
-
-      /* Now walk in reverse DFS order to find cycle equivalents.  */
-      while (spanlist.length ())
-	{
-	  block = spanlist.pop ();
-	  bb_sese *sese = BB_GET_SESE (block);
-
-	  /* Do the pseudo node below.  */
-	  nvptx_sese_pseudo (block, sese, depth, +1,
-			     sese->dir > 0 ? block->succs : block->preds,
-			     (sese->dir > 0 ? offsetof (edge_def, dest)
-			      : offsetof (edge_def, src)));
-	  sese->set_color (color_counts);
-	  /* Do the pseudo node above.  */
-	  nvptx_sese_pseudo (block, sese, depth, -1,
-			     sese->dir < 0 ? block->succs : block->preds,
-			     (sese->dir < 0 ? offsetof (edge_def, dest)
-			      : offsetof (edge_def, src)));
-	}
-      if (dump_file)
-	fprintf (dump_file, "\n");
-    }
-
-  if (dump_file)
-    {
-      unsigned count;
-      const char *comma = "";
-      
-      fprintf (dump_file, "Found %d cycle equivalents\n",
-	       color_counts.length ());
-      for (ix = 0; color_counts.iterate (ix, &count); ix++)
-	{
-	  fprintf (dump_file, "%s%d[%d]={", comma, ix, count);
-
-	  comma = "";
-	  for (unsigned jx = 0; blocks.iterate (jx, &block); jx++)
-	    if (BB_GET_SESE (block)->color == ix)
-	      {
-		block->flags |= BB_VISITED;
-		fprintf (dump_file, "%s%d", comma, block->index);
-		comma=",";
-	      }
-	  fprintf (dump_file, "}");
-	  comma = ", ";
-	}
-      fprintf (dump_file, "\n");
-   }
-  
-  /* Now we've colored every block in the subgraph.  We now need to
-     determine the minimal set of SESE regions that cover that
-     subgraph.  Do this with a DFS walk of the complete function.
-     During the walk we're either 'looking' or 'coloring'.  When we
-     reach the last node of a particular color, we stop coloring and
-     return to looking.  */
-
-  /* There cannot be more SESE regions than colors.  */
-  regions.reserve (color_counts.length ());
-  for (ix = color_counts.length (); ix--;)
-    regions.quick_push (bb_pair_t (0, 0));
-
-  for (ix = 0; blocks.iterate (ix, &block); ix++)
-    block->flags &= ~BB_VISITED;
-
-  nvptx_sese_color (color_counts, regions, ENTRY_BLOCK_PTR_FOR_FN (cfun), -1);
-
-  if (dump_file)
-    {
-      const char *comma = "";
-      int len = regions.length ();
-      
-      fprintf (dump_file, "SESE regions:");
-      for (ix = 0; ix != len; ix++)
-	{
-	  basic_block from = regions[ix].first;
-	  basic_block to = regions[ix].second;
-
-	  if (from)
-	    {
-	      fprintf (dump_file, "%s %d{%d", comma, ix, from->index);
-	      if (to != from)
-		fprintf (dump_file, "->%d", to->index);
-
-	      int color = BB_GET_SESE (from)->color;
-
-	      /* Print the blocks within the region (excluding ends).  */
-	      FOR_EACH_BB_FN (block, cfun)
-		{
-		  bb_sese *sese = BB_GET_SESE (block);
-
-		  if (sese && sese->color == color
-		      && block != from && block != to)
-		    fprintf (dump_file, ".%d", block->index);
-		}
-	      fprintf (dump_file, "}");
-	    }
-	  comma = ",";
-	}
-      fprintf (dump_file, "\n\n");
-    }
-  
-  for (ix = 0; blocks.iterate (ix, &block); ix++)
-    delete BB_GET_SESE (block);
-}
-
-#undef BB_SET_SESE
-#undef BB_GET_SESE
-
 /* Propagate live state at the start of a partitioned region.  IS_CALL
    indicates whether the propagation is for a (partitioned) call
    instruction.  BLOCK provides the live register information, and
@@ -4767,7 +4149,7 @@ nvptx_neuter_pars (parallel *par, unsigned modes, unsigned outer)
 	  /* Neuter whole SESE regions.  */
 	  bb_pair_vec_t regions;
 
-	  nvptx_find_sese (par->blocks, regions);
+	  omp_find_sese (par->blocks, regions);
 	  len = regions.length ();
 	  for (ix = 0; ix != len; ix++)
 	    {
diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index f5b7995705a..34558a3b972 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -6168,6 +6168,16 @@ Tweak variable declaration for a private variable at the specified
 parallelism level.
 @end deftypefn
 
+@deftypevr {Target Hook} bool TARGET_GOACC_WORKER_PARTITIONING
+Use gimple transformation for worker neutering/broadcasting.
+@end deftypevr
+
+@deftypefn {Target Hook} tree TARGET_GOACC_CREATE_PROPAGATION_RECORD (tree @var{rec}, bool @var{sender}, const char *@var{name})
+Create a record used to propagate local-variable state from an active
+worker to other workers.  A possible implementation might adjust the type
+of REC to place the new variable in shared GPU memory.
+@end deftypefn
+
 @node Anchored Addresses
 @section Anchored Addresses
 @cindex anchored addresses
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index d5ed6906e5d..54d1f083ef8 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -4217,6 +4217,10 @@ address;  but often a machine-dependent strategy can generate better code.
 
 @hook TARGET_GOACC_ADJUST_PRIVATE_DECL
 
+@hook TARGET_GOACC_WORKER_PARTITIONING
+
+@hook TARGET_GOACC_CREATE_PROPAGATION_RECORD
+
 @node Anchored Addresses
 @section Anchored Addresses
 @cindex anchored addresses
diff --git a/gcc/omp-builtins.def b/gcc/omp-builtins.def
index 9961c287494..a8f10e3389e 100644
--- a/gcc/omp-builtins.def
+++ b/gcc/omp-builtins.def
@@ -73,6 +73,8 @@ DEF_GOMP_BUILTIN (BUILT_IN_GOMP_BARRIER, "GOMP_barrier",
 		  BT_FN_VOID, ATTR_NOTHROW_LEAF_LIST)
 DEF_GOMP_BUILTIN (BUILT_IN_GOMP_BARRIER_CANCEL, "GOMP_barrier_cancel",
 		  BT_FN_BOOL, ATTR_NOTHROW_LEAF_LIST)
+DEF_GOACC_BUILTIN (BUILT_IN_GOACC_BARRIER, "GOACC_barrier",
+		   BT_FN_VOID, ATTR_NOTHROW_LEAF_LIST)
 DEF_GOMP_BUILTIN (BUILT_IN_GOMP_TASKWAIT, "GOMP_taskwait",
 		  BT_FN_VOID, ATTR_NOTHROW_LEAF_LIST)
 DEF_GOMP_BUILTIN (BUILT_IN_GOMP_TASKWAIT_DEPEND, "GOMP_taskwait_depend",
@@ -410,6 +412,12 @@ DEF_GOMP_BUILTIN (BUILT_IN_GOMP_SINGLE_COPY_START, "GOMP_single_copy_start",
 		  BT_FN_PTR, ATTR_NOTHROW_LEAF_LIST)
 DEF_GOMP_BUILTIN (BUILT_IN_GOMP_SINGLE_COPY_END, "GOMP_single_copy_end",
 		  BT_FN_VOID_PTR, ATTR_NOTHROW_LEAF_LIST)
+DEF_GOACC_BUILTIN (BUILT_IN_GOACC_SINGLE_START, "GOACC_single_start",
+		   BT_FN_BOOL, ATTR_NOTHROW_LEAF_LIST)
+DEF_GOACC_BUILTIN (BUILT_IN_GOACC_SINGLE_COPY_START, "GOACC_single_copy_start",
+		   BT_FN_PTR, ATTR_NOTHROW_LEAF_LIST)
+DEF_GOACC_BUILTIN (BUILT_IN_GOACC_SINGLE_COPY_END, "GOACC_single_copy_end",
+		   BT_FN_VOID_PTR, ATTR_NOTHROW_LEAF_LIST)
 DEF_GOMP_BUILTIN (BUILT_IN_GOMP_OFFLOAD_REGISTER, "GOMP_offload_register_ver",
 		  BT_FN_VOID_UINT_PTR_INT_PTR, ATTR_NOTHROW_LIST)
 DEF_GOMP_BUILTIN (BUILT_IN_GOMP_OFFLOAD_UNREGISTER,
diff --git a/gcc/omp-offload.c b/gcc/omp-offload.c
index 2e56a04a714..c652bc6dcef 100644
--- a/gcc/omp-offload.c
+++ b/gcc/omp-offload.c
@@ -52,6 +52,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "stringpool.h"
 #include "attribs.h"
 #include "cfgloop.h"
+#include "omp-sese.h"
 #include "convert.h"
 
 /* Describe the OpenACC looping structure of a function.  The entire
@@ -1093,6 +1094,8 @@ oacc_loop_xform_head_tail (gcall *from, int level)
       else if (gimple_call_internal_p (stmt, IFN_GOACC_REDUCTION))
 	*gimple_call_arg_ptr (stmt, 3) = replacement;
 
+      update_stmt (stmt);
+
       gsi_next (&gsi);
       while (gsi_end_p (gsi))
 	gsi = gsi_start_bb (single_succ (gsi_bb (gsi)));
@@ -1117,25 +1120,28 @@ oacc_loop_process (oacc_loop *loop)
       gcall *call;
       
       for (ix = 0; loop->ifns.iterate (ix, &call); ix++)
-	switch (gimple_call_internal_fn (call))
-	  {
-	  case IFN_GOACC_LOOP:
+	{
+	  switch (gimple_call_internal_fn (call))
 	    {
-	      bool is_e = gimple_call_arg (call, 5) == integer_minus_one_node;
-	      gimple_call_set_arg (call, 5, is_e ? e_mask_arg : mask_arg);
-	      if (!is_e)
-		gimple_call_set_arg (call, 4, chunk_arg);
-	    }
-	    break;
+	    case IFN_GOACC_LOOP:
+	      {
+		bool is_e = gimple_call_arg (call, 5) == integer_minus_one_node;
+		gimple_call_set_arg (call, 5, is_e ? e_mask_arg : mask_arg);
+		if (!is_e)
+		  gimple_call_set_arg (call, 4, chunk_arg);
+	      }
+	      break;
 
-	  case IFN_GOACC_TILE:
-	    gimple_call_set_arg (call, 3, mask_arg);
-	    gimple_call_set_arg (call, 4, e_mask_arg);
-	    break;
+	    case IFN_GOACC_TILE:
+	      gimple_call_set_arg (call, 3, mask_arg);
+	      gimple_call_set_arg (call, 4, e_mask_arg);
+	      break;
 
-	  default:
-	    gcc_unreachable ();
-	  }
+	    default:
+	      gcc_unreachable ();
+	    }
+	  update_stmt (call);
+	}
 
       unsigned dim = GOMP_DIM_GANG;
       unsigned mask = loop->mask | loop->e_mask;
@@ -1574,12 +1580,27 @@ is_sync_builtin_call (gcall *call)
   return false;
 }
 
+/* Default implementation creates a temporary variable of type RECORD_TYPE if
+   SENDER is true, else a pointer to RECORD_TYPE if SENDER is false.  */
+
+tree
+default_goacc_create_propagation_record (tree record_type, bool sender,
+					 const char *name)
+{
+  tree type = record_type;
+
+  if (!sender)
+    type = build_pointer_type (type);
+
+  return create_tmp_var (type, name);
+}
+
 /* Main entry point for oacc transformations which run on the device
    compiler after LTO, so we know what the target device is at this
    point (including the host fallback).  */
 
 static unsigned int
-execute_oacc_device_lower ()
+execute_oacc_loop_designation ()
 {
   tree attrs = oacc_get_fn_attrib (current_function_decl);
 
@@ -1679,10 +1700,36 @@ execute_oacc_device_lower ()
 	free_oacc_loop (l);
     }
 
+  free_oacc_loop (loops);
+
   /* Offloaded targets may introduce new basic blocks, which require
      dominance information to update SSA.  */
   calculate_dominance_info (CDI_DOMINATORS);
 
+  return 0;
+}
+
+int
+execute_oacc_gimple_workers (void)
+{
+  oacc_do_neutering ();
+  calculate_dominance_info (CDI_DOMINATORS);
+  return 0;
+}
+
+static unsigned int
+execute_oacc_device_lower ()
+{
+  int dims[GOMP_DIM_MAX];
+  tree attr = oacc_get_fn_attrib (current_function_decl);
+
+  if (!attr)
+    /* Not an offloaded function.  */
+    return 0;
+
+  for (unsigned i = 0; i < GOMP_DIM_MAX; i++)
+    dims[i] = oacc_get_fn_dim_size (current_function_decl, i);
+
   hash_set<tree> adjusted_vars;
 
   /* Now lower internal loop functions to target-specific code
@@ -1878,8 +1925,6 @@ execute_oacc_device_lower ()
 	  }
     }
 
-  free_oacc_loop (loops);
-
   return 0;
 }
 
@@ -1920,6 +1965,70 @@ default_goacc_dim_limit (int ARG_UNUSED (axis))
 
 namespace {
 
+const pass_data pass_data_oacc_loop_designation =
+{
+  GIMPLE_PASS, /* type */
+  "oaccloops", /* name */
+  OPTGROUP_OMP, /* optinfo_flags */
+  TV_NONE, /* tv_id */
+  PROP_cfg, /* properties_required */
+  0 /* Possibly PROP_gimple_eomp.  */, /* properties_provided */
+  0, /* properties_destroyed */
+  0, /* todo_flags_start */
+  TODO_update_ssa | TODO_cleanup_cfg
+  | TODO_rebuild_alias, /* todo_flags_finish */
+};
+
+class pass_oacc_loop_designation : public gimple_opt_pass
+{
+public:
+  pass_oacc_loop_designation (gcc::context *ctxt)
+    : gimple_opt_pass (pass_data_oacc_loop_designation, ctxt)
+  {}
+
+  /* opt_pass methods: */
+  virtual bool gate (function *) { return flag_openacc; };
+
+  virtual unsigned int execute (function *)
+    {
+      return execute_oacc_loop_designation ();
+    }
+
+}; // class pass_oacc_loop_designation
+
+const pass_data pass_data_oacc_gimple_workers =
+{
+  GIMPLE_PASS, /* type */
+  "oaccworkers", /* name */
+  OPTGROUP_OMP, /* optinfo_flags */
+  TV_NONE, /* tv_id */
+  PROP_cfg, /* properties_required */
+  0, /* properties_provided */
+  0, /* properties_destroyed */
+  0, /* todo_flags_start */
+  TODO_update_ssa | TODO_cleanup_cfg, /* todo_flags_finish */
+};
+
+class pass_oacc_gimple_workers : public gimple_opt_pass
+{
+public:
+  pass_oacc_gimple_workers (gcc::context *ctxt)
+    : gimple_opt_pass (pass_data_oacc_gimple_workers, ctxt)
+  {}
+
+  /* opt_pass methods: */
+  virtual bool gate (function *)
+  {
+    return flag_openacc && targetm.goacc.worker_partitioning;
+  };
+
+  virtual unsigned int execute (function *)
+    {
+      return execute_oacc_gimple_workers ();
+    }
+
+}; // class pass_oacc_gimple_workers
+
 const pass_data pass_data_oacc_device_lower =
 {
   GIMPLE_PASS, /* type */
@@ -1952,6 +2061,18 @@ public:
 
 } // anon namespace
 
+gimple_opt_pass *
+make_pass_oacc_loop_designation (gcc::context *ctxt)
+{
+  return new pass_oacc_loop_designation (ctxt);
+}
+
+gimple_opt_pass *
+make_pass_oacc_gimple_workers (gcc::context *ctxt)
+{
+  return new pass_oacc_gimple_workers (ctxt);
+}
+
 gimple_opt_pass *
 make_pass_oacc_device_lower (gcc::context *ctxt)
 {
diff --git a/gcc/omp-offload.h b/gcc/omp-offload.h
index 21c9236b74f..b441854585f 100644
--- a/gcc/omp-offload.h
+++ b/gcc/omp-offload.h
@@ -29,6 +29,7 @@ extern int oacc_fn_attrib_level (tree attr);
 extern GTY(()) vec<tree, va_gc> *offload_funcs;
 extern GTY(()) vec<tree, va_gc> *offload_vars;
 
+extern int oacc_fn_attrib_level (tree attr);
 extern void omp_finish_file (void);
 
 #endif /* GCC_OMP_DEVICE_H */
diff --git a/gcc/omp-sese.c b/gcc/omp-sese.c
new file mode 100644
index 00000000000..4a825d1a136
--- /dev/null
+++ b/gcc/omp-sese.c
@@ -0,0 +1,2086 @@
+/* Find single-entry, single-exit regions for OpenACC.
+   Copyright (C) 2014-2017 Free Software Foundation, Inc.
+
+   This file is part of GCC.
+
+   GCC is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published
+   by the Free Software Foundation; either version 3, or (at your
+   option) any later version.
+
+   GCC is distributed in the hope that it will be useful, but WITHOUT
+   ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
+   or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public
+   License for more details.
+
+   You should have received a copy of the GNU General Public License
+   along with GCC; see the file COPYING3.  If not see
+   <http://www.gnu.org/licenses/>.  */
+
+#include "config.h"
+#include "system.h"
+#include "coretypes.h"
+#include "backend.h"
+#include "rtl.h"
+#include "tree.h"
+#include "gimple.h"
+#include "tree-pass.h"
+#include "ssa.h"
+#include "cgraph.h"
+#include "pretty-print.h"
+#include "fold-const.h"
+#include "gimplify.h"
+#include "gimple-iterator.h"
+#include "gimple-walk.h"
+#include "tree-inline.h"
+#include "langhooks.h"
+#include "omp-general.h"
+#include "omp-low.h"
+#include "omp-grid.h"
+#include "gimple-pretty-print.h"
+#include "cfghooks.h"
+#include "insn-config.h"
+#include "recog.h"
+#include "internal-fn.h"
+#include "bitmap.h"
+#include "tree-nested.h"
+#include "stor-layout.h"
+#include "tree-ssa-threadupdate.h"
+#include "tree-into-ssa.h"
+#include "splay-tree.h"
+#include "target.h"
+#include "cfgloop.h"
+#include "tree-cfg.h"
+#include "omp-offload.h"
+#include "attribs.h"
+#include "omp-sese.h"
+
+/* Loop structure of the function.  The entire function is described as
+   a NULL loop.  */
+
+struct parallel_g
+{
+  /* Parent parallel.  */
+  parallel_g *parent;
+
+  /* Next sibling parallel.  */
+  parallel_g *next;
+
+  /* First child parallel.  */
+  parallel_g *inner;
+
+  /* Partitioning mask of the parallel.  */
+  unsigned mask;
+
+  /* Partitioning used within inner parallels. */
+  unsigned inner_mask;
+
+  /* Location of parallel forked and join.  The forked is the first
+     block in the parallel and the join is the first block after of
+     the partition.  */
+  basic_block forked_block;
+  basic_block join_block;
+
+  gimple *forked_stmt;
+  gimple *join_stmt;
+
+  gimple *fork_stmt;
+  gimple *joining_stmt;
+
+  /* Basic blocks in this parallel, but not in child parallels.  The
+     FORKED and JOINING blocks are in the partition.  The FORK and JOIN
+     blocks are not.  */
+  auto_vec<basic_block> blocks;
+
+  tree record_type;
+  tree sender_decl;
+  tree receiver_decl;
+
+public:
+  parallel_g (parallel_g *parent, unsigned mode);
+  ~parallel_g ();
+};
+
+/* Constructor links the new parallel into it's parent's chain of
+   children.  */
+
+parallel_g::parallel_g (parallel_g *parent_, unsigned mask_)
+  :parent (parent_), next (0), inner (0), mask (mask_), inner_mask (0)
+{
+  forked_block = join_block = 0;
+  forked_stmt = join_stmt = NULL;
+  fork_stmt = joining_stmt = NULL;
+
+  record_type = NULL_TREE;
+  sender_decl = NULL_TREE;
+  receiver_decl = NULL_TREE;
+
+  if (parent)
+    {
+      next = parent->inner;
+      parent->inner = this;
+    }
+}
+
+parallel_g::~parallel_g ()
+{
+  delete inner;
+  delete next;
+}
+
+static bool
+local_var_based_p (tree decl)
+{
+  switch (TREE_CODE (decl))
+    {
+    case VAR_DECL:
+      return !is_global_var (decl);
+
+    case COMPONENT_REF:
+    case BIT_FIELD_REF:
+    case ARRAY_REF:
+      return local_var_based_p (TREE_OPERAND (decl, 0));
+
+    default:
+      return false;
+    }
+}
+
+/* Map of basic blocks to gimple stmts.  */
+typedef hash_map<basic_block, gimple *> bb_stmt_map_t;
+
+/* Calls to OpenACC routines are made by all workers/wavefronts/warps, since
+   the routine likely contains partitioned loops (else will do its own
+   neutering and variable propagation). Return TRUE if a function call CALL
+   should be made in (worker) single mode instead, rather than redundant
+   mode.  */
+
+static bool
+omp_sese_active_worker_call (gcall *call)
+{
+#define GOMP_DIM_SEQ GOMP_DIM_MAX
+  tree fndecl = gimple_call_fndecl (call);
+
+  if (!fndecl)
+    return true;
+
+  tree attrs = oacc_get_fn_attrib (fndecl);
+
+  if (!attrs)
+    return true;
+
+  int level = oacc_fn_attrib_level (attrs);
+
+  /* Neither regular functions nor "seq" routines should be run by all threads
+     in worker-single mode.  */
+  return level == -1 || level == GOMP_DIM_SEQ;
+#undef GOMP_DIM_SEQ
+}
+
+/* Split basic blocks such that each forked and join unspecs are at
+   the start of their basic blocks.  Thus afterwards each block will
+   have a single partitioning mode.  We also do the same for return
+   insns, as they are executed by every thread.  Return the
+   partitioning mode of the function as a whole.  Populate MAP with
+   head and tail blocks.  We also clear the BB visited flag, which is
+   used when finding partitions.  */
+
+static void
+omp_sese_split_blocks (bb_stmt_map_t *map)
+{
+  auto_vec<gimple *> worklist;
+  basic_block block;
+
+  /* Locate all the reorg instructions of interest.  */
+  FOR_ALL_BB_FN (block, cfun)
+    {
+      /* Clear visited flag, for use by parallel locator  */
+      block->flags &= ~BB_VISITED;
+
+      for (gimple_stmt_iterator gsi = gsi_start_bb (block);
+	   !gsi_end_p (gsi);
+	   gsi_next (&gsi))
+	{
+	  gimple *stmt = gsi_stmt (gsi);
+
+	  if (gimple_call_internal_p (stmt, IFN_UNIQUE))
+	    {
+	      enum ifn_unique_kind k = ((enum ifn_unique_kind)
+		TREE_INT_CST_LOW (gimple_call_arg (stmt, 0)));
+
+	      if (k == IFN_UNIQUE_OACC_JOIN)
+		worklist.safe_push (stmt);
+	      else if (k == IFN_UNIQUE_OACC_FORK)
+		{
+		  gcc_assert (gsi_one_before_end_p (gsi));
+		  basic_block forked_block = single_succ (block);
+		  gimple_stmt_iterator gsi2 = gsi_start_bb (forked_block);
+
+		  /* We push a NOP as a placeholder for the "forked" stmt.
+		     This is then recognized in omp_sese_find_par.  */
+		  gimple *nop = gimple_build_nop ();
+		  gsi_insert_before (&gsi2, nop, GSI_SAME_STMT);
+
+		  worklist.safe_push (nop);
+		}
+	    }
+	  else if (gimple_code (stmt) == GIMPLE_RETURN
+		   || gimple_code (stmt) == GIMPLE_COND
+		   || gimple_code (stmt) == GIMPLE_SWITCH
+		   || (gimple_code (stmt) == GIMPLE_CALL
+		       && !gimple_call_internal_p (stmt)
+		       && !omp_sese_active_worker_call (as_a <gcall *> (stmt))))
+	    worklist.safe_push (stmt);
+	  else if (is_gimple_assign (stmt))
+	    {
+	      tree lhs = gimple_assign_lhs (stmt);
+
+	      /* Force assignments to components/fields/elements of local
+		 aggregates into fully-partitioned (redundant) mode.  This
+		 avoids having to broadcast the whole aggregate.  The RHS of
+		 the assignment will be propagated using the normal
+		 mechanism.  */
+
+	      switch (TREE_CODE (lhs))
+		{
+		case COMPONENT_REF:
+		case BIT_FIELD_REF:
+		case ARRAY_REF:
+		  {
+		    tree aggr = TREE_OPERAND (lhs, 0);
+
+		    if (local_var_based_p (aggr))
+		      worklist.safe_push (stmt);
+		  }
+		  break;
+
+		default:
+		  ;
+		}
+	    }
+	}
+    }
+
+  /* Split blocks on the worklist.  */
+  unsigned ix;
+  gimple *stmt;
+
+  for (ix = 0; worklist.iterate (ix, &stmt); ix++)
+    {
+      basic_block block = gimple_bb (stmt);
+
+      if (gimple_code (stmt) == GIMPLE_COND)
+	{
+	  gcond *orig_cond = as_a <gcond *> (stmt);
+	  tree_code code = gimple_expr_code (orig_cond);
+	  tree pred = make_ssa_name (boolean_type_node);
+	  gimple *asgn = gimple_build_assign (pred, code,
+			   gimple_cond_lhs (orig_cond),
+			   gimple_cond_rhs (orig_cond));
+	  gcond *new_cond
+	    = gimple_build_cond (NE_EXPR, pred, boolean_false_node,
+				 gimple_cond_true_label (orig_cond),
+				 gimple_cond_false_label (orig_cond));
+
+	  gimple_stmt_iterator gsi = gsi_for_stmt (stmt);
+	  gsi_insert_before (&gsi, asgn, GSI_SAME_STMT);
+	  gsi_replace (&gsi, new_cond, true);
+
+	  edge e = split_block (block, asgn);
+	  block = e->dest;
+	  map->get_or_insert (block) = new_cond;
+	}
+      else if ((gimple_code (stmt) == GIMPLE_CALL
+		&& !gimple_call_internal_p (stmt))
+	       || is_gimple_assign (stmt))
+	{
+	  gimple_stmt_iterator gsi = gsi_for_stmt (stmt);
+	  gsi_prev (&gsi);
+
+	  edge call = split_block (block, gsi_stmt (gsi));
+
+	  gimple *call_stmt = gsi_stmt (gsi_start_bb (call->dest));
+
+	  edge call_to_ret = split_block (call->dest, call_stmt);
+
+	  map->get_or_insert (call_to_ret->src) = call_stmt;
+	}
+      else
+	{
+	  gimple_stmt_iterator gsi = gsi_for_stmt (stmt);
+	  gsi_prev (&gsi);
+
+	  if (gsi_end_p (gsi))
+	    map->get_or_insert (block) = stmt;
+	  else
+	    {
+	      /* Split block before insn. The insn is in the new block.  */
+	      edge e = split_block (block, gsi_stmt (gsi));
+
+	      block = e->dest;
+	      map->get_or_insert (block) = stmt;
+	    }
+	}
+    }
+}
+
+static const char *
+mask_name (unsigned mask)
+{
+  switch (mask)
+    {
+    case 0: return "gang redundant";
+    case 1: return "gang partitioned";
+    case 2: return "worker partitioned";
+    case 3: return "gang+worker partitioned";
+    case 4: return "vector partitioned";
+    case 5: return "gang+vector partitioned";
+    case 6: return "worker+vector partitioned";
+    case 7: return "fully partitioned";
+    default: return "<illegal>";
+    }
+}
+
+/* Dump this parallel and all its inner parallels.  */
+
+static void
+omp_sese_dump_pars (parallel_g *par, unsigned depth)
+{
+  fprintf (dump_file, "%u: mask %d (%s) head=%d, tail=%d\n",
+	   depth, par->mask, mask_name (par->mask),
+	   par->forked_block ? par->forked_block->index : -1,
+	   par->join_block ? par->join_block->index : -1);
+
+  fprintf (dump_file, "    blocks:");
+
+  basic_block block;
+  for (unsigned ix = 0; par->blocks.iterate (ix, &block); ix++)
+    fprintf (dump_file, " %d", block->index);
+  fprintf (dump_file, "\n");
+  if (par->inner)
+    omp_sese_dump_pars (par->inner, depth + 1);
+
+  if (par->next)
+    omp_sese_dump_pars (par->next, depth);
+}
+
+/* If BLOCK contains a fork/join marker, process it to create or
+   terminate a loop structure.  Add this block to the current loop,
+   and then walk successor blocks.   */
+
+static parallel_g *
+omp_sese_find_par (bb_stmt_map_t *map, parallel_g *par, basic_block block)
+{
+  if (block->flags & BB_VISITED)
+    return par;
+  block->flags |= BB_VISITED;
+
+  if (gimple **stmtp = map->get (block))
+    {
+      gimple *stmt = *stmtp;
+
+      if (gimple_code (stmt) == GIMPLE_COND
+	  || gimple_code (stmt) == GIMPLE_SWITCH
+	  || gimple_code (stmt) == GIMPLE_RETURN
+	  || (gimple_code (stmt) == GIMPLE_CALL
+	      && !gimple_call_internal_p (stmt))
+	  || is_gimple_assign (stmt))
+	{
+	  /* A single block that is forced to be at the maximum partition
+	     level.  Make a singleton par for it.  */
+	  par = new parallel_g (par, GOMP_DIM_MASK (GOMP_DIM_GANG)
+				   | GOMP_DIM_MASK (GOMP_DIM_WORKER)
+				   | GOMP_DIM_MASK (GOMP_DIM_VECTOR));
+	  par->forked_block = block;
+	  par->forked_stmt = stmt;
+	  par->blocks.safe_push (block);
+	  par = par->parent;
+	  goto walk_successors;
+	}
+      else if (gimple_nop_p (stmt))
+	{
+	  basic_block pred = single_pred (block);
+	  gcc_assert (pred);
+	  gimple_stmt_iterator gsi = gsi_last_bb (pred);
+	  gimple *final_stmt = gsi_stmt (gsi);
+
+	  if (gimple_call_internal_p (final_stmt, IFN_UNIQUE))
+	    {
+	      gcall *call = as_a <gcall *> (final_stmt);
+	      enum ifn_unique_kind k = ((enum ifn_unique_kind)
+		TREE_INT_CST_LOW (gimple_call_arg (call, 0)));
+
+	      if (k == IFN_UNIQUE_OACC_FORK)
+		{
+		  HOST_WIDE_INT dim
+		    = TREE_INT_CST_LOW (gimple_call_arg (call, 2));
+		  unsigned mask = (dim >= 0) ? GOMP_DIM_MASK (dim) : 0;
+
+		  par = new parallel_g (par, mask);
+		  par->forked_block = block;
+		  par->forked_stmt = final_stmt;
+		  par->fork_stmt = stmt;
+		}
+	      else
+		gcc_unreachable ();
+	    }
+	  else
+	    gcc_unreachable ();
+	}
+      else if (gimple_call_internal_p (stmt, IFN_UNIQUE))
+	{
+	  gcall *call = as_a <gcall *> (stmt);
+	  enum ifn_unique_kind k = ((enum ifn_unique_kind)
+	    TREE_INT_CST_LOW (gimple_call_arg (call, 0)));
+	  if (k == IFN_UNIQUE_OACC_JOIN)
+	    {
+	      HOST_WIDE_INT dim = TREE_INT_CST_LOW (gimple_call_arg (stmt, 2));
+	      unsigned mask = (dim >= 0) ? GOMP_DIM_MASK (dim) : 0;
+
+	      gcc_assert (par->mask == mask);
+	      par->join_block = block;
+	      par->join_stmt = stmt;
+	      par = par->parent;
+	    }
+	  else
+	    gcc_unreachable ();
+	}
+      else
+	gcc_unreachable ();
+    }
+
+  if (par)
+    /* Add this block onto the current loop's list of blocks.  */
+    par->blocks.safe_push (block);
+  else
+    /* This must be the entry block.  Create a NULL parallel.  */
+    par = new parallel_g (0, 0);
+
+walk_successors:
+  /* Walk successor blocks.  */
+  edge e;
+  edge_iterator ei;
+
+  FOR_EACH_EDGE (e, ei, block->succs)
+    omp_sese_find_par (map, par, e->dest);
+
+  return par;
+}
+
+/* DFS walk the CFG looking for fork & join markers.  Construct
+   loop structures as we go.  MAP is a mapping of basic blocks
+   to head & tail markers, discovered when splitting blocks.  This
+   speeds up the discovery.  We rely on the BB visited flag having
+   been cleared when splitting blocks.  */
+
+static parallel_g *
+omp_sese_discover_pars (bb_stmt_map_t *map)
+{
+  basic_block block;
+
+  /* Mark exit blocks as visited.  */
+  block = EXIT_BLOCK_PTR_FOR_FN (cfun);
+  block->flags |= BB_VISITED;
+
+  /* And entry block as not.  */
+  block = ENTRY_BLOCK_PTR_FOR_FN (cfun);
+  block->flags &= ~BB_VISITED;
+
+  parallel_g *par = omp_sese_find_par (map, 0, block);
+
+  if (dump_file)
+    {
+      fprintf (dump_file, "\nLoops\n");
+      omp_sese_dump_pars (par, 0);
+      fprintf (dump_file, "\n");
+    }
+
+  return par;
+}
+
+static void
+populate_single_mode_bitmaps (parallel_g *par, bitmap worker_single,
+			      bitmap vector_single, unsigned outer_mask,
+			      int depth)
+{
+  unsigned mask = outer_mask | par->mask;
+
+  basic_block block;
+
+  for (unsigned i = 0; par->blocks.iterate (i, &block); i++)
+    {
+      if ((mask & GOMP_DIM_MASK (GOMP_DIM_WORKER)) == 0)
+	bitmap_set_bit (worker_single, block->index);
+
+      if ((mask & GOMP_DIM_MASK (GOMP_DIM_VECTOR)) == 0)
+	bitmap_set_bit (vector_single, block->index);
+    }
+
+  if (par->inner)
+    populate_single_mode_bitmaps (par->inner, worker_single, vector_single,
+				  mask, depth + 1);
+  if (par->next)
+    populate_single_mode_bitmaps (par->next, worker_single, vector_single,
+				  outer_mask, depth);
+}
+
+/* A map from SSA names or var decls to record fields.  */
+
+typedef hash_map<tree, tree> field_map_t;
+
+/* For each propagation record type, this is a map from SSA names or var decls
+   to propagate, to the field in the record type that should be used for
+   transmission and reception.  */
+
+typedef hash_map<tree, field_map_t *> record_field_map_t;
+
+static GTY(()) record_field_map_t *field_map;
+
+static void
+install_var_field (tree var, tree record_type)
+{
+  field_map_t *fields = *field_map->get (record_type);
+  tree name;
+  char tmp[20];
+
+  if (TREE_CODE (var) == SSA_NAME)
+    {
+      name = SSA_NAME_IDENTIFIER (var);
+      if (!name)
+	{
+	  sprintf (tmp, "_%u", (unsigned) SSA_NAME_VERSION (var));
+	  name = get_identifier (tmp);
+	}
+    }
+  else if (TREE_CODE (var) == VAR_DECL)
+    {
+      name = DECL_NAME (var);
+      if (!name)
+	{
+	  sprintf (tmp, "D.%u", DECL_UID (var));
+	  name = get_identifier (tmp);
+	}
+    }
+  else
+    gcc_unreachable ();
+
+  gcc_assert (!fields->get (var));
+
+  tree type = TREE_TYPE (var);
+
+  if (POINTER_TYPE_P (type)
+      && TYPE_RESTRICT (type))
+    type = build_qualified_type (type, TYPE_QUALS (type) & ~TYPE_QUAL_RESTRICT);
+
+  tree field = build_decl (BUILTINS_LOCATION, FIELD_DECL, name, type);
+
+  if (TREE_CODE (var) == VAR_DECL && type == TREE_TYPE (var))
+    {
+      SET_DECL_ALIGN (field, DECL_ALIGN (var));
+      DECL_USER_ALIGN (field) = DECL_USER_ALIGN (var);
+      TREE_THIS_VOLATILE (field) = TREE_THIS_VOLATILE (var);
+    }
+  else
+    SET_DECL_ALIGN (field, TYPE_ALIGN (type));
+
+  fields->put (var, field);
+
+  insert_field_into_struct (record_type, field);
+}
+
+/* Sets of SSA_NAMES or VAR_DECLs to propagate.  */
+typedef hash_set<tree> propagation_set;
+
+static void
+find_ssa_names_to_propagate (parallel_g *par, unsigned outer_mask,
+			     bitmap worker_single, bitmap vector_single,
+			     vec<propagation_set *> *prop_set)
+{
+  unsigned mask = outer_mask | par->mask;
+
+  if (par->inner)
+    find_ssa_names_to_propagate (par->inner, mask, worker_single,
+				 vector_single, prop_set);
+  if (par->next)
+    find_ssa_names_to_propagate (par->next, outer_mask, worker_single,
+				 vector_single, prop_set);
+
+  if (mask & GOMP_DIM_MASK (GOMP_DIM_WORKER))
+    {
+      basic_block block;
+      int ix;
+
+      for (ix = 0; par->blocks.iterate (ix, &block); ix++)
+	{
+	  for (gphi_iterator psi = gsi_start_phis (block);
+	       !gsi_end_p (psi); gsi_next (&psi))
+	    {
+	      gphi *phi = psi.phi ();
+	      use_operand_p use;
+	      ssa_op_iter iter;
+
+	      FOR_EACH_PHI_ARG (use, phi, iter, SSA_OP_USE)
+		{
+		  tree var = USE_FROM_PTR (use);
+
+		  if (TREE_CODE (var) != SSA_NAME)
+		    continue;
+
+		  gimple *def_stmt = SSA_NAME_DEF_STMT (var);
+
+		  if (gimple_nop_p (def_stmt))
+		    continue;
+
+		  basic_block def_bb = gimple_bb (def_stmt);
+
+		  if (bitmap_bit_p (worker_single, def_bb->index))
+		    {
+		      if (!(*prop_set)[def_bb->index])
+			(*prop_set)[def_bb->index] = new propagation_set;
+
+		      propagation_set *ws_prop = (*prop_set)[def_bb->index];
+
+		      ws_prop->add (var);
+		    }
+		}
+	    }
+
+	  for (gimple_stmt_iterator gsi = gsi_start_bb (block);
+	       !gsi_end_p (gsi); gsi_next (&gsi))
+	    {
+	      use_operand_p use;
+	      ssa_op_iter iter;
+	      gimple *stmt = gsi_stmt (gsi);
+
+	      FOR_EACH_SSA_USE_OPERAND (use, stmt, iter, SSA_OP_USE)
+		{
+		  tree var = USE_FROM_PTR (use);
+
+		  gimple *def_stmt = SSA_NAME_DEF_STMT (var);
+
+		  if (gimple_nop_p (def_stmt))
+		    continue;
+
+		  basic_block def_bb = gimple_bb (def_stmt);
+
+		  if (bitmap_bit_p (worker_single, def_bb->index))
+		    {
+		      if (!(*prop_set)[def_bb->index])
+			(*prop_set)[def_bb->index] = new propagation_set;
+
+		      propagation_set *ws_prop = (*prop_set)[def_bb->index];
+
+		      ws_prop->add (var);
+		    }
+		}
+	    }
+	}
+    }
+}
+
+/* Callback for walk_gimple_stmt to find RHS VAR_DECLs (uses) in a
+   statement.  */
+
+static tree
+find_partitioned_var_uses_1 (tree *node, int *, void *data)
+{
+  walk_stmt_info *wi = (walk_stmt_info *) data;
+  hash_set<tree> *partitioned_var_uses = (hash_set<tree> *) wi->info;
+
+  if (!wi->is_lhs && VAR_P (*node))
+    partitioned_var_uses->add (*node);
+
+  return NULL_TREE;
+}
+
+static void
+find_partitioned_var_uses (parallel_g *par, unsigned outer_mask,
+			   hash_set<tree> *partitioned_var_uses)
+{
+  unsigned mask = outer_mask | par->mask;
+
+  if (par->inner)
+    find_partitioned_var_uses (par->inner, mask, partitioned_var_uses);
+  if (par->next)
+    find_partitioned_var_uses (par->next, outer_mask, partitioned_var_uses);
+
+  if (mask & GOMP_DIM_MASK (GOMP_DIM_WORKER))
+    {
+      basic_block block;
+      int ix;
+
+      for (ix = 0; par->blocks.iterate (ix, &block); ix++)
+	for (gimple_stmt_iterator gsi = gsi_start_bb (block);
+	     !gsi_end_p (gsi); gsi_next (&gsi))
+	  {
+	    walk_stmt_info wi;
+	    memset (&wi, 0, sizeof (wi));
+	    wi.info = (void *) partitioned_var_uses;
+	    walk_gimple_stmt (&gsi, NULL, find_partitioned_var_uses_1, &wi);
+	  }
+    }
+}
+
+/* Gang-private variables (typically placed in a GPU's shared memory) do not
+   need to be processed by the worker-propagation mechanism.  Populate the
+   GANGPRIVATE_VARS set with any such variables found in the current
+   function.  */
+
+static void
+find_gangprivate_vars (hash_set<tree> *gangprivate_vars)
+{
+  basic_block block;
+
+  FOR_EACH_BB_FN (block, cfun)
+    {
+      for (gimple_stmt_iterator gsi = gsi_start_bb (block);
+	   !gsi_end_p (gsi);
+	   gsi_next (&gsi))
+	{
+	  gimple *stmt = gsi_stmt (gsi);
+
+	  if (gimple_call_internal_p (stmt, IFN_UNIQUE))
+	    {
+	      enum ifn_unique_kind k = ((enum ifn_unique_kind)
+		TREE_INT_CST_LOW (gimple_call_arg (stmt, 0)));
+	      if (k == IFN_UNIQUE_OACC_PRIVATE)
+		{
+		  HOST_WIDE_INT level
+		    = TREE_INT_CST_LOW (gimple_call_arg (stmt, 2));
+		  if (level != GOMP_DIM_GANG)
+		    continue;
+		  for (unsigned i = 3; i < gimple_call_num_args (stmt); i++)
+		    {
+		      tree arg = gimple_call_arg (stmt, i);
+		      gcc_assert (TREE_CODE (arg) == ADDR_EXPR);
+		      tree decl = TREE_OPERAND (arg, 0);
+		      gangprivate_vars->add (decl);
+		    }
+		}
+	    }
+	}
+    }
+}
+
+
+static void
+find_local_vars_to_propagate (parallel_g *par, unsigned outer_mask,
+			      hash_set<tree> *partitioned_var_uses,
+			      hash_set<tree> *gangprivate_vars,
+			      vec<propagation_set *> *prop_set)
+{
+  unsigned mask = outer_mask | par->mask;
+
+  if (par->inner)
+    find_local_vars_to_propagate (par->inner, mask, partitioned_var_uses,
+				  gangprivate_vars, prop_set);
+  if (par->next)
+    find_local_vars_to_propagate (par->next, outer_mask, partitioned_var_uses,
+				  gangprivate_vars, prop_set);
+
+  if (!(mask & GOMP_DIM_MASK (GOMP_DIM_WORKER)))
+    {
+      basic_block block;
+      int ix;
+
+      for (ix = 0; par->blocks.iterate (ix, &block); ix++)
+	{
+	  for (gimple_stmt_iterator gsi = gsi_start_bb (block);
+	       !gsi_end_p (gsi); gsi_next (&gsi))
+	    {
+	      gimple *stmt = gsi_stmt (gsi);
+	      tree var;
+	      unsigned i;
+
+	      FOR_EACH_LOCAL_DECL (cfun, i, var)
+		{
+		  if (!VAR_P (var)
+		      || is_global_var (var)
+		      || AGGREGATE_TYPE_P (TREE_TYPE (var))
+		      || !partitioned_var_uses->contains (var)
+		      || gangprivate_vars->contains (var))
+		    continue;
+
+		  if (stmt_may_clobber_ref_p (stmt, var))
+		    {
+		      if (dump_file)
+			{
+			  fprintf (dump_file, "bb %u: local variable may be "
+				   "clobbered in %s mode: ", block->index,
+				   mask_name (mask));
+			  print_generic_expr (dump_file, var, TDF_SLIM);
+			  fprintf (dump_file, "\n");
+			}
+
+		      if (!(*prop_set)[block->index])
+			(*prop_set)[block->index] = new propagation_set;
+
+		      propagation_set *ws_prop
+			= (*prop_set)[block->index];
+
+		      ws_prop->add (var);
+		    }
+		}
+	    }
+	}
+    }
+}
+
+/* Transform basic blocks FROM, TO (which may be the same block) into:
+   if (GOACC_single_start ())
+     BLOCK;
+   GOACC_barrier ();
+			      \  |  /
+			      +----+
+			      |    |        (new) predicate block
+			      +----+--
+   \  |  /   \  |  /	        |t    \
+   +----+    +----+	      +----+  |
+   |	|    |    |	===>  |    |  | f   (old) from block
+   +----+    +----+	      +----+  |
+     |       t/  \f	        |    /
+			      +----+/
+  (split  (split before       |    |        skip block
+  at end)   condition)	      +----+
+			      t/  \f
+*/
+
+static void
+worker_single_simple (basic_block from, basic_block to,
+		      hash_set<tree> *def_escapes_block)
+{
+  gimple *call, *cond;
+  tree lhs, decl;
+  basic_block skip_block;
+
+  gimple_stmt_iterator gsi = gsi_last_bb (to);
+  if (EDGE_COUNT (to->succs) > 1)
+    {
+      gcc_assert (gimple_code (gsi_stmt (gsi)) == GIMPLE_COND);
+      gsi_prev (&gsi);
+    }
+  edge e = split_block (to, gsi_stmt (gsi));
+  skip_block = e->dest;
+
+  gimple_stmt_iterator start = gsi_after_labels (from);
+
+  decl = builtin_decl_explicit (BUILT_IN_GOACC_SINGLE_START);
+  lhs = create_tmp_var (TREE_TYPE (TREE_TYPE (decl)));
+  call = gimple_build_call (decl, 0);
+  gimple_call_set_lhs (call, lhs);
+  gsi_insert_before (&start, call, GSI_NEW_STMT);
+  update_stmt (call);
+
+  cond = gimple_build_cond (EQ_EXPR, lhs,
+			    fold_convert_loc (UNKNOWN_LOCATION,
+					      TREE_TYPE (lhs),
+					      boolean_true_node),
+			    NULL_TREE, NULL_TREE);
+  gsi_insert_after (&start, cond, GSI_NEW_STMT);
+  update_stmt (cond);
+
+  edge et = split_block (from, cond);
+  et->flags &= ~EDGE_FALLTHRU;
+  et->flags |= EDGE_TRUE_VALUE;
+  /* Make the active worker the more probable path so we prefer fallthrough
+     (letting the idle workers jump around more).  */
+  et->probability = profile_probability::likely ();
+
+  edge ef = make_edge (from, skip_block, EDGE_FALSE_VALUE);
+  ef->probability = et->probability.invert ();
+
+  basic_block neutered = split_edge (ef);
+  gimple_stmt_iterator neut_gsi = gsi_last_bb (neutered);
+
+  for (gsi = gsi_start_bb (et->dest); !gsi_end_p (gsi); gsi_next (&gsi))
+    {
+      gimple *stmt = gsi_stmt (gsi);
+      ssa_op_iter iter;
+      tree var;
+
+      FOR_EACH_SSA_TREE_OPERAND (var, stmt, iter, SSA_OP_DEF)
+	{
+	  if (def_escapes_block->contains (var))
+	    {
+	      gphi *join_phi = create_phi_node (NULL_TREE, skip_block);
+	      create_new_def_for (var, join_phi,
+				  gimple_phi_result_ptr (join_phi));
+	      add_phi_arg (join_phi, var, e, UNKNOWN_LOCATION);
+
+	      tree neutered_def = copy_ssa_name (var, NULL);
+	      /* We really want "don't care" or some value representing
+		 undefined here, but optimizers will probably get rid of the
+		 zero-assignments anyway.  */
+	      gassign *zero = gimple_build_assign (neutered_def,
+				build_zero_cst (TREE_TYPE (neutered_def)));
+
+	      gsi_insert_after (&neut_gsi, zero, GSI_CONTINUE_LINKING);
+	      update_stmt (zero);
+
+	      add_phi_arg (join_phi, neutered_def, single_succ_edge (neutered),
+			   UNKNOWN_LOCATION);
+	      update_stmt (join_phi);
+	    }
+	}
+    }
+
+  gsi = gsi_start_bb (skip_block);
+
+  decl = builtin_decl_explicit (BUILT_IN_GOACC_BARRIER);
+  gimple *acc_bar = gimple_build_call (decl, 0);
+
+  gsi_insert_before (&gsi, acc_bar, GSI_SAME_STMT);
+  update_stmt (acc_bar);
+}
+
+/* This is a copied and renamed omp-low.c:omp_build_component_ref.  */
+
+static tree
+oacc_build_component_ref (tree obj, tree field)
+{
+  tree ret = build3 (COMPONENT_REF, TREE_TYPE (field), obj, field, NULL);
+  if (TREE_THIS_VOLATILE (field))
+    TREE_THIS_VOLATILE (ret) |= 1;
+  if (TREE_READONLY (field))
+    TREE_READONLY (ret) |= 1;
+  return ret;
+}
+
+static tree
+build_receiver_ref (tree record_type, tree var, tree receiver_decl)
+{
+  field_map_t *fields = *field_map->get (record_type);
+  tree x = build_simple_mem_ref (receiver_decl);
+  tree field = *fields->get (var);
+  TREE_THIS_NOTRAP (x) = 1;
+  x = oacc_build_component_ref (x, field);
+  return x;
+}
+
+static tree
+build_sender_ref (tree record_type, tree var, tree sender_decl)
+{
+  field_map_t *fields = *field_map->get (record_type);
+  tree field = *fields->get (var);
+  return oacc_build_component_ref (sender_decl, field);
+}
+
+static int
+sort_by_ssa_version_or_uid (const void *p1, const void *p2)
+{
+  const tree t1 = *(const tree *)p1;
+  const tree t2 = *(const tree *)p2;
+
+  if (TREE_CODE (t1) == SSA_NAME && TREE_CODE (t2) == SSA_NAME)
+    return SSA_NAME_VERSION (t1) - SSA_NAME_VERSION (t2);
+  else if (TREE_CODE (t1) == SSA_NAME && TREE_CODE (t2) != SSA_NAME)
+    return -1;
+  else if (TREE_CODE (t1) != SSA_NAME && TREE_CODE (t2) == SSA_NAME)
+    return 1;
+  else
+    return DECL_UID (t1) - DECL_UID (t2);
+}
+
+static int
+sort_by_size_then_ssa_version_or_uid (const void *p1, const void *p2)
+{
+  const tree t1 = *(const tree *)p1;
+  const tree t2 = *(const tree *)p2;
+  unsigned HOST_WIDE_INT s1 = tree_to_uhwi (TYPE_SIZE (TREE_TYPE (t1)));
+  unsigned HOST_WIDE_INT s2 = tree_to_uhwi (TYPE_SIZE (TREE_TYPE (t2)));
+  if (s1 != s2)
+    return s2 - s1;
+  else
+    return sort_by_ssa_version_or_uid (p1, p2);
+}
+
+static void
+worker_single_copy (basic_block from, basic_block to,
+		    hash_set<tree> *def_escapes_block,
+		    hash_set<tree> *worker_partitioned_uses,
+		    tree record_type)
+{
+  /* If we only have virtual defs, we'll have no record type, but we still want
+     to emit single_copy_start and (particularly) single_copy_end to act as
+     a vdef source on the neutered edge representing memory writes on the
+     non-neutered edge.  */
+  if (!record_type)
+    record_type = char_type_node;
+
+  tree sender_decl
+    = targetm.goacc.create_propagation_record (record_type, true,
+					       ".oacc_worker_o");
+  tree receiver_decl
+    = targetm.goacc.create_propagation_record (record_type, false,
+					       ".oacc_worker_i");
+
+  gimple_stmt_iterator gsi = gsi_last_bb (to);
+  if (EDGE_COUNT (to->succs) > 1)
+    gsi_prev (&gsi);
+  edge e = split_block (to, gsi_stmt (gsi));
+  basic_block barrier_block = e->dest;
+
+  gimple_stmt_iterator start = gsi_after_labels (from);
+
+  tree decl = builtin_decl_explicit (BUILT_IN_GOACC_SINGLE_COPY_START);
+
+  tree lhs = create_tmp_var (TREE_TYPE (TREE_TYPE (decl)));
+
+  gimple *call = gimple_build_call (decl, 1,
+				    build_fold_addr_expr (sender_decl));
+  gimple_call_set_lhs (call, lhs);
+  gsi_insert_before (&start, call, GSI_NEW_STMT);
+  update_stmt (call);
+
+  tree conv_tmp = make_ssa_name (TREE_TYPE (receiver_decl));
+
+  gimple *conv = gimple_build_assign (conv_tmp,
+				      fold_convert (TREE_TYPE (receiver_decl),
+						    lhs));
+  update_stmt (conv);
+  gsi_insert_after (&start, conv, GSI_NEW_STMT);
+  gimple *asgn = gimple_build_assign (receiver_decl, conv_tmp);
+  gsi_insert_after (&start, asgn, GSI_NEW_STMT);
+  update_stmt (asgn);
+
+  tree zero_ptr = build_int_cst (TREE_TYPE (receiver_decl), 0);
+
+  tree recv_tmp = make_ssa_name (TREE_TYPE (receiver_decl));
+  asgn = gimple_build_assign (recv_tmp, receiver_decl);
+  gsi_insert_after (&start, asgn, GSI_NEW_STMT);
+  update_stmt (asgn);
+
+  gimple *cond = gimple_build_cond (EQ_EXPR, recv_tmp, zero_ptr, NULL_TREE,
+				    NULL_TREE);
+  update_stmt (cond);
+
+  gsi_insert_after (&start, cond, GSI_NEW_STMT);
+
+  edge et = split_block (from, cond);
+  et->flags &= ~EDGE_FALLTHRU;
+  et->flags |= EDGE_TRUE_VALUE;
+  /* Make the active worker the more probable path so we prefer fallthrough
+     (letting the idle workers jump around more).  */
+  et->probability = profile_probability::likely ();
+
+  basic_block body = et->dest;
+
+  edge ef = make_edge (from, barrier_block, EDGE_FALSE_VALUE);
+  ef->probability = et->probability.invert ();
+
+  decl = builtin_decl_explicit (BUILT_IN_GOACC_BARRIER);
+  gimple *acc_bar = gimple_build_call (decl, 0);
+
+  gimple_stmt_iterator bar_gsi = gsi_start_bb (barrier_block);
+  gsi_insert_before (&bar_gsi, acc_bar, GSI_NEW_STMT);
+
+  cond = gimple_build_cond (NE_EXPR, recv_tmp, zero_ptr, NULL_TREE, NULL_TREE);
+  gsi_insert_after (&bar_gsi, cond, GSI_NEW_STMT);
+
+  edge et2 = split_block (barrier_block, cond);
+  et2->flags &= ~EDGE_FALLTHRU;
+  et2->flags |= EDGE_TRUE_VALUE;
+  et2->probability = profile_probability::unlikely ();
+
+  basic_block exit_block = et2->dest;
+
+  basic_block copyout_block = split_edge (et2);
+  edge ef2 = make_edge (barrier_block, exit_block, EDGE_FALSE_VALUE);
+  ef2->probability = et2->probability.invert ();
+
+  gimple_stmt_iterator copyout_gsi = gsi_start_bb (copyout_block);
+
+  edge copyout_to_exit = single_succ_edge (copyout_block);
+
+  gimple_seq sender_seq = NULL;
+
+  /* Make sure we iterate over definitions in a stable order.  */
+  auto_vec<tree> escape_vec (def_escapes_block->elements ());
+  for (hash_set<tree>::iterator it = def_escapes_block->begin ();
+       it != def_escapes_block->end (); ++it)
+    escape_vec.quick_push (*it);
+  escape_vec.qsort (sort_by_ssa_version_or_uid);
+
+  for (unsigned i = 0; i < escape_vec.length (); i++)
+    {
+      tree var = escape_vec[i];
+
+      if (TREE_CODE (var) == SSA_NAME && SSA_NAME_IS_VIRTUAL_OPERAND (var))
+	continue;
+
+      tree barrier_def = 0;
+
+      if (TREE_CODE (var) == SSA_NAME)
+	{
+	  gimple *def_stmt = SSA_NAME_DEF_STMT (var);
+
+	  if (gimple_nop_p (def_stmt))
+	    continue;
+
+	  /* The barrier phi takes one result from the actual work of the
+	     block we're neutering, and the other result is constant zero of
+	     the same type.  */
+
+	  gphi *barrier_phi = create_phi_node (NULL_TREE, barrier_block);
+	  barrier_def = create_new_def_for (var, barrier_phi,
+			  gimple_phi_result_ptr (barrier_phi));
+
+	  add_phi_arg (barrier_phi, var, e, UNKNOWN_LOCATION);
+	  add_phi_arg (barrier_phi, build_zero_cst (TREE_TYPE (var)), ef,
+		       UNKNOWN_LOCATION);
+
+	  update_stmt (barrier_phi);
+	}
+      else
+	gcc_assert (TREE_CODE (var) == VAR_DECL);
+
+      /* If we had no record type, we will have no fields map.  */
+      field_map_t **fields_p = field_map->get (record_type);
+      field_map_t *fields = fields_p ? *fields_p : NULL;
+
+      if (worker_partitioned_uses->contains (var)
+	  && fields
+	  && fields->get (var))
+	{
+	  tree neutered_def = make_ssa_name (TREE_TYPE (var));
+
+	  /* Receive definition from shared memory block.  */
+
+	  tree receiver_ref = build_receiver_ref (record_type, var,
+						  receiver_decl);
+	  gassign *recv = gimple_build_assign (neutered_def,
+					       receiver_ref);
+	  gsi_insert_after (&copyout_gsi, recv, GSI_CONTINUE_LINKING);
+	  update_stmt (recv);
+
+	  if (TREE_CODE (var) == VAR_DECL)
+	    {
+	      /* If it's a VAR_DECL, we only copied to an SSA temporary.  Copy
+		 to the final location now.  */
+	      gassign *asgn = gimple_build_assign (var, neutered_def);
+	      gsi_insert_after (&copyout_gsi, asgn, GSI_CONTINUE_LINKING);
+	      update_stmt (asgn);
+	    }
+	  else
+	    {
+	      /* If it's an SSA name, create a new phi at the join node to
+		 represent either the output from the active worker (the
+		 barrier) or the inactive workers (the copyout block).  */
+	      gphi *join_phi = create_phi_node (NULL_TREE, exit_block);
+	      create_new_def_for (barrier_def, join_phi,
+				  gimple_phi_result_ptr (join_phi));
+	      add_phi_arg (join_phi, barrier_def, ef2, UNKNOWN_LOCATION);
+	      add_phi_arg (join_phi, neutered_def, copyout_to_exit,
+			   UNKNOWN_LOCATION);
+	      update_stmt (join_phi);
+	    }
+
+	  /* Send definition to shared memory block.  */
+
+	  tree sender_ref = build_sender_ref (record_type, var, sender_decl);
+
+	  if (TREE_CODE (var) == SSA_NAME)
+	    {
+	      gassign *send = gimple_build_assign (sender_ref, var);
+	      gimple_seq_add_stmt (&sender_seq, send);
+	      update_stmt (send);
+	    }
+	  else if (TREE_CODE (var) == VAR_DECL)
+	    {
+	      tree tmp = make_ssa_name (TREE_TYPE (var));
+	      gassign *send = gimple_build_assign (tmp, var);
+	      gimple_seq_add_stmt (&sender_seq, send);
+	      update_stmt (send);
+	      send = gimple_build_assign (sender_ref, tmp);
+	      gimple_seq_add_stmt (&sender_seq, send);
+	      update_stmt (send);
+	    }
+	  else
+	    gcc_unreachable ();
+	}
+    }
+
+  /* It's possible for the ET->DEST block (the work done by the active thread)
+     to finish with a control-flow insn, e.g. a UNIQUE function call.  Split
+     the block and add SENDER_SEQ in the latter part to avoid having control
+     flow in the middle of a BB.  */
+
+  decl = builtin_decl_explicit (BUILT_IN_GOACC_SINGLE_COPY_END);
+  call = gimple_build_call (decl, 1, build_fold_addr_expr (sender_decl));
+  gimple_seq_add_stmt (&sender_seq, call);
+
+  gsi = gsi_last_bb (body);
+  gimple *last = gsi_stmt (gsi);
+  basic_block sender_block = split_block (body, last)->dest;
+  gsi = gsi_last_bb (sender_block);
+  gsi_insert_seq_after (&gsi, sender_seq, GSI_CONTINUE_LINKING);
+}
+
+static void
+neuter_worker_single (parallel_g *par, unsigned outer_mask,
+		      bitmap worker_single, bitmap vector_single,
+		      vec<propagation_set *> *prop_set,
+		      hash_set<tree> *partitioned_var_uses)
+{
+  unsigned mask = outer_mask | par->mask;
+
+  if ((mask & GOMP_DIM_MASK (GOMP_DIM_WORKER)) == 0)
+    {
+      basic_block block;
+
+      for (unsigned i = 0; par->blocks.iterate (i, &block); i++)
+	{
+	  bool has_defs = false;
+	  hash_set<tree> def_escapes_block;
+	  hash_set<tree> worker_partitioned_uses;
+	  unsigned j;
+	  tree var;
+
+	  FOR_EACH_SSA_NAME (j, var, cfun)
+	    {
+	      if (SSA_NAME_IS_VIRTUAL_OPERAND (var))
+		{
+		  has_defs = true;
+		  continue;
+		}
+
+	      gimple *def_stmt = SSA_NAME_DEF_STMT (var);
+
+	      if (gimple_nop_p (def_stmt))
+		continue;
+
+	      if (gimple_bb (def_stmt)->index != block->index)
+		continue;
+
+	      gimple *use_stmt;
+	      imm_use_iterator use_iter;
+	      bool uses_outside_block = false;
+	      bool worker_partitioned_use = false;
+
+	      FOR_EACH_IMM_USE_STMT (use_stmt, use_iter, var)
+		{
+		  int blocknum = gimple_bb (use_stmt)->index;
+
+		  /* Don't propagate SSA names that are only used in the
+		     current block, unless the usage is in a phi node: that
+		     means the name left the block, then came back in at the
+		     top.  */
+		  if (blocknum != block->index
+		      || gimple_code (use_stmt) == GIMPLE_PHI)
+		    uses_outside_block = true;
+		  if (!bitmap_bit_p (worker_single, blocknum))
+		    worker_partitioned_use = true;
+		}
+
+	      if (uses_outside_block)
+		def_escapes_block.add (var);
+
+	      if (worker_partitioned_use)
+		{
+		  worker_partitioned_uses.add (var);
+		  has_defs = true;
+		}
+	    }
+
+	  propagation_set *ws_prop = (*prop_set)[block->index];
+
+	  if (ws_prop)
+	    {
+	      for (propagation_set::iterator it = ws_prop->begin ();
+		   it != ws_prop->end ();
+		   ++it)
+		{
+		  tree var = *it;
+		  if (TREE_CODE (var) == VAR_DECL)
+		    {
+		      def_escapes_block.add (var);
+		      if (partitioned_var_uses->contains (var))
+			{
+			  worker_partitioned_uses.add (var);
+			  has_defs = true;
+			}
+		    }
+		}
+
+	      delete ws_prop;
+	      (*prop_set)[block->index] = 0;
+	    }
+
+	  tree record_type = (tree) block->aux;
+
+	  if (has_defs)
+	    worker_single_copy (block, block, &def_escapes_block,
+				&worker_partitioned_uses, record_type);
+	  else
+	    worker_single_simple (block, block, &def_escapes_block);
+	}
+    }
+
+  if ((outer_mask & GOMP_DIM_MASK (GOMP_DIM_WORKER)) == 0)
+    {
+      basic_block block;
+
+      for (unsigned i = 0; par->blocks.iterate (i, &block); i++)
+	for (gimple_stmt_iterator gsi = gsi_start_bb (block);
+	     !gsi_end_p (gsi);
+	     gsi_next (&gsi))
+	  {
+	    gimple *stmt = gsi_stmt (gsi);
+
+	    if (gimple_code (stmt) == GIMPLE_CALL
+		&& !gimple_call_internal_p (stmt)
+		&& !omp_sese_active_worker_call (as_a <gcall *> (stmt)))
+	      {
+		/* If we have an OpenACC routine call in worker-single mode,
+		   place barriers before and afterwards to prevent
+		   clobbering re-used shared memory regions (as are used
+		   for AMDGCN at present, for example).  */
+		tree decl = builtin_decl_explicit (BUILT_IN_GOACC_BARRIER);
+		gsi_insert_before (&gsi, gimple_build_call (decl, 0),
+				   GSI_SAME_STMT);
+		gsi_insert_after (&gsi, gimple_build_call (decl, 0),
+				  GSI_NEW_STMT);
+	      }
+	  }
+    }
+
+  if (par->inner)
+    neuter_worker_single (par->inner, mask, worker_single, vector_single,
+			  prop_set, partitioned_var_uses);
+  if (par->next)
+    neuter_worker_single (par->next, outer_mask, worker_single, vector_single,
+			  prop_set, partitioned_var_uses);
+}
+
+
+void
+oacc_do_neutering (void)
+{
+  bb_stmt_map_t bb_stmt_map;
+  auto_bitmap worker_single, vector_single;
+
+  omp_sese_split_blocks (&bb_stmt_map);
+
+  if (dump_file)
+    {
+      fprintf (dump_file, "\n\nAfter splitting:\n\n");
+      dump_function_to_file (current_function_decl, dump_file, dump_flags);
+    }
+
+  unsigned mask = 0;
+
+  /* If this is a routine, calculate MASK as if the outer levels are already
+     partitioned.  */
+  tree attr = oacc_get_fn_attrib (current_function_decl);
+  if (attr)
+    {
+      tree dims = TREE_VALUE (attr);
+      unsigned ix;
+      for (ix = 0; ix != GOMP_DIM_MAX; ix++, dims = TREE_CHAIN (dims))
+	{
+	  tree allowed = TREE_PURPOSE (dims);
+	  if (allowed && integer_zerop (allowed))
+	    mask |= GOMP_DIM_MASK (ix);
+	}
+    }
+
+  parallel_g *par = omp_sese_discover_pars (&bb_stmt_map);
+  populate_single_mode_bitmaps (par, worker_single, vector_single, mask, 0);
+
+  basic_block bb;
+  FOR_ALL_BB_FN (bb, cfun)
+    bb->aux = NULL;
+
+  field_map = record_field_map_t::create_ggc (40);
+
+  vec<propagation_set *> prop_set;
+  prop_set.create (last_basic_block_for_fn (cfun));
+
+  for (int i = 0; i < last_basic_block_for_fn (cfun); i++)
+    prop_set.quick_push (0);
+
+  find_ssa_names_to_propagate (par, mask, worker_single, vector_single,
+			       &prop_set);
+
+  hash_set<tree> partitioned_var_uses;
+  hash_set<tree> gangprivate_vars;
+
+  find_gangprivate_vars (&gangprivate_vars);
+  find_partitioned_var_uses (par, mask, &partitioned_var_uses);
+  find_local_vars_to_propagate (par, mask, &partitioned_var_uses,
+				&gangprivate_vars, &prop_set);
+
+  FOR_ALL_BB_FN (bb, cfun)
+    {
+      propagation_set *ws_prop = prop_set[bb->index];
+      if (ws_prop)
+	{
+	  tree record_type = lang_hooks.types.make_type (RECORD_TYPE);
+	  tree name = create_tmp_var_name (".oacc_ws_data_s");
+	  name = build_decl (UNKNOWN_LOCATION, TYPE_DECL, name, record_type);
+	  DECL_ARTIFICIAL (name) = 1;
+	  DECL_NAMELESS (name) = 1;
+	  TYPE_NAME (record_type) = name;
+	  TYPE_ARTIFICIAL (record_type) = 1;
+
+	  auto_vec<tree> field_vec (ws_prop->elements ());
+	  for (hash_set<tree>::iterator it = ws_prop->begin ();
+	       it != ws_prop->end (); ++it)
+	    field_vec.quick_push (*it);
+
+	  field_vec.qsort (sort_by_size_then_ssa_version_or_uid);
+
+	  field_map->put (record_type, field_map_t::create_ggc (17));
+
+	  /* Insert var fields in reverse order, so the last inserted element
+	     is the first in the structure.  */
+	  for (int i = field_vec.length () - 1; i >= 0; i--)
+	    install_var_field (field_vec[i], record_type);
+
+	  layout_type (record_type);
+
+	  bb->aux = (tree) record_type;
+	}
+    }
+
+  neuter_worker_single (par, mask, worker_single, vector_single, &prop_set,
+			&partitioned_var_uses);
+
+  prop_set.release ();
+
+  /* This doesn't seem to make a difference.  */
+  loops_state_clear (LOOP_CLOSED_SSA);
+
+  /* Neutering worker-single neutered blocks will invalidate dominance info.
+     It may be possible to incrementally update just the affected blocks, but
+     obliterate everything for now.  */
+  free_dominance_info (CDI_DOMINATORS);
+  free_dominance_info (CDI_POST_DOMINATORS);
+
+  if (dump_file)
+    {
+      fprintf (dump_file, "\n\nAfter neutering:\n\n");
+      dump_function_to_file (current_function_decl, dump_file, dump_flags);
+    }
+}
+
+/* Analyse a group of BBs within a partitioned region and create N
+   Single-Entry-Single-Exit regions.  Some of those regions will be
+   trivial ones consisting of a single BB.  The blocks of a
+   partitioned region might form a set of disjoint graphs -- because
+   the region encloses a differently partitoned sub region.
+
+   We use the linear time algorithm described in 'Finding Regions Fast:
+   Single Entry Single Exit and control Regions in Linear Time'
+   Johnson, Pearson & Pingali.  That algorithm deals with complete
+   CFGs, where a back edge is inserted from END to START, and thus the
+   problem becomes one of finding equivalent loops.
+
+   In this case we have a partial CFG.  We complete it by redirecting
+   any incoming edge to the graph to be from an arbitrary external BB,
+   and similarly redirecting any outgoing edge to be to  that BB.
+   Thus we end up with a closed graph.
+
+   The algorithm works by building a spanning tree of an undirected
+   graph and keeping track of back edges from nodes further from the
+   root in the tree to nodes nearer to the root in the tree.  In the
+   description below, the root is up and the tree grows downwards.
+
+   We avoid having to deal with degenerate back-edges to the same
+   block, by splitting each BB into 3 -- one for input edges, one for
+   the node itself and one for the output edges.  Such back edges are
+   referred to as 'Brackets'.  Cycle equivalent nodes will have the
+   same set of brackets.
+
+   Determining bracket equivalency is done by maintaining a list of
+   brackets in such a manner that the list length and final bracket
+   uniquely identify the set.
+
+   We use coloring to mark all BBs with cycle equivalency with the
+   same color.  This is the output of the 'Finding Regions Fast'
+   algorithm.  Notice it doesn't actually find the set of nodes within
+   a particular region, just unorderd sets of nodes that are the
+   entries and exits of SESE regions.
+
+   After determining cycle equivalency, we need to find the minimal
+   set of SESE regions.  Do this with a DFS coloring walk of the
+   complete graph.  We're either 'looking' or 'coloring'.  When
+   looking, and we're in the subgraph, we start coloring the color of
+   the current node, and remember that node as the start of the
+   current color's SESE region.  Every time we go to a new node, we
+   decrement the count of nodes with thet color.  If it reaches zero,
+   we remember that node as the end of the current color's SESE region
+   and return to 'looking'.  Otherwise we color the node the current
+   color.
+
+   This way we end up with coloring the inside of non-trivial SESE
+   regions with the color of that region.  */
+
+/* A node in the undirected CFG.  The discriminator SECOND indicates just
+   above or just below the BB idicated by FIRST.  */
+typedef std::pair<basic_block, int> pseudo_node_t;
+
+/* A bracket indicates an edge towards the root of the spanning tree of the
+   undirected graph.  Each bracket has a color, determined
+   from the currrent set of brackets.  */
+struct bracket
+{
+  pseudo_node_t back; /* Back target */
+
+  /* Current color and size of set.  */
+  unsigned color;
+  unsigned size;
+
+  bracket (pseudo_node_t back_)
+  : back (back_), color (~0u), size (~0u)
+  {
+  }
+
+  unsigned get_color (auto_vec<unsigned> &color_counts, unsigned length)
+  {
+    if (length != size)
+      {
+	size = length;
+	color = color_counts.length ();
+	color_counts.quick_push (0);
+      }
+    color_counts[color]++;
+    return color;
+  }
+};
+
+typedef auto_vec<bracket> bracket_vec_t;
+
+/* Basic block info for finding SESE regions.    */
+
+struct bb_sese
+{
+  int node;  /* Node number in spanning tree.  */
+  int parent; /* Parent node number.  */
+
+  /* The algorithm splits each node A into Ai, A', Ao. The incoming
+     edges arrive at pseudo-node Ai and the outgoing edges leave at
+     pseudo-node Ao.  We have to remember which way we arrived at a
+     particular node when generating the spanning tree.  dir > 0 means
+     we arrived at Ai, dir < 0 means we arrived at Ao.  */
+  int dir;
+
+  /* Lowest numbered pseudo-node reached via a backedge from thsis
+     node, or any descendant.  */
+  pseudo_node_t high;
+
+  int color;  /* Cycle-equivalence color  */
+
+  /* Stack of brackets for this node.  */
+  bracket_vec_t brackets;
+
+  bb_sese (unsigned node_, unsigned p, int dir_)
+  :node (node_), parent (p), dir (dir_)
+  {
+  }
+  ~bb_sese ();
+
+  /* Push a bracket ending at BACK.  */
+  void push (const pseudo_node_t &back)
+  {
+    if (dump_file)
+      fprintf (dump_file, "Pushing backedge %d:%+d\n",
+	       back.first ? back.first->index : 0, back.second);
+    brackets.safe_push (bracket (back));
+  }
+
+  void append (bb_sese *child);
+  void remove (const pseudo_node_t &);
+
+  /* Set node's color.  */
+  void set_color (auto_vec<unsigned> &color_counts)
+  {
+    color = brackets.last ().get_color (color_counts, brackets.length ());
+  }
+};
+
+bb_sese::~bb_sese ()
+{
+}
+
+/* Destructively append CHILD's brackets.  */
+
+void
+bb_sese::append (bb_sese *child)
+{
+  if (int len = child->brackets.length ())
+    {
+      int ix;
+
+      if (dump_file)
+	{
+	  for (ix = 0; ix < len; ix++)
+	    {
+	      const pseudo_node_t &pseudo = child->brackets[ix].back;
+	      fprintf (dump_file, "Appending (%d)'s backedge %d:%+d\n",
+		       child->node, pseudo.first ? pseudo.first->index : 0,
+		       pseudo.second);
+	    }
+	}
+      if (!brackets.length ())
+	std::swap (brackets, child->brackets);
+      else
+	{
+	  brackets.reserve (len);
+	  for (ix = 0; ix < len; ix++)
+	    brackets.quick_push (child->brackets[ix]);
+	}
+    }
+}
+
+/* Remove brackets that terminate at PSEUDO.  */
+
+void
+bb_sese::remove (const pseudo_node_t &pseudo)
+{
+  unsigned removed = 0;
+  int len = brackets.length ();
+
+  for (int ix = 0; ix < len; ix++)
+    {
+      if (brackets[ix].back == pseudo)
+	{
+	  if (dump_file)
+	    fprintf (dump_file, "Removing backedge %d:%+d\n",
+		     pseudo.first ? pseudo.first->index : 0, pseudo.second);
+	  removed++;
+	}
+      else if (removed)
+	brackets[ix-removed] = brackets[ix];
+    }
+  while (removed--)
+    brackets.pop ();
+}
+
+/* Accessors for BB's aux pointer.  */
+#define BB_SET_SESE(B, S) ((B)->aux = (S))
+#define BB_GET_SESE(B) ((bb_sese *)(B)->aux)
+
+/* DFS walk creating SESE data structures.  Only cover nodes with
+   BB_VISITED set.  Append discovered blocks to LIST.  We number in
+   increments of 3 so that the above and below pseudo nodes can be
+   implicitly numbered too.  */
+
+static int
+omp_sese_number (int n, int p, int dir, basic_block b,
+		   auto_vec<basic_block> *list)
+{
+  if (BB_GET_SESE (b))
+    return n;
+
+  if (dump_file)
+    fprintf (dump_file, "Block %d(%d), parent (%d), orientation %+d\n",
+	     b->index, n, p, dir);
+
+  BB_SET_SESE (b, new bb_sese (n, p, dir));
+  p = n;
+
+  n += 3;
+  list->quick_push (b);
+
+  /* First walk the nodes on the 'other side' of this node, then walk
+     the nodes on the same side.  */
+  for (unsigned ix = 2; ix; ix--)
+    {
+      vec<edge, va_gc> *edges = dir > 0 ? b->succs : b->preds;
+      size_t offset = (dir > 0 ? offsetof (edge_def, dest)
+		       : offsetof (edge_def, src));
+      edge e;
+      edge_iterator ei;
+
+      FOR_EACH_EDGE (e, ei, edges)
+	{
+	  basic_block target = *(basic_block *)((char *)e + offset);
+
+	  if (target->flags & BB_VISITED)
+	    n = omp_sese_number (n, p, dir, target, list);
+	}
+      dir = -dir;
+    }
+  return n;
+}
+
+/* Process pseudo node above (DIR < 0) or below (DIR > 0) ME.
+   EDGES are the outgoing edges and OFFSET is the offset to the src
+   or dst block on the edges.   */
+
+static void
+omp_sese_pseudo (basic_block me, bb_sese *sese, int depth, int dir,
+		   vec<edge, va_gc> *edges, size_t offset)
+{
+  edge e;
+  edge_iterator ei;
+  int hi_back = depth;
+  pseudo_node_t node_back (0, depth);
+  int hi_child = depth;
+  pseudo_node_t node_child (0, depth);
+  basic_block child = NULL;
+  unsigned num_children = 0;
+  int usd = -dir * sese->dir;
+
+  if (dump_file)
+    fprintf (dump_file, "\nProcessing %d(%d) %+d\n",
+	     me->index, sese->node, dir);
+
+  if (dir < 0)
+    {
+      /* This is the above pseudo-child.  It has the BB itself as an
+	 additional child node.  */
+      node_child = sese->high;
+      hi_child = node_child.second;
+      if (node_child.first)
+	hi_child += BB_GET_SESE (node_child.first)->node;
+      num_children++;
+    }
+
+  /* Examine each edge.
+     - if it is a child (a) append its bracket list and (b) record
+       whether it is the child with the highest reaching bracket.
+     - if it is an edge to ancestor, record whether it's the highest
+       reaching backlink.  */
+  FOR_EACH_EDGE (e, ei, edges)
+    {
+      basic_block target = *(basic_block *)((char *)e + offset);
+
+      if (bb_sese *t_sese = BB_GET_SESE (target))
+	{
+	  if (t_sese->parent == sese->node && !(t_sese->dir + usd))
+	    {
+	      /* Child node.  Append its bracket list. */
+	      num_children++;
+	      sese->append (t_sese);
+
+	      /* Compare it's hi value.  */
+	      int t_hi = t_sese->high.second;
+
+	      if (basic_block child_hi_block = t_sese->high.first)
+		t_hi += BB_GET_SESE (child_hi_block)->node;
+
+	      if (hi_child > t_hi)
+		{
+		  hi_child = t_hi;
+		  node_child = t_sese->high;
+		  child = target;
+		}
+	    }
+	  else if (t_sese->node < sese->node + dir
+		   && !(dir < 0 && sese->parent == t_sese->node))
+	    {
+	      /* Non-parental ancestor node -- a backlink.  */
+	      int d = usd * t_sese->dir;
+	      int back = t_sese->node + d;
+
+	      if (hi_back > back)
+		{
+		  hi_back = back;
+		  node_back = pseudo_node_t (target, d);
+		}
+	    }
+	}
+      else
+	{ /* Fallen off graph, backlink to entry node.  */
+	  hi_back = 0;
+	  node_back = pseudo_node_t (0, 0);
+	}
+    }
+
+  /* Remove any brackets that terminate at this pseudo node.  */
+  sese->remove (pseudo_node_t (me, dir));
+
+  /* Now push any backlinks from this pseudo node.  */
+  FOR_EACH_EDGE (e, ei, edges)
+    {
+      basic_block target = *(basic_block *)((char *)e + offset);
+      if (bb_sese *t_sese = BB_GET_SESE (target))
+	{
+	  if (t_sese->node < sese->node + dir
+	      && !(dir < 0 && sese->parent == t_sese->node))
+	    /* Non-parental ancestor node - backedge from me.  */
+	    sese->push (pseudo_node_t (target, usd * t_sese->dir));
+	}
+      else
+	{
+	  /* back edge to entry node */
+	  sese->push (pseudo_node_t (0, 0));
+	}
+    }
+
+  /* If this node leads directly or indirectly to a no-return region of
+     the graph, then fake a backedge to entry node.  */
+  if (!sese->brackets.length () || !edges || !edges->length ())
+    {
+      hi_back = 0;
+      node_back = pseudo_node_t (0, 0);
+      sese->push (node_back);
+    }
+
+  /* Record the highest reaching backedge from us or a descendant.  */
+  sese->high = hi_back < hi_child ? node_back : node_child;
+
+  if (num_children > 1)
+    {
+      /* There is more than one child -- this is a Y shaped piece of
+	 spanning tree.  We have to insert a fake backedge from this
+	 node to the highest ancestor reached by not-the-highest
+	 reaching child.  Note that there may be multiple children
+	 with backedges to the same highest node.  That's ok and we
+	 insert the edge to that highest node.  */
+      hi_child = depth;
+      if (dir < 0 && child)
+	{
+	  node_child = sese->high;
+	  hi_child = node_child.second;
+	  if (node_child.first)
+	    hi_child += BB_GET_SESE (node_child.first)->node;
+	}
+
+      FOR_EACH_EDGE (e, ei, edges)
+	{
+	  basic_block target = *(basic_block *)((char *)e + offset);
+
+	  if (target == child)
+	    /* Ignore the highest child. */
+	    continue;
+
+	  bb_sese *t_sese = BB_GET_SESE (target);
+	  if (!t_sese)
+	    continue;
+	  if (t_sese->parent != sese->node)
+	    /* Not a child. */
+	    continue;
+
+	  /* Compare its hi value.  */
+	  int t_hi = t_sese->high.second;
+
+	  if (basic_block child_hi_block = t_sese->high.first)
+	    t_hi += BB_GET_SESE (child_hi_block)->node;
+
+	  if (hi_child > t_hi)
+	    {
+	      hi_child = t_hi;
+	      node_child = t_sese->high;
+	    }
+	}
+
+      sese->push (node_child);
+    }
+}
+
+
+/* DFS walk of BB graph.  Color node BLOCK according to COLORING then
+   proceed to successors.  Set SESE entry and exit nodes of
+   REGIONS.  */
+
+static void
+omp_sese_color (auto_vec<unsigned> &color_counts, bb_pair_vec_t &regions,
+		basic_block block, int coloring)
+{
+  bb_sese *sese = BB_GET_SESE (block);
+
+  if (block->flags & BB_VISITED)
+    {
+      /* If we've already encountered this block, either we must not
+	 be coloring, or it must have been colored the current color.  */
+      gcc_assert (coloring < 0 || (sese && coloring == sese->color));
+      return;
+    }
+
+  block->flags |= BB_VISITED;
+
+  if (sese)
+    {
+      if (coloring < 0)
+	{
+	  /* Start coloring a region.  */
+	  regions[sese->color].first = block;
+	  coloring = sese->color;
+	}
+
+      if (!--color_counts[sese->color] && sese->color == coloring)
+	{
+	  /* Found final block of SESE region.  */
+	  regions[sese->color].second = block;
+	  coloring = -1;
+	}
+      else
+	/* Color the node, so we can assert on revisiting the node
+	   that the graph is indeed SESE.  */
+	sese->color = coloring;
+    }
+  else
+    /* Fallen off the subgraph, we cannot be coloring.  */
+    gcc_assert (coloring < 0);
+
+  /* Walk each successor block.  */
+  if (block->succs && block->succs->length ())
+    {
+      edge e;
+      edge_iterator ei;
+
+      FOR_EACH_EDGE (e, ei, block->succs)
+	omp_sese_color (color_counts, regions, e->dest, coloring);
+    }
+  else
+    gcc_assert (coloring < 0);
+}
+
+/* Find minimal set of SESE regions covering BLOCKS.  REGIONS might
+   end up with NULL entries in it.  */
+
+void
+omp_find_sese (auto_vec<basic_block> &blocks, bb_pair_vec_t &regions)
+{
+  basic_block block;
+  int ix;
+
+  /* First clear each BB of the whole function.  */
+  FOR_EACH_BB_FN (block, cfun)
+    {
+      block->flags &= ~BB_VISITED;
+      BB_SET_SESE (block, 0);
+    }
+  block = EXIT_BLOCK_PTR_FOR_FN (cfun);
+  block->flags &= ~BB_VISITED;
+  BB_SET_SESE (block, 0);
+  block = ENTRY_BLOCK_PTR_FOR_FN (cfun);
+  block->flags &= ~BB_VISITED;
+  BB_SET_SESE (block, 0);
+
+  /* Mark blocks in the function that are in this graph.  */
+  for (ix = 0; blocks.iterate (ix, &block); ix++)
+    block->flags |= BB_VISITED;
+
+  /* Counts of nodes assigned to each color.  There cannot be more
+     colors than blocks (and hopefully there will be fewer).  */
+  auto_vec<unsigned> color_counts;
+  color_counts.reserve (blocks.length ());
+
+  /* Worklist of nodes in the spanning tree.  Again, there cannot be
+     more nodes in the tree than blocks (there will be fewer if the
+     CFG of blocks is disjoint).  */
+  auto_vec<basic_block> spanlist;
+  spanlist.reserve (blocks.length ());
+
+  /* Make sure every block has its cycle class determined.  */
+  for (ix = 0; blocks.iterate (ix, &block); ix++)
+    {
+      if (BB_GET_SESE (block))
+	/* We already met this block in an earlier graph solve.  */
+	continue;
+
+      if (dump_file)
+	fprintf (dump_file, "Searching graph starting at %d\n", block->index);
+
+      /* Number the nodes reachable from block initial DFS order.  */
+      int depth = omp_sese_number (2, 0, +1, block, &spanlist);
+
+      /* Now walk in reverse DFS order to find cycle equivalents.  */
+      while (spanlist.length ())
+	{
+	  block = spanlist.pop ();
+	  bb_sese *sese = BB_GET_SESE (block);
+
+	  /* Do the pseudo node below.  */
+	  omp_sese_pseudo (block, sese, depth, +1,
+			     sese->dir > 0 ? block->succs : block->preds,
+			     (sese->dir > 0 ? offsetof (edge_def, dest)
+			      : offsetof (edge_def, src)));
+	  sese->set_color (color_counts);
+	  /* Do the pseudo node above.  */
+	  omp_sese_pseudo (block, sese, depth, -1,
+			     sese->dir < 0 ? block->succs : block->preds,
+			     (sese->dir < 0 ? offsetof (edge_def, dest)
+			      : offsetof (edge_def, src)));
+	}
+      if (dump_file)
+	fprintf (dump_file, "\n");
+    }
+
+  if (dump_file)
+    {
+      unsigned count;
+      const char *comma = "";
+
+      fprintf (dump_file, "Found %d cycle equivalents\n",
+	       color_counts.length ());
+      for (ix = 0; color_counts.iterate (ix, &count); ix++)
+	{
+	  fprintf (dump_file, "%s%d[%d]={", comma, ix, count);
+
+	  comma = "";
+	  for (unsigned jx = 0; blocks.iterate (jx, &block); jx++)
+	    if (BB_GET_SESE (block)->color == ix)
+	      {
+		block->flags |= BB_VISITED;
+		fprintf (dump_file, "%s%d", comma, block->index);
+		comma=",";
+	      }
+	  fprintf (dump_file, "}");
+	  comma = ", ";
+	}
+      fprintf (dump_file, "\n");
+   }
+
+  /* Now we've colored every block in the subgraph.  We now need to
+     determine the minimal set of SESE regions that cover that
+     subgraph.  Do this with a DFS walk of the complete function.
+     During the walk we're either 'looking' or 'coloring'.  When we
+     reach the last node of a particular color, we stop coloring and
+     return to looking.  */
+
+  /* There cannot be more SESE regions than colors.  */
+  regions.reserve (color_counts.length ());
+  for (ix = color_counts.length (); ix--;)
+    regions.quick_push (bb_pair_t (0, 0));
+
+  for (ix = 0; blocks.iterate (ix, &block); ix++)
+    block->flags &= ~BB_VISITED;
+
+  omp_sese_color (color_counts, regions, ENTRY_BLOCK_PTR_FOR_FN (cfun), -1);
+
+  if (dump_file)
+    {
+      const char *comma = "";
+      int len = regions.length ();
+
+      fprintf (dump_file, "SESE regions:");
+      for (ix = 0; ix != len; ix++)
+	{
+	  basic_block from = regions[ix].first;
+	  basic_block to = regions[ix].second;
+
+	  if (from)
+	    {
+	      fprintf (dump_file, "%s %d{%d", comma, ix, from->index);
+	      if (to != from)
+		fprintf (dump_file, "->%d", to->index);
+
+	      int color = BB_GET_SESE (from)->color;
+
+	      /* Print the blocks within the region (excluding ends).  */
+	      FOR_EACH_BB_FN (block, cfun)
+		{
+		  bb_sese *sese = BB_GET_SESE (block);
+
+		  if (sese && sese->color == color
+		      && block != from && block != to)
+		    fprintf (dump_file, ".%d", block->index);
+		}
+	      fprintf (dump_file, "}");
+	    }
+	  comma = ",";
+	}
+      fprintf (dump_file, "\n\n");
+    }
+
+  for (ix = 0; blocks.iterate (ix, &block); ix++)
+    delete BB_GET_SESE (block);
+}
+
+#undef BB_SET_SESE
+#undef BB_GET_SESE
diff --git a/gcc/omp-sese.h b/gcc/omp-sese.h
new file mode 100644
index 00000000000..0b82a2417eb
--- /dev/null
+++ b/gcc/omp-sese.h
@@ -0,0 +1,32 @@
+/* Find single-entry, single-exit regions for OpenACC.
+
+   Copyright (C) 2005-2017 Free Software Foundation, Inc.
+
+This file is part of GCC.
+
+GCC is free software; you can redistribute it and/or modify it under
+the terms of the GNU General Public License as published by the Free
+Software Foundation; either version 3, or (at your option) any later
+version.
+
+GCC is distributed in the hope that it will be useful, but WITHOUT ANY
+WARRANTY; without even the implied warranty of MERCHANTABILITY or
+FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
+for more details.
+
+You should have received a copy of the GNU General Public License
+along with GCC; see the file COPYING3.  If not see
+<http://www.gnu.org/licenses/>.  */
+
+#ifndef GCC_OMP_SESE_H
+#define GCC_OMP_SESE_H
+
+/* A pair of BBs.  We use this to represent SESE regions.  */
+typedef std::pair<basic_block, basic_block> bb_pair_t;
+typedef auto_vec<bb_pair_t> bb_pair_vec_t;
+
+extern void omp_find_sese (auto_vec<basic_block> &blocks,
+			   bb_pair_vec_t &regions);
+extern void oacc_do_neutering (void);
+
+#endif
diff --git a/gcc/passes.def b/gcc/passes.def
index 798a391bd35..0d71f6d4a70 100644
--- a/gcc/passes.def
+++ b/gcc/passes.def
@@ -177,6 +177,8 @@ along with GCC; see the file COPYING3.  If not see
   INSERT_PASSES_AFTER (all_passes)
   NEXT_PASS (pass_fixup_cfg);
   NEXT_PASS (pass_lower_eh_dispatch);
+  NEXT_PASS (pass_oacc_loop_designation);
+  NEXT_PASS (pass_oacc_gimple_workers);
   NEXT_PASS (pass_oacc_device_lower);
   NEXT_PASS (pass_omp_device_lower);
   NEXT_PASS (pass_omp_target_link);
diff --git a/gcc/target.def b/gcc/target.def
index 2cc5d5c46b3..179218271ff 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -1751,6 +1751,19 @@ parallelism level.",
 void, (tree var, int),
 NULL)
 
+DEFHOOK
+(create_propagation_record,
+"Create a record used to propagate local-variable state from an active\n\
+worker to other workers.  A possible implementation might adjust the type\n\
+of REC to place the new variable in shared GPU memory.",
+tree, (tree rec, bool sender, const char *name),
+default_goacc_create_propagation_record)
+
+DEFHOOKPOD
+(worker_partitioning,
+"Use gimple transformation for worker neutering/broadcasting.",
+bool, false)
+
 HOOK_VECTOR_END (goacc)
 
 /* Functions relating to vectorization.  */
diff --git a/gcc/targhooks.h b/gcc/targhooks.h
index d4c3563e825..213578088dc 100644
--- a/gcc/targhooks.h
+++ b/gcc/targhooks.h
@@ -128,6 +128,7 @@ extern bool default_goacc_validate_dims (tree, int [], int, unsigned);
 extern int default_goacc_dim_limit (int);
 extern bool default_goacc_fork_join (gcall *, const int [], bool);
 extern void default_goacc_reduction (gcall *);
+extern tree default_goacc_create_propagation_record (tree, bool, const char *);
 
 /* These are here, and not in hooks.[ch], because not all users of
    hooks.h include tm.h, and thus we don't have CUMULATIVE_ARGS.  */
diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h
index a987661530e..2df80e8dbae 100644
--- a/gcc/tree-pass.h
+++ b/gcc/tree-pass.h
@@ -415,6 +415,8 @@ extern gimple_opt_pass *make_pass_diagnose_omp_blocks (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_expand_omp (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_expand_omp_ssa (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_omp_target_link (gcc::context *ctxt);
+extern gimple_opt_pass *make_pass_oacc_loop_designation (gcc::context *ctxt);
+extern gimple_opt_pass *make_pass_oacc_gimple_workers (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_oacc_device_lower (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_omp_device_lower (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_object_sizes (gcc::context *ctxt);
-- 
2.23.0

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 05/13] AMD GCN adjustments for middle-end worker partitioning
  2019-11-15 21:41 [PATCH 00/13] AMD GCN worker partitioning support Julian Brown
                   ` (3 preceding siblings ...)
  2019-11-15 21:43 ` [PATCH 04/13] OpenACC middle-end worker-partitioning support Julian Brown
@ 2019-11-15 21:43 ` Julian Brown
  2019-11-15 21:44 ` [PATCH 08/13] Fix host-to-device copies from rodata for AMD GCN Julian Brown
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 19+ messages in thread
From: Julian Brown @ 2019-11-15 21:43 UTC (permalink / raw)
  To: gcc-patches; +Cc: thomas_schwinge, jakub, Catherine_Moore, Andrew Stubbs

This patch provides AMD GCN-specific parts supporting middle-end
worker partitioning. The adjust_propagation_record hook is now called
create_propagation_record. Several builtins are redefined to take an
argument in a special address space (corresponding to GPU shared memory).

Tested alongside other patches in this series. OK?

Thanks,

Julian

ChangeLog

	gcc/
	* config/gcn/gcn-protos.h (gcn_goacc_adjust_propagation_record): Rename
	prototype to...
	(gcn_goacc_create_propagation_record): This.
	* config/gcn/gcn-tree.c (gcn_goacc_adjust_propagation_record): Rename
	function to...
	(gcn_goacc_create_propagation_record): This.  Adjust comment.
	* config/gcn/gcn.c (gcn_init_builtins): Override decls for
	BUILT_IN_GOACC_SINGLE_START, BUILT_IN_GOACC_SINGLE_COPY_START,
	BUILT_IN_GOACC_SINGLE_COPY_END and BUILT_IN_GOACC_BARRIER.
	(gcn_fork_join): Remove inaccurate comment.
	(TARGET_GOACC_ADJUST_PROPAGATION_RECORD): Rename to...
	(TARGET_GOACC_CREATE_PROPAGATION_RECORD): This.
---
 gcc/config/gcn/gcn-protos.h |  2 +-
 gcc/config/gcn/gcn-tree.c   |  6 +++---
 gcc/config/gcn/gcn.c        | 11 +++--------
 3 files changed, 7 insertions(+), 12 deletions(-)

diff --git a/gcc/config/gcn/gcn-protos.h b/gcc/config/gcn/gcn-protos.h
index 714d51189d9..e33c0598fee 100644
--- a/gcc/config/gcn/gcn-protos.h
+++ b/gcc/config/gcn/gcn-protos.h
@@ -37,7 +37,7 @@ extern rtx gcn_full_exec ();
 extern rtx gcn_full_exec_reg ();
 extern rtx gcn_gen_undef (machine_mode);
 extern bool gcn_global_address_p (rtx);
-extern tree gcn_goacc_adjust_propagation_record (tree record_type, bool sender,
+extern tree gcn_goacc_create_propagation_record (tree record_type, bool sender,
 						 const char *name);
 extern void gcn_goacc_adjust_private_decl (tree var, int level);
 extern void gcn_goacc_reduction (gcall *call);
diff --git a/gcc/config/gcn/gcn-tree.c b/gcc/config/gcn/gcn-tree.c
index aa56e236134..538034f7372 100644
--- a/gcc/config/gcn/gcn-tree.c
+++ b/gcc/config/gcn/gcn-tree.c
@@ -667,12 +667,12 @@ gcn_goacc_reduction (gcall *call)
     }
 }
 
-/* Implement TARGET_GOACC_ADJUST_PROPAGATION_RECORD.
+/* Implement TARGET_GOACC_CREATE_PROPAGATION_RECORD.
  
-   Tweak (worker) propagation record, e.g. to put it in shared memory.  */
+   Create (worker) propagation record in shared memory.  */
 
 tree
-gcn_goacc_adjust_propagation_record (tree record_type, bool sender,
+gcn_goacc_create_propagation_record (tree record_type, bool sender,
 				     const char *name)
 {
   tree type = record_type;
diff --git a/gcc/config/gcn/gcn.c b/gcc/config/gcn/gcn.c
index cf2f30413ae..2f758ef3ddc 100644
--- a/gcc/config/gcn/gcn.c
+++ b/gcc/config/gcn/gcn.c
@@ -3494,8 +3494,6 @@ gcn_init_builtins (void)
       TREE_NOTHROW (gcn_builtin_decls[i]) = 1;
     }
 
-/* FIXME: remove the ifdef once OpenACC support is merged upstream.  */
-#ifdef BUILT_IN_GOACC_SINGLE_START
   /* These builtins need to take/return an LDS pointer: override the generic
      versions here.  */
 
@@ -3512,7 +3510,6 @@ gcn_init_builtins (void)
 
   set_builtin_decl (BUILT_IN_GOACC_BARRIER,
 		    gcn_builtin_decls[GCN_BUILTIN_ACC_BARRIER], false);
-#endif
 }
 
 /* Expand the CMP_SWAP GCN builtins.  We have our own versions that do
@@ -4798,8 +4795,6 @@ static bool
 gcn_fork_join (gcall *ARG_UNUSED (call), const int *ARG_UNUSED (dims),
 	       bool ARG_UNUSED (is_fork))
 {
-  /* GCN does not use the fork/join concept invented for NVPTX.
-     Instead we use standard autovectorization.  */
   return false;
 }
 
@@ -6064,9 +6059,9 @@ print_operand (FILE *file, rtx x, int code)
 #define TARGET_GIMPLIFY_VA_ARG_EXPR gcn_gimplify_va_arg_expr
 #undef TARGET_OMP_DEVICE_KIND_ARCH_ISA
 #define TARGET_OMP_DEVICE_KIND_ARCH_ISA gcn_omp_device_kind_arch_isa
-#undef  TARGET_GOACC_ADJUST_PROPAGATION_RECORD
-#define TARGET_GOACC_ADJUST_PROPAGATION_RECORD \
-  gcn_goacc_adjust_propagation_record
+#undef  TARGET_GOACC_CREATE_PROPAGATION_RECORD
+#define TARGET_GOACC_CREATE_PROPAGATION_RECORD \
+  gcn_goacc_create_propagation_record
 #undef  TARGET_GOACC_ADJUST_PRIVATE_DECL
 #define TARGET_GOACC_ADJUST_PRIVATE_DECL gcn_goacc_adjust_private_decl
 #undef  TARGET_GOACC_FORK_JOIN
-- 
2.23.0

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 08/13] Fix host-to-device copies from rodata for AMD GCN
  2019-11-15 21:41 [PATCH 00/13] AMD GCN worker partitioning support Julian Brown
                   ` (4 preceding siblings ...)
  2019-11-15 21:43 ` [PATCH 05/13] AMD GCN adjustments for middle-end worker partitioning Julian Brown
@ 2019-11-15 21:44 ` Julian Brown
  2019-11-18 11:00   ` Andrew Stubbs
  2019-11-15 21:44 ` [PATCH 06/13] Fix up tests for oaccdevlow pass split Julian Brown
                   ` (6 subsequent siblings)
  12 siblings, 1 reply; 19+ messages in thread
From: Julian Brown @ 2019-11-15 21:44 UTC (permalink / raw)
  To: gcc-patches; +Cc: thomas_schwinge, jakub, Catherine_Moore, Andrew Stubbs

It appears that the hsa_memory_copy API routine
has problems copying from read-only data: in the
libgomp/testsuite/libgomp.oacc-c-c++-common/reduction-8.c test, a "const"
variable cannot be successfully copied to the target. I think the problem
is with read-only page mappings (in the HSA runtime). Luckily (?), if
the copy fails, the API call fails with a HSA_STATUS_ERROR return code,
which we can detect and use to trigger a workaround.

I've also added return-code checks to several other uses of the
hsa_memory_copy API routine, in an attempt to avoid silent runtime
failures. (A previous fix for a similar problem, removed before upstream
submission, does not appear to apply to this exact situation.)

OK?

Julian

ChangeLog

	libgomp/
	* plugin/plugin-gcn.c (hsa_memory_copy_wrapper): New.
	(copy_data, GOMP_OFFLOAD_host2dev): Use above function.
	(GOMP_OFFLOAD_dev2host, GOMP_OFFLOAD_dev2dev): Check hsa_memory_copy
	return code.
---
 libgomp/plugin/plugin-gcn.c | 35 +++++++++++++++++++++++++++++++----
 1 file changed, 31 insertions(+), 4 deletions(-)

diff --git a/libgomp/plugin/plugin-gcn.c b/libgomp/plugin/plugin-gcn.c
index 7b95a4cef8f..eb016e3fcd6 100644
--- a/libgomp/plugin/plugin-gcn.c
+++ b/libgomp/plugin/plugin-gcn.c
@@ -2940,6 +2940,29 @@ maybe_init_omp_async (struct agent_info *agent)
       = GOMP_OFFLOAD_openacc_async_construct (agent->device_id);
 }
 
+/* A wrapper that works around an issue in the HSA runtime with host-to-device
+   copies from read-only pages.  */
+
+static void
+hsa_memory_copy_wrapper (void *dst, const void *src, size_t len)
+{
+  hsa_status_t status = hsa_fns.hsa_memory_copy_fn (dst, src, len);
+
+  if (status == HSA_STATUS_SUCCESS)
+    return;
+
+  /* It appears that the copy fails if the source data is in a read-only page.
+     We can't detect that easily, so try copying the data to a temporary buffer
+     and doing the copy again if we got an error above.  */
+
+  void *src_copy = malloc (len);
+  memcpy (src_copy, src, len);
+  status = hsa_fns.hsa_memory_copy_fn (dst, (const void *) src_copy, len);
+  free (src_copy);
+  if (status != HSA_STATUS_SUCCESS)
+    GOMP_PLUGIN_error ("memory copy failed");
+}
+
 /* Copy data to or from a device.  This is intended for use as an async
    callback event.  */
 
@@ -2950,7 +2973,7 @@ copy_data (void *data_)
   GCN_DEBUG ("Async thread %d:%d: Copying %zu bytes from (%p) to (%p)\n",
 	     data->aq->agent->device_id, data->aq->id, data->len, data->src,
 	     data->dst);
-  hsa_fns.hsa_memory_copy_fn (data->dst, data->src, data->len);
+  hsa_memory_copy_wrapper (data->dst, data->src, data->len);
   if (data->free_src)
     free ((void *) data->src);
   free (data);
@@ -3643,7 +3666,9 @@ GOMP_OFFLOAD_dev2host (int device, void *dst, const void *src, size_t n)
 {
   GCN_DEBUG ("Copying %zu bytes from device %d (%p) to host (%p)\n", n, device,
 	     src, dst);
-  hsa_fns.hsa_memory_copy_fn (dst, src, n);
+  hsa_status_t status = hsa_fns.hsa_memory_copy_fn (dst, src, n);
+  if (status != HSA_STATUS_SUCCESS)
+    GOMP_PLUGIN_error ("memory copy failed");
   return true;
 }
 
@@ -3654,7 +3679,7 @@ GOMP_OFFLOAD_host2dev (int device, void *dst, const void *src, size_t n)
 {
   GCN_DEBUG ("Copying %zu bytes from host (%p) to device %d (%p)\n", n, src,
 	     device, dst);
-  hsa_fns.hsa_memory_copy_fn (dst, src, n);
+  hsa_memory_copy_wrapper (dst, src, n);
   return true;
 }
 
@@ -3675,7 +3700,9 @@ GOMP_OFFLOAD_dev2dev (int device, void *dst, const void *src, size_t n)
 
   GCN_DEBUG ("Copying %zu bytes from device %d (%p) to device %d (%p)\n", n,
 	     device, src, device, dst);
-  hsa_fns.hsa_memory_copy_fn (dst, src, n);
+  hsa_status_t status = hsa_fns.hsa_memory_copy_fn (dst, src, n);
+  if (status != HSA_STATUS_SUCCESS)
+    GOMP_PLUGIN_error ("memory copy failed");
   return true;
 }
 
-- 
2.23.0

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 07/13] Fix OpenACC "ephemeral" asynchronous host-to-device copies
  2019-11-15 21:41 [PATCH 00/13] AMD GCN worker partitioning support Julian Brown
                   ` (6 preceding siblings ...)
  2019-11-15 21:44 ` [PATCH 06/13] Fix up tests for oaccdevlow pass split Julian Brown
@ 2019-11-15 21:44 ` Julian Brown
  2019-11-15 21:45 ` [PATCH 10/13] Race conditions in OpenACC async tests Julian Brown
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 19+ messages in thread
From: Julian Brown @ 2019-11-15 21:44 UTC (permalink / raw)
  To: gcc-patches; +Cc: thomas_schwinge, jakub, Catherine_Moore, Andrew Stubbs

The AMD GCN runtime support appears to exercise asynchronous operations
more heavily than other offload targets. As such, latent problems with
asynchronous host-to-device copies have come to light with GCN. This
patch provides a solution to those.

Previously posted for the og9 branch (with some rationale) here:

https://gcc.gnu.org/ml/gcc-patches/2019-09/msg01026.html

This patch implies an ABI change for
GOMP_OFFLOAD_openacc_async_host2dev. I'm not sure what our policy is
there: do we need to introduce a new "v2" plugin entry point?

OK?

Julian

ChangeLog

	libgomp/
	* libgomp-plugin.h (GOMP_OFFLOAD_openacc_async_host2dev): Update
	prototype.
	* libgomp.h (gomp_copy_host2dev): Update prototype.
	* oacc-host.c (host_openacc_async_host2dev): Add ephemeral parameter.
	* oacc-mem.c (memcpy_tofrom_device): Update call to gomp_copy_host2dev.
	(update_dev_host): Likewise.
	* plugin/plugin-gcn.c (GOMP_OFFLOAD_openacc_async_host2dev): Handle
	ephemeral host-to-device copies.
	* plugin/plugin-nvptx.c (GOMP_OFFLOAD_openacc_async_host2dev):
	Add EPHEMERAL parameter, and FIXME function comment.
	* target.c (goacc_device_copy_async): Remove.
	(gomp_copy_host2dev): Add ephemeral parameter. Update function comment.
	Call async host2dev plugin hook directly.
	(gomp_copy_dev2host): Call async dev2host plugin hook directly.
	(gomp_map_vars_existing, gomp_map_pointer): Update calls to
	gomp_copy_host2dev.
	(gomp_map_vars_internal): Don't use coalescing buffer for asynchronous
	copies. Update calls to gomp_copy_host2dev.
	(gomp_update): Update calls to gomp_copy_host2dev.
---
 libgomp/libgomp-plugin.h      |  3 +-
 libgomp/libgomp.h             |  2 +-
 libgomp/oacc-host.c           |  1 +
 libgomp/oacc-mem.c            |  4 +-
 libgomp/plugin/plugin-gcn.c   | 23 +++++----
 libgomp/plugin/plugin-nvptx.c | 13 ++++-
 libgomp/target.c              | 92 +++++++++++++++++++----------------
 7 files changed, 82 insertions(+), 56 deletions(-)

diff --git a/libgomp/libgomp-plugin.h b/libgomp/libgomp-plugin.h
index 037558c43f5..200f3b594ee 100644
--- a/libgomp/libgomp-plugin.h
+++ b/libgomp/libgomp-plugin.h
@@ -126,7 +126,8 @@ extern void GOMP_OFFLOAD_openacc_async_exec (void (*) (void *), size_t, void **,
 					     struct goacc_asyncqueue *);
 extern bool GOMP_OFFLOAD_openacc_async_dev2host (int, void *, const void *, size_t,
 						 struct goacc_asyncqueue *);
-extern bool GOMP_OFFLOAD_openacc_async_host2dev (int, void *, const void *, size_t,
+extern bool GOMP_OFFLOAD_openacc_async_host2dev (int, void *, const void *,
+						 size_t, bool,
 						 struct goacc_asyncqueue *);
 extern void *GOMP_OFFLOAD_openacc_cuda_get_current_device (void);
 extern void *GOMP_OFFLOAD_openacc_cuda_get_current_context (void);
diff --git a/libgomp/libgomp.h b/libgomp/libgomp.h
index 7b46e0a494d..65fa390f4a5 100644
--- a/libgomp/libgomp.h
+++ b/libgomp/libgomp.h
@@ -1153,7 +1153,7 @@ extern void gomp_acc_declare_allocate (bool, size_t, void **, size_t *,
 struct gomp_coalesce_buf;
 extern void gomp_copy_host2dev (struct gomp_device_descr *,
 				struct goacc_asyncqueue *, void *, const void *,
-				size_t, struct gomp_coalesce_buf *);
+				size_t, bool, struct gomp_coalesce_buf *);
 extern void gomp_copy_dev2host (struct gomp_device_descr *,
 				struct goacc_asyncqueue *, void *, const void *,
 				size_t);
diff --git a/libgomp/oacc-host.c b/libgomp/oacc-host.c
index cbcac9bf7b3..bc11770725d 100644
--- a/libgomp/oacc-host.c
+++ b/libgomp/oacc-host.c
@@ -184,6 +184,7 @@ host_openacc_async_host2dev (int ord __attribute__ ((unused)),
 			     void *dst __attribute__ ((unused)),
 			     const void *src __attribute__ ((unused)),
 			     size_t n __attribute__ ((unused)),
+			     bool eph __attribute__ ((unused)),
 			     struct goacc_asyncqueue *aq
 			     __attribute__ ((unused)))
 {
diff --git a/libgomp/oacc-mem.c b/libgomp/oacc-mem.c
index 2f271009fb8..240ebc5c865 100644
--- a/libgomp/oacc-mem.c
+++ b/libgomp/oacc-mem.c
@@ -205,7 +205,7 @@ memcpy_tofrom_device (bool from, void *d, void *h, size_t s, int async,
   if (from)
     gomp_copy_dev2host (thr->dev, aq, h, d, s);
   else
-    gomp_copy_host2dev (thr->dev, aq, d, h, s, /* TODO: cbuf? */ NULL);
+    gomp_copy_host2dev (thr->dev, aq, d, h, s, false, /* TODO: cbuf? */ NULL);
 
   if (profiling_p)
     {
@@ -856,7 +856,7 @@ update_dev_host (int is_dev, void *h, size_t s, int async)
   goacc_aq aq = get_goacc_asyncqueue (async);
 
   if (is_dev)
-    gomp_copy_host2dev (acc_dev, aq, d, h, s, /* TODO: cbuf? */ NULL);
+    gomp_copy_host2dev (acc_dev, aq, d, h, s, false, /* TODO: cbuf? */ NULL);
   else
     gomp_copy_dev2host (acc_dev, aq, h, d, s);
 
diff --git a/libgomp/plugin/plugin-gcn.c b/libgomp/plugin/plugin-gcn.c
index 583916759a5..7b95a4cef8f 100644
--- a/libgomp/plugin/plugin-gcn.c
+++ b/libgomp/plugin/plugin-gcn.c
@@ -3933,19 +3933,22 @@ GOMP_OFFLOAD_openacc_async_queue_callback (struct goacc_asyncqueue *aq,
 
 bool
 GOMP_OFFLOAD_openacc_async_host2dev (int device, void *dst, const void *src,
-				     size_t n, struct goacc_asyncqueue *aq)
+				     size_t n, bool ephemeral,
+				     struct goacc_asyncqueue *aq)
 {
   struct agent_info *agent = get_agent_info (device);
   assert (agent == aq->agent);
-  /* The source data does not necessarily remain live until the deferred
-     copy happens.  Taking a snapshot of the data here avoids reading
-     uninitialised data later, but means that (a) data is copied twice and
-     (b) modifications to the copied data between the "spawning" point of
-     the asynchronous kernel and when it is executed will not be seen.
-     But, that is probably correct.  */
-  void *src_copy = GOMP_PLUGIN_malloc (n);
-  memcpy (src_copy, src, n);
-  queue_push_copy (aq, dst, src_copy, n, true);
+
+  if (ephemeral)
+    {
+      /* The source data is on the stack or otherwise may be deallocated
+	 before the asynchronous copy takes place.  Take a copy of the source
+	 data.  */
+      void *src_copy = GOMP_PLUGIN_malloc (n);
+      memcpy (src_copy, src, n);
+      src = src_copy;
+    }
+  queue_push_copy (aq, dst, src, n, ephemeral);
   return true;
 }
 
diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c
index 911d0f66a6e..34b25f13dd4 100644
--- a/libgomp/plugin/plugin-nvptx.c
+++ b/libgomp/plugin/plugin-nvptx.c
@@ -1711,9 +1711,20 @@ GOMP_OFFLOAD_dev2dev (int ord, void *dst, const void *src, size_t n)
   return true;
 }
 
+/* FIXME: It is unknown whether the cuMemcpyHtoDAsync API call caches source
+   data before the asynchronous copy takes place.  Either way there is a data
+   race associated with ignoring the EPHEMERAL parameter here -- either if it
+   is TRUE (because we are copying uncached data that may disappear before the
+   async copy takes place) or if it is FALSE (because the source data may be
+   cached/snapshotted here before it is modified by an earlier async operation,
+   so stale data gets copied to the target).
+   Neither problem has been observed in practice, so far.  */
+
 bool
 GOMP_OFFLOAD_openacc_async_host2dev (int ord, void *dst, const void *src,
-				     size_t n, struct goacc_asyncqueue *aq)
+				     size_t n,
+				     bool ephemeral __attribute__((unused)),
+				     struct goacc_asyncqueue *aq)
 {
   if (!nvptx_attach_host_thread_to_device (ord)
       || !cuda_memcpy_sanity_check (src, dst, n))
diff --git a/libgomp/target.c b/libgomp/target.c
index b6276cade9f..792c70ce331 100644
--- a/libgomp/target.c
+++ b/libgomp/target.c
@@ -190,22 +190,6 @@ gomp_device_copy (struct gomp_device_descr *devicep,
     }
 }
 
-static inline void
-goacc_device_copy_async (struct gomp_device_descr *devicep,
-			 bool (*copy_func) (int, void *, const void *, size_t,
-					    struct goacc_asyncqueue *),
-			 const char *dst, void *dstaddr,
-			 const char *src, const void *srcaddr,
-			 size_t size, struct goacc_asyncqueue *aq)
-{
-  if (!copy_func (devicep->target_id, dstaddr, srcaddr, size, aq))
-    {
-      gomp_mutex_unlock (&devicep->lock);
-      gomp_fatal ("Copying of %s object [%p..%p) to %s object [%p..%p) failed",
-		  src, srcaddr, srcaddr + size, dst, dstaddr, dstaddr + size);
-    }
-}
-
 /* Infrastructure for coalescing adjacent or nearly adjacent (in device addresses)
    host to device memory transfers.  */
 
@@ -298,11 +282,18 @@ gomp_to_device_kind_p (int kind)
     }
 }
 
+/* Copy host memory to an offload device.  In asynchronous mode (if AQ is
+   non-NULL), when the source data is stack or may otherwise be deallocated
+   before the asynchronous copy takes place, EPHEMERAL must be passed as
+   TRUE.  The CBUF isn't used for non-ephemeral asynchronous copies, because
+   the host data might not be computed yet (by an earlier asynchronous compute
+   region).  */
+
 attribute_hidden void
 gomp_copy_host2dev (struct gomp_device_descr *devicep,
 		    struct goacc_asyncqueue *aq,
 		    void *d, const void *h, size_t sz,
-		    struct gomp_coalesce_buf *cbuf)
+		    bool ephemeral, struct gomp_coalesce_buf *cbuf)
 {
   if (cbuf)
     {
@@ -330,8 +321,15 @@ gomp_copy_host2dev (struct gomp_device_descr *devicep,
 	}
     }
   if (__builtin_expect (aq != NULL, 0))
-    goacc_device_copy_async (devicep, devicep->openacc.async.host2dev_func,
-			     "dev", d, "host", h, sz, aq);
+    {
+      if (!devicep->openacc.async.host2dev_func (devicep->target_id, d, h, sz,
+						 ephemeral, aq))
+	{
+	  gomp_mutex_unlock (&devicep->lock);
+	  gomp_fatal ("Copying of host object [%p..%p) to dev object [%p..%p) "
+		      "failed", h, h + sz, d, d + sz);
+	}
+    }
   else
     gomp_device_copy (devicep, devicep->host2dev_func, "dev", d, "host", h, sz);
 }
@@ -342,8 +340,15 @@ gomp_copy_dev2host (struct gomp_device_descr *devicep,
 		    void *h, const void *d, size_t sz)
 {
   if (__builtin_expect (aq != NULL, 0))
-    goacc_device_copy_async (devicep, devicep->openacc.async.dev2host_func,
-			     "host", h, "dev", d, sz, aq);
+    {
+      if (!devicep->openacc.async.dev2host_func (devicep->target_id, h, d, sz,
+						 aq))
+	{
+	  gomp_mutex_unlock (&devicep->lock);
+	  gomp_fatal ("Copying of dev object [%p..%p) to host object [%p..%p) "
+		      "failed", d, d + sz, h, h + sz);
+	}
+    }
   else
     gomp_device_copy (devicep, devicep->dev2host_func, "host", h, "dev", d, sz);
 }
@@ -390,7 +395,7 @@ gomp_map_vars_existing (struct gomp_device_descr *devicep,
 			(void *) (oldn->tgt->tgt_start + oldn->tgt_offset
 				  + newn->host_start - oldn->host_start),
 			(void *) newn->host_start,
-			newn->host_end - newn->host_start, cbuf);
+			newn->host_end - newn->host_start, false, cbuf);
 
   if (oldn->refcount != REFCOUNT_INFINITY)
     oldn->refcount++;
@@ -418,8 +423,8 @@ gomp_map_pointer (struct target_mem_desc *tgt, struct goacc_asyncqueue *aq,
       cur_node.tgt_offset = (uintptr_t) NULL;
       gomp_copy_host2dev (devicep, aq,
 			  (void *) (tgt->tgt_start + target_offset),
-			  (void *) &cur_node.tgt_offset,
-			  sizeof (void *), cbuf);
+			  (void *) &cur_node.tgt_offset, sizeof (void *),
+			  true, cbuf);
       return;
     }
   /* Add bias to the pointer value.  */
@@ -439,7 +444,8 @@ gomp_map_pointer (struct target_mem_desc *tgt, struct goacc_asyncqueue *aq,
      to initialize the pointer with.  */
   cur_node.tgt_offset -= bias;
   gomp_copy_host2dev (devicep, aq, (void *) (tgt->tgt_start + target_offset),
-		      (void *) &cur_node.tgt_offset, sizeof (void *), cbuf);
+		      (void *) &cur_node.tgt_offset, sizeof (void *), true,
+		      cbuf);
 }
 
 static void
@@ -656,8 +662,9 @@ gomp_map_vars_internal (struct gomp_device_descr *devicep,
 	      for (i = first; i <= last; i++)
 		{
 		  tgt->list[i].key = NULL;
-		  if (gomp_to_device_kind_p (get_kind (short_mapkind, kinds, i)
-					     & typemask))
+		  if (!aq
+		      && gomp_to_device_kind_p (get_kind (short_mapkind, kinds,
+							  i) & typemask))
 		    gomp_coalesce_buf_add (&cbuf,
 					   tgt_size - cur_node.host_end
 					   + (uintptr_t) hostaddrs[i],
@@ -692,8 +699,9 @@ gomp_map_vars_internal (struct gomp_device_descr *devicep,
 	  if (tgt_align < align)
 	    tgt_align = align;
 	  tgt_size = (tgt_size + align - 1) & ~(align - 1);
-	  gomp_coalesce_buf_add (&cbuf, tgt_size,
-				 cur_node.host_end - cur_node.host_start);
+	  if (!aq)
+	    gomp_coalesce_buf_add (&cbuf, tgt_size,
+				   cur_node.host_end - cur_node.host_start);
 	  tgt_size += cur_node.host_end - cur_node.host_start;
 	  has_firstprivate = true;
 	  continue;
@@ -723,7 +731,7 @@ gomp_map_vars_internal (struct gomp_device_descr *devicep,
 	  if (tgt_align < align)
 	    tgt_align = align;
 	  tgt_size = (tgt_size + align - 1) & ~(align - 1);
-	  if (gomp_to_device_kind_p (kind & typemask))
+	  if (!aq && gomp_to_device_kind_p (kind & typemask))
 	    gomp_coalesce_buf_add (&cbuf, tgt_size,
 				   cur_node.host_end - cur_node.host_start);
 	  tgt_size += cur_node.host_end - cur_node.host_start;
@@ -825,7 +833,7 @@ gomp_map_vars_internal (struct gomp_device_descr *devicep,
 		len = sizes[i];
 		gomp_copy_host2dev (devicep, aq,
 				    (void *) (tgt->tgt_start + tgt_size),
-				    (void *) hostaddrs[i], len, cbufp);
+				    (void *) hostaddrs[i], len, false, cbufp);
 		tgt_size += len;
 		continue;
 	      case GOMP_MAP_FIRSTPRIVATE_INT:
@@ -895,12 +903,11 @@ gomp_map_vars_internal (struct gomp_device_descr *devicep,
 		if (cur_node.tgt_offset)
 		  cur_node.tgt_offset -= sizes[i];
 		gomp_copy_host2dev (devicep, aq,
-				    (void *) (n->tgt->tgt_start
-					      + n->tgt_offset
+				    (void *) (n->tgt->tgt_start + n->tgt_offset
 					      + cur_node.host_start
 					      - n->host_start),
 				    (void *) &cur_node.tgt_offset,
-				    sizeof (void *), cbufp);
+				    sizeof (void *), true, cbufp);
 		cur_node.tgt_offset = n->tgt->tgt_start + n->tgt_offset
 				      + cur_node.host_start - n->host_start;
 		continue;
@@ -972,7 +979,8 @@ gomp_map_vars_internal (struct gomp_device_descr *devicep,
 					(void *) (tgt->tgt_start
 						  + k->tgt_offset),
 					(void *) k->host_start,
-					k->host_end - k->host_start, cbufp);
+					k->host_end - k->host_start, false,
+					cbufp);
 		    break;
 		  case GOMP_MAP_POINTER:
 		    gomp_map_pointer (tgt, aq,
@@ -984,7 +992,8 @@ gomp_map_vars_internal (struct gomp_device_descr *devicep,
 					(void *) (tgt->tgt_start
 						  + k->tgt_offset),
 					(void *) k->host_start,
-					k->host_end - k->host_start, cbufp);
+					k->host_end - k->host_start, false,
+					cbufp);
 
 		    for (j = i + 1; j < mapnum; j++)
 		      if (!GOMP_MAP_POINTER_P (get_kind (short_mapkind, kinds,
@@ -1035,7 +1044,7 @@ gomp_map_vars_internal (struct gomp_device_descr *devicep,
 					(void *) (tgt->tgt_start
 						  + k->tgt_offset),
 					(void *) k->host_start,
-					sizeof (void *), cbufp);
+					sizeof (void *), false, cbufp);
 		    break;
 		  default:
 		    gomp_mutex_unlock (&devicep->lock);
@@ -1051,7 +1060,7 @@ gomp_map_vars_internal (struct gomp_device_descr *devicep,
 		    /* We intentionally do not use coalescing here, as it's not
 		       data allocated by the current call to this function.  */
 		    gomp_copy_host2dev (devicep, aq, (void *) n->tgt_offset,
-					&tgt_addr, sizeof (void *), NULL);
+					&tgt_addr, sizeof (void *), true, NULL);
 		  }
 		array++;
 	      }
@@ -1066,7 +1075,7 @@ gomp_map_vars_internal (struct gomp_device_descr *devicep,
 	  gomp_copy_host2dev (devicep, aq,
 			      (void *) (tgt->tgt_start + i * sizeof (void *)),
 			      (void *) &cur_node.tgt_offset, sizeof (void *),
-			      cbufp);
+			      true, cbufp);
 	}
     }
 
@@ -1078,7 +1087,8 @@ gomp_map_vars_internal (struct gomp_device_descr *devicep,
 			    (void *) (tgt->tgt_start + cbuf.chunks[c].start),
 			    (char *) cbuf.buf + (cbuf.chunks[c].start
 						 - cbuf.chunks[0].start),
-			    cbuf.chunks[c].end - cbuf.chunks[c].start, NULL);
+			    cbuf.chunks[c].end - cbuf.chunks[c].start, true,
+			    NULL);
       free (cbuf.buf);
       cbuf.buf = NULL;
       cbufp = NULL;
@@ -1280,7 +1290,7 @@ gomp_update (struct gomp_device_descr *devicep, size_t mapnum, void **hostaddrs,
 
 	    if (GOMP_MAP_COPY_TO_P (kind & typemask))
 	      gomp_copy_host2dev (devicep, NULL, devaddr, hostaddr, size,
-				  NULL);
+				  false, NULL);
 	    if (GOMP_MAP_COPY_FROM_P (kind & typemask))
 	      gomp_copy_dev2host (devicep, NULL, hostaddr, devaddr, size);
 	  }
-- 
2.23.0

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 06/13] Fix up tests for oaccdevlow pass split
  2019-11-15 21:41 [PATCH 00/13] AMD GCN worker partitioning support Julian Brown
                   ` (5 preceding siblings ...)
  2019-11-15 21:44 ` [PATCH 08/13] Fix host-to-device copies from rodata for AMD GCN Julian Brown
@ 2019-11-15 21:44 ` Julian Brown
  2019-11-15 21:44 ` [PATCH 07/13] Fix OpenACC "ephemeral" asynchronous host-to-device copies Julian Brown
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 19+ messages in thread
From: Julian Brown @ 2019-11-15 21:44 UTC (permalink / raw)
  To: gcc-patches; +Cc: thomas_schwinge, jakub, Catherine_Moore, Andrew Stubbs

This patch adjusts expected output for several tests after the oaccdevlow
pass is split into three by an earlier patch in this series.

OK?

Julian

ChangeLog

	gcc/testsuite/
	* c-c++-common/goacc/classify-kernels-unparallelized.c,
	c-c++-common/goacc/classify-kernels.c,
	c-c++-common/goacc/classify-parallel.c,
	c-c++-common/goacc/classify-routine.c,
	gcc.dg/goacc/loop-processing-1.c,
	gfortran.dg/goacc/classify-kernels-unparallelized.f95,
	gfortran.dg/goacc/classify-kernels.f95,
	gfortran.dg/goacc/classify-parallel.f95,
	gfortran.dg/goacc/classify-routine.f95: Scan oaccloops dump instead of
	oaccdevlow pass.
---
 .../c-c++-common/goacc/classify-kernels-unparallelized.c  | 8 ++++----
 gcc/testsuite/c-c++-common/goacc/classify-kernels.c       | 8 ++++----
 gcc/testsuite/c-c++-common/goacc/classify-parallel.c      | 8 ++++----
 gcc/testsuite/c-c++-common/goacc/classify-routine.c       | 8 ++++----
 gcc/testsuite/gcc.dg/goacc/loop-processing-1.c            | 4 ++--
 .../gfortran.dg/goacc/classify-kernels-unparallelized.f95 | 8 ++++----
 gcc/testsuite/gfortran.dg/goacc/classify-kernels.f95      | 8 ++++----
 gcc/testsuite/gfortran.dg/goacc/classify-parallel.f95     | 8 ++++----
 gcc/testsuite/gfortran.dg/goacc/classify-routine.f95      | 8 ++++----
 9 files changed, 34 insertions(+), 34 deletions(-)

diff --git a/gcc/testsuite/c-c++-common/goacc/classify-kernels-unparallelized.c b/gcc/testsuite/c-c++-common/goacc/classify-kernels-unparallelized.c
index d4c4b2ca237..79b4cad7916 100644
--- a/gcc/testsuite/c-c++-common/goacc/classify-kernels-unparallelized.c
+++ b/gcc/testsuite/c-c++-common/goacc/classify-kernels-unparallelized.c
@@ -5,7 +5,7 @@
    { dg-additional-options "-fopt-info-optimized-omp" }
    { dg-additional-options "-fdump-tree-ompexp" }
    { dg-additional-options "-fdump-tree-parloops1-all" }
-   { dg-additional-options "-fdump-tree-oaccdevlow" } */
+   { dg-additional-options "-fdump-tree-oaccloops" } */
 
 #define N 1024
 
@@ -35,6 +35,6 @@ void KERNELS ()
 
 /* Check the offloaded function's classification and compute dimensions (will
    always be 1 x 1 x 1 for non-offloading compilation).
-   { dg-final { scan-tree-dump-times "(?n)Function is unparallelized OpenACC kernels offload" 1 "oaccdevlow" } }
-   { dg-final { scan-tree-dump-times "(?n)Compute dimensions \\\[1, 1, 1\\\]" 1 "oaccdevlow" } }
-   { dg-final { scan-tree-dump-times "(?n)__attribute__\\(\\(oacc function \\(1, 1, 1\\), oacc kernels, omp target entrypoint\\)\\)" 1 "oaccdevlow" } } */
+   { dg-final { scan-tree-dump-times "(?n)Function is unparallelized OpenACC kernels offload" 1 "oaccloops" } }
+   { dg-final { scan-tree-dump-times "(?n)Compute dimensions \\\[1, 1, 1\\\]" 1 "oaccloops" } }
+   { dg-final { scan-tree-dump-times "(?n)__attribute__\\(\\(oacc function \\(1, 1, 1\\), oacc kernels, omp target entrypoint\\)\\)" 1 "oaccloops" } } */
diff --git a/gcc/testsuite/c-c++-common/goacc/classify-kernels.c b/gcc/testsuite/c-c++-common/goacc/classify-kernels.c
index 16e9b9e31d1..8fcfc3f4278 100644
--- a/gcc/testsuite/c-c++-common/goacc/classify-kernels.c
+++ b/gcc/testsuite/c-c++-common/goacc/classify-kernels.c
@@ -5,7 +5,7 @@
    { dg-additional-options "-fopt-info-optimized-omp" }
    { dg-additional-options "-fdump-tree-ompexp" }
    { dg-additional-options "-fdump-tree-parloops1-all" }
-   { dg-additional-options "-fdump-tree-oaccdevlow" } */
+   { dg-additional-options "-fdump-tree-oaccloops" } */
 
 #define N 1024
 
@@ -31,6 +31,6 @@ void KERNELS ()
 
 /* Check the offloaded function's classification and compute dimensions (will
    always be 1 x 1 x 1 for non-offloading compilation).
-   { dg-final { scan-tree-dump-times "(?n)Function is parallelized OpenACC kernels offload" 1 "oaccdevlow" } }
-   { dg-final { scan-tree-dump-times "(?n)Compute dimensions \\\[1, 1, 1\\\]" 1 "oaccdevlow" } }
-   { dg-final { scan-tree-dump-times "(?n)__attribute__\\(\\(oacc function \\(1, 1, 1\\), oacc kernels parallelized, oacc function \\(, , \\), oacc kernels, omp target entrypoint\\)\\)" 1 "oaccdevlow" } } */
+   { dg-final { scan-tree-dump-times "(?n)Function is parallelized OpenACC kernels offload" 1 "oaccloops" } }
+   { dg-final { scan-tree-dump-times "(?n)Compute dimensions \\\[1, 1, 1\\\]" 1 "oaccloops" } }
+   { dg-final { scan-tree-dump-times "(?n)__attribute__\\(\\(oacc function \\(1, 1, 1\\), oacc kernels parallelized, oacc function \\(, , \\), oacc kernels, omp target entrypoint\\)\\)" 1 "oaccloops" } } */
diff --git a/gcc/testsuite/c-c++-common/goacc/classify-parallel.c b/gcc/testsuite/c-c++-common/goacc/classify-parallel.c
index 66a6d133663..4e8f155961e 100644
--- a/gcc/testsuite/c-c++-common/goacc/classify-parallel.c
+++ b/gcc/testsuite/c-c++-common/goacc/classify-parallel.c
@@ -4,7 +4,7 @@
 /* { dg-additional-options "-O2" }
    { dg-additional-options "-fopt-info-optimized-omp" }
    { dg-additional-options "-fdump-tree-ompexp" }
-   { dg-additional-options "-fdump-tree-oaccdevlow" } */
+   { dg-additional-options "-fdump-tree-oaccloops" } */
 
 #define N 1024
 
@@ -24,6 +24,6 @@ void PARALLEL ()
 
 /* Check the offloaded function's classification and compute dimensions (will
    always be 1 x 1 x 1 for non-offloading compilation).
-   { dg-final { scan-tree-dump-times "(?n)Function is OpenACC parallel offload" 1 "oaccdevlow" } }
-   { dg-final { scan-tree-dump-times "(?n)Compute dimensions \\\[1, 1, 1\\\]" 1 "oaccdevlow" } }
-   { dg-final { scan-tree-dump-times "(?n)__attribute__\\(\\(oacc function \\(1, 1, 1\\), omp target entrypoint\\)\\)" 1 "oaccdevlow" } } */
+   { dg-final { scan-tree-dump-times "(?n)Function is OpenACC parallel offload" 1 "oaccloops" } }
+   { dg-final { scan-tree-dump-times "(?n)Compute dimensions \\\[1, 1, 1\\\]" 1 "oaccloops" } }
+   { dg-final { scan-tree-dump-times "(?n)__attribute__\\(\\(oacc function \\(1, 1, 1\\), omp target entrypoint\\)\\)" 1 "oaccloops" } } */
diff --git a/gcc/testsuite/c-c++-common/goacc/classify-routine.c b/gcc/testsuite/c-c++-common/goacc/classify-routine.c
index 0b9ba6ea69f..54eddc10b3c 100644
--- a/gcc/testsuite/c-c++-common/goacc/classify-routine.c
+++ b/gcc/testsuite/c-c++-common/goacc/classify-routine.c
@@ -4,7 +4,7 @@
 /* { dg-additional-options "-O2" }
    { dg-additional-options "-fopt-info-optimized-omp" }
    { dg-additional-options "-fdump-tree-ompexp" }
-   { dg-additional-options "-fdump-tree-oaccdevlow" } */
+   { dg-additional-options "-fdump-tree-oaccloops" } */
 
 #define N 1024
 
@@ -26,6 +26,6 @@ void ROUTINE ()
 
 /* Check the offloaded function's classification and compute dimensions (will
    always be 1 x 1 x 1 for non-offloading compilation).
-   { dg-final { scan-tree-dump-times "(?n)Function is OpenACC routine level 1" 1 "oaccdevlow" } }
-   { dg-final { scan-tree-dump-times "(?n)Compute dimensions \\\[1, 1, 1\\\]" 1 "oaccdevlow" } }
-   { dg-final { scan-tree-dump-times "(?n)__attribute__\\(\\(oacc function \\(0 1, 1 1, 1 1\\), omp declare target \\(worker\\), oacc function \\(0 1, 1 0, 1 0\\)\\)\\)" 1 "oaccdevlow" } } */
+   { dg-final { scan-tree-dump-times "(?n)Function is OpenACC routine level 1" 1 "oaccloops" } }
+   { dg-final { scan-tree-dump-times "(?n)Compute dimensions \\\[1, 1, 1\\\]" 1 "oaccloops" } }
+   { dg-final { scan-tree-dump-times "(?n)__attribute__\\(\\(oacc function \\(0 1, 1 1, 1 1\\), omp declare target \\(worker\\), oacc function \\(0 1, 1 0, 1 0\\)\\)\\)" 1 "oaccloops" } } */
diff --git a/gcc/testsuite/gcc.dg/goacc/loop-processing-1.c b/gcc/testsuite/gcc.dg/goacc/loop-processing-1.c
index bd4c07e7d81..78b9aed89be 100644
--- a/gcc/testsuite/gcc.dg/goacc/loop-processing-1.c
+++ b/gcc/testsuite/gcc.dg/goacc/loop-processing-1.c
@@ -1,5 +1,5 @@
 /* Make sure that OpenACC loop processing happens.  */
-/* { dg-additional-options "-O2 -fdump-tree-oaccdevlow" } */
+/* { dg-additional-options "-O2 -fdump-tree-oaccloops" } */
 
 extern int place ();
 
@@ -15,4 +15,4 @@ void vector_1 (int *ary, int size)
   }
 }
 
-/* { dg-final { scan-tree-dump {OpenACC loops.*Loop 0\(0\).*Loop 24\(1\).*\.data_dep\.[0-9_]+ = \.UNIQUE \(OACC_HEAD_MARK, 0, 1, 36\);.*Head-0:.*\.data_dep\.[0-9_]+ = \.UNIQUE \(OACC_HEAD_MARK, 0, 1, 36\);.*\.data_dep\.[0-9_]+ = \.UNIQUE \(OACC_FORK, \.data_dep\.[0-9_]+, 0\);.*Tail-0:.*\.data_dep\.[0-9_]+ = \.UNIQUE \(OACC_TAIL_MARK, \.data_dep\.[0-9_]+, 1\);.*\.data_dep\.[0-9_]+ = \.UNIQUE \(OACC_JOIN, \.data_dep\.[0-9_]+, 0\);.*Loop 6\(6\).*\.data_dep\.[0-9_]+ = \.UNIQUE \(OACC_HEAD_MARK, 0, 2, 6\);.*Head-0:.*\.data_dep\.[0-9_]+ = \.UNIQUE \(OACC_HEAD_MARK, 0, 2, 6\);.*\.data_dep\.[0-9_]+ = \.UNIQUE \(OACC_FORK, \.data_dep\.[0-9_]+, 1\);.*Head-1:.*\.data_dep\.[0-9_]+ = \.UNIQUE \(OACC_HEAD_MARK, \.data_dep\.[0-9_]+, 1\);.*\.data_dep\.[0-9_]+ = \.UNIQUE \(OACC_FORK, \.data_dep\.[0-9_]+, 2\);.*Tail-1:.*\.data_dep\.[0-9_]+ = \.UNIQUE \(OACC_TAIL_MARK, \.data_dep\.[0-9_]+, 2\);.*\.data_dep\.[0-9_]+ = \.UNIQUE \(OACC_JOIN, \.data_dep\.[0-9_]+, 2\);.*Tail-0:.*\.data_dep\.[0-9_]+ = \.UNIQUE \(OACC_TAIL_MARK, \.data_dep\.[0-9_]+, 1\);.*\.data_dep\.[0-9_]+ = \.UNIQUE \(OACC_JOIN, \.data_dep\.[0-9_]+, 1\);} "oaccdevlow" } } */
+/* { dg-final { scan-tree-dump {OpenACC loops.*Loop 0\(0\).*Loop 24\(1\).*\.data_dep\.[0-9_]+ = \.UNIQUE \(OACC_HEAD_MARK, 0, 1, 36\);.*Head-0:.*\.data_dep\.[0-9_]+ = \.UNIQUE \(OACC_HEAD_MARK, 0, 1, 36\);.*\.data_dep\.[0-9_]+ = \.UNIQUE \(OACC_FORK, \.data_dep\.[0-9_]+, 0\);.*Tail-0:.*\.data_dep\.[0-9_]+ = \.UNIQUE \(OACC_TAIL_MARK, \.data_dep\.[0-9_]+, 1\);.*\.data_dep\.[0-9_]+ = \.UNIQUE \(OACC_JOIN, \.data_dep\.[0-9_]+, 0\);.*Loop 6\(6\).*\.data_dep\.[0-9_]+ = \.UNIQUE \(OACC_HEAD_MARK, 0, 2, 6\);.*Head-0:.*\.data_dep\.[0-9_]+ = \.UNIQUE \(OACC_HEAD_MARK, 0, 2, 6\);.*\.data_dep\.[0-9_]+ = \.UNIQUE \(OACC_FORK, \.data_dep\.[0-9_]+, 1\);.*Head-1:.*\.data_dep\.[0-9_]+ = \.UNIQUE \(OACC_HEAD_MARK, \.data_dep\.[0-9_]+, 1\);.*\.data_dep\.[0-9_]+ = \.UNIQUE \(OACC_FORK, \.data_dep\.[0-9_]+, 2\);.*Tail-1:.*\.data_dep\.[0-9_]+ = \.UNIQUE \(OACC_TAIL_MARK, \.data_dep\.[0-9_]+, 2\);.*\.data_dep\.[0-9_]+ = \.UNIQUE \(OACC_JOIN, \.data_dep\.[0-9_]+, 2\);.*Tail-0:.*\.data_dep\.[0-9_]+ = \.UNIQUE \(OACC_TAIL_MARK, \.data_dep\.[0-9_]+, 1\);.*\.data_dep\.[0-9_]+ = \.UNIQUE \(OACC_JOIN, \.data_dep\.[0-9_]+, 1\);} "oaccloops" } } */
diff --git a/gcc/testsuite/gfortran.dg/goacc/classify-kernels-unparallelized.f95 b/gcc/testsuite/gfortran.dg/goacc/classify-kernels-unparallelized.f95
index 08772428c4c..6e4001b4f9b 100644
--- a/gcc/testsuite/gfortran.dg/goacc/classify-kernels-unparallelized.f95
+++ b/gcc/testsuite/gfortran.dg/goacc/classify-kernels-unparallelized.f95
@@ -5,7 +5,7 @@
 ! { dg-additional-options "-fopt-info-optimized-omp" }
 ! { dg-additional-options "-fdump-tree-ompexp" }
 ! { dg-additional-options "-fdump-tree-parloops1-all" }
-! { dg-additional-options "-fdump-tree-oaccdevlow" }
+! { dg-additional-options "-fdump-tree-oaccloops" }
 
 program main
   implicit none
@@ -37,6 +37,6 @@ end program main
 
 ! Check the offloaded function's classification and compute dimensions (will
 ! always be 1 x 1 x 1 for non-offloading compilation).
-! { dg-final { scan-tree-dump-times "(?n)Function is unparallelized OpenACC kernels offload" 1 "oaccdevlow" } }
-! { dg-final { scan-tree-dump-times "(?n)Compute dimensions \\\[1, 1, 1\\\]" 1 "oaccdevlow" } }
-! { dg-final { scan-tree-dump-times "(?n)__attribute__\\(\\(oacc function \\(1, 1, 1\\), oacc kernels, omp target entrypoint\\)\\)" 1 "oaccdevlow" } }
+! { dg-final { scan-tree-dump-times "(?n)Function is unparallelized OpenACC kernels offload" 1 "oaccloops" } }
+! { dg-final { scan-tree-dump-times "(?n)Compute dimensions \\\[1, 1, 1\\\]" 1 "oaccloops" } }
+! { dg-final { scan-tree-dump-times "(?n)__attribute__\\(\\(oacc function \\(1, 1, 1\\), oacc kernels, omp target entrypoint\\)\\)" 1 "oaccloops" } }
diff --git a/gcc/testsuite/gfortran.dg/goacc/classify-kernels.f95 b/gcc/testsuite/gfortran.dg/goacc/classify-kernels.f95
index f2c4736e111..a0a5fd93bbc 100644
--- a/gcc/testsuite/gfortran.dg/goacc/classify-kernels.f95
+++ b/gcc/testsuite/gfortran.dg/goacc/classify-kernels.f95
@@ -5,7 +5,7 @@
 ! { dg-additional-options "-fopt-info-optimized-omp" }
 ! { dg-additional-options "-fdump-tree-ompexp" }
 ! { dg-additional-options "-fdump-tree-parloops1-all" }
-! { dg-additional-options "-fdump-tree-oaccdevlow" }
+! { dg-additional-options "-fdump-tree-oaccloops" }
 
 program main
   implicit none
@@ -33,6 +33,6 @@ end program main
 
 ! Check the offloaded function's classification and compute dimensions (will
 ! always be 1 x 1 x 1 for non-offloading compilation).
-! { dg-final { scan-tree-dump-times "(?n)Function is parallelized OpenACC kernels offload" 1 "oaccdevlow" } }
-! { dg-final { scan-tree-dump-times "(?n)Compute dimensions \\\[1, 1, 1\\\]" 1 "oaccdevlow" } }
-! { dg-final { scan-tree-dump-times "(?n)__attribute__\\(\\(oacc function \\(1, 1, 1\\), oacc kernels parallelized, oacc function \\(, , \\), oacc kernels, omp target entrypoint\\)\\)" 1 "oaccdevlow" } }
+! { dg-final { scan-tree-dump-times "(?n)Function is parallelized OpenACC kernels offload" 1 "oaccloops" } }
+! { dg-final { scan-tree-dump-times "(?n)Compute dimensions \\\[1, 1, 1\\\]" 1 "oaccloops" } }
+! { dg-final { scan-tree-dump-times "(?n)__attribute__\\(\\(oacc function \\(1, 1, 1\\), oacc kernels parallelized, oacc function \\(, , \\), oacc kernels, omp target entrypoint\\)\\)" 1 "oaccloops" } }
diff --git a/gcc/testsuite/gfortran.dg/goacc/classify-parallel.f95 b/gcc/testsuite/gfortran.dg/goacc/classify-parallel.f95
index a23ea81609b..ae3f322fb63 100644
--- a/gcc/testsuite/gfortran.dg/goacc/classify-parallel.f95
+++ b/gcc/testsuite/gfortran.dg/goacc/classify-parallel.f95
@@ -4,7 +4,7 @@
 ! { dg-additional-options "-O2" }
 ! { dg-additional-options "-fopt-info-optimized-omp" }
 ! { dg-additional-options "-fdump-tree-ompexp" }
-! { dg-additional-options "-fdump-tree-oaccdevlow" }
+! { dg-additional-options "-fdump-tree-oaccloops" }
 
 program main
   implicit none
@@ -26,6 +26,6 @@ end program main
 
 ! Check the offloaded function's classification and compute dimensions (will
 ! always be 1 x 1 x 1 for non-offloading compilation).
-! { dg-final { scan-tree-dump-times "(?n)Function is OpenACC parallel offload" 1 "oaccdevlow" } }
-! { dg-final { scan-tree-dump-times "(?n)Compute dimensions \\\[1, 1, 1\\\]" 1 "oaccdevlow" } }
-! { dg-final { scan-tree-dump-times "(?n)__attribute__\\(\\(oacc function \\(1, 1, 1\\), omp target entrypoint\\)\\)" 1 "oaccdevlow" } }
+! { dg-final { scan-tree-dump-times "(?n)Function is OpenACC parallel offload" 1 "oaccloops" } }
+! { dg-final { scan-tree-dump-times "(?n)Compute dimensions \\\[1, 1, 1\\\]" 1 "oaccloops" } }
+! { dg-final { scan-tree-dump-times "(?n)__attribute__\\(\\(oacc function \\(1, 1, 1\\), omp target entrypoint\\)\\)" 1 "oaccloops" } }
diff --git a/gcc/testsuite/gfortran.dg/goacc/classify-routine.f95 b/gcc/testsuite/gfortran.dg/goacc/classify-routine.f95
index 401d5270391..ed24cee10d8 100644
--- a/gcc/testsuite/gfortran.dg/goacc/classify-routine.f95
+++ b/gcc/testsuite/gfortran.dg/goacc/classify-routine.f95
@@ -4,7 +4,7 @@
 ! { dg-additional-options "-O2" }
 ! { dg-additional-options "-fopt-info-optimized-omp" }
 ! { dg-additional-options "-fdump-tree-ompexp" }
-! { dg-additional-options "-fdump-tree-oaccdevlow" }
+! { dg-additional-options "-fdump-tree-oaccloops" }
 
 subroutine ROUTINE
   !$acc routine worker
@@ -25,6 +25,6 @@ end subroutine ROUTINE
 
 ! Check the offloaded function's classification and compute dimensions (will
 ! always be 1 x 1 x 1 for non-offloading compilation).
-! { dg-final { scan-tree-dump-times "(?n)Function is OpenACC routine level 1" 1 "oaccdevlow" } }
-! { dg-final { scan-tree-dump-times "(?n)Compute dimensions \\\[1, 1, 1\\\]" 1 "oaccdevlow" } }
-! { dg-final { scan-tree-dump-times "(?n)__attribute__\\(\\(oacc function \\(0 1, 1 1, 1 1\\), omp declare target \\(worker\\)\\)\\)" 1 "oaccdevlow" } }
+! { dg-final { scan-tree-dump-times "(?n)Function is OpenACC routine level 1" 1 "oaccloops" } }
+! { dg-final { scan-tree-dump-times "(?n)Compute dimensions \\\[1, 1, 1\\\]" 1 "oaccloops" } }
+! { dg-final { scan-tree-dump-times "(?n)__attribute__\\(\\(oacc function \\(0 1, 1 1, 1 1\\), omp declare target \\(worker\\)\\)\\)" 1 "oaccloops" } }
-- 
2.23.0

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 09/13] AMD GCN libgomp plugin queue-full condition locking fix
  2019-11-15 21:41 [PATCH 00/13] AMD GCN worker partitioning support Julian Brown
                   ` (8 preceding siblings ...)
  2019-11-15 21:45 ` [PATCH 10/13] Race conditions in OpenACC async tests Julian Brown
@ 2019-11-15 21:45 ` Julian Brown
  2019-11-18 11:10   ` Andrew Stubbs
  2019-11-15 21:45 ` [PATCH 11/13] AMD GCN symbol output with null cfun Julian Brown
                   ` (2 subsequent siblings)
  12 siblings, 1 reply; 19+ messages in thread
From: Julian Brown @ 2019-11-15 21:45 UTC (permalink / raw)
  To: gcc-patches; +Cc: thomas_schwinge, jakub, Catherine_Moore, Andrew Stubbs

This patch corrects a possible race condition in locking for the
asynchronous queue-full condition check in the AMD GCN libgomp plugin.

OK?

Julian

ChangeLog

	libgomp/
	* plugin/plugin-gcn.c (wait_for_queue_nonfull): Don't lock/unlock
	aq->mutex here.
	(queue_push_launch): Lock aq->mutex before calling
	wait_for_queue_nonfull.
	(queue_push_callback): Likewise.
	(queue_push_asyncwait): Likewise.
	(queue_push_placeholder): Likewise.
---
 libgomp/plugin/plugin-gcn.c | 20 ++++++++------------
 1 file changed, 8 insertions(+), 12 deletions(-)

diff --git a/libgomp/plugin/plugin-gcn.c b/libgomp/plugin/plugin-gcn.c
index eb016e3fcd6..c4347dfa45d 100644
--- a/libgomp/plugin/plugin-gcn.c
+++ b/libgomp/plugin/plugin-gcn.c
@@ -2732,13 +2732,9 @@ wait_for_queue_nonfull (struct goacc_asyncqueue *aq)
 {
   if (aq->queue_n == ASYNC_QUEUE_SIZE)
     {
-      pthread_mutex_lock (&aq->mutex);
-
       /* Queue is full.  Wait for it to not be full.  */
       while (aq->queue_n == ASYNC_QUEUE_SIZE)
 	pthread_cond_wait (&aq->queue_cond_out, &aq->mutex);
-
-      pthread_mutex_unlock (&aq->mutex);
     }
 }
 
@@ -2752,10 +2748,10 @@ queue_push_launch (struct goacc_asyncqueue *aq, struct kernel_info *kernel,
 {
   assert (aq->agent == kernel->agent);
 
-  wait_for_queue_nonfull (aq);
-
   pthread_mutex_lock (&aq->mutex);
 
+  wait_for_queue_nonfull (aq);
+
   int queue_last = ((aq->queue_first + aq->queue_n)
 		    % ASYNC_QUEUE_SIZE);
   if (DEBUG_QUEUES)
@@ -2785,10 +2781,10 @@ static void
 queue_push_callback (struct goacc_asyncqueue *aq, void (*fn)(void *),
 		     void *data)
 {
-  wait_for_queue_nonfull (aq);
-
   pthread_mutex_lock (&aq->mutex);
 
+  wait_for_queue_nonfull (aq);
+
   int queue_last = ((aq->queue_first + aq->queue_n)
 		    % ASYNC_QUEUE_SIZE);
   if (DEBUG_QUEUES)
@@ -2818,10 +2814,10 @@ static void
 queue_push_asyncwait (struct goacc_asyncqueue *aq,
 		      struct placeholder *placeholderp)
 {
-  wait_for_queue_nonfull (aq);
-
   pthread_mutex_lock (&aq->mutex);
 
+  wait_for_queue_nonfull (aq);
+
   int queue_last = ((aq->queue_first + aq->queue_n) % ASYNC_QUEUE_SIZE);
   if (DEBUG_QUEUES)
     GCN_DEBUG ("queue_push_asyncwait %d:%d: at %i\n", aq->agent->device_id,
@@ -2849,10 +2845,10 @@ queue_push_placeholder (struct goacc_asyncqueue *aq)
 {
   struct placeholder *placeholderp;
 
-  wait_for_queue_nonfull (aq);
-
   pthread_mutex_lock (&aq->mutex);
 
+  wait_for_queue_nonfull (aq);
+
   int queue_last = ((aq->queue_first + aq->queue_n) % ASYNC_QUEUE_SIZE);
   if (DEBUG_QUEUES)
     GCN_DEBUG ("queue_push_placeholder %d:%d: at %i\n", aq->agent->device_id,
-- 
2.23.0

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 10/13] Race conditions in OpenACC async tests
  2019-11-15 21:41 [PATCH 00/13] AMD GCN worker partitioning support Julian Brown
                   ` (7 preceding siblings ...)
  2019-11-15 21:44 ` [PATCH 07/13] Fix OpenACC "ephemeral" asynchronous host-to-device copies Julian Brown
@ 2019-11-15 21:45 ` Julian Brown
  2019-11-15 21:45 ` [PATCH 09/13] AMD GCN libgomp plugin queue-full condition locking fix Julian Brown
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 19+ messages in thread
From: Julian Brown @ 2019-11-15 21:45 UTC (permalink / raw)
  To: gcc-patches; +Cc: thomas_schwinge, jakub, Catherine_Moore, Andrew Stubbs

This patch provides some race-condition fixes for tests that broke
for AMD GCN.

OK?

Thanks,

Julian

ChangeLog

	libgomp/
	* testsuite/libgomp.oacc-c-c++-common/lib-94.c: Fix race condition.
	* testsuite/libgomp.oacc-fortran/lib-16-2.f90: Likewise.
---
 libgomp/testsuite/libgomp.oacc-c-c++-common/lib-94.c | 4 ++--
 libgomp/testsuite/libgomp.oacc-fortran/lib-16-2.f90  | 5 +++++
 2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/lib-94.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/lib-94.c
index 54497237b0c..baa3ac83f04 100644
--- a/libgomp/testsuite/libgomp.oacc-c-c++-common/lib-94.c
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/lib-94.c
@@ -22,10 +22,10 @@ main (int argc, char **argv)
 
   acc_copyin_async (h, N, async);
 
-  memset (h, 0, N);
-
   acc_wait (async);
 
+  memset (h, 0, N);
+
   acc_copyout_async (h, N, async + 1);
 
   acc_wait (async + 1);
diff --git a/libgomp/testsuite/libgomp.oacc-fortran/lib-16-2.f90 b/libgomp/testsuite/libgomp.oacc-fortran/lib-16-2.f90
index ddd557d3be0..e2e47c967fa 100644
--- a/libgomp/testsuite/libgomp.oacc-fortran/lib-16-2.f90
+++ b/libgomp/testsuite/libgomp.oacc-fortran/lib-16-2.f90
@@ -27,6 +27,9 @@ program main
 
   if (acc_is_present (h) .neqv. .TRUE.) stop 1
 
+  ! We must wait for the update to be done.
+  call acc_wait (async)
+
   h(:) = 0
 
   call acc_copyout_async (h, sizeof (h), async)
@@ -45,6 +48,8 @@ program main
   
   if (acc_is_present (h) .neqv. .TRUE.) stop 3
 
+  call acc_wait (async)
+
   do i = 1, N
     if (h(i) /= i + i) stop 4
   end do 
-- 
2.23.0

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 11/13] AMD GCN symbol output with null cfun
  2019-11-15 21:41 [PATCH 00/13] AMD GCN worker partitioning support Julian Brown
                   ` (9 preceding siblings ...)
  2019-11-15 21:45 ` [PATCH 09/13] AMD GCN libgomp plugin queue-full condition locking fix Julian Brown
@ 2019-11-15 21:45 ` Julian Brown
  2019-11-18 11:12   ` Andrew Stubbs
  2019-11-15 21:46 ` [PATCH 12/13] Fix parallel-dims.f90 for AMD GCN Julian Brown
  2019-11-15 21:58 ` [PATCH 13/13] Enable worker partitioning " Julian Brown
  12 siblings, 1 reply; 19+ messages in thread
From: Julian Brown @ 2019-11-15 21:45 UTC (permalink / raw)
  To: gcc-patches; +Cc: thomas_schwinge, jakub, Catherine_Moore, Andrew Stubbs

This patch checks that cfun is valid in the gcn_asm_output_symbol_ref
function. This prevents a crash when that function is called with NULL
cfun, i.e. when outputting debug symbols.

OK?

Thanks,

Julian

ChangeLog

	gcc/
	* config/gcn/gcn.c (gcn_asm_output_symbol_ref): Handle null cfun.
---
 gcc/config/gcn/gcn.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/gcc/config/gcn/gcn.c b/gcc/config/gcn/gcn.c
index 2f758ef3ddc..3584ac85021 100644
--- a/gcc/config/gcn/gcn.c
+++ b/gcc/config/gcn/gcn.c
@@ -5199,7 +5199,8 @@ void
 gcn_asm_output_symbol_ref (FILE *file, rtx x)
 {
   tree decl;
-  if ((decl = SYMBOL_REF_DECL (x)) != 0
+  if (cfun
+      && (decl = SYMBOL_REF_DECL (x)) != 0
       && TREE_CODE (decl) == VAR_DECL
       && AS_LDS_P (TYPE_ADDR_SPACE (TREE_TYPE (decl))))
     {
@@ -5214,7 +5215,8 @@ gcn_asm_output_symbol_ref (FILE *file, rtx x)
     {
       assemble_name (file, XSTR (x, 0));
       /* FIXME: See above -- this condition is unreachable.  */
-      if ((decl = SYMBOL_REF_DECL (x)) != 0
+      if (cfun
+	  && (decl = SYMBOL_REF_DECL (x)) != 0
 	  && TREE_CODE (decl) == VAR_DECL
 	  && AS_LDS_P (TYPE_ADDR_SPACE (TREE_TYPE (decl))))
 	fputs ("@abs32", file);
-- 
2.23.0

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 12/13] Fix parallel-dims.f90 for AMD GCN
  2019-11-15 21:41 [PATCH 00/13] AMD GCN worker partitioning support Julian Brown
                   ` (10 preceding siblings ...)
  2019-11-15 21:45 ` [PATCH 11/13] AMD GCN symbol output with null cfun Julian Brown
@ 2019-11-15 21:46 ` Julian Brown
  2019-11-15 21:58 ` [PATCH 13/13] Enable worker partitioning " Julian Brown
  12 siblings, 0 replies; 19+ messages in thread
From: Julian Brown @ 2019-11-15 21:46 UTC (permalink / raw)
  To: gcc-patches; +Cc: thomas_schwinge, jakub, Catherine_Moore, Andrew Stubbs

This patch provides AMD GCN support for the parallel-dims.f90 test's
parallel-dims-aux.c helper.

OK?

Thanks,

Julian

ChangeLog

	libgomp/
	* testsuite/libgomp.oacc-fortran/parallel-dims-aux.c: Support AMD GCN.
---
 .../testsuite/libgomp.oacc-fortran/parallel-dims-aux.c   | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/libgomp/testsuite/libgomp.oacc-fortran/parallel-dims-aux.c b/libgomp/testsuite/libgomp.oacc-fortran/parallel-dims-aux.c
index b5986f4afef..0778081860f 100644
--- a/libgomp/testsuite/libgomp.oacc-fortran/parallel-dims-aux.c
+++ b/libgomp/testsuite/libgomp.oacc-fortran/parallel-dims-aux.c
@@ -16,7 +16,8 @@
 {
   if (acc_on_device ((int) acc_device_host))
     return 0;
-  else if (acc_on_device ((int) acc_device_nvidia))
+  else if (acc_on_device ((int) acc_device_nvidia)
+	   || acc_on_device ((int) acc_device_gcn))
     return __builtin_goacc_parlevel_id (GOMP_DIM_GANG);
   else
     __builtin_abort ();
@@ -27,7 +28,8 @@
 {
   if (acc_on_device ((int) acc_device_host))
     return 0;
-  else if (acc_on_device ((int) acc_device_nvidia))
+  else if (acc_on_device ((int) acc_device_nvidia)
+	   || acc_on_device ((int) acc_device_gcn))
     return __builtin_goacc_parlevel_id (GOMP_DIM_WORKER);
   else
     __builtin_abort ();
@@ -38,7 +40,8 @@
 {
   if (acc_on_device ((int) acc_device_host))
     return 0;
-  else if (acc_on_device ((int) acc_device_nvidia))
+  else if (acc_on_device ((int) acc_device_nvidia)
+	   || acc_on_device ((int) acc_device_gcn))
     return __builtin_goacc_parlevel_id (GOMP_DIM_VECTOR);
   else
     __builtin_abort ();
-- 
2.23.0

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 13/13] Enable worker partitioning for AMD GCN
  2019-11-15 21:41 [PATCH 00/13] AMD GCN worker partitioning support Julian Brown
                   ` (11 preceding siblings ...)
  2019-11-15 21:46 ` [PATCH 12/13] Fix parallel-dims.f90 for AMD GCN Julian Brown
@ 2019-11-15 21:58 ` Julian Brown
  2019-11-18 11:24   ` Andrew Stubbs
  12 siblings, 1 reply; 19+ messages in thread
From: Julian Brown @ 2019-11-15 21:58 UTC (permalink / raw)
  To: gcc-patches; +Cc: thomas_schwinge, jakub, Catherine_Moore, Andrew Stubbs

This patch flips the switch to enable worker partitioning on AMD GCN.

OK?

Thanks,

Julian

ChangeLog

	gcc/
	* config/gcn/gcn.c (gcn_goacc_validate_dims): Remove
	no-flag_worker-partitioning assertion.
	(TARGET_GOACC_WORKER_PARTITIONING): Define target hook to true.
	* config/gcn/gcn.opt (flag_worker_partitioning): Change default to 1.

	libgomp/
	* plugin/plugin-gcn.c (gcn_exec): Change default number of workers to
	16.
---
 gcc/config/gcn/gcn.c        | 4 ++--
 gcc/config/gcn/gcn.opt      | 2 +-
 libgomp/plugin/plugin-gcn.c | 4 +---
 3 files changed, 4 insertions(+), 6 deletions(-)

diff --git a/gcc/config/gcn/gcn.c b/gcc/config/gcn/gcn.c
index 3584ac85021..6e45032ea3a 100644
--- a/gcc/config/gcn/gcn.c
+++ b/gcc/config/gcn/gcn.c
@@ -4692,8 +4692,6 @@ gcn_goacc_validate_dims (tree decl, int dims[], int fn_level,
   /* FIXME: remove -facc-experimental-workers when they're ready.  */
   int max_workers = flag_worker_partitioning ? 16 : 1;
 
-  gcc_assert (!flag_worker_partitioning);
-
   /* The vector size must appear to be 64, to the user, unless this is a
      SEQ routine.  The real, internal value is always 1, which means use
      autovectorization, but the user should not see that.  */
@@ -6072,6 +6070,8 @@ print_operand (FILE *file, rtx x, int code)
 #define TARGET_GOACC_REDUCTION gcn_goacc_reduction
 #undef  TARGET_GOACC_VALIDATE_DIMS
 #define TARGET_GOACC_VALIDATE_DIMS gcn_goacc_validate_dims
+#undef  TARGET_GOACC_WORKER_PARTITIONING
+#define TARGET_GOACC_WORKER_PARTITIONING true
 #undef  TARGET_HARD_REGNO_MODE_OK
 #define TARGET_HARD_REGNO_MODE_OK gcn_hard_regno_mode_ok
 #undef  TARGET_HARD_REGNO_NREGS
diff --git a/gcc/config/gcn/gcn.opt b/gcc/config/gcn/gcn.opt
index 402deb625bd..bdc878f35ad 100644
--- a/gcc/config/gcn/gcn.opt
+++ b/gcc/config/gcn/gcn.opt
@@ -65,7 +65,7 @@ Target Report RejectNegative Var(flag_bypass_init_error)
 bool flag_worker_partitioning = false
 
 macc-experimental-workers
-Target Report Var(flag_worker_partitioning) Init(0)
+Target Report Var(flag_worker_partitioning) Init(1)
 
 int stack_size_opt = -1
 
diff --git a/libgomp/plugin/plugin-gcn.c b/libgomp/plugin/plugin-gcn.c
index c4347dfa45d..3368d7e261a 100644
--- a/libgomp/plugin/plugin-gcn.c
+++ b/libgomp/plugin/plugin-gcn.c
@@ -3097,10 +3097,8 @@ gcn_exec (struct kernel_info *kernel, size_t mapnum, void **hostaddrs,
      problem size, so let's do a reasonable number of single-worker gangs.
      64 gangs matches a typical Fiji device.  */
 
-  /* NOTE: Until support for middle-end worker partitioning is merged, use 1
-     for the default number of workers.  */
   if (dims[0] == 0) dims[0] = get_cu_count (kernel->agent); /* Gangs.  */
-  if (dims[1] == 0) dims[1] = 1;  /* Workers.  */
+  if (dims[1] == 0) dims[1] = 16;  /* Workers.  */
 
   /* The incoming dimensions are expressed in terms of gangs, workers, and
      vectors.  The HSA dimensions are expressed in terms of "work-items",
-- 
2.23.0

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 08/13] Fix host-to-device copies from rodata for AMD GCN
  2019-11-15 21:44 ` [PATCH 08/13] Fix host-to-device copies from rodata for AMD GCN Julian Brown
@ 2019-11-18 11:00   ` Andrew Stubbs
  0 siblings, 0 replies; 19+ messages in thread
From: Andrew Stubbs @ 2019-11-18 11:00 UTC (permalink / raw)
  To: Julian Brown, gcc-patches; +Cc: thomas_schwinge, jakub, Catherine_Moore

On 15/11/2019 21:44, Julian Brown wrote:
> +static void
> +hsa_memory_copy_wrapper (void *dst, const void *src, size_t len)
> +{
> +  hsa_status_t status = hsa_fns.hsa_memory_copy_fn (dst, src, len);
> +
> +  if (status == HSA_STATUS_SUCCESS)
> +    return;
> +
> +  /* It appears that the copy fails if the source data is in a read-only page.
> +     We can't detect that easily, so try copying the data to a temporary buffer
> +     and doing the copy again if we got an error above.  */
> +
> +  void *src_copy = malloc (len);
> +  memcpy (src_copy, src, len);
> +  status = hsa_fns.hsa_memory_copy_fn (dst, (const void *) src_copy, len);
> +  free (src_copy);
> +  if (status != HSA_STATUS_SUCCESS)
> +    GOMP_PLUGIN_error ("memory copy failed");
> +}

I'd like a GCN_DEBUG (or GCN_WARNING?) in the fallback case, so that we 
can see if it does that very often. If so I'd think we should report it 
as a bug to ROCm.

Otherwise the patch is fine by me.

Andrew

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 09/13] AMD GCN libgomp plugin queue-full condition locking fix
  2019-11-15 21:45 ` [PATCH 09/13] AMD GCN libgomp plugin queue-full condition locking fix Julian Brown
@ 2019-11-18 11:10   ` Andrew Stubbs
  0 siblings, 0 replies; 19+ messages in thread
From: Andrew Stubbs @ 2019-11-18 11:10 UTC (permalink / raw)
  To: Julian Brown, gcc-patches; +Cc: thomas_schwinge, jakub, Catherine_Moore

On 15/11/2019 21:44, Julian Brown wrote:
> @@ -2732,13 +2732,9 @@ wait_for_queue_nonfull (struct goacc_asyncqueue *aq)
>   {
>     if (aq->queue_n == ASYNC_QUEUE_SIZE)
>       {
> -      pthread_mutex_lock (&aq->mutex);
> -
>         /* Queue is full.  Wait for it to not be full.  */
>         while (aq->queue_n == ASYNC_QUEUE_SIZE)
>   	pthread_cond_wait (&aq->queue_cond_out, &aq->mutex);
> -
> -      pthread_mutex_unlock (&aq->mutex);
>       }
>   }

If wait_for_queue_nonfull requires the mutex locked on entry then the 
comment above the function should say so.

Otherwise this looks fine.

Andrew

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 11/13] AMD GCN symbol output with null cfun
  2019-11-15 21:45 ` [PATCH 11/13] AMD GCN symbol output with null cfun Julian Brown
@ 2019-11-18 11:12   ` Andrew Stubbs
  0 siblings, 0 replies; 19+ messages in thread
From: Andrew Stubbs @ 2019-11-18 11:12 UTC (permalink / raw)
  To: Julian Brown, gcc-patches; +Cc: thomas_schwinge, jakub, Catherine_Moore

On 15/11/2019 21:44, Julian Brown wrote:
> This patch checks that cfun is valid in the gcn_asm_output_symbol_ref
> function. This prevents a crash when that function is called with NULL
> cfun, i.e. when outputting debug symbols.
> 
> OK?

OK, although that FIXME still baffles me.

Andrew

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 13/13] Enable worker partitioning for AMD GCN
  2019-11-15 21:58 ` [PATCH 13/13] Enable worker partitioning " Julian Brown
@ 2019-11-18 11:24   ` Andrew Stubbs
  0 siblings, 0 replies; 19+ messages in thread
From: Andrew Stubbs @ 2019-11-18 11:24 UTC (permalink / raw)
  To: Julian Brown, gcc-patches; +Cc: thomas_schwinge, jakub, Catherine_Moore

On 15/11/2019 21:44, Julian Brown wrote:
> This patch flips the switch to enable worker partitioning on AMD GCN.
> 
> OK?

This is OK, although I think we could just remove that flag now.

Andrew

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 01/13] Add support for gang local storage allocation in shared memory
  2019-11-15 21:42 ` [PATCH 01/13] Add support for gang local storage allocation in shared memory Julian Brown
@ 2019-12-05  2:40   ` Julian Brown
  0 siblings, 0 replies; 19+ messages in thread
From: Julian Brown @ 2019-12-05  2:40 UTC (permalink / raw)
  To: gcc-patches; +Cc: thomas_schwinge, jakub, Catherine_Moore, Andrew Stubbs

[-- Attachment #1: Type: text/plain, Size: 4291 bytes --]

Hi!

On Fri, 15 Nov 2019 13:41:11 -0800
Julian Brown <julian@codesourcery.com> wrote:

> This patch provides support for gang local storage allocation in
> shared memory. It is mostly identical to the version posted
> previously, with one cosmetic fix (a duplicated identical condition):
> 
> https://gcc.gnu.org/ml/gcc-patches/2019-11/msg00448.html
> 
> Tested alongside other patches in this series with offloading to AMD
> GCN.

This is a new version of the patch, with some previously-missed review
comments hopefully addressed.

First, Jakub's suggestion from June about the oacc_addressable_var_decls
field not needing to be a pointer has been applied:

https://gcc.gnu.org/ml/gcc-patches/2019-06/msg00097.html

Secondly, Thomas's remarks about the naming of the 'expand_accel_var'
target hook and placement of DejaGNU scanning patterns in new tests from
the following message have been fixed:

https://gcc.gnu.org/ml/gcc-patches/2019-06/msg00700.html

The above-mentioned hook is now called TARGET_GOACC_EXPAND_VAR_DECL,
which is arguably more correct than the previous name since there isn't
really any such thing as an "accel var".

Lightly re-tested, and full test reruns (with other patches in the
worker-partitioning support series) with offloading for both AMD GCN
and NVPTX are in progress.

OK?

Thanks,

Julian

ChangeLog

2019-12-05  Julian Brown  <julian@codesourcery.com>
	    Chung-Lin Tang  <cltang@codesourcery.com>

	gcc/
	* config/gcn/gcn-protos.h (gcn_goacc_adjust_gangprivate_decl): Rename
	to...
	(gcn_goacc_adjust_private_decl): ...this.  Add LEVEL parameter.
	* config/gcn/gcn-tree.c (gcn_goacc_adjust_gangprivate_decl): Rename
	to...
	(gcn_goacc_adjust_private_decl): ...this. Add and use LEVEL parameter.
	* config/gcn/gcn.c (TARGET_GOACC_ADJUST_GANGPRIVATE_DECL): Delete.
	(TARGET_GOACC_ADJUST_PRIVATE_DECL): Define using renamed
	gcn_goacc_adjust_private_decl.
	* config/nvptx/nvptx.c (tree-hash-traits.h, tree-pretty-print.h):
	Include.
	(gangprivate_shared_size): New global variable.
	(gangprivate_shared_align): Likewise.
	(gangprivate_shared_sym): Likewise.
	(gangprivate_shared_hmap): Likewise.
	(nvptx_option_override): Initialize gangprivate_shared_sym,
	gangprivate_shared_align.
	(nvptx_file_end): Output gangprivate_shared_sym.
	(nvptx_goacc_adjust_private_decl): New function.
	(nvptx_goacc_expand_var_decl): New function.
	(nvptx_set_current_function): New function.
	(TARGET_GOACC_ADJUST_PRIVATE_DECL, TARGET_GOACC_EXPAND_VAR_DECL):
	Define hooks.
	* doc/tm.texi.in (TARGET_GOACC_EXPAND_ACCEL_VAR,
	TARGET_GOACC_ADJUST_PRIVATE_DECL): Place new documentation hooks.
	* doc/tm.texi: Regenerate.
	* expr.c (expand_expr_real_1): Expand decls using the expand_var_decl
	OpenACC hook if defined.
	* internal-fn.c (expand_UNIQUE): Handle IFN_UNIQUE_OACC_PRIVATE.
	* internal-fn.h (IFN_UNIQUE_CODES): Add OACC_PRIVATE.
	* omp-low.c (omp_context): Add oacc_addressable_var_decls field.
	(lower_oacc_reductions): Add PRIVATE_MARKER parameter.  Insert private
	marker before fork.
	(lower_oacc_head_tail): Add PRIVATE_MARKER parameter. Modify private
	marker's gimple call arguments, and pass it to lower_oacc_reductions.
	(oacc_record_private_var_clauses, oacc_record_vars_in_bind,
	make_oacc_private_marker): New functions.
	(lower_omp_for): Call oacc_record_private_var_clauses with "for"
	clauses.  Call oacc_record_vars_in_bind for OpenACC contexts.  Create
	private marker and pass to lower_oacc_head_tail.
	(lower_omp_target): Create private marker and pass to
	lower_oacc_reductions.
	(lower_omp_1): Call oacc_record_vars_in_bind for OpenACC bind contexts.
	* omp-offload.c (convert.h): Include.
	(oacc_loop_xform_head_tail): Treat private-variable markers like
	fork/join when transforming head/tail sequences.
	(execute_oacc_device_lower): Use IFN_UNIQUE_OACC_PRIVATE to determine
	partitioning level of private variables, and process any found via
	adjust_private_decl target hook.
	* target.def (expand_var_decl, adjust_private_decl): New OpenACC target
	hooks.

	libgomp/
	* testsuite/libgomp.oacc-c-c++-common/gang-private-1.c: New test.
	* testsuite/libgomp.oacc-c-c++-common/loop-gwv-2.c: New test.
	* testsuite/libgomp.oacc-fortran/gangprivate-attrib-1.f90: New test.
	* testsuite/libgomp.oacc-fortran/gangprivate-attrib-2.f90: New test.

[-- Attachment #2: gang-local-storage-in-shm-11.diff --]
[-- Type: text/x-patch, Size: 29661 bytes --]

commit 24dedf12d09020bf2177074b2456655b89dc625f
Author: Julian Brown <julian@codesourcery.com>
Date:   Thu Mar 21 15:09:24 2019 -0700

    Add support for gang local storage allocation in shared memory
    
            gcc/
            * config/gcn/gcn-protos.h (gcn_goacc_adjust_gangprivate_decl): Rename
            to...
            (gcn_goacc_adjust_private_decl): ...this.  Add LEVEL parameter.
            * config/gcn/gcn-tree.c (gcn_goacc_adjust_gangprivate_decl): Rename
            to...
            (gcn_goacc_adjust_private_decl): ...this. Add and use LEVEL parameter.
            * config/gcn/gcn.c (TARGET_GOACC_ADJUST_GANGPRIVATE_DECL): Delete.
            (TARGET_GOACC_ADJUST_PRIVATE_DECL): Define using renamed
            gcn_goacc_adjust_private_decl.
            * config/nvptx/nvptx.c (tree-hash-traits.h, tree-pretty-print.h):
            Include.
            (gangprivate_shared_size): New global variable.
            (gangprivate_shared_align): Likewise.
            (gangprivate_shared_sym): Likewise.
            (gangprivate_shared_hmap): Likewise.
            (nvptx_option_override): Initialize gangprivate_shared_sym,
            gangprivate_shared_align.
            (nvptx_file_end): Output gangprivate_shared_sym.
            (nvptx_goacc_adjust_private_decl): New function.
            (nvptx_goacc_expand_var_decl): New function.
            (nvptx_set_current_function): New function.
            (TARGET_GOACC_ADJUST_PRIVATE_DECL, TARGET_GOACC_EXPAND_VAR_DECL):
            Define hooks.
            * doc/tm.texi.in (TARGET_GOACC_EXPAND_ACCEL_VAR,
            TARGET_GOACC_ADJUST_PRIVATE_DECL): Place new documentation hooks.
            * doc/tm.texi: Regenerate.
            * expr.c (expand_expr_real_1): Expand decls using the expand_var_decl
            OpenACC hook if defined.
            * internal-fn.c (expand_UNIQUE): Handle IFN_UNIQUE_OACC_PRIVATE.
            * internal-fn.h (IFN_UNIQUE_CODES): Add OACC_PRIVATE.
            * omp-low.c (omp_context): Add oacc_addressable_var_decls field.
            (lower_oacc_reductions): Add PRIVATE_MARKER parameter.  Insert private
            marker before fork.
            (lower_oacc_head_tail): Add PRIVATE_MARKER parameter. Modify private
            marker's gimple call arguments, and pass it to lower_oacc_reductions.
            (oacc_record_private_var_clauses, oacc_record_vars_in_bind,
            make_oacc_private_marker): New functions.
            (lower_omp_for): Call oacc_record_private_var_clauses with "for"
            clauses.  Call oacc_record_vars_in_bind for OpenACC contexts.  Create
            private marker and pass to lower_oacc_head_tail.
            (lower_omp_target): Create private marker and pass to
            lower_oacc_reductions.
            (lower_omp_1): Call oacc_record_vars_in_bind for OpenACC bind contexts.
            * omp-offload.c (convert.h): Include.
            (oacc_loop_xform_head_tail): Treat private-variable markers like
            fork/join when transforming head/tail sequences.
            (execute_oacc_device_lower): Use IFN_UNIQUE_OACC_PRIVATE to determine
            partitioning level of private variables, and process any found via
            adjust_private_decl target hook.
            * target.def (expand_var_decl, adjust_private_decl): New OpenACC target
            hooks.
    
            libgomp/
            * testsuite/libgomp.oacc-c-c++-common/gang-private-1.c: New test.
            * testsuite/libgomp.oacc-c-c++-common/loop-gwv-2.c: New test.
            * testsuite/libgomp.oacc-fortran/gangprivate-attrib-1.f90: New test.
            * testsuite/libgomp.oacc-fortran/gangprivate-attrib-2.f90: New test.

diff --git a/gcc/config/gcn/gcn-protos.h b/gcc/config/gcn/gcn-protos.h
index da7faf29c70..714d51189d9 100644
--- a/gcc/config/gcn/gcn-protos.h
+++ b/gcc/config/gcn/gcn-protos.h
@@ -39,7 +39,7 @@ extern rtx gcn_gen_undef (machine_mode);
 extern bool gcn_global_address_p (rtx);
 extern tree gcn_goacc_adjust_propagation_record (tree record_type, bool sender,
 						 const char *name);
-extern void gcn_goacc_adjust_gangprivate_decl (tree var);
+extern void gcn_goacc_adjust_private_decl (tree var, int level);
 extern void gcn_goacc_reduction (gcall *call);
 extern bool gcn_hard_regno_rename_ok (unsigned int from_reg,
 				      unsigned int to_reg);
diff --git a/gcc/config/gcn/gcn-tree.c b/gcc/config/gcn/gcn-tree.c
index c6b6302e9ed..aa56e236134 100644
--- a/gcc/config/gcn/gcn-tree.c
+++ b/gcc/config/gcn/gcn-tree.c
@@ -697,8 +697,11 @@ gcn_goacc_adjust_propagation_record (tree record_type, bool sender,
 }
 
 void
-gcn_goacc_adjust_gangprivate_decl (tree var)
+gcn_goacc_adjust_private_decl (tree var, int level)
 {
+  if (level != GOMP_DIM_GANG)
+    return;
+
   tree type = TREE_TYPE (var);
   tree lds_type = build_qualified_type (type,
 		    TYPE_QUALS_NO_ADDR_SPACE (type)
diff --git a/gcc/config/gcn/gcn.c b/gcc/config/gcn/gcn.c
index d2a35c95d7a..f21a83853ea 100644
--- a/gcc/config/gcn/gcn.c
+++ b/gcc/config/gcn/gcn.c
@@ -6096,8 +6096,8 @@ print_operand (FILE *file, rtx x, int code)
 #undef  TARGET_GOACC_ADJUST_PROPAGATION_RECORD
 #define TARGET_GOACC_ADJUST_PROPAGATION_RECORD \
   gcn_goacc_adjust_propagation_record
-#undef  TARGET_GOACC_ADJUST_GANGPRIVATE_DECL
-#define TARGET_GOACC_ADJUST_GANGPRIVATE_DECL gcn_goacc_adjust_gangprivate_decl
+#undef  TARGET_GOACC_ADJUST_PRIVATE_DECL
+#define TARGET_GOACC_ADJUST_PRIVATE_DECL gcn_goacc_adjust_private_decl
 #undef  TARGET_GOACC_FORK_JOIN
 #define TARGET_GOACC_FORK_JOIN gcn_fork_join
 #undef  TARGET_GOACC_REDUCTION
diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index 0d6e8840852..88a04e319db 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -74,6 +74,8 @@
 #include "cfgloop.h"
 #include "fold-const.h"
 #include "intl.h"
+#include "tree-hash-traits.h"
+#include "tree-pretty-print.h"
 
 /* This file should be included last.  */
 #include "target-def.h"
@@ -166,6 +168,12 @@ static unsigned vector_red_align;
 static unsigned vector_red_partition;
 static GTY(()) rtx vector_red_sym;
 
+/* Shared memory block for gang-private variables.  */
+static unsigned gangprivate_shared_size;
+static unsigned gangprivate_shared_align;
+static GTY(()) rtx gangprivate_shared_sym;
+static hash_map<tree_decl_hash, unsigned int> gangprivate_shared_hmap;
+
 /* Global lock variable, needed for 128bit worker & gang reductions.  */
 static GTY(()) tree global_lock_var;
 
@@ -247,6 +255,10 @@ nvptx_option_override (void)
   vector_red_align = GET_MODE_ALIGNMENT (SImode) / BITS_PER_UNIT;
   vector_red_partition = 0;
 
+  gangprivate_shared_sym = gen_rtx_SYMBOL_REF (Pmode, "__gangprivate_shared");
+  SET_SYMBOL_DATA_AREA (gangprivate_shared_sym, DATA_AREA_SHARED);
+  gangprivate_shared_align = GET_MODE_ALIGNMENT (SImode) / BITS_PER_UNIT;
+
   diagnose_openacc_conflict (TARGET_GOMP, "-mgomp");
   diagnose_openacc_conflict (TARGET_SOFT_STACK, "-msoft-stack");
   diagnose_openacc_conflict (TARGET_UNIFORM_SIMT, "-muniform-simt");
@@ -5231,6 +5243,10 @@ nvptx_file_end (void)
     write_shared_buffer (asm_out_file, vector_red_sym,
 			 vector_red_align, vector_red_size);
 
+  if (gangprivate_shared_size)
+    write_shared_buffer (asm_out_file, gangprivate_shared_sym,
+			 gangprivate_shared_align, gangprivate_shared_size);
+
   if (need_softstack_decl)
     {
       write_var_marker (asm_out_file, false, true, "__nvptx_stacks");
@@ -6450,6 +6466,60 @@ nvptx_can_change_mode_class (machine_mode, machine_mode, reg_class_t)
   return false;
 }
 
+/* Implement TARGET_GOACC_ADJUST_PRIVATE_DECL.  Set "oacc gangprivate"
+   attribute for gang-private variable declarations.  */
+
+void
+nvptx_goacc_adjust_private_decl (tree decl, int level)
+{
+  if (level != GOMP_DIM_GANG)
+    return;
+
+  if (!lookup_attribute ("oacc gangprivate", DECL_ATTRIBUTES (decl)))
+    {
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	{
+	  fprintf (dump_file, "Setting 'oacc gangprivate' attribute for decl:");
+	  print_generic_decl (dump_file, decl, TDF_SLIM);
+	  fputc ('\n', dump_file);
+	}
+      tree id = get_identifier ("oacc gangprivate");
+      DECL_ATTRIBUTES (decl) = tree_cons (id, NULL, DECL_ATTRIBUTES (decl));
+    }
+}
+
+/* Implement TARGET_GOACC_EXPAND_VAR_DECL.  Place "oacc gangprivate"
+   variables in shared memory.  */
+
+static rtx
+nvptx_goacc_expand_var_decl (tree var)
+{
+  if (VAR_P (var)
+      && lookup_attribute ("oacc gangprivate", DECL_ATTRIBUTES (var)))
+    {
+      unsigned int offset, *poffset;
+      poffset = gangprivate_shared_hmap.get (var);
+      if (poffset)
+	offset = *poffset;
+      else
+	{
+	  unsigned HOST_WIDE_INT align = DECL_ALIGN (var);
+	  gangprivate_shared_size
+	    = (gangprivate_shared_size + align - 1) & ~(align - 1);
+	  if (gangprivate_shared_align < align)
+	    gangprivate_shared_align = align;
+
+	  offset = gangprivate_shared_size;
+	  bool existed = gangprivate_shared_hmap.put (var, offset);
+	  gcc_assert (!existed);
+	  gangprivate_shared_size += tree_to_uhwi (DECL_SIZE_UNIT (var));
+	}
+      rtx addr = plus_constant (Pmode, gangprivate_shared_sym, offset);
+      return gen_rtx_MEM (TYPE_MODE (TREE_TYPE (var)), addr);
+    }
+  return NULL_RTX;
+}
+
 static GTY(()) tree nvptx_previous_fndecl;
 
 static void
@@ -6458,6 +6528,7 @@ nvptx_set_current_function (tree fndecl)
   if (!fndecl || fndecl == nvptx_previous_fndecl)
     return;
 
+  gangprivate_shared_hmap.empty ();
   nvptx_previous_fndecl = fndecl;
   vector_red_partition = 0;
   oacc_bcast_partition = 0;
@@ -6602,6 +6673,12 @@ nvptx_set_current_function (tree fndecl)
 #undef TARGET_HAVE_SPECULATION_SAFE_VALUE
 #define TARGET_HAVE_SPECULATION_SAFE_VALUE speculation_safe_value_not_needed
 
+#undef TARGET_GOACC_ADJUST_PRIVATE_DECL
+#define TARGET_GOACC_ADJUST_PRIVATE_DECL nvptx_goacc_adjust_private_decl
+
+#undef TARGET_GOACC_EXPAND_VAR_DECL
+#define TARGET_GOACC_EXPAND_VAR_DECL nvptx_goacc_expand_var_decl
+
 #undef TARGET_SET_CURRENT_FUNCTION
 #define TARGET_SET_CURRENT_FUNCTION nvptx_set_current_function
 
diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index 5b8b68bd710..67bb31d1c76 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -6185,6 +6185,19 @@ like @code{cond_add@var{m}}.  The default implementation returns a zero
 constant of type @var{type}.
 @end deftypefn
 
+@deftypefn {Target Hook} rtx TARGET_GOACC_EXPAND_VAR_DECL (tree @var{var})
+This hook, if defined, is used by accelerator target back-ends to expand
+specially handled kinds of VAR_DECL expressions.  A particular use is to
+place variables with specific attributes inside special accelarator
+memories.  A return value of NULL indicates that the target does not
+handle this VAR_DECL, and normal RTL expanding is resumed.
+@end deftypefn
+
+@deftypefn {Target Hook} void TARGET_GOACC_ADJUST_PRIVATE_DECL (tree @var{var}, @var{int})
+Tweak variable declaration for a private variable at the specified
+parallelism level.
+@end deftypefn
+
 @node Anchored Addresses
 @section Anchored Addresses
 @cindex anchored addresses
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index 1b061d70127..0d023f21fae 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -4213,6 +4213,10 @@ address;  but often a machine-dependent strategy can generate better code.
 
 @hook TARGET_PREFERRED_ELSE_VALUE
 
+@hook TARGET_GOACC_EXPAND_VAR_DECL
+
+@hook TARGET_GOACC_ADJUST_PRIVATE_DECL
+
 @node Anchored Addresses
 @section Anchored Addresses
 @cindex anchored addresses
diff --git a/gcc/expr.c b/gcc/expr.c
index ed50586971f..e6e19e7e2bf 100644
--- a/gcc/expr.c
+++ b/gcc/expr.c
@@ -10069,8 +10069,19 @@ expand_expr_real_1 (tree exp, rtx target, machine_mode tmode,
       exp = SSA_NAME_VAR (ssa_name);
       goto expand_decl_rtl;
 
-    case PARM_DECL:
     case VAR_DECL:
+      /* Allow accel compiler to handle variables that require special
+	 treatment, e.g. if they have been modified in some way earlier in
+	 compilation by the adjust_private_decl OpenACC hook.  */
+      if (flag_openacc && targetm.goacc.expand_var_decl)
+	{
+	  temp = targetm.goacc.expand_var_decl (exp);
+	  if (temp)
+	    return temp;
+	}
+      /* ... fall through ...  */
+
+    case PARM_DECL:
       /* If a static var's type was incomplete when the decl was written,
 	 but the type is complete now, lay out the decl now.  */
       if (DECL_SIZE (exp) == 0
diff --git a/gcc/internal-fn.c b/gcc/internal-fn.c
index 6bbbc9e946e..8b483d1d343 100644
--- a/gcc/internal-fn.c
+++ b/gcc/internal-fn.c
@@ -2624,6 +2624,8 @@ expand_UNIQUE (internal_fn, gcall *stmt)
       else
 	gcc_unreachable ();
       break;
+    case IFN_UNIQUE_OACC_PRIVATE:
+      break;
     }
 
   if (pattern)
diff --git a/gcc/internal-fn.h b/gcc/internal-fn.h
index a1bc0819915..aab3237e9f6 100644
--- a/gcc/internal-fn.h
+++ b/gcc/internal-fn.h
@@ -36,7 +36,8 @@ along with GCC; see the file COPYING3.  If not see
 #define IFN_UNIQUE_CODES				  \
   DEF(UNSPEC),	\
     DEF(OACC_FORK), DEF(OACC_JOIN),		\
-    DEF(OACC_HEAD_MARK), DEF(OACC_TAIL_MARK)
+    DEF(OACC_HEAD_MARK), DEF(OACC_TAIL_MARK),	\
+    DEF(OACC_PRIVATE)
 
 enum ifn_unique_kind {
 #define DEF(X) IFN_UNIQUE_##X
diff --git a/gcc/omp-low.c b/gcc/omp-low.c
index 19132f76da2..30657d0b617 100644
--- a/gcc/omp-low.c
+++ b/gcc/omp-low.c
@@ -163,6 +163,9 @@ struct omp_context
 
   /* True if there is bind clause on the construct (i.e. a loop construct).  */
   bool loop_p;
+
+  /* Addressable variable decls in this context.  */
+  vec<tree> oacc_addressable_var_decls;
 };
 
 static splay_tree all_contexts;
@@ -6663,8 +6666,9 @@ lower_lastprivate_clauses (tree clauses, tree predicate, gimple_seq *body_p,
 
 static void
 lower_oacc_reductions (location_t loc, tree clauses, tree level, bool inner,
-		       gcall *fork, gcall *join, gimple_seq *fork_seq,
-		       gimple_seq *join_seq, omp_context *ctx)
+		       gcall *fork, gcall *private_marker, gcall *join,
+		       gimple_seq *fork_seq, gimple_seq *join_seq,
+		       omp_context *ctx)
 {
   gimple_seq before_fork = NULL;
   gimple_seq after_fork = NULL;
@@ -6862,6 +6866,8 @@ lower_oacc_reductions (location_t loc, tree clauses, tree level, bool inner,
 
   /* Now stitch things together.  */
   gimple_seq_add_seq (fork_seq, before_fork);
+  if (private_marker)
+    gimple_seq_add_stmt (fork_seq, private_marker);
   if (fork)
     gimple_seq_add_stmt (fork_seq, fork);
   gimple_seq_add_seq (fork_seq, after_fork);
@@ -7577,7 +7583,7 @@ lower_oacc_loop_marker (location_t loc, tree ddvar, bool head,
    HEAD and TAIL.  */
 
 static void
-lower_oacc_head_tail (location_t loc, tree clauses,
+lower_oacc_head_tail (location_t loc, tree clauses, gcall *private_marker,
 		      gimple_seq *head, gimple_seq *tail, omp_context *ctx)
 {
   bool inner = false;
@@ -7585,6 +7591,14 @@ lower_oacc_head_tail (location_t loc, tree clauses,
   gimple_seq_add_stmt (head, gimple_build_assign (ddvar, integer_zero_node));
 
   unsigned count = lower_oacc_head_mark (loc, ddvar, clauses, head, ctx);
+
+  if (private_marker)
+    {
+      gimple_set_location (private_marker, loc);
+      gimple_call_set_lhs (private_marker, ddvar);
+      gimple_call_set_arg (private_marker, 1, ddvar);
+    }
+
   tree fork_kind = build_int_cst (unsigned_type_node, IFN_UNIQUE_OACC_FORK);
   tree join_kind = build_int_cst (unsigned_type_node, IFN_UNIQUE_OACC_JOIN);
 
@@ -7615,7 +7629,8 @@ lower_oacc_head_tail (location_t loc, tree clauses,
 			      &join_seq);
 
       lower_oacc_reductions (loc, clauses, place, inner,
-			     fork, join, &fork_seq, &join_seq,  ctx);
+			     fork, (count == 1) ? private_marker : NULL,
+			     join, &fork_seq, &join_seq,  ctx);
 
       /* Append this level to head. */
       gimple_seq_add_seq (head, fork_seq);
@@ -9580,6 +9595,32 @@ lower_omp_for_lastprivate (struct omp_for_data *fd, gimple_seq *body_p,
     }
 }
 
+/* Record vars listed in private clauses in CLAUSES in CTX.  This information
+   is used to mark up variables that should be made private per-gang.  */
+
+static void
+oacc_record_private_var_clauses (omp_context *ctx, tree clauses)
+{
+  for (tree c = clauses; c; c = OMP_CLAUSE_CHAIN (c))
+    if (OMP_CLAUSE_CODE (c) == OMP_CLAUSE_PRIVATE)
+      {
+	tree decl = OMP_CLAUSE_DECL (c);
+	if (VAR_P (decl) && TREE_ADDRESSABLE (decl))
+	  ctx->oacc_addressable_var_decls.safe_push (decl);
+      }
+}
+
+/* Record addressable vars declared in BINDVARS in CTX.  This information is
+   used to mark up variables that should be made private per-gang.  */
+
+static void
+oacc_record_vars_in_bind (omp_context *ctx, tree bindvars)
+{
+  for (tree v = bindvars; v; v = DECL_CHAIN (v))
+    if (VAR_P (v) && TREE_ADDRESSABLE (v))
+      ctx->oacc_addressable_var_decls.safe_push (v);
+}
+
 /* Callback for walk_gimple_seq.  Find #pragma omp scan statement.  */
 
 static tree
@@ -10410,6 +10451,57 @@ lower_omp_for_scan (gimple_seq *body_p, gimple_seq *dlist, gomp_for *stmt,
   *dlist = new_dlist;
 }
 
+/* Build an internal UNIQUE function with type IFN_UNIQUE_OACC_PRIVATE listing
+   the addresses of variables that should be made private at the surrounding
+   parallelism level.  Such functions appear in the gimple code stream in two
+   forms, e.g. for a partitioned loop:
+
+      .data_dep.6 = .UNIQUE (OACC_HEAD_MARK, .data_dep.6, 1, 68);
+      .data_dep.6 = .UNIQUE (OACC_PRIVATE, .data_dep.6, -1, &w);
+      .data_dep.6 = .UNIQUE (OACC_FORK, .data_dep.6, -1);
+      .data_dep.6 = .UNIQUE (OACC_HEAD_MARK, .data_dep.6);
+
+   or alternatively, OACC_PRIVATE can appear at the top level of a parallel,
+   not as part of a HEAD_MARK sequence:
+
+      .UNIQUE (OACC_PRIVATE, 0, 0, &w);
+
+   For such stand-alone appearances, the 3rd argument is always 0, denoting
+   gang partitioning.  */
+
+static gcall *
+make_oacc_private_marker (omp_context *ctx)
+{
+  int i;
+  tree decl;
+
+  if (ctx->oacc_addressable_var_decls.length () == 0)
+    return NULL;
+
+  auto_vec<tree, 5> args;
+
+  args.quick_push (build_int_cst (integer_type_node, IFN_UNIQUE_OACC_PRIVATE));
+  args.quick_push (integer_zero_node);
+  args.quick_push (integer_minus_one_node);
+
+  FOR_EACH_VEC_ELT (ctx->oacc_addressable_var_decls, i, decl)
+    {
+      for (omp_context *thisctx = ctx; thisctx; thisctx = thisctx->outer)
+	{
+	  tree inner_decl = maybe_lookup_decl (decl, thisctx);
+	  if (inner_decl)
+	    {
+	      decl = inner_decl;
+	      break;
+	    }
+	}
+      tree addr = build_fold_addr_expr (decl);
+      args.safe_push (addr);
+    }
+
+  return gimple_build_call_internal_vec (IFN_UNIQUE, args);
+}
+
 /* Lower code for an OMP loop directive.  */
 
 static void
@@ -10426,6 +10518,8 @@ lower_omp_for (gimple_stmt_iterator *gsi_p, omp_context *ctx)
 
   push_gimplify_context ();
 
+  oacc_record_private_var_clauses (ctx, gimple_omp_for_clauses (stmt));
+
   lower_omp (gimple_omp_for_pre_body_ptr (stmt), ctx);
 
   block = make_node (BLOCK);
@@ -10444,6 +10538,8 @@ lower_omp_for (gimple_stmt_iterator *gsi_p, omp_context *ctx)
       gbind *inner_bind
 	= as_a <gbind *> (gimple_seq_first_stmt (omp_for_body));
       tree vars = gimple_bind_vars (inner_bind);
+      if (is_gimple_omp_oacc (ctx->stmt))
+	oacc_record_vars_in_bind (ctx, vars);
       gimple_bind_append_vars (new_stmt, vars);
       /* bind_vars/BLOCK_VARS are being moved to new_stmt/block, don't
 	 keep them on the inner_bind and it's block.  */
@@ -10543,6 +10639,11 @@ lower_omp_for (gimple_stmt_iterator *gsi_p, omp_context *ctx)
 
   lower_omp (gimple_omp_body_ptr (stmt), ctx);
 
+  gcall *private_marker = NULL;
+  if (is_gimple_omp_oacc (ctx->stmt)
+      && !gimple_seq_empty_p (omp_for_body))
+    private_marker = make_oacc_private_marker (ctx);
+
   /* Lower the header expressions.  At this point, we can assume that
      the header is of the form:
 
@@ -10579,7 +10680,7 @@ lower_omp_for (gimple_stmt_iterator *gsi_p, omp_context *ctx)
   if (is_gimple_omp_oacc (ctx->stmt)
       && !ctx_in_oacc_kernels_region (ctx))
     lower_oacc_head_tail (gimple_location (stmt),
-			  gimple_omp_for_clauses (stmt),
+			  gimple_omp_for_clauses (stmt), private_marker,
 			  &oacc_head, &oacc_tail, ctx);
 
   /* Add OpenACC partitioning and reduction markers just before the loop.  */
@@ -12521,8 +12622,14 @@ lower_omp_target (gimple_stmt_iterator *gsi_p, omp_context *ctx)
 	     them as a dummy GANG loop.  */
 	  tree level = build_int_cst (integer_type_node, GOMP_DIM_GANG);
 
+	  gcall *private_marker = make_oacc_private_marker (ctx);
+
+	  if (private_marker)
+	    gimple_call_set_arg (private_marker, 2, level);
+
 	  lower_oacc_reductions (gimple_location (ctx->stmt), clauses, level,
-				 false, NULL, NULL, &fork_seq, &join_seq, ctx);
+				 false, NULL, private_marker, NULL, &fork_seq,
+				 &join_seq, ctx);
 	}
 
       gimple_seq_add_seq (&new_body, fork_seq);
@@ -12778,6 +12885,9 @@ lower_omp_1 (gimple_stmt_iterator *gsi_p, omp_context *ctx)
 		 ctx);
       break;
     case GIMPLE_BIND:
+      if (ctx && is_gimple_omp_oacc (ctx->stmt))
+	oacc_record_vars_in_bind (ctx,
+				  gimple_bind_vars (as_a <gbind *> (stmt)));
       lower_omp (gimple_bind_body_ptr (as_a <gbind *> (stmt)), ctx);
       maybe_remove_omp_member_access_dummy_vars (as_a <gbind *> (stmt));
       break;
diff --git a/gcc/omp-offload.c b/gcc/omp-offload.c
index 32eacf7863e..d8291125370 100644
--- a/gcc/omp-offload.c
+++ b/gcc/omp-offload.c
@@ -52,6 +52,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "stringpool.h"
 #include "attribs.h"
 #include "cfgloop.h"
+#include "convert.h"
 
 /* Describe the OpenACC looping structure of a function.  The entire
    function is held in a 'NULL' loop.  */
@@ -1082,7 +1083,9 @@ oacc_loop_xform_head_tail (gcall *from, int level)
 	    = ((enum ifn_unique_kind)
 	       TREE_INT_CST_LOW (gimple_call_arg (stmt, 0)));
 
-	  if (k == IFN_UNIQUE_OACC_FORK || k == IFN_UNIQUE_OACC_JOIN)
+	  if (k == IFN_UNIQUE_OACC_FORK
+	      || k == IFN_UNIQUE_OACC_JOIN
+	      || k == IFN_UNIQUE_OACC_PRIVATE)
 	    *gimple_call_arg_ptr (stmt, 2) = replacement;
 	  else if (k == kind && stmt != from)
 	    break;
@@ -1684,6 +1687,38 @@ execute_oacc_device_lower ()
 		case IFN_UNIQUE_OACC_TAIL_MARK:
 		  remove = true;
 		  break;
+
+		case IFN_UNIQUE_OACC_PRIVATE:
+		  {
+		    HOST_WIDE_INT level
+		      = TREE_INT_CST_LOW (gimple_call_arg (call, 2));
+		    if (level == -1)
+		      break;
+		    for (unsigned i = 3;
+			 i < gimple_call_num_args (call);
+			 i++)
+		      {
+			tree arg = gimple_call_arg (call, i);
+			gcc_assert (TREE_CODE (arg) == ADDR_EXPR);
+			tree decl = TREE_OPERAND (arg, 0);
+			if (dump_file && (dump_flags & TDF_DETAILS))
+			  {
+			    static char const *const axes[] =
+			      /* Must be kept in sync with GOMP_DIM
+				 enumeration.  */
+			      { "gang", "worker", "vector" };
+			    fprintf (dump_file, "Decl UID %u has %s "
+				     "partitioning:", DECL_UID (decl),
+				     axes[level]);
+			    print_generic_decl (dump_file, decl, TDF_SLIM);
+			    fputc ('\n', dump_file);
+			  }
+			if (targetm.goacc.adjust_private_decl)
+			  targetm.goacc.adjust_private_decl (decl, level);
+		      }
+		    remove = true;
+		  }
+		  break;
 		}
 	      break;
 	    }
diff --git a/gcc/target.def b/gcc/target.def
index e0e856979a9..6f871ebc91d 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -1734,6 +1734,23 @@ for allocating any storage for reductions when necessary.",
 void, (gcall *call),
 default_goacc_reduction)
 
+DEFHOOK
+(expand_var_decl,
+"This hook, if defined, is used by accelerator target back-ends to expand\n\
+specially handled kinds of VAR_DECL expressions.  A particular use is to\n\
+place variables with specific attributes inside special accelarator\n\
+memories.  A return value of NULL indicates that the target does not\n\
+handle this VAR_DECL, and normal RTL expanding is resumed.",
+rtx, (tree var),
+NULL)
+
+DEFHOOK
+(adjust_private_decl,
+"Tweak variable declaration for a private variable at the specified\n\
+parallelism level.",
+void, (tree var, int),
+NULL)
+
 HOOK_VECTOR_END (goacc)
 
 /* Functions relating to vectorization.  */
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/gang-private-1.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/gang-private-1.c
new file mode 100644
index 00000000000..28222c25da3
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/gang-private-1.c
@@ -0,0 +1,38 @@
+#include <assert.h>
+
+int main (void)
+{
+  int ret;
+
+  #pragma acc parallel num_gangs(1) num_workers(32) copyout(ret)
+  {
+    int w = 0;
+
+    #pragma acc loop worker
+    for (int i = 0; i < 32; i++)
+      {
+	#pragma acc atomic update
+	w++;
+      }
+
+    ret = (w == 32);
+  }
+  assert (ret);
+
+  #pragma acc parallel num_gangs(1) vector_length(32) copyout(ret)
+  {
+    int v = 0;
+
+    #pragma acc loop vector
+    for (int i = 0; i < 32; i++)
+      {
+	#pragma acc atomic update
+	v++;
+      }
+
+    ret = (v == 32);
+  }
+  assert (ret);
+
+  return 0;
+}
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-gwv-2.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-gwv-2.c
new file mode 100644
index 00000000000..a4f81a39e24
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-gwv-2.c
@@ -0,0 +1,95 @@
+#include <stdio.h>
+#include <openacc.h>
+#include <alloca.h>
+#include <string.h>
+#include <gomp-constants.h>
+#include <stdlib.h>
+
+#if 0
+#define DEBUG(DIM, IDX, VAL) \
+  fprintf (stderr, "%sdist[%d] = %d\n", (DIM), (IDX), (VAL))
+#else
+#define DEBUG(DIM, IDX, VAL)
+#endif
+
+#define N (32*32*32)
+
+int
+check (const char *dim, int *dist, int dimsize)
+{
+  int ix;
+  int exit = 0;
+
+  for (ix = 0; ix < dimsize; ix++)
+    {
+      DEBUG(dim, ix, dist[ix]);
+      if (dist[ix] < (N) / (dimsize + 0.5)
+	  || dist[ix] > (N) / (dimsize - 0.5))
+	{
+	  fprintf (stderr, "did not distribute to %ss (%d not between %d "
+		   "and %d)\n", dim, dist[ix], (int) ((N) / (dimsize + 0.5)),
+		   (int) ((N) / (dimsize - 0.5)));
+	  exit |= 1;
+	}
+    }
+
+  return exit;
+}
+
+int main ()
+{
+  int ary[N];
+  int ix;
+  int exit = 0;
+  int gangsize = 0, workersize = 0, vectorsize = 0;
+  int *gangdist, *workerdist, *vectordist;
+
+  for (ix = 0; ix < N;ix++)
+    ary[ix] = -1;
+
+#pragma acc parallel num_gangs(32) num_workers(32) vector_length(32) \
+	    copy(ary) copyout(gangsize, workersize, vectorsize)
+  {
+#pragma acc loop gang worker vector
+    for (unsigned ix = 0; ix < N; ix++)
+      {
+	int g, w, v;
+
+	g = __builtin_goacc_parlevel_id (GOMP_DIM_GANG);
+	w = __builtin_goacc_parlevel_id (GOMP_DIM_WORKER);
+	v = __builtin_goacc_parlevel_id (GOMP_DIM_VECTOR);
+
+	ary[ix] = (g << 16) | (w << 8) | v;
+      }
+
+    gangsize = __builtin_goacc_parlevel_size (GOMP_DIM_GANG);
+    workersize = __builtin_goacc_parlevel_size (GOMP_DIM_WORKER);
+    vectorsize = __builtin_goacc_parlevel_size (GOMP_DIM_VECTOR);
+  }
+
+  gangdist = (int *) alloca (gangsize * sizeof (int));
+  workerdist = (int *) alloca (workersize * sizeof (int));
+  vectordist = (int *) alloca (vectorsize * sizeof (int));
+  memset (gangdist, 0, gangsize * sizeof (int));
+  memset (workerdist, 0, workersize * sizeof (int));
+  memset (vectordist, 0, vectorsize * sizeof (int));
+
+  /* Test that work is shared approximately equally amongst each active
+     gang/worker/vector.  */
+  for (ix = 0; ix < N; ix++)
+    {
+      int g = (ary[ix] >> 16) & 255;
+      int w = (ary[ix] >> 8) & 255;
+      int v = ary[ix] & 255;
+
+      gangdist[g]++;
+      workerdist[w]++;
+      vectordist[v]++;
+    }
+
+  exit = check ("gang", gangdist, gangsize);
+  exit |= check ("worker", workerdist, workersize);
+  exit |= check ("vector", vectordist, vectorsize);
+
+  return exit;
+}
diff --git a/libgomp/testsuite/libgomp.oacc-fortran/gangprivate-attrib-1.f90 b/libgomp/testsuite/libgomp.oacc-fortran/gangprivate-attrib-1.f90
new file mode 100644
index 00000000000..1b2aaea6ac4
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-fortran/gangprivate-attrib-1.f90
@@ -0,0 +1,25 @@
+! Test for "oacc gangprivate" attribute on gang-private variables
+
+! { dg-do run }
+! { dg-additional-options "-fdump-tree-oaccdevlow-details" }
+
+program main
+  integer :: w, arr(0:31)
+
+  !$acc parallel num_gangs(32) num_workers(32) copyout(arr)
+    !$acc loop gang private(w)
+! { dg-final { scan-tree-dump-times "Decl UID \[0-9\]+ has gang partitioning:  integer\\(kind=4\\) w;" 1 "oaccdevlow" } } */
+    do j = 0, 31
+      w = 0
+      !$acc loop seq
+      do i = 0, 31
+        !$acc atomic update
+        w = w + 1
+        !$acc end atomic
+      end do
+      arr(j) = w
+    end do
+  !$acc end parallel
+
+  if (any (arr .ne. 32)) stop 1
+end program main
diff --git a/libgomp/testsuite/libgomp.oacc-fortran/gangprivate-attrib-2.f90 b/libgomp/testsuite/libgomp.oacc-fortran/gangprivate-attrib-2.f90
new file mode 100644
index 00000000000..f4e67b0c708
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-fortran/gangprivate-attrib-2.f90
@@ -0,0 +1,25 @@
+! Test for worker-private variables
+
+! { dg-do run }
+! { dg-additional-options "-fdump-tree-oaccdevlow-details" }
+
+program main
+  integer :: w, arr(0:31)
+
+  !$acc parallel num_gangs(32) num_workers(32) copyout(arr)
+    !$acc loop gang worker private(w)
+! { dg-final { scan-tree-dump-times "Decl UID \[0-9\]+ has worker partitioning:  integer\\(kind=4\\) w;" 1 "oaccdevlow" } } */
+    do j = 0, 31
+      w = 0
+      !$acc loop seq
+      do i = 0, 31
+        !$acc atomic update
+        w = w + 1
+        !$acc end atomic
+      end do
+      arr(j) = w
+    end do
+  !$acc end parallel
+
+  if (any (arr .ne. 32)) stop 1
+end program main

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2019-12-05  2:40 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-15 21:41 [PATCH 00/13] AMD GCN worker partitioning support Julian Brown
2019-11-15 21:41 ` [PATCH 02/13] Target-dependent gang-private variable decl rewriting Julian Brown
2019-11-15 21:42 ` [PATCH 01/13] Add support for gang local storage allocation in shared memory Julian Brown
2019-12-05  2:40   ` Julian Brown
2019-11-15 21:43 ` [PATCH 03/13] Rewrite OpenACC private or reduction reference variables Julian Brown
2019-11-15 21:43 ` [PATCH 04/13] OpenACC middle-end worker-partitioning support Julian Brown
2019-11-15 21:43 ` [PATCH 05/13] AMD GCN adjustments for middle-end worker partitioning Julian Brown
2019-11-15 21:44 ` [PATCH 08/13] Fix host-to-device copies from rodata for AMD GCN Julian Brown
2019-11-18 11:00   ` Andrew Stubbs
2019-11-15 21:44 ` [PATCH 06/13] Fix up tests for oaccdevlow pass split Julian Brown
2019-11-15 21:44 ` [PATCH 07/13] Fix OpenACC "ephemeral" asynchronous host-to-device copies Julian Brown
2019-11-15 21:45 ` [PATCH 10/13] Race conditions in OpenACC async tests Julian Brown
2019-11-15 21:45 ` [PATCH 09/13] AMD GCN libgomp plugin queue-full condition locking fix Julian Brown
2019-11-18 11:10   ` Andrew Stubbs
2019-11-15 21:45 ` [PATCH 11/13] AMD GCN symbol output with null cfun Julian Brown
2019-11-18 11:12   ` Andrew Stubbs
2019-11-15 21:46 ` [PATCH 12/13] Fix parallel-dims.f90 for AMD GCN Julian Brown
2019-11-15 21:58 ` [PATCH 13/13] Enable worker partitioning " Julian Brown
2019-11-18 11:24   ` Andrew Stubbs

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).