[PATCH 0/4] openacc: Worker partitioning in the middle end

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [PATCH 0/4] openacc: Worker partitioning in the middle end
@ 2021-03-02 12:20 Julian Brown
  2021-03-02 12:20 ` [PATCH 1/4] openacc: Middle-end worker-partitioning support Julian Brown
                   ` (3 more replies)
  0 siblings, 4 replies; 20+ messages in thread
From: Julian Brown @ 2021-03-02 12:20 UTC (permalink / raw)
  To: gcc-patches
  Cc: Thomas Schwinge, Tobias Burnus, Kwok Cheung Yeung, Jakub Jelinek

This series contains updated parts of the patch series that was previously
sent upstream in November 2019:

  https://gcc.gnu.org/pipermail/gcc-patches/2019-November/534547.html

The purpose of the series is to enable multiple workers for OpenACC
(workers being one of the dimensions of parallelism supported by the
standard) on targets such as AMD GCN. (NVPTX uses its own scheme for
supporting multiple workers, implemented mostly in the backend.)

Tested with offloading to AMD GCN and (separately) to NVPTX.

Further commentary is provided alongside individual patches. I'm posting
these patches for review now, but I don't expect to commit them until
stage 1.

Thanks,

Julian

Julian Brown (4):
  openacc: Middle-end worker-partitioning support
  openacc: Fix async bugs in several OpenACC test cases
  amdgcn: Enable OpenACC worker partitioning for AMD GCN
  openacc: Reference-typed reduction and private variable rewriting

 gcc/Makefile.in                               |    1 +
 gcc/config/gcn/gcn-protos.h                   |    2 +-
 gcc/config/gcn/gcn-tree.c                     |    6 +-
 gcc/config/gcn/gcn.c                          |   23 +-
 gcc/config/gcn/gcn.opt                        |    5 -
 gcc/doc/tm.texi                               |   10 +
 gcc/doc/tm.texi.in                            |    4 +
 gcc/gimplify.c                                |  117 ++
 gcc/oacc-neuter-bcast.c                       | 1471 +++++++++++++++++
 gcc/oacc-neuter-bcast.h                       |   26 +
 gcc/omp-builtins.def                          |    8 +
 gcc/omp-low.c                                 |   47 +-
 gcc/omp-offload.c                             |  159 +-
 gcc/omp-offload.h                             |    1 +
 gcc/passes.def                                |    2 +
 gcc/target.def                                |   13 +
 gcc/targhooks.h                               |    1 +
 .../goacc/classify-kernels-unparallelized.c   |    8 +-
 .../c-c++-common/goacc/classify-kernels.c     |    8 +-
 .../c-c++-common/goacc/classify-parallel.c    |    8 +-
 .../c-c++-common/goacc/classify-routine.c     |    8 +-
 .../c-c++-common/goacc/classify-serial.c      |    8 +-
 .../gcc.dg/goacc/loop-processing-1.c          |    2 +-
 .../goacc/classify-kernels-unparallelized.f95 |    8 +-
 .../gfortran.dg/goacc/classify-kernels.f95    |    8 +-
 .../gfortran.dg/goacc/classify-parallel.f95   |    8 +-
 .../gfortran.dg/goacc/classify-routine.f95    |    8 +-
 .../gfortran.dg/goacc/classify-serial.f95     |    8 +-
 gcc/tree-core.h                               |    4 +-
 gcc/tree-pass.h                               |    2 +
 gcc/tree.c                                    |   11 +-
 gcc/tree.h                                    |    2 +
 libgomp/plugin/plugin-gcn.c                   |    4 +-
 .../libgomp.oacc-c++/privatized-ref-2.C       |   64 +
 .../libgomp.oacc-c++/privatized-ref-3.C       |   64 +
 .../libgomp.oacc-c-c++-common/deep-copy-10.c  |   14 +-
 .../loop-dim-default.c                        |   11 +-
 .../libgomp.oacc-c-c++-common/parallel-dims.c |   13 +-
 .../libgomp.oacc-fortran/lib-16-2.f90         |    5 +
 .../testsuite/libgomp.oacc-fortran/lib-16.f90 |    5 +
 .../libgomp.oacc-fortran/parallel-dims-aux.c  |    9 +-
 .../libgomp.oacc-fortran/privatized-ref-1.f95 |   71 +
 42 files changed, 2112 insertions(+), 145 deletions(-)
 create mode 100644 gcc/oacc-neuter-bcast.c
 create mode 100644 gcc/oacc-neuter-bcast.h
 create mode 100644 libgomp/testsuite/libgomp.oacc-c++/privatized-ref-2.C
 create mode 100644 libgomp/testsuite/libgomp.oacc-c++/privatized-ref-3.C
 create mode 100644 libgomp/testsuite/libgomp.oacc-fortran/privatized-ref-1.f95

-- 
2.29.2


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 1/4] openacc: Middle-end worker-partitioning support
  2021-03-02 12:20 [PATCH 0/4] openacc: Worker partitioning in the middle end Julian Brown
@ 2021-03-02 12:20 ` Julian Brown
  2021-07-29  7:49   ` [OpenACC] Extract 'pass_oacc_loop_designation' out of 'pass_oacc_device_lower' (was: [PATCH 1/4] openacc: Middle-end worker-partitioning support) Thomas Schwinge
                     ` (5 more replies)
  2021-03-02 12:20 ` [PATCH 2/4] openacc: Fix async bugs in several OpenACC test cases Julian Brown
                   ` (2 subsequent siblings)
  3 siblings, 6 replies; 20+ messages in thread
From: Julian Brown @ 2021-03-02 12:20 UTC (permalink / raw)
  To: gcc-patches
  Cc: Thomas Schwinge, Tobias Burnus, Kwok Cheung Yeung, Jakub Jelinek

A version of this patch was previously posted here:

  https://gcc.gnu.org/pipermail/gcc-patches/2019-November/534553.html

This patch implements worker-partitioning support in the middle end,
by rewriting gimple. The OpenACC execution model requires that code
can run in either "worker single" mode where only a single worker per
gang is active, or "worker partitioned" mode, where multiple workers
per gang are active. This means we need to do something equivalent
to spawning additional workers when transitioning from worker-single
to worker-partitioned mode. However, GPUs typically fix the number of
threads of invoked kernels at launch time, so we need to do something
with the "extra" threads when they are not wanted.

The scheme used is to conditionalise each basic block that executes
in "worker single" mode for worker 0 only. Conditional branches
are handled specially so "idle" (non-0) workers follow along with
worker 0. On transitioning to "worker partitioned" mode, any variables
modified by worker 0 are propagated to the other workers via GPU shared
memory. Special care is taken for routine calls, writes through pointers,
and so forth, as follows:

  - There are two types of function calls to consider in worker-single
    mode: "normal" calls to maths library routines, etc. are called from
    worker 0 only. OpenACC routines may contain worker-partitioned loops
    themselves, so are called from all workers, including "idle" ones.

  - SSA names set in worker-single mode, but used in worker-partitioned
    mode, are copied to shared memory in worker 0. Other workers retrieve
    the value from the appropriate shared-memory location after a barrier,
    and new phi nodes are introduced at the convergence point to resolve
    the worker 0/other worker copies of the value.

  - Local scalar variables (on the stack) also need special handling. We
    broadcast any variables that are written in the current worker-single
    block, and that are read in any worker-partitioned block.  (This is
    believed to be safe, and is flow-insensitive to ease analysis.)

  - Local aggregates (arrays and composites) on the stack are *not*
    broadcast. Instead we force gimple stmts modifying elements/fields of
    local aggregates into fully-partitioned mode. The RHS of the
    assignment is a scalar, and is thus subject to broadcasting as above.

  - Writes through pointers may affect any local variable that has
    its address taken. We use points-to analysis to determine the set
    of potentially-affected variables for a given pointer indirection.
    We broadcast any such variable which is used in worker-partitioned
    mode, on a per-block basis for any block containing a write through
    a pointer.

Some slides about the implementation (from 2018) are available at:

  https://jtb20.github.io/gcnworkers.pdf

This version of the patch includes several follow-on bug fixes by myself
and Kwok. This version also avoids moving SESE-region finding code out
of the NVPTX backend, since that code isn't used by the middle-end worker
partitioning neutering/broadcasting implementation yet.

This patch (and the rest of the series) should be applied on top of the
private variable patches posted previously here:

  https://gcc.gnu.org/pipermail/gcc-patches/2021-February/565925.html

Tested with offloading to AMD GCN. OK for stage 1?

Julian

2021-03-02  Julian Brown  <julian@codesourcery.com>
	    Nathan Sidwell  <nathan@acm.org>
	    Kwok Cheung Yeung  <kcy@codesourcery.com>

gcc/
	* Makefile.in (OBJS): Add oacc-neuter-bcast.o.
	* doc/tm.texi.in (TARGET_GOACC_WORKER_PARTITIONING,
	TARGET_GOACC_CREATE_PROPAGATION_RECORD): Add documentation hooks.
	* doc/tm.texi: Regenerate.
	* oacc-neuter-bcast.c: New file.
	* oacc-neuter-bcast.h: New file.
	* omp-builtins.def (BUILT_IN_GOACC_BARRIER, BUILT_IN_GOACC_SINGLE_START,
	BUILT_IN_GOACC_SINGLE_COPY_START, BUILT_IN_GOACC_SINGLE_COPY_END): New
	builtins.
	* omp-offload.c (oacc-neuter-bcast.h): Include header.
	(oacc_loop_xform_head_tail): Call update_stmt for modified builtin
	calls.
	(oacc_loop_process): Likewise.
	(default_goacc_create_propagation_record): New function.
	(execute_oacc_loop_designation): New.  Split out of oacc_device_lower.
	(execute_oacc_gimple_workers): New.  Likewise.
	(execute_oacc_device_lower): Recreate dims array.
	(pass_data_oacc_loop_designation, pass_data_oacc_gimple_workers): New.
	(pass_oacc_loop_designation, pass_oacc_gimple_workers): New.
	(make_pass_oacc_loop_designation, make_pass_oacc_gimple_workers): New.
	* omp-offload.h (oacc_fn_attrib_level): Add prototype.
	* passes.def (pass_oacc_loop_designation, pass_oacc_gimple_workers):
	Add passes.
	* target.def (worker_partitioning, create_propagation_record): Add
	target hooks.
	* targhooks.h (default_goacc_create_propagation_record): Add prototype.
	* tree-pass.h (make_pass_oacc_loop_designation,
	make_pass_oacc_gimple_workers): Add prototypes.

gcc/testsuite/
	* c-c++-common/goacc/classify-kernels-unparallelized.c: Scan oaccloops
	dump instead of oaccdevlow.
	* c-c++-common/goacc/classify-kernels.c: Likewise.
	* c-c++-common/goacc/classify-parallel.c: Likewise
	* c-c++-common/goacc/classify-routine.c: Likewise.
	* c-c++-common/goacc/classify-serial.c: Likewise.
	* gcc.dg/goacc/loop-processing-1.c: Likewise.
	* gfortran.dg/goacc/classify-kernels-unparallelized.f95: Likewise.
	* gfortran.dg/goacc/classify-kernels.f95: Likewise.
	* gfortran.dg/goacc/classify-parallel.f95: Likewise.
	* gfortran.dg/goacc/classify-routine.f95: Likewise.
	* gfortran.dg/goacc/classify-serial.f95: Likewise.
---
 gcc/Makefile.in                               |    1 +
 gcc/doc/tm.texi                               |   10 +
 gcc/doc/tm.texi.in                            |    4 +
 gcc/oacc-neuter-bcast.c                       | 1471 +++++++++++++++++
 gcc/oacc-neuter-bcast.h                       |   26 +
 gcc/omp-builtins.def                          |    8 +
 gcc/omp-offload.c                             |  159 +-
 gcc/omp-offload.h                             |    1 +
 gcc/passes.def                                |    2 +
 gcc/target.def                                |   13 +
 gcc/targhooks.h                               |    1 +
 .../goacc/classify-kernels-unparallelized.c   |    8 +-
 .../c-c++-common/goacc/classify-kernels.c     |    8 +-
 .../c-c++-common/goacc/classify-parallel.c    |    8 +-
 .../c-c++-common/goacc/classify-routine.c     |    8 +-
 .../c-c++-common/goacc/classify-serial.c      |    8 +-
 .../gcc.dg/goacc/loop-processing-1.c          |    2 +-
 .../goacc/classify-kernels-unparallelized.f95 |    8 +-
 .../gfortran.dg/goacc/classify-kernels.f95    |    8 +-
 .../gfortran.dg/goacc/classify-parallel.f95   |    8 +-
 .../gfortran.dg/goacc/classify-routine.f95    |    8 +-
 .../gfortran.dg/goacc/classify-serial.f95     |    8 +-
 gcc/tree-pass.h                               |    2 +
 23 files changed, 1720 insertions(+), 60 deletions(-)
 create mode 100644 gcc/oacc-neuter-bcast.c
 create mode 100644 gcc/oacc-neuter-bcast.h

diff --git a/gcc/Makefile.in b/gcc/Makefile.in
index a63c5d9cab6..c3be0633932 100644
--- a/gcc/Makefile.in
+++ b/gcc/Makefile.in
@@ -1500,6 +1500,7 @@ OBJS = \
 	omp-offload.o \
 	omp-expand.o \
 	omp-general.o \
+	oacc-neuter-bcast.o \
 	omp-low.o \
 	omp-oacc-kernels-decompose.o \
 	omp-simd-clone.o \
diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index 94927ea7b2b..9936fa6b6f9 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -6253,6 +6253,16 @@ adjusted variable declaration needs to be expanded to RTL in a non-standard
 way.
 @end deftypefn
 
+@deftypevr {Target Hook} bool TARGET_GOACC_WORKER_PARTITIONING
+Use gimple transformation for worker neutering/broadcasting.
+@end deftypevr
+
+@deftypefn {Target Hook} tree TARGET_GOACC_CREATE_PROPAGATION_RECORD (tree @var{rec}, bool @var{sender}, const char *@var{name})
+Create a record used to propagate local-variable state from an active
+worker to other workers.  A possible implementation might adjust the type
+of REC to place the new variable in shared GPU memory.
+@end deftypefn
+
 @node Anchored Addresses
 @section Anchored Addresses
 @cindex anchored addresses
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index b8c23cf6db5..a10aed1d78d 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -4223,6 +4223,10 @@ address;  but often a machine-dependent strategy can generate better code.
 
 @hook TARGET_GOACC_ADJUST_PRIVATE_DECL
 
+@hook TARGET_GOACC_WORKER_PARTITIONING
+
+@hook TARGET_GOACC_CREATE_PROPAGATION_RECORD
+
 @node Anchored Addresses
 @section Anchored Addresses
 @cindex anchored addresses
diff --git a/gcc/oacc-neuter-bcast.c b/gcc/oacc-neuter-bcast.c
new file mode 100644
index 00000000000..e258b88bf66
--- /dev/null
+++ b/gcc/oacc-neuter-bcast.c
@@ -0,0 +1,1471 @@
+/* Implement worker partitioning for OpenACC via neutering/broadcasting scheme.
+
+   Copyright (C) 2015-2021 Free Software Foundation, Inc.
+
+   This file is part of GCC.
+
+   GCC is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published
+   by the Free Software Foundation; either version 3, or (at your
+   option) any later version.
+
+   GCC is distributed in the hope that it will be useful, but WITHOUT
+   ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
+   or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public
+   License for more details.
+
+   You should have received a copy of the GNU General Public License
+   along with GCC; see the file COPYING3.  If not see
+   <http://www.gnu.org/licenses/>.  */
+
+#include "config.h"
+#include "system.h"
+#include "coretypes.h"
+#include "backend.h"
+#include "rtl.h"
+#include "tree.h"
+#include "gimple.h"
+#include "tree-pass.h"
+#include "ssa.h"
+#include "cgraph.h"
+#include "pretty-print.h"
+#include "fold-const.h"
+#include "gimplify.h"
+#include "gimple-iterator.h"
+#include "gimple-walk.h"
+#include "tree-inline.h"
+#include "langhooks.h"
+#include "omp-general.h"
+#include "omp-low.h"
+#include "gimple-pretty-print.h"
+#include "cfghooks.h"
+#include "insn-config.h"
+#include "recog.h"
+#include "internal-fn.h"
+#include "bitmap.h"
+#include "tree-nested.h"
+#include "stor-layout.h"
+#include "tree-ssa-threadupdate.h"
+#include "tree-into-ssa.h"
+#include "splay-tree.h"
+#include "target.h"
+#include "cfgloop.h"
+#include "tree-cfg.h"
+#include "omp-offload.h"
+#include "attribs.h"
+#include "oacc-neuter-bcast.h"
+
+/* Loop structure of the function.  The entire function is described as
+   a NULL loop.  */
+
+struct parallel_g
+{
+  /* Parent parallel.  */
+  parallel_g *parent;
+
+  /* Next sibling parallel.  */
+  parallel_g *next;
+
+  /* First child parallel.  */
+  parallel_g *inner;
+
+  /* Partitioning mask of the parallel.  */
+  unsigned mask;
+
+  /* Partitioning used within inner parallels. */
+  unsigned inner_mask;
+
+  /* Location of parallel forked and join.  The forked is the first
+     block in the parallel and the join is the first block after of
+     the partition.  */
+  basic_block forked_block;
+  basic_block join_block;
+
+  gimple *forked_stmt;
+  gimple *join_stmt;
+
+  gimple *fork_stmt;
+  gimple *joining_stmt;
+
+  /* Basic blocks in this parallel, but not in child parallels.  The
+     FORKED and JOINING blocks are in the partition.  The FORK and JOIN
+     blocks are not.  */
+  auto_vec<basic_block> blocks;
+
+  tree record_type;
+  tree sender_decl;
+  tree receiver_decl;
+
+public:
+  parallel_g (parallel_g *parent, unsigned mode);
+  ~parallel_g ();
+};
+
+/* Constructor links the new parallel into it's parent's chain of
+   children.  */
+
+parallel_g::parallel_g (parallel_g *parent_, unsigned mask_)
+  :parent (parent_), next (0), inner (0), mask (mask_), inner_mask (0)
+{
+  forked_block = join_block = 0;
+  forked_stmt = join_stmt = NULL;
+  fork_stmt = joining_stmt = NULL;
+
+  record_type = NULL_TREE;
+  sender_decl = NULL_TREE;
+  receiver_decl = NULL_TREE;
+
+  if (parent)
+    {
+      next = parent->inner;
+      parent->inner = this;
+    }
+}
+
+parallel_g::~parallel_g ()
+{
+  delete inner;
+  delete next;
+}
+
+static bool
+local_var_based_p (tree decl)
+{
+  switch (TREE_CODE (decl))
+    {
+    case VAR_DECL:
+      return !is_global_var (decl);
+
+    case COMPONENT_REF:
+    case BIT_FIELD_REF:
+    case ARRAY_REF:
+      return local_var_based_p (TREE_OPERAND (decl, 0));
+
+    default:
+      return false;
+    }
+}
+
+/* Map of basic blocks to gimple stmts.  */
+typedef hash_map<basic_block, gimple *> bb_stmt_map_t;
+
+/* Calls to OpenACC routines are made by all workers/wavefronts/warps, since
+   the routine likely contains partitioned loops (else will do its own
+   neutering and variable propagation). Return TRUE if a function call CALL
+   should be made in (worker) single mode instead, rather than redundant
+   mode.  */
+
+static bool
+omp_sese_active_worker_call (gcall *call)
+{
+#define GOMP_DIM_SEQ GOMP_DIM_MAX
+  tree fndecl = gimple_call_fndecl (call);
+
+  if (!fndecl)
+    return true;
+
+  tree attrs = oacc_get_fn_attrib (fndecl);
+
+  if (!attrs)
+    return true;
+
+  int level = oacc_fn_attrib_level (attrs);
+
+  /* Neither regular functions nor "seq" routines should be run by all threads
+     in worker-single mode.  */
+  return level == -1 || level == GOMP_DIM_SEQ;
+#undef GOMP_DIM_SEQ
+}
+
+/* Split basic blocks such that each forked and join unspecs are at
+   the start of their basic blocks.  Thus afterwards each block will
+   have a single partitioning mode.  We also do the same for return
+   insns, as they are executed by every thread.  Return the
+   partitioning mode of the function as a whole.  Populate MAP with
+   head and tail blocks.  We also clear the BB visited flag, which is
+   used when finding partitions.  */
+
+static void
+omp_sese_split_blocks (bb_stmt_map_t *map)
+{
+  auto_vec<gimple *> worklist;
+  basic_block block;
+
+  /* Locate all the reorg instructions of interest.  */
+  FOR_ALL_BB_FN (block, cfun)
+    {
+      /* Clear visited flag, for use by parallel locator  */
+      block->flags &= ~BB_VISITED;
+
+      for (gimple_stmt_iterator gsi = gsi_start_bb (block);
+	   !gsi_end_p (gsi);
+	   gsi_next (&gsi))
+	{
+	  gimple *stmt = gsi_stmt (gsi);
+
+	  if (gimple_call_internal_p (stmt, IFN_UNIQUE))
+	    {
+	      enum ifn_unique_kind k = ((enum ifn_unique_kind)
+		TREE_INT_CST_LOW (gimple_call_arg (stmt, 0)));
+
+	      if (k == IFN_UNIQUE_OACC_JOIN)
+		worklist.safe_push (stmt);
+	      else if (k == IFN_UNIQUE_OACC_FORK)
+		{
+		  gcc_assert (gsi_one_before_end_p (gsi));
+		  basic_block forked_block = single_succ (block);
+		  gimple_stmt_iterator gsi2 = gsi_start_bb (forked_block);
+
+		  /* We push a NOP as a placeholder for the "forked" stmt.
+		     This is then recognized in omp_sese_find_par.  */
+		  gimple *nop = gimple_build_nop ();
+		  gsi_insert_before (&gsi2, nop, GSI_SAME_STMT);
+
+		  worklist.safe_push (nop);
+		}
+	    }
+	  else if (gimple_code (stmt) == GIMPLE_RETURN
+		   || gimple_code (stmt) == GIMPLE_COND
+		   || gimple_code (stmt) == GIMPLE_SWITCH
+		   || (gimple_code (stmt) == GIMPLE_CALL
+		       && !gimple_call_internal_p (stmt)
+		       && !omp_sese_active_worker_call (as_a <gcall *> (stmt))))
+	    worklist.safe_push (stmt);
+	  else if (is_gimple_assign (stmt))
+	    {
+	      tree lhs = gimple_assign_lhs (stmt);
+
+	      /* Force assignments to components/fields/elements of local
+		 aggregates into fully-partitioned (redundant) mode.  This
+		 avoids having to broadcast the whole aggregate.  The RHS of
+		 the assignment will be propagated using the normal
+		 mechanism.  */
+
+	      switch (TREE_CODE (lhs))
+		{
+		case COMPONENT_REF:
+		case BIT_FIELD_REF:
+		case ARRAY_REF:
+		  {
+		    tree aggr = TREE_OPERAND (lhs, 0);
+
+		    if (local_var_based_p (aggr))
+		      worklist.safe_push (stmt);
+		  }
+		  break;
+
+		default:
+		  ;
+		}
+	    }
+	}
+    }
+
+  /* Split blocks on the worklist.  */
+  unsigned ix;
+  gimple *stmt;
+
+  for (ix = 0; worklist.iterate (ix, &stmt); ix++)
+    {
+      basic_block block = gimple_bb (stmt);
+
+      if (gimple_code (stmt) == GIMPLE_COND)
+	{
+	  gcond *orig_cond = as_a <gcond *> (stmt);
+	  tree_code code = gimple_expr_code (orig_cond);
+	  tree pred = make_ssa_name (boolean_type_node);
+	  gimple *asgn = gimple_build_assign (pred, code,
+			   gimple_cond_lhs (orig_cond),
+			   gimple_cond_rhs (orig_cond));
+	  gcond *new_cond
+	    = gimple_build_cond (NE_EXPR, pred, boolean_false_node,
+				 gimple_cond_true_label (orig_cond),
+				 gimple_cond_false_label (orig_cond));
+
+	  gimple_stmt_iterator gsi = gsi_for_stmt (stmt);
+	  gsi_insert_before (&gsi, asgn, GSI_SAME_STMT);
+	  gsi_replace (&gsi, new_cond, true);
+
+	  edge e = split_block (block, asgn);
+	  block = e->dest;
+	  map->get_or_insert (block) = new_cond;
+	}
+      else if ((gimple_code (stmt) == GIMPLE_CALL
+		&& !gimple_call_internal_p (stmt))
+	       || is_gimple_assign (stmt))
+	{
+	  gimple_stmt_iterator gsi = gsi_for_stmt (stmt);
+	  gsi_prev (&gsi);
+
+	  edge call = split_block (block, gsi_stmt (gsi));
+
+	  gimple *call_stmt = gsi_stmt (gsi_start_bb (call->dest));
+
+	  edge call_to_ret = split_block (call->dest, call_stmt);
+
+	  map->get_or_insert (call_to_ret->src) = call_stmt;
+	}
+      else
+	{
+	  gimple_stmt_iterator gsi = gsi_for_stmt (stmt);
+	  gsi_prev (&gsi);
+
+	  if (gsi_end_p (gsi))
+	    map->get_or_insert (block) = stmt;
+	  else
+	    {
+	      /* Split block before insn. The insn is in the new block.  */
+	      edge e = split_block (block, gsi_stmt (gsi));
+
+	      block = e->dest;
+	      map->get_or_insert (block) = stmt;
+	    }
+	}
+    }
+}
+
+static const char *
+mask_name (unsigned mask)
+{
+  switch (mask)
+    {
+    case 0: return "gang redundant";
+    case 1: return "gang partitioned";
+    case 2: return "worker partitioned";
+    case 3: return "gang+worker partitioned";
+    case 4: return "vector partitioned";
+    case 5: return "gang+vector partitioned";
+    case 6: return "worker+vector partitioned";
+    case 7: return "fully partitioned";
+    default: return "<illegal>";
+    }
+}
+
+/* Dump this parallel and all its inner parallels.  */
+
+static void
+omp_sese_dump_pars (parallel_g *par, unsigned depth)
+{
+  fprintf (dump_file, "%u: mask %d (%s) head=%d, tail=%d\n",
+	   depth, par->mask, mask_name (par->mask),
+	   par->forked_block ? par->forked_block->index : -1,
+	   par->join_block ? par->join_block->index : -1);
+
+  fprintf (dump_file, "    blocks:");
+
+  basic_block block;
+  for (unsigned ix = 0; par->blocks.iterate (ix, &block); ix++)
+    fprintf (dump_file, " %d", block->index);
+  fprintf (dump_file, "\n");
+  if (par->inner)
+    omp_sese_dump_pars (par->inner, depth + 1);
+
+  if (par->next)
+    omp_sese_dump_pars (par->next, depth);
+}
+
+/* If BLOCK contains a fork/join marker, process it to create or
+   terminate a loop structure.  Add this block to the current loop,
+   and then walk successor blocks.   */
+
+static parallel_g *
+omp_sese_find_par (bb_stmt_map_t *map, parallel_g *par, basic_block block)
+{
+  if (block->flags & BB_VISITED)
+    return par;
+  block->flags |= BB_VISITED;
+
+  if (gimple **stmtp = map->get (block))
+    {
+      gimple *stmt = *stmtp;
+
+      if (gimple_code (stmt) == GIMPLE_COND
+	  || gimple_code (stmt) == GIMPLE_SWITCH
+	  || gimple_code (stmt) == GIMPLE_RETURN
+	  || (gimple_code (stmt) == GIMPLE_CALL
+	      && !gimple_call_internal_p (stmt))
+	  || is_gimple_assign (stmt))
+	{
+	  /* A single block that is forced to be at the maximum partition
+	     level.  Make a singleton par for it.  */
+	  par = new parallel_g (par, GOMP_DIM_MASK (GOMP_DIM_GANG)
+				   | GOMP_DIM_MASK (GOMP_DIM_WORKER)
+				   | GOMP_DIM_MASK (GOMP_DIM_VECTOR));
+	  par->forked_block = block;
+	  par->forked_stmt = stmt;
+	  par->blocks.safe_push (block);
+	  par = par->parent;
+	  goto walk_successors;
+	}
+      else if (gimple_nop_p (stmt))
+	{
+	  basic_block pred = single_pred (block);
+	  gcc_assert (pred);
+	  gimple_stmt_iterator gsi = gsi_last_bb (pred);
+	  gimple *final_stmt = gsi_stmt (gsi);
+
+	  if (gimple_call_internal_p (final_stmt, IFN_UNIQUE))
+	    {
+	      gcall *call = as_a <gcall *> (final_stmt);
+	      enum ifn_unique_kind k = ((enum ifn_unique_kind)
+		TREE_INT_CST_LOW (gimple_call_arg (call, 0)));
+
+	      if (k == IFN_UNIQUE_OACC_FORK)
+		{
+		  HOST_WIDE_INT dim
+		    = TREE_INT_CST_LOW (gimple_call_arg (call, 2));
+		  unsigned mask = (dim >= 0) ? GOMP_DIM_MASK (dim) : 0;
+
+		  par = new parallel_g (par, mask);
+		  par->forked_block = block;
+		  par->forked_stmt = final_stmt;
+		  par->fork_stmt = stmt;
+		}
+	      else
+		gcc_unreachable ();
+	    }
+	  else
+	    gcc_unreachable ();
+	}
+      else if (gimple_call_internal_p (stmt, IFN_UNIQUE))
+	{
+	  gcall *call = as_a <gcall *> (stmt);
+	  enum ifn_unique_kind k = ((enum ifn_unique_kind)
+	    TREE_INT_CST_LOW (gimple_call_arg (call, 0)));
+	  if (k == IFN_UNIQUE_OACC_JOIN)
+	    {
+	      HOST_WIDE_INT dim = TREE_INT_CST_LOW (gimple_call_arg (stmt, 2));
+	      unsigned mask = (dim >= 0) ? GOMP_DIM_MASK (dim) : 0;
+
+	      gcc_assert (par->mask == mask);
+	      par->join_block = block;
+	      par->join_stmt = stmt;
+	      par = par->parent;
+	    }
+	  else
+	    gcc_unreachable ();
+	}
+      else
+	gcc_unreachable ();
+    }
+
+  if (par)
+    /* Add this block onto the current loop's list of blocks.  */
+    par->blocks.safe_push (block);
+  else
+    /* This must be the entry block.  Create a NULL parallel.  */
+    par = new parallel_g (0, 0);
+
+walk_successors:
+  /* Walk successor blocks.  */
+  edge e;
+  edge_iterator ei;
+
+  FOR_EACH_EDGE (e, ei, block->succs)
+    omp_sese_find_par (map, par, e->dest);
+
+  return par;
+}
+
+/* DFS walk the CFG looking for fork & join markers.  Construct
+   loop structures as we go.  MAP is a mapping of basic blocks
+   to head & tail markers, discovered when splitting blocks.  This
+   speeds up the discovery.  We rely on the BB visited flag having
+   been cleared when splitting blocks.  */
+
+static parallel_g *
+omp_sese_discover_pars (bb_stmt_map_t *map)
+{
+  basic_block block;
+
+  /* Mark exit blocks as visited.  */
+  block = EXIT_BLOCK_PTR_FOR_FN (cfun);
+  block->flags |= BB_VISITED;
+
+  /* And entry block as not.  */
+  block = ENTRY_BLOCK_PTR_FOR_FN (cfun);
+  block->flags &= ~BB_VISITED;
+
+  parallel_g *par = omp_sese_find_par (map, 0, block);
+
+  if (dump_file)
+    {
+      fprintf (dump_file, "\nLoops\n");
+      omp_sese_dump_pars (par, 0);
+      fprintf (dump_file, "\n");
+    }
+
+  return par;
+}
+
+static void
+populate_single_mode_bitmaps (parallel_g *par, bitmap worker_single,
+			      bitmap vector_single, unsigned outer_mask,
+			      int depth)
+{
+  unsigned mask = outer_mask | par->mask;
+
+  basic_block block;
+
+  for (unsigned i = 0; par->blocks.iterate (i, &block); i++)
+    {
+      if ((mask & GOMP_DIM_MASK (GOMP_DIM_WORKER)) == 0)
+	bitmap_set_bit (worker_single, block->index);
+
+      if ((mask & GOMP_DIM_MASK (GOMP_DIM_VECTOR)) == 0)
+	bitmap_set_bit (vector_single, block->index);
+    }
+
+  if (par->inner)
+    populate_single_mode_bitmaps (par->inner, worker_single, vector_single,
+				  mask, depth + 1);
+  if (par->next)
+    populate_single_mode_bitmaps (par->next, worker_single, vector_single,
+				  outer_mask, depth);
+}
+
+/* A map from SSA names or var decls to record fields.  */
+
+typedef hash_map<tree, tree> field_map_t;
+
+/* For each propagation record type, this is a map from SSA names or var decls
+   to propagate, to the field in the record type that should be used for
+   transmission and reception.  */
+
+typedef hash_map<tree, field_map_t *> record_field_map_t;
+
+static GTY(()) record_field_map_t *field_map;
+
+static void
+install_var_field (tree var, tree record_type)
+{
+  field_map_t *fields = *field_map->get (record_type);
+  tree name;
+  char tmp[20];
+
+  if (TREE_CODE (var) == SSA_NAME)
+    {
+      name = SSA_NAME_IDENTIFIER (var);
+      if (!name)
+	{
+	  sprintf (tmp, "_%u", (unsigned) SSA_NAME_VERSION (var));
+	  name = get_identifier (tmp);
+	}
+    }
+  else if (TREE_CODE (var) == VAR_DECL)
+    {
+      name = DECL_NAME (var);
+      if (!name)
+	{
+	  sprintf (tmp, "D_%u", (unsigned) DECL_UID (var));
+	  name = get_identifier (tmp);
+	}
+    }
+  else
+    gcc_unreachable ();
+
+  gcc_assert (!fields->get (var));
+
+  tree type = TREE_TYPE (var);
+
+  if (POINTER_TYPE_P (type)
+      && TYPE_RESTRICT (type))
+    type = build_qualified_type (type, TYPE_QUALS (type) & ~TYPE_QUAL_RESTRICT);
+
+  tree field = build_decl (BUILTINS_LOCATION, FIELD_DECL, name, type);
+
+  if (TREE_CODE (var) == VAR_DECL && type == TREE_TYPE (var))
+    {
+      SET_DECL_ALIGN (field, DECL_ALIGN (var));
+      DECL_USER_ALIGN (field) = DECL_USER_ALIGN (var);
+      TREE_THIS_VOLATILE (field) = TREE_THIS_VOLATILE (var);
+    }
+  else
+    SET_DECL_ALIGN (field, TYPE_ALIGN (type));
+
+  fields->put (var, field);
+
+  insert_field_into_struct (record_type, field);
+}
+
+/* Sets of SSA_NAMES or VAR_DECLs to propagate.  */
+typedef hash_set<tree> propagation_set;
+
+static void
+find_ssa_names_to_propagate (parallel_g *par, unsigned outer_mask,
+			     bitmap worker_single, bitmap vector_single,
+			     vec<propagation_set *> *prop_set)
+{
+  unsigned mask = outer_mask | par->mask;
+
+  if (par->inner)
+    find_ssa_names_to_propagate (par->inner, mask, worker_single,
+				 vector_single, prop_set);
+  if (par->next)
+    find_ssa_names_to_propagate (par->next, outer_mask, worker_single,
+				 vector_single, prop_set);
+
+  if (mask & GOMP_DIM_MASK (GOMP_DIM_WORKER))
+    {
+      basic_block block;
+      int ix;
+
+      for (ix = 0; par->blocks.iterate (ix, &block); ix++)
+	{
+	  for (gphi_iterator psi = gsi_start_phis (block);
+	       !gsi_end_p (psi); gsi_next (&psi))
+	    {
+	      gphi *phi = psi.phi ();
+	      use_operand_p use;
+	      ssa_op_iter iter;
+
+	      FOR_EACH_PHI_ARG (use, phi, iter, SSA_OP_USE)
+		{
+		  tree var = USE_FROM_PTR (use);
+
+		  if (TREE_CODE (var) != SSA_NAME)
+		    continue;
+
+		  gimple *def_stmt = SSA_NAME_DEF_STMT (var);
+
+		  if (gimple_nop_p (def_stmt))
+		    continue;
+
+		  basic_block def_bb = gimple_bb (def_stmt);
+
+		  if (bitmap_bit_p (worker_single, def_bb->index))
+		    {
+		      if (!(*prop_set)[def_bb->index])
+			(*prop_set)[def_bb->index] = new propagation_set;
+
+		      propagation_set *ws_prop = (*prop_set)[def_bb->index];
+
+		      ws_prop->add (var);
+		    }
+		}
+	    }
+
+	  for (gimple_stmt_iterator gsi = gsi_start_bb (block);
+	       !gsi_end_p (gsi); gsi_next (&gsi))
+	    {
+	      use_operand_p use;
+	      ssa_op_iter iter;
+	      gimple *stmt = gsi_stmt (gsi);
+
+	      FOR_EACH_SSA_USE_OPERAND (use, stmt, iter, SSA_OP_USE)
+		{
+		  tree var = USE_FROM_PTR (use);
+
+		  gimple *def_stmt = SSA_NAME_DEF_STMT (var);
+
+		  if (gimple_nop_p (def_stmt))
+		    continue;
+
+		  basic_block def_bb = gimple_bb (def_stmt);
+
+		  if (bitmap_bit_p (worker_single, def_bb->index))
+		    {
+		      if (!(*prop_set)[def_bb->index])
+			(*prop_set)[def_bb->index] = new propagation_set;
+
+		      propagation_set *ws_prop = (*prop_set)[def_bb->index];
+
+		      ws_prop->add (var);
+		    }
+		}
+	    }
+	}
+    }
+}
+
+/* Callback for walk_gimple_stmt to find RHS VAR_DECLs (uses) in a
+   statement.  */
+
+static tree
+find_partitioned_var_uses_1 (tree *node, int *, void *data)
+{
+  walk_stmt_info *wi = (walk_stmt_info *) data;
+  hash_set<tree> *partitioned_var_uses = (hash_set<tree> *) wi->info;
+
+  if (!wi->is_lhs && VAR_P (*node))
+    partitioned_var_uses->add (*node);
+
+  return NULL_TREE;
+}
+
+static void
+find_partitioned_var_uses (parallel_g *par, unsigned outer_mask,
+			   hash_set<tree> *partitioned_var_uses)
+{
+  unsigned mask = outer_mask | par->mask;
+
+  if (par->inner)
+    find_partitioned_var_uses (par->inner, mask, partitioned_var_uses);
+  if (par->next)
+    find_partitioned_var_uses (par->next, outer_mask, partitioned_var_uses);
+
+  if (mask & GOMP_DIM_MASK (GOMP_DIM_WORKER))
+    {
+      basic_block block;
+      int ix;
+
+      for (ix = 0; par->blocks.iterate (ix, &block); ix++)
+	for (gimple_stmt_iterator gsi = gsi_start_bb (block);
+	     !gsi_end_p (gsi); gsi_next (&gsi))
+	  {
+	    walk_stmt_info wi;
+	    memset (&wi, 0, sizeof (wi));
+	    wi.info = (void *) partitioned_var_uses;
+	    walk_gimple_stmt (&gsi, NULL, find_partitioned_var_uses_1, &wi);
+	  }
+    }
+}
+
+/* Gang-private variables (typically placed in a GPU's shared memory) do not
+   need to be processed by the worker-propagation mechanism.  Populate the
+   GANGPRIVATE_VARS set with any such variables found in the current
+   function.  */
+
+static void
+find_gangprivate_vars (hash_set<tree> *gangprivate_vars)
+{
+  basic_block block;
+
+  FOR_EACH_BB_FN (block, cfun)
+    {
+      for (gimple_stmt_iterator gsi = gsi_start_bb (block);
+	   !gsi_end_p (gsi);
+	   gsi_next (&gsi))
+	{
+	  gimple *stmt = gsi_stmt (gsi);
+
+	  if (gimple_call_internal_p (stmt, IFN_UNIQUE))
+	    {
+	      enum ifn_unique_kind k = ((enum ifn_unique_kind)
+		TREE_INT_CST_LOW (gimple_call_arg (stmt, 0)));
+	      if (k == IFN_UNIQUE_OACC_PRIVATE)
+		{
+		  HOST_WIDE_INT level
+		    = TREE_INT_CST_LOW (gimple_call_arg (stmt, 2));
+		  if (level != GOMP_DIM_GANG)
+		    continue;
+		  for (unsigned i = 3; i < gimple_call_num_args (stmt); i++)
+		    {
+		      tree arg = gimple_call_arg (stmt, i);
+		      gcc_assert (TREE_CODE (arg) == ADDR_EXPR);
+		      tree decl = TREE_OPERAND (arg, 0);
+		      gangprivate_vars->add (decl);
+		    }
+		}
+	    }
+	}
+    }
+}
+
+static void
+find_local_vars_to_propagate (parallel_g *par, unsigned outer_mask,
+			      hash_set<tree> *partitioned_var_uses,
+			      hash_set<tree> *gangprivate_vars,
+			      vec<propagation_set *> *prop_set)
+{
+  unsigned mask = outer_mask | par->mask;
+
+  if (par->inner)
+    find_local_vars_to_propagate (par->inner, mask, partitioned_var_uses,
+				  gangprivate_vars, prop_set);
+  if (par->next)
+    find_local_vars_to_propagate (par->next, outer_mask, partitioned_var_uses,
+				  gangprivate_vars, prop_set);
+
+  if (!(mask & GOMP_DIM_MASK (GOMP_DIM_WORKER)))
+    {
+      basic_block block;
+      int ix;
+
+      for (ix = 0; par->blocks.iterate (ix, &block); ix++)
+	{
+	  for (gimple_stmt_iterator gsi = gsi_start_bb (block);
+	       !gsi_end_p (gsi); gsi_next (&gsi))
+	    {
+	      gimple *stmt = gsi_stmt (gsi);
+	      tree var;
+	      unsigned i;
+
+	      FOR_EACH_LOCAL_DECL (cfun, i, var)
+		{
+		  if (!VAR_P (var)
+		      || is_global_var (var)
+		      || AGGREGATE_TYPE_P (TREE_TYPE (var))
+		      || !partitioned_var_uses->contains (var)
+		      || gangprivate_vars->contains (var))
+		    continue;
+
+		  if (stmt_may_clobber_ref_p (stmt, var))
+		    {
+		      if (dump_file)
+			{
+			  fprintf (dump_file, "bb %u: local variable may be "
+				   "clobbered in %s mode: ", block->index,
+				   mask_name (mask));
+			  print_generic_expr (dump_file, var, TDF_SLIM);
+			  fprintf (dump_file, "\n");
+			}
+
+		      if (!(*prop_set)[block->index])
+			(*prop_set)[block->index] = new propagation_set;
+
+		      propagation_set *ws_prop
+			= (*prop_set)[block->index];
+
+		      ws_prop->add (var);
+		    }
+		}
+	    }
+	}
+    }
+}
+
+/* Transform basic blocks FROM, TO (which may be the same block) into:
+   if (GOACC_single_start ())
+     BLOCK;
+   GOACC_barrier ();
+			      \  |  /
+			      +----+
+			      |    |        (new) predicate block
+			      +----+--
+   \  |  /   \  |  /	        |t    \
+   +----+    +----+	      +----+  |
+   |	|    |    |	===>  |    |  | f   (old) from block
+   +----+    +----+	      +----+  |
+     |       t/  \f	        |    /
+			      +----+/
+  (split  (split before       |    |        skip block
+  at end)   condition)	      +----+
+			      t/  \f
+*/
+
+static void
+worker_single_simple (basic_block from, basic_block to,
+		      hash_set<tree> *def_escapes_block)
+{
+  gimple *call, *cond;
+  tree lhs, decl;
+  basic_block skip_block;
+
+  gimple_stmt_iterator gsi = gsi_last_bb (to);
+  if (EDGE_COUNT (to->succs) > 1)
+    {
+      gcc_assert (gimple_code (gsi_stmt (gsi)) == GIMPLE_COND);
+      gsi_prev (&gsi);
+    }
+  edge e = split_block (to, gsi_stmt (gsi));
+  skip_block = e->dest;
+
+  gimple_stmt_iterator start = gsi_after_labels (from);
+
+  decl = builtin_decl_explicit (BUILT_IN_GOACC_SINGLE_START);
+  lhs = create_tmp_var (TREE_TYPE (TREE_TYPE (decl)));
+  call = gimple_build_call (decl, 0);
+  gimple_call_set_lhs (call, lhs);
+  gsi_insert_before (&start, call, GSI_NEW_STMT);
+  update_stmt (call);
+
+  cond = gimple_build_cond (EQ_EXPR, lhs,
+			    fold_convert_loc (UNKNOWN_LOCATION,
+					      TREE_TYPE (lhs),
+					      boolean_true_node),
+			    NULL_TREE, NULL_TREE);
+  gsi_insert_after (&start, cond, GSI_NEW_STMT);
+  update_stmt (cond);
+
+  edge et = split_block (from, cond);
+  et->flags &= ~EDGE_FALLTHRU;
+  et->flags |= EDGE_TRUE_VALUE;
+  /* Make the active worker the more probable path so we prefer fallthrough
+     (letting the idle workers jump around more).  */
+  et->probability = profile_probability::likely ();
+
+  edge ef = make_edge (from, skip_block, EDGE_FALSE_VALUE);
+  ef->probability = et->probability.invert ();
+
+  basic_block neutered = split_edge (ef);
+  gimple_stmt_iterator neut_gsi = gsi_last_bb (neutered);
+
+  for (gsi = gsi_start_bb (et->dest); !gsi_end_p (gsi); gsi_next (&gsi))
+    {
+      gimple *stmt = gsi_stmt (gsi);
+      ssa_op_iter iter;
+      tree var;
+
+      FOR_EACH_SSA_TREE_OPERAND (var, stmt, iter, SSA_OP_DEF)
+	{
+	  if (def_escapes_block->contains (var))
+	    {
+	      gphi *join_phi = create_phi_node (NULL_TREE, skip_block);
+	      create_new_def_for (var, join_phi,
+				  gimple_phi_result_ptr (join_phi));
+	      add_phi_arg (join_phi, var, e, UNKNOWN_LOCATION);
+
+	      tree neutered_def = copy_ssa_name (var, NULL);
+	      /* We really want "don't care" or some value representing
+		 undefined here, but optimizers will probably get rid of the
+		 zero-assignments anyway.  */
+	      gassign *zero = gimple_build_assign (neutered_def,
+				build_zero_cst (TREE_TYPE (neutered_def)));
+
+	      gsi_insert_after (&neut_gsi, zero, GSI_CONTINUE_LINKING);
+	      update_stmt (zero);
+
+	      add_phi_arg (join_phi, neutered_def, single_succ_edge (neutered),
+			   UNKNOWN_LOCATION);
+	      update_stmt (join_phi);
+	    }
+	}
+    }
+
+  gsi = gsi_start_bb (skip_block);
+
+  decl = builtin_decl_explicit (BUILT_IN_GOACC_BARRIER);
+  gimple *acc_bar = gimple_build_call (decl, 0);
+
+  gsi_insert_before (&gsi, acc_bar, GSI_SAME_STMT);
+  update_stmt (acc_bar);
+}
+
+/* This is a copied and renamed omp-low.c:omp_build_component_ref.  */
+
+static tree
+oacc_build_component_ref (tree obj, tree field)
+{
+  tree field_type = TREE_TYPE (field);
+  tree obj_type = TREE_TYPE (obj);
+  if (!ADDR_SPACE_GENERIC_P (TYPE_ADDR_SPACE (obj_type)))
+    field_type = build_qualified_type
+			(field_type,
+			 KEEP_QUAL_ADDR_SPACE (TYPE_QUALS (obj_type)));
+
+  tree ret = build3 (COMPONENT_REF, field_type, obj, field, NULL);
+  if (TREE_THIS_VOLATILE (field))
+    TREE_THIS_VOLATILE (ret) |= 1;
+  if (TREE_READONLY (field))
+    TREE_READONLY (ret) |= 1;
+  return ret;
+}
+
+static tree
+build_receiver_ref (tree record_type, tree var, tree receiver_decl)
+{
+  field_map_t *fields = *field_map->get (record_type);
+  tree x = build_simple_mem_ref (receiver_decl);
+  tree field = *fields->get (var);
+  TREE_THIS_NOTRAP (x) = 1;
+  x = oacc_build_component_ref (x, field);
+  return x;
+}
+
+static tree
+build_sender_ref (tree record_type, tree var, tree sender_decl)
+{
+  field_map_t *fields = *field_map->get (record_type);
+  tree field = *fields->get (var);
+  return oacc_build_component_ref (sender_decl, field);
+}
+
+static int
+sort_by_ssa_version_or_uid (const void *p1, const void *p2)
+{
+  const tree t1 = *(const tree *)p1;
+  const tree t2 = *(const tree *)p2;
+
+  if (TREE_CODE (t1) == SSA_NAME && TREE_CODE (t2) == SSA_NAME)
+    return SSA_NAME_VERSION (t1) - SSA_NAME_VERSION (t2);
+  else if (TREE_CODE (t1) == SSA_NAME && TREE_CODE (t2) != SSA_NAME)
+    return -1;
+  else if (TREE_CODE (t1) != SSA_NAME && TREE_CODE (t2) == SSA_NAME)
+    return 1;
+  else
+    return DECL_UID (t1) - DECL_UID (t2);
+}
+
+static int
+sort_by_size_then_ssa_version_or_uid (const void *p1, const void *p2)
+{
+  const tree t1 = *(const tree *)p1;
+  const tree t2 = *(const tree *)p2;
+  unsigned HOST_WIDE_INT s1 = tree_to_uhwi (TYPE_SIZE (TREE_TYPE (t1)));
+  unsigned HOST_WIDE_INT s2 = tree_to_uhwi (TYPE_SIZE (TREE_TYPE (t2)));
+  if (s1 != s2)
+    return s2 - s1;
+  else
+    return sort_by_ssa_version_or_uid (p1, p2);
+}
+
+static void
+worker_single_copy (basic_block from, basic_block to,
+		    hash_set<tree> *def_escapes_block,
+		    hash_set<tree> *worker_partitioned_uses,
+		    tree record_type)
+{
+  /* If we only have virtual defs, we'll have no record type, but we still want
+     to emit single_copy_start and (particularly) single_copy_end to act as
+     a vdef source on the neutered edge representing memory writes on the
+     non-neutered edge.  */
+  if (!record_type)
+    record_type = char_type_node;
+
+  tree sender_decl
+    = targetm.goacc.create_propagation_record (record_type, true,
+					       ".oacc_worker_o");
+  tree receiver_decl
+    = targetm.goacc.create_propagation_record (record_type, false,
+					       ".oacc_worker_i");
+
+  gimple_stmt_iterator gsi = gsi_last_bb (to);
+  if (EDGE_COUNT (to->succs) > 1)
+    gsi_prev (&gsi);
+  edge e = split_block (to, gsi_stmt (gsi));
+  basic_block barrier_block = e->dest;
+
+  gimple_stmt_iterator start = gsi_after_labels (from);
+
+  tree decl = builtin_decl_explicit (BUILT_IN_GOACC_SINGLE_COPY_START);
+
+  tree lhs = create_tmp_var (TREE_TYPE (TREE_TYPE (decl)));
+
+  gimple *call = gimple_build_call (decl, 1,
+				    build_fold_addr_expr (sender_decl));
+  gimple_call_set_lhs (call, lhs);
+  gsi_insert_before (&start, call, GSI_NEW_STMT);
+  update_stmt (call);
+
+  tree conv_tmp = make_ssa_name (TREE_TYPE (receiver_decl));
+
+  gimple *conv = gimple_build_assign (conv_tmp,
+				      fold_convert (TREE_TYPE (receiver_decl),
+						    lhs));
+  update_stmt (conv);
+  gsi_insert_after (&start, conv, GSI_NEW_STMT);
+  gimple *asgn = gimple_build_assign (receiver_decl, conv_tmp);
+  gsi_insert_after (&start, asgn, GSI_NEW_STMT);
+  update_stmt (asgn);
+
+  tree zero_ptr = build_int_cst (TREE_TYPE (receiver_decl), 0);
+
+  tree recv_tmp = make_ssa_name (TREE_TYPE (receiver_decl));
+  asgn = gimple_build_assign (recv_tmp, receiver_decl);
+  gsi_insert_after (&start, asgn, GSI_NEW_STMT);
+  update_stmt (asgn);
+
+  gimple *cond = gimple_build_cond (EQ_EXPR, recv_tmp, zero_ptr, NULL_TREE,
+				    NULL_TREE);
+  update_stmt (cond);
+
+  gsi_insert_after (&start, cond, GSI_NEW_STMT);
+
+  edge et = split_block (from, cond);
+  et->flags &= ~EDGE_FALLTHRU;
+  et->flags |= EDGE_TRUE_VALUE;
+  /* Make the active worker the more probable path so we prefer fallthrough
+     (letting the idle workers jump around more).  */
+  et->probability = profile_probability::likely ();
+
+  basic_block body = et->dest;
+
+  edge ef = make_edge (from, barrier_block, EDGE_FALSE_VALUE);
+  ef->probability = et->probability.invert ();
+
+  decl = builtin_decl_explicit (BUILT_IN_GOACC_BARRIER);
+  gimple *acc_bar = gimple_build_call (decl, 0);
+
+  gimple_stmt_iterator bar_gsi = gsi_start_bb (barrier_block);
+  gsi_insert_before (&bar_gsi, acc_bar, GSI_NEW_STMT);
+
+  cond = gimple_build_cond (NE_EXPR, recv_tmp, zero_ptr, NULL_TREE, NULL_TREE);
+  gsi_insert_after (&bar_gsi, cond, GSI_NEW_STMT);
+
+  edge et2 = split_block (barrier_block, cond);
+  et2->flags &= ~EDGE_FALLTHRU;
+  et2->flags |= EDGE_TRUE_VALUE;
+  et2->probability = profile_probability::unlikely ();
+
+  basic_block exit_block = et2->dest;
+
+  basic_block copyout_block = split_edge (et2);
+  edge ef2 = make_edge (barrier_block, exit_block, EDGE_FALSE_VALUE);
+  ef2->probability = et2->probability.invert ();
+
+  gimple_stmt_iterator copyout_gsi = gsi_start_bb (copyout_block);
+
+  edge copyout_to_exit = single_succ_edge (copyout_block);
+
+  gimple_seq sender_seq = NULL;
+
+  /* Make sure we iterate over definitions in a stable order.  */
+  auto_vec<tree> escape_vec (def_escapes_block->elements ());
+  for (hash_set<tree>::iterator it = def_escapes_block->begin ();
+       it != def_escapes_block->end (); ++it)
+    escape_vec.quick_push (*it);
+  escape_vec.qsort (sort_by_ssa_version_or_uid);
+
+  for (unsigned i = 0; i < escape_vec.length (); i++)
+    {
+      tree var = escape_vec[i];
+
+      if (TREE_CODE (var) == SSA_NAME && SSA_NAME_IS_VIRTUAL_OPERAND (var))
+	continue;
+
+      tree barrier_def = 0;
+
+      if (TREE_CODE (var) == SSA_NAME)
+	{
+	  gimple *def_stmt = SSA_NAME_DEF_STMT (var);
+
+	  if (gimple_nop_p (def_stmt))
+	    continue;
+
+	  /* The barrier phi takes one result from the actual work of the
+	     block we're neutering, and the other result is constant zero of
+	     the same type.  */
+
+	  gphi *barrier_phi = create_phi_node (NULL_TREE, barrier_block);
+	  barrier_def = create_new_def_for (var, barrier_phi,
+			  gimple_phi_result_ptr (barrier_phi));
+
+	  add_phi_arg (barrier_phi, var, e, UNKNOWN_LOCATION);
+	  add_phi_arg (barrier_phi, build_zero_cst (TREE_TYPE (var)), ef,
+		       UNKNOWN_LOCATION);
+
+	  update_stmt (barrier_phi);
+	}
+      else
+	gcc_assert (TREE_CODE (var) == VAR_DECL);
+
+      /* If we had no record type, we will have no fields map.  */
+      field_map_t **fields_p = field_map->get (record_type);
+      field_map_t *fields = fields_p ? *fields_p : NULL;
+
+      if (worker_partitioned_uses->contains (var)
+	  && fields
+	  && fields->get (var))
+	{
+	  tree neutered_def = make_ssa_name (TREE_TYPE (var));
+
+	  /* Receive definition from shared memory block.  */
+
+	  tree receiver_ref = build_receiver_ref (record_type, var,
+						  receiver_decl);
+	  gassign *recv = gimple_build_assign (neutered_def,
+					       receiver_ref);
+	  gsi_insert_after (&copyout_gsi, recv, GSI_CONTINUE_LINKING);
+	  update_stmt (recv);
+
+	  if (TREE_CODE (var) == VAR_DECL)
+	    {
+	      /* If it's a VAR_DECL, we only copied to an SSA temporary.  Copy
+		 to the final location now.  */
+	      gassign *asgn = gimple_build_assign (var, neutered_def);
+	      gsi_insert_after (&copyout_gsi, asgn, GSI_CONTINUE_LINKING);
+	      update_stmt (asgn);
+	    }
+	  else
+	    {
+	      /* If it's an SSA name, create a new phi at the join node to
+		 represent either the output from the active worker (the
+		 barrier) or the inactive workers (the copyout block).  */
+	      gphi *join_phi = create_phi_node (NULL_TREE, exit_block);
+	      create_new_def_for (barrier_def, join_phi,
+				  gimple_phi_result_ptr (join_phi));
+	      add_phi_arg (join_phi, barrier_def, ef2, UNKNOWN_LOCATION);
+	      add_phi_arg (join_phi, neutered_def, copyout_to_exit,
+			   UNKNOWN_LOCATION);
+	      update_stmt (join_phi);
+	    }
+
+	  /* Send definition to shared memory block.  */
+
+	  tree sender_ref = build_sender_ref (record_type, var, sender_decl);
+
+	  if (TREE_CODE (var) == SSA_NAME)
+	    {
+	      gassign *send = gimple_build_assign (sender_ref, var);
+	      gimple_seq_add_stmt (&sender_seq, send);
+	      update_stmt (send);
+	    }
+	  else if (TREE_CODE (var) == VAR_DECL)
+	    {
+	      tree tmp = make_ssa_name (TREE_TYPE (var));
+	      gassign *send = gimple_build_assign (tmp, var);
+	      gimple_seq_add_stmt (&sender_seq, send);
+	      update_stmt (send);
+	      send = gimple_build_assign (sender_ref, tmp);
+	      gimple_seq_add_stmt (&sender_seq, send);
+	      update_stmt (send);
+	    }
+	  else
+	    gcc_unreachable ();
+	}
+    }
+
+  /* It's possible for the ET->DEST block (the work done by the active thread)
+     to finish with a control-flow insn, e.g. a UNIQUE function call.  Split
+     the block and add SENDER_SEQ in the latter part to avoid having control
+     flow in the middle of a BB.  */
+
+  decl = builtin_decl_explicit (BUILT_IN_GOACC_SINGLE_COPY_END);
+  call = gimple_build_call (decl, 1, build_fold_addr_expr (sender_decl));
+  gimple_seq_add_stmt (&sender_seq, call);
+
+  gsi = gsi_last_bb (body);
+  gimple *last = gsi_stmt (gsi);
+  basic_block sender_block = split_block (body, last)->dest;
+  gsi = gsi_last_bb (sender_block);
+  gsi_insert_seq_after (&gsi, sender_seq, GSI_CONTINUE_LINKING);
+}
+
+static void
+neuter_worker_single (parallel_g *par, unsigned outer_mask,
+		      bitmap worker_single, bitmap vector_single,
+		      vec<propagation_set *> *prop_set,
+		      hash_set<tree> *partitioned_var_uses)
+{
+  unsigned mask = outer_mask | par->mask;
+
+  if ((mask & GOMP_DIM_MASK (GOMP_DIM_WORKER)) == 0)
+    {
+      basic_block block;
+
+      for (unsigned i = 0; par->blocks.iterate (i, &block); i++)
+	{
+	  bool has_defs = false;
+	  hash_set<tree> def_escapes_block;
+	  hash_set<tree> worker_partitioned_uses;
+	  unsigned j;
+	  tree var;
+
+	  FOR_EACH_SSA_NAME (j, var, cfun)
+	    {
+	      if (SSA_NAME_IS_VIRTUAL_OPERAND (var))
+		{
+		  has_defs = true;
+		  continue;
+		}
+
+	      gimple *def_stmt = SSA_NAME_DEF_STMT (var);
+
+	      if (gimple_nop_p (def_stmt))
+		continue;
+
+	      if (gimple_bb (def_stmt)->index != block->index)
+		continue;
+
+	      gimple *use_stmt;
+	      imm_use_iterator use_iter;
+	      bool uses_outside_block = false;
+	      bool worker_partitioned_use = false;
+
+	      FOR_EACH_IMM_USE_STMT (use_stmt, use_iter, var)
+		{
+		  int blocknum = gimple_bb (use_stmt)->index;
+
+		  /* Don't propagate SSA names that are only used in the
+		     current block, unless the usage is in a phi node: that
+		     means the name left the block, then came back in at the
+		     top.  */
+		  if (blocknum != block->index
+		      || gimple_code (use_stmt) == GIMPLE_PHI)
+		    uses_outside_block = true;
+		  if (!bitmap_bit_p (worker_single, blocknum))
+		    worker_partitioned_use = true;
+		}
+
+	      if (uses_outside_block)
+		def_escapes_block.add (var);
+
+	      if (worker_partitioned_use)
+		{
+		  worker_partitioned_uses.add (var);
+		  has_defs = true;
+		}
+	    }
+
+	  propagation_set *ws_prop = (*prop_set)[block->index];
+
+	  if (ws_prop)
+	    {
+	      for (propagation_set::iterator it = ws_prop->begin ();
+		   it != ws_prop->end ();
+		   ++it)
+		{
+		  tree var = *it;
+		  if (TREE_CODE (var) == VAR_DECL)
+		    {
+		      def_escapes_block.add (var);
+		      if (partitioned_var_uses->contains (var))
+			{
+			  worker_partitioned_uses.add (var);
+			  has_defs = true;
+			}
+		    }
+		}
+
+	      delete ws_prop;
+	      (*prop_set)[block->index] = 0;
+	    }
+
+	  tree record_type = (tree) block->aux;
+
+	  if (has_defs)
+	    worker_single_copy (block, block, &def_escapes_block,
+				&worker_partitioned_uses, record_type);
+	  else
+	    worker_single_simple (block, block, &def_escapes_block);
+	}
+    }
+
+  if ((outer_mask & GOMP_DIM_MASK (GOMP_DIM_WORKER)) == 0)
+    {
+      basic_block block;
+
+      for (unsigned i = 0; par->blocks.iterate (i, &block); i++)
+	for (gimple_stmt_iterator gsi = gsi_start_bb (block);
+	     !gsi_end_p (gsi);
+	     gsi_next (&gsi))
+	  {
+	    gimple *stmt = gsi_stmt (gsi);
+
+	    if (gimple_code (stmt) == GIMPLE_CALL
+		&& !gimple_call_internal_p (stmt)
+		&& !omp_sese_active_worker_call (as_a <gcall *> (stmt)))
+	      {
+		/* If we have an OpenACC routine call in worker-single mode,
+		   place barriers before and afterwards to prevent
+		   clobbering re-used shared memory regions (as are used
+		   for AMDGCN at present, for example).  */
+		tree decl = builtin_decl_explicit (BUILT_IN_GOACC_BARRIER);
+		gsi_insert_before (&gsi, gimple_build_call (decl, 0),
+				   GSI_SAME_STMT);
+		gsi_insert_after (&gsi, gimple_build_call (decl, 0),
+				  GSI_NEW_STMT);
+	      }
+	  }
+    }
+
+  if (par->inner)
+    neuter_worker_single (par->inner, mask, worker_single, vector_single,
+			  prop_set, partitioned_var_uses);
+  if (par->next)
+    neuter_worker_single (par->next, outer_mask, worker_single, vector_single,
+			  prop_set, partitioned_var_uses);
+}
+
+
+void
+oacc_do_neutering (void)
+{
+  bb_stmt_map_t bb_stmt_map;
+  auto_bitmap worker_single, vector_single;
+
+  omp_sese_split_blocks (&bb_stmt_map);
+
+  if (dump_file)
+    {
+      fprintf (dump_file, "\n\nAfter splitting:\n\n");
+      dump_function_to_file (current_function_decl, dump_file, dump_flags);
+    }
+
+  unsigned mask = 0;
+
+  /* If this is a routine, calculate MASK as if the outer levels are already
+     partitioned.  */
+  tree attr = oacc_get_fn_attrib (current_function_decl);
+  if (attr)
+    {
+      tree dims = TREE_VALUE (attr);
+      unsigned ix;
+      for (ix = 0; ix != GOMP_DIM_MAX; ix++, dims = TREE_CHAIN (dims))
+	{
+	  tree allowed = TREE_PURPOSE (dims);
+	  if (allowed && integer_zerop (allowed))
+	    mask |= GOMP_DIM_MASK (ix);
+	}
+    }
+
+  parallel_g *par = omp_sese_discover_pars (&bb_stmt_map);
+  populate_single_mode_bitmaps (par, worker_single, vector_single, mask, 0);
+
+  basic_block bb;
+  FOR_ALL_BB_FN (bb, cfun)
+    bb->aux = NULL;
+
+  field_map = record_field_map_t::create_ggc (40);
+
+  vec<propagation_set *> prop_set;
+  prop_set.create (last_basic_block_for_fn (cfun));
+
+  for (unsigned i = 0; i < last_basic_block_for_fn (cfun); i++)
+    prop_set.quick_push (0);
+
+  find_ssa_names_to_propagate (par, mask, worker_single, vector_single,
+			       &prop_set);
+
+  hash_set<tree> partitioned_var_uses;
+  hash_set<tree> gangprivate_vars;
+
+  find_gangprivate_vars (&gangprivate_vars);
+  find_partitioned_var_uses (par, mask, &partitioned_var_uses);
+  find_local_vars_to_propagate (par, mask, &partitioned_var_uses,
+				&gangprivate_vars, &prop_set);
+
+  FOR_ALL_BB_FN (bb, cfun)
+    {
+      propagation_set *ws_prop = prop_set[bb->index];
+      if (ws_prop)
+	{
+	  tree record_type = lang_hooks.types.make_type (RECORD_TYPE);
+	  tree name = create_tmp_var_name (".oacc_ws_data_s");
+	  name = build_decl (UNKNOWN_LOCATION, TYPE_DECL, name, record_type);
+	  DECL_ARTIFICIAL (name) = 1;
+	  DECL_NAMELESS (name) = 1;
+	  TYPE_NAME (record_type) = name;
+	  TYPE_ARTIFICIAL (record_type) = 1;
+
+	  auto_vec<tree> field_vec (ws_prop->elements ());
+	  for (hash_set<tree>::iterator it = ws_prop->begin ();
+	       it != ws_prop->end (); ++it)
+	    field_vec.quick_push (*it);
+
+	  field_vec.qsort (sort_by_size_then_ssa_version_or_uid);
+
+	  field_map->put (record_type, field_map_t::create_ggc (17));
+
+	  /* Insert var fields in reverse order, so the last inserted element
+	     is the first in the structure.  */
+	  for (int i = field_vec.length () - 1; i >= 0; i--)
+	    install_var_field (field_vec[i], record_type);
+
+	  layout_type (record_type);
+
+	  bb->aux = (tree) record_type;
+	}
+    }
+
+  neuter_worker_single (par, mask, worker_single, vector_single, &prop_set,
+			&partitioned_var_uses);
+
+  prop_set.release ();
+
+  /* This doesn't seem to make a difference.  */
+  loops_state_clear (LOOP_CLOSED_SSA);
+
+  /* Neutering worker-single neutered blocks will invalidate dominance info.
+     It may be possible to incrementally update just the affected blocks, but
+     obliterate everything for now.  */
+  free_dominance_info (CDI_DOMINATORS);
+  free_dominance_info (CDI_POST_DOMINATORS);
+
+  if (dump_file)
+    {
+      fprintf (dump_file, "\n\nAfter neutering:\n\n");
+      dump_function_to_file (current_function_decl, dump_file, dump_flags);
+    }
+}
diff --git a/gcc/oacc-neuter-bcast.h b/gcc/oacc-neuter-bcast.h
new file mode 100644
index 00000000000..b3f38426e64
--- /dev/null
+++ b/gcc/oacc-neuter-bcast.h
@@ -0,0 +1,26 @@
+/* Implement worker partitioning for OpenACC via neutering/broadcasting scheme.
+
+   Copyright (C) 2015-2021 Free Software Foundation, Inc.
+
+This file is part of GCC.
+
+GCC is free software; you can redistribute it and/or modify it under
+the terms of the GNU General Public License as published by the Free
+Software Foundation; either version 3, or (at your option) any later
+version.
+
+GCC is distributed in the hope that it will be useful, but WITHOUT ANY
+WARRANTY; without even the implied warranty of MERCHANTABILITY or
+FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
+for more details.
+
+You should have received a copy of the GNU General Public License
+along with GCC; see the file COPYING3.  If not see
+<http://www.gnu.org/licenses/>.  */
+
+#ifndef GCC_OACC_NEUTER_BCAST_H
+#define GCC_OACC_NEUTER_BCAST_H
+
+extern void oacc_do_neutering (void);
+
+#endif
diff --git a/gcc/omp-builtins.def b/gcc/omp-builtins.def
index cfbf1e67b8e..1cc807f77c0 100644
--- a/gcc/omp-builtins.def
+++ b/gcc/omp-builtins.def
@@ -75,6 +75,8 @@ DEF_GOMP_BUILTIN (BUILT_IN_GOMP_BARRIER, "GOMP_barrier",
 		  BT_FN_VOID, ATTR_NOTHROW_LEAF_LIST)
 DEF_GOMP_BUILTIN (BUILT_IN_GOMP_BARRIER_CANCEL, "GOMP_barrier_cancel",
 		  BT_FN_BOOL, ATTR_NOTHROW_LEAF_LIST)
+DEF_GOACC_BUILTIN (BUILT_IN_GOACC_BARRIER, "GOACC_barrier",
+		   BT_FN_VOID, ATTR_NOTHROW_LEAF_LIST)
 DEF_GOMP_BUILTIN (BUILT_IN_GOMP_TASKWAIT, "GOMP_taskwait",
 		  BT_FN_VOID, ATTR_NOTHROW_LEAF_LIST)
 DEF_GOMP_BUILTIN (BUILT_IN_GOMP_TASKWAIT_DEPEND, "GOMP_taskwait_depend",
@@ -412,6 +414,12 @@ DEF_GOMP_BUILTIN (BUILT_IN_GOMP_SINGLE_COPY_START, "GOMP_single_copy_start",
 		  BT_FN_PTR, ATTR_NOTHROW_LEAF_LIST)
 DEF_GOMP_BUILTIN (BUILT_IN_GOMP_SINGLE_COPY_END, "GOMP_single_copy_end",
 		  BT_FN_VOID_PTR, ATTR_NOTHROW_LEAF_LIST)
+DEF_GOACC_BUILTIN (BUILT_IN_GOACC_SINGLE_START, "GOACC_single_start",
+		   BT_FN_BOOL, ATTR_NOTHROW_LEAF_LIST)
+DEF_GOACC_BUILTIN (BUILT_IN_GOACC_SINGLE_COPY_START, "GOACC_single_copy_start",
+		   BT_FN_PTR, ATTR_NOTHROW_LEAF_LIST)
+DEF_GOACC_BUILTIN (BUILT_IN_GOACC_SINGLE_COPY_END, "GOACC_single_copy_end",
+		   BT_FN_VOID_PTR, ATTR_NOTHROW_LEAF_LIST)
 DEF_GOMP_BUILTIN (BUILT_IN_GOMP_OFFLOAD_REGISTER, "GOMP_offload_register_ver",
 		  BT_FN_VOID_UINT_PTR_INT_PTR, ATTR_NOTHROW_LIST)
 DEF_GOMP_BUILTIN (BUILT_IN_GOMP_OFFLOAD_UNREGISTER,
diff --git a/gcc/omp-offload.c b/gcc/omp-offload.c
index b3f543b597a..27d77532737 100644
--- a/gcc/omp-offload.c
+++ b/gcc/omp-offload.c
@@ -53,6 +53,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "attribs.h"
 #include "cfgloop.h"
 #include "context.h"
+#include "oacc-neuter-bcast.h"
 #include "convert.h"
 
 /* Describe the OpenACC looping structure of a function.  The entire
@@ -1367,6 +1368,8 @@ oacc_loop_xform_head_tail (gcall *from, int level)
       else if (gimple_call_internal_p (stmt, IFN_GOACC_REDUCTION))
 	*gimple_call_arg_ptr (stmt, 3) = replacement;
 
+      update_stmt (stmt);
+
       gsi_next (&gsi);
       while (gsi_end_p (gsi))
 	gsi = gsi_start_bb (single_succ (gsi_bb (gsi)));
@@ -1391,25 +1394,28 @@ oacc_loop_process (oacc_loop *loop)
       gcall *call;
       
       for (ix = 0; loop->ifns.iterate (ix, &call); ix++)
-	switch (gimple_call_internal_fn (call))
-	  {
-	  case IFN_GOACC_LOOP:
+	{
+	  switch (gimple_call_internal_fn (call))
 	    {
-	      bool is_e = gimple_call_arg (call, 5) == integer_minus_one_node;
-	      gimple_call_set_arg (call, 5, is_e ? e_mask_arg : mask_arg);
-	      if (!is_e)
-		gimple_call_set_arg (call, 4, chunk_arg);
-	    }
-	    break;
+	    case IFN_GOACC_LOOP:
+	      {
+		bool is_e = gimple_call_arg (call, 5) == integer_minus_one_node;
+		gimple_call_set_arg (call, 5, is_e ? e_mask_arg : mask_arg);
+		if (!is_e)
+		  gimple_call_set_arg (call, 4, chunk_arg);
+	      }
+	      break;
 
-	  case IFN_GOACC_TILE:
-	    gimple_call_set_arg (call, 3, mask_arg);
-	    gimple_call_set_arg (call, 4, e_mask_arg);
-	    break;
+	    case IFN_GOACC_TILE:
+	      gimple_call_set_arg (call, 3, mask_arg);
+	      gimple_call_set_arg (call, 4, e_mask_arg);
+	      break;
 
-	  default:
-	    gcc_unreachable ();
-	  }
+	    default:
+	      gcc_unreachable ();
+	    }
+	  update_stmt (call);
+	}
 
       unsigned dim = GOMP_DIM_GANG;
       unsigned mask = loop->mask | loop->e_mask;
@@ -1906,12 +1912,27 @@ is_sync_builtin_call (gcall *call)
   return false;
 }
 
+/* Default implementation creates a temporary variable of type RECORD_TYPE if
+   SENDER is true, else a pointer to RECORD_TYPE if SENDER is false.  */
+
+tree
+default_goacc_create_propagation_record (tree record_type, bool sender,
+					 const char *name)
+{
+  tree type = record_type;
+
+  if (!sender)
+    type = build_pointer_type (type);
+
+  return create_tmp_var (type, name);
+}
+
 /* Main entry point for oacc transformations which run on the device
    compiler after LTO, so we know what the target device is at this
    point (including the host fallback).  */
 
 static unsigned int
-execute_oacc_device_lower ()
+execute_oacc_loop_designation ()
 {
   tree attrs = oacc_get_fn_attrib (current_function_decl);
 
@@ -2051,10 +2072,36 @@ execute_oacc_device_lower ()
 	free_oacc_loop (l);
     }
 
+  free_oacc_loop (loops);
+
   /* Offloaded targets may introduce new basic blocks, which require
      dominance information to update SSA.  */
   calculate_dominance_info (CDI_DOMINATORS);
 
+  return 0;
+}
+
+int
+execute_oacc_gimple_workers (void)
+{
+  oacc_do_neutering ();
+  calculate_dominance_info (CDI_DOMINATORS);
+  return 0;
+}
+
+static unsigned int
+execute_oacc_device_lower ()
+{
+  int dims[GOMP_DIM_MAX];
+  tree attr = oacc_get_fn_attrib (current_function_decl);
+
+  if (!attr)
+    /* Not an offloaded function.  */
+    return 0;
+
+  for (unsigned i = 0; i < GOMP_DIM_MAX; i++)
+    dims[i] = oacc_get_fn_dim_size (current_function_decl, i);
+
   hash_map<tree, tree> adjusted_vars;
 
   /* Now lower internal loop functions to target-specific code
@@ -2252,8 +2299,6 @@ execute_oacc_device_lower ()
 	  }
     }
 
-  free_oacc_loop (loops);
-
   return 0;
 }
 
@@ -2294,6 +2339,70 @@ default_goacc_dim_limit (int ARG_UNUSED (axis))
 
 namespace {
 
+const pass_data pass_data_oacc_loop_designation =
+{
+  GIMPLE_PASS, /* type */
+  "oaccloops", /* name */
+  OPTGROUP_OMP, /* optinfo_flags */
+  TV_NONE, /* tv_id */
+  PROP_cfg, /* properties_required */
+  0 /* Possibly PROP_gimple_eomp.  */, /* properties_provided */
+  0, /* properties_destroyed */
+  0, /* todo_flags_start */
+  TODO_update_ssa | TODO_cleanup_cfg
+  | TODO_rebuild_alias, /* todo_flags_finish */
+};
+
+class pass_oacc_loop_designation : public gimple_opt_pass
+{
+public:
+  pass_oacc_loop_designation (gcc::context *ctxt)
+    : gimple_opt_pass (pass_data_oacc_loop_designation, ctxt)
+  {}
+
+  /* opt_pass methods: */
+  virtual bool gate (function *) { return flag_openacc; };
+
+  virtual unsigned int execute (function *)
+    {
+      return execute_oacc_loop_designation ();
+    }
+
+}; // class pass_oacc_loop_designation
+
+const pass_data pass_data_oacc_gimple_workers =
+{
+  GIMPLE_PASS, /* type */
+  "oaccworkers", /* name */
+  OPTGROUP_OMP, /* optinfo_flags */
+  TV_NONE, /* tv_id */
+  PROP_cfg, /* properties_required */
+  0, /* properties_provided */
+  0, /* properties_destroyed */
+  0, /* todo_flags_start */
+  TODO_update_ssa | TODO_cleanup_cfg, /* todo_flags_finish */
+};
+
+class pass_oacc_gimple_workers : public gimple_opt_pass
+{
+public:
+  pass_oacc_gimple_workers (gcc::context *ctxt)
+    : gimple_opt_pass (pass_data_oacc_gimple_workers, ctxt)
+  {}
+
+  /* opt_pass methods: */
+  virtual bool gate (function *)
+  {
+    return flag_openacc && targetm.goacc.worker_partitioning;
+  };
+
+  virtual unsigned int execute (function *)
+    {
+      return execute_oacc_gimple_workers ();
+    }
+
+}; // class pass_oacc_gimple_workers
+
 const pass_data pass_data_oacc_device_lower =
 {
   GIMPLE_PASS, /* type */
@@ -2326,6 +2435,18 @@ public:
 
 } // anon namespace
 
+gimple_opt_pass *
+make_pass_oacc_loop_designation (gcc::context *ctxt)
+{
+  return new pass_oacc_loop_designation (ctxt);
+}
+
+gimple_opt_pass *
+make_pass_oacc_gimple_workers (gcc::context *ctxt)
+{
+  return new pass_oacc_gimple_workers (ctxt);
+}
+
 gimple_opt_pass *
 make_pass_oacc_device_lower (gcc::context *ctxt)
 {
diff --git a/gcc/omp-offload.h b/gcc/omp-offload.h
index b91d08cd218..a6f26a7c962 100644
--- a/gcc/omp-offload.h
+++ b/gcc/omp-offload.h
@@ -29,6 +29,7 @@ extern int oacc_fn_attrib_level (tree attr);
 extern GTY(()) vec<tree, va_gc> *offload_funcs;
 extern GTY(()) vec<tree, va_gc> *offload_vars;
 
+extern int oacc_fn_attrib_level (tree attr);
 extern void omp_finish_file (void);
 extern void omp_discover_implicit_declare_target (void);
 
diff --git a/gcc/passes.def b/gcc/passes.def
index e9ed3c7bc57..f6e99ac1f4e 100644
--- a/gcc/passes.def
+++ b/gcc/passes.def
@@ -183,6 +183,8 @@ along with GCC; see the file COPYING3.  If not see
   INSERT_PASSES_AFTER (all_passes)
   NEXT_PASS (pass_fixup_cfg);
   NEXT_PASS (pass_lower_eh_dispatch);
+  NEXT_PASS (pass_oacc_loop_designation);
+  NEXT_PASS (pass_oacc_gimple_workers);
   NEXT_PASS (pass_oacc_device_lower);
   NEXT_PASS (pass_omp_device_lower);
   NEXT_PASS (pass_omp_target_link);
diff --git a/gcc/target.def b/gcc/target.def
index 00b6f8f1bc9..35e4ec92ba1 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -1742,6 +1742,19 @@ way.",
 tree, (tree var, int level),
 NULL)
 
+DEFHOOK
+(create_propagation_record,
+"Create a record used to propagate local-variable state from an active\n\
+worker to other workers.  A possible implementation might adjust the type\n\
+of REC to place the new variable in shared GPU memory.",
+tree, (tree rec, bool sender, const char *name),
+default_goacc_create_propagation_record)
+
+DEFHOOKPOD
+(worker_partitioning,
+"Use gimple transformation for worker neutering/broadcasting.",
+bool, false)
+
 HOOK_VECTOR_END (goacc)
 
 /* Functions relating to vectorization.  */
diff --git a/gcc/targhooks.h b/gcc/targhooks.h
index 39a6f82f143..19cb0e5325d 100644
--- a/gcc/targhooks.h
+++ b/gcc/targhooks.h
@@ -130,6 +130,7 @@ extern bool default_goacc_validate_dims (tree, int [], int, unsigned);
 extern int default_goacc_dim_limit (int);
 extern bool default_goacc_fork_join (gcall *, const int [], bool);
 extern void default_goacc_reduction (gcall *);
+extern tree default_goacc_create_propagation_record (tree, bool, const char *);
 
 /* These are here, and not in hooks.[ch], because not all users of
    hooks.h include tm.h, and thus we don't have CUMULATIVE_ARGS.  */
diff --git a/gcc/testsuite/c-c++-common/goacc/classify-kernels-unparallelized.c b/gcc/testsuite/c-c++-common/goacc/classify-kernels-unparallelized.c
index d4c4b2ca237..79b4cad7916 100644
--- a/gcc/testsuite/c-c++-common/goacc/classify-kernels-unparallelized.c
+++ b/gcc/testsuite/c-c++-common/goacc/classify-kernels-unparallelized.c
@@ -5,7 +5,7 @@
    { dg-additional-options "-fopt-info-optimized-omp" }
    { dg-additional-options "-fdump-tree-ompexp" }
    { dg-additional-options "-fdump-tree-parloops1-all" }
-   { dg-additional-options "-fdump-tree-oaccdevlow" } */
+   { dg-additional-options "-fdump-tree-oaccloops" } */
 
 #define N 1024
 
@@ -35,6 +35,6 @@ void KERNELS ()
 
 /* Check the offloaded function's classification and compute dimensions (will
    always be 1 x 1 x 1 for non-offloading compilation).
-   { dg-final { scan-tree-dump-times "(?n)Function is unparallelized OpenACC kernels offload" 1 "oaccdevlow" } }
-   { dg-final { scan-tree-dump-times "(?n)Compute dimensions \\\[1, 1, 1\\\]" 1 "oaccdevlow" } }
-   { dg-final { scan-tree-dump-times "(?n)__attribute__\\(\\(oacc function \\(1, 1, 1\\), oacc kernels, omp target entrypoint\\)\\)" 1 "oaccdevlow" } } */
+   { dg-final { scan-tree-dump-times "(?n)Function is unparallelized OpenACC kernels offload" 1 "oaccloops" } }
+   { dg-final { scan-tree-dump-times "(?n)Compute dimensions \\\[1, 1, 1\\\]" 1 "oaccloops" } }
+   { dg-final { scan-tree-dump-times "(?n)__attribute__\\(\\(oacc function \\(1, 1, 1\\), oacc kernels, omp target entrypoint\\)\\)" 1 "oaccloops" } } */
diff --git a/gcc/testsuite/c-c++-common/goacc/classify-kernels.c b/gcc/testsuite/c-c++-common/goacc/classify-kernels.c
index 16e9b9e31d1..8fcfc3f4278 100644
--- a/gcc/testsuite/c-c++-common/goacc/classify-kernels.c
+++ b/gcc/testsuite/c-c++-common/goacc/classify-kernels.c
@@ -5,7 +5,7 @@
    { dg-additional-options "-fopt-info-optimized-omp" }
    { dg-additional-options "-fdump-tree-ompexp" }
    { dg-additional-options "-fdump-tree-parloops1-all" }
-   { dg-additional-options "-fdump-tree-oaccdevlow" } */
+   { dg-additional-options "-fdump-tree-oaccloops" } */
 
 #define N 1024
 
@@ -31,6 +31,6 @@ void KERNELS ()
 
 /* Check the offloaded function's classification and compute dimensions (will
    always be 1 x 1 x 1 for non-offloading compilation).
-   { dg-final { scan-tree-dump-times "(?n)Function is parallelized OpenACC kernels offload" 1 "oaccdevlow" } }
-   { dg-final { scan-tree-dump-times "(?n)Compute dimensions \\\[1, 1, 1\\\]" 1 "oaccdevlow" } }
-   { dg-final { scan-tree-dump-times "(?n)__attribute__\\(\\(oacc function \\(1, 1, 1\\), oacc kernels parallelized, oacc function \\(, , \\), oacc kernels, omp target entrypoint\\)\\)" 1 "oaccdevlow" } } */
+   { dg-final { scan-tree-dump-times "(?n)Function is parallelized OpenACC kernels offload" 1 "oaccloops" } }
+   { dg-final { scan-tree-dump-times "(?n)Compute dimensions \\\[1, 1, 1\\\]" 1 "oaccloops" } }
+   { dg-final { scan-tree-dump-times "(?n)__attribute__\\(\\(oacc function \\(1, 1, 1\\), oacc kernels parallelized, oacc function \\(, , \\), oacc kernels, omp target entrypoint\\)\\)" 1 "oaccloops" } } */
diff --git a/gcc/testsuite/c-c++-common/goacc/classify-parallel.c b/gcc/testsuite/c-c++-common/goacc/classify-parallel.c
index 933d7664386..34259099bcf 100644
--- a/gcc/testsuite/c-c++-common/goacc/classify-parallel.c
+++ b/gcc/testsuite/c-c++-common/goacc/classify-parallel.c
@@ -4,7 +4,7 @@
 /* { dg-additional-options "-O2" }
    { dg-additional-options "-fopt-info-optimized-omp" }
    { dg-additional-options "-fdump-tree-ompexp" }
-   { dg-additional-options "-fdump-tree-oaccdevlow" } */
+   { dg-additional-options "-fdump-tree-oaccloops" } */
 
 #define N 1024
 
@@ -24,6 +24,6 @@ void PARALLEL ()
 
 /* Check the offloaded function's classification and compute dimensions (will
    always be 1 x 1 x 1 for non-offloading compilation).
-   { dg-final { scan-tree-dump-times "(?n)Function is OpenACC parallel offload" 1 "oaccdevlow" } }
-   { dg-final { scan-tree-dump-times "(?n)Compute dimensions \\\[1, 1, 1\\\]" 1 "oaccdevlow" } }
-   { dg-final { scan-tree-dump-times "(?n)__attribute__\\(\\(oacc function \\(1, 1, 1\\), oacc parallel, omp target entrypoint\\)\\)" 1 "oaccdevlow" } } */
+   { dg-final { scan-tree-dump-times "(?n)Function is OpenACC parallel offload" 1 "oaccloops" } }
+   { dg-final { scan-tree-dump-times "(?n)Compute dimensions \\\[1, 1, 1\\\]" 1 "oaccloops" } }
+   { dg-final { scan-tree-dump-times "(?n)__attribute__\\(\\(oacc function \\(1, 1, 1\\), oacc parallel, omp target entrypoint\\)\\)" 1 "oaccloops" } } */
diff --git a/gcc/testsuite/c-c++-common/goacc/classify-routine.c b/gcc/testsuite/c-c++-common/goacc/classify-routine.c
index 0b9ba6ea69f..54eddc10b3c 100644
--- a/gcc/testsuite/c-c++-common/goacc/classify-routine.c
+++ b/gcc/testsuite/c-c++-common/goacc/classify-routine.c
@@ -4,7 +4,7 @@
 /* { dg-additional-options "-O2" }
    { dg-additional-options "-fopt-info-optimized-omp" }
    { dg-additional-options "-fdump-tree-ompexp" }
-   { dg-additional-options "-fdump-tree-oaccdevlow" } */
+   { dg-additional-options "-fdump-tree-oaccloops" } */
 
 #define N 1024
 
@@ -26,6 +26,6 @@ void ROUTINE ()
 
 /* Check the offloaded function's classification and compute dimensions (will
    always be 1 x 1 x 1 for non-offloading compilation).
-   { dg-final { scan-tree-dump-times "(?n)Function is OpenACC routine level 1" 1 "oaccdevlow" } }
-   { dg-final { scan-tree-dump-times "(?n)Compute dimensions \\\[1, 1, 1\\\]" 1 "oaccdevlow" } }
-   { dg-final { scan-tree-dump-times "(?n)__attribute__\\(\\(oacc function \\(0 1, 1 1, 1 1\\), omp declare target \\(worker\\), oacc function \\(0 1, 1 0, 1 0\\)\\)\\)" 1 "oaccdevlow" } } */
+   { dg-final { scan-tree-dump-times "(?n)Function is OpenACC routine level 1" 1 "oaccloops" } }
+   { dg-final { scan-tree-dump-times "(?n)Compute dimensions \\\[1, 1, 1\\\]" 1 "oaccloops" } }
+   { dg-final { scan-tree-dump-times "(?n)__attribute__\\(\\(oacc function \\(0 1, 1 1, 1 1\\), omp declare target \\(worker\\), oacc function \\(0 1, 1 0, 1 0\\)\\)\\)" 1 "oaccloops" } } */
diff --git a/gcc/testsuite/c-c++-common/goacc/classify-serial.c b/gcc/testsuite/c-c++-common/goacc/classify-serial.c
index 89b5f5ed9f9..cbf7a331372 100644
--- a/gcc/testsuite/c-c++-common/goacc/classify-serial.c
+++ b/gcc/testsuite/c-c++-common/goacc/classify-serial.c
@@ -4,7 +4,7 @@
 /* { dg-additional-options "-O2" }
    { dg-additional-options "-fopt-info-optimized-omp" }
    { dg-additional-options "-fdump-tree-ompexp" }
-   { dg-additional-options "-fdump-tree-oaccdevlow" } */
+   { dg-additional-options "-fdump-tree-oaccloops" } */
 
 #define N 1024
 
@@ -26,6 +26,6 @@ void SERIAL ()
 
 /* Check the offloaded function's classification and compute dimensions (will
    always be 1 x 1 x 1 for non-offloading compilation).
-   { dg-final { scan-tree-dump-times "(?n)Function is OpenACC serial offload" 1 "oaccdevlow" } }
-   { dg-final { scan-tree-dump-times "(?n)Compute dimensions \\\[1, 1, 1\\\]" 1 "oaccdevlow" } }
-   { dg-final { scan-tree-dump-times "(?n)__attribute__\\(\\(oacc function \\(1, 1, 1\\), oacc serial, omp target entrypoint\\)\\)" 1 "oaccdevlow" } } */
+   { dg-final { scan-tree-dump-times "(?n)Function is OpenACC serial offload" 1 "oaccloops" } }
+   { dg-final { scan-tree-dump-times "(?n)Compute dimensions \\\[1, 1, 1\\\]" 1 "oaccloops" } }
+   { dg-final { scan-tree-dump-times "(?n)__attribute__\\(\\(oacc function \\(1, 1, 1\\), oacc serial, omp target entrypoint\\)\\)" 1 "oaccloops" } } */
diff --git a/gcc/testsuite/gcc.dg/goacc/loop-processing-1.c b/gcc/testsuite/gcc.dg/goacc/loop-processing-1.c
index bd4c07e7d81..f234449cff8 100644
--- a/gcc/testsuite/gcc.dg/goacc/loop-processing-1.c
+++ b/gcc/testsuite/gcc.dg/goacc/loop-processing-1.c
@@ -1,5 +1,5 @@
 /* Make sure that OpenACC loop processing happens.  */
-/* { dg-additional-options "-O2 -fdump-tree-oaccdevlow" } */
+/* { dg-additional-options "-O2 -fdump-tree-oaccloops" } */
 
 extern int place ();
 
diff --git a/gcc/testsuite/gfortran.dg/goacc/classify-kernels-unparallelized.f95 b/gcc/testsuite/gfortran.dg/goacc/classify-kernels-unparallelized.f95
index 6cca3d6eefb..aa62c63f57b 100644
--- a/gcc/testsuite/gfortran.dg/goacc/classify-kernels-unparallelized.f95
+++ b/gcc/testsuite/gfortran.dg/goacc/classify-kernels-unparallelized.f95
@@ -5,7 +5,7 @@
 ! { dg-additional-options "-fopt-info-optimized-omp" }
 ! { dg-additional-options "-fdump-tree-ompexp" }
 ! { dg-additional-options "-fdump-tree-parloops1-all" }
-! { dg-additional-options "-fdump-tree-oaccdevlow" }
+! { dg-additional-options "-fdump-tree-oaccloops" }
 
 program main
   implicit none
@@ -37,6 +37,6 @@ end program main
 
 ! Check the offloaded function's classification and compute dimensions (will
 ! always be 1 x 1 x 1 for non-offloading compilation).
-! { dg-final { scan-tree-dump-times "(?n)Function is unparallelized OpenACC kernels offload" 1 "oaccdevlow" } }
-! { dg-final { scan-tree-dump-times "(?n)Compute dimensions \\\[1, 1, 1\\\]" 1 "oaccdevlow" } }
-! { dg-final { scan-tree-dump-times "(?n)__attribute__\\(\\(oacc function \\(1, 1, 1\\), oacc kernels, omp target entrypoint\\)\\)" 1 "oaccdevlow" } }
+! { dg-final { scan-tree-dump-times "(?n)Function is unparallelized OpenACC kernels offload" 1 "oaccloops" } }
+! { dg-final { scan-tree-dump-times "(?n)Compute dimensions \\\[1, 1, 1\\\]" 1 "oaccloops" } }
+! { dg-final { scan-tree-dump-times "(?n)__attribute__\\(\\(oacc function \\(1, 1, 1\\), oacc kernels, omp target entrypoint\\)\\)" 1 "oaccloops" } }
diff --git a/gcc/testsuite/gfortran.dg/goacc/classify-kernels.f95 b/gcc/testsuite/gfortran.dg/goacc/classify-kernels.f95
index 715a983bb26..0bc5fb5cf30 100644
--- a/gcc/testsuite/gfortran.dg/goacc/classify-kernels.f95
+++ b/gcc/testsuite/gfortran.dg/goacc/classify-kernels.f95
@@ -5,7 +5,7 @@
 ! { dg-additional-options "-fopt-info-optimized-omp" }
 ! { dg-additional-options "-fdump-tree-ompexp" }
 ! { dg-additional-options "-fdump-tree-parloops1-all" }
-! { dg-additional-options "-fdump-tree-oaccdevlow" }
+! { dg-additional-options "-fdump-tree-oaccloops" }
 
 program main
   implicit none
@@ -33,6 +33,6 @@ end program main
 
 ! Check the offloaded function's classification and compute dimensions (will
 ! always be 1 x 1 x 1 for non-offloading compilation).
-! { dg-final { scan-tree-dump-times "(?n)Function is parallelized OpenACC kernels offload" 1 "oaccdevlow" } }
-! { dg-final { scan-tree-dump-times "(?n)Compute dimensions \\\[1, 1, 1\\\]" 1 "oaccdevlow" } }
-! { dg-final { scan-tree-dump-times "(?n)__attribute__\\(\\(oacc function \\(1, 1, 1\\), oacc kernels parallelized, oacc function \\(, , \\), oacc kernels, omp target entrypoint\\)\\)" 1 "oaccdevlow" } }
+! { dg-final { scan-tree-dump-times "(?n)Function is parallelized OpenACC kernels offload" 1 "oaccloops" } }
+! { dg-final { scan-tree-dump-times "(?n)Compute dimensions \\\[1, 1, 1\\\]" 1 "oaccloops" } }
+! { dg-final { scan-tree-dump-times "(?n)__attribute__\\(\\(oacc function \\(1, 1, 1\\), oacc kernels parallelized, oacc function \\(, , \\), oacc kernels, omp target entrypoint\\)\\)" 1 "oaccloops" } }
diff --git a/gcc/testsuite/gfortran.dg/goacc/classify-parallel.f95 b/gcc/testsuite/gfortran.dg/goacc/classify-parallel.f95
index 01f06bbcc27..20bbdb0fbd3 100644
--- a/gcc/testsuite/gfortran.dg/goacc/classify-parallel.f95
+++ b/gcc/testsuite/gfortran.dg/goacc/classify-parallel.f95
@@ -4,7 +4,7 @@
 ! { dg-additional-options "-O2" }
 ! { dg-additional-options "-fopt-info-optimized-omp" }
 ! { dg-additional-options "-fdump-tree-ompexp" }
-! { dg-additional-options "-fdump-tree-oaccdevlow" }
+! { dg-additional-options "-fdump-tree-oaccloops" }
 
 program main
   implicit none
@@ -26,6 +26,6 @@ end program main
 
 ! Check the offloaded function's classification and compute dimensions (will
 ! always be 1 x 1 x 1 for non-offloading compilation).
-! { dg-final { scan-tree-dump-times "(?n)Function is OpenACC parallel offload" 1 "oaccdevlow" } }
-! { dg-final { scan-tree-dump-times "(?n)Compute dimensions \\\[1, 1, 1\\\]" 1 "oaccdevlow" } }
-! { dg-final { scan-tree-dump-times "(?n)__attribute__\\(\\(oacc function \\(1, 1, 1\\), oacc parallel, omp target entrypoint\\)\\)" 1 "oaccdevlow" } }
+! { dg-final { scan-tree-dump-times "(?n)Function is OpenACC parallel offload" 1 "oaccloops" } }
+! { dg-final { scan-tree-dump-times "(?n)Compute dimensions \\\[1, 1, 1\\\]" 1 "oaccloops" } }
+! { dg-final { scan-tree-dump-times "(?n)__attribute__\\(\\(oacc function \\(1, 1, 1\\), oacc parallel, omp target entrypoint\\)\\)" 1 "oaccloops" } }
diff --git a/gcc/testsuite/gfortran.dg/goacc/classify-routine.f95 b/gcc/testsuite/gfortran.dg/goacc/classify-routine.f95
index 401d5270391..ed24cee10d8 100644
--- a/gcc/testsuite/gfortran.dg/goacc/classify-routine.f95
+++ b/gcc/testsuite/gfortran.dg/goacc/classify-routine.f95
@@ -4,7 +4,7 @@
 ! { dg-additional-options "-O2" }
 ! { dg-additional-options "-fopt-info-optimized-omp" }
 ! { dg-additional-options "-fdump-tree-ompexp" }
-! { dg-additional-options "-fdump-tree-oaccdevlow" }
+! { dg-additional-options "-fdump-tree-oaccloops" }
 
 subroutine ROUTINE
   !$acc routine worker
@@ -25,6 +25,6 @@ end subroutine ROUTINE
 
 ! Check the offloaded function's classification and compute dimensions (will
 ! always be 1 x 1 x 1 for non-offloading compilation).
-! { dg-final { scan-tree-dump-times "(?n)Function is OpenACC routine level 1" 1 "oaccdevlow" } }
-! { dg-final { scan-tree-dump-times "(?n)Compute dimensions \\\[1, 1, 1\\\]" 1 "oaccdevlow" } }
-! { dg-final { scan-tree-dump-times "(?n)__attribute__\\(\\(oacc function \\(0 1, 1 1, 1 1\\), omp declare target \\(worker\\)\\)\\)" 1 "oaccdevlow" } }
+! { dg-final { scan-tree-dump-times "(?n)Function is OpenACC routine level 1" 1 "oaccloops" } }
+! { dg-final { scan-tree-dump-times "(?n)Compute dimensions \\\[1, 1, 1\\\]" 1 "oaccloops" } }
+! { dg-final { scan-tree-dump-times "(?n)__attribute__\\(\\(oacc function \\(0 1, 1 1, 1 1\\), omp declare target \\(worker\\)\\)\\)" 1 "oaccloops" } }
diff --git a/gcc/testsuite/gfortran.dg/goacc/classify-serial.f95 b/gcc/testsuite/gfortran.dg/goacc/classify-serial.f95
index d7052bacbe8..33fbdbc07d0 100644
--- a/gcc/testsuite/gfortran.dg/goacc/classify-serial.f95
+++ b/gcc/testsuite/gfortran.dg/goacc/classify-serial.f95
@@ -4,7 +4,7 @@
 ! { dg-additional-options "-O2" }
 ! { dg-additional-options "-fopt-info-optimized-omp" }
 ! { dg-additional-options "-fdump-tree-ompexp" }
-! { dg-additional-options "-fdump-tree-oaccdevlow" }
+! { dg-additional-options "-fdump-tree-oaccloops" }
 
 program main
   implicit none
@@ -28,6 +28,6 @@ end program main
 
 ! Check the offloaded function's classification and compute dimensions (will
 ! always be 1 x 1 x 1 for non-offloading compilation).
-! { dg-final { scan-tree-dump-times "(?n)Function is OpenACC serial offload" 1 "oaccdevlow" } }
-! { dg-final { scan-tree-dump-times "(?n)Compute dimensions \\\[1, 1, 1\\\]" 1 "oaccdevlow" } }
-! { dg-final { scan-tree-dump-times "(?n)__attribute__\\(\\(oacc function \\(1, 1, 1\\), oacc serial, omp target entrypoint\\)\\)" 1 "oaccdevlow" } }
+! { dg-final { scan-tree-dump-times "(?n)Function is OpenACC serial offload" 1 "oaccloops" } }
+! { dg-final { scan-tree-dump-times "(?n)Compute dimensions \\\[1, 1, 1\\\]" 1 "oaccloops" } }
+! { dg-final { scan-tree-dump-times "(?n)__attribute__\\(\\(oacc function \\(1, 1, 1\\), oacc serial, omp target entrypoint\\)\\)" 1 "oaccloops" } }
diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h
index 15693fee150..4a575a54c04 100644
--- a/gcc/tree-pass.h
+++ b/gcc/tree-pass.h
@@ -423,6 +423,8 @@ extern gimple_opt_pass *make_pass_diagnose_omp_blocks (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_expand_omp (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_expand_omp_ssa (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_omp_target_link (gcc::context *ctxt);
+extern gimple_opt_pass *make_pass_oacc_loop_designation (gcc::context *ctxt);
+extern gimple_opt_pass *make_pass_oacc_gimple_workers (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_oacc_device_lower (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_omp_device_lower (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_object_sizes (gcc::context *ctxt);
-- 
2.29.2


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 2/4] openacc: Fix async bugs in several OpenACC test cases
  2021-03-02 12:20 [PATCH 0/4] openacc: Worker partitioning in the middle end Julian Brown
  2021-03-02 12:20 ` [PATCH 1/4] openacc: Middle-end worker-partitioning support Julian Brown
@ 2021-03-02 12:20 ` Julian Brown
  2021-03-02 12:20 ` [PATCH 3/4] amdgcn: Enable OpenACC worker partitioning for AMD GCN Julian Brown
  2021-03-02 12:20 ` [PATCH 4/4] openacc: Reference-typed reduction and private variable rewriting Julian Brown
  3 siblings, 0 replies; 20+ messages in thread
From: Julian Brown @ 2021-03-02 12:20 UTC (permalink / raw)
  To: gcc-patches
  Cc: Thomas Schwinge, Tobias Burnus, Kwok Cheung Yeung, Jakub Jelinek

Enabling worker-partitioning support in the middle end (for AMD GCN)
reveals several bugs in existing tests relating to async usage.
This patch fixes those up.

Tested with offloading to AMD GCN. OK for stage 1? (Or now?)

Julian

2021-03-02  Julian Brown  <julian@codesourcery.com>

libgomp/
	* testsuite/libgomp.oacc-c-c++-common/deep-copy-10.c: Fix async
	behaviour and increase number of iterations.
	* testsuite/libgomp.oacc-fortran/lib-16-2.f90: Fix async behaviour.
	* testsuite/libgomp.oacc-fortran/lib-16.f90: Likewise.
---
 .../libgomp.oacc-c-c++-common/deep-copy-10.c       | 14 ++++++++------
 .../testsuite/libgomp.oacc-fortran/lib-16-2.f90    |  5 +++++
 libgomp/testsuite/libgomp.oacc-fortran/lib-16.f90  |  5 +++++
 3 files changed, 18 insertions(+), 6 deletions(-)

diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/deep-copy-10.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/deep-copy-10.c
index 573a8214bf0..dadb6d37942 100644
--- a/libgomp/testsuite/libgomp.oacc-c-c++-common/deep-copy-10.c
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/deep-copy-10.c
@@ -1,6 +1,8 @@
 #include <stdlib.h>
 
-/* Test asyncronous attach and detach operation.  */
+#define ITERATIONS 1023
+
+/* Test asynchronous attach and detach operation.  */
 
 typedef struct {
   int *a;
@@ -25,13 +27,13 @@ main (int argc, char* argv[])
 
 #pragma acc enter data copyin(m)
 
-  for (int i = 0; i < 99; i++)
+  for (int i = 0; i < ITERATIONS; i++)
     {
       int j;
-#pragma acc parallel loop copy(m.a[0:N]) async(i % 2)
+#pragma acc parallel loop copy(m.a[0:N]) async(0)
       for (j = 0; j < N; j++)
 	m.a[j]++;
-#pragma acc parallel loop copy(m.b[0:N]) async((i + 1) % 2)
+#pragma acc parallel loop copy(m.b[0:N]) async(1)
       for (j = 0; j < N; j++)
 	m.b[j]++;
     }
@@ -40,9 +42,9 @@ main (int argc, char* argv[])
 
   for (i = 0; i < N; i++)
     {
-      if (m.a[i] != 99)
+      if (m.a[i] != ITERATIONS)
 	abort ();
-      if (m.b[i] != 99)
+      if (m.b[i] != ITERATIONS)
 	abort ();
     }
 
diff --git a/libgomp/testsuite/libgomp.oacc-fortran/lib-16-2.f90 b/libgomp/testsuite/libgomp.oacc-fortran/lib-16-2.f90
index ddd557d3be0..e2e47c967fa 100644
--- a/libgomp/testsuite/libgomp.oacc-fortran/lib-16-2.f90
+++ b/libgomp/testsuite/libgomp.oacc-fortran/lib-16-2.f90
@@ -27,6 +27,9 @@ program main
 
   if (acc_is_present (h) .neqv. .TRUE.) stop 1
 
+  ! We must wait for the update to be done.
+  call acc_wait (async)
+
   h(:) = 0
 
   call acc_copyout_async (h, sizeof (h), async)
@@ -45,6 +48,8 @@ program main
   
   if (acc_is_present (h) .neqv. .TRUE.) stop 3
 
+  call acc_wait (async)
+
   do i = 1, N
     if (h(i) /= i + i) stop 4
   end do 
diff --git a/libgomp/testsuite/libgomp.oacc-fortran/lib-16.f90 b/libgomp/testsuite/libgomp.oacc-fortran/lib-16.f90
index ccd1ce6ee18..ef9a6f6626c 100644
--- a/libgomp/testsuite/libgomp.oacc-fortran/lib-16.f90
+++ b/libgomp/testsuite/libgomp.oacc-fortran/lib-16.f90
@@ -27,6 +27,9 @@ program main
 
   if (acc_is_present (h) .neqv. .TRUE.) stop 1
 
+  ! We must wait for the update to be done.
+  call acc_wait (async)
+
   h(:) = 0
 
   call acc_copyout_async (h, sizeof (h), async)
@@ -45,6 +48,8 @@ program main
   
   if (acc_is_present (h) .neqv. .TRUE.) stop 3
 
+  call acc_wait (async)
+
   do i = 1, N
     if (h(i) /= i + i) stop 4
   end do 
-- 
2.29.2


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 3/4] amdgcn: Enable OpenACC worker partitioning for AMD GCN
  2021-03-02 12:20 [PATCH 0/4] openacc: Worker partitioning in the middle end Julian Brown
  2021-03-02 12:20 ` [PATCH 1/4] openacc: Middle-end worker-partitioning support Julian Brown
  2021-03-02 12:20 ` [PATCH 2/4] openacc: Fix async bugs in several OpenACC test cases Julian Brown
@ 2021-03-02 12:20 ` Julian Brown
  2021-08-09 13:26   ` Thomas Schwinge
  2021-03-02 12:20 ` [PATCH 4/4] openacc: Reference-typed reduction and private variable rewriting Julian Brown
  3 siblings, 1 reply; 20+ messages in thread
From: Julian Brown @ 2021-03-02 12:20 UTC (permalink / raw)
  To: gcc-patches
  Cc: Thomas Schwinge, Tobias Burnus, Kwok Cheung Yeung, Jakub Jelinek

This patch enables worker-partitioning support via gimple rewriting for
AMD GCN. Older (and currently unused) parts of this support are already
present in the AMD GCN backend: those vestigial parts are enabled or
updated, as appropriate.

I can probably self-approve this -- I will commit if/when the other
patches in the series are committed in stage 1.

Julian

2021-03-02  Julian Brown  <julian@codesourcery.com>
	    Kwok Cheung Yeung  <kcy@codesourcery.com>

gcc/
	* config/gcn/gcn-protos.h (gcn_goacc_adjust_propagation_record):
	Rename prototype to...
	(gcn_goacc_create_propagation_record): This.
	* config/gcn/gcn-tree.c (gcn_goacc_adjust_propagation_record): Rename
	function to...
	(gcn_goacc_create_propagation_record): This.  Adjust comment.
	* config/gcn/gcn.c (gcn_init_builtins): Override decls for
	BUILT_IN_GOACC_SINGLE_START, BUILT_IN_GOACC_SINGLE_COPY_START,
	BUILT_IN_GOACC_SINGLE_COPY_END and BUILT_IN_GOACC_BARRIER.
	(gcn_goacc_validate_dims): Turn on worker partitioning unconditionally.
	(gcn_fork_join): Update comment.
	(TARGET_GOACC_ADJUST_PROPAGATION_RECORD): Rename to...
	(TARGET_GOACC_CREATE_PROPAGATION_RECORD): This.
	(TARGET_GOACC_WORKER_PARTITIONING): Define target hook.
	* config/gcn/gcn.opt (flag_worker_partitioning): Remove.
	(macc_experimental_workers): Remove unused option.

libgomp/
	* plugin/plugin-gcn.c (gcn_exec): Change default number of workers to
	16.
	* testsuite/libgomp.oacc-c-c++-common/loop-dim-default.c (check): Skip
	vector dimension test for AMD GCN.  Enable multiple workers.
	* testsuite/libgomp.oacc-c-c++-common/parallel-dims.c: Enable multiple
	workers.  Update line numbers for scan tests.
	* testsuite/libgomp.oacc-fortran/parallel-dims-aux.c: Support AMD GCN.
---
 gcc/config/gcn/gcn-protos.h                   |  2 +-
 gcc/config/gcn/gcn-tree.c                     |  6 ++---
 gcc/config/gcn/gcn.c                          | 23 +++++++------------
 gcc/config/gcn/gcn.opt                        |  5 ----
 libgomp/plugin/plugin-gcn.c                   |  4 +---
 .../loop-dim-default.c                        | 11 +++++----
 .../libgomp.oacc-c-c++-common/parallel-dims.c | 13 ++++-------
 .../libgomp.oacc-fortran/parallel-dims-aux.c  |  9 +++++---
 8 files changed, 31 insertions(+), 42 deletions(-)

diff --git a/gcc/config/gcn/gcn-protos.h b/gcc/config/gcn/gcn-protos.h
index 7ef7ae8af46..6238bdc8a96 100644
--- a/gcc/config/gcn/gcn-protos.h
+++ b/gcc/config/gcn/gcn-protos.h
@@ -38,7 +38,7 @@ extern rtx gcn_full_exec ();
 extern rtx gcn_full_exec_reg ();
 extern rtx gcn_gen_undef (machine_mode);
 extern bool gcn_global_address_p (rtx);
-extern tree gcn_goacc_adjust_propagation_record (tree record_type, bool sender,
+extern tree gcn_goacc_create_propagation_record (tree record_type, bool sender,
 						 const char *name);
 extern tree gcn_goacc_adjust_private_decl (tree var, int level);
 extern void gcn_goacc_reduction (gcall *call);
diff --git a/gcc/config/gcn/gcn-tree.c b/gcc/config/gcn/gcn-tree.c
index 75ea50c59dd..a457121c72b 100644
--- a/gcc/config/gcn/gcn-tree.c
+++ b/gcc/config/gcn/gcn-tree.c
@@ -548,12 +548,12 @@ gcn_goacc_reduction (gcall *call)
     }
 }
 
-/* Implement TARGET_GOACC_ADJUST_PROPAGATION_RECORD.
+/* Implement TARGET_GOACC_CREATE_PROPAGATION_RECORD.
  
-   Tweak (worker) propagation record, e.g. to put it in shared memory.  */
+   Create (worker) propagation record in shared memory.  */
 
 tree
-gcn_goacc_adjust_propagation_record (tree record_type, bool sender,
+gcn_goacc_create_propagation_record (tree record_type, bool sender,
 				     const char *name)
 {
   tree type = record_type;
diff --git a/gcc/config/gcn/gcn.c b/gcc/config/gcn/gcn.c
index 1ea919bf058..fe4fa68f4ce 100644
--- a/gcc/config/gcn/gcn.c
+++ b/gcc/config/gcn/gcn.c
@@ -3588,8 +3588,6 @@ gcn_init_builtins (void)
       TREE_NOTHROW (gcn_builtin_decls[i]) = 1;
     }
 
-/* FIXME: remove the ifdef once OpenACC support is merged upstream.  */
-#ifdef BUILT_IN_GOACC_SINGLE_START
   /* These builtins need to take/return an LDS pointer: override the generic
      versions here.  */
 
@@ -3606,7 +3604,6 @@ gcn_init_builtins (void)
 
   set_builtin_decl (BUILT_IN_GOACC_BARRIER,
 		    gcn_builtin_decls[GCN_BUILTIN_ACC_BARRIER], false);
-#endif
 }
 
 /* Expand the CMP_SWAP GCN builtins.  We have our own versions that do
@@ -4865,11 +4862,7 @@ gcn_goacc_validate_dims (tree decl, int dims[], int fn_level,
 			 unsigned /*used*/)
 {
   bool changed = false;
-
-  /* FIXME: remove -facc-experimental-workers when they're ready.  */
-  int max_workers = flag_worker_partitioning ? 16 : 1;
-
-  gcc_assert (!flag_worker_partitioning);
+  const int max_workers = 16;
 
   /* The vector size must appear to be 64, to the user, unless this is a
      SEQ routine.  The real, internal value is always 1, which means use
@@ -4906,8 +4899,7 @@ gcn_goacc_validate_dims (tree decl, int dims[], int fn_level,
     {
       dims[GOMP_DIM_VECTOR] = GCN_DEFAULT_VECTORS;
       if (dims[GOMP_DIM_WORKER] < 0)
-	dims[GOMP_DIM_WORKER] = (flag_worker_partitioning
-				 ? GCN_DEFAULT_WORKERS : 1);
+	dims[GOMP_DIM_WORKER] = GCN_DEFAULT_WORKERS;
       if (dims[GOMP_DIM_GANG] < 0)
 	dims[GOMP_DIM_GANG] = GCN_DEFAULT_GANGS;
       changed = true;
@@ -4972,8 +4964,7 @@ static bool
 gcn_fork_join (gcall *ARG_UNUSED (call), const int *ARG_UNUSED (dims),
 	       bool ARG_UNUSED (is_fork))
 {
-  /* GCN does not use the fork/join concept invented for NVPTX.
-     Instead we use standard autovectorization.  */
+  /* GCN does not need to expand fork/join markers at the RTL level.  */
   return false;
 }
 
@@ -6314,9 +6305,9 @@ gcn_dwarf_register_span (rtx rtl)
 #define TARGET_GIMPLIFY_VA_ARG_EXPR gcn_gimplify_va_arg_expr
 #undef TARGET_OMP_DEVICE_KIND_ARCH_ISA
 #define TARGET_OMP_DEVICE_KIND_ARCH_ISA gcn_omp_device_kind_arch_isa
-#undef  TARGET_GOACC_ADJUST_PROPAGATION_RECORD
-#define TARGET_GOACC_ADJUST_PROPAGATION_RECORD \
-  gcn_goacc_adjust_propagation_record
+#undef  TARGET_GOACC_CREATE_PROPAGATION_RECORD
+#define TARGET_GOACC_CREATE_PROPAGATION_RECORD \
+  gcn_goacc_create_propagation_record
 #undef  TARGET_GOACC_ADJUST_PRIVATE_DECL
 #define TARGET_GOACC_ADJUST_PRIVATE_DECL gcn_goacc_adjust_private_decl
 #undef  TARGET_GOACC_FORK_JOIN
@@ -6325,6 +6316,8 @@ gcn_dwarf_register_span (rtx rtl)
 #define TARGET_GOACC_REDUCTION gcn_goacc_reduction
 #undef  TARGET_GOACC_VALIDATE_DIMS
 #define TARGET_GOACC_VALIDATE_DIMS gcn_goacc_validate_dims
+#undef  TARGET_GOACC_WORKER_PARTITIONING
+#define TARGET_GOACC_WORKER_PARTITIONING true
 #undef  TARGET_HARD_REGNO_MODE_OK
 #define TARGET_HARD_REGNO_MODE_OK gcn_hard_regno_mode_ok
 #undef  TARGET_HARD_REGNO_NREGS
diff --git a/gcc/config/gcn/gcn.opt b/gcc/config/gcn/gcn.opt
index 767d45826c2..41cc49095b1 100644
--- a/gcc/config/gcn/gcn.opt
+++ b/gcc/config/gcn/gcn.opt
@@ -62,11 +62,6 @@ bool flag_bypass_init_error = false
 mbypass-init-error
 Target RejectNegative Var(flag_bypass_init_error)
 
-bool flag_worker_partitioning = false
-
-macc-experimental-workers
-Target Var(flag_worker_partitioning) Init(0)
-
 int stack_size_opt = -1
 
 mstack-size=
diff --git a/libgomp/plugin/plugin-gcn.c b/libgomp/plugin/plugin-gcn.c
index 8e6af69988e..b89470199cb 100644
--- a/libgomp/plugin/plugin-gcn.c
+++ b/libgomp/plugin/plugin-gcn.c
@@ -3041,10 +3041,8 @@ gcn_exec (struct kernel_info *kernel, size_t mapnum, void **hostaddrs,
      problem size, so let's do a reasonable number of single-worker gangs.
      64 gangs matches a typical Fiji device.  */
 
-  /* NOTE: Until support for middle-end worker partitioning is merged, use 1
-     for the default number of workers.  */
   if (dims[0] == 0) dims[0] = get_cu_count (kernel->agent); /* Gangs.  */
-  if (dims[1] == 0) dims[1] = 1;  /* Workers.  */
+  if (dims[1] == 0) dims[1] = 16; /* Workers.  */
 
   /* The incoming dimensions are expressed in terms of gangs, workers, and
      vectors.  The HSA dimensions are expressed in terms of "work-items",
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-dim-default.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-dim-default.c
index ca771646655..ddf0a29d304 100644
--- a/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-dim-default.c
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-dim-default.c
@@ -79,13 +79,18 @@ int check (const int *ary, int size, int gp, int wp, int vp)
 	exit = 1;
       }
   
+#ifndef ACC_DEVICE_TYPE_radeon
+  /* AMD GCN uses the autovectorizer for the vector dimension: the use
+     of a function call in vector-partitioned code in this test is not
+     currently supported.  */
   for (ix = 0; ix < vp; ix++)
     if (vectors[ix] != vectors[0])
       {
 	printf ("vector %d not used %d times\n", ix, vectors[0]);
 	exit = 1;
       }
-  
+#endif
+
   return exit;
 }
 
@@ -132,9 +137,7 @@ int main ()
   /* AMD GCN uses the autovectorizer for the vector dimension: the use
      of a function call in vector-partitioned code in this test is not
      currently supported.  */
-  /* AMD GCN does not currently support multiple workers.  This should be
-     set to 16 when that changes.  */
-  return test_1 (16, 1, 1);
+  return test_1 (16, 16, 64);
 #else
   return test_1 (16, 16, 32);
 #endif
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/parallel-dims.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/parallel-dims.c
index 003bcac2413..10bb7b61f50 100644
--- a/libgomp/testsuite/libgomp.oacc-c-c++-common/parallel-dims.c
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/parallel-dims.c
@@ -288,9 +288,8 @@ int main ()
 	}
       else if (acc_on_device (acc_device_radeon))
 	{
-	  /* The GCC GCN back end is limited to num_workers (16).
-	     Temporarily set this to 1 until multiple workers are permitted. */
-	  workers_actual = 1; // 16;
+	  /* The GCC GCN back end is limited to num_workers (16).  */
+	  workers_actual = 16;
 	}
       else
 	__builtin_abort ();
@@ -491,8 +490,6 @@ int main ()
 	}
       else if (acc_on_device (acc_device_radeon))
 	{
-	  /* Temporary setting, until multiple workers are permitted.  */
-	  workers_actual = 1;
 	  /* See above comments about GCN vectors_actual.  */
 	  vectors_actual = 1;
 	}
@@ -618,9 +615,9 @@ int main ()
     gangs_max = workers_max = vectors_max = INT_MIN;
 #pragma acc serial copy (vectors_actual) /* { dg-warning "using vector_length \\(32\\), ignoring 1" "" { target openacc_nvidia_accel_selected } } */ \
   copy (gangs_min, gangs_max, workers_min, workers_max, vectors_min, vectors_max)
-/* { dg-warning "not gang partitioned" "" { target *-*-* } 619 } */
-/* { dg-warning "not worker partitioned" "" { target *-*-* } 619 } */
-/* { dg-warning "not vector partitioned" "" { target *-*-* } 619 } */
+/* { dg-warning "not gang partitioned" "" { target *-*-* } 616 } */
+/* { dg-warning "not worker partitioned" "" { target *-*-* } 616 } */
+/* { dg-warning "not vector partitioned" "" { target *-*-* } 616 } */
     {
       if (acc_on_device (acc_device_nvidia))
 	{
diff --git a/libgomp/testsuite/libgomp.oacc-fortran/parallel-dims-aux.c b/libgomp/testsuite/libgomp.oacc-fortran/parallel-dims-aux.c
index b5986f4afef..9810a259f2a 100644
--- a/libgomp/testsuite/libgomp.oacc-fortran/parallel-dims-aux.c
+++ b/libgomp/testsuite/libgomp.oacc-fortran/parallel-dims-aux.c
@@ -16,7 +16,8 @@
 {
   if (acc_on_device ((int) acc_device_host))
     return 0;
-  else if (acc_on_device ((int) acc_device_nvidia))
+  else if (acc_on_device ((int) acc_device_nvidia)
+	   || acc_on_device ((int) acc_device_radeon))
     return __builtin_goacc_parlevel_id (GOMP_DIM_GANG);
   else
     __builtin_abort ();
@@ -27,7 +28,8 @@
 {
   if (acc_on_device ((int) acc_device_host))
     return 0;
-  else if (acc_on_device ((int) acc_device_nvidia))
+  else if (acc_on_device ((int) acc_device_nvidia)
+	   || acc_on_device ((int) acc_device_radeon))
     return __builtin_goacc_parlevel_id (GOMP_DIM_WORKER);
   else
     __builtin_abort ();
@@ -38,7 +40,8 @@
 {
   if (acc_on_device ((int) acc_device_host))
     return 0;
-  else if (acc_on_device ((int) acc_device_nvidia))
+  else if (acc_on_device ((int) acc_device_nvidia)
+	   || acc_on_device ((int) acc_device_radeon))
     return __builtin_goacc_parlevel_id (GOMP_DIM_VECTOR);
   else
     __builtin_abort ();
-- 
2.29.2


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 4/4] openacc: Reference-typed reduction and private variable rewriting
  2021-03-02 12:20 [PATCH 0/4] openacc: Worker partitioning in the middle end Julian Brown
                   ` (2 preceding siblings ...)
  2021-03-02 12:20 ` [PATCH 3/4] amdgcn: Enable OpenACC worker partitioning for AMD GCN Julian Brown
@ 2021-03-02 12:20 ` Julian Brown
  3 siblings, 0 replies; 20+ messages in thread
From: Julian Brown @ 2021-03-02 12:20 UTC (permalink / raw)
  To: gcc-patches
  Cc: Thomas Schwinge, Tobias Burnus, Kwok Cheung Yeung, Jakub Jelinek

A version of this patch was previously posted for mainline here:

  https://gcc.gnu.org/pipermail/gcc-patches/2019-November/534552.html

Reference-type private variables or reference-type variables used as
reduction targets do not work well with the scheme to implement worker
partitioning on AMD GCN. This patch (originally by Cesar Philippidis, but
modified somewhat since) provides support for replacing such variables
with new non-reference-typed temporary versions within partitioned
offload regions.

In more detail, the problem with reductions is as follows.  The expansion
of reduction operations (or similarly use of private variables) may
cause the bits of a reference variable (i.e. a pointer to a stack slot)
formed in worker-single mode to be broadcast to worker-partitioned mode,
and then dereferenced. Thus all workers will try to access the same
variable on worker 0's stack, which is not what was intended -- rather,
the reference in each worker should have been a pointer to a slot in
that worker's own stack.

A better solution to this problem might be to avoid trying to broadcast
pointers formed by taking the address of a stack slot somehow, but that
could prove tricky or perhaps impossible in the general case.

(I noticed during testing that Tobias has a couple of follow-up patches
to this one on the og10 branch. It might make sense to fold those into
this one too, else they'll need applying separately.)

Tested with offloading to AMD GCN (and separately to NVPTX). OK for
stage 1?

Julian

2021-03-02  Cesar Philippidis  <cesar@codesourcery.com>
	    Julian Brown  <julian@codesourcery.com>
	    Kwok Cheung Yeung  <kcy@codesourcery.com>

gcc/
	* gimplify.c (privatize_reduction): New struct.
	(localize_reductions_r, localize_reductions): New functions.
	(gimplify_omp_for): Call localize_reductions.
	(gimplify_omp_workshare): Likewise.
	* omp-low.c (lower_oacc_reductions): Handle localized reductions.
	Create fewer temp vars.
	* tree-core.h (omp_clause_code): Add OMP_CLAUSE_REDUCTION_PRIVATE_DECL
	documentation.
	* tree.c (omp_clause_num_ops): Bump number of ops for
	OMP_CLAUSE_REDUCTION to 6.
	(walk_tree_1): Adjust accordingly.
	* tree.h (OMP_CLAUSE_REDUCTION_PRIVATE_DECL): Add macro.

libgomp/
	* testsuite/libgomp.oacc-fortran/privatized-ref-1.f95: New test.
	* testsuite/libgomp.oacc-c++/privatized-ref-2.C: New test.
	* testsuite/libgomp.oacc-c++/privatized-ref-3.C: New test.
---
 gcc/gimplify.c                                | 117 ++++++++++++++++++
 gcc/omp-low.c                                 |  47 +++----
 gcc/tree-core.h                               |   4 +-
 gcc/tree.c                                    |  11 +-
 gcc/tree.h                                    |   2 +
 .../libgomp.oacc-c++/privatized-ref-2.C       |  64 ++++++++++
 .../libgomp.oacc-c++/privatized-ref-3.C       |  64 ++++++++++
 .../libgomp.oacc-fortran/privatized-ref-1.f95 |  71 +++++++++++
 8 files changed, 343 insertions(+), 37 deletions(-)
 create mode 100644 libgomp/testsuite/libgomp.oacc-c++/privatized-ref-2.C
 create mode 100644 libgomp/testsuite/libgomp.oacc-c++/privatized-ref-3.C
 create mode 100644 libgomp/testsuite/libgomp.oacc-fortran/privatized-ref-1.f95

diff --git a/gcc/gimplify.c b/gcc/gimplify.c
index caf25ccdd5c..e092b7be723 100644
--- a/gcc/gimplify.c
+++ b/gcc/gimplify.c
@@ -236,6 +236,11 @@ struct gimplify_omp_ctx
   int defaultmap[4];
 };
 
+struct privatize_reduction
+{
+  tree ref_var, local_var;
+};
+
 static struct gimplify_ctx *gimplify_ctxp;
 static struct gimplify_omp_ctx *gimplify_omp_ctxp;
 static bool in_omp_construct;
@@ -11381,6 +11386,95 @@ gimplify_omp_taskloop_expr (tree type, tree *tp, gimple_seq *pre_p,
   OMP_FOR_CLAUSES (orig_for_stmt) = c;
 }
 
+/* Helper function for localize_reductions.  Replace all uses of REF_VAR with
+   LOCAL_VAR.  */
+
+static tree
+localize_reductions_r (tree *tp, int *walk_subtrees, void *data)
+{
+  enum tree_code tc = TREE_CODE (*tp);
+  struct privatize_reduction *pr = (struct privatize_reduction *) data;
+
+  if (TYPE_P (*tp))
+    *walk_subtrees = 0;
+
+  switch (tc)
+    {
+    case INDIRECT_REF:
+    case MEM_REF:
+      if (TREE_OPERAND (*tp, 0) == pr->ref_var)
+	*tp = pr->local_var;
+
+      *walk_subtrees = 0;
+      break;
+
+    case VAR_DECL:
+    case PARM_DECL:
+    case RESULT_DECL:
+      if (*tp == pr->ref_var)
+	*tp = pr->local_var;
+
+      *walk_subtrees = 0;
+      break;
+
+    default:
+      break;
+    }
+
+  return NULL_TREE;
+}
+
+/* OpenACC worker and vector loop state propagation requires reductions
+   to be inside local variables.  This function replaces all reference-type
+   reductions variables associated with the loop with a local copy.  It is
+   also used to create private copies of reduction variables for those
+   which are not associated with acc loops.  */
+
+static void
+localize_reductions (tree clauses, tree body)
+{
+  tree c, var, type, new_var;
+  struct privatize_reduction pr;
+
+  for (c = clauses; c; c = OMP_CLAUSE_CHAIN (c))
+    if (OMP_CLAUSE_CODE (c) == OMP_CLAUSE_REDUCTION)
+      {
+	var = OMP_CLAUSE_DECL (c);
+
+	if (!lang_hooks.decls.omp_privatize_by_reference (var))
+	  {
+	    OMP_CLAUSE_REDUCTION_PRIVATE_DECL (c) = NULL;
+	    continue;
+	  }
+
+	type = TREE_TYPE (TREE_TYPE (var));
+	new_var = create_tmp_var (type, IDENTIFIER_POINTER (DECL_NAME (var)));
+
+	pr.ref_var = var;
+	pr.local_var = new_var;
+
+	walk_tree (&body, localize_reductions_r, &pr, NULL);
+
+	OMP_CLAUSE_REDUCTION_PRIVATE_DECL (c) = new_var;
+      }
+    else if (OMP_CLAUSE_CODE (c) == OMP_CLAUSE_PRIVATE)
+      {
+	var = OMP_CLAUSE_DECL (c);
+
+	if (!lang_hooks.decls.omp_privatize_by_reference (var))
+	  continue;
+
+	type = TREE_TYPE (TREE_TYPE (var));
+	new_var = create_tmp_var (type, IDENTIFIER_POINTER (DECL_NAME (var)));
+
+	pr.ref_var = var;
+	pr.local_var = new_var;
+
+	walk_tree (&body, localize_reductions_r, &pr, NULL);
+      }
+}
+
+
 /* Gimplify the gross structure of an OMP_FOR statement.  */
 
 static enum gimplify_status
@@ -11607,6 +11701,24 @@ gimplify_omp_for (tree *expr_p, gimple_seq *pre_p)
       gcc_unreachable ();
     }
 
+  if (ort == ORT_ACC)
+    {
+      gimplify_omp_ctx *outer = gimplify_omp_ctxp;
+
+      while (outer
+	     && outer->region_type != ORT_ACC_PARALLEL
+	     && outer->region_type != ORT_ACC_KERNELS)
+	outer = outer->outer_context;
+
+      /* FIXME: Reductions only work in parallel regions at present.  We avoid
+	 doing the reduction localization transformation in kernels regions
+	 here, because the code to remove reductions in kernels regions cannot
+	 handle that.  */
+      if (outer && outer->region_type == ORT_ACC_PARALLEL)
+	localize_reductions (OMP_FOR_CLAUSES (for_stmt),
+			     OMP_FOR_BODY (for_stmt));
+    }
+
   /* Set OMP_CLAUSE_LINEAR_NO_COPYIN flag on explicit linear
      clause for the IV.  */
   if (ort == ORT_SIMD && TREE_VEC_LENGTH (OMP_FOR_INIT (for_stmt)) == 1)
@@ -13265,6 +13377,11 @@ gimplify_omp_workshare (tree *expr_p, gimple_seq *pre_p)
       || (ort & ORT_HOST_TEAMS) == ORT_HOST_TEAMS)
     {
       push_gimplify_context ();
+
+      /* FIXME: Reductions are not supported in kernels regions yet.  */
+      if (/*ort == ORT_ACC_KERNELS ||*/ ort == ORT_ACC_PARALLEL)
+        localize_reductions (OMP_CLAUSES (expr), OMP_BODY (expr));
+
       gimple *g = gimplify_and_return_first (OMP_BODY (expr), &body);
       if (gimple_code (g) == GIMPLE_BIND)
 	pop_gimplify_context (g);
diff --git a/gcc/omp-low.c b/gcc/omp-low.c
index fd8025e0e3f..7fd3b33d41d 100644
--- a/gcc/omp-low.c
+++ b/gcc/omp-low.c
@@ -7072,9 +7072,9 @@ lower_oacc_reductions (location_t loc, tree clauses, tree level, bool inner,
 	gcc_checking_assert (!is_oacc_kernels_decomposed_part (ctx));
 
 	tree orig = OMP_CLAUSE_DECL (c);
-	tree var = maybe_lookup_decl (orig, ctx);
+	tree var;
 	tree ref_to_res = NULL_TREE;
-	tree incoming, outgoing, v1, v2, v3;
+	tree incoming, outgoing;
 	bool is_private = false;
 
 	enum tree_code rcode = OMP_CLAUSE_REDUCTION_CODE (c);
@@ -7086,6 +7086,9 @@ lower_oacc_reductions (location_t loc, tree clauses, tree level, bool inner,
 	  rcode = BIT_IOR_EXPR;
 	tree op = build_int_cst (unsigned_type_node, rcode);
 
+	var = OMP_CLAUSE_REDUCTION_PRIVATE_DECL (c);
+	if (!var)
+	  var = maybe_lookup_decl (orig, ctx);
 	if (!var)
 	  var = orig;
 
@@ -7176,36 +7179,13 @@ lower_oacc_reductions (location_t loc, tree clauses, tree level, bool inner,
 	if (!ref_to_res)
 	  ref_to_res = integer_zero_node;
 
-	if (omp_is_reference (orig))
+	if (omp_is_reference (outgoing))
 	  {
-	    tree type = TREE_TYPE (var);
-	    const char *id = IDENTIFIER_POINTER (DECL_NAME (var));
-
-	    if (!inner)
-	      {
-		tree x = create_tmp_var (TREE_TYPE (type), id);
-		gimplify_assign (var, build_fold_addr_expr (x), fork_seq);
-	      }
-
-	    v1 = create_tmp_var (type, id);
-	    v2 = create_tmp_var (type, id);
-	    v3 = create_tmp_var (type, id);
-
-	    gimplify_assign (v1, var, fork_seq);
-	    gimplify_assign (v2, var, fork_seq);
-	    gimplify_assign (v3, var, fork_seq);
-
-	    var = build_simple_mem_ref (var);
-	    v1 = build_simple_mem_ref (v1);
-	    v2 = build_simple_mem_ref (v2);
-	    v3 = build_simple_mem_ref (v3);
 	    outgoing = build_simple_mem_ref (outgoing);
 
 	    if (!TREE_CONSTANT (incoming))
 	      incoming = build_simple_mem_ref (incoming);
 	  }
-	else
-	  v1 = v2 = v3 = var;
 
 	/* Determine position in reduction buffer, which may be used
 	   by target.  The parser has ensured that this is not a
@@ -7238,20 +7218,21 @@ lower_oacc_reductions (location_t loc, tree clauses, tree level, bool inner,
 	  = build_call_expr_internal_loc (loc, IFN_GOACC_REDUCTION,
 					  TREE_TYPE (var), 6, init_code,
 					  unshare_expr (ref_to_res),
-					  v1, level, op, off);
+					  var, level, op, off);
 	tree fini_call
 	  = build_call_expr_internal_loc (loc, IFN_GOACC_REDUCTION,
 					  TREE_TYPE (var), 6, fini_code,
 					  unshare_expr (ref_to_res),
-					  v2, level, op, off);
+					  var, level, op, off);
 	tree teardown_call
 	  = build_call_expr_internal_loc (loc, IFN_GOACC_REDUCTION,
-					  TREE_TYPE (var), 6, teardown_code,
-					  ref_to_res, v3, level, op, off);
+					  TREE_TYPE (var), 6,
+					  teardown_code, ref_to_res, var,
+					  level, op, off);
 
-	gimplify_assign (v1, setup_call, &before_fork);
-	gimplify_assign (v2, init_call, &after_fork);
-	gimplify_assign (v3, fini_call, &before_join);
+	gimplify_assign (var, setup_call, &before_fork);
+	gimplify_assign (var, init_call, &after_fork);
+	gimplify_assign (var, fini_call, &before_join);
 	gimplify_assign (outgoing, teardown_call, &after_join);
       }
 
diff --git a/gcc/tree-core.h b/gcc/tree-core.h
index d2e6c895e42..01b106b81d7 100644
--- a/gcc/tree-core.h
+++ b/gcc/tree-core.h
@@ -259,7 +259,9 @@ enum omp_clause_code {
                 placeholder used in OMP_CLAUSE_REDUCTION_{INIT,MERGE}.
      Operand 4: OMP_CLAUSE_REDUCTION_DECL_PLACEHOLDER: Another dummy
 		VAR_DECL placeholder, used like the above for C/C++ array
-		reductions.  */
+		reductions.
+     Operand 5: OMP_CLAUSE_REDUCTION_PRIVATE_DECL: A private VAR_DECL of
+                the original DECL associated with the reduction clause.  */
   OMP_CLAUSE_REDUCTION,
 
   /* OpenMP clause: task_reduction (operator:variable_list).  */
diff --git a/gcc/tree.c b/gcc/tree.c
index c09434d7293..7ff82b91892 100644
--- a/gcc/tree.c
+++ b/gcc/tree.c
@@ -284,7 +284,7 @@ unsigned const char omp_clause_num_ops[] =
   1, /* OMP_CLAUSE_SHARED  */
   1, /* OMP_CLAUSE_FIRSTPRIVATE  */
   2, /* OMP_CLAUSE_LASTPRIVATE  */
-  5, /* OMP_CLAUSE_REDUCTION  */
+  6, /* OMP_CLAUSE_REDUCTION  */
   5, /* OMP_CLAUSE_TASK_REDUCTION  */
   5, /* OMP_CLAUSE_IN_REDUCTION  */
   1, /* OMP_CLAUSE_COPYIN  */
@@ -12326,11 +12326,16 @@ walk_tree_1 (tree *tp, walk_tree_fn func, void *data,
 	  WALK_SUBTREE_TAIL (OMP_CLAUSE_CHAIN (*tp));
 
 	case OMP_CLAUSE_REDUCTION:
+	  {
+	    for (int i = 0; i < 6; i++)
+	      WALK_SUBTREE (OMP_CLAUSE_OPERAND (*tp, i));
+	    WALK_SUBTREE_TAIL (OMP_CLAUSE_CHAIN (*tp));
+	  }
+
 	case OMP_CLAUSE_TASK_REDUCTION:
 	case OMP_CLAUSE_IN_REDUCTION:
 	  {
-	    int i;
-	    for (i = 0; i < 5; i++)
+	    for (int i = 0; i < 5; i++)
 	      WALK_SUBTREE (OMP_CLAUSE_OPERAND (*tp, i));
 	    WALK_SUBTREE_TAIL (OMP_CLAUSE_CHAIN (*tp));
 	  }
diff --git a/gcc/tree.h b/gcc/tree.h
index 4f33868e8e1..baef6a75fa6 100644
--- a/gcc/tree.h
+++ b/gcc/tree.h
@@ -1685,6 +1685,8 @@ class auto_suppress_location_wrappers
 #define OMP_CLAUSE_REDUCTION_DECL_PLACEHOLDER(NODE) \
   OMP_CLAUSE_OPERAND (OMP_CLAUSE_RANGE_CHECK (NODE, OMP_CLAUSE_REDUCTION, \
 					      OMP_CLAUSE_IN_REDUCTION), 4)
+#define OMP_CLAUSE_REDUCTION_PRIVATE_DECL(NODE) \
+  OMP_CLAUSE_OPERAND (OMP_CLAUSE_SUBCODE_CHECK (NODE, OMP_CLAUSE_REDUCTION), 5)
 
 /* True if a REDUCTION clause may reference the original list item (omp_orig)
    in its OMP_CLAUSE_REDUCTION_{,GIMPLE_}INIT.  */
diff --git a/libgomp/testsuite/libgomp.oacc-c++/privatized-ref-2.C b/libgomp/testsuite/libgomp.oacc-c++/privatized-ref-2.C
new file mode 100644
index 00000000000..052ccc51d6a
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c++/privatized-ref-2.C
@@ -0,0 +1,64 @@
+/* { dg-do run } */
+
+#include <stdlib.h>
+
+void workers (void)
+{
+  double res[65536];
+  int i;
+
+#pragma acc parallel copyout(res) num_gangs(64) num_workers(16)
+  {
+    int i, j;
+#pragma acc loop gang
+    for (i = 0; i < 256; i++)
+      {
+#pragma acc loop worker
+	for (j = 0; j < 256; j++)
+	  {
+	    int tmpvar;
+	    int &tmpref = tmpvar;
+	    tmpref = (i * 256 + j) * 99;
+	    res[i * 256 + j] = tmpref;
+	  }
+      }
+  }
+
+  for (i = 0; i < 65536; i++)
+    if (res[i] != i * 99)
+      abort ();
+}
+
+void vectors (void)
+{
+  double res[65536];
+  int i;
+
+#pragma acc parallel copyout(res) num_gangs(64) num_workers(16)
+  {
+    int i, j;
+#pragma acc loop gang worker
+    for (i = 0; i < 256; i++)
+      {
+#pragma acc loop vector
+	for (j = 0; j < 256; j++)
+	  {
+	    int tmpvar;
+	    int &tmpref = tmpvar;
+	    tmpref = (i * 256 + j) * 101;
+	    res[i * 256 + j] = tmpref;
+	  }
+      }
+  }
+
+  for (i = 0; i < 65536; i++)
+    if (res[i] != i * 101)
+      abort ();
+}
+
+int main (int argc, char *argv[])
+{
+  workers ();
+  vectors ();
+  return 0;
+}
diff --git a/libgomp/testsuite/libgomp.oacc-c++/privatized-ref-3.C b/libgomp/testsuite/libgomp.oacc-c++/privatized-ref-3.C
new file mode 100644
index 00000000000..d887178d507
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c++/privatized-ref-3.C
@@ -0,0 +1,64 @@
+/* { dg-do run } */
+
+#include <stdlib.h>
+
+void workers (void)
+{
+  double res[65536];
+  int i;
+
+#pragma acc parallel copyout(res) num_gangs(64) num_workers(16)
+  {
+    int i, j;
+    int tmpvar;
+    int &tmpref = tmpvar;
+#pragma acc loop gang
+    for (i = 0; i < 256; i++)
+      {
+#pragma acc loop worker private(tmpref)
+	for (j = 0; j < 256; j++)
+	  {
+	    tmpref = (i * 256 + j) * 99;
+	    res[i * 256 + j] = tmpref;
+	  }
+      }
+  }
+
+  for (i = 0; i < 65536; i++)
+    if (res[i] != i * 99)
+      abort ();
+}
+
+void vectors (void)
+{
+  double res[65536];
+  int i;
+
+#pragma acc parallel copyout(res) num_gangs(64) num_workers(16)
+  {
+    int i, j;
+    int tmpvar;
+    int &tmpref = tmpvar;
+#pragma acc loop gang worker
+    for (i = 0; i < 256; i++)
+      {
+#pragma acc loop vector private(tmpref)
+	for (j = 0; j < 256; j++)
+	  {
+	    tmpref = (i * 256 + j) * 101;
+	    res[i * 256 + j] = tmpref;
+	  }
+      }
+  }
+
+  for (i = 0; i < 65536; i++)
+    if (res[i] != i * 101)
+      abort ();
+}
+
+int main (int argc, char *argv[])
+{
+  workers ();
+  vectors ();
+  return 0;
+}
diff --git a/libgomp/testsuite/libgomp.oacc-fortran/privatized-ref-1.f95 b/libgomp/testsuite/libgomp.oacc-fortran/privatized-ref-1.f95
new file mode 100644
index 00000000000..e4b85206cc1
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-fortran/privatized-ref-1.f95
@@ -0,0 +1,71 @@
+! { dg-do run }
+
+program main
+  implicit none
+  integer :: myint
+  integer :: i
+  real :: res(65536), tmp
+
+  res(:) = 0.0
+
+  myint = 5
+  call workers(myint, res)
+
+  do i=1,65536
+    tmp = i * 99
+    if (res(i) .ne. tmp) stop 1
+  end do
+
+  res(:) = 0.0
+
+  myint = 7
+  call vectors(myint, res)
+
+  do i=1,65536
+    tmp = i * 101
+    if (res(i) .ne. tmp) stop 2
+  end do
+
+contains
+
+  subroutine workers(t1, res)
+    implicit none
+    integer :: t1
+    integer :: i, j
+    real, intent(out) :: res(:)
+
+    !$acc parallel copyout(res) num_gangs(64) num_workers(16)
+
+    !$acc loop gang
+    do i=0,255
+      !$acc loop worker private(t1)
+      do j=1,256
+        t1 = (i * 256 + j) * 99
+        res(i * 256 + j) = t1
+      end do
+    end do
+
+    !$acc end parallel
+  end subroutine workers
+
+  subroutine vectors(t1, res)
+    implicit none
+    integer :: t1
+    integer :: i, j
+    real, intent(out) :: res(:)
+
+    !$acc parallel copyout(res) num_gangs(64) num_workers(16)
+
+    !$acc loop gang worker
+    do i=0,255
+      !$acc loop vector private(t1)
+      do j=1,256
+        t1 = (i * 256 + j) * 101
+        res(i * 256 + j) = t1
+      end do
+    end do
+
+    !$acc end parallel
+  end subroutine vectors
+
+end program main
-- 
2.29.2


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [OpenACC] Extract 'pass_oacc_loop_designation' out of 'pass_oacc_device_lower' (was: [PATCH 1/4] openacc: Middle-end worker-partitioning support)
  2021-03-02 12:20 ` [PATCH 1/4] openacc: Middle-end worker-partitioning support Julian Brown
@ 2021-07-29  7:49   ` Thomas Schwinge
  2021-08-06 10:20     ` Julian Brown
  2021-08-04 13:13   ` [PATCH 1/4] openacc: Middle-end worker-partitioning support Thomas Schwinge
                     ` (4 subsequent siblings)
  5 siblings, 1 reply; 20+ messages in thread
From: Thomas Schwinge @ 2021-07-29  7:49 UTC (permalink / raw)
  To: Julian Brown, gcc-patches; +Cc: Tobias Burnus, Kwok Cheung Yeung, Jakub Jelinek

[-- Attachment #1: Type: text/plain, Size: 3894 bytes --]

Hi Julian!

On 2021-03-02T04:20:11-0800, Julian Brown <julian@codesourcery.com> wrote:
> This patch implements worker-partitioning support in the middle end,
> [...]

I've first separately pushed the mostly "mechanical changes" re
"[OpenACC] Extract 'pass_oacc_loop_designation' out of
'pass_oacc_device_lower'" to master branch in commit
0829ab79d37be6c59072af0c4f54043f7e9d23ea, see attached.

A few comments there:

> --- a/gcc/omp-offload.c
> +++ b/gcc/omp-offload.c

> @@ -1367,6 +1368,8 @@ oacc_loop_xform_head_tail (gcall *from, int level)
>        else if (gimple_call_internal_p (stmt, IFN_GOACC_REDUCTION))
>       *gimple_call_arg_ptr (stmt, 3) = replacement;
>
> +      update_stmt (stmt);
> +
>        gsi_next (&gsi);
>        while (gsi_end_p (gsi))
>       gsi = gsi_start_bb (single_succ (gsi_bb (gsi)));
> @@ -1391,25 +1394,28 @@ oacc_loop_process (oacc_loop *loop)
> [...]
> +       update_stmt (call);

Sneaky.  ACK.

>  /* Main entry point for oacc transformations which run on the device
>     compiler after LTO, so we know what the target device is at this
>     point (including the host fallback).  */
>
>  static unsigned int
> -execute_oacc_device_lower ()
> +execute_oacc_loop_designation ()

This does not just OpenACC loop designation but also includes the general
OpenACC offloaded function classification (diagnostics) as well as
OpenACC 'nohost' clause handling for OpenACC 'routine', meaning that the
"loop designation" name is not totally accurate.  But I couldn't easily
come up with anything more accurate (or an easy way to split out these
things), so I left it at that.

(Also, for later, I wonder if not all the 'oacc_loop' stuff could/should
move into its own new file 'gcc/omp-oacc-loop.cc'.  Also, the tag
'oacc_loop' isn't totally accurate either, for this also deals with
OpenACC 'routine' level of parallelism -- maybe 'oacc_lop' instead of
'oacc_loop' etc.)

> @@ -2051,10 +2072,36 @@ execute_oacc_device_lower ()
>       free_oacc_loop (l);
>      }
>
> +  free_oacc_loop (loops);
> +
>    /* Offloaded targets may introduce new basic blocks, which require
>       dominance information to update SSA.  */
>    calculate_dominance_info (CDI_DOMINATORS);
>
> +  return 0;
> +}

I do confirm the manual 'calculate_dominance_info (CDI_DOMINATORS)'
necessary in the original state (where this is in the middle of the two
"passes"), but given 'TODO_cleanup_cfg' as part of 'todo_flags_finish'
for new 'pass_oacc_loop_designation', we no longer need that now, as far
as I can tell.  So I removed the manual
'calculate_dominance_info (CDI_DOMINATORS)' -- but please do tell if
there is a reason to keep it.

>  namespace {
>
> +const pass_data pass_data_oacc_loop_designation =
> +{
> +  GIMPLE_PASS, /* type */
> +  "oaccloops", /* name */
> +  OPTGROUP_OMP, /* optinfo_flags */
> +  TV_NONE, /* tv_id */
> +  PROP_cfg, /* properties_required */
> +  0 /* Possibly PROP_gimple_eomp.  */, /* properties_provided */
> +  0, /* properties_destroyed */
> +  0, /* todo_flags_start */
> +  TODO_update_ssa | TODO_cleanup_cfg
> +  | TODO_rebuild_alias, /* todo_flags_finish */
> +};

Do you remember why you added 'TODO_rebuild_alias' here?
'pass_oacc_device_lower' doesn't have it, and neither does
'pass_oacc_loop_designation' in your original (2017-11-27) internal
gcn/master branch commit 81ee7ef64cdfa47c01f24c79b8ebd03242c9f3eb
"Split device-lowering/gimple workers into three passes".  So I
removed that -- but please do tell if there is a reason to keep it.


Grüße
 Thomas


-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-OpenACC-Extract-pass_oacc_loop_designation-out-of-pa.patch --]
[-- Type: text/x-diff, Size: 65563 bytes --]

From 0829ab79d37be6c59072af0c4f54043f7e9d23ea Mon Sep 17 00:00:00 2001
From: Thomas Schwinge <thomas@codesourcery.com>
Date: Tue, 2 Mar 2021 04:20:11 -0800
Subject: [PATCH] [OpenACC] Extract 'pass_oacc_loop_designation' out of
 'pass_oacc_device_lower'

This really is a separate step -- and another pass to be added between the two,
later on.

	gcc/
	* omp-offload.c (oacc_loop_xform_head_tail, oacc_loop_process):
	'update_stmt' after modification.
	(pass_oacc_loop_designation): New function, extracted out of...
	(pass_oacc_device_lower): ... this.
	(pass_data_oacc_loop_designation, pass_oacc_loop_designation)
	(make_pass_oacc_loop_designation): New
	* passes.def: Add it.
	* tree-parloops.c (create_parallel_loop): Adjust.
	* tree-pass.h (make_pass_oacc_loop_designation): New.
	gcc/testsuite/
	* c-c++-common/goacc/classify-kernels-unparallelized.c:
	's%oaccdevlow%oaccloops%g'.
	* c-c++-common/goacc/classify-kernels.c: Likewise.
	* c-c++-common/goacc/classify-parallel.c: Likewise.
	* c-c++-common/goacc/classify-routine-nohost.c: Likewise.
	* c-c++-common/goacc/classify-routine.c: Likewise.
	* c-c++-common/goacc/classify-serial.c: Likewise.
	* c-c++-common/goacc/routine-nohost-1.c: Likewise.
	* g++.dg/goacc/template.C: Likewise.
	* gcc.dg/goacc/loop-processing-1.c: Likewise.
	* gfortran.dg/goacc/classify-kernels-unparallelized.f95: Likewise.
	* gfortran.dg/goacc/classify-kernels.f95: Likewise.
	* gfortran.dg/goacc/classify-parallel.f95: Likewise.
	* gfortran.dg/goacc/classify-routine-nohost.f95: Likewise.
	* gfortran.dg/goacc/classify-routine.f95: Likewise.
	* gfortran.dg/goacc/classify-serial.f95: Likewise.
	* gfortran.dg/goacc/routine-multiple-directives-1.f90: Likewise.
	libgomp/
	* testsuite/libgomp.oacc-c-c++-common/pr85486-2.c:
	's%oaccdevlow%oaccloops%g'.
	* testsuite/libgomp.oacc-c-c++-common/pr85486-3.c: Likewise.
	* testsuite/libgomp.oacc-c-c++-common/pr85486.c: Likewise.
	* testsuite/libgomp.oacc-c-c++-common/routine-nohost-1.c:
	Likewise.
	* testsuite/libgomp.oacc-c-c++-common/vector-length-128-1.c:
	Likewise.
	* testsuite/libgomp.oacc-c-c++-common/vector-length-128-2.c:
	Likewise.
	* testsuite/libgomp.oacc-c-c++-common/vector-length-128-3.c:
	Likewise.
	* testsuite/libgomp.oacc-c-c++-common/vector-length-128-4.c:
	Likewise.
	* testsuite/libgomp.oacc-c-c++-common/vector-length-128-5.c:
	Likewise.
	* testsuite/libgomp.oacc-c-c++-common/vector-length-128-6.c:
	Likewise.
	* testsuite/libgomp.oacc-c-c++-common/vector-length-128-7.c:
	Likewise.
	* testsuite/libgomp.oacc-fortran/routine-nohost-1.f90: Likewise.

Co-Authored-By: Julian Brown <julian@codesourcery.com>
Co-Authored-By: Kwok Cheung Yeung <kcy@codesourcery.com>
---
 gcc/omp-offload.c                             | 98 ++++++++++++++-----
 gcc/passes.def                                |  1 +
 .../goacc/classify-kernels-unparallelized.c   |  8 +-
 .../c-c++-common/goacc/classify-kernels.c     |  8 +-
 .../c-c++-common/goacc/classify-parallel.c    |  8 +-
 .../goacc/classify-routine-nohost.c           | 22 ++---
 .../c-c++-common/goacc/classify-routine.c     | 22 ++---
 .../c-c++-common/goacc/classify-serial.c      |  8 +-
 .../c-c++-common/goacc/routine-nohost-1.c     |  8 +-
 gcc/testsuite/g++.dg/goacc/template.C         | 20 ++--
 .../gcc.dg/goacc/loop-processing-1.c          |  4 +-
 .../goacc/classify-kernels-unparallelized.f95 |  8 +-
 .../gfortran.dg/goacc/classify-kernels.f95    |  8 +-
 .../gfortran.dg/goacc/classify-parallel.f95   |  8 +-
 .../goacc/classify-routine-nohost.f95         | 20 ++--
 .../gfortran.dg/goacc/classify-routine.f95    | 20 ++--
 .../gfortran.dg/goacc/classify-serial.f95     |  8 +-
 .../goacc/routine-multiple-directives-1.f90   | 34 +++----
 gcc/tree-parloops.c                           |  2 +-
 gcc/tree-pass.h                               |  1 +
 .../libgomp.oacc-c-c++-common/pr85486-2.c     |  4 +-
 .../libgomp.oacc-c-c++-common/pr85486-3.c     |  4 +-
 .../libgomp.oacc-c-c++-common/pr85486.c       |  4 +-
 .../routine-nohost-1.c                        |  8 +-
 .../vector-length-128-1.c                     |  4 +-
 .../vector-length-128-2.c                     |  4 +-
 .../vector-length-128-3.c                     |  4 +-
 .../vector-length-128-4.c                     |  4 +-
 .../vector-length-128-5.c                     |  4 +-
 .../vector-length-128-6.c                     |  4 +-
 .../vector-length-128-7.c                     |  4 +-
 .../libgomp.oacc-fortran/routine-nohost-1.f90 |  6 +-
 32 files changed, 213 insertions(+), 157 deletions(-)

diff --git a/gcc/omp-offload.c b/gcc/omp-offload.c
index bfbb0112e24..d881426ae65 100644
--- a/gcc/omp-offload.c
+++ b/gcc/omp-offload.c
@@ -1367,6 +1367,7 @@ oacc_loop_xform_head_tail (gcall *from, int level)
 	}
       else if (gimple_call_internal_p (stmt, IFN_GOACC_REDUCTION))
 	*gimple_call_arg_ptr (stmt, 3) = replacement;
+      update_stmt (stmt);
 
       gsi_next (&gsi);
       while (gsi_end_p (gsi))
@@ -1392,25 +1393,28 @@ oacc_loop_process (oacc_loop *loop)
       gcall *call;
       
       for (ix = 0; loop->ifns.iterate (ix, &call); ix++)
-	switch (gimple_call_internal_fn (call))
-	  {
-	  case IFN_GOACC_LOOP:
+	{
+	  switch (gimple_call_internal_fn (call))
 	    {
-	      bool is_e = gimple_call_arg (call, 5) == integer_minus_one_node;
-	      gimple_call_set_arg (call, 5, is_e ? e_mask_arg : mask_arg);
-	      if (!is_e)
-		gimple_call_set_arg (call, 4, chunk_arg);
-	    }
-	    break;
+	    case IFN_GOACC_LOOP:
+	      {
+		bool is_e = gimple_call_arg (call, 5) == integer_minus_one_node;
+		gimple_call_set_arg (call, 5, is_e ? e_mask_arg : mask_arg);
+		if (!is_e)
+		  gimple_call_set_arg (call, 4, chunk_arg);
+	      }
+	      break;
 
-	  case IFN_GOACC_TILE:
-	    gimple_call_set_arg (call, 3, mask_arg);
-	    gimple_call_set_arg (call, 4, e_mask_arg);
-	    break;
+	    case IFN_GOACC_TILE:
+	      gimple_call_set_arg (call, 3, mask_arg);
+	      gimple_call_set_arg (call, 4, e_mask_arg);
+	      break;
 
-	  default:
-	    gcc_unreachable ();
-	  }
+	    default:
+	      gcc_unreachable ();
+	    }
+	  update_stmt (call);
+	}
 
       unsigned dim = GOMP_DIM_GANG;
       unsigned mask = loop->mask | loop->e_mask;
@@ -1912,7 +1916,7 @@ is_sync_builtin_call (gcall *call)
    point (including the host fallback).  */
 
 static unsigned int
-execute_oacc_device_lower ()
+execute_oacc_loop_designation ()
 {
   tree attrs = oacc_get_fn_attrib (current_function_decl);
 
@@ -1981,6 +1985,8 @@ execute_oacc_device_lower ()
 	gcc_unreachable ();
     }
 
+  /* This doesn't belong into 'pass_oacc_loop_designation' conceptually, but
+     it's a convenient place, so...  */
   if (is_oacc_routine)
     {
       tree attr = lookup_attribute ("omp declare target",
@@ -2088,9 +2094,23 @@ execute_oacc_device_lower ()
 	free_oacc_loop (l);
     }
 
-  /* Offloaded targets may introduce new basic blocks, which require
-     dominance information to update SSA.  */
-  calculate_dominance_info (CDI_DOMINATORS);
+  free_oacc_loop (loops);
+
+  return 0;
+}
+
+static unsigned int
+execute_oacc_device_lower ()
+{
+  tree attrs = oacc_get_fn_attrib (current_function_decl);
+
+  if (!attrs)
+    /* Not an offloaded function.  */
+    return 0;
+
+  int dims[GOMP_DIM_MAX];
+  for (unsigned i = 0; i < GOMP_DIM_MAX; i++)
+    dims[i] = oacc_get_fn_dim_size (current_function_decl, i);
 
   hash_map<tree, tree> adjusted_vars;
 
@@ -2355,8 +2375,6 @@ execute_oacc_device_lower ()
 	  }
     }
 
-  free_oacc_loop (loops);
-
   return 0;
 }
 
@@ -2397,6 +2415,36 @@ default_goacc_dim_limit (int ARG_UNUSED (axis))
 
 namespace {
 
+const pass_data pass_data_oacc_loop_designation =
+{
+  GIMPLE_PASS, /* type */
+  "oaccloops", /* name */
+  OPTGROUP_OMP, /* optinfo_flags */
+  TV_NONE, /* tv_id */
+  PROP_cfg, /* properties_required */
+  0 /* Possibly PROP_gimple_eomp.  */, /* properties_provided */
+  0, /* properties_destroyed */
+  0, /* todo_flags_start */
+  TODO_update_ssa | TODO_cleanup_cfg, /* todo_flags_finish */
+};
+
+class pass_oacc_loop_designation : public gimple_opt_pass
+{
+public:
+  pass_oacc_loop_designation (gcc::context *ctxt)
+    : gimple_opt_pass (pass_data_oacc_loop_designation, ctxt)
+  {}
+
+  /* opt_pass methods: */
+  virtual bool gate (function *) { return flag_openacc; };
+
+  virtual unsigned int execute (function *)
+    {
+      return execute_oacc_loop_designation ();
+    }
+
+}; // class pass_oacc_loop_designation
+
 const pass_data pass_data_oacc_device_lower =
 {
   GIMPLE_PASS, /* type */
@@ -2429,6 +2477,12 @@ public:
 
 } // anon namespace
 
+gimple_opt_pass *
+make_pass_oacc_loop_designation (gcc::context *ctxt)
+{
+  return new pass_oacc_loop_designation (ctxt);
+}
+
 gimple_opt_pass *
 make_pass_oacc_device_lower (gcc::context *ctxt)
 {
diff --git a/gcc/passes.def b/gcc/passes.def
index e2858368b7d..26d86df2f5a 100644
--- a/gcc/passes.def
+++ b/gcc/passes.def
@@ -183,6 +183,7 @@ along with GCC; see the file COPYING3.  If not see
   INSERT_PASSES_AFTER (all_passes)
   NEXT_PASS (pass_fixup_cfg);
   NEXT_PASS (pass_lower_eh_dispatch);
+  NEXT_PASS (pass_oacc_loop_designation);
   NEXT_PASS (pass_oacc_device_lower);
   NEXT_PASS (pass_omp_device_lower);
   NEXT_PASS (pass_omp_target_link);
diff --git a/gcc/testsuite/c-c++-common/goacc/classify-kernels-unparallelized.c b/gcc/testsuite/c-c++-common/goacc/classify-kernels-unparallelized.c
index 218f6248062..1d12658790d 100644
--- a/gcc/testsuite/c-c++-common/goacc/classify-kernels-unparallelized.c
+++ b/gcc/testsuite/c-c++-common/goacc/classify-kernels-unparallelized.c
@@ -5,7 +5,7 @@
    { dg-additional-options "-fopt-info-optimized-omp" }
    { dg-additional-options "-fdump-tree-ompexp" }
    { dg-additional-options "-fdump-tree-parloops1-all" }
-   { dg-additional-options "-fdump-tree-oaccdevlow" } */
+   { dg-additional-options "-fdump-tree-oaccloops" } */
 
 /* { dg-additional-options "-Wopenacc-parallelism" } for testing/documenting
    aspects of that functionality.  */
@@ -38,6 +38,6 @@ void KERNELS ()
 
 /* Check the offloaded function's classification and compute dimensions (will
    always be 1 x 1 x 1 for non-offloading compilation).
-   { dg-final { scan-tree-dump-times "(?n)Function is unparallelized OpenACC kernels offload" 1 "oaccdevlow" } }
-   { dg-final { scan-tree-dump-times "(?n)Compute dimensions \\\[1, 1, 1\\\]" 1 "oaccdevlow" } }
-   { dg-final { scan-tree-dump-times "(?n)__attribute__\\(\\(oacc function \\(1, 1, 1\\), oacc kernels, omp target entrypoint\\)\\)" 1 "oaccdevlow" } } */
+   { dg-final { scan-tree-dump-times "(?n)Function is unparallelized OpenACC kernels offload" 1 "oaccloops" } }
+   { dg-final { scan-tree-dump-times "(?n)Compute dimensions \\\[1, 1, 1\\\]" 1 "oaccloops" } }
+   { dg-final { scan-tree-dump-times "(?n)__attribute__\\(\\(oacc function \\(1, 1, 1\\), oacc kernels, omp target entrypoint\\)\\)" 1 "oaccloops" } } */
diff --git a/gcc/testsuite/c-c++-common/goacc/classify-kernels.c b/gcc/testsuite/c-c++-common/goacc/classify-kernels.c
index 95a150ca9ac..bdf7b4a0641 100644
--- a/gcc/testsuite/c-c++-common/goacc/classify-kernels.c
+++ b/gcc/testsuite/c-c++-common/goacc/classify-kernels.c
@@ -5,7 +5,7 @@
    { dg-additional-options "-fopt-info-optimized-omp" }
    { dg-additional-options "-fdump-tree-ompexp" }
    { dg-additional-options "-fdump-tree-parloops1-all" }
-   { dg-additional-options "-fdump-tree-oaccdevlow" } */
+   { dg-additional-options "-fdump-tree-oaccloops" } */
 
 /* { dg-additional-options "-Wopenacc-parallelism" } for testing/documenting
    aspects of that functionality.  */
@@ -34,6 +34,6 @@ void KERNELS ()
 
 /* Check the offloaded function's classification and compute dimensions (will
    always be 1 x 1 x 1 for non-offloading compilation).
-   { dg-final { scan-tree-dump-times "(?n)Function is parallelized OpenACC kernels offload" 1 "oaccdevlow" } }
-   { dg-final { scan-tree-dump-times "(?n)Compute dimensions \\\[1, 1, 1\\\]" 1 "oaccdevlow" } }
-   { dg-final { scan-tree-dump-times "(?n)__attribute__\\(\\(oacc function \\(1, 1, 1\\), oacc kernels parallelized, oacc function \\(, , \\), oacc kernels, omp target entrypoint\\)\\)" 1 "oaccdevlow" } } */
+   { dg-final { scan-tree-dump-times "(?n)Function is parallelized OpenACC kernels offload" 1 "oaccloops" } }
+   { dg-final { scan-tree-dump-times "(?n)Compute dimensions \\\[1, 1, 1\\\]" 1 "oaccloops" } }
+   { dg-final { scan-tree-dump-times "(?n)__attribute__\\(\\(oacc function \\(1, 1, 1\\), oacc kernels parallelized, oacc function \\(, , \\), oacc kernels, omp target entrypoint\\)\\)" 1 "oaccloops" } } */
diff --git a/gcc/testsuite/c-c++-common/goacc/classify-parallel.c b/gcc/testsuite/c-c++-common/goacc/classify-parallel.c
index 230e70c66cd..9056aa69dad 100644
--- a/gcc/testsuite/c-c++-common/goacc/classify-parallel.c
+++ b/gcc/testsuite/c-c++-common/goacc/classify-parallel.c
@@ -4,7 +4,7 @@
 /* { dg-additional-options "-O2" }
    { dg-additional-options "-fopt-info-optimized-omp" }
    { dg-additional-options "-fdump-tree-ompexp" }
-   { dg-additional-options "-fdump-tree-oaccdevlow" } */
+   { dg-additional-options "-fdump-tree-oaccloops" } */
 
 /* { dg-additional-options "-Wopenacc-parallelism" } for testing/documenting
    aspects of that functionality.  */
@@ -27,6 +27,6 @@ void PARALLEL ()
 
 /* Check the offloaded function's classification and compute dimensions (will
    always be 1 x 1 x 1 for non-offloading compilation).
-   { dg-final { scan-tree-dump-times "(?n)Function is OpenACC parallel offload" 1 "oaccdevlow" } }
-   { dg-final { scan-tree-dump-times "(?n)Compute dimensions \\\[1, 1, 1\\\]" 1 "oaccdevlow" } }
-   { dg-final { scan-tree-dump-times "(?n)__attribute__\\(\\(oacc function \\(1, 1, 1\\), oacc parallel, omp target entrypoint\\)\\)" 1 "oaccdevlow" } } */
+   { dg-final { scan-tree-dump-times "(?n)Function is OpenACC parallel offload" 1 "oaccloops" } }
+   { dg-final { scan-tree-dump-times "(?n)Compute dimensions \\\[1, 1, 1\\\]" 1 "oaccloops" } }
+   { dg-final { scan-tree-dump-times "(?n)__attribute__\\(\\(oacc function \\(1, 1, 1\\), oacc parallel, omp target entrypoint\\)\\)" 1 "oaccloops" } } */
diff --git a/gcc/testsuite/c-c++-common/goacc/classify-routine-nohost.c b/gcc/testsuite/c-c++-common/goacc/classify-routine-nohost.c
index a58482f7f92..99855822011 100644
--- a/gcc/testsuite/c-c++-common/goacc/classify-routine-nohost.c
+++ b/gcc/testsuite/c-c++-common/goacc/classify-routine-nohost.c
@@ -4,7 +4,7 @@
 /* { dg-additional-options "-O2" }
    { dg-additional-options "-fopt-info-optimized-omp" }
    { dg-additional-options "-fdump-tree-ompexp" }
-   { dg-additional-options "-fdump-tree-oaccdevlow" } */
+   { dg-additional-options "-fdump-tree-oaccloops" } */
 
 /* { dg-additional-options "-Wopenacc-parallelism" } for testing/documenting
    aspects of that functionality.  */
@@ -28,14 +28,14 @@ void ROUTINE ()
    { dg-final { scan-tree-dump-times "(?n)__attribute__\\(\\(omp declare target \\(nohost worker\\), oacc function \\(0 1, 1 0, 1 0\\)\\)\\)" 1 "ompexp" } } */
 
 /* Check the offloaded function's classification.
-   { dg-final { scan-tree-dump-times "(?n)Function is OpenACC routine level 1" 1 "oaccdevlow" } }
-   { dg-final { scan-tree-dump-times "(?n)OpenACC routine 'ROUTINE' has 'nohost' clause" 1 "oaccdevlow" { target c } } }
-   { dg-final { scan-tree-dump-times "(?n)OpenACC routine 'void ROUTINE\\(\\)' has 'nohost' clause" 1 "oaccdevlow" { target { c++ && { ! offloading_enabled } } } } }
-   { dg-final { scan-tree-dump-times "(?n)OpenACC routine 'ROUTINE\\(\\)' has 'nohost' clause" 1 "oaccdevlow" { target { c++ && offloading_enabled } } } }
-   { dg-final { scan-tree-dump-times "(?n)OpenACC routine 'ROUTINE' discarded" 1 "oaccdevlow" { target c } } }
-   { dg-final { scan-tree-dump-times "(?n)OpenACC routine 'void ROUTINE\\(\\)' discarded" 1 "oaccdevlow" { target { c++ && { ! offloading_enabled } } } } }
-   { dg-final { scan-tree-dump-times "(?n)OpenACC routine 'ROUTINE\\(\\)' discarded" 1 "oaccdevlow" { target { c++ && offloading_enabled } } } }
+   { dg-final { scan-tree-dump-times "(?n)Function is OpenACC routine level 1" 1 "oaccloops" } }
+   { dg-final { scan-tree-dump-times "(?n)OpenACC routine 'ROUTINE' has 'nohost' clause" 1 "oaccloops" { target c } } }
+   { dg-final { scan-tree-dump-times "(?n)OpenACC routine 'void ROUTINE\\(\\)' has 'nohost' clause" 1 "oaccloops" { target { c++ && { ! offloading_enabled } } } } }
+   { dg-final { scan-tree-dump-times "(?n)OpenACC routine 'ROUTINE\\(\\)' has 'nohost' clause" 1 "oaccloops" { target { c++ && offloading_enabled } } } }
+   { dg-final { scan-tree-dump-times "(?n)OpenACC routine 'ROUTINE' discarded" 1 "oaccloops" { target c } } }
+   { dg-final { scan-tree-dump-times "(?n)OpenACC routine 'void ROUTINE\\(\\)' discarded" 1 "oaccloops" { target { c++ && { ! offloading_enabled } } } } }
+   { dg-final { scan-tree-dump-times "(?n)OpenACC routine 'ROUTINE\\(\\)' discarded" 1 "oaccloops" { target { c++ && offloading_enabled } } } }
    TODO See PR101551 for 'offloading_enabled' differences.
-   { dg-final { scan-tree-dump-not "(?n)Compute dimensions" "oaccdevlow" } }
-   { dg-final { scan-tree-dump-not "(?n)__attribute__\\(.*omp declare target \\(nohost" "oaccdevlow" } }
-   { dg-final { scan-tree-dump-not "(?n)void ROUTINE \\(\\)" "oaccdevlow" } } */
+   { dg-final { scan-tree-dump-not "(?n)Compute dimensions" "oaccloops" } }
+   { dg-final { scan-tree-dump-not "(?n)__attribute__\\(.*omp declare target \\(nohost" "oaccloops" } }
+   { dg-final { scan-tree-dump-not "(?n)void ROUTINE \\(\\)" "oaccloops" } } */
diff --git a/gcc/testsuite/c-c++-common/goacc/classify-routine.c b/gcc/testsuite/c-c++-common/goacc/classify-routine.c
index cc0ba2b9a7d..f7f0454009b 100644
--- a/gcc/testsuite/c-c++-common/goacc/classify-routine.c
+++ b/gcc/testsuite/c-c++-common/goacc/classify-routine.c
@@ -4,7 +4,7 @@
 /* { dg-additional-options "-O2" }
    { dg-additional-options "-fopt-info-optimized-omp" }
    { dg-additional-options "-fdump-tree-ompexp" }
-   { dg-additional-options "-fdump-tree-oaccdevlow" } */
+   { dg-additional-options "-fdump-tree-oaccloops" } */
 
 /* { dg-additional-options "-Wopenacc-parallelism" } for testing/documenting
    aspects of that functionality.  */
@@ -29,14 +29,14 @@ void ROUTINE ()
 
 /* Check the offloaded function's classification and compute dimensions (will
    always be 1 x 1 x 1 for non-offloading compilation).
-   { dg-final { scan-tree-dump-times "(?n)Function is OpenACC routine level 1" 1 "oaccdevlow" } }
-   { dg-final { scan-tree-dump-times "(?n)OpenACC routine 'ROUTINE' doesn't have 'nohost' clause" 1 "oaccdevlow" { target c } } }
-   { dg-final { scan-tree-dump-times "(?n)OpenACC routine 'void ROUTINE\\(\\)' doesn't have 'nohost' clause" 1 "oaccdevlow" { target { c++ && { ! offloading_enabled } } } } }
-   { dg-final { scan-tree-dump-times "(?n)OpenACC routine 'ROUTINE\\(\\)' doesn't have 'nohost' clause" 1 "oaccdevlow" { target { c++ && offloading_enabled } } } }
-   { dg-final { scan-tree-dump-times "(?n)OpenACC routine 'ROUTINE' not discarded" 1 "oaccdevlow" { target c } } }
-   { dg-final { scan-tree-dump-times "(?n)OpenACC routine 'void ROUTINE\\(\\)' not discarded" 1 "oaccdevlow" { target { c++ && { ! offloading_enabled } } } } }
-   { dg-final { scan-tree-dump-times "(?n)OpenACC routine 'ROUTINE\\(\\)' not discarded" 1 "oaccdevlow" { target { c++ && offloading_enabled } } } }
+   { dg-final { scan-tree-dump-times "(?n)Function is OpenACC routine level 1" 1 "oaccloops" } }
+   { dg-final { scan-tree-dump-times "(?n)OpenACC routine 'ROUTINE' doesn't have 'nohost' clause" 1 "oaccloops" { target c } } }
+   { dg-final { scan-tree-dump-times "(?n)OpenACC routine 'void ROUTINE\\(\\)' doesn't have 'nohost' clause" 1 "oaccloops" { target { c++ && { ! offloading_enabled } } } } }
+   { dg-final { scan-tree-dump-times "(?n)OpenACC routine 'ROUTINE\\(\\)' doesn't have 'nohost' clause" 1 "oaccloops" { target { c++ && offloading_enabled } } } }
+   { dg-final { scan-tree-dump-times "(?n)OpenACC routine 'ROUTINE' not discarded" 1 "oaccloops" { target c } } }
+   { dg-final { scan-tree-dump-times "(?n)OpenACC routine 'void ROUTINE\\(\\)' not discarded" 1 "oaccloops" { target { c++ && { ! offloading_enabled } } } } }
+   { dg-final { scan-tree-dump-times "(?n)OpenACC routine 'ROUTINE\\(\\)' not discarded" 1 "oaccloops" { target { c++ && offloading_enabled } } } }
    TODO See PR101551 for 'offloading_enabled' differences.
-   { dg-final { scan-tree-dump-times "(?n)Compute dimensions \\\[1, 1, 1\\\]" 1 "oaccdevlow" } }
-   { dg-final { scan-tree-dump-times "(?n)__attribute__\\(\\(oacc function \\(0 1, 1 1, 1 1\\), omp declare target \\(worker\\), oacc function \\(0 1, 1 0, 1 0\\)\\)\\)" 1 "oaccdevlow" } }
-   { dg-final { scan-tree-dump-times "(?n)void ROUTINE \\(\\)" 1 "oaccdevlow" } } */
+   { dg-final { scan-tree-dump-times "(?n)Compute dimensions \\\[1, 1, 1\\\]" 1 "oaccloops" } }
+   { dg-final { scan-tree-dump-times "(?n)__attribute__\\(\\(oacc function \\(0 1, 1 1, 1 1\\), omp declare target \\(worker\\), oacc function \\(0 1, 1 0, 1 0\\)\\)\\)" 1 "oaccloops" } }
+   { dg-final { scan-tree-dump-times "(?n)void ROUTINE \\(\\)" 1 "oaccloops" } } */
diff --git a/gcc/testsuite/c-c++-common/goacc/classify-serial.c b/gcc/testsuite/c-c++-common/goacc/classify-serial.c
index ae052ae6a1c..f41c141bcd5 100644
--- a/gcc/testsuite/c-c++-common/goacc/classify-serial.c
+++ b/gcc/testsuite/c-c++-common/goacc/classify-serial.c
@@ -4,7 +4,7 @@
 /* { dg-additional-options "-O2" }
    { dg-additional-options "-fopt-info-optimized-omp" }
    { dg-additional-options "-fdump-tree-ompexp" }
-   { dg-additional-options "-fdump-tree-oaccdevlow" } */
+   { dg-additional-options "-fdump-tree-oaccloops" } */
 
 /* { dg-additional-options "-Wopenacc-parallelism" } for testing/documenting
    aspects of that functionality.  */
@@ -32,6 +32,6 @@ void SERIAL ()
 
 /* Check the offloaded function's classification and compute dimensions (will
    always be 1 x 1 x 1 for non-offloading compilation).
-   { dg-final { scan-tree-dump-times "(?n)Function is OpenACC serial offload" 1 "oaccdevlow" } }
-   { dg-final { scan-tree-dump-times "(?n)Compute dimensions \\\[1, 1, 1\\\]" 1 "oaccdevlow" } }
-   { dg-final { scan-tree-dump-times "(?n)__attribute__\\(\\(oacc function \\(1, 1, 1\\), oacc serial, omp target entrypoint\\)\\)" 1 "oaccdevlow" } } */
+   { dg-final { scan-tree-dump-times "(?n)Function is OpenACC serial offload" 1 "oaccloops" } }
+   { dg-final { scan-tree-dump-times "(?n)Compute dimensions \\\[1, 1, 1\\\]" 1 "oaccloops" } }
+   { dg-final { scan-tree-dump-times "(?n)__attribute__\\(\\(oacc function \\(1, 1, 1\\), oacc serial, omp target entrypoint\\)\\)" 1 "oaccloops" } } */
diff --git a/gcc/testsuite/c-c++-common/goacc/routine-nohost-1.c b/gcc/testsuite/c-c++-common/goacc/routine-nohost-1.c
index c8927416efa..59ebb2bc5a9 100644
--- a/gcc/testsuite/c-c++-common/goacc/routine-nohost-1.c
+++ b/gcc/testsuite/c-c++-common/goacc/routine-nohost-1.c
@@ -1,6 +1,6 @@
 /* Test OpenACC 'routine' with 'nohost' clause, valid use.  */
 
-/* { dg-additional-options "-fdump-tree-oaccdevlow" } */
+/* { dg-additional-options "-fdump-tree-oaccloops" } */
 
 #pragma acc routine nohost
 int THREE(void)
@@ -13,7 +13,7 @@ int THREE(void)
 #pragma acc routine nohost
 extern int THREE(void);
 
-/* { dg-final { scan-tree-dump-times {(?n)^OpenACC routine '[^']*THREE[^']*' has 'nohost' clause\.$} 1 oaccdevlow } } */
+/* { dg-final { scan-tree-dump-times {(?n)^OpenACC routine '[^']*THREE[^']*' has 'nohost' clause\.$} 1 oaccloops } } */
 
 
 #pragma acc routine nohost
@@ -30,7 +30,7 @@ extern void NOTHING(void);
 
 #pragma acc routine (NOTHING) nohost
 
-/* { dg-final { scan-tree-dump-times {(?n)^OpenACC routine '[^']*NOTHING[^']*' has 'nohost' clause\.$} 1 oaccdevlow } } */
+/* { dg-final { scan-tree-dump-times {(?n)^OpenACC routine '[^']*NOTHING[^']*' has 'nohost' clause\.$} 1 oaccloops } } */
 
 
 extern float ADD(float, float);
@@ -47,4 +47,4 @@ extern float ADD(float, float);
 
 #pragma acc routine (ADD) nohost
 
-/* { dg-final { scan-tree-dump-times {(?n)^OpenACC routine '[^']*ADD[^']*' has 'nohost' clause\.$} 1 oaccdevlow } } */
+/* { dg-final { scan-tree-dump-times {(?n)^OpenACC routine '[^']*ADD[^']*' has 'nohost' clause\.$} 1 oaccloops } } */
diff --git a/gcc/testsuite/g++.dg/goacc/template.C b/gcc/testsuite/g++.dg/goacc/template.C
index f34fcfea52d..10d3f446da7 100644
--- a/gcc/testsuite/g++.dg/goacc/template.C
+++ b/gcc/testsuite/g++.dg/goacc/template.C
@@ -1,4 +1,4 @@
-/* { dg-additional-options "-fdump-tree-oaccdevlow" } */
+/* { dg-additional-options "-fdump-tree-oaccloops" } */
 
 #pragma acc routine nohost
 template <typename T> T
@@ -156,13 +156,13 @@ main ()
   return b + c;
 }
 
-/* { dg-final { scan-tree-dump-times {(?n)^OpenACC routine '[^']+' has 'nohost' clause\.$} 4 oaccdevlow } }
-   { dg-final { scan-tree-dump-times {(?n)^OpenACC routine 'T accDouble\(int\) \[with T = char\]' has 'nohost' clause\.$} 1 oaccdevlow { target { ! offloading_enabled } } } }
-   { dg-final { scan-tree-dump-times {(?n)^OpenACC routine 'accDouble<char>\(int\)char' has 'nohost' clause\.$} 1 oaccdevlow { target offloading_enabled } } }
-   { dg-final { scan-tree-dump-times {(?n)^OpenACC routine 'T accDouble\(int\) \[with T = int\]' has 'nohost' clause\.$} 1 oaccdevlow { target { ! offloading_enabled } } } }
-   { dg-final { scan-tree-dump-times {(?n)^OpenACC routine 'accDouble<int>\(int\)int' has 'nohost' clause\.$} 1 oaccdevlow { target offloading_enabled } } }
-   { dg-final { scan-tree-dump-times {(?n)^OpenACC routine 'T accDouble\(int\) \[with T = float\]' has 'nohost' clause\.$} 1 oaccdevlow { target { ! offloading_enabled } } } }
-   { dg-final { scan-tree-dump-times {(?n)^OpenACC routine 'accDouble<float>\(int\)float' has 'nohost' clause\.$} 1 oaccdevlow { target offloading_enabled } } }
-   { dg-final { scan-tree-dump-times {(?n)^OpenACC routine 'T accDouble\(int\) \[with T = double\]' has 'nohost' clause\.$} 1 oaccdevlow { target { ! offloading_enabled } } } }
-   { dg-final { scan-tree-dump-times {(?n)^OpenACC routine 'accDouble<double>\(int\)double' has 'nohost' clause\.$} 1 oaccdevlow { target offloading_enabled } } }
+/* { dg-final { scan-tree-dump-times {(?n)^OpenACC routine '[^']+' has 'nohost' clause\.$} 4 oaccloops } }
+   { dg-final { scan-tree-dump-times {(?n)^OpenACC routine 'T accDouble\(int\) \[with T = char\]' has 'nohost' clause\.$} 1 oaccloops { target { ! offloading_enabled } } } }
+   { dg-final { scan-tree-dump-times {(?n)^OpenACC routine 'accDouble<char>\(int\)char' has 'nohost' clause\.$} 1 oaccloops { target offloading_enabled } } }
+   { dg-final { scan-tree-dump-times {(?n)^OpenACC routine 'T accDouble\(int\) \[with T = int\]' has 'nohost' clause\.$} 1 oaccloops { target { ! offloading_enabled } } } }
+   { dg-final { scan-tree-dump-times {(?n)^OpenACC routine 'accDouble<int>\(int\)int' has 'nohost' clause\.$} 1 oaccloops { target offloading_enabled } } }
+   { dg-final { scan-tree-dump-times {(?n)^OpenACC routine 'T accDouble\(int\) \[with T = float\]' has 'nohost' clause\.$} 1 oaccloops { target { ! offloading_enabled } } } }
+   { dg-final { scan-tree-dump-times {(?n)^OpenACC routine 'accDouble<float>\(int\)float' has 'nohost' clause\.$} 1 oaccloops { target offloading_enabled } } }
+   { dg-final { scan-tree-dump-times {(?n)^OpenACC routine 'T accDouble\(int\) \[with T = double\]' has 'nohost' clause\.$} 1 oaccloops { target { ! offloading_enabled } } } }
+   { dg-final { scan-tree-dump-times {(?n)^OpenACC routine 'accDouble<double>\(int\)double' has 'nohost' clause\.$} 1 oaccloops { target offloading_enabled } } }
    TODO See PR101551 for 'offloading_enabled' differences.  */
diff --git a/gcc/testsuite/gcc.dg/goacc/loop-processing-1.c b/gcc/testsuite/gcc.dg/goacc/loop-processing-1.c
index bd4c07e7d81..78b9aed89be 100644
--- a/gcc/testsuite/gcc.dg/goacc/loop-processing-1.c
+++ b/gcc/testsuite/gcc.dg/goacc/loop-processing-1.c
@@ -1,5 +1,5 @@
 /* Make sure that OpenACC loop processing happens.  */
-/* { dg-additional-options "-O2 -fdump-tree-oaccdevlow" } */
+/* { dg-additional-options "-O2 -fdump-tree-oaccloops" } */
 
 extern int place ();
 
@@ -15,4 +15,4 @@ void vector_1 (int *ary, int size)
   }
 }
 
-/* { dg-final { scan-tree-dump {OpenACC loops.*Loop 0\(0\).*Loop 24\(1\).*\.data_dep\.[0-9_]+ = \.UNIQUE \(OACC_HEAD_MARK, 0, 1, 36\);.*Head-0:.*\.data_dep\.[0-9_]+ = \.UNIQUE \(OACC_HEAD_MARK, 0, 1, 36\);.*\.data_dep\.[0-9_]+ = \.UNIQUE \(OACC_FORK, \.data_dep\.[0-9_]+, 0\);.*Tail-0:.*\.data_dep\.[0-9_]+ = \.UNIQUE \(OACC_TAIL_MARK, \.data_dep\.[0-9_]+, 1\);.*\.data_dep\.[0-9_]+ = \.UNIQUE \(OACC_JOIN, \.data_dep\.[0-9_]+, 0\);.*Loop 6\(6\).*\.data_dep\.[0-9_]+ = \.UNIQUE \(OACC_HEAD_MARK, 0, 2, 6\);.*Head-0:.*\.data_dep\.[0-9_]+ = \.UNIQUE \(OACC_HEAD_MARK, 0, 2, 6\);.*\.data_dep\.[0-9_]+ = \.UNIQUE \(OACC_FORK, \.data_dep\.[0-9_]+, 1\);.*Head-1:.*\.data_dep\.[0-9_]+ = \.UNIQUE \(OACC_HEAD_MARK, \.data_dep\.[0-9_]+, 1\);.*\.data_dep\.[0-9_]+ = \.UNIQUE \(OACC_FORK, \.data_dep\.[0-9_]+, 2\);.*Tail-1:.*\.data_dep\.[0-9_]+ = \.UNIQUE \(OACC_TAIL_MARK, \.data_dep\.[0-9_]+, 2\);.*\.data_dep\.[0-9_]+ = \.UNIQUE \(OACC_JOIN, \.data_dep\.[0-9_]+, 2\);.*Tail-0:.*\.data_dep\.[0-9_]+ = \.UNIQUE \(OACC_TAIL_MARK, \.data_dep\.[0-9_]+, 1\);.*\.data_dep\.[0-9_]+ = \.UNIQUE \(OACC_JOIN, \.data_dep\.[0-9_]+, 1\);} "oaccdevlow" } } */
+/* { dg-final { scan-tree-dump {OpenACC loops.*Loop 0\(0\).*Loop 24\(1\).*\.data_dep\.[0-9_]+ = \.UNIQUE \(OACC_HEAD_MARK, 0, 1, 36\);.*Head-0:.*\.data_dep\.[0-9_]+ = \.UNIQUE \(OACC_HEAD_MARK, 0, 1, 36\);.*\.data_dep\.[0-9_]+ = \.UNIQUE \(OACC_FORK, \.data_dep\.[0-9_]+, 0\);.*Tail-0:.*\.data_dep\.[0-9_]+ = \.UNIQUE \(OACC_TAIL_MARK, \.data_dep\.[0-9_]+, 1\);.*\.data_dep\.[0-9_]+ = \.UNIQUE \(OACC_JOIN, \.data_dep\.[0-9_]+, 0\);.*Loop 6\(6\).*\.data_dep\.[0-9_]+ = \.UNIQUE \(OACC_HEAD_MARK, 0, 2, 6\);.*Head-0:.*\.data_dep\.[0-9_]+ = \.UNIQUE \(OACC_HEAD_MARK, 0, 2, 6\);.*\.data_dep\.[0-9_]+ = \.UNIQUE \(OACC_FORK, \.data_dep\.[0-9_]+, 1\);.*Head-1:.*\.data_dep\.[0-9_]+ = \.UNIQUE \(OACC_HEAD_MARK, \.data_dep\.[0-9_]+, 1\);.*\.data_dep\.[0-9_]+ = \.UNIQUE \(OACC_FORK, \.data_dep\.[0-9_]+, 2\);.*Tail-1:.*\.data_dep\.[0-9_]+ = \.UNIQUE \(OACC_TAIL_MARK, \.data_dep\.[0-9_]+, 2\);.*\.data_dep\.[0-9_]+ = \.UNIQUE \(OACC_JOIN, \.data_dep\.[0-9_]+, 2\);.*Tail-0:.*\.data_dep\.[0-9_]+ = \.UNIQUE \(OACC_TAIL_MARK, \.data_dep\.[0-9_]+, 1\);.*\.data_dep\.[0-9_]+ = \.UNIQUE \(OACC_JOIN, \.data_dep\.[0-9_]+, 1\);} "oaccloops" } } */
diff --git a/gcc/testsuite/gfortran.dg/goacc/classify-kernels-unparallelized.f95 b/gcc/testsuite/gfortran.dg/goacc/classify-kernels-unparallelized.f95
index cb5251a2aeb..3fb48b321f2 100644
--- a/gcc/testsuite/gfortran.dg/goacc/classify-kernels-unparallelized.f95
+++ b/gcc/testsuite/gfortran.dg/goacc/classify-kernels-unparallelized.f95
@@ -5,7 +5,7 @@
 ! { dg-additional-options "-fopt-info-optimized-omp" }
 ! { dg-additional-options "-fdump-tree-ompexp" }
 ! { dg-additional-options "-fdump-tree-parloops1-all" }
-! { dg-additional-options "-fdump-tree-oaccdevlow" }
+! { dg-additional-options "-fdump-tree-oaccloops" }
 
 ! { dg-additional-options "-Wopenacc-parallelism" } for testing/documenting
 ! aspects of that functionality.
@@ -40,6 +40,6 @@ end program main
 
 ! Check the offloaded function's classification and compute dimensions (will
 ! always be 1 x 1 x 1 for non-offloading compilation).
-! { dg-final { scan-tree-dump-times "(?n)Function is unparallelized OpenACC kernels offload" 1 "oaccdevlow" } }
-! { dg-final { scan-tree-dump-times "(?n)Compute dimensions \\\[1, 1, 1\\\]" 1 "oaccdevlow" } }
-! { dg-final { scan-tree-dump-times "(?n)__attribute__\\(\\(oacc function \\(1, 1, 1\\), oacc kernels, omp target entrypoint\\)\\)" 1 "oaccdevlow" } }
+! { dg-final { scan-tree-dump-times "(?n)Function is unparallelized OpenACC kernels offload" 1 "oaccloops" } }
+! { dg-final { scan-tree-dump-times "(?n)Compute dimensions \\\[1, 1, 1\\\]" 1 "oaccloops" } }
+! { dg-final { scan-tree-dump-times "(?n)__attribute__\\(\\(oacc function \\(1, 1, 1\\), oacc kernels, omp target entrypoint\\)\\)" 1 "oaccloops" } }
diff --git a/gcc/testsuite/gfortran.dg/goacc/classify-kernels.f95 b/gcc/testsuite/gfortran.dg/goacc/classify-kernels.f95
index 07aaf065e1d..6c8d298e236 100644
--- a/gcc/testsuite/gfortran.dg/goacc/classify-kernels.f95
+++ b/gcc/testsuite/gfortran.dg/goacc/classify-kernels.f95
@@ -5,7 +5,7 @@
 ! { dg-additional-options "-fopt-info-optimized-omp" }
 ! { dg-additional-options "-fdump-tree-ompexp" }
 ! { dg-additional-options "-fdump-tree-parloops1-all" }
-! { dg-additional-options "-fdump-tree-oaccdevlow" }
+! { dg-additional-options "-fdump-tree-oaccloops" }
 
 ! { dg-additional-options "-Wopenacc-parallelism" } for testing/documenting
 ! aspects of that functionality.
@@ -36,6 +36,6 @@ end program main
 
 ! Check the offloaded function's classification and compute dimensions (will
 ! always be 1 x 1 x 1 for non-offloading compilation).
-! { dg-final { scan-tree-dump-times "(?n)Function is parallelized OpenACC kernels offload" 1 "oaccdevlow" } }
-! { dg-final { scan-tree-dump-times "(?n)Compute dimensions \\\[1, 1, 1\\\]" 1 "oaccdevlow" } }
-! { dg-final { scan-tree-dump-times "(?n)__attribute__\\(\\(oacc function \\(1, 1, 1\\), oacc kernels parallelized, oacc function \\(, , \\), oacc kernels, omp target entrypoint\\)\\)" 1 "oaccdevlow" } }
+! { dg-final { scan-tree-dump-times "(?n)Function is parallelized OpenACC kernels offload" 1 "oaccloops" } }
+! { dg-final { scan-tree-dump-times "(?n)Compute dimensions \\\[1, 1, 1\\\]" 1 "oaccloops" } }
+! { dg-final { scan-tree-dump-times "(?n)__attribute__\\(\\(oacc function \\(1, 1, 1\\), oacc kernels parallelized, oacc function \\(, , \\), oacc kernels, omp target entrypoint\\)\\)" 1 "oaccloops" } }
diff --git a/gcc/testsuite/gfortran.dg/goacc/classify-parallel.f95 b/gcc/testsuite/gfortran.dg/goacc/classify-parallel.f95
index a41e0e68b38..ce4c08ff219 100644
--- a/gcc/testsuite/gfortran.dg/goacc/classify-parallel.f95
+++ b/gcc/testsuite/gfortran.dg/goacc/classify-parallel.f95
@@ -4,7 +4,7 @@
 ! { dg-additional-options "-O2" }
 ! { dg-additional-options "-fopt-info-optimized-omp" }
 ! { dg-additional-options "-fdump-tree-ompexp" }
-! { dg-additional-options "-fdump-tree-oaccdevlow" }
+! { dg-additional-options "-fdump-tree-oaccloops" }
 
 ! { dg-additional-options "-Wopenacc-parallelism" } for testing/documenting
 ! aspects of that functionality.
@@ -29,6 +29,6 @@ end program main
 
 ! Check the offloaded function's classification and compute dimensions (will
 ! always be 1 x 1 x 1 for non-offloading compilation).
-! { dg-final { scan-tree-dump-times "(?n)Function is OpenACC parallel offload" 1 "oaccdevlow" } }
-! { dg-final { scan-tree-dump-times "(?n)Compute dimensions \\\[1, 1, 1\\\]" 1 "oaccdevlow" } }
-! { dg-final { scan-tree-dump-times "(?n)__attribute__\\(\\(oacc function \\(1, 1, 1\\), oacc parallel, omp target entrypoint\\)\\)" 1 "oaccdevlow" } }
+! { dg-final { scan-tree-dump-times "(?n)Function is OpenACC parallel offload" 1 "oaccloops" } }
+! { dg-final { scan-tree-dump-times "(?n)Compute dimensions \\\[1, 1, 1\\\]" 1 "oaccloops" } }
+! { dg-final { scan-tree-dump-times "(?n)__attribute__\\(\\(oacc function \\(1, 1, 1\\), oacc parallel, omp target entrypoint\\)\\)" 1 "oaccloops" } }
diff --git a/gcc/testsuite/gfortran.dg/goacc/classify-routine-nohost.f95 b/gcc/testsuite/gfortran.dg/goacc/classify-routine-nohost.f95
index 0e06fb9f0ba..07e2063551f 100644
--- a/gcc/testsuite/gfortran.dg/goacc/classify-routine-nohost.f95
+++ b/gcc/testsuite/gfortran.dg/goacc/classify-routine-nohost.f95
@@ -4,7 +4,7 @@
 ! { dg-additional-options "-O2" }
 ! { dg-additional-options "-fopt-info-optimized-omp" }
 ! { dg-additional-options "-fdump-tree-ompexp" }
-! { dg-additional-options "-fdump-tree-oaccdevlow" }
+! { dg-additional-options "-fdump-tree-oaccloops" }
 
 ! { dg-additional-options "-Wopenacc-parallelism" } for testing/documenting
 ! aspects of that functionality.
@@ -27,13 +27,13 @@ end subroutine ROUTINE
 ! { dg-final { scan-tree-dump-times "(?n)__attribute__\\(\\(oacc function \\(0 1, 1 0, 1 0\\), omp declare target \\(nohost worker\\)\\)\\)" 1 "ompexp" } }
 
 ! Check the offloaded function's classification.
-! { dg-final { scan-tree-dump-times "(?n)Function is OpenACC routine level 1" 1 "oaccdevlow" } }
-! { dg-final { scan-tree-dump-times "(?n)OpenACC routine 'routine' has 'nohost' clause" 1 "oaccdevlow" { target { ! offloading_enabled } } } }
-! { dg-final { scan-tree-dump-times "(?n)OpenACC routine 'routine_' has 'nohost' clause" 1 "oaccdevlow" { target offloading_enabled } } }
-! { dg-final { scan-tree-dump-times "(?n)OpenACC routine 'routine' discarded" 1 "oaccdevlow" { target { ! offloading_enabled } } } }
-! { dg-final { scan-tree-dump-times "(?n)OpenACC routine 'routine_' discarded" 1 "oaccdevlow" { target offloading_enabled } } }
-! { dg-final { scan-tree-dump-not "(?n)Compute dimensions" "oaccdevlow" } }
-! { dg-final { scan-tree-dump-not "(?n)__attribute__\\(.*omp declare target \\(nohost" "oaccdevlow" } }
-! { dg-final { scan-tree-dump-not "(?n)void routine \\(\\)" "oaccdevlow" { target { ! offloading_enabled } } } }
-! { dg-final { scan-tree-dump-not "(?n)void routine_ \\(\\)" "oaccdevlow" { target offloading_enabled } } }
+! { dg-final { scan-tree-dump-times "(?n)Function is OpenACC routine level 1" 1 "oaccloops" } }
+! { dg-final { scan-tree-dump-times "(?n)OpenACC routine 'routine' has 'nohost' clause" 1 "oaccloops" { target { ! offloading_enabled } } } }
+! { dg-final { scan-tree-dump-times "(?n)OpenACC routine 'routine_' has 'nohost' clause" 1 "oaccloops" { target offloading_enabled } } }
+! { dg-final { scan-tree-dump-times "(?n)OpenACC routine 'routine' discarded" 1 "oaccloops" { target { ! offloading_enabled } } } }
+! { dg-final { scan-tree-dump-times "(?n)OpenACC routine 'routine_' discarded" 1 "oaccloops" { target offloading_enabled } } }
+! { dg-final { scan-tree-dump-not "(?n)Compute dimensions" "oaccloops" } }
+! { dg-final { scan-tree-dump-not "(?n)__attribute__\\(.*omp declare target \\(nohost" "oaccloops" } }
+! { dg-final { scan-tree-dump-not "(?n)void routine \\(\\)" "oaccloops" { target { ! offloading_enabled } } } }
+! { dg-final { scan-tree-dump-not "(?n)void routine_ \\(\\)" "oaccloops" { target offloading_enabled } } }
 !TODO See PR101551 for 'offloading_enabled' differences.
diff --git a/gcc/testsuite/gfortran.dg/goacc/classify-routine.f95 b/gcc/testsuite/gfortran.dg/goacc/classify-routine.f95
index 92d3243cdcf..b065ccadacd 100644
--- a/gcc/testsuite/gfortran.dg/goacc/classify-routine.f95
+++ b/gcc/testsuite/gfortran.dg/goacc/classify-routine.f95
@@ -4,7 +4,7 @@
 ! { dg-additional-options "-O2" }
 ! { dg-additional-options "-fopt-info-optimized-omp" }
 ! { dg-additional-options "-fdump-tree-ompexp" }
-! { dg-additional-options "-fdump-tree-oaccdevlow" }
+! { dg-additional-options "-fdump-tree-oaccloops" }
 
 ! { dg-additional-options "-Wopenacc-parallelism" } for testing/documenting
 ! aspects of that functionality.
@@ -28,13 +28,13 @@ end subroutine ROUTINE
 
 ! Check the offloaded function's classification and compute dimensions (will
 ! always be 1 x 1 x 1 for non-offloading compilation).
-! { dg-final { scan-tree-dump-times "(?n)Function is OpenACC routine level 1" 1 "oaccdevlow" } }
-! { dg-final { scan-tree-dump-times "(?n)OpenACC routine 'routine' doesn't have 'nohost' clause" 1 "oaccdevlow" { target { ! offloading_enabled } } } }
-! { dg-final { scan-tree-dump-times "(?n)OpenACC routine 'routine_' doesn't have 'nohost' clause" 1 "oaccdevlow" { target offloading_enabled } } }
-! { dg-final { scan-tree-dump-times "(?n)OpenACC routine 'routine' not discarded" 1 "oaccdevlow" { target { ! offloading_enabled } } } }
-! { dg-final { scan-tree-dump-times "(?n)OpenACC routine 'routine_' not discarded" 1 "oaccdevlow" { target offloading_enabled } } }
-! { dg-final { scan-tree-dump-times "(?n)Compute dimensions \\\[1, 1, 1\\\]" 1 "oaccdevlow" } }
-! { dg-final { scan-tree-dump-times "(?n)__attribute__\\(\\(oacc function \\(0 1, 1 1, 1 1\\), omp declare target \\(worker\\)\\)\\)" 1 "oaccdevlow" } }
-! { dg-final { scan-tree-dump-times "(?n)void routine \\(\\)" 1 "oaccdevlow" { target { ! offloading_enabled } } } }
-! { dg-final { scan-tree-dump-times "(?n)void routine_ \\(\\)" 1 "oaccdevlow" { target offloading_enabled } } }
+! { dg-final { scan-tree-dump-times "(?n)Function is OpenACC routine level 1" 1 "oaccloops" } }
+! { dg-final { scan-tree-dump-times "(?n)OpenACC routine 'routine' doesn't have 'nohost' clause" 1 "oaccloops" { target { ! offloading_enabled } } } }
+! { dg-final { scan-tree-dump-times "(?n)OpenACC routine 'routine_' doesn't have 'nohost' clause" 1 "oaccloops" { target offloading_enabled } } }
+! { dg-final { scan-tree-dump-times "(?n)OpenACC routine 'routine' not discarded" 1 "oaccloops" { target { ! offloading_enabled } } } }
+! { dg-final { scan-tree-dump-times "(?n)OpenACC routine 'routine_' not discarded" 1 "oaccloops" { target offloading_enabled } } }
+! { dg-final { scan-tree-dump-times "(?n)Compute dimensions \\\[1, 1, 1\\\]" 1 "oaccloops" } }
+! { dg-final { scan-tree-dump-times "(?n)__attribute__\\(\\(oacc function \\(0 1, 1 1, 1 1\\), omp declare target \\(worker\\)\\)\\)" 1 "oaccloops" } }
+! { dg-final { scan-tree-dump-times "(?n)void routine \\(\\)" 1 "oaccloops" { target { ! offloading_enabled } } } }
+! { dg-final { scan-tree-dump-times "(?n)void routine_ \\(\\)" 1 "oaccloops" { target offloading_enabled } } }
 !TODO See PR101551 for 'offloading_enabled' differences.
diff --git a/gcc/testsuite/gfortran.dg/goacc/classify-serial.f95 b/gcc/testsuite/gfortran.dg/goacc/classify-serial.f95
index 6dcb1b170f8..f5cb3fe50c5 100644
--- a/gcc/testsuite/gfortran.dg/goacc/classify-serial.f95
+++ b/gcc/testsuite/gfortran.dg/goacc/classify-serial.f95
@@ -4,7 +4,7 @@
 ! { dg-additional-options "-O2" }
 ! { dg-additional-options "-fopt-info-optimized-omp" }
 ! { dg-additional-options "-fdump-tree-ompexp" }
-! { dg-additional-options "-fdump-tree-oaccdevlow" }
+! { dg-additional-options "-fdump-tree-oaccloops" }
 
 ! { dg-additional-options "-Wopenacc-parallelism" } for testing/documenting
 ! aspects of that functionality.
@@ -32,6 +32,6 @@ end program main
 
 ! Check the offloaded function's classification and compute dimensions (will
 ! always be 1 x 1 x 1 for non-offloading compilation).
-! { dg-final { scan-tree-dump-times "(?n)Function is OpenACC serial offload" 1 "oaccdevlow" } }
-! { dg-final { scan-tree-dump-times "(?n)Compute dimensions \\\[1, 1, 1\\\]" 1 "oaccdevlow" } }
-! { dg-final { scan-tree-dump-times "(?n)__attribute__\\(\\(oacc function \\(1, 1, 1\\), oacc serial, omp target entrypoint\\)\\)" 1 "oaccdevlow" } }
+! { dg-final { scan-tree-dump-times "(?n)Function is OpenACC serial offload" 1 "oaccloops" } }
+! { dg-final { scan-tree-dump-times "(?n)Compute dimensions \\\[1, 1, 1\\\]" 1 "oaccloops" } }
+! { dg-final { scan-tree-dump-times "(?n)__attribute__\\(\\(oacc function \\(1, 1, 1\\), oacc serial, omp target entrypoint\\)\\)" 1 "oaccloops" } }
diff --git a/gcc/testsuite/gfortran.dg/goacc/routine-multiple-directives-1.f90 b/gcc/testsuite/gfortran.dg/goacc/routine-multiple-directives-1.f90
index 44ef4533f04..42bcb0e8d63 100644
--- a/gcc/testsuite/gfortran.dg/goacc/routine-multiple-directives-1.f90
+++ b/gcc/testsuite/gfortran.dg/goacc/routine-multiple-directives-1.f90
@@ -1,6 +1,6 @@
 ! Check for valid cases of multiple OpenACC 'routine' directives.
 
-! { dg-additional-options "-fdump-tree-oaccdevlow" }
+! { dg-additional-options "-fdump-tree-oaccloops" }
 !TODO See PR101551 for 'offloading_enabled' differences.
 
 ! { dg-additional-options "-Wopenacc-parallelism" } for testing/documenting
@@ -11,32 +11,32 @@
 !$ACC ROUTINE(s_1) SEQ
 !$ACC ROUTINE SEQ
       END SUBROUTINE s_1
-      ! { dg-final { scan-tree-dump-times "(?n)OpenACC routine 's_1' doesn't have 'nohost' clause" 1 "oaccdevlow" { target { ! offloading_enabled } } } }
-      ! { dg-final { scan-tree-dump-times "(?n)OpenACC routine 's_1_' doesn't have 'nohost' clause" 1 "oaccdevlow" { target offloading_enabled } } }
+      ! { dg-final { scan-tree-dump-times "(?n)OpenACC routine 's_1' doesn't have 'nohost' clause" 1 "oaccloops" { target { ! offloading_enabled } } } }
+      ! { dg-final { scan-tree-dump-times "(?n)OpenACC routine 's_1_' doesn't have 'nohost' clause" 1 "oaccloops" { target offloading_enabled } } }
 
       SUBROUTINE s_1_nh
 !$ACC ROUTINE(s_1_nh) NOHOST
 !$ACC ROUTINE(s_1_nh) SEQ NOHOST
 !$ACC ROUTINE NOHOST SEQ
       END SUBROUTINE s_1_nh
-      ! { dg-final { scan-tree-dump-times "(?n)OpenACC routine 's_1_nh' has 'nohost' clause" 1 "oaccdevlow" { target { ! offloading_enabled } } } }
-      ! { dg-final { scan-tree-dump-times "(?n)OpenACC routine 's_1_nh_' has 'nohost' clause" 1 "oaccdevlow" { target offloading_enabled } } }
+      ! { dg-final { scan-tree-dump-times "(?n)OpenACC routine 's_1_nh' has 'nohost' clause" 1 "oaccloops" { target { ! offloading_enabled } } } }
+      ! { dg-final { scan-tree-dump-times "(?n)OpenACC routine 's_1_nh_' has 'nohost' clause" 1 "oaccloops" { target offloading_enabled } } }
 
       SUBROUTINE s_2
 !$ACC ROUTINE
 !$ACC ROUTINE SEQ
 !$ACC ROUTINE(s_2)
       END SUBROUTINE s_2
-      ! { dg-final { scan-tree-dump-times "(?n)OpenACC routine 's_2' doesn't have 'nohost' clause" 1 "oaccdevlow" { target { ! offloading_enabled } } } }
-      ! { dg-final { scan-tree-dump-times "(?n)OpenACC routine 's_2_' doesn't have 'nohost' clause" 1 "oaccdevlow" { target offloading_enabled } } }
+      ! { dg-final { scan-tree-dump-times "(?n)OpenACC routine 's_2' doesn't have 'nohost' clause" 1 "oaccloops" { target { ! offloading_enabled } } } }
+      ! { dg-final { scan-tree-dump-times "(?n)OpenACC routine 's_2_' doesn't have 'nohost' clause" 1 "oaccloops" { target offloading_enabled } } }
 
       SUBROUTINE s_2_nh
 !$ACC ROUTINE NOHOST
 !$ACC ROUTINE NOHOST SEQ
 !$ACC ROUTINE(s_2_nh) NOHOST
       END SUBROUTINE s_2_nh
-      ! { dg-final { scan-tree-dump-times "(?n)OpenACC routine 's_2_nh' has 'nohost' clause" 1 "oaccdevlow" { target { ! offloading_enabled } } } }
-      ! { dg-final { scan-tree-dump-times "(?n)OpenACC routine 's_2_nh_' has 'nohost' clause" 1 "oaccdevlow" { target offloading_enabled } } }
+      ! { dg-final { scan-tree-dump-times "(?n)OpenACC routine 's_2_nh' has 'nohost' clause" 1 "oaccloops" { target { ! offloading_enabled } } } }
+      ! { dg-final { scan-tree-dump-times "(?n)OpenACC routine 's_2_nh_' has 'nohost' clause" 1 "oaccloops" { target offloading_enabled } } }
 
       SUBROUTINE v_1
 !$ACC ROUTINE VECTOR
@@ -45,8 +45,8 @@
 !$ACC ROUTINE VECTOR
 ! { dg-warning "region is vector partitioned but does not contain vector partitioned code" "" { target *-*-* } .-5 }
       END SUBROUTINE v_1
-      ! { dg-final { scan-tree-dump-times "(?n)OpenACC routine 'v_1' doesn't have 'nohost' clause" 1 "oaccdevlow" { target { ! offloading_enabled } } } }
-      ! { dg-final { scan-tree-dump-times "(?n)OpenACC routine 'v_1_' doesn't have 'nohost' clause" 1 "oaccdevlow" { target offloading_enabled } } }
+      ! { dg-final { scan-tree-dump-times "(?n)OpenACC routine 'v_1' doesn't have 'nohost' clause" 1 "oaccloops" { target { ! offloading_enabled } } } }
+      ! { dg-final { scan-tree-dump-times "(?n)OpenACC routine 'v_1_' doesn't have 'nohost' clause" 1 "oaccloops" { target offloading_enabled } } }
 
       SUBROUTINE v_1_nh
 !$ACC ROUTINE NOHOST VECTOR
@@ -55,8 +55,8 @@
 !$ACC ROUTINE VECTOR NOHOST
 ! { dg-bogus "region is vector partitioned but does not contain vector partitioned code" "" { target *-*-* } .-5 }
       END SUBROUTINE v_1_nh
-      ! { dg-final { scan-tree-dump-times "(?n)OpenACC routine 'v_1_nh' has 'nohost' clause" 1 "oaccdevlow" { target { ! offloading_enabled } } } }
-      ! { dg-final { scan-tree-dump-times "(?n)OpenACC routine 'v_1_nh_' has 'nohost' clause" 1 "oaccdevlow" { target offloading_enabled } } }
+      ! { dg-final { scan-tree-dump-times "(?n)OpenACC routine 'v_1_nh' has 'nohost' clause" 1 "oaccloops" { target { ! offloading_enabled } } } }
+      ! { dg-final { scan-tree-dump-times "(?n)OpenACC routine 'v_1_nh_' has 'nohost' clause" 1 "oaccloops" { target offloading_enabled } } }
 
       SUBROUTINE v_2
 !$ACC ROUTINE(v_2) VECTOR
@@ -64,8 +64,8 @@
 !$ACC ROUTINE(v_2) VECTOR
 ! { dg-warning "region is vector partitioned but does not contain vector partitioned code" "" { target *-*-* } .-4 }
       END SUBROUTINE v_2
-      ! { dg-final { scan-tree-dump-times "(?n)OpenACC routine 'v_2' doesn't have 'nohost' clause" 1 "oaccdevlow" { target { ! offloading_enabled } } } }
-      ! { dg-final { scan-tree-dump-times "(?n)OpenACC routine 'v_2_' doesn't have 'nohost' clause" 1 "oaccdevlow" { target offloading_enabled } } }
+      ! { dg-final { scan-tree-dump-times "(?n)OpenACC routine 'v_2' doesn't have 'nohost' clause" 1 "oaccloops" { target { ! offloading_enabled } } } }
+      ! { dg-final { scan-tree-dump-times "(?n)OpenACC routine 'v_2_' doesn't have 'nohost' clause" 1 "oaccloops" { target offloading_enabled } } }
 
       SUBROUTINE v_2_nh
 !$ACC ROUTINE(v_2_nh) VECTOR NOHOST
@@ -73,8 +73,8 @@
 !$ACC ROUTINE(v_2_nh) NOHOST VECTOR
 ! { dg-bogus "region is vector partitioned but does not contain vector partitioned code" "" { target *-*-* } .-4 }
       END SUBROUTINE v_2_nh
-      ! { dg-final { scan-tree-dump-times "(?n)OpenACC routine 'v_2_nh' has 'nohost' clause" 1 "oaccdevlow" { target { ! offloading_enabled } } } }
-      ! { dg-final { scan-tree-dump-times "(?n)OpenACC routine 'v_2_nh_' has 'nohost' clause" 1 "oaccdevlow" { target offloading_enabled } } }
+      ! { dg-final { scan-tree-dump-times "(?n)OpenACC routine 'v_2_nh' has 'nohost' clause" 1 "oaccloops" { target { ! offloading_enabled } } } }
+      ! { dg-final { scan-tree-dump-times "(?n)OpenACC routine 'v_2_nh_' has 'nohost' clause" 1 "oaccloops" { target offloading_enabled } } }
 
       SUBROUTINE sub_1
       IMPLICIT NONE
diff --git a/gcc/tree-parloops.c b/gcc/tree-parloops.c
index bb547572653..4d40c96df7d 100644
--- a/gcc/tree-parloops.c
+++ b/gcc/tree-parloops.c
@@ -2867,7 +2867,7 @@ create_parallel_loop (class loop *loop, tree loop_fn, tree data,
   /* Emit GIMPLE_OMP_FOR.  */
   if (oacc_kernels_p)
     /* Parallelized OpenACC kernels constructs use gang parallelism.  See also
-       omp-offload.c:execute_oacc_device_lower.  */
+       omp-offload.c:execute_oacc_loop_designation.  */
     t = build_omp_clause (loc, OMP_CLAUSE_GANG);
   else
     {
diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h
index 1f5b1370a95..5484ad5eac7 100644
--- a/gcc/tree-pass.h
+++ b/gcc/tree-pass.h
@@ -424,6 +424,7 @@ extern gimple_opt_pass *make_pass_diagnose_omp_blocks (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_expand_omp (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_expand_omp_ssa (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_omp_target_link (gcc::context *ctxt);
+extern gimple_opt_pass *make_pass_oacc_loop_designation (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_oacc_device_lower (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_omp_device_lower (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_object_sizes (gcc::context *ctxt);
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/pr85486-2.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/pr85486-2.c
index d45326488cd..17cc9bd663e 100644
--- a/libgomp/testsuite/libgomp.oacc-c-c++-common/pr85486-2.c
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/pr85486-2.c
@@ -2,10 +2,10 @@
 /* { dg-additional-options "-DVECTOR_LENGTH=" } */
 /* { dg-additional-options "-fopenacc-dim=::128" } */
 
-/* { dg-additional-options "-foffload=-fdump-tree-oaccdevlow" } */
+/* { dg-additional-options "-foffload=-fdump-tree-oaccloops" } */
 /* { dg-set-target-env-var "GOMP_DEBUG" "1" } */
 
 #include "pr85486.c"
 
-/* { dg-final { scan-offload-tree-dump "__attribute__\\(\\(oacc function \\(1, 1, 32\\)" "oaccdevlow" } } */
+/* { dg-final { scan-offload-tree-dump "__attribute__\\(\\(oacc function \\(1, 1, 32\\)" "oaccloops" } } */
 /* { dg-output "nvptx_exec: kernel main\\\$_omp_fn\\\$0: launch gangs=1, workers=1, vectors=32" } */
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/pr85486-3.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/pr85486-3.c
index 33480a4ae68..5d05540ce46 100644
--- a/libgomp/testsuite/libgomp.oacc-c-c++-common/pr85486-3.c
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/pr85486-3.c
@@ -2,10 +2,10 @@
 /* { dg-additional-options "-DVECTOR_LENGTH=" } */
 /* { dg-set-target-env-var "GOMP_OPENACC_DIM" "::128" } */
 
-/* { dg-additional-options "-foffload=-fdump-tree-oaccdevlow" } */
+/* { dg-additional-options "-foffload=-fdump-tree-oaccloops" } */
 /* { dg-set-target-env-var "GOMP_DEBUG" "1" } */
 
 #include "pr85486.c"
 
-/* { dg-final { scan-offload-tree-dump "__attribute__\\(\\(oacc function \\(1, 1, 32\\)" "oaccdevlow" } } */
+/* { dg-final { scan-offload-tree-dump "__attribute__\\(\\(oacc function \\(1, 1, 32\\)" "oaccloops" } } */
 /* { dg-output "nvptx_exec: kernel main\\\$_omp_fn\\\$0: launch gangs=1, workers=1, vectors=32" } */
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/pr85486.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/pr85486.c
index 0d98b82f993..f95f2ee3123 100644
--- a/libgomp/testsuite/libgomp.oacc-c-c++-common/pr85486.c
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/pr85486.c
@@ -1,7 +1,7 @@
 /* { dg-do run { target openacc_nvidia_accel_selected } } */
 /* { dg-additional-options "-DVECTOR_LENGTH=vector_length(128)" } */
 
-/* { dg-additional-options "-foffload=-fdump-tree-oaccdevlow" } */
+/* { dg-additional-options "-foffload=-fdump-tree-oaccloops" } */
 /* { dg-set-target-env-var "GOMP_DEBUG" "1" } */
 
 /* Minimized from ref-1.C.  */
@@ -54,5 +54,5 @@ main (void)
   return 0;
 }
 
-/* { dg-final { scan-offload-tree-dump "__attribute__\\(\\(oacc function \\(1, 1, 32\\)" "oaccdevlow" } } */
+/* { dg-final { scan-offload-tree-dump "__attribute__\\(\\(oacc function \\(1, 1, 32\\)" "oaccloops" } } */
 /* { dg-output "nvptx_exec: kernel main\\\$_omp_fn\\\$0: launch gangs=1, workers=1, vectors=32" } */
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/routine-nohost-1.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/routine-nohost-1.c
index dc92727d5be..7dc7459e5fe 100644
--- a/libgomp/testsuite/libgomp.oacc-c-c++-common/routine-nohost-1.c
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/routine-nohost-1.c
@@ -4,7 +4,7 @@
    { dg-skip-if "TODO PR82391" { *-*-* } { "-O0" } }
 */
 
-/* { dg-additional-options "-fdump-tree-oaccdevlow" } */
+/* { dg-additional-options "-fdump-tree-oaccloops" } */
 
 /* { dg-additional-options "-fno-inline" } for stable results regarding OpenACC 'routine'.  */
 
@@ -36,9 +36,9 @@ static int fact_nohost(int n)
 
   return fact(n);
 }
-/* { dg-final { scan-tree-dump-times {(?n)^OpenACC routine 'fact_nohost' has 'nohost' clause\.$} 1 oaccdevlow { target c } } }
-   { dg-final { scan-tree-dump-times {(?n)^OpenACC routine 'int fact_nohost\(int\)' has 'nohost' clause\.$} 1 oaccdevlow { target { c++ && { ! offloading_enabled } } } } }
-   { dg-final { scan-tree-dump-times {(?n)^OpenACC routine 'fact_nohost\(int\)' has 'nohost' clause\.$} 1 oaccdevlow { target { c++ && offloading_enabled } } } }
+/* { dg-final { scan-tree-dump-times {(?n)^OpenACC routine 'fact_nohost' has 'nohost' clause\.$} 1 oaccloops { target c } } }
+   { dg-final { scan-tree-dump-times {(?n)^OpenACC routine 'int fact_nohost\(int\)' has 'nohost' clause\.$} 1 oaccloops { target { c++ && { ! offloading_enabled } } } } }
+   { dg-final { scan-tree-dump-times {(?n)^OpenACC routine 'fact_nohost\(int\)' has 'nohost' clause\.$} 1 oaccloops { target { c++ && offloading_enabled } } } }
    TODO See PR101551 for 'offloading_enabled' differences.  */
 
 int main()
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-1.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-1.c
index 18d77cc5ecb..5158bb5eb89 100644
--- a/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-1.c
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-1.c
@@ -1,5 +1,5 @@
 /* { dg-do run { target openacc_nvidia_accel_selected } } */
-/* { dg-additional-options "-foffload=-fdump-tree-oaccdevlow" } */
+/* { dg-additional-options "-foffload=-fdump-tree-oaccloops" } */
 /* { dg-set-target-env-var "GOMP_DEBUG" "1" } */
 
 #include <stdlib.h>
@@ -34,5 +34,5 @@ main (void)
   return 0;
 }
 
-/* { dg-final { scan-offload-tree-dump "__attribute__\\(\\(oacc function \\(1, 1, 128\\)" "oaccdevlow" } } */
+/* { dg-final { scan-offload-tree-dump "__attribute__\\(\\(oacc function \\(1, 1, 128\\)" "oaccloops" } } */
 /* { dg-output "nvptx_exec: kernel main\\\$_omp_fn\\\$0: launch gangs=1, workers=1, vectors=128" } */
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-2.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-2.c
index 8b5b2a4a92d..a3e44ebfbcb 100644
--- a/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-2.c
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-2.c
@@ -1,6 +1,6 @@
 /* { dg-do run { target openacc_nvidia_accel_selected } } */
 /* { dg-additional-options "-fopenacc-dim=::128" } */
-/* { dg-additional-options "-foffload=-fdump-tree-oaccdevlow" } */
+/* { dg-additional-options "-foffload=-fdump-tree-oaccloops" } */
 /* { dg-set-target-env-var "GOMP_DEBUG" "1" } */
 
 #include <stdlib.h>
@@ -35,5 +35,5 @@ main (void)
   return 0;
 }
 
-/* { dg-final { scan-offload-tree-dump "__attribute__\\(\\(oacc function \\(1, 1, 128\\)" "oaccdevlow" } } */
+/* { dg-final { scan-offload-tree-dump "__attribute__\\(\\(oacc function \\(1, 1, 128\\)" "oaccloops" } } */
 /* { dg-output "nvptx_exec: kernel main\\\$_omp_fn\\\$0: launch gangs=1, workers=1, vectors=128" } */
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-3.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-3.c
index 59be37a7c27..a85400d09c5 100644
--- a/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-3.c
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-3.c
@@ -1,5 +1,5 @@
 /* { dg-do run { target openacc_nvidia_accel_selected } } */
-/* { dg-additional-options "-foffload=-fdump-tree-oaccdevlow" } */
+/* { dg-additional-options "-foffload=-fdump-tree-oaccloops" } */
 /* We default to warp size 32 for the vector length, so the GOMP_OPENACC_DIM has
    no effect.  */
 /* { dg-set-target-env-var "GOMP_OPENACC_DIM" "::128" } */
@@ -38,5 +38,5 @@ main (void)
   return 0;
 }
 
-/* { dg-final { scan-offload-tree-dump "__attribute__\\(\\(oacc function \\(1, 1, 32\\)" "oaccdevlow" } } */
+/* { dg-final { scan-offload-tree-dump "__attribute__\\(\\(oacc function \\(1, 1, 32\\)" "oaccloops" } } */
 /* { dg-output "nvptx_exec: kernel main\\\$_omp_fn\\\$0: launch gangs=1, workers=1, vectors=32" } */
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-4.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-4.c
index e5d1df09b8a..24c078f377c 100644
--- a/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-4.c
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-4.c
@@ -1,5 +1,5 @@
 /* { dg-do run { target openacc_nvidia_accel_selected } } */
-/* { dg-additional-options "-foffload=-fdump-tree-oaccdevlow" } */
+/* { dg-additional-options "-foffload=-fdump-tree-oaccloops" } */
 /* { dg-set-target-env-var "GOMP_DEBUG" "1" } */
 
 #include <stdlib.h>
@@ -36,5 +36,5 @@ main (void)
   return 0;
 }
 
-/* { dg-final { scan-offload-tree-dump "__attribute__\\(\\(oacc function \\(1, 2, 128\\)" "oaccdevlow" } } */
+/* { dg-final { scan-offload-tree-dump "__attribute__\\(\\(oacc function \\(1, 2, 128\\)" "oaccloops" } } */
 /* { dg-output "nvptx_exec: kernel main\\\$_omp_fn\\\$0: launch gangs=1, workers=2, vectors=128" } */
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-5.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-5.c
index e60f1c28db4..fcca9f593bb 100644
--- a/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-5.c
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-5.c
@@ -1,6 +1,6 @@
 /* { dg-do run { target openacc_nvidia_accel_selected } } */
 /* { dg-additional-options "-fopenacc-dim=:2:128" } */
-/* { dg-additional-options "-foffload=-fdump-tree-oaccdevlow" } */
+/* { dg-additional-options "-foffload=-fdump-tree-oaccloops" } */
 /* { dg-set-target-env-var "GOMP_DEBUG" "1" } */
 
 #include <stdlib.h>
@@ -37,5 +37,5 @@ main (void)
   return 0;
 }
 
-/* { dg-final { scan-offload-tree-dump "__attribute__\\(\\(oacc function \\(1, 2, 128\\)" "oaccdevlow" } } */
+/* { dg-final { scan-offload-tree-dump "__attribute__\\(\\(oacc function \\(1, 2, 128\\)" "oaccloops" } } */
 /* { dg-output "nvptx_exec: kernel main\\\$_omp_fn\\\$0: launch gangs=1, workers=2, vectors=128" } */
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-6.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-6.c
index a1f67622f84..0807eab7eee 100644
--- a/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-6.c
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-6.c
@@ -1,6 +1,6 @@
 /* { dg-do run { target openacc_nvidia_accel_selected } } */
 /* { dg-set-target-env-var "GOMP_OPENACC_DIM" ":2:" } */
-/* { dg-additional-options "-foffload=-fdump-tree-oaccdevlow" } */
+/* { dg-additional-options "-foffload=-fdump-tree-oaccloops" } */
 /* { dg-set-target-env-var "GOMP_DEBUG" "1" } */
 
 #include <stdlib.h>
@@ -37,5 +37,5 @@ main (void)
   return 0;
 }
 
-/* { dg-final { scan-offload-tree-dump "__attribute__\\(\\(oacc function \\(1, 0, 128\\)" "oaccdevlow" } } */
+/* { dg-final { scan-offload-tree-dump "__attribute__\\(\\(oacc function \\(1, 0, 128\\)" "oaccloops" } } */
 /* { dg-output "nvptx_exec: kernel main\\\$_omp_fn\\\$0: launch gangs=1, workers=2, vectors=128" } */
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-7.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-7.c
index c419f6499b5..4a8c1bf549e 100644
--- a/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-7.c
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-7.c
@@ -1,5 +1,5 @@
 /* { dg-do run { target openacc_nvidia_accel_selected } } */
-/* { dg-additional-options "-foffload=-fdump-tree-oaccdevlow" } */
+/* { dg-additional-options "-foffload=-fdump-tree-oaccloops" } */
 /* { dg-set-target-env-var "GOMP_DEBUG" "1" } */
 
 #include <stdlib.h>
@@ -36,5 +36,5 @@ main (void)
   return 0;
 }
 
-/* { dg-final { scan-offload-tree-dump "__attribute__\\(\\(oacc function \\(1, 0, 128\\)" "oaccdevlow" } } */
+/* { dg-final { scan-offload-tree-dump "__attribute__\\(\\(oacc function \\(1, 0, 128\\)" "oaccloops" } } */
 /* { dg-output "nvptx_exec: kernel main\\\$_omp_fn\\\$0: launch gangs=1, workers=8, vectors=128" } */
diff --git a/libgomp/testsuite/libgomp.oacc-fortran/routine-nohost-1.f90 b/libgomp/testsuite/libgomp.oacc-fortran/routine-nohost-1.f90
index cd5bddc8685..b0537b8ff0b 100644
--- a/libgomp/testsuite/libgomp.oacc-fortran/routine-nohost-1.f90
+++ b/libgomp/testsuite/libgomp.oacc-fortran/routine-nohost-1.f90
@@ -5,7 +5,7 @@
 ! With optimizations disabled, we currently don't expect that 'acc_on_device' "evaluates at compile time to a constant".
 ! { dg-skip-if "TODO PR82391" { *-*-* } { "-O0" } }
 
-! { dg-additional-options "-fdump-tree-oaccdevlow" }
+! { dg-additional-options "-fdump-tree-oaccloops" }
 
 program main
   use openacc
@@ -58,6 +58,6 @@ function fact_nohost(x) result(res)
 
   res = fact(x)
 end function fact_nohost
-! { dg-final { scan-tree-dump-times {(?n)^OpenACC routine 'fact_nohost' has 'nohost' clause\.$} 1 oaccdevlow { target { ! offloading_enabled } } } }
-! { dg-final { scan-tree-dump-times {(?n)^OpenACC routine 'fact_nohost_' has 'nohost' clause\.$} 1 oaccdevlow { target offloading_enabled } } }
+! { dg-final { scan-tree-dump-times {(?n)^OpenACC routine 'fact_nohost' has 'nohost' clause\.$} 1 oaccloops { target { ! offloading_enabled } } } }
+! { dg-final { scan-tree-dump-times {(?n)^OpenACC routine 'fact_nohost_' has 'nohost' clause\.$} 1 oaccloops { target offloading_enabled } } }
 !TODO See PR101551 for 'offloading_enabled' differences.
-- 
2.30.2


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/4] openacc: Middle-end worker-partitioning support
  2021-03-02 12:20 ` [PATCH 1/4] openacc: Middle-end worker-partitioning support Julian Brown
  2021-07-29  7:49   ` [OpenACC] Extract 'pass_oacc_loop_designation' out of 'pass_oacc_device_lower' (was: [PATCH 1/4] openacc: Middle-end worker-partitioning support) Thomas Schwinge
@ 2021-08-04 13:13   ` Thomas Schwinge
  2021-08-06  8:49     ` Julian Brown
  2021-08-04 13:56   ` [PATCH 1/4] openacc: Middle-end worker-partitioning support Thomas Schwinge
                     ` (3 subsequent siblings)
  5 siblings, 1 reply; 20+ messages in thread
From: Thomas Schwinge @ 2021-08-04 13:13 UTC (permalink / raw)
  To: Julian Brown, gcc-patches; +Cc: Tobias Burnus, Kwok Cheung Yeung, Jakub Jelinek

Hi!

On 2021-03-02T04:20:11-0800, Julian Brown <julian@codesourcery.com> wrote:
> This patch implements worker-partitioning support in the middle end,
> by rewriting gimple.  [...]

Yay!

Given:

> --- /dev/null
> +++ b/gcc/oacc-neuter-bcast.c

> +/* A map from SSA names or var decls to record fields.  */
> +
> +typedef hash_map<tree, tree> field_map_t;
> +
> +/* For each propagation record type, this is a map from SSA names or var decls
> +   to propagate, to the field in the record type that should be used for
> +   transmission and reception.  */
> +
> +typedef hash_map<tree, field_map_t *> record_field_map_t;
> +
> +static GTY(()) record_field_map_t *field_map;

Per 'gcc/doc/gty.texi': "Whenever you [...] create a new source file
containing 'GTY' markers, [...] add the filename to the 'GTFILES'
variable in 'Makefile.in'.  [...] The generated header file should be
included after everything else in the source file."  Thus:

    --- gcc/Makefile.in
    +++ gcc/Makefile.in
    @@ -2720,2 +2720,3 @@ GTFILES = $(CPPLIB_H) $(srcdir)/input.h $(srcdir)/coretypes.h \
       $(srcdir)/omp-general.h \
    +  $(srcdir)/oacc-neuter-bcast.c \
       @all_gtfiles@
    --- gcc/oacc-neuter-bcast.c
    +++ gcc/oacc-neuter-bcast.c
    @@ -1514 +1514,4 @@ make_pass_oacc_gimple_workers (gcc::context *ctxt)
     }
    +
    +
    +#include "gt-oacc-neuter-bcast.h"

That however results in:

    [...]
    build/gengtype  \
                        -r gtype.state
    warning: structure `field_map_t' used but not defined
    gengtype: Internal error: abort in error_at_line, at gengtype.c:111
    make[2]: *** [Makefile:2796: s-gtype] Error 1
    [...]

I shall try to figure out the right GC annotations to make the 'typedef's
known to the GC machinery (unless somebody can tell me off hand) -- but
actually is this really necessary to allocate as GC memory?

> +void
> +oacc_do_neutering (void)
> +{
> +  [...]
> +  field_map = record_field_map_t::create_ggc (40);
> +  [...]
> +  FOR_ALL_BB_FN (bb, cfun)
> +    {
> +      propagation_set *ws_prop = prop_set[bb->index];
> +      if (ws_prop)
> +     {
> +       tree record_type = lang_hooks.types.make_type (RECORD_TYPE);
> +       [...]
> +       field_map->put (record_type, field_map_t::create_ggc (17));
> +       [...]
> +    }
> +  [...]
> +}

'oacc_do_neutering' is the 'execute' function of the pass, so that means
every time this executes, a fresh 'field_map' is set up, no state
persists across runs (assuming I'm understanding that correctly).  Why
don't we simply use standard (non-GC) memory management for that?  "For
convenience" shall be fine as an answer ;-) -- but maybe instead of
figuring out the right GC annotations, changing the memory management
will be easier?  (Or, of course, maybe I completely misunderstood that?)


Grüße
 Thomas
-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/4] openacc: Middle-end worker-partitioning support
  2021-03-02 12:20 ` [PATCH 1/4] openacc: Middle-end worker-partitioning support Julian Brown
  2021-07-29  7:49   ` [OpenACC] Extract 'pass_oacc_loop_designation' out of 'pass_oacc_device_lower' (was: [PATCH 1/4] openacc: Middle-end worker-partitioning support) Thomas Schwinge
  2021-08-04 13:13   ` [PATCH 1/4] openacc: Middle-end worker-partitioning support Thomas Schwinge
@ 2021-08-04 13:56   ` Thomas Schwinge
  2021-08-06  9:25     ` Julian Brown
  2021-08-09 13:21   ` Thomas Schwinge
                     ` (2 subsequent siblings)
  5 siblings, 1 reply; 20+ messages in thread
From: Thomas Schwinge @ 2021-08-04 13:56 UTC (permalink / raw)
  To: Julian Brown, gcc-patches; +Cc: Tobias Burnus, Kwok Cheung Yeung, Jakub Jelinek

Hi!

On 2021-03-02T04:20:11-0800, Julian Brown <julian@codesourcery.com> wrote:
> This patch implements worker-partitioning support in the middle end,
> by rewriting gimple.  [...]

Yay!


> This version of the patch [...]
> avoids moving SESE-region finding code out
> of the NVPTX backend

So that's 'struct bb_sese' and following functions.

> since that code isn't used by the middle-end worker
> partitioning neutering/broadcasting implementation yet.

I understand correctly that "isn't used [...] yet" means that (a) "this
isn't implemented yet" (on og11 etc.), and doesn't mean (b) "is missing
from this patch submission"?  ... thus from (a) it follows that we may
later also drop from the og11 branch these changes?


Relatedly, a nontrivial amount of data structures/logic/code did get
duplicated from the nvptx back end, and modified slightly or
not-so-slightly (RTL vs. GIMPLE plus certain implementation "details").

We should at least cross reference the two instances, to make sure that
any changes to one are also propagated to the other.  (I'll take care.)

And then, do you (or anyone else, of course) happen to have any clever
idea about how to avoid the duplication, and somehow combine the RTL
vs. GIMPLE implementations?  Given that we nowadays may use C++ -- do you
foresee it feasible to have an abstract base class capturing basically
the data structures, logic, common code, and then RTL-specialized plus
GIMPLE-specialized classes inheriting from that?

For example:

    $ sed -e s%parallel_g%parallel%g < gcc/oacc-neuter-bcast.c > gcc/oacc-neuter-bcast.c_
    $ git diff --no-index --word-diff -b --patience gcc/config/nvptx/nvptx.c gcc/oacc-neuter-bcast.c_
    [...]
    /* Loop structure of the function.  The entire function is described as
       a NULL loop.  */
    @@ -3229,17 +80,21 @@ struct parallel
      basic_block forked_block;
      basic_block join_block;

      [-rtx_insn *forked_insn;-]
    [-  rtx_insn *join_insn;-]{+gimple *forked_stmt;+}
    {+  gimple *join_stmt;+}

      [-rtx_insn *fork_insn;-]
    [-  rtx_insn *joining_insn;-]{+gimple *fork_stmt;+}
    {+  gimple *joining_stmt;+}

      /* Basic blocks in this parallel, but not in child parallels.  The
         FORKED and JOINING blocks are in the partition.  The FORK and JOIN
         blocks are not.  */
      auto_vec<basic_block> blocks;

      {+tree record_type;+}
    {+  tree sender_decl;+}
    {+  tree receiver_decl;+}

    public:
      parallel (parallel *parent, unsigned mode);
      ~parallel ();
    @@ -3252,8 +107,12 @@ parallel::parallel (parallel *parent_, unsigned mask_)
      :parent (parent_), next (0), inner (0), mask (mask_), inner_mask (0)
    {
      forked_block = join_block = 0;
      [-forked_insn-]{+forked_stmt+} = [-join_insn-]{+join_stmt+} = [-0;-]
    [-  fork_insn-]{+NULL;+}
    {+  fork_stmt+} = [-joining_insn-]{+joining_stmt+} = [-0;-]{+NULL;+}

    {+  record_type = NULL_TREE;+}
    {+  sender_decl = NULL_TREE;+}
    {+  receiver_decl = NULL_TREE;+}

      if (parent)
        {
    @@ -3268,12 +127,54 @@ parallel::~parallel ()
      delete next;
    }
    [...]
    /* Split basic blocks such that each forked and join unspecs are at
       the start of their basic blocks.  Thus afterwards each block will
    @@ -3284,111 +185,168 @@ typedef auto_vec<insn_bb_t> insn_bb_vec_t;
       used when finding partitions.  */

    static void
    [-nvptx_split_blocks (bb_insn_map_t-]{+omp_sese_split_blocks (bb_stmt_map_t+} *map)
    {
      [-insn_bb_vec_t-]{+auto_vec<gimple *>+} worklist;
      basic_block block;
    [-  rtx_insn *insn;-]

      /* Locate all the reorg instructions of interest.  */
      FOR_ALL_BB_FN (block, cfun)
        {
    [-      bool seen_insn = false;-]

          /* Clear visited flag, for use by parallel locator  */
          block->flags &= ~BB_VISITED;

          [-FOR_BB_INSNS (block, insn)-]{+for (gimple_stmt_iterator gsi = gsi_start_bb (block);+}
    {+     !gsi_end_p (gsi);+}
    {+     gsi_next (&gsi))+}
        {
    [...]
    /* Dump this parallel and all its inner parallels.  */

    static void
    [-nvptx_dump_pars-]{+omp_sese_dump_pars+} (parallel *par, unsigned depth)
    {
      fprintf (dump_file, "%u: mask %d {+(%s)+} head=%d, tail=%d\n",
           depth, par->mask, {+mask_name (par->mask),+}
           par->forked_block ? par->forked_block->index : -1,
           par->join_block ? par->join_block->index : -1);

    @@ -3399,10 +357,10 @@ nvptx_dump_pars (parallel *par, unsigned depth)
        fprintf (dump_file, " %d", block->index);
      fprintf (dump_file, "\n");
      if (par->inner)
        [-nvptx_dump_pars-]{+omp_sese_dump_pars+} (par->inner, depth + 1);

      if (par->next)
        [-nvptx_dump_pars-]{+omp_sese_dump_pars+} (par->next, depth);
    }

    /* If BLOCK contains a fork/join marker, process it to create or
    @@ -3410,60 +368,84 @@ nvptx_dump_pars (parallel *par, unsigned depth)
       and then walk successor blocks.   */

    static parallel *
    [-nvptx_find_par (bb_insn_map_t-]{+omp_sese_find_par (bb_stmt_map_t+} *map, parallel *par, basic_block block)
    {
      if (block->flags & BB_VISITED)
        return par;
      block->flags |= BB_VISITED;

      if [-(rtx_insn **endp-]{+(gimple **stmtp+} = map->get (block))
        {
    [...]
    static parallel *
    [-nvptx_discover_pars (bb_insn_map_t-]{+omp_sese_discover_pars (bb_stmt_map_t+} *map)
    {
      basic_block block;

    @@ -3502,3468 +485,1033 @@ nvptx_discover_pars (bb_insn_map_t *map)
      block = ENTRY_BLOCK_PTR_FOR_FN (cfun);
      block->flags &= ~BB_VISITED;

      parallel *par = [-nvptx_find_par-]{+omp_sese_find_par+} (map, 0, block);

      if (dump_file)
        {
          fprintf (dump_file, "\nLoops\n");
          [-nvptx_dump_pars-]{+omp_sese_dump_pars+} (par, 0);
          fprintf (dump_file, "\n");
        }

      return par;
    }

(For brevity, I stripped out the parts where implementation "details"
differ considerably.)


Grüße
 Thomas
-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/4] openacc: Middle-end worker-partitioning support
  2021-08-04 13:13   ` [PATCH 1/4] openacc: Middle-end worker-partitioning support Thomas Schwinge
@ 2021-08-06  8:49     ` Julian Brown
  2021-08-16 10:34       ` Thomas Schwinge
  0 siblings, 1 reply; 20+ messages in thread
From: Julian Brown @ 2021-08-06  8:49 UTC (permalink / raw)
  To: Thomas Schwinge
  Cc: gcc-patches, Tobias Burnus, Kwok Cheung Yeung, Jakub Jelinek

On Wed, 4 Aug 2021 15:13:30 +0200
Thomas Schwinge <thomas@codesourcery.com> wrote:

> 'oacc_do_neutering' is the 'execute' function of the pass, so that
> means every time this executes, a fresh 'field_map' is set up, no
> state persists across runs (assuming I'm understanding that
> correctly).  Why don't we simply use standard (non-GC) memory
> management for that?  "For convenience" shall be fine as an answer
> ;-) -- but maybe instead of figuring out the right GC annotations,
> changing the memory management will be easier?  (Or, of course, maybe
> I completely misunderstood that?)

I suspect you're right, and there's no need for this to be GC-allocated
memory. If non-standard memory allocation will work out fine, we should
probably use that instead.

Thanks,

Julian

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/4] openacc: Middle-end worker-partitioning support
  2021-08-04 13:56   ` [PATCH 1/4] openacc: Middle-end worker-partitioning support Thomas Schwinge
@ 2021-08-06  9:25     ` Julian Brown
  2021-08-09 13:32       ` Thomas Schwinge
  0 siblings, 1 reply; 20+ messages in thread
From: Julian Brown @ 2021-08-06  9:25 UTC (permalink / raw)
  To: Thomas Schwinge
  Cc: gcc-patches, Tobias Burnus, Kwok Cheung Yeung, Jakub Jelinek

On Wed, 4 Aug 2021 15:56:49 +0200
Thomas Schwinge <thomas@codesourcery.com> wrote:

> > This version of the patch [...]
> > avoids moving SESE-region finding code out
> > of the NVPTX backend  
> 
> So that's 'struct bb_sese' and following functions.
> 
> > since that code isn't used by the middle-end worker
> > partitioning neutering/broadcasting implementation yet.  
> 
> I understand correctly that "isn't used [...] yet" means that (a)
> "this isn't implemented yet" (on og11 etc.), and doesn't mean (b) "is
> missing from this patch submission"?  ... thus from (a) it follows
> that we may later also drop from the og11 branch these changes?

Yes, the former -- the SESE region-finding code isn't used anywhere for
the middle-end worker partitioning support. Thus if we happen to have
two adjacent blocks that both use worker-single mode, we will
conditionalise them to run on one worker only separately, with
redundant tests/branches.

I'm not sure how often that happens in practice. We don't need to
handle vector-single mode for GCN, which possibly means that the
SESE-finding code's ability to skip entire inner loop nests (IIRC!) is
unnecessary.

So yes, that code could probably be dropped for og11 too, though
perhaps we should try to evaluate if it would still be useful first.

> Relatedly, a nontrivial amount of data structures/logic/code did get
> duplicated from the nvptx back end, and modified slightly or
> not-so-slightly (RTL vs. GIMPLE plus certain implementation
> "details").
> 
> We should at least cross reference the two instances, to make sure
> that any changes to one are also propagated to the other.  (I'll take
> care.)

OK, thanks,

> And then, do you (or anyone else, of course) happen to have any clever
> idea about how to avoid the duplication, and somehow combine the RTL
> vs. GIMPLE implementations?  Given that we nowadays may use C++ -- do
> you foresee it feasible to have an abstract base class capturing
> basically the data structures, logic, common code, and then
> RTL-specialized plus GIMPLE-specialized classes inheriting from that?

I suppose one could either use "old-style" inheritance, or maybe do
it with templates. There's probably both costs & benefits when it comes
to maintenance, either way -- having this code shared would mean any
changes need testing for both nvptx & GCN targets, and risks making it
harder to follow. OTOH, like you say, changes would only need to be
made in one place.

TBH, I'd spend effort on trying to integrate the SESE code (if it'd be
beneficial) first, before trying to de-duplicate those other bits.

Thanks,

Julian

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [OpenACC] Extract 'pass_oacc_loop_designation' out of 'pass_oacc_device_lower' (was: [PATCH 1/4] openacc: Middle-end worker-partitioning support)
  2021-07-29  7:49   ` [OpenACC] Extract 'pass_oacc_loop_designation' out of 'pass_oacc_device_lower' (was: [PATCH 1/4] openacc: Middle-end worker-partitioning support) Thomas Schwinge
@ 2021-08-06 10:20     ` Julian Brown
  0 siblings, 0 replies; 20+ messages in thread
From: Julian Brown @ 2021-08-06 10:20 UTC (permalink / raw)
  To: Thomas Schwinge
  Cc: gcc-patches, Tobias Burnus, Kwok Cheung Yeung, Jakub Jelinek

On Thu, 29 Jul 2021 09:49:05 +0200
Thomas Schwinge <thomas@codesourcery.com> wrote:

> >  namespace {
> >  
> > +const pass_data pass_data_oacc_loop_designation =
> > +{
> > +  GIMPLE_PASS, /* type */
> > +  "oaccloops", /* name */
> > +  OPTGROUP_OMP, /* optinfo_flags */
> > +  TV_NONE, /* tv_id */
> > +  PROP_cfg, /* properties_required */
> > +  0 /* Possibly PROP_gimple_eomp.  */, /* properties_provided */
> > +  0, /* properties_destroyed */
> > +  0, /* todo_flags_start */
> > +  TODO_update_ssa | TODO_cleanup_cfg
> > +  | TODO_rebuild_alias, /* todo_flags_finish */
> > +};  
> 
> Do you remember why you added 'TODO_rebuild_alias' here?
> 'pass_oacc_device_lower' doesn't have it, and neither does
> 'pass_oacc_loop_designation' in your original (2017-11-27) internal
> gcn/master branch commit 81ee7ef64cdfa47c01f24c79b8ebd03242c9f3eb
> "Split device-lowering/gimple workers into three passes".  So I
> removed that -- but please do tell if there is a reason to keep it.

I do not :-). My suspicion is that it was leftover debug code, or
possibly the result of a bad merge.

(Another possibility is that the alias information needs updating when
we change the address space of certain entities during gimple rewriting
for worker partitioning -- but this is the wrong place for that, I
think?).

Thanks,

Julian

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/4] openacc: Middle-end worker-partitioning support
  2021-03-02 12:20 ` [PATCH 1/4] openacc: Middle-end worker-partitioning support Julian Brown
                     ` (2 preceding siblings ...)
  2021-08-04 13:56   ` [PATCH 1/4] openacc: Middle-end worker-partitioning support Thomas Schwinge
@ 2021-08-09 13:21   ` Thomas Schwinge
  2021-08-16 10:34   ` Thomas Schwinge
  2021-08-16 10:34   ` Thomas Schwinge
  5 siblings, 0 replies; 20+ messages in thread
From: Thomas Schwinge @ 2021-08-09 13:21 UTC (permalink / raw)
  To: Julian Brown, gcc-patches; +Cc: Tobias Burnus, Kwok Cheung Yeung, Jakub Jelinek

[-- Attachment #1: Type: text/plain, Size: 7958 bytes --]

Hi!

On 2021-03-02T04:20:11-0800, Julian Brown <julian@codesourcery.com> wrote:
> This patch implements worker-partitioning support in the middle end,
> by rewriting gimple.  [...]

Thanks!

> The OpenACC execution model requires that code
> can run in either "worker single" mode where only a single worker per
> gang is active, or "worker partitioned" mode, where multiple workers
> per gang are active. This means we need to do something equivalent
> to spawning additional workers when transitioning from worker-single
> to worker-partitioned mode. However, GPUs typically fix the number of
> threads of invoked kernels at launch time, so we need to do something
> with the "extra" threads when they are not wanted.

ACK, and we shall (at some later point in time) see whether this new
middle end implementation can't also replace the nvptx back end one,
adapted as necessary.

> The scheme used is [...]

I haven't in detail verified the algorithmic aspects and most of the
implementation of those -- it appears to generally work (see also
additional testsuite changes I've folded in).


> --- /dev/null
> +++ b/gcc/oacc-neuter-bcast.c

For consistency with other recent OpenACC implementation files, I renamed
this to 'gcc/omp-oacc-neuter-broadcast.cc', and renamed the pass from
'oaccworkers' to 'omp_oacc_neuter_broadcast'.

> +void
> +oacc_do_neutering (void)
> +{

> +  for (unsigned i = 0; i < last_basic_block_for_fn (cfun); i++)
> +    prop_set.quick_push (0);

    [...]/gcc/oacc-neuter-bcast.c: In function ‘void oacc_do_neutering()’:
    [...]/gcc/oacc-neuter-bcast.c:1405:26: error: comparison of integer expressions of different signedness: ‘unsigned int’ and ‘int’ [-Werror=sign-compare]
     1405 |   for (unsigned i = 0; i < last_basic_block_for_fn (cfun); i++)

Fixed.

> --- a/gcc/omp-builtins.def
> +++ b/gcc/omp-builtins.def
> @@ -75,6 +75,8 @@ DEF_GOMP_BUILTIN (BUILT_IN_GOMP_BARRIER, "GOMP_barrier",
>                 BT_FN_VOID, ATTR_NOTHROW_LEAF_LIST)
>  DEF_GOMP_BUILTIN (BUILT_IN_GOMP_BARRIER_CANCEL, "GOMP_barrier_cancel",
>                 BT_FN_BOOL, ATTR_NOTHROW_LEAF_LIST)
> +DEF_GOACC_BUILTIN (BUILT_IN_GOACC_BARRIER, "GOACC_barrier",
> +                BT_FN_VOID, ATTR_NOTHROW_LEAF_LIST)
>  DEF_GOMP_BUILTIN (BUILT_IN_GOMP_TASKWAIT, "GOMP_taskwait",
>                 BT_FN_VOID, ATTR_NOTHROW_LEAF_LIST)
>  DEF_GOMP_BUILTIN (BUILT_IN_GOMP_TASKWAIT_DEPEND, "GOMP_taskwait_depend",
> @@ -412,6 +414,12 @@ DEF_GOMP_BUILTIN (BUILT_IN_GOMP_SINGLE_COPY_START, "GOMP_single_copy_start",
>                 BT_FN_PTR, ATTR_NOTHROW_LEAF_LIST)
>  DEF_GOMP_BUILTIN (BUILT_IN_GOMP_SINGLE_COPY_END, "GOMP_single_copy_end",
>                 BT_FN_VOID_PTR, ATTR_NOTHROW_LEAF_LIST)
> +DEF_GOACC_BUILTIN (BUILT_IN_GOACC_SINGLE_START, "GOACC_single_start",
> +                BT_FN_BOOL, ATTR_NOTHROW_LEAF_LIST)
> +DEF_GOACC_BUILTIN (BUILT_IN_GOACC_SINGLE_COPY_START, "GOACC_single_copy_start",
> +                BT_FN_PTR, ATTR_NOTHROW_LEAF_LIST)
> +DEF_GOACC_BUILTIN (BUILT_IN_GOACC_SINGLE_COPY_END, "GOACC_single_copy_end",
> +                BT_FN_VOID_PTR, ATTR_NOTHROW_LEAF_LIST)
>  DEF_GOMP_BUILTIN (BUILT_IN_GOMP_OFFLOAD_REGISTER, "GOMP_offload_register_ver",
>                 BT_FN_VOID_UINT_PTR_INT_PTR, ATTR_NOTHROW_LIST)
>  DEF_GOMP_BUILTIN (BUILT_IN_GOMP_OFFLOAD_UNREGISTER,

These should use 'DEF_GOACC_BUILTIN_ONLY' instead of 'DEF_GOACC_BUILTIN':
we're expecting them to be all handled inside the compiler, not resulting
in a libgomp function call.

It looks a bit strange to group the 'GOACC' ones in between the 'GOMP'
one, so I split that up instead.

> --- a/gcc/omp-offload.c
> +++ b/gcc/omp-offload.c

> +int
> +execute_oacc_gimple_workers (void)
> +{
> +  oacc_do_neutering ();
> +  calculate_dominance_info (CDI_DOMINATORS);
> +  return 0;
> +}

Similar to discussion re "[OpenACC] Extract 'pass_oacc_loop_designation'
out of 'pass_oacc_device_lower'": "given 'TODO_cleanup_cfg' as part of
'todo_flags_finish' for new [pass], we no longer need [manual
'calculate_dominance_info (CDI_DOMINATORS)'], as far as I can tell.  So I
removed [that] -- but please do tell if there is a reason to keep it".

The guts of the implementation are in a new, separate file (thanks!), and
I see no reason for 'execute_oacc_gimple_workers' plus the following
bits:

> +const pass_data pass_data_oacc_gimple_workers =
> +{
> +  GIMPLE_PASS, /* type */
> +  "oaccworkers", /* name */
> +  OPTGROUP_OMP, /* optinfo_flags */
> +  TV_NONE, /* tv_id */
> +  PROP_cfg, /* properties_required */
> +  0, /* properties_provided */
> +  0, /* properties_destroyed */
> +  0, /* todo_flags_start */
> +  TODO_update_ssa | TODO_cleanup_cfg, /* todo_flags_finish */
> +};
> +
> +class pass_oacc_gimple_workers : public gimple_opt_pass
> +{
> +public:
> +  pass_oacc_gimple_workers (gcc::context *ctxt)
> +    : gimple_opt_pass (pass_data_oacc_gimple_workers, ctxt)
> +  {}
> +
> +  /* opt_pass methods: */
> +  virtual bool gate (function *)
> +  {
> +    return flag_openacc && targetm.goacc.worker_partitioning;
> +  };
> +
> +  virtual unsigned int execute (function *)
> +    {
> +      return execute_oacc_gimple_workers ();
> +    }
> +
> +}; // class pass_oacc_gimple_workers

> +gimple_opt_pass *
> +make_pass_oacc_gimple_workers (gcc::context *ctxt)
> +{
> +  return new pass_oacc_gimple_workers (ctxt);
> +}

... to reside in 'gcc/omp-offload.c', so I moved all that into the new
file, too.  Relationship with related OpenACC passes is apparent via
'gcc/passes.def'.

> --- a/gcc/omp-offload.h
> +++ b/gcc/omp-offload.h

> +extern int oacc_fn_attrib_level (tree attr);

Removed, already declared elsewhere in that file.

> --- a/gcc/target.def
> +++ b/gcc/target.def
> @@ -1742,6 +1742,19 @@ way.",
>  tree, (tree var, int level),
>  NULL)
>
> +DEFHOOK
> +(create_propagation_record,
> +"Create a record used to propagate local-variable state from an active\n\
> +worker to other workers.  A possible implementation might adjust the type\n\
> +of REC to place the new variable in shared GPU memory.",
> +tree, (tree rec, bool sender, const char *name),
> +default_goacc_create_propagation_record)

To make this easier for the cursory reading where calls to
'goacc.create_propagation_record' appear, I renamed that to
'create_worker_broadcast_record'.  It remains to be seen (later) whether
it makes sense for (nvptx) vector state propagation to use this interface
(then to be renamed) or a new additional one.

> +DEFHOOKPOD
> +(worker_partitioning,
> +"Use gimple transformation for worker neutering/broadcasting.",
> +bool, false)
> +
>  HOOK_VECTOR_END (goacc)

As (at least currently) it wouldn't be possible to have
'targetm.goacc.worker_partitioning' without
'targetm.goacc.create_worker_propagation_record' being different from
'default_goacc_create_worker_propagation_record', and thus the latter
effectively being 'gcc_unreachable' dead code, I removed the latter and
the former, and instead now "Presence of
[targetm.goacc.create_worker_propagation_record] indicates that middle
end neutering/broadcasting be used".

For consistency, I moved the relevant GCN back end adjustments
('gcn_goacc_adjust_propagation_record' ->
'gcn_goacc_create_worker_propagation_record') from the forthcoming
"amdgcn: Enable OpenACC worker partitioning for AMD GCN" changes into
this commit here.

Pushed "openacc: Middle-end worker-partitioning support" to master branch
in commit e2a58ed6dc5293602d0d168475109caa81ad0f0d, see attached.


Grüße
 Thomas


-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-openacc-Middle-end-worker-partitioning-support.patch --]
[-- Type: text/x-diff, Size: 57044 bytes --]

From e2a58ed6dc5293602d0d168475109caa81ad0f0d Mon Sep 17 00:00:00 2001
From: Julian Brown <julian@codesourcery.com>
Date: Tue, 2 Mar 2021 04:20:11 -0800
Subject: [PATCH] openacc: Middle-end worker-partitioning support

This patch implements worker-partitioning support in the middle end,
by rewriting gimple. The OpenACC execution model requires that code
can run in either "worker single" mode where only a single worker per
gang is active, or "worker partitioned" mode, where multiple workers
per gang are active. This means we need to do something equivalent
to spawning additional workers when transitioning from worker-single
to worker-partitioned mode. However, GPUs typically fix the number of
threads of invoked kernels at launch time, so we need to do something
with the "extra" threads when they are not wanted.

The scheme used is to conditionalise each basic block that executes
in "worker single" mode for worker 0 only. Conditional branches
are handled specially so "idle" (non-0) workers follow along with
worker 0. On transitioning to "worker partitioned" mode, any variables
modified by worker 0 are propagated to the other workers via GPU shared
memory. Special care is taken for routine calls, writes through pointers,
and so forth, as follows:

  - There are two types of function calls to consider in worker-single
    mode: "normal" calls to maths library routines, etc. are called from
    worker 0 only. OpenACC routines may contain worker-partitioned loops
    themselves, so are called from all workers, including "idle" ones.

  - SSA names set in worker-single mode, but used in worker-partitioned
    mode, are copied to shared memory in worker 0. Other workers retrieve
    the value from the appropriate shared-memory location after a barrier,
    and new phi nodes are introduced at the convergence point to resolve
    the worker 0/other worker copies of the value.

  - Local scalar variables (on the stack) also need special handling. We
    broadcast any variables that are written in the current worker-single
    block, and that are read in any worker-partitioned block.  (This is
    believed to be safe, and is flow-insensitive to ease analysis.)

  - Local aggregates (arrays and composites) on the stack are *not*
    broadcast. Instead we force gimple stmts modifying elements/fields of
    local aggregates into fully-partitioned mode. The RHS of the
    assignment is a scalar, and is thus subject to broadcasting as above.

  - Writes through pointers may affect any local variable that has
    its address taken. We use points-to analysis to determine the set
    of potentially-affected variables for a given pointer indirection.
    We broadcast any such variable which is used in worker-partitioned
    mode, on a per-block basis for any block containing a write through
    a pointer.

Some slides about the implementation (from 2018) are available at:

  https://jtb20.github.io/gcnworkers.pdf

	gcc/
	* Makefile.in (OBJS): Add omp-oacc-neuter-broadcast.o.
	* doc/tm.texi.in (TARGET_GOACC_CREATE_WORKER_BROADCAST_RECORD):
	Add documentation hook.
	* doc/tm.texi: Regenerate.
	* omp-oacc-neuter-broadcast.cc: New file.
	* omp-builtins.def (BUILT_IN_GOACC_BARRIER)
	(BUILT_IN_GOACC_SINGLE_START, BUILT_IN_GOACC_SINGLE_COPY_START)
	(BUILT_IN_GOACC_SINGLE_COPY_END): New builtins.
	* passes.def (pass_omp_oacc_neuter_broadcast): Add pass.
	* target.def (goacc.create_worker_broadcast_record): Add target
	hook.
	* tree-pass.h (make_pass_omp_oacc_neuter_broadcast): Add
	prototype.
	* config/gcn/gcn-protos.h (gcn_goacc_adjust_propagation_record):
	Rename prototype to...
	(gcn_goacc_create_worker_broadcast_record): ... this.
	* config/gcn/gcn-tree.c (gcn_goacc_adjust_propagation_record): Rename
	function to...
	(gcn_goacc_create_worker_broadcast_record): ... this.
	* config/gcn/gcn.c (TARGET_GOACC_ADJUST_PROPAGATION_RECORD):
	Rename to...
	(TARGET_GOACC_CREATE_WORKER_BROADCAST_RECORD): ... this.

Co-Authored-By: Nathan Sidwell <nathan@codesourcery.com> (via 'gcc/config/nvptx/nvptx.c' master)
Co-Authored-By: Kwok Cheung Yeung <kcy@codesourcery.com>
Co-Authored-By: Thomas Schwinge <thomas@codesourcery.com>
---
 gcc/Makefile.in                  |    1 +
 gcc/config/gcn/gcn-protos.h      |    5 +-
 gcc/config/gcn/gcn-tree.c        |   58 +-
 gcc/config/gcn/gcn.c             |    6 +-
 gcc/doc/tm.texi                  |    9 +
 gcc/doc/tm.texi.in               |    2 +
 gcc/omp-builtins.def             |    9 +
 gcc/omp-oacc-neuter-broadcast.cc | 1515 ++++++++++++++++++++++++++++++
 gcc/passes.def                   |    1 +
 gcc/target.def                   |   11 +
 gcc/tree-pass.h                  |    1 +
 11 files changed, 1584 insertions(+), 34 deletions(-)
 create mode 100644 gcc/omp-oacc-neuter-broadcast.cc

diff --git a/gcc/Makefile.in b/gcc/Makefile.in
index 8baa3b76601..6653e9e2142 100644
--- a/gcc/Makefile.in
+++ b/gcc/Makefile.in
@@ -1513,6 +1513,7 @@ OBJS = \
 	omp-general.o \
 	omp-low.o \
 	omp-oacc-kernels-decompose.o \
+	omp-oacc-neuter-broadcast.o \
 	omp-simd-clone.o \
 	opt-problem.o \
 	optabs.o \
diff --git a/gcc/config/gcn/gcn-protos.h b/gcc/config/gcn/gcn-protos.h
index 8bd0b434a84..5d62a845bec 100644
--- a/gcc/config/gcn/gcn-protos.h
+++ b/gcc/config/gcn/gcn-protos.h
@@ -38,9 +38,10 @@ extern rtx gcn_full_exec ();
 extern rtx gcn_full_exec_reg ();
 extern rtx gcn_gen_undef (machine_mode);
 extern bool gcn_global_address_p (rtx);
-extern tree gcn_goacc_adjust_propagation_record (tree record_type, bool sender,
-						 const char *name);
 extern tree gcn_goacc_adjust_private_decl (location_t, tree var, int level);
+extern tree gcn_goacc_create_worker_broadcast_record (tree record_type,
+						      bool sender,
+						      const char *name);
 extern void gcn_goacc_reduction (gcall *call);
 extern bool gcn_hard_regno_rename_ok (unsigned int from_reg,
 				      unsigned int to_reg);
diff --git a/gcc/config/gcn/gcn-tree.c b/gcc/config/gcn/gcn-tree.c
index 1eb8882d4bf..f722d2d3c4e 100644
--- a/gcc/config/gcn/gcn-tree.c
+++ b/gcc/config/gcn/gcn-tree.c
@@ -548,35 +548,6 @@ gcn_goacc_reduction (gcall *call)
     }
 }
 
-/* Implement TARGET_GOACC_ADJUST_PROPAGATION_RECORD.
- 
-   Tweak (worker) propagation record, e.g. to put it in shared memory.  */
-
-tree
-gcn_goacc_adjust_propagation_record (tree record_type, bool sender,
-				     const char *name)
-{
-  tree type = record_type;
-
-  TYPE_ADDR_SPACE (type) = ADDR_SPACE_LDS;
-
-  if (!sender)
-    type = build_pointer_type (type);
-
-  tree decl = create_tmp_var_raw (type, name);
-
-  if (sender)
-    {
-      DECL_CONTEXT (decl) = NULL_TREE;
-      TREE_STATIC (decl) = 1;
-    }
-
-  if (sender)
-    varpool_node::finalize_decl (decl);
-
-  return decl;
-}
-
 tree
 gcn_goacc_adjust_private_decl (location_t, tree var, int level)
 {
@@ -604,4 +575,33 @@ gcn_goacc_adjust_private_decl (location_t, tree var, int level)
   return var;
 }
 
+/* Implement TARGET_GOACC_CREATE_WORKER_BROADCAST_RECORD.
+
+   Create OpenACC worker state propagation record in shared memory.  */
+
+tree
+gcn_goacc_create_worker_broadcast_record (tree record_type, bool sender,
+					  const char *name)
+{
+  tree type = record_type;
+
+  TYPE_ADDR_SPACE (type) = ADDR_SPACE_LDS;
+
+  if (!sender)
+    type = build_pointer_type (type);
+
+  tree decl = create_tmp_var_raw (type, name);
+
+  if (sender)
+    {
+      DECL_CONTEXT (decl) = NULL_TREE;
+      TREE_STATIC (decl) = 1;
+    }
+
+  if (sender)
+    varpool_node::finalize_decl (decl);
+
+  return decl;
+}
+
 /* }}}  */
diff --git a/gcc/config/gcn/gcn.c b/gcc/config/gcn/gcn.c
index d25c4e54e16..87af5d18f42 100644
--- a/gcc/config/gcn/gcn.c
+++ b/gcc/config/gcn/gcn.c
@@ -6513,11 +6513,11 @@ gcn_dwarf_register_span (rtx rtl)
 #define TARGET_GIMPLIFY_VA_ARG_EXPR gcn_gimplify_va_arg_expr
 #undef TARGET_OMP_DEVICE_KIND_ARCH_ISA
 #define TARGET_OMP_DEVICE_KIND_ARCH_ISA gcn_omp_device_kind_arch_isa
-#undef  TARGET_GOACC_ADJUST_PROPAGATION_RECORD
-#define TARGET_GOACC_ADJUST_PROPAGATION_RECORD \
-  gcn_goacc_adjust_propagation_record
 #undef  TARGET_GOACC_ADJUST_PRIVATE_DECL
 #define TARGET_GOACC_ADJUST_PRIVATE_DECL gcn_goacc_adjust_private_decl
+#undef  TARGET_GOACC_CREATE_WORKER_BROADCAST_RECORD
+#define TARGET_GOACC_CREATE_WORKER_BROADCAST_RECORD \
+  gcn_goacc_create_worker_broadcast_record
 #undef  TARGET_GOACC_FORK_JOIN
 #define TARGET_GOACC_FORK_JOIN gcn_fork_join
 #undef  TARGET_GOACC_REDUCTION
diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index cb015283237..a30fdcbbf3d 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -6409,6 +6409,15 @@ private variables at OpenACC device-lowering time using the
 @code{TARGET_GOACC_ADJUST_PRIVATE_DECL} target hook.
 @end deftypefn
 
+@deftypefn {Target Hook} tree TARGET_GOACC_CREATE_WORKER_BROADCAST_RECORD (tree @var{rec}, bool @var{sender}, const char *@var{name})
+Create a record used to propagate local-variable state from an active
+worker to other workers.  A possible implementation might adjust the type
+of REC to place the new variable in shared GPU memory.
+
+Presence of this target hook indicates that middle end neutering/broadcasting
+be used.
+@end deftypefn
+
 @node Anchored Addresses
 @section Anchored Addresses
 @cindex anchored addresses
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index 4a522ae7e2e..611fc500ac8 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -4223,6 +4223,8 @@ address;  but often a machine-dependent strategy can generate better code.
 
 @hook TARGET_GOACC_EXPAND_VAR_DECL
 
+@hook TARGET_GOACC_CREATE_WORKER_BROADCAST_RECORD
+
 @node Anchored Addresses
 @section Anchored Addresses
 @cindex anchored addresses
diff --git a/gcc/omp-builtins.def b/gcc/omp-builtins.def
index 4a7e7badd7e..05b555c7fa0 100644
--- a/gcc/omp-builtins.def
+++ b/gcc/omp-builtins.def
@@ -59,6 +59,15 @@ DEF_GOACC_BUILTIN_ONLY (BUILT_IN_GOACC_PARLEVEL_ID, "goacc_parlevel_id",
 DEF_GOACC_BUILTIN_ONLY (BUILT_IN_GOACC_PARLEVEL_SIZE, "goacc_parlevel_size",
 			BT_FN_INT_INT, ATTR_NOTHROW_LEAF_LIST)
 
+DEF_GOACC_BUILTIN_ONLY (BUILT_IN_GOACC_BARRIER, "GOACC_barrier",
+			BT_FN_VOID, ATTR_NOTHROW_LEAF_LIST)
+DEF_GOACC_BUILTIN_ONLY (BUILT_IN_GOACC_SINGLE_START, "GOACC_single_start",
+			BT_FN_BOOL, ATTR_NOTHROW_LEAF_LIST)
+DEF_GOACC_BUILTIN_ONLY (BUILT_IN_GOACC_SINGLE_COPY_START, "GOACC_single_copy_start",
+			BT_FN_PTR, ATTR_NOTHROW_LEAF_LIST)
+DEF_GOACC_BUILTIN_ONLY (BUILT_IN_GOACC_SINGLE_COPY_END, "GOACC_single_copy_end",
+			BT_FN_VOID_PTR, ATTR_NOTHROW_LEAF_LIST)
+
 DEF_GOMP_BUILTIN (BUILT_IN_OMP_GET_THREAD_NUM, "omp_get_thread_num",
 		  BT_FN_INT, ATTR_CONST_NOTHROW_LEAF_LIST)
 DEF_GOMP_BUILTIN (BUILT_IN_OMP_GET_NUM_THREADS, "omp_get_num_threads",
diff --git a/gcc/omp-oacc-neuter-broadcast.cc b/gcc/omp-oacc-neuter-broadcast.cc
new file mode 100644
index 00000000000..0f6ba885c6c
--- /dev/null
+++ b/gcc/omp-oacc-neuter-broadcast.cc
@@ -0,0 +1,1515 @@
+/* OpenACC worker partitioning via middle end neutering/broadcasting scheme
+
+   Copyright (C) 2015-2021 Free Software Foundation, Inc.
+
+   This file is part of GCC.
+
+   GCC is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published
+   by the Free Software Foundation; either version 3, or (at your
+   option) any later version.
+
+   GCC is distributed in the hope that it will be useful, but WITHOUT
+   ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
+   or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public
+   License for more details.
+
+   You should have received a copy of the GNU General Public License
+   along with GCC; see the file COPYING3.  If not see
+   <http://www.gnu.org/licenses/>.  */
+
+#include "config.h"
+#include "system.h"
+#include "coretypes.h"
+#include "backend.h"
+#include "rtl.h"
+#include "tree.h"
+#include "gimple.h"
+#include "tree-pass.h"
+#include "ssa.h"
+#include "cgraph.h"
+#include "pretty-print.h"
+#include "fold-const.h"
+#include "gimplify.h"
+#include "gimple-iterator.h"
+#include "gimple-walk.h"
+#include "tree-inline.h"
+#include "langhooks.h"
+#include "omp-general.h"
+#include "omp-low.h"
+#include "gimple-pretty-print.h"
+#include "cfghooks.h"
+#include "insn-config.h"
+#include "recog.h"
+#include "internal-fn.h"
+#include "bitmap.h"
+#include "tree-nested.h"
+#include "stor-layout.h"
+#include "tree-ssa-threadupdate.h"
+#include "tree-into-ssa.h"
+#include "splay-tree.h"
+#include "target.h"
+#include "cfgloop.h"
+#include "tree-cfg.h"
+#include "omp-offload.h"
+#include "attribs.h"
+
+/* Loop structure of the function.  The entire function is described as
+   a NULL loop.  */
+
+struct parallel_g
+{
+  /* Parent parallel.  */
+  parallel_g *parent;
+
+  /* Next sibling parallel.  */
+  parallel_g *next;
+
+  /* First child parallel.  */
+  parallel_g *inner;
+
+  /* Partitioning mask of the parallel.  */
+  unsigned mask;
+
+  /* Partitioning used within inner parallels. */
+  unsigned inner_mask;
+
+  /* Location of parallel forked and join.  The forked is the first
+     block in the parallel and the join is the first block after of
+     the partition.  */
+  basic_block forked_block;
+  basic_block join_block;
+
+  gimple *forked_stmt;
+  gimple *join_stmt;
+
+  gimple *fork_stmt;
+  gimple *joining_stmt;
+
+  /* Basic blocks in this parallel, but not in child parallels.  The
+     FORKED and JOINING blocks are in the partition.  The FORK and JOIN
+     blocks are not.  */
+  auto_vec<basic_block> blocks;
+
+  tree record_type;
+  tree sender_decl;
+  tree receiver_decl;
+
+public:
+  parallel_g (parallel_g *parent, unsigned mode);
+  ~parallel_g ();
+};
+
+/* Constructor links the new parallel into it's parent's chain of
+   children.  */
+
+parallel_g::parallel_g (parallel_g *parent_, unsigned mask_)
+  :parent (parent_), next (0), inner (0), mask (mask_), inner_mask (0)
+{
+  forked_block = join_block = 0;
+  forked_stmt = join_stmt = NULL;
+  fork_stmt = joining_stmt = NULL;
+
+  record_type = NULL_TREE;
+  sender_decl = NULL_TREE;
+  receiver_decl = NULL_TREE;
+
+  if (parent)
+    {
+      next = parent->inner;
+      parent->inner = this;
+    }
+}
+
+parallel_g::~parallel_g ()
+{
+  delete inner;
+  delete next;
+}
+
+static bool
+local_var_based_p (tree decl)
+{
+  switch (TREE_CODE (decl))
+    {
+    case VAR_DECL:
+      return !is_global_var (decl);
+
+    case COMPONENT_REF:
+    case BIT_FIELD_REF:
+    case ARRAY_REF:
+      return local_var_based_p (TREE_OPERAND (decl, 0));
+
+    default:
+      return false;
+    }
+}
+
+/* Map of basic blocks to gimple stmts.  */
+typedef hash_map<basic_block, gimple *> bb_stmt_map_t;
+
+/* Calls to OpenACC routines are made by all workers/wavefronts/warps, since
+   the routine likely contains partitioned loops (else will do its own
+   neutering and variable propagation). Return TRUE if a function call CALL
+   should be made in (worker) single mode instead, rather than redundant
+   mode.  */
+
+static bool
+omp_sese_active_worker_call (gcall *call)
+{
+#define GOMP_DIM_SEQ GOMP_DIM_MAX
+  tree fndecl = gimple_call_fndecl (call);
+
+  if (!fndecl)
+    return true;
+
+  tree attrs = oacc_get_fn_attrib (fndecl);
+
+  if (!attrs)
+    return true;
+
+  int level = oacc_fn_attrib_level (attrs);
+
+  /* Neither regular functions nor "seq" routines should be run by all threads
+     in worker-single mode.  */
+  return level == -1 || level == GOMP_DIM_SEQ;
+#undef GOMP_DIM_SEQ
+}
+
+/* Split basic blocks such that each forked and join unspecs are at
+   the start of their basic blocks.  Thus afterwards each block will
+   have a single partitioning mode.  We also do the same for return
+   insns, as they are executed by every thread.  Return the
+   partitioning mode of the function as a whole.  Populate MAP with
+   head and tail blocks.  We also clear the BB visited flag, which is
+   used when finding partitions.  */
+
+static void
+omp_sese_split_blocks (bb_stmt_map_t *map)
+{
+  auto_vec<gimple *> worklist;
+  basic_block block;
+
+  /* Locate all the reorg instructions of interest.  */
+  FOR_ALL_BB_FN (block, cfun)
+    {
+      /* Clear visited flag, for use by parallel locator  */
+      block->flags &= ~BB_VISITED;
+
+      for (gimple_stmt_iterator gsi = gsi_start_bb (block);
+	   !gsi_end_p (gsi);
+	   gsi_next (&gsi))
+	{
+	  gimple *stmt = gsi_stmt (gsi);
+
+	  if (gimple_call_internal_p (stmt, IFN_UNIQUE))
+	    {
+	      enum ifn_unique_kind k = ((enum ifn_unique_kind)
+		TREE_INT_CST_LOW (gimple_call_arg (stmt, 0)));
+
+	      if (k == IFN_UNIQUE_OACC_JOIN)
+		worklist.safe_push (stmt);
+	      else if (k == IFN_UNIQUE_OACC_FORK)
+		{
+		  gcc_assert (gsi_one_before_end_p (gsi));
+		  basic_block forked_block = single_succ (block);
+		  gimple_stmt_iterator gsi2 = gsi_start_bb (forked_block);
+
+		  /* We push a NOP as a placeholder for the "forked" stmt.
+		     This is then recognized in omp_sese_find_par.  */
+		  gimple *nop = gimple_build_nop ();
+		  gsi_insert_before (&gsi2, nop, GSI_SAME_STMT);
+
+		  worklist.safe_push (nop);
+		}
+	    }
+	  else if (gimple_code (stmt) == GIMPLE_RETURN
+		   || gimple_code (stmt) == GIMPLE_COND
+		   || gimple_code (stmt) == GIMPLE_SWITCH
+		   || (gimple_code (stmt) == GIMPLE_CALL
+		       && !gimple_call_internal_p (stmt)
+		       && !omp_sese_active_worker_call (as_a <gcall *> (stmt))))
+	    worklist.safe_push (stmt);
+	  else if (is_gimple_assign (stmt))
+	    {
+	      tree lhs = gimple_assign_lhs (stmt);
+
+	      /* Force assignments to components/fields/elements of local
+		 aggregates into fully-partitioned (redundant) mode.  This
+		 avoids having to broadcast the whole aggregate.  The RHS of
+		 the assignment will be propagated using the normal
+		 mechanism.  */
+
+	      switch (TREE_CODE (lhs))
+		{
+		case COMPONENT_REF:
+		case BIT_FIELD_REF:
+		case ARRAY_REF:
+		  {
+		    tree aggr = TREE_OPERAND (lhs, 0);
+
+		    if (local_var_based_p (aggr))
+		      worklist.safe_push (stmt);
+		  }
+		  break;
+
+		default:
+		  ;
+		}
+	    }
+	}
+    }
+
+  /* Split blocks on the worklist.  */
+  unsigned ix;
+  gimple *stmt;
+
+  for (ix = 0; worklist.iterate (ix, &stmt); ix++)
+    {
+      basic_block block = gimple_bb (stmt);
+
+      if (gimple_code (stmt) == GIMPLE_COND)
+	{
+	  gcond *orig_cond = as_a <gcond *> (stmt);
+	  tree_code code = gimple_expr_code (orig_cond);
+	  tree pred = make_ssa_name (boolean_type_node);
+	  gimple *asgn = gimple_build_assign (pred, code,
+			   gimple_cond_lhs (orig_cond),
+			   gimple_cond_rhs (orig_cond));
+	  gcond *new_cond
+	    = gimple_build_cond (NE_EXPR, pred, boolean_false_node,
+				 gimple_cond_true_label (orig_cond),
+				 gimple_cond_false_label (orig_cond));
+
+	  gimple_stmt_iterator gsi = gsi_for_stmt (stmt);
+	  gsi_insert_before (&gsi, asgn, GSI_SAME_STMT);
+	  gsi_replace (&gsi, new_cond, true);
+
+	  edge e = split_block (block, asgn);
+	  block = e->dest;
+	  map->get_or_insert (block) = new_cond;
+	}
+      else if ((gimple_code (stmt) == GIMPLE_CALL
+		&& !gimple_call_internal_p (stmt))
+	       || is_gimple_assign (stmt))
+	{
+	  gimple_stmt_iterator gsi = gsi_for_stmt (stmt);
+	  gsi_prev (&gsi);
+
+	  edge call = split_block (block, gsi_stmt (gsi));
+
+	  gimple *call_stmt = gsi_stmt (gsi_start_bb (call->dest));
+
+	  edge call_to_ret = split_block (call->dest, call_stmt);
+
+	  map->get_or_insert (call_to_ret->src) = call_stmt;
+	}
+      else
+	{
+	  gimple_stmt_iterator gsi = gsi_for_stmt (stmt);
+	  gsi_prev (&gsi);
+
+	  if (gsi_end_p (gsi))
+	    map->get_or_insert (block) = stmt;
+	  else
+	    {
+	      /* Split block before insn. The insn is in the new block.  */
+	      edge e = split_block (block, gsi_stmt (gsi));
+
+	      block = e->dest;
+	      map->get_or_insert (block) = stmt;
+	    }
+	}
+    }
+}
+
+static const char *
+mask_name (unsigned mask)
+{
+  switch (mask)
+    {
+    case 0: return "gang redundant";
+    case 1: return "gang partitioned";
+    case 2: return "worker partitioned";
+    case 3: return "gang+worker partitioned";
+    case 4: return "vector partitioned";
+    case 5: return "gang+vector partitioned";
+    case 6: return "worker+vector partitioned";
+    case 7: return "fully partitioned";
+    default: return "<illegal>";
+    }
+}
+
+/* Dump this parallel and all its inner parallels.  */
+
+static void
+omp_sese_dump_pars (parallel_g *par, unsigned depth)
+{
+  fprintf (dump_file, "%u: mask %d (%s) head=%d, tail=%d\n",
+	   depth, par->mask, mask_name (par->mask),
+	   par->forked_block ? par->forked_block->index : -1,
+	   par->join_block ? par->join_block->index : -1);
+
+  fprintf (dump_file, "    blocks:");
+
+  basic_block block;
+  for (unsigned ix = 0; par->blocks.iterate (ix, &block); ix++)
+    fprintf (dump_file, " %d", block->index);
+  fprintf (dump_file, "\n");
+  if (par->inner)
+    omp_sese_dump_pars (par->inner, depth + 1);
+
+  if (par->next)
+    omp_sese_dump_pars (par->next, depth);
+}
+
+/* If BLOCK contains a fork/join marker, process it to create or
+   terminate a loop structure.  Add this block to the current loop,
+   and then walk successor blocks.   */
+
+static parallel_g *
+omp_sese_find_par (bb_stmt_map_t *map, parallel_g *par, basic_block block)
+{
+  if (block->flags & BB_VISITED)
+    return par;
+  block->flags |= BB_VISITED;
+
+  if (gimple **stmtp = map->get (block))
+    {
+      gimple *stmt = *stmtp;
+
+      if (gimple_code (stmt) == GIMPLE_COND
+	  || gimple_code (stmt) == GIMPLE_SWITCH
+	  || gimple_code (stmt) == GIMPLE_RETURN
+	  || (gimple_code (stmt) == GIMPLE_CALL
+	      && !gimple_call_internal_p (stmt))
+	  || is_gimple_assign (stmt))
+	{
+	  /* A single block that is forced to be at the maximum partition
+	     level.  Make a singleton par for it.  */
+	  par = new parallel_g (par, GOMP_DIM_MASK (GOMP_DIM_GANG)
+				   | GOMP_DIM_MASK (GOMP_DIM_WORKER)
+				   | GOMP_DIM_MASK (GOMP_DIM_VECTOR));
+	  par->forked_block = block;
+	  par->forked_stmt = stmt;
+	  par->blocks.safe_push (block);
+	  par = par->parent;
+	  goto walk_successors;
+	}
+      else if (gimple_nop_p (stmt))
+	{
+	  basic_block pred = single_pred (block);
+	  gcc_assert (pred);
+	  gimple_stmt_iterator gsi = gsi_last_bb (pred);
+	  gimple *final_stmt = gsi_stmt (gsi);
+
+	  if (gimple_call_internal_p (final_stmt, IFN_UNIQUE))
+	    {
+	      gcall *call = as_a <gcall *> (final_stmt);
+	      enum ifn_unique_kind k = ((enum ifn_unique_kind)
+		TREE_INT_CST_LOW (gimple_call_arg (call, 0)));
+
+	      if (k == IFN_UNIQUE_OACC_FORK)
+		{
+		  HOST_WIDE_INT dim
+		    = TREE_INT_CST_LOW (gimple_call_arg (call, 2));
+		  unsigned mask = (dim >= 0) ? GOMP_DIM_MASK (dim) : 0;
+
+		  par = new parallel_g (par, mask);
+		  par->forked_block = block;
+		  par->forked_stmt = final_stmt;
+		  par->fork_stmt = stmt;
+		}
+	      else
+		gcc_unreachable ();
+	    }
+	  else
+	    gcc_unreachable ();
+	}
+      else if (gimple_call_internal_p (stmt, IFN_UNIQUE))
+	{
+	  gcall *call = as_a <gcall *> (stmt);
+	  enum ifn_unique_kind k = ((enum ifn_unique_kind)
+	    TREE_INT_CST_LOW (gimple_call_arg (call, 0)));
+	  if (k == IFN_UNIQUE_OACC_JOIN)
+	    {
+	      HOST_WIDE_INT dim = TREE_INT_CST_LOW (gimple_call_arg (stmt, 2));
+	      unsigned mask = (dim >= 0) ? GOMP_DIM_MASK (dim) : 0;
+
+	      gcc_assert (par->mask == mask);
+	      par->join_block = block;
+	      par->join_stmt = stmt;
+	      par = par->parent;
+	    }
+	  else
+	    gcc_unreachable ();
+	}
+      else
+	gcc_unreachable ();
+    }
+
+  if (par)
+    /* Add this block onto the current loop's list of blocks.  */
+    par->blocks.safe_push (block);
+  else
+    /* This must be the entry block.  Create a NULL parallel.  */
+    par = new parallel_g (0, 0);
+
+walk_successors:
+  /* Walk successor blocks.  */
+  edge e;
+  edge_iterator ei;
+
+  FOR_EACH_EDGE (e, ei, block->succs)
+    omp_sese_find_par (map, par, e->dest);
+
+  return par;
+}
+
+/* DFS walk the CFG looking for fork & join markers.  Construct
+   loop structures as we go.  MAP is a mapping of basic blocks
+   to head & tail markers, discovered when splitting blocks.  This
+   speeds up the discovery.  We rely on the BB visited flag having
+   been cleared when splitting blocks.  */
+
+static parallel_g *
+omp_sese_discover_pars (bb_stmt_map_t *map)
+{
+  basic_block block;
+
+  /* Mark exit blocks as visited.  */
+  block = EXIT_BLOCK_PTR_FOR_FN (cfun);
+  block->flags |= BB_VISITED;
+
+  /* And entry block as not.  */
+  block = ENTRY_BLOCK_PTR_FOR_FN (cfun);
+  block->flags &= ~BB_VISITED;
+
+  parallel_g *par = omp_sese_find_par (map, 0, block);
+
+  if (dump_file)
+    {
+      fprintf (dump_file, "\nLoops\n");
+      omp_sese_dump_pars (par, 0);
+      fprintf (dump_file, "\n");
+    }
+
+  return par;
+}
+
+static void
+populate_single_mode_bitmaps (parallel_g *par, bitmap worker_single,
+			      bitmap vector_single, unsigned outer_mask,
+			      int depth)
+{
+  unsigned mask = outer_mask | par->mask;
+
+  basic_block block;
+
+  for (unsigned i = 0; par->blocks.iterate (i, &block); i++)
+    {
+      if ((mask & GOMP_DIM_MASK (GOMP_DIM_WORKER)) == 0)
+	bitmap_set_bit (worker_single, block->index);
+
+      if ((mask & GOMP_DIM_MASK (GOMP_DIM_VECTOR)) == 0)
+	bitmap_set_bit (vector_single, block->index);
+    }
+
+  if (par->inner)
+    populate_single_mode_bitmaps (par->inner, worker_single, vector_single,
+				  mask, depth + 1);
+  if (par->next)
+    populate_single_mode_bitmaps (par->next, worker_single, vector_single,
+				  outer_mask, depth);
+}
+
+/* A map from SSA names or var decls to record fields.  */
+
+typedef hash_map<tree, tree> field_map_t;
+
+/* For each propagation record type, this is a map from SSA names or var decls
+   to propagate, to the field in the record type that should be used for
+   transmission and reception.  */
+
+typedef hash_map<tree, field_map_t *> record_field_map_t;
+
+static GTY(()) record_field_map_t *field_map;
+
+static void
+install_var_field (tree var, tree record_type)
+{
+  field_map_t *fields = *field_map->get (record_type);
+  tree name;
+  char tmp[20];
+
+  if (TREE_CODE (var) == SSA_NAME)
+    {
+      name = SSA_NAME_IDENTIFIER (var);
+      if (!name)
+	{
+	  sprintf (tmp, "_%u", (unsigned) SSA_NAME_VERSION (var));
+	  name = get_identifier (tmp);
+	}
+    }
+  else if (TREE_CODE (var) == VAR_DECL)
+    {
+      name = DECL_NAME (var);
+      if (!name)
+	{
+	  sprintf (tmp, "D_%u", (unsigned) DECL_UID (var));
+	  name = get_identifier (tmp);
+	}
+    }
+  else
+    gcc_unreachable ();
+
+  gcc_assert (!fields->get (var));
+
+  tree type = TREE_TYPE (var);
+
+  if (POINTER_TYPE_P (type)
+      && TYPE_RESTRICT (type))
+    type = build_qualified_type (type, TYPE_QUALS (type) & ~TYPE_QUAL_RESTRICT);
+
+  tree field = build_decl (BUILTINS_LOCATION, FIELD_DECL, name, type);
+
+  if (TREE_CODE (var) == VAR_DECL && type == TREE_TYPE (var))
+    {
+      SET_DECL_ALIGN (field, DECL_ALIGN (var));
+      DECL_USER_ALIGN (field) = DECL_USER_ALIGN (var);
+      TREE_THIS_VOLATILE (field) = TREE_THIS_VOLATILE (var);
+    }
+  else
+    SET_DECL_ALIGN (field, TYPE_ALIGN (type));
+
+  fields->put (var, field);
+
+  insert_field_into_struct (record_type, field);
+}
+
+/* Sets of SSA_NAMES or VAR_DECLs to propagate.  */
+typedef hash_set<tree> propagation_set;
+
+static void
+find_ssa_names_to_propagate (parallel_g *par, unsigned outer_mask,
+			     bitmap worker_single, bitmap vector_single,
+			     vec<propagation_set *> *prop_set)
+{
+  unsigned mask = outer_mask | par->mask;
+
+  if (par->inner)
+    find_ssa_names_to_propagate (par->inner, mask, worker_single,
+				 vector_single, prop_set);
+  if (par->next)
+    find_ssa_names_to_propagate (par->next, outer_mask, worker_single,
+				 vector_single, prop_set);
+
+  if (mask & GOMP_DIM_MASK (GOMP_DIM_WORKER))
+    {
+      basic_block block;
+      int ix;
+
+      for (ix = 0; par->blocks.iterate (ix, &block); ix++)
+	{
+	  for (gphi_iterator psi = gsi_start_phis (block);
+	       !gsi_end_p (psi); gsi_next (&psi))
+	    {
+	      gphi *phi = psi.phi ();
+	      use_operand_p use;
+	      ssa_op_iter iter;
+
+	      FOR_EACH_PHI_ARG (use, phi, iter, SSA_OP_USE)
+		{
+		  tree var = USE_FROM_PTR (use);
+
+		  if (TREE_CODE (var) != SSA_NAME)
+		    continue;
+
+		  gimple *def_stmt = SSA_NAME_DEF_STMT (var);
+
+		  if (gimple_nop_p (def_stmt))
+		    continue;
+
+		  basic_block def_bb = gimple_bb (def_stmt);
+
+		  if (bitmap_bit_p (worker_single, def_bb->index))
+		    {
+		      if (!(*prop_set)[def_bb->index])
+			(*prop_set)[def_bb->index] = new propagation_set;
+
+		      propagation_set *ws_prop = (*prop_set)[def_bb->index];
+
+		      ws_prop->add (var);
+		    }
+		}
+	    }
+
+	  for (gimple_stmt_iterator gsi = gsi_start_bb (block);
+	       !gsi_end_p (gsi); gsi_next (&gsi))
+	    {
+	      use_operand_p use;
+	      ssa_op_iter iter;
+	      gimple *stmt = gsi_stmt (gsi);
+
+	      FOR_EACH_SSA_USE_OPERAND (use, stmt, iter, SSA_OP_USE)
+		{
+		  tree var = USE_FROM_PTR (use);
+
+		  gimple *def_stmt = SSA_NAME_DEF_STMT (var);
+
+		  if (gimple_nop_p (def_stmt))
+		    continue;
+
+		  basic_block def_bb = gimple_bb (def_stmt);
+
+		  if (bitmap_bit_p (worker_single, def_bb->index))
+		    {
+		      if (!(*prop_set)[def_bb->index])
+			(*prop_set)[def_bb->index] = new propagation_set;
+
+		      propagation_set *ws_prop = (*prop_set)[def_bb->index];
+
+		      ws_prop->add (var);
+		    }
+		}
+	    }
+	}
+    }
+}
+
+/* Callback for walk_gimple_stmt to find RHS VAR_DECLs (uses) in a
+   statement.  */
+
+static tree
+find_partitioned_var_uses_1 (tree *node, int *, void *data)
+{
+  walk_stmt_info *wi = (walk_stmt_info *) data;
+  hash_set<tree> *partitioned_var_uses = (hash_set<tree> *) wi->info;
+
+  if (!wi->is_lhs && VAR_P (*node))
+    partitioned_var_uses->add (*node);
+
+  return NULL_TREE;
+}
+
+static void
+find_partitioned_var_uses (parallel_g *par, unsigned outer_mask,
+			   hash_set<tree> *partitioned_var_uses)
+{
+  unsigned mask = outer_mask | par->mask;
+
+  if (par->inner)
+    find_partitioned_var_uses (par->inner, mask, partitioned_var_uses);
+  if (par->next)
+    find_partitioned_var_uses (par->next, outer_mask, partitioned_var_uses);
+
+  if (mask & GOMP_DIM_MASK (GOMP_DIM_WORKER))
+    {
+      basic_block block;
+      int ix;
+
+      for (ix = 0; par->blocks.iterate (ix, &block); ix++)
+	for (gimple_stmt_iterator gsi = gsi_start_bb (block);
+	     !gsi_end_p (gsi); gsi_next (&gsi))
+	  {
+	    walk_stmt_info wi;
+	    memset (&wi, 0, sizeof (wi));
+	    wi.info = (void *) partitioned_var_uses;
+	    walk_gimple_stmt (&gsi, NULL, find_partitioned_var_uses_1, &wi);
+	  }
+    }
+}
+
+/* Gang-private variables (typically placed in a GPU's shared memory) do not
+   need to be processed by the worker-propagation mechanism.  Populate the
+   GANG_PRIVATE_VARS set with any such variables found in the current
+   function.  */
+
+static void
+find_gang_private_vars (hash_set<tree> *gang_private_vars)
+{
+  basic_block block;
+
+  FOR_EACH_BB_FN (block, cfun)
+    {
+      for (gimple_stmt_iterator gsi = gsi_start_bb (block);
+	   !gsi_end_p (gsi);
+	   gsi_next (&gsi))
+	{
+	  gimple *stmt = gsi_stmt (gsi);
+
+	  if (gimple_call_internal_p (stmt, IFN_UNIQUE))
+	    {
+	      enum ifn_unique_kind k = ((enum ifn_unique_kind)
+		TREE_INT_CST_LOW (gimple_call_arg (stmt, 0)));
+	      if (k == IFN_UNIQUE_OACC_PRIVATE)
+		{
+		  HOST_WIDE_INT level
+		    = TREE_INT_CST_LOW (gimple_call_arg (stmt, 2));
+		  if (level != GOMP_DIM_GANG)
+		    continue;
+		  for (unsigned i = 3; i < gimple_call_num_args (stmt); i++)
+		    {
+		      tree arg = gimple_call_arg (stmt, i);
+		      gcc_assert (TREE_CODE (arg) == ADDR_EXPR);
+		      tree decl = TREE_OPERAND (arg, 0);
+		      gang_private_vars->add (decl);
+		    }
+		}
+	    }
+	}
+    }
+}
+
+static void
+find_local_vars_to_propagate (parallel_g *par, unsigned outer_mask,
+			      hash_set<tree> *partitioned_var_uses,
+			      hash_set<tree> *gang_private_vars,
+			      vec<propagation_set *> *prop_set)
+{
+  unsigned mask = outer_mask | par->mask;
+
+  if (par->inner)
+    find_local_vars_to_propagate (par->inner, mask, partitioned_var_uses,
+				  gang_private_vars, prop_set);
+  if (par->next)
+    find_local_vars_to_propagate (par->next, outer_mask, partitioned_var_uses,
+				  gang_private_vars, prop_set);
+
+  if (!(mask & GOMP_DIM_MASK (GOMP_DIM_WORKER)))
+    {
+      basic_block block;
+      int ix;
+
+      for (ix = 0; par->blocks.iterate (ix, &block); ix++)
+	{
+	  for (gimple_stmt_iterator gsi = gsi_start_bb (block);
+	       !gsi_end_p (gsi); gsi_next (&gsi))
+	    {
+	      gimple *stmt = gsi_stmt (gsi);
+	      tree var;
+	      unsigned i;
+
+	      FOR_EACH_LOCAL_DECL (cfun, i, var)
+		{
+		  if (!VAR_P (var)
+		      || is_global_var (var)
+		      || AGGREGATE_TYPE_P (TREE_TYPE (var))
+		      || !partitioned_var_uses->contains (var)
+		      || gang_private_vars->contains (var))
+		    continue;
+
+		  if (stmt_may_clobber_ref_p (stmt, var))
+		    {
+		      if (dump_file)
+			{
+			  fprintf (dump_file, "bb %u: local variable may be "
+				   "clobbered in %s mode: ", block->index,
+				   mask_name (mask));
+			  print_generic_expr (dump_file, var, TDF_SLIM);
+			  fprintf (dump_file, "\n");
+			}
+
+		      if (!(*prop_set)[block->index])
+			(*prop_set)[block->index] = new propagation_set;
+
+		      propagation_set *ws_prop
+			= (*prop_set)[block->index];
+
+		      ws_prop->add (var);
+		    }
+		}
+	    }
+	}
+    }
+}
+
+/* Transform basic blocks FROM, TO (which may be the same block) into:
+   if (GOACC_single_start ())
+     BLOCK;
+   GOACC_barrier ();
+			      \  |  /
+			      +----+
+			      |    |        (new) predicate block
+			      +----+--
+   \  |  /   \  |  /	        |t    \
+   +----+    +----+	      +----+  |
+   |	|    |    |	===>  |    |  | f   (old) from block
+   +----+    +----+	      +----+  |
+     |       t/  \f	        |    /
+			      +----+/
+  (split  (split before       |    |        skip block
+  at end)   condition)	      +----+
+			      t/  \f
+*/
+
+static void
+worker_single_simple (basic_block from, basic_block to,
+		      hash_set<tree> *def_escapes_block)
+{
+  gimple *call, *cond;
+  tree lhs, decl;
+  basic_block skip_block;
+
+  gimple_stmt_iterator gsi = gsi_last_bb (to);
+  if (EDGE_COUNT (to->succs) > 1)
+    {
+      gcc_assert (gimple_code (gsi_stmt (gsi)) == GIMPLE_COND);
+      gsi_prev (&gsi);
+    }
+  edge e = split_block (to, gsi_stmt (gsi));
+  skip_block = e->dest;
+
+  gimple_stmt_iterator start = gsi_after_labels (from);
+
+  decl = builtin_decl_explicit (BUILT_IN_GOACC_SINGLE_START);
+  lhs = create_tmp_var (TREE_TYPE (TREE_TYPE (decl)));
+  call = gimple_build_call (decl, 0);
+  gimple_call_set_lhs (call, lhs);
+  gsi_insert_before (&start, call, GSI_NEW_STMT);
+  update_stmt (call);
+
+  cond = gimple_build_cond (EQ_EXPR, lhs,
+			    fold_convert_loc (UNKNOWN_LOCATION,
+					      TREE_TYPE (lhs),
+					      boolean_true_node),
+			    NULL_TREE, NULL_TREE);
+  gsi_insert_after (&start, cond, GSI_NEW_STMT);
+  update_stmt (cond);
+
+  edge et = split_block (from, cond);
+  et->flags &= ~EDGE_FALLTHRU;
+  et->flags |= EDGE_TRUE_VALUE;
+  /* Make the active worker the more probable path so we prefer fallthrough
+     (letting the idle workers jump around more).  */
+  et->probability = profile_probability::likely ();
+
+  edge ef = make_edge (from, skip_block, EDGE_FALSE_VALUE);
+  ef->probability = et->probability.invert ();
+
+  basic_block neutered = split_edge (ef);
+  gimple_stmt_iterator neut_gsi = gsi_last_bb (neutered);
+
+  for (gsi = gsi_start_bb (et->dest); !gsi_end_p (gsi); gsi_next (&gsi))
+    {
+      gimple *stmt = gsi_stmt (gsi);
+      ssa_op_iter iter;
+      tree var;
+
+      FOR_EACH_SSA_TREE_OPERAND (var, stmt, iter, SSA_OP_DEF)
+	{
+	  if (def_escapes_block->contains (var))
+	    {
+	      gphi *join_phi = create_phi_node (NULL_TREE, skip_block);
+	      create_new_def_for (var, join_phi,
+				  gimple_phi_result_ptr (join_phi));
+	      add_phi_arg (join_phi, var, e, UNKNOWN_LOCATION);
+
+	      tree neutered_def = copy_ssa_name (var, NULL);
+	      /* We really want "don't care" or some value representing
+		 undefined here, but optimizers will probably get rid of the
+		 zero-assignments anyway.  */
+	      gassign *zero = gimple_build_assign (neutered_def,
+				build_zero_cst (TREE_TYPE (neutered_def)));
+
+	      gsi_insert_after (&neut_gsi, zero, GSI_CONTINUE_LINKING);
+	      update_stmt (zero);
+
+	      add_phi_arg (join_phi, neutered_def, single_succ_edge (neutered),
+			   UNKNOWN_LOCATION);
+	      update_stmt (join_phi);
+	    }
+	}
+    }
+
+  gsi = gsi_start_bb (skip_block);
+
+  decl = builtin_decl_explicit (BUILT_IN_GOACC_BARRIER);
+  gimple *acc_bar = gimple_build_call (decl, 0);
+
+  gsi_insert_before (&gsi, acc_bar, GSI_SAME_STMT);
+  update_stmt (acc_bar);
+}
+
+/* This is a copied and renamed omp-low.c:omp_build_component_ref.  */
+
+static tree
+oacc_build_component_ref (tree obj, tree field)
+{
+  tree field_type = TREE_TYPE (field);
+  tree obj_type = TREE_TYPE (obj);
+  if (!ADDR_SPACE_GENERIC_P (TYPE_ADDR_SPACE (obj_type)))
+    field_type = build_qualified_type
+			(field_type,
+			 KEEP_QUAL_ADDR_SPACE (TYPE_QUALS (obj_type)));
+
+  tree ret = build3 (COMPONENT_REF, field_type, obj, field, NULL);
+  if (TREE_THIS_VOLATILE (field))
+    TREE_THIS_VOLATILE (ret) |= 1;
+  if (TREE_READONLY (field))
+    TREE_READONLY (ret) |= 1;
+  return ret;
+}
+
+static tree
+build_receiver_ref (tree record_type, tree var, tree receiver_decl)
+{
+  field_map_t *fields = *field_map->get (record_type);
+  tree x = build_simple_mem_ref (receiver_decl);
+  tree field = *fields->get (var);
+  TREE_THIS_NOTRAP (x) = 1;
+  x = oacc_build_component_ref (x, field);
+  return x;
+}
+
+static tree
+build_sender_ref (tree record_type, tree var, tree sender_decl)
+{
+  field_map_t *fields = *field_map->get (record_type);
+  tree field = *fields->get (var);
+  return oacc_build_component_ref (sender_decl, field);
+}
+
+static int
+sort_by_ssa_version_or_uid (const void *p1, const void *p2)
+{
+  const tree t1 = *(const tree *)p1;
+  const tree t2 = *(const tree *)p2;
+
+  if (TREE_CODE (t1) == SSA_NAME && TREE_CODE (t2) == SSA_NAME)
+    return SSA_NAME_VERSION (t1) - SSA_NAME_VERSION (t2);
+  else if (TREE_CODE (t1) == SSA_NAME && TREE_CODE (t2) != SSA_NAME)
+    return -1;
+  else if (TREE_CODE (t1) != SSA_NAME && TREE_CODE (t2) == SSA_NAME)
+    return 1;
+  else
+    return DECL_UID (t1) - DECL_UID (t2);
+}
+
+static int
+sort_by_size_then_ssa_version_or_uid (const void *p1, const void *p2)
+{
+  const tree t1 = *(const tree *)p1;
+  const tree t2 = *(const tree *)p2;
+  unsigned HOST_WIDE_INT s1 = tree_to_uhwi (TYPE_SIZE (TREE_TYPE (t1)));
+  unsigned HOST_WIDE_INT s2 = tree_to_uhwi (TYPE_SIZE (TREE_TYPE (t2)));
+  if (s1 != s2)
+    return s2 - s1;
+  else
+    return sort_by_ssa_version_or_uid (p1, p2);
+}
+
+static void
+worker_single_copy (basic_block from, basic_block to,
+		    hash_set<tree> *def_escapes_block,
+		    hash_set<tree> *worker_partitioned_uses,
+		    tree record_type)
+{
+  /* If we only have virtual defs, we'll have no record type, but we still want
+     to emit single_copy_start and (particularly) single_copy_end to act as
+     a vdef source on the neutered edge representing memory writes on the
+     non-neutered edge.  */
+  if (!record_type)
+    record_type = char_type_node;
+
+  tree sender_decl
+    = targetm.goacc.create_worker_broadcast_record (record_type, true,
+						    ".oacc_worker_o");
+  tree receiver_decl
+    = targetm.goacc.create_worker_broadcast_record (record_type, false,
+						    ".oacc_worker_i");
+
+  gimple_stmt_iterator gsi = gsi_last_bb (to);
+  if (EDGE_COUNT (to->succs) > 1)
+    gsi_prev (&gsi);
+  edge e = split_block (to, gsi_stmt (gsi));
+  basic_block barrier_block = e->dest;
+
+  gimple_stmt_iterator start = gsi_after_labels (from);
+
+  tree decl = builtin_decl_explicit (BUILT_IN_GOACC_SINGLE_COPY_START);
+
+  tree lhs = create_tmp_var (TREE_TYPE (TREE_TYPE (decl)));
+
+  gimple *call = gimple_build_call (decl, 1,
+				    build_fold_addr_expr (sender_decl));
+  gimple_call_set_lhs (call, lhs);
+  gsi_insert_before (&start, call, GSI_NEW_STMT);
+  update_stmt (call);
+
+  tree conv_tmp = make_ssa_name (TREE_TYPE (receiver_decl));
+
+  gimple *conv = gimple_build_assign (conv_tmp,
+				      fold_convert (TREE_TYPE (receiver_decl),
+						    lhs));
+  update_stmt (conv);
+  gsi_insert_after (&start, conv, GSI_NEW_STMT);
+  gimple *asgn = gimple_build_assign (receiver_decl, conv_tmp);
+  gsi_insert_after (&start, asgn, GSI_NEW_STMT);
+  update_stmt (asgn);
+
+  tree zero_ptr = build_int_cst (TREE_TYPE (receiver_decl), 0);
+
+  tree recv_tmp = make_ssa_name (TREE_TYPE (receiver_decl));
+  asgn = gimple_build_assign (recv_tmp, receiver_decl);
+  gsi_insert_after (&start, asgn, GSI_NEW_STMT);
+  update_stmt (asgn);
+
+  gimple *cond = gimple_build_cond (EQ_EXPR, recv_tmp, zero_ptr, NULL_TREE,
+				    NULL_TREE);
+  update_stmt (cond);
+
+  gsi_insert_after (&start, cond, GSI_NEW_STMT);
+
+  edge et = split_block (from, cond);
+  et->flags &= ~EDGE_FALLTHRU;
+  et->flags |= EDGE_TRUE_VALUE;
+  /* Make the active worker the more probable path so we prefer fallthrough
+     (letting the idle workers jump around more).  */
+  et->probability = profile_probability::likely ();
+
+  basic_block body = et->dest;
+
+  edge ef = make_edge (from, barrier_block, EDGE_FALSE_VALUE);
+  ef->probability = et->probability.invert ();
+
+  decl = builtin_decl_explicit (BUILT_IN_GOACC_BARRIER);
+  gimple *acc_bar = gimple_build_call (decl, 0);
+
+  gimple_stmt_iterator bar_gsi = gsi_start_bb (barrier_block);
+  gsi_insert_before (&bar_gsi, acc_bar, GSI_NEW_STMT);
+
+  cond = gimple_build_cond (NE_EXPR, recv_tmp, zero_ptr, NULL_TREE, NULL_TREE);
+  gsi_insert_after (&bar_gsi, cond, GSI_NEW_STMT);
+
+  edge et2 = split_block (barrier_block, cond);
+  et2->flags &= ~EDGE_FALLTHRU;
+  et2->flags |= EDGE_TRUE_VALUE;
+  et2->probability = profile_probability::unlikely ();
+
+  basic_block exit_block = et2->dest;
+
+  basic_block copyout_block = split_edge (et2);
+  edge ef2 = make_edge (barrier_block, exit_block, EDGE_FALSE_VALUE);
+  ef2->probability = et2->probability.invert ();
+
+  gimple_stmt_iterator copyout_gsi = gsi_start_bb (copyout_block);
+
+  edge copyout_to_exit = single_succ_edge (copyout_block);
+
+  gimple_seq sender_seq = NULL;
+
+  /* Make sure we iterate over definitions in a stable order.  */
+  auto_vec<tree> escape_vec (def_escapes_block->elements ());
+  for (hash_set<tree>::iterator it = def_escapes_block->begin ();
+       it != def_escapes_block->end (); ++it)
+    escape_vec.quick_push (*it);
+  escape_vec.qsort (sort_by_ssa_version_or_uid);
+
+  for (unsigned i = 0; i < escape_vec.length (); i++)
+    {
+      tree var = escape_vec[i];
+
+      if (TREE_CODE (var) == SSA_NAME && SSA_NAME_IS_VIRTUAL_OPERAND (var))
+	continue;
+
+      tree barrier_def = 0;
+
+      if (TREE_CODE (var) == SSA_NAME)
+	{
+	  gimple *def_stmt = SSA_NAME_DEF_STMT (var);
+
+	  if (gimple_nop_p (def_stmt))
+	    continue;
+
+	  /* The barrier phi takes one result from the actual work of the
+	     block we're neutering, and the other result is constant zero of
+	     the same type.  */
+
+	  gphi *barrier_phi = create_phi_node (NULL_TREE, barrier_block);
+	  barrier_def = create_new_def_for (var, barrier_phi,
+			  gimple_phi_result_ptr (barrier_phi));
+
+	  add_phi_arg (barrier_phi, var, e, UNKNOWN_LOCATION);
+	  add_phi_arg (barrier_phi, build_zero_cst (TREE_TYPE (var)), ef,
+		       UNKNOWN_LOCATION);
+
+	  update_stmt (barrier_phi);
+	}
+      else
+	gcc_assert (TREE_CODE (var) == VAR_DECL);
+
+      /* If we had no record type, we will have no fields map.  */
+      field_map_t **fields_p = field_map->get (record_type);
+      field_map_t *fields = fields_p ? *fields_p : NULL;
+
+      if (worker_partitioned_uses->contains (var)
+	  && fields
+	  && fields->get (var))
+	{
+	  tree neutered_def = make_ssa_name (TREE_TYPE (var));
+
+	  /* Receive definition from shared memory block.  */
+
+	  tree receiver_ref = build_receiver_ref (record_type, var,
+						  receiver_decl);
+	  gassign *recv = gimple_build_assign (neutered_def,
+					       receiver_ref);
+	  gsi_insert_after (&copyout_gsi, recv, GSI_CONTINUE_LINKING);
+	  update_stmt (recv);
+
+	  if (TREE_CODE (var) == VAR_DECL)
+	    {
+	      /* If it's a VAR_DECL, we only copied to an SSA temporary.  Copy
+		 to the final location now.  */
+	      gassign *asgn = gimple_build_assign (var, neutered_def);
+	      gsi_insert_after (&copyout_gsi, asgn, GSI_CONTINUE_LINKING);
+	      update_stmt (asgn);
+	    }
+	  else
+	    {
+	      /* If it's an SSA name, create a new phi at the join node to
+		 represent either the output from the active worker (the
+		 barrier) or the inactive workers (the copyout block).  */
+	      gphi *join_phi = create_phi_node (NULL_TREE, exit_block);
+	      create_new_def_for (barrier_def, join_phi,
+				  gimple_phi_result_ptr (join_phi));
+	      add_phi_arg (join_phi, barrier_def, ef2, UNKNOWN_LOCATION);
+	      add_phi_arg (join_phi, neutered_def, copyout_to_exit,
+			   UNKNOWN_LOCATION);
+	      update_stmt (join_phi);
+	    }
+
+	  /* Send definition to shared memory block.  */
+
+	  tree sender_ref = build_sender_ref (record_type, var, sender_decl);
+
+	  if (TREE_CODE (var) == SSA_NAME)
+	    {
+	      gassign *send = gimple_build_assign (sender_ref, var);
+	      gimple_seq_add_stmt (&sender_seq, send);
+	      update_stmt (send);
+	    }
+	  else if (TREE_CODE (var) == VAR_DECL)
+	    {
+	      tree tmp = make_ssa_name (TREE_TYPE (var));
+	      gassign *send = gimple_build_assign (tmp, var);
+	      gimple_seq_add_stmt (&sender_seq, send);
+	      update_stmt (send);
+	      send = gimple_build_assign (sender_ref, tmp);
+	      gimple_seq_add_stmt (&sender_seq, send);
+	      update_stmt (send);
+	    }
+	  else
+	    gcc_unreachable ();
+	}
+    }
+
+  /* It's possible for the ET->DEST block (the work done by the active thread)
+     to finish with a control-flow insn, e.g. a UNIQUE function call.  Split
+     the block and add SENDER_SEQ in the latter part to avoid having control
+     flow in the middle of a BB.  */
+
+  decl = builtin_decl_explicit (BUILT_IN_GOACC_SINGLE_COPY_END);
+  call = gimple_build_call (decl, 1, build_fold_addr_expr (sender_decl));
+  gimple_seq_add_stmt (&sender_seq, call);
+
+  gsi = gsi_last_bb (body);
+  gimple *last = gsi_stmt (gsi);
+  basic_block sender_block = split_block (body, last)->dest;
+  gsi = gsi_last_bb (sender_block);
+  gsi_insert_seq_after (&gsi, sender_seq, GSI_CONTINUE_LINKING);
+}
+
+static void
+neuter_worker_single (parallel_g *par, unsigned outer_mask,
+		      bitmap worker_single, bitmap vector_single,
+		      vec<propagation_set *> *prop_set,
+		      hash_set<tree> *partitioned_var_uses)
+{
+  unsigned mask = outer_mask | par->mask;
+
+  if ((mask & GOMP_DIM_MASK (GOMP_DIM_WORKER)) == 0)
+    {
+      basic_block block;
+
+      for (unsigned i = 0; par->blocks.iterate (i, &block); i++)
+	{
+	  bool has_defs = false;
+	  hash_set<tree> def_escapes_block;
+	  hash_set<tree> worker_partitioned_uses;
+	  unsigned j;
+	  tree var;
+
+	  FOR_EACH_SSA_NAME (j, var, cfun)
+	    {
+	      if (SSA_NAME_IS_VIRTUAL_OPERAND (var))
+		{
+		  has_defs = true;
+		  continue;
+		}
+
+	      gimple *def_stmt = SSA_NAME_DEF_STMT (var);
+
+	      if (gimple_nop_p (def_stmt))
+		continue;
+
+	      if (gimple_bb (def_stmt)->index != block->index)
+		continue;
+
+	      gimple *use_stmt;
+	      imm_use_iterator use_iter;
+	      bool uses_outside_block = false;
+	      bool worker_partitioned_use = false;
+
+	      FOR_EACH_IMM_USE_STMT (use_stmt, use_iter, var)
+		{
+		  int blocknum = gimple_bb (use_stmt)->index;
+
+		  /* Don't propagate SSA names that are only used in the
+		     current block, unless the usage is in a phi node: that
+		     means the name left the block, then came back in at the
+		     top.  */
+		  if (blocknum != block->index
+		      || gimple_code (use_stmt) == GIMPLE_PHI)
+		    uses_outside_block = true;
+		  if (!bitmap_bit_p (worker_single, blocknum))
+		    worker_partitioned_use = true;
+		}
+
+	      if (uses_outside_block)
+		def_escapes_block.add (var);
+
+	      if (worker_partitioned_use)
+		{
+		  worker_partitioned_uses.add (var);
+		  has_defs = true;
+		}
+	    }
+
+	  propagation_set *ws_prop = (*prop_set)[block->index];
+
+	  if (ws_prop)
+	    {
+	      for (propagation_set::iterator it = ws_prop->begin ();
+		   it != ws_prop->end ();
+		   ++it)
+		{
+		  tree var = *it;
+		  if (TREE_CODE (var) == VAR_DECL)
+		    {
+		      def_escapes_block.add (var);
+		      if (partitioned_var_uses->contains (var))
+			{
+			  worker_partitioned_uses.add (var);
+			  has_defs = true;
+			}
+		    }
+		}
+
+	      delete ws_prop;
+	      (*prop_set)[block->index] = 0;
+	    }
+
+	  tree record_type = (tree) block->aux;
+
+	  if (has_defs)
+	    worker_single_copy (block, block, &def_escapes_block,
+				&worker_partitioned_uses, record_type);
+	  else
+	    worker_single_simple (block, block, &def_escapes_block);
+	}
+    }
+
+  if ((outer_mask & GOMP_DIM_MASK (GOMP_DIM_WORKER)) == 0)
+    {
+      basic_block block;
+
+      for (unsigned i = 0; par->blocks.iterate (i, &block); i++)
+	for (gimple_stmt_iterator gsi = gsi_start_bb (block);
+	     !gsi_end_p (gsi);
+	     gsi_next (&gsi))
+	  {
+	    gimple *stmt = gsi_stmt (gsi);
+
+	    if (gimple_code (stmt) == GIMPLE_CALL
+		&& !gimple_call_internal_p (stmt)
+		&& !omp_sese_active_worker_call (as_a <gcall *> (stmt)))
+	      {
+		/* If we have an OpenACC routine call in worker-single mode,
+		   place barriers before and afterwards to prevent
+		   clobbering re-used shared memory regions (as are used
+		   for AMDGCN at present, for example).  */
+		tree decl = builtin_decl_explicit (BUILT_IN_GOACC_BARRIER);
+		gsi_insert_before (&gsi, gimple_build_call (decl, 0),
+				   GSI_SAME_STMT);
+		gsi_insert_after (&gsi, gimple_build_call (decl, 0),
+				  GSI_NEW_STMT);
+	      }
+	  }
+    }
+
+  if (par->inner)
+    neuter_worker_single (par->inner, mask, worker_single, vector_single,
+			  prop_set, partitioned_var_uses);
+  if (par->next)
+    neuter_worker_single (par->next, outer_mask, worker_single, vector_single,
+			  prop_set, partitioned_var_uses);
+}
+
+static int
+execute_omp_oacc_neuter_broadcast ()
+{
+  bb_stmt_map_t bb_stmt_map;
+  auto_bitmap worker_single, vector_single;
+
+  omp_sese_split_blocks (&bb_stmt_map);
+
+  if (dump_file)
+    {
+      fprintf (dump_file, "\n\nAfter splitting:\n\n");
+      dump_function_to_file (current_function_decl, dump_file, dump_flags);
+    }
+
+  unsigned mask = 0;
+
+  /* If this is a routine, calculate MASK as if the outer levels are already
+     partitioned.  */
+  tree attr = oacc_get_fn_attrib (current_function_decl);
+  if (attr)
+    {
+      tree dims = TREE_VALUE (attr);
+      unsigned ix;
+      for (ix = 0; ix != GOMP_DIM_MAX; ix++, dims = TREE_CHAIN (dims))
+	{
+	  tree allowed = TREE_PURPOSE (dims);
+	  if (allowed && integer_zerop (allowed))
+	    mask |= GOMP_DIM_MASK (ix);
+	}
+    }
+
+  parallel_g *par = omp_sese_discover_pars (&bb_stmt_map);
+  populate_single_mode_bitmaps (par, worker_single, vector_single, mask, 0);
+
+  basic_block bb;
+  FOR_ALL_BB_FN (bb, cfun)
+    bb->aux = NULL;
+
+  field_map = record_field_map_t::create_ggc (40);
+
+  vec<propagation_set *> prop_set;
+  prop_set.create (last_basic_block_for_fn (cfun));
+
+  for (int i = 0; i < last_basic_block_for_fn (cfun); i++)
+    prop_set.quick_push (0);
+
+  find_ssa_names_to_propagate (par, mask, worker_single, vector_single,
+			       &prop_set);
+
+  hash_set<tree> partitioned_var_uses;
+  hash_set<tree> gang_private_vars;
+
+  find_gang_private_vars (&gang_private_vars);
+  find_partitioned_var_uses (par, mask, &partitioned_var_uses);
+  find_local_vars_to_propagate (par, mask, &partitioned_var_uses,
+				&gang_private_vars, &prop_set);
+
+  FOR_ALL_BB_FN (bb, cfun)
+    {
+      propagation_set *ws_prop = prop_set[bb->index];
+      if (ws_prop)
+	{
+	  tree record_type = lang_hooks.types.make_type (RECORD_TYPE);
+	  tree name = create_tmp_var_name (".oacc_ws_data_s");
+	  name = build_decl (UNKNOWN_LOCATION, TYPE_DECL, name, record_type);
+	  DECL_ARTIFICIAL (name) = 1;
+	  DECL_NAMELESS (name) = 1;
+	  TYPE_NAME (record_type) = name;
+	  TYPE_ARTIFICIAL (record_type) = 1;
+
+	  auto_vec<tree> field_vec (ws_prop->elements ());
+	  for (hash_set<tree>::iterator it = ws_prop->begin ();
+	       it != ws_prop->end (); ++it)
+	    field_vec.quick_push (*it);
+
+	  field_vec.qsort (sort_by_size_then_ssa_version_or_uid);
+
+	  field_map->put (record_type, field_map_t::create_ggc (17));
+
+	  /* Insert var fields in reverse order, so the last inserted element
+	     is the first in the structure.  */
+	  for (int i = field_vec.length () - 1; i >= 0; i--)
+	    install_var_field (field_vec[i], record_type);
+
+	  layout_type (record_type);
+
+	  bb->aux = (tree) record_type;
+	}
+    }
+
+  neuter_worker_single (par, mask, worker_single, vector_single, &prop_set,
+			&partitioned_var_uses);
+
+  prop_set.release ();
+
+  /* This doesn't seem to make a difference.  */
+  loops_state_clear (LOOP_CLOSED_SSA);
+
+  /* Neutering worker-single neutered blocks will invalidate dominance info.
+     It may be possible to incrementally update just the affected blocks, but
+     obliterate everything for now.  */
+  free_dominance_info (CDI_DOMINATORS);
+  free_dominance_info (CDI_POST_DOMINATORS);
+
+  if (dump_file)
+    {
+      fprintf (dump_file, "\n\nAfter neutering:\n\n");
+      dump_function_to_file (current_function_decl, dump_file, dump_flags);
+    }
+
+  return 0;
+}
+
+namespace {
+
+const pass_data pass_data_omp_oacc_neuter_broadcast =
+{
+  GIMPLE_PASS, /* type */
+  "omp_oacc_neuter_broadcast", /* name */
+  OPTGROUP_OMP, /* optinfo_flags */
+  TV_NONE, /* tv_id */
+  PROP_cfg, /* properties_required */
+  0, /* properties_provided */
+  0, /* properties_destroyed */
+  0, /* todo_flags_start */
+  TODO_update_ssa | TODO_cleanup_cfg, /* todo_flags_finish */
+};
+
+class pass_omp_oacc_neuter_broadcast : public gimple_opt_pass
+{
+public:
+  pass_omp_oacc_neuter_broadcast (gcc::context *ctxt)
+    : gimple_opt_pass (pass_data_omp_oacc_neuter_broadcast, ctxt)
+  {}
+
+  /* opt_pass methods: */
+  virtual bool gate (function *)
+  {
+    return (flag_openacc
+	    && targetm.goacc.create_worker_broadcast_record);
+  };
+
+  virtual unsigned int execute (function *)
+    {
+      return execute_omp_oacc_neuter_broadcast ();
+    }
+
+}; // class pass_omp_oacc_neuter_broadcast
+
+} // anon namespace
+
+gimple_opt_pass *
+make_pass_omp_oacc_neuter_broadcast (gcc::context *ctxt)
+{
+  return new pass_omp_oacc_neuter_broadcast (ctxt);
+}
diff --git a/gcc/passes.def b/gcc/passes.def
index 26d86df2f5a..d7a1f8c97a6 100644
--- a/gcc/passes.def
+++ b/gcc/passes.def
@@ -184,6 +184,7 @@ along with GCC; see the file COPYING3.  If not see
   NEXT_PASS (pass_fixup_cfg);
   NEXT_PASS (pass_lower_eh_dispatch);
   NEXT_PASS (pass_oacc_loop_designation);
+  NEXT_PASS (pass_omp_oacc_neuter_broadcast);
   NEXT_PASS (pass_oacc_device_lower);
   NEXT_PASS (pass_omp_device_lower);
   NEXT_PASS (pass_omp_target_link);
diff --git a/gcc/target.def b/gcc/target.def
index 68a46aaa832..7676d5e626e 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -1756,6 +1756,17 @@ private variables at OpenACC device-lowering time using the\n\
 rtx, (tree var),
 NULL)
 
+DEFHOOK
+(create_worker_broadcast_record,
+"Create a record used to propagate local-variable state from an active\n\
+worker to other workers.  A possible implementation might adjust the type\n\
+of REC to place the new variable in shared GPU memory.\n\
+\n\
+Presence of this target hook indicates that middle end neutering/broadcasting\n\
+be used.",
+tree, (tree rec, bool sender, const char *name),
+NULL)
+
 HOOK_VECTOR_END (goacc)
 
 /* Functions relating to vectorization.  */
diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h
index 5484ad5eac7..83941bc0cee 100644
--- a/gcc/tree-pass.h
+++ b/gcc/tree-pass.h
@@ -425,6 +425,7 @@ extern gimple_opt_pass *make_pass_expand_omp (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_expand_omp_ssa (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_omp_target_link (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_oacc_loop_designation (gcc::context *ctxt);
+extern gimple_opt_pass *make_pass_omp_oacc_neuter_broadcast (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_oacc_device_lower (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_omp_device_lower (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_object_sizes (gcc::context *ctxt);
-- 
2.30.2


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 3/4] amdgcn: Enable OpenACC worker partitioning for AMD GCN
  2021-03-02 12:20 ` [PATCH 3/4] amdgcn: Enable OpenACC worker partitioning for AMD GCN Julian Brown
@ 2021-08-09 13:26   ` Thomas Schwinge
  0 siblings, 0 replies; 20+ messages in thread
From: Thomas Schwinge @ 2021-08-09 13:26 UTC (permalink / raw)
  To: Julian Brown, gcc-patches; +Cc: Tobias Burnus, Kwok Cheung Yeung, Jakub Jelinek

[-- Attachment #1: Type: text/plain, Size: 2787 bytes --]

Hi!

On 2021-03-02T04:20:13-0800, Julian Brown <julian@codesourcery.com> wrote:
> This patch enables worker-partitioning support via gimple rewriting for
> AMD GCN.

Thanks!

> Older (and currently unused) parts of this support are already
> present in the AMD GCN backend: those vestigial parts are enabled or
> updated, as appropriate.

..., and some of that moved into the "openacc: Middle-end
worker-partitioning support" commit.

A few of the test suite changes have already been resolved via other
commits.  And, on the other hand, a few more were necessary now.

> --- a/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-dim-default.c
> +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-dim-default.c
> @@ -79,13 +79,18 @@ int check (const int *ary, int size, int gp, int wp, int vp)
>       exit = 1;
>        }
>
> +#ifndef ACC_DEVICE_TYPE_radeon
> +  /* AMD GCN uses the autovectorizer for the vector dimension: the use
> +     of a function call in vector-partitioned code in this test is not
> +     currently supported.  */
>    for (ix = 0; ix < vp; ix++)
>      if (vectors[ix] != vectors[0])
>        {
>       printf ("vector %d not used %d times\n", ix, vectors[0]);
>       exit = 1;
>        }
> -
> +#endif
> +
>    return exit;
>  }

I removed this change (disabling 'vectors' checking for
'ACC_DEVICE_TYPE_radeon'), because:

> @@ -132,9 +137,7 @@ int main ()
>    /* AMD GCN uses the autovectorizer for the vector dimension: the use
>       of a function call in vector-partitioned code in this test is not
>       currently supported.  */
> -  /* AMD GCN does not currently support multiple workers.  This should be
> -     set to 16 when that changes.  */
> -  return test_1 (16, 1, 1);
> +  return test_1 (16, 16, 64);
>  #else
>    return test_1 (16, 16, 32);
>  #endif

... if continuing to specify 'vp = 1' for 'ACC_DEVICE_TYPE_radeon', the
above 'vectors' checking can stay as is.  (Similarly done in other test case
files.)

ACK for the 'wp = 1' to 'wp = 16' change, of course.

I found that 'libgomp.oacc-fortran/optional-reduction.f90' and
'libgomp.oacc-fortran/reduction-7.f90' need to be XFAILed for '-O0'.
I suppose that'll get resolved via the forthcoming "openacc:
Reference-typed reduction and private variable rewriting" changes.

Pushed "amdgcn: Enable OpenACC worker partitioning for AMD GCN" to master
branch in commit c408512e1f7ca07e07794dc13fd6dfd9d2d7e998, see attached.


Grüße
 Thomas


-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-amdgcn-Enable-OpenACC-worker-partitioning-for-AMD-GC.patch --]
[-- Type: text/x-diff, Size: 9728 bytes --]

From c408512e1f7ca07e07794dc13fd6dfd9d2d7e998 Mon Sep 17 00:00:00 2001
From: Julian Brown <julian@codesourcery.com>
Date: Tue, 2 Mar 2021 04:20:13 -0800
Subject: [PATCH] amdgcn: Enable OpenACC worker partitioning for AMD GCN

	gcc/
	* config/gcn/gcn.c (gcn_init_builtins): Override decls for
	BUILT_IN_GOACC_SINGLE_START, BUILT_IN_GOACC_SINGLE_COPY_START,
	BUILT_IN_GOACC_SINGLE_COPY_END and BUILT_IN_GOACC_BARRIER.
	(gcn_goacc_validate_dims): Turn on worker partitioning unconditionally.
	(gcn_fork_join): Update comment.
	* config/gcn/gcn.opt (flag_worker_partitioning): Remove.
	(macc_experimental_workers): Remove unused option.
	libgomp/
	* plugin/plugin-gcn.c (gcn_exec): Change default number of workers to
	16.
	* testsuite/libgomp.oacc-c-c++-common/acc_prof-kernels-1.c
	[acc_device_radeon]: Update.
	* testsuite/libgomp.oacc-c-c++-common/loop-dim-default.c
	[ACC_DEVICE_TYPE_radeon]: Likewise.
	* testsuite/libgomp.oacc-c-c++-common/parallel-dims.c
	[acc_device_radeon]: Likewise.
	* testsuite/libgomp.oacc-c-c++-common/routine-wv-2.c
	[ACC_DEVICE_TYPE_radeon]: Likewise.
	* testsuite/libgomp.oacc-fortran/optional-reduction.f90: XFAIL for
	'openacc_radeon_accel_selected' and '-O0'.
	* testsuite/libgomp.oacc-fortran/reduction-7.f90: Likewise.

Co-Authored-By: Kwok Cheung Yeung <kcy@codesourcery.com>
Co-Authored-By: Thomas Schwinge <thomas@codesourcery.com>
---
 gcc/config/gcn/gcn.c                              | 15 +++------------
 gcc/config/gcn/gcn.opt                            |  5 -----
 libgomp/plugin/plugin-gcn.c                       |  3 +--
 .../acc_prof-kernels-1.c                          |  3 ---
 .../libgomp.oacc-c-c++-common/loop-dim-default.c  |  4 +---
 .../libgomp.oacc-c-c++-common/parallel-dims.c     | 12 ++++--------
 .../libgomp.oacc-c-c++-common/routine-wv-2.c      |  7 ++++---
 .../libgomp.oacc-fortran/optional-reduction.f90   |  3 +++
 .../libgomp.oacc-fortran/reduction-7.f90          |  3 +++
 9 files changed, 19 insertions(+), 36 deletions(-)

diff --git a/gcc/config/gcn/gcn.c b/gcc/config/gcn/gcn.c
index 87af5d18f42..9df28277498 100644
--- a/gcc/config/gcn/gcn.c
+++ b/gcc/config/gcn/gcn.c
@@ -3712,8 +3712,6 @@ gcn_init_builtins (void)
       TREE_NOTHROW (gcn_builtin_decls[i]) = 1;
     }
 
-/* FIXME: remove the ifdef once OpenACC support is merged upstream.  */
-#ifdef BUILT_IN_GOACC_SINGLE_START
   /* These builtins need to take/return an LDS pointer: override the generic
      versions here.  */
 
@@ -3730,7 +3728,6 @@ gcn_init_builtins (void)
 
   set_builtin_decl (BUILT_IN_GOACC_BARRIER,
 		    gcn_builtin_decls[GCN_BUILTIN_ACC_BARRIER], false);
-#endif
 }
 
 /* Implement TARGET_INIT_LIBFUNCS.  */
@@ -5019,11 +5016,7 @@ gcn_goacc_validate_dims (tree decl, int dims[], int fn_level,
 			 unsigned /*used*/)
 {
   bool changed = false;
-
-  /* FIXME: remove -facc-experimental-workers when they're ready.  */
-  int max_workers = flag_worker_partitioning ? 16 : 1;
-
-  gcc_assert (!flag_worker_partitioning);
+  const int max_workers = 16;
 
   /* The vector size must appear to be 64, to the user, unless this is a
      SEQ routine.  The real, internal value is always 1, which means use
@@ -5060,8 +5053,7 @@ gcn_goacc_validate_dims (tree decl, int dims[], int fn_level,
     {
       dims[GOMP_DIM_VECTOR] = GCN_DEFAULT_VECTORS;
       if (dims[GOMP_DIM_WORKER] < 0)
-	dims[GOMP_DIM_WORKER] = (flag_worker_partitioning
-				 ? GCN_DEFAULT_WORKERS : 1);
+	dims[GOMP_DIM_WORKER] = GCN_DEFAULT_WORKERS;
       if (dims[GOMP_DIM_GANG] < 0)
 	dims[GOMP_DIM_GANG] = GCN_DEFAULT_GANGS;
       changed = true;
@@ -5126,8 +5118,7 @@ static bool
 gcn_fork_join (gcall *ARG_UNUSED (call), const int *ARG_UNUSED (dims),
 	       bool ARG_UNUSED (is_fork))
 {
-  /* GCN does not use the fork/join concept invented for NVPTX.
-     Instead we use standard autovectorization.  */
+  /* GCN does not need to expand fork/join markers at the RTL level.  */
   return false;
 }
 
diff --git a/gcc/config/gcn/gcn.opt b/gcc/config/gcn/gcn.opt
index b2b10b0794c..6faacca42bb 100644
--- a/gcc/config/gcn/gcn.opt
+++ b/gcc/config/gcn/gcn.opt
@@ -62,11 +62,6 @@ bool flag_bypass_init_error = false
 mbypass-init-error
 Target RejectNegative Var(flag_bypass_init_error)
 
-bool flag_worker_partitioning = false
-
-macc-experimental-workers
-Target Var(flag_worker_partitioning) Init(0)
-
 int stack_size_opt = -1
 
 mstack-size=
diff --git a/libgomp/plugin/plugin-gcn.c b/libgomp/plugin/plugin-gcn.c
index f26d7361106..9e7377c91f9 100644
--- a/libgomp/plugin/plugin-gcn.c
+++ b/libgomp/plugin/plugin-gcn.c
@@ -3038,8 +3038,7 @@ gcn_exec (struct kernel_info *kernel, size_t mapnum, void **hostaddrs,
      64 gangs matches a typical Fiji device.  */
 
   if (dims[0] == 0) dims[0] = get_cu_count (kernel->agent); /* Gangs.  */
-  /* NOTE: Until support for middle-end worker partitioning is merged, force 'num_workers (1)'.  */
-  if (/*TODO dims[1] == 0*/ true) dims[1] = 1;  /* Workers.  */
+  if (dims[1] == 0) dims[1] = 16; /* Workers.  */
 
   /* The incoming dimensions are expressed in terms of gangs, workers, and
      vectors.  The HSA dimensions are expressed in terms of "work-items",
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/acc_prof-kernels-1.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/acc_prof-kernels-1.c
index 6c136c26c93..ad33f72e2fb 100644
--- a/libgomp/testsuite/libgomp.oacc-c-c++-common/acc_prof-kernels-1.c
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/acc_prof-kernels-1.c
@@ -93,9 +93,6 @@ static void cb_enqueue_launch_start (acc_prof_info *prof_info, acc_event_info *e
     }
   if (num_workers < 1)
     assert (event_info->launch_event.num_workers >= 1);
-  /* GCN currently enforces 'num_workers (1)'.  */
-  else if (acc_device_type == acc_device_radeon)
-    assert (event_info->launch_event.num_workers == 1);
   else
     {
 #ifdef __OPTIMIZE__
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-dim-default.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-dim-default.c
index ca771646655..419bc33ad53 100644
--- a/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-dim-default.c
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-dim-default.c
@@ -132,9 +132,7 @@ int main ()
   /* AMD GCN uses the autovectorizer for the vector dimension: the use
      of a function call in vector-partitioned code in this test is not
      currently supported.  */
-  /* AMD GCN does not currently support multiple workers.  This should be
-     set to 16 when that changes.  */
-  return test_1 (16, 1, 1);
+  return test_1 (16, 16, 1);
 #else
   return test_1 (16, 16, 32);
 #endif
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/parallel-dims.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/parallel-dims.c
index fe0dacd5aac..9392e1d88c5 100644
--- a/libgomp/testsuite/libgomp.oacc-c-c++-common/parallel-dims.c
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/parallel-dims.c
@@ -261,9 +261,8 @@ int main ()
 	}
       else if (acc_on_device (acc_device_radeon))
 	{
-	  /* The GCC GCN back end is limited to num_workers (16).
-	     Temporarily set this to 1 until multiple workers are permitted. */
-	  workers_actual = 1; // 16;
+	  /* The GCC GCN back end is limited to num_workers (16).  */
+	  workers_actual = 16;
 	}
       else
 	__builtin_abort ();
@@ -313,9 +312,8 @@ int main ()
 	}
       else if (acc_on_device (acc_device_radeon))
 	{
-	  /* The GCC GCN back end is limited to num_workers (16).
-	     Temporarily set this to 1 until multiple workers are permitted. */
-	  workers_actual = 1; // 16;
+	  /* The GCC GCN back end is limited to num_workers (16).  */
+	  workers_actual = 16;
 	}
       else
 	__builtin_abort ();
@@ -465,8 +463,6 @@ int main ()
 	}
       else if (acc_on_device (acc_device_radeon))
 	{
-	  /* Temporary setting, until multiple workers are permitted.  */
-	  workers_actual = 1;
 	  /* See above comments about GCN vectors_actual.  */
 	  vectors_actual = 1;
 	}
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/routine-wv-2.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/routine-wv-2.c
index 624ec24e437..4f88b1c0779 100644
--- a/libgomp/testsuite/libgomp.oacc-c-c++-common/routine-wv-2.c
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/routine-wv-2.c
@@ -5,12 +5,13 @@
 #include <openacc.h>
 #include <gomp-constants.h>
 
+#define NUM_WORKERS 16
 #ifdef ACC_DEVICE_TYPE_radeon
-/* Temporarily set this to 1 until multiple workers are permitted.  */
-#define NUM_WORKERS 1
+/* AMD GCN uses the autovectorizer for the vector dimension: the use
+   of a function call in vector-partitioned code in this test is not
+   currently supported.  */
 #define NUM_VECTORS 1
 #else
-#define NUM_WORKERS 16
 #define NUM_VECTORS 32
 #endif
 #define WIDTH 64
diff --git a/libgomp/testsuite/libgomp.oacc-fortran/optional-reduction.f90 b/libgomp/testsuite/libgomp.oacc-fortran/optional-reduction.f90
index 29f92c0d4c3..69b69b66c71 100644
--- a/libgomp/testsuite/libgomp.oacc-fortran/optional-reduction.f90
+++ b/libgomp/testsuite/libgomp.oacc-fortran/optional-reduction.f90
@@ -4,6 +4,9 @@
 
 ! { dg-do run }
 
+!TODO
+! { dg-xfail-run-if TODO { openacc_radeon_accel_selected && { ! __OPTIMIZE__ } } }
+
 program optional_reduction
   implicit none
 
diff --git a/libgomp/testsuite/libgomp.oacc-fortran/reduction-7.f90 b/libgomp/testsuite/libgomp.oacc-fortran/reduction-7.f90
index 8cffac93a22..a8b0c60e420 100644
--- a/libgomp/testsuite/libgomp.oacc-fortran/reduction-7.f90
+++ b/libgomp/testsuite/libgomp.oacc-fortran/reduction-7.f90
@@ -1,5 +1,8 @@
 ! { dg-do run }
 
+!TODO
+! { dg-xfail-run-if TODO { openacc_radeon_accel_selected && { ! __OPTIMIZE__ } } }
+
 ! subroutine reduction with private and firstprivate variables
 
 program reduction
-- 
2.30.2


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/4] openacc: Middle-end worker-partitioning support
  2021-08-06  9:25     ` Julian Brown
@ 2021-08-09 13:32       ` Thomas Schwinge
  0 siblings, 0 replies; 20+ messages in thread
From: Thomas Schwinge @ 2021-08-09 13:32 UTC (permalink / raw)
  To: Julian Brown, gcc-patches; +Cc: Tobias Burnus, Kwok Cheung Yeung, Jakub Jelinek

[-- Attachment #1: Type: text/plain, Size: 2297 bytes --]

Hi!

On 2021-08-06T10:25:22+0100, Julian Brown <julian@codesourcery.com> wrote:
> On Wed, 4 Aug 2021 15:56:49 +0200
> Thomas Schwinge <thomas@codesourcery.com> wrote:
>> a nontrivial amount of data structures/logic/code did get
>> duplicated from the nvptx back end, and modified slightly or
>> not-so-slightly (RTL vs. GIMPLE plus certain implementation
>> "details").
>>
>> We should at least cross reference the two instances, to make sure
>> that any changes to one are also propagated to the other.  (I'll take
>> care.)
>
> OK, thanks,

Pushed "Cross-reference parts adapted in
'gcc/omp-oacc-neuter-broadcast.cc'" to master branch in
commit 62f01243fb27030b8d99c671f27349c2e7465edc, see attached.

>> And then, do you (or anyone else, of course) happen to have any clever
>> idea about how to avoid the duplication, and somehow combine the RTL
>> vs. GIMPLE implementations?  Given that we nowadays may use C++ -- do
>> you foresee it feasible to have an abstract base class capturing
>> basically the data structures, logic, common code, and then
>> RTL-specialized plus GIMPLE-specialized classes inheriting from that?
>
> I suppose one could either use "old-style" inheritance, or maybe do
> it with templates.

Or, as my WIP would show: both of these.  ;-) (To be posted later.)

> There's probably both costs & benefits when it comes
> to maintenance, either way -- having this code shared would mean any
> changes need testing for both nvptx & GCN targets, and risks making it
> harder to follow. OTOH, like you say, changes would only need to be
> made in one place.


> TBH, I'd spend effort on trying to integrate the SESE code (if it'd be
> beneficial) first, before trying to de-duplicate those other bits.

Spending effort on that may make sense, but I'm not able to do that as
part of this task here, because that's new development and related
performance etc. analysis -- which additionally I don't know much about
in the GCN context.


Grüße
 Thomas


-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-Cross-reference-parts-adapted-in-gcc-omp-oacc-neuter.patch --]
[-- Type: text/x-diff, Size: 5133 bytes --]

From 62f01243fb27030b8d99c671f27349c2e7465edc Mon Sep 17 00:00:00 2001
From: Thomas Schwinge <thomas@codesourcery.com>
Date: Mon, 9 Aug 2021 12:21:43 +0200
Subject: [PATCH] Cross-reference parts adapted in
 'gcc/omp-oacc-neuter-broadcast.cc'

	gcc/
	* config/nvptx/nvptx.c: Cross-reference parts adapted in
	'gcc/omp-oacc-neuter-broadcast.cc'.
	* omp-low.c: Likewise.
	* omp-oacc-neuter-broadcast.cc: Cross-reference parts adapted from
	the above files.
---
 gcc/config/nvptx/nvptx.c         | 5 +++++
 gcc/omp-low.c                    | 2 ++
 gcc/omp-oacc-neuter-broadcast.cc | 9 ++++++++-
 3 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index 6642bdfa867..4e4909e8c5f 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -3205,6 +3205,7 @@ nvptx_mach_vector_length ()
 
 /* Loop structure of the function.  The entire function is described as
    a NULL loop.  */
+/* See also 'gcc/omp-oacc-neuter-broadcast.cc:struct parallel_g'.  */
 
 struct parallel
 {
@@ -3282,6 +3283,7 @@ typedef auto_vec<insn_bb_t> insn_bb_vec_t;
    partitioning mode of the function as a whole.  Populate MAP with
    head and tail blocks.  We also clear the BB visited flag, which is
    used when finding partitions.  */
+/* See also 'gcc/omp-oacc-neuter-broadcast.cc:omp_sese_split_blocks'.  */
 
 static void
 nvptx_split_blocks (bb_insn_map_t *map)
@@ -3383,6 +3385,7 @@ nvptx_discover_pre (basic_block block, int expected)
 }
 
 /* Dump this parallel and all its inner parallels.  */
+/* See also 'gcc/omp-oacc-neuter-broadcast.cc:omp_sese_dump_pars'.  */
 
 static void
 nvptx_dump_pars (parallel *par, unsigned depth)
@@ -3408,6 +3411,7 @@ nvptx_dump_pars (parallel *par, unsigned depth)
 /* If BLOCK contains a fork/join marker, process it to create or
    terminate a loop structure.  Add this block to the current loop,
    and then walk successor blocks.   */
+/* See also 'gcc/omp-oacc-neuter-broadcast.cc:omp_sese_find_par'.  */
 
 static parallel *
 nvptx_find_par (bb_insn_map_t *map, parallel *par, basic_block block)
@@ -3488,6 +3492,7 @@ nvptx_find_par (bb_insn_map_t *map, parallel *par, basic_block block)
    to head & tail markers, discovered when splitting blocks.  This
    speeds up the discovery.  We rely on the BB visited flag having
    been cleared when splitting blocks.  */
+/* See also 'gcc/omp-oacc-neuter-broadcast.cc:omp_sese_discover_pars'.  */
 
 static parallel *
 nvptx_discover_pars (bb_insn_map_t *map)
diff --git a/gcc/omp-low.c b/gcc/omp-low.c
index 2f735bcde9c..926087da701 100644
--- a/gcc/omp-low.c
+++ b/gcc/omp-low.c
@@ -615,6 +615,8 @@ omp_copy_decl_1 (tree var, omp_context *ctx)
 
 /* Build COMPONENT_REF and set TREE_THIS_VOLATILE and TREE_READONLY on it
    as appropriate.  */
+/* See also 'gcc/omp-oacc-neuter-broadcast.cc:oacc_build_component_ref'.  */
+
 static tree
 omp_build_component_ref (tree obj, tree field)
 {
diff --git a/gcc/omp-oacc-neuter-broadcast.cc b/gcc/omp-oacc-neuter-broadcast.cc
index 0f6ba885c6c..f8555380451 100644
--- a/gcc/omp-oacc-neuter-broadcast.cc
+++ b/gcc/omp-oacc-neuter-broadcast.cc
@@ -56,6 +56,7 @@
 
 /* Loop structure of the function.  The entire function is described as
    a NULL loop.  */
+/* Adapted from 'gcc/config/nvptx/nvptx.c:struct parallel'.  */
 
 struct parallel_g
 {
@@ -183,6 +184,7 @@ omp_sese_active_worker_call (gcall *call)
    partitioning mode of the function as a whole.  Populate MAP with
    head and tail blocks.  We also clear the BB visited flag, which is
    used when finding partitions.  */
+/* Adapted from 'gcc/config/nvptx/nvptx.c:nvptx_split_blocks'.  */
 
 static void
 omp_sese_split_blocks (bb_stmt_map_t *map)
@@ -341,6 +343,7 @@ mask_name (unsigned mask)
 }
 
 /* Dump this parallel and all its inner parallels.  */
+/* Adapted from 'gcc/config/nvptx/nvptx.c:nvptx_dump_pars'.  */
 
 static void
 omp_sese_dump_pars (parallel_g *par, unsigned depth)
@@ -366,6 +369,7 @@ omp_sese_dump_pars (parallel_g *par, unsigned depth)
 /* If BLOCK contains a fork/join marker, process it to create or
    terminate a loop structure.  Add this block to the current loop,
    and then walk successor blocks.   */
+/* Adapted from 'gcc/config/nvptx/nvptx.c:nvptx_find_par'.  */
 
 static parallel_g *
 omp_sese_find_par (bb_stmt_map_t *map, parallel_g *par, basic_block block)
@@ -471,6 +475,7 @@ walk_successors:
    to head & tail markers, discovered when splitting blocks.  This
    speeds up the discovery.  We rely on the BB visited flag having
    been cleared when splitting blocks.  */
+/* Adapted from 'gcc/config/nvptx/nvptx.c:nvptx_discover_pars'.  */
 
 static parallel_g *
 omp_sese_discover_pars (bb_stmt_map_t *map)
@@ -931,7 +936,9 @@ worker_single_simple (basic_block from, basic_block to,
   update_stmt (acc_bar);
 }
 
-/* This is a copied and renamed omp-low.c:omp_build_component_ref.  */
+/* Build COMPONENT_REF and set TREE_THIS_VOLATILE and TREE_READONLY on it
+   as appropriate.  */
+/* Adapted from 'gcc/omp-low.c:omp_build_component_ref'.  */
 
 static tree
 oacc_build_component_ref (tree obj, tree field)
-- 
2.30.2


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/4] openacc: Middle-end worker-partitioning support
  2021-08-06  8:49     ` Julian Brown
@ 2021-08-16 10:34       ` Thomas Schwinge
  2022-02-22 16:48         ` Further simplify 'gcc/omp-oacc-neuter-broadcast.cc:record_field_map_t' (was: [PATCH 1/4] openacc: Middle-end worker-partitioning support) Thomas Schwinge
  0 siblings, 1 reply; 20+ messages in thread
From: Thomas Schwinge @ 2021-08-16 10:34 UTC (permalink / raw)
  To: Julian Brown, gcc-patches; +Cc: Jakub Jelinek, Tobias Burnus

[-- Attachment #1: Type: text/plain, Size: 1369 bytes --]

Hi!

On 2021-08-06T09:49:58+0100, Julian Brown <julian@codesourcery.com> wrote:
> On Wed, 4 Aug 2021 15:13:30 +0200
> Thomas Schwinge <thomas@codesourcery.com> wrote:
>
>> 'oacc_do_neutering' is the 'execute' function of the pass, so that
>> means every time this executes, a fresh 'field_map' is set up, no
>> state persists across runs (assuming I'm understanding that
>> correctly).  Why don't we simply use standard (non-GC) memory
>> management for that?  "For convenience" shall be fine as an answer
>> ;-) -- but maybe instead of figuring out the right GC annotations,
>> changing the memory management will be easier?  (Or, of course, maybe
>> I completely misunderstood that?)
>
> I suspect you're right, and there's no need for this to be GC-allocated
> memory. If non-standard memory allocation will work out fine, we should

("non-GC", I suppose.)

> probably use that instead.

Pushed "Avoid 'GTY' use for 'gcc/omp-oacc-neuter-broadcast.cc:field_map'"
to master branch in commit 049eda8274b7394523238b17ab12c3e2889f253e, see
attached.


Grüße
 Thomas


-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-Avoid-GTY-use-for-gcc-omp-oacc-neuter-broadcast.cc-f.patch --]
[-- Type: text/x-diff, Size: 6680 bytes --]

From 049eda8274b7394523238b17ab12c3e2889f253e Mon Sep 17 00:00:00 2001
From: Thomas Schwinge <thomas@codesourcery.com>
Date: Fri, 6 Aug 2021 12:09:12 +0200
Subject: [PATCH] Avoid 'GTY' use for
 'gcc/omp-oacc-neuter-broadcast.cc:field_map'

... and further simplify related things a bit.

Fix-up/clean-up for recent commit e2a58ed6dc5293602d0d168475109caa81ad0f0d
"openacc: Middle-end worker-partitioning support".

	gcc/
	* omp-oacc-neuter-broadcast.cc (field_map): Move variable into...
	(execute_omp_oacc_neuter_broadcast): ... here.
	(install_var_field, build_receiver_ref, build_sender_ref): Take
	'field_map_t *' parameter.  Adjust all users.
	(worker_single_copy, neuter_worker_single): Take a
	'record_field_map_t *' parameter.  Adjust all users.
---
 gcc/omp-oacc-neuter-broadcast.cc | 48 +++++++++++++++++---------------
 1 file changed, 26 insertions(+), 22 deletions(-)

diff --git a/gcc/omp-oacc-neuter-broadcast.cc b/gcc/omp-oacc-neuter-broadcast.cc
index f8555380451..9bde0aca10f 100644
--- a/gcc/omp-oacc-neuter-broadcast.cc
+++ b/gcc/omp-oacc-neuter-broadcast.cc
@@ -538,12 +538,9 @@ typedef hash_map<tree, tree> field_map_t;
 
 typedef hash_map<tree, field_map_t *> record_field_map_t;
 
-static GTY(()) record_field_map_t *field_map;
-
 static void
-install_var_field (tree var, tree record_type)
+install_var_field (tree var, tree record_type, field_map_t *fields)
 {
-  field_map_t *fields = *field_map->get (record_type);
   tree name;
   char tmp[20];
 
@@ -959,9 +956,8 @@ oacc_build_component_ref (tree obj, tree field)
 }
 
 static tree
-build_receiver_ref (tree record_type, tree var, tree receiver_decl)
+build_receiver_ref (tree var, tree receiver_decl, field_map_t *fields)
 {
-  field_map_t *fields = *field_map->get (record_type);
   tree x = build_simple_mem_ref (receiver_decl);
   tree field = *fields->get (var);
   TREE_THIS_NOTRAP (x) = 1;
@@ -970,9 +966,8 @@ build_receiver_ref (tree record_type, tree var, tree receiver_decl)
 }
 
 static tree
-build_sender_ref (tree record_type, tree var, tree sender_decl)
+build_sender_ref (tree var, tree sender_decl, field_map_t *fields)
 {
-  field_map_t *fields = *field_map->get (record_type);
   tree field = *fields->get (var);
   return oacc_build_component_ref (sender_decl, field);
 }
@@ -1010,7 +1005,7 @@ static void
 worker_single_copy (basic_block from, basic_block to,
 		    hash_set<tree> *def_escapes_block,
 		    hash_set<tree> *worker_partitioned_uses,
-		    tree record_type)
+		    tree record_type, record_field_map_t *record_field_map)
 {
   /* If we only have virtual defs, we'll have no record type, but we still want
      to emit single_copy_start and (particularly) single_copy_end to act as
@@ -1147,7 +1142,7 @@ worker_single_copy (basic_block from, basic_block to,
 	gcc_assert (TREE_CODE (var) == VAR_DECL);
 
       /* If we had no record type, we will have no fields map.  */
-      field_map_t **fields_p = field_map->get (record_type);
+      field_map_t **fields_p = record_field_map->get (record_type);
       field_map_t *fields = fields_p ? *fields_p : NULL;
 
       if (worker_partitioned_uses->contains (var)
@@ -1158,8 +1153,7 @@ worker_single_copy (basic_block from, basic_block to,
 
 	  /* Receive definition from shared memory block.  */
 
-	  tree receiver_ref = build_receiver_ref (record_type, var,
-						  receiver_decl);
+	  tree receiver_ref = build_receiver_ref (var, receiver_decl, fields);
 	  gassign *recv = gimple_build_assign (neutered_def,
 					       receiver_ref);
 	  gsi_insert_after (&copyout_gsi, recv, GSI_CONTINUE_LINKING);
@@ -1189,7 +1183,7 @@ worker_single_copy (basic_block from, basic_block to,
 
 	  /* Send definition to shared memory block.  */
 
-	  tree sender_ref = build_sender_ref (record_type, var, sender_decl);
+	  tree sender_ref = build_sender_ref (var, sender_decl, fields);
 
 	  if (TREE_CODE (var) == SSA_NAME)
 	    {
@@ -1232,7 +1226,8 @@ static void
 neuter_worker_single (parallel_g *par, unsigned outer_mask,
 		      bitmap worker_single, bitmap vector_single,
 		      vec<propagation_set *> *prop_set,
-		      hash_set<tree> *partitioned_var_uses)
+		      hash_set<tree> *partitioned_var_uses,
+		      record_field_map_t *record_field_map)
 {
   unsigned mask = outer_mask | par->mask;
 
@@ -1322,7 +1317,8 @@ neuter_worker_single (parallel_g *par, unsigned outer_mask,
 
 	  if (has_defs)
 	    worker_single_copy (block, block, &def_escapes_block,
-				&worker_partitioned_uses, record_type);
+				&worker_partitioned_uses, record_type,
+				record_field_map);
 	  else
 	    worker_single_simple (block, block, &def_escapes_block);
 	}
@@ -1358,10 +1354,10 @@ neuter_worker_single (parallel_g *par, unsigned outer_mask,
 
   if (par->inner)
     neuter_worker_single (par->inner, mask, worker_single, vector_single,
-			  prop_set, partitioned_var_uses);
+			  prop_set, partitioned_var_uses, record_field_map);
   if (par->next)
     neuter_worker_single (par->next, outer_mask, worker_single, vector_single,
-			  prop_set, partitioned_var_uses);
+			  prop_set, partitioned_var_uses, record_field_map);
 }
 
 static int
@@ -1402,8 +1398,6 @@ execute_omp_oacc_neuter_broadcast ()
   FOR_ALL_BB_FN (bb, cfun)
     bb->aux = NULL;
 
-  field_map = record_field_map_t::create_ggc (40);
-
   vec<propagation_set *> prop_set;
   prop_set.create (last_basic_block_for_fn (cfun));
 
@@ -1421,6 +1415,8 @@ execute_omp_oacc_neuter_broadcast ()
   find_local_vars_to_propagate (par, mask, &partitioned_var_uses,
 				&gang_private_vars, &prop_set);
 
+  record_field_map_t record_field_map;
+
   FOR_ALL_BB_FN (bb, cfun)
     {
       propagation_set *ws_prop = prop_set[bb->index];
@@ -1441,12 +1437,16 @@ execute_omp_oacc_neuter_broadcast ()
 
 	  field_vec.qsort (sort_by_size_then_ssa_version_or_uid);
 
-	  field_map->put (record_type, field_map_t::create_ggc (17));
+	  field_map_t *fields = new field_map_t;
+
+	  bool existed;
+	  existed = record_field_map.put (record_type, fields);
+	  gcc_checking_assert (!existed);
 
 	  /* Insert var fields in reverse order, so the last inserted element
 	     is the first in the structure.  */
 	  for (int i = field_vec.length () - 1; i >= 0; i--)
-	    install_var_field (field_vec[i], record_type);
+	    install_var_field (field_vec[i], record_type, fields);
 
 	  layout_type (record_type);
 
@@ -1455,7 +1455,11 @@ execute_omp_oacc_neuter_broadcast ()
     }
 
   neuter_worker_single (par, mask, worker_single, vector_single, &prop_set,
-			&partitioned_var_uses);
+			&partitioned_var_uses, &record_field_map);
+
+  for (auto it : record_field_map)
+    delete it.second;
+  record_field_map.empty ();
 
   prop_set.release ();
 
-- 
2.30.2


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/4] openacc: Middle-end worker-partitioning support
  2021-03-02 12:20 ` [PATCH 1/4] openacc: Middle-end worker-partitioning support Julian Brown
                     ` (3 preceding siblings ...)
  2021-08-09 13:21   ` Thomas Schwinge
@ 2021-08-16 10:34   ` Thomas Schwinge
  2021-08-16 10:34   ` Thomas Schwinge
  5 siblings, 0 replies; 20+ messages in thread
From: Thomas Schwinge @ 2021-08-16 10:34 UTC (permalink / raw)
  To: Julian Brown, gcc-patches; +Cc: Jakub Jelinek, Tobias Burnus

[-- Attachment #1: Type: text/plain, Size: 2474 bytes --]

Hi!

On 2021-03-02T04:20:11-0800, Julian Brown <julian@codesourcery.com> wrote:
> --- /dev/null
> +++ b/gcc/oacc-neuter-bcast.c

Allocated here:

> +/* Sets of SSA_NAMES or VAR_DECLs to propagate.  */
> +typedef hash_set<tree> propagation_set;
> +
> +static void
> +find_ssa_names_to_propagate ([...],
> +                          vec<propagation_set *> *prop_set)
> +{

> +                   if (!(*prop_set)[def_bb->index])
> +                     (*prop_set)[def_bb->index] = new propagation_set;

> +                   if (!(*prop_set)[def_bb->index])
> +                     (*prop_set)[def_bb->index] = new propagation_set;

..., and here:

> +static void
> +find_local_vars_to_propagate ([...],
> +                           vec<propagation_set *> *prop_set)
> +{

> +                   if (!(*prop_set)[block->index])
> +                     (*prop_set)[block->index] = new propagation_set;

..., and deallocated here:

> +static void
> +neuter_worker_single ([...],
> +                   vec<propagation_set *> *prop_set,
> +                   [...])
> +{

> +       propagation_set *ws_prop = (*prop_set)[block->index];
> +
> +       if (ws_prop)
> +         {
> +           [...]
> +           delete ws_prop;
> +           (*prop_set)[block->index] = 0;
> +         }

..., and defined here:

> +void
> +oacc_do_neutering (void)
> +{

> +  vec<propagation_set *> prop_set;
> +  prop_set.create (last_basic_block_for_fn (cfun));
> +
> +  for (unsigned i = 0; i < last_basic_block_for_fn (cfun); i++)
> +    prop_set.quick_push (0);

I recently learned about 'safe_grow_cleared', which allows for
simplifying this loop.

> +  find_ssa_names_to_propagate ([...], &prop_set);

> +  find_local_vars_to_propagate ([...], &prop_set);

> +  neuter_worker_single ([...], &prop_set, [...]);
> +
> +  prop_set.release ();

It seems appropriate to add a check that 'neuter_worker_single' has
indeed handled/'delete'd all these.

Pushed "Clarify memory management for 'prop_set' in
'gcc/omp-oacc-neuter-broadcast.cc'" to master branch in
commit 7b9d99e615212c24cecae4202d8def9aa5e71809, see attached.


Grüße
 Thomas


-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-Clarify-memory-management-for-prop_set-in-gcc-omp-oa.patch --]
[-- Type: text/x-diff, Size: 1640 bytes --]

From 7b9d99e615212c24cecae4202d8def9aa5e71809 Mon Sep 17 00:00:00 2001
From: Thomas Schwinge <thomas@codesourcery.com>
Date: Wed, 11 Aug 2021 22:31:55 +0200
Subject: [PATCH] Clarify memory management for 'prop_set' in
 'gcc/omp-oacc-neuter-broadcast.cc'

Clean-up for recent commit e2a58ed6dc5293602d0d168475109caa81ad0f0d
"openacc: Middle-end worker-partitioning support".

	gcc/
	* omp-oacc-neuter-broadcast.cc
	(execute_omp_oacc_neuter_broadcast): Clarify memory management for
	'prop_set'.
---
 gcc/omp-oacc-neuter-broadcast.cc | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/gcc/omp-oacc-neuter-broadcast.cc b/gcc/omp-oacc-neuter-broadcast.cc
index 9bde0aca10f..d30867085c3 100644
--- a/gcc/omp-oacc-neuter-broadcast.cc
+++ b/gcc/omp-oacc-neuter-broadcast.cc
@@ -1398,11 +1398,8 @@ execute_omp_oacc_neuter_broadcast ()
   FOR_ALL_BB_FN (bb, cfun)
     bb->aux = NULL;
 
-  vec<propagation_set *> prop_set;
-  prop_set.create (last_basic_block_for_fn (cfun));
-
-  for (int i = 0; i < last_basic_block_for_fn (cfun); i++)
-    prop_set.quick_push (0);
+  vec<propagation_set *> prop_set (vNULL);
+  prop_set.safe_grow_cleared (last_basic_block_for_fn (cfun), true);
 
   find_ssa_names_to_propagate (par, mask, worker_single, vector_single,
 			       &prop_set);
@@ -1461,6 +1458,9 @@ execute_omp_oacc_neuter_broadcast ()
     delete it.second;
   record_field_map.empty ();
 
+  /* These are supposed to have been 'delete'd by 'neuter_worker_single'.  */
+  for (auto it : prop_set)
+    gcc_checking_assert (!it);
   prop_set.release ();
 
   /* This doesn't seem to make a difference.  */
-- 
2.30.2


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/4] openacc: Middle-end worker-partitioning support
  2021-03-02 12:20 ` [PATCH 1/4] openacc: Middle-end worker-partitioning support Julian Brown
                     ` (4 preceding siblings ...)
  2021-08-16 10:34   ` Thomas Schwinge
@ 2021-08-16 10:34   ` Thomas Schwinge
  5 siblings, 0 replies; 20+ messages in thread
From: Thomas Schwinge @ 2021-08-16 10:34 UTC (permalink / raw)
  To: Julian Brown, gcc-patches; +Cc: Jakub Jelinek, Tobias Burnus

[-- Attachment #1: Type: text/plain, Size: 1496 bytes --]

Hi!

On 2021-03-02T04:20:11-0800, Julian Brown <julian@codesourcery.com> wrote:
> --- /dev/null
> +++ b/gcc/oacc-neuter-bcast.c

Allocated here:

> +static parallel_g *
> +omp_sese_find_par (bb_stmt_map_t *map, parallel_g *par, basic_block block)
> +{

> +       par = new parallel_g ([...]);

> +               par = new parallel_g (par, mask);

> +    par = new parallel_g ([...]);

> +  return par;
> +}

> +static parallel_g *
> +omp_sese_discover_pars (bb_stmt_map_t *map)
> +{

> +  parallel_g *par = omp_sese_find_par (map, 0, block);

> +  return par;
> +}

..., and used here:

> +void
> +oacc_do_neutering (void)
> +{

> +  parallel_g *par = omp_sese_discover_pars (&bb_stmt_map);
> +  populate_single_mode_bitmaps (par, [...]);

> +  find_ssa_names_to_propagate (par, [...]);

> +  find_partitioned_var_uses (par, [...]);
> +  find_local_vars_to_propagate (par, [...]);

> +  neuter_worker_single (par, [...]);

... but never released; memory leak.

Pushed "Plug 'par' memory leak in
'gcc/omp-oacc-neuter-broadcast.cc:execute_omp_oacc_neuter_broadcast'" to
master branch in commit df98015fb7db2ed754a7c154669bc7777f8e1612, see
attached.


Grüße
 Thomas


-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-Plug-par-memory-leak-in-gcc-omp-oacc-neuter-broadcas.patch --]
[-- Type: text/x-diff, Size: 1002 bytes --]

From df98015fb7db2ed754a7c154669bc7777f8e1612 Mon Sep 17 00:00:00 2001
From: Thomas Schwinge <thomas@codesourcery.com>
Date: Fri, 6 Aug 2021 15:34:25 +0200
Subject: [PATCH] Plug 'par' memory leak in
 'gcc/omp-oacc-neuter-broadcast.cc:execute_omp_oacc_neuter_broadcast'

Fix-up for recent commit e2a58ed6dc5293602d0d168475109caa81ad0f0d
"openacc: Middle-end worker-partitioning support".

	gcc/
	* omp-oacc-neuter-broadcast.cc
	(execute_omp_oacc_neuter_broadcast): Plug 'par' memory leak.
---
 gcc/omp-oacc-neuter-broadcast.cc | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/gcc/omp-oacc-neuter-broadcast.cc b/gcc/omp-oacc-neuter-broadcast.cc
index d30867085c3..d48627a6940 100644
--- a/gcc/omp-oacc-neuter-broadcast.cc
+++ b/gcc/omp-oacc-neuter-broadcast.cc
@@ -1463,6 +1463,8 @@ execute_omp_oacc_neuter_broadcast ()
     gcc_checking_assert (!it);
   prop_set.release ();
 
+  delete par;
+
   /* This doesn't seem to make a difference.  */
   loops_state_clear (LOOP_CLOSED_SSA);
 
-- 
2.30.2


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Further simplify 'gcc/omp-oacc-neuter-broadcast.cc:record_field_map_t' (was: [PATCH 1/4] openacc: Middle-end worker-partitioning support)
  2021-08-16 10:34       ` Thomas Schwinge
@ 2022-02-22 16:48         ` Thomas Schwinge
  0 siblings, 0 replies; 20+ messages in thread
From: Thomas Schwinge @ 2022-02-22 16:48 UTC (permalink / raw)
  To: Julian Brown, gcc-patches; +Cc: Jakub Jelinek, Tobias Burnus

[-- Attachment #1: Type: text/plain, Size: 1595 bytes --]

Hi!

On 2021-08-16T12:34:09+0200, I wrote:
> On 2021-08-06T09:49:58+0100, Julian Brown <julian@codesourcery.com> wrote:
>> On Wed, 4 Aug 2021 15:13:30 +0200
>> Thomas Schwinge <thomas@codesourcery.com> wrote:
>>
>>> 'oacc_do_neutering' is the 'execute' function of the pass, so that
>>> means every time this executes, a fresh 'field_map' is set up, no
>>> state persists across runs (assuming I'm understanding that
>>> correctly).  Why don't we simply use standard (non-GC) memory
>>> management for that?  "For convenience" shall be fine as an answer
>>> ;-) -- but maybe instead of figuring out the right GC annotations,
>>> changing the memory management will be easier?  (Or, of course, maybe
>>> I completely misunderstood that?)
>>
>> I suspect you're right, and there's no need for this to be GC-allocated
>> memory. If non-standard memory allocation will work out fine, we should
>
> ("non-GC", I suppose.)
>
>> probably use that instead.
>
> Pushed "Avoid 'GTY' use for 'gcc/omp-oacc-neuter-broadcast.cc:field_map'"
> to master branch in commit 049eda8274b7394523238b17ab12c3e2889f253e

In commit 0fe9176f410accc767e0abab010aec843b2e7ea6 I've now pushed
"Further simplify 'gcc/omp-oacc-neuter-broadcast.cc:record_field_map_t'"
to master branch, see attached.


Grüße
 Thomas


-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-Further-simplify-gcc-omp-oacc-neuter-broadcast.cc-re.patch --]
[-- Type: text/x-diff, Size: 2617 bytes --]

From 0fe9176f410accc767e0abab010aec843b2e7ea6 Mon Sep 17 00:00:00 2001
From: Thomas Schwinge <thomas@codesourcery.com>
Date: Fri, 13 Aug 2021 21:17:55 +0200
Subject: [PATCH] Further simplify
 'gcc/omp-oacc-neuter-broadcast.cc:record_field_map_t'

Now that I've resolved GCC 'hash_map' issues (a while ago already), we may
further simplify this after commit 049eda8274b7394523238b17ab12c3e2889f253e
"Avoid 'GTY' use for 'gcc/omp-oacc-neuter-broadcast.cc:field_map'": as
'hash_map' Value, directly store 'field_map_t' objects, not pointers to
manually allocated 'field_map_t' objects.

	gcc/
	* omp-oacc-neuter-broadcast.cc (record_field_map_t): Further
	simplify.  Adjust all users.
---
 gcc/omp-oacc-neuter-broadcast.cc | 12 ++++--------
 1 file changed, 4 insertions(+), 8 deletions(-)

diff --git a/gcc/omp-oacc-neuter-broadcast.cc b/gcc/omp-oacc-neuter-broadcast.cc
index 7fb691d7155..314161e38f5 100644
--- a/gcc/omp-oacc-neuter-broadcast.cc
+++ b/gcc/omp-oacc-neuter-broadcast.cc
@@ -538,7 +538,7 @@ typedef hash_map<tree, tree> field_map_t;
    to propagate, to the field in the record type that should be used for
    transmission and reception.  */
 
-typedef hash_map<tree, field_map_t *> record_field_map_t;
+typedef hash_map<tree, field_map_t> record_field_map_t;
 
 static void
 install_var_field (tree var, tree record_type, field_map_t *fields)
@@ -1168,8 +1168,7 @@ worker_single_copy (basic_block from, basic_block to,
 	gcc_assert (TREE_CODE (var) == VAR_DECL);
 
       /* If we had no record type, we will have no fields map.  */
-      field_map_t **fields_p = record_field_map->get (record_type);
-      field_map_t *fields = fields_p ? *fields_p : NULL;
+      field_map_t *fields = record_field_map->get (record_type);
 
       if (worker_partitioned_uses->contains (var)
 	  && fields
@@ -1684,10 +1683,9 @@ oacc_do_neutering (unsigned HOST_WIDE_INT bounds_lo,
 
 	  field_vec.qsort (sort_by_size_then_ssa_version_or_uid);
 
-	  field_map_t *fields = new field_map_t;
-
 	  bool existed;
-	  existed = record_field_map.put (record_type, fields);
+	  field_map_t *fields
+	    = &record_field_map.get_or_insert (record_type, &existed);
 	  gcc_checking_assert (!existed);
 
 	  /* Insert var fields in reverse order, so the last inserted element
@@ -1818,8 +1816,6 @@ oacc_do_neutering (unsigned HOST_WIDE_INT bounds_lo,
 			&partitioned_var_uses, &record_field_map,
 			&blk_offset_map, writes_gang_private);
 
-  for (auto it : record_field_map)
-    delete it.second;
   record_field_map.empty ();
 
   /* These are supposed to have been 'delete'd by 'neuter_worker_single'.  */
-- 
2.34.1


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/4] openacc: Fix async bugs in several OpenACC test cases
  2021-06-29 23:42 ` [PATCH 2/4] openacc: Fix async bugs in several OpenACC test cases Julian Brown
@ 2021-06-29 23:52   ` Julian Brown
  0 siblings, 0 replies; 20+ messages in thread
From: Julian Brown @ 2021-06-29 23:52 UTC (permalink / raw)
  To: gcc-patches; +Cc: Jakub Jelinek, Thomas Schwinge

On Tue, 29 Jun 2021 16:42:02 -0700
Julian Brown <julian@codesourcery.com> wrote:

> Several OpenACC tests accidentally abuse async semantics, leading to
> race conditions & test failures.  This patch fixes those tests.
> 
> Tested with offloading to AMD GCN. I can probably self-approve this as
> a testcase change only, unless anyone objects.

Forgot to say: this was previously posted as part of the AMD GCN
worker-partitioning series here:

  https://gcc.gnu.org/pipermail/gcc-patches/2021-March/566081.html

But I noticed that the worker-partitioning patches do not (now?) have to
be present for the tests in question to fail.

Thanks,

Julian

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 2/4] openacc: Fix async bugs in several OpenACC test cases
  2021-06-29 23:42 [PATCH 0/4] openacc: Async fixes Julian Brown
@ 2021-06-29 23:42 ` Julian Brown
  2021-06-29 23:52   ` Julian Brown
  0 siblings, 1 reply; 20+ messages in thread
From: Julian Brown @ 2021-06-29 23:42 UTC (permalink / raw)
  To: gcc-patches; +Cc: Thomas Schwinge, Jakub Jelinek, Chung-Lin Tang

Several OpenACC tests accidentally abuse async semantics, leading to
race conditions & test failures.  This patch fixes those tests.

Tested with offloading to AMD GCN. I can probably self-approve this as
a testcase change only, unless anyone objects.

Thanks,

Julian

2021-06-29  Julian Brown  <julian@codesourcery.com>

libgomp/
	* testsuite/libgomp.oacc-c-c++-common/deep-copy-10.c: Fix async
	behaviour and increase number of iterations.
	* testsuite/libgomp.oacc-fortran/lib-16-2.f90: Fix async behaviour.
	* testsuite/libgomp.oacc-fortran/lib-16.f90: Likewise.
---
 .../libgomp.oacc-c-c++-common/deep-copy-10.c       | 14 ++++++++------
 .../testsuite/libgomp.oacc-fortran/lib-16-2.f90    |  5 +++++
 libgomp/testsuite/libgomp.oacc-fortran/lib-16.f90  |  5 +++++
 3 files changed, 18 insertions(+), 6 deletions(-)

diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/deep-copy-10.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/deep-copy-10.c
index 573a8214bf0..dadb6d37942 100644
--- a/libgomp/testsuite/libgomp.oacc-c-c++-common/deep-copy-10.c
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/deep-copy-10.c
@@ -1,6 +1,8 @@
 #include <stdlib.h>
 
-/* Test asyncronous attach and detach operation.  */
+#define ITERATIONS 1023
+
+/* Test asynchronous attach and detach operation.  */
 
 typedef struct {
   int *a;
@@ -25,13 +27,13 @@ main (int argc, char* argv[])
 
 #pragma acc enter data copyin(m)
 
-  for (int i = 0; i < 99; i++)
+  for (int i = 0; i < ITERATIONS; i++)
     {
       int j;
-#pragma acc parallel loop copy(m.a[0:N]) async(i % 2)
+#pragma acc parallel loop copy(m.a[0:N]) async(0)
       for (j = 0; j < N; j++)
 	m.a[j]++;
-#pragma acc parallel loop copy(m.b[0:N]) async((i + 1) % 2)
+#pragma acc parallel loop copy(m.b[0:N]) async(1)
       for (j = 0; j < N; j++)
 	m.b[j]++;
     }
@@ -40,9 +42,9 @@ main (int argc, char* argv[])
 
   for (i = 0; i < N; i++)
     {
-      if (m.a[i] != 99)
+      if (m.a[i] != ITERATIONS)
 	abort ();
-      if (m.b[i] != 99)
+      if (m.b[i] != ITERATIONS)
 	abort ();
     }
 
diff --git a/libgomp/testsuite/libgomp.oacc-fortran/lib-16-2.f90 b/libgomp/testsuite/libgomp.oacc-fortran/lib-16-2.f90
index ddd557d3be0..e2e47c967fa 100644
--- a/libgomp/testsuite/libgomp.oacc-fortran/lib-16-2.f90
+++ b/libgomp/testsuite/libgomp.oacc-fortran/lib-16-2.f90
@@ -27,6 +27,9 @@ program main
 
   if (acc_is_present (h) .neqv. .TRUE.) stop 1
 
+  ! We must wait for the update to be done.
+  call acc_wait (async)
+
   h(:) = 0
 
   call acc_copyout_async (h, sizeof (h), async)
@@ -45,6 +48,8 @@ program main
   
   if (acc_is_present (h) .neqv. .TRUE.) stop 3
 
+  call acc_wait (async)
+
   do i = 1, N
     if (h(i) /= i + i) stop 4
   end do 
diff --git a/libgomp/testsuite/libgomp.oacc-fortran/lib-16.f90 b/libgomp/testsuite/libgomp.oacc-fortran/lib-16.f90
index ccd1ce6ee18..ef9a6f6626c 100644
--- a/libgomp/testsuite/libgomp.oacc-fortran/lib-16.f90
+++ b/libgomp/testsuite/libgomp.oacc-fortran/lib-16.f90
@@ -27,6 +27,9 @@ program main
 
   if (acc_is_present (h) .neqv. .TRUE.) stop 1
 
+  ! We must wait for the update to be done.
+  call acc_wait (async)
+
   h(:) = 0
 
   call acc_copyout_async (h, sizeof (h), async)
@@ -45,6 +48,8 @@ program main
   
   if (acc_is_present (h) .neqv. .TRUE.) stop 3
 
+  call acc_wait (async)
+
   do i = 1, N
     if (h(i) /= i + i) stop 4
   end do 
-- 
2.29.2


^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2022-02-22 16:48 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-02 12:20 [PATCH 0/4] openacc: Worker partitioning in the middle end Julian Brown
2021-03-02 12:20 ` [PATCH 1/4] openacc: Middle-end worker-partitioning support Julian Brown
2021-07-29  7:49   ` [OpenACC] Extract 'pass_oacc_loop_designation' out of 'pass_oacc_device_lower' (was: [PATCH 1/4] openacc: Middle-end worker-partitioning support) Thomas Schwinge
2021-08-06 10:20     ` Julian Brown
2021-08-04 13:13   ` [PATCH 1/4] openacc: Middle-end worker-partitioning support Thomas Schwinge
2021-08-06  8:49     ` Julian Brown
2021-08-16 10:34       ` Thomas Schwinge
2022-02-22 16:48         ` Further simplify 'gcc/omp-oacc-neuter-broadcast.cc:record_field_map_t' (was: [PATCH 1/4] openacc: Middle-end worker-partitioning support) Thomas Schwinge
2021-08-04 13:56   ` [PATCH 1/4] openacc: Middle-end worker-partitioning support Thomas Schwinge
2021-08-06  9:25     ` Julian Brown
2021-08-09 13:32       ` Thomas Schwinge
2021-08-09 13:21   ` Thomas Schwinge
2021-08-16 10:34   ` Thomas Schwinge
2021-08-16 10:34   ` Thomas Schwinge
2021-03-02 12:20 ` [PATCH 2/4] openacc: Fix async bugs in several OpenACC test cases Julian Brown
2021-03-02 12:20 ` [PATCH 3/4] amdgcn: Enable OpenACC worker partitioning for AMD GCN Julian Brown
2021-08-09 13:26   ` Thomas Schwinge
2021-03-02 12:20 ` [PATCH 4/4] openacc: Reference-typed reduction and private variable rewriting Julian Brown
2021-06-29 23:42 [PATCH 0/4] openacc: Async fixes Julian Brown
2021-06-29 23:42 ` [PATCH 2/4] openacc: Fix async bugs in several OpenACC test cases Julian Brown
2021-06-29 23:52   ` Julian Brown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).