public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [gomp4] fortran testcase
@ 2015-10-18 23:20 ` Nathan Sidwell
  0 siblings, 0 replies; 26+ messages in thread
From: Nathan Sidwell @ 2015-10-18 23:20 UTC (permalink / raw)
  To: GCC Patches

[-- Attachment #1: Type: text/plain, Size: 442 bytes --]

This fortran testcase was passing by not testing what it thought it was testing. 
  The loop over which the reduction is specified  is not partitioned, and hence 
the reduction was being elided.

Fixing that problem  turns out to be more involved, and a distraction from 
completing the patch series I'm working on (which exposed this problem).

I've adjusted the testcase to specify a partitioning, and marked the test as 
xfailing.

nathan

[-- Attachment #2: gomp4-red-test.patch --]
[-- Type: text/x-patch, Size: 833 bytes --]

2015-10-18  Nathan Sidwell  <nathan@codesourcery.com>

	* gfortran.dg/goacc/reduction-2.f95: Force loop partitioning and
	xfail.

Index: gcc/testsuite/gfortran.dg/goacc/reduction-2.f95
===================================================================
--- gcc/testsuite/gfortran.dg/goacc/reduction-2.f95	(revision 228954)
+++ gcc/testsuite/gfortran.dg/goacc/reduction-2.f95	(working copy)
@@ -1,4 +1,6 @@
-! { dg-do compile }
+! { dg-xfail-if "" { *-*-* } { "*" } { "" } }
+! { dg-excess-errors "internal compiler error" }
+! { dg-do compile  }
 
 program reduction
   integer, parameter    :: n = 40, c = 10
@@ -13,7 +15,7 @@ subroutine redsub(sum, n, c)
   sum = 0
 
   !$acc parallel vector_length(n) copyin (n, c)
-  !$acc loop reduction(+:sum)
+  !$acc loop vector reduction(+:sum)
   do i = 1, n
      sum = sum + c
   end do

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [0/3] OpenACC reductions
@ 2015-11-02 16:10 Nathan Sidwell
  2015-10-18 23:20 ` [gomp4] fortran testcase Nathan Sidwell
                   ` (3 more replies)
  0 siblings, 4 replies; 26+ messages in thread
From: Nathan Sidwell @ 2015-11-02 16:10 UTC (permalink / raw)
  To: Jakub Jelinek, GCC Patches; +Cc: Cesar Philippidis

Jakub,
the following patch series implements the reduction handling for OpenACC:

01-trunk-reductions-core-1102.patch  Core  execution changes
02-trunk-reductions-ptx-1102.patch   PTX backend bits
03-trunk-reductions-tests-1102.patch Testcases


The reduction mechanism relies on a new internal builtin -- IFN_GOACC_REDUCTION, 
which is used in 4 different places.  IYR the loop partionining is managed with 
FORK and JOIN unique_fn markers.  The reductions go around these as follows:

IFN_UNIQUE (HEAD_MARKER ...)
IFN_REDUCTION (SETUP ...)
IFN_UNIQUE (FORK ...)
IFN_REDUCTION (INIT ...)
IFN_UNIQUE (HEAD_MARKER)
<loop here>
IFN_UNIQUE (TAIL_MARKER ...)
IFN_REDUCTION (FINI ...)
IFN_UNIQUE (JOIN ...)
IFN_REDUCTION (TEARDOWN ...)
IFN_UNIQUE (TAIL_MARKER)


There's a quad of functions for each reduction variable of the loop.  If a loop 
is partitioned over multiple dimensions, there are additional quads for each 
dimension, surrounding the fork/join for that dimension.

All the reduction calls look similar and are:

V = REDUCTION (KIND, REF_TO_RES, LOCAL_VAR, LEVEL, OP, OFFSET)

REF_TO_RES is a pointer to a reciever object.  it is a null pointer constant if 
there is no such object.
LOCAL_VAR is the executing thread's instance of the reduction variable.
LEVEL is the dimension across which this reduction is partitiong (gang, worker, 
vector).  As with the head/tail markers,this assignment of level is deferred to 
the target compiler.
OP is the reduction operator
OFFSET is an offset into a hypothetical buffer allocated for all the reductions 
of this particular loop.  It's a way of identifying which quad of reductions 
apply to the same logical variable, and happens to be useful in some use cases 
(I'll expand on that in the PTX fragment).

All these functions return a new value for the local variable.

When everything collapses to a single thread (i.e. on the host), the 
implementation of these functions is trivial.

SETUP
    - if REF_TO_RES is not  nullptrconst, return *REF_TO_RES, else return 
LOCAL_VAR (this is  a compile-time check)
INIT & FINI
   - return LOCAL_VAR
TEARDOWN
   - if REF_TO_RES is not nullptrconst *REF_TO_RES = LOCAL_VAR.
     always return LOCAL_VAR

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [1/3] OpenACC reductions
  2015-11-02 16:10 [0/3] OpenACC reductions Nathan Sidwell
  2015-10-18 23:20 ` [gomp4] fortran testcase Nathan Sidwell
@ 2015-11-02 16:18 ` Nathan Sidwell
  2015-11-03 15:46   ` Jakub Jelinek
                     ` (3 more replies)
  2015-11-02 16:35 ` [2/3] " Nathan Sidwell
  2015-11-02 16:38 ` [3/3] " Nathan Sidwell
  3 siblings, 4 replies; 26+ messages in thread
From: Nathan Sidwell @ 2015-11-02 16:18 UTC (permalink / raw)
  To: Jakub Jelinek, GCC Patches; +Cc: Cesar Philippidis

[-- Attachment #1: Type: text/plain, Size: 1623 bytes --]

This is the core execution bits of OpenACC reductions.

We have a new internal fn 'IFN_GOACC_REDUCTION' and a new target hook 
goacc.reduction, to lower it on the target compiler.

The omp-low changes are:
1) remove all the existing OpenACC reduction handling

2) when emitting an openacc loop head/tail markers, intersperse that with 
reduction calls.  This emission has to search the context stack to see if 
there's an outer context also reducing the same variable.  In that case this 
instance ignores any reciever object info, because that's handled by the outer 
reduction.  Similarly for loops over multiple  axes -- inner axes ignore  the 
receiver  object info.

3) To deal with reductions at the outermost level, i.e.:
   #pragma acc parallel reduction (+:r)
   {
     ...
   }
we insert REDUCTION calls for a (dummy) gang level around the entire body.  Note 
these calls will lack the surrounding HEAD/TAIL and FORK/JOIN functions of a 
real partitioning level.

4) In the target compiler, we assign axes to the reduction calls in the same 
manner we assigned fork/join axes.

5) In the target compiler we then lower these calls to device-specific gimple. 
The default host behaviour essentially turns them all into copies, with possible 
loads from or stores to the receiver object.

One thing not handled by this patch are reductions of variables of reference 
type.  We have an implementation on gomp4 branch, but I suspect theres going to 
be some simplification we can do there following the openmp merge.  This only 
affected one testcase (which was (a) broken anyway and (b) working by accident).

nathan

[-- Attachment #2: 01-trunk-reductions-core-1102.patch --]
[-- Type: text/x-patch, Size: 37546 bytes --]

2015-11-02  Nathan Sidwell  <nathan@codesourcery.com>
	    Cesar Philippidis  <cesar@codesourcery.com>

	* internal-fn.def (GOACC_REDUCTION): New.
	* internal-fn.h (enum ifn_goacc_reduction_kind): New.
	* internal-fn.c (expand_GOACC_REDUCTION): New.
	* target.def (goacc.reduction): New OpenACC hook.
	* targhooks.h (default_goacc_reduction): Declare.
	* doc/tm.texi.in: Add TARGET_GOACC_REDUCTION.
	* doc/tm.texi: Rebuilt.
	* omp-low.c (oacc_get_reduction_array_id, oacc_max_threads,
	scan_sharing_clauses): Remove oacc reduction handling here.
	(lower_rec_input_clauses): Don't handle OpenACC reductions here.
	(oacc_lower_reduction_var_helper): Delete.
	(lower_oacc_reductions): New.
	(lower_reduction_clauses): Don't handle OpenACC reductions here.
	(lower_oacc_head_tail): Call lower_oacc_reductions.
	(oacc_gimple_assign, oacc_init_reduction_array,
	oacc_initialize_reduction_data, oacc_finalize_reduction_data,
	oacc_process_reduction_data): Delete.
	(lower_omp_target): Remove old OpenACC reduction handling.  Insert
	dummy OpenACC gang reduction for reductions at outer level.
	(oacc_loop_xform_head_tail): Transform IFN_GOACC_REDUCTION.
	(default_goacc_reduction): New.
	(execute_oacc_device_lower): Handle IFN_GOACC_REDUCTION.

Index: gcc/doc/tm.texi
===================================================================
--- gcc/doc/tm.texi	(revision 229667)
+++ gcc/doc/tm.texi	(working copy)
@@ -5787,6 +5787,15 @@ gimple has been inserted before it, or t
 The default hook returns false, if there are no RTL expanders for them.
 @end deftypefn
 
+@deftypefn {Target Hook} void TARGET_GOACC_REDUCTION (gcall *@var{call})
+This hook is used by the oacc_transform pass to expand calls to the
+@var{GOACC_REDUCTION} internal function, into a sequence of gimple
+instructions.  @var{call} is gimple statement containing the call to
+the function.  This hook removes statement @var{call} after the
+expanded sequence has been inserted.  This hook is also responsible
+for allocating any storage for reductions when necessary.
+@end deftypefn
+
 @node Anchored Addresses
 @section Anchored Addresses
 @cindex anchored addresses
Index: gcc/doc/tm.texi.in
===================================================================
--- gcc/doc/tm.texi.in	(revision 229667)
+++ gcc/doc/tm.texi.in	(working copy)
@@ -4264,6 +4264,8 @@ address;  but often a machine-dependent
 
 @hook TARGET_GOACC_FORK_JOIN
 
+@hook TARGET_GOACC_REDUCTION
+
 @node Anchored Addresses
 @section Anchored Addresses
 @cindex anchored addresses
Index: gcc/internal-fn.c
===================================================================
--- gcc/internal-fn.c	(revision 229667)
+++ gcc/internal-fn.c	(working copy)
@@ -2045,6 +2045,14 @@ expand_GOACC_LOOP (gcall *stmt ATTRIBUTE
   gcc_unreachable ();
 }
 
+/* This is expanded by oacc_device_lower pass.  */
+
+static void
+expand_GOACC_REDUCTION (gcall *stmt ATTRIBUTE_UNUSED)
+{
+  gcc_unreachable ();
+}
+
 /* Routines to expand each internal function, indexed by function number.
    Each routine has the prototype:
 
Index: gcc/internal-fn.def
===================================================================
--- gcc/internal-fn.def	(revision 229667)
+++ gcc/internal-fn.def	(working copy)
@@ -83,3 +83,6 @@ DEF_INTERNAL_FN (GOACC_DIM_POS, ECF_PURE
 
 /* OpenACC looping abstraction.  See internal-fn.h for usage.  */
 DEF_INTERNAL_FN (GOACC_LOOP, ECF_PURE | ECF_NOTHROW, NULL)
+
+/* OpenACC reduction abstraction.  See internal-fn.h  for usage.  */
+DEF_INTERNAL_FN (GOACC_REDUCTION, ECF_NOTHROW | ECF_LEAF, NULL)
Index: gcc/internal-fn.h
===================================================================
--- gcc/internal-fn.h	(revision 229667)
+++ gcc/internal-fn.h	(working copy)
@@ -66,6 +66,28 @@ enum ifn_goacc_loop_kind {
   IFN_GOACC_LOOP_BOUND    /* Limit of iteration value.  */
 };
 
+/* The GOACC_REDUCTION function defines a generic interface to support
+   gang, worker and vector reductions.  All calls are of the following
+   form:
+
+     V = REDUCTION (CODE, REF_TO_RES, LOCAL_VAR, LEVEL, OP, OFFSET)
+
+   REF_TO_RES - is a reference to the original reduction varl, may be NULL
+   LOCAL_VAR is the intermediate reduction variable
+   LEVEL corresponds to the GOMP_DIM of the reduction
+   OP is the tree code of the reduction operation
+   OFFSET may be used as an offset into a reduction array for the
+          reductions occuring at this level.
+   In general the return value is LOCAL_VAR, which creates a data
+   dependency between calls operating on the same reduction.  */
+
+enum ifn_goacc_reduction_kind {
+  IFN_GOACC_REDUCTION_SETUP,
+  IFN_GOACC_REDUCTION_INIT,
+  IFN_GOACC_REDUCTION_FINI,
+  IFN_GOACC_REDUCTION_TEARDOWN
+};
+
 /* Initialize internal function tables.  */
 
 extern void init_internal_fns ();
Index: gcc/omp-low.c
===================================================================
--- gcc/omp-low.c	(revision 229667)
+++ gcc/omp-low.c	(working copy)
@@ -305,66 +305,6 @@ is_oacc_kernels (omp_context *ctx)
 	      == GF_OMP_TARGET_KIND_OACC_KERNELS));
 }
 
-/* Helper function to get the name of the array containing the partial
-   reductions for OpenACC reductions.  */
-static const char *
-oacc_get_reduction_array_id (tree node)
-{
-  const char *id = IDENTIFIER_POINTER (DECL_NAME (node));
-  int len = strlen ("OACC") + strlen (id);
-  char *temp_name = XALLOCAVEC (char, len + 1);
-  snprintf (temp_name, len + 1, "OACC%s", id);
-  return IDENTIFIER_POINTER (get_identifier (temp_name));
-}
-
-/* Determine the number of threads OpenACC threads used to determine the
-   size of the array of partial reductions.  Currently, this is num_gangs
-   * vector_length.  This value may be different than GOACC_GET_NUM_THREADS,
-   because it is independed of the device used.  */
-
-static tree
-oacc_max_threads (omp_context *ctx)
-{
-  tree nthreads, vector_length, gangs, clauses;
-
-  gangs = fold_convert (sizetype, integer_one_node);
-  vector_length = gangs;
-
-  /* The reduction clause may be nested inside a loop directive.
-     Scan for the innermost vector_length clause.  */
-  for (omp_context *oc = ctx; oc; oc = oc->outer)
-    {
-      if (gimple_code (oc->stmt) != GIMPLE_OMP_TARGET
-	  || (gimple_omp_target_kind (oc->stmt)
-	      != GF_OMP_TARGET_KIND_OACC_PARALLEL))
-	continue;
-
-      clauses = gimple_omp_target_clauses (oc->stmt);
-
-      vector_length = find_omp_clause (clauses, OMP_CLAUSE_VECTOR_LENGTH);
-      if (vector_length)
-	vector_length = fold_convert_loc (OMP_CLAUSE_LOCATION (vector_length),
-					  sizetype,
-					  OMP_CLAUSE_VECTOR_LENGTH_EXPR
-					  (vector_length));
-      else
-	vector_length = fold_convert (sizetype, integer_one_node);
-
-      gangs = find_omp_clause (clauses, OMP_CLAUSE_NUM_GANGS);
-      if (gangs)
-        gangs = fold_convert_loc (OMP_CLAUSE_LOCATION (gangs), sizetype,
-				  OMP_CLAUSE_NUM_GANGS_EXPR (gangs));
-      else
-	gangs = fold_convert (sizetype, integer_one_node);
-
-      break;
-    }
-
-  nthreads = fold_build2 (MULT_EXPR, sizetype, gangs, vector_length);
-
-  return nthreads;
-}
-
 /* If DECL is the artificial dummy VAR_DECL created for non-static
    data member privatization, return the underlying "this" parameter,
    otherwise return NULL.  */
@@ -2016,27 +1956,6 @@ scan_sharing_clauses (tree clauses, omp_
 		install_var_field (decl, by_ref, 3, ctx);
 	    }
 	  install_var_local (decl, ctx);
-	  if (is_gimple_omp_oacc (ctx->stmt)
-	      && OMP_CLAUSE_CODE (c) == OMP_CLAUSE_REDUCTION)
-	    {
-	      /* Create a decl for the reduction array.  */
-	      tree var = OMP_CLAUSE_DECL (c);
-	      tree type = get_base_type (var);
-	      tree ptype = build_pointer_type (type);
-	      tree array = create_tmp_var (ptype,
-					   oacc_get_reduction_array_id (var));
-	      omp_context *octx = (ctx->field_map ? ctx : ctx->outer);
-	      install_var_field (array, true, 3, octx);
-	      install_var_local (array, octx);
-
-	      /* Insert it into the current context.  */
-	      splay_tree_insert (ctx->reduction_map, (splay_tree_key)
-				 oacc_get_reduction_array_id (var),
-				 (splay_tree_value) array);
-	      splay_tree_insert (ctx->reduction_map,
-				 (splay_tree_key) array,
-				 (splay_tree_value) array);
-	    }
 	  break;
 
 	case OMP_CLAUSE_USE_DEVICE_PTR:
@@ -4935,6 +4855,10 @@ lower_rec_input_clauses (tree clauses, g
 	      break;
 
 	    case OMP_CLAUSE_REDUCTION:
+	      /* OpenACC reductions are initialized using the
+		 GOACC_REDUCTION internal function.  */
+	      if (is_gimple_omp_oacc (ctx->stmt))
+		break;
 	      if (OMP_CLAUSE_REDUCTION_PLACEHOLDER (c))
 		{
 		  tree placeholder = OMP_CLAUSE_REDUCTION_PLACEHOLDER (c);
@@ -5348,56 +5272,170 @@ lower_lastprivate_clauses (tree clauses,
     gimple_seq_add_stmt (stmt_list, gimple_build_label (label));
 }
 
-static void
-oacc_lower_reduction_var_helper (gimple_seq *stmt_seqp, omp_context *ctx,
-				 tree tid, tree var, tree new_var)
-{
-  /* The atomic add at the end of the sum creates unnecessary
-     write contention on accelerators.  To work around this,
-     create an array to store the partial reductions. Later, in
-     lower_omp_for (for openacc), the values of array will be
-     combined.  */
+/* Lower the OpenACC reductions of CLAUSES for compute axis LEVEL
+   (which might be a placeholder).  INNER is true if this is an inner
+   axis of a multi-axis loop.  FORK and JOIN are (optional) fork and
+   join markers.  Generate the before-loop forking sequence in
+   FORK_SEQ and the after-loop joining sequence to JOIN_SEQ.  The
+   general form of these sequences is
+
+     GOACC_REDUCTION_SETUP
+     GOACC_FORK
+     GOACC_REDUCTION_INIT
+     ...
+     GOACC_REDUCTION_FINI
+     GOACC_JOIN
+     GOACC_REDUCTION_TEARDOWN.  */
+
+static void
+lower_oacc_reductions (location_t loc, tree clauses, tree level, bool inner,
+		       gcall *fork, gcall *join, gimple_seq *fork_seq,
+		       gimple_seq *join_seq, omp_context *ctx)
+{
+  gimple_seq before_fork = NULL;
+  gimple_seq after_fork = NULL;
+  gimple_seq before_join = NULL;
+  gimple_seq after_join = NULL;
+  tree init_code = NULL_TREE, fini_code = NULL_TREE,
+    setup_code = NULL_TREE, teardown_code = NULL_TREE;
+  unsigned offset = 0;
 
-  tree t = NULL_TREE, array, x;
-  tree type = get_base_type (var);
-  gimple *stmt;
+  for (tree c = clauses; c; c = OMP_CLAUSE_CHAIN (c))
+    if (OMP_CLAUSE_CODE (c) == OMP_CLAUSE_REDUCTION)
+      {
+	tree orig = OMP_CLAUSE_DECL (c);
+	tree var = maybe_lookup_decl (orig, ctx);
+	tree ref_to_res = NULL_TREE;
+	tree incoming, outgoing;
+
+	enum tree_code rcode = OMP_CLAUSE_REDUCTION_CODE (c);
+	if (rcode == MINUS_EXPR)
+	  rcode = PLUS_EXPR;
+	else if (rcode == TRUTH_ANDIF_EXPR)
+	  rcode = BIT_AND_EXPR;
+	else if (rcode == TRUTH_ORIF_EXPR)
+	  rcode = BIT_IOR_EXPR;
+	tree op = build_int_cst (unsigned_type_node, rcode);
+
+	if (!var)
+	  var = orig;
+	gcc_assert (!is_reference (var));
 
-  /* Now insert the partial reductions into the array.  */
+	incoming = outgoing = var;
+	
+	if (!inner)
+	  {
+	    /* See if an outer construct also reduces this variable.  */
+	    omp_context *outer = ctx;
 
-  /* Find the reduction array.  */
+	    while (omp_context *probe = outer->outer)
+	      {
+		enum gimple_code type = gimple_code (probe->stmt);
+		tree cls;
 
-  tree ptype = build_pointer_type (type);
+		switch (type)
+		  {
+		  case GIMPLE_OMP_FOR:
+		    cls = gimple_omp_for_clauses (probe->stmt);
+		    break;
 
-  t = lookup_oacc_reduction (oacc_get_reduction_array_id (var), ctx);
-  t = build_receiver_ref (t, false, ctx->outer);
+		  case GIMPLE_OMP_TARGET:
+		    if (gimple_omp_target_kind (probe->stmt)
+			!= GF_OMP_TARGET_KIND_OACC_PARALLEL)
+		      goto do_lookup;
 
-  array = create_tmp_var (ptype);
-  gimplify_assign (array, t, stmt_seqp);
+		    cls = gimple_omp_target_clauses (probe->stmt);
+		    break;
 
-  tree ptr = create_tmp_var (TREE_TYPE (array));
+		  default:
+		    goto do_lookup;
+		  }
+		
+		outer = probe;
+		for (; cls;  cls = OMP_CLAUSE_CHAIN (cls))
+		  if (OMP_CLAUSE_CODE (cls) == OMP_CLAUSE_REDUCTION
+		      && orig == OMP_CLAUSE_DECL (cls))
+		    goto has_outer_reduction;
+	      }
 
-  /* Find the reduction array.  */
+	  do_lookup:
+	    /* This is the outermost construct with this reduction,
+	       see if there's a mapping for it.  */
+	    if (gimple_code (outer->stmt) == GIMPLE_OMP_TARGET
+		&& maybe_lookup_field (orig, outer))
+	      {
+		ref_to_res = build_receiver_ref (orig, false, outer);
+		if (is_reference (orig))
+		  ref_to_res = build_simple_mem_ref (ref_to_res);
 
-  /* testing a unary conversion.  */
-  tree offset = create_tmp_var (sizetype);
-  gimplify_assign (offset, TYPE_SIZE_UNIT (type),
-		   stmt_seqp);
-  t = create_tmp_var (sizetype);
-  gimplify_assign (t, unshare_expr (fold_build1 (NOP_EXPR, sizetype, tid)),
-		   stmt_seqp);
-  stmt = gimple_build_assign (offset, MULT_EXPR, offset, t);
-  gimple_seq_add_stmt (stmt_seqp, stmt);
+		outgoing = var;
+		incoming = omp_reduction_init_op (loc, rcode, TREE_TYPE (var));
+	      }
+	    else
+	      incoming = outgoing = orig;
+	      
+	  has_outer_reduction:;
+	  }
 
-  /* Offset expression.  Does the POINTER_PLUS_EXPR take care
-     of adding sizeof(var) to the array?  */
-  ptr = create_tmp_var (ptype);
-  stmt = gimple_build_assign (unshare_expr (ptr), POINTER_PLUS_EXPR, array,
-			      offset);
-  gimple_seq_add_stmt (stmt_seqp, stmt);
+	if (!ref_to_res)
+	  ref_to_res = integer_zero_node;
+
+	/* Determine position in reduction buffer, which may be used
+	   by target.  */
+	enum machine_mode mode = TYPE_MODE (TREE_TYPE (var));
+	unsigned align = GET_MODE_ALIGNMENT (mode) /  BITS_PER_UNIT;
+	offset = (offset + align - 1) & ~(align - 1);
+	tree off = build_int_cst (sizetype, offset);
+	offset += GET_MODE_SIZE (mode);
 
-  /* Move the local sum to gfc$sum[i].  */
-  x = unshare_expr (build_simple_mem_ref (ptr));
-  stmt = gimplify_assign (x, new_var, stmt_seqp);
+	if (!init_code)
+	  {
+	    init_code = build_int_cst (integer_type_node,
+				       IFN_GOACC_REDUCTION_INIT);
+	    fini_code = build_int_cst (integer_type_node,
+				       IFN_GOACC_REDUCTION_FINI);
+	    setup_code = build_int_cst (integer_type_node,
+					IFN_GOACC_REDUCTION_SETUP);
+	    teardown_code = build_int_cst (integer_type_node,
+					   IFN_GOACC_REDUCTION_TEARDOWN);
+	  }
+
+	tree setup_call
+	  = build_call_expr_internal_loc (loc, IFN_GOACC_REDUCTION,
+					  TREE_TYPE (var), 6, setup_code,
+					  unshare_expr (ref_to_res),
+					  incoming, level, op, off);
+	tree init_call
+	  = build_call_expr_internal_loc (loc, IFN_GOACC_REDUCTION,
+					  TREE_TYPE (var), 6, init_code,
+					  unshare_expr (ref_to_res),
+					  var, level, op, off);
+	tree fini_call
+	  = build_call_expr_internal_loc (loc, IFN_GOACC_REDUCTION,
+					  TREE_TYPE (var), 6, fini_code,
+					  unshare_expr (ref_to_res),
+					  var, level, op, off);
+	tree teardown_call
+	  = build_call_expr_internal_loc (loc, IFN_GOACC_REDUCTION,
+					  TREE_TYPE (var), 6, teardown_code,
+					  ref_to_res, var, level, op, off);
+
+	gimplify_assign (var, setup_call, &before_fork);
+	gimplify_assign (var, init_call, &after_fork);
+	gimplify_assign (var, fini_call, &before_join);
+	gimplify_assign (outgoing, teardown_call, &after_join);
+      }
+
+  /* Now stitch things together.  */
+  gimple_seq_add_seq (fork_seq, before_fork);
+  if (fork)
+    gimple_seq_add_stmt (fork_seq, fork);
+  gimple_seq_add_seq (fork_seq, after_fork);
+
+  gimple_seq_add_seq (join_seq, before_join);
+  if (join)
+    gimple_seq_add_stmt (join_seq, join);
+  gimple_seq_add_seq (join_seq, after_join);
 }
 
 /* Generate code to implement the REDUCTION clauses.  */
@@ -5410,6 +5448,10 @@ lower_reduction_clauses (tree clauses, g
   tree x, c, tid = NULL_TREE;
   int count = 0;
 
+  /* OpenACC loop reductions are handled elsewhere.  */
+  if (is_gimple_omp_oacc (ctx->stmt))
+    return;
+
   /* SIMD reductions are handled in lower_rec_input_clauses.  */
   if (gimple_code (ctx->stmt) == GIMPLE_OMP_FOR
       && gimple_omp_for_kind (ctx->stmt) & GF_OMP_FOR_SIMD)
@@ -5481,13 +5523,7 @@ lower_reduction_clauses (tree clauses, g
       if (code == MINUS_EXPR)
         code = PLUS_EXPR;
 
-      if (is_gimple_omp_oacc (ctx->stmt))
-	{
-	  gcc_checking_assert (!OMP_CLAUSE_REDUCTION_PLACEHOLDER (c));
-
-	  oacc_lower_reduction_var_helper (stmt_seqp, ctx, tid, var, new_var);
-	}
-      else if (count == 1)
+      if (count == 1)
 	{
 	  tree addr = build_fold_addr_expr_loc (clause_loc, ref);
 
@@ -6052,8 +6088,8 @@ lower_oacc_head_tail (location_t loc, tr
 			      build_int_cst (integer_type_node, done),
 			      &join_seq);
 
-      gimple_seq_add_stmt (&fork_seq, fork);
-      gimple_seq_add_stmt (&join_seq, join);
+      lower_oacc_reductions (loc, clauses, place, inner,
+			     fork, join, &fork_seq, &join_seq,  ctx);
 
       /* Append this level to head. */
       gimple_seq_add_seq (head, fork_seq);
@@ -12945,446 +12981,6 @@ make_pass_expand_omp_ssa (gcc::context *
 \f
 /* Routines to lower OMP directives into OMP-GIMPLE.  */
 
-/* Helper function to preform, potentially COMPLEX_TYPE, operation and
-   convert it to gimple.  */
-static void
-oacc_gimple_assign (tree dest, tree_code op, tree src, gimple_seq *seq)
-{
-  gimple *stmt;
-
-  if (TREE_CODE (TREE_TYPE (dest)) != COMPLEX_TYPE)
-    {
-      stmt = gimple_build_assign (dest, op, dest, src);
-      gimple_seq_add_stmt (seq, stmt);
-      return;
-    }
-
-  tree t = create_tmp_var (TREE_TYPE (TREE_TYPE (dest)));
-  tree rdest = fold_build1 (REALPART_EXPR, TREE_TYPE (TREE_TYPE (dest)), dest);
-  gimplify_assign (t, rdest, seq);
-  rdest = t;
-
-  t = create_tmp_var (TREE_TYPE (TREE_TYPE (dest)));
-  tree idest = fold_build1 (IMAGPART_EXPR, TREE_TYPE (TREE_TYPE (dest)), dest);
-  gimplify_assign (t, idest, seq);
-  idest = t;
-
-  t = create_tmp_var (TREE_TYPE (TREE_TYPE (src)));
-  tree rsrc = fold_build1 (REALPART_EXPR, TREE_TYPE (TREE_TYPE (src)), src);
-  gimplify_assign (t, rsrc, seq);
-  rsrc = t;
-
-  t = create_tmp_var (TREE_TYPE (TREE_TYPE (src)));
-  tree isrc = fold_build1 (IMAGPART_EXPR, TREE_TYPE (TREE_TYPE (src)), src);
-  gimplify_assign (t, isrc, seq);
-  isrc = t;
-
-  tree r = create_tmp_var (TREE_TYPE (TREE_TYPE (dest)));
-  tree i = create_tmp_var (TREE_TYPE (TREE_TYPE (dest)));
-  tree result;
-
-  if (op == PLUS_EXPR)
-    {
-      stmt = gimple_build_assign (r, op, rdest, rsrc);
-      gimple_seq_add_stmt (seq, stmt);
-
-      stmt = gimple_build_assign (i, op, idest, isrc);
-      gimple_seq_add_stmt (seq, stmt);
-    }
-  else if (op == MULT_EXPR)
-    {
-      /* Let x = a + ib = dest, y = c + id = src.
-	 x * y = (ac - bd) + i(ad + bc)  */
-      tree ac = create_tmp_var (TREE_TYPE (TREE_TYPE (dest)));
-      tree bd = create_tmp_var (TREE_TYPE (TREE_TYPE (dest)));
-      tree ad = create_tmp_var (TREE_TYPE (TREE_TYPE (dest)));
-      tree bc = create_tmp_var (TREE_TYPE (TREE_TYPE (dest)));
-
-      stmt = gimple_build_assign (ac, MULT_EXPR, rdest, rsrc);
-      gimple_seq_add_stmt (seq, stmt);
-
-      stmt = gimple_build_assign (bd, MULT_EXPR, idest, isrc);
-      gimple_seq_add_stmt (seq, stmt);
-
-      stmt = gimple_build_assign (r, MINUS_EXPR, ac, bd);
-      gimple_seq_add_stmt (seq, stmt);
-
-      stmt = gimple_build_assign (ad, MULT_EXPR, rdest, isrc);
-      gimple_seq_add_stmt (seq, stmt);
-
-      stmt = gimple_build_assign (bd, MULT_EXPR, idest, rsrc);
-      gimple_seq_add_stmt (seq, stmt);
-
-      stmt = gimple_build_assign (i, PLUS_EXPR, ad, bc);
-      gimple_seq_add_stmt (seq, stmt);
-    }
-  else
-    gcc_unreachable ();
-
-  result = build2 (COMPLEX_EXPR, TREE_TYPE (dest), r, i);
-  gimplify_assign (dest, result, seq);
-}
-
-/* Initialize the reduction array with default values.  */
-
-static void
-oacc_init_reduction_array (tree array, tree init, tree nthreads,
-			   gimple_seq *stmt_seqp)
-{
-  tree type = TREE_TYPE (TREE_TYPE (array));
-  tree x, loop_header, loop_body, loop_exit;
-  gimple *stmt;
-
-  /* Create for loop.
-
-     let var = the original reduction variable
-     let array = reduction variable array
-
-     for (i = 0; i < nthreads; i++)
-       var op= array[i]
- */
-
-  loop_header = create_artificial_label (UNKNOWN_LOCATION);
-  loop_body = create_artificial_label (UNKNOWN_LOCATION);
-  loop_exit = create_artificial_label (UNKNOWN_LOCATION);
-
-  /* Create and initialize an index variable.  */
-  tree ix = create_tmp_var (sizetype);
-  gimplify_assign (ix, fold_build1 (NOP_EXPR, sizetype, integer_zero_node),
-		   stmt_seqp);
-
-  /* Insert the loop header label here.  */
-  gimple_seq_add_stmt (stmt_seqp, gimple_build_label (loop_header));
-
-  /* Exit loop if ix >= nthreads.  */
-  x = create_tmp_var (sizetype);
-  gimplify_assign (x, fold_build1 (NOP_EXPR, sizetype, nthreads), stmt_seqp);
-  stmt = gimple_build_cond (GE_EXPR, ix, x, loop_exit, loop_body);
-  gimple_seq_add_stmt (stmt_seqp, stmt);
-
-  /* Insert the loop body label here.  */
-  gimple_seq_add_stmt (stmt_seqp, gimple_build_label (loop_body));
-
-  /* Calculate the array offset.  */
-  tree offset = create_tmp_var (sizetype);
-  gimplify_assign (offset, TYPE_SIZE_UNIT (type), stmt_seqp);
-  stmt = gimple_build_assign (offset, MULT_EXPR, offset, ix);
-  gimple_seq_add_stmt (stmt_seqp, stmt);
-
-  tree ptr = create_tmp_var (TREE_TYPE (array));
-  stmt = gimple_build_assign (ptr, POINTER_PLUS_EXPR, array, offset);
-  gimple_seq_add_stmt (stmt_seqp, stmt);
-
-  /* Assign init.  */
-  gimplify_assign (build_simple_mem_ref (ptr), init, stmt_seqp);
-
-  /* Increment the induction variable.  */
-  tree one = fold_build1 (NOP_EXPR, sizetype, integer_one_node);
-  stmt = gimple_build_assign (ix, PLUS_EXPR, ix, one);
-  gimple_seq_add_stmt (stmt_seqp, stmt);
-
-  /* Go back to the top of the loop.  */
-  gimple_seq_add_stmt (stmt_seqp, gimple_build_goto (loop_header));
-
-  /* Place the loop exit label here.  */
-  gimple_seq_add_stmt (stmt_seqp, gimple_build_label (loop_exit));
-}
-
-/* Helper function to initialize local data for the reduction arrays.
-   The reduction arrays need to be placed inside the calling function
-   for accelerators, or else the host won't be able to preform the final
-   reduction.  */
-
-static void
-oacc_initialize_reduction_data (tree clauses, tree nthreads,
-				gimple_seq *stmt_seqp, omp_context *ctx)
-{
-  tree c, t, oc;
-  gimple *stmt;
-  omp_context *octx;
-
-  /* Find the innermost OpenACC parallel context.  */
-  if (gimple_code (ctx->stmt) == GIMPLE_OMP_TARGET
-      && (gimple_omp_target_kind (ctx->stmt)
-	  == GF_OMP_TARGET_KIND_OACC_PARALLEL))
-    octx = ctx;
-  else
-    octx = ctx->outer;
-  gcc_checking_assert (gimple_code (octx->stmt) == GIMPLE_OMP_TARGET
-		       && (gimple_omp_target_kind (octx->stmt)
-			   == GF_OMP_TARGET_KIND_OACC_PARALLEL));
-
-  /* Extract the clauses.  */
-  oc = gimple_omp_target_clauses (octx->stmt);
-
-  /* Find the last outer clause.  */
-  for (; oc && OMP_CLAUSE_CHAIN (oc); oc = OMP_CLAUSE_CHAIN (oc))
-    ;
-
-  /* Allocate arrays for each reduction variable.  */
-  for (c = clauses; c; c = OMP_CLAUSE_CHAIN (c))
-    {
-      if (OMP_CLAUSE_CODE (c) != OMP_CLAUSE_REDUCTION)
-	continue;
-
-      tree var = OMP_CLAUSE_DECL (c);
-      tree type = get_base_type (var);
-      tree array = lookup_oacc_reduction (oacc_get_reduction_array_id (var),
-					  ctx);
-      tree size, call;
-
-      /* Calculate size of the reduction array.  */
-      t = create_tmp_var (TREE_TYPE (nthreads));
-      stmt = gimple_build_assign (t, MULT_EXPR, nthreads,
-				  fold_convert (TREE_TYPE (nthreads),
-						TYPE_SIZE_UNIT (type)));
-      gimple_seq_add_stmt (stmt_seqp, stmt);
-
-      size = create_tmp_var (sizetype);
-      gimplify_assign (size, fold_build1 (NOP_EXPR, sizetype, t), stmt_seqp);
-
-      /* Now allocate memory for it.  */
-      call = unshare_expr (builtin_decl_explicit (BUILT_IN_ALLOCA));
-      stmt = gimple_build_call (call, 1, size);
-      gimple_call_set_lhs (stmt, array);
-      gimple_seq_add_stmt (stmt_seqp, stmt);
-
-      /* Initialize array. */
-      tree init = omp_reduction_init_op (OMP_CLAUSE_LOCATION (c),
-					 OMP_CLAUSE_REDUCTION_CODE (c),
-					 type);
-      oacc_init_reduction_array (array, init, nthreads, stmt_seqp);
-
-      /* Map this array into the accelerator.  */
-
-      /* Add the reduction array to the list of clauses.  */
-      tree x = array;
-      t = build_omp_clause (gimple_location (ctx->stmt), OMP_CLAUSE_MAP);
-      OMP_CLAUSE_SET_MAP_KIND (t, GOMP_MAP_FORCE_TOFROM);
-      OMP_CLAUSE_DECL (t) = x;
-      OMP_CLAUSE_CHAIN (t) = NULL;
-      if (oc)
-	OMP_CLAUSE_CHAIN (oc) = t;
-      else
-	gimple_omp_target_set_clauses (as_a <gomp_target *> (octx->stmt), t);
-      OMP_CLAUSE_SIZE (t) = size;
-      oc = t;
-    }
-}
-
-/* Helper function to process the array of partial reductions.  Nthreads
-   indicates the number of threads.  Unfortunately, GOACC_GET_NUM_THREADS
-   cannot be used here, because nthreads on the host may be different than
-   on the accelerator. */
-
-static void
-oacc_finalize_reduction_data (tree clauses, tree nthreads,
-			      gimple_seq *stmt_seqp, omp_context *ctx)
-{
-  tree c, x, var, array, loop_header, loop_body, loop_exit, type;
-  gimple *stmt;
-
-  /* Create for loop.
-
-     let var = the original reduction variable
-     let array = reduction variable array
-
-     for (i = 0; i < nthreads; i++)
-       var op= array[i]
- */
-
-  loop_header = create_artificial_label (UNKNOWN_LOCATION);
-  loop_body = create_artificial_label (UNKNOWN_LOCATION);
-  loop_exit = create_artificial_label (UNKNOWN_LOCATION);
-
-  /* Create and initialize an index variable.  */
-  tree ix = create_tmp_var (sizetype);
-  gimplify_assign (ix, fold_build1 (NOP_EXPR, sizetype, integer_zero_node),
-		   stmt_seqp);
-
-  /* Insert the loop header label here.  */
-  gimple_seq_add_stmt (stmt_seqp, gimple_build_label (loop_header));
-
-  /* Exit loop if ix >= nthreads.  */
-  x = create_tmp_var (sizetype);
-  gimplify_assign (x, fold_build1 (NOP_EXPR, sizetype, nthreads), stmt_seqp);
-  stmt = gimple_build_cond (GE_EXPR, ix, x, loop_exit, loop_body);
-  gimple_seq_add_stmt (stmt_seqp, stmt);
-
-  /* Insert the loop body label here.  */
-  gimple_seq_add_stmt (stmt_seqp, gimple_build_label (loop_body));
-
-  /* Collapse each reduction array, one element at a time.  */
-  for (c = clauses; c; c = OMP_CLAUSE_CHAIN (c))
-    {
-      if (OMP_CLAUSE_CODE (c) != OMP_CLAUSE_REDUCTION)
-	continue;
-
-      tree_code reduction_code = OMP_CLAUSE_REDUCTION_CODE (c);
-
-      /* reduction(-:var) sums up the partial results, so it acts
-	 identically to reduction(+:var).  */
-      if (reduction_code == MINUS_EXPR)
-        reduction_code = PLUS_EXPR;
-
-      /* Set up reduction variable var.  */
-      var = OMP_CLAUSE_DECL (c);
-      type = get_base_type (var);
-      array = lookup_oacc_reduction (oacc_get_reduction_array_id
-				     (OMP_CLAUSE_DECL (c)), ctx);
-
-      /* Calculate the array offset.  */
-      tree offset = create_tmp_var (sizetype);
-      gimplify_assign (offset, TYPE_SIZE_UNIT (type), stmt_seqp);
-      stmt = gimple_build_assign (offset, MULT_EXPR, offset, ix);
-      gimple_seq_add_stmt (stmt_seqp, stmt);
-
-      tree ptr = create_tmp_var (TREE_TYPE (array));
-      stmt = gimple_build_assign (ptr, POINTER_PLUS_EXPR, array, offset);
-      gimple_seq_add_stmt (stmt_seqp, stmt);
-
-      /* Extract array[ix] into mem.  */
-      tree mem = create_tmp_var (type);
-      gimplify_assign (mem, build_simple_mem_ref (ptr), stmt_seqp);
-
-      /* Find the original reduction variable.  */
-      if (is_reference (var))
-	var = build_simple_mem_ref (var);
-
-      tree t = create_tmp_var (type);
-
-      x = lang_hooks.decls.omp_clause_assign_op (c, t, var);
-      gimplify_and_add (unshare_expr(x), stmt_seqp);
-
-      /* var = var op mem */
-      switch (OMP_CLAUSE_REDUCTION_CODE (c))
-	{
-	case TRUTH_ANDIF_EXPR:
-	case TRUTH_ORIF_EXPR:
-	  t = fold_build2 (OMP_CLAUSE_REDUCTION_CODE (c), integer_type_node,
-			   t, mem);
-	  gimplify_and_add (t, stmt_seqp);
-	  break;
-	default:
-	  /* The lhs isn't a gimple_reg when var is COMPLEX_TYPE.  */
-	  oacc_gimple_assign (t, OMP_CLAUSE_REDUCTION_CODE (c), mem,
-			      stmt_seqp);
-	}
-
-      t = fold_build1 (NOP_EXPR, TREE_TYPE (var), t);
-      x = lang_hooks.decls.omp_clause_assign_op (c, var, t);
-      gimplify_and_add (unshare_expr(x), stmt_seqp);
-    }
-
-  /* Increment the induction variable.  */
-  tree one = fold_build1 (NOP_EXPR, sizetype, integer_one_node);
-  stmt = gimple_build_assign (ix, PLUS_EXPR, ix, one);
-  gimple_seq_add_stmt (stmt_seqp, stmt);
-
-  /* Go back to the top of the loop.  */
-  gimple_seq_add_stmt (stmt_seqp, gimple_build_goto (loop_header));
-
-  /* Place the loop exit label here.  */
-  gimple_seq_add_stmt (stmt_seqp, gimple_build_label (loop_exit));
-}
-
-/* Scan through all of the gimple stmts searching for an OMP_FOR_EXPR, and
-   scan that for reductions.  */
-
-static void
-oacc_process_reduction_data (gimple_seq *body, gimple_seq *in_stmt_seqp,
-			gimple_seq *out_stmt_seqp, omp_context *ctx)
-{
-  gimple_stmt_iterator gsi;
-  gimple_seq inner = NULL;
-
-  /* A collapse clause may have inserted a new bind block.  */
-  gsi = gsi_start (*body);
-  while (!gsi_end_p (gsi))
-    {
-      gimple *stmt = gsi_stmt (gsi);
-      if (gbind *bind_stmt = dyn_cast <gbind *> (stmt))
-	{
-	  inner = gimple_bind_body (bind_stmt);
-	  body = &inner;
-	  gsi = gsi_start (*body);
-	}
-      else if (dyn_cast <gomp_for *> (stmt))
-	break;
-      else
-	gsi_next (&gsi);
-    }
-
-  for (gsi = gsi_start (*body); !gsi_end_p (gsi); gsi_next (&gsi))
-    {
-      tree clauses, nthreads, t, c, acc_device, acc_device_host, call,
-	enter, exit;
-      bool reduction_found = false;
-
-      gimple *stmt = gsi_stmt (gsi);
-
-      switch (gimple_code (stmt))
-	{
-	case GIMPLE_OMP_FOR:
-	  clauses = gimple_omp_for_clauses (stmt);
-
-	  /* Search for a reduction clause.  */
-	  for (c = clauses; c; c = OMP_CLAUSE_CHAIN (c))
-	    if (OMP_CLAUSE_CODE (c) == OMP_CLAUSE_REDUCTION)
-	      {
-		reduction_found = true;
-		break;
-	      }
-
-	  if (!reduction_found)
-	    break;
-
-	  ctx = maybe_lookup_ctx (stmt);
-	  t = NULL_TREE;
-
-	  /* Extract the number of threads.  */
-	  nthreads = create_tmp_var (sizetype);
-	  t = oacc_max_threads (ctx);
-	  gimplify_assign (nthreads, t, in_stmt_seqp);
-
-	  /* Determine if this is kernel will be executed on the host.  */
-	  call = builtin_decl_explicit (BUILT_IN_ACC_GET_DEVICE_TYPE);
-	  acc_device = create_tmp_var (integer_type_node, ".acc_device_type");
-	  stmt = gimple_build_call (call, 0);
-	  gimple_call_set_lhs (stmt, acc_device);
-	  gimple_seq_add_stmt (in_stmt_seqp, stmt);
-
-	  /* Set nthreads = 1 for ACC_DEVICE_TYPE=host.  */
-	  acc_device_host = create_tmp_var (integer_type_node,
-					    ".acc_device_host");
-	  gimplify_assign (acc_device_host,
-			   build_int_cst (integer_type_node,
-					  GOMP_DEVICE_HOST),
-			   in_stmt_seqp);
-
-	  enter = create_artificial_label (UNKNOWN_LOCATION);
-	  exit = create_artificial_label (UNKNOWN_LOCATION);
-
-	  stmt = gimple_build_cond (EQ_EXPR, acc_device, acc_device_host,
-				    enter, exit);
-	  gimple_seq_add_stmt (in_stmt_seqp, stmt);
-	  gimple_seq_add_stmt (in_stmt_seqp, gimple_build_label (enter));
-	  gimplify_assign (nthreads, fold_build1 (NOP_EXPR, sizetype,
-						  integer_one_node),
-			   in_stmt_seqp);
-	  gimple_seq_add_stmt (in_stmt_seqp, gimple_build_label (exit));
-
-	  oacc_initialize_reduction_data (clauses, nthreads, in_stmt_seqp,
-					  ctx);
-	  oacc_finalize_reduction_data (clauses, nthreads, out_stmt_seqp, ctx);
-	  break;
-	default:
-	  // Scan for other directives which support reduction here.
-	  break;
-	}
-    }
-}
-
 /* If ctx is a worksharing context inside of a cancellable parallel
    region and it isn't nowait, add lhs to its GIMPLE_OMP_RETURN
    and conditional branch to parallel's cancel_label to handle
@@ -15006,12 +14602,9 @@ lower_omp_target (gimple_stmt_iterator *
 
   irlist = NULL;
   orlist = NULL;
-  if (offloaded
-      && is_gimple_omp_oacc (stmt))
-    oacc_process_reduction_data (&tgt_body, &irlist, &orlist, ctx);
 
   for (c = clauses; c ; c = OMP_CLAUSE_CHAIN (c))
     switch (OMP_CLAUSE_CODE (c))
       {
 	tree var, x;
 
@@ -15930,7 +15524,22 @@ lower_omp_target (gimple_stmt_iterator *
 	    break;
 	  }
 
+      gimple_seq fork_seq = NULL;
+      gimple_seq join_seq = NULL;
+
+      if (is_oacc_parallel (ctx))
+	{
+	  /* If there are reductions on the offloaded region itself, treat
+	     them as a dummy GANG loop.  */
+	  tree level = build_int_cst (integer_type_node, GOMP_DIM_GANG);
+
+	  lower_oacc_reductions (gimple_location (ctx->stmt), clauses, level,
+				 false, NULL, NULL, &fork_seq, &join_seq, ctx);
+	}
+
+      gimple_seq_add_seq (&new_body, fork_seq);
       gimple_seq_add_seq (&new_body, tgt_body);
+      gimple_seq_add_seq (&new_body, join_seq);
 
       if (offloaded)
 	new_body = maybe_catch_exception (new_body);
@@ -19019,6 +18628,11 @@ oacc_loop_xform_head_tail (gcall *from,
 	  else if (k == kind && stmt != from)
 	    break;
 	}
+      else if (is_gimple_call (stmt)
+	       && gimple_call_internal_p (stmt)
+	       && gimple_call_internal_fn (stmt) == IFN_GOACC_REDUCTION)
+	*gimple_call_arg_ptr (stmt, 3) = replacement;
+
       gsi_next (&gsi);
       while (gsi_end_p (gsi))
 	gsi = gsi_start_bb (single_succ (gsi_bb (gsi)));
@@ -19237,6 +18851,53 @@ default_goacc_fork_join (gcall *ARG_UNUS
     return targetm.have_oacc_join ();
 }
 
+/* Default goacc.reduction early expander.
+
+   LHS-opt = IFN_RED_<foo> (RES_PTR-opt, VAR, LEVEL, OP, LID, RID)
+   If RES_PTR is not integer-zerop:
+       SETUP - emit 'LHS = *RES_PTR', LHS = NULL
+       TEARDOWN - emit '*RES_PTR = VAR'
+   If LHS is not NULL
+       emit 'LHS = VAR'   */
+
+void
+default_goacc_reduction (gcall *call)
+{
+  unsigned code = (unsigned)TREE_INT_CST_LOW (gimple_call_arg (call, 0));
+  gimple_stmt_iterator gsi = gsi_for_stmt (call);
+  tree lhs = gimple_call_lhs (call);
+  tree var = gimple_call_arg (call, 2);
+  gimple_seq seq = NULL;
+
+  if (code == IFN_GOACC_REDUCTION_SETUP
+      || code == IFN_GOACC_REDUCTION_TEARDOWN)
+    {
+      /* Setup and Teardown need to copy from/to the receiver object,
+	 if there is one.  */
+      tree ref_to_res = gimple_call_arg (call, 1);
+      
+      if (!integer_zerop (ref_to_res))
+	{
+	  tree dst = build_simple_mem_ref (ref_to_res);
+	  tree src = var;
+	  
+	  if (code == IFN_GOACC_REDUCTION_SETUP)
+	    {
+	      src = dst;
+	      dst = lhs;
+	      lhs = NULL;
+	    }
+	  gimple_seq_add_stmt (&seq, gimple_build_assign (dst, src));
+	}
+    }
+
+  /* Copy VAR to LHS, if there is an LHS.  */
+  if (lhs)
+    gimple_seq_add_stmt (&seq, gimple_build_assign (lhs, var));
+
+  gsi_replace_with_seq (&gsi, seq, true);
+}
+
 /* Main entry point for oacc transformations which run on the device
    compiler after LTO, so we know what the target device is at this
    point (including the host fallback).  */
@@ -19264,6 +18925,10 @@ execute_oacc_device_lower ()
       fprintf (dump_file, "\n");
     }
 
+  /* Offloaded targets may introduce new basic blocks, which require
+     dominance information to update SSA.  */
+  calculate_dominance_info (CDI_DOMINATORS);
+
   /* Now lower internal loop functions to target-specific code
      sequences.  */
   basic_block bb;
@@ -19298,6 +18963,19 @@ execute_oacc_device_lower ()
 	    rescan = true;
 	    break;
 
+	  case IFN_GOACC_REDUCTION:
+	    /* Mark the function for SSA renaming.  */
+	    mark_virtual_operands_for_renaming (cfun);
+
+	    /* If the level is -1, this ended up being an unused
+	       axis.  Handle as a default.  */
+	    if (integer_minus_onep (gimple_call_arg (call, 3)))
+	      default_goacc_reduction (call);
+	    else
+	      targetm.goacc.reduction (call);
+	    rescan = true;
+	    break;
+
 	  case IFN_UNIQUE:
 	    {
 	      enum ifn_unique_kind kind
Index: gcc/target.def
===================================================================
--- gcc/target.def	(revision 229667)
+++ gcc/target.def	(working copy)
@@ -1670,6 +1670,17 @@ The default hook returns false, if there
 bool, (gcall *call, const int *dims, bool is_fork),
 default_goacc_fork_join)
 
+DEFHOOK
+(reduction,
+"This hook is used by the oacc_transform pass to expand calls to the\n\
+@var{GOACC_REDUCTION} internal function, into a sequence of gimple\n\
+instructions.  @var{call} is gimple statement containing the call to\n\
+the function.  This hook removes statement @var{call} after the\n\
+expanded sequence has been inserted.  This hook is also responsible\n\
+for allocating any storage for reductions when necessary.",
+void, (gcall *call),
+default_goacc_reduction)
+
 HOOK_VECTOR_END (goacc)
 
 /* Functions relating to vectorization.  */
Index: gcc/targhooks.h
===================================================================
--- gcc/targhooks.h	(revision 229667)
+++ gcc/targhooks.h	(working copy)
@@ -111,6 +111,7 @@ extern void default_destroy_cost_data (v
 /* OpenACC hooks.  */
 extern bool default_goacc_validate_dims (tree, int [], int);
 extern bool default_goacc_fork_join (gcall *, const int [], bool);
+extern void default_goacc_reduction (gcall *);
 
 /* These are here, and not in hooks.[ch], because not all users of
    hooks.h include tm.h, and thus we don't have CUMULATIVE_ARGS.  */

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [2/3] OpenACC reductions
  2015-11-02 16:10 [0/3] OpenACC reductions Nathan Sidwell
  2015-10-18 23:20 ` [gomp4] fortran testcase Nathan Sidwell
  2015-11-02 16:18 ` [1/3] OpenACC reductions Nathan Sidwell
@ 2015-11-02 16:35 ` Nathan Sidwell
  2015-11-04 10:01   ` Jakub Jelinek
  2015-11-04 13:27   ` Bernd Schmidt
  2015-11-02 16:38 ` [3/3] " Nathan Sidwell
  3 siblings, 2 replies; 26+ messages in thread
From: Nathan Sidwell @ 2015-11-02 16:35 UTC (permalink / raw)
  To: Jakub Jelinek, GCC Patches; +Cc: Cesar Philippidis, Bernd Schmidt

[-- Attachment #1: Type: text/plain, Size: 3143 bytes --]

This patch contains the PTX backend pieces of OpenACC reduction handling.  These 
functions are lowered to gimple, using a couple of PTX-specific builtins for 
some functionality.  Expansion to RTL introduced no new patterns.

We need 3 different schemes for the 3 different partitioning axes, but 
fortunately there is still a lot of commonality.

For the gang level, there is usually a receiver object (if there isn't, the 
program is not well formed).  Each gang of execution has it's own instance of 
the reduction variable, which it initializes with an operator-specific init 
value (a zero or all-bits of appropriate type).  To merge, we use a lockless[1] 
update scheme on the receiver object.  Thus, setup and teardown are simply 
copies of local var.  INIT is a constant initializer and FINI is the update.

For worker level, we need to allocate a buffer in .shared memory local to the 
CTA.  This buffer is reused for each set of reductions, so we only need to size 
it to the maximum value across the program, in a similar manner to the worker 
propagation buffer.  (Unfortunately both need to be live concurrently, so we 
can't share them).

At the setup call, we copy the incoming value (from receiver object or 
LOCAL_VAR) to a slot in the buffer.  At the init call, we initialize to the 
operator-specific value.  At the fini call we do a  lockless update on the 
worker reduction buffer slot.  At the teardown call we read from the reduction 
buffer and possibly write to the receiver object.

The worker reduction buffer slots are the OFFSET argument of the reduction 
calls.  This is the only use we make of this operand for PTX.

For vector level, we  can use shuffle instructions to copy from another vector 
and arrange to use a  binary-tree of combinations to provide a final value.  The 
setup call reads the reciever object, if there is  one.  The init call 
initializes all but vector-zero with an operator-specific value.  Vector zero 
carries the incoming value.  The finalize call expands to the binary tree of 
shuffles.  For PTX this is 5 steps[2].  The teardown call writes back to the 
receiver object, if there is one.

[1] We use a lockless update for gang and worker level that looks somewhat like:

actual = INIT_VAL  (OP)
do
   guess = actual;
   result = guess OP myval
   actual = atomic_cmp_exchange (*REF_TO_RES, guess, result)
while (actual != guess)

The reason for this scheme is that a locking scheme doesn't work across workers 
-- it deadlocks apparently due to resource starvation.  The above is guaranteed 
to make progress.  Further it is easier to optimize to  target-specific 
atomics, should OP be supported directly.

[2] We have to emit an unrolled loop for this tree.   If it were a loop, we'd 
need to mark the loop branch as unified, and don't have a mechanism for that at 
the gimple level.  I have some ideas as to how to do that, but not had time to 
investigate.  We're fixing the vector_length to 32 in this manner, which is what 
other compilers appear to do anyway.

Bernd, I think the builtin fn bits you originally wrote got completely rewritten 
by me several times.

nathan

[-- Attachment #2: 02-trunk-reductions-ptx-1102.patch --]
[-- Type: text/x-patch, Size: 21580 bytes --]

2015-11-02  Nathan Sidwell  <nathan@codesourcery.com>
	    Cesar Philippidis  <cesar@codesourcery.com>

	* config/nvptx/nvptx.c: Include gimple headers.
	(worker_red_size, worker_red_align, worker_red_name,
	worker_red_sym): New.
	(nvptx_option_override): Initialize worker reduction buffer.
	(nvptx_file_end): Write out worker reduction buffer var.
	(nvptx_expand_shuffle, nvptx_expand_worker_addr,
	nvptx_expand_cmp_swap): New builtin expanders.
	(enum nvptx_builtins): New.
	(nvptx_builtin_decls): New.
	(nvptx_builtin_decl, nvptx_init_builtins, nvptx_expand_builtin): New
	(PTX_VECTOR_LENGTH, PTX_WORKER_LENGTH): New.
	(nvptx_get_worker_red_addr, nvptx_generate_vector_shuffle,
	nvptx_lockless_update): New helpers.
	(nvptx_goacc_reduction_setup, nvptx_goacc_reduction_init,
	nvptx_goacc_reduction_fini, nvptx_goacc_reduction_teaddown): New.
	(nvptx_goacc_reduction): New.
	(TARGET_INIT_BUILTINS, TARGET_EXPAND_BUILTIN,
	TARGET_BUILTIN_DECL): Override.
	(TARGET_GOACC_REDUCTION): Override.

Index: gcc/config/nvptx/nvptx.c
===================================================================
--- gcc/config/nvptx/nvptx.c	(revision 229667)
+++ gcc/config/nvptx/nvptx.c	(working copy)
@@ -57,6 +57,15 @@
 #include "omp-low.h"
 #include "gomp-constants.h"
 #include "dumpfile.h"
+#include "internal-fn.h"
+#include "gimple-iterator.h"
+#include "stringpool.h"
+#include "tree-ssa-operands.h"
+#include "tree-ssanames.h"
+#include "gimplify.h"
+#include "tree-phinodes.h"
+#include "cfgloop.h"
+#include "fold-const.h"
 
 /* This file should be included last.  */
 #include "target-def.h"
@@ -98,6 +107,14 @@ static unsigned worker_bcast_align;
 #define worker_bcast_name "__worker_bcast"
 static GTY(()) rtx worker_bcast_sym;
 
+/* Size of buffer needed for worker reductions.  This has to be
+   distinct from the worker broadcast array, as both may be live
+   concurrently.  */
+static unsigned worker_red_size;
+static unsigned worker_red_align;
+#define worker_red_name "__worker_red"
+static GTY(()) rtx worker_red_sym;
+
 /* Allocate a new, cleared machine_function structure.  */
 
 static struct machine_function *
@@ -128,6 +145,9 @@ nvptx_option_override (void)
 
   worker_bcast_sym = gen_rtx_SYMBOL_REF (Pmode, worker_bcast_name);
   worker_bcast_align = GET_MODE_ALIGNMENT (SImode) / BITS_PER_UNIT;
+
+  worker_red_sym = gen_rtx_SYMBOL_REF (Pmode, worker_red_name);
+  worker_red_align = GET_MODE_ALIGNMENT (SImode) / BITS_PER_UNIT;
 }
 
 /* Return the mode to be used when declaring a ptx object for OBJ.
@@ -3246,8 +3266,199 @@ nvptx_file_end (void)
 	       worker_bcast_align,
 	       worker_bcast_name, worker_bcast_size);
     }
+
+  if (worker_red_size)
+    {
+      /* Define the reduction buffer.  */
+
+      worker_red_size = (worker_red_size + worker_red_align - 1)
+	& ~(worker_red_align - 1);
+      
+      fprintf (asm_out_file, "// BEGIN VAR DEF: %s\n", worker_red_name);
+      fprintf (asm_out_file, ".shared .align %d .u8 %s[%d];\n",
+	       worker_red_align,
+	       worker_red_name, worker_red_size);
+    }
+}
+
+/* Expander for the shuffle builtins.  */
+
+static rtx
+nvptx_expand_shuffle (tree exp, rtx target, machine_mode mode, int ignore)
+{
+  if (ignore)
+    return target;
+  
+  rtx src = expand_expr (CALL_EXPR_ARG (exp, 0),
+			 NULL_RTX, mode, EXPAND_NORMAL);
+  if (!REG_P (src))
+    src = copy_to_mode_reg (mode, src);
+
+  rtx idx = expand_expr (CALL_EXPR_ARG (exp, 1),
+			 NULL_RTX, SImode, EXPAND_NORMAL);
+  rtx op = expand_expr (CALL_EXPR_ARG  (exp, 2),
+			NULL_RTX, SImode, EXPAND_NORMAL);
+  
+  if (!REG_P (idx) && GET_CODE (idx) != CONST_INT)
+    idx = copy_to_mode_reg (SImode, idx);
+
+  rtx pat = nvptx_gen_shuffle (target, src, idx, INTVAL (op));
+  if (pat)
+    emit_insn (pat);
+
+  return target;
+}
+
+/* Worker reduction address expander.  */
+
+static rtx
+nvptx_expand_worker_addr (tree exp, rtx target,
+			  machine_mode ARG_UNUSED (mode), int ignore)
+{
+  if (ignore)
+    return target;
+
+  unsigned align = TREE_INT_CST_LOW (CALL_EXPR_ARG (exp, 2));
+  if (align > worker_red_align)
+    worker_red_align = align;
+
+  unsigned offset = TREE_INT_CST_LOW (CALL_EXPR_ARG (exp, 0));
+  unsigned size = TREE_INT_CST_LOW (CALL_EXPR_ARG (exp, 1));
+  if (size + offset > worker_red_size)
+    worker_red_size = size + offset;
+
+  emit_insn (gen_rtx_SET (target, worker_red_sym));
+
+  if (offset)
+    emit_insn (gen_rtx_SET (target,
+			    gen_rtx_PLUS (Pmode, target, GEN_INT (offset))));
+
+  emit_insn (gen_rtx_SET (target,
+			  gen_rtx_UNSPEC (Pmode, gen_rtvec (1, target),
+					  UNSPEC_FROM_SHARED)));
+
+  return target;
+}
+
+static rtx
+nvptx_expand_cmp_swap (tree exp, rtx target,
+		       machine_mode ARG_UNUSED (m), int ARG_UNUSED (ignore))
+{
+  machine_mode mode = TYPE_MODE (TREE_TYPE (exp));
+  
+  if (!target)
+    target = gen_reg_rtx (mode);
+
+  rtx mem = expand_expr (CALL_EXPR_ARG (exp, 0),
+			 NULL_RTX, Pmode, EXPAND_NORMAL);
+  rtx cmp = expand_expr (CALL_EXPR_ARG (exp, 1),
+			 NULL_RTX, mode, EXPAND_NORMAL);
+  rtx src = expand_expr (CALL_EXPR_ARG (exp, 2),
+			 NULL_RTX, mode, EXPAND_NORMAL);
+  rtx pat;
+
+  mem = gen_rtx_MEM (mode, mem);
+  if (!REG_P (cmp))
+    cmp = copy_to_mode_reg (mode, cmp);
+  if (!REG_P (src))
+    src = copy_to_mode_reg (mode, src);
+  
+  if (mode == SImode)
+    pat = gen_atomic_compare_and_swapsi_1 (target, mem, cmp, src, const0_rtx);
+  else
+    pat = gen_atomic_compare_and_swapdi_1 (target, mem, cmp, src, const0_rtx);
+
+  emit_insn (pat);
+
+  return target;
+}
+
+
+/* Codes for all the NVPTX builtins.  */
+enum nvptx_builtins
+{
+  NVPTX_BUILTIN_SHUFFLE,
+  NVPTX_BUILTIN_SHUFFLELL,
+  NVPTX_BUILTIN_WORKER_ADDR,
+  NVPTX_BUILTIN_CMP_SWAP,
+  NVPTX_BUILTIN_CMP_SWAPLL,
+  NVPTX_BUILTIN_MAX
+};
+
+static GTY(()) tree nvptx_builtin_decls[NVPTX_BUILTIN_MAX];
+
+/* Return the NVPTX builtin for CODE.  */
+
+static tree
+nvptx_builtin_decl (unsigned code, bool initialize_p ATTRIBUTE_UNUSED)
+{
+  if (code >= NVPTX_BUILTIN_MAX)
+    return error_mark_node;
+
+  return nvptx_builtin_decls[code];
+}
+
+/* Set up all builtin functions for this target.  */
+
+static void
+nvptx_init_builtins (void)
+{
+#define DEF(ID, NAME, T)						\
+  (nvptx_builtin_decls[NVPTX_BUILTIN_ ## ID] =				\
+   add_builtin_function ("__builtin_nvptx_" NAME,			\
+			 build_function_type_list T,			\
+			 NVPTX_BUILTIN_ ## ID, BUILT_IN_MD, NULL, NULL))
+#define ST sizetype
+#define UINT unsigned_type_node
+#define LLUINT long_long_unsigned_type_node
+#define PTRVOID ptr_type_node
+
+  DEF (SHUFFLE, "shuffle", (UINT, UINT, UINT, UINT, NULL_TREE));
+  DEF (SHUFFLELL, "shufflell", (LLUINT, LLUINT, UINT, UINT, NULL_TREE));
+  DEF (WORKER_ADDR, "worker_addr",
+       (PTRVOID, ST, UINT, UINT, NULL_TREE));
+  DEF (CMP_SWAP, "cmp_swap", (UINT, PTRVOID, UINT, UINT, NULL_TREE));
+  DEF (CMP_SWAPLL, "cmp_swapll", (LLUINT, PTRVOID, LLUINT, LLUINT, NULL_TREE));
+
+#undef DEF
+#undef ST
+#undef UINT
+#undef LLUINT
+#undef PTRVOID
+}
+
+/* Expand an expression EXP that calls a built-in function,
+   with result going to TARGET if that's convenient
+   (and in mode MODE if that's convenient).
+   SUBTARGET may be used as the target for computing one of EXP's operands.
+   IGNORE is nonzero if the value is to be ignored.  */
+
+static rtx
+nvptx_expand_builtin (tree exp, rtx target, rtx subtarget ATTRIBUTE_UNUSED,
+		      machine_mode mode, int ignore)
+{
+  tree fndecl = TREE_OPERAND (CALL_EXPR_FN (exp), 0);
+  switch (DECL_FUNCTION_CODE (fndecl))
+    {
+    case NVPTX_BUILTIN_SHUFFLE:
+    case NVPTX_BUILTIN_SHUFFLELL:
+      return nvptx_expand_shuffle (exp, target, mode, ignore);
+
+    case NVPTX_BUILTIN_WORKER_ADDR:
+      return nvptx_expand_worker_addr (exp, target, mode, ignore);
+
+    case NVPTX_BUILTIN_CMP_SWAP:
+    case NVPTX_BUILTIN_CMP_SWAPLL:
+      return nvptx_expand_cmp_swap (exp, target, mode, ignore);
+
+    default: gcc_unreachable ();
+    }
 }
 \f
+/* Define dimension sizes for known hardware.  */
+#define PTX_VECTOR_LENGTH 32
+#define PTX_WORKER_LENGTH 32
+
 /* Validate compute dimensions of an OpenACC offload or routine, fill
    in non-unity defaults.  FN_LEVEL indicates the level at which a
    routine might spawn a loop.  It is negative for non-routines.  */
@@ -3284,6 +3495,422 @@ nvptx_goacc_fork_join (gcall *call, cons
   return true;
 }
 
+
+static tree
+nvptx_get_worker_red_addr (tree type, tree offset)
+{
+  machine_mode mode = TYPE_MODE (type);
+  tree fndecl = nvptx_builtin_decl (NVPTX_BUILTIN_WORKER_ADDR, true);
+  tree size = build_int_cst (unsigned_type_node, GET_MODE_SIZE (mode));
+  tree align = build_int_cst (unsigned_type_node,
+			      GET_MODE_ALIGNMENT (mode) / BITS_PER_UNIT);
+  tree call = build_call_expr (fndecl, 3, offset, size, align);
+
+  return fold_convert (build_pointer_type (type), call);
+}
+
+/* Emit a SHFL.DOWN using index SHFL of VAR into DEST_VAR.  This function
+   will cast the variable if necessary.  */
+
+static void
+nvptx_generate_vector_shuffle (location_t loc,
+			       tree dest_var, tree var, unsigned shift,
+			       gimple_seq *seq)
+{
+  unsigned fn = NVPTX_BUILTIN_SHUFFLE;
+  tree_code code = NOP_EXPR;
+  tree type = unsigned_type_node;
+
+  switch (TYPE_MODE (TREE_TYPE (var)))
+    {
+    case SFmode:
+      code = VIEW_CONVERT_EXPR;
+      /* FALLTHROUGH */
+    case SImode:
+      break;
+
+    case DFmode:
+      code = VIEW_CONVERT_EXPR;
+      /* FALLTHROUGH  */
+    case DImode:
+      type = long_long_unsigned_type_node;
+      fn = NVPTX_BUILTIN_SHUFFLELL;
+      break;
+
+    default:
+      gcc_unreachable ();
+    }
+
+  tree call = nvptx_builtin_decl (fn, true);
+  call = build_call_expr_loc
+    (loc, call, 3, build1 (code, type, var),
+     build_int_cst (unsigned_type_node, shift),
+     build_int_cst (unsigned_type_node, SHUFFLE_DOWN));
+
+  call = fold_build1 (code, TREE_TYPE (dest_var), call);
+
+  gimplify_assign (dest_var, call, seq);
+}
+
+/* Insert code to locklessly update  *PTR with *PTR OP VAR just before
+   GSI.  */
+
+static tree
+nvptx_lockless_update (location_t loc, gimple_stmt_iterator *gsi,
+		       tree ptr, tree var, tree_code op)
+{
+  unsigned fn = NVPTX_BUILTIN_CMP_SWAP;
+  tree_code code = NOP_EXPR;
+  tree type = unsigned_type_node;
+
+  switch (TYPE_MODE (TREE_TYPE (var)))
+    {
+    case SFmode:
+      code = VIEW_CONVERT_EXPR;
+      /* FALLTHROUGH */
+    case SImode:
+      break;
+
+    case DFmode:
+      code = VIEW_CONVERT_EXPR;
+      /* FALLTHROUGH  */
+    case DImode:
+      type = long_long_unsigned_type_node;
+      fn = NVPTX_BUILTIN_CMP_SWAPLL;
+      break;
+
+    default:
+      gcc_unreachable ();
+    }
+
+  gimple_seq init_seq = NULL;
+  tree init_var = make_ssa_name (type);
+  tree init_expr = omp_reduction_init_op (loc, op, TREE_TYPE (var));
+  init_expr = fold_build1 (code, type, init_expr);
+  gimplify_assign (init_var, init_expr, &init_seq);
+  gimple *init_end = gimple_seq_last (init_seq);
+
+  gsi_insert_seq_before (gsi, init_seq, GSI_SAME_STMT);
+  
+  gimple_seq loop_seq = NULL;
+  tree expect_var = make_ssa_name (type);
+  tree actual_var = make_ssa_name (type);
+  tree write_var = make_ssa_name (type);
+  
+  tree write_expr = fold_build1 (code, TREE_TYPE (var), expect_var);
+  write_expr = fold_build2 (op, TREE_TYPE (var), write_expr, var);
+  write_expr = fold_build1 (code, type, write_expr);
+  gimplify_assign (write_var, write_expr, &loop_seq);
+
+  tree swap_expr = nvptx_builtin_decl (fn, true);
+  swap_expr = build_call_expr_loc (loc, swap_expr, 3,
+				   ptr, expect_var, write_var);
+  gimplify_assign (actual_var, swap_expr, &loop_seq);
+
+  gcond *cond = gimple_build_cond (EQ_EXPR, actual_var, expect_var,
+				   NULL_TREE, NULL_TREE);
+  gimple_seq_add_stmt (&loop_seq, cond);
+
+  /* Split the block just after the init stmts.  */
+  basic_block pre_bb = gsi_bb (*gsi);
+  edge pre_edge = split_block (pre_bb, init_end);
+  basic_block loop_bb = pre_edge->dest;
+  pre_bb = pre_edge->src;
+  /* Reset the iterator.  */
+  *gsi = gsi_for_stmt (gsi_stmt (*gsi));
+
+  /* Insert the loop statements.  */
+  gimple *loop_end = gimple_seq_last (loop_seq);
+  gsi_insert_seq_before (gsi, loop_seq, GSI_SAME_STMT);
+
+  /* Split the block just after the loop stmts.  */
+  edge post_edge = split_block (loop_bb, loop_end);
+  basic_block post_bb = post_edge->dest;
+  loop_bb = post_edge->src;
+  *gsi = gsi_for_stmt (gsi_stmt (*gsi));
+
+  post_edge->flags ^= EDGE_TRUE_VALUE | EDGE_FALLTHRU;
+  edge loop_edge = make_edge (loop_bb, loop_bb, EDGE_FALSE_VALUE);
+  set_immediate_dominator (CDI_DOMINATORS, loop_bb, pre_bb);
+  set_immediate_dominator (CDI_DOMINATORS, post_bb, loop_bb);
+
+  gphi *phi = create_phi_node (expect_var, loop_bb);
+  add_phi_arg (phi, init_var, pre_edge, loc);
+  add_phi_arg (phi, actual_var, loop_edge, loc);
+
+  loop *loop = alloc_loop ();
+  loop->header = loop_bb;
+  loop->latch = loop_bb;
+  add_loop (loop, loop_bb->loop_father);
+
+  return fold_build1 (code, TREE_TYPE (var), write_var);
+}
+
+/* NVPTX implementation of GOACC_REDUCTION_SETUP.  */
+
+static void
+nvptx_goacc_reduction_setup (gcall *call)
+{
+  gimple_stmt_iterator gsi = gsi_for_stmt (call);
+  tree lhs = gimple_call_lhs (call);
+  tree var = gimple_call_arg (call, 2);
+  int level = TREE_INT_CST_LOW (gimple_call_arg (call, 3));
+  gimple_seq seq = NULL;
+
+  push_gimplify_context (true);
+
+  if (level != GOMP_DIM_GANG)
+    {
+      /* Copy the receiver object.  */
+      tree ref_to_res = gimple_call_arg (call, 1);
+
+      if (!integer_zerop (ref_to_res))
+	var = build_simple_mem_ref (ref_to_res);
+    }
+  
+  if (level == GOMP_DIM_WORKER)
+    {
+      /* Store incoming value to worker reduction buffer.  */
+      tree offset = gimple_call_arg (call, 5);
+      tree call = nvptx_get_worker_red_addr (TREE_TYPE (var), offset);
+      tree ptr = make_ssa_name (TREE_TYPE (call));
+
+      gimplify_assign (ptr, call, &seq);
+      tree ref = build_simple_mem_ref (ptr);
+      TREE_THIS_VOLATILE (ref) = 1;
+      gimplify_assign (ref, var, &seq);
+    }
+
+  if (lhs)
+    gimplify_assign (lhs, var, &seq);
+
+  pop_gimplify_context (NULL);
+  gsi_replace_with_seq (&gsi, seq, true);
+}
+
+/* NVPTX implementation of GOACC_REDUCTION_INIT. */
+
+static void
+nvptx_goacc_reduction_init (gcall *call)
+{
+  gimple_stmt_iterator gsi = gsi_for_stmt (call);
+  tree lhs = gimple_call_lhs (call);
+  tree var = gimple_call_arg (call, 2);
+  int level = TREE_INT_CST_LOW (gimple_call_arg (call, 3));
+  enum tree_code rcode
+    = (enum tree_code)TREE_INT_CST_LOW (gimple_call_arg (call, 4));
+  tree init = omp_reduction_init_op (gimple_location (call), rcode,
+				     TREE_TYPE (var));
+  gimple_seq seq = NULL;
+  
+  push_gimplify_context (true);
+
+  if (level == GOMP_DIM_VECTOR)
+    {
+      /* Initialize vector-non-zeroes to INIT_VAL (OP).  */
+      tree tid = make_ssa_name (integer_type_node);
+      tree dim_vector = gimple_call_arg (call, 3);
+      gimple *tid_call = gimple_build_call_internal (IFN_GOACC_DIM_POS, 1,
+						     dim_vector);
+      gimple *cond_stmt = gimple_build_cond (NE_EXPR, tid, integer_zero_node,
+					     NULL_TREE, NULL_TREE);
+
+      gimple_call_set_lhs (tid_call, tid);
+      gimple_seq_add_stmt (&seq, tid_call);
+      gimple_seq_add_stmt (&seq, cond_stmt);
+
+      /* Split the block just after the call.  */
+      edge init_edge = split_block (gsi_bb (gsi), call);
+      basic_block init_bb = init_edge->dest;
+      basic_block call_bb = init_edge->src;
+
+      /* Fixup flags from call_bb to init_bb.  */
+      init_edge->flags ^= EDGE_FALLTHRU | EDGE_TRUE_VALUE;
+      
+      /* Set the initialization stmts.  */
+      gimple_seq init_seq = NULL;
+      tree init_var = make_ssa_name (TREE_TYPE (var));
+      gimplify_assign (init_var, init, &init_seq);
+      gsi = gsi_start_bb (init_bb);
+      gsi_insert_seq_before (&gsi, init_seq, GSI_SAME_STMT);
+
+      /* Split block just after the init stmt.  */
+      gsi_prev (&gsi);
+      edge inited_edge = split_block (gsi_bb (gsi), gsi_stmt (gsi));
+      basic_block dst_bb = inited_edge->dest;
+      
+      /* Create false edge from call_bb to dst_bb.  */
+      edge nop_edge = make_edge (call_bb, dst_bb, EDGE_FALSE_VALUE);
+
+      /* Create phi node in dst block.  */
+      gphi *phi = create_phi_node (lhs, dst_bb);
+      add_phi_arg (phi, init_var, inited_edge, gimple_location (call));
+      add_phi_arg (phi, var, nop_edge, gimple_location (call));
+
+      /* Reset dominator of dst bb.  */
+      set_immediate_dominator (CDI_DOMINATORS, dst_bb, call_bb);
+
+      /* Reset the gsi.  */
+      gsi = gsi_for_stmt (call);
+    }
+  else
+    {
+      if (level == GOMP_DIM_GANG)
+	{
+	  /* If there's no receiver object, propagate the incoming VAR.  */
+	  tree ref_to_res = gimple_call_arg (call, 1);
+	  if (integer_zerop (ref_to_res))
+	    init = var;
+	}
+
+      gimplify_assign (lhs, init, &seq);
+    }
+
+  pop_gimplify_context (NULL);
+  gsi_replace_with_seq (&gsi, seq, true);
+}
+
+/* NVPTX implementation of GOACC_REDUCTION_FINI.  */
+
+static void
+nvptx_goacc_reduction_fini (gcall *call)
+{
+  gimple_stmt_iterator gsi = gsi_for_stmt (call);
+  tree lhs = gimple_call_lhs (call);
+  tree ref_to_res = gimple_call_arg (call, 1);
+  tree var = gimple_call_arg (call, 2);
+  int level = TREE_INT_CST_LOW (gimple_call_arg (call, 3));
+  enum tree_code op
+    = (enum tree_code)TREE_INT_CST_LOW (gimple_call_arg (call, 4));
+  gimple_seq seq = NULL;
+  tree r = NULL_TREE;;
+
+  push_gimplify_context (true);
+
+  if (level == GOMP_DIM_VECTOR)
+    {
+      /* Emit binary shuffle tree.  TODO. Emit this as an actual loop,
+	 but that requires a method of emitting a unified jump at the
+	 gimple level.  */
+      for (int shfl = PTX_VECTOR_LENGTH / 2; shfl > 0; shfl = shfl >> 1)
+	{
+	  tree other_var = make_ssa_name (TREE_TYPE (var));
+	  nvptx_generate_vector_shuffle (gimple_location (call),
+					 other_var, var, shfl, &seq);
+
+	  r = make_ssa_name (TREE_TYPE (var));
+	  gimplify_assign (r, fold_build2 (op, TREE_TYPE (var),
+					   var, other_var), &seq);
+	  var = r;
+	}
+    }
+  else
+    {
+      tree accum = NULL_TREE;
+
+      if (level == GOMP_DIM_WORKER)
+	{
+	  /* Get reduction buffer address.  */
+	  tree offset = gimple_call_arg (call, 5);
+	  tree call = nvptx_get_worker_red_addr (TREE_TYPE (var), offset);
+	  tree ptr = make_ssa_name (TREE_TYPE (call));
+
+	  gimplify_assign (ptr, call, &seq);
+	  accum = ptr;
+	}
+      else if (integer_zerop (ref_to_res))
+	r = var;
+      else
+	accum = ref_to_res;
+
+      if (accum)
+	{
+	  /* Locklessly update the accumulator.  */
+	  gsi_insert_seq_before (&gsi, seq, GSI_SAME_STMT);
+	  seq = NULL;
+	  r = nvptx_lockless_update (gimple_location (call), &gsi,
+				     accum, var, op);
+	}
+    }
+
+  if (lhs)
+    gimplify_assign (lhs, r, &seq);
+  pop_gimplify_context (NULL);
+
+  gsi_replace_with_seq (&gsi, seq, true);
+}
+
+/* NVPTX implementation of GOACC_REDUCTION_TEARDOWN.  */
+
+static void
+nvptx_goacc_reduction_teardown (gcall *call)
+{
+  gimple_stmt_iterator gsi = gsi_for_stmt (call);
+  tree lhs = gimple_call_lhs (call);
+  tree var = gimple_call_arg (call, 2);
+  int level = TREE_INT_CST_LOW (gimple_call_arg (call, 3));
+  gimple_seq seq = NULL;
+  
+  push_gimplify_context (true);
+  if (level == GOMP_DIM_WORKER)
+    {
+      /* Read the worker reduction buffer.  */
+      tree offset = gimple_call_arg (call, 5);
+      tree call = nvptx_get_worker_red_addr(TREE_TYPE (var), offset);
+      tree ptr = make_ssa_name (TREE_TYPE (call));
+
+      gimplify_assign (ptr, call, &seq);
+      var = build_simple_mem_ref (ptr);
+      TREE_THIS_VOLATILE (var) = 1;
+    }
+
+  if (level != GOMP_DIM_GANG)
+    {
+      /* Write to the receiver object.  */
+      tree ref_to_res = gimple_call_arg (call, 1);
+
+      if (!integer_zerop (ref_to_res))
+	gimplify_assign (build_simple_mem_ref (ref_to_res), var, &seq);
+    }
+
+  if (lhs)
+    gimplify_assign (lhs, var, &seq);
+  
+  pop_gimplify_context (NULL);
+
+  gsi_replace_with_seq (&gsi, seq, true);
+}
+
+/* NVPTX reduction expander.  */
+
+void
+nvptx_goacc_reduction (gcall *call)
+{
+  unsigned code = (unsigned)TREE_INT_CST_LOW (gimple_call_arg (call, 0));
+
+  switch (code)
+    {
+    case IFN_GOACC_REDUCTION_SETUP:
+      nvptx_goacc_reduction_setup (call);
+      break;
+
+    case IFN_GOACC_REDUCTION_INIT:
+      nvptx_goacc_reduction_init (call);
+      break;
+
+    case IFN_GOACC_REDUCTION_FINI:
+      nvptx_goacc_reduction_fini (call);
+      break;
+
+    case IFN_GOACC_REDUCTION_TEARDOWN:
+      nvptx_goacc_reduction_teardown (call);
+      break;
+
+    default:
+      gcc_unreachable ();
+    }
+}
+
 #undef TARGET_OPTION_OVERRIDE
 #define TARGET_OPTION_OVERRIDE nvptx_option_override
 
@@ -3373,12 +4000,22 @@ nvptx_goacc_fork_join (gcall *call, cons
 #undef TARGET_CANNOT_COPY_INSN_P
 #define TARGET_CANNOT_COPY_INSN_P nvptx_cannot_copy_insn_p
 
+#undef TARGET_INIT_BUILTINS
+#define TARGET_INIT_BUILTINS nvptx_init_builtins
+#undef TARGET_EXPAND_BUILTIN
+#define TARGET_EXPAND_BUILTIN nvptx_expand_builtin
+#undef  TARGET_BUILTIN_DECL
+#define TARGET_BUILTIN_DECL nvptx_builtin_decl
+
 #undef TARGET_GOACC_VALIDATE_DIMS
 #define TARGET_GOACC_VALIDATE_DIMS nvptx_goacc_validate_dims
 
 #undef TARGET_GOACC_FORK_JOIN
 #define TARGET_GOACC_FORK_JOIN nvptx_goacc_fork_join
 
+#undef TARGET_GOACC_REDUCTION
+#define TARGET_GOACC_REDUCTION nvptx_goacc_reduction
+
 struct gcc_target targetm = TARGET_INITIALIZER;
 
 #include "gt-nvptx.h"

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [3/3] OpenACC reductions
  2015-11-02 16:10 [0/3] OpenACC reductions Nathan Sidwell
                   ` (2 preceding siblings ...)
  2015-11-02 16:35 ` [2/3] " Nathan Sidwell
@ 2015-11-02 16:38 ` Nathan Sidwell
  2015-11-04 10:03   ` Jakub Jelinek
  2015-11-06 10:49   ` [gomp4] " Thomas Schwinge
  3 siblings, 2 replies; 26+ messages in thread
From: Nathan Sidwell @ 2015-11-02 16:38 UTC (permalink / raw)
  To: Jakub Jelinek, GCC Patches; +Cc: Cesar Philippidis

[-- Attachment #1: Type: text/plain, Size: 663 bytes --]

This patch are the initial set of tests.  The libgomp tests use an idiom of 
summing thread identifiers and then checking the expected set of threads 
participated.  They are all derived from the loop tests I recently added for the 
execution model itself.

The fortran test was duplicated in both the gfortran testsuite and the libgomp 
testsuite.   I deleted it from the former.  It was slightly bogus as it asked 
for a vector-length of 40, and appeared to be working by accident by not 
actually partitioning the loop.  I fixed that up and reworked it to avoid 
needing a reduction on a reference variable.  Reference handling will be a later 
patch.

nathan

[-- Attachment #2: 03-trunk-reductions-tests-1102.patch --]
[-- Type: text/x-patch, Size: 11411 bytes --]

2015-11-02  Nathan Sidwell  <nathan@codesourcery.com>

	libgomp/
	* libgomp.oacc-c-c++-common/loop-red-g-1.c: New.
	* libgomp.oacc-c-c++-common/loop-red-gwv-1.c: New.
	* libgomp.oacc-c-c++-common/loop-red-v-1.c: New.
	* libgomp.oacc-c-c++-common/loop-red-v-2.c: New.
	* libgomp.oacc-c-c++-common/loop-red-w-1.c: New.
	* libgomp.oacc-c-c++-common/loop-red-w-2.c: New.
	* libgomp.oacc-c-c++-common/loop-red-wv-1.c: New.
	* libgomp.oacc-fortran/reduction-5.f90: Avoid reference var.

	gcc/testsuite/
	* gfortran.dg/goacc/reduction-2.f95: Delete.

Index: libgomp/testsuite/libgomp.oacc-c-c++-common/loop-red-g-1.c
===================================================================
--- libgomp/testsuite/libgomp.oacc-c-c++-common/loop-red-g-1.c	(revision 0)
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/loop-red-g-1.c	(working copy)
@@ -0,0 +1,54 @@
+/* { dg-do run } */
+/* { dg-additional-options "-O2" */
+
+#include <stdio.h>
+
+#define N (32*32*32+17)
+int main ()
+{
+  int ix;
+  int ondev = 0;
+  int t = 0, h = 0;
+  
+#pragma acc parallel num_gangs(32) vector_length(32) copy(t) copy(ondev)
+  {
+#pragma acc loop gang  reduction (+:t)
+    for (unsigned ix = 0; ix < N; ix++)
+      {
+	int val = ix;
+	
+	if (__builtin_acc_on_device (5))
+	  {
+	    int g = 0, w = 0, v = 0;
+
+	    __asm__ volatile ("mov.u32 %0,%%ctaid.x;" : "=r" (g));
+	    __asm__ volatile ("mov.u32 %0,%%tid.y;" : "=r" (w));
+	    __asm__ volatile ("mov.u32 %0,%%tid.x;" : "=r" (v));
+	    val = (g << 16) | (w << 8) | v;
+	    ondev = 1;
+	  }
+	t += val;
+      }
+  }
+
+  for (ix = 0; ix < N; ix++)
+    {
+      int val = ix;
+      if(ondev)
+	{
+	  int g = ix / ((N + 31) / 32);
+	  int w = 0;
+	  int v = 0;
+
+	  val = (g << 16) | (w << 8) | v;
+	}
+      h += val;
+    }
+  if (t != h)
+    {
+      printf ("t=%x expected %x\n", t, h);
+      return 1;
+    }
+  
+  return 0;
+}
Index: libgomp/testsuite/libgomp.oacc-c-c++-common/loop-red-gwv-1.c
===================================================================
--- libgomp/testsuite/libgomp.oacc-c-c++-common/loop-red-gwv-1.c	(revision 0)
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/loop-red-gwv-1.c	(working copy)
@@ -0,0 +1,56 @@
+/* { dg-do run } */
+/* { dg-additional-options "-O2" */
+
+#include <stdio.h>
+
+#define N (32*32*32+17)
+int main ()
+{
+  int ix;
+  int ondev = 0;
+  int t = 0, h = 0;
+  
+#pragma acc parallel num_gangs(32) num_workers(32) vector_length(32) copy(t) copy(ondev)
+  {
+#pragma acc loop gang worker vector reduction(+:t)
+    for (unsigned ix = 0; ix < N; ix++)
+      {
+	int val = ix;
+	
+	if (__builtin_acc_on_device (5))
+	  {
+	    int g = 0, w = 0, v = 0;
+
+	    __asm__ volatile ("mov.u32 %0,%%ctaid.x;" : "=r" (g));
+	    __asm__ volatile ("mov.u32 %0,%%tid.y;" : "=r" (w));
+	    __asm__ volatile ("mov.u32 %0,%%tid.x;" : "=r" (v));
+	    val = (g << 16) | (w << 8) | v;
+	    ondev = 1;
+	  }
+	t += val;
+      }
+  }
+
+  for (ix = 0; ix < N; ix++)
+    {
+      int val = ix;
+      if(ondev)
+	{
+	  int chunk_size = (N + 32*32*32 - 1) / (32*32*32);
+	  
+	  int g = ix / (chunk_size * 32 * 32);
+	  int w = ix / 32 % 32;
+	  int v = ix % 32;
+
+	  val = (g << 16) | (w << 8) | v;
+	}
+      h += val;
+    }
+  if (t != h)
+    {
+      printf ("t=%x expected %x\n", t, h);
+      return 1;
+    }
+  
+  return 0;
+}
Index: libgomp/testsuite/libgomp.oacc-c-c++-common/loop-red-v-1.c
===================================================================
--- libgomp/testsuite/libgomp.oacc-c-c++-common/loop-red-v-1.c	(revision 0)
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/loop-red-v-1.c	(working copy)
@@ -0,0 +1,56 @@
+/* { dg-do run } */
+/* { dg-additional-options "-O2" */
+
+#include <stdio.h>
+
+#define N (32*32*32+17)
+
+int main ()
+{
+  int ix;
+  int ondev = 0;
+  int t = 0,  h = 0;
+
+#pragma acc parallel vector_length(32) copy(t) copy(ondev)
+  {
+#pragma acc loop vector reduction (+:t)
+    for (unsigned ix = 0; ix < N; ix++)
+      {
+	int val = ix;
+	
+	if (__builtin_acc_on_device (5))
+	  {
+	    int g = 0, w = 0, v = 0;
+
+	    __asm__ volatile ("mov.u32 %0,%%ctaid.x;" : "=r" (g));
+	    __asm__ volatile ("mov.u32 %0,%%tid.y;" : "=r" (w));
+	    __asm__ volatile ("mov.u32 %0,%%tid.x;" : "=r" (v));
+	    val = (g << 16) | (w << 8) | v;
+	    ondev = 1;
+	  }
+	t += val;
+      }
+  }
+
+  for (ix = 0; ix < N; ix++)
+    {
+      int val = ix;
+      if (ondev)
+	{
+	  int g = 0;
+	  int w = 0;
+	  int v = ix % 32;
+
+	  val = (g << 16) | (w << 8) | v;
+	}
+      h += val;
+    }
+
+  if (t != h)
+    {
+      printf ("t=%x expected %x\n", t, h);
+      return 1;
+    }
+  
+  return 0;
+}
Index: libgomp/testsuite/libgomp.oacc-c-c++-common/loop-red-v-2.c
===================================================================
--- libgomp/testsuite/libgomp.oacc-c-c++-common/loop-red-v-2.c	(revision 0)
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/loop-red-v-2.c	(working copy)
@@ -0,0 +1,59 @@
+/* { dg-do run } */
+/* { dg-additional-options "-O2" */
+
+#include <stdio.h>
+
+#define N (32*32*32+17)
+
+int main ()
+{
+  int ix;
+  int ondev = 0;
+  int q = 0,  h = 0;
+
+#pragma acc parallel vector_length(32) copy(q) copy(ondev)
+  {
+    int t = q;
+    
+#pragma acc loop vector reduction (+:t)
+    for (unsigned ix = 0; ix < N; ix++)
+      {
+	int val = ix;
+	
+	if (__builtin_acc_on_device (5))
+	  {
+	    int g = 0, w = 0, v = 0;
+
+	    __asm__ volatile ("mov.u32 %0,%%ctaid.x;" : "=r" (g));
+	    __asm__ volatile ("mov.u32 %0,%%tid.y;" : "=r" (w));
+	    __asm__ volatile ("mov.u32 %0,%%tid.x;" : "=r" (v));
+	    val = (g << 16) | (w << 8) | v;
+	    ondev = 1;
+	  }
+	t += val;
+      }
+    q = t;
+  }
+
+  for (ix = 0; ix < N; ix++)
+    {
+      int val = ix;
+      if (ondev)
+	{
+	  int g = 0;
+	  int w = 0;
+	  int v = ix % 32;
+
+	  val = (g << 16) | (w << 8) | v;
+	}
+      h += val;
+    }
+
+  if (q != h)
+    {
+      printf ("t=%x expected %x\n", q, h);
+      return 1;
+    }
+  
+  return 0;
+}
Index: libgomp/testsuite/libgomp.oacc-c-c++-common/loop-red-w-1.c
===================================================================
--- libgomp/testsuite/libgomp.oacc-c-c++-common/loop-red-w-1.c	(revision 0)
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/loop-red-w-1.c	(working copy)
@@ -0,0 +1,54 @@
+/* { dg-do run } */
+/* { dg-additional-options "-O2" */
+
+#include <stdio.h>
+
+#define N (32*32*32+17)
+int main ()
+{
+  int ix;
+  int ondev = 0;
+  int t = 0,  h = 0;
+
+#pragma acc parallel num_workers(32) vector_length(32) copy(t) copy(ondev)
+  {
+#pragma acc loop worker reduction(+:t)
+    for (unsigned ix = 0; ix < N; ix++)
+      {
+	int val = ix;
+	
+	if (__builtin_acc_on_device (5))
+	  {
+	    int g = 0, w = 0, v = 0;
+
+	    __asm__ volatile ("mov.u32 %0,%%ctaid.x;" : "=r" (g));
+	    __asm__ volatile ("mov.u32 %0,%%tid.y;" : "=r" (w));
+	    __asm__ volatile ("mov.u32 %0,%%tid.x;" : "=r" (v));
+	    val = (g << 16) | (w << 8) | v;
+	    ondev = 1;
+	  }
+	t += val;
+      }
+  }
+
+  for (ix = 0; ix < N; ix++)
+    {
+      int val = ix;
+      if(ondev)
+	{
+	  int g = 0;
+	  int w = ix % 32;
+	  int v = 0;
+
+	  val = (g << 16) | (w << 8) | v;
+	}
+      h += val;
+    }
+  if (t != h)
+    {
+      printf ("t=%x expected %x\n", t, h);
+      return 1;
+    }
+  
+  return 0;
+}
Index: libgomp/testsuite/libgomp.oacc-c-c++-common/loop-red-w-2.c
===================================================================
--- libgomp/testsuite/libgomp.oacc-c-c++-common/loop-red-w-2.c	(revision 0)
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/loop-red-w-2.c	(working copy)
@@ -0,0 +1,57 @@
+/* { dg-do run } */
+/* { dg-additional-options "-O2" */
+
+#include <stdio.h>
+
+#define N (32*32*32+17)
+int main ()
+{
+  int ix;
+  int ondev = 0;
+  int q = 0,  h = 0;
+
+#pragma acc parallel num_workers(32) vector_length(32) copy(q) copy(ondev)
+  {
+    int t = q;
+    
+#pragma acc loop worker reduction(+:t)
+    for (unsigned ix = 0; ix < N; ix++)
+      {
+	int val = ix;
+	
+	if (__builtin_acc_on_device (5))
+	  {
+	    int g = 0, w = 0, v = 0;
+
+	    __asm__ volatile ("mov.u32 %0,%%ctaid.x;" : "=r" (g));
+	    __asm__ volatile ("mov.u32 %0,%%tid.y;" : "=r" (w));
+	    __asm__ volatile ("mov.u32 %0,%%tid.x;" : "=r" (v));
+	    val = (g << 16) | (w << 8) | v;
+	    ondev = 1;
+	  }
+	t += val;
+      }
+    q = t;
+  }
+
+  for (ix = 0; ix < N; ix++)
+    {
+      int val = ix;
+      if(ondev)
+	{
+	  int g = 0;
+	  int w = ix % 32;
+	  int v = 0;
+
+	  val = (g << 16) | (w << 8) | v;
+	}
+      h += val;
+    }
+  if (q != h)
+    {
+      printf ("t=%x expected %x\n", q, h);
+      return 1;
+    }
+  
+  return 0;
+}
Index: libgomp/testsuite/libgomp.oacc-c-c++-common/loop-red-wv-1.c
===================================================================
--- libgomp/testsuite/libgomp.oacc-c-c++-common/loop-red-wv-1.c	(revision 0)
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/loop-red-wv-1.c	(working copy)
@@ -0,0 +1,54 @@
+/* { dg-do run } */
+/* { dg-additional-options "-O2" */
+
+#include <stdio.h>
+
+#define N (32*32*32+17)
+int main ()
+{
+  int ix;
+  int ondev = 0;
+  int t = 0, h = 0;
+  
+#pragma acc parallel num_workers(32) vector_length(32) copy(t) copy(ondev)
+  {
+#pragma acc loop worker vector reduction (+:t)
+    for (unsigned ix = 0; ix < N; ix++)
+      {
+	int val = ix;
+	
+	if (__builtin_acc_on_device (5))
+	  {
+	    int g = 0, w = 0, v = 0;
+
+	    __asm__ volatile ("mov.u32 %0,%%ctaid.x;" : "=r" (g));
+	    __asm__ volatile ("mov.u32 %0,%%tid.y;" : "=r" (w));
+	    __asm__ volatile ("mov.u32 %0,%%tid.x;" : "=r" (v));
+	    val = (g << 16) | (w << 8) | v;
+	    ondev = 1;
+	  }
+	t += val;
+      }
+  }
+
+  for (ix = 0; ix < N; ix++)
+    {
+      int val = ix;
+      if(ondev)
+	{
+	  int g = 0;
+	  int w = (ix / 32) % 32;
+	  int v = ix % 32;
+
+	  val = (g << 16) | (w << 8) | v;
+	}
+      h += val;
+    }
+  if (t != h)
+    {
+      printf ("t=%x expected %x\n", t, h);
+      return 1;
+    }
+  
+  return 0;
+}
Index: libgomp/testsuite/libgomp.oacc-fortran/reduction-5.f90
===================================================================
--- libgomp/testsuite/libgomp.oacc-fortran/reduction-5.f90	(revision 229667)
+++ libgomp/testsuite/libgomp.oacc-fortran/reduction-5.f90	(working copy)
@@ -21,12 +21,15 @@ end program reduction
 subroutine redsub(sum, n, c)
   integer :: sum, n, c
 
-  sum = 0
+  integer :: s
+  s = 0
 
-  !$acc parallel vector_length(n) copyin (n, c) num_gangs(1)
-  !$acc loop reduction(+:sum)
+  !$acc parallel vector_length(32) copyin (n, c) copy (s) num_gangs(1)
+  !$acc loop reduction(+:s)
   do i = 1, n
-     sum = sum + c
+     s = s + c
   end do
   !$acc end parallel
+
+  sum = s
 end subroutine redsub
Index: gcc/testsuite/gfortran.dg/goacc/reduction-2.f95
===================================================================
--- gcc/testsuite/gfortran.dg/goacc/reduction-2.f95	(revision 229667)
+++ gcc/testsuite/gfortran.dg/goacc/reduction-2.f95	(working copy)
@@ -1,21 +0,0 @@
-! { dg-do compile }
-
-program reduction
-  integer, parameter    :: n = 40, c = 10
-  integer               :: i, sum
-
-  call redsub (sum, n, c)
-end program reduction
-
-subroutine redsub(sum, n, c)
-  integer :: sum, n, c
-
-  sum = 0
-
-  !$acc parallel vector_length(n) copyin (n, c)
-  !$acc loop reduction(+:sum)
-  do i = 1, n
-     sum = sum + c
-  end do
-  !$acc end parallel
-end subroutine redsub

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [1/3] OpenACC reductions
  2015-11-02 16:18 ` [1/3] OpenACC reductions Nathan Sidwell
@ 2015-11-03 15:46   ` Jakub Jelinek
  2015-11-03 16:02     ` Nathan Sidwell
  2015-11-04  9:59   ` Jakub Jelinek
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 26+ messages in thread
From: Jakub Jelinek @ 2015-11-03 15:46 UTC (permalink / raw)
  To: Nathan Sidwell; +Cc: GCC Patches, Cesar Philippidis

On Mon, Nov 02, 2015 at 11:18:37AM -0500, Nathan Sidwell wrote:
> This is the core execution bits of OpenACC reductions.
> 
> We have a new internal fn 'IFN_GOACC_REDUCTION' and a new target hook
> goacc.reduction, to lower it on the target compiler.

So, let me start with a few questions:
1) does OpenACC allow UDRs or only the built-in reductions?  If it
   does not allow UDRs, do you have it covered by testcases that you
   disallow parsing of them (e.g. when you have
#pragma omp declare reduction (xyz: struct S: omp_out.x += omp_in.y) initializer (omp_priv = { 5 })
#pragma acc parallel reduction (xyz: var_with_type_S)
   )?
2) how do you expand the reductions in the end when targetting host fallback
   or when targetting non-PTX targets?

	Jakub

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [1/3] OpenACC reductions
  2015-11-03 15:46   ` Jakub Jelinek
@ 2015-11-03 16:02     ` Nathan Sidwell
  2015-11-04 10:31       ` Jakub Jelinek
  0 siblings, 1 reply; 26+ messages in thread
From: Nathan Sidwell @ 2015-11-03 16:02 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: GCC Patches, Cesar Philippidis

On 11/03/15 10:46, Jakub Jelinek wrote:
> On Mon, Nov 02, 2015 at 11:18:37AM -0500, Nathan Sidwell wrote:
>> This is the core execution bits of OpenACC reductions.
>>
>> We have a new internal fn 'IFN_GOACC_REDUCTION' and a new target hook
>> goacc.reduction, to lower it on the target compiler.
>
> So, let me start with a few questions:
> 1) does OpenACC allow UDRs or only the built-in reductions?  If it
>     does not allow UDRs, do you have it covered by testcases that you
>     disallow parsing of them (e.g. when you have

no UDR reductions.  Will check test cases for that.

> #pragma omp declare reduction (xyz: struct S: omp_out.x += omp_in.y) initializer (omp_priv = { 5 })
> #pragma acc parallel reduction (xyz: var_with_type_S)
>     )?

> 2) how do you expand the reductions in the end when targetting host fallback
>     or when targetting non-PTX targets?

That's what default_goacc_reduction is doing.

(I see its comment hasn't caught up with the changes I made during the merge. 
Will fix)

    LHS-opt = IFN_RED (KIND, RES_PTR, VAR, LEVEL, OP, OFFSET)
    If RES_PTR is not integer-zerop:
        SETUP - emit 'LHS = *RES_PTR', LHS = NULL
        TEARDOWN - emit '*RES_PTR = VAR'
    If LHS is not NULL
        emit 'LHS = VAR'

This is the correct behaviour for a single-threaded  loop.  Of course the loop 
could go on to be parallelized in the normal way -- or additional conversion to 
openmp constructs along the same lines as we discussed for the GOACC_LOOP function.

Does that help?

nathan

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [1/3] OpenACC reductions
  2015-11-02 16:18 ` [1/3] OpenACC reductions Nathan Sidwell
  2015-11-03 15:46   ` Jakub Jelinek
@ 2015-11-04  9:59   ` Jakub Jelinek
  2015-11-06 10:47   ` [gomp4] " Thomas Schwinge
  2021-08-09 11:37   ` [1/3] OpenACC reductions Thomas Schwinge
  3 siblings, 0 replies; 26+ messages in thread
From: Jakub Jelinek @ 2015-11-04  9:59 UTC (permalink / raw)
  To: Nathan Sidwell; +Cc: GCC Patches, Cesar Philippidis

On Mon, Nov 02, 2015 at 11:18:37AM -0500, Nathan Sidwell wrote:
> 2015-11-02  Nathan Sidwell  <nathan@codesourcery.com>
> 	    Cesar Philippidis  <cesar@codesourcery.com>
> 
> 	* internal-fn.def (GOACC_REDUCTION): New.
> 	* internal-fn.h (enum ifn_goacc_reduction_kind): New.
> 	* internal-fn.c (expand_GOACC_REDUCTION): New.
> 	* target.def (goacc.reduction): New OpenACC hook.
> 	* targhooks.h (default_goacc_reduction): Declare.
> 	* doc/tm.texi.in: Add TARGET_GOACC_REDUCTION.
> 	* doc/tm.texi: Rebuilt.
> 	* omp-low.c (oacc_get_reduction_array_id, oacc_max_threads,
> 	scan_sharing_clauses): Remove oacc reduction handling here.
> 	(lower_rec_input_clauses): Don't handle OpenACC reductions here.
> 	(oacc_lower_reduction_var_helper): Delete.
> 	(lower_oacc_reductions): New.
> 	(lower_reduction_clauses): Don't handle OpenACC reductions here.
> 	(lower_oacc_head_tail): Call lower_oacc_reductions.
> 	(oacc_gimple_assign, oacc_init_reduction_array,
> 	oacc_initialize_reduction_data, oacc_finalize_reduction_data,
> 	oacc_process_reduction_data): Delete.
> 	(lower_omp_target): Remove old OpenACC reduction handling.  Insert
> 	dummy OpenACC gang reduction for reductions at outer level.
> 	(oacc_loop_xform_head_tail): Transform IFN_GOACC_REDUCTION.
> 	(default_goacc_reduction): New.
> 	(execute_oacc_device_lower): Handle IFN_GOACC_REDUCTION.

Ok.

	Jakub

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [2/3] OpenACC reductions
  2015-11-02 16:35 ` [2/3] " Nathan Sidwell
@ 2015-11-04 10:01   ` Jakub Jelinek
  2015-11-04 13:57     ` Nathan Sidwell
  2015-11-04 13:27   ` Bernd Schmidt
  1 sibling, 1 reply; 26+ messages in thread
From: Jakub Jelinek @ 2015-11-04 10:01 UTC (permalink / raw)
  To: Nathan Sidwell; +Cc: GCC Patches, Cesar Philippidis, Bernd Schmidt

On Mon, Nov 02, 2015 at 11:35:34AM -0500, Nathan Sidwell wrote:
> 2015-11-02  Nathan Sidwell  <nathan@codesourcery.com>
> 	    Cesar Philippidis  <cesar@codesourcery.com>
> 
> 	* config/nvptx/nvptx.c: Include gimple headers.
> 	(worker_red_size, worker_red_align, worker_red_name,
...

I think you can approve this yourself, or do you want Bernd to review it?

	Jakub

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [3/3] OpenACC reductions
  2015-11-02 16:38 ` [3/3] " Nathan Sidwell
@ 2015-11-04 10:03   ` Jakub Jelinek
  2015-11-06 10:49   ` [gomp4] " Thomas Schwinge
  1 sibling, 0 replies; 26+ messages in thread
From: Jakub Jelinek @ 2015-11-04 10:03 UTC (permalink / raw)
  To: Nathan Sidwell; +Cc: GCC Patches, Cesar Philippidis

On Mon, Nov 02, 2015 at 11:38:47AM -0500, Nathan Sidwell wrote:
> This patch are the initial set of tests.  The libgomp tests use an idiom of
> summing thread identifiers and then checking the expected set of threads
> participated.  They are all derived from the loop tests I recently added for
> the execution model itself.
> 
> The fortran test was duplicated in both the gfortran testsuite and the
> libgomp testsuite.   I deleted it from the former.  It was slightly bogus as
> it asked for a vector-length of 40, and appeared to be working by accident
> by not actually partitioning the loop.  I fixed that up and reworked it to
> avoid needing a reduction on a reference variable.  Reference handling will
> be a later patch.
> 
> nathan

> 2015-11-02  Nathan Sidwell  <nathan@codesourcery.com>
> 
> 	libgomp/
> 	* libgomp.oacc-c-c++-common/loop-red-g-1.c: New.
> 	* libgomp.oacc-c-c++-common/loop-red-gwv-1.c: New.
> 	* libgomp.oacc-c-c++-common/loop-red-v-1.c: New.
> 	* libgomp.oacc-c-c++-common/loop-red-v-2.c: New.
> 	* libgomp.oacc-c-c++-common/loop-red-w-1.c: New.
> 	* libgomp.oacc-c-c++-common/loop-red-w-2.c: New.
> 	* libgomp.oacc-c-c++-common/loop-red-wv-1.c: New.
> 	* libgomp.oacc-fortran/reduction-5.f90: Avoid reference var.
> 
> 	gcc/testsuite/
> 	* gfortran.dg/goacc/reduction-2.f95: Delete.

Ok.

	Jakub

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [1/3] OpenACC reductions
  2015-11-03 16:02     ` Nathan Sidwell
@ 2015-11-04 10:31       ` Jakub Jelinek
  2015-11-04 13:58         ` Nathan Sidwell
  0 siblings, 1 reply; 26+ messages in thread
From: Jakub Jelinek @ 2015-11-04 10:31 UTC (permalink / raw)
  To: Nathan Sidwell; +Cc: GCC Patches, Cesar Philippidis

On Tue, Nov 03, 2015 at 11:01:57AM -0500, Nathan Sidwell wrote:
> On 11/03/15 10:46, Jakub Jelinek wrote:
> >On Mon, Nov 02, 2015 at 11:18:37AM -0500, Nathan Sidwell wrote:
> >>This is the core execution bits of OpenACC reductions.
> >>
> >>We have a new internal fn 'IFN_GOACC_REDUCTION' and a new target hook
> >>goacc.reduction, to lower it on the target compiler.
> >
> >So, let me start with a few questions:
> >1) does OpenACC allow UDRs or only the built-in reductions?  If it
> >    does not allow UDRs, do you have it covered by testcases that you
> >    disallow parsing of them (e.g. when you have
> 
> no UDR reductions.  Will check test cases for that.

BTW, what about min/max reductions for C/C++?  Those were added in OpenMP
3.1, so perhaps OpenACC copied them.

	Jakub

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [2/3] OpenACC reductions
  2015-11-02 16:35 ` [2/3] " Nathan Sidwell
  2015-11-04 10:01   ` Jakub Jelinek
@ 2015-11-04 13:27   ` Bernd Schmidt
  2015-11-04 14:09     ` Nathan Sidwell
  2015-11-04 16:59     ` Nathan Sidwell
  1 sibling, 2 replies; 26+ messages in thread
From: Bernd Schmidt @ 2015-11-04 13:27 UTC (permalink / raw)
  To: Nathan Sidwell, Jakub Jelinek, GCC Patches; +Cc: Cesar Philippidis

On 11/02/2015 05:35 PM, Nathan Sidwell wrote:
>
> +/* Size of buffer needed for worker reductions.  This has to be

Maybe "description" rather than "Size" since there's really four 
variables we're covering with the comment.

> +      worker_red_size = (worker_red_size + worker_red_align - 1)
> +	& ~(worker_red_align - 1);

Formatting. Wrap the entire multi-line expression in parentheses to get 
editors to align the & operator.

> +static rtx
> +nvptx_expand_cmp_swap (tree exp, rtx target,
> +		       machine_mode ARG_UNUSED (m), int ARG_UNUSED (ignore))

Add a comment. You're using ATTRIBUTE_UNUSED and ARG_UNUSED in this 
patch, it would be good to be consistent - I'm still not sure which 
style is preferred after the switch to C++, so as far as I'm concerned 
just pick one.

> +{
> +#define DEF(ID, NAME, T)						\
> +  (nvptx_builtin_decls[NVPTX_BUILTIN_ ## ID] =				\
> +   add_builtin_function ("__builtin_nvptx_" NAME,			\
> +			 build_function_type_list T,			\
> +			 NVPTX_BUILTIN_ ## ID, BUILT_IN_MD, NULL, NULL))

I think the assignment operator should start the line like all others, 
but other code in gcc is pretty inconsistent in that department.

> +static tree
> +nvptx_get_worker_red_addr (tree type, tree offset)

Add a comment.

> +  switch (TYPE_MODE (TREE_TYPE (var)))
> +    {
> +    case SFmode:
> +      code = VIEW_CONVERT_EXPR;
> +      /* FALLTHROUGH */
> +    case SImode:
> +      break;
> +
> +    case DFmode:
> +      code = VIEW_CONVERT_EXPR;
> +      /* FALLTHROUGH  */
> +    case DImode:
> +      type = long_long_unsigned_type_node;
> +      fn = NVPTX_BUILTIN_CMP_SWAPLL;
> +      break;
> +
> +    default:
> +      gcc_unreachable ();
> +    }

There are two such switch statements, and it's possible to write this 
more compactly:
   if (!INTEGRAL_MODE_P (...))
     code = VIEW_CONVERT_EXPR;
   if (GET_MODE_SIZE (...) == 8)
     fn = CMP_SWAPLL;
Not required, you can decide which you like better.

Otherwise I have no objections.


Bernd

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [2/3] OpenACC reductions
  2015-11-04 10:01   ` Jakub Jelinek
@ 2015-11-04 13:57     ` Nathan Sidwell
  0 siblings, 0 replies; 26+ messages in thread
From: Nathan Sidwell @ 2015-11-04 13:57 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: GCC Patches, Cesar Philippidis, Bernd Schmidt

On 11/04/15 05:01, Jakub Jelinek wrote:
> On Mon, Nov 02, 2015 at 11:35:34AM -0500, Nathan Sidwell wrote:
>> 2015-11-02  Nathan Sidwell  <nathan@codesourcery.com>
>> 	    Cesar Philippidis  <cesar@codesourcery.com>
>>
>> 	* config/nvptx/nvptx.c: Include gimple headers.
>> 	(worker_red_size, worker_red_align, worker_red_name,
> ...
>
> I think you can approve this yourself, or do you want Bernd to review it?

I was posting for completeness, and Bernd often spots things I missed.

thanks all!

nathan

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [1/3] OpenACC reductions
  2015-11-04 10:31       ` Jakub Jelinek
@ 2015-11-04 13:58         ` Nathan Sidwell
  2015-11-04 14:08           ` Jakub Jelinek
  0 siblings, 1 reply; 26+ messages in thread
From: Nathan Sidwell @ 2015-11-04 13:58 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: GCC Patches, Cesar Philippidis

On 11/04/15 05:31, Jakub Jelinek wrote:
> On Tue, Nov 03, 2015 at 11:01:57AM -0500, Nathan Sidwell wrote:
>> On 11/03/15 10:46, Jakub Jelinek wrote:
>>> On Mon, Nov 02, 2015 at 11:18:37AM -0500, Nathan Sidwell wrote:
>>>> This is the core execution bits of OpenACC reductions.
>>>>
>>>> We have a new internal fn 'IFN_GOACC_REDUCTION' and a new target hook
>>>> goacc.reduction, to lower it on the target compiler.
>>>
>>> So, let me start with a few questions:
>>> 1) does OpenACC allow UDRs or only the built-in reductions?  If it
>>>     does not allow UDRs, do you have it covered by testcases that you
>>>     disallow parsing of them (e.g. when you have
>>
>> no UDR reductions.  Will check test cases for that.
>
> BTW, what about min/max reductions for C/C++?  Those were added in OpenMP
> 3.1, so perhaps OpenACC copied them.

OpenACC has min/max, and this is exercised on gomp4.  we'll get to porting  more 
testcases after this rush is done, ok?  (Or is there something specific about 
min/max?)


nathan

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [1/3] OpenACC reductions
  2015-11-04 13:58         ` Nathan Sidwell
@ 2015-11-04 14:08           ` Jakub Jelinek
  0 siblings, 0 replies; 26+ messages in thread
From: Jakub Jelinek @ 2015-11-04 14:08 UTC (permalink / raw)
  To: Nathan Sidwell; +Cc: GCC Patches, Cesar Philippidis

On Wed, Nov 04, 2015 at 08:58:26AM -0500, Nathan Sidwell wrote:
> On 11/04/15 05:31, Jakub Jelinek wrote:
> >On Tue, Nov 03, 2015 at 11:01:57AM -0500, Nathan Sidwell wrote:
> >>On 11/03/15 10:46, Jakub Jelinek wrote:
> >>>On Mon, Nov 02, 2015 at 11:18:37AM -0500, Nathan Sidwell wrote:
> >>>>This is the core execution bits of OpenACC reductions.
> >>>>
> >>>>We have a new internal fn 'IFN_GOACC_REDUCTION' and a new target hook
> >>>>goacc.reduction, to lower it on the target compiler.
> >>>
> >>>So, let me start with a few questions:
> >>>1) does OpenACC allow UDRs or only the built-in reductions?  If it
> >>>    does not allow UDRs, do you have it covered by testcases that you
> >>>    disallow parsing of them (e.g. when you have
> >>
> >>no UDR reductions.  Will check test cases for that.
> >
> >BTW, what about min/max reductions for C/C++?  Those were added in OpenMP
> >3.1, so perhaps OpenACC copied them.
> 
> OpenACC has min/max, and this is exercised on gomp4.  we'll get to porting
> more testcases after this rush is done, ok?  (Or is there something specific
> about min/max?)

No, just wanted to know what you need to disable in the reduction clause
parsing...
For e.g. C it might be enough to add
if (!openacc)
{
and } around:
            reduc_id = c_parser_peek_token (parser)->value;
	    break;

	Jakub

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [2/3] OpenACC reductions
  2015-11-04 13:27   ` Bernd Schmidt
@ 2015-11-04 14:09     ` Nathan Sidwell
  2015-11-04 16:59     ` Nathan Sidwell
  1 sibling, 0 replies; 26+ messages in thread
From: Nathan Sidwell @ 2015-11-04 14:09 UTC (permalink / raw)
  To: Bernd Schmidt, Jakub Jelinek, GCC Patches; +Cc: Cesar Philippidis

On 11/04/15 08:27, Bernd Schmidt wrote:
> On 11/02/2015 05:35 PM, Nathan Sidwell wrote:
>>

> There are two such switch statements, and it's possible to write this more
> compactly:
>    if (!INTEGRAL_MODE_P (...))
>      code = VIEW_CONVERT_EXPR;
>    if (GET_MODE_SIZE (...) == 8)
>      fn = CMP_SWAPLL;
> Not required, you can decide which you like better.

thanks, not noticed that (the switch was originally more complex)

nathan

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [2/3] OpenACC reductions
  2015-11-04 13:27   ` Bernd Schmidt
  2015-11-04 14:09     ` Nathan Sidwell
@ 2015-11-04 16:59     ` Nathan Sidwell
  2015-11-06 10:48       ` [gomp4] " Thomas Schwinge
  1 sibling, 1 reply; 26+ messages in thread
From: Nathan Sidwell @ 2015-11-04 16:59 UTC (permalink / raw)
  To: Bernd Schmidt, Jakub Jelinek, GCC Patches; +Cc: Cesar Philippidis

[-- Attachment #1: Type: text/plain, Size: 79 bytes --]

On 11/04/15 08:27, Bernd Schmidt wrote:


Adjust and applied, thanks!

nathan


[-- Attachment #2: 02-trunk-reductions-ptx-1104.patch --]
[-- Type: text/x-patch, Size: 22540 bytes --]

2015-11-04  Nathan Sidwell  <nathan@codesourcery.com>
	    Cesar Philippidis  <cesar@codesourcery.com>

	* config/nvptx/nvptx.c: Include gimple headers.
	(worker_red_size, worker_red_align, worker_red_name,
	worker_red_sym): New.
	(nvptx_option_override): Initialize worker reduction buffer.
	(nvptx_file_end): Write out worker reduction buffer var.
	(nvptx_expand_shuffle, nvptx_expand_worker_addr,
	nvptx_expand_cmp_swap): New builtin expanders.
	(enum nvptx_builtins): New.
	(nvptx_builtin_decls): New.
	(nvptx_builtin_decl, nvptx_init_builtins, nvptx_expand_builtin): New
	(PTX_VECTOR_LENGTH, PTX_WORKER_LENGTH): New.
	(nvptx_get_worker_red_addr, nvptx_generate_vector_shuffle,
	nvptx_lockless_update): New helpers.
	(nvptx_goacc_reduction_setup, nvptx_goacc_reduction_init,
	nvptx_goacc_reduction_fini, nvptx_goacc_reduction_teaddown): New.
	(nvptx_goacc_reduction): New.
	(TARGET_INIT_BUILTINS, TARGET_EXPAND_BUILTIN,
	TARGET_BUILTIN_DECL): Override.
	(TARGET_GOACC_REDUCTION): Override.

Index: config/nvptx/nvptx.c
===================================================================
--- config/nvptx/nvptx.c	(revision 229766)
+++ config/nvptx/nvptx.c	(working copy)
@@ -57,6 +57,15 @@
 #include "omp-low.h"
 #include "gomp-constants.h"
 #include "dumpfile.h"
+#include "internal-fn.h"
+#include "gimple-iterator.h"
+#include "stringpool.h"
+#include "tree-ssa-operands.h"
+#include "tree-ssanames.h"
+#include "gimplify.h"
+#include "tree-phinodes.h"
+#include "cfgloop.h"
+#include "fold-const.h"
 
 /* This file should be included last.  */
 #include "target-def.h"
@@ -88,16 +97,23 @@ struct tree_hasher : ggc_cache_ptr_hash<
 static GTY((cache)) hash_table<tree_hasher> *declared_fndecls_htab;
 static GTY((cache)) hash_table<tree_hasher> *needed_fndecls_htab;
 
-/* Size of buffer needed to broadcast across workers.  This is used
-   for both worker-neutering and worker broadcasting.   It is shared
-   by all functions emitted.  The buffer is placed in shared memory.
-   It'd be nice if PTX supported common blocks, because then this
-   could be shared across TUs (taking the largest size).  */
+/* Buffer needed to broadcast across workers.  This is used for both
+   worker-neutering and worker broadcasting.  It is shared by all
+   functions emitted.  The buffer is placed in shared memory.  It'd be
+   nice if PTX supported common blocks, because then this could be
+   shared across TUs (taking the largest size).  */
 static unsigned worker_bcast_size;
 static unsigned worker_bcast_align;
 #define worker_bcast_name "__worker_bcast"
 static GTY(()) rtx worker_bcast_sym;
 
+/* Buffer needed for worker reductions.  This has to be distinct from
+   the worker broadcast array, as both may be live concurrently.  */
+static unsigned worker_red_size;
+static unsigned worker_red_align;
+#define worker_red_name "__worker_red"
+static GTY(()) rtx worker_red_sym;
+
 /* Allocate a new, cleared machine_function structure.  */
 
 static struct machine_function *
@@ -128,6 +144,9 @@ nvptx_option_override (void)
 
   worker_bcast_sym = gen_rtx_SYMBOL_REF (Pmode, worker_bcast_name);
   worker_bcast_align = GET_MODE_ALIGNMENT (SImode) / BITS_PER_UNIT;
+
+  worker_red_sym = gen_rtx_SYMBOL_REF (Pmode, worker_red_name);
+  worker_red_align = GET_MODE_ALIGNMENT (SImode) / BITS_PER_UNIT;
 }
 
 /* Return the mode to be used when declaring a ptx object for OBJ.
@@ -3246,8 +3265,203 @@ nvptx_file_end (void)
 	       worker_bcast_align,
 	       worker_bcast_name, worker_bcast_size);
     }
+
+  if (worker_red_size)
+    {
+      /* Define the reduction buffer.  */
+
+      worker_red_size = ((worker_red_size + worker_red_align - 1)
+			 & ~(worker_red_align - 1));
+      
+      fprintf (asm_out_file, "// BEGIN VAR DEF: %s\n", worker_red_name);
+      fprintf (asm_out_file, ".shared .align %d .u8 %s[%d];\n",
+	       worker_red_align,
+	       worker_red_name, worker_red_size);
+    }
+}
+
+/* Expander for the shuffle builtins.  */
+
+static rtx
+nvptx_expand_shuffle (tree exp, rtx target, machine_mode mode, int ignore)
+{
+  if (ignore)
+    return target;
+  
+  rtx src = expand_expr (CALL_EXPR_ARG (exp, 0),
+			 NULL_RTX, mode, EXPAND_NORMAL);
+  if (!REG_P (src))
+    src = copy_to_mode_reg (mode, src);
+
+  rtx idx = expand_expr (CALL_EXPR_ARG (exp, 1),
+			 NULL_RTX, SImode, EXPAND_NORMAL);
+  rtx op = expand_expr (CALL_EXPR_ARG  (exp, 2),
+			NULL_RTX, SImode, EXPAND_NORMAL);
+  
+  if (!REG_P (idx) && GET_CODE (idx) != CONST_INT)
+    idx = copy_to_mode_reg (SImode, idx);
+
+  rtx pat = nvptx_gen_shuffle (target, src, idx, INTVAL (op));
+  if (pat)
+    emit_insn (pat);
+
+  return target;
+}
+
+/* Worker reduction address expander.  */
+
+static rtx
+nvptx_expand_worker_addr (tree exp, rtx target,
+			  machine_mode ARG_UNUSED (mode), int ignore)
+{
+  if (ignore)
+    return target;
+
+  unsigned align = TREE_INT_CST_LOW (CALL_EXPR_ARG (exp, 2));
+  if (align > worker_red_align)
+    worker_red_align = align;
+
+  unsigned offset = TREE_INT_CST_LOW (CALL_EXPR_ARG (exp, 0));
+  unsigned size = TREE_INT_CST_LOW (CALL_EXPR_ARG (exp, 1));
+  if (size + offset > worker_red_size)
+    worker_red_size = size + offset;
+
+  emit_insn (gen_rtx_SET (target, worker_red_sym));
+
+  if (offset)
+    emit_insn (gen_rtx_SET (target,
+			    gen_rtx_PLUS (Pmode, target, GEN_INT (offset))));
+
+  emit_insn (gen_rtx_SET (target,
+			  gen_rtx_UNSPEC (Pmode, gen_rtvec (1, target),
+					  UNSPEC_FROM_SHARED)));
+
+  return target;
+}
+
+/* Expand the CMP_SWAP PTX builtins.  We have our own versions that do
+   not require taking the address of any object, other than the memory
+   cell being operated on.  */
+
+static rtx
+nvptx_expand_cmp_swap (tree exp, rtx target,
+		       machine_mode ARG_UNUSED (m), int ARG_UNUSED (ignore))
+{
+  machine_mode mode = TYPE_MODE (TREE_TYPE (exp));
+  
+  if (!target)
+    target = gen_reg_rtx (mode);
+
+  rtx mem = expand_expr (CALL_EXPR_ARG (exp, 0),
+			 NULL_RTX, Pmode, EXPAND_NORMAL);
+  rtx cmp = expand_expr (CALL_EXPR_ARG (exp, 1),
+			 NULL_RTX, mode, EXPAND_NORMAL);
+  rtx src = expand_expr (CALL_EXPR_ARG (exp, 2),
+			 NULL_RTX, mode, EXPAND_NORMAL);
+  rtx pat;
+
+  mem = gen_rtx_MEM (mode, mem);
+  if (!REG_P (cmp))
+    cmp = copy_to_mode_reg (mode, cmp);
+  if (!REG_P (src))
+    src = copy_to_mode_reg (mode, src);
+  
+  if (mode == SImode)
+    pat = gen_atomic_compare_and_swapsi_1 (target, mem, cmp, src, const0_rtx);
+  else
+    pat = gen_atomic_compare_and_swapdi_1 (target, mem, cmp, src, const0_rtx);
+
+  emit_insn (pat);
+
+  return target;
+}
+
+
+/* Codes for all the NVPTX builtins.  */
+enum nvptx_builtins
+{
+  NVPTX_BUILTIN_SHUFFLE,
+  NVPTX_BUILTIN_SHUFFLELL,
+  NVPTX_BUILTIN_WORKER_ADDR,
+  NVPTX_BUILTIN_CMP_SWAP,
+  NVPTX_BUILTIN_CMP_SWAPLL,
+  NVPTX_BUILTIN_MAX
+};
+
+static GTY(()) tree nvptx_builtin_decls[NVPTX_BUILTIN_MAX];
+
+/* Return the NVPTX builtin for CODE.  */
+
+static tree
+nvptx_builtin_decl (unsigned code, bool ARG_UNUSED (initialize_p))
+{
+  if (code >= NVPTX_BUILTIN_MAX)
+    return error_mark_node;
+
+  return nvptx_builtin_decls[code];
+}
+
+/* Set up all builtin functions for this target.  */
+
+static void
+nvptx_init_builtins (void)
+{
+#define DEF(ID, NAME, T)						\
+  (nvptx_builtin_decls[NVPTX_BUILTIN_ ## ID]				\
+   = add_builtin_function ("__builtin_nvptx_" NAME,			\
+			   build_function_type_list T,			\
+			   NVPTX_BUILTIN_ ## ID, BUILT_IN_MD, NULL, NULL))
+#define ST sizetype
+#define UINT unsigned_type_node
+#define LLUINT long_long_unsigned_type_node
+#define PTRVOID ptr_type_node
+
+  DEF (SHUFFLE, "shuffle", (UINT, UINT, UINT, UINT, NULL_TREE));
+  DEF (SHUFFLELL, "shufflell", (LLUINT, LLUINT, UINT, UINT, NULL_TREE));
+  DEF (WORKER_ADDR, "worker_addr",
+       (PTRVOID, ST, UINT, UINT, NULL_TREE));
+  DEF (CMP_SWAP, "cmp_swap", (UINT, PTRVOID, UINT, UINT, NULL_TREE));
+  DEF (CMP_SWAPLL, "cmp_swapll", (LLUINT, PTRVOID, LLUINT, LLUINT, NULL_TREE));
+
+#undef DEF
+#undef ST
+#undef UINT
+#undef LLUINT
+#undef PTRVOID
+}
+
+/* Expand an expression EXP that calls a built-in function,
+   with result going to TARGET if that's convenient
+   (and in mode MODE if that's convenient).
+   SUBTARGET may be used as the target for computing one of EXP's operands.
+   IGNORE is nonzero if the value is to be ignored.  */
+
+static rtx
+nvptx_expand_builtin (tree exp, rtx target, rtx ARG_UNUSED (subtarget),
+		      machine_mode mode, int ignore)
+{
+  tree fndecl = TREE_OPERAND (CALL_EXPR_FN (exp), 0);
+  switch (DECL_FUNCTION_CODE (fndecl))
+    {
+    case NVPTX_BUILTIN_SHUFFLE:
+    case NVPTX_BUILTIN_SHUFFLELL:
+      return nvptx_expand_shuffle (exp, target, mode, ignore);
+
+    case NVPTX_BUILTIN_WORKER_ADDR:
+      return nvptx_expand_worker_addr (exp, target, mode, ignore);
+
+    case NVPTX_BUILTIN_CMP_SWAP:
+    case NVPTX_BUILTIN_CMP_SWAPLL:
+      return nvptx_expand_cmp_swap (exp, target, mode, ignore);
+
+    default: gcc_unreachable ();
+    }
 }
 \f
+/* Define dimension sizes for known hardware.  */
+#define PTX_VECTOR_LENGTH 32
+#define PTX_WORKER_LENGTH 32
+
 /* Validate compute dimensions of an OpenACC offload or routine, fill
    in non-unity defaults.  FN_LEVEL indicates the level at which a
    routine might spawn a loop.  It is negative for non-routines.  */
@@ -3284,6 +3498,404 @@ nvptx_goacc_fork_join (gcall *call, cons
   return true;
 }
 
+/* Generate a PTX builtin function call that returns the address in
+   the worker reduction buffer at OFFSET.  TYPE is the type of the
+   data at that location.  */
+
+static tree
+nvptx_get_worker_red_addr (tree type, tree offset)
+{
+  machine_mode mode = TYPE_MODE (type);
+  tree fndecl = nvptx_builtin_decl (NVPTX_BUILTIN_WORKER_ADDR, true);
+  tree size = build_int_cst (unsigned_type_node, GET_MODE_SIZE (mode));
+  tree align = build_int_cst (unsigned_type_node,
+			      GET_MODE_ALIGNMENT (mode) / BITS_PER_UNIT);
+  tree call = build_call_expr (fndecl, 3, offset, size, align);
+
+  return fold_convert (build_pointer_type (type), call);
+}
+
+/* Emit a SHFL.DOWN using index SHFL of VAR into DEST_VAR.  This function
+   will cast the variable if necessary.  */
+
+static void
+nvptx_generate_vector_shuffle (location_t loc,
+			       tree dest_var, tree var, unsigned shift,
+			       gimple_seq *seq)
+{
+  unsigned fn = NVPTX_BUILTIN_SHUFFLE;
+  tree_code code = NOP_EXPR;
+  tree type = unsigned_type_node;
+  enum machine_mode mode = TYPE_MODE (TREE_TYPE (var));
+
+  if (!INTEGRAL_MODE_P (mode))
+    code = VIEW_CONVERT_EXPR;
+  if (GET_MODE_SIZE (mode) == GET_MODE_SIZE (DImode))
+    {
+      fn = NVPTX_BUILTIN_SHUFFLELL;
+      type = long_long_unsigned_type_node;
+    }
+
+  tree call = nvptx_builtin_decl (fn, true);
+  call = build_call_expr_loc
+    (loc, call, 3, fold_build1 (code, type, var),
+     build_int_cst (unsigned_type_node, shift),
+     build_int_cst (unsigned_type_node, SHUFFLE_DOWN));
+
+  call = fold_build1 (code, TREE_TYPE (dest_var), call);
+
+  gimplify_assign (dest_var, call, seq);
+}
+
+/* Insert code to locklessly update  *PTR with *PTR OP VAR just before
+   GSI.  */
+
+static tree
+nvptx_lockless_update (location_t loc, gimple_stmt_iterator *gsi,
+		       tree ptr, tree var, tree_code op)
+{
+  unsigned fn = NVPTX_BUILTIN_CMP_SWAP;
+  tree_code code = NOP_EXPR;
+  tree type = unsigned_type_node;
+
+  enum machine_mode mode = TYPE_MODE (TREE_TYPE (var));
+
+  if (!INTEGRAL_MODE_P (mode))
+    code = VIEW_CONVERT_EXPR;
+  if (GET_MODE_SIZE (mode) == GET_MODE_SIZE (DImode))
+    {
+      fn = NVPTX_BUILTIN_CMP_SWAPLL;
+      type = long_long_unsigned_type_node;
+    }
+
+  gimple_seq init_seq = NULL;
+  tree init_var = make_ssa_name (type);
+  tree init_expr = omp_reduction_init_op (loc, op, TREE_TYPE (var));
+  init_expr = fold_build1 (code, type, init_expr);
+  gimplify_assign (init_var, init_expr, &init_seq);
+  gimple *init_end = gimple_seq_last (init_seq);
+
+  gsi_insert_seq_before (gsi, init_seq, GSI_SAME_STMT);
+  
+  gimple_seq loop_seq = NULL;
+  tree expect_var = make_ssa_name (type);
+  tree actual_var = make_ssa_name (type);
+  tree write_var = make_ssa_name (type);
+  
+  tree write_expr = fold_build1 (code, TREE_TYPE (var), expect_var);
+  write_expr = fold_build2 (op, TREE_TYPE (var), write_expr, var);
+  write_expr = fold_build1 (code, type, write_expr);
+  gimplify_assign (write_var, write_expr, &loop_seq);
+
+  tree swap_expr = nvptx_builtin_decl (fn, true);
+  swap_expr = build_call_expr_loc (loc, swap_expr, 3,
+				   ptr, expect_var, write_var);
+  gimplify_assign (actual_var, swap_expr, &loop_seq);
+
+  gcond *cond = gimple_build_cond (EQ_EXPR, actual_var, expect_var,
+				   NULL_TREE, NULL_TREE);
+  gimple_seq_add_stmt (&loop_seq, cond);
+
+  /* Split the block just after the init stmts.  */
+  basic_block pre_bb = gsi_bb (*gsi);
+  edge pre_edge = split_block (pre_bb, init_end);
+  basic_block loop_bb = pre_edge->dest;
+  pre_bb = pre_edge->src;
+  /* Reset the iterator.  */
+  *gsi = gsi_for_stmt (gsi_stmt (*gsi));
+
+  /* Insert the loop statements.  */
+  gimple *loop_end = gimple_seq_last (loop_seq);
+  gsi_insert_seq_before (gsi, loop_seq, GSI_SAME_STMT);
+
+  /* Split the block just after the loop stmts.  */
+  edge post_edge = split_block (loop_bb, loop_end);
+  basic_block post_bb = post_edge->dest;
+  loop_bb = post_edge->src;
+  *gsi = gsi_for_stmt (gsi_stmt (*gsi));
+
+  post_edge->flags ^= EDGE_TRUE_VALUE | EDGE_FALLTHRU;
+  edge loop_edge = make_edge (loop_bb, loop_bb, EDGE_FALSE_VALUE);
+  set_immediate_dominator (CDI_DOMINATORS, loop_bb, pre_bb);
+  set_immediate_dominator (CDI_DOMINATORS, post_bb, loop_bb);
+
+  gphi *phi = create_phi_node (expect_var, loop_bb);
+  add_phi_arg (phi, init_var, pre_edge, loc);
+  add_phi_arg (phi, actual_var, loop_edge, loc);
+
+  loop *loop = alloc_loop ();
+  loop->header = loop_bb;
+  loop->latch = loop_bb;
+  add_loop (loop, loop_bb->loop_father);
+
+  return fold_build1 (code, TREE_TYPE (var), write_var);
+}
+
+/* NVPTX implementation of GOACC_REDUCTION_SETUP.  */
+
+static void
+nvptx_goacc_reduction_setup (gcall *call)
+{
+  gimple_stmt_iterator gsi = gsi_for_stmt (call);
+  tree lhs = gimple_call_lhs (call);
+  tree var = gimple_call_arg (call, 2);
+  int level = TREE_INT_CST_LOW (gimple_call_arg (call, 3));
+  gimple_seq seq = NULL;
+
+  push_gimplify_context (true);
+
+  if (level != GOMP_DIM_GANG)
+    {
+      /* Copy the receiver object.  */
+      tree ref_to_res = gimple_call_arg (call, 1);
+
+      if (!integer_zerop (ref_to_res))
+	var = build_simple_mem_ref (ref_to_res);
+    }
+  
+  if (level == GOMP_DIM_WORKER)
+    {
+      /* Store incoming value to worker reduction buffer.  */
+      tree offset = gimple_call_arg (call, 5);
+      tree call = nvptx_get_worker_red_addr (TREE_TYPE (var), offset);
+      tree ptr = make_ssa_name (TREE_TYPE (call));
+
+      gimplify_assign (ptr, call, &seq);
+      tree ref = build_simple_mem_ref (ptr);
+      TREE_THIS_VOLATILE (ref) = 1;
+      gimplify_assign (ref, var, &seq);
+    }
+
+  if (lhs)
+    gimplify_assign (lhs, var, &seq);
+
+  pop_gimplify_context (NULL);
+  gsi_replace_with_seq (&gsi, seq, true);
+}
+
+/* NVPTX implementation of GOACC_REDUCTION_INIT. */
+
+static void
+nvptx_goacc_reduction_init (gcall *call)
+{
+  gimple_stmt_iterator gsi = gsi_for_stmt (call);
+  tree lhs = gimple_call_lhs (call);
+  tree var = gimple_call_arg (call, 2);
+  int level = TREE_INT_CST_LOW (gimple_call_arg (call, 3));
+  enum tree_code rcode
+    = (enum tree_code)TREE_INT_CST_LOW (gimple_call_arg (call, 4));
+  tree init = omp_reduction_init_op (gimple_location (call), rcode,
+				     TREE_TYPE (var));
+  gimple_seq seq = NULL;
+  
+  push_gimplify_context (true);
+
+  if (level == GOMP_DIM_VECTOR)
+    {
+      /* Initialize vector-non-zeroes to INIT_VAL (OP).  */
+      tree tid = make_ssa_name (integer_type_node);
+      tree dim_vector = gimple_call_arg (call, 3);
+      gimple *tid_call = gimple_build_call_internal (IFN_GOACC_DIM_POS, 1,
+						     dim_vector);
+      gimple *cond_stmt = gimple_build_cond (NE_EXPR, tid, integer_zero_node,
+					     NULL_TREE, NULL_TREE);
+
+      gimple_call_set_lhs (tid_call, tid);
+      gimple_seq_add_stmt (&seq, tid_call);
+      gimple_seq_add_stmt (&seq, cond_stmt);
+
+      /* Split the block just after the call.  */
+      edge init_edge = split_block (gsi_bb (gsi), call);
+      basic_block init_bb = init_edge->dest;
+      basic_block call_bb = init_edge->src;
+
+      /* Fixup flags from call_bb to init_bb.  */
+      init_edge->flags ^= EDGE_FALLTHRU | EDGE_TRUE_VALUE;
+      
+      /* Set the initialization stmts.  */
+      gimple_seq init_seq = NULL;
+      tree init_var = make_ssa_name (TREE_TYPE (var));
+      gimplify_assign (init_var, init, &init_seq);
+      gsi = gsi_start_bb (init_bb);
+      gsi_insert_seq_before (&gsi, init_seq, GSI_SAME_STMT);
+
+      /* Split block just after the init stmt.  */
+      gsi_prev (&gsi);
+      edge inited_edge = split_block (gsi_bb (gsi), gsi_stmt (gsi));
+      basic_block dst_bb = inited_edge->dest;
+      
+      /* Create false edge from call_bb to dst_bb.  */
+      edge nop_edge = make_edge (call_bb, dst_bb, EDGE_FALSE_VALUE);
+
+      /* Create phi node in dst block.  */
+      gphi *phi = create_phi_node (lhs, dst_bb);
+      add_phi_arg (phi, init_var, inited_edge, gimple_location (call));
+      add_phi_arg (phi, var, nop_edge, gimple_location (call));
+
+      /* Reset dominator of dst bb.  */
+      set_immediate_dominator (CDI_DOMINATORS, dst_bb, call_bb);
+
+      /* Reset the gsi.  */
+      gsi = gsi_for_stmt (call);
+    }
+  else
+    {
+      if (level == GOMP_DIM_GANG)
+	{
+	  /* If there's no receiver object, propagate the incoming VAR.  */
+	  tree ref_to_res = gimple_call_arg (call, 1);
+	  if (integer_zerop (ref_to_res))
+	    init = var;
+	}
+
+      gimplify_assign (lhs, init, &seq);
+    }
+
+  pop_gimplify_context (NULL);
+  gsi_replace_with_seq (&gsi, seq, true);
+}
+
+/* NVPTX implementation of GOACC_REDUCTION_FINI.  */
+
+static void
+nvptx_goacc_reduction_fini (gcall *call)
+{
+  gimple_stmt_iterator gsi = gsi_for_stmt (call);
+  tree lhs = gimple_call_lhs (call);
+  tree ref_to_res = gimple_call_arg (call, 1);
+  tree var = gimple_call_arg (call, 2);
+  int level = TREE_INT_CST_LOW (gimple_call_arg (call, 3));
+  enum tree_code op
+    = (enum tree_code)TREE_INT_CST_LOW (gimple_call_arg (call, 4));
+  gimple_seq seq = NULL;
+  tree r = NULL_TREE;;
+
+  push_gimplify_context (true);
+
+  if (level == GOMP_DIM_VECTOR)
+    {
+      /* Emit binary shuffle tree.  TODO. Emit this as an actual loop,
+	 but that requires a method of emitting a unified jump at the
+	 gimple level.  */
+      for (int shfl = PTX_VECTOR_LENGTH / 2; shfl > 0; shfl = shfl >> 1)
+	{
+	  tree other_var = make_ssa_name (TREE_TYPE (var));
+	  nvptx_generate_vector_shuffle (gimple_location (call),
+					 other_var, var, shfl, &seq);
+
+	  r = make_ssa_name (TREE_TYPE (var));
+	  gimplify_assign (r, fold_build2 (op, TREE_TYPE (var),
+					   var, other_var), &seq);
+	  var = r;
+	}
+    }
+  else
+    {
+      tree accum = NULL_TREE;
+
+      if (level == GOMP_DIM_WORKER)
+	{
+	  /* Get reduction buffer address.  */
+	  tree offset = gimple_call_arg (call, 5);
+	  tree call = nvptx_get_worker_red_addr (TREE_TYPE (var), offset);
+	  tree ptr = make_ssa_name (TREE_TYPE (call));
+
+	  gimplify_assign (ptr, call, &seq);
+	  accum = ptr;
+	}
+      else if (integer_zerop (ref_to_res))
+	r = var;
+      else
+	accum = ref_to_res;
+
+      if (accum)
+	{
+	  /* Locklessly update the accumulator.  */
+	  gsi_insert_seq_before (&gsi, seq, GSI_SAME_STMT);
+	  seq = NULL;
+	  r = nvptx_lockless_update (gimple_location (call), &gsi,
+				     accum, var, op);
+	}
+    }
+
+  if (lhs)
+    gimplify_assign (lhs, r, &seq);
+  pop_gimplify_context (NULL);
+
+  gsi_replace_with_seq (&gsi, seq, true);
+}
+
+/* NVPTX implementation of GOACC_REDUCTION_TEARDOWN.  */
+
+static void
+nvptx_goacc_reduction_teardown (gcall *call)
+{
+  gimple_stmt_iterator gsi = gsi_for_stmt (call);
+  tree lhs = gimple_call_lhs (call);
+  tree var = gimple_call_arg (call, 2);
+  int level = TREE_INT_CST_LOW (gimple_call_arg (call, 3));
+  gimple_seq seq = NULL;
+  
+  push_gimplify_context (true);
+  if (level == GOMP_DIM_WORKER)
+    {
+      /* Read the worker reduction buffer.  */
+      tree offset = gimple_call_arg (call, 5);
+      tree call = nvptx_get_worker_red_addr(TREE_TYPE (var), offset);
+      tree ptr = make_ssa_name (TREE_TYPE (call));
+
+      gimplify_assign (ptr, call, &seq);
+      var = build_simple_mem_ref (ptr);
+      TREE_THIS_VOLATILE (var) = 1;
+    }
+
+  if (level != GOMP_DIM_GANG)
+    {
+      /* Write to the receiver object.  */
+      tree ref_to_res = gimple_call_arg (call, 1);
+
+      if (!integer_zerop (ref_to_res))
+	gimplify_assign (build_simple_mem_ref (ref_to_res), var, &seq);
+    }
+
+  if (lhs)
+    gimplify_assign (lhs, var, &seq);
+  
+  pop_gimplify_context (NULL);
+
+  gsi_replace_with_seq (&gsi, seq, true);
+}
+
+/* NVPTX reduction expander.  */
+
+void
+nvptx_goacc_reduction (gcall *call)
+{
+  unsigned code = (unsigned)TREE_INT_CST_LOW (gimple_call_arg (call, 0));
+
+  switch (code)
+    {
+    case IFN_GOACC_REDUCTION_SETUP:
+      nvptx_goacc_reduction_setup (call);
+      break;
+
+    case IFN_GOACC_REDUCTION_INIT:
+      nvptx_goacc_reduction_init (call);
+      break;
+
+    case IFN_GOACC_REDUCTION_FINI:
+      nvptx_goacc_reduction_fini (call);
+      break;
+
+    case IFN_GOACC_REDUCTION_TEARDOWN:
+      nvptx_goacc_reduction_teardown (call);
+      break;
+
+    default:
+      gcc_unreachable ();
+    }
+}
+
 #undef TARGET_OPTION_OVERRIDE
 #define TARGET_OPTION_OVERRIDE nvptx_option_override
 
@@ -3373,12 +3985,22 @@ nvptx_goacc_fork_join (gcall *call, cons
 #undef TARGET_CANNOT_COPY_INSN_P
 #define TARGET_CANNOT_COPY_INSN_P nvptx_cannot_copy_insn_p
 
+#undef TARGET_INIT_BUILTINS
+#define TARGET_INIT_BUILTINS nvptx_init_builtins
+#undef TARGET_EXPAND_BUILTIN
+#define TARGET_EXPAND_BUILTIN nvptx_expand_builtin
+#undef  TARGET_BUILTIN_DECL
+#define TARGET_BUILTIN_DECL nvptx_builtin_decl
+
 #undef TARGET_GOACC_VALIDATE_DIMS
 #define TARGET_GOACC_VALIDATE_DIMS nvptx_goacc_validate_dims
 
 #undef TARGET_GOACC_FORK_JOIN
 #define TARGET_GOACC_FORK_JOIN nvptx_goacc_fork_join
 
+#undef TARGET_GOACC_REDUCTION
+#define TARGET_GOACC_REDUCTION nvptx_goacc_reduction
+
 struct gcc_target targetm = TARGET_INITIALIZER;
 
 #include "gt-nvptx.h"

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [gomp4] Re: [1/3] OpenACC reductions
  2015-11-02 16:18 ` [1/3] OpenACC reductions Nathan Sidwell
  2015-11-03 15:46   ` Jakub Jelinek
  2015-11-04  9:59   ` Jakub Jelinek
@ 2015-11-06 10:47   ` Thomas Schwinge
  2016-01-07  3:55     ` [gomp4] private reductions Cesar Philippidis
  2021-08-09 11:37   ` [1/3] OpenACC reductions Thomas Schwinge
  3 siblings, 1 reply; 26+ messages in thread
From: Thomas Schwinge @ 2015-11-06 10:47 UTC (permalink / raw)
  To: Nathan Sidwell, GCC Patches; +Cc: Cesar Philippidis, Jakub Jelinek

[-- Attachment #1: Type: text/plain, Size: 9569 bytes --]

Hi Nathan!

On Mon, 2 Nov 2015 11:18:37 -0500, Nathan Sidwell <nathan@acm.org> wrote:
> This is the core execution bits of OpenACC reductions.

> One thing not handled by this patch are reductions of variables of reference 
> type.  We have an implementation on gomp4 branch [...]

Trying to keep the existing code on gomp-4_0-branch alive, I merged your
trunk r229767 into gomp-4_0-branch in r229835.  To avoid regressions in
libgomp reduction execution tests, I had to apply one hack; please have a
look.  For your easier review, here is the merge commit in two variants,
first displayed as a three-way diff by Git's --cc option:

commit 2b76127eebddb59d45e5f068324e14efe77bb05c
Merge: bed2efe 641a0fa
Author: tschwinge <tschwinge@138bc75d-0d04-0410-961f-82ee72b054a4>
Date:   Fri Nov 6 09:33:40 2015 +0000

    svn merge -r 229764:229767 svn+ssh://gcc.gnu.org/svn/gcc/trunk
    
    
    git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/branches/gomp-4_0-branch@229835 138bc75d-0d04-0410-961f-82ee72b054a4


 gcc/ChangeLog   | 28 +++++++++++++++++++++++++++-
 gcc/omp-low.c   | 58 ++++++++++++++++++++++++++++++++++++++-------------------
 gcc/targhooks.h |  2 +-
 3 files changed, 67 insertions(+), 21 deletions(-)

diff --cc gcc/omp-low.c
index debedb1,6a0915b..da574a9
--- gcc/omp-low.c
+++ gcc/omp-low.c
@@@ -5441,14 -5306,25 +5441,28 @@@ lower_oacc_reductions (location_t loc, 
      if (OMP_CLAUSE_CODE (c) == OMP_CLAUSE_REDUCTION)
        {
  	tree orig = OMP_CLAUSE_DECL (c);
- 	tree var = OMP_CLAUSE_REDUCTION_PRIVATE_DECL (c);
 -	tree var = maybe_lookup_decl (orig, ctx);
++	tree var;
  	tree ref_to_res = NULL_TREE;
- 	
+ 	tree incoming, outgoing;
+ 
+ 	enum tree_code rcode = OMP_CLAUSE_REDUCTION_CODE (c);
+ 	if (rcode == MINUS_EXPR)
+ 	  rcode = PLUS_EXPR;
+ 	else if (rcode == TRUTH_ANDIF_EXPR)
+ 	  rcode = BIT_AND_EXPR;
+ 	else if (rcode == TRUTH_ORIF_EXPR)
+ 	  rcode = BIT_IOR_EXPR;
+ 	tree op = build_int_cst (unsigned_type_node, rcode);
+ 
++	var = OMP_CLAUSE_REDUCTION_PRIVATE_DECL (c);
 +	if (!var)
 +	  var = maybe_lookup_decl (orig, ctx);
  	if (!var)
  	  var = orig;
+ 	gcc_assert (!is_reference (var));
  
+ 	incoming = outgoing = var;
+ 	
  	if (!inner)
  	  {
  	    /* See if an outer construct also reduces this variable.  */
@@@ -5490,24 -5365,22 +5503,31 @@@
  	       see if there's a mapping for it.  */
  	    if (gimple_code (outer->stmt) == GIMPLE_OMP_TARGET
  		&& maybe_lookup_field (orig, outer))
- 	      ref_to_res = build_receiver_ref (orig, false, outer);
+ 	      {
+ 		ref_to_res = build_receiver_ref (orig, false, outer);
+ 		if (is_reference (orig))
+ 		  ref_to_res = build_simple_mem_ref (ref_to_res);
  
+ 		outgoing = var;
+ 		incoming = omp_reduction_init_op (loc, rcode, TREE_TYPE (var));
+ 	      }
++	    /* This is enabled on trunk, but has been disabled in the merge of
++	       trunk r229767 into gomp-4_0-branch, as otherwise there were a
++	       lot of regressions in libgomp reduction execution tests.  It is
++	       unclear if the problem is in the tests themselves, or here, or
++	       elsewhere.  Given the usage of "var =
++	       OMP_CLAUSE_REDUCTION_PRIVATE_DECL (c)" on gomp-4_0-branch, maybe
++	       we have to consider that here, too, instead of "orig"?  */
++#if 0
+ 	    else
+ 	      incoming = outgoing = orig;
++#endif
+ 	      
  	  has_outer_reduction:;
  	  }
- 	gcc_assert (!is_reference (var));
+ 
  	if (!ref_to_res)
  	  ref_to_res = integer_zero_node;
- 	else if (is_reference (orig))
- 	  ref_to_res = build_simple_mem_ref (ref_to_res);
- 
- 	enum tree_code rcode = OMP_CLAUSE_REDUCTION_CODE (c);
- 	if (rcode == MINUS_EXPR)
- 	  rcode = PLUS_EXPR;
- 	else if (rcode == TRUTH_ANDIF_EXPR)
- 	  rcode = BIT_AND_EXPR;
- 	else if (rcode == TRUTH_ORIF_EXPR)
- 	  rcode = BIT_IOR_EXPR;
- 	tree op = build_int_cst (unsigned_type_node, rcode);
  
  	/* Determine position in reduction buffer, which may be used
  	   by target.  */
diff --cc gcc/targhooks.h
index f8efe47a,c34e4ae..4a4496a
--- gcc/targhooks.h
+++ gcc/targhooks.h
@@@ -109,10 -109,9 +109,10 @@@ extern void default_finish_cost (void *
  extern void default_destroy_cost_data (void *);
  
  /* OpenACC hooks.  */
- extern void default_goacc_reduction (gcall *);
  extern bool default_goacc_validate_dims (tree, int [], int);
 +extern unsigned default_goacc_dim_limit (unsigned);
  extern bool default_goacc_fork_join (gcall *, const int [], bool);
+ extern void default_goacc_reduction (gcall *);
  
  /* These are here, and not in hooks.[ch], because not all users of
     hooks.h include tm.h, and thus we don't have CUMULATIVE_ARGS.  */

..., and second, as a "plain patch" (gomp-4_0-branch before vs. after):

--- gcc/ChangeLog
+++ gcc/ChangeLog
@@ -1,3 +1,29 @@
[...]
--- gcc/omp-low.c
+++ gcc/omp-low.c
@@ -5441,14 +5441,28 @@ lower_oacc_reductions (location_t loc, tree clauses, tree level, bool inner,
     if (OMP_CLAUSE_CODE (c) == OMP_CLAUSE_REDUCTION)
       {
 	tree orig = OMP_CLAUSE_DECL (c);
-	tree var = OMP_CLAUSE_REDUCTION_PRIVATE_DECL (c);
+	tree var;
 	tree ref_to_res = NULL_TREE;
-	
+	tree incoming, outgoing;
+
+	enum tree_code rcode = OMP_CLAUSE_REDUCTION_CODE (c);
+	if (rcode == MINUS_EXPR)
+	  rcode = PLUS_EXPR;
+	else if (rcode == TRUTH_ANDIF_EXPR)
+	  rcode = BIT_AND_EXPR;
+	else if (rcode == TRUTH_ORIF_EXPR)
+	  rcode = BIT_IOR_EXPR;
+	tree op = build_int_cst (unsigned_type_node, rcode);
+
+	var = OMP_CLAUSE_REDUCTION_PRIVATE_DECL (c);
 	if (!var)
 	  var = maybe_lookup_decl (orig, ctx);
 	if (!var)
 	  var = orig;
+	gcc_assert (!is_reference (var));
 
+	incoming = outgoing = var;
+	
 	if (!inner)
 	  {
 	    /* See if an outer construct also reduces this variable.  */
@@ -5485,29 +5499,35 @@ lower_oacc_reductions (location_t loc, tree clauses, tree level, bool inner,
 	      }
 
 	  do_lookup:
-	    
 	    /* This is the outermost construct with this reduction,
 	       see if there's a mapping for it.  */
 	    if (gimple_code (outer->stmt) == GIMPLE_OMP_TARGET
 		&& maybe_lookup_field (orig, outer))
-	      ref_to_res = build_receiver_ref (orig, false, outer);
+	      {
+		ref_to_res = build_receiver_ref (orig, false, outer);
+		if (is_reference (orig))
+		  ref_to_res = build_simple_mem_ref (ref_to_res);
 
+		outgoing = var;
+		incoming = omp_reduction_init_op (loc, rcode, TREE_TYPE (var));
+	      }
+	    /* This is enabled on trunk, but has been disabled in the merge of
+	       trunk r229767 into gomp-4_0-branch, as otherwise there were a
+	       lot of regressions in libgomp reduction execution tests.  It is
+	       unclear if the problem is in the tests themselves, or here, or
+	       elsewhere.  Given the usage of "var =
+	       OMP_CLAUSE_REDUCTION_PRIVATE_DECL (c)" on gomp-4_0-branch, maybe
+	       we have to consider that here, too, instead of "orig"?  */
+#if 0
+	    else
+	      incoming = outgoing = orig;
+#endif
+	      
 	  has_outer_reduction:;
 	  }
-	gcc_assert (!is_reference (var));
+
 	if (!ref_to_res)
 	  ref_to_res = integer_zero_node;
-	else if (is_reference (orig))
-	  ref_to_res = build_simple_mem_ref (ref_to_res);
-
-	enum tree_code rcode = OMP_CLAUSE_REDUCTION_CODE (c);
-	if (rcode == MINUS_EXPR)
-	  rcode = PLUS_EXPR;
-	else if (rcode == TRUTH_ANDIF_EXPR)
-	  rcode = BIT_AND_EXPR;
-	else if (rcode == TRUTH_ORIF_EXPR)
-	  rcode = BIT_IOR_EXPR;
-	tree op = build_int_cst (unsigned_type_node, rcode);
 
 	/* Determine position in reduction buffer, which may be used
 	   by target.  */
@@ -5533,7 +5553,7 @@ lower_oacc_reductions (location_t loc, tree clauses, tree level, bool inner,
 	  = build_call_expr_internal_loc (loc, IFN_GOACC_REDUCTION,
 					  TREE_TYPE (var), 6, setup_code,
 					  unshare_expr (ref_to_res),
-					  var, level, op, off);
+					  incoming, level, op, off);
 	tree init_call
 	  = build_call_expr_internal_loc (loc, IFN_GOACC_REDUCTION,
 					  TREE_TYPE (var), 6, init_code,
@@ -5552,7 +5572,7 @@ lower_oacc_reductions (location_t loc, tree clauses, tree level, bool inner,
 	gimplify_assign (var, setup_call, &before_fork);
 	gimplify_assign (var, init_call, &after_fork);
 	gimplify_assign (var, fini_call, &before_join);
-	gimplify_assign (var, teardown_call, &after_join);
+	gimplify_assign (outgoing, teardown_call, &after_join);
       }
 
   /* Now stitch things together.  */
@@ -19549,7 +19569,7 @@ default_goacc_fork_join (gcall *ARG_UNUSED (call),
 
 /* Default goacc.reduction early expander.
 
-   LHS-opt = IFN_RED_<foo> (RES_PTR-opt, VAR, LEVEL, OP, LID, RID)
+   LHS-opt = IFN_REDUCTION (KIND, RES_PTR, VAR, LEVEL, OP, OFFSET)
    If RES_PTR is not integer-zerop:
        SETUP - emit 'LHS = *RES_PTR', LHS = NULL
        TEARDOWN - emit '*RES_PTR = VAR'
--- gcc/targhooks.h
+++ gcc/targhooks.h
@@ -109,10 +109,10 @@ extern void default_finish_cost (void *, unsigned *, unsigned *, unsigned *);
 extern void default_destroy_cost_data (void *);
 
 /* OpenACC hooks.  */
-extern void default_goacc_reduction (gcall *);
 extern bool default_goacc_validate_dims (tree, int [], int);
 extern unsigned default_goacc_dim_limit (unsigned);
 extern bool default_goacc_fork_join (gcall *, const int [], bool);
+extern void default_goacc_reduction (gcall *);
 
 /* These are here, and not in hooks.[ch], because not all users of
    hooks.h include tm.h, and thus we don't have CUMULATIVE_ARGS.  */


Grüße
 Thomas

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 472 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [gomp4] Re: [2/3] OpenACC reductions
  2015-11-04 16:59     ` Nathan Sidwell
@ 2015-11-06 10:48       ` Thomas Schwinge
  0 siblings, 0 replies; 26+ messages in thread
From: Thomas Schwinge @ 2015-11-06 10:48 UTC (permalink / raw)
  To: Nathan Sidwell, GCC Patches
  Cc: Cesar Philippidis, Bernd Schmidt, Jakub Jelinek

[-- Attachment #1: Type: text/plain, Size: 11906 bytes --]

Hi Nathan!

On Wed, 4 Nov 2015 11:59:28 -0500, Nathan Sidwell <nathan@acm.org> wrote:
> [PTX backend pieces of OpenACC reduction handling]

Merged your trunk r229768 into gomp-4_0-branch in r229836:

commit 089a0224af68e30b55f42734de48adc645eb7370
Merge: 2b76127 78a78aa
Author: tschwinge <tschwinge@138bc75d-0d04-0410-961f-82ee72b054a4>
Date:   Fri Nov 6 09:38:10 2015 +0000

    svn merge -r 229767:229768 svn+ssh://gcc.gnu.org/svn/gcc/trunk
    
    
    git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/branches/gomp-4_0-branch@229836 138bc75d-0d04-0410-961f-82ee72b054a4

 gcc/ChangeLog            |  23 +++++++
 gcc/config/nvptx/nvptx.c | 169 +++++++++++++++++++++++------------------------
 2 files changed, 107 insertions(+), 85 deletions(-)

I hope I did the right thing replacing the existing code on
gomp-4_0-branch with what you committed to trunk: in particular, the
nvptx_lockless_update and nvptx_goacc_reduction_init functions.  That is,
in the merge commit, I effectively applied the following patch
(gomp-4_0-branch before vs. after):

--- gcc/ChangeLog
+++ gcc/ChangeLog
[...]
--- gcc/config/nvptx/nvptx.c
+++ gcc/config/nvptx/nvptx.c
@@ -57,21 +57,22 @@
[#include directives reshuffled]
@@ -104,19 +105,18 @@ struct tree_hasher : ggc_cache_ptr_hash<tree_node>
 static GTY((cache)) hash_table<tree_hasher> *declared_fndecls_htab;
 static GTY((cache)) hash_table<tree_hasher> *needed_fndecls_htab;
 
-/* Size of buffer needed to broadcast across workers.  This is used
-   for both worker-neutering and worker broadcasting.   It is shared
-   by all functions emitted.  The buffer is placed in shared memory.
-   It'd be nice if PTX supported common blocks, because then this
-   could be shared across TUs (taking the largest size).  */
+/* Buffer needed to broadcast across workers.  This is used for both
+   worker-neutering and worker broadcasting.  It is shared by all
+   functions emitted.  The buffer is placed in shared memory.  It'd be
+   nice if PTX supported common blocks, because then this could be
+   shared across TUs (taking the largest size).  */
 static unsigned worker_bcast_size;
 static unsigned worker_bcast_align;
 #define worker_bcast_name "__worker_bcast"
 static GTY(()) rtx worker_bcast_sym;
 
-/* Size of buffer needed for worker reductions.  This has to be
-   distinct from the worker broadcast array, as both may be live
-   concurrently.  */
+/* Buffer needed for worker reductions.  This has to be distinct from
+   the worker broadcast array, as both may be live concurrently.  */
 static unsigned worker_red_size;
 static unsigned worker_red_align;
 #define worker_red_name "__worker_red"
@@ -3977,8 +3977,8 @@ nvptx_file_end (void)
     {
       /* Define the reduction buffer.  */
 
-      worker_red_size = (worker_red_size + worker_red_align - 1)
-	& ~(worker_red_align - 1);
+      worker_red_size = ((worker_red_size + worker_red_align - 1)
+			 & ~(worker_red_align - 1));
       
       fprintf (asm_out_file, "// BEGIN VAR DEF: %s\n", worker_red_name);
       fprintf (asm_out_file, ".shared .align %d .u8 %s[%d];\n",
@@ -3986,7 +3986,7 @@ nvptx_file_end (void)
 	       worker_red_name, worker_red_size);
     }
 }
-\f
+
 /* Expander for the shuffle builtins.  */
 
 static rtx
@@ -4046,6 +4046,10 @@ nvptx_expand_worker_addr (tree exp, rtx target,
   return target;
 }
 
+/* Expand the CMP_SWAP PTX builtins.  We have our own versions that do
+   not require taking the address of any object, other than the memory
+   cell being operated on.  */
+
 static rtx
 nvptx_expand_cmp_swap (tree exp, rtx target,
 		       machine_mode ARG_UNUSED (m), int ARG_UNUSED (ignore))
@@ -4096,7 +4100,7 @@ static GTY(()) tree nvptx_builtin_decls[NVPTX_BUILTIN_MAX];
 /* Return the NVPTX builtin for CODE.  */
 
 static tree
-nvptx_builtin_decl (unsigned code, bool initialize_p ATTRIBUTE_UNUSED)
+nvptx_builtin_decl (unsigned code, bool ARG_UNUSED (initialize_p))
 {
   if (code >= NVPTX_BUILTIN_MAX)
     return error_mark_node;
@@ -4110,10 +4114,10 @@ static void
 nvptx_init_builtins (void)
 {
 #define DEF(ID, NAME, T)						\
-  (nvptx_builtin_decls[NVPTX_BUILTIN_ ## ID] =				\
-   add_builtin_function ("__builtin_nvptx_" NAME,			\
-			 build_function_type_list T,			\
-			 NVPTX_BUILTIN_ ## ID, BUILT_IN_MD, NULL, NULL))
+  (nvptx_builtin_decls[NVPTX_BUILTIN_ ## ID]				\
+   = add_builtin_function ("__builtin_nvptx_" NAME,			\
+			   build_function_type_list T,			\
+			   NVPTX_BUILTIN_ ## ID, BUILT_IN_MD, NULL, NULL))
 #define ST sizetype
 #define UINT unsigned_type_node
 #define LLUINT long_long_unsigned_type_node
@@ -4140,7 +4144,7 @@ nvptx_init_builtins (void)
    IGNORE is nonzero if the value is to be ignored.  */
 
 static rtx
-nvptx_expand_builtin (tree exp, rtx target, rtx subtarget ATTRIBUTE_UNUSED,
+nvptx_expand_builtin (tree exp, rtx target, rtx ARG_UNUSED (subtarget),
 		      machine_mode mode, int ignore)
 {
   tree fndecl = TREE_OPERAND (CALL_EXPR_FN (exp), 0);
@@ -4239,6 +4243,10 @@ nvptx_goacc_fork_join (gcall *call, const int dims[],
   return true;
 }
 
+/* Generate a PTX builtin function call that returns the address in
+   the worker reduction buffer at OFFSET.  TYPE is the type of the
+   data at that location.  */
+
 static tree
 nvptx_get_worker_red_addr (tree type, tree offset)
 {
@@ -4263,30 +4271,19 @@ nvptx_generate_vector_shuffle (location_t loc,
   unsigned fn = NVPTX_BUILTIN_SHUFFLE;
   tree_code code = NOP_EXPR;
   tree type = unsigned_type_node;
+  enum machine_mode mode = TYPE_MODE (TREE_TYPE (var));
 
-  switch (TYPE_MODE (TREE_TYPE (var)))
+  if (!INTEGRAL_MODE_P (mode))
+    code = VIEW_CONVERT_EXPR;
+  if (GET_MODE_SIZE (mode) == GET_MODE_SIZE (DImode))
     {
-    case SFmode:
-      code = VIEW_CONVERT_EXPR;
-      /* FALLTHROUGH */
-    case SImode:
-      break;
-
-    case DFmode:
-      code = VIEW_CONVERT_EXPR;
-      /* FALLTHROUGH  */
-    case DImode:
-      type = long_long_unsigned_type_node;
       fn = NVPTX_BUILTIN_SHUFFLELL;
-      break;
-
-    default:
-      gcc_unreachable ();
+      type = long_long_unsigned_type_node;
     }
 
   tree call = nvptx_builtin_decl (fn, true);
   call = build_call_expr_loc
-    (loc, call, 3, build1 (code, type, var),
+    (loc, call, 3, fold_build1 (code, type, var),
      build_int_cst (unsigned_type_node, shift),
      build_int_cst (unsigned_type_node, SHUFFLE_DOWN));
 
@@ -4295,6 +4292,9 @@ nvptx_generate_vector_shuffle (location_t loc,
   gimplify_assign (dest_var, call, seq);
 }
 
+/* Insert code to locklessly update  *PTR with *PTR OP VAR just before
+   GSI.  */
+
 static tree
 nvptx_lockless_update (location_t loc, gimple_stmt_iterator *gsi,
 		       tree ptr, tree var, tree_code op)
@@ -4303,24 +4303,14 @@ nvptx_lockless_update (location_t loc, gimple_stmt_iterator *gsi,
   tree_code code = NOP_EXPR;
   tree type = unsigned_type_node;
 
-  switch (TYPE_MODE (TREE_TYPE (var)))
-    {
-    case SFmode:
-      code = VIEW_CONVERT_EXPR;
-      /* FALLTHROUGH */
-    case SImode:
-      break;
+  enum machine_mode mode = TYPE_MODE (TREE_TYPE (var));
 
-    case DFmode:
-      code = VIEW_CONVERT_EXPR;
-      /* FALLTHROUGH  */
-    case DImode:
-      type = long_long_unsigned_type_node;
+  if (!INTEGRAL_MODE_P (mode))
+    code = VIEW_CONVERT_EXPR;
+  if (GET_MODE_SIZE (mode) == GET_MODE_SIZE (DImode))
+    {
       fn = NVPTX_BUILTIN_CMP_SWAPLL;
-      break;
-
-    default:
-      gcc_unreachable ();
+      type = long_long_unsigned_type_node;
     }
 
   gimple_seq init_seq = NULL;
@@ -4354,21 +4344,26 @@ nvptx_lockless_update (location_t loc, gimple_stmt_iterator *gsi,
   /* Split the block just after the init stmts.  */
   basic_block pre_bb = gsi_bb (*gsi);
   edge pre_edge = split_block (pre_bb, init_end);
-  basic_block post_bb = pre_edge->dest;
+  basic_block loop_bb = pre_edge->dest;
+  pre_bb = pre_edge->src;
   /* Reset the iterator.  */
   *gsi = gsi_for_stmt (gsi_stmt (*gsi));
 
-  basic_block loop_bb = create_empty_bb (pre_bb);
-  gimple_stmt_iterator loop_gsi = gsi_start_bb (loop_bb);
-  gsi_insert_seq_after (&loop_gsi, loop_seq, GSI_CONTINUE_LINKING);
+  /* Insert the loop statements.  */
+  gimple *loop_end = gimple_seq_last (loop_seq);
+  gsi_insert_seq_before (gsi, loop_seq, GSI_SAME_STMT);
 
-  make_edge (loop_bb, post_bb, EDGE_TRUE_VALUE);
-  redirect_edge_succ (pre_edge, loop_bb);
+  /* Split the block just after the loop stmts.  */
+  edge post_edge = split_block (loop_bb, loop_end);
+  basic_block post_bb = post_edge->dest;
+  loop_bb = post_edge->src;
+  *gsi = gsi_for_stmt (gsi_stmt (*gsi));
+
+  post_edge->flags ^= EDGE_TRUE_VALUE | EDGE_FALLTHRU;
   edge loop_edge = make_edge (loop_bb, loop_bb, EDGE_FALSE_VALUE);
-  add_bb_to_loop (loop_bb, pre_bb->loop_father);
   set_immediate_dominator (CDI_DOMINATORS, loop_bb, pre_bb);
   set_immediate_dominator (CDI_DOMINATORS, post_bb, loop_bb);
-  
+
   gphi *phi = create_phi_node (expect_var, loop_bb);
   add_phi_arg (phi, init_var, pre_edge, loc);
   add_phi_arg (phi, actual_var, loop_edge, loc);
@@ -4455,34 +4450,38 @@ nvptx_goacc_reduction_init (gcall *call)
       gimple_seq_add_stmt (&seq, cond_stmt);
 
       /* Split the block just after the call.  */
-      basic_block call_bb = gsi_bb (gsi);
-      edge nop_edge = split_block (call_bb, call);
-      basic_block dst_bb = nop_edge->dest;
+      edge init_edge = split_block (gsi_bb (gsi), call);
+      basic_block init_bb = init_edge->dest;
+      basic_block call_bb = init_edge->src;
 
-      /* Create the initialization block.  */
+      /* Fixup flags from call_bb to init_bb.  */
+      init_edge->flags ^= EDGE_FALLTHRU | EDGE_TRUE_VALUE;
+      
+      /* Set the initialization stmts.  */
       gimple_seq init_seq = NULL;
       tree init_var = make_ssa_name (TREE_TYPE (var));
       gimplify_assign (init_var, init, &init_seq);
-      /* One would think create_basic_block is the right thing to use
-	 here to create a new BB and set its gimple sequence.  Sadly
-	 that doesn't set the stmts' bb field :(  */
-      basic_block init_bb = create_empty_bb (call_bb);
-      gimple_stmt_iterator init_gsi = gsi_start_bb (init_bb);
-      gsi_insert_seq_after (&init_gsi, init_seq, GSI_CONTINUE_LINKING);
-
-      /* Link the init block in between the call and dst blocks.  */
-      make_edge (call_bb, init_bb, EDGE_TRUE_VALUE);
-      edge init_edge = make_edge (init_bb, dst_bb, EDGE_FALLTHRU);
-      add_bb_to_loop (init_bb, call_bb->loop_father);
-      set_immediate_dominator (CDI_DOMINATORS, init_bb, call_bb);
-
-      /* Mark the edge linking call to dst to non-fallthrough false edge.  */
-      nop_edge->flags ^= EDGE_FALLTHRU | EDGE_FALSE_VALUE;
+      gsi = gsi_start_bb (init_bb);
+      gsi_insert_seq_before (&gsi, init_seq, GSI_SAME_STMT);
+
+      /* Split block just after the init stmt.  */
+      gsi_prev (&gsi);
+      edge inited_edge = split_block (gsi_bb (gsi), gsi_stmt (gsi));
+      basic_block dst_bb = inited_edge->dest;
       
+      /* Create false edge from call_bb to dst_bb.  */
+      edge nop_edge = make_edge (call_bb, dst_bb, EDGE_FALSE_VALUE);
+
       /* Create phi node in dst block.  */
       gphi *phi = create_phi_node (lhs, dst_bb);
-      add_phi_arg (phi, init_var, init_edge, gimple_location (call));
+      add_phi_arg (phi, init_var, inited_edge, gimple_location (call));
       add_phi_arg (phi, var, nop_edge, gimple_location (call));
+
+      /* Reset dominator of dst bb.  */
+      set_immediate_dominator (CDI_DOMINATORS, dst_bb, call_bb);
+
+      /* Reset the gsi.  */
+      gsi = gsi_for_stmt (call);
     }
   else
     {


Grüße
 Thomas

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 472 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [gomp4] Re: [3/3] OpenACC reductions
  2015-11-02 16:38 ` [3/3] " Nathan Sidwell
  2015-11-04 10:03   ` Jakub Jelinek
@ 2015-11-06 10:49   ` Thomas Schwinge
  1 sibling, 0 replies; 26+ messages in thread
From: Thomas Schwinge @ 2015-11-06 10:49 UTC (permalink / raw)
  To: Nathan Sidwell, GCC Patches; +Cc: Cesar Philippidis, Jakub Jelinek

[-- Attachment #1: Type: text/plain, Size: 2588 bytes --]

Hi Nathan!

On Mon, 2 Nov 2015 11:38:47 -0500, Nathan Sidwell <nathan@acm.org> wrote:
> This patch are the initial set of tests.  The libgomp tests use an idiom of 
> summing thread identifiers and then checking the expected set of threads 
> participated.  They are all derived from the loop tests I recently added for the 
> execution model itself.
> 
> The fortran test was duplicated in both the gfortran testsuite and the libgomp 
> testsuite.   I deleted it from the former.  It was slightly bogus as it asked 
> for a vector-length of 40, and appeared to be working by accident by not 
> actually partitioning the loop.  I fixed that up

On gomp-4_0-branch, you had modified/XFAILed (ICE) that test in r228955,
<http://news.gmane.org/find-root.php?message_id=%3C56240637.6040601%40acm.org%3E>
-- which still needs to be resolved, so I left that as-is, that is, did
not delete the gcc/testsuite/gfortran.dg/goacc/reduction-2.f95 file in
the merge commit.

> and reworked it to avoid 
> needing a reduction on a reference variable.  Reference handling will be a later 
> patch.

As that is -- apparently -- functional on gomp-4_0-branch, I also left
the libgomp/testsuite/libgomp.oacc-fortran/reduction-5.f90 file as-is;
it's also doing more elaborate testing in its gomp-4_0-branch variant.

Merged your trunk r229769 and r229770 into gomp-4_0-branch in r229837,
effectively just adding your new libgomp testsuite files unmodified:

commit a222b569f0234d219fec69cd13b66446f664440d
Merge: 089a022 06d6724
Author: tschwinge <tschwinge@138bc75d-0d04-0410-961f-82ee72b054a4>
Date:   Fri Nov 6 09:40:44 2015 +0000

    svn merge -r 229768:229770 svn+ssh://gcc.gnu.org/svn/gcc/trunk
    
    
    git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/branches/gomp-4_0-branch@229837 138bc75d-0d04-0410-961f-82ee72b054a4

 gcc/testsuite/ChangeLog                            |  4 ++
 libgomp/ChangeLog                                  | 11 ++++
 .../libgomp.oacc-c-c++-common/loop-red-g-1.c       | 54 ++++++++++++++++++++
 .../libgomp.oacc-c-c++-common/loop-red-gwv-1.c     | 56 ++++++++++++++++++++
 .../libgomp.oacc-c-c++-common/loop-red-v-1.c       | 56 ++++++++++++++++++++
 .../libgomp.oacc-c-c++-common/loop-red-v-2.c       | 59 ++++++++++++++++++++++
 .../libgomp.oacc-c-c++-common/loop-red-w-1.c       | 54 ++++++++++++++++++++
 .../libgomp.oacc-c-c++-common/loop-red-w-2.c       | 57 +++++++++++++++++++++
 .../libgomp.oacc-c-c++-common/loop-red-wv-1.c      | 54 ++++++++++++++++++++
 9 files changed, 405 insertions(+)


Grüße
 Thomas

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 472 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [gomp4] private reductions
@ 2016-01-07  3:55     ` Cesar Philippidis
  2016-01-07 16:53       ` Cesar Philippidis
                         ` (2 more replies)
  0 siblings, 3 replies; 26+ messages in thread
From: Cesar Philippidis @ 2016-01-07  3:55 UTC (permalink / raw)
  To: gcc-patches, james norris

[-- Attachment #1: Type: text/plain, Size: 1706 bytes --]

This patch updates the way that private reductions are handled in gomp4
to be more like trunk. Before, omp lower was using a complicated mapping
for private variables, but now it's treating them more like omp, with
the exception of reference-type variables. This complication originated
back when we were using ganglocal memory for private variables. Now that
private variables just regular local variables, I was able to remove a
lot of that old code.

It should be noted that reference-type variables still rely on
gimplifier creating a special OMP_CLAUSE_REDUCTION_PRIVATE_DECL, which
is basically a local copy of the reduction variable. Currently this is
used when the reduction variables are dummy arguments in fortran or
parallel (non-loop) reductions. I want to get rid of the
localize_reductions pass from the gimplifier eventually, but for the
time being this patch does fix pr/68813.

In the process of removing removing that old private code, I noticed
that lower_oacc_reductions couldn't handle reductions of the form

  #pragma acc loop reduction (+:v)
  for (...)
    #pragma acc loop reduction (+:v)
     for (...)

That's fixed now. In addition to teaching lower_oacc_reductions about
private variables, I also taught it how to update any intermediate
reduction variable when present. I'll port over this change to trunk
once I've resolved the localize_reductions issue in the gimplifier.

I don't have recent baseline, but I am seeing these failures:

  g++.sum:c-c++-common/goacc/routine-7.c
  libgomp.oacc-c/../libgomp.oacc-c-c++-common/declare-4.c

I'll work on routine-7.c tomorrow. Jim, can you look at the declare-4.c
failure?

This patch has been applied to gomp-4_0-branch.

Cesar

[-- Attachment #2: pr68813c.diff --]
[-- Type: text/x-patch, Size: 12838 bytes --]

2016-01-06  Cesar Philippidis  <cesar@codesourcery.com>

	PR other/68813

	gcc/
	* omp-low.c (is_oacc_reduction_private): Delete.
	(build_outer_var_ref): Remove special handling for private reductions
	in openacc.
	(scan_sharing_clauses): Likewise.
	(lower_rec_input_clauses): Likewise.
	(lower_oacc_reductions): Update support for private reductions.

	libgomp/
	* testsuite/libgomp.oacc-fortran/pr68813.f90: New test.
	* testsuite/libgomp.oacc-fortran/reduction-7.f90: New test.


diff --git a/gcc/omp-low.c b/gcc/omp-low.c
index 8a6dc5e..e11cefc 100644
--- a/gcc/omp-low.c
+++ b/gcc/omp-low.c
@@ -316,53 +316,6 @@ is_oacc_kernels (omp_context *ctx)
 	      == GF_OMP_TARGET_KIND_OACC_KERNELS));
 }
 
-/* Return true if VAR is a is private reduction variable.  A reduction
-   variable is considered private if the variable is local to the
-   offloaded region, or if it is the first reduction to use a mapped
-   variable.  E.g., if V is mapped as 'copy', and loops L1 and L2 contain
-   reductions on V, and L2 is nested inside L1, V is not private in L1
-   but is private in L2.  */
-
-static bool
-is_oacc_reduction_private (tree var, omp_context *ctx, bool initial = true)
-{
-  tree c, clauses, decl;
-
-  if (ctx == NULL || !is_gimple_omp_oacc (ctx->stmt))
-    return true;
-
-  if (gimple_code (ctx->stmt) == GIMPLE_OMP_FOR)
-    clauses = gimple_omp_for_clauses (ctx->stmt);
-  else
-    clauses = gimple_omp_target_clauses (ctx->stmt);
-
-  for (c = clauses; c; c = OMP_CLAUSE_CHAIN (c))
-    {
-      switch (OMP_CLAUSE_CODE (c))
-	{
-	case OMP_CLAUSE_PRIVATE:
-	  decl = OMP_CLAUSE_DECL (c);
-	  if (decl == var)
-	    return true;
-	  break;
-	case OMP_CLAUSE_MAP:
-	  decl = OMP_CLAUSE_DECL (c);
-	  if (decl == var)
-	    return false;
-	  break;
-	case OMP_CLAUSE_REDUCTION:
-	  decl = OMP_CLAUSE_DECL (c);
-	  if (!initial && decl == var)
-	    return true;
-	  break;
-	default:
-	  break;
-	}
-    }
-
-  return is_oacc_reduction_private (var, ctx->outer, false);
-}
-
 /* If DECL is the artificial dummy VAR_DECL created for non-static
    data member privatization, return the underlying "this" parameter,
    otherwise return NULL.  */
@@ -1323,14 +1276,8 @@ static tree
 build_outer_var_ref (tree var, omp_context *ctx, bool lastprivate = false)
 {
   tree x;
-  tree outer_ref = maybe_lookup_decl_in_outer_ctx (var, ctx);
 
-  if (TREE_CODE (outer_ref) == INDIRECT_REF)
-    {
-      gcc_assert (is_gimple_omp_oacc (ctx->stmt));
-      x = outer_ref;
-    }
-  else if (is_global_var (outer_ref))
+  if (is_global_var (maybe_lookup_decl_in_outer_ctx (var, ctx)))
     x = var;
   else if (is_variable_sized (var))
     {
@@ -1384,26 +1331,9 @@ build_outer_var_ref (tree var, omp_context *ctx, bool lastprivate = false)
 	    x = build_simple_mem_ref (x);
 	}
     }
-  else if (is_oacc_parallel (ctx))
-    x = var;
   else if (ctx->outer)
-    {
-      /* OpenACC may have multiple outer contexts (one per loop).  */
-      if (gimple_code (ctx->stmt) == GIMPLE_OMP_FOR
-	  && gimple_omp_for_kind (ctx->stmt) == GF_OMP_FOR_KIND_OACC_LOOP)
-	{
-	  do
-	    {
-	      ctx = ctx->outer;
-	      x = maybe_lookup_decl (var, ctx);
-	    }
-	  while(!x);
-	}
-      else
-	x = lookup_decl (var, ctx->outer);
-    }
-  else if (is_reference (var)
-	   || get_oacc_fn_attrib (current_function_decl))
+    x = lookup_decl (var, ctx->outer);
+  else if (is_reference (var))
     /* This can happen with orphaned constructs.  If var is reference, it is
        possible it is shared and as such valid.  */
     x = var;
@@ -2026,9 +1956,8 @@ scan_sharing_clauses (tree clauses, omp_context *ctx,
 	case OMP_CLAUSE_LINEAR:
 	  decl = OMP_CLAUSE_DECL (c);
 	do_private:
-	  if (!is_gimple_omp_oacc (ctx->stmt)
-	      && (OMP_CLAUSE_CODE (c) == OMP_CLAUSE_FIRSTPRIVATE
-		  || OMP_CLAUSE_CODE (c) == OMP_CLAUSE_IS_DEVICE_PTR)
+	  if ((OMP_CLAUSE_CODE (c) == OMP_CLAUSE_FIRSTPRIVATE
+	       || OMP_CLAUSE_CODE (c) == OMP_CLAUSE_IS_DEVICE_PTR)
 	      && is_gimple_omp_offloaded (ctx->stmt))
 	    {
 	      if (OMP_CLAUSE_CODE (c) == OMP_CLAUSE_FIRSTPRIVATE)
@@ -2060,27 +1989,7 @@ scan_sharing_clauses (tree clauses, omp_context *ctx,
 	      else if (!global)
 		install_var_field (decl, by_ref, 3, ctx);
 	    }
-	  if (!is_gimple_omp_oacc (ctx->stmt)
-	      || !is_oacc_reduction_private (decl, ctx))
-	    install_var_local (decl, ctx);
-	  else
-	    {
-	      gcc_assert (is_gimple_omp_oacc (ctx->stmt));
-	      /* This probably needs to be moved further up, next to the OpenMP
-		 OMP_CLAUSE_FIRSTPRIVATE handling, in order to correctly handle
-		 VLAs.  */
-	      if (OMP_CLAUSE_CODE (c) == OMP_CLAUSE_FIRSTPRIVATE)
-		{
-		  install_var_field (decl, (TREE_CODE (TREE_TYPE (decl))
-					    != REFERENCE_TYPE), 3, ctx);
-		  install_var_local (decl, ctx);
-		}
-	      else
-		/* The gimplifier always includes a OMP_CLAUSE_MAP with
-		   each parallel reduction variable.  So don't install a
-		   local variable here.  */
-		gcc_assert (OMP_CLAUSE_CODE (c) == OMP_CLAUSE_REDUCTION);
-	    }
+	  install_var_local (decl, ctx);
 	  break;
 
 	case OMP_CLAUSE_USE_DEVICE:
@@ -2322,9 +2231,8 @@ scan_sharing_clauses (tree clauses, omp_context *ctx,
 	  decl = OMP_CLAUSE_DECL (c);
 	  if (is_variable_sized (decl))
 	    {
-	      if (!is_gimple_omp_oacc (ctx->stmt)
-		  && (OMP_CLAUSE_CODE (c) == OMP_CLAUSE_FIRSTPRIVATE
-		      || OMP_CLAUSE_CODE (c) == OMP_CLAUSE_IS_DEVICE_PTR)
+	      if ((OMP_CLAUSE_CODE (c) == OMP_CLAUSE_FIRSTPRIVATE
+		   || OMP_CLAUSE_CODE (c) == OMP_CLAUSE_IS_DEVICE_PTR)
 		  && is_gimple_omp_offloaded (ctx->stmt))
 		{
 		  tree decl2 = DECL_VALUE_EXPR (decl);
@@ -2336,11 +2244,9 @@ scan_sharing_clauses (tree clauses, omp_context *ctx,
 		}
 	      install_var_local (decl, ctx);
 	    }
-	  if (!is_gimple_omp_oacc (ctx->stmt)
-	      || !is_oacc_reduction_private (decl, ctx))
-	    fixup_remapped_decl (decl, ctx,
-				 OMP_CLAUSE_CODE (c) == OMP_CLAUSE_PRIVATE
-				 && OMP_CLAUSE_PRIVATE_DEBUG (c));
+	  fixup_remapped_decl (decl, ctx,
+			       OMP_CLAUSE_CODE (c) == OMP_CLAUSE_PRIVATE
+			       && OMP_CLAUSE_PRIVATE_DEBUG (c));
 	  if (OMP_CLAUSE_CODE (c) == OMP_CLAUSE_LINEAR
 	      && OMP_CLAUSE_LINEAR_GIMPLE_SEQ (c))
 	    scan_array_reductions = true;
@@ -2352,9 +2258,7 @@ scan_sharing_clauses (tree clauses, omp_context *ctx,
 	    {
 	      if (is_variable_sized (decl))
 		install_var_local (decl, ctx);
-	      if (!(is_gimple_omp_oacc (ctx->stmt)
-		    && is_oacc_reduction_private (decl, ctx)))
-		fixup_remapped_decl (decl, ctx, false);
+	      fixup_remapped_decl (decl, ctx, false);
 	    }
 	  if (OMP_CLAUSE_REDUCTION_PLACEHOLDER (c))
 	    scan_array_reductions = true;
@@ -4614,14 +4518,7 @@ lower_rec_input_clauses (tree clauses, gimple_seq *ilist, gimple_seq *dlist,
 	      new_var = var;
 	    }
 	  if (c_kind != OMP_CLAUSE_COPYIN)
-	    {
-	      /* Not all OpenACC reductions require new mappings.  */
-	      if (is_gimple_omp_oacc (ctx->stmt)
-		  && (new_var = maybe_lookup_decl (var, ctx)) == NULL)
-		new_var = var;
-	      else
-		new_var = lookup_decl (var, ctx);
-	    }
+	    new_var = lookup_decl (var, ctx);
 
 	  if (c_kind == OMP_CLAUSE_SHARED || c_kind == OMP_CLAUSE_COPYIN)
 	    {
@@ -5317,13 +5214,7 @@ lower_rec_input_clauses (tree clauses, gimple_seq *ilist, gimple_seq *dlist,
 		}
 	      else
 		{
-		  tree type;
-		  if (is_oacc_parallel (ctx) && is_reference (var))
-		    type = TREE_TYPE (TREE_TYPE (new_var));
-		  else
-		    type = TREE_TYPE (new_var);
-
-		  x = omp_reduction_init (c, type);
+		  x = omp_reduction_init (c, TREE_TYPE (new_var));
 		  gcc_assert (TREE_CODE (TREE_TYPE (new_var)) != ARRAY_TYPE);
 		  enum tree_code code = OMP_CLAUSE_REDUCTION_CODE (c);
 
@@ -5689,7 +5580,9 @@ lower_oacc_reductions (location_t loc, tree clauses, tree level, bool inner,
 	  var = maybe_lookup_decl (orig, ctx);
 	if (!var)
 	  var = orig;
-	gcc_assert (!is_reference (var));
+
+	if (is_reference (var))
+	  var = build_simple_mem_ref (var);
 
 	incoming = outgoing = var;
 	
@@ -5731,29 +5624,55 @@ lower_oacc_reductions (location_t loc, tree clauses, tree level, bool inner,
 	  do_lookup:
 	    /* This is the outermost construct with this reduction,
 	       see if there's a mapping for it.  */
-	    if (gimple_code (outer->stmt) == GIMPLE_OMP_TARGET
-		&& maybe_lookup_field (orig, outer))
+	    if (gimple_code (outer->stmt) == GIMPLE_OMP_TARGET)
 	      {
-		ref_to_res = build_receiver_ref (orig, false, outer);
-		if (is_reference (orig))
-		  ref_to_res = build_simple_mem_ref (ref_to_res);
+		bool is_private = false;
+		bool is_mapped = false;
+
+		/* Check for a private or firstprivate mapping.  */
+		for (tree cls = gimple_omp_target_clauses (outer->stmt);
+		     cls; cls = OMP_CLAUSE_CHAIN (cls))
+		  {
+		    if ((OMP_CLAUSE_CODE (cls) == OMP_CLAUSE_PRIVATE
+			 || OMP_CLAUSE_CODE (cls) == OMP_CLAUSE_FIRSTPRIVATE)
+			&& OMP_CLAUSE_DECL (cls) == orig)
+		      {
+			tree t = lookup_decl (orig, ctx->outer);
+			if (is_reference (t))
+			  incoming = outgoing = build_simple_mem_ref (t);
+			else
+			  incoming = outgoing = t;
+			is_private = true;
+			break;
+		      }
+		  }
 
-		outgoing = var;
-		incoming = omp_reduction_init_op (loc, rcode, TREE_TYPE (var));
+		/* Check for a data mapping.  */
+		if (!is_private && maybe_lookup_field (orig, outer))
+		  {
+		    ref_to_res = build_receiver_ref (orig, false, outer);
+
+		    if (is_reference (orig))
+		      ref_to_res = build_simple_mem_ref (ref_to_res);
+
+		    incoming = omp_reduction_init_op (loc, rcode,
+						      TREE_TYPE (var));
+		    outgoing = var;
+		    is_mapped = true;
+		  }
+
+		/* Update incoming and outgoing for reduction variables
+		   local to the offloaded region.  */
+		if (!is_private && !is_mapped)
+		  incoming = outgoing = orig;
 	      }
-	    /* This is enabled on trunk, but has been disabled in the merge of
-	       trunk r229767 into gomp-4_0-branch, as otherwise there were a
-	       lot of regressions in libgomp reduction execution tests.  It is
-	       unclear if the problem is in the tests themselves, or here, or
-	       elsewhere.  Given the usage of "var =
-	       OMP_CLAUSE_REDUCTION_PRIVATE_DECL (c)" on gomp-4_0-branch, maybe
-	       we have to consider that here, too, instead of "orig"?  */
-#if 0
 	    else
 	      incoming = outgoing = orig;
-#endif
-	      
-	  has_outer_reduction:;
+
+	  has_outer_reduction:
+	    /* We found a reduction variable used by another reduction.  */
+	    if (gimple_code (outer->stmt) != GIMPLE_OMP_TARGET)
+	      incoming = outgoing = lookup_decl (orig, ctx->outer);
 	  }
 
 	if (!ref_to_res)
diff --git a/libgomp/testsuite/libgomp.oacc-fortran/pr68813.f90 b/libgomp/testsuite/libgomp.oacc-fortran/pr68813.f90
new file mode 100644
index 0000000..735350f
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-fortran/pr68813.f90
@@ -0,0 +1,19 @@
+program foo
+  implicit none
+  integer, parameter :: n = 100
+  integer, dimension(n,n) :: a
+  integer :: i, j, sum = 0
+
+  a = 1
+
+  !$acc parallel copyin(a(1:n,1:n)) firstprivate (sum)
+  !$acc loop gang reduction(+:sum)
+  do i=1, n
+     !$acc loop vector reduction(+:sum)
+     do j=1, n
+        sum = sum + a(i, j)
+     enddo
+  enddo
+  !$acc end parallel
+
+end program foo
diff --git a/libgomp/testsuite/libgomp.oacc-fortran/reduction-7.f90 b/libgomp/testsuite/libgomp.oacc-fortran/reduction-7.f90
new file mode 100644
index 0000000..e80004d
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-fortran/reduction-7.f90
@@ -0,0 +1,60 @@
+! { dg-do run }
+! { dg-additional-options "-w" }
+
+! subroutine reduction with firstprivate variables
+
+program reduction
+  integer, parameter    :: n = 100
+  integer               :: i, j, vsum, cs, arr(n)
+
+  call redsub_bogus (cs, n)
+  call redsub_combined (cs, n, arr)
+
+  vsum = 0
+
+  ! Verify the results
+  do i = 1, n
+     vsum = i
+     do j = 1, n
+        vsum = vsum + 1;
+     end do
+     if (vsum .ne. arr(i)) call abort ()
+  end do
+end program reduction
+
+! Bogus reduction on an impliclitly firstprivate variable.  The results do
+! survive the parallel region.  The goal here is to ensure that gfortran
+! doesn't ICE.
+
+subroutine redsub_bogus(sum, n)
+  integer :: sum, n, arr(n)
+  integer :: i
+
+  !$acc parallel
+  !$acc loop gang worker vector reduction (+:sum)
+  do i = 1, n
+     sum = sum + 1
+  end do
+  !$acc end parallel
+end subroutine redsub_bogus
+
+! This reduction involving a firstprivate variable yields legitimate results.
+
+subroutine redsub_combined(sum, n, arr)
+  integer :: sum, n, arr(n)
+  integer :: i, j
+
+  !$acc parallel copy (arr)
+  !$acc loop gang
+  do i = 1, n
+     sum = i;
+
+     !$acc loop reduction(+:sum)
+     do j = 1, n
+        sum = sum + 1
+     end do
+
+     arr(i) = sum
+  end do
+  !$acc end parallel
+end subroutine redsub_combined

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [gomp4] private reductions
  2016-01-07  3:55     ` [gomp4] private reductions Cesar Philippidis
@ 2016-01-07 16:53       ` Cesar Philippidis
  2016-01-09  1:14       ` Cesar Philippidis
  2016-01-11 12:10       ` Thomas Schwinge
  2 siblings, 0 replies; 26+ messages in thread
From: Cesar Philippidis @ 2016-01-07 16:53 UTC (permalink / raw)
  To: gcc-patches, james norris

[-- Attachment #1: Type: text/plain, Size: 668 bytes --]

On 01/06/2016 07:55 PM, Cesar Philippidis wrote:

> I don't have recent baseline, but I am seeing these failures:
> 
>   g++.sum:c-c++-common/goacc/routine-7.c
>   libgomp.oacc-c/../libgomp.oacc-c-c++-common/declare-4.c
> 
> I'll work on routine-7.c tomorrow. Jim, can you look at the declare-4.c
> failure?

Reductions inside routines are a special case because they may not have
outer omp contexts, so in that case the original reduction variable
which needs to be updated by the finalizer is the one specified in the
reduction clause. The fix for routine-7.c involved making
lower_oacc_reductions aware of that.

I've applied this patch to gomp-4_0-branch.

Cesar


[-- Attachment #2: routine7-fix.diff --]
[-- Type: text/x-patch, Size: 804 bytes --]

2016-01-07  Cesar Philippidis  <cesar@codesourcery.com>

	gcc/
	* omp-low.c (lower_oacc_reductions): Properly handle reductions
	inside routines.


diff --git a/gcc/omp-low.c b/gcc/omp-low.c
index 55b5da3..1a3c27e 100644
--- a/gcc/omp-low.c
+++ b/gcc/omp-low.c
@@ -5672,7 +5672,12 @@ lower_oacc_reductions (location_t loc, tree clauses, tree level, bool inner,
 	  has_outer_reduction:
 	    /* We found a reduction variable used by another reduction.  */
 	    if (gimple_code (outer->stmt) != GIMPLE_OMP_TARGET)
-	      incoming = outgoing = lookup_decl (orig, ctx->outer);
+	      {
+		/* There may be no outer omp context if this reduction is
+		   inside a routine.  */
+		incoming = outgoing = (ctx->outer == NULL)
+		  ? orig : lookup_decl (orig, ctx->outer);
+	      }
 	  }
 
 	if (!ref_to_res)

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [gomp4] private reductions
  2016-01-07  3:55     ` [gomp4] private reductions Cesar Philippidis
  2016-01-07 16:53       ` Cesar Philippidis
@ 2016-01-09  1:14       ` Cesar Philippidis
  2016-01-11 12:10       ` Thomas Schwinge
  2 siblings, 0 replies; 26+ messages in thread
From: Cesar Philippidis @ 2016-01-09  1:14 UTC (permalink / raw)
  To: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 1127 bytes --]

On 01/06/2016 07:55 PM, Cesar Philippidis wrote:

> @@ -1384,26 +1331,9 @@ build_outer_var_ref (tree var, omp_context *ctx, bool lastprivate = false)
>  	    x = build_simple_mem_ref (x);
>  	}
>      }
> -  else if (is_oacc_parallel (ctx))
> -    x = var;
>    else if (ctx->outer)
> -    {
> -      /* OpenACC may have multiple outer contexts (one per loop).  */
> -      if (gimple_code (ctx->stmt) == GIMPLE_OMP_FOR
> -	  && gimple_omp_for_kind (ctx->stmt) == GF_OMP_FOR_KIND_OACC_LOOP)
> -	{
> -	  do
> -	    {
> -	      ctx = ctx->outer;
> -	      x = maybe_lookup_decl (var, ctx);
> -	    }
> -	  while(!x);
> -	}
> -      else
> -	x = lookup_decl (var, ctx->outer);
> -    }

It turns out that this is still necessary for openacc. The attached
patch reverts this particular change. This fixes a bug that I
encountered in a program I which I didn't have time to reduce.
Basically, that program was doing something like this

  #pragma acc loop vector(length:var)

where var is some variable declared outside of a kernels region.  I'll
apply the patch to gomp4 now, and add a couple of more test cases next week.

Cesar


[-- Attachment #2: nested_ctx.diff --]
[-- Type: text/x-patch, Size: 971 bytes --]

2016-01-08  Cesar Philippidis  <cesar@codesourcery.com>

	gcc/
	* omp-low.c (build_outer_var_ref): Recursively scan for decls
	in outer omp contexts.


diff --git a/gcc/omp-low.c b/gcc/omp-low.c
index 69dabfe..98422fb 100644
--- a/gcc/omp-low.c
+++ b/gcc/omp-low.c
@@ -1332,7 +1332,21 @@ build_outer_var_ref (tree var, omp_context *ctx, bool lastprivate = false)
 	}
     }
   else if (ctx->outer)
-    x = lookup_decl (var, ctx->outer);
+    {
+      /* OpenACC may have multiple outer contexts (one per loop).  */
+      if (gimple_code (ctx->stmt) == GIMPLE_OMP_FOR
+	  && gimple_omp_for_kind (ctx->stmt) == GF_OMP_FOR_KIND_OACC_LOOP)
+	{
+	  do
+	    {
+	      ctx = ctx->outer;
+	      x = maybe_lookup_decl (var, ctx);
+	    }
+	  while(!x);
+	}
+      else
+	x = lookup_decl (var, ctx->outer);
+    }
   else if (is_reference (var))
     /* This can happen with orphaned constructs.  If var is reference, it is
        possible it is shared and as such valid.  */

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [gomp4] private reductions
  2016-01-07  3:55     ` [gomp4] private reductions Cesar Philippidis
  2016-01-07 16:53       ` Cesar Philippidis
  2016-01-09  1:14       ` Cesar Philippidis
@ 2016-01-11 12:10       ` Thomas Schwinge
  2016-01-11 14:55         ` Cesar Philippidis
  2 siblings, 1 reply; 26+ messages in thread
From: Thomas Schwinge @ 2016-01-11 12:10 UTC (permalink / raw)
  To: Cesar Philippidis, gcc-patches; +Cc: Nathan Sidwell, james norris

[-- Attachment #1: Type: text/plain, Size: 4115 bytes --]

Hi!

On Wed, 6 Jan 2016 19:55:02 -0800, Cesar Philippidis <cesar@codesourcery.com> wrote:
> This patch updates the way that private reductions are handled in gomp4
> to be more like trunk.

Anything to commit to trunk (test cases at least?)?


> This patch has been applied to gomp-4_0-branch.

> 	PR other/68813

Can now close that one?


> --- a/gcc/omp-low.c
> +++ b/gcc/omp-low.c

> @@ -5731,29 +5624,55 @@ lower_oacc_reductions (location_t loc, tree clauses, tree level, bool inner,

> -	    /* This is enabled on trunk, but has been disabled in the merge of
> -	       trunk r229767 into gomp-4_0-branch, as otherwise there were a
> -	       lot of regressions in libgomp reduction execution tests.  It is
> -	       unclear if the problem is in the tests themselves, or here, or
> -	       elsewhere.  Given the usage of "var =
> -	       OMP_CLAUSE_REDUCTION_PRIVATE_DECL (c)" on gomp-4_0-branch, maybe
> -	       we have to consider that here, too, instead of "orig"?  */
> -#if 0
>  	    else
>  	      incoming = outgoing = orig;
> -#endif

Ah, nice -- does your patch completely resolve the issue (that is, remove
the code divergence between trunk and gomp-4_0-branch), that I once asked
Nathan to look into,
<http://news.gmane.org/find-root.php?message_id=%3C878u6bz9p4.fsf%40kepler.schwinge.homeip.net%3E>?


With you patch applied, I'm seeing some more progressions, so in r232222,
I committed the following to gomp-4_0-branch:

commit 266cf901de75f6cc2d5a7a03635050a0a9b255f1
Author: tschwinge <tschwinge@138bc75d-0d04-0410-961f-82ee72b054a4>
Date:   Mon Jan 11 12:09:14 2016 +0000

    Remove stale XFAIL markers
    
    	gcc/testsuite/
    	* gfortran.dg/goacc/reduction-3.f95: Remove XFAIL marker.
    
    	libgomp/
    	* testsuite/libgomp.oacc-c-c++-common/firstprivate-2.c: Remove
    	XFAIL marker.
    
    git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/branches/gomp-4_0-branch@232222 138bc75d-0d04-0410-961f-82ee72b054a4
---
 gcc/testsuite/ChangeLog.gomp                                 | 2 ++
 gcc/testsuite/gfortran.dg/goacc/reduction-3.f95              | 4 ----
 libgomp/ChangeLog.gomp                                       | 5 +++++
 libgomp/testsuite/libgomp.oacc-c-c++-common/firstprivate-2.c | 3 ---
 4 files changed, 7 insertions(+), 7 deletions(-)

diff --git gcc/testsuite/ChangeLog.gomp gcc/testsuite/ChangeLog.gomp
index 2db11df..1634a7c 100644
--- gcc/testsuite/ChangeLog.gomp
+++ gcc/testsuite/ChangeLog.gomp
@@ -1,5 +1,7 @@
 2016-01-11  Thomas Schwinge  <thomas@codesourcery.com>
 
+	* gfortran.dg/goacc/reduction-3.f95: Remove XFAIL marker.
+
 	* c-c++-common/goacc-gomp/nesting-fail-1.c: Add OpenACC declare
 	directive for "i".
 
diff --git gcc/testsuite/gfortran.dg/goacc/reduction-3.f95 gcc/testsuite/gfortran.dg/goacc/reduction-3.f95
index d7eeb95..c0f6cbf 100644
--- gcc/testsuite/gfortran.dg/goacc/reduction-3.f95
+++ gcc/testsuite/gfortran.dg/goacc/reduction-3.f95
@@ -1,7 +1,3 @@
-! { dg-xfail-if "" { *-*-* } { "*" } { "" } }
-! { dg-excess-errors "internal compiler error" }
-! { dg-do compile  }
-
 program reduction
   integer, parameter    :: n = 40, c = 10
   integer               :: i, sum
diff --git libgomp/ChangeLog.gomp libgomp/ChangeLog.gomp
index efdbfd3..cf91c0a 100644
--- libgomp/ChangeLog.gomp
+++ libgomp/ChangeLog.gomp
@@ -1,3 +1,8 @@
+2016-01-11  Thomas Schwinge  <thomas@codesourcery.com>
+
+	* testsuite/libgomp.oacc-c-c++-common/firstprivate-2.c: Remove
+	XFAIL marker.
+
 2016-01-07  Cesar Philippidis  <cesar@codesourcery.com>
 
 	* testsuite/libgomp.oacc-c-c++-common/reduction-6.c: New test.
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/firstprivate-2.c libgomp/testsuite/libgomp.oacc-c-c++-common/firstprivate-2.c
index a186872..672e412 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/firstprivate-2.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/firstprivate-2.c
@@ -1,6 +1,3 @@
-/* { dg-do run } */
-/* { dg-xfail-run-if "TODO" { *-*-* } } */
-
 #include  <openacc.h>
 
 int main ()


Grüße
 Thomas

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 472 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [gomp4] private reductions
  2016-01-11 12:10       ` Thomas Schwinge
@ 2016-01-11 14:55         ` Cesar Philippidis
  0 siblings, 0 replies; 26+ messages in thread
From: Cesar Philippidis @ 2016-01-11 14:55 UTC (permalink / raw)
  To: Thomas Schwinge, gcc-patches; +Cc: Nathan Sidwell, james norris

On 01/11/2016 04:10 AM, Thomas Schwinge wrote:

> On Wed, 6 Jan 2016 19:55:02 -0800, Cesar Philippidis <cesar@codesourcery.com> wrote:
>> This patch updates the way that private reductions are handled in gomp4
>> to be more like trunk.
> 
> Anything to commit to trunk (test cases at least?)?

I could possibly apply the support nested acc loops using the same
reduction variable. But I'd like to hold off a little longer until I
have a better solution for reference variables first.

>> This patch has been applied to gomp-4_0-branch.
> 
>> 	PR other/68813
> 
> Can now close that one?

Thanks for reminding me. I closed it.

>> --- a/gcc/omp-low.c
>> +++ b/gcc/omp-low.c
> 
>> @@ -5731,29 +5624,55 @@ lower_oacc_reductions (location_t loc, tree clauses, tree level, bool inner,
> 
>> -	    /* This is enabled on trunk, but has been disabled in the merge of
>> -	       trunk r229767 into gomp-4_0-branch, as otherwise there were a
>> -	       lot of regressions in libgomp reduction execution tests.  It is
>> -	       unclear if the problem is in the tests themselves, or here, or
>> -	       elsewhere.  Given the usage of "var =
>> -	       OMP_CLAUSE_REDUCTION_PRIVATE_DECL (c)" on gomp-4_0-branch, maybe
>> -	       we have to consider that here, too, instead of "orig"?  */
>> -#if 0
>>  	    else
>>  	      incoming = outgoing = orig;
>> -#endif
> 
> Ah, nice -- does your patch completely resolve the issue (that is, remove
> the code divergence between trunk and gomp-4_0-branch), that I once asked
> Nathan to look into,
> <http://news.gmane.org/find-root.php?message_id=%3C878u6bz9p4.fsf%40kepler.schwinge.homeip.net%3E>?

Yes and no. It's better in the sense that firstprivate and private are
now mostly identical with the exception of this change to
build_outer_var_ref
<https://gcc.gnu.org/ml/gcc-patches/2016-01/msg00497.html>. Other than
that, lower_oacc_reductions in gomp4 has changes for nested reductions.

> With you patch applied, I'm seeing some more progressions, so in r232222,
> I committed the following to gomp-4_0-branch:

Thank you for doing that. I'm horrible at keeping xfails up to date.

Cesar

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [1/3] OpenACC reductions
  2015-11-02 16:18 ` [1/3] OpenACC reductions Nathan Sidwell
                     ` (2 preceding siblings ...)
  2015-11-06 10:47   ` [gomp4] " Thomas Schwinge
@ 2021-08-09 11:37   ` Thomas Schwinge
  3 siblings, 0 replies; 26+ messages in thread
From: Thomas Schwinge @ 2021-08-09 11:37 UTC (permalink / raw)
  To: gcc-patches

[-- Attachment #1: Type: text/plain, Size: 1189 bytes --]

Hi!

On 2015-11-02T11:18:37-0500, Nathan Sidwell <nathan@acm.org> wrote:
> This is the core execution bits of OpenACC reductions.
>
> We have a new internal fn 'IFN_GOACC_REDUCTION' and a new target hook
> goacc.reduction, to lower it on the target compiler.

Yay for proper abstraction!  Long ago committed in r229767 (Git
commit e50146711b7200e8f822c6d8239430c682b76e4f).

> The omp-low changes are:
> 1) remove all the existing OpenACC reduction handling

One more:

> --- gcc/omp-low.c     (revision 229667)
> +++ gcc/omp-low.c     (working copy)

> -[...]
> -       /* Determine if this is kernel will be executed on the host.  */
> -       call = builtin_decl_explicit (BUILT_IN_ACC_GET_DEVICE_TYPE);
> -[...]

Pushed "[OpenACC] Clean up unused 'BUILT_IN_ACC_GET_DEVICE_TYPE'" to
master branch in commit 06870af3e48daf523a973981f053ee5c6f44c871, see
attached.


Grüße
 Thomas


-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-OpenACC-Clean-up-unused-BUILT_IN_ACC_GET_DEVICE_TYPE.patch --]
[-- Type: text/x-diff, Size: 1101 bytes --]

From 06870af3e48daf523a973981f053ee5c6f44c871 Mon Sep 17 00:00:00 2001
From: Thomas Schwinge <thomas@codesourcery.com>
Date: Mon, 2 Aug 2021 18:34:47 +0200
Subject: [PATCH] [OpenACC] Clean up unused 'BUILT_IN_ACC_GET_DEVICE_TYPE'

Unused as of r229767 (Git commit e50146711b7200e8f822c6d8239430c682b76e4f)
"OpenACC reductions".

	gcc/
	* omp-builtins.def (BUILT_IN_ACC_GET_DEVICE_TYPE): Remove.
---
 gcc/omp-builtins.def | 2 --
 1 file changed, 2 deletions(-)

diff --git a/gcc/omp-builtins.def b/gcc/omp-builtins.def
index 97964f866ec..4a7e7badd7e 100644
--- a/gcc/omp-builtins.def
+++ b/gcc/omp-builtins.def
@@ -29,8 +29,6 @@ along with GCC; see the file COPYING3.  If not see
 /* The reason why they aren't in gcc/builtins.def is that the Fortran front end
    doesn't source those.  */
 
-DEF_GOACC_BUILTIN (BUILT_IN_ACC_GET_DEVICE_TYPE, "acc_get_device_type",
-		   BT_FN_INT, ATTR_NOTHROW_LIST)
 DEF_GOACC_BUILTIN (BUILT_IN_GOACC_DATA_START, "GOACC_data_start",
 		   BT_FN_VOID_INT_SIZE_PTR_PTR_PTR, ATTR_NOTHROW_LIST)
 DEF_GOACC_BUILTIN (BUILT_IN_GOACC_DATA_END, "GOACC_data_end",
-- 
2.30.2


^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2021-08-09 11:37 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-11-02 16:10 [0/3] OpenACC reductions Nathan Sidwell
2015-10-18 23:20 ` [gomp4] fortran testcase Nathan Sidwell
2015-11-02 16:18 ` [1/3] OpenACC reductions Nathan Sidwell
2015-11-03 15:46   ` Jakub Jelinek
2015-11-03 16:02     ` Nathan Sidwell
2015-11-04 10:31       ` Jakub Jelinek
2015-11-04 13:58         ` Nathan Sidwell
2015-11-04 14:08           ` Jakub Jelinek
2015-11-04  9:59   ` Jakub Jelinek
2015-11-06 10:47   ` [gomp4] " Thomas Schwinge
2016-01-07  3:55     ` [gomp4] private reductions Cesar Philippidis
2016-01-07 16:53       ` Cesar Philippidis
2016-01-09  1:14       ` Cesar Philippidis
2016-01-11 12:10       ` Thomas Schwinge
2016-01-11 14:55         ` Cesar Philippidis
2021-08-09 11:37   ` [1/3] OpenACC reductions Thomas Schwinge
2015-11-02 16:35 ` [2/3] " Nathan Sidwell
2015-11-04 10:01   ` Jakub Jelinek
2015-11-04 13:57     ` Nathan Sidwell
2015-11-04 13:27   ` Bernd Schmidt
2015-11-04 14:09     ` Nathan Sidwell
2015-11-04 16:59     ` Nathan Sidwell
2015-11-06 10:48       ` [gomp4] " Thomas Schwinge
2015-11-02 16:38 ` [3/3] " Nathan Sidwell
2015-11-04 10:03   ` Jakub Jelinek
2015-11-06 10:49   ` [gomp4] " Thomas Schwinge

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).