public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [gomp4] OpenACC vector and worker reductions
@ 2015-07-17 18:26 Cesar Philippidis
  2015-07-17 18:26 ` [gomp4] OpenACC reduction tests Cesar Philippidis
  0 siblings, 1 reply; 7+ messages in thread
From: Cesar Philippidis @ 2015-07-17 18:26 UTC (permalink / raw)
  To: gcc-patches, Nathan Sidwell, Jakub Jelinek; +Cc: cesar

[-- Attachment #1: Type: text/plain, Size: 4935 bytes --]

This patch adds support for OpenACC vector and worker reductions in a
target-independent fashion. It adds quite a bit of machinery to
accomplish that goal. For starters, three internal functions,
GOACC_REDUCTION_INIT, GOACC_REDUCTION and GOACC_REDUCTION_WRITEBACK,
have been introduced. It's probably easiest to explain all of the
changes with an example. Given an acc loop reduction as follows

  red = ...

  #pragma acc loop reduction (+:red) vector
  for (...)
    red++;

the OpenMP way to lower this reduction would be to introduce a new
private variable for 'red', which I'll call red.private. That private
reduction variable gets initialized with some value depending on the
reduction operation. All of the references to the original reduction
variable inside the loop get replaced with the private copy. Immediately
after the loop exits, the original reduction variable is atomically
updated with the private copy.

The code ends up looking something as follows:

  red = ...
  red.private = 0;   // initialize red.internal
  #pragma omp for (...)
    red.internal++;
  #pragma omp continue
    red += red.private // this is an atomic operation
  #pragma omp end

Conceptually, this loop may be decomposed into three sections. The first
section is the reduction initializer, the second is the loop, and the
third is the reduction finalizer.

This get a little more complicated in OpenACC. For starters, there are
three levels of parallelism that may be associated with a single acc
loop. When transferring from one level of parallelism to another, some
targets (e.g. nvptx) may require variable state propagation and
predication due to the constraints of static thread scheduling. Nathan
solved that problem, at least from a high-level, by surrounding acc
loops with GOACC_FORK and GOACC_JOIN function markers.

Furthermore, certain targets have hardware limitations preventing
general atomic operations from being utilized. Specifically, spinlocks
may not be used by threads inside the same warp for nvptx targets. In
gcc 6.0, warps corresponds to vectors, which currently contain 32
threads. That said, spinlocks are usable on nvptx targets if only one
thread within a warp is using it. This patch solves this problem by
breaking up the reduction finalizer into two steps -- a parallel
reduction (a call to GOACC_REDUCTION) and a write-back to the original
variable. In OpenACC, the original loop gets lowered into the following
form:

  red = ...
  red.private = GOACC_REDUCTION_INIT (0)
  GOACC_FORK ()
  #pragma omp for (...)
    red.internal++;
  #pragma omp continue
    red.private = GOACC_REDUCTION (gwv_mask, op, red.private)
    GOACC_WRITEBACK ()
    red += red.private // this is an atomic operation
  #pragma omp end
  GOACC_JOIN ()

First of all, the call to GOACC_REDUCTION_INIT is necessary to ensure
that red.private has a value to propagate to all of the threads
associated with that loop. Without it, in situations where there are
more threads than loop iterations, the threads that didn't enter the
body of the loop would not contain a proper initial value, so the
reduction finalizer would be generating bogus results.

Both GOACC_REDUCTION and GOACC_WRITEBACK get evaluated inside the target
compiler by a new fold_oacc_reductions pass. That pass uses
targetm.goacc.fold_reduction to fold GOACC_REDUCTION in a
target-specific way. That pass also removes the GOACC_WRITEBACK marker
and moves the nearest GOACC_JOIN call at it's place if necessary
(worker-only loops are special). This is guaranteed to work because
OpenACC loops are single-entry, single-exit and there is only one
GOACC_WRITEBACK marker per acc loop (there is one GOACC_REDUCTION per
reduction though). Moving the GOACC_JOIN up allows the reduction
write-back to operate in a corresponds 'single' mode. E.g. since this
example executes the body in vector-partitioned mode, the original
reduction variable must be updated in vector-single mode.

There's one more quirk that I encountered while working on this patch.
All dummy args to fortran subroutine are passed by reference. That
causes problems for loop state propagation, because only the pointer
gets propagated, and not the value being pointed to. To get around this,
I taught the gimplifier to introduce a new local copy of the reduction
variable. Now the reduction clause has five operands associated with it,
with the fifth one being new private reduction variable.

In addition to the above machinery, this patch also implements the
fold_reduction hook on nvptx targets to use a tree-reduction for vector
loops. All other reductions on nvptx targets use atomics.

I hopefully ironed out all of the bugs in this patch, but I am rerunning
the entire regression testsuite again. Any comments are welcome. Is this
reduction scheme too nvptx-specific?

I'll post the test cases in a follow up patch because the patch would be
too big for the mailing list otherwise.

Thanks,
Cesar

[-- Attachment #2: vector-reduction.diff --]
[-- Type: text/x-patch, Size: 29902 bytes --]

2015-07-17  Cesar Philippidis  <cesar@codesourcery.com>

	gcc/
	* config/nvptx/nvptx.c (nvptx_generate_vector_shuffle): New function.
	(nvptx_goacc_fold_reduction): New function.
	(TARGET_GOACC_FOLD_REDUCTION): Define hook.
	* doc/tm.texi: Document TARGET_GOACC_FOLD_REDUCTION.
	* doc/tm.texi.in: Add hook for TARGET_GOACC_FOLD_REDUCTION.
	* gimplify.c (struct privatize_reduction): Declare.
	(localize_reductions_r): New function.
	(localize_reductions): New function.
	(gimplify_omp_for): Call localize_reductions for acc loops.
	* internal-fn.c (expand_GOACC_REDUCTION): New function.
	(expand_GOACC_REDUCTION_INIT): New function.
	(expand_GOACC_REDUCTION_WRITEBACK): New function.
	* internal-fn.def (GOACC_REDUCTION): New internal function.
	(GOACC_REDUCTION_INIT): New internal function.
	(GOACC_REDUCTION_WRITEBACK): New internal function.
	* omp-low.c (lower_rec_input_clauses): Use GOACC_REDUCTION_INIT for
	OpenACC reductions.
	(lower_oacc_reductions): New function.
	(lower_reduction_clauses): Use lower_oacc_reductions for OpenACC
	reductions.
	(find_goacc_join): New function.
	(find_enclosing_join): New function.
	(execute_fold_oacc_reductions): New function.
	(class pass_fold_oacc_reductions): New pass.
	(make_pass_fold_oacc_reductions): New function.
	(default_goacc_fold_reduction): New function.
	* optabs.def (oacc_thread_broadcast_optab): Remove.
	* passes.def (pass_fold_oacc_reductions): Use it.
	* target.def (fold_reduction): New target hook.
	* targhooks.h (default_goacc_fold_reduction): Declare.
	* tree-core.h (enum omp_clause_code): Document argument 4 of
	OMP_CLAUSE_REDUCTION.
	* tree-pass.h (make_pass_oacc_fold_reductions): Declare.
	* tree.c (omp_clause_num_ops): Increase the number of reduction clause
	operands by one.
	* tree.h (OMP_CLAUSE_REDUCTION_PRIVATE_DECL): New macro.


diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index b471890..cdfdf00 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -59,6 +59,15 @@
 #include "dominance.h"
 #include "cfg.h"
 #include "omp-low.h"
+#include "fold-const.h"
+#include "stringpool.h"
+#include "internal-fn.h"
+#include "gimple.h"
+#include "gimple-iterator.h"
+#include "gimple-ssa.h"
+#include "tree-phinodes.h"
+#include "ssa-iterators.h"
+#include "tree-ssanames.h"
 
 /* This file should be included last.  */
 #include "target-def.h"
@@ -3190,6 +3199,124 @@ nvptx_expand_builtin (tree exp, rtx target ATTRIBUTE_UNUSED,
 
   gcc_unreachable ();
 }
+
+static void
+nvptx_generate_vector_shuffle (tree dest_var, tree var, int shfl,
+			       gimple_stmt_iterator *where)
+{
+  tree vartype = TREE_TYPE (var);
+  tree call_arg_type = unsigned_type_node;
+  tree_code ccode = SCALAR_FLOAT_TYPE_P (vartype)
+    ? VIEW_CONVERT_EXPR : NOP_EXPR;
+  enum nvptx_builtins fn = NVPTX_BUILTIN_SHUFFLE_DOWN;
+
+  if (TYPE_PRECISION (vartype) > TYPE_PRECISION (call_arg_type))
+    {
+      fn = NVPTX_BUILTIN_SHUFFLE_DOWNLL;
+      call_arg_type = long_long_unsigned_type_node;
+    }
+
+  bool need_conversion = !types_compatible_p (vartype, call_arg_type);
+  tree casted_var = var;
+
+  if (need_conversion)
+    {
+      casted_var = make_ssa_name (call_arg_type);
+      tree t1 = fold_build1 (ccode, call_arg_type, var);
+      gassign *conv1 = gimple_build_assign (casted_var, t1);
+      update_stmt (conv1);
+      gsi_insert_before (where, conv1, GSI_SAME_STMT);
+    }
+
+  tree fndecl = nvptx_builtin_decl (fn, true);
+  tree shift =  build_int_cst (integer_type_node, shfl);
+  gimple call = gimple_build_call (fndecl, 2, casted_var, shift);
+
+  gsi_insert_before (where, call, GSI_SAME_STMT);
+  tree casted_dest = dest_var;
+
+  if (need_conversion)
+    {
+      casted_dest = make_ssa_name (call_arg_type);
+      tree t2 = fold_build1 (ccode, vartype, casted_dest);
+      gassign *conv2 = gimple_build_assign (dest_var, t2);
+      gsi_insert_before (where, conv2, GSI_SAME_STMT);
+      update_stmt (conv2);
+    }
+
+  update_stmt (call);
+  gimple_call_set_lhs (call, casted_dest);
+}
+
+/* Fold an OpenACC vector reduction shuffle down instructions.  */
+
+static void
+nvptx_goacc_fold_reduction (gimple_stmt_iterator *gsi)
+{
+  /* Generate a sequence of instructions to preform a tree reduction using
+     shfl.down as an intermediate step.  */
+
+  gimple call = gsi_stmt (*gsi);
+  tree arg0 = gimple_call_arg (call, 0); // loop mask
+  tree arg1 = gimple_call_arg (call, 1); // reduction op
+  tree arg2 = gimple_call_arg (call, 2); // reduction decl
+  tree type = TREE_TYPE (arg2);
+  unsigned level = TREE_INT_CST_LOW (arg0);
+  enum tree_code code = (enum tree_code) TREE_INT_CST_LOW (arg1);
+  tree lhs = gimple_call_lhs (call);
+
+  /* Nothing to do here is this isn't a vector loop.  */
+  if ((level & OACC_LOOP_MASK (OACC_vector)) == 0)
+    {
+      gassign *g = gimple_build_assign (lhs, arg2);
+      gsi_replace (gsi, g, true);
+      return;
+    }
+
+  tree new_var = arg2;
+  tree t, t2;
+  gassign *g;
+
+  if (code == TRUTH_ANDIF_EXPR)
+    code = BIT_AND_EXPR;
+  else if (code == TRUTH_ORIF_EXPR)
+    code = BIT_IOR_EXPR;
+
+  if (!is_gimple_val (arg0))
+    {
+      new_var = make_ssa_name (type);
+      gassign *g = gimple_build_assign (new_var, arg2);
+      gsi_insert_before (gsi, g, GSI_SAME_STMT);
+    }
+
+  for (int shfl = 16; shfl > 0; shfl = shfl >> 1)
+    {
+      t = make_ssa_name (type);
+      nvptx_generate_vector_shuffle (t, new_var, shfl, gsi);
+      t2 = make_ssa_name (create_tmp_var (type));
+
+      g = gimple_build_assign (t2, fold_build2 (code, type, new_var, t));
+      update_stmt (g);
+      gsi_insert_before (gsi, g, GSI_SAME_STMT);
+
+      new_var = t2;
+    }
+
+  /* Restore the type of the comparison operand.  */
+  if (code == EQ_EXPR || code == NE_EXPR)
+    {
+      type = TREE_TYPE (lhs);
+      t = make_ssa_name (type);
+      t2 = fold_build1 (NOP_EXPR, type, new_var);
+      gassign *g = gimple_build_assign (t, t2);
+      gsi_insert_before (gsi, g, GSI_SAME_STMT);
+      new_var = t;
+    }
+
+  g = gimple_build_assign (lhs, new_var);
+  gsi_replace (gsi, g, false);
+  update_stmt (g);
+}
 \f
 #undef TARGET_OPTION_OVERRIDE
 #define TARGET_OPTION_OVERRIDE nvptx_option_override
@@ -3285,6 +3412,9 @@ nvptx_expand_builtin (tree exp, rtx target ATTRIBUTE_UNUSED,
 #undef  TARGET_BUILTIN_DECL
 #define TARGET_BUILTIN_DECL nvptx_builtin_decl
 
+#undef TARGET_GOACC_FOLD_REDUCTION
+#define TARGET_GOACC_FOLD_REDUCTION nvptx_goacc_fold_reduction
+
 struct gcc_target targetm = TARGET_INITIALIZER;
 
 #include "gt-nvptx.h"
diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index 3dc51c0..cc42998 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -5740,6 +5740,14 @@ usable.  In that case, the smaller the number is, the more desirable it is
 to use it.
 @end deftypefn
 
+@deftypefn {Target Hook} void TARGET_GOACC_FOLD_REDUCTION (gimple_stmt_iterator *@var{gsi})
+This hook is used to expand OpenACC reductions represented by calls to
+the internal function @var{GOACC_REDUCTION} into a sequence of gimple
+instructions.  @var{gsi} points to the gimple statement holding the
+ function call.  By default, targets are assumed to be single-threaded
+although that is not a requirement.
+@end deftypefn
+
 @node Anchored Addresses
 @section Anchored Addresses
 @cindex anchored addresses
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index 93fb41c..0936516 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -4245,6 +4245,8 @@ address;  but often a machine-dependent strategy can generate better code.
 
 @hook TARGET_SIMD_CLONE_USABLE
 
+@hook TARGET_GOACC_FOLD_REDUCTION
+
 @node Anchored Addresses
 @section Anchored Addresses
 @cindex anchored addresses
diff --git a/gcc/gimplify.c b/gcc/gimplify.c
index f5ec04a..833e469 100644
--- a/gcc/gimplify.c
+++ b/gcc/gimplify.c
@@ -181,6 +181,11 @@ struct gimplify_omp_ctx
   bool distribute;
 };
 
+struct privatize_reduction
+{
+  tree ref_var, local_var;
+};
+
 static struct gimplify_ctx *gimplify_ctxp;
 static struct gimplify_omp_ctx *gimplify_omp_ctxp;
 
@@ -7292,6 +7297,97 @@ find_combined_omp_for (tree *tp, int *walk_subtrees, void *)
   return NULL_TREE;
 }
 
+/* Helper function for localize_reductions.  Replace all uses of REF_VAR with
+   LOCAL_VAR.  */
+
+static tree
+localize_reductions_r (tree *tp, int *walk_subtrees, void *data)
+{
+  enum tree_code tc = TREE_CODE (*tp);
+  struct privatize_reduction *pr = (struct privatize_reduction *) data;
+
+  if (TYPE_P (*tp))
+    *walk_subtrees = 0;
+
+  switch (tc)
+    {
+    case INDIRECT_REF:
+    case MEM_REF:
+      if (TREE_OPERAND (*tp, 0) == pr->ref_var)
+	*tp = pr->local_var;
+
+      *walk_subtrees = 0;
+      break;
+
+    case VAR_DECL:
+    case PARM_DECL:
+    case RESULT_DECL:
+      if (*tp == pr->ref_var)
+	*tp = pr->local_var;
+
+      *walk_subtrees = 0;
+      break;
+
+    default:
+      break;
+    }
+
+  return NULL_TREE;
+}
+
+/* OpenACC worker and vector loop state propagation requires reductions
+   to be inside local variables.  This function replaces all reference-type
+   reductions variables associated with the loop with a local copy.  */
+
+static void
+localize_reductions (tree *expr_p)
+{
+  tree clauses = OMP_FOR_CLAUSES (*expr_p);
+  tree c, var, type, new_var;
+  struct privatize_reduction pr;
+  int gwv_cur = 0;
+  int mask_wv = OACC_LOOP_MASK (OACC_worker) | OACC_LOOP_MASK (OACC_vector);
+
+  /* Non-vector and worker reduction do not need to be localized.  */
+  for (c = clauses; c; c = OMP_CLAUSE_CHAIN (c))
+    {
+      enum omp_clause_code cc = OMP_CLAUSE_CODE (c);
+
+      if (cc == OMP_CLAUSE_GANG)
+	gwv_cur |= OACC_LOOP_MASK (OACC_gang);
+      else if (cc == OMP_CLAUSE_WORKER)
+	gwv_cur |= OACC_LOOP_MASK (OACC_worker);
+      else if (cc == OMP_CLAUSE_VECTOR)
+	gwv_cur |= OACC_LOOP_MASK (OACC_vector);
+    }
+
+  if (!(gwv_cur & mask_wv))
+    return;
+
+  for (c = clauses; c; c = OMP_CLAUSE_CHAIN (c))
+    if (OMP_CLAUSE_CODE (c) == OMP_CLAUSE_REDUCTION)
+      {
+	var = OMP_CLAUSE_DECL (c);
+
+	if (!lang_hooks.decls.omp_privatize_by_reference (var))
+	  {
+	    OMP_CLAUSE_REDUCTION_PRIVATE_DECL (c) = NULL;
+	    continue;
+	  }
+
+	type = TREE_TYPE (TREE_TYPE (var));
+	new_var = create_tmp_var (type);
+
+	pr.ref_var = var;
+	pr.local_var = new_var;
+
+	walk_tree (expr_p, localize_reductions_r, &pr, NULL);
+
+	OMP_CLAUSE_DECL (c) = var;
+	OMP_CLAUSE_REDUCTION_PRIVATE_DECL (c) = new_var;
+      }
+}
+
 /* Gimplify the gross structure of an OMP_FOR statement.  */
 
 static enum gimplify_status
@@ -7330,6 +7426,9 @@ gimplify_omp_for (tree *expr_p, gimple_seq *pre_p)
       gcc_unreachable ();
     }
 
+  if (ork == ORK_OACC)
+    localize_reductions (expr_p);
+
   /* Set OMP_CLAUSE_LINEAR_NO_COPYIN flag on explicit linear
      clause for the IV.  */
   if (simd && TREE_VEC_LENGTH (OMP_FOR_INIT (for_stmt)) == 1)
diff --git a/gcc/internal-fn.c b/gcc/internal-fn.c
index 0a2c9a1..bc6f23e 100644
--- a/gcc/internal-fn.c
+++ b/gcc/internal-fn.c
@@ -46,6 +46,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "stringpool.h"
 #include "tree-ssanames.h"
 #include "diagnostic-core.h"
+#include "recog.h"
 
 /* The names of each internal function, indexed by function number.  */
 const char *const internal_fn_name_array[] = {
@@ -1984,6 +1985,42 @@ expand_GOACC_JOIN (gcall *stmt ATTRIBUTE_UNUSED)
 #endif
 }
 
+/* GOACC_REDUCTION is supposed to be expanded at pass_fold_reductions.
+   So this dummy function should never be called.  */
+
+static void
+expand_GOACC_REDUCTION (gcall *stmt ATTRIBUTE_UNUSED)
+{
+  gcc_unreachable ();
+}
+
+/* This is an optimization barrier.  Propagate call arg0 to the LHS.  */
+
+static void
+expand_GOACC_REDUCTION_INIT (gcall *stmt)
+{
+  tree lhs, arg0;
+  rtx target, val;
+
+  lhs = gimple_call_lhs (stmt);
+  arg0 = gimple_call_arg (stmt, 0);
+  target = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE);
+  val = expand_expr (arg0, target, VOIDmode, EXPAND_NORMAL);
+  emit_move_insn (target, val);
+}
+
+/* GOACC_REDUCTION_WRITEBACK is used as a marker to denote the transition of
+   the execution engine entering into a single-threaded mode from a thread-
+   partitioned mode.  The code following this marker is responsible for
+   updating the original reduction variable.  This function is expanded during
+   fold_oacc_reductions.  */
+
+static void
+expand_GOACC_REDUCTION_WRITEBACK (gcall *stmt ATTRIBUTE_UNUSED)
+{
+  gcc_unreachable ();
+}
+
 /* Routines to expand each internal function, indexed by function number.
    Each routine has the prototype:
 
diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
index c3374d6..ddd63c9 100644
--- a/gcc/internal-fn.def
+++ b/gcc/internal-fn.def
@@ -66,3 +66,6 @@ DEF_INTERNAL_FN (VA_ARG, ECF_NOTHROW | ECF_LEAF, NULL)
 DEF_INTERNAL_FN (GOACC_DATA_END_WITH_ARG, ECF_NOTHROW, ".r")
 DEF_INTERNAL_FN (GOACC_FORK, ECF_NOTHROW | ECF_LEAF, ".")
 DEF_INTERNAL_FN (GOACC_JOIN, ECF_NOTHROW | ECF_LEAF, ".")
+DEF_INTERNAL_FN (GOACC_REDUCTION, ECF_NOTHROW, NULL)
+DEF_INTERNAL_FN (GOACC_REDUCTION_INIT, ECF_NOTHROW, NULL)
+DEF_INTERNAL_FN (GOACC_REDUCTION_WRITEBACK, ECF_NOTHROW, NULL)
diff --git a/gcc/omp-low.c b/gcc/omp-low.c
index 37b853f..e58394c 100644
--- a/gcc/omp-low.c
+++ b/gcc/omp-low.c
@@ -82,6 +82,8 @@ along with GCC; see the file COPYING3.  If not see
 #include "gomp-constants.h"
 #include "gimple-pretty-print.h"
 #include "set"
+#include "tree-ssa-propagate.h"
+#include "omp-low.h"
 
 
 /* Lowering of OMP parallel and workshare constructs proceeds in two
@@ -4394,7 +4396,39 @@ lower_rec_input_clauses (tree clauses, gimple_seq *ilist, gimple_seq *dlist,
 			handle_simd_reference (clause_loc, new_vard, ilist);
 		      else if (is_oacc_parallel (ctx) && is_reference (var))
 			new_var = build_simple_mem_ref (new_var);
-		      gimplify_assign (new_var, x, ilist);
+
+		      /* OpenACC loops may require loop state propagation.
+			 Using an function call for the reduction initializer
+			 ensures that the initial value for the private
+			 reduction variable is propagated to all of the
+			 threads inside a loop.  */
+		      if (is_gimple_omp_oacc (ctx->stmt)
+			  && (ctx->gwv_this &
+			      (OACC_LOOP_MASK (OACC_worker)
+			       | OACC_LOOP_MASK (OACC_vector))))
+			{
+			  tree t = OMP_CLAUSE_REDUCTION_PRIVATE_DECL (c);
+			  if (t == NULL)
+			    t = new_var;
+			  gcall *call = gimple_build_call_internal
+			    (IFN_GOACC_REDUCTION_INIT, 1, x);
+			  if (TREE_CODE (t) != INDIRECT_REF)
+			    {
+			      gimple_call_set_lhs (call, t);
+			      gimple_seq_add_stmt (ilist, call);
+			    }
+			  else
+			    {
+			      tree x = create_tmp_var (TREE_TYPE (t));
+			      gimplify_assign (x, t, ilist);
+			      gimple_call_set_lhs (call, x);
+			      gimple_seq_add_stmt (ilist, call);
+			      gimplify_assign (t, x, ilist);
+			    }
+			}
+		      else
+			  gimplify_assign (new_var, x, ilist);
+
 		      if (is_simd)
 			{
 			  tree ref = build_outer_var_ref (var, ctx);
@@ -4746,6 +4780,158 @@ expand_oacc_get_thread_num (gimple_seq *seq, int gwv_bits)
   return res;
 }
 
+static void
+lower_oacc_reductions (tree clauses, gimple_seq *stmt_seqp, omp_context *ctx)
+{
+  int loop_flags = 0;
+  tree tlevel, c, x, atom;
+  gcall *call;
+  gimple stmt;
+  int reductions = 0;
+  bool use_atomics = false;
+  bool atomic_compatible = true;
+  hash_map<tree, tree> ired_map;
+
+  /* GWV_THIS contains the current level of parallelism the loop nest.
+     Extract the level of parallelism only associated with the current
+     loop.  */
+  if (is_gimple_omp_oacc (ctx->stmt))
+    {
+      omp_context *outer = ctx->outer;
+
+      if (outer && gimple_code (outer->stmt) != GIMPLE_OMP_FOR)
+	outer = NULL;
+
+      loop_flags = outer ? ctx->gwv_this & (~outer->gwv_this)
+	: ctx->gwv_this;
+    }
+
+  /* OpenACC reduction finalizers operate in two stages.  The first
+     stage combines all of the partial reductions values together in
+     a 'partitioned' execution mode.  The second stage updates the
+     original or intermediate reduction variable in a 'single' execution
+     mode.
+
+     The internal function GOACC_REDUCTION handles the first stage, and
+     GOACC_REDUCTION_WRITEBACK acts as a marker for the second stage.
+     Later on, fold_oacc_reductions will move all of the code following
+     GOACC_REDUCTION_WRITEBACK immediately after the nearest GOACC_JOIN.  */
+
+  /* Phase 1: vectorize the reductions.  */
+  for (c = clauses; c; c = OMP_CLAUSE_CHAIN (c))
+    {
+      tree ired;  // intermediate reduction variable
+      tree var;   // reduction clause decl
+      tree tcode;
+
+      if (OMP_CLAUSE_CODE (c) != OMP_CLAUSE_REDUCTION)
+	continue;
+
+      tcode
+	= build_int_cst (integer_type_node, OMP_CLAUSE_REDUCTION_CODE (c));
+
+      var = OMP_CLAUSE_DECL (c);
+      ired = OMP_CLAUSE_REDUCTION_PRIVATE_DECL (c);
+      if (ired == NULL_TREE)
+	ired = lookup_decl (var, ctx);
+
+      if (is_reference (ired))
+	ired = build_simple_mem_ref (ired);
+
+      if (!is_gimple_reg (ired))
+	{
+	  tree t = create_tmp_var (TREE_TYPE (ired));
+	  gimplify_assign (t, ired, stmt_seqp);
+	  ired = t;
+	}
+
+      ired_map.put (var, ired);
+
+      tlevel = build_int_cst (unsigned_type_node, loop_flags);
+
+      call = gimple_build_call_internal (IFN_GOACC_REDUCTION, 3, tlevel,
+					 tcode, ired);
+      gimple_call_set_lhs (call, ired);
+      gimple_seq_add_stmt (stmt_seqp, call);
+
+      if (!is_atomic_compatible_reduction (var, ctx))
+	atomic_compatible = false;
+
+      reductions++;
+    }
+
+  if (reductions == 0)
+    return;
+
+  /* Phase 2: Update the original reduction variable.  */
+
+  /* Insert the marks for the reduction writeback here.  */
+  call = gimple_build_call_internal (IFN_GOACC_REDUCTION_WRITEBACK, 1,
+				     tlevel);
+  gimple_seq_add_stmt (stmt_seqp, call);
+
+  use_atomics = atomic_compatible
+    && (loop_flags & ~OACC_LOOP_MASK (OACC_vector)) != 0;
+
+  /* Use a spin-lock if multiple reductions are involved.  */
+  if (!atomic_compatible || (reductions > 1 && use_atomics))
+    {
+      atom = builtin_decl_explicit (BUILT_IN_GOMP_ATOMIC_START);
+      stmt = gimple_build_call (atom, 0);
+      gimple_seq_add_stmt (stmt_seqp, stmt);
+    }
+
+  /* Lower individual reduction writebacks.  */
+  for (c = clauses; c; c = OMP_CLAUSE_CHAIN (c))
+    {
+      tree ired;             // intermediate reduction variable
+      tree ored, ored_addr;  // original reduction variable
+      tree var;              // reduction clause decl
+      enum tree_code tcode;
+
+      if (OMP_CLAUSE_CODE (c) != OMP_CLAUSE_REDUCTION)
+	continue;
+
+      tcode = OMP_CLAUSE_REDUCTION_CODE (c);
+      if (tcode == MINUS_EXPR)
+        tcode = PLUS_EXPR;
+
+      var = OMP_CLAUSE_DECL (c);
+      ired = *ired_map.get (var);
+      ored = build_outer_var_ref (var, ctx);
+
+      if (use_atomics && reductions == 1)
+	{
+	  ored_addr = build_fold_addr_expr (ored);
+	  ored_addr = save_expr (ored_addr);
+
+	  ored = build1 (INDIRECT_REF, TREE_TYPE (TREE_TYPE (ored_addr)),
+			 ored_addr);
+	  x = fold_build2 (tcode, TREE_TYPE (ored), ored, ired);
+	  x = build2 (OMP_ATOMIC, void_type_node, ored_addr, x);
+	  gimplify_and_add (x, stmt_seqp);
+	}
+      else
+	{
+	  tree t = ored;
+	  if (is_reference (ored) && !is_reference (ired))
+	    {
+	      t = create_tmp_var (TREE_TYPE (ired));
+	      gimplify_assign (t, ired, stmt_seqp);
+	    }
+	  x = build2 (tcode, TREE_TYPE (ired), t, ired);
+	  gimplify_assign (ored, x, stmt_seqp);
+	}
+    }
+
+  if (!atomic_compatible || (reductions > 1 && use_atomics))
+    {
+      atom = builtin_decl_explicit (BUILT_IN_GOMP_ATOMIC_END);
+      stmt = gimple_build_call (atom, 0);
+      gimple_seq_add_stmt (stmt_seqp, stmt);
+    }
+}
+
 /* Generate code to implement the REDUCTION clauses.  OpenACC reductions
    are usually executed in parallel, but they fallback to sequential code for
    known single-threaded regions.  */
@@ -4758,6 +4944,13 @@ lower_reduction_clauses (tree clauses, gimple_seq *stmt_seqp, omp_context *ctx)
   tree x, c;
   int count = 0;
 
+  /* OpenACC loop reductions are handled elsewhere.  */
+  if (!is_oacc_parallel (ctx) && is_gimple_omp_oacc (ctx->stmt))
+    {
+      lower_oacc_reductions (clauses, stmt_seqp, ctx);
+      return;
+    }
+
   /* SIMD reductions are handled in lower_rec_input_clauses.  */
   if (gimple_code (ctx->stmt) == GIMPLE_OMP_FOR
       && gimple_omp_for_kind (ctx->stmt) & GF_OMP_FOR_SIMD)
@@ -14394,4 +14587,188 @@ make_pass_late_lower_omp (gcc::context *ctxt)
   return new pass_late_lower_omp (ctxt);
 }
 
+static void
+find_goacc_join (gimple_stmt_iterator *gsi)
+{
+  gimple stmt;
+
+  while (!gsi_end_p (*gsi))
+    {
+      stmt = gsi_stmt (*gsi);
+
+      if (is_gimple_call (stmt) && gimple_call_internal_p (stmt)
+	  && gimple_call_internal_fn (stmt) == IFN_GOACC_JOIN)
+	return;
+
+      gsi_next (gsi);
+    }
+}
+
+static gimple_stmt_iterator
+find_enclosing_join (basic_block bb)
+{
+  basic_block son;
+  gimple_stmt_iterator gsi = gsi_start_bb (bb);
+
+  find_goacc_join (&gsi);
+  if (!gsi_end_p (gsi))
+    return gsi;
+
+  for (son = first_dom_son (CDI_DOMINATORS, bb);
+       son;
+       son = next_dom_son (CDI_DOMINATORS, son))
+    {
+      gsi = find_enclosing_join (son);
+      if (!gsi_end_p (gsi))
+	return gsi;
+    }
+
+  return gsi;
+}
+
+/* Main entry point for folding function calls for oacc reductions.  See
+   lower_oacc_reductions for a description on how the internal functions
+   are used.  */
+
+static unsigned int
+execute_fold_oacc_reductions ()
+{
+  basic_block bb;
+  gimple_stmt_iterator gsi, gsi_wb, gsi_join;
+  gimple stmt;
+  tree arg;
+  int reductions = 0;
+  int loop_mask = 0;
+
+  if (!lookup_attribute ("oacc function",
+			 DECL_ATTRIBUTES (current_function_decl)))
+    return 0;
+
+  free_dominance_info (CDI_DOMINATORS);
+  calculate_dominance_info (CDI_DOMINATORS);
+
+  FOR_ALL_BB_FN (bb, cfun)
+    {
+      /* Pass 1: Fold GOACC_REDUCTION.  These calls are to be evaluated
+         by targetm.goacc.fold_reduction.  */
+      gsi = gsi_start_bb (bb);
+      reductions = 0;
+
+      while (!gsi_end_p (gsi))
+	{
+	  bool removed = false;
+	  stmt = gsi_stmt (gsi);
+
+	  if (is_gimple_call (stmt) && gimple_call_internal_p (stmt))
+	    {
+	      if (gimple_call_internal_fn (stmt) == IFN_GOACC_REDUCTION)
+		{
+		  targetm.goacc.fold_reduction (&gsi);
+		  stmt = gsi_stmt (gsi);
+		  reductions++;
+		  removed = true;
+		}
+	      else if (gimple_call_internal_fn (stmt)
+		       == IFN_GOACC_REDUCTION_WRITEBACK)
+		break;
+	    }
+
+	  if (!removed)
+	    gsi_next (&gsi);
+	}
+
+      /* Pass 2: Update the placement of the GOACC_JOINs using the
+         GOACC_REDUCTION_WRITEBACK markers for vector reductions.  */
+
+      if (reductions == 0)
+	continue;
+
+      arg = gimple_call_arg (stmt, 0);
+      loop_mask = TREE_INT_CST_LOW (arg);
+
+      /* Only vector reduction writebacks need to placed after the call
+	 to GOACC_JOIN.  */
+      if ((loop_mask & OACC_LOOP_MASK (OACC_vector)) == 0)
+	{
+	  gsi_remove (&gsi, true);
+	  continue;
+	}
+
+      gsi_wb = gsi;
+      gsi_join = find_enclosing_join (bb);
+
+      gcc_assert (!gsi_end_p (gsi_join));
+      stmt = gsi_stmt (gsi_join);
+
+      /* Replace the call go GOACC_REDUCTION_WRITEBACK with a call to
+	 GOACC_JOIN marker.  */
+      tree arg0 = gimple_call_arg (stmt, 0);
+      gcall *call = gimple_build_call_internal (IFN_GOACC_JOIN, 1, arg0);
+      gsi_replace (&gsi_wb, call, true);
+
+      /* Remove the original call to GOACC_JOIN.  */
+      gsi_remove(&gsi_join, true);
+    }
+
+  cleanup_tree_cfg ();
+  mark_virtual_operands_for_renaming (cfun);
+  update_ssa (TODO_update_ssa);
+  verify_ssa (true, true);
+  return 0;
+}
+
+namespace {
+
+const pass_data pass_data_oacc_reductions =
+{
+  GIMPLE_PASS, /* type */
+  "fold_oacc_reductions", /* name */
+  OPTGROUP_NONE, /* optinfo_flags */
+  TV_NONE, /* tv_id */
+  PROP_cfg, /* properties_required */
+  0 /* Possibly PROP_gimple_eomp.  */, /* properties_provided */
+  0, /* properties_destroyed */
+  0, /* todo_flags_start */
+  TODO_update_ssa, /* todo_flags_finish */
+};
+
+class pass_fold_oacc_reductions : public gimple_opt_pass
+{
+public:
+  pass_fold_oacc_reductions (gcc::context *ctxt)
+    : gimple_opt_pass (pass_data_oacc_reductions, ctxt)
+  {}
+
+  /* opt_pass methods: */
+  virtual unsigned int execute (function *)
+    {
+      bool gate = (flag_openacc != 0 && !seen_error ());
+
+      if (!gate)
+	return 0;
+
+      return execute_fold_oacc_reductions ();
+    }
+
+}; // class pass_fold_oacc_reductions
+
+} // anon namespace
+
+gimple_opt_pass *
+make_pass_fold_oacc_reductions (gcc::context *ctxt)
+{
+  return new pass_fold_oacc_reductions (ctxt);
+}
+
+void
+default_goacc_fold_reduction (gimple_stmt_iterator *gsi)
+{
+  gimple call = gsi_stmt (*gsi);
+  tree lhs = gimple_call_lhs (call);
+  tree rhs = gimple_call_arg (call, 2);
+  gassign *g = gimple_build_assign (lhs, rhs);
+
+  gsi_replace (gsi, g, true);
+}
+
 #include "gt-omp-low.h"
diff --git a/gcc/optabs.def b/gcc/optabs.def
index 6018971..888b21c 100644
--- a/gcc/optabs.def
+++ b/gcc/optabs.def
@@ -332,5 +332,3 @@ OPTAB_D (atomic_xor_optab, "atomic_xor$I$a")
 
 OPTAB_D (get_thread_pointer_optab, "get_thread_pointer$I$a")
 OPTAB_D (set_thread_pointer_optab, "set_thread_pointer$I$a")
-
-OPTAB_D (oacc_thread_broadcast_optab, "oacc_thread_broadcast$I$a")
diff --git a/gcc/passes.def b/gcc/passes.def
index 43e67df..abb598f 100644
--- a/gcc/passes.def
+++ b/gcc/passes.def
@@ -165,6 +165,7 @@ along with GCC; see the file COPYING3.  If not see
   INSERT_PASSES_AFTER (all_passes)
   NEXT_PASS (pass_fixup_cfg);
   NEXT_PASS (pass_lower_eh_dispatch);
+  NEXT_PASS (pass_fold_oacc_reductions);
   NEXT_PASS (pass_all_optimizations);
   PUSH_INSERT_PASSES_WITHIN (pass_all_optimizations)
       NEXT_PASS (pass_remove_cgraph_callee_edges);
diff --git a/gcc/target.def b/gcc/target.def
index 4edc209..ecdeb74 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -1639,6 +1639,22 @@ int, (struct cgraph_node *), NULL)
 
 HOOK_VECTOR_END (simd_clone)
 
+/* Functions relating to openacc.  */
+#undef HOOK_PREFIX
+#define HOOK_PREFIX "TARGET_GOACC_"
+HOOK_VECTOR (TARGET_GOACC, goacc)
+
+DEFHOOK
+(fold_reduction,
+"This hook is used to expand OpenACC reductions represented by calls to\n\
+the internal function @var{GOACC_REDUCTION} into a sequence of gimple\n\
+instructions.  @var{gsi} points to the gimple statement holding the\n\ function call.  By default, targets are assumed to be single-threaded\n\
+although that is not a requirement.",
+void, (gimple_stmt_iterator *gsi),
+default_goacc_fold_reduction)
+
+HOOK_VECTOR_END (goacc)
+
 /* Functions relating to vectorization.  */
 #undef HOOK_PREFIX
 #define HOOK_PREFIX "TARGET_VECTORIZE_"
diff --git a/gcc/targhooks.h b/gcc/targhooks.h
index 5ae991d..8e3112d 100644
--- a/gcc/targhooks.h
+++ b/gcc/targhooks.h
@@ -107,6 +107,8 @@ extern unsigned default_add_stmt_cost (void *, int, enum vect_cost_for_stmt,
 extern void default_finish_cost (void *, unsigned *, unsigned *, unsigned *);
 extern void default_destroy_cost_data (void *);
 
+extern void default_goacc_fold_reduction (gimple_stmt_iterator *);
+
 /* These are here, and not in hooks.[ch], because not all users of
    hooks.h include tm.h, and thus we don't have CUMULATIVE_ARGS.  */
 
diff --git a/gcc/tree-core.h b/gcc/tree-core.h
index 3be9093..cd2a618 100644
--- a/gcc/tree-core.h
+++ b/gcc/tree-core.h
@@ -232,7 +232,9 @@ enum omp_clause_code {
      Operand 2: OMP_CLAUSE_REDUCTION_MERGE: Stmt-list to merge private var
                 into the shared one.
      Operand 3: OMP_CLAUSE_REDUCTION_PLACEHOLDER: A dummy VAR_DECL
-                placeholder used in OMP_CLAUSE_REDUCTION_{INIT,MERGE}.  */
+                placeholder used in OMP_CLAUSE_REDUCTION_{INIT,MERGE}.
+     Operand 4: OMP_CLAUSE_REDUCTION_PRIVATE_DECL: A private VAR_DECL of
+                the original DECL associated with the reduction clause.  */
   OMP_CLAUSE_REDUCTION,
 
   /* OpenMP clause: copyin (variable_list).  */
diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h
index 13f20ea..ce31593 100644
--- a/gcc/tree-pass.h
+++ b/gcc/tree-pass.h
@@ -410,6 +410,7 @@ extern gimple_opt_pass *make_pass_late_lower_omp (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_diagnose_omp_blocks (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_expand_omp (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_expand_omp_ssa (gcc::context *ctxt);
+extern gimple_opt_pass *make_pass_fold_oacc_reductions (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_object_sizes (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_strlen (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_fold_builtins (gcc::context *ctxt);
diff --git a/gcc/tree.c b/gcc/tree.c
index ff533a3..395f236 100644
--- a/gcc/tree.c
+++ b/gcc/tree.c
@@ -281,7 +281,7 @@ unsigned const char omp_clause_num_ops[] =
   1, /* OMP_CLAUSE_SHARED  */
   1, /* OMP_CLAUSE_FIRSTPRIVATE  */
   2, /* OMP_CLAUSE_LASTPRIVATE  */
-  4, /* OMP_CLAUSE_REDUCTION  */
+  5, /* OMP_CLAUSE_REDUCTION  */
   1, /* OMP_CLAUSE_COPYIN  */
   1, /* OMP_CLAUSE_COPYPRIVATE  */
   3, /* OMP_CLAUSE_LINEAR  */
diff --git a/gcc/tree.h b/gcc/tree.h
index 189c298..4479790 100644
--- a/gcc/tree.h
+++ b/gcc/tree.h
@@ -1428,6 +1428,8 @@ extern void protected_set_expr_location (tree, location_t);
   (OMP_CLAUSE_CHECK (NODE))->omp_clause.gimple_reduction_merge
 #define OMP_CLAUSE_REDUCTION_PLACEHOLDER(NODE) \
   OMP_CLAUSE_OPERAND (OMP_CLAUSE_SUBCODE_CHECK (NODE, OMP_CLAUSE_REDUCTION), 3)
+#define OMP_CLAUSE_REDUCTION_PRIVATE_DECL(NODE) \
+  OMP_CLAUSE_OPERAND (OMP_CLAUSE_SUBCODE_CHECK (NODE, OMP_CLAUSE_REDUCTION), 4)
 
 /* True if a REDUCTION clause may reference the original list item (omp_orig)
    in its OMP_CLAUSE_REDUCTION_{,GIMPLE_}INIT.  */

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [gomp4] OpenACC reduction tests
  2015-07-17 18:26 [gomp4] OpenACC vector and worker reductions Cesar Philippidis
@ 2015-07-17 18:26 ` Cesar Philippidis
  2015-09-18  8:29   ` Thomas Schwinge
  2015-09-18 13:40   ` Thomas Schwinge
  0 siblings, 2 replies; 7+ messages in thread
From: Cesar Philippidis @ 2015-07-17 18:26 UTC (permalink / raw)
  To: gcc-patches, Nathan Sidwell, Jakub Jelinek

[-- Attachment #1: Type: text/plain, Size: 442 bytes --]

This patch updates the libgomp OpenACC reduction test cases to check
worker, vector and combined gang worker vector reductions. I tried to
use some macros to simplify the c test cases a bit. I probably could
have made them more generic with an additional header file/macro, but
then that makes it too confusing too debug. The fortran tests are a bit
of a lost clause, unless someone knows how to use the preprocessor with
!$acc loops.

Cesar

[-- Attachment #2: vector-reduction-tests.diff --]
[-- Type: text/x-patch, Size: 69485 bytes --]

2015-07-17  Cesar Philippidis  <cesar@codesourcery.com>

	libgomp/
	* testsuite/libgomp.oacc-c-c++-common/reduction.h: New file.
	* testsuite/libgomp.oacc-c-c++-common/reduction-1.c: Update tests
	with worker, vector and combined reductions.
	* testsuite/libgomp.oacc-c-c++-common/reduction-2.c: Likewise.
	* testsuite/libgomp.oacc-c-c++-common/reduction-3.c: Likewise.
	* testsuite/libgomp.oacc-c-c++-common/reduction-4.c: Likewise.
	* testsuite/libgomp.oacc-c-c++-common/reduction-5.c: Likewise.
	* testsuite/libgomp.oacc-c-c++-common/reduction-6.c: Likewise.
	* testsuite/libgomp.oacc-fortran/reduction-1.f90: Likewise.
	* testsuite/libgomp.oacc-fortran/reduction-2.f90: Likewise.
	* testsuite/libgomp.oacc-fortran/reduction-3.f90: Likewise.
	* testsuite/libgomp.oacc-fortran/reduction-4.f90: Likewise.
	* testsuite/libgomp.oacc-fortran/reduction-5.f90: Likewise.
	* testsuite/libgomp.oacc-fortran/reduction-6.f90: Likewise.


diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/reduction-1.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/reduction-1.c
index bb81759..8738927 100644
--- a/libgomp/testsuite/libgomp.oacc-c-c++-common/reduction-1.c
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/reduction-1.c
@@ -3,44 +3,54 @@
 /* Integer reductions.  */
 
 #include <stdlib.h>
-#include <stdbool.h>
-
-#define ng 32
-
-#define DO_PRAGMA(x) _Pragma (#x)
-
-#define check_reduction_op(type, op, init, b)	\
-  {						\
-    type res, vres;				\
-    res = (init);				\
-DO_PRAGMA (acc parallel num_gangs (ng) copy (res)) \
-DO_PRAGMA (acc loop gang reduction (op:res))	\
-    for (i = 0; i < n; i++)			\
-      res = res op (b);				\
-						\
-    vres = (init);				\
-    for (i = 0; i < n; i++)			\
-      vres = vres op (b);			\
-						\
-    if (res != vres)				\
-      abort ();					\
-  }
+#include "reduction.h"
+
+const int ng = 8;
+const int nw = 4;
+const int vl = 32;
 
 static void
-test_reductions_int (void)
+test_reductions (void)
 {
-  const int n = 1000;
+  const int n = 100;
   int i;
   int array[n];
 
   for (i = 0; i < n; i++)
-    array[i] = i;
-
-  check_reduction_op (int, +, 0, array[i]);
-  check_reduction_op (int, *, 1, array[i]);
-  check_reduction_op (int, &, -1, array[i]);
-  check_reduction_op (int, |, 0, array[i]);
-  check_reduction_op (int, ^, 0, array[i]);
+    array[i] = i+1;
+
+  /* Gang reductions.  */
+  check_reduction_op (int, +, 0, array[i], num_gangs (ng), gang);
+  check_reduction_op (int, *, 1, array[i], num_gangs (ng), gang);
+  check_reduction_op (int, &, -1, array[i], num_gangs (ng), gang);
+  check_reduction_op (int, |, 0, array[i], num_gangs (ng), gang);
+  check_reduction_op (int, ^, 0, array[i], num_gangs (ng), gang);
+
+  /* Worker reductions.  */
+  check_reduction_op (int, +, 0, array[i], num_workers (nw), worker);
+  check_reduction_op (int, *, 1, array[i], num_workers (nw), worker);
+  check_reduction_op (int, &, -1, array[i], num_workers (nw), worker);
+  check_reduction_op (int, |, 0, array[i], num_workers (nw), worker);
+  check_reduction_op (int, ^, 0, array[i], num_workers (nw), worker);
+
+  /* Vector reductions.  */
+  check_reduction_op (int, +, 0, array[i], vector_length (vl), vector);
+  check_reduction_op (int, *, 1, array[i], vector_length (vl), vector);
+  check_reduction_op (int, &, -1, array[i], vector_length (vl), vector);
+  check_reduction_op (int, |, 0, array[i], vector_length (vl), vector);
+  check_reduction_op (int, ^, 0, array[i], vector_length (vl), vector);
+
+  /* Combined reductions.  */
+  check_reduction_op (int, +, 0, array[i], num_gangs (ng) num_workers (nw)
+		      vector_length (vl), gang worker vector);
+  check_reduction_op (int, *, 1, array[i], num_gangs (ng) num_workers (nw)
+		      vector_length (vl), gang worker vector);
+  check_reduction_op (int, &, -1, array[i], num_gangs (ng) num_workers (nw)
+		      vector_length (vl), gang worker vector);
+  check_reduction_op (int, |, 0, array[i], num_gangs (ng) num_workers (nw)
+		      vector_length (vl), gang worker vector);
+  check_reduction_op (int, ^, 0, array[i], num_gangs (ng) num_workers (nw)
+		      vector_length (vl), gang worker vector);
 }
 
 static void
@@ -55,32 +65,31 @@ test_reductions_bool (void)
     array[i] = i;
 
   cmp_val = 5;
-#if 0
-  // TODO
-  check_reduction_op (bool, &&, true, (cmp_val > array[i]));
-  check_reduction_op (bool, ||, false, (cmp_val > array[i]));
-#endif
-}
 
-#define check_reduction_macro(type, op, init, b)	\
-  {							\
-    type res, vres;					\
-    res = (init);					\
-DO_PRAGMA (acc parallel num_gangs (ng) copy(res))	\
-DO_PRAGMA (acc loop gang reduction (op:res))		\
-    for (i = 0; i < n; i++)				\
-      res = op (res, (b));				\
-							\
-    vres = (init);					\
-    for (i = 0; i < n; i++)				\
-      vres = op (vres, (b));				\
-							\
-    if (res != vres)					\
-      abort ();						\
-  }
-
-#define max(a, b) (((a) > (b)) ? (a) : (b))
-#define min(a, b) (((a) < (b)) ? (a) : (b))
+  /* Gang reductions.  */
+  check_reduction_op (int, &&, 1, (cmp_val > array[i]), num_gangs (ng),
+		      gang);
+  check_reduction_op (int, ||, 0, (cmp_val > array[i]), num_gangs (ng),
+		      gang);
+
+  /* Worker reductions.  */
+  check_reduction_op (int, &&, 1, (cmp_val > array[i]), num_workers (nw),
+		      worker);
+  check_reduction_op (int, ||, 0, (cmp_val > array[i]), num_workers (nw),
+		      worker);
+
+  /* Vector reductions.  */
+  check_reduction_op (int, &&, 1, (cmp_val > array[i]), vector_length (vl),
+		      vector);
+  check_reduction_op (int, ||, 0, (cmp_val > array[i]), vector_length (vl),
+		      vector);
+
+  /* Combined reductions.  */
+  check_reduction_op (int, &&, 1, (cmp_val > array[i]), num_gangs (ng)
+		      num_workers (nw) vector_length (vl), gang worker vector);
+  check_reduction_op (int, ||, 0, (cmp_val > array[i]), num_gangs (ng)
+		      num_workers (nw) vector_length (vl), gang worker vector);
+}
 
 static void
 test_reductions_minmax (void)
@@ -92,14 +101,32 @@ test_reductions_minmax (void)
   for (i = 0; i < n; i++)
     array[i] = i;
 
-  check_reduction_macro (int, min, n + 1, array[i]);
-  check_reduction_macro (int, max, -1, array[i]);
+  /* Gang reductions.  */
+  check_reduction_macro (int, min, n + 1, array[i], num_gangs (ng), gang);
+  check_reduction_macro (int, max, -1, array[i], num_gangs (ng), gang);
+
+  /* Worker reductions.  */
+  check_reduction_macro (int, min, n + 1, array[i], num_workers (nw), worker);
+  check_reduction_macro (int, max, -1, array[i], num_workers (nw), worker);
+
+  /* Vector reductions.  */
+  check_reduction_macro (int, min, n + 1, array[i], vector_length (vl),
+			 vector);
+  check_reduction_macro (int, max, -1, array[i], vector_length (vl), vector);
+
+  /* Combined reductions.  */
+  check_reduction_macro (int, min, n + 1, array[i], num_gangs (ng)
+			 num_workers (nw) vector_length (vl), gang worker
+			 vector);
+  check_reduction_macro (int, max, -1, array[i], num_gangs (ng)
+			 num_workers (nw) vector_length (vl), gang worker
+			 vector);
 }
 
 int
 main (void)
 {
-  test_reductions_int ();
+  test_reductions ();
   test_reductions_bool ();
   test_reductions_minmax ();
   return 0;
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/reduction-2.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/reduction-2.c
index ba6eb27..2465ddd 100644
--- a/libgomp/testsuite/libgomp.oacc-c-c++-common/reduction-2.c
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/reduction-2.c
@@ -3,123 +3,78 @@
 /* float reductions.  */
 
 #include <stdlib.h>
-#include <stdbool.h>
-#include <math.h>
+#include "reduction.h"
 
-#define ng 32
+const int ng = 8;
+const int nw = 4;
+const int vl = 32;
 
-int
-main(void)
+static void
+test_reductions (void)
 {
-  const int n = 1000;
+  const int n = 100;
   int i;
-  float vresult, result, array[n];
-  int lvresult, lresult;
+  float array[n];
 
   for (i = 0; i < n; i++)
-    array[i] = i;
+    array[i] = i+1;
 
-  result = 0;
-  vresult = 0;
+  /* Gang reductions.  */
+  check_reduction_op (float, +, 0, array[i], num_gangs (ng), gang);
+  check_reduction_op (float, *, 1, array[i], num_gangs (ng), gang);
 
-  /* '+' reductions.  */
-#pragma acc parallel num_gangs (ng) copy (result)
-#pragma acc loop reduction (+:result) gang
-  for (i = 0; i < n; i++)
-    result += array[i];
-
-  /* Verify the reduction.  */
-  for (i = 0; i < n; i++)
-    vresult += array[i];
+  /* Worker reductions.  */
+  check_reduction_op (float, +, 0, array[i], num_workers (nw), worker);
+  check_reduction_op (float, *, 1, array[i], num_workers (nw), worker);
 
-  if (result != vresult)
-    abort ();
-
-  result = 0;
-  vresult = 0;
-
-  /* '*' reductions.  */
-#pragma acc parallel num_gangs (ng) copy (result)
-#pragma acc loop reduction (*:result) gang
-  for (i = 0; i < n; i++)
-    result *= array[i];
-
-  /* Verify the reduction.  */
-  for (i = 0; i < n; i++)
-    vresult *= array[i];
-
-  if (fabs(result - vresult) > .0001)
-    abort ();
-
-  result = 0;
-  vresult = 0;
-
-  /* 'max' reductions.  */
-#pragma acc parallel num_gangs (ng) copy (result)
-#pragma acc loop reduction (max:result) gang
-  for (i = 0; i < n; i++)
-    result = result > array[i] ? result : array[i];
-
-  /* Verify the reduction.  */
-  for (i = 0; i < n; i++)
-    vresult = vresult > array[i] ? vresult : array[i];
-
-  if (result != vresult)
-    abort ();
-
-  result = 0;
-  vresult = 0;
-
-  /* 'min' reductions.  */
-#pragma acc parallel num_gangs (ng) copy (result)
-#pragma acc loop reduction (min:result) gang
-  for (i = 0; i < n; i++)
-    result = result < array[i] ? result : array[i];
+  /* Vector reductions.  */
+  check_reduction_op (float, +, 0, array[i], vector_length (vl), vector);
+  check_reduction_op (float, *, 1, array[i], vector_length (vl), vector);
 
-  /* Verify the reduction.  */
-  for (i = 0; i < n; i++)
-    vresult = vresult < array[i] ? vresult : array[i];
-
-  if (result != vresult)
-    abort ();
-
-  result = 5;
-  vresult = 5;
-
-  lresult = 0;
-  lvresult = 0;
-
-  /* '&&' reductions.  */
-#pragma acc parallel num_gangs (ng) copy (result)
-#pragma acc loop reduction (&&:lresult) gang
-  for (i = 0; i < n; i++)
-    lresult = lresult && (result > array[i]);
-
-  /* Verify the reduction.  */
-  for (i = 0; i < n; i++)
-    lvresult = lresult && (result > array[i]);
-
-  if (lresult != lvresult)
-    abort ();
-
-  result = 5;
-  vresult = 5;
-
-  lresult = 0;
-  lvresult = 0;
+  /* Combined reductions.  */
+  check_reduction_op (float, +, 0, array[i], num_gangs (ng) num_workers (nw)
+		      vector_length (vl), gang worker vector);
+  check_reduction_op (float, *, 1, array[i], num_gangs (ng) num_workers (nw)
+		      vector_length (vl), gang worker vector);
+}
 
-  /* '||' reductions.  */
-#pragma acc parallel num_gangs (ng) copy (result)
-#pragma acc loop reduction (||:lresult) gang
-  for (i = 0; i < n; i++)
-    lresult = lresult || (result > array[i]);
+static void
+test_reductions_minmax (void)
+{
+  const int n = 1000;
+  int i;
+  float array[n];
 
-  /* Verify the reduction.  */
   for (i = 0; i < n; i++)
-    lvresult = lresult || (result > array[i]);
+    array[i] = i;
 
-  if (lresult != lvresult)
-    abort ();
+  /* Gang reductions.  */
+  check_reduction_macro (float, min, n + 1, array[i], num_gangs (ng), gang);
+  check_reduction_macro (float, max, -1, array[i], num_gangs (ng), gang);
+
+  /* Worker reductions.  */
+  check_reduction_macro (float, min, n + 1, array[i], num_workers (nw),
+			 worker);
+  check_reduction_macro (float, max, -1, array[i], num_workers (nw), worker);
+
+  /* Vector reductions.  */
+  check_reduction_macro (float, min, n + 1, array[i], vector_length (vl),
+			 vector);
+  check_reduction_macro (float, max, -1, array[i], vector_length (vl), vector);
+
+  /* Combined reductions.  */
+  check_reduction_macro (float, min, n + 1, array[i], num_gangs (ng)
+			 num_workers (nw) vector_length (vl), gang worker
+			 vector);
+  check_reduction_macro (float, max, -1, array[i], num_gangs (ng)
+			 num_workers (nw)vector_length (vl), gang worker
+			 vector);
+}
 
+int
+main (void)
+{
+  test_reductions ();
+  test_reductions_minmax ();
   return 0;
 }
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/reduction-3.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/reduction-3.c
index 5ecc651..091421f 100644
--- a/libgomp/testsuite/libgomp.oacc-c-c++-common/reduction-3.c
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/reduction-3.c
@@ -3,123 +3,79 @@
 /* double reductions.  */
 
 #include <stdlib.h>
-#include <stdbool.h>
-#include <math.h>
+#include "reduction.h"
 
-#define ng 32
+const int ng = 8;
+const int nw = 4;
+const int vl = 32;
 
-int
-main(void)
+static void
+test_reductions (void)
 {
-  const int n = 1000;
+  const int n = 10;
   int i;
-  double vresult, result, array[n];
-  int lvresult, lresult;
+  double array[n];
 
   for (i = 0; i < n; i++)
-    array[i] = i;
+    array[i] = i+1;
 
-  result = 0;
-  vresult = 0;
+  /* Gang reductions.  */
+  check_reduction_op (double, +, 0, array[i], num_gangs (ng), gang);
+  check_reduction_op (double, *, 1, array[i], num_gangs (ng), gang);
 
-  /* '+' reductions.  */
-#pragma acc parallel num_gangs (ng) copy (result)
-#pragma acc loop reduction (+:result) gang
-  for (i = 0; i < n; i++)
-    result += array[i];
-
-  /* Verify the reduction.  */
-  for (i = 0; i < n; i++)
-    vresult += array[i];
+  /* Worker reductions.  */
+  check_reduction_op (double, +, 0, array[i], num_workers (nw), worker);
+  check_reduction_op (double, *, 1, array[i], num_workers (nw), worker);
 
-  if (result != vresult)
-    abort ();
-
-  result = 0;
-  vresult = 0;
-
-  /* '*' reductions.  */
-#pragma acc parallel num_gangs (ng) copy (result)
-#pragma acc loop reduction (*:result) gang
-  for (i = 0; i < n; i++)
-    result *= array[i];
-
-  /* Verify the reduction.  */
-  for (i = 0; i < n; i++)
-    vresult *= array[i];
-
-  if (fabs(result - vresult) > .0001)
-    abort ();
-
-  result = 0;
-  vresult = 0;
-
-  /* 'max' reductions.  */
-#pragma acc parallel num_gangs (ng) copy (result)
-#pragma acc loop reduction (max:result) gang
-  for (i = 0; i < n; i++)
-      result = result > array[i] ? result : array[i];
-
-  /* Verify the reduction.  */
-  for (i = 0; i < n; i++)
-      vresult = vresult > array[i] ? vresult : array[i];
-
-  if (result != vresult)
-    abort ();
-
-  result = 0;
-  vresult = 0;
-
-  /* 'min' reductions.  */
-#pragma acc parallel num_gangs (ng) copy (result)
-#pragma acc loop reduction (min:result) gang
-  for (i = 0; i < n; i++)
-      result = result < array[i] ? result : array[i];
+  /* Vector reductions.  */
+  check_reduction_op (double, +, 0, array[i], vector_length (vl), vector);
+  check_reduction_op (double, *, 1, array[i], vector_length (vl), vector);
 
-  /* Verify the reduction.  */
-  for (i = 0; i < n; i++)
-      vresult = vresult < array[i] ? vresult : array[i];
-
-  if (result != vresult)
-    abort ();
-
-  result = 5;
-  vresult = 5;
-
-  lresult = 0;
-  lvresult = 0;
-
-  /* '&&' reductions.  */
-#pragma acc parallel num_gangs (ng) copy (result)
-#pragma acc loop reduction (&&:lresult) gang
-  for (i = 0; i < n; i++)
-    lresult = lresult && (result > array[i]);
-
-  /* Verify the reduction.  */
-  for (i = 0; i < n; i++)
-    lvresult = lresult && (result > array[i]);
-
-  if (lresult != lvresult)
-    abort ();
-
-  result = 5;
-  vresult = 5;
-
-  lresult = 0;
-  lvresult = 0;
+  /* Combined reductions.  */
+  check_reduction_op (double, +, 0, array[i], num_gangs (ng)  num_workers (nw)
+		      vector_length (vl), gang worker vector);
+  check_reduction_op (double, *, 1, array[i], num_gangs (ng)  num_workers (nw)
+		      vector_length (vl), gang worker vector);
+}
 
-  /* '||' reductions.  */
-#pragma acc parallel num_gangs (ng) copy (result)
-#pragma acc loop reduction (||:lresult) gang
-  for (i = 0; i < n; i++)
-    lresult = lresult || (result > array[i]);
+static void
+test_reductions_minmax (void)
+{
+  const int n = 1000;
+  int i;
+  double array[n];
 
-  /* Verify the reduction.  */
   for (i = 0; i < n; i++)
-    lvresult = lresult || (result > array[i]);
+    array[i] = i;
 
-  if (lresult != lvresult)
-    abort ();
+  /* Gang reductions.  */
+  check_reduction_macro (double, min, n + 1, array[i], num_gangs (ng), gang);
+  check_reduction_macro (double, max, -1, array[i], num_gangs (ng), gang);
+
+  /* Worker reductions.  */
+  check_reduction_macro (double, min, n + 1, array[i], num_workers (nw),
+			 worker);
+  check_reduction_macro (double, max, -1, array[i], num_workers (nw), worker);
+
+  /* Vector reductions.  */
+  check_reduction_macro (double, min, n + 1, array[i], vector_length (vl),
+			 vector);
+  check_reduction_macro (double, max, -1, array[i], vector_length (vl),
+			 vector);
+
+  /* Combined reductions.  */
+  check_reduction_macro (double, min, n + 1, array[i], num_gangs (ng)
+			 num_workers (nw) vector_length (vl), gang worker
+			 vector);
+  check_reduction_macro (double, max, -1, array[i], num_gangs (ng)
+			 num_workers (nw) vector_length (vl), gang worker
+			 vector);
+}
 
+int
+main (void)
+{
+  test_reductions ();
+  test_reductions_minmax ();
   return 0;
 }
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/reduction-4.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/reduction-4.c
index c7069e9..816b09f 100644
--- a/libgomp/testsuite/libgomp.oacc-c-c++-common/reduction-4.c
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/reduction-4.c
@@ -1,95 +1,53 @@
 /* { dg-do run { target { ! { hppa*-*-hpux* } } } } */
-/* { dg-xfail-run-if "libgomp: cuStreamSynchronize error: launch timeout" { openacc_nvidia_accel_selected } } */
 
 /* complex reductions.  */
 
 #include <stdlib.h>
-#include <stdbool.h>
-#include <math.h>
 #include <complex.h>
+#include "reduction.h"
 
-#define ng 32
+const int ng = 8;
+const int nw = 4;
+const int vl = 32;
 
-int
-main(void)
+static void
+test_reductions (void)
 {
-  const int n = 1000;
+  const int n = 10;
   int i;
-  double _Complex vresult, result, array[n];
-  bool lvresult, lresult;
-
-  for (i = 0; i < n; i++)
-    array[i] = i;
-
-  result = 0;
-  vresult = 0;
-
-  /* '+' reductions.  */
-#pragma acc parallel num_gangs (ng) copy (result)
-#pragma acc loop reduction (+:result) gang
-  for (i = 0; i < n; i++)
-    result += array[i];
-
-  /* Verify the reduction.  */
-  for (i = 0; i < n; i++)
-    vresult += array[i];
-
-  if (result != vresult)
-    abort ();
-
-  result = 0;
-  vresult = 0;
-
-  /* '*' reductions.  */
-#pragma acc parallel num_gangs (ng) copy (result)
-#pragma acc loop reduction (*:result) gang
-  for (i = 0; i < n; i++)
-    result *= array[i];
-
-  /* Verify the reduction.  */
-  for (i = 0; i < n; i++)
-    vresult *= array[i];
-
-  if (cabsf (result - vresult) > .0001)
-    abort ();
-
-  result = 5;
-  vresult = 5;
-
-  lresult = false;
-  lvresult = false;
-
-  /* '&&' reductions.  */
-#pragma acc parallel num_gangs (ng) copy (lresult)
-#pragma acc loop reduction (&&:lresult) gang
-  for (i = 0; i < n; i++)
-    lresult = lresult && (creal(result) > creal(array[i]));
-
-  /* Verify the reduction.  */
-  for (i = 0; i < n; i++)
-    lvresult = lvresult && (creal(result) > creal(array[i]));
-
-  if (lresult != lvresult)
-    abort ();
-
-  result = 5;
-  vresult = 5;
-
-  lresult = false;
-  lvresult = false;
-
-  /* '||' reductions.  */
-#pragma acc parallel num_gangs (ng) copy (lresult)
-#pragma acc loop reduction (||:lresult) gang
-  for (i = 0; i < n; i++)
-    lresult = lresult || (creal(result) > creal(array[i]));
-
-  /* Verify the reduction.  */
-  for (i = 0; i < n; i++)
-    lvresult = lvresult || (creal(result) > creal(array[i]));
-
-  if (lresult != lvresult)
-    abort ();
+  double _Complex array[n];
+
+  for (i = 0; i < n; i++)
+    array[i] = i+1;
+
+  /* Gang reductions.  */
+  check_reduction_op (double, +, 0, creal (array[i]), num_gangs (ng), gang);
+  check_reduction_op (double, *, 1, creal (array[i]), num_gangs (ng), gang);
+
+  /* Worker reductions.  */
+  check_reduction_op (double, +, 0, creal (array[i]), num_workers (nw),
+		      worker);
+  check_reduction_op (double, *, 1, creal (array[i]), num_workers (nw),
+		      worker);
+
+  /* Vector reductions.  */
+  check_reduction_op (double, +, 0, creal (array[i]), vector_length (vl),
+		      vector);
+  check_reduction_op (double, *, 1, creal (array[i]), vector_length (vl),
+		      vector);
+
+  /* Combined reductions.  */
+  check_reduction_op (double, +, 0, creal (array[i]), num_gangs (ng)
+			 num_workers (nw) vector_length (vl), gang worker
+			 vector);
+  check_reduction_op (double, *, 1, creal (array[i]), num_gangs (ng)
+			 num_workers (nw) vector_length (vl), gang worker
+			 vector);
+}
 
+int
+main (void)
+{
+  test_reductions ();
   return 0;
 }
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/reduction-5.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/reduction-5.c
index 23a194c..e979ab6 100644
--- a/libgomp/testsuite/libgomp.oacc-c-c++-common/reduction-5.c
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/reduction-5.c
@@ -1,32 +1,53 @@
+/* { dg-do run } */
+
+/* Multiple reductions.  */
+
 #include <stdio.h>
 #include <stdlib.h>
 
+const int ng = 8;
+const int nw = 4;
+const int vl = 32;
+
+const int n = 100;
+
+#define DO_PRAGMA(x) _Pragma (#x)
+
+#define check_reduction(gwv_par, gwv_loop)		\
+  {							\
+  s1 = 2; s2 = 5;					\
+DO_PRAGMA (acc parallel gwv_par copy (s1, s2))		\
+DO_PRAGMA (acc loop gwv_loop reduction (+:s1, s2))	\
+    for (i = 0; i < n; i++)				\
+      {							\
+         s1 = s1 + 3;					\
+         s2 = s2 + 5;					\
+      }							\
+							\
+    if (s1 != v1 && s2 != v2)				\
+      abort ();						\
+  }
+
 int
 main (void)
 {
   int s1 = 2, s2 = 5, v1 = 2, v2 = 5;
-  int n = 100;
   int i;
 
-#pragma acc parallel num_gangs (1000)  copy (s1, s2)
-#pragma acc loop reduction (+:s1, s2) gang
-  for (i = 0; i < n; i++)
-    {
-      s1 = s1 + 3;
-      s2 = s2 + 2;
-    }
-
   for (i = 0; i < n; i++)
     {
       v1 = v1 + 3;
       v2 = v2 + 2;
     }
-  
-  if (s1 != v1)
-    abort ();
-  
-  if (s2 != v2)
-    abort ();
-    
+
+  check_reduction (num_gangs (ng), gang);
+
+  /* Nvptx targets require a vector_length or 32 in to allow spinlocks with
+     gangs.  */
+  check_reduction (num_workers (nw) vector_length (vl), worker);
+  check_reduction (vector_length (vl), vector);
+  check_reduction (num_gangs (ng) num_workers (nw) vector_length (vl), gang
+		   worker vector);
+
   return 0;
 }
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/reduction-6.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/reduction-6.c
new file mode 100644
index 0000000..17fa951
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/reduction-6.c
@@ -0,0 +1,29 @@
+/* { dg-do run } */
+
+/* Check nested reductions.  */
+
+#include <assert.h>
+
+#define n 1000
+
+int
+main ()
+{
+  int i, j, red = 0, vred = 0;
+  int chunksize = 10;
+
+#pragma acc parallel num_gangs (10) vector_length (32) copy (red)
+#pragma acc loop reduction (+:red) gang
+  for (i = 0; i < n/chunksize; i++)
+#pragma acc loop reduction (+:red) vector
+    for (j = 0; j < chunksize; j++)
+      red += j;
+
+  for (i = 0; i < n/chunksize; i++)
+    for (j = 0; j < chunksize; j++)
+      vred += j;
+
+  assert (red == vred);
+
+  return 0;
+}
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/reduction.h b/libgomp/testsuite/libgomp.oacc-c-c++-common/reduction.h
new file mode 100644
index 0000000..1b3f8d4
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/reduction.h
@@ -0,0 +1,43 @@
+#ifndef REDUCTION_H
+#define REDUCTION_H
+
+#define DO_PRAGMA(x) _Pragma (#x)
+
+#define check_reduction_op(type, op, init, b, gwv_par, gwv_loop)	\
+  {									\
+    type res, vres;							\
+    res = (init);							\
+DO_PRAGMA (acc parallel gwv_par copy (res))				\
+DO_PRAGMA (acc loop gwv_loop reduction (op:res))			\
+    for (i = 0; i < n; i++)						\
+      res = res op (b);							\
+									\
+    vres = (init);							\
+    for (i = 0; i < n; i++)						\
+      vres = vres op (b);						\
+									\
+    if (res != vres)							\
+      abort ();								\
+  }
+
+#define check_reduction_macro(type, op, init, b, gwv_par, gwv_loop)	\
+  {									\
+    type res, vres;							\
+    res = (init);							\
+    DO_PRAGMA (acc parallel gwv_par copy(res))				\
+DO_PRAGMA (acc loop gwv_loop reduction (op:res))			\
+    for (i = 0; i < n; i++)						\
+      res = op (res, (b));						\
+									\
+    vres = (init);							\
+    for (i = 0; i < n; i++)						\
+      vres = op (vres, (b));						\
+									\
+    if (res != vres)							\
+      abort ();								\
+  }
+
+#define max(a, b) (((a) > (b)) ? (a) : (b))
+#define min(a, b) (((a) < (b)) ? (a) : (b))
+
+#endif
diff --git a/libgomp/testsuite/libgomp.oacc-fortran/reduction-1.f90 b/libgomp/testsuite/libgomp.oacc-fortran/reduction-1.f90
index 3419ffd..03cca04 100644
--- a/libgomp/testsuite/libgomp.oacc-fortran/reduction-1.f90
+++ b/libgomp/testsuite/libgomp.oacc-fortran/reduction-1.f90
@@ -5,24 +5,50 @@
 program reduction_1
   implicit none
 
-  integer, parameter    :: n = 10, gangs = 20
-  integer               :: i, vresult, result
-  logical               :: lresult, lvresult
+  integer, parameter    :: n = 10, ng = 8, nw = 4, vl = 32
+  integer               :: i, vresult, rg, rw, rv, rc
+  logical               :: lrg, lrw, lrv, lrc, lvresult
   integer, dimension (n) :: array
 
   do i = 1, n
      array(i) = i
   end do
 
-  result = 0
+  !
+  ! '+' reductions
+  !
+
+  rg = 0
+  rw = 0
+  rv = 0
+  rc = 0
   vresult = 0
 
-  ! '+' reductions
+  !$acc parallel num_gangs(ng) copy(rg)
+  !$acc loop reduction(+:rg) gang
+  do i = 1, n
+     rg = rg + array(i)
+  end do
+  !$acc end parallel
 
-  !$acc parallel num_gangs(gangs) copy(result)
-  !$acc loop reduction(+:result) gang
+  !$acc parallel num_workers(nw) copy(rw)
+  !$acc loop reduction(+:rw) worker
   do i = 1, n
-     result = result + array(i)
+     rw = rw + array(i)
+  end do
+  !$acc end parallel
+
+  !$acc parallel vector_length(vl) copy(rv)
+  !$acc loop reduction(+:rv) vector
+  do i = 1, n
+     rv = rv + array(i)
+  end do
+  !$acc end parallel
+
+  !$acc parallel num_gangs(ng) num_workers(nw) vector_length(vl) copy(rc)
+  !$acc loop reduction(+:rc) gang worker vector
+  do i = 1, n
+     rc = rc + array(i)
   end do
   !$acc end parallel
 
@@ -31,17 +57,46 @@ program reduction_1
      vresult = vresult + array(i)
   end do
 
-  if (result.ne.vresult) call abort
-
-  result = 0
-  vresult = 0
+  if (rg .ne. vresult) call abort
+  if (rw .ne. vresult) call abort
+  if (rv .ne. vresult) call abort
+  if (rc .ne. vresult) call abort
 
+  !
   ! '*' reductions
+  !
 
-  !$acc parallel num_gangs(gangs) copy(result)
-  !$acc loop reduction(*:result) gang
+  rg = 1
+  rw = 1
+  rv = 1
+  rc = 1
+  vresult = 1
+
+  !$acc parallel num_gangs(ng) copy(rg)
+  !$acc loop reduction(*:rg) gang
   do i = 1, n
-     result = result * array(i)
+     rg = rg * array(i)
+  end do
+  !$acc end parallel
+
+  !$acc parallel num_workers(nw) copy(rw)
+  !$acc loop reduction(*:rw) worker
+  do i = 1, n
+     rw = rw * array(i)
+  end do
+  !$acc end parallel
+
+  !$acc parallel vector_length(vl) copy(rv)
+  !$acc loop reduction(*:rv) vector
+  do i = 1, n
+     rv = rv * array(i)
+  end do
+  !$acc end parallel
+
+  !$acc parallel num_gangs(ng) num_workers(nw) vector_length(vl) copy(rc)
+  !$acc loop reduction(*:rc) gang worker vector
+  do i = 1, n
+     rc = rc * array(i)
   end do
   !$acc end parallel
 
@@ -50,17 +105,46 @@ program reduction_1
      vresult = vresult * array(i)
   end do
 
-  if (result.ne.vresult) call abort
+  if (rg .ne. vresult) call abort
+  if (rw .ne. vresult) call abort
+  if (rv .ne. vresult) call abort
+  if (rc .ne. vresult) call abort
+
+  !
+  ! 'max' reductions
+  !
 
-  result = 0
+  rg = 0
+  rw = 0
+  rv = 0
+  rc = 0
   vresult = 0
 
-  ! 'max' reductions
+  !$acc parallel num_gangs(ng) copy(rg)
+  !$acc loop reduction(max:rg) gang
+  do i = 1, n
+     rg = max (rg, array(i))
+  end do
+  !$acc end parallel
+
+  !$acc parallel num_workers(nw) copy(rw)
+  !$acc loop reduction(max:rw) worker
+  do i = 1, n
+     rw = max (rw, array(i))
+  end do
+  !$acc end parallel
+
+  !$acc parallel vector_length(vl) copy(rv)
+  !$acc loop reduction(max:rv) vector
+  do i = 1, n
+     rv = max (rv, array(i))
+  end do
+  !$acc end parallel
 
-  !$acc parallel num_gangs(gangs) copy(result)
-  !$acc loop reduction(max:result) gang
+  !$acc parallel num_gangs(ng) Num_workers(nw) vector_length(vl) copy(rc)
+  !$acc loop reduction(max:rc) gang worker vector
   do i = 1, n
-     result = max (result, array(i))
+     rc = max (rc, array(i))
   end do
   !$acc end parallel
 
@@ -69,17 +153,46 @@ program reduction_1
      vresult = max (vresult, array(i))
   end do
 
-  if (result.ne.vresult) call abort
-
-  result = 1
-  vresult = 1
+  if (rg .ne. vresult) call abort
+  if (rw .ne. vresult) call abort
+  if (rv .ne. vresult) call abort
+  if (rc .ne. vresult) call abort
 
+  !
   ! 'min' reductions
+  !
+
+  rg = 0
+  rw = 0
+  rv = 0
+  rc = 0
+  vresult = 0
+
+  !$acc parallel num_gangs(ng) copy(rg)
+  !$acc loop reduction(min:rg) gang
+  do i = 1, n
+     rg = min (rg, array(i))
+  end do
+  !$acc end parallel
 
-  !$acc parallel num_gangs(gangs) copy(result)
-  !$acc loop reduction(min:result) gang
+  !$acc parallel num_workers(nw) copy(rw)
+  !$acc loop reduction(min:rw) worker
   do i = 1, n
-     result = min (result, array(i))
+     rw = min (rw, array(i))
+  end do
+  !$acc end parallel
+
+  !$acc parallel vector_length(vl) copy(rv)
+  !$acc loop reduction(min:rv) vector
+  do i = 1, n
+     rv = min (rv, array(i))
+  end do
+  !$acc end parallel
+
+  !$acc parallel num_gangs(ng) num_workers(nw) vector_length(vl) copy(rc)
+  !$acc loop reduction(min:rc) gang worker vector
+  do i = 1, n
+     rc = min (rc, array(i))
   end do
   !$acc end parallel
 
@@ -88,17 +201,46 @@ program reduction_1
      vresult = min (vresult, array(i))
   end do
 
-  if (result.ne.vresult) call abort
+  if (rg .ne. vresult) call abort
+  if (rw .ne. vresult) call abort
+  if (rv .ne. vresult) call abort
+  if (rc .ne. vresult) call abort
+
+  !
+  ! 'iand' reductions
+  !
 
-  result = 1
+  rg = 1
+  rw = 1
+  rv = 1
+  rc = 1
   vresult = 1
 
-  ! 'iand' reductions
+  !$acc parallel num_gangs(ng) copy(rg)
+  !$acc loop reduction(iand:rg) gang
+  do i = 1, n
+     rg = iand (rg, array(i))
+  end do
+  !$acc end parallel
+
+  !$acc parallel num_workers(nw) copy(rw)
+  !$acc loop reduction(iand:rw) worker
+  do i = 1, n
+     rw = iand (rw, array(i))
+  end do
+  !$acc end parallel
+
+  !$acc parallel vector_length(vl) copy(rv)
+  !$acc loop reduction(iand:rv) vector
+  do i = 1, n
+     rv = iand (rv, array(i))
+  end do
+  !$acc end parallel
 
-  !$acc parallel num_gangs(gangs) copy(result)
-  !$acc loop reduction(iand:result) gang
+  !$acc parallel num_gangs(ng) num_workers(nw) vector_length(vl) copy(rc)
+  !$acc loop reduction(iand:rc) gang worker vector
   do i = 1, n
-     result = iand (result, array(i))
+     rc = iand (rc, array(i))
   end do
   !$acc end parallel
 
@@ -107,17 +249,46 @@ program reduction_1
      vresult = iand (vresult, array(i))
   end do
 
-  if (result.ne.vresult) call abort
-
-  result = 1
-  vresult = 1
+  if (rg .ne. vresult) call abort
+  if (rw .ne. vresult) call abort
+  if (rv .ne. vresult) call abort
+  if (rc .ne. vresult) call abort
 
+  !
   ! 'ior' reductions
+  !
+
+  rg = 0
+  rw = 0
+  rv = 0
+  rc = 0
+  vresult = 0
+
+  !$acc parallel num_gangs(ng) copy(rg)
+  !$acc loop reduction(ior:rg) gang
+  do i = 1, n
+     rg = ior (rg, array(i))
+  end do
+  !$acc end parallel
 
-  !$acc parallel num_gangs(gangs) copy(result)
-  !$acc loop reduction(ior:result) gang
+  !$acc parallel num_workers(nw) copy(rw)
+  !$acc loop reduction(ior:rw) worker
   do i = 1, n
-     result = ior (result, array(i))
+     rw = ior (rw, array(i))
+  end do
+  !$acc end parallel
+
+  !$acc parallel vector_length(vl) copy(rv)
+  !$acc loop reduction(ior:rv) gang
+  do i = 1, n
+     rv = ior (rv, array(i))
+  end do
+  !$acc end parallel
+
+  !$acc parallel num_gangs(ng) num_workers(nw) vector_length(vl) copy(rc)
+  !$acc loop reduction(ior:rc) gang worker vector
+  do i = 1, n
+     rc = ior (rc, array(i))
   end do
   !$acc end parallel
 
@@ -126,17 +297,46 @@ program reduction_1
      vresult = ior (vresult, array(i))
   end do
 
-  if (result.ne.vresult) call abort
+  if (rg .ne. vresult) call abort
+  if (rw .ne. vresult) call abort
+  if (rv .ne. vresult) call abort
+  if (rc .ne. vresult) call abort
 
-  result = 0
+  !
+  ! 'ieor' reductions
+  !
+
+  rg = 0
+  rw = 0
+  rv = 0
+  rc = 0
   vresult = 0
 
-  ! 'ieor' reductions
+  !$acc parallel num_gangs(ng) copy(rg)
+  !$acc loop reduction(ieor:rg) gang
+  do i = 1, n
+     rg = ieor (rg, array(i))
+  end do
+  !$acc end parallel
+
+  !$acc parallel num_workers(nw) copy(rw)
+  !$acc loop reduction(ieor:rw) worker
+  do i = 1, n
+     rw = ieor (rw, array(i))
+  end do
+  !$acc end parallel
 
-  !$acc parallel num_gangs(gangs) copy(result)
-  !$acc loop reduction(ieor:result) gang
+  !$acc parallel vector_length(vl) copy(rv)
+  !$acc loop reduction(ieor:rv) vector
   do i = 1, n
-     result = ieor (result, array(i))
+     rv = ieor (rv, array(i))
+  end do
+  !$acc end parallel
+
+  !$acc parallel num_gangs(ng) num_workers(nw) vector_length(vl) copy(rc)
+  !$acc loop reduction(ieor:rc) gang worker vector
+  do i = 1, n
+     rc = ieor (rc, array(i))
   end do
   !$acc end parallel
 
@@ -145,17 +345,46 @@ program reduction_1
      vresult = ieor (vresult, array(i))
   end do
 
-  if (result.ne.vresult) call abort
-
-  lresult = .false.
-  lvresult = .false.
+  if (rg .ne. vresult) call abort
+  if (rw .ne. vresult) call abort
+  if (rv .ne. vresult) call abort
+  if (rc .ne. vresult) call abort
 
+  !
   ! '.and.' reductions
+  !
+
+  lrg = .true.
+  lrw = .true.
+  lrv = .true.
+  lrc = .true.
+  lvresult = .true.
+
+  !$acc parallel num_gangs(ng) copy(lrg)
+  !$acc loop reduction(.and.:lrg) gang
+  do i = 1, n
+     lrg = lrg .and. (array(i) .ge. 5)
+  end do
+  !$acc end parallel
+
+  !$acc parallel num_workers(nw) copy(lrw)
+  !$acc loop reduction(.and.:lrw) worker
+  do i = 1, n
+     lrw = lrw .and. (array(i) .ge. 5)
+  end do
+  !$acc end parallel
+
+  !$acc parallel vector_length(vl) copy(lrv)
+  !$acc loop reduction(.and.:lrv) vector
+  do i = 1, n
+     lrv = lrv .and. (array(i) .ge. 5)
+  end do
+  !$acc end parallel
 
-  !$acc parallel num_gangs(gangs) copy(lresult)
-  !$acc loop reduction(.and.:lresult) gang
+  !$acc parallel num_gangs(ng) num_workers(nw) vector_length(vl) copy(lrc)
+  !$acc loop reduction(.and.:lrc) gang worker vector
   do i = 1, n
-     lresult = lresult .and. (array(i) .ge. 5)
+     lrc = lrc .and. (array(i) .ge. 5)
   end do
   !$acc end parallel
 
@@ -164,17 +393,46 @@ program reduction_1
      lvresult = lvresult .and. (array(i) .ge. 5)
   end do
 
-  if (lresult .neqv. lvresult) call abort
+  if (lrg .neqv. lvresult) call abort
+  if (lrw .neqv. lvresult) call abort
+  if (lrv .neqv. lvresult) call abort
+  if (lrc .neqv. lvresult) call abort
+
+  !
+  ! '.or.' reductions
+  !
 
-  lresult = .false.
+  lrg = .true.
+  lrw = .true.
+  lrv = .true.
+  lrc = .true.
   lvresult = .false.
 
-  ! '.or.' reductions
+  !$acc parallel num_gangs(ng) copy(lrg)
+  !$acc loop reduction(.or.:lrg) gang
+  do i = 1, n
+     lrg = lrg .or. (array(i) .ge. 5)
+  end do
+  !$acc end parallel
+
+  !$acc parallel num_workers(nw) copy(lrw)
+  !$acc loop reduction(.or.:lrw) worker
+  do i = 1, n
+     lrw = lrw .or. (array(i) .ge. 5)
+  end do
+  !$acc end parallel
+
+  !$acc parallel vector_length(vl) copy(lrv)
+  !$acc loop reduction(.or.:lrv) vector
+  do i = 1, n
+     lrv = lrv .or. (array(i) .ge. 5)
+  end do
+  !$acc end parallel
 
-  !$acc parallel num_gangs(gangs) copy(lresult)
-  !$acc loop reduction(.or.:lresult) gang
+  !$acc parallel num_gangs(ng) num_workers(nw) vector_length(vl) copy(lrc)
+  !$acc loop reduction(.or.:lrc) gang worker vector
   do i = 1, n
-     lresult = lresult .or. (array(i) .ge. 5)
+     lrc = lrc .or. (array(i) .ge. 5)
   end do
   !$acc end parallel
 
@@ -183,17 +441,46 @@ program reduction_1
      lvresult = lvresult .or. (array(i) .ge. 5)
   end do
 
-  if (lresult .neqv. lvresult) call abort
-
-  lresult = .false.
-  lvresult = .false.
+  if (lrg .neqv. lvresult) call abort
+  if (lrw .neqv. lvresult) call abort
+  if (lrv .neqv. lvresult) call abort
+  if (lrc .neqv. lvresult) call abort
 
+  !
   ! '.eqv.' reductions
+  !
+
+  lrg = .true.
+  lrw = .true.
+  lrv = .true.
+  lrc = .true.
+  lvresult = .true.
+
+  !$acc parallel num_gangs(ng) copy(lrg)
+  !$acc loop reduction(.eqv.:lrg) gang
+  do i = 1, n
+     lrg = lrg .eqv. (array(i) .ge. 5)
+  end do
+  !$acc end parallel
+
+  !$acc parallel num_workers(nw) copy(lrw)
+  !$acc loop reduction(.eqv.:lrw) worker
+  do i = 1, n
+     lrw = lrw .eqv. (array(i) .ge. 5)
+  end do
+  !$acc end parallel
+
+  !$acc parallel vector_length(vl) copy(lrv)
+  !$acc loop reduction(.eqv.:lrv) vector
+  do i = 1, n
+     lrv = lrv .eqv. (array(i) .ge. 5)
+  end do
+  !$acc end parallel
 
-  !$acc parallel num_gangs(gangs) copy(lresult)
-  !$acc loop reduction(.eqv.:lresult) gang
+  !$acc parallel num_workers(nw) vector_length(vl) copy(lrc)
+  !$acc loop reduction(.eqv.:lrc) gang worker vector
   do i = 1, n
-     lresult = lresult .eqv. (array(i) .ge. 5)
+     lrc = lrc .eqv. (array(i) .ge. 5)
   end do
   !$acc end parallel
 
@@ -202,17 +489,46 @@ program reduction_1
      lvresult = lvresult .eqv. (array(i) .ge. 5)
   end do
 
-  if (lresult .neqv. lvresult) call abort
-
-  lresult = .false.
-  lvresult = .false.
+  if (lrg .neqv. lvresult) call abort
+  if (lrw .neqv. lvresult) call abort
+  if (lrv .neqv. lvresult) call abort
+  if (lrc .neqv. lvresult) call abort
 
+  !
   ! '.neqv.' reductions
+  !
+
+  lrg = .true.
+  lrw = .true.
+  lrv = .true.
+  lrc = .true.
+  lvresult = .true.
+
+  !$acc parallel num_gangs(ng) copy(lrg)
+  !$acc loop reduction(.neqv.:lrg) gang
+  do i = 1, n
+     lrg = lrg .neqv. (array(i) .ge. 5)
+  end do
+  !$acc end parallel
+
+  !$acc parallel num_workers(nw) copy(lrw)
+  !$acc loop reduction(.neqv.:lrw) worker
+  do i = 1, n
+     lrw = lrw .neqv. (array(i) .ge. 5)
+  end do
+  !$acc end parallel
+
+  !$acc parallel vector_length(vl) copy(lrv)
+  !$acc loop reduction(.neqv.:lrv) vector
+  do i = 1, n
+     lrv = lrv .neqv. (array(i) .ge. 5)
+  end do
+  !$acc end parallel
 
-  !$acc parallel num_gangs(gangs) copy(lresult)
-  !$acc loop reduction(.neqv.:lresult) gang
+  !$acc parallel num_gangs(ng) num_workers(nw) vector_length(vl) copy(lrc)
+  !$acc loop reduction(.neqv.:lrc) gang worker vector
   do i = 1, n
-     lresult = lresult .neqv. (array(i) .ge. 5)
+     lrc = lrc .neqv. (array(i) .ge. 5)
   end do
   !$acc end parallel
 
@@ -221,5 +537,8 @@ program reduction_1
      lvresult = lvresult .neqv. (array(i) .ge. 5)
   end do
 
-  if (lresult .neqv. lvresult) call abort
+  if (lrg .neqv. lvresult) call abort
+  if (lrw .neqv. lvresult) call abort
+  if (lrv .neqv. lvresult) call abort
+  if (lrc .neqv. lvresult) call abort
 end program reduction_1
diff --git a/libgomp/testsuite/libgomp.oacc-fortran/reduction-2.f90 b/libgomp/testsuite/libgomp.oacc-fortran/reduction-2.f90
index fe6a9c3..cd09099 100644
--- a/libgomp/testsuite/libgomp.oacc-fortran/reduction-2.f90
+++ b/libgomp/testsuite/libgomp.oacc-fortran/reduction-2.f90
@@ -5,26 +5,52 @@
 program reduction_2
   implicit none
 
-  integer, parameter    :: n = 10, gangs = 20
+  integer, parameter    :: n = 10, ng = 8, nw = 4, vl = 32
   integer               :: i
-  real, parameter       :: e = .001
-  real                  :: vresult, result
-  logical               :: lresult, lvresult
-  real, dimension (n) :: array
+  real                  :: vresult, rg, rw, rv, rc
+  real, parameter       :: e = 0.001
+  logical               :: lrg, lrw, lrv, lrc, lvresult
+  real, dimension (n)   :: array
 
   do i = 1, n
      array(i) = i
   end do
 
-  result = 0
+  !
+  ! '+' reductions
+  !
+
+  rg = 0
+  rw = 0
+  rv = 0
+  rc = 0
   vresult = 0
 
-  ! '+' reductions
+  !$acc parallel num_gangs(ng) copy(rg)
+  !$acc loop reduction(+:rg) gang
+  do i = 1, n
+     rg = rg + array(i)
+  end do
+  !$acc end parallel
 
-  !$acc parallel num_gangs(gangs) copy(result)
-  !$acc loop reduction(+:result) gang
+  !$acc parallel num_workers(nw) copy(rw)
+  !$acc loop reduction(+:rw) worker
   do i = 1, n
-     result = result + array(i)
+     rw = rw + array(i)
+  end do
+  !$acc end parallel
+
+  !$acc parallel vector_length(vl) copy(rv)
+  !$acc loop reduction(+:rv) vector
+  do i = 1, n
+     rv = rv + array(i)
+  end do
+  !$acc end parallel
+
+  !$acc parallel num_gangs(ng) num_workers(nw) vector_length(vl) copy(rc)
+  !$acc loop reduction(+:rc) gang worker vector
+  do i = 1, n
+     rc = rc + array(i)
   end do
   !$acc end parallel
 
@@ -33,17 +59,46 @@ program reduction_2
      vresult = vresult + array(i)
   end do
 
-  if (abs (result - vresult) .ge. e) call abort
+  if (rg .ne. vresult) call abort
+  if (rw .ne. vresult) call abort
+  if (rv .ne. vresult) call abort
+  if (rc .ne. vresult) call abort
+
+  !
+  ! '*' reductions
+  !
 
-  result = 1
+  rg = 1
+  rw = 1
+  rv = 1
+  rc = 1
   vresult = 1
 
-  ! '*' reductions
+  !$acc parallel num_gangs(ng) copy(rg)
+  !$acc loop reduction(*:rg) gang
+  do i = 1, n
+     rg = rg * array(i)
+  end do
+  !$acc end parallel
 
-  !$acc parallel num_gangs(gangs) copy(result)
-  !$acc loop reduction(*:result) gang
+  !$acc parallel num_workers(nw) copy(rw)
+  !$acc loop reduction(*:rw) worker
   do i = 1, n
-     result = result * array(i)
+     rw = rw * array(i)
+  end do
+  !$acc end parallel
+
+  !$acc parallel vector_length(vl) copy(rv)
+  !$acc loop reduction(*:rv) vector
+  do i = 1, n
+     rv = rv * array(i)
+  end do
+  !$acc end parallel
+
+  !$acc parallel num_gangs(ng) num_workers(nw) vector_length(vl) copy(rc)
+  !$acc loop reduction(*:rc) gang worker vector
+  do i = 1, n
+     rc = rc * array(i)
   end do
   !$acc end parallel
 
@@ -52,17 +107,46 @@ program reduction_2
      vresult = vresult * array(i)
   end do
 
-  if (result.ne.vresult) call abort
+  if (abs (rg - vresult) .ge. e) call abort
+  if (abs (rw - vresult) .ge. e) call abort
+  if (abs (rv - vresult) .ge. e) call abort
+  if (abs (rc - vresult) .ge. e) call abort
+
+  !
+  ! 'max' reductions
+  !
 
-  result = 0
+  rg = 0
+  rw = 0
+  rg = 0
+  rc = 0
   vresult = 0
 
-  ! 'max' reductions
+  !$acc parallel num_gangs(ng) copy(rg)
+  !$acc loop reduction(max:rg) gang
+  do i = 1, n
+     rg = max (rg, array(i))
+  end do
+  !$acc end parallel
+
+  !$acc parallel num_workers(nw) copy(rw)
+  !$acc loop reduction(max:rw) worker
+  do i = 1, n
+     rw = max (rw, array(i))
+  end do
+  !$acc end parallel
 
-  !$acc parallel num_gangs(gangs) copy(result)
-  !$acc loop reduction(max:result) gang
+  !$acc parallel vector_length(vl) copy(rv)
+  !$acc loop reduction(max:rv) vector
   do i = 1, n
-     result = max (result, array(i))
+     rv = max (rv, array(i))
+  end do
+  !$acc end parallel
+
+  !$acc parallel num_gangs(ng) num_workers(nw) vector_length(vl) copy(rc)
+  !$acc loop reduction(max:rc) gang worker vector
+  do i = 1, n
+     rc = max (rc, array(i))
   end do
   !$acc end parallel
 
@@ -71,17 +155,46 @@ program reduction_2
      vresult = max (vresult, array(i))
   end do
 
-  if (result.ne.vresult) call abort
-
-  result = 1
-  vresult = 1
+  if (abs (rg - vresult) .ge. e) call abort
+  if (abs (rw - vresult) .ge. e) call abort
+  if (abs (rg - vresult) .ge. e) call abort
+  if (abs (rc - vresult) .ge. e) call abort
 
+  !
   ! 'min' reductions
+  !
 
-  !$acc parallel num_gangs(gangs) copy(result)
-  !$acc loop reduction(min:result) gang
+  rg = 0
+  rw = 0
+  rv = 0
+  rc = 0
+  vresult = 0
+
+  !$acc parallel num_gangs(ng) copy(rg)
+  !$acc loop reduction(min:rg) gang
+  do i = 1, n
+     rg = min (rg, array(i))
+  end do
+  !$acc end parallel
+
+  !$acc parallel num_workers(nw) copy(rw)
+  !$acc loop reduction(min:rw) worker
+  do i = 1, n
+     rw = min (rw, array(i))
+  end do
+  !$acc end parallel
+
+  !$acc parallel vector_length(vl) copy(rv)
+  !$acc loop reduction(min:rv) vector
   do i = 1, n
-     result = min (result, array(i))
+     rv = min (rv, array(i))
+  end do
+  !$acc end parallel
+
+  !$acc parallel num_gangs(ng) num_workers(nw) vector_length(vl) copy(rc)
+  !$acc loop reduction(min:rc) gang worker vector
+  do i = 1, n
+     rc = min (rc, array(i))
   end do
   !$acc end parallel
 
@@ -90,17 +203,46 @@ program reduction_2
      vresult = min (vresult, array(i))
   end do
 
-  if (result.ne.vresult) call abort
-
-  lresult = .false.
-  lvresult = .false.
+  if (abs (rg - vresult) .ge. e) call abort
+  if (abs (rw - vresult) .ge. e) call abort
+  if (abs (rv - vresult) .ge. e) call abort
+  if (abs (rc - vresult) .ge. e) call abort
 
+  !
   ! '.and.' reductions
+  !
+
+  lrg = .true.
+  lrw = .true.
+  lrv = .true.
+  lrc = .true.
+  lvresult = .true.
+
+  !$acc parallel num_gangs(ng) copy(lrg)
+  !$acc loop reduction(.and.:lrg) gang
+  do i = 1, n
+     lrg = lrg .and. (array(i) .ge. 5)
+  end do
+  !$acc end parallel
 
-  !$acc parallel num_gangs(gangs) copy(lresult)
-  !$acc loop reduction(.and.:lresult) gang
+  !$acc parallel num_workers(nw) copy(lrw)
+  !$acc loop reduction(.and.:lrw) worker
   do i = 1, n
-     lresult = lresult .and. (array(i) .ge. 5)
+     lrw = lrw .and. (array(i) .ge. 5)
+  end do
+  !$acc end parallel
+
+  !$acc parallel vector_length(vl) copy(lrv)
+  !$acc loop reduction(.and.:lrv) vector
+  do i = 1, n
+     lrv = lrv .and. (array(i) .ge. 5)
+  end do
+  !$acc end parallel
+
+  !$acc parallel num_gangs(ng) num_workers(nw) vector_length(vl) copy(lrc)
+  !$acc loop reduction(.and.:lrc) gang worker vector
+  do i = 1, n
+     lrc = lrc .and. (array(i) .ge. 5)
   end do
   !$acc end parallel
 
@@ -109,17 +251,46 @@ program reduction_2
      lvresult = lvresult .and. (array(i) .ge. 5)
   end do
 
-  if (lresult .neqv. lvresult) call abort
+  if (lrg .neqv. lvresult) call abort
+  if (lrw .neqv. lvresult) call abort
+  if (lrv .neqv. lvresult) call abort
+  if (lrc .neqv. lvresult) call abort
 
-  lresult = .false.
+  !
+  ! '.or.' reductions
+  !
+
+  lrg = .false.
+  lrw = .false.
+  lrv = .false.
+  lrc = .false.
   lvresult = .false.
 
-  ! '.or.' reductions
+  !$acc parallel num_gangs(ng) copy(lrg)
+  !$acc loop reduction(.or.:lrg) gang
+  do i = 1, n
+     lrg = lrg .or. (array(i) .ge. 5)
+  end do
+  !$acc end parallel
+
+  !$acc parallel num_workers(nw) copy(lrw)
+  !$acc loop reduction(.or.:lrw) worker
+  do i = 1, n
+     lrw = lrw .or. (array(i) .ge. 5)
+  end do
+  !$acc end parallel
 
-  !$acc parallel num_gangs(gangs) copy(lresult)
-  !$acc loop reduction(.or.:lresult) gang
+  !$acc parallel vector_length(vl) copy(lrv)
+  !$acc loop reduction(.or.:lrv) vector
   do i = 1, n
-     lresult = lresult .or. (array(i) .ge. 5)
+     lrv = lrv .or. (array(i) .ge. 5)
+  end do
+  !$acc end parallel
+
+  !$acc parallel num_gangs(ng) num_workers(nw) vector_length(vl) copy(lrc)
+  !$acc loop reduction(.or.:lrc) gang worker vector
+  do i = 1, n
+     lrc = lrc .or. (array(i) .ge. 5)
   end do
   !$acc end parallel
 
@@ -128,17 +299,46 @@ program reduction_2
      lvresult = lvresult .or. (array(i) .ge. 5)
   end do
 
-  if (lresult .neqv. lvresult) call abort
-
-  lresult = .false.
-  lvresult = .false.
+  if (lrg .neqv. lvresult) call abort
+  if (lrw .neqv. lvresult) call abort
+  if (lrv .neqv. lvresult) call abort
+  if (lrc .neqv. lvresult) call abort
 
+  !
   ! '.eqv.' reductions
+  !
+
+  lrg = .true.
+  lrw = .true.
+  lrv = .true.
+  lrc = .true.
+  lvresult = .true.
+
+  !$acc parallel num_gangs(ng) copy(lrg)
+  !$acc loop reduction(.eqv.:lrg) gang
+  do i = 1, n
+     lrg = lrg .eqv. (array(i) .ge. 5)
+  end do
+  !$acc end parallel
 
-  !$acc parallel num_gangs(gangs) copy(lresult)
-  !$acc loop reduction(.eqv.:lresult) gang
+  !$acc parallel num_workers(nw) copy(lrw)
+  !$acc loop reduction(.eqv.:lrw) worker
   do i = 1, n
-     lresult = lresult .eqv. (array(i) .ge. 5)
+     lrw = lrw .eqv. (array(i) .ge. 5)
+  end do
+  !$acc end parallel
+
+  !$acc parallel vector_length(vl) copy(lrv)
+  !$acc loop reduction(.eqv.:lrv) vector
+  do i = 1, n
+     lrv = lrv .eqv. (array(i) .ge. 5)
+  end do
+  !$acc end parallel
+
+  !$acc parallel num_gangs(ng) num_workers(nw) vector_length(vl) copy(lrc)
+  !$acc loop reduction(.eqv.:lrc) gang worker vector
+  do i = 1, n
+     lrc = lrc .eqv. (array(i) .ge. 5)
   end do
   !$acc end parallel
 
@@ -147,17 +347,46 @@ program reduction_2
      lvresult = lvresult .eqv. (array(i) .ge. 5)
   end do
 
-  if (lresult .neqv. lvresult) call abort
-
-  lresult = .false.
-  lvresult = .false.
+  if (lrg .neqv. lvresult) call abort
+  if (lrw .neqv. lvresult) call abort
+  if (lrv .neqv. lvresult) call abort
+  if (lrc .neqv. lvresult) call abort
 
+  !
   ! '.neqv.' reductions
+  !
+
+  lrg = .true.
+  lrw = .true.
+  lrv = .true.
+  lrc = .true.
+  lvresult = .true.
+
+  !$acc parallel num_gangs(ng) copy(lrg)
+  !$acc loop reduction(.neqv.:lrg) gang
+  do i = 1, n
+     lrg = lrg .neqv. (array(i) .ge. 5)
+  end do
+  !$acc end parallel
+
+  !$acc parallel num_workers(nw) copy(lrw)
+  !$acc loop reduction(.neqv.:lrw) worker
+  do i = 1, n
+     lrw = lrw .neqv. (array(i) .ge. 5)
+  end do
+  !$acc end parallel
+
+  !$acc parallel vector_length(vl) copy(lrv)
+  !$acc loop reduction(.neqv.:lrv) vector
+  do i = 1, n
+     lrv = lrv .neqv. (array(i) .ge. 5)
+  end do
+  !$acc end parallel
 
-  !$acc parallel num_gangs(gangs) copy(lresult)
-  !$acc loop reduction(.neqv.:lresult) gang
+  !$acc parallel num_gangs(ng) num_workers(nw) vector_length(vl) copy(lrc)
+  !$acc loop reduction(.neqv.:lrc) gang worker vector
   do i = 1, n
-     lresult = lresult .neqv. (array(i) .ge. 5)
+     lrc = lrc .neqv. (array(i) .ge. 5)
   end do
   !$acc end parallel
 
@@ -166,5 +395,8 @@ program reduction_2
      lvresult = lvresult .neqv. (array(i) .ge. 5)
   end do
 
-  if (lresult .neqv. lvresult) call abort
+  if (lrg .neqv. lvresult) call abort
+  if (lrw .neqv. lvresult) call abort
+  if (lrv .neqv. lvresult) call abort
+  if (lrc .neqv. lvresult) call abort
 end program reduction_2
diff --git a/libgomp/testsuite/libgomp.oacc-fortran/reduction-3.f90 b/libgomp/testsuite/libgomp.oacc-fortran/reduction-3.f90
index 155b903..a7dbf2b 100644
--- a/libgomp/testsuite/libgomp.oacc-fortran/reduction-3.f90
+++ b/libgomp/testsuite/libgomp.oacc-fortran/reduction-3.f90
@@ -5,26 +5,52 @@
 program reduction_3
   implicit none
 
-  integer, parameter    :: n = 10, gangs = 20
+  integer, parameter    :: n = 10, ng = 8, nw = 4, vl = 32
   integer               :: i
-  double precision, parameter :: e = .001
-  double precision      :: vresult, result
-  logical               :: lresult, lvresult
+  double precision      :: vresult, rg, rw, rv, rc
+  double precision, parameter :: e = 0.001
+  logical               :: lrg, lrw, lrv, lrc, lvresult
   double precision, dimension (n) :: array
 
   do i = 1, n
      array(i) = i
   end do
 
-  result = 0
+  !
+  ! '+' reductions
+  !
+
+  rg = 0
+  rw = 0
+  rv = 0
+  rc = 0
   vresult = 0
 
-  ! '+' reductions
+  !$acc parallel num_gangs(ng) copy(rg)
+  !$acc loop reduction(+:rg) gang
+  do i = 1, n
+     rg = rg + array(i)
+  end do
+  !$acc end parallel
 
-  !$acc parallel num_gangs(gangs) copy(result)
-  !$acc loop reduction(+:result) gang
+  !$acc parallel num_workers(nw) copy(rw)
+  !$acc loop reduction(+:rw) worker
   do i = 1, n
-     result = result + array(i)
+     rw = rw + array(i)
+  end do
+  !$acc end parallel
+
+  !$acc parallel vector_length(vl) copy(rv)
+  !$acc loop reduction(+:rv) vector
+  do i = 1, n
+     rv = rv + array(i)
+  end do
+  !$acc end parallel
+
+  !$acc parallel num_gangs(ng) num_workers(nw) vector_length(vl) copy(rc)
+  !$acc loop reduction(+:rc) gang worker vector
+  do i = 1, n
+     rc = rc + array(i)
   end do
   !$acc end parallel
 
@@ -33,17 +59,46 @@ program reduction_3
      vresult = vresult + array(i)
   end do
 
-  if (abs (result - vresult) .ge. e) call abort
+  if (abs (rg - vresult) .ge. e) call abort
+  if (abs (rw - vresult) .ge. e) call abort
+  if (abs (rv - vresult) .ge. e) call abort
+  if (abs (rc - vresult) .ge. e) call abort
+
+  !
+  ! '*' reductions
+  !
 
-  result = 1
+  rg = 1
+  rw = 1
+  rv = 1
+  rc = 1
   vresult = 1
 
-  ! '*' reductions
+  !$acc parallel num_gangs(ng) copy(rg)
+  !$acc loop reduction(*:rg) gang
+  do i = 1, n
+     rg = rg * array(i)
+  end do
+  !$acc end parallel
 
-  !$acc parallel num_gangs(gangs) copy(result)
-  !$acc loop reduction(*:result) gang
+  !$acc parallel num_workers(nw) copy(rw)
+  !$acc loop reduction(*:rw) worker
   do i = 1, n
-     result = result * array(i)
+     rw = rw * array(i)
+  end do
+  !$acc end parallel
+
+  !$acc parallel vector_length(vl) copy(rv)
+  !$acc loop reduction(*:rv) vector
+  do i = 1, n
+     rv = rv * array(i)
+  end do
+  !$acc end parallel
+
+  !$acc parallel num_gangs(ng) num_workers(nw) vector_length(vl) copy(rc)
+  !$acc loop reduction(*:rc) gang worker vector
+  do i = 1, n
+     rc = rc * array(i)
   end do
   !$acc end parallel
 
@@ -52,17 +107,46 @@ program reduction_3
      vresult = vresult * array(i)
   end do
 
-  if (result.ne.vresult) call abort
+  if (abs (rg - vresult) .ge. e) call abort
+  if (abs (rw - vresult) .ge. e) call abort
+  if (abs (rv - vresult) .ge. e) call abort
+  if (abs (rc - vresult) .ge. e) call abort
+
+  !
+  ! 'max' reductions
+  !
 
-  result = 0
+  rg = 0
+  rw = 0
+  rv = 0
+  rc = 0
   vresult = 0
 
-  ! 'max' reductions
+  !$acc parallel num_gangs(ng) copy(rg)
+  !$acc loop reduction(max:rg) gang
+  do i = 1, n
+     rg = max (rg, array(i))
+  end do
+  !$acc end parallel
+
+  !$acc parallel num_workers(nw) copy(rw)
+  !$acc loop reduction(max:rw) worker
+  do i = 1, n
+     rw = max (rw, array(i))
+  end do
+  !$acc end parallel
 
-  !$acc parallel num_gangs(gangs) copy(result)
-  !$acc loop reduction(max:result) gang
+  !$acc parallel vector_length(vl) copy(rv)
+  !$acc loop reduction(max:rv) vector
   do i = 1, n
-     result = max (result, array(i))
+     rv = max (rv, array(i))
+  end do
+  !$acc end parallel
+
+  !$acc parallel num_gangs(ng) num_workers(nw) vector_length(vl) copy(rc)
+  !$acc loop reduction(max:rc) gang worker vector
+  do i = 1, n
+     rc = max (rc, array(i))
   end do
   !$acc end parallel
 
@@ -71,17 +155,46 @@ program reduction_3
      vresult = max (vresult, array(i))
   end do
 
-  if (result.ne.vresult) call abort
-
-  result = 1
-  vresult = 1
+  if (abs (rg - vresult) .ge. e) call abort
+  if (abs (rw - vresult) .ge. e) call abort
+  if (abs (rv - vresult) .ge. e) call abort
+  if (abs (rc - vresult) .ge. e) call abort
 
+  !
   ! 'min' reductions
+  !
 
-  !$acc parallel num_gangs(gangs) copy(result)
-  !$acc loop reduction(min:result) gang
+  rg = 0
+  rw = 0
+  rv = 0
+  rc = 0
+  vresult = 0
+
+  !$acc parallel num_gangs(ng) copy(rg)
+  !$acc loop reduction(min:rg) gang
+  do i = 1, n
+     rg = min (rg, array(i))
+  end do
+  !$acc end parallel
+
+  !$acc parallel num_workers(nw) copy(rw)
+  !$acc loop reduction(min:rw) worker
+  do i = 1, n
+     rw = min (rw, array(i))
+  end do
+  !$acc end parallel
+
+  !$acc parallel vector_length(vl) copy(rv)
+  !$acc loop reduction(min:rv) vector
   do i = 1, n
-     result = min (result, array(i))
+     rv = min (rv, array(i))
+  end do
+  !$acc end parallel
+
+  !$acc parallel num_gangs(ng) num_workers(nw) vector_length(vl) copy(rc)
+  !$acc loop reduction(min:rc) gang worker vector
+  do i = 1, n
+     rc = min (rc, array(i))
   end do
   !$acc end parallel
 
@@ -90,17 +203,46 @@ program reduction_3
      vresult = min (vresult, array(i))
   end do
 
-  if (result.ne.vresult) call abort
-
-  lresult = .false.
-  lvresult = .false.
+  if (abs (rg - vresult) .ge. e) call abort
+  if (abs (rw - vresult) .ge. e) call abort
+  if (abs (rv - vresult) .ge. e) call abort
+  if (abs (rc - vresult) .ge. e) call abort
 
+  !
   ! '.and.' reductions
+  !
+
+  lrg = .true.
+  lrw = .true.
+  lrv = .true.
+  lrc = .true.
+  lvresult = .true.
+
+  !$acc parallel num_gangs(ng) copy(lrg)
+  !$acc loop reduction(.and.:lrg) gang
+  do i = 1, n
+     lrg = lrg .and. (array(i) .ge. 5)
+  end do
+  !$acc end parallel
 
-  !$acc parallel num_gangs(gangs) copy(lresult)
-  !$acc loop reduction(.and.:lresult) gang
+  !$acc parallel num_workers(nw) copy(lrw)
+  !$acc loop reduction(.and.:lrw) worker
   do i = 1, n
-     lresult = lresult .and. (array(i) .ge. 5)
+     lrw = lrw .and. (array(i) .ge. 5)
+  end do
+  !$acc end parallel
+
+  !$acc parallel vector_length(vl) copy(lrv)
+  !$acc loop reduction(.and.:lrv) vector
+  do i = 1, n
+     lrv = lrv .and. (array(i) .ge. 5)
+  end do
+  !$acc end parallel
+
+  !$acc parallel num_gangs(ng) num_workers(nw) vector_length(vl) copy(lrc)
+  !$acc loop reduction(.and.:lrc) gang worker vector
+  do i = 1, n
+     lrc = lrc .and. (array(i) .ge. 5)
   end do
   !$acc end parallel
 
@@ -109,17 +251,46 @@ program reduction_3
      lvresult = lvresult .and. (array(i) .ge. 5)
   end do
 
-  if (lresult .neqv. lvresult) call abort
+  if (lrg .neqv. lvresult) call abort
+  if (lrw .neqv. lvresult) call abort
+  if (lrv .neqv. lvresult) call abort
+  if (lrc .neqv. lvresult) call abort
 
-  lresult = .false.
+  !
+  ! '.or.' reductions
+  !
+
+  lrg = .false.
+  lrw = .false.
+  lrv = .false.
+  lrc = .false.
   lvresult = .false.
 
-  ! '.or.' reductions
+  !$acc parallel num_gangs(ng) copy(lrg)
+  !$acc loop reduction(.or.:lrg) gang
+  do i = 1, n
+     lrg = lrg .or. (array(i) .ge. 5)
+  end do
+  !$acc end parallel
+
+  !$acc parallel num_workers(nw) copy(lrw)
+  !$acc loop reduction(.or.:lrw) worker
+  do i = 1, n
+     lrw = lrw .or. (array(i) .ge. 5)
+  end do
+  !$acc end parallel
 
-  !$acc parallel num_gangs(gangs) copy(lresult)
-  !$acc loop reduction(.or.:lresult) gang
+  !$acc parallel vector_length(vl) copy(lrv)
+  !$acc loop reduction(.or.:lrv) vector
   do i = 1, n
-     lresult = lresult .or. (array(i) .ge. 5)
+     lrv = lrv .or. (array(i) .ge. 5)
+  end do
+  !$acc end parallel
+
+  !$acc parallel num_gangs(ng) num_workers(nw) vector_length(vl) copy(lrc)
+  !$acc loop reduction(.or.:lrc) gang worker vector
+  do i = 1, n
+     lrc = lrc .or. (array(i) .ge. 5)
   end do
   !$acc end parallel
 
@@ -128,17 +299,46 @@ program reduction_3
      lvresult = lvresult .or. (array(i) .ge. 5)
   end do
 
-  if (lresult .neqv. lvresult) call abort
-
-  lresult = .false.
-  lvresult = .false.
+  if (lrg .neqv. lvresult) call abort
+  if (lrw .neqv. lvresult) call abort
+  if (lrv .neqv. lvresult) call abort
+  if (lrc .neqv. lvresult) call abort
 
+  !
   ! '.eqv.' reductions
+  !
+
+  lrg = .true.
+  lrw = .true.
+  lrv = .true.
+  lrc = .true.
+  lvresult = .true.
+
+  !$acc parallel num_gangs(ng) copy(lrg)
+  !$acc loop reduction(.eqv.:lrg) gang
+  do i = 1, n
+     lrg = lrg .eqv. (array(i) .ge. 5)
+  end do
+  !$acc end parallel
 
-  !$acc parallel num_gangs(gangs) copy(lresult)
-  !$acc loop reduction(.eqv.:lresult) gang
+  !$acc parallel num_workers(nw) copy(lrw)
+  !$acc loop reduction(.eqv.:lrw) worker
   do i = 1, n
-     lresult = lresult .eqv. (array(i) .ge. 5)
+     lrw = lrw .eqv. (array(i) .ge. 5)
+  end do
+  !$acc end parallel
+
+  !$acc parallel vector_length(vl) copy(lrv)
+  !$acc loop reduction(.eqv.:lrv) vector
+  do i = 1, n
+     lrv = lrv .eqv. (array(i) .ge. 5)
+  end do
+  !$acc end parallel
+
+  !$acc parallel num_gangs(ng) num_workers(nw) vector_length(vl) copy(lrc)
+  !$acc loop reduction(.eqv.:lrc) gang worker vector
+  do i = 1, n
+     lrc = lrc .eqv. (array(i) .ge. 5)
   end do
   !$acc end parallel
 
@@ -147,17 +347,46 @@ program reduction_3
      lvresult = lvresult .eqv. (array(i) .ge. 5)
   end do
 
-  if (lresult .neqv. lvresult) call abort
-
-  lresult = .false.
-  lvresult = .false.
+  if (lrg .neqv. lvresult) call abort
+  if (lrw .neqv. lvresult) call abort
+  if (lrv .neqv. lvresult) call abort
+  if (lrc .neqv. lvresult) call abort
 
+  !
   ! '.neqv.' reductions
+  !
+
+  lrg = .true.
+  lrw = .true.
+  lrv = .true.
+  lrc = .true.
+  lvresult = .true.
+
+  !$acc parallel num_gangs(ng) copy(lrg)
+  !$acc loop reduction(.neqv.:lrg) gang
+  do i = 1, n
+     lrg = lrg .neqv. (array(i) .ge. 5)
+  end do
+  !$acc end parallel
+
+  !$acc parallel num_workers(nw) copy(lrw)
+  !$acc loop reduction(.neqv.:lrw) worker
+  do i = 1, n
+     lrw = lrw .neqv. (array(i) .ge. 5)
+  end do
+  !$acc end parallel
+
+  !$acc parallel vector_length(vl) copy(lrv)
+  !$acc loop reduction(.neqv.:lrv) vector
+  do i = 1, n
+     lrv = lrv .neqv. (array(i) .ge. 5)
+  end do
+  !$acc end parallel
 
-  !$acc parallel num_gangs(gangs) copy(lresult)
-  !$acc loop reduction(.neqv.:lresult) gang
+  !$acc parallel num_gangs(ng) num_workers(nw) vector_length(vl) copy(lrc)
+  !$acc loop reduction(.neqv.:lrc) gang worker vector
   do i = 1, n
-     lresult = lresult .neqv. (array(i) .ge. 5)
+     lrc = lrc .neqv. (array(i) .ge. 5)
   end do
   !$acc end parallel
 
@@ -166,5 +395,8 @@ program reduction_3
      lvresult = lvresult .neqv. (array(i) .ge. 5)
   end do
 
-  if (lresult .neqv. lvresult) call abort
+  if (lrg .neqv. lvresult) call abort
+  if (lrw .neqv. lvresult) call abort
+  if (lrv .neqv. lvresult) call abort
+  if (lrc .neqv. lvresult) call abort
 end program reduction_3
diff --git a/libgomp/testsuite/libgomp.oacc-fortran/reduction-4.f90 b/libgomp/testsuite/libgomp.oacc-fortran/reduction-4.f90
index 8d4f6c1..c3bdaf6 100644
--- a/libgomp/testsuite/libgomp.oacc-fortran/reduction-4.f90
+++ b/libgomp/testsuite/libgomp.oacc-fortran/reduction-4.f90
@@ -5,50 +5,108 @@
 program reduction_4
   implicit none
 
-  integer, parameter    :: n = 10, gangs = 20
+  integer, parameter    :: n = 10, ng = 8, nw = 4, vl = 32
   integer               :: i
-  complex               :: vresult, result
+  real                  :: vresult, rg, rw, rv, rc
   complex, dimension (n) :: array
 
   do i = 1, n
      array(i) = i
   end do
 
-  result = 0
+  !
+  ! '+' reductions
+  !
+
+  rg = 0
+  rw = 0
+  rv = 0
+  rc = 0
   vresult = 0
 
-  ! '+' reductions
+  !$acc parallel num_gangs(ng) copy(rg)
+  !$acc loop reduction(+:rg) gang
+  do i = 1, n
+     rg = rg + REAL(array(i))
+  end do
+  !$acc end parallel
+
+  !$acc parallel num_workers(nw) copy(rw)
+  !$acc loop reduction(+:rw) worker
+  do i = 1, n
+     rw = rw + REAL(array(i))
+  end do
+  !$acc end parallel
+
+  !$acc parallel vector_length(vl) copy(rv)
+  !$acc loop reduction(+:rv) vector
+  do i = 1, n
+     rv = rv + REAL(array(i))
+  end do
+  !$acc end parallel
 
-  !$acc parallel num_gangs(gangs) copy(result)
-  !$acc loop reduction(+:result) gang
+  !$acc parallel num_gangs(ng) num_workers(nw) vector_length(vl) copy(rc)
+  !$acc loop reduction(+:rc) gang worker vector
   do i = 1, n
-     result = result + array(i)
+     rc = rc + REAL(array(i))
   end do
   !$acc end parallel
 
   ! Verify the results
   do i = 1, n
-     vresult = vresult + array(i)
+     vresult = vresult + REAL(array(i))
   end do
 
-  if (result .ne. vresult) call abort
+  if (rg .ne. vresult) call abort
+  if (rw .ne. vresult) call abort
+  if (rv .ne. vresult) call abort
+  if (rc .ne. vresult) call abort
 
-  result = 1
+  !
+  ! '*' reductions
+  !
+
+  rg = 1
+  rw = 1
+  rv = 1
+  rc = 1
   vresult = 1
 
-  ! '*' reductions
+  !$acc parallel num_gangs(ng) copy(rg)
+  !$acc loop reduction(*:rg) gang
+  do i = 1, n
+     rg = rg * REAL(array(i))
+  end do
+  !$acc end parallel
+
+  !$acc parallel num_workers(nw) copy(rw)
+  !$acc loop reduction(*:rw) worker
+  do i = 1, n
+     rw = rw * REAL(array(i))
+  end do
+  !$acc end parallel
+
+  !$acc parallel vector_length(vl) copy(rv)
+  !$acc loop reduction(*:rv) vector
+  do i = 1, n
+     rv = rv * REAL(array(i))
+  end do
+  !$acc end parallel
 
-  !$acc parallel num_gangs (gangs) copy(result)
-  !$acc loop reduction(*:result) gang
+  !$acc parallel num_gangs(ng) num_workers(nw) vector_length(vl) copy(rc)
+  !$acc loop reduction(*:rc) gang worker vector
   do i = 1, n
-     result = result * array(i)
+     rc = rc * REAL(array(i))
   end do
   !$acc end parallel
 
   ! Verify the results
   do i = 1, n
-     vresult = vresult * array(i)
+     vresult = vresult * REAL(array(i))
   end do
 
-  if (result .ne. vresult) call abort
+  if (rg .ne. vresult) call abort
+  if (rw .ne. vresult) call abort
+  if (rv .ne. vresult) call abort
+  if (rc .ne. vresult) call abort
 end program reduction_4
diff --git a/libgomp/testsuite/libgomp.oacc-fortran/reduction-5.f90 b/libgomp/testsuite/libgomp.oacc-fortran/reduction-5.f90
index 1066fa7..304fe7f 100644
--- a/libgomp/testsuite/libgomp.oacc-fortran/reduction-5.f90
+++ b/libgomp/testsuite/libgomp.oacc-fortran/reduction-5.f90
@@ -4,9 +4,12 @@
 
 program reduction
   integer, parameter    :: n = 40, c = 10
-  integer               :: i, vsum, sum
+  integer               :: i, vsum, gs, ws, vs, cs
 
-  call redsub (sum, n, c)
+  call redsub_gang (gs, n, c)
+  call redsub_worker (gs, n, c)
+  call redsub_vector (vs, n, c)
+  call redsub_combined (cs, n, c)
 
   vsum = 0
 
@@ -15,10 +18,11 @@ program reduction
      vsum = vsum + c
   end do
 
-  if (sum.ne.vsum) call abort ()
+  if (gs .ne. vsum) call abort ()
+  if (vs .ne. vsum) call abort ()
 end program reduction
 
-subroutine redsub(sum, n, c)
+subroutine redsub_gang(sum, n, c)
   integer :: sum, n, c
 
   sum = 0
@@ -29,4 +33,43 @@ subroutine redsub(sum, n, c)
      sum = sum + c
   end do
   !$acc end parallel
-end subroutine redsub
+end subroutine redsub_gang
+
+subroutine redsub_worker(sum, n, c)
+  integer :: sum, n, c
+
+  sum = 0
+
+  !$acc parallel copyin (n, c) num_workers(4) vector_length (32) copy(sum)
+  !$acc loop reduction(+:sum) worker
+  do i = 1, n
+     sum = sum + c
+  end do
+  !$acc end parallel
+end subroutine redsub_worker
+
+subroutine redsub_vector(sum, n, c)
+  integer :: sum, n, c
+
+  sum = 0
+
+  !$acc parallel copyin (n, c) vector_length(32) copy(sum)
+  !$acc loop reduction(+:sum) vector
+  do i = 1, n
+     sum = sum + c
+  end do
+  !$acc end parallel
+end subroutine redsub_vector
+
+subroutine redsub_combined(sum, n, c)
+  integer :: sum, n, c
+
+  sum = 0
+
+  !$acc parallel num_gangs (8) num_workers (4) vector_length(32) copy(sum)
+  !$acc loop reduction(+:sum) gang worker vector
+  do i = 1, n
+     sum = sum + c
+  end do
+  !$acc end parallel
+end subroutine redsub_combined
diff --git a/libgomp/testsuite/libgomp.oacc-fortran/reduction-6.f90 b/libgomp/testsuite/libgomp.oacc-fortran/reduction-6.f90
index 2733968..990faac 100644
--- a/libgomp/testsuite/libgomp.oacc-fortran/reduction-6.f90
+++ b/libgomp/testsuite/libgomp.oacc-fortran/reduction-6.f90
@@ -3,28 +3,91 @@
 program reduction
   implicit none
 
-  integer, parameter    :: n = 100
-  integer               :: i, s1, s2, vs1, vs2
+  integer, parameter    :: n = 100, n2 = 1000, chunksize = 10
+  integer               :: i, gs1, gs2, ws1, ws2, vs1, vs2, cs1, cs2, hs1, hs2
+  integer               :: j, red, vred
 
-  s1 = 0
-  s2 = 0
+  gs1 = 0
+  gs2 = 0
+  ws1 = 0
+  ws2 = 0
   vs1 = 0
   vs2 = 0
+  cs1 = 0
+  cs2 = 0
+  hs1 = 0
+  hs2 = 0
 
-  !$acc parallel num_gangs (1000) copy(s1, s2)
-  !$acc loop reduction(+:s1, s2) gang
+  !$acc parallel num_gangs (1000) copy(gs1, gs2)
+  !$acc loop reduction(+:gs1, gs2) gang
   do i = 1, n
-     s1 = s1 + 1
-     s2 = s2 + 2
+     gs1 = gs1 + 1
+     gs2 = gs2 + 2
   end do
   !$acc end parallel
 
-  ! Verify the results
+  !$acc parallel num_workers (4) vector_length (32) copy(ws1, ws2)
+  !$acc loop reduction(+:ws1, ws2) worker
+  do i = 1, n
+     ws1 = ws1 + 1
+     ws2 = ws2 + 2
+  end do
+  !$acc end parallel
+
+  !$acc parallel vector_length (32) copy(vs1, vs2)
+  !$acc loop reduction(+:vs1, vs2) vector
   do i = 1, n
      vs1 = vs1 + 1
      vs2 = vs2 + 2
   end do
+  !$acc end parallel
+
+  !$acc parallel num_gangs(8) num_workers(4) vector_length(32) copy(cs1, cs2)
+  !$acc loop reduction(+:cs1, cs2) gang worker vector
+  do i = 1, n
+     cs1 = cs1 + 1
+     cs2 = cs2 + 2
+  end do
+  !$acc end parallel
+
+  ! Verify the results on the host
+  do i = 1, n
+     hs1 = hs1 + 1
+     hs2 = hs2 + 2
+  end do
+
+  if (gs1 .ne. hs1) call abort ()
+  if (gs2 .ne. hs2) call abort ()
+
+  if (ws1 .ne. hs1) call abort ()
+  if (ws2 .ne. hs2) call abort ()
+
+  if (vs1 .ne. hs1) call abort ()
+  if (vs2 .ne. hs2) call abort ()
+
+  if (cs1 .ne. hs1) call abort ()
+  if (cs2 .ne. hs2) call abort ()
+
+  ! Nested reductions.
+
+  red = 0
+  vred = 0
+
+  !$acc parallel num_gangs(10) vector_length(32) copy(red)
+  !$acc loop reduction(+:red) gang
+  do i = 1, n/chunksize
+     !$acc loop reduction(+:red) vector
+     do j = 1, chunksize
+        red = red + chunksize
+     end do
+  end do
+  !$acc end parallel
+
+  do i = 1, n/chunksize
+     do j = 1, chunksize
+        vred = vred + chunksize
+     end do
+  end do
 
-  if (s1.ne.vs1) call abort ()
-  if (s2.ne.vs2) call abort ()
+  if (red .ne. vred) call abort ()
 end program reduction

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [gomp4] OpenACC reduction tests
  2015-07-17 18:26 ` [gomp4] OpenACC reduction tests Cesar Philippidis
@ 2015-09-18  8:29   ` Thomas Schwinge
  2015-09-23  8:50     ` Thomas Schwinge
  2015-09-18 13:40   ` Thomas Schwinge
  1 sibling, 1 reply; 7+ messages in thread
From: Thomas Schwinge @ 2015-09-18  8:29 UTC (permalink / raw)
  To: Cesar Philippidis; +Cc: gcc-patches, Nathan Sidwell, Jakub Jelinek

[-- Attachment #1: Type: text/plain, Size: 6160 bytes --]

Hi Cesar!

Great progress with your OpenACC reductions work!

On Fri, 17 Jul 2015 11:13:59 -0700, Cesar Philippidis <cesar@codesourcery.com> wrote:
> This patch updates the libgomp OpenACC reduction test cases [...]

> --- a/libgomp/testsuite/libgomp.oacc-fortran/reduction-5.f90
> +++ b/libgomp/testsuite/libgomp.oacc-fortran/reduction-5.f90
> @@ -4,9 +4,12 @@
>  
>  program reduction
>    integer, parameter    :: n = 40, c = 10
> -  integer               :: i, vsum, sum
> +  integer               :: i, vsum, gs, ws, vs, cs
>  
> -  call redsub (sum, n, c)
> +  call redsub_gang (gs, n, c)
> +  call redsub_worker (gs, n, c)
> +  call redsub_vector (vs, n, c)
> +  call redsub_combined (cs, n, c)
>  
>    vsum = 0
>  
> @@ -15,10 +18,11 @@ program reduction
>       vsum = vsum + c
>    end do
>  
> -  if (sum.ne.vsum) call abort ()
> +  if (gs .ne. vsum) call abort ()
> +  if (vs .ne. vsum) call abort ()
>  end program reduction

This looks incomplete to me, so I extended it as follows.

With -O0, I frequently see this test FAIL (thus XFAILed), both for nvptx
offloading and host-fallback execution.  Adding a few printfs, I observe
redsub_gang compute "random" results.  Given the following
-Wuninitialized/-Wmaybe-uninitialized warnings (for -O1, for example),
maybe there's some initialization of (internal) variables missing?
(These user-visible warnings about compiler internals need to be
addressed regardless.)  Would you please have a look at that?

    source-gcc/libgomp/testsuite/libgomp.oacc-fortran/reduction-5.f90: In function 'redsub_combined_._omp_fn.0':
    source-gcc/libgomp/testsuite/libgomp.oacc-fortran/reduction-5.f90:73:0: warning: '<anonymous>' is used uninitialized in this function [-Wuninitialized]
       !$acc loop reduction(+:sum) gang worker vector
    ^
    source-gcc/libgomp/testsuite/libgomp.oacc-fortran/reduction-5.f90: In function 'redsub_vector_._omp_fn.1':
    source-gcc/libgomp/testsuite/libgomp.oacc-fortran/reduction-5.f90:60:0: warning: '<anonymous>' is used uninitialized in this function [-Wuninitialized]
       !$acc loop reduction(+:sum) vector
    ^
    source-gcc/libgomp/testsuite/libgomp.oacc-fortran/reduction-5.f90: In function 'redsub_worker_._omp_fn.2':
    source-gcc/libgomp/testsuite/libgomp.oacc-fortran/reduction-5.f90:47:0: warning: '<anonymous>' is used uninitialized in this function [-Wuninitialized]
       !$acc loop reduction(+:sum) worker
    ^
    source-gcc/libgomp/testsuite/libgomp.oacc-fortran/reduction-5.f90: In function 'redsub_gang_._omp_fn.3':
    source-gcc/libgomp/testsuite/libgomp.oacc-fortran/reduction-5.f90:34:0: warning: 'sum.43' may be used uninitialized in this function [-Wmaybe-uninitialized]
       !$acc loop reduction(+:sum) gang
    ^

Committed to gomp-4_0-branch in r227897:

commit 0a1cca2cc3c1d1e2310c6438299e63a7bd99396b
Author: tschwinge <tschwinge@138bc75d-0d04-0410-961f-82ee72b054a4>
Date:   Fri Sep 18 08:07:47 2015 +0000

    Extend OpenACC reduction test case, XFAIL for -O0
    
    	libgomp/
    	* testsuite/libgomp.oacc-fortran/reduction-5.f90: Extend.  XFAIL
    	execution test for -O0.
    
    git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/branches/gomp-4_0-branch@227897 138bc75d-0d04-0410-961f-82ee72b054a4
---
 libgomp/ChangeLog.gomp                                 |    5 +++++
 libgomp/testsuite/libgomp.oacc-fortran/reduction-5.f90 |    5 ++++-
 2 files changed, 9 insertions(+), 1 deletion(-)

diff --git libgomp/ChangeLog.gomp libgomp/ChangeLog.gomp
index 63bc7dc..0c0e697 100644
--- libgomp/ChangeLog.gomp
+++ libgomp/ChangeLog.gomp
@@ -1,3 +1,8 @@
+2015-09-18  Thomas Schwinge  <thomas@codesourcery.com>
+
+	* testsuite/libgomp.oacc-fortran/reduction-5.f90: Extend.  XFAIL
+	execution test for -O0.
+
 2015-09-15  Nathan Sidwell  <nathan@codesourcery.com>
 
 	* oacc-parallel.c (GOACC_parallel_keyed): Use GOMP_DIM constants.
diff --git libgomp/testsuite/libgomp.oacc-fortran/reduction-5.f90 libgomp/testsuite/libgomp.oacc-fortran/reduction-5.f90
index 304fe7f..f787e7d 100644
--- libgomp/testsuite/libgomp.oacc-fortran/reduction-5.f90
+++ libgomp/testsuite/libgomp.oacc-fortran/reduction-5.f90
@@ -1,4 +1,5 @@
 ! { dg-do run }
+! { dg-xfail-run-if "TODO" { *-*-* } { "-O0" } }
 
 ! subroutine reduction
 
@@ -7,7 +8,7 @@ program reduction
   integer               :: i, vsum, gs, ws, vs, cs
 
   call redsub_gang (gs, n, c)
-  call redsub_worker (gs, n, c)
+  call redsub_worker (ws, n, c)
   call redsub_vector (vs, n, c)
   call redsub_combined (cs, n, c)
 
@@ -19,7 +20,9 @@ program reduction
   end do
 
   if (gs .ne. vsum) call abort ()
+  if (ws .ne. vsum) call abort ()
   if (vs .ne. vsum) call abort ()
+  if (cs .ne. vsum) call abort ()
 end program reduction
 
 subroutine redsub_gang(sum, n, c)

> -subroutine redsub(sum, n, c)
> +subroutine redsub_gang(sum, n, c)
>    integer :: sum, n, c
>  
>    sum = 0
> @@ -29,4 +33,43 @@ subroutine redsub(sum, n, c)
>       sum = sum + c
>    end do
>    !$acc end parallel
> -end subroutine redsub
> +end subroutine redsub_gang
> +
> +subroutine redsub_worker(sum, n, c)
> +  integer :: sum, n, c
> +
> +  sum = 0
> +
> +  !$acc parallel copyin (n, c) num_workers(4) vector_length (32) copy(sum)
> +  !$acc loop reduction(+:sum) worker
> +  do i = 1, n
> +     sum = sum + c
> +  end do
> +  !$acc end parallel
> +end subroutine redsub_worker
> +
> +subroutine redsub_vector(sum, n, c)
> +  integer :: sum, n, c
> +
> +  sum = 0
> +
> +  !$acc parallel copyin (n, c) vector_length(32) copy(sum)
> +  !$acc loop reduction(+:sum) vector
> +  do i = 1, n
> +     sum = sum + c
> +  end do
> +  !$acc end parallel
> +end subroutine redsub_vector
> +
> +subroutine redsub_combined(sum, n, c)
> +  integer :: sum, n, c
> +
> +  sum = 0
> +
> +  !$acc parallel num_gangs (8) num_workers (4) vector_length(32) copy(sum)
> +  !$acc loop reduction(+:sum) gang worker vector
> +  do i = 1, n
> +     sum = sum + c
> +  end do
> +  !$acc end parallel
> +end subroutine redsub_combined


Grüße,
 Thomas

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 472 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [gomp4] OpenACC reduction tests
  2015-07-17 18:26 ` [gomp4] OpenACC reduction tests Cesar Philippidis
  2015-09-18  8:29   ` Thomas Schwinge
@ 2015-09-18 13:40   ` Thomas Schwinge
  2016-04-12 11:39     ` [PR testsuite/68242] FAIL: libgomp.oacc-c-c++-common/reduction-2.c, and other OpenACC reduction test case "oddities" (was: [gomp4] OpenACC reduction tests) Thomas Schwinge
  1 sibling, 1 reply; 7+ messages in thread
From: Thomas Schwinge @ 2015-09-18 13:40 UTC (permalink / raw)
  To: Cesar Philippidis; +Cc: gcc-patches, Nathan Sidwell, Jakub Jelinek

[-- Attachment #1: Type: text/plain, Size: 5194 bytes --]

Hi Cesar!

On Fri, 17 Jul 2015 11:13:59 -0700, Cesar Philippidis <cesar@codesourcery.com> wrote:
> This patch updates the libgomp OpenACC reduction test cases to check
> worker, vector and combined gang worker vector reductions. I tried to
> use some macros to simplify the c test cases a bit. I probably could
> have made them more generic with an additional header file/macro, but
> then that makes it too confusing too debug. The fortran tests are a bit
> of a lost clause, unless someone knows how to use the preprocessor with
> !$acc loops.

> --- a/libgomp/testsuite/libgomp.oacc-c-c++-common/reduction-2.c
> +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/reduction-2.c

> +static void
> +test_reductions (void)
>  {

> -  [...]
> +  const int n = 100;
>    int i;
> -  [...]
> +  float array[n];
>  
>    for (i = 0; i < n; i++)
> -    [...]
> +    array[i] = i+1;
>  
> -  [...]
> +  /* Gang reductions.  */
> +  check_reduction_op (float, +, 0, array[i], num_gangs (ng), gang);
> +  check_reduction_op (float, *, 1, array[i], num_gangs (ng), gang);

I see this one reproducibly FAIL in the x86_64 -m32 multilib's
host-fallback testing (there is no nvptx offloading for 32-bit
configurations).  (The -m32 multilib is configured/enabled by default, so
fixing this is a prerequisite for trunk integration.)  From a very quick
glance, might it be that we're overflowing the float data type with the
"1 * 2 * 3 * [...] * 1000" computation?  The OpenACC reduction computes
"inf" which is then compared against a very high finite reference value
-- or the other way round (I lost my debugging session).  Instead of
multiplying these "big" numbers, I guess we should just do a more
idiomatic floating point computation?

> --- a/libgomp/testsuite/libgomp.oacc-c-c++-common/reduction-4.c
> +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/reduction-4.c

>  /* complex reductions.  */

> +static void
> +test_reductions (void)
>  {

> +  double _Complex array[n];
> +
> +  for (i = 0; i < n; i++)
> +    array[i] = i+1;
> +
> +  /* Gang reductions.  */
> +  check_reduction_op (double, +, 0, creal (array[i]), num_gangs (ng), gang);

Given that in the check_reduction_op instantiations you're specifying a
"double" data type (instead of "double _Complex", for example), and
"creal (array[i])" reduction operands (instead of "array[i]", for
example), we're not actually testing reductions with complex data types,
so I guess that should be changed.  :-)

> --- /dev/null
> +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/reduction.h
> @@ -0,0 +1,43 @@
> +#ifndef REDUCTION_H
> +#define REDUCTION_H
> +
> +#define DO_PRAGMA(x) _Pragma (#x)
> +
> +#define check_reduction_op(type, op, init, b, gwv_par, gwv_loop)	\
> +  {									\
> +    type res, vres;							\
> +    res = (init);							\
> +DO_PRAGMA (acc parallel gwv_par copy (res))				\
> +DO_PRAGMA (acc loop gwv_loop reduction (op:res))			\
> +    for (i = 0; i < n; i++)						\
> +      res = res op (b);							\
> +									\
> +    vres = (init);							\
> +    for (i = 0; i < n; i++)						\
> +      vres = vres op (b);						\
> +									\
> +    if (res != vres)							\
> +      abort ();								\
> +  }

It's the right thing for integer data types, but for anything floating
point, we should be allowing for some small difference (epsilon) between
res and vres, due to rounding differences in the OpenACC reduction
(possibly offloaded) and reference value computation, and similar.

> +#define check_reduction_macro(type, op, init, b, gwv_par, gwv_loop)	\
> +  {									\
> +    type res, vres;							\
> +    res = (init);							\
> +    DO_PRAGMA (acc parallel gwv_par copy(res))				\
> +DO_PRAGMA (acc loop gwv_loop reduction (op:res))			\
> +    for (i = 0; i < n; i++)						\
> +      res = op (res, (b));						\
> +									\
> +    vres = (init);							\
> +    for (i = 0; i < n; i++)						\
> +      vres = op (vres, (b));						\
> +									\
> +    if (res != vres)							\
> +      abort ();								\
> +  }

Likewise.

> +#define max(a, b) (((a) > (b)) ? (a) : (b))
> +#define min(a, b) (((a) < (b)) ? (a) : (b))
> +
> +#endif

> --- a/libgomp/testsuite/libgomp.oacc-fortran/reduction-4.f90
> +++ b/libgomp/testsuite/libgomp.oacc-fortran/reduction-4.f90
> @@ -5,50 +5,108 @@
>  program reduction_4
>    implicit none
>  
> -  integer, parameter    :: n = 10, gangs = 20
> +  integer, parameter    :: n = 10, ng = 8, nw = 4, vl = 32
>    integer               :: i
> -  complex               :: vresult, result
> +  real                  :: vresult, rg, rw, rv, rc
>    complex, dimension (n) :: array

Same problem as in the C test case: not actually testing complex data
types:

>    do i = 1, n
>       array(i) = i
>    end do
>  
> -[...]
> +  !
> +  ! '+' reductions
> +  !
> +
> +  rg = 0
> +  rw = 0
> +  rv = 0
> +  rc = 0
>    vresult = 0
>  
> -[...]
> +  !$acc parallel num_gangs(ng) copy(rg)
> +  !$acc loop reduction(+:rg) gang
> +  do i = 1, n
> +     rg = rg + REAL(array(i))
> +  end do
> +  !$acc end parallel


Grüße,
 Thomas

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 472 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [gomp4] OpenACC reduction tests
  2015-09-18  8:29   ` Thomas Schwinge
@ 2015-09-23  8:50     ` Thomas Schwinge
  2015-11-07 11:15       ` Thomas Schwinge
  0 siblings, 1 reply; 7+ messages in thread
From: Thomas Schwinge @ 2015-09-23  8:50 UTC (permalink / raw)
  To: Cesar Philippidis, Nathan Sidwell; +Cc: gcc-patches, Jakub Jelinek

[-- Attachment #1: Type: text/plain, Size: 4005 bytes --]

Hi!

On Fri, 18 Sep 2015 10:11:25 +0200, I wrote:
> On Fri, 17 Jul 2015 11:13:59 -0700, Cesar Philippidis <cesar@codesourcery.com> wrote:
> > This patch updates the libgomp OpenACC reduction test cases [...]
> 
> > --- a/libgomp/testsuite/libgomp.oacc-fortran/reduction-5.f90
> > +++ b/libgomp/testsuite/libgomp.oacc-fortran/reduction-5.f90

> With -O0, I frequently see this test FAIL (thus XFAILed), both for nvptx
> offloading and host-fallback execution.  Adding a few printfs, I observe
> redsub_gang compute "random" results.

This seems to have gotten fixed with Nathan's recent "Another oacc
reduction simplification",
<http://news.gmane.org/find-root.php?message_id=%3C560173E1.9030801%40acm.org%3E>,
so I'm removing the XFAIL.


The following issue however remains to be addressed:

> Given the following
> -Wuninitialized/-Wmaybe-uninitialized warnings (for -O1, for example),
> maybe there's some initialization of (internal) variables missing?
> (These user-visible warnings about compiler internals need to be
> addressed regardless.)  Would you please have a look at that?
> 
>     source-gcc/libgomp/testsuite/libgomp.oacc-fortran/reduction-5.f90: In function 'redsub_combined_._omp_fn.0':
>     source-gcc/libgomp/testsuite/libgomp.oacc-fortran/reduction-5.f90:73:0: warning: '<anonymous>' is used uninitialized in this function [-Wuninitialized]
>        !$acc loop reduction(+:sum) gang worker vector
>     ^
>     source-gcc/libgomp/testsuite/libgomp.oacc-fortran/reduction-5.f90: In function 'redsub_vector_._omp_fn.1':
>     source-gcc/libgomp/testsuite/libgomp.oacc-fortran/reduction-5.f90:60:0: warning: '<anonymous>' is used uninitialized in this function [-Wuninitialized]
>        !$acc loop reduction(+:sum) vector
>     ^
>     source-gcc/libgomp/testsuite/libgomp.oacc-fortran/reduction-5.f90: In function 'redsub_worker_._omp_fn.2':
>     source-gcc/libgomp/testsuite/libgomp.oacc-fortran/reduction-5.f90:47:0: warning: '<anonymous>' is used uninitialized in this function [-Wuninitialized]
>        !$acc loop reduction(+:sum) worker
>     ^
>     source-gcc/libgomp/testsuite/libgomp.oacc-fortran/reduction-5.f90: In function 'redsub_gang_._omp_fn.3':
>     source-gcc/libgomp/testsuite/libgomp.oacc-fortran/reduction-5.f90:34:0: warning: 'sum.43' may be used uninitialized in this function [-Wmaybe-uninitialized]
>        !$acc loop reduction(+:sum) gang
>     ^


Committed to gomp-4_0-branch in r228035:

commit 705169947333655ded3427985b34b758a5bc6cf5
Author: tschwinge <tschwinge@138bc75d-0d04-0410-961f-82ee72b054a4>
Date:   Wed Sep 23 07:54:15 2015 +0000

    Remove XFAIL of OpenACC reduction execution test case for -O0
    
    	libgomp/
    	* testsuite/libgomp.oacc-fortran/reduction-5.f90: Remove XFAIL of
    	execution test for -O0.
    
    git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/branches/gomp-4_0-branch@228035 138bc75d-0d04-0410-961f-82ee72b054a4
---
 libgomp/ChangeLog.gomp                                 | 5 +++++
 libgomp/testsuite/libgomp.oacc-fortran/reduction-5.f90 | 1 -
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git libgomp/ChangeLog.gomp libgomp/ChangeLog.gomp
index 46c1a05..47db0d4 100644
--- libgomp/ChangeLog.gomp
+++ libgomp/ChangeLog.gomp
@@ -1,3 +1,8 @@
+2015-09-23  Thomas Schwinge  <thomas@codesourcery.com>
+
+	* testsuite/libgomp.oacc-fortran/reduction-5.f90: Remove XFAIL of
+	execution test for -O0.
+
 2015-09-22  Cesar Philippidis  <cesar@codesourcery.com>
 
 	* testsuite/libgomp.oacc-fortran/dummy-array.f90: New test.
diff --git libgomp/testsuite/libgomp.oacc-fortran/reduction-5.f90 libgomp/testsuite/libgomp.oacc-fortran/reduction-5.f90
index f787e7d..180c9a2 100644
--- libgomp/testsuite/libgomp.oacc-fortran/reduction-5.f90
+++ libgomp/testsuite/libgomp.oacc-fortran/reduction-5.f90
@@ -1,5 +1,4 @@
 ! { dg-do run }
-! { dg-xfail-run-if "TODO" { *-*-* } { "-O0" } }
 
 ! subroutine reduction
 


Grüße,
 Thomas

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 472 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [gomp4] OpenACC reduction tests
  2015-09-23  8:50     ` Thomas Schwinge
@ 2015-11-07 11:15       ` Thomas Schwinge
  0 siblings, 0 replies; 7+ messages in thread
From: Thomas Schwinge @ 2015-11-07 11:15 UTC (permalink / raw)
  To: Cesar Philippidis, Nathan Sidwell; +Cc: gcc-patches, Jakub Jelinek

[-- Attachment #1: Type: text/plain, Size: 2053 bytes --]

Hi!

On Wed, 23 Sep 2015 09:56:44 +0200, I wrote:
> On Fri, 18 Sep 2015 10:11:25 +0200, I wrote:
> > On Fri, 17 Jul 2015 11:13:59 -0700, Cesar Philippidis <cesar@codesourcery.com> wrote:
> > > This patch updates the libgomp OpenACC reduction test cases [...]

> > Given the following
> > -Wuninitialized/-Wmaybe-uninitialized warnings (for -O1, for example),
> > maybe there's some initialization of (internal) variables missing?
> > (These user-visible warnings about compiler internals need to be
> > addressed regardless.)  Would you please have a look at that?
> > 
> >     source-gcc/libgomp/testsuite/libgomp.oacc-fortran/reduction-5.f90: In function 'redsub_combined_._omp_fn.0':
> >     source-gcc/libgomp/testsuite/libgomp.oacc-fortran/reduction-5.f90:73:0: warning: '<anonymous>' is used uninitialized in this function [-Wuninitialized]
> >        !$acc loop reduction(+:sum) gang worker vector
> >     ^
> >     source-gcc/libgomp/testsuite/libgomp.oacc-fortran/reduction-5.f90: In function 'redsub_vector_._omp_fn.1':
> >     source-gcc/libgomp/testsuite/libgomp.oacc-fortran/reduction-5.f90:60:0: warning: '<anonymous>' is used uninitialized in this function [-Wuninitialized]
> >        !$acc loop reduction(+:sum) vector
> >     ^
> >     source-gcc/libgomp/testsuite/libgomp.oacc-fortran/reduction-5.f90: In function 'redsub_worker_._omp_fn.2':
> >     source-gcc/libgomp/testsuite/libgomp.oacc-fortran/reduction-5.f90:47:0: warning: '<anonymous>' is used uninitialized in this function [-Wuninitialized]
> >        !$acc loop reduction(+:sum) worker
> >     ^
> >     source-gcc/libgomp/testsuite/libgomp.oacc-fortran/reduction-5.f90: In function 'redsub_gang_._omp_fn.3':
> >     source-gcc/libgomp/testsuite/libgomp.oacc-fortran/reduction-5.f90:34:0: warning: 'sum.43' may be used uninitialized in this function [-Wmaybe-uninitialized]
> >        !$acc loop reduction(+:sum) gang
> >     ^

I didn't see anyone explicitly claim to have fixed that; however, the
warnings are gone.


Grüße
 Thomas

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 472 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PR testsuite/68242] FAIL: libgomp.oacc-c-c++-common/reduction-2.c, and other OpenACC reduction test case "oddities" (was: [gomp4] OpenACC reduction tests)
  2015-09-18 13:40   ` Thomas Schwinge
@ 2016-04-12 11:39     ` Thomas Schwinge
  0 siblings, 0 replies; 7+ messages in thread
From: Thomas Schwinge @ 2016-04-12 11:39 UTC (permalink / raw)
  To: Cesar Philippidis; +Cc: gcc-patches, Nathan Sidwell, Jakub Jelinek

[-- Attachment #1: Type: text/plain, Size: 6040 bytes --]

Hi Cesar!

(At least several of) the issues that I pointed out (see below) have
never been fixed on gomp-4_0-branch, but the test cases have now been
merged from gomp-4_0-branch into trunk, so the regression (PASS -> FAIL
for libgomp.oacc-c-c++-common/reduction-2.c) as well as the other
"oddities" are now to be fixed in trunk.  I re-assigned
<https://gcc.gnu.org/PR68242> from Nathan to Cesar.  (I didn't verify
that the following list of items is conclusive/complete.)

On Fri, 18 Sep 2015 15:37:58 +0200, I wrote:
> Hi Cesar!
> 
> On Fri, 17 Jul 2015 11:13:59 -0700, Cesar Philippidis <cesar@codesourcery.com> wrote:
> > This patch updates the libgomp OpenACC reduction test cases to check
> > worker, vector and combined gang worker vector reductions. I tried to
> > use some macros to simplify the c test cases a bit. I probably could
> > have made them more generic with an additional header file/macro, but
> > then that makes it too confusing too debug. The fortran tests are a bit
> > of a lost clause, unless someone knows how to use the preprocessor with
> > !$acc loops.
> 
> > --- a/libgomp/testsuite/libgomp.oacc-c-c++-common/reduction-2.c
> > +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/reduction-2.c
> 
> > +static void
> > +test_reductions (void)
> >  {
> 
> > -  [...]
> > +  const int n = 100;
> >    int i;
> > -  [...]
> > +  float array[n];
> >  
> >    for (i = 0; i < n; i++)
> > -    [...]
> > +    array[i] = i+1;
> >  
> > -  [...]
> > +  /* Gang reductions.  */
> > +  check_reduction_op (float, +, 0, array[i], num_gangs (ng), gang);
> > +  check_reduction_op (float, *, 1, array[i], num_gangs (ng), gang);
> 
> I see this one reproducibly FAIL in the x86_64 -m32 multilib's
> host-fallback testing (there is no nvptx offloading for 32-bit
> configurations).  (The -m32 multilib is configured/enabled by default, so
> fixing this is a prerequisite for trunk integration.)  From a very quick
> glance, might it be that we're overflowing the float data type with the
> "1 * 2 * 3 * [...] * 1000" computation?  The OpenACC reduction computes
> "inf" which is then compared against a very high finite reference value
> -- or the other way round (I lost my debugging session).  Instead of
> multiplying these "big" numbers, I guess we should just do a more
> idiomatic floating point computation?
> 
> > --- a/libgomp/testsuite/libgomp.oacc-c-c++-common/reduction-4.c
> > +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/reduction-4.c
> 
> >  /* complex reductions.  */
> 
> > +static void
> > +test_reductions (void)
> >  {
> 
> > +  double _Complex array[n];
> > +
> > +  for (i = 0; i < n; i++)
> > +    array[i] = i+1;
> > +
> > +  /* Gang reductions.  */
> > +  check_reduction_op (double, +, 0, creal (array[i]), num_gangs (ng), gang);
> 
> Given that in the check_reduction_op instantiations you're specifying a
> "double" data type (instead of "double _Complex", for example), and
> "creal (array[i])" reduction operands (instead of "array[i]", for
> example), we're not actually testing reductions with complex data types,
> so I guess that should be changed.  :-)
> 
> > --- /dev/null
> > +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/reduction.h
> > @@ -0,0 +1,43 @@
> > +#ifndef REDUCTION_H
> > +#define REDUCTION_H
> > +
> > +#define DO_PRAGMA(x) _Pragma (#x)
> > +
> > +#define check_reduction_op(type, op, init, b, gwv_par, gwv_loop)	\
> > +  {									\
> > +    type res, vres;							\
> > +    res = (init);							\
> > +DO_PRAGMA (acc parallel gwv_par copy (res))				\
> > +DO_PRAGMA (acc loop gwv_loop reduction (op:res))			\
> > +    for (i = 0; i < n; i++)						\
> > +      res = res op (b);							\
> > +									\
> > +    vres = (init);							\
> > +    for (i = 0; i < n; i++)						\
> > +      vres = vres op (b);						\
> > +									\
> > +    if (res != vres)							\
> > +      abort ();								\
> > +  }
> 
> It's the right thing for integer data types, but for anything floating
> point, we should be allowing for some small difference (epsilon) between
> res and vres, due to rounding differences in the OpenACC reduction
> (possibly offloaded) and reference value computation, and similar.
> 
> > +#define check_reduction_macro(type, op, init, b, gwv_par, gwv_loop)	\
> > +  {									\
> > +    type res, vres;							\
> > +    res = (init);							\
> > +    DO_PRAGMA (acc parallel gwv_par copy(res))				\
> > +DO_PRAGMA (acc loop gwv_loop reduction (op:res))			\
> > +    for (i = 0; i < n; i++)						\
> > +      res = op (res, (b));						\
> > +									\
> > +    vres = (init);							\
> > +    for (i = 0; i < n; i++)						\
> > +      vres = op (vres, (b));						\
> > +									\
> > +    if (res != vres)							\
> > +      abort ();								\
> > +  }
> 
> Likewise.
> 
> > +#define max(a, b) (((a) > (b)) ? (a) : (b))
> > +#define min(a, b) (((a) < (b)) ? (a) : (b))
> > +
> > +#endif
> 
> > --- a/libgomp/testsuite/libgomp.oacc-fortran/reduction-4.f90
> > +++ b/libgomp/testsuite/libgomp.oacc-fortran/reduction-4.f90
> > @@ -5,50 +5,108 @@
> >  program reduction_4
> >    implicit none
> >  
> > -  integer, parameter    :: n = 10, gangs = 20
> > +  integer, parameter    :: n = 10, ng = 8, nw = 4, vl = 32
> >    integer               :: i
> > -  complex               :: vresult, result
> > +  real                  :: vresult, rg, rw, rv, rc
> >    complex, dimension (n) :: array
> 
> Same problem as in the C test case: not actually testing complex data
> types:
> 
> >    do i = 1, n
> >       array(i) = i
> >    end do
> >  
> > -[...]
> > +  !
> > +  ! '+' reductions
> > +  !
> > +
> > +  rg = 0
> > +  rw = 0
> > +  rv = 0
> > +  rc = 0
> >    vresult = 0
> >  
> > -[...]
> > +  !$acc parallel num_gangs(ng) copy(rg)
> > +  !$acc loop reduction(+:rg) gang
> > +  do i = 1, n
> > +     rg = rg + REAL(array(i))
> > +  end do
> > +  !$acc end parallel


Grüße
 Thomas

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 472 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2016-04-12 11:39 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-07-17 18:26 [gomp4] OpenACC vector and worker reductions Cesar Philippidis
2015-07-17 18:26 ` [gomp4] OpenACC reduction tests Cesar Philippidis
2015-09-18  8:29   ` Thomas Schwinge
2015-09-23  8:50     ` Thomas Schwinge
2015-11-07 11:15       ` Thomas Schwinge
2015-09-18 13:40   ` Thomas Schwinge
2016-04-12 11:39     ` [PR testsuite/68242] FAIL: libgomp.oacc-c-c++-common/reduction-2.c, and other OpenACC reduction test case "oddities" (was: [gomp4] OpenACC reduction tests) Thomas Schwinge

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).