public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [gomp4] New reduction infrastructure for OpenACC
@ 2015-08-19 19:43 Cesar Philippidis
       [not found] ` <55D61EE2.4030402@acm.org>
  2015-09-01 12:53 ` Tom de Vries
  0 siblings, 2 replies; 4+ messages in thread
From: Cesar Philippidis @ 2015-08-19 19:43 UTC (permalink / raw)
  To: gcc-patches, Nathan Sidwell, Jakub Jelinek

[-- Attachment #1: Type: text/plain, Size: 4966 bytes --]

This patch introduces a infrastructure for reductions in OpenACC. This
infrastructure consists of four internal functions,
GOACC_REDUCTION_SETUP, GOACC_REDUCTION_INIT, GOACC_REDUCTION_FINI, and
GOACC_REDUCTION_TEARDOWN, along with a new target hook goacc.reduction.
Each internal function shares a common interface:

  var = ifn (*ref_to_res, local_var, level, op, lid, rid)

var is the intermediate and private result of the reduction. Usually,
var = local_var.

*ref_to_res is a pointer to the resulting reduction. This is only
non-NULL for gang reductions. All other reduction operate on local
variables for which var will suffice.

local_var is a local (private) copy of the reduction variable.

level is the GOMP_DIM of the reduction. Each function call may only
contain one dim. If a loop a combination of gang, worker and vector,
then ifn must be called one per each dim.

op is the reduction operation.

lid is a unique loop ID. It's not 100% unique because it might get reset
in different TUs.

rid is the reduction ID within a loop. E.g., if a loop has two
reductions associated with it, the first could be designated zero and
the second one.

The target hook takes in one argument, the gimple statement containing
the call to the internal reduction function, and it returns true if it
introduces any calls to other target functions. This was necessary for
the nvptx backend, specifically for vector INIT because the thread ID is
necessary.

Each internal function is expanded during execute_oacc_transform using
that goacc reduction target hook. This allows us to generate
target-specific code while lowering it in a target-independent manner.

There are a couple of significant changes in this patch over the
existing OpenMP reduction implementation. The first change is that
reductions no longer rely on special ganglocal mappings. Certain
targets, such as nvptx gpus, have a distributed memory hierarchy. On
nvptx targets, all of the processors are partitioned into blocks. Each
block has a limited amount of shared memory. Because of the OpenACC spec
is written, we were initially mapping nvptx's shared memory into
gang-local memory. However, Nathan's worker and vector state propagator
is robust enough that we were able to eliminate the ganglocal mappings
altogether.

While this new infrastructure allows us to eliminate the ganglocal
mappings, nvptx still needs to use shared memory for worker reductions.
Consider the following example where red is private:

  #pragma acc loop worker reduction (+:red)
  for (...)
    red++;

This loop would expand to this during omp-lower:

  red = GOACC_REDUCTION_SETUP (NULL, red, GOMP_DIM_WORKER, '+', 0, 0);
  GOACC_FORK (GOMP_DIM_WORKER);
  red = GOACC_REDUCTION_INIT (NULL, red, GOMP_DIM_WORKER, '+', 0, 0);

  for (...)
    red++;

  red = GOACC_REDUCTION_FINI (NULL, red, GOMP_DIM_WORKER, '+', 0, 0);
  GOACC_JOIN (GOMP_DIM_WORKER);
  red = GOACC_REDUCTION_TEARDOWN (NULL, red, GOMP_DIM_WORKER, '+', 0, 0);

For nvptx targets, SETUP and TEARDOWN are responsible for allocating and
freeing shared memory. INIT is responsible for initializing the private
reduction variable. This is necessary for vector reductions because we
want thread 0 to contain the original value of local_var, and the other
threads to be initialized to the proper value for 'op'. All of the
intermediate reduction results are combined in FINI and written back to
var or *ref_to_res, whichever is necessary, in TEARDOWN.

I don't want to delve too much into the use of this infrastructure right
now. We do have a design for that, and I intend to present more details
when I post the lowering patch. The next patch will likely be the nvptx
changes though.

One of the reasons why we needed create this generic interface was to
implement vector reductions on nvptx targets. On nvptx targets, we're
mapping vectors to warps. That's fine, but warps cannot use spinlocks or
the warp will deadlock. As a consequence, we can't use the existing
OpenMP atomic reductions in OpenACC. The way I got around the spinlock
problem in 5.0 was by allocating an array of length vector_length, and
stashing all of the intermediate reductions in there. The later on, one
thread would merge all of those reductions together.

This new reduction infrastructure provides a more elegant solution for
OpenACC reduction. And while we're still using atomic operations for
gang and worker reductions, we're no longer using a global lock for
workers. This api allows us to use a lock in shared memory for workers.
That said, this infrastructure does provide sufficient flexibility to
implement tree reductions for gangs and workers later on.

It should be noted that this is not a replacement for the existing
OpenMP reductions. Rather, OpenMP will continue to use
lower_reduction_clauses and friends, while OpenACC will use this
infrastructure. That said, OpenMP could taught to use this infrastructure.

Is this patch OK for gomp-4_0-branch?

Thanks,
Cesar

[-- Attachment #2: reduction-infrastructure-gomp4.diff --]
[-- Type: text/x-patch, Size: 14052 bytes --]

2015-08-19  Cesar Philippidis  <cesar@codesourcery.com>

	gcc/
	* doc/tm.texi: Regenerate.
	* doc/tm.texi.in: Add a placeholder for TARGET_GOACC_REDUCTION.
	* internal-fn.c (expand_GOACC_REDUCTION_SETUP): New function.
	(expand_GOACC_REDUCTION_INIT): New function.
	(expand_GOACC_REDUCTION_FINI): New function.
	(expand_GOACC_REDUCTION_TEARDOWN): New function.
	* internal-fn.def (DEF_INTERNAL_FN):
	* omp-low.c (GOACC_REDUCTION_SETUP, GOACC_REDUCTION_INIT,
	GOACC_REDUCTION_FINI, GOACC_REDUCTION_TEARDOWN): New internal
	functions.
	* omp-low.c (execute_oacc_transform): Expand those new internal
	functions.
	(make_pass_oacc_transform): Add TODO_cleanup_cfg to todo_flags_finish.
	(default_goacc_reduction_setup): New function.
	(default_goacc_reduction_init_fini): New function.
	(default_goacc_reduction_teardown): New function.
	(default_goacc_reduction): New function.
	* target.def (reduction): New goacc target hook.
	* targhooks.h (default_goacc_reduction): Declare


diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index 12e57a7..0c8ba5d 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -5767,6 +5767,19 @@ pass.  It should return true, if the functions should be deleted.  The
 default hook returns true, if there is no RTL expanders for them.
 @end deftypefn
 
+@deftypefn {Target Hook} bool TARGET_GOACC_REDUCTION (gimple @var{call})
+This hook is used by the oacc_transform pass to expand calls to the
+internal functions @var{GOACC_REDUCTION_SETUP},
+@var{GOACC_REDUCTION_INIT},
+ @var{GOACC_REDUCTION_FINI} and
+ @var{GOACC_REDUCTION_TEARDOWN} into a sequence of gimple instructions.
+ @var{call} is gimple statement containing the call to the function.  This
+ hook removes statement @var{call} after the expanded sequence has been
+ inserted.  This hook is also responsible for allocating any storage for
+ reductions when necessary.  It returns @var{true} if the expanded
+sequence introduces any calls to OpenACC-specific internal functions.
+@end deftypefn
+
 @node Anchored Addresses
 @section Anchored Addresses
 @cindex anchored addresses
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index 3bf98be..dfd32a7 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -4253,6 +4253,8 @@ address;  but often a machine-dependent strategy can generate better code.
 
 @hook TARGET_GOACC_LOCK_UNLOCK
 
+@hook TARGET_GOACC_REDUCTION
+
 @node Anchored Addresses
 @section Anchored Addresses
 @cindex anchored addresses
diff --git a/gcc/internal-fn.c b/gcc/internal-fn.c
index fcb9c47..9c923ed 100644
--- a/gcc/internal-fn.c
+++ b/gcc/internal-fn.c
@@ -2051,6 +2051,39 @@ expand_GOACC_UNLOCK (gcall *ARG_UNUSED (stmt))
 #endif
 }
 
+/* This should get expanded in oacc_transform.  */
+
+static void
+expand_GOACC_REDUCTION_SETUP (gcall *stmt ATTRIBUTE_UNUSED)
+{
+  gcc_unreachable ();
+}
+
+/* This should get expanded in oacc_transform.  */
+
+static void
+expand_GOACC_REDUCTION_INIT (gcall *stmt ATTRIBUTE_UNUSED)
+{
+  gcc_unreachable ();
+}
+
+/* This should get expanded in oacc_transform.  */
+
+static void
+expand_GOACC_REDUCTION_FINI (gcall *stmt ATTRIBUTE_UNUSED)
+{
+  gcc_unreachable ();
+}
+
+/* This should get expanded in oacc_transform.  */
+
+static void
+expand_GOACC_REDUCTION_TEARDOWN (gcall *stmt ATTRIBUTE_UNUSED)
+{
+  gcc_unreachable ();
+}
+
+
 /* Routines to expand each internal function, indexed by function number.
    Each routine has the prototype:
 
diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
index 0bb8a91..6c5db37 100644
--- a/gcc/internal-fn.def
+++ b/gcc/internal-fn.def
@@ -89,3 +89,21 @@ DEF_INTERNAL_FN (GOACC_DIM_POS, ECF_PURE | ECF_NOTHROW | ECF_LEAF, ".")
    argument is a loop identifer.  */
 DEF_INTERNAL_FN (GOACC_LOCK, ECF_NOTHROW | ECF_LEAF, "..")
 DEF_INTERNAL_FN (GOACC_UNLOCK, ECF_NOTHROW | ECF_LEAF, "..")
+
+/* REDUCTION_SETUP, REDUCTION_INIT, REDUCTION_FINI and REDUCTION_TEARDOWN
+   together define a generic interface to support gang, worker and vector
+   reductions. All of the functions take the following form
+
+     V = goacc_reduction_foo (REF_TO_RES, LOCAL_VAR, LEVEL, OP, LID, RID)
+
+   where REF_TO_RES is a reference to the original reduction variable for
+   that particular reduction, LOCAL_VAR is the intermediate reduction
+   variable. LEVEL corresponds to the GOMP_DIM of the reduction, OP is a
+   tree code of the reduction operation. LID is a unique identifier of the
+   loop within a TU and RID is a unique id for a reduction within a loop.
+   V is the resulting intermediate reduction variable returned by the
+   function.  In general, V should equal LOCAL_VAR.  */
+DEF_INTERNAL_FN (GOACC_REDUCTION_SETUP, ECF_NOTHROW, NULL)
+DEF_INTERNAL_FN (GOACC_REDUCTION_INIT, ECF_NOTHROW, NULL)
+DEF_INTERNAL_FN (GOACC_REDUCTION_FINI, ECF_NOTHROW, NULL)
+DEF_INTERNAL_FN (GOACC_REDUCTION_TEARDOWN, ECF_NOTHROW, NULL)
diff --git a/gcc/omp-low.c b/gcc/omp-low.c
index fcf037e..2049eea 100644
--- a/gcc/omp-low.c
+++ b/gcc/omp-low.c
@@ -14671,11 +14671,16 @@ execute_oacc_transform ()
   tree attrs = get_oacc_fn_attrib (current_function_decl);
   int dims[GOMP_DIM_MAX];
   tree purpose[GOMP_DIM_MAX];
+  bool needs_rescan;
   
   if (!attrs)
     /* Not an offloaded function.  */
     return 0;
 
+  /* Offloaded targets may introduce new basic blocks, which require
+     dominance information to update SSA.  */
+  calculate_dominance_info (CDI_DOMINATORS);
+
   {
     unsigned ix;
     tree pos = TREE_VALUE (attrs);
@@ -14725,59 +14730,74 @@ execute_oacc_transform ()
 	replace_oacc_fn_attrib (current_function_decl, pos);
       }
   }
-  
-  FOR_ALL_BB_FN (bb, cfun)
-    for (gimple_stmt_iterator gsi = gsi_start_bb (bb); !gsi_end_p (gsi);)
-      {
-	gimple stmt = gsi_stmt (gsi);
 
-	if (!is_gimple_call (stmt))
-	  ; /* Nothing.  */
-	else if (gimple_call_builtin_p (stmt, BUILT_IN_ACC_ON_DEVICE))
-	  /* acc_on_device must be evaluated at compile time for
-	     constant arguments.  */
-	  {
-	    gsi_next (&gsi);
-	    oacc_xform_on_device (stmt);
-	    continue;
-	  }
-	else if (gimple_call_internal_p (stmt))
+  do
+    {
+      needs_rescan = false;
+
+      FOR_ALL_BB_FN (bb, cfun)
+	for (gimple_stmt_iterator gsi = gsi_start_bb (bb); !gsi_end_p (gsi);)
 	  {
-	    unsigned ifn_code = gimple_call_internal_fn (stmt);
-	    switch (ifn_code)
-	      {
-	      default: break;
+	    gimple stmt = gsi_stmt (gsi);
 
-	      case IFN_GOACC_DIM_POS:
-	      case IFN_GOACC_DIM_SIZE:
+	    if (!is_gimple_call (stmt))
+	      ; /* Nothing.  */
+	    else if (gimple_call_builtin_p (stmt, BUILT_IN_ACC_ON_DEVICE))
+	      /* acc_on_device must be evaluated at compile time for
+		 constant arguments.  */
+	      {
 		gsi_next (&gsi);
-		oacc_xform_dim (stmt, dims, ifn_code == IFN_GOACC_DIM_POS);
+		oacc_xform_on_device (stmt);
 		continue;
+	      }
+	    else if (gimple_call_internal_p (stmt))
+	      {
+		unsigned ifn_code = gimple_call_internal_fn (stmt);
+		int retval = 0;
+		switch (ifn_code)
+		  {
+		  default: break;
 
-	      case IFN_GOACC_LOCK:
-	      case IFN_GOACC_UNLOCK:
-		if (targetm.goacc.lock_unlock
-		    (stmt, dims, ifn_code == IFN_GOACC_LOCK))
-		  goto remove;
-		break;
+		  case IFN_GOACC_DIM_POS:
+		  case IFN_GOACC_DIM_SIZE:
+		    gsi_next (&gsi);
+		    oacc_xform_dim (stmt, dims, ifn_code == IFN_GOACC_DIM_POS);
+		    continue;
 
-	      case IFN_GOACC_FORK:
-	      case IFN_GOACC_JOIN:
-		if (targetm.goacc.fork_join
-		    (stmt, dims, ifn_code == IFN_GOACC_FORK))
-		  {
-		  remove:
-		    replace_uses_by (gimple_vdef (stmt),
-				     gimple_vuse (stmt));
-		    gsi_remove (&gsi, true);
-		    /* Removal will have advanced the iterator.  */
+		  case IFN_GOACC_LOCK:
+		  case IFN_GOACC_UNLOCK:
+		    if (targetm.goacc.lock_unlock
+			(stmt, dims, ifn_code == IFN_GOACC_LOCK))
+		      goto remove;
+		    break;
+
+		  case IFN_GOACC_REDUCTION_SETUP:
+		  case IFN_GOACC_REDUCTION_INIT:
+		  case IFN_GOACC_REDUCTION_FINI:
+		  case IFN_GOACC_REDUCTION_TEARDOWN:
+		    gsi_next (&gsi);
+		    if (targetm.goacc.reduction (stmt))
+		      needs_rescan = true;
 		    continue;
+
+		  case IFN_GOACC_FORK:
+		  case IFN_GOACC_JOIN:
+		    if (targetm.goacc.fork_join
+			(stmt, dims, ifn_code == IFN_GOACC_FORK))
+		      {
+		      remove:
+			replace_uses_by (gimple_vdef (stmt),
+					 gimple_vuse (stmt));
+			gsi_remove (&gsi, true);
+			/* Removal will have advanced the iterator.  */
+			continue;
+		      }
+		    break;
 		  }
-		break;
 	      }
+	    gsi_next (&gsi);
 	  }
-	gsi_next (&gsi);
-      }
+    } while (needs_rescan);
 
   return 0;
 }
@@ -14875,7 +14895,7 @@ const pass_data pass_data_oacc_transform =
   0 /* Possibly PROP_gimple_eomp.  */, /* properties_provided */
   0, /* properties_destroyed */
   0, /* todo_flags_start */
-  TODO_update_ssa, /* todo_flags_finish */
+  TODO_update_ssa | TODO_cleanup_cfg, /* todo_flags_finish */
 };
 
 class pass_oacc_transform : public gimple_opt_pass
@@ -14906,4 +14926,145 @@ make_pass_oacc_transform (gcc::context *ctxt)
   return new pass_oacc_transform (ctxt);
 }
 
+/* Default implementation of targetm.goacc.reduction_setup.  This hook
+   provides a baseline implementation for the internal function
+   GOACC_REDUCTION_SETUP for a single-threaded target.  I.e. num_gangs =
+   num_workers = vector_length = 1.
+
+   Given:
+
+     V = IFN_RED_SETUP (RES_PTR, LOCAL, LEVEL, OP. LID, RID)
+
+   Expand to:
+
+     V = RES_PTR ? *RES_PTR : LOCAL;
+*/
+
+static bool
+default_goacc_reduction_setup (gimple call)
+{
+  gimple_stmt_iterator gsi = gsi_for_stmt (call);
+  tree v = gimple_call_lhs (call);
+  tree ref_to_res = gimple_call_arg (call, 0);
+  tree local_var = gimple_call_arg (call, 1);
+  gimple_seq seq = NULL;
+
+  push_gimplify_context (true);
+
+  if (!integer_zerop (ref_to_res))
+    {
+      tree x = build_simple_mem_ref (ref_to_res);
+      gimplify_assign (v, x, &seq);
+    }
+  else
+    gimplify_assign (v, local_var, &seq);
+
+  pop_gimplify_context (NULL);
+
+  gsi_replace_with_seq (&gsi, seq, true);
+
+  return false;
+}
+
+/* Default implementation for both targetm.goacc.reduction_init and
+   reduction_fini.  This hook provides a baseline implementation for the
+   internal functions GOACC_REDUCTION_INIT and GOACC_REDUCTION_FINI for a
+   single-threaded target.
+
+   Given:
+
+     V = IFN_RED_INIT (RES_PTR, LOCAL, LEVEL, OP, LID, RID)
+
+   or
+
+     V = IFN_RED_FINI (RES_PTR, LOCAL, LEVEL, OP, LID, RID)
+
+   Expand to:
+
+     V = LOCAL;
+*/
+
+static bool
+default_goacc_reduction_init_fini (gimple call)
+{
+  gimple_stmt_iterator gsi = gsi_for_stmt (call);
+  tree v = gimple_call_lhs (call);
+  tree local_var = gimple_call_arg (call, 1);
+  gimple g;
+
+  g = gimple_build_assign (v, local_var);
+  gsi_replace (&gsi, g, true);
+
+  return false;
+}
+
+/* Default implementation of targetm.goacc.reduction_teardown.  This hook
+   provides a baseline implementation for the internal function
+   GOACC_REDUCTION_TEARDOWN for a single-threaded target.
+
+   Given:
+
+     IFN_RED_TEARDOWN (RES_PTR, LOCAL, LEVEL, OP, LID, RID)
+
+   Expand to:
+
+     if (RES_PTR)
+       *RES_PTR = LOCAL;
+
+    V = LOCAL;
+*/
+
+static bool
+default_goacc_reduction_teardown (gimple call)
+{
+  gimple_stmt_iterator gsi = gsi_for_stmt (call);
+  tree lhs = gimple_call_lhs (call);
+  tree ref_to_res = gimple_call_arg (call, 0);
+  tree var = gimple_call_arg (call, 1);
+  gimple_seq seq = NULL;
+
+  push_gimplify_context (true);
+
+  if (!integer_zerop (ref_to_res))
+    {
+      tree x = build_simple_mem_ref (ref_to_res);
+      gimplify_assign (x, var, &seq);
+    }
+
+  if (lhs != NULL_TREE)
+    gimplify_assign (lhs, var, &seq);
+
+  pop_gimplify_context (NULL);
+
+  gsi_replace_with_seq (&gsi, seq, true);
+
+  return false;
+}
+
+/* Default goacc.reduction early expander.  */
+
+bool
+default_goacc_reduction (gimple call)
+{
+  /* Reductions modify the SSA names in complicated ways.  Let update_ssa
+     correct it.  */
+  mark_virtual_operands_for_renaming (cfun);
+
+  switch (gimple_call_internal_fn (call))
+    {
+    case IFN_GOACC_REDUCTION_SETUP:
+      return default_goacc_reduction_setup (call);
+
+    case IFN_GOACC_REDUCTION_INIT:
+    case IFN_GOACC_REDUCTION_FINI:
+      return default_goacc_reduction_init_fini (call);
+
+    case IFN_GOACC_REDUCTION_TEARDOWN:
+      return default_goacc_reduction_teardown (call);
+
+    default:
+      gcc_unreachable ();
+    }
+}
+
 #include "gt-omp-low.h"
diff --git a/gcc/target.def b/gcc/target.def
index fa5670a..550db6a 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -1679,6 +1679,15 @@ default hook returns true, if there is no RTL expanders for them.",
 bool, (gimple, const int[], bool),
 default_goacc_lock_unlock)
 
+DEFHOOK
+(reduction,
+"This hook is used by the oacc_transform pass to expand calls to the\n\
+internal functions @var{GOACC_REDUCTION_SETUP},\n\
+@var{GOACC_REDUCTION_INIT},\n\ @var{GOACC_REDUCTION_FINI} and\n\ @var{GOACC_REDUCTION_TEARDOWN} into a sequence of gimple instructions.\n\ @var{call} is gimple statement containing the call to the function.  This\n\ hook removes statement @var{call} after the expanded sequence has been\n\ inserted.  This hook is also responsible for allocating any storage for\n\ reductions when necessary.  It returns @var{true} if the expanded\n\
+sequence introduces any calls to OpenACC-specific internal functions.",
+bool, (gimple call),
+default_goacc_reduction)
+
 HOOK_VECTOR_END (goacc)
 
 /* Functions relating to vectorization.  */
diff --git a/gcc/targhooks.h b/gcc/targhooks.h
index 0e7f13d..ddde78d 100644
--- a/gcc/targhooks.h
+++ b/gcc/targhooks.h
@@ -107,6 +107,7 @@ extern unsigned default_add_stmt_cost (void *, int, enum vect_cost_for_stmt,
 extern void default_finish_cost (void *, unsigned *, unsigned *, unsigned *);
 extern void default_destroy_cost_data (void *);
 
+extern bool default_goacc_reduction (gimple);
 extern bool default_goacc_validate_dims (tree, int [], int);
 extern unsigned default_goacc_dim_limit (unsigned);
 extern bool default_goacc_fork_join (gimple, const int [], bool);

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [gomp4] New reduction infrastructure for OpenACC
       [not found] ` <55D61EE2.4030402@acm.org>
@ 2015-08-20 20:21   ` Nathan Sidwell
  0 siblings, 0 replies; 4+ messages in thread
From: Nathan Sidwell @ 2015-08-20 20:21 UTC (permalink / raw)
  To: Cesar Philippidis, gcc-patches, Jakub Jelinek

[-- Attachment #1: Type: text/plain, Size: 123 bytes --]

Sigh, pdf's get filtered.  Let's try some raw tex ...

Here's the design document for the reduction implementation

nathan

[-- Attachment #2: Reductions.tex --]
[-- Type: text/x-tex, Size: 32347 bytes --]

\documentclass[a4paper]{article}

\newcommand{\brk}{\linebreak[0]}
\newcommand{\codespc}{\pagebreak[2]\\[1ex plus2pt minus1pt]}
\newcommand{\codebrk}{\pagebreak[2]\\}
\newcommand{\define}[1]{{\bf #1}}
\begingroup \catcode`\~ = 12 \global\def\twiddle{~} \endgroup
\newenvironment{codeblock}%
{\begin{quote}\tt\begin{tabbing}%
\hspace{1em}\=\hspace{1em}\=\hspace{1em}\=\hspace{1em}\=%
\hspace{1em}\=\hspace{1em}\=\hspace{1em}\=\hspace{1em}\=%
\kill}%
{\end{tabbing}\end{quote}}

\begin{document}

\title{OpenACC Reductions}
\author{Nathan Sidwell}
\date{2015-08-13}

\maketitle

\begin{abstract}
This document describes the design of the OpenACC reduction
implementation in GCC. It is intended to be sufficiently general in the
early part of the compiler, becoming more target-specific once we have
entered the target-specific device compiler.
\end{abstract}

\section{Changes}

\begin{itemize}
\item[2015-07-28] Initial version
\item[2015-07-30] Change internal builtins to not take addresses.
\item[2015-08-03] Note about templated builtin expansion.
\item[2015-08-07] Discuss reductions at outer parallel construct
\item[2015-08-13] Note reductions at outer parallel consume inner gang
reduction. Comment on memory barriers.
\end{itemize}

\section{General Features}
We cannot emit anything that depends on a device feature before
we've entered the device compiler.   This means that
anything happening in gimplify or omp-low has to be generic.  It has
to be sufficiently generic to permit other architectures to implement
reductions.  Thus, anything emitted here, beyond simply noting the
gang/worker/vector level of the execution environment cannot know
anything about gang/worker or vector beyond what the abstract
specification describes.

\subsection{Compiler Passes}
The following passes are relevant to openACC compilation:

\begin{enumerate}
\item[Gimplify] This is where variables used in a parallel region are
noted, and the required transformation determined -- for instance
copy/private/firstprivate etc.  These are all made explicit by
augmenting the parallel clause itself.

\item[Omp-lower] This is where a parallel  region is broken out into
a separate function, variables are rewritten according to copy, private
or whatever. The structure describing where copied variables are
located is created for both host and target sides.

\item[Omp-expand] This is where loops are converted from serial form
into a form to be executed in parallel on multiple threads. The
multiple threads are implicit -- abstract functions provide the number
of threads and the thread number of the current thread.

\item[LTO write/read] The offloaded functions are written out to disc
and read back into the device-specific compiler.  The host-side
compiler continues with its regular compilation. Following this point
there are essentially two threads of compilation -- one for host and
one for device.

\item[Oacc-xform] This new pass is responsible for lowering the
abstractions created in omp-lower and/or omp-expand into
device-specific code sequences. Such sequences should be regular gimple
as much as possible, but may include device-specific builtins. The
expansions are done via target-specific hooks and default fallbacks.
This step is also responsible for checking launch dimensions.

\item[Expand-RTL] This pass expands any remaining internal functions and
device-specific builtins into device-specific RTL.

\item[Mach-dep-reorg] This pass may perform device specific
reorganization.

\end{enumerate}

\subsection[Fork/Join]{Fork/Join}
Omp-expand emits fork and join builtins around gang, worker and vector
partitioned regions.

In oacc-xform these are deleted by targets that do not need them.  For
PTX the hook deletes gang-level forks and joins, as they are irrelevant
to the PTX execution model.

They are expanded to PTX pseudo instructions at RTL expansion time, and
those are propagated all the way through to the PTX reorg pass where
they get turned into actual code.  In the PTX case they are not real
forks and joins.  In the non-forked region we neuter all but one
thread at the appropriate level -- they're emulated
forks and joins for us.

The reduction machinery will hang off these markers, by inserting
additional builtins before and after them.  These builtins will be
expanded by device-specific code in the oacc-xform pass.  Whether the
expansion of those builtins generates device-specific builtins is a
property of the particular backend.  In the case of PTX, what each
builtin expands to depends on the level of the reduction.

\subsection{Reduction objects}
The object used for a reduction has two separate lifetimes:

\begin{itemize}

\item The lifetime before (and after) the reduction loop.  We refer to
that as the RESULT object.

\item The lifetime within the reduction loop. We refer to this as the
LOCAL object.

\end{itemize}

There is one instance of the former, but N instances of the latter --
one per executing thread.

The RESULT object might be a copy object received from the host, or it
might be a local private object within an outer loop. Some observations
can be made:

\begin{itemize}

\item The LOCAL object can be a private object. Indeed, this is desired
so that each thread has its own local instance (hence the name).

\item Because reductions over gang-level loops are only specified to
produce a final result once the enclosing parallel is completed, the
RESULT object must be a copy object. That is the only way for a result
to be returned to the program. If it is not, the behavior is undefined
and the programmer is confused.

\item The RESULT object will  most often be a private object for worker
and vector level loops. However, it could be  a copy object, most
likely, when a single gang is executing. It could be a copy object
within a gang loop though.

\item A reduction over one loop may be contained as a reduction within
an outer loop. The inner reduction's RESULT object is
the outer loop's LOCAL object.

\end{itemize}

We can take advantage of these latter two observations by surrounding
worker and vector reductions to copy objects with a dummy gang-level
reduction of the same variable. This means that worker and vector level
reductions only have to deal with the case where RESULT is a private
variable. Gang-level reductions will have to deal with the case of
RESULT being a copy object or a private object.  However, as we shall
see, the latter case turns out to be a NOP.

\subsubsection{Reductions on Parallel}
OpenACC permits a reduction to be specified on a parallel construct.
This could be operated on in the redundant execution mode outside of
any gang-level loops. While strange, there is no need to make this not
work.  We can achieve that by creating a null gang loop  at the
outermost level and eliding any reduction machinery on an inner gang
loop that might concern the same variable.

\subsection{Atomics}
Atomic operations may not be available for all required data types and
operators. Indeed, this is true for PTX, so we must address it. This
document treats the lack of atomics as the general case and then treats
the presence of an atomic operation as an optimization. Indeed, at the
omp-expand point we do not know what the target properties are, so the
general machinery does not rely on specific atomics being available.

\subsection{Variable Rewriting}
The OMP machinery has the capability of rewriting a variable used
within an offload region.  This is necessary for copy objects, so
that, say `a' is turned into something like
`*omp\_struct->copied\_a\_addr'.  We shall see that for reductions of
copy objects we will need to inhibit that rewriting.

We do not have to create additional rewriting.

The copy object rewriting must be inhibited within the loops in which
the reduction is active. Outside of such loops, the copy rewriting must
occur.  For instance:

\begin{codeblock}%
\#pragma acc parallel copy (a) num\_workers(5)\\*
\{\\*
\>a = 5; // must be rewritten\\*
\#pragma acc loop worker reduction(+:a)\\*
\>for (ix {\dots} ) \{\\*
\>\>a += ix; // must not be rewritten\\*
\>\}\\*
\>if (a) {\dots} // must be rewritten\\*
\}\end{codeblock}

Fortunately each basic block is mapped to the region in which it occurs,
thus the rewriting code can determine whether it is contained in a
region that has an active reduction on any particular variable.

\subsubsection{Fortran Pass-by-Reference Parameters}
Fortran subroutines pass scalars by implicit reference.  If such a
parameter is used as a reduction variable, the referenced scalar needs
privatizing, so that different threads have their own instance.  This
is actually an instance of a different problem, and punting on it for
initial implementation of reductions is acceptable.  This is ok, as it
only affects subroutines that are called via use of the routine
directive, which is itself deferred.

\section{Early Builtins}
We shall define 4 new internal builtins to deal with reductions in an
abstract manner. We shall also define a lock/unlock pair, to define an
atomic region.

The four builtins correspond to the four locations:

\begin{itemize}
\item before the partitioned region
\item start of the partitioned region
\item end of the partitioned region
\item after the partitioned region.
\end{itemize}
All these functions take the same six arguments:

\begin{itemize}
\item[RES\_PTR] Address of copy RESULT object. If the result is to a
private var, this is NULL.
\item[LOCAL] The LOCAL object -- the private variable used to hold the
individual reductions.
\item[LEVEL] Loop level (a constant)
\item[OP] Reduction operator (a constant)
\item[LID] Loop identifier (a constant)
\item[RID] Reduction identifier (a constant)
\end{itemize}
For worker and vector level reductions, RESULT argument will be NULL.
For gang-level loops, RESULT will be the rewritten reference to the
copied object (or NULL if there is none).

These functions all return a new LOCAL object value -- thus they are all
appear as part of an assignment.

\subsection[Loop \& Reduction Identifiers]{Loop \& Reduction
Identifiers}
We need to uniquely identify the set of functions referring to the same
reduction.  This is needed so the later expanders (at oacc-xform,
expand-rtl and perhaps mach-dep-reorg) can determine the members of the
same set. Those expanders might encounter these builtins in an
arbitrary order -- there's no guarantee the
basic-blocks will be iterated over in program order (which itself can
be an ill-defined concept). Note that the same variable may participate
in multiple reductions (sequentially), so neither RESULT nor LOCAL
guarantee uniqueness. We could have simply used a single counter to
uniquely identify all the reductions in the same function, but it
turned out that knowing the set of different reductions for a
particular loop is useful (and example is shown later).

 Thus we have the loop identifier and reduction identifier arguments.
Both of these will be integer constants, allocated from local counters.
 The loop identifier counter has a different value for each loop (with
or without reductions, which ever is simplest).  The reduction
identifier has a different value increments for each reduction within
the same loop. The integer values themselves have no meaning (so, for
instance the reduction identifier could be unique over all reductions
of the function).

As we shall see the same loop identifier number may be used at different
levels. Thus the <LID, LEVEL> tuple must be used
to uniquely identify a specific loop \& level. This will fail if a
routine is inlined, because both the inliner and inlinee may use the
same loop id.  It would be insufficient to simply use unique loop ids
throughout the compilation, as a function might be inlined multiple
times into the same function. For the moment we can simply inhibit
inlining functions containing reductions.

\subsection{Reduction Internal Builtin Functions}
We define internal builtins for the reductions, as these are
specifically designed for functions intended to be expanded later in
the compilation and not related to any fallback library call.

\begin{codeblock}%
T IFN\_RED\_SETUP (T *RES\_PTR, T LOCAL, LEVEL, OP, LID, RID)
\end{codeblock}

This is placed before the partitioned region. It is used to implement
any single-mode setup that is needed. Typically initializing temporary
storage.

\begin{codeblock}%
T IFN\_RED\_TEARDOWN (T *RES\_PTR, T LOCAL, LEVEL, OP, RID, LID)
\end{codeblock}

This is placed after the partitioned region. It is used to implement any
single mode teardown that is needed.  Typically updated the RESULT
object.

\begin{codeblock}%
T IFN\_RED\_INIT (T *RES\_PTR, T LOCAL, LEVEL, OP, LID, RID)
\end{codeblock}

This is placed at the start of the partitioned region, before the loop
is entered. It is used to implement any partitioned-mode
initialization. Typically initializing the LOCAL object in each thread.

\begin{codeblock}%
T IFN\_RED\_FINI (T *RES\_PTR, T LOCAL, LEVEL, OP, LID, RID)
\end{codeblock}

This is placed at the end of the partitioned region, after the loop is
exited. It is used to implement any partitioned-mode finalization.
Typically combining the per-thread LOCAL objects into a single value.

The two lock builtins are:

\begin{codeblock}%
void IFN\_RED\_LOCK (LEVEL, LID)\\*
void IFN\_RED\_UNLOCK LEVEL, LID)
\end{codeblock}

These are placed around all the IFN\_RED\_FINI calls that happen at the
end of a loop containing reductions. If it is too tricky to place, they
can be placed around each individual IFN\_RED\_FINI call at the expense
of performance.

The IFN\_RED\_FINI function must be marked unique, so it is not cloned
nor subject to tail commoning. (It is possible all  these functions
should be so marked in the general case.) The FORK and JOIN functions
already have this property.

\subsection{Examples}
Here are some examples of the code after omp-expand has inserted the
above functions:

\begin{codeblock}%
// gang level loop with copy object\\*
\#parallel copy(a) gang\\*
\{\\*
\>\#loop gang reduction(+:a)\\*
\>a = IFN\_SETUP (ompstruct->a, a, GANG, +, 0, 0)\\*
\>IFN\_FORK (GANG)\\*
\>a = IFN\_INIT (ompstruct->a, a, GANG, +, 0, 0)\\*
\>for (\dots) \{ \dots \}\\*
\>IFN\_LOCK (GANG, 0)\\*
\>a = IFN\_FINI (ompstruct->a, a, GANG, +, 0, 0)\\*
\>IFN\_UNLOCK (GANG, 0)\\*
\>IFN\_JOIN (GANG)\\*
\>a = IFN\_TEARDOWN (ompstruct->a, a, GANG, +, 0, 0)\\*
\}\end{codeblock}

\begin{codeblock}%
// worker level loop with private object\\*
\#parallel worker\\*
\{\\*
\>int a = 0;\\*
\>\#loop worker reduction(+:a)\\*
\>a = IFN\_SETUP (NULL, a, WORKER, +, 0, 0)\\*
\>IFN\_FORK (WORKER)\\*
\>a = IFN\_INIT (NULL, a, WORKER, +, 0, 0)\\*
\>for ({\dots}) \{ {\dots} \}\\*
\>IFN\_LOCK (WORKER, 0)\\*
\>a = IFN\_FINI (NULL, a, WORKER, +, 0, 0)\\*
\>IFN\_UNLOCK (WORKER, 0)\\*
\>IFN\_JOIN (WORKER)\\*
\>a = IFN\_TEARDOWN (NULL, a, WORKER, +, 0, 0)\\*
\}\end{codeblock}

\begin{codeblock}%
// worker level loop with copy object\\*
\#parallel worker copy (a)\\*
\{\\*
\>// Insert dummy gang reduction at start. \ \\*
\>// Note this uses the same RID \& LID as the inner worker
loop.\\*
\>a = IFN\_SETUP (ompstruct->a, a, GANG, +, 0, 0)\\*
\>a = IFN\_INIT (ompstruct->a, a, GANG, +, 0, 0)\\*
\>\#loop worker reduction(+:a)\\*
\>a = IFN\_SETUP (NULL, a, WORKER, +, 0, 0)\\*
\>IFN\_FORK (WORKER)\\*
\>a = IFN\_INIT (NULL, a, WORKER, +, 0, 0)\\*
\>for ({\dots}) \{ {\dots} \}\\*
\>IFN\_LOCK (WORKER, 0)\\*
\>a = IFN\_FINI (NULL, a, WORKER, +, 0, 0)\\*
\>IFN\_UNLOCK (WORKER, 0)\\*
\>IFN\_JOIN (WORKER)\\*
\>a = IFN\_TEARDOWN (NULL, a, WORKER, +, 0, 0)\\*
\>// Dummy gang reduction at end\\*
\>a = IFN\_FINI (ompstruct->a, a, GANG, +, 0, 0)\\*
\>a = IFN\_TEARDOWN (ompstruct->a, a, GANG, +, 0, 0)\\*
\}\end{codeblock}

Note that if the above worker reduction was inside a gang loop, the
inserted dummy gang reduction could be placed inside or outside the
gang-level loop. \ Placing outside the loop would be more efficient, as
there would then only be one execution of the gang-reduction per gang,
rather than one execution per iteration. This can of course be
implemented as a later optimization.

The vector cases are exactly like the worker cases.

\begin{codeblock}%
// combined worker \& vector level loop with private object\\*
\#parallel worker vector\\*
\{\\*
\>int a = 0;\\*
\>\#loop worker vector reduction(+:a)\\*
\>// Insert worker setup and init around worker fork\\*
\>a = IFN\_SETUP (NULL, a, WORKER, +, 0, 0)\\*
\>IFN\_FORK (WORKER)\\*
\>a = IFN\_INIT (NULL, a, WORKER, +, 0, 0)\\*
\>// Insert vector setup and init around vector fork\\*
\>a = IFN\_SETUP (NULL, a, VECTOR, +, 0, 0)\\*
\>IFN\_FORK (VECTOR)\\*
\>a = IFN\_INIT (NULL, a, VECTOR, +, 0, 0)\\*
\>for ({\dots}) \{ {\dots} \}\\*
\>// Vector fini \& teardown around vector join\\*
\>IFN\_LOCK (VECTOR, 0)\\*
\>a = IFN\_FINI (NULL, a, VECTOR, +, 0, 0)\\*
\>IFN\_UNLOCK (VECTOR, 0)\\*
\>IFN\_JOIN (VECTOR)\\*
\>// Worker fini \& teardown around worker join\\*
\>IFN\_LOCK (WORKER, 0)\\*
\>a = IFN\_FINI (NULL, a, WORKER, +, 0, 0)\\*
\>IFN\_UNLOCK (WORKER)\\*
\>IFN\_JOIN (WORKER)\\*
\>a = IFN\_TEARDOWN (NULL, a, WORKER, +, 0, 0)\\*
\}\end{codeblock}

\begin{codeblock}%
// reduction on parallel (with copy object)\\*
\#parallel copy(a) gang reduction (+:a)\\*
\{\\*
\>// NULL outermost `loop'\\*
\>a = IFN\_SETUP (ompstruct->a, a, GANG, +, 0, 0)\\*
\>a = IFN\_INIT (ompstruct->a, a, GANG, +, 0, 0)\\*
\>{\dots} redundant mode gang execution\\*
\>IFN\_LOCK (GANG, 0)\\*
\>a = IFN\_FINI (ompstruct->a, a, GANG, +, 0, 0)\\*
\>IFN\_UNLOCK (GANG, 0)\\*
\>a = IFN\_TEARDOWN (ompstruct->a, a, GANG, +, 0, 0)\\*
\}\end{codeblock}

\begin{codeblock}%
// reduction on parallel with inner gang loop\\*
\#parallel copy(a) gang reduction (+:a)\\*
\{\\*
\>// NULL outermost `loop'\\*
\>a = IFN\_SETUP (ompstruct->a, a, GANG, +, 0, 0)\\*
\>a = IFN\_INIT (ompstruct->a, a, GANG, +, 0, 0)\\*
\>{\dots} gang redundant execution\\*
\>\#loop gang reduction(+:a)\\*
\>// no insertion of IFN\_SETUP or IFN\_INIT before loop\\*
\>for ({\dots}) \{ {\dots} \}\\*
\>// no insertion of IFN\_FINI or IFN\_TEARDOWN after loop.\\*
\>{\dots} more gang redundant execution\\*
\>IFN\_LOCK (GANG, 0)\\*
\>a = IFN\_FINI (ompstruct->a, a, GANG, +, 0, 0)\\*
\>IFN\_UNLOCK (GANG, 0)\\*
\>a = IFN\_TEARDOWN (ompstruct->a, a, GANG, +, 0, 0)\\*
\}\end{codeblock}

\section{OACC XFORM}
The OACC-xform pass is early in the device compiler after LTO readback.
This is where the above internal functions are expanded to a
combination of regular code and device-specific builtins.
We'll need to provide 2 implementations:

\begin{itemize}

\item A default implementation for host-expansion

\item A PTX implementation.

\end{itemize}

These transformations should be done early in the XFORM pass, so that
other optimizations it performs can be applied to the sequences emitted
here.

As is typical when describing compiler code transformations, one needs
to show both expanded code sequences, and internal compiler helper
routines. The following expansions try and make it clear when a piece
of code is part of the compiler, producing a result that is fed into
the expansion.  For instance:

\begin{codeblock}%
\{tmp = const\_int (5);\}\\*
LOCAL = \{tmp\}
\end{codeblock}

Here the first line is part of the compiler, generating a tmp value, and
the second line is part of the emitted code sequence, which uses that
tmp value.

I also use a template notation, using T to describe the type of the
reduction object.

\subsection{Host Implementation}
The host fallback (\& default implementation) executes the loops as a
single thread. This makes reductions trivial.

\subsubsection{Setup}
\begin{codeblock}%
V = IFN\_RED\_SETUP (RES\_PTR, LOCAL, LEVEL, OP. LID, RID)
\end{codeblock}

expands to:

\begin{codeblock}%
V = RES\_PTR ? *RES\_PTR : LOCAL;
\end{codeblock}

\subsubsection[Init]{Init}
\begin{codeblock}%
V = IFN\_RED\_INIT (RES\_PTR, LOCAL, LEVEL, OP, LID,  RID)
\end{codeblock}

expands to:

\begin{codeblock}%
V = LOCAL;
\end{codeblock}

\subsubsection{Fini}
\begin{codeblock}%
V = IFN\_RED\_FINI (RES\_PTR, LOCAL, LEVEL, OP, LID, RID)
\end{codeblock}

expands to:

\begin{codeblock}%
V = LOCAL;
\end{codeblock}

\subsubsection{Teardown}
\begin{codeblock}%
V = IFN\_RED\_TEARDOWN (RES\_PTR, LOCAL, LEVEL, OP, LID, RID)
\end{codeblock}

expands to:

\begin{codeblock}%
if (RES\_PTR) *RES\_PTR = LOCAL;\\*
V = LOCAL;
\end{codeblock}

\subsubsection{Lock \& Unlock}
The lock and unlock builtins expand to nothing.

\subsection{PTX Implementation}
The expansion for PTX differs for the different loop levels. The
IFN\_RED\_FINI function is the most complex and worker level is the
most involved. We introduce a number of PTX-specific builtins, which
are documented later.

\subsubsection{Vector Level}
At the vector level, RES\_PTR is known to be NULL.

\paragraph{Setup}
\begin{codeblock}%
V = IFN\_RED\_SETUP (RES\_PTR, LOCAL, LEVEL, OP, LID, RID)
\end{codeblock}

expands to:

\begin{codeblock}%
V = LOCAL;
\end{codeblock}

\paragraph{Init}
\begin{codeblock}%
V = IFN\_RED\_INIT (RES\_PTR, LOCAL, LEVEL, OP, LID, RID)
\end{codeblock}

expands to:

\begin{codeblock}%
V = IFN\_OACC\_DIM\_POS (VECTOR) ? \{init\_val<T> (OP)\} : LOCAL
\end{codeblock}

This sets the initialization value for all but vector zero, which simply
copies the incoming LOCAL value.

\paragraph{Fini}
\begin{codeblock}%
V = IFN\_RED\_FINI (RES\_PTR, LOCAL, LEVEL, OP, LID, RID)
\end{codeblock}

expands to a  binary tree of shuffles:

\begin{codeblock}%
for (ix = IFN\_OACC\_DIM\_SIZE (VECTOR); ix >= 1;) \{\\*
 \ T tmp = ptx\_shuffle\_down<T> (LOCAL, ix);\\*
 \ LOCAL = OP (LOCAL, tmp);\\*
\}\\*
V = LOCAL
\end{codeblock}

This relies on unified execution of the vector engines (and
IFN\_\brk{}OACC\_\brk{}DIM\_\brk{}SIZE returning a power of 2).

\paragraph{Teardown}
\begin{codeblock}%
V = IFN\_RED\_TEARDOWN (RES\_PTR, LOCAL, LEVEL, OP, LID, RID)
\end{codeblock}

expands to:

\begin{codeblock}%
V = LOCAL;
\end{codeblock}

\paragraph[Lock \ \& Unlock]{Lock \ \& Unlock}
The lock and unlock functions expand to nothing at the vector level.

\subsubsection{Gang Level}
At the gang level RES\_PTR might or might not be NULL. We will know at
the compilation stage -- there is no need for runtime checking.

Memory barriers are only needed within the locked region, and will be
inserted by the lock expansions themselves.  We do not need additional
barriers at the setup and teardown phases, because accessing the
reduction object within the parallel region (outside of the reduction
scope itself) is undefined behaviour.

\paragraph{Setup}
\begin{codeblock}%
V = IFN\_RED\_SETUP (RES\_PTR, LOCAL, LEVEL, OP, LID, RID)
\end{codeblock}

expands to:

\begin{codeblock}%
\{if RES\_PTR not NULL\} V = LOCAL\\*
\{else\} Nothing
\end{codeblock}

\paragraph{Init}
\begin{codeblock}%
V == IFN\_RED\_INIT (RES\_PTR, LOCAL, LEVEL, OP, LID, RID)
\end{codeblock}

expands to:

\begin{codeblock}%
V = \{init\_val<T> (OPERATOR)\}
\end{codeblock}

\paragraph{Fini}
\begin{codeblock}%
V = IFN\_RED\_FINI (RES\_PTR, LOCAL, LEVEL, OP, LID, RID)
\end{codeblock}

expands to a simple update of the *RES\_PTR object:

\begin{codeblock}%
V = OPERATOR (*RES\_PTR, LOCAL);\\*
*RES\_PTR = V;
\end{codeblock}

The thread-safety of this relies on the lock expansions below.

\paragraph{Teardown}
\begin{codeblock}%
V = IFN\_RED\_TEARDOWN (RES\_PTR, LOCAL, LEVEL, OP, LID, RID)
\end{codeblock}

expands to :

\begin{codeblock}%
\{if RES\_PTR not NULL\} V = LOCAL\\*
\{else\} Nothing
\end{codeblock}

\paragraph{Gang Locking}
The lock and unlock functions use lock in global memory. We can allocate
a single lock for the whole program.

The expansions are:

\begin{codeblock}%
ptx\_lock (GANG, LID)
\end{codeblock}

\begin{codeblock}%
ptx\_unlock (GANG, LID)
\end{codeblock}

respectively.

\subsubsection{Worker Level}
The worker level is most complicated because we need to create temporary
storage in .shared memory to hold accumulation values.  This relies
heavily on PTX-specific builtins and helper routines.

The approach is to allocate a buffer in .shared memory to hold the set
of worker reductions for a worker loop.  This buffer must be live from
the first worker setup call of the loop to the last worker teardown
call. This buffer must be distinct from the worker spill buffer used to
implement worker state propagation at entry to worker partitioned mode,
because both need to be concurrently live.

However, we do not have to have separate buffers for worker reductions
on different loops.  Those are never executed concurrently (on a
single CTA). This means we can use a similar approach allocating the
buffer as we do with the worker spill buffer. A single buffer can be
reused for all worker loop reductions in the compilation.  We just
have to make sure it is the largest size needed.

To determine the size needed, we need to know the set of reductions for
each worker loop.  This is where the LID argument to the builtins
comes in -- that will be the same for each reduction in the same worker
loop. Actually allocating the buffer can be deferred to RTL expansion
stage, which will give the optimizers opportunity to delete unreachable
code, which might change the set of reductions needed.

Thus expansion of worker level reduction functions will propagate the
LID and RID arguments to PTX-specific builtins, which will themselves
perform the .shared buffer allocation.

As with vector-level reductions, RES\_PTR is known to be NULL at compile
time.

Memory barriers are needed within the locked region used by the
finalization.  These are inserted by the lock and unlock builtins
themselves. We do not need explicit memory barriers after the setup
store or before the teardown read because they are inserted by the fork
and join expansions themselves.

\paragraph{Setup}
\begin{codeblock}%
V = IFN\_RED\_SETUP (RES\_PTR, LOCAL, LEVEL, OP, LID, RID)
\end{codeblock}

expands to:

\begin{codeblock}%
*ptx\_work\_red\_addr<T> (LID, RID) = LOCAL;
\end{codeblock}

This will eventually expand to a write into the .shared memory buffer of
the incoming reduction value.

\paragraph{Init}
\begin{codeblock}%
V = IFN\_RED\_INIT (RES\_PTR, LOCAL, LEVEL, OP, LID, RID)
\end{codeblock}

expands to:

\begin{codeblock}%
V = \{init\_val<T> (OPERATOR)\}
\end{codeblock}

This initializes each worker's instance of the reduction variable.

\paragraph{Fini}
\begin{codeblock}%
V = IFN\_RED\_FINI (RES\_PTR, LOCAL, LEVEL, OP, LID, RID)
\end{codeblock}

expands to 

\begin{codeblock}%
T tmp = *ptx\_work\_red\_addr<T> (LID, RID);\\*
tmp = OP (tmp, LOCAL);\\*
*ptx\_work\_red\_addr<T> (LID, RID) = tmp;
\end{codeblock}

The thread-safety of this relies on the lock expansions below.

\paragraph{Teardown}
\begin{codeblock}%
V = IFN\_RED\_TEARDOWN (RES\_PTR, LOCAL, LEVEL, OP, LID, RID)
\end{codeblock}

expands to:

\begin{codeblock}%
V = *ptx\_work\_red\_addr<T> (LID, RID);
\end{codeblock}

\paragraph{Worker Locking}
The lock and unlock functions use lock in .shared memory. We can
allocate a single lock for the whole program. The expansions are:

\begin{codeblock}%
ptx\_lock (WORKER, LID)
\end{codeblock}

\begin{codeblock}%
ptx\_unlock (WORKER, LID)
\end{codeblock}

Respectively.

\subsubsection[OpenACC Builtins]{OpenACC Builtins}

The above used an existing OpenACC builtin,
IFN\_\brk{}OACC\_\brk{}DIM\_\brk{}SIZE and
IFN\_\brk{}OACC\_\brk{}DIM\_\brk{}POS.  Note, that the DIM\_SIZE
builtin is often expanded to a constant within the OACC\_XFORM pass
itself. Thus we must ensure that transform is run over the expansions
done here.

\subsubsection{PTX Builtins}
The following PTX builtins were used above:

\begin{codeblock}%
T ptx\_shuffle\_down (T \&obj, unsigned SHIFT)
\end{codeblock}

This function does a shuffle down, returning the SHIFT'th next vector
engine's value of `obj'.

\begin{codeblock}%
T *ptx\_work\_red\_addr <T> (unsigned LID, unsigned RID)
\end{codeblock}

This function returns the address of the reduction accumulation
variable for loop LID, reduction RID. When this builtin is expanded at
RTL generation time, the PTX backend will need to maintain a
per-function table of mappings from <LID, RID> to address.  Entries
will be inserted as encountered, and per-LID counters of the current
allocation point need to be maintained.  The high water mark acrosss
all LIDs needs to be stashed in a global variable examined at the end
of compilation in the same manner to the worker\_spill buffer.

\begin{codeblock}%
void ptx\_lock (unsigned LEVEL, unsigned LID)\\*
void ptx\_unlock (unsigned LEVEL, unsigned LID)
\end{codeblock}

These two functions implement a spin-lock and associated unlock at a
particular level. The only two levels needed are gang and worker level
-- vector reductions do not require locking.  The only difference in
the two is the location of the lock variable. These expand to an
unspec, whose presence will be noted during ptx reorg. That will cause
the applicable  .global or .shared lock variable  to be emitted at
end of compilation.

They also insert memory barriers before and after the protected region,
to ensure external state changes are properly observed with the region,
and changes made in the region are properly observed outside the
region.

The lock itself will look like:

\begin{codeblock}%
1:\>\>atom.\$space.cas.u32 \$res,[\$lock\_var],0,1\\*
\>\>set.ne.u32 \$pred,\$res,0\\*
\>\>@\$pred bra.uni 1b\\*
\>\>membar.\$space
\end{codeblock}

This returns the original value of \$lock\_var, thus, when the lock is
obtained, \$res  will be zero.  When the lock is already taken, it
will be 1. The expander should emit a simple loop 

The unlock will be:

\begin{codeblock}%
\>\>membar.\$space\\*
\>\>atom.\$space.cas.u32 \$res,\$lock,1,0
\end{codeblock}

\begin{codeblock}%
void ptx\_mem\_bar (unsigned LEVEL)
\end{codeblock}

This inserts a memory barrier at the appropriate level.

\subsubsection{Templated Builtins}
The high level reduction machinery must handle boolean, integer
(including character), float and complex types. Rather than provide
each ptx builtin for the full set of types, only a fundamental subset
shall be provided:

\begin{itemize}
\item SImode (uint)
\item DImode (ulong)
\item SFmode (float)
\item DFmode (double)
\end{itemize}
The OACC\_XFORM pass must deal with composing these to process other
types.

\begin{itemize}
\item For smaller integer types, promote them to and Simode type.
\item For complex types, decompose into two instances of the underlying
type.
\end{itemize}
Note that in the latter case, the underlying ptx\_work\_red\_addr call,
if used, must have distinguishing RID values for the different parts of
the object. \ This could be achieved by left shifting all incoming RID
values and using bit zero as the distinguisher (all such compound types
only have two underlying components).

\subsubsection{Atomics Optimization}
The existence of atomic operations can be dealt with later.  They are
an optimization.  The PTX backend provides a target hook that can be
queried by the OACC\_XFORM pass when it is expanding the RED\_FINI
function.  Before emitting the non-atomic code for worker or gang
level, the hook should be queried to determine if an atomic operation
exists for the operation and type required.  If it does, the
appropriate builtin should be emitted rather than the non-atomic
sequence.

There is a wrinkle though.  If atomic operations are used, barriers
need inserting before and after the atomic operations, as they are
asynchronous with respect to normal reads and writes. It would be
inefficient to insert barriers around each atomic operation, if there
are several reductions at the same level. (Although that is a rare
case).

We can use the LID identifier to record whether atomic operations are
used at any particular loop finalization and augment the ptx reorg pass
to insert barriers at the lock and unlock instructions where they are
on such loops. We can also record whether all reductions for a
particular loop use atomics and elide the locking in that case.

More strictly it would be sufficient to place barriers after the setup
write and before the teardown read.  This could be  achieved by
defining additional ptx builtins and emitting them during OACC\_XFORM,
expanding to RTL and then processing during PTX reorg.

All told, the whole use of atomics may be excessive and not worth the
effort.
\end{document}

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [gomp4] New reduction infrastructure for OpenACC
  2015-08-19 19:43 [gomp4] New reduction infrastructure for OpenACC Cesar Philippidis
       [not found] ` <55D61EE2.4030402@acm.org>
@ 2015-09-01 12:53 ` Tom de Vries
  2015-09-01 16:23   ` Tom de Vries
  1 sibling, 1 reply; 4+ messages in thread
From: Tom de Vries @ 2015-09-01 12:53 UTC (permalink / raw)
  To: Cesar Philippidis; +Cc: gcc-patches, Nathan Sidwell, Jakub Jelinek

[-- Attachment #1: Type: text/plain, Size: 999 bytes --]

On 19/08/15 21:42, Cesar Philippidis wrote:
> +DEFHOOK
> +(reduction,
> +"This hook is used by the oacc_transform pass to expand calls to the\n\
> +internal functions @var{GOACC_REDUCTION_SETUP},\n\
> +@var{GOACC_REDUCTION_INIT},\n\ @var{GOACC_REDUCTION_FINI} and\n\ @var{GOACC_REDUCTION_TEARDOWN} into a sequence of gimple instructions.\n\ @var{call} is gimple statement containing the call to the function.  This\n\ hook removes statement @var{call} after the expanded sequence has been\n\ inserted.  This hook is also responsible for allocating any storage for\n\ reductions when necessary.  It returns @var{true} if the expanded\n\
> +sequence introduces any calls to OpenACC-specific internal functions.",
> +bool, (gimple call),
> +default_goacc_reduction)
> +

This causes a gomp-4_0-branch bootstrap error:
...
src/gcc/target.def:1685:1: error: unknown escape sequence: '\040' [-Werror]
...

Currently testing attached patch, which allows 'make build/genhooks.o' 
to succeed.

Thanks,
- Tom

[-- Attachment #2: fix-target-def-reduction.patch --]
[-- Type: text/x-patch, Size: 1268 bytes --]

--- gcc/target.def	2015-09-01 01:45:41.851372425 -0700
+++ gcc/target.def	2015-09-01 05:41:48.280961137 -0700
@@ -1684,7 +1684,13 @@
 (reduction,
 "This hook is used by the oacc_transform pass to expand calls to the\n\
 internal functions @var{GOACC_REDUCTION_SETUP},\n\
-@var{GOACC_REDUCTION_INIT},\n\ @var{GOACC_REDUCTION_FINI} and\n\ @var{GOACC_REDUCTION_TEARDOWN} into a sequence of gimple instructions.\n\ @var{call} is gimple statement containing the call to the function.  This\n\ hook removes statement @var{call} after the expanded sequence has been\n\ inserted.  This hook is also responsible for allocating any storage for\n\ reductions when necessary.  It returns @var{true} if the expanded\n\
+@var{GOACC_REDUCTION_INIT},\n\
+@var{GOACC_REDUCTION_FINI} and\n\
+@var{GOACC_REDUCTION_TEARDOWN} into a sequence of gimple instructions.\n\
+@var{call} is gimple statement containing the call to the function.  This\n\
+hook removes statement @var{call} after the expanded sequence has been\n\
+inserted.  This hook is also responsible for allocating any storage for\n\
+reductions when necessary.  It returns @var{true} if the expanded\n\
 sequence introduces any calls to OpenACC-specific internal functions.",
 bool, (gimple call),
 default_goacc_reduction)

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [gomp4] New reduction infrastructure for OpenACC
  2015-09-01 12:53 ` Tom de Vries
@ 2015-09-01 16:23   ` Tom de Vries
  0 siblings, 0 replies; 4+ messages in thread
From: Tom de Vries @ 2015-09-01 16:23 UTC (permalink / raw)
  To: Cesar Philippidis; +Cc: gcc-patches, Nathan Sidwell, Jakub Jelinek

[-- Attachment #1: Type: text/plain, Size: 1107 bytes --]

On 01/09/15 14:53, Tom de Vries wrote:
> On 19/08/15 21:42, Cesar Philippidis wrote:
>> +DEFHOOK
>> +(reduction,
>> +"This hook is used by the oacc_transform pass to expand calls to the\n\
>> +internal functions @var{GOACC_REDUCTION_SETUP},\n\
>> +@var{GOACC_REDUCTION_INIT},\n\ @var{GOACC_REDUCTION_FINI} and\n\
>> @var{GOACC_REDUCTION_TEARDOWN} into a sequence of gimple
>> instructions.\n\ @var{call} is gimple statement containing the call to
>> the function.  This\n\ hook removes statement @var{call} after the
>> expanded sequence has been\n\ inserted.  This hook is also responsible
>> for allocating any storage for\n\ reductions when necessary.  It
>> returns @var{true} if the expanded\n\
>> +sequence introduces any calls to OpenACC-specific internal functions.",
>> +bool, (gimple call),
>> +default_goacc_reduction)
>> +
>
> This causes a gomp-4_0-branch bootstrap error:
> ...
> src/gcc/target.def:1685:1: error: unknown escape sequence: '\040' [-Werror]
> ...
>
> Currently testing attached patch, which allows 'make build/genhooks.o'
> to succeed.
>

Committed as attached.

Thanks,
- Tom


[-- Attachment #2: 0001-Fix-unknown-escape-sequence-error-in-DEFHOOK-reducti.patch --]
[-- Type: text/x-patch, Size: 2826 bytes --]

Fix unknown escape sequence error in DEFHOOK reduction

2015-09-01  Tom de Vries  <tom@codesourcery.com>

	* target.def: Fix unknown escape sequence \040 error in reduction
	DEFHOOK.
	* doc/tm.texi: Regenerate.
---
 gcc/doc/tm.texi | 12 ++++++------
 gcc/target.def  |  8 +++++++-
 2 files changed, 13 insertions(+), 7 deletions(-)

diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index f53a54e..a151a10 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -5772,12 +5772,12 @@ is no RTL expanders for them.
 This hook is used by the oacc_transform pass to expand calls to the
 internal functions @var{GOACC_REDUCTION_SETUP},
 @var{GOACC_REDUCTION_INIT},
- @var{GOACC_REDUCTION_FINI} and
- @var{GOACC_REDUCTION_TEARDOWN} into a sequence of gimple instructions.
- @var{call} is gimple statement containing the call to the function.  This
- hook removes statement @var{call} after the expanded sequence has been
- inserted.  This hook is also responsible for allocating any storage for
- reductions when necessary.  It returns @var{true} if the expanded
+@var{GOACC_REDUCTION_FINI} and
+@var{GOACC_REDUCTION_TEARDOWN} into a sequence of gimple instructions.
+@var{call} is gimple statement containing the call to the function.  This
+hook removes statement @var{call} after the expanded sequence has been
+inserted.  This hook is also responsible for allocating any storage for
+reductions when necessary.  It returns @var{true} if the expanded
 sequence introduces any calls to OpenACC-specific internal functions.
 @end deftypefn
 
diff --git a/gcc/target.def b/gcc/target.def
index 22e04f6..80b401f 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -1684,7 +1684,13 @@ DEFHOOK
 (reduction,
 "This hook is used by the oacc_transform pass to expand calls to the\n\
 internal functions @var{GOACC_REDUCTION_SETUP},\n\
-@var{GOACC_REDUCTION_INIT},\n\ @var{GOACC_REDUCTION_FINI} and\n\ @var{GOACC_REDUCTION_TEARDOWN} into a sequence of gimple instructions.\n\ @var{call} is gimple statement containing the call to the function.  This\n\ hook removes statement @var{call} after the expanded sequence has been\n\ inserted.  This hook is also responsible for allocating any storage for\n\ reductions when necessary.  It returns @var{true} if the expanded\n\
+@var{GOACC_REDUCTION_INIT},\n\
+@var{GOACC_REDUCTION_FINI} and\n\
+@var{GOACC_REDUCTION_TEARDOWN} into a sequence of gimple instructions.\n\
+@var{call} is gimple statement containing the call to the function.  This\n\
+hook removes statement @var{call} after the expanded sequence has been\n\
+inserted.  This hook is also responsible for allocating any storage for\n\
+reductions when necessary.  It returns @var{true} if the expanded\n\
 sequence introduces any calls to OpenACC-specific internal functions.",
 bool, (gimple call),
 default_goacc_reduction)
-- 
1.9.1


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2015-09-01 16:23 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-08-19 19:43 [gomp4] New reduction infrastructure for OpenACC Cesar Philippidis
     [not found] ` <55D61EE2.4030402@acm.org>
2015-08-20 20:21   ` Nathan Sidwell
2015-09-01 12:53 ` Tom de Vries
2015-09-01 16:23   ` Tom de Vries

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).