\documentclass[a4paper]{article}

\newcommand{\brk}{\linebreak[0]}
\newcommand{\codespc}{\pagebreak[2]\\[1ex plus2pt minus1pt]}
\newcommand{\codebrk}{\pagebreak[2]\\}
\newcommand{\define}[1]{{\bf #1}}
\begingroup \catcode`\~ = 12 \global\def\twiddle{~} \endgroup
\newenvironment{codeblock}%
{\begin{quote}\tt\begin{tabbing}%
\hspace{1em}\=\hspace{1em}\=\hspace{1em}\=\hspace{1em}\=%
\hspace{1em}\=\hspace{1em}\=\hspace{1em}\=\hspace{1em}\=%
\kill}%
{\end{tabbing}\end{quote}}

\begin{document}

\title{OpenACC Reductions}
\author{Nathan Sidwell}
\date{2015-08-13}

\maketitle

\begin{abstract}
This document describes the design of the OpenACC reduction
implementation in GCC. It is intended to be sufficiently general in the
early part of the compiler, becoming more target-specific once we have
entered the target-specific device compiler.
\end{abstract}

\section{Changes}

\begin{itemize}
\item[2015-07-28] Initial version
\item[2015-07-30] Change internal builtins to not take addresses.
\item[2015-08-03] Note about templated builtin expansion.
\item[2015-08-07] Discuss reductions at outer parallel construct
\item[2015-08-13] Note reductions at outer parallel consume inner gang
reduction. Comment on memory barriers.
\end{itemize}

\section{General Features}
We cannot emit anything that depends on a device feature before
we've entered the device compiler.   This means that
anything happening in gimplify or omp-low has to be generic.  It has
to be sufficiently generic to permit other architectures to implement
reductions.  Thus, anything emitted here, beyond simply noting the
gang/worker/vector level of the execution environment cannot know
anything about gang/worker or vector beyond what the abstract
specification describes.

\subsection{Compiler Passes}
The following passes are relevant to openACC compilation:

\begin{enumerate}
\item[Gimplify] This is where variables used in a parallel region are
noted, and the required transformation determined -- for instance
copy/private/firstprivate etc.  These are all made explicit by
augmenting the parallel clause itself.

\item[Omp-lower] This is where a parallel  region is broken out into
a separate function, variables are rewritten according to copy, private
or whatever. The structure describing where copied variables are
located is created for both host and target sides.

\item[Omp-expand] This is where loops are converted from serial form
into a form to be executed in parallel on multiple threads. The
multiple threads are implicit -- abstract functions provide the number
of threads and the thread number of the current thread.

\item[LTO write/read] The offloaded functions are written out to disc
and read back into the device-specific compiler.  The host-side
compiler continues with its regular compilation. Following this point
there are essentially two threads of compilation -- one for host and
one for device.

\item[Oacc-xform] This new pass is responsible for lowering the
abstractions created in omp-lower and/or omp-expand into
device-specific code sequences. Such sequences should be regular gimple
as much as possible, but may include device-specific builtins. The
expansions are done via target-specific hooks and default fallbacks.
This step is also responsible for checking launch dimensions.

\item[Expand-RTL] This pass expands any remaining internal functions and
device-specific builtins into device-specific RTL.

\item[Mach-dep-reorg] This pass may perform device specific
reorganization.

\end{enumerate}

\subsection[Fork/Join]{Fork/Join}
Omp-expand emits fork and join builtins around gang, worker and vector
partitioned regions.

In oacc-xform these are deleted by targets that do not need them.  For
PTX the hook deletes gang-level forks and joins, as they are irrelevant
to the PTX execution model.

They are expanded to PTX pseudo instructions at RTL expansion time, and
those are propagated all the way through to the PTX reorg pass where
they get turned into actual code.  In the PTX case they are not real
forks and joins.  In the non-forked region we neuter all but one
thread at the appropriate level -- they're emulated
forks and joins for us.

The reduction machinery will hang off these markers, by inserting
additional builtins before and after them.  These builtins will be
expanded by device-specific code in the oacc-xform pass.  Whether the
expansion of those builtins generates device-specific builtins is a
property of the particular backend.  In the case of PTX, what each
builtin expands to depends on the level of the reduction.

\subsection{Reduction objects}
The object used for a reduction has two separate lifetimes:

\begin{itemize}

\item The lifetime before (and after) the reduction loop.  We refer to
that as the RESULT object.

\item The lifetime within the reduction loop. We refer to this as the
LOCAL object.

\end{itemize}

There is one instance of the former, but N instances of the latter --
one per executing thread.

The RESULT object might be a copy object received from the host, or it
might be a local private object within an outer loop. Some observations
can be made:

\begin{itemize}

\item The LOCAL object can be a private object. Indeed, this is desired
so that each thread has its own local instance (hence the name).

\item Because reductions over gang-level loops are only specified to
produce a final result once the enclosing parallel is completed, the
RESULT object must be a copy object. That is the only way for a result
to be returned to the program. If it is not, the behavior is undefined
and the programmer is confused.

\item The RESULT object will  most often be a private object for worker
and vector level loops. However, it could be  a copy object, most
likely, when a single gang is executing. It could be a copy object
within a gang loop though.

\item A reduction over one loop may be contained as a reduction within
an outer loop. The inner reduction's RESULT object is
the outer loop's LOCAL object.

\end{itemize}

We can take advantage of these latter two observations by surrounding
worker and vector reductions to copy objects with a dummy gang-level
reduction of the same variable. This means that worker and vector level
reductions only have to deal with the case where RESULT is a private
variable. Gang-level reductions will have to deal with the case of
RESULT being a copy object or a private object.  However, as we shall
see, the latter case turns out to be a NOP.

\subsubsection{Reductions on Parallel}
OpenACC permits a reduction to be specified on a parallel construct.
This could be operated on in the redundant execution mode outside of
any gang-level loops. While strange, there is no need to make this not
work.  We can achieve that by creating a null gang loop  at the
outermost level and eliding any reduction machinery on an inner gang
loop that might concern the same variable.

\subsection{Atomics}
Atomic operations may not be available for all required data types and
operators. Indeed, this is true for PTX, so we must address it. This
document treats the lack of atomics as the general case and then treats
the presence of an atomic operation as an optimization. Indeed, at the
omp-expand point we do not know what the target properties are, so the
general machinery does not rely on specific atomics being available.

\subsection{Variable Rewriting}
The OMP machinery has the capability of rewriting a variable used
within an offload region.  This is necessary for copy objects, so
that, say `a' is turned into something like
`*omp\_struct->copied\_a\_addr'.  We shall see that for reductions of
copy objects we will need to inhibit that rewriting.

We do not have to create additional rewriting.

The copy object rewriting must be inhibited within the loops in which
the reduction is active. Outside of such loops, the copy rewriting must
occur.  For instance:

\begin{codeblock}%
\#pragma acc parallel copy (a) num\_workers(5)\\*
\{\\*
\>a = 5; // must be rewritten\\*
\#pragma acc loop worker reduction(+:a)\\*
\>for (ix {\dots} ) \{\\*
\>\>a += ix; // must not be rewritten\\*
\>\}\\*
\>if (a) {\dots} // must be rewritten\\*
\}\end{codeblock}

Fortunately each basic block is mapped to the region in which it occurs,
thus the rewriting code can determine whether it is contained in a
region that has an active reduction on any particular variable.

\subsubsection{Fortran Pass-by-Reference Parameters}
Fortran subroutines pass scalars by implicit reference.  If such a
parameter is used as a reduction variable, the referenced scalar needs
privatizing, so that different threads have their own instance.  This
is actually an instance of a different problem, and punting on it for
initial implementation of reductions is acceptable.  This is ok, as it
only affects subroutines that are called via use of the routine
directive, which is itself deferred.

\section{Early Builtins}
We shall define 4 new internal builtins to deal with reductions in an
abstract manner. We shall also define a lock/unlock pair, to define an
atomic region.

The four builtins correspond to the four locations:

\begin{itemize}
\item before the partitioned region
\item start of the partitioned region
\item end of the partitioned region
\item after the partitioned region.
\end{itemize}
All these functions take the same six arguments:

\begin{itemize}
\item[RES\_PTR] Address of copy RESULT object. If the result is to a
private var, this is NULL.
\item[LOCAL] The LOCAL object -- the private variable used to hold the
individual reductions.
\item[LEVEL] Loop level (a constant)
\item[OP] Reduction operator (a constant)
\item[LID] Loop identifier (a constant)
\item[RID] Reduction identifier (a constant)
\end{itemize}
For worker and vector level reductions, RESULT argument will be NULL.
For gang-level loops, RESULT will be the rewritten reference to the
copied object (or NULL if there is none).

These functions all return a new LOCAL object value -- thus they are all
appear as part of an assignment.

\subsection[Loop \& Reduction Identifiers]{Loop \& Reduction
Identifiers}
We need to uniquely identify the set of functions referring to the same
reduction.  This is needed so the later expanders (at oacc-xform,
expand-rtl and perhaps mach-dep-reorg) can determine the members of the
same set. Those expanders might encounter these builtins in an
arbitrary order -- there's no guarantee the
basic-blocks will be iterated over in program order (which itself can
be an ill-defined concept). Note that the same variable may participate
in multiple reductions (sequentially), so neither RESULT nor LOCAL
guarantee uniqueness. We could have simply used a single counter to
uniquely identify all the reductions in the same function, but it
turned out that knowing the set of different reductions for a
particular loop is useful (and example is shown later).

 Thus we have the loop identifier and reduction identifier arguments.
Both of these will be integer constants, allocated from local counters.
 The loop identifier counter has a different value for each loop (with
or without reductions, which ever is simplest).  The reduction
identifier has a different value increments for each reduction within
the same loop. The integer values themselves have no meaning (so, for
instance the reduction identifier could be unique over all reductions
of the function).

As we shall see the same loop identifier number may be used at different
levels. Thus the <LID, LEVEL> tuple must be used
to uniquely identify a specific loop \& level. This will fail if a
routine is inlined, because both the inliner and inlinee may use the
same loop id.  It would be insufficient to simply use unique loop ids
throughout the compilation, as a function might be inlined multiple
times into the same function. For the moment we can simply inhibit
inlining functions containing reductions.

\subsection{Reduction Internal Builtin Functions}
We define internal builtins for the reductions, as these are
specifically designed for functions intended to be expanded later in
the compilation and not related to any fallback library call.

\begin{codeblock}%
T IFN\_RED\_SETUP (T *RES\_PTR, T LOCAL, LEVEL, OP, LID, RID)
\end{codeblock}

This is placed before the partitioned region. It is used to implement
any single-mode setup that is needed. Typically initializing temporary
storage.

\begin{codeblock}%
T IFN\_RED\_TEARDOWN (T *RES\_PTR, T LOCAL, LEVEL, OP, RID, LID)
\end{codeblock}

This is placed after the partitioned region. It is used to implement any
single mode teardown that is needed.  Typically updated the RESULT
object.

\begin{codeblock}%
T IFN\_RED\_INIT (T *RES\_PTR, T LOCAL, LEVEL, OP, LID, RID)
\end{codeblock}

This is placed at the start of the partitioned region, before the loop
is entered. It is used to implement any partitioned-mode
initialization. Typically initializing the LOCAL object in each thread.

\begin{codeblock}%
T IFN\_RED\_FINI (T *RES\_PTR, T LOCAL, LEVEL, OP, LID, RID)
\end{codeblock}

This is placed at the end of the partitioned region, after the loop is
exited. It is used to implement any partitioned-mode finalization.
Typically combining the per-thread LOCAL objects into a single value.

The two lock builtins are:

\begin{codeblock}%
void IFN\_RED\_LOCK (LEVEL, LID)\\*
void IFN\_RED\_UNLOCK LEVEL, LID)
\end{codeblock}

These are placed around all the IFN\_RED\_FINI calls that happen at the
end of a loop containing reductions. If it is too tricky to place, they
can be placed around each individual IFN\_RED\_FINI call at the expense
of performance.

The IFN\_RED\_FINI function must be marked unique, so it is not cloned
nor subject to tail commoning. (It is possible all  these functions
should be so marked in the general case.) The FORK and JOIN functions
already have this property.

\subsection{Examples}
Here are some examples of the code after omp-expand has inserted the
above functions:

\begin{codeblock}%
// gang level loop with copy object\\*
\#parallel copy(a) gang\\*
\{\\*
\>\#loop gang reduction(+:a)\\*
\>a = IFN\_SETUP (ompstruct->a, a, GANG, +, 0, 0)\\*
\>IFN\_FORK (GANG)\\*
\>a = IFN\_INIT (ompstruct->a, a, GANG, +, 0, 0)\\*
\>for (\dots) \{ \dots \}\\*
\>IFN\_LOCK (GANG, 0)\\*
\>a = IFN\_FINI (ompstruct->a, a, GANG, +, 0, 0)\\*
\>IFN\_UNLOCK (GANG, 0)\\*
\>IFN\_JOIN (GANG)\\*
\>a = IFN\_TEARDOWN (ompstruct->a, a, GANG, +, 0, 0)\\*
\}\end{codeblock}

\begin{codeblock}%
// worker level loop with private object\\*
\#parallel worker\\*
\{\\*
\>int a = 0;\\*
\>\#loop worker reduction(+:a)\\*
\>a = IFN\_SETUP (NULL, a, WORKER, +, 0, 0)\\*
\>IFN\_FORK (WORKER)\\*
\>a = IFN\_INIT (NULL, a, WORKER, +, 0, 0)\\*
\>for ({\dots}) \{ {\dots} \}\\*
\>IFN\_LOCK (WORKER, 0)\\*
\>a = IFN\_FINI (NULL, a, WORKER, +, 0, 0)\\*
\>IFN\_UNLOCK (WORKER, 0)\\*
\>IFN\_JOIN (WORKER)\\*
\>a = IFN\_TEARDOWN (NULL, a, WORKER, +, 0, 0)\\*
\}\end{codeblock}

\begin{codeblock}%
// worker level loop with copy object\\*
\#parallel worker copy (a)\\*
\{\\*
\>// Insert dummy gang reduction at start. \ \\*
\>// Note this uses the same RID \& LID as the inner worker
loop.\\*
\>a = IFN\_SETUP (ompstruct->a, a, GANG, +, 0, 0)\\*
\>a = IFN\_INIT (ompstruct->a, a, GANG, +, 0, 0)\\*
\>\#loop worker reduction(+:a)\\*
\>a = IFN\_SETUP (NULL, a, WORKER, +, 0, 0)\\*
\>IFN\_FORK (WORKER)\\*
\>a = IFN\_INIT (NULL, a, WORKER, +, 0, 0)\\*
\>for ({\dots}) \{ {\dots} \}\\*
\>IFN\_LOCK (WORKER, 0)\\*
\>a = IFN\_FINI (NULL, a, WORKER, +, 0, 0)\\*
\>IFN\_UNLOCK (WORKER, 0)\\*
\>IFN\_JOIN (WORKER)\\*
\>a = IFN\_TEARDOWN (NULL, a, WORKER, +, 0, 0)\\*
\>// Dummy gang reduction at end\\*
\>a = IFN\_FINI (ompstruct->a, a, GANG, +, 0, 0)\\*
\>a = IFN\_TEARDOWN (ompstruct->a, a, GANG, +, 0, 0)\\*
\}\end{codeblock}

Note that if the above worker reduction was inside a gang loop, the
inserted dummy gang reduction could be placed inside or outside the
gang-level loop. \ Placing outside the loop would be more efficient, as
there would then only be one execution of the gang-reduction per gang,
rather than one execution per iteration. This can of course be
implemented as a later optimization.

The vector cases are exactly like the worker cases.

\begin{codeblock}%
// combined worker \& vector level loop with private object\\*
\#parallel worker vector\\*
\{\\*
\>int a = 0;\\*
\>\#loop worker vector reduction(+:a)\\*
\>// Insert worker setup and init around worker fork\\*
\>a = IFN\_SETUP (NULL, a, WORKER, +, 0, 0)\\*
\>IFN\_FORK (WORKER)\\*
\>a = IFN\_INIT (NULL, a, WORKER, +, 0, 0)\\*
\>// Insert vector setup and init around vector fork\\*
\>a = IFN\_SETUP (NULL, a, VECTOR, +, 0, 0)\\*
\>IFN\_FORK (VECTOR)\\*
\>a = IFN\_INIT (NULL, a, VECTOR, +, 0, 0)\\*
\>for ({\dots}) \{ {\dots} \}\\*
\>// Vector fini \& teardown around vector join\\*
\>IFN\_LOCK (VECTOR, 0)\\*
\>a = IFN\_FINI (NULL, a, VECTOR, +, 0, 0)\\*
\>IFN\_UNLOCK (VECTOR, 0)\\*
\>IFN\_JOIN (VECTOR)\\*
\>// Worker fini \& teardown around worker join\\*
\>IFN\_LOCK (WORKER, 0)\\*
\>a = IFN\_FINI (NULL, a, WORKER, +, 0, 0)\\*
\>IFN\_UNLOCK (WORKER)\\*
\>IFN\_JOIN (WORKER)\\*
\>a = IFN\_TEARDOWN (NULL, a, WORKER, +, 0, 0)\\*
\}\end{codeblock}

\begin{codeblock}%
// reduction on parallel (with copy object)\\*
\#parallel copy(a) gang reduction (+:a)\\*
\{\\*
\>// NULL outermost `loop'\\*
\>a = IFN\_SETUP (ompstruct->a, a, GANG, +, 0, 0)\\*
\>a = IFN\_INIT (ompstruct->a, a, GANG, +, 0, 0)\\*
\>{\dots} redundant mode gang execution\\*
\>IFN\_LOCK (GANG, 0)\\*
\>a = IFN\_FINI (ompstruct->a, a, GANG, +, 0, 0)\\*
\>IFN\_UNLOCK (GANG, 0)\\*
\>a = IFN\_TEARDOWN (ompstruct->a, a, GANG, +, 0, 0)\\*
\}\end{codeblock}

\begin{codeblock}%
// reduction on parallel with inner gang loop\\*
\#parallel copy(a) gang reduction (+:a)\\*
\{\\*
\>// NULL outermost `loop'\\*
\>a = IFN\_SETUP (ompstruct->a, a, GANG, +, 0, 0)\\*
\>a = IFN\_INIT (ompstruct->a, a, GANG, +, 0, 0)\\*
\>{\dots} gang redundant execution\\*
\>\#loop gang reduction(+:a)\\*
\>// no insertion of IFN\_SETUP or IFN\_INIT before loop\\*
\>for ({\dots}) \{ {\dots} \}\\*
\>// no insertion of IFN\_FINI or IFN\_TEARDOWN after loop.\\*
\>{\dots} more gang redundant execution\\*
\>IFN\_LOCK (GANG, 0)\\*
\>a = IFN\_FINI (ompstruct->a, a, GANG, +, 0, 0)\\*
\>IFN\_UNLOCK (GANG, 0)\\*
\>a = IFN\_TEARDOWN (ompstruct->a, a, GANG, +, 0, 0)\\*
\}\end{codeblock}

\section{OACC XFORM}
The OACC-xform pass is early in the device compiler after LTO readback.
This is where the above internal functions are expanded to a
combination of regular code and device-specific builtins.
We'll need to provide 2 implementations:

\begin{itemize}

\item A default implementation for host-expansion

\item A PTX implementation.

\end{itemize}

These transformations should be done early in the XFORM pass, so that
other optimizations it performs can be applied to the sequences emitted
here.

As is typical when describing compiler code transformations, one needs
to show both expanded code sequences, and internal compiler helper
routines. The following expansions try and make it clear when a piece
of code is part of the compiler, producing a result that is fed into
the expansion.  For instance:

\begin{codeblock}%
\{tmp = const\_int (5);\}\\*
LOCAL = \{tmp\}
\end{codeblock}

Here the first line is part of the compiler, generating a tmp value, and
the second line is part of the emitted code sequence, which uses that
tmp value.

I also use a template notation, using T to describe the type of the
reduction object.

\subsection{Host Implementation}
The host fallback (\& default implementation) executes the loops as a
single thread. This makes reductions trivial.

\subsubsection{Setup}
\begin{codeblock}%
V = IFN\_RED\_SETUP (RES\_PTR, LOCAL, LEVEL, OP. LID, RID)
\end{codeblock}

expands to:

\begin{codeblock}%
V = RES\_PTR ? *RES\_PTR : LOCAL;
\end{codeblock}

\subsubsection[Init]{Init}
\begin{codeblock}%
V = IFN\_RED\_INIT (RES\_PTR, LOCAL, LEVEL, OP, LID,  RID)
\end{codeblock}

expands to:

\begin{codeblock}%
V = LOCAL;
\end{codeblock}

\subsubsection{Fini}
\begin{codeblock}%
V = IFN\_RED\_FINI (RES\_PTR, LOCAL, LEVEL, OP, LID, RID)
\end{codeblock}

expands to:

\begin{codeblock}%
V = LOCAL;
\end{codeblock}

\subsubsection{Teardown}
\begin{codeblock}%
V = IFN\_RED\_TEARDOWN (RES\_PTR, LOCAL, LEVEL, OP, LID, RID)
\end{codeblock}

expands to:

\begin{codeblock}%
if (RES\_PTR) *RES\_PTR = LOCAL;\\*
V = LOCAL;
\end{codeblock}

\subsubsection{Lock \& Unlock}
The lock and unlock builtins expand to nothing.

\subsection{PTX Implementation}
The expansion for PTX differs for the different loop levels. The
IFN\_RED\_FINI function is the most complex and worker level is the
most involved. We introduce a number of PTX-specific builtins, which
are documented later.

\subsubsection{Vector Level}
At the vector level, RES\_PTR is known to be NULL.

\paragraph{Setup}
\begin{codeblock}%
V = IFN\_RED\_SETUP (RES\_PTR, LOCAL, LEVEL, OP, LID, RID)
\end{codeblock}

expands to:

\begin{codeblock}%
V = LOCAL;
\end{codeblock}

\paragraph{Init}
\begin{codeblock}%
V = IFN\_RED\_INIT (RES\_PTR, LOCAL, LEVEL, OP, LID, RID)
\end{codeblock}

expands to:

\begin{codeblock}%
V = IFN\_OACC\_DIM\_POS (VECTOR) ? \{init\_val<T> (OP)\} : LOCAL
\end{codeblock}

This sets the initialization value for all but vector zero, which simply
copies the incoming LOCAL value.

\paragraph{Fini}
\begin{codeblock}%
V = IFN\_RED\_FINI (RES\_PTR, LOCAL, LEVEL, OP, LID, RID)
\end{codeblock}

expands to a  binary tree of shuffles:

\begin{codeblock}%
for (ix = IFN\_OACC\_DIM\_SIZE (VECTOR); ix >= 1;) \{\\*
 \ T tmp = ptx\_shuffle\_down<T> (LOCAL, ix);\\*
 \ LOCAL = OP (LOCAL, tmp);\\*
\}\\*
V = LOCAL
\end{codeblock}

This relies on unified execution of the vector engines (and
IFN\_\brk{}OACC\_\brk{}DIM\_\brk{}SIZE returning a power of 2).

\paragraph{Teardown}
\begin{codeblock}%
V = IFN\_RED\_TEARDOWN (RES\_PTR, LOCAL, LEVEL, OP, LID, RID)
\end{codeblock}

expands to:

\begin{codeblock}%
V = LOCAL;
\end{codeblock}

\paragraph[Lock \ \& Unlock]{Lock \ \& Unlock}
The lock and unlock functions expand to nothing at the vector level.

\subsubsection{Gang Level}
At the gang level RES\_PTR might or might not be NULL. We will know at
the compilation stage -- there is no need for runtime checking.

Memory barriers are only needed within the locked region, and will be
inserted by the lock expansions themselves.  We do not need additional
barriers at the setup and teardown phases, because accessing the
reduction object within the parallel region (outside of the reduction
scope itself) is undefined behaviour.

\paragraph{Setup}
\begin{codeblock}%
V = IFN\_RED\_SETUP (RES\_PTR, LOCAL, LEVEL, OP, LID, RID)
\end{codeblock}

expands to:

\begin{codeblock}%
\{if RES\_PTR not NULL\} V = LOCAL\\*
\{else\} Nothing
\end{codeblock}

\paragraph{Init}
\begin{codeblock}%
V == IFN\_RED\_INIT (RES\_PTR, LOCAL, LEVEL, OP, LID, RID)
\end{codeblock}

expands to:

\begin{codeblock}%
V = \{init\_val<T> (OPERATOR)\}
\end{codeblock}

\paragraph{Fini}
\begin{codeblock}%
V = IFN\_RED\_FINI (RES\_PTR, LOCAL, LEVEL, OP, LID, RID)
\end{codeblock}

expands to a simple update of the *RES\_PTR object:

\begin{codeblock}%
V = OPERATOR (*RES\_PTR, LOCAL);\\*
*RES\_PTR = V;
\end{codeblock}

The thread-safety of this relies on the lock expansions below.

\paragraph{Teardown}
\begin{codeblock}%
V = IFN\_RED\_TEARDOWN (RES\_PTR, LOCAL, LEVEL, OP, LID, RID)
\end{codeblock}

expands to :

\begin{codeblock}%
\{if RES\_PTR not NULL\} V = LOCAL\\*
\{else\} Nothing
\end{codeblock}

\paragraph{Gang Locking}
The lock and unlock functions use lock in global memory. We can allocate
a single lock for the whole program.

The expansions are:

\begin{codeblock}%
ptx\_lock (GANG, LID)
\end{codeblock}

\begin{codeblock}%
ptx\_unlock (GANG, LID)
\end{codeblock}

respectively.

\subsubsection{Worker Level}
The worker level is most complicated because we need to create temporary
storage in .shared memory to hold accumulation values.  This relies
heavily on PTX-specific builtins and helper routines.

The approach is to allocate a buffer in .shared memory to hold the set
of worker reductions for a worker loop.  This buffer must be live from
the first worker setup call of the loop to the last worker teardown
call. This buffer must be distinct from the worker spill buffer used to
implement worker state propagation at entry to worker partitioned mode,
because both need to be concurrently live.

However, we do not have to have separate buffers for worker reductions
on different loops.  Those are never executed concurrently (on a
single CTA). This means we can use a similar approach allocating the
buffer as we do with the worker spill buffer. A single buffer can be
reused for all worker loop reductions in the compilation.  We just
have to make sure it is the largest size needed.

To determine the size needed, we need to know the set of reductions for
each worker loop.  This is where the LID argument to the builtins
comes in -- that will be the same for each reduction in the same worker
loop. Actually allocating the buffer can be deferred to RTL expansion
stage, which will give the optimizers opportunity to delete unreachable
code, which might change the set of reductions needed.

Thus expansion of worker level reduction functions will propagate the
LID and RID arguments to PTX-specific builtins, which will themselves
perform the .shared buffer allocation.

As with vector-level reductions, RES\_PTR is known to be NULL at compile
time.

Memory barriers are needed within the locked region used by the
finalization.  These are inserted by the lock and unlock builtins
themselves. We do not need explicit memory barriers after the setup
store or before the teardown read because they are inserted by the fork
and join expansions themselves.

\paragraph{Setup}
\begin{codeblock}%
V = IFN\_RED\_SETUP (RES\_PTR, LOCAL, LEVEL, OP, LID, RID)
\end{codeblock}

expands to:

\begin{codeblock}%
*ptx\_work\_red\_addr<T> (LID, RID) = LOCAL;
\end{codeblock}

This will eventually expand to a write into the .shared memory buffer of
the incoming reduction value.

\paragraph{Init}
\begin{codeblock}%
V = IFN\_RED\_INIT (RES\_PTR, LOCAL, LEVEL, OP, LID, RID)
\end{codeblock}

expands to:

\begin{codeblock}%
V = \{init\_val<T> (OPERATOR)\}
\end{codeblock}

This initializes each worker's instance of the reduction variable.

\paragraph{Fini}
\begin{codeblock}%
V = IFN\_RED\_FINI (RES\_PTR, LOCAL, LEVEL, OP, LID, RID)
\end{codeblock}

expands to 

\begin{codeblock}%
T tmp = *ptx\_work\_red\_addr<T> (LID, RID);\\*
tmp = OP (tmp, LOCAL);\\*
*ptx\_work\_red\_addr<T> (LID, RID) = tmp;
\end{codeblock}

The thread-safety of this relies on the lock expansions below.

\paragraph{Teardown}
\begin{codeblock}%
V = IFN\_RED\_TEARDOWN (RES\_PTR, LOCAL, LEVEL, OP, LID, RID)
\end{codeblock}

expands to:

\begin{codeblock}%
V = *ptx\_work\_red\_addr<T> (LID, RID);
\end{codeblock}

\paragraph{Worker Locking}
The lock and unlock functions use lock in .shared memory. We can
allocate a single lock for the whole program. The expansions are:

\begin{codeblock}%
ptx\_lock (WORKER, LID)
\end{codeblock}

\begin{codeblock}%
ptx\_unlock (WORKER, LID)
\end{codeblock}

Respectively.

\subsubsection[OpenACC Builtins]{OpenACC Builtins}

The above used an existing OpenACC builtin,
IFN\_\brk{}OACC\_\brk{}DIM\_\brk{}SIZE and
IFN\_\brk{}OACC\_\brk{}DIM\_\brk{}POS.  Note, that the DIM\_SIZE
builtin is often expanded to a constant within the OACC\_XFORM pass
itself. Thus we must ensure that transform is run over the expansions
done here.

\subsubsection{PTX Builtins}
The following PTX builtins were used above:

\begin{codeblock}%
T ptx\_shuffle\_down (T \&obj, unsigned SHIFT)
\end{codeblock}

This function does a shuffle down, returning the SHIFT'th next vector
engine's value of `obj'.

\begin{codeblock}%
T *ptx\_work\_red\_addr <T> (unsigned LID, unsigned RID)
\end{codeblock}

This function returns the address of the reduction accumulation
variable for loop LID, reduction RID. When this builtin is expanded at
RTL generation time, the PTX backend will need to maintain a
per-function table of mappings from <LID, RID> to address.  Entries
will be inserted as encountered, and per-LID counters of the current
allocation point need to be maintained.  The high water mark acrosss
all LIDs needs to be stashed in a global variable examined at the end
of compilation in the same manner to the worker\_spill buffer.

\begin{codeblock}%
void ptx\_lock (unsigned LEVEL, unsigned LID)\\*
void ptx\_unlock (unsigned LEVEL, unsigned LID)
\end{codeblock}

These two functions implement a spin-lock and associated unlock at a
particular level. The only two levels needed are gang and worker level
-- vector reductions do not require locking.  The only difference in
the two is the location of the lock variable. These expand to an
unspec, whose presence will be noted during ptx reorg. That will cause
the applicable  .global or .shared lock variable  to be emitted at
end of compilation.

They also insert memory barriers before and after the protected region,
to ensure external state changes are properly observed with the region,
and changes made in the region are properly observed outside the
region.

The lock itself will look like:

\begin{codeblock}%
1:\>\>atom.\$space.cas.u32 \$res,[\$lock\_var],0,1\\*
\>\>set.ne.u32 \$pred,\$res,0\\*
\>\>@\$pred bra.uni 1b\\*
\>\>membar.\$space
\end{codeblock}

This returns the original value of \$lock\_var, thus, when the lock is
obtained, \$res  will be zero.  When the lock is already taken, it
will be 1. The expander should emit a simple loop 

The unlock will be:

\begin{codeblock}%
\>\>membar.\$space\\*
\>\>atom.\$space.cas.u32 \$res,\$lock,1,0
\end{codeblock}

\begin{codeblock}%
void ptx\_mem\_bar (unsigned LEVEL)
\end{codeblock}

This inserts a memory barrier at the appropriate level.

\subsubsection{Templated Builtins}
The high level reduction machinery must handle boolean, integer
(including character), float and complex types. Rather than provide
each ptx builtin for the full set of types, only a fundamental subset
shall be provided:

\begin{itemize}
\item SImode (uint)
\item DImode (ulong)
\item SFmode (float)
\item DFmode (double)
\end{itemize}
The OACC\_XFORM pass must deal with composing these to process other
types.

\begin{itemize}
\item For smaller integer types, promote them to and Simode type.
\item For complex types, decompose into two instances of the underlying
type.
\end{itemize}
Note that in the latter case, the underlying ptx\_work\_red\_addr call,
if used, must have distinguishing RID values for the different parts of
the object. \ This could be achieved by left shifting all incoming RID
values and using bit zero as the distinguisher (all such compound types
only have two underlying components).

\subsubsection{Atomics Optimization}
The existence of atomic operations can be dealt with later.  They are
an optimization.  The PTX backend provides a target hook that can be
queried by the OACC\_XFORM pass when it is expanding the RED\_FINI
function.  Before emitting the non-atomic code for worker or gang
level, the hook should be queried to determine if an atomic operation
exists for the operation and type required.  If it does, the
appropriate builtin should be emitted rather than the non-atomic
sequence.

There is a wrinkle though.  If atomic operations are used, barriers
need inserting before and after the atomic operations, as they are
asynchronous with respect to normal reads and writes. It would be
inefficient to insert barriers around each atomic operation, if there
are several reductions at the same level. (Although that is a rare
case).

We can use the LID identifier to record whether atomic operations are
used at any particular loop finalization and augment the ptx reorg pass
to insert barriers at the lock and unlock instructions where they are
on such loops. We can also record whether all reductions for a
particular loop use atomics and elide the locking in that case.

More strictly it would be sufficient to place barriers after the setup
write and before the teardown read.  This could be  achieved by
defining additional ptx builtins and emitting them during OACC\_XFORM,
expanding to RTL and then processing during PTX reorg.

All told, the whole use of atomics may be excessive and not worth the
effort.
\end{document}