\documentclass[a4paper]{article} \newcommand{\brk}{\linebreak[0]} \newcommand{\codespc}{\pagebreak[2]\\[1ex plus2pt minus1pt]} \newcommand{\codebrk}{\pagebreak[2]\\} \newcommand{\define}[1]{{\bf #1}} \begingroup \catcode`\~ = 12 \global\def\twiddle{~} \endgroup \newenvironment{codeblock}% {\begin{quote}\tt\begin{tabbing}% \hspace{1em}\=\hspace{1em}\=\hspace{1em}\=\hspace{1em}\=% \hspace{1em}\=\hspace{1em}\=\hspace{1em}\=\hspace{1em}\=% \kill}% {\end{tabbing}\end{quote}} \begin{document} \title{OpenACC Reductions} \author{Nathan Sidwell} \date{2015-08-13} \maketitle \begin{abstract} This document describes the design of the OpenACC reduction implementation in GCC. It is intended to be sufficiently general in the early part of the compiler, becoming more target-specific once we have entered the target-specific device compiler. \end{abstract} \section{Changes} \begin{itemize} \item[2015-07-28] Initial version \item[2015-07-30] Change internal builtins to not take addresses. \item[2015-08-03] Note about templated builtin expansion. \item[2015-08-07] Discuss reductions at outer parallel construct \item[2015-08-13] Note reductions at outer parallel consume inner gang reduction. Comment on memory barriers. \end{itemize} \section{General Features} We cannot emit anything that depends on a device feature before we've entered the device compiler. This means that anything happening in gimplify or omp-low has to be generic. It has to be sufficiently generic to permit other architectures to implement reductions. Thus, anything emitted here, beyond simply noting the gang/worker/vector level of the execution environment cannot know anything about gang/worker or vector beyond what the abstract specification describes. \subsection{Compiler Passes} The following passes are relevant to openACC compilation: \begin{enumerate} \item[Gimplify] This is where variables used in a parallel region are noted, and the required transformation determined -- for instance copy/private/firstprivate etc. These are all made explicit by augmenting the parallel clause itself. \item[Omp-lower] This is where a parallel region is broken out into a separate function, variables are rewritten according to copy, private or whatever. The structure describing where copied variables are located is created for both host and target sides. \item[Omp-expand] This is where loops are converted from serial form into a form to be executed in parallel on multiple threads. The multiple threads are implicit -- abstract functions provide the number of threads and the thread number of the current thread. \item[LTO write/read] The offloaded functions are written out to disc and read back into the device-specific compiler. The host-side compiler continues with its regular compilation. Following this point there are essentially two threads of compilation -- one for host and one for device. \item[Oacc-xform] This new pass is responsible for lowering the abstractions created in omp-lower and/or omp-expand into device-specific code sequences. Such sequences should be regular gimple as much as possible, but may include device-specific builtins. The expansions are done via target-specific hooks and default fallbacks. This step is also responsible for checking launch dimensions. \item[Expand-RTL] This pass expands any remaining internal functions and device-specific builtins into device-specific RTL. \item[Mach-dep-reorg] This pass may perform device specific reorganization. \end{enumerate} \subsection[Fork/Join]{Fork/Join} Omp-expand emits fork and join builtins around gang, worker and vector partitioned regions. In oacc-xform these are deleted by targets that do not need them. For PTX the hook deletes gang-level forks and joins, as they are irrelevant to the PTX execution model. They are expanded to PTX pseudo instructions at RTL expansion time, and those are propagated all the way through to the PTX reorg pass where they get turned into actual code. In the PTX case they are not real forks and joins. In the non-forked region we neuter all but one thread at the appropriate level -- they're emulated forks and joins for us. The reduction machinery will hang off these markers, by inserting additional builtins before and after them. These builtins will be expanded by device-specific code in the oacc-xform pass. Whether the expansion of those builtins generates device-specific builtins is a property of the particular backend. In the case of PTX, what each builtin expands to depends on the level of the reduction. \subsection{Reduction objects} The object used for a reduction has two separate lifetimes: \begin{itemize} \item The lifetime before (and after) the reduction loop. We refer to that as the RESULT object. \item The lifetime within the reduction loop. We refer to this as the LOCAL object. \end{itemize} There is one instance of the former, but N instances of the latter -- one per executing thread. The RESULT object might be a copy object received from the host, or it might be a local private object within an outer loop. Some observations can be made: \begin{itemize} \item The LOCAL object can be a private object. Indeed, this is desired so that each thread has its own local instance (hence the name). \item Because reductions over gang-level loops are only specified to produce a final result once the enclosing parallel is completed, the RESULT object must be a copy object. That is the only way for a result to be returned to the program. If it is not, the behavior is undefined and the programmer is confused. \item The RESULT object will most often be a private object for worker and vector level loops. However, it could be a copy object, most likely, when a single gang is executing. It could be a copy object within a gang loop though. \item A reduction over one loop may be contained as a reduction within an outer loop. The inner reduction's RESULT object is the outer loop's LOCAL object. \end{itemize} We can take advantage of these latter two observations by surrounding worker and vector reductions to copy objects with a dummy gang-level reduction of the same variable. This means that worker and vector level reductions only have to deal with the case where RESULT is a private variable. Gang-level reductions will have to deal with the case of RESULT being a copy object or a private object. However, as we shall see, the latter case turns out to be a NOP. \subsubsection{Reductions on Parallel} OpenACC permits a reduction to be specified on a parallel construct. This could be operated on in the redundant execution mode outside of any gang-level loops. While strange, there is no need to make this not work. We can achieve that by creating a null gang loop at the outermost level and eliding any reduction machinery on an inner gang loop that might concern the same variable. \subsection{Atomics} Atomic operations may not be available for all required data types and operators. Indeed, this is true for PTX, so we must address it. This document treats the lack of atomics as the general case and then treats the presence of an atomic operation as an optimization. Indeed, at the omp-expand point we do not know what the target properties are, so the general machinery does not rely on specific atomics being available. \subsection{Variable Rewriting} The OMP machinery has the capability of rewriting a variable used within an offload region. This is necessary for copy objects, so that, say `a' is turned into something like `*omp\_struct->copied\_a\_addr'. We shall see that for reductions of copy objects we will need to inhibit that rewriting. We do not have to create additional rewriting. The copy object rewriting must be inhibited within the loops in which the reduction is active. Outside of such loops, the copy rewriting must occur. For instance: \begin{codeblock}% \#pragma acc parallel copy (a) num\_workers(5)\\* \{\\* \>a = 5; // must be rewritten\\* \#pragma acc loop worker reduction(+:a)\\* \>for (ix {\dots} ) \{\\* \>\>a += ix; // must not be rewritten\\* \>\}\\* \>if (a) {\dots} // must be rewritten\\* \}\end{codeblock} Fortunately each basic block is mapped to the region in which it occurs, thus the rewriting code can determine whether it is contained in a region that has an active reduction on any particular variable. \subsubsection{Fortran Pass-by-Reference Parameters} Fortran subroutines pass scalars by implicit reference. If such a parameter is used as a reduction variable, the referenced scalar needs privatizing, so that different threads have their own instance. This is actually an instance of a different problem, and punting on it for initial implementation of reductions is acceptable. This is ok, as it only affects subroutines that are called via use of the routine directive, which is itself deferred. \section{Early Builtins} We shall define 4 new internal builtins to deal with reductions in an abstract manner. We shall also define a lock/unlock pair, to define an atomic region. The four builtins correspond to the four locations: \begin{itemize} \item before the partitioned region \item start of the partitioned region \item end of the partitioned region \item after the partitioned region. \end{itemize} All these functions take the same six arguments: \begin{itemize} \item[RES\_PTR] Address of copy RESULT object. If the result is to a private var, this is NULL. \item[LOCAL] The LOCAL object -- the private variable used to hold the individual reductions. \item[LEVEL] Loop level (a constant) \item[OP] Reduction operator (a constant) \item[LID] Loop identifier (a constant) \item[RID] Reduction identifier (a constant) \end{itemize} For worker and vector level reductions, RESULT argument will be NULL. For gang-level loops, RESULT will be the rewritten reference to the copied object (or NULL if there is none). These functions all return a new LOCAL object value -- thus they are all appear as part of an assignment. \subsection[Loop \& Reduction Identifiers]{Loop \& Reduction Identifiers} We need to uniquely identify the set of functions referring to the same reduction. This is needed so the later expanders (at oacc-xform, expand-rtl and perhaps mach-dep-reorg) can determine the members of the same set. Those expanders might encounter these builtins in an arbitrary order -- there's no guarantee the basic-blocks will be iterated over in program order (which itself can be an ill-defined concept). Note that the same variable may participate in multiple reductions (sequentially), so neither RESULT nor LOCAL guarantee uniqueness. We could have simply used a single counter to uniquely identify all the reductions in the same function, but it turned out that knowing the set of different reductions for a particular loop is useful (and example is shown later). Thus we have the loop identifier and reduction identifier arguments. Both of these will be integer constants, allocated from local counters. The loop identifier counter has a different value for each loop (with or without reductions, which ever is simplest). The reduction identifier has a different value increments for each reduction within the same loop. The integer values themselves have no meaning (so, for instance the reduction identifier could be unique over all reductions of the function). As we shall see the same loop identifier number may be used at different levels. Thus the tuple must be used to uniquely identify a specific loop \& level. This will fail if a routine is inlined, because both the inliner and inlinee may use the same loop id. It would be insufficient to simply use unique loop ids throughout the compilation, as a function might be inlined multiple times into the same function. For the moment we can simply inhibit inlining functions containing reductions. \subsection{Reduction Internal Builtin Functions} We define internal builtins for the reductions, as these are specifically designed for functions intended to be expanded later in the compilation and not related to any fallback library call. \begin{codeblock}% T IFN\_RED\_SETUP (T *RES\_PTR, T LOCAL, LEVEL, OP, LID, RID) \end{codeblock} This is placed before the partitioned region. It is used to implement any single-mode setup that is needed. Typically initializing temporary storage. \begin{codeblock}% T IFN\_RED\_TEARDOWN (T *RES\_PTR, T LOCAL, LEVEL, OP, RID, LID) \end{codeblock} This is placed after the partitioned region. It is used to implement any single mode teardown that is needed. Typically updated the RESULT object. \begin{codeblock}% T IFN\_RED\_INIT (T *RES\_PTR, T LOCAL, LEVEL, OP, LID, RID) \end{codeblock} This is placed at the start of the partitioned region, before the loop is entered. It is used to implement any partitioned-mode initialization. Typically initializing the LOCAL object in each thread. \begin{codeblock}% T IFN\_RED\_FINI (T *RES\_PTR, T LOCAL, LEVEL, OP, LID, RID) \end{codeblock} This is placed at the end of the partitioned region, after the loop is exited. It is used to implement any partitioned-mode finalization. Typically combining the per-thread LOCAL objects into a single value. The two lock builtins are: \begin{codeblock}% void IFN\_RED\_LOCK (LEVEL, LID)\\* void IFN\_RED\_UNLOCK LEVEL, LID) \end{codeblock} These are placed around all the IFN\_RED\_FINI calls that happen at the end of a loop containing reductions. If it is too tricky to place, they can be placed around each individual IFN\_RED\_FINI call at the expense of performance. The IFN\_RED\_FINI function must be marked unique, so it is not cloned nor subject to tail commoning. (It is possible all these functions should be so marked in the general case.) The FORK and JOIN functions already have this property. \subsection{Examples} Here are some examples of the code after omp-expand has inserted the above functions: \begin{codeblock}% // gang level loop with copy object\\* \#parallel copy(a) gang\\* \{\\* \>\#loop gang reduction(+:a)\\* \>a = IFN\_SETUP (ompstruct->a, a, GANG, +, 0, 0)\\* \>IFN\_FORK (GANG)\\* \>a = IFN\_INIT (ompstruct->a, a, GANG, +, 0, 0)\\* \>for (\dots) \{ \dots \}\\* \>IFN\_LOCK (GANG, 0)\\* \>a = IFN\_FINI (ompstruct->a, a, GANG, +, 0, 0)\\* \>IFN\_UNLOCK (GANG, 0)\\* \>IFN\_JOIN (GANG)\\* \>a = IFN\_TEARDOWN (ompstruct->a, a, GANG, +, 0, 0)\\* \}\end{codeblock} \begin{codeblock}% // worker level loop with private object\\* \#parallel worker\\* \{\\* \>int a = 0;\\* \>\#loop worker reduction(+:a)\\* \>a = IFN\_SETUP (NULL, a, WORKER, +, 0, 0)\\* \>IFN\_FORK (WORKER)\\* \>a = IFN\_INIT (NULL, a, WORKER, +, 0, 0)\\* \>for ({\dots}) \{ {\dots} \}\\* \>IFN\_LOCK (WORKER, 0)\\* \>a = IFN\_FINI (NULL, a, WORKER, +, 0, 0)\\* \>IFN\_UNLOCK (WORKER, 0)\\* \>IFN\_JOIN (WORKER)\\* \>a = IFN\_TEARDOWN (NULL, a, WORKER, +, 0, 0)\\* \}\end{codeblock} \begin{codeblock}% // worker level loop with copy object\\* \#parallel worker copy (a)\\* \{\\* \>// Insert dummy gang reduction at start. \ \\* \>// Note this uses the same RID \& LID as the inner worker loop.\\* \>a = IFN\_SETUP (ompstruct->a, a, GANG, +, 0, 0)\\* \>a = IFN\_INIT (ompstruct->a, a, GANG, +, 0, 0)\\* \>\#loop worker reduction(+:a)\\* \>a = IFN\_SETUP (NULL, a, WORKER, +, 0, 0)\\* \>IFN\_FORK (WORKER)\\* \>a = IFN\_INIT (NULL, a, WORKER, +, 0, 0)\\* \>for ({\dots}) \{ {\dots} \}\\* \>IFN\_LOCK (WORKER, 0)\\* \>a = IFN\_FINI (NULL, a, WORKER, +, 0, 0)\\* \>IFN\_UNLOCK (WORKER, 0)\\* \>IFN\_JOIN (WORKER)\\* \>a = IFN\_TEARDOWN (NULL, a, WORKER, +, 0, 0)\\* \>// Dummy gang reduction at end\\* \>a = IFN\_FINI (ompstruct->a, a, GANG, +, 0, 0)\\* \>a = IFN\_TEARDOWN (ompstruct->a, a, GANG, +, 0, 0)\\* \}\end{codeblock} Note that if the above worker reduction was inside a gang loop, the inserted dummy gang reduction could be placed inside or outside the gang-level loop. \ Placing outside the loop would be more efficient, as there would then only be one execution of the gang-reduction per gang, rather than one execution per iteration. This can of course be implemented as a later optimization. The vector cases are exactly like the worker cases. \begin{codeblock}% // combined worker \& vector level loop with private object\\* \#parallel worker vector\\* \{\\* \>int a = 0;\\* \>\#loop worker vector reduction(+:a)\\* \>// Insert worker setup and init around worker fork\\* \>a = IFN\_SETUP (NULL, a, WORKER, +, 0, 0)\\* \>IFN\_FORK (WORKER)\\* \>a = IFN\_INIT (NULL, a, WORKER, +, 0, 0)\\* \>// Insert vector setup and init around vector fork\\* \>a = IFN\_SETUP (NULL, a, VECTOR, +, 0, 0)\\* \>IFN\_FORK (VECTOR)\\* \>a = IFN\_INIT (NULL, a, VECTOR, +, 0, 0)\\* \>for ({\dots}) \{ {\dots} \}\\* \>// Vector fini \& teardown around vector join\\* \>IFN\_LOCK (VECTOR, 0)\\* \>a = IFN\_FINI (NULL, a, VECTOR, +, 0, 0)\\* \>IFN\_UNLOCK (VECTOR, 0)\\* \>IFN\_JOIN (VECTOR)\\* \>// Worker fini \& teardown around worker join\\* \>IFN\_LOCK (WORKER, 0)\\* \>a = IFN\_FINI (NULL, a, WORKER, +, 0, 0)\\* \>IFN\_UNLOCK (WORKER)\\* \>IFN\_JOIN (WORKER)\\* \>a = IFN\_TEARDOWN (NULL, a, WORKER, +, 0, 0)\\* \}\end{codeblock} \begin{codeblock}% // reduction on parallel (with copy object)\\* \#parallel copy(a) gang reduction (+:a)\\* \{\\* \>// NULL outermost `loop'\\* \>a = IFN\_SETUP (ompstruct->a, a, GANG, +, 0, 0)\\* \>a = IFN\_INIT (ompstruct->a, a, GANG, +, 0, 0)\\* \>{\dots} redundant mode gang execution\\* \>IFN\_LOCK (GANG, 0)\\* \>a = IFN\_FINI (ompstruct->a, a, GANG, +, 0, 0)\\* \>IFN\_UNLOCK (GANG, 0)\\* \>a = IFN\_TEARDOWN (ompstruct->a, a, GANG, +, 0, 0)\\* \}\end{codeblock} \begin{codeblock}% // reduction on parallel with inner gang loop\\* \#parallel copy(a) gang reduction (+:a)\\* \{\\* \>// NULL outermost `loop'\\* \>a = IFN\_SETUP (ompstruct->a, a, GANG, +, 0, 0)\\* \>a = IFN\_INIT (ompstruct->a, a, GANG, +, 0, 0)\\* \>{\dots} gang redundant execution\\* \>\#loop gang reduction(+:a)\\* \>// no insertion of IFN\_SETUP or IFN\_INIT before loop\\* \>for ({\dots}) \{ {\dots} \}\\* \>// no insertion of IFN\_FINI or IFN\_TEARDOWN after loop.\\* \>{\dots} more gang redundant execution\\* \>IFN\_LOCK (GANG, 0)\\* \>a = IFN\_FINI (ompstruct->a, a, GANG, +, 0, 0)\\* \>IFN\_UNLOCK (GANG, 0)\\* \>a = IFN\_TEARDOWN (ompstruct->a, a, GANG, +, 0, 0)\\* \}\end{codeblock} \section{OACC XFORM} The OACC-xform pass is early in the device compiler after LTO readback. This is where the above internal functions are expanded to a combination of regular code and device-specific builtins. We'll need to provide 2 implementations: \begin{itemize} \item A default implementation for host-expansion \item A PTX implementation. \end{itemize} These transformations should be done early in the XFORM pass, so that other optimizations it performs can be applied to the sequences emitted here. As is typical when describing compiler code transformations, one needs to show both expanded code sequences, and internal compiler helper routines. The following expansions try and make it clear when a piece of code is part of the compiler, producing a result that is fed into the expansion. For instance: \begin{codeblock}% \{tmp = const\_int (5);\}\\* LOCAL = \{tmp\} \end{codeblock} Here the first line is part of the compiler, generating a tmp value, and the second line is part of the emitted code sequence, which uses that tmp value. I also use a template notation, using T to describe the type of the reduction object. \subsection{Host Implementation} The host fallback (\& default implementation) executes the loops as a single thread. This makes reductions trivial. \subsubsection{Setup} \begin{codeblock}% V = IFN\_RED\_SETUP (RES\_PTR, LOCAL, LEVEL, OP. LID, RID) \end{codeblock} expands to: \begin{codeblock}% V = RES\_PTR ? *RES\_PTR : LOCAL; \end{codeblock} \subsubsection[Init]{Init} \begin{codeblock}% V = IFN\_RED\_INIT (RES\_PTR, LOCAL, LEVEL, OP, LID, RID) \end{codeblock} expands to: \begin{codeblock}% V = LOCAL; \end{codeblock} \subsubsection{Fini} \begin{codeblock}% V = IFN\_RED\_FINI (RES\_PTR, LOCAL, LEVEL, OP, LID, RID) \end{codeblock} expands to: \begin{codeblock}% V = LOCAL; \end{codeblock} \subsubsection{Teardown} \begin{codeblock}% V = IFN\_RED\_TEARDOWN (RES\_PTR, LOCAL, LEVEL, OP, LID, RID) \end{codeblock} expands to: \begin{codeblock}% if (RES\_PTR) *RES\_PTR = LOCAL;\\* V = LOCAL; \end{codeblock} \subsubsection{Lock \& Unlock} The lock and unlock builtins expand to nothing. \subsection{PTX Implementation} The expansion for PTX differs for the different loop levels. The IFN\_RED\_FINI function is the most complex and worker level is the most involved. We introduce a number of PTX-specific builtins, which are documented later. \subsubsection{Vector Level} At the vector level, RES\_PTR is known to be NULL. \paragraph{Setup} \begin{codeblock}% V = IFN\_RED\_SETUP (RES\_PTR, LOCAL, LEVEL, OP, LID, RID) \end{codeblock} expands to: \begin{codeblock}% V = LOCAL; \end{codeblock} \paragraph{Init} \begin{codeblock}% V = IFN\_RED\_INIT (RES\_PTR, LOCAL, LEVEL, OP, LID, RID) \end{codeblock} expands to: \begin{codeblock}% V = IFN\_OACC\_DIM\_POS (VECTOR) ? \{init\_val (OP)\} : LOCAL \end{codeblock} This sets the initialization value for all but vector zero, which simply copies the incoming LOCAL value. \paragraph{Fini} \begin{codeblock}% V = IFN\_RED\_FINI (RES\_PTR, LOCAL, LEVEL, OP, LID, RID) \end{codeblock} expands to a binary tree of shuffles: \begin{codeblock}% for (ix = IFN\_OACC\_DIM\_SIZE (VECTOR); ix >= 1;) \{\\* \ T tmp = ptx\_shuffle\_down (LOCAL, ix);\\* \ LOCAL = OP (LOCAL, tmp);\\* \}\\* V = LOCAL \end{codeblock} This relies on unified execution of the vector engines (and IFN\_\brk{}OACC\_\brk{}DIM\_\brk{}SIZE returning a power of 2). \paragraph{Teardown} \begin{codeblock}% V = IFN\_RED\_TEARDOWN (RES\_PTR, LOCAL, LEVEL, OP, LID, RID) \end{codeblock} expands to: \begin{codeblock}% V = LOCAL; \end{codeblock} \paragraph[Lock \ \& Unlock]{Lock \ \& Unlock} The lock and unlock functions expand to nothing at the vector level. \subsubsection{Gang Level} At the gang level RES\_PTR might or might not be NULL. We will know at the compilation stage -- there is no need for runtime checking. Memory barriers are only needed within the locked region, and will be inserted by the lock expansions themselves. We do not need additional barriers at the setup and teardown phases, because accessing the reduction object within the parallel region (outside of the reduction scope itself) is undefined behaviour. \paragraph{Setup} \begin{codeblock}% V = IFN\_RED\_SETUP (RES\_PTR, LOCAL, LEVEL, OP, LID, RID) \end{codeblock} expands to: \begin{codeblock}% \{if RES\_PTR not NULL\} V = LOCAL\\* \{else\} Nothing \end{codeblock} \paragraph{Init} \begin{codeblock}% V == IFN\_RED\_INIT (RES\_PTR, LOCAL, LEVEL, OP, LID, RID) \end{codeblock} expands to: \begin{codeblock}% V = \{init\_val (OPERATOR)\} \end{codeblock} \paragraph{Fini} \begin{codeblock}% V = IFN\_RED\_FINI (RES\_PTR, LOCAL, LEVEL, OP, LID, RID) \end{codeblock} expands to a simple update of the *RES\_PTR object: \begin{codeblock}% V = OPERATOR (*RES\_PTR, LOCAL);\\* *RES\_PTR = V; \end{codeblock} The thread-safety of this relies on the lock expansions below. \paragraph{Teardown} \begin{codeblock}% V = IFN\_RED\_TEARDOWN (RES\_PTR, LOCAL, LEVEL, OP, LID, RID) \end{codeblock} expands to : \begin{codeblock}% \{if RES\_PTR not NULL\} V = LOCAL\\* \{else\} Nothing \end{codeblock} \paragraph{Gang Locking} The lock and unlock functions use lock in global memory. We can allocate a single lock for the whole program. The expansions are: \begin{codeblock}% ptx\_lock (GANG, LID) \end{codeblock} \begin{codeblock}% ptx\_unlock (GANG, LID) \end{codeblock} respectively. \subsubsection{Worker Level} The worker level is most complicated because we need to create temporary storage in .shared memory to hold accumulation values. This relies heavily on PTX-specific builtins and helper routines. The approach is to allocate a buffer in .shared memory to hold the set of worker reductions for a worker loop. This buffer must be live from the first worker setup call of the loop to the last worker teardown call. This buffer must be distinct from the worker spill buffer used to implement worker state propagation at entry to worker partitioned mode, because both need to be concurrently live. However, we do not have to have separate buffers for worker reductions on different loops. Those are never executed concurrently (on a single CTA). This means we can use a similar approach allocating the buffer as we do with the worker spill buffer. A single buffer can be reused for all worker loop reductions in the compilation. We just have to make sure it is the largest size needed. To determine the size needed, we need to know the set of reductions for each worker loop. This is where the LID argument to the builtins comes in -- that will be the same for each reduction in the same worker loop. Actually allocating the buffer can be deferred to RTL expansion stage, which will give the optimizers opportunity to delete unreachable code, which might change the set of reductions needed. Thus expansion of worker level reduction functions will propagate the LID and RID arguments to PTX-specific builtins, which will themselves perform the .shared buffer allocation. As with vector-level reductions, RES\_PTR is known to be NULL at compile time. Memory barriers are needed within the locked region used by the finalization. These are inserted by the lock and unlock builtins themselves. We do not need explicit memory barriers after the setup store or before the teardown read because they are inserted by the fork and join expansions themselves. \paragraph{Setup} \begin{codeblock}% V = IFN\_RED\_SETUP (RES\_PTR, LOCAL, LEVEL, OP, LID, RID) \end{codeblock} expands to: \begin{codeblock}% *ptx\_work\_red\_addr (LID, RID) = LOCAL; \end{codeblock} This will eventually expand to a write into the .shared memory buffer of the incoming reduction value. \paragraph{Init} \begin{codeblock}% V = IFN\_RED\_INIT (RES\_PTR, LOCAL, LEVEL, OP, LID, RID) \end{codeblock} expands to: \begin{codeblock}% V = \{init\_val (OPERATOR)\} \end{codeblock} This initializes each worker's instance of the reduction variable. \paragraph{Fini} \begin{codeblock}% V = IFN\_RED\_FINI (RES\_PTR, LOCAL, LEVEL, OP, LID, RID) \end{codeblock} expands to \begin{codeblock}% T tmp = *ptx\_work\_red\_addr (LID, RID);\\* tmp = OP (tmp, LOCAL);\\* *ptx\_work\_red\_addr (LID, RID) = tmp; \end{codeblock} The thread-safety of this relies on the lock expansions below. \paragraph{Teardown} \begin{codeblock}% V = IFN\_RED\_TEARDOWN (RES\_PTR, LOCAL, LEVEL, OP, LID, RID) \end{codeblock} expands to: \begin{codeblock}% V = *ptx\_work\_red\_addr (LID, RID); \end{codeblock} \paragraph{Worker Locking} The lock and unlock functions use lock in .shared memory. We can allocate a single lock for the whole program. The expansions are: \begin{codeblock}% ptx\_lock (WORKER, LID) \end{codeblock} \begin{codeblock}% ptx\_unlock (WORKER, LID) \end{codeblock} Respectively. \subsubsection[OpenACC Builtins]{OpenACC Builtins} The above used an existing OpenACC builtin, IFN\_\brk{}OACC\_\brk{}DIM\_\brk{}SIZE and IFN\_\brk{}OACC\_\brk{}DIM\_\brk{}POS. Note, that the DIM\_SIZE builtin is often expanded to a constant within the OACC\_XFORM pass itself. Thus we must ensure that transform is run over the expansions done here. \subsubsection{PTX Builtins} The following PTX builtins were used above: \begin{codeblock}% T ptx\_shuffle\_down (T \&obj, unsigned SHIFT) \end{codeblock} This function does a shuffle down, returning the SHIFT'th next vector engine's value of `obj'. \begin{codeblock}% T *ptx\_work\_red\_addr (unsigned LID, unsigned RID) \end{codeblock} This function returns the address of the reduction accumulation variable for loop LID, reduction RID. When this builtin is expanded at RTL generation time, the PTX backend will need to maintain a per-function table of mappings from to address. Entries will be inserted as encountered, and per-LID counters of the current allocation point need to be maintained. The high water mark acrosss all LIDs needs to be stashed in a global variable examined at the end of compilation in the same manner to the worker\_spill buffer. \begin{codeblock}% void ptx\_lock (unsigned LEVEL, unsigned LID)\\* void ptx\_unlock (unsigned LEVEL, unsigned LID) \end{codeblock} These two functions implement a spin-lock and associated unlock at a particular level. The only two levels needed are gang and worker level -- vector reductions do not require locking. The only difference in the two is the location of the lock variable. These expand to an unspec, whose presence will be noted during ptx reorg. That will cause the applicable .global or .shared lock variable to be emitted at end of compilation. They also insert memory barriers before and after the protected region, to ensure external state changes are properly observed with the region, and changes made in the region are properly observed outside the region. The lock itself will look like: \begin{codeblock}% 1:\>\>atom.\$space.cas.u32 \$res,[\$lock\_var],0,1\\* \>\>set.ne.u32 \$pred,\$res,0\\* \>\>@\$pred bra.uni 1b\\* \>\>membar.\$space \end{codeblock} This returns the original value of \$lock\_var, thus, when the lock is obtained, \$res will be zero. When the lock is already taken, it will be 1. The expander should emit a simple loop The unlock will be: \begin{codeblock}% \>\>membar.\$space\\* \>\>atom.\$space.cas.u32 \$res,\$lock,1,0 \end{codeblock} \begin{codeblock}% void ptx\_mem\_bar (unsigned LEVEL) \end{codeblock} This inserts a memory barrier at the appropriate level. \subsubsection{Templated Builtins} The high level reduction machinery must handle boolean, integer (including character), float and complex types. Rather than provide each ptx builtin for the full set of types, only a fundamental subset shall be provided: \begin{itemize} \item SImode (uint) \item DImode (ulong) \item SFmode (float) \item DFmode (double) \end{itemize} The OACC\_XFORM pass must deal with composing these to process other types. \begin{itemize} \item For smaller integer types, promote them to and Simode type. \item For complex types, decompose into two instances of the underlying type. \end{itemize} Note that in the latter case, the underlying ptx\_work\_red\_addr call, if used, must have distinguishing RID values for the different parts of the object. \ This could be achieved by left shifting all incoming RID values and using bit zero as the distinguisher (all such compound types only have two underlying components). \subsubsection{Atomics Optimization} The existence of atomic operations can be dealt with later. They are an optimization. The PTX backend provides a target hook that can be queried by the OACC\_XFORM pass when it is expanding the RED\_FINI function. Before emitting the non-atomic code for worker or gang level, the hook should be queried to determine if an atomic operation exists for the operation and type required. If it does, the appropriate builtin should be emitted rather than the non-atomic sequence. There is a wrinkle though. If atomic operations are used, barriers need inserting before and after the atomic operations, as they are asynchronous with respect to normal reads and writes. It would be inefficient to insert barriers around each atomic operation, if there are several reductions at the same level. (Although that is a rare case). We can use the LID identifier to record whether atomic operations are used at any particular loop finalization and augment the ptx reorg pass to insert barriers at the lock and unlock instructions where they are on such loops. We can also record whether all reductions for a particular loop use atomics and elide the locking in that case. More strictly it would be sufficient to place barriers after the setup write and before the teardown read. This could be achieved by defining additional ptx builtins and emitting them during OACC\_XFORM, expanding to RTL and then processing during PTX reorg. All told, the whole use of atomics may be excessive and not worth the effort. \end{document}