public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* Re: [OpenACC 1/11] UNIQUE internal function
  2015-10-21 19:00 [OpenACC 0/11] execution model Nathan Sidwell
@ 2015-10-21 19:00 ` Nathan Sidwell
  2015-10-22  7:49   ` Richard Biener
  2015-10-22  8:05   ` Jakub Jelinek
  2015-10-21 19:11 ` [OpenACC 2/11] PTX backend changes Nathan Sidwell
                   ` (9 subsequent siblings)
  10 siblings, 2 replies; 120+ messages in thread
From: Nathan Sidwell @ 2015-10-21 19:00 UTC (permalink / raw)
  To: GCC Patches; +Cc: Jakub Jelinek, Bernd Schmidt, Jason Merrill, Joseph S. Myers

[-- Attachment #1: Type: text/plain, Size: 2347 bytes --]

This patch implements a new internal function that has a 'uniqueness' property. 
   Jump-threading cannot clone it and tail-merging cannot combine multiple 
instances.

The uniqueness is implemented by a new gimple fn, gimple_call_internal_unique_p. 
  Routines that check for identical or cloneable calls are augmented to check 
this property.  These are:

* tree-ssa-threadedge, which is figuring out if jump threading is a win.  Jump 
threading is inhibited.

* gimple_call_same_target_p, used for tail merging and similar transforms.  Two 
calls of IFN_UNIQUE will never be  the same target.

* tracer.c, which is determining whether to clone a region.

Interestingly jump threading avoids cloning volatile asms (which it admits is 
conservatively safe), but the tracer does not. I wonder if there's a latent 
problem in tracer?

The reason I needed a function with this property is to  preserve the looping 
structure of a function's CFG.  As mentioned in the intro, we mark up loops 
(using this builtin), so the example I gave has the following inserts:

#pragma acc parallel ...
{
  // single mode here
#pragma acc loop ...
IFN_UNIQUE (FORKING  ...)
for (i = 0; i < N; i++) // loop 1
   ... // partitioned mode here
IFN_UNIQUE (JOINING ...)

if (expr) // single mode here
#pragma acc loop ...
   IFN_UNIQUE (FORKING ...)
   for (i = 0; i < N; i++) // loop 2
     ... // partitioned mode here
   IFN_UNIQUE (JOINING ...)
}

The properly nested loop property of the CFG is preserved through the 
compilation.  This is important as (a) it allows later passes to reconstruct 
this looping structure and (b) hardware constraints require a partioned region 
end for all partitioned threads at a single instruction.

Until I added this unique property, original bring-up  of partitioned execution 
would hit cases of split loops ending in multiple cloned JOINING markers and 
similar cases.

To distinguish different uses of the UNIQUE function, I use the first argument, 
which is expected to be an INTEGER_CST.  I figured this better than using 
multiple new internal fns, all with the unique property, as the latter would 
need (at least) a range check in gimple_call_internal_unique_p rather than a 
simple equality.

Jakub, IYR I originally had IFN_FORK and IFN_JOIN as such distinct internal fns. 
  This replaces that scheme.

ok?

nathan

[-- Attachment #2: 01-trunk-unique.patch --]
[-- Type: text/x-patch, Size: 4820 bytes --]

2015-10-20  Nathan Sidwell  <nathan@codesourcery.com>
	    Cesar Philippidis  <cesar@codesourcery.com>
	
	* internal-fn.c (expand_UNIQUE): New.
	* internal-fn.def (IFN_UNIQUE): New.
	(IFN_UNIQUE_UNSPEC): Define.
	* gimple.h (gimple_call_internal_unique_p): New.
	* gimple.c (gimple_call_same_target_p): Check internal fn
	uniqueness.
	* tracer.c (ignore_bb_p): Check for IFN_UNIQUE call.
	* tree-ssa-threadedge.c
	(record_temporary_equivalences_from_stmts): Likewise.

Index: gimple.c
===================================================================
--- gimple.c	(revision 229096)
+++ gimple.c	(working copy)
@@ -1346,7 +1346,8 @@ gimple_call_same_target_p (const gimple
 {
   if (gimple_call_internal_p (c1))
     return (gimple_call_internal_p (c2)
-	    && gimple_call_internal_fn (c1) == gimple_call_internal_fn (c2));
+	    && gimple_call_internal_fn (c1) == gimple_call_internal_fn (c2)
+	    && !gimple_call_internal_unique_p (as_a <const gcall *> (c1)));
   else
     return (gimple_call_fn (c1) == gimple_call_fn (c2)
 	    || (gimple_call_fndecl (c1)
Index: gimple.h
===================================================================
--- gimple.h	(revision 229096)
+++ gimple.h	(working copy)
@@ -2895,6 +2895,14 @@ gimple_call_internal_fn (const gimple *g
   return gimple_call_internal_fn (gc);
 }
 
+/* Return true, if this internal gimple call is unique.  */
+
+static inline bool
+gimple_call_internal_unique_p (const gcall *gs)
+{
+  return gimple_call_internal_fn (gs) == IFN_UNIQUE;
+}
+
 /* If CTRL_ALTERING_P is true, mark GIMPLE_CALL S to be a stmt
    that could alter control flow.  */
 
Index: internal-fn.c
===================================================================
--- internal-fn.c	(revision 229096)
+++ internal-fn.c	(working copy)
@@ -1958,6 +1958,30 @@ expand_VA_ARG (gcall *stmt ATTRIBUTE_UNU
   gcc_unreachable ();
 }
 
+/* Expand the IFN_UNIQUE function according to its first argument.  */
+
+static void
+expand_UNIQUE (gcall *stmt)
+{
+  rtx pattern = NULL_RTX;
+
+  switch (TREE_INT_CST_LOW (gimple_call_arg (stmt, 0)))
+    {
+    default:
+      gcc_unreachable ();
+      break;
+
+    case IFN_UNIQUE_UNSPEC:
+#ifdef HAVE_unique
+      pattern = gen_unique ();
+#endif
+      break;
+    }
+
+  if (pattern)
+    emit_insn (pattern);
+}
+
 /* Routines to expand each internal function, indexed by function number.
    Each routine has the prototype:
 
Index: internal-fn.def
===================================================================
--- internal-fn.def	(revision 229096)
+++ internal-fn.def	(working copy)
@@ -65,3 +65,11 @@ DEF_INTERNAL_FN (SUB_OVERFLOW, ECF_CONST
 DEF_INTERNAL_FN (MUL_OVERFLOW, ECF_CONST | ECF_LEAF | ECF_NOTHROW, NULL)
 DEF_INTERNAL_FN (TSAN_FUNC_EXIT, ECF_NOVOPS | ECF_LEAF | ECF_NOTHROW, NULL)
 DEF_INTERNAL_FN (VA_ARG, ECF_NOTHROW | ECF_LEAF, NULL)
+
+/* An unduplicable, uncombinable function.  Generally used to preserve
+   a CFG property in the face of jump threading, tail merging or
+   other such optimizations.  The first argument distinguishes
+   between uses.  Other arguments are as needed for use.  The return
+   type depends on use too.  */
+DEF_INTERNAL_FN (UNIQUE, ECF_NOTHROW | ECF_LEAF, NULL)
+#define IFN_UNIQUE_UNSPEC 0  /* Undifferentiated UNIQUE.  */
Index: tracer.c
===================================================================
--- tracer.c	(revision 229096)
+++ tracer.c	(working copy)
@@ -93,6 +93,7 @@ bb_seen_p (basic_block bb)
 static bool
 ignore_bb_p (const_basic_block bb)
 {
+  gimple_stmt_iterator gsi;
   gimple *g;
 
   if (bb->index < NUM_FIXED_BLOCKS)
@@ -106,6 +107,17 @@ ignore_bb_p (const_basic_block bb)
   if (g && gimple_code (g) == GIMPLE_TRANSACTION)
     return true;
 
+  /* Ignore blocks containing non-clonable function calls.  */
+  for (gsi = gsi_start_bb (CONST_CAST_BB (bb));
+       !gsi_end_p (gsi); gsi_next (&gsi))
+    {
+      g = gsi_stmt (gsi);
+
+      if (is_gimple_call (g) && gimple_call_internal_p (g)
+	  && gimple_call_internal_unique_p (as_a <gcall *> (g)))
+	return true;
+    }
+
   return false;
 }
 
Index: tree-ssa-threadedge.c
===================================================================
--- tree-ssa-threadedge.c	(revision 229096)
+++ tree-ssa-threadedge.c	(working copy)
@@ -283,6 +283,17 @@ record_temporary_equivalences_from_stmts
 	  && gimple_asm_volatile_p (as_a <gasm *> (stmt)))
 	return NULL;
 
+      /* If the statement is a unique builtin, we can not thread
+	 through here.  */
+      if (gimple_code (stmt) == GIMPLE_CALL)
+	{
+	  gcall *call = as_a <gcall *> (stmt);
+
+	  if (gimple_call_internal_p (call)
+	      && gimple_call_internal_unique_p (call))
+	    return NULL;
+	}
+
       /* If duplicating this block is going to cause too much code
 	 expansion, then do not thread through this block.  */
       stmt_count++;

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [OpenACC 0/11] execution model
@ 2015-10-21 19:00 Nathan Sidwell
  2015-10-21 19:00 ` [OpenACC 1/11] UNIQUE internal function Nathan Sidwell
                   ` (10 more replies)
  0 siblings, 11 replies; 120+ messages in thread
From: Nathan Sidwell @ 2015-10-21 19:00 UTC (permalink / raw)
  To: GCC Patches; +Cc: Jakub Jelinek, Bernd Schmidt, Jason Merrill, Joseph S. Myers

I'll be posting a patch series for trunk, which implements the core of the 
OpenACC execution model.  This is split into the following patches:

01-trunk-unique.patch
   Internal function with a 'uniqueness' property
02-trunk-nvptx-partition.patch
   NVPTX backend patch set for partitioned execution
03-trunk-hook.patch
   OpenACC hook
04-trunk-c.patch
   C FE changes
05-trunk-cxx.patch
   C++ FE changes
06-trunk-red-init.patch
   Placeholder to keep reductions functioning
07-trunk-loop-mark.patch
   Annotate OpenACC loops in device-agnostic manner
08-trunk-dev-lower.patch
   Device-specific lowering of loop markers
09-trunk-lower-gate.patch
   Run oacc_device_lower pass regardless of errors
10-trunk-libgomp.patch
   Libgomp change (remove dimension check)
11-trunk-tests.patch
   Initial set of execution tests
[let's try that again, after slapping my mail agent for using an old address]

With the exception of patch 6, these are all on the gomp4 branch.  This patch 
set does not change reduction handling, which will be dealt with in a subsequent 
set.

An offloaded region is spawned on a set of execution engines.   These are 
organized as a cube, with specific axes controlled by the programmer.  The 
engines may operate in a 'partitioned' mode, where each engine executes as a 
separate thread, or they may operate in a 'single' mode, where one engine of a 
particular set executes the program and the other engines are idled (in an 
implementation-specific manner).

A driving example is the following:
#pragma acc parallel ...
{
  // single mode here
#pragma acc loop ...
for (i = 0; i < N; i++) // loop 1
   ... // partitioned mode here

if (expr) // single mode here
#pragma acc loop ...
   for (i = 0; i < N; i++) // loop 2
     ... // partitioned mode here
}

While it's clear all paths lead to loop 1, it's not statically determinable 
whether loop 2 is executed or not.

This implementation marks the head and tail of partitioned execution regions 
with builtin functions indicating the axes of partitioning.  After 
device-specific lowering, these will eventually make it to RTL expansion time, 
where they get expanded to backend-specific  RTL.  In the PTX implementation 
'single' mode is implemented by a 'neutering' mechanism, where the non-active 
execution engines skip each basic block and 'follow along' conditional branches 
to get to a subsequent block.  In this manner all engines can reach a 
dynamically determinable partitioned region.

On entry to a partitioned region, we execute a 'fork' operation, cloning live 
state from the single active engine before the region, into the other threads 
that become activated.

This patchset has been tested on x86_64-linux & ptx accelerator.

nathan

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 2/11] PTX backend changes
  2015-10-21 19:00 [OpenACC 0/11] execution model Nathan Sidwell
  2015-10-21 19:00 ` [OpenACC 1/11] UNIQUE internal function Nathan Sidwell
@ 2015-10-21 19:11 ` Nathan Sidwell
  2015-10-22  8:16   ` Jakub Jelinek
  2015-10-22 14:05   ` Bernd Schmidt
  2015-10-21 19:16 ` [OpenACC 3/11] new target hook Nathan Sidwell
                   ` (8 subsequent siblings)
  10 siblings, 2 replies; 120+ messages in thread
From: Nathan Sidwell @ 2015-10-21 19:11 UTC (permalink / raw)
  To: GCC Patches; +Cc: Jakub Jelinek, Bernd Schmidt, Jason Merrill, Joseph S. Myers

[-- Attachment #1: Type: text/plain, Size: 1667 bytes --]

This patch is the PTX backend changes for partitioned execution.

We implement some new expanders:
oacc_dim_size, oacc_dim_pos -- span and location within the compute cube
oacc_fork, oacc_join -- expanders for IFN_UNIQUE (OACC_FORK) & IFN_UNIQUE 
(OACC_JOIN)

The fork & join markers are non-copyable instructions, and continue preserving 
the nested loop structure of the CFG.  in mach_dep_reorg we scan for these 
instuctions reconstructing the partitioned loop structure and determining which 
BB's reside in which loops.  Once determined we can then apply the neutering 
algorithm, which in a single-partitioned region forces all but engine-zero to 
skip to the end of the block.  For blocks that end in a branch, it skips to just 
before the branch.  for blocks that end in a conditional branch we insert code 
to propagate the branch condition from engine-zero to the other engines, so they 
all go the same way at the branch.

There are two axes of interest:
* vector, these can propagate via a machine 'shuffle' instruction
* worker, these can propagate via a location in local shared memory

At the beginning of a partitioned region, we have to  propagate live register 
state and stack frame from engine-zero to the other engines (just as would 
happen on a regular 'fork' call).  Again, how this is done depends on the axis 
of propagation:
* vector, use shuffle instructions just after the fork
* worker, spill to buffer in shared memory just before the fork and then fill 
from that buffer just after the fork.

For the worker axis, explicit sync instructions are needed before and after 
accessing the shared memory state.

Bernd, any comments?

nathan

[-- Attachment #2: 02-trunk-nvptx-partition.patch --]
[-- Type: text/x-patch, Size: 46614 bytes --]

2015-10-20  Nathan Sidwell  <nathan@codesourcery.com>

	* config/nvptx/nvptx.h (struct machine_function): Add
	axis_predicate.
	* config/nvptx/nvptx-protos.h (nvptx_expand_oacc_fork,
	nvptx_expand_oacc_join): Declare.
	* config/nvptx/nvptx.md (UNSPEC_NTID, UNSPEC_TID): Delete.
	(UNSPEC_DIM_SIZE, UNSPEC_SHARED_DATA, UNSPEC_BIT_CONV,
	UNSPEC_SHUFFLE, UNSPEC_BR_UNIFIED): New.
	(UNSPECV_BARSYNC, UNSPECV_DIM_POS, UNSPECV_FORK, UNSPECV_FORKED,
	UNSPECV_JOINING, UNSPECV_JOIN): New.
	(BITS, BITD): New mode iterators.
	(br_true_uni, br_false_uni): New.
	(*oacc_ntid_insn,  oacc_ntid, *oacc_tid_insn, oacc_tid): Delete.
	(oacc_dim_size, oacc_dim_pos): New.
	(nvptx_fork, nvptx_forked, nvptx_joining, nvptx_join): New.
	(oacc_fork, oacc_join): New.
	(nvptx_shuffle<mode>, unpack<mode>si2, packsi<mode>2): New.
	(worker_load<mode>, worker_store<mode>): New.
	(nvptx_barsync): New.
	* config/nvptx/nvptx.c: Include gimple.h & dumpfile.h.
	(SHUFFLE_UP, SHUFFLE_DOWN, SHUFFLE_BFLY, SHUFFLE_IDX): Define.
	(worker_bcast_hwm, worker_bcast_align, worker_bcast_name,
	worker_bcast_sym): New.
	(nvptx_option_override): Initialize worker broadcast buffer.
	(nvptx_emit_forking, nvptx_emit_joining): New.
	(nvptx_init_axis_predicate): New.
	(nvptx_declare_function_name): Init axis predicates.
	(nvptx_expand_call): Add fork/join markers around routine call.
	(nvptx_expand_oacc_fork, nvptx_expand_oacc_join): New.
	(nvptx_gen_unpack, nvptx_gen_pack, nvptx_gen_shuffle): New.
	(nvptx_gen_vcast): New.
	(struct wcast_data_t): New.
	(enum propagate_mask): New.
	(nvptx_gen_wcast): New.
	(nvptx_print_operand): Add 'S' case.
	(struct parallel): New.
	(parallel::parallel, parallel::~parallel): New.
	(bb_insn_map_t, insn_bb_t, insn_bb_vec_t): New typedefs.
	(nvptx_split_blocks, nvptx_discover_pre, nvptx_dump_pars,
	nvptx_find_par, nvptx_discover_pars): New.
	(nvptx_propagate): New.
	(vprop_gen, nvptx_vpropagate): New.
	(wprop_gen, nvptx_wpropagate): New.
	(nvptx_wsync): New.
	(nvptx_single, nvptx_skip_par): New.
	(nvptx_process_pars, nvptx_neuter_pars): New.
	(ntptx_reorg): Split blocks, generate parallel structure, apply
	neutering.
	(nvptx_cannot_copy_insn_p): New.
	(nvptx_file_end): Emit worker broadcast decl.
	(TARGET_CANNOT_COPY_INSN_P): Override.

Index: gcc/config/nvptx/nvptx-protos.h
===================================================================
--- gcc/config/nvptx/nvptx-protos.h	(revision 229096)
+++ gcc/config/nvptx/nvptx-protos.h	(working copy)
@@ -32,6 +32,8 @@ extern void nvptx_register_pragmas (void
 extern const char *nvptx_section_for_decl (const_tree);
 
 #ifdef RTX_CODE
+extern void nvptx_expand_oacc_fork (unsigned);
+extern void nvptx_expand_oacc_join (unsigned);
 extern void nvptx_expand_call (rtx, rtx);
 extern rtx nvptx_expand_compare (rtx);
 extern const char *nvptx_ptx_type_from_mode (machine_mode, bool);
Index: gcc/config/nvptx/nvptx.c
===================================================================
--- gcc/config/nvptx/nvptx.c	(revision 229096)
+++ gcc/config/nvptx/nvptx.c	(working copy)
@@ -51,14 +51,21 @@
 #include "langhooks.h"
 #include "dbxout.h"
 #include "cfgrtl.h"
+#include "gimple.h"
 #include "stor-layout.h"
 #include "builtins.h"
 #include "omp-low.h"
 #include "gomp-constants.h"
+#include "dumpfile.h"
 
 /* This file should be included last.  */
 #include "target-def.h"
 
+#define SHUFFLE_UP 0
+#define SHUFFLE_DOWN 1
+#define SHUFFLE_BFLY 2
+#define SHUFFLE_IDX 3
+
 /* Record the function decls we've written, and the libfuncs and function
    decls corresponding to them.  */
 static std::stringstream func_decls;
@@ -81,6 +88,16 @@ struct tree_hasher : ggc_cache_ptr_hash<
 static GTY((cache)) hash_table<tree_hasher> *declared_fndecls_htab;
 static GTY((cache)) hash_table<tree_hasher> *needed_fndecls_htab;
 
+/* Size of buffer needed to broadcast across workers.  This is used
+   for both worker-neutering and worker broadcasting.   It is shared
+   by all functions emitted.  The buffer is placed in shared memory.
+   It'd be nice if PTX supported common blocks, because then this
+   could be shared across TUs (taking the largest size).  */
+static unsigned worker_bcast_hwm;
+static unsigned worker_bcast_align;
+#define worker_bcast_name "__worker_bcast"
+static GTY(()) rtx worker_bcast_sym;
+
 /* Allocate a new, cleared machine_function structure.  */
 
 static struct machine_function *
@@ -108,6 +125,9 @@ nvptx_option_override (void)
   needed_fndecls_htab = hash_table<tree_hasher>::create_ggc (17);
   declared_libfuncs_htab
     = hash_table<declared_libfunc_hasher>::create_ggc (17);
+
+  worker_bcast_sym = gen_rtx_SYMBOL_REF (Pmode, worker_bcast_name);
+  worker_bcast_align = GET_MODE_ALIGNMENT (SImode) / BITS_PER_UNIT;
 }
 
 /* Return the mode to be used when declaring a ptx object for OBJ.
@@ -194,6 +214,44 @@ nvptx_split_reg_p (machine_mode mode)
   return false;
 }
 
+/* Emit forking instructions for MASK.  */
+
+static void
+nvptx_emit_forking (unsigned mask, bool is_call)
+{
+  mask &= (GOMP_DIM_MASK (GOMP_DIM_WORKER)
+	   | GOMP_DIM_MASK (GOMP_DIM_VECTOR));
+  if (mask)
+    {
+      rtx op = GEN_INT (mask | (is_call << GOMP_DIM_MAX));
+      
+      /* Emit fork at all levels, this helps form SESE regions..  */
+      if (!is_call)
+	emit_insn (gen_nvptx_fork (op));
+      emit_insn (gen_nvptx_forked (op));
+    }
+}
+
+/* Emit joining instructions for MASK.  */
+
+static void
+nvptx_emit_joining (unsigned mask, bool is_call)
+{
+  mask &= (GOMP_DIM_MASK (GOMP_DIM_WORKER)
+	   | GOMP_DIM_MASK (GOMP_DIM_VECTOR));
+  if (mask)
+    {
+      rtx op = GEN_INT (mask | (is_call << GOMP_DIM_MAX));
+
+      /* Emit joining for all non-call pars to ensure there's a single
+	 predecessor for the block the join insn ends up in.  This is
+	 needed for skipping entire loops.  */
+      if (!is_call)
+	emit_insn (gen_nvptx_joining (op));
+      emit_insn (gen_nvptx_join (op));
+    }
+}
+
 #define PASS_IN_REG_P(MODE, TYPE)				\
   ((GET_MODE_CLASS (MODE) == MODE_INT				\
     || GET_MODE_CLASS (MODE) == MODE_FLOAT			\
@@ -500,6 +558,19 @@ nvptx_record_needed_fndecl (tree decl)
     *slot = decl;
 }
 
+/* Emit code to initialize the REGNO predicate register to indicate
+   whether we are not lane zero on the NAME axis.  */
+
+static void
+nvptx_init_axis_predicate (FILE *file, int regno, const char *name)
+{
+  fprintf (file, "\t{\n");
+  fprintf (file, "\t\t.reg.u32\t%%%s;\n", name);
+  fprintf (file, "\t\tmov.u32\t%%%s, %%tid.%s;\n", name, name);
+  fprintf (file, "\t\tsetp.ne.u32\t%%r%d, %%%s, 0;\n", regno, name);
+  fprintf (file, "\t}\n");
+}
+
 /* Implement ASM_DECLARE_FUNCTION_NAME.  Writes the start of a ptx
    function, including local var decls and copies from the arguments to
    local regs.  */
@@ -623,6 +694,14 @@ nvptx_declare_function_name (FILE *file,
   if (stdarg_p (fntype))
     fprintf (file, "\tld.param.u%d %%argp, [%%in_argp];\n",
 	     GET_MODE_BITSIZE (Pmode));
+
+  /* Emit axis predicates. */
+  if (cfun->machine->axis_predicate[0])
+    nvptx_init_axis_predicate (file,
+			       REGNO (cfun->machine->axis_predicate[0]), "y");
+  if (cfun->machine->axis_predicate[1])
+    nvptx_init_axis_predicate (file,
+			       REGNO (cfun->machine->axis_predicate[1]), "x");
 }
 
 /* Output a return instruction.  Also copy the return value to its outgoing
@@ -779,6 +858,7 @@ nvptx_expand_call (rtx retval, rtx addre
   bool external_decl = false;
   rtx varargs = NULL_RTX;
   tree decl_type = NULL_TREE;
+  unsigned parallel = 0;
 
   for (t = cfun->machine->call_args; t; t = XEXP (t, 1))
     nargs++;
@@ -799,6 +879,22 @@ nvptx_expand_call (rtx retval, rtx addre
 	    cfun->machine->has_call_with_sc = true;
 	  if (DECL_EXTERNAL (decl))
 	    external_decl = true;
+	  tree attr = get_oacc_fn_attrib (decl);
+	  if (attr)
+	    {
+	      tree dims = TREE_VALUE (attr);
+
+	      parallel = GOMP_DIM_MASK (GOMP_DIM_MAX) - 1;
+	      for (int ix = 0; ix != GOMP_DIM_MAX; ix++)
+		{
+		  if (TREE_PURPOSE (dims)
+		      && !integer_zerop (TREE_PURPOSE (dims)))
+		    break;
+		  /* Not on this axis.  */
+		  parallel ^= GOMP_DIM_MASK (ix);
+		  dims = TREE_CHAIN (dims);
+		}
+	    }
 	}
     }
 
@@ -860,7 +956,11 @@ nvptx_expand_call (rtx retval, rtx addre
 	  write_func_decl_from_insn (func_decls, retval, pat, callee);
 	}
     }
+
+  nvptx_emit_forking (parallel, true);
   emit_call_insn (pat);
+  nvptx_emit_joining (parallel, true);
+
   if (tmp_retval != retval)
     emit_move_insn (retval, tmp_retval);
 }
@@ -1069,6 +1169,214 @@ nvptx_expand_compare (rtx compare)
   return gen_rtx_NE (BImode, pred, const0_rtx);
 }
 
+/* Expand the oacc fork & join primitive into ptx-required unspecs.  */
+
+void
+nvptx_expand_oacc_fork (unsigned mode)
+{
+  nvptx_emit_forking (GOMP_DIM_MASK (mode), false);
+}
+
+void
+nvptx_expand_oacc_join (unsigned mode)
+{
+  nvptx_emit_joining (GOMP_DIM_MASK (mode), false);
+}
+
+/* Generate instruction(s) to unpack a 64 bit object into 2 32 bit
+   objects.  */
+
+static rtx
+nvptx_gen_unpack (rtx dst0, rtx dst1, rtx src)
+{
+  rtx res;
+  
+  switch (GET_MODE (src))
+    {
+    case DImode:
+      res = gen_unpackdisi2 (dst0, dst1, src);
+      break;
+    case DFmode:
+      res = gen_unpackdfsi2 (dst0, dst1, src);
+      break;
+    default: gcc_unreachable ();
+    }
+  return res;
+}
+
+/* Generate instruction(s) to pack 2 32 bit objects into a 64 bit
+   object.  */
+
+static rtx
+nvptx_gen_pack (rtx dst, rtx src0, rtx src1)
+{
+  rtx res;
+  
+  switch (GET_MODE (dst))
+    {
+    case DImode:
+      res = gen_packsidi2 (dst, src0, src1);
+      break;
+    case DFmode:
+      res = gen_packsidf2 (dst, src0, src1);
+      break;
+    default: gcc_unreachable ();
+    }
+  return res;
+}
+
+/* Generate an instruction or sequence to broadcast register REG
+   across the vectors of a single warp.  */
+
+static rtx
+nvptx_gen_shuffle (rtx dst, rtx src, rtx idx, unsigned kind)
+{
+  rtx res;
+
+  switch (GET_MODE (dst))
+    {
+    case SImode:
+      res = gen_nvptx_shufflesi (dst, src, idx, GEN_INT (kind));
+      break;
+    case SFmode:
+      res = gen_nvptx_shufflesf (dst, src, idx, GEN_INT (kind));
+      break;
+    case DImode:
+    case DFmode:
+      {
+	rtx tmp0 = gen_reg_rtx (SImode);
+	rtx tmp1 = gen_reg_rtx (SImode);
+
+	start_sequence ();
+	emit_insn (nvptx_gen_unpack (tmp0, tmp1, src));
+	emit_insn (nvptx_gen_shuffle (tmp0, tmp0, idx, kind));
+	emit_insn (nvptx_gen_shuffle (tmp1, tmp1, idx, kind));
+	emit_insn (nvptx_gen_pack (dst, tmp0, tmp1));
+	res = get_insns ();
+	end_sequence ();
+      }
+      break;
+    case BImode:
+      {
+	rtx tmp = gen_reg_rtx (SImode);
+	
+	start_sequence ();
+	emit_insn (gen_sel_truesi (tmp, src, GEN_INT (1), const0_rtx));
+	emit_insn (nvptx_gen_shuffle (tmp, tmp, idx, kind));
+	emit_insn (gen_rtx_SET (dst, gen_rtx_NE (BImode, tmp, const0_rtx)));
+	res = get_insns ();
+	end_sequence ();
+      }
+      break;
+      
+    default:
+      gcc_unreachable ();
+    }
+  return res;
+}
+
+/* Generate an instruction or sequence to broadcast register REG
+   across the vectors of a single warp.  */
+
+static rtx
+nvptx_gen_vcast (rtx reg)
+{
+  return nvptx_gen_shuffle (reg, reg, const0_rtx, SHUFFLE_IDX);
+}
+
+/* Structure used when generating a worker-level spill or fill.  */
+
+struct wcast_data_t
+{
+  rtx base;
+  rtx ptr;
+  unsigned offset;
+};
+
+/* Direction of the spill/fill and looping setup/teardown indicator.  */
+
+enum propagate_mask
+  {
+    PM_read = 1 << 0,
+    PM_write = 1 << 1,
+    PM_loop_begin = 1 << 2,
+    PM_loop_end = 1 << 3,
+
+    PM_read_write = PM_read | PM_write
+  };
+
+/* Generate instruction(s) to spill or fill register REG to/from the
+   worker broadcast array.  PM indicates what is to be done, REP
+   how many loop iterations will be executed (0 for not a loop).  */
+   
+static rtx
+nvptx_gen_wcast (rtx reg, propagate_mask pm, unsigned rep, wcast_data_t *data)
+{
+  rtx  res;
+  machine_mode mode = GET_MODE (reg);
+
+  switch (mode)
+    {
+    case BImode:
+      {
+	rtx tmp = gen_reg_rtx (SImode);
+	
+	start_sequence ();
+	if (pm & PM_read)
+	  emit_insn (gen_sel_truesi (tmp, reg, GEN_INT (1), const0_rtx));
+	emit_insn (nvptx_gen_wcast (tmp, pm, rep, data));
+	if (pm & PM_write)
+	  emit_insn (gen_rtx_SET (reg, gen_rtx_NE (BImode, tmp, const0_rtx)));
+	res = get_insns ();
+	end_sequence ();
+      }
+      break;
+
+    default:
+      {
+	rtx addr = data->ptr;
+
+	if (!addr)
+	  {
+	    unsigned align = GET_MODE_ALIGNMENT (mode) / BITS_PER_UNIT;
+
+	    if (align > worker_bcast_align)
+	      worker_bcast_align = align;
+	    data->offset = (data->offset + align - 1) & ~(align - 1);
+	    addr = data->base;
+	    if (data->offset)
+	      addr = gen_rtx_PLUS (Pmode, addr, GEN_INT (data->offset));
+	  }
+	
+	addr = gen_rtx_MEM (mode, addr);
+	addr = gen_rtx_UNSPEC (mode, gen_rtvec (1, addr), UNSPEC_SHARED_DATA);
+	if (pm == PM_read)
+	  res = gen_rtx_SET (addr, reg);
+	else if (pm == PM_write)
+	  res = gen_rtx_SET (reg, addr);
+	else
+	  gcc_unreachable ();
+
+	if (data->ptr)
+	  {
+	    /* We're using a ptr, increment it.  */
+	    start_sequence ();
+	    
+	    emit_insn (res);
+	    emit_insn (gen_adddi3 (data->ptr, data->ptr,
+				   GEN_INT (GET_MODE_SIZE (GET_MODE (reg)))));
+	    res = get_insns ();
+	    end_sequence ();
+	  }
+	else
+	  rep = 1;
+	data->offset += rep * GET_MODE_SIZE (GET_MODE (reg));
+      }
+      break;
+    }
+  return res;
+}
+
 /* When loading an operand ORIG_OP, verify whether an address space
    conversion to generic is required, and if so, perform it.  Also
    check for SYMBOL_REFs for function decls and call
@@ -1660,6 +1968,7 @@ nvptx_print_operand_address (FILE *file,
    c -- print an opcode suffix for a comparison operator, including a type code
    d -- print a CONST_INT as a vector dimension (x, y, or z)
    f -- print a full reg even for something that must always be split
+   S -- print a shuffle kind specified by CONST_INT
    t -- print a type opcode suffix, promoting QImode to 32 bits
    T -- print a type size in bits
    u -- print a type opcode suffix without promotions.  */
@@ -1723,6 +2032,15 @@ nvptx_print_operand (FILE *file, rtx x,
       fprintf (file, "%s", nvptx_ptx_type_from_mode (op_mode, false));
       break;
 
+    case 'S':
+      {
+	unsigned kind = UINTVAL (x);
+	static const char *const kinds[] = 
+	  {"up", "down", "bfly", "idx"};
+	fprintf (file, ".%s", kinds[kind]);
+      }
+      break;
+
     case 'T':
       fprintf (file, "%d", GET_MODE_BITSIZE (GET_MODE (x)));
       break;
@@ -1973,10 +2291,744 @@ nvptx_reorg_subreg (void)
     }
 }
 
+/* Loop structure of the function.The entire function is described as
+   a NULL loop.  We should be able to extend this to represent
+   superblocks.  */
+
+struct parallel
+{
+  /* Parent parallel.  */
+  parallel *parent;
+  
+  /* Next sibling parallel.  */
+  parallel *next;
+
+  /* First child parallel.  */
+  parallel *inner;
+
+  /* Partitioning mask of the parallel.  */
+  unsigned mask;
+
+  /* Partitioning used within inner parallels. */
+  unsigned inner_mask;
+
+  /* Location of parallel forked and join.  The forked is the first
+     block in the parallel and the join is the first block after of
+     the partition.  */
+  basic_block forked_block;
+  basic_block join_block;
+
+  rtx_insn *forked_insn;
+  rtx_insn *join_insn;
+
+  rtx_insn *fork_insn;
+  rtx_insn *joining_insn;
+
+  /* Basic blocks in this parallel, but not in child parallels.  The
+     FORKED and JOINING blocks are in the partition.  The FORK and JOIN
+     blocks are not.  */
+  auto_vec<basic_block> blocks;
+
+public:
+  parallel (parallel *parent, unsigned mode);
+  ~parallel ();
+};
+
+/* Constructor links the new parallel into it's parent's chain of
+   children.  */
+
+parallel::parallel (parallel *parent_, unsigned mask_)
+  :parent (parent_), next (0), inner (0), mask (mask_), inner_mask (0)
+{
+  forked_block = join_block = 0;
+  forked_insn = join_insn = 0;
+  fork_insn = joining_insn = 0;
+  
+  if (parent)
+    {
+      next = parent->inner;
+      parent->inner = this;
+    }
+}
+
+parallel::~parallel ()
+{
+  delete inner;
+  delete next;
+}
+
+/* Map of basic blocks to insns */
+typedef hash_map<basic_block, rtx_insn *> bb_insn_map_t;
+
+/* A tuple of an insn of interest and the BB in which it resides.  */
+typedef std::pair<rtx_insn *, basic_block> insn_bb_t;
+typedef auto_vec<insn_bb_t> insn_bb_vec_t;
+
+/* Split basic blocks such that each forked and join unspecs are at
+   the start of their basic blocks.  Thus afterwards each block will
+   have a single partitioning mode.  We also do the same for return
+   insns, as they are executed by every thread.  Return the
+   partitioning mode of the function as a whole.  Populate MAP with
+   head and tail blocks.  We also clear the BB visited flag, which is
+   used when finding partitions.  */
+
+static void
+nvptx_split_blocks (bb_insn_map_t *map)
+{
+  insn_bb_vec_t worklist;
+  basic_block block;
+  rtx_insn *insn;
+
+  /* Locate all the reorg instructions of interest.  */
+  FOR_ALL_BB_FN (block, cfun)
+    {
+      bool seen_insn = false;
+
+      // Clear visited flag, for use by parallel locator  */
+      block->flags &= ~BB_VISITED;
+      
+      FOR_BB_INSNS (block, insn)
+	{
+	  if (!INSN_P (insn))
+	    continue;
+	  switch (recog_memoized (insn))
+	    {
+	    default:
+	      seen_insn = true;
+	      continue;
+	    case CODE_FOR_nvptx_forked:
+	    case CODE_FOR_nvptx_join:
+	      break;
+
+	    case CODE_FOR_return:
+	      /* We also need to split just before return insns, as
+		 that insn needs executing by all threads, but the
+		 block it is in probably does not.  */
+	      break;
+	    }
+
+	  if (seen_insn)
+	    /* We've found an instruction that  must be at the start of
+	       a block, but isn't.  Add it to the worklist.  */
+	    worklist.safe_push (insn_bb_t (insn, block));
+	  else
+	    /* It was already the first instruction.  Just add it to
+	       the map.  */
+	    map->get_or_insert (block) = insn;
+	  seen_insn = true;
+	}
+    }
+
+  /* Split blocks on the worklist.  */
+  unsigned ix;
+  insn_bb_t *elt;
+  basic_block remap = 0;
+  for (ix = 0; worklist.iterate (ix, &elt); ix++)
+    {
+      if (remap != elt->second)
+	{
+	  block = elt->second;
+	  remap = block;
+	}
+      
+      /* Split block before insn. The insn is in the new block  */
+      edge e = split_block (block, PREV_INSN (elt->first));
+
+      block = e->dest;
+      map->get_or_insert (block) = elt->first;
+    }
+}
+
+/* BLOCK is a basic block containing a head or tail instruction.
+   Locate the associated prehead or pretail instruction, which must be
+   in the single predecessor block.  */
+
+static rtx_insn *
+nvptx_discover_pre (basic_block block, int expected)
+{
+  gcc_assert (block->preds->length () == 1);
+  basic_block pre_block = (*block->preds)[0]->src;
+  rtx_insn *pre_insn;
+
+  for (pre_insn = BB_END (pre_block); !INSN_P (pre_insn);
+       pre_insn = PREV_INSN (pre_insn))
+    gcc_assert (pre_insn != BB_HEAD (pre_block));
+
+  gcc_assert (recog_memoized (pre_insn) == expected);
+  return pre_insn;
+}
+
+/* Dump this parallel and all its inner parallels.  */
+
+static void
+nvptx_dump_pars (parallel *par, unsigned depth)
+{
+  fprintf (dump_file, "%u: mask %d head=%d, tail=%d\n",
+	   depth, par->mask,
+	   par->forked_block ? par->forked_block->index : -1,
+	   par->join_block ? par->join_block->index : -1);
+
+  fprintf (dump_file, "    blocks:");
+
+  basic_block block;
+  for (unsigned ix = 0; par->blocks.iterate (ix, &block); ix++)
+    fprintf (dump_file, " %d", block->index);
+  fprintf (dump_file, "\n");
+  if (par->inner)
+    nvptx_dump_pars (par->inner, depth + 1);
+
+  if (par->next)
+    nvptx_dump_pars (par->next, depth);
+}
+
+/* If BLOCK contains a fork/join marker, process it to create or
+   terminate a loop structure.  Add this block to the current loop,
+   and then walk successor blocks.   */
+
+static parallel *
+nvptx_find_par (bb_insn_map_t *map, parallel *par, basic_block block)
+{
+  if (block->flags & BB_VISITED)
+    return par;
+  block->flags |= BB_VISITED;
+
+  if (rtx_insn **endp = map->get (block))
+    {
+      rtx_insn *end = *endp;
+
+      /* This is a block head or tail, or return instruction.  */
+      switch (recog_memoized (end))
+	{
+	case CODE_FOR_return:
+	  /* Return instructions are in their own block, and we
+	     don't need to do anything more.  */
+	  return par;
+
+	case CODE_FOR_nvptx_forked:
+	  /* Loop head, create a new inner loop and add it into
+	     our parent's child list.  */
+	  {
+	    unsigned mask = UINTVAL (XVECEXP (PATTERN (end), 0, 0));
+
+	    gcc_assert (mask);
+	    par = new parallel (par, mask);
+	    par->forked_block = block;
+	    par->forked_insn = end;
+	    if (!(mask & GOMP_DIM_MASK (GOMP_DIM_MAX))
+		&& (mask & GOMP_DIM_MASK (GOMP_DIM_WORKER)))
+	      par->fork_insn
+		= nvptx_discover_pre (block, CODE_FOR_nvptx_fork);
+	  }
+	  break;
+
+	case CODE_FOR_nvptx_join:
+	  /* A loop tail.  Finish the current loop and return to
+	     parent.  */
+	  {
+	    unsigned mask = UINTVAL (XVECEXP (PATTERN (end), 0, 0));
+
+	    gcc_assert (par->mask == mask);
+	    par->join_block = block;
+	    par->join_insn = end;
+	    if (!(mask & GOMP_DIM_MASK (GOMP_DIM_MAX))
+		&& (mask & GOMP_DIM_MASK (GOMP_DIM_WORKER)))
+	      par->joining_insn
+		= nvptx_discover_pre (block, CODE_FOR_nvptx_joining);
+	    par = par->parent;
+	  }
+	  break;
+
+	default:
+	  gcc_unreachable ();
+	}
+    }
+
+  if (par)
+    /* Add this block onto the current loop's list of blocks.  */
+    par->blocks.safe_push (block);
+  else
+    /* This must be the entry block.  Create a NULL parallel.  */
+    par = new parallel (0, 0);
+
+  /* Walk successor blocks.  */
+  edge e;
+  edge_iterator ei;
+
+  FOR_EACH_EDGE (e, ei, block->succs)
+    nvptx_find_par (map, par, e->dest);
+
+  return par;
+}
+
+/* DFS walk the CFG looking for fork & join markers.  Construct
+   loop structures as we go.  MAP is a mapping of basic blocks
+   to head & tail markers, discovered when splitting blocks.  This
+   speeds up the discovery.  We rely on the BB visited flag having
+   been cleared when splitting blocks.  */
+
+static parallel *
+nvptx_discover_pars (bb_insn_map_t *map)
+{
+  basic_block block;
+
+  /* Mark exit blocks as visited.  */
+  block = EXIT_BLOCK_PTR_FOR_FN (cfun);
+  block->flags |= BB_VISITED;
+
+  /* And entry block as not.  */
+  block = ENTRY_BLOCK_PTR_FOR_FN (cfun);
+  block->flags &= ~BB_VISITED;
+
+  parallel *par = nvptx_find_par (map, 0, block);
+
+  if (dump_file)
+    {
+      fprintf (dump_file, "\nLoops\n");
+      nvptx_dump_pars (par, 0);
+      fprintf (dump_file, "\n");
+    }
+  
+  return par;
+}
+
+/* Propagate live state at the start of a partitioned region.  BLOCK
+   provides the live register information, and might not contain
+   INSN. Propagation is inserted just after INSN. RW indicates whether
+   we are reading and/or writing state.  This
+   separation is needed for worker-level proppagation where we
+   essentially do a spill & fill.  FN is the underlying worker
+   function to generate the propagation instructions for single
+   register.  DATA is user data.
+
+   We propagate the live register set and the entire frame.  We could
+   do better by (a) propagating just the live set that is used within
+   the partitioned regions and (b) only propagating stack entries that
+   are used.  The latter might be quite hard to determine.  */
+
+static void
+nvptx_propagate (basic_block block, rtx_insn *insn, propagate_mask rw,
+		 rtx (*fn) (rtx, propagate_mask,
+			    unsigned, void *), void *data)
+{
+  bitmap live = DF_LIVE_IN (block);
+  bitmap_iterator iterator;
+  unsigned ix;
+
+  /* Copy the frame array.  */
+  HOST_WIDE_INT fs = get_frame_size ();
+  if (fs)
+    {
+      rtx tmp = gen_reg_rtx (DImode);
+      rtx idx = NULL_RTX;
+      rtx ptr = gen_reg_rtx (Pmode);
+      rtx pred = NULL_RTX;
+      rtx_code_label *label = NULL;
+
+      gcc_assert (!(fs & (GET_MODE_SIZE (DImode) - 1)));
+      fs /= GET_MODE_SIZE (DImode);
+      /* Detect single iteration loop. */
+      if (fs == 1)
+	fs = 0;
+
+      start_sequence ();
+      emit_insn (gen_rtx_SET (ptr, frame_pointer_rtx));
+      if (fs)
+	{
+	  idx = gen_reg_rtx (SImode);
+	  pred = gen_reg_rtx (BImode);
+	  label = gen_label_rtx ();
+	  
+	  emit_insn (gen_rtx_SET (idx, GEN_INT (fs)));
+	  /* Allow worker function to initialize anything needed */
+	  rtx init = fn (tmp, PM_loop_begin, fs, data);
+	  if (init)
+	    emit_insn (init);
+	  emit_label (label);
+	  LABEL_NUSES (label)++;
+	  emit_insn (gen_addsi3 (idx, idx, GEN_INT (-1)));
+	}
+      if (rw & PM_read)
+	emit_insn (gen_rtx_SET (tmp, gen_rtx_MEM (DImode, ptr)));
+      emit_insn (fn (tmp, rw, fs, data));
+      if (rw & PM_write)
+	emit_insn (gen_rtx_SET (gen_rtx_MEM (DImode, ptr), tmp));
+      if (fs)
+	{
+	  emit_insn (gen_rtx_SET (pred, gen_rtx_NE (BImode, idx, const0_rtx)));
+	  emit_insn (gen_adddi3 (ptr, ptr, GEN_INT (GET_MODE_SIZE (DImode))));
+	  emit_insn (gen_br_true_uni (pred, label));
+	  rtx fini = fn (tmp, PM_loop_end, fs, data);
+	  if (fini)
+	    emit_insn (fini);
+	  emit_insn (gen_rtx_CLOBBER (GET_MODE (idx), idx));
+	}
+      emit_insn (gen_rtx_CLOBBER (GET_MODE (tmp), tmp));
+      emit_insn (gen_rtx_CLOBBER (GET_MODE (ptr), ptr));
+      rtx cpy = get_insns ();
+      end_sequence ();
+      insn = emit_insn_after (cpy, insn);
+    }
+
+  /* Copy live registers.  */
+  EXECUTE_IF_SET_IN_BITMAP (live, 0, ix, iterator)
+    {
+      rtx reg = regno_reg_rtx[ix];
+
+      if (REGNO (reg) >= FIRST_PSEUDO_REGISTER)
+	{
+	  rtx bcast = fn (reg, rw, 0, data);
+
+	  insn = emit_insn_after (bcast, insn);
+	}
+    }
+}
+
+/* Worker for nvptx_vpropagate.  */
+
+static rtx
+vprop_gen (rtx reg, propagate_mask pm,
+	   unsigned ARG_UNUSED (count), void *ARG_UNUSED (data))
+{
+  if (!(pm & PM_read_write))
+    return 0;
+  
+  return nvptx_gen_vcast (reg);
+}
+
+/* Propagate state that is live at start of BLOCK across the vectors
+   of a single warp.  Propagation is inserted just after INSN.   */
+
+static void
+nvptx_vpropagate (basic_block block, rtx_insn *insn)
+{
+  nvptx_propagate (block, insn, PM_read_write, vprop_gen, 0);
+}
+
+/* Worker for nvptx_wpropagate.  */
+
+static rtx
+wprop_gen (rtx reg, propagate_mask pm, unsigned rep, void *data_)
+{
+  wcast_data_t *data = (wcast_data_t *)data_;
+
+  if (pm & PM_loop_begin)
+    {
+      /* Starting a loop, initialize pointer.    */
+      unsigned align = GET_MODE_ALIGNMENT (GET_MODE (reg)) / BITS_PER_UNIT;
+
+      if (align > worker_bcast_align)
+	worker_bcast_align = align;
+      data->offset = (data->offset + align - 1) & ~(align - 1);
+
+      data->ptr = gen_reg_rtx (Pmode);
+
+      return gen_adddi3 (data->ptr, data->base, GEN_INT (data->offset));
+    }
+  else if (pm & PM_loop_end)
+    {
+      rtx clobber = gen_rtx_CLOBBER (GET_MODE (data->ptr), data->ptr);
+      data->ptr = NULL_RTX;
+      return clobber;
+    }
+  else
+    return nvptx_gen_wcast (reg, pm, rep, data);
+}
+
+/* Spill or fill live state that is live at start of BLOCK.  PRE_P
+   indicates if this is just before partitioned mode (do spill), or
+   just after it starts (do fill). Sequence is inserted just after
+   INSN.  */
+
+static void
+nvptx_wpropagate (bool pre_p, basic_block block, rtx_insn *insn)
+{
+  wcast_data_t data;
+
+  data.base = gen_reg_rtx (Pmode);
+  data.offset = 0;
+  data.ptr = NULL_RTX;
+
+  nvptx_propagate (block, insn, pre_p ? PM_read : PM_write, wprop_gen, &data);
+  if (data.offset)
+    {
+      /* Stuff was emitted, initialize the base pointer now.  */
+      rtx init = gen_rtx_SET (data.base, worker_bcast_sym);
+      emit_insn_after (init, insn);
+      
+      if (worker_bcast_hwm < data.offset)
+	worker_bcast_hwm = data.offset;
+    }
+}
+
+/* Emit a worker-level synchronization barrier.  We use different
+   markers for before and after synchronizations.  */
+
+static rtx
+nvptx_wsync (bool after)
+{
+  return gen_nvptx_barsync (GEN_INT (after));
+}
+
+/* Single neutering according to MASK.  FROM is the incoming block and
+   TO is the outgoing block.  These may be the same block. Insert at
+   start of FROM:
+   
+     if (tid.<axis>) goto end.
+
+   and insert before ending branch of TO (if there is such an insn):
+
+     end:
+     <possibly-broadcast-cond>
+     <branch>
+
+   We currently only use differnt FROM and TO when skipping an entire
+   loop.  We could do more if we detected superblocks.  */
+
+static void
+nvptx_single (unsigned mask, basic_block from, basic_block to)
+{
+  rtx_insn *head = BB_HEAD (from);
+  rtx_insn *tail = BB_END (to);
+  unsigned skip_mask = mask;
+
+  /* Find first insn of from block */
+  while (head != BB_END (from) && !INSN_P (head))
+    head = NEXT_INSN (head);
+
+  /* Find last insn of to block */
+  rtx_insn *limit = from == to ? head : BB_HEAD (to);
+  while (tail != limit && !INSN_P (tail) && !LABEL_P (tail))
+    tail = PREV_INSN (tail);
+
+  /* Detect if tail is a branch.  */
+  rtx tail_branch = NULL_RTX;
+  rtx cond_branch = NULL_RTX;
+  if (tail && INSN_P (tail))
+    {
+      tail_branch = PATTERN (tail);
+      if (GET_CODE (tail_branch) != SET || SET_DEST (tail_branch) != pc_rtx)
+	tail_branch = NULL_RTX;
+      else
+	{
+	  cond_branch = SET_SRC (tail_branch);
+	  if (GET_CODE (cond_branch) != IF_THEN_ELSE)
+	    cond_branch = NULL_RTX;
+	}
+    }
+
+  if (tail == head)
+    {
+      /* If this is empty, do nothing.  */
+      if (!head || !INSN_P (head))
+	return;
+
+      /* If this is a dummy insn, do nothing.  */
+      switch (recog_memoized (head))
+	{
+	default:break;
+	case CODE_FOR_nvptx_fork:
+	case CODE_FOR_nvptx_forked:
+	case CODE_FOR_nvptx_joining:
+	case CODE_FOR_nvptx_join:
+	  return;
+	}
+
+      if (cond_branch)
+	{
+	  /* If we're only doing vector single, there's no need to
+	     emit skip code because we'll not insert anything.  */
+	  if (!(mask & GOMP_DIM_MASK (GOMP_DIM_VECTOR)))
+	    skip_mask = 0;
+	}
+      else if (tail_branch)
+	/* Block with only unconditional branch.  Nothing to do.  */
+	return;
+    }
+
+  /* Insert the vector test inside the worker test.  */
+  unsigned mode;
+  rtx_insn *before = tail;
+  for (mode = GOMP_DIM_WORKER; mode <= GOMP_DIM_VECTOR; mode++)
+    if (GOMP_DIM_MASK (mode) & skip_mask)
+      {
+	rtx_code_label *label = gen_label_rtx ();
+	rtx pred = cfun->machine->axis_predicate[mode - GOMP_DIM_WORKER];
+
+	if (!pred)
+	  {
+	    pred = gen_reg_rtx (BImode);
+	    cfun->machine->axis_predicate[mode - GOMP_DIM_WORKER] = pred;
+	  }
+	
+	rtx br;
+	if (mode == GOMP_DIM_VECTOR)
+	  br = gen_br_true (pred, label);
+	else
+	  br = gen_br_true_uni (pred, label);
+	emit_insn_before (br, head);
+
+	LABEL_NUSES (label)++;
+	if (tail_branch)
+	  before = emit_label_before (label, before);
+	else
+	  emit_label_after (label, tail);
+      }
+
+  /* Now deal with propagating the branch condition.  */
+  if (cond_branch)
+    {
+      rtx pvar = XEXP (XEXP (cond_branch, 0), 0);
+
+      if (GOMP_DIM_MASK (GOMP_DIM_VECTOR) == mask)
+	{
+	  /* Vector mode only, do a shuffle.  */
+	  emit_insn_before (nvptx_gen_vcast (pvar), tail);
+	}
+      else
+	{
+	  /* Includes worker mode, do spill & fill.  By construction
+	     we should never have worker mode only. */
+	  wcast_data_t data;
+
+	  data.base = worker_bcast_sym;
+	  data.ptr = 0;
+
+	  if (worker_bcast_hwm < GET_MODE_SIZE (SImode))
+	    worker_bcast_hwm = GET_MODE_SIZE (SImode);
+
+	  data.offset = 0;
+	  emit_insn_before (nvptx_gen_wcast (pvar, PM_read, 0, &data),
+			    before);
+	  /* Barrier so other workers can see the write.  */
+	  emit_insn_before (nvptx_wsync (false), tail);
+	  data.offset = 0;
+	  emit_insn_before (nvptx_gen_wcast (pvar, PM_write, 0, &data), tail);
+	  /* This barrier is needed to avoid worker zero clobbering
+	     the broadcast buffer before all the other workers have
+	     had a chance to read this instance of it.  */
+	  emit_insn_before (nvptx_wsync (true), tail);
+	}
+
+      extract_insn (tail);
+      rtx unsp = gen_rtx_UNSPEC (BImode, gen_rtvec (1, pvar),
+				 UNSPEC_BR_UNIFIED);
+      validate_change (tail, recog_data.operand_loc[0], unsp, false);
+    }
+}
+
+/* PAR is a parallel that is being skipped in its entirety according to
+   MASK.  Treat this as skipping a superblock starting at forked
+   and ending at joining.  */
+
+static void
+nvptx_skip_par (unsigned mask, parallel *par)
+{
+  basic_block tail = par->join_block;
+  gcc_assert (tail->preds->length () == 1);
+
+  basic_block pre_tail = (*tail->preds)[0]->src;
+  gcc_assert (pre_tail->succs->length () == 1);
+
+  nvptx_single (mask, par->forked_block, pre_tail);
+}
+
+/* Process the parallel PAR and all its contained
+   parallels.  We do everything but the neutering.  Return mask of
+   partitioned modes used within this parallel.  */
+
+static unsigned
+nvptx_process_pars (parallel *par)
+{
+  unsigned inner_mask = par->mask;
+
+  /* Do the inner parallels first.  */
+  if (par->inner)
+    {
+      par->inner_mask = nvptx_process_pars (par->inner);
+      inner_mask |= par->inner_mask;
+    }
+
+  if (par->mask & GOMP_DIM_MASK (GOMP_DIM_MAX))
+    { /* No propagation needed for a call.  */ }
+  else if (par->mask & GOMP_DIM_MASK (GOMP_DIM_WORKER))
+    {
+      nvptx_wpropagate (false, par->forked_block, par->forked_insn);
+      nvptx_wpropagate (true, par->forked_block, par->fork_insn);
+      /* Insert begin and end synchronizations.  */
+      emit_insn_after (nvptx_wsync (false), par->forked_insn);
+      emit_insn_before (nvptx_wsync (true), par->joining_insn);
+    }
+  else if (par->mask & GOMP_DIM_MASK (GOMP_DIM_VECTOR))
+    nvptx_vpropagate (par->forked_block, par->forked_insn);
+
+  /* Now do siblings.  */
+  if (par->next)
+    inner_mask |= nvptx_process_pars (par->next);
+  return inner_mask;
+}
+
+/* Neuter the parallel described by PAR.  We recurse in depth-first
+   order.  MODES are the partitioning of the execution and OUTER is
+   the partitioning of the parallels we are contained in.  */
+
+static void
+nvptx_neuter_pars (parallel *par, unsigned modes, unsigned outer)
+{
+  unsigned me = par->mask
+    & (GOMP_DIM_MASK (GOMP_DIM_WORKER) | GOMP_DIM_MASK (GOMP_DIM_VECTOR));
+  unsigned  skip_mask = 0, neuter_mask = 0;
+  
+  if (par->inner)
+    nvptx_neuter_pars (par->inner, modes, outer | me);
+
+  for (unsigned mode = GOMP_DIM_WORKER; mode <= GOMP_DIM_VECTOR; mode++)
+    {
+      if ((outer | me) & GOMP_DIM_MASK (mode))
+	{ /* Mode is partitioned: no neutering.  */ }
+      else if (!(modes & GOMP_DIM_MASK (mode)))
+	{ /* Mode  is not used: nothing to do.  */ }
+      else if (par->inner_mask & GOMP_DIM_MASK (mode)
+	       || !par->forked_insn)
+	/* Partitioned in inner parallels, or we're not a partitioned
+	   at all: neuter individual blocks.  */
+	neuter_mask |= GOMP_DIM_MASK (mode);
+      else if (!par->parent || !par->parent->forked_insn
+	       || par->parent->inner_mask & GOMP_DIM_MASK (mode))
+	/* Parent isn't a parallel or contains this paralleling: skip
+	   parallel at this level.  */
+	skip_mask |= GOMP_DIM_MASK (mode);
+      else
+	{ /* Parent will skip this parallel itself.  */ }
+    }
+
+  if (neuter_mask)
+    {
+      int ix;
+      int len = par->blocks.length ();
+
+      for (ix = 0; ix != len; ix++)
+	{
+	  basic_block block = par->blocks[ix];
+
+	  nvptx_single (neuter_mask, block, block);
+	}
+    }
+
+  if (skip_mask)
+      nvptx_skip_par (skip_mask, par);
+  
+  if (par->next)
+    nvptx_neuter_pars (par->next, modes, outer);
+}
+
 /* PTX-specific reorganization
+   - Scan and release reduction buffers
+   - Split blocks at fork and join instructions
    - Compute live registers
    - Mark now-unused registers, so function begin doesn't declare
    unused registers.
+   - Insert state propagation when entering partitioned mode
+   - Insert neutering instructions when in single mode
    - Replace subregs with suitable sequences.
 */
 
@@ -1989,19 +3041,60 @@ nvptx_reorg (void)
 
   thread_prologue_and_epilogue_insns ();
 
+  /* Split blocks and record interesting unspecs.  */
+  bb_insn_map_t bb_insn_map;
+
+  nvptx_split_blocks (&bb_insn_map);
+
   /* Compute live regs */
   df_clear_flags (DF_LR_RUN_DCE);
   df_set_flags (DF_NO_INSN_RESCAN | DF_NO_HARD_REGS);
+  df_live_add_problem ();
+  df_live_set_all_dirty ();
   df_analyze ();
   regstat_init_n_sets_and_refs ();
 
-  int max_regs = max_reg_num ();
-
+  if (dump_file)
+    df_dump (dump_file);
+  
   /* Mark unused regs as unused.  */
+  int max_regs = max_reg_num ();
   for (int i = LAST_VIRTUAL_REGISTER + 1; i < max_regs; i++)
     if (REG_N_SETS (i) == 0 && REG_N_REFS (i) == 0)
       regno_reg_rtx[i] = const0_rtx;
 
+  /* Determine launch dimensions of the function.  If it is not an
+     offloaded function  (i.e. this is a regular compiler), the
+     function has no neutering.  */
+  tree attr = get_oacc_fn_attrib (current_function_decl);
+  if (attr)
+    {
+      /* If we determined this mask before RTL expansion, we could
+	 elide emission of some levels of forks and joins.  */
+      unsigned mask = 0;
+      tree dims = TREE_VALUE (attr);
+      unsigned ix;
+
+      for (ix = 0; ix != GOMP_DIM_MAX; ix++, dims = TREE_CHAIN (dims))
+	{
+	  int size = TREE_INT_CST_LOW (TREE_VALUE (dims));
+	  tree allowed = TREE_PURPOSE (dims);
+
+	  if (size != 1 && !(allowed && integer_zerop (allowed)))
+	    mask |= GOMP_DIM_MASK (ix);
+	}
+      /* If there is worker neutering, there must be vector
+	 neutering.  Otherwise the hardware will fail.  */
+      gcc_assert (!(mask & GOMP_DIM_MASK (GOMP_DIM_WORKER))
+		  || (mask & GOMP_DIM_MASK (GOMP_DIM_VECTOR)));
+
+      /* Discover & process partitioned regions.  */
+      parallel *pars = nvptx_discover_pars (&bb_insn_map);
+      nvptx_process_pars (pars);
+      nvptx_neuter_pars (pars, mask, 0);
+      delete pars;
+    }
+
   /* Replace subregs.  */
   nvptx_reorg_subreg ();
 
@@ -2052,6 +3145,26 @@ nvptx_vector_alignment (const_tree type)
 
   return MIN (align, BIGGEST_ALIGNMENT);
 }
+
+/* Indicate that INSN cannot be duplicated.   */
+
+static bool
+nvptx_cannot_copy_insn_p (rtx_insn *insn)
+{
+  switch (recog_memoized (insn))
+    {
+    case CODE_FOR_nvptx_shufflesi:
+    case CODE_FOR_nvptx_shufflesf:
+    case CODE_FOR_nvptx_barsync:
+    case CODE_FOR_nvptx_fork:
+    case CODE_FOR_nvptx_forked:
+    case CODE_FOR_nvptx_joining:
+    case CODE_FOR_nvptx_join:
+      return true;
+    default:
+      return false;
+    }
+}
 \f
 /* Record a symbol for mkoffload to enter into the mapping table.  */
 
@@ -2129,6 +3242,19 @@ nvptx_file_end (void)
   FOR_EACH_HASH_TABLE_ELEMENT (*needed_fndecls_htab, decl, tree, iter)
     nvptx_record_fndecl (decl, true);
   fputs (func_decls.str().c_str(), asm_out_file);
+
+  if (worker_bcast_hwm)
+    {
+      /* Define the broadcast buffer.  */
+
+      worker_bcast_hwm = (worker_bcast_hwm + worker_bcast_align - 1)
+	& ~(worker_bcast_align - 1);
+      
+      fprintf (asm_out_file, "// BEGIN VAR DEF: %s\n", worker_bcast_name);
+      fprintf (asm_out_file, ".shared .align %d .u8 %s[%d];\n",
+	       worker_bcast_align,
+	       worker_bcast_name, worker_bcast_hwm);
+    }
 }
 \f
 /* Validate compute dimensions of an OpenACC offload or routine, fill
@@ -2233,6 +3359,9 @@ nvptx_goacc_validate_dims (tree ARG_UNUS
 #undef TARGET_VECTOR_ALIGNMENT
 #define TARGET_VECTOR_ALIGNMENT nvptx_vector_alignment
 
+#undef TARGET_CANNOT_COPY_INSN_P
+#define TARGET_CANNOT_COPY_INSN_P nvptx_cannot_copy_insn_p
+
 #undef TARGET_GOACC_VALIDATE_DIMS
 #define TARGET_GOACC_VALIDATE_DIMS nvptx_goacc_validate_dims
 
Index: gcc/config/nvptx/nvptx.h
===================================================================
--- gcc/config/nvptx/nvptx.h	(revision 229096)
+++ gcc/config/nvptx/nvptx.h	(working copy)
@@ -230,6 +230,7 @@ struct GTY(()) machine_function
   HOST_WIDE_INT outgoing_stdarg_size;
   int ret_reg_mode; /* machine_mode not defined yet. */
   int punning_buffer_size;
+  rtx axis_predicate[2];
 };
 #endif
 \f
Index: gcc/config/nvptx/nvptx.md
===================================================================
--- gcc/config/nvptx/nvptx.md	(revision 229096)
+++ gcc/config/nvptx/nvptx.md	(working copy)
@@ -49,14 +49,27 @@
 
    UNSPEC_ALLOCA
 
-   UNSPEC_NTID
-   UNSPEC_TID
+   UNSPEC_DIM_SIZE
+
+   UNSPEC_SHARED_DATA
+
+   UNSPEC_BIT_CONV
+
+   UNSPEC_SHUFFLE
+   UNSPEC_BR_UNIFIED
 ])
 
 (define_c_enum "unspecv" [
    UNSPECV_LOCK
    UNSPECV_CAS
    UNSPECV_XCHG
+   UNSPECV_BARSYNC
+   UNSPECV_DIM_POS
+
+   UNSPECV_FORK
+   UNSPECV_FORKED
+   UNSPECV_JOINING
+   UNSPECV_JOIN
 ])
 
 (define_attr "subregs_ok" "false,true"
@@ -246,6 +259,8 @@
 (define_mode_iterator QHSIM [QI HI SI])
 (define_mode_iterator SDFM [SF DF])
 (define_mode_iterator SDCM [SC DC])
+(define_mode_iterator BITS [SI SF])
+(define_mode_iterator BITD [DI DF])
 
 ;; This mode iterator allows :P to be used for patterns that operate on
 ;; pointer-sized quantities.  Exactly one of the two alternatives will match.
@@ -817,6 +832,23 @@
   ""
   "%J0\\tbra\\t%l1;")
 
+;; unified conditional branch
+(define_insn "br_true_uni"
+  [(set (pc) (if_then_else
+	(ne (unspec:BI [(match_operand:BI 0 "nvptx_register_operand" "R")]
+		       UNSPEC_BR_UNIFIED) (const_int 0))
+        (label_ref (match_operand 1 "" "")) (pc)))]
+  ""
+  "%j0\\tbra.uni\\t%l1;")
+
+(define_insn "br_false_uni"
+  [(set (pc) (if_then_else
+	(eq (unspec:BI [(match_operand:BI 0 "nvptx_register_operand" "R")]
+		       UNSPEC_BR_UNIFIED) (const_int 0))
+        (label_ref (match_operand 1 "" "")) (pc)))]
+  ""
+  "%J0\\tbra.uni\\t%l1;")
+
 (define_expand "cbranch<mode>4"
   [(set (pc)
 	(if_then_else (match_operator 0 "nvptx_comparison_operator"
@@ -1308,36 +1340,126 @@
   DONE;
 })
 
-(define_insn "*oacc_ntid_insn"
-  [(set (match_operand:SI 0 "nvptx_register_operand" "=R")
-	(unspec:SI [(match_operand:SI 1 "const_int_operand" "n")] UNSPEC_NTID))]
+(define_insn "oacc_dim_size"
+  [(set (match_operand:SI 0 "nvptx_register_operand" "")
+	(unspec:SI [(match_operand:SI 1 "const_int_operand" "")]
+		   UNSPEC_DIM_SIZE))]
   ""
-  "%.\\tmov.u32 %0, %%ntid%d1;")
+{
+  static const char *const asms[] =
+{ /* Must match oacc_loop_levels ordering.  */
+  "%.\\tmov.u32\\t%0, %%nctaid.x;",	/* gang */
+  "%.\\tmov.u32\\t%0, %%ntid.y;",	/* worker */
+  "%.\\tmov.u32\\t%0, %%ntid.x;",	/* vector */
+};
+  return asms[INTVAL (operands[1])];
+})
 
-(define_expand "oacc_ntid"
+(define_insn "oacc_dim_pos"
   [(set (match_operand:SI 0 "nvptx_register_operand" "")
-	(unspec:SI [(match_operand:SI 1 "const_int_operand" "")] UNSPEC_NTID))]
+	(unspec_volatile:SI [(match_operand:SI 1 "const_int_operand" "")]
+			    UNSPECV_DIM_POS))]
   ""
 {
-  if (INTVAL (operands[1]) < 0 || INTVAL (operands[1]) > 2)
-    FAIL;
+  static const char *const asms[] =
+{ /* Must match oacc_loop_levels ordering.  */
+  "%.\\tmov.u32\\t%0, %%ctaid.x;",	/* gang */
+  "%.\\tmov.u32\\t%0, %%tid.y;",	/* worker */
+  "%.\\tmov.u32\\t%0, %%tid.x;",	/* vector */
+};
+  return asms[INTVAL (operands[1])];
 })
 
-(define_insn "*oacc_tid_insn"
-  [(set (match_operand:SI 0 "nvptx_register_operand" "=R")
-	(unspec:SI [(match_operand:SI 1 "const_int_operand" "n")] UNSPEC_TID))]
+(define_insn "nvptx_fork"
+  [(unspec_volatile:SI [(match_operand:SI 0 "const_int_operand" "")]
+		       UNSPECV_FORK)]
   ""
-  "%.\\tmov.u32 %0, %%tid%d1;")
+  "// fork %0;"
+)
 
-(define_expand "oacc_tid"
-  [(set (match_operand:SI 0 "nvptx_register_operand" "")
-	(unspec:SI [(match_operand:SI 1 "const_int_operand" "")] UNSPEC_TID))]
+(define_insn "nvptx_forked"
+  [(unspec_volatile:SI [(match_operand:SI 0 "const_int_operand" "")]
+		       UNSPECV_FORKED)]
+  ""
+  "// forked %0;"
+)
+
+(define_insn "nvptx_joining"
+  [(unspec_volatile:SI [(match_operand:SI 0 "const_int_operand" "")]
+		       UNSPECV_JOINING)]
+  ""
+  "// joining %0;"
+)
+
+(define_insn "nvptx_join"
+  [(unspec_volatile:SI [(match_operand:SI 0 "const_int_operand" "")]
+		       UNSPECV_JOIN)]
+  ""
+  "// join %0;"
+)
+
+(define_expand "oacc_fork"
+  [(unspec_volatile:SI [(match_operand:SI 0 "const_int_operand" "")]
+		       UNSPECV_FORKED)]
   ""
 {
-  if (INTVAL (operands[1]) < 0 || INTVAL (operands[1]) > 2)
-    FAIL;
+  nvptx_expand_oacc_fork (INTVAL (operands[0]));
+  DONE;
 })
 
+(define_expand "oacc_join"
+  [(unspec_volatile:SI [(match_operand:SI 0 "const_int_operand" "")]
+		       UNSPECV_JOIN)]
+  ""
+{
+  nvptx_expand_oacc_join (INTVAL (operands[0]));
+  DONE;
+})
+
+;; only 32-bit shuffles exist.
+(define_insn "nvptx_shuffle<mode>"
+  [(set (match_operand:BITS 0 "nvptx_register_operand" "=R")
+	(unspec:BITS
+		[(match_operand:BITS 1 "nvptx_register_operand" "R")
+		 (match_operand:SI 2 "nvptx_nonmemory_operand" "Ri")
+		 (match_operand:SI 3 "const_int_operand" "n")]
+		  UNSPEC_SHUFFLE))]
+  ""
+  "%.\\tshfl%S3.b32\\t%0, %1, %2, 31;")
+
+;; extract parts of a 64 bit object into 2 32-bit ints
+(define_insn "unpack<mode>si2"
+  [(set (match_operand:SI 0 "nvptx_register_operand" "=R")
+        (unspec:SI [(match_operand:BITD 2 "nvptx_register_operand" "R")
+		    (const_int 0)] UNSPEC_BIT_CONV))
+   (set (match_operand:SI 1 "nvptx_register_operand" "=R")
+        (unspec:SI [(match_dup 2) (const_int 1)] UNSPEC_BIT_CONV))]
+  ""
+  "%.\\tmov.b64\\t{%0,%1}, %2;")
+
+;; pack 2 32-bit ints into a 64 bit object
+(define_insn "packsi<mode>2"
+  [(set (match_operand:BITD 0 "nvptx_register_operand" "=R")
+        (unspec:BITD [(match_operand:SI 1 "nvptx_register_operand" "R")
+		      (match_operand:SI 2 "nvptx_register_operand" "R")]
+		    UNSPEC_BIT_CONV))]
+  ""
+  "%.\\tmov.b64\\t%0, {%1,%2};")
+
+(define_insn "worker_load<mode>"
+  [(set (match_operand:SDISDFM 0 "nvptx_register_operand" "=R")
+        (unspec:SDISDFM [(match_operand:SDISDFM 1 "memory_operand" "m")]
+			 UNSPEC_SHARED_DATA))]
+  ""
+  "%.\\tld.shared%u0\\t%0, %1;")
+
+(define_insn "worker_store<mode>"
+  [(set (unspec:SDISDFM [(match_operand:SDISDFM 1 "memory_operand" "=m")]
+			 UNSPEC_SHARED_DATA)
+	(match_operand:SDISDFM 0 "nvptx_register_operand" "R"))]
+  ""
+  "%.\\tst.shared%u1\\t%1, %0;")
+
 ;; Atomic insns.
 
 (define_expand "atomic_compare_and_swap<mode>"
@@ -1423,3 +1545,9 @@
 	(match_dup 1))]
   "0"
   "%.\\tatom%A1.b%T0.<logic>\\t%0, %1, %2;")
+
+(define_insn "nvptx_barsync"
+  [(unspec_volatile [(match_operand:SI 0 "const_int_operand" "")]
+		    UNSPECV_BARSYNC)]
+  ""
+  "\\tbar.sync\\t%0;")

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 3/11] new target hook
  2015-10-21 19:00 [OpenACC 0/11] execution model Nathan Sidwell
  2015-10-21 19:00 ` [OpenACC 1/11] UNIQUE internal function Nathan Sidwell
  2015-10-21 19:11 ` [OpenACC 2/11] PTX backend changes Nathan Sidwell
@ 2015-10-21 19:16 ` Nathan Sidwell
  2015-10-22  8:23   ` Jakub Jelinek
  2015-10-21 19:19 ` [OpenACC 5/11] C++ FE changes Nathan Sidwell
                   ` (7 subsequent siblings)
  10 siblings, 1 reply; 120+ messages in thread
From: Nathan Sidwell @ 2015-10-21 19:16 UTC (permalink / raw)
  To: GCC Patches; +Cc: Jakub Jelinek, Bernd Schmidt, Jason Merrill, Joseph S. Myers

[-- Attachment #1: Type: text/plain, Size: 476 bytes --]

This patch implements a new OpenACC target hook.  It is used during 
device-specific lowering and allows a backend to indicate whether it needs to 
know about a fork/join for a particular axis.

For instance, in PTX we don't care about gang-level fork and join.  We also 
don't care if the size of that dimension is 1.

The default implementation of the hook never only cares if the oacc_fork and 
oacc_join RTL  expanders exist (and they don't on the host compiler).

nathan

[-- Attachment #2: 03-trunk-hook.patch --]
[-- Type: text/x-patch, Size: 5540 bytes --]

2015-10-20  Nathan Sidwell  <nathan@codesourcery.com>

	* target.def (fork_join): New GOACC hook.
	* targhooks.h (default_goacc_fork_join): Declare.
	* omp-low.c (default_goacc_forkjoin): New.
	* doc/tm.texi.in (TARGET_GOACC_FORK_JOIN): Add.
	* doc/tm.texi: Regenerate.
	* config/nvptx/nvptx.c (nvptx_xform_fork_join): New.
	(TARGET_GOACC_FOR_JOIN): Override.

Index: config/nvptx/nvptx.c
===================================================================
--- config/nvptx/nvptx.c	(revision 229096)
+++ config/nvptx/nvptx.c	(working copy)
@@ -2146,7 +2146,26 @@ nvptx_goacc_validate_dims (tree ARG_UNUS
 
   return changed;
 }
-\f
+
+/* Determine whether fork & joins are needed.  */
+
+static bool
+nvptx_xform_fork_join (gcall *call, const int dims[],
+		       bool ARG_UNUSED (is_fork))
+{
+  tree arg = gimple_call_arg (call, 1);
+  unsigned axis = TREE_INT_CST_LOW (arg);
+
+  /* We only care about worker and vector partitioning.  */
+  if (axis < GOMP_DIM_WORKER)
+    return true;
+
+  /* If the size is 1, there's no partitioning.  */
+  if (dims[axis] == 1)
+    return true;
+
+  return false;
+}
 #undef TARGET_OPTION_OVERRIDE
 #define TARGET_OPTION_OVERRIDE nvptx_option_override
 
@@ -2236,6 +2255,9 @@ nvptx_goacc_validate_dims (tree ARG_UNUS
 #undef TARGET_GOACC_VALIDATE_DIMS
 #define TARGET_GOACC_VALIDATE_DIMS nvptx_goacc_validate_dims
 
+#undef TARGET_GOACC_FORK_JOIN
+#define TARGET_GOACC_FORK_JOIN nvptx_xform_fork_join
+
 struct gcc_target targetm = TARGET_INITIALIZER;
 
 #include "gt-nvptx.h"
Index: doc/tm.texi
===================================================================
--- doc/tm.texi	(revision 229096)
+++ doc/tm.texi	(working copy)
@@ -5748,7 +5748,7 @@ usable.  In that case, the smaller the n
 to use it.
 @end deftypefn
 
-@deftypefn {Target Hook} bool TARGET_GOACC_VALIDATE_DIMS (tree @var{decl}, int @var{dims[]}, int @var{fn_level})
+@deftypefn {Target Hook} bool TARGET_GOACC_VALIDATE_DIMS (tree @var{decl}, int *@var{dims}, int @var{fn_level})
 This hook should check the launch dimensions provided for an OpenACC
 compute region, or routine.  Defaulted values are represented as -1
 and non-constant values as 0. The @var{fn_level} is negative for the
@@ -5760,6 +5760,13 @@ true, if changes have been made.  You mu
 provide dimensions larger than 1.
 @end deftypefn
 
+@deftypefn {Target Hook} bool TARGET_GOACC_FORK_JOIN (gcall *@var{call}, const int *@var{dims}, bool @var{is_fork})
+This hook should convert IFN_GOACC_FORK and IFN_GOACC_JOIN function
+calls to target-specific gimple.  It is executed during the oacc_xform
+pass.  It should return true, if the functions should be deleted.  The
+default hook returns true, if there are no RTL expanders for them.
+@end deftypefn
+
 @node Anchored Addresses
 @section Anchored Addresses
 @cindex anchored addresses
Index: doc/tm.texi.in
===================================================================
--- doc/tm.texi.in	(revision 229096)
+++ doc/tm.texi.in	(working copy)
@@ -4249,6 +4249,8 @@ address;  but often a machine-dependent
 
 @hook TARGET_GOACC_VALIDATE_DIMS
 
+@hook TARGET_GOACC_FORK_JOIN
+
 @node Anchored Addresses
 @section Anchored Addresses
 @cindex anchored addresses
Index: omp-low.c
===================================================================
--- omp-low.c	(revision 229096)
+++ omp-low.c	(working copy)
@@ -17532,6 +17532,29 @@ oacc_validate_dims (tree fn, tree attrs,
   return fn_level;
 }
 
+/* Default fork/join early expander.  Delete the function calls if
+   there is no RTL expander.  */
+
+bool
+default_goacc_fork_join (gcall *ARG_UNUSED (call),
+			 const int *ARG_UNUSED (dims), bool is_fork)
+{
+  if (is_fork)
+    {
+#ifndef HAVE_oacc_fork
+      return true;
+#endif
+    }
+  else
+    {
+#ifndef HAVE_oacc_join
+      return true;
+#endif
+    }
+
+  return false;
+}
+
 /* Main entry point for oacc transformations which run on the device
    compiler after LTO, so we know what the target device is at this
    point (including the host fallback).  */
Index: target.def
===================================================================
--- target.def	(revision 229096)
+++ target.def	(working copy)
@@ -1655,9 +1655,19 @@ should fill in anything that needs to de
 non-defaults.  Diagnostics should be issued as appropriate.  Return\n\
 true, if changes have been made.  You must override this hook to\n\
 provide dimensions larger than 1.",
-bool, (tree decl, int dims[], int fn_level),
+bool, (tree decl, int *dims, int fn_level),
 default_goacc_validate_dims)
 
+DEFHOOK
+(fork_join,
+"This hook should convert IFN_UNIQUE calls for IFN_UNIQUE_GOACC_FORK\n\
+and IFN_UNIQUE_GOACC_JOIN   to target-specific gimple.  It is executed\n\
+during the oacc_device_lower pass.  It should return true, if the\n\
+functions should be deleted.  The default hook returns true, if there\n\
+are no RTL expanders for these actions.", 
+bool, (gcall *call, const int *dims, bool is_fork),
+default_goacc_fork_join)
+
 HOOK_VECTOR_END (goacc)
 
 /* Functions relating to vectorization.  */
Index: targhooks.h
===================================================================
--- targhooks.h	(revision 229096)
+++ targhooks.h	(working copy)
@@ -109,6 +109,7 @@ extern void default_destroy_cost_data (v
 
 /* OpenACC hooks.  */
 extern bool default_goacc_validate_dims (tree, int [], int);
+extern bool default_goacc_fork_join (gcall *, const int [], bool);
 
 /* These are here, and not in hooks.[ch], because not all users of
    hooks.h include tm.h, and thus we don't have CUMULATIVE_ARGS.  */

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 5/11] C++ FE changes
  2015-10-21 19:00 [OpenACC 0/11] execution model Nathan Sidwell
                   ` (2 preceding siblings ...)
  2015-10-21 19:16 ` [OpenACC 3/11] new target hook Nathan Sidwell
@ 2015-10-21 19:19 ` Nathan Sidwell
  2015-10-22  8:58   ` Jakub Jelinek
  2015-10-21 19:19 ` [OpenACC 4/11] C " Nathan Sidwell
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 120+ messages in thread
From: Nathan Sidwell @ 2015-10-21 19:19 UTC (permalink / raw)
  To: GCC Patches; +Cc: Jakub Jelinek, Bernd Schmidt, Jason Merrill, Joseph S. Myers

[-- Attachment #1: Type: text/plain, Size: 295 bytes --]

This patch is the C++ changes matching the C ones of patch 4.  In 
finish_omp_clauses, the gang, worker, & vector clauses are handled the same as 
OpenMP's 'num_threads' clause.  One change to num_threads is the augmentation of 
a diagnostic to add %<...%>  markers to the clause name.

nathan


[-- Attachment #2: 05-trunk-cxx.patch --]
[-- Type: text/x-patch, Size: 14292 bytes --]

2015-10-20  Cesar Philippidis  <cesar@codesourcery.com>
	    Thomas Schwinge  <thomas@codesourcery.com>
	    James Norris  <jnorris@codesourcery.com>
	    Joseph Myers  <joseph@codesourcery.com>
	    Julian Brown  <julian@codesourcery.com>
	    Nathan Sidwell <nathan@codesourcery.com>

	* parser.c (cp_parser_omp_clause_name): Add auto, gang, seq,
	vector, worker.
	(cp_parser_oacc_simple_clause): New.
	(cp_parser_oacc_shape_clause): New.
	(cp_parser_oacc_all_clauses): Add auto, gang, seq, vector, worker.
	(OACC_LOOP_CLAUSE_MASK): Likewise.
	* semantics.c (finish_omp_clauses): Add auto, gang, seq, vector,
	worker.

2015-10-20  Nathan Sidwell <nathan@codesourcery.com>

	gcc/testsuite/
	* g++.dg/g++.dg/gomp/pr33372-1.C: Adjust diagnostic.

Index: gcc/cp/parser.c
===================================================================
--- gcc/cp/parser.c	(revision 228969)
+++ gcc/cp/parser.c	(working copy)
@@ -29058,7 +29058,9 @@ cp_parser_omp_clause_name (cp_parser *pa
 {
   pragma_omp_clause result = PRAGMA_OMP_CLAUSE_NONE;
 
-  if (cp_lexer_next_token_is_keyword (parser->lexer, RID_IF))
+  if (cp_lexer_next_token_is_keyword (parser->lexer, RID_AUTO))
+    result = PRAGMA_OACC_CLAUSE_AUTO;
+  else if (cp_lexer_next_token_is_keyword (parser->lexer, RID_IF))
     result = PRAGMA_OMP_CLAUSE_IF;
   else if (cp_lexer_next_token_is_keyword (parser->lexer, RID_DEFAULT))
     result = PRAGMA_OMP_CLAUSE_DEFAULT;
@@ -29116,7 +29118,9 @@ cp_parser_omp_clause_name (cp_parser *pa
 	    result = PRAGMA_OMP_CLAUSE_FROM;
 	  break;
 	case 'g':
-	  if (!strcmp ("grainsize", p))
+	  if (!strcmp ("gang", p))
+	    result = PRAGMA_OACC_CLAUSE_GANG;
+	  else if (!strcmp ("grainsize", p))
 	    result = PRAGMA_OMP_CLAUSE_GRAINSIZE;
 	  break;
 	case 'h':
@@ -29206,6 +29210,8 @@ cp_parser_omp_clause_name (cp_parser *pa
 	    result = PRAGMA_OMP_CLAUSE_SECTIONS;
 	  else if (!strcmp ("self", p))
 	    result = PRAGMA_OACC_CLAUSE_SELF;
+	  else if (!strcmp ("seq", p))
+	    result = PRAGMA_OACC_CLAUSE_SEQ;
 	  else if (!strcmp ("shared", p))
 	    result = PRAGMA_OMP_CLAUSE_SHARED;
 	  else if (!strcmp ("simd", p))
@@ -29232,7 +29238,9 @@ cp_parser_omp_clause_name (cp_parser *pa
 	    result = PRAGMA_OMP_CLAUSE_USE_DEVICE_PTR;
 	  break;
 	case 'v':
-	  if (!strcmp ("vector_length", p))
+	  if (!strcmp ("vector", p))
+	    result = PRAGMA_OACC_CLAUSE_VECTOR;
+	  else if (!strcmp ("vector_length", p))
 	    result = PRAGMA_OACC_CLAUSE_VECTOR_LENGTH;
 	  else if (flag_cilkplus && !strcmp ("vectorlength", p))
 	    result = PRAGMA_CILK_CLAUSE_VECTORLENGTH;
@@ -29240,6 +29248,8 @@ cp_parser_omp_clause_name (cp_parser *pa
 	case 'w':
 	  if (!strcmp ("wait", p))
 	    result = PRAGMA_OACC_CLAUSE_WAIT;
+	  else if (!strcmp ("worker", p))
+	    result = PRAGMA_OACC_CLAUSE_WORKER;
 	  break;
 	}
     }
@@ -29576,6 +29586,160 @@ cp_parser_oacc_data_clause_deviceptr (cp
   return list;
 }
 
+/* OpenACC 2.0:
+   auto
+   independent
+   nohost
+   seq */
+
+static tree
+cp_parser_oacc_simple_clause (cp_parser *ARG_UNUSED (parser),
+			      enum omp_clause_code code,
+			      tree list, location_t location)
+{
+  check_no_duplicate_clause (list, code, omp_clause_code_name[code], location);
+  tree c = build_omp_clause (location, code);
+  OMP_CLAUSE_CHAIN (c) = list;
+  return c;
+}
+
+/* OpenACC:
+   gang [( gang_expr_list )]
+   worker [( expression )]
+   vector [( expression )] */
+
+static tree
+cp_parser_oacc_shape_clause (cp_parser *parser, pragma_omp_clause c_kind,
+			     const char *str, tree list)
+{
+  omp_clause_code kind;
+  const char *id = "num";
+  cp_lexer *lexer = parser->lexer;
+
+  switch (c_kind)
+    {
+    default:
+      gcc_unreachable ();
+    case PRAGMA_OACC_CLAUSE_GANG:
+      kind = OMP_CLAUSE_GANG;
+      break;
+    case PRAGMA_OACC_CLAUSE_VECTOR:
+      kind = OMP_CLAUSE_VECTOR;
+      id = "length";
+      break;
+    case PRAGMA_OACC_CLAUSE_WORKER:
+      kind = OMP_CLAUSE_WORKER;
+      break;
+    }
+
+  tree op0 = NULL_TREE, op1 = NULL_TREE;
+  location_t loc = cp_lexer_peek_token (lexer)->location;
+
+  if (cp_lexer_next_token_is (lexer, CPP_OPEN_PAREN))
+    {
+      tree *op_to_parse = &op0;
+      cp_lexer_consume_token (lexer);
+
+      do
+	{
+	  if (cp_lexer_next_token_is (lexer, CPP_NAME)
+	      || cp_lexer_next_token_is (lexer, CPP_KEYWORD))
+	    {
+	      tree name_kind = cp_lexer_peek_token (lexer)->u.value;
+	      const char *p = IDENTIFIER_POINTER (name_kind);
+	      if (kind == OMP_CLAUSE_GANG && strcmp ("static", p) == 0)
+		{
+		  cp_lexer_consume_token (lexer);
+		  if (!cp_parser_require (parser, CPP_COLON, RT_COLON))
+		    {
+		      cp_parser_skip_to_closing_parenthesis (parser, false,
+							     false, true);
+		      return list;
+		    }
+		  op_to_parse = &op1;
+		  if (cp_lexer_next_token_is (lexer, CPP_MULT))
+		    {
+		      if (*op_to_parse != NULL_TREE)
+			{
+			  cp_parser_error (parser,
+					   "duplicate %<num%> argument");
+			  cp_parser_skip_to_closing_parenthesis (parser,
+								 false, false,
+								 true);
+			  return list;
+			}
+		      cp_lexer_consume_token (lexer);
+		      *op_to_parse = integer_minus_one_node;
+		      if (cp_lexer_next_token_is (lexer, CPP_COMMA))
+			cp_lexer_consume_token (lexer);
+		      continue;
+		    }
+		}
+	      else if (strcmp (id, p) == 0)
+		{
+		  op_to_parse = &op0;
+		  cp_lexer_consume_token (lexer);
+		  if (!cp_parser_require (parser, CPP_COLON, RT_COLON))
+		    {
+		      cp_parser_skip_to_closing_parenthesis (parser, false,
+							     false, true);
+		      return list;
+		    }
+		}
+	      else
+		{
+		  if (kind == OMP_CLAUSE_GANG)
+		    cp_parser_error (parser,
+				     "expected %<%num%> or %<static%>");
+		  else if (kind == OMP_CLAUSE_VECTOR)
+		    cp_parser_error (parser, "expected %<length%>");
+		  else
+		    cp_parser_error (parser, "expected %<num%>");
+		  cp_parser_skip_to_closing_parenthesis (parser, false, false,
+							 true);
+		  return list;
+		}
+	    }
+
+	  if (*op_to_parse != NULL_TREE)
+	    {
+	      cp_parser_error (parser, "duplicate operand to clause");
+	      cp_parser_skip_to_closing_parenthesis (parser, false, false,
+						     true);
+	      return list;
+	    }
+
+	  tree expr = cp_parser_assignment_expression (parser, NULL, false,
+						       false);
+	  if (expr == error_mark_node)
+	    {
+	      cp_parser_skip_to_closing_parenthesis (parser, false, false,
+						     true);
+	      return list;
+	    }
+
+	  mark_exp_read (expr);
+	  *op_to_parse = expr;
+	  op_to_parse = &op0;
+
+	  if (cp_lexer_next_token_is (lexer, CPP_COMMA))
+	    cp_lexer_consume_token (lexer);
+	}
+      while (!cp_lexer_next_token_is (lexer, CPP_CLOSE_PAREN));
+      cp_lexer_consume_token (lexer);
+    }
+
+  check_no_duplicate_clause (list, kind, str, loc);
+
+  tree c = build_omp_clause (loc, kind);
+  if (op0)
+    OMP_CLAUSE_OPERAND (c, 0) = op0;
+  if (op1)
+    OMP_CLAUSE_OPERAND (c, 1) = op1;
+  OMP_CLAUSE_CHAIN (c) = list;
+  return c;
+}
+
 /* OpenACC:
    vector_length ( expression ) */
 
@@ -31300,6 +31464,11 @@ cp_parser_oacc_all_clauses (cp_parser *p
 	  clauses = cp_parser_oacc_clause_async (parser, clauses);
 	  c_name = "async";
 	  break;
+	case PRAGMA_OACC_CLAUSE_AUTO:
+	  clauses = cp_parser_oacc_simple_clause (parser, OMP_CLAUSE_AUTO,
+						 clauses, here);
+	  c_name = "auto";
+	  break;
 	case PRAGMA_OACC_CLAUSE_COLLAPSE:
 	  clauses = cp_parser_omp_clause_collapse (parser, clauses, here);
 	  c_name = "collapse";
@@ -31332,6 +31501,11 @@ cp_parser_oacc_all_clauses (cp_parser *p
 	  clauses = cp_parser_oacc_data_clause_deviceptr (parser, clauses);
 	  c_name = "deviceptr";
 	  break;
+	case PRAGMA_OACC_CLAUSE_GANG:
+	  c_name = "gang";
+	  clauses = cp_parser_oacc_shape_clause (parser, c_kind, c_name,
+						 clauses);
+	  break;
 	case PRAGMA_OACC_CLAUSE_HOST:
 	  clauses = cp_parser_oacc_data_clause (parser, c_kind, clauses);
 	  c_name = "host";
@@ -31376,6 +31550,16 @@ cp_parser_oacc_all_clauses (cp_parser *p
 	  clauses = cp_parser_oacc_data_clause (parser, c_kind, clauses);
 	  c_name = "self";
 	  break;
+	case PRAGMA_OACC_CLAUSE_SEQ:
+	  clauses = cp_parser_oacc_simple_clause (parser, OMP_CLAUSE_SEQ,
+						 clauses, here);
+	  c_name = "seq";
+	  break;
+	case PRAGMA_OACC_CLAUSE_VECTOR:
+	  c_name = "vector";
+	  clauses = cp_parser_oacc_shape_clause (parser, c_kind, c_name,
+						 clauses);
+	  break;
 	case PRAGMA_OACC_CLAUSE_VECTOR_LENGTH:
 	  clauses = cp_parser_oacc_clause_vector_length (parser, clauses);
 	  c_name = "vector_length";
@@ -31384,6 +31568,11 @@ cp_parser_oacc_all_clauses (cp_parser *p
 	  clauses = cp_parser_oacc_clause_wait (parser, clauses);
 	  c_name = "wait";
 	  break;
+	case PRAGMA_OACC_CLAUSE_WORKER:
+	  c_name = "worker";
+	  clauses = cp_parser_oacc_shape_clause (parser, c_kind, c_name,
+						clauses);
+	  break;
 	default:
 	  cp_parser_error (parser, "expected %<#pragma acc%> clause");
 	  goto saw_error;
@@ -34333,6 +34522,11 @@ cp_parser_oacc_kernels (cp_parser *parse
 
 #define OACC_LOOP_CLAUSE_MASK						\
 	( (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_COLLAPSE)		\
+	| (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_GANG)		\
+	| (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_VECTOR)		\
+	| (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_WORKER)		\
+	| (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_AUTO)		\
+	| (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_SEQ)			\
 	| (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_REDUCTION))
 
 static tree
Index: gcc/cp/semantics.c
===================================================================
--- gcc/cp/semantics.c	(revision 228969)
+++ gcc/cp/semantics.c	(working copy)
@@ -5904,6 +5904,37 @@ finish_omp_clauses (tree clauses, bool a
 	    bitmap_set_bit (&firstprivate_head, DECL_UID (t));
 	  goto handle_field_decl;
 
+	case OMP_CLAUSE_GANG:
+	case OMP_CLAUSE_VECTOR:
+	case OMP_CLAUSE_WORKER:
+	  /* Operand 0 is the num: or length: argument.  */
+	  t = OMP_CLAUSE_OPERAND (c, 0);
+	  if (t == NULL_TREE)
+	    break;
+
+	  t = maybe_convert_cond (t);
+	  if (t == error_mark_node)
+	    remove = true;
+	  else if (!processing_template_decl)
+	    t = fold_build_cleanup_point_expr (TREE_TYPE (t), t);
+	  OMP_CLAUSE_OPERAND (c, 0) = t;
+
+	  if (OMP_CLAUSE_CODE (c) != OMP_CLAUSE_GANG)
+	    break;
+
+	  /* Ooperand 1 is the gang static: argument.  */
+	  t = OMP_CLAUSE_OPERAND (c, 1);
+	  if (t == NULL_TREE)
+	    break;
+
+	  t = maybe_convert_cond (t);
+	  if (t == error_mark_node)
+	    remove = true;
+	  else if (!processing_template_decl)
+	    t = fold_build_cleanup_point_expr (TREE_TYPE (t), t);
+	  OMP_CLAUSE_OPERAND (c, 1) = t;
+	  break;
+
 	case OMP_CLAUSE_LASTPRIVATE:
 	  t = omp_clause_decl_field (OMP_CLAUSE_DECL (c));
 	  if (t)
@@ -5959,32 +5990,58 @@ finish_omp_clauses (tree clauses, bool a
 	  break;
 
 	case OMP_CLAUSE_NUM_THREADS:
-	  t = OMP_CLAUSE_NUM_THREADS_EXPR (c);
-	  if (t == error_mark_node)
-	    remove = true;
-	  else if (!type_dependent_expression_p (t)
-		   && !INTEGRAL_TYPE_P (TREE_TYPE (t)))
-	    {
-	      error ("num_threads expression must be integral");
-	      remove = true;
-	    }
-	  else
-	    {
-	      t = mark_rvalue_use (t);
-	      if (!processing_template_decl)
-		{
-		  t = maybe_constant_value (t);
-		  if (TREE_CODE (t) == INTEGER_CST
-		      && tree_int_cst_sgn (t) != 1)
-		    {
-		      warning_at (OMP_CLAUSE_LOCATION (c), 0,
-				  "%<num_threads%> value must be positive");
-		      t = integer_one_node;
-		    }
-		  t = fold_build_cleanup_point_expr (TREE_TYPE (t), t);
-		}
-	      OMP_CLAUSE_NUM_THREADS_EXPR (c) = t;
-	    }
+	case OMP_CLAUSE_NUM_GANGS:
+	case OMP_CLAUSE_NUM_WORKERS:
+	case OMP_CLAUSE_VECTOR_LENGTH:
+	  {
+	    const char *name = NULL;
+	      
+	    switch (OMP_CLAUSE_CODE (c))
+	      {
+	      case OMP_CLAUSE_NUM_THREADS:
+		name = "num_threads";
+		break;
+	      case OMP_CLAUSE_NUM_GANGS:
+		name = "num_gangs";
+		break;
+	      case OMP_CLAUSE_NUM_WORKERS:
+		name = "num_workers";
+		break;
+	      case OMP_CLAUSE_VECTOR_LENGTH:
+		name = "vector_length";
+		break;
+	      default:
+		gcc_unreachable ();
+	      }
+
+	    t = OMP_CLAUSE_OPERAND (c, 0);
+	    if (t == error_mark_node)
+	      remove = true;
+	    else if (!type_dependent_expression_p (t)
+		     && !INTEGRAL_TYPE_P (TREE_TYPE (t)))
+	      {
+		error_at (OMP_CLAUSE_LOCATION (c),
+			  "%<%s%> expression must be integral", name);
+		remove = true;
+	      }
+	    else
+	      {
+		t = mark_rvalue_use (t);
+		if (!processing_template_decl)
+		  {
+		    t = maybe_constant_value (t);
+		    if (TREE_CODE (t) == INTEGER_CST
+			&& tree_int_cst_sgn (t) != 1)
+		      {
+			warning_at (OMP_CLAUSE_LOCATION (c), 0,
+				    "%<%s%> value must be positive", name);
+			t = integer_one_node;
+		      }
+		    t = fold_build_cleanup_point_expr (TREE_TYPE (t), t);
+		  }
+		OMP_CLAUSE_OPERAND (c, 0) = t;
+	      }
+	  }
 	  break;
 
 	case OMP_CLAUSE_SCHEDULE:
@@ -6103,16 +6160,6 @@ finish_omp_clauses (tree clauses, bool a
 	    }
 	  break;
 
-	case OMP_CLAUSE_VECTOR_LENGTH:
-	  t = OMP_CLAUSE_VECTOR_LENGTH_EXPR (c);
-	  t = maybe_convert_cond (t);
-	  if (t == error_mark_node)
-	    remove = true;
-	  else if (!processing_template_decl)
-	    t = fold_build_cleanup_point_expr (TREE_TYPE (t), t);
-	  OMP_CLAUSE_VECTOR_LENGTH_EXPR (c) = t;
-	  break;
-
 	case OMP_CLAUSE_WAIT:
 	  t = OMP_CLAUSE_WAIT_EXPR (c);
 	  if (t == error_mark_node)
@@ -6687,6 +6734,8 @@ finish_omp_clauses (tree clauses, bool a
 	case OMP_CLAUSE_SIMD:
 	case OMP_CLAUSE_DEFAULTMAP:
 	case OMP_CLAUSE__CILK_FOR_COUNT_:
+	case OMP_CLAUSE_AUTO:
+	case OMP_CLAUSE_SEQ:
 	  break;
 
 	case OMP_CLAUSE_INBRANCH:
Index: gcc/testsuite/g++.dg/gomp/pr33372-1.C
===================================================================
--- gcc/testsuite/g++.dg/gomp/pr33372-1.C	(revision 229101)
+++ gcc/testsuite/g++.dg/gomp/pr33372-1.C	(working copy)
@@ -6,7 +6,7 @@ template <typename T>
 void f ()
 {
   extern T n ();
-#pragma omp parallel num_threads(n)	// { dg-error "num_threads expression must be integral" }
+#pragma omp parallel num_threads(n)	// { dg-error "'num_threads' expression must be integral" }
   ;
 #pragma omp parallel for schedule(static, n)
   for (int i = 0; i < 10; i++)		// { dg-error "chunk size expression must be integral" }

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 4/11] C FE changes
  2015-10-21 19:00 [OpenACC 0/11] execution model Nathan Sidwell
                   ` (3 preceding siblings ...)
  2015-10-21 19:19 ` [OpenACC 5/11] C++ FE changes Nathan Sidwell
@ 2015-10-21 19:19 ` Nathan Sidwell
  2015-10-22  8:25   ` Jakub Jelinek
  2015-10-21 19:32 ` [OpenACC 6/11] Reduction initialization Nathan Sidwell
                   ` (5 subsequent siblings)
  10 siblings, 1 reply; 120+ messages in thread
From: Nathan Sidwell @ 2015-10-21 19:19 UTC (permalink / raw)
  To: GCC Patches; +Cc: Jakub Jelinek, Bernd Schmidt, Jason Merrill, Joseph S. Myers

[-- Attachment #1: Type: text/plain, Size: 352 bytes --]

This patch implements  changes to the C parser to deal with the 'gang', 
'worker', 'vector', 'seq' and 'auto' clauses on an OpenACC loop directive.

The first 3 can take a numeric argument, which is used within a kernels offload 
region and the gang clause can take an additional 'static' argument, to force 
static loop iteration assignment.

nathan


[-- Attachment #2: 04-trunk-c.patch --]
[-- Type: text/x-patch, Size: 6589 bytes --]

2015-10-20  Cesar Philippidis  <cesar@codesourcery.com>
	    Thomas Schwinge  <thomas@codesourcery.com>
	    James Norris  <jnorris@codesourcery.com>
	    Joseph Myers  <joseph@codesourcery.com>
	    Julian Brown  <julian@codesourcery.com>

	* c-parser.c (c_parser_oacc_shape_clause): New.
	(c_parser_oacc_simple_clause): New.
	(c_parser_oacc_all_clauses): Add auto, gang, seq, vector, worker.
	(OACC_LOOP_CLAUSE_MASK): Add gang, worker, vector, auto, seq.

Index: gcc/c/c-parser.c
===================================================================
--- gcc/c/c-parser.c	(revision 228969)
+++ gcc/c/c-parser.c	(working copy)
@@ -11185,6 +11185,138 @@ c_parser_omp_clause_num_workers (c_parse
 }
 
 /* OpenACC:
+   gang [( gang_expr_list )]
+   worker [( expression )]
+   vector [( expression )] */
+
+static tree
+c_parser_oacc_shape_clause (c_parser *parser, pragma_omp_clause c_kind,
+			    const char *str, tree list)
+{
+  omp_clause_code kind;
+  const char *id = "num";
+
+  switch (c_kind)
+    {
+    default:
+      gcc_unreachable ();
+    case PRAGMA_OACC_CLAUSE_GANG:
+      kind = OMP_CLAUSE_GANG;
+      break;
+    case PRAGMA_OACC_CLAUSE_VECTOR:
+      kind = OMP_CLAUSE_VECTOR;
+      id = "length";
+      break;
+    case PRAGMA_OACC_CLAUSE_WORKER:
+      kind = OMP_CLAUSE_WORKER;
+      break;
+    }
+
+  tree op0 = NULL_TREE, op1 = NULL_TREE;
+  location_t loc = c_parser_peek_token (parser)->location;
+
+  if (c_parser_next_token_is (parser, CPP_OPEN_PAREN))
+    {
+      tree *op_to_parse = &op0;
+      c_parser_consume_token (parser);
+
+      do
+	{
+	  if (c_parser_next_token_is (parser, CPP_NAME)
+	      || c_parser_next_token_is (parser, CPP_KEYWORD))
+	    {
+	      tree name_kind = c_parser_peek_token (parser)->value;
+	      const char *p = IDENTIFIER_POINTER (name_kind);
+	      if (kind == OMP_CLAUSE_GANG && strcmp ("static", p) == 0)
+		{
+		  c_parser_consume_token (parser);
+		  if (!c_parser_require (parser, CPP_COLON, "expected %<:%>"))
+		    {
+		      c_parser_skip_until_found (parser, CPP_CLOSE_PAREN, 0);
+		      return list;
+		    }
+		  op_to_parse = &op1;
+		  if (c_parser_next_token_is (parser, CPP_MULT))
+		    {
+		      c_parser_consume_token (parser);
+		      *op_to_parse = integer_minus_one_node;
+		      continue;
+		    }
+		}
+	      else if (strcmp (id, p) == 0)
+		{
+		  c_parser_consume_token (parser);
+		  if (!c_parser_require (parser, CPP_COLON, "expected %<:%>"))
+		    {
+		      c_parser_skip_until_found (parser, CPP_CLOSE_PAREN, 0);
+		      return list;
+		    }
+		}
+	      else
+		{
+		  if (kind == OMP_CLAUSE_GANG)
+		    c_parser_error (parser, "expected %<%num%> or %<static%>");
+		  else if (kind == OMP_CLAUSE_VECTOR)
+		    c_parser_error (parser, "expected %<length%>");
+		  else
+		    c_parser_error (parser, "expected %<num%>");
+		  c_parser_skip_until_found (parser, CPP_CLOSE_PAREN, 0);
+		  return list;
+		}
+	    }
+
+	  if (*op_to_parse != NULL_TREE)
+	    {
+	      c_parser_error (parser, "duplicate operand to clause");
+	      c_parser_skip_until_found (parser, CPP_CLOSE_PAREN, 0);
+	      return list;
+	    }
+
+	  tree expr = c_parser_expression (parser).value;
+	  if (expr == error_mark_node)
+	    {
+	      c_parser_skip_until_found (parser, CPP_CLOSE_PAREN, 0);
+	      return list;
+	    }
+
+	  mark_exp_read (expr);
+	  *op_to_parse = expr;
+	  op_to_parse = &op0;
+	}
+      while (!c_parser_next_token_is (parser, CPP_CLOSE_PAREN));
+      c_parser_consume_token (parser);
+    }
+
+  check_no_duplicate_clause (list, kind, str);
+
+  tree c = build_omp_clause (loc, kind);
+  if (op0)
+    OMP_CLAUSE_OPERAND (c, 0) = op0;
+  if (op1)
+    OMP_CLAUSE_OPERAND (c, 1) = op1;
+  OMP_CLAUSE_CHAIN (c) = list;
+  return c;
+}
+
+/* OpenACC:
+   auto
+   independent
+   nohost
+   seq */
+
+static tree
+c_parser_oacc_simple_clause (c_parser *parser ATTRIBUTE_UNUSED,
+			     enum omp_clause_code code, tree list)
+{
+  check_no_duplicate_clause (list, code, omp_clause_code_name[code]);
+
+  tree c = build_omp_clause (c_parser_peek_token (parser)->location, code);
+  OMP_CLAUSE_CHAIN (c) = list;
+
+  return c;
+}
+
+/* OpenACC:
    async [( int-expr )] */
 
 static tree
@@ -12390,6 +12522,11 @@ c_parser_oacc_all_clauses (c_parser *par
 	  clauses = c_parser_oacc_clause_async (parser, clauses);
 	  c_name = "async";
 	  break;
+	case PRAGMA_OACC_CLAUSE_AUTO:
+	  clauses = c_parser_oacc_simple_clause (parser, OMP_CLAUSE_AUTO,
+						clauses);
+	  c_name = "auto";
+	  break;
 	case PRAGMA_OACC_CLAUSE_COLLAPSE:
 	  clauses = c_parser_omp_clause_collapse (parser, clauses);
 	  c_name = "collapse";
@@ -12426,6 +12563,11 @@ c_parser_oacc_all_clauses (c_parser *par
 	  clauses = c_parser_omp_clause_firstprivate (parser, clauses);
 	  c_name = "firstprivate";
 	  break;
+	case PRAGMA_OACC_CLAUSE_GANG:
+	  c_name = "gang";
+	  clauses = c_parser_oacc_shape_clause (parser, c_kind, c_name,
+						clauses);
+	  break;
 	case PRAGMA_OACC_CLAUSE_HOST:
 	  clauses = c_parser_oacc_data_clause (parser, c_kind, clauses);
 	  c_name = "host";
@@ -12474,6 +12616,16 @@ c_parser_oacc_all_clauses (c_parser *par
 	  clauses = c_parser_oacc_data_clause (parser, c_kind, clauses);
 	  c_name = "self";
 	  break;
+	case PRAGMA_OACC_CLAUSE_SEQ:
+	  clauses = c_parser_oacc_simple_clause (parser, OMP_CLAUSE_SEQ,
+						clauses);
+	  c_name = "seq";
+	  break;
+	case PRAGMA_OACC_CLAUSE_VECTOR:
+	  c_name = "vector";
+	  clauses = c_parser_oacc_shape_clause (parser, c_kind, c_name,
+						clauses);
+	  break;
 	case PRAGMA_OACC_CLAUSE_VECTOR_LENGTH:
 	  clauses = c_parser_omp_clause_vector_length (parser, clauses);
 	  c_name = "vector_length";
@@ -12482,6 +12634,11 @@ c_parser_oacc_all_clauses (c_parser *par
 	  clauses = c_parser_oacc_clause_wait (parser, clauses);
 	  c_name = "wait";
 	  break;
+	case PRAGMA_OACC_CLAUSE_WORKER:
+	  c_name = "worker";
+	  clauses = c_parser_oacc_shape_clause (parser, c_kind, c_name,
+						clauses);
+	  break;
 	default:
 	  c_parser_error (parser, "expected %<#pragma acc%> clause");
 	  goto saw_error;
@@ -13012,6 +13169,11 @@ c_parser_oacc_enter_exit_data (c_parser
 
 #define OACC_LOOP_CLAUSE_MASK						\
 	( (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_COLLAPSE)		\
+	| (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_GANG)		\
+	| (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_WORKER)		\
+	| (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_VECTOR)		\
+	| (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_AUTO)		\
+	| (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_SEQ)			\
 	| (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_REDUCTION) )
 
 static tree

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 6/11] Reduction initialization
  2015-10-21 19:00 [OpenACC 0/11] execution model Nathan Sidwell
                   ` (4 preceding siblings ...)
  2015-10-21 19:19 ` [OpenACC 4/11] C " Nathan Sidwell
@ 2015-10-21 19:32 ` Nathan Sidwell
  2015-10-22  9:11   ` Jakub Jelinek
  2015-10-21 19:47 ` [OpenACC 7/11] execution model Nathan Sidwell
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 120+ messages in thread
From: Nathan Sidwell @ 2015-10-21 19:32 UTC (permalink / raw)
  To: GCC Patches; +Cc: Jakub Jelinek, Bernd Schmidt, Jason Merrill, Joseph S. Myers

[-- Attachment #1: Type: text/plain, Size: 1093 bytes --]

This patch is a temporary measure to avoid breaking reductions, until I post the 
reductions patch set (which builds on this).

Currently OpenACC reductions are handled by
(a) spawning all threads throughout the offload region
(b) having them each individually write to an allocated slot in a 'reductions 
array', according to their thread number.
(c) having the host collate the reduction values after the region.

This is clearly a rather restricted implementation of reductions.  With loop 
partitioning implemented, not all threads execute though -- in fact, on a loop 
lacking any gang, worker or vector specifier, the loop won't be partitioned 
(until I commit the 'auto' implementation).  This  leads to entries in the 
reduction array being uninitialized.

This patch takes the brute-force approach of  initializing the reductions array 
on the host before offloading and then copying it to the device.  Thus at the 
end of the region, any slots that weren't used have a sensible initial value 
which will not destroy the reduction result.

This code should be short lived ...

nathan

[-- Attachment #2: 06-trunk-red-init.patch --]
[-- Type: text/x-patch, Size: 3712 bytes --]

2015-10-20  Nathan Sidwell  <nathan@codesourcery.com>

	* omp-low.c (oacc_init_rediction_array): New.
	(oacc_initialize_reduction_data): Initialize array.

Index: gcc/omp-low.c
===================================================================
--- gcc/omp-low.c	(revision 229101)
+++ gcc/omp-low.c	(working copy)
@@ -12202,6 +13008,71 @@ oacc_gimple_assign (tree dest, tree_code
   gimplify_assign (dest, result, seq);
 }
 
+/* Initialize the reduction array with default values.  */
+
+static void
+oacc_init_reduction_array (tree array, tree init, tree nthreads,
+			   gimple_seq *stmt_seqp)
+{
+  tree type = TREE_TYPE (TREE_TYPE (array));
+  tree x, loop_header, loop_body, loop_exit;
+  gimple *stmt;
+
+  /* Create for loop.
+
+     let var = the original reduction variable
+     let array = reduction variable array
+
+     for (i = 0; i < nthreads; i++)
+       var op= array[i]
+ */
+
+  loop_header = create_artificial_label (UNKNOWN_LOCATION);
+  loop_body = create_artificial_label (UNKNOWN_LOCATION);
+  loop_exit = create_artificial_label (UNKNOWN_LOCATION);
+
+  /* Create and initialize an index variable.  */
+  tree ix = create_tmp_var (sizetype);
+  gimplify_assign (ix, fold_build1 (NOP_EXPR, sizetype, integer_zero_node),
+		   stmt_seqp);
+
+  /* Insert the loop header label here.  */
+  gimple_seq_add_stmt (stmt_seqp, gimple_build_label (loop_header));
+
+  /* Exit loop if ix >= nthreads.  */
+  x = create_tmp_var (sizetype);
+  gimplify_assign (x, fold_build1 (NOP_EXPR, sizetype, nthreads), stmt_seqp);
+  stmt = gimple_build_cond (GE_EXPR, ix, x, loop_exit, loop_body);
+  gimple_seq_add_stmt (stmt_seqp, stmt);
+
+  /* Insert the loop body label here.  */
+  gimple_seq_add_stmt (stmt_seqp, gimple_build_label (loop_body));
+
+  /* Calculate the array offset.  */
+  tree offset = create_tmp_var (sizetype);
+  gimplify_assign (offset, TYPE_SIZE_UNIT (type), stmt_seqp);
+  stmt = gimple_build_assign (offset, MULT_EXPR, offset, ix);
+  gimple_seq_add_stmt (stmt_seqp, stmt);
+
+  tree ptr = create_tmp_var (TREE_TYPE (array));
+  stmt = gimple_build_assign (ptr, POINTER_PLUS_EXPR, array, offset);
+  gimple_seq_add_stmt (stmt_seqp, stmt);
+
+  /* Assign init.  */
+  gimplify_assign (build_simple_mem_ref (ptr), init, stmt_seqp);
+
+  /* Increment the induction variable.  */
+  tree one = fold_build1 (NOP_EXPR, sizetype, integer_one_node);
+  stmt = gimple_build_assign (ix, PLUS_EXPR, ix, one);
+  gimple_seq_add_stmt (stmt_seqp, stmt);
+
+  /* Go back to the top of the loop.  */
+  gimple_seq_add_stmt (stmt_seqp, gimple_build_goto (loop_header));
+
+  /* Place the loop exit label here.  */
+  gimple_seq_add_stmt (stmt_seqp, gimple_build_label (loop_exit));
+}
+
 /* Helper function to initialize local data for the reduction arrays.
    The reduction arrays need to be placed inside the calling function
    for accelerators, or else the host won't be able to preform the final
@@ -12261,12 +13132,18 @@ oacc_initialize_reduction_data (tree cla
       gimple_call_set_lhs (stmt, array);
       gimple_seq_add_stmt (stmt_seqp, stmt);
 
+      /* Initialize array. */
+      tree init = omp_reduction_init_op (OMP_CLAUSE_LOCATION (c),
+					 OMP_CLAUSE_REDUCTION_CODE (c),
+					 type);
+      oacc_init_reduction_array (array, init, nthreads, stmt_seqp);
+
       /* Map this array into the accelerator.  */
 
       /* Add the reduction array to the list of clauses.  */
       tree x = array;
       t = build_omp_clause (gimple_location (ctx->stmt), OMP_CLAUSE_MAP);
-      OMP_CLAUSE_SET_MAP_KIND (t, GOMP_MAP_FORCE_FROM);
+      OMP_CLAUSE_SET_MAP_KIND (t, GOMP_MAP_FORCE_TOFROM);
       OMP_CLAUSE_DECL (t) = x;
       OMP_CLAUSE_CHAIN (t) = NULL;
       if (oc)

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 7/11] execution model
  2015-10-21 19:00 [OpenACC 0/11] execution model Nathan Sidwell
                   ` (5 preceding siblings ...)
  2015-10-21 19:32 ` [OpenACC 6/11] Reduction initialization Nathan Sidwell
@ 2015-10-21 19:47 ` Nathan Sidwell
  2015-10-22  9:32   ` Jakub Jelinek
  2020-11-24 10:34   ` Thomas Schwinge
  2015-10-21 19:50 ` [OpenACC 8/11] device-specific lowering Nathan Sidwell
                   ` (3 subsequent siblings)
  10 siblings, 2 replies; 120+ messages in thread
From: Nathan Sidwell @ 2015-10-21 19:47 UTC (permalink / raw)
  To: GCC Patches; +Cc: Jakub Jelinek, Bernd Schmidt, Jason Merrill, Joseph S. Myers

[-- Attachment #1: Type: text/plain, Size: 1787 bytes --]

This patch is the early lowering part of OpenACC loops.  Rather than piggy-back 
onto expand_omp_for_static_nochunk & expand_omp_for_static_chunk, we have a new 
function 'expand_oacc_for', which does the OpenACC equivalent expension. Except 
that it uses a new  internal builtin to abstract the  actual loop step, intial 
value, bound and chunking.  We end up turning a loop of the form

for (i = b; i < e; i += s)
   {...}

into

chunk=0
num_chunks = IFN_LOOP (CHUNKS, ...);
step = IFN_LOOP (STEP, ...);

head:
offset = IFN_LOOP (OFFSET, ...);
bound = IFN_LOOP (BOUND, ...);

if (!(offset < bound)) goto bottom;

body:
i = offset + b;
{...}
offset += step;
if (offset < bound) goto body;

bottom:
chunk++;
if (chunk < num_chunks) goto head;

In addition to marking up the loop itself like that, we also emit partitioning 
markers around the loop.  The whole of the above sequence will be wrapped with:

IFN_UNIQUE (HEAD_MARK, <partitioning flags>)
IFN_UNIQUE (FORK)  // repeat for each potential axis
IFN_UNIQUE (HEAD_MARK)
<loop here>
IFN_UNIQUE (TAIL_MARK, ...)
IFN_UNIQUE (JOIN)  // repeat for each potential axis
IFN_UNIQUE (TAIL_MARK)

The reason for the rather heavy weight head and tail marker series is for 
reductions.  Those will insert more internal fns just before and just after the 
fork and join functions.

The fork and join pattern is duplicated for each potential axis, but does not 
specify which axis --  that is all determined in the device-specific oacc 
lowering pass.

The patch introduces 4 variants of the IFN_UNIQUE function along with a  new 
IFN_LOOP function.  I see I also included IFN_OACC_DIM_POS and IFN_OACC_DIM_SIZE 
which provide the size of a compute axis and a position along it.  Those are 
actually used in the next patch and not here.

nathan

[-- Attachment #2: 07-trunk-loop-mark.patch --]
[-- Type: text/x-patch, Size: 50383 bytes --]

2015-10-20  Nathan Sidwell  <nathan@codesourcery.com>

	* omp-low.c (struct omp_context): Remove gwv_below, gwv_this
	fields.
	(enum oacc_loop_flags): New.
	(enclosing_target_ctx): May return NULL.
	(ctx_in_oacc_kernels_region): New.
	(is_oacc_parallel, is_oaccc_kernels): New.
	(check_oacc_kernel_gwv): New.
	(oacc_loop_or_target_p): Delete.
	(scan_omp_for): Don't calculate gwv mask.  Check parallel clause
	operands.  Strip reductions fro kernels.
	(scan_omp_target): Don't calculate gwv mask.
	(lower_oacc_head_mark, lower_oacc_loop_marker,
	lower_oacc_head_tail): New.
	(expand_omp_for_static_nochunk, expand_omp_for_static_chunk):
	Remove OpenACC handling.
	(struct oacc_collapse): New.
	(expand_oacc_collapse_init, expand_oacc_collapse_vars): New.
	(expand_oacc_for): New.
	(expand_omp_for): Call expand_oacc_for.
	(lower_omp_for): Call lower_oacc_head_tail.
	* internal-fn.def (IFN_UNIQUE_OACC_FORK, IFN_UNIQUE_OACC_JOIN,
	IFN_OACC_HEAD_MARK, IFN_OACC_TAIL_MARK): Define.
	(IFN_GOACC_DIM_SIZE, IFN_GOACC_DIM_POS): New.
	(IFN_GOACC_LOOP): New.
	(IFN_GOACC_LOOP_CHUNKS, IFN_GOACC_LOOP_STEP,
	IFN_GOACC_LOOP_OFFSET, IFN_GOACC_LOOP_BOUND): New.
	* internal-fn.c (expand_UNIQUE): Handle IFN_UNIQUE_OACC_FORK,
	IFN_UNIQUE_OACC_JOIN.
	(expand_GOACC_DIM_SIZE, expand_GOACC_DIM_POS, expand_GOACC_LOOP): New.

Index: gcc/omp-low.c
===================================================================
--- gcc/omp-low.c	(revision 229101)
+++ gcc/omp-low.c	(working copy)
@@ -199,14 +200,6 @@ struct omp_context
 
   /* True if this construct can be cancelled.  */
   bool cancellable;
-
-  /* For OpenACC loops, a mask of gang, worker and vector used at
-     levels below this one.  */
-  int gwv_below;
-  /* For OpenACC loops, a mask of gang, worker and vector used at
-     this level and above.  For parallel and kernels clauses, a mask
-     indicating which of num_gangs/num_workers/num_vectors was used.  */
-  int gwv_this;
 };
 
 /* A structure holding the elements of:
@@ -233,6 +226,24 @@ struct omp_for_data
   struct omp_for_data_loop *loops;
 };
 
+/*  Flags for an OpenACC loop.  */
+
+enum oacc_loop_flags
+  {
+    OLF_SEQ	= 1u << 0,  /* Explicitly sequential  */
+    OLF_AUTO	= 1u << 1,	/* Compiler chooses axes.  */
+    OLF_INDEPENDENT = 1u << 2,	/* Iterations are known independent.  */
+    OLF_GANG_STATIC = 1u << 3,	/* Gang partitioning is static (has op). */
+
+    /* Explicitly specified loop axes.  */
+    OLF_DIM_BASE = 4,
+    OLF_DIM_GANG   = 1u << (OLF_DIM_BASE + GOMP_DIM_GANG),
+    OLF_DIM_WORKER = 1u << (OLF_DIM_BASE + GOMP_DIM_WORKER),
+    OLF_DIM_VECTOR = 1u << (OLF_DIM_BASE + GOMP_DIM_VECTOR),
+
+    OLF_MAX = OLF_DIM_BASE + GOMP_DIM_MAX
+  };
+
 
 static splay_tree all_contexts;
 static int taskreg_nesting_level;
@@ -255,6 +292,28 @@ static gphi *find_phi_with_arg_on_edge (
       *handled_ops_p = false; \
       break;
 
+/* Return true if CTX corresponds to an oacc parallel region.  */
+
+static bool
+is_oacc_parallel (omp_context *ctx)
+{
+  enum gimple_code outer_type = gimple_code (ctx->stmt);
+  return ((outer_type == GIMPLE_OMP_TARGET)
+	  && (gimple_omp_target_kind (ctx->stmt)
+	      == GF_OMP_TARGET_KIND_OACC_PARALLEL));
+}
+
+/* Return true if CTX corresponds to an oacc kernels region.  */
+
+static bool
+is_oacc_kernels (omp_context *ctx)
+{
+  enum gimple_code outer_type = gimple_code (ctx->stmt);
+  return ((outer_type == GIMPLE_OMP_TARGET)
+	  && (gimple_omp_target_kind (ctx->stmt)
+	      == GF_OMP_TARGET_KIND_OACC_KERNELS));
+}
+
 /* Helper function to get the name of the array containing the partial
    reductions for OpenACC reductions.  */
 static const char *
@@ -2889,28 +2948,92 @@ finish_taskreg_scan (omp_context *ctx)
     }
 }
 
+/* Find the enclosing offload context.  */
 
 static omp_context *
 enclosing_target_ctx (omp_context *ctx)
 {
-  while (ctx != NULL
-	 && gimple_code (ctx->stmt) != GIMPLE_OMP_TARGET)
-    ctx = ctx->outer;
-  gcc_assert (ctx != NULL);
+  for (; ctx; ctx = ctx->outer)
+    if (gimple_code (ctx->stmt) == GIMPLE_OMP_TARGET)
+      break;
+
   return ctx;
 }
 
+/* Return true if ctx is part of an oacc kernels region.  */
+
 static bool
-oacc_loop_or_target_p (gimple *stmt)
+ctx_in_oacc_kernels_region (omp_context *ctx)
+{
+  for (;ctx != NULL; ctx = ctx->outer)
+    {
+      gimple *stmt = ctx->stmt;
+      if (gimple_code (stmt) == GIMPLE_OMP_TARGET
+	  && gimple_omp_target_kind (stmt) == GF_OMP_TARGET_KIND_OACC_KERNELS)
+	return true;
+    }
+
+  return false;
+}
+
+/* Check the parallelism clauses inside a kernels regions.
+   Until kernels handling moves to use the same loop indirection
+   scheme as parallel, we need to do this checking early.  */
+
+static unsigned
+check_oacc_kernel_gwv (gomp_for *stmt, omp_context *ctx)
 {
-  enum gimple_code outer_type = gimple_code (stmt);
-  return ((outer_type == GIMPLE_OMP_TARGET
-	   && ((gimple_omp_target_kind (stmt)
-		== GF_OMP_TARGET_KIND_OACC_PARALLEL)
-	       || (gimple_omp_target_kind (stmt)
-		   == GF_OMP_TARGET_KIND_OACC_KERNELS)))
-	  || (outer_type == GIMPLE_OMP_FOR
-	      && gimple_omp_for_kind (stmt) == GF_OMP_FOR_KIND_OACC_LOOP));
+  bool checking = true;
+  unsigned outer_mask = 0;
+  unsigned this_mask = 0;
+  bool has_seq = false, has_auto = false;
+
+  if (ctx->outer)
+    outer_mask = check_oacc_kernel_gwv (NULL,  ctx->outer);
+  if (!stmt)
+    {
+      checking = false;
+      if (gimple_code (ctx->stmt) != GIMPLE_OMP_FOR)
+	return outer_mask;
+      stmt = as_a <gomp_for *> (ctx->stmt);
+    }
+
+  for (tree c = gimple_omp_for_clauses (stmt); c; c = OMP_CLAUSE_CHAIN (c))
+    {
+      switch (OMP_CLAUSE_CODE (c))
+	{
+	case OMP_CLAUSE_GANG:
+	  this_mask |= GOMP_DIM_MASK (GOMP_DIM_GANG);
+	  break;
+	case OMP_CLAUSE_WORKER:
+	  this_mask |= GOMP_DIM_MASK (GOMP_DIM_WORKER);
+	  break;
+	case OMP_CLAUSE_VECTOR:
+	  this_mask |= GOMP_DIM_MASK (GOMP_DIM_VECTOR);
+	  break;
+	case OMP_CLAUSE_SEQ:
+	  has_seq = true;
+	  break;
+	case OMP_CLAUSE_AUTO:
+	  has_auto = true;
+	  break;
+	default:
+	  break;
+	}
+    }
+
+  if (checking)
+    {
+      if (has_seq && (this_mask || has_auto))
+	error_at (gimple_location (stmt), "%<seq%> overrides other OpenACC loop specifiers");
+      else if (has_auto && this_mask)
+	error_at (gimple_location (stmt), "%<auto%> conflicts with other OpenACC loop specifiers");
+
+      if (this_mask & outer_mask)
+	error_at (gimple_location (stmt), "inner loop uses same  OpenACC parallelism as containing loop");
+    }
+
+  return outer_mask | this_mask;
 }
 
 /* Scan a GIMPLE_OMP_FOR.  */
@@ -2918,52 +3041,62 @@ oacc_loop_or_target_p (gimple *stmt)
 static void
 scan_omp_for (gomp_for *stmt, omp_context *outer_ctx)
 {
-  enum gimple_code outer_type = GIMPLE_ERROR_MARK;
   omp_context *ctx;
   size_t i;
   tree clauses = gimple_omp_for_clauses (stmt);
 
-  if (outer_ctx)
-    outer_type = gimple_code (outer_ctx->stmt);
-
   ctx = new_omp_context (stmt, outer_ctx);
 
   if (is_gimple_omp_oacc (stmt))
     {
-      if (outer_ctx && outer_type == GIMPLE_OMP_FOR)
-	ctx->gwv_this = outer_ctx->gwv_this;
-      for (tree c = clauses; c; c = OMP_CLAUSE_CHAIN (c))
-	{
-	  int val;
-	  if (OMP_CLAUSE_CODE (c) == OMP_CLAUSE_GANG)
-	    val = MASK_GANG;
-	  else if (OMP_CLAUSE_CODE (c) == OMP_CLAUSE_WORKER)
-	    val = MASK_WORKER;
-	  else if (OMP_CLAUSE_CODE (c) == OMP_CLAUSE_VECTOR)
-	    val = MASK_VECTOR;
-	  else
-	    continue;
-	  ctx->gwv_this |= val;
-	  if (!outer_ctx)
-	    {
-	      /* Skip; not nested inside a region.  */
-	      continue;
-	    }
-	  if (!oacc_loop_or_target_p (outer_ctx->stmt))
+      omp_context *tgt = enclosing_target_ctx (outer_ctx);
+
+      if (!tgt || is_oacc_parallel (tgt))
+	for (tree c = clauses; c; c = OMP_CLAUSE_CHAIN (c))
+	  {
+	    char const *check = NULL;
+
+	    switch (OMP_CLAUSE_CODE (c))
+	      {
+	      case OMP_CLAUSE_GANG:
+		check = "gang";
+		break;
+
+	      case OMP_CLAUSE_WORKER:
+		check = "worker";
+		break;
+
+	      case OMP_CLAUSE_VECTOR:
+		check = "vector";
+		break;
+
+	      default:
+		break;
+	      }
+
+	    if (check && OMP_CLAUSE_OPERAND (c, 0))
+	      error_at (gimple_location (stmt),
+			"argument not permitted on %<%s%> clause in"
+			" OpenACC %<parallel%>", check);
+	  }
+
+      if (tgt && is_oacc_kernels (tgt))
+	{
+	  /* Strip out reductions, as they are not  handled yet.  */
+	  tree *prev_ptr = &clauses;
+
+	  while (tree probe = *prev_ptr)
 	    {
-	      /* Skip; not nested inside an OpenACC region.  */
-	      continue;
-	    }
-	  if (outer_type == GIMPLE_OMP_FOR)
-	    outer_ctx->gwv_below |= val;
-	  if (OMP_CLAUSE_OPERAND (c, 0) != NULL_TREE)
-	    {
-	      omp_context *enclosing = enclosing_target_ctx (outer_ctx);
-	      if (gimple_omp_target_kind (enclosing->stmt)
-		  == GF_OMP_TARGET_KIND_OACC_PARALLEL)
-		error_at (gimple_location (stmt),
-			  "no arguments allowed to gang, worker and vector clauses inside parallel");
+	      tree *next_ptr = &OMP_CLAUSE_CHAIN (probe);
+	      
+	      if (OMP_CLAUSE_CODE (probe) == OMP_CLAUSE_REDUCTION)
+		*prev_ptr = *next_ptr;
+	      else
+		prev_ptr = next_ptr;
 	    }
+
+	  gimple_omp_for_set_clauses (stmt, clauses);
+	  check_oacc_kernel_gwv (stmt, ctx);
 	}
     }
 
@@ -2978,19 +3111,6 @@ scan_omp_for (gomp_for *stmt, omp_contex
       scan_omp_op (gimple_omp_for_incr_ptr (stmt, i), ctx);
     }
   scan_omp (gimple_omp_body_ptr (stmt), ctx);
-
-  if (is_gimple_omp_oacc (stmt))
-    {
-      if (ctx->gwv_this & ctx->gwv_below)
-	error_at (gimple_location (stmt),
-		  "gang, worker and vector may occur only once in a loop nest");
-      else if (ctx->gwv_below != 0
-	       && ctx->gwv_this > ctx->gwv_below)
-	error_at (gimple_location (stmt),
-		  "gang, worker and vector must occur in this order in a loop nest");
-      if (outer_ctx && outer_type == GIMPLE_OMP_FOR)
-	outer_ctx->gwv_below |= ctx->gwv_below;
-    }
 }
 
 /* Scan an OpenMP sections directive.  */
@@ -3061,19 +3181,6 @@ scan_omp_target (gomp_target *stmt, omp_
       gimple_omp_target_set_child_fn (stmt, ctx->cb.dst_fn);
     }
 
-  if (is_gimple_omp_oacc (stmt))
-    {
-      for (tree c = clauses; c; c = OMP_CLAUSE_CHAIN (c))
-	{
-	  if (OMP_CLAUSE_CODE (c) == OMP_CLAUSE_NUM_GANGS)
-	    ctx->gwv_this |= MASK_GANG;
-	  else if (OMP_CLAUSE_CODE (c) == OMP_CLAUSE_NUM_WORKERS)
-	    ctx->gwv_this |= MASK_WORKER;
-	  else if (OMP_CLAUSE_CODE (c) == OMP_CLAUSE_VECTOR_LENGTH)
-	    ctx->gwv_this |= MASK_VECTOR;
-	}
-    }
-
   scan_sharing_clauses (clauses, ctx);
   scan_omp (gimple_omp_body_ptr (stmt), ctx);
 
@@ -5769,6 +5885,166 @@ lower_send_shared_vars (gimple_seq *ilis
     }
 }
 
+/* Emit an OpenACC head marker call, encapulating the partitioning and
+   other information that must be processed by the target compiler.
+   Return the maximum number of dimensions the associated loop might
+   be partitioned over.  */
+
+static unsigned
+lower_oacc_head_mark (location_t loc, tree clauses,
+		      gimple_seq *seq, omp_context *ctx)
+{
+  unsigned levels = 0;
+  unsigned tag = 0;
+  tree gang_static = NULL_TREE;
+  auto_vec<tree, 1> args;
+
+  args.quick_push (build_int_cst
+		   (integer_type_node, IFN_UNIQUE_OACC_HEAD_MARK));
+  for (tree c = clauses; c; c = OMP_CLAUSE_CHAIN (c))
+    {
+      switch (OMP_CLAUSE_CODE (c))
+	{
+	case OMP_CLAUSE_GANG:
+	  tag |= OLF_DIM_GANG;
+	  gang_static = OMP_CLAUSE_GANG_STATIC_EXPR (c);
+	  /* static:* is represented by -1, and we can ignore it, as
+	     scheduling is always static.  */
+	  if (gang_static && integer_minus_onep (gang_static))
+	    gang_static = NULL_TREE;
+	  levels++;
+	  break;
+
+	case OMP_CLAUSE_WORKER:
+	  tag |= OLF_DIM_WORKER;
+	  levels++;
+	  break;
+
+	case OMP_CLAUSE_VECTOR:
+	  tag |= OLF_DIM_VECTOR;
+	  levels++;
+	  break;
+
+	case OMP_CLAUSE_SEQ:
+	  tag |= OLF_SEQ;
+	  break;
+
+	case OMP_CLAUSE_AUTO:
+	  tag |= OLF_AUTO;
+	  break;
+
+	case OMP_CLAUSE_INDEPENDENT:
+	  tag |= OLF_INDEPENDENT;
+	  break;
+
+	default:
+	  continue;
+	}
+    }
+
+  if (gang_static)
+    {
+      if (DECL_P  (gang_static))
+	gang_static = build_outer_var_ref (gang_static, ctx);
+      tag |= OLF_GANG_STATIC;
+    }
+
+  /* In a parallel region, loops are implicitly INDEPENDENT.  */
+  omp_context *tgt = enclosing_target_ctx (ctx);
+  if (!tgt || is_oacc_parallel (tgt))
+    tag |= OLF_INDEPENDENT;
+
+  /* A loop lacking SEQ, GANG, WORKER and/or VECTOR is implicitly AUTO.  */
+  if (!(tag & (((GOMP_DIM_MASK (GOMP_DIM_MAX) - 1) << OLF_DIM_BASE)
+	       | OLF_SEQ)))
+      tag |= OLF_AUTO;
+
+  /* Ensure at least one level.  */
+  if (!levels)
+    levels++;
+
+  args.safe_push (build_int_cst (integer_type_node, levels));
+  args.safe_push (build_int_cst (integer_type_node, tag));
+  if (gang_static)
+    args.safe_push (gang_static);
+
+  gcall *call = gimple_build_call_internal_vec (IFN_UNIQUE, args);
+  gimple_set_location (call, loc);
+  gimple_seq_add_stmt (seq, call);
+
+  return levels;
+}
+
+/* Emit an OpenACC lopp head or tail marker to SEQ.  LEVEL is the
+   partitioning level of the enclosed region.  */ 
+
+static void
+lower_oacc_loop_marker (location_t loc, bool head, tree tofollow,
+			gimple_seq *seq)
+{
+  tree marker = build_int_cst
+    (integer_type_node, (head ? IFN_UNIQUE_OACC_HEAD_MARK
+			 : IFN_UNIQUE_OACC_TAIL_MARK));
+  gcall *call = gimple_build_call_internal
+    (IFN_UNIQUE, 1 + (tofollow != NULL_TREE), marker, tofollow);
+  gimple_set_location (call, loc);
+  gimple_seq_add_stmt (seq, call);
+}
+
+/* Generate the before and after OpenACC loop sequences.  CLAUSES are
+   the loop clauses, from which we extract reductions.  Initialize
+   HEAD and TAIL.  */
+
+static void
+lower_oacc_head_tail (location_t loc, tree clauses,
+		      gimple_seq *head, gimple_seq *tail, omp_context *ctx)
+{
+  bool inner = false;
+  unsigned count = lower_oacc_head_mark (loc, clauses, head, ctx);
+  
+  if (!count)
+    lower_oacc_loop_marker (loc, false, integer_zero_node, tail);
+
+  for (unsigned done = 1; count; count--, done++)
+    {
+      tree place = build_int_cst (integer_type_node, -1);
+      gcall *fork = gimple_build_call_internal
+	(IFN_UNIQUE, 2,
+	 build_int_cst (unsigned_type_node, IFN_UNIQUE_OACC_FORK), place);
+      gcall *join = gimple_build_call_internal
+	(IFN_UNIQUE, 2,
+	 build_int_cst (unsigned_type_node, IFN_UNIQUE_OACC_JOIN), place);
+      gimple_seq fork_seq = NULL;
+      gimple_seq join_seq = NULL;
+
+      gimple_set_location (fork, loc);
+      gimple_set_location (join, loc);
+
+      /* Mark the beginning of this level sequence.  */
+      if (inner)
+	lower_oacc_loop_marker (loc, true,
+				build_int_cst (integer_type_node, count),
+				&fork_seq);
+      lower_oacc_loop_marker (loc, false,
+			      build_int_cst (integer_type_node, done),
+			      &join_seq);
+
+      gimple_seq_add_stmt (&fork_seq, fork);
+      gimple_seq_add_stmt (&join_seq, join);
+
+      /* Append this level to head. */
+      gimple_seq_add_seq (head, fork_seq);
+      /* Prepend it to tail.  */
+      gimple_seq_add_seq (&join_seq, *tail);
+      *tail = join_seq;
+
+      inner = true;
+    }
+
+  /* Mark the end of the sequence.  */
+  lower_oacc_loop_marker (loc, true, NULL_TREE, head);
+  lower_oacc_loop_marker (loc, false, NULL_TREE, tail);
+}
 
 /* A convenience function to build an empty GIMPLE_COND with just the
    condition.  */
@@ -8327,10 +8603,6 @@ expand_omp_for_static_nochunk (struct om
   tree *counts = NULL;
   tree n1, n2, step;
 
-  gcc_checking_assert ((gimple_omp_for_kind (fd->for_stmt)
-			!= GF_OMP_FOR_KIND_OACC_LOOP)
-		       || !inner_stmt);
-
   itype = type = TREE_TYPE (fd->loop.v);
   if (POINTER_TYPE_P (type))
     itype = signed_type_for (type);
@@ -8423,10 +8695,6 @@ expand_omp_for_static_nochunk (struct om
       nthreads = builtin_decl_explicit (BUILT_IN_OMP_GET_NUM_TEAMS);
       threadid = builtin_decl_explicit (BUILT_IN_OMP_GET_TEAM_NUM);
       break;
-    case GF_OMP_FOR_KIND_OACC_LOOP:
-      nthreads = builtin_decl_explicit (BUILT_IN_GOACC_GET_NUM_THREADS);
-      threadid = builtin_decl_explicit (BUILT_IN_GOACC_GET_THREAD_NUM);
-      break;
     default:
       gcc_unreachable ();
     }
@@ -8653,10 +8921,7 @@ expand_omp_for_static_nochunk (struct om
   if (!gimple_omp_return_nowait_p (gsi_stmt (gsi)))
     {
       t = gimple_omp_return_lhs (gsi_stmt (gsi));
-      if (gimple_omp_for_kind (fd->for_stmt) == GF_OMP_FOR_KIND_OACC_LOOP)
-	gcc_checking_assert (t == NULL_TREE);
-      else
-	gsi_insert_after (&gsi, build_omp_barrier (t), GSI_SAME_STMT);
+      gsi_insert_after (&gsi, build_omp_barrier (t), GSI_SAME_STMT);
     }
   gsi_remove (&gsi, true);
 
@@ -8794,10 +9059,6 @@ expand_omp_for_static_chunk (struct omp_
   tree *counts = NULL;
   tree n1, n2, step;
 
-  gcc_checking_assert ((gimple_omp_for_kind (fd->for_stmt)
-			!= GF_OMP_FOR_KIND_OACC_LOOP)
-		       || !inner_stmt);
-
   itype = type = TREE_TYPE (fd->loop.v);
   if (POINTER_TYPE_P (type))
     itype = signed_type_for (type);
@@ -8894,10 +9155,6 @@ expand_omp_for_static_chunk (struct omp_
       nthreads = builtin_decl_explicit (BUILT_IN_OMP_GET_NUM_TEAMS);
       threadid = builtin_decl_explicit (BUILT_IN_OMP_GET_TEAM_NUM);
       break;
-    case GF_OMP_FOR_KIND_OACC_LOOP:
-      nthreads = builtin_decl_explicit (BUILT_IN_GOACC_GET_NUM_THREADS);
-      threadid = builtin_decl_explicit (BUILT_IN_GOACC_GET_THREAD_NUM);
-      break;
     default:
       gcc_unreachable ();
     }
@@ -9157,10 +9414,7 @@ expand_omp_for_static_chunk (struct omp_
   if (!gimple_omp_return_nowait_p (gsi_stmt (gsi)))
     {
       t = gimple_omp_return_lhs (gsi_stmt (gsi));
-      if (gimple_omp_for_kind (fd->for_stmt) == GF_OMP_FOR_KIND_OACC_LOOP)
-	gcc_checking_assert (t == NULL_TREE);
-      else
-	gsi_insert_after (&gsi, build_omp_barrier (t), GSI_SAME_STMT);
+      gsi_insert_after (&gsi, build_omp_barrier (t), GSI_SAME_STMT);
     }
   gsi_remove (&gsi, true);
 
@@ -10210,93 +10464,645 @@ expand_omp_taskloop_for_inner (struct om
     }
 }
 
-/* Expand the OMP loop defined by REGION.  */
+/* Information about members of an OpenACC collapsed loop nest.  */
 
-static void
-expand_omp_for (struct omp_region *region, gimple *inner_stmt)
+struct oacc_collapse
 {
-  struct omp_for_data fd;
-  struct omp_for_data_loop *loops;
+  tree base;  /* Base value. */
+  tree iters; /* Number of steps.  */
+  tree step;  /* step size.  */
+};
 
-  loops
-    = (struct omp_for_data_loop *)
-      alloca (gimple_omp_for_collapse (last_stmt (region->entry))
-	      * sizeof (struct omp_for_data_loop));
-  extract_omp_for_data (as_a <gomp_for *> (last_stmt (region->entry)),
-			&fd, loops);
-  region->sched_kind = fd.sched_kind;
+/* Helper for expand_oacc_for.  Determine collapsed loop information.
+   Fill in COUNTS array.  Emit any initialization code before GSI.
+   Return the calculated outer loop bound of BOUND_TYPE.  */
 
-  gcc_assert (EDGE_COUNT (region->entry->succs) == 2);
-  BRANCH_EDGE (region->entry)->flags &= ~EDGE_ABNORMAL;
-  FALLTHRU_EDGE (region->entry)->flags &= ~EDGE_ABNORMAL;
-  if (region->cont)
-    {
-      gcc_assert (EDGE_COUNT (region->cont->succs) == 2);
-      BRANCH_EDGE (region->cont)->flags &= ~EDGE_ABNORMAL;
-      FALLTHRU_EDGE (region->cont)->flags &= ~EDGE_ABNORMAL;
-    }
-  else
-    /* If there isn't a continue then this is a degerate case where
-       the introduction of abnormal edges during lowering will prevent
-       original loops from being detected.  Fix that up.  */
-    loops_state_set (LOOPS_NEED_FIXUP);
+static tree
+expand_oacc_collapse_init (const struct omp_for_data *fd,
+			   gimple_stmt_iterator *gsi,
+			   oacc_collapse *counts, tree bound_type)
+{
+  tree total = build_int_cst (bound_type, 1);
+  int ix;
+  
+  gcc_assert (integer_onep (fd->loop.step));
+  gcc_assert (integer_zerop (fd->loop.n1));
 
-  if (gimple_omp_for_kind (fd.for_stmt) & GF_OMP_FOR_SIMD)
-    expand_omp_simd (region, &fd);
-  else if (gimple_omp_for_kind (fd.for_stmt) == GF_OMP_FOR_KIND_CILKFOR)
-    expand_cilk_for (region, &fd);
-  else if (gimple_omp_for_kind (fd.for_stmt) == GF_OMP_FOR_KIND_TASKLOOP)
-    {
-      if (gimple_omp_for_combined_into_p (fd.for_stmt))
-	expand_omp_taskloop_for_inner (region, &fd, inner_stmt);
-      else
-	expand_omp_taskloop_for_outer (region, &fd, inner_stmt);
-    }
-  else if (fd.sched_kind == OMP_CLAUSE_SCHEDULE_STATIC
-	   && !fd.have_ordered)
-    {
-      if (fd.chunk_size == NULL)
-	expand_omp_for_static_nochunk (region, &fd, inner_stmt);
-      else
-	expand_omp_for_static_chunk (region, &fd, inner_stmt);
-    }
-  else
+  for (ix = 0; ix != fd->collapse; ix++)
     {
-      int fn_index, start_ix, next_ix;
+      const omp_for_data_loop *loop = &fd->loops[ix];
 
-      gcc_assert (gimple_omp_for_kind (fd.for_stmt)
-		  == GF_OMP_FOR_KIND_FOR);
-      if (fd.chunk_size == NULL
-	  && fd.sched_kind == OMP_CLAUSE_SCHEDULE_STATIC)
-	fd.chunk_size = integer_zero_node;
-      gcc_assert (fd.sched_kind != OMP_CLAUSE_SCHEDULE_AUTO);
-      fn_index = (fd.sched_kind == OMP_CLAUSE_SCHEDULE_RUNTIME)
-		  ? 3 : fd.sched_kind;
-      if (!fd.ordered)
-	fn_index += fd.have_ordered * 4;
-      if (fd.ordered)
-	start_ix = ((int)BUILT_IN_GOMP_LOOP_DOACROSS_STATIC_START) + fn_index;
-      else
-	start_ix = ((int)BUILT_IN_GOMP_LOOP_STATIC_START) + fn_index;
-      next_ix = ((int)BUILT_IN_GOMP_LOOP_STATIC_NEXT) + fn_index;
-      if (fd.iter_type == long_long_unsigned_type_node)
-	{
-	  start_ix += ((int)BUILT_IN_GOMP_LOOP_ULL_STATIC_START
-			- (int)BUILT_IN_GOMP_LOOP_STATIC_START);
-	  next_ix += ((int)BUILT_IN_GOMP_LOOP_ULL_STATIC_NEXT
-		      - (int)BUILT_IN_GOMP_LOOP_STATIC_NEXT);
-	}
-      expand_omp_for_generic (region, &fd, (enum built_in_function) start_ix,
-			      (enum built_in_function) next_ix, inner_stmt);
+      tree iter_type = TREE_TYPE (loop->v);
+      tree diff_type = iter_type;
+      tree plus_type = iter_type;
+
+      gcc_assert (loop->cond_code == fd->loop.cond_code);
+      
+      if (POINTER_TYPE_P (iter_type))
+	plus_type = sizetype;
+      if (POINTER_TYPE_P (diff_type) || TYPE_UNSIGNED (diff_type))
+	diff_type = signed_type_for (diff_type);
+
+      tree b = loop->n1;
+      tree e = loop->n2;
+      tree s = loop->step;
+      bool up = loop->cond_code == LT_EXPR;
+      tree dir = build_int_cst (diff_type, up ? +1 : -1);
+      bool negating;
+      tree expr;
+
+      b = force_gimple_operand_gsi (gsi, b, true, NULL_TREE,
+				    true, GSI_SAME_STMT);
+      e = force_gimple_operand_gsi (gsi, e, true, NULL_TREE,
+				    true, GSI_SAME_STMT);
+
+      /* Convert the step, avoiding possible unsigned->signed overflow. */
+      negating = !up && TYPE_UNSIGNED (TREE_TYPE (s));
+      if (negating)
+	s = fold_build1 (NEGATE_EXPR, TREE_TYPE (s), s);
+      s = fold_convert (diff_type, s);
+      if (negating)
+	s = fold_build1 (NEGATE_EXPR, diff_type, s);
+      s = force_gimple_operand_gsi (gsi, s, true, NULL_TREE,
+				    true, GSI_SAME_STMT);
+
+      /* Determine the range, avoiding possible unsigned->signed overflow. */
+      negating = !up && TYPE_UNSIGNED (iter_type);
+      expr = fold_build2 (MINUS_EXPR, plus_type,
+			  fold_convert (plus_type, negating ? b : e),
+			  fold_convert (plus_type, negating ? e : b));
+      expr = fold_convert (diff_type, expr);
+      if (negating)
+	expr = fold_build1 (NEGATE_EXPR, diff_type, expr);
+      tree range = force_gimple_operand_gsi
+	(gsi, expr, true, NULL_TREE, true, GSI_SAME_STMT);
+
+      /* Determine number of iterations.  */
+      expr = fold_build2 (MINUS_EXPR, diff_type, range, dir);
+      expr = fold_build2 (PLUS_EXPR, diff_type, expr, s);
+      expr = fold_build2 (TRUNC_DIV_EXPR, diff_type, expr, s);
+
+      tree iters = force_gimple_operand_gsi (gsi, expr, true, NULL_TREE,
+					     true, GSI_SAME_STMT);
+
+      counts[ix].base = b;
+      counts[ix].iters = iters;
+      counts[ix].step = s;
+
+      total = fold_build2 (MULT_EXPR, bound_type, total,
+			   fold_convert (bound_type, iters));
     }
 
-  if (gimple_in_ssa_p (cfun))
-    update_ssa (TODO_update_ssa_only_virtuals);
+  return total;
 }
 
+/* Emit initializers for collapsed loop members.  IVAR is the outer
+   loop iteration variable, from which collapsed loop iteration values
+   are  calculated.  COUNTS array has been initialized by
+   expand_oacc_collapse_inits.  */
+
+static void
+expand_oacc_collapse_vars (const struct omp_for_data *fd,
+			   gimple_stmt_iterator *gsi,
+			   const oacc_collapse *counts, tree ivar)
+{
+  tree ivar_type = TREE_TYPE (ivar);
+
+  /*  The most rapidly changing iteration variable is the innermost
+      one.  */
+  for (int ix = fd->collapse; ix--;)
+    {
+      const omp_for_data_loop *loop = &fd->loops[ix];
+      const oacc_collapse *collapse = &counts[ix];
+      tree iter_type = TREE_TYPE (loop->v);
+      tree diff_type = TREE_TYPE (collapse->step);
+      tree plus_type = iter_type;
+      enum tree_code plus_code = PLUS_EXPR;
+      tree expr;
+
+      if (POINTER_TYPE_P (iter_type))
+	{
+	  plus_code = POINTER_PLUS_EXPR;
+	  plus_type = sizetype;
+	}
+
+      expr = build2 (TRUNC_MOD_EXPR, ivar_type, ivar,
+		     fold_convert (ivar_type, collapse->iters));
+      expr = build2 (MULT_EXPR, diff_type, fold_convert (diff_type, expr),
+		     collapse->step);
+      expr = build2 (plus_code, iter_type, collapse->base,
+		     fold_convert (plus_type, expr));
+      expr = force_gimple_operand_gsi (gsi, expr, false, NULL_TREE,
+				       true, GSI_SAME_STMT);
+      gassign *ass = gimple_build_assign (loop->v, expr);
+      gsi_insert_before (gsi, ass, GSI_SAME_STMT);
+
+      if (ix)
+	{
+	  expr = build2 (TRUNC_DIV_EXPR, ivar_type, ivar,
+			 fold_convert (ivar_type, collapse->iters));
+	  ivar = force_gimple_operand_gsi (gsi, expr, true, NULL_TREE,
+					   true, GSI_SAME_STMT);
+	}
+    }
+}
+
+/* A subroutine of expand_omp_for.  Generate code for an OpenACC
+   partitioned loop.  The lowering here is abstracted, in that the
+   loop parameters are passed through internal functions, which are
+   further lowered by oacc_device_lower, once we get to the target
+   compiler.  The loop is of the form:
+
+   for (V = B; V LTGT E; V += S) {BODY}
+
+   where LTGT is < or >.  We may have a specified chunking size, CHUNKING
+   (constant 0 for no chunking) and we will have a GWV partitioning
+   mask, specifying dimensions over which the loop is to be
+   partitioned (see note below).  We generate code that looks like:
+
+   <entry_bb> [incoming FALL->body, BRANCH->exit]
+     typedef signedintify (typeof (V)) T;  // underlying signed integral type
+     T range = E - B;
+     T chunk_no = 0;
+     T DIR = LTGT == '<' ? +1 : -1;
+     T chunk_max = GOACC_LOOP_CHUNK (dir, range, S, CHUNK_SIZE, GWV);
+     T step = GOACC_LOOP_STEP (dir, range, S, CHUNK_SIZE, GWV);
+
+   <head_bb> [created by splitting end of entry_bb]
+     T offset = GOACC_LOOP_OFFSET (dir, range, S, CHUNK_SIZE, GWV, chunk_no);
+     T bound = GOACC_LOOP_BOUND (dir, range, S, CHUNK_SIZE, GWV, offset);
+     if (!(offset LTGT bound)) goto bottom_bb;
+
+   <body_bb> [incoming]
+     V = B + offset;
+     {BODY}
+
+   <cont_bb> [incoming, may == body_bb FALL->exit_bb, BRANCH->body_bb]
+     offset += step;
+     if (offset LTGT bound) goto body_bb; [*]
+
+   <bottom_bb> [created by splitting start of exit_bb] insert BRANCH->head_bb
+     chunk_no++;
+     if (chunk < chunk_max) goto head_bb;
+
+   <exit_bb> [incoming]
+     V = B + ((range -/+ 1) / S +/- 1) * S [*]
+
+   [*] Needed if V live at end of loop
+
+   Note: CHUNKING & GWV mask are specified explicitly here.  This is a
+   transition, and will be specified by a more general mechanism shortly.
+ */
+
+static void
+expand_oacc_for (struct omp_region *region, struct omp_for_data *fd)
+{
+  tree v = fd->loop.v;
+  enum tree_code cond_code = fd->loop.cond_code;
+  enum tree_code plus_code = PLUS_EXPR;
+
+  tree chunk_size = integer_minus_one_node;
+  tree gwv = integer_zero_node;
+  tree iter_type = TREE_TYPE (v);
+  tree diff_type = iter_type;
+  tree plus_type = iter_type;
+  struct oacc_collapse *counts = NULL;
+
+  gcc_checking_assert (gimple_omp_for_kind (fd->for_stmt)
+		       == GF_OMP_FOR_KIND_OACC_LOOP);
+  gcc_assert (!gimple_omp_for_combined_into_p (fd->for_stmt));
+  gcc_assert (cond_code == LT_EXPR || cond_code == GT_EXPR);
+
+  if (POINTER_TYPE_P (iter_type))
+    {
+      plus_code = POINTER_PLUS_EXPR;
+      plus_type = sizetype;
+    }
+  if (POINTER_TYPE_P (diff_type) || TYPE_UNSIGNED (diff_type))
+    diff_type = signed_type_for (diff_type);
+
+  basic_block entry_bb = region->entry; /* BB ending in OMP_FOR */
+  basic_block exit_bb = region->exit; /* BB ending in OMP_RETURN */
+  basic_block cont_bb = region->cont; /* BB ending in OMP_CONTINUE  */
+  basic_block bottom_bb = NULL;
+
+  /* entry_bb has two sucessors; the branch edge is to the exit
+     block,  fallthrough edge to body.  */
+  gcc_assert (EDGE_COUNT (entry_bb->succs) == 2
+	      && BRANCH_EDGE (entry_bb)->dest == exit_bb);
+
+  /* If cont_bb non-NULL, it has 2 successors.  The branch successor is
+     body_bb, or to a block whose only successor is the body_bb.  Its
+     fallthrough successor is the final block (same as the branch
+     successor of the entry_bb).  */
+  if (cont_bb)
+    {
+      basic_block body_bb = FALLTHRU_EDGE (entry_bb)->dest;
+      basic_block bed = BRANCH_EDGE (cont_bb)->dest;
+
+      gcc_assert (FALLTHRU_EDGE (cont_bb)->dest == exit_bb);
+      gcc_assert (bed == body_bb || single_succ_edge (bed)->dest == body_bb);
+    }
+  else
+    gcc_assert (!gimple_in_ssa_p (cfun));
+
+  /* The exit block only has entry_bb and cont_bb as predecessors.  */
+  gcc_assert (EDGE_COUNT (exit_bb->preds) == 1 + (cont_bb != NULL));
+
+  tree chunk_no;
+  tree chunk_max = NULL_TREE;
+  tree bound, offset;
+  tree step = create_tmp_var (diff_type, ".step");
+  bool up = cond_code == LT_EXPR;
+  tree dir = build_int_cst (diff_type, up ? +1 : -1);
+  bool chunking = !gimple_in_ssa_p (cfun);;
+  bool negating;
+
+  /* SSA instances.  */
+  tree offset_incr = NULL_TREE;
+  tree offset_init = NULL_TREE;
+
+  gimple_stmt_iterator gsi;
+  gassign *ass;
+  gcall *call;
+  gimple *stmt;
+  tree expr;
+  location_t loc;
+  edge split, be, fte;
+
+  /* Split the end of entry_bb to create head_bb.  */
+  split = split_block (entry_bb, last_stmt (entry_bb));
+  basic_block head_bb = split->dest;
+  entry_bb = split->src;
+
+  /* Chunk setup goes at end of entry_bb, replacing the omp_for.  */
+  gsi = gsi_last_bb (entry_bb);
+  gomp_for *for_stmt = as_a <gomp_for *> (gsi_stmt (gsi));
+  loc = gimple_location (for_stmt);
+
+  if (gimple_in_ssa_p (cfun))
+    {
+      offset_init = gimple_omp_for_index (for_stmt, 0);
+      gcc_assert (integer_zerop (fd->loop.n1));
+      /* The SSA parallelizer does gang parallelism.  */
+      gwv = build_int_cst (integer_type_node, GOMP_DIM_MASK (GOMP_DIM_GANG));
+    }
+
+  if (fd->collapse > 1)
+    {
+      counts = XALLOCAVEC (struct oacc_collapse, fd->collapse);
+      tree total = expand_oacc_collapse_init (fd, &gsi, counts,
+					      TREE_TYPE (fd->loop.n2));
+
+      if (SSA_VAR_P (fd->loop.n2))
+	{
+	  total = force_gimple_operand_gsi (&gsi, total, false, NULL_TREE,
+					    true, GSI_SAME_STMT);
+	  ass = gimple_build_assign (fd->loop.n2, total);
+	  gsi_insert_before (&gsi, ass, GSI_SAME_STMT);
+	}
+      
+    }
+
+  tree b = fd->loop.n1;
+  tree e = fd->loop.n2;
+  tree s = fd->loop.step;
+
+  b = force_gimple_operand_gsi (&gsi, b, true, NULL_TREE, true, GSI_SAME_STMT);
+  e = force_gimple_operand_gsi (&gsi, e, true, NULL_TREE, true, GSI_SAME_STMT);
+
+  /* Convert the step, avoiding possible unsigned->signed overflow. */
+  negating = !up && TYPE_UNSIGNED (TREE_TYPE (s));
+  if (negating)
+    s = fold_build1 (NEGATE_EXPR, TREE_TYPE (s), s);
+  s = fold_convert (diff_type, s);
+  if (negating)
+    s = fold_build1 (NEGATE_EXPR, diff_type, s);
+  s = force_gimple_operand_gsi (&gsi, s, true, NULL_TREE, true, GSI_SAME_STMT);
+
+  if (!chunking)
+    chunk_size = integer_zero_node;
+  expr = fold_convert (diff_type, chunk_size);
+  chunk_size = force_gimple_operand_gsi (&gsi, expr, true,
+					 NULL_TREE, true, GSI_SAME_STMT);
+  /* Determine the range, avoiding possible unsigned->signed overflow. */
+  negating = !up && TYPE_UNSIGNED (iter_type);
+  expr = fold_build2 (MINUS_EXPR, plus_type,
+		      fold_convert (plus_type, negating ? b : e),
+		      fold_convert (plus_type, negating ? e : b));
+  expr = fold_convert (diff_type, expr);
+  if (negating)
+    expr = fold_build1 (NEGATE_EXPR, diff_type, expr);
+  tree range = force_gimple_operand_gsi (&gsi, expr, true,
+					 NULL_TREE, true, GSI_SAME_STMT);
+
+  chunk_no = build_int_cst (diff_type, 0);
+  if (chunking)
+    {
+      gcc_assert (!gimple_in_ssa_p (cfun));
+
+      expr = chunk_no;
+      chunk_max = create_tmp_var (diff_type, ".chunk_max");
+      chunk_no = create_tmp_var (diff_type, ".chunk_no");
+
+      ass = gimple_build_assign (chunk_no, expr);
+      gsi_insert_before (&gsi, ass, GSI_SAME_STMT);
+
+      call = gimple_build_call_internal (IFN_GOACC_LOOP, 6,
+					 build_int_cst (integer_type_node,
+							IFN_GOACC_LOOP_CHUNKS),
+					 dir, range, s, chunk_size, gwv);
+      gimple_call_set_lhs (call, chunk_max);
+      gimple_set_location (call, loc);
+      gsi_insert_before (&gsi, call, GSI_SAME_STMT);
+    }
+  else
+    chunk_size = chunk_no;
+
+  call = gimple_build_call_internal (IFN_GOACC_LOOP, 6,
+				     build_int_cst (integer_type_node,
+						    IFN_GOACC_LOOP_STEP),
+				     dir, range, s, chunk_size, gwv);
+  gimple_call_set_lhs (call, step);
+  gimple_set_location (call, loc);
+  gsi_insert_before (&gsi, call, GSI_SAME_STMT);
+
+  /* Remove the GIMPLE_OMP_FOR.  */
+  gsi_remove (&gsi, true);
+
+  /* Fixup edges from head_bb */
+  be = BRANCH_EDGE (head_bb);
+  fte = FALLTHRU_EDGE (head_bb);
+  be->flags |= EDGE_FALSE_VALUE;
+  fte->flags ^= EDGE_FALLTHRU | EDGE_TRUE_VALUE;
+
+  basic_block body_bb = fte->dest;
+
+  if (gimple_in_ssa_p (cfun))
+    {
+      gsi = gsi_last_bb (cont_bb);
+      gomp_continue *cont_stmt = as_a <gomp_continue *> (gsi_stmt (gsi));
+
+      offset = gimple_omp_continue_control_use (cont_stmt);
+      offset_incr = gimple_omp_continue_control_def (cont_stmt);
+    }
+  else
+    {
+      offset = create_tmp_var (diff_type, ".offset");
+      offset_init = offset_incr = offset;
+    }
+  bound = create_tmp_var (TREE_TYPE (offset), ".bound");
+
+  /* Loop offset & bound go into head_bb.  */
+  gsi = gsi_start_bb (head_bb);
+
+  call = gimple_build_call_internal (IFN_GOACC_LOOP, 7,
+				     build_int_cst (integer_type_node,
+						    IFN_GOACC_LOOP_OFFSET),
+				     dir, range, s,
+				     chunk_size, gwv, chunk_no);
+  gimple_call_set_lhs (call, offset_init);
+  gimple_set_location (call, loc);
+  gsi_insert_after (&gsi, call, GSI_CONTINUE_LINKING);
+
+  call = gimple_build_call_internal (IFN_GOACC_LOOP, 7,
+				     build_int_cst (integer_type_node,
+						    IFN_GOACC_LOOP_BOUND),
+				     dir, range, s,
+				     chunk_size, gwv, offset_init);
+  gimple_call_set_lhs (call, bound);
+  gimple_set_location (call, loc);
+  gsi_insert_after (&gsi, call, GSI_CONTINUE_LINKING);
+
+  expr = build2 (cond_code, boolean_type_node, offset_init, bound);
+  gsi_insert_after (&gsi, gimple_build_cond_empty (expr),
+		    GSI_CONTINUE_LINKING);
+
+  /* V assignment goes into body_bb.  */
+  if (!gimple_in_ssa_p (cfun))
+    {
+      gsi = gsi_start_bb (body_bb);
+
+      expr = build2 (plus_code, iter_type, b,
+		     fold_convert (plus_type, offset));
+      expr = force_gimple_operand_gsi (&gsi, expr, false, NULL_TREE,
+				       true, GSI_SAME_STMT);
+      ass = gimple_build_assign (v, expr);
+      gsi_insert_before (&gsi, ass, GSI_SAME_STMT);
+      if (fd->collapse > 1)
+	expand_oacc_collapse_vars (fd, &gsi, counts, v);
+    }
+
+  /* Loop increment goes into cont_bb.  If this is not a loop, we
+     will have spawned threads as if it was, and each one will
+     execute one iteration.  The specification is not explicit about
+     whether such constructs are ill-formed or not, and they can
+     occur, especially when noreturn routines are involved.  */
+  if (cont_bb)
+    {
+      gsi = gsi_last_bb (cont_bb);
+      gomp_continue *cont_stmt = as_a <gomp_continue *> (gsi_stmt (gsi));
+      loc = gimple_location (cont_stmt);
+
+      /* Increment offset.  */
+      if (gimple_in_ssa_p (cfun))
+	expr= build2 (plus_code, iter_type, offset,
+		      fold_convert (plus_type, step));
+      else
+	expr = build2 (PLUS_EXPR, diff_type, offset, step);
+      expr = force_gimple_operand_gsi (&gsi, expr, false, NULL_TREE,
+				       true, GSI_SAME_STMT);
+      ass = gimple_build_assign (offset_incr, expr);
+      gsi_insert_before (&gsi, ass, GSI_SAME_STMT);
+      expr = build2 (cond_code, boolean_type_node, offset_incr, bound);
+      gsi_insert_before (&gsi, gimple_build_cond_empty (expr), GSI_SAME_STMT);
+
+      /*  Remove the GIMPLE_OMP_CONTINUE.  */
+      gsi_remove (&gsi, true);
+
+      /* Fixup edges from cont_bb */
+      be = BRANCH_EDGE (cont_bb);
+      fte = FALLTHRU_EDGE (cont_bb);
+      be->flags |= EDGE_TRUE_VALUE;
+      fte->flags ^= EDGE_FALLTHRU | EDGE_FALSE_VALUE;
+
+      if (chunking)
+	{
+	  /* Split the beginning of exit_bb to make bottom_bb.  We
+	     need to insert a nop at the start, because splitting is
+  	     after a stmt, not before.  */
+	  gsi = gsi_start_bb (exit_bb);
+	  stmt = gimple_build_nop ();
+	  gsi_insert_before (&gsi, stmt, GSI_SAME_STMT);
+	  split = split_block (exit_bb, stmt);
+	  bottom_bb = split->src;
+	  exit_bb = split->dest;
+	  gsi = gsi_last_bb (bottom_bb);
+
+	  /* Chunk increment and test goes into bottom_bb.  */
+	  expr = build2 (PLUS_EXPR, diff_type, chunk_no,
+			 build_int_cst (diff_type, 1));
+	  ass = gimple_build_assign (chunk_no, expr);
+	  gsi_insert_after (&gsi, ass, GSI_CONTINUE_LINKING);
+
+	  /* Chunk test at end of bottom_bb.  */
+	  expr = build2 (LT_EXPR, boolean_type_node, chunk_no, chunk_max);
+	  gsi_insert_after (&gsi, gimple_build_cond_empty (expr),
+			    GSI_CONTINUE_LINKING);
+
+	  /* Fixup edges from bottom_bb. */
+	  split->flags ^= EDGE_FALLTHRU | EDGE_FALSE_VALUE;
+	  make_edge (bottom_bb, head_bb, EDGE_TRUE_VALUE);
+	}
+    }
+
+  gsi = gsi_last_bb (exit_bb);
+  gcc_assert (gimple_code (gsi_stmt (gsi)) == GIMPLE_OMP_RETURN);
+  loc = gimple_location (gsi_stmt (gsi));
+
+  if (!gimple_in_ssa_p (cfun))
+    {
+      /* Insert the final value of V, in case it is live.  This is the
+	 value for the only thread that survives past the join.  */
+      expr = fold_build2 (MINUS_EXPR, diff_type, range, dir);
+      expr = fold_build2 (PLUS_EXPR, diff_type, expr, s);
+      expr = fold_build2 (TRUNC_DIV_EXPR, diff_type, expr, s);
+      expr = fold_build2 (MULT_EXPR, diff_type, expr, s);
+      expr = build2 (plus_code, iter_type, b, fold_convert (plus_type, expr));
+      expr = force_gimple_operand_gsi (&gsi, expr, false, NULL_TREE,
+				       true, GSI_SAME_STMT);
+      ass = gimple_build_assign (v, expr);
+      gsi_insert_before (&gsi, ass, GSI_SAME_STMT);
+    }
+
+  /* Remove the OMP_RETURN. */
+  gsi_remove (&gsi, true);
+
+  if (cont_bb)
+    {
+      /* We now have one or two nested loops.  Update the loop
+	 structures.  */
+      struct loop *parent = entry_bb->loop_father;
+      struct loop *body = body_bb->loop_father;
+      
+      if (chunking)
+	{
+	  struct loop *chunk_loop = alloc_loop ();
+	  chunk_loop->header = head_bb;
+	  chunk_loop->latch = bottom_bb;
+	  add_loop (chunk_loop, parent);
+	  parent = chunk_loop;
+	}
+      else if (parent != body)
+	{
+	  gcc_assert (body->header == body_bb);
+	  gcc_assert (body->latch == cont_bb
+		      || single_pred (body->latch) == cont_bb);
+	  parent = NULL;
+	}
+
+      if (parent)
+	{
+	  struct loop *body_loop = alloc_loop ();
+	  body_loop->header = body_bb;
+	  body_loop->latch = cont_bb;
+	  add_loop (body_loop, parent);
+	}
+    }
+}
+
+/* Expand the OMP loop defined by REGION.  */
+
+static void
+expand_omp_for (struct omp_region *region, gimple *inner_stmt)
+{
+  struct omp_for_data fd;
+  struct omp_for_data_loop *loops;
+
+  loops
+    = (struct omp_for_data_loop *)
+      alloca (gimple_omp_for_collapse (last_stmt (region->entry))
+	      * sizeof (struct omp_for_data_loop));
+  extract_omp_for_data (as_a <gomp_for *> (last_stmt (region->entry)),
+			&fd, loops);
+  region->sched_kind = fd.sched_kind;
+
+  gcc_assert (EDGE_COUNT (region->entry->succs) == 2);
+  BRANCH_EDGE (region->entry)->flags &= ~EDGE_ABNORMAL;
+  FALLTHRU_EDGE (region->entry)->flags &= ~EDGE_ABNORMAL;
+  if (region->cont)
+    {
+      gcc_assert (EDGE_COUNT (region->cont->succs) == 2);
+      BRANCH_EDGE (region->cont)->flags &= ~EDGE_ABNORMAL;
+      FALLTHRU_EDGE (region->cont)->flags &= ~EDGE_ABNORMAL;
+    }
+  else
+    /* If there isn't a continue then this is a degerate case where
+       the introduction of abnormal edges during lowering will prevent
+       original loops from being detected.  Fix that up.  */
+    loops_state_set (LOOPS_NEED_FIXUP);
+
+  if (gimple_omp_for_kind (fd.for_stmt) & GF_OMP_FOR_SIMD)
+    expand_omp_simd (region, &fd);
+  else if (gimple_omp_for_kind (fd.for_stmt) == GF_OMP_FOR_KIND_CILKFOR)
+    expand_cilk_for (region, &fd);
+  else if (gimple_omp_for_kind (fd.for_stmt) == GF_OMP_FOR_KIND_OACC_LOOP)
+    {
+      gcc_assert (!inner_stmt);
+      expand_oacc_for (region, &fd);
+    }
+  else if (gimple_omp_for_kind (fd.for_stmt) == GF_OMP_FOR_KIND_TASKLOOP)
+    {
+      if (gimple_omp_for_combined_into_p (fd.for_stmt))
+	expand_omp_taskloop_for_inner (region, &fd, inner_stmt);
+      else
+	expand_omp_taskloop_for_outer (region, &fd, inner_stmt);
+    }
+  else if (fd.sched_kind == OMP_CLAUSE_SCHEDULE_STATIC
+	   && !fd.have_ordered)
+    {
+      if (fd.chunk_size == NULL)
+	expand_omp_for_static_nochunk (region, &fd, inner_stmt);
+      else
+	expand_omp_for_static_chunk (region, &fd, inner_stmt);
+    }
+  else
+    {
+      int fn_index, start_ix, next_ix;
+
+      gcc_assert (gimple_omp_for_kind (fd.for_stmt)
+		  == GF_OMP_FOR_KIND_FOR);
+      if (fd.chunk_size == NULL
+	  && fd.sched_kind == OMP_CLAUSE_SCHEDULE_STATIC)
+	fd.chunk_size = integer_zero_node;
+      gcc_assert (fd.sched_kind != OMP_CLAUSE_SCHEDULE_AUTO);
+      fn_index = (fd.sched_kind == OMP_CLAUSE_SCHEDULE_RUNTIME)
+		  ? 3 : fd.sched_kind;
+      if (!fd.ordered)
+	fn_index += fd.have_ordered * 4;
+      if (fd.ordered)
+	start_ix = ((int)BUILT_IN_GOMP_LOOP_DOACROSS_STATIC_START) + fn_index;
+      else
+	start_ix = ((int)BUILT_IN_GOMP_LOOP_STATIC_START) + fn_index;
+      next_ix = ((int)BUILT_IN_GOMP_LOOP_STATIC_NEXT) + fn_index;
+      if (fd.iter_type == long_long_unsigned_type_node)
+	{
+	  start_ix += ((int)BUILT_IN_GOMP_LOOP_ULL_STATIC_START
+			- (int)BUILT_IN_GOMP_LOOP_STATIC_START);
+	  next_ix += ((int)BUILT_IN_GOMP_LOOP_ULL_STATIC_NEXT
+		      - (int)BUILT_IN_GOMP_LOOP_STATIC_NEXT);
+	}
+      expand_omp_for_generic (region, &fd, (enum built_in_function) start_ix,
+			      (enum built_in_function) next_ix, inner_stmt);
+    }
+
+  if (gimple_in_ssa_p (cfun))
+    update_ssa (TODO_update_ssa_only_virtuals);
+}
+
+
+/* Expand code for an OpenMP sections directive.  In pseudo code, we generate
 
-/* Expand code for an OpenMP sections directive.  In pseudo code, we generate
-
 	v = GOMP_sections_start (n);
     L0:
 	switch (v)
@@ -13375,6 +14252,7 @@ lower_omp_for (gimple_stmt_iterator *gsi
   gomp_for *stmt = as_a <gomp_for *> (gsi_stmt (*gsi_p));
   gbind *new_stmt;
   gimple_seq omp_for_body, body, dlist;
+  gimple_seq oacc_head = NULL, oacc_tail = NULL;
   size_t i;
 
   push_gimplify_context ();
@@ -13483,6 +14361,16 @@ lower_omp_for (gimple_stmt_iterator *gsi
   /* Once lowered, extract the bounds and clauses.  */
   extract_omp_for_data (stmt, &fd, NULL);
 
+  if (is_gimple_omp_oacc (ctx->stmt)
+      && !ctx_in_oacc_kernels_region (ctx))
+    lower_oacc_head_tail (gimple_location (stmt),
+			  gimple_omp_for_clauses (stmt),
+			  &oacc_head, &oacc_tail, ctx);
+
+  /* Add OpenACC partitioning markers just before the loop  */
+  if (oacc_head)
+    gimple_seq_add_seq (&body, oacc_head);
+  
   lower_omp_for_lastprivate (&fd, &body, &dlist, ctx);
 
   if (gimple_omp_for_kind (stmt) == GF_OMP_FOR_KIND_FOR)
@@ -13516,6 +14404,11 @@ lower_omp_for (gimple_stmt_iterator *gsi
   /* Region exit marker goes at the end of the loop body.  */
   gimple_seq_add_stmt (&body, gimple_build_omp_return (fd.have_nowait));
   maybe_add_implicit_barrier_cancel (ctx, &body);
+
+  /* Add OpenACC joining and reduction markers just after the loop.  */
+  if (oacc_tail)
+    gimple_seq_add_seq (&body, oacc_tail);
+
   pop_gimplify_context (new_stmt);
 
   gimple_bind_append_vars (new_stmt, ctx->block_vars);
Index: gcc/internal-fn.def
===================================================================
--- gcc/internal-fn.def	(revision 229101)
+++ gcc/internal-fn.def	(working copy)
@@ -65,11 +65,56 @@ DEF_INTERNAL_FN (SUB_OVERFLOW, ECF_CONST
 DEF_INTERNAL_FN (MUL_OVERFLOW, ECF_CONST | ECF_LEAF | ECF_NOTHROW, NULL)
 DEF_INTERNAL_FN (TSAN_FUNC_EXIT, ECF_NOVOPS | ECF_LEAF | ECF_NOTHROW, NULL)
 DEF_INTERNAL_FN (VA_ARG, ECF_NOTHROW | ECF_LEAF, NULL)
 
 /* An unduplicable, uncombinable function.  Generally used to preserve
    a CFG property in the face of jump threading, tail merging or
    other such optimizations.  The first argument distinguishes
    between uses.  Other arguments are as needed for use.  The return
    type depends on use too.  */
 DEF_INTERNAL_FN (UNIQUE, ECF_NOTHROW | ECF_LEAF, NULL)
 #define IFN_UNIQUE_UNSPEC 0  /* Undifferentiated UNIQUE.  */
+
+/* FORK and JOIN mark the points at which OpenACC partitioned
+   execution is entered or exited.  They take an INTEGER_CST argument,
+   indicating the axis of forking or joining and return nothing.  */
+#define IFN_UNIQUE_OACC_FORK 1
+#define IFN_UNIQUE_OACC_JOIN 2
+/* HEAD_MARK and TAIL_MARK are used to demark the sequence entering or
+   leaving partitioned execution.  */
+#define IFN_UNIQUE_OACC_HEAD_MARK 3
+#define IFN_UNIQUE_OACC_TAIL_MARK 4
+
+/* DIM_SIZE and DIM_POS return the size of a particular compute
+   dimension and the executing thread's position within that
+   dimension.  DIM_POS is pure (and not const) so that it isn't
+   thought to clobber memory and can be gcse'd within a single
+   parallel region, but not across FORK/JOIN boundaries.  They take a
+   single INTEGER_CST argument.  */
+DEF_INTERNAL_FN (GOACC_DIM_SIZE, ECF_CONST | ECF_NOTHROW | ECF_LEAF, ".")
+DEF_INTERNAL_FN (GOACC_DIM_POS, ECF_PURE | ECF_NOTHROW | ECF_LEAF, ".")
+
+/* OpenACC looping abstraction.  Allows the precise stepping of
+   the compute geometry over the loop iterations to be deferred until
+   it is known which compiler is generating the code.  The action is
+   encoded in a constant first argument.
+
+     CHUNK_MAX = LOOP (CODE_CHUNKS, DIR, RANGE, STEP, CHUNK_SIZE, MASK)
+     STEP = LOOP (CODE_STEP, DIR, RANGE, STEP, CHUNK_SIZE, MASK)
+     OFFSET = LOOP (CODE_OFFSET, DIR, RANGE, STEP, CHUNK_SIZE, MASK, CHUNK_NO)
+     BOUND = LOOP (CODE_BOUND, DIR, RANGE, STEP, CHUNK_SIZE, MASK, OFFSET)
+
+     DIR - +1 for up loop, -1 for down loop
+     RANGE - Range of loop (END - BASE)
+     STEP - iteration step size
+     CHUNKING - size of chunking, (constant zero for no chunking)
+     CHUNK_NO - chunk number
+     MASK - partitioning mask.
+
+   TODO: The partitioning mask and chunk size are a transition stage,
+   they will be removed once the required infrastructure is in place.  */
+
+DEF_INTERNAL_FN (GOACC_LOOP, ECF_PURE | ECF_NOTHROW, NULL)
+#define IFN_GOACC_LOOP_CHUNKS 0  /* Number  of chunks.  */
+#define IFN_GOACC_LOOP_STEP 1    /* Size of each thread's step.  */
+#define IFN_GOACC_LOOP_OFFSET 2  /* Initial iteration value.  */
+#define IFN_GOACC_LOOP_BOUND 3   /* Limit of iteration value.  */
Index: gcc/internal-fn.c
===================================================================
--- gcc/internal-fn.c	(revision 229101)
+++ gcc/internal-fn.c	(working copy)
@@ -1958,28 +1958,95 @@ expand_VA_ARG (gcall *stmt ATTRIBUTE_UNU
   gcc_unreachable ();
 }
 
 /* Expand the IFN_UNIQUE function according to its first argument.  */
 
 static void
 expand_UNIQUE (gcall *stmt)
 {
   rtx pattern = NULL_RTX;
 
   switch (TREE_INT_CST_LOW (gimple_call_arg (stmt, 0)))
     {
     default:
       gcc_unreachable ();
 
     case IFN_UNIQUE_UNSPEC:
 #ifdef HAVE_unique
       pattern = gen_unique ();
 #endif
       break;
+
+    case IFN_UNIQUE_OACC_FORK:
+#ifdef HAVE_oacc_fork
+      pattern = expand_normal (gimple_call_arg (stmt, 1));
+      pattern = gen_oacc_fork (pattern);
+#else
+      gcc_unreachable ();
+#endif
+      break;
+
+    case IFN_UNIQUE_OACC_JOIN:
+#ifdef HAVE_oacc_join
+      pattern = expand_normal (gimple_call_arg (stmt, 1));
+      pattern = gen_oacc_join (pattern);
+#else
+      gcc_unreachable ();
+#endif
+      break;
     }
 
   if (pattern)
     emit_insn (pattern);
 }
+
+/* The size of an OpenACC compute dimension.  */
+
+static void
+expand_GOACC_DIM_SIZE (gcall *ARG_UNUSED (stmt))
+{
+  tree lhs = gimple_call_lhs (stmt);
+
+  if (!lhs)
+    return;
+  
+  rtx target = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE);
+#ifdef HAVE_oacc_dim_size
+  rtx dim = expand_expr (gimple_call_arg (stmt, 0), NULL_RTX,
+			 VOIDmode, EXPAND_NORMAL);
+  emit_insn (gen_oacc_dim_size (target, dim));
+#else
+  emit_move_insn (target, GEN_INT (1));
+#endif
+}
+
+/* The position of an OpenACC execution engine along one compute  axis.  */
+
+static void
+expand_GOACC_DIM_POS (gcall *ARG_UNUSED (stmt))
+{
+  tree lhs = gimple_call_lhs (stmt);
+
+  if (!lhs)
+    return;
+  
+  rtx target = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE);
+#ifdef HAVE_oacc_dim_pos
+  rtx dim = expand_expr (gimple_call_arg (stmt, 0), NULL_RTX,
+			 VOIDmode, EXPAND_NORMAL);
+  emit_insn (gen_oacc_dim_pos (target, dim));
+#else
+  emit_move_insn (target, const0_rtx);
+#endif
+}
+
+/* This is expanded by oacc_device_lower pass.  */
+
+static void
+expand_GOACC_LOOP (gcall *stmt ATTRIBUTE_UNUSED)
+{
+  gcc_unreachable ();
+}
+
 /* Routines to expand each internal function, indexed by function number.
    Each routine has the prototype:
 

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 8/11] device-specific lowering
  2015-10-21 19:00 [OpenACC 0/11] execution model Nathan Sidwell
                   ` (6 preceding siblings ...)
  2015-10-21 19:47 ` [OpenACC 7/11] execution model Nathan Sidwell
@ 2015-10-21 19:50 ` Nathan Sidwell
  2015-10-22  9:32   ` Jakub Jelinek
  2015-10-26 15:21   ` Jakub Jelinek
  2015-10-21 19:51 ` [OpenACC 9/11] oacc_device_lower pass gate Nathan Sidwell
                   ` (2 subsequent siblings)
  10 siblings, 2 replies; 120+ messages in thread
From: Nathan Sidwell @ 2015-10-21 19:50 UTC (permalink / raw)
  To: GCC Patches; +Cc: Jakub Jelinek, Bernd Schmidt, Jason Merrill, Joseph S. Myers

[-- Attachment #1: Type: text/plain, Size: 1183 bytes --]

This patch is the device-specific half of the previous patch.  It processes the 
partition head & tail markers and loop abstraction functions inserted during omp 
lowering.

In the oacc_device_lower pass we scan the CFG reconstructing the set of nested 
loops demarked by IFN_UNIQUE (HEAD_MARK) & IFN_UNIQUE (TAIL_MARK) functions. 
The HEAD_MARK function provides  the loop partition information provided by the 
user.  Once constructed we can iterate over that structure checking partitioning 
consistency (for instance an inner loop must use a dimension 'inside' an outer 
loop). We also assign specific partitioning axes here.  Partitioning updates the 
parameters of the IFN_LOOP and IFN_FORK/JOIN functions appropriately.

Once partitioning has been determined, we iterate over the CFG scanning for the 
marker, fork/join and loop functions.  The marker functions are deleted, the 
fork & join functions are conditionally deleted (using the target hook of patch 
3), and the loop function is expanded into code calculating the loop parameters 
depending on how the loop has been partitioned.  This  uses the OACC_DIM_POS and 
OACC_DIM_SIZE builtins included in patch 7.

nathan

[-- Attachment #2: 08-trunk-dev-lower.patch --]
[-- Type: text/x-patch, Size: 26385 bytes --]

2015-10-20  Nathan Sidwell  <nathan@codesourcery.com>

	* omp-low.c: Include gimple-pretty-print.h.
	(struct oacc_loop): New.
	(oacc_thread_numbers): New.
	(oacc_xform_loop): New.
	(new_oacc_loop_raw, new_oacc_loop_outer, new_oacc_loop,
	new_oacc_loop_routine, finish_oacc_loop, free_oacc_loop): New,
	(dump_oacc_loop_part, dump_oacc_loop, debug_oacc_loop): New,
	(oacc_loop_discover_walk, oacc_loop_sibling_nrevers,
	oacc_loop_discovery): New.
	(oacc_loop_xform_head_tail, oacc_loop_xform_loop,
	oacc_loop_process): New.
	(oacc_loop_fixed_partitions, oacc_loop_partition): New.
	(execte_oacc_device_lower): Discover & process loops.  Process
	internal fns.

Index: gcc/omp-low.c
===================================================================
--- gcc/omp-low.c	(revision 228969)
+++ gcc/omp-low.c	(working copy)
@@ -81,6 +81,7 @@ along with GCC; see the file COPYING3.
 #include "context.h"
 #include "lto-section-names.h"
 #include "gomp-constants.h"
+#include "gimple-pretty-print.h"
 
 /* Lowering of OMP parallel and workshare constructs proceeds in two
    phases.  The first phase scans the function looking for OMP statements
@@ -233,6 +226,32 @@ struct omp_for_data
   struct omp_for_data_loop *loops;
 };
 
+/* Describe the OpenACC looping structure of a function.  The entire
+   function is held in a 'NULL' loop.  */
+
+struct oacc_loop
+{
+  oacc_loop *parent; /* Containing loop.  */
+
+  oacc_loop *child; /* First inner loop.  */
+
+  oacc_loop *sibling; /* Next loop within same parent.  */
+
+  location_t loc; /* Location of the loop start.  */
+
+  gcall *marker; /* Initial head marker.  */
+  
+  gcall *heads[GOMP_DIM_MAX];  /* Head marker functions. */
+  gcall *tails[GOMP_DIM_MAX];  /* Tail marker functions. */
+
+  tree routine;  /* Pseudo-loop enclosing a routine.  */
+
+  unsigned mask;   /* Partitioning mask.  */
+  unsigned flags;   /* Partitioning flags.  */
+  tree chunk_size;   /* Chunk size.  */
+  gcall *head_end; /* Final marker of head sequence.  */
+};
+
 
 static splay_tree all_contexts;
 static int taskreg_nesting_level;
@@ -17474,6 +18357,240 @@ omp_finish_file (void)
     }
 }
 
+/* Find the number of threads (POS = false), or thread number (POS =
+   tre) for an OpenACC region partitioned as MASK.  Setup code
+   required for the calculation is added to SEQ.  */
+
+static tree
+oacc_thread_numbers (bool pos, int mask, gimple_seq *seq)
+{
+  tree res = pos ? NULL_TREE :  build_int_cst (unsigned_type_node, 1);
+  unsigned ix;
+
+  /* Start at gang level, and examine relevant dimension indices.  */
+  for (ix = GOMP_DIM_GANG; ix != GOMP_DIM_MAX; ix++)
+    if (GOMP_DIM_MASK (ix) & mask)
+      {
+	tree arg = build_int_cst (unsigned_type_node, ix);
+
+	if (res)
+	  {
+	    /* We had an outer index, so scale that by the size of
+	       this dimension.  */
+	    tree n = create_tmp_var (integer_type_node);
+	    gimple *call
+	      = gimple_build_call_internal (IFN_GOACC_DIM_SIZE, 1, arg);
+	    
+	    gimple_call_set_lhs (call, n);
+	    gimple_seq_add_stmt (seq, call);
+	    res = fold_build2 (MULT_EXPR, integer_type_node, res, n);
+	  }
+	if (pos)
+	  {
+	    /* Determine index in this dimension.  */
+	    tree id = create_tmp_var (integer_type_node);
+	    gimple *call = gimple_build_call_internal
+	      (IFN_GOACC_DIM_POS, 1, arg);
+
+	    gimple_call_set_lhs (call, id);
+	    gimple_seq_add_stmt (seq, call);
+	    if (res)
+	      res = fold_build2 (PLUS_EXPR, integer_type_node, res, id);
+	    else
+	      res = id;
+	  }
+      }
+
+  if (res == NULL_TREE)
+    res = build_int_cst (integer_type_node, 0);
+
+  return res;
+}
+
+/* Transform IFN_GOACC_LOOP calls to actual code.  See
+   expand_oacc_for for where these are generated.  At the vector
+   level, we stride loops, such that each  member of a warp will
+   operate on adjacent iterations.  At the worker and gang level,
+   each gang/warp executes a set of contiguous iterations.  Chunking
+   can override this such that each iteration engine executes a
+   contiguous chunk, and then moves on to stride to the next chunk.   */
+
+static void
+oacc_xform_loop (gcall *call)
+{
+  gimple_stmt_iterator gsi = gsi_for_stmt (call);
+  unsigned code = (unsigned)TREE_INT_CST_LOW (gimple_call_arg (call, 0));
+  tree dir = gimple_call_arg (call, 1);
+  tree range = gimple_call_arg (call, 2);
+  tree step = gimple_call_arg (call, 3);
+  tree chunk_size = NULL_TREE;
+  unsigned mask = (unsigned)TREE_INT_CST_LOW (gimple_call_arg (call, 5));
+  tree lhs = gimple_call_lhs (call);
+  tree type = TREE_TYPE (lhs);
+  tree diff_type = TREE_TYPE (range);
+  tree r = NULL_TREE;
+  gimple_seq seq = NULL;
+  bool chunking = false, striding = true;
+  unsigned outer_mask = mask & (~mask + 1); // Outermost partitioning
+  unsigned inner_mask = mask & ~outer_mask; // Inner partitioning (if any)
+
+#ifdef ACCEL_COMPILER
+  chunk_size = gimple_call_arg (call, 4);
+  if (integer_minus_onep (chunk_size)  /* Force static allocation.  */
+      || integer_zerop (chunk_size))   /* Default (also static).  */
+    {
+      /* If we're at the gang level, we want each to execute a
+	 contiguous run of iterations.  Otherwise we want each element
+	 to stride.  */
+      striding = !(outer_mask & GOMP_DIM_MASK (GOMP_DIM_GANG));
+      chunking = false;
+    }
+  else
+    {
+      /* Chunk of size 1 is striding.  */
+      striding = integer_onep (chunk_size);
+      chunking = !striding;
+    }
+#endif
+
+  /* striding=true, chunking=true
+       -> invalid.
+     striding=true, chunking=false
+       -> chunks=1
+     striding=false,chunking=true
+       -> chunks=ceil (range/(chunksize*threads*step))
+     striding=false,chunking=false
+       -> chunk_size=ceil(range/(threads*step)),chunks=1  */
+  push_gimplify_context (true);
+
+  switch (code)
+    {
+    default: gcc_unreachable ();
+
+    case IFN_GOACC_LOOP_CHUNKS:
+      if (!chunking)
+	r = build_int_cst (type, 1);
+      else
+	{
+	  /* chunk_max
+	     = (range - dir) / (chunks * step * num_threads) + dir  */
+	  tree per = oacc_thread_numbers (false, mask, &seq);
+	  per = fold_convert (type, per);
+	  chunk_size = fold_convert (type, chunk_size);
+	  per = fold_build2 (MULT_EXPR, type, per, chunk_size);
+	  per = fold_build2 (MULT_EXPR, type, per, step);
+	  r = build2 (MINUS_EXPR, type, range, dir);
+	  r = build2 (PLUS_EXPR, type, r, per);
+	  r = build2 (TRUNC_DIV_EXPR, type, r, per);
+	}
+      break;
+
+    case IFN_GOACC_LOOP_STEP:
+      {
+	/* If striding, step by the entire compute volume, otherwise
+	   step by the inner volume.  */
+	unsigned volume = striding ? mask : inner_mask;
+
+	r = oacc_thread_numbers (false, volume, &seq);
+	r = build2 (MULT_EXPR, type, fold_convert (type, r), step);
+      }
+      break;
+
+    case IFN_GOACC_LOOP_OFFSET:
+      if (striding)
+	{
+	  r = oacc_thread_numbers (true, mask, &seq);
+	  r = fold_convert (diff_type, r);
+	}
+      else
+	{
+	  tree inner_size = oacc_thread_numbers (false, inner_mask, &seq);
+	  tree outer_size = oacc_thread_numbers (false, outer_mask, &seq);
+	  tree volume = fold_build2 (MULT_EXPR, TREE_TYPE (inner_size),
+				     inner_size, outer_size);
+
+	  volume = fold_convert (diff_type, volume);
+	  if (chunking)
+	    chunk_size = fold_convert (diff_type, chunk_size);
+	  else
+	    {
+	      tree per = fold_build2 (MULT_EXPR, diff_type, volume, step);
+
+	      chunk_size = build2 (MINUS_EXPR, diff_type, range, dir);
+	      chunk_size = build2 (PLUS_EXPR, diff_type, chunk_size, per);
+	      chunk_size = build2 (TRUNC_DIV_EXPR, diff_type, chunk_size, per);
+	    }
+
+	  tree span = build2 (MULT_EXPR, diff_type, chunk_size,
+			      fold_convert (diff_type, inner_size));
+	  r = oacc_thread_numbers (true, outer_mask, &seq);
+	  r = fold_convert (diff_type, r);
+	  r = build2 (MULT_EXPR, diff_type, r, span);
+
+	  tree inner = oacc_thread_numbers (true, inner_mask, &seq);
+	  inner = fold_convert (diff_type, inner);
+	  r = fold_build2 (PLUS_EXPR, diff_type, r, inner);
+
+	  if (chunking)
+	    {
+	      tree chunk = fold_convert (diff_type, gimple_call_arg (call, 6));
+	      tree per
+		= fold_build2 (MULT_EXPR, diff_type, volume, chunk_size);
+	      per = build2 (MULT_EXPR, diff_type, per, chunk);
+
+	      r = build2 (PLUS_EXPR, diff_type, r, per);
+	    }
+	}
+      r = fold_build2 (MULT_EXPR, diff_type, r, step);
+      if (type != diff_type)
+	r = fold_convert (type, r);
+      break;
+
+    case IFN_GOACC_LOOP_BOUND:
+      if (striding)
+	r = range;
+      else
+	{
+	  tree inner_size = oacc_thread_numbers (false, inner_mask, &seq);
+	  tree outer_size = oacc_thread_numbers (false, outer_mask, &seq);
+	  tree volume = fold_build2 (MULT_EXPR, TREE_TYPE (inner_size),
+				     inner_size, outer_size);
+
+	  volume = fold_convert (diff_type, volume);
+	  if (chunking)
+	    chunk_size = fold_convert (diff_type, chunk_size);
+	  else
+	    {
+	      tree per = fold_build2 (MULT_EXPR, diff_type, volume, step);
+
+	      chunk_size = build2 (MINUS_EXPR, diff_type, range, dir);
+	      chunk_size = build2 (PLUS_EXPR, diff_type, chunk_size, per);
+	      chunk_size = build2 (TRUNC_DIV_EXPR, diff_type, chunk_size, per);
+	    }
+
+	  tree span = build2 (MULT_EXPR, diff_type, chunk_size,
+			      fold_convert (diff_type, inner_size));
+
+	  r = fold_build2 (MULT_EXPR, diff_type, span, step);
+
+	  tree offset = gimple_call_arg (call, 6);
+	  r = build2 (PLUS_EXPR, diff_type, r,
+		      fold_convert (diff_type, offset));
+	  r = build2 (integer_onep (dir) ? MIN_EXPR : MAX_EXPR,
+		      diff_type, r, range);
+	}
+      if (diff_type != type)
+	r = fold_convert (type, r);
+      break;
+    }
+
+  gimplify_assign (lhs, r, &seq);
+
+  pop_gimplify_context (NULL);
+
+  gsi_replace_with_seq (&gsi, seq, true);
+}
+
 /* Validate and update the dimensions for offloaded FN.  ATTRS is the
    raw attribute.  DIMS is an array of dimensions, which is returned.
    Returns the function level dimensionality --  the level at which an
@@ -17532,6 +18681,554 @@ oacc_validate_dims (tree fn, tree attrs,
   return fn_level;
 }
 
+/* Create an empty OpenACC loop structure at LOC.  */
+
+static oacc_loop *
+new_oacc_loop_raw (oacc_loop *parent, location_t loc)
+{
+  oacc_loop *loop = XCNEW (oacc_loop);
+
+  loop->parent = parent;
+  loop->child = loop->sibling = NULL;
+
+  if (parent)
+    {
+      loop->sibling = parent->child;
+      parent->child = loop;
+    }
+
+  loop->loc = loc;
+  loop->marker = NULL;
+  memset (loop->heads, 0, sizeof (loop->heads));
+  memset (loop->tails, 0, sizeof (loop->tails));
+  loop->routine = NULL_TREE;
+
+  loop->mask = loop->flags = 0;
+  loop->chunk_size = 0;
+  loop->head_end = NULL;
+
+  return loop;
+}
+
+/* Create an outermost, dummy OpenACC loop for offloaded function
+   DECL.  */
+
+static oacc_loop *
+new_oacc_loop_outer (tree decl)
+{
+  return new_oacc_loop_raw (NULL, DECL_SOURCE_LOCATION (decl));
+}
+
+/* Start a new OpenACC loop  structure beginning at head marker HEAD.
+   Link into PARENT loop.  Return the new loop.  */
+
+static oacc_loop *
+new_oacc_loop (oacc_loop *parent, gcall *marker)
+{
+  oacc_loop *loop = new_oacc_loop_raw (parent, gimple_location (marker));
+
+  loop->marker = marker;
+  
+  /* TODO: This is where device_type flattening would occur for the loop
+     flags.   */
+
+  loop->flags = TREE_INT_CST_LOW (gimple_call_arg (marker, 2));
+
+  tree chunk_size = integer_zero_node;
+  if (loop->flags & OLF_GANG_STATIC)
+    chunk_size = gimple_call_arg (marker, 3);
+  loop->chunk_size = chunk_size;
+
+  return loop;
+}
+
+/* Create a dummy loop encompassing a call to a openACC routine.
+   Extract the routine's partitioning requirements.  */
+
+static void
+new_oacc_loop_routine (oacc_loop *parent, gcall *call, tree decl, tree attrs)
+{
+  oacc_loop *loop = new_oacc_loop_raw (parent, gimple_location (call));
+  int dims[GOMP_DIM_MAX];
+  int level = oacc_validate_dims (decl, attrs, dims);
+
+  gcc_assert (level >= 0);
+
+  loop->marker = call;
+  loop->routine = decl;
+  loop->mask = ((GOMP_DIM_MASK (GOMP_DIM_MAX) - 1)
+		^ (GOMP_DIM_MASK (level) - 1));
+}
+
+/* Finish off the current OpenACC loop ending at tail marker TAIL.
+   Return the parent loop.  */
+
+static oacc_loop *
+finish_oacc_loop (oacc_loop *loop)
+{
+  return loop->parent;
+}
+
+/* Free all OpenACC loop structures within LOOP (inclusive).  */
+
+static void
+free_oacc_loop (oacc_loop *loop)
+{
+  if (loop->sibling)
+    free_oacc_loop (loop->sibling);
+  if (loop->child)
+    free_oacc_loop (loop->child);
+
+  free (loop);
+}
+
+/* Dump out the OpenACC loop head or tail beginning at FROM.  */
+
+static void
+dump_oacc_loop_part (FILE *file, gcall *from, int depth,
+		     const char *title, int level)
+{
+  gimple_stmt_iterator gsi = gsi_for_stmt (from);
+  unsigned code = TREE_INT_CST_LOW (gimple_call_arg (from, 0));
+
+  fprintf (file, "%*s%s-%d:\n", depth * 2, "", title, level);
+  for (gimple *stmt = from; ;)
+    {
+      print_gimple_stmt (file, stmt, depth * 2 + 2, 0);
+      gsi_next (&gsi);
+      stmt = gsi_stmt (gsi);
+
+      if (!is_gimple_call (stmt))
+	continue;
+
+      gcall *call = as_a <gcall *> (stmt);
+      
+      if (gimple_call_internal_p (call)
+	  && gimple_call_internal_fn (call) == IFN_UNIQUE
+	  && code == TREE_INT_CST_LOW (gimple_call_arg (call, 0)))
+	break;
+    }
+}
+
+/* Dump OpenACC loops LOOP, its siblings and its children.  */
+
+static void
+dump_oacc_loop (FILE *file, oacc_loop *loop, int depth)
+{
+  int ix;
+  
+  fprintf (file, "%*sLoop %x(%x) %s:%u\n", depth * 2, "",
+	   loop->flags, loop->mask,
+	   LOCATION_FILE (loop->loc), LOCATION_LINE (loop->loc));
+
+  if (loop->marker)
+    print_gimple_stmt (file, loop->marker, depth * 2, 0);
+
+  if (loop->routine)
+    fprintf (file, "%*sRoutine %s:%u:%s\n",
+	     depth * 2, "", DECL_SOURCE_FILE (loop->routine),
+	     DECL_SOURCE_LINE (loop->routine),
+	     IDENTIFIER_POINTER (DECL_NAME (loop->routine)));
+
+  for (ix = GOMP_DIM_GANG; ix != GOMP_DIM_MAX; ix++)
+    if (loop->heads[ix])
+      dump_oacc_loop_part (file, loop->heads[ix], depth, "Head", ix);
+  for (ix = GOMP_DIM_MAX; ix--;)
+    if (loop->tails[ix])
+      dump_oacc_loop_part (file, loop->tails[ix], depth, "Tail", ix);
+
+  if (loop->child)
+    dump_oacc_loop (file, loop->child, depth + 1);
+  if (loop->sibling)
+    dump_oacc_loop (file, loop->sibling, depth);
+}
+
+void debug_oacc_loop (oacc_loop *);
+
+/* Dump loops to stderr.  */
+
+DEBUG_FUNCTION void
+debug_oacc_loop (oacc_loop *loop)
+{
+  dump_oacc_loop (stderr, loop, 0);
+}
+
+/* DFS walk of basic blocks BB onwards, creating OpenACC loop
+   structures as we go.  By construction these loops are properly
+   nested.  */
+
+static void
+oacc_loop_discover_walk (oacc_loop *loop, basic_block bb)
+{
+  if (bb->flags & BB_VISITED)
+    return;
+  bb->flags |= BB_VISITED;
+
+  int marker = 0;
+  int remaining = 0;
+
+  /* Scan for loop markers.  */
+  for (gimple_stmt_iterator gsi = gsi_start_bb (bb); !gsi_end_p (gsi);
+       gsi_next (&gsi))
+    {
+      gimple *stmt = gsi_stmt (gsi);
+
+      if (!is_gimple_call (stmt))
+	continue;
+
+      gcall *call = as_a <gcall *> (stmt);
+      
+      /* If this is a routine, make a dummy loop for it.  */
+      if (tree decl = gimple_call_fndecl (call))
+	if (tree attrs = get_oacc_fn_attrib (decl))
+	  {
+	    gcc_assert (!marker);
+	    new_oacc_loop_routine (loop, call, decl, attrs);
+	  }
+
+      if (!gimple_call_internal_p (call))
+	continue;
+
+      if (gimple_call_internal_fn (call) != IFN_UNIQUE)
+	continue;
+
+      unsigned code = TREE_INT_CST_LOW (gimple_call_arg (call, 0));
+      if (code == IFN_UNIQUE_OACC_HEAD_MARK
+	  || code == IFN_UNIQUE_OACC_TAIL_MARK)
+	{
+	  if (gimple_call_num_args (call) == 1)
+	    {
+	      gcc_assert (marker && !remaining);
+	      marker = 0;
+	      if (code == IFN_UNIQUE_OACC_TAIL_MARK)
+		loop = finish_oacc_loop (loop);
+	      else
+		loop->head_end = call;
+	    }
+	  else
+	    {
+	      int count = TREE_INT_CST_LOW (gimple_call_arg (call, 1));
+
+	      if (!marker)
+		{
+		  if (code == IFN_UNIQUE_OACC_HEAD_MARK)
+		    loop = new_oacc_loop (loop, call);
+		  remaining = count;
+		}
+	      gcc_assert (count == remaining);
+	      if (remaining)
+		{
+		  remaining--;
+		  if (code == IFN_UNIQUE_OACC_HEAD_MARK)
+		    loop->heads[marker] = call;
+		  else
+		    loop->tails[remaining] = call;
+		}
+	      marker++;
+	    }
+	}
+    }
+  gcc_assert (!remaining && !marker);
+
+  /* Walk successor blocks.  */
+  edge e;
+  edge_iterator ei;
+
+  FOR_EACH_EDGE (e, ei, bb->succs)
+    oacc_loop_discover_walk (loop, e->dest);
+}
+
+/* LOOP is the first sibling.  Reverse the order in place and return
+   the new first sibling.  Recurse to child loops.  */
+
+static oacc_loop *
+oacc_loop_sibling_nreverse (oacc_loop *loop)
+{
+  oacc_loop *last = NULL;
+  do
+    {
+      if (loop->child)
+	loop->child = oacc_loop_sibling_nreverse  (loop->child);
+
+      oacc_loop *next = loop->sibling;
+      loop->sibling = last;
+      last = loop;
+      loop = next;
+    }
+  while (loop);
+
+  return last;
+}
+
+/* Discover the OpenACC loops marked up by HEAD and TAIL markers for
+   the current function.  */
+
+static oacc_loop *
+oacc_loop_discovery ()
+{
+  basic_block bb;
+  
+  oacc_loop *top = new_oacc_loop_outer (current_function_decl);
+  oacc_loop_discover_walk (top, ENTRY_BLOCK_PTR_FOR_FN (cfun));
+
+  /* The siblings were constructed in reverse order, reverse them so
+     that diagnostics come out in an unsurprising order.  */
+  top = oacc_loop_sibling_nreverse (top);
+
+  /* Reset the visited flags.  */
+  FOR_ALL_BB_FN (bb, cfun)
+    bb->flags &= ~BB_VISITED;
+
+  return top;
+}
+
+/* Transform the abstract internal function markers starting at FROM
+   to be for partitioning level LEVEL.  Stop when we meet another HEAD
+   or TAIL  marker.  */
+
+static void
+oacc_loop_xform_head_tail (gcall *from, int level)
+{
+  gimple_stmt_iterator gsi = gsi_for_stmt (from);
+  unsigned code = TREE_INT_CST_LOW (gimple_call_arg (from, 0));
+  tree replacement  = build_int_cst (unsigned_type_node, level);
+
+  for (gimple *stmt = from; ;)
+    {
+      gsi_next (&gsi);
+      stmt = gsi_stmt (gsi);
+
+      if (!is_gimple_call (stmt))
+	continue;
+
+      gcall *call = as_a <gcall *> (stmt);
+      
+      if (!gimple_call_internal_p (call))
+	continue;
+
+      switch (gimple_call_internal_fn (call))
+	{
+	case IFN_UNIQUE:
+	  {
+	    unsigned c = TREE_INT_CST_LOW (gimple_call_arg (call, 0));
+
+	    if (c == code)
+	      goto break2;
+
+	    if (c == IFN_UNIQUE_OACC_FORK || c == IFN_UNIQUE_OACC_JOIN)
+	      *gimple_call_arg_ptr (call, 1) = replacement;
+	  }
+	  break;
+
+	default:
+	  break;
+	}
+    }
+
+ break2:;
+}
+
+/* Transform the IFN_GOACC_LOOP internal functions by providing the
+   determined partitioning mask and chunking argument.  */
+
+static void
+oacc_loop_xform_loop (gcall *end_marker, tree mask_arg, tree chunk_arg)
+{
+  gimple_stmt_iterator gsi = gsi_for_stmt (end_marker);
+  
+  for (;;)
+    {
+      for (; !gsi_end_p (gsi); gsi_next (&gsi))
+	{
+	  gimple *stmt = gsi_stmt (gsi);
+
+	  if (!is_gimple_call (stmt))
+	    continue;
+
+	  gcall *call = as_a <gcall *> (stmt);
+      
+	  if (!gimple_call_internal_p (call))
+	    continue;
+
+	  if (gimple_call_internal_fn (call) != IFN_GOACC_LOOP)
+	    continue;
+
+	  *gimple_call_arg_ptr (call, 5) = mask_arg;
+	  *gimple_call_arg_ptr (call, 4) = chunk_arg;
+	  if (TREE_INT_CST_LOW (gimple_call_arg (call, 0))
+	      == IFN_GOACC_LOOP_BOUND)
+	    goto break2;
+	}
+
+      /* If we didn't see LOOP_BOUND, it should be in the single
+	 successor block.  */
+      basic_block bb = single_succ (gsi_bb (gsi));
+      gsi = gsi_start_bb (bb);
+    }
+
+ break2:;
+}
+
+/* Process the discovered OpenACC loops, setting the correct
+   partitioning level etc.  */
+
+static void
+oacc_loop_process (oacc_loop *loop)
+{
+  if (loop->child)
+    oacc_loop_process (loop->child);
+
+  if (loop->mask && !loop->routine)
+    {
+      int ix;
+      unsigned mask = loop->mask;
+      unsigned dim = GOMP_DIM_GANG;
+      tree mask_arg = build_int_cst (unsigned_type_node, mask);
+      tree chunk_arg = loop->chunk_size;
+
+      oacc_loop_xform_loop (loop->head_end, mask_arg, chunk_arg);
+
+      for (ix = 0; ix != GOMP_DIM_MAX && loop->heads[ix]; ix++)
+	{
+	  gcc_assert (mask);
+
+	  while (!(GOMP_DIM_MASK (dim) & mask))
+	    dim++;
+
+	  oacc_loop_xform_head_tail (loop->heads[ix], dim);
+	  oacc_loop_xform_head_tail (loop->tails[ix], dim);
+
+	  mask ^= GOMP_DIM_MASK (dim);
+	}
+    }
+
+  if (loop->sibling)
+    oacc_loop_process (loop->sibling);
+}
+
+/* Walk the OpenACC loop heirarchy checking and assigning the
+   programmer-specified partitionings.  OUTER_MASK is the partitioning
+   this loop is contained within.  Return partitiong mask used within
+   this loop nest.  */
+
+static unsigned
+oacc_loop_fixed_partitions (oacc_loop *loop, unsigned outer_mask)
+{
+  unsigned this_mask = loop->mask;
+  bool has_auto = false;
+  bool noisy = true;
+
+#ifdef ACCEL_COMPILER
+  /* When device_type is supported, we want the device compiler to be
+     noisy, if the loop parameters are device_type-specific.  */
+  noisy = false;
+#endif
+
+  if (!loop->routine)
+    {
+      bool auto_par = (loop->flags & OLF_AUTO) != 0;
+      bool seq_par = (loop->flags & OLF_SEQ) != 0;
+
+      this_mask = ((loop->flags >> OLF_DIM_BASE)
+		   & (GOMP_DIM_MASK (GOMP_DIM_MAX) - 1));
+
+      if ((this_mask != 0) + auto_par + seq_par > 1)
+	{
+	  if (noisy)
+	    error_at (loop->loc,
+		      seq_par
+		      ? "%<seq%> overrides other OpenACC loop specifiers"
+		      : "%<auto%> conflicts with other OpenACC loop specifiers");
+	  auto_par = false;
+	  loop->flags &= ~OLF_AUTO;
+	  if (seq_par)
+	    {
+	      loop->flags &=
+		~((GOMP_DIM_MASK (GOMP_DIM_MAX) - 1) << OLF_DIM_BASE);
+	      this_mask = 0;
+	    }
+	}
+      if (auto_par && (loop->flags & OLF_INDEPENDENT))
+	has_auto = true;
+    }
+
+  if (this_mask & outer_mask)
+    {
+      const oacc_loop *outer;
+      for (outer = loop->parent; outer; outer = outer->parent)
+	if (outer->mask & this_mask)
+	  break;
+
+      if (noisy)
+	{
+	  if (outer)
+	    {
+	      error_at (loop->loc,
+			"%s uses same OpenACC parallelism as containing loop",
+			loop->routine ? "routine call" : "inner loop");
+	      inform (outer->loc, "containing loop here");
+	    }
+	  else
+	    error_at (loop->loc,
+		      "%s uses OpenACC parallelism disallowed by containing routine",
+		      loop->routine ? "routine call" : "loop");
+      
+	  if (loop->routine)
+	    inform (DECL_SOURCE_LOCATION (loop->routine),
+		    "routine %qD declared here", loop->routine);
+	}
+      this_mask &= ~outer_mask;
+    }
+  else
+    {
+      unsigned outermost = this_mask & -this_mask;
+
+      if (outermost && outermost <= outer_mask)
+	{
+	  if (noisy)
+	    {
+	      error_at (loop->loc,
+			"incorrectly nested OpenACC loop parallelism");
+
+	      const oacc_loop *outer;
+	      for (outer = loop->parent;
+		   outer->flags && outer->flags < outermost;
+		   outer = outer->parent)
+		continue;
+	      inform (outer->loc, "containing loop here");
+	    }
+
+	  this_mask &= ~outermost;
+	}
+    }
+
+  loop->mask = this_mask;
+
+  if (loop->child
+      && oacc_loop_fixed_partitions (loop->child, outer_mask | this_mask))
+    has_auto = true;
+
+  if (loop->sibling
+      && oacc_loop_fixed_partitions (loop->sibling, outer_mask))
+    has_auto = true;
+
+  return has_auto;
+}
+
+/* Walk the OpenACC loop heirarchy to check and assign partitioning
+   axes.  */
+
+static void
+oacc_loop_partition (oacc_loop *loop, int fn_level)
+{
+  unsigned outer_mask = 0;
+
+  if (fn_level >= 0)
+    outer_mask = GOMP_DIM_MASK (fn_level) - 1;
+
+  oacc_loop_fixed_partitions (loop, outer_mask);
+}
+
 /* Main entry point for oacc transformations which run on the device
    compiler after LTO, so we know what the target device is at this
    point (including the host fallback).  */
@@ -17546,8 +19266,98 @@ execute_oacc_device_lower ()
     /* Not an offloaded function.  */
     return 0;
 
-  oacc_validate_dims (current_function_decl, attrs, dims);
-  
+  int fn_level = oacc_validate_dims (current_function_decl, attrs, dims);
+
+  /* Discover, partition and process the loops.  */
+  oacc_loop *loops = oacc_loop_discovery ();
+  oacc_loop_partition (loops, fn_level);
+  oacc_loop_process (loops);
+  if (dump_file)
+    {
+      fprintf (dump_file, "OpenACC loops\n");
+      dump_oacc_loop (dump_file, loops, 0);
+      fprintf (dump_file, "\n");
+    }
+
+  /* Now lower internal loop functions to target-specific code
+     sequences.  */
+  basic_block bb;
+  FOR_ALL_BB_FN (bb, cfun)
+    for (gimple_stmt_iterator gsi = gsi_start_bb (bb); !gsi_end_p (gsi);)
+      {
+	gimple *stmt = gsi_stmt (gsi);
+	if (!is_gimple_call (stmt))
+	  {
+	    gsi_next (&gsi);
+	    continue;
+	  }
+
+	gcall *call = as_a <gcall *> (stmt);
+	if (!gimple_call_internal_p (call))
+	  {
+	    gsi_next (&gsi);
+	    continue;
+	  }
+
+	/* Rewind to allow rescan.  */
+	gsi_prev (&gsi);
+	int rescan = 0;
+	unsigned ifn_code = gimple_call_internal_fn (call);
+
+	switch (ifn_code)
+	  {
+	  default: break;
+
+	  case IFN_GOACC_LOOP:
+	    oacc_xform_loop (call);
+	    rescan = 1;
+	    break;
+
+	  case IFN_UNIQUE:
+	    {
+	      unsigned code = TREE_INT_CST_LOW (gimple_call_arg (call, 0));
+
+	      switch (code)
+		{
+		case IFN_UNIQUE_OACC_FORK:
+		case IFN_UNIQUE_OACC_JOIN:
+		  if (integer_minus_onep (gimple_call_arg (call, 1)))
+		    rescan = -1;
+		  else if (targetm.goacc.fork_join
+			   (call, dims, code == IFN_UNIQUE_OACC_FORK))
+		    rescan = -1;
+		  break;
+
+		case IFN_UNIQUE_OACC_HEAD_MARK:
+		case IFN_UNIQUE_OACC_TAIL_MARK:
+		  rescan = -1;
+		  break;
+		}
+	      break;
+	    }
+	  }
+
+	if (gsi_end_p (gsi))
+	  /* We rewound past the beginning of the BB.  */
+	  gsi = gsi_start_bb (bb);
+	else
+	  /* Undo the rewind.  */
+	  gsi_next (&gsi);
+
+	if (!rescan)
+	  /* If not rescanning, advance over the call.  */
+	  gsi_next (&gsi);
+	else if (rescan < 0)
+	  {
+	    if (gimple_vdef (call))
+	      replace_uses_by (gimple_vdef (call),
+			       gimple_vuse (call));
+	    gsi_remove (&gsi, true);
+	  }
+      }
+
+  free_oacc_loop (loops);
+
   return 0;
 }
 

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 9/11] oacc_device_lower pass gate
  2015-10-21 19:00 [OpenACC 0/11] execution model Nathan Sidwell
                   ` (7 preceding siblings ...)
  2015-10-21 19:50 ` [OpenACC 8/11] device-specific lowering Nathan Sidwell
@ 2015-10-21 19:51 ` Nathan Sidwell
  2015-10-22  9:33   ` Jakub Jelinek
  2015-10-21 19:52 ` [OpenACC 10/11] remove plugin restriction Nathan Sidwell
  2015-10-21 19:59 ` [OpenACC 11/11] execution tests Nathan Sidwell
  10 siblings, 1 reply; 120+ messages in thread
From: Nathan Sidwell @ 2015-10-21 19:51 UTC (permalink / raw)
  To: GCC Patches; +Cc: Jakub Jelinek, Bernd Schmidt, Jason Merrill, Joseph S. Myers

[-- Attachment #1: Type: text/plain, Size: 232 bytes --]


This patch is obvious, but included for completeness. We always want to run the 
device lowering pass (when openacc is enabled), in order to delete the marker 
and loop functions that should never be seen after this point.

nathan

[-- Attachment #2: 09-trunk-lower-gate.patch --]
[-- Type: text/x-patch, Size: 516 bytes --]

2015-10-20  Nathan Sidwell  <nathan@codesourcery.com>

	* omp-low.c (pass_oacc_device_lower::execute): Ignore errors.

Index: gcc/omp-low.c
===================================================================
--- gcc/omp-low.c	(revision 229101)
+++ gcc/omp-low.c	(working copy)
@@ -17598,7 +19386,7 @@ public:
   /* opt_pass methods: */
   virtual unsigned int execute (function *)
     {
-      bool gate = (flag_openacc != 0 && !seen_error ());
+      bool gate = flag_openacc != 0;
 
       if (!gate)
 	return 0;

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 10/11] remove plugin restriction
  2015-10-21 19:00 [OpenACC 0/11] execution model Nathan Sidwell
                   ` (8 preceding siblings ...)
  2015-10-21 19:51 ` [OpenACC 9/11] oacc_device_lower pass gate Nathan Sidwell
@ 2015-10-21 19:52 ` Nathan Sidwell
  2015-10-22  9:38   ` Jakub Jelinek
  2015-10-21 19:59 ` [OpenACC 11/11] execution tests Nathan Sidwell
  10 siblings, 1 reply; 120+ messages in thread
From: Nathan Sidwell @ 2015-10-21 19:52 UTC (permalink / raw)
  To: GCC Patches; +Cc: Jakub Jelinek, Bernd Schmidt, Jason Merrill, Joseph S. Myers

[-- Attachment #1: Type: text/plain, Size: 124 bytes --]

Here's another obvious patch.  The ptx plugin no longer needs to barf on gang or 
worker dimensions of non-unity.

nathan



[-- Attachment #2: 10-trunk-libgomp.patch --]
[-- Type: text/x-patch, Size: 949 bytes --]

2015-10-20  Nathan Sidwell  <nathan@codesourcery.com>

	* plugin/plugin-nvptx.c (nvptx_exec): Remove check on compute
	dimensions.

Index: libgomp/plugin/plugin-nvptx.c
===================================================================
--- libgomp/plugin/plugin-nvptx.c	(revision 228969)
+++ libgomp/plugin/plugin-nvptx.c	(working copy)
@@ -902,13 +902,6 @@ nvptx_exec (void (*fn), size_t mapnum, v
     if (targ_fn->launch->dim[i])
       dims[i] = targ_fn->launch->dim[i];
 
-  if (dims[GOMP_DIM_GANG] != 1)
-    GOMP_PLUGIN_fatal ("non-unity num_gangs (%d) not supported",
-		       dims[GOMP_DIM_GANG]);
-  if (dims[GOMP_DIM_WORKER] != 1)
-    GOMP_PLUGIN_fatal ("non-unity num_workers (%d) not supported",
-		       dims[GOMP_DIM_WORKER]);
-
   /* This reserves a chunk of a pre-allocated page of memory mapped on both
      the host and the device. HP is a host pointer to the new chunk, and DP is
      the corresponding device pointer.  */

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 11/11] execution tests
  2015-10-21 19:00 [OpenACC 0/11] execution model Nathan Sidwell
                   ` (9 preceding siblings ...)
  2015-10-21 19:52 ` [OpenACC 10/11] remove plugin restriction Nathan Sidwell
@ 2015-10-21 19:59 ` Nathan Sidwell
  2015-10-21 20:15   ` Ilya Verbin
  2015-10-22  9:54   ` Jakub Jelinek
  10 siblings, 2 replies; 120+ messages in thread
From: Nathan Sidwell @ 2015-10-21 19:59 UTC (permalink / raw)
  To: GCC Patches; +Cc: Jakub Jelinek, Bernd Schmidt, Jason Merrill, Joseph S. Myers

[-- Attachment #1: Type: text/plain, Size: 237 bytes --]

This patch has some new execution tests, verifying loop partitioning is behaving 
as expected.

There are more execution tests on the gomp4 branch, but many of them use 
reductions.  We'll merge those once reductions are merged.

nathan

[-- Attachment #2: 11-trunk-tests.patch --]
[-- Type: text/x-patch, Size: 26177 bytes --]

2015-10-20  Nathan Sidwell  <nathan@codesourcery.com>

	* testsuite/libgomp.oacc-c-c++-common/loop-g-1.c: New.
	* testsuite/libgomp.oacc-c-c++-common/loop-w-1.c: New.
	* testsuite/libgomp.oacc-c-c++-common/loop-g-1.s: New.
	* testsuite/libgomp.oacc-c-c++-common/loop-wv-1.c: New.
	* testsuite/libgomp.oacc-c-c++-common/loop-g-2.c: New.
	* testsuite/libgomp.oacc-c-c++-common/loop-gwv-1.c: New.
	* testsuite/libgomp.oacc-c-c++-common/loop-v-1.c: New.

Index: libgomp/testsuite/libgomp.oacc-c-c++-common/loop-g-1.c
===================================================================
--- libgomp/testsuite/libgomp.oacc-c-c++-common/loop-g-1.c	(revision 0)
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/loop-g-1.c	(working copy)
@@ -0,0 +1,57 @@
+/* { dg-do run } */
+/* { dg-additional-options "-O2" */
+
+#include <stdio.h>
+
+#define N (32*32*32+17)
+int main ()
+{
+  int ary[N];
+  int ix;
+  int exit = 0;
+  int ondev = 0;
+
+  for (ix = 0; ix < N;ix++)
+    ary[ix] = -1;
+  
+#pragma acc parallel num_gangs(32) vector_length(32) copy(ary) copy(ondev)
+  {
+#pragma acc loop gang
+    for (unsigned ix = 0; ix < N; ix++)
+      {
+	if (__builtin_acc_on_device (5))
+	  {
+	    int g = 0, w = 0, v = 0;
+
+	    __asm__ volatile ("mov.u32 %0,%%ctaid.x;" : "=r" (g));
+	    __asm__ volatile ("mov.u32 %0,%%tid.y;" : "=r" (w));
+	    __asm__ volatile ("mov.u32 %0,%%tid.x;" : "=r" (v));
+	    ary[ix] = (g << 16) | (w << 8) | v;
+	    ondev = 1;
+	  }
+	else
+	  ary[ix] = ix;
+      }
+  }
+
+  for (ix = 0; ix < N; ix++)
+    {
+      int expected = ix;
+      if(ondev)
+	{
+	  int g = ix / ((N + 31) / 32);
+	  int w = 0;
+	  int v = 0;
+
+	  expected = (g << 16) | (w << 8) | v;
+	}
+      
+      if (ary[ix] != expected)
+	{
+	  exit = 1;
+	  printf ("ary[%d]=%x expected %x\n", ix, ary[ix], expected);
+	}
+    }
+  
+  return exit;
+}
Index: libgomp/testsuite/libgomp.oacc-c-c++-common/loop-g-1.s
===================================================================
--- libgomp/testsuite/libgomp.oacc-c-c++-common/loop-g-1.s	(revision 0)
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/loop-g-1.s	(working copy)
@@ -0,0 +1,386 @@
+	.file	"loop-g-1.c"
+	.section	.gnu.offload_lto_.inline.f031cb8759bb7418,"e",@progbits
+	.string	"x\234ce\200"
+	.string	"i\006\004`d`P\220g``\262zp\205\231\201\205\201\t,\306\310\004\022\007"
+	.ascii	"-\377\002_"
+	.text
+	.section	.gnu.offload_lto_main._omp_fn.0.f031cb8759bb7418,"e",@progbits
+	.ascii	"x\234\215W\373STG\026\276\347\366\2357\f\304\020b\310h0\331\321"
+	.ascii	"$\204\031\311V\355\243\222\335\252\255\375i\377\201T~\263\310"
+	.ascii	"@\b\273\300X0\032\363\023\027\034^E6\006bH\234\302\205\215(\020"
+	.ascii	"\214H\214\213\016.\316\202\n\270\3403\260\242\361\301#\nb\214"
+	.ascii	"\004\\$\b\3549\335\227a\200Y\365V\335\236\333\347~\337\351\257"
+	.ascii	"O\237>}G'\211K\266J\222\003$)\233:f\td\to`\037\237\301_&+\222"
+	.ascii	"\244\312Fl\024E6\320\263\236\236u!v=\350\020i\340\255Q6\021O"
+	.ascii	"U\030\2756\001\272\225@\312\225\231\242\323\033\214&\310eL\033"
+	.ascii	"U\321~?d;\031\264\025\335<*\277R\253\342u\243\264~U\236Z\260"
+	.ascii	"\272\022\201\025\032\030\366\300+\210\362\265L\236\326\225\262"
+	.ascii	"J\006`\304n\363\335\206\346H\352\312\020\220\260\377`\376r\233"
+	.ascii	"\201\025\342\230\330\351\274t\266\n6\354U?\316W\253;U(F\030\203"
+	.ascii	"}\300\320%\354\005\033\324\200M\252d\212\004\265\260\201\360"
+	.ascii	"\347\275\007+\231\360\255`w\274\3077d\025\276\343\261\373\311"
+	.ascii	"x\211\314\312\205\347\233\245\027K\242\354{\204\326\256\350\362"
+	.ascii	"\374\202NU\251d:\370Rx\257G\357u\350\235.\204\227\r"
+	.string	"\365\324D\bx\365\324\250RV\027\360\306#Zz\f\364v\311\037\362\312Ux\002\344\334\023#\017<1\262e\234=Jh\343\235\243\323<\362z\036\221\212\202\266\263\206R\036a\352n\357\231\373S\025\354\316S\243+\231\201\342\013_\023\225\3028\263w4\322\231H\274CP\005\2372\315_\351\341\341V\206\024\222\006\337\200\023\2324\202\267cG\263\314>\023\201o\236\275x\fP`\200\002_i-S\003\336uD\370\307\322\270\303at\2559\016*3>F\331\350\376\313\n\t3\302Q.\f\004\277\255\243\267V'\204\231\300\217\302\216h\370\017?\272\327\250\210t9\241R2V6\365\316\311\"a\252\275<\035g\206\232\017\033\330n\241\275\245\355\314\016\235\320^\375\275Q-\367r\217\255B\371?Q\3711R\336\262\250|\260i\356\323H\275\002\307\301\016\315x\037\304\273\021\357\003x\177\205\367~\274\033\300\316wG"
+	.string	"p\035\244Xd\235\034\367W"
+	.ascii	"\373\273\030\263~\276\244\"b!Q\033V\225i\211j\206\177\255LTh"
+	.ascii	"\003;R\246\206\317\317\232\331W\202_\323<\277S\257%\304\211j"
+	.ascii	"\320\310\355a\310'8Y\235\2716\027\305\374\202\\\374\267\276\021"
+	.ascii	"\266\202|2\f\371\024'O\\\237h\265\2626A\036\031\236\270$\257"
+	.ascii	" w\204!wr\262\377\352\221[Q\254[\220\217\237\275\360\302\nnW"
+	.ascii	"\030\356i\316\275\032\270^``}\202\273\353\312\314\004W\235\257"
+	.ascii	"\006\212w\306\323\006\260\300\277\303p\2739\267qv\270O\317\006"
+	.ascii	"\004\267\244a\327\035\203\340\336(\335\037[\236\257\362\261-"
+	.ascii	"\320\023\206\177F[\322\006\021\366\376\257k,\202Z]{\001\312\353"
+	.ascii	"\002\237\023\325(\205V+\316;\313\361\355u\375\275z6,\306\355"
+	.ascii	"\n\324\345Gj\023\036(\001U\354p8\027f\324\363\234\335\177\356"
+	.ascii	"`"
+	.string	"\251\225\335\024\354\351\2077\2765k\354}\213\354\013a\330\0279\273pO\373t$\033\021l\357\241\003G\264ZX=\270\310\3766\f\273we\002/\233\371\324\300\243f\336'F\357\231\274\246h\370\231\351 ^\277\022\377\037\034C\333<K\027\247\372P\277L\265#\376\377\255\315%\2607\226C@\235\237\337\256\252\215\277(g\t\200G\205Y\246\306BM\0045\221\324X\251\211\242&\032\033\313S|aW\341\243\356i\036\355\030\036\265g\370\354c\271\345YnY\315-\317\0210\216\232\347\251\241\003I\267\006\250.\255\305Gc<6\372\027\311\370\022u\3554\314zj6\360a$\251\210\327\317b\206\0076\033\203\302\342\366O\336Z\254\034\274,,)%8/\t\372\227\031\311\021\\&#\\\241V\202\327l\220`\243#\360\341\356f\023U4\035\304\300w`\213\223\306\240\240\370\217\032\f\001\337\234\350(4\022\300\002"
+	.string	"W1\202\211\034\343\rb\320U\2344\n\371\036\311\307,\241\325\330\f\022\\\303p\217\301.\265\311Hh\037\213@\374\2534\356\341\242\356&\263@\001\\\0072\265\236\372\342ea)Se\270\201C\t\373|\375]\277^\274\310\205\0014\277j[\263\304i$:M\340\350\005V\020F\201c<p\276 \334\312\347ur~*\020%&.\303 \204\243\373\030\320@\213\314(\021\270pH\353\032i\030\274g\030\006\301\312\201\222t\037"
+	.ascii	"#\"\243e\275\221t\340\267\020\3038\375d@K\264\017\327s\024\266"
+	.ascii	"W\350\261\363T\350ak\301\250\r\001z\214\246(\327\035\0338\031"
+	.ascii	"\025\272\334W\226\256\254\004\303|U\277'y\253\350\360\373\371"
+	.ascii	"\370\2045wV\305/\253\210\n:\340\340&\344\362\002R\343\177\030"
+	.ascii	"!\346\253\203[`\253\304\260\241\342+\3551\032\031\247!T\3138"
+	.ascii	"\005v\355R9\3631\273\216\204\353\270p\241\325\216\302}\202\312"
+	.ascii	"'k\327&\2130\2756\277\242\306\2326Y8}\021'G\271\341\357\212\023"
+	.ascii	"\206\227PI\355\271\300>s\360\200\366\2615\301\363\271\242\370"
+	.ascii	"\257vq>\217\004\317g\030\005'\271\250\350z\231\\P5\3518]\034"
+	.ascii	"\241\035\327\3501xZ\267\364\336\337!'5\3622\365g\232\347mtB\002"
+	.ascii	"\310\313\030$\241\034\256\323 tN\216\034y3d\213\340\256\245="
+	.ascii	"K;\226\207\0239wH\361\323\024\027\202w&\nCL\320\360\2460<C\206"
+	.ascii	"\333\220\367\305X\277\201,\370)\214]\232\316G\025U\312\272<\271"
+	.ascii	"J\375\034\313\220\2254\316\016\026\f\231\024\372\366\376\001"
+	.ascii	"\326q\325\247\006\017\371\365h\201\273\240\370X,\276\371\021"
+	.ascii	"\024\356\356\307\256\240\273\002\371q\356\356-w7N\356\236\305"
+	.ascii	"7?i\356\306a\301\233\357\261\336&\226{\233$o\253\361\315}\364"
+	.ascii	"\206\271p\257\360\352-\263\230\376s\250\032\241\027\356u\357"
+	.ascii	"\266\210\374\212\206\377\202M\240\246\r\002\024\207ZB\277\355"
+	.ascii	"u`\204)\001zPX\257\201\236\347\316\342\204\261\337$\2146n_-\262"
+	.ascii	"\310&lk\2370\213\036,f\321\364B\026\275M.Vd\321\332\260Y\224"
+	.ascii	"MB\177\016\311\242\031\312\"\276\372\376\321\326\337\0131\361"
+	.ascii	"\217\024\323}\252\366K&>\206\037\006?\206aV\250\2314jb\232\275"
+	.ascii	"\307&\025\355\323\030=\212/c\230C\002\0266\236\264XA\330\024"
+	.ascii	"\315da#\256\307\352D\025(\270\023\327\363\352\312\301\246\220"
+	.ascii	"\r\273\201\252\t\375\365\351\333\327\244\210\002C\345u\236\027"
+	.ascii	"\346\202\342\366\317\372R\026\213}\360\377\221\016\024Pe:#V\034"
+	.ascii	"\b\301\302i\301\302\231'\2079\020\220\030'\375@\3770\351*\242"
+	.ascii	"\005\300\347|p\346\270\262\223=\256\367\234Y9\351)\357\247fd"
+	.ascii	"8\335\233S\263\222].\247'{K\326_\0223S\263\323R\2359\331.g\232"
+	.ascii	"\313\225\230\231\234\236\225\221\236\225\352\314H\177'\315\235"
+	.ascii	"\271\331\351I\315\361\344lI\367\004-\0167r\023]\t\tN\207c\231"
+	.ascii	"\215\254\211.wf\246;\313\231\341voNLKLr\270$\213\003!\233R\222"
+	.ascii	"=\311\233\322\225\364m\216\215:wVJ\352V\226\234\375\201\305\341"
+	.ascii	"z\017El\312r;~\023\241=g&os\374\332\340\310\361\244nv\374\312"
+	.ascii	"\344p\277\373nN\252\307\361[\223\343\035\367\226\254\024G\322"
+	.ascii	"F\363\202-)IN\337\006i\360"
+	.ascii	">l\215\315toul\371\345\353\361\366\215\257\331\355.Orz\212c\333"
+	.ascii	"\033\022\373]\266\024\263\344\225\007_|\360F\030#\242\377\007"
+	.ascii	"!Kr\335"
+	.text
+	.section	.gnu.offload_lto_.symbol_nodes.f031cb8759bb7418,"e",@progbits
+	.string	"x\234ce``\320\003b\006&\236z\006\206\t\347\030\030\200\324\212\205\013\0170300\362\3263\202\205\030\030\032\032\024\030\030\230\031\030\031\216\264\277\231\317\301"
+	.string	""
+	.ascii	"\004N\0139"
+	.text
+	.section	.gnu.offload_lto_.refs.f031cb8759bb7418,"e",@progbits
+	.string	"x\234ce```\004b\006"
+	.string	""
+	.string	";"
+	.ascii	"\007"
+	.text
+	.section	.gnu.offload_lto_.offload_table.f031cb8759bb7418,"e",@progbits
+	.string	"x\234ce```\006bF\006\006"
+	.string	""
+	.string	"Z"
+	.ascii	"\n"
+	.text
+	.section	.gnu.offload_lto_.decls.f031cb8759bb7418,"e",@progbits
+	.ascii	"x\234\215T\337O\333U\024\377\236\336oa\226\2262@C\f\017d!\031"
+	.ascii	"\311\322v\350\037\240\017>\360\270\355\3057I\375\322\261F\370"
+	.ascii	"\226\264_4{\362\322\221XA\035L\030Jp\351\346F\221!k\351X\367"
+	.ascii	"\013\2500\030l\300\306&\242\213 \272\200/\023\331d\262lq\365"
+	.ascii	"\334{\373\205\002\242\336\344\334~\317\271\237\3639\347\334{"
+	.ascii	"N\215\222X?\247I\322\f\376~\206\222\300e\300\337\003\322\372"
+	.ascii	"bz\030%\"m]\314nM\342\263QrQ\206Q\366\243\354C)A\031B\351G\351"
+	.ascii	"C\031E\331\2152\2012\216r/\311?\200\022\377\217XM\004\306\347"
+	.ascii	"\342\2473\346\340G@\025\330\302}#\b\222E\360\005\034\245\327"
+	.ascii	" \255A7;I lI\360\026\312T\302u\216d|\236\213\b\260\201\017\203"
+	.ascii	"nJ\006\326\230ak\246\234\b\364L6\227$ID\277\003\332\031x\f?\301"
+	.ascii	",p\255g52b&\030\t\277\037?j9m%;\371\367\330\350\261\363f\331"
+	.ascii	" A;\020\341\367p\242\365~f=\201}&(@\365\344\374\303f\006\220"
+	.ascii	"\240\003\3629\340\346\344\314\220\221Y$\370\ndfy\336\031\355"
+	.ascii	"5\350\030\350\024\306\225\313\253\263F\262\233GI4v\f\247\261"
+	.ascii	"(]@\340,\310\020\022\230\356\256ga\023w\334o\206\267\270\351"
+	.ascii	"\367\225\346\017w\344\035\247\270\226kh\026\300\201\227\340\325"
+	.ascii	"|\330\213\261\341\034\344\341\336\215\276<\221\341\253\023\027"
+	.ascii	"\255\273\2024\036\244\363\261O\215\237\237\210\323#\224\326\310"
+	.ascii	"\f\007gx=\3605\006\343\237\360%\314\341\265\205a\027z\266/\236"
+	.ascii	"\250I\253#$\232&A\204[Z\257\254\3340b\325\331&\330\201j"
+	.string	"l\351l\314\202j\256\t\342,T\355\365\2431\003\271\004\274\232?f\273\226\254d@(w\226\307\277\310\020\216Y\250^X\034\0331\tG\n\250?\273\037\353M'\023\002{t\241\277\305\"\260\r\324\200\206\245\310jM\232)\300\212\r\372\351\316\232bv\024\305;\354aw\r\227\301\204\273#\037.\241\212\360\217>Y>'\013\377k\224\245\331\026\231~n\020\321\202\265<\321\370J\364\003\310n`\214\270\025\310\354\031\257\"\341\025N\330\007\331:-\007O\266\005\300\\G\251_\017\217\200\224\370\375`\026\315$7O\306#2\024\026\346\024\235b\350\371\330\020\t\371i\340H:\224\230\341<>\3527\220\263\346\210\031\243s1n\022G\007\027\032\200\3762\017\214\017.\376#8\312\342I\215\224\006h\330O_\317\223 \206\347\027\330\371 \0241\007\004\r@!\304\241\2207x;\253\220\277:\215gq\302^ )\377"
+	.ascii	"\277u\207Z2r\222\204\301\377A\210>\017\236\034{\323R\237r\033"
+	.ascii	"C)yJp\r,I\340\355\216!j\325\221\376m\2210\f\026DO\325\206\333"
+	.ascii	"\210x59e\276r\371|\221\256xS\263\025L\371\200\336\300;\241\301"
+	.ascii	"O_(\221\340:\032F8\335(o\203\033\250\217\241\336\212%\321\345"
+	.ascii	"D\342\375D\"\235\037q\003m<\365F\240\361en`\333\206\341\302\036"
+	.ascii	"\330f\270\360D\037\256q>\\7\365\341\252\3778\324\277\2617\020"
+	.ascii	"\220R\343\204\350\r\366\313F1\261\032\267\212\032Y_\327>8^g\331"
+	.ascii	"\276\257o\351}=)\372z\360Q8d\026\336\204ME\355\302|\206\270\241"
+	.ascii	"<T'Cg\232LdJ\214\320\364\223\276\247fr\227+\362X\357\323\036"
+	.ascii	"\302\357\256H\277\274\345\223\224\340c\213\242\340;|\352oy\260"
+	.ascii	"i\f>\243\337\036E2fn\027c\322 KL\205\273\311\016\202;\354C\334"
+	.ascii	"\363k\034\310ns\365\327\2772\327\007\213\023\300\024\262\336"
+	.ascii	"\346%}\317'\353\226>Y-=\177\006eS2\243\300\277f\204\350\305p"
+	.ascii	"t\320\222\312\375\303\266\334p\017\225\314J\247[\265\227z*\253"
+	.ascii	"J\017\252\366\275/\342G\201\346\364\226\273\264\002\227\252y"
+	.ascii	"\017Wy\334\252f\3618\025\245\340`\265\252hn\217\352\007\207O"
+	.ascii	"\361:5\345\220C\365\271\313\336sUT8<U.\0251\016\315[\255\276"
+	.ascii	"c\253t!\201\303\347U\034\345\212bc\021*\334\252\313Q\341~\273"
+	.ascii	"\034\371\035\232\313\247\371\252\335\332\232\305\316\370m\312"
+	.ascii	"\236=\016\273}\223\215Ym\212\247\262\322\243:*<\236*[\271\255"
+	.ascii	"\330\256HF\217Z\346z\327bgy\22795g\251f\177\2058\275\207A\221"
+	.ascii	"\225CN\357\337(\266/\210"
+	.text
+	.section	.gnu.offload_lto_.symtab.f031cb8759bb7418,"e",@progbits
+	.text
+	.section	.gnu.offload_lto_.opts,"e",@progbits
+	.string	"'-fexceptions' '-fmath-errno' '-fsigned-zeros' '-ftrapping-math' '-fno-trapv' '-fno-strict-overflow' '-fno-openmp' '-foffload-abi=lp64' '-fopenacc'"
+	.text
+	.section	.gnu.offload_lto_.mode_table.f031cb8759bb7418,"e",@progbits
+	.string	"x\234ce\200"
+	.string	"e \026"
+	.string	"\342\376#\r\035\r{:\004&\2664-h8\322\0210\251\245\345@\303\211\216\t\314\223[:\032\032\317t\\`f`\016\364d`\016\006b\027 \016\361d"
+	.string	""
+	.ascii	"\225\020\024\253"
+	.text
+	.section	.rodata
+.LC0:
+	.string	"ary[%d]=%x expected %x\n"
+	.text
+	.globl	main
+	.type	main, @function
+main:
+.LFB11:
+	.cfi_startproc
+	pushq	%rbp
+	.cfi_def_cfa_offset 16
+	.cfi_offset 6, -16
+	movq	%rsp, %rbp
+	.cfi_def_cfa_register 6
+	subq	$131216, %rsp
+	movl	$0, -8(%rbp)
+	movl	$0, -131204(%rbp)
+	movl	$0, -4(%rbp)
+.L3:
+	cmpl	$32784, -4(%rbp)
+	jg	.L2
+	movl	-4(%rbp), %eax
+	cltq
+	movl	$-1, -131200(%rbp,%rax,4)
+	addl	$1, -4(%rbp)
+	jmp	.L3
+.L2:
+	leaq	-131204(%rbp), %rax
+	movq	%rax, -48(%rbp)
+	leaq	-131200(%rbp), %rax
+	movq	%rax, -40(%rbp)
+	leaq	-48(%rbp), %rax
+	subq	$8, %rsp
+	pushq	$0
+	movl	$_ZZ4mainE17.omp_data_kinds.5, %r9d
+	movl	$_ZZ4mainE17.omp_data_sizes.4, %r8d
+	movq	%rax, %rcx
+	movl	$2, %edx
+	movl	$main._omp_fn.0, %esi
+	movl	$-1, %edi
+	movl	$0, %eax
+	call	GOACC_parallel_keyed
+	addq	$16, %rsp
+	movl	$0, -4(%rbp)
+.L7:
+	cmpl	$32784, -4(%rbp)
+	jg	.L4
+	movl	-4(%rbp), %eax
+	movl	%eax, -12(%rbp)
+	movl	-131204(%rbp), %eax
+	testl	%eax, %eax
+	je	.L5
+	movl	-4(%rbp), %eax
+	movslq	%eax, %rdx
+	imulq	$2145388543, %rdx, %rdx
+	shrq	$32, %rdx
+	sarl	$9, %edx
+	sarl	$31, %eax
+	subl	%eax, %edx
+	movl	%edx, %eax
+	movl	%eax, -16(%rbp)
+	movl	$0, -20(%rbp)
+	movl	$0, -24(%rbp)
+	movl	-16(%rbp), %eax
+	sall	$16, %eax
+	movl	%eax, %edx
+	movl	-20(%rbp), %eax
+	sall	$8, %eax
+	orl	%edx, %eax
+	orl	-24(%rbp), %eax
+	movl	%eax, -12(%rbp)
+.L5:
+	movl	-4(%rbp), %eax
+	cltq
+	movl	-131200(%rbp,%rax,4), %eax
+	cmpl	-12(%rbp), %eax
+	je	.L6
+	movl	$1, -8(%rbp)
+	movl	-4(%rbp), %eax
+	cltq
+	movl	-131200(%rbp,%rax,4), %edx
+	movl	-12(%rbp), %ecx
+	movl	-4(%rbp), %eax
+	movl	%eax, %esi
+	movl	$.LC0, %edi
+	movl	$0, %eax
+	call	printf
+.L6:
+	addl	$1, -4(%rbp)
+	jmp	.L7
+.L4:
+	movl	-8(%rbp), %eax
+	leave
+	.cfi_def_cfa 7, 8
+	ret
+	.cfi_endproc
+.LFE11:
+	.size	main, .-main
+	.type	main._omp_fn.0, @function
+main._omp_fn.0:
+.LFB12:
+	.cfi_startproc
+	pushq	%rbp
+	.cfi_def_cfa_offset 16
+	.cfi_offset 6, -16
+	movq	%rsp, %rbp
+	.cfi_def_cfa_register 6
+	pushq	%r15
+	pushq	%r14
+	pushq	%r13
+	pushq	%r12
+	pushq	%rbx
+	subq	$40, %rsp
+	.cfi_offset 15, -24
+	.cfi_offset 14, -32
+	.cfi_offset 13, -40
+	.cfi_offset 12, -48
+	.cfi_offset 3, -56
+	movq	%rdi, -72(%rbp)
+	movl	$0, %r12d
+	movl	$1, %r14d
+	movl	$1, %eax
+	movl	%eax, %r15d
+.L15:
+	movl	$0, %eax
+	movl	%eax, %ebx
+	movl	$32785, %r13d
+	cmpl	%r13d, %ebx
+	jge	.L10
+.L13:
+	movl	%ebx, %eax
+	movl	%eax, -52(%rbp)
+	movl	$5, %edi
+	call	acc_on_device
+	testl	%eax, %eax
+	jne	.L11
+	jmp	.L16
+.L14:
+	addl	%r15d, %ebx
+	cmpl	%r13d, %ebx
+	jl	.L13
+	jmp	.L10
+.L16:
+	movl	-52(%rbp), %ecx
+	movq	-72(%rbp), %rax
+	movq	8(%rax), %rax
+	movl	-52(%rbp), %edx
+	movl	%ecx, (%rax,%rdx,4)
+	jmp	.L14
+.L11:
+	movl	$0, -56(%rbp)
+	movl	$0, -60(%rbp)
+	movl	$0, -64(%rbp)
+#APP
+# 26 "/scratch/nsidwell/openacc/trunk-merge/src/gcc-mainline/libgomp/testsuite/libgomp.oacc-c++/../libgomp.oacc-c-c++-common/loop-g-1.c" 1
+	mov.u32 %eax,%ctaid.x;
+# 0 "" 2
+#NO_APP
+	movl	%eax, -56(%rbp)
+#APP
+# 27 "/scratch/nsidwell/openacc/trunk-merge/src/gcc-mainline/libgomp/testsuite/libgomp.oacc-c++/../libgomp.oacc-c-c++-common/loop-g-1.c" 1
+	mov.u32 %eax,%tid.y;
+# 0 "" 2
+#NO_APP
+	movl	%eax, -60(%rbp)
+#APP
+# 28 "/scratch/nsidwell/openacc/trunk-merge/src/gcc-mainline/libgomp/testsuite/libgomp.oacc-c++/../libgomp.oacc-c-c++-common/loop-g-1.c" 1
+	mov.u32 %eax,%tid.x;
+# 0 "" 2
+#NO_APP
+	movl	%eax, -64(%rbp)
+	movl	-56(%rbp), %eax
+	sall	$16, %eax
+	movl	%eax, %edx
+	movl	-60(%rbp), %eax
+	sall	$8, %eax
+	orl	%edx, %eax
+	orl	-64(%rbp), %eax
+	movl	%eax, %ecx
+	movq	-72(%rbp), %rax
+	movq	8(%rax), %rax
+	movl	-52(%rbp), %edx
+	movl	%ecx, (%rax,%rdx,4)
+	movq	-72(%rbp), %rax
+	movq	(%rax), %rax
+	movl	$1, (%rax)
+	jmp	.L14
+.L10:
+	addl	$1, %r12d
+	cmpl	%r14d, %r12d
+	jl	.L15
+	movl	$32785, -52(%rbp)
+	addq	$40, %rsp
+	popq	%rbx
+	popq	%r12
+	popq	%r13
+	popq	%r14
+	popq	%r15
+	popq	%rbp
+	.cfi_def_cfa 7, 8
+	ret
+	.cfi_endproc
+.LFE12:
+	.size	main._omp_fn.0, .-main._omp_fn.0
+	.data
+	.align 16
+	.type	_ZZ4mainE17.omp_data_sizes.4, @object
+	.size	_ZZ4mainE17.omp_data_sizes.4, 16
+_ZZ4mainE17.omp_data_sizes.4:
+	.quad	4
+	.quad	131140
+	.align 2
+	.type	_ZZ4mainE17.omp_data_kinds.5, @object
+	.size	_ZZ4mainE17.omp_data_kinds.5, 4
+_ZZ4mainE17.omp_data_kinds.5:
+	.value	643
+	.value	643
+	.section	.gnu.offload_vars,"aw",@progbits
+	.align 8
+	.type	.offload_var_table, @object
+	.size	.offload_var_table, 0
+.offload_var_table:
+	.section	.gnu.offload_funcs,"aw",@progbits
+	.align 8
+	.type	.offload_func_table, @object
+	.size	.offload_func_table, 8
+.offload_func_table:
+	.quad	main._omp_fn.0
+	.comm	__gnu_lto_v1,1,1
+	.ident	"GCC: (Sourcery CodeBench (OpenACC/PTX) Lite 2016.05-999999) 6.0.0 20151019 (experimental)"
+	.section	.note.GNU-stack,"",@progbits
Index: libgomp/testsuite/libgomp.oacc-c-c++-common/loop-g-2.c
===================================================================
--- libgomp/testsuite/libgomp.oacc-c-c++-common/loop-g-2.c	(revision 0)
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/loop-g-2.c	(working copy)
@@ -0,0 +1,57 @@
+/* { dg-do run } */
+/* { dg-additional-options "-O2" */
+
+#include <stdio.h>
+
+#define N (32*32*32+17)
+int main ()
+{
+  int ary[N];
+  int ix;
+  int exit = 0;
+  int ondev = 0;
+
+  for (ix = 0; ix < N;ix++)
+    ary[ix] = -1;
+  
+#pragma acc parallel num_gangs(32) vector_length(32) copy(ary) copy(ondev)
+  {
+#pragma acc loop gang (static:1)
+    for (unsigned ix = 0; ix < N; ix++)
+      {
+	if (__builtin_acc_on_device (5))
+	  {
+	    int g = 0, w = 0, v = 0;
+
+	    __asm__ volatile ("mov.u32 %0,%%ctaid.x;" : "=r" (g));
+	    __asm__ volatile ("mov.u32 %0,%%tid.y;" : "=r" (w));
+	    __asm__ volatile ("mov.u32 %0,%%tid.x;" : "=r" (v));
+	    ary[ix] = (g << 16) | (w << 8) | v;
+	    ondev = 1;
+	  }
+	else
+	  ary[ix] = ix;
+      }
+  }
+
+  for (ix = 0; ix < N; ix++)
+    {
+      int expected = ix;
+      if(ondev)
+	{
+	  int g = ix % 32;
+	  int w = 0;
+	  int v = 0;
+
+	  expected = (g << 16) | (w << 8) | v;
+	}
+      
+      if (ary[ix] != expected)
+	{
+	  exit = 1;
+	  printf ("ary[%d]=%x expected %x\n", ix, ary[ix], expected);
+	}
+    }
+  
+  return exit;
+}
Index: libgomp/testsuite/libgomp.oacc-c-c++-common/loop-gwv-1.c
===================================================================
--- libgomp/testsuite/libgomp.oacc-c-c++-common/loop-gwv-1.c	(revision 0)
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/loop-gwv-1.c	(working copy)
@@ -0,0 +1,59 @@
+/* { dg-do run } */
+/* { dg-additional-options "-O2" */
+
+#include <stdio.h>
+
+#define N (32*32*32+17)
+int main ()
+{
+  int ary[N];
+  int ix;
+  int exit = 0;
+  int ondev = 0;
+
+  for (ix = 0; ix < N;ix++)
+    ary[ix] = -1;
+  
+#pragma acc parallel num_gangs(32) num_workers(32) vector_length(32) copy(ary) copy(ondev)
+  {
+#pragma acc loop gang worker vector
+    for (unsigned ix = 0; ix < N; ix++)
+      {
+	if (__builtin_acc_on_device (5))
+	  {
+	    int g = 0, w = 0, v = 0;
+
+	    __asm__ volatile ("mov.u32 %0,%%ctaid.x;" : "=r" (g));
+	    __asm__ volatile ("mov.u32 %0,%%tid.y;" : "=r" (w));
+	    __asm__ volatile ("mov.u32 %0,%%tid.x;" : "=r" (v));
+	    ary[ix] = (g << 16) | (w << 8) | v;
+	    ondev = 1;
+	  }
+	else
+	  ary[ix] = ix;
+      }
+  }
+
+  for (ix = 0; ix < N; ix++)
+    {
+      int expected = ix;
+      if(ondev)
+	{
+	  int chunk_size = (N + 32*32*32 - 1) / (32*32*32);
+	  
+	  int g = ix / (chunk_size * 32 * 32);
+	  int w = ix / 32 % 32;
+	  int v = ix % 32;
+
+	  expected = (g << 16) | (w << 8) | v;
+	}
+      
+      if (ary[ix] != expected)
+	{
+	  exit = 1;
+	  printf ("ary[%d]=%x expected %x\n", ix, ary[ix], expected);
+	}
+    }
+  
+  return exit;
+}
Index: libgomp/testsuite/libgomp.oacc-c-c++-common/loop-v-1.c
===================================================================
--- libgomp/testsuite/libgomp.oacc-c-c++-common/loop-v-1.c	(revision 0)
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/loop-v-1.c	(working copy)
@@ -0,0 +1,57 @@
+/* { dg-do run } */
+/* { dg-additional-options "-O2" */
+
+#include <stdio.h>
+
+#define N (32*32*32+17)
+int main ()
+{
+  int ary[N];
+  int ix;
+  int exit = 0;
+  int ondev = 0;
+
+  for (ix = 0; ix < N;ix++)
+    ary[ix] = -1;
+  
+#pragma acc parallel vector_length(32) copy(ary) copy(ondev)
+  {
+#pragma acc loop vector
+    for (unsigned ix = 0; ix < N; ix++)
+      {
+	if (__builtin_acc_on_device (5))
+	  {
+	    int g = 0, w = 0, v = 0;
+
+	    __asm__ volatile ("mov.u32 %0,%%ctaid.x;" : "=r" (g));
+	    __asm__ volatile ("mov.u32 %0,%%tid.y;" : "=r" (w));
+	    __asm__ volatile ("mov.u32 %0,%%tid.x;" : "=r" (v));
+	    ary[ix] = (g << 16) | (w << 8) | v;
+	    ondev = 1;
+	  }
+	else
+	  ary[ix] = ix;
+      }
+  }
+
+  for (ix = 0; ix < N; ix++)
+    {
+      int expected = ix;
+      if(ondev)
+	{
+	  int g = 0;
+	  int w = 0;
+	  int v = ix % 32;
+
+	  expected = (g << 16) | (w << 8) | v;
+	}
+      
+      if (ary[ix] != expected)
+	{
+	  exit = 1;
+	  printf ("ary[%d]=%x expected %x\n", ix, ary[ix], expected);
+	}
+    }
+  
+  return exit;
+}
Index: libgomp/testsuite/libgomp.oacc-c-c++-common/loop-w-1.c
===================================================================
--- libgomp/testsuite/libgomp.oacc-c-c++-common/loop-w-1.c	(revision 0)
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/loop-w-1.c	(working copy)
@@ -0,0 +1,57 @@
+/* { dg-do run } */
+/* { dg-additional-options "-O2" */
+
+#include <stdio.h>
+
+#define N (32*32*32+17)
+int main ()
+{
+  int ary[N];
+  int ix;
+  int exit = 0;
+  int ondev = 0;
+
+  for (ix = 0; ix < N;ix++)
+    ary[ix] = -1;
+  
+#pragma acc parallel num_workers(32) vector_length(32) copy(ary) copy(ondev)
+  {
+#pragma acc loop worker
+    for (unsigned ix = 0; ix < N; ix++)
+      {
+	if (__builtin_acc_on_device (5))
+	  {
+	    int g = 0, w = 0, v = 0;
+
+	    __asm__ volatile ("mov.u32 %0,%%ctaid.x;" : "=r" (g));
+	    __asm__ volatile ("mov.u32 %0,%%tid.y;" : "=r" (w));
+	    __asm__ volatile ("mov.u32 %0,%%tid.x;" : "=r" (v));
+	    ary[ix] = (g << 16) | (w << 8) | v;
+	    ondev = 1;
+	  }
+	else
+	  ary[ix] = ix;
+      }
+  }
+
+  for (ix = 0; ix < N; ix++)
+    {
+      int expected = ix;
+      if(ondev)
+	{
+	  int g = 0;
+	  int w = ix % 32;
+	  int v = 0;
+
+	  expected = (g << 16) | (w << 8) | v;
+	}
+      
+      if (ary[ix] != expected)
+	{
+	  exit = 1;
+	  printf ("ary[%d]=%x expected %x\n", ix, ary[ix], expected);
+	}
+    }
+  
+  return exit;
+}
Index: libgomp/testsuite/libgomp.oacc-c-c++-common/loop-wv-1.c
===================================================================
--- libgomp/testsuite/libgomp.oacc-c-c++-common/loop-wv-1.c	(revision 0)
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/loop-wv-1.c	(working copy)
@@ -0,0 +1,57 @@
+/* { dg-do run } */
+/* { dg-additional-options "-O2" */
+
+#include <stdio.h>
+
+#define N (32*32*32+17)
+int main ()
+{
+  int ary[N];
+  int ix;
+  int exit = 0;
+  int ondev = 0;
+
+  for (ix = 0; ix < N;ix++)
+    ary[ix] = -1;
+  
+#pragma acc parallel num_workers(32) vector_length(32) copy(ary) copy(ondev)
+  {
+#pragma acc loop worker vector
+    for (unsigned ix = 0; ix < N; ix++)
+      {
+	if (__builtin_acc_on_device (5))
+	  {
+	    int g = 0, w = 0, v = 0;
+
+	    __asm__ volatile ("mov.u32 %0,%%ctaid.x;" : "=r" (g));
+	    __asm__ volatile ("mov.u32 %0,%%tid.y;" : "=r" (w));
+	    __asm__ volatile ("mov.u32 %0,%%tid.x;" : "=r" (v));
+	    ary[ix] = (g << 16) | (w << 8) | v;
+	    ondev = 1;
+	  }
+	else
+	  ary[ix] = ix;
+      }
+  }
+
+  for (ix = 0; ix < N; ix++)
+    {
+      int expected = ix;
+      if(ondev)
+	{
+	  int g = 0;
+	  int w = (ix / 32) % 32;
+	  int v = ix % 32;
+
+	  expected = (g << 16) | (w << 8) | v;
+	}
+      
+      if (ary[ix] != expected)
+	{
+	  exit = 1;
+	  printf ("ary[%d]=%x expected %x\n", ix, ary[ix], expected);
+	}
+    }
+  
+  return exit;
+}

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 11/11] execution tests
  2015-10-21 19:59 ` [OpenACC 11/11] execution tests Nathan Sidwell
@ 2015-10-21 20:15   ` Ilya Verbin
  2015-10-21 20:17     ` Nathan Sidwell
  2015-10-22  9:54   ` Jakub Jelinek
  1 sibling, 1 reply; 120+ messages in thread
From: Ilya Verbin @ 2015-10-21 20:15 UTC (permalink / raw)
  To: Nathan Sidwell
  Cc: GCC Patches, Jakub Jelinek, Bernd Schmidt, Jason Merrill,
	Joseph S. Myers



> On 21 Oct 2015, at 22:53, Nathan Sidwell <nathan@acm.org> wrote:
> 
> This patch has some new execution tests, verifying loop partitioning is behaving as expected.
> 
> There are more execution tests on the gomp4 branch, but many of them use reductions.  We'll merge those once reductions are merged.
> 
> nathan
> <11-trunk-tests.patch>

Does the testcase with offload IR appear here accidentally?

  -- Ilya

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 11/11] execution tests
  2015-10-21 20:15   ` Ilya Verbin
@ 2015-10-21 20:17     ` Nathan Sidwell
  2015-10-28 14:30       ` Nathan Sidwell
  0 siblings, 1 reply; 120+ messages in thread
From: Nathan Sidwell @ 2015-10-21 20:17 UTC (permalink / raw)
  To: Ilya Verbin
  Cc: GCC Patches, Jakub Jelinek, Bernd Schmidt, Jason Merrill,
	Joseph S. Myers

[-- Attachment #1: Type: text/plain, Size: 157 bytes --]

On 10/21/15 16:14, Ilya Verbin wrote:

>> <11-trunk-tests.patch>
>
> Does the testcase with offload IR appear here accidentally?

D'oh!  yup, fixed.

nathan

[-- Attachment #2: 11-trunk-tests.patch --]
[-- Type: text/x-patch, Size: 8862 bytes --]

2015-10-20  Nathan Sidwell  <nathan@codesourcery.com>

	* testsuite/libgomp.oacc-c-c++-common/loop-g-1.c: New.
	* testsuite/libgomp.oacc-c-c++-common/loop-w-1.c: New.
	* testsuite/libgomp.oacc-c-c++-common/loop-wv-1.c: New.
	* testsuite/libgomp.oacc-c-c++-common/loop-g-2.c: New.
	* testsuite/libgomp.oacc-c-c++-common/loop-gwv-1.c: New.
	* testsuite/libgomp.oacc-c-c++-common/loop-v-1.c: New.

Index: libgomp/testsuite/libgomp.oacc-c-c++-common/loop-g-1.c
===================================================================
--- libgomp/testsuite/libgomp.oacc-c-c++-common/loop-g-1.c	(revision 0)
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/loop-g-1.c	(working copy)
@@ -0,0 +1,57 @@
+/* { dg-do run } */
+/* { dg-additional-options "-O2" */
+
+#include <stdio.h>
+
+#define N (32*32*32+17)
+int main ()
+{
+  int ary[N];
+  int ix;
+  int exit = 0;
+  int ondev = 0;
+
+  for (ix = 0; ix < N;ix++)
+    ary[ix] = -1;
+  
+#pragma acc parallel num_gangs(32) vector_length(32) copy(ary) copy(ondev)
+  {
+#pragma acc loop gang
+    for (unsigned ix = 0; ix < N; ix++)
+      {
+	if (__builtin_acc_on_device (5))
+	  {
+	    int g = 0, w = 0, v = 0;
+
+	    __asm__ volatile ("mov.u32 %0,%%ctaid.x;" : "=r" (g));
+	    __asm__ volatile ("mov.u32 %0,%%tid.y;" : "=r" (w));
+	    __asm__ volatile ("mov.u32 %0,%%tid.x;" : "=r" (v));
+	    ary[ix] = (g << 16) | (w << 8) | v;
+	    ondev = 1;
+	  }
+	else
+	  ary[ix] = ix;
+      }
+  }
+
+  for (ix = 0; ix < N; ix++)
+    {
+      int expected = ix;
+      if(ondev)
+	{
+	  int g = ix / ((N + 31) / 32);
+	  int w = 0;
+	  int v = 0;
+
+	  expected = (g << 16) | (w << 8) | v;
+	}
+      
+      if (ary[ix] != expected)
+	{
+	  exit = 1;
+	  printf ("ary[%d]=%x expected %x\n", ix, ary[ix], expected);
+	}
+    }
+  
+  return exit;
+}
Index: libgomp/testsuite/libgomp.oacc-c-c++-common/loop-g-2.c
===================================================================
--- libgomp/testsuite/libgomp.oacc-c-c++-common/loop-g-2.c	(revision 0)
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/loop-g-2.c	(working copy)
@@ -0,0 +1,57 @@
+/* { dg-do run } */
+/* { dg-additional-options "-O2" */
+
+#include <stdio.h>
+
+#define N (32*32*32+17)
+int main ()
+{
+  int ary[N];
+  int ix;
+  int exit = 0;
+  int ondev = 0;
+
+  for (ix = 0; ix < N;ix++)
+    ary[ix] = -1;
+  
+#pragma acc parallel num_gangs(32) vector_length(32) copy(ary) copy(ondev)
+  {
+#pragma acc loop gang (static:1)
+    for (unsigned ix = 0; ix < N; ix++)
+      {
+	if (__builtin_acc_on_device (5))
+	  {
+	    int g = 0, w = 0, v = 0;
+
+	    __asm__ volatile ("mov.u32 %0,%%ctaid.x;" : "=r" (g));
+	    __asm__ volatile ("mov.u32 %0,%%tid.y;" : "=r" (w));
+	    __asm__ volatile ("mov.u32 %0,%%tid.x;" : "=r" (v));
+	    ary[ix] = (g << 16) | (w << 8) | v;
+	    ondev = 1;
+	  }
+	else
+	  ary[ix] = ix;
+      }
+  }
+
+  for (ix = 0; ix < N; ix++)
+    {
+      int expected = ix;
+      if(ondev)
+	{
+	  int g = ix % 32;
+	  int w = 0;
+	  int v = 0;
+
+	  expected = (g << 16) | (w << 8) | v;
+	}
+      
+      if (ary[ix] != expected)
+	{
+	  exit = 1;
+	  printf ("ary[%d]=%x expected %x\n", ix, ary[ix], expected);
+	}
+    }
+  
+  return exit;
+}
Index: libgomp/testsuite/libgomp.oacc-c-c++-common/loop-gwv-1.c
===================================================================
--- libgomp/testsuite/libgomp.oacc-c-c++-common/loop-gwv-1.c	(revision 0)
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/loop-gwv-1.c	(working copy)
@@ -0,0 +1,59 @@
+/* { dg-do run } */
+/* { dg-additional-options "-O2" */
+
+#include <stdio.h>
+
+#define N (32*32*32+17)
+int main ()
+{
+  int ary[N];
+  int ix;
+  int exit = 0;
+  int ondev = 0;
+
+  for (ix = 0; ix < N;ix++)
+    ary[ix] = -1;
+  
+#pragma acc parallel num_gangs(32) num_workers(32) vector_length(32) copy(ary) copy(ondev)
+  {
+#pragma acc loop gang worker vector
+    for (unsigned ix = 0; ix < N; ix++)
+      {
+	if (__builtin_acc_on_device (5))
+	  {
+	    int g = 0, w = 0, v = 0;
+
+	    __asm__ volatile ("mov.u32 %0,%%ctaid.x;" : "=r" (g));
+	    __asm__ volatile ("mov.u32 %0,%%tid.y;" : "=r" (w));
+	    __asm__ volatile ("mov.u32 %0,%%tid.x;" : "=r" (v));
+	    ary[ix] = (g << 16) | (w << 8) | v;
+	    ondev = 1;
+	  }
+	else
+	  ary[ix] = ix;
+      }
+  }
+
+  for (ix = 0; ix < N; ix++)
+    {
+      int expected = ix;
+      if(ondev)
+	{
+	  int chunk_size = (N + 32*32*32 - 1) / (32*32*32);
+	  
+	  int g = ix / (chunk_size * 32 * 32);
+	  int w = ix / 32 % 32;
+	  int v = ix % 32;
+
+	  expected = (g << 16) | (w << 8) | v;
+	}
+      
+      if (ary[ix] != expected)
+	{
+	  exit = 1;
+	  printf ("ary[%d]=%x expected %x\n", ix, ary[ix], expected);
+	}
+    }
+  
+  return exit;
+}
Index: libgomp/testsuite/libgomp.oacc-c-c++-common/loop-v-1.c
===================================================================
--- libgomp/testsuite/libgomp.oacc-c-c++-common/loop-v-1.c	(revision 0)
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/loop-v-1.c	(working copy)
@@ -0,0 +1,57 @@
+/* { dg-do run } */
+/* { dg-additional-options "-O2" */
+
+#include <stdio.h>
+
+#define N (32*32*32+17)
+int main ()
+{
+  int ary[N];
+  int ix;
+  int exit = 0;
+  int ondev = 0;
+
+  for (ix = 0; ix < N;ix++)
+    ary[ix] = -1;
+  
+#pragma acc parallel vector_length(32) copy(ary) copy(ondev)
+  {
+#pragma acc loop vector
+    for (unsigned ix = 0; ix < N; ix++)
+      {
+	if (__builtin_acc_on_device (5))
+	  {
+	    int g = 0, w = 0, v = 0;
+
+	    __asm__ volatile ("mov.u32 %0,%%ctaid.x;" : "=r" (g));
+	    __asm__ volatile ("mov.u32 %0,%%tid.y;" : "=r" (w));
+	    __asm__ volatile ("mov.u32 %0,%%tid.x;" : "=r" (v));
+	    ary[ix] = (g << 16) | (w << 8) | v;
+	    ondev = 1;
+	  }
+	else
+	  ary[ix] = ix;
+      }
+  }
+
+  for (ix = 0; ix < N; ix++)
+    {
+      int expected = ix;
+      if(ondev)
+	{
+	  int g = 0;
+	  int w = 0;
+	  int v = ix % 32;
+
+	  expected = (g << 16) | (w << 8) | v;
+	}
+      
+      if (ary[ix] != expected)
+	{
+	  exit = 1;
+	  printf ("ary[%d]=%x expected %x\n", ix, ary[ix], expected);
+	}
+    }
+  
+  return exit;
+}
Index: libgomp/testsuite/libgomp.oacc-c-c++-common/loop-w-1.c
===================================================================
--- libgomp/testsuite/libgomp.oacc-c-c++-common/loop-w-1.c	(revision 0)
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/loop-w-1.c	(working copy)
@@ -0,0 +1,57 @@
+/* { dg-do run } */
+/* { dg-additional-options "-O2" */
+
+#include <stdio.h>
+
+#define N (32*32*32+17)
+int main ()
+{
+  int ary[N];
+  int ix;
+  int exit = 0;
+  int ondev = 0;
+
+  for (ix = 0; ix < N;ix++)
+    ary[ix] = -1;
+  
+#pragma acc parallel num_workers(32) vector_length(32) copy(ary) copy(ondev)
+  {
+#pragma acc loop worker
+    for (unsigned ix = 0; ix < N; ix++)
+      {
+	if (__builtin_acc_on_device (5))
+	  {
+	    int g = 0, w = 0, v = 0;
+
+	    __asm__ volatile ("mov.u32 %0,%%ctaid.x;" : "=r" (g));
+	    __asm__ volatile ("mov.u32 %0,%%tid.y;" : "=r" (w));
+	    __asm__ volatile ("mov.u32 %0,%%tid.x;" : "=r" (v));
+	    ary[ix] = (g << 16) | (w << 8) | v;
+	    ondev = 1;
+	  }
+	else
+	  ary[ix] = ix;
+      }
+  }
+
+  for (ix = 0; ix < N; ix++)
+    {
+      int expected = ix;
+      if(ondev)
+	{
+	  int g = 0;
+	  int w = ix % 32;
+	  int v = 0;
+
+	  expected = (g << 16) | (w << 8) | v;
+	}
+      
+      if (ary[ix] != expected)
+	{
+	  exit = 1;
+	  printf ("ary[%d]=%x expected %x\n", ix, ary[ix], expected);
+	}
+    }
+  
+  return exit;
+}
Index: libgomp/testsuite/libgomp.oacc-c-c++-common/loop-wv-1.c
===================================================================
--- libgomp/testsuite/libgomp.oacc-c-c++-common/loop-wv-1.c	(revision 0)
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/loop-wv-1.c	(working copy)
@@ -0,0 +1,57 @@
+/* { dg-do run } */
+/* { dg-additional-options "-O2" */
+
+#include <stdio.h>
+
+#define N (32*32*32+17)
+int main ()
+{
+  int ary[N];
+  int ix;
+  int exit = 0;
+  int ondev = 0;
+
+  for (ix = 0; ix < N;ix++)
+    ary[ix] = -1;
+  
+#pragma acc parallel num_workers(32) vector_length(32) copy(ary) copy(ondev)
+  {
+#pragma acc loop worker vector
+    for (unsigned ix = 0; ix < N; ix++)
+      {
+	if (__builtin_acc_on_device (5))
+	  {
+	    int g = 0, w = 0, v = 0;
+
+	    __asm__ volatile ("mov.u32 %0,%%ctaid.x;" : "=r" (g));
+	    __asm__ volatile ("mov.u32 %0,%%tid.y;" : "=r" (w));
+	    __asm__ volatile ("mov.u32 %0,%%tid.x;" : "=r" (v));
+	    ary[ix] = (g << 16) | (w << 8) | v;
+	    ondev = 1;
+	  }
+	else
+	  ary[ix] = ix;
+      }
+  }
+
+  for (ix = 0; ix < N; ix++)
+    {
+      int expected = ix;
+      if(ondev)
+	{
+	  int g = 0;
+	  int w = (ix / 32) % 32;
+	  int v = ix % 32;
+
+	  expected = (g << 16) | (w << 8) | v;
+	}
+      
+      if (ary[ix] != expected)
+	{
+	  exit = 1;
+	  printf ("ary[%d]=%x expected %x\n", ix, ary[ix], expected);
+	}
+    }
+  
+  return exit;
+}

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 1/11] UNIQUE internal function
  2015-10-21 19:00 ` [OpenACC 1/11] UNIQUE internal function Nathan Sidwell
@ 2015-10-22  7:49   ` Richard Biener
  2015-10-22  7:55     ` Richard Biener
  2015-10-22  8:05   ` Jakub Jelinek
  1 sibling, 1 reply; 120+ messages in thread
From: Richard Biener @ 2015-10-22  7:49 UTC (permalink / raw)
  To: Nathan Sidwell
  Cc: GCC Patches, Jakub Jelinek, Bernd Schmidt, Jason Merrill,
	Joseph S. Myers

On Wed, Oct 21, 2015 at 9:00 PM, Nathan Sidwell <nathan@acm.org> wrote:
> This patch implements a new internal function that has a 'uniqueness'
> property.   Jump-threading cannot clone it and tail-merging cannot combine
> multiple instances.
>
> The uniqueness is implemented by a new gimple fn,
> gimple_call_internal_unique_p.  Routines that check for identical or
> cloneable calls are augmented to check this property.  These are:
>
> * tree-ssa-threadedge, which is figuring out if jump threading is a win.
> Jump threading is inhibited.
>
> * gimple_call_same_target_p, used for tail merging and similar transforms.
> Two calls of IFN_UNIQUE will never be  the same target.
>
> * tracer.c, which is determining whether to clone a region.
>
> Interestingly jump threading avoids cloning volatile asms (which it admits
> is conservatively safe), but the tracer does not. I wonder if there's a
> latent problem in tracer?
>
> The reason I needed a function with this property is to  preserve the
> looping structure of a function's CFG.  As mentioned in the intro, we mark
> up loops (using this builtin), so the example I gave has the following
> inserts:
>
> #pragma acc parallel ...
> {
>  // single mode here
> #pragma acc loop ...
> IFN_UNIQUE (FORKING  ...)
> for (i = 0; i < N; i++) // loop 1
>   ... // partitioned mode here
> IFN_UNIQUE (JOINING ...)
>
> if (expr) // single mode here
> #pragma acc loop ...
>   IFN_UNIQUE (FORKING ...)
>   for (i = 0; i < N; i++) // loop 2
>     ... // partitioned mode here
>   IFN_UNIQUE (JOINING ...)
> }
>
> The properly nested loop property of the CFG is preserved through the
> compilation.  This is important as (a) it allows later passes to reconstruct
> this looping structure and (b) hardware constraints require a partioned
> region end for all partitioned threads at a single instruction.
>
> Until I added this unique property, original bring-up  of partitioned
> execution would hit cases of split loops ending in multiple cloned JOINING
> markers and similar cases.
>
> To distinguish different uses of the UNIQUE function, I use the first
> argument, which is expected to be an INTEGER_CST.  I figured this better
> than using multiple new internal fns, all with the unique property, as the
> latter would need (at least) a range check in gimple_call_internal_unique_p
> rather than a simple equality.
>
> Jakub, IYR I originally had IFN_FORK and IFN_JOIN as such distinct internal
> fns.  This replaces that scheme.
>
> ok?

Hmm, I'd just have used gimple_has_volatile_ops on the call?  That
should have the
desired effects.

Richard.

> nathan

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 1/11] UNIQUE internal function
  2015-10-22  7:49   ` Richard Biener
@ 2015-10-22  7:55     ` Richard Biener
  2015-10-22  8:04       ` Jakub Jelinek
  0 siblings, 1 reply; 120+ messages in thread
From: Richard Biener @ 2015-10-22  7:55 UTC (permalink / raw)
  To: Nathan Sidwell
  Cc: GCC Patches, Jakub Jelinek, Bernd Schmidt, Jason Merrill,
	Joseph S. Myers

On Thu, Oct 22, 2015 at 9:48 AM, Richard Biener
<richard.guenther@gmail.com> wrote:
> On Wed, Oct 21, 2015 at 9:00 PM, Nathan Sidwell <nathan@acm.org> wrote:
>> This patch implements a new internal function that has a 'uniqueness'
>> property.   Jump-threading cannot clone it and tail-merging cannot combine
>> multiple instances.
>>
>> The uniqueness is implemented by a new gimple fn,
>> gimple_call_internal_unique_p.  Routines that check for identical or
>> cloneable calls are augmented to check this property.  These are:
>>
>> * tree-ssa-threadedge, which is figuring out if jump threading is a win.
>> Jump threading is inhibited.
>>
>> * gimple_call_same_target_p, used for tail merging and similar transforms.
>> Two calls of IFN_UNIQUE will never be  the same target.
>>
>> * tracer.c, which is determining whether to clone a region.
>>
>> Interestingly jump threading avoids cloning volatile asms (which it admits
>> is conservatively safe), but the tracer does not. I wonder if there's a
>> latent problem in tracer?
>>
>> The reason I needed a function with this property is to  preserve the
>> looping structure of a function's CFG.  As mentioned in the intro, we mark
>> up loops (using this builtin), so the example I gave has the following
>> inserts:
>>
>> #pragma acc parallel ...
>> {
>>  // single mode here
>> #pragma acc loop ...
>> IFN_UNIQUE (FORKING  ...)
>> for (i = 0; i < N; i++) // loop 1
>>   ... // partitioned mode here
>> IFN_UNIQUE (JOINING ...)
>>
>> if (expr) // single mode here
>> #pragma acc loop ...
>>   IFN_UNIQUE (FORKING ...)
>>   for (i = 0; i < N; i++) // loop 2
>>     ... // partitioned mode here
>>   IFN_UNIQUE (JOINING ...)
>> }
>>
>> The properly nested loop property of the CFG is preserved through the
>> compilation.  This is important as (a) it allows later passes to reconstruct
>> this looping structure and (b) hardware constraints require a partioned
>> region end for all partitioned threads at a single instruction.
>>
>> Until I added this unique property, original bring-up  of partitioned
>> execution would hit cases of split loops ending in multiple cloned JOINING
>> markers and similar cases.
>>
>> To distinguish different uses of the UNIQUE function, I use the first
>> argument, which is expected to be an INTEGER_CST.  I figured this better
>> than using multiple new internal fns, all with the unique property, as the
>> latter would need (at least) a range check in gimple_call_internal_unique_p
>> rather than a simple equality.
>>
>> Jakub, IYR I originally had IFN_FORK and IFN_JOIN as such distinct internal
>> fns.  This replaces that scheme.
>>
>> ok?
>
> Hmm, I'd just have used gimple_has_volatile_ops on the call?  That
> should have the
> desired effects.

That is, whatever new IFNs you need are ok, but special-casing them is not
necessary if you properly mark the calls as volatile.

Richard.

> Richard.
>
>> nathan

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 1/11] UNIQUE internal function
  2015-10-22  7:55     ` Richard Biener
@ 2015-10-22  8:04       ` Jakub Jelinek
  2015-10-22  8:07         ` Richard Biener
  0 siblings, 1 reply; 120+ messages in thread
From: Jakub Jelinek @ 2015-10-22  8:04 UTC (permalink / raw)
  To: Richard Biener
  Cc: Nathan Sidwell, GCC Patches, Bernd Schmidt, Jason Merrill,
	Joseph S. Myers

On Thu, Oct 22, 2015 at 09:49:29AM +0200, Richard Biener wrote:
> >> Jakub, IYR I originally had IFN_FORK and IFN_JOIN as such distinct internal
> >> fns.  This replaces that scheme.
> >>
> >> ok?
> >
> > Hmm, I'd just have used gimple_has_volatile_ops on the call?  That
> > should have the
> > desired effects.
> 
> That is, whatever new IFNs you need are ok, but special-casing them is not
> necessary if you properly mark the calls as volatile.

I don't see gimple_has_volatile_ops used in tracer.c or
tree-ssa-threadedge.c.  Setting gimple_has_volatile_ops on those IFNs is
fine, but I think they are even stronger than that.

	Jakub

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 1/11] UNIQUE internal function
  2015-10-21 19:00 ` [OpenACC 1/11] UNIQUE internal function Nathan Sidwell
  2015-10-22  7:49   ` Richard Biener
@ 2015-10-22  8:05   ` Jakub Jelinek
  2015-10-22  8:12     ` Richard Biener
  2015-10-22 20:25     ` Nathan Sidwell
  1 sibling, 2 replies; 120+ messages in thread
From: Jakub Jelinek @ 2015-10-22  8:05 UTC (permalink / raw)
  To: Nathan Sidwell; +Cc: GCC Patches, Bernd Schmidt, Jason Merrill, Joseph S. Myers

On Wed, Oct 21, 2015 at 03:00:47PM -0400, Nathan Sidwell wrote:
> To distinguish different uses of the UNIQUE function, I use the first
> argument, which is expected to be an INTEGER_CST.  I figured this better
> than using multiple new internal fns, all with the unique property, as the
> latter would need (at least) a range check in gimple_call_internal_unique_p
> rather than a simple equality.
> 
> Jakub, IYR I originally had IFN_FORK and IFN_JOIN as such distinct internal
> fns.  This replaces that scheme.
> 
> ok?
> 
> nathan

> 2015-10-20  Nathan Sidwell  <nathan@codesourcery.com>
> 	    Cesar Philippidis  <cesar@codesourcery.com>
> 	
> 	* internal-fn.c (expand_UNIQUE): New.
> 	* internal-fn.def (IFN_UNIQUE): New.
> 	(IFN_UNIQUE_UNSPEC): Define.
> 	* gimple.h (gimple_call_internal_unique_p): New.
> 	* gimple.c (gimple_call_same_target_p): Check internal fn
> 	uniqueness.
> 	* tracer.c (ignore_bb_p): Check for IFN_UNIQUE call.
> 	* tree-ssa-threadedge.c
> 	(record_temporary_equivalences_from_stmts): Likewise.

This is generally fine with me, but please work with Richi to find
something acceptable to him too.

> +DEF_INTERNAL_FN (UNIQUE, ECF_NOTHROW | ECF_LEAF, NULL)

Are you sure about the ECF_LEAF?  I mean, while the function can't
call back to your code, I'd expect you want it as kind of strong
optimization barrier too.

> +#define IFN_UNIQUE_UNSPEC 0  /* Undifferentiated UNIQUE.  */
> Index: tracer.c
> ===================================================================
> --- tracer.c	(revision 229096)
> +++ tracer.c	(working copy)
> @@ -93,6 +93,7 @@ bb_seen_p (basic_block bb)
>  static bool
>  ignore_bb_p (const_basic_block bb)
>  {
> +  gimple_stmt_iterator gsi;
>    gimple *g;
>  
>    if (bb->index < NUM_FIXED_BLOCKS)
> @@ -106,6 +107,17 @@ ignore_bb_p (const_basic_block bb)
>    if (g && gimple_code (g) == GIMPLE_TRANSACTION)
>      return true;
>  
> +  /* Ignore blocks containing non-clonable function calls.  */
> +  for (gsi = gsi_start_bb (CONST_CAST_BB (bb));
> +       !gsi_end_p (gsi); gsi_next (&gsi))
> +    {
> +      g = gsi_stmt (gsi);
> +
> +      if (is_gimple_call (g) && gimple_call_internal_p (g)
> +	  && gimple_call_internal_unique_p (as_a <gcall *> (g)))
> +	return true;
> +    }

Do you have to scan the whole bb?  E.g. don't or should not those
unique IFNs force end of bb?

	Jakub

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 1/11] UNIQUE internal function
  2015-10-22  8:04       ` Jakub Jelinek
@ 2015-10-22  8:07         ` Richard Biener
  2015-10-22 11:42           ` Julian Brown
  0 siblings, 1 reply; 120+ messages in thread
From: Richard Biener @ 2015-10-22  8:07 UTC (permalink / raw)
  To: Jakub Jelinek
  Cc: Nathan Sidwell, GCC Patches, Bernd Schmidt, Jason Merrill,
	Joseph S. Myers

On Thu, Oct 22, 2015 at 9:59 AM, Jakub Jelinek <jakub@redhat.com> wrote:
> On Thu, Oct 22, 2015 at 09:49:29AM +0200, Richard Biener wrote:
>> >> Jakub, IYR I originally had IFN_FORK and IFN_JOIN as such distinct internal
>> >> fns.  This replaces that scheme.
>> >>
>> >> ok?
>> >
>> > Hmm, I'd just have used gimple_has_volatile_ops on the call?  That
>> > should have the
>> > desired effects.
>>
>> That is, whatever new IFNs you need are ok, but special-casing them is not
>> necessary if you properly mark the calls as volatile.
>
> I don't see gimple_has_volatile_ops used in tracer.c or
> tree-ssa-threadedge.c.  Setting gimple_has_volatile_ops on those IFNs is
> fine, but I think they are even stronger than that.

Hmm, indeed.  Now I fail to see how the implemented property "preserves
the CFG looping structure".  And I would have expected can_copy_bbs_p
to be adjusted instead (catching more cases and the threading and tracer
case as well).

As far as I can see nothing would prevent dissolving the loop by completely
unolling it for example.  Or deleting it because it has no side-effects.

So you'd need to be more precise as to what properties you are trying to
preserve by placing a single stmt somewhere.

Richard.

>         Jakub

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 1/11] UNIQUE internal function
  2015-10-22  8:05   ` Jakub Jelinek
@ 2015-10-22  8:12     ` Richard Biener
  2015-10-22 13:08       ` Nathan Sidwell
                         ` (2 more replies)
  2015-10-22 20:25     ` Nathan Sidwell
  1 sibling, 3 replies; 120+ messages in thread
From: Richard Biener @ 2015-10-22  8:12 UTC (permalink / raw)
  To: Jakub Jelinek
  Cc: Nathan Sidwell, GCC Patches, Bernd Schmidt, Jason Merrill,
	Joseph S. Myers

On Thu, Oct 22, 2015 at 10:04 AM, Jakub Jelinek <jakub@redhat.com> wrote:
> On Wed, Oct 21, 2015 at 03:00:47PM -0400, Nathan Sidwell wrote:
>> To distinguish different uses of the UNIQUE function, I use the first
>> argument, which is expected to be an INTEGER_CST.  I figured this better
>> than using multiple new internal fns, all with the unique property, as the
>> latter would need (at least) a range check in gimple_call_internal_unique_p
>> rather than a simple equality.
>>
>> Jakub, IYR I originally had IFN_FORK and IFN_JOIN as such distinct internal
>> fns.  This replaces that scheme.
>>
>> ok?
>>
>> nathan
>
>> 2015-10-20  Nathan Sidwell  <nathan@codesourcery.com>
>>           Cesar Philippidis  <cesar@codesourcery.com>
>>
>>       * internal-fn.c (expand_UNIQUE): New.
>>       * internal-fn.def (IFN_UNIQUE): New.
>>       (IFN_UNIQUE_UNSPEC): Define.
>>       * gimple.h (gimple_call_internal_unique_p): New.
>>       * gimple.c (gimple_call_same_target_p): Check internal fn
>>       uniqueness.
>>       * tracer.c (ignore_bb_p): Check for IFN_UNIQUE call.
>>       * tree-ssa-threadedge.c
>>       (record_temporary_equivalences_from_stmts): Likewise.
>
> This is generally fine with me, but please work with Richi to find
> something acceptable to him too.
>
>> +DEF_INTERNAL_FN (UNIQUE, ECF_NOTHROW | ECF_LEAF, NULL)
>
> Are you sure about the ECF_LEAF?  I mean, while the function can't
> call back to your code, I'd expect you want it as kind of strong
> optimization barrier too.
>
>> +#define IFN_UNIQUE_UNSPEC 0  /* Undifferentiated UNIQUE.  */
>> Index: tracer.c
>> ===================================================================
>> --- tracer.c  (revision 229096)
>> +++ tracer.c  (working copy)
>> @@ -93,6 +93,7 @@ bb_seen_p (basic_block bb)
>>  static bool
>>  ignore_bb_p (const_basic_block bb)
>>  {
>> +  gimple_stmt_iterator gsi;
>>    gimple *g;
>>
>>    if (bb->index < NUM_FIXED_BLOCKS)
>> @@ -106,6 +107,17 @@ ignore_bb_p (const_basic_block bb)
>>    if (g && gimple_code (g) == GIMPLE_TRANSACTION)
>>      return true;
>>
>> +  /* Ignore blocks containing non-clonable function calls.  */
>> +  for (gsi = gsi_start_bb (CONST_CAST_BB (bb));
>> +       !gsi_end_p (gsi); gsi_next (&gsi))
>> +    {
>> +      g = gsi_stmt (gsi);
>> +
>> +      if (is_gimple_call (g) && gimple_call_internal_p (g)
>> +       && gimple_call_internal_unique_p (as_a <gcall *> (g)))
>> +     return true;
>> +    }
>
> Do you have to scan the whole bb?  E.g. don't or should not those
> unique IFNs force end of bb?

Yeah, please make them either end or start a BB so we have to check
at most a single stmt.  ECF_RETURNS_TWICE should achieve that,
it also makes it a code motion barrier.

Richard.

>         Jakub

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 2/11] PTX backend changes
  2015-10-21 19:11 ` [OpenACC 2/11] PTX backend changes Nathan Sidwell
@ 2015-10-22  8:16   ` Jakub Jelinek
  2015-10-22  9:58     ` Bernd Schmidt
  2015-10-22 14:05   ` Bernd Schmidt
  1 sibling, 1 reply; 120+ messages in thread
From: Jakub Jelinek @ 2015-10-22  8:16 UTC (permalink / raw)
  To: Nathan Sidwell; +Cc: GCC Patches, Bernd Schmidt, Jason Merrill, Joseph S. Myers

On Wed, Oct 21, 2015 at 03:09:55PM -0400, Nathan Sidwell wrote:
> Bernd, any comments?

Just a few questions, otherwise it is a PTX territory you PTX maintainers
should review.

> 	(*oacc_ntid_insn,  oacc_ntid, *oacc_tid_insn, oacc_tid): Delete.

Extra space.

> +/* Size of buffer needed to broadcast across workers.  This is used
> +   for both worker-neutering and worker broadcasting.   It is shared
> +   by all functions emitted.  The buffer is placed in shared memory.
> +   It'd be nice if PTX supported common blocks, because then this
> +   could be shared across TUs (taking the largest size).  */
> +static unsigned worker_bcast_hwm;

As discussed in another thread for another patch, is hwm the best acronym
here?  If it is the size, then why not worker_bcast_size?

> @@ -2129,6 +3242,19 @@ nvptx_file_end (void)
>    FOR_EACH_HASH_TABLE_ELEMENT (*needed_fndecls_htab, decl, tree, iter)
>      nvptx_record_fndecl (decl, true);
>    fputs (func_decls.str().c_str(), asm_out_file);
> +
> +  if (worker_bcast_hwm)
> +    {
> +      /* Define the broadcast buffer.  */
> +
> +      worker_bcast_hwm = (worker_bcast_hwm + worker_bcast_align - 1)
> +	& ~(worker_bcast_align - 1);
> +      
> +      fprintf (asm_out_file, "// BEGIN VAR DEF: %s\n", worker_bcast_name);
> +      fprintf (asm_out_file, ".shared .align %d .u8 %s[%d];\n",
> +	       worker_bcast_align,
> +	       worker_bcast_name, worker_bcast_hwm);
> +    }

So, is the worker broadcast buffer effectively a file scope .shared
variable?  My worry is that as .shared is quite limited resource, if you
compile many TUs and each allocates its own broadcast buffer you run out of
shared memory.  Is there any way how to share the broadcast buffers in
between different TUs (other than LTO)?

	Jakub

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 3/11] new target hook
  2015-10-21 19:16 ` [OpenACC 3/11] new target hook Nathan Sidwell
@ 2015-10-22  8:23   ` Jakub Jelinek
  2015-10-22 13:17     ` Nathan Sidwell
  2015-10-27 22:15     ` Nathan Sidwell
  0 siblings, 2 replies; 120+ messages in thread
From: Jakub Jelinek @ 2015-10-22  8:23 UTC (permalink / raw)
  To: Nathan Sidwell; +Cc: GCC Patches, Bernd Schmidt, Jason Merrill, Joseph S. Myers

On Wed, Oct 21, 2015 at 03:13:26PM -0400, Nathan Sidwell wrote:
> 2015-10-20  Nathan Sidwell  <nathan@codesourcery.com>
> 
> 	* target.def (fork_join): New GOACC hook.
> 	* targhooks.h (default_goacc_fork_join): Declare.
> 	* omp-low.c (default_goacc_forkjoin): New.
> 	* doc/tm.texi.in (TARGET_GOACC_FORK_JOIN): Add.
> 	* doc/tm.texi: Regenerate.
> 	* config/nvptx/nvptx.c (nvptx_xform_fork_join): New.
> 	(TARGET_GOACC_FOR_JOIN): Override.

This is ok, with nits.

> --- config/nvptx/nvptx.c	(revision 229096)
> +++ config/nvptx/nvptx.c	(working copy)
> @@ -2146,7 +2146,26 @@ nvptx_goacc_validate_dims (tree ARG_UNUS
>  
>    return changed;
>  }
> -\f
> +
> +/* Determine whether fork & joins are needed.  */
> +
> +static bool
> +nvptx_xform_fork_join (gcall *call, const int dims[],
> +		       bool ARG_UNUSED (is_fork))

Why is this not called nvptx_goacc_fork_join when that is the name of
the target hook?

	Jakub

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 4/11] C FE changes
  2015-10-21 19:19 ` [OpenACC 4/11] C " Nathan Sidwell
@ 2015-10-22  8:25   ` Jakub Jelinek
  2015-10-23 20:20     ` Cesar Philippidis
  0 siblings, 1 reply; 120+ messages in thread
From: Jakub Jelinek @ 2015-10-22  8:25 UTC (permalink / raw)
  To: Nathan Sidwell; +Cc: GCC Patches, Bernd Schmidt, Jason Merrill, Joseph S. Myers

On Wed, Oct 21, 2015 at 03:16:20PM -0400, Nathan Sidwell wrote:
> 2015-10-20  Cesar Philippidis  <cesar@codesourcery.com>
> 	    Thomas Schwinge  <thomas@codesourcery.com>
> 	    James Norris  <jnorris@codesourcery.com>
> 	    Joseph Myers  <joseph@codesourcery.com>
> 	    Julian Brown  <julian@codesourcery.com>
> 
> 	* c-parser.c (c_parser_oacc_shape_clause): New.
> 	(c_parser_oacc_simple_clause): New.
> 	(c_parser_oacc_all_clauses): Add auto, gang, seq, vector, worker.
> 	(OACC_LOOP_CLAUSE_MASK): Add gang, worker, vector, auto, seq.

Ok, with one nit.

>  /* OpenACC:
> +   gang [( gang_expr_list )]
> +   worker [( expression )]
> +   vector [( expression )] */
> +
> +static tree
> +c_parser_oacc_shape_clause (c_parser *parser, pragma_omp_clause c_kind,
> +			    const char *str, tree list)

I think it would be better to remove the c_kind argument and pass to this
function omp_clause_code kind instead.  The callers are already in a big
switch, with a separate call for each of the clauses.
After all, e.g. for c_parser_oacc_simple_clause you already do it that way
too.

> +{
> +  omp_clause_code kind;
> +  const char *id = "num";
> +
> +  switch (c_kind)
> +    {
> +    default:
> +      gcc_unreachable ();
> +    case PRAGMA_OACC_CLAUSE_GANG:
> +      kind = OMP_CLAUSE_GANG;
> +      break;
> +    case PRAGMA_OACC_CLAUSE_VECTOR:
> +      kind = OMP_CLAUSE_VECTOR;
> +      id = "length";
> +      break;
> +    case PRAGMA_OACC_CLAUSE_WORKER:
> +      kind = OMP_CLAUSE_WORKER;
> +      break;
> +    }

Then you can replace this switch with just if (kind == OMP_CLAUSE_VECTOR)
id = "length";

	Jakub

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 5/11] C++ FE changes
  2015-10-21 19:19 ` [OpenACC 5/11] C++ FE changes Nathan Sidwell
@ 2015-10-22  8:58   ` Jakub Jelinek
  2015-10-23 20:26     ` Cesar Philippidis
  0 siblings, 1 reply; 120+ messages in thread
From: Jakub Jelinek @ 2015-10-22  8:58 UTC (permalink / raw)
  To: Nathan Sidwell; +Cc: GCC Patches, Bernd Schmidt, Jason Merrill, Joseph S. Myers

On Wed, Oct 21, 2015 at 03:18:55PM -0400, Nathan Sidwell wrote:
> This patch is the C++ changes matching the C ones of patch 4.  In
> finish_omp_clauses, the gang, worker, & vector clauses are handled the same
> as OpenMP's 'num_threads' clause.  One change to num_threads is the
> augmentation of a diagnostic to add %<...%>  markers to the clause name.

Indeed, lots of older OpenMP diagnostics is missing %<...%> markers around
keywords.  Something to fix eventually.

> 2015-10-20  Cesar Philippidis  <cesar@codesourcery.com>
> 	    Thomas Schwinge  <thomas@codesourcery.com>
> 	    James Norris  <jnorris@codesourcery.com>
> 	    Joseph Myers  <joseph@codesourcery.com>
> 	    Julian Brown  <julian@codesourcery.com>
> 	    Nathan Sidwell <nathan@codesourcery.com>
> 
> 	* parser.c (cp_parser_omp_clause_name): Add auto, gang, seq,
> 	vector, worker.
> 	(cp_parser_oacc_simple_clause): New.
> 	(cp_parser_oacc_shape_clause): New.

What I've said for the C FE patch, plus:

> +	  if (cp_lexer_next_token_is (lexer, CPP_NAME)
> +	      || cp_lexer_next_token_is (lexer, CPP_KEYWORD))
> +	    {
> +	      tree name_kind = cp_lexer_peek_token (lexer)->u.value;
> +	      const char *p = IDENTIFIER_POINTER (name_kind);
> +	      if (kind == OMP_CLAUSE_GANG && strcmp ("static", p) == 0)

As static is a keyword, wouldn't it be better to just handle that case
using cp_lexer_next_token_is_keyword (lexer, RID_STATIC)?

Also, what is the exact grammar of the shape arguments?
Would be nice to describe the grammar, in the grammar you just say
expression, at least for vector/worker, which is clearly not accurate.

It seems the intent is that num: or length: or static: is optional, right?
But if that is the case, you should treat those as parsed only if followed
by :.  While static is a keyword, so you can't have a variable called like
that, having vector(length) or vector(num) should not be rejected.
So, I would have expected that it should test if it is RID_STATIC
followed by CPP_COLON (and only in that case consume those tokens),
or CPP_NAME of id followed by CPP_COLON (and only in that case consume those
tokens), otherwise parse it as assignment expression.

The C FE may have similar issue.  Plus of course there should be testsuite
coverage for all the weird cases.

> +	case OMP_CLAUSE_GANG:
> +	case OMP_CLAUSE_VECTOR:
> +	case OMP_CLAUSE_WORKER:
> +	  /* Operand 0 is the num: or length: argument.  */
> +	  t = OMP_CLAUSE_OPERAND (c, 0);
> +	  if (t == NULL_TREE)
> +	    break;
> +
> +	  t = maybe_convert_cond (t);

Can you explain the maybe_convert_cond calls (in both cases here,
plus the preexisting in OMP_CLAUSE_VECTOR_LENGTH)?
The reason why it is used for OpenMP if and final clauses is that those have
a condition argument, either the condition is zero or non-zero (so
effectively it is turned into a bool).
But aren't the gang/vector/worker/vector_length arguments integers rather
than conditions?  I'd expect that finish_omp_clauses should verify
those operands are indeed integral expressions (if that is the requirement
in the standard), as it is something that for C++ can't be verified during
parsing, if arbitrary expressions are parsed there.

> @@ -5959,32 +5990,58 @@ finish_omp_clauses (tree clauses, bool a
>  	  break;
>  
>  	case OMP_CLAUSE_NUM_THREADS:
> -	  t = OMP_CLAUSE_NUM_THREADS_EXPR (c);
> -	  if (t == error_mark_node)
> -	    remove = true;
> -	  else if (!type_dependent_expression_p (t)
> -		   && !INTEGRAL_TYPE_P (TREE_TYPE (t)))
> -	    {
> -	      error ("num_threads expression must be integral");
> -	      remove = true;
> -	    }
> -	  else
> -	    {
> -	      t = mark_rvalue_use (t);
> -	      if (!processing_template_decl)
> -		{
> -		  t = maybe_constant_value (t);
> -		  if (TREE_CODE (t) == INTEGER_CST
> -		      && tree_int_cst_sgn (t) != 1)
> -		    {
> -		      warning_at (OMP_CLAUSE_LOCATION (c), 0,
> -				  "%<num_threads%> value must be positive");
> -		      t = integer_one_node;
> -		    }
> -		  t = fold_build_cleanup_point_expr (TREE_TYPE (t), t);
> -		}
> -	      OMP_CLAUSE_NUM_THREADS_EXPR (c) = t;
> -	    }
> +	case OMP_CLAUSE_NUM_GANGS:
> +	case OMP_CLAUSE_NUM_WORKERS:
> +	case OMP_CLAUSE_VECTOR_LENGTH:

If you are already merging some of the similar handling, please
handle OMP_CLAUSE_NUM_TEAMS and OMP_CLAUSE_NUM_TASKS the same way.

	Jakub

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 6/11] Reduction initialization
  2015-10-21 19:32 ` [OpenACC 6/11] Reduction initialization Nathan Sidwell
@ 2015-10-22  9:11   ` Jakub Jelinek
  2015-10-27 22:27     ` Nathan Sidwell
  0 siblings, 1 reply; 120+ messages in thread
From: Jakub Jelinek @ 2015-10-22  9:11 UTC (permalink / raw)
  To: Nathan Sidwell; +Cc: GCC Patches, Bernd Schmidt, Jason Merrill, Joseph S. Myers

On Wed, Oct 21, 2015 at 03:24:13PM -0400, Nathan Sidwell wrote:
> 2015-10-20  Nathan Sidwell  <nathan@codesourcery.com>
> 
> 	* omp-low.c (oacc_init_rediction_array): New.
> 	(oacc_initialize_reduction_data): Initialize array.

Ok.

	Jakub

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 7/11] execution model
  2015-10-21 19:47 ` [OpenACC 7/11] execution model Nathan Sidwell
@ 2015-10-22  9:32   ` Jakub Jelinek
  2015-10-22 12:51     ` Nathan Sidwell
  2015-10-25 15:03     ` Nathan Sidwell
  2020-11-24 10:34   ` Thomas Schwinge
  1 sibling, 2 replies; 120+ messages in thread
From: Jakub Jelinek @ 2015-10-22  9:32 UTC (permalink / raw)
  To: Nathan Sidwell; +Cc: GCC Patches, Bernd Schmidt, Jason Merrill, Joseph S. Myers

On Wed, Oct 21, 2015 at 03:42:26PM -0400, Nathan Sidwell wrote:
> +/*  Flags for an OpenACC loop.  */
> +
> +enum oacc_loop_flags
> +  {

Weird formatting.  I see either
enum foobarbaz {
  e1 = ...,
  e2 = ...
};
or
enum foobarbaz
{
  e1 = ...,
  e2 = ...
};
styles being used heavily, but not this one.

> +    OLF_SEQ	= 1u << 0,  /* Explicitly sequential  */
> +    OLF_AUTO	= 1u << 1,	/* Compiler chooses axes.  */
> +    OLF_INDEPENDENT = 1u << 2,	/* Iterations are known independent.  */
> +    OLF_GANG_STATIC = 1u << 3,	/* Gang partitioning is static (has op). */
> +
> +    /* Explicitly specified loop axes.  */
> +    OLF_DIM_BASE = 4,
> +    OLF_DIM_GANG   = 1u << (OLF_DIM_BASE + GOMP_DIM_GANG),
> +    OLF_DIM_WORKER = 1u << (OLF_DIM_BASE + GOMP_DIM_WORKER),
> +    OLF_DIM_VECTOR = 1u << (OLF_DIM_BASE + GOMP_DIM_VECTOR),
> +
> +    OLF_MAX = OLF_DIM_BASE + GOMP_DIM_MAX
> +  };
> +

> +  if (checking)
> +    {
> +      if (has_seq && (this_mask || has_auto))
> +	error_at (gimple_location (stmt), "%<seq%> overrides other OpenACC loop specifiers");
> +      else if (has_auto && this_mask)
> +	error_at (gimple_location (stmt), "%<auto%> conflicts with other OpenACC loop specifiers");
> +
> +      if (this_mask & outer_mask)
> +	error_at (gimple_location (stmt), "inner loop uses same  OpenACC parallelism as containing loop");

Too long lines.  Plus 2 spaces into one.
> +	    if (check && OMP_CLAUSE_OPERAND (c, 0))
> +	      error_at (gimple_location (stmt),
> +			"argument not permitted on %<%s%> clause in"

%qs instead of %<%s%> ?

> @@ -5769,6 +5885,166 @@ lower_send_shared_vars (gimple_seq *ilis
>      }
>  }
>  
> +/* Emit an OpenACC head marker call, encapulating the partitioning and
> +   other information that must be processed by the target compiler.
> +   Return the maximum number of dimensions the associated loop might
> +   be partitioned over.  */
> +
> +static unsigned
> +lower_oacc_head_mark (location_t loc, tree clauses,
> +		      gimple_seq *seq, omp_context *ctx)
> +{
> +  unsigned levels = 0;
> +  unsigned tag = 0;
> +  tree gang_static = NULL_TREE;
> +  auto_vec<tree, 1> args;

If you usually push there 3 or 4 arguments, wouldn't it be better to
just use auto_vec<tree, 4> args; instead?

> +  if (gang_static)
> +    {
> +      if (DECL_P  (gang_static))

Formatting, too many spaces.

> +  tree marker = build_int_cst
> +    (integer_type_node, (head ? IFN_UNIQUE_OACC_HEAD_MARK
> +			 : IFN_UNIQUE_OACC_TAIL_MARK));

I really don't like putting the arguments on a different from
the function name, unless you have to.
Here you can easily do say
  enum internal_fn marker_val = head ? IFN_UNIQUE_OACC_HEAD_MARK
				     : IFN_UNIQUE_OACC_TAIL_MARK;
  tree marker = build_int_cst (integer_type_node, marker_val);
same number of lines, easier to read.

> +  gcall *call = gimple_build_call_internal
> +    (IFN_UNIQUE, 1 + (tofollow != NULL_TREE), marker, tofollow);

Similarly.

> +      gcall *fork = gimple_build_call_internal
> +	(IFN_UNIQUE, 2,
> +	 build_int_cst (unsigned_type_node, IFN_UNIQUE_OACC_FORK), place);
> +      gcall *join = gimple_build_call_internal
> +	(IFN_UNIQUE, 2,
> +	 build_int_cst (unsigned_type_node, IFN_UNIQUE_OACC_JOIN), place);

Likewise.  Just use a 
    tree t = build_int_cst (unsigned_type_node, IFN_UNIQUE_OACC_FORK);
    gcall *fork = gimple_build_call_internal (IFN_UNIQUE, 2, t, place);
etc.

> +      expr = build2 (TRUNC_MOD_EXPR, ivar_type, ivar,
> +		     fold_convert (ivar_type, collapse->iters));
> +      expr = build2 (MULT_EXPR, diff_type, fold_convert (diff_type, expr),
> +		     collapse->step);
> +      expr = build2 (plus_code, iter_type, collapse->base,
> +		     fold_convert (plus_type, expr));

Shouldn't these be fold_build2 instead?
Of course Richi would prefer gimple_build, but omp-low.c has already
way too much of fold_build2 + force_gimple_operand_gsi code around
that it is fine with me this way.

>  /* An unduplicable, uncombinable function.  Generally used to preserve
>     a CFG property in the face of jump threading, tail merging or
>     other such optimizations.  The first argument distinguishes
>     between uses.  Other arguments are as needed for use.  The return
>     type depends on use too.  */
>  DEF_INTERNAL_FN (UNIQUE, ECF_NOTHROW | ECF_LEAF, NULL)
>  #define IFN_UNIQUE_UNSPEC 0  /* Undifferentiated UNIQUE.  */
> +
> +/* FORK and JOIN mark the points at which OpenACC partitioned
> +   execution is entered or exited.  They take an INTEGER_CST argument,
> +   indicating the axis of forking or joining and return nothing.  */
> +#define IFN_UNIQUE_OACC_FORK 1
> +#define IFN_UNIQUE_OACC_JOIN 2
> +/* HEAD_MARK and TAIL_MARK are used to demark the sequence entering or
> +   leaving partitioned execution.  */
> +#define IFN_UNIQUE_OACC_HEAD_MARK 3
> +#define IFN_UNIQUE_OACC_TAIL_MARK 4

Shouldn't these be in an enum, to make debugging easier?

	Jakub

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 8/11] device-specific lowering
  2015-10-21 19:50 ` [OpenACC 8/11] device-specific lowering Nathan Sidwell
@ 2015-10-22  9:32   ` Jakub Jelinek
  2015-10-22 12:59     ` Nathan Sidwell
  2015-10-26 15:21   ` Jakub Jelinek
  1 sibling, 1 reply; 120+ messages in thread
From: Jakub Jelinek @ 2015-10-22  9:32 UTC (permalink / raw)
  To: Nathan Sidwell; +Cc: GCC Patches, Bernd Schmidt, Jason Merrill, Joseph S. Myers

On Wed, Oct 21, 2015 at 03:49:08PM -0400, Nathan Sidwell wrote:
> This patch is the device-specific half of the previous patch.  It processes
> the partition head & tail markers and loop abstraction functions inserted
> during omp lowering.
> 
> In the oacc_device_lower pass we scan the CFG reconstructing the set of
> nested loops demarked by IFN_UNIQUE (HEAD_MARK) & IFN_UNIQUE (TAIL_MARK)
> functions. The HEAD_MARK function provides  the loop partition information
> provided by the user.  Once constructed we can iterate over that structure
> checking partitioning consistency (for instance an inner loop must use a
> dimension 'inside' an outer loop). We also assign specific partitioning axes
> here.  Partitioning updates the parameters of the IFN_LOOP and IFN_FORK/JOIN
> functions appropriately.
> 
> Once partitioning has been determined, we iterate over the CFG scanning for
> the marker, fork/join and loop functions.  The marker functions are deleted,
> the fork & join functions are conditionally deleted (using the target hook
> of patch 3), and the loop function is expanded into code calculating the
> loop parameters depending on how the loop has been partitioned.  This  uses
> the OACC_DIM_POS and OACC_DIM_SIZE builtins included in patch 7.

So, how do you expand the OACC loops on non-PTX devices (host, or say
XeonPhi)?  Do you drop the IFNs and replace stuff with normal loops?
I don't see anything that would e.g. set the various flags that e.g. OpenMP
#pragma omp simd or Cilk+ #pragma simd sets, like loop->safelen,
loop->force_vectorize, maybe loop->simduid and promote some vars to simduid
arrays if that is relevant to OpenACC.

	Jakub

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 9/11] oacc_device_lower pass gate
  2015-10-21 19:51 ` [OpenACC 9/11] oacc_device_lower pass gate Nathan Sidwell
@ 2015-10-22  9:33   ` Jakub Jelinek
  2015-10-27 20:31     ` Nathan Sidwell
  0 siblings, 1 reply; 120+ messages in thread
From: Jakub Jelinek @ 2015-10-22  9:33 UTC (permalink / raw)
  To: Nathan Sidwell; +Cc: GCC Patches, Bernd Schmidt, Jason Merrill, Joseph S. Myers

On Wed, Oct 21, 2015 at 03:50:31PM -0400, Nathan Sidwell wrote:
> 
> This patch is obvious, but included for completeness. We always want to run
> the device lowering pass (when openacc is enabled), in order to delete the
> marker and loop functions that should never be seen after this point.
> 
> nathan

> 2015-10-20  Nathan Sidwell  <nathan@codesourcery.com>
> 
> 	* omp-low.c (pass_oacc_device_lower::execute): Ignore errors.

Ok.

	Jakub

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 10/11] remove plugin restriction
  2015-10-21 19:52 ` [OpenACC 10/11] remove plugin restriction Nathan Sidwell
@ 2015-10-22  9:38   ` Jakub Jelinek
  0 siblings, 0 replies; 120+ messages in thread
From: Jakub Jelinek @ 2015-10-22  9:38 UTC (permalink / raw)
  To: Nathan Sidwell; +Cc: GCC Patches, Bernd Schmidt, Jason Merrill, Joseph S. Myers

On Wed, Oct 21, 2015 at 03:51:42PM -0400, Nathan Sidwell wrote:
> Here's another obvious patch.  The ptx plugin no longer needs to barf on
> gang or worker dimensions of non-unity.
> 
> nathan
> 
> 

> 2015-10-20  Nathan Sidwell  <nathan@codesourcery.com>
> 
> 	* plugin/plugin-nvptx.c (nvptx_exec): Remove check on compute
> 	dimensions.

Ok.

	Jakub

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 11/11] execution tests
  2015-10-21 19:59 ` [OpenACC 11/11] execution tests Nathan Sidwell
  2015-10-21 20:15   ` Ilya Verbin
@ 2015-10-22  9:54   ` Jakub Jelinek
  2015-10-22 14:02     ` Nathan Sidwell
  1 sibling, 1 reply; 120+ messages in thread
From: Jakub Jelinek @ 2015-10-22  9:54 UTC (permalink / raw)
  To: Nathan Sidwell; +Cc: GCC Patches, Bernd Schmidt, Jason Merrill, Joseph S. Myers

On Wed, Oct 21, 2015 at 03:53:17PM -0400, Nathan Sidwell wrote:
> This patch has some new execution tests, verifying loop partitioning is
> behaving as expected.
> 
> There are more execution tests on the gomp4 branch, but many of them use
> reductions.  We'll merge those once reductions are merged.
> 
> nathan

> 2015-10-20  Nathan Sidwell  <nathan@codesourcery.com>
> 
> 	* testsuite/libgomp.oacc-c-c++-common/loop-g-1.c: New.
> 	* testsuite/libgomp.oacc-c-c++-common/loop-w-1.c: New.
> 	* testsuite/libgomp.oacc-c-c++-common/loop-g-1.s: New.

As Ilya mentioned, this one should go.

> 	* testsuite/libgomp.oacc-c-c++-common/loop-wv-1.c: New.
> 	* testsuite/libgomp.oacc-c-c++-common/loop-g-2.c: New.
> 	* testsuite/libgomp.oacc-c-c++-common/loop-gwv-1.c: New.
> 	* testsuite/libgomp.oacc-c-c++-common/loop-v-1.c: New.

And, I must say I'm at least missing testcases that check parsing but also
runtime behavior of the vector or worker clause arguments (there
is one gang (static:1) clause, but not the other clauses nor other styles of
gang arguments.

	Jakub

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 2/11] PTX backend changes
  2015-10-22  8:16   ` Jakub Jelinek
@ 2015-10-22  9:58     ` Bernd Schmidt
  2015-10-22 13:02       ` Nathan Sidwell
  0 siblings, 1 reply; 120+ messages in thread
From: Bernd Schmidt @ 2015-10-22  9:58 UTC (permalink / raw)
  To: Jakub Jelinek, Nathan Sidwell; +Cc: GCC Patches, Jason Merrill, Joseph S. Myers

On 10/22/2015 10:12 AM, Jakub Jelinek wrote:
>
>> @@ -2129,6 +3242,19 @@ nvptx_file_end (void)
>>     FOR_EACH_HASH_TABLE_ELEMENT (*needed_fndecls_htab, decl, tree, iter)
>>       nvptx_record_fndecl (decl, true);
>>     fputs (func_decls.str().c_str(), asm_out_file);
>> +
>> +  if (worker_bcast_hwm)
>> +    {
>> +      /* Define the broadcast buffer.  */
>> +
>> +      worker_bcast_hwm = (worker_bcast_hwm + worker_bcast_align - 1)
>> +	& ~(worker_bcast_align - 1);
>> +
>> +      fprintf (asm_out_file, "// BEGIN VAR DEF: %s\n", worker_bcast_name);
>> +      fprintf (asm_out_file, ".shared .align %d .u8 %s[%d];\n",
>> +	       worker_bcast_align,
>> +	       worker_bcast_name, worker_bcast_hwm);
>> +    }
>
> So, is the worker broadcast buffer effectively a file scope .shared
> variable?  My worry is that as .shared is quite limited resource, if you
> compile many TUs and each allocates its own broadcast buffer you run out of
> shared memory.  Is there any way how to share the broadcast buffers in
> between different TUs (other than LTO)?

I think LTO is the mechanism, nvptx-lto1 only ever produces one assembly 
file. So I'm not really concerned about this.

One other thing about this occurred to me yesterday - I was worried 
about thread-safety with a single static buffer - couldn't code execute 
multiple kernels at the same time? I googled a bit, and could not 
actually find a definitive answer as to whether all shared memory is 
allocated at kernel launch, or just the dynamic portion?


Bernd

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 1/11] UNIQUE internal function
  2015-10-22  8:07         ` Richard Biener
@ 2015-10-22 11:42           ` Julian Brown
  2015-10-22 13:12             ` Nathan Sidwell
  0 siblings, 1 reply; 120+ messages in thread
From: Julian Brown @ 2015-10-22 11:42 UTC (permalink / raw)
  To: Richard Biener
  Cc: Jakub Jelinek, Nathan Sidwell, GCC Patches, Bernd Schmidt,
	Jason Merrill, Joseph S. Myers

On Thu, 22 Oct 2015 10:05:30 +0200
Richard Biener <richard.guenther@gmail.com> wrote:

> On Thu, Oct 22, 2015 at 9:59 AM, Jakub Jelinek <jakub@redhat.com>
> wrote:
> > On Thu, Oct 22, 2015 at 09:49:29AM +0200, Richard Biener wrote:  
> >> >> Jakub, IYR I originally had IFN_FORK and IFN_JOIN as such
> >> >> distinct internal fns.  This replaces that scheme.
> >> >>
> >> >> ok?  
> >> >
> >> > Hmm, I'd just have used gimple_has_volatile_ops on the call?
> >> > That should have the
> >> > desired effects.  
> >>
> >> That is, whatever new IFNs you need are ok, but special-casing
> >> them is not necessary if you properly mark the calls as volatile.  
> >
> > I don't see gimple_has_volatile_ops used in tracer.c or
> > tree-ssa-threadedge.c.  Setting gimple_has_volatile_ops on those
> > IFNs is fine, but I think they are even stronger than that.  
> 
> Hmm, indeed.  Now I fail to see how the implemented property
> "preserves the CFG looping structure".  And I would have expected
> can_copy_bbs_p to be adjusted instead (catching more cases and the
> threading and tracer case as well).
> 
> As far as I can see nothing would prevent dissolving the loop by
> completely unolling it for example.  Or deleting it because it has no
> side-effects.
> 
> So you'd need to be more precise as to what properties you are trying
> to preserve by placing a single stmt somewhere.

FWIW an earlier, abandoned attempt at solving the same problem was
discussed in the following thread, continuing through June:

  https://gcc.gnu.org/ml/gcc-patches/2015-05/msg02647.html

Though the details of lowering of OpenACC constructs have changed with
Nathan's current patches, the underlying problem remains the same. PTX
requires certain operations (bar.sync) to be executed uniformly by all
threads in a CTA. IIUC this affects "JOIN" points across all
workers/vectors in a gang, in particular (though this is generic code,
other -- particularly GPU -- targets may have similar restrictions).

HTH,

Julian

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 7/11] execution model
  2015-10-22  9:32   ` Jakub Jelinek
@ 2015-10-22 12:51     ` Nathan Sidwell
  2015-10-22 13:01       ` Jakub Jelinek
  2015-10-25 15:03     ` Nathan Sidwell
  1 sibling, 1 reply; 120+ messages in thread
From: Nathan Sidwell @ 2015-10-22 12:51 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: GCC Patches, Bernd Schmidt, Jason Merrill, Joseph S. Myers

On 10/22/15 05:23, Jakub Jelinek wrote:
> On Wed, Oct 21, 2015 at 03:42:26PM -0400, Nathan Sidwell wrote:
>> +/*  Flags for an OpenACC loop.  */
>> +
>> +enum oacc_loop_flags
>> +  {
>
> Weird formatting.  I see either

Blame emacs (I thought  it was configured for GNU formatting ...)


>> +      expr = build2 (TRUNC_MOD_EXPR, ivar_type, ivar,
>> +		     fold_convert (ivar_type, collapse->iters));
>> +      expr = build2 (MULT_EXPR, diff_type, fold_convert (diff_type, expr),
>> +		     collapse->step);
>> +      expr = build2 (plus_code, iter_type, collapse->base,
>> +		     fold_convert (plus_type, expr));
>
> Shouldn't these be fold_build2 instead?

I don't think fold_build2 makes a difference here, as (at least) one operand is 
  a variable?

>> +/* FORK and JOIN mark the points at which OpenACC partitioned
>> +   execution is entered or exited.  They take an INTEGER_CST argument,
>> +   indicating the axis of forking or joining and return nothing.  */
>> +#define IFN_UNIQUE_OACC_FORK 1
>> +#define IFN_UNIQUE_OACC_JOIN 2
>> +/* HEAD_MARK and TAIL_MARK are used to demark the sequence entering or
>> +   leaving partitioned execution.  */
>> +#define IFN_UNIQUE_OACC_HEAD_MARK 3
>> +#define IFN_UNIQUE_OACC_TAIL_MARK 4
>
> Shouldn't these be in an enum, to make debugging easier?

internal-fn.def can be included multiple times in one file (probably only 
internal-fn.c).  Thus an enum would either need to go somewhere else (and I'd 
like to keep it close to the ifn def), or need to be protected in some manner. 
Hence I went with #defs, which are safe to duplicate.  Any thoughts on how to 
resolve that contention?

nathan

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 8/11] device-specific lowering
  2015-10-22  9:32   ` Jakub Jelinek
@ 2015-10-22 12:59     ` Nathan Sidwell
  0 siblings, 0 replies; 120+ messages in thread
From: Nathan Sidwell @ 2015-10-22 12:59 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: GCC Patches, Bernd Schmidt, Jason Merrill, Joseph S. Myers

On 10/22/15 05:31, Jakub Jelinek wrote:
> On Wed, Oct 21, 2015 at 03:49:08PM -0400, Nathan Sidwell wrote:

> So, how do you expand the OACC loops on non-PTX devices (host, or say
> XeonPhi)?  Do you drop the IFNs and replace stuff with normal loops?

On a non ptx target (canonical example being the host), the IFN head/tail 
markers get deleted. the IFN_LOOP builtin gets expanded to code that essentially 
restores the original loop structure.

> I don't see anything that would e.g. set the various flags that e.g. OpenMP
> #pragma omp simd or Cilk+ #pragma simd sets, like loop->safelen,
> loop->force_vectorize, maybe loop->simduid and promote some vars to simduid
> arrays if that is relevant to OpenACC.

It won't convert them into such representations.

nathan

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 7/11] execution model
  2015-10-22 12:51     ` Nathan Sidwell
@ 2015-10-22 13:01       ` Jakub Jelinek
  2015-10-22 13:08         ` Nathan Sidwell
  0 siblings, 1 reply; 120+ messages in thread
From: Jakub Jelinek @ 2015-10-22 13:01 UTC (permalink / raw)
  To: Nathan Sidwell; +Cc: GCC Patches, Bernd Schmidt, Jason Merrill, Joseph S. Myers

On Thu, Oct 22, 2015 at 08:50:23AM -0400, Nathan Sidwell wrote:
> >>+      expr = build2 (TRUNC_MOD_EXPR, ivar_type, ivar,
> >>+		     fold_convert (ivar_type, collapse->iters));
> >>+      expr = build2 (MULT_EXPR, diff_type, fold_convert (diff_type, expr),
> >>+		     collapse->step);
> >>+      expr = build2 (plus_code, iter_type, collapse->base,
> >>+		     fold_convert (plus_type, expr));
> >
> >Shouldn't these be fold_build2 instead?
> 
> I don't think fold_build2 makes a difference here, as (at least) one operand
> is  a variable?

Fold does tons of optimizations, some could be relevant even for this case.
But if you think build2 is fine, I can live with it.

> >>+/* FORK and JOIN mark the points at which OpenACC partitioned
> >>+   execution is entered or exited.  They take an INTEGER_CST argument,
> >>+   indicating the axis of forking or joining and return nothing.  */
> >>+#define IFN_UNIQUE_OACC_FORK 1
> >>+#define IFN_UNIQUE_OACC_JOIN 2
> >>+/* HEAD_MARK and TAIL_MARK are used to demark the sequence entering or
> >>+   leaving partitioned execution.  */
> >>+#define IFN_UNIQUE_OACC_HEAD_MARK 3
> >>+#define IFN_UNIQUE_OACC_TAIL_MARK 4
> >
> >Shouldn't these be in an enum, to make debugging easier?
> 
> internal-fn.def can be included multiple times in one file (probably only
> internal-fn.c).  Thus an enum would either need to go somewhere else (and
> I'd like to keep it close to the ifn def), or need to be protected in some
> manner. Hence I went with #defs, which are safe to duplicate.  Any thoughts
> on how to resolve that contention?

I think an enum in internal-fn.h is better, the IFN_UNIQUE comment can just
say that it uses this and this enum from internal-fn.h and the description
go there.  Being able to just p (enum ifn_unique_kind) gimple_call_arg (call, 0)
is valuable (though, perhaps you could even tweak the gimple-pretty-print.c
dumper to dump the first argument to IFN_UNIQUE symbolically instead of
numerically.

	Jakub

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 2/11] PTX backend changes
  2015-10-22  9:58     ` Bernd Schmidt
@ 2015-10-22 13:02       ` Nathan Sidwell
  2015-10-22 13:23         ` Nathan Sidwell
  0 siblings, 1 reply; 120+ messages in thread
From: Nathan Sidwell @ 2015-10-22 13:02 UTC (permalink / raw)
  To: Bernd Schmidt, Jakub Jelinek; +Cc: GCC Patches, Jason Merrill, Joseph S. Myers

On 10/22/15 05:55, Bernd Schmidt wrote:
> On 10/22/2015 10:12 AM, Jakub Jelinek wrote:

>> So, is the worker broadcast buffer effectively a file scope .shared
>> variable?  My worry is that as .shared is quite limited resource, if you
>> compile many TUs and each allocates its own broadcast buffer you run out of
>> shared memory.  Is there any way how to share the broadcast buffers in
>> between different TUs (other than LTO)?
>
> I think LTO is the mechanism, nvptx-lto1 only ever produces one assembly file.
> So I'm not really concerned about this.

Correct.  PTX has no equivalent of common or weak, so we can't do the elf thing 
of emitting a common defn and having the linking process pick the largest.

>
> One other thing about this occurred to me yesterday - I was worried about
> thread-safety with a single static buffer - couldn't code execute multiple
> kernels at the same time? I googled a bit, and could not actually find a
> definitive answer as to whether all shared memory is allocated at kernel launch,
> or just the dynamic portion?

AFAICT a single CTA doesn't execute multiple kernels concurrently.

nathan

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 1/11] UNIQUE internal function
  2015-10-22  8:12     ` Richard Biener
@ 2015-10-22 13:08       ` Nathan Sidwell
  2015-10-22 14:04       ` Nathan Sidwell
  2015-10-22 17:39       ` Nathan Sidwell
  2 siblings, 0 replies; 120+ messages in thread
From: Nathan Sidwell @ 2015-10-22 13:08 UTC (permalink / raw)
  To: Richard Biener, Jakub Jelinek
  Cc: GCC Patches, Bernd Schmidt, Jason Merrill, Joseph S. Myers

On 10/22/15 04:07, Richard Biener wrote:
> On Thu, Oct 22, 2015 at 10:04 AM, Jakub Jelinek <jakub@redhat.com> wrote:

>> Do you have to scan the whole bb?  E.g. don't or should not those
>> unique IFNs force end of bb?
>
> Yeah, please make them either end or start a BB so we have to check
> at most a single stmt.  ECF_RETURNS_TWICE should achieve that,
> it also makes it a code motion barrier.

Thanks, I'd not thought of doing it like that.  Will try.

nathan

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 7/11] execution model
  2015-10-22 13:01       ` Jakub Jelinek
@ 2015-10-22 13:08         ` Nathan Sidwell
  0 siblings, 0 replies; 120+ messages in thread
From: Nathan Sidwell @ 2015-10-22 13:08 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: GCC Patches, Bernd Schmidt, Jason Merrill, Joseph S. Myers

On 10/22/15 08:59, Jakub Jelinek wrote:
> On Thu, Oct 22, 2015 at 08:50:23AM -0400, Nathan Sidwell wrote:
>>>> +      expr = build2 (TRUNC_MOD_EXPR, ivar_type, ivar,
>>>> +		     fold_convert (ivar_type, collapse->iters));
>>>> +      expr = build2 (MULT_EXPR, diff_type, fold_convert (diff_type, expr),
>>>> +		     collapse->step);
>>>> +      expr = build2 (plus_code, iter_type, collapse->base,
>>>> +		     fold_convert (plus_type, expr));
>>>
>>> Shouldn't these be fold_build2 instead?
>>
>> I don't think fold_build2 makes a difference here, as (at least) one operand
>> is  a variable?
>
> Fold does tons of optimizations, some could be relevant even for this case.
> But if you think build2 is fine, I can live with it.

I don;t mind.  I just thought it would do a lot of checking and no actual 
transformation.

> I think an enum in internal-fn.h is better, the IFN_UNIQUE comment can just
> say that it uses this and this enum from internal-fn.h and the description
> go there.  Being able to just p (enum ifn_unique_kind) gimple_call_arg (call, 0)
> is valuable (though, perhaps you could even tweak the gimple-pretty-print.c
> dumper to dump the first argument to IFN_UNIQUE symbolically instead of
> numerically.

ok

nathan

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 1/11] UNIQUE internal function
  2015-10-22 11:42           ` Julian Brown
@ 2015-10-22 13:12             ` Nathan Sidwell
  2015-10-22 13:20               ` Jakub Jelinek
  0 siblings, 1 reply; 120+ messages in thread
From: Nathan Sidwell @ 2015-10-22 13:12 UTC (permalink / raw)
  To: Julian Brown, Richard Biener
  Cc: Jakub Jelinek, GCC Patches, Bernd Schmidt, Jason Merrill,
	Joseph S. Myers

On 10/22/15 07:10, Julian Brown wrote:
> On Thu, 22 Oct 2015 10:05:30 +0200
> Richard Biener <richard.guenther@gmail.com> wrote:

>> So you'd need to be more precise as to what properties you are trying
>> to preserve by placing a single stmt somewhere.
>
> FWIW an earlier, abandoned attempt at solving the same problem was
> discussed in the following thread, continuing through June:
>
>    https://gcc.gnu.org/ml/gcc-patches/2015-05/msg02647.html
>
> Though the details of lowering of OpenACC constructs have changed with
> Nathan's current patches, the underlying problem remains the same. PTX
> requires certain operations (bar.sync) to be executed uniformly by all
> threads in a CTA. IIUC this affects "JOIN" points across all
> workers/vectors in a gang, in particular (though this is generic code,
> other -- particularly GPU -- targets may have similar restrictions).


Richard, does  this answer your question?

nathan

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 3/11] new target hook
  2015-10-22  8:23   ` Jakub Jelinek
@ 2015-10-22 13:17     ` Nathan Sidwell
  2015-10-27 22:15     ` Nathan Sidwell
  1 sibling, 0 replies; 120+ messages in thread
From: Nathan Sidwell @ 2015-10-22 13:17 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: GCC Patches, Bernd Schmidt, Jason Merrill, Joseph S. Myers

On 10/22/15 04:15, Jakub Jelinek wrote:
> On Wed, Oct 21, 2015 at 03:13:26PM -0400, Nathan Sidwell wrote:

>> +/* Determine whether fork & joins are needed.  */
>> +
>> +static bool
>> +nvptx_xform_fork_join (gcall *call, const int dims[],
>> +		       bool ARG_UNUSED (is_fork))
>
> Why is this not called nvptx_goacc_fork_join when that is the name of
> the target hook?

don't want to make  things to clear ... (will fix).

nathan

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 1/11] UNIQUE internal function
  2015-10-22 13:12             ` Nathan Sidwell
@ 2015-10-22 13:20               ` Jakub Jelinek
  2015-10-22 13:27                 ` Nathan Sidwell
  0 siblings, 1 reply; 120+ messages in thread
From: Jakub Jelinek @ 2015-10-22 13:20 UTC (permalink / raw)
  To: Nathan Sidwell
  Cc: Julian Brown, Richard Biener, GCC Patches, Bernd Schmidt,
	Jason Merrill, Joseph S. Myers

On Thu, Oct 22, 2015 at 09:08:30AM -0400, Nathan Sidwell wrote:
> On 10/22/15 07:10, Julian Brown wrote:
> >On Thu, 22 Oct 2015 10:05:30 +0200
> >Richard Biener <richard.guenther@gmail.com> wrote:
> 
> >>So you'd need to be more precise as to what properties you are trying
> >>to preserve by placing a single stmt somewhere.
> >
> >FWIW an earlier, abandoned attempt at solving the same problem was
> >discussed in the following thread, continuing through June:
> >
> >   https://gcc.gnu.org/ml/gcc-patches/2015-05/msg02647.html
> >
> >Though the details of lowering of OpenACC constructs have changed with
> >Nathan's current patches, the underlying problem remains the same. PTX
> >requires certain operations (bar.sync) to be executed uniformly by all
> >threads in a CTA. IIUC this affects "JOIN" points across all
> >workers/vectors in a gang, in particular (though this is generic code,
> >other -- particularly GPU -- targets may have similar restrictions).
> 
> 
> Richard, does  this answer your question?

I agree with Richard that it would be better to write more about what kind
of IL changes are acceptable with IFN_UNIQUE in the IL and what are not.
E.g. is inlining ok (I'd hope yes)?  Is function splitting ok (bet as long
as all IFN_UNIQUE calls stay in one or the other part, but not both)?
Various loop optimization, ...

	Jakub

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 2/11] PTX backend changes
  2015-10-22 13:02       ` Nathan Sidwell
@ 2015-10-22 13:23         ` Nathan Sidwell
  0 siblings, 0 replies; 120+ messages in thread
From: Nathan Sidwell @ 2015-10-22 13:23 UTC (permalink / raw)
  To: Bernd Schmidt, Jakub Jelinek; +Cc: GCC Patches, Jason Merrill, Joseph S. Myers

On 10/22/15 09:01, Nathan Sidwell wrote:
> On 10/22/15 05:55, Bernd Schmidt wrote:
>> On 10/22/2015 10:12 AM, Jakub Jelinek wrote:
>
>>> So, is the worker broadcast buffer effectively a file scope .shared
>>> variable?  My worry is that as .shared is quite limited resource, if you
>>> compile many TUs and each allocates its own broadcast buffer you run out of
>>> shared memory.  Is there any way how to share the broadcast buffers in
>>> between different TUs (other than LTO)?
>>
>> I think LTO is the mechanism, nvptx-lto1 only ever produces one assembly file.
>> So I'm not really concerned about this.
>
> Correct.  PTX has no equivalent of common or weak, so we can't do the elf thing
> of emitting a common defn and having the linking process pick the largest.

oh, and I even thought of having a bunch of defns in a library of the form
long worker_buf_<n>:

and then having the emitted code reference the set that it needed so that the 
linker would concatenate them into a single object.  But PTX has no concept of 
sections, so couldn't gatheer those decls into contiguous memory,

nathan

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 1/11] UNIQUE internal function
  2015-10-22 13:20               ` Jakub Jelinek
@ 2015-10-22 13:27                 ` Nathan Sidwell
  2015-10-22 14:31                   ` Richard Biener
  0 siblings, 1 reply; 120+ messages in thread
From: Nathan Sidwell @ 2015-10-22 13:27 UTC (permalink / raw)
  To: Jakub Jelinek
  Cc: Julian Brown, Richard Biener, GCC Patches, Bernd Schmidt,
	Jason Merrill, Joseph S. Myers

On 10/22/15 09:17, Jakub Jelinek wrote:
> On Thu, Oct 22, 2015 at 09:08:30AM -0400, Nathan Sidwell wrote:

> I agree with Richard that it would be better to write more about what kind
> of IL changes are acceptable with IFN_UNIQUE in the IL and what are not.
> E.g. is inlining ok (I'd hope yes)?  Is function splitting ok (bet as long
> as all IFN_UNIQUE calls stay in one or the other part, but not both)?

Essentially, yes.  a set of IFN_UNIQUE form a group  which must not be separated 
  from each other.  The set is discovered implicitly by following the CFG 
(though I suppose we could add an identifying INT_CST operand or something 
equivalent).

nathan

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 11/11] execution tests
  2015-10-22  9:54   ` Jakub Jelinek
@ 2015-10-22 14:02     ` Nathan Sidwell
  2015-10-22 14:07       ` Jakub Jelinek
  0 siblings, 1 reply; 120+ messages in thread
From: Nathan Sidwell @ 2015-10-22 14:02 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: GCC Patches, Bernd Schmidt, Jason Merrill, Joseph S. Myers

On 10/22/15 05:37, Jakub Jelinek wrote:

> And, I must say I'm at least missing testcases that check parsing but also
> runtime behavior of the vector or worker clause arguments (there
> is one gang (static:1) clause, but not the other clauses nor other styles of
> gang arguments.

the static clause is only valid on gang.  But you're right, some error tests 
would be good to include in this patch set.

nathan

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 1/11] UNIQUE internal function
  2015-10-22  8:12     ` Richard Biener
  2015-10-22 13:08       ` Nathan Sidwell
@ 2015-10-22 14:04       ` Nathan Sidwell
  2015-10-22 14:28         ` Richard Biener
  2015-10-22 17:39       ` Nathan Sidwell
  2 siblings, 1 reply; 120+ messages in thread
From: Nathan Sidwell @ 2015-10-22 14:04 UTC (permalink / raw)
  To: Richard Biener, Jakub Jelinek
  Cc: GCC Patches, Bernd Schmidt, Jason Merrill, Joseph S. Myers

On 10/22/15 04:07, Richard Biener wrote:

> Yeah, please make them either end or start a BB so we have to check
> at most a single stmt.  ECF_RETURNS_TWICE should achieve that,
> it also makes it a code motion barrier.

Just so I'm clear, you're not saying that RETURNS_TWICE will stop the call being 
duplicated though?

thinking a little further, a code motion barrier is stronger than I need (but 
conservatively safe).  For instance:

UNIQUE (HEAD)
for (...)
{
   a = <loop_invariant_expr>
}
UNIQUE (TAIL)

It would be safe and desirable to move that loop invariant to before the UNIQUE. 
  Perhaps it won't matter in practice -- after all having N physical threads 
calculate it in parallel (just after the HEAD marker, but before the loop) will 
probably take no longer than a single thread doing it while the others wait.[*]

nathan

[*] ut it will take more power.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 2/11] PTX backend changes
  2015-10-21 19:11 ` [OpenACC 2/11] PTX backend changes Nathan Sidwell
  2015-10-22  8:16   ` Jakub Jelinek
@ 2015-10-22 14:05   ` Bernd Schmidt
  2015-10-22 14:26     ` Nathan Sidwell
  1 sibling, 1 reply; 120+ messages in thread
From: Bernd Schmidt @ 2015-10-22 14:05 UTC (permalink / raw)
  To: Nathan Sidwell, GCC Patches; +Cc: Jakub Jelinek, Jason Merrill, Joseph S. Myers

On 10/21/2015 09:09 PM, Nathan Sidwell wrote:
> At the beginning of a partitioned region, we have to  propagate live
> register state and stack frame from engine-zero to the other engines
> (just as would happen on a regular 'fork' call).

This is something I'm not terribly happy about, but since I have no 
alternative to offer, it's fine. I expect it could turn out to be a 
glass jaw of the implementation though. Speaking of which - do you have 
any recent speedup numbers for nontrivial OpenACC programs?

Some other minor nitpicks:

> +      /* Emit fork at all levels, this helps form SESE regions..  */

Could expand the comment, it doesn't help me understand the issue. Also, 
punctuation.

> +/* Structure used when generating a worker-level spill or fill.  */
> +
> +struct wcast_data_t
> +{
> +  rtx base;
> +  rtx ptr;
> +  unsigned offset;
> +};

Could document the members.

> +/* Loop structure of the function.The entire function is described as

Whitespace.

> +      // Clear visited flag, for use by parallel locator  */

Odd comment style :-)

> +
> +static void
> +nvptx_propagate (basic_block block, rtx_insn *insn, propagate_mask rw,
> +		 rtx (*fn) (rtx, propagate_mask,
> +			    unsigned, void *), void *data)

Break that out into a typedef?

> +	  /* Allow worker function to initialize anything needed */

Punctuation.

> +	default:break;

Formatting.

> +  if (par->mask & GOMP_DIM_MASK (GOMP_DIM_MAX))
> +    { /* No propagation needed for a call.  */ }
> +  else if (par->mask & GOMP_DIM_MASK (GOMP_DIM_WORKER))

Ok that looks weird with the open brace on the line before the else. I 
think the standard practice is to just use "/* .. */;", but possibly 
just invert the if condition and move the else branches into it.

> +  unsigned me = par->mask
> +    & (GOMP_DIM_MASK (GOMP_DIM_WORKER) | GOMP_DIM_MASK (GOMP_DIM_VECTOR));

Formatting. Maybe have extra defines for the masks so you don't have to 
spell GOMP_DIM_MASK everytime?

> +      if ((outer | me) & GOMP_DIM_MASK (mode))
> +	{ /* Mode is partitioned: no neutering.  */ }
> +      else if (!(modes & GOMP_DIM_MASK (mode)))
> +	{ /* Mode  is not used: nothing to do.  */ }

Same issue as above. Whitespace in comment.

> +      else
> +	{ /* Parent will skip this parallel itself.  */ }

Here too - actually no need to have an empty else at all.

> +  "%.\\tmov.b64\\t{%0,%1}, %2;")
 > +  "%.\\tmov.b64\\t%0, {%1,%2};")

Might want to add a space after the comma. I can see arguments for and 
against, so do what you like.


Bernd

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 11/11] execution tests
  2015-10-22 14:02     ` Nathan Sidwell
@ 2015-10-22 14:07       ` Jakub Jelinek
  2015-10-22 14:23         ` Nathan Sidwell
  0 siblings, 1 reply; 120+ messages in thread
From: Jakub Jelinek @ 2015-10-22 14:07 UTC (permalink / raw)
  To: Nathan Sidwell; +Cc: GCC Patches, Bernd Schmidt, Jason Merrill, Joseph S. Myers

On Thu, Oct 22, 2015 at 09:53:46AM -0400, Nathan Sidwell wrote:
> On 10/22/15 05:37, Jakub Jelinek wrote:
> 
> >And, I must say I'm at least missing testcases that check parsing but also
> >runtime behavior of the vector or worker clause arguments (there
> >is one gang (static:1) clause, but not the other clauses nor other styles of
> >gang arguments.
> 
> the static clause is only valid on gang.

That is what I've figured out.
But it is unclear from the parsing what from these is allowed:
int v, w;
...
gang(26)
gang(v)
vector(length: 16)
vector(length: v)
vector(16)
vector(v)
worker(num: 16)
worker(num: v)
worker(16)
worker(v)
gang(16, 24)
gang(v, w)
gang(static: 16, num: 5)
gang(static: v, num: w)
gang(num: 5, static: 4)
gang(num: v, static: w)

and if the length: or num: part is really optional, then
int length, num;
vector(length)
worker(num)
gang(num, static: 6)
gang(static: 5, num)
should be also accepted (or subset thereof?).

	Jakub

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 11/11] execution tests
  2015-10-22 14:07       ` Jakub Jelinek
@ 2015-10-22 14:23         ` Nathan Sidwell
  2015-10-22 14:47           ` Cesar Philippidis
  0 siblings, 1 reply; 120+ messages in thread
From: Nathan Sidwell @ 2015-10-22 14:23 UTC (permalink / raw)
  To: Jakub Jelinek
  Cc: GCC Patches, Bernd Schmidt, Jason Merrill, Joseph S. Myers,
	Cesar Philippidis

On 10/22/15 10:05, Jakub Jelinek wrote:
> On Thu, Oct 22, 2015 at 09:53:46AM -0400, Nathan Sidwell wrote:
>> On 10/22/15 05:37, Jakub Jelinek wrote:
>>
>>> And, I must say I'm at least missing testcases that check parsing but also
>>> runtime behavior of the vector or worker clause arguments (there
>>> is one gang (static:1) clause, but not the other clauses nor other styles of
>>> gang arguments.
>>
>> the static clause is only valid on gang.
>
> That is what I've figured out.
> But it is unclear from the parsing what from these is allowed:

good questions.  As you may have guessed, I'm not the primary author of the 
parsing code.  Cesar's stepped up to address this.

nathan

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 2/11] PTX backend changes
  2015-10-22 14:05   ` Bernd Schmidt
@ 2015-10-22 14:26     ` Nathan Sidwell
  2015-10-22 14:30       ` Bernd Schmidt
  0 siblings, 1 reply; 120+ messages in thread
From: Nathan Sidwell @ 2015-10-22 14:26 UTC (permalink / raw)
  To: Bernd Schmidt, GCC Patches; +Cc: Jakub Jelinek, Jason Merrill, Joseph S. Myers

On 10/22/15 10:04, Bernd Schmidt wrote:

>> +  if (par->mask & GOMP_DIM_MASK (GOMP_DIM_MAX))
>> +    { /* No propagation needed for a call.  */ }
>> +  else if (par->mask & GOMP_DIM_MASK (GOMP_DIM_WORKER))
>
> Ok that looks weird with the open brace on the line before the else. I think the
> standard practice is to just use "/* .. */;", but possibly just invert the if
> condition and move the else branches into it.

I find it more obviously an empty if -- that ';' can get lost with the comment 
(I have vague memories of a compiler warning too, I'll give it a try.  Inverting 
the condition makes the sequence confusing, IMHO.


>> +      else
>> +    { /* Parent will skip this parallel itself.  */ }
>
> Here too - actually no need to have an empty else at all.

I wanted somewhere clear for the comment to go. (Actually, I think this is the 
one the compiler warns about -- empty dangling else).

>> +  "%.\\tmov.b64\\t{%0,%1}, %2;")
>  > +  "%.\\tmov.b64\\t%0, {%1,%2};")
>
> Might want to add a space after the comma. I can see arguments for and against,
> so do what you like.

Yeah, same thoughts ran through my head ...

nathan

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 1/11] UNIQUE internal function
  2015-10-22 14:04       ` Nathan Sidwell
@ 2015-10-22 14:28         ` Richard Biener
  2015-10-22 14:31           ` Nathan Sidwell
  2015-10-22 18:08           ` Nathan Sidwell
  0 siblings, 2 replies; 120+ messages in thread
From: Richard Biener @ 2015-10-22 14:28 UTC (permalink / raw)
  To: Nathan Sidwell
  Cc: Jakub Jelinek, GCC Patches, Bernd Schmidt, Jason Merrill,
	Joseph S. Myers

On Thu, Oct 22, 2015 at 4:01 PM, Nathan Sidwell <nathan@acm.org> wrote:
> On 10/22/15 04:07, Richard Biener wrote:
>
>> Yeah, please make them either end or start a BB so we have to check
>> at most a single stmt.  ECF_RETURNS_TWICE should achieve that,
>> it also makes it a code motion barrier.
>
>
> Just so I'm clear, you're not saying that RETURNS_TWICE will stop the call
> being duplicated though?

It will in practice.  RETURNS_TWICE will get you an abnormal edge from
entry (I think)

> thinking a little further, a code motion barrier is stronger than I need
> (but conservatively safe).  For instance:
>
> UNIQUE (HEAD)
> for (...)
> {
>   a = <loop_invariant_expr>
> }
> UNIQUE (TAIL)
>
> It would be safe and desirable to move that loop invariant to before the
> UNIQUE.  Perhaps it won't matter in practice -- after all having N physical
> threads calculate it in parallel (just after the HEAD marker, but before the
> loop) will probably take no longer than a single thread doing it while the
> others wait.[*]

RETURNS_TWICE will make the invariant motion stop at UNIQUE (HEAD),
but it would have done that anyway.  It will also be a CSE barrier, thus

tem = global;
UNIQUE(HEAD)
tem2 = global;

will not CSE tem2 to tem.

Richard.

> nathan
>
> [*] ut it will take more power.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 2/11] PTX backend changes
  2015-10-22 14:26     ` Nathan Sidwell
@ 2015-10-22 14:30       ` Bernd Schmidt
  2015-10-22 14:36         ` Jakub Jelinek
  2015-10-22 14:42         ` Nathan Sidwell
  0 siblings, 2 replies; 120+ messages in thread
From: Bernd Schmidt @ 2015-10-22 14:30 UTC (permalink / raw)
  To: Nathan Sidwell, GCC Patches; +Cc: Jakub Jelinek, Jason Merrill, Joseph S. Myers

On 10/22/2015 04:24 PM, Nathan Sidwell wrote:
>>> +      else
>>> +    { /* Parent will skip this parallel itself.  */ }
>>
>> Here too - actually no need to have an empty else at all.
>
> I wanted somewhere clear for the comment to go. (Actually, I think this
> is the one the compiler warns about -- empty dangling else).

Don't recall such a warning, but that doesn't mean it isn't there. About 
the comment, it could either go just after the if, or above it, along 
the lines of
  /* We handle cases X, Y, Z here.  If none of them apply, then
     the parent will skip this parallel itself.  */

In this particular loop you could also use continue; for the empty if 
conditions.


Bernd

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 1/11] UNIQUE internal function
  2015-10-22 13:27                 ` Nathan Sidwell
@ 2015-10-22 14:31                   ` Richard Biener
  2015-10-22 14:47                     ` Nathan Sidwell
  0 siblings, 1 reply; 120+ messages in thread
From: Richard Biener @ 2015-10-22 14:31 UTC (permalink / raw)
  To: Nathan Sidwell
  Cc: Jakub Jelinek, Julian Brown, GCC Patches, Bernd Schmidt,
	Jason Merrill, Joseph S. Myers

On Thu, Oct 22, 2015 at 3:24 PM, Nathan Sidwell <nathan@acm.org> wrote:
> On 10/22/15 09:17, Jakub Jelinek wrote:
>>
>> On Thu, Oct 22, 2015 at 09:08:30AM -0400, Nathan Sidwell wrote:
>
>
>> I agree with Richard that it would be better to write more about what kind
>> of IL changes are acceptable with IFN_UNIQUE in the IL and what are not.
>> E.g. is inlining ok (I'd hope yes)?  Is function splitting ok (bet as long
>> as all IFN_UNIQUE calls stay in one or the other part, but not both)?
>
>
> Essentially, yes.  a set of IFN_UNIQUE form a group  which must not be
> separated  from each other.  The set is discovered implicitly by following
> the CFG (though I suppose we could add an identifying INT_CST operand or
> something equivalent).

I don't see how this is achieved though.  To achieve this you'd need data
dependences between them, sth like

token_1 = IFN_UNIQUE (HEAD);
...
token_2 = IFN_UNIQUE (TAIL, token_1);

not sure if that is enough (what is "separate from each other"?), for example
partial inlining might simply pass token_1 to the split part where only
IFN_UNIQUE (TAIL, token_1) would be in.  At least the above provides
ordering between the two IFN calls (which you achieve by having VDEFs
I guess, but then they are also barriers for memory optimizations).

Richard.

> nathan

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 1/11] UNIQUE internal function
  2015-10-22 14:28         ` Richard Biener
@ 2015-10-22 14:31           ` Nathan Sidwell
  2015-10-22 18:08           ` Nathan Sidwell
  1 sibling, 0 replies; 120+ messages in thread
From: Nathan Sidwell @ 2015-10-22 14:31 UTC (permalink / raw)
  To: Richard Biener
  Cc: Jakub Jelinek, GCC Patches, Bernd Schmidt, Jason Merrill,
	Joseph S. Myers

On 10/22/15 10:26, Richard Biener wrote:
> On Thu, Oct 22, 2015 at 4:01 PM, Nathan Sidwell <nathan@acm.org> wrote:

> RETURNS_TWICE will make the invariant motion stop at UNIQUE (HEAD),
> but it would have done that anyway.  It will also be a CSE barrier, thus
>
> tem = global;
> UNIQUE(HEAD)
> tem2 = global;
>
> will not CSE tem2 to tem.

Yes, I can see it would behave like that for something globally visible.  What 
about state that isn't so visible?  (perhaps I'm worrying about something that 
doesn't matter, but I'd like to understand)

nathan

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 2/11] PTX backend changes
  2015-10-22 14:30       ` Bernd Schmidt
@ 2015-10-22 14:36         ` Jakub Jelinek
  2015-10-22 14:52           ` Nathan Sidwell
  2015-10-22 14:42         ` Nathan Sidwell
  1 sibling, 1 reply; 120+ messages in thread
From: Jakub Jelinek @ 2015-10-22 14:36 UTC (permalink / raw)
  To: Bernd Schmidt; +Cc: Nathan Sidwell, GCC Patches, Jason Merrill, Joseph S. Myers

On Thu, Oct 22, 2015 at 04:28:17PM +0200, Bernd Schmidt wrote:
> On 10/22/2015 04:24 PM, Nathan Sidwell wrote:
> >>>+      else
> >>>+    { /* Parent will skip this parallel itself.  */ }
> >>
> >>Here too - actually no need to have an empty else at all.
> >
> >I wanted somewhere clear for the comment to go. (Actually, I think this
> >is the one the compiler warns about -- empty dangling else).
> 
> Don't recall such a warning, but that doesn't mean it isn't there. About the
> comment, it could either go just after the if, or above it, along the lines
> of

There is a warning for
  if (cond);
but not for
  if (cond)
    ;
or
  if (cond)
    /* comment */ ;
which is the style used in various places throughout the compiler.

>  /* We handle cases X, Y, Z here.  If none of them apply, then
>     the parent will skip this parallel itself.  */
> 
> In this particular loop you could also use continue; for the empty if
> conditions.

	Jakub

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 2/11] PTX backend changes
  2015-10-22 14:30       ` Bernd Schmidt
  2015-10-22 14:36         ` Jakub Jelinek
@ 2015-10-22 14:42         ` Nathan Sidwell
  1 sibling, 0 replies; 120+ messages in thread
From: Nathan Sidwell @ 2015-10-22 14:42 UTC (permalink / raw)
  To: Bernd Schmidt, GCC Patches; +Cc: Jakub Jelinek, Jason Merrill, Joseph S. Myers

On 10/22/15 10:28, Bernd Schmidt wrote:
> On 10/22/2015 04:24 PM, Nathan Sidwell wrote:
>>>> +      else
>>>> +    { /* Parent will skip this parallel itself.  */ }
>>>
>>> Here too - actually no need to have an empty else at all.
>>
>> I wanted somewhere clear for the comment to go. (Actually, I think this
>> is the one the compiler warns about -- empty dangling else).
>
> Don't recall such a warning, but that doesn't mean it isn't there.

found it:
.../nvptx.c:3007:47: warning: suggest braces around empty body in an 'else' 
statement [-Wempty-body]

I did what the compiler told me!  Always obey the compiler![*]

> About the
> comment, it could either go just after the if, or above it, along the lines of
>   /* We handle cases X, Y, Z here.  If none of them apply, then
>      the parent will skip this parallel itself.  */

Again I find that confusing -- much clearer to annotate the if cascade itself. 
(IIRC, I originally had a comment at the top, but it was demonstrably too 
confusing for me as the code was wrong)

> In this particular loop you could also use continue; for the empty if conditions.

I think that's going to be confusing too.

nathan

[*] and hypnotoad.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 11/11] execution tests
  2015-10-22 14:23         ` Nathan Sidwell
@ 2015-10-22 14:47           ` Cesar Philippidis
  2015-10-22 14:58             ` Nathan Sidwell
  2015-10-22 15:03             ` Jakub Jelinek
  0 siblings, 2 replies; 120+ messages in thread
From: Cesar Philippidis @ 2015-10-22 14:47 UTC (permalink / raw)
  To: Nathan Sidwell, Jakub Jelinek
  Cc: GCC Patches, Bernd Schmidt, Jason Merrill, Joseph S. Myers

On 10/22/2015 07:23 AM, Nathan Sidwell wrote:
> On 10/22/15 10:05, Jakub Jelinek wrote:
>> On Thu, Oct 22, 2015 at 09:53:46AM -0400, Nathan Sidwell wrote:
>>> On 10/22/15 05:37, Jakub Jelinek wrote:
>>>
>>>> And, I must say I'm at least missing testcases that check parsing
>>>> but also
>>>> runtime behavior of the vector or worker clause arguments (there
>>>> is one gang (static:1) clause, but not the other clauses nor other
>>>> styles of
>>>> gang arguments.
>>>
>>> the static clause is only valid on gang.
>>
>> That is what I've figured out.
>> But it is unclear from the parsing what from these is allowed:
> 
> good questions.  As you may have guessed, I'm not the primary author of
> the parsing code.  Cesar's stepped up to address this.

I'll go into more detail later when I post the revised patch, but for
the time being, in response to your to your earlier question I've
inlined how the clauses should be translated in comments below:

> But it is unclear from the parsing what from these is allowed:

int v, w;
...
gang(26)  // equivalent to gang(num:26)
gang(v)   // gang(num:v)
vector(length: 16)  // vector(length: 16)
vector(length: v)  // vector(length: v)
vector(16)  // vector(length: 16)
vector(v)   // vector(length: v)
worker(num: 16)  // worker(num: 16)
worker(num: v)   // worker(num: 16)
worker(16)  // worker(num: 16)
worker(v)   // worker(num: 16)
gang(16, 24)  // technically gang(num:16, num:24) is acceptable but it
              // should be an error
gang(v, w)  // likewise
gang(static: 16, num: 5)  // gang(static: 16, num: 5)
gang(static: v, num: w)   // gang(static: v, num: w)
gang(num: 5, static: 4)   // gang(num: 5, static: 4)
gang(num: v, static: w)   // gang(num: v, static: w)

Also note that the static argument can accept '*'.

> and if the length: or num: part is really optional, then
> int length, num;
> vector(length)
> worker(num)
> gang(num, static: 6)
> gang(static: 5, num)
> should be also accepted (or subset thereof?).

Interesting question. The spec is unclear. It defines gang, worker and
vector as follows in section 2.7 in the OpenACC 2.0a spec:

  gang [( gang-arg-list )]
  worker [( [num:] int-expr )]
  vector [( [length:] int-expr )]

where gang-arg is one of:

  [num:] int-expr
  static: size-expr

and gang-arg-list may have at most one num and one static argument,
and where size-expr is one of:

  *
  int-expr

So I've interpreted that as a requirement that length and num must be
followed by an int-expr, whatever that is.

I've been meaning to cleanup to up the c and c++ front ends for a while
now, but I've been bogged down by other things. This is next on my todo
list.

Cesar

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 1/11] UNIQUE internal function
  2015-10-22 14:31                   ` Richard Biener
@ 2015-10-22 14:47                     ` Nathan Sidwell
  0 siblings, 0 replies; 120+ messages in thread
From: Nathan Sidwell @ 2015-10-22 14:47 UTC (permalink / raw)
  To: Richard Biener
  Cc: Jakub Jelinek, Julian Brown, GCC Patches, Bernd Schmidt,
	Jason Merrill, Joseph S. Myers

On 10/22/15 10:30, Richard Biener wrote:
> On Thu, Oct 22, 2015 at 3:24 PM, Nathan Sidwell <nathan@acm.org> wrote:
>>
>> Essentially, yes.  a set of IFN_UNIQUE form a group  which must not be
>> separated  from each other.  The set is discovered implicitly by following
>> the CFG (though I suppose we could add an identifying INT_CST operand or
>> something equivalent).
>
> I don't see how this is achieved though.

Well, in practice it does.

>  To achieve this you'd need data
> dependences between them, sth like
>
> token_1 = IFN_UNIQUE (HEAD);
> ...
> token_2 = IFN_UNIQUE (TAIL, token_1);
>
> not sure if that is enough (what is "separate from each other"?), for example
> partial inlining might simply pass token_1 to the split part where only
> IFN_UNIQUE (TAIL, token_1) would be in.

Yeah, such partial inlining will break.  Not encountered it happening though.

> At least the above provides
> ordering between the two IFN calls (which you achieve by having VDEFs
> I guess, but then they are also barriers for memory optimizations).

Right, I think I'm relying on the compiler's lack of knowledge about what global 
state might be affected by the two calls to prevent it reordering them WRT 
eachother.  Is that what you meant?

(I did wonder about the need to add the kind of data dependency you describe, 
but found it unnecessary.)

nathan

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 2/11] PTX backend changes
  2015-10-22 14:36         ` Jakub Jelinek
@ 2015-10-22 14:52           ` Nathan Sidwell
  2015-10-28 14:28             ` Nathan Sidwell
  0 siblings, 1 reply; 120+ messages in thread
From: Nathan Sidwell @ 2015-10-22 14:52 UTC (permalink / raw)
  To: Jakub Jelinek, Bernd Schmidt; +Cc: GCC Patches, Jason Merrill, Joseph S. Myers

On 10/22/15 10:32, Jakub Jelinek wrote:
> There is a warning for
>    if (cond);
> but not for
>    if (cond)
>      ;
> or
>    if (cond)
>      /* comment */ ;
> which is the style used in various places throughout the compiler.

Sadly, that's not quite accurate.  The warning occurs for all the empty if's you 
describe.  It doesn't occur for

   if (a)
     ;
   else ...

This is a case where I really want to say something about that else.

nathan

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 11/11] execution tests
  2015-10-22 14:47           ` Cesar Philippidis
@ 2015-10-22 14:58             ` Nathan Sidwell
  2015-10-22 15:03             ` Jakub Jelinek
  1 sibling, 0 replies; 120+ messages in thread
From: Nathan Sidwell @ 2015-10-22 14:58 UTC (permalink / raw)
  To: Cesar Philippidis, Jakub Jelinek
  Cc: GCC Patches, Bernd Schmidt, Jason Merrill, Joseph S. Myers

On 10/22/15 10:47, Cesar Philippidis wrote:

> Interesting question. The spec is unclear. It defines gang, worker and
> vector as follows in section 2.7 in the OpenACC 2.0a spec:
>
>    gang [( gang-arg-list )]
>    worker [( [num:] int-expr )]
>    vector [( [length:] int-expr )]
>
> where gang-arg is one of:
>
>    [num:] int-expr
>    static: size-expr
>

the spec is intentionally unspecific about whether the exprs are 
integer-constant-expressions or integer-expressions.  Leaving it as an 
implementation choice.

nathan

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 11/11] execution tests
  2015-10-22 14:47           ` Cesar Philippidis
  2015-10-22 14:58             ` Nathan Sidwell
@ 2015-10-22 15:03             ` Jakub Jelinek
  2015-10-22 15:08               ` Cesar Philippidis
  2015-10-23 20:32               ` Cesar Philippidis
  1 sibling, 2 replies; 120+ messages in thread
From: Jakub Jelinek @ 2015-10-22 15:03 UTC (permalink / raw)
  To: Cesar Philippidis
  Cc: Nathan Sidwell, GCC Patches, Bernd Schmidt, Jason Merrill,
	Joseph S. Myers

On Thu, Oct 22, 2015 at 07:47:01AM -0700, Cesar Philippidis wrote:
> > But it is unclear from the parsing what from these is allowed:
> 
> int v, w;
> ...
> gang(26)  // equivalent to gang(num:26)
> gang(v)   // gang(num:v)
> vector(length: 16)  // vector(length: 16)
> vector(length: v)  // vector(length: v)
> vector(16)  // vector(length: 16)
> vector(v)   // vector(length: v)
> worker(num: 16)  // worker(num: 16)
> worker(num: v)   // worker(num: 16)
> worker(16)  // worker(num: 16)
> worker(v)   // worker(num: 16)
> gang(16, 24)  // technically gang(num:16, num:24) is acceptable but it
>               // should be an error
> gang(v, w)  // likewise
> gang(static: 16, num: 5)  // gang(static: 16, num: 5)
> gang(static: v, num: w)   // gang(static: v, num: w)
> gang(num: 5, static: 4)   // gang(num: 5, static: 4)
> gang(num: v, static: w)   // gang(num: v, static: w)
> 
> Also note that the static argument can accept '*'.
> 
> > and if the length: or num: part is really optional, then
> > int length, num;
> > vector(length)
> > worker(num)
> > gang(num, static: 6)
> > gang(static: 5, num)
> > should be also accepted (or subset thereof?).
> 
> Interesting question. The spec is unclear. It defines gang, worker and
> vector as follows in section 2.7 in the OpenACC 2.0a spec:
> 
>   gang [( gang-arg-list )]
>   worker [( [num:] int-expr )]
>   vector [( [length:] int-expr )]
> 
> where gang-arg is one of:
> 
>   [num:] int-expr
>   static: size-expr
> 
> and gang-arg-list may have at most one num and one static argument,
> and where size-expr is one of:
> 
>   *
>   int-expr
> 
> So I've interpreted that as a requirement that length and num must be
> followed by an int-expr, whatever that is.

My reading of the above is that
vector(length)
is equivalent to
vector(length: length)
and
worker(num)
is equivalent to
vector(num: num)
etc.  Basically, neither length nor num aren't reserved identifiers,
so you can use them for variable names, and if
vector(v) is equivalent to vector(length: v), then
vector(length) should be equivalent to vector(length:length)
or
vector(length + 1) should be equivalent to vector(length: length+1)
static is a keyword that can't start an integral expression, so I guess
it is fine if you issue an expected : diagnostics after it.

In any case, please add a testcase (both C and C++) which covers all these
allowed variants (ideally one testcase) and rejected variants (another
testcase with dg-error).

This is still an easy case, as even the C FE has 2 tokens lookup.
E.g. for OpenMP map clause where
map (always, tofrom: x)
means one thing and
map (always, tofrom, y)
another one (map (tofrom: always, tofrom, y))
I had to do quite ugly things to get around this.

	Jakub

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 11/11] execution tests
  2015-10-22 15:03             ` Jakub Jelinek
@ 2015-10-22 15:08               ` Cesar Philippidis
  2015-10-23 20:32               ` Cesar Philippidis
  1 sibling, 0 replies; 120+ messages in thread
From: Cesar Philippidis @ 2015-10-22 15:08 UTC (permalink / raw)
  To: Jakub Jelinek
  Cc: Nathan Sidwell, GCC Patches, Bernd Schmidt, Jason Merrill,
	Joseph S. Myers

On 10/22/2015 08:00 AM, Jakub Jelinek wrote:
> On Thu, Oct 22, 2015 at 07:47:01AM -0700, Cesar Philippidis wrote:
>>> But it is unclear from the parsing what from these is allowed:
>>
>> int v, w;
>> ...
>> gang(26)  // equivalent to gang(num:26)
>> gang(v)   // gang(num:v)
>> vector(length: 16)  // vector(length: 16)
>> vector(length: v)  // vector(length: v)
>> vector(16)  // vector(length: 16)
>> vector(v)   // vector(length: v)
>> worker(num: 16)  // worker(num: 16)
>> worker(num: v)   // worker(num: 16)
>> worker(16)  // worker(num: 16)
>> worker(v)   // worker(num: 16)
>> gang(16, 24)  // technically gang(num:16, num:24) is acceptable but it
>>               // should be an error
>> gang(v, w)  // likewise
>> gang(static: 16, num: 5)  // gang(static: 16, num: 5)
>> gang(static: v, num: w)   // gang(static: v, num: w)
>> gang(num: 5, static: 4)   // gang(num: 5, static: 4)
>> gang(num: v, static: w)   // gang(num: v, static: w)
>>
>> Also note that the static argument can accept '*'.
>>
>>> and if the length: or num: part is really optional, then
>>> int length, num;
>>> vector(length)
>>> worker(num)
>>> gang(num, static: 6)
>>> gang(static: 5, num)
>>> should be also accepted (or subset thereof?).
>>
>> Interesting question. The spec is unclear. It defines gang, worker and
>> vector as follows in section 2.7 in the OpenACC 2.0a spec:
>>
>>   gang [( gang-arg-list )]
>>   worker [( [num:] int-expr )]
>>   vector [( [length:] int-expr )]
>>
>> where gang-arg is one of:
>>
>>   [num:] int-expr
>>   static: size-expr
>>
>> and gang-arg-list may have at most one num and one static argument,
>> and where size-expr is one of:
>>
>>   *
>>   int-expr
>>
>> So I've interpreted that as a requirement that length and num must be
>> followed by an int-expr, whatever that is.
> 
> My reading of the above is that
> vector(length)
> is equivalent to
> vector(length: length)
> and
> worker(num)
> is equivalent to
> vector(num: num)
> etc.  Basically, neither length nor num aren't reserved identifiers,
> so you can use them for variable names, and if
> vector(v) is equivalent to vector(length: v), then
> vector(length) should be equivalent to vector(length:length)
> or
> vector(length + 1) should be equivalent to vector(length: length+1)
> static is a keyword that can't start an integral expression, so I guess
> it is fine if you issue an expected : diagnostics after it.

You're correct. I overlooked that 'int length, num' declaration.

> In any case, please add a testcase (both C and C++) which covers all these
> allowed variants (ideally one testcase) and rejected variants (another
> testcase with dg-error).
> 
> This is still an easy case, as even the C FE has 2 tokens lookup.
> E.g. for OpenMP map clause where
> map (always, tofrom: x)
> means one thing and
> map (always, tofrom, y)
> another one (map (tofrom: always, tofrom, y))
> I had to do quite ugly things to get around this.

I'll add more test cases.

Thanks,
Cesar

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 1/11] UNIQUE internal function
  2015-10-22  8:12     ` Richard Biener
  2015-10-22 13:08       ` Nathan Sidwell
  2015-10-22 14:04       ` Nathan Sidwell
@ 2015-10-22 17:39       ` Nathan Sidwell
  2 siblings, 0 replies; 120+ messages in thread
From: Nathan Sidwell @ 2015-10-22 17:39 UTC (permalink / raw)
  To: Richard Biener, Jakub Jelinek
  Cc: GCC Patches, Bernd Schmidt, Jason Merrill, Joseph S. Myers

On 10/22/15 04:07, Richard Biener wrote:

> Yeah, please make them either end or start a BB so we have to check
> at most a single stmt.  ECF_RETURNS_TWICE should achieve that,
> it also makes it a code motion barrier.

I'm having a hard time making  UNIQUE the end of a BB.

I'm emitting code to a gimple sequence, which later gets processed by the OMP 
machinery.  Just doing that doesn't cause the block to be split after the 
ECF_RETURNS_TWICE function.

In all the below, the label is generated by:

   tree label = create_artificial_label (loc);
   FORCED_LABEL (label) = 1;

I tried doing
   UNIQUE (...)
   goto label
   label:

but that is apparently optimized away, leaving UNIQUE in the middle of a bb. 
Next I tried:

   UNIQUE (..., &label)
   goto label
label:

but the goto is elided and label migrates to the start of the bb, again leaving 
UNIQUE in the middle.

Any suggestions?

nathan

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 1/11] UNIQUE internal function
  2015-10-22 14:28         ` Richard Biener
  2015-10-22 14:31           ` Nathan Sidwell
@ 2015-10-22 18:08           ` Nathan Sidwell
  2015-10-23  8:46             ` Jakub Jelinek
  2015-10-23  9:40             ` Richard Biener
  1 sibling, 2 replies; 120+ messages in thread
From: Nathan Sidwell @ 2015-10-22 18:08 UTC (permalink / raw)
  To: Richard Biener
  Cc: Jakub Jelinek, GCC Patches, Bernd Schmidt, Jason Merrill,
	Joseph S. Myers

On 10/22/15 10:26, Richard Biener wrote:
> On Thu, Oct 22, 2015 at 4:01 PM, Nathan Sidwell <nathan@acm.org> wrote:
>> On 10/22/15 04:07, Richard Biener wrote:
>>
>>> Yeah, please make them either end or start a BB so we have to check
>>> at most a single stmt.  ECF_RETURNS_TWICE should achieve that,
>>> it also makes it a code motion barrier.
>>
>>
>> Just so I'm clear, you're not saying that RETURNS_TWICE will stop the call
>> being duplicated though?
>
> It will in practice.  RETURNS_TWICE will get you an abnormal edge from
> entry (I think)

Won't that interfere with the OMP  machinery, which expects correctly nested 
loops?  (no in-to or out-of loop jumps)

nathan

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 1/11] UNIQUE internal function
  2015-10-22  8:05   ` Jakub Jelinek
  2015-10-22  8:12     ` Richard Biener
@ 2015-10-22 20:25     ` Nathan Sidwell
  2015-10-23  8:05       ` Jakub Jelinek
  1 sibling, 1 reply; 120+ messages in thread
From: Nathan Sidwell @ 2015-10-22 20:25 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: GCC Patches, Bernd Schmidt, Jason Merrill, Joseph S. Myers

On 10/22/15 04:04, Jakub Jelinek wrote:

>> +  /* Ignore blocks containing non-clonable function calls.  */
>> +  for (gsi = gsi_start_bb (CONST_CAST_BB (bb));
>> +       !gsi_end_p (gsi); gsi_next (&gsi))
>> +    {
>> +      g = gsi_stmt (gsi);
>> +
>> +      if (is_gimple_call (g) && gimple_call_internal_p (g)
>> +	  && gimple_call_internal_unique_p (as_a <gcall *> (g)))
>> +	return true;
>> +    }
>
> Do you have to scan the whole bb?  E.g. don't or should not those
> unique IFNs force end of bb?

What about adding a flag to struct function?

   /* Nonzero if this function contains IFN_UNIQUE markers.  */
   unsigned int has_unique_calls : 1;

Then the tracer could either skip it, or do the search?

(I notice there are cilk flags already in struct function, instead of the above, 
we could add an openacc-specific one with  a similar behaviour?)

nathan

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 1/11] UNIQUE internal function
  2015-10-22 20:25     ` Nathan Sidwell
@ 2015-10-23  8:05       ` Jakub Jelinek
  0 siblings, 0 replies; 120+ messages in thread
From: Jakub Jelinek @ 2015-10-23  8:05 UTC (permalink / raw)
  To: Nathan Sidwell; +Cc: GCC Patches, Bernd Schmidt, Jason Merrill, Joseph S. Myers

On Thu, Oct 22, 2015 at 04:17:32PM -0400, Nathan Sidwell wrote:
> On 10/22/15 04:04, Jakub Jelinek wrote:
> 
> >>+  /* Ignore blocks containing non-clonable function calls.  */
> >>+  for (gsi = gsi_start_bb (CONST_CAST_BB (bb));
> >>+       !gsi_end_p (gsi); gsi_next (&gsi))
> >>+    {
> >>+      g = gsi_stmt (gsi);
> >>+
> >>+      if (is_gimple_call (g) && gimple_call_internal_p (g)
> >>+	  && gimple_call_internal_unique_p (as_a <gcall *> (g)))
> >>+	return true;
> >>+    }
> >
> >Do you have to scan the whole bb?  E.g. don't or should not those
> >unique IFNs force end of bb?
> 
> What about adding a flag to struct function?
> 
>   /* Nonzero if this function contains IFN_UNIQUE markers.  */
>   unsigned int has_unique_calls : 1;
> 
> Then the tracer could either skip it, or do the search?
> 
> (I notice there are cilk flags already in struct function, instead of the
> above, we could add an openacc-specific one with  a similar behaviour?)

If you want to force end of a BB after the IFN_UNIQUE call, then you can just
gimple_call_set_ctrl_altering (gcall, true);
on it, and probably tweak gimple_call_initialize_ctrl_altering
so that it does that by default.  Plus of course split the blocks after it
when you emit it.

	Jakub

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 1/11] UNIQUE internal function
  2015-10-22 18:08           ` Nathan Sidwell
@ 2015-10-23  8:46             ` Jakub Jelinek
  2015-10-23 13:03               ` Nathan Sidwell
  2015-10-23  9:40             ` Richard Biener
  1 sibling, 1 reply; 120+ messages in thread
From: Jakub Jelinek @ 2015-10-23  8:46 UTC (permalink / raw)
  To: Nathan Sidwell
  Cc: Richard Biener, GCC Patches, Bernd Schmidt, Jason Merrill,
	Joseph S. Myers

On Thu, Oct 22, 2015 at 02:06:54PM -0400, Nathan Sidwell wrote:
> On 10/22/15 10:26, Richard Biener wrote:
> >On Thu, Oct 22, 2015 at 4:01 PM, Nathan Sidwell <nathan@acm.org> wrote:
> >>On 10/22/15 04:07, Richard Biener wrote:
> >>
> >>>Yeah, please make them either end or start a BB so we have to check
> >>>at most a single stmt.  ECF_RETURNS_TWICE should achieve that,
> >>>it also makes it a code motion barrier.
> >>
> >>
> >>Just so I'm clear, you're not saying that RETURNS_TWICE will stop the call
> >>being duplicated though?
> >
> >It will in practice.  RETURNS_TWICE will get you an abnormal edge from
> >entry (I think)
> 
> Won't that interfere with the OMP  machinery, which expects correctly nested
> loops?  (no in-to or out-of loop jumps)

I bet it will, the region with the abnormal edges is no longer SESE.

	Jakub

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 1/11] UNIQUE internal function
  2015-10-22 18:08           ` Nathan Sidwell
  2015-10-23  8:46             ` Jakub Jelinek
@ 2015-10-23  9:40             ` Richard Biener
  1 sibling, 0 replies; 120+ messages in thread
From: Richard Biener @ 2015-10-23  9:40 UTC (permalink / raw)
  To: Nathan Sidwell
  Cc: Jakub Jelinek, GCC Patches, Bernd Schmidt, Jason Merrill,
	Joseph S. Myers

On Thu, Oct 22, 2015 at 8:06 PM, Nathan Sidwell <nathan@acm.org> wrote:
> On 10/22/15 10:26, Richard Biener wrote:
>>
>> On Thu, Oct 22, 2015 at 4:01 PM, Nathan Sidwell <nathan@acm.org> wrote:
>>>
>>> On 10/22/15 04:07, Richard Biener wrote:
>>>
>>>> Yeah, please make them either end or start a BB so we have to check
>>>> at most a single stmt.  ECF_RETURNS_TWICE should achieve that,
>>>> it also makes it a code motion barrier.
>>>
>>>
>>>
>>> Just so I'm clear, you're not saying that RETURNS_TWICE will stop the
>>> call
>>> being duplicated though?
>>
>>
>> It will in practice.  RETURNS_TWICE will get you an abnormal edge from
>> entry (I think)
>
>
> Won't that interfere with the OMP  machinery, which expects correctly nested
> loops?  (no in-to or out-of loop jumps)

Probably yes.

> nathan

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 1/11] UNIQUE internal function
  2015-10-23  8:46             ` Jakub Jelinek
@ 2015-10-23 13:03               ` Nathan Sidwell
  2015-10-23 13:03                 ` Richard Biener
  2015-10-23 13:12                 ` Jakub Jelinek
  0 siblings, 2 replies; 120+ messages in thread
From: Nathan Sidwell @ 2015-10-23 13:03 UTC (permalink / raw)
  To: Jakub Jelinek
  Cc: Richard Biener, GCC Patches, Bernd Schmidt, Jason Merrill,
	Joseph S. Myers

On 10/23/15 04:25, Jakub Jelinek wrote:
> On Thu, Oct 22, 2015 at 02:06:54PM -0400, Nathan Sidwell wrote:
>> On 10/22/15 10:26, Richard Biener wrote:
>>> On Thu, Oct 22, 2015 at 4:01 PM, Nathan Sidwell <nathan@acm.org> wrote:
>>>> On 10/22/15 04:07, Richard Biener wrote:
>>>>
>>>>> Yeah, please make them either end or start a BB so we have to check
>>>>> at most a single stmt.  ECF_RETURNS_TWICE should achieve that,
>>>>> it also makes it a code motion barrier.
>>>>
>>>>
>>>> Just so I'm clear, you're not saying that RETURNS_TWICE will stop the call
>>>> being duplicated though?
>>>
>>> It will in practice.  RETURNS_TWICE will get you an abnormal edge from
>>> entry (I think)
>>
>> Won't that interfere with the OMP  machinery, which expects correctly nested
>> loops?  (no in-to or out-of loop jumps)
>
> I bet it will, the region with the abnormal edges is no longer SESE.

Hm, it seems like a bad plan to try RETURNS_TWICE then.


> If you want to force end of a BB after the IFN_UNIQUE call, then you can just
> gimple_call_set_ctrl_altering (gcall, true);
> on it, and probably tweak gimple_call_initialize_ctrl_altering
> so that it does that by default.  Plus of course split the blocks after it
> when you emit it.

IIUC this won't require RETURNS_TWICE, correct?  We're generate these seqs to a 
gimple sequence that eventually gets attached to the graph at the end of lower 
omp_for with:

   gimple_bind_set_body (new_stmt, body);
   gimple_omp_set_body (stmt, NULL);
   gimple_omp_for_set_pre_body (stmt, NULL);

Presumably that sequence will have to be split in the manner you describe 
somewhere else.  Not sure where that might be?

Any thoughts on the approach of adding a flag to struct function, and having 
tracer to skip such functions?

nathan

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 1/11] UNIQUE internal function
  2015-10-23 13:03               ` Nathan Sidwell
@ 2015-10-23 13:03                 ` Richard Biener
  2015-10-23 13:16                   ` Nathan Sidwell
  2015-10-23 13:12                 ` Jakub Jelinek
  1 sibling, 1 reply; 120+ messages in thread
From: Richard Biener @ 2015-10-23 13:03 UTC (permalink / raw)
  To: Nathan Sidwell
  Cc: Jakub Jelinek, GCC Patches, Bernd Schmidt, Jason Merrill,
	Joseph S. Myers

On Fri, Oct 23, 2015 at 2:57 PM, Nathan Sidwell <nathan@acm.org> wrote:
> On 10/23/15 04:25, Jakub Jelinek wrote:
>>
>> On Thu, Oct 22, 2015 at 02:06:54PM -0400, Nathan Sidwell wrote:
>>>
>>> On 10/22/15 10:26, Richard Biener wrote:
>>>>
>>>> On Thu, Oct 22, 2015 at 4:01 PM, Nathan Sidwell <nathan@acm.org> wrote:
>>>>>
>>>>> On 10/22/15 04:07, Richard Biener wrote:
>>>>>
>>>>>> Yeah, please make them either end or start a BB so we have to check
>>>>>> at most a single stmt.  ECF_RETURNS_TWICE should achieve that,
>>>>>> it also makes it a code motion barrier.
>>>>>
>>>>>
>>>>>
>>>>> Just so I'm clear, you're not saying that RETURNS_TWICE will stop the
>>>>> call
>>>>> being duplicated though?
>>>>
>>>>
>>>> It will in practice.  RETURNS_TWICE will get you an abnormal edge from
>>>> entry (I think)
>>>
>>>
>>> Won't that interfere with the OMP  machinery, which expects correctly
>>> nested
>>> loops?  (no in-to or out-of loop jumps)
>>
>>
>> I bet it will, the region with the abnormal edges is no longer SESE.
>
>
> Hm, it seems like a bad plan to try RETURNS_TWICE then.
>
>
>> If you want to force end of a BB after the IFN_UNIQUE call, then you can
>> just
>> gimple_call_set_ctrl_altering (gcall, true);
>> on it, and probably tweak gimple_call_initialize_ctrl_altering
>> so that it does that by default.  Plus of course split the blocks after it
>> when you emit it.
>
>
> IIUC this won't require RETURNS_TWICE, correct?  We're generate these seqs
> to a gimple sequence that eventually gets attached to the graph at the end
> of lower omp_for with:
>
>   gimple_bind_set_body (new_stmt, body);
>   gimple_omp_set_body (stmt, NULL);
>   gimple_omp_for_set_pre_body (stmt, NULL);
>
> Presumably that sequence will have to be split in the manner you describe
> somewhere else.  Not sure where that might be?
>
> Any thoughts on the approach of adding a flag to struct function, and having
> tracer to skip such functions?

It's a hack.  I don't like hacks.  I think the requirement "don't duplicate me"
but inlining is ok is somewhat broken.  The requirement seems to be
sth like the "important" paris of such functions need to dominate/post-dominate
each other (technically not even in the same function)?

Richard.

> nathan

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 1/11] UNIQUE internal function
  2015-10-23 13:03               ` Nathan Sidwell
  2015-10-23 13:03                 ` Richard Biener
@ 2015-10-23 13:12                 ` Jakub Jelinek
  2015-10-23 13:38                   ` Nathan Sidwell
  2015-10-25 14:29                   ` Nathan Sidwell
  1 sibling, 2 replies; 120+ messages in thread
From: Jakub Jelinek @ 2015-10-23 13:12 UTC (permalink / raw)
  To: Nathan Sidwell
  Cc: Richard Biener, GCC Patches, Bernd Schmidt, Jason Merrill,
	Joseph S. Myers

On Fri, Oct 23, 2015 at 08:57:17AM -0400, Nathan Sidwell wrote:
> >If you want to force end of a BB after the IFN_UNIQUE call, then you can just
> >gimple_call_set_ctrl_altering (gcall, true);
> >on it, and probably tweak gimple_call_initialize_ctrl_altering
> >so that it does that by default.  Plus of course split the blocks after it
> >when you emit it.
> 
> IIUC this won't require RETURNS_TWICE, correct?  We're generate these seqs

It doesn't require that, sure.

> to a gimple sequence that eventually gets attached to the graph at the end
> of lower omp_for with:
> 
>   gimple_bind_set_body (new_stmt, body);
>   gimple_omp_set_body (stmt, NULL);
>   gimple_omp_for_set_pre_body (stmt, NULL);
> 
> Presumably that sequence will have to be split in the manner you describe
> somewhere else.  Not sure where that might be?

If this is during the omplower pass, then it is before cfg pass and
therefore all you need is tweak the gimple_call_initialize_ctrl_altering
function and the cfg pass will DTRT.

> Any thoughts on the approach of adding a flag to struct function, and having
> tracer to skip such functions?

It could still be expensive if functions with that flag set contain very
large basic blocks.

	Jakub

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 1/11] UNIQUE internal function
  2015-10-23 13:16                   ` Nathan Sidwell
@ 2015-10-23 13:16                     ` Jakub Jelinek
  2015-10-23 14:46                       ` Nathan Sidwell
  0 siblings, 1 reply; 120+ messages in thread
From: Jakub Jelinek @ 2015-10-23 13:16 UTC (permalink / raw)
  To: Nathan Sidwell
  Cc: Richard Biener, GCC Patches, Bernd Schmidt, Jason Merrill,
	Joseph S. Myers

On Fri, Oct 23, 2015 at 09:13:43AM -0400, Nathan Sidwell wrote:
> You're correct that the SESE region could be split across a function
> boundary in the manner you describe, but the  complexity of dealing with
> that in the backend's partitioning code would be high.  Let's not try and
> enable that from the get-go.

Sure, but then you probably need to tweak the fnsplit pass to guarantee
that.

	Jakub

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 1/11] UNIQUE internal function
  2015-10-23 13:03                 ` Richard Biener
@ 2015-10-23 13:16                   ` Nathan Sidwell
  2015-10-23 13:16                     ` Jakub Jelinek
  0 siblings, 1 reply; 120+ messages in thread
From: Nathan Sidwell @ 2015-10-23 13:16 UTC (permalink / raw)
  To: Richard Biener
  Cc: Jakub Jelinek, GCC Patches, Bernd Schmidt, Jason Merrill,
	Joseph S. Myers

On 10/23/15 09:03, Richard Biener wrote:

> It's a hack.  I don't like hacks.

One person's hack can be another person's pragmatism :)

>  I think the requirement "don't duplicate me"
> but inlining is ok is somewhat broken.

The requirement is that the SESE region formed by the markers remains as an SESE 
region with those markers as the entry & exit paths. We don't have a way of 
expressing exactly that in the compiler.  What we do have is the ability to say 
'don't duplicate this insn'.

> The requirement seems to be
> sth like the "important" paris of such functions need to dominate/post-dominate
> each other (technically not even in the same function)?

You're correct that the SESE region could be split across a function boundary in 
the manner you describe, but the  complexity of dealing with that in the 
backend's partitioning code would be high.  Let's not try and enable that from 
the get-go.

nathan

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 1/11] UNIQUE internal function
  2015-10-23 13:12                 ` Jakub Jelinek
@ 2015-10-23 13:38                   ` Nathan Sidwell
  2015-10-25 14:29                   ` Nathan Sidwell
  1 sibling, 0 replies; 120+ messages in thread
From: Nathan Sidwell @ 2015-10-23 13:38 UTC (permalink / raw)
  To: Jakub Jelinek
  Cc: Richard Biener, GCC Patches, Bernd Schmidt, Jason Merrill,
	Joseph S. Myers

On 10/23/15 09:03, Jakub Jelinek wrote:
> On Fri, Oct 23, 2015 at 08:57:17AM -0400, Nathan Sidwell wrote:

> If this is during the omplower pass, then it is before cfg pass and
> therefore all you need is tweak the gimple_call_initialize_ctrl_altering
> function and the cfg pass will DTRT.

ok, thanks

nathan

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 1/11] UNIQUE internal function
  2015-10-23 13:16                     ` Jakub Jelinek
@ 2015-10-23 14:46                       ` Nathan Sidwell
  0 siblings, 0 replies; 120+ messages in thread
From: Nathan Sidwell @ 2015-10-23 14:46 UTC (permalink / raw)
  To: Jakub Jelinek
  Cc: Richard Biener, GCC Patches, Bernd Schmidt, Jason Merrill,
	Joseph S. Myers, Cesar Philippidis

On 10/23/15 09:16, Jakub Jelinek wrote:
> On Fri, Oct 23, 2015 at 09:13:43AM -0400, Nathan Sidwell wrote:
>> You're correct that the SESE region could be split across a function
>> boundary in the manner you describe, but the  complexity of dealing with
>> that in the backend's partitioning code would be high.  Let's not try and
>> enable that from the get-go.
>
> Sure, but then you probably need to tweak the fnsplit pass to guarantee
> that.

Ok, I'll take a look at that too.

The gimple_call_set_ctrl_altering approach is looking good for the moment.

Richard, if that works out, so we only have to check unique_p on the last insn 
of a bb, does that satisfy your concerns?  (Of course I'll repost patch 1 for 
review).

WRT the other patches I think the status is:

01-trunk-unique.patch
   Internal function with a 'uniqueness' property
   * reworking as described.
02-trunk-nvptx-partition.patch
   NVPTX backend patch set for partitioned execution
   * approved with minor edits
03-trunk-hook.patch
   OpenACC hook
   * approved with minor edit
04-trunk-c.patch
   C FE changes
   * Being addressed by Cesar
05-trunk-cxx.patch
   C++ FE changes
   * Being addressed by Cesar
06-trunk-red-init.patch
   Placeholder to keep reductions functioning
   * Approved
07-trunk-loop-mark.patch
   Annotate OpenACC loops in device-agnostic manner
   * Addressing minor comments
08-trunk-dev-lower.patch
   Device-specific lowering of loop markers
   * Question asked & answered about non-ptx behaviour
09-trunk-lower-gate.patch
   Run oacc_device_lower pass regardless of errors
   * Approved
10-trunk-libgomp.patch
   Libgomp change (remove dimension check)
   * Approved
11-trunk-tests.patch
   Initial set of execution tests
   * Approved, but C& C++ error tests needed

I'll repost:
01-trunk-unique.patch
   Internal function with a 'uniqueness' property

That has some obvious knock on changes to 02, 07 and 08, do you want those 
reposted for review?

Cesar will repost:
04-trunk-c.patch
   C FE changes
05-trunk-cxx.patch
   C++ FE changes

The remaining patch:
08-trunk-dev-lower.patch
   Device-specific lowering of loop markers

seems to be waiting on Jakub?

nathan

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: Re: [OpenACC 4/11] C FE changes
  2015-10-22  8:25   ` Jakub Jelinek
@ 2015-10-23 20:20     ` Cesar Philippidis
  2015-10-23 20:40       ` Jakub Jelinek
  2015-10-23 21:25       ` Nathan Sidwell
  0 siblings, 2 replies; 120+ messages in thread
From: Cesar Philippidis @ 2015-10-23 20:20 UTC (permalink / raw)
  To: Jakub Jelinek, Nathan Sidwell
  Cc: GCC Patches, Bernd Schmidt, Jason Merrill, Joseph S. Myers

[-- Attachment #1: Type: text/plain, Size: 2266 bytes --]

On 10/22/2015 01:22 AM, Jakub Jelinek wrote:
> On Wed, Oct 21, 2015 at 03:16:20PM -0400, Nathan Sidwell wrote:
>> 2015-10-20  Cesar Philippidis  <cesar@codesourcery.com>
>> 	    Thomas Schwinge  <thomas@codesourcery.com>
>> 	    James Norris  <jnorris@codesourcery.com>
>> 	    Joseph Myers  <joseph@codesourcery.com>
>> 	    Julian Brown  <julian@codesourcery.com>
>>
>> 	* c-parser.c (c_parser_oacc_shape_clause): New.
>> 	(c_parser_oacc_simple_clause): New.
>> 	(c_parser_oacc_all_clauses): Add auto, gang, seq, vector, worker.
>> 	(OACC_LOOP_CLAUSE_MASK): Add gang, worker, vector, auto, seq.
> 
> Ok, with one nit.
> 
>>  /* OpenACC:
>> +   gang [( gang_expr_list )]
>> +   worker [( expression )]
>> +   vector [( expression )] */
>> +
>> +static tree
>> +c_parser_oacc_shape_clause (c_parser *parser, pragma_omp_clause c_kind,
>> +			    const char *str, tree list)
> 
> I think it would be better to remove the c_kind argument and pass to this
> function omp_clause_code kind instead.  The callers are already in a big
> switch, with a separate call for each of the clauses.
> After all, e.g. for c_parser_oacc_simple_clause you already do it that way
> too.
> 
>> +{
>> +  omp_clause_code kind;
>> +  const char *id = "num";
>> +
>> +  switch (c_kind)
>> +    {
>> +    default:
>> +      gcc_unreachable ();
>> +    case PRAGMA_OACC_CLAUSE_GANG:
>> +      kind = OMP_CLAUSE_GANG;
>> +      break;
>> +    case PRAGMA_OACC_CLAUSE_VECTOR:
>> +      kind = OMP_CLAUSE_VECTOR;
>> +      id = "length";
>> +      break;
>> +    case PRAGMA_OACC_CLAUSE_WORKER:
>> +      kind = OMP_CLAUSE_WORKER;
>> +      break;
>> +    }
> 
> Then you can replace this switch with just if (kind == OMP_CLAUSE_VECTOR)
> id = "length";

Good idea, thanks. This patch also corrects the problems parsing weird
combinations of num, static and length arguments that you mentioned
elsewhere.

Is this OK for trunk?

Nathan, can you try out this patch with your updated patch set? I saw
some test cases getting stuck when expanding expand_GOACC_DIM_SIZE in on
the host compiler, which is wrong. I don't see that happening in
gomp-4_0-branch with this patch. Also, can you merge this patch along
with the c++ and new test case patches to trunk? I'll handle the gomp4
backport.

Cesar


[-- Attachment #2: 04-cfe-cjp.diff --]
[-- Type: text/x-patch, Size: 6801 bytes --]

2015-10-20  Cesar Philippidis  <cesar@codesourcery.com>
	    Thomas Schwinge  <thomas@codesourcery.com>
	    James Norris  <jnorris@codesourcery.com>
	    Joseph Myers  <joseph@codesourcery.com>
	    Julian Brown  <julian@codesourcery.com>
	    Bernd Schmidt  <bschmidt@redhat.com>

	* c-parser.c (c_parser_oacc_shape_clause): New.
	(c_parser_oacc_simple_clause): New.
	(c_parser_oacc_all_clauses): Add auto, gang, seq, vector, worker.
	(OACC_LOOP_CLAUSE_MASK): Add gang, worker, vector, auto, seq.


diff --git a/gcc/c/c-parser.c b/gcc/c/c-parser.c
index c8c6a2d..1e3c333 100644
--- a/gcc/c/c-parser.c
+++ b/gcc/c/c-parser.c
@@ -11188,6 +11188,142 @@ c_parser_omp_clause_num_workers (c_parser *parser, tree list)
 }
 
 /* OpenACC:
+
+    gang [( gang-arg-list )]
+    worker [( [num:] int-expr )]
+    vector [( [length:] int-expr )]
+
+  where gang-arg is one of:
+
+    [num:] int-expr
+    static: size-expr
+
+  and size-expr may be:
+
+    *
+    int-expr
+*/
+
+static tree
+c_parser_oacc_shape_clause (c_parser *parser, omp_clause_code kind,
+			    const char *str, tree list)
+{
+  const char *id = "num";
+
+  if (kind == OMP_CLAUSE_VECTOR)
+    id = "length";
+
+  tree op0 = NULL_TREE, op1 = NULL_TREE;
+  location_t loc = c_parser_peek_token (parser)->location;
+
+  if (c_parser_next_token_is (parser, CPP_OPEN_PAREN))
+    {
+      tree *op_to_parse = &op0;
+      c_parser_consume_token (parser);
+
+      do
+	{
+	  loc = c_parser_peek_token (parser)->location;
+	  op_to_parse = &op0;
+
+	  if ((c_parser_next_token_is (parser, CPP_NAME)
+	       || c_parser_next_token_is (parser, CPP_KEYWORD))
+	      && c_parser_peek_2nd_token (parser)->type == CPP_COLON)
+	    {
+	      tree name_kind = c_parser_peek_token (parser)->value;
+	      const char *p = IDENTIFIER_POINTER (name_kind);
+	      if (kind == OMP_CLAUSE_GANG
+		  && c_parser_next_token_is_keyword (parser, RID_STATIC))
+		{
+		  c_parser_consume_token (parser); /* static  */
+		  c_parser_consume_token (parser); /* ':'  */
+
+		  op_to_parse = &op1;
+		  if (c_parser_next_token_is (parser, CPP_MULT))
+		    {
+		      c_parser_consume_token (parser);
+		      *op_to_parse = integer_minus_one_node;
+
+		      /* Consume a comma if present.  */
+		      if (c_parser_next_token_is (parser, CPP_COMMA))
+			c_parser_consume_token (parser);
+
+		      continue;
+		    }
+		}
+	      else if (strcmp (id, p) == 0)
+		{
+		  c_parser_consume_token (parser);  /* id  */
+		  c_parser_consume_token (parser);  /* ':'  */
+		}
+	      else
+		{
+		  if (kind == OMP_CLAUSE_GANG)
+		    c_parser_error (parser, "expected %<num%> or %<static%>");
+		  else if (kind == OMP_CLAUSE_VECTOR)
+		    c_parser_error (parser, "expected %<length%>");
+		  else
+		    c_parser_error (parser, "expected %<num%>");
+		  c_parser_skip_until_found (parser, CPP_CLOSE_PAREN, 0);
+		  return list;
+		}
+	    }
+
+	  if (*op_to_parse != NULL_TREE)
+	    {
+	      c_parser_error (parser, "unexpected argument");
+	      c_parser_skip_until_found (parser, CPP_CLOSE_PAREN, 0);
+	      return list;
+	    }
+
+	  tree expr = c_parser_expr_no_commas (parser, NULL).value;
+	  if (expr == error_mark_node)
+	    {
+	      c_parser_skip_until_found (parser, CPP_CLOSE_PAREN, 0);
+	      return list;
+	    }
+
+	  mark_exp_read (expr);
+	  *op_to_parse = expr;
+
+	  /* Consume a comma if present.  */
+	  if (c_parser_next_token_is (parser, CPP_COMMA))
+	    c_parser_consume_token (parser);
+	}
+      while (!c_parser_next_token_is (parser, CPP_CLOSE_PAREN));
+      c_parser_consume_token (parser);
+    }
+
+  check_no_duplicate_clause (list, kind, str);
+
+  tree c = build_omp_clause (loc, kind);
+  if (op0)
+    OMP_CLAUSE_OPERAND (c, 0) = op0;
+  if (op1)
+    OMP_CLAUSE_OPERAND (c, 1) = op1;
+  OMP_CLAUSE_CHAIN (c) = list;
+  return c;
+}
+
+/* OpenACC:
+   auto
+   independent
+   nohost
+   seq */
+
+static tree
+c_parser_oacc_simple_clause (c_parser *parser ATTRIBUTE_UNUSED,
+			     enum omp_clause_code code, tree list)
+{
+  check_no_duplicate_clause (list, code, omp_clause_code_name[code]);
+
+  tree c = build_omp_clause (c_parser_peek_token (parser)->location, code);
+  OMP_CLAUSE_CHAIN (c) = list;
+
+  return c;
+}
+
+/* OpenACC:
    async [( int-expr )] */
 
 static tree
@@ -12393,6 +12529,11 @@ c_parser_oacc_all_clauses (c_parser *parser, omp_clause_mask mask,
 	  clauses = c_parser_oacc_clause_async (parser, clauses);
 	  c_name = "async";
 	  break;
+	case PRAGMA_OACC_CLAUSE_AUTO:
+	  clauses = c_parser_oacc_simple_clause (parser, OMP_CLAUSE_AUTO,
+						clauses);
+	  c_name = "auto";
+	  break;
 	case PRAGMA_OACC_CLAUSE_COLLAPSE:
 	  clauses = c_parser_omp_clause_collapse (parser, clauses);
 	  c_name = "collapse";
@@ -12429,6 +12570,11 @@ c_parser_oacc_all_clauses (c_parser *parser, omp_clause_mask mask,
 	  clauses = c_parser_omp_clause_firstprivate (parser, clauses);
 	  c_name = "firstprivate";
 	  break;
+	case PRAGMA_OACC_CLAUSE_GANG:
+	  c_name = "gang";
+	  clauses = c_parser_oacc_shape_clause (parser, OMP_CLAUSE_GANG,
+						c_name, clauses);
+	  break;
 	case PRAGMA_OACC_CLAUSE_HOST:
 	  clauses = c_parser_oacc_data_clause (parser, c_kind, clauses);
 	  c_name = "host";
@@ -12477,6 +12623,16 @@ c_parser_oacc_all_clauses (c_parser *parser, omp_clause_mask mask,
 	  clauses = c_parser_oacc_data_clause (parser, c_kind, clauses);
 	  c_name = "self";
 	  break;
+	case PRAGMA_OACC_CLAUSE_SEQ:
+	  clauses = c_parser_oacc_simple_clause (parser, OMP_CLAUSE_SEQ,
+						clauses);
+	  c_name = "seq";
+	  break;
+	case PRAGMA_OACC_CLAUSE_VECTOR:
+	  c_name = "vector";
+	  clauses = c_parser_oacc_shape_clause (parser, OMP_CLAUSE_VECTOR,
+						c_name,	clauses);
+	  break;
 	case PRAGMA_OACC_CLAUSE_VECTOR_LENGTH:
 	  clauses = c_parser_omp_clause_vector_length (parser, clauses);
 	  c_name = "vector_length";
@@ -12485,6 +12641,11 @@ c_parser_oacc_all_clauses (c_parser *parser, omp_clause_mask mask,
 	  clauses = c_parser_oacc_clause_wait (parser, clauses);
 	  c_name = "wait";
 	  break;
+	case PRAGMA_OACC_CLAUSE_WORKER:
+	  c_name = "worker";
+	  clauses = c_parser_oacc_shape_clause (parser, OMP_CLAUSE_WORKER,
+						c_name, clauses);
+	  break;
 	default:
 	  c_parser_error (parser, "expected %<#pragma acc%> clause");
 	  goto saw_error;
@@ -13015,6 +13176,11 @@ c_parser_oacc_enter_exit_data (c_parser *parser, bool enter)
 
 #define OACC_LOOP_CLAUSE_MASK						\
 	( (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_COLLAPSE)		\
+	| (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_GANG)		\
+	| (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_WORKER)		\
+	| (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_VECTOR)		\
+	| (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_AUTO)		\
+	| (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_SEQ)			\
 	| (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_REDUCTION) )
 
 static tree

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: Re: [OpenACC 5/11] C++ FE changes
  2015-10-22  8:58   ` Jakub Jelinek
@ 2015-10-23 20:26     ` Cesar Philippidis
  2015-10-24  2:39       ` Cesar Philippidis
  0 siblings, 1 reply; 120+ messages in thread
From: Cesar Philippidis @ 2015-10-23 20:26 UTC (permalink / raw)
  To: Jakub Jelinek, Nathan Sidwell
  Cc: GCC Patches, Bernd Schmidt, Jason Merrill, Joseph S. Myers

[-- Attachment #1: Type: text/plain, Size: 5004 bytes --]

On 10/22/2015 01:52 AM, Jakub Jelinek wrote:
> On Wed, Oct 21, 2015 at 03:18:55PM -0400, Nathan Sidwell wrote:
>> This patch is the C++ changes matching the C ones of patch 4.  In
>> finish_omp_clauses, the gang, worker, & vector clauses are handled the same
>> as OpenMP's 'num_threads' clause.  One change to num_threads is the
>> augmentation of a diagnostic to add %<...%>  markers to the clause name.
> 
> Indeed, lots of older OpenMP diagnostics is missing %<...%> markers around
> keywords.  Something to fix eventually.

I updated omp tasks and teams in semantics.c.

>> 2015-10-20  Cesar Philippidis  <cesar@codesourcery.com>
>> 	    Thomas Schwinge  <thomas@codesourcery.com>
>> 	    James Norris  <jnorris@codesourcery.com>
>> 	    Joseph Myers  <joseph@codesourcery.com>
>> 	    Julian Brown  <julian@codesourcery.com>
>> 	    Nathan Sidwell <nathan@codesourcery.com>
>>
>> 	* parser.c (cp_parser_omp_clause_name): Add auto, gang, seq,
>> 	vector, worker.
>> 	(cp_parser_oacc_simple_clause): New.
>> 	(cp_parser_oacc_shape_clause): New.
> 
> What I've said for the C FE patch, plus:
> 
>> +	  if (cp_lexer_next_token_is (lexer, CPP_NAME)
>> +	      || cp_lexer_next_token_is (lexer, CPP_KEYWORD))
>> +	    {
>> +	      tree name_kind = cp_lexer_peek_token (lexer)->u.value;
>> +	      const char *p = IDENTIFIER_POINTER (name_kind);
>> +	      if (kind == OMP_CLAUSE_GANG && strcmp ("static", p) == 0)
> 
> As static is a keyword, wouldn't it be better to just handle that case
> using cp_lexer_next_token_is_keyword (lexer, RID_STATIC)?
> 
> Also, what is the exact grammar of the shape arguments?
> Would be nice to describe the grammar, in the grammar you just say
> expression, at least for vector/worker, which is clearly not accurate.
> 
> It seems the intent is that num: or length: or static: is optional, right?
> But if that is the case, you should treat those as parsed only if followed
> by :.  While static is a keyword, so you can't have a variable called like
> that, having vector(length) or vector(num) should not be rejected.
> So, I would have expected that it should test if it is RID_STATIC
> followed by CPP_COLON (and only in that case consume those tokens),
> or CPP_NAME of id followed by CPP_COLON (and only in that case consume those
> tokens), otherwise parse it as assignment expression.

That function now peeks ahead to look for a colon, so now it can handle
variables with the name of clause keywords.

> The C FE may have similar issue.  Plus of course there should be testsuite
> coverage for all the weird cases.

I included a new test in a different patch because it's common to both c
and c++.

>> +	case OMP_CLAUSE_GANG:
>> +	case OMP_CLAUSE_VECTOR:
>> +	case OMP_CLAUSE_WORKER:
>> +	  /* Operand 0 is the num: or length: argument.  */
>> +	  t = OMP_CLAUSE_OPERAND (c, 0);
>> +	  if (t == NULL_TREE)
>> +	    break;
>> +
>> +	  t = maybe_convert_cond (t);
> 
> Can you explain the maybe_convert_cond calls (in both cases here,
> plus the preexisting in OMP_CLAUSE_VECTOR_LENGTH)?
> The reason why it is used for OpenMP if and final clauses is that those have
> a condition argument, either the condition is zero or non-zero (so
> effectively it is turned into a bool).
> But aren't the gang/vector/worker/vector_length arguments integers rather
> than conditions?  I'd expect that finish_omp_clauses should verify
> those operands are indeed integral expressions (if that is the requirement
> in the standard), as it is something that for C++ can't be verified during
> parsing, if arbitrary expressions are parsed there.

It's probably a copy-and-paste error. This functionality was added
incrementally. I removed that check.

>> @@ -5959,32 +5990,58 @@ finish_omp_clauses (tree clauses, bool a
>>  	  break;
>>  
>>  	case OMP_CLAUSE_NUM_THREADS:
>> -	  t = OMP_CLAUSE_NUM_THREADS_EXPR (c);
>> -	  if (t == error_mark_node)
>> -	    remove = true;
>> -	  else if (!type_dependent_expression_p (t)
>> -		   && !INTEGRAL_TYPE_P (TREE_TYPE (t)))
>> -	    {
>> -	      error ("num_threads expression must be integral");
>> -	      remove = true;
>> -	    }
>> -	  else
>> -	    {
>> -	      t = mark_rvalue_use (t);
>> -	      if (!processing_template_decl)
>> -		{
>> -		  t = maybe_constant_value (t);
>> -		  if (TREE_CODE (t) == INTEGER_CST
>> -		      && tree_int_cst_sgn (t) != 1)
>> -		    {
>> -		      warning_at (OMP_CLAUSE_LOCATION (c), 0,
>> -				  "%<num_threads%> value must be positive");
>> -		      t = integer_one_node;
>> -		    }
>> -		  t = fold_build_cleanup_point_expr (TREE_TYPE (t), t);
>> -		}
>> -	      OMP_CLAUSE_NUM_THREADS_EXPR (c) = t;
>> -	    }
>> +	case OMP_CLAUSE_NUM_GANGS:
>> +	case OMP_CLAUSE_NUM_WORKERS:
>> +	case OMP_CLAUSE_VECTOR_LENGTH:
> 
> If you are already merging some of the similar handling, please
> handle OMP_CLAUSE_NUM_TEAMS and OMP_CLAUSE_NUM_TASKS the same way.

I did that, but I also had to adjust the expected errors in a couple of
existing gomp test cases.

Is this patch OK for trunk?

Cesar


[-- Attachment #2: 05-cpfe-cjp.diff --]
[-- Type: text/x-patch, Size: 17319 bytes --]

2015-10-23  Cesar Philippidis  <cesar@codesourcery.com>
	    Thomas Schwinge  <thomas@codesourcery.com>
	    James Norris  <jnorris@codesourcery.com>
	    Joseph Myers  <joseph@codesourcery.com>
	    Julian Brown  <julian@codesourcery.com>
	    Nathan Sidwell <nathan@codesourcery.com>
	    Bernd Schmidt  <bschmidt@redhat.com>

	* parser.c (cp_parser_omp_clause_name): Add auto, gang, seq,
	vector, worker.
	(cp_parser_oacc_simple_clause): New.
	(cp_parser_oacc_shape_clause): New.
	(cp_parser_oacc_all_clauses): Add auto, gang, seq, vector, worker.
	(OACC_LOOP_CLAUSE_MASK): Likewise.
	* semantics.c (finish_omp_clauses): Add auto, gang, seq, vector,
	worker. Unify the handling to teams, tasks and vector_length with
	the other loop shape clauses.

2015-10-23  Nathan Sidwell <nathan@codesourcery.com>
	    Cesar Philippidis  <cesar@codesourcery.com>

	gcc/testsuite/
	* g++.dg/g++.dg/gomp/pr33372-1.C: Adjust diagnostic.
	* gcc/testsuite/g++.dg/gomp/pr33372-3.C: Likewise.


diff --git a/gcc/cp/parser.c b/gcc/cp/parser.c
index 7555bf3..28cfdc9 100644
--- a/gcc/cp/parser.c
+++ b/gcc/cp/parser.c
@@ -29064,7 +29064,9 @@ cp_parser_omp_clause_name (cp_parser *parser)
 {
   pragma_omp_clause result = PRAGMA_OMP_CLAUSE_NONE;
 
-  if (cp_lexer_next_token_is_keyword (parser->lexer, RID_IF))
+  if (cp_lexer_next_token_is_keyword (parser->lexer, RID_AUTO))
+    result = PRAGMA_OACC_CLAUSE_AUTO;
+  else if (cp_lexer_next_token_is_keyword (parser->lexer, RID_IF))
     result = PRAGMA_OMP_CLAUSE_IF;
   else if (cp_lexer_next_token_is_keyword (parser->lexer, RID_DEFAULT))
     result = PRAGMA_OMP_CLAUSE_DEFAULT;
@@ -29122,7 +29124,9 @@ cp_parser_omp_clause_name (cp_parser *parser)
 	    result = PRAGMA_OMP_CLAUSE_FROM;
 	  break;
 	case 'g':
-	  if (!strcmp ("grainsize", p))
+	  if (!strcmp ("gang", p))
+	    result = PRAGMA_OACC_CLAUSE_GANG;
+	  else if (!strcmp ("grainsize", p))
 	    result = PRAGMA_OMP_CLAUSE_GRAINSIZE;
 	  break;
 	case 'h':
@@ -29212,6 +29216,8 @@ cp_parser_omp_clause_name (cp_parser *parser)
 	    result = PRAGMA_OMP_CLAUSE_SECTIONS;
 	  else if (!strcmp ("self", p))
 	    result = PRAGMA_OACC_CLAUSE_SELF;
+	  else if (!strcmp ("seq", p))
+	    result = PRAGMA_OACC_CLAUSE_SEQ;
 	  else if (!strcmp ("shared", p))
 	    result = PRAGMA_OMP_CLAUSE_SHARED;
 	  else if (!strcmp ("simd", p))
@@ -29238,7 +29244,9 @@ cp_parser_omp_clause_name (cp_parser *parser)
 	    result = PRAGMA_OMP_CLAUSE_USE_DEVICE_PTR;
 	  break;
 	case 'v':
-	  if (!strcmp ("vector_length", p))
+	  if (!strcmp ("vector", p))
+	    result = PRAGMA_OACC_CLAUSE_VECTOR;
+	  else if (!strcmp ("vector_length", p))
 	    result = PRAGMA_OACC_CLAUSE_VECTOR_LENGTH;
 	  else if (flag_cilkplus && !strcmp ("vectorlength", p))
 	    result = PRAGMA_CILK_CLAUSE_VECTORLENGTH;
@@ -29246,6 +29254,8 @@ cp_parser_omp_clause_name (cp_parser *parser)
 	case 'w':
 	  if (!strcmp ("wait", p))
 	    result = PRAGMA_OACC_CLAUSE_WAIT;
+	  else if (!strcmp ("worker", p))
+	    result = PRAGMA_OACC_CLAUSE_WORKER;
 	  break;
 	}
     }
@@ -29582,6 +29592,147 @@ cp_parser_oacc_data_clause_deviceptr (cp_parser *parser, tree list)
   return list;
 }
 
+/* OpenACC 2.0:
+   auto
+   independent
+   nohost
+   seq */
+
+static tree
+cp_parser_oacc_simple_clause (cp_parser *ARG_UNUSED (parser),
+			      enum omp_clause_code code,
+			      tree list, location_t location)
+{
+  check_no_duplicate_clause (list, code, omp_clause_code_name[code], location);
+  tree c = build_omp_clause (location, code);
+  OMP_CLAUSE_CHAIN (c) = list;
+  return c;
+}
+
+/* OpenACC:
+
+    gang [( gang-arg-list )]
+    worker [( [num:] int-expr )]
+    vector [( [length:] int-expr )]
+
+  where gang-arg is one of:
+
+    [num:] int-expr
+    static: size-expr
+
+  and size-expr may be:
+
+    *
+    int-expr
+*/
+
+static tree
+cp_parser_oacc_shape_clause (cp_parser *parser, omp_clause_code kind,
+			     const char *str, tree list)
+{
+  const char *id = "num";
+  cp_lexer *lexer = parser->lexer;
+
+  if (kind == OMP_CLAUSE_VECTOR)
+    id = "length";
+
+  tree op0 = NULL_TREE, op1 = NULL_TREE;
+  location_t loc = cp_lexer_peek_token (lexer)->location;
+
+  if (cp_lexer_next_token_is (lexer, CPP_OPEN_PAREN))
+    {
+      tree *op_to_parse = &op0;
+      cp_lexer_consume_token (lexer);
+
+      do
+	{
+	  loc = cp_lexer_peek_token (lexer)->location;
+	  op_to_parse = &op0;
+
+	  if ((cp_lexer_next_token_is (lexer, CPP_NAME)
+	       || cp_lexer_next_token_is (lexer, CPP_KEYWORD))
+	      && cp_lexer_nth_token_is (lexer, 2, CPP_COLON))
+	    {
+	      tree name_kind = cp_lexer_peek_token (lexer)->u.value;
+	      const char *p = IDENTIFIER_POINTER (name_kind);
+	      if (kind == OMP_CLAUSE_GANG
+		  && cp_lexer_next_token_is_keyword (lexer, RID_STATIC))
+		{
+		  cp_lexer_consume_token (lexer); /* static  */
+		  cp_lexer_consume_token (lexer); /* ':'  */
+
+		  op_to_parse = &op1;
+		  if (cp_lexer_next_token_is (lexer, CPP_MULT))
+		    {
+		      cp_lexer_consume_token (lexer);
+		      *op_to_parse = integer_minus_one_node;
+
+		      /* Consume a comma if present.  */
+		      if (cp_lexer_next_token_is (lexer, CPP_COMMA))
+			cp_lexer_consume_token (lexer);
+
+		      continue;
+		    }
+		}
+	      else if (strcmp (id, p) == 0)
+		{
+		  cp_lexer_consume_token (lexer); /* id  */
+		  cp_lexer_consume_token (lexer); /* ':'  */
+		}
+	      else
+		{
+		  if (kind == OMP_CLAUSE_GANG)
+		    cp_parser_error (parser,
+				     "expected %<num%> or %<static%>");
+		  else if (kind == OMP_CLAUSE_VECTOR)
+		    cp_parser_error (parser, "expected %<length%>");
+		  else
+		    cp_parser_error (parser, "expected %<num%>");
+		  cp_parser_skip_to_closing_parenthesis (parser, false, false,
+							 true);
+		  return list;
+		}
+	    }
+
+	  if (*op_to_parse != NULL_TREE)
+	    {
+	      cp_parser_error (parser, "unexpected argument");
+	      cp_parser_skip_to_closing_parenthesis (parser, false, false,
+						     true);
+	      return list;
+	    }
+
+	  tree expr = cp_parser_assignment_expression (parser, NULL, false,
+						       false);
+	  if (expr == error_mark_node)
+	    {
+	      cp_parser_skip_to_closing_parenthesis (parser, false, false,
+						     true);
+	      return list;
+	    }
+
+	  mark_exp_read (expr);
+	  *op_to_parse = expr;
+
+	  /* Consume a comma if present.  */
+	  if (cp_lexer_next_token_is (lexer, CPP_COMMA))
+	    cp_lexer_consume_token (lexer);
+	}
+      while (!cp_lexer_next_token_is (lexer, CPP_CLOSE_PAREN));
+      cp_lexer_consume_token (lexer);
+    }
+
+  check_no_duplicate_clause (list, kind, str, loc);
+
+  tree c = build_omp_clause (loc, kind);
+  if (op0)
+    OMP_CLAUSE_OPERAND (c, 0) = op0;
+  if (op1)
+    OMP_CLAUSE_OPERAND (c, 1) = op1;
+  OMP_CLAUSE_CHAIN (c) = list;
+  return c;
+}
+
 /* OpenACC:
    vector_length ( expression ) */
 
@@ -31306,6 +31457,11 @@ cp_parser_oacc_all_clauses (cp_parser *parser, omp_clause_mask mask,
 	  clauses = cp_parser_oacc_clause_async (parser, clauses);
 	  c_name = "async";
 	  break;
+	case PRAGMA_OACC_CLAUSE_AUTO:
+	  clauses = cp_parser_oacc_simple_clause (parser, OMP_CLAUSE_AUTO,
+						 clauses, here);
+	  c_name = "auto";
+	  break;
 	case PRAGMA_OACC_CLAUSE_COLLAPSE:
 	  clauses = cp_parser_omp_clause_collapse (parser, clauses, here);
 	  c_name = "collapse";
@@ -31338,6 +31494,11 @@ cp_parser_oacc_all_clauses (cp_parser *parser, omp_clause_mask mask,
 	  clauses = cp_parser_oacc_data_clause_deviceptr (parser, clauses);
 	  c_name = "deviceptr";
 	  break;
+	case PRAGMA_OACC_CLAUSE_GANG:
+	  c_name = "gang";
+	  clauses = cp_parser_oacc_shape_clause (parser, OMP_CLAUSE_GANG,
+						 c_name, clauses);
+	  break;
 	case PRAGMA_OACC_CLAUSE_HOST:
 	  clauses = cp_parser_oacc_data_clause (parser, c_kind, clauses);
 	  c_name = "host";
@@ -31382,6 +31543,16 @@ cp_parser_oacc_all_clauses (cp_parser *parser, omp_clause_mask mask,
 	  clauses = cp_parser_oacc_data_clause (parser, c_kind, clauses);
 	  c_name = "self";
 	  break;
+	case PRAGMA_OACC_CLAUSE_SEQ:
+	  clauses = cp_parser_oacc_simple_clause (parser, OMP_CLAUSE_SEQ,
+						 clauses, here);
+	  c_name = "seq";
+	  break;
+	case PRAGMA_OACC_CLAUSE_VECTOR:
+	  c_name = "vector";
+	  clauses = cp_parser_oacc_shape_clause (parser, OMP_CLAUSE_VECTOR,
+						 c_name, clauses);
+	  break;
 	case PRAGMA_OACC_CLAUSE_VECTOR_LENGTH:
 	  clauses = cp_parser_oacc_clause_vector_length (parser, clauses);
 	  c_name = "vector_length";
@@ -31390,6 +31561,11 @@ cp_parser_oacc_all_clauses (cp_parser *parser, omp_clause_mask mask,
 	  clauses = cp_parser_oacc_clause_wait (parser, clauses);
 	  c_name = "wait";
 	  break;
+	case PRAGMA_OACC_CLAUSE_WORKER:
+	  c_name = "worker";
+	  clauses = cp_parser_oacc_shape_clause (parser, OMP_CLAUSE_WORKER,
+						 c_name, clauses);
+	  break;
 	default:
 	  cp_parser_error (parser, "expected %<#pragma acc%> clause");
 	  goto saw_error;
@@ -34339,6 +34515,11 @@ cp_parser_oacc_kernels (cp_parser *parser, cp_token *pragma_tok)
 
 #define OACC_LOOP_CLAUSE_MASK						\
 	( (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_COLLAPSE)		\
+	| (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_GANG)		\
+	| (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_VECTOR)		\
+	| (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_WORKER)		\
+	| (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_AUTO)		\
+	| (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_SEQ)			\
 	| (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_REDUCTION))
 
 static tree
diff --git a/gcc/cp/semantics.c b/gcc/cp/semantics.c
index 11315d9..153a970 100644
--- a/gcc/cp/semantics.c
+++ b/gcc/cp/semantics.c
@@ -5911,6 +5911,31 @@ finish_omp_clauses (tree clauses, bool allow_fields, bool declare_simd)
 	    bitmap_set_bit (&firstprivate_head, DECL_UID (t));
 	  goto handle_field_decl;
 
+	case OMP_CLAUSE_GANG:
+	case OMP_CLAUSE_VECTOR:
+	case OMP_CLAUSE_WORKER:
+	  /* Operand 0 is the num: or length: argument.  */
+	  t = OMP_CLAUSE_OPERAND (c, 0);
+	  if (t == NULL_TREE)
+	    break;
+
+	  if (!processing_template_decl)
+	    t = fold_build_cleanup_point_expr (TREE_TYPE (t), t);
+	  OMP_CLAUSE_OPERAND (c, 0) = t;
+
+	  if (OMP_CLAUSE_CODE (c) != OMP_CLAUSE_GANG)
+	    break;
+
+	  /* Ooperand 1 is the gang static: argument.  */
+	  t = OMP_CLAUSE_OPERAND (c, 1);
+	  if (t == NULL_TREE)
+	    break;
+
+	  if (!processing_template_decl)
+	    t = fold_build_cleanup_point_expr (TREE_TYPE (t), t);
+	  OMP_CLAUSE_OPERAND (c, 1) = t;
+	  break;
+
 	case OMP_CLAUSE_LASTPRIVATE:
 	  t = omp_clause_decl_field (OMP_CLAUSE_DECL (c));
 	  if (t)
@@ -5965,14 +5990,37 @@ finish_omp_clauses (tree clauses, bool allow_fields, bool declare_simd)
 	  OMP_CLAUSE_FINAL_EXPR (c) = t;
 	  break;
 
+	case OMP_CLAUSE_NUM_TASKS:
+	case OMP_CLAUSE_NUM_TEAMS:
 	case OMP_CLAUSE_NUM_THREADS:
-	  t = OMP_CLAUSE_NUM_THREADS_EXPR (c);
+	case OMP_CLAUSE_NUM_GANGS:
+	case OMP_CLAUSE_NUM_WORKERS:
+	case OMP_CLAUSE_VECTOR_LENGTH:
+	  t = OMP_CLAUSE_OPERAND (c, 0);
 	  if (t == error_mark_node)
 	    remove = true;
 	  else if (!type_dependent_expression_p (t)
 		   && !INTEGRAL_TYPE_P (TREE_TYPE (t)))
 	    {
-	      error ("num_threads expression must be integral");
+	     switch (OMP_CLAUSE_CODE (c))
+		{
+		case OMP_CLAUSE_NUM_TASKS:
+		  error ("%<num_tasks%> expression must be integral"); break;
+		case OMP_CLAUSE_NUM_TEAMS:
+		  error ("%<num_teams%> expression must be integral"); break;
+		case OMP_CLAUSE_NUM_THREADS:
+		  error ("%<num_threads%> expression must be integral"); break;
+		case OMP_CLAUSE_NUM_GANGS:
+		  error ("%<num_gangs%> expression must be integral"); break;
+		case OMP_CLAUSE_NUM_WORKERS:
+		  error ("%<num_workers%> expression must be integral");
+		  break;
+		case OMP_CLAUSE_VECTOR_LENGTH:
+		  error ("%<vector_length%> expression must be integral");
+		  break;
+		default:
+		  error ("invalid argument");
+		}
 	      remove = true;
 	    }
 	  else
@@ -5984,13 +6032,40 @@ finish_omp_clauses (tree clauses, bool allow_fields, bool declare_simd)
 		  if (TREE_CODE (t) == INTEGER_CST
 		      && tree_int_cst_sgn (t) != 1)
 		    {
-		      warning_at (OMP_CLAUSE_LOCATION (c), 0,
-				  "%<num_threads%> value must be positive");
+		      switch (OMP_CLAUSE_CODE (c))
+			{
+			case OMP_CLAUSE_NUM_TASKS:
+			  warning_at (OMP_CLAUSE_LOCATION (c), 0,
+				      "%<num_tasks%> value must be positive");
+			  break;
+			case OMP_CLAUSE_NUM_TEAMS:
+			  warning_at (OMP_CLAUSE_LOCATION (c), 0,
+				      "%<num_teams%> value must be positive");
+			  break;
+			case OMP_CLAUSE_NUM_THREADS:
+			  warning_at (OMP_CLAUSE_LOCATION (c), 0,
+				      "%<num_threads%> value must be"
+				      "positive"); break;
+			case OMP_CLAUSE_NUM_GANGS:
+			  warning_at (OMP_CLAUSE_LOCATION (c), 0,
+				      "%<num_gangs%> value must be positive");
+			  break;
+			case OMP_CLAUSE_NUM_WORKERS:
+			  warning_at (OMP_CLAUSE_LOCATION (c), 0,
+				      "%<num_workers%> value must be"
+				      "positive"); break;
+			case OMP_CLAUSE_VECTOR_LENGTH:
+			  warning_at (OMP_CLAUSE_LOCATION (c), 0,
+				      "%<vector_length%> value must be"
+				      "positive"); break;
+			default:
+			  error ("invalid argument");
+			}
 		      t = integer_one_node;
 		    }
 		  t = fold_build_cleanup_point_expr (TREE_TYPE (t), t);
 		}
-	      OMP_CLAUSE_NUM_THREADS_EXPR (c) = t;
+	      OMP_CLAUSE_OPERAND (c, 0) = t;
 	    }
 	  break;
 
@@ -6062,35 +6137,6 @@ finish_omp_clauses (tree clauses, bool allow_fields, bool declare_simd)
 	    }
 	  break;
 
-	case OMP_CLAUSE_NUM_TEAMS:
-	  t = OMP_CLAUSE_NUM_TEAMS_EXPR (c);
-	  if (t == error_mark_node)
-	    remove = true;
-	  else if (!type_dependent_expression_p (t)
-		   && !INTEGRAL_TYPE_P (TREE_TYPE (t)))
-	    {
-	      error ("%<num_teams%> expression must be integral");
-	      remove = true;
-	    }
-	  else
-	    {
-	      t = mark_rvalue_use (t);
-	      if (!processing_template_decl)
-		{
-		  t = maybe_constant_value (t);
-		  if (TREE_CODE (t) == INTEGER_CST
-		      && tree_int_cst_sgn (t) != 1)
-		    {
-		      warning_at (OMP_CLAUSE_LOCATION (c), 0,
-				  "%<num_teams%> value must be positive");
-		      t = integer_one_node;
-		    }
-		  t = fold_build_cleanup_point_expr (TREE_TYPE (t), t);
-		}
-	      OMP_CLAUSE_NUM_TEAMS_EXPR (c) = t;
-	    }
-	  break;
-
 	case OMP_CLAUSE_ASYNC:
 	  t = OMP_CLAUSE_ASYNC_EXPR (c);
 	  if (t == error_mark_node)
@@ -6110,16 +6156,6 @@ finish_omp_clauses (tree clauses, bool allow_fields, bool declare_simd)
 	    }
 	  break;
 
-	case OMP_CLAUSE_VECTOR_LENGTH:
-	  t = OMP_CLAUSE_VECTOR_LENGTH_EXPR (c);
-	  t = maybe_convert_cond (t);
-	  if (t == error_mark_node)
-	    remove = true;
-	  else if (!processing_template_decl)
-	    t = fold_build_cleanup_point_expr (TREE_TYPE (t), t);
-	  OMP_CLAUSE_VECTOR_LENGTH_EXPR (c) = t;
-	  break;
-
 	case OMP_CLAUSE_WAIT:
 	  t = OMP_CLAUSE_WAIT_EXPR (c);
 	  if (t == error_mark_node)
@@ -6547,35 +6583,6 @@ finish_omp_clauses (tree clauses, bool allow_fields, bool declare_simd)
 	    }
 	  goto check_dup_generic;
 
-	case OMP_CLAUSE_NUM_TASKS:
-	  t = OMP_CLAUSE_NUM_TASKS_EXPR (c);
-	  if (t == error_mark_node)
-	    remove = true;
-	  else if (!type_dependent_expression_p (t)
-		   && !INTEGRAL_TYPE_P (TREE_TYPE (t)))
-	    {
-	      error ("%<num_tasks%> expression must be integral");
-	      remove = true;
-	    }
-	  else
-	    {
-	      t = mark_rvalue_use (t);
-	      if (!processing_template_decl)
-		{
-		  t = maybe_constant_value (t);
-		  if (TREE_CODE (t) == INTEGER_CST
-		      && tree_int_cst_sgn (t) != 1)
-		    {
-		      warning_at (OMP_CLAUSE_LOCATION (c), 0,
-				  "%<num_tasks%> value must be positive");
-		      t = integer_one_node;
-		    }
-		  t = fold_build_cleanup_point_expr (TREE_TYPE (t), t);
-		}
-	      OMP_CLAUSE_NUM_TASKS_EXPR (c) = t;
-	    }
-	  break;
-
 	case OMP_CLAUSE_GRAINSIZE:
 	  t = OMP_CLAUSE_GRAINSIZE_EXPR (c);
 	  if (t == error_mark_node)
@@ -6694,6 +6701,8 @@ finish_omp_clauses (tree clauses, bool allow_fields, bool declare_simd)
 	case OMP_CLAUSE_SIMD:
 	case OMP_CLAUSE_DEFAULTMAP:
 	case OMP_CLAUSE__CILK_FOR_COUNT_:
+	case OMP_CLAUSE_AUTO:
+	case OMP_CLAUSE_SEQ:
 	  break;
 
 	case OMP_CLAUSE_INBRANCH:
diff --git a/gcc/testsuite/g++.dg/gomp/pr33372-1.C b/gcc/testsuite/g++.dg/gomp/pr33372-1.C
index 62900bf..e9da259 100644
--- a/gcc/testsuite/g++.dg/gomp/pr33372-1.C
+++ b/gcc/testsuite/g++.dg/gomp/pr33372-1.C
@@ -6,7 +6,7 @@ template <typename T>
 void f ()
 {
   extern T n ();
-#pragma omp parallel num_threads(n)	// { dg-error "num_threads expression must be integral" }
+#pragma omp parallel num_threads(n)	// { dg-error "'num_threads' expression must be integral" }
   ;
 #pragma omp parallel for schedule(static, n)
   for (int i = 0; i < 10; i++)		// { dg-error "chunk size expression must be integral" }
diff --git a/gcc/testsuite/g++.dg/gomp/pr33372-3.C b/gcc/testsuite/g++.dg/gomp/pr33372-3.C
index 8220f3c..f0a1910 100644
--- a/gcc/testsuite/g++.dg/gomp/pr33372-3.C
+++ b/gcc/testsuite/g++.dg/gomp/pr33372-3.C
@@ -6,7 +6,7 @@ template <typename T>
 void f ()
 {
   T n = 6;
-#pragma omp parallel num_threads(n)	// { dg-error "num_threads expression must be integral" }
+#pragma omp parallel num_threads(n)	// { dg-error "'num_threads' expression must be integral" }
   ;
 #pragma omp parallel for schedule(static, n)
   for (int i = 0; i < 10; i++)		// { dg-error "chunk size expression must be integral" }

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 11/11] execution tests
  2015-10-22 15:03             ` Jakub Jelinek
  2015-10-22 15:08               ` Cesar Philippidis
@ 2015-10-23 20:32               ` Cesar Philippidis
  2015-10-24  2:56                 ` Cesar Philippidis
  1 sibling, 1 reply; 120+ messages in thread
From: Cesar Philippidis @ 2015-10-23 20:32 UTC (permalink / raw)
  To: Jakub Jelinek
  Cc: Nathan Sidwell, GCC Patches, Bernd Schmidt, Jason Merrill,
	Joseph S. Myers

[-- Attachment #1: Type: text/plain, Size: 3078 bytes --]

On 10/22/2015 08:00 AM, Jakub Jelinek wrote:
> On Thu, Oct 22, 2015 at 07:47:01AM -0700, Cesar Philippidis wrote:
>>> But it is unclear from the parsing what from these is allowed:
>>
>> int v, w;
>> ...
>> gang(26)  // equivalent to gang(num:26)
>> gang(v)   // gang(num:v)
>> vector(length: 16)  // vector(length: 16)
>> vector(length: v)  // vector(length: v)
>> vector(16)  // vector(length: 16)
>> vector(v)   // vector(length: v)
>> worker(num: 16)  // worker(num: 16)
>> worker(num: v)   // worker(num: 16)
>> worker(16)  // worker(num: 16)
>> worker(v)   // worker(num: 16)
>> gang(16, 24)  // technically gang(num:16, num:24) is acceptable but it
>>               // should be an error
>> gang(v, w)  // likewise
>> gang(static: 16, num: 5)  // gang(static: 16, num: 5)
>> gang(static: v, num: w)   // gang(static: v, num: w)
>> gang(num: 5, static: 4)   // gang(num: 5, static: 4)
>> gang(num: v, static: w)   // gang(num: v, static: w)
>>
>> Also note that the static argument can accept '*'.
>>
>>> and if the length: or num: part is really optional, then
>>> int length, num;
>>> vector(length)
>>> worker(num)
>>> gang(num, static: 6)
>>> gang(static: 5, num)
>>> should be also accepted (or subset thereof?).
>>
>> Interesting question. The spec is unclear. It defines gang, worker and
>> vector as follows in section 2.7 in the OpenACC 2.0a spec:
>>
>>   gang [( gang-arg-list )]
>>   worker [( [num:] int-expr )]
>>   vector [( [length:] int-expr )]
>>
>> where gang-arg is one of:
>>
>>   [num:] int-expr
>>   static: size-expr
>>
>> and gang-arg-list may have at most one num and one static argument,
>> and where size-expr is one of:
>>
>>   *
>>   int-expr
>>
>> So I've interpreted that as a requirement that length and num must be
>> followed by an int-expr, whatever that is.
> 
> My reading of the above is that
> vector(length)
> is equivalent to
> vector(length: length)
> and
> worker(num)
> is equivalent to
> vector(num: num)
> etc.  Basically, neither length nor num aren't reserved identifiers,
> so you can use them for variable names, and if
> vector(v) is equivalent to vector(length: v), then
> vector(length) should be equivalent to vector(length:length)
> or
> vector(length + 1) should be equivalent to vector(length: length+1)
> static is a keyword that can't start an integral expression, so I guess
> it is fine if you issue an expected : diagnostics after it.
> 
> In any case, please add a testcase (both C and C++) which covers all these
> allowed variants (ideally one testcase) and rejected variants (another
> testcase with dg-error).
> 
> This is still an easy case, as even the C FE has 2 tokens lookup.
> E.g. for OpenMP map clause where
> map (always, tofrom: x)
> means one thing and
> map (always, tofrom, y)
> another one (map (tofrom: always, tofrom, y))
> I had to do quite ugly things to get around this.

Here are the updated test cases. Besides for adding a new test to
exercise the loop shape parsing, I also removed that assembly file
included in the original patch that Ilya noticed.

Is this OK for trunk?

Cesar


[-- Attachment #2: 11-testsuite-cjp.diff --]
[-- Type: text/x-patch, Size: 13272 bytes --]

2015-10-23  Nathan Sidwell  <nathan@codesourcery.com>

	libgomp/
	* testsuite/libgomp.oacc-c-c++-common/loop-g-1.c: New.
	* testsuite/libgomp.oacc-c-c++-common/loop-w-1.c: New.
	* testsuite/libgomp.oacc-c-c++-common/loop-wv-1.c: New.
	* testsuite/libgomp.oacc-c-c++-common/loop-g-2.c: New.
	* testsuite/libgomp.oacc-c-c++-common/loop-gwv-1.c: New.
	* testsuite/libgomp.oacc-c-c++-common/loop-v-1.c: New.

2015-10-23  Cesar Philippidis  <cesar@codesourcery.com>

	gcc/testsuite/
	* c-c++-common/goacc/loop-shape.c: New.


diff --git a/gcc/testsuite/c-c++-common/goacc/loop-shape.c b/gcc/testsuite/c-c++-common/goacc/loop-shape.c
new file mode 100644
index 0000000..3cb3006
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/goacc/loop-shape.c
@@ -0,0 +1,197 @@
+/* Exercise *_parser_oacc_shape_clause by checking various combinations
+   of gang, worker and vector clause arguments.  */
+
+/* { dg-compile } */
+
+int main ()
+{
+  int i;
+  int v, w;
+  int length, num;
+
+  /* Valid uses.  */
+
+  #pragma acc kernels
+  #pragma acc loop gang worker vector
+  for (i = 0; i < 10; i++)
+    ;
+  
+  #pragma acc kernels
+  #pragma acc loop gang(26)
+  for (i = 0; i < 10; i++)
+    ;
+  
+  #pragma acc kernels
+  #pragma acc loop gang(v)
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop vector(length: 16)
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop vector(length: v)
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop vector(16)
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop vector(v)
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop worker(num: 16)
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop worker(num: v)
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop worker(16)
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop worker(v)
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop gang(static: 16, num: 5)
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop gang(static: v, num: w)
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop vector(length)
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop worker(num)
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop gang(num, static: 6)
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop gang(static: 5, num)
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop gang(1, static:*)
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop gang(static:*, 1)
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop gang(1, static:*)
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop gang(num: 5, static: 4)
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop gang(num: v, static: w)
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop gang(num, static:num)
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop vector(length:length)
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop worker(num:length)
+  for (i = 0; i < 10; i++)
+    ;  
+
+  #pragma acc kernels
+  #pragma acc loop worker(num:num)
+  for (i = 0; i < 10; i++)
+    ;  
+
+  /* Invalid uses.  */
+  
+  #pragma acc kernels
+#pragma acc loop gang(16, 24) /* { dg-error "unexpected argument" } */
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop gang(v, w) /* { dg-error "unexpected argument" } */
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop gang(num: 1, num:2, num:3, 4) /* { dg-error "unexpected argument" } */
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop gang(1, num:2, num:3, 4) /* { dg-error "unexpected argument" } */
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop gang(num, num:5) /* { dg-error "unexpected argument" } */
+  for (i = 0; i < 10; i++)
+    ;  
+
+  #pragma acc kernels
+  #pragma acc loop gang(length:num) /* { dg-error "expected .num. or .static." } */
+  for (i = 0; i < 10; i++)
+    ;  
+
+  #pragma acc kernels
+  #pragma acc loop vector(5, length:length) /* { dg-error "unexpected argument" } */
+  for (i = 0; i < 10; i++)
+    ;  
+
+  #pragma acc kernels
+  #pragma acc loop vector(num:length) /* { dg-error "expected .length. before" } */
+  for (i = 0; i < 10; i++)
+    ;  
+
+  #pragma acc kernels
+  #pragma acc loop worker(length:5) /* { dg-error "expected .num. before" } */
+  for (i = 0; i < 10; i++)
+    ;  
+
+  #pragma acc kernels
+  #pragma acc loop worker(1, num:2) /* { dg-error "unexpected argument" } */
+  for (i = 0; i < 10; i++)
+    ;  
+  
+  return 0;
+}
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-g-1.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-g-1.c
new file mode 100644
index 0000000..58545d0
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-g-1.c
@@ -0,0 +1,57 @@
+/* { dg-do run } */
+/* { dg-additional-options "-O2" */
+
+#include <stdio.h>
+
+#define N (32*32*32+17)
+int main ()
+{
+  int ary[N];
+  int ix;
+  int exit = 0;
+  int ondev = 0;
+
+  for (ix = 0; ix < N;ix++)
+    ary[ix] = -1;
+  
+#pragma acc parallel num_gangs(32) vector_length(32) copy(ary) copy(ondev)
+  {
+#pragma acc loop gang
+    for (unsigned ix = 0; ix < N; ix++)
+      {
+	if (__builtin_acc_on_device (5))
+	  {
+	    int g = 0, w = 0, v = 0;
+
+	    __asm__ volatile ("mov.u32 %0,%%ctaid.x;" : "=r" (g));
+	    __asm__ volatile ("mov.u32 %0,%%tid.y;" : "=r" (w));
+	    __asm__ volatile ("mov.u32 %0,%%tid.x;" : "=r" (v));
+	    ary[ix] = (g << 16) | (w << 8) | v;
+	    ondev = 1;
+	  }
+	else
+	  ary[ix] = ix;
+      }
+  }
+
+  for (ix = 0; ix < N; ix++)
+    {
+      int expected = ix;
+      if(ondev)
+	{
+	  int g = ix / ((N + 31) / 32);
+	  int w = 0;
+	  int v = 0;
+
+	  expected = (g << 16) | (w << 8) | v;
+	}
+      
+      if (ary[ix] != expected)
+	{
+	  exit = 1;
+	  printf ("ary[%d]=%x expected %x\n", ix, ary[ix], expected);
+	}
+    }
+  
+  return exit;
+}
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-g-2.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-g-2.c
new file mode 100644
index 0000000..c01c6fa
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-g-2.c
@@ -0,0 +1,57 @@
+/* { dg-do run } */
+/* { dg-additional-options "-O2" */
+
+#include <stdio.h>
+
+#define N (32*32*32+17)
+int main ()
+{
+  int ary[N];
+  int ix;
+  int exit = 0;
+  int ondev = 0;
+
+  for (ix = 0; ix < N;ix++)
+    ary[ix] = -1;
+  
+#pragma acc parallel num_gangs(32) vector_length(32) copy(ary) copy(ondev)
+  {
+#pragma acc loop gang (static:1)
+    for (unsigned ix = 0; ix < N; ix++)
+      {
+	if (__builtin_acc_on_device (5))
+	  {
+	    int g = 0, w = 0, v = 0;
+
+	    __asm__ volatile ("mov.u32 %0,%%ctaid.x;" : "=r" (g));
+	    __asm__ volatile ("mov.u32 %0,%%tid.y;" : "=r" (w));
+	    __asm__ volatile ("mov.u32 %0,%%tid.x;" : "=r" (v));
+	    ary[ix] = (g << 16) | (w << 8) | v;
+	    ondev = 1;
+	  }
+	else
+	  ary[ix] = ix;
+      }
+  }
+
+  for (ix = 0; ix < N; ix++)
+    {
+      int expected = ix;
+      if(ondev)
+	{
+	  int g = ix % 32;
+	  int w = 0;
+	  int v = 0;
+
+	  expected = (g << 16) | (w << 8) | v;
+	}
+      
+      if (ary[ix] != expected)
+	{
+	  exit = 1;
+	  printf ("ary[%d]=%x expected %x\n", ix, ary[ix], expected);
+	}
+    }
+  
+  return exit;
+}
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-gwv-1.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-gwv-1.c
new file mode 100644
index 0000000..f23e2f3
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-gwv-1.c
@@ -0,0 +1,59 @@
+/* { dg-do run } */
+/* { dg-additional-options "-O2" */
+
+#include <stdio.h>
+
+#define N (32*32*32+17)
+int main ()
+{
+  int ary[N];
+  int ix;
+  int exit = 0;
+  int ondev = 0;
+
+  for (ix = 0; ix < N;ix++)
+    ary[ix] = -1;
+  
+#pragma acc parallel num_gangs(32) num_workers(32) vector_length(32) copy(ary) copy(ondev)
+  {
+#pragma acc loop gang worker vector
+    for (unsigned ix = 0; ix < N; ix++)
+      {
+	if (__builtin_acc_on_device (5))
+	  {
+	    int g = 0, w = 0, v = 0;
+
+	    __asm__ volatile ("mov.u32 %0,%%ctaid.x;" : "=r" (g));
+	    __asm__ volatile ("mov.u32 %0,%%tid.y;" : "=r" (w));
+	    __asm__ volatile ("mov.u32 %0,%%tid.x;" : "=r" (v));
+	    ary[ix] = (g << 16) | (w << 8) | v;
+	    ondev = 1;
+	  }
+	else
+	  ary[ix] = ix;
+      }
+  }
+
+  for (ix = 0; ix < N; ix++)
+    {
+      int expected = ix;
+      if(ondev)
+	{
+	  int chunk_size = (N + 32*32*32 - 1) / (32*32*32);
+	  
+	  int g = ix / (chunk_size * 32 * 32);
+	  int w = ix / 32 % 32;
+	  int v = ix % 32;
+
+	  expected = (g << 16) | (w << 8) | v;
+	}
+      
+      if (ary[ix] != expected)
+	{
+	  exit = 1;
+	  printf ("ary[%d]=%x expected %x\n", ix, ary[ix], expected);
+	}
+    }
+  
+  return exit;
+}
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-v-1.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-v-1.c
new file mode 100644
index 0000000..70c6292
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-v-1.c
@@ -0,0 +1,57 @@
+/* { dg-do run } */
+/* { dg-additional-options "-O2" */
+
+#include <stdio.h>
+
+#define N (32*32*32+17)
+int main ()
+{
+  int ary[N];
+  int ix;
+  int exit = 0;
+  int ondev = 0;
+
+  for (ix = 0; ix < N;ix++)
+    ary[ix] = -1;
+  
+#pragma acc parallel vector_length(32) copy(ary) copy(ondev)
+  {
+#pragma acc loop vector
+    for (unsigned ix = 0; ix < N; ix++)
+      {
+	if (__builtin_acc_on_device (5))
+	  {
+	    int g = 0, w = 0, v = 0;
+
+	    __asm__ volatile ("mov.u32 %0,%%ctaid.x;" : "=r" (g));
+	    __asm__ volatile ("mov.u32 %0,%%tid.y;" : "=r" (w));
+	    __asm__ volatile ("mov.u32 %0,%%tid.x;" : "=r" (v));
+	    ary[ix] = (g << 16) | (w << 8) | v;
+	    ondev = 1;
+	  }
+	else
+	  ary[ix] = ix;
+      }
+  }
+
+  for (ix = 0; ix < N; ix++)
+    {
+      int expected = ix;
+      if(ondev)
+	{
+	  int g = 0;
+	  int w = 0;
+	  int v = ix % 32;
+
+	  expected = (g << 16) | (w << 8) | v;
+	}
+      
+      if (ary[ix] != expected)
+	{
+	  exit = 1;
+	  printf ("ary[%d]=%x expected %x\n", ix, ary[ix], expected);
+	}
+    }
+  
+  return exit;
+}
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-w-1.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-w-1.c
new file mode 100644
index 0000000..5473c2d
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-w-1.c
@@ -0,0 +1,57 @@
+/* { dg-do run } */
+/* { dg-additional-options "-O2" */
+
+#include <stdio.h>
+
+#define N (32*32*32+17)
+int main ()
+{
+  int ary[N];
+  int ix;
+  int exit = 0;
+  int ondev = 0;
+
+  for (ix = 0; ix < N;ix++)
+    ary[ix] = -1;
+  
+#pragma acc parallel num_workers(32) vector_length(32) copy(ary) copy(ondev)
+  {
+#pragma acc loop worker
+    for (unsigned ix = 0; ix < N; ix++)
+      {
+	if (__builtin_acc_on_device (5))
+	  {
+	    int g = 0, w = 0, v = 0;
+
+	    __asm__ volatile ("mov.u32 %0,%%ctaid.x;" : "=r" (g));
+	    __asm__ volatile ("mov.u32 %0,%%tid.y;" : "=r" (w));
+	    __asm__ volatile ("mov.u32 %0,%%tid.x;" : "=r" (v));
+	    ary[ix] = (g << 16) | (w << 8) | v;
+	    ondev = 1;
+	  }
+	else
+	  ary[ix] = ix;
+      }
+  }
+
+  for (ix = 0; ix < N; ix++)
+    {
+      int expected = ix;
+      if(ondev)
+	{
+	  int g = 0;
+	  int w = ix % 32;
+	  int v = 0;
+
+	  expected = (g << 16) | (w << 8) | v;
+	}
+      
+      if (ary[ix] != expected)
+	{
+	  exit = 1;
+	  printf ("ary[%d]=%x expected %x\n", ix, ary[ix], expected);
+	}
+    }
+  
+  return exit;
+}
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-wv-1.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-wv-1.c
new file mode 100644
index 0000000..85e4476
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-wv-1.c
@@ -0,0 +1,57 @@
+/* { dg-do run } */
+/* { dg-additional-options "-O2" */
+
+#include <stdio.h>
+
+#define N (32*32*32+17)
+int main ()
+{
+  int ary[N];
+  int ix;
+  int exit = 0;
+  int ondev = 0;
+
+  for (ix = 0; ix < N;ix++)
+    ary[ix] = -1;
+  
+#pragma acc parallel num_workers(32) vector_length(32) copy(ary) copy(ondev)
+  {
+#pragma acc loop worker vector
+    for (unsigned ix = 0; ix < N; ix++)
+      {
+	if (__builtin_acc_on_device (5))
+	  {
+	    int g = 0, w = 0, v = 0;
+
+	    __asm__ volatile ("mov.u32 %0,%%ctaid.x;" : "=r" (g));
+	    __asm__ volatile ("mov.u32 %0,%%tid.y;" : "=r" (w));
+	    __asm__ volatile ("mov.u32 %0,%%tid.x;" : "=r" (v));
+	    ary[ix] = (g << 16) | (w << 8) | v;
+	    ondev = 1;
+	  }
+	else
+	  ary[ix] = ix;
+      }
+  }
+
+  for (ix = 0; ix < N; ix++)
+    {
+      int expected = ix;
+      if(ondev)
+	{
+	  int g = 0;
+	  int w = (ix / 32) % 32;
+	  int v = ix % 32;
+
+	  expected = (g << 16) | (w << 8) | v;
+	}
+      
+      if (ary[ix] != expected)
+	{
+	  exit = 1;
+	  printf ("ary[%d]=%x expected %x\n", ix, ary[ix], expected);
+	}
+    }
+  
+  return exit;
+}

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: Re: [OpenACC 4/11] C FE changes
  2015-10-23 20:20     ` Cesar Philippidis
@ 2015-10-23 20:40       ` Jakub Jelinek
  2015-10-23 21:31         ` Jakub Jelinek
  2015-10-23 21:32         ` Cesar Philippidis
  2015-10-23 21:25       ` Nathan Sidwell
  1 sibling, 2 replies; 120+ messages in thread
From: Jakub Jelinek @ 2015-10-23 20:40 UTC (permalink / raw)
  To: Cesar Philippidis
  Cc: Nathan Sidwell, GCC Patches, Bernd Schmidt, Jason Merrill,
	Joseph S. Myers

On Fri, Oct 23, 2015 at 01:17:07PM -0700, Cesar Philippidis wrote:
> Good idea, thanks. This patch also corrects the problems parsing weird
> combinations of num, static and length arguments that you mentioned
> elsewhere.
> 
> Is this OK for trunk?

I'd strongly prefer to see always patches accompanied by testcases.

> +	  loc = c_parser_peek_token (parser)->location;
> +	  op_to_parse = &op0;
> +
> +	  if ((c_parser_next_token_is (parser, CPP_NAME)
> +	       || c_parser_next_token_is (parser, CPP_KEYWORD))
> +	      && c_parser_peek_2nd_token (parser)->type == CPP_COLON)
> +	    {
> +	      tree name_kind = c_parser_peek_token (parser)->value;
> +	      const char *p = IDENTIFIER_POINTER (name_kind);

I think I'd prefer not to peek at this at all if it is RID_STATIC,
so perhaps just have (and name_kind is weird):
	      else
		{
		  tree val = c_parser_peek_token (parser)->value;
		  if (strcmp (id, IDENTIFIER_POINTER (val)) == 0)
		    {
		      c_parser_consume_token (parser);  /* id  */
		      c_parser_consume_token (parser);  /* ':'  */
		    }
		  else
		    {
...
		    }
		}
?

> +	      if (kind == OMP_CLAUSE_GANG
> +		  && c_parser_next_token_is_keyword (parser, RID_STATIC))
> +		{
> +		  c_parser_consume_token (parser); /* static  */
> +		  c_parser_consume_token (parser); /* ':'  */
> +
> +		  op_to_parse = &op1;
> +		  if (c_parser_next_token_is (parser, CPP_MULT))
> +		    {
> +		      c_parser_consume_token (parser);
> +		      *op_to_parse = integer_minus_one_node;
> +
> +		      /* Consume a comma if present.  */
> +		      if (c_parser_next_token_is (parser, CPP_COMMA))
> +			c_parser_consume_token (parser);

Doesn't this mean that you happily parse
gang (static: * abc)
or
gang (static:*num:1)
etc.?  I'd say the comma should be non-optional (i.e. either accept
CPP_COMMA, or CPP_CLOSE_PARENT, but nothing else) in that case (at least,
when in OpenMP grammar something is *-list it is meant to be comma
separated).

> +	  /* Consume a comma if present.  */
> +	  if (c_parser_next_token_is (parser, CPP_COMMA))
> +	    c_parser_consume_token (parser);

Similarly this means
gang (num: 5 static: *)
is accepted.  If it is valid, then again it should have testsuite coverage.

	Jakub

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 4/11] C FE changes
  2015-10-23 20:20     ` Cesar Philippidis
  2015-10-23 20:40       ` Jakub Jelinek
@ 2015-10-23 21:25       ` Nathan Sidwell
  2015-10-25 14:18         ` Nathan Sidwell
  1 sibling, 1 reply; 120+ messages in thread
From: Nathan Sidwell @ 2015-10-23 21:25 UTC (permalink / raw)
  To: Cesar Philippidis, Jakub Jelinek
  Cc: GCC Patches, Bernd Schmidt, Jason Merrill, Joseph S. Myers

On 10/23/15 16:17, Cesar Philippidis wrote:

> Nathan, can you try out this patch with your updated patch set? I saw
> some test cases getting stuck when expanding expand_GOACC_DIM_SIZE in on
> the host compiler, which is wrong. I don't see that happening in
> gomp-4_0-branch with this patch. Also, can you merge this patch along
> with the c++ and new test case patches to trunk? I'll handle the gomp4
> backport.

Wilco.


nathan

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: Re: [OpenACC 4/11] C FE changes
  2015-10-23 20:40       ` Jakub Jelinek
@ 2015-10-23 21:31         ` Jakub Jelinek
  2015-10-23 21:32         ` Cesar Philippidis
  1 sibling, 0 replies; 120+ messages in thread
From: Jakub Jelinek @ 2015-10-23 21:31 UTC (permalink / raw)
  To: Cesar Philippidis
  Cc: Nathan Sidwell, GCC Patches, Bernd Schmidt, Jason Merrill,
	Joseph S. Myers

On Fri, Oct 23, 2015 at 10:31:55PM +0200, Jakub Jelinek wrote:
> Doesn't this mean that you happily parse
> gang (static: * abc)
> or
> gang (static:*num:1)
> etc.?  I'd say the comma should be non-optional (i.e. either accept
> CPP_COMMA, or CPP_CLOSE_PARENT, but nothing else) in that case (at least,
> when in OpenMP grammar something is *-list it is meant to be comma
> separated).

Looking at the OpenACC standard, gang-arg-list is indeed a comma separated
list of gang-arg, so the above are not valid, so you really should just
error out and skip to close paren if at that spot isn't a CPP_COMMA or
CPP_CLOSE_PAREN.  And for vector/worker arguments, which don't accept a
*-list, IMNSHO you shouldn't even try to accept CPP_COMMA, just require
CPP_CLOSE_PAREN.

> 
> > +	  /* Consume a comma if present.  */
> > +	  if (c_parser_next_token_is (parser, CPP_COMMA))
> > +	    c_parser_consume_token (parser);
> 
> Similarly this means
> gang (num: 5 static: *)
> is accepted.  If it is valid, then again it should have testsuite coverage.

	Jakub

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 4/11] C FE changes
  2015-10-23 20:40       ` Jakub Jelinek
  2015-10-23 21:31         ` Jakub Jelinek
@ 2015-10-23 21:32         ` Cesar Philippidis
  2015-10-24  2:37           ` Cesar Philippidis
  1 sibling, 1 reply; 120+ messages in thread
From: Cesar Philippidis @ 2015-10-23 21:32 UTC (permalink / raw)
  To: Jakub Jelinek
  Cc: Nathan Sidwell, GCC Patches, Bernd Schmidt, Jason Merrill,
	Joseph S. Myers

On 10/23/2015 01:31 PM, Jakub Jelinek wrote:
> On Fri, Oct 23, 2015 at 01:17:07PM -0700, Cesar Philippidis wrote:
>> Good idea, thanks. This patch also corrects the problems parsing weird
>> combinations of num, static and length arguments that you mentioned
>> elsewhere.
>>
>> Is this OK for trunk?
> 
> I'd strongly prefer to see always patches accompanied by testcases.
> 
>> +	  loc = c_parser_peek_token (parser)->location;
>> +	  op_to_parse = &op0;
>> +
>> +	  if ((c_parser_next_token_is (parser, CPP_NAME)
>> +	       || c_parser_next_token_is (parser, CPP_KEYWORD))
>> +	      && c_parser_peek_2nd_token (parser)->type == CPP_COLON)
>> +	    {
>> +	      tree name_kind = c_parser_peek_token (parser)->value;
>> +	      const char *p = IDENTIFIER_POINTER (name_kind);
> 
> I think I'd prefer not to peek at this at all if it is RID_STATIC,
> so perhaps just have (and name_kind is weird):
> 	      else
> 		{
> 		  tree val = c_parser_peek_token (parser)->value;
> 		  if (strcmp (id, IDENTIFIER_POINTER (val)) == 0)
> 		    {
> 		      c_parser_consume_token (parser);  /* id  */
> 		      c_parser_consume_token (parser);  /* ':'  */
> 		    }
> 		  else
> 		    {
> ...
> 		    }
> 		}
> ?

My plan over here was try and catch any arguments with a colon. But that
fell threw because...

>> +	      if (kind == OMP_CLAUSE_GANG
>> +		  && c_parser_next_token_is_keyword (parser, RID_STATIC))
>> +		{
>> +		  c_parser_consume_token (parser); /* static  */
>> +		  c_parser_consume_token (parser); /* ':'  */
>> +
>> +		  op_to_parse = &op1;
>> +		  if (c_parser_next_token_is (parser, CPP_MULT))
>> +		    {
>> +		      c_parser_consume_token (parser);
>> +		      *op_to_parse = integer_minus_one_node;
>> +
>> +		      /* Consume a comma if present.  */
>> +		      if (c_parser_next_token_is (parser, CPP_COMMA))
>> +			c_parser_consume_token (parser);
> 
> Doesn't this mean that you happily parse
> gang (static: * abc)
> or
> gang (static:*num:1)
> etc.?  I'd say the comma should be non-optional (i.e. either accept
> CPP_COMMA, or CPP_CLOSE_PARENT, but nothing else) in that case (at least,
> when in OpenMP grammar something is *-list it is meant to be comma
> separated).

I'm not handling commas properly. My next patch is going to handle the
static argument separately.

>> +	  /* Consume a comma if present.  */
>> +	  if (c_parser_next_token_is (parser, CPP_COMMA))
>> +	    c_parser_consume_token (parser);
> 
> Similarly this means
> gang (num: 5 static: *)
> is accepted.  If it is valid, then again it should have testsuite coverage.

I'll include a test case for this with the next patch.

Cesar

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 4/11] C FE changes
  2015-10-23 21:32         ` Cesar Philippidis
@ 2015-10-24  2:37           ` Cesar Philippidis
  2015-10-24 13:08             ` Jakub Jelinek
  0 siblings, 1 reply; 120+ messages in thread
From: Cesar Philippidis @ 2015-10-24  2:37 UTC (permalink / raw)
  To: Jakub Jelinek
  Cc: Nathan Sidwell, GCC Patches, Bernd Schmidt, Jason Merrill,
	Joseph S. Myers

[-- Attachment #1: Type: text/plain, Size: 2893 bytes --]

On 10/23/2015 02:31 PM, Cesar Philippidis wrote:
> On 10/23/2015 01:31 PM, Jakub Jelinek wrote:
>> On Fri, Oct 23, 2015 at 01:17:07PM -0700, Cesar Philippidis wrote:
>>> Good idea, thanks. This patch also corrects the problems parsing weird
>>> combinations of num, static and length arguments that you mentioned
>>> elsewhere.
>>>
>>> Is this OK for trunk?
>>
>> I'd strongly prefer to see always patches accompanied by testcases.
>>
>>> +	  loc = c_parser_peek_token (parser)->location;
>>> +	  op_to_parse = &op0;
>>> +
>>> +	  if ((c_parser_next_token_is (parser, CPP_NAME)
>>> +	       || c_parser_next_token_is (parser, CPP_KEYWORD))
>>> +	      && c_parser_peek_2nd_token (parser)->type == CPP_COLON)
>>> +	    {
>>> +	      tree name_kind = c_parser_peek_token (parser)->value;
>>> +	      const char *p = IDENTIFIER_POINTER (name_kind);
>>
>> I think I'd prefer not to peek at this at all if it is RID_STATIC,
>> so perhaps just have (and name_kind is weird):
>> 	      else
>> 		{
>> 		  tree val = c_parser_peek_token (parser)->value;
>> 		  if (strcmp (id, IDENTIFIER_POINTER (val)) == 0)
>> 		    {
>> 		      c_parser_consume_token (parser);  /* id  */
>> 		      c_parser_consume_token (parser);  /* ':'  */
>> 		    }
>> 		  else
>> 		    {
>> ...
>> 		    }
>> 		}
>> ?
> 
> My plan over here was try and catch any arguments with a colon. But that
> fell threw because...
> 
>>> +	      if (kind == OMP_CLAUSE_GANG
>>> +		  && c_parser_next_token_is_keyword (parser, RID_STATIC))
>>> +		{
>>> +		  c_parser_consume_token (parser); /* static  */
>>> +		  c_parser_consume_token (parser); /* ':'  */
>>> +
>>> +		  op_to_parse = &op1;
>>> +		  if (c_parser_next_token_is (parser, CPP_MULT))
>>> +		    {
>>> +		      c_parser_consume_token (parser);
>>> +		      *op_to_parse = integer_minus_one_node;
>>> +
>>> +		      /* Consume a comma if present.  */
>>> +		      if (c_parser_next_token_is (parser, CPP_COMMA))
>>> +			c_parser_consume_token (parser);
>>
>> Doesn't this mean that you happily parse
>> gang (static: * abc)
>> or
>> gang (static:*num:1)
>> etc.?  I'd say the comma should be non-optional (i.e. either accept
>> CPP_COMMA, or CPP_CLOSE_PARENT, but nothing else) in that case (at least,
>> when in OpenMP grammar something is *-list it is meant to be comma
>> separated).
> 
> I'm not handling commas properly. My next patch is going to handle the
> static argument separately.
> 
>>> +	  /* Consume a comma if present.  */
>>> +	  if (c_parser_next_token_is (parser, CPP_COMMA))
>>> +	    c_parser_consume_token (parser);
>>
>> Similarly this means
>> gang (num: 5 static: *)
>> is accepted.  If it is valid, then again it should have testsuite coverage.
> 
> I'll include a test case for this with the next patch.

Here's the updated patch. Hopefully I addressed everything. Thank you
for suggesting all of those test cases.

Is this OK for trunk?

Cesar


[-- Attachment #2: 04-cfe-cjp-2.diff --]
[-- Type: text/x-patch, Size: 12089 bytes --]

2015-10-23  Cesar Philippidis  <cesar@codesourcery.com>
	    Thomas Schwinge  <thomas@codesourcery.com>
	    James Norris  <jnorris@codesourcery.com>
	    Joseph Myers  <joseph@codesourcery.com>
	    Julian Brown  <julian@codesourcery.com>
	    Bernd Schmidt  <bschmidt@redhat.com>

	gcc/c/
	* c-parser.c (c_parser_oacc_shape_clause): New.
	(c_parser_oacc_simple_clause): New.
	(c_parser_oacc_all_clauses): Add auto, gang, seq, vector, worker.
	(OACC_LOOP_CLAUSE_MASK): Add gang, worker, vector, auto, seq.

2015-10-23  Cesar Philippidis  <cesar@codesourcery.com>

	gcc/testsuite/
	* c-c++-common/goacc/loop-shape.c (int main):

diff --git a/gcc/c/c-parser.c b/gcc/c/c-parser.c
index c8c6a2d..7d2baa9 100644
--- a/gcc/c/c-parser.c
+++ b/gcc/c/c-parser.c
@@ -11188,6 +11188,156 @@ c_parser_omp_clause_num_workers (c_parser *parser, tree list)
 }
 
 /* OpenACC:
+
+    gang [( gang-arg-list )]
+    worker [( [num:] int-expr )]
+    vector [( [length:] int-expr )]
+
+  where gang-arg is one of:
+
+    [num:] int-expr
+    static: size-expr
+
+  and size-expr may be:
+
+    *
+    int-expr
+*/
+
+static tree
+c_parser_oacc_shape_clause (c_parser *parser, omp_clause_code kind,
+			    const char *str, tree list)
+{
+  const char *id = "num";
+  tree op0 = NULL_TREE, op1 = NULL_TREE, c;
+  location_t loc = c_parser_peek_token (parser)->location;
+
+  if (kind == OMP_CLAUSE_VECTOR)
+    id = "length";
+
+  if (c_parser_next_token_is (parser, CPP_OPEN_PAREN))
+    {
+      tree *op_to_parse = &op0;
+      c_token *next;
+
+      c_parser_consume_token (parser);
+
+      do
+	{
+	  next = c_parser_peek_token (parser);
+	  loc = next->location;
+	  op_to_parse = &op0;
+
+	  /* Consume a comma if present.  */
+	  if (c_parser_next_token_is (parser, CPP_COMMA))
+	    {
+	      if (op0 == NULL && op1 == NULL)
+		{
+		  c_parser_error (parser, "unexpected argument");
+		  goto cleanup_error;
+		}
+
+	      c_parser_consume_token (parser);
+	    }
+
+	  /* First extract any length:, num: and static: arguments.  */
+
+	  /* Gang static argument.  */
+	  if (c_parser_next_token_is (parser, CPP_KEYWORD)
+	      && c_parser_next_token_is_keyword (parser, RID_STATIC))
+	    {
+	      if (kind != OMP_CLAUSE_GANG)
+		{
+		  c_parser_error (parser, "invalid %<static%> argument");
+		  goto cleanup_error;
+		}
+
+	      c_parser_consume_token (parser);
+
+	      if (!c_parser_require (parser, CPP_COLON, "expected %<:%>"))
+		goto cleanup_error;
+
+	      op_to_parse = &op1;
+	      if (*op_to_parse != NULL)
+		{
+		  c_parser_error (parser, "too many %<static%> arguements");
+		  goto cleanup_error;
+		}
+
+	      /* Check for the '*' argument.  */
+	      if (c_parser_next_token_is (parser, CPP_MULT))
+		{
+		  c_parser_consume_token (parser);
+		  *op_to_parse = integer_minus_one_node;
+		  continue;
+		}
+	    }
+	  /* Worker num: argument and vector length: arguments.  */
+	  else if (c_parser_next_token_is (parser, CPP_NAME)
+		   && strcmp (id, IDENTIFIER_POINTER (next->value)) == 0
+		   && c_parser_peek_2nd_token (parser)->type == CPP_COLON)
+	    {
+	      c_parser_consume_token (parser);  /* id  */
+	      c_parser_consume_token (parser);  /* ':'  */
+	    }
+
+	  /* Now collect the actual argument.  */
+
+	  if (*op_to_parse != NULL_TREE)
+	    {
+	      c_parser_error (parser, "unexpected argument");
+	      goto cleanup_error;
+	    }
+
+	  tree expr = c_parser_expr_no_commas (parser, NULL).value;
+	  if (expr == error_mark_node)
+	    goto cleanup_error;
+
+	  mark_exp_read (expr);
+	  *op_to_parse = expr;
+	}
+      while (c_parser_next_token_is (parser, CPP_COMMA)
+	     && !c_parser_next_token_is (parser, CPP_CLOSE_PAREN));
+
+      if (!c_parser_require (parser, CPP_CLOSE_PAREN, "expected %<)%>"))
+	goto cleanup_error;
+    }
+
+  check_no_duplicate_clause (list, kind, str);
+
+  c = build_omp_clause (loc, kind);
+  if (op0)
+    OMP_CLAUSE_OPERAND (c, 0) = op0;
+  if (op1)
+    OMP_CLAUSE_OPERAND (c, 1) = op1;
+  OMP_CLAUSE_CHAIN (c) = list;
+
+  return c;
+
+ cleanup_error:
+  c_parser_skip_until_found (parser, CPP_CLOSE_PAREN, 0);
+  return list;
+}
+
+/* OpenACC:
+   auto
+   independent
+   nohost
+   seq */
+
+static tree
+c_parser_oacc_simple_clause (c_parser *parser ATTRIBUTE_UNUSED,
+			     enum omp_clause_code code, tree list)
+{
+  check_no_duplicate_clause (list, code, omp_clause_code_name[code]);
+
+  tree c = build_omp_clause (c_parser_peek_token (parser)->location, code);
+  OMP_CLAUSE_CHAIN (c) = list;
+
+  return c;
+}
+
+/* OpenACC:
    async [( int-expr )] */
 
 static tree
@@ -12393,6 +12543,11 @@ c_parser_oacc_all_clauses (c_parser *parser, omp_clause_mask mask,
 	  clauses = c_parser_oacc_clause_async (parser, clauses);
 	  c_name = "async";
 	  break;
+	case PRAGMA_OACC_CLAUSE_AUTO:
+	  clauses = c_parser_oacc_simple_clause (parser, OMP_CLAUSE_AUTO,
+						clauses);
+	  c_name = "auto";
+	  break;
 	case PRAGMA_OACC_CLAUSE_COLLAPSE:
 	  clauses = c_parser_omp_clause_collapse (parser, clauses);
 	  c_name = "collapse";
@@ -12429,6 +12584,11 @@ c_parser_oacc_all_clauses (c_parser *parser, omp_clause_mask mask,
 	  clauses = c_parser_omp_clause_firstprivate (parser, clauses);
 	  c_name = "firstprivate";
 	  break;
+	case PRAGMA_OACC_CLAUSE_GANG:
+	  c_name = "gang";
+	  clauses = c_parser_oacc_shape_clause (parser, OMP_CLAUSE_GANG,
+						c_name, clauses);
+	  break;
 	case PRAGMA_OACC_CLAUSE_HOST:
 	  clauses = c_parser_oacc_data_clause (parser, c_kind, clauses);
 	  c_name = "host";
@@ -12477,6 +12637,16 @@ c_parser_oacc_all_clauses (c_parser *parser, omp_clause_mask mask,
 	  clauses = c_parser_oacc_data_clause (parser, c_kind, clauses);
 	  c_name = "self";
 	  break;
+	case PRAGMA_OACC_CLAUSE_SEQ:
+	  clauses = c_parser_oacc_simple_clause (parser, OMP_CLAUSE_SEQ,
+						clauses);
+	  c_name = "seq";
+	  break;
+	case PRAGMA_OACC_CLAUSE_VECTOR:
+	  c_name = "vector";
+	  clauses = c_parser_oacc_shape_clause (parser, OMP_CLAUSE_VECTOR,
+						c_name,	clauses);
+	  break;
 	case PRAGMA_OACC_CLAUSE_VECTOR_LENGTH:
 	  clauses = c_parser_omp_clause_vector_length (parser, clauses);
 	  c_name = "vector_length";
@@ -12485,6 +12655,11 @@ c_parser_oacc_all_clauses (c_parser *parser, omp_clause_mask mask,
 	  clauses = c_parser_oacc_clause_wait (parser, clauses);
 	  c_name = "wait";
 	  break;
+	case PRAGMA_OACC_CLAUSE_WORKER:
+	  c_name = "worker";
+	  clauses = c_parser_oacc_shape_clause (parser, OMP_CLAUSE_WORKER,
+						c_name, clauses);
+	  break;
 	default:
 	  c_parser_error (parser, "expected %<#pragma acc%> clause");
 	  goto saw_error;
@@ -13015,6 +13190,11 @@ c_parser_oacc_enter_exit_data (c_parser *parser, bool enter)
 
 #define OACC_LOOP_CLAUSE_MASK						\
 	( (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_COLLAPSE)		\
+	| (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_GANG)		\
+	| (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_WORKER)		\
+	| (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_VECTOR)		\
+	| (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_AUTO)		\
+	| (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_SEQ)			\
 	| (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_REDUCTION) )
 
 static tree
diff --git a/gcc/testsuite/c-c++-common/goacc/loop-shape.c b/gcc/testsuite/c-c++-common/goacc/loop-shape.c
new file mode 100644
index 0000000..1053361
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/goacc/loop-shape.c
@@ -0,0 +1,218 @@
+/* Exercise *_parser_oacc_shape_clause by checking various combinations
+   of gang, worker and vector clause arguments.  */
+
+/* { dg-compile } */
+
+int main ()
+{
+  int i;
+  int v, w;
+  int length, num;
+
+  /* Valid uses.  */
+
+  #pragma acc kernels
+  #pragma acc loop gang worker vector
+  for (i = 0; i < 10; i++)
+    ;
+  
+  #pragma acc kernels
+  #pragma acc loop gang(26)
+  for (i = 0; i < 10; i++)
+    ;
+  
+  #pragma acc kernels
+  #pragma acc loop gang(v)
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop vector(length: 16)
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop vector(length: v)
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop vector(16)
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop vector(v)
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop worker(num: 16)
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop worker(num: v)
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop worker(16)
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop worker(v)
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop gang(static: 16, num: 5)
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop gang(static: v, num: w)
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop vector(length)
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop worker(num)
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop gang(num, static: 6)
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop gang(static: 5, num)
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop gang(1, static:*)
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop gang(static:*, 1)
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop gang(1, static:*)
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop gang(num: 5, static: 4)
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop gang(num: v, static: w)
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop gang(num, static:num)
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop vector(length:length)
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop worker(num:length)
+  for (i = 0; i < 10; i++)
+    ;  
+
+  #pragma acc kernels
+  #pragma acc loop worker(num:num)
+  for (i = 0; i < 10; i++)
+    ;  
+
+  /* Invalid uses.  */
+  
+  #pragma acc kernels
+#pragma acc loop gang(16, 24) /* { dg-error "unexpected argument" } */
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop gang(v, w) /* { dg-error "unexpected argument" } */
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop gang(num: 1, num:2, num:3, 4) /* { dg-error "unexpected argument" } */
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop gang(num: 1 num:2, num:3, 4) /* { dg-error "expected '.' before" } */
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop gang(1, num:2, num:3, 4) /* { dg-error "unexpected argument" } */
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop gang(num, num:5) /* { dg-error "unexpected argument" } */
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop gang(length:num) /* { dg-error "" } */
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop vector(5, length:length) /* { dg-error "unexpected argument" } */
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop vector(num:length) /* { dg-error "" } */
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop worker(length:5) /* { dg-error "expected '.' before" } */
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop worker(1, num:2) /* { dg-error "unexpected argument" } */
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop gang(static: * abc) /* { dg-error "expected '.' before" } */
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop gang(static:*num:1) /* { dg-error "expected '.' before" } */
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop gang(num: 5 static: *) /* { dg-error "expected '.' before" } */
+  for (i = 0; i < 10; i++)
+    ;
+
+  
+  return 0;
+}

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 5/11] C++ FE changes
  2015-10-23 20:26     ` Cesar Philippidis
@ 2015-10-24  2:39       ` Cesar Philippidis
  2015-10-24 21:15         ` Cesar Philippidis
  0 siblings, 1 reply; 120+ messages in thread
From: Cesar Philippidis @ 2015-10-24  2:39 UTC (permalink / raw)
  To: Jakub Jelinek, Nathan Sidwell
  Cc: GCC Patches, Bernd Schmidt, Jason Merrill, Joseph S. Myers

[-- Attachment #1: Type: text/plain, Size: 5646 bytes --]

On 10/23/2015 01:25 PM, Cesar Philippidis wrote:
> On 10/22/2015 01:52 AM, Jakub Jelinek wrote:
>> On Wed, Oct 21, 2015 at 03:18:55PM -0400, Nathan Sidwell wrote:
>>> This patch is the C++ changes matching the C ones of patch 4.  In
>>> finish_omp_clauses, the gang, worker, & vector clauses are handled the same
>>> as OpenMP's 'num_threads' clause.  One change to num_threads is the
>>> augmentation of a diagnostic to add %<...%>  markers to the clause name.
>>
>> Indeed, lots of older OpenMP diagnostics is missing %<...%> markers around
>> keywords.  Something to fix eventually.
> 
> I updated omp tasks and teams in semantics.c.
> 
>>> 2015-10-20  Cesar Philippidis  <cesar@codesourcery.com>
>>> 	    Thomas Schwinge  <thomas@codesourcery.com>
>>> 	    James Norris  <jnorris@codesourcery.com>
>>> 	    Joseph Myers  <joseph@codesourcery.com>
>>> 	    Julian Brown  <julian@codesourcery.com>
>>> 	    Nathan Sidwell <nathan@codesourcery.com>
>>>
>>> 	* parser.c (cp_parser_omp_clause_name): Add auto, gang, seq,
>>> 	vector, worker.
>>> 	(cp_parser_oacc_simple_clause): New.
>>> 	(cp_parser_oacc_shape_clause): New.
>>
>> What I've said for the C FE patch, plus:
>>
>>> +	  if (cp_lexer_next_token_is (lexer, CPP_NAME)
>>> +	      || cp_lexer_next_token_is (lexer, CPP_KEYWORD))
>>> +	    {
>>> +	      tree name_kind = cp_lexer_peek_token (lexer)->u.value;
>>> +	      const char *p = IDENTIFIER_POINTER (name_kind);
>>> +	      if (kind == OMP_CLAUSE_GANG && strcmp ("static", p) == 0)
>>
>> As static is a keyword, wouldn't it be better to just handle that case
>> using cp_lexer_next_token_is_keyword (lexer, RID_STATIC)?
>>
>> Also, what is the exact grammar of the shape arguments?
>> Would be nice to describe the grammar, in the grammar you just say
>> expression, at least for vector/worker, which is clearly not accurate.
>>
>> It seems the intent is that num: or length: or static: is optional, right?
>> But if that is the case, you should treat those as parsed only if followed
>> by :.  While static is a keyword, so you can't have a variable called like
>> that, having vector(length) or vector(num) should not be rejected.
>> So, I would have expected that it should test if it is RID_STATIC
>> followed by CPP_COLON (and only in that case consume those tokens),
>> or CPP_NAME of id followed by CPP_COLON (and only in that case consume those
>> tokens), otherwise parse it as assignment expression.
> 
> That function now peeks ahead to look for a colon, so now it can handle
> variables with the name of clause keywords.
> 
>> The C FE may have similar issue.  Plus of course there should be testsuite
>> coverage for all the weird cases.
> 
> I included a new test in a different patch because it's common to both c
> and c++.
> 
>>> +	case OMP_CLAUSE_GANG:
>>> +	case OMP_CLAUSE_VECTOR:
>>> +	case OMP_CLAUSE_WORKER:
>>> +	  /* Operand 0 is the num: or length: argument.  */
>>> +	  t = OMP_CLAUSE_OPERAND (c, 0);
>>> +	  if (t == NULL_TREE)
>>> +	    break;
>>> +
>>> +	  t = maybe_convert_cond (t);
>>
>> Can you explain the maybe_convert_cond calls (in both cases here,
>> plus the preexisting in OMP_CLAUSE_VECTOR_LENGTH)?
>> The reason why it is used for OpenMP if and final clauses is that those have
>> a condition argument, either the condition is zero or non-zero (so
>> effectively it is turned into a bool).
>> But aren't the gang/vector/worker/vector_length arguments integers rather
>> than conditions?  I'd expect that finish_omp_clauses should verify
>> those operands are indeed integral expressions (if that is the requirement
>> in the standard), as it is something that for C++ can't be verified during
>> parsing, if arbitrary expressions are parsed there.
> 
> It's probably a copy-and-paste error. This functionality was added
> incrementally. I removed that check.
> 
>>> @@ -5959,32 +5990,58 @@ finish_omp_clauses (tree clauses, bool a
>>>  	  break;
>>>  
>>>  	case OMP_CLAUSE_NUM_THREADS:
>>> -	  t = OMP_CLAUSE_NUM_THREADS_EXPR (c);
>>> -	  if (t == error_mark_node)
>>> -	    remove = true;
>>> -	  else if (!type_dependent_expression_p (t)
>>> -		   && !INTEGRAL_TYPE_P (TREE_TYPE (t)))
>>> -	    {
>>> -	      error ("num_threads expression must be integral");
>>> -	      remove = true;
>>> -	    }
>>> -	  else
>>> -	    {
>>> -	      t = mark_rvalue_use (t);
>>> -	      if (!processing_template_decl)
>>> -		{
>>> -		  t = maybe_constant_value (t);
>>> -		  if (TREE_CODE (t) == INTEGER_CST
>>> -		      && tree_int_cst_sgn (t) != 1)
>>> -		    {
>>> -		      warning_at (OMP_CLAUSE_LOCATION (c), 0,
>>> -				  "%<num_threads%> value must be positive");
>>> -		      t = integer_one_node;
>>> -		    }
>>> -		  t = fold_build_cleanup_point_expr (TREE_TYPE (t), t);
>>> -		}
>>> -	      OMP_CLAUSE_NUM_THREADS_EXPR (c) = t;
>>> -	    }
>>> +	case OMP_CLAUSE_NUM_GANGS:
>>> +	case OMP_CLAUSE_NUM_WORKERS:
>>> +	case OMP_CLAUSE_VECTOR_LENGTH:
>>
>> If you are already merging some of the similar handling, please
>> handle OMP_CLAUSE_NUM_TEAMS and OMP_CLAUSE_NUM_TASKS the same way.
> 
> I did that, but I also had to adjust the expected errors in a couple of
> existing gomp test cases.
> 
> Is this patch OK for trunk?

This patch teaches the the loop shape function how to be more careful
with commas. Most of the errors in the c++ front end are similar their
counterparts in c, except for those when the c++ front end thinks a
stray colon is an errant scope resolution operation. For those types of
failures, I just used a generic regex for the dg-error. Consequently,
this patch shares the same loop-shape.c test case that I included in the
c patch.

Is this OK for trunk?

Cesar


[-- Attachment #2: 05-cpfe-cjp-2.diff --]
[-- Type: text/x-patch, Size: 23136 bytes --]

2015-10-23  Cesar Philippidis  <cesar@codesourcery.com>
	    Thomas Schwinge  <thomas@codesourcery.com>
	    James Norris  <jnorris@codesourcery.com>
	    Joseph Myers  <joseph@codesourcery.com>
	    Julian Brown  <julian@codesourcery.com>
	    Nathan Sidwell <nathan@codesourcery.com>
	    Bernd Schmidt  <bschmidt@redhat.com>

	* parser.c (cp_parser_omp_clause_name): Add auto, gang, seq,
	vector, worker.
	(cp_parser_oacc_simple_clause): New.
	(cp_parser_oacc_shape_clause): New.
	(cp_parser_oacc_all_clauses): Add auto, gang, seq, vector, worker.
	(OACC_LOOP_CLAUSE_MASK): Likewise.
	* semantics.c (finish_omp_clauses): Add auto, gang, seq, vector,
	worker. Unify the handling to teams, tasks and vector_length with
	the other loop shape clauses.

2015-10-23  Nathan Sidwell <nathan@codesourcery.com>
	    Cesar Philippidis  <cesar@codesourcery.com>

	gcc/testsuite/
	* g++.dg/g++.dg/gomp/pr33372-1.C: Adjust diagnostic.
	* gcc/testsuite/g++.dg/gomp/pr33372-3.C: Likewise.


diff --git a/gcc/cp/ChangeLog b/gcc/cp/ChangeLog
index 8dbff11..2970ec2 100644
--- a/gcc/cp/ChangeLog
+++ b/gcc/cp/ChangeLog
@@ -1,3 +1,14 @@
+2015-10-22  Jason Merrill  <jason@redhat.com>
+
+	* call.c (add_template_conv_candidate): Pass DEDUCE_CALL.
+	(add_template_candidate_real): Handle it.
+	* pt.c (fn_type_unification): Handle it.
+
+	* call.c (add_conv_candidate): Remove first_arg parm.
+	(add_template_conv_candidate): Likewise.
+	(add_template_candidate_real): Don't pass it.
+	(build_op_call_1): Likewise.
+
 2015-10-22  Richard Biener  <rguenther@suse.de>
 
 	* semantics.c (cp_finish_omp_clause_depend_sink): Properly convert
diff --git a/gcc/cp/call.c b/gcc/cp/call.c
index 55b3c8c..5b57dc9 100644
--- a/gcc/cp/call.c
+++ b/gcc/cp/call.c
@@ -178,9 +178,6 @@ static struct z_candidate *add_template_candidate
 static struct z_candidate *add_template_candidate_real
 	(struct z_candidate **, tree, tree, tree, tree, const vec<tree, va_gc> *,
 	 tree, tree, tree, int, tree, unification_kind_t, tsubst_flags_t);
-static struct z_candidate *add_template_conv_candidate
-	(struct z_candidate **, tree, tree, tree, const vec<tree, va_gc> *,
-	 tree, tree, tree, tsubst_flags_t);
 static void add_builtin_candidates
 	(struct z_candidate **, enum tree_code, enum tree_code,
 	 tree, tree *, int, tsubst_flags_t);
@@ -192,7 +189,7 @@ static void build_builtin_candidate
 	(struct z_candidate **, tree, tree, tree, tree *, tree *,
 	 int, tsubst_flags_t);
 static struct z_candidate *add_conv_candidate
-	(struct z_candidate **, tree, tree, tree, const vec<tree, va_gc> *, tree,
+	(struct z_candidate **, tree, tree, const vec<tree, va_gc> *, tree,
 	 tree, tsubst_flags_t);
 static struct z_candidate *add_function_candidate
 	(struct z_candidate **, tree, tree, tree, const vec<tree, va_gc> *, tree,
@@ -2176,7 +2173,7 @@ add_function_candidate (struct z_candidate **candidates,
 
 static struct z_candidate *
 add_conv_candidate (struct z_candidate **candidates, tree fn, tree obj,
-		    tree first_arg, const vec<tree, va_gc> *arglist,
+		    const vec<tree, va_gc> *arglist,
 		    tree access_path, tree conversion_path,
 		    tsubst_flags_t complain)
 {
@@ -2190,7 +2187,7 @@ add_conv_candidate (struct z_candidate **candidates, tree fn, tree obj,
     parmlist = TREE_TYPE (parmlist);
   parmlist = TYPE_ARG_TYPES (parmlist);
 
-  len = vec_safe_length (arglist) + (first_arg != NULL_TREE ? 1 : 0) + 1;
+  len = vec_safe_length (arglist) + 1;
   convs = alloc_conversions (len);
   parmnode = parmlist;
   viable = 1;
@@ -2208,10 +2205,8 @@ add_conv_candidate (struct z_candidate **candidates, tree fn, tree obj,
 
       if (i == 0)
 	arg = obj;
-      else if (i == 1 && first_arg != NULL_TREE)
-	arg = first_arg;
       else
-	arg = (*arglist)[i - (first_arg != NULL_TREE ? 1 : 0) - 1];
+	arg = (*arglist)[i - 1];
       argtype = lvalue_type (arg);
 
       if (i == 0)
@@ -2260,7 +2255,7 @@ add_conv_candidate (struct z_candidate **candidates, tree fn, tree obj,
       reason = arity_rejection (NULL_TREE, i + remaining, len);
     }
 
-  return add_candidate (candidates, totype, first_arg, arglist, len, convs,
+  return add_candidate (candidates, totype, obj, arglist, len, convs,
 			access_path, conversion_path, viable, reason, flags);
 }
 
@@ -3032,6 +3027,9 @@ add_template_candidate_real (struct z_candidate **candidates, tree tmpl,
     {
       if (first_arg_without_in_chrg != NULL_TREE)
 	first_arg_without_in_chrg = NULL_TREE;
+      else if (return_type && strict == DEDUCE_CALL)
+	/* We're deducing for a call to the result of a template conversion
+	   function, so the args don't contain 'this'; leave them alone.  */;
       else
 	++skip_without_in_chrg;
     }
@@ -3122,7 +3120,7 @@ add_template_candidate_real (struct z_candidate **candidates, tree tmpl,
 
   if (obj != NULL_TREE)
     /* Aha, this is a conversion function.  */
-    cand = add_conv_candidate (candidates, fn, obj, first_arg, arglist,
+    cand = add_conv_candidate (candidates, fn, obj, arglist,
 			       access_path, conversion_path, complain);
   else
     cand = add_function_candidate (candidates, fn, ctype,
@@ -3172,18 +3170,23 @@ add_template_candidate (struct z_candidate **candidates, tree tmpl, tree ctype,
 				 flags, NULL_TREE, strict, complain);
 }
 
+/* Create an overload candidate for the conversion function template TMPL,
+   returning RETURN_TYPE, which will be invoked for expression OBJ to produce a
+   pointer-to-function which will in turn be called with the argument list
+   ARGLIST, and add it to CANDIDATES.  This does not change ARGLIST.  FLAGS is
+   passed on to implicit_conversion.  */
 
 static struct z_candidate *
 add_template_conv_candidate (struct z_candidate **candidates, tree tmpl,
-			     tree obj, tree first_arg,
+			     tree obj,
 			     const vec<tree, va_gc> *arglist,
 			     tree return_type, tree access_path,
 			     tree conversion_path, tsubst_flags_t complain)
 {
   return
     add_template_candidate_real (candidates, tmpl, NULL_TREE, NULL_TREE,
-				 first_arg, arglist, return_type, access_path,
-				 conversion_path, 0, obj, DEDUCE_CONV,
+				 NULL_TREE, arglist, return_type, access_path,
+				 conversion_path, 0, obj, DEDUCE_CALL,
 				 complain);
 }
 
@@ -4335,11 +4338,11 @@ build_op_call_1 (tree obj, vec<tree, va_gc> **args, tsubst_flags_t complain)
 
 	    if (TREE_CODE (fn) == TEMPLATE_DECL)
 	      add_template_conv_candidate
-		(&candidates, fn, obj, NULL_TREE, *args, totype,
+		(&candidates, fn, obj, *args, totype,
 		 /*access_path=*/NULL_TREE,
 		 /*conversion_path=*/NULL_TREE, complain);
 	    else
-	      add_conv_candidate (&candidates, fn, obj, NULL_TREE,
+	      add_conv_candidate (&candidates, fn, obj,
 				  *args, /*conversion_path=*/NULL_TREE,
 				  /*access_path=*/NULL_TREE, complain);
 	  }
diff --git a/gcc/cp/parser.c b/gcc/cp/parser.c
index 7555bf3..eaccf4b 100644
--- a/gcc/cp/parser.c
+++ b/gcc/cp/parser.c
@@ -29064,7 +29064,9 @@ cp_parser_omp_clause_name (cp_parser *parser)
 {
   pragma_omp_clause result = PRAGMA_OMP_CLAUSE_NONE;
 
-  if (cp_lexer_next_token_is_keyword (parser->lexer, RID_IF))
+  if (cp_lexer_next_token_is_keyword (parser->lexer, RID_AUTO))
+    result = PRAGMA_OACC_CLAUSE_AUTO;
+  else if (cp_lexer_next_token_is_keyword (parser->lexer, RID_IF))
     result = PRAGMA_OMP_CLAUSE_IF;
   else if (cp_lexer_next_token_is_keyword (parser->lexer, RID_DEFAULT))
     result = PRAGMA_OMP_CLAUSE_DEFAULT;
@@ -29122,7 +29124,9 @@ cp_parser_omp_clause_name (cp_parser *parser)
 	    result = PRAGMA_OMP_CLAUSE_FROM;
 	  break;
 	case 'g':
-	  if (!strcmp ("grainsize", p))
+	  if (!strcmp ("gang", p))
+	    result = PRAGMA_OACC_CLAUSE_GANG;
+	  else if (!strcmp ("grainsize", p))
 	    result = PRAGMA_OMP_CLAUSE_GRAINSIZE;
 	  break;
 	case 'h':
@@ -29212,6 +29216,8 @@ cp_parser_omp_clause_name (cp_parser *parser)
 	    result = PRAGMA_OMP_CLAUSE_SECTIONS;
 	  else if (!strcmp ("self", p))
 	    result = PRAGMA_OACC_CLAUSE_SELF;
+	  else if (!strcmp ("seq", p))
+	    result = PRAGMA_OACC_CLAUSE_SEQ;
 	  else if (!strcmp ("shared", p))
 	    result = PRAGMA_OMP_CLAUSE_SHARED;
 	  else if (!strcmp ("simd", p))
@@ -29238,7 +29244,9 @@ cp_parser_omp_clause_name (cp_parser *parser)
 	    result = PRAGMA_OMP_CLAUSE_USE_DEVICE_PTR;
 	  break;
 	case 'v':
-	  if (!strcmp ("vector_length", p))
+	  if (!strcmp ("vector", p))
+	    result = PRAGMA_OACC_CLAUSE_VECTOR;
+	  else if (!strcmp ("vector_length", p))
 	    result = PRAGMA_OACC_CLAUSE_VECTOR_LENGTH;
 	  else if (flag_cilkplus && !strcmp ("vectorlength", p))
 	    result = PRAGMA_CILK_CLAUSE_VECTORLENGTH;
@@ -29246,6 +29254,8 @@ cp_parser_omp_clause_name (cp_parser *parser)
 	case 'w':
 	  if (!strcmp ("wait", p))
 	    result = PRAGMA_OACC_CLAUSE_WAIT;
+	  else if (!strcmp ("worker", p))
+	    result = PRAGMA_OACC_CLAUSE_WORKER;
 	  break;
 	}
     }
@@ -29582,6 +29592,158 @@ cp_parser_oacc_data_clause_deviceptr (cp_parser *parser, tree list)
   return list;
 }
 
+/* OpenACC 2.0:
+   auto
+   independent
+   nohost
+   seq */
+
+static tree
+cp_parser_oacc_simple_clause (cp_parser *ARG_UNUSED (parser),
+			      enum omp_clause_code code,
+			      tree list, location_t location)
+{
+  check_no_duplicate_clause (list, code, omp_clause_code_name[code], location);
+  tree c = build_omp_clause (location, code);
+  OMP_CLAUSE_CHAIN (c) = list;
+  return c;
+}
+
+/* OpenACC:
+
+    gang [( gang-arg-list )]
+    worker [( [num:] int-expr )]
+    vector [( [length:] int-expr )]
+
+  where gang-arg is one of:
+
+    [num:] int-expr
+    static: size-expr
+
+  and size-expr may be:
+
+    *
+    int-expr
+*/
+
+static tree
+cp_parser_oacc_shape_clause (cp_parser *parser, omp_clause_code kind,
+			     const char *str, tree list)
+{
+  const char *id = "num";
+  cp_lexer *lexer = parser->lexer;
+  tree op0 = NULL_TREE, op1 = NULL_TREE, c;
+  location_t loc = cp_lexer_peek_token (lexer)->location;
+
+  if (kind == OMP_CLAUSE_VECTOR)
+    id = "length";
+
+  if (cp_lexer_next_token_is (lexer, CPP_OPEN_PAREN))
+    {
+      tree *op_to_parse = &op0;
+      cp_token *next;
+
+      cp_lexer_consume_token (lexer);
+
+      do
+	{
+	  /* Consume a comma if present.  */
+	  if (cp_lexer_next_token_is (lexer, CPP_COMMA))
+	    {
+	      if (op0 == NULL && op1 == NULL)
+		{
+		  cp_parser_error (parser, "unexpected argument");
+		  goto cleanup_error;
+		}
+
+	      cp_lexer_consume_token (lexer);
+	    }
+
+	  next = cp_lexer_peek_token (lexer);
+	  loc = next->location;
+	  op_to_parse = &op0;
+
+	  /* First extract any length:, num: and static: arguments.  */
+
+	  /* Gang static argument.  */
+	  if (cp_lexer_next_token_is (lexer, CPP_KEYWORD)
+	      && cp_lexer_next_token_is_keyword (lexer, RID_STATIC))
+	    {
+	      if (kind != OMP_CLAUSE_GANG)
+		{
+		  cp_parser_error (parser, "invalid %<static%> argument");
+		  goto cleanup_error;
+		}
+
+	      cp_lexer_consume_token (lexer);
+
+	      if (!cp_parser_require (parser, CPP_COLON, RT_COLON))
+		goto cleanup_error;
+
+	      op_to_parse = &op1;
+	      if (*op_to_parse != NULL)
+		{
+		  cp_parser_error (parser, "too many %<static%> arguements");
+		  goto cleanup_error;
+		}
+
+	      /* Check for the '*' argument.  */
+	      if (cp_lexer_next_token_is (lexer, CPP_MULT))
+		{
+		  cp_lexer_consume_token (lexer);
+		  *op_to_parse = integer_minus_one_node;
+		  continue;
+		}
+	    }
+	  /* Worker num: argument and vector length: arguments.  */
+	  else if (cp_lexer_next_token_is (lexer, CPP_NAME)
+		   && strcmp (id, IDENTIFIER_POINTER (next->u.value)) == 0
+		   && cp_lexer_nth_token_is (lexer, 2, CPP_COLON))
+	    {
+	      cp_lexer_consume_token (lexer);  /* id  */
+	      cp_lexer_consume_token (lexer);  /* ':'  */
+	    }
+
+	  /* Now collect the actual argument.  */
+
+	  if (*op_to_parse != NULL_TREE)
+	    {
+	      cp_parser_error (parser, "unexpected argument");
+	      goto cleanup_error;
+	    }
+
+
+	  tree expr = cp_parser_assignment_expression (parser, NULL, false,
+						       false);
+	  if (expr == error_mark_node)
+	    goto cleanup_error;
+
+	  mark_exp_read (expr);
+	  *op_to_parse = expr;
+	}
+      while (cp_lexer_next_token_is (lexer, CPP_COMMA)
+	     && !cp_lexer_next_token_is (lexer, CPP_CLOSE_PAREN));
+
+      if (!cp_parser_require (parser, CPP_CLOSE_PAREN, RT_CLOSE_PAREN))
+	goto cleanup_error;
+    }
+
+  check_no_duplicate_clause (list, kind, str, loc);
+
+  c = build_omp_clause (loc, kind);
+  if (op0)
+    OMP_CLAUSE_OPERAND (c, 0) = op0;
+  if (op1)
+    OMP_CLAUSE_OPERAND (c, 1) = op1;
+  OMP_CLAUSE_CHAIN (c) = list;
+
+  return c;
+
+ cleanup_error:
+  cp_parser_skip_to_closing_parenthesis (parser, false, false, true);
+  return list;
+}
+
 /* OpenACC:
    vector_length ( expression ) */
 
@@ -31306,6 +31468,11 @@ cp_parser_oacc_all_clauses (cp_parser *parser, omp_clause_mask mask,
 	  clauses = cp_parser_oacc_clause_async (parser, clauses);
 	  c_name = "async";
 	  break;
+	case PRAGMA_OACC_CLAUSE_AUTO:
+	  clauses = cp_parser_oacc_simple_clause (parser, OMP_CLAUSE_AUTO,
+						 clauses, here);
+	  c_name = "auto";
+	  break;
 	case PRAGMA_OACC_CLAUSE_COLLAPSE:
 	  clauses = cp_parser_omp_clause_collapse (parser, clauses, here);
 	  c_name = "collapse";
@@ -31338,6 +31505,11 @@ cp_parser_oacc_all_clauses (cp_parser *parser, omp_clause_mask mask,
 	  clauses = cp_parser_oacc_data_clause_deviceptr (parser, clauses);
 	  c_name = "deviceptr";
 	  break;
+	case PRAGMA_OACC_CLAUSE_GANG:
+	  c_name = "gang";
+	  clauses = cp_parser_oacc_shape_clause (parser, OMP_CLAUSE_GANG,
+						 c_name, clauses);
+	  break;
 	case PRAGMA_OACC_CLAUSE_HOST:
 	  clauses = cp_parser_oacc_data_clause (parser, c_kind, clauses);
 	  c_name = "host";
@@ -31382,6 +31554,16 @@ cp_parser_oacc_all_clauses (cp_parser *parser, omp_clause_mask mask,
 	  clauses = cp_parser_oacc_data_clause (parser, c_kind, clauses);
 	  c_name = "self";
 	  break;
+	case PRAGMA_OACC_CLAUSE_SEQ:
+	  clauses = cp_parser_oacc_simple_clause (parser, OMP_CLAUSE_SEQ,
+						 clauses, here);
+	  c_name = "seq";
+	  break;
+	case PRAGMA_OACC_CLAUSE_VECTOR:
+	  c_name = "vector";
+	  clauses = cp_parser_oacc_shape_clause (parser, OMP_CLAUSE_VECTOR,
+						 c_name, clauses);
+	  break;
 	case PRAGMA_OACC_CLAUSE_VECTOR_LENGTH:
 	  clauses = cp_parser_oacc_clause_vector_length (parser, clauses);
 	  c_name = "vector_length";
@@ -31390,6 +31572,11 @@ cp_parser_oacc_all_clauses (cp_parser *parser, omp_clause_mask mask,
 	  clauses = cp_parser_oacc_clause_wait (parser, clauses);
 	  c_name = "wait";
 	  break;
+	case PRAGMA_OACC_CLAUSE_WORKER:
+	  c_name = "worker";
+	  clauses = cp_parser_oacc_shape_clause (parser, OMP_CLAUSE_WORKER,
+						 c_name, clauses);
+	  break;
 	default:
 	  cp_parser_error (parser, "expected %<#pragma acc%> clause");
 	  goto saw_error;
@@ -34339,6 +34526,11 @@ cp_parser_oacc_kernels (cp_parser *parser, cp_token *pragma_tok)
 
 #define OACC_LOOP_CLAUSE_MASK						\
 	( (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_COLLAPSE)		\
+	| (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_GANG)		\
+	| (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_VECTOR)		\
+	| (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_WORKER)		\
+	| (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_AUTO)		\
+	| (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_SEQ)			\
 	| (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_REDUCTION))
 
 static tree
diff --git a/gcc/cp/pt.c b/gcc/cp/pt.c
index 142245a..ffe02da 100644
--- a/gcc/cp/pt.c
+++ b/gcc/cp/pt.c
@@ -17235,7 +17235,9 @@ pack_deducible_p (tree parm, tree fn)
 
    DEDUCE_CALL:
      We are deducing arguments for a function call, as in
-     [temp.deduct.call].
+     [temp.deduct.call].  If RETURN_TYPE is non-null, we are
+     deducing arguments for a call to the result of a conversion
+     function template, as in [over.call.object].
 
    DEDUCE_CONV:
      We are deducing arguments for a conversion function, as in
@@ -17402,7 +17404,15 @@ fn_type_unification (tree fn,
   /* Never do unification on the 'this' parameter.  */
   parms = skip_artificial_parms_for (fn, TYPE_ARG_TYPES (fntype));
 
-  if (return_type)
+  if (return_type && strict == DEDUCE_CALL)
+    {
+      /* We're deducing for a call to the result of a template conversion
+         function.  The parms we really want are in return_type.  */
+      if (POINTER_TYPE_P (return_type))
+	return_type = TREE_TYPE (return_type);
+      parms = TYPE_ARG_TYPES (return_type);
+    }
+  else if (return_type)
     {
       tree *new_args;
 
diff --git a/gcc/cp/semantics.c b/gcc/cp/semantics.c
index 11315d9..153a970 100644
--- a/gcc/cp/semantics.c
+++ b/gcc/cp/semantics.c
@@ -5911,6 +5911,31 @@ finish_omp_clauses (tree clauses, bool allow_fields, bool declare_simd)
 	    bitmap_set_bit (&firstprivate_head, DECL_UID (t));
 	  goto handle_field_decl;
 
+	case OMP_CLAUSE_GANG:
+	case OMP_CLAUSE_VECTOR:
+	case OMP_CLAUSE_WORKER:
+	  /* Operand 0 is the num: or length: argument.  */
+	  t = OMP_CLAUSE_OPERAND (c, 0);
+	  if (t == NULL_TREE)
+	    break;
+
+	  if (!processing_template_decl)
+	    t = fold_build_cleanup_point_expr (TREE_TYPE (t), t);
+	  OMP_CLAUSE_OPERAND (c, 0) = t;
+
+	  if (OMP_CLAUSE_CODE (c) != OMP_CLAUSE_GANG)
+	    break;
+
+	  /* Ooperand 1 is the gang static: argument.  */
+	  t = OMP_CLAUSE_OPERAND (c, 1);
+	  if (t == NULL_TREE)
+	    break;
+
+	  if (!processing_template_decl)
+	    t = fold_build_cleanup_point_expr (TREE_TYPE (t), t);
+	  OMP_CLAUSE_OPERAND (c, 1) = t;
+	  break;
+
 	case OMP_CLAUSE_LASTPRIVATE:
 	  t = omp_clause_decl_field (OMP_CLAUSE_DECL (c));
 	  if (t)
@@ -5965,14 +5990,37 @@ finish_omp_clauses (tree clauses, bool allow_fields, bool declare_simd)
 	  OMP_CLAUSE_FINAL_EXPR (c) = t;
 	  break;
 
+	case OMP_CLAUSE_NUM_TASKS:
+	case OMP_CLAUSE_NUM_TEAMS:
 	case OMP_CLAUSE_NUM_THREADS:
-	  t = OMP_CLAUSE_NUM_THREADS_EXPR (c);
+	case OMP_CLAUSE_NUM_GANGS:
+	case OMP_CLAUSE_NUM_WORKERS:
+	case OMP_CLAUSE_VECTOR_LENGTH:
+	  t = OMP_CLAUSE_OPERAND (c, 0);
 	  if (t == error_mark_node)
 	    remove = true;
 	  else if (!type_dependent_expression_p (t)
 		   && !INTEGRAL_TYPE_P (TREE_TYPE (t)))
 	    {
-	      error ("num_threads expression must be integral");
+	     switch (OMP_CLAUSE_CODE (c))
+		{
+		case OMP_CLAUSE_NUM_TASKS:
+		  error ("%<num_tasks%> expression must be integral"); break;
+		case OMP_CLAUSE_NUM_TEAMS:
+		  error ("%<num_teams%> expression must be integral"); break;
+		case OMP_CLAUSE_NUM_THREADS:
+		  error ("%<num_threads%> expression must be integral"); break;
+		case OMP_CLAUSE_NUM_GANGS:
+		  error ("%<num_gangs%> expression must be integral"); break;
+		case OMP_CLAUSE_NUM_WORKERS:
+		  error ("%<num_workers%> expression must be integral");
+		  break;
+		case OMP_CLAUSE_VECTOR_LENGTH:
+		  error ("%<vector_length%> expression must be integral");
+		  break;
+		default:
+		  error ("invalid argument");
+		}
 	      remove = true;
 	    }
 	  else
@@ -5984,13 +6032,40 @@ finish_omp_clauses (tree clauses, bool allow_fields, bool declare_simd)
 		  if (TREE_CODE (t) == INTEGER_CST
 		      && tree_int_cst_sgn (t) != 1)
 		    {
-		      warning_at (OMP_CLAUSE_LOCATION (c), 0,
-				  "%<num_threads%> value must be positive");
+		      switch (OMP_CLAUSE_CODE (c))
+			{
+			case OMP_CLAUSE_NUM_TASKS:
+			  warning_at (OMP_CLAUSE_LOCATION (c), 0,
+				      "%<num_tasks%> value must be positive");
+			  break;
+			case OMP_CLAUSE_NUM_TEAMS:
+			  warning_at (OMP_CLAUSE_LOCATION (c), 0,
+				      "%<num_teams%> value must be positive");
+			  break;
+			case OMP_CLAUSE_NUM_THREADS:
+			  warning_at (OMP_CLAUSE_LOCATION (c), 0,
+				      "%<num_threads%> value must be"
+				      "positive"); break;
+			case OMP_CLAUSE_NUM_GANGS:
+			  warning_at (OMP_CLAUSE_LOCATION (c), 0,
+				      "%<num_gangs%> value must be positive");
+			  break;
+			case OMP_CLAUSE_NUM_WORKERS:
+			  warning_at (OMP_CLAUSE_LOCATION (c), 0,
+				      "%<num_workers%> value must be"
+				      "positive"); break;
+			case OMP_CLAUSE_VECTOR_LENGTH:
+			  warning_at (OMP_CLAUSE_LOCATION (c), 0,
+				      "%<vector_length%> value must be"
+				      "positive"); break;
+			default:
+			  error ("invalid argument");
+			}
 		      t = integer_one_node;
 		    }
 		  t = fold_build_cleanup_point_expr (TREE_TYPE (t), t);
 		}
-	      OMP_CLAUSE_NUM_THREADS_EXPR (c) = t;
+	      OMP_CLAUSE_OPERAND (c, 0) = t;
 	    }
 	  break;
 
@@ -6062,35 +6137,6 @@ finish_omp_clauses (tree clauses, bool allow_fields, bool declare_simd)
 	    }
 	  break;
 
-	case OMP_CLAUSE_NUM_TEAMS:
-	  t = OMP_CLAUSE_NUM_TEAMS_EXPR (c);
-	  if (t == error_mark_node)
-	    remove = true;
-	  else if (!type_dependent_expression_p (t)
-		   && !INTEGRAL_TYPE_P (TREE_TYPE (t)))
-	    {
-	      error ("%<num_teams%> expression must be integral");
-	      remove = true;
-	    }
-	  else
-	    {
-	      t = mark_rvalue_use (t);
-	      if (!processing_template_decl)
-		{
-		  t = maybe_constant_value (t);
-		  if (TREE_CODE (t) == INTEGER_CST
-		      && tree_int_cst_sgn (t) != 1)
-		    {
-		      warning_at (OMP_CLAUSE_LOCATION (c), 0,
-				  "%<num_teams%> value must be positive");
-		      t = integer_one_node;
-		    }
-		  t = fold_build_cleanup_point_expr (TREE_TYPE (t), t);
-		}
-	      OMP_CLAUSE_NUM_TEAMS_EXPR (c) = t;
-	    }
-	  break;
-
 	case OMP_CLAUSE_ASYNC:
 	  t = OMP_CLAUSE_ASYNC_EXPR (c);
 	  if (t == error_mark_node)
@@ -6110,16 +6156,6 @@ finish_omp_clauses (tree clauses, bool allow_fields, bool declare_simd)
 	    }
 	  break;
 
-	case OMP_CLAUSE_VECTOR_LENGTH:
-	  t = OMP_CLAUSE_VECTOR_LENGTH_EXPR (c);
-	  t = maybe_convert_cond (t);
-	  if (t == error_mark_node)
-	    remove = true;
-	  else if (!processing_template_decl)
-	    t = fold_build_cleanup_point_expr (TREE_TYPE (t), t);
-	  OMP_CLAUSE_VECTOR_LENGTH_EXPR (c) = t;
-	  break;
-
 	case OMP_CLAUSE_WAIT:
 	  t = OMP_CLAUSE_WAIT_EXPR (c);
 	  if (t == error_mark_node)
@@ -6547,35 +6583,6 @@ finish_omp_clauses (tree clauses, bool allow_fields, bool declare_simd)
 	    }
 	  goto check_dup_generic;
 
-	case OMP_CLAUSE_NUM_TASKS:
-	  t = OMP_CLAUSE_NUM_TASKS_EXPR (c);
-	  if (t == error_mark_node)
-	    remove = true;
-	  else if (!type_dependent_expression_p (t)
-		   && !INTEGRAL_TYPE_P (TREE_TYPE (t)))
-	    {
-	      error ("%<num_tasks%> expression must be integral");
-	      remove = true;
-	    }
-	  else
-	    {
-	      t = mark_rvalue_use (t);
-	      if (!processing_template_decl)
-		{
-		  t = maybe_constant_value (t);
-		  if (TREE_CODE (t) == INTEGER_CST
-		      && tree_int_cst_sgn (t) != 1)
-		    {
-		      warning_at (OMP_CLAUSE_LOCATION (c), 0,
-				  "%<num_tasks%> value must be positive");
-		      t = integer_one_node;
-		    }
-		  t = fold_build_cleanup_point_expr (TREE_TYPE (t), t);
-		}
-	      OMP_CLAUSE_NUM_TASKS_EXPR (c) = t;
-	    }
-	  break;
-
 	case OMP_CLAUSE_GRAINSIZE:
 	  t = OMP_CLAUSE_GRAINSIZE_EXPR (c);
 	  if (t == error_mark_node)
@@ -6694,6 +6701,8 @@ finish_omp_clauses (tree clauses, bool allow_fields, bool declare_simd)
 	case OMP_CLAUSE_SIMD:
 	case OMP_CLAUSE_DEFAULTMAP:
 	case OMP_CLAUSE__CILK_FOR_COUNT_:
+	case OMP_CLAUSE_AUTO:
+	case OMP_CLAUSE_SEQ:
 	  break;
 
 	case OMP_CLAUSE_INBRANCH:

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 11/11] execution tests
  2015-10-23 20:32               ` Cesar Philippidis
@ 2015-10-24  2:56                 ` Cesar Philippidis
  0 siblings, 0 replies; 120+ messages in thread
From: Cesar Philippidis @ 2015-10-24  2:56 UTC (permalink / raw)
  To: Jakub Jelinek
  Cc: Nathan Sidwell, GCC Patches, Bernd Schmidt, Jason Merrill,
	Joseph S. Myers

[-- Attachment #1: Type: text/plain, Size: 3395 bytes --]

On 10/23/2015 01:29 PM, Cesar Philippidis wrote:
> On 10/22/2015 08:00 AM, Jakub Jelinek wrote:
>> On Thu, Oct 22, 2015 at 07:47:01AM -0700, Cesar Philippidis wrote:
>>>> But it is unclear from the parsing what from these is allowed:
>>>
>>> int v, w;
>>> ...
>>> gang(26)  // equivalent to gang(num:26)
>>> gang(v)   // gang(num:v)
>>> vector(length: 16)  // vector(length: 16)
>>> vector(length: v)  // vector(length: v)
>>> vector(16)  // vector(length: 16)
>>> vector(v)   // vector(length: v)
>>> worker(num: 16)  // worker(num: 16)
>>> worker(num: v)   // worker(num: 16)
>>> worker(16)  // worker(num: 16)
>>> worker(v)   // worker(num: 16)
>>> gang(16, 24)  // technically gang(num:16, num:24) is acceptable but it
>>>               // should be an error
>>> gang(v, w)  // likewise
>>> gang(static: 16, num: 5)  // gang(static: 16, num: 5)
>>> gang(static: v, num: w)   // gang(static: v, num: w)
>>> gang(num: 5, static: 4)   // gang(num: 5, static: 4)
>>> gang(num: v, static: w)   // gang(num: v, static: w)
>>>
>>> Also note that the static argument can accept '*'.
>>>
>>>> and if the length: or num: part is really optional, then
>>>> int length, num;
>>>> vector(length)
>>>> worker(num)
>>>> gang(num, static: 6)
>>>> gang(static: 5, num)
>>>> should be also accepted (or subset thereof?).
>>>
>>> Interesting question. The spec is unclear. It defines gang, worker and
>>> vector as follows in section 2.7 in the OpenACC 2.0a spec:
>>>
>>>   gang [( gang-arg-list )]
>>>   worker [( [num:] int-expr )]
>>>   vector [( [length:] int-expr )]
>>>
>>> where gang-arg is one of:
>>>
>>>   [num:] int-expr
>>>   static: size-expr
>>>
>>> and gang-arg-list may have at most one num and one static argument,
>>> and where size-expr is one of:
>>>
>>>   *
>>>   int-expr
>>>
>>> So I've interpreted that as a requirement that length and num must be
>>> followed by an int-expr, whatever that is.
>>
>> My reading of the above is that
>> vector(length)
>> is equivalent to
>> vector(length: length)
>> and
>> worker(num)
>> is equivalent to
>> vector(num: num)
>> etc.  Basically, neither length nor num aren't reserved identifiers,
>> so you can use them for variable names, and if
>> vector(v) is equivalent to vector(length: v), then
>> vector(length) should be equivalent to vector(length:length)
>> or
>> vector(length + 1) should be equivalent to vector(length: length+1)
>> static is a keyword that can't start an integral expression, so I guess
>> it is fine if you issue an expected : diagnostics after it.
>>
>> In any case, please add a testcase (both C and C++) which covers all these
>> allowed variants (ideally one testcase) and rejected variants (another
>> testcase with dg-error).
>>
>> This is still an easy case, as even the C FE has 2 tokens lookup.
>> E.g. for OpenMP map clause where
>> map (always, tofrom: x)
>> means one thing and
>> map (always, tofrom, y)
>> another one (map (tofrom: always, tofrom, y))
>> I had to do quite ugly things to get around this.
> 
> Here are the updated test cases. Besides for adding a new test to
> exercise the loop shape parsing, I also removed that assembly file
> included in the original patch that Ilya noticed.
> 
> Is this OK for trunk?

This patch is mostly the same as I posted earlier, with the exclusion of
the loop-shape parser test. That test was included with the c parser
changes.

Is this OK for trunk?

Cesar


[-- Attachment #2: 11-testsuite-cjp-2.diff --]
[-- Type: text/x-patch, Size: 9233 bytes --]

2015-10-23  Nathan Sidwell  <nathan@codesourcery.com>

	* testsuite/libgomp.oacc-c-c++-common/loop-g-1.c: New.
	* testsuite/libgomp.oacc-c-c++-common/loop-w-1.c: New.
	* testsuite/libgomp.oacc-c-c++-common/loop-g-1.s: New.
	* testsuite/libgomp.oacc-c-c++-common/loop-wv-1.c: New.
	* testsuite/libgomp.oacc-c-c++-common/loop-g-2.c: New.
	* testsuite/libgomp.oacc-c-c++-common/loop-gwv-1.c: New.
	* testsuite/libgomp.oacc-c-c++-common/loop-v-1.c: New.

diff --git a/libgomp/testsuite/libgomp.c++/member-2.C b/libgomp/testsuite/libgomp.c++/member-2.C
index bb348d8..bbe2bdf4 100644
--- a/libgomp/testsuite/libgomp.c++/member-2.C
+++ b/libgomp/testsuite/libgomp.c++/member-2.C
@@ -154,7 +154,7 @@ A<Q>::m1 ()
     {
       f = false;
     #pragma omp single
-    #pragma omp taskloop lastprivate (a, T<Q>::t, b, n)
+    #pragma omp taskloop lastprivate (a, T<Q>::t, b, n) private (R::r)
       for (int i = 0; i < 30; i++)
 	{
 	  int q = omp_get_thread_num ();
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-g-1.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-g-1.c
new file mode 100644
index 0000000..58545d0
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-g-1.c
@@ -0,0 +1,57 @@
+/* { dg-do run } */
+/* { dg-additional-options "-O2" */
+
+#include <stdio.h>
+
+#define N (32*32*32+17)
+int main ()
+{
+  int ary[N];
+  int ix;
+  int exit = 0;
+  int ondev = 0;
+
+  for (ix = 0; ix < N;ix++)
+    ary[ix] = -1;
+  
+#pragma acc parallel num_gangs(32) vector_length(32) copy(ary) copy(ondev)
+  {
+#pragma acc loop gang
+    for (unsigned ix = 0; ix < N; ix++)
+      {
+	if (__builtin_acc_on_device (5))
+	  {
+	    int g = 0, w = 0, v = 0;
+
+	    __asm__ volatile ("mov.u32 %0,%%ctaid.x;" : "=r" (g));
+	    __asm__ volatile ("mov.u32 %0,%%tid.y;" : "=r" (w));
+	    __asm__ volatile ("mov.u32 %0,%%tid.x;" : "=r" (v));
+	    ary[ix] = (g << 16) | (w << 8) | v;
+	    ondev = 1;
+	  }
+	else
+	  ary[ix] = ix;
+      }
+  }
+
+  for (ix = 0; ix < N; ix++)
+    {
+      int expected = ix;
+      if(ondev)
+	{
+	  int g = ix / ((N + 31) / 32);
+	  int w = 0;
+	  int v = 0;
+
+	  expected = (g << 16) | (w << 8) | v;
+	}
+      
+      if (ary[ix] != expected)
+	{
+	  exit = 1;
+	  printf ("ary[%d]=%x expected %x\n", ix, ary[ix], expected);
+	}
+    }
+  
+  return exit;
+}
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-g-2.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-g-2.c
new file mode 100644
index 0000000..c01c6fa
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-g-2.c
@@ -0,0 +1,57 @@
+/* { dg-do run } */
+/* { dg-additional-options "-O2" */
+
+#include <stdio.h>
+
+#define N (32*32*32+17)
+int main ()
+{
+  int ary[N];
+  int ix;
+  int exit = 0;
+  int ondev = 0;
+
+  for (ix = 0; ix < N;ix++)
+    ary[ix] = -1;
+  
+#pragma acc parallel num_gangs(32) vector_length(32) copy(ary) copy(ondev)
+  {
+#pragma acc loop gang (static:1)
+    for (unsigned ix = 0; ix < N; ix++)
+      {
+	if (__builtin_acc_on_device (5))
+	  {
+	    int g = 0, w = 0, v = 0;
+
+	    __asm__ volatile ("mov.u32 %0,%%ctaid.x;" : "=r" (g));
+	    __asm__ volatile ("mov.u32 %0,%%tid.y;" : "=r" (w));
+	    __asm__ volatile ("mov.u32 %0,%%tid.x;" : "=r" (v));
+	    ary[ix] = (g << 16) | (w << 8) | v;
+	    ondev = 1;
+	  }
+	else
+	  ary[ix] = ix;
+      }
+  }
+
+  for (ix = 0; ix < N; ix++)
+    {
+      int expected = ix;
+      if(ondev)
+	{
+	  int g = ix % 32;
+	  int w = 0;
+	  int v = 0;
+
+	  expected = (g << 16) | (w << 8) | v;
+	}
+      
+      if (ary[ix] != expected)
+	{
+	  exit = 1;
+	  printf ("ary[%d]=%x expected %x\n", ix, ary[ix], expected);
+	}
+    }
+  
+  return exit;
+}
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-gwv-1.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-gwv-1.c
new file mode 100644
index 0000000..f23e2f3
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-gwv-1.c
@@ -0,0 +1,59 @@
+/* { dg-do run } */
+/* { dg-additional-options "-O2" */
+
+#include <stdio.h>
+
+#define N (32*32*32+17)
+int main ()
+{
+  int ary[N];
+  int ix;
+  int exit = 0;
+  int ondev = 0;
+
+  for (ix = 0; ix < N;ix++)
+    ary[ix] = -1;
+  
+#pragma acc parallel num_gangs(32) num_workers(32) vector_length(32) copy(ary) copy(ondev)
+  {
+#pragma acc loop gang worker vector
+    for (unsigned ix = 0; ix < N; ix++)
+      {
+	if (__builtin_acc_on_device (5))
+	  {
+	    int g = 0, w = 0, v = 0;
+
+	    __asm__ volatile ("mov.u32 %0,%%ctaid.x;" : "=r" (g));
+	    __asm__ volatile ("mov.u32 %0,%%tid.y;" : "=r" (w));
+	    __asm__ volatile ("mov.u32 %0,%%tid.x;" : "=r" (v));
+	    ary[ix] = (g << 16) | (w << 8) | v;
+	    ondev = 1;
+	  }
+	else
+	  ary[ix] = ix;
+      }
+  }
+
+  for (ix = 0; ix < N; ix++)
+    {
+      int expected = ix;
+      if(ondev)
+	{
+	  int chunk_size = (N + 32*32*32 - 1) / (32*32*32);
+	  
+	  int g = ix / (chunk_size * 32 * 32);
+	  int w = ix / 32 % 32;
+	  int v = ix % 32;
+
+	  expected = (g << 16) | (w << 8) | v;
+	}
+      
+      if (ary[ix] != expected)
+	{
+	  exit = 1;
+	  printf ("ary[%d]=%x expected %x\n", ix, ary[ix], expected);
+	}
+    }
+  
+  return exit;
+}
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-v-1.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-v-1.c
new file mode 100644
index 0000000..70c6292
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-v-1.c
@@ -0,0 +1,57 @@
+/* { dg-do run } */
+/* { dg-additional-options "-O2" */
+
+#include <stdio.h>
+
+#define N (32*32*32+17)
+int main ()
+{
+  int ary[N];
+  int ix;
+  int exit = 0;
+  int ondev = 0;
+
+  for (ix = 0; ix < N;ix++)
+    ary[ix] = -1;
+  
+#pragma acc parallel vector_length(32) copy(ary) copy(ondev)
+  {
+#pragma acc loop vector
+    for (unsigned ix = 0; ix < N; ix++)
+      {
+	if (__builtin_acc_on_device (5))
+	  {
+	    int g = 0, w = 0, v = 0;
+
+	    __asm__ volatile ("mov.u32 %0,%%ctaid.x;" : "=r" (g));
+	    __asm__ volatile ("mov.u32 %0,%%tid.y;" : "=r" (w));
+	    __asm__ volatile ("mov.u32 %0,%%tid.x;" : "=r" (v));
+	    ary[ix] = (g << 16) | (w << 8) | v;
+	    ondev = 1;
+	  }
+	else
+	  ary[ix] = ix;
+      }
+  }
+
+  for (ix = 0; ix < N; ix++)
+    {
+      int expected = ix;
+      if(ondev)
+	{
+	  int g = 0;
+	  int w = 0;
+	  int v = ix % 32;
+
+	  expected = (g << 16) | (w << 8) | v;
+	}
+      
+      if (ary[ix] != expected)
+	{
+	  exit = 1;
+	  printf ("ary[%d]=%x expected %x\n", ix, ary[ix], expected);
+	}
+    }
+  
+  return exit;
+}
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-w-1.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-w-1.c
new file mode 100644
index 0000000..5473c2d
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-w-1.c
@@ -0,0 +1,57 @@
+/* { dg-do run } */
+/* { dg-additional-options "-O2" */
+
+#include <stdio.h>
+
+#define N (32*32*32+17)
+int main ()
+{
+  int ary[N];
+  int ix;
+  int exit = 0;
+  int ondev = 0;
+
+  for (ix = 0; ix < N;ix++)
+    ary[ix] = -1;
+  
+#pragma acc parallel num_workers(32) vector_length(32) copy(ary) copy(ondev)
+  {
+#pragma acc loop worker
+    for (unsigned ix = 0; ix < N; ix++)
+      {
+	if (__builtin_acc_on_device (5))
+	  {
+	    int g = 0, w = 0, v = 0;
+
+	    __asm__ volatile ("mov.u32 %0,%%ctaid.x;" : "=r" (g));
+	    __asm__ volatile ("mov.u32 %0,%%tid.y;" : "=r" (w));
+	    __asm__ volatile ("mov.u32 %0,%%tid.x;" : "=r" (v));
+	    ary[ix] = (g << 16) | (w << 8) | v;
+	    ondev = 1;
+	  }
+	else
+	  ary[ix] = ix;
+      }
+  }
+
+  for (ix = 0; ix < N; ix++)
+    {
+      int expected = ix;
+      if(ondev)
+	{
+	  int g = 0;
+	  int w = ix % 32;
+	  int v = 0;
+
+	  expected = (g << 16) | (w << 8) | v;
+	}
+      
+      if (ary[ix] != expected)
+	{
+	  exit = 1;
+	  printf ("ary[%d]=%x expected %x\n", ix, ary[ix], expected);
+	}
+    }
+  
+  return exit;
+}
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-wv-1.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-wv-1.c
new file mode 100644
index 0000000..85e4476
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-wv-1.c
@@ -0,0 +1,57 @@
+/* { dg-do run } */
+/* { dg-additional-options "-O2" */
+
+#include <stdio.h>
+
+#define N (32*32*32+17)
+int main ()
+{
+  int ary[N];
+  int ix;
+  int exit = 0;
+  int ondev = 0;
+
+  for (ix = 0; ix < N;ix++)
+    ary[ix] = -1;
+  
+#pragma acc parallel num_workers(32) vector_length(32) copy(ary) copy(ondev)
+  {
+#pragma acc loop worker vector
+    for (unsigned ix = 0; ix < N; ix++)
+      {
+	if (__builtin_acc_on_device (5))
+	  {
+	    int g = 0, w = 0, v = 0;
+
+	    __asm__ volatile ("mov.u32 %0,%%ctaid.x;" : "=r" (g));
+	    __asm__ volatile ("mov.u32 %0,%%tid.y;" : "=r" (w));
+	    __asm__ volatile ("mov.u32 %0,%%tid.x;" : "=r" (v));
+	    ary[ix] = (g << 16) | (w << 8) | v;
+	    ondev = 1;
+	  }
+	else
+	  ary[ix] = ix;
+      }
+  }
+
+  for (ix = 0; ix < N; ix++)
+    {
+      int expected = ix;
+      if(ondev)
+	{
+	  int g = 0;
+	  int w = (ix / 32) % 32;
+	  int v = ix % 32;
+
+	  expected = (g << 16) | (w << 8) | v;
+	}
+      
+      if (ary[ix] != expected)
+	{
+	  exit = 1;
+	  printf ("ary[%d]=%x expected %x\n", ix, ary[ix], expected);
+	}
+    }
+  
+  return exit;
+}

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 4/11] C FE changes
  2015-10-24  2:37           ` Cesar Philippidis
@ 2015-10-24 13:08             ` Jakub Jelinek
  2015-10-24 21:11               ` Cesar Philippidis
  0 siblings, 1 reply; 120+ messages in thread
From: Jakub Jelinek @ 2015-10-24 13:08 UTC (permalink / raw)
  To: Cesar Philippidis
  Cc: Nathan Sidwell, GCC Patches, Bernd Schmidt, Jason Merrill,
	Joseph S. Myers

On Fri, Oct 23, 2015 at 07:31:51PM -0700, Cesar Philippidis wrote:

> +static tree
> +c_parser_oacc_shape_clause (c_parser *parser, omp_clause_code kind,
> +			    const char *str, tree list)
> +{
> +  const char *id = "num";
> +  tree op0 = NULL_TREE, op1 = NULL_TREE, c;
> +  location_t loc = c_parser_peek_token (parser)->location;
> +
> +  if (kind == OMP_CLAUSE_VECTOR)
> +    id = "length";
> +
> +  if (c_parser_next_token_is (parser, CPP_OPEN_PAREN))
> +    {
> +      tree *op_to_parse = &op0;
> +      c_token *next;
> +
> +      c_parser_consume_token (parser);
> +
> +      do
> +	{
> +	  op_to_parse = &op0;
> +
> +	  /* Consume a comma if present.  */
> +	  if (c_parser_next_token_is (parser, CPP_COMMA))
> +	    {
> +	      if (op0 == NULL && op1 == NULL)
> +		{
> +		  c_parser_error (parser, "unexpected argument");
> +		  goto cleanup_error;
> +		}
> +
> +	      c_parser_consume_token (parser);
> +	    }

This means you parse
gang (, static: *)
vector (, 5)
etc., even when you error on it afterwards with unexpected argument,
it is still different diagnostics from other invalid tokens immediately
after the opening (.
Also, loc and next are wrong if there is a valid comma.
So I'm really wondering why
gang (static: *, num: 5)
works, because next is the CPP_COMMA token, so while
c_parser_next_token_is (parser, CPP_NAME) matches the actual name,
what exactly next->value contains is unclear.

I think it would be better to:

  tree ops[2] = { NULL_TREE, NULL_TREE };

      do
	{
// Note, declare these here
	  c_token *next = c_parser_peek_token (parser);
	  location_t loc = next->location;
// Just use ops[idx] instead of *op_to_parse etc., though if you strongly
// prefer *op_to_parse, I won't object.
	  int idx = 0;
// Note it seems generally the C parser doesn't check for CPP_KEYWORD
// before calling c_parser_next_token_is_keyword.  And I'd just do it
// for OMP_CLAUSE_GANG, which has it in the grammar.
	  if (kind == OMP_CLAUSE_GANG
	      && c_parser_next_token_is_keyword (parser, RID_STATIC))
	    {
// ...
	      // Your current code, except that for 
	      if (c_parser_next_token_is (parser, CPP_MULT))
		{
		  c_parser_consume_token (parser);
		  if (c_parser_next_token_is (parser, CPP_COMMA))
		    {
		      c_parser_consume_token (parser);
		      continue;
		    }
		  break;
		}
	    }
	  else if (... num: / length: )
	    {
// ...
	    }
// ...
	  mark_exp_read (expr);
	  ops[idx] = expr;

	  if (kind == OMP_CLAUSE_GANG
	      && c_parser_next_token_is (parser, CPP_COMMA))
	    {
	      c_parser_consume_token (parser);
	      continue;
	    }
	  break;
	}
      while (1);

      if (!c_parser_require (parser, CPP_CLOSE_PAREN, "expected %<)%>"))
	goto cleanup_error;

That way you don't parse something that is not in the grammar.

	Jakub

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 4/11] C FE changes
  2015-10-24 13:08             ` Jakub Jelinek
@ 2015-10-24 21:11               ` Cesar Philippidis
  2015-10-26  9:47                 ` Jakub Jelinek
  0 siblings, 1 reply; 120+ messages in thread
From: Cesar Philippidis @ 2015-10-24 21:11 UTC (permalink / raw)
  To: Jakub Jelinek
  Cc: Nathan Sidwell, GCC Patches, Bernd Schmidt, Jason Merrill,
	Joseph S. Myers

[-- Attachment #1: Type: text/plain, Size: 3506 bytes --]

On 10/24/2015 01:03 AM, Jakub Jelinek wrote:
> On Fri, Oct 23, 2015 at 07:31:51PM -0700, Cesar Philippidis wrote:
> 
>> +static tree
>> +c_parser_oacc_shape_clause (c_parser *parser, omp_clause_code kind,
>> +			    const char *str, tree list)
>> +{
>> +  const char *id = "num";
>> +  tree op0 = NULL_TREE, op1 = NULL_TREE, c;
>> +  location_t loc = c_parser_peek_token (parser)->location;
>> +
>> +  if (kind == OMP_CLAUSE_VECTOR)
>> +    id = "length";
>> +
>> +  if (c_parser_next_token_is (parser, CPP_OPEN_PAREN))
>> +    {
>> +      tree *op_to_parse = &op0;
>> +      c_token *next;
>> +
>> +      c_parser_consume_token (parser);
>> +
>> +      do
>> +	{
>> +	  op_to_parse = &op0;
>> +
>> +	  /* Consume a comma if present.  */
>> +	  if (c_parser_next_token_is (parser, CPP_COMMA))
>> +	    {
>> +	      if (op0 == NULL && op1 == NULL)
>> +		{
>> +		  c_parser_error (parser, "unexpected argument");
>> +		  goto cleanup_error;
>> +		}
>> +
>> +	      c_parser_consume_token (parser);
>> +	    }
> 
> This means you parse
> gang (, static: *)
> vector (, 5)
> etc., even when you error on it afterwards with unexpected argument,
> it is still different diagnostics from other invalid tokens immediately
> after the opening (.

So you didn't like how the error messages are inconsistent? It was
catching those errors.

I've added those new test cases. Unfortunately, c and c++ report
different error messages, so I had the make dg-error generic to that
line containing those types of errors.

> Also, loc and next are wrong if there is a valid comma.

Yeah, I don't think it needs to be adjusted in the loop. c_parser_error
already knows where to report the error at anyway.

> So I'm really wondering why
> gang (static: *, num: 5)
> works, because next is the CPP_COMMA token, so while
> c_parser_next_token_is (parser, CPP_NAME) matches the actual name,
> what exactly next->value contains is unclear.
> 
> I think it would be better to:
> 
>   tree ops[2] = { NULL_TREE, NULL_TREE };
> 
>       do
> 	{
> // Note, declare these here
> 	  c_token *next = c_parser_peek_token (parser);
> 	  location_t loc = next->location;
> // Just use ops[idx] instead of *op_to_parse etc., though if you strongly
> // prefer *op_to_parse, I won't object.
> 	  int idx = 0;
> // Note it seems generally the C parser doesn't check for CPP_KEYWORD
> // before calling c_parser_next_token_is_keyword.  And I'd just do it
> // for OMP_CLAUSE_GANG, which has it in the grammar.
> 	  if (kind == OMP_CLAUSE_GANG
> 	      && c_parser_next_token_is_keyword (parser, RID_STATIC))
> 	    {
> // ...
> 	      // Your current code, except that for 
> 	      if (c_parser_next_token_is (parser, CPP_MULT))
> 		{
> 		  c_parser_consume_token (parser);
> 		  if (c_parser_next_token_is (parser, CPP_COMMA))
> 		    {
> 		      c_parser_consume_token (parser);
> 		      continue;
> 		    }
> 		  break;
> 		}
> 	    }
> 	  else if (... num: / length: )
> 	    {
> // ...
> 	    }
> // ...
> 	  mark_exp_read (expr);
> 	  ops[idx] = expr;
> 
> 	  if (kind == OMP_CLAUSE_GANG
> 	      && c_parser_next_token_is (parser, CPP_COMMA))
> 	    {
> 	      c_parser_consume_token (parser);
> 	      continue;
> 	    }
> 	  break;
> 	}
>       while (1);
> 
>       if (!c_parser_require (parser, CPP_CLOSE_PAREN, "expected %<)%>"))
> 	goto cleanup_error;
> 
> That way you don't parse something that is not in the grammar.

I did that. It turned out to be a little more compact than what I had
before. Is this OK for trunk?

Cesar


[-- Attachment #2: 04-cfe-cjp-3.diff --]
[-- Type: text/x-patch, Size: 12386 bytes --]

2015-10-24  Cesar Philippidis  <cesar@codesourcery.com>
	    Thomas Schwinge  <thomas@codesourcery.com>
	    James Norris  <jnorris@codesourcery.com>
	    Joseph Myers  <joseph@codesourcery.com>
	    Julian Brown  <julian@codesourcery.com>
	    Bernd Schmidt  <bschmidt@redhat.com>

	gcc/c/
	* c-parser.c (c_parser_oacc_shape_clause): New.
	(c_parser_oacc_simple_clause): New.
	(c_parser_oacc_all_clauses): Add auto, gang, seq, vector, worker.
	(OACC_LOOP_CLAUSE_MASK): Add gang, worker, vector, auto, seq.

2015-10-24  Cesar Philippidis  <cesar@codesourcery.com>

	gcc/testsuite/
	* c-c++-common/goacc/loop-shape.c: New test.

diff --git a/gcc/c/c-parser.c b/gcc/c/c-parser.c
index c8c6a2d..2ad3825 100644
--- a/gcc/c/c-parser.c
+++ b/gcc/c/c-parser.c
@@ -11188,6 +11188,144 @@ c_parser_omp_clause_num_workers (c_parser *parser, tree list)
 }
 
 /* OpenACC:
+
+    gang [( gang-arg-list )]
+    worker [( [num:] int-expr )]
+    vector [( [length:] int-expr )]
+
+  where gang-arg is one of:
+
+    [num:] int-expr
+    static: size-expr
+
+  and size-expr may be:
+
+    *
+    int-expr
+*/
+
+static tree
+c_parser_oacc_shape_clause (c_parser *parser, omp_clause_code kind,
+			    const char *str, tree list)
+{
+  const char *id = "num";
+  tree ops[2] = { NULL_TREE, NULL_TREE }, c;
+  location_t loc = c_parser_peek_token (parser)->location;
+
+  if (kind == OMP_CLAUSE_VECTOR)
+    id = "length";
+
+  if (c_parser_next_token_is (parser, CPP_OPEN_PAREN))
+    {
+      c_parser_consume_token (parser);
+
+      do
+	{
+	  c_token *next = c_parser_peek_token (parser);
+	  int idx = 0;
+
+	  /* Gang static argument.  */
+	  if (kind == OMP_CLAUSE_GANG
+	      && c_parser_next_token_is_keyword (parser, RID_STATIC))
+	    {
+	      c_parser_consume_token (parser);
+
+	      if (!c_parser_require (parser, CPP_COLON, "expected %<:%>"))
+		goto cleanup_error;
+
+	      idx = 1;
+	      if (ops[idx] != NULL_TREE )
+		{
+		  c_parser_error (parser, "too many %<static%> arguements");
+		  goto cleanup_error;
+		}
+
+	      /* Check for the '*' argument.  */
+	      if (c_parser_next_token_is (parser, CPP_MULT))
+		{
+		  c_parser_consume_token (parser);
+		  ops[idx] = integer_minus_one_node;
+
+		  if (c_parser_next_token_is (parser, CPP_COMMA))
+		    {
+		      c_parser_consume_token (parser);
+		      continue;
+		    }
+		  else
+		    break;
+		}
+	    }
+	  /* Worker num: argument and vector length: arguments.  */
+	  else if (c_parser_next_token_is (parser, CPP_NAME)
+		   && strcmp (id, IDENTIFIER_POINTER (next->value)) == 0
+		   && c_parser_peek_2nd_token (parser)->type == CPP_COLON)
+	    {
+	      c_parser_consume_token (parser);  /* id  */
+	      c_parser_consume_token (parser);  /* ':'  */
+	    }
+
+	  /* Now collect the actual argument.  */
+	  if (ops[idx] != NULL_TREE)
+	    {
+	      c_parser_error (parser, "unexpected argument");
+	      goto cleanup_error;
+	    }
+
+	  tree expr = c_parser_expr_no_commas (parser, NULL).value;
+	  if (expr == error_mark_node)
+	    goto cleanup_error;
+
+	  mark_exp_read (expr);
+	  ops[idx] = expr;
+
+	  if (kind == OMP_CLAUSE_GANG
+	      && c_parser_next_token_is (parser, CPP_COMMA))
+	    {
+	      c_parser_consume_token (parser);
+	      continue;
+	    }
+	  break;
+	}
+      while (1);
+
+      if (!c_parser_require (parser, CPP_CLOSE_PAREN, "expected %<)%>"))
+	goto cleanup_error;
+    }
+
+  check_no_duplicate_clause (list, kind, str);
+
+  c = build_omp_clause (loc, kind);
+  OMP_CLAUSE_OPERAND (c, 0) = ops[0];
+  if (ops[1])
+    OMP_CLAUSE_OPERAND (c, 1) = ops[1];
+  OMP_CLAUSE_CHAIN (c) = list;
+
+  return c;
+
+ cleanup_error:
+  c_parser_skip_until_found (parser, CPP_CLOSE_PAREN, 0);
+  return list;
+}
+
+/* OpenACC:
+   auto
+   independent
+   nohost
+   seq */
+
+static tree
+c_parser_oacc_simple_clause (c_parser *parser ATTRIBUTE_UNUSED,
+			     enum omp_clause_code code, tree list)
+{
+  check_no_duplicate_clause (list, code, omp_clause_code_name[code]);
+
+  tree c = build_omp_clause (c_parser_peek_token (parser)->location, code);
+  OMP_CLAUSE_CHAIN (c) = list;
+
+  return c;
+}
+
+/* OpenACC:
    async [( int-expr )] */
 
 static tree
@@ -12393,6 +12531,11 @@ c_parser_oacc_all_clauses (c_parser *parser, omp_clause_mask mask,
 	  clauses = c_parser_oacc_clause_async (parser, clauses);
 	  c_name = "async";
 	  break;
+	case PRAGMA_OACC_CLAUSE_AUTO:
+	  clauses = c_parser_oacc_simple_clause (parser, OMP_CLAUSE_AUTO,
+						clauses);
+	  c_name = "auto";
+	  break;
 	case PRAGMA_OACC_CLAUSE_COLLAPSE:
 	  clauses = c_parser_omp_clause_collapse (parser, clauses);
 	  c_name = "collapse";
@@ -12429,6 +12572,11 @@ c_parser_oacc_all_clauses (c_parser *parser, omp_clause_mask mask,
 	  clauses = c_parser_omp_clause_firstprivate (parser, clauses);
 	  c_name = "firstprivate";
 	  break;
+	case PRAGMA_OACC_CLAUSE_GANG:
+	  c_name = "gang";
+	  clauses = c_parser_oacc_shape_clause (parser, OMP_CLAUSE_GANG,
+						c_name, clauses);
+	  break;
 	case PRAGMA_OACC_CLAUSE_HOST:
 	  clauses = c_parser_oacc_data_clause (parser, c_kind, clauses);
 	  c_name = "host";
@@ -12477,6 +12625,16 @@ c_parser_oacc_all_clauses (c_parser *parser, omp_clause_mask mask,
 	  clauses = c_parser_oacc_data_clause (parser, c_kind, clauses);
 	  c_name = "self";
 	  break;
+	case PRAGMA_OACC_CLAUSE_SEQ:
+	  clauses = c_parser_oacc_simple_clause (parser, OMP_CLAUSE_SEQ,
+						clauses);
+	  c_name = "seq";
+	  break;
+	case PRAGMA_OACC_CLAUSE_VECTOR:
+	  c_name = "vector";
+	  clauses = c_parser_oacc_shape_clause (parser, OMP_CLAUSE_VECTOR,
+						c_name,	clauses);
+	  break;
 	case PRAGMA_OACC_CLAUSE_VECTOR_LENGTH:
 	  clauses = c_parser_omp_clause_vector_length (parser, clauses);
 	  c_name = "vector_length";
@@ -12485,6 +12643,11 @@ c_parser_oacc_all_clauses (c_parser *parser, omp_clause_mask mask,
 	  clauses = c_parser_oacc_clause_wait (parser, clauses);
 	  c_name = "wait";
 	  break;
+	case PRAGMA_OACC_CLAUSE_WORKER:
+	  c_name = "worker";
+	  clauses = c_parser_oacc_shape_clause (parser, OMP_CLAUSE_WORKER,
+						c_name, clauses);
+	  break;
 	default:
 	  c_parser_error (parser, "expected %<#pragma acc%> clause");
 	  goto saw_error;
@@ -13015,6 +13178,11 @@ c_parser_oacc_enter_exit_data (c_parser *parser, bool enter)
 
 #define OACC_LOOP_CLAUSE_MASK						\
 	( (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_COLLAPSE)		\
+	| (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_GANG)		\
+	| (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_WORKER)		\
+	| (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_VECTOR)		\
+	| (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_AUTO)		\
+	| (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_SEQ)			\
 	| (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_REDUCTION) )
 
 static tree
diff --git a/gcc/testsuite/c-c++-common/goacc/loop-shape.c b/gcc/testsuite/c-c++-common/goacc/loop-shape.c
new file mode 100644
index 0000000..aa75ac9
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/goacc/loop-shape.c
@@ -0,0 +1,247 @@
+/* Exercise *_parser_oacc_shape_clause by checking various combinations
+   of gang, worker and vector clause arguments.  */
+
+/* { dg-compile } */
+
+int main ()
+{
+  int i;
+  int v, w;
+  int length, num;
+
+  /* Valid uses.  */
+
+  #pragma acc kernels
+  #pragma acc loop gang worker vector
+  for (i = 0; i < 10; i++)
+    ;
+  
+  #pragma acc kernels
+  #pragma acc loop gang(26)
+  for (i = 0; i < 10; i++)
+    ;
+  
+  #pragma acc kernels
+  #pragma acc loop gang(v)
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop vector(length: 16)
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop vector(length: v)
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop vector(16)
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop vector(v)
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop worker(num: 16)
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop worker(num: v)
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop worker(16)
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop worker(v)
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop gang(static: 16, num: 5)
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop gang(static: v, num: w)
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop vector(length)
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop worker(num)
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop gang(num, static: 6)
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop gang(static: 5, num)
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop gang(1, static:*)
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop gang(static:*, 1)
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop gang(1, static:*)
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop gang(num: 5, static: 4)
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop gang(num: v, static: w)
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop gang(num, static:num)
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop vector(length:length)
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop worker(num:length)
+  for (i = 0; i < 10; i++)
+    ;  
+
+  #pragma acc kernels
+  #pragma acc loop worker(num:num)
+  for (i = 0; i < 10; i++)
+    ;  
+
+  /* Invalid uses.  */
+  
+  #pragma acc kernels
+#pragma acc loop gang(16, 24) /* { dg-error "unexpected argument" } */
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop gang(v, w) /* { dg-error "unexpected argument" } */
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop gang(num: 1, num:2, num:3, 4) /* { dg-error "unexpected argument" } */
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop gang(num: 1 num:2, num:3, 4) /* { dg-error "expected '.' before" } */
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop gang(1, num:2, num:3, 4) /* { dg-error "unexpected argument" } */
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop gang(num, num:5) /* { dg-error "unexpected argument" } */
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop gang(length:num) /* { dg-error "" } */
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop vector(5, length:length) /* { dg-error "expected '.' before" } */
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop vector(num:length) /* { dg-error "" } */
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop worker(length:5) /* { dg-error "expected '.' before" } */
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop worker(1, num:2) /* { dg-error "expected '.' before" } */
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop gang(static: * abc) /* { dg-error "expected '.' before" } */
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop gang(static:*num:1) /* { dg-error "expected '.' before" } */
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop gang(num: 5 static: *) /* { dg-error "expected '.' before" } */
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop gang(,static: *) /* { dg-error "" } */
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop vector(,length:5) /* { dg-error "" } */
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop worker(,num:10) /* { dg-error "" } */
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop worker(,10) /* { dg-error "" } */
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop vector(,10) /* { dg-error "" } */
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop gang(,10) /* { dg-error "" } */
+  for (i = 0; i < 10; i++)
+    ;
+
+  return 0;
+}

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 5/11] C++ FE changes
  2015-10-24  2:39       ` Cesar Philippidis
@ 2015-10-24 21:15         ` Cesar Philippidis
  2015-10-26 10:30           ` Jakub Jelinek
  0 siblings, 1 reply; 120+ messages in thread
From: Cesar Philippidis @ 2015-10-24 21:15 UTC (permalink / raw)
  To: Jakub Jelinek, Nathan Sidwell
  Cc: GCC Patches, Bernd Schmidt, Jason Merrill, Joseph S. Myers

[-- Attachment #1: Type: text/plain, Size: 5904 bytes --]

On 10/23/2015 07:37 PM, Cesar Philippidis wrote:
> On 10/23/2015 01:25 PM, Cesar Philippidis wrote:
>> On 10/22/2015 01:52 AM, Jakub Jelinek wrote:
>>> On Wed, Oct 21, 2015 at 03:18:55PM -0400, Nathan Sidwell wrote:
>>>> This patch is the C++ changes matching the C ones of patch 4.  In
>>>> finish_omp_clauses, the gang, worker, & vector clauses are handled the same
>>>> as OpenMP's 'num_threads' clause.  One change to num_threads is the
>>>> augmentation of a diagnostic to add %<...%>  markers to the clause name.
>>>
>>> Indeed, lots of older OpenMP diagnostics is missing %<...%> markers around
>>> keywords.  Something to fix eventually.
>>
>> I updated omp tasks and teams in semantics.c.
>>
>>>> 2015-10-20  Cesar Philippidis  <cesar@codesourcery.com>
>>>> 	    Thomas Schwinge  <thomas@codesourcery.com>
>>>> 	    James Norris  <jnorris@codesourcery.com>
>>>> 	    Joseph Myers  <joseph@codesourcery.com>
>>>> 	    Julian Brown  <julian@codesourcery.com>
>>>> 	    Nathan Sidwell <nathan@codesourcery.com>
>>>>
>>>> 	* parser.c (cp_parser_omp_clause_name): Add auto, gang, seq,
>>>> 	vector, worker.
>>>> 	(cp_parser_oacc_simple_clause): New.
>>>> 	(cp_parser_oacc_shape_clause): New.
>>>
>>> What I've said for the C FE patch, plus:
>>>
>>>> +	  if (cp_lexer_next_token_is (lexer, CPP_NAME)
>>>> +	      || cp_lexer_next_token_is (lexer, CPP_KEYWORD))
>>>> +	    {
>>>> +	      tree name_kind = cp_lexer_peek_token (lexer)->u.value;
>>>> +	      const char *p = IDENTIFIER_POINTER (name_kind);
>>>> +	      if (kind == OMP_CLAUSE_GANG && strcmp ("static", p) == 0)
>>>
>>> As static is a keyword, wouldn't it be better to just handle that case
>>> using cp_lexer_next_token_is_keyword (lexer, RID_STATIC)?
>>>
>>> Also, what is the exact grammar of the shape arguments?
>>> Would be nice to describe the grammar, in the grammar you just say
>>> expression, at least for vector/worker, which is clearly not accurate.
>>>
>>> It seems the intent is that num: or length: or static: is optional, right?
>>> But if that is the case, you should treat those as parsed only if followed
>>> by :.  While static is a keyword, so you can't have a variable called like
>>> that, having vector(length) or vector(num) should not be rejected.
>>> So, I would have expected that it should test if it is RID_STATIC
>>> followed by CPP_COLON (and only in that case consume those tokens),
>>> or CPP_NAME of id followed by CPP_COLON (and only in that case consume those
>>> tokens), otherwise parse it as assignment expression.
>>
>> That function now peeks ahead to look for a colon, so now it can handle
>> variables with the name of clause keywords.
>>
>>> The C FE may have similar issue.  Plus of course there should be testsuite
>>> coverage for all the weird cases.
>>
>> I included a new test in a different patch because it's common to both c
>> and c++.
>>
>>>> +	case OMP_CLAUSE_GANG:
>>>> +	case OMP_CLAUSE_VECTOR:
>>>> +	case OMP_CLAUSE_WORKER:
>>>> +	  /* Operand 0 is the num: or length: argument.  */
>>>> +	  t = OMP_CLAUSE_OPERAND (c, 0);
>>>> +	  if (t == NULL_TREE)
>>>> +	    break;
>>>> +
>>>> +	  t = maybe_convert_cond (t);
>>>
>>> Can you explain the maybe_convert_cond calls (in both cases here,
>>> plus the preexisting in OMP_CLAUSE_VECTOR_LENGTH)?
>>> The reason why it is used for OpenMP if and final clauses is that those have
>>> a condition argument, either the condition is zero or non-zero (so
>>> effectively it is turned into a bool).
>>> But aren't the gang/vector/worker/vector_length arguments integers rather
>>> than conditions?  I'd expect that finish_omp_clauses should verify
>>> those operands are indeed integral expressions (if that is the requirement
>>> in the standard), as it is something that for C++ can't be verified during
>>> parsing, if arbitrary expressions are parsed there.
>>
>> It's probably a copy-and-paste error. This functionality was added
>> incrementally. I removed that check.
>>
>>>> @@ -5959,32 +5990,58 @@ finish_omp_clauses (tree clauses, bool a
>>>>  	  break;
>>>>  
>>>>  	case OMP_CLAUSE_NUM_THREADS:
>>>> -	  t = OMP_CLAUSE_NUM_THREADS_EXPR (c);
>>>> -	  if (t == error_mark_node)
>>>> -	    remove = true;
>>>> -	  else if (!type_dependent_expression_p (t)
>>>> -		   && !INTEGRAL_TYPE_P (TREE_TYPE (t)))
>>>> -	    {
>>>> -	      error ("num_threads expression must be integral");
>>>> -	      remove = true;
>>>> -	    }
>>>> -	  else
>>>> -	    {
>>>> -	      t = mark_rvalue_use (t);
>>>> -	      if (!processing_template_decl)
>>>> -		{
>>>> -		  t = maybe_constant_value (t);
>>>> -		  if (TREE_CODE (t) == INTEGER_CST
>>>> -		      && tree_int_cst_sgn (t) != 1)
>>>> -		    {
>>>> -		      warning_at (OMP_CLAUSE_LOCATION (c), 0,
>>>> -				  "%<num_threads%> value must be positive");
>>>> -		      t = integer_one_node;
>>>> -		    }
>>>> -		  t = fold_build_cleanup_point_expr (TREE_TYPE (t), t);
>>>> -		}
>>>> -	      OMP_CLAUSE_NUM_THREADS_EXPR (c) = t;
>>>> -	    }
>>>> +	case OMP_CLAUSE_NUM_GANGS:
>>>> +	case OMP_CLAUSE_NUM_WORKERS:
>>>> +	case OMP_CLAUSE_VECTOR_LENGTH:
>>>
>>> If you are already merging some of the similar handling, please
>>> handle OMP_CLAUSE_NUM_TEAMS and OMP_CLAUSE_NUM_TASKS the same way.
>>
>> I did that, but I also had to adjust the expected errors in a couple of
>> existing gomp test cases.
>>
>> Is this patch OK for trunk?
> 
> This patch teaches the the loop shape function how to be more careful
> with commas. Most of the errors in the c++ front end are similar their
> counterparts in c, except for those when the c++ front end thinks a
> stray colon is an errant scope resolution operation. For those types of
> failures, I just used a generic regex for the dg-error. Consequently,
> this patch shares the same loop-shape.c test case that I included in the
> c patch.

I've adjusted this version of the patch to behave like the latest c version.

Is this OK for trunk?

Cesar


[-- Attachment #2: 05-cpfe-cjp-3.diff --]
[-- Type: text/x-patch, Size: 17011 bytes --]

2015-10-23  Cesar Philippidis  <cesar@codesourcery.com>
	    Thomas Schwinge  <thomas@codesourcery.com>
	    James Norris  <jnorris@codesourcery.com>
	    Joseph Myers  <joseph@codesourcery.com>
	    Julian Brown  <julian@codesourcery.com>
	    Nathan Sidwell <nathan@codesourcery.com>
	    Bernd Schmidt  <bschmidt@redhat.com>

	* parser.c (cp_parser_omp_clause_name): Add auto, gang, seq,
	vector, worker.
	(cp_parser_oacc_simple_clause): New.
	(cp_parser_oacc_shape_clause): New.
	(cp_parser_oacc_all_clauses): Add auto, gang, seq, vector, worker.
	(OACC_LOOP_CLAUSE_MASK): Likewise.
	* semantics.c (finish_omp_clauses): Add auto, gang, seq, vector,
	worker. Unify the handling to teams, tasks and vector_length with
	the other loop shape clauses.

2015-10-23  Nathan Sidwell <nathan@codesourcery.com>
	    Cesar Philippidis  <cesar@codesourcery.com>

	gcc/testsuite/
	* g++.dg/g++.dg/gomp/pr33372-1.C: Adjust diagnostic.
	* gcc/testsuite/g++.dg/gomp/pr33372-3.C: Likewise.

diff --git a/gcc/cp/parser.c b/gcc/cp/parser.c
index 7555bf3..92d1495 100644
--- a/gcc/cp/parser.c
+++ b/gcc/cp/parser.c
@@ -29064,7 +29064,9 @@ cp_parser_omp_clause_name (cp_parser *parser)
 {
   pragma_omp_clause result = PRAGMA_OMP_CLAUSE_NONE;
 
-  if (cp_lexer_next_token_is_keyword (parser->lexer, RID_IF))
+  if (cp_lexer_next_token_is_keyword (parser->lexer, RID_AUTO))
+    result = PRAGMA_OACC_CLAUSE_AUTO;
+  else if (cp_lexer_next_token_is_keyword (parser->lexer, RID_IF))
     result = PRAGMA_OMP_CLAUSE_IF;
   else if (cp_lexer_next_token_is_keyword (parser->lexer, RID_DEFAULT))
     result = PRAGMA_OMP_CLAUSE_DEFAULT;
@@ -29122,7 +29124,9 @@ cp_parser_omp_clause_name (cp_parser *parser)
 	    result = PRAGMA_OMP_CLAUSE_FROM;
 	  break;
 	case 'g':
-	  if (!strcmp ("grainsize", p))
+	  if (!strcmp ("gang", p))
+	    result = PRAGMA_OACC_CLAUSE_GANG;
+	  else if (!strcmp ("grainsize", p))
 	    result = PRAGMA_OMP_CLAUSE_GRAINSIZE;
 	  break;
 	case 'h':
@@ -29212,6 +29216,8 @@ cp_parser_omp_clause_name (cp_parser *parser)
 	    result = PRAGMA_OMP_CLAUSE_SECTIONS;
 	  else if (!strcmp ("self", p))
 	    result = PRAGMA_OACC_CLAUSE_SELF;
+	  else if (!strcmp ("seq", p))
+	    result = PRAGMA_OACC_CLAUSE_SEQ;
 	  else if (!strcmp ("shared", p))
 	    result = PRAGMA_OMP_CLAUSE_SHARED;
 	  else if (!strcmp ("simd", p))
@@ -29238,7 +29244,9 @@ cp_parser_omp_clause_name (cp_parser *parser)
 	    result = PRAGMA_OMP_CLAUSE_USE_DEVICE_PTR;
 	  break;
 	case 'v':
-	  if (!strcmp ("vector_length", p))
+	  if (!strcmp ("vector", p))
+	    result = PRAGMA_OACC_CLAUSE_VECTOR;
+	  else if (!strcmp ("vector_length", p))
 	    result = PRAGMA_OACC_CLAUSE_VECTOR_LENGTH;
 	  else if (flag_cilkplus && !strcmp ("vectorlength", p))
 	    result = PRAGMA_CILK_CLAUSE_VECTORLENGTH;
@@ -29246,6 +29254,8 @@ cp_parser_omp_clause_name (cp_parser *parser)
 	case 'w':
 	  if (!strcmp ("wait", p))
 	    result = PRAGMA_OACC_CLAUSE_WAIT;
+	  else if (!strcmp ("worker", p))
+	    result = PRAGMA_OACC_CLAUSE_WORKER;
 	  break;
 	}
     }
@@ -29582,6 +29592,144 @@ cp_parser_oacc_data_clause_deviceptr (cp_parser *parser, tree list)
   return list;
 }
 
+/* OpenACC 2.0:
+   auto
+   independent
+   nohost
+   seq */
+
+static tree
+cp_parser_oacc_simple_clause (cp_parser *ARG_UNUSED (parser),
+			      enum omp_clause_code code,
+			      tree list, location_t location)
+{
+  check_no_duplicate_clause (list, code, omp_clause_code_name[code], location);
+  tree c = build_omp_clause (location, code);
+  OMP_CLAUSE_CHAIN (c) = list;
+  return c;
+}
+
+/* OpenACC:
+
+    gang [( gang-arg-list )]
+    worker [( [num:] int-expr )]
+    vector [( [length:] int-expr )]
+
+  where gang-arg is one of:
+
+    [num:] int-expr
+    static: size-expr
+
+  and size-expr may be:
+
+    *
+    int-expr
+*/
+
+static tree
+cp_parser_oacc_shape_clause (cp_parser *parser, omp_clause_code kind,
+			     const char *str, tree list)
+{
+  const char *id = "num";
+  cp_lexer *lexer = parser->lexer;
+  tree ops[2] = { NULL_TREE, NULL_TREE }, c;
+  location_t loc = cp_lexer_peek_token (lexer)->location;
+
+  if (kind == OMP_CLAUSE_VECTOR)
+    id = "length";
+
+  if (cp_lexer_next_token_is (lexer, CPP_OPEN_PAREN))
+    {
+      cp_lexer_consume_token (lexer);
+
+      do
+	{
+	  cp_token *next = cp_lexer_peek_token (lexer);
+	  int idx = 0;
+
+	  /* Gang static argument.  */
+	  if (kind == OMP_CLAUSE_GANG
+	      && cp_lexer_next_token_is_keyword (lexer, RID_STATIC))
+	    {
+	      cp_lexer_consume_token (lexer);
+
+	      if (!cp_parser_require (parser, CPP_COLON, RT_COLON))
+		goto cleanup_error;
+
+	      idx = 1;
+	      if (ops[idx] != NULL)
+		{
+		  cp_parser_error (parser, "too many %<static%> arguements");
+		  goto cleanup_error;
+		}
+
+	      /* Check for the '*' argument.  */
+	      if (cp_lexer_next_token_is (lexer, CPP_MULT))
+		{
+		  cp_lexer_consume_token (lexer);
+		  ops[idx] = integer_minus_one_node;
+
+		  if (cp_lexer_next_token_is (lexer, CPP_COMMA))
+		    {
+		      cp_lexer_consume_token (lexer);
+		      continue;
+		    }
+		  else break;
+		}
+	    }
+	  /* Worker num: argument and vector length: arguments.  */
+	  else if (cp_lexer_next_token_is (lexer, CPP_NAME)
+		   && strcmp (id, IDENTIFIER_POINTER (next->u.value)) == 0
+		   && cp_lexer_nth_token_is (lexer, 2, CPP_COLON))
+	    {
+	      cp_lexer_consume_token (lexer);  /* id  */
+	      cp_lexer_consume_token (lexer);  /* ':'  */
+	    }
+
+	  /* Now collect the actual argument.  */
+	  if (ops[idx] != NULL_TREE)
+	    {
+	      cp_parser_error (parser, "unexpected argument");
+	      goto cleanup_error;
+	    }
+
+	  tree expr = cp_parser_assignment_expression (parser, NULL, false,
+						       false);
+	  if (expr == error_mark_node)
+	    goto cleanup_error;
+
+	  mark_exp_read (expr);
+	  ops[idx] = expr;
+
+	  if (kind == OMP_CLAUSE_GANG
+	      && cp_lexer_next_token_is (lexer, CPP_COMMA))
+	    {
+	      cp_lexer_consume_token (lexer);
+	      continue;
+	    }
+	  break;
+	}
+      while (1);
+
+      if (!cp_parser_require (parser, CPP_CLOSE_PAREN, RT_CLOSE_PAREN))
+	goto cleanup_error;
+    }
+
+  check_no_duplicate_clause (list, kind, str, loc);
+
+  c = build_omp_clause (loc, kind);
+  OMP_CLAUSE_OPERAND (c, 0) = ops[0];
+  if (ops[1])
+    OMP_CLAUSE_OPERAND (c, 1) = ops[1];
+  OMP_CLAUSE_CHAIN (c) = list;
+
+  return c;
+
+ cleanup_error:
+  cp_parser_skip_to_closing_parenthesis (parser, false, false, true);
+  return list;
+}
+
 /* OpenACC:
    vector_length ( expression ) */
 
@@ -31306,6 +31454,11 @@ cp_parser_oacc_all_clauses (cp_parser *parser, omp_clause_mask mask,
 	  clauses = cp_parser_oacc_clause_async (parser, clauses);
 	  c_name = "async";
 	  break;
+	case PRAGMA_OACC_CLAUSE_AUTO:
+	  clauses = cp_parser_oacc_simple_clause (parser, OMP_CLAUSE_AUTO,
+						 clauses, here);
+	  c_name = "auto";
+	  break;
 	case PRAGMA_OACC_CLAUSE_COLLAPSE:
 	  clauses = cp_parser_omp_clause_collapse (parser, clauses, here);
 	  c_name = "collapse";
@@ -31338,6 +31491,11 @@ cp_parser_oacc_all_clauses (cp_parser *parser, omp_clause_mask mask,
 	  clauses = cp_parser_oacc_data_clause_deviceptr (parser, clauses);
 	  c_name = "deviceptr";
 	  break;
+	case PRAGMA_OACC_CLAUSE_GANG:
+	  c_name = "gang";
+	  clauses = cp_parser_oacc_shape_clause (parser, OMP_CLAUSE_GANG,
+						 c_name, clauses);
+	  break;
 	case PRAGMA_OACC_CLAUSE_HOST:
 	  clauses = cp_parser_oacc_data_clause (parser, c_kind, clauses);
 	  c_name = "host";
@@ -31382,6 +31540,16 @@ cp_parser_oacc_all_clauses (cp_parser *parser, omp_clause_mask mask,
 	  clauses = cp_parser_oacc_data_clause (parser, c_kind, clauses);
 	  c_name = "self";
 	  break;
+	case PRAGMA_OACC_CLAUSE_SEQ:
+	  clauses = cp_parser_oacc_simple_clause (parser, OMP_CLAUSE_SEQ,
+						 clauses, here);
+	  c_name = "seq";
+	  break;
+	case PRAGMA_OACC_CLAUSE_VECTOR:
+	  c_name = "vector";
+	  clauses = cp_parser_oacc_shape_clause (parser, OMP_CLAUSE_VECTOR,
+						 c_name, clauses);
+	  break;
 	case PRAGMA_OACC_CLAUSE_VECTOR_LENGTH:
 	  clauses = cp_parser_oacc_clause_vector_length (parser, clauses);
 	  c_name = "vector_length";
@@ -31390,6 +31558,11 @@ cp_parser_oacc_all_clauses (cp_parser *parser, omp_clause_mask mask,
 	  clauses = cp_parser_oacc_clause_wait (parser, clauses);
 	  c_name = "wait";
 	  break;
+	case PRAGMA_OACC_CLAUSE_WORKER:
+	  c_name = "worker";
+	  clauses = cp_parser_oacc_shape_clause (parser, OMP_CLAUSE_WORKER,
+						 c_name, clauses);
+	  break;
 	default:
 	  cp_parser_error (parser, "expected %<#pragma acc%> clause");
 	  goto saw_error;
@@ -34339,6 +34512,11 @@ cp_parser_oacc_kernels (cp_parser *parser, cp_token *pragma_tok)
 
 #define OACC_LOOP_CLAUSE_MASK						\
 	( (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_COLLAPSE)		\
+	| (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_GANG)		\
+	| (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_VECTOR)		\
+	| (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_WORKER)		\
+	| (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_AUTO)		\
+	| (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_SEQ)			\
 	| (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_REDUCTION))
 
 static tree
diff --git a/gcc/cp/semantics.c b/gcc/cp/semantics.c
index 11315d9..153a970 100644
--- a/gcc/cp/semantics.c
+++ b/gcc/cp/semantics.c
@@ -5911,6 +5911,31 @@ finish_omp_clauses (tree clauses, bool allow_fields, bool declare_simd)
 	    bitmap_set_bit (&firstprivate_head, DECL_UID (t));
 	  goto handle_field_decl;
 
+	case OMP_CLAUSE_GANG:
+	case OMP_CLAUSE_VECTOR:
+	case OMP_CLAUSE_WORKER:
+	  /* Operand 0 is the num: or length: argument.  */
+	  t = OMP_CLAUSE_OPERAND (c, 0);
+	  if (t == NULL_TREE)
+	    break;
+
+	  if (!processing_template_decl)
+	    t = fold_build_cleanup_point_expr (TREE_TYPE (t), t);
+	  OMP_CLAUSE_OPERAND (c, 0) = t;
+
+	  if (OMP_CLAUSE_CODE (c) != OMP_CLAUSE_GANG)
+	    break;
+
+	  /* Ooperand 1 is the gang static: argument.  */
+	  t = OMP_CLAUSE_OPERAND (c, 1);
+	  if (t == NULL_TREE)
+	    break;
+
+	  if (!processing_template_decl)
+	    t = fold_build_cleanup_point_expr (TREE_TYPE (t), t);
+	  OMP_CLAUSE_OPERAND (c, 1) = t;
+	  break;
+
 	case OMP_CLAUSE_LASTPRIVATE:
 	  t = omp_clause_decl_field (OMP_CLAUSE_DECL (c));
 	  if (t)
@@ -5965,14 +5990,37 @@ finish_omp_clauses (tree clauses, bool allow_fields, bool declare_simd)
 	  OMP_CLAUSE_FINAL_EXPR (c) = t;
 	  break;
 
+	case OMP_CLAUSE_NUM_TASKS:
+	case OMP_CLAUSE_NUM_TEAMS:
 	case OMP_CLAUSE_NUM_THREADS:
-	  t = OMP_CLAUSE_NUM_THREADS_EXPR (c);
+	case OMP_CLAUSE_NUM_GANGS:
+	case OMP_CLAUSE_NUM_WORKERS:
+	case OMP_CLAUSE_VECTOR_LENGTH:
+	  t = OMP_CLAUSE_OPERAND (c, 0);
 	  if (t == error_mark_node)
 	    remove = true;
 	  else if (!type_dependent_expression_p (t)
 		   && !INTEGRAL_TYPE_P (TREE_TYPE (t)))
 	    {
-	      error ("num_threads expression must be integral");
+	     switch (OMP_CLAUSE_CODE (c))
+		{
+		case OMP_CLAUSE_NUM_TASKS:
+		  error ("%<num_tasks%> expression must be integral"); break;
+		case OMP_CLAUSE_NUM_TEAMS:
+		  error ("%<num_teams%> expression must be integral"); break;
+		case OMP_CLAUSE_NUM_THREADS:
+		  error ("%<num_threads%> expression must be integral"); break;
+		case OMP_CLAUSE_NUM_GANGS:
+		  error ("%<num_gangs%> expression must be integral"); break;
+		case OMP_CLAUSE_NUM_WORKERS:
+		  error ("%<num_workers%> expression must be integral");
+		  break;
+		case OMP_CLAUSE_VECTOR_LENGTH:
+		  error ("%<vector_length%> expression must be integral");
+		  break;
+		default:
+		  error ("invalid argument");
+		}
 	      remove = true;
 	    }
 	  else
@@ -5984,13 +6032,40 @@ finish_omp_clauses (tree clauses, bool allow_fields, bool declare_simd)
 		  if (TREE_CODE (t) == INTEGER_CST
 		      && tree_int_cst_sgn (t) != 1)
 		    {
-		      warning_at (OMP_CLAUSE_LOCATION (c), 0,
-				  "%<num_threads%> value must be positive");
+		      switch (OMP_CLAUSE_CODE (c))
+			{
+			case OMP_CLAUSE_NUM_TASKS:
+			  warning_at (OMP_CLAUSE_LOCATION (c), 0,
+				      "%<num_tasks%> value must be positive");
+			  break;
+			case OMP_CLAUSE_NUM_TEAMS:
+			  warning_at (OMP_CLAUSE_LOCATION (c), 0,
+				      "%<num_teams%> value must be positive");
+			  break;
+			case OMP_CLAUSE_NUM_THREADS:
+			  warning_at (OMP_CLAUSE_LOCATION (c), 0,
+				      "%<num_threads%> value must be"
+				      "positive"); break;
+			case OMP_CLAUSE_NUM_GANGS:
+			  warning_at (OMP_CLAUSE_LOCATION (c), 0,
+				      "%<num_gangs%> value must be positive");
+			  break;
+			case OMP_CLAUSE_NUM_WORKERS:
+			  warning_at (OMP_CLAUSE_LOCATION (c), 0,
+				      "%<num_workers%> value must be"
+				      "positive"); break;
+			case OMP_CLAUSE_VECTOR_LENGTH:
+			  warning_at (OMP_CLAUSE_LOCATION (c), 0,
+				      "%<vector_length%> value must be"
+				      "positive"); break;
+			default:
+			  error ("invalid argument");
+			}
 		      t = integer_one_node;
 		    }
 		  t = fold_build_cleanup_point_expr (TREE_TYPE (t), t);
 		}
-	      OMP_CLAUSE_NUM_THREADS_EXPR (c) = t;
+	      OMP_CLAUSE_OPERAND (c, 0) = t;
 	    }
 	  break;
 
@@ -6062,35 +6137,6 @@ finish_omp_clauses (tree clauses, bool allow_fields, bool declare_simd)
 	    }
 	  break;
 
-	case OMP_CLAUSE_NUM_TEAMS:
-	  t = OMP_CLAUSE_NUM_TEAMS_EXPR (c);
-	  if (t == error_mark_node)
-	    remove = true;
-	  else if (!type_dependent_expression_p (t)
-		   && !INTEGRAL_TYPE_P (TREE_TYPE (t)))
-	    {
-	      error ("%<num_teams%> expression must be integral");
-	      remove = true;
-	    }
-	  else
-	    {
-	      t = mark_rvalue_use (t);
-	      if (!processing_template_decl)
-		{
-		  t = maybe_constant_value (t);
-		  if (TREE_CODE (t) == INTEGER_CST
-		      && tree_int_cst_sgn (t) != 1)
-		    {
-		      warning_at (OMP_CLAUSE_LOCATION (c), 0,
-				  "%<num_teams%> value must be positive");
-		      t = integer_one_node;
-		    }
-		  t = fold_build_cleanup_point_expr (TREE_TYPE (t), t);
-		}
-	      OMP_CLAUSE_NUM_TEAMS_EXPR (c) = t;
-	    }
-	  break;
-
 	case OMP_CLAUSE_ASYNC:
 	  t = OMP_CLAUSE_ASYNC_EXPR (c);
 	  if (t == error_mark_node)
@@ -6110,16 +6156,6 @@ finish_omp_clauses (tree clauses, bool allow_fields, bool declare_simd)
 	    }
 	  break;
 
-	case OMP_CLAUSE_VECTOR_LENGTH:
-	  t = OMP_CLAUSE_VECTOR_LENGTH_EXPR (c);
-	  t = maybe_convert_cond (t);
-	  if (t == error_mark_node)
-	    remove = true;
-	  else if (!processing_template_decl)
-	    t = fold_build_cleanup_point_expr (TREE_TYPE (t), t);
-	  OMP_CLAUSE_VECTOR_LENGTH_EXPR (c) = t;
-	  break;
-
 	case OMP_CLAUSE_WAIT:
 	  t = OMP_CLAUSE_WAIT_EXPR (c);
 	  if (t == error_mark_node)
@@ -6547,35 +6583,6 @@ finish_omp_clauses (tree clauses, bool allow_fields, bool declare_simd)
 	    }
 	  goto check_dup_generic;
 
-	case OMP_CLAUSE_NUM_TASKS:
-	  t = OMP_CLAUSE_NUM_TASKS_EXPR (c);
-	  if (t == error_mark_node)
-	    remove = true;
-	  else if (!type_dependent_expression_p (t)
-		   && !INTEGRAL_TYPE_P (TREE_TYPE (t)))
-	    {
-	      error ("%<num_tasks%> expression must be integral");
-	      remove = true;
-	    }
-	  else
-	    {
-	      t = mark_rvalue_use (t);
-	      if (!processing_template_decl)
-		{
-		  t = maybe_constant_value (t);
-		  if (TREE_CODE (t) == INTEGER_CST
-		      && tree_int_cst_sgn (t) != 1)
-		    {
-		      warning_at (OMP_CLAUSE_LOCATION (c), 0,
-				  "%<num_tasks%> value must be positive");
-		      t = integer_one_node;
-		    }
-		  t = fold_build_cleanup_point_expr (TREE_TYPE (t), t);
-		}
-	      OMP_CLAUSE_NUM_TASKS_EXPR (c) = t;
-	    }
-	  break;
-
 	case OMP_CLAUSE_GRAINSIZE:
 	  t = OMP_CLAUSE_GRAINSIZE_EXPR (c);
 	  if (t == error_mark_node)
@@ -6694,6 +6701,8 @@ finish_omp_clauses (tree clauses, bool allow_fields, bool declare_simd)
 	case OMP_CLAUSE_SIMD:
 	case OMP_CLAUSE_DEFAULTMAP:
 	case OMP_CLAUSE__CILK_FOR_COUNT_:
+	case OMP_CLAUSE_AUTO:
+	case OMP_CLAUSE_SEQ:
 	  break;
 
 	case OMP_CLAUSE_INBRANCH:
diff --git a/gcc/testsuite/g++.dg/gomp/pr33372-1.C b/gcc/testsuite/g++.dg/gomp/pr33372-1.C
index 62900bf..e9da259 100644
--- a/gcc/testsuite/g++.dg/gomp/pr33372-1.C
+++ b/gcc/testsuite/g++.dg/gomp/pr33372-1.C
@@ -6,7 +6,7 @@ template <typename T>
 void f ()
 {
   extern T n ();
-#pragma omp parallel num_threads(n)	// { dg-error "num_threads expression must be integral" }
+#pragma omp parallel num_threads(n)	// { dg-error "'num_threads' expression must be integral" }
   ;
 #pragma omp parallel for schedule(static, n)
   for (int i = 0; i < 10; i++)		// { dg-error "chunk size expression must be integral" }
diff --git a/gcc/testsuite/g++.dg/gomp/pr33372-3.C b/gcc/testsuite/g++.dg/gomp/pr33372-3.C
index 8220f3c..f0a1910 100644
--- a/gcc/testsuite/g++.dg/gomp/pr33372-3.C
+++ b/gcc/testsuite/g++.dg/gomp/pr33372-3.C
@@ -6,7 +6,7 @@ template <typename T>
 void f ()
 {
   T n = 6;
-#pragma omp parallel num_threads(n)	// { dg-error "num_threads expression must be integral" }
+#pragma omp parallel num_threads(n)	// { dg-error "'num_threads' expression must be integral" }
   ;
 #pragma omp parallel for schedule(static, n)
   for (int i = 0; i < 10; i++)		// { dg-error "chunk size expression must be integral" }

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 4/11] C FE changes
  2015-10-23 21:25       ` Nathan Sidwell
@ 2015-10-25 14:18         ` Nathan Sidwell
  0 siblings, 0 replies; 120+ messages in thread
From: Nathan Sidwell @ 2015-10-25 14:18 UTC (permalink / raw)
  To: Cesar Philippidis, Jakub Jelinek
  Cc: GCC Patches, Bernd Schmidt, Jason Merrill, Joseph S. Myers

On 10/23/15 17:25, Nathan Sidwell wrote:
> On 10/23/15 16:17, Cesar Philippidis wrote:
>
>> Nathan, can you try out this patch with your updated patch set? I saw
>> some test cases getting stuck when expanding expand_GOACC_DIM_SIZE in on
>> the host compiler, which is wrong. I don't see that happening in
>> gomp-4_0-branch with this patch. Also, can you merge this patch along
>> with the c++ and new test case patches to trunk? I'll handle the gomp4
>> backport.
>
> Wilco.

testing  your patch on trunk along with my IFN_UNIQUE changes shows good results.

nathan

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 1/11] UNIQUE internal function
  2015-10-23 13:12                 ` Jakub Jelinek
  2015-10-23 13:38                   ` Nathan Sidwell
@ 2015-10-25 14:29                   ` Nathan Sidwell
  2015-10-26 22:35                     ` Nathan Sidwell
  1 sibling, 1 reply; 120+ messages in thread
From: Nathan Sidwell @ 2015-10-25 14:29 UTC (permalink / raw)
  To: Jakub Jelinek
  Cc: Richard Biener, GCC Patches, Bernd Schmidt, Jason Merrill,
	Joseph S. Myers

[-- Attachment #1: Type: text/plain, Size: 602 bytes --]

Richard, Jakub,
here is an updated patch.  Changes from previous version

1) Moved the subcodes to an enumeration in internal-fn.h

2) Remove ECF_LEAF

3) Added check in initialize_ctrl_altering

4) tracer code now (continues) to only look in last stmt of block

I looked at fnsplit and do not believe I need changes there.  That's changing 
things like:
   if (cheap test)
     do cheap thing
   else
     do complex thing

to break out the else part into a separate function.   That's fine -- it'll copy 
the whole CFG of interest.

I'll  be posting an updated 7/11 patch shortly.

comments?

nathan

[-- Attachment #2: 01-trunk-unique-1025.patch --]
[-- Type: text/x-patch, Size: 6560 bytes --]

2015-10-25  Nathan Sidwell  <nathan@codesourcery.com>
	
	* internal-fn.c (expand_UNIQUE): New.
	* internal-fn.h (enum ifn_unique_kind): New.
	* internal-fn.def (IFN_UNIQUE): New.
	* gimple.h (gimple_call_internal_unique_p): New.
	* gimple.c (gimple_call_same_target_p): Check internal fn
	uniqueness.
	* tracer.c (ignore_bb_p): Check for IFN_UNIQUE call.
	* tree-ssa-threadedge.c
	(record_temporary_equivalences_from_stmts): Likewise.
	* tree-cfg.c (gmple_call_initialize_ctrl_altering): Likewise.

Index: gcc/tree-ssa-threadedge.c
===================================================================
--- gcc/tree-ssa-threadedge.c	(revision 229276)
+++ gcc/tree-ssa-threadedge.c	(working copy)
@@ -283,6 +283,17 @@ record_temporary_equivalences_from_stmts
 	  && gimple_asm_volatile_p (as_a <gasm *> (stmt)))
 	return NULL;
 
+      /* If the statement is a unique builtin, we can not thread
+	 through here.  */
+      if (gimple_code (stmt) == GIMPLE_CALL)
+	{
+	  gcall *call = as_a <gcall *> (stmt);
+
+	  if (gimple_call_internal_p (call)
+	      && gimple_call_internal_unique_p (call))
+	    return NULL;
+	}
+
       /* If duplicating this block is going to cause too much code
 	 expansion, then do not thread through this block.  */
       stmt_count++;
Index: gcc/internal-fn.def
===================================================================
--- gcc/internal-fn.def	(revision 229276)
+++ gcc/internal-fn.def	(working copy)
@@ -65,3 +65,10 @@ DEF_INTERNAL_FN (SUB_OVERFLOW, ECF_CONST
 DEF_INTERNAL_FN (MUL_OVERFLOW, ECF_CONST | ECF_LEAF | ECF_NOTHROW, NULL)
 DEF_INTERNAL_FN (TSAN_FUNC_EXIT, ECF_NOVOPS | ECF_LEAF | ECF_NOTHROW, NULL)
 DEF_INTERNAL_FN (VA_ARG, ECF_NOTHROW | ECF_LEAF, NULL)
+
+/* An unduplicable, uncombinable function.  Generally used to preserve
+   a CFG property in the face of jump threading, tail merging or
+   other such optimizations.  The first argument distinguishes
+   between uses. See internal-fn.h for usage.  */
+DEF_INTERNAL_FN (UNIQUE, ECF_NOTHROW, NULL)
Index: gcc/gimple.c
===================================================================
--- gcc/gimple.c	(revision 229276)
+++ gcc/gimple.c	(working copy)
@@ -1346,7 +1346,8 @@ gimple_call_same_target_p (const gimple
 {
   if (gimple_call_internal_p (c1))
     return (gimple_call_internal_p (c2)
-	    && gimple_call_internal_fn (c1) == gimple_call_internal_fn (c2));
+	    && gimple_call_internal_fn (c1) == gimple_call_internal_fn (c2)
+	    && !gimple_call_internal_unique_p (as_a <const gcall *> (c1)));
   else
     return (gimple_call_fn (c1) == gimple_call_fn (c2)
 	    || (gimple_call_fndecl (c1)
Index: gcc/gimple.h
===================================================================
--- gcc/gimple.h	(revision 229276)
+++ gcc/gimple.h	(working copy)
@@ -2895,6 +2895,21 @@ gimple_call_internal_fn (const gimple *g
   return gimple_call_internal_fn (gc);
 }
 
+/* Return true, if this internal gimple call is unique.  */
+
+static inline bool
+gimple_call_internal_unique_p (const gcall *gs)
+{
+  return gimple_call_internal_fn (gs) == IFN_UNIQUE;
+}
+
+static inline bool
+gimple_call_internal_unique_p (const gimple *gs)
+{
+  const gcall *gc = GIMPLE_CHECK2<const gcall *> (gs);
+  return gimple_call_internal_unique_p (gc);
+}
+
 /* If CTRL_ALTERING_P is true, mark GIMPLE_CALL S to be a stmt
    that could alter control flow.  */
 
Index: gcc/internal-fn.c
===================================================================
--- gcc/internal-fn.c	(revision 229276)
+++ gcc/internal-fn.c	(working copy)
@@ -1958,6 +1958,30 @@ expand_VA_ARG (gcall *stmt ATTRIBUTE_UNU
   gcc_unreachable ();
 }
 
+/* Expand the IFN_UNIQUE function according to its first argument.  */
+
+static void
+expand_UNIQUE (gcall *stmt)
+{
+  rtx pattern = NULL_RTX;
+  int code = TREE_INT_CST_LOW (gimple_call_arg (stmt, 0));
+
+  switch (code)
+    {
+    default:
+      gcc_unreachable ();
+
+    case IFN_UNIQUE_UNSPEC:
+#ifdef HAVE_unique
+      pattern = gen_unique ();
+#endif
+      break;
+    }
+
+  if (pattern)
+    emit_insn (pattern);
+}
+
 /* Routines to expand each internal function, indexed by function number.
    Each routine has the prototype:
 
Index: gcc/internal-fn.h
===================================================================
--- gcc/internal-fn.h	(revision 229276)
+++ gcc/internal-fn.h	(working copy)
@@ -20,6 +20,11 @@ along with GCC; see the file COPYING3.
 #ifndef GCC_INTERNAL_FN_H
 #define GCC_INTERNAL_FN_H
 
+/* INTEGER_CST values for IFN_UNIQUE function arg-0.  */
+enum ifn_unique_kind {
+  IFN_UNIQUE_UNSPEC   /* Undifferentiated UNIQUE.  */
+};
+
 /* Initialize internal function tables.  */
 
 extern void init_internal_fns ();
Index: gcc/tracer.c
===================================================================
--- gcc/tracer.c	(revision 229276)
+++ gcc/tracer.c	(working copy)
@@ -93,18 +93,24 @@ bb_seen_p (basic_block bb)
 static bool
 ignore_bb_p (const_basic_block bb)
 {
-  gimple *g;
-
   if (bb->index < NUM_FIXED_BLOCKS)
     return true;
   if (optimize_bb_for_size_p (bb))
     return true;
 
-  /* A transaction is a single entry multiple exit region.  It must be
-     duplicated in its entirety or not at all.  */
-  g = last_stmt (CONST_CAST_BB (bb));
-  if (g && gimple_code (g) == GIMPLE_TRANSACTION)
-    return true;
+  if (gimple *g = last_stmt (CONST_CAST_BB (bb)))
+    {
+      /* A transaction is a single entry multiple exit region.  It
+	 must be duplicated in its entirety or not at all.  */
+      if (gimple_code (g) == GIMPLE_TRANSACTION)
+	return true;
+
+      /* An IFN_UNIQUE call must be duplicated as part of its group,
+	 or not at all.  */
+      if (is_gimple_call (g) && gimple_call_internal_p (g)
+	  && gimple_call_internal_unique_p (g))
+	return true;
+    }
 
   return false;
 }
Index: gcc/tree-cfg.c
===================================================================
--- gcc/tree-cfg.c	(revision 229276)
+++ gcc/tree-cfg.c	(working copy)
@@ -487,7 +487,11 @@ gimple_call_initialize_ctrl_altering (gi
       || ((flags & ECF_TM_BUILTIN)
 	  && is_tm_ending_fndecl (gimple_call_fndecl (stmt)))
       /* BUILT_IN_RETURN call is same as return statement.  */
-      || gimple_call_builtin_p (stmt, BUILT_IN_RETURN))
+      || gimple_call_builtin_p (stmt, BUILT_IN_RETURN)
+      /* IFN_UNIQUE should be the last insn, to make checking for it
+	 as cheap as possible.  */
+      || (gimple_call_internal_p (stmt)
+	  && gimple_call_internal_unique_p (stmt)))
     gimple_call_set_ctrl_altering (stmt, true);
   else
     gimple_call_set_ctrl_altering (stmt, false);

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 7/11] execution model
  2015-10-22  9:32   ` Jakub Jelinek
  2015-10-22 12:51     ` Nathan Sidwell
@ 2015-10-25 15:03     ` Nathan Sidwell
  2015-10-26 23:39       ` Nathan Sidwell
  1 sibling, 1 reply; 120+ messages in thread
From: Nathan Sidwell @ 2015-10-25 15:03 UTC (permalink / raw)
  To: Jakub Jelinek
  Cc: GCC Patches, Bernd Schmidt, Jason Merrill, Joseph S. Myers,
	Richard Guenther

[-- Attachment #1: Type: text/plain, Size: 671 bytes --]

Jakub, Richard,
here's an updated version of patch 7, the early half of OpenACC lowering.  I've 
addressed all of Jakub's earlier comments.

The significant change is that now the head/tail unique markers are  threaded 
on a data dependency variable.  I'd not  noticed its lack being a problem, but 
this is certainly more robust in showing the ordering dependency between calls. 
  The dependency var is the 2nd parameter, and all others are simply shifted 
along by one.

At RTL generation time the date dependency is exposed to the RTL expander, which 
in the PTX case simply does a src->dst move, which will eventually be deleted as 
unnecessary.

comments?

nathan


[-- Attachment #2: 07-trunk-loop-mark-1025.patch --]
[-- Type: text/x-patch, Size: 49006 bytes --]

2015-10-25  Nathan Sidwell  <nathan@codesourcery.com>
	
	* internal-fn.def (IFN_GOACC_LOOP): New.
	* internal-fn.h (enum ifn_unique_kind): Add IFN_UNIQUE_OACC_FORK,
	IFN_UNIQUE_OACC_JOIN, IFN_UNIQUE_OACC_HEAD_MARK,
	IFN_UNIQUE_OACC_TAIL_MARK.
	(enum ifn_goacc_loop_kind): New.
	* internal-fn.c (expand_UNIQUE): Add IFN_UNIQUE_OACC_FORK,
	IFN_UNIQUE_OACC_JOIN cases.
	(expand_OACC_LOOP): New.
	(IFN_GOACC_LOOP_CHUNKS, IFN_GOACC_LOOP_STEP,
	IFN_GOACC_LOOP_OFFSET, IFN_GOACC_LOOP_BOUND): New.
	* internal-fn.c (expand_UNIQUE): Handle IFN_UNIQUE_OACC_FORK,
	IFN_UNIQUE_OACC_JOIN.
	(expand_GOACC_DIM_SIZE, expand_GOACC_DIM_POS, expand_GOACC_LOOP): New.
	* omp-low.c (struct omp_context): Remove gwv_below, gwv_this
	fields.
	(enum oacc_loop_flags): New.
	(enclosing_target_ctx): May return NULL.
	(ctx_in_oacc_kernels_region): New.
	(is_oacc_parallel, is_oaccc_kernels): New.
	(check_oacc_kernel_gwv): New.
	(oacc_loop_or_target_p): Delete.
	(scan_omp_for): Don't calculate gwv mask.  Check parallel clause
	operands.  Strip reductions fro kernels.
	(scan_omp_target): Don't calculate gwv mask.
	(lower_oacc_head_mark, lower_oacc_loop_marker,
	lower_oacc_head_tail): New.
	(expand_omp_for_static_nochunk, expand_omp_for_static_chunk):
	Remove OpenACC handling.
	(struct oacc_collapse): New.
	(expand_oacc_collapse_init, expand_oacc_collapse_vars): New.
	(expand_oacc_for): New.
	(expand_omp_for): Call expand_oacc_for.
	(lower_omp_for): Call lower_oacc_head_tail.

Index: gcc/internal-fn.def
===================================================================
--- gcc/internal-fn.def	(revision 229276)
+++ gcc/internal-fn.def	(working copy)
@@ -65,9 +65,12 @@ DEF_INTERNAL_FN (SUB_OVERFLOW, ECF_CONST
 DEF_INTERNAL_FN (MUL_OVERFLOW, ECF_CONST | ECF_LEAF | ECF_NOTHROW, NULL)
 DEF_INTERNAL_FN (TSAN_FUNC_EXIT, ECF_NOVOPS | ECF_LEAF | ECF_NOTHROW, NULL)
 DEF_INTERNAL_FN (VA_ARG, ECF_NOTHROW | ECF_LEAF, NULL)
 
 /* An unduplicable, uncombinable function.  Generally used to preserve
    a CFG property in the face of jump threading, tail merging or
    other such optimizations.  The first argument distinguishes
    between uses. See internal-fn.h for usage.  */
 DEF_INTERNAL_FN (UNIQUE, ECF_NOTHROW, NULL)
+
+/* OpenACC looping abstraction.  See internal-fn.h for usage.  */
+DEF_INTERNAL_FN (GOACC_LOOP, ECF_PURE | ECF_NOTHROW, NULL)
Index: gcc/internal-fn.c
===================================================================
--- gcc/internal-fn.c	(revision 229276)
+++ gcc/internal-fn.c	(working copy)
@@ -1958,30 +1958,69 @@ expand_VA_ARG (gcall *stmt ATTRIBUTE_UNU
   gcc_unreachable ();
 }
 
 /* Expand the IFN_UNIQUE function according to its first argument.  */
 
 static void
 expand_UNIQUE (gcall *stmt)
 {
   rtx pattern = NULL_RTX;
   int code = TREE_INT_CST_LOW (gimple_call_arg (stmt, 0));
 
   switch (code)
     {
     default:
       gcc_unreachable ();
 
     case IFN_UNIQUE_UNSPEC:
 #ifdef HAVE_unique
       pattern = gen_unique ();
 #endif
       break;
+
+    case IFN_UNIQUE_OACC_FORK:
+    case IFN_UNIQUE_OACC_JOIN:
+      {
+	tree lhs = gimple_call_lhs (stmt);
+	rtx target = const0_rtx;
+
+	if (lhs)
+	  target = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE);
+
+	rtx data_dep = expand_normal (gimple_call_arg (stmt, 1));
+	rtx axis = expand_normal (gimple_call_arg (stmt, 2));
+
+	if (code == IFN_UNIQUE_OACC_FORK)
+	  {
+#ifdef HAVE_oacc_fork
+	    pattern = gen_oacc_fork (target, data_dep, axis);
+#else
+	    gcc_unreachable ();
+#endif
+	  }
+	else
+	  {
+#ifdef HAVE_oacc_join
+	    pattern = gen_oacc_join (target, data_dep, axis);
+#else
+	    gcc_unreachable ();
+#endif
+	  }
+      }
+      break;
     }
 
   if (pattern)
     emit_insn (pattern);
 }
 
+/* This is expanded by oacc_device_lower pass.  */
+
+static void
+expand_GOACC_LOOP (gcall *stmt ATTRIBUTE_UNUSED)
+{
+  gcc_unreachable ();
+}
+
 /* Routines to expand each internal function, indexed by function number.
    Each routine has the prototype:
 
Index: gcc/internal-fn.h
===================================================================
--- gcc/internal-fn.h	(revision 229276)
+++ gcc/internal-fn.h	(working copy)
@@ -20,10 +20,52 @@ along with GCC; see the file COPYING3.
 #ifndef GCC_INTERNAL_FN_H
 #define GCC_INTERNAL_FN_H
 
 /* INTEGER_CST values for IFN_UNIQUE function arg-0.  */
 enum ifn_unique_kind {
   IFN_UNIQUE_UNSPEC,  /* Undifferentiated UNIQUE.  */
+
+  /* FORK and JOIN mark the points at which OpenACC partitioned
+     execution is entered or exited.
+     return: data dependency value
+     arg-1: data dependency var
+     arg-2: INTEGER_CST argument, indicating the axis.  */
+  IFN_UNIQUE_OACC_FORK,
+  IFN_UNIQUE_OACC_JOIN,
+
+  /* HEAD_MARK and TAIL_MARK are used to demark the sequence entering
+     or leaving partitioned execution.
+     return: data dependency value
+     arg-1: data dependency var
+     arg-2: INTEGER_CST argument, remaining markers in this sequence
+     arg-3...: varargs on primary header  */
+  IFN_UNIQUE_OACC_HEAD_MARK,
+  IFN_UNIQUE_OACC_TAIL_MARK
 };
+
+/* INTEGER_CST values for IFN_GOACC_LOOP arg-0.  Allows the precise
+   stepping of the compute geometry over the loop iterations to be
+   deferred until it is known which compiler is generating the code.
+   The action is encoded in a constant first argument.
+
+     CHUNK_MAX = LOOP (CODE_CHUNKS, DIR, RANGE, STEP, CHUNK_SIZE, MASK)
+     STEP = LOOP (CODE_STEP, DIR, RANGE, STEP, CHUNK_SIZE, MASK)
+     OFFSET = LOOP (CODE_OFFSET, DIR, RANGE, STEP, CHUNK_SIZE, MASK, CHUNK_NO)
+     BOUND = LOOP (CODE_BOUND, DIR, RANGE, STEP, CHUNK_SIZE, MASK, OFFSET)
+
+     DIR - +1 for up loop, -1 for down loop
+     RANGE - Range of loop (END - BASE)
+     STEP - iteration step size
+     CHUNKING - size of chunking, (constant zero for no chunking)
+     CHUNK_NO - chunk number
+     MASK - partitioning mask.  */
+
+enum ifn_goacc_loop_kind {
+  IFN_GOACC_LOOP_CHUNKS,  /* Number of chunks.  */
+  IFN_GOACC_LOOP_STEP,    /* Size of each thread's step.  */
+  IFN_GOACC_LOOP_OFFSET,  /* Initial iteration value.  */
+  IFN_GOACC_LOOP_BOUND    /* Limit of iteration value.  */
+};
+
 /* Initialize internal function tables.  */
 
 extern void init_internal_fns ();
Index: gcc/omp-low.c
===================================================================
--- gcc/omp-low.c	(revision 229276)
+++ gcc/omp-low.c	(working copy)
@@ -199,14 +200,6 @@ struct omp_context
 
   /* True if this construct can be cancelled.  */
   bool cancellable;
-
-  /* For OpenACC loops, a mask of gang, worker and vector used at
-     levels below this one.  */
-  int gwv_below;
-  /* For OpenACC loops, a mask of gang, worker and vector used at
-     this level and above.  For parallel and kernels clauses, a mask
-     indicating which of num_gangs/num_workers/num_vectors was used.  */
-  int gwv_this;
 };
 
 /* A structure holding the elements of:
@@ -233,6 +226,23 @@ struct omp_for_data
   struct omp_for_data_loop *loops;
 };
 
+/*  Flags for an OpenACC loop.  */
+
+enum oacc_loop_flags {
+  OLF_SEQ	= 1u << 0,  /* Explicitly sequential  */
+  OLF_AUTO	= 1u << 1,	/* Compiler chooses axes.  */
+  OLF_INDEPENDENT = 1u << 2,	/* Iterations are known independent.  */
+  OLF_GANG_STATIC = 1u << 3,	/* Gang partitioning is static (has op). */
+
+  /* Explicitly specified loop axes.  */
+  OLF_DIM_BASE = 4,
+  OLF_DIM_GANG   = 1u << (OLF_DIM_BASE + GOMP_DIM_GANG),
+  OLF_DIM_WORKER = 1u << (OLF_DIM_BASE + GOMP_DIM_WORKER),
+  OLF_DIM_VECTOR = 1u << (OLF_DIM_BASE + GOMP_DIM_VECTOR),
+
+  OLF_MAX = OLF_DIM_BASE + GOMP_DIM_MAX
+};
+
 
 static splay_tree all_contexts;
 static int taskreg_nesting_level;
@@ -255,6 +291,28 @@ static gphi *find_phi_with_arg_on_edge (
       *handled_ops_p = false; \
       break;
 
+/* Return true if CTX corresponds to an oacc parallel region.  */
+
+static bool
+is_oacc_parallel (omp_context *ctx)
+{
+  enum gimple_code outer_type = gimple_code (ctx->stmt);
+  return ((outer_type == GIMPLE_OMP_TARGET)
+	  && (gimple_omp_target_kind (ctx->stmt)
+	      == GF_OMP_TARGET_KIND_OACC_PARALLEL));
+}
+
+/* Return true if CTX corresponds to an oacc kernels region.  */
+
+static bool
+is_oacc_kernels (omp_context *ctx)
+{
+  enum gimple_code outer_type = gimple_code (ctx->stmt);
+  return ((outer_type == GIMPLE_OMP_TARGET)
+	  && (gimple_omp_target_kind (ctx->stmt)
+	      == GF_OMP_TARGET_KIND_OACC_KERNELS));
+}
+
 /* Helper function to get the name of the array containing the partial
    reductions for OpenACC reductions.  */
 static const char *
@@ -2889,28 +2947,95 @@ finish_taskreg_scan (omp_context *ctx)
     }
 }
 
+/* Find the enclosing offload context.  */
 
 static omp_context *
 enclosing_target_ctx (omp_context *ctx)
 {
-  while (ctx != NULL
-	 && gimple_code (ctx->stmt) != GIMPLE_OMP_TARGET)
-    ctx = ctx->outer;
-  gcc_assert (ctx != NULL);
+  for (; ctx; ctx = ctx->outer)
+    if (gimple_code (ctx->stmt) == GIMPLE_OMP_TARGET)
+      break;
+
   return ctx;
 }
 
+/* Return true if ctx is part of an oacc kernels region.  */
+
 static bool
-oacc_loop_or_target_p (gimple *stmt)
+ctx_in_oacc_kernels_region (omp_context *ctx)
+{
+  for (;ctx != NULL; ctx = ctx->outer)
+    {
+      gimple *stmt = ctx->stmt;
+      if (gimple_code (stmt) == GIMPLE_OMP_TARGET
+	  && gimple_omp_target_kind (stmt) == GF_OMP_TARGET_KIND_OACC_KERNELS)
+	return true;
+    }
+
+  return false;
+}
+
+/* Check the parallelism clauses inside a kernels regions.
+   Until kernels handling moves to use the same loop indirection
+   scheme as parallel, we need to do this checking early.  */
+
+static unsigned
+check_oacc_kernel_gwv (gomp_for *stmt, omp_context *ctx)
 {
-  enum gimple_code outer_type = gimple_code (stmt);
-  return ((outer_type == GIMPLE_OMP_TARGET
-	   && ((gimple_omp_target_kind (stmt)
-		== GF_OMP_TARGET_KIND_OACC_PARALLEL)
-	       || (gimple_omp_target_kind (stmt)
-		   == GF_OMP_TARGET_KIND_OACC_KERNELS)))
-	  || (outer_type == GIMPLE_OMP_FOR
-	      && gimple_omp_for_kind (stmt) == GF_OMP_FOR_KIND_OACC_LOOP));
+  bool checking = true;
+  unsigned outer_mask = 0;
+  unsigned this_mask = 0;
+  bool has_seq = false, has_auto = false;
+
+  if (ctx->outer)
+    outer_mask = check_oacc_kernel_gwv (NULL,  ctx->outer);
+  if (!stmt)
+    {
+      checking = false;
+      if (gimple_code (ctx->stmt) != GIMPLE_OMP_FOR)
+	return outer_mask;
+      stmt = as_a <gomp_for *> (ctx->stmt);
+    }
+
+  for (tree c = gimple_omp_for_clauses (stmt); c; c = OMP_CLAUSE_CHAIN (c))
+    {
+      switch (OMP_CLAUSE_CODE (c))
+	{
+	case OMP_CLAUSE_GANG:
+	  this_mask |= GOMP_DIM_MASK (GOMP_DIM_GANG);
+	  break;
+	case OMP_CLAUSE_WORKER:
+	  this_mask |= GOMP_DIM_MASK (GOMP_DIM_WORKER);
+	  break;
+	case OMP_CLAUSE_VECTOR:
+	  this_mask |= GOMP_DIM_MASK (GOMP_DIM_VECTOR);
+	  break;
+	case OMP_CLAUSE_SEQ:
+	  has_seq = true;
+	  break;
+	case OMP_CLAUSE_AUTO:
+	  has_auto = true;
+	  break;
+	default:
+	  break;
+	}
+    }
+
+  if (checking)
+    {
+      if (has_seq && (this_mask || has_auto))
+	error_at (gimple_location (stmt), "%<seq%> overrides other"
+		  " OpenACC loop specifiers");
+      else if (has_auto && this_mask)
+	error_at (gimple_location (stmt), "%<auto%> conflicts with other"
+		  " OpenACC loop specifiers");
+
+      if (this_mask & outer_mask)
+	error_at (gimple_location (stmt), "inner loop uses same"
+		  " OpenACC parallelism as containing loop");
+    }
+
+  return outer_mask | this_mask;
 }
 
 /* Scan a GIMPLE_OMP_FOR.  */
@@ -2918,52 +3043,62 @@ oacc_loop_or_target_p (gimple *stmt)
 static void
 scan_omp_for (gomp_for *stmt, omp_context *outer_ctx)
 {
-  enum gimple_code outer_type = GIMPLE_ERROR_MARK;
   omp_context *ctx;
   size_t i;
   tree clauses = gimple_omp_for_clauses (stmt);
 
-  if (outer_ctx)
-    outer_type = gimple_code (outer_ctx->stmt);
-
   ctx = new_omp_context (stmt, outer_ctx);
 
   if (is_gimple_omp_oacc (stmt))
     {
-      if (outer_ctx && outer_type == GIMPLE_OMP_FOR)
-	ctx->gwv_this = outer_ctx->gwv_this;
-      for (tree c = clauses; c; c = OMP_CLAUSE_CHAIN (c))
-	{
-	  int val;
-	  if (OMP_CLAUSE_CODE (c) == OMP_CLAUSE_GANG)
-	    val = MASK_GANG;
-	  else if (OMP_CLAUSE_CODE (c) == OMP_CLAUSE_WORKER)
-	    val = MASK_WORKER;
-	  else if (OMP_CLAUSE_CODE (c) == OMP_CLAUSE_VECTOR)
-	    val = MASK_VECTOR;
-	  else
-	    continue;
-	  ctx->gwv_this |= val;
-	  if (!outer_ctx)
-	    {
-	      /* Skip; not nested inside a region.  */
-	      continue;
-	    }
-	  if (!oacc_loop_or_target_p (outer_ctx->stmt))
+      omp_context *tgt = enclosing_target_ctx (outer_ctx);
+
+      if (!tgt || is_oacc_parallel (tgt))
+	for (tree c = clauses; c; c = OMP_CLAUSE_CHAIN (c))
+	  {
+	    char const *check = NULL;
+
+	    switch (OMP_CLAUSE_CODE (c))
+	      {
+	      case OMP_CLAUSE_GANG:
+		check = "gang";
+		break;
+
+	      case OMP_CLAUSE_WORKER:
+		check = "worker";
+		break;
+
+	      case OMP_CLAUSE_VECTOR:
+		check = "vector";
+		break;
+
+	      default:
+		break;
+	      }
+
+	    if (check && OMP_CLAUSE_OPERAND (c, 0))
+	      error_at (gimple_location (stmt),
+			"argument not permitted on %qs clause in"
+			" OpenACC %<parallel%>", check);
+	  }
+
+      if (tgt && is_oacc_kernels (tgt))
+	{
+	  /* Strip out reductions, as they are not  handled yet.  */
+	  tree *prev_ptr = &clauses;
+
+	  while (tree probe = *prev_ptr)
 	    {
-	      /* Skip; not nested inside an OpenACC region.  */
-	      continue;
-	    }
-	  if (outer_type == GIMPLE_OMP_FOR)
-	    outer_ctx->gwv_below |= val;
-	  if (OMP_CLAUSE_OPERAND (c, 0) != NULL_TREE)
-	    {
-	      omp_context *enclosing = enclosing_target_ctx (outer_ctx);
-	      if (gimple_omp_target_kind (enclosing->stmt)
-		  == GF_OMP_TARGET_KIND_OACC_PARALLEL)
-		error_at (gimple_location (stmt),
-			  "no arguments allowed to gang, worker and vector clauses inside parallel");
+	      tree *next_ptr = &OMP_CLAUSE_CHAIN (probe);
+	      
+	      if (OMP_CLAUSE_CODE (probe) == OMP_CLAUSE_REDUCTION)
+		*prev_ptr = *next_ptr;
+	      else
+		prev_ptr = next_ptr;
 	    }
+
+	  gimple_omp_for_set_clauses (stmt, clauses);
+	  check_oacc_kernel_gwv (stmt, ctx);
 	}
     }
 
@@ -2978,19 +3113,6 @@ scan_omp_for (gomp_for *stmt, omp_contex
       scan_omp_op (gimple_omp_for_incr_ptr (stmt, i), ctx);
     }
   scan_omp (gimple_omp_body_ptr (stmt), ctx);
-
-  if (is_gimple_omp_oacc (stmt))
-    {
-      if (ctx->gwv_this & ctx->gwv_below)
-	error_at (gimple_location (stmt),
-		  "gang, worker and vector may occur only once in a loop nest");
-      else if (ctx->gwv_below != 0
-	       && ctx->gwv_this > ctx->gwv_below)
-	error_at (gimple_location (stmt),
-		  "gang, worker and vector must occur in this order in a loop nest");
-      if (outer_ctx && outer_type == GIMPLE_OMP_FOR)
-	outer_ctx->gwv_below |= ctx->gwv_below;
-    }
 }
 
 /* Scan an OpenMP sections directive.  */
@@ -3061,19 +3183,6 @@ scan_omp_target (gomp_target *stmt, omp_
       gimple_omp_target_set_child_fn (stmt, ctx->cb.dst_fn);
     }
 
-  if (is_gimple_omp_oacc (stmt))
-    {
-      for (tree c = clauses; c; c = OMP_CLAUSE_CHAIN (c))
-	{
-	  if (OMP_CLAUSE_CODE (c) == OMP_CLAUSE_NUM_GANGS)
-	    ctx->gwv_this |= MASK_GANG;
-	  else if (OMP_CLAUSE_CODE (c) == OMP_CLAUSE_NUM_WORKERS)
-	    ctx->gwv_this |= MASK_WORKER;
-	  else if (OMP_CLAUSE_CODE (c) == OMP_CLAUSE_VECTOR_LENGTH)
-	    ctx->gwv_this |= MASK_VECTOR;
-	}
-    }
-
   scan_sharing_clauses (clauses, ctx);
   scan_omp (gimple_omp_body_ptr (stmt), ctx);
 
@@ -5806,6 +5915,176 @@ lower_send_shared_vars (gimple_seq *ilis
     }
 }
 
+/* Emit an OpenACC head marker call, encapulating the partitioning and
+   other information that must be processed by the target compiler.
+   Return the maximum number of dimensions the associated loop might
+   be partitioned over.  */
+
+static unsigned
+lower_oacc_head_mark (location_t loc, tree ddvar, tree clauses,
+		      gimple_seq *seq, omp_context *ctx)
+{
+  unsigned levels = 0;
+  unsigned tag = 0;
+  tree gang_static = NULL_TREE;
+  auto_vec<tree, 5> args;
+
+  args.quick_push (build_int_cst
+		   (integer_type_node, IFN_UNIQUE_OACC_HEAD_MARK));
+  args.quick_push (ddvar);
+  for (tree c = clauses; c; c = OMP_CLAUSE_CHAIN (c))
+    {
+      switch (OMP_CLAUSE_CODE (c))
+	{
+	case OMP_CLAUSE_GANG:
+	  tag |= OLF_DIM_GANG;
+	  gang_static = OMP_CLAUSE_GANG_STATIC_EXPR (c);
+	  /* static:* is represented by -1, and we can ignore it, as
+	     scheduling is always static.  */
+	  if (gang_static && integer_minus_onep (gang_static))
+	    gang_static = NULL_TREE;
+	  levels++;
+	  break;
+
+	case OMP_CLAUSE_WORKER:
+	  tag |= OLF_DIM_WORKER;
+	  levels++;
+	  break;
+
+	case OMP_CLAUSE_VECTOR:
+	  tag |= OLF_DIM_VECTOR;
+	  levels++;
+	  break;
+
+	case OMP_CLAUSE_SEQ:
+	  tag |= OLF_SEQ;
+	  break;
+
+	case OMP_CLAUSE_AUTO:
+	  tag |= OLF_AUTO;
+	  break;
+
+	case OMP_CLAUSE_INDEPENDENT:
+	  tag |= OLF_INDEPENDENT;
+	  break;
+
+	default:
+	  continue;
+	}
+    }
+
+  if (gang_static)
+    {
+      if (DECL_P (gang_static))
+	gang_static = build_outer_var_ref (gang_static, ctx);
+      tag |= OLF_GANG_STATIC;
+    }
+
+  /* In a parallel region, loops are implicitly INDEPENDENT.  */
+  omp_context *tgt = enclosing_target_ctx (ctx);
+  if (!tgt || is_oacc_parallel (tgt))
+    tag |= OLF_INDEPENDENT;
+
+  /* A loop lacking SEQ, GANG, WORKER and/or VECTOR is implicitly AUTO.  */
+  if (!(tag & (((GOMP_DIM_MASK (GOMP_DIM_MAX) - 1) << OLF_DIM_BASE)
+	       | OLF_SEQ)))
+      tag |= OLF_AUTO;
+
+  /* Ensure at least one level.  */
+  if (!levels)
+    levels++;
+
+  args.quick_push (build_int_cst (integer_type_node, levels));
+  args.quick_push (build_int_cst (integer_type_node, tag));
+  if (gang_static)
+    args.quick_push (gang_static);
+
+  gcall *call = gimple_build_call_internal_vec (IFN_UNIQUE, args);
+  gimple_set_location (call, loc);
+  gimple_set_lhs (call, ddvar);
+  gimple_seq_add_stmt (seq, call);
+
+  return levels;
+}
+
+/* Emit an OpenACC lopp head or tail marker to SEQ.  LEVEL is the
+   partitioning level of the enclosed region.  */ 
+
+static void
+lower_oacc_loop_marker (location_t loc, tree ddvar, bool head,
+			tree tofollow, gimple_seq *seq)
+{
+  int marker_kind = (head ? IFN_UNIQUE_OACC_HEAD_MARK
+		     : IFN_UNIQUE_OACC_TAIL_MARK);
+  tree marker = build_int_cst (integer_type_node, marker_kind);
+  int nargs = 2 + (tofollow != NULL_TREE);
+  gcall *call = gimple_build_call_internal (IFN_UNIQUE, nargs,
+					    marker, ddvar, tofollow);
+  gimple_set_location (call, loc);
+  gimple_set_lhs (call, ddvar);
+  gimple_seq_add_stmt (seq, call);
+}
+
+/* Generate the before and after OpenACC loop sequences.  CLAUSES are
+   the loop clauses, from which we extract reductions.  Initialize
+   HEAD and TAIL.  */
+
+static void
+lower_oacc_head_tail (location_t loc, tree clauses,
+		      gimple_seq *head, gimple_seq *tail, omp_context *ctx)
+{
+  bool inner = false;
+  tree ddvar = create_tmp_var (integer_type_node, ".data_dep");
+  gimple_seq_add_stmt (head, gimple_build_assign (ddvar, integer_zero_node));
+
+  unsigned count = lower_oacc_head_mark (loc, ddvar, clauses, head, ctx);
+  if (!count)
+    lower_oacc_loop_marker (loc, ddvar, false, integer_zero_node, tail);
+  
+  tree fork_kind = build_int_cst (unsigned_type_node, IFN_UNIQUE_OACC_FORK);
+  tree join_kind = build_int_cst (unsigned_type_node, IFN_UNIQUE_OACC_JOIN);
+
+  for (unsigned done = 1; count; count--, done++)
+    {
+      gimple_seq fork_seq = NULL;
+      gimple_seq join_seq = NULL;
+
+      tree place = build_int_cst (integer_type_node, -1);
+      gcall *fork = gimple_build_call_internal (IFN_UNIQUE, 3,
+						fork_kind, ddvar, place);
+      gimple_set_location (fork, loc);
+      gimple_set_lhs (fork, ddvar);
+
+      gcall *join = gimple_build_call_internal (IFN_UNIQUE, 3,
+						join_kind, ddvar, place);
+      gimple_set_location (join, loc);
+      gimple_set_lhs (join, ddvar);
+
+      /* Mark the beginning of this level sequence.  */
+      if (inner)
+	lower_oacc_loop_marker (loc, ddvar, true,
+				build_int_cst (integer_type_node, count),
+				&fork_seq);
+      lower_oacc_loop_marker (loc, ddvar, false,
+			      build_int_cst (integer_type_node, done),
+			      &join_seq);
+
+      gimple_seq_add_stmt (&fork_seq, fork);
+      gimple_seq_add_stmt (&join_seq, join);
+
+      /* Append this level to head. */
+      gimple_seq_add_seq (head, fork_seq);
+      /* Prepend it to tail.  */
+      gimple_seq_add_seq (&join_seq, *tail);
+      *tail = join_seq;
+
+      inner = true;
+    }
+
+  /* Mark the end of the sequence.  */
+  lower_oacc_loop_marker (loc, ddvar, true, NULL_TREE, head);
+  lower_oacc_loop_marker (loc, ddvar, false, NULL_TREE, tail);
+}
 
 /* A convenience function to build an empty GIMPLE_COND with just the
    condition.  */
@@ -8364,10 +8643,6 @@ expand_omp_for_static_nochunk (struct om
   tree *counts = NULL;
   tree n1, n2, step;
 
-  gcc_checking_assert ((gimple_omp_for_kind (fd->for_stmt)
-			!= GF_OMP_FOR_KIND_OACC_LOOP)
-		       || !inner_stmt);
-
   itype = type = TREE_TYPE (fd->loop.v);
   if (POINTER_TYPE_P (type))
     itype = signed_type_for (type);
@@ -8460,10 +8735,6 @@ expand_omp_for_static_nochunk (struct om
       nthreads = builtin_decl_explicit (BUILT_IN_OMP_GET_NUM_TEAMS);
       threadid = builtin_decl_explicit (BUILT_IN_OMP_GET_TEAM_NUM);
       break;
-    case GF_OMP_FOR_KIND_OACC_LOOP:
-      nthreads = builtin_decl_explicit (BUILT_IN_GOACC_GET_NUM_THREADS);
-      threadid = builtin_decl_explicit (BUILT_IN_GOACC_GET_THREAD_NUM);
-      break;
     default:
       gcc_unreachable ();
     }
@@ -8690,10 +8961,7 @@ expand_omp_for_static_nochunk (struct om
   if (!gimple_omp_return_nowait_p (gsi_stmt (gsi)))
     {
       t = gimple_omp_return_lhs (gsi_stmt (gsi));
-      if (gimple_omp_for_kind (fd->for_stmt) == GF_OMP_FOR_KIND_OACC_LOOP)
-	gcc_checking_assert (t == NULL_TREE);
-      else
-	gsi_insert_after (&gsi, build_omp_barrier (t), GSI_SAME_STMT);
+      gsi_insert_after (&gsi, build_omp_barrier (t), GSI_SAME_STMT);
     }
   gsi_remove (&gsi, true);
 
@@ -8831,10 +9099,6 @@ expand_omp_for_static_chunk (struct omp_
   tree *counts = NULL;
   tree n1, n2, step;
 
-  gcc_checking_assert ((gimple_omp_for_kind (fd->for_stmt)
-			!= GF_OMP_FOR_KIND_OACC_LOOP)
-		       || !inner_stmt);
-
   itype = type = TREE_TYPE (fd->loop.v);
   if (POINTER_TYPE_P (type))
     itype = signed_type_for (type);
@@ -8931,10 +9195,6 @@ expand_omp_for_static_chunk (struct omp_
       nthreads = builtin_decl_explicit (BUILT_IN_OMP_GET_NUM_TEAMS);
       threadid = builtin_decl_explicit (BUILT_IN_OMP_GET_TEAM_NUM);
       break;
-    case GF_OMP_FOR_KIND_OACC_LOOP:
-      nthreads = builtin_decl_explicit (BUILT_IN_GOACC_GET_NUM_THREADS);
-      threadid = builtin_decl_explicit (BUILT_IN_GOACC_GET_THREAD_NUM);
-      break;
     default:
       gcc_unreachable ();
     }
@@ -9194,10 +9454,7 @@ expand_omp_for_static_chunk (struct omp_
   if (!gimple_omp_return_nowait_p (gsi_stmt (gsi)))
     {
       t = gimple_omp_return_lhs (gsi_stmt (gsi));
-      if (gimple_omp_for_kind (fd->for_stmt) == GF_OMP_FOR_KIND_OACC_LOOP)
-	gcc_checking_assert (t == NULL_TREE);
-      else
-	gsi_insert_after (&gsi, build_omp_barrier (t), GSI_SAME_STMT);
+      gsi_insert_after (&gsi, build_omp_barrier (t), GSI_SAME_STMT);
     }
   gsi_remove (&gsi, true);
 
@@ -10247,95 +10504,647 @@ expand_omp_taskloop_for_inner (struct om
     }
 }
 
-/* Expand the OMP loop defined by REGION.  */
+/* Information about members of an OpenACC collapsed loop nest.  */
 
-static void
-expand_omp_for (struct omp_region *region, gimple *inner_stmt)
+struct oacc_collapse
 {
-  struct omp_for_data fd;
-  struct omp_for_data_loop *loops;
+  tree base;  /* Base value. */
+  tree iters; /* Number of steps.  */
+  tree step;  /* step size.  */
+};
 
-  loops
-    = (struct omp_for_data_loop *)
-      alloca (gimple_omp_for_collapse (last_stmt (region->entry))
-	      * sizeof (struct omp_for_data_loop));
-  extract_omp_for_data (as_a <gomp_for *> (last_stmt (region->entry)),
-			&fd, loops);
-  region->sched_kind = fd.sched_kind;
+/* Helper for expand_oacc_for.  Determine collapsed loop information.
+   Fill in COUNTS array.  Emit any initialization code before GSI.
+   Return the calculated outer loop bound of BOUND_TYPE.  */
 
-  gcc_assert (EDGE_COUNT (region->entry->succs) == 2);
-  BRANCH_EDGE (region->entry)->flags &= ~EDGE_ABNORMAL;
-  FALLTHRU_EDGE (region->entry)->flags &= ~EDGE_ABNORMAL;
-  if (region->cont)
-    {
-      gcc_assert (EDGE_COUNT (region->cont->succs) == 2);
-      BRANCH_EDGE (region->cont)->flags &= ~EDGE_ABNORMAL;
-      FALLTHRU_EDGE (region->cont)->flags &= ~EDGE_ABNORMAL;
-    }
-  else
-    /* If there isn't a continue then this is a degerate case where
-       the introduction of abnormal edges during lowering will prevent
-       original loops from being detected.  Fix that up.  */
-    loops_state_set (LOOPS_NEED_FIXUP);
+static tree
+expand_oacc_collapse_init (const struct omp_for_data *fd,
+			   gimple_stmt_iterator *gsi,
+			   oacc_collapse *counts, tree bound_type)
+{
+  tree total = build_int_cst (bound_type, 1);
+  int ix;
+  
+  gcc_assert (integer_onep (fd->loop.step));
+  gcc_assert (integer_zerop (fd->loop.n1));
 
-  if (gimple_omp_for_kind (fd.for_stmt) & GF_OMP_FOR_SIMD)
-    expand_omp_simd (region, &fd);
-  else if (gimple_omp_for_kind (fd.for_stmt) == GF_OMP_FOR_KIND_CILKFOR)
-    expand_cilk_for (region, &fd);
-  else if (gimple_omp_for_kind (fd.for_stmt) == GF_OMP_FOR_KIND_TASKLOOP)
-    {
-      if (gimple_omp_for_combined_into_p (fd.for_stmt))
-	expand_omp_taskloop_for_inner (region, &fd, inner_stmt);
-      else
-	expand_omp_taskloop_for_outer (region, &fd, inner_stmt);
-    }
-  else if (fd.sched_kind == OMP_CLAUSE_SCHEDULE_STATIC
-	   && !fd.have_ordered)
-    {
-      if (fd.chunk_size == NULL)
-	expand_omp_for_static_nochunk (region, &fd, inner_stmt);
-      else
-	expand_omp_for_static_chunk (region, &fd, inner_stmt);
-    }
-  else
+  for (ix = 0; ix != fd->collapse; ix++)
     {
-      int fn_index, start_ix, next_ix;
+      const omp_for_data_loop *loop = &fd->loops[ix];
 
-      gcc_assert (gimple_omp_for_kind (fd.for_stmt)
-		  == GF_OMP_FOR_KIND_FOR);
-      if (fd.chunk_size == NULL
-	  && fd.sched_kind == OMP_CLAUSE_SCHEDULE_STATIC)
-	fd.chunk_size = integer_zero_node;
-      gcc_assert (fd.sched_kind != OMP_CLAUSE_SCHEDULE_AUTO);
-      fn_index = (fd.sched_kind == OMP_CLAUSE_SCHEDULE_RUNTIME)
-		  ? 3 : fd.sched_kind;
-      if (!fd.ordered)
-	fn_index += fd.have_ordered * 4;
-      if (fd.ordered)
-	start_ix = ((int)BUILT_IN_GOMP_LOOP_DOACROSS_STATIC_START) + fn_index;
-      else
-	start_ix = ((int)BUILT_IN_GOMP_LOOP_STATIC_START) + fn_index;
-      next_ix = ((int)BUILT_IN_GOMP_LOOP_STATIC_NEXT) + fn_index;
-      if (fd.iter_type == long_long_unsigned_type_node)
-	{
-	  start_ix += ((int)BUILT_IN_GOMP_LOOP_ULL_STATIC_START
-			- (int)BUILT_IN_GOMP_LOOP_STATIC_START);
-	  next_ix += ((int)BUILT_IN_GOMP_LOOP_ULL_STATIC_NEXT
-		      - (int)BUILT_IN_GOMP_LOOP_STATIC_NEXT);
-	}
-      expand_omp_for_generic (region, &fd, (enum built_in_function) start_ix,
-			      (enum built_in_function) next_ix, inner_stmt);
+      tree iter_type = TREE_TYPE (loop->v);
+      tree diff_type = iter_type;
+      tree plus_type = iter_type;
+
+      gcc_assert (loop->cond_code == fd->loop.cond_code);
+      
+      if (POINTER_TYPE_P (iter_type))
+	plus_type = sizetype;
+      if (POINTER_TYPE_P (diff_type) || TYPE_UNSIGNED (diff_type))
+	diff_type = signed_type_for (diff_type);
+
+      tree b = loop->n1;
+      tree e = loop->n2;
+      tree s = loop->step;
+      bool up = loop->cond_code == LT_EXPR;
+      tree dir = build_int_cst (diff_type, up ? +1 : -1);
+      bool negating;
+      tree expr;
+
+      b = force_gimple_operand_gsi (gsi, b, true, NULL_TREE,
+				    true, GSI_SAME_STMT);
+      e = force_gimple_operand_gsi (gsi, e, true, NULL_TREE,
+				    true, GSI_SAME_STMT);
+
+      /* Convert the step, avoiding possible unsigned->signed overflow. */
+      negating = !up && TYPE_UNSIGNED (TREE_TYPE (s));
+      if (negating)
+	s = fold_build1 (NEGATE_EXPR, TREE_TYPE (s), s);
+      s = fold_convert (diff_type, s);
+      if (negating)
+	s = fold_build1 (NEGATE_EXPR, diff_type, s);
+      s = force_gimple_operand_gsi (gsi, s, true, NULL_TREE,
+				    true, GSI_SAME_STMT);
+
+      /* Determine the range, avoiding possible unsigned->signed overflow. */
+      negating = !up && TYPE_UNSIGNED (iter_type);
+      expr = fold_build2 (MINUS_EXPR, plus_type,
+			  fold_convert (plus_type, negating ? b : e),
+			  fold_convert (plus_type, negating ? e : b));
+      expr = fold_convert (diff_type, expr);
+      if (negating)
+	expr = fold_build1 (NEGATE_EXPR, diff_type, expr);
+      tree range = force_gimple_operand_gsi
+	(gsi, expr, true, NULL_TREE, true, GSI_SAME_STMT);
+
+      /* Determine number of iterations.  */
+      expr = fold_build2 (MINUS_EXPR, diff_type, range, dir);
+      expr = fold_build2 (PLUS_EXPR, diff_type, expr, s);
+      expr = fold_build2 (TRUNC_DIV_EXPR, diff_type, expr, s);
+
+      tree iters = force_gimple_operand_gsi (gsi, expr, true, NULL_TREE,
+					     true, GSI_SAME_STMT);
+
+      counts[ix].base = b;
+      counts[ix].iters = iters;
+      counts[ix].step = s;
+
+      total = fold_build2 (MULT_EXPR, bound_type, total,
+			   fold_convert (bound_type, iters));
     }
 
-  if (gimple_in_ssa_p (cfun))
-    update_ssa (TODO_update_ssa_only_virtuals);
+  return total;
 }
 
+/* Emit initializers for collapsed loop members.  IVAR is the outer
+   loop iteration variable, from which collapsed loop iteration values
+   are  calculated.  COUNTS array has been initialized by
+   expand_oacc_collapse_inits.  */
 
-/* Expand code for an OpenMP sections directive.  In pseudo code, we generate
+static void
+expand_oacc_collapse_vars (const struct omp_for_data *fd,
+			   gimple_stmt_iterator *gsi,
+			   const oacc_collapse *counts, tree ivar)
+{
+  tree ivar_type = TREE_TYPE (ivar);
 
-	v = GOMP_sections_start (n);
-    L0:
+  /*  The most rapidly changing iteration variable is the innermost
+      one.  */
+  for (int ix = fd->collapse; ix--;)
+    {
+      const omp_for_data_loop *loop = &fd->loops[ix];
+      const oacc_collapse *collapse = &counts[ix];
+      tree iter_type = TREE_TYPE (loop->v);
+      tree diff_type = TREE_TYPE (collapse->step);
+      tree plus_type = iter_type;
+      enum tree_code plus_code = PLUS_EXPR;
+      tree expr;
+
+      if (POINTER_TYPE_P (iter_type))
+	{
+	  plus_code = POINTER_PLUS_EXPR;
+	  plus_type = sizetype;
+	}
+
+      expr = fold_build2 (TRUNC_MOD_EXPR, ivar_type, ivar,
+			  fold_convert (ivar_type, collapse->iters));
+      expr = fold_build2 (MULT_EXPR, diff_type, fold_convert (diff_type, expr),
+			  collapse->step);
+      expr = fold_build2 (plus_code, iter_type, collapse->base,
+			  fold_convert (plus_type, expr));
+      expr = force_gimple_operand_gsi (gsi, expr, false, NULL_TREE,
+				       true, GSI_SAME_STMT);
+      gassign *ass = gimple_build_assign (loop->v, expr);
+      gsi_insert_before (gsi, ass, GSI_SAME_STMT);
+
+      if (ix)
+	{
+	  expr = fold_build2 (TRUNC_DIV_EXPR, ivar_type, ivar,
+			      fold_convert (ivar_type, collapse->iters));
+	  ivar = force_gimple_operand_gsi (gsi, expr, true, NULL_TREE,
+					   true, GSI_SAME_STMT);
+	}
+    }
+}
+
+/* A subroutine of expand_omp_for.  Generate code for an OpenACC
+   partitioned loop.  The lowering here is abstracted, in that the
+   loop parameters are passed through internal functions, which are
+   further lowered by oacc_device_lower, once we get to the target
+   compiler.  The loop is of the form:
+
+   for (V = B; V LTGT E; V += S) {BODY}
+
+   where LTGT is < or >.  We may have a specified chunking size, CHUNKING
+   (constant 0 for no chunking) and we will have a GWV partitioning
+   mask, specifying dimensions over which the loop is to be
+   partitioned (see note below).  We generate code that looks like:
+
+   <entry_bb> [incoming FALL->body, BRANCH->exit]
+     typedef signedintify (typeof (V)) T;  // underlying signed integral type
+     T range = E - B;
+     T chunk_no = 0;
+     T DIR = LTGT == '<' ? +1 : -1;
+     T chunk_max = GOACC_LOOP_CHUNK (dir, range, S, CHUNK_SIZE, GWV);
+     T step = GOACC_LOOP_STEP (dir, range, S, CHUNK_SIZE, GWV);
+
+   <head_bb> [created by splitting end of entry_bb]
+     T offset = GOACC_LOOP_OFFSET (dir, range, S, CHUNK_SIZE, GWV, chunk_no);
+     T bound = GOACC_LOOP_BOUND (dir, range, S, CHUNK_SIZE, GWV, offset);
+     if (!(offset LTGT bound)) goto bottom_bb;
+
+   <body_bb> [incoming]
+     V = B + offset;
+     {BODY}
+
+   <cont_bb> [incoming, may == body_bb FALL->exit_bb, BRANCH->body_bb]
+     offset += step;
+     if (offset LTGT bound) goto body_bb; [*]
+
+   <bottom_bb> [created by splitting start of exit_bb] insert BRANCH->head_bb
+     chunk_no++;
+     if (chunk < chunk_max) goto head_bb;
+
+   <exit_bb> [incoming]
+     V = B + ((range -/+ 1) / S +/- 1) * S [*]
+
+   [*] Needed if V live at end of loop
+
+   Note: CHUNKING & GWV mask are specified explicitly here.  This is a
+   transition, and will be specified by a more general mechanism shortly.
+ */
+
+static void
+expand_oacc_for (struct omp_region *region, struct omp_for_data *fd)
+{
+  tree v = fd->loop.v;
+  enum tree_code cond_code = fd->loop.cond_code;
+  enum tree_code plus_code = PLUS_EXPR;
+
+  tree chunk_size = integer_minus_one_node;
+  tree gwv = integer_zero_node;
+  tree iter_type = TREE_TYPE (v);
+  tree diff_type = iter_type;
+  tree plus_type = iter_type;
+  struct oacc_collapse *counts = NULL;
+
+  gcc_checking_assert (gimple_omp_for_kind (fd->for_stmt)
+		       == GF_OMP_FOR_KIND_OACC_LOOP);
+  gcc_assert (!gimple_omp_for_combined_into_p (fd->for_stmt));
+  gcc_assert (cond_code == LT_EXPR || cond_code == GT_EXPR);
+
+  if (POINTER_TYPE_P (iter_type))
+    {
+      plus_code = POINTER_PLUS_EXPR;
+      plus_type = sizetype;
+    }
+  if (POINTER_TYPE_P (diff_type) || TYPE_UNSIGNED (diff_type))
+    diff_type = signed_type_for (diff_type);
+
+  basic_block entry_bb = region->entry; /* BB ending in OMP_FOR */
+  basic_block exit_bb = region->exit; /* BB ending in OMP_RETURN */
+  basic_block cont_bb = region->cont; /* BB ending in OMP_CONTINUE  */
+  basic_block bottom_bb = NULL;
+
+  /* entry_bb has two sucessors; the branch edge is to the exit
+     block,  fallthrough edge to body.  */
+  gcc_assert (EDGE_COUNT (entry_bb->succs) == 2
+	      && BRANCH_EDGE (entry_bb)->dest == exit_bb);
+
+  /* If cont_bb non-NULL, it has 2 successors.  The branch successor is
+     body_bb, or to a block whose only successor is the body_bb.  Its
+     fallthrough successor is the final block (same as the branch
+     successor of the entry_bb).  */
+  if (cont_bb)
+    {
+      basic_block body_bb = FALLTHRU_EDGE (entry_bb)->dest;
+      basic_block bed = BRANCH_EDGE (cont_bb)->dest;
+
+      gcc_assert (FALLTHRU_EDGE (cont_bb)->dest == exit_bb);
+      gcc_assert (bed == body_bb || single_succ_edge (bed)->dest == body_bb);
+    }
+  else
+    gcc_assert (!gimple_in_ssa_p (cfun));
+
+  /* The exit block only has entry_bb and cont_bb as predecessors.  */
+  gcc_assert (EDGE_COUNT (exit_bb->preds) == 1 + (cont_bb != NULL));
+
+  tree chunk_no;
+  tree chunk_max = NULL_TREE;
+  tree bound, offset;
+  tree step = create_tmp_var (diff_type, ".step");
+  bool up = cond_code == LT_EXPR;
+  tree dir = build_int_cst (diff_type, up ? +1 : -1);
+  bool chunking = !gimple_in_ssa_p (cfun);;
+  bool negating;
+
+  /* SSA instances.  */
+  tree offset_incr = NULL_TREE;
+  tree offset_init = NULL_TREE;
+
+  gimple_stmt_iterator gsi;
+  gassign *ass;
+  gcall *call;
+  gimple *stmt;
+  tree expr;
+  location_t loc;
+  edge split, be, fte;
+
+  /* Split the end of entry_bb to create head_bb.  */
+  split = split_block (entry_bb, last_stmt (entry_bb));
+  basic_block head_bb = split->dest;
+  entry_bb = split->src;
+
+  /* Chunk setup goes at end of entry_bb, replacing the omp_for.  */
+  gsi = gsi_last_bb (entry_bb);
+  gomp_for *for_stmt = as_a <gomp_for *> (gsi_stmt (gsi));
+  loc = gimple_location (for_stmt);
+
+  if (gimple_in_ssa_p (cfun))
+    {
+      offset_init = gimple_omp_for_index (for_stmt, 0);
+      gcc_assert (integer_zerop (fd->loop.n1));
+      /* The SSA parallelizer does gang parallelism.  */
+      gwv = build_int_cst (integer_type_node, GOMP_DIM_MASK (GOMP_DIM_GANG));
+    }
+
+  if (fd->collapse > 1)
+    {
+      counts = XALLOCAVEC (struct oacc_collapse, fd->collapse);
+      tree total = expand_oacc_collapse_init (fd, &gsi, counts,
+					      TREE_TYPE (fd->loop.n2));
+
+      if (SSA_VAR_P (fd->loop.n2))
+	{
+	  total = force_gimple_operand_gsi (&gsi, total, false, NULL_TREE,
+					    true, GSI_SAME_STMT);
+	  ass = gimple_build_assign (fd->loop.n2, total);
+	  gsi_insert_before (&gsi, ass, GSI_SAME_STMT);
+	}
+      
+    }
+
+  tree b = fd->loop.n1;
+  tree e = fd->loop.n2;
+  tree s = fd->loop.step;
+
+  b = force_gimple_operand_gsi (&gsi, b, true, NULL_TREE, true, GSI_SAME_STMT);
+  e = force_gimple_operand_gsi (&gsi, e, true, NULL_TREE, true, GSI_SAME_STMT);
+
+  /* Convert the step, avoiding possible unsigned->signed overflow. */
+  negating = !up && TYPE_UNSIGNED (TREE_TYPE (s));
+  if (negating)
+    s = fold_build1 (NEGATE_EXPR, TREE_TYPE (s), s);
+  s = fold_convert (diff_type, s);
+  if (negating)
+    s = fold_build1 (NEGATE_EXPR, diff_type, s);
+  s = force_gimple_operand_gsi (&gsi, s, true, NULL_TREE, true, GSI_SAME_STMT);
+
+  if (!chunking)
+    chunk_size = integer_zero_node;
+  expr = fold_convert (diff_type, chunk_size);
+  chunk_size = force_gimple_operand_gsi (&gsi, expr, true,
+					 NULL_TREE, true, GSI_SAME_STMT);
+  /* Determine the range, avoiding possible unsigned->signed overflow. */
+  negating = !up && TYPE_UNSIGNED (iter_type);
+  expr = fold_build2 (MINUS_EXPR, plus_type,
+		      fold_convert (plus_type, negating ? b : e),
+		      fold_convert (plus_type, negating ? e : b));
+  expr = fold_convert (diff_type, expr);
+  if (negating)
+    expr = fold_build1 (NEGATE_EXPR, diff_type, expr);
+  tree range = force_gimple_operand_gsi (&gsi, expr, true,
+					 NULL_TREE, true, GSI_SAME_STMT);
+
+  chunk_no = build_int_cst (diff_type, 0);
+  if (chunking)
+    {
+      gcc_assert (!gimple_in_ssa_p (cfun));
+
+      expr = chunk_no;
+      chunk_max = create_tmp_var (diff_type, ".chunk_max");
+      chunk_no = create_tmp_var (diff_type, ".chunk_no");
+
+      ass = gimple_build_assign (chunk_no, expr);
+      gsi_insert_before (&gsi, ass, GSI_SAME_STMT);
+
+      call = gimple_build_call_internal (IFN_GOACC_LOOP, 6,
+					 build_int_cst (integer_type_node,
+							IFN_GOACC_LOOP_CHUNKS),
+					 dir, range, s, chunk_size, gwv);
+      gimple_call_set_lhs (call, chunk_max);
+      gimple_set_location (call, loc);
+      gsi_insert_before (&gsi, call, GSI_SAME_STMT);
+    }
+  else
+    chunk_size = chunk_no;
+
+  call = gimple_build_call_internal (IFN_GOACC_LOOP, 6,
+				     build_int_cst (integer_type_node,
+						    IFN_GOACC_LOOP_STEP),
+				     dir, range, s, chunk_size, gwv);
+  gimple_call_set_lhs (call, step);
+  gimple_set_location (call, loc);
+  gsi_insert_before (&gsi, call, GSI_SAME_STMT);
+
+  /* Remove the GIMPLE_OMP_FOR.  */
+  gsi_remove (&gsi, true);
+
+  /* Fixup edges from head_bb */
+  be = BRANCH_EDGE (head_bb);
+  fte = FALLTHRU_EDGE (head_bb);
+  be->flags |= EDGE_FALSE_VALUE;
+  fte->flags ^= EDGE_FALLTHRU | EDGE_TRUE_VALUE;
+
+  basic_block body_bb = fte->dest;
+
+  if (gimple_in_ssa_p (cfun))
+    {
+      gsi = gsi_last_bb (cont_bb);
+      gomp_continue *cont_stmt = as_a <gomp_continue *> (gsi_stmt (gsi));
+
+      offset = gimple_omp_continue_control_use (cont_stmt);
+      offset_incr = gimple_omp_continue_control_def (cont_stmt);
+    }
+  else
+    {
+      offset = create_tmp_var (diff_type, ".offset");
+      offset_init = offset_incr = offset;
+    }
+  bound = create_tmp_var (TREE_TYPE (offset), ".bound");
+
+  /* Loop offset & bound go into head_bb.  */
+  gsi = gsi_start_bb (head_bb);
+
+  call = gimple_build_call_internal (IFN_GOACC_LOOP, 7,
+				     build_int_cst (integer_type_node,
+						    IFN_GOACC_LOOP_OFFSET),
+				     dir, range, s,
+				     chunk_size, gwv, chunk_no);
+  gimple_call_set_lhs (call, offset_init);
+  gimple_set_location (call, loc);
+  gsi_insert_after (&gsi, call, GSI_CONTINUE_LINKING);
+
+  call = gimple_build_call_internal (IFN_GOACC_LOOP, 7,
+				     build_int_cst (integer_type_node,
+						    IFN_GOACC_LOOP_BOUND),
+				     dir, range, s,
+				     chunk_size, gwv, offset_init);
+  gimple_call_set_lhs (call, bound);
+  gimple_set_location (call, loc);
+  gsi_insert_after (&gsi, call, GSI_CONTINUE_LINKING);
+
+  expr = build2 (cond_code, boolean_type_node, offset_init, bound);
+  gsi_insert_after (&gsi, gimple_build_cond_empty (expr),
+		    GSI_CONTINUE_LINKING);
+
+  /* V assignment goes into body_bb.  */
+  if (!gimple_in_ssa_p (cfun))
+    {
+      gsi = gsi_start_bb (body_bb);
+
+      expr = build2 (plus_code, iter_type, b,
+		     fold_convert (plus_type, offset));
+      expr = force_gimple_operand_gsi (&gsi, expr, false, NULL_TREE,
+				       true, GSI_SAME_STMT);
+      ass = gimple_build_assign (v, expr);
+      gsi_insert_before (&gsi, ass, GSI_SAME_STMT);
+      if (fd->collapse > 1)
+	expand_oacc_collapse_vars (fd, &gsi, counts, v);
+    }
+
+  /* Loop increment goes into cont_bb.  If this is not a loop, we
+     will have spawned threads as if it was, and each one will
+     execute one iteration.  The specification is not explicit about
+     whether such constructs are ill-formed or not, and they can
+     occur, especially when noreturn routines are involved.  */
+  if (cont_bb)
+    {
+      gsi = gsi_last_bb (cont_bb);
+      gomp_continue *cont_stmt = as_a <gomp_continue *> (gsi_stmt (gsi));
+      loc = gimple_location (cont_stmt);
+
+      /* Increment offset.  */
+      if (gimple_in_ssa_p (cfun))
+	expr= build2 (plus_code, iter_type, offset,
+		      fold_convert (plus_type, step));
+      else
+	expr = build2 (PLUS_EXPR, diff_type, offset, step);
+      expr = force_gimple_operand_gsi (&gsi, expr, false, NULL_TREE,
+				       true, GSI_SAME_STMT);
+      ass = gimple_build_assign (offset_incr, expr);
+      gsi_insert_before (&gsi, ass, GSI_SAME_STMT);
+      expr = build2 (cond_code, boolean_type_node, offset_incr, bound);
+      gsi_insert_before (&gsi, gimple_build_cond_empty (expr), GSI_SAME_STMT);
+
+      /*  Remove the GIMPLE_OMP_CONTINUE.  */
+      gsi_remove (&gsi, true);
+
+      /* Fixup edges from cont_bb */
+      be = BRANCH_EDGE (cont_bb);
+      fte = FALLTHRU_EDGE (cont_bb);
+      be->flags |= EDGE_TRUE_VALUE;
+      fte->flags ^= EDGE_FALLTHRU | EDGE_FALSE_VALUE;
+
+      if (chunking)
+	{
+	  /* Split the beginning of exit_bb to make bottom_bb.  We
+	     need to insert a nop at the start, because splitting is
+  	     after a stmt, not before.  */
+	  gsi = gsi_start_bb (exit_bb);
+	  stmt = gimple_build_nop ();
+	  gsi_insert_before (&gsi, stmt, GSI_SAME_STMT);
+	  split = split_block (exit_bb, stmt);
+	  bottom_bb = split->src;
+	  exit_bb = split->dest;
+	  gsi = gsi_last_bb (bottom_bb);
+
+	  /* Chunk increment and test goes into bottom_bb.  */
+	  expr = build2 (PLUS_EXPR, diff_type, chunk_no,
+			 build_int_cst (diff_type, 1));
+	  ass = gimple_build_assign (chunk_no, expr);
+	  gsi_insert_after (&gsi, ass, GSI_CONTINUE_LINKING);
+
+	  /* Chunk test at end of bottom_bb.  */
+	  expr = build2 (LT_EXPR, boolean_type_node, chunk_no, chunk_max);
+	  gsi_insert_after (&gsi, gimple_build_cond_empty (expr),
+			    GSI_CONTINUE_LINKING);
+
+	  /* Fixup edges from bottom_bb. */
+	  split->flags ^= EDGE_FALLTHRU | EDGE_FALSE_VALUE;
+	  make_edge (bottom_bb, head_bb, EDGE_TRUE_VALUE);
+	}
+    }
+
+  gsi = gsi_last_bb (exit_bb);
+  gcc_assert (gimple_code (gsi_stmt (gsi)) == GIMPLE_OMP_RETURN);
+  loc = gimple_location (gsi_stmt (gsi));
+
+  if (!gimple_in_ssa_p (cfun))
+    {
+      /* Insert the final value of V, in case it is live.  This is the
+	 value for the only thread that survives past the join.  */
+      expr = fold_build2 (MINUS_EXPR, diff_type, range, dir);
+      expr = fold_build2 (PLUS_EXPR, diff_type, expr, s);
+      expr = fold_build2 (TRUNC_DIV_EXPR, diff_type, expr, s);
+      expr = fold_build2 (MULT_EXPR, diff_type, expr, s);
+      expr = build2 (plus_code, iter_type, b, fold_convert (plus_type, expr));
+      expr = force_gimple_operand_gsi (&gsi, expr, false, NULL_TREE,
+				       true, GSI_SAME_STMT);
+      ass = gimple_build_assign (v, expr);
+      gsi_insert_before (&gsi, ass, GSI_SAME_STMT);
+    }
+
+  /* Remove the OMP_RETURN. */
+  gsi_remove (&gsi, true);
+
+  if (cont_bb)
+    {
+      /* We now have one or two nested loops.  Update the loop
+	 structures.  */
+      struct loop *parent = entry_bb->loop_father;
+      struct loop *body = body_bb->loop_father;
+      
+      if (chunking)
+	{
+	  struct loop *chunk_loop = alloc_loop ();
+	  chunk_loop->header = head_bb;
+	  chunk_loop->latch = bottom_bb;
+	  add_loop (chunk_loop, parent);
+	  parent = chunk_loop;
+	}
+      else if (parent != body)
+	{
+	  gcc_assert (body->header == body_bb);
+	  gcc_assert (body->latch == cont_bb
+		      || single_pred (body->latch) == cont_bb);
+	  parent = NULL;
+	}
+
+      if (parent)
+	{
+	  struct loop *body_loop = alloc_loop ();
+	  body_loop->header = body_bb;
+	  body_loop->latch = cont_bb;
+	  add_loop (body_loop, parent);
+	}
+    }
+}
+
+/* Expand the OMP loop defined by REGION.  */
+
+static void
+expand_omp_for (struct omp_region *region, gimple *inner_stmt)
+{
+  struct omp_for_data fd;
+  struct omp_for_data_loop *loops;
+
+  loops
+    = (struct omp_for_data_loop *)
+      alloca (gimple_omp_for_collapse (last_stmt (region->entry))
+	      * sizeof (struct omp_for_data_loop));
+  extract_omp_for_data (as_a <gomp_for *> (last_stmt (region->entry)),
+			&fd, loops);
+  region->sched_kind = fd.sched_kind;
+
+  gcc_assert (EDGE_COUNT (region->entry->succs) == 2);
+  BRANCH_EDGE (region->entry)->flags &= ~EDGE_ABNORMAL;
+  FALLTHRU_EDGE (region->entry)->flags &= ~EDGE_ABNORMAL;
+  if (region->cont)
+    {
+      gcc_assert (EDGE_COUNT (region->cont->succs) == 2);
+      BRANCH_EDGE (region->cont)->flags &= ~EDGE_ABNORMAL;
+      FALLTHRU_EDGE (region->cont)->flags &= ~EDGE_ABNORMAL;
+    }
+  else
+    /* If there isn't a continue then this is a degerate case where
+       the introduction of abnormal edges during lowering will prevent
+       original loops from being detected.  Fix that up.  */
+    loops_state_set (LOOPS_NEED_FIXUP);
+
+  if (gimple_omp_for_kind (fd.for_stmt) & GF_OMP_FOR_SIMD)
+    expand_omp_simd (region, &fd);
+  else if (gimple_omp_for_kind (fd.for_stmt) == GF_OMP_FOR_KIND_CILKFOR)
+    expand_cilk_for (region, &fd);
+  else if (gimple_omp_for_kind (fd.for_stmt) == GF_OMP_FOR_KIND_OACC_LOOP)
+    {
+      gcc_assert (!inner_stmt);
+      expand_oacc_for (region, &fd);
+    }
+  else if (gimple_omp_for_kind (fd.for_stmt) == GF_OMP_FOR_KIND_TASKLOOP)
+    {
+      if (gimple_omp_for_combined_into_p (fd.for_stmt))
+	expand_omp_taskloop_for_inner (region, &fd, inner_stmt);
+      else
+	expand_omp_taskloop_for_outer (region, &fd, inner_stmt);
+    }
+  else if (fd.sched_kind == OMP_CLAUSE_SCHEDULE_STATIC
+	   && !fd.have_ordered)
+    {
+      if (fd.chunk_size == NULL)
+	expand_omp_for_static_nochunk (region, &fd, inner_stmt);
+      else
+	expand_omp_for_static_chunk (region, &fd, inner_stmt);
+    }
+  else
+    {
+      int fn_index, start_ix, next_ix;
+
+      gcc_assert (gimple_omp_for_kind (fd.for_stmt)
+		  == GF_OMP_FOR_KIND_FOR);
+      if (fd.chunk_size == NULL
+	  && fd.sched_kind == OMP_CLAUSE_SCHEDULE_STATIC)
+	fd.chunk_size = integer_zero_node;
+      gcc_assert (fd.sched_kind != OMP_CLAUSE_SCHEDULE_AUTO);
+      fn_index = (fd.sched_kind == OMP_CLAUSE_SCHEDULE_RUNTIME)
+		  ? 3 : fd.sched_kind;
+      if (!fd.ordered)
+	fn_index += fd.have_ordered * 4;
+      if (fd.ordered)
+	start_ix = ((int)BUILT_IN_GOMP_LOOP_DOACROSS_STATIC_START) + fn_index;
+      else
+	start_ix = ((int)BUILT_IN_GOMP_LOOP_STATIC_START) + fn_index;
+      next_ix = ((int)BUILT_IN_GOMP_LOOP_STATIC_NEXT) + fn_index;
+      if (fd.iter_type == long_long_unsigned_type_node)
+	{
+	  start_ix += ((int)BUILT_IN_GOMP_LOOP_ULL_STATIC_START
+			- (int)BUILT_IN_GOMP_LOOP_STATIC_START);
+	  next_ix += ((int)BUILT_IN_GOMP_LOOP_ULL_STATIC_NEXT
+		      - (int)BUILT_IN_GOMP_LOOP_STATIC_NEXT);
+	}
+      expand_omp_for_generic (region, &fd, (enum built_in_function) start_ix,
+			      (enum built_in_function) next_ix, inner_stmt);
+    }
+
+  if (gimple_in_ssa_p (cfun))
+    update_ssa (TODO_update_ssa_only_virtuals);
+}
+
+
+/* Expand code for an OpenMP sections directive.  In pseudo code, we generate
+
+	v = GOMP_sections_start (n);
+    L0:
 	switch (v)
 	  {
 	  case 0:

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 4/11] C FE changes
  2015-10-24 21:11               ` Cesar Philippidis
@ 2015-10-26  9:47                 ` Jakub Jelinek
  2015-10-26 10:09                   ` Jakub Jelinek
  2015-10-26 22:32                   ` Cesar Philippidis
  0 siblings, 2 replies; 120+ messages in thread
From: Jakub Jelinek @ 2015-10-26  9:47 UTC (permalink / raw)
  To: Cesar Philippidis
  Cc: Nathan Sidwell, GCC Patches, Bernd Schmidt, Jason Merrill,
	Joseph S. Myers

On Sat, Oct 24, 2015 at 02:10:26PM -0700, Cesar Philippidis wrote:
> +static tree
> +c_parser_oacc_shape_clause (c_parser *parser, omp_clause_code kind,
> +			    const char *str, tree list)
> +{
> +  const char *id = "num";
> +  tree ops[2] = { NULL_TREE, NULL_TREE }, c;
> +  location_t loc = c_parser_peek_token (parser)->location;
> +
> +  if (kind == OMP_CLAUSE_VECTOR)
> +    id = "length";
> +
> +  if (c_parser_next_token_is (parser, CPP_OPEN_PAREN))
> +    {
> +      c_parser_consume_token (parser);
> +
> +      do
> +	{
> +	  c_token *next = c_parser_peek_token (parser);
> +	  int idx = 0;
> +
> +	  /* Gang static argument.  */
> +	  if (kind == OMP_CLAUSE_GANG
> +	      && c_parser_next_token_is_keyword (parser, RID_STATIC))
> +	    {
> +	      c_parser_consume_token (parser);
> +
> +	      if (!c_parser_require (parser, CPP_COLON, "expected %<:%>"))
> +		goto cleanup_error;
> +
> +	      idx = 1;
> +	      if (ops[idx] != NULL_TREE )

Spurious space before ).

> +		{
> +		  c_parser_error (parser, "too many %<static%> arguements");

Typo, arguments.

> +static tree
> +c_parser_oacc_simple_clause (c_parser *parser ATTRIBUTE_UNUSED,
> +			     enum omp_clause_code code, tree list)

Please remove the useless ATTRIBUTE_UNUSED, you are using that parameter
unconditionally in c_parser_peek_token (parser).

> +{
> +  check_no_duplicate_clause (list, code, omp_clause_code_name[code]);
> +
> +  tree c = build_omp_clause (c_parser_peek_token (parser)->location, code);
> +  OMP_CLAUSE_CHAIN (c) = list;
> +
> +  return c;
> +}
> +

> +int main ()
> +{
> +  int i;
> +  int v, w;
> +  int length, num;

Can you please initialize the v/w/length/num variables?

> +  #pragma acc kernels
> +#pragma acc loop gang(16, 24) /* { dg-error "unexpected argument" } */
> +  for (i = 0; i < 10; i++)

Missing indentation of the acc loop line.

Ok for trunk with those changes fixed.

	Jakub

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 4/11] C FE changes
  2015-10-26  9:47                 ` Jakub Jelinek
@ 2015-10-26 10:09                   ` Jakub Jelinek
  2015-10-26 22:32                   ` Cesar Philippidis
  1 sibling, 0 replies; 120+ messages in thread
From: Jakub Jelinek @ 2015-10-26 10:09 UTC (permalink / raw)
  To: Cesar Philippidis
  Cc: Nathan Sidwell, GCC Patches, Bernd Schmidt, Jason Merrill,
	Joseph S. Myers

On Mon, Oct 26, 2015 at 09:59:49AM +0100, Jakub Jelinek wrote:
> Ok for trunk with those changes fixed.

Oops, I've missed that there is no checking of the type (that the
expressions have INTEGRAL_TYPE_P); in the C FE, this is sometimes done
already during the parsing, sometimes during c_finish_omp_clauses.

	Jakub

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 5/11] C++ FE changes
  2015-10-24 21:15         ` Cesar Philippidis
@ 2015-10-26 10:30           ` Jakub Jelinek
  2015-10-26 22:44             ` Cesar Philippidis
  0 siblings, 1 reply; 120+ messages in thread
From: Jakub Jelinek @ 2015-10-26 10:30 UTC (permalink / raw)
  To: Cesar Philippidis
  Cc: Nathan Sidwell, GCC Patches, Bernd Schmidt, Jason Merrill,
	Joseph S. Myers

On Sat, Oct 24, 2015 at 02:11:41PM -0700, Cesar Philippidis wrote:
> @@ -29582,6 +29592,144 @@ cp_parser_oacc_data_clause_deviceptr (cp_parser *parser, tree list)
>    return list;
>  }
>  
> +/* OpenACC 2.0:
> +   auto
> +   independent
> +   nohost
> +   seq */
> +
> +static tree
> +cp_parser_oacc_simple_clause (cp_parser *ARG_UNUSED (parser),
> +			      enum omp_clause_code code,
> +			      tree list, location_t location)
> +{
> +  check_no_duplicate_clause (list, code, omp_clause_code_name[code], location);
> +  tree c = build_omp_clause (location, code);
> +  OMP_CLAUSE_CHAIN (c) = list;
> +  return c;

Here the PARSER argument is unconditionally unused, I'd use what is used
elsewhere, i.e. cp_parser * /* parser */,

> +	      idx = 1;
> +	      if (ops[idx] != NULL)
> +		{
> +		  cp_parser_error (parser, "too many %<static%> arguements");

Typo, arguments.

> --- a/gcc/cp/semantics.c
> +++ b/gcc/cp/semantics.c
> @@ -5911,6 +5911,31 @@ finish_omp_clauses (tree clauses, bool allow_fields, bool declare_simd)
>  	    bitmap_set_bit (&firstprivate_head, DECL_UID (t));
>  	  goto handle_field_decl;
>  
> +	case OMP_CLAUSE_GANG:
> +	case OMP_CLAUSE_VECTOR:
> +	case OMP_CLAUSE_WORKER:
> +	  /* Operand 0 is the num: or length: argument.  */
> +	  t = OMP_CLAUSE_OPERAND (c, 0);
> +	  if (t == NULL_TREE)
> +	    break;
> +
> +	  if (!processing_template_decl)
> +	    t = fold_build_cleanup_point_expr (TREE_TYPE (t), t);
> +	  OMP_CLAUSE_OPERAND (c, 0) = t;
> +
> +	  if (OMP_CLAUSE_CODE (c) != OMP_CLAUSE_GANG)
> +	    break;

I think it would be better to do the Operand 1 stuff first for
case OMP_CLAUSE_GANG: only, and then have /* FALLTHRU */ into
case OMP_CLAUSE_{VECTOR,WORKER}: which would handle the first argument.

You should add testing that the operand has INTEGRAL_TYPE_P type
(except that for processing_template_decl it can be
type_dependent_expression_p instead of INTEGRAL_TYPE_P).

Also, the if (t == NULL_TREE) stuff looks fishy, because e.g. right now
if you have OMP_CLAUSE_GANG gang (static: expr) or similar,
you wouldn't wrap the expr into cleanup point.
So, instead it should be
  if (t)
    {
      if (t == error_mark_node)
	remove = true;
      else if (!type_dependent_expression_p (t)
                   && !INTEGRAL_TYPE_P (TREE_TYPE (t)))
	{
	  error_at (OMP_CLAUSE_LOCATION (c), ...);
	  remove = true;
        }
      else
	{
	  t = mark_rvalue_use (t);
	  if (!processing_template_decl)
	    t = fold_build_cleanup_point_expr (TREE_TYPE (t), t);
	  OMP_CLAUSE_OPERAND (c, 0) = t;
	}
    }
or so.  Also, can the expressions be arbitrary integers, or just
non-negative, or positive?  If it is INTEGER_CST, that is something that
could be checked here too.

>  	  else if (!type_dependent_expression_p (t)
>  		   && !INTEGRAL_TYPE_P (TREE_TYPE (t)))
>  	    {
> -	      error ("num_threads expression must be integral");
> +	     switch (OMP_CLAUSE_CODE (c))
> +		{
> +		case OMP_CLAUSE_NUM_TASKS:
> +		  error ("%<num_tasks%> expression must be integral"); break;
> +		case OMP_CLAUSE_NUM_TEAMS:
> +		  error ("%<num_teams%> expression must be integral"); break;
> +		case OMP_CLAUSE_NUM_THREADS:
> +		  error ("%<num_threads%> expression must be integral"); break;
> +		case OMP_CLAUSE_NUM_GANGS:
> +		  error ("%<num_gangs%> expression must be integral"); break;
> +		case OMP_CLAUSE_NUM_WORKERS:
> +		  error ("%<num_workers%> expression must be integral");
> +		  break;
> +		case OMP_CLAUSE_VECTOR_LENGTH:
> +		  error ("%<vector_length%> expression must be integral");
> +		  break;

When touching these, can you please use error_at (OMP_CLAUSE_LOCATION (c),
instead of error ( ?

> +		default:
> +		  error ("invalid argument");

What invalid argument?  I'd say that is clearly gcc_unreachable (); case.

But, I think it would be better to just use
  error_at (OMP_CLAUSE_LOCATION (c), "%qs expression must be integral",
	    omp_clause_code_name[c]);

> -		      warning_at (OMP_CLAUSE_LOCATION (c), 0,
> -				  "%<num_threads%> value must be positive");
> +		      switch (OMP_CLAUSE_CODE (c))
> +			{
> +			case OMP_CLAUSE_NUM_TASKS:
> +			  warning_at (OMP_CLAUSE_LOCATION (c), 0,
> +				      "%<num_tasks%> value must be positive");
> +			  break;
> +			case OMP_CLAUSE_NUM_TEAMS:
> +			  warning_at (OMP_CLAUSE_LOCATION (c), 0,
> +				      "%<num_teams%> value must be positive");
> +			  break;
> +			case OMP_CLAUSE_NUM_THREADS:
> +			  warning_at (OMP_CLAUSE_LOCATION (c), 0,
> +				      "%<num_threads%> value must be"
> +				      "positive"); break;
> +			case OMP_CLAUSE_NUM_GANGS:
> +			  warning_at (OMP_CLAUSE_LOCATION (c), 0,
> +				      "%<num_gangs%> value must be positive");
> +			  break;
> +			case OMP_CLAUSE_NUM_WORKERS:
> +			  warning_at (OMP_CLAUSE_LOCATION (c), 0,
> +				      "%<num_workers%> value must be"
> +				      "positive"); break;
> +			case OMP_CLAUSE_VECTOR_LENGTH:
> +			  warning_at (OMP_CLAUSE_LOCATION (c), 0,
> +				      "%<vector_length%> value must be"
> +				      "positive"); break;
> +			default:
> +			  error ("invalid argument");
> +			}

And similarly here.

	Jakub

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 8/11] device-specific lowering
  2015-10-21 19:50 ` [OpenACC 8/11] device-specific lowering Nathan Sidwell
  2015-10-22  9:32   ` Jakub Jelinek
@ 2015-10-26 15:21   ` Jakub Jelinek
  2015-10-26 16:23     ` Nathan Sidwell
  2015-10-28  1:06     ` Nathan Sidwell
  1 sibling, 2 replies; 120+ messages in thread
From: Jakub Jelinek @ 2015-10-26 15:21 UTC (permalink / raw)
  To: Nathan Sidwell; +Cc: GCC Patches, Bernd Schmidt, Jason Merrill, Joseph S. Myers

On Wed, Oct 21, 2015 at 03:49:08PM -0400, Nathan Sidwell wrote:
> This patch is the device-specific half of the previous patch.  It processes
> the partition head & tail markers and loop abstraction functions inserted
> during omp lowering.

> > I don't see anything that would e.g. set the various flags that e.g. OpenMP
> > #pragma omp simd or Cilk+ #pragma simd sets, like loop->safelen,
> > loop->force_vectorize, maybe loop->simduid and promote some vars to simduid
> > arrays if that is relevant to OpenACC.

> It won't convert them into such representations.

Can you fix that incrementally?  I'd expect that code marked with acc loop vector 
can't have loop carried backward lexical dependencies, at least not within
the adjacent number of iterations specified in vector clause?

> +/* Find the number of threads (POS = false), or thread number (POS =
> +   tre) for an OpenACC region partitioned as MASK.  Setup code

Typo, tre -> true.

> +static tree
> +oacc_thread_numbers (bool pos, int mask, gimple_seq *seq)
> +{
> +  tree res = pos ? NULL_TREE :  build_int_cst (unsigned_type_node, 1);

Formatting, too many spaces.

> +  if (res == NULL_TREE)
> +    res = build_int_cst (integer_type_node, 0);

integer_zero_node ?

> +/* Transform IFN_GOACC_LOOP calls to actual code.  See
> +   expand_oacc_for for where these are generated.  At the vector
> +   level, we stride loops, such that each  member of a warp will

Too many spaces before member.

> +  gimple_stmt_iterator gsi = gsi_for_stmt (call);
> +  unsigned code = (unsigned)TREE_INT_CST_LOW (gimple_call_arg (call, 0));

Missing space before T.

> +  tree dir = gimple_call_arg (call, 1);
> +  tree range = gimple_call_arg (call, 2);
> +  tree step = gimple_call_arg (call, 3);
> +  tree chunk_size = NULL_TREE;
> +  unsigned mask = (unsigned)TREE_INT_CST_LOW (gimple_call_arg (call, 5));

Ditto.

> +static void
> +oacc_loop_xform_head_tail (gcall *from, int level)
> +{
> +  gimple_stmt_iterator gsi = gsi_for_stmt (from);
> +  unsigned code = TREE_INT_CST_LOW (gimple_call_arg (from, 0));
> +  tree replacement  = build_int_cst (unsigned_type_node, level);

Too many spaces.

> +      switch (gimple_call_internal_fn (call))
> +	{
> +	case IFN_UNIQUE:
> +	  {
> +	    unsigned c = TREE_INT_CST_LOW (gimple_call_arg (call, 0));

Shouldn't c be of type enum enum ifn_unique_kind ?
What about code?
> +
> +	default:
> +	  break;
> +	}
> +    }
> +
> + break2:;

Can't you replace goto break2; with return; and
remove break2:; ?

> +	  if (TREE_INT_CST_LOW (gimple_call_arg (call, 0))
> +	      == IFN_GOACC_LOOP_BOUND)
> +	    goto break2;
> +	}
> +
> +      /* If we didn't see LOOP_BOUND, it should be in the single
> +	 successor block.  */
> +      basic_block bb = single_succ (gsi_bb (gsi));
> +      gsi = gsi_start_bb (bb);
> +    }
> +
> + break2:;

Similarly.
> +	    if (gimple_vdef (call))
> +	      replace_uses_by (gimple_vdef (call),
> +			       gimple_vuse (call));

Why the line break in between the arguments?  The line wouldn't be really
long.

Otherwise LGTM.

	Jakub

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 8/11] device-specific lowering
  2015-10-26 15:21   ` Jakub Jelinek
@ 2015-10-26 16:23     ` Nathan Sidwell
  2015-10-26 16:56       ` Jakub Jelinek
  2015-10-28  1:06     ` Nathan Sidwell
  1 sibling, 1 reply; 120+ messages in thread
From: Nathan Sidwell @ 2015-10-26 16:23 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: GCC Patches, Bernd Schmidt, Jason Merrill, Joseph S. Myers

On 10/26/15 08:13, Jakub Jelinek wrote:

>> It won't convert them into such representations.
>
> Can you fix that incrementally?  I'd expect that code marked with acc loop vector
> can't have loop carried backward lexical dependencies, at least not within
> the adjacent number of iterations specified in vector clause?

Sure.  I was using 'won't' to  describe the patch,  not claiming it could never 
be changed to do that kind of thing.


> Otherwise LGTM.

I think all your other comments are spot on and will address.  Do you want 
another review with them fixed?

If not, I think the  only thing remaining is  the IFN_UNIQUE patch, which (At 
least) needs an update to use targetm.have... conversion.

nathan

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 8/11] device-specific lowering
  2015-10-26 16:23     ` Nathan Sidwell
@ 2015-10-26 16:56       ` Jakub Jelinek
  2015-10-26 18:10         ` Nathan Sidwell
  0 siblings, 1 reply; 120+ messages in thread
From: Jakub Jelinek @ 2015-10-26 16:56 UTC (permalink / raw)
  To: Nathan Sidwell; +Cc: GCC Patches, Bernd Schmidt, Jason Merrill, Joseph S. Myers

On Mon, Oct 26, 2015 at 09:13:28AM -0700, Nathan Sidwell wrote:
> On 10/26/15 08:13, Jakub Jelinek wrote:
> 
> >>It won't convert them into such representations.
> >
> >Can you fix that incrementally?  I'd expect that code marked with acc loop vector
> >can't have loop carried backward lexical dependencies, at least not within
> >the adjacent number of iterations specified in vector clause?
> 
> Sure.  I was using 'won't' to  describe the patch,  not claiming it could
> never be changed to do that kind of thing.

Ok.

> >Otherwise LGTM.
> 
> I think all your other comments are spot on and will address.  Do you want
> another review with them fixed?

Just committing fixed version (and posting what you've committed for patches
that changed since the patch that has been posted earlier) is enough.

> If not, I think the  only thing remaining is  the IFN_UNIQUE patch, which
> (At least) needs an update to use targetm.have... conversion.

Ok, will wait till you make those changes then?

	Jakub

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 8/11] device-specific lowering
  2015-10-26 16:56       ` Jakub Jelinek
@ 2015-10-26 18:10         ` Nathan Sidwell
  0 siblings, 0 replies; 120+ messages in thread
From: Nathan Sidwell @ 2015-10-26 18:10 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: GCC Patches, Bernd Schmidt, Jason Merrill, Joseph S. Myers

On 10/26/15 09:51, Jakub Jelinek wrote:

>> If not, I think the  only thing remaining is  the IFN_UNIQUE patch, which
>> (At least) needs an update to use targetm.have... conversion.
>
> Ok, will wait till you make those changes then?

Hope to have that later today.

nathan

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 4/11] C FE changes
  2015-10-26  9:47                 ` Jakub Jelinek
  2015-10-26 10:09                   ` Jakub Jelinek
@ 2015-10-26 22:32                   ` Cesar Philippidis
  2015-10-27 20:23                     ` Nathan Sidwell
  1 sibling, 1 reply; 120+ messages in thread
From: Cesar Philippidis @ 2015-10-26 22:32 UTC (permalink / raw)
  To: Jakub Jelinek
  Cc: Nathan Sidwell, GCC Patches, Bernd Schmidt, Jason Merrill,
	Joseph S. Myers

[-- Attachment #1: Type: text/plain, Size: 220 bytes --]

On 10/26/2015 01:59 AM, Jakub Jelinek wrote:

> Ok for trunk with those changes fixed.

Here's the patch with those changes. Nathan will commit this patch the
rest of the openacc execution model patches.

Thanks,
Cesar


[-- Attachment #2: 04-cfe-cjp-4.diff --]
[-- Type: text/x-patch, Size: 14773 bytes --]

2015-10-26  Cesar Philippidis  <cesar@codesourcery.com>
	    Thomas Schwinge  <thomas@codesourcery.com>
	    James Norris  <jnorris@codesourcery.com>
	    Joseph Myers  <joseph@codesourcery.com>
	    Julian Brown  <julian@codesourcery.com>
	    Bernd Schmidt  <bschmidt@redhat.com>

	gcc/c/
	* c-parser.c (c_parser_oacc_shape_clause): New.
	(c_parser_oacc_simple_clause): New.
	(c_parser_oacc_all_clauses): Add auto, gang, seq, vector, worker.
	(OACC_LOOP_CLAUSE_MASK): Add gang, worker, vector, auto, seq.

2015-10-26  Cesar Philippidis  <cesar@codesourcery.com>

	gcc/testsuite/
	* c-c++-common/goacc/loop-shape.c: New test.


diff --git a/gcc/c/c-parser.c b/gcc/c/c-parser.c
index c8c6a2d..13f09d8 100644
--- a/gcc/c/c-parser.c
+++ b/gcc/c/c-parser.c
@@ -11188,6 +11188,167 @@ c_parser_omp_clause_num_workers (c_parser *parser, tree list)
 }
 
 /* OpenACC:
+
+    gang [( gang-arg-list )]
+    worker [( [num:] int-expr )]
+    vector [( [length:] int-expr )]
+
+  where gang-arg is one of:
+
+    [num:] int-expr
+    static: size-expr
+
+  and size-expr may be:
+
+    *
+    int-expr
+*/
+
+static tree
+c_parser_oacc_shape_clause (c_parser *parser, omp_clause_code kind,
+			    const char *str, tree list)
+{
+  const char *id = "num";
+  tree ops[2] = { NULL_TREE, NULL_TREE }, c;
+  location_t loc = c_parser_peek_token (parser)->location;
+
+  if (kind == OMP_CLAUSE_VECTOR)
+    id = "length";
+
+  if (c_parser_next_token_is (parser, CPP_OPEN_PAREN))
+    {
+      c_parser_consume_token (parser);
+
+      do
+	{
+	  c_token *next = c_parser_peek_token (parser);
+	  int idx = 0;
+
+	  /* Gang static argument.  */
+	  if (kind == OMP_CLAUSE_GANG
+	      && c_parser_next_token_is_keyword (parser, RID_STATIC))
+	    {
+	      c_parser_consume_token (parser);
+
+	      if (!c_parser_require (parser, CPP_COLON, "expected %<:%>"))
+		goto cleanup_error;
+
+	      idx = 1;
+	      if (ops[idx] != NULL_TREE)
+		{
+		  c_parser_error (parser, "too many %<static%> arguments");
+		  goto cleanup_error;
+		}
+
+	      /* Check for the '*' argument.  */
+	      if (c_parser_next_token_is (parser, CPP_MULT))
+		{
+		  c_parser_consume_token (parser);
+		  ops[idx] = integer_minus_one_node;
+
+		  if (c_parser_next_token_is (parser, CPP_COMMA))
+		    {
+		      c_parser_consume_token (parser);
+		      continue;
+		    }
+		  else
+		    break;
+		}
+	    }
+	  /* Worker num: argument and vector length: arguments.  */
+	  else if (c_parser_next_token_is (parser, CPP_NAME)
+		   && strcmp (id, IDENTIFIER_POINTER (next->value)) == 0
+		   && c_parser_peek_2nd_token (parser)->type == CPP_COLON)
+	    {
+	      c_parser_consume_token (parser);  /* id  */
+	      c_parser_consume_token (parser);  /* ':'  */
+	    }
+
+	  /* Now collect the actual argument.  */
+	  if (ops[idx] != NULL_TREE)
+	    {
+	      c_parser_error (parser, "unexpected argument");
+	      goto cleanup_error;
+	    }
+
+	  location_t expr_loc = c_parser_peek_token (parser)->location;
+	  tree expr = c_parser_expr_no_commas (parser, NULL).value;
+	  if (expr == error_mark_node)
+	    goto cleanup_error;
+
+	  mark_exp_read (expr);
+	  expr = c_fully_fold (expr, false, NULL);
+
+	  /* Attempt to statically determine when the number isn't a
+	     positive integer.  */
+
+	  if (!INTEGRAL_TYPE_P (TREE_TYPE (expr)))
+	    {
+	      c_parser_error (parser, "expected integer expression");
+	      return list;
+	    }
+
+	  tree c = fold_build2_loc (expr_loc, LE_EXPR, boolean_type_node, expr,
+				    build_int_cst (TREE_TYPE (expr), 0));
+	  if (c == boolean_true_node)
+	    {
+	      warning_at (loc, 0,
+			  "%<%s%> value must be positive", str);
+	      expr = integer_one_node;
+	    }
+
+	  ops[idx] = expr;
+
+	  if (kind == OMP_CLAUSE_GANG
+	      && c_parser_next_token_is (parser, CPP_COMMA))
+	    {
+	      c_parser_consume_token (parser);
+	      continue;
+	    }
+	  break;
+	}
+      while (1);
+
+      if (!c_parser_require (parser, CPP_CLOSE_PAREN, "expected %<)%>"))
+	goto cleanup_error;
+    }
+
+  check_no_duplicate_clause (list, kind, str);
+
+  c = build_omp_clause (loc, kind);
+
+  if (ops[1])
+    OMP_CLAUSE_OPERAND (c, 1) = ops[1];
+
+  OMP_CLAUSE_OPERAND (c, 0) = ops[0];
+  OMP_CLAUSE_CHAIN (c) = list;
+
+  return c;
+
+ cleanup_error:
+  c_parser_skip_until_found (parser, CPP_CLOSE_PAREN, 0);
+  return list;
+}
+
+/* OpenACC:
+   auto
+   independent
+   nohost
+   seq */
+
+static tree
+c_parser_oacc_simple_clause (c_parser *parser, enum omp_clause_code code,
+			     tree list)
+{
+  check_no_duplicate_clause (list, code, omp_clause_code_name[code]);
+
+  tree c = build_omp_clause (c_parser_peek_token (parser)->location, code);
+  OMP_CLAUSE_CHAIN (c) = list;
+
+  return c;
+}
+
+/* OpenACC:
    async [( int-expr )] */
 
 static tree
@@ -12393,6 +12554,11 @@ c_parser_oacc_all_clauses (c_parser *parser, omp_clause_mask mask,
 	  clauses = c_parser_oacc_clause_async (parser, clauses);
 	  c_name = "async";
 	  break;
+	case PRAGMA_OACC_CLAUSE_AUTO:
+	  clauses = c_parser_oacc_simple_clause (parser, OMP_CLAUSE_AUTO,
+						clauses);
+	  c_name = "auto";
+	  break;
 	case PRAGMA_OACC_CLAUSE_COLLAPSE:
 	  clauses = c_parser_omp_clause_collapse (parser, clauses);
 	  c_name = "collapse";
@@ -12429,6 +12595,11 @@ c_parser_oacc_all_clauses (c_parser *parser, omp_clause_mask mask,
 	  clauses = c_parser_omp_clause_firstprivate (parser, clauses);
 	  c_name = "firstprivate";
 	  break;
+	case PRAGMA_OACC_CLAUSE_GANG:
+	  c_name = "gang";
+	  clauses = c_parser_oacc_shape_clause (parser, OMP_CLAUSE_GANG,
+						c_name, clauses);
+	  break;
 	case PRAGMA_OACC_CLAUSE_HOST:
 	  clauses = c_parser_oacc_data_clause (parser, c_kind, clauses);
 	  c_name = "host";
@@ -12477,6 +12648,16 @@ c_parser_oacc_all_clauses (c_parser *parser, omp_clause_mask mask,
 	  clauses = c_parser_oacc_data_clause (parser, c_kind, clauses);
 	  c_name = "self";
 	  break;
+	case PRAGMA_OACC_CLAUSE_SEQ:
+	  clauses = c_parser_oacc_simple_clause (parser, OMP_CLAUSE_SEQ,
+						clauses);
+	  c_name = "seq";
+	  break;
+	case PRAGMA_OACC_CLAUSE_VECTOR:
+	  c_name = "vector";
+	  clauses = c_parser_oacc_shape_clause (parser, OMP_CLAUSE_VECTOR,
+						c_name,	clauses);
+	  break;
 	case PRAGMA_OACC_CLAUSE_VECTOR_LENGTH:
 	  clauses = c_parser_omp_clause_vector_length (parser, clauses);
 	  c_name = "vector_length";
@@ -12485,6 +12666,11 @@ c_parser_oacc_all_clauses (c_parser *parser, omp_clause_mask mask,
 	  clauses = c_parser_oacc_clause_wait (parser, clauses);
 	  c_name = "wait";
 	  break;
+	case PRAGMA_OACC_CLAUSE_WORKER:
+	  c_name = "worker";
+	  clauses = c_parser_oacc_shape_clause (parser, OMP_CLAUSE_WORKER,
+						c_name, clauses);
+	  break;
 	default:
 	  c_parser_error (parser, "expected %<#pragma acc%> clause");
 	  goto saw_error;
@@ -13015,6 +13201,11 @@ c_parser_oacc_enter_exit_data (c_parser *parser, bool enter)
 
 #define OACC_LOOP_CLAUSE_MASK						\
 	( (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_COLLAPSE)		\
+	| (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_GANG)		\
+	| (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_WORKER)		\
+	| (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_VECTOR)		\
+	| (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_AUTO)		\
+	| (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_SEQ)			\
 	| (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_REDUCTION) )
 
 static tree
diff --git a/gcc/testsuite/c-c++-common/goacc/loop-shape.c b/gcc/testsuite/c-c++-common/goacc/loop-shape.c
new file mode 100644
index 0000000..b6d3156
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/goacc/loop-shape.c
@@ -0,0 +1,322 @@
+/* Exercise *_parser_oacc_shape_clause by checking various combinations
+   of gang, worker and vector clause arguments.  */
+
+/* { dg-compile } */
+
+int main ()
+{
+  int i;
+  int v = 32, w = 19;
+  int length = 1, num = 5;
+
+  /* Valid uses.  */
+
+  #pragma acc kernels
+  #pragma acc loop gang worker vector
+  for (i = 0; i < 10; i++)
+    ;
+  
+  #pragma acc kernels
+  #pragma acc loop gang(26)
+  for (i = 0; i < 10; i++)
+    ;
+  
+  #pragma acc kernels
+  #pragma acc loop gang(v)
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop vector(length: 16)
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop vector(length: v)
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop vector(16)
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop vector(v)
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop worker(num: 16)
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop worker(num: v)
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop worker(16)
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop worker(v)
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop gang(static: 16, num: 5)
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop gang(static: v, num: w)
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop vector(length)
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop worker(num)
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop gang(num, static: 6)
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop gang(static: 5, num)
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop gang(1, static:*)
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop gang(static:*, 1)
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop gang(1, static:*)
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop gang(num: 5, static: 4)
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop gang(num: v, static: w)
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop gang(num, static:num)
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop vector(length:length)
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop worker(num:length)
+  for (i = 0; i < 10; i++)
+    ;  
+
+  #pragma acc kernels
+  #pragma acc loop worker(num:num)
+  for (i = 0; i < 10; i++)
+    ;  
+
+  /* Invalid uses.  */
+  
+  #pragma acc kernels
+  #pragma acc loop gang(16, 24) /* { dg-error "unexpected argument" } */
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop gang(v, w) /* { dg-error "unexpected argument" } */
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop gang(num: 1, num:2, num:3, 4) /* { dg-error "unexpected argument" } */
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop gang(num: 1 num:2, num:3, 4) /* { dg-error "expected '.' before" } */
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop gang(1, num:2, num:3, 4) /* { dg-error "unexpected argument" } */
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop gang(num, num:5) /* { dg-error "unexpected argument" } */
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop gang(length:num) /* { dg-error "" } */
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop vector(5, length:length) /* { dg-error "expected '.' before" } */
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop vector(num:length) /* { dg-error "" } */
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop worker(length:5) /* { dg-error "expected '.' before" } */
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop worker(1, num:2) /* { dg-error "expected '.' before" } */
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop gang(static: * abc) /* { dg-error "expected '.' before" } */
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop gang(static:*num:1) /* { dg-error "expected '.' before" } */
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop gang(num: 5 static: *) /* { dg-error "expected '.' before" } */
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop gang(,static: *) /* { dg-error "" } */
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop vector(,length:5) /* { dg-error "" } */
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop worker(,num:10) /* { dg-error "" } */
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop worker(,10) /* { dg-error "" } */
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop vector(,10) /* { dg-error "" } */
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop gang(,10) /* { dg-error "" } */
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop gang(-12) /* { dg-warning "" } */
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop gang(-1.0) /* { dg-error "" } */
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop gang(1.0) /* { dg-error "" } */
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop gang(num:-1.0) /* { dg-error "" } */
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop gang(num:1.0) /* { dg-error "" } */
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop gang(static:-1.0) /* { dg-error "" } */
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop gang(static:1.0) /* { dg-error "" } */
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop worker(-1.0) /* { dg-error "" } */
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop worker(1.0) /* { dg-error "" } */
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop worker(num:-1.0) /* { dg-error "" } */
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop worker(num:1.0) /* { dg-error "" } */
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop vector(-1.0) /* { dg-error "" } */
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop vector(1.0) /* { dg-error "" } */
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop vector(length:-1.0) /* { dg-error "" } */
+  for (i = 0; i < 10; i++)
+    ;
+
+  #pragma acc kernels
+  #pragma acc loop vector(length:1.0) /* { dg-error "" } */
+  for (i = 0; i < 10; i++)
+    ;
+
+  return 0;
+}

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 1/11] UNIQUE internal function
  2015-10-25 14:29                   ` Nathan Sidwell
@ 2015-10-26 22:35                     ` Nathan Sidwell
  2015-10-27  8:18                       ` Jakub Jelinek
  0 siblings, 1 reply; 120+ messages in thread
From: Nathan Sidwell @ 2015-10-26 22:35 UTC (permalink / raw)
  To: Jakub Jelinek
  Cc: Richard Biener, GCC Patches, Bernd Schmidt, Jason Merrill,
	Joseph S. Myers

[-- Attachment #1: Type: text/plain, Size: 662 bytes --]

Richard, Jakub,
this updates patch 1 to use the target-insns.def mechanism of detecting 
conditionally-implemented instructions.  Otherwise it's the same as yesterday's 
patch.  To recap:

1) Moved the subcodes to an enumeration in internal-fn.h

2) Remove ECF_LEAF

3) Added check in initialize_ctrl_altering

4) tracer code now (continues) to only look in last stmt of block

I looked at fnsplit and do not believe I need changes there.  That's changing 
things like:
   if (cheap test)
     do cheap thing
   else
     do complex thing

to break out the else part into a separate function.   That's fine -- it'll copy 
the whole CFG of interest.

ok?

nathan

[-- Attachment #2: 01-trunk-unique-1026.patch --]
[-- Type: text/x-patch, Size: 7113 bytes --]

2015-10-26  Nathan Sidwell  <nathan@codesourcery.com>
	
	* internal-fn.c (expand_UNIQUE): New.
	* internal-fn.h (enum ifn_unique_kind): New.
	* internal-fn.def (IFN_UNIQUE): New.
	* target-insns.def (unique): Define.
	* gimple.h (gimple_call_internal_unique_p): New.
	* gimple.c (gimple_call_same_target_p): Check internal fn
	uniqueness.
	* tracer.c (ignore_bb_p): Check for IFN_UNIQUE call.
	* tree-ssa-threadedge.c
	(record_temporary_equivalences_from_stmts): Likewise.
	* tree-cfg.c (gmple_call_initialize_ctrl_altering): Likewise.

Index: gcc/target-insns.def
===================================================================
--- gcc/target-insns.def	(revision 229276)
+++ gcc/target-insns.def	(working copy)
@@ -89,5 +93,6 @@ DEF_TARGET_INSN (stack_protect_test, (rt
 DEF_TARGET_INSN (store_multiple, (rtx x0, rtx x1, rtx x2))
 DEF_TARGET_INSN (tablejump, (rtx x0, rtx x1))
 DEF_TARGET_INSN (trap, (void))
+DEF_TARGET_INSN (unique, (void))
 DEF_TARGET_INSN (untyped_call, (rtx x0, rtx x1, rtx x2))
 DEF_TARGET_INSN (untyped_return, (rtx x0, rtx x1))
Index: gcc/gimple.c
===================================================================
--- gcc/gimple.c	(revision 229276)
+++ gcc/gimple.c	(working copy)
@@ -1346,7 +1346,8 @@ gimple_call_same_target_p (const gimple
 {
   if (gimple_call_internal_p (c1))
     return (gimple_call_internal_p (c2)
-	    && gimple_call_internal_fn (c1) == gimple_call_internal_fn (c2));
+	    && gimple_call_internal_fn (c1) == gimple_call_internal_fn (c2)
+	    && !gimple_call_internal_unique_p (as_a <const gcall *> (c1)));
   else
     return (gimple_call_fn (c1) == gimple_call_fn (c2)
 	    || (gimple_call_fndecl (c1)
Index: gcc/gimple.h
===================================================================
--- gcc/gimple.h	(revision 229276)
+++ gcc/gimple.h	(working copy)
@@ -2895,6 +2895,21 @@ gimple_call_internal_fn (const gimple *g
   return gimple_call_internal_fn (gc);
 }
 
+/* Return true, if this internal gimple call is unique.  */
+
+static inline bool
+gimple_call_internal_unique_p (const gcall *gs)
+{
+  return gimple_call_internal_fn (gs) == IFN_UNIQUE;
+}
+
+static inline bool
+gimple_call_internal_unique_p (const gimple *gs)
+{
+  const gcall *gc = GIMPLE_CHECK2<const gcall *> (gs);
+  return gimple_call_internal_unique_p (gc);
+}
+
 /* If CTRL_ALTERING_P is true, mark GIMPLE_CALL S to be a stmt
    that could alter control flow.  */
 
Index: gcc/internal-fn.c
===================================================================
--- gcc/internal-fn.c	(revision 229276)
+++ gcc/internal-fn.c	(working copy)
@@ -1958,6 +1958,30 @@ expand_VA_ARG (gcall *stmt ATTRIBUTE_UNU
   gcc_unreachable ();
 }
 
+/* Expand the IFN_UNIQUE function according to its first argument.  */
+
+static void
+expand_UNIQUE (gcall *stmt)
+{
+  rtx pattern = NULL_RTX;
+  enum ifn_unique_kind kind
+    = (enum ifn_unique_kind) TREE_INT_CST_LOW (gimple_call_arg (stmt, 0));
+
+  switch (kind)
+    {
+    default:
+      gcc_unreachable ();
+
+    case IFN_UNIQUE_UNSPEC:
+      if (targetm.have_unique ())
+	pattern = targetm.gen_unique ();
+      break;
+    }
+
+  if (pattern)
+    emit_insn (pattern);
+}
+
 /* Routines to expand each internal function, indexed by function number.
    Each routine has the prototype:
 
Index: gcc/internal-fn.h
===================================================================
--- gcc/internal-fn.h	(revision 229276)
+++ gcc/internal-fn.h	(working copy)
@@ -20,6 +20,11 @@ along with GCC; see the file COPYING3.
 #ifndef GCC_INTERNAL_FN_H
 #define GCC_INTERNAL_FN_H
 
+/* INTEGER_CST values for IFN_UNIQUE function arg-0.  */
+enum ifn_unique_kind {
+  IFN_UNIQUE_UNSPEC   /* Undifferentiated UNIQUE.  */
+};
+
 /* Initialize internal function tables.  */
 
 extern void init_internal_fns ();
Index: gcc/internal-fn.def
===================================================================
--- gcc/internal-fn.def	(revision 229276)
+++ gcc/internal-fn.def	(working copy)
@@ -65,3 +65,10 @@ DEF_INTERNAL_FN (SUB_OVERFLOW, ECF_CONST
 DEF_INTERNAL_FN (MUL_OVERFLOW, ECF_CONST | ECF_LEAF | ECF_NOTHROW, NULL)
 DEF_INTERNAL_FN (TSAN_FUNC_EXIT, ECF_NOVOPS | ECF_LEAF | ECF_NOTHROW, NULL)
 DEF_INTERNAL_FN (VA_ARG, ECF_NOTHROW | ECF_LEAF, NULL)
+
+/* An unduplicable, uncombinable function.  Generally used to preserve
+   a CFG property in the face of jump threading, tail merging or
+   other such optimizations.  The first argument distinguishes
+   between uses. See internal-fn.h for usage.  */
+DEF_INTERNAL_FN (UNIQUE, ECF_NOTHROW, NULL)
Index: gcc/tracer.c
===================================================================
--- gcc/tracer.c	(revision 229276)
+++ gcc/tracer.c	(working copy)
@@ -93,18 +93,24 @@ bb_seen_p (basic_block bb)
 static bool
 ignore_bb_p (const_basic_block bb)
 {
-  gimple *g;
-
   if (bb->index < NUM_FIXED_BLOCKS)
     return true;
   if (optimize_bb_for_size_p (bb))
     return true;
 
-  /* A transaction is a single entry multiple exit region.  It must be
-     duplicated in its entirety or not at all.  */
-  g = last_stmt (CONST_CAST_BB (bb));
-  if (g && gimple_code (g) == GIMPLE_TRANSACTION)
-    return true;
+  if (gimple *g = last_stmt (CONST_CAST_BB (bb)))
+    {
+      /* A transaction is a single entry multiple exit region.  It
+	 must be duplicated in its entirety or not at all.  */
+      if (gimple_code (g) == GIMPLE_TRANSACTION)
+	return true;
+
+      /* An IFN_UNIQUE call must be duplicated as part of its group,
+	 or not at all.  */
+      if (is_gimple_call (g) && gimple_call_internal_p (g)
+	  && gimple_call_internal_unique_p (g))
+	return true;
+    }
 
   return false;
 }
Index: gcc/tree-ssa-threadedge.c
===================================================================
--- gcc/tree-ssa-threadedge.c	(revision 229276)
+++ gcc/tree-ssa-threadedge.c	(working copy)
@@ -283,6 +283,13 @@ record_temporary_equivalences_from_stmts
 	  && gimple_asm_volatile_p (as_a <gasm *> (stmt)))
 	return NULL;
 
+      /* If the statement is a unique builtin, we can not thread
+	 through here.  */
+      if (gimple_code (stmt) == GIMPLE_CALL
+	  && gimple_call_internal_p (stmt)
+	  && gimple_call_internal_unique_p (stmt))
+	return NULL;
+
       /* If duplicating this block is going to cause too much code
 	 expansion, then do not thread through this block.  */
       stmt_count++;
Index: gcc/tree-cfg.c
===================================================================
--- gcc/tree-cfg.c	(revision 229276)
+++ gcc/tree-cfg.c	(working copy)
@@ -487,7 +487,11 @@ gimple_call_initialize_ctrl_altering (gi
       || ((flags & ECF_TM_BUILTIN)
 	  && is_tm_ending_fndecl (gimple_call_fndecl (stmt)))
       /* BUILT_IN_RETURN call is same as return statement.  */
-      || gimple_call_builtin_p (stmt, BUILT_IN_RETURN))
+      || gimple_call_builtin_p (stmt, BUILT_IN_RETURN)
+      /* IFN_UNIQUE should be the last insn, to make checking for it
+	 as cheap as possible.  */
+      || (gimple_call_internal_p (stmt)
+	  && gimple_call_internal_unique_p (stmt)))
     gimple_call_set_ctrl_altering (stmt, true);
   else
     gimple_call_set_ctrl_altering (stmt, false);

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 5/11] C++ FE changes
  2015-10-26 10:30           ` Jakub Jelinek
@ 2015-10-26 22:44             ` Cesar Philippidis
  2015-10-27  8:03               ` Jakub Jelinek
  0 siblings, 1 reply; 120+ messages in thread
From: Cesar Philippidis @ 2015-10-26 22:44 UTC (permalink / raw)
  To: Jakub Jelinek
  Cc: Nathan Sidwell, GCC Patches, Bernd Schmidt, Jason Merrill,
	Joseph S. Myers

[-- Attachment #1: Type: text/plain, Size: 3859 bytes --]

On 10/26/2015 03:20 AM, Jakub Jelinek wrote:
> On Sat, Oct 24, 2015 at 02:11:41PM -0700, Cesar Philippidis wrote:

>> --- a/gcc/cp/semantics.c
>> +++ b/gcc/cp/semantics.c
>> @@ -5911,6 +5911,31 @@ finish_omp_clauses (tree clauses, bool allow_fields, bool declare_simd)
>>  	    bitmap_set_bit (&firstprivate_head, DECL_UID (t));
>>  	  goto handle_field_decl;
>>  
>> +	case OMP_CLAUSE_GANG:
>> +	case OMP_CLAUSE_VECTOR:
>> +	case OMP_CLAUSE_WORKER:
>> +	  /* Operand 0 is the num: or length: argument.  */
>> +	  t = OMP_CLAUSE_OPERAND (c, 0);
>> +	  if (t == NULL_TREE)
>> +	    break;
>> +
>> +	  if (!processing_template_decl)
>> +	    t = fold_build_cleanup_point_expr (TREE_TYPE (t), t);
>> +	  OMP_CLAUSE_OPERAND (c, 0) = t;
>> +
>> +	  if (OMP_CLAUSE_CODE (c) != OMP_CLAUSE_GANG)
>> +	    break;
> 
> I think it would be better to do the Operand 1 stuff first for
> case OMP_CLAUSE_GANG: only, and then have /* FALLTHRU */ into
> case OMP_CLAUSE_{VECTOR,WORKER}: which would handle the first argument.
> 
> You should add testing that the operand has INTEGRAL_TYPE_P type
> (except that for processing_template_decl it can be
> type_dependent_expression_p instead of INTEGRAL_TYPE_P).
>
> Also, the if (t == NULL_TREE) stuff looks fishy, because e.g. right now
> if you have OMP_CLAUSE_GANG gang (static: expr) or similar,
> you wouldn't wrap the expr into cleanup point.
> So, instead it should be
>   if (t)
>     {
>       if (t == error_mark_node)
> 	remove = true;
>       else if (!type_dependent_expression_p (t)
>                    && !INTEGRAL_TYPE_P (TREE_TYPE (t)))
> 	{
> 	  error_at (OMP_CLAUSE_LOCATION (c), ...);
> 	  remove = true;
>         }
>       else
> 	{
> 	  t = mark_rvalue_use (t);
> 	  if (!processing_template_decl)
> 	    t = fold_build_cleanup_point_expr (TREE_TYPE (t), t);
> 	  OMP_CLAUSE_OPERAND (c, 0) = t;
> 	}
>     }
> or so.  Also, can the expressions be arbitrary integers, or just
> non-negative, or positive?  If it is INTEGER_CST, that is something that
> could be checked here too.

I ended up handling with with OMP_CLAUSE_NUM_*, since they all require
positive integer expressions. The only exception was OMP_CLAUSE_GANG
which has two optional arguments.

>>  	  else if (!type_dependent_expression_p (t)
>>  		   && !INTEGRAL_TYPE_P (TREE_TYPE (t)))
>>  	    {
>> -	      error ("num_threads expression must be integral");
>> +	     switch (OMP_CLAUSE_CODE (c))
>> +		{
>> +		case OMP_CLAUSE_NUM_TASKS:
>> +		  error ("%<num_tasks%> expression must be integral"); break;
>> +		case OMP_CLAUSE_NUM_TEAMS:
>> +		  error ("%<num_teams%> expression must be integral"); break;
>> +		case OMP_CLAUSE_NUM_THREADS:
>> +		  error ("%<num_threads%> expression must be integral"); break;
>> +		case OMP_CLAUSE_NUM_GANGS:
>> +		  error ("%<num_gangs%> expression must be integral"); break;
>> +		case OMP_CLAUSE_NUM_WORKERS:
>> +		  error ("%<num_workers%> expression must be integral");
>> +		  break;
>> +		case OMP_CLAUSE_VECTOR_LENGTH:
>> +		  error ("%<vector_length%> expression must be integral");
>> +		  break;
> 
> When touching these, can you please use error_at (OMP_CLAUSE_LOCATION (c),
> instead of error ( ?

Done

>> +		default:
>> +		  error ("invalid argument");
> 
> What invalid argument?  I'd say that is clearly gcc_unreachable (); case.
> 
> But, I think it would be better to just use
>   error_at (OMP_CLAUSE_LOCATION (c), "%qs expression must be integral",
> 	    omp_clause_code_name[c]);

I used that generic message for all of those clauses except for _GANG,
_WORKER and _VECTOR. The gang clause, at the very least, needed it to
disambiguate the static and num arguments. If you want I can handle
_WORKER and _VECTOR with the generic message. I only included it because
those arguments are optional, whereas they are mandatory for the other
clauses.

Is this patch OK for trunk?

Cesar


[-- Attachment #2: 05-cpfe-cjp-4.diff --]
[-- Type: text/x-patch, Size: 16721 bytes --]

2015-10-26  Cesar Philippidis  <cesar@codesourcery.com>
	    Thomas Schwinge  <thomas@codesourcery.com>
	    James Norris  <jnorris@codesourcery.com>
	    Joseph Myers  <joseph@codesourcery.com>
	    Julian Brown  <julian@codesourcery.com>
	    Nathan Sidwell <nathan@codesourcery.com>
	    Bernd Schmidt  <bschmidt@redhat.com>

	gcc/cp/
	* parser.c (cp_parser_omp_clause_name): Add auto, gang, seq,
	vector, worker.
	(cp_parser_oacc_simple_clause): New.
	(cp_parser_oacc_shape_clause): New.
	(cp_parser_oacc_all_clauses): Add auto, gang, seq, vector, worker.
	(OACC_LOOP_CLAUSE_MASK): Likewise.
	* semantics.c (finish_omp_clauses): Add auto, gang, seq, vector,
	worker. Unify the handling of teams, tasks and vector_length with
	the other loop shape clauses.

2015-10-26  Nathan Sidwell <nathan@codesourcery.com>
	    Cesar Philippidis  <cesar@codesourcery.com>

	gcc/testsuite/
	* g++.dg/g++.dg/gomp/pr33372-1.C: Adjust diagnostic.
	* gcc/testsuite/g++.dg/gomp/pr33372-3.C: Likewise.
			  

diff --git a/gcc/cp/parser.c b/gcc/cp/parser.c
index 7555bf3..5d07487 100644
--- a/gcc/cp/parser.c
+++ b/gcc/cp/parser.c
@@ -29064,7 +29064,9 @@ cp_parser_omp_clause_name (cp_parser *parser)
 {
   pragma_omp_clause result = PRAGMA_OMP_CLAUSE_NONE;
 
-  if (cp_lexer_next_token_is_keyword (parser->lexer, RID_IF))
+  if (cp_lexer_next_token_is_keyword (parser->lexer, RID_AUTO))
+    result = PRAGMA_OACC_CLAUSE_AUTO;
+  else if (cp_lexer_next_token_is_keyword (parser->lexer, RID_IF))
     result = PRAGMA_OMP_CLAUSE_IF;
   else if (cp_lexer_next_token_is_keyword (parser->lexer, RID_DEFAULT))
     result = PRAGMA_OMP_CLAUSE_DEFAULT;
@@ -29122,7 +29124,9 @@ cp_parser_omp_clause_name (cp_parser *parser)
 	    result = PRAGMA_OMP_CLAUSE_FROM;
 	  break;
 	case 'g':
-	  if (!strcmp ("grainsize", p))
+	  if (!strcmp ("gang", p))
+	    result = PRAGMA_OACC_CLAUSE_GANG;
+	  else if (!strcmp ("grainsize", p))
 	    result = PRAGMA_OMP_CLAUSE_GRAINSIZE;
 	  break;
 	case 'h':
@@ -29212,6 +29216,8 @@ cp_parser_omp_clause_name (cp_parser *parser)
 	    result = PRAGMA_OMP_CLAUSE_SECTIONS;
 	  else if (!strcmp ("self", p))
 	    result = PRAGMA_OACC_CLAUSE_SELF;
+	  else if (!strcmp ("seq", p))
+	    result = PRAGMA_OACC_CLAUSE_SEQ;
 	  else if (!strcmp ("shared", p))
 	    result = PRAGMA_OMP_CLAUSE_SHARED;
 	  else if (!strcmp ("simd", p))
@@ -29238,7 +29244,9 @@ cp_parser_omp_clause_name (cp_parser *parser)
 	    result = PRAGMA_OMP_CLAUSE_USE_DEVICE_PTR;
 	  break;
 	case 'v':
-	  if (!strcmp ("vector_length", p))
+	  if (!strcmp ("vector", p))
+	    result = PRAGMA_OACC_CLAUSE_VECTOR;
+	  else if (!strcmp ("vector_length", p))
 	    result = PRAGMA_OACC_CLAUSE_VECTOR_LENGTH;
 	  else if (flag_cilkplus && !strcmp ("vectorlength", p))
 	    result = PRAGMA_CILK_CLAUSE_VECTORLENGTH;
@@ -29246,6 +29254,8 @@ cp_parser_omp_clause_name (cp_parser *parser)
 	case 'w':
 	  if (!strcmp ("wait", p))
 	    result = PRAGMA_OACC_CLAUSE_WAIT;
+	  else if (!strcmp ("worker", p))
+	    result = PRAGMA_OACC_CLAUSE_WORKER;
 	  break;
 	}
     }
@@ -29582,6 +29592,146 @@ cp_parser_oacc_data_clause_deviceptr (cp_parser *parser, tree list)
   return list;
 }
 
+/* OpenACC 2.0:
+   auto
+   independent
+   nohost
+   seq */
+
+static tree
+cp_parser_oacc_simple_clause (cp_parser * /* parser  */,
+			      enum omp_clause_code code,
+			      tree list, location_t location)
+{
+  check_no_duplicate_clause (list, code, omp_clause_code_name[code], location);
+  tree c = build_omp_clause (location, code);
+  OMP_CLAUSE_CHAIN (c) = list;
+  return c;
+}
+
+/* OpenACC:
+
+    gang [( gang-arg-list )]
+    worker [( [num:] int-expr )]
+    vector [( [length:] int-expr )]
+
+  where gang-arg is one of:
+
+    [num:] int-expr
+    static: size-expr
+
+  and size-expr may be:
+
+    *
+    int-expr
+*/
+
+static tree
+cp_parser_oacc_shape_clause (cp_parser *parser, omp_clause_code kind,
+			     const char *str, tree list)
+{
+  const char *id = "num";
+  cp_lexer *lexer = parser->lexer;
+  tree ops[2] = { NULL_TREE, NULL_TREE }, c;
+  location_t loc = cp_lexer_peek_token (lexer)->location;
+
+  if (kind == OMP_CLAUSE_VECTOR)
+    id = "length";
+
+  if (cp_lexer_next_token_is (lexer, CPP_OPEN_PAREN))
+    {
+      cp_lexer_consume_token (lexer);
+
+      do
+	{
+	  cp_token *next = cp_lexer_peek_token (lexer);
+	  int idx = 0;
+
+	  /* Gang static argument.  */
+	  if (kind == OMP_CLAUSE_GANG
+	      && cp_lexer_next_token_is_keyword (lexer, RID_STATIC))
+	    {
+	      cp_lexer_consume_token (lexer);
+
+	      if (!cp_parser_require (parser, CPP_COLON, RT_COLON))
+		goto cleanup_error;
+
+	      idx = 1;
+	      if (ops[idx] != NULL)
+		{
+		  cp_parser_error (parser, "too many %<static%> arguments");
+		  goto cleanup_error;
+		}
+
+	      /* Check for the '*' argument.  */
+	      if (cp_lexer_next_token_is (lexer, CPP_MULT))
+		{
+		  cp_lexer_consume_token (lexer);
+		  ops[idx] = integer_minus_one_node;
+
+		  if (cp_lexer_next_token_is (lexer, CPP_COMMA))
+		    {
+		      cp_lexer_consume_token (lexer);
+		      continue;
+		    }
+		  else break;
+		}
+	    }
+	  /* Worker num: argument and vector length: arguments.  */
+	  else if (cp_lexer_next_token_is (lexer, CPP_NAME)
+		   && strcmp (id, IDENTIFIER_POINTER (next->u.value)) == 0
+		   && cp_lexer_nth_token_is (lexer, 2, CPP_COLON))
+	    {
+	      cp_lexer_consume_token (lexer);  /* id  */
+	      cp_lexer_consume_token (lexer);  /* ':'  */
+	    }
+
+	  /* Now collect the actual argument.  */
+	  if (ops[idx] != NULL_TREE)
+	    {
+	      cp_parser_error (parser, "unexpected argument");
+	      goto cleanup_error;
+	    }
+
+	  tree expr = cp_parser_assignment_expression (parser, NULL, false,
+						       false);
+	  if (expr == error_mark_node)
+	    goto cleanup_error;
+
+	  mark_exp_read (expr);
+	  ops[idx] = expr;
+
+	  if (kind == OMP_CLAUSE_GANG
+	      && cp_lexer_next_token_is (lexer, CPP_COMMA))
+	    {
+	      cp_lexer_consume_token (lexer);
+	      continue;
+	    }
+	  break;
+	}
+      while (1);
+
+      if (!cp_parser_require (parser, CPP_CLOSE_PAREN, RT_CLOSE_PAREN))
+	goto cleanup_error;
+    }
+
+  check_no_duplicate_clause (list, kind, str, loc);
+
+  c = build_omp_clause (loc, kind);
+
+  if (ops[1])
+    OMP_CLAUSE_OPERAND (c, 1) = ops[1];
+
+  OMP_CLAUSE_OPERAND (c, 0) = ops[0];
+  OMP_CLAUSE_CHAIN (c) = list;
+
+  return c;
+
+ cleanup_error:
+  cp_parser_skip_to_closing_parenthesis (parser, false, false, true);
+  return list;
+}
+
 /* OpenACC:
    vector_length ( expression ) */
 
@@ -31306,6 +31456,11 @@ cp_parser_oacc_all_clauses (cp_parser *parser, omp_clause_mask mask,
 	  clauses = cp_parser_oacc_clause_async (parser, clauses);
 	  c_name = "async";
 	  break;
+	case PRAGMA_OACC_CLAUSE_AUTO:
+	  clauses = cp_parser_oacc_simple_clause (parser, OMP_CLAUSE_AUTO,
+						 clauses, here);
+	  c_name = "auto";
+	  break;
 	case PRAGMA_OACC_CLAUSE_COLLAPSE:
 	  clauses = cp_parser_omp_clause_collapse (parser, clauses, here);
 	  c_name = "collapse";
@@ -31338,6 +31493,11 @@ cp_parser_oacc_all_clauses (cp_parser *parser, omp_clause_mask mask,
 	  clauses = cp_parser_oacc_data_clause_deviceptr (parser, clauses);
 	  c_name = "deviceptr";
 	  break;
+	case PRAGMA_OACC_CLAUSE_GANG:
+	  c_name = "gang";
+	  clauses = cp_parser_oacc_shape_clause (parser, OMP_CLAUSE_GANG,
+						 c_name, clauses);
+	  break;
 	case PRAGMA_OACC_CLAUSE_HOST:
 	  clauses = cp_parser_oacc_data_clause (parser, c_kind, clauses);
 	  c_name = "host";
@@ -31382,6 +31542,16 @@ cp_parser_oacc_all_clauses (cp_parser *parser, omp_clause_mask mask,
 	  clauses = cp_parser_oacc_data_clause (parser, c_kind, clauses);
 	  c_name = "self";
 	  break;
+	case PRAGMA_OACC_CLAUSE_SEQ:
+	  clauses = cp_parser_oacc_simple_clause (parser, OMP_CLAUSE_SEQ,
+						 clauses, here);
+	  c_name = "seq";
+	  break;
+	case PRAGMA_OACC_CLAUSE_VECTOR:
+	  c_name = "vector";
+	  clauses = cp_parser_oacc_shape_clause (parser, OMP_CLAUSE_VECTOR,
+						 c_name, clauses);
+	  break;
 	case PRAGMA_OACC_CLAUSE_VECTOR_LENGTH:
 	  clauses = cp_parser_oacc_clause_vector_length (parser, clauses);
 	  c_name = "vector_length";
@@ -31390,6 +31560,11 @@ cp_parser_oacc_all_clauses (cp_parser *parser, omp_clause_mask mask,
 	  clauses = cp_parser_oacc_clause_wait (parser, clauses);
 	  c_name = "wait";
 	  break;
+	case PRAGMA_OACC_CLAUSE_WORKER:
+	  c_name = "worker";
+	  clauses = cp_parser_oacc_shape_clause (parser, OMP_CLAUSE_WORKER,
+						 c_name, clauses);
+	  break;
 	default:
 	  cp_parser_error (parser, "expected %<#pragma acc%> clause");
 	  goto saw_error;
@@ -34339,6 +34514,11 @@ cp_parser_oacc_kernels (cp_parser *parser, cp_token *pragma_tok)
 
 #define OACC_LOOP_CLAUSE_MASK						\
 	( (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_COLLAPSE)		\
+	| (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_GANG)		\
+	| (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_VECTOR)		\
+	| (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_WORKER)		\
+	| (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_AUTO)		\
+	| (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_SEQ)			\
 	| (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_REDUCTION))
 
 static tree
diff --git a/gcc/cp/semantics.c b/gcc/cp/semantics.c
index 11315d9..2abc73d 100644
--- a/gcc/cp/semantics.c
+++ b/gcc/cp/semantics.c
@@ -5965,14 +5965,76 @@ finish_omp_clauses (tree clauses, bool allow_fields, bool declare_simd)
 	  OMP_CLAUSE_FINAL_EXPR (c) = t;
 	  break;
 
+	case OMP_CLAUSE_GANG:
+	  /* Operand 1 is the gang static: argument.  */
+	  t = OMP_CLAUSE_OPERAND (c, 1);
+	  if (t != NULL_TREE)
+	    {
+	      if (t == error_mark_node)
+		remove = true;
+	      else if (!type_dependent_expression_p (t)
+		       && !INTEGRAL_TYPE_P (TREE_TYPE (t)))
+		{
+		  error ("%<gang%> static expression must be integral");
+		  remove = true;
+		}
+	      else
+		{
+		  t = mark_rvalue_use (t);
+		  if (!processing_template_decl)
+		    {
+		      t = maybe_constant_value (t);
+		      if (TREE_CODE (t) == INTEGER_CST
+			  && tree_int_cst_sgn (t) != 1
+			  && t != integer_minus_one_node)
+			{
+			  warning_at (OMP_CLAUSE_LOCATION (c), 0,
+				      "%<gang%> static value must be"
+				      "positive");
+			  t = integer_one_node;
+			}
+		    }
+		  t = fold_build_cleanup_point_expr (TREE_TYPE (t), t);
+		}
+	      OMP_CLAUSE_OPERAND (c, 1) = t;
+	    }
+	  /* Check operand 0, the num argument.  */
+
+	case OMP_CLAUSE_WORKER:
+	case OMP_CLAUSE_VECTOR:
+	  if (OMP_CLAUSE_OPERAND (c, 0) == NULL_TREE)
+	    break;
+
+	case OMP_CLAUSE_NUM_TASKS:
+	case OMP_CLAUSE_NUM_TEAMS:
 	case OMP_CLAUSE_NUM_THREADS:
-	  t = OMP_CLAUSE_NUM_THREADS_EXPR (c);
+	case OMP_CLAUSE_NUM_GANGS:
+	case OMP_CLAUSE_NUM_WORKERS:
+	case OMP_CLAUSE_VECTOR_LENGTH:
+	  t = OMP_CLAUSE_OPERAND (c, 0);
 	  if (t == error_mark_node)
 	    remove = true;
 	  else if (!type_dependent_expression_p (t)
 		   && !INTEGRAL_TYPE_P (TREE_TYPE (t)))
 	    {
-	      error ("num_threads expression must be integral");
+	     switch (OMP_CLAUSE_CODE (c))
+		{
+		case OMP_CLAUSE_GANG:
+		  error_at (OMP_CLAUSE_LOCATION (c),
+			    "%<gang%> num expression must be integral"); break;
+		case OMP_CLAUSE_VECTOR:
+		  error_at (OMP_CLAUSE_LOCATION (c),
+			    "%<vector%> length expression must be integral");
+		  break;
+		case OMP_CLAUSE_WORKER:
+		  error_at (OMP_CLAUSE_LOCATION (c),
+			    "%<worker%> num expression must be integral");
+		  break;
+		default:
+		  error_at (OMP_CLAUSE_LOCATION (c),
+			    "%qs expression must be integral",
+			    omp_clause_code_name[OMP_CLAUSE_CODE (c)]);
+		}
 	      remove = true;
 	    }
 	  else
@@ -5984,13 +6046,33 @@ finish_omp_clauses (tree clauses, bool allow_fields, bool declare_simd)
 		  if (TREE_CODE (t) == INTEGER_CST
 		      && tree_int_cst_sgn (t) != 1)
 		    {
-		      warning_at (OMP_CLAUSE_LOCATION (c), 0,
-				  "%<num_threads%> value must be positive");
+		      switch (OMP_CLAUSE_CODE (c))
+			{
+			case OMP_CLAUSE_GANG:
+			  warning_at (OMP_CLAUSE_LOCATION (c), 0,
+				      "%<gang%> num value must be positive");
+			  break;
+			case OMP_CLAUSE_VECTOR:
+			  warning_at (OMP_CLAUSE_LOCATION (c), 0,
+				      "%<vector%> length value must be"
+				      "positive");
+			  break;
+			case OMP_CLAUSE_WORKER:
+			  warning_at (OMP_CLAUSE_LOCATION (c), 0,
+				      "%<worker%> num value must be"
+				      "positive");
+			  break;
+			default:
+			  warning_at (OMP_CLAUSE_LOCATION (c), 0,
+				      "%qs value must be positive",
+				      omp_clause_code_name
+				      [OMP_CLAUSE_CODE (c)]);
+			}
 		      t = integer_one_node;
 		    }
 		  t = fold_build_cleanup_point_expr (TREE_TYPE (t), t);
 		}
-	      OMP_CLAUSE_NUM_THREADS_EXPR (c) = t;
+	      OMP_CLAUSE_OPERAND (c, 0) = t;
 	    }
 	  break;
 
@@ -6062,35 +6144,6 @@ finish_omp_clauses (tree clauses, bool allow_fields, bool declare_simd)
 	    }
 	  break;
 
-	case OMP_CLAUSE_NUM_TEAMS:
-	  t = OMP_CLAUSE_NUM_TEAMS_EXPR (c);
-	  if (t == error_mark_node)
-	    remove = true;
-	  else if (!type_dependent_expression_p (t)
-		   && !INTEGRAL_TYPE_P (TREE_TYPE (t)))
-	    {
-	      error ("%<num_teams%> expression must be integral");
-	      remove = true;
-	    }
-	  else
-	    {
-	      t = mark_rvalue_use (t);
-	      if (!processing_template_decl)
-		{
-		  t = maybe_constant_value (t);
-		  if (TREE_CODE (t) == INTEGER_CST
-		      && tree_int_cst_sgn (t) != 1)
-		    {
-		      warning_at (OMP_CLAUSE_LOCATION (c), 0,
-				  "%<num_teams%> value must be positive");
-		      t = integer_one_node;
-		    }
-		  t = fold_build_cleanup_point_expr (TREE_TYPE (t), t);
-		}
-	      OMP_CLAUSE_NUM_TEAMS_EXPR (c) = t;
-	    }
-	  break;
-
 	case OMP_CLAUSE_ASYNC:
 	  t = OMP_CLAUSE_ASYNC_EXPR (c);
 	  if (t == error_mark_node)
@@ -6110,16 +6163,6 @@ finish_omp_clauses (tree clauses, bool allow_fields, bool declare_simd)
 	    }
 	  break;
 
-	case OMP_CLAUSE_VECTOR_LENGTH:
-	  t = OMP_CLAUSE_VECTOR_LENGTH_EXPR (c);
-	  t = maybe_convert_cond (t);
-	  if (t == error_mark_node)
-	    remove = true;
-	  else if (!processing_template_decl)
-	    t = fold_build_cleanup_point_expr (TREE_TYPE (t), t);
-	  OMP_CLAUSE_VECTOR_LENGTH_EXPR (c) = t;
-	  break;
-
 	case OMP_CLAUSE_WAIT:
 	  t = OMP_CLAUSE_WAIT_EXPR (c);
 	  if (t == error_mark_node)
@@ -6547,35 +6590,6 @@ finish_omp_clauses (tree clauses, bool allow_fields, bool declare_simd)
 	    }
 	  goto check_dup_generic;
 
-	case OMP_CLAUSE_NUM_TASKS:
-	  t = OMP_CLAUSE_NUM_TASKS_EXPR (c);
-	  if (t == error_mark_node)
-	    remove = true;
-	  else if (!type_dependent_expression_p (t)
-		   && !INTEGRAL_TYPE_P (TREE_TYPE (t)))
-	    {
-	      error ("%<num_tasks%> expression must be integral");
-	      remove = true;
-	    }
-	  else
-	    {
-	      t = mark_rvalue_use (t);
-	      if (!processing_template_decl)
-		{
-		  t = maybe_constant_value (t);
-		  if (TREE_CODE (t) == INTEGER_CST
-		      && tree_int_cst_sgn (t) != 1)
-		    {
-		      warning_at (OMP_CLAUSE_LOCATION (c), 0,
-				  "%<num_tasks%> value must be positive");
-		      t = integer_one_node;
-		    }
-		  t = fold_build_cleanup_point_expr (TREE_TYPE (t), t);
-		}
-	      OMP_CLAUSE_NUM_TASKS_EXPR (c) = t;
-	    }
-	  break;
-
 	case OMP_CLAUSE_GRAINSIZE:
 	  t = OMP_CLAUSE_GRAINSIZE_EXPR (c);
 	  if (t == error_mark_node)
@@ -6694,6 +6708,8 @@ finish_omp_clauses (tree clauses, bool allow_fields, bool declare_simd)
 	case OMP_CLAUSE_SIMD:
 	case OMP_CLAUSE_DEFAULTMAP:
 	case OMP_CLAUSE__CILK_FOR_COUNT_:
+	case OMP_CLAUSE_AUTO:
+	case OMP_CLAUSE_SEQ:
 	  break;
 
 	case OMP_CLAUSE_INBRANCH:
diff --git a/gcc/testsuite/g++.dg/gomp/pr33372-1.C b/gcc/testsuite/g++.dg/gomp/pr33372-1.C
index 62900bf..e9da259 100644
--- a/gcc/testsuite/g++.dg/gomp/pr33372-1.C
+++ b/gcc/testsuite/g++.dg/gomp/pr33372-1.C
@@ -6,7 +6,7 @@ template <typename T>
 void f ()
 {
   extern T n ();
-#pragma omp parallel num_threads(n)	// { dg-error "num_threads expression must be integral" }
+#pragma omp parallel num_threads(n)	// { dg-error "'num_threads' expression must be integral" }
   ;
 #pragma omp parallel for schedule(static, n)
   for (int i = 0; i < 10; i++)		// { dg-error "chunk size expression must be integral" }
diff --git a/gcc/testsuite/g++.dg/gomp/pr33372-3.C b/gcc/testsuite/g++.dg/gomp/pr33372-3.C
index 8220f3c..f0a1910 100644
--- a/gcc/testsuite/g++.dg/gomp/pr33372-3.C
+++ b/gcc/testsuite/g++.dg/gomp/pr33372-3.C
@@ -6,7 +6,7 @@ template <typename T>
 void f ()
 {
   T n = 6;
-#pragma omp parallel num_threads(n)	// { dg-error "num_threads expression must be integral" }
+#pragma omp parallel num_threads(n)	// { dg-error "'num_threads' expression must be integral" }
   ;
 #pragma omp parallel for schedule(static, n)
   for (int i = 0; i < 10; i++)		// { dg-error "chunk size expression must be integral" }

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 7/11] execution model
  2015-10-25 15:03     ` Nathan Sidwell
@ 2015-10-26 23:39       ` Nathan Sidwell
  2015-10-27  8:33         ` Jakub Jelinek
  0 siblings, 1 reply; 120+ messages in thread
From: Nathan Sidwell @ 2015-10-26 23:39 UTC (permalink / raw)
  To: Jakub Jelinek
  Cc: GCC Patches, Bernd Schmidt, Jason Merrill, Joseph S. Myers,
	Richard Guenther

[-- Attachment #1: Type: text/plain, Size: 686 bytes --]

Jakub, Richard,
This is the updated version of patch 7, using target-insns.def for the new 
insns.  Otherwise same as yesterday's, which had the following changes:

The significant change is that now the head/tail unique markers are  threaded on 
a data dependency variable.  I'd not  noticed its lack being a problem, but this 
is certainly more robust in showing the ordering dependency between calls.  The 
dependency var is the 2nd parameter, and all others are simply shifted along by one.

At RTL generation time the date dependency is exposed to the RTL expander, which 
in the PTX case simply does a src->dst move, which will eventually be deleted as 
unnecessary.

ok?

nathan

[-- Attachment #2: 07-trunk-loop-mark-1026.patch --]
[-- Type: text/x-patch, Size: 51213 bytes --]

2015-10-26  Nathan Sidwell  <nathan@codesourcery.com>

	* internal-fn.def (IFN_GOACC_LOOP): New.
	* internal-fn.h (enum ifn_unique_kind): Add IFN_UNIQUE_OACC_FORK,
	IFN_UNIQUE_OACC_JOIN, IFN_UNIQUE_OACC_HEAD_MARK,
	IFN_UNIQUE_OACC_TAIL_MARK.
	(enum ifn_goacc_loop_kind): New.
	* internal-fn.c (expand_UNIQUE): Add IFN_UNIQUE_OACC_FORK,
	IFN_UNIQUE_OACC_JOIN cases.
	(expand_OACC_LOOP): New.
	(IFN_GOACC_LOOP_CHUNKS, IFN_GOACC_LOOP_STEP,
	IFN_GOACC_LOOP_OFFSET, IFN_GOACC_LOOP_BOUND): New.
	* target-insns.def (oacc_dim_pos, oacc_dim_size, oacc_fork,
	oacc_join): New.
	* internal-fn.c (expand_UNIQUE): Handle IFN_UNIQUE_OACC_FORK,
	IFN_UNIQUE_OACC_JOIN.
	(expand_GOACC_DIM_SIZE, expand_GOACC_DIM_POS, expand_GOACC_LOOP): New.
	* omp-low.c (struct omp_context): Remove gwv_below, gwv_this
	fields.
	(enum oacc_loop_flags): New.
	(enclosing_target_ctx): May return NULL.
	(ctx_in_oacc_kernels_region): New.
	(is_oacc_parallel, is_oaccc_kernels): New.
	(check_oacc_kernel_gwv): New.
	(oacc_loop_or_target_p): Delete.
	(scan_omp_for): Don't calculate gwv mask.  Check parallel clause
	operands.  Strip reductions fro kernels.
	(scan_omp_target): Don't calculate gwv mask.
	(lower_oacc_head_mark, lower_oacc_loop_marker,
	lower_oacc_head_tail): New.
	(expand_omp_for_static_nochunk, expand_omp_for_static_chunk):
	Remove OpenACC handling.
	(struct oacc_collapse): New.
	(expand_oacc_collapse_init, expand_oacc_collapse_vars): New.
	(expand_oacc_for): New.
	(expand_omp_for): Call expand_oacc_for.
	(lower_omp_for): Call lower_oacc_head_tail.

Index: gcc/target-insns.def
===================================================================
--- gcc/target-insns.def	(revision 229276)
+++ gcc/target-insns.def	(working copy)
@@ -64,6 +64,8 @@ DEF_TARGET_INSN (memory_barrier, (void))
 DEF_TARGET_INSN (movstr, (rtx x0, rtx x1, rtx x2))
 DEF_TARGET_INSN (nonlocal_goto, (rtx x0, rtx x1, rtx x2, rtx x3))
 DEF_TARGET_INSN (nonlocal_goto_receiver, (void))
+DEF_TARGET_INSN (oacc_fork, (rtx x0, rtx x1, rtx x2))
+DEF_TARGET_INSN (oacc_join, (rtx x0, rtx x1, rtx x2))
 DEF_TARGET_INSN (prefetch, (rtx x0, rtx x1, rtx x2))
 DEF_TARGET_INSN (probe_stack, (rtx x0))
 DEF_TARGET_INSN (probe_stack_address, (rtx x0))
Index: gcc/internal-fn.c
===================================================================
--- gcc/internal-fn.c	(revision 229276)
+++ gcc/internal-fn.c	(working copy)
@@ -1958,30 +1958,60 @@ expand_VA_ARG (gcall *stmt ATTRIBUTE_UNU
   gcc_unreachable ();
 }
 
 /* Expand the IFN_UNIQUE function according to its first argument.  */
 
 static void
 expand_UNIQUE (gcall *stmt)
 {
   rtx pattern = NULL_RTX;
   enum ifn_unique_kind kind
     = (enum ifn_unique_kind) TREE_INT_CST_LOW (gimple_call_arg (stmt, 0));
 
   switch (kind)
     {
     default:
       gcc_unreachable ();
 
     case IFN_UNIQUE_UNSPEC:
       if (targetm.have_unique ())
 	pattern = targetm.gen_unique ();
       break;
+
+    case IFN_UNIQUE_OACC_FORK:
+    case IFN_UNIQUE_OACC_JOIN:
+      if (targetm.have_oacc_fork () && targetm.have_oacc_join ())
+	{
+	  tree lhs = gimple_call_lhs (stmt);
+	  rtx target = const0_rtx;
+
+	  if (lhs)
+	    target = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE);
+
+	  rtx data_dep = expand_normal (gimple_call_arg (stmt, 1));
+	  rtx axis = expand_normal (gimple_call_arg (stmt, 2));
+
+	  if (kind == IFN_UNIQUE_OACC_FORK)
+	    pattern = targetm.gen_oacc_fork (target, data_dep, axis);
+	  else
+	    pattern = targetm.gen_oacc_join (target, data_dep, axis);
+	}
+      else
+	gcc_unreachable ();
+      break;
     }
 
   if (pattern)
     emit_insn (pattern);
 }
 
+/* This is expanded by oacc_device_lower pass.  */
+
+static void
+expand_GOACC_LOOP (gcall *stmt ATTRIBUTE_UNUSED)
+{
+  gcc_unreachable ();
+}
+
 /* Routines to expand each internal function, indexed by function number.
    Each routine has the prototype:
 
Index: gcc/internal-fn.h
===================================================================
--- gcc/internal-fn.h	(revision 229276)
+++ gcc/internal-fn.h	(working copy)
@@ -20,9 +20,52 @@ along with GCC; see the file COPYING3.
 #ifndef GCC_INTERNAL_FN_H
 #define GCC_INTERNAL_FN_H
 
 /* INTEGER_CST values for IFN_UNIQUE function arg-0.  */
 enum ifn_unique_kind {
-  IFN_UNIQUE_UNSPEC   /* Undifferentiated UNIQUE.  */
+  IFN_UNIQUE_UNSPEC,  /* Undifferentiated UNIQUE.  */
+
+  /* FORK and JOIN mark the points at which OpenACC partitioned
+     execution is entered or exited.
+     return: data dependency value
+     arg-1: data dependency var
+     arg-2: INTEGER_CST argument, indicating the axis.  */
+  IFN_UNIQUE_OACC_FORK,
+  IFN_UNIQUE_OACC_JOIN,
+
+  /* HEAD_MARK and TAIL_MARK are used to demark the sequence entering
+     or leaving partitioned execution.
+     return: data dependency value
+     arg-1: data dependency var
+     arg-2: INTEGER_CST argument, remaining markers in this sequence
+     arg-3...: varargs on primary header  */
+  IFN_UNIQUE_OACC_HEAD_MARK,
+  IFN_UNIQUE_OACC_TAIL_MARK
+};
+
+/* INTEGER_CST values for IFN_GOACC_LOOP arg-0.  Allows the precise
+   stepping of the compute geometry over the loop iterations to be
+   deferred until it is known which compiler is generating the code.
+   The action is encoded in a constant first argument.
+
+     CHUNK_MAX = LOOP (CODE_CHUNKS, DIR, RANGE, STEP, CHUNK_SIZE, MASK)
+     STEP = LOOP (CODE_STEP, DIR, RANGE, STEP, CHUNK_SIZE, MASK)
+     OFFSET = LOOP (CODE_OFFSET, DIR, RANGE, STEP, CHUNK_SIZE, MASK, CHUNK_NO)
+     BOUND = LOOP (CODE_BOUND, DIR, RANGE, STEP, CHUNK_SIZE, MASK, OFFSET)
+
+     DIR - +1 for up loop, -1 for down loop
+     RANGE - Range of loop (END - BASE)
+     STEP - iteration step size
+     CHUNKING - size of chunking, (constant zero for no chunking)
+     CHUNK_NO - chunk number
+     MASK - partitioning mask.  */
+
+enum ifn_goacc_loop_kind {
+  IFN_GOACC_LOOP_CHUNKS,  /* Number of chunks.  */
+  IFN_GOACC_LOOP_STEP,    /* Size of each thread's step.  */
+  IFN_GOACC_LOOP_OFFSET,  /* Initial iteration value.  */
+  IFN_GOACC_LOOP_BOUND    /* Limit of iteration value.  */
+};
+
 /* Initialize internal function tables.  */
 
 extern void init_internal_fns ();
Index: gcc/internal-fn.def
===================================================================
--- gcc/internal-fn.def	(revision 229276)
+++ gcc/internal-fn.def	(working copy)
@@ -65,9 +65,12 @@ DEF_INTERNAL_FN (SUB_OVERFLOW, ECF_CONST
 DEF_INTERNAL_FN (MUL_OVERFLOW, ECF_CONST | ECF_LEAF | ECF_NOTHROW, NULL)
 DEF_INTERNAL_FN (TSAN_FUNC_EXIT, ECF_NOVOPS | ECF_LEAF | ECF_NOTHROW, NULL)
 DEF_INTERNAL_FN (VA_ARG, ECF_NOTHROW | ECF_LEAF, NULL)
 
 /* An unduplicable, uncombinable function.  Generally used to preserve
    a CFG property in the face of jump threading, tail merging or
    other such optimizations.  The first argument distinguishes
    between uses. See internal-fn.h for usage.  */
 DEF_INTERNAL_FN (UNIQUE, ECF_NOTHROW, NULL)
+
+/* OpenACC looping abstraction.  See internal-fn.h for usage.  */
+DEF_INTERNAL_FN (GOACC_LOOP, ECF_PURE | ECF_NOTHROW, NULL)
Index: gcc/omp-low.c
===================================================================
--- gcc/omp-low.c	(revision 229276)
+++ gcc/omp-low.c	(working copy)
@@ -199,14 +200,6 @@ struct omp_context
 
   /* True if this construct can be cancelled.  */
   bool cancellable;
-
-  /* For OpenACC loops, a mask of gang, worker and vector used at
-     levels below this one.  */
-  int gwv_below;
-  /* For OpenACC loops, a mask of gang, worker and vector used at
-     this level and above.  For parallel and kernels clauses, a mask
-     indicating which of num_gangs/num_workers/num_vectors was used.  */
-  int gwv_this;
 };
 
 /* A structure holding the elements of:
@@ -233,6 +226,23 @@ struct omp_for_data
   struct omp_for_data_loop *loops;
 };
 
+/*  Flags for an OpenACC loop.  */
+
+enum oacc_loop_flags {
+  OLF_SEQ	= 1u << 0,  /* Explicitly sequential  */
+  OLF_AUTO	= 1u << 1,	/* Compiler chooses axes.  */
+  OLF_INDEPENDENT = 1u << 2,	/* Iterations are known independent.  */
+  OLF_GANG_STATIC = 1u << 3,	/* Gang partitioning is static (has op). */
+
+  /* Explicitly specified loop axes.  */
+  OLF_DIM_BASE = 4,
+  OLF_DIM_GANG   = 1u << (OLF_DIM_BASE + GOMP_DIM_GANG),
+  OLF_DIM_WORKER = 1u << (OLF_DIM_BASE + GOMP_DIM_WORKER),
+  OLF_DIM_VECTOR = 1u << (OLF_DIM_BASE + GOMP_DIM_VECTOR),
+
+  OLF_MAX = OLF_DIM_BASE + GOMP_DIM_MAX
+};
+
 
 static splay_tree all_contexts;
 static int taskreg_nesting_level;
@@ -255,6 +291,28 @@ static gphi *find_phi_with_arg_on_edge (
       *handled_ops_p = false; \
       break;
 
+/* Return true if CTX corresponds to an oacc parallel region.  */
+
+static bool
+is_oacc_parallel (omp_context *ctx)
+{
+  enum gimple_code outer_type = gimple_code (ctx->stmt);
+  return ((outer_type == GIMPLE_OMP_TARGET)
+	  && (gimple_omp_target_kind (ctx->stmt)
+	      == GF_OMP_TARGET_KIND_OACC_PARALLEL));
+}
+
+/* Return true if CTX corresponds to an oacc kernels region.  */
+
+static bool
+is_oacc_kernels (omp_context *ctx)
+{
+  enum gimple_code outer_type = gimple_code (ctx->stmt);
+  return ((outer_type == GIMPLE_OMP_TARGET)
+	  && (gimple_omp_target_kind (ctx->stmt)
+	      == GF_OMP_TARGET_KIND_OACC_KERNELS));
+}
+
 /* Helper function to get the name of the array containing the partial
    reductions for OpenACC reductions.  */
 static const char *
@@ -2889,28 +2947,95 @@ finish_taskreg_scan (omp_context *ctx)
     }
 }
 
+/* Find the enclosing offload context.  */
 
 static omp_context *
 enclosing_target_ctx (omp_context *ctx)
 {
-  while (ctx != NULL
-	 && gimple_code (ctx->stmt) != GIMPLE_OMP_TARGET)
-    ctx = ctx->outer;
-  gcc_assert (ctx != NULL);
+  for (; ctx; ctx = ctx->outer)
+    if (gimple_code (ctx->stmt) == GIMPLE_OMP_TARGET)
+      break;
+
   return ctx;
 }
 
+/* Return true if ctx is part of an oacc kernels region.  */
+
 static bool
-oacc_loop_or_target_p (gimple *stmt)
+ctx_in_oacc_kernels_region (omp_context *ctx)
+{
+  for (;ctx != NULL; ctx = ctx->outer)
+    {
+      gimple *stmt = ctx->stmt;
+      if (gimple_code (stmt) == GIMPLE_OMP_TARGET
+	  && gimple_omp_target_kind (stmt) == GF_OMP_TARGET_KIND_OACC_KERNELS)
+	return true;
+    }
+
+  return false;
+}
+
+/* Check the parallelism clauses inside a kernels regions.
+   Until kernels handling moves to use the same loop indirection
+   scheme as parallel, we need to do this checking early.  */
+
+static unsigned
+check_oacc_kernel_gwv (gomp_for *stmt, omp_context *ctx)
 {
-  enum gimple_code outer_type = gimple_code (stmt);
-  return ((outer_type == GIMPLE_OMP_TARGET
-	   && ((gimple_omp_target_kind (stmt)
-		== GF_OMP_TARGET_KIND_OACC_PARALLEL)
-	       || (gimple_omp_target_kind (stmt)
-		   == GF_OMP_TARGET_KIND_OACC_KERNELS)))
-	  || (outer_type == GIMPLE_OMP_FOR
-	      && gimple_omp_for_kind (stmt) == GF_OMP_FOR_KIND_OACC_LOOP));
+  bool checking = true;
+  unsigned outer_mask = 0;
+  unsigned this_mask = 0;
+  bool has_seq = false, has_auto = false;
+
+  if (ctx->outer)
+    outer_mask = check_oacc_kernel_gwv (NULL,  ctx->outer);
+  if (!stmt)
+    {
+      checking = false;
+      if (gimple_code (ctx->stmt) != GIMPLE_OMP_FOR)
+	return outer_mask;
+      stmt = as_a <gomp_for *> (ctx->stmt);
+    }
+
+  for (tree c = gimple_omp_for_clauses (stmt); c; c = OMP_CLAUSE_CHAIN (c))
+    {
+      switch (OMP_CLAUSE_CODE (c))
+	{
+	case OMP_CLAUSE_GANG:
+	  this_mask |= GOMP_DIM_MASK (GOMP_DIM_GANG);
+	  break;
+	case OMP_CLAUSE_WORKER:
+	  this_mask |= GOMP_DIM_MASK (GOMP_DIM_WORKER);
+	  break;
+	case OMP_CLAUSE_VECTOR:
+	  this_mask |= GOMP_DIM_MASK (GOMP_DIM_VECTOR);
+	  break;
+	case OMP_CLAUSE_SEQ:
+	  has_seq = true;
+	  break;
+	case OMP_CLAUSE_AUTO:
+	  has_auto = true;
+	  break;
+	default:
+	  break;
+	}
+    }
+
+  if (checking)
+    {
+      if (has_seq && (this_mask || has_auto))
+	error_at (gimple_location (stmt), "%<seq%> overrides other"
+		  " OpenACC loop specifiers");
+      else if (has_auto && this_mask)
+	error_at (gimple_location (stmt), "%<auto%> conflicts with other"
+		  " OpenACC loop specifiers");
+
+      if (this_mask & outer_mask)
+	error_at (gimple_location (stmt), "inner loop uses same"
+		  " OpenACC parallelism as containing loop");
+    }
+
+  return outer_mask | this_mask;
 }
 
 /* Scan a GIMPLE_OMP_FOR.  */
@@ -2918,52 +3043,62 @@ oacc_loop_or_target_p (gimple *stmt)
 static void
 scan_omp_for (gomp_for *stmt, omp_context *outer_ctx)
 {
-  enum gimple_code outer_type = GIMPLE_ERROR_MARK;
   omp_context *ctx;
   size_t i;
   tree clauses = gimple_omp_for_clauses (stmt);
 
-  if (outer_ctx)
-    outer_type = gimple_code (outer_ctx->stmt);
-
   ctx = new_omp_context (stmt, outer_ctx);
 
   if (is_gimple_omp_oacc (stmt))
     {
-      if (outer_ctx && outer_type == GIMPLE_OMP_FOR)
-	ctx->gwv_this = outer_ctx->gwv_this;
-      for (tree c = clauses; c; c = OMP_CLAUSE_CHAIN (c))
-	{
-	  int val;
-	  if (OMP_CLAUSE_CODE (c) == OMP_CLAUSE_GANG)
-	    val = MASK_GANG;
-	  else if (OMP_CLAUSE_CODE (c) == OMP_CLAUSE_WORKER)
-	    val = MASK_WORKER;
-	  else if (OMP_CLAUSE_CODE (c) == OMP_CLAUSE_VECTOR)
-	    val = MASK_VECTOR;
-	  else
-	    continue;
-	  ctx->gwv_this |= val;
-	  if (!outer_ctx)
-	    {
-	      /* Skip; not nested inside a region.  */
-	      continue;
-	    }
-	  if (!oacc_loop_or_target_p (outer_ctx->stmt))
+      omp_context *tgt = enclosing_target_ctx (outer_ctx);
+
+      if (!tgt || is_oacc_parallel (tgt))
+	for (tree c = clauses; c; c = OMP_CLAUSE_CHAIN (c))
+	  {
+	    char const *check = NULL;
+
+	    switch (OMP_CLAUSE_CODE (c))
+	      {
+	      case OMP_CLAUSE_GANG:
+		check = "gang";
+		break;
+
+	      case OMP_CLAUSE_WORKER:
+		check = "worker";
+		break;
+
+	      case OMP_CLAUSE_VECTOR:
+		check = "vector";
+		break;
+
+	      default:
+		break;
+	      }
+
+	    if (check && OMP_CLAUSE_OPERAND (c, 0))
+	      error_at (gimple_location (stmt),
+			"argument not permitted on %qs clause in"
+			" OpenACC %<parallel%>", check);
+	  }
+
+      if (tgt && is_oacc_kernels (tgt))
+	{
+	  /* Strip out reductions, as they are not  handled yet.  */
+	  tree *prev_ptr = &clauses;
+
+	  while (tree probe = *prev_ptr)
 	    {
-	      /* Skip; not nested inside an OpenACC region.  */
-	      continue;
-	    }
-	  if (outer_type == GIMPLE_OMP_FOR)
-	    outer_ctx->gwv_below |= val;
-	  if (OMP_CLAUSE_OPERAND (c, 0) != NULL_TREE)
-	    {
-	      omp_context *enclosing = enclosing_target_ctx (outer_ctx);
-	      if (gimple_omp_target_kind (enclosing->stmt)
-		  == GF_OMP_TARGET_KIND_OACC_PARALLEL)
-		error_at (gimple_location (stmt),
-			  "no arguments allowed to gang, worker and vector clauses inside parallel");
+	      tree *next_ptr = &OMP_CLAUSE_CHAIN (probe);
+	      
+	      if (OMP_CLAUSE_CODE (probe) == OMP_CLAUSE_REDUCTION)
+		*prev_ptr = *next_ptr;
+	      else
+		prev_ptr = next_ptr;
 	    }
+
+	  gimple_omp_for_set_clauses (stmt, clauses);
+	  check_oacc_kernel_gwv (stmt, ctx);
 	}
     }
 
@@ -2978,19 +3113,6 @@ scan_omp_for (gomp_for *stmt, omp_contex
       scan_omp_op (gimple_omp_for_incr_ptr (stmt, i), ctx);
     }
   scan_omp (gimple_omp_body_ptr (stmt), ctx);
-
-  if (is_gimple_omp_oacc (stmt))
-    {
-      if (ctx->gwv_this & ctx->gwv_below)
-	error_at (gimple_location (stmt),
-		  "gang, worker and vector may occur only once in a loop nest");
-      else if (ctx->gwv_below != 0
-	       && ctx->gwv_this > ctx->gwv_below)
-	error_at (gimple_location (stmt),
-		  "gang, worker and vector must occur in this order in a loop nest");
-      if (outer_ctx && outer_type == GIMPLE_OMP_FOR)
-	outer_ctx->gwv_below |= ctx->gwv_below;
-    }
 }
 
 /* Scan an OpenMP sections directive.  */
@@ -3061,19 +3183,6 @@ scan_omp_target (gomp_target *stmt, omp_
       gimple_omp_target_set_child_fn (stmt, ctx->cb.dst_fn);
     }
 
-  if (is_gimple_omp_oacc (stmt))
-    {
-      for (tree c = clauses; c; c = OMP_CLAUSE_CHAIN (c))
-	{
-	  if (OMP_CLAUSE_CODE (c) == OMP_CLAUSE_NUM_GANGS)
-	    ctx->gwv_this |= MASK_GANG;
-	  else if (OMP_CLAUSE_CODE (c) == OMP_CLAUSE_NUM_WORKERS)
-	    ctx->gwv_this |= MASK_WORKER;
-	  else if (OMP_CLAUSE_CODE (c) == OMP_CLAUSE_VECTOR_LENGTH)
-	    ctx->gwv_this |= MASK_VECTOR;
-	}
-    }
-
   scan_sharing_clauses (clauses, ctx);
   scan_omp (gimple_omp_body_ptr (stmt), ctx);
 
@@ -5806,6 +5915,176 @@ lower_send_shared_vars (gimple_seq *ilis
     }
 }
 
+/* Emit an OpenACC head marker call, encapulating the partitioning and
+   other information that must be processed by the target compiler.
+   Return the maximum number of dimensions the associated loop might
+   be partitioned over.  */
+
+static unsigned
+lower_oacc_head_mark (location_t loc, tree ddvar, tree clauses,
+		      gimple_seq *seq, omp_context *ctx)
+{
+  unsigned levels = 0;
+  unsigned tag = 0;
+  tree gang_static = NULL_TREE;
+  auto_vec<tree, 5> args;
+
+  args.quick_push (build_int_cst
+		   (integer_type_node, IFN_UNIQUE_OACC_HEAD_MARK));
+  args.quick_push (ddvar);
+  for (tree c = clauses; c; c = OMP_CLAUSE_CHAIN (c))
+    {
+      switch (OMP_CLAUSE_CODE (c))
+	{
+	case OMP_CLAUSE_GANG:
+	  tag |= OLF_DIM_GANG;
+	  gang_static = OMP_CLAUSE_GANG_STATIC_EXPR (c);
+	  /* static:* is represented by -1, and we can ignore it, as
+	     scheduling is always static.  */
+	  if (gang_static && integer_minus_onep (gang_static))
+	    gang_static = NULL_TREE;
+	  levels++;
+	  break;
+
+	case OMP_CLAUSE_WORKER:
+	  tag |= OLF_DIM_WORKER;
+	  levels++;
+	  break;
+
+	case OMP_CLAUSE_VECTOR:
+	  tag |= OLF_DIM_VECTOR;
+	  levels++;
+	  break;
+
+	case OMP_CLAUSE_SEQ:
+	  tag |= OLF_SEQ;
+	  break;
+
+	case OMP_CLAUSE_AUTO:
+	  tag |= OLF_AUTO;
+	  break;
+
+	case OMP_CLAUSE_INDEPENDENT:
+	  tag |= OLF_INDEPENDENT;
+	  break;
+
+	default:
+	  continue;
+	}
+    }
+
+  if (gang_static)
+    {
+      if (DECL_P (gang_static))
+	gang_static = build_outer_var_ref (gang_static, ctx);
+      tag |= OLF_GANG_STATIC;
+    }
+
+  /* In a parallel region, loops are implicitly INDEPENDENT.  */
+  omp_context *tgt = enclosing_target_ctx (ctx);
+  if (!tgt || is_oacc_parallel (tgt))
+    tag |= OLF_INDEPENDENT;
+
+  /* A loop lacking SEQ, GANG, WORKER and/or VECTOR is implicitly AUTO.  */
+  if (!(tag & (((GOMP_DIM_MASK (GOMP_DIM_MAX) - 1) << OLF_DIM_BASE)
+	       | OLF_SEQ)))
+      tag |= OLF_AUTO;
+
+  /* Ensure at least one level.  */
+  if (!levels)
+    levels++;
+
+  args.quick_push (build_int_cst (integer_type_node, levels));
+  args.quick_push (build_int_cst (integer_type_node, tag));
+  if (gang_static)
+    args.quick_push (gang_static);
+
+  gcall *call = gimple_build_call_internal_vec (IFN_UNIQUE, args);
+  gimple_set_location (call, loc);
+  gimple_set_lhs (call, ddvar);
+  gimple_seq_add_stmt (seq, call);
+
+  return levels;
+}
+
+/* Emit an OpenACC lopp head or tail marker to SEQ.  LEVEL is the
+   partitioning level of the enclosed region.  */ 
+
+static void
+lower_oacc_loop_marker (location_t loc, tree ddvar, bool head,
+			tree tofollow, gimple_seq *seq)
+{
+  int marker_kind = (head ? IFN_UNIQUE_OACC_HEAD_MARK
+		     : IFN_UNIQUE_OACC_TAIL_MARK);
+  tree marker = build_int_cst (integer_type_node, marker_kind);
+  int nargs = 2 + (tofollow != NULL_TREE);
+  gcall *call = gimple_build_call_internal (IFN_UNIQUE, nargs,
+					    marker, ddvar, tofollow);
+  gimple_set_location (call, loc);
+  gimple_set_lhs (call, ddvar);
+  gimple_seq_add_stmt (seq, call);
+}
+
+/* Generate the before and after OpenACC loop sequences.  CLAUSES are
+   the loop clauses, from which we extract reductions.  Initialize
+   HEAD and TAIL.  */
+
+static void
+lower_oacc_head_tail (location_t loc, tree clauses,
+		      gimple_seq *head, gimple_seq *tail, omp_context *ctx)
+{
+  bool inner = false;
+  tree ddvar = create_tmp_var (integer_type_node, ".data_dep");
+  gimple_seq_add_stmt (head, gimple_build_assign (ddvar, integer_zero_node));
+
+  unsigned count = lower_oacc_head_mark (loc, ddvar, clauses, head, ctx);
+  if (!count)
+    lower_oacc_loop_marker (loc, ddvar, false, integer_zero_node, tail);
+  
+  tree fork_kind = build_int_cst (unsigned_type_node, IFN_UNIQUE_OACC_FORK);
+  tree join_kind = build_int_cst (unsigned_type_node, IFN_UNIQUE_OACC_JOIN);
+
+  for (unsigned done = 1; count; count--, done++)
+    {
+      gimple_seq fork_seq = NULL;
+      gimple_seq join_seq = NULL;
+
+      tree place = build_int_cst (integer_type_node, -1);
+      gcall *fork = gimple_build_call_internal (IFN_UNIQUE, 3,
+						fork_kind, ddvar, place);
+      gimple_set_location (fork, loc);
+      gimple_set_lhs (fork, ddvar);
+
+      gcall *join = gimple_build_call_internal (IFN_UNIQUE, 3,
+						join_kind, ddvar, place);
+      gimple_set_location (join, loc);
+      gimple_set_lhs (join, ddvar);
+
+      /* Mark the beginning of this level sequence.  */
+      if (inner)
+	lower_oacc_loop_marker (loc, ddvar, true,
+				build_int_cst (integer_type_node, count),
+				&fork_seq);
+      lower_oacc_loop_marker (loc, ddvar, false,
+			      build_int_cst (integer_type_node, done),
+			      &join_seq);
+
+      gimple_seq_add_stmt (&fork_seq, fork);
+      gimple_seq_add_stmt (&join_seq, join);
+
+      /* Append this level to head. */
+      gimple_seq_add_seq (head, fork_seq);
+      /* Prepend it to tail.  */
+      gimple_seq_add_seq (&join_seq, *tail);
+      *tail = join_seq;
+
+      inner = true;
+    }
+
+  /* Mark the end of the sequence.  */
+  lower_oacc_loop_marker (loc, ddvar, true, NULL_TREE, head);
+  lower_oacc_loop_marker (loc, ddvar, false, NULL_TREE, tail);
+}
 
 /* A convenience function to build an empty GIMPLE_COND with just the
    condition.  */
@@ -8364,10 +8643,6 @@ expand_omp_for_static_nochunk (struct om
   tree *counts = NULL;
   tree n1, n2, step;
 
-  gcc_checking_assert ((gimple_omp_for_kind (fd->for_stmt)
-			!= GF_OMP_FOR_KIND_OACC_LOOP)
-		       || !inner_stmt);
-
   itype = type = TREE_TYPE (fd->loop.v);
   if (POINTER_TYPE_P (type))
     itype = signed_type_for (type);
@@ -8460,10 +8735,6 @@ expand_omp_for_static_nochunk (struct om
       nthreads = builtin_decl_explicit (BUILT_IN_OMP_GET_NUM_TEAMS);
       threadid = builtin_decl_explicit (BUILT_IN_OMP_GET_TEAM_NUM);
       break;
-    case GF_OMP_FOR_KIND_OACC_LOOP:
-      nthreads = builtin_decl_explicit (BUILT_IN_GOACC_GET_NUM_THREADS);
-      threadid = builtin_decl_explicit (BUILT_IN_GOACC_GET_THREAD_NUM);
-      break;
     default:
       gcc_unreachable ();
     }
@@ -8690,10 +8961,7 @@ expand_omp_for_static_nochunk (struct om
   if (!gimple_omp_return_nowait_p (gsi_stmt (gsi)))
     {
       t = gimple_omp_return_lhs (gsi_stmt (gsi));
-      if (gimple_omp_for_kind (fd->for_stmt) == GF_OMP_FOR_KIND_OACC_LOOP)
-	gcc_checking_assert (t == NULL_TREE);
-      else
-	gsi_insert_after (&gsi, build_omp_barrier (t), GSI_SAME_STMT);
+      gsi_insert_after (&gsi, build_omp_barrier (t), GSI_SAME_STMT);
     }
   gsi_remove (&gsi, true);
 
@@ -8831,10 +9099,6 @@ expand_omp_for_static_chunk (struct omp_
   tree *counts = NULL;
   tree n1, n2, step;
 
-  gcc_checking_assert ((gimple_omp_for_kind (fd->for_stmt)
-			!= GF_OMP_FOR_KIND_OACC_LOOP)
-		       || !inner_stmt);
-
   itype = type = TREE_TYPE (fd->loop.v);
   if (POINTER_TYPE_P (type))
     itype = signed_type_for (type);
@@ -8931,10 +9195,6 @@ expand_omp_for_static_chunk (struct omp_
       nthreads = builtin_decl_explicit (BUILT_IN_OMP_GET_NUM_TEAMS);
       threadid = builtin_decl_explicit (BUILT_IN_OMP_GET_TEAM_NUM);
       break;
-    case GF_OMP_FOR_KIND_OACC_LOOP:
-      nthreads = builtin_decl_explicit (BUILT_IN_GOACC_GET_NUM_THREADS);
-      threadid = builtin_decl_explicit (BUILT_IN_GOACC_GET_THREAD_NUM);
-      break;
     default:
       gcc_unreachable ();
     }
@@ -9194,10 +9454,7 @@ expand_omp_for_static_chunk (struct omp_
   if (!gimple_omp_return_nowait_p (gsi_stmt (gsi)))
     {
       t = gimple_omp_return_lhs (gsi_stmt (gsi));
-      if (gimple_omp_for_kind (fd->for_stmt) == GF_OMP_FOR_KIND_OACC_LOOP)
-	gcc_checking_assert (t == NULL_TREE);
-      else
-	gsi_insert_after (&gsi, build_omp_barrier (t), GSI_SAME_STMT);
+      gsi_insert_after (&gsi, build_omp_barrier (t), GSI_SAME_STMT);
     }
   gsi_remove (&gsi, true);
 
@@ -10247,95 +10504,647 @@ expand_omp_taskloop_for_inner (struct om
     }
 }
 
-/* Expand the OMP loop defined by REGION.  */
+/* Information about members of an OpenACC collapsed loop nest.  */
 
-static void
-expand_omp_for (struct omp_region *region, gimple *inner_stmt)
+struct oacc_collapse
 {
-  struct omp_for_data fd;
-  struct omp_for_data_loop *loops;
+  tree base;  /* Base value. */
+  tree iters; /* Number of steps.  */
+  tree step;  /* step size.  */
+};
 
-  loops
-    = (struct omp_for_data_loop *)
-      alloca (gimple_omp_for_collapse (last_stmt (region->entry))
-	      * sizeof (struct omp_for_data_loop));
-  extract_omp_for_data (as_a <gomp_for *> (last_stmt (region->entry)),
-			&fd, loops);
-  region->sched_kind = fd.sched_kind;
+/* Helper for expand_oacc_for.  Determine collapsed loop information.
+   Fill in COUNTS array.  Emit any initialization code before GSI.
+   Return the calculated outer loop bound of BOUND_TYPE.  */
 
-  gcc_assert (EDGE_COUNT (region->entry->succs) == 2);
-  BRANCH_EDGE (region->entry)->flags &= ~EDGE_ABNORMAL;
-  FALLTHRU_EDGE (region->entry)->flags &= ~EDGE_ABNORMAL;
-  if (region->cont)
-    {
-      gcc_assert (EDGE_COUNT (region->cont->succs) == 2);
-      BRANCH_EDGE (region->cont)->flags &= ~EDGE_ABNORMAL;
-      FALLTHRU_EDGE (region->cont)->flags &= ~EDGE_ABNORMAL;
-    }
-  else
-    /* If there isn't a continue then this is a degerate case where
-       the introduction of abnormal edges during lowering will prevent
-       original loops from being detected.  Fix that up.  */
-    loops_state_set (LOOPS_NEED_FIXUP);
+static tree
+expand_oacc_collapse_init (const struct omp_for_data *fd,
+			   gimple_stmt_iterator *gsi,
+			   oacc_collapse *counts, tree bound_type)
+{
+  tree total = build_int_cst (bound_type, 1);
+  int ix;
+  
+  gcc_assert (integer_onep (fd->loop.step));
+  gcc_assert (integer_zerop (fd->loop.n1));
 
-  if (gimple_omp_for_kind (fd.for_stmt) & GF_OMP_FOR_SIMD)
-    expand_omp_simd (region, &fd);
-  else if (gimple_omp_for_kind (fd.for_stmt) == GF_OMP_FOR_KIND_CILKFOR)
-    expand_cilk_for (region, &fd);
-  else if (gimple_omp_for_kind (fd.for_stmt) == GF_OMP_FOR_KIND_TASKLOOP)
-    {
-      if (gimple_omp_for_combined_into_p (fd.for_stmt))
-	expand_omp_taskloop_for_inner (region, &fd, inner_stmt);
-      else
-	expand_omp_taskloop_for_outer (region, &fd, inner_stmt);
-    }
-  else if (fd.sched_kind == OMP_CLAUSE_SCHEDULE_STATIC
-	   && !fd.have_ordered)
-    {
-      if (fd.chunk_size == NULL)
-	expand_omp_for_static_nochunk (region, &fd, inner_stmt);
-      else
-	expand_omp_for_static_chunk (region, &fd, inner_stmt);
-    }
-  else
+  for (ix = 0; ix != fd->collapse; ix++)
     {
-      int fn_index, start_ix, next_ix;
+      const omp_for_data_loop *loop = &fd->loops[ix];
 
-      gcc_assert (gimple_omp_for_kind (fd.for_stmt)
-		  == GF_OMP_FOR_KIND_FOR);
-      if (fd.chunk_size == NULL
-	  && fd.sched_kind == OMP_CLAUSE_SCHEDULE_STATIC)
-	fd.chunk_size = integer_zero_node;
-      gcc_assert (fd.sched_kind != OMP_CLAUSE_SCHEDULE_AUTO);
-      fn_index = (fd.sched_kind == OMP_CLAUSE_SCHEDULE_RUNTIME)
-		  ? 3 : fd.sched_kind;
-      if (!fd.ordered)
-	fn_index += fd.have_ordered * 4;
-      if (fd.ordered)
-	start_ix = ((int)BUILT_IN_GOMP_LOOP_DOACROSS_STATIC_START) + fn_index;
-      else
-	start_ix = ((int)BUILT_IN_GOMP_LOOP_STATIC_START) + fn_index;
-      next_ix = ((int)BUILT_IN_GOMP_LOOP_STATIC_NEXT) + fn_index;
-      if (fd.iter_type == long_long_unsigned_type_node)
-	{
-	  start_ix += ((int)BUILT_IN_GOMP_LOOP_ULL_STATIC_START
-			- (int)BUILT_IN_GOMP_LOOP_STATIC_START);
-	  next_ix += ((int)BUILT_IN_GOMP_LOOP_ULL_STATIC_NEXT
-		      - (int)BUILT_IN_GOMP_LOOP_STATIC_NEXT);
-	}
-      expand_omp_for_generic (region, &fd, (enum built_in_function) start_ix,
-			      (enum built_in_function) next_ix, inner_stmt);
+      tree iter_type = TREE_TYPE (loop->v);
+      tree diff_type = iter_type;
+      tree plus_type = iter_type;
+
+      gcc_assert (loop->cond_code == fd->loop.cond_code);
+      
+      if (POINTER_TYPE_P (iter_type))
+	plus_type = sizetype;
+      if (POINTER_TYPE_P (diff_type) || TYPE_UNSIGNED (diff_type))
+	diff_type = signed_type_for (diff_type);
+
+      tree b = loop->n1;
+      tree e = loop->n2;
+      tree s = loop->step;
+      bool up = loop->cond_code == LT_EXPR;
+      tree dir = build_int_cst (diff_type, up ? +1 : -1);
+      bool negating;
+      tree expr;
+
+      b = force_gimple_operand_gsi (gsi, b, true, NULL_TREE,
+				    true, GSI_SAME_STMT);
+      e = force_gimple_operand_gsi (gsi, e, true, NULL_TREE,
+				    true, GSI_SAME_STMT);
+
+      /* Convert the step, avoiding possible unsigned->signed overflow. */
+      negating = !up && TYPE_UNSIGNED (TREE_TYPE (s));
+      if (negating)
+	s = fold_build1 (NEGATE_EXPR, TREE_TYPE (s), s);
+      s = fold_convert (diff_type, s);
+      if (negating)
+	s = fold_build1 (NEGATE_EXPR, diff_type, s);
+      s = force_gimple_operand_gsi (gsi, s, true, NULL_TREE,
+				    true, GSI_SAME_STMT);
+
+      /* Determine the range, avoiding possible unsigned->signed overflow. */
+      negating = !up && TYPE_UNSIGNED (iter_type);
+      expr = fold_build2 (MINUS_EXPR, plus_type,
+			  fold_convert (plus_type, negating ? b : e),
+			  fold_convert (plus_type, negating ? e : b));
+      expr = fold_convert (diff_type, expr);
+      if (negating)
+	expr = fold_build1 (NEGATE_EXPR, diff_type, expr);
+      tree range = force_gimple_operand_gsi
+	(gsi, expr, true, NULL_TREE, true, GSI_SAME_STMT);
+
+      /* Determine number of iterations.  */
+      expr = fold_build2 (MINUS_EXPR, diff_type, range, dir);
+      expr = fold_build2 (PLUS_EXPR, diff_type, expr, s);
+      expr = fold_build2 (TRUNC_DIV_EXPR, diff_type, expr, s);
+
+      tree iters = force_gimple_operand_gsi (gsi, expr, true, NULL_TREE,
+					     true, GSI_SAME_STMT);
+
+      counts[ix].base = b;
+      counts[ix].iters = iters;
+      counts[ix].step = s;
+
+      total = fold_build2 (MULT_EXPR, bound_type, total,
+			   fold_convert (bound_type, iters));
     }
 
-  if (gimple_in_ssa_p (cfun))
-    update_ssa (TODO_update_ssa_only_virtuals);
+  return total;
 }
 
+/* Emit initializers for collapsed loop members.  IVAR is the outer
+   loop iteration variable, from which collapsed loop iteration values
+   are  calculated.  COUNTS array has been initialized by
+   expand_oacc_collapse_inits.  */
 
-/* Expand code for an OpenMP sections directive.  In pseudo code, we generate
+static void
+expand_oacc_collapse_vars (const struct omp_for_data *fd,
+			   gimple_stmt_iterator *gsi,
+			   const oacc_collapse *counts, tree ivar)
+{
+  tree ivar_type = TREE_TYPE (ivar);
 
-	v = GOMP_sections_start (n);
-    L0:
+  /*  The most rapidly changing iteration variable is the innermost
+      one.  */
+  for (int ix = fd->collapse; ix--;)
+    {
+      const omp_for_data_loop *loop = &fd->loops[ix];
+      const oacc_collapse *collapse = &counts[ix];
+      tree iter_type = TREE_TYPE (loop->v);
+      tree diff_type = TREE_TYPE (collapse->step);
+      tree plus_type = iter_type;
+      enum tree_code plus_code = PLUS_EXPR;
+      tree expr;
+
+      if (POINTER_TYPE_P (iter_type))
+	{
+	  plus_code = POINTER_PLUS_EXPR;
+	  plus_type = sizetype;
+	}
+
+      expr = fold_build2 (TRUNC_MOD_EXPR, ivar_type, ivar,
+			  fold_convert (ivar_type, collapse->iters));
+      expr = fold_build2 (MULT_EXPR, diff_type, fold_convert (diff_type, expr),
+			  collapse->step);
+      expr = fold_build2 (plus_code, iter_type, collapse->base,
+			  fold_convert (plus_type, expr));
+      expr = force_gimple_operand_gsi (gsi, expr, false, NULL_TREE,
+				       true, GSI_SAME_STMT);
+      gassign *ass = gimple_build_assign (loop->v, expr);
+      gsi_insert_before (gsi, ass, GSI_SAME_STMT);
+
+      if (ix)
+	{
+	  expr = fold_build2 (TRUNC_DIV_EXPR, ivar_type, ivar,
+			      fold_convert (ivar_type, collapse->iters));
+	  ivar = force_gimple_operand_gsi (gsi, expr, true, NULL_TREE,
+					   true, GSI_SAME_STMT);
+	}
+    }
+}
+
+/* A subroutine of expand_omp_for.  Generate code for an OpenACC
+   partitioned loop.  The lowering here is abstracted, in that the
+   loop parameters are passed through internal functions, which are
+   further lowered by oacc_device_lower, once we get to the target
+   compiler.  The loop is of the form:
+
+   for (V = B; V LTGT E; V += S) {BODY}
+
+   where LTGT is < or >.  We may have a specified chunking size, CHUNKING
+   (constant 0 for no chunking) and we will have a GWV partitioning
+   mask, specifying dimensions over which the loop is to be
+   partitioned (see note below).  We generate code that looks like:
+
+   <entry_bb> [incoming FALL->body, BRANCH->exit]
+     typedef signedintify (typeof (V)) T;  // underlying signed integral type
+     T range = E - B;
+     T chunk_no = 0;
+     T DIR = LTGT == '<' ? +1 : -1;
+     T chunk_max = GOACC_LOOP_CHUNK (dir, range, S, CHUNK_SIZE, GWV);
+     T step = GOACC_LOOP_STEP (dir, range, S, CHUNK_SIZE, GWV);
+
+   <head_bb> [created by splitting end of entry_bb]
+     T offset = GOACC_LOOP_OFFSET (dir, range, S, CHUNK_SIZE, GWV, chunk_no);
+     T bound = GOACC_LOOP_BOUND (dir, range, S, CHUNK_SIZE, GWV, offset);
+     if (!(offset LTGT bound)) goto bottom_bb;
+
+   <body_bb> [incoming]
+     V = B + offset;
+     {BODY}
+
+   <cont_bb> [incoming, may == body_bb FALL->exit_bb, BRANCH->body_bb]
+     offset += step;
+     if (offset LTGT bound) goto body_bb; [*]
+
+   <bottom_bb> [created by splitting start of exit_bb] insert BRANCH->head_bb
+     chunk_no++;
+     if (chunk < chunk_max) goto head_bb;
+
+   <exit_bb> [incoming]
+     V = B + ((range -/+ 1) / S +/- 1) * S [*]
+
+   [*] Needed if V live at end of loop
+
+   Note: CHUNKING & GWV mask are specified explicitly here.  This is a
+   transition, and will be specified by a more general mechanism shortly.
+ */
+
+static void
+expand_oacc_for (struct omp_region *region, struct omp_for_data *fd)
+{
+  tree v = fd->loop.v;
+  enum tree_code cond_code = fd->loop.cond_code;
+  enum tree_code plus_code = PLUS_EXPR;
+
+  tree chunk_size = integer_minus_one_node;
+  tree gwv = integer_zero_node;
+  tree iter_type = TREE_TYPE (v);
+  tree diff_type = iter_type;
+  tree plus_type = iter_type;
+  struct oacc_collapse *counts = NULL;
+
+  gcc_checking_assert (gimple_omp_for_kind (fd->for_stmt)
+		       == GF_OMP_FOR_KIND_OACC_LOOP);
+  gcc_assert (!gimple_omp_for_combined_into_p (fd->for_stmt));
+  gcc_assert (cond_code == LT_EXPR || cond_code == GT_EXPR);
+
+  if (POINTER_TYPE_P (iter_type))
+    {
+      plus_code = POINTER_PLUS_EXPR;
+      plus_type = sizetype;
+    }
+  if (POINTER_TYPE_P (diff_type) || TYPE_UNSIGNED (diff_type))
+    diff_type = signed_type_for (diff_type);
+
+  basic_block entry_bb = region->entry; /* BB ending in OMP_FOR */
+  basic_block exit_bb = region->exit; /* BB ending in OMP_RETURN */
+  basic_block cont_bb = region->cont; /* BB ending in OMP_CONTINUE  */
+  basic_block bottom_bb = NULL;
+
+  /* entry_bb has two sucessors; the branch edge is to the exit
+     block,  fallthrough edge to body.  */
+  gcc_assert (EDGE_COUNT (entry_bb->succs) == 2
+	      && BRANCH_EDGE (entry_bb)->dest == exit_bb);
+
+  /* If cont_bb non-NULL, it has 2 successors.  The branch successor is
+     body_bb, or to a block whose only successor is the body_bb.  Its
+     fallthrough successor is the final block (same as the branch
+     successor of the entry_bb).  */
+  if (cont_bb)
+    {
+      basic_block body_bb = FALLTHRU_EDGE (entry_bb)->dest;
+      basic_block bed = BRANCH_EDGE (cont_bb)->dest;
+
+      gcc_assert (FALLTHRU_EDGE (cont_bb)->dest == exit_bb);
+      gcc_assert (bed == body_bb || single_succ_edge (bed)->dest == body_bb);
+    }
+  else
+    gcc_assert (!gimple_in_ssa_p (cfun));
+
+  /* The exit block only has entry_bb and cont_bb as predecessors.  */
+  gcc_assert (EDGE_COUNT (exit_bb->preds) == 1 + (cont_bb != NULL));
+
+  tree chunk_no;
+  tree chunk_max = NULL_TREE;
+  tree bound, offset;
+  tree step = create_tmp_var (diff_type, ".step");
+  bool up = cond_code == LT_EXPR;
+  tree dir = build_int_cst (diff_type, up ? +1 : -1);
+  bool chunking = !gimple_in_ssa_p (cfun);;
+  bool negating;
+
+  /* SSA instances.  */
+  tree offset_incr = NULL_TREE;
+  tree offset_init = NULL_TREE;
+
+  gimple_stmt_iterator gsi;
+  gassign *ass;
+  gcall *call;
+  gimple *stmt;
+  tree expr;
+  location_t loc;
+  edge split, be, fte;
+
+  /* Split the end of entry_bb to create head_bb.  */
+  split = split_block (entry_bb, last_stmt (entry_bb));
+  basic_block head_bb = split->dest;
+  entry_bb = split->src;
+
+  /* Chunk setup goes at end of entry_bb, replacing the omp_for.  */
+  gsi = gsi_last_bb (entry_bb);
+  gomp_for *for_stmt = as_a <gomp_for *> (gsi_stmt (gsi));
+  loc = gimple_location (for_stmt);
+
+  if (gimple_in_ssa_p (cfun))
+    {
+      offset_init = gimple_omp_for_index (for_stmt, 0);
+      gcc_assert (integer_zerop (fd->loop.n1));
+      /* The SSA parallelizer does gang parallelism.  */
+      gwv = build_int_cst (integer_type_node, GOMP_DIM_MASK (GOMP_DIM_GANG));
+    }
+
+  if (fd->collapse > 1)
+    {
+      counts = XALLOCAVEC (struct oacc_collapse, fd->collapse);
+      tree total = expand_oacc_collapse_init (fd, &gsi, counts,
+					      TREE_TYPE (fd->loop.n2));
+
+      if (SSA_VAR_P (fd->loop.n2))
+	{
+	  total = force_gimple_operand_gsi (&gsi, total, false, NULL_TREE,
+					    true, GSI_SAME_STMT);
+	  ass = gimple_build_assign (fd->loop.n2, total);
+	  gsi_insert_before (&gsi, ass, GSI_SAME_STMT);
+	}
+      
+    }
+
+  tree b = fd->loop.n1;
+  tree e = fd->loop.n2;
+  tree s = fd->loop.step;
+
+  b = force_gimple_operand_gsi (&gsi, b, true, NULL_TREE, true, GSI_SAME_STMT);
+  e = force_gimple_operand_gsi (&gsi, e, true, NULL_TREE, true, GSI_SAME_STMT);
+
+  /* Convert the step, avoiding possible unsigned->signed overflow. */
+  negating = !up && TYPE_UNSIGNED (TREE_TYPE (s));
+  if (negating)
+    s = fold_build1 (NEGATE_EXPR, TREE_TYPE (s), s);
+  s = fold_convert (diff_type, s);
+  if (negating)
+    s = fold_build1 (NEGATE_EXPR, diff_type, s);
+  s = force_gimple_operand_gsi (&gsi, s, true, NULL_TREE, true, GSI_SAME_STMT);
+
+  if (!chunking)
+    chunk_size = integer_zero_node;
+  expr = fold_convert (diff_type, chunk_size);
+  chunk_size = force_gimple_operand_gsi (&gsi, expr, true,
+					 NULL_TREE, true, GSI_SAME_STMT);
+  /* Determine the range, avoiding possible unsigned->signed overflow. */
+  negating = !up && TYPE_UNSIGNED (iter_type);
+  expr = fold_build2 (MINUS_EXPR, plus_type,
+		      fold_convert (plus_type, negating ? b : e),
+		      fold_convert (plus_type, negating ? e : b));
+  expr = fold_convert (diff_type, expr);
+  if (negating)
+    expr = fold_build1 (NEGATE_EXPR, diff_type, expr);
+  tree range = force_gimple_operand_gsi (&gsi, expr, true,
+					 NULL_TREE, true, GSI_SAME_STMT);
+
+  chunk_no = build_int_cst (diff_type, 0);
+  if (chunking)
+    {
+      gcc_assert (!gimple_in_ssa_p (cfun));
+
+      expr = chunk_no;
+      chunk_max = create_tmp_var (diff_type, ".chunk_max");
+      chunk_no = create_tmp_var (diff_type, ".chunk_no");
+
+      ass = gimple_build_assign (chunk_no, expr);
+      gsi_insert_before (&gsi, ass, GSI_SAME_STMT);
+
+      call = gimple_build_call_internal (IFN_GOACC_LOOP, 6,
+					 build_int_cst (integer_type_node,
+							IFN_GOACC_LOOP_CHUNKS),
+					 dir, range, s, chunk_size, gwv);
+      gimple_call_set_lhs (call, chunk_max);
+      gimple_set_location (call, loc);
+      gsi_insert_before (&gsi, call, GSI_SAME_STMT);
+    }
+  else
+    chunk_size = chunk_no;
+
+  call = gimple_build_call_internal (IFN_GOACC_LOOP, 6,
+				     build_int_cst (integer_type_node,
+						    IFN_GOACC_LOOP_STEP),
+				     dir, range, s, chunk_size, gwv);
+  gimple_call_set_lhs (call, step);
+  gimple_set_location (call, loc);
+  gsi_insert_before (&gsi, call, GSI_SAME_STMT);
+
+  /* Remove the GIMPLE_OMP_FOR.  */
+  gsi_remove (&gsi, true);
+
+  /* Fixup edges from head_bb */
+  be = BRANCH_EDGE (head_bb);
+  fte = FALLTHRU_EDGE (head_bb);
+  be->flags |= EDGE_FALSE_VALUE;
+  fte->flags ^= EDGE_FALLTHRU | EDGE_TRUE_VALUE;
+
+  basic_block body_bb = fte->dest;
+
+  if (gimple_in_ssa_p (cfun))
+    {
+      gsi = gsi_last_bb (cont_bb);
+      gomp_continue *cont_stmt = as_a <gomp_continue *> (gsi_stmt (gsi));
+
+      offset = gimple_omp_continue_control_use (cont_stmt);
+      offset_incr = gimple_omp_continue_control_def (cont_stmt);
+    }
+  else
+    {
+      offset = create_tmp_var (diff_type, ".offset");
+      offset_init = offset_incr = offset;
+    }
+  bound = create_tmp_var (TREE_TYPE (offset), ".bound");
+
+  /* Loop offset & bound go into head_bb.  */
+  gsi = gsi_start_bb (head_bb);
+
+  call = gimple_build_call_internal (IFN_GOACC_LOOP, 7,
+				     build_int_cst (integer_type_node,
+						    IFN_GOACC_LOOP_OFFSET),
+				     dir, range, s,
+				     chunk_size, gwv, chunk_no);
+  gimple_call_set_lhs (call, offset_init);
+  gimple_set_location (call, loc);
+  gsi_insert_after (&gsi, call, GSI_CONTINUE_LINKING);
+
+  call = gimple_build_call_internal (IFN_GOACC_LOOP, 7,
+				     build_int_cst (integer_type_node,
+						    IFN_GOACC_LOOP_BOUND),
+				     dir, range, s,
+				     chunk_size, gwv, offset_init);
+  gimple_call_set_lhs (call, bound);
+  gimple_set_location (call, loc);
+  gsi_insert_after (&gsi, call, GSI_CONTINUE_LINKING);
+
+  expr = build2 (cond_code, boolean_type_node, offset_init, bound);
+  gsi_insert_after (&gsi, gimple_build_cond_empty (expr),
+		    GSI_CONTINUE_LINKING);
+
+  /* V assignment goes into body_bb.  */
+  if (!gimple_in_ssa_p (cfun))
+    {
+      gsi = gsi_start_bb (body_bb);
+
+      expr = build2 (plus_code, iter_type, b,
+		     fold_convert (plus_type, offset));
+      expr = force_gimple_operand_gsi (&gsi, expr, false, NULL_TREE,
+				       true, GSI_SAME_STMT);
+      ass = gimple_build_assign (v, expr);
+      gsi_insert_before (&gsi, ass, GSI_SAME_STMT);
+      if (fd->collapse > 1)
+	expand_oacc_collapse_vars (fd, &gsi, counts, v);
+    }
+
+  /* Loop increment goes into cont_bb.  If this is not a loop, we
+     will have spawned threads as if it was, and each one will
+     execute one iteration.  The specification is not explicit about
+     whether such constructs are ill-formed or not, and they can
+     occur, especially when noreturn routines are involved.  */
+  if (cont_bb)
+    {
+      gsi = gsi_last_bb (cont_bb);
+      gomp_continue *cont_stmt = as_a <gomp_continue *> (gsi_stmt (gsi));
+      loc = gimple_location (cont_stmt);
+
+      /* Increment offset.  */
+      if (gimple_in_ssa_p (cfun))
+	expr= build2 (plus_code, iter_type, offset,
+		      fold_convert (plus_type, step));
+      else
+	expr = build2 (PLUS_EXPR, diff_type, offset, step);
+      expr = force_gimple_operand_gsi (&gsi, expr, false, NULL_TREE,
+				       true, GSI_SAME_STMT);
+      ass = gimple_build_assign (offset_incr, expr);
+      gsi_insert_before (&gsi, ass, GSI_SAME_STMT);
+      expr = build2 (cond_code, boolean_type_node, offset_incr, bound);
+      gsi_insert_before (&gsi, gimple_build_cond_empty (expr), GSI_SAME_STMT);
+
+      /*  Remove the GIMPLE_OMP_CONTINUE.  */
+      gsi_remove (&gsi, true);
+
+      /* Fixup edges from cont_bb */
+      be = BRANCH_EDGE (cont_bb);
+      fte = FALLTHRU_EDGE (cont_bb);
+      be->flags |= EDGE_TRUE_VALUE;
+      fte->flags ^= EDGE_FALLTHRU | EDGE_FALSE_VALUE;
+
+      if (chunking)
+	{
+	  /* Split the beginning of exit_bb to make bottom_bb.  We
+	     need to insert a nop at the start, because splitting is
+  	     after a stmt, not before.  */
+	  gsi = gsi_start_bb (exit_bb);
+	  stmt = gimple_build_nop ();
+	  gsi_insert_before (&gsi, stmt, GSI_SAME_STMT);
+	  split = split_block (exit_bb, stmt);
+	  bottom_bb = split->src;
+	  exit_bb = split->dest;
+	  gsi = gsi_last_bb (bottom_bb);
+
+	  /* Chunk increment and test goes into bottom_bb.  */
+	  expr = build2 (PLUS_EXPR, diff_type, chunk_no,
+			 build_int_cst (diff_type, 1));
+	  ass = gimple_build_assign (chunk_no, expr);
+	  gsi_insert_after (&gsi, ass, GSI_CONTINUE_LINKING);
+
+	  /* Chunk test at end of bottom_bb.  */
+	  expr = build2 (LT_EXPR, boolean_type_node, chunk_no, chunk_max);
+	  gsi_insert_after (&gsi, gimple_build_cond_empty (expr),
+			    GSI_CONTINUE_LINKING);
+
+	  /* Fixup edges from bottom_bb. */
+	  split->flags ^= EDGE_FALLTHRU | EDGE_FALSE_VALUE;
+	  make_edge (bottom_bb, head_bb, EDGE_TRUE_VALUE);
+	}
+    }
+
+  gsi = gsi_last_bb (exit_bb);
+  gcc_assert (gimple_code (gsi_stmt (gsi)) == GIMPLE_OMP_RETURN);
+  loc = gimple_location (gsi_stmt (gsi));
+
+  if (!gimple_in_ssa_p (cfun))
+    {
+      /* Insert the final value of V, in case it is live.  This is the
+	 value for the only thread that survives past the join.  */
+      expr = fold_build2 (MINUS_EXPR, diff_type, range, dir);
+      expr = fold_build2 (PLUS_EXPR, diff_type, expr, s);
+      expr = fold_build2 (TRUNC_DIV_EXPR, diff_type, expr, s);
+      expr = fold_build2 (MULT_EXPR, diff_type, expr, s);
+      expr = build2 (plus_code, iter_type, b, fold_convert (plus_type, expr));
+      expr = force_gimple_operand_gsi (&gsi, expr, false, NULL_TREE,
+				       true, GSI_SAME_STMT);
+      ass = gimple_build_assign (v, expr);
+      gsi_insert_before (&gsi, ass, GSI_SAME_STMT);
+    }
+
+  /* Remove the OMP_RETURN. */
+  gsi_remove (&gsi, true);
+
+  if (cont_bb)
+    {
+      /* We now have one or two nested loops.  Update the loop
+	 structures.  */
+      struct loop *parent = entry_bb->loop_father;
+      struct loop *body = body_bb->loop_father;
+      
+      if (chunking)
+	{
+	  struct loop *chunk_loop = alloc_loop ();
+	  chunk_loop->header = head_bb;
+	  chunk_loop->latch = bottom_bb;
+	  add_loop (chunk_loop, parent);
+	  parent = chunk_loop;
+	}
+      else if (parent != body)
+	{
+	  gcc_assert (body->header == body_bb);
+	  gcc_assert (body->latch == cont_bb
+		      || single_pred (body->latch) == cont_bb);
+	  parent = NULL;
+	}
+
+      if (parent)
+	{
+	  struct loop *body_loop = alloc_loop ();
+	  body_loop->header = body_bb;
+	  body_loop->latch = cont_bb;
+	  add_loop (body_loop, parent);
+	}
+    }
+}
+
+/* Expand the OMP loop defined by REGION.  */
+
+static void
+expand_omp_for (struct omp_region *region, gimple *inner_stmt)
+{
+  struct omp_for_data fd;
+  struct omp_for_data_loop *loops;
+
+  loops
+    = (struct omp_for_data_loop *)
+      alloca (gimple_omp_for_collapse (last_stmt (region->entry))
+	      * sizeof (struct omp_for_data_loop));
+  extract_omp_for_data (as_a <gomp_for *> (last_stmt (region->entry)),
+			&fd, loops);
+  region->sched_kind = fd.sched_kind;
+
+  gcc_assert (EDGE_COUNT (region->entry->succs) == 2);
+  BRANCH_EDGE (region->entry)->flags &= ~EDGE_ABNORMAL;
+  FALLTHRU_EDGE (region->entry)->flags &= ~EDGE_ABNORMAL;
+  if (region->cont)
+    {
+      gcc_assert (EDGE_COUNT (region->cont->succs) == 2);
+      BRANCH_EDGE (region->cont)->flags &= ~EDGE_ABNORMAL;
+      FALLTHRU_EDGE (region->cont)->flags &= ~EDGE_ABNORMAL;
+    }
+  else
+    /* If there isn't a continue then this is a degerate case where
+       the introduction of abnormal edges during lowering will prevent
+       original loops from being detected.  Fix that up.  */
+    loops_state_set (LOOPS_NEED_FIXUP);
+
+  if (gimple_omp_for_kind (fd.for_stmt) & GF_OMP_FOR_SIMD)
+    expand_omp_simd (region, &fd);
+  else if (gimple_omp_for_kind (fd.for_stmt) == GF_OMP_FOR_KIND_CILKFOR)
+    expand_cilk_for (region, &fd);
+  else if (gimple_omp_for_kind (fd.for_stmt) == GF_OMP_FOR_KIND_OACC_LOOP)
+    {
+      gcc_assert (!inner_stmt);
+      expand_oacc_for (region, &fd);
+    }
+  else if (gimple_omp_for_kind (fd.for_stmt) == GF_OMP_FOR_KIND_TASKLOOP)
+    {
+      if (gimple_omp_for_combined_into_p (fd.for_stmt))
+	expand_omp_taskloop_for_inner (region, &fd, inner_stmt);
+      else
+	expand_omp_taskloop_for_outer (region, &fd, inner_stmt);
+    }
+  else if (fd.sched_kind == OMP_CLAUSE_SCHEDULE_STATIC
+	   && !fd.have_ordered)
+    {
+      if (fd.chunk_size == NULL)
+	expand_omp_for_static_nochunk (region, &fd, inner_stmt);
+      else
+	expand_omp_for_static_chunk (region, &fd, inner_stmt);
+    }
+  else
+    {
+      int fn_index, start_ix, next_ix;
+
+      gcc_assert (gimple_omp_for_kind (fd.for_stmt)
+		  == GF_OMP_FOR_KIND_FOR);
+      if (fd.chunk_size == NULL
+	  && fd.sched_kind == OMP_CLAUSE_SCHEDULE_STATIC)
+	fd.chunk_size = integer_zero_node;
+      gcc_assert (fd.sched_kind != OMP_CLAUSE_SCHEDULE_AUTO);
+      fn_index = (fd.sched_kind == OMP_CLAUSE_SCHEDULE_RUNTIME)
+		  ? 3 : fd.sched_kind;
+      if (!fd.ordered)
+	fn_index += fd.have_ordered * 4;
+      if (fd.ordered)
+	start_ix = ((int)BUILT_IN_GOMP_LOOP_DOACROSS_STATIC_START) + fn_index;
+      else
+	start_ix = ((int)BUILT_IN_GOMP_LOOP_STATIC_START) + fn_index;
+      next_ix = ((int)BUILT_IN_GOMP_LOOP_STATIC_NEXT) + fn_index;
+      if (fd.iter_type == long_long_unsigned_type_node)
+	{
+	  start_ix += ((int)BUILT_IN_GOMP_LOOP_ULL_STATIC_START
+			- (int)BUILT_IN_GOMP_LOOP_STATIC_START);
+	  next_ix += ((int)BUILT_IN_GOMP_LOOP_ULL_STATIC_NEXT
+		      - (int)BUILT_IN_GOMP_LOOP_STATIC_NEXT);
+	}
+      expand_omp_for_generic (region, &fd, (enum built_in_function) start_ix,
+			      (enum built_in_function) next_ix, inner_stmt);
+    }
+
+  if (gimple_in_ssa_p (cfun))
+    update_ssa (TODO_update_ssa_only_virtuals);
+}
+
+
+/* Expand code for an OpenMP sections directive.  In pseudo code, we generate
+
+	v = GOMP_sections_start (n);
+    L0:
 	switch (v)
 	  {
 	  case 0:
@@ -13412,6 +14292,7 @@ lower_omp_for (gimple_stmt_iterator *gsi
   gomp_for *stmt = as_a <gomp_for *> (gsi_stmt (*gsi_p));
   gbind *new_stmt;
   gimple_seq omp_for_body, body, dlist;
+  gimple_seq oacc_head = NULL, oacc_tail = NULL;
   size_t i;
 
   push_gimplify_context ();
@@ -13520,6 +14401,16 @@ lower_omp_for (gimple_stmt_iterator *gsi
   /* Once lowered, extract the bounds and clauses.  */
   extract_omp_for_data (stmt, &fd, NULL);
 
+  if (is_gimple_omp_oacc (ctx->stmt)
+      && !ctx_in_oacc_kernels_region (ctx))
+    lower_oacc_head_tail (gimple_location (stmt),
+			  gimple_omp_for_clauses (stmt),
+			  &oacc_head, &oacc_tail, ctx);
+
+  /* Add OpenACC partitioning markers just before the loop  */
+  if (oacc_head)
+    gimple_seq_add_seq (&body, oacc_head);
+  
   lower_omp_for_lastprivate (&fd, &body, &dlist, ctx);
 
   if (gimple_omp_for_kind (stmt) == GF_OMP_FOR_KIND_FOR)
@@ -13553,6 +14444,11 @@ lower_omp_for (gimple_stmt_iterator *gsi
   /* Region exit marker goes at the end of the loop body.  */
   gimple_seq_add_stmt (&body, gimple_build_omp_return (fd.have_nowait));
   maybe_add_implicit_barrier_cancel (ctx, &body);
+
+  /* Add OpenACC joining and reduction markers just after the loop.  */
+  if (oacc_tail)
+    gimple_seq_add_seq (&body, oacc_tail);
+
   pop_gimplify_context (new_stmt);
 
   gimple_bind_append_vars (new_stmt, ctx->block_vars);

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 5/11] C++ FE changes
  2015-10-26 22:44             ` Cesar Philippidis
@ 2015-10-27  8:03               ` Jakub Jelinek
  2015-10-27 20:21                 ` Nathan Sidwell
  0 siblings, 1 reply; 120+ messages in thread
From: Jakub Jelinek @ 2015-10-27  8:03 UTC (permalink / raw)
  To: Cesar Philippidis
  Cc: Nathan Sidwell, GCC Patches, Bernd Schmidt, Jason Merrill,
	Joseph S. Myers

On Mon, Oct 26, 2015 at 03:35:20PM -0700, Cesar Philippidis wrote:
> I used that generic message for all of those clauses except for _GANG,
> _WORKER and _VECTOR. The gang clause, at the very least, needed it to
> disambiguate the static and num arguments. If you want I can handle
> _WORKER and _VECTOR with the generic message. I only included it because
> those arguments are optional, whereas they are mandatory for the other
> clauses.
> 
> Is this patch OK for trunk?

Ok.

	Jakub

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 1/11] UNIQUE internal function
  2015-10-26 22:35                     ` Nathan Sidwell
@ 2015-10-27  8:18                       ` Jakub Jelinek
  2015-10-27 13:47                         ` Richard Biener
  2015-10-27 14:15                         ` Nathan Sidwell
  0 siblings, 2 replies; 120+ messages in thread
From: Jakub Jelinek @ 2015-10-27  8:18 UTC (permalink / raw)
  To: Nathan Sidwell
  Cc: Richard Biener, GCC Patches, Bernd Schmidt, Jason Merrill,
	Joseph S. Myers

On Mon, Oct 26, 2015 at 03:32:45PM -0700, Nathan Sidwell wrote:
> Richard, Jakub,
> this updates patch 1 to use the target-insns.def mechanism of detecting
> conditionally-implemented instructions.  Otherwise it's the same as
> yesterday's patch.  To recap:
> 
> 1) Moved the subcodes to an enumeration in internal-fn.h
> 
> 2) Remove ECF_LEAF
> 
> 3) Added check in initialize_ctrl_altering
> 
> 4) tracer code now (continues) to only look in last stmt of block
> 
> I looked at fnsplit and do not believe I need changes there.  That's
> changing things like:
>   if (cheap test)
>     do cheap thing
>   else
>     do complex thing
> 
> to break out the else part into a separate function.   That's fine -- it'll
> copy the whole CFG of interest.

The question is if some UNIQUE call could be ever considered as part of the
cheap test or do cheap thing.  If not, everything is fine of course for
fnsplit.

> ok?

Ok for me, but please wait for Richi's ack too.

	Jakub

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 7/11] execution model
  2015-10-26 23:39       ` Nathan Sidwell
@ 2015-10-27  8:33         ` Jakub Jelinek
  2015-10-27 14:03           ` Nathan Sidwell
  0 siblings, 1 reply; 120+ messages in thread
From: Jakub Jelinek @ 2015-10-27  8:33 UTC (permalink / raw)
  To: Nathan Sidwell
  Cc: GCC Patches, Bernd Schmidt, Jason Merrill, Joseph S. Myers,
	Richard Guenther

On Mon, Oct 26, 2015 at 04:11:20PM -0700, Nathan Sidwell wrote:
> Jakub, Richard,
> This is the updated version of patch 7, using target-insns.def for the new
> insns.  Otherwise same as yesterday's, which had the following changes:
> 
> The significant change is that now the head/tail unique markers are
> threaded on a data dependency variable.  I'd not  noticed its lack being a
> problem, but this is certainly more robust in showing the ordering
> dependency between calls.  The dependency var is the 2nd parameter, and all
> others are simply shifted along by one.
> 
> At RTL generation time the date dependency is exposed to the RTL expander,
> which in the PTX case simply does a src->dst move, which will eventually be
> deleted as unnecessary.
> 
> ok?

LGTM, though could I ask you to try to try to move the
struct oacc_collapse
expand_oacc_collapse_init
expand_oacc_collapse_vars
expand_oacc_for
additions somewhere else
(e.g. in between expand_omp_taskreg and expand_omp_for_init_counts),
because it seems patch just got too confused and gave up, so most of
expand_omp_for which I assume is unchanged except for
> +  else if (gimple_omp_for_kind (fd.for_stmt) == GF_OMP_FOR_KIND_OACC_LOOP)
> +    {
> +      gcc_assert (!inner_stmt);
> +      expand_oacc_for (region, &fd);
> +    }
addition is considered to be deleted in one place and added into another
one; if patch does this, I'd be afraid svn blame or git blame would do so
too, and thus lose history for expand_omp_for.  If moving it around doesn't
help, no big deal, but if it helps, it would be appreciated.

> 	(is_oacc_parallel, is_oaccc_kernels): New.

One too many Cs.

	Jakub

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 1/11] UNIQUE internal function
  2015-10-27  8:18                       ` Jakub Jelinek
@ 2015-10-27 13:47                         ` Richard Biener
  2015-10-27 14:06                           ` Nathan Sidwell
  2015-10-27 14:15                         ` Nathan Sidwell
  1 sibling, 1 reply; 120+ messages in thread
From: Richard Biener @ 2015-10-27 13:47 UTC (permalink / raw)
  To: Jakub Jelinek
  Cc: Nathan Sidwell, GCC Patches, Bernd Schmidt, Jason Merrill,
	Joseph S. Myers

On Tue, Oct 27, 2015 at 9:03 AM, Jakub Jelinek <jakub@redhat.com> wrote:
> On Mon, Oct 26, 2015 at 03:32:45PM -0700, Nathan Sidwell wrote:
>> Richard, Jakub,
>> this updates patch 1 to use the target-insns.def mechanism of detecting
>> conditionally-implemented instructions.  Otherwise it's the same as
>> yesterday's patch.  To recap:
>>
>> 1) Moved the subcodes to an enumeration in internal-fn.h
>>
>> 2) Remove ECF_LEAF
>>
>> 3) Added check in initialize_ctrl_altering
>>
>> 4) tracer code now (continues) to only look in last stmt of block
>>
>> I looked at fnsplit and do not believe I need changes there.  That's
>> changing things like:
>>   if (cheap test)
>>     do cheap thing
>>   else
>>     do complex thing
>>
>> to break out the else part into a separate function.   That's fine -- it'll
>> copy the whole CFG of interest.
>
> The question is if some UNIQUE call could be ever considered as part of the
> cheap test or do cheap thing.  If not, everything is fine of course for
> fnsplit.
>
>> ok?
>
> Ok for me, but please wait for Richi's ack too.

+      /* An IFN_UNIQUE call must be duplicated as part of its group,
+        or not at all.  */
+      if (is_gimple_call (g) && gimple_call_internal_p (g)
+         && gimple_call_internal_unique_p (g))

&&s always to the next line

Otherwise looks ok to me now.

Thanks,
Richard.

>         Jakub

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 7/11] execution model
  2015-10-27  8:33         ` Jakub Jelinek
@ 2015-10-27 14:03           ` Nathan Sidwell
  2015-10-28  5:45             ` Nathan Sidwell
  0 siblings, 1 reply; 120+ messages in thread
From: Nathan Sidwell @ 2015-10-27 14:03 UTC (permalink / raw)
  To: Jakub Jelinek
  Cc: GCC Patches, Bernd Schmidt, Jason Merrill, Joseph S. Myers,
	Richard Guenther

On 10/27/15 01:18, Jakub Jelinek wrote:

> LGTM, though could I ask you to try to try to move the
> struct oacc_collapse
> expand_oacc_collapse_init
> expand_oacc_collapse_vars
> expand_oacc_for
> additions somewhere else
> (e.g. in between expand_omp_taskreg and expand_omp_for_init_counts),

ok,  I wasn't sure of the best placement.

> because it seems patch just got too confused and gave up, so most of
> expand_omp_for which I assume is unchanged except for
>> +  else if (gimple_omp_for_kind (fd.for_stmt) == GF_OMP_FOR_KIND_OACC_LOOP)
>> +    {
>> +      gcc_assert (!inner_stmt);
>> +      expand_oacc_for (region, &fd);
>> +    }
> addition is considered to be deleted in one place and added into another
> one; if patch does this, I'd be afraid svn blame or git blame would do so
> too, and thus lose history for expand_omp_for.  If moving it around doesn't
> help, no big deal, but if it helps, it would be appreciated.

yeah, I noticed diff got confused. (I'm  not sure the above suggestion will 
resolve it, but we can give it a go.

nathan

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 1/11] UNIQUE internal function
  2015-10-27 13:47                         ` Richard Biener
@ 2015-10-27 14:06                           ` Nathan Sidwell
  2015-10-27 14:07                             ` Jakub Jelinek
  2015-10-27 20:18                             ` Nathan Sidwell
  0 siblings, 2 replies; 120+ messages in thread
From: Nathan Sidwell @ 2015-10-27 14:06 UTC (permalink / raw)
  To: Richard Biener, Jakub Jelinek
  Cc: GCC Patches, Bernd Schmidt, Jason Merrill, Joseph S. Myers

On 10/27/15 06:45, Richard Biener wrote:
> On Tue, Oct 27, 2015 at 9:03 AM, Jakub Jelinek <jakub@redhat.com> wrote:

>> Ok for me, but please wait for Richi's ack too.
>
> +      /* An IFN_UNIQUE call must be duplicated as part of its group,
> +        or not at all.  */
> +      if (is_gimple_call (g) && gimple_call_internal_p (g)
> +         && gimple_call_internal_unique_p (g))
>
> &&s always to the next line

oh, did not know that.

> Otherwise looks ok to me now.

Great thanks!

nathan

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 1/11] UNIQUE internal function
  2015-10-27 14:06                           ` Nathan Sidwell
@ 2015-10-27 14:07                             ` Jakub Jelinek
  2015-10-27 20:18                             ` Nathan Sidwell
  1 sibling, 0 replies; 120+ messages in thread
From: Jakub Jelinek @ 2015-10-27 14:07 UTC (permalink / raw)
  To: Nathan Sidwell
  Cc: Richard Biener, GCC Patches, Bernd Schmidt, Jason Merrill,
	Joseph S. Myers

On Tue, Oct 27, 2015 at 07:03:40AM -0700, Nathan Sidwell wrote:
> On 10/27/15 06:45, Richard Biener wrote:
> >On Tue, Oct 27, 2015 at 9:03 AM, Jakub Jelinek <jakub@redhat.com> wrote:
> 
> >>Ok for me, but please wait for Richi's ack too.
> >
> >+      /* An IFN_UNIQUE call must be duplicated as part of its group,
> >+        or not at all.  */
> >+      if (is_gimple_call (g) && gimple_call_internal_p (g)
> >+         && gimple_call_internal_unique_p (g))
> >
> >&&s always to the next line
> 
> oh, did not know that.

I believe the general rule is if all the conditions are short enough
that everything fits on a single line, you can write it as
  if (a && b && c && d)
but as soon as you need to wrap, it should be one && per line, so
  if (a
      && b
      && c
      && d)
style in that case rather than
  if (a && b
      && c && d)

But, lots of code doesn't do it this way.

	Jakub

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 1/11] UNIQUE internal function
  2015-10-27  8:18                       ` Jakub Jelinek
  2015-10-27 13:47                         ` Richard Biener
@ 2015-10-27 14:15                         ` Nathan Sidwell
  1 sibling, 0 replies; 120+ messages in thread
From: Nathan Sidwell @ 2015-10-27 14:15 UTC (permalink / raw)
  To: Jakub Jelinek
  Cc: Richard Biener, GCC Patches, Bernd Schmidt, Jason Merrill,
	Joseph S. Myers

On 10/27/15 01:03, Jakub Jelinek wrote:
> On Mon, Oct 26, 2015 at 03:32:45PM -0700, Nathan Sidwell wrote:

>> to break out the else part into a separate function.   That's fine -- it'll
>> copy the whole CFG of interest.
>
> The question is if some UNIQUE call could be ever considered as part of the
> cheap test or do cheap thing.  If not, everything is fine of course for
> fnsplit.

It doesn't matter (although I doubt the CFG it's attached to will be considered 
cheap) for how I'm using it.  We never generate a CFG where part of the UNIQUE 
sequence will be in the cheap thing block and another part not in the cheap 
thing  block.

nathan

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 1/11] UNIQUE internal function
  2015-10-27 14:06                           ` Nathan Sidwell
  2015-10-27 14:07                             ` Jakub Jelinek
@ 2015-10-27 20:18                             ` Nathan Sidwell
  1 sibling, 0 replies; 120+ messages in thread
From: Nathan Sidwell @ 2015-10-27 20:18 UTC (permalink / raw)
  To: Richard Biener, Jakub Jelinek
  Cc: GCC Patches, Bernd Schmidt, Jason Merrill, Joseph S. Myers

[-- Attachment #1: Type: text/plain, Size: 46 bytes --]

This is the patch that was committed.

nathan

[-- Attachment #2: 01-trunk-unique-1027.patch --]
[-- Type: text/x-patch, Size: 7120 bytes --]

2015-10-27  Nathan Sidwell  <nathan@codesourcery.com>
	
	* internal-fn.c (expand_UNIQUE): New.
	* internal-fn.h (enum ifn_unique_kind): New.
	* internal-fn.def (IFN_UNIQUE): New.
	* target-insns.def (unique): Define.
	* gimple.h (gimple_call_internal_unique_p): New.
	* gimple.c (gimple_call_same_target_p): Check internal fn
	uniqueness.
	* tracer.c (ignore_bb_p): Check for IFN_UNIQUE call.
	* tree-ssa-threadedge.c
	(record_temporary_equivalences_from_stmts): Likewise.
	* tree-cfg.c (gmple_call_initialize_ctrl_altering): Likewise.

Index: gcc/internal-fn.c
===================================================================
--- gcc/internal-fn.c	(revision 229443)
+++ gcc/internal-fn.c	(working copy)
@@ -1958,6 +1958,30 @@ expand_VA_ARG (gcall *stmt ATTRIBUTE_UNU
   gcc_unreachable ();
 }
 
+/* Expand the IFN_UNIQUE function according to its first argument.  */
+
+static void
+expand_UNIQUE (gcall *stmt)
+{
+  rtx pattern = NULL_RTX;
+  enum ifn_unique_kind kind
+    = (enum ifn_unique_kind) TREE_INT_CST_LOW (gimple_call_arg (stmt, 0));
+
+  switch (kind)
+    {
+    default:
+      gcc_unreachable ();
+
+    case IFN_UNIQUE_UNSPEC:
+      if (targetm.have_unique ())
+	pattern = targetm.gen_unique ();
+      break;
+    }
+
+  if (pattern)
+    emit_insn (pattern);
+}
+
 /* Routines to expand each internal function, indexed by function number.
    Each routine has the prototype:
 
Index: gcc/internal-fn.h
===================================================================
--- gcc/internal-fn.h	(revision 229443)
+++ gcc/internal-fn.h	(working copy)
@@ -20,6 +20,11 @@ along with GCC; see the file COPYING3.
 #ifndef GCC_INTERNAL_FN_H
 #define GCC_INTERNAL_FN_H
 
+/* INTEGER_CST values for IFN_UNIQUE function arg-0.  */
+enum ifn_unique_kind {
+  IFN_UNIQUE_UNSPEC   /* Undifferentiated UNIQUE.  */
+};
+
 /* Initialize internal function tables.  */
 
 extern void init_internal_fns ();
Index: gcc/internal-fn.def
===================================================================
--- gcc/internal-fn.def	(revision 229443)
+++ gcc/internal-fn.def	(working copy)
@@ -65,3 +65,10 @@ DEF_INTERNAL_FN (SUB_OVERFLOW, ECF_CONST
 DEF_INTERNAL_FN (MUL_OVERFLOW, ECF_CONST | ECF_LEAF | ECF_NOTHROW, NULL)
 DEF_INTERNAL_FN (TSAN_FUNC_EXIT, ECF_NOVOPS | ECF_LEAF | ECF_NOTHROW, NULL)
 DEF_INTERNAL_FN (VA_ARG, ECF_NOTHROW | ECF_LEAF, NULL)
+
+/* An unduplicable, uncombinable function.  Generally used to preserve
+   a CFG property in the face of jump threading, tail merging or
+   other such optimizations.  The first argument distinguishes
+   between uses.  See internal-fn.h for usage.  */
+DEF_INTERNAL_FN (UNIQUE, ECF_NOTHROW, NULL)
+
Index: gcc/target-insns.def
===================================================================
--- gcc/target-insns.def	(revision 229443)
+++ gcc/target-insns.def	(working copy)
@@ -89,5 +89,6 @@ DEF_TARGET_INSN (stack_protect_test, (rt
 DEF_TARGET_INSN (store_multiple, (rtx x0, rtx x1, rtx x2))
 DEF_TARGET_INSN (tablejump, (rtx x0, rtx x1))
 DEF_TARGET_INSN (trap, (void))
+DEF_TARGET_INSN (unique, (void))
 DEF_TARGET_INSN (untyped_call, (rtx x0, rtx x1, rtx x2))
 DEF_TARGET_INSN (untyped_return, (rtx x0, rtx x1))
Index: gcc/gimple.c
===================================================================
--- gcc/gimple.c	(revision 229443)
+++ gcc/gimple.c	(working copy)
@@ -1346,7 +1346,8 @@ gimple_call_same_target_p (const gimple
 {
   if (gimple_call_internal_p (c1))
     return (gimple_call_internal_p (c2)
-	    && gimple_call_internal_fn (c1) == gimple_call_internal_fn (c2));
+	    && gimple_call_internal_fn (c1) == gimple_call_internal_fn (c2)
+	    && !gimple_call_internal_unique_p (as_a <const gcall *> (c1)));
   else
     return (gimple_call_fn (c1) == gimple_call_fn (c2)
 	    || (gimple_call_fndecl (c1)
Index: gcc/gimple.h
===================================================================
--- gcc/gimple.h	(revision 229443)
+++ gcc/gimple.h	(working copy)
@@ -2895,6 +2895,21 @@ gimple_call_internal_fn (const gimple *g
   return gimple_call_internal_fn (gc);
 }
 
+/* Return true, if this internal gimple call is unique.  */
+
+static inline bool
+gimple_call_internal_unique_p (const gcall *gs)
+{
+  return gimple_call_internal_fn (gs) == IFN_UNIQUE;
+}
+
+static inline bool
+gimple_call_internal_unique_p (const gimple *gs)
+{
+  const gcall *gc = GIMPLE_CHECK2<const gcall *> (gs);
+  return gimple_call_internal_unique_p (gc);
+}
+
 /* If CTRL_ALTERING_P is true, mark GIMPLE_CALL S to be a stmt
    that could alter control flow.  */
 
Index: gcc/tracer.c
===================================================================
--- gcc/tracer.c	(revision 229443)
+++ gcc/tracer.c	(working copy)
@@ -93,18 +93,25 @@ bb_seen_p (basic_block bb)
 static bool
 ignore_bb_p (const_basic_block bb)
 {
-  gimple *g;
-
   if (bb->index < NUM_FIXED_BLOCKS)
     return true;
   if (optimize_bb_for_size_p (bb))
     return true;
 
-  /* A transaction is a single entry multiple exit region.  It must be
-     duplicated in its entirety or not at all.  */
-  g = last_stmt (CONST_CAST_BB (bb));
-  if (g && gimple_code (g) == GIMPLE_TRANSACTION)
-    return true;
+  if (gimple *g = last_stmt (CONST_CAST_BB (bb)))
+    {
+      /* A transaction is a single entry multiple exit region.  It
+	 must be duplicated in its entirety or not at all.  */
+      if (gimple_code (g) == GIMPLE_TRANSACTION)
+	return true;
+
+      /* An IFN_UNIQUE call must be duplicated as part of its group,
+	 or not at all.  */
+      if (is_gimple_call (g)
+	  && gimple_call_internal_p (g)
+	  && gimple_call_internal_unique_p (g))
+	return true;
+    }
 
   return false;
 }
Index: gcc/tree-cfg.c
===================================================================
--- gcc/tree-cfg.c	(revision 229443)
+++ gcc/tree-cfg.c	(working copy)
@@ -487,7 +487,11 @@ gimple_call_initialize_ctrl_altering (gi
       || ((flags & ECF_TM_BUILTIN)
 	  && is_tm_ending_fndecl (gimple_call_fndecl (stmt)))
       /* BUILT_IN_RETURN call is same as return statement.  */
-      || gimple_call_builtin_p (stmt, BUILT_IN_RETURN))
+      || gimple_call_builtin_p (stmt, BUILT_IN_RETURN)
+      /* IFN_UNIQUE should be the last insn, to make checking for it
+	 as cheap as possible.  */
+      || (gimple_call_internal_p (stmt)
+	  && gimple_call_internal_unique_p (stmt)))
     gimple_call_set_ctrl_altering (stmt, true);
   else
     gimple_call_set_ctrl_altering (stmt, false);
Index: gcc/tree-ssa-threadedge.c
===================================================================
--- gcc/tree-ssa-threadedge.c	(revision 229443)
+++ gcc/tree-ssa-threadedge.c	(working copy)
@@ -247,6 +247,13 @@ record_temporary_equivalences_from_stmts
 	  && gimple_asm_volatile_p (as_a <gasm *> (stmt)))
 	return NULL;
 
+      /* If the statement is a unique builtin, we can not thread
+	 through here.  */
+      if (gimple_code (stmt) == GIMPLE_CALL
+	  && gimple_call_internal_p (stmt)
+	  && gimple_call_internal_unique_p (stmt))
+	return NULL;
+
       /* If duplicating this block is going to cause too much code
 	 expansion, then do not thread through this block.  */
       stmt_count++;

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 5/11] C++ FE changes
  2015-10-27  8:03               ` Jakub Jelinek
@ 2015-10-27 20:21                 ` Nathan Sidwell
  0 siblings, 0 replies; 120+ messages in thread
From: Nathan Sidwell @ 2015-10-27 20:21 UTC (permalink / raw)
  To: Jakub Jelinek, Cesar Philippidis
  Cc: GCC Patches, Bernd Schmidt, Jason Merrill, Joseph S. Myers

[-- Attachment #1: Type: text/plain, Size: 581 bytes --]

On 10/27/15 00:59, Jakub Jelinek wrote:
> On Mon, Oct 26, 2015 at 03:35:20PM -0700, Cesar Philippidis wrote:
>> I used that generic message for all of those clauses except for _GANG,
>> _WORKER and _VECTOR. The gang clause, at the very least, needed it to
>> disambiguate the static and num arguments. If you want I can handle
>> _WORKER and _VECTOR with the generic message. I only included it because
>> those arguments are optional, whereas they are mandatory for the other
>> clauses.
>>
>> Is this patch OK for trunk?
>
> Ok.

this is the patch that I've committed.

nathan



[-- Attachment #2: 05-trunk-cpfe-1027.patch --]
[-- Type: text/x-patch, Size: 16472 bytes --]

2015-10-27  Cesar Philippidis  <cesar@codesourcery.com>
	    Thomas Schwinge  <thomas@codesourcery.com>
	    James Norris  <jnorris@codesourcery.com>
	    Joseph Myers  <joseph@codesourcery.com>
	    Julian Brown  <julian@codesourcery.com>
	    Nathan Sidwell <nathan@codesourcery.com>
	    Bernd Schmidt  <bschmidt@redhat.com>

	gcc/cp/
	* parser.c (cp_parser_omp_clause_name): Add auto, gang, seq,
	vector, worker.
	(cp_parser_oacc_simple_clause): New.
	(cp_parser_oacc_shape_clause): New.
	(cp_parser_oacc_all_clauses): Add auto, gang, seq, vector, worker.
	(OACC_LOOP_CLAUSE_MASK): Likewise.
	* semantics.c (finish_omp_clauses): Add auto, gang, seq, vector,
	worker. Unify the handling of teams, tasks and vector_length with
	the other loop shape clauses.

2015-10-27  Nathan Sidwell <nathan@codesourcery.com>
	    Cesar Philippidis  <cesar@codesourcery.com>

	gcc/testsuite/
	* g++.dg/g++.dg/gomp/pr33372-1.C: Adjust diagnostic.
	* gcc/testsuite/g++.dg/gomp/pr33372-3.C: Likewise.
			  
Index: gcc/cp/semantics.c
===================================================================
--- gcc/cp/semantics.c	(revision 229443)
+++ gcc/cp/semantics.c	(working copy)
@@ -5965,14 +5965,76 @@ finish_omp_clauses (tree clauses, bool a
 	  OMP_CLAUSE_FINAL_EXPR (c) = t;
 	  break;
 
+	case OMP_CLAUSE_GANG:
+	  /* Operand 1 is the gang static: argument.  */
+	  t = OMP_CLAUSE_OPERAND (c, 1);
+	  if (t != NULL_TREE)
+	    {
+	      if (t == error_mark_node)
+		remove = true;
+	      else if (!type_dependent_expression_p (t)
+		       && !INTEGRAL_TYPE_P (TREE_TYPE (t)))
+		{
+		  error ("%<gang%> static expression must be integral");
+		  remove = true;
+		}
+	      else
+		{
+		  t = mark_rvalue_use (t);
+		  if (!processing_template_decl)
+		    {
+		      t = maybe_constant_value (t);
+		      if (TREE_CODE (t) == INTEGER_CST
+			  && tree_int_cst_sgn (t) != 1
+			  && t != integer_minus_one_node)
+			{
+			  warning_at (OMP_CLAUSE_LOCATION (c), 0,
+				      "%<gang%> static value must be"
+				      "positive");
+			  t = integer_one_node;
+			}
+		    }
+		  t = fold_build_cleanup_point_expr (TREE_TYPE (t), t);
+		}
+	      OMP_CLAUSE_OPERAND (c, 1) = t;
+	    }
+	  /* Check operand 0, the num argument.  */
+
+	case OMP_CLAUSE_WORKER:
+	case OMP_CLAUSE_VECTOR:
+	  if (OMP_CLAUSE_OPERAND (c, 0) == NULL_TREE)
+	    break;
+
+	case OMP_CLAUSE_NUM_TASKS:
+	case OMP_CLAUSE_NUM_TEAMS:
 	case OMP_CLAUSE_NUM_THREADS:
-	  t = OMP_CLAUSE_NUM_THREADS_EXPR (c);
+	case OMP_CLAUSE_NUM_GANGS:
+	case OMP_CLAUSE_NUM_WORKERS:
+	case OMP_CLAUSE_VECTOR_LENGTH:
+	  t = OMP_CLAUSE_OPERAND (c, 0);
 	  if (t == error_mark_node)
 	    remove = true;
 	  else if (!type_dependent_expression_p (t)
 		   && !INTEGRAL_TYPE_P (TREE_TYPE (t)))
 	    {
-	      error ("num_threads expression must be integral");
+	     switch (OMP_CLAUSE_CODE (c))
+		{
+		case OMP_CLAUSE_GANG:
+		  error_at (OMP_CLAUSE_LOCATION (c),
+			    "%<gang%> num expression must be integral"); break;
+		case OMP_CLAUSE_VECTOR:
+		  error_at (OMP_CLAUSE_LOCATION (c),
+			    "%<vector%> length expression must be integral");
+		  break;
+		case OMP_CLAUSE_WORKER:
+		  error_at (OMP_CLAUSE_LOCATION (c),
+			    "%<worker%> num expression must be integral");
+		  break;
+		default:
+		  error_at (OMP_CLAUSE_LOCATION (c),
+			    "%qs expression must be integral",
+			    omp_clause_code_name[OMP_CLAUSE_CODE (c)]);
+		}
 	      remove = true;
 	    }
 	  else
@@ -5984,13 +6046,33 @@ finish_omp_clauses (tree clauses, bool a
 		  if (TREE_CODE (t) == INTEGER_CST
 		      && tree_int_cst_sgn (t) != 1)
 		    {
-		      warning_at (OMP_CLAUSE_LOCATION (c), 0,
-				  "%<num_threads%> value must be positive");
+		      switch (OMP_CLAUSE_CODE (c))
+			{
+			case OMP_CLAUSE_GANG:
+			  warning_at (OMP_CLAUSE_LOCATION (c), 0,
+				      "%<gang%> num value must be positive");
+			  break;
+			case OMP_CLAUSE_VECTOR:
+			  warning_at (OMP_CLAUSE_LOCATION (c), 0,
+				      "%<vector%> length value must be"
+				      "positive");
+			  break;
+			case OMP_CLAUSE_WORKER:
+			  warning_at (OMP_CLAUSE_LOCATION (c), 0,
+				      "%<worker%> num value must be"
+				      "positive");
+			  break;
+			default:
+			  warning_at (OMP_CLAUSE_LOCATION (c), 0,
+				      "%qs value must be positive",
+				      omp_clause_code_name
+				      [OMP_CLAUSE_CODE (c)]);
+			}
 		      t = integer_one_node;
 		    }
 		  t = fold_build_cleanup_point_expr (TREE_TYPE (t), t);
 		}
-	      OMP_CLAUSE_NUM_THREADS_EXPR (c) = t;
+	      OMP_CLAUSE_OPERAND (c, 0) = t;
 	    }
 	  break;
 
@@ -6062,35 +6144,6 @@ finish_omp_clauses (tree clauses, bool a
 	    }
 	  break;
 
-	case OMP_CLAUSE_NUM_TEAMS:
-	  t = OMP_CLAUSE_NUM_TEAMS_EXPR (c);
-	  if (t == error_mark_node)
-	    remove = true;
-	  else if (!type_dependent_expression_p (t)
-		   && !INTEGRAL_TYPE_P (TREE_TYPE (t)))
-	    {
-	      error ("%<num_teams%> expression must be integral");
-	      remove = true;
-	    }
-	  else
-	    {
-	      t = mark_rvalue_use (t);
-	      if (!processing_template_decl)
-		{
-		  t = maybe_constant_value (t);
-		  if (TREE_CODE (t) == INTEGER_CST
-		      && tree_int_cst_sgn (t) != 1)
-		    {
-		      warning_at (OMP_CLAUSE_LOCATION (c), 0,
-				  "%<num_teams%> value must be positive");
-		      t = integer_one_node;
-		    }
-		  t = fold_build_cleanup_point_expr (TREE_TYPE (t), t);
-		}
-	      OMP_CLAUSE_NUM_TEAMS_EXPR (c) = t;
-	    }
-	  break;
-
 	case OMP_CLAUSE_ASYNC:
 	  t = OMP_CLAUSE_ASYNC_EXPR (c);
 	  if (t == error_mark_node)
@@ -6110,16 +6163,6 @@ finish_omp_clauses (tree clauses, bool a
 	    }
 	  break;
 
-	case OMP_CLAUSE_VECTOR_LENGTH:
-	  t = OMP_CLAUSE_VECTOR_LENGTH_EXPR (c);
-	  t = maybe_convert_cond (t);
-	  if (t == error_mark_node)
-	    remove = true;
-	  else if (!processing_template_decl)
-	    t = fold_build_cleanup_point_expr (TREE_TYPE (t), t);
-	  OMP_CLAUSE_VECTOR_LENGTH_EXPR (c) = t;
-	  break;
-
 	case OMP_CLAUSE_WAIT:
 	  t = OMP_CLAUSE_WAIT_EXPR (c);
 	  if (t == error_mark_node)
@@ -6547,35 +6590,6 @@ finish_omp_clauses (tree clauses, bool a
 	    }
 	  goto check_dup_generic;
 
-	case OMP_CLAUSE_NUM_TASKS:
-	  t = OMP_CLAUSE_NUM_TASKS_EXPR (c);
-	  if (t == error_mark_node)
-	    remove = true;
-	  else if (!type_dependent_expression_p (t)
-		   && !INTEGRAL_TYPE_P (TREE_TYPE (t)))
-	    {
-	      error ("%<num_tasks%> expression must be integral");
-	      remove = true;
-	    }
-	  else
-	    {
-	      t = mark_rvalue_use (t);
-	      if (!processing_template_decl)
-		{
-		  t = maybe_constant_value (t);
-		  if (TREE_CODE (t) == INTEGER_CST
-		      && tree_int_cst_sgn (t) != 1)
-		    {
-		      warning_at (OMP_CLAUSE_LOCATION (c), 0,
-				  "%<num_tasks%> value must be positive");
-		      t = integer_one_node;
-		    }
-		  t = fold_build_cleanup_point_expr (TREE_TYPE (t), t);
-		}
-	      OMP_CLAUSE_NUM_TASKS_EXPR (c) = t;
-	    }
-	  break;
-
 	case OMP_CLAUSE_GRAINSIZE:
 	  t = OMP_CLAUSE_GRAINSIZE_EXPR (c);
 	  if (t == error_mark_node)
@@ -6694,6 +6708,8 @@ finish_omp_clauses (tree clauses, bool a
 	case OMP_CLAUSE_SIMD:
 	case OMP_CLAUSE_DEFAULTMAP:
 	case OMP_CLAUSE__CILK_FOR_COUNT_:
+	case OMP_CLAUSE_AUTO:
+	case OMP_CLAUSE_SEQ:
 	  break;
 
 	case OMP_CLAUSE_INBRANCH:
Index: gcc/cp/parser.c
===================================================================
--- gcc/cp/parser.c	(revision 229443)
+++ gcc/cp/parser.c	(working copy)
@@ -29064,7 +29064,9 @@ cp_parser_omp_clause_name (cp_parser *pa
 {
   pragma_omp_clause result = PRAGMA_OMP_CLAUSE_NONE;
 
-  if (cp_lexer_next_token_is_keyword (parser->lexer, RID_IF))
+  if (cp_lexer_next_token_is_keyword (parser->lexer, RID_AUTO))
+    result = PRAGMA_OACC_CLAUSE_AUTO;
+  else if (cp_lexer_next_token_is_keyword (parser->lexer, RID_IF))
     result = PRAGMA_OMP_CLAUSE_IF;
   else if (cp_lexer_next_token_is_keyword (parser->lexer, RID_DEFAULT))
     result = PRAGMA_OMP_CLAUSE_DEFAULT;
@@ -29122,7 +29124,9 @@ cp_parser_omp_clause_name (cp_parser *pa
 	    result = PRAGMA_OMP_CLAUSE_FROM;
 	  break;
 	case 'g':
-	  if (!strcmp ("grainsize", p))
+	  if (!strcmp ("gang", p))
+	    result = PRAGMA_OACC_CLAUSE_GANG;
+	  else if (!strcmp ("grainsize", p))
 	    result = PRAGMA_OMP_CLAUSE_GRAINSIZE;
 	  break;
 	case 'h':
@@ -29212,6 +29216,8 @@ cp_parser_omp_clause_name (cp_parser *pa
 	    result = PRAGMA_OMP_CLAUSE_SECTIONS;
 	  else if (!strcmp ("self", p))
 	    result = PRAGMA_OACC_CLAUSE_SELF;
+	  else if (!strcmp ("seq", p))
+	    result = PRAGMA_OACC_CLAUSE_SEQ;
 	  else if (!strcmp ("shared", p))
 	    result = PRAGMA_OMP_CLAUSE_SHARED;
 	  else if (!strcmp ("simd", p))
@@ -29238,7 +29244,9 @@ cp_parser_omp_clause_name (cp_parser *pa
 	    result = PRAGMA_OMP_CLAUSE_USE_DEVICE_PTR;
 	  break;
 	case 'v':
-	  if (!strcmp ("vector_length", p))
+	  if (!strcmp ("vector", p))
+	    result = PRAGMA_OACC_CLAUSE_VECTOR;
+	  else if (!strcmp ("vector_length", p))
 	    result = PRAGMA_OACC_CLAUSE_VECTOR_LENGTH;
 	  else if (flag_cilkplus && !strcmp ("vectorlength", p))
 	    result = PRAGMA_CILK_CLAUSE_VECTORLENGTH;
@@ -29246,6 +29254,8 @@ cp_parser_omp_clause_name (cp_parser *pa
 	case 'w':
 	  if (!strcmp ("wait", p))
 	    result = PRAGMA_OACC_CLAUSE_WAIT;
+	  else if (!strcmp ("worker", p))
+	    result = PRAGMA_OACC_CLAUSE_WORKER;
 	  break;
 	}
     }
@@ -29582,6 +29592,146 @@ cp_parser_oacc_data_clause_deviceptr (cp
   return list;
 }
 
+/* OpenACC 2.0:
+   auto
+   independent
+   nohost
+   seq */
+
+static tree
+cp_parser_oacc_simple_clause (cp_parser * /* parser  */,
+			      enum omp_clause_code code,
+			      tree list, location_t location)
+{
+  check_no_duplicate_clause (list, code, omp_clause_code_name[code], location);
+  tree c = build_omp_clause (location, code);
+  OMP_CLAUSE_CHAIN (c) = list;
+  return c;
+}
+
+/* OpenACC:
+
+    gang [( gang-arg-list )]
+    worker [( [num:] int-expr )]
+    vector [( [length:] int-expr )]
+
+  where gang-arg is one of:
+
+    [num:] int-expr
+    static: size-expr
+
+  and size-expr may be:
+
+    *
+    int-expr
+*/
+
+static tree
+cp_parser_oacc_shape_clause (cp_parser *parser, omp_clause_code kind,
+			     const char *str, tree list)
+{
+  const char *id = "num";
+  cp_lexer *lexer = parser->lexer;
+  tree ops[2] = { NULL_TREE, NULL_TREE }, c;
+  location_t loc = cp_lexer_peek_token (lexer)->location;
+
+  if (kind == OMP_CLAUSE_VECTOR)
+    id = "length";
+
+  if (cp_lexer_next_token_is (lexer, CPP_OPEN_PAREN))
+    {
+      cp_lexer_consume_token (lexer);
+
+      do
+	{
+	  cp_token *next = cp_lexer_peek_token (lexer);
+	  int idx = 0;
+
+	  /* Gang static argument.  */
+	  if (kind == OMP_CLAUSE_GANG
+	      && cp_lexer_next_token_is_keyword (lexer, RID_STATIC))
+	    {
+	      cp_lexer_consume_token (lexer);
+
+	      if (!cp_parser_require (parser, CPP_COLON, RT_COLON))
+		goto cleanup_error;
+
+	      idx = 1;
+	      if (ops[idx] != NULL)
+		{
+		  cp_parser_error (parser, "too many %<static%> arguments");
+		  goto cleanup_error;
+		}
+
+	      /* Check for the '*' argument.  */
+	      if (cp_lexer_next_token_is (lexer, CPP_MULT))
+		{
+		  cp_lexer_consume_token (lexer);
+		  ops[idx] = integer_minus_one_node;
+
+		  if (cp_lexer_next_token_is (lexer, CPP_COMMA))
+		    {
+		      cp_lexer_consume_token (lexer);
+		      continue;
+		    }
+		  else break;
+		}
+	    }
+	  /* Worker num: argument and vector length: arguments.  */
+	  else if (cp_lexer_next_token_is (lexer, CPP_NAME)
+		   && strcmp (id, IDENTIFIER_POINTER (next->u.value)) == 0
+		   && cp_lexer_nth_token_is (lexer, 2, CPP_COLON))
+	    {
+	      cp_lexer_consume_token (lexer);  /* id  */
+	      cp_lexer_consume_token (lexer);  /* ':'  */
+	    }
+
+	  /* Now collect the actual argument.  */
+	  if (ops[idx] != NULL_TREE)
+	    {
+	      cp_parser_error (parser, "unexpected argument");
+	      goto cleanup_error;
+	    }
+
+	  tree expr = cp_parser_assignment_expression (parser, NULL, false,
+						       false);
+	  if (expr == error_mark_node)
+	    goto cleanup_error;
+
+	  mark_exp_read (expr);
+	  ops[idx] = expr;
+
+	  if (kind == OMP_CLAUSE_GANG
+	      && cp_lexer_next_token_is (lexer, CPP_COMMA))
+	    {
+	      cp_lexer_consume_token (lexer);
+	      continue;
+	    }
+	  break;
+	}
+      while (1);
+
+      if (!cp_parser_require (parser, CPP_CLOSE_PAREN, RT_CLOSE_PAREN))
+	goto cleanup_error;
+    }
+
+  check_no_duplicate_clause (list, kind, str, loc);
+
+  c = build_omp_clause (loc, kind);
+
+  if (ops[1])
+    OMP_CLAUSE_OPERAND (c, 1) = ops[1];
+
+  OMP_CLAUSE_OPERAND (c, 0) = ops[0];
+  OMP_CLAUSE_CHAIN (c) = list;
+
+  return c;
+
+ cleanup_error:
+  cp_parser_skip_to_closing_parenthesis (parser, false, false, true);
+  return list;
+}
+
 /* OpenACC:
    vector_length ( expression ) */
 
@@ -31306,6 +31456,11 @@ cp_parser_oacc_all_clauses (cp_parser *p
 	  clauses = cp_parser_oacc_clause_async (parser, clauses);
 	  c_name = "async";
 	  break;
+	case PRAGMA_OACC_CLAUSE_AUTO:
+	  clauses = cp_parser_oacc_simple_clause (parser, OMP_CLAUSE_AUTO,
+						 clauses, here);
+	  c_name = "auto";
+	  break;
 	case PRAGMA_OACC_CLAUSE_COLLAPSE:
 	  clauses = cp_parser_omp_clause_collapse (parser, clauses, here);
 	  c_name = "collapse";
@@ -31338,6 +31493,11 @@ cp_parser_oacc_all_clauses (cp_parser *p
 	  clauses = cp_parser_oacc_data_clause_deviceptr (parser, clauses);
 	  c_name = "deviceptr";
 	  break;
+	case PRAGMA_OACC_CLAUSE_GANG:
+	  c_name = "gang";
+	  clauses = cp_parser_oacc_shape_clause (parser, OMP_CLAUSE_GANG,
+						 c_name, clauses);
+	  break;
 	case PRAGMA_OACC_CLAUSE_HOST:
 	  clauses = cp_parser_oacc_data_clause (parser, c_kind, clauses);
 	  c_name = "host";
@@ -31382,6 +31542,16 @@ cp_parser_oacc_all_clauses (cp_parser *p
 	  clauses = cp_parser_oacc_data_clause (parser, c_kind, clauses);
 	  c_name = "self";
 	  break;
+	case PRAGMA_OACC_CLAUSE_SEQ:
+	  clauses = cp_parser_oacc_simple_clause (parser, OMP_CLAUSE_SEQ,
+						 clauses, here);
+	  c_name = "seq";
+	  break;
+	case PRAGMA_OACC_CLAUSE_VECTOR:
+	  c_name = "vector";
+	  clauses = cp_parser_oacc_shape_clause (parser, OMP_CLAUSE_VECTOR,
+						 c_name, clauses);
+	  break;
 	case PRAGMA_OACC_CLAUSE_VECTOR_LENGTH:
 	  clauses = cp_parser_oacc_clause_vector_length (parser, clauses);
 	  c_name = "vector_length";
@@ -31390,6 +31560,11 @@ cp_parser_oacc_all_clauses (cp_parser *p
 	  clauses = cp_parser_oacc_clause_wait (parser, clauses);
 	  c_name = "wait";
 	  break;
+	case PRAGMA_OACC_CLAUSE_WORKER:
+	  c_name = "worker";
+	  clauses = cp_parser_oacc_shape_clause (parser, OMP_CLAUSE_WORKER,
+						 c_name, clauses);
+	  break;
 	default:
 	  cp_parser_error (parser, "expected %<#pragma acc%> clause");
 	  goto saw_error;
@@ -34303,6 +34478,11 @@ cp_parser_oacc_enter_exit_data (cp_parse
 
 #define OACC_LOOP_CLAUSE_MASK						\
 	( (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_COLLAPSE)		\
+	| (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_GANG)		\
+	| (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_VECTOR)		\
+	| (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_WORKER)		\
+	| (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_AUTO)		\
+	| (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_SEQ)			\
 	| (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_REDUCTION) )
 
 static tree
Index: gcc/testsuite/g++.dg/gomp/pr33372-3.C
===================================================================
--- gcc/testsuite/g++.dg/gomp/pr33372-3.C	(revision 229443)
+++ gcc/testsuite/g++.dg/gomp/pr33372-3.C	(working copy)
@@ -6,7 +6,7 @@ template <typename T>
 void f ()
 {
   T n = 6;
-#pragma omp parallel num_threads(n)	// { dg-error "num_threads expression must be integral" }
+#pragma omp parallel num_threads(n)	// { dg-error "'num_threads' expression must be integral" }
   ;
 #pragma omp parallel for schedule(static, n)
   for (int i = 0; i < 10; i++)		// { dg-error "chunk size expression must be integral" }
Index: gcc/testsuite/g++.dg/gomp/pr33372-1.C
===================================================================
--- gcc/testsuite/g++.dg/gomp/pr33372-1.C	(revision 229443)
+++ gcc/testsuite/g++.dg/gomp/pr33372-1.C	(working copy)
@@ -6,7 +6,7 @@ template <typename T>
 void f ()
 {
   extern T n ();
-#pragma omp parallel num_threads(n)	// { dg-error "num_threads expression must be integral" }
+#pragma omp parallel num_threads(n)	// { dg-error "'num_threads' expression must be integral" }
   ;
 #pragma omp parallel for schedule(static, n)
   for (int i = 0; i < 10; i++)		// { dg-error "chunk size expression must be integral" }

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 4/11] C FE changes
  2015-10-26 22:32                   ` Cesar Philippidis
@ 2015-10-27 20:23                     ` Nathan Sidwell
  0 siblings, 0 replies; 120+ messages in thread
From: Nathan Sidwell @ 2015-10-27 20:23 UTC (permalink / raw)
  To: Cesar Philippidis, Jakub Jelinek
  Cc: GCC Patches, Bernd Schmidt, Jason Merrill, Joseph S. Myers

[-- Attachment #1: Type: text/plain, Size: 300 bytes --]

On 10/26/15 15:27, Cesar Philippidis wrote:
> On 10/26/2015 01:59 AM, Jakub Jelinek wrote:
>
>> Ok for trunk with those changes fixed.
>
> Here's the patch with those changes. Nathan will commit this patch the
> rest of the openacc execution model patches.

This is the patch I've committed.

nathan

[-- Attachment #2: 05-trunk-cpfe-1027.patch --]
[-- Type: text/x-patch, Size: 16472 bytes --]

2015-10-27  Cesar Philippidis  <cesar@codesourcery.com>
	    Thomas Schwinge  <thomas@codesourcery.com>
	    James Norris  <jnorris@codesourcery.com>
	    Joseph Myers  <joseph@codesourcery.com>
	    Julian Brown  <julian@codesourcery.com>
	    Nathan Sidwell <nathan@codesourcery.com>
	    Bernd Schmidt  <bschmidt@redhat.com>

	gcc/cp/
	* parser.c (cp_parser_omp_clause_name): Add auto, gang, seq,
	vector, worker.
	(cp_parser_oacc_simple_clause): New.
	(cp_parser_oacc_shape_clause): New.
	(cp_parser_oacc_all_clauses): Add auto, gang, seq, vector, worker.
	(OACC_LOOP_CLAUSE_MASK): Likewise.
	* semantics.c (finish_omp_clauses): Add auto, gang, seq, vector,
	worker. Unify the handling of teams, tasks and vector_length with
	the other loop shape clauses.

2015-10-27  Nathan Sidwell <nathan@codesourcery.com>
	    Cesar Philippidis  <cesar@codesourcery.com>

	gcc/testsuite/
	* g++.dg/g++.dg/gomp/pr33372-1.C: Adjust diagnostic.
	* gcc/testsuite/g++.dg/gomp/pr33372-3.C: Likewise.
			  
Index: gcc/cp/semantics.c
===================================================================
--- gcc/cp/semantics.c	(revision 229443)
+++ gcc/cp/semantics.c	(working copy)
@@ -5965,14 +5965,76 @@ finish_omp_clauses (tree clauses, bool a
 	  OMP_CLAUSE_FINAL_EXPR (c) = t;
 	  break;
 
+	case OMP_CLAUSE_GANG:
+	  /* Operand 1 is the gang static: argument.  */
+	  t = OMP_CLAUSE_OPERAND (c, 1);
+	  if (t != NULL_TREE)
+	    {
+	      if (t == error_mark_node)
+		remove = true;
+	      else if (!type_dependent_expression_p (t)
+		       && !INTEGRAL_TYPE_P (TREE_TYPE (t)))
+		{
+		  error ("%<gang%> static expression must be integral");
+		  remove = true;
+		}
+	      else
+		{
+		  t = mark_rvalue_use (t);
+		  if (!processing_template_decl)
+		    {
+		      t = maybe_constant_value (t);
+		      if (TREE_CODE (t) == INTEGER_CST
+			  && tree_int_cst_sgn (t) != 1
+			  && t != integer_minus_one_node)
+			{
+			  warning_at (OMP_CLAUSE_LOCATION (c), 0,
+				      "%<gang%> static value must be"
+				      "positive");
+			  t = integer_one_node;
+			}
+		    }
+		  t = fold_build_cleanup_point_expr (TREE_TYPE (t), t);
+		}
+	      OMP_CLAUSE_OPERAND (c, 1) = t;
+	    }
+	  /* Check operand 0, the num argument.  */
+
+	case OMP_CLAUSE_WORKER:
+	case OMP_CLAUSE_VECTOR:
+	  if (OMP_CLAUSE_OPERAND (c, 0) == NULL_TREE)
+	    break;
+
+	case OMP_CLAUSE_NUM_TASKS:
+	case OMP_CLAUSE_NUM_TEAMS:
 	case OMP_CLAUSE_NUM_THREADS:
-	  t = OMP_CLAUSE_NUM_THREADS_EXPR (c);
+	case OMP_CLAUSE_NUM_GANGS:
+	case OMP_CLAUSE_NUM_WORKERS:
+	case OMP_CLAUSE_VECTOR_LENGTH:
+	  t = OMP_CLAUSE_OPERAND (c, 0);
 	  if (t == error_mark_node)
 	    remove = true;
 	  else if (!type_dependent_expression_p (t)
 		   && !INTEGRAL_TYPE_P (TREE_TYPE (t)))
 	    {
-	      error ("num_threads expression must be integral");
+	     switch (OMP_CLAUSE_CODE (c))
+		{
+		case OMP_CLAUSE_GANG:
+		  error_at (OMP_CLAUSE_LOCATION (c),
+			    "%<gang%> num expression must be integral"); break;
+		case OMP_CLAUSE_VECTOR:
+		  error_at (OMP_CLAUSE_LOCATION (c),
+			    "%<vector%> length expression must be integral");
+		  break;
+		case OMP_CLAUSE_WORKER:
+		  error_at (OMP_CLAUSE_LOCATION (c),
+			    "%<worker%> num expression must be integral");
+		  break;
+		default:
+		  error_at (OMP_CLAUSE_LOCATION (c),
+			    "%qs expression must be integral",
+			    omp_clause_code_name[OMP_CLAUSE_CODE (c)]);
+		}
 	      remove = true;
 	    }
 	  else
@@ -5984,13 +6046,33 @@ finish_omp_clauses (tree clauses, bool a
 		  if (TREE_CODE (t) == INTEGER_CST
 		      && tree_int_cst_sgn (t) != 1)
 		    {
-		      warning_at (OMP_CLAUSE_LOCATION (c), 0,
-				  "%<num_threads%> value must be positive");
+		      switch (OMP_CLAUSE_CODE (c))
+			{
+			case OMP_CLAUSE_GANG:
+			  warning_at (OMP_CLAUSE_LOCATION (c), 0,
+				      "%<gang%> num value must be positive");
+			  break;
+			case OMP_CLAUSE_VECTOR:
+			  warning_at (OMP_CLAUSE_LOCATION (c), 0,
+				      "%<vector%> length value must be"
+				      "positive");
+			  break;
+			case OMP_CLAUSE_WORKER:
+			  warning_at (OMP_CLAUSE_LOCATION (c), 0,
+				      "%<worker%> num value must be"
+				      "positive");
+			  break;
+			default:
+			  warning_at (OMP_CLAUSE_LOCATION (c), 0,
+				      "%qs value must be positive",
+				      omp_clause_code_name
+				      [OMP_CLAUSE_CODE (c)]);
+			}
 		      t = integer_one_node;
 		    }
 		  t = fold_build_cleanup_point_expr (TREE_TYPE (t), t);
 		}
-	      OMP_CLAUSE_NUM_THREADS_EXPR (c) = t;
+	      OMP_CLAUSE_OPERAND (c, 0) = t;
 	    }
 	  break;
 
@@ -6062,35 +6144,6 @@ finish_omp_clauses (tree clauses, bool a
 	    }
 	  break;
 
-	case OMP_CLAUSE_NUM_TEAMS:
-	  t = OMP_CLAUSE_NUM_TEAMS_EXPR (c);
-	  if (t == error_mark_node)
-	    remove = true;
-	  else if (!type_dependent_expression_p (t)
-		   && !INTEGRAL_TYPE_P (TREE_TYPE (t)))
-	    {
-	      error ("%<num_teams%> expression must be integral");
-	      remove = true;
-	    }
-	  else
-	    {
-	      t = mark_rvalue_use (t);
-	      if (!processing_template_decl)
-		{
-		  t = maybe_constant_value (t);
-		  if (TREE_CODE (t) == INTEGER_CST
-		      && tree_int_cst_sgn (t) != 1)
-		    {
-		      warning_at (OMP_CLAUSE_LOCATION (c), 0,
-				  "%<num_teams%> value must be positive");
-		      t = integer_one_node;
-		    }
-		  t = fold_build_cleanup_point_expr (TREE_TYPE (t), t);
-		}
-	      OMP_CLAUSE_NUM_TEAMS_EXPR (c) = t;
-	    }
-	  break;
-
 	case OMP_CLAUSE_ASYNC:
 	  t = OMP_CLAUSE_ASYNC_EXPR (c);
 	  if (t == error_mark_node)
@@ -6110,16 +6163,6 @@ finish_omp_clauses (tree clauses, bool a
 	    }
 	  break;
 
-	case OMP_CLAUSE_VECTOR_LENGTH:
-	  t = OMP_CLAUSE_VECTOR_LENGTH_EXPR (c);
-	  t = maybe_convert_cond (t);
-	  if (t == error_mark_node)
-	    remove = true;
-	  else if (!processing_template_decl)
-	    t = fold_build_cleanup_point_expr (TREE_TYPE (t), t);
-	  OMP_CLAUSE_VECTOR_LENGTH_EXPR (c) = t;
-	  break;
-
 	case OMP_CLAUSE_WAIT:
 	  t = OMP_CLAUSE_WAIT_EXPR (c);
 	  if (t == error_mark_node)
@@ -6547,35 +6590,6 @@ finish_omp_clauses (tree clauses, bool a
 	    }
 	  goto check_dup_generic;
 
-	case OMP_CLAUSE_NUM_TASKS:
-	  t = OMP_CLAUSE_NUM_TASKS_EXPR (c);
-	  if (t == error_mark_node)
-	    remove = true;
-	  else if (!type_dependent_expression_p (t)
-		   && !INTEGRAL_TYPE_P (TREE_TYPE (t)))
-	    {
-	      error ("%<num_tasks%> expression must be integral");
-	      remove = true;
-	    }
-	  else
-	    {
-	      t = mark_rvalue_use (t);
-	      if (!processing_template_decl)
-		{
-		  t = maybe_constant_value (t);
-		  if (TREE_CODE (t) == INTEGER_CST
-		      && tree_int_cst_sgn (t) != 1)
-		    {
-		      warning_at (OMP_CLAUSE_LOCATION (c), 0,
-				  "%<num_tasks%> value must be positive");
-		      t = integer_one_node;
-		    }
-		  t = fold_build_cleanup_point_expr (TREE_TYPE (t), t);
-		}
-	      OMP_CLAUSE_NUM_TASKS_EXPR (c) = t;
-	    }
-	  break;
-
 	case OMP_CLAUSE_GRAINSIZE:
 	  t = OMP_CLAUSE_GRAINSIZE_EXPR (c);
 	  if (t == error_mark_node)
@@ -6694,6 +6708,8 @@ finish_omp_clauses (tree clauses, bool a
 	case OMP_CLAUSE_SIMD:
 	case OMP_CLAUSE_DEFAULTMAP:
 	case OMP_CLAUSE__CILK_FOR_COUNT_:
+	case OMP_CLAUSE_AUTO:
+	case OMP_CLAUSE_SEQ:
 	  break;
 
 	case OMP_CLAUSE_INBRANCH:
Index: gcc/cp/parser.c
===================================================================
--- gcc/cp/parser.c	(revision 229443)
+++ gcc/cp/parser.c	(working copy)
@@ -29064,7 +29064,9 @@ cp_parser_omp_clause_name (cp_parser *pa
 {
   pragma_omp_clause result = PRAGMA_OMP_CLAUSE_NONE;
 
-  if (cp_lexer_next_token_is_keyword (parser->lexer, RID_IF))
+  if (cp_lexer_next_token_is_keyword (parser->lexer, RID_AUTO))
+    result = PRAGMA_OACC_CLAUSE_AUTO;
+  else if (cp_lexer_next_token_is_keyword (parser->lexer, RID_IF))
     result = PRAGMA_OMP_CLAUSE_IF;
   else if (cp_lexer_next_token_is_keyword (parser->lexer, RID_DEFAULT))
     result = PRAGMA_OMP_CLAUSE_DEFAULT;
@@ -29122,7 +29124,9 @@ cp_parser_omp_clause_name (cp_parser *pa
 	    result = PRAGMA_OMP_CLAUSE_FROM;
 	  break;
 	case 'g':
-	  if (!strcmp ("grainsize", p))
+	  if (!strcmp ("gang", p))
+	    result = PRAGMA_OACC_CLAUSE_GANG;
+	  else if (!strcmp ("grainsize", p))
 	    result = PRAGMA_OMP_CLAUSE_GRAINSIZE;
 	  break;
 	case 'h':
@@ -29212,6 +29216,8 @@ cp_parser_omp_clause_name (cp_parser *pa
 	    result = PRAGMA_OMP_CLAUSE_SECTIONS;
 	  else if (!strcmp ("self", p))
 	    result = PRAGMA_OACC_CLAUSE_SELF;
+	  else if (!strcmp ("seq", p))
+	    result = PRAGMA_OACC_CLAUSE_SEQ;
 	  else if (!strcmp ("shared", p))
 	    result = PRAGMA_OMP_CLAUSE_SHARED;
 	  else if (!strcmp ("simd", p))
@@ -29238,7 +29244,9 @@ cp_parser_omp_clause_name (cp_parser *pa
 	    result = PRAGMA_OMP_CLAUSE_USE_DEVICE_PTR;
 	  break;
 	case 'v':
-	  if (!strcmp ("vector_length", p))
+	  if (!strcmp ("vector", p))
+	    result = PRAGMA_OACC_CLAUSE_VECTOR;
+	  else if (!strcmp ("vector_length", p))
 	    result = PRAGMA_OACC_CLAUSE_VECTOR_LENGTH;
 	  else if (flag_cilkplus && !strcmp ("vectorlength", p))
 	    result = PRAGMA_CILK_CLAUSE_VECTORLENGTH;
@@ -29246,6 +29254,8 @@ cp_parser_omp_clause_name (cp_parser *pa
 	case 'w':
 	  if (!strcmp ("wait", p))
 	    result = PRAGMA_OACC_CLAUSE_WAIT;
+	  else if (!strcmp ("worker", p))
+	    result = PRAGMA_OACC_CLAUSE_WORKER;
 	  break;
 	}
     }
@@ -29582,6 +29592,146 @@ cp_parser_oacc_data_clause_deviceptr (cp
   return list;
 }
 
+/* OpenACC 2.0:
+   auto
+   independent
+   nohost
+   seq */
+
+static tree
+cp_parser_oacc_simple_clause (cp_parser * /* parser  */,
+			      enum omp_clause_code code,
+			      tree list, location_t location)
+{
+  check_no_duplicate_clause (list, code, omp_clause_code_name[code], location);
+  tree c = build_omp_clause (location, code);
+  OMP_CLAUSE_CHAIN (c) = list;
+  return c;
+}
+
+/* OpenACC:
+
+    gang [( gang-arg-list )]
+    worker [( [num:] int-expr )]
+    vector [( [length:] int-expr )]
+
+  where gang-arg is one of:
+
+    [num:] int-expr
+    static: size-expr
+
+  and size-expr may be:
+
+    *
+    int-expr
+*/
+
+static tree
+cp_parser_oacc_shape_clause (cp_parser *parser, omp_clause_code kind,
+			     const char *str, tree list)
+{
+  const char *id = "num";
+  cp_lexer *lexer = parser->lexer;
+  tree ops[2] = { NULL_TREE, NULL_TREE }, c;
+  location_t loc = cp_lexer_peek_token (lexer)->location;
+
+  if (kind == OMP_CLAUSE_VECTOR)
+    id = "length";
+
+  if (cp_lexer_next_token_is (lexer, CPP_OPEN_PAREN))
+    {
+      cp_lexer_consume_token (lexer);
+
+      do
+	{
+	  cp_token *next = cp_lexer_peek_token (lexer);
+	  int idx = 0;
+
+	  /* Gang static argument.  */
+	  if (kind == OMP_CLAUSE_GANG
+	      && cp_lexer_next_token_is_keyword (lexer, RID_STATIC))
+	    {
+	      cp_lexer_consume_token (lexer);
+
+	      if (!cp_parser_require (parser, CPP_COLON, RT_COLON))
+		goto cleanup_error;
+
+	      idx = 1;
+	      if (ops[idx] != NULL)
+		{
+		  cp_parser_error (parser, "too many %<static%> arguments");
+		  goto cleanup_error;
+		}
+
+	      /* Check for the '*' argument.  */
+	      if (cp_lexer_next_token_is (lexer, CPP_MULT))
+		{
+		  cp_lexer_consume_token (lexer);
+		  ops[idx] = integer_minus_one_node;
+
+		  if (cp_lexer_next_token_is (lexer, CPP_COMMA))
+		    {
+		      cp_lexer_consume_token (lexer);
+		      continue;
+		    }
+		  else break;
+		}
+	    }
+	  /* Worker num: argument and vector length: arguments.  */
+	  else if (cp_lexer_next_token_is (lexer, CPP_NAME)
+		   && strcmp (id, IDENTIFIER_POINTER (next->u.value)) == 0
+		   && cp_lexer_nth_token_is (lexer, 2, CPP_COLON))
+	    {
+	      cp_lexer_consume_token (lexer);  /* id  */
+	      cp_lexer_consume_token (lexer);  /* ':'  */
+	    }
+
+	  /* Now collect the actual argument.  */
+	  if (ops[idx] != NULL_TREE)
+	    {
+	      cp_parser_error (parser, "unexpected argument");
+	      goto cleanup_error;
+	    }
+
+	  tree expr = cp_parser_assignment_expression (parser, NULL, false,
+						       false);
+	  if (expr == error_mark_node)
+	    goto cleanup_error;
+
+	  mark_exp_read (expr);
+	  ops[idx] = expr;
+
+	  if (kind == OMP_CLAUSE_GANG
+	      && cp_lexer_next_token_is (lexer, CPP_COMMA))
+	    {
+	      cp_lexer_consume_token (lexer);
+	      continue;
+	    }
+	  break;
+	}
+      while (1);
+
+      if (!cp_parser_require (parser, CPP_CLOSE_PAREN, RT_CLOSE_PAREN))
+	goto cleanup_error;
+    }
+
+  check_no_duplicate_clause (list, kind, str, loc);
+
+  c = build_omp_clause (loc, kind);
+
+  if (ops[1])
+    OMP_CLAUSE_OPERAND (c, 1) = ops[1];
+
+  OMP_CLAUSE_OPERAND (c, 0) = ops[0];
+  OMP_CLAUSE_CHAIN (c) = list;
+
+  return c;
+
+ cleanup_error:
+  cp_parser_skip_to_closing_parenthesis (parser, false, false, true);
+  return list;
+}
+
 /* OpenACC:
    vector_length ( expression ) */
 
@@ -31306,6 +31456,11 @@ cp_parser_oacc_all_clauses (cp_parser *p
 	  clauses = cp_parser_oacc_clause_async (parser, clauses);
 	  c_name = "async";
 	  break;
+	case PRAGMA_OACC_CLAUSE_AUTO:
+	  clauses = cp_parser_oacc_simple_clause (parser, OMP_CLAUSE_AUTO,
+						 clauses, here);
+	  c_name = "auto";
+	  break;
 	case PRAGMA_OACC_CLAUSE_COLLAPSE:
 	  clauses = cp_parser_omp_clause_collapse (parser, clauses, here);
 	  c_name = "collapse";
@@ -31338,6 +31493,11 @@ cp_parser_oacc_all_clauses (cp_parser *p
 	  clauses = cp_parser_oacc_data_clause_deviceptr (parser, clauses);
 	  c_name = "deviceptr";
 	  break;
+	case PRAGMA_OACC_CLAUSE_GANG:
+	  c_name = "gang";
+	  clauses = cp_parser_oacc_shape_clause (parser, OMP_CLAUSE_GANG,
+						 c_name, clauses);
+	  break;
 	case PRAGMA_OACC_CLAUSE_HOST:
 	  clauses = cp_parser_oacc_data_clause (parser, c_kind, clauses);
 	  c_name = "host";
@@ -31382,6 +31542,16 @@ cp_parser_oacc_all_clauses (cp_parser *p
 	  clauses = cp_parser_oacc_data_clause (parser, c_kind, clauses);
 	  c_name = "self";
 	  break;
+	case PRAGMA_OACC_CLAUSE_SEQ:
+	  clauses = cp_parser_oacc_simple_clause (parser, OMP_CLAUSE_SEQ,
+						 clauses, here);
+	  c_name = "seq";
+	  break;
+	case PRAGMA_OACC_CLAUSE_VECTOR:
+	  c_name = "vector";
+	  clauses = cp_parser_oacc_shape_clause (parser, OMP_CLAUSE_VECTOR,
+						 c_name, clauses);
+	  break;
 	case PRAGMA_OACC_CLAUSE_VECTOR_LENGTH:
 	  clauses = cp_parser_oacc_clause_vector_length (parser, clauses);
 	  c_name = "vector_length";
@@ -31390,6 +31560,11 @@ cp_parser_oacc_all_clauses (cp_parser *p
 	  clauses = cp_parser_oacc_clause_wait (parser, clauses);
 	  c_name = "wait";
 	  break;
+	case PRAGMA_OACC_CLAUSE_WORKER:
+	  c_name = "worker";
+	  clauses = cp_parser_oacc_shape_clause (parser, OMP_CLAUSE_WORKER,
+						 c_name, clauses);
+	  break;
 	default:
 	  cp_parser_error (parser, "expected %<#pragma acc%> clause");
 	  goto saw_error;
@@ -34303,6 +34478,11 @@ cp_parser_oacc_enter_exit_data (cp_parse
 
 #define OACC_LOOP_CLAUSE_MASK						\
 	( (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_COLLAPSE)		\
+	| (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_GANG)		\
+	| (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_VECTOR)		\
+	| (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_WORKER)		\
+	| (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_AUTO)		\
+	| (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_SEQ)			\
 	| (OMP_CLAUSE_MASK_1 << PRAGMA_OACC_CLAUSE_REDUCTION) )
 
 static tree
Index: gcc/testsuite/g++.dg/gomp/pr33372-3.C
===================================================================
--- gcc/testsuite/g++.dg/gomp/pr33372-3.C	(revision 229443)
+++ gcc/testsuite/g++.dg/gomp/pr33372-3.C	(working copy)
@@ -6,7 +6,7 @@ template <typename T>
 void f ()
 {
   T n = 6;
-#pragma omp parallel num_threads(n)	// { dg-error "num_threads expression must be integral" }
+#pragma omp parallel num_threads(n)	// { dg-error "'num_threads' expression must be integral" }
   ;
 #pragma omp parallel for schedule(static, n)
   for (int i = 0; i < 10; i++)		// { dg-error "chunk size expression must be integral" }
Index: gcc/testsuite/g++.dg/gomp/pr33372-1.C
===================================================================
--- gcc/testsuite/g++.dg/gomp/pr33372-1.C	(revision 229443)
+++ gcc/testsuite/g++.dg/gomp/pr33372-1.C	(working copy)
@@ -6,7 +6,7 @@ template <typename T>
 void f ()
 {
   extern T n ();
-#pragma omp parallel num_threads(n)	// { dg-error "num_threads expression must be integral" }
+#pragma omp parallel num_threads(n)	// { dg-error "'num_threads' expression must be integral" }
   ;
 #pragma omp parallel for schedule(static, n)
   for (int i = 0; i < 10; i++)		// { dg-error "chunk size expression must be integral" }

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 9/11] oacc_device_lower pass gate
  2015-10-22  9:33   ` Jakub Jelinek
@ 2015-10-27 20:31     ` Nathan Sidwell
  0 siblings, 0 replies; 120+ messages in thread
From: Nathan Sidwell @ 2015-10-27 20:31 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: GCC Patches, Bernd Schmidt, Jason Merrill, Joseph S. Myers

On 10/22/15 02:32, Jakub Jelinek wrote:
> On Wed, Oct 21, 2015 at 03:50:31PM -0400, Nathan Sidwell wrote:
>>
>> This patch is obvious, but included for completeness. We always want to run
>> the device lowering pass (when openacc is enabled), in order to delete the
>> marker and loop functions that should never be seen after this point.
>>
>> nathan
>
>> 2015-10-20  Nathan Sidwell  <nathan@codesourcery.com>
>>
>> 	* omp-low.c (pass_oacc_device_lower::execute): Ignore errors.
>
> Ok.

The patch is committed.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 3/11] new target hook
  2015-10-22  8:23   ` Jakub Jelinek
  2015-10-22 13:17     ` Nathan Sidwell
@ 2015-10-27 22:15     ` Nathan Sidwell
  1 sibling, 0 replies; 120+ messages in thread
From: Nathan Sidwell @ 2015-10-27 22:15 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: GCC Patches, Bernd Schmidt, Jason Merrill, Joseph S. Myers

[-- Attachment #1: Type: text/plain, Size: 718 bytes --]

On 10/22/15 01:15, Jakub Jelinek wrote:
> On Wed, Oct 21, 2015 at 03:13:26PM -0400, Nathan Sidwell wrote:
>> 2015-10-20  Nathan Sidwell  <nathan@codesourcery.com>
>>
>> 	* target.def (fork_join): New GOACC hook.
>> 	* targhooks.h (default_goacc_fork_join): Declare.
>> 	* omp-low.c (default_goacc_forkjoin): New.
>> 	* doc/tm.texi.in (TARGET_GOACC_FORK_JOIN): Add.
>> 	* doc/tm.texi: Regenerate.
>> 	* config/nvptx/nvptx.c (nvptx_xform_fork_join): New.
>> 	(TARGET_GOACC_FOR_JOIN): Override.

this is  what I've committed.  Other than nits, the changes are
1) use targetm.have_... rather than #ifdef HAVE_...

2) don't include the nvptx hunk.  I'll apply that with the  nvptx bits to turn 
all this stuff on.

nathan


[-- Attachment #2: 03-trunk-hook-1027.patch --]
[-- Type: text/x-patch, Size: 5027 bytes --]

2015-10-27  Nathan Sidwell  <nathan@codesourcery.com>

	* target-insns.def (oacc_fork, oacc_join): Define.
	* target.def (goacc.validate_dims): Adjust doc to avoid warning.
	(goacc.fork_join): New GOACC hook.
	* targhooks.h (default_goacc_fork_join): Declare.
	* omp-low.c (default_goacc_forkjoin): New.
	* doc/tm.texi.in (TARGET_GOACC_FORK_JOIN): Add.
	* doc/tm.texi: Regenerate.

Index: gcc/doc/tm.texi
===================================================================
--- gcc/doc/tm.texi	(revision 229447)
+++ gcc/doc/tm.texi	(working copy)
@@ -5754,7 +5754,7 @@ usable.  In that case, the smaller the n
 to use it.
 @end deftypefn
 
-@deftypefn {Target Hook} bool TARGET_GOACC_VALIDATE_DIMS (tree @var{decl}, int @var{dims[]}, int @var{fn_level})
+@deftypefn {Target Hook} bool TARGET_GOACC_VALIDATE_DIMS (tree @var{decl}, int *@var{dims}, int @var{fn_level})
 This hook should check the launch dimensions provided for an OpenACC
 compute region, or routine.  Defaulted values are represented as -1
 and non-constant values as 0. The @var{fn_level} is negative for the
@@ -5766,6 +5766,14 @@ true, if changes have been made.  You mu
 provide dimensions larger than 1.
 @end deftypefn
 
+@deftypefn {Target Hook} bool TARGET_GOACC_FORK_JOIN (gcall *@var{call}, const int *@var{dims}, bool @var{is_fork})
+This hook should convert IFN_GOACC_FORK and IFN_GOACC_JOIN function
+calls to target-specific gimple.  It is executed during the
+oacc_device_lower pass.  It should return true, if the functions
+should be deleted.  The default hook returns true, if there are no
+RTL expanders for them.
+@end deftypefn
+
 @node Anchored Addresses
 @section Anchored Addresses
 @cindex anchored addresses
Index: gcc/doc/tm.texi.in
===================================================================
--- gcc/doc/tm.texi.in	(revision 229447)
+++ gcc/doc/tm.texi.in	(working copy)
@@ -4251,6 +4251,8 @@ address;  but often a machine-dependent
 
 @hook TARGET_GOACC_VALIDATE_DIMS
 
+@hook TARGET_GOACC_FORK_JOIN
+
 @node Anchored Addresses
 @section Anchored Addresses
 @cindex anchored addresses
Index: gcc/omp-low.c
===================================================================
--- gcc/omp-low.c	(revision 229462)
+++ gcc/omp-low.c	(working copy)
@@ -17571,6 +17642,19 @@ oacc_validate_dims (tree fn, tree attrs,
   return fn_level;
 }
 
+/* Default fork/join early expander.  Delete the function calls if
+   there is no RTL expander.  */
+
+bool
+default_goacc_fork_join (gcall *ARG_UNUSED (call),
+			 const int *ARG_UNUSED (dims), bool is_fork)
+{
+  if (is_fork)
+    return targetm.have_oacc_fork ();
+  else
+    return targetm.have_oacc_join ();
+}
+
 /* Main entry point for oacc transformations which run on the device
    compiler after LTO, so we know what the target device is at this
    point (including the host fallback).  */
Index: gcc/target-insns.def
===================================================================
--- gcc/target-insns.def	(revision 229459)
+++ gcc/target-insns.def	(working copy)
@@ -64,6 +64,8 @@ DEF_TARGET_INSN (memory_barrier, (void))
 DEF_TARGET_INSN (movstr, (rtx x0, rtx x1, rtx x2))
 DEF_TARGET_INSN (nonlocal_goto, (rtx x0, rtx x1, rtx x2, rtx x3))
 DEF_TARGET_INSN (nonlocal_goto_receiver, (void))
+DEF_TARGET_INSN (oacc_fork, (rtx x0, rtx x1, rtx x2))
+DEF_TARGET_INSN (oacc_join, (rtx x0, rtx x1, rtx x2))
 DEF_TARGET_INSN (prefetch, (rtx x0, rtx x1, rtx x2))
 DEF_TARGET_INSN (probe_stack, (rtx x0))
 DEF_TARGET_INSN (probe_stack_address, (rtx x0))
Index: gcc/target.def
===================================================================
--- gcc/target.def	(revision 229447)
+++ gcc/target.def	(working copy)
@@ -1655,9 +1655,19 @@ should fill in anything that needs to de
 non-defaults.  Diagnostics should be issued as appropriate.  Return\n\
 true, if changes have been made.  You must override this hook to\n\
 provide dimensions larger than 1.",
-bool, (tree decl, int dims[], int fn_level),
+bool, (tree decl, int *dims, int fn_level),
 default_goacc_validate_dims)
 
+DEFHOOK
+(fork_join,
+"This hook should convert IFN_GOACC_FORK and IFN_GOACC_JOIN function\n\
+calls to target-specific gimple.  It is executed during the\n\
+oacc_device_lower pass.  It should return true, if the functions\n\
+should be deleted.  The default hook returns true, if there are no\n\
+RTL expanders for them.",
+bool, (gcall *call, const int *dims, bool is_fork),
+default_goacc_fork_join)
+
 HOOK_VECTOR_END (goacc)
 
 /* Functions relating to vectorization.  */
Index: gcc/targhooks.h
===================================================================
--- gcc/targhooks.h	(revision 229447)
+++ gcc/targhooks.h	(working copy)
@@ -110,6 +110,7 @@ extern void default_destroy_cost_data (v
 
 /* OpenACC hooks.  */
 extern bool default_goacc_validate_dims (tree, int [], int);
+extern bool default_goacc_fork_join (gcall *, const int [], bool);
 
 /* These are here, and not in hooks.[ch], because not all users of
    hooks.h include tm.h, and thus we don't have CUMULATIVE_ARGS.  */

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 6/11] Reduction initialization
  2015-10-22  9:11   ` Jakub Jelinek
@ 2015-10-27 22:27     ` Nathan Sidwell
  0 siblings, 0 replies; 120+ messages in thread
From: Nathan Sidwell @ 2015-10-27 22:27 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: GCC Patches, Bernd Schmidt, Jason Merrill, Joseph S. Myers

On 10/22/15 01:58, Jakub Jelinek wrote:
> On Wed, Oct 21, 2015 at 03:24:13PM -0400, Nathan Sidwell wrote:
>> 2015-10-20  Nathan Sidwell  <nathan@codesourcery.com>
>>
>> 	* omp-low.c (oacc_init_rediction_array): New.
>> 	(oacc_initialize_reduction_data): Initialize array.
>
> Ok.

Committed.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 8/11] device-specific lowering
  2015-10-26 15:21   ` Jakub Jelinek
  2015-10-26 16:23     ` Nathan Sidwell
@ 2015-10-28  1:06     ` Nathan Sidwell
  1 sibling, 0 replies; 120+ messages in thread
From: Nathan Sidwell @ 2015-10-28  1:06 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: GCC Patches, Bernd Schmidt, Jason Merrill, Joseph S. Myers

[-- Attachment #1: Type: text/plain, Size: 704 bytes --]

On 10/26/15 08:13, Jakub Jelinek wrote:
> On Wed, Oct 21, 2015 at 03:49:08PM -0400, Nathan Sidwell wrote:
>> This patch is the device-specific half of the previous patch.  It processes
>> the partition head & tail markers and loop abstraction functions inserted
>> during omp lowering.


This is the  patch I've committed.  Because I committed before patch 7 (to avoid 
breaking the build), I've included the internal-fn changes from patch 7.
I also noticed that in converting default_goacc_fork_join to use targetm, I'd 
inverted the sense of the return value.  That suggested to me I'd got the 
original sense wrong, so I've updated the target hook documentation to reflect 
the new reality.

nathan



[-- Attachment #2: 08-trunk-dev-lower-1027.patch --]
[-- Type: text/x-patch, Size: 36663 bytes --]

2015-10-27  Nathan Sidwell  <nathan@codesourcery.com>

	* internal-fn.def (IFN_GOACC_DIM_SIZE, IFN_GOACC_DIM_POS,
	IFN_GOACC_LOOP): New.
	* internal-fn.h (enum ifn_unique_kind): Add IFN_UNIQUE_OACC_FORK,
	IFN_UNIQUE_OACC_JOIN, IFN_UNIQUE_OACC_HEAD_MARK,
	IFN_UNIQUE_OACC_TAIL_MARK.
	(enum ifn_goacc_loop_kind): New.
	* internal-fn.c (expand_UNIQUE): Add IFN_UNIQUE_OACC_FORK,
	IFN_UNIQUE_OACC_JOIN cases.
	(expand_GOACC_DIM_SIZE, expand_GOACC_DIM_POS): New.
	(expand_GOACC_LOOP): New.
	* target-insns.def (oacc_dim_pos, oacc_dim_size): New.
	* omp-low.c: Include gimple-pretty-print.h.
	(struct oacc_loop): New.
	(enum oacc_loop_flags): New.
	(oacc_thread_numbers): New.
	(oacc_xform_loop): New.
	(new_oacc_loop_raw, new_oacc_loop_outer, new_oacc_loop,
	new_oacc_loop_routine, finish_oacc_loop, free_oacc_loop): New,
	(dump_oacc_loop_part, dump_oacc_loop, debug_oacc_loop): New,
	(oacc_loop_discover_walk, oacc_loop_sibling_nrevers,
	oacc_loop_discovery): New.
	(oacc_loop_xform_head_tail, oacc_loop_xform_loop,
	oacc_loop_process): New.
	(oacc_loop_fixed_partitions, oacc_loop_partition): New.
	(execute_oacc_device_lower): Discover & process loops.  Process
	internal fns.
	* target.def (goacc.fork_join): Change sense of hook, clarify
	documentation.
	* doc/tm.texi: Regenerated.

Index: gcc/doc/tm.texi
===================================================================
--- gcc/doc/tm.texi	(revision 229465)
+++ gcc/doc/tm.texi	(working copy)
@@ -5778,11 +5778,13 @@ provide dimensions larger than 1.
 @end deftypefn
 
 @deftypefn {Target Hook} bool TARGET_GOACC_FORK_JOIN (gcall *@var{call}, const int *@var{dims}, bool @var{is_fork})
-This hook should convert IFN_GOACC_FORK and IFN_GOACC_JOIN function
-calls to target-specific gimple.  It is executed during the
-oacc_device_lower pass.  It should return true, if the functions
-should be deleted.  The default hook returns true, if there are no
-RTL expanders for them.
+This hook can be used to convert IFN_GOACC_FORK and IFN_GOACC_JOIN
+function calls to target-specific gimple, or indicate whether they
+should be retained.  It is executed during the oacc_device_lower pass.
+It should return true, if the call should be retained.  It should
+return false, if it is to be deleted (either because target-specific
+gimple has been inserted before it, or there is no need for it).
+The default hook returns false, if there are no RTL expanders for them.
 @end deftypefn
 
 @node Anchored Addresses
Index: gcc/internal-fn.c
===================================================================
--- gcc/internal-fn.c	(revision 229465)
+++ gcc/internal-fn.c	(working copy)
@@ -1976,12 +1976,84 @@ expand_UNIQUE (gcall *stmt)
       if (targetm.have_unique ())
 	pattern = targetm.gen_unique ();
       break;
+
+    case IFN_UNIQUE_OACC_FORK:
+    case IFN_UNIQUE_OACC_JOIN:
+      if (targetm.have_oacc_fork () && targetm.have_oacc_join ())
+	{
+	  tree lhs = gimple_call_lhs (stmt);
+	  rtx target = const0_rtx;
+
+	  if (lhs)
+	    target = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE);
+
+	  rtx data_dep = expand_normal (gimple_call_arg (stmt, 1));
+	  rtx axis = expand_normal (gimple_call_arg (stmt, 2));
+
+	  if (kind == IFN_UNIQUE_OACC_FORK)
+	    pattern = targetm.gen_oacc_fork (target, data_dep, axis);
+	  else
+	    pattern = targetm.gen_oacc_join (target, data_dep, axis);
+	}
+      else
+	gcc_unreachable ();
+      break;
     }
 
   if (pattern)
     emit_insn (pattern);
 }
 
+/* The size of an OpenACC compute dimension.  */
+
+static void
+expand_GOACC_DIM_SIZE (gcall *stmt)
+{
+  tree lhs = gimple_call_lhs (stmt);
+
+  if (!lhs)
+    return;
+
+  rtx target = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE);
+  if (targetm.have_oacc_dim_size ())
+    {
+      rtx dim = expand_expr (gimple_call_arg (stmt, 0), NULL_RTX,
+			     VOIDmode, EXPAND_NORMAL);
+      emit_insn (targetm.gen_oacc_dim_size (target, dim));
+    }
+  else
+    emit_move_insn (target, GEN_INT (1));
+}
+
+/* The position of an OpenACC execution engine along one compute axis.  */
+
+static void
+expand_GOACC_DIM_POS (gcall *stmt)
+{
+  tree lhs = gimple_call_lhs (stmt);
+
+  if (!lhs)
+    return;
+
+  rtx target = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE);
+  if (targetm.have_oacc_dim_pos ())
+    {
+      rtx dim = expand_expr (gimple_call_arg (stmt, 0), NULL_RTX,
+			     VOIDmode, EXPAND_NORMAL);
+      emit_insn (targetm.gen_oacc_dim_pos (target, dim));
+    }
+  else
+    emit_move_insn (target, const0_rtx);
+}
+
+/* This is expanded by oacc_device_lower pass.  */
+
+static void
+expand_GOACC_LOOP (gcall *stmt ATTRIBUTE_UNUSED)
+{
+  gcc_unreachable ();
+}
+
 /* Routines to expand each internal function, indexed by function number.
    Each routine has the prototype:
 
Index: gcc/internal-fn.def
===================================================================
--- gcc/internal-fn.def	(revision 229465)
+++ gcc/internal-fn.def	(working copy)
@@ -72,3 +72,14 @@ DEF_INTERNAL_FN (VA_ARG, ECF_NOTHROW | E
    between uses.  See internal-fn.h for usage.  */
 DEF_INTERNAL_FN (UNIQUE, ECF_NOTHROW, NULL)
 
+/* DIM_SIZE and DIM_POS return the size of a particular compute
+   dimension and the executing thread's position within that
+   dimension.  DIM_POS is pure (and not const) so that it isn't
+   thought to clobber memory and can be gcse'd within a single
+   parallel region, but not across FORK/JOIN boundaries.  They take a
+   single INTEGER_CST argument.  */
+DEF_INTERNAL_FN (GOACC_DIM_SIZE, ECF_CONST | ECF_NOTHROW | ECF_LEAF, ".")
+DEF_INTERNAL_FN (GOACC_DIM_POS, ECF_PURE | ECF_NOTHROW | ECF_LEAF, ".")
+
+/* OpenACC looping abstraction.  See internal-fn.h for usage.  */
+DEF_INTERNAL_FN (GOACC_LOOP, ECF_PURE | ECF_NOTHROW, NULL)
Index: gcc/internal-fn.h
===================================================================
--- gcc/internal-fn.h	(revision 229465)
+++ gcc/internal-fn.h	(working copy)
@@ -22,7 +22,48 @@ along with GCC; see the file COPYING3.
 
 /* INTEGER_CST values for IFN_UNIQUE function arg-0.  */
 enum ifn_unique_kind {
-  IFN_UNIQUE_UNSPEC   /* Undifferentiated UNIQUE.  */
+  IFN_UNIQUE_UNSPEC,  /* Undifferentiated UNIQUE.  */
+
+  /* FORK and JOIN mark the points at which OpenACC partitioned
+     execution is entered or exited.
+     return: data dependency value
+     arg-1: data dependency var
+     arg-2: INTEGER_CST argument, indicating the axis.  */
+  IFN_UNIQUE_OACC_FORK,
+  IFN_UNIQUE_OACC_JOIN,
+
+  /* HEAD_MARK and TAIL_MARK are used to demark the sequence entering
+     or leaving partitioned execution.
+     return: data dependency value
+     arg-1: data dependency var
+     arg-2: INTEGER_CST argument, remaining markers in this sequence
+     arg-3...: varargs on primary header  */
+  IFN_UNIQUE_OACC_HEAD_MARK,
+  IFN_UNIQUE_OACC_TAIL_MARK
+};
+
+/* INTEGER_CST values for IFN_GOACC_LOOP arg-0.  Allows the precise
+   stepping of the compute geometry over the loop iterations to be
+   deferred until it is known which compiler is generating the code.
+   The action is encoded in a constant first argument.
+
+     CHUNK_MAX = LOOP (CODE_CHUNKS, DIR, RANGE, STEP, CHUNK_SIZE, MASK)
+     STEP = LOOP (CODE_STEP, DIR, RANGE, STEP, CHUNK_SIZE, MASK)
+     OFFSET = LOOP (CODE_OFFSET, DIR, RANGE, STEP, CHUNK_SIZE, MASK, CHUNK_NO)
+     BOUND = LOOP (CODE_BOUND, DIR, RANGE, STEP, CHUNK_SIZE, MASK, OFFSET)
+
+     DIR - +1 for up loop, -1 for down loop
+     RANGE - Range of loop (END - BASE)
+     STEP - iteration step size
+     CHUNKING - size of chunking, (constant zero for no chunking)
+     CHUNK_NO - chunk number
+     MASK - partitioning mask.  */
+
+enum ifn_goacc_loop_kind {
+  IFN_GOACC_LOOP_CHUNKS,  /* Number of chunks.  */
+  IFN_GOACC_LOOP_STEP,    /* Size of each thread's step.  */
+  IFN_GOACC_LOOP_OFFSET,  /* Initial iteration value.  */
+  IFN_GOACC_LOOP_BOUND    /* Limit of iteration value.  */
 };
 
 /* Initialize internal function tables.  */
Index: gcc/omp-low.c
===================================================================
--- gcc/omp-low.c	(revision 229465)
+++ gcc/omp-low.c	(working copy)
@@ -81,6 +81,7 @@ along with GCC; see the file COPYING3.
 #include "context.h"
 #include "lto-section-names.h"
 #include "gomp-constants.h"
+#include "gimple-pretty-print.h"
 
 /* Lowering of OMP parallel and workshare constructs proceeds in two
    phases.  The first phase scans the function looking for OMP statements
@@ -233,6 +234,49 @@ struct omp_for_data
   struct omp_for_data_loop *loops;
 };
 
+/* Describe the OpenACC looping structure of a function.  The entire
+   function is held in a 'NULL' loop.  */
+
+struct oacc_loop
+{
+  oacc_loop *parent; /* Containing loop.  */
+
+  oacc_loop *child; /* First inner loop.  */
+
+  oacc_loop *sibling; /* Next loop within same parent.  */
+
+  location_t loc; /* Location of the loop start.  */
+
+  gcall *marker; /* Initial head marker.  */
+  
+  gcall *heads[GOMP_DIM_MAX];  /* Head marker functions. */
+  gcall *tails[GOMP_DIM_MAX];  /* Tail marker functions. */
+
+  tree routine;  /* Pseudo-loop enclosing a routine.  */
+
+  unsigned mask;   /* Partitioning mask.  */
+  unsigned flags;   /* Partitioning flags.  */
+  tree chunk_size;   /* Chunk size.  */
+  gcall *head_end; /* Final marker of head sequence.  */
+};
+
+/*  Flags for an OpenACC loop.  */
+
+enum oacc_loop_flags {
+  OLF_SEQ	= 1u << 0,  /* Explicitly sequential  */
+  OLF_AUTO	= 1u << 1,	/* Compiler chooses axes.  */
+  OLF_INDEPENDENT = 1u << 2,	/* Iterations are known independent.  */
+  OLF_GANG_STATIC = 1u << 3,	/* Gang partitioning is static (has op). */
+
+  /* Explicitly specified loop axes.  */
+  OLF_DIM_BASE = 4,
+  OLF_DIM_GANG   = 1u << (OLF_DIM_BASE + GOMP_DIM_GANG),
+  OLF_DIM_WORKER = 1u << (OLF_DIM_BASE + GOMP_DIM_WORKER),
+  OLF_DIM_VECTOR = 1u << (OLF_DIM_BASE + GOMP_DIM_VECTOR),
+
+  OLF_MAX = OLF_DIM_BASE + GOMP_DIM_MAX
+};
+
 
 static splay_tree all_contexts;
 static int taskreg_nesting_level;
@@ -17584,6 +17628,241 @@ omp_finish_file (void)
     }
 }
 
+/* Find the number of threads (POS = false), or thread number (POS =
+   true) for an OpenACC region partitioned as MASK.  Setup code
+   required for the calculation is added to SEQ.  */
+
+static tree
+oacc_thread_numbers (bool pos, int mask, gimple_seq *seq)
+{
+  tree res = pos ? NULL_TREE : build_int_cst (unsigned_type_node, 1);
+  unsigned ix;
+
+  /* Start at gang level, and examine relevant dimension indices.  */
+  for (ix = GOMP_DIM_GANG; ix != GOMP_DIM_MAX; ix++)
+    if (GOMP_DIM_MASK (ix) & mask)
+      {
+	tree arg = build_int_cst (unsigned_type_node, ix);
+
+	if (res)
+	  {
+	    /* We had an outer index, so scale that by the size of
+	       this dimension.  */
+	    tree n = create_tmp_var (integer_type_node);
+	    gimple *call
+	      = gimple_build_call_internal (IFN_GOACC_DIM_SIZE, 1, arg);
+	    
+	    gimple_call_set_lhs (call, n);
+	    gimple_seq_add_stmt (seq, call);
+	    res = fold_build2 (MULT_EXPR, integer_type_node, res, n);
+	  }
+	if (pos)
+	  {
+	    /* Determine index in this dimension.  */
+	    tree id = create_tmp_var (integer_type_node);
+	    gimple *call = gimple_build_call_internal
+	      (IFN_GOACC_DIM_POS, 1, arg);
+
+	    gimple_call_set_lhs (call, id);
+	    gimple_seq_add_stmt (seq, call);
+	    if (res)
+	      res = fold_build2 (PLUS_EXPR, integer_type_node, res, id);
+	    else
+	      res = id;
+	  }
+      }
+
+  if (res == NULL_TREE)
+    res = integer_zero_node;
+
+  return res;
+}
+
+/* Transform IFN_GOACC_LOOP calls to actual code.  See
+   expand_oacc_for for where these are generated.  At the vector
+   level, we stride loops, such that each member of a warp will
+   operate on adjacent iterations.  At the worker and gang level,
+   each gang/warp executes a set of contiguous iterations.  Chunking
+   can override this such that each iteration engine executes a
+   contiguous chunk, and then moves on to stride to the next chunk.   */
+
+static void
+oacc_xform_loop (gcall *call)
+{
+  gimple_stmt_iterator gsi = gsi_for_stmt (call);
+  enum ifn_goacc_loop_kind code
+    = (enum ifn_goacc_loop_kind) TREE_INT_CST_LOW (gimple_call_arg (call, 0));
+  tree dir = gimple_call_arg (call, 1);
+  tree range = gimple_call_arg (call, 2);
+  tree step = gimple_call_arg (call, 3);
+  tree chunk_size = NULL_TREE;
+  unsigned mask = (unsigned) TREE_INT_CST_LOW (gimple_call_arg (call, 5));
+  tree lhs = gimple_call_lhs (call);
+  tree type = TREE_TYPE (lhs);
+  tree diff_type = TREE_TYPE (range);
+  tree r = NULL_TREE;
+  gimple_seq seq = NULL;
+  bool chunking = false, striding = true;
+  unsigned outer_mask = mask & (~mask + 1); // Outermost partitioning
+  unsigned inner_mask = mask & ~outer_mask; // Inner partitioning (if any)
+
+#ifdef ACCEL_COMPILER
+  chunk_size = gimple_call_arg (call, 4);
+  if (integer_minus_onep (chunk_size)  /* Force static allocation.  */
+      || integer_zerop (chunk_size))   /* Default (also static).  */
+    {
+      /* If we're at the gang level, we want each to execute a
+	 contiguous run of iterations.  Otherwise we want each element
+	 to stride.  */
+      striding = !(outer_mask & GOMP_DIM_MASK (GOMP_DIM_GANG));
+      chunking = false;
+    }
+  else
+    {
+      /* Chunk of size 1 is striding.  */
+      striding = integer_onep (chunk_size);
+      chunking = !striding;
+    }
+#endif
+
+  /* striding=true, chunking=true
+       -> invalid.
+     striding=true, chunking=false
+       -> chunks=1
+     striding=false,chunking=true
+       -> chunks=ceil (range/(chunksize*threads*step))
+     striding=false,chunking=false
+       -> chunk_size=ceil(range/(threads*step)),chunks=1  */
+  push_gimplify_context (true);
+
+  switch (code)
+    {
+    default: gcc_unreachable ();
+
+    case IFN_GOACC_LOOP_CHUNKS:
+      if (!chunking)
+	r = build_int_cst (type, 1);
+      else
+	{
+	  /* chunk_max
+	     = (range - dir) / (chunks * step * num_threads) + dir  */
+	  tree per = oacc_thread_numbers (false, mask, &seq);
+	  per = fold_convert (type, per);
+	  chunk_size = fold_convert (type, chunk_size);
+	  per = fold_build2 (MULT_EXPR, type, per, chunk_size);
+	  per = fold_build2 (MULT_EXPR, type, per, step);
+	  r = build2 (MINUS_EXPR, type, range, dir);
+	  r = build2 (PLUS_EXPR, type, r, per);
+	  r = build2 (TRUNC_DIV_EXPR, type, r, per);
+	}
+      break;
+
+    case IFN_GOACC_LOOP_STEP:
+      {
+	/* If striding, step by the entire compute volume, otherwise
+	   step by the inner volume.  */
+	unsigned volume = striding ? mask : inner_mask;
+
+	r = oacc_thread_numbers (false, volume, &seq);
+	r = build2 (MULT_EXPR, type, fold_convert (type, r), step);
+      }
+      break;
+
+    case IFN_GOACC_LOOP_OFFSET:
+      if (striding)
+	{
+	  r = oacc_thread_numbers (true, mask, &seq);
+	  r = fold_convert (diff_type, r);
+	}
+      else
+	{
+	  tree inner_size = oacc_thread_numbers (false, inner_mask, &seq);
+	  tree outer_size = oacc_thread_numbers (false, outer_mask, &seq);
+	  tree volume = fold_build2 (MULT_EXPR, TREE_TYPE (inner_size),
+				     inner_size, outer_size);
+
+	  volume = fold_convert (diff_type, volume);
+	  if (chunking)
+	    chunk_size = fold_convert (diff_type, chunk_size);
+	  else
+	    {
+	      tree per = fold_build2 (MULT_EXPR, diff_type, volume, step);
+
+	      chunk_size = build2 (MINUS_EXPR, diff_type, range, dir);
+	      chunk_size = build2 (PLUS_EXPR, diff_type, chunk_size, per);
+	      chunk_size = build2 (TRUNC_DIV_EXPR, diff_type, chunk_size, per);
+	    }
+
+	  tree span = build2 (MULT_EXPR, diff_type, chunk_size,
+			      fold_convert (diff_type, inner_size));
+	  r = oacc_thread_numbers (true, outer_mask, &seq);
+	  r = fold_convert (diff_type, r);
+	  r = build2 (MULT_EXPR, diff_type, r, span);
+
+	  tree inner = oacc_thread_numbers (true, inner_mask, &seq);
+	  inner = fold_convert (diff_type, inner);
+	  r = fold_build2 (PLUS_EXPR, diff_type, r, inner);
+
+	  if (chunking)
+	    {
+	      tree chunk = fold_convert (diff_type, gimple_call_arg (call, 6));
+	      tree per
+		= fold_build2 (MULT_EXPR, diff_type, volume, chunk_size);
+	      per = build2 (MULT_EXPR, diff_type, per, chunk);
+
+	      r = build2 (PLUS_EXPR, diff_type, r, per);
+	    }
+	}
+      r = fold_build2 (MULT_EXPR, diff_type, r, step);
+      if (type != diff_type)
+	r = fold_convert (type, r);
+      break;
+
+    case IFN_GOACC_LOOP_BOUND:
+      if (striding)
+	r = range;
+      else
+	{
+	  tree inner_size = oacc_thread_numbers (false, inner_mask, &seq);
+	  tree outer_size = oacc_thread_numbers (false, outer_mask, &seq);
+	  tree volume = fold_build2 (MULT_EXPR, TREE_TYPE (inner_size),
+				     inner_size, outer_size);
+
+	  volume = fold_convert (diff_type, volume);
+	  if (chunking)
+	    chunk_size = fold_convert (diff_type, chunk_size);
+	  else
+	    {
+	      tree per = fold_build2 (MULT_EXPR, diff_type, volume, step);
+
+	      chunk_size = build2 (MINUS_EXPR, diff_type, range, dir);
+	      chunk_size = build2 (PLUS_EXPR, diff_type, chunk_size, per);
+	      chunk_size = build2 (TRUNC_DIV_EXPR, diff_type, chunk_size, per);
+	    }
+
+	  tree span = build2 (MULT_EXPR, diff_type, chunk_size,
+			      fold_convert (diff_type, inner_size));
+
+	  r = fold_build2 (MULT_EXPR, diff_type, span, step);
+
+	  tree offset = gimple_call_arg (call, 6);
+	  r = build2 (PLUS_EXPR, diff_type, r,
+		      fold_convert (diff_type, offset));
+	  r = build2 (integer_onep (dir) ? MIN_EXPR : MAX_EXPR,
+		      diff_type, r, range);
+	}
+      if (diff_type != type)
+	r = fold_convert (type, r);
+      break;
+    }
+
+  gimplify_assign (lhs, r, &seq);
+
+  pop_gimplify_context (NULL);
+
+  gsi_replace_with_seq (&gsi, seq, true);
+}
+
 /* Validate and update the dimensions for offloaded FN.  ATTRS is the
    raw attribute.  DIMS is an array of dimensions, which is returned.
    Returns the function level dimensionality --  the level at which an
@@ -17642,6 +17921,553 @@ oacc_validate_dims (tree fn, tree attrs,
   return fn_level;
 }
 
+/* Create an empty OpenACC loop structure at LOC.  */
+
+static oacc_loop *
+new_oacc_loop_raw (oacc_loop *parent, location_t loc)
+{
+  oacc_loop *loop = XCNEW (oacc_loop);
+
+  loop->parent = parent;
+  loop->child = loop->sibling = NULL;
+
+  if (parent)
+    {
+      loop->sibling = parent->child;
+      parent->child = loop;
+    }
+
+  loop->loc = loc;
+  loop->marker = NULL;
+  memset (loop->heads, 0, sizeof (loop->heads));
+  memset (loop->tails, 0, sizeof (loop->tails));
+  loop->routine = NULL_TREE;
+
+  loop->mask = loop->flags = 0;
+  loop->chunk_size = 0;
+  loop->head_end = NULL;
+
+  return loop;
+}
+
+/* Create an outermost, dummy OpenACC loop for offloaded function
+   DECL.  */
+
+static oacc_loop *
+new_oacc_loop_outer (tree decl)
+{
+  return new_oacc_loop_raw (NULL, DECL_SOURCE_LOCATION (decl));
+}
+
+/* Start a new OpenACC loop  structure beginning at head marker HEAD.
+   Link into PARENT loop.  Return the new loop.  */
+
+static oacc_loop *
+new_oacc_loop (oacc_loop *parent, gcall *marker)
+{
+  oacc_loop *loop = new_oacc_loop_raw (parent, gimple_location (marker));
+
+  loop->marker = marker;
+  
+  /* TODO: This is where device_type flattening would occur for the loop
+     flags.   */
+
+  loop->flags = TREE_INT_CST_LOW (gimple_call_arg (marker, 3));
+
+  tree chunk_size = integer_zero_node;
+  if (loop->flags & OLF_GANG_STATIC)
+    chunk_size = gimple_call_arg (marker, 4);
+  loop->chunk_size = chunk_size;
+
+  return loop;
+}
+
+/* Create a dummy loop encompassing a call to a openACC routine.
+   Extract the routine's partitioning requirements.  */
+
+static void
+new_oacc_loop_routine (oacc_loop *parent, gcall *call, tree decl, tree attrs)
+{
+  oacc_loop *loop = new_oacc_loop_raw (parent, gimple_location (call));
+  int dims[GOMP_DIM_MAX];
+  int level = oacc_validate_dims (decl, attrs, dims);
+
+  gcc_assert (level >= 0);
+
+  loop->marker = call;
+  loop->routine = decl;
+  loop->mask = ((GOMP_DIM_MASK (GOMP_DIM_MAX) - 1)
+		^ (GOMP_DIM_MASK (level) - 1));
+}
+
+/* Finish off the current OpenACC loop ending at tail marker TAIL.
+   Return the parent loop.  */
+
+static oacc_loop *
+finish_oacc_loop (oacc_loop *loop)
+{
+  return loop->parent;
+}
+
+/* Free all OpenACC loop structures within LOOP (inclusive).  */
+
+static void
+free_oacc_loop (oacc_loop *loop)
+{
+  if (loop->sibling)
+    free_oacc_loop (loop->sibling);
+  if (loop->child)
+    free_oacc_loop (loop->child);
+
+  free (loop);
+}
+
+/* Dump out the OpenACC loop head or tail beginning at FROM.  */
+
+static void
+dump_oacc_loop_part (FILE *file, gcall *from, int depth,
+		     const char *title, int level)
+{
+  enum ifn_unique_kind kind
+    = (enum ifn_unique_kind) TREE_INT_CST_LOW (gimple_call_arg (from, 0));
+
+  fprintf (file, "%*s%s-%d:\n", depth * 2, "", title, level);
+  for (gimple_stmt_iterator gsi = gsi_for_stmt (from);;)
+    {
+      gimple *stmt = gsi_stmt (gsi);
+
+      if (is_gimple_call (stmt)
+	  && gimple_call_internal_p (stmt)
+	  && gimple_call_internal_fn (stmt) == IFN_UNIQUE)
+	{
+	  enum ifn_unique_kind k
+	    = ((enum ifn_unique_kind) TREE_INT_CST_LOW
+	       (gimple_call_arg (stmt, 0)));
+
+	  if (k == kind && stmt != from)
+	    break;
+	}
+      print_gimple_stmt (file, stmt, depth * 2 + 2, 0);
+
+      gsi_next (&gsi);
+      while (gsi_end_p (gsi))
+	gsi = gsi_start_bb (single_succ (gsi_bb (gsi)));
+    }
+}
+
+/* Dump OpenACC loops LOOP, its siblings and its children.  */
+
+static void
+dump_oacc_loop (FILE *file, oacc_loop *loop, int depth)
+{
+  int ix;
+  
+  fprintf (file, "%*sLoop %x(%x) %s:%u\n", depth * 2, "",
+	   loop->flags, loop->mask,
+	   LOCATION_FILE (loop->loc), LOCATION_LINE (loop->loc));
+
+  if (loop->marker)
+    print_gimple_stmt (file, loop->marker, depth * 2, 0);
+
+  if (loop->routine)
+    fprintf (file, "%*sRoutine %s:%u:%s\n",
+	     depth * 2, "", DECL_SOURCE_FILE (loop->routine),
+	     DECL_SOURCE_LINE (loop->routine),
+	     IDENTIFIER_POINTER (DECL_NAME (loop->routine)));
+
+  for (ix = GOMP_DIM_GANG; ix != GOMP_DIM_MAX; ix++)
+    if (loop->heads[ix])
+      dump_oacc_loop_part (file, loop->heads[ix], depth, "Head", ix);
+  for (ix = GOMP_DIM_MAX; ix--;)
+    if (loop->tails[ix])
+      dump_oacc_loop_part (file, loop->tails[ix], depth, "Tail", ix);
+
+  if (loop->child)
+    dump_oacc_loop (file, loop->child, depth + 1);
+  if (loop->sibling)
+    dump_oacc_loop (file, loop->sibling, depth);
+}
+
+void debug_oacc_loop (oacc_loop *);
+
+/* Dump loops to stderr.  */
+
+DEBUG_FUNCTION void
+debug_oacc_loop (oacc_loop *loop)
+{
+  dump_oacc_loop (stderr, loop, 0);
+}
+
+/* DFS walk of basic blocks BB onwards, creating OpenACC loop
+   structures as we go.  By construction these loops are properly
+   nested.  */
+
+static void
+oacc_loop_discover_walk (oacc_loop *loop, basic_block bb)
+{
+  int marker = 0;
+  int remaining = 0;
+
+  if (bb->flags & BB_VISITED)
+    return;
+
+ follow:
+  bb->flags |= BB_VISITED;
+
+  /* Scan for loop markers.  */
+  for (gimple_stmt_iterator gsi = gsi_start_bb (bb); !gsi_end_p (gsi);
+       gsi_next (&gsi))
+    {
+      gimple *stmt = gsi_stmt (gsi);
+
+      if (!is_gimple_call (stmt))
+	continue;
+
+      gcall *call = as_a <gcall *> (stmt);
+      
+      /* If this is a routine, make a dummy loop for it.  */
+      if (tree decl = gimple_call_fndecl (call))
+	if (tree attrs = get_oacc_fn_attrib (decl))
+	  {
+	    gcc_assert (!marker);
+	    new_oacc_loop_routine (loop, call, decl, attrs);
+	  }
+
+      if (!gimple_call_internal_p (call))
+	continue;
+
+      if (gimple_call_internal_fn (call) != IFN_UNIQUE)
+	continue;
+
+      enum ifn_unique_kind kind
+	= (enum ifn_unique_kind) TREE_INT_CST_LOW (gimple_call_arg (call, 0));
+      if (kind == IFN_UNIQUE_OACC_HEAD_MARK
+	  || kind == IFN_UNIQUE_OACC_TAIL_MARK)
+	{
+	  if (gimple_call_num_args (call) == 2)
+	    {
+	      gcc_assert (marker && !remaining);
+	      marker = 0;
+	      if (kind == IFN_UNIQUE_OACC_TAIL_MARK)
+		loop = finish_oacc_loop (loop);
+	      else
+		loop->head_end = call;
+	    }
+	  else
+	    {
+	      int count = TREE_INT_CST_LOW (gimple_call_arg (call, 2));
+
+	      if (!marker)
+		{
+		  if (kind == IFN_UNIQUE_OACC_HEAD_MARK)
+		    loop = new_oacc_loop (loop, call);
+		  remaining = count;
+		}
+	      gcc_assert (count == remaining);
+	      if (remaining)
+		{
+		  remaining--;
+		  if (kind == IFN_UNIQUE_OACC_HEAD_MARK)
+		    loop->heads[marker] = call;
+		  else
+		    loop->tails[remaining] = call;
+		}
+	      marker++;
+	    }
+	}
+    }
+  if (remaining || marker)
+    {
+      bb = single_succ (bb);
+      gcc_assert (single_pred_p (bb) && !(bb->flags & BB_VISITED));
+      goto follow;
+    }
+
+  /* Walk successor blocks.  */
+  edge e;
+  edge_iterator ei;
+
+  FOR_EACH_EDGE (e, ei, bb->succs)
+    oacc_loop_discover_walk (loop, e->dest);
+}
+
+/* LOOP is the first sibling.  Reverse the order in place and return
+   the new first sibling.  Recurse to child loops.  */
+
+static oacc_loop *
+oacc_loop_sibling_nreverse (oacc_loop *loop)
+{
+  oacc_loop *last = NULL;
+  do
+    {
+      if (loop->child)
+	loop->child = oacc_loop_sibling_nreverse  (loop->child);
+
+      oacc_loop *next = loop->sibling;
+      loop->sibling = last;
+      last = loop;
+      loop = next;
+    }
+  while (loop);
+
+  return last;
+}
+
+/* Discover the OpenACC loops marked up by HEAD and TAIL markers for
+   the current function.  */
+
+static oacc_loop *
+oacc_loop_discovery ()
+{
+  basic_block bb;
+  
+  oacc_loop *top = new_oacc_loop_outer (current_function_decl);
+  oacc_loop_discover_walk (top, ENTRY_BLOCK_PTR_FOR_FN (cfun));
+
+  /* The siblings were constructed in reverse order, reverse them so
+     that diagnostics come out in an unsurprising order.  */
+  top = oacc_loop_sibling_nreverse (top);
+
+  /* Reset the visited flags.  */
+  FOR_ALL_BB_FN (bb, cfun)
+    bb->flags &= ~BB_VISITED;
+
+  return top;
+}
+
+/* Transform the abstract internal function markers starting at FROM
+   to be for partitioning level LEVEL.  Stop when we meet another HEAD
+   or TAIL  marker.  */
+
+static void
+oacc_loop_xform_head_tail (gcall *from, int level)
+{
+  enum ifn_unique_kind kind
+    = (enum ifn_unique_kind) TREE_INT_CST_LOW (gimple_call_arg (from, 0));
+  tree replacement = build_int_cst (unsigned_type_node, level);
+
+  for (gimple_stmt_iterator gsi = gsi_for_stmt (from);;)
+    {
+      gimple *stmt = gsi_stmt (gsi);
+      
+      if (is_gimple_call (stmt)
+	  && gimple_call_internal_p (stmt)
+	  && gimple_call_internal_fn (stmt) == IFN_UNIQUE)
+	{
+	  enum ifn_unique_kind k
+	    = ((enum ifn_unique_kind)
+	       TREE_INT_CST_LOW (gimple_call_arg (stmt, 0)));
+
+	  if (k == IFN_UNIQUE_OACC_FORK || k == IFN_UNIQUE_OACC_JOIN)
+	    *gimple_call_arg_ptr (stmt, 2) = replacement;
+	  else if (k == kind && stmt != from)
+	    break;
+	}
+      gsi_next (&gsi);
+      while (gsi_end_p (gsi))
+	gsi = gsi_start_bb (single_succ (gsi_bb (gsi)));
+    }
+}
+
+/* Transform the IFN_GOACC_LOOP internal functions by providing the
+   determined partitioning mask and chunking argument.  */
+
+static void
+oacc_loop_xform_loop (gcall *end_marker, tree mask_arg, tree chunk_arg)
+{
+  gimple_stmt_iterator gsi = gsi_for_stmt (end_marker);
+  
+  for (;;)
+    {
+      for (; !gsi_end_p (gsi); gsi_next (&gsi))
+	{
+	  gimple *stmt = gsi_stmt (gsi);
+
+	  if (!is_gimple_call (stmt))
+	    continue;
+
+	  gcall *call = as_a <gcall *> (stmt);
+      
+	  if (!gimple_call_internal_p (call))
+	    continue;
+
+	  if (gimple_call_internal_fn (call) != IFN_GOACC_LOOP)
+	    continue;
+
+	  *gimple_call_arg_ptr (call, 5) = mask_arg;
+	  *gimple_call_arg_ptr (call, 4) = chunk_arg;
+	  if (TREE_INT_CST_LOW (gimple_call_arg (call, 0))
+	      == IFN_GOACC_LOOP_BOUND)
+	    return;
+	}
+
+      /* If we didn't see LOOP_BOUND, it should be in the single
+	 successor block.  */
+      basic_block bb = single_succ (gsi_bb (gsi));
+      gsi = gsi_start_bb (bb);
+    }
+}
+
+/* Process the discovered OpenACC loops, setting the correct
+   partitioning level etc.  */
+
+static void
+oacc_loop_process (oacc_loop *loop)
+{
+  if (loop->child)
+    oacc_loop_process (loop->child);
+
+  if (loop->mask && !loop->routine)
+    {
+      int ix;
+      unsigned mask = loop->mask;
+      unsigned dim = GOMP_DIM_GANG;
+      tree mask_arg = build_int_cst (unsigned_type_node, mask);
+      tree chunk_arg = loop->chunk_size;
+
+      oacc_loop_xform_loop (loop->head_end, mask_arg, chunk_arg);
+
+      for (ix = 0; ix != GOMP_DIM_MAX && loop->heads[ix]; ix++)
+	{
+	  gcc_assert (mask);
+
+	  while (!(GOMP_DIM_MASK (dim) & mask))
+	    dim++;
+
+	  oacc_loop_xform_head_tail (loop->heads[ix], dim);
+	  oacc_loop_xform_head_tail (loop->tails[ix], dim);
+
+	  mask ^= GOMP_DIM_MASK (dim);
+	}
+    }
+
+  if (loop->sibling)
+    oacc_loop_process (loop->sibling);
+}
+
+/* Walk the OpenACC loop heirarchy checking and assigning the
+   programmer-specified partitionings.  OUTER_MASK is the partitioning
+   this loop is contained within.  Return partitiong mask used within
+   this loop nest.  */
+
+static unsigned
+oacc_loop_fixed_partitions (oacc_loop *loop, unsigned outer_mask)
+{
+  unsigned this_mask = loop->mask;
+  bool has_auto = false;
+  bool noisy = true;
+
+#ifdef ACCEL_COMPILER
+  /* When device_type is supported, we want the device compiler to be
+     noisy, if the loop parameters are device_type-specific.  */
+  noisy = false;
+#endif
+
+  if (!loop->routine)
+    {
+      bool auto_par = (loop->flags & OLF_AUTO) != 0;
+      bool seq_par = (loop->flags & OLF_SEQ) != 0;
+
+      this_mask = ((loop->flags >> OLF_DIM_BASE)
+		   & (GOMP_DIM_MASK (GOMP_DIM_MAX) - 1));
+
+      if ((this_mask != 0) + auto_par + seq_par > 1)
+	{
+	  if (noisy)
+	    error_at (loop->loc,
+		      seq_par
+		      ? "%<seq%> overrides other OpenACC loop specifiers"
+		      : "%<auto%> conflicts with other OpenACC loop specifiers");
+	  auto_par = false;
+	  loop->flags &= ~OLF_AUTO;
+	  if (seq_par)
+	    {
+	      loop->flags &=
+		~((GOMP_DIM_MASK (GOMP_DIM_MAX) - 1) << OLF_DIM_BASE);
+	      this_mask = 0;
+	    }
+	}
+      if (auto_par && (loop->flags & OLF_INDEPENDENT))
+	has_auto = true;
+    }
+
+  if (this_mask & outer_mask)
+    {
+      const oacc_loop *outer;
+      for (outer = loop->parent; outer; outer = outer->parent)
+	if (outer->mask & this_mask)
+	  break;
+
+      if (noisy)
+	{
+	  if (outer)
+	    {
+	      error_at (loop->loc,
+			"%s uses same OpenACC parallelism as containing loop",
+			loop->routine ? "routine call" : "inner loop");
+	      inform (outer->loc, "containing loop here");
+	    }
+	  else
+	    error_at (loop->loc,
+		      "%s uses OpenACC parallelism disallowed by containing routine",
+		      loop->routine ? "routine call" : "loop");
+      
+	  if (loop->routine)
+	    inform (DECL_SOURCE_LOCATION (loop->routine),
+		    "routine %qD declared here", loop->routine);
+	}
+      this_mask &= ~outer_mask;
+    }
+  else
+    {
+      unsigned outermost = this_mask & -this_mask;
+
+      if (outermost && outermost <= outer_mask)
+	{
+	  if (noisy)
+	    {
+	      error_at (loop->loc,
+			"incorrectly nested OpenACC loop parallelism");
+
+	      const oacc_loop *outer;
+	      for (outer = loop->parent;
+		   outer->flags && outer->flags < outermost;
+		   outer = outer->parent)
+		continue;
+	      inform (outer->loc, "containing loop here");
+	    }
+
+	  this_mask &= ~outermost;
+	}
+    }
+
+  loop->mask = this_mask;
+
+  if (loop->child
+      && oacc_loop_fixed_partitions (loop->child, outer_mask | this_mask))
+    has_auto = true;
+
+  if (loop->sibling
+      && oacc_loop_fixed_partitions (loop->sibling, outer_mask))
+    has_auto = true;
+
+  return has_auto;
+}
+
+/* Walk the OpenACC loop heirarchy to check and assign partitioning
+   axes.  */
+
+static void
+oacc_loop_partition (oacc_loop *loop, int fn_level)
+{
+  unsigned outer_mask = 0;
+
+  if (fn_level >= 0)
+    outer_mask = GOMP_DIM_MASK (fn_level) - 1;
+
+  oacc_loop_fixed_partitions (loop, outer_mask);
+}
+
 /* Default fork/join early expander.  Delete the function calls if
    there is no RTL expander.  */
 
@@ -17669,8 +18495,110 @@ execute_oacc_device_lower ()
     /* Not an offloaded function.  */
     return 0;
 
-  oacc_validate_dims (current_function_decl, attrs, dims);
-  
+  int fn_level = oacc_validate_dims (current_function_decl, attrs, dims);
+
+  /* Discover, partition and process the loops.  */
+  oacc_loop *loops = oacc_loop_discovery ();
+  oacc_loop_partition (loops, fn_level);
+  oacc_loop_process (loops);
+  if (dump_file)
+    {
+      fprintf (dump_file, "OpenACC loops\n");
+      dump_oacc_loop (dump_file, loops, 0);
+      fprintf (dump_file, "\n");
+    }
+
+  /* Now lower internal loop functions to target-specific code
+     sequences.  */
+  basic_block bb;
+  FOR_ALL_BB_FN (bb, cfun)
+    for (gimple_stmt_iterator gsi = gsi_start_bb (bb); !gsi_end_p (gsi);)
+      {
+	gimple *stmt = gsi_stmt (gsi);
+	if (!is_gimple_call (stmt))
+	  {
+	    gsi_next (&gsi);
+	    continue;
+	  }
+
+	gcall *call = as_a <gcall *> (stmt);
+	if (!gimple_call_internal_p (call))
+	  {
+	    gsi_next (&gsi);
+	    continue;
+	  }
+
+	/* Rewind to allow rescan.  */
+	gsi_prev (&gsi);
+	bool rescan = false, remove = false;
+	enum  internal_fn ifn_code = gimple_call_internal_fn (call);
+
+	switch (ifn_code)
+	  {
+	  default: break;
+
+	  case IFN_GOACC_LOOP:
+	    oacc_xform_loop (call);
+	    rescan = true;
+	    break;
+
+	  case IFN_UNIQUE:
+	    {
+	      enum ifn_unique_kind kind
+		= ((enum ifn_unique_kind)
+		   TREE_INT_CST_LOW (gimple_call_arg (call, 0)));
+
+	      switch (kind)
+		{
+		default:
+		  gcc_unreachable ();
+
+		case IFN_UNIQUE_OACC_FORK:
+		case IFN_UNIQUE_OACC_JOIN:
+		  if (integer_minus_onep (gimple_call_arg (call, 2)))
+		    remove = true;
+		  else if (!targetm.goacc.fork_join
+			   (call, dims, kind == IFN_UNIQUE_OACC_FORK))
+		    remove = true;
+		  break;
+
+		case IFN_UNIQUE_OACC_HEAD_MARK:
+		case IFN_UNIQUE_OACC_TAIL_MARK:
+		  remove = true;
+		  break;
+		}
+	      break;
+	    }
+	  }
+
+	if (gsi_end_p (gsi))
+	  /* We rewound past the beginning of the BB.  */
+	  gsi = gsi_start_bb (bb);
+	else
+	  /* Undo the rewind.  */
+	  gsi_next (&gsi);
+
+	if (remove)
+	  {
+	    if (gimple_vdef (call))
+	      replace_uses_by (gimple_vdef (call), gimple_vuse (call));
+	    if (gimple_call_lhs (call))
+	      {
+		/* Propagate the data dependency var.  */
+		gimple *ass = gimple_build_assign (gimple_call_lhs (call),
+						   gimple_call_arg (call, 1));
+		gsi_replace (&gsi, ass,  false);
+	      }
+	    else
+	      gsi_remove (&gsi, true);
+	  }
+	else if (!rescan)
+	  /* If not rescanning, advance over the call.  */
+	  gsi_next (&gsi);
+      }
+
+  free_oacc_loop (loops);
+
   return 0;
 }
 
Index: gcc/target-insns.def
===================================================================
--- gcc/target-insns.def	(revision 229465)
+++ gcc/target-insns.def	(working copy)
@@ -64,6 +64,8 @@ DEF_TARGET_INSN (memory_barrier, (void))
 DEF_TARGET_INSN (movstr, (rtx x0, rtx x1, rtx x2))
 DEF_TARGET_INSN (nonlocal_goto, (rtx x0, rtx x1, rtx x2, rtx x3))
 DEF_TARGET_INSN (nonlocal_goto_receiver, (void))
+DEF_TARGET_INSN (oacc_dim_pos, (rtx x0, rtx x1))
+DEF_TARGET_INSN (oacc_dim_size, (rtx x0, rtx x1))
 DEF_TARGET_INSN (oacc_fork, (rtx x0, rtx x1, rtx x2))
 DEF_TARGET_INSN (oacc_join, (rtx x0, rtx x1, rtx x2))
 DEF_TARGET_INSN (prefetch, (rtx x0, rtx x1, rtx x2))
Index: gcc/target.def
===================================================================
--- gcc/target.def	(revision 229465)
+++ gcc/target.def	(working copy)
@@ -1660,11 +1660,13 @@ default_goacc_validate_dims)
 
 DEFHOOK
 (fork_join,
-"This hook should convert IFN_GOACC_FORK and IFN_GOACC_JOIN function\n\
-calls to target-specific gimple.  It is executed during the\n\
-oacc_device_lower pass.  It should return true, if the functions\n\
-should be deleted.  The default hook returns true, if there are no\n\
-RTL expanders for them.",
+"This hook can be used to convert IFN_GOACC_FORK and IFN_GOACC_JOIN\n\
+function calls to target-specific gimple, or indicate whether they\n\
+should be retained.  It is executed during the oacc_device_lower pass.\n\
+It should return true, if the call should be retained.  It should\n\
+return false, if it is to be deleted (either because target-specific\n\
+gimple has been inserted before it, or there is no need for it).\n\
+The default hook returns false, if there are no RTL expanders for them.",
 bool, (gcall *call, const int *dims, bool is_fork),
 default_goacc_fork_join)
 

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 7/11] execution model
  2015-10-27 14:03           ` Nathan Sidwell
@ 2015-10-28  5:45             ` Nathan Sidwell
  0 siblings, 0 replies; 120+ messages in thread
From: Nathan Sidwell @ 2015-10-28  5:45 UTC (permalink / raw)
  To: Jakub Jelinek
  Cc: GCC Patches, Bernd Schmidt, Jason Merrill, Joseph S. Myers,
	Richard Guenther

[-- Attachment #1: Type: text/plain, Size: 408 bytes --]

On 10/27/15 07:02, Nathan Sidwell wrote:

> yeah, I noticed diff got confused. (I'm  not sure the above suggestion will
> resolve it, but we can give it a go.

This is what I've committed.  This breaks the libgomp reduction tests on nvidia, 
because there's now a discrepancy between loop  iteration assignment and live 
threads of execution.  It'll get resolved in applying the ptx backend patch.


nathan


[-- Attachment #2: 07-trunk-loop-mark-1027.patch --]
[-- Type: text/x-patch, Size: 38558 bytes --]

2015-10-27  Nathan Sidwell  <nathan@codesourcery.com>

	* omp-low.c (struct omp_context): Remove gwv_below, gwv_this
	fields.
	(is_oacc_parallel, is_oacc_kernels): New.
	(enclosing_target_ctx): May return NULL.
	(ctx_in_oacc_kernels_region): New.
	(check_oacc_kernel_gwv): New.
	(oacc_loop_or_target_p): Delete.
	(scan_omp_for): Don't calculate gwv mask.  Check parallel clause
	operands.  Strip reductions fro kernels.
	(scan_omp_target): Don't calculate gwv mask.
	(lower_oacc_head_mark, lower_oacc_loop_marker,
	lower_oacc_head_tail): New.
	(struct oacc_collapse): New.
	(expand_oacc_collapse_init, expand_oacc_collapse_vars): New.
	(expand_omp_for_static_nochunk, expand_omp_for_static_chunk):
	Remove OpenACC handling.
	(expand_oacc_for): New.
	(expand_omp_for): Call expand_oacc_for.
	(lower_omp_for): Call lower_oacc_head_tail.

Index: gcc/omp-low.c
===================================================================
--- gcc/omp-low.c	(revision 229466)
+++ gcc/omp-low.c	(working copy)
@@ -200,14 +200,6 @@ struct omp_context
 
   /* True if this construct can be cancelled.  */
   bool cancellable;
-
-  /* For OpenACC loops, a mask of gang, worker and vector used at
-     levels below this one.  */
-  int gwv_below;
-  /* For OpenACC loops, a mask of gang, worker and vector used at
-     this level and above.  For parallel and kernels clauses, a mask
-     indicating which of num_gangs/num_workers/num_vectors was used.  */
-  int gwv_this;
 };
 
 /* A structure holding the elements of:
@@ -299,6 +291,28 @@ static gphi *find_phi_with_arg_on_edge (
       *handled_ops_p = false; \
       break;
 
+/* Return true if CTX corresponds to an oacc parallel region.  */
+
+static bool
+is_oacc_parallel (omp_context *ctx)
+{
+  enum gimple_code outer_type = gimple_code (ctx->stmt);
+  return ((outer_type == GIMPLE_OMP_TARGET)
+	  && (gimple_omp_target_kind (ctx->stmt)
+	      == GF_OMP_TARGET_KIND_OACC_PARALLEL));
+}
+
+/* Return true if CTX corresponds to an oacc kernels region.  */
+
+static bool
+is_oacc_kernels (omp_context *ctx)
+{
+  enum gimple_code outer_type = gimple_code (ctx->stmt);
+  return ((outer_type == GIMPLE_OMP_TARGET)
+	  && (gimple_omp_target_kind (ctx->stmt)
+	      == GF_OMP_TARGET_KIND_OACC_KERNELS));
+}
+
 /* Helper function to get the name of the array containing the partial
    reductions for OpenACC reductions.  */
 static const char *
@@ -2933,28 +2947,95 @@ finish_taskreg_scan (omp_context *ctx)
     }
 }
 
+/* Find the enclosing offload context.  */
 
 static omp_context *
 enclosing_target_ctx (omp_context *ctx)
 {
-  while (ctx != NULL
-	 && gimple_code (ctx->stmt) != GIMPLE_OMP_TARGET)
-    ctx = ctx->outer;
-  gcc_assert (ctx != NULL);
+  for (; ctx; ctx = ctx->outer)
+    if (gimple_code (ctx->stmt) == GIMPLE_OMP_TARGET)
+      break;
+
   return ctx;
 }
 
+/* Return true if ctx is part of an oacc kernels region.  */
+
 static bool
-oacc_loop_or_target_p (gimple *stmt)
+ctx_in_oacc_kernels_region (omp_context *ctx)
 {
-  enum gimple_code outer_type = gimple_code (stmt);
-  return ((outer_type == GIMPLE_OMP_TARGET
-	   && ((gimple_omp_target_kind (stmt)
-		== GF_OMP_TARGET_KIND_OACC_PARALLEL)
-	       || (gimple_omp_target_kind (stmt)
-		   == GF_OMP_TARGET_KIND_OACC_KERNELS)))
-	  || (outer_type == GIMPLE_OMP_FOR
-	      && gimple_omp_for_kind (stmt) == GF_OMP_FOR_KIND_OACC_LOOP));
+  for (;ctx != NULL; ctx = ctx->outer)
+    {
+      gimple *stmt = ctx->stmt;
+      if (gimple_code (stmt) == GIMPLE_OMP_TARGET
+	  && gimple_omp_target_kind (stmt) == GF_OMP_TARGET_KIND_OACC_KERNELS)
+	return true;
+    }
+
+  return false;
+}
+
+/* Check the parallelism clauses inside a kernels regions.
+   Until kernels handling moves to use the same loop indirection
+   scheme as parallel, we need to do this checking early.  */
+
+static unsigned
+check_oacc_kernel_gwv (gomp_for *stmt, omp_context *ctx)
+{
+  bool checking = true;
+  unsigned outer_mask = 0;
+  unsigned this_mask = 0;
+  bool has_seq = false, has_auto = false;
+
+  if (ctx->outer)
+    outer_mask = check_oacc_kernel_gwv (NULL,  ctx->outer);
+  if (!stmt)
+    {
+      checking = false;
+      if (gimple_code (ctx->stmt) != GIMPLE_OMP_FOR)
+	return outer_mask;
+      stmt = as_a <gomp_for *> (ctx->stmt);
+    }
+
+  for (tree c = gimple_omp_for_clauses (stmt); c; c = OMP_CLAUSE_CHAIN (c))
+    {
+      switch (OMP_CLAUSE_CODE (c))
+	{
+	case OMP_CLAUSE_GANG:
+	  this_mask |= GOMP_DIM_MASK (GOMP_DIM_GANG);
+	  break;
+	case OMP_CLAUSE_WORKER:
+	  this_mask |= GOMP_DIM_MASK (GOMP_DIM_WORKER);
+	  break;
+	case OMP_CLAUSE_VECTOR:
+	  this_mask |= GOMP_DIM_MASK (GOMP_DIM_VECTOR);
+	  break;
+	case OMP_CLAUSE_SEQ:
+	  has_seq = true;
+	  break;
+	case OMP_CLAUSE_AUTO:
+	  has_auto = true;
+	  break;
+	default:
+	  break;
+	}
+    }
+
+  if (checking)
+    {
+      if (has_seq && (this_mask || has_auto))
+	error_at (gimple_location (stmt), "%<seq%> overrides other"
+		  " OpenACC loop specifiers");
+      else if (has_auto && this_mask)
+	error_at (gimple_location (stmt), "%<auto%> conflicts with other"
+		  " OpenACC loop specifiers");
+
+      if (this_mask & outer_mask)
+	error_at (gimple_location (stmt), "inner loop uses same"
+		  " OpenACC parallelism as containing loop");
+    }
+
+  return outer_mask | this_mask;
 }
 
 /* Scan a GIMPLE_OMP_FOR.  */
@@ -2962,52 +3043,62 @@ oacc_loop_or_target_p (gimple *stmt)
 static void
 scan_omp_for (gomp_for *stmt, omp_context *outer_ctx)
 {
-  enum gimple_code outer_type = GIMPLE_ERROR_MARK;
   omp_context *ctx;
   size_t i;
   tree clauses = gimple_omp_for_clauses (stmt);
 
-  if (outer_ctx)
-    outer_type = gimple_code (outer_ctx->stmt);
-
   ctx = new_omp_context (stmt, outer_ctx);
 
   if (is_gimple_omp_oacc (stmt))
     {
-      if (outer_ctx && outer_type == GIMPLE_OMP_FOR)
-	ctx->gwv_this = outer_ctx->gwv_this;
-      for (tree c = clauses; c; c = OMP_CLAUSE_CHAIN (c))
-	{
-	  int val;
-	  if (OMP_CLAUSE_CODE (c) == OMP_CLAUSE_GANG)
-	    val = MASK_GANG;
-	  else if (OMP_CLAUSE_CODE (c) == OMP_CLAUSE_WORKER)
-	    val = MASK_WORKER;
-	  else if (OMP_CLAUSE_CODE (c) == OMP_CLAUSE_VECTOR)
-	    val = MASK_VECTOR;
-	  else
-	    continue;
-	  ctx->gwv_this |= val;
-	  if (!outer_ctx)
-	    {
-	      /* Skip; not nested inside a region.  */
-	      continue;
-	    }
-	  if (!oacc_loop_or_target_p (outer_ctx->stmt))
+      omp_context *tgt = enclosing_target_ctx (outer_ctx);
+
+      if (!tgt || is_oacc_parallel (tgt))
+	for (tree c = clauses; c; c = OMP_CLAUSE_CHAIN (c))
+	  {
+	    char const *check = NULL;
+
+	    switch (OMP_CLAUSE_CODE (c))
+	      {
+	      case OMP_CLAUSE_GANG:
+		check = "gang";
+		break;
+
+	      case OMP_CLAUSE_WORKER:
+		check = "worker";
+		break;
+
+	      case OMP_CLAUSE_VECTOR:
+		check = "vector";
+		break;
+
+	      default:
+		break;
+	      }
+
+	    if (check && OMP_CLAUSE_OPERAND (c, 0))
+	      error_at (gimple_location (stmt),
+			"argument not permitted on %qs clause in"
+			" OpenACC %<parallel%>", check);
+	  }
+
+      if (tgt && is_oacc_kernels (tgt))
+	{
+	  /* Strip out reductions, as they are not  handled yet.  */
+	  tree *prev_ptr = &clauses;
+
+	  while (tree probe = *prev_ptr)
 	    {
-	      /* Skip; not nested inside an OpenACC region.  */
-	      continue;
-	    }
-	  if (outer_type == GIMPLE_OMP_FOR)
-	    outer_ctx->gwv_below |= val;
-	  if (OMP_CLAUSE_OPERAND (c, 0) != NULL_TREE)
-	    {
-	      omp_context *enclosing = enclosing_target_ctx (outer_ctx);
-	      if (gimple_omp_target_kind (enclosing->stmt)
-		  == GF_OMP_TARGET_KIND_OACC_PARALLEL)
-		error_at (gimple_location (stmt),
-			  "no arguments allowed to gang, worker and vector clauses inside parallel");
+	      tree *next_ptr = &OMP_CLAUSE_CHAIN (probe);
+	      
+	      if (OMP_CLAUSE_CODE (probe) == OMP_CLAUSE_REDUCTION)
+		*prev_ptr = *next_ptr;
+	      else
+		prev_ptr = next_ptr;
 	    }
+
+	  gimple_omp_for_set_clauses (stmt, clauses);
+	  check_oacc_kernel_gwv (stmt, ctx);
 	}
     }
 
@@ -3022,19 +3113,6 @@ scan_omp_for (gomp_for *stmt, omp_contex
       scan_omp_op (gimple_omp_for_incr_ptr (stmt, i), ctx);
     }
   scan_omp (gimple_omp_body_ptr (stmt), ctx);
-
-  if (is_gimple_omp_oacc (stmt))
-    {
-      if (ctx->gwv_this & ctx->gwv_below)
-	error_at (gimple_location (stmt),
-		  "gang, worker and vector may occur only once in a loop nest");
-      else if (ctx->gwv_below != 0
-	       && ctx->gwv_this > ctx->gwv_below)
-	error_at (gimple_location (stmt),
-		  "gang, worker and vector must occur in this order in a loop nest");
-      if (outer_ctx && outer_type == GIMPLE_OMP_FOR)
-	outer_ctx->gwv_below |= ctx->gwv_below;
-    }
 }
 
 /* Scan an OpenMP sections directive.  */
@@ -3105,19 +3183,6 @@ scan_omp_target (gomp_target *stmt, omp_
       gimple_omp_target_set_child_fn (stmt, ctx->cb.dst_fn);
     }
 
-  if (is_gimple_omp_oacc (stmt))
-    {
-      for (tree c = clauses; c; c = OMP_CLAUSE_CHAIN (c))
-	{
-	  if (OMP_CLAUSE_CODE (c) == OMP_CLAUSE_NUM_GANGS)
-	    ctx->gwv_this |= MASK_GANG;
-	  else if (OMP_CLAUSE_CODE (c) == OMP_CLAUSE_NUM_WORKERS)
-	    ctx->gwv_this |= MASK_WORKER;
-	  else if (OMP_CLAUSE_CODE (c) == OMP_CLAUSE_VECTOR_LENGTH)
-	    ctx->gwv_this |= MASK_VECTOR;
-	}
-    }
-
   scan_sharing_clauses (clauses, ctx);
   scan_omp (gimple_omp_body_ptr (stmt), ctx);
 
@@ -5850,6 +5915,176 @@ lower_send_shared_vars (gimple_seq *ilis
     }
 }
 
+/* Emit an OpenACC head marker call, encapulating the partitioning and
+   other information that must be processed by the target compiler.
+   Return the maximum number of dimensions the associated loop might
+   be partitioned over.  */
+
+static unsigned
+lower_oacc_head_mark (location_t loc, tree ddvar, tree clauses,
+		      gimple_seq *seq, omp_context *ctx)
+{
+  unsigned levels = 0;
+  unsigned tag = 0;
+  tree gang_static = NULL_TREE;
+  auto_vec<tree, 5> args;
+
+  args.quick_push (build_int_cst
+		   (integer_type_node, IFN_UNIQUE_OACC_HEAD_MARK));
+  args.quick_push (ddvar);
+  for (tree c = clauses; c; c = OMP_CLAUSE_CHAIN (c))
+    {
+      switch (OMP_CLAUSE_CODE (c))
+	{
+	case OMP_CLAUSE_GANG:
+	  tag |= OLF_DIM_GANG;
+	  gang_static = OMP_CLAUSE_GANG_STATIC_EXPR (c);
+	  /* static:* is represented by -1, and we can ignore it, as
+	     scheduling is always static.  */
+	  if (gang_static && integer_minus_onep (gang_static))
+	    gang_static = NULL_TREE;
+	  levels++;
+	  break;
+
+	case OMP_CLAUSE_WORKER:
+	  tag |= OLF_DIM_WORKER;
+	  levels++;
+	  break;
+
+	case OMP_CLAUSE_VECTOR:
+	  tag |= OLF_DIM_VECTOR;
+	  levels++;
+	  break;
+
+	case OMP_CLAUSE_SEQ:
+	  tag |= OLF_SEQ;
+	  break;
+
+	case OMP_CLAUSE_AUTO:
+	  tag |= OLF_AUTO;
+	  break;
+
+	case OMP_CLAUSE_INDEPENDENT:
+	  tag |= OLF_INDEPENDENT;
+	  break;
+
+	default:
+	  continue;
+	}
+    }
+
+  if (gang_static)
+    {
+      if (DECL_P (gang_static))
+	gang_static = build_outer_var_ref (gang_static, ctx);
+      tag |= OLF_GANG_STATIC;
+    }
+
+  /* In a parallel region, loops are implicitly INDEPENDENT.  */
+  omp_context *tgt = enclosing_target_ctx (ctx);
+  if (!tgt || is_oacc_parallel (tgt))
+    tag |= OLF_INDEPENDENT;
+
+  /* A loop lacking SEQ, GANG, WORKER and/or VECTOR is implicitly AUTO.  */
+  if (!(tag & (((GOMP_DIM_MASK (GOMP_DIM_MAX) - 1) << OLF_DIM_BASE)
+	       | OLF_SEQ)))
+      tag |= OLF_AUTO;
+
+  /* Ensure at least one level.  */
+  if (!levels)
+    levels++;
+
+  args.quick_push (build_int_cst (integer_type_node, levels));
+  args.quick_push (build_int_cst (integer_type_node, tag));
+  if (gang_static)
+    args.quick_push (gang_static);
+
+  gcall *call = gimple_build_call_internal_vec (IFN_UNIQUE, args);
+  gimple_set_location (call, loc);
+  gimple_set_lhs (call, ddvar);
+  gimple_seq_add_stmt (seq, call);
+
+  return levels;
+}
+
+/* Emit an OpenACC lopp head or tail marker to SEQ.  LEVEL is the
+   partitioning level of the enclosed region.  */ 
+
+static void
+lower_oacc_loop_marker (location_t loc, tree ddvar, bool head,
+			tree tofollow, gimple_seq *seq)
+{
+  int marker_kind = (head ? IFN_UNIQUE_OACC_HEAD_MARK
+		     : IFN_UNIQUE_OACC_TAIL_MARK);
+  tree marker = build_int_cst (integer_type_node, marker_kind);
+  int nargs = 2 + (tofollow != NULL_TREE);
+  gcall *call = gimple_build_call_internal (IFN_UNIQUE, nargs,
+					    marker, ddvar, tofollow);
+  gimple_set_location (call, loc);
+  gimple_set_lhs (call, ddvar);
+  gimple_seq_add_stmt (seq, call);
+}
+
+/* Generate the before and after OpenACC loop sequences.  CLAUSES are
+   the loop clauses, from which we extract reductions.  Initialize
+   HEAD and TAIL.  */
+
+static void
+lower_oacc_head_tail (location_t loc, tree clauses,
+		      gimple_seq *head, gimple_seq *tail, omp_context *ctx)
+{
+  bool inner = false;
+  tree ddvar = create_tmp_var (integer_type_node, ".data_dep");
+  gimple_seq_add_stmt (head, gimple_build_assign (ddvar, integer_zero_node));
+
+  unsigned count = lower_oacc_head_mark (loc, ddvar, clauses, head, ctx);
+  if (!count)
+    lower_oacc_loop_marker (loc, ddvar, false, integer_zero_node, tail);
+  
+  tree fork_kind = build_int_cst (unsigned_type_node, IFN_UNIQUE_OACC_FORK);
+  tree join_kind = build_int_cst (unsigned_type_node, IFN_UNIQUE_OACC_JOIN);
+
+  for (unsigned done = 1; count; count--, done++)
+    {
+      gimple_seq fork_seq = NULL;
+      gimple_seq join_seq = NULL;
+
+      tree place = build_int_cst (integer_type_node, -1);
+      gcall *fork = gimple_build_call_internal (IFN_UNIQUE, 3,
+						fork_kind, ddvar, place);
+      gimple_set_location (fork, loc);
+      gimple_set_lhs (fork, ddvar);
+
+      gcall *join = gimple_build_call_internal (IFN_UNIQUE, 3,
+						join_kind, ddvar, place);
+      gimple_set_location (join, loc);
+      gimple_set_lhs (join, ddvar);
+
+      /* Mark the beginning of this level sequence.  */
+      if (inner)
+	lower_oacc_loop_marker (loc, ddvar, true,
+				build_int_cst (integer_type_node, count),
+				&fork_seq);
+      lower_oacc_loop_marker (loc, ddvar, false,
+			      build_int_cst (integer_type_node, done),
+			      &join_seq);
+
+      gimple_seq_add_stmt (&fork_seq, fork);
+      gimple_seq_add_stmt (&join_seq, join);
+
+      /* Append this level to head. */
+      gimple_seq_add_seq (head, fork_seq);
+      /* Prepend it to tail.  */
+      gimple_seq_add_seq (&join_seq, *tail);
+      *tail = join_seq;
+
+      inner = true;
+    }
+
+  /* Mark the end of the sequence.  */
+  lower_oacc_loop_marker (loc, ddvar, true, NULL_TREE, head);
+  lower_oacc_loop_marker (loc, ddvar, false, NULL_TREE, tail);
+}
 
 /* A convenience function to build an empty GIMPLE_COND with just the
    condition.  */
@@ -6762,6 +6997,149 @@ expand_omp_taskreg (struct omp_region *r
     update_ssa (TODO_update_ssa_only_virtuals);
 }
 
+/* Information about members of an OpenACC collapsed loop nest.  */
+
+struct oacc_collapse
+{
+  tree base;  /* Base value. */
+  tree iters; /* Number of steps.  */
+  tree step;  /* step size.  */
+};
+
+/* Helper for expand_oacc_for.  Determine collapsed loop information.
+   Fill in COUNTS array.  Emit any initialization code before GSI.
+   Return the calculated outer loop bound of BOUND_TYPE.  */
+
+static tree
+expand_oacc_collapse_init (const struct omp_for_data *fd,
+			   gimple_stmt_iterator *gsi,
+			   oacc_collapse *counts, tree bound_type)
+{
+  tree total = build_int_cst (bound_type, 1);
+  int ix;
+  
+  gcc_assert (integer_onep (fd->loop.step));
+  gcc_assert (integer_zerop (fd->loop.n1));
+
+  for (ix = 0; ix != fd->collapse; ix++)
+    {
+      const omp_for_data_loop *loop = &fd->loops[ix];
+
+      tree iter_type = TREE_TYPE (loop->v);
+      tree diff_type = iter_type;
+      tree plus_type = iter_type;
+
+      gcc_assert (loop->cond_code == fd->loop.cond_code);
+      
+      if (POINTER_TYPE_P (iter_type))
+	plus_type = sizetype;
+      if (POINTER_TYPE_P (diff_type) || TYPE_UNSIGNED (diff_type))
+	diff_type = signed_type_for (diff_type);
+
+      tree b = loop->n1;
+      tree e = loop->n2;
+      tree s = loop->step;
+      bool up = loop->cond_code == LT_EXPR;
+      tree dir = build_int_cst (diff_type, up ? +1 : -1);
+      bool negating;
+      tree expr;
+
+      b = force_gimple_operand_gsi (gsi, b, true, NULL_TREE,
+				    true, GSI_SAME_STMT);
+      e = force_gimple_operand_gsi (gsi, e, true, NULL_TREE,
+				    true, GSI_SAME_STMT);
+
+      /* Convert the step, avoiding possible unsigned->signed overflow. */
+      negating = !up && TYPE_UNSIGNED (TREE_TYPE (s));
+      if (negating)
+	s = fold_build1 (NEGATE_EXPR, TREE_TYPE (s), s);
+      s = fold_convert (diff_type, s);
+      if (negating)
+	s = fold_build1 (NEGATE_EXPR, diff_type, s);
+      s = force_gimple_operand_gsi (gsi, s, true, NULL_TREE,
+				    true, GSI_SAME_STMT);
+
+      /* Determine the range, avoiding possible unsigned->signed overflow. */
+      negating = !up && TYPE_UNSIGNED (iter_type);
+      expr = fold_build2 (MINUS_EXPR, plus_type,
+			  fold_convert (plus_type, negating ? b : e),
+			  fold_convert (plus_type, negating ? e : b));
+      expr = fold_convert (diff_type, expr);
+      if (negating)
+	expr = fold_build1 (NEGATE_EXPR, diff_type, expr);
+      tree range = force_gimple_operand_gsi
+	(gsi, expr, true, NULL_TREE, true, GSI_SAME_STMT);
+
+      /* Determine number of iterations.  */
+      expr = fold_build2 (MINUS_EXPR, diff_type, range, dir);
+      expr = fold_build2 (PLUS_EXPR, diff_type, expr, s);
+      expr = fold_build2 (TRUNC_DIV_EXPR, diff_type, expr, s);
+
+      tree iters = force_gimple_operand_gsi (gsi, expr, true, NULL_TREE,
+					     true, GSI_SAME_STMT);
+
+      counts[ix].base = b;
+      counts[ix].iters = iters;
+      counts[ix].step = s;
+
+      total = fold_build2 (MULT_EXPR, bound_type, total,
+			   fold_convert (bound_type, iters));
+    }
+
+  return total;
+}
+
+/* Emit initializers for collapsed loop members.  IVAR is the outer
+   loop iteration variable, from which collapsed loop iteration values
+   are  calculated.  COUNTS array has been initialized by
+   expand_oacc_collapse_inits.  */
+
+static void
+expand_oacc_collapse_vars (const struct omp_for_data *fd,
+			   gimple_stmt_iterator *gsi,
+			   const oacc_collapse *counts, tree ivar)
+{
+  tree ivar_type = TREE_TYPE (ivar);
+
+  /*  The most rapidly changing iteration variable is the innermost
+      one.  */
+  for (int ix = fd->collapse; ix--;)
+    {
+      const omp_for_data_loop *loop = &fd->loops[ix];
+      const oacc_collapse *collapse = &counts[ix];
+      tree iter_type = TREE_TYPE (loop->v);
+      tree diff_type = TREE_TYPE (collapse->step);
+      tree plus_type = iter_type;
+      enum tree_code plus_code = PLUS_EXPR;
+      tree expr;
+
+      if (POINTER_TYPE_P (iter_type))
+	{
+	  plus_code = POINTER_PLUS_EXPR;
+	  plus_type = sizetype;
+	}
+
+      expr = fold_build2 (TRUNC_MOD_EXPR, ivar_type, ivar,
+			  fold_convert (ivar_type, collapse->iters));
+      expr = fold_build2 (MULT_EXPR, diff_type, fold_convert (diff_type, expr),
+			  collapse->step);
+      expr = fold_build2 (plus_code, iter_type, collapse->base,
+			  fold_convert (plus_type, expr));
+      expr = force_gimple_operand_gsi (gsi, expr, false, NULL_TREE,
+				       true, GSI_SAME_STMT);
+      gassign *ass = gimple_build_assign (loop->v, expr);
+      gsi_insert_before (gsi, ass, GSI_SAME_STMT);
+
+      if (ix)
+	{
+	  expr = fold_build2 (TRUNC_DIV_EXPR, ivar_type, ivar,
+			      fold_convert (ivar_type, collapse->iters));
+	  ivar = force_gimple_operand_gsi (gsi, expr, true, NULL_TREE,
+					   true, GSI_SAME_STMT);
+	}
+    }
+}
+
 
 /* Helper function for expand_omp_{for_*,simd}.  If this is the outermost
    of the combined collapse > 1 loop constructs, generate code like:
@@ -8408,10 +8786,6 @@ expand_omp_for_static_nochunk (struct om
   tree *counts = NULL;
   tree n1, n2, step;
 
-  gcc_checking_assert ((gimple_omp_for_kind (fd->for_stmt)
-			!= GF_OMP_FOR_KIND_OACC_LOOP)
-		       || !inner_stmt);
-
   itype = type = TREE_TYPE (fd->loop.v);
   if (POINTER_TYPE_P (type))
     itype = signed_type_for (type);
@@ -8504,10 +8878,6 @@ expand_omp_for_static_nochunk (struct om
       nthreads = builtin_decl_explicit (BUILT_IN_OMP_GET_NUM_TEAMS);
       threadid = builtin_decl_explicit (BUILT_IN_OMP_GET_TEAM_NUM);
       break;
-    case GF_OMP_FOR_KIND_OACC_LOOP:
-      nthreads = builtin_decl_explicit (BUILT_IN_GOACC_GET_NUM_THREADS);
-      threadid = builtin_decl_explicit (BUILT_IN_GOACC_GET_THREAD_NUM);
-      break;
     default:
       gcc_unreachable ();
     }
@@ -8734,10 +9104,7 @@ expand_omp_for_static_nochunk (struct om
   if (!gimple_omp_return_nowait_p (gsi_stmt (gsi)))
     {
       t = gimple_omp_return_lhs (gsi_stmt (gsi));
-      if (gimple_omp_for_kind (fd->for_stmt) == GF_OMP_FOR_KIND_OACC_LOOP)
-	gcc_checking_assert (t == NULL_TREE);
-      else
-	gsi_insert_after (&gsi, build_omp_barrier (t), GSI_SAME_STMT);
+      gsi_insert_after (&gsi, build_omp_barrier (t), GSI_SAME_STMT);
     }
   gsi_remove (&gsi, true);
 
@@ -8875,10 +9242,6 @@ expand_omp_for_static_chunk (struct omp_
   tree *counts = NULL;
   tree n1, n2, step;
 
-  gcc_checking_assert ((gimple_omp_for_kind (fd->for_stmt)
-			!= GF_OMP_FOR_KIND_OACC_LOOP)
-		       || !inner_stmt);
-
   itype = type = TREE_TYPE (fd->loop.v);
   if (POINTER_TYPE_P (type))
     itype = signed_type_for (type);
@@ -8975,10 +9338,6 @@ expand_omp_for_static_chunk (struct omp_
       nthreads = builtin_decl_explicit (BUILT_IN_OMP_GET_NUM_TEAMS);
       threadid = builtin_decl_explicit (BUILT_IN_OMP_GET_TEAM_NUM);
       break;
-    case GF_OMP_FOR_KIND_OACC_LOOP:
-      nthreads = builtin_decl_explicit (BUILT_IN_GOACC_GET_NUM_THREADS);
-      threadid = builtin_decl_explicit (BUILT_IN_GOACC_GET_THREAD_NUM);
-      break;
     default:
       gcc_unreachable ();
     }
@@ -9238,10 +9597,7 @@ expand_omp_for_static_chunk (struct omp_
   if (!gimple_omp_return_nowait_p (gsi_stmt (gsi)))
     {
       t = gimple_omp_return_lhs (gsi_stmt (gsi));
-      if (gimple_omp_for_kind (fd->for_stmt) == GF_OMP_FOR_KIND_OACC_LOOP)
-	gcc_checking_assert (t == NULL_TREE);
-      else
-	gsi_insert_after (&gsi, build_omp_barrier (t), GSI_SAME_STMT);
+      gsi_insert_after (&gsi, build_omp_barrier (t), GSI_SAME_STMT);
     }
   gsi_remove (&gsi, true);
 
@@ -10291,6 +10647,410 @@ expand_omp_taskloop_for_inner (struct om
     }
 }
 
+/* A subroutine of expand_omp_for.  Generate code for an OpenACC
+   partitioned loop.  The lowering here is abstracted, in that the
+   loop parameters are passed through internal functions, which are
+   further lowered by oacc_device_lower, once we get to the target
+   compiler.  The loop is of the form:
+
+   for (V = B; V LTGT E; V += S) {BODY}
+
+   where LTGT is < or >.  We may have a specified chunking size, CHUNKING
+   (constant 0 for no chunking) and we will have a GWV partitioning
+   mask, specifying dimensions over which the loop is to be
+   partitioned (see note below).  We generate code that looks like:
+
+   <entry_bb> [incoming FALL->body, BRANCH->exit]
+     typedef signedintify (typeof (V)) T;  // underlying signed integral type
+     T range = E - B;
+     T chunk_no = 0;
+     T DIR = LTGT == '<' ? +1 : -1;
+     T chunk_max = GOACC_LOOP_CHUNK (dir, range, S, CHUNK_SIZE, GWV);
+     T step = GOACC_LOOP_STEP (dir, range, S, CHUNK_SIZE, GWV);
+
+   <head_bb> [created by splitting end of entry_bb]
+     T offset = GOACC_LOOP_OFFSET (dir, range, S, CHUNK_SIZE, GWV, chunk_no);
+     T bound = GOACC_LOOP_BOUND (dir, range, S, CHUNK_SIZE, GWV, offset);
+     if (!(offset LTGT bound)) goto bottom_bb;
+
+   <body_bb> [incoming]
+     V = B + offset;
+     {BODY}
+
+   <cont_bb> [incoming, may == body_bb FALL->exit_bb, BRANCH->body_bb]
+     offset += step;
+     if (offset LTGT bound) goto body_bb; [*]
+
+   <bottom_bb> [created by splitting start of exit_bb] insert BRANCH->head_bb
+     chunk_no++;
+     if (chunk < chunk_max) goto head_bb;
+
+   <exit_bb> [incoming]
+     V = B + ((range -/+ 1) / S +/- 1) * S [*]
+
+   [*] Needed if V live at end of loop
+
+   Note: CHUNKING & GWV mask are specified explicitly here.  This is a
+   transition, and will be specified by a more general mechanism shortly.
+ */
+
+static void
+expand_oacc_for (struct omp_region *region, struct omp_for_data *fd)
+{
+  tree v = fd->loop.v;
+  enum tree_code cond_code = fd->loop.cond_code;
+  enum tree_code plus_code = PLUS_EXPR;
+
+  tree chunk_size = integer_minus_one_node;
+  tree gwv = integer_zero_node;
+  tree iter_type = TREE_TYPE (v);
+  tree diff_type = iter_type;
+  tree plus_type = iter_type;
+  struct oacc_collapse *counts = NULL;
+
+  gcc_checking_assert (gimple_omp_for_kind (fd->for_stmt)
+		       == GF_OMP_FOR_KIND_OACC_LOOP);
+  gcc_assert (!gimple_omp_for_combined_into_p (fd->for_stmt));
+  gcc_assert (cond_code == LT_EXPR || cond_code == GT_EXPR);
+
+  if (POINTER_TYPE_P (iter_type))
+    {
+      plus_code = POINTER_PLUS_EXPR;
+      plus_type = sizetype;
+    }
+  if (POINTER_TYPE_P (diff_type) || TYPE_UNSIGNED (diff_type))
+    diff_type = signed_type_for (diff_type);
+
+  basic_block entry_bb = region->entry; /* BB ending in OMP_FOR */
+  basic_block exit_bb = region->exit; /* BB ending in OMP_RETURN */
+  basic_block cont_bb = region->cont; /* BB ending in OMP_CONTINUE  */
+  basic_block bottom_bb = NULL;
+
+  /* entry_bb has two sucessors; the branch edge is to the exit
+     block,  fallthrough edge to body.  */
+  gcc_assert (EDGE_COUNT (entry_bb->succs) == 2
+	      && BRANCH_EDGE (entry_bb)->dest == exit_bb);
+
+  /* If cont_bb non-NULL, it has 2 successors.  The branch successor is
+     body_bb, or to a block whose only successor is the body_bb.  Its
+     fallthrough successor is the final block (same as the branch
+     successor of the entry_bb).  */
+  if (cont_bb)
+    {
+      basic_block body_bb = FALLTHRU_EDGE (entry_bb)->dest;
+      basic_block bed = BRANCH_EDGE (cont_bb)->dest;
+
+      gcc_assert (FALLTHRU_EDGE (cont_bb)->dest == exit_bb);
+      gcc_assert (bed == body_bb || single_succ_edge (bed)->dest == body_bb);
+    }
+  else
+    gcc_assert (!gimple_in_ssa_p (cfun));
+
+  /* The exit block only has entry_bb and cont_bb as predecessors.  */
+  gcc_assert (EDGE_COUNT (exit_bb->preds) == 1 + (cont_bb != NULL));
+
+  tree chunk_no;
+  tree chunk_max = NULL_TREE;
+  tree bound, offset;
+  tree step = create_tmp_var (diff_type, ".step");
+  bool up = cond_code == LT_EXPR;
+  tree dir = build_int_cst (diff_type, up ? +1 : -1);
+  bool chunking = !gimple_in_ssa_p (cfun);;
+  bool negating;
+
+  /* SSA instances.  */
+  tree offset_incr = NULL_TREE;
+  tree offset_init = NULL_TREE;
+
+  gimple_stmt_iterator gsi;
+  gassign *ass;
+  gcall *call;
+  gimple *stmt;
+  tree expr;
+  location_t loc;
+  edge split, be, fte;
+
+  /* Split the end of entry_bb to create head_bb.  */
+  split = split_block (entry_bb, last_stmt (entry_bb));
+  basic_block head_bb = split->dest;
+  entry_bb = split->src;
+
+  /* Chunk setup goes at end of entry_bb, replacing the omp_for.  */
+  gsi = gsi_last_bb (entry_bb);
+  gomp_for *for_stmt = as_a <gomp_for *> (gsi_stmt (gsi));
+  loc = gimple_location (for_stmt);
+
+  if (gimple_in_ssa_p (cfun))
+    {
+      offset_init = gimple_omp_for_index (for_stmt, 0);
+      gcc_assert (integer_zerop (fd->loop.n1));
+      /* The SSA parallelizer does gang parallelism.  */
+      gwv = build_int_cst (integer_type_node, GOMP_DIM_MASK (GOMP_DIM_GANG));
+    }
+
+  if (fd->collapse > 1)
+    {
+      counts = XALLOCAVEC (struct oacc_collapse, fd->collapse);
+      tree total = expand_oacc_collapse_init (fd, &gsi, counts,
+					      TREE_TYPE (fd->loop.n2));
+
+      if (SSA_VAR_P (fd->loop.n2))
+	{
+	  total = force_gimple_operand_gsi (&gsi, total, false, NULL_TREE,
+					    true, GSI_SAME_STMT);
+	  ass = gimple_build_assign (fd->loop.n2, total);
+	  gsi_insert_before (&gsi, ass, GSI_SAME_STMT);
+	}
+      
+    }
+
+  tree b = fd->loop.n1;
+  tree e = fd->loop.n2;
+  tree s = fd->loop.step;
+
+  b = force_gimple_operand_gsi (&gsi, b, true, NULL_TREE, true, GSI_SAME_STMT);
+  e = force_gimple_operand_gsi (&gsi, e, true, NULL_TREE, true, GSI_SAME_STMT);
+
+  /* Convert the step, avoiding possible unsigned->signed overflow. */
+  negating = !up && TYPE_UNSIGNED (TREE_TYPE (s));
+  if (negating)
+    s = fold_build1 (NEGATE_EXPR, TREE_TYPE (s), s);
+  s = fold_convert (diff_type, s);
+  if (negating)
+    s = fold_build1 (NEGATE_EXPR, diff_type, s);
+  s = force_gimple_operand_gsi (&gsi, s, true, NULL_TREE, true, GSI_SAME_STMT);
+
+  if (!chunking)
+    chunk_size = integer_zero_node;
+  expr = fold_convert (diff_type, chunk_size);
+  chunk_size = force_gimple_operand_gsi (&gsi, expr, true,
+					 NULL_TREE, true, GSI_SAME_STMT);
+  /* Determine the range, avoiding possible unsigned->signed overflow. */
+  negating = !up && TYPE_UNSIGNED (iter_type);
+  expr = fold_build2 (MINUS_EXPR, plus_type,
+		      fold_convert (plus_type, negating ? b : e),
+		      fold_convert (plus_type, negating ? e : b));
+  expr = fold_convert (diff_type, expr);
+  if (negating)
+    expr = fold_build1 (NEGATE_EXPR, diff_type, expr);
+  tree range = force_gimple_operand_gsi (&gsi, expr, true,
+					 NULL_TREE, true, GSI_SAME_STMT);
+
+  chunk_no = build_int_cst (diff_type, 0);
+  if (chunking)
+    {
+      gcc_assert (!gimple_in_ssa_p (cfun));
+
+      expr = chunk_no;
+      chunk_max = create_tmp_var (diff_type, ".chunk_max");
+      chunk_no = create_tmp_var (diff_type, ".chunk_no");
+
+      ass = gimple_build_assign (chunk_no, expr);
+      gsi_insert_before (&gsi, ass, GSI_SAME_STMT);
+
+      call = gimple_build_call_internal (IFN_GOACC_LOOP, 6,
+					 build_int_cst (integer_type_node,
+							IFN_GOACC_LOOP_CHUNKS),
+					 dir, range, s, chunk_size, gwv);
+      gimple_call_set_lhs (call, chunk_max);
+      gimple_set_location (call, loc);
+      gsi_insert_before (&gsi, call, GSI_SAME_STMT);
+    }
+  else
+    chunk_size = chunk_no;
+
+  call = gimple_build_call_internal (IFN_GOACC_LOOP, 6,
+				     build_int_cst (integer_type_node,
+						    IFN_GOACC_LOOP_STEP),
+				     dir, range, s, chunk_size, gwv);
+  gimple_call_set_lhs (call, step);
+  gimple_set_location (call, loc);
+  gsi_insert_before (&gsi, call, GSI_SAME_STMT);
+
+  /* Remove the GIMPLE_OMP_FOR.  */
+  gsi_remove (&gsi, true);
+
+  /* Fixup edges from head_bb */
+  be = BRANCH_EDGE (head_bb);
+  fte = FALLTHRU_EDGE (head_bb);
+  be->flags |= EDGE_FALSE_VALUE;
+  fte->flags ^= EDGE_FALLTHRU | EDGE_TRUE_VALUE;
+
+  basic_block body_bb = fte->dest;
+
+  if (gimple_in_ssa_p (cfun))
+    {
+      gsi = gsi_last_bb (cont_bb);
+      gomp_continue *cont_stmt = as_a <gomp_continue *> (gsi_stmt (gsi));
+
+      offset = gimple_omp_continue_control_use (cont_stmt);
+      offset_incr = gimple_omp_continue_control_def (cont_stmt);
+    }
+  else
+    {
+      offset = create_tmp_var (diff_type, ".offset");
+      offset_init = offset_incr = offset;
+    }
+  bound = create_tmp_var (TREE_TYPE (offset), ".bound");
+
+  /* Loop offset & bound go into head_bb.  */
+  gsi = gsi_start_bb (head_bb);
+
+  call = gimple_build_call_internal (IFN_GOACC_LOOP, 7,
+				     build_int_cst (integer_type_node,
+						    IFN_GOACC_LOOP_OFFSET),
+				     dir, range, s,
+				     chunk_size, gwv, chunk_no);
+  gimple_call_set_lhs (call, offset_init);
+  gimple_set_location (call, loc);
+  gsi_insert_after (&gsi, call, GSI_CONTINUE_LINKING);
+
+  call = gimple_build_call_internal (IFN_GOACC_LOOP, 7,
+				     build_int_cst (integer_type_node,
+						    IFN_GOACC_LOOP_BOUND),
+				     dir, range, s,
+				     chunk_size, gwv, offset_init);
+  gimple_call_set_lhs (call, bound);
+  gimple_set_location (call, loc);
+  gsi_insert_after (&gsi, call, GSI_CONTINUE_LINKING);
+
+  expr = build2 (cond_code, boolean_type_node, offset_init, bound);
+  gsi_insert_after (&gsi, gimple_build_cond_empty (expr),
+		    GSI_CONTINUE_LINKING);
+
+  /* V assignment goes into body_bb.  */
+  if (!gimple_in_ssa_p (cfun))
+    {
+      gsi = gsi_start_bb (body_bb);
+
+      expr = build2 (plus_code, iter_type, b,
+		     fold_convert (plus_type, offset));
+      expr = force_gimple_operand_gsi (&gsi, expr, false, NULL_TREE,
+				       true, GSI_SAME_STMT);
+      ass = gimple_build_assign (v, expr);
+      gsi_insert_before (&gsi, ass, GSI_SAME_STMT);
+      if (fd->collapse > 1)
+	expand_oacc_collapse_vars (fd, &gsi, counts, v);
+    }
+
+  /* Loop increment goes into cont_bb.  If this is not a loop, we
+     will have spawned threads as if it was, and each one will
+     execute one iteration.  The specification is not explicit about
+     whether such constructs are ill-formed or not, and they can
+     occur, especially when noreturn routines are involved.  */
+  if (cont_bb)
+    {
+      gsi = gsi_last_bb (cont_bb);
+      gomp_continue *cont_stmt = as_a <gomp_continue *> (gsi_stmt (gsi));
+      loc = gimple_location (cont_stmt);
+
+      /* Increment offset.  */
+      if (gimple_in_ssa_p (cfun))
+	expr= build2 (plus_code, iter_type, offset,
+		      fold_convert (plus_type, step));
+      else
+	expr = build2 (PLUS_EXPR, diff_type, offset, step);
+      expr = force_gimple_operand_gsi (&gsi, expr, false, NULL_TREE,
+				       true, GSI_SAME_STMT);
+      ass = gimple_build_assign (offset_incr, expr);
+      gsi_insert_before (&gsi, ass, GSI_SAME_STMT);
+      expr = build2 (cond_code, boolean_type_node, offset_incr, bound);
+      gsi_insert_before (&gsi, gimple_build_cond_empty (expr), GSI_SAME_STMT);
+
+      /*  Remove the GIMPLE_OMP_CONTINUE.  */
+      gsi_remove (&gsi, true);
+
+      /* Fixup edges from cont_bb */
+      be = BRANCH_EDGE (cont_bb);
+      fte = FALLTHRU_EDGE (cont_bb);
+      be->flags |= EDGE_TRUE_VALUE;
+      fte->flags ^= EDGE_FALLTHRU | EDGE_FALSE_VALUE;
+
+      if (chunking)
+	{
+	  /* Split the beginning of exit_bb to make bottom_bb.  We
+	     need to insert a nop at the start, because splitting is
+  	     after a stmt, not before.  */
+	  gsi = gsi_start_bb (exit_bb);
+	  stmt = gimple_build_nop ();
+	  gsi_insert_before (&gsi, stmt, GSI_SAME_STMT);
+	  split = split_block (exit_bb, stmt);
+	  bottom_bb = split->src;
+	  exit_bb = split->dest;
+	  gsi = gsi_last_bb (bottom_bb);
+
+	  /* Chunk increment and test goes into bottom_bb.  */
+	  expr = build2 (PLUS_EXPR, diff_type, chunk_no,
+			 build_int_cst (diff_type, 1));
+	  ass = gimple_build_assign (chunk_no, expr);
+	  gsi_insert_after (&gsi, ass, GSI_CONTINUE_LINKING);
+
+	  /* Chunk test at end of bottom_bb.  */
+	  expr = build2 (LT_EXPR, boolean_type_node, chunk_no, chunk_max);
+	  gsi_insert_after (&gsi, gimple_build_cond_empty (expr),
+			    GSI_CONTINUE_LINKING);
+
+	  /* Fixup edges from bottom_bb. */
+	  split->flags ^= EDGE_FALLTHRU | EDGE_FALSE_VALUE;
+	  make_edge (bottom_bb, head_bb, EDGE_TRUE_VALUE);
+	}
+    }
+
+  gsi = gsi_last_bb (exit_bb);
+  gcc_assert (gimple_code (gsi_stmt (gsi)) == GIMPLE_OMP_RETURN);
+  loc = gimple_location (gsi_stmt (gsi));
+
+  if (!gimple_in_ssa_p (cfun))
+    {
+      /* Insert the final value of V, in case it is live.  This is the
+	 value for the only thread that survives past the join.  */
+      expr = fold_build2 (MINUS_EXPR, diff_type, range, dir);
+      expr = fold_build2 (PLUS_EXPR, diff_type, expr, s);
+      expr = fold_build2 (TRUNC_DIV_EXPR, diff_type, expr, s);
+      expr = fold_build2 (MULT_EXPR, diff_type, expr, s);
+      expr = build2 (plus_code, iter_type, b, fold_convert (plus_type, expr));
+      expr = force_gimple_operand_gsi (&gsi, expr, false, NULL_TREE,
+				       true, GSI_SAME_STMT);
+      ass = gimple_build_assign (v, expr);
+      gsi_insert_before (&gsi, ass, GSI_SAME_STMT);
+    }
+
+  /* Remove the OMP_RETURN. */
+  gsi_remove (&gsi, true);
+
+  if (cont_bb)
+    {
+      /* We now have one or two nested loops.  Update the loop
+	 structures.  */
+      struct loop *parent = entry_bb->loop_father;
+      struct loop *body = body_bb->loop_father;
+      
+      if (chunking)
+	{
+	  struct loop *chunk_loop = alloc_loop ();
+	  chunk_loop->header = head_bb;
+	  chunk_loop->latch = bottom_bb;
+	  add_loop (chunk_loop, parent);
+	  parent = chunk_loop;
+	}
+      else if (parent != body)
+	{
+	  gcc_assert (body->header == body_bb);
+	  gcc_assert (body->latch == cont_bb
+		      || single_pred (body->latch) == cont_bb);
+	  parent = NULL;
+	}
+
+      if (parent)
+	{
+	  struct loop *body_loop = alloc_loop ();
+	  body_loop->header = body_bb;
+	  body_loop->latch = cont_bb;
+	  add_loop (body_loop, parent);
+	}
+    }
+}
+
 /* Expand the OMP loop defined by REGION.  */
 
 static void
@@ -10326,6 +11086,11 @@ expand_omp_for (struct omp_region *regio
     expand_omp_simd (region, &fd);
   else if (gimple_omp_for_kind (fd.for_stmt) == GF_OMP_FOR_KIND_CILKFOR)
     expand_cilk_for (region, &fd);
+  else if (gimple_omp_for_kind (fd.for_stmt) == GF_OMP_FOR_KIND_OACC_LOOP)
+    {
+      gcc_assert (!inner_stmt);
+      expand_oacc_for (region, &fd);
+    }
   else if (gimple_omp_for_kind (fd.for_stmt) == GF_OMP_FOR_KIND_TASKLOOP)
     {
       if (gimple_omp_for_combined_into_p (fd.for_stmt))
@@ -13527,6 +14292,7 @@ lower_omp_for (gimple_stmt_iterator *gsi
   gomp_for *stmt = as_a <gomp_for *> (gsi_stmt (*gsi_p));
   gbind *new_stmt;
   gimple_seq omp_for_body, body, dlist;
+  gimple_seq oacc_head = NULL, oacc_tail = NULL;
   size_t i;
 
   push_gimplify_context ();
@@ -13635,6 +14401,16 @@ lower_omp_for (gimple_stmt_iterator *gsi
   /* Once lowered, extract the bounds and clauses.  */
   extract_omp_for_data (stmt, &fd, NULL);
 
+  if (is_gimple_omp_oacc (ctx->stmt)
+      && !ctx_in_oacc_kernels_region (ctx))
+    lower_oacc_head_tail (gimple_location (stmt),
+			  gimple_omp_for_clauses (stmt),
+			  &oacc_head, &oacc_tail, ctx);
+
+  /* Add OpenACC partitioning markers just before the loop  */
+  if (oacc_head)
+    gimple_seq_add_seq (&body, oacc_head);
+  
   lower_omp_for_lastprivate (&fd, &body, &dlist, ctx);
 
   if (gimple_omp_for_kind (stmt) == GF_OMP_FOR_KIND_FOR)
@@ -13668,6 +14444,11 @@ lower_omp_for (gimple_stmt_iterator *gsi
   /* Region exit marker goes at the end of the loop body.  */
   gimple_seq_add_stmt (&body, gimple_build_omp_return (fd.have_nowait));
   maybe_add_implicit_barrier_cancel (ctx, &body);
+
+  /* Add OpenACC joining and reduction markers just after the loop.  */
+  if (oacc_tail)
+    gimple_seq_add_seq (&body, oacc_tail);
+
   pop_gimplify_context (new_stmt);
 
   gimple_bind_append_vars (new_stmt, ctx->block_vars);

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 2/11] PTX backend changes
  2015-10-22 14:52           ` Nathan Sidwell
@ 2015-10-28 14:28             ` Nathan Sidwell
  0 siblings, 0 replies; 120+ messages in thread
From: Nathan Sidwell @ 2015-10-28 14:28 UTC (permalink / raw)
  To: Jakub Jelinek, Bernd Schmidt; +Cc: GCC Patches, Jason Merrill, Joseph S. Myers

[-- Attachment #1: Type: text/plain, Size: 125 bytes --]

This is the patch I've just committed.

It includes the new target hook overriding, which was originally in patch 3.

nathan

[-- Attachment #2: 02-trunk-nvptx-1027.patch --]
[-- Type: text/x-patch, Size: 48428 bytes --]

2015-10-28  Nathan Sidwell  <nathan@codesourcery.com>

	* config/nvptx/nvptx.h (struct machine_function): Add
	axis_predicate.
	* config/nvptx/nvptx-protos.h (nvptx_expand_oacc_fork,
	nvptx_expand_oacc_join): Declare.
	* config/nvptx/nvptx.md (UNSPEC_NTID, UNSPEC_TID): Delete.
	(UNSPEC_DIM_SIZE, UNSPEC_SHARED_DATA, UNSPEC_BIT_CONV,
	UNSPEC_SHUFFLE, UNSPEC_BR_UNIFIED): New.
	(UNSPECV_BARSYNC, UNSPECV_DIM_POS, UNSPECV_FORK, UNSPECV_FORKED,
	UNSPECV_JOINING, UNSPECV_JOIN): New.
	(BITS, BITD): New mode iterators.
	(br_true_uni, br_false_uni): New.
	(*oacc_ntid_insn, oacc_ntid, *oacc_tid_insn, oacc_tid): Delete.
	(oacc_dim_size, oacc_dim_pos): New.
	(nvptx_fork, nvptx_forked, nvptx_joining, nvptx_join): New.
	(oacc_fork, oacc_join): New.
	(nvptx_shuffle<mode>, unpack<mode>si2, packsi<mode>2): New.
	(worker_load<mode>, worker_store<mode>): New.
	(nvptx_barsync): New.
	* config/nvptx/nvptx.c: Include gimple.h & dumpfile.h.
	(SHUFFLE_UP, SHUFFLE_DOWN, SHUFFLE_BFLY, SHUFFLE_IDX): Define.
	(worker_bcast_hwm, worker_bcast_align, worker_bcast_name,
	worker_bcast_sym): New.
	(nvptx_option_override): Initialize worker broadcast buffer.
	(nvptx_emit_forking, nvptx_emit_joining): New.
	(nvptx_init_axis_predicate): New.
	(nvptx_declare_function_name): Init axis predicates.
	(nvptx_expand_call): Add fork/join markers around routine call.
	(nvptx_expand_oacc_fork, nvptx_expand_oacc_join): New.
	(nvptx_gen_unpack, nvptx_gen_pack, nvptx_gen_shuffle): New.
	(nvptx_gen_vcast): New.
	(struct wcast_data_t): New.
	(enum propagate_mask): New.
	(nvptx_gen_wcast): New.
	(nvptx_print_operand): Add 'S' case.
	(struct parallel): New.
	(parallel::parallel, parallel::~parallel): New.
	(bb_insn_map_t, insn_bb_t, insn_bb_vec_t): New typedefs.
	(nvptx_split_blocks, nvptx_discover_pre, nvptx_dump_pars,
	nvptx_find_par, nvptx_discover_pars): New.
	(nvptx_propagate): New.
	(vprop_gen, nvptx_vpropagate): New.
	(wprop_gen, nvptx_wpropagate): New.
	(nvptx_wsync): New.
	(nvptx_single, nvptx_skip_par): New.
	(nvptx_process_pars, nvptx_neuter_pars): New.
	(ntptx_reorg): Split blocks, generate parallel structure, apply
	neutering.
	(nvptx_cannot_copy_insn_p): New.
	(nvptx_file_end): Emit worker broadcast decl.
	(nvptx_goacc_fork_join): New.
	(TARGET_CANNOT_COPY_INSN_P): Override.
	(TARGET_GOACC_FORK_JOIN): Override.

Index: gcc/config/nvptx/nvptx-protos.h
===================================================================
--- gcc/config/nvptx/nvptx-protos.h	(revision 229472)
+++ gcc/config/nvptx/nvptx-protos.h	(working copy)
@@ -32,6 +32,8 @@ extern void nvptx_register_pragmas (void
 extern const char *nvptx_section_for_decl (const_tree);
 
 #ifdef RTX_CODE
+extern void nvptx_expand_oacc_fork (unsigned);
+extern void nvptx_expand_oacc_join (unsigned);
 extern void nvptx_expand_call (rtx, rtx);
 extern rtx nvptx_expand_compare (rtx);
 extern const char *nvptx_ptx_type_from_mode (machine_mode, bool);
Index: gcc/config/nvptx/nvptx.md
===================================================================
--- gcc/config/nvptx/nvptx.md	(revision 229472)
+++ gcc/config/nvptx/nvptx.md	(working copy)
@@ -49,14 +49,27 @@
 
    UNSPEC_ALLOCA
 
-   UNSPEC_NTID
-   UNSPEC_TID
+   UNSPEC_DIM_SIZE
+
+   UNSPEC_SHARED_DATA
+
+   UNSPEC_BIT_CONV
+
+   UNSPEC_SHUFFLE
+   UNSPEC_BR_UNIFIED
 ])
 
 (define_c_enum "unspecv" [
    UNSPECV_LOCK
    UNSPECV_CAS
    UNSPECV_XCHG
+   UNSPECV_BARSYNC
+   UNSPECV_DIM_POS
+
+   UNSPECV_FORK
+   UNSPECV_FORKED
+   UNSPECV_JOINING
+   UNSPECV_JOIN
 ])
 
 (define_attr "subregs_ok" "false,true"
@@ -246,6 +259,8 @@
 (define_mode_iterator QHSIM [QI HI SI])
 (define_mode_iterator SDFM [SF DF])
 (define_mode_iterator SDCM [SC DC])
+(define_mode_iterator BITS [SI SF])
+(define_mode_iterator BITD [DI DF])
 
 ;; This mode iterator allows :P to be used for patterns that operate on
 ;; pointer-sized quantities.  Exactly one of the two alternatives will match.
@@ -817,6 +832,23 @@
   ""
   "%J0\\tbra\\t%l1;")
 
+;; unified conditional branch
+(define_insn "br_true_uni"
+  [(set (pc) (if_then_else
+	(ne (unspec:BI [(match_operand:BI 0 "nvptx_register_operand" "R")]
+		       UNSPEC_BR_UNIFIED) (const_int 0))
+        (label_ref (match_operand 1 "" "")) (pc)))]
+  ""
+  "%j0\\tbra.uni\\t%l1;")
+
+(define_insn "br_false_uni"
+  [(set (pc) (if_then_else
+	(eq (unspec:BI [(match_operand:BI 0 "nvptx_register_operand" "R")]
+		       UNSPEC_BR_UNIFIED) (const_int 0))
+        (label_ref (match_operand 1 "" "")) (pc)))]
+  ""
+  "%J0\\tbra.uni\\t%l1;")
+
 (define_expand "cbranch<mode>4"
   [(set (pc)
 	(if_then_else (match_operator 0 "nvptx_comparison_operator"
@@ -1308,36 +1340,134 @@
   DONE;
 })
 
-(define_insn "*oacc_ntid_insn"
-  [(set (match_operand:SI 0 "nvptx_register_operand" "=R")
-	(unspec:SI [(match_operand:SI 1 "const_int_operand" "n")] UNSPEC_NTID))]
+(define_insn "oacc_dim_size"
+  [(set (match_operand:SI 0 "nvptx_register_operand" "")
+	(unspec:SI [(match_operand:SI 1 "const_int_operand" "")]
+		   UNSPEC_DIM_SIZE))]
   ""
-  "%.\\tmov.u32 %0, %%ntid%d1;")
+{
+  static const char *const asms[] =
+{ /* Must match oacc_loop_levels ordering.  */
+  "%.\\tmov.u32\\t%0, %%nctaid.x;",	/* gang */
+  "%.\\tmov.u32\\t%0, %%ntid.y;",	/* worker */
+  "%.\\tmov.u32\\t%0, %%ntid.x;",	/* vector */
+};
+  return asms[INTVAL (operands[1])];
+})
 
-(define_expand "oacc_ntid"
+(define_insn "oacc_dim_pos"
   [(set (match_operand:SI 0 "nvptx_register_operand" "")
-	(unspec:SI [(match_operand:SI 1 "const_int_operand" "")] UNSPEC_NTID))]
+	(unspec_volatile:SI [(match_operand:SI 1 "const_int_operand" "")]
+			    UNSPECV_DIM_POS))]
   ""
 {
-  if (INTVAL (operands[1]) < 0 || INTVAL (operands[1]) > 2)
-    FAIL;
+  static const char *const asms[] =
+{ /* Must match oacc_loop_levels ordering.  */
+  "%.\\tmov.u32\\t%0, %%ctaid.x;",	/* gang */
+  "%.\\tmov.u32\\t%0, %%tid.y;",	/* worker */
+  "%.\\tmov.u32\\t%0, %%tid.x;",	/* vector */
+};
+  return asms[INTVAL (operands[1])];
 })
 
-(define_insn "*oacc_tid_insn"
-  [(set (match_operand:SI 0 "nvptx_register_operand" "=R")
-	(unspec:SI [(match_operand:SI 1 "const_int_operand" "n")] UNSPEC_TID))]
+(define_insn "nvptx_fork"
+  [(unspec_volatile:SI [(match_operand:SI 0 "const_int_operand" "")]
+		       UNSPECV_FORK)]
   ""
-  "%.\\tmov.u32 %0, %%tid%d1;")
+  "// fork %0;"
+)
 
-(define_expand "oacc_tid"
-  [(set (match_operand:SI 0 "nvptx_register_operand" "")
-	(unspec:SI [(match_operand:SI 1 "const_int_operand" "")] UNSPEC_TID))]
+(define_insn "nvptx_forked"
+  [(unspec_volatile:SI [(match_operand:SI 0 "const_int_operand" "")]
+		       UNSPECV_FORKED)]
+  ""
+  "// forked %0;"
+)
+
+(define_insn "nvptx_joining"
+  [(unspec_volatile:SI [(match_operand:SI 0 "const_int_operand" "")]
+		       UNSPECV_JOINING)]
+  ""
+  "// joining %0;"
+)
+
+(define_insn "nvptx_join"
+  [(unspec_volatile:SI [(match_operand:SI 0 "const_int_operand" "")]
+		       UNSPECV_JOIN)]
+  ""
+  "// join %0;"
+)
+
+(define_expand "oacc_fork"
+  [(set (match_operand:SI 0 "nvptx_nonmemory_operand" "")
+        (match_operand:SI 1 "nvptx_general_operand" ""))
+   (unspec_volatile:SI [(match_operand:SI 2 "const_int_operand" "")]
+		        UNSPECV_FORKED)]
   ""
 {
-  if (INTVAL (operands[1]) < 0 || INTVAL (operands[1]) > 2)
-    FAIL;
+  if (operands[0] != const0_rtx)
+    emit_move_insn (operands[0], operands[1]);
+  nvptx_expand_oacc_fork (INTVAL (operands[2]));
+  DONE;
 })
 
+(define_expand "oacc_join"
+  [(set (match_operand:SI 0 "nvptx_nonmemory_operand" "")
+        (match_operand:SI 1 "nvptx_general_operand" ""))
+   (unspec_volatile:SI [(match_operand:SI 2 "const_int_operand" "")]
+		        UNSPECV_JOIN)]
+  ""
+{
+  if (operands[0] != const0_rtx)
+    emit_move_insn (operands[0], operands[1]);
+  nvptx_expand_oacc_join (INTVAL (operands[2]));
+  DONE;
+})
+
+;; only 32-bit shuffles exist.
+(define_insn "nvptx_shuffle<mode>"
+  [(set (match_operand:BITS 0 "nvptx_register_operand" "=R")
+	(unspec:BITS
+		[(match_operand:BITS 1 "nvptx_register_operand" "R")
+		 (match_operand:SI 2 "nvptx_nonmemory_operand" "Ri")
+		 (match_operand:SI 3 "const_int_operand" "n")]
+		  UNSPEC_SHUFFLE))]
+  ""
+  "%.\\tshfl%S3.b32\\t%0, %1, %2, 31;")
+
+;; extract parts of a 64 bit object into 2 32-bit ints
+(define_insn "unpack<mode>si2"
+  [(set (match_operand:SI 0 "nvptx_register_operand" "=R")
+        (unspec:SI [(match_operand:BITD 2 "nvptx_register_operand" "R")
+		    (const_int 0)] UNSPEC_BIT_CONV))
+   (set (match_operand:SI 1 "nvptx_register_operand" "=R")
+        (unspec:SI [(match_dup 2) (const_int 1)] UNSPEC_BIT_CONV))]
+  ""
+  "%.\\tmov.b64\\t{%0,%1}, %2;")
+
+;; pack 2 32-bit ints into a 64 bit object
+(define_insn "packsi<mode>2"
+  [(set (match_operand:BITD 0 "nvptx_register_operand" "=R")
+        (unspec:BITD [(match_operand:SI 1 "nvptx_register_operand" "R")
+		      (match_operand:SI 2 "nvptx_register_operand" "R")]
+		    UNSPEC_BIT_CONV))]
+  ""
+  "%.\\tmov.b64\\t%0, {%1,%2};")
+
+(define_insn "worker_load<mode>"
+  [(set (match_operand:SDISDFM 0 "nvptx_register_operand" "=R")
+        (unspec:SDISDFM [(match_operand:SDISDFM 1 "memory_operand" "m")]
+			 UNSPEC_SHARED_DATA))]
+  ""
+  "%.\\tld.shared%u0\\t%0, %1;")
+
+(define_insn "worker_store<mode>"
+  [(set (unspec:SDISDFM [(match_operand:SDISDFM 1 "memory_operand" "=m")]
+			 UNSPEC_SHARED_DATA)
+	(match_operand:SDISDFM 0 "nvptx_register_operand" "R"))]
+  ""
+  "%.\\tst.shared%u1\\t%1, %0;")
+
 ;; Atomic insns.
 
 (define_expand "atomic_compare_and_swap<mode>"
@@ -1423,3 +1553,9 @@
 	(match_dup 1))]
   "0"
   "%.\\tatom%A1.b%T0.<logic>\\t%0, %1, %2;")
+
+(define_insn "nvptx_barsync"
+  [(unspec_volatile [(match_operand:SI 0 "const_int_operand" "")]
+		    UNSPECV_BARSYNC)]
+  ""
+  "\\tbar.sync\\t%0;")
Index: gcc/config/nvptx/nvptx.c
===================================================================
--- gcc/config/nvptx/nvptx.c	(revision 229472)
+++ gcc/config/nvptx/nvptx.c	(working copy)
@@ -51,14 +51,21 @@
 #include "langhooks.h"
 #include "dbxout.h"
 #include "cfgrtl.h"
+#include "gimple.h"
 #include "stor-layout.h"
 #include "builtins.h"
 #include "omp-low.h"
 #include "gomp-constants.h"
+#include "dumpfile.h"
 
 /* This file should be included last.  */
 #include "target-def.h"
 
+#define SHUFFLE_UP 0
+#define SHUFFLE_DOWN 1
+#define SHUFFLE_BFLY 2
+#define SHUFFLE_IDX 3
+
 /* Record the function decls we've written, and the libfuncs and function
    decls corresponding to them.  */
 static std::stringstream func_decls;
@@ -81,6 +88,16 @@ struct tree_hasher : ggc_cache_ptr_hash<
 static GTY((cache)) hash_table<tree_hasher> *declared_fndecls_htab;
 static GTY((cache)) hash_table<tree_hasher> *needed_fndecls_htab;
 
+/* Size of buffer needed to broadcast across workers.  This is used
+   for both worker-neutering and worker broadcasting.   It is shared
+   by all functions emitted.  The buffer is placed in shared memory.
+   It'd be nice if PTX supported common blocks, because then this
+   could be shared across TUs (taking the largest size).  */
+static unsigned worker_bcast_size;
+static unsigned worker_bcast_align;
+#define worker_bcast_name "__worker_bcast"
+static GTY(()) rtx worker_bcast_sym;
+
 /* Allocate a new, cleared machine_function structure.  */
 
 static struct machine_function *
@@ -108,6 +125,9 @@ nvptx_option_override (void)
   needed_fndecls_htab = hash_table<tree_hasher>::create_ggc (17);
   declared_libfuncs_htab
     = hash_table<declared_libfunc_hasher>::create_ggc (17);
+
+  worker_bcast_sym = gen_rtx_SYMBOL_REF (Pmode, worker_bcast_name);
+  worker_bcast_align = GET_MODE_ALIGNMENT (SImode) / BITS_PER_UNIT;
 }
 
 /* Return the mode to be used when declaring a ptx object for OBJ.
@@ -194,6 +214,47 @@ nvptx_split_reg_p (machine_mode mode)
   return false;
 }
 
+/* Emit forking instructions for MASK.  */
+
+static void
+nvptx_emit_forking (unsigned mask, bool is_call)
+{
+  mask &= (GOMP_DIM_MASK (GOMP_DIM_WORKER)
+	   | GOMP_DIM_MASK (GOMP_DIM_VECTOR));
+  if (mask)
+    {
+      rtx op = GEN_INT (mask | (is_call << GOMP_DIM_MAX));
+      
+      /* Emit fork at all levels.  This helps form SESE regions, as
+	 it creates a block with a single successor before entering a
+	 partitooned region.  That is a good candidate for the end of
+	 an SESE region.  */
+      if (!is_call)
+	emit_insn (gen_nvptx_fork (op));
+      emit_insn (gen_nvptx_forked (op));
+    }
+}
+
+/* Emit joining instructions for MASK.  */
+
+static void
+nvptx_emit_joining (unsigned mask, bool is_call)
+{
+  mask &= (GOMP_DIM_MASK (GOMP_DIM_WORKER)
+	   | GOMP_DIM_MASK (GOMP_DIM_VECTOR));
+  if (mask)
+    {
+      rtx op = GEN_INT (mask | (is_call << GOMP_DIM_MAX));
+
+      /* Emit joining for all non-call pars to ensure there's a single
+	 predecessor for the block the join insn ends up in.  This is
+	 needed for skipping entire loops.  */
+      if (!is_call)
+	emit_insn (gen_nvptx_joining (op));
+      emit_insn (gen_nvptx_join (op));
+    }
+}
+
 #define PASS_IN_REG_P(MODE, TYPE)				\
   ((GET_MODE_CLASS (MODE) == MODE_INT				\
     || GET_MODE_CLASS (MODE) == MODE_FLOAT			\
@@ -500,6 +561,19 @@ nvptx_record_needed_fndecl (tree decl)
     *slot = decl;
 }
 
+/* Emit code to initialize the REGNO predicate register to indicate
+   whether we are not lane zero on the NAME axis.  */
+
+static void
+nvptx_init_axis_predicate (FILE *file, int regno, const char *name)
+{
+  fprintf (file, "\t{\n");
+  fprintf (file, "\t\t.reg.u32\t%%%s;\n", name);
+  fprintf (file, "\t\tmov.u32\t%%%s, %%tid.%s;\n", name, name);
+  fprintf (file, "\t\tsetp.ne.u32\t%%r%d, %%%s, 0;\n", regno, name);
+  fprintf (file, "\t}\n");
+}
+
 /* Implement ASM_DECLARE_FUNCTION_NAME.  Writes the start of a ptx
    function, including local var decls and copies from the arguments to
    local regs.  */
@@ -623,6 +697,14 @@ nvptx_declare_function_name (FILE *file,
   if (stdarg_p (fntype))
     fprintf (file, "\tld.param.u%d %%argp, [%%in_argp];\n",
 	     GET_MODE_BITSIZE (Pmode));
+
+  /* Emit axis predicates. */
+  if (cfun->machine->axis_predicate[0])
+    nvptx_init_axis_predicate (file,
+			       REGNO (cfun->machine->axis_predicate[0]), "y");
+  if (cfun->machine->axis_predicate[1])
+    nvptx_init_axis_predicate (file,
+			       REGNO (cfun->machine->axis_predicate[1]), "x");
 }
 
 /* Output a return instruction.  Also copy the return value to its outgoing
@@ -779,6 +861,7 @@ nvptx_expand_call (rtx retval, rtx addre
   bool external_decl = false;
   rtx varargs = NULL_RTX;
   tree decl_type = NULL_TREE;
+  unsigned parallel = 0;
 
   for (t = cfun->machine->call_args; t; t = XEXP (t, 1))
     nargs++;
@@ -799,6 +882,22 @@ nvptx_expand_call (rtx retval, rtx addre
 	    cfun->machine->has_call_with_sc = true;
 	  if (DECL_EXTERNAL (decl))
 	    external_decl = true;
+	  tree attr = get_oacc_fn_attrib (decl);
+	  if (attr)
+	    {
+	      tree dims = TREE_VALUE (attr);
+
+	      parallel = GOMP_DIM_MASK (GOMP_DIM_MAX) - 1;
+	      for (int ix = 0; ix != GOMP_DIM_MAX; ix++)
+		{
+		  if (TREE_PURPOSE (dims)
+		      && !integer_zerop (TREE_PURPOSE (dims)))
+		    break;
+		  /* Not on this axis.  */
+		  parallel ^= GOMP_DIM_MASK (ix);
+		  dims = TREE_CHAIN (dims);
+		}
+	    }
 	}
     }
 
@@ -860,7 +959,11 @@ nvptx_expand_call (rtx retval, rtx addre
 	  write_func_decl_from_insn (func_decls, retval, pat, callee);
 	}
     }
+
+  nvptx_emit_forking (parallel, true);
   emit_call_insn (pat);
+  nvptx_emit_joining (parallel, true);
+
   if (tmp_retval != retval)
     emit_move_insn (retval, tmp_retval);
 }
@@ -1069,6 +1172,214 @@ nvptx_expand_compare (rtx compare)
   return gen_rtx_NE (BImode, pred, const0_rtx);
 }
 
+/* Expand the oacc fork & join primitive into ptx-required unspecs.  */
+
+void
+nvptx_expand_oacc_fork (unsigned mode)
+{
+  nvptx_emit_forking (GOMP_DIM_MASK (mode), false);
+}
+
+void
+nvptx_expand_oacc_join (unsigned mode)
+{
+  nvptx_emit_joining (GOMP_DIM_MASK (mode), false);
+}
+
+/* Generate instruction(s) to unpack a 64 bit object into 2 32 bit
+   objects.  */
+
+static rtx
+nvptx_gen_unpack (rtx dst0, rtx dst1, rtx src)
+{
+  rtx res;
+  
+  switch (GET_MODE (src))
+    {
+    case DImode:
+      res = gen_unpackdisi2 (dst0, dst1, src);
+      break;
+    case DFmode:
+      res = gen_unpackdfsi2 (dst0, dst1, src);
+      break;
+    default: gcc_unreachable ();
+    }
+  return res;
+}
+
+/* Generate instruction(s) to pack 2 32 bit objects into a 64 bit
+   object.  */
+
+static rtx
+nvptx_gen_pack (rtx dst, rtx src0, rtx src1)
+{
+  rtx res;
+  
+  switch (GET_MODE (dst))
+    {
+    case DImode:
+      res = gen_packsidi2 (dst, src0, src1);
+      break;
+    case DFmode:
+      res = gen_packsidf2 (dst, src0, src1);
+      break;
+    default: gcc_unreachable ();
+    }
+  return res;
+}
+
+/* Generate an instruction or sequence to broadcast register REG
+   across the vectors of a single warp.  */
+
+static rtx
+nvptx_gen_shuffle (rtx dst, rtx src, rtx idx, unsigned kind)
+{
+  rtx res;
+
+  switch (GET_MODE (dst))
+    {
+    case SImode:
+      res = gen_nvptx_shufflesi (dst, src, idx, GEN_INT (kind));
+      break;
+    case SFmode:
+      res = gen_nvptx_shufflesf (dst, src, idx, GEN_INT (kind));
+      break;
+    case DImode:
+    case DFmode:
+      {
+	rtx tmp0 = gen_reg_rtx (SImode);
+	rtx tmp1 = gen_reg_rtx (SImode);
+
+	start_sequence ();
+	emit_insn (nvptx_gen_unpack (tmp0, tmp1, src));
+	emit_insn (nvptx_gen_shuffle (tmp0, tmp0, idx, kind));
+	emit_insn (nvptx_gen_shuffle (tmp1, tmp1, idx, kind));
+	emit_insn (nvptx_gen_pack (dst, tmp0, tmp1));
+	res = get_insns ();
+	end_sequence ();
+      }
+      break;
+    case BImode:
+      {
+	rtx tmp = gen_reg_rtx (SImode);
+	
+	start_sequence ();
+	emit_insn (gen_sel_truesi (tmp, src, GEN_INT (1), const0_rtx));
+	emit_insn (nvptx_gen_shuffle (tmp, tmp, idx, kind));
+	emit_insn (gen_rtx_SET (dst, gen_rtx_NE (BImode, tmp, const0_rtx)));
+	res = get_insns ();
+	end_sequence ();
+      }
+      break;
+      
+    default:
+      gcc_unreachable ();
+    }
+  return res;
+}
+
+/* Generate an instruction or sequence to broadcast register REG
+   across the vectors of a single warp.  */
+
+static rtx
+nvptx_gen_vcast (rtx reg)
+{
+  return nvptx_gen_shuffle (reg, reg, const0_rtx, SHUFFLE_IDX);
+}
+
+/* Structure used when generating a worker-level spill or fill.  */
+
+struct wcast_data_t
+{
+  rtx base;  /* Register holding base addr of buffer.  */
+  rtx ptr;  /* Iteration var,  if needed.  */
+  unsigned offset; /* Offset into worker buffer.  */
+};
+
+/* Direction of the spill/fill and looping setup/teardown indicator.  */
+
+enum propagate_mask
+  {
+    PM_read = 1 << 0,
+    PM_write = 1 << 1,
+    PM_loop_begin = 1 << 2,
+    PM_loop_end = 1 << 3,
+
+    PM_read_write = PM_read | PM_write
+  };
+
+/* Generate instruction(s) to spill or fill register REG to/from the
+   worker broadcast array.  PM indicates what is to be done, REP
+   how many loop iterations will be executed (0 for not a loop).  */
+   
+static rtx
+nvptx_gen_wcast (rtx reg, propagate_mask pm, unsigned rep, wcast_data_t *data)
+{
+  rtx  res;
+  machine_mode mode = GET_MODE (reg);
+
+  switch (mode)
+    {
+    case BImode:
+      {
+	rtx tmp = gen_reg_rtx (SImode);
+	
+	start_sequence ();
+	if (pm & PM_read)
+	  emit_insn (gen_sel_truesi (tmp, reg, GEN_INT (1), const0_rtx));
+	emit_insn (nvptx_gen_wcast (tmp, pm, rep, data));
+	if (pm & PM_write)
+	  emit_insn (gen_rtx_SET (reg, gen_rtx_NE (BImode, tmp, const0_rtx)));
+	res = get_insns ();
+	end_sequence ();
+      }
+      break;
+
+    default:
+      {
+	rtx addr = data->ptr;
+
+	if (!addr)
+	  {
+	    unsigned align = GET_MODE_ALIGNMENT (mode) / BITS_PER_UNIT;
+
+	    if (align > worker_bcast_align)
+	      worker_bcast_align = align;
+	    data->offset = (data->offset + align - 1) & ~(align - 1);
+	    addr = data->base;
+	    if (data->offset)
+	      addr = gen_rtx_PLUS (Pmode, addr, GEN_INT (data->offset));
+	  }
+	
+	addr = gen_rtx_MEM (mode, addr);
+	addr = gen_rtx_UNSPEC (mode, gen_rtvec (1, addr), UNSPEC_SHARED_DATA);
+	if (pm == PM_read)
+	  res = gen_rtx_SET (addr, reg);
+	else if (pm == PM_write)
+	  res = gen_rtx_SET (reg, addr);
+	else
+	  gcc_unreachable ();
+
+	if (data->ptr)
+	  {
+	    /* We're using a ptr, increment it.  */
+	    start_sequence ();
+	    
+	    emit_insn (res);
+	    emit_insn (gen_adddi3 (data->ptr, data->ptr,
+				   GEN_INT (GET_MODE_SIZE (GET_MODE (reg)))));
+	    res = get_insns ();
+	    end_sequence ();
+	  }
+	else
+	  rep = 1;
+	data->offset += rep * GET_MODE_SIZE (GET_MODE (reg));
+      }
+      break;
+    }
+  return res;
+}
+
 /* When loading an operand ORIG_OP, verify whether an address space
    conversion to generic is required, and if so, perform it.  Also
    check for SYMBOL_REFs for function decls and call
@@ -1660,6 +1971,7 @@ nvptx_print_operand_address (FILE *file,
    c -- print an opcode suffix for a comparison operator, including a type code
    d -- print a CONST_INT as a vector dimension (x, y, or z)
    f -- print a full reg even for something that must always be split
+   S -- print a shuffle kind specified by CONST_INT
    t -- print a type opcode suffix, promoting QImode to 32 bits
    T -- print a type size in bits
    u -- print a type opcode suffix without promotions.  */
@@ -1723,6 +2035,15 @@ nvptx_print_operand (FILE *file, rtx x,
       fprintf (file, "%s", nvptx_ptx_type_from_mode (op_mode, false));
       break;
 
+    case 'S':
+      {
+	unsigned kind = UINTVAL (x);
+	static const char *const kinds[] = 
+	  {"up", "down", "bfly", "idx"};
+	fprintf (file, ".%s", kinds[kind]);
+      }
+      break;
+
     case 'T':
       fprintf (file, "%d", GET_MODE_BITSIZE (GET_MODE (x)));
       break;
@@ -1973,10 +2294,747 @@ nvptx_reorg_subreg (void)
     }
 }
 
+/* Loop structure of the function. The entire function is described as
+   a NULL loop.  We should be able to extend this to represent
+   superblocks.  */
+
+struct parallel
+{
+  /* Parent parallel.  */
+  parallel *parent;
+  
+  /* Next sibling parallel.  */
+  parallel *next;
+
+  /* First child parallel.  */
+  parallel *inner;
+
+  /* Partitioning mask of the parallel.  */
+  unsigned mask;
+
+  /* Partitioning used within inner parallels. */
+  unsigned inner_mask;
+
+  /* Location of parallel forked and join.  The forked is the first
+     block in the parallel and the join is the first block after of
+     the partition.  */
+  basic_block forked_block;
+  basic_block join_block;
+
+  rtx_insn *forked_insn;
+  rtx_insn *join_insn;
+
+  rtx_insn *fork_insn;
+  rtx_insn *joining_insn;
+
+  /* Basic blocks in this parallel, but not in child parallels.  The
+     FORKED and JOINING blocks are in the partition.  The FORK and JOIN
+     blocks are not.  */
+  auto_vec<basic_block> blocks;
+
+public:
+  parallel (parallel *parent, unsigned mode);
+  ~parallel ();
+};
+
+/* Constructor links the new parallel into it's parent's chain of
+   children.  */
+
+parallel::parallel (parallel *parent_, unsigned mask_)
+  :parent (parent_), next (0), inner (0), mask (mask_), inner_mask (0)
+{
+  forked_block = join_block = 0;
+  forked_insn = join_insn = 0;
+  fork_insn = joining_insn = 0;
+  
+  if (parent)
+    {
+      next = parent->inner;
+      parent->inner = this;
+    }
+}
+
+parallel::~parallel ()
+{
+  delete inner;
+  delete next;
+}
+
+/* Map of basic blocks to insns */
+typedef hash_map<basic_block, rtx_insn *> bb_insn_map_t;
+
+/* A tuple of an insn of interest and the BB in which it resides.  */
+typedef std::pair<rtx_insn *, basic_block> insn_bb_t;
+typedef auto_vec<insn_bb_t> insn_bb_vec_t;
+
+/* Split basic blocks such that each forked and join unspecs are at
+   the start of their basic blocks.  Thus afterwards each block will
+   have a single partitioning mode.  We also do the same for return
+   insns, as they are executed by every thread.  Return the
+   partitioning mode of the function as a whole.  Populate MAP with
+   head and tail blocks.  We also clear the BB visited flag, which is
+   used when finding partitions.  */
+
+static void
+nvptx_split_blocks (bb_insn_map_t *map)
+{
+  insn_bb_vec_t worklist;
+  basic_block block;
+  rtx_insn *insn;
+
+  /* Locate all the reorg instructions of interest.  */
+  FOR_ALL_BB_FN (block, cfun)
+    {
+      bool seen_insn = false;
+
+      /* Clear visited flag, for use by parallel locator  */
+      block->flags &= ~BB_VISITED;
+
+      FOR_BB_INSNS (block, insn)
+	{
+	  if (!INSN_P (insn))
+	    continue;
+	  switch (recog_memoized (insn))
+	    {
+	    default:
+	      seen_insn = true;
+	      continue;
+	    case CODE_FOR_nvptx_forked:
+	    case CODE_FOR_nvptx_join:
+	      break;
+
+	    case CODE_FOR_return:
+	      /* We also need to split just before return insns, as
+		 that insn needs executing by all threads, but the
+		 block it is in probably does not.  */
+	      break;
+	    }
+
+	  if (seen_insn)
+	    /* We've found an instruction that  must be at the start of
+	       a block, but isn't.  Add it to the worklist.  */
+	    worklist.safe_push (insn_bb_t (insn, block));
+	  else
+	    /* It was already the first instruction.  Just add it to
+	       the map.  */
+	    map->get_or_insert (block) = insn;
+	  seen_insn = true;
+	}
+    }
+
+  /* Split blocks on the worklist.  */
+  unsigned ix;
+  insn_bb_t *elt;
+  basic_block remap = 0;
+  for (ix = 0; worklist.iterate (ix, &elt); ix++)
+    {
+      if (remap != elt->second)
+	{
+	  block = elt->second;
+	  remap = block;
+	}
+      
+      /* Split block before insn. The insn is in the new block  */
+      edge e = split_block (block, PREV_INSN (elt->first));
+
+      block = e->dest;
+      map->get_or_insert (block) = elt->first;
+    }
+}
+
+/* BLOCK is a basic block containing a head or tail instruction.
+   Locate the associated prehead or pretail instruction, which must be
+   in the single predecessor block.  */
+
+static rtx_insn *
+nvptx_discover_pre (basic_block block, int expected)
+{
+  gcc_assert (block->preds->length () == 1);
+  basic_block pre_block = (*block->preds)[0]->src;
+  rtx_insn *pre_insn;
+
+  for (pre_insn = BB_END (pre_block); !INSN_P (pre_insn);
+       pre_insn = PREV_INSN (pre_insn))
+    gcc_assert (pre_insn != BB_HEAD (pre_block));
+
+  gcc_assert (recog_memoized (pre_insn) == expected);
+  return pre_insn;
+}
+
+/* Dump this parallel and all its inner parallels.  */
+
+static void
+nvptx_dump_pars (parallel *par, unsigned depth)
+{
+  fprintf (dump_file, "%u: mask %d head=%d, tail=%d\n",
+	   depth, par->mask,
+	   par->forked_block ? par->forked_block->index : -1,
+	   par->join_block ? par->join_block->index : -1);
+
+  fprintf (dump_file, "    blocks:");
+
+  basic_block block;
+  for (unsigned ix = 0; par->blocks.iterate (ix, &block); ix++)
+    fprintf (dump_file, " %d", block->index);
+  fprintf (dump_file, "\n");
+  if (par->inner)
+    nvptx_dump_pars (par->inner, depth + 1);
+
+  if (par->next)
+    nvptx_dump_pars (par->next, depth);
+}
+
+/* If BLOCK contains a fork/join marker, process it to create or
+   terminate a loop structure.  Add this block to the current loop,
+   and then walk successor blocks.   */
+
+static parallel *
+nvptx_find_par (bb_insn_map_t *map, parallel *par, basic_block block)
+{
+  if (block->flags & BB_VISITED)
+    return par;
+  block->flags |= BB_VISITED;
+
+  if (rtx_insn **endp = map->get (block))
+    {
+      rtx_insn *end = *endp;
+
+      /* This is a block head or tail, or return instruction.  */
+      switch (recog_memoized (end))
+	{
+	case CODE_FOR_return:
+	  /* Return instructions are in their own block, and we
+	     don't need to do anything more.  */
+	  return par;
+
+	case CODE_FOR_nvptx_forked:
+	  /* Loop head, create a new inner loop and add it into
+	     our parent's child list.  */
+	  {
+	    unsigned mask = UINTVAL (XVECEXP (PATTERN (end), 0, 0));
+
+	    gcc_assert (mask);
+	    par = new parallel (par, mask);
+	    par->forked_block = block;
+	    par->forked_insn = end;
+	    if (!(mask & GOMP_DIM_MASK (GOMP_DIM_MAX))
+		&& (mask & GOMP_DIM_MASK (GOMP_DIM_WORKER)))
+	      par->fork_insn
+		= nvptx_discover_pre (block, CODE_FOR_nvptx_fork);
+	  }
+	  break;
+
+	case CODE_FOR_nvptx_join:
+	  /* A loop tail.  Finish the current loop and return to
+	     parent.  */
+	  {
+	    unsigned mask = UINTVAL (XVECEXP (PATTERN (end), 0, 0));
+
+	    gcc_assert (par->mask == mask);
+	    par->join_block = block;
+	    par->join_insn = end;
+	    if (!(mask & GOMP_DIM_MASK (GOMP_DIM_MAX))
+		&& (mask & GOMP_DIM_MASK (GOMP_DIM_WORKER)))
+	      par->joining_insn
+		= nvptx_discover_pre (block, CODE_FOR_nvptx_joining);
+	    par = par->parent;
+	  }
+	  break;
+
+	default:
+	  gcc_unreachable ();
+	}
+    }
+
+  if (par)
+    /* Add this block onto the current loop's list of blocks.  */
+    par->blocks.safe_push (block);
+  else
+    /* This must be the entry block.  Create a NULL parallel.  */
+    par = new parallel (0, 0);
+
+  /* Walk successor blocks.  */
+  edge e;
+  edge_iterator ei;
+
+  FOR_EACH_EDGE (e, ei, block->succs)
+    nvptx_find_par (map, par, e->dest);
+
+  return par;
+}
+
+/* DFS walk the CFG looking for fork & join markers.  Construct
+   loop structures as we go.  MAP is a mapping of basic blocks
+   to head & tail markers, discovered when splitting blocks.  This
+   speeds up the discovery.  We rely on the BB visited flag having
+   been cleared when splitting blocks.  */
+
+static parallel *
+nvptx_discover_pars (bb_insn_map_t *map)
+{
+  basic_block block;
+
+  /* Mark exit blocks as visited.  */
+  block = EXIT_BLOCK_PTR_FOR_FN (cfun);
+  block->flags |= BB_VISITED;
+
+  /* And entry block as not.  */
+  block = ENTRY_BLOCK_PTR_FOR_FN (cfun);
+  block->flags &= ~BB_VISITED;
+
+  parallel *par = nvptx_find_par (map, 0, block);
+
+  if (dump_file)
+    {
+      fprintf (dump_file, "\nLoops\n");
+      nvptx_dump_pars (par, 0);
+      fprintf (dump_file, "\n");
+    }
+  
+  return par;
+}
+
+/* Propagate live state at the start of a partitioned region.  BLOCK
+   provides the live register information, and might not contain
+   INSN. Propagation is inserted just after INSN. RW indicates whether
+   we are reading and/or writing state.  This
+   separation is needed for worker-level proppagation where we
+   essentially do a spill & fill.  FN is the underlying worker
+   function to generate the propagation instructions for single
+   register.  DATA is user data.
+
+   We propagate the live register set and the entire frame.  We could
+   do better by (a) propagating just the live set that is used within
+   the partitioned regions and (b) only propagating stack entries that
+   are used.  The latter might be quite hard to determine.  */
+
+typedef rtx (*propagator_fn) (rtx, propagate_mask, unsigned, void *);
+
+static void
+nvptx_propagate (basic_block block, rtx_insn *insn, propagate_mask rw,
+		 propagator_fn fn, void *data)
+{
+  bitmap live = DF_LIVE_IN (block);
+  bitmap_iterator iterator;
+  unsigned ix;
+
+  /* Copy the frame array.  */
+  HOST_WIDE_INT fs = get_frame_size ();
+  if (fs)
+    {
+      rtx tmp = gen_reg_rtx (DImode);
+      rtx idx = NULL_RTX;
+      rtx ptr = gen_reg_rtx (Pmode);
+      rtx pred = NULL_RTX;
+      rtx_code_label *label = NULL;
+
+      gcc_assert (!(fs & (GET_MODE_SIZE (DImode) - 1)));
+      fs /= GET_MODE_SIZE (DImode);
+      /* Detect single iteration loop. */
+      if (fs == 1)
+	fs = 0;
+
+      start_sequence ();
+      emit_insn (gen_rtx_SET (ptr, frame_pointer_rtx));
+      if (fs)
+	{
+	  idx = gen_reg_rtx (SImode);
+	  pred = gen_reg_rtx (BImode);
+	  label = gen_label_rtx ();
+	  
+	  emit_insn (gen_rtx_SET (idx, GEN_INT (fs)));
+	  /* Allow worker function to initialize anything needed.  */
+	  rtx init = fn (tmp, PM_loop_begin, fs, data);
+	  if (init)
+	    emit_insn (init);
+	  emit_label (label);
+	  LABEL_NUSES (label)++;
+	  emit_insn (gen_addsi3 (idx, idx, GEN_INT (-1)));
+	}
+      if (rw & PM_read)
+	emit_insn (gen_rtx_SET (tmp, gen_rtx_MEM (DImode, ptr)));
+      emit_insn (fn (tmp, rw, fs, data));
+      if (rw & PM_write)
+	emit_insn (gen_rtx_SET (gen_rtx_MEM (DImode, ptr), tmp));
+      if (fs)
+	{
+	  emit_insn (gen_rtx_SET (pred, gen_rtx_NE (BImode, idx, const0_rtx)));
+	  emit_insn (gen_adddi3 (ptr, ptr, GEN_INT (GET_MODE_SIZE (DImode))));
+	  emit_insn (gen_br_true_uni (pred, label));
+	  rtx fini = fn (tmp, PM_loop_end, fs, data);
+	  if (fini)
+	    emit_insn (fini);
+	  emit_insn (gen_rtx_CLOBBER (GET_MODE (idx), idx));
+	}
+      emit_insn (gen_rtx_CLOBBER (GET_MODE (tmp), tmp));
+      emit_insn (gen_rtx_CLOBBER (GET_MODE (ptr), ptr));
+      rtx cpy = get_insns ();
+      end_sequence ();
+      insn = emit_insn_after (cpy, insn);
+    }
+
+  /* Copy live registers.  */
+  EXECUTE_IF_SET_IN_BITMAP (live, 0, ix, iterator)
+    {
+      rtx reg = regno_reg_rtx[ix];
+
+      if (REGNO (reg) >= FIRST_PSEUDO_REGISTER)
+	{
+	  rtx bcast = fn (reg, rw, 0, data);
+
+	  insn = emit_insn_after (bcast, insn);
+	}
+    }
+}
+
+/* Worker for nvptx_vpropagate.  */
+
+static rtx
+vprop_gen (rtx reg, propagate_mask pm,
+	   unsigned ARG_UNUSED (count), void *ARG_UNUSED (data))
+{
+  if (!(pm & PM_read_write))
+    return 0;
+  
+  return nvptx_gen_vcast (reg);
+}
+
+/* Propagate state that is live at start of BLOCK across the vectors
+   of a single warp.  Propagation is inserted just after INSN.   */
+
+static void
+nvptx_vpropagate (basic_block block, rtx_insn *insn)
+{
+  nvptx_propagate (block, insn, PM_read_write, vprop_gen, 0);
+}
+
+/* Worker for nvptx_wpropagate.  */
+
+static rtx
+wprop_gen (rtx reg, propagate_mask pm, unsigned rep, void *data_)
+{
+  wcast_data_t *data = (wcast_data_t *)data_;
+
+  if (pm & PM_loop_begin)
+    {
+      /* Starting a loop, initialize pointer.    */
+      unsigned align = GET_MODE_ALIGNMENT (GET_MODE (reg)) / BITS_PER_UNIT;
+
+      if (align > worker_bcast_align)
+	worker_bcast_align = align;
+      data->offset = (data->offset + align - 1) & ~(align - 1);
+
+      data->ptr = gen_reg_rtx (Pmode);
+
+      return gen_adddi3 (data->ptr, data->base, GEN_INT (data->offset));
+    }
+  else if (pm & PM_loop_end)
+    {
+      rtx clobber = gen_rtx_CLOBBER (GET_MODE (data->ptr), data->ptr);
+      data->ptr = NULL_RTX;
+      return clobber;
+    }
+  else
+    return nvptx_gen_wcast (reg, pm, rep, data);
+}
+
+/* Spill or fill live state that is live at start of BLOCK.  PRE_P
+   indicates if this is just before partitioned mode (do spill), or
+   just after it starts (do fill). Sequence is inserted just after
+   INSN.  */
+
+static void
+nvptx_wpropagate (bool pre_p, basic_block block, rtx_insn *insn)
+{
+  wcast_data_t data;
+
+  data.base = gen_reg_rtx (Pmode);
+  data.offset = 0;
+  data.ptr = NULL_RTX;
+
+  nvptx_propagate (block, insn, pre_p ? PM_read : PM_write, wprop_gen, &data);
+  if (data.offset)
+    {
+      /* Stuff was emitted, initialize the base pointer now.  */
+      rtx init = gen_rtx_SET (data.base, worker_bcast_sym);
+      emit_insn_after (init, insn);
+      
+      if (worker_bcast_size < data.offset)
+	worker_bcast_size = data.offset;
+    }
+}
+
+/* Emit a worker-level synchronization barrier.  We use different
+   markers for before and after synchronizations.  */
+
+static rtx
+nvptx_wsync (bool after)
+{
+  return gen_nvptx_barsync (GEN_INT (after));
+}
+
+/* Single neutering according to MASK.  FROM is the incoming block and
+   TO is the outgoing block.  These may be the same block. Insert at
+   start of FROM:
+   
+     if (tid.<axis>) goto end.
+
+   and insert before ending branch of TO (if there is such an insn):
+
+     end:
+     <possibly-broadcast-cond>
+     <branch>
+
+   We currently only use differnt FROM and TO when skipping an entire
+   loop.  We could do more if we detected superblocks.  */
+
+static void
+nvptx_single (unsigned mask, basic_block from, basic_block to)
+{
+  rtx_insn *head = BB_HEAD (from);
+  rtx_insn *tail = BB_END (to);
+  unsigned skip_mask = mask;
+
+  /* Find first insn of from block */
+  while (head != BB_END (from) && !INSN_P (head))
+    head = NEXT_INSN (head);
+
+  /* Find last insn of to block */
+  rtx_insn *limit = from == to ? head : BB_HEAD (to);
+  while (tail != limit && !INSN_P (tail) && !LABEL_P (tail))
+    tail = PREV_INSN (tail);
+
+  /* Detect if tail is a branch.  */
+  rtx tail_branch = NULL_RTX;
+  rtx cond_branch = NULL_RTX;
+  if (tail && INSN_P (tail))
+    {
+      tail_branch = PATTERN (tail);
+      if (GET_CODE (tail_branch) != SET || SET_DEST (tail_branch) != pc_rtx)
+	tail_branch = NULL_RTX;
+      else
+	{
+	  cond_branch = SET_SRC (tail_branch);
+	  if (GET_CODE (cond_branch) != IF_THEN_ELSE)
+	    cond_branch = NULL_RTX;
+	}
+    }
+
+  if (tail == head)
+    {
+      /* If this is empty, do nothing.  */
+      if (!head || !INSN_P (head))
+	return;
+
+      /* If this is a dummy insn, do nothing.  */
+      switch (recog_memoized (head))
+	{
+	default:
+	  break;
+	case CODE_FOR_nvptx_fork:
+	case CODE_FOR_nvptx_forked:
+	case CODE_FOR_nvptx_joining:
+	case CODE_FOR_nvptx_join:
+	  return;
+	}
+
+      if (cond_branch)
+	{
+	  /* If we're only doing vector single, there's no need to
+	     emit skip code because we'll not insert anything.  */
+	  if (!(mask & GOMP_DIM_MASK (GOMP_DIM_VECTOR)))
+	    skip_mask = 0;
+	}
+      else if (tail_branch)
+	/* Block with only unconditional branch.  Nothing to do.  */
+	return;
+    }
+
+  /* Insert the vector test inside the worker test.  */
+  unsigned mode;
+  rtx_insn *before = tail;
+  for (mode = GOMP_DIM_WORKER; mode <= GOMP_DIM_VECTOR; mode++)
+    if (GOMP_DIM_MASK (mode) & skip_mask)
+      {
+	rtx_code_label *label = gen_label_rtx ();
+	rtx pred = cfun->machine->axis_predicate[mode - GOMP_DIM_WORKER];
+
+	if (!pred)
+	  {
+	    pred = gen_reg_rtx (BImode);
+	    cfun->machine->axis_predicate[mode - GOMP_DIM_WORKER] = pred;
+	  }
+	
+	rtx br;
+	if (mode == GOMP_DIM_VECTOR)
+	  br = gen_br_true (pred, label);
+	else
+	  br = gen_br_true_uni (pred, label);
+	emit_insn_before (br, head);
+
+	LABEL_NUSES (label)++;
+	if (tail_branch)
+	  before = emit_label_before (label, before);
+	else
+	  emit_label_after (label, tail);
+      }
+
+  /* Now deal with propagating the branch condition.  */
+  if (cond_branch)
+    {
+      rtx pvar = XEXP (XEXP (cond_branch, 0), 0);
+
+      if (GOMP_DIM_MASK (GOMP_DIM_VECTOR) == mask)
+	{
+	  /* Vector mode only, do a shuffle.  */
+	  emit_insn_before (nvptx_gen_vcast (pvar), tail);
+	}
+      else
+	{
+	  /* Includes worker mode, do spill & fill.  By construction
+	     we should never have worker mode only. */
+	  wcast_data_t data;
+
+	  data.base = worker_bcast_sym;
+	  data.ptr = 0;
+
+	  if (worker_bcast_size < GET_MODE_SIZE (SImode))
+	    worker_bcast_size = GET_MODE_SIZE (SImode);
+
+	  data.offset = 0;
+	  emit_insn_before (nvptx_gen_wcast (pvar, PM_read, 0, &data),
+			    before);
+	  /* Barrier so other workers can see the write.  */
+	  emit_insn_before (nvptx_wsync (false), tail);
+	  data.offset = 0;
+	  emit_insn_before (nvptx_gen_wcast (pvar, PM_write, 0, &data), tail);
+	  /* This barrier is needed to avoid worker zero clobbering
+	     the broadcast buffer before all the other workers have
+	     had a chance to read this instance of it.  */
+	  emit_insn_before (nvptx_wsync (true), tail);
+	}
+
+      extract_insn (tail);
+      rtx unsp = gen_rtx_UNSPEC (BImode, gen_rtvec (1, pvar),
+				 UNSPEC_BR_UNIFIED);
+      validate_change (tail, recog_data.operand_loc[0], unsp, false);
+    }
+}
+
+/* PAR is a parallel that is being skipped in its entirety according to
+   MASK.  Treat this as skipping a superblock starting at forked
+   and ending at joining.  */
+
+static void
+nvptx_skip_par (unsigned mask, parallel *par)
+{
+  basic_block tail = par->join_block;
+  gcc_assert (tail->preds->length () == 1);
+
+  basic_block pre_tail = (*tail->preds)[0]->src;
+  gcc_assert (pre_tail->succs->length () == 1);
+
+  nvptx_single (mask, par->forked_block, pre_tail);
+}
+
+/* Process the parallel PAR and all its contained
+   parallels.  We do everything but the neutering.  Return mask of
+   partitioned modes used within this parallel.  */
+
+static unsigned
+nvptx_process_pars (parallel *par)
+{
+  unsigned inner_mask = par->mask;
+
+  /* Do the inner parallels first.  */
+  if (par->inner)
+    {
+      par->inner_mask = nvptx_process_pars (par->inner);
+      inner_mask |= par->inner_mask;
+    }
+
+  if (par->mask & GOMP_DIM_MASK (GOMP_DIM_MAX))
+    /* No propagation needed for a call.  */;
+ else if (par->mask & GOMP_DIM_MASK (GOMP_DIM_WORKER))
+    {
+      nvptx_wpropagate (false, par->forked_block, par->forked_insn);
+      nvptx_wpropagate (true, par->forked_block, par->fork_insn);
+      /* Insert begin and end synchronizations.  */
+      emit_insn_after (nvptx_wsync (false), par->forked_insn);
+      emit_insn_before (nvptx_wsync (true), par->joining_insn);
+    }
+  else if (par->mask & GOMP_DIM_MASK (GOMP_DIM_VECTOR))
+    nvptx_vpropagate (par->forked_block, par->forked_insn);
+
+  /* Now do siblings.  */
+  if (par->next)
+    inner_mask |= nvptx_process_pars (par->next);
+  return inner_mask;
+}
+
+/* Neuter the parallel described by PAR.  We recurse in depth-first
+   order.  MODES are the partitioning of the execution and OUTER is
+   the partitioning of the parallels we are contained in.  */
+
+static void
+nvptx_neuter_pars (parallel *par, unsigned modes, unsigned outer)
+{
+  unsigned me = (par->mask
+		 & (GOMP_DIM_MASK (GOMP_DIM_WORKER)
+		    | GOMP_DIM_MASK (GOMP_DIM_VECTOR)));
+  unsigned  skip_mask = 0, neuter_mask = 0;
+  
+  if (par->inner)
+    nvptx_neuter_pars (par->inner, modes, outer | me);
+
+  for (unsigned mode = GOMP_DIM_WORKER; mode <= GOMP_DIM_VECTOR; mode++)
+    {
+      if ((outer | me) & GOMP_DIM_MASK (mode))
+	{} /* Mode is partitioned: no neutering.  */
+      else if (!(modes & GOMP_DIM_MASK (mode)))
+	{} /* Mode is not used: nothing to do.  */  
+      else if (par->inner_mask & GOMP_DIM_MASK (mode)
+	       || !par->forked_insn)
+	/* Partitioned in inner parallels, or we're not a partitioned
+	   at all: neuter individual blocks.  */
+	neuter_mask |= GOMP_DIM_MASK (mode);
+      else if (!par->parent || !par->parent->forked_insn
+	       || par->parent->inner_mask & GOMP_DIM_MASK (mode))
+	/* Parent isn't a parallel or contains this paralleling: skip
+	   parallel at this level.  */
+	skip_mask |= GOMP_DIM_MASK (mode);
+      else
+	{} /* Parent will skip this parallel itself.  */
+    }
+
+  if (neuter_mask)
+    {
+      int ix;
+      int len = par->blocks.length ();
+
+      for (ix = 0; ix != len; ix++)
+	{
+	  basic_block block = par->blocks[ix];
+
+	  nvptx_single (neuter_mask, block, block);
+	}
+    }
+
+  if (skip_mask)
+      nvptx_skip_par (skip_mask, par);
+  
+  if (par->next)
+    nvptx_neuter_pars (par->next, modes, outer);
+}
+
 /* PTX-specific reorganization
+   - Scan and release reduction buffers
+   - Split blocks at fork and join instructions
    - Compute live registers
    - Mark now-unused registers, so function begin doesn't declare
    unused registers.
+   - Insert state propagation when entering partitioned mode
+   - Insert neutering instructions when in single mode
    - Replace subregs with suitable sequences.
 */
 
@@ -1989,19 +3047,60 @@ nvptx_reorg (void)
 
   thread_prologue_and_epilogue_insns ();
 
+  /* Split blocks and record interesting unspecs.  */
+  bb_insn_map_t bb_insn_map;
+
+  nvptx_split_blocks (&bb_insn_map);
+
   /* Compute live regs */
   df_clear_flags (DF_LR_RUN_DCE);
   df_set_flags (DF_NO_INSN_RESCAN | DF_NO_HARD_REGS);
+  df_live_add_problem ();
+  df_live_set_all_dirty ();
   df_analyze ();
   regstat_init_n_sets_and_refs ();
 
-  int max_regs = max_reg_num ();
-
+  if (dump_file)
+    df_dump (dump_file);
+  
   /* Mark unused regs as unused.  */
+  int max_regs = max_reg_num ();
   for (int i = LAST_VIRTUAL_REGISTER + 1; i < max_regs; i++)
     if (REG_N_SETS (i) == 0 && REG_N_REFS (i) == 0)
       regno_reg_rtx[i] = const0_rtx;
 
+  /* Determine launch dimensions of the function.  If it is not an
+     offloaded function  (i.e. this is a regular compiler), the
+     function has no neutering.  */
+  tree attr = get_oacc_fn_attrib (current_function_decl);
+  if (attr)
+    {
+      /* If we determined this mask before RTL expansion, we could
+	 elide emission of some levels of forks and joins.  */
+      unsigned mask = 0;
+      tree dims = TREE_VALUE (attr);
+      unsigned ix;
+
+      for (ix = 0; ix != GOMP_DIM_MAX; ix++, dims = TREE_CHAIN (dims))
+	{
+	  int size = TREE_INT_CST_LOW (TREE_VALUE (dims));
+	  tree allowed = TREE_PURPOSE (dims);
+
+	  if (size != 1 && !(allowed && integer_zerop (allowed)))
+	    mask |= GOMP_DIM_MASK (ix);
+	}
+      /* If there is worker neutering, there must be vector
+	 neutering.  Otherwise the hardware will fail.  */
+      gcc_assert (!(mask & GOMP_DIM_MASK (GOMP_DIM_WORKER))
+		  || (mask & GOMP_DIM_MASK (GOMP_DIM_VECTOR)));
+
+      /* Discover & process partitioned regions.  */
+      parallel *pars = nvptx_discover_pars (&bb_insn_map);
+      nvptx_process_pars (pars);
+      nvptx_neuter_pars (pars, mask, 0);
+      delete pars;
+    }
+
   /* Replace subregs.  */
   nvptx_reorg_subreg ();
 
@@ -2052,6 +3151,26 @@ nvptx_vector_alignment (const_tree type)
 
   return MIN (align, BIGGEST_ALIGNMENT);
 }
+
+/* Indicate that INSN cannot be duplicated.   */
+
+static bool
+nvptx_cannot_copy_insn_p (rtx_insn *insn)
+{
+  switch (recog_memoized (insn))
+    {
+    case CODE_FOR_nvptx_shufflesi:
+    case CODE_FOR_nvptx_shufflesf:
+    case CODE_FOR_nvptx_barsync:
+    case CODE_FOR_nvptx_fork:
+    case CODE_FOR_nvptx_forked:
+    case CODE_FOR_nvptx_joining:
+    case CODE_FOR_nvptx_join:
+      return true;
+    default:
+      return false;
+    }
+}
 \f
 /* Record a symbol for mkoffload to enter into the mapping table.  */
 
@@ -2129,6 +3248,19 @@ nvptx_file_end (void)
   FOR_EACH_HASH_TABLE_ELEMENT (*needed_fndecls_htab, decl, tree, iter)
     nvptx_record_fndecl (decl, true);
   fputs (func_decls.str().c_str(), asm_out_file);
+
+  if (worker_bcast_size)
+    {
+      /* Define the broadcast buffer.  */
+
+      worker_bcast_size = (worker_bcast_size + worker_bcast_align - 1)
+	& ~(worker_bcast_align - 1);
+      
+      fprintf (asm_out_file, "// BEGIN VAR DEF: %s\n", worker_bcast_name);
+      fprintf (asm_out_file, ".shared .align %d .u8 %s[%d];\n",
+	       worker_bcast_align,
+	       worker_bcast_name, worker_bcast_size);
+    }
 }
 \f
 /* Validate compute dimensions of an OpenACC offload or routine, fill
@@ -2141,12 +3273,32 @@ nvptx_goacc_validate_dims (tree ARG_UNUS
 {
   bool changed = false;
 
-  /* TODO: Leave dimensions unaltered.  Partitioned execution needs
+  /* TODO: Leave dimensions unaltered.  Reductions need
      porting before filtering dimensions makes sense.  */
 
   return changed;
 }
-\f
+
+/* Determine whether fork & joins are needed.  */
+
+static bool
+nvptx_goacc_fork_join (gcall *call, const int dims[],
+		       bool ARG_UNUSED (is_fork))
+{
+  tree arg = gimple_call_arg (call, 2);
+  unsigned axis = TREE_INT_CST_LOW (arg);
+
+  /* We only care about worker and vector partitioning.  */
+  if (axis < GOMP_DIM_WORKER)
+    return false;
+
+  /* If the size is 1, there's no partitioning.  */
+  if (dims[axis] == 1)
+    return false;
+
+  return true;
+}
+
 #undef TARGET_OPTION_OVERRIDE
 #define TARGET_OPTION_OVERRIDE nvptx_option_override
 
@@ -2233,9 +3385,15 @@ nvptx_goacc_validate_dims (tree ARG_UNUS
 #undef TARGET_VECTOR_ALIGNMENT
 #define TARGET_VECTOR_ALIGNMENT nvptx_vector_alignment
 
+#undef TARGET_CANNOT_COPY_INSN_P
+#define TARGET_CANNOT_COPY_INSN_P nvptx_cannot_copy_insn_p
+
 #undef TARGET_GOACC_VALIDATE_DIMS
 #define TARGET_GOACC_VALIDATE_DIMS nvptx_goacc_validate_dims
 
+#undef TARGET_GOACC_FORK_JOIN
+#define TARGET_GOACC_FORK_JOIN nvptx_goacc_fork_join
+
 struct gcc_target targetm = TARGET_INITIALIZER;
 
 #include "gt-nvptx.h"
Index: gcc/config/nvptx/nvptx.h
===================================================================
--- gcc/config/nvptx/nvptx.h	(revision 229472)
+++ gcc/config/nvptx/nvptx.h	(working copy)
@@ -230,6 +230,7 @@ struct GTY(()) machine_function
   HOST_WIDE_INT outgoing_stdarg_size;
   int ret_reg_mode; /* machine_mode not defined yet. */
   int punning_buffer_size;
+  rtx axis_predicate[2];
 };
 #endif
 \f

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 11/11] execution tests
  2015-10-21 20:17     ` Nathan Sidwell
@ 2015-10-28 14:30       ` Nathan Sidwell
  0 siblings, 0 replies; 120+ messages in thread
From: Nathan Sidwell @ 2015-10-28 14:30 UTC (permalink / raw)
  To: Ilya Verbin
  Cc: GCC Patches, Jakub Jelinek, Bernd Schmidt, Jason Merrill,
	Joseph S. Myers, Richard Guenther, Cesar Philippidis

On 10/21/15 13:16, Nathan Sidwell wrote:
> On 10/21/15 16:14, Ilya Verbin wrote:
>
>>> <11-trunk-tests.patch>
>>
>> Does the testcase with offload IR appear here accidentally?
>
> D'oh!  yup, fixed.

Now all applied, Thanks for everybody's help.

nathan

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [OpenACC 7/11] execution model
  2015-10-21 19:47 ` [OpenACC 7/11] execution model Nathan Sidwell
  2015-10-22  9:32   ` Jakub Jelinek
@ 2020-11-24 10:34   ` Thomas Schwinge
  1 sibling, 0 replies; 120+ messages in thread
From: Thomas Schwinge @ 2020-11-24 10:34 UTC (permalink / raw)
  To: gcc-patches; +Cc: Nathan Sidwell, Jakub Jelinek, Frederik Harwath, Tom de Vries

[-- Attachment #1: Type: text/plain, Size: 1618 bytes --]

Hi!

On 2015-10-21T15:42:26-0400, Nathan Sidwell <nathan@acm.org> wrote:
> This patch is the early lowering part of OpenACC loops.  Rather than piggy-back
> onto expand_omp_for_static_nochunk & expand_omp_for_static_chunk, we have a new
> function 'expand_oacc_for', which does the OpenACC equivalent
> expension.  [...]

Aye!

(These changes got committed in r229472.)

> +static void
> +expand_oacc_for (struct omp_region *region, struct omp_for_data *fd)
> +{
> +  [...]
> +  bool chunking = !gimple_in_ssa_p (cfun);;
> +  [...]
> +  if (gimple_in_ssa_p (cfun))
> +    {
> +      offset_init = gimple_omp_for_index (for_stmt, 0);
> +      gcc_assert (integer_zerop (fd->loop.n1));
> +      /* The SSA parallelizer does gang parallelism.  */
> +      gwv = build_int_cst (integer_type_node, GOMP_DIM_MASK (GOMP_DIM_GANG));
> +    }
> +  [etc.]

That's, uhm, a bit "non-obvious" ;-) what's going on there, that (citing
from my patch/commit) "some of the 'gimple_in_ssa_p (cfun)' conditionals
are for SSA specifics, and some are for 'parloops' OpenACC
'kernels'-parallelized specifics".  To clarify that, I've pushed "More
explicit checking of which OMP constructs we're expecting, part II" to
master branch in commit 8c3aa359ce33732273bbd61c5f9a2c607779b32e, and
backported to releases/gcc-10 branch in commit
6e8837438148d6ed3e512099b2d12d06836c2a45, see attached.


Grüße
 Thomas


-----------------
Mentor Graphics (Deutschland) GmbH, Arnulfstraße 201, 80634 München / Germany
Registergericht München HRB 106955, Geschäftsführer: Thomas Heurung, Alexander Walter

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-More-explicit-checking-of-which-OMP-constructs-we-re.patch --]
[-- Type: text/x-diff, Size: 1643 bytes --]

From 8c3aa359ce33732273bbd61c5f9a2c607779b32e Mon Sep 17 00:00:00 2001
From: Thomas Schwinge <thomas@codesourcery.com>
Date: Fri, 20 Nov 2020 10:41:46 +0100
Subject: [PATCH] More explicit checking of which OMP constructs we're
 expecting, part II

In particular, more precisely highlight what applies generally vs. the special
handling for the current 'parloops'-based OpenACC 'kernels' implementation.

	gcc/
	* omp-expand.c (expand_oacc_for): More explicit checking of which
	OMP constructs we're expecting.
---
 gcc/omp-expand.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/gcc/omp-expand.c b/gcc/omp-expand.c
index c0e94e5e323..928644b099c 100644
--- a/gcc/omp-expand.c
+++ b/gcc/omp-expand.c
@@ -7413,6 +7413,21 @@ expand_omp_taskloop_for_inner (struct omp_region *region,
 static void
 expand_oacc_for (struct omp_region *region, struct omp_for_data *fd)
 {
+  bool is_oacc_kernels_parallelized
+    = (lookup_attribute ("oacc kernels parallelized",
+			 DECL_ATTRIBUTES (current_function_decl)) != NULL);
+  {
+    bool is_oacc_kernels
+      = (lookup_attribute ("oacc kernels",
+			   DECL_ATTRIBUTES (current_function_decl)) != NULL);
+    if (is_oacc_kernels_parallelized)
+      gcc_checking_assert (is_oacc_kernels);
+  }
+  gcc_assert (gimple_in_ssa_p (cfun) == is_oacc_kernels_parallelized);
+  /* In the following, some of the 'gimple_in_ssa_p (cfun)' conditionals are
+     for SSA specifics, and some are for 'parloops' OpenACC
+     'kernels'-parallelized specifics.  */
+
   tree v = fd->loop.v;
   enum tree_code cond_code = fd->loop.cond_code;
   enum tree_code plus_code = PLUS_EXPR;
-- 
2.17.1


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #3: 0001-More-explicit-checking-of-which-OMP-constructs-w.g10.patch --]
[-- Type: text/x-diff, Size: 1713 bytes --]

From 6e8837438148d6ed3e512099b2d12d06836c2a45 Mon Sep 17 00:00:00 2001
From: Thomas Schwinge <thomas@codesourcery.com>
Date: Fri, 20 Nov 2020 10:41:46 +0100
Subject: [PATCH] More explicit checking of which OMP constructs we're
 expecting, part II

In particular, more precisely highlight what applies generally vs. the special
handling for the current 'parloops'-based OpenACC 'kernels' implementation.

	gcc/
	* omp-expand.c (expand_oacc_for): More explicit checking of which
	OMP constructs we're expecting.

(cherry picked from commit 8c3aa359ce33732273bbd61c5f9a2c607779b32e)
---
 gcc/omp-expand.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/gcc/omp-expand.c b/gcc/omp-expand.c
index 735e263c8f8..27adbaf15cb 100644
--- a/gcc/omp-expand.c
+++ b/gcc/omp-expand.c
@@ -5998,6 +5998,21 @@ expand_omp_taskloop_for_inner (struct omp_region *region,
 static void
 expand_oacc_for (struct omp_region *region, struct omp_for_data *fd)
 {
+  bool is_oacc_kernels_parallelized
+    = (lookup_attribute ("oacc kernels parallelized",
+			 DECL_ATTRIBUTES (current_function_decl)) != NULL);
+  {
+    bool is_oacc_kernels
+      = (lookup_attribute ("oacc kernels",
+			   DECL_ATTRIBUTES (current_function_decl)) != NULL);
+    if (is_oacc_kernels_parallelized)
+      gcc_checking_assert (is_oacc_kernels);
+  }
+  gcc_assert (gimple_in_ssa_p (cfun) == is_oacc_kernels_parallelized);
+  /* In the following, some of the 'gimple_in_ssa_p (cfun)' conditionals are
+     for SSA specifics, and some are for 'parloops' OpenACC
+     'kernels'-parallelized specifics.  */
+
   tree v = fd->loop.v;
   enum tree_code cond_code = fd->loop.cond_code;
   enum tree_code plus_code = PLUS_EXPR;
-- 
2.17.1


^ permalink raw reply	[flat|nested] 120+ messages in thread

end of thread, other threads:[~2020-11-24 10:34 UTC | newest]

Thread overview: 120+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-10-21 19:00 [OpenACC 0/11] execution model Nathan Sidwell
2015-10-21 19:00 ` [OpenACC 1/11] UNIQUE internal function Nathan Sidwell
2015-10-22  7:49   ` Richard Biener
2015-10-22  7:55     ` Richard Biener
2015-10-22  8:04       ` Jakub Jelinek
2015-10-22  8:07         ` Richard Biener
2015-10-22 11:42           ` Julian Brown
2015-10-22 13:12             ` Nathan Sidwell
2015-10-22 13:20               ` Jakub Jelinek
2015-10-22 13:27                 ` Nathan Sidwell
2015-10-22 14:31                   ` Richard Biener
2015-10-22 14:47                     ` Nathan Sidwell
2015-10-22  8:05   ` Jakub Jelinek
2015-10-22  8:12     ` Richard Biener
2015-10-22 13:08       ` Nathan Sidwell
2015-10-22 14:04       ` Nathan Sidwell
2015-10-22 14:28         ` Richard Biener
2015-10-22 14:31           ` Nathan Sidwell
2015-10-22 18:08           ` Nathan Sidwell
2015-10-23  8:46             ` Jakub Jelinek
2015-10-23 13:03               ` Nathan Sidwell
2015-10-23 13:03                 ` Richard Biener
2015-10-23 13:16                   ` Nathan Sidwell
2015-10-23 13:16                     ` Jakub Jelinek
2015-10-23 14:46                       ` Nathan Sidwell
2015-10-23 13:12                 ` Jakub Jelinek
2015-10-23 13:38                   ` Nathan Sidwell
2015-10-25 14:29                   ` Nathan Sidwell
2015-10-26 22:35                     ` Nathan Sidwell
2015-10-27  8:18                       ` Jakub Jelinek
2015-10-27 13:47                         ` Richard Biener
2015-10-27 14:06                           ` Nathan Sidwell
2015-10-27 14:07                             ` Jakub Jelinek
2015-10-27 20:18                             ` Nathan Sidwell
2015-10-27 14:15                         ` Nathan Sidwell
2015-10-23  9:40             ` Richard Biener
2015-10-22 17:39       ` Nathan Sidwell
2015-10-22 20:25     ` Nathan Sidwell
2015-10-23  8:05       ` Jakub Jelinek
2015-10-21 19:11 ` [OpenACC 2/11] PTX backend changes Nathan Sidwell
2015-10-22  8:16   ` Jakub Jelinek
2015-10-22  9:58     ` Bernd Schmidt
2015-10-22 13:02       ` Nathan Sidwell
2015-10-22 13:23         ` Nathan Sidwell
2015-10-22 14:05   ` Bernd Schmidt
2015-10-22 14:26     ` Nathan Sidwell
2015-10-22 14:30       ` Bernd Schmidt
2015-10-22 14:36         ` Jakub Jelinek
2015-10-22 14:52           ` Nathan Sidwell
2015-10-28 14:28             ` Nathan Sidwell
2015-10-22 14:42         ` Nathan Sidwell
2015-10-21 19:16 ` [OpenACC 3/11] new target hook Nathan Sidwell
2015-10-22  8:23   ` Jakub Jelinek
2015-10-22 13:17     ` Nathan Sidwell
2015-10-27 22:15     ` Nathan Sidwell
2015-10-21 19:19 ` [OpenACC 5/11] C++ FE changes Nathan Sidwell
2015-10-22  8:58   ` Jakub Jelinek
2015-10-23 20:26     ` Cesar Philippidis
2015-10-24  2:39       ` Cesar Philippidis
2015-10-24 21:15         ` Cesar Philippidis
2015-10-26 10:30           ` Jakub Jelinek
2015-10-26 22:44             ` Cesar Philippidis
2015-10-27  8:03               ` Jakub Jelinek
2015-10-27 20:21                 ` Nathan Sidwell
2015-10-21 19:19 ` [OpenACC 4/11] C " Nathan Sidwell
2015-10-22  8:25   ` Jakub Jelinek
2015-10-23 20:20     ` Cesar Philippidis
2015-10-23 20:40       ` Jakub Jelinek
2015-10-23 21:31         ` Jakub Jelinek
2015-10-23 21:32         ` Cesar Philippidis
2015-10-24  2:37           ` Cesar Philippidis
2015-10-24 13:08             ` Jakub Jelinek
2015-10-24 21:11               ` Cesar Philippidis
2015-10-26  9:47                 ` Jakub Jelinek
2015-10-26 10:09                   ` Jakub Jelinek
2015-10-26 22:32                   ` Cesar Philippidis
2015-10-27 20:23                     ` Nathan Sidwell
2015-10-23 21:25       ` Nathan Sidwell
2015-10-25 14:18         ` Nathan Sidwell
2015-10-21 19:32 ` [OpenACC 6/11] Reduction initialization Nathan Sidwell
2015-10-22  9:11   ` Jakub Jelinek
2015-10-27 22:27     ` Nathan Sidwell
2015-10-21 19:47 ` [OpenACC 7/11] execution model Nathan Sidwell
2015-10-22  9:32   ` Jakub Jelinek
2015-10-22 12:51     ` Nathan Sidwell
2015-10-22 13:01       ` Jakub Jelinek
2015-10-22 13:08         ` Nathan Sidwell
2015-10-25 15:03     ` Nathan Sidwell
2015-10-26 23:39       ` Nathan Sidwell
2015-10-27  8:33         ` Jakub Jelinek
2015-10-27 14:03           ` Nathan Sidwell
2015-10-28  5:45             ` Nathan Sidwell
2020-11-24 10:34   ` Thomas Schwinge
2015-10-21 19:50 ` [OpenACC 8/11] device-specific lowering Nathan Sidwell
2015-10-22  9:32   ` Jakub Jelinek
2015-10-22 12:59     ` Nathan Sidwell
2015-10-26 15:21   ` Jakub Jelinek
2015-10-26 16:23     ` Nathan Sidwell
2015-10-26 16:56       ` Jakub Jelinek
2015-10-26 18:10         ` Nathan Sidwell
2015-10-28  1:06     ` Nathan Sidwell
2015-10-21 19:51 ` [OpenACC 9/11] oacc_device_lower pass gate Nathan Sidwell
2015-10-22  9:33   ` Jakub Jelinek
2015-10-27 20:31     ` Nathan Sidwell
2015-10-21 19:52 ` [OpenACC 10/11] remove plugin restriction Nathan Sidwell
2015-10-22  9:38   ` Jakub Jelinek
2015-10-21 19:59 ` [OpenACC 11/11] execution tests Nathan Sidwell
2015-10-21 20:15   ` Ilya Verbin
2015-10-21 20:17     ` Nathan Sidwell
2015-10-28 14:30       ` Nathan Sidwell
2015-10-22  9:54   ` Jakub Jelinek
2015-10-22 14:02     ` Nathan Sidwell
2015-10-22 14:07       ` Jakub Jelinek
2015-10-22 14:23         ` Nathan Sidwell
2015-10-22 14:47           ` Cesar Philippidis
2015-10-22 14:58             ` Nathan Sidwell
2015-10-22 15:03             ` Jakub Jelinek
2015-10-22 15:08               ` Cesar Philippidis
2015-10-23 20:32               ` Cesar Philippidis
2015-10-24  2:56                 ` Cesar Philippidis

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).