public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [gomp] Move openacc vector& worker single handling to RTL
@ 2015-07-03 22:52 Nathan Sidwell
  2015-07-03 23:12 ` Jakub Jelinek
  0 siblings, 1 reply; 31+ messages in thread
From: Nathan Sidwell @ 2015-07-03 22:52 UTC (permalink / raw)
  To: GCC Patches, Jakub Jelinek

[-- Attachment #1: Type: text/plain, Size: 4831 bytes --]

This patch reorganizes the handling of vector and worker single modes and their 
transitions to/from partitioned mode out of omp-low and into mach-dep-reorg. 
That allows the regular middle end optimizers to behave normally -- with two 
exceptions, see below.

There are no libgomp regressions, and a number of progressions -- mainly private 
variables now 'just work'.

The approach taken is to have expand_omp_for_static_(no)chunk to emit open acc 
builtins at the start and end of the loop -- the points where execution should 
transition into a partitioned mode and back to single mode.   I've actually used 
a single builtin with a constant argument to say whether it is the head or tail 
of the loop.  You could consider these to be like 'fork' and 'join' primitives, 
if that helps.

We cope with multi-mode loops over (say worker & vector dimensions), by emitted 
two loop head and tails in nested seqence.  I.e. 'hed-worker, head-vector <loop> 
tail-vector tail-worker'.  Thus at a transition we only have to consider one 
particular axis.

These builtins are made known to the duplication and merging optimizations as 
not-to-be duplicated or merged (see builtin_unique_p).  For instance, the jump 
threading optimizer has to already check operations on the potentially  threaded 
path as suitable for duplication, and this is an additional test there.  The 
tail-merging optimizer similarly has to determine that tails are identical, and 
that is never true for this particular builtin.  The intent is that the loops 
are then maintained as single-entry-single-exit all the way through to RTL 
expansion.

Where and when these builtins are expanded to target specific code is not fixed. 
  In the case of PTX they go all the way to RTL expansion.

At RTL expansion the builtins are expanded to volatile unspecs.  We insert 'pre' 
markers too, as some code needs to know the last instruction before the 
transition.  These are uncopyable, and AFAICT RTL doesn't do tail merging (or at 
least I've not encountered it) so again these cause the SESE nature of the loop 
to be preserved all the way to mach dep reorg.

That's where the fun starts.  We scan the CFG looking for the loop markers. 
First we break basic blocks so the head and tail markers are the first insns of 
their block.  That prevents us needing a mode transition mid block.  We then 
rescan the graph discovering loops and adding each block to the loop in which it 
resides.  The entire function is modeled as a NULL loop.

Once that is done we walk the loop structure and insert state propagation code 
at the loop head points.  For vector propagation that'll be a sequence of PTX 
shuffle instructions.  For worker propagation it is a bit more complicated.  At 
the pre-head marker, we insert a spill of state to .shared memory (executed by 
the single active worker) and at the head marker we insert a fill (executed by 
all workers).  We also insert a sync barrier before the fill.  More on where 
that memory comes from later.

Finally we walk the loop structure again, inserting block or loop neutering 
code.  Where possible we try and skip entire blocks[*], but the basic approach 
is the same.  We insert branch-around at the start of the initial block and, if 
needed, insert propagation code at the end of the final block (which might be 
the same block).  The vector-propagation case is again a simple shuffle, but the 
worker case is a spill/sync/fill sequence, with the spill done by the single 
active worker.  The subsequent unified branch is marked with an unspec operand, 
rather than relying on detecting the data flow.

Note, the branch around is inserted using hidden branches that appear to the 
rest of the compiler as volatile unspecs referring to a later label.  I don't 
think the expense of creating new blocks is necessary or worthwhile -- this is 
flow control the compiler doesn't need to know about (if it did, I argue that 
we're inserting this too early).

The worker spill/fill storage is a file-scope array variable, sized during 
compilation and emitted directly at the end of the compilation process.  Again, 
this is not registered with the rest of the compiler = (a) I  wasn't sure how 
to, and (b) considered this an internal bit of the backend.  It is shared by all 
functions in this TU.  Unfortunately PTX  doesn't appear to support COMMON,  so 
making it shared across all TU appears difficult -- one can always use LTO 
optimization anyway,

IMHO this is a step towards putting target-dependent handling in the target 
compiler and out of the more generic host-side compiler.

The changelog is separated into 3 parts
- a) general infrastructure
- b) additiona
- c) deletions.

comments?

nathan

[*] a possible optimization is to do superblock discovery, and skip those in a 
similar manner to loop skipping.

[-- Attachment #2: rtl-02072015-2.diff --]
[-- Type: text/plain, Size: 93510 bytes --]

2015-07-02  Nathan Sidwell  <nathan@codesourcery.com>

	Infrastructure:
	* builtins.h (builtin_unique_p): Declare.
	* builtins.c (builtin_unique_p): New fn
	* omp-low.h (OACC_LOOP_MASK): Define here...
	* omp-low.c (OACC_LOOP_MASK): ... not here.
	* tree-ssa-threadedge.c	(record_temporary_equivalences_from_stmts):
	Add check for builtin_unique_p.
	* gimple.c (gimple_call_same_target_p): Add check for
	builtin_unique_p.

	Additions:
	* builtin-types.def (BT_FN_VOID_INT_INT): New.
	* builtins.c (expand_oacc_levels): New.
	(expand_oacc_loop): New.
	(expand_builtin): Adjust.
	* omp-low.c (gen_oacc_loop_head, gen_oacc_loop_tail): New.
	(expand_omp_for_static_nochunk): Add oacc loop head & tail calls.
	(expand_omp_for_static_chunk): Likewise.
	* tree-ssa-alias.c (ref_maybe_used_by_call_p_1): Add
	BUILT_IN_GOACC_LOOP.
	* omp-builtins.def (BUILT_IN_GOACC_LEVELS, BUILT_IN_GOACC_LOOP): New.
	* config/nvptx/nvptx-protos.h ( nvptx_expand_oacc_loop): New.
	* config/nvptx/nvptx.md (UNSPEC_BIT_CONV, UNSPEC_BROADCAST,
	UNSPEC_BR_UNIFIED): New unspecs.
	(UNSPECV_LEVELS, UNSPECV_LOOP, UNSPECV_BR_HIDDEN): New.
	(BITS, BITD): New mode iterators.
	(br_true_hidden, br_false_hidden, br_uni_true, br_uni_false): New
	branches.
	(oacc_levels, nvptx_loop): New insns.
	(oacc_loop): New expand
	(nvptx_broadcast<mode>): New insn.
	(unpack<mode>si2, packsi<mode>2): New insns.
	(worker_load<mode>, worker_store<mode>): New insns.
	(nvptx_barsync): Renamed from ...
	(threadbarrier_insn): ... here.
	config/nvptx/nvptx.c: Include hash-map,h, dominance.h, cfg.h &
	omp-low.h.
	(nvptx_loop_head, nvptx_loop_tail, nvtpx_loop_prehead,
	nvptx_loop_pretail, LOOP_MODE_CHANGE_P: New.
	(worker_bcast_hwm, worker_bcast_align, worker_bcast_name,
	worker_bcast_sym): New.
	(nvptx_opetion_override): Initialize worker_bcast_sym.
	(nvptx_expand_oacc_loop): New.
	(nvptx_gen_unpack, nvptx_gen_pack): New.
	(struct wcast_data_t, propagate_mask): New types.
	(nvptx_gen_vcast, nvptx_gen_wcast): New.
	(nvptx_print_operand):  Change 'U' specifier to look at operand
	itself.
	(struct reorg_unspec, struct reorg_loop): New structs.
	(unspec_map_t): New map.
	(loop_t, work_loop_t): New types.
	(nvptx_split_blocks, nvptx_discover_pre, nvptx_dump_loops,
	nvptx_discover_loops): New.
	(nvptx_propagate, vprop_gen, nvptx_vpropagate, wprop_gen,
	nvptx_wpropagate): New.
	(nvptx_wsync): New.
	(nvptx_single, nvptx_skip_loop): New.
	(nvptx_process_loops): New.
	(nvptx_neuter_loops): New.
	(nvptx_reorg): Add liveness DF problem.  Call nvptx_split_loops,
	nvptx_discover_loops, nvptx_process_loops & nvptx_neuter_loops.
	(nvptx_cannot_copy_insn): Check for broadcast, sync & loop insns.
	(nvptx_file_end): Output worker broadcast array definition.

	fortran/
	* types.def (BT_FN_VOID_INT_INT): New function type.

	Deletions:
	* builtins.c (expand_oacc_thread_barrier): Delete.
	(expand_oacc_thread_broadcast): Delete.
	(expand_builtin): Adjust.
	* gimple.c (struct gimple_statement_omp_parallel_layout): Remove
	broadcast_array member.
	(gimple_omp_target_broadcast_array): Delete.
	(gimple_omp_target_set_broadcast_array): Delete.
	* omp-low.c (omp_region): Remove broadcast_array member.
	(oacc_broadcast): Delete.
	(build_oacc_threadbarrier): Delete.
	(oacc_loop_needs_threadbarrier_p): Delete.
	(oacc_alloc_broadcast_storage): Delete.
	(find_omp_target_region): Remove call to
	gimple_omp_target_broadcast_array.
	(enclosing_target_region, required_predication_mask,
	generate_vector_broadcast, generate_oacc_broadcast,
	make_predication_test, predicate_bb, find_predicatable_bbs,
	predicate_omp_regions): Delete.
	(use, gen, live_in): Delete.
	(populate_loop_live_in, oacc_populate_live_in_1,
	oacc_populate_live_in, populate_loop_use, oacc_broadcast_1,
	oacc_broadcast): Delete.
	(execute_expand_omp): Remove predicate_omp_regions call.
	(lower_omp_target): Remove oacc_alloc_broadcast_storage call.
	Remove gimple_omp_target_set_broadcast_array call.
	(make_gimple_omp_edges): Remove oacc_loop_needs_threadbarrier_p
	check.
	* tree-ssa-alias.c (ref_maybe_used_by_call_p_1): Remove
	BUILT_IN_GOACC_THREADBARRIER.
	* omp-builtins.def (BUILT_IN_GOACC_THREAD_BROADCAST,
	BUILT_IN_GOACC_THREAD_BROADCAST_LL,
	BUILT_IN_GOACC_THREADBARRIER): Delete.
	* config/nvptx/nvptx.md (UNSPECV_WARPBCAST): Delete.
	(br_true, br_false): Remove U format specifier.
	(oacc_thread_broadcastsi, oacc_thread_broadcast_di): Delete.
	(oacc_threadbarrier): Delete.
	* config/.nvptx/nvptx.c (condition_unidirectional_p): Delete.
	(nvptx_print_operand):  Change 'U' specifier to look at operand
	itself.
	(nvptx_reorg_subreg): Remove unidirection checking.
	(nvptx_cannot_copy_insn): Remove broadcast and barrier insns.
	* config/nvptx/nvptx.h (machine_function): Remove
	arp_equal_pseudos.

Index: gimple.c
===================================================================
--- gimple.c	(revision 225154)
+++ gimple.c	(working copy)
@@ -68,7 +68,7 @@ along with GCC; see the file COPYING3.
 #include "lto-streamer.h"
 #include "cgraph.h"
 #include "gimple-ssa.h"
-
+#include "builtins.h"
 
 /* All the tuples have their operand vector (if present) at the very bottom
    of the structure.  Therefore, the offset required to find the
@@ -1382,10 +1382,22 @@ gimple_call_same_target_p (const_gimple
   if (gimple_call_internal_p (c1))
     return (gimple_call_internal_p (c2)
 	    && gimple_call_internal_fn (c1) == gimple_call_internal_fn (c2));
+
+  else if (gimple_call_fn (c1) == gimple_call_fn (c2))
+    return true;
   else
-    return (gimple_call_fn (c1) == gimple_call_fn (c2)
-	    || (gimple_call_fndecl (c1)
-		&& gimple_call_fndecl (c1) == gimple_call_fndecl (c2)));
+    {
+      tree decl = gimple_call_fndecl (c1);
+
+      if (!decl || decl != gimple_call_fndecl (c2))
+	return false;
+
+      /* If it is a unique builtin call, all calls are distinct.  */
+      if (DECL_BUILT_IN (decl) && builtin_unique_p (decl))
+	return false;
+
+      return true;
+    }
 }
 
 /* Detect flags from a GIMPLE_CALL.  This is just like
Index: gimple.h
===================================================================
--- gimple.h	(revision 225154)
+++ gimple.h	(working copy)
@@ -581,10 +581,6 @@ struct GTY((tag("GSS_OMP_PARALLEL_LAYOUT
   /* [ WORD 11 ]
      Size of the gang-local memory to allocate.  */
   tree ganglocal_size;
-
-  /* [ WORD 12 ]
-     A pointer to the array to be used for broadcasting across threads.  */
-  tree broadcast_array;
 };
 
 /* GIMPLE_OMP_PARALLEL or GIMPLE_TASK */
@@ -5248,25 +5244,6 @@ gimple_omp_target_set_ganglocal_size (go
 }
 
 
-/* Return the pointer to the broadcast array associated with OMP_TARGET GS.  */
-
-static inline tree
-gimple_omp_target_broadcast_array (const gomp_target *omp_target_stmt)
-{
-  return omp_target_stmt->broadcast_array;
-}
-
-
-/* Set PTR to be the broadcast array associated with OMP_TARGET
-   GS.  */
-
-static inline void
-gimple_omp_target_set_broadcast_array (gomp_target *omp_target_stmt, tree ptr)
-{
-  omp_target_stmt->broadcast_array = ptr;
-}
-
-
 /* Return the clauses associated with OMP_TEAMS GS.  */
 
 static inline tree
Index: builtin-types.def
===================================================================
--- builtin-types.def	(revision 225154)
+++ builtin-types.def	(working copy)
@@ -236,6 +236,7 @@ DEF_FUNCTION_TYPE_1 (BT_FN_CONST_PTR_BND
 
 DEF_POINTER_TYPE (BT_PTR_FN_VOID_PTR, BT_FN_VOID_PTR)
 
+DEF_FUNCTION_TYPE_2 (BT_FN_VOID_INT_INT, BT_VOID, BT_INT, BT_INT)
 DEF_FUNCTION_TYPE_2 (BT_FN_VOID_PTR_INT, BT_VOID, BT_PTR, BT_INT)
 DEF_FUNCTION_TYPE_2 (BT_FN_STRING_STRING_CONST_STRING,
 		     BT_STRING, BT_STRING, BT_CONST_STRING)
Index: builtins.c
===================================================================
--- builtins.c	(revision 225154)
+++ builtins.c	(working copy)
@@ -274,6 +274,19 @@ is_builtin_fn (tree decl)
   return TREE_CODE (decl) == FUNCTION_DECL && DECL_BUILT_IN (decl);
 }
 
+bool
+builtin_unique_p (tree decl)
+{
+  if (DECL_BUILT_IN_CLASS (decl) != BUILT_IN_NORMAL)
+    return false;
+
+  /* The OpenACC Loop markers must not be cloned or deleted.  */
+  if (DECL_FUNCTION_CODE (decl) == BUILT_IN_GOACC_LOOP)
+    return true;
+
+  return false;
+}
+
 /* Return true if NODE should be considered for inline expansion regardless
    of the optimization level.  This means whenever a function is invoked with
    its "internal" name, which normally contains the prefix "__builtin".  */
@@ -5947,20 +5960,6 @@ expand_builtin_acc_on_device (tree exp A
 #endif
 }
 
-/* Expand a thread synchronization point for OpenACC threads.  */
-static void
-expand_oacc_threadbarrier (void)
-{
-#ifdef HAVE_oacc_threadbarrier
-  rtx insn = GEN_FCN (CODE_FOR_oacc_threadbarrier) ();
-  if (insn != NULL_RTX)
-    {
-      emit_insn (insn);
-    }
-#endif
-}
-
-
 /* Expand a thread-id/thread-count builtin for OpenACC.  */
 
 static rtx
@@ -6013,64 +6012,65 @@ expand_oacc_id (enum built_in_function f
   return result;
 }
 
-static rtx
-expand_oacc_ganglocal_ptr (rtx target ATTRIBUTE_UNUSED)
+static void
+expand_oacc_levels (tree exp)
 {
-#ifdef HAVE_ganglocal_ptr
-  enum insn_code icode;
-  icode = CODE_FOR_ganglocal_ptr;
-  rtx tmp = target;
-  if (!REG_P (tmp) || GET_MODE (tmp) != Pmode)
-    tmp = gen_reg_rtx (Pmode);
-  rtx insn = GEN_FCN (icode) (tmp);
-  if (insn != NULL_RTX)
+  rtx arg = expand_normal (CALL_EXPR_ARG (exp, 0));
+  unsigned limit = OACC_LOOP_MASK (OACC_HWM);
+
+  if (GET_CODE (arg) != CONST_INT || UINTVAL (arg) >= limit)
     {
-      emit_insn (insn);
-      return tmp;
+      error ("argument to %D must be constant in range 0 to %d",
+	     get_callee_fndecl (exp), limit - 1);
+      return;
     }
+  
+#ifdef HAVE_oacc_levels
+  emit_insn (gen_oacc_levels (arg));
 #endif
-  return NULL_RTX;
 }
 
-/* Handle a GOACC_thread_broadcast builtin call EXP with target TARGET.
-   Return the result.  */
-
-static rtx
-expand_builtin_oacc_thread_broadcast (tree exp, rtx target)
+static void
+expand_oacc_loop (tree exp)
 {
-  tree arg0 = CALL_EXPR_ARG (exp, 0);
-  enum insn_code icode;
+  rtx arg0 = expand_normal (CALL_EXPR_ARG (exp, 0));
+  if (GET_CODE (arg0) != CONST_INT || UINTVAL (arg0) >= 2)
+    {
+      error ("first argument to %D must be constant in range 0 to 1",
+	     get_callee_fndecl (exp));
+      return;
+    }
+  rtx arg1 = expand_normal (CALL_EXPR_ARG (exp, 1));
 
-  enum machine_mode mode = TYPE_MODE (TREE_TYPE (arg0));
-  gcc_assert (INTEGRAL_MODE_P (mode));
-  do
+  if (GET_CODE (arg1) != CONST_INT || UINTVAL (arg1) >= OACC_HWM)
     {
-      icode = direct_optab_handler (oacc_thread_broadcast_optab, mode);
-      mode = GET_MODE_WIDER_MODE (mode);
+      error ("second argument to %D must be constant in range 0 to %d",
+	     get_callee_fndecl (exp), OACC_HWM - 1);
+      return;
     }
-  while (icode == CODE_FOR_nothing && mode != VOIDmode);
-  if (icode == CODE_FOR_nothing)
-    return expand_expr (arg0, NULL_RTX, VOIDmode, EXPAND_NORMAL);
+  
+#ifdef HAVE_oacc_loop
+  emit_insn (gen_oacc_loop (arg0, arg1));
+#endif
+}
 
+static rtx
+expand_oacc_ganglocal_ptr (rtx target ATTRIBUTE_UNUSED)
+{
+#ifdef HAVE_ganglocal_ptr
+  enum insn_code icode;
+  icode = CODE_FOR_ganglocal_ptr;
   rtx tmp = target;
-  machine_mode mode0 = insn_data[icode].operand[0].mode;
-  machine_mode mode1 = insn_data[icode].operand[1].mode;
-  if (!tmp || !REG_P (tmp) || GET_MODE (tmp) != mode0)
-    tmp = gen_reg_rtx (mode0);
-  rtx op1 = expand_expr (arg0, NULL_RTX, mode1, EXPAND_NORMAL);
-  if (GET_MODE (op1) != mode1)
-    op1 = convert_to_mode (mode1, op1, 0);
-
-  /* op1 might be an immediate, place it inside a register.  */
-  op1 = force_reg (mode1, op1);
-
-  rtx insn = GEN_FCN (icode) (tmp, op1);
+  if (!REG_P (tmp) || GET_MODE (tmp) != Pmode)
+    tmp = gen_reg_rtx (Pmode);
+  rtx insn = GEN_FCN (icode) (tmp);
   if (insn != NULL_RTX)
     {
       emit_insn (insn);
       return tmp;
     }
-  return const0_rtx;
+#endif
+  return NULL_RTX;
 }
 
 /* Expand an expression EXP that calls a built-in function,
@@ -7219,20 +7219,20 @@ expand_builtin (tree exp, rtx target, rt
     case BUILT_IN_GOACC_NID:
       return expand_oacc_id (fcode, exp, target);
 
+    case BUILT_IN_GOACC_LEVELS:
+      expand_oacc_levels (exp);
+      return const0_rtx;
+
+    case BUILT_IN_GOACC_LOOP:
+      expand_oacc_loop (exp);
+      return const0_rtx;
+
     case BUILT_IN_GOACC_GET_GANGLOCAL_PTR:
       target = expand_oacc_ganglocal_ptr (target);
       if (target)
 	return target;
       break;
 
-    case BUILT_IN_GOACC_THREAD_BROADCAST:
-    case BUILT_IN_GOACC_THREAD_BROADCAST_LL:
-      return expand_builtin_oacc_thread_broadcast (exp, target);
-
-    case BUILT_IN_GOACC_THREADBARRIER:
-      expand_oacc_threadbarrier ();
-      return const0_rtx;
-
     default:	/* just do library call, if unknown builtin */
       break;
     }
Index: omp-builtins.def
===================================================================
--- omp-builtins.def	(revision 225154)
+++ omp-builtins.def	(working copy)
@@ -69,13 +69,10 @@ DEF_GOACC_BUILTIN (BUILT_IN_GOACC_GET_GA
 		   BT_FN_PTR, ATTR_NOTHROW_LEAF_LIST)
 DEF_GOACC_BUILTIN (BUILT_IN_GOACC_DEVICEPTR, "GOACC_deviceptr",
 		   BT_FN_PTR_PTR, ATTR_CONST_NOTHROW_LEAF_LIST)
-DEF_GOACC_BUILTIN (BUILT_IN_GOACC_THREAD_BROADCAST, "GOACC_thread_broadcast",
-		   BT_FN_UINT_UINT, ATTR_NOTHROW_LEAF_LIST)
-DEF_GOACC_BUILTIN (BUILT_IN_GOACC_THREAD_BROADCAST_LL, "GOACC_thread_broadcast_ll",
-		   BT_FN_ULONGLONG_ULONGLONG, ATTR_NOTHROW_LEAF_LIST)
-DEF_GOACC_BUILTIN (BUILT_IN_GOACC_THREADBARRIER, "GOACC_threadbarrier",
-		   BT_FN_VOID, ATTR_NOTHROW_LEAF_LIST)
-
+DEF_GOACC_BUILTIN (BUILT_IN_GOACC_LEVELS, "GOACC_levels",
+		   BT_FN_VOID_INT, ATTR_NOTHROW_LEAF_LIST)
+DEF_GOACC_BUILTIN (BUILT_IN_GOACC_LOOP, "GOACC_loop",
+		   BT_FN_VOID_INT_INT, ATTR_NOTHROW_LEAF_LIST)
 DEF_GOACC_BUILTIN_COMPILER (BUILT_IN_ACC_ON_DEVICE, "acc_on_device",
 			    BT_FN_INT_INT, ATTR_CONST_NOTHROW_LEAF_LIST)
 
Index: builtins.h
===================================================================
--- builtins.h	(revision 225154)
+++ builtins.h	(working copy)
@@ -50,6 +50,7 @@ extern struct target_builtins *this_targ
 extern bool force_folding_builtin_constant_p;
 
 extern bool is_builtin_fn (tree);
+extern bool builtin_unique_p (tree);
 extern bool get_object_alignment_1 (tree, unsigned int *,
 				    unsigned HOST_WIDE_INT *);
 extern unsigned int get_object_alignment (tree);
Index: fortran/types.def
===================================================================
--- fortran/types.def	(revision 225154)
+++ fortran/types.def	(working copy)
@@ -115,6 +115,7 @@ DEF_FUNCTION_TYPE_2 (BT_FN_VOID_VPTR_INT
 DEF_FUNCTION_TYPE_2 (BT_FN_BOOL_VPTR_INT, BT_BOOL, BT_VOLATILE_PTR, BT_INT)
 DEF_FUNCTION_TYPE_2 (BT_FN_BOOL_SIZE_CONST_VPTR, BT_BOOL, BT_SIZE,
 		     BT_CONST_VOLATILE_PTR)
+DEF_FUNCTION_TYPE_2 (BT_FN_VOID_INT_INT, BT_VOID, BT_INT, BT_INT)
 DEF_FUNCTION_TYPE_2 (BT_FN_BOOL_INT_BOOL, BT_BOOL, BT_INT, BT_BOOL)
 DEF_FUNCTION_TYPE_2 (BT_FN_VOID_UINT_UINT, BT_VOID, BT_UINT, BT_UINT)
 
Index: config/nvptx/nvptx.md
===================================================================
--- config/nvptx/nvptx.md	(revision 225154)
+++ config/nvptx/nvptx.md	(working copy)
@@ -52,15 +52,23 @@
    UNSPEC_NID
 
    UNSPEC_SHARED_DATA
+
+   UNSPEC_BIT_CONV
+
+   UNSPEC_BROADCAST
+   UNSPEC_BR_UNIFIED
 ])
 
 (define_c_enum "unspecv" [
    UNSPECV_LOCK
    UNSPECV_CAS
    UNSPECV_XCHG
-   UNSPECV_WARP_BCAST
    UNSPECV_BARSYNC
    UNSPECV_ID
+
+   UNSPECV_LEVELS
+   UNSPECV_LOOP
+   UNSPECV_BR_HIDDEN
 ])
 
 (define_attr "subregs_ok" "false,true"
@@ -253,6 +258,8 @@
 (define_mode_iterator QHSIM [QI HI SI])
 (define_mode_iterator SDFM [SF DF])
 (define_mode_iterator SDCM [SC DC])
+(define_mode_iterator BITS [SI SF])
+(define_mode_iterator BITD [DI DF])
 
 ;; This mode iterator allows :P to be used for patterns that operate on
 ;; pointer-sized quantities.  Exactly one of the two alternatives will match.
@@ -813,7 +820,7 @@
 		      (label_ref (match_operand 1 "" ""))
 		      (pc)))]
   ""
-  "%j0\\tbra%U0\\t%l1;")
+  "%j0\\tbra\\t%l1;")
 
 (define_insn "br_false"
   [(set (pc)
@@ -822,7 +829,34 @@
 		      (label_ref (match_operand 1 "" ""))
 		      (pc)))]
   ""
-  "%J0\\tbra%U0\\t%l1;")
+  "%J0\\tbra\\t%l1;")
+
+;; a hidden conditional branch
+(define_insn "br_true_hidden"
+  [(unspec_volatile:SI [(ne (match_operand:BI 0 "nvptx_register_operand" "R")
+			    (const_int 0))
+		        (label_ref (match_operand 1 "" ""))
+			(match_operand:SI 2 "const_int_operand" "i")]
+			UNSPECV_BR_HIDDEN)]
+  ""
+  "%j0\\tbra%U2\\t%l1;")
+
+;; unified conditional branch
+(define_insn "br_uni_true"
+  [(set (pc) (if_then_else
+	(ne (unspec:BI [(match_operand:BI 0 "nvptx_register_operand" "R")]
+		       UNSPEC_BR_UNIFIED) (const_int 0))
+        (label_ref (match_operand 1 "" "")) (pc)))]
+  ""
+  "%j0\\tbra.uni\\t%l1;")
+
+(define_insn "br_uni_false"
+  [(set (pc) (if_then_else
+	(eq (unspec:BI [(match_operand:BI 0 "nvptx_register_operand" "R")]
+		       UNSPEC_BR_UNIFIED) (const_int 0))
+        (label_ref (match_operand 1 "" "")) (pc)))]
+  ""
+  "%J0\\tbra.uni\\t%l1;")
 
 (define_expand "cbranch<mode>4"
   [(set (pc)
@@ -1326,37 +1360,72 @@
   return asms[INTVAL (operands[1])];
 })
 
-(define_insn "oacc_thread_broadcastsi"
-  [(set (match_operand:SI 0 "nvptx_register_operand" "")
-	(unspec_volatile:SI [(match_operand:SI 1 "nvptx_register_operand" "")]
-			    UNSPECV_WARP_BCAST))]
+(define_insn "oacc_levels"
+  [(unspec_volatile:SI [(match_operand:SI 0 "const_int_operand" "")]
+		       UNSPECV_LEVELS)]
   ""
-  "%.\\tshfl.idx.b32\\t%0, %1, 0, 31;")
+  "// levels %0;"
+)
 
-(define_expand "oacc_thread_broadcastdi"
-  [(set (match_operand:DI 0 "nvptx_register_operand" "")
-	(unspec_volatile:DI [(match_operand:DI 1 "nvptx_register_operand" "")]
-			    UNSPECV_WARP_BCAST))]
-  ""
-{
-  rtx t = gen_reg_rtx (DImode);
-  emit_insn (gen_lshrdi3 (t, operands[1], GEN_INT (32)));
-  rtx op0 = force_reg (SImode, gen_lowpart (SImode, t));
-  rtx op1 = force_reg (SImode, gen_lowpart (SImode, operands[1]));
-  rtx targ0 = gen_reg_rtx (SImode);
-  rtx targ1 = gen_reg_rtx (SImode);
-  emit_insn (gen_oacc_thread_broadcastsi (targ0, op0));
-  emit_insn (gen_oacc_thread_broadcastsi (targ1, op1));
-  rtx t2 = gen_reg_rtx (DImode);
-  rtx t3 = gen_reg_rtx (DImode);
-  emit_insn (gen_extendsidi2 (t2, targ0));
-  emit_insn (gen_extendsidi2 (t3, targ1));
-  rtx t4 = gen_reg_rtx (DImode);
-  emit_insn (gen_ashldi3 (t4, t2, GEN_INT (32)));
-  emit_insn (gen_iordi3 (operands[0], t3, t4));
-  DONE;
+(define_insn "nvptx_loop"
+  [(unspec_volatile:SI [(match_operand:SI 0 "const_int_operand" "")
+		        (match_operand:SI 1 "const_int_operand" "")]
+		       UNSPECV_LOOP)]
+  ""
+  "// loop %0, %1;"
+)
+
+(define_expand "oacc_loop"
+  [(unspec_volatile:SI [(match_operand:SI 0 "const_int_operand" "")
+		        (match_operand:SI 1 "const_int_operand" "")]
+		       UNSPECV_LOOP)]
+  ""
+{
+  nvptx_expand_oacc_loop (operands[0], operands[1]);
 })
 
+;; only 32-bit shuffles exist.
+(define_insn "nvptx_broadcast<mode>"
+  [(set (match_operand:BITS 0 "nvptx_register_operand" "")
+	(unspec:BITS
+		[(match_operand:BITS 1 "nvptx_register_operand" "")]
+		  UNSPEC_BROADCAST))]
+  ""
+  "%.\\tshfl.idx.b32\\t%0, %1, 0, 31;")
+
+;; extract parts of a 64 bit object into 2 32-bit ints
+(define_insn "unpack<mode>si2"
+  [(set (match_operand:SI 0 "nvptx_register_operand" "")
+        (unspec:SI [(match_operand:BITD 2 "nvptx_register_operand" "")
+		    (const_int 0)] UNSPEC_BIT_CONV))
+   (set (match_operand:SI 1 "nvptx_register_operand" "")
+        (unspec:SI [(match_dup 2) (const_int 1)] UNSPEC_BIT_CONV))]
+  ""
+  "%.\\tmov.b64 {%0,%1}, %2;")
+
+;; pack 2 32-bit ints into a 64 bit object
+(define_insn "packsi<mode>2"
+  [(set (match_operand:BITD 0 "nvptx_register_operand" "")
+        (unspec:BITD [(match_operand:SI 1 "nvptx_register_operand" "")
+		      (match_operand:SI 2 "nvptx_register_operand" "")]
+		    UNSPEC_BIT_CONV))]
+  ""
+  "%.\\tmov.b64 %0, {%1,%2};")
+
+(define_insn "worker_load<mode>"
+  [(set (match_operand:SDISDFM 0 "nvptx_register_operand" "=R")
+        (unspec:SDISDFM [(match_operand:SDISDFM 1 "memory_operand" "m")]
+			 UNSPEC_SHARED_DATA))]
+  ""
+  "%.\\tld.shared%u0\\t%0,%1;")
+
+(define_insn "worker_store<mode>"
+  [(set (unspec:SDISDFM [(match_operand:SDISDFM 1 "memory_operand" "=m")]
+			 UNSPEC_SHARED_DATA)
+	(match_operand:SDISDFM 0 "nvptx_register_operand" "R"))]
+  ""
+  "%.\\tst.shared%u1\\t%1,%0;")
+
 (define_insn "ganglocal_ptr<mode>"
   [(set (match_operand:P 0 "nvptx_register_operand" "")
 	(unspec:P [(const_int 0)] UNSPEC_SHARED_DATA))]
@@ -1462,14 +1531,8 @@
   "%.\\tatom%A1.b%T0.<logic>\\t%0, %1, %2;")
 
 ;; ??? Mark as not predicable later?
-(define_insn "threadbarrier_insn"
-  [(unspec_volatile [(match_operand:SI 0 "const_int_operand" "")] UNSPECV_BARSYNC)]
+(define_insn "nvptx_barsync"
+  [(unspec_volatile [(match_operand:SI 0 "const_int_operand" "")]
+		    UNSPECV_BARSYNC)]
   ""
   "bar.sync\\t%0;")
-
-(define_expand "oacc_threadbarrier"
-  [(unspec_volatile [(match_operand:SI 0 "const_int_operand" "")] UNSPECV_BARSYNC)]
-  ""
-{
-  operands[0] = const0_rtx;
-})
Index: config/nvptx/nvptx.c
===================================================================
--- config/nvptx/nvptx.c	(revision 225154)
+++ config/nvptx/nvptx.c	(working copy)
@@ -24,6 +24,7 @@
 #include "coretypes.h"
 #include "tm.h"
 #include "rtl.h"
+#include "hash-map.h"
 #include "hash-set.h"
 #include "machmode.h"
 #include "vec.h"
@@ -74,6 +75,15 @@
 #include "df.h"
 #include "dumpfile.h"
 #include "builtins.h"
+#include "dominance.h"
+#include "cfg.h"
+#include "omp-low.h"
+
+#define nvptx_loop_head		0
+#define nvptx_loop_tail		1
+#define LOOP_MODE_CHANGE_P(X) ((X) < 2)
+#define nvptx_loop_prehead 	2
+#define nvptx_loop_pretail 	3
 
 /* Record the function decls we've written, and the libfuncs and function
    decls corresponding to them.  */
@@ -97,6 +107,16 @@ static GTY((cache))
 static GTY((cache)) hash_table<tree_hasher> *declared_fndecls_htab;
 static GTY((cache)) hash_table<tree_hasher> *needed_fndecls_htab;
 
+/* Size of buffer needed to broadcast across workers.  This is used
+   for both worker-neutering and worker broadcasting.   It is shared
+   by all functions emitted.  The buffer is placed in shared memory.
+   It'd be nice if PTX supported common blocks, because then this
+   could be shared across TUs (taking the largest size).  */
+static unsigned worker_bcast_hwm;
+static unsigned worker_bcast_align;
+#define worker_bcast_name "__worker_bcast"
+static GTY(()) rtx worker_bcast_sym;
+
 /* Allocate a new, cleared machine_function structure.  */
 
 static struct machine_function *
@@ -124,6 +144,8 @@ nvptx_option_override (void)
   needed_fndecls_htab = hash_table<tree_hasher>::create_ggc (17);
   declared_libfuncs_htab
     = hash_table<declared_libfunc_hasher>::create_ggc (17);
+
+  worker_bcast_sym = gen_rtx_SYMBOL_REF (Pmode, worker_bcast_name);
 }
 
 /* Return the mode to be used when declaring a ptx object for OBJ.
@@ -1053,6 +1075,7 @@ nvptx_static_chain (const_tree fndecl, b
     return gen_rtx_REG (Pmode, OUTGOING_STATIC_CHAIN_REGNUM);
 }
 \f
+
 /* Emit a comparison COMPARE, and return the new test to be used in the
    jump.  */
 
@@ -1066,6 +1089,203 @@ nvptx_expand_compare (rtx compare)
   return gen_rtx_NE (BImode, pred, const0_rtx);
 }
 
+
+/* Expand the oacc_loop primitive into ptx-required unspecs.  */
+
+void
+nvptx_expand_oacc_loop (rtx kind, rtx mode)
+{
+  /* Emit pre-tail for all loops and emit pre-head for worker level.  */
+  if (UINTVAL (kind) || UINTVAL (mode) == OACC_worker)
+    emit_insn (gen_nvptx_loop (GEN_INT (UINTVAL (kind) + 2), mode));
+}
+
+/* Generate instruction(s) to unpack a 64 bit object into 2 32 bit
+   objects.  */
+
+static rtx
+nvptx_gen_unpack (rtx dst0, rtx dst1, rtx src)
+{
+  rtx res;
+  
+  switch (GET_MODE (src))
+    {
+    case DImode:
+      res = gen_unpackdisi2 (dst0, dst1, src);
+      break;
+    case DFmode:
+      res = gen_unpackdfsi2 (dst0, dst1, src);
+      break;
+    default: gcc_unreachable ();
+    }
+  return res;
+}
+
+/* Generate instruction(s) to pack 2 32 bit objects into a 64 bit
+   object.  */
+
+static rtx
+nvptx_gen_pack (rtx dst, rtx src0, rtx src1)
+{
+  rtx res;
+  
+  switch (GET_MODE (dst))
+    {
+    case DImode:
+      res = gen_packsidi2 (dst, src0, src1);
+      break;
+    case DFmode:
+      res = gen_packsidf2 (dst, src0, src1);
+      break;
+    default: gcc_unreachable ();
+    }
+  return res;
+}
+
+/* Generate an instruction or sequence to broadcast register REG
+   across the vectors of a single warp.  */
+
+static rtx
+nvptx_gen_vcast (rtx reg)
+{
+  rtx res;
+
+  switch (GET_MODE (reg))
+    {
+    case SImode:
+      res = gen_nvptx_broadcastsi (reg, reg);
+      break;
+    case SFmode:
+      res = gen_nvptx_broadcastsf (reg, reg);
+      break;
+    case DImode:
+    case DFmode:
+      {
+	rtx tmp0 = gen_reg_rtx (SImode);
+	rtx tmp1 = gen_reg_rtx (SImode);
+
+	start_sequence ();
+	emit_insn (nvptx_gen_unpack (tmp0, tmp1, reg));
+	emit_insn (nvptx_gen_vcast (tmp0));
+	emit_insn (nvptx_gen_vcast (tmp1));
+	emit_insn (nvptx_gen_pack (reg, tmp0, tmp1));
+	res = get_insns ();
+	end_sequence ();
+      }
+      break;
+    case BImode:
+      {
+	rtx tmp = gen_reg_rtx (SImode);
+	
+	start_sequence ();
+	emit_insn (gen_sel_truesi (tmp, reg, GEN_INT (1), const0_rtx));
+	emit_insn (nvptx_gen_vcast (tmp));
+	emit_insn (gen_rtx_SET (BImode, reg,
+				gen_rtx_NE (BImode, tmp, const0_rtx)));
+	res = get_insns ();
+	end_sequence ();
+      }
+      break;
+      
+    case HImode:
+    case QImode:
+    default:debug_rtx (reg);gcc_unreachable ();
+    }
+  return res;
+}
+
+/* Structure used when generating a worker-level spill or fill.  */
+
+struct wcast_data_t
+{
+  rtx base;
+  rtx ptr;
+  unsigned offset;
+};
+
+/* Direction of the spill/fill and looping setup/teardown indicator.  */
+
+enum propagate_mask
+  {
+    PM_read = 1 << 0,
+    PM_write = 1 << 1,
+    PM_loop_begin = 1 << 2,
+    PM_loop_end = 1 << 3,
+
+    PM_read_write = PM_read | PM_write
+  };
+
+/* Generate instruction(s) to spill or fill register REG to/from the
+   worker broadcast array.  PM indicates what is to be done, REP
+   how many loop iterations will be executed (0 for not a loop).  */
+   
+static rtx
+nvptx_gen_wcast (rtx reg, propagate_mask pm, unsigned rep, wcast_data_t *data)
+{
+  rtx  res;
+  machine_mode mode = GET_MODE (reg);
+
+  switch (mode)
+    {
+    case BImode:
+      {
+	rtx tmp = gen_reg_rtx (SImode);
+	
+	start_sequence ();
+	if (pm & PM_read)
+	  emit_insn (gen_sel_truesi (tmp, reg, GEN_INT (1), const0_rtx));
+	emit_insn (nvptx_gen_wcast (tmp, pm, rep, data));
+	if (pm & PM_write)
+	  emit_insn (gen_rtx_SET (BImode, reg,
+				  gen_rtx_NE (BImode, tmp, const0_rtx)));
+	res = get_insns ();
+	end_sequence ();
+      }
+      break;
+
+    default:
+      {
+	rtx addr = data->ptr;
+
+	if (!addr)
+	  {
+	    unsigned align = GET_MODE_ALIGNMENT (mode) / BITS_PER_UNIT;
+
+	    if (align > worker_bcast_align)
+	      worker_bcast_align = align;
+	    data->offset = (data->offset + align - 1) & ~(align - 1);
+	    addr = data->base;
+	    if (data->offset)
+	      addr = gen_rtx_PLUS (Pmode, addr, GEN_INT (data->offset));
+	  }
+	
+	addr = gen_rtx_MEM (mode, addr);
+	addr = gen_rtx_UNSPEC (mode, gen_rtvec (1, addr), UNSPEC_SHARED_DATA);
+	if (pm & PM_read)
+	  res = gen_rtx_SET (mode, addr, reg);
+	if (pm & PM_write)
+	  res = gen_rtx_SET (mode, reg, addr);
+
+	if (data->ptr)
+	  {
+	    /* We're using a ptr, increment it.  */
+	    start_sequence ();
+	    
+	    emit_insn (res);
+	    emit_insn (gen_adddi3 (data->ptr, data->ptr,
+				   GEN_INT (GET_MODE_SIZE (GET_MODE (res)))));
+	    res = get_insns ();
+	    end_sequence ();
+	  }
+	else
+	  rep = 1;
+	data->offset += rep * GET_MODE_SIZE (GET_MODE (reg));
+      }
+      break;
+    }
+  return res;
+}
+
 /* When loading an operand ORIG_OP, verify whether an address space
    conversion to generic is required, and if so, perform it.  Also
    check for SYMBOL_REFs for function decls and call
@@ -1647,23 +1867,6 @@ nvptx_print_operand_address (FILE *file,
   nvptx_print_address_operand (file, addr, VOIDmode);
 }
 
-/* Return true if the value of COND is the same across all threads in a
-   warp.  */
-
-static bool
-condition_unidirectional_p (rtx cond)
-{
-  if (CONSTANT_P (cond))
-    return true;
-  if (GET_CODE (cond) == REG)
-    return cfun->machine->warp_equal_pseudos[REGNO (cond)];
-  if (GET_RTX_CLASS (GET_CODE (cond)) == RTX_COMPARE
-      || GET_RTX_CLASS (GET_CODE (cond)) == RTX_COMM_COMPARE)
-    return (condition_unidirectional_p (XEXP (cond, 0))
-	    && condition_unidirectional_p (XEXP (cond, 1)));
-  return false;
-}
-
 /* Print an operand, X, to FILE, with an optional modifier in CODE.
 
    Meaning of CODE:
@@ -1677,8 +1880,7 @@ condition_unidirectional_p (rtx cond)
    t -- print a type opcode suffix, promoting QImode to 32 bits
    T -- print a type size in bits
    u -- print a type opcode suffix without promotions.
-   U -- print ".uni" if a condition consists only of values equal across all
-        threads in a warp.  */
+   U -- print ".uni" if the const_int operand is non-zero.  */
 
 static void
 nvptx_print_operand (FILE *file, rtx x, int code)
@@ -1740,10 +1942,10 @@ nvptx_print_operand (FILE *file, rtx x,
       goto common;
 
     case 'U':
-      if (condition_unidirectional_p (x))
+      if (INTVAL (x))
 	fprintf (file, ".uni");
       break;
-
+      
     case 'c':
       op_mode = GET_MODE (XEXP (x, 0));
       switch (x_code)
@@ -1900,7 +2102,7 @@ get_replacement (struct reg_replace *r)
    conversion copyin/copyout instructions.  */
 
 static void
-nvptx_reorg_subreg (int max_regs)
+nvptx_reorg_subreg ()
 {
   struct reg_replace qiregs, hiregs, siregs, diregs;
   rtx_insn *insn, *next;
@@ -1914,11 +2116,6 @@ nvptx_reorg_subreg (int max_regs)
   siregs.mode = SImode;
   diregs.mode = DImode;
 
-  cfun->machine->warp_equal_pseudos
-    = ggc_cleared_vec_alloc<char> (max_regs);
-
-  auto_vec<unsigned> warp_reg_worklist;
-
   for (insn = get_insns (); insn; insn = next)
     {
       next = NEXT_INSN (insn);
@@ -1934,18 +2131,6 @@ nvptx_reorg_subreg (int max_regs)
       diregs.n_in_use = 0;
       extract_insn (insn);
 
-      if (recog_memoized (insn) == CODE_FOR_oacc_thread_broadcastsi
-	  || (GET_CODE (PATTERN (insn)) == SET
-	      && CONSTANT_P (SET_SRC (PATTERN (insn)))))
-	{
-	  rtx dest = recog_data.operand[0];
-	  if (REG_P (dest) && REG_N_SETS (REGNO (dest)) == 1)
-	    {
-	      cfun->machine->warp_equal_pseudos[REGNO (dest)] = true;
-	      warp_reg_worklist.safe_push (REGNO (dest));
-	    }
-	}
-
       enum attr_subregs_ok s_ok = get_attr_subregs_ok (insn);
       for (int i = 0; i < recog_data.n_operands; i++)
 	{
@@ -1999,71 +2184,780 @@ nvptx_reorg_subreg (int max_regs)
 	  validate_change (insn, recog_data.operand_loc[i], new_reg, false);
 	}
     }
+}
+
+/* An unspec of interest and the BB in which it resides.  */
+struct reorg_unspec
+{
+  rtx_insn *insn;
+  basic_block block;
+};
 
-  while (!warp_reg_worklist.is_empty ())
+/* Loop structure of the function.The entire function is described as
+   a NULL loop.  We should be able to extend this to represent
+   superblocks.  */
+
+#define OACC_null OACC_HWM
+
+struct reorg_loop
+{
+  /* Parent loop.  */
+  reorg_loop *parent;
+  
+  /* Next sibling loop.  */
+  reorg_loop *next;
+
+  /* First child loop.  */
+  reorg_loop *inner;
+
+  /* Partitioning mode of the loop.  */
+  unsigned mode;
+
+  /* Partitioning used within inner loops. */
+  unsigned inner_mask;
+
+  /* Location of loop head and tail.  The head is the first block in
+     the partitioned loop and the tail is the first block out of the
+     partitioned loop.  */
+  basic_block head_block;
+  basic_block tail_block;
+
+  rtx_insn *head_insn;
+  rtx_insn *tail_insn;
+
+  rtx_insn *pre_head_insn;
+  rtx_insn *pre_tail_insn;
+
+  /* basic blocks in this loop, but not in child loops.  The HEAD and
+     PRETAIL blocks are in the loop.  The PREHEAD and TAIL blocks
+     are not.  */
+  auto_vec<basic_block> blocks;
+
+public:
+  reorg_loop (reorg_loop *parent, unsigned mode);
+  ~reorg_loop ();
+};
+
+typedef auto_vec<reorg_unspec> unspec_vec_t;
+
+/* Constructor links the new loop into it's parent's chain of
+   children.  */
+
+reorg_loop::reorg_loop (reorg_loop *parent_, unsigned mode_)
+  :parent (parent_), next (0), inner (0), mode (mode_), inner_mask (0)
+{
+  head_block = tail_block = 0;
+  head_insn = tail_insn = 0;
+  pre_head_insn = pre_tail_insn = 0;
+  
+  if (parent)
     {
-      int regno = warp_reg_worklist.pop ();
+      next = parent->inner;
+      parent->inner = this;
+    }
+}
+
+reorg_loop::~reorg_loop ()
+{
+  delete inner;
+  delete next;
+}
+
+/* Map of basic blocks to unspecs */
+typedef hash_map<basic_block, rtx_insn *> unspec_map_t;
+
+/* Split basic blocks such that each loop head & tail unspecs are at
+   the start of their basic blocks.  Thus afterwards each block will
+   have a single partitioning mode.  We also do the same for return
+   insns, as they are executed by every thread.  Return the partitioning
+   execution mode of the function as a whole.  Populate MAP with head
+   and tail blocks.   We also clear the BB visited flag, which is
+   used when finding loops.  */
+
+static unsigned
+nvptx_split_blocks (unspec_map_t *map)
+{
+  auto_vec<reorg_unspec> worklist;
+  basic_block block;
+  rtx_insn *insn;
+  unsigned levels = ~0U; // Assume the worst WRT required neutering
+
+  /* Locate all the reorg instructions of interest.  */
+  FOR_ALL_BB_FN (block, cfun)
+    {
+      bool seen_insn = false;
+
+      // Clear visited flag, for use by loop locator  */
+      block->flags &= ~BB_VISITED;
       
-      df_ref use = DF_REG_USE_CHAIN (regno);
-      for (; use; use = DF_REF_NEXT_REG (use))
+      FOR_BB_INSNS (block, insn)
 	{
-	  rtx_insn *insn;
-	  if (!DF_REF_INSN_INFO (use))
-	    continue;
-	  insn = DF_REF_INSN (use);
-	  if (DEBUG_INSN_P (insn))
-	    continue;
-
-	  /* The only insns we have to exclude are those which refer to
-	     memory.  */
-	  rtx pat = PATTERN (insn);
-	  if (GET_CODE (pat) == SET
-	      && (MEM_P (SET_SRC (pat)) || MEM_P (SET_DEST (pat))))
+	  if (!INSN_P (insn))
 	    continue;
+	  switch (recog_memoized (insn))
+	    {
+	    default:
+	      seen_insn = true;
+	      continue;
+	    case CODE_FOR_oacc_levels:
+	      /* We just need to detect this and note its argument.  */
+	      {
+		unsigned l = UINTVAL (XVECEXP (PATTERN (insn), 0, 0));
+		/* If we see this multiple times, this should all
+		   agree.  */
+		gcc_assert (levels == ~0U || l == levels);
+		levels = l;
+	      }
+	      continue;
+
+	    case CODE_FOR_nvptx_loop:
+	      {
+		rtx kind = XVECEXP (PATTERN (insn), 0, 0);
+		if (!LOOP_MODE_CHANGE_P (UINTVAL (kind)))
+		  {
+		    seen_insn = true;
+		    continue;
+		  }
+	      }
+	      break;
+	      
+	    case CODE_FOR_return:
+	      /* We also need to split just before return insns, as
+		 that insn needs executing by all threads, but the
+		 block it is in probably does not.  */
+	      break;
+	    }
 
-	  df_ref insn_use;
-	  bool all_equal = true;
-	  FOR_EACH_INSN_USE (insn_use, insn)
+	  if (seen_insn)
 	    {
-	      unsigned insn_regno = DF_REF_REGNO (insn_use);
-	      if (!cfun->machine->warp_equal_pseudos[insn_regno])
-		{
-		  all_equal = false;
-		  break;
-		}
+	      /* We've found an instruction that  must be at the start of
+		 a block, but isn't.  Add it to the worklist.  */
+	      reorg_unspec uns;
+	      uns.insn = insn;
+	      uns.block = block;
+	      worklist.safe_push (uns);
 	    }
-	  if (!all_equal)
-	    continue;
-	  df_ref insn_def;
-	  FOR_EACH_INSN_DEF (insn_def, insn)
+	  else
+	    /* It was already the first instruction.  Just add it to
+	       the map.  */
+	    map->get_or_insert (block) = insn;
+	  seen_insn = true;
+	}
+    }
+
+  /* Split blocks on the worklist.  */
+  unsigned ix;
+  reorg_unspec *elt;
+  basic_block remap = 0;
+  for (ix = 0; worklist.iterate (ix, &elt); ix++)
+    {
+      if (remap != elt->block)
+	{
+	  block = elt->block;
+	  remap = block;
+	}
+      
+      /* Split block before insn. The insn is in the new block  */
+      edge e = split_block (block, PREV_INSN (elt->insn));
+
+      block = e->dest;
+      map->get_or_insert (block) = elt->insn;
+    }
+
+  return levels;
+}
+
+/* BLOCK is a basic block containing a head or tail instruction.
+   Locate the associated prehead or pretail instruction, which must be
+   in the single predecessor block.  */
+
+static rtx_insn *
+nvptx_discover_pre (basic_block block, unsigned expected)
+{
+  gcc_assert (block->preds->length () == 1);
+  basic_block pre_block = (*block->preds)[0]->src;
+  rtx_insn *pre_insn;
+
+  for (pre_insn = BB_END (pre_block); !INSN_P (pre_insn);
+       pre_insn = PREV_INSN (pre_insn))
+    gcc_assert (pre_insn != BB_HEAD (pre_block));
+
+  gcc_assert (recog_memoized (pre_insn) == CODE_FOR_nvptx_loop
+	      && (UINTVAL (XVECEXP (PATTERN (pre_insn), 0, 0))
+		  == expected));
+  return pre_insn;
+}
+
+typedef std::pair<basic_block, reorg_loop *> loop_t;
+typedef auto_vec<loop_t> work_loop_t;
+
+/*  Dump this loop and all its inner loops.  */
+
+static void
+nvptx_dump_loops (reorg_loop *loop, unsigned depth)
+{
+  fprintf (dump_file, "%u: mode %d head=%d, tail=%d\n",
+	   depth, loop->mode,
+	   loop->head_block ? loop->head_block->index : -1,
+	   loop->tail_block ? loop->tail_block->index : -1);
+
+  fprintf (dump_file, "    blocks:");
+
+  basic_block block;
+  for (unsigned ix = 0; loop->blocks.iterate (ix, &block); ix++)
+    fprintf (dump_file, " %d", block->index);
+  fprintf (dump_file, "\n");
+  if (loop->inner)
+    nvptx_dump_loops (loop->inner, depth + 1);
+
+  if (loop->next)
+    nvptx_dump_loops (loop->next, depth);
+}
+
+/* Walk the BBG looking for loop head & tail markers.  Construct a
+   loop structure for the function.  MAP is a mapping of basic blocks
+   to head & taiol markers, discoveded when splitting blocks.  This
+   speeds up the discovery.  We rely on the BB visited flag having
+   been cleared when splitting blocks.  */
+
+static reorg_loop *
+nvptx_discover_loops (unspec_map_t *map)
+{
+  reorg_loop *outer_loop = new reorg_loop (0, OACC_null);
+  work_loop_t worklist;
+  basic_block block;
+
+  // Mark entry and exit blocks as visited.
+  block = EXIT_BLOCK_PTR_FOR_FN (cfun);
+  block->flags |= BB_VISITED;
+  block = ENTRY_BLOCK_PTR_FOR_FN (cfun);
+  worklist.safe_push (loop_t (block, outer_loop));
+
+  while (worklist.length ())
+    {
+      loop_t loop = worklist.pop ();
+      reorg_loop *l = loop.second;
+
+      block = loop.first;
+
+      // Have we met this block?
+      if (block->flags & BB_VISITED)
+	continue;
+      block->flags |= BB_VISITED;
+      
+      rtx_insn **endp = map->get (block);
+      if (endp)
+	{
+	  rtx_insn *end = *endp;
+	  
+	  /* This is a block head or tail, or return instruction.  */
+	  switch (recog_memoized (end))
 	    {
-	      unsigned dregno = DF_REF_REGNO (insn_def);
-	      if (cfun->machine->warp_equal_pseudos[dregno])
-		continue;
-	      cfun->machine->warp_equal_pseudos[dregno] = true;
-	      warp_reg_worklist.safe_push (dregno);
+	    case CODE_FOR_return:
+	      /* Return instructions are in their own block, and we
+		 don't need to do anything more.  */
+	      continue;
+
+	    case CODE_FOR_nvptx_loop:
+	      {
+		unsigned kind = UINTVAL (XVECEXP (PATTERN (end), 0, 0));
+		unsigned mode = UINTVAL (XVECEXP (PATTERN (end), 0, 1));
+		
+		switch (kind)
+		  {
+		  case nvptx_loop_head:
+		    /* Loop head, create a new inner loop and add it
+		       into our parent's child list.  */
+		    l = new reorg_loop (l, mode);
+		    l->head_block = block;
+		    l->head_insn = end;
+		    if (mode == OACC_worker)
+		      l->pre_head_insn
+			= nvptx_discover_pre (block, nvptx_loop_prehead);
+		    break;
+
+		  case nvptx_loop_tail:
+		    /* A loop tail.  Finish the current loop and
+		       return to parent.  */
+		    gcc_assert (l->mode == mode);
+		    l->tail_block = block;
+		    l->tail_insn = end;
+		    if (mode == OACC_worker)
+		      l->pre_tail_insn
+			= nvptx_discover_pre (block, nvptx_loop_pretail);
+		    l = l->parent;
+		    break;
+		    
+		  default:
+		    gcc_unreachable ();
+		  }
+	      }
+	      break;
+
+	    default:gcc_unreachable ();
 	    }
 	}
+      /* Add this block onto the current loop's list of blocks.  */
+      l->blocks.safe_push (block);
+
+      /* Push each destination block onto the work list.  */
+      edge e;
+      edge_iterator ei;
+
+      loop.second = l;
+      FOR_EACH_EDGE (e, ei, block->succs)
+	{
+	  loop.first = e->dest;
+	  
+	  worklist.safe_push (loop);
+	}
     }
 
   if (dump_file)
-    for (int i = 0; i < max_regs; i++)
-      if (cfun->machine->warp_equal_pseudos[i])
-	fprintf (dump_file, "Found warp invariant pseudo %d\n", i);
+    {
+      fprintf (dump_file, "\nLoops\n");
+      nvptx_dump_loops (outer_loop, 0);
+      fprintf (dump_file, "\n");
+    }
+  
+  return outer_loop;
+}
+
+/* Propagate live state at the start of a partitioned region.  BLOCK
+   provides the live register information, and might not contain
+   INSN. Propagation is inserted just after INSN. RW indicates whether
+   we are reading and/or writing state.  This
+   separation is needed for worker-level proppagation where we
+   essentially do a spill & fill.  FN is the underlying worker
+   function to generate the propagation instructions for single
+   register.  DATA is user data.
+
+   We propagate the live register set and the entire frame.  We could
+   do better by (a) propagating just the live set that is used within
+   the partitioned regions and (b) only propagating stack entries that
+   are used.  The latter might be quite hard to determine.  */
+
+static void
+nvptx_propagate (basic_block block, rtx_insn *insn, propagate_mask rw,
+		 rtx (*fn) (rtx, propagate_mask,
+			    unsigned, void *), void *data)
+{
+  bitmap live = DF_LIVE_IN (block);
+  bitmap_iterator iterator;
+  unsigned ix;
+
+  /* Copy the frame array.  */
+  HOST_WIDE_INT fs = get_frame_size ();
+  if (fs)
+    {
+      rtx tmp = gen_reg_rtx (DImode);
+      rtx idx = NULL_RTX;
+      rtx ptr = gen_reg_rtx (Pmode);
+      rtx pred = NULL_RTX;
+      rtx_code_label *label = NULL;
+
+      gcc_assert (!(fs & (GET_MODE_SIZE (DImode) - 1)));
+      fs /= GET_MODE_SIZE (DImode);
+      /* Detect single iteration loop. */
+      if (fs == 1)
+	fs = 0;
+
+      start_sequence ();
+      emit_insn (gen_rtx_SET (Pmode, ptr, frame_pointer_rtx));
+      if (fs)
+	{
+	  idx = gen_reg_rtx (SImode);
+	  pred = gen_reg_rtx (BImode);
+	  label = gen_label_rtx ();
+	  
+	  emit_insn (gen_rtx_SET (SImode, idx, GEN_INT (fs)));
+	  /* Allow worker function to initialize anything needed */
+	  rtx init = fn (tmp, PM_loop_begin, fs, data);
+	  if (init)
+	    emit_insn (init);
+	  emit_label (label);
+	  LABEL_NUSES (label)++;
+	  emit_insn (gen_addsi3 (idx, idx, GEN_INT (-1)));
+	}
+      if (rw & PM_read)
+	emit_insn (gen_rtx_SET (DImode, tmp, gen_rtx_MEM (DImode, ptr)));
+      emit_insn (fn (tmp, rw, fs, data));
+      if (rw & PM_write)
+	emit_insn (gen_rtx_SET (DImode, gen_rtx_MEM (DImode, ptr), tmp));
+      if (fs)
+	{
+	  emit_insn (gen_rtx_SET (SImode, pred,
+				  gen_rtx_NE (BImode, idx, const0_rtx)));
+	  emit_insn (gen_adddi3 (ptr, ptr, GEN_INT (GET_MODE_SIZE (DImode))));
+	  emit_insn (gen_br_true_hidden (pred, label, GEN_INT (1)));
+	  rtx fini = fn (tmp, PM_loop_end, fs, data);
+	  if (fini)
+	    emit_insn (fini);
+	  emit_insn (gen_rtx_CLOBBER (GET_MODE (idx), idx));
+	}
+      emit_insn (gen_rtx_CLOBBER (GET_MODE (tmp), tmp));
+      emit_insn (gen_rtx_CLOBBER (GET_MODE (ptr), ptr));
+      rtx cpy = get_insns ();
+      end_sequence ();
+      insn = emit_insn_after (cpy, insn);
+    }
+
+  /* Copy live registers.  */
+  EXECUTE_IF_SET_IN_BITMAP (live, 0, ix, iterator)
+    {
+      rtx reg = regno_reg_rtx[ix];
+
+      if (REGNO (reg) >= FIRST_PSEUDO_REGISTER)
+	{
+	  rtx bcast = fn (reg, rw, 0, data);
+
+	  insn = emit_insn_after (bcast, insn);
+	}
+    }
+}
+
+/* Worker for nvptx_vpropagate.  */
+
+static rtx
+vprop_gen (rtx reg, propagate_mask pm,
+	   unsigned ARG_UNUSED (count), void *ARG_UNUSED (data))
+{
+  if (!(pm & PM_read_write))
+    return 0;
+  
+  return nvptx_gen_vcast (reg);
 }
 
-/* PTX-specific reorganization
-   1) mark now-unused registers, so function begin doesn't declare
-   unused registers.
-   2) replace subregs with suitable sequences.
-*/
+/* Propagate state that is live at start of BLOCK across the vectors
+   of a single warp.  Propagation is inserted just after INSN.   */
 
 static void
-nvptx_reorg (void)
+nvptx_vpropagate (basic_block block, rtx_insn *insn)
 {
-  struct reg_replace qiregs, hiregs, siregs, diregs;
-  rtx_insn *insn, *next;
+  nvptx_propagate (block, insn, PM_read_write, vprop_gen, 0);
+}
+
+/* Worker for nvptx_wpropagate.  */
+
+static rtx
+wprop_gen (rtx reg, propagate_mask pm, unsigned rep, void *data_)
+{
+  wcast_data_t *data = (wcast_data_t *)data_;
+
+  if (pm & PM_loop_begin)
+    {
+      /* Starting a loop, initialize pointer.    */
+      unsigned align = GET_MODE_ALIGNMENT (GET_MODE (reg)) / BITS_PER_UNIT;
+
+      if (align > worker_bcast_align)
+	worker_bcast_align = align;
+      data->offset = (data->offset + align - 1) & ~(align - 1);
+
+      data->ptr = gen_reg_rtx (Pmode);
+
+      return gen_adddi3 (data->ptr, data->base, GEN_INT (data->offset));
+    }
+  else if (pm & PM_loop_end)
+    {
+      rtx clobber = gen_rtx_CLOBBER (GET_MODE (data->ptr), data->ptr);
+      data->ptr = NULL_RTX;
+      return clobber;
+    }
+  else
+    return nvptx_gen_wcast (reg, pm, rep, data);
+}
+
+/* Spill or fill live state that is live at start of BLOCK.  PRE_P
+   indicates if this is just before partitioned mode (do spill), or
+   just after it starts (do fill). Sequence is inserted just after
+   INSN.  */
+
+static void
+nvptx_wpropagate (bool pre_p, basic_block block, rtx_insn *insn)
+{
+  wcast_data_t data;
+
+  data.base = gen_reg_rtx (Pmode);
+  data.offset = 0;
+  data.ptr = NULL_RTX;
+
+  nvptx_propagate (block, insn, pre_p ? PM_read : PM_write, wprop_gen, &data);
+  if (data.offset)
+    {
+      /* Stuff was emitted, initialize the base pointer now.  */
+      rtx init = gen_rtx_SET (Pmode, data.base, worker_bcast_sym);
+      emit_insn_after (init, insn);
+      
+      if (worker_bcast_hwm < data.offset)
+	worker_bcast_hwm = data.offset;
+    }
+}
+
+/* Emit a worker-level synchronization barrier.  */
+
+static void
+nvptx_wsync (bool tail_p, rtx_insn *insn)
+{
+  emit_insn_after (gen_nvptx_barsync (GEN_INT (tail_p)), insn);
+}
+
+/* Single neutering according to MASK.  FROM is the incoming block and
+   TO is the outgoing block.  These may be the same block. Insert at
+   start of FROM:
+   
+     if (tid.<axis>) hidden_goto end.
+
+   and insert before ending branch of TO (if there is such an insn):
+
+     end:
+     <possibly-broadcast-cond>
+     <branch>
+
+   We currently only use differnt FROM and TO when skipping an entire
+   loop.  We could do more if we detected superblocks.  */
+
+static void
+nvptx_single (unsigned mask, basic_block from, basic_block to)
+{
+  rtx_insn *head = BB_HEAD (from);
+  rtx_insn *tail = BB_END (to);
+  unsigned skip_mask = mask;
+
+  /* Find first insn of from block */
+  while (head != BB_END (from) && !INSN_P (head))
+    head = NEXT_INSN (head);
+
+  /* Find last insn of to block */
+  rtx_insn *limit = from == to ? head : BB_HEAD (to);
+  while (tail != limit && !INSN_P (tail) && !LABEL_P (tail))
+    tail = PREV_INSN (tail);
+
+  /* Detect if tail is a branch.  */
+  rtx tail_branch = NULL_RTX;
+  rtx cond_branch = NULL_RTX;
+  if (tail && INSN_P (tail))
+    {
+      tail_branch = PATTERN (tail);
+      if (GET_CODE (tail_branch) != SET || SET_DEST (tail_branch) != pc_rtx)
+	tail_branch = NULL_RTX;
+      else
+	{
+	  cond_branch = SET_SRC (tail_branch);
+	  if (GET_CODE (cond_branch) != IF_THEN_ELSE)
+	    cond_branch = NULL_RTX;
+	}
+    }
+
+  if (tail == head)
+    {
+      /* If this is empty, do nothing.  */
+      if (!head || !INSN_P (head))
+	return;
+
+      /* If this is a dummy insn, do nothing.  */
+      switch (recog_memoized (head))
+	{
+	default:break;
+	case CODE_FOR_nvptx_loop:
+	case CODE_FOR_oacc_levels:
+	  return;
+	}
 
+      if (cond_branch)
+	{
+	  /* If we're only doing vector single, there's no need to
+	     emit skip code because we'll not insert anything.  */
+	  if (!(mask & (1 << OACC_vector)))
+	    skip_mask = 0;
+	}
+      else if (tail_branch)
+	/* Block with only unconditional branch.  Nothing to do.  */
+	return;
+    }
+
+  /* Insert the vector test inside the worker test.  */
+  unsigned mode;
+  rtx_insn *before = tail;
+  for (mode = OACC_worker; mode <= OACC_vector; mode++)
+    if ((1 << mode) & skip_mask)
+      {
+	rtx id = gen_reg_rtx (SImode);
+	rtx pred = gen_reg_rtx (BImode);
+	rtx_code_label *label = gen_label_rtx ();
+
+	emit_insn_before (gen_oacc_id (id, GEN_INT (mode)), head);
+	rtx cond = gen_rtx_SET (BImode, pred,
+				gen_rtx_NE (BImode, id, const0_rtx));
+	emit_insn_before (cond, head);
+	emit_insn_before (gen_br_true_hidden (pred, label,
+					      GEN_INT (mode != OACC_vector)),
+			  head);
+
+	LABEL_NUSES (label)++;
+	if (tail_branch)
+	  before = emit_label_before (label, before);
+	else
+	  emit_label_after (label, tail);
+      }
+
+  /* Now deal with propagating the branch condition.  */
+  if (cond_branch)
+    {
+      rtx pvar = XEXP (XEXP (cond_branch, 0), 0);
+
+      if ((1 << OACC_vector) == mask)
+	{
+	  /* Vector mode only, do a shuffle.  */
+	  emit_insn_before (nvptx_gen_vcast (pvar), tail);
+	}
+      else
+	{
+	  /* Includes worker mode, do spill & fill.  by construction
+	     we should never have worker mode only. */
+	  wcast_data_t data;
+
+	  data.base = worker_bcast_sym;
+	  data.ptr = 0;
+
+	  if (worker_bcast_hwm < GET_MODE_SIZE (SImode))
+	    worker_bcast_hwm = GET_MODE_SIZE (SImode);
+
+	  data.offset = 0;
+	  emit_insn_before (nvptx_gen_wcast (pvar, PM_read, 0, &data),
+			    before);
+	  emit_insn_before (gen_nvptx_barsync (GEN_INT (2)), tail);
+	  data.offset = 0;
+	  emit_insn_before (nvptx_gen_wcast (pvar, PM_write, 0, &data),
+			    tail);
+	}
+
+      extract_insn (tail);
+      rtx unsp = gen_rtx_UNSPEC (BImode, gen_rtvec (1, pvar),
+				 UNSPEC_BR_UNIFIED);
+      validate_change (tail, recog_data.operand_loc[0], unsp, false);
+    }
+}
+
+/* LOOP is a loop that is being skipped in its entirety according to
+   MASK.  Treat this as skipping a superblock starting at loop head
+   and ending at loop pre-tail.  */
+
+static void
+nvptx_skip_loop (unsigned mask, reorg_loop *loop)
+{
+  basic_block tail = loop->tail_block;
+  gcc_assert (tail->preds->length () == 1);
+
+  basic_block pre_tail = (*tail->preds)[0]->src;
+  gcc_assert (pre_tail->succs->length () == 1);
+
+  nvptx_single (mask, loop->head_block, pre_tail);
+}
+
+/* Process the loop LOOP and all its contained loops.  We do
+   everything but the neutering.  Return mask of partition modes used
+   within this loop.  */
+
+static unsigned
+nvptx_process_loops (reorg_loop *loop)
+{
+  unsigned inner_mask = 1 << loop->mode;
+  
+  /* Do the inner loops first.  */
+  if (loop->inner)
+    {
+      loop->inner_mask = nvptx_process_loops (loop->inner);
+      inner_mask |= loop->inner_mask;
+    }
+  
+  switch (loop->mode)
+    {
+    case OACC_null:
+      /* Dummy loop.  */
+      break;
+
+    case OACC_vector:
+      nvptx_vpropagate (loop->head_block, loop->head_insn);
+      break;
+      
+    case OACC_worker:
+      {
+	nvptx_wpropagate (false, loop->head_block, loop->head_insn);
+	nvptx_wpropagate (true, loop->head_block, loop->pre_head_insn);
+	/* Insert begin and end synchronizations.  */
+	nvptx_wsync (false, loop->head_insn);
+	nvptx_wsync (true, loop->pre_tail_insn);
+      }
+      break;
+
+    case OACC_gang:
+      break;
+
+    default:gcc_unreachable ();
+    }
+
+  /* Now do siblings.  */
+  if (loop->next)
+    inner_mask |= nvptx_process_loops (loop->next);
+  return inner_mask;
+}
+
+/* Neuter the loop described by LOOP.  We recurse in depth-first
+   order.  LEVELS are the partitioning of the execution and OUTER is
+   the partitioning of the loops we are contained in.  Return the
+   partitioning level within this loop.  */
+
+static void
+nvptx_neuter_loops (reorg_loop *loop, unsigned levels, unsigned outer)
+{
+  unsigned me = (1 << loop->mode) & ((1 << OACC_worker) | (1 << OACC_vector));
+  unsigned  skip_mask = 0, neuter_mask = 0;
+  
+  if (loop->inner)
+    nvptx_neuter_loops (loop->inner, levels, outer | me);
+
+  for (unsigned mode = OACC_worker; mode <= OACC_vector; mode++)
+    {
+      if ((outer | me) & (1 << mode))
+	{ /* Mode is partitioned: no neutering.  */ }
+      else if (!(levels & (1 << mode)))
+	{ /* Mode  is not used: nothing to do.  */ }
+      else if (loop->inner_mask & (1 << mode)
+	       || !loop->head_insn)
+	/* Partitioning inside this loop, or we're not a loop: neuter
+	   individual blocks.  */
+	neuter_mask |= 1 << mode;
+      else if (!loop->parent || !loop->parent->head_insn
+	       || loop->parent->inner_mask & (1 << mode))
+	/* Parent isn't a loop or contains this partitioning: skip
+	   loop at this level.  */
+	skip_mask |= 1 << mode;
+      else
+	{ /* Parent will skip this loop itself.  */ }
+    }
+
+  if (neuter_mask)
+    {
+      basic_block block;
+
+      for (unsigned ix = 0; loop->blocks.iterate (ix, &block); ix++)
+	nvptx_single (neuter_mask, block, block);
+    }
+
+  if (skip_mask)
+      nvptx_skip_loop (skip_mask, loop);
+  
+  if (loop->next)
+    nvptx_neuter_loops (loop->next, levels, outer);
+}
+
+/* NVPTX machine dependent reorg.
+   Insert vector and worker single neutering code and state
+   propagation when entering partioned mode.  Fixup subregs.  */
+
+static void
+nvptx_reorg (void)
+{
   /* We are freeing block_for_insn in the toplev to keep compatibility
      with old MDEP_REORGS that are not CFG based.  Recompute it now.  */
   compute_bb_for_insn ();
@@ -2072,19 +2966,34 @@ nvptx_reorg (void)
 
   df_clear_flags (DF_LR_RUN_DCE);
   df_set_flags (DF_NO_INSN_RESCAN | DF_NO_HARD_REGS);
+  df_live_add_problem ();
+  
+  /* Split blocks and record interesting unspecs.  */
+  unspec_map_t unspec_map;
+  unsigned levels = nvptx_split_blocks (&unspec_map);
+
+  /* Compute live regs */
   df_analyze ();
   regstat_init_n_sets_and_refs ();
 
-  int max_regs = max_reg_num ();
-
+  if (dump_file)
+    df_dump (dump_file);
+  
   /* Mark unused regs as unused.  */
+  int max_regs = max_reg_num ();
   for (int i = LAST_VIRTUAL_REGISTER + 1; i < max_regs; i++)
     if (REG_N_SETS (i) == 0 && REG_N_REFS (i) == 0)
       regno_reg_rtx[i] = const0_rtx;
 
-  /* Replace subregs.  */
-  nvptx_reorg_subreg (max_regs);
+  reorg_loop *loops = nvptx_discover_loops (&unspec_map);
+
+  nvptx_process_loops (loops);
+  nvptx_neuter_loops (loops, levels, 0);
 
+  delete loops;
+
+  nvptx_reorg_subreg ();
+  
   regstat_free_n_sets_and_refs ();
 
   df_finish_pass (true);
@@ -2133,19 +3042,21 @@ nvptx_vector_alignment (const_tree type)
   return MIN (align, BIGGEST_ALIGNMENT);
 }
 
-/* Indicate that INSN cannot be duplicated.  This is true for insns
-   that generate a unique id.  To be on the safe side, we also
-   exclude instructions that have to be executed simultaneously by
-   all threads in a warp.  */
+/* Indicate that INSN cannot be duplicated.   */
 
 static bool
 nvptx_cannot_copy_insn_p (rtx_insn *insn)
 {
-  if (recog_memoized (insn) == CODE_FOR_oacc_thread_broadcastsi)
-    return true;
-  if (recog_memoized (insn) == CODE_FOR_threadbarrier_insn)
-    return true;
-  return false;
+  switch (recog_memoized (insn))
+    {
+    case CODE_FOR_nvptx_broadcastsi:
+    case CODE_FOR_nvptx_broadcastsf:
+    case CODE_FOR_nvptx_barsync:
+    case CODE_FOR_nvptx_loop:
+      return true;
+    default:
+      return false;
+    }
 }
 \f
 /* Record a symbol for mkoffload to enter into the mapping table.  */
@@ -2185,6 +3096,21 @@ nvptx_file_end (void)
   FOR_EACH_HASH_TABLE_ELEMENT (*needed_fndecls_htab, decl, tree, iter)
     nvptx_record_fndecl (decl, true);
   fputs (func_decls.str().c_str(), asm_out_file);
+
+  if (worker_bcast_hwm)
+    {
+      /* Define the broadcast buffer.  */
+
+      if (worker_bcast_align < GET_MODE_SIZE (SImode))
+	worker_bcast_align = GET_MODE_SIZE (SImode);
+      worker_bcast_hwm = (worker_bcast_hwm + worker_bcast_align - 1)
+	& ~(worker_bcast_align - 1);
+      
+      fprintf (asm_out_file, "// BEGIN VAR DEF: %s\n", worker_bcast_name);
+      fprintf (asm_out_file, ".shared.align %d .u8 %s[%d];\n",
+	       worker_bcast_align,
+	       worker_bcast_name, worker_bcast_hwm);
+    }
 }
 \f
 #undef TARGET_OPTION_OVERRIDE
Index: config/nvptx/nvptx.h
===================================================================
--- config/nvptx/nvptx.h	(revision 225154)
+++ config/nvptx/nvptx.h	(working copy)
@@ -235,7 +235,6 @@ struct nvptx_pseudo_info
 struct GTY(()) machine_function
 {
   rtx_expr_list *call_args;
-  char *warp_equal_pseudos;
   rtx start_call;
   tree funtype;
   bool has_call_with_varargs;
Index: config/nvptx/nvptx-protos.h
===================================================================
--- config/nvptx/nvptx-protos.h	(revision 225154)
+++ config/nvptx/nvptx-protos.h	(working copy)
@@ -32,6 +32,7 @@ extern void nvptx_register_pragmas (void
 extern const char *nvptx_section_for_decl (const_tree);
 
 #ifdef RTX_CODE
+extern void nvptx_expand_oacc_loop (rtx, rtx);
 extern void nvptx_expand_call (rtx, rtx);
 extern rtx nvptx_expand_compare (rtx);
 extern const char *nvptx_ptx_type_from_mode (machine_mode, bool);
Index: tree-ssa-threadedge.c
===================================================================
--- tree-ssa-threadedge.c	(revision 225154)
+++ tree-ssa-threadedge.c	(working copy)
@@ -310,6 +310,17 @@ record_temporary_equivalences_from_stmts
 	  && gimple_asm_volatile_p (as_a <gasm *> (stmt)))
 	return NULL;
 
+      /* If the statement is a unique builtin, we can not thread
+	 through here.  */
+      if (gimple_code (stmt) == GIMPLE_CALL)
+	{
+	  tree decl = gimple_call_fndecl (as_a <gcall *> (stmt));
+
+	  if (decl && DECL_BUILT_IN (decl)
+	      && builtin_unique_p (decl))
+	    return NULL;
+	}
+
       /* If duplicating this block is going to cause too much code
 	 expansion, then do not thread through this block.  */
       stmt_count++;
Index: tree-ssa-alias.c
===================================================================
--- tree-ssa-alias.c	(revision 225154)
+++ tree-ssa-alias.c	(working copy)
@@ -1764,7 +1764,6 @@ ref_maybe_used_by_call_p_1 (gcall *call,
 	case BUILT_IN_GOMP_ATOMIC_END:
 	case BUILT_IN_GOMP_BARRIER:
 	case BUILT_IN_GOMP_BARRIER_CANCEL:
-	case BUILT_IN_GOACC_THREADBARRIER:
 	case BUILT_IN_GOMP_TASKWAIT:
 	case BUILT_IN_GOMP_TASKGROUP_END:
 	case BUILT_IN_GOMP_CRITICAL_START:
@@ -1779,6 +1778,7 @@ ref_maybe_used_by_call_p_1 (gcall *call,
 	case BUILT_IN_GOMP_SECTIONS_END_CANCEL:
 	case BUILT_IN_GOMP_SINGLE_COPY_START:
 	case BUILT_IN_GOMP_SINGLE_COPY_END:
+        case BUILT_IN_GOACC_LOOP:
 	  return true;
 
 	default:
Index: omp-low.c
===================================================================
--- omp-low.c	(revision 225154)
+++ omp-low.c	(working copy)
@@ -166,14 +166,8 @@ struct omp_region
 
   /* For an OpenACC loop, the level of parallelism requested.  */
   int gwv_this;
-
-  tree broadcast_array;
 };
 
-/* Levels of parallelism as defined by OpenACC.  Increasing numbers
-   correspond to deeper loop nesting levels.  */
-#define OACC_LOOP_MASK(X) (1 << (X))
-
 /* Context structure.  Used to store information about each parallel
    directive in the code.  */
 
@@ -292,8 +286,6 @@ static vec<omp_context *> taskreg_contex
 
 static void scan_omp (gimple_seq *, omp_context *);
 static tree scan_omp_1_op (tree *, int *, void *);
-static basic_block oacc_broadcast (basic_block, basic_block,
-				   struct omp_region *);
 
 #define WALK_SUBSTMTS  \
     case GIMPLE_BIND: \
@@ -3742,15 +3734,6 @@ build_omp_barrier (tree lhs)
   return g;
 }
 
-/* Build a call to GOACC_threadbarrier.  */
-
-static gcall *
-build_oacc_threadbarrier (void)
-{
-  tree fndecl = builtin_decl_explicit (BUILT_IN_GOACC_THREADBARRIER);
-  return gimple_build_call (fndecl, 0);
-}
-
 /* If a context was created for STMT when it was scanned, return it.  */
 
 static omp_context *
@@ -3761,6 +3744,56 @@ maybe_lookup_ctx (gimple stmt)
   return n ? (omp_context *) n->value : NULL;
 }
 
+/* Generate loop head markers in outer->inner order.  */
+
+static void
+gen_oacc_loop_head (gimple_seq *seq, unsigned mask)
+{
+  {
+    // TODDO: Determine this information from the parallel region itself
+    // and emit it once in the offload function.  Currently the target
+    // geometry definition is being extracted early.  For now inform
+    // the backend we're using all axes of parallelism, which is a
+    // safe default.
+    gcall *call = gimple_build_call
+      (builtin_decl_explicit (BUILT_IN_GOACC_LEVELS), 1,
+       build_int_cst (unsigned_type_node,
+		      OACC_LOOP_MASK (OACC_gang)
+		      | OACC_LOOP_MASK (OACC_vector)
+		      | OACC_LOOP_MASK (OACC_worker)));
+    gimple_seq_add_stmt (seq, call);
+  }
+
+  tree loop_decl = builtin_decl_explicit (BUILT_IN_GOACC_LOOP);
+  tree arg0 = build_int_cst (unsigned_type_node, 0);
+  unsigned level;
+
+  for (level = OACC_gang; level != OACC_HWM; level++)
+    if (mask & OACC_LOOP_MASK (level))
+      {
+	tree arg1 = build_int_cst (unsigned_type_node, level);
+	gcall *call = gimple_build_call (loop_decl, 2,  arg0, arg1);
+	gimple_seq_add_stmt (seq, call);
+      }
+}
+
+/* Generate loop tail markers in inner->outer order.  */
+
+static void
+gen_oacc_loop_tail (gimple_seq *seq, unsigned mask)
+{
+  tree loop_decl = builtin_decl_explicit (BUILT_IN_GOACC_LOOP);
+  tree arg0 = build_int_cst (unsigned_type_node, 1);
+  unsigned level;
+
+  for (level = OACC_HWM; level-- != OACC_gang; )
+    if (mask & OACC_LOOP_MASK (level))
+      {
+	tree arg1 = build_int_cst (unsigned_type_node, level);
+	gcall *call = gimple_build_call (loop_decl, 2,  arg0, arg1);
+	gimple_seq_add_stmt (seq, call);
+      }
+}
 
 /* Find the mapping for DECL in CTX or the immediately enclosing
    context that has a mapping for DECL.
@@ -7190,21 +7223,6 @@ expand_omp_for_generic (struct omp_regio
     }
 }
 
-
-/* True if a barrier is needed after a loop partitioned over
-   gangs/workers/vectors as specified by GWV_BITS.  OpenACC semantics specify
-   that a (conceptual) barrier is needed after worker and vector-partitioned
-   loops, but not after gang-partitioned loops.  Currently we are relying on
-   warp reconvergence to synchronise threads within a warp after vector loops,
-   so an explicit barrier is not helpful after those.  */
-
-static bool
-oacc_loop_needs_threadbarrier_p (int gwv_bits)
-{
-  return !(gwv_bits & OACC_LOOP_MASK (OACC_gang))
-    && (gwv_bits & OACC_LOOP_MASK (OACC_worker));
-}
-
 /* A subroutine of expand_omp_for.  Generate code for a parallel
    loop with static schedule and no specified chunk size.  Given
    parameters:
@@ -7213,6 +7231,7 @@ oacc_loop_needs_threadbarrier_p (int gwv
 
    where COND is "<" or ">", we generate pseudocode
 
+  OACC_LOOP_HEAD
 	if ((__typeof (V)) -1 > 0 && N2 cond N1) goto L2;
 	if (cond is <)
 	  adj = STEP - 1;
@@ -7240,6 +7259,11 @@ oacc_loop_needs_threadbarrier_p (int gwv
 	V += STEP;
 	if (V cond e) goto L1;
     L2:
+ OACC_LOOP_TAIL
+
+ It'd be better to place the OACC_LOOP markers just inside the outer
+ conditional, so they can be entirely eliminated if the loop is
+ unreachable.
 */
 
 static void
@@ -7281,10 +7305,6 @@ expand_omp_for_static_nochunk (struct om
     }
   exit_bb = region->exit;
 
-  /* Broadcast variables to OpenACC threads.  */
-  entry_bb = oacc_broadcast (entry_bb, fin_bb, region);
-  region->entry = entry_bb;
-
   /* Iteration space partitioning goes in ENTRY_BB.  */
   gsi = gsi_last_bb (entry_bb);
   gcc_assert (gimple_code (gsi_stmt (gsi)) == GIMPLE_OMP_FOR);
@@ -7306,6 +7326,15 @@ expand_omp_for_static_nochunk (struct om
     t = fold_binary (fd->loop.cond_code, boolean_type_node,
 		     fold_convert (type, fd->loop.n1),
 		     fold_convert (type, fd->loop.n2));
+
+  if (gimple_omp_for_kind (fd->for_stmt) == GF_OMP_FOR_KIND_OACC_LOOP)
+    {
+      gimple_seq seq = NULL;
+	
+      gen_oacc_loop_head (&seq, region->gwv_this);
+      gsi_insert_seq_before (&gsi, seq, GSI_SAME_STMT);
+    }
+
   if (fd->collapse == 1
       && TYPE_UNSIGNED (type)
       && (t == NULL_TREE || !integer_onep (t)))
@@ -7364,6 +7393,7 @@ expand_omp_for_static_nochunk (struct om
     case GF_OMP_FOR_KIND_OACC_LOOP:
       {
 	gimple_seq seq = NULL;
+	
 	nthreads = expand_oacc_get_num_threads (&seq, region->gwv_this);
 	threadid = expand_oacc_get_thread_num (&seq, region->gwv_this);
 	gsi_insert_seq_before (&gsi, seq, GSI_SAME_STMT);
@@ -7547,18 +7577,19 @@ expand_omp_for_static_nochunk (struct om
 
   /* Replace the GIMPLE_OMP_RETURN with a barrier, or nothing.  */
   gsi = gsi_last_bb (exit_bb);
-  if (!gimple_omp_return_nowait_p (gsi_stmt (gsi)))
+  if (gimple_omp_for_kind (fd->for_stmt) == GF_OMP_FOR_KIND_OACC_LOOP)
+    {
+      gimple_seq seq = NULL;
+
+      gen_oacc_loop_tail (&seq, region->gwv_this);
+      gsi_insert_seq_after (&gsi, seq, GSI_SAME_STMT);
+    }
+  else if (!gimple_omp_return_nowait_p (gsi_stmt (gsi)))
     {
       t = gimple_omp_return_lhs (gsi_stmt (gsi));
-      if (gimple_omp_for_kind (fd->for_stmt) == GF_OMP_FOR_KIND_OACC_LOOP)
-	{
-	  gcc_checking_assert (t == NULL_TREE);
-	  if (oacc_loop_needs_threadbarrier_p (region->gwv_this))
-	    gsi_insert_after (&gsi, build_oacc_threadbarrier (), GSI_SAME_STMT);
-	}
-      else
-	gsi_insert_after (&gsi, build_omp_barrier (t), GSI_SAME_STMT);
+      gsi_insert_after (&gsi, build_omp_barrier (t), GSI_SAME_STMT);
     }
+    
   gsi_remove (&gsi, true);
 
   /* Connect all the blocks.  */
@@ -7633,6 +7664,7 @@ find_phi_with_arg_on_edge (tree arg, edg
 
    where COND is "<" or ">", we generate pseudocode
 
+OACC_LOOP_HEAD
 	if ((__typeof (V)) -1 > 0 && N2 cond N1) goto L2;
 	if (cond is <)
 	  adj = STEP - 1;
@@ -7643,6 +7675,7 @@ find_phi_with_arg_on_edge (tree arg, edg
 	else
 	  n = (adj + N2 - N1) / STEP;
 	trip = 0;
+
 	V = threadid * CHUNK * STEP + N1;  -- this extra definition of V is
 					      here so that V is defined
 					      if the loop is not entered
@@ -7661,6 +7694,7 @@ find_phi_with_arg_on_edge (tree arg, edg
 	trip += 1;
 	goto L0;
     L4:
+OACC_LOOP_TAIL
 */
 
 static void
@@ -7694,10 +7728,6 @@ expand_omp_for_static_chunk (struct omp_
   gcc_assert (EDGE_COUNT (iter_part_bb->succs) == 2);
   fin_bb = BRANCH_EDGE (iter_part_bb)->dest;
 
-  /* Broadcast variables to OpenACC threads.  */
-  entry_bb = oacc_broadcast (entry_bb, fin_bb, region);
-  region->entry = entry_bb;
-
   gcc_assert (broken_loop
 	      || fin_bb == FALLTHRU_EDGE (cont_bb)->dest);
   seq_start_bb = split_edge (FALLTHRU_EDGE (iter_part_bb));
@@ -7709,7 +7739,7 @@ expand_omp_for_static_chunk (struct omp_
       trip_update_bb = split_edge (FALLTHRU_EDGE (cont_bb));
     }
   exit_bb = region->exit;
-
+  
   /* Trip and adjustment setup goes in ENTRY_BB.  */
   gsi = gsi_last_bb (entry_bb);
   gcc_assert (gimple_code (gsi_stmt (gsi)) == GIMPLE_OMP_FOR);
@@ -7731,6 +7761,14 @@ expand_omp_for_static_chunk (struct omp_
     t = fold_binary (fd->loop.cond_code, boolean_type_node,
 		     fold_convert (type, fd->loop.n1),
 		     fold_convert (type, fd->loop.n2));
+  if (gimple_omp_for_kind (fd->for_stmt) == GF_OMP_FOR_KIND_OACC_LOOP)
+    {
+      gimple_seq seq = NULL;
+	
+      gen_oacc_loop_head (&seq, region->gwv_this);
+      gsi_insert_seq_before (&gsi, seq, GSI_SAME_STMT);
+    }
+
   if (fd->collapse == 1
       && TYPE_UNSIGNED (type)
       && (t == NULL_TREE || !integer_onep (t)))
@@ -7989,18 +8027,20 @@ expand_omp_for_static_chunk (struct omp_
 
   /* Replace the GIMPLE_OMP_RETURN with a barrier, or nothing.  */
   gsi = gsi_last_bb (exit_bb);
-  if (!gimple_omp_return_nowait_p (gsi_stmt (gsi)))
+
+  if (gimple_omp_for_kind (fd->for_stmt) == GF_OMP_FOR_KIND_OACC_LOOP)
+    {
+      gimple_seq seq = NULL;
+
+      gen_oacc_loop_tail (&seq, region->gwv_this);
+      gsi_insert_seq_after (&gsi, seq, GSI_SAME_STMT);
+    }
+  else if (!gimple_omp_return_nowait_p (gsi_stmt (gsi)))
     {
       t = gimple_omp_return_lhs (gsi_stmt (gsi));
-      if (gimple_omp_for_kind (fd->for_stmt) == GF_OMP_FOR_KIND_OACC_LOOP)
-        {
-	  gcc_checking_assert (t == NULL_TREE);
-	  if (oacc_loop_needs_threadbarrier_p (region->gwv_this))
-	    gsi_insert_after (&gsi, build_oacc_threadbarrier (), GSI_SAME_STMT);
-	}
-      else
-	gsi_insert_after (&gsi, build_omp_barrier (t), GSI_SAME_STMT);
+      gsi_insert_after (&gsi, build_omp_barrier (t), GSI_SAME_STMT);
     }
+
   gsi_remove (&gsi, true);
 
   /* Connect the new blocks.  */
@@ -9571,20 +9611,6 @@ expand_omp_atomic (struct omp_region *re
   expand_omp_atomic_mutex (load_bb, store_bb, addr, loaded_val, stored_val);
 }
 
-/* Allocate storage for OpenACC worker threads in CTX to broadcast
-   condition results.  */
-
-static void
-oacc_alloc_broadcast_storage (omp_context *ctx)
-{
-  tree vull_type_node = build_qualified_type (long_long_unsigned_type_node,
-					      TYPE_QUAL_VOLATILE);
-
-  ctx->worker_sync_elt
-    = alloc_var_ganglocal (NULL_TREE, vull_type_node, ctx,
-			   TYPE_SIZE_UNIT (vull_type_node));
-}
-
 /* Mark the loops inside the kernels region starting at REGION_ENTRY and ending
    at REGION_EXIT.  */
 
@@ -10360,7 +10386,6 @@ find_omp_target_region_data (struct omp_
     region->gwv_this |= OACC_LOOP_MASK (OACC_worker);
   if (find_omp_clause (clauses, OMP_CLAUSE_VECTOR_LENGTH))
     region->gwv_this |= OACC_LOOP_MASK (OACC_vector);
-  region->broadcast_array = gimple_omp_target_broadcast_array (stmt);
 }
 
 /* Helper for build_omp_regions.  Scan the dominator tree starting at
@@ -10504,669 +10529,6 @@ build_omp_regions (void)
   build_omp_regions_1 (ENTRY_BLOCK_PTR_FOR_FN (cfun), NULL, false);
 }
 
-/* Walk the tree upwards from region until a target region is found
-   or we reach the end, then return it.  */
-static omp_region *
-enclosing_target_region (omp_region *region)
-{
-  while (region != NULL
-	 && region->type != GIMPLE_OMP_TARGET)
-    region = region->outer;
-  return region;
-}
-
-/* Return a mask of GWV_ values indicating the kind of OpenACC
-   predication required for basic blocks in REGION.  */
-
-static int
-required_predication_mask (omp_region *region)
-{
-  while (region
-	 && region->type != GIMPLE_OMP_FOR && region->type != GIMPLE_OMP_TARGET)
-    region = region->outer;
-  if (!region)
-    return 0;
-
-  int outer_masks = region->gwv_this;
-  omp_region *outer_target = region;
-  while (outer_target != NULL && outer_target->type != GIMPLE_OMP_TARGET)
-    {
-      if (outer_target->type == GIMPLE_OMP_FOR)
-	outer_masks |= outer_target->gwv_this;
-      outer_target = outer_target->outer;
-    }
-  if (!outer_target)
-    return 0;
-
-  int mask = 0;
-  if ((outer_target->gwv_this & OACC_LOOP_MASK (OACC_worker)) != 0
-      && (region->type == GIMPLE_OMP_TARGET
-	  || (outer_masks & OACC_LOOP_MASK (OACC_worker)) == 0))
-    mask |= OACC_LOOP_MASK (OACC_worker);
-  if ((outer_target->gwv_this & OACC_LOOP_MASK (OACC_vector)) != 0
-      && (region->type == GIMPLE_OMP_TARGET
-	  || (outer_masks & OACC_LOOP_MASK (OACC_vector)) == 0))
-    mask |= OACC_LOOP_MASK (OACC_vector);
-  return mask;
-}
-
-/* Generate a broadcast across OpenACC vector threads (a warp on GPUs)
-   so that VAR is broadcast to DEST_VAR.  The new statements are added
-   after WHERE.  Return the stmt after which the block should be split.  */
-
-static gimple
-generate_vector_broadcast (tree dest_var, tree var,
-			   gimple_stmt_iterator &where)
-{
-  gimple retval = gsi_stmt (where);
-  tree vartype = TREE_TYPE (var);
-  tree call_arg_type = unsigned_type_node;
-  enum built_in_function fn = BUILT_IN_GOACC_THREAD_BROADCAST;
-
-  if (TYPE_PRECISION (vartype) > TYPE_PRECISION (call_arg_type))
-    {
-      fn = BUILT_IN_GOACC_THREAD_BROADCAST_LL;
-      call_arg_type = long_long_unsigned_type_node;
-    }
-
-  bool need_conversion = !types_compatible_p (vartype, call_arg_type);
-  tree casted_var = var;
-
-  if (need_conversion)
-    {
-      gassign *conv1 = NULL;
-      casted_var = create_tmp_var (call_arg_type);
-
-      /* Handle floats and doubles.  */
-      if (!INTEGRAL_TYPE_P (vartype))
-	{
-	  tree t = fold_build1 (VIEW_CONVERT_EXPR, call_arg_type, var);
-	  conv1 = gimple_build_assign (casted_var, t);
-	}
-      else
-	conv1 = gimple_build_assign (casted_var, NOP_EXPR, var);
-
-      gsi_insert_after (&where, conv1, GSI_CONTINUE_LINKING);
-    }
-
-  tree decl = builtin_decl_explicit (fn);
-  gimple call = gimple_build_call (decl, 1, casted_var);
-  gsi_insert_after (&where, call, GSI_NEW_STMT);
-  tree casted_dest = dest_var;
-
-  if (need_conversion)
-    {
-      gassign *conv2 = NULL;
-      casted_dest = create_tmp_var (call_arg_type);
-
-      if (!INTEGRAL_TYPE_P (vartype))
-	{
-	  tree t = fold_build1 (VIEW_CONVERT_EXPR, vartype, casted_dest);
-	  conv2 = gimple_build_assign (dest_var, t);
-	}
-      else
-	conv2 = gimple_build_assign (dest_var, NOP_EXPR, casted_dest);
-
-      gsi_insert_after (&where, conv2, GSI_CONTINUE_LINKING);
-    }
-
-  gimple_call_set_lhs (call, casted_dest);
-  return retval;
-}
-
-/* Generate a broadcast across OpenACC threads in REGION so that VAR
-   is broadcast to DEST_VAR.  MASK specifies the parallelism level and
-   thereby the broadcast method.  If it is only vector, we
-   can use a warp broadcast, otherwise we fall back to memory
-   store/load.  */
-
-static gimple
-generate_oacc_broadcast (omp_region *region, tree dest_var, tree var,
-			 gimple_stmt_iterator &where, int mask)
-{
-  if (mask == OACC_LOOP_MASK (OACC_vector))
-    return generate_vector_broadcast (dest_var, var, where);
-
-  omp_region *parent = enclosing_target_region (region);
-
-  tree elttype = build_qualified_type (TREE_TYPE (var), TYPE_QUAL_VOLATILE);
-  tree ptr = create_tmp_var (build_pointer_type (elttype));
-  gassign *cast1 = gimple_build_assign (ptr, NOP_EXPR,
-				       parent->broadcast_array);
-  gsi_insert_after (&where, cast1, GSI_NEW_STMT);
-  gassign *st = gimple_build_assign (build_simple_mem_ref (ptr), var);
-  gsi_insert_after (&where, st, GSI_NEW_STMT);
-
-  gsi_insert_after (&where, build_oacc_threadbarrier (), GSI_NEW_STMT);
-
-  gassign *cast2 = gimple_build_assign (ptr, NOP_EXPR,
-					parent->broadcast_array);
-  gsi_insert_after (&where, cast2, GSI_NEW_STMT);
-  gassign *ld = gimple_build_assign (dest_var, build_simple_mem_ref (ptr));
-  gsi_insert_after (&where, ld, GSI_NEW_STMT);
-
-  gsi_insert_after (&where, build_oacc_threadbarrier (), GSI_NEW_STMT);
-
-  return st;
-}
-
-/* Build a test for OpenACC predication.  TRUE_EDGE is the edge that should be
-   taken if the block should be executed.  SKIP_DEST_BB is the destination to
-   jump to otherwise.  MASK specifies the type of predication, it can contain
-   the bits for VECTOR and/or WORKER.  */
-
-static void
-make_predication_test (edge true_edge, basic_block skip_dest_bb, int mask)
-{
-  basic_block cond_bb = true_edge->src;
-  
-  gimple_stmt_iterator tmp_gsi = gsi_last_bb (cond_bb);
-  tree decl = builtin_decl_explicit (BUILT_IN_GOACC_ID);
-  tree comp_var = NULL_TREE;
-  unsigned ix;
-
-  for (ix = OACC_worker; ix <= OACC_vector; ix++)
-    if (OACC_LOOP_MASK (ix) & mask)
-      {
-	gimple call = gimple_build_call
-	  (decl, 1, build_int_cst (unsigned_type_node, ix));
-	tree var = create_tmp_var (unsigned_type_node);
-
-	gimple_call_set_lhs (call, var);
-	gsi_insert_after (&tmp_gsi, call, GSI_NEW_STMT);
-	if (comp_var)
-	  {
-	    tree new_comp = create_tmp_var (unsigned_type_node);
-	    gassign *ior = gimple_build_assign (new_comp,
-						BIT_IOR_EXPR, comp_var, var);
-	    gsi_insert_after (&tmp_gsi, ior, GSI_NEW_STMT);
-	    comp_var = new_comp;
-	  }
-	else
-	  comp_var = var;
-      }
-
-  tree cond = build2 (EQ_EXPR, boolean_type_node, comp_var,
-		      fold_convert (unsigned_type_node, integer_zero_node));
-  gimple cond_stmt = gimple_build_cond_empty (cond);
-  gsi_insert_after (&tmp_gsi, cond_stmt, GSI_NEW_STMT);
-
-  true_edge->flags = EDGE_TRUE_VALUE;
-
-  /* Force an abnormal edge before a broadcast operation that might be present
-     in SKIP_DEST_BB.  This is only done for the non-execution edge (with
-     respect to the predication done by this function) -- the opposite
-     (execution) edge that reaches the broadcast operation must be made
-     abnormal also, e.g. in this function's caller.  */
-  edge e = make_edge (cond_bb, skip_dest_bb, EDGE_FALSE_VALUE);
-  basic_block false_abnorm_bb = split_edge (e);
-  edge abnorm_edge = single_succ_edge (false_abnorm_bb);
-  abnorm_edge->flags |= EDGE_ABNORMAL;
-}
-
-/* Apply OpenACC predication to basic block BB which is in
-   region PARENT.  MASK has a bitmask of levels that need to be
-   applied; VECTOR and/or WORKER may be set.  */
-
-static void
-predicate_bb (basic_block bb, struct omp_region *parent, int mask)
-{
-  /* We handle worker-single vector-partitioned loops by jumping
-     around them if not in the controlling worker.  Don't insert
-     unnecessary (and incorrect) predication.  */
-  if (parent->type == GIMPLE_OMP_FOR
-      && (parent->gwv_this & OACC_LOOP_MASK (OACC_vector)))
-    mask &= ~OACC_LOOP_MASK (OACC_worker);
-
-  if (mask == 0 || parent->type == GIMPLE_OMP_ATOMIC_LOAD)
-    return;
-
-  gimple_stmt_iterator gsi;
-  gimple stmt;
-
-  gsi = gsi_last_bb (bb);
-  stmt = gsi_stmt (gsi);
-  if (stmt == NULL)
-    return;
-
-  basic_block skip_dest_bb = NULL;
-
-  if (gimple_code (stmt) == GIMPLE_OMP_ENTRY_END)
-    return;
-
-  if (gimple_code (stmt) == GIMPLE_COND)
-    {
-      tree cond_var = create_tmp_var (boolean_type_node);
-      tree broadcast_cond = create_tmp_var (boolean_type_node);
-      gassign *asgn = gimple_build_assign (cond_var,
-					   gimple_cond_code (stmt),
-					   gimple_cond_lhs (stmt),
-					   gimple_cond_rhs (stmt));
-      gsi_insert_before (&gsi, asgn, GSI_CONTINUE_LINKING);
-      gimple_stmt_iterator gsi_asgn = gsi_for_stmt (asgn);
-
-      gimple splitpoint = generate_oacc_broadcast (parent, broadcast_cond,
-						   cond_var, gsi_asgn,
-						   mask);
-
-      edge e = split_block (bb, splitpoint);
-      e->flags = EDGE_ABNORMAL;
-      skip_dest_bb = e->dest;
-
-      gimple_cond_set_condition (as_a <gcond *> (stmt), EQ_EXPR,
-				 broadcast_cond, boolean_true_node);
-    }
-  else if (gimple_code (stmt) == GIMPLE_SWITCH)
-    {
-      gswitch *sstmt = as_a <gswitch *> (stmt);
-      tree var = gimple_switch_index (sstmt);
-      tree new_var = create_tmp_var (TREE_TYPE (var));
-
-      gassign *asgn = gimple_build_assign (new_var, var);
-      gsi_insert_before (&gsi, asgn, GSI_CONTINUE_LINKING);
-      gimple_stmt_iterator gsi_asgn = gsi_for_stmt (asgn);
-
-      gimple splitpoint = generate_oacc_broadcast (parent, new_var, var,
-						   gsi_asgn, mask);
-
-      edge e = split_block (bb, splitpoint);
-      e->flags = EDGE_ABNORMAL;
-      skip_dest_bb = e->dest;
-
-      gimple_switch_set_index (sstmt, new_var);
-    }
-  else if (is_gimple_omp (stmt))
-    {
-      gsi_prev (&gsi);
-      gimple split_stmt = gsi_stmt (gsi);
-      enum gimple_code code = gimple_code (stmt);
-
-      /* First, see if we must predicate away an entire loop or atomic region.  */
-      if (code == GIMPLE_OMP_FOR
-	  || code == GIMPLE_OMP_ATOMIC_LOAD)
-	{
-	  omp_region *inner;
-	  inner = *bb_region_map->get (FALLTHRU_EDGE (bb)->dest);
-	  skip_dest_bb = single_succ (inner->exit);
-	  gcc_assert (inner->entry == bb);
-	  if (code != GIMPLE_OMP_FOR
-	      || ((inner->gwv_this & OACC_LOOP_MASK (OACC_vector))
-		  && !(inner->gwv_this & OACC_LOOP_MASK (OACC_worker))
-		  && (mask & OACC_LOOP_MASK  (OACC_worker))))
-	    {
-	      gimple_stmt_iterator head_gsi = gsi_start_bb (bb);
-	      gsi_prev (&head_gsi);
-	      edge e0 = split_block (bb, gsi_stmt (head_gsi));
-	      int mask2 = mask;
-	      if (code == GIMPLE_OMP_FOR)
-		mask2 &= ~OACC_LOOP_MASK (OACC_vector);
-	      if (!split_stmt || code != GIMPLE_OMP_FOR)
-		{
-		  /* The simple case: nothing here except the for,
-		     so we just need to make one branch around the
-		     entire loop.  */
-		  inner->entry = e0->dest;
-		  make_predication_test (e0, skip_dest_bb, mask2);
-		  return;
-		}
-	      basic_block for_block = e0->dest;
-	      /* The general case, make two conditions - a full one around the
-		 code preceding the for, and one branch around the loop.  */
-	      edge e1 = split_block (for_block, split_stmt);
-	      basic_block bb3 = e1->dest;
-	      edge e2 = split_block (for_block, split_stmt);
-	      basic_block bb2 = e2->dest;
-
-	      make_predication_test (e0, bb2, mask);
-	      make_predication_test (single_pred_edge (bb3), skip_dest_bb,
-				     mask2);
-	      inner->entry = bb3;
-	      return;
-	    }
-	}
-
-      /* Only a few statements need special treatment.  */
-      if (gimple_code (stmt) != GIMPLE_OMP_FOR
-	  && gimple_code (stmt) != GIMPLE_OMP_CONTINUE
-	  && gimple_code (stmt) != GIMPLE_OMP_RETURN)
-	{
-	  edge e = single_succ_edge (bb);
-	  skip_dest_bb = e->dest;
-	}
-      else
-	{
-	  if (!split_stmt)
-	    return;
-	  edge e = split_block (bb, split_stmt);
-	  skip_dest_bb = e->dest;
-	  if (gimple_code (stmt) == GIMPLE_OMP_CONTINUE)
-	    {
-	      gcc_assert (parent->cont == bb);
-	      parent->cont = skip_dest_bb;
-	    }
-	  else if (gimple_code (stmt) == GIMPLE_OMP_RETURN)
-	    {
-	      gcc_assert (parent->exit == bb);
-	      parent->exit = skip_dest_bb;
-	    }
-	  else if (gimple_code (stmt) == GIMPLE_OMP_FOR)
-	    {
-	      omp_region *inner;
-	      inner = *bb_region_map->get (FALLTHRU_EDGE (skip_dest_bb)->dest);
-	      gcc_assert (inner->entry == bb);
-	      inner->entry = skip_dest_bb;
-	    }
-	}
-    }
-  else if (single_succ_p (bb))
-    {
-      edge e = single_succ_edge (bb);
-      skip_dest_bb = e->dest;
-      if (gimple_code (stmt) == GIMPLE_GOTO)
-	gsi_prev (&gsi);
-      if (gsi_stmt (gsi) == 0)
-	return;
-    }
-
-  if (skip_dest_bb != NULL)
-    {
-      gimple_stmt_iterator head_gsi = gsi_start_bb (bb);
-      gsi_prev (&head_gsi);
-      edge e2 = split_block (bb, gsi_stmt (head_gsi));
-      make_predication_test (e2, skip_dest_bb, mask);
-    }
-}
-
-/* Walk the dominator tree starting at BB to collect basic blocks in
-   WORKLIST which need OpenACC vector predication applied to them.  */
-
-static void
-find_predicatable_bbs (basic_block bb, vec<basic_block> &worklist)
-{
-  struct omp_region *parent = *bb_region_map->get (bb);
-  if (required_predication_mask (parent) != 0)
-    worklist.safe_push (bb);
-  basic_block son;
-  for (son = first_dom_son (CDI_DOMINATORS, bb);
-       son;
-       son = next_dom_son (CDI_DOMINATORS, son))
-    find_predicatable_bbs (son, worklist);
-}
-
-/* Apply OpenACC vector predication to all basic blocks.  HEAD_BB is the
-   first.  */
-
-static void
-predicate_omp_regions (basic_block head_bb)
-{
-  vec<basic_block> worklist = vNULL;
-  find_predicatable_bbs (head_bb, worklist);
-  int i;
-  basic_block bb;
-  FOR_EACH_VEC_ELT (worklist, i, bb)
-    {
-      omp_region *region = *bb_region_map->get (bb);
-      int mask = required_predication_mask (region);
-      predicate_bb (bb, region, mask);
-    }
-}
-
-/* USE and GET sets for variable broadcasting.  */
-static std::set<tree> use, gen, live_in;
-
-/* This is an extremely conservative live in analysis.  We only want to
-   detect is any compiler temporary used inside an acc loop is local to
-   that loop or not.  So record all decl uses in all the basic blocks
-   post-dominating the acc loop in question.  */
-static tree
-populate_loop_live_in (tree *tp, int *walk_subtrees,
-		       void *data_ ATTRIBUTE_UNUSED)
-{
-  struct walk_stmt_info *wi = (struct walk_stmt_info *) data_;
-
-  if (wi && wi->is_lhs)
-    {
-      if (VAR_P (*tp))
-	live_in.insert (*tp);
-    }
-  else if (IS_TYPE_OR_DECL_P (*tp))
-    *walk_subtrees = 0;
-
-  return NULL_TREE;
-}
-
-static void
-oacc_populate_live_in_1 (basic_block entry_bb, basic_block exit_bb,
-			 basic_block loop_bb)
-{
-  basic_block son;
-  gimple_stmt_iterator gsi;
-
-  if (entry_bb == exit_bb)
-    return;
-
-  if (!dominated_by_p (CDI_DOMINATORS, loop_bb, entry_bb))
-    return;
-
-  for (gsi = gsi_start_bb (entry_bb); !gsi_end_p (gsi); gsi_next (&gsi))
-    {
-      struct walk_stmt_info wi;
-      gimple stmt;
-
-      memset (&wi, 0, sizeof (wi));
-      stmt = gsi_stmt (gsi);
-
-      walk_gimple_op (stmt, populate_loop_live_in, &wi);
-    }
-
-  /* Continue walking the dominator tree.  */
-  for (son = first_dom_son (CDI_DOMINATORS, entry_bb);
-       son;
-       son = next_dom_son (CDI_DOMINATORS, son))
-    oacc_populate_live_in_1 (son, exit_bb, loop_bb);
-}
-
-static void
-oacc_populate_live_in (basic_block entry_bb, omp_region *region)
-{
-  /* Find the innermost OMP_TARGET region.  */
-  while (region  && region->type != GIMPLE_OMP_TARGET)
-    region = region->outer;
-
-  if (!region)
-    return;
-
-  basic_block son;
-
-  for (son = first_dom_son (CDI_DOMINATORS, region->entry);
-       son;
-       son = next_dom_son (CDI_DOMINATORS, son))
-    oacc_populate_live_in_1 (son, region->exit, entry_bb);
-}
-
-static tree
-populate_loop_use (tree *tp, int *walk_subtrees, void *data_)
-{
-  struct walk_stmt_info *wi = (struct walk_stmt_info *) data_;
-  std::set<tree>::iterator it;
-
-  /* There isn't much to do for LHS ops. There shouldn't be any pointers
-     or references here.  */
-  if (wi && wi->is_lhs)
-    return NULL_TREE;
-
-  if (VAR_P (*tp))
-    {
-      tree type;
-
-      *walk_subtrees = 0;
-
-      /* Filter out incompatible decls.  */
-      if (INDIRECT_REF_P (*tp) || is_global_var (*tp))
-	return NULL_TREE;
-
-      type = TREE_TYPE (*tp);
-
-      /* Aggregate types aren't supported either.  */
-      if (AGGREGATE_TYPE_P (type))
-	return NULL_TREE;
-
-      /* Filter out decls inside GEN.  */
-      it = gen.find (*tp);
-      if (it == gen.end ())
-	use.insert (*tp);
-    }
-  else if (IS_TYPE_OR_DECL_P (*tp))
-    *walk_subtrees = 0;
-
-  return NULL_TREE;
-}
-
-/* INIT is true if this is the first time this function is called.  */
-
-static void
-oacc_broadcast_1 (basic_block entry_bb, basic_block exit_bb, bool init,
-		  int mask)
-{
-  basic_block son;
-  gimple_stmt_iterator gsi;
-  gimple stmt;
-  tree block, var;
-
-  if (entry_bb == exit_bb)
-    return;
-
-  /* Populate the GEN set.  */
-
-  gsi = gsi_start_bb (entry_bb);
-  stmt = gsi_stmt (gsi);
-
-  /* There's nothing to do if stmt is empty or if this is the entry basic
-     block to the vector loop.  The entry basic block to pre-expanded loops
-     do not have an entry label.  As such, the scope containing the initial
-     entry_bb should not be added to the gen set.  */
-  if (stmt != NULL && !init && (block = gimple_block (stmt)) != NULL)
-    for (var = BLOCK_VARS (block); var; var = DECL_CHAIN (var))
-      gen.insert(var);
-
-  /* Populate the USE set.  */
-
-  for (gsi = gsi_start_bb (entry_bb); !gsi_end_p (gsi); gsi_next (&gsi))
-    {
-      struct walk_stmt_info wi;
-
-      memset (&wi, 0, sizeof (wi));
-      stmt = gsi_stmt (gsi);
-
-      walk_gimple_op (stmt, populate_loop_use, &wi);
-    }
-
-  /* Continue processing the children of this basic block.  */
-  for (son = first_dom_son (CDI_DOMINATORS, entry_bb);
-       son;
-       son = next_dom_son (CDI_DOMINATORS, son))
-    oacc_broadcast_1 (son, exit_bb, false, mask);
-}
-
-/* Broadcast variables to OpenACC vector loops.  This function scans
-   all of the basic blocks withing an acc vector loop.  It maintains
-   two sets of decls, a GEN set and a USE set.  The GEN set contains
-   all of the decls in the the basic block's scope.  The USE set
-   consists of decls used in current basic block, but are not in the
-   GEN set, globally defined or were transferred into the the accelerator
-   via a data movement clause.
-
-   The vector loop begins at ENTRY_BB and end at EXIT_BB, where EXIT_BB
-   is a latch back to ENTRY_BB.  Once a set of used variables have been
-   determined, they will get broadcasted in a pre-header to ENTRY_BB.  */
-
-static basic_block
-oacc_broadcast (basic_block entry_bb, basic_block exit_bb, omp_region *region)
-{
-  gimple_stmt_iterator gsi;
-  std::set<tree>::iterator it;
-  int mask = region->gwv_this;
-
-  /* Nothing to do if this isn't an acc worker or vector loop.  */
-  if (mask == 0)
-    return entry_bb;
-
-  use.empty ();
-  gen.empty ();
-  live_in.empty ();
-
-  /* Currently, subroutines aren't supported.  */
-  gcc_assert (!lookup_attribute ("oacc function",
-				 DECL_ATTRIBUTES (current_function_decl)));
-
-  /* Populate live_in.  */
-  oacc_populate_live_in (entry_bb, region);
-
-  /* Populate the set of used decls.  */
-  oacc_broadcast_1 (entry_bb, exit_bb, true, mask);
-
-  /* Filter out all of the GEN decls from the USE set.  Also filter out
-     any compiler temporaries that which are not present in LIVE_IN.  */
-  for (it = use.begin (); it != use.end (); it++)
-    {
-      std::set<tree>::iterator git, lit;
-
-      git = gen.find (*it);
-      lit = live_in.find (*it);
-      if (git != gen.end () || lit == live_in.end ())
-	use.erase (it);
-    }
-
-  if (mask == OACC_LOOP_MASK (OACC_vector))
-    {
-      /* Broadcast all decls in USE right before the last instruction in
-	 entry_bb.  */
-      gsi = gsi_last_bb (entry_bb);
-
-      gimple_seq seq = NULL;
-      gimple_stmt_iterator g2 = gsi_start (seq);
-
-      for (it = use.begin (); it != use.end (); it++)
-	generate_oacc_broadcast (region, *it, *it, g2, mask);
-
-      gsi_insert_seq_before (&gsi, seq, GSI_CONTINUE_LINKING);
-    }
-  else if (mask & OACC_LOOP_MASK (OACC_worker))
-    {
-      if (use.empty ())
-	return entry_bb;
-
-      /* If this loop contains a worker, then each broadcast must be
-	 predicated.  */
-
-      for (it = use.begin (); it != use.end (); it++)
-	{
-	  /* Worker broadcasting requires predication.  To do that, there
-	     needs to be several new parent basic blocks before the omp
-	     for instruction.  */
-
-	  gimple_seq seq = NULL;
-	  gimple_stmt_iterator g2 = gsi_start (seq);
-	  gimple splitpoint = generate_oacc_broadcast (region, *it, *it,
-						       g2, mask);
-	  gsi = gsi_last_bb (entry_bb);
-	  gsi_insert_seq_before (&gsi, seq, GSI_CONTINUE_LINKING);
-	  edge e = split_block (entry_bb, splitpoint);
-	  e->flags |= EDGE_ABNORMAL;
-	  basic_block dest_bb = e->dest;
-	  gsi_prev (&gsi);
-	  edge e2 = split_block (entry_bb, gsi_stmt (gsi));
-	  e2->flags |= EDGE_ABNORMAL;
-	  make_predication_test (e2, dest_bb, mask);
-
-	  /* Update entry_bb.  */
-	  entry_bb = dest_bb;
-	}
-    }
-
-  return entry_bb;
-}
-
 /* Main entry point for expanding OMP-GIMPLE into runtime calls.  */
 
 static unsigned int
@@ -11185,8 +10547,6 @@ execute_expand_omp (void)
 	  fprintf (dump_file, "\n");
 	}
 
-      predicate_omp_regions (ENTRY_BLOCK_PTR_FOR_FN (cfun));
-
       remove_exit_barriers (root_omp_region);
 
       expand_omp (root_omp_region);
@@ -13220,10 +12580,7 @@ lower_omp_target (gimple_stmt_iterator *
   orlist = NULL;
 
   if (is_gimple_omp_oacc (stmt))
-    {
-      oacc_init_count_vars (ctx, clauses);
-      oacc_alloc_broadcast_storage (ctx);
-    }
+    oacc_init_count_vars (ctx, clauses);
 
   if (has_reduction)
     {
@@ -13510,7 +12867,6 @@ lower_omp_target (gimple_stmt_iterator *
   gsi_insert_seq_before (gsi_p, sz_ilist, GSI_SAME_STMT);
 
   gimple_omp_target_set_ganglocal_size (stmt, sz);
-  gimple_omp_target_set_broadcast_array (stmt, ctx->worker_sync_elt);
   pop_gimplify_context (NULL);
 }
 
@@ -14227,16 +13583,7 @@ make_gimple_omp_edges (basic_block bb, s
 				  ((for_stmt = last_stmt (cur_region->entry))))
 	     == GF_OMP_FOR_KIND_OACC_LOOP)
         {
-	  /* Called before OMP expansion, so this information has not been
-	     recorded in cur_region->gwv_this yet.  */
-	  int gwv_bits = find_omp_for_region_gwv (for_stmt);
-	  if (oacc_loop_needs_threadbarrier_p (gwv_bits))
-	    {
-	      make_edge (bb, bb->next_bb, EDGE_FALLTHRU | EDGE_ABNORMAL);
-	      fallthru = false;
-	    }
-	  else
-	    fallthru = true;
+	  fallthru = true;
 	}
       else
 	/* In the case of a GIMPLE_OMP_SECTION, the edge will go
Index: omp-low.h
===================================================================
--- omp-low.h	(revision 225154)
+++ omp-low.h	(working copy)
@@ -20,6 +20,8 @@ along with GCC; see the file COPYING3.
 #ifndef GCC_OMP_LOW_H
 #define GCC_OMP_LOW_H
 
+/* Levels of parallelism as defined by OpenACC.  Increasing numbers
+   correspond to deeper loop nesting levels.  */
 enum oacc_loop_levels
   {
     OACC_gang,
@@ -27,6 +29,7 @@ enum oacc_loop_levels
     OACC_vector,
     OACC_HWM
   };
+#define OACC_LOOP_MASK(X) (1 << (X))
 
 struct omp_region;
 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [gomp] Move openacc vector& worker single handling to RTL
  2015-07-03 22:52 [gomp] Move openacc vector& worker single handling to RTL Nathan Sidwell
@ 2015-07-03 23:12 ` Jakub Jelinek
  2015-07-04 20:41   ` Nathan Sidwell
  0 siblings, 1 reply; 31+ messages in thread
From: Jakub Jelinek @ 2015-07-03 23:12 UTC (permalink / raw)
  To: Nathan Sidwell; +Cc: GCC Patches

On Fri, Jul 03, 2015 at 06:51:57PM -0400, Nathan Sidwell wrote:
> IMHO this is a step towards putting target-dependent handling in the target
> compiler and out of the more generic host-side compiler.
> 
> The changelog is separated into 3 parts
> - a) general infrastructure
> - b) additiona
> - c) deletions.
> 
> comments?

Thanks for working on it.

If the builtins are not meant to be used by users directly (I assume they
aren't) nor have a 1-1 correspondence to a library routine, it is much
better to emit them as internal calls (see internal-fn.{c,def}) instead of
BUILT_IN_NORMAL functions.

	Jakub

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [gomp] Move openacc vector& worker single handling to RTL
  2015-07-03 23:12 ` Jakub Jelinek
@ 2015-07-04 20:41   ` Nathan Sidwell
  2015-07-06 19:35     ` Nathan Sidwell
  0 siblings, 1 reply; 31+ messages in thread
From: Nathan Sidwell @ 2015-07-04 20:41 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: GCC Patches

On 07/03/15 19:11, Jakub Jelinek wrote:
> On Fri, Jul 03, 2015 at 06:51:57PM -0400, Nathan Sidwell wrote:
>> IMHO this is a step towards putting target-dependent handling in the target
>> compiler and out of the more generic host-side compiler.
>>
>> The changelog is separated into 3 parts
>> - a) general infrastructure
>> - b) additiona
>> - c) deletions.
>>
>> comments?
>
> Thanks for working on it.
>
> If the builtins are not meant to be used by users directly (I assume they
> aren't) nor have a 1-1 correspondence to a library routine, it is much
> better to emit them as internal calls (see internal-fn.{c,def}) instead of
> BUILT_IN_NORMAL functions.

thanks, Cesar pointed me at the internal builtins too --  I'll take a look.

nathan

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [gomp] Move openacc vector& worker single handling to RTL
  2015-07-04 20:41   ` Nathan Sidwell
@ 2015-07-06 19:35     ` Nathan Sidwell
  2015-07-07  9:54       ` Jakub Jelinek
  0 siblings, 1 reply; 31+ messages in thread
From: Nathan Sidwell @ 2015-07-06 19:35 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: GCC Patches

[-- Attachment #1: Type: text/plain, Size: 759 bytes --]

On 07/04/15 16:41, Nathan Sidwell wrote:
> On 07/03/15 19:11, Jakub Jelinek wrote:

>> If the builtins are not meant to be used by users directly (I assume they
>> aren't) nor have a 1-1 correspondence to a library routine, it is much
>> better to emit them as internal calls (see internal-fn.{c,def}) instead of
>> BUILT_IN_NORMAL functions.
>

This patch uses internal builtins, I had to make one additional change to 
tree-ssa-tail-merge.c's same_succ_def::equal hash compare function.  The new 
internal fn I introduced should compare EQ but not otherwise compare EQUAL, and 
that was blowing up the has function, which relied on EQUAL only.  I don't know 
why I didn't hit this problem in the previous patch with the regular builtin.

comments?

nathan


[-- Attachment #2: rtl-06072015-2.diff --]
[-- Type: text/plain, Size: 91625 bytes --]

2015-07-06  Nathan Sidwell  <nathan@codesourcery.com>

	Infrastructure:
	* gimple.h (gimple_call_internal_unique_p): Declare.
	* gimple.c (gimple_call_same_target_p): Add check for
	gimple_call_internal_unique_p.
	* internal-fn.c (gimple_call_internal_unique_p): New.
	* omp-low.h (OACC_LOOP_MASK): Define here...
	* omp-low.c (OACC_LOOP_MASK): ... not here.
	* tree-ssa-threadedge.c	(record_temporary_equivalences_from_stmts):
	Add check for gimple_call_internal_unique_p.
	* tree-ssa-tail-merge.c (same_succ_def::equal): Add EQ check for
	the gimple statements.

	Additions:
	* internal-fn.def (GOACC_LEVELS, GOACC_LOOP): New.
	* internal-fn.c (gimple_call_internal_unique_p): Add check for
	IFN_GOACC_LOOP.
	(expand_GOACC_LEVELS, expand_GOACC_LOOP): New.
	* omp-low.c (gen_oacc_loop_head, gen_oacc_loop_tail): New.
	(expand_omp_for_static_nochunk): Add oacc loop head & tail calls.
	(expand_omp_for_static_chunk): Likewise.
	* tree-ssa-alias.c (ref_maybe_used_by_call_p_1): Add
	BUILT_IN_GOACC_LOOP.
	* config/nvptx/nvptx-protos.h ( nvptx_expand_oacc_loop): New.
	* config/nvptx/nvptx.md (UNSPEC_BIT_CONV, UNSPEC_BROADCAST,
	UNSPEC_BR_UNIFIED): New unspecs.
	(UNSPECV_LEVELS, UNSPECV_LOOP, UNSPECV_BR_HIDDEN): New.
	(BITS, BITD): New mode iterators.
	(br_true_hidden, br_false_hidden, br_uni_true, br_uni_false): New
	branches.
	(oacc_levels, nvptx_loop): New insns.
	(oacc_loop): New expand
	(nvptx_broadcast<mode>): New insn.
	(unpack<mode>si2, packsi<mode>2): New insns.
	(worker_load<mode>, worker_store<mode>): New insns.
	(nvptx_barsync): Renamed from ...
	(threadbarrier_insn): ... here.
	config/nvptx/nvptx.c: Include hash-map,h, dominance.h, cfg.h &
	omp-low.h.
	(nvptx_loop_head, nvptx_loop_tail, nvtpx_loop_prehead,
	nvptx_loop_pretail, LOOP_MODE_CHANGE_P: New.
	(worker_bcast_hwm, worker_bcast_align, worker_bcast_name,
	worker_bcast_sym): New.
	(nvptx_opetion_override): Initialize worker_bcast_sym.
	(nvptx_expand_oacc_loop): New.
	(nvptx_gen_unpack, nvptx_gen_pack): New.
	(struct wcast_data_t, propagate_mask): New types.
	(nvptx_gen_vcast, nvptx_gen_wcast): New.
	(nvptx_print_operand):  Change 'U' specifier to look at operand
	itself.
	(struct reorg_unspec, struct reorg_loop): New structs.
	(unspec_map_t): New map.
	(loop_t, work_loop_t): New types.
	(nvptx_split_blocks, nvptx_discover_pre, nvptx_dump_loops,
	nvptx_discover_loops): New.
	(nvptx_propagate, vprop_gen, nvptx_vpropagate, wprop_gen,
	nvptx_wpropagate): New.
	(nvptx_wsync): New.
	(nvptx_single, nvptx_skip_loop): New.
	(nvptx_process_loops): New.
	(nvptx_neuter_loops): New.
	(nvptx_reorg): Add liveness DF problem.  Call nvptx_split_loops,
	nvptx_discover_loops, nvptx_process_loops & nvptx_neuter_loops.
	(nvptx_cannot_copy_insn): Check for broadcast, sync & loop insns.
	(nvptx_file_end): Output worker broadcast array definition.

	Deletions:
	* builtins.c (expand_oacc_thread_barrier): Delete.
	(expand_oacc_thread_broadcast): Delete.
	(expand_builtin): Adjust.
	* gimple.c (struct gimple_statement_omp_parallel_layout): Remove
	broadcast_array member.
	(gimple_omp_target_broadcast_array): Delete.
	(gimple_omp_target_set_broadcast_array): Delete.
	* omp-low.c (omp_region): Remove broadcast_array member.
	(oacc_broadcast): Delete.
	(build_oacc_threadbarrier): Delete.
	(oacc_loop_needs_threadbarrier_p): Delete.
	(oacc_alloc_broadcast_storage): Delete.
	(find_omp_target_region): Remove call to
	gimple_omp_target_broadcast_array.
	(enclosing_target_region, required_predication_mask,
	generate_vector_broadcast, generate_oacc_broadcast,
	make_predication_test, predicate_bb, find_predicatable_bbs,
	predicate_omp_regions): Delete.
	(use, gen, live_in): Delete.
	(populate_loop_live_in, oacc_populate_live_in_1,
	oacc_populate_live_in, populate_loop_use, oacc_broadcast_1,
	oacc_broadcast): Delete.
	(execute_expand_omp): Remove predicate_omp_regions call.
	(lower_omp_target): Remove oacc_alloc_broadcast_storage call.
	Remove gimple_omp_target_set_broadcast_array call.
	(make_gimple_omp_edges): Remove oacc_loop_needs_threadbarrier_p
	check.
	* tree-ssa-alias.c (ref_maybe_used_by_call_p_1): Remove
	BUILT_IN_GOACC_THREADBARRIER.
	* omp-builtins.def (BUILT_IN_GOACC_THREAD_BROADCAST,
	BUILT_IN_GOACC_THREAD_BROADCAST_LL,
	BUILT_IN_GOACC_THREADBARRIER): Delete.
	* config/nvptx/nvptx.md (UNSPECV_WARPBCAST): Delete.
	(br_true, br_false): Remove U format specifier.
	(oacc_thread_broadcastsi, oacc_thread_broadcast_di): Delete.
	(oacc_threadbarrier): Delete.
	* config/.nvptx/nvptx.c (condition_unidirectional_p): Delete.
	(nvptx_print_operand):  Change 'U' specifier to look at operand
	itself.
	(nvptx_reorg_subreg): Remove unidirection checking.
	(nvptx_cannot_copy_insn): Remove broadcast and barrier insns.
	* config/nvptx/nvptx.h (machine_function): Remove
	arp_equal_pseudos.

Index: omp-low.c
===================================================================
--- omp-low.c	(revision 225323)
+++ omp-low.c	(working copy)
@@ -166,14 +166,8 @@ struct omp_region
 
   /* For an OpenACC loop, the level of parallelism requested.  */
   int gwv_this;
-
-  tree broadcast_array;
 };
 
-/* Levels of parallelism as defined by OpenACC.  Increasing numbers
-   correspond to deeper loop nesting levels.  */
-#define OACC_LOOP_MASK(X) (1 << (X))
-
 /* Context structure.  Used to store information about each parallel
    directive in the code.  */
 
@@ -292,8 +286,6 @@ static vec<omp_context *> taskreg_contex
 
 static void scan_omp (gimple_seq *, omp_context *);
 static tree scan_omp_1_op (tree *, int *, void *);
-static basic_block oacc_broadcast (basic_block, basic_block,
-				   struct omp_region *);
 
 #define WALK_SUBSTMTS  \
     case GIMPLE_BIND: \
@@ -3487,15 +3479,6 @@ build_omp_barrier (tree lhs)
   return g;
 }
 
-/* Build a call to GOACC_threadbarrier.  */
-
-static gcall *
-build_oacc_threadbarrier (void)
-{
-  tree fndecl = builtin_decl_explicit (BUILT_IN_GOACC_THREADBARRIER);
-  return gimple_build_call (fndecl, 0);
-}
-
 /* If a context was created for STMT when it was scanned, return it.  */
 
 static omp_context *
@@ -3506,6 +3489,56 @@ maybe_lookup_ctx (gimple stmt)
   return n ? (omp_context *) n->value : NULL;
 }
 
+/* Generate loop head markers in outer->inner order.  */
+
+static void
+gen_oacc_loop_head (gimple_seq *seq, unsigned mask)
+{
+  {
+    // TODDO: Determine this information from the parallel region itself
+    // and emit it once in the offload function.  Currently the target
+    // geometry definition is being extracted early.  For now inform
+    // the backend we're using all axes of parallelism, which is a
+    // safe default.
+    gcall *call = gimple_build_call_internal
+      (IFN_GOACC_LEVELS, 1, 
+       build_int_cst (unsigned_type_node,
+		      OACC_LOOP_MASK (OACC_gang)
+		      | OACC_LOOP_MASK (OACC_vector)
+		      | OACC_LOOP_MASK (OACC_worker)));
+    gimple_seq_add_stmt (seq, call);
+  }
+
+  tree arg0 = build_int_cst (unsigned_type_node, 0);
+  unsigned level;
+
+  for (level = OACC_gang; level != OACC_HWM; level++)
+    if (mask & OACC_LOOP_MASK (level))
+      {
+	tree arg1 = build_int_cst (unsigned_type_node, level);
+	gcall *call = gimple_build_call_internal
+	  (IFN_GOACC_LOOP, 2, arg0, arg1);
+	gimple_seq_add_stmt (seq, call);
+      }
+}
+
+/* Generate loop tail markers in inner->outer order.  */
+
+static void
+gen_oacc_loop_tail (gimple_seq *seq, unsigned mask)
+{
+  tree arg0 = build_int_cst (unsigned_type_node, 1);
+  unsigned level;
+
+  for (level = OACC_HWM; level-- != OACC_gang; )
+    if (mask & OACC_LOOP_MASK (level))
+      {
+	tree arg1 = build_int_cst (unsigned_type_node, level);
+	gcall *call = gimple_build_call_internal
+	  (IFN_GOACC_LOOP, 2, arg0, arg1);
+	gimple_seq_add_stmt (seq, call);
+      }
+}
 
 /* Find the mapping for DECL in CTX or the immediately enclosing
    context that has a mapping for DECL.
@@ -6777,21 +6810,6 @@ expand_omp_for_generic (struct omp_regio
     }
 }
 
-
-/* True if a barrier is needed after a loop partitioned over
-   gangs/workers/vectors as specified by GWV_BITS.  OpenACC semantics specify
-   that a (conceptual) barrier is needed after worker and vector-partitioned
-   loops, but not after gang-partitioned loops.  Currently we are relying on
-   warp reconvergence to synchronise threads within a warp after vector loops,
-   so an explicit barrier is not helpful after those.  */
-
-static bool
-oacc_loop_needs_threadbarrier_p (int gwv_bits)
-{
-  return !(gwv_bits & OACC_LOOP_MASK (OACC_gang))
-    && (gwv_bits & OACC_LOOP_MASK (OACC_worker));
-}
-
 /* A subroutine of expand_omp_for.  Generate code for a parallel
    loop with static schedule and no specified chunk size.  Given
    parameters:
@@ -6800,6 +6818,7 @@ oacc_loop_needs_threadbarrier_p (int gwv
 
    where COND is "<" or ">", we generate pseudocode
 
+  OACC_LOOP_HEAD
 	if ((__typeof (V)) -1 > 0 && N2 cond N1) goto L2;
 	if (cond is <)
 	  adj = STEP - 1;
@@ -6827,6 +6846,11 @@ oacc_loop_needs_threadbarrier_p (int gwv
 	V += STEP;
 	if (V cond e) goto L1;
     L2:
+ OACC_LOOP_TAIL
+
+ It'd be better to place the OACC_LOOP markers just inside the outer
+ conditional, so they can be entirely eliminated if the loop is
+ unreachable.
 */
 
 static void
@@ -6868,10 +6892,6 @@ expand_omp_for_static_nochunk (struct om
     }
   exit_bb = region->exit;
 
-  /* Broadcast variables to OpenACC threads.  */
-  entry_bb = oacc_broadcast (entry_bb, fin_bb, region);
-  region->entry = entry_bb;
-
   /* Iteration space partitioning goes in ENTRY_BB.  */
   gsi = gsi_last_bb (entry_bb);
   gcc_assert (gimple_code (gsi_stmt (gsi)) == GIMPLE_OMP_FOR);
@@ -6893,6 +6913,15 @@ expand_omp_for_static_nochunk (struct om
     t = fold_binary (fd->loop.cond_code, boolean_type_node,
 		     fold_convert (type, fd->loop.n1),
 		     fold_convert (type, fd->loop.n2));
+
+  if (gimple_omp_for_kind (fd->for_stmt) == GF_OMP_FOR_KIND_OACC_LOOP)
+    {
+      gimple_seq seq = NULL;
+	
+      gen_oacc_loop_head (&seq, region->gwv_this);
+      gsi_insert_seq_before (&gsi, seq, GSI_SAME_STMT);
+    }
+
   if (fd->collapse == 1
       && TYPE_UNSIGNED (type)
       && (t == NULL_TREE || !integer_onep (t)))
@@ -6951,6 +6980,7 @@ expand_omp_for_static_nochunk (struct om
     case GF_OMP_FOR_KIND_OACC_LOOP:
       {
 	gimple_seq seq = NULL;
+	
 	nthreads = expand_oacc_get_num_threads (&seq, region->gwv_this);
 	threadid = expand_oacc_get_thread_num (&seq, region->gwv_this);
 	gsi_insert_seq_before (&gsi, seq, GSI_SAME_STMT);
@@ -7134,18 +7164,19 @@ expand_omp_for_static_nochunk (struct om
 
   /* Replace the GIMPLE_OMP_RETURN with a barrier, or nothing.  */
   gsi = gsi_last_bb (exit_bb);
-  if (!gimple_omp_return_nowait_p (gsi_stmt (gsi)))
+  if (gimple_omp_for_kind (fd->for_stmt) == GF_OMP_FOR_KIND_OACC_LOOP)
+    {
+      gimple_seq seq = NULL;
+
+      gen_oacc_loop_tail (&seq, region->gwv_this);
+      gsi_insert_seq_after (&gsi, seq, GSI_SAME_STMT);
+    }
+  else if (!gimple_omp_return_nowait_p (gsi_stmt (gsi)))
     {
       t = gimple_omp_return_lhs (gsi_stmt (gsi));
-      if (gimple_omp_for_kind (fd->for_stmt) == GF_OMP_FOR_KIND_OACC_LOOP)
-	{
-	  gcc_checking_assert (t == NULL_TREE);
-	  if (oacc_loop_needs_threadbarrier_p (region->gwv_this))
-	    gsi_insert_after (&gsi, build_oacc_threadbarrier (), GSI_SAME_STMT);
-	}
-      else
-	gsi_insert_after (&gsi, build_omp_barrier (t), GSI_SAME_STMT);
+      gsi_insert_after (&gsi, build_omp_barrier (t), GSI_SAME_STMT);
     }
+    
   gsi_remove (&gsi, true);
 
   /* Connect all the blocks.  */
@@ -7220,6 +7251,7 @@ find_phi_with_arg_on_edge (tree arg, edg
 
    where COND is "<" or ">", we generate pseudocode
 
+OACC_LOOP_HEAD
 	if ((__typeof (V)) -1 > 0 && N2 cond N1) goto L2;
 	if (cond is <)
 	  adj = STEP - 1;
@@ -7230,6 +7262,7 @@ find_phi_with_arg_on_edge (tree arg, edg
 	else
 	  n = (adj + N2 - N1) / STEP;
 	trip = 0;
+
 	V = threadid * CHUNK * STEP + N1;  -- this extra definition of V is
 					      here so that V is defined
 					      if the loop is not entered
@@ -7248,6 +7281,7 @@ find_phi_with_arg_on_edge (tree arg, edg
 	trip += 1;
 	goto L0;
     L4:
+OACC_LOOP_TAIL
 */
 
 static void
@@ -7281,10 +7315,6 @@ expand_omp_for_static_chunk (struct omp_
   gcc_assert (EDGE_COUNT (iter_part_bb->succs) == 2);
   fin_bb = BRANCH_EDGE (iter_part_bb)->dest;
 
-  /* Broadcast variables to OpenACC threads.  */
-  entry_bb = oacc_broadcast (entry_bb, fin_bb, region);
-  region->entry = entry_bb;
-
   gcc_assert (broken_loop
 	      || fin_bb == FALLTHRU_EDGE (cont_bb)->dest);
   seq_start_bb = split_edge (FALLTHRU_EDGE (iter_part_bb));
@@ -7296,7 +7326,7 @@ expand_omp_for_static_chunk (struct omp_
       trip_update_bb = split_edge (FALLTHRU_EDGE (cont_bb));
     }
   exit_bb = region->exit;
-
+  
   /* Trip and adjustment setup goes in ENTRY_BB.  */
   gsi = gsi_last_bb (entry_bb);
   gcc_assert (gimple_code (gsi_stmt (gsi)) == GIMPLE_OMP_FOR);
@@ -7318,6 +7348,14 @@ expand_omp_for_static_chunk (struct omp_
     t = fold_binary (fd->loop.cond_code, boolean_type_node,
 		     fold_convert (type, fd->loop.n1),
 		     fold_convert (type, fd->loop.n2));
+  if (gimple_omp_for_kind (fd->for_stmt) == GF_OMP_FOR_KIND_OACC_LOOP)
+    {
+      gimple_seq seq = NULL;
+	
+      gen_oacc_loop_head (&seq, region->gwv_this);
+      gsi_insert_seq_before (&gsi, seq, GSI_SAME_STMT);
+    }
+
   if (fd->collapse == 1
       && TYPE_UNSIGNED (type)
       && (t == NULL_TREE || !integer_onep (t)))
@@ -7576,18 +7614,20 @@ expand_omp_for_static_chunk (struct omp_
 
   /* Replace the GIMPLE_OMP_RETURN with a barrier, or nothing.  */
   gsi = gsi_last_bb (exit_bb);
-  if (!gimple_omp_return_nowait_p (gsi_stmt (gsi)))
+
+  if (gimple_omp_for_kind (fd->for_stmt) == GF_OMP_FOR_KIND_OACC_LOOP)
+    {
+      gimple_seq seq = NULL;
+
+      gen_oacc_loop_tail (&seq, region->gwv_this);
+      gsi_insert_seq_after (&gsi, seq, GSI_SAME_STMT);
+    }
+  else if (!gimple_omp_return_nowait_p (gsi_stmt (gsi)))
     {
       t = gimple_omp_return_lhs (gsi_stmt (gsi));
-      if (gimple_omp_for_kind (fd->for_stmt) == GF_OMP_FOR_KIND_OACC_LOOP)
-        {
-	  gcc_checking_assert (t == NULL_TREE);
-	  if (oacc_loop_needs_threadbarrier_p (region->gwv_this))
-	    gsi_insert_after (&gsi, build_oacc_threadbarrier (), GSI_SAME_STMT);
-	}
-      else
-	gsi_insert_after (&gsi, build_omp_barrier (t), GSI_SAME_STMT);
+      gsi_insert_after (&gsi, build_omp_barrier (t), GSI_SAME_STMT);
     }
+
   gsi_remove (&gsi, true);
 
   /* Connect the new blocks.  */
@@ -9158,20 +9198,6 @@ expand_omp_atomic (struct omp_region *re
   expand_omp_atomic_mutex (load_bb, store_bb, addr, loaded_val, stored_val);
 }
 
-/* Allocate storage for OpenACC worker threads in CTX to broadcast
-   condition results.  */
-
-static void
-oacc_alloc_broadcast_storage (omp_context *ctx)
-{
-  tree vull_type_node = build_qualified_type (long_long_unsigned_type_node,
-					      TYPE_QUAL_VOLATILE);
-
-  ctx->worker_sync_elt
-    = alloc_var_ganglocal (NULL_TREE, vull_type_node, ctx,
-			   TYPE_SIZE_UNIT (vull_type_node));
-}
-
 /* Mark the loops inside the kernels region starting at REGION_ENTRY and ending
    at REGION_EXIT.  */
 
@@ -9947,7 +9973,6 @@ find_omp_target_region_data (struct omp_
     region->gwv_this |= OACC_LOOP_MASK (OACC_worker);
   if (find_omp_clause (clauses, OMP_CLAUSE_VECTOR_LENGTH))
     region->gwv_this |= OACC_LOOP_MASK (OACC_vector);
-  region->broadcast_array = gimple_omp_target_broadcast_array (stmt);
 }
 
 /* Helper for build_omp_regions.  Scan the dominator tree starting at
@@ -10091,669 +10116,6 @@ build_omp_regions (void)
   build_omp_regions_1 (ENTRY_BLOCK_PTR_FOR_FN (cfun), NULL, false);
 }
 
-/* Walk the tree upwards from region until a target region is found
-   or we reach the end, then return it.  */
-static omp_region *
-enclosing_target_region (omp_region *region)
-{
-  while (region != NULL
-	 && region->type != GIMPLE_OMP_TARGET)
-    region = region->outer;
-  return region;
-}
-
-/* Return a mask of GWV_ values indicating the kind of OpenACC
-   predication required for basic blocks in REGION.  */
-
-static int
-required_predication_mask (omp_region *region)
-{
-  while (region
-	 && region->type != GIMPLE_OMP_FOR && region->type != GIMPLE_OMP_TARGET)
-    region = region->outer;
-  if (!region)
-    return 0;
-
-  int outer_masks = region->gwv_this;
-  omp_region *outer_target = region;
-  while (outer_target != NULL && outer_target->type != GIMPLE_OMP_TARGET)
-    {
-      if (outer_target->type == GIMPLE_OMP_FOR)
-	outer_masks |= outer_target->gwv_this;
-      outer_target = outer_target->outer;
-    }
-  if (!outer_target)
-    return 0;
-
-  int mask = 0;
-  if ((outer_target->gwv_this & OACC_LOOP_MASK (OACC_worker)) != 0
-      && (region->type == GIMPLE_OMP_TARGET
-	  || (outer_masks & OACC_LOOP_MASK (OACC_worker)) == 0))
-    mask |= OACC_LOOP_MASK (OACC_worker);
-  if ((outer_target->gwv_this & OACC_LOOP_MASK (OACC_vector)) != 0
-      && (region->type == GIMPLE_OMP_TARGET
-	  || (outer_masks & OACC_LOOP_MASK (OACC_vector)) == 0))
-    mask |= OACC_LOOP_MASK (OACC_vector);
-  return mask;
-}
-
-/* Generate a broadcast across OpenACC vector threads (a warp on GPUs)
-   so that VAR is broadcast to DEST_VAR.  The new statements are added
-   after WHERE.  Return the stmt after which the block should be split.  */
-
-static gimple
-generate_vector_broadcast (tree dest_var, tree var,
-			   gimple_stmt_iterator &where)
-{
-  gimple retval = gsi_stmt (where);
-  tree vartype = TREE_TYPE (var);
-  tree call_arg_type = unsigned_type_node;
-  enum built_in_function fn = BUILT_IN_GOACC_THREAD_BROADCAST;
-
-  if (TYPE_PRECISION (vartype) > TYPE_PRECISION (call_arg_type))
-    {
-      fn = BUILT_IN_GOACC_THREAD_BROADCAST_LL;
-      call_arg_type = long_long_unsigned_type_node;
-    }
-
-  bool need_conversion = !types_compatible_p (vartype, call_arg_type);
-  tree casted_var = var;
-
-  if (need_conversion)
-    {
-      gassign *conv1 = NULL;
-      casted_var = create_tmp_var (call_arg_type);
-
-      /* Handle floats and doubles.  */
-      if (!INTEGRAL_TYPE_P (vartype))
-	{
-	  tree t = fold_build1 (VIEW_CONVERT_EXPR, call_arg_type, var);
-	  conv1 = gimple_build_assign (casted_var, t);
-	}
-      else
-	conv1 = gimple_build_assign (casted_var, NOP_EXPR, var);
-
-      gsi_insert_after (&where, conv1, GSI_CONTINUE_LINKING);
-    }
-
-  tree decl = builtin_decl_explicit (fn);
-  gimple call = gimple_build_call (decl, 1, casted_var);
-  gsi_insert_after (&where, call, GSI_NEW_STMT);
-  tree casted_dest = dest_var;
-
-  if (need_conversion)
-    {
-      gassign *conv2 = NULL;
-      casted_dest = create_tmp_var (call_arg_type);
-
-      if (!INTEGRAL_TYPE_P (vartype))
-	{
-	  tree t = fold_build1 (VIEW_CONVERT_EXPR, vartype, casted_dest);
-	  conv2 = gimple_build_assign (dest_var, t);
-	}
-      else
-	conv2 = gimple_build_assign (dest_var, NOP_EXPR, casted_dest);
-
-      gsi_insert_after (&where, conv2, GSI_CONTINUE_LINKING);
-    }
-
-  gimple_call_set_lhs (call, casted_dest);
-  return retval;
-}
-
-/* Generate a broadcast across OpenACC threads in REGION so that VAR
-   is broadcast to DEST_VAR.  MASK specifies the parallelism level and
-   thereby the broadcast method.  If it is only vector, we
-   can use a warp broadcast, otherwise we fall back to memory
-   store/load.  */
-
-static gimple
-generate_oacc_broadcast (omp_region *region, tree dest_var, tree var,
-			 gimple_stmt_iterator &where, int mask)
-{
-  if (mask == OACC_LOOP_MASK (OACC_vector))
-    return generate_vector_broadcast (dest_var, var, where);
-
-  omp_region *parent = enclosing_target_region (region);
-
-  tree elttype = build_qualified_type (TREE_TYPE (var), TYPE_QUAL_VOLATILE);
-  tree ptr = create_tmp_var (build_pointer_type (elttype));
-  gassign *cast1 = gimple_build_assign (ptr, NOP_EXPR,
-				       parent->broadcast_array);
-  gsi_insert_after (&where, cast1, GSI_NEW_STMT);
-  gassign *st = gimple_build_assign (build_simple_mem_ref (ptr), var);
-  gsi_insert_after (&where, st, GSI_NEW_STMT);
-
-  gsi_insert_after (&where, build_oacc_threadbarrier (), GSI_NEW_STMT);
-
-  gassign *cast2 = gimple_build_assign (ptr, NOP_EXPR,
-					parent->broadcast_array);
-  gsi_insert_after (&where, cast2, GSI_NEW_STMT);
-  gassign *ld = gimple_build_assign (dest_var, build_simple_mem_ref (ptr));
-  gsi_insert_after (&where, ld, GSI_NEW_STMT);
-
-  gsi_insert_after (&where, build_oacc_threadbarrier (), GSI_NEW_STMT);
-
-  return st;
-}
-
-/* Build a test for OpenACC predication.  TRUE_EDGE is the edge that should be
-   taken if the block should be executed.  SKIP_DEST_BB is the destination to
-   jump to otherwise.  MASK specifies the type of predication, it can contain
-   the bits for VECTOR and/or WORKER.  */
-
-static void
-make_predication_test (edge true_edge, basic_block skip_dest_bb, int mask)
-{
-  basic_block cond_bb = true_edge->src;
-  
-  gimple_stmt_iterator tmp_gsi = gsi_last_bb (cond_bb);
-  tree decl = builtin_decl_explicit (BUILT_IN_GOACC_ID);
-  tree comp_var = NULL_TREE;
-  unsigned ix;
-
-  for (ix = OACC_worker; ix <= OACC_vector; ix++)
-    if (OACC_LOOP_MASK (ix) & mask)
-      {
-	gimple call = gimple_build_call
-	  (decl, 1, build_int_cst (unsigned_type_node, ix));
-	tree var = create_tmp_var (unsigned_type_node);
-
-	gimple_call_set_lhs (call, var);
-	gsi_insert_after (&tmp_gsi, call, GSI_NEW_STMT);
-	if (comp_var)
-	  {
-	    tree new_comp = create_tmp_var (unsigned_type_node);
-	    gassign *ior = gimple_build_assign (new_comp,
-						BIT_IOR_EXPR, comp_var, var);
-	    gsi_insert_after (&tmp_gsi, ior, GSI_NEW_STMT);
-	    comp_var = new_comp;
-	  }
-	else
-	  comp_var = var;
-      }
-
-  tree cond = build2 (EQ_EXPR, boolean_type_node, comp_var,
-		      fold_convert (unsigned_type_node, integer_zero_node));
-  gimple cond_stmt = gimple_build_cond_empty (cond);
-  gsi_insert_after (&tmp_gsi, cond_stmt, GSI_NEW_STMT);
-
-  true_edge->flags = EDGE_TRUE_VALUE;
-
-  /* Force an abnormal edge before a broadcast operation that might be present
-     in SKIP_DEST_BB.  This is only done for the non-execution edge (with
-     respect to the predication done by this function) -- the opposite
-     (execution) edge that reaches the broadcast operation must be made
-     abnormal also, e.g. in this function's caller.  */
-  edge e = make_edge (cond_bb, skip_dest_bb, EDGE_FALSE_VALUE);
-  basic_block false_abnorm_bb = split_edge (e);
-  edge abnorm_edge = single_succ_edge (false_abnorm_bb);
-  abnorm_edge->flags |= EDGE_ABNORMAL;
-}
-
-/* Apply OpenACC predication to basic block BB which is in
-   region PARENT.  MASK has a bitmask of levels that need to be
-   applied; VECTOR and/or WORKER may be set.  */
-
-static void
-predicate_bb (basic_block bb, struct omp_region *parent, int mask)
-{
-  /* We handle worker-single vector-partitioned loops by jumping
-     around them if not in the controlling worker.  Don't insert
-     unnecessary (and incorrect) predication.  */
-  if (parent->type == GIMPLE_OMP_FOR
-      && (parent->gwv_this & OACC_LOOP_MASK (OACC_vector)))
-    mask &= ~OACC_LOOP_MASK (OACC_worker);
-
-  if (mask == 0 || parent->type == GIMPLE_OMP_ATOMIC_LOAD)
-    return;
-
-  gimple_stmt_iterator gsi;
-  gimple stmt;
-
-  gsi = gsi_last_bb (bb);
-  stmt = gsi_stmt (gsi);
-  if (stmt == NULL)
-    return;
-
-  basic_block skip_dest_bb = NULL;
-
-  if (gimple_code (stmt) == GIMPLE_OMP_ENTRY_END)
-    return;
-
-  if (gimple_code (stmt) == GIMPLE_COND)
-    {
-      tree cond_var = create_tmp_var (boolean_type_node);
-      tree broadcast_cond = create_tmp_var (boolean_type_node);
-      gassign *asgn = gimple_build_assign (cond_var,
-					   gimple_cond_code (stmt),
-					   gimple_cond_lhs (stmt),
-					   gimple_cond_rhs (stmt));
-      gsi_insert_before (&gsi, asgn, GSI_CONTINUE_LINKING);
-      gimple_stmt_iterator gsi_asgn = gsi_for_stmt (asgn);
-
-      gimple splitpoint = generate_oacc_broadcast (parent, broadcast_cond,
-						   cond_var, gsi_asgn,
-						   mask);
-
-      edge e = split_block (bb, splitpoint);
-      e->flags = EDGE_ABNORMAL;
-      skip_dest_bb = e->dest;
-
-      gimple_cond_set_condition (as_a <gcond *> (stmt), EQ_EXPR,
-				 broadcast_cond, boolean_true_node);
-    }
-  else if (gimple_code (stmt) == GIMPLE_SWITCH)
-    {
-      gswitch *sstmt = as_a <gswitch *> (stmt);
-      tree var = gimple_switch_index (sstmt);
-      tree new_var = create_tmp_var (TREE_TYPE (var));
-
-      gassign *asgn = gimple_build_assign (new_var, var);
-      gsi_insert_before (&gsi, asgn, GSI_CONTINUE_LINKING);
-      gimple_stmt_iterator gsi_asgn = gsi_for_stmt (asgn);
-
-      gimple splitpoint = generate_oacc_broadcast (parent, new_var, var,
-						   gsi_asgn, mask);
-
-      edge e = split_block (bb, splitpoint);
-      e->flags = EDGE_ABNORMAL;
-      skip_dest_bb = e->dest;
-
-      gimple_switch_set_index (sstmt, new_var);
-    }
-  else if (is_gimple_omp (stmt))
-    {
-      gsi_prev (&gsi);
-      gimple split_stmt = gsi_stmt (gsi);
-      enum gimple_code code = gimple_code (stmt);
-
-      /* First, see if we must predicate away an entire loop or atomic region.  */
-      if (code == GIMPLE_OMP_FOR
-	  || code == GIMPLE_OMP_ATOMIC_LOAD)
-	{
-	  omp_region *inner;
-	  inner = *bb_region_map->get (FALLTHRU_EDGE (bb)->dest);
-	  skip_dest_bb = single_succ (inner->exit);
-	  gcc_assert (inner->entry == bb);
-	  if (code != GIMPLE_OMP_FOR
-	      || ((inner->gwv_this & OACC_LOOP_MASK (OACC_vector))
-		  && !(inner->gwv_this & OACC_LOOP_MASK (OACC_worker))
-		  && (mask & OACC_LOOP_MASK  (OACC_worker))))
-	    {
-	      gimple_stmt_iterator head_gsi = gsi_start_bb (bb);
-	      gsi_prev (&head_gsi);
-	      edge e0 = split_block (bb, gsi_stmt (head_gsi));
-	      int mask2 = mask;
-	      if (code == GIMPLE_OMP_FOR)
-		mask2 &= ~OACC_LOOP_MASK (OACC_vector);
-	      if (!split_stmt || code != GIMPLE_OMP_FOR)
-		{
-		  /* The simple case: nothing here except the for,
-		     so we just need to make one branch around the
-		     entire loop.  */
-		  inner->entry = e0->dest;
-		  make_predication_test (e0, skip_dest_bb, mask2);
-		  return;
-		}
-	      basic_block for_block = e0->dest;
-	      /* The general case, make two conditions - a full one around the
-		 code preceding the for, and one branch around the loop.  */
-	      edge e1 = split_block (for_block, split_stmt);
-	      basic_block bb3 = e1->dest;
-	      edge e2 = split_block (for_block, split_stmt);
-	      basic_block bb2 = e2->dest;
-
-	      make_predication_test (e0, bb2, mask);
-	      make_predication_test (single_pred_edge (bb3), skip_dest_bb,
-				     mask2);
-	      inner->entry = bb3;
-	      return;
-	    }
-	}
-
-      /* Only a few statements need special treatment.  */
-      if (gimple_code (stmt) != GIMPLE_OMP_FOR
-	  && gimple_code (stmt) != GIMPLE_OMP_CONTINUE
-	  && gimple_code (stmt) != GIMPLE_OMP_RETURN)
-	{
-	  edge e = single_succ_edge (bb);
-	  skip_dest_bb = e->dest;
-	}
-      else
-	{
-	  if (!split_stmt)
-	    return;
-	  edge e = split_block (bb, split_stmt);
-	  skip_dest_bb = e->dest;
-	  if (gimple_code (stmt) == GIMPLE_OMP_CONTINUE)
-	    {
-	      gcc_assert (parent->cont == bb);
-	      parent->cont = skip_dest_bb;
-	    }
-	  else if (gimple_code (stmt) == GIMPLE_OMP_RETURN)
-	    {
-	      gcc_assert (parent->exit == bb);
-	      parent->exit = skip_dest_bb;
-	    }
-	  else if (gimple_code (stmt) == GIMPLE_OMP_FOR)
-	    {
-	      omp_region *inner;
-	      inner = *bb_region_map->get (FALLTHRU_EDGE (skip_dest_bb)->dest);
-	      gcc_assert (inner->entry == bb);
-	      inner->entry = skip_dest_bb;
-	    }
-	}
-    }
-  else if (single_succ_p (bb))
-    {
-      edge e = single_succ_edge (bb);
-      skip_dest_bb = e->dest;
-      if (gimple_code (stmt) == GIMPLE_GOTO)
-	gsi_prev (&gsi);
-      if (gsi_stmt (gsi) == 0)
-	return;
-    }
-
-  if (skip_dest_bb != NULL)
-    {
-      gimple_stmt_iterator head_gsi = gsi_start_bb (bb);
-      gsi_prev (&head_gsi);
-      edge e2 = split_block (bb, gsi_stmt (head_gsi));
-      make_predication_test (e2, skip_dest_bb, mask);
-    }
-}
-
-/* Walk the dominator tree starting at BB to collect basic blocks in
-   WORKLIST which need OpenACC vector predication applied to them.  */
-
-static void
-find_predicatable_bbs (basic_block bb, vec<basic_block> &worklist)
-{
-  struct omp_region *parent = *bb_region_map->get (bb);
-  if (required_predication_mask (parent) != 0)
-    worklist.safe_push (bb);
-  basic_block son;
-  for (son = first_dom_son (CDI_DOMINATORS, bb);
-       son;
-       son = next_dom_son (CDI_DOMINATORS, son))
-    find_predicatable_bbs (son, worklist);
-}
-
-/* Apply OpenACC vector predication to all basic blocks.  HEAD_BB is the
-   first.  */
-
-static void
-predicate_omp_regions (basic_block head_bb)
-{
-  vec<basic_block> worklist = vNULL;
-  find_predicatable_bbs (head_bb, worklist);
-  int i;
-  basic_block bb;
-  FOR_EACH_VEC_ELT (worklist, i, bb)
-    {
-      omp_region *region = *bb_region_map->get (bb);
-      int mask = required_predication_mask (region);
-      predicate_bb (bb, region, mask);
-    }
-}
-
-/* USE and GET sets for variable broadcasting.  */
-static std::set<tree> use, gen, live_in;
-
-/* This is an extremely conservative live in analysis.  We only want to
-   detect is any compiler temporary used inside an acc loop is local to
-   that loop or not.  So record all decl uses in all the basic blocks
-   post-dominating the acc loop in question.  */
-static tree
-populate_loop_live_in (tree *tp, int *walk_subtrees,
-		       void *data_ ATTRIBUTE_UNUSED)
-{
-  struct walk_stmt_info *wi = (struct walk_stmt_info *) data_;
-
-  if (wi && wi->is_lhs)
-    {
-      if (VAR_P (*tp))
-	live_in.insert (*tp);
-    }
-  else if (IS_TYPE_OR_DECL_P (*tp))
-    *walk_subtrees = 0;
-
-  return NULL_TREE;
-}
-
-static void
-oacc_populate_live_in_1 (basic_block entry_bb, basic_block exit_bb,
-			 basic_block loop_bb)
-{
-  basic_block son;
-  gimple_stmt_iterator gsi;
-
-  if (entry_bb == exit_bb)
-    return;
-
-  if (!dominated_by_p (CDI_DOMINATORS, loop_bb, entry_bb))
-    return;
-
-  for (gsi = gsi_start_bb (entry_bb); !gsi_end_p (gsi); gsi_next (&gsi))
-    {
-      struct walk_stmt_info wi;
-      gimple stmt;
-
-      memset (&wi, 0, sizeof (wi));
-      stmt = gsi_stmt (gsi);
-
-      walk_gimple_op (stmt, populate_loop_live_in, &wi);
-    }
-
-  /* Continue walking the dominator tree.  */
-  for (son = first_dom_son (CDI_DOMINATORS, entry_bb);
-       son;
-       son = next_dom_son (CDI_DOMINATORS, son))
-    oacc_populate_live_in_1 (son, exit_bb, loop_bb);
-}
-
-static void
-oacc_populate_live_in (basic_block entry_bb, omp_region *region)
-{
-  /* Find the innermost OMP_TARGET region.  */
-  while (region  && region->type != GIMPLE_OMP_TARGET)
-    region = region->outer;
-
-  if (!region)
-    return;
-
-  basic_block son;
-
-  for (son = first_dom_son (CDI_DOMINATORS, region->entry);
-       son;
-       son = next_dom_son (CDI_DOMINATORS, son))
-    oacc_populate_live_in_1 (son, region->exit, entry_bb);
-}
-
-static tree
-populate_loop_use (tree *tp, int *walk_subtrees, void *data_)
-{
-  struct walk_stmt_info *wi = (struct walk_stmt_info *) data_;
-  std::set<tree>::iterator it;
-
-  /* There isn't much to do for LHS ops. There shouldn't be any pointers
-     or references here.  */
-  if (wi && wi->is_lhs)
-    return NULL_TREE;
-
-  if (VAR_P (*tp))
-    {
-      tree type;
-
-      *walk_subtrees = 0;
-
-      /* Filter out incompatible decls.  */
-      if (INDIRECT_REF_P (*tp) || is_global_var (*tp))
-	return NULL_TREE;
-
-      type = TREE_TYPE (*tp);
-
-      /* Aggregate types aren't supported either.  */
-      if (AGGREGATE_TYPE_P (type))
-	return NULL_TREE;
-
-      /* Filter out decls inside GEN.  */
-      it = gen.find (*tp);
-      if (it == gen.end ())
-	use.insert (*tp);
-    }
-  else if (IS_TYPE_OR_DECL_P (*tp))
-    *walk_subtrees = 0;
-
-  return NULL_TREE;
-}
-
-/* INIT is true if this is the first time this function is called.  */
-
-static void
-oacc_broadcast_1 (basic_block entry_bb, basic_block exit_bb, bool init,
-		  int mask)
-{
-  basic_block son;
-  gimple_stmt_iterator gsi;
-  gimple stmt;
-  tree block, var;
-
-  if (entry_bb == exit_bb)
-    return;
-
-  /* Populate the GEN set.  */
-
-  gsi = gsi_start_bb (entry_bb);
-  stmt = gsi_stmt (gsi);
-
-  /* There's nothing to do if stmt is empty or if this is the entry basic
-     block to the vector loop.  The entry basic block to pre-expanded loops
-     do not have an entry label.  As such, the scope containing the initial
-     entry_bb should not be added to the gen set.  */
-  if (stmt != NULL && !init && (block = gimple_block (stmt)) != NULL)
-    for (var = BLOCK_VARS (block); var; var = DECL_CHAIN (var))
-      gen.insert(var);
-
-  /* Populate the USE set.  */
-
-  for (gsi = gsi_start_bb (entry_bb); !gsi_end_p (gsi); gsi_next (&gsi))
-    {
-      struct walk_stmt_info wi;
-
-      memset (&wi, 0, sizeof (wi));
-      stmt = gsi_stmt (gsi);
-
-      walk_gimple_op (stmt, populate_loop_use, &wi);
-    }
-
-  /* Continue processing the children of this basic block.  */
-  for (son = first_dom_son (CDI_DOMINATORS, entry_bb);
-       son;
-       son = next_dom_son (CDI_DOMINATORS, son))
-    oacc_broadcast_1 (son, exit_bb, false, mask);
-}
-
-/* Broadcast variables to OpenACC vector loops.  This function scans
-   all of the basic blocks withing an acc vector loop.  It maintains
-   two sets of decls, a GEN set and a USE set.  The GEN set contains
-   all of the decls in the the basic block's scope.  The USE set
-   consists of decls used in current basic block, but are not in the
-   GEN set, globally defined or were transferred into the the accelerator
-   via a data movement clause.
-
-   The vector loop begins at ENTRY_BB and end at EXIT_BB, where EXIT_BB
-   is a latch back to ENTRY_BB.  Once a set of used variables have been
-   determined, they will get broadcasted in a pre-header to ENTRY_BB.  */
-
-static basic_block
-oacc_broadcast (basic_block entry_bb, basic_block exit_bb, omp_region *region)
-{
-  gimple_stmt_iterator gsi;
-  std::set<tree>::iterator it;
-  int mask = region->gwv_this;
-
-  /* Nothing to do if this isn't an acc worker or vector loop.  */
-  if (mask == 0)
-    return entry_bb;
-
-  use.empty ();
-  gen.empty ();
-  live_in.empty ();
-
-  /* Currently, subroutines aren't supported.  */
-  gcc_assert (!lookup_attribute ("oacc function",
-				 DECL_ATTRIBUTES (current_function_decl)));
-
-  /* Populate live_in.  */
-  oacc_populate_live_in (entry_bb, region);
-
-  /* Populate the set of used decls.  */
-  oacc_broadcast_1 (entry_bb, exit_bb, true, mask);
-
-  /* Filter out all of the GEN decls from the USE set.  Also filter out
-     any compiler temporaries that which are not present in LIVE_IN.  */
-  for (it = use.begin (); it != use.end (); it++)
-    {
-      std::set<tree>::iterator git, lit;
-
-      git = gen.find (*it);
-      lit = live_in.find (*it);
-      if (git != gen.end () || lit == live_in.end ())
-	use.erase (it);
-    }
-
-  if (mask == OACC_LOOP_MASK (OACC_vector))
-    {
-      /* Broadcast all decls in USE right before the last instruction in
-	 entry_bb.  */
-      gsi = gsi_last_bb (entry_bb);
-
-      gimple_seq seq = NULL;
-      gimple_stmt_iterator g2 = gsi_start (seq);
-
-      for (it = use.begin (); it != use.end (); it++)
-	generate_oacc_broadcast (region, *it, *it, g2, mask);
-
-      gsi_insert_seq_before (&gsi, seq, GSI_CONTINUE_LINKING);
-    }
-  else if (mask & OACC_LOOP_MASK (OACC_worker))
-    {
-      if (use.empty ())
-	return entry_bb;
-
-      /* If this loop contains a worker, then each broadcast must be
-	 predicated.  */
-
-      for (it = use.begin (); it != use.end (); it++)
-	{
-	  /* Worker broadcasting requires predication.  To do that, there
-	     needs to be several new parent basic blocks before the omp
-	     for instruction.  */
-
-	  gimple_seq seq = NULL;
-	  gimple_stmt_iterator g2 = gsi_start (seq);
-	  gimple splitpoint = generate_oacc_broadcast (region, *it, *it,
-						       g2, mask);
-	  gsi = gsi_last_bb (entry_bb);
-	  gsi_insert_seq_before (&gsi, seq, GSI_CONTINUE_LINKING);
-	  edge e = split_block (entry_bb, splitpoint);
-	  e->flags |= EDGE_ABNORMAL;
-	  basic_block dest_bb = e->dest;
-	  gsi_prev (&gsi);
-	  edge e2 = split_block (entry_bb, gsi_stmt (gsi));
-	  e2->flags |= EDGE_ABNORMAL;
-	  make_predication_test (e2, dest_bb, mask);
-
-	  /* Update entry_bb.  */
-	  entry_bb = dest_bb;
-	}
-    }
-
-  return entry_bb;
-}
-
 /* Main entry point for expanding OMP-GIMPLE into runtime calls.  */
 
 static unsigned int
@@ -10772,8 +10134,6 @@ execute_expand_omp (void)
 	  fprintf (dump_file, "\n");
 	}
 
-      predicate_omp_regions (ENTRY_BLOCK_PTR_FOR_FN (cfun));
-
       remove_exit_barriers (root_omp_region);
 
       expand_omp (root_omp_region);
@@ -12342,10 +11702,7 @@ lower_omp_target (gimple_stmt_iterator *
   orlist = NULL;
 
   if (is_gimple_omp_oacc (stmt))
-    {
-      oacc_init_count_vars (ctx, clauses);
-      oacc_alloc_broadcast_storage (ctx);
-    }
+    oacc_init_count_vars (ctx, clauses);
 
   if (has_reduction)
     {
@@ -12631,7 +11988,6 @@ lower_omp_target (gimple_stmt_iterator *
   gsi_insert_seq_before (gsi_p, sz_ilist, GSI_SAME_STMT);
 
   gimple_omp_target_set_ganglocal_size (stmt, sz);
-  gimple_omp_target_set_broadcast_array (stmt, ctx->worker_sync_elt);
   pop_gimplify_context (NULL);
 }
 
@@ -13348,16 +12704,7 @@ make_gimple_omp_edges (basic_block bb, s
 				  ((for_stmt = last_stmt (cur_region->entry))))
 	     == GF_OMP_FOR_KIND_OACC_LOOP)
         {
-	  /* Called before OMP expansion, so this information has not been
-	     recorded in cur_region->gwv_this yet.  */
-	  int gwv_bits = find_omp_for_region_gwv (for_stmt);
-	  if (oacc_loop_needs_threadbarrier_p (gwv_bits))
-	    {
-	      make_edge (bb, bb->next_bb, EDGE_FALLTHRU | EDGE_ABNORMAL);
-	      fallthru = false;
-	    }
-	  else
-	    fallthru = true;
+	  fallthru = true;
 	}
       else
 	/* In the case of a GIMPLE_OMP_SECTION, the edge will go
Index: omp-low.h
===================================================================
--- omp-low.h	(revision 225323)
+++ omp-low.h	(working copy)
@@ -20,6 +20,8 @@ along with GCC; see the file COPYING3.
 #ifndef GCC_OMP_LOW_H
 #define GCC_OMP_LOW_H
 
+/* Levels of parallelism as defined by OpenACC.  Increasing numbers
+   correspond to deeper loop nesting levels.  */
 enum oacc_loop_levels
   {
     OACC_gang,
@@ -27,6 +29,7 @@ enum oacc_loop_levels
     OACC_vector,
     OACC_HWM
   };
+#define OACC_LOOP_MASK(X) (1 << (X))
 
 struct omp_region;
 
Index: builtins.c
===================================================================
--- builtins.c	(revision 225323)
+++ builtins.c	(working copy)
@@ -5947,20 +5947,6 @@ expand_builtin_acc_on_device (tree exp A
 #endif
 }
 
-/* Expand a thread synchronization point for OpenACC threads.  */
-static void
-expand_oacc_threadbarrier (void)
-{
-#ifdef HAVE_oacc_threadbarrier
-  rtx insn = GEN_FCN (CODE_FOR_oacc_threadbarrier) ();
-  if (insn != NULL_RTX)
-    {
-      emit_insn (insn);
-    }
-#endif
-}
-
-
 /* Expand a thread-id/thread-count builtin for OpenACC.  */
 
 static rtx
@@ -6032,47 +6018,6 @@ expand_oacc_ganglocal_ptr (rtx target AT
   return NULL_RTX;
 }
 
-/* Handle a GOACC_thread_broadcast builtin call EXP with target TARGET.
-   Return the result.  */
-
-static rtx
-expand_builtin_oacc_thread_broadcast (tree exp, rtx target)
-{
-  tree arg0 = CALL_EXPR_ARG (exp, 0);
-  enum insn_code icode;
-
-  enum machine_mode mode = TYPE_MODE (TREE_TYPE (arg0));
-  gcc_assert (INTEGRAL_MODE_P (mode));
-  do
-    {
-      icode = direct_optab_handler (oacc_thread_broadcast_optab, mode);
-      mode = GET_MODE_WIDER_MODE (mode);
-    }
-  while (icode == CODE_FOR_nothing && mode != VOIDmode);
-  if (icode == CODE_FOR_nothing)
-    return expand_expr (arg0, NULL_RTX, VOIDmode, EXPAND_NORMAL);
-
-  rtx tmp = target;
-  machine_mode mode0 = insn_data[icode].operand[0].mode;
-  machine_mode mode1 = insn_data[icode].operand[1].mode;
-  if (!tmp || !REG_P (tmp) || GET_MODE (tmp) != mode0)
-    tmp = gen_reg_rtx (mode0);
-  rtx op1 = expand_expr (arg0, NULL_RTX, mode1, EXPAND_NORMAL);
-  if (GET_MODE (op1) != mode1)
-    op1 = convert_to_mode (mode1, op1, 0);
-
-  /* op1 might be an immediate, place it inside a register.  */
-  op1 = force_reg (mode1, op1);
-
-  rtx insn = GEN_FCN (icode) (tmp, op1);
-  if (insn != NULL_RTX)
-    {
-      emit_insn (insn);
-      return tmp;
-    }
-  return const0_rtx;
-}
-
 /* Expand an expression EXP that calls a built-in function,
    with result going to TARGET if that's convenient
    (and in mode MODE if that's convenient).
@@ -7225,14 +7170,6 @@ expand_builtin (tree exp, rtx target, rt
 	return target;
       break;
 
-    case BUILT_IN_GOACC_THREAD_BROADCAST:
-    case BUILT_IN_GOACC_THREAD_BROADCAST_LL:
-      return expand_builtin_oacc_thread_broadcast (exp, target);
-
-    case BUILT_IN_GOACC_THREADBARRIER:
-      expand_oacc_threadbarrier ();
-      return const0_rtx;
-
     default:	/* just do library call, if unknown builtin */
       break;
     }
Index: omp-builtins.def
===================================================================
--- omp-builtins.def	(revision 225323)
+++ omp-builtins.def	(working copy)
@@ -69,13 +69,6 @@ DEF_GOACC_BUILTIN (BUILT_IN_GOACC_GET_GA
 		   BT_FN_PTR, ATTR_NOTHROW_LEAF_LIST)
 DEF_GOACC_BUILTIN (BUILT_IN_GOACC_DEVICEPTR, "GOACC_deviceptr",
 		   BT_FN_PTR_PTR, ATTR_CONST_NOTHROW_LEAF_LIST)
-DEF_GOACC_BUILTIN (BUILT_IN_GOACC_THREAD_BROADCAST, "GOACC_thread_broadcast",
-		   BT_FN_UINT_UINT, ATTR_NOTHROW_LEAF_LIST)
-DEF_GOACC_BUILTIN (BUILT_IN_GOACC_THREAD_BROADCAST_LL, "GOACC_thread_broadcast_ll",
-		   BT_FN_ULONGLONG_ULONGLONG, ATTR_NOTHROW_LEAF_LIST)
-DEF_GOACC_BUILTIN (BUILT_IN_GOACC_THREADBARRIER, "GOACC_threadbarrier",
-		   BT_FN_VOID, ATTR_NOTHROW_LEAF_LIST)
-
 DEF_GOACC_BUILTIN_COMPILER (BUILT_IN_ACC_ON_DEVICE, "acc_on_device",
 			    BT_FN_INT_INT, ATTR_CONST_NOTHROW_LEAF_LIST)
 
Index: gimple.c
===================================================================
--- gimple.c	(revision 225323)
+++ gimple.c	(working copy)
@@ -1380,12 +1380,27 @@ bool
 gimple_call_same_target_p (const_gimple c1, const_gimple c2)
 {
   if (gimple_call_internal_p (c1))
-    return (gimple_call_internal_p (c2)
-	    && gimple_call_internal_fn (c1) == gimple_call_internal_fn (c2));
+    {
+      if (!gimple_call_internal_p (c2)
+	  || gimple_call_internal_fn (c1) != gimple_call_internal_fn (c2))
+	return false;
+
+      if (gimple_call_internal_unique_p (c1))
+	return false;
+      
+      return true;
+    }
+  else if (gimple_call_fn (c1) == gimple_call_fn (c2))
+    return true;
   else
-    return (gimple_call_fn (c1) == gimple_call_fn (c2)
-	    || (gimple_call_fndecl (c1)
-		&& gimple_call_fndecl (c1) == gimple_call_fndecl (c2)));
+    {
+      tree decl = gimple_call_fndecl (c1);
+
+      if (!decl || decl != gimple_call_fndecl (c2))
+	return false;
+
+      return true;
+    }
 }
 
 /* Detect flags from a GIMPLE_CALL.  This is just like
Index: gimple.h
===================================================================
--- gimple.h	(revision 225323)
+++ gimple.h	(working copy)
@@ -581,10 +581,6 @@ struct GTY((tag("GSS_OMP_PARALLEL_LAYOUT
   /* [ WORD 11 ]
      Size of the gang-local memory to allocate.  */
   tree ganglocal_size;
-
-  /* [ WORD 12 ]
-     A pointer to the array to be used for broadcasting across threads.  */
-  tree broadcast_array;
 };
 
 /* GIMPLE_OMP_PARALLEL or GIMPLE_TASK */
@@ -2693,6 +2689,11 @@ gimple_call_internal_fn (const_gimple gs
   return static_cast <const gcall *> (gs)->u.internal_fn;
 }
 
+/* Return true, if this internal gimple call is unique.  */
+
+extern bool
+gimple_call_internal_unique_p (const_gimple);
+
 /* If CTRL_ALTERING_P is true, mark GIMPLE_CALL S to be a stmt
    that could alter control flow.  */
 
@@ -5248,25 +5249,6 @@ gimple_omp_target_set_ganglocal_size (go
 }
 
 
-/* Return the pointer to the broadcast array associated with OMP_TARGET GS.  */
-
-static inline tree
-gimple_omp_target_broadcast_array (const gomp_target *omp_target_stmt)
-{
-  return omp_target_stmt->broadcast_array;
-}
-
-
-/* Set PTR to be the broadcast array associated with OMP_TARGET
-   GS.  */
-
-static inline void
-gimple_omp_target_set_broadcast_array (gomp_target *omp_target_stmt, tree ptr)
-{
-  omp_target_stmt->broadcast_array = ptr;
-}
-
-
 /* Return the clauses associated with OMP_TEAMS GS.  */
 
 static inline tree
Index: tree-ssa-threadedge.c
===================================================================
--- tree-ssa-threadedge.c	(revision 225323)
+++ tree-ssa-threadedge.c	(working copy)
@@ -310,6 +310,17 @@ record_temporary_equivalences_from_stmts
 	  && gimple_asm_volatile_p (as_a <gasm *> (stmt)))
 	return NULL;
 
+      /* If the statement is a unique builtin, we can not thread
+	 through here.  */
+      if (gimple_code (stmt) == GIMPLE_CALL)
+	{
+	  gcall *call = as_a <gcall *> (stmt);
+
+	  if (gimple_call_internal_p (call)
+	      && gimple_call_internal_unique_p (call))
+	    return NULL;
+	}
+
       /* If duplicating this block is going to cause too much code
 	 expansion, then do not thread through this block.  */
       stmt_count++;
Index: tree-ssa-tail-merge.c
===================================================================
--- tree-ssa-tail-merge.c	(revision 225323)
+++ tree-ssa-tail-merge.c	(working copy)
@@ -608,10 +608,13 @@ same_succ_def::equal (const same_succ_de
     {
       s1 = gsi_stmt (gsi1);
       s2 = gsi_stmt (gsi2);
-      if (gimple_code (s1) != gimple_code (s2))
-	return 0;
-      if (is_gimple_call (s1) && !gimple_call_same_target_p (s1, s2))
-	return 0;
+      if (s1 != s2)
+	{
+	  if (gimple_code (s1) != gimple_code (s2))
+	    return 0;
+	  if (is_gimple_call (s1) && !gimple_call_same_target_p (s1, s2))
+	    return 0;
+	}
       gsi_next_nondebug (&gsi1);
       gsi_next_nondebug (&gsi2);
       gsi_advance_fw_nondebug_nonlocal (&gsi1);
Index: internal-fn.def
===================================================================
--- internal-fn.def	(revision 225323)
+++ internal-fn.def	(working copy)
@@ -64,3 +64,5 @@ DEF_INTERNAL_FN (MUL_OVERFLOW, ECF_CONST
 DEF_INTERNAL_FN (TSAN_FUNC_EXIT, ECF_NOVOPS | ECF_LEAF | ECF_NOTHROW, NULL)
 DEF_INTERNAL_FN (VA_ARG, ECF_NOTHROW | ECF_LEAF, NULL)
 DEF_INTERNAL_FN (GOACC_DATA_END_WITH_ARG, ECF_NOTHROW, ".r")
+DEF_INTERNAL_FN (GOACC_LEVELS, ECF_NOTHROW | ECF_LEAF, "..")
+DEF_INTERNAL_FN (GOACC_LOOP, ECF_NOTHROW | ECF_LEAF, "..")
Index: tree-ssa-alias.c
===================================================================
--- tree-ssa-alias.c	(revision 225323)
+++ tree-ssa-alias.c	(working copy)
@@ -1764,7 +1764,6 @@ ref_maybe_used_by_call_p_1 (gcall *call,
 	case BUILT_IN_GOMP_ATOMIC_END:
 	case BUILT_IN_GOMP_BARRIER:
 	case BUILT_IN_GOMP_BARRIER_CANCEL:
-	case BUILT_IN_GOACC_THREADBARRIER:
 	case BUILT_IN_GOMP_TASKWAIT:
 	case BUILT_IN_GOMP_TASKGROUP_END:
 	case BUILT_IN_GOMP_CRITICAL_START:
Index: config/nvptx/nvptx-protos.h
===================================================================
--- config/nvptx/nvptx-protos.h	(revision 225323)
+++ config/nvptx/nvptx-protos.h	(working copy)
@@ -32,6 +32,7 @@ extern void nvptx_register_pragmas (void
 extern const char *nvptx_section_for_decl (const_tree);
 
 #ifdef RTX_CODE
+extern void nvptx_expand_oacc_loop (rtx, rtx);
 extern void nvptx_expand_call (rtx, rtx);
 extern rtx nvptx_expand_compare (rtx);
 extern const char *nvptx_ptx_type_from_mode (machine_mode, bool);
Index: config/nvptx/nvptx.md
===================================================================
--- config/nvptx/nvptx.md	(revision 225323)
+++ config/nvptx/nvptx.md	(working copy)
@@ -52,15 +52,23 @@
    UNSPEC_NID
 
    UNSPEC_SHARED_DATA
+
+   UNSPEC_BIT_CONV
+
+   UNSPEC_BROADCAST
+   UNSPEC_BR_UNIFIED
 ])
 
 (define_c_enum "unspecv" [
    UNSPECV_LOCK
    UNSPECV_CAS
    UNSPECV_XCHG
-   UNSPECV_WARP_BCAST
    UNSPECV_BARSYNC
    UNSPECV_ID
+
+   UNSPECV_LEVELS
+   UNSPECV_LOOP
+   UNSPECV_BR_HIDDEN
 ])
 
 (define_attr "subregs_ok" "false,true"
@@ -253,6 +261,8 @@
 (define_mode_iterator QHSIM [QI HI SI])
 (define_mode_iterator SDFM [SF DF])
 (define_mode_iterator SDCM [SC DC])
+(define_mode_iterator BITS [SI SF])
+(define_mode_iterator BITD [DI DF])
 
 ;; This mode iterator allows :P to be used for patterns that operate on
 ;; pointer-sized quantities.  Exactly one of the two alternatives will match.
@@ -813,7 +823,7 @@
 		      (label_ref (match_operand 1 "" ""))
 		      (pc)))]
   ""
-  "%j0\\tbra%U0\\t%l1;")
+  "%j0\\tbra\\t%l1;")
 
 (define_insn "br_false"
   [(set (pc)
@@ -822,7 +832,34 @@
 		      (label_ref (match_operand 1 "" ""))
 		      (pc)))]
   ""
-  "%J0\\tbra%U0\\t%l1;")
+  "%J0\\tbra\\t%l1;")
+
+;; a hidden conditional branch
+(define_insn "br_true_hidden"
+  [(unspec_volatile:SI [(ne (match_operand:BI 0 "nvptx_register_operand" "R")
+			    (const_int 0))
+		        (label_ref (match_operand 1 "" ""))
+			(match_operand:SI 2 "const_int_operand" "i")]
+			UNSPECV_BR_HIDDEN)]
+  ""
+  "%j0\\tbra%U2\\t%l1;")
+
+;; unified conditional branch
+(define_insn "br_uni_true"
+  [(set (pc) (if_then_else
+	(ne (unspec:BI [(match_operand:BI 0 "nvptx_register_operand" "R")]
+		       UNSPEC_BR_UNIFIED) (const_int 0))
+        (label_ref (match_operand 1 "" "")) (pc)))]
+  ""
+  "%j0\\tbra.uni\\t%l1;")
+
+(define_insn "br_uni_false"
+  [(set (pc) (if_then_else
+	(eq (unspec:BI [(match_operand:BI 0 "nvptx_register_operand" "R")]
+		       UNSPEC_BR_UNIFIED) (const_int 0))
+        (label_ref (match_operand 1 "" "")) (pc)))]
+  ""
+  "%J0\\tbra.uni\\t%l1;")
 
 (define_expand "cbranch<mode>4"
   [(set (pc)
@@ -1326,37 +1363,72 @@
   return asms[INTVAL (operands[1])];
 })
 
-(define_insn "oacc_thread_broadcastsi"
-  [(set (match_operand:SI 0 "nvptx_register_operand" "")
-	(unspec_volatile:SI [(match_operand:SI 1 "nvptx_register_operand" "")]
-			    UNSPECV_WARP_BCAST))]
+(define_insn "oacc_levels"
+  [(unspec_volatile:SI [(match_operand:SI 0 "const_int_operand" "")]
+		       UNSPECV_LEVELS)]
   ""
-  "%.\\tshfl.idx.b32\\t%0, %1, 0, 31;")
+  "// levels %0;"
+)
 
-(define_expand "oacc_thread_broadcastdi"
-  [(set (match_operand:DI 0 "nvptx_register_operand" "")
-	(unspec_volatile:DI [(match_operand:DI 1 "nvptx_register_operand" "")]
-			    UNSPECV_WARP_BCAST))]
-  ""
-{
-  rtx t = gen_reg_rtx (DImode);
-  emit_insn (gen_lshrdi3 (t, operands[1], GEN_INT (32)));
-  rtx op0 = force_reg (SImode, gen_lowpart (SImode, t));
-  rtx op1 = force_reg (SImode, gen_lowpart (SImode, operands[1]));
-  rtx targ0 = gen_reg_rtx (SImode);
-  rtx targ1 = gen_reg_rtx (SImode);
-  emit_insn (gen_oacc_thread_broadcastsi (targ0, op0));
-  emit_insn (gen_oacc_thread_broadcastsi (targ1, op1));
-  rtx t2 = gen_reg_rtx (DImode);
-  rtx t3 = gen_reg_rtx (DImode);
-  emit_insn (gen_extendsidi2 (t2, targ0));
-  emit_insn (gen_extendsidi2 (t3, targ1));
-  rtx t4 = gen_reg_rtx (DImode);
-  emit_insn (gen_ashldi3 (t4, t2, GEN_INT (32)));
-  emit_insn (gen_iordi3 (operands[0], t3, t4));
-  DONE;
+(define_insn "nvptx_loop"
+  [(unspec_volatile:SI [(match_operand:SI 0 "const_int_operand" "")
+		        (match_operand:SI 1 "const_int_operand" "")]
+		       UNSPECV_LOOP)]
+  ""
+  "// loop %0, %1;"
+)
+
+(define_expand "oacc_loop"
+  [(unspec_volatile:SI [(match_operand:SI 0 "const_int_operand" "")
+		        (match_operand:SI 1 "const_int_operand" "")]
+		       UNSPECV_LOOP)]
+  ""
+{
+  nvptx_expand_oacc_loop (operands[0], operands[1]);
 })
 
+;; only 32-bit shuffles exist.
+(define_insn "nvptx_broadcast<mode>"
+  [(set (match_operand:BITS 0 "nvptx_register_operand" "")
+	(unspec:BITS
+		[(match_operand:BITS 1 "nvptx_register_operand" "")]
+		  UNSPEC_BROADCAST))]
+  ""
+  "%.\\tshfl.idx.b32\\t%0, %1, 0, 31;")
+
+;; extract parts of a 64 bit object into 2 32-bit ints
+(define_insn "unpack<mode>si2"
+  [(set (match_operand:SI 0 "nvptx_register_operand" "")
+        (unspec:SI [(match_operand:BITD 2 "nvptx_register_operand" "")
+		    (const_int 0)] UNSPEC_BIT_CONV))
+   (set (match_operand:SI 1 "nvptx_register_operand" "")
+        (unspec:SI [(match_dup 2) (const_int 1)] UNSPEC_BIT_CONV))]
+  ""
+  "%.\\tmov.b64 {%0,%1}, %2;")
+
+;; pack 2 32-bit ints into a 64 bit object
+(define_insn "packsi<mode>2"
+  [(set (match_operand:BITD 0 "nvptx_register_operand" "")
+        (unspec:BITD [(match_operand:SI 1 "nvptx_register_operand" "")
+		      (match_operand:SI 2 "nvptx_register_operand" "")]
+		    UNSPEC_BIT_CONV))]
+  ""
+  "%.\\tmov.b64 %0, {%1,%2};")
+
+(define_insn "worker_load<mode>"
+  [(set (match_operand:SDISDFM 0 "nvptx_register_operand" "=R")
+        (unspec:SDISDFM [(match_operand:SDISDFM 1 "memory_operand" "m")]
+			 UNSPEC_SHARED_DATA))]
+  ""
+  "%.\\tld.shared%u0\\t%0,%1;")
+
+(define_insn "worker_store<mode>"
+  [(set (unspec:SDISDFM [(match_operand:SDISDFM 1 "memory_operand" "=m")]
+			 UNSPEC_SHARED_DATA)
+	(match_operand:SDISDFM 0 "nvptx_register_operand" "R"))]
+  ""
+  "%.\\tst.shared%u1\\t%1,%0;")
+
 (define_insn "ganglocal_ptr<mode>"
   [(set (match_operand:P 0 "nvptx_register_operand" "")
 	(unspec:P [(const_int 0)] UNSPEC_SHARED_DATA))]
@@ -1462,14 +1534,8 @@
   "%.\\tatom%A1.b%T0.<logic>\\t%0, %1, %2;")
 
 ;; ??? Mark as not predicable later?
-(define_insn "threadbarrier_insn"
-  [(unspec_volatile [(match_operand:SI 0 "const_int_operand" "")] UNSPECV_BARSYNC)]
+(define_insn "nvptx_barsync"
+  [(unspec_volatile [(match_operand:SI 0 "const_int_operand" "")]
+		    UNSPECV_BARSYNC)]
   ""
   "bar.sync\\t%0;")
-
-(define_expand "oacc_threadbarrier"
-  [(unspec_volatile [(match_operand:SI 0 "const_int_operand" "")] UNSPECV_BARSYNC)]
-  ""
-{
-  operands[0] = const0_rtx;
-})
Index: config/nvptx/nvptx.c
===================================================================
--- config/nvptx/nvptx.c	(revision 225323)
+++ config/nvptx/nvptx.c	(working copy)
@@ -24,6 +24,7 @@
 #include "coretypes.h"
 #include "tm.h"
 #include "rtl.h"
+#include "hash-map.h"
 #include "hash-set.h"
 #include "machmode.h"
 #include "vec.h"
@@ -74,6 +75,15 @@
 #include "df.h"
 #include "dumpfile.h"
 #include "builtins.h"
+#include "dominance.h"
+#include "cfg.h"
+#include "omp-low.h"
+
+#define nvptx_loop_head		0
+#define nvptx_loop_tail		1
+#define LOOP_MODE_CHANGE_P(X) ((X) < 2)
+#define nvptx_loop_prehead 	2
+#define nvptx_loop_pretail 	3
 
 /* Record the function decls we've written, and the libfuncs and function
    decls corresponding to them.  */
@@ -97,6 +107,16 @@ static GTY((cache))
 static GTY((cache)) hash_table<tree_hasher> *declared_fndecls_htab;
 static GTY((cache)) hash_table<tree_hasher> *needed_fndecls_htab;
 
+/* Size of buffer needed to broadcast across workers.  This is used
+   for both worker-neutering and worker broadcasting.   It is shared
+   by all functions emitted.  The buffer is placed in shared memory.
+   It'd be nice if PTX supported common blocks, because then this
+   could be shared across TUs (taking the largest size).  */
+static unsigned worker_bcast_hwm;
+static unsigned worker_bcast_align;
+#define worker_bcast_name "__worker_bcast"
+static GTY(()) rtx worker_bcast_sym;
+
 /* Allocate a new, cleared machine_function structure.  */
 
 static struct machine_function *
@@ -124,6 +144,8 @@ nvptx_option_override (void)
   needed_fndecls_htab = hash_table<tree_hasher>::create_ggc (17);
   declared_libfuncs_htab
     = hash_table<declared_libfunc_hasher>::create_ggc (17);
+
+  worker_bcast_sym = gen_rtx_SYMBOL_REF (Pmode, worker_bcast_name);
 }
 
 /* Return the mode to be used when declaring a ptx object for OBJ.
@@ -1053,6 +1075,7 @@ nvptx_static_chain (const_tree fndecl, b
     return gen_rtx_REG (Pmode, OUTGOING_STATIC_CHAIN_REGNUM);
 }
 \f
+
 /* Emit a comparison COMPARE, and return the new test to be used in the
    jump.  */
 
@@ -1066,6 +1089,203 @@ nvptx_expand_compare (rtx compare)
   return gen_rtx_NE (BImode, pred, const0_rtx);
 }
 
+
+/* Expand the oacc_loop primitive into ptx-required unspecs.  */
+
+void
+nvptx_expand_oacc_loop (rtx kind, rtx mode)
+{
+  /* Emit pre-tail for all loops and emit pre-head for worker level.  */
+  if (UINTVAL (kind) || UINTVAL (mode) == OACC_worker)
+    emit_insn (gen_nvptx_loop (GEN_INT (UINTVAL (kind) + 2), mode));
+}
+
+/* Generate instruction(s) to unpack a 64 bit object into 2 32 bit
+   objects.  */
+
+static rtx
+nvptx_gen_unpack (rtx dst0, rtx dst1, rtx src)
+{
+  rtx res;
+  
+  switch (GET_MODE (src))
+    {
+    case DImode:
+      res = gen_unpackdisi2 (dst0, dst1, src);
+      break;
+    case DFmode:
+      res = gen_unpackdfsi2 (dst0, dst1, src);
+      break;
+    default: gcc_unreachable ();
+    }
+  return res;
+}
+
+/* Generate instruction(s) to pack 2 32 bit objects into a 64 bit
+   object.  */
+
+static rtx
+nvptx_gen_pack (rtx dst, rtx src0, rtx src1)
+{
+  rtx res;
+  
+  switch (GET_MODE (dst))
+    {
+    case DImode:
+      res = gen_packsidi2 (dst, src0, src1);
+      break;
+    case DFmode:
+      res = gen_packsidf2 (dst, src0, src1);
+      break;
+    default: gcc_unreachable ();
+    }
+  return res;
+}
+
+/* Generate an instruction or sequence to broadcast register REG
+   across the vectors of a single warp.  */
+
+static rtx
+nvptx_gen_vcast (rtx reg)
+{
+  rtx res;
+
+  switch (GET_MODE (reg))
+    {
+    case SImode:
+      res = gen_nvptx_broadcastsi (reg, reg);
+      break;
+    case SFmode:
+      res = gen_nvptx_broadcastsf (reg, reg);
+      break;
+    case DImode:
+    case DFmode:
+      {
+	rtx tmp0 = gen_reg_rtx (SImode);
+	rtx tmp1 = gen_reg_rtx (SImode);
+
+	start_sequence ();
+	emit_insn (nvptx_gen_unpack (tmp0, tmp1, reg));
+	emit_insn (nvptx_gen_vcast (tmp0));
+	emit_insn (nvptx_gen_vcast (tmp1));
+	emit_insn (nvptx_gen_pack (reg, tmp0, tmp1));
+	res = get_insns ();
+	end_sequence ();
+      }
+      break;
+    case BImode:
+      {
+	rtx tmp = gen_reg_rtx (SImode);
+	
+	start_sequence ();
+	emit_insn (gen_sel_truesi (tmp, reg, GEN_INT (1), const0_rtx));
+	emit_insn (nvptx_gen_vcast (tmp));
+	emit_insn (gen_rtx_SET (BImode, reg,
+				gen_rtx_NE (BImode, tmp, const0_rtx)));
+	res = get_insns ();
+	end_sequence ();
+      }
+      break;
+      
+    case HImode:
+    case QImode:
+    default:debug_rtx (reg);gcc_unreachable ();
+    }
+  return res;
+}
+
+/* Structure used when generating a worker-level spill or fill.  */
+
+struct wcast_data_t
+{
+  rtx base;
+  rtx ptr;
+  unsigned offset;
+};
+
+/* Direction of the spill/fill and looping setup/teardown indicator.  */
+
+enum propagate_mask
+  {
+    PM_read = 1 << 0,
+    PM_write = 1 << 1,
+    PM_loop_begin = 1 << 2,
+    PM_loop_end = 1 << 3,
+
+    PM_read_write = PM_read | PM_write
+  };
+
+/* Generate instruction(s) to spill or fill register REG to/from the
+   worker broadcast array.  PM indicates what is to be done, REP
+   how many loop iterations will be executed (0 for not a loop).  */
+   
+static rtx
+nvptx_gen_wcast (rtx reg, propagate_mask pm, unsigned rep, wcast_data_t *data)
+{
+  rtx  res;
+  machine_mode mode = GET_MODE (reg);
+
+  switch (mode)
+    {
+    case BImode:
+      {
+	rtx tmp = gen_reg_rtx (SImode);
+	
+	start_sequence ();
+	if (pm & PM_read)
+	  emit_insn (gen_sel_truesi (tmp, reg, GEN_INT (1), const0_rtx));
+	emit_insn (nvptx_gen_wcast (tmp, pm, rep, data));
+	if (pm & PM_write)
+	  emit_insn (gen_rtx_SET (BImode, reg,
+				  gen_rtx_NE (BImode, tmp, const0_rtx)));
+	res = get_insns ();
+	end_sequence ();
+      }
+      break;
+
+    default:
+      {
+	rtx addr = data->ptr;
+
+	if (!addr)
+	  {
+	    unsigned align = GET_MODE_ALIGNMENT (mode) / BITS_PER_UNIT;
+
+	    if (align > worker_bcast_align)
+	      worker_bcast_align = align;
+	    data->offset = (data->offset + align - 1) & ~(align - 1);
+	    addr = data->base;
+	    if (data->offset)
+	      addr = gen_rtx_PLUS (Pmode, addr, GEN_INT (data->offset));
+	  }
+	
+	addr = gen_rtx_MEM (mode, addr);
+	addr = gen_rtx_UNSPEC (mode, gen_rtvec (1, addr), UNSPEC_SHARED_DATA);
+	if (pm & PM_read)
+	  res = gen_rtx_SET (mode, addr, reg);
+	if (pm & PM_write)
+	  res = gen_rtx_SET (mode, reg, addr);
+
+	if (data->ptr)
+	  {
+	    /* We're using a ptr, increment it.  */
+	    start_sequence ();
+	    
+	    emit_insn (res);
+	    emit_insn (gen_adddi3 (data->ptr, data->ptr,
+				   GEN_INT (GET_MODE_SIZE (GET_MODE (res)))));
+	    res = get_insns ();
+	    end_sequence ();
+	  }
+	else
+	  rep = 1;
+	data->offset += rep * GET_MODE_SIZE (GET_MODE (reg));
+      }
+      break;
+    }
+  return res;
+}
+
 /* When loading an operand ORIG_OP, verify whether an address space
    conversion to generic is required, and if so, perform it.  Also
    check for SYMBOL_REFs for function decls and call
@@ -1647,23 +1867,6 @@ nvptx_print_operand_address (FILE *file,
   nvptx_print_address_operand (file, addr, VOIDmode);
 }
 
-/* Return true if the value of COND is the same across all threads in a
-   warp.  */
-
-static bool
-condition_unidirectional_p (rtx cond)
-{
-  if (CONSTANT_P (cond))
-    return true;
-  if (GET_CODE (cond) == REG)
-    return cfun->machine->warp_equal_pseudos[REGNO (cond)];
-  if (GET_RTX_CLASS (GET_CODE (cond)) == RTX_COMPARE
-      || GET_RTX_CLASS (GET_CODE (cond)) == RTX_COMM_COMPARE)
-    return (condition_unidirectional_p (XEXP (cond, 0))
-	    && condition_unidirectional_p (XEXP (cond, 1)));
-  return false;
-}
-
 /* Print an operand, X, to FILE, with an optional modifier in CODE.
 
    Meaning of CODE:
@@ -1677,8 +1880,7 @@ condition_unidirectional_p (rtx cond)
    t -- print a type opcode suffix, promoting QImode to 32 bits
    T -- print a type size in bits
    u -- print a type opcode suffix without promotions.
-   U -- print ".uni" if a condition consists only of values equal across all
-        threads in a warp.  */
+   U -- print ".uni" if the const_int operand is non-zero.  */
 
 static void
 nvptx_print_operand (FILE *file, rtx x, int code)
@@ -1740,10 +1942,10 @@ nvptx_print_operand (FILE *file, rtx x,
       goto common;
 
     case 'U':
-      if (condition_unidirectional_p (x))
+      if (INTVAL (x))
 	fprintf (file, ".uni");
       break;
-
+      
     case 'c':
       op_mode = GET_MODE (XEXP (x, 0));
       switch (x_code)
@@ -1900,7 +2102,7 @@ get_replacement (struct reg_replace *r)
    conversion copyin/copyout instructions.  */
 
 static void
-nvptx_reorg_subreg (int max_regs)
+nvptx_reorg_subreg ()
 {
   struct reg_replace qiregs, hiregs, siregs, diregs;
   rtx_insn *insn, *next;
@@ -1914,11 +2116,6 @@ nvptx_reorg_subreg (int max_regs)
   siregs.mode = SImode;
   diregs.mode = DImode;
 
-  cfun->machine->warp_equal_pseudos
-    = ggc_cleared_vec_alloc<char> (max_regs);
-
-  auto_vec<unsigned> warp_reg_worklist;
-
   for (insn = get_insns (); insn; insn = next)
     {
       next = NEXT_INSN (insn);
@@ -1934,18 +2131,6 @@ nvptx_reorg_subreg (int max_regs)
       diregs.n_in_use = 0;
       extract_insn (insn);
 
-      if (recog_memoized (insn) == CODE_FOR_oacc_thread_broadcastsi
-	  || (GET_CODE (PATTERN (insn)) == SET
-	      && CONSTANT_P (SET_SRC (PATTERN (insn)))))
-	{
-	  rtx dest = recog_data.operand[0];
-	  if (REG_P (dest) && REG_N_SETS (REGNO (dest)) == 1)
-	    {
-	      cfun->machine->warp_equal_pseudos[REGNO (dest)] = true;
-	      warp_reg_worklist.safe_push (REGNO (dest));
-	    }
-	}
-
       enum attr_subregs_ok s_ok = get_attr_subregs_ok (insn);
       for (int i = 0; i < recog_data.n_operands; i++)
 	{
@@ -1999,71 +2184,782 @@ nvptx_reorg_subreg (int max_regs)
 	  validate_change (insn, recog_data.operand_loc[i], new_reg, false);
 	}
     }
+}
+
+/* An unspec of interest and the BB in which it resides.  */
+struct reorg_unspec
+{
+  rtx_insn *insn;
+  basic_block block;
+};
 
-  while (!warp_reg_worklist.is_empty ())
+/* Loop structure of the function.The entire function is described as
+   a NULL loop.  We should be able to extend this to represent
+   superblocks.  */
+
+#define OACC_null OACC_HWM
+
+struct reorg_loop
+{
+  /* Parent loop.  */
+  reorg_loop *parent;
+  
+  /* Next sibling loop.  */
+  reorg_loop *next;
+
+  /* First child loop.  */
+  reorg_loop *inner;
+
+  /* Partitioning mode of the loop.  */
+  unsigned mode;
+
+  /* Partitioning used within inner loops. */
+  unsigned inner_mask;
+
+  /* Location of loop head and tail.  The head is the first block in
+     the partitioned loop and the tail is the first block out of the
+     partitioned loop.  */
+  basic_block head_block;
+  basic_block tail_block;
+
+  rtx_insn *head_insn;
+  rtx_insn *tail_insn;
+
+  rtx_insn *pre_head_insn;
+  rtx_insn *pre_tail_insn;
+
+  /* basic blocks in this loop, but not in child loops.  The HEAD and
+     PRETAIL blocks are in the loop.  The PREHEAD and TAIL blocks
+     are not.  */
+  auto_vec<basic_block> blocks;
+
+public:
+  reorg_loop (reorg_loop *parent, unsigned mode);
+  ~reorg_loop ();
+};
+
+typedef auto_vec<reorg_unspec> unspec_vec_t;
+
+/* Constructor links the new loop into it's parent's chain of
+   children.  */
+
+reorg_loop::reorg_loop (reorg_loop *parent_, unsigned mode_)
+  :parent (parent_), next (0), inner (0), mode (mode_), inner_mask (0)
+{
+  head_block = tail_block = 0;
+  head_insn = tail_insn = 0;
+  pre_head_insn = pre_tail_insn = 0;
+  
+  if (parent)
     {
-      int regno = warp_reg_worklist.pop ();
+      next = parent->inner;
+      parent->inner = this;
+    }
+}
+
+reorg_loop::~reorg_loop ()
+{
+  delete inner;
+  delete next;
+}
+
+/* Map of basic blocks to unspecs */
+typedef hash_map<basic_block, rtx_insn *> unspec_map_t;
+
+/* Split basic blocks such that each loop head & tail unspecs are at
+   the start of their basic blocks.  Thus afterwards each block will
+   have a single partitioning mode.  We also do the same for return
+   insns, as they are executed by every thread.  Return the partitioning
+   execution mode of the function as a whole.  Populate MAP with head
+   and tail blocks.   We also clear the BB visited flag, which is
+   used when finding loops.  */
+
+static unsigned
+nvptx_split_blocks (unspec_map_t *map)
+{
+  auto_vec<reorg_unspec> worklist;
+  basic_block block;
+  rtx_insn *insn;
+  unsigned levels = ~0U; // Assume the worst WRT required neutering
+
+  /* Locate all the reorg instructions of interest.  */
+  FOR_ALL_BB_FN (block, cfun)
+    {
+      bool seen_insn = false;
+
+      // Clear visited flag, for use by loop locator  */
+      block->flags &= ~BB_VISITED;
       
-      df_ref use = DF_REG_USE_CHAIN (regno);
-      for (; use; use = DF_REF_NEXT_REG (use))
+      FOR_BB_INSNS (block, insn)
 	{
-	  rtx_insn *insn;
-	  if (!DF_REF_INSN_INFO (use))
-	    continue;
-	  insn = DF_REF_INSN (use);
-	  if (DEBUG_INSN_P (insn))
-	    continue;
-
-	  /* The only insns we have to exclude are those which refer to
-	     memory.  */
-	  rtx pat = PATTERN (insn);
-	  if (GET_CODE (pat) == SET
-	      && (MEM_P (SET_SRC (pat)) || MEM_P (SET_DEST (pat))))
+	  if (!INSN_P (insn))
 	    continue;
+	  switch (recog_memoized (insn))
+	    {
+	    default:
+	      seen_insn = true;
+	      continue;
+	    case CODE_FOR_oacc_levels:
+	      /* We just need to detect this and note its argument.  */
+	      {
+		unsigned l = UINTVAL (XVECEXP (PATTERN (insn), 0, 0));
+		/* If we see this multiple times, this should all
+		   agree.  */
+		gcc_assert (levels == ~0U || l == levels);
+		levels = l;
+	      }
+	      continue;
+
+	    case CODE_FOR_nvptx_loop:
+	      {
+		rtx kind = XVECEXP (PATTERN (insn), 0, 0);
+		if (!LOOP_MODE_CHANGE_P (UINTVAL (kind)))
+		  {
+		    seen_insn = true;
+		    continue;
+		  }
+	      }
+	      break;
+	      
+	    case CODE_FOR_return:
+	      /* We also need to split just before return insns, as
+		 that insn needs executing by all threads, but the
+		 block it is in probably does not.  */
+	      break;
+	    }
 
-	  df_ref insn_use;
-	  bool all_equal = true;
-	  FOR_EACH_INSN_USE (insn_use, insn)
+	  if (seen_insn)
 	    {
-	      unsigned insn_regno = DF_REF_REGNO (insn_use);
-	      if (!cfun->machine->warp_equal_pseudos[insn_regno])
-		{
-		  all_equal = false;
-		  break;
-		}
+	      /* We've found an instruction that  must be at the start of
+		 a block, but isn't.  Add it to the worklist.  */
+	      reorg_unspec uns;
+	      uns.insn = insn;
+	      uns.block = block;
+	      worklist.safe_push (uns);
 	    }
-	  if (!all_equal)
-	    continue;
-	  df_ref insn_def;
-	  FOR_EACH_INSN_DEF (insn_def, insn)
+	  else
+	    /* It was already the first instruction.  Just add it to
+	       the map.  */
+	    map->get_or_insert (block) = insn;
+	  seen_insn = true;
+	}
+    }
+
+  /* Split blocks on the worklist.  */
+  unsigned ix;
+  reorg_unspec *elt;
+  basic_block remap = 0;
+  for (ix = 0; worklist.iterate (ix, &elt); ix++)
+    {
+      if (remap != elt->block)
+	{
+	  block = elt->block;
+	  remap = block;
+	}
+      
+      /* Split block before insn. The insn is in the new block  */
+      edge e = split_block (block, PREV_INSN (elt->insn));
+
+      block = e->dest;
+      map->get_or_insert (block) = elt->insn;
+    }
+
+  return levels;
+}
+
+/* BLOCK is a basic block containing a head or tail instruction.
+   Locate the associated prehead or pretail instruction, which must be
+   in the single predecessor block.  */
+
+static rtx_insn *
+nvptx_discover_pre (basic_block block, unsigned expected)
+{
+  gcc_assert (block->preds->length () == 1);
+  basic_block pre_block = (*block->preds)[0]->src;
+  rtx_insn *pre_insn;
+
+  for (pre_insn = BB_END (pre_block); !INSN_P (pre_insn);
+       pre_insn = PREV_INSN (pre_insn))
+    gcc_assert (pre_insn != BB_HEAD (pre_block));
+
+  gcc_assert (recog_memoized (pre_insn) == CODE_FOR_nvptx_loop
+	      && (UINTVAL (XVECEXP (PATTERN (pre_insn), 0, 0))
+		  == expected));
+  return pre_insn;
+}
+
+typedef std::pair<basic_block, reorg_loop *> loop_t;
+typedef auto_vec<loop_t> work_loop_t;
+
+/*  Dump this loop and all its inner loops.  */
+
+static void
+nvptx_dump_loops (reorg_loop *loop, unsigned depth)
+{
+  fprintf (dump_file, "%u: mode %d head=%d, tail=%d\n",
+	   depth, loop->mode,
+	   loop->head_block ? loop->head_block->index : -1,
+	   loop->tail_block ? loop->tail_block->index : -1);
+
+  fprintf (dump_file, "    blocks:");
+
+  basic_block block;
+  for (unsigned ix = 0; loop->blocks.iterate (ix, &block); ix++)
+    fprintf (dump_file, " %d", block->index);
+  fprintf (dump_file, "\n");
+  if (loop->inner)
+    nvptx_dump_loops (loop->inner, depth + 1);
+
+  if (loop->next)
+    nvptx_dump_loops (loop->next, depth);
+}
+
+/* Walk the BBG looking for loop head & tail markers.  Construct a
+   loop structure for the function.  MAP is a mapping of basic blocks
+   to head & taiol markers, discoveded when splitting blocks.  This
+   speeds up the discovery.  We rely on the BB visited flag having
+   been cleared when splitting blocks.  */
+
+static reorg_loop *
+nvptx_discover_loops (unspec_map_t *map)
+{
+  reorg_loop *outer_loop = new reorg_loop (0, OACC_null);
+  work_loop_t worklist;
+  basic_block block;
+
+  // Mark entry and exit blocks as visited.
+  block = EXIT_BLOCK_PTR_FOR_FN (cfun);
+  block->flags |= BB_VISITED;
+  block = ENTRY_BLOCK_PTR_FOR_FN (cfun);
+  worklist.safe_push (loop_t (block, outer_loop));
+
+  while (worklist.length ())
+    {
+      loop_t loop = worklist.pop ();
+      reorg_loop *l = loop.second;
+
+      block = loop.first;
+
+      // Have we met this block?
+      if (block->flags & BB_VISITED)
+	continue;
+      block->flags |= BB_VISITED;
+      
+      rtx_insn **endp = map->get (block);
+      if (endp)
+	{
+	  rtx_insn *end = *endp;
+	  
+	  /* This is a block head or tail, or return instruction.  */
+	  switch (recog_memoized (end))
 	    {
-	      unsigned dregno = DF_REF_REGNO (insn_def);
-	      if (cfun->machine->warp_equal_pseudos[dregno])
-		continue;
-	      cfun->machine->warp_equal_pseudos[dregno] = true;
-	      warp_reg_worklist.safe_push (dregno);
+	    case CODE_FOR_return:
+	      /* Return instructions are in their own block, and we
+		 don't need to do anything more.  */
+	      continue;
+
+	    case CODE_FOR_nvptx_loop:
+	      {
+		unsigned kind = UINTVAL (XVECEXP (PATTERN (end), 0, 0));
+		unsigned mode = UINTVAL (XVECEXP (PATTERN (end), 0, 1));
+		
+		switch (kind)
+		  {
+		  case nvptx_loop_head:
+		    /* Loop head, create a new inner loop and add it
+		       into our parent's child list.  */
+		    l = new reorg_loop (l, mode);
+		    l->head_block = block;
+		    l->head_insn = end;
+		    if (mode == OACC_worker)
+		      l->pre_head_insn
+			= nvptx_discover_pre (block, nvptx_loop_prehead);
+		    break;
+
+		  case nvptx_loop_tail:
+		    /* A loop tail.  Finish the current loop and
+		       return to parent.  */
+		    gcc_assert (l->mode == mode);
+		    l->tail_block = block;
+		    l->tail_insn = end;
+		    if (mode == OACC_worker)
+		      l->pre_tail_insn
+			= nvptx_discover_pre (block, nvptx_loop_pretail);
+		    l = l->parent;
+		    break;
+		    
+		  default:
+		    gcc_unreachable ();
+		  }
+	      }
+	      break;
+
+	    default:gcc_unreachable ();
 	    }
 	}
+      /* Add this block onto the current loop's list of blocks.  */
+      l->blocks.safe_push (block);
+
+      /* Push each destination block onto the work list.  */
+      edge e;
+      edge_iterator ei;
+
+      loop.second = l;
+      FOR_EACH_EDGE (e, ei, block->succs)
+	{
+	  loop.first = e->dest;
+	  
+	  worklist.safe_push (loop);
+	}
     }
 
   if (dump_file)
-    for (int i = 0; i < max_regs; i++)
-      if (cfun->machine->warp_equal_pseudos[i])
-	fprintf (dump_file, "Found warp invariant pseudo %d\n", i);
+    {
+      fprintf (dump_file, "\nLoops\n");
+      nvptx_dump_loops (outer_loop, 0);
+      fprintf (dump_file, "\n");
+    }
+  
+  return outer_loop;
+}
+
+/* Propagate live state at the start of a partitioned region.  BLOCK
+   provides the live register information, and might not contain
+   INSN. Propagation is inserted just after INSN. RW indicates whether
+   we are reading and/or writing state.  This
+   separation is needed for worker-level proppagation where we
+   essentially do a spill & fill.  FN is the underlying worker
+   function to generate the propagation instructions for single
+   register.  DATA is user data.
+
+   We propagate the live register set and the entire frame.  We could
+   do better by (a) propagating just the live set that is used within
+   the partitioned regions and (b) only propagating stack entries that
+   are used.  The latter might be quite hard to determine.  */
+
+static void
+nvptx_propagate (basic_block block, rtx_insn *insn, propagate_mask rw,
+		 rtx (*fn) (rtx, propagate_mask,
+			    unsigned, void *), void *data)
+{
+  bitmap live = DF_LIVE_IN (block);
+  bitmap_iterator iterator;
+  unsigned ix;
+
+  /* Copy the frame array.  */
+  HOST_WIDE_INT fs = get_frame_size ();
+  if (fs)
+    {
+      rtx tmp = gen_reg_rtx (DImode);
+      rtx idx = NULL_RTX;
+      rtx ptr = gen_reg_rtx (Pmode);
+      rtx pred = NULL_RTX;
+      rtx_code_label *label = NULL;
+
+      gcc_assert (!(fs & (GET_MODE_SIZE (DImode) - 1)));
+      fs /= GET_MODE_SIZE (DImode);
+      /* Detect single iteration loop. */
+      if (fs == 1)
+	fs = 0;
+
+      start_sequence ();
+      emit_insn (gen_rtx_SET (Pmode, ptr, frame_pointer_rtx));
+      if (fs)
+	{
+	  idx = gen_reg_rtx (SImode);
+	  pred = gen_reg_rtx (BImode);
+	  label = gen_label_rtx ();
+	  
+	  emit_insn (gen_rtx_SET (SImode, idx, GEN_INT (fs)));
+	  /* Allow worker function to initialize anything needed */
+	  rtx init = fn (tmp, PM_loop_begin, fs, data);
+	  if (init)
+	    emit_insn (init);
+	  emit_label (label);
+	  LABEL_NUSES (label)++;
+	  emit_insn (gen_addsi3 (idx, idx, GEN_INT (-1)));
+	}
+      if (rw & PM_read)
+	emit_insn (gen_rtx_SET (DImode, tmp, gen_rtx_MEM (DImode, ptr)));
+      emit_insn (fn (tmp, rw, fs, data));
+      if (rw & PM_write)
+	emit_insn (gen_rtx_SET (DImode, gen_rtx_MEM (DImode, ptr), tmp));
+      if (fs)
+	{
+	  emit_insn (gen_rtx_SET (SImode, pred,
+				  gen_rtx_NE (BImode, idx, const0_rtx)));
+	  emit_insn (gen_adddi3 (ptr, ptr, GEN_INT (GET_MODE_SIZE (DImode))));
+	  emit_insn (gen_br_true_hidden (pred, label, GEN_INT (1)));
+	  rtx fini = fn (tmp, PM_loop_end, fs, data);
+	  if (fini)
+	    emit_insn (fini);
+	  emit_insn (gen_rtx_CLOBBER (GET_MODE (idx), idx));
+	}
+      emit_insn (gen_rtx_CLOBBER (GET_MODE (tmp), tmp));
+      emit_insn (gen_rtx_CLOBBER (GET_MODE (ptr), ptr));
+      rtx cpy = get_insns ();
+      end_sequence ();
+      insn = emit_insn_after (cpy, insn);
+    }
+
+  /* Copy live registers.  */
+  EXECUTE_IF_SET_IN_BITMAP (live, 0, ix, iterator)
+    {
+      rtx reg = regno_reg_rtx[ix];
+
+      if (REGNO (reg) >= FIRST_PSEUDO_REGISTER)
+	{
+	  rtx bcast = fn (reg, rw, 0, data);
+
+	  insn = emit_insn_after (bcast, insn);
+	}
+    }
+}
+
+/* Worker for nvptx_vpropagate.  */
+
+static rtx
+vprop_gen (rtx reg, propagate_mask pm,
+	   unsigned ARG_UNUSED (count), void *ARG_UNUSED (data))
+{
+  if (!(pm & PM_read_write))
+    return 0;
+  
+  return nvptx_gen_vcast (reg);
 }
 
-/* PTX-specific reorganization
-   1) mark now-unused registers, so function begin doesn't declare
-   unused registers.
-   2) replace subregs with suitable sequences.
-*/
+/* Propagate state that is live at start of BLOCK across the vectors
+   of a single warp.  Propagation is inserted just after INSN.   */
 
 static void
-nvptx_reorg (void)
+nvptx_vpropagate (basic_block block, rtx_insn *insn)
 {
-  struct reg_replace qiregs, hiregs, siregs, diregs;
-  rtx_insn *insn, *next;
+  nvptx_propagate (block, insn, PM_read_write, vprop_gen, 0);
+}
+
+/* Worker for nvptx_wpropagate.  */
+
+static rtx
+wprop_gen (rtx reg, propagate_mask pm, unsigned rep, void *data_)
+{
+  wcast_data_t *data = (wcast_data_t *)data_;
+
+  if (pm & PM_loop_begin)
+    {
+      /* Starting a loop, initialize pointer.    */
+      unsigned align = GET_MODE_ALIGNMENT (GET_MODE (reg)) / BITS_PER_UNIT;
+
+      if (align > worker_bcast_align)
+	worker_bcast_align = align;
+      data->offset = (data->offset + align - 1) & ~(align - 1);
+
+      data->ptr = gen_reg_rtx (Pmode);
+
+      return gen_adddi3 (data->ptr, data->base, GEN_INT (data->offset));
+    }
+  else if (pm & PM_loop_end)
+    {
+      rtx clobber = gen_rtx_CLOBBER (GET_MODE (data->ptr), data->ptr);
+      data->ptr = NULL_RTX;
+      return clobber;
+    }
+  else
+    return nvptx_gen_wcast (reg, pm, rep, data);
+}
+
+/* Spill or fill live state that is live at start of BLOCK.  PRE_P
+   indicates if this is just before partitioned mode (do spill), or
+   just after it starts (do fill). Sequence is inserted just after
+   INSN.  */
+
+static void
+nvptx_wpropagate (bool pre_p, basic_block block, rtx_insn *insn)
+{
+  wcast_data_t data;
+
+  data.base = gen_reg_rtx (Pmode);
+  data.offset = 0;
+  data.ptr = NULL_RTX;
+
+  nvptx_propagate (block, insn, pre_p ? PM_read : PM_write, wprop_gen, &data);
+  if (data.offset)
+    {
+      /* Stuff was emitted, initialize the base pointer now.  */
+      rtx init = gen_rtx_SET (Pmode, data.base, worker_bcast_sym);
+      emit_insn_after (init, insn);
+      
+      if (worker_bcast_hwm < data.offset)
+	worker_bcast_hwm = data.offset;
+    }
+}
+
+/* Emit a worker-level synchronization barrier.  */
+
+static void
+nvptx_wsync (bool tail_p, rtx_insn *insn)
+{
+  emit_insn_after (gen_nvptx_barsync (GEN_INT (tail_p)), insn);
+}
+
+/* Single neutering according to MASK.  FROM is the incoming block and
+   TO is the outgoing block.  These may be the same block. Insert at
+   start of FROM:
+   
+     if (tid.<axis>) hidden_goto end.
+
+   and insert before ending branch of TO (if there is such an insn):
+
+     end:
+     <possibly-broadcast-cond>
+     <branch>
+
+   We currently only use differnt FROM and TO when skipping an entire
+   loop.  We could do more if we detected superblocks.  */
+
+static void
+nvptx_single (unsigned mask, basic_block from, basic_block to)
+{
+  rtx_insn *head = BB_HEAD (from);
+  rtx_insn *tail = BB_END (to);
+  unsigned skip_mask = mask;
+
+  /* Find first insn of from block */
+  while (head != BB_END (from) && !INSN_P (head))
+    head = NEXT_INSN (head);
+
+  /* Find last insn of to block */
+  rtx_insn *limit = from == to ? head : BB_HEAD (to);
+  while (tail != limit && !INSN_P (tail) && !LABEL_P (tail))
+    tail = PREV_INSN (tail);
+
+  /* Detect if tail is a branch.  */
+  rtx tail_branch = NULL_RTX;
+  rtx cond_branch = NULL_RTX;
+  if (tail && INSN_P (tail))
+    {
+      tail_branch = PATTERN (tail);
+      if (GET_CODE (tail_branch) != SET || SET_DEST (tail_branch) != pc_rtx)
+	tail_branch = NULL_RTX;
+      else
+	{
+	  cond_branch = SET_SRC (tail_branch);
+	  if (GET_CODE (cond_branch) != IF_THEN_ELSE)
+	    cond_branch = NULL_RTX;
+	}
+    }
+
+  if (tail == head)
+    {
+      /* If this is empty, do nothing.  */
+      if (!head || !INSN_P (head))
+	return;
+
+      /* If this is a dummy insn, do nothing.  */
+      switch (recog_memoized (head))
+	{
+	default:break;
+	case CODE_FOR_nvptx_loop:
+	case CODE_FOR_oacc_levels:
+	  return;
+	}
 
+      if (cond_branch)
+	{
+	  /* If we're only doing vector single, there's no need to
+	     emit skip code because we'll not insert anything.  */
+	  if (!(mask & OACC_LOOP_MASK (OACC_vector)))
+	    skip_mask = 0;
+	}
+      else if (tail_branch)
+	/* Block with only unconditional branch.  Nothing to do.  */
+	return;
+    }
+
+  /* Insert the vector test inside the worker test.  */
+  unsigned mode;
+  rtx_insn *before = tail;
+  for (mode = OACC_worker; mode <= OACC_vector; mode++)
+    if (OACC_LOOP_MASK (mode) & skip_mask)
+      {
+	rtx id = gen_reg_rtx (SImode);
+	rtx pred = gen_reg_rtx (BImode);
+	rtx_code_label *label = gen_label_rtx ();
+
+	emit_insn_before (gen_oacc_id (id, GEN_INT (mode)), head);
+	rtx cond = gen_rtx_SET (BImode, pred,
+				gen_rtx_NE (BImode, id, const0_rtx));
+	emit_insn_before (cond, head);
+	emit_insn_before (gen_br_true_hidden (pred, label,
+					      GEN_INT (mode != OACC_vector)),
+			  head);
+
+	LABEL_NUSES (label)++;
+	if (tail_branch)
+	  before = emit_label_before (label, before);
+	else
+	  emit_label_after (label, tail);
+      }
+
+  /* Now deal with propagating the branch condition.  */
+  if (cond_branch)
+    {
+      rtx pvar = XEXP (XEXP (cond_branch, 0), 0);
+
+      if (OACC_LOOP_MASK (OACC_vector) == mask)
+	{
+	  /* Vector mode only, do a shuffle.  */
+	  emit_insn_before (nvptx_gen_vcast (pvar), tail);
+	}
+      else
+	{
+	  /* Includes worker mode, do spill & fill.  by construction
+	     we should never have worker mode only. */
+	  wcast_data_t data;
+
+	  data.base = worker_bcast_sym;
+	  data.ptr = 0;
+
+	  if (worker_bcast_hwm < GET_MODE_SIZE (SImode))
+	    worker_bcast_hwm = GET_MODE_SIZE (SImode);
+
+	  data.offset = 0;
+	  emit_insn_before (nvptx_gen_wcast (pvar, PM_read, 0, &data),
+			    before);
+	  emit_insn_before (gen_nvptx_barsync (GEN_INT (2)), tail);
+	  data.offset = 0;
+	  emit_insn_before (nvptx_gen_wcast (pvar, PM_write, 0, &data),
+			    tail);
+	}
+
+      extract_insn (tail);
+      rtx unsp = gen_rtx_UNSPEC (BImode, gen_rtvec (1, pvar),
+				 UNSPEC_BR_UNIFIED);
+      validate_change (tail, recog_data.operand_loc[0], unsp, false);
+    }
+}
+
+/* LOOP is a loop that is being skipped in its entirety according to
+   MASK.  Treat this as skipping a superblock starting at loop head
+   and ending at loop pre-tail.  */
+
+static void
+nvptx_skip_loop (unsigned mask, reorg_loop *loop)
+{
+  basic_block tail = loop->tail_block;
+  gcc_assert (tail->preds->length () == 1);
+
+  basic_block pre_tail = (*tail->preds)[0]->src;
+  gcc_assert (pre_tail->succs->length () == 1);
+
+  nvptx_single (mask, loop->head_block, pre_tail);
+}
+
+/* Process the loop LOOP and all its contained loops.  We do
+   everything but the neutering.  Return mask of partition modes used
+   within this loop.  */
+
+static unsigned
+nvptx_process_loops (reorg_loop *loop)
+{
+  unsigned inner_mask = OACC_LOOP_MASK (loop->mode);
+  
+  /* Do the inner loops first.  */
+  if (loop->inner)
+    {
+      loop->inner_mask = nvptx_process_loops (loop->inner);
+      inner_mask |= loop->inner_mask;
+    }
+  
+  switch (loop->mode)
+    {
+    case OACC_null:
+      /* Dummy loop.  */
+      break;
+
+    case OACC_vector:
+      nvptx_vpropagate (loop->head_block, loop->head_insn);
+      break;
+      
+    case OACC_worker:
+      {
+	nvptx_wpropagate (false, loop->head_block, loop->head_insn);
+	nvptx_wpropagate (true, loop->head_block, loop->pre_head_insn);
+	/* Insert begin and end synchronizations.  */
+	nvptx_wsync (false, loop->head_insn);
+	nvptx_wsync (true, loop->pre_tail_insn);
+      }
+      break;
+
+    case OACC_gang:
+      break;
+
+    default:gcc_unreachable ();
+    }
+
+  /* Now do siblings.  */
+  if (loop->next)
+    inner_mask |= nvptx_process_loops (loop->next);
+  return inner_mask;
+}
+
+/* Neuter the loop described by LOOP.  We recurse in depth-first
+   order.  LEVELS are the partitioning of the execution and OUTER is
+   the partitioning of the loops we are contained in.  Return the
+   partitioning level within this loop.  */
+
+static void
+nvptx_neuter_loops (reorg_loop *loop, unsigned levels, unsigned outer)
+{
+  unsigned me = (OACC_LOOP_MASK (loop->mode)
+		 & (OACC_LOOP_MASK (OACC_worker)
+		    | OACC_LOOP_MASK (OACC_vector)));
+  unsigned  skip_mask = 0, neuter_mask = 0;
+  
+  if (loop->inner)
+    nvptx_neuter_loops (loop->inner, levels, outer | me);
+
+  for (unsigned mode = OACC_worker; mode <= OACC_vector; mode++)
+    {
+      if ((outer | me) & OACC_LOOP_MASK (mode))
+	{ /* Mode is partitioned: no neutering.  */ }
+      else if (!(levels & OACC_LOOP_MASK (mode)))
+	{ /* Mode  is not used: nothing to do.  */ }
+      else if (loop->inner_mask & OACC_LOOP_MASK (mode)
+	       || !loop->head_insn)
+	/* Partitioning inside this loop, or we're not a loop: neuter
+	   individual blocks.  */
+	neuter_mask |= OACC_LOOP_MASK (mode);
+      else if (!loop->parent || !loop->parent->head_insn
+	       || loop->parent->inner_mask & OACC_LOOP_MASK (mode))
+	/* Parent isn't a loop or contains this partitioning: skip
+	   loop at this level.  */
+	skip_mask |= OACC_LOOP_MASK (mode);
+      else
+	{ /* Parent will skip this loop itself.  */ }
+    }
+
+  if (neuter_mask)
+    {
+      basic_block block;
+
+      for (unsigned ix = 0; loop->blocks.iterate (ix, &block); ix++)
+	nvptx_single (neuter_mask, block, block);
+    }
+
+  if (skip_mask)
+      nvptx_skip_loop (skip_mask, loop);
+  
+  if (loop->next)
+    nvptx_neuter_loops (loop->next, levels, outer);
+}
+
+/* NVPTX machine dependent reorg.
+   Insert vector and worker single neutering code and state
+   propagation when entering partioned mode.  Fixup subregs.  */
+
+static void
+nvptx_reorg (void)
+{
   /* We are freeing block_for_insn in the toplev to keep compatibility
      with old MDEP_REORGS that are not CFG based.  Recompute it now.  */
   compute_bb_for_insn ();
@@ -2072,19 +2968,34 @@ nvptx_reorg (void)
 
   df_clear_flags (DF_LR_RUN_DCE);
   df_set_flags (DF_NO_INSN_RESCAN | DF_NO_HARD_REGS);
+  df_live_add_problem ();
+  
+  /* Split blocks and record interesting unspecs.  */
+  unspec_map_t unspec_map;
+  unsigned levels = nvptx_split_blocks (&unspec_map);
+
+  /* Compute live regs */
   df_analyze ();
   regstat_init_n_sets_and_refs ();
 
-  int max_regs = max_reg_num ();
-
+  if (dump_file)
+    df_dump (dump_file);
+  
   /* Mark unused regs as unused.  */
+  int max_regs = max_reg_num ();
   for (int i = LAST_VIRTUAL_REGISTER + 1; i < max_regs; i++)
     if (REG_N_SETS (i) == 0 && REG_N_REFS (i) == 0)
       regno_reg_rtx[i] = const0_rtx;
 
-  /* Replace subregs.  */
-  nvptx_reorg_subreg (max_regs);
+  reorg_loop *loops = nvptx_discover_loops (&unspec_map);
+
+  nvptx_process_loops (loops);
+  nvptx_neuter_loops (loops, levels, 0);
 
+  delete loops;
+
+  nvptx_reorg_subreg ();
+  
   regstat_free_n_sets_and_refs ();
 
   df_finish_pass (true);
@@ -2133,19 +3044,21 @@ nvptx_vector_alignment (const_tree type)
   return MIN (align, BIGGEST_ALIGNMENT);
 }
 
-/* Indicate that INSN cannot be duplicated.  This is true for insns
-   that generate a unique id.  To be on the safe side, we also
-   exclude instructions that have to be executed simultaneously by
-   all threads in a warp.  */
+/* Indicate that INSN cannot be duplicated.   */
 
 static bool
 nvptx_cannot_copy_insn_p (rtx_insn *insn)
 {
-  if (recog_memoized (insn) == CODE_FOR_oacc_thread_broadcastsi)
-    return true;
-  if (recog_memoized (insn) == CODE_FOR_threadbarrier_insn)
-    return true;
-  return false;
+  switch (recog_memoized (insn))
+    {
+    case CODE_FOR_nvptx_broadcastsi:
+    case CODE_FOR_nvptx_broadcastsf:
+    case CODE_FOR_nvptx_barsync:
+    case CODE_FOR_nvptx_loop:
+      return true;
+    default:
+      return false;
+    }
 }
 \f
 /* Record a symbol for mkoffload to enter into the mapping table.  */
@@ -2185,6 +3098,21 @@ nvptx_file_end (void)
   FOR_EACH_HASH_TABLE_ELEMENT (*needed_fndecls_htab, decl, tree, iter)
     nvptx_record_fndecl (decl, true);
   fputs (func_decls.str().c_str(), asm_out_file);
+
+  if (worker_bcast_hwm)
+    {
+      /* Define the broadcast buffer.  */
+
+      if (worker_bcast_align < GET_MODE_SIZE (SImode))
+	worker_bcast_align = GET_MODE_SIZE (SImode);
+      worker_bcast_hwm = (worker_bcast_hwm + worker_bcast_align - 1)
+	& ~(worker_bcast_align - 1);
+      
+      fprintf (asm_out_file, "// BEGIN VAR DEF: %s\n", worker_bcast_name);
+      fprintf (asm_out_file, ".shared.align %d .u8 %s[%d];\n",
+	       worker_bcast_align,
+	       worker_bcast_name, worker_bcast_hwm);
+    }
 }
 \f
 #undef TARGET_OPTION_OVERRIDE
Index: config/nvptx/nvptx.h
===================================================================
--- config/nvptx/nvptx.h	(revision 225323)
+++ config/nvptx/nvptx.h	(working copy)
@@ -235,7 +235,6 @@ struct nvptx_pseudo_info
 struct GTY(()) machine_function
 {
   rtx_expr_list *call_args;
-  char *warp_equal_pseudos;
   rtx start_call;
   tree funtype;
   bool has_call_with_varargs;
Index: internal-fn.c
===================================================================
--- internal-fn.c	(revision 225323)
+++ internal-fn.c	(working copy)
@@ -98,6 +98,19 @@ init_internal_fns ()
   internal_fn_fnspec_array[IFN_LAST] = 0;
 }
 
+/* Return true if this internal fn call is a unique marker -- it
+   should not be duplicated or merged.  */
+
+bool
+gimple_call_internal_unique_p (const_gimple gs)
+{
+  switch (gimple_call_internal_fn (gs))
+    {
+    default: return false;
+    case IFN_GOACC_LOOP: return true;
+    }
+}
+
 /* ARRAY_TYPE is an array of vector modes.  Return the associated insn
    for load-lanes-style optab OPTAB.  The insn must exist.  */
 
@@ -1990,6 +2003,28 @@ expand_GOACC_DATA_END_WITH_ARG (gcall *s
   gcc_unreachable ();
 }
 
+
+static void
+expand_GOACC_LEVELS (gcall *stmt)
+{
+  rtx mask = expand_normal (gimple_call_arg (stmt, 0));
+  
+#ifdef HAVE_oacc_levels
+  emit_insn (gen_oacc_levels (mask));
+#endif
+}
+
+static void
+expand_GOACC_LOOP (gcall *stmt)
+{
+  rtx kind = expand_normal (gimple_call_arg (stmt, 0));
+  rtx level = expand_normal (gimple_call_arg (stmt, 1));
+  
+#ifdef HAVE_oacc_loop
+  emit_insn (gen_oacc_loop (kind, level));
+#endif
+}
+
 /* Routines to expand each internal function, indexed by function number.
    Each routine has the prototype:
 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [gomp] Move openacc vector& worker single handling to RTL
  2015-07-06 19:35     ` Nathan Sidwell
@ 2015-07-07  9:54       ` Jakub Jelinek
  2015-07-07 14:13         ` Nathan Sidwell
  0 siblings, 1 reply; 31+ messages in thread
From: Jakub Jelinek @ 2015-07-07  9:54 UTC (permalink / raw)
  To: Nathan Sidwell; +Cc: GCC Patches

On Mon, Jul 06, 2015 at 03:34:51PM -0400, Nathan Sidwell wrote:
> On 07/04/15 16:41, Nathan Sidwell wrote:
> >On 07/03/15 19:11, Jakub Jelinek wrote:
> 
> >>If the builtins are not meant to be used by users directly (I assume they
> >>aren't) nor have a 1-1 correspondence to a library routine, it is much
> >>better to emit them as internal calls (see internal-fn.{c,def}) instead of
> >>BUILT_IN_NORMAL functions.
> >
> 
> This patch uses internal builtins, I had to make one additional change to
> tree-ssa-tail-merge.c's same_succ_def::equal hash compare function.  The new
> internal fn I introduced should compare EQ but not otherwise compare EQUAL,
> and that was blowing up the has function, which relied on EQUAL only.  I
> don't know why I didn't hit this problem in the previous patch with the
> regular builtin.

How does this interact with
#pragma acc routine {gang,worker,vector,seq} ?
Or is that something to be added later on?

	Jakub

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [gomp] Move openacc vector& worker single handling to RTL
  2015-07-07  9:54       ` Jakub Jelinek
@ 2015-07-07 14:13         ` Nathan Sidwell
  2015-07-07 14:22           ` Jakub Jelinek
  0 siblings, 1 reply; 31+ messages in thread
From: Nathan Sidwell @ 2015-07-07 14:13 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: GCC Patches

On 07/07/15 05:54, Jakub Jelinek wrote:
> On Mon, Jul 06, 2015 at 03:34:51PM -0400, Nathan Sidwell wrote:

> How does this interact with
> #pragma acc routine {gang,worker,vector,seq} ?
> Or is that something to be added later on?

That is to be added later on.  I suspect such routines will trivially work, as 
they'll be marked up with the loop head/tail functions and levels builtin (the 
latter might need a bit of reworking).  What will need additional work at that 
point is the callers of routines -- they're typically called from a foo-single 
mode, but need to get all threads into the called function.  I'm thinking each 
call site will look like a mini-loop[*] surrounded by a hesd/tail marker.  (all 
that can be done in the device-side compiler once real call sites are known.)

nathan

[*] of course it won't be a loop.  Perhaps fork/join are less confusing names 
after all.  WDYT?

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [gomp] Move openacc vector& worker single handling to RTL
  2015-07-07 14:13         ` Nathan Sidwell
@ 2015-07-07 14:22           ` Jakub Jelinek
  2015-07-07 14:43             ` Nathan Sidwell
  2015-07-08 14:48             ` Nathan Sidwell
  0 siblings, 2 replies; 31+ messages in thread
From: Jakub Jelinek @ 2015-07-07 14:22 UTC (permalink / raw)
  To: Nathan Sidwell; +Cc: GCC Patches

On Tue, Jul 07, 2015 at 10:12:56AM -0400, Nathan Sidwell wrote:
> On 07/07/15 05:54, Jakub Jelinek wrote:
> >On Mon, Jul 06, 2015 at 03:34:51PM -0400, Nathan Sidwell wrote:
> 
> >How does this interact with
> >#pragma acc routine {gang,worker,vector,seq} ?
> >Or is that something to be added later on?
> 
> That is to be added later on.  I suspect such routines will trivially work,
> as they'll be marked up with the loop head/tail functions and levels builtin
> (the latter might need a bit of reworking).  What will need additional work
> at that point is the callers of routines -- they're typically called from a
> foo-single mode, but need to get all threads into the called function.  I'm
> thinking each call site will look like a mini-loop[*] surrounded by a
> hesd/tail marker.  (all that can be done in the device-side compiler once
> real call sites are known.)

Wouldn't function attributes be better for that case, and just use the internal
functions for the case when the mode is being changed in the middle of
function?

I agree that fork/join might be less confusing.

BTW, where do you plan to lower the internal functions for non-PTX?
Doing it in RTL mach reorg is too late for those, we shouldn't be writing it
for each single target, as for non-PTX (perhaps non-HSA) I bet the behavior
is the same.

	Jakub

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [gomp] Move openacc vector& worker single handling to RTL
  2015-07-07 14:22           ` Jakub Jelinek
@ 2015-07-07 14:43             ` Nathan Sidwell
  2015-07-08 14:48             ` Nathan Sidwell
  1 sibling, 0 replies; 31+ messages in thread
From: Nathan Sidwell @ 2015-07-07 14:43 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: GCC Patches

On 07/07/15 10:22, Jakub Jelinek wrote:
> On Tue, Jul 07, 2015 at 10:12:56AM -0400, Nathan Sidwell wrote:

> Wouldn't function attributes be better for that case, and just use the internal
> functions for the case when the mode is being changed in the middle of
> function?

It may be.  I've been thinking how the top-level offloaded function (kernel), 
should be marked to specify gangs/worker/vector dimensions to allow a less 
device-specific launch mechanism.  I suspect that and routines will have similar 
solutions.

> I agree that fork/join might be less confusing.
>
> BTW, where do you plan to lower the internal functions for non-PTX?
> Doing it in RTL mach reorg is too late for those, we shouldn't be writing it
> for each single target, as for non-PTX (perhaps non-HSA) I bet the behavior
> is the same.

I suspect other devices can add a new device-specific lowering pass somewhere 
soon after the LTO readback.   I think we're going to need that pass for some 
other pieces of PTX.

FWIW on a device that has a PTX-like architecture, I think this specific piece 
should be done as late as possible.  Perhaps pieces of the PTX mach-dep-reorg 
can be abstracted for general use?

nathan

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [gomp] Move openacc vector& worker single handling to RTL
  2015-07-07 14:22           ` Jakub Jelinek
  2015-07-07 14:43             ` Nathan Sidwell
@ 2015-07-08 14:48             ` Nathan Sidwell
  2015-07-08 14:58               ` Jakub Jelinek
  1 sibling, 1 reply; 31+ messages in thread
From: Nathan Sidwell @ 2015-07-08 14:48 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: GCC Patches

[-- Attachment #1: Type: text/plain, Size: 639 bytes --]

On 07/07/15 10:22, Jakub Jelinek wrote:

> I agree that fork/join might be less confusing.

this version is the great renaming.  I've added fork & join internal fns.  In 
the PTX backend I've added 4 new unspecs:

fork -- the final single mode insn
forked -- the first partitioned mode insn
joining -- the last partitioned mode insn
join -- the first single mode insn

Not all partitionings need all four markers.  I've renamed the loop data 
structures to 'parallel' and similar, because that's actually what they are 
representing -- parallel regions.  The fact those regions contain loops is 
irrelevant to the task at hand.



nathan


[-- Attachment #2: rtl-08072015-1.diff --]
[-- Type: text/plain, Size: 91901 bytes --]

2015-07-08  Nathan Sidwell  <nathan@codesourcery.com>

	Infrastructure:
	* gimple.h (gimple_call_internal_unique_p): Declare.
	* gimple.c (gimple_call_same_target_p): Add check for
	gimple_call_internal_unique_p.
	* internal-fn.c (gimple_call_internal_unique_p): New.
	* omp-low.h (OACC_LOOP_MASK): Define here...
	* omp-low.c (OACC_LOOP_MASK): ... not here.
	* tree-ssa-threadedge.c	(record_temporary_equivalences_from_stmts):
	Add check for gimple_call_internal_unique_p.
	* tree-ssa-tail-merge.c (same_succ_def::equal): Add EQ check for
	the gimple statements.

	Additions:
	* internal-fn.def (GOACC_MODES, GOACC_FORK, GOACC_JOIN): New.
	* internal-fn.c (gimple_call_internal_unique_p): Add check for
	IFN_GOACC_FORK, IFN_GOACC_JOIN.
	(expand_GOACC_MODES, expand_GOACC_FORK, expand_GOACC_JOIN): New.
	* omp-low.c (gen_oacc_fork, gen_oacc_join): New.
	(expand_omp_for_static_nochunk): Add oacc loop fork & join calls.
	(expand_omp_for_static_chunk): Likewise.
	* config/nvptx/nvptx-protos.h (nvptx_expand_oacc_fork,
	nvptx_expand_oacc_join): Declare.
	* config/nvptx/nvptx.md (UNSPEC_BIT_CONV, UNSPEC_BROADCAST,
	UNSPEC_BR_UNIFIED): New unspecs.
	(UNSPECV_MODES, UNSPECV_FORK, UNSPECV_FORKED, UNSPECV_JOINING,
	UNSPECV_JOIN, UNSPECV_BR_HIDDEN): New.
	(BITS, BITD): New mode iterators.
	(br_true_hidden, br_false_hidden, br_uni_true, br_uni_false): New
	branches.
	(oacc_modes, nvptx_fork, nvptx_forked, nvptx_joining, nvptx_join):
	New insns.
	(oacc_fork, oacc_join): New expand
	(nvptx_broadcast<mode>): New insn.
	(unpack<mode>si2, packsi<mode>2): New insns.
	(worker_load<mode>, worker_store<mode>): New insns.
	(nvptx_barsync): Renamed from ...
	(threadbarrier_insn): ... here.
	* config/nvptx/nvptx.c: Include hash-map,h, dominance.h, cfg.h &
	omp-low.h.
	(worker_bcast_hwm, worker_bcast_align, worker_bcast_name,
	worker_bcast_sym): New.
	(nvptx_option_override): Initialize worker_bcast_sym.
	(nvptx_expand_oacc_fork, nvptx_expand_oacc_join): New.
	(nvptx_gen_unpack, nvptx_gen_pack): New.
	(struct wcast_data_t, propagate_mask): New types.
	(nvptx_gen_vcast, nvptx_gen_wcast): New.
	(nvptx_print_operand):  Change 'U' specifier to look at operand
	itself.
	(struct parallel): New structs.
	(parallel::parallel, parallel::~parallel): Ctor & dtor.
	(bb_insn_map_t): New map.
	(insn_bb_t, insn_bb_vec_t): New tuple & vector of.
	(nvptx_split_blocks, nvptx_discover_pre): New.
	(bb_par_t, bb_par_vec_t); New tuple & vector of.
	(nvptx_dump_pars,nvptx_discover_pars): New.
	(nvptx_propagate, vprop_gen, nvptx_vpropagate, wprop_gen,
	nvptx_wpropagate): New.
	(nvptx_wsync): New.
	(nvptx_single, nvptx_skip_par): New.
	(nvptx_process_pars): New.
	(nvptx_neuter_pars): New.
	(nvptx_reorg): Add liveness DF problem.  Call nvptx_split_blocks,
	nvptx_discover_pars, nvptx_process_pars & nvptx_neuter_pars.
	(nvptx_cannot_copy_insn): Check for broadcast, sync, fork& join insns.
	(nvptx_file_end): Output worker broadcast array definition.

	Deletions:
	* builtins.c (expand_oacc_thread_barrier): Delete.
	(expand_oacc_thread_broadcast): Delete.
	(expand_builtin): Adjust.
	* gimple.c (struct gimple_statement_omp_parallel_layout): Remove
	broadcast_array member.
	(gimple_omp_target_broadcast_array): Delete.
	(gimple_omp_target_set_broadcast_array): Delete.
	* omp-low.c (omp_region): Remove broadcast_array member.
	(oacc_broadcast): Delete.
	(build_oacc_threadbarrier): Delete.
	(oacc_loop_needs_threadbarrier_p): Delete.
	(oacc_alloc_broadcast_storage): Delete.
	(find_omp_target_region): Remove call to
	gimple_omp_target_broadcast_array.
	(enclosing_target_region, required_predication_mask,
	generate_vector_broadcast, generate_oacc_broadcast,
	make_predication_test, predicate_bb, find_predicatable_bbs,
	predicate_omp_regions): Delete.
	(use, gen, live_in): Delete.
	(populate_loop_live_in, oacc_populate_live_in_1,
	oacc_populate_live_in, populate_loop_use, oacc_broadcast_1,
	oacc_broadcast): Delete.
	(execute_expand_omp): Remove predicate_omp_regions call.
	(lower_omp_target): Remove oacc_alloc_broadcast_storage call.
	Remove gimple_omp_target_set_broadcast_array call.
	(make_gimple_omp_edges): Remove oacc_loop_needs_threadbarrier_p
	check.
	* tree-ssa-alias.c (ref_maybe_used_by_call_p_1): Remove
	BUILT_IN_GOACC_THREADBARRIER.
	* omp-builtins.def (BUILT_IN_GOACC_THREAD_BROADCAST,
	BUILT_IN_GOACC_THREAD_BROADCAST_LL,
	BUILT_IN_GOACC_THREADBARRIER): Delete.
	* config/nvptx/nvptx.md (UNSPECV_WARPBCAST): Delete.
	(br_true, br_false): Remove U format specifier.
	(oacc_thread_broadcastsi, oacc_thread_broadcast_di): Delete.
	(oacc_threadbarrier): Delete.
	* config/.nvptx/nvptx.c (condition_unidirectional_p): Delete.
	(nvptx_print_operand):  Change 'U' specifier to look at operand
	itself.
	(nvptx_reorg_subreg): Remove unidirection checking.
	(nvptx_cannot_copy_insn): Remove broadcast and barrier insns.
	* config/nvptx/nvptx.h (machine_function): Remove
	arp_equal_pseudos.

Index: omp-low.c
===================================================================
--- omp-low.c	(revision 225323)
+++ omp-low.c	(working copy)
@@ -166,14 +166,8 @@ struct omp_region
 
   /* For an OpenACC loop, the level of parallelism requested.  */
   int gwv_this;
-
-  tree broadcast_array;
 };
 
-/* Levels of parallelism as defined by OpenACC.  Increasing numbers
-   correspond to deeper loop nesting levels.  */
-#define OACC_LOOP_MASK(X) (1 << (X))
-
 /* Context structure.  Used to store information about each parallel
    directive in the code.  */
 
@@ -292,8 +286,6 @@ static vec<omp_context *> taskreg_contex
 
 static void scan_omp (gimple_seq *, omp_context *);
 static tree scan_omp_1_op (tree *, int *, void *);
-static basic_block oacc_broadcast (basic_block, basic_block,
-				   struct omp_region *);
 
 #define WALK_SUBSTMTS  \
     case GIMPLE_BIND: \
@@ -3487,15 +3479,6 @@ build_omp_barrier (tree lhs)
   return g;
 }
 
-/* Build a call to GOACC_threadbarrier.  */
-
-static gcall *
-build_oacc_threadbarrier (void)
-{
-  tree fndecl = builtin_decl_explicit (BUILT_IN_GOACC_THREADBARRIER);
-  return gimple_build_call (fndecl, 0);
-}
-
 /* If a context was created for STMT when it was scanned, return it.  */
 
 static omp_context *
@@ -3506,6 +3489,54 @@ maybe_lookup_ctx (gimple stmt)
   return n ? (omp_context *) n->value : NULL;
 }
 
+/* Generate loop head markers in outer->inner order.  */
+
+static void
+gen_oacc_fork (gimple_seq *seq, unsigned mask)
+{
+  {
+    // TODDO: Determine this information from the parallel region itself
+    // and emit it once in the offload function.  Currently the target
+    // geometry definition is being extracted early.  For now inform
+    // the backend we're using all axes of parallelism, which is a
+    // safe default.
+    gcall *call = gimple_build_call_internal
+      (IFN_GOACC_MODES, 1, 
+       build_int_cst (unsigned_type_node,
+		      OACC_LOOP_MASK (OACC_gang)
+		      | OACC_LOOP_MASK (OACC_vector)
+		      | OACC_LOOP_MASK (OACC_worker)));
+    gimple_seq_add_stmt (seq, call);
+  }
+
+  unsigned level;
+
+  for (level = OACC_gang; level != OACC_HWM; level++)
+    if (mask & OACC_LOOP_MASK (level))
+      {
+	tree arg = build_int_cst (unsigned_type_node, level);
+	gcall *call = gimple_build_call_internal
+	  (IFN_GOACC_FORK, 1, arg);
+	gimple_seq_add_stmt (seq, call);
+      }
+}
+
+/* Generate loop tail markers in inner->outer order.  */
+
+static void
+gen_oacc_join (gimple_seq *seq, unsigned mask)
+{
+  unsigned level;
+
+  for (level = OACC_HWM; level-- != OACC_gang; )
+    if (mask & OACC_LOOP_MASK (level))
+      {
+	tree arg = build_int_cst (unsigned_type_node, level);
+	gcall *call = gimple_build_call_internal
+	  (IFN_GOACC_JOIN, 1, arg);
+	gimple_seq_add_stmt (seq, call);
+      }
+}
 
 /* Find the mapping for DECL in CTX or the immediately enclosing
    context that has a mapping for DECL.
@@ -6777,21 +6808,6 @@ expand_omp_for_generic (struct omp_regio
     }
 }
 
-
-/* True if a barrier is needed after a loop partitioned over
-   gangs/workers/vectors as specified by GWV_BITS.  OpenACC semantics specify
-   that a (conceptual) barrier is needed after worker and vector-partitioned
-   loops, but not after gang-partitioned loops.  Currently we are relying on
-   warp reconvergence to synchronise threads within a warp after vector loops,
-   so an explicit barrier is not helpful after those.  */
-
-static bool
-oacc_loop_needs_threadbarrier_p (int gwv_bits)
-{
-  return !(gwv_bits & OACC_LOOP_MASK (OACC_gang))
-    && (gwv_bits & OACC_LOOP_MASK (OACC_worker));
-}
-
 /* A subroutine of expand_omp_for.  Generate code for a parallel
    loop with static schedule and no specified chunk size.  Given
    parameters:
@@ -6800,6 +6816,7 @@ oacc_loop_needs_threadbarrier_p (int gwv
 
    where COND is "<" or ">", we generate pseudocode
 
+  OACC_FORK
 	if ((__typeof (V)) -1 > 0 && N2 cond N1) goto L2;
 	if (cond is <)
 	  adj = STEP - 1;
@@ -6827,6 +6844,11 @@ oacc_loop_needs_threadbarrier_p (int gwv
 	V += STEP;
 	if (V cond e) goto L1;
     L2:
+ OACC_JOIN
+
+ It'd be better to place the OACC_LOOP markers just inside the outer
+ conditional, so they can be entirely eliminated if the loop is
+ unreachable.
 */
 
 static void
@@ -6868,10 +6890,6 @@ expand_omp_for_static_nochunk (struct om
     }
   exit_bb = region->exit;
 
-  /* Broadcast variables to OpenACC threads.  */
-  entry_bb = oacc_broadcast (entry_bb, fin_bb, region);
-  region->entry = entry_bb;
-
   /* Iteration space partitioning goes in ENTRY_BB.  */
   gsi = gsi_last_bb (entry_bb);
   gcc_assert (gimple_code (gsi_stmt (gsi)) == GIMPLE_OMP_FOR);
@@ -6893,6 +6911,15 @@ expand_omp_for_static_nochunk (struct om
     t = fold_binary (fd->loop.cond_code, boolean_type_node,
 		     fold_convert (type, fd->loop.n1),
 		     fold_convert (type, fd->loop.n2));
+
+  if (gimple_omp_for_kind (fd->for_stmt) == GF_OMP_FOR_KIND_OACC_LOOP)
+    {
+      gimple_seq seq = NULL;
+	
+      gen_oacc_fork (&seq, region->gwv_this);
+      gsi_insert_seq_before (&gsi, seq, GSI_SAME_STMT);
+    }
+
   if (fd->collapse == 1
       && TYPE_UNSIGNED (type)
       && (t == NULL_TREE || !integer_onep (t)))
@@ -6951,6 +6978,7 @@ expand_omp_for_static_nochunk (struct om
     case GF_OMP_FOR_KIND_OACC_LOOP:
       {
 	gimple_seq seq = NULL;
+	
 	nthreads = expand_oacc_get_num_threads (&seq, region->gwv_this);
 	threadid = expand_oacc_get_thread_num (&seq, region->gwv_this);
 	gsi_insert_seq_before (&gsi, seq, GSI_SAME_STMT);
@@ -7134,18 +7162,19 @@ expand_omp_for_static_nochunk (struct om
 
   /* Replace the GIMPLE_OMP_RETURN with a barrier, or nothing.  */
   gsi = gsi_last_bb (exit_bb);
-  if (!gimple_omp_return_nowait_p (gsi_stmt (gsi)))
+  if (gimple_omp_for_kind (fd->for_stmt) == GF_OMP_FOR_KIND_OACC_LOOP)
+    {
+      gimple_seq seq = NULL;
+
+      gen_oacc_join (&seq, region->gwv_this);
+      gsi_insert_seq_after (&gsi, seq, GSI_SAME_STMT);
+    }
+  else if (!gimple_omp_return_nowait_p (gsi_stmt (gsi)))
     {
       t = gimple_omp_return_lhs (gsi_stmt (gsi));
-      if (gimple_omp_for_kind (fd->for_stmt) == GF_OMP_FOR_KIND_OACC_LOOP)
-	{
-	  gcc_checking_assert (t == NULL_TREE);
-	  if (oacc_loop_needs_threadbarrier_p (region->gwv_this))
-	    gsi_insert_after (&gsi, build_oacc_threadbarrier (), GSI_SAME_STMT);
-	}
-      else
-	gsi_insert_after (&gsi, build_omp_barrier (t), GSI_SAME_STMT);
+      gsi_insert_after (&gsi, build_omp_barrier (t), GSI_SAME_STMT);
     }
+    
   gsi_remove (&gsi, true);
 
   /* Connect all the blocks.  */
@@ -7220,6 +7249,7 @@ find_phi_with_arg_on_edge (tree arg, edg
 
    where COND is "<" or ">", we generate pseudocode
 
+OACC_FORK
 	if ((__typeof (V)) -1 > 0 && N2 cond N1) goto L2;
 	if (cond is <)
 	  adj = STEP - 1;
@@ -7230,6 +7260,7 @@ find_phi_with_arg_on_edge (tree arg, edg
 	else
 	  n = (adj + N2 - N1) / STEP;
 	trip = 0;
+
 	V = threadid * CHUNK * STEP + N1;  -- this extra definition of V is
 					      here so that V is defined
 					      if the loop is not entered
@@ -7248,6 +7279,7 @@ find_phi_with_arg_on_edge (tree arg, edg
 	trip += 1;
 	goto L0;
     L4:
+OACC_JOIN
 */
 
 static void
@@ -7281,10 +7313,6 @@ expand_omp_for_static_chunk (struct omp_
   gcc_assert (EDGE_COUNT (iter_part_bb->succs) == 2);
   fin_bb = BRANCH_EDGE (iter_part_bb)->dest;
 
-  /* Broadcast variables to OpenACC threads.  */
-  entry_bb = oacc_broadcast (entry_bb, fin_bb, region);
-  region->entry = entry_bb;
-
   gcc_assert (broken_loop
 	      || fin_bb == FALLTHRU_EDGE (cont_bb)->dest);
   seq_start_bb = split_edge (FALLTHRU_EDGE (iter_part_bb));
@@ -7296,7 +7324,7 @@ expand_omp_for_static_chunk (struct omp_
       trip_update_bb = split_edge (FALLTHRU_EDGE (cont_bb));
     }
   exit_bb = region->exit;
-
+  
   /* Trip and adjustment setup goes in ENTRY_BB.  */
   gsi = gsi_last_bb (entry_bb);
   gcc_assert (gimple_code (gsi_stmt (gsi)) == GIMPLE_OMP_FOR);
@@ -7318,6 +7346,14 @@ expand_omp_for_static_chunk (struct omp_
     t = fold_binary (fd->loop.cond_code, boolean_type_node,
 		     fold_convert (type, fd->loop.n1),
 		     fold_convert (type, fd->loop.n2));
+  if (gimple_omp_for_kind (fd->for_stmt) == GF_OMP_FOR_KIND_OACC_LOOP)
+    {
+      gimple_seq seq = NULL;
+	
+      gen_oacc_fork (&seq, region->gwv_this);
+      gsi_insert_seq_before (&gsi, seq, GSI_SAME_STMT);
+    }
+
   if (fd->collapse == 1
       && TYPE_UNSIGNED (type)
       && (t == NULL_TREE || !integer_onep (t)))
@@ -7576,18 +7612,20 @@ expand_omp_for_static_chunk (struct omp_
 
   /* Replace the GIMPLE_OMP_RETURN with a barrier, or nothing.  */
   gsi = gsi_last_bb (exit_bb);
-  if (!gimple_omp_return_nowait_p (gsi_stmt (gsi)))
+
+  if (gimple_omp_for_kind (fd->for_stmt) == GF_OMP_FOR_KIND_OACC_LOOP)
+    {
+      gimple_seq seq = NULL;
+
+      gen_oacc_join (&seq, region->gwv_this);
+      gsi_insert_seq_after (&gsi, seq, GSI_SAME_STMT);
+    }
+  else if (!gimple_omp_return_nowait_p (gsi_stmt (gsi)))
     {
       t = gimple_omp_return_lhs (gsi_stmt (gsi));
-      if (gimple_omp_for_kind (fd->for_stmt) == GF_OMP_FOR_KIND_OACC_LOOP)
-        {
-	  gcc_checking_assert (t == NULL_TREE);
-	  if (oacc_loop_needs_threadbarrier_p (region->gwv_this))
-	    gsi_insert_after (&gsi, build_oacc_threadbarrier (), GSI_SAME_STMT);
-	}
-      else
-	gsi_insert_after (&gsi, build_omp_barrier (t), GSI_SAME_STMT);
+      gsi_insert_after (&gsi, build_omp_barrier (t), GSI_SAME_STMT);
     }
+
   gsi_remove (&gsi, true);
 
   /* Connect the new blocks.  */
@@ -9158,20 +9196,6 @@ expand_omp_atomic (struct omp_region *re
   expand_omp_atomic_mutex (load_bb, store_bb, addr, loaded_val, stored_val);
 }
 
-/* Allocate storage for OpenACC worker threads in CTX to broadcast
-   condition results.  */
-
-static void
-oacc_alloc_broadcast_storage (omp_context *ctx)
-{
-  tree vull_type_node = build_qualified_type (long_long_unsigned_type_node,
-					      TYPE_QUAL_VOLATILE);
-
-  ctx->worker_sync_elt
-    = alloc_var_ganglocal (NULL_TREE, vull_type_node, ctx,
-			   TYPE_SIZE_UNIT (vull_type_node));
-}
-
 /* Mark the loops inside the kernels region starting at REGION_ENTRY and ending
    at REGION_EXIT.  */
 
@@ -9947,7 +9971,6 @@ find_omp_target_region_data (struct omp_
     region->gwv_this |= OACC_LOOP_MASK (OACC_worker);
   if (find_omp_clause (clauses, OMP_CLAUSE_VECTOR_LENGTH))
     region->gwv_this |= OACC_LOOP_MASK (OACC_vector);
-  region->broadcast_array = gimple_omp_target_broadcast_array (stmt);
 }
 
 /* Helper for build_omp_regions.  Scan the dominator tree starting at
@@ -10091,669 +10114,6 @@ build_omp_regions (void)
   build_omp_regions_1 (ENTRY_BLOCK_PTR_FOR_FN (cfun), NULL, false);
 }
 
-/* Walk the tree upwards from region until a target region is found
-   or we reach the end, then return it.  */
-static omp_region *
-enclosing_target_region (omp_region *region)
-{
-  while (region != NULL
-	 && region->type != GIMPLE_OMP_TARGET)
-    region = region->outer;
-  return region;
-}
-
-/* Return a mask of GWV_ values indicating the kind of OpenACC
-   predication required for basic blocks in REGION.  */
-
-static int
-required_predication_mask (omp_region *region)
-{
-  while (region
-	 && region->type != GIMPLE_OMP_FOR && region->type != GIMPLE_OMP_TARGET)
-    region = region->outer;
-  if (!region)
-    return 0;
-
-  int outer_masks = region->gwv_this;
-  omp_region *outer_target = region;
-  while (outer_target != NULL && outer_target->type != GIMPLE_OMP_TARGET)
-    {
-      if (outer_target->type == GIMPLE_OMP_FOR)
-	outer_masks |= outer_target->gwv_this;
-      outer_target = outer_target->outer;
-    }
-  if (!outer_target)
-    return 0;
-
-  int mask = 0;
-  if ((outer_target->gwv_this & OACC_LOOP_MASK (OACC_worker)) != 0
-      && (region->type == GIMPLE_OMP_TARGET
-	  || (outer_masks & OACC_LOOP_MASK (OACC_worker)) == 0))
-    mask |= OACC_LOOP_MASK (OACC_worker);
-  if ((outer_target->gwv_this & OACC_LOOP_MASK (OACC_vector)) != 0
-      && (region->type == GIMPLE_OMP_TARGET
-	  || (outer_masks & OACC_LOOP_MASK (OACC_vector)) == 0))
-    mask |= OACC_LOOP_MASK (OACC_vector);
-  return mask;
-}
-
-/* Generate a broadcast across OpenACC vector threads (a warp on GPUs)
-   so that VAR is broadcast to DEST_VAR.  The new statements are added
-   after WHERE.  Return the stmt after which the block should be split.  */
-
-static gimple
-generate_vector_broadcast (tree dest_var, tree var,
-			   gimple_stmt_iterator &where)
-{
-  gimple retval = gsi_stmt (where);
-  tree vartype = TREE_TYPE (var);
-  tree call_arg_type = unsigned_type_node;
-  enum built_in_function fn = BUILT_IN_GOACC_THREAD_BROADCAST;
-
-  if (TYPE_PRECISION (vartype) > TYPE_PRECISION (call_arg_type))
-    {
-      fn = BUILT_IN_GOACC_THREAD_BROADCAST_LL;
-      call_arg_type = long_long_unsigned_type_node;
-    }
-
-  bool need_conversion = !types_compatible_p (vartype, call_arg_type);
-  tree casted_var = var;
-
-  if (need_conversion)
-    {
-      gassign *conv1 = NULL;
-      casted_var = create_tmp_var (call_arg_type);
-
-      /* Handle floats and doubles.  */
-      if (!INTEGRAL_TYPE_P (vartype))
-	{
-	  tree t = fold_build1 (VIEW_CONVERT_EXPR, call_arg_type, var);
-	  conv1 = gimple_build_assign (casted_var, t);
-	}
-      else
-	conv1 = gimple_build_assign (casted_var, NOP_EXPR, var);
-
-      gsi_insert_after (&where, conv1, GSI_CONTINUE_LINKING);
-    }
-
-  tree decl = builtin_decl_explicit (fn);
-  gimple call = gimple_build_call (decl, 1, casted_var);
-  gsi_insert_after (&where, call, GSI_NEW_STMT);
-  tree casted_dest = dest_var;
-
-  if (need_conversion)
-    {
-      gassign *conv2 = NULL;
-      casted_dest = create_tmp_var (call_arg_type);
-
-      if (!INTEGRAL_TYPE_P (vartype))
-	{
-	  tree t = fold_build1 (VIEW_CONVERT_EXPR, vartype, casted_dest);
-	  conv2 = gimple_build_assign (dest_var, t);
-	}
-      else
-	conv2 = gimple_build_assign (dest_var, NOP_EXPR, casted_dest);
-
-      gsi_insert_after (&where, conv2, GSI_CONTINUE_LINKING);
-    }
-
-  gimple_call_set_lhs (call, casted_dest);
-  return retval;
-}
-
-/* Generate a broadcast across OpenACC threads in REGION so that VAR
-   is broadcast to DEST_VAR.  MASK specifies the parallelism level and
-   thereby the broadcast method.  If it is only vector, we
-   can use a warp broadcast, otherwise we fall back to memory
-   store/load.  */
-
-static gimple
-generate_oacc_broadcast (omp_region *region, tree dest_var, tree var,
-			 gimple_stmt_iterator &where, int mask)
-{
-  if (mask == OACC_LOOP_MASK (OACC_vector))
-    return generate_vector_broadcast (dest_var, var, where);
-
-  omp_region *parent = enclosing_target_region (region);
-
-  tree elttype = build_qualified_type (TREE_TYPE (var), TYPE_QUAL_VOLATILE);
-  tree ptr = create_tmp_var (build_pointer_type (elttype));
-  gassign *cast1 = gimple_build_assign (ptr, NOP_EXPR,
-				       parent->broadcast_array);
-  gsi_insert_after (&where, cast1, GSI_NEW_STMT);
-  gassign *st = gimple_build_assign (build_simple_mem_ref (ptr), var);
-  gsi_insert_after (&where, st, GSI_NEW_STMT);
-
-  gsi_insert_after (&where, build_oacc_threadbarrier (), GSI_NEW_STMT);
-
-  gassign *cast2 = gimple_build_assign (ptr, NOP_EXPR,
-					parent->broadcast_array);
-  gsi_insert_after (&where, cast2, GSI_NEW_STMT);
-  gassign *ld = gimple_build_assign (dest_var, build_simple_mem_ref (ptr));
-  gsi_insert_after (&where, ld, GSI_NEW_STMT);
-
-  gsi_insert_after (&where, build_oacc_threadbarrier (), GSI_NEW_STMT);
-
-  return st;
-}
-
-/* Build a test for OpenACC predication.  TRUE_EDGE is the edge that should be
-   taken if the block should be executed.  SKIP_DEST_BB is the destination to
-   jump to otherwise.  MASK specifies the type of predication, it can contain
-   the bits for VECTOR and/or WORKER.  */
-
-static void
-make_predication_test (edge true_edge, basic_block skip_dest_bb, int mask)
-{
-  basic_block cond_bb = true_edge->src;
-  
-  gimple_stmt_iterator tmp_gsi = gsi_last_bb (cond_bb);
-  tree decl = builtin_decl_explicit (BUILT_IN_GOACC_ID);
-  tree comp_var = NULL_TREE;
-  unsigned ix;
-
-  for (ix = OACC_worker; ix <= OACC_vector; ix++)
-    if (OACC_LOOP_MASK (ix) & mask)
-      {
-	gimple call = gimple_build_call
-	  (decl, 1, build_int_cst (unsigned_type_node, ix));
-	tree var = create_tmp_var (unsigned_type_node);
-
-	gimple_call_set_lhs (call, var);
-	gsi_insert_after (&tmp_gsi, call, GSI_NEW_STMT);
-	if (comp_var)
-	  {
-	    tree new_comp = create_tmp_var (unsigned_type_node);
-	    gassign *ior = gimple_build_assign (new_comp,
-						BIT_IOR_EXPR, comp_var, var);
-	    gsi_insert_after (&tmp_gsi, ior, GSI_NEW_STMT);
-	    comp_var = new_comp;
-	  }
-	else
-	  comp_var = var;
-      }
-
-  tree cond = build2 (EQ_EXPR, boolean_type_node, comp_var,
-		      fold_convert (unsigned_type_node, integer_zero_node));
-  gimple cond_stmt = gimple_build_cond_empty (cond);
-  gsi_insert_after (&tmp_gsi, cond_stmt, GSI_NEW_STMT);
-
-  true_edge->flags = EDGE_TRUE_VALUE;
-
-  /* Force an abnormal edge before a broadcast operation that might be present
-     in SKIP_DEST_BB.  This is only done for the non-execution edge (with
-     respect to the predication done by this function) -- the opposite
-     (execution) edge that reaches the broadcast operation must be made
-     abnormal also, e.g. in this function's caller.  */
-  edge e = make_edge (cond_bb, skip_dest_bb, EDGE_FALSE_VALUE);
-  basic_block false_abnorm_bb = split_edge (e);
-  edge abnorm_edge = single_succ_edge (false_abnorm_bb);
-  abnorm_edge->flags |= EDGE_ABNORMAL;
-}
-
-/* Apply OpenACC predication to basic block BB which is in
-   region PARENT.  MASK has a bitmask of levels that need to be
-   applied; VECTOR and/or WORKER may be set.  */
-
-static void
-predicate_bb (basic_block bb, struct omp_region *parent, int mask)
-{
-  /* We handle worker-single vector-partitioned loops by jumping
-     around them if not in the controlling worker.  Don't insert
-     unnecessary (and incorrect) predication.  */
-  if (parent->type == GIMPLE_OMP_FOR
-      && (parent->gwv_this & OACC_LOOP_MASK (OACC_vector)))
-    mask &= ~OACC_LOOP_MASK (OACC_worker);
-
-  if (mask == 0 || parent->type == GIMPLE_OMP_ATOMIC_LOAD)
-    return;
-
-  gimple_stmt_iterator gsi;
-  gimple stmt;
-
-  gsi = gsi_last_bb (bb);
-  stmt = gsi_stmt (gsi);
-  if (stmt == NULL)
-    return;
-
-  basic_block skip_dest_bb = NULL;
-
-  if (gimple_code (stmt) == GIMPLE_OMP_ENTRY_END)
-    return;
-
-  if (gimple_code (stmt) == GIMPLE_COND)
-    {
-      tree cond_var = create_tmp_var (boolean_type_node);
-      tree broadcast_cond = create_tmp_var (boolean_type_node);
-      gassign *asgn = gimple_build_assign (cond_var,
-					   gimple_cond_code (stmt),
-					   gimple_cond_lhs (stmt),
-					   gimple_cond_rhs (stmt));
-      gsi_insert_before (&gsi, asgn, GSI_CONTINUE_LINKING);
-      gimple_stmt_iterator gsi_asgn = gsi_for_stmt (asgn);
-
-      gimple splitpoint = generate_oacc_broadcast (parent, broadcast_cond,
-						   cond_var, gsi_asgn,
-						   mask);
-
-      edge e = split_block (bb, splitpoint);
-      e->flags = EDGE_ABNORMAL;
-      skip_dest_bb = e->dest;
-
-      gimple_cond_set_condition (as_a <gcond *> (stmt), EQ_EXPR,
-				 broadcast_cond, boolean_true_node);
-    }
-  else if (gimple_code (stmt) == GIMPLE_SWITCH)
-    {
-      gswitch *sstmt = as_a <gswitch *> (stmt);
-      tree var = gimple_switch_index (sstmt);
-      tree new_var = create_tmp_var (TREE_TYPE (var));
-
-      gassign *asgn = gimple_build_assign (new_var, var);
-      gsi_insert_before (&gsi, asgn, GSI_CONTINUE_LINKING);
-      gimple_stmt_iterator gsi_asgn = gsi_for_stmt (asgn);
-
-      gimple splitpoint = generate_oacc_broadcast (parent, new_var, var,
-						   gsi_asgn, mask);
-
-      edge e = split_block (bb, splitpoint);
-      e->flags = EDGE_ABNORMAL;
-      skip_dest_bb = e->dest;
-
-      gimple_switch_set_index (sstmt, new_var);
-    }
-  else if (is_gimple_omp (stmt))
-    {
-      gsi_prev (&gsi);
-      gimple split_stmt = gsi_stmt (gsi);
-      enum gimple_code code = gimple_code (stmt);
-
-      /* First, see if we must predicate away an entire loop or atomic region.  */
-      if (code == GIMPLE_OMP_FOR
-	  || code == GIMPLE_OMP_ATOMIC_LOAD)
-	{
-	  omp_region *inner;
-	  inner = *bb_region_map->get (FALLTHRU_EDGE (bb)->dest);
-	  skip_dest_bb = single_succ (inner->exit);
-	  gcc_assert (inner->entry == bb);
-	  if (code != GIMPLE_OMP_FOR
-	      || ((inner->gwv_this & OACC_LOOP_MASK (OACC_vector))
-		  && !(inner->gwv_this & OACC_LOOP_MASK (OACC_worker))
-		  && (mask & OACC_LOOP_MASK  (OACC_worker))))
-	    {
-	      gimple_stmt_iterator head_gsi = gsi_start_bb (bb);
-	      gsi_prev (&head_gsi);
-	      edge e0 = split_block (bb, gsi_stmt (head_gsi));
-	      int mask2 = mask;
-	      if (code == GIMPLE_OMP_FOR)
-		mask2 &= ~OACC_LOOP_MASK (OACC_vector);
-	      if (!split_stmt || code != GIMPLE_OMP_FOR)
-		{
-		  /* The simple case: nothing here except the for,
-		     so we just need to make one branch around the
-		     entire loop.  */
-		  inner->entry = e0->dest;
-		  make_predication_test (e0, skip_dest_bb, mask2);
-		  return;
-		}
-	      basic_block for_block = e0->dest;
-	      /* The general case, make two conditions - a full one around the
-		 code preceding the for, and one branch around the loop.  */
-	      edge e1 = split_block (for_block, split_stmt);
-	      basic_block bb3 = e1->dest;
-	      edge e2 = split_block (for_block, split_stmt);
-	      basic_block bb2 = e2->dest;
-
-	      make_predication_test (e0, bb2, mask);
-	      make_predication_test (single_pred_edge (bb3), skip_dest_bb,
-				     mask2);
-	      inner->entry = bb3;
-	      return;
-	    }
-	}
-
-      /* Only a few statements need special treatment.  */
-      if (gimple_code (stmt) != GIMPLE_OMP_FOR
-	  && gimple_code (stmt) != GIMPLE_OMP_CONTINUE
-	  && gimple_code (stmt) != GIMPLE_OMP_RETURN)
-	{
-	  edge e = single_succ_edge (bb);
-	  skip_dest_bb = e->dest;
-	}
-      else
-	{
-	  if (!split_stmt)
-	    return;
-	  edge e = split_block (bb, split_stmt);
-	  skip_dest_bb = e->dest;
-	  if (gimple_code (stmt) == GIMPLE_OMP_CONTINUE)
-	    {
-	      gcc_assert (parent->cont == bb);
-	      parent->cont = skip_dest_bb;
-	    }
-	  else if (gimple_code (stmt) == GIMPLE_OMP_RETURN)
-	    {
-	      gcc_assert (parent->exit == bb);
-	      parent->exit = skip_dest_bb;
-	    }
-	  else if (gimple_code (stmt) == GIMPLE_OMP_FOR)
-	    {
-	      omp_region *inner;
-	      inner = *bb_region_map->get (FALLTHRU_EDGE (skip_dest_bb)->dest);
-	      gcc_assert (inner->entry == bb);
-	      inner->entry = skip_dest_bb;
-	    }
-	}
-    }
-  else if (single_succ_p (bb))
-    {
-      edge e = single_succ_edge (bb);
-      skip_dest_bb = e->dest;
-      if (gimple_code (stmt) == GIMPLE_GOTO)
-	gsi_prev (&gsi);
-      if (gsi_stmt (gsi) == 0)
-	return;
-    }
-
-  if (skip_dest_bb != NULL)
-    {
-      gimple_stmt_iterator head_gsi = gsi_start_bb (bb);
-      gsi_prev (&head_gsi);
-      edge e2 = split_block (bb, gsi_stmt (head_gsi));
-      make_predication_test (e2, skip_dest_bb, mask);
-    }
-}
-
-/* Walk the dominator tree starting at BB to collect basic blocks in
-   WORKLIST which need OpenACC vector predication applied to them.  */
-
-static void
-find_predicatable_bbs (basic_block bb, vec<basic_block> &worklist)
-{
-  struct omp_region *parent = *bb_region_map->get (bb);
-  if (required_predication_mask (parent) != 0)
-    worklist.safe_push (bb);
-  basic_block son;
-  for (son = first_dom_son (CDI_DOMINATORS, bb);
-       son;
-       son = next_dom_son (CDI_DOMINATORS, son))
-    find_predicatable_bbs (son, worklist);
-}
-
-/* Apply OpenACC vector predication to all basic blocks.  HEAD_BB is the
-   first.  */
-
-static void
-predicate_omp_regions (basic_block head_bb)
-{
-  vec<basic_block> worklist = vNULL;
-  find_predicatable_bbs (head_bb, worklist);
-  int i;
-  basic_block bb;
-  FOR_EACH_VEC_ELT (worklist, i, bb)
-    {
-      omp_region *region = *bb_region_map->get (bb);
-      int mask = required_predication_mask (region);
-      predicate_bb (bb, region, mask);
-    }
-}
-
-/* USE and GET sets for variable broadcasting.  */
-static std::set<tree> use, gen, live_in;
-
-/* This is an extremely conservative live in analysis.  We only want to
-   detect is any compiler temporary used inside an acc loop is local to
-   that loop or not.  So record all decl uses in all the basic blocks
-   post-dominating the acc loop in question.  */
-static tree
-populate_loop_live_in (tree *tp, int *walk_subtrees,
-		       void *data_ ATTRIBUTE_UNUSED)
-{
-  struct walk_stmt_info *wi = (struct walk_stmt_info *) data_;
-
-  if (wi && wi->is_lhs)
-    {
-      if (VAR_P (*tp))
-	live_in.insert (*tp);
-    }
-  else if (IS_TYPE_OR_DECL_P (*tp))
-    *walk_subtrees = 0;
-
-  return NULL_TREE;
-}
-
-static void
-oacc_populate_live_in_1 (basic_block entry_bb, basic_block exit_bb,
-			 basic_block loop_bb)
-{
-  basic_block son;
-  gimple_stmt_iterator gsi;
-
-  if (entry_bb == exit_bb)
-    return;
-
-  if (!dominated_by_p (CDI_DOMINATORS, loop_bb, entry_bb))
-    return;
-
-  for (gsi = gsi_start_bb (entry_bb); !gsi_end_p (gsi); gsi_next (&gsi))
-    {
-      struct walk_stmt_info wi;
-      gimple stmt;
-
-      memset (&wi, 0, sizeof (wi));
-      stmt = gsi_stmt (gsi);
-
-      walk_gimple_op (stmt, populate_loop_live_in, &wi);
-    }
-
-  /* Continue walking the dominator tree.  */
-  for (son = first_dom_son (CDI_DOMINATORS, entry_bb);
-       son;
-       son = next_dom_son (CDI_DOMINATORS, son))
-    oacc_populate_live_in_1 (son, exit_bb, loop_bb);
-}
-
-static void
-oacc_populate_live_in (basic_block entry_bb, omp_region *region)
-{
-  /* Find the innermost OMP_TARGET region.  */
-  while (region  && region->type != GIMPLE_OMP_TARGET)
-    region = region->outer;
-
-  if (!region)
-    return;
-
-  basic_block son;
-
-  for (son = first_dom_son (CDI_DOMINATORS, region->entry);
-       son;
-       son = next_dom_son (CDI_DOMINATORS, son))
-    oacc_populate_live_in_1 (son, region->exit, entry_bb);
-}
-
-static tree
-populate_loop_use (tree *tp, int *walk_subtrees, void *data_)
-{
-  struct walk_stmt_info *wi = (struct walk_stmt_info *) data_;
-  std::set<tree>::iterator it;
-
-  /* There isn't much to do for LHS ops. There shouldn't be any pointers
-     or references here.  */
-  if (wi && wi->is_lhs)
-    return NULL_TREE;
-
-  if (VAR_P (*tp))
-    {
-      tree type;
-
-      *walk_subtrees = 0;
-
-      /* Filter out incompatible decls.  */
-      if (INDIRECT_REF_P (*tp) || is_global_var (*tp))
-	return NULL_TREE;
-
-      type = TREE_TYPE (*tp);
-
-      /* Aggregate types aren't supported either.  */
-      if (AGGREGATE_TYPE_P (type))
-	return NULL_TREE;
-
-      /* Filter out decls inside GEN.  */
-      it = gen.find (*tp);
-      if (it == gen.end ())
-	use.insert (*tp);
-    }
-  else if (IS_TYPE_OR_DECL_P (*tp))
-    *walk_subtrees = 0;
-
-  return NULL_TREE;
-}
-
-/* INIT is true if this is the first time this function is called.  */
-
-static void
-oacc_broadcast_1 (basic_block entry_bb, basic_block exit_bb, bool init,
-		  int mask)
-{
-  basic_block son;
-  gimple_stmt_iterator gsi;
-  gimple stmt;
-  tree block, var;
-
-  if (entry_bb == exit_bb)
-    return;
-
-  /* Populate the GEN set.  */
-
-  gsi = gsi_start_bb (entry_bb);
-  stmt = gsi_stmt (gsi);
-
-  /* There's nothing to do if stmt is empty or if this is the entry basic
-     block to the vector loop.  The entry basic block to pre-expanded loops
-     do not have an entry label.  As such, the scope containing the initial
-     entry_bb should not be added to the gen set.  */
-  if (stmt != NULL && !init && (block = gimple_block (stmt)) != NULL)
-    for (var = BLOCK_VARS (block); var; var = DECL_CHAIN (var))
-      gen.insert(var);
-
-  /* Populate the USE set.  */
-
-  for (gsi = gsi_start_bb (entry_bb); !gsi_end_p (gsi); gsi_next (&gsi))
-    {
-      struct walk_stmt_info wi;
-
-      memset (&wi, 0, sizeof (wi));
-      stmt = gsi_stmt (gsi);
-
-      walk_gimple_op (stmt, populate_loop_use, &wi);
-    }
-
-  /* Continue processing the children of this basic block.  */
-  for (son = first_dom_son (CDI_DOMINATORS, entry_bb);
-       son;
-       son = next_dom_son (CDI_DOMINATORS, son))
-    oacc_broadcast_1 (son, exit_bb, false, mask);
-}
-
-/* Broadcast variables to OpenACC vector loops.  This function scans
-   all of the basic blocks withing an acc vector loop.  It maintains
-   two sets of decls, a GEN set and a USE set.  The GEN set contains
-   all of the decls in the the basic block's scope.  The USE set
-   consists of decls used in current basic block, but are not in the
-   GEN set, globally defined or were transferred into the the accelerator
-   via a data movement clause.
-
-   The vector loop begins at ENTRY_BB and end at EXIT_BB, where EXIT_BB
-   is a latch back to ENTRY_BB.  Once a set of used variables have been
-   determined, they will get broadcasted in a pre-header to ENTRY_BB.  */
-
-static basic_block
-oacc_broadcast (basic_block entry_bb, basic_block exit_bb, omp_region *region)
-{
-  gimple_stmt_iterator gsi;
-  std::set<tree>::iterator it;
-  int mask = region->gwv_this;
-
-  /* Nothing to do if this isn't an acc worker or vector loop.  */
-  if (mask == 0)
-    return entry_bb;
-
-  use.empty ();
-  gen.empty ();
-  live_in.empty ();
-
-  /* Currently, subroutines aren't supported.  */
-  gcc_assert (!lookup_attribute ("oacc function",
-				 DECL_ATTRIBUTES (current_function_decl)));
-
-  /* Populate live_in.  */
-  oacc_populate_live_in (entry_bb, region);
-
-  /* Populate the set of used decls.  */
-  oacc_broadcast_1 (entry_bb, exit_bb, true, mask);
-
-  /* Filter out all of the GEN decls from the USE set.  Also filter out
-     any compiler temporaries that which are not present in LIVE_IN.  */
-  for (it = use.begin (); it != use.end (); it++)
-    {
-      std::set<tree>::iterator git, lit;
-
-      git = gen.find (*it);
-      lit = live_in.find (*it);
-      if (git != gen.end () || lit == live_in.end ())
-	use.erase (it);
-    }
-
-  if (mask == OACC_LOOP_MASK (OACC_vector))
-    {
-      /* Broadcast all decls in USE right before the last instruction in
-	 entry_bb.  */
-      gsi = gsi_last_bb (entry_bb);
-
-      gimple_seq seq = NULL;
-      gimple_stmt_iterator g2 = gsi_start (seq);
-
-      for (it = use.begin (); it != use.end (); it++)
-	generate_oacc_broadcast (region, *it, *it, g2, mask);
-
-      gsi_insert_seq_before (&gsi, seq, GSI_CONTINUE_LINKING);
-    }
-  else if (mask & OACC_LOOP_MASK (OACC_worker))
-    {
-      if (use.empty ())
-	return entry_bb;
-
-      /* If this loop contains a worker, then each broadcast must be
-	 predicated.  */
-
-      for (it = use.begin (); it != use.end (); it++)
-	{
-	  /* Worker broadcasting requires predication.  To do that, there
-	     needs to be several new parent basic blocks before the omp
-	     for instruction.  */
-
-	  gimple_seq seq = NULL;
-	  gimple_stmt_iterator g2 = gsi_start (seq);
-	  gimple splitpoint = generate_oacc_broadcast (region, *it, *it,
-						       g2, mask);
-	  gsi = gsi_last_bb (entry_bb);
-	  gsi_insert_seq_before (&gsi, seq, GSI_CONTINUE_LINKING);
-	  edge e = split_block (entry_bb, splitpoint);
-	  e->flags |= EDGE_ABNORMAL;
-	  basic_block dest_bb = e->dest;
-	  gsi_prev (&gsi);
-	  edge e2 = split_block (entry_bb, gsi_stmt (gsi));
-	  e2->flags |= EDGE_ABNORMAL;
-	  make_predication_test (e2, dest_bb, mask);
-
-	  /* Update entry_bb.  */
-	  entry_bb = dest_bb;
-	}
-    }
-
-  return entry_bb;
-}
-
 /* Main entry point for expanding OMP-GIMPLE into runtime calls.  */
 
 static unsigned int
@@ -10772,8 +10132,6 @@ execute_expand_omp (void)
 	  fprintf (dump_file, "\n");
 	}
 
-      predicate_omp_regions (ENTRY_BLOCK_PTR_FOR_FN (cfun));
-
       remove_exit_barriers (root_omp_region);
 
       expand_omp (root_omp_region);
@@ -12342,10 +11700,7 @@ lower_omp_target (gimple_stmt_iterator *
   orlist = NULL;
 
   if (is_gimple_omp_oacc (stmt))
-    {
-      oacc_init_count_vars (ctx, clauses);
-      oacc_alloc_broadcast_storage (ctx);
-    }
+    oacc_init_count_vars (ctx, clauses);
 
   if (has_reduction)
     {
@@ -12631,7 +11986,6 @@ lower_omp_target (gimple_stmt_iterator *
   gsi_insert_seq_before (gsi_p, sz_ilist, GSI_SAME_STMT);
 
   gimple_omp_target_set_ganglocal_size (stmt, sz);
-  gimple_omp_target_set_broadcast_array (stmt, ctx->worker_sync_elt);
   pop_gimplify_context (NULL);
 }
 
@@ -13348,16 +12702,7 @@ make_gimple_omp_edges (basic_block bb, s
 				  ((for_stmt = last_stmt (cur_region->entry))))
 	     == GF_OMP_FOR_KIND_OACC_LOOP)
         {
-	  /* Called before OMP expansion, so this information has not been
-	     recorded in cur_region->gwv_this yet.  */
-	  int gwv_bits = find_omp_for_region_gwv (for_stmt);
-	  if (oacc_loop_needs_threadbarrier_p (gwv_bits))
-	    {
-	      make_edge (bb, bb->next_bb, EDGE_FALLTHRU | EDGE_ABNORMAL);
-	      fallthru = false;
-	    }
-	  else
-	    fallthru = true;
+	  fallthru = true;
 	}
       else
 	/* In the case of a GIMPLE_OMP_SECTION, the edge will go
Index: omp-low.h
===================================================================
--- omp-low.h	(revision 225323)
+++ omp-low.h	(working copy)
@@ -20,6 +20,8 @@ along with GCC; see the file COPYING3.
 #ifndef GCC_OMP_LOW_H
 #define GCC_OMP_LOW_H
 
+/* Levels of parallelism as defined by OpenACC.  Increasing numbers
+   correspond to deeper loop nesting levels.  */
 enum oacc_loop_levels
   {
     OACC_gang,
@@ -27,6 +29,7 @@ enum oacc_loop_levels
     OACC_vector,
     OACC_HWM
   };
+#define OACC_LOOP_MASK(X) (1 << (X))
 
 struct omp_region;
 
Index: internal-fn.def
===================================================================
--- internal-fn.def	(revision 225323)
+++ internal-fn.def	(working copy)
@@ -64,3 +64,6 @@ DEF_INTERNAL_FN (MUL_OVERFLOW, ECF_CONST
 DEF_INTERNAL_FN (TSAN_FUNC_EXIT, ECF_NOVOPS | ECF_LEAF | ECF_NOTHROW, NULL)
 DEF_INTERNAL_FN (VA_ARG, ECF_NOTHROW | ECF_LEAF, NULL)
 DEF_INTERNAL_FN (GOACC_DATA_END_WITH_ARG, ECF_NOTHROW, ".r")
+DEF_INTERNAL_FN (GOACC_MODES, ECF_NOTHROW | ECF_LEAF, ".")
+DEF_INTERNAL_FN (GOACC_FORK, ECF_NOTHROW | ECF_LEAF, ".")
+DEF_INTERNAL_FN (GOACC_JOIN, ECF_NOTHROW | ECF_LEAF, ".")
Index: internal-fn.c
===================================================================
--- internal-fn.c	(revision 225323)
+++ internal-fn.c	(working copy)
@@ -98,6 +98,20 @@ init_internal_fns ()
   internal_fn_fnspec_array[IFN_LAST] = 0;
 }
 
+/* Return true if this internal fn call is a unique marker -- it
+   should not be duplicated or merged.  */
+
+bool
+gimple_call_internal_unique_p (const_gimple gs)
+{
+  switch (gimple_call_internal_fn (gs))
+    {
+    default: return false;
+    case IFN_GOACC_FORK: return true;
+    case IFN_GOACC_JOIN: return true;
+    }
+}
+
 /* ARRAY_TYPE is an array of vector modes.  Return the associated insn
    for load-lanes-style optab OPTAB.  The insn must exist.  */
 
@@ -1990,6 +2004,36 @@ expand_GOACC_DATA_END_WITH_ARG (gcall *s
   gcc_unreachable ();
 }
 
+static void
+expand_GOACC_MODES (gcall *stmt)
+{
+  rtx mask = expand_normal (gimple_call_arg (stmt, 0));
+  
+#ifdef HAVE_oacc_modes
+  emit_insn (gen_oacc_modes (mask));
+#endif
+}
+
+static void
+expand_GOACC_FORK (gcall *stmt)
+{
+  rtx mode = expand_normal (gimple_call_arg (stmt, 0));
+  
+#ifdef HAVE_oacc_fork
+  emit_insn (gen_oacc_fork (mode));
+#endif
+}
+
+static void
+expand_GOACC_JOIN (gcall *stmt)
+{
+  rtx mode = expand_normal (gimple_call_arg (stmt, 0));
+  
+#ifdef HAVE_oacc_join
+  emit_insn (gen_oacc_join (mode));
+#endif
+}
+
 /* Routines to expand each internal function, indexed by function number.
    Each routine has the prototype:
 
Index: builtins.c
===================================================================
--- builtins.c	(revision 225323)
+++ builtins.c	(working copy)
@@ -5947,20 +5947,6 @@ expand_builtin_acc_on_device (tree exp A
 #endif
 }
 
-/* Expand a thread synchronization point for OpenACC threads.  */
-static void
-expand_oacc_threadbarrier (void)
-{
-#ifdef HAVE_oacc_threadbarrier
-  rtx insn = GEN_FCN (CODE_FOR_oacc_threadbarrier) ();
-  if (insn != NULL_RTX)
-    {
-      emit_insn (insn);
-    }
-#endif
-}
-
-
 /* Expand a thread-id/thread-count builtin for OpenACC.  */
 
 static rtx
@@ -6032,47 +6018,6 @@ expand_oacc_ganglocal_ptr (rtx target AT
   return NULL_RTX;
 }
 
-/* Handle a GOACC_thread_broadcast builtin call EXP with target TARGET.
-   Return the result.  */
-
-static rtx
-expand_builtin_oacc_thread_broadcast (tree exp, rtx target)
-{
-  tree arg0 = CALL_EXPR_ARG (exp, 0);
-  enum insn_code icode;
-
-  enum machine_mode mode = TYPE_MODE (TREE_TYPE (arg0));
-  gcc_assert (INTEGRAL_MODE_P (mode));
-  do
-    {
-      icode = direct_optab_handler (oacc_thread_broadcast_optab, mode);
-      mode = GET_MODE_WIDER_MODE (mode);
-    }
-  while (icode == CODE_FOR_nothing && mode != VOIDmode);
-  if (icode == CODE_FOR_nothing)
-    return expand_expr (arg0, NULL_RTX, VOIDmode, EXPAND_NORMAL);
-
-  rtx tmp = target;
-  machine_mode mode0 = insn_data[icode].operand[0].mode;
-  machine_mode mode1 = insn_data[icode].operand[1].mode;
-  if (!tmp || !REG_P (tmp) || GET_MODE (tmp) != mode0)
-    tmp = gen_reg_rtx (mode0);
-  rtx op1 = expand_expr (arg0, NULL_RTX, mode1, EXPAND_NORMAL);
-  if (GET_MODE (op1) != mode1)
-    op1 = convert_to_mode (mode1, op1, 0);
-
-  /* op1 might be an immediate, place it inside a register.  */
-  op1 = force_reg (mode1, op1);
-
-  rtx insn = GEN_FCN (icode) (tmp, op1);
-  if (insn != NULL_RTX)
-    {
-      emit_insn (insn);
-      return tmp;
-    }
-  return const0_rtx;
-}
-
 /* Expand an expression EXP that calls a built-in function,
    with result going to TARGET if that's convenient
    (and in mode MODE if that's convenient).
@@ -7225,14 +7170,6 @@ expand_builtin (tree exp, rtx target, rt
 	return target;
       break;
 
-    case BUILT_IN_GOACC_THREAD_BROADCAST:
-    case BUILT_IN_GOACC_THREAD_BROADCAST_LL:
-      return expand_builtin_oacc_thread_broadcast (exp, target);
-
-    case BUILT_IN_GOACC_THREADBARRIER:
-      expand_oacc_threadbarrier ();
-      return const0_rtx;
-
     default:	/* just do library call, if unknown builtin */
       break;
     }
Index: tree-ssa-alias.c
===================================================================
--- tree-ssa-alias.c	(revision 225323)
+++ tree-ssa-alias.c	(working copy)
@@ -1764,7 +1764,6 @@ ref_maybe_used_by_call_p_1 (gcall *call,
 	case BUILT_IN_GOMP_ATOMIC_END:
 	case BUILT_IN_GOMP_BARRIER:
 	case BUILT_IN_GOMP_BARRIER_CANCEL:
-	case BUILT_IN_GOACC_THREADBARRIER:
 	case BUILT_IN_GOMP_TASKWAIT:
 	case BUILT_IN_GOMP_TASKGROUP_END:
 	case BUILT_IN_GOMP_CRITICAL_START:
Index: gimple.c
===================================================================
--- gimple.c	(revision 225323)
+++ gimple.c	(working copy)
@@ -1380,12 +1380,27 @@ bool
 gimple_call_same_target_p (const_gimple c1, const_gimple c2)
 {
   if (gimple_call_internal_p (c1))
-    return (gimple_call_internal_p (c2)
-	    && gimple_call_internal_fn (c1) == gimple_call_internal_fn (c2));
+    {
+      if (!gimple_call_internal_p (c2)
+	  || gimple_call_internal_fn (c1) != gimple_call_internal_fn (c2))
+	return false;
+
+      if (gimple_call_internal_unique_p (c1))
+	return false;
+      
+      return true;
+    }
+  else if (gimple_call_fn (c1) == gimple_call_fn (c2))
+    return true;
   else
-    return (gimple_call_fn (c1) == gimple_call_fn (c2)
-	    || (gimple_call_fndecl (c1)
-		&& gimple_call_fndecl (c1) == gimple_call_fndecl (c2)));
+    {
+      tree decl = gimple_call_fndecl (c1);
+
+      if (!decl || decl != gimple_call_fndecl (c2))
+	return false;
+
+      return true;
+    }
 }
 
 /* Detect flags from a GIMPLE_CALL.  This is just like
Index: gimple.h
===================================================================
--- gimple.h	(revision 225323)
+++ gimple.h	(working copy)
@@ -581,10 +581,6 @@ struct GTY((tag("GSS_OMP_PARALLEL_LAYOUT
   /* [ WORD 11 ]
      Size of the gang-local memory to allocate.  */
   tree ganglocal_size;
-
-  /* [ WORD 12 ]
-     A pointer to the array to be used for broadcasting across threads.  */
-  tree broadcast_array;
 };
 
 /* GIMPLE_OMP_PARALLEL or GIMPLE_TASK */
@@ -2693,6 +2689,11 @@ gimple_call_internal_fn (const_gimple gs
   return static_cast <const gcall *> (gs)->u.internal_fn;
 }
 
+/* Return true, if this internal gimple call is unique.  */
+
+extern bool
+gimple_call_internal_unique_p (const_gimple);
+
 /* If CTRL_ALTERING_P is true, mark GIMPLE_CALL S to be a stmt
    that could alter control flow.  */
 
@@ -5248,25 +5249,6 @@ gimple_omp_target_set_ganglocal_size (go
 }
 
 
-/* Return the pointer to the broadcast array associated with OMP_TARGET GS.  */
-
-static inline tree
-gimple_omp_target_broadcast_array (const gomp_target *omp_target_stmt)
-{
-  return omp_target_stmt->broadcast_array;
-}
-
-
-/* Set PTR to be the broadcast array associated with OMP_TARGET
-   GS.  */
-
-static inline void
-gimple_omp_target_set_broadcast_array (gomp_target *omp_target_stmt, tree ptr)
-{
-  omp_target_stmt->broadcast_array = ptr;
-}
-
-
 /* Return the clauses associated with OMP_TEAMS GS.  */
 
 static inline tree
Index: tree-ssa-threadedge.c
===================================================================
--- tree-ssa-threadedge.c	(revision 225323)
+++ tree-ssa-threadedge.c	(working copy)
@@ -310,6 +310,17 @@ record_temporary_equivalences_from_stmts
 	  && gimple_asm_volatile_p (as_a <gasm *> (stmt)))
 	return NULL;
 
+      /* If the statement is a unique builtin, we can not thread
+	 through here.  */
+      if (gimple_code (stmt) == GIMPLE_CALL)
+	{
+	  gcall *call = as_a <gcall *> (stmt);
+
+	  if (gimple_call_internal_p (call)
+	      && gimple_call_internal_unique_p (call))
+	    return NULL;
+	}
+
       /* If duplicating this block is going to cause too much code
 	 expansion, then do not thread through this block.  */
       stmt_count++;
Index: tree-ssa-tail-merge.c
===================================================================
--- tree-ssa-tail-merge.c	(revision 225323)
+++ tree-ssa-tail-merge.c	(working copy)
@@ -608,10 +608,13 @@ same_succ_def::equal (const same_succ_de
     {
       s1 = gsi_stmt (gsi1);
       s2 = gsi_stmt (gsi2);
-      if (gimple_code (s1) != gimple_code (s2))
-	return 0;
-      if (is_gimple_call (s1) && !gimple_call_same_target_p (s1, s2))
-	return 0;
+      if (s1 != s2)
+	{
+	  if (gimple_code (s1) != gimple_code (s2))
+	    return 0;
+	  if (is_gimple_call (s1) && !gimple_call_same_target_p (s1, s2))
+	    return 0;
+	}
       gsi_next_nondebug (&gsi1);
       gsi_next_nondebug (&gsi2);
       gsi_advance_fw_nondebug_nonlocal (&gsi1);
Index: omp-builtins.def
===================================================================
--- omp-builtins.def	(revision 225323)
+++ omp-builtins.def	(working copy)
@@ -69,13 +69,6 @@ DEF_GOACC_BUILTIN (BUILT_IN_GOACC_GET_GA
 		   BT_FN_PTR, ATTR_NOTHROW_LEAF_LIST)
 DEF_GOACC_BUILTIN (BUILT_IN_GOACC_DEVICEPTR, "GOACC_deviceptr",
 		   BT_FN_PTR_PTR, ATTR_CONST_NOTHROW_LEAF_LIST)
-DEF_GOACC_BUILTIN (BUILT_IN_GOACC_THREAD_BROADCAST, "GOACC_thread_broadcast",
-		   BT_FN_UINT_UINT, ATTR_NOTHROW_LEAF_LIST)
-DEF_GOACC_BUILTIN (BUILT_IN_GOACC_THREAD_BROADCAST_LL, "GOACC_thread_broadcast_ll",
-		   BT_FN_ULONGLONG_ULONGLONG, ATTR_NOTHROW_LEAF_LIST)
-DEF_GOACC_BUILTIN (BUILT_IN_GOACC_THREADBARRIER, "GOACC_threadbarrier",
-		   BT_FN_VOID, ATTR_NOTHROW_LEAF_LIST)
-
 DEF_GOACC_BUILTIN_COMPILER (BUILT_IN_ACC_ON_DEVICE, "acc_on_device",
 			    BT_FN_INT_INT, ATTR_CONST_NOTHROW_LEAF_LIST)
 
Index: config/nvptx/nvptx.c
===================================================================
--- config/nvptx/nvptx.c	(revision 225323)
+++ config/nvptx/nvptx.c	(working copy)
@@ -24,6 +24,7 @@
 #include "coretypes.h"
 #include "tm.h"
 #include "rtl.h"
+#include "hash-map.h"
 #include "hash-set.h"
 #include "machmode.h"
 #include "vec.h"
@@ -74,6 +75,9 @@
 #include "df.h"
 #include "dumpfile.h"
 #include "builtins.h"
+#include "dominance.h"
+#include "cfg.h"
+#include "omp-low.h"
 
 /* Record the function decls we've written, and the libfuncs and function
    decls corresponding to them.  */
@@ -97,6 +101,16 @@ static GTY((cache))
 static GTY((cache)) hash_table<tree_hasher> *declared_fndecls_htab;
 static GTY((cache)) hash_table<tree_hasher> *needed_fndecls_htab;
 
+/* Size of buffer needed to broadcast across workers.  This is used
+   for both worker-neutering and worker broadcasting.   It is shared
+   by all functions emitted.  The buffer is placed in shared memory.
+   It'd be nice if PTX supported common blocks, because then this
+   could be shared across TUs (taking the largest size).  */
+static unsigned worker_bcast_hwm;
+static unsigned worker_bcast_align;
+#define worker_bcast_name "__worker_bcast"
+static GTY(()) rtx worker_bcast_sym;
+
 /* Allocate a new, cleared machine_function structure.  */
 
 static struct machine_function *
@@ -124,6 +138,8 @@ nvptx_option_override (void)
   needed_fndecls_htab = hash_table<tree_hasher>::create_ggc (17);
   declared_libfuncs_htab
     = hash_table<declared_libfunc_hasher>::create_ggc (17);
+
+  worker_bcast_sym = gen_rtx_SYMBOL_REF (Pmode, worker_bcast_name);
 }
 
 /* Return the mode to be used when declaring a ptx object for OBJ.
@@ -1053,6 +1069,7 @@ nvptx_static_chain (const_tree fndecl, b
     return gen_rtx_REG (Pmode, OUTGOING_STATIC_CHAIN_REGNUM);
 }
 \f
+
 /* Emit a comparison COMPARE, and return the new test to be used in the
    jump.  */
 
@@ -1066,6 +1083,210 @@ nvptx_expand_compare (rtx compare)
   return gen_rtx_NE (BImode, pred, const0_rtx);
 }
 
+
+/* Expand the oacc fork & join primitive into ptx-required unspecs.  */
+
+void
+nvptx_expand_oacc_fork (rtx mode)
+{
+  /* Emit fork for worker level.  */
+  if (UINTVAL (mode) == OACC_worker)
+    emit_insn (gen_nvptx_fork (mode));
+}
+
+void
+nvptx_expand_oacc_join (rtx mode)
+{
+  /* Emit joining for all pars.  */
+  emit_insn (gen_nvptx_joining (mode));
+}
+
+/* Generate instruction(s) to unpack a 64 bit object into 2 32 bit
+   objects.  */
+
+static rtx
+nvptx_gen_unpack (rtx dst0, rtx dst1, rtx src)
+{
+  rtx res;
+  
+  switch (GET_MODE (src))
+    {
+    case DImode:
+      res = gen_unpackdisi2 (dst0, dst1, src);
+      break;
+    case DFmode:
+      res = gen_unpackdfsi2 (dst0, dst1, src);
+      break;
+    default: gcc_unreachable ();
+    }
+  return res;
+}
+
+/* Generate instruction(s) to pack 2 32 bit objects into a 64 bit
+   object.  */
+
+static rtx
+nvptx_gen_pack (rtx dst, rtx src0, rtx src1)
+{
+  rtx res;
+  
+  switch (GET_MODE (dst))
+    {
+    case DImode:
+      res = gen_packsidi2 (dst, src0, src1);
+      break;
+    case DFmode:
+      res = gen_packsidf2 (dst, src0, src1);
+      break;
+    default: gcc_unreachable ();
+    }
+  return res;
+}
+
+/* Generate an instruction or sequence to broadcast register REG
+   across the vectors of a single warp.  */
+
+static rtx
+nvptx_gen_vcast (rtx reg)
+{
+  rtx res;
+
+  switch (GET_MODE (reg))
+    {
+    case SImode:
+      res = gen_nvptx_broadcastsi (reg, reg);
+      break;
+    case SFmode:
+      res = gen_nvptx_broadcastsf (reg, reg);
+      break;
+    case DImode:
+    case DFmode:
+      {
+	rtx tmp0 = gen_reg_rtx (SImode);
+	rtx tmp1 = gen_reg_rtx (SImode);
+
+	start_sequence ();
+	emit_insn (nvptx_gen_unpack (tmp0, tmp1, reg));
+	emit_insn (nvptx_gen_vcast (tmp0));
+	emit_insn (nvptx_gen_vcast (tmp1));
+	emit_insn (nvptx_gen_pack (reg, tmp0, tmp1));
+	res = get_insns ();
+	end_sequence ();
+      }
+      break;
+    case BImode:
+      {
+	rtx tmp = gen_reg_rtx (SImode);
+	
+	start_sequence ();
+	emit_insn (gen_sel_truesi (tmp, reg, GEN_INT (1), const0_rtx));
+	emit_insn (nvptx_gen_vcast (tmp));
+	emit_insn (gen_rtx_SET (BImode, reg,
+				gen_rtx_NE (BImode, tmp, const0_rtx)));
+	res = get_insns ();
+	end_sequence ();
+      }
+      break;
+      
+    case HImode:
+    case QImode:
+    default:debug_rtx (reg);gcc_unreachable ();
+    }
+  return res;
+}
+
+/* Structure used when generating a worker-level spill or fill.  */
+
+struct wcast_data_t
+{
+  rtx base;
+  rtx ptr;
+  unsigned offset;
+};
+
+/* Direction of the spill/fill and looping setup/teardown indicator.  */
+
+enum propagate_mask
+  {
+    PM_read = 1 << 0,
+    PM_write = 1 << 1,
+    PM_loop_begin = 1 << 2,
+    PM_loop_end = 1 << 3,
+
+    PM_read_write = PM_read | PM_write
+  };
+
+/* Generate instruction(s) to spill or fill register REG to/from the
+   worker broadcast array.  PM indicates what is to be done, REP
+   how many loop iterations will be executed (0 for not a loop).  */
+   
+static rtx
+nvptx_gen_wcast (rtx reg, propagate_mask pm, unsigned rep, wcast_data_t *data)
+{
+  rtx  res;
+  machine_mode mode = GET_MODE (reg);
+
+  switch (mode)
+    {
+    case BImode:
+      {
+	rtx tmp = gen_reg_rtx (SImode);
+	
+	start_sequence ();
+	if (pm & PM_read)
+	  emit_insn (gen_sel_truesi (tmp, reg, GEN_INT (1), const0_rtx));
+	emit_insn (nvptx_gen_wcast (tmp, pm, rep, data));
+	if (pm & PM_write)
+	  emit_insn (gen_rtx_SET (BImode, reg,
+				  gen_rtx_NE (BImode, tmp, const0_rtx)));
+	res = get_insns ();
+	end_sequence ();
+      }
+      break;
+
+    default:
+      {
+	rtx addr = data->ptr;
+
+	if (!addr)
+	  {
+	    unsigned align = GET_MODE_ALIGNMENT (mode) / BITS_PER_UNIT;
+
+	    if (align > worker_bcast_align)
+	      worker_bcast_align = align;
+	    data->offset = (data->offset + align - 1) & ~(align - 1);
+	    addr = data->base;
+	    if (data->offset)
+	      addr = gen_rtx_PLUS (Pmode, addr, GEN_INT (data->offset));
+	  }
+	
+	addr = gen_rtx_MEM (mode, addr);
+	addr = gen_rtx_UNSPEC (mode, gen_rtvec (1, addr), UNSPEC_SHARED_DATA);
+	if (pm & PM_read)
+	  res = gen_rtx_SET (mode, addr, reg);
+	if (pm & PM_write)
+	  res = gen_rtx_SET (mode, reg, addr);
+
+	if (data->ptr)
+	  {
+	    /* We're using a ptr, increment it.  */
+	    start_sequence ();
+	    
+	    emit_insn (res);
+	    emit_insn (gen_adddi3 (data->ptr, data->ptr,
+				   GEN_INT (GET_MODE_SIZE (GET_MODE (res)))));
+	    res = get_insns ();
+	    end_sequence ();
+	  }
+	else
+	  rep = 1;
+	data->offset += rep * GET_MODE_SIZE (GET_MODE (reg));
+      }
+      break;
+    }
+  return res;
+}
+
 /* When loading an operand ORIG_OP, verify whether an address space
    conversion to generic is required, and if so, perform it.  Also
    check for SYMBOL_REFs for function decls and call
@@ -1647,23 +1868,6 @@ nvptx_print_operand_address (FILE *file,
   nvptx_print_address_operand (file, addr, VOIDmode);
 }
 
-/* Return true if the value of COND is the same across all threads in a
-   warp.  */
-
-static bool
-condition_unidirectional_p (rtx cond)
-{
-  if (CONSTANT_P (cond))
-    return true;
-  if (GET_CODE (cond) == REG)
-    return cfun->machine->warp_equal_pseudos[REGNO (cond)];
-  if (GET_RTX_CLASS (GET_CODE (cond)) == RTX_COMPARE
-      || GET_RTX_CLASS (GET_CODE (cond)) == RTX_COMM_COMPARE)
-    return (condition_unidirectional_p (XEXP (cond, 0))
-	    && condition_unidirectional_p (XEXP (cond, 1)));
-  return false;
-}
-
 /* Print an operand, X, to FILE, with an optional modifier in CODE.
 
    Meaning of CODE:
@@ -1677,8 +1881,7 @@ condition_unidirectional_p (rtx cond)
    t -- print a type opcode suffix, promoting QImode to 32 bits
    T -- print a type size in bits
    u -- print a type opcode suffix without promotions.
-   U -- print ".uni" if a condition consists only of values equal across all
-        threads in a warp.  */
+   U -- print ".uni" if the const_int operand is non-zero.  */
 
 static void
 nvptx_print_operand (FILE *file, rtx x, int code)
@@ -1740,10 +1943,10 @@ nvptx_print_operand (FILE *file, rtx x,
       goto common;
 
     case 'U':
-      if (condition_unidirectional_p (x))
+      if (INTVAL (x))
 	fprintf (file, ".uni");
       break;
-
+      
     case 'c':
       op_mode = GET_MODE (XEXP (x, 0));
       switch (x_code)
@@ -1900,7 +2103,7 @@ get_replacement (struct reg_replace *r)
    conversion copyin/copyout instructions.  */
 
 static void
-nvptx_reorg_subreg (int max_regs)
+nvptx_reorg_subreg ()
 {
   struct reg_replace qiregs, hiregs, siregs, diregs;
   rtx_insn *insn, *next;
@@ -1914,11 +2117,6 @@ nvptx_reorg_subreg (int max_regs)
   siregs.mode = SImode;
   diregs.mode = DImode;
 
-  cfun->machine->warp_equal_pseudos
-    = ggc_cleared_vec_alloc<char> (max_regs);
-
-  auto_vec<unsigned> warp_reg_worklist;
-
   for (insn = get_insns (); insn; insn = next)
     {
       next = NEXT_INSN (insn);
@@ -1934,18 +2132,6 @@ nvptx_reorg_subreg (int max_regs)
       diregs.n_in_use = 0;
       extract_insn (insn);
 
-      if (recog_memoized (insn) == CODE_FOR_oacc_thread_broadcastsi
-	  || (GET_CODE (PATTERN (insn)) == SET
-	      && CONSTANT_P (SET_SRC (PATTERN (insn)))))
-	{
-	  rtx dest = recog_data.operand[0];
-	  if (REG_P (dest) && REG_N_SETS (REGNO (dest)) == 1)
-	    {
-	      cfun->machine->warp_equal_pseudos[REGNO (dest)] = true;
-	      warp_reg_worklist.safe_push (REGNO (dest));
-	    }
-	}
-
       enum attr_subregs_ok s_ok = get_attr_subregs_ok (insn);
       for (int i = 0; i < recog_data.n_operands; i++)
 	{
@@ -1999,71 +2185,757 @@ nvptx_reorg_subreg (int max_regs)
 	  validate_change (insn, recog_data.operand_loc[i], new_reg, false);
 	}
     }
+}
+
+/* Loop structure of the function.The entire function is described as
+   a NULL loop.  We should be able to extend this to represent
+   superblocks.  */
+
+#define OACC_null OACC_HWM
+
+struct parallel
+{
+  /* Parent parallel.  */
+  parallel *parent;
+  
+  /* Next sibling parallel.  */
+  parallel *next;
+
+  /* First child parallel.  */
+  parallel *inner;
+
+  /* Partitioning mode of the parallel.  */
+  unsigned mode;
+
+  /* Partitioning used within inner parallels. */
+  unsigned inner_mask;
+
+  /* Location of parallel forked and join.  The forked is the first
+     block in the parallel and the join is the first block after of
+     the partition.  */
+  basic_block forked_block;
+  basic_block join_block;
+
+  rtx_insn *forked_insn;
+  rtx_insn *join_insn;
 
-  while (!warp_reg_worklist.is_empty ())
+  rtx_insn *fork_insn;
+  rtx_insn *joining_insn;
+
+  /* Basic blocks in this parallel, but not in child parallels.  The
+     FORKED and JOINING blocks are in the partition.  The FORK and JOIN
+     blocks are not.  */
+  auto_vec<basic_block> blocks;
+
+public:
+  parallel (parallel *parent, unsigned mode);
+  ~parallel ();
+};
+
+/* Constructor links the new parallel into it's parent's chain of
+   children.  */
+
+parallel::parallel (parallel *parent_, unsigned mode_)
+  :parent (parent_), next (0), inner (0), mode (mode_), inner_mask (0)
+{
+  forked_block = join_block = 0;
+  forked_insn = join_insn = 0;
+  fork_insn = joining_insn = 0;
+  
+  if (parent)
     {
-      int regno = warp_reg_worklist.pop ();
+      next = parent->inner;
+      parent->inner = this;
+    }
+}
+
+parallel::~parallel ()
+{
+  delete inner;
+  delete next;
+}
+
+/* Map of basic blocks to insns */
+typedef hash_map<basic_block, rtx_insn *> bb_insn_map_t;
+
+/* A tuple of an insn of interest and the BB in which it resides.  */
+typedef std::pair<rtx_insn *, basic_block> insn_bb_t;
+typedef auto_vec<insn_bb_t> insn_bb_vec_t;
+
+/* Split basic blocks such that each forked and join unspecs are at
+   the start of their basic blocks.  Thus afterwards each block will
+   have a single partitioning mode.  We also do the same for return
+   insns, as they are executed by every thread.  Return the
+   partitioning mode of the function as a whole.  Populate MAP with
+   head and tail blocks.  We also clear the BB visited flag, which is
+   used when finding partitions.  */
+
+static unsigned
+nvptx_split_blocks (bb_insn_map_t *map)
+{
+  insn_bb_vec_t worklist;
+  basic_block block;
+  rtx_insn *insn;
+  unsigned modes = ~0U; // Assume the worst WRT required neutering
+
+  /* Locate all the reorg instructions of interest.  */
+  FOR_ALL_BB_FN (block, cfun)
+    {
+      bool seen_insn = false;
+
+      // Clear visited flag, for use by parallel locator  */
+      block->flags &= ~BB_VISITED;
       
-      df_ref use = DF_REG_USE_CHAIN (regno);
-      for (; use; use = DF_REF_NEXT_REG (use))
+      FOR_BB_INSNS (block, insn)
 	{
-	  rtx_insn *insn;
-	  if (!DF_REF_INSN_INFO (use))
-	    continue;
-	  insn = DF_REF_INSN (use);
-	  if (DEBUG_INSN_P (insn))
+	  if (!INSN_P (insn))
 	    continue;
-
-	  /* The only insns we have to exclude are those which refer to
-	     memory.  */
-	  rtx pat = PATTERN (insn);
-	  if (GET_CODE (pat) == SET
-	      && (MEM_P (SET_SRC (pat)) || MEM_P (SET_DEST (pat))))
-	    continue;
-
-	  df_ref insn_use;
-	  bool all_equal = true;
-	  FOR_EACH_INSN_USE (insn_use, insn)
+	  switch (recog_memoized (insn))
 	    {
-	      unsigned insn_regno = DF_REF_REGNO (insn_use);
-	      if (!cfun->machine->warp_equal_pseudos[insn_regno])
-		{
-		  all_equal = false;
-		  break;
-		}
+	    default:
+	      seen_insn = true;
+	      continue;
+	    case CODE_FOR_oacc_modes:
+	      /* We just need to detect this and note its argument.  */
+	      {
+		unsigned l = UINTVAL (XVECEXP (PATTERN (insn), 0, 0));
+		/* If we see this multiple times, this should all
+		   agree.  */
+		gcc_assert (modes == ~0U || l == modes);
+		modes = l;
+	      }
+	      continue;
+
+	    case CODE_FOR_nvptx_forked:
+	    case CODE_FOR_nvptx_join:
+	      break;
+	      
+	    case CODE_FOR_return:
+	      /* We also need to split just before return insns, as
+		 that insn needs executing by all threads, but the
+		 block it is in probably does not.  */
+	      break;
 	    }
-	  if (!all_equal)
-	    continue;
-	  df_ref insn_def;
-	  FOR_EACH_INSN_DEF (insn_def, insn)
+
+	  if (seen_insn)
+	    /* We've found an instruction that  must be at the start of
+	       a block, but isn't.  Add it to the worklist.  */
+	    worklist.safe_push (insn_bb_t (insn, block));
+	  else
+	    /* It was already the first instruction.  Just add it to
+	       the map.  */
+	    map->get_or_insert (block) = insn;
+	  seen_insn = true;
+	}
+    }
+
+  /* Split blocks on the worklist.  */
+  unsigned ix;
+  insn_bb_t *elt;
+  basic_block remap = 0;
+  for (ix = 0; worklist.iterate (ix, &elt); ix++)
+    {
+      if (remap != elt->second)
+	{
+	  block = elt->second;
+	  remap = block;
+	}
+      
+      /* Split block before insn. The insn is in the new block  */
+      edge e = split_block (block, PREV_INSN (elt->first));
+
+      block = e->dest;
+      map->get_or_insert (block) = elt->first;
+    }
+
+  return modes;
+}
+
+/* BLOCK is a basic block containing a head or tail instruction.
+   Locate the associated prehead or pretail instruction, which must be
+   in the single predecessor block.  */
+
+static rtx_insn *
+nvptx_discover_pre (basic_block block, int expected)
+{
+  gcc_assert (block->preds->length () == 1);
+  basic_block pre_block = (*block->preds)[0]->src;
+  rtx_insn *pre_insn;
+
+  for (pre_insn = BB_END (pre_block); !INSN_P (pre_insn);
+       pre_insn = PREV_INSN (pre_insn))
+    gcc_assert (pre_insn != BB_HEAD (pre_block));
+
+  gcc_assert (recog_memoized (pre_insn) == expected);
+  return pre_insn;
+}
+
+/*  Dump this parallel and all its inner parallels.  */
+
+static void
+nvptx_dump_pars (parallel *par, unsigned depth)
+{
+  fprintf (dump_file, "%u: mode %d head=%d, tail=%d\n",
+	   depth, par->mode,
+	   par->forked_block ? par->forked_block->index : -1,
+	   par->join_block ? par->join_block->index : -1);
+
+  fprintf (dump_file, "    blocks:");
+
+  basic_block block;
+  for (unsigned ix = 0; par->blocks.iterate (ix, &block); ix++)
+    fprintf (dump_file, " %d", block->index);
+  fprintf (dump_file, "\n");
+  if (par->inner)
+    nvptx_dump_pars (par->inner, depth + 1);
+
+  if (par->next)
+    nvptx_dump_pars (par->next, depth);
+}
+
+typedef std::pair<basic_block, parallel *> bb_par_t;
+typedef auto_vec<bb_par_t> bb_par_vec_t;
+
+/* Walk the BBG looking for fork & join markers.  Construct a
+   loop structure for the function.  MAP is a mapping of basic blocks
+   to head & taiol markers, discoveded when splitting blocks.  This
+   speeds up the discovery.  We rely on the BB visited flag having
+   been cleared when splitting blocks.  */
+
+static parallel *
+nvptx_discover_pars (bb_insn_map_t *map)
+{
+  parallel *outer_par = new parallel (0, OACC_null);
+  bb_par_vec_t worklist;
+  basic_block block;
+
+  // Mark entry and exit blocks as visited.
+  block = EXIT_BLOCK_PTR_FOR_FN (cfun);
+  block->flags |= BB_VISITED;
+  block = ENTRY_BLOCK_PTR_FOR_FN (cfun);
+  worklist.safe_push (bb_par_t (block, outer_par));
+
+  while (worklist.length ())
+    {
+      bb_par_t bb_par = worklist.pop ();
+      parallel *l = bb_par.second;
+
+      block = bb_par.first;
+
+      // Have we met this block?
+      if (block->flags & BB_VISITED)
+	continue;
+      block->flags |= BB_VISITED;
+      
+      rtx_insn **endp = map->get (block);
+      if (endp)
+	{
+	  rtx_insn *end = *endp;
+	  
+	  /* This is a block head or tail, or return instruction.  */
+	  switch (recog_memoized (end))
 	    {
-	      unsigned dregno = DF_REF_REGNO (insn_def);
-	      if (cfun->machine->warp_equal_pseudos[dregno])
-		continue;
-	      cfun->machine->warp_equal_pseudos[dregno] = true;
-	      warp_reg_worklist.safe_push (dregno);
+	    case CODE_FOR_return:
+	      /* Return instructions are in their own block, and we
+		 don't need to do anything more.  */
+	      continue;
+
+	    case CODE_FOR_nvptx_forked:
+	      /* Loop head, create a new inner loop and add it into
+		 our parent's child list.  */
+	      {
+		unsigned mode = UINTVAL (XVECEXP (PATTERN (end), 0, 0));
+		
+		l = new parallel (l, mode);
+		l->forked_block = block;
+		l->forked_insn = end;
+		if (mode == OACC_worker)
+		  l->fork_insn
+		    = nvptx_discover_pre (block, CODE_FOR_nvptx_fork);
+	      }
+	      break;
+
+	    case CODE_FOR_nvptx_join:
+	      /* A loop tail.  Finish the current loop and return to
+		 parent.  */
+	      {
+		unsigned mode = UINTVAL (XVECEXP (PATTERN (end), 0, 0));
+
+		gcc_assert (l->mode == mode);
+		l->join_block = block;
+		l->join_insn = end;
+		if (mode == OACC_worker)
+		  l->joining_insn
+		    = nvptx_discover_pre (block, CODE_FOR_nvptx_joining);
+		l = l->parent;
+	      }
+	      break;
+
+	    default:
+	      gcc_unreachable ();
 	    }
 	}
+
+      /* Add this block onto the current loop's list of blocks.  */
+      l->blocks.safe_push (block);
+
+      /* Push each destination block onto the work list.  */
+      edge e;
+      edge_iterator ei;
+      FOR_EACH_EDGE (e, ei, block->succs)
+	worklist.safe_push (bb_par_t (e->dest, l));
     }
 
   if (dump_file)
-    for (int i = 0; i < max_regs; i++)
-      if (cfun->machine->warp_equal_pseudos[i])
-	fprintf (dump_file, "Found warp invariant pseudo %d\n", i);
+    {
+      fprintf (dump_file, "\nLoops\n");
+      nvptx_dump_pars (outer_par, 0);
+      fprintf (dump_file, "\n");
+    }
+  
+  return outer_par;
+}
+
+/* Propagate live state at the start of a partitioned region.  BLOCK
+   provides the live register information, and might not contain
+   INSN. Propagation is inserted just after INSN. RW indicates whether
+   we are reading and/or writing state.  This
+   separation is needed for worker-level proppagation where we
+   essentially do a spill & fill.  FN is the underlying worker
+   function to generate the propagation instructions for single
+   register.  DATA is user data.
+
+   We propagate the live register set and the entire frame.  We could
+   do better by (a) propagating just the live set that is used within
+   the partitioned regions and (b) only propagating stack entries that
+   are used.  The latter might be quite hard to determine.  */
+
+static void
+nvptx_propagate (basic_block block, rtx_insn *insn, propagate_mask rw,
+		 rtx (*fn) (rtx, propagate_mask,
+			    unsigned, void *), void *data)
+{
+  bitmap live = DF_LIVE_IN (block);
+  bitmap_iterator iterator;
+  unsigned ix;
+
+  /* Copy the frame array.  */
+  HOST_WIDE_INT fs = get_frame_size ();
+  if (fs)
+    {
+      rtx tmp = gen_reg_rtx (DImode);
+      rtx idx = NULL_RTX;
+      rtx ptr = gen_reg_rtx (Pmode);
+      rtx pred = NULL_RTX;
+      rtx_code_label *label = NULL;
+
+      gcc_assert (!(fs & (GET_MODE_SIZE (DImode) - 1)));
+      fs /= GET_MODE_SIZE (DImode);
+      /* Detect single iteration loop. */
+      if (fs == 1)
+	fs = 0;
+
+      start_sequence ();
+      emit_insn (gen_rtx_SET (Pmode, ptr, frame_pointer_rtx));
+      if (fs)
+	{
+	  idx = gen_reg_rtx (SImode);
+	  pred = gen_reg_rtx (BImode);
+	  label = gen_label_rtx ();
+	  
+	  emit_insn (gen_rtx_SET (SImode, idx, GEN_INT (fs)));
+	  /* Allow worker function to initialize anything needed */
+	  rtx init = fn (tmp, PM_loop_begin, fs, data);
+	  if (init)
+	    emit_insn (init);
+	  emit_label (label);
+	  LABEL_NUSES (label)++;
+	  emit_insn (gen_addsi3 (idx, idx, GEN_INT (-1)));
+	}
+      if (rw & PM_read)
+	emit_insn (gen_rtx_SET (DImode, tmp, gen_rtx_MEM (DImode, ptr)));
+      emit_insn (fn (tmp, rw, fs, data));
+      if (rw & PM_write)
+	emit_insn (gen_rtx_SET (DImode, gen_rtx_MEM (DImode, ptr), tmp));
+      if (fs)
+	{
+	  emit_insn (gen_rtx_SET (SImode, pred,
+				  gen_rtx_NE (BImode, idx, const0_rtx)));
+	  emit_insn (gen_adddi3 (ptr, ptr, GEN_INT (GET_MODE_SIZE (DImode))));
+	  emit_insn (gen_br_true_hidden (pred, label, GEN_INT (1)));
+	  rtx fini = fn (tmp, PM_loop_end, fs, data);
+	  if (fini)
+	    emit_insn (fini);
+	  emit_insn (gen_rtx_CLOBBER (GET_MODE (idx), idx));
+	}
+      emit_insn (gen_rtx_CLOBBER (GET_MODE (tmp), tmp));
+      emit_insn (gen_rtx_CLOBBER (GET_MODE (ptr), ptr));
+      rtx cpy = get_insns ();
+      end_sequence ();
+      insn = emit_insn_after (cpy, insn);
+    }
+
+  /* Copy live registers.  */
+  EXECUTE_IF_SET_IN_BITMAP (live, 0, ix, iterator)
+    {
+      rtx reg = regno_reg_rtx[ix];
+
+      if (REGNO (reg) >= FIRST_PSEUDO_REGISTER)
+	{
+	  rtx bcast = fn (reg, rw, 0, data);
+
+	  insn = emit_insn_after (bcast, insn);
+	}
+    }
+}
+
+/* Worker for nvptx_vpropagate.  */
+
+static rtx
+vprop_gen (rtx reg, propagate_mask pm,
+	   unsigned ARG_UNUSED (count), void *ARG_UNUSED (data))
+{
+  if (!(pm & PM_read_write))
+    return 0;
+  
+  return nvptx_gen_vcast (reg);
 }
 
-/* PTX-specific reorganization
-   1) mark now-unused registers, so function begin doesn't declare
-   unused registers.
-   2) replace subregs with suitable sequences.
-*/
+/* Propagate state that is live at start of BLOCK across the vectors
+   of a single warp.  Propagation is inserted just after INSN.   */
 
 static void
-nvptx_reorg (void)
+nvptx_vpropagate (basic_block block, rtx_insn *insn)
 {
-  struct reg_replace qiregs, hiregs, siregs, diregs;
-  rtx_insn *insn, *next;
+  nvptx_propagate (block, insn, PM_read_write, vprop_gen, 0);
+}
+
+/* Worker for nvptx_wpropagate.  */
+
+static rtx
+wprop_gen (rtx reg, propagate_mask pm, unsigned rep, void *data_)
+{
+  wcast_data_t *data = (wcast_data_t *)data_;
+
+  if (pm & PM_loop_begin)
+    {
+      /* Starting a loop, initialize pointer.    */
+      unsigned align = GET_MODE_ALIGNMENT (GET_MODE (reg)) / BITS_PER_UNIT;
+
+      if (align > worker_bcast_align)
+	worker_bcast_align = align;
+      data->offset = (data->offset + align - 1) & ~(align - 1);
+
+      data->ptr = gen_reg_rtx (Pmode);
+
+      return gen_adddi3 (data->ptr, data->base, GEN_INT (data->offset));
+    }
+  else if (pm & PM_loop_end)
+    {
+      rtx clobber = gen_rtx_CLOBBER (GET_MODE (data->ptr), data->ptr);
+      data->ptr = NULL_RTX;
+      return clobber;
+    }
+  else
+    return nvptx_gen_wcast (reg, pm, rep, data);
+}
+
+/* Spill or fill live state that is live at start of BLOCK.  PRE_P
+   indicates if this is just before partitioned mode (do spill), or
+   just after it starts (do fill). Sequence is inserted just after
+   INSN.  */
+
+static void
+nvptx_wpropagate (bool pre_p, basic_block block, rtx_insn *insn)
+{
+  wcast_data_t data;
+
+  data.base = gen_reg_rtx (Pmode);
+  data.offset = 0;
+  data.ptr = NULL_RTX;
+
+  nvptx_propagate (block, insn, pre_p ? PM_read : PM_write, wprop_gen, &data);
+  if (data.offset)
+    {
+      /* Stuff was emitted, initialize the base pointer now.  */
+      rtx init = gen_rtx_SET (Pmode, data.base, worker_bcast_sym);
+      emit_insn_after (init, insn);
+      
+      if (worker_bcast_hwm < data.offset)
+	worker_bcast_hwm = data.offset;
+    }
+}
+
+/* Emit a worker-level synchronization barrier.  */
+
+static void
+nvptx_wsync (bool tail_p, rtx_insn *insn)
+{
+  emit_insn_after (gen_nvptx_barsync (GEN_INT (tail_p)), insn);
+}
+
+/* Single neutering according to MASK.  FROM is the incoming block and
+   TO is the outgoing block.  These may be the same block. Insert at
+   start of FROM:
+   
+     if (tid.<axis>) hidden_goto end.
+
+   and insert before ending branch of TO (if there is such an insn):
+
+     end:
+     <possibly-broadcast-cond>
+     <branch>
+
+   We currently only use differnt FROM and TO when skipping an entire
+   loop.  We could do more if we detected superblocks.  */
+
+static void
+nvptx_single (unsigned mask, basic_block from, basic_block to)
+{
+  rtx_insn *head = BB_HEAD (from);
+  rtx_insn *tail = BB_END (to);
+  unsigned skip_mask = mask;
+
+  /* Find first insn of from block */
+  while (head != BB_END (from) && !INSN_P (head))
+    head = NEXT_INSN (head);
+
+  /* Find last insn of to block */
+  rtx_insn *limit = from == to ? head : BB_HEAD (to);
+  while (tail != limit && !INSN_P (tail) && !LABEL_P (tail))
+    tail = PREV_INSN (tail);
+
+  /* Detect if tail is a branch.  */
+  rtx tail_branch = NULL_RTX;
+  rtx cond_branch = NULL_RTX;
+  if (tail && INSN_P (tail))
+    {
+      tail_branch = PATTERN (tail);
+      if (GET_CODE (tail_branch) != SET || SET_DEST (tail_branch) != pc_rtx)
+	tail_branch = NULL_RTX;
+      else
+	{
+	  cond_branch = SET_SRC (tail_branch);
+	  if (GET_CODE (cond_branch) != IF_THEN_ELSE)
+	    cond_branch = NULL_RTX;
+	}
+    }
+
+  if (tail == head)
+    {
+      /* If this is empty, do nothing.  */
+      if (!head || !INSN_P (head))
+	return;
+
+      /* If this is a dummy insn, do nothing.  */
+      switch (recog_memoized (head))
+	{
+	default:break;
+	case CODE_FOR_nvptx_fork:
+	case CODE_FOR_nvptx_forked:
+	case CODE_FOR_nvptx_joining:
+	case CODE_FOR_nvptx_join:
+	case CODE_FOR_oacc_modes:
+	  return;
+	}
+
+      if (cond_branch)
+	{
+	  /* If we're only doing vector single, there's no need to
+	     emit skip code because we'll not insert anything.  */
+	  if (!(mask & OACC_LOOP_MASK (OACC_vector)))
+	    skip_mask = 0;
+	}
+      else if (tail_branch)
+	/* Block with only unconditional branch.  Nothing to do.  */
+	return;
+    }
+
+  /* Insert the vector test inside the worker test.  */
+  unsigned mode;
+  rtx_insn *before = tail;
+  for (mode = OACC_worker; mode <= OACC_vector; mode++)
+    if (OACC_LOOP_MASK (mode) & skip_mask)
+      {
+	rtx id = gen_reg_rtx (SImode);
+	rtx pred = gen_reg_rtx (BImode);
+	rtx_code_label *label = gen_label_rtx ();
+
+	emit_insn_before (gen_oacc_id (id, GEN_INT (mode)), head);
+	rtx cond = gen_rtx_SET (BImode, pred,
+				gen_rtx_NE (BImode, id, const0_rtx));
+	emit_insn_before (cond, head);
+	emit_insn_before (gen_br_true_hidden (pred, label,
+					      GEN_INT (mode != OACC_vector)),
+			  head);
+
+	LABEL_NUSES (label)++;
+	if (tail_branch)
+	  before = emit_label_before (label, before);
+	else
+	  emit_label_after (label, tail);
+      }
+
+  /* Now deal with propagating the branch condition.  */
+  if (cond_branch)
+    {
+      rtx pvar = XEXP (XEXP (cond_branch, 0), 0);
+
+      if (OACC_LOOP_MASK (OACC_vector) == mask)
+	{
+	  /* Vector mode only, do a shuffle.  */
+	  emit_insn_before (nvptx_gen_vcast (pvar), tail);
+	}
+      else
+	{
+	  /* Includes worker mode, do spill & fill.  by construction
+	     we should never have worker mode only. */
+	  wcast_data_t data;
+
+	  data.base = worker_bcast_sym;
+	  data.ptr = 0;
+
+	  if (worker_bcast_hwm < GET_MODE_SIZE (SImode))
+	    worker_bcast_hwm = GET_MODE_SIZE (SImode);
+
+	  data.offset = 0;
+	  emit_insn_before (nvptx_gen_wcast (pvar, PM_read, 0, &data),
+			    before);
+	  emit_insn_before (gen_nvptx_barsync (GEN_INT (2)), tail);
+	  data.offset = 0;
+	  emit_insn_before (nvptx_gen_wcast (pvar, PM_write, 0, &data),
+			    tail);
+	}
+
+      extract_insn (tail);
+      rtx unsp = gen_rtx_UNSPEC (BImode, gen_rtvec (1, pvar),
+				 UNSPEC_BR_UNIFIED);
+      validate_change (tail, recog_data.operand_loc[0], unsp, false);
+    }
+}
+
+/* PAR is a parallel that is being skipped in its entirety according to
+   MASK.  Treat this as skipping a superblock starting at forked
+   and ending at joining.  */
+
+static void
+nvptx_skip_par (unsigned mask, parallel *par)
+{
+  basic_block tail = par->join_block;
+  gcc_assert (tail->preds->length () == 1);
+
+  basic_block pre_tail = (*tail->preds)[0]->src;
+  gcc_assert (pre_tail->succs->length () == 1);
+
+  nvptx_single (mask, par->forked_block, pre_tail);
+}
+
+/* Process the parallel PAR and all its contained
+   parallels.  We do everything but the neutering.  Return mask of
+   partitioned modes used within this parallel.  */
 
+static unsigned
+nvptx_process_pars (parallel *par)
+{
+  unsigned inner_mask = OACC_LOOP_MASK (par->mode);
+  
+  /* Do the inner parallels first.  */
+  if (par->inner)
+    {
+      par->inner_mask = nvptx_process_pars (par->inner);
+      inner_mask |= par->inner_mask;
+    }
+  
+  switch (par->mode)
+    {
+    case OACC_null:
+      /* Dummy parallel.  */
+      break;
+
+    case OACC_vector:
+      nvptx_vpropagate (par->forked_block, par->forked_insn);
+      break;
+      
+    case OACC_worker:
+      {
+	nvptx_wpropagate (false, par->forked_block,
+			  par->forked_insn);
+	nvptx_wpropagate (true, par->forked_block, par->fork_insn);
+	/* Insert begin and end synchronizations.  */
+	nvptx_wsync (false, par->forked_insn);
+	nvptx_wsync (true, par->joining_insn);
+      }
+      break;
+
+    case OACC_gang:
+      break;
+
+    default:gcc_unreachable ();
+    }
+
+  /* Now do siblings.  */
+  if (par->next)
+    inner_mask |= nvptx_process_pars (par->next);
+  return inner_mask;
+}
+
+/* Neuter the parallel described by PAR.  We recurse in depth-first
+   order.  MODES are the partitioning of the execution and OUTER is
+   the partitioning of the parallels we are contained in.  */
+
+static void
+nvptx_neuter_pars (parallel *par, unsigned modes, unsigned outer)
+{
+  unsigned me = (OACC_LOOP_MASK (par->mode)
+		 & (OACC_LOOP_MASK (OACC_worker)
+		    | OACC_LOOP_MASK (OACC_vector)));
+  unsigned  skip_mask = 0, neuter_mask = 0;
+  
+  if (par->inner)
+    nvptx_neuter_pars (par->inner, modes, outer | me);
+
+  for (unsigned mode = OACC_worker; mode <= OACC_vector; mode++)
+    {
+      if ((outer | me) & OACC_LOOP_MASK (mode))
+	{ /* Mode is partitioned: no neutering.  */ }
+      else if (!(modes & OACC_LOOP_MASK (mode)))
+	{ /* Mode  is not used: nothing to do.  */ }
+      else if (par->inner_mask & OACC_LOOP_MASK (mode)
+	       || !par->forked_insn)
+	/* Partitioned in inner parallels, or we're not a partitioned
+	   at all: neuter individual blocks.  */
+	neuter_mask |= OACC_LOOP_MASK (mode);
+      else if (!par->parent || !par->parent->forked_insn
+	       || par->parent->inner_mask & OACC_LOOP_MASK (mode))
+	/* Parent isn't a parallel or contains this paralleling: skip
+	   parallel at this level.  */
+	skip_mask |= OACC_LOOP_MASK (mode);
+      else
+	{ /* Parent will skip this parallel itself.  */ }
+    }
+
+  if (neuter_mask)
+    {
+      basic_block block;
+
+      for (unsigned ix = 0; par->blocks.iterate (ix, &block); ix++)
+	nvptx_single (neuter_mask, block, block);
+    }
+
+  if (skip_mask)
+      nvptx_skip_par (skip_mask, par);
+  
+  if (par->next)
+    nvptx_neuter_pars (par->next, modes, outer);
+}
+
+/* NVPTX machine dependent reorg.
+   Insert vector and worker single neutering code and state
+   propagation when entering partioned mode.  Fixup subregs.  */
+
+static void
+nvptx_reorg (void)
+{
   /* We are freeing block_for_insn in the toplev to keep compatibility
      with old MDEP_REORGS that are not CFG based.  Recompute it now.  */
   compute_bb_for_insn ();
@@ -2072,19 +2944,34 @@ nvptx_reorg (void)
 
   df_clear_flags (DF_LR_RUN_DCE);
   df_set_flags (DF_NO_INSN_RESCAN | DF_NO_HARD_REGS);
+  df_live_add_problem ();
+  
+  /* Split blocks and record interesting unspecs.  */
+  bb_insn_map_t bb_insn_map;
+  unsigned modes = nvptx_split_blocks (&bb_insn_map);
+
+  /* Compute live regs */
   df_analyze ();
   regstat_init_n_sets_and_refs ();
 
-  int max_regs = max_reg_num ();
-
+  if (dump_file)
+    df_dump (dump_file);
+  
   /* Mark unused regs as unused.  */
+  int max_regs = max_reg_num ();
   for (int i = LAST_VIRTUAL_REGISTER + 1; i < max_regs; i++)
     if (REG_N_SETS (i) == 0 && REG_N_REFS (i) == 0)
       regno_reg_rtx[i] = const0_rtx;
 
-  /* Replace subregs.  */
-  nvptx_reorg_subreg (max_regs);
+  parallel *pars = nvptx_discover_pars (&bb_insn_map);
+
+  nvptx_process_pars (pars);
+  nvptx_neuter_pars (pars, modes, 0);
 
+  delete pars;
+
+  nvptx_reorg_subreg ();
+  
   regstat_free_n_sets_and_refs ();
 
   df_finish_pass (true);
@@ -2133,19 +3020,24 @@ nvptx_vector_alignment (const_tree type)
   return MIN (align, BIGGEST_ALIGNMENT);
 }
 
-/* Indicate that INSN cannot be duplicated.  This is true for insns
-   that generate a unique id.  To be on the safe side, we also
-   exclude instructions that have to be executed simultaneously by
-   all threads in a warp.  */
+/* Indicate that INSN cannot be duplicated.   */
 
 static bool
 nvptx_cannot_copy_insn_p (rtx_insn *insn)
 {
-  if (recog_memoized (insn) == CODE_FOR_oacc_thread_broadcastsi)
-    return true;
-  if (recog_memoized (insn) == CODE_FOR_threadbarrier_insn)
-    return true;
-  return false;
+  switch (recog_memoized (insn))
+    {
+    case CODE_FOR_nvptx_broadcastsi:
+    case CODE_FOR_nvptx_broadcastsf:
+    case CODE_FOR_nvptx_barsync:
+    case CODE_FOR_nvptx_fork:
+    case CODE_FOR_nvptx_forked:
+    case CODE_FOR_nvptx_joining:
+    case CODE_FOR_nvptx_join:
+      return true;
+    default:
+      return false;
+    }
 }
 \f
 /* Record a symbol for mkoffload to enter into the mapping table.  */
@@ -2185,6 +3077,21 @@ nvptx_file_end (void)
   FOR_EACH_HASH_TABLE_ELEMENT (*needed_fndecls_htab, decl, tree, iter)
     nvptx_record_fndecl (decl, true);
   fputs (func_decls.str().c_str(), asm_out_file);
+
+  if (worker_bcast_hwm)
+    {
+      /* Define the broadcast buffer.  */
+
+      if (worker_bcast_align < GET_MODE_SIZE (SImode))
+	worker_bcast_align = GET_MODE_SIZE (SImode);
+      worker_bcast_hwm = (worker_bcast_hwm + worker_bcast_align - 1)
+	& ~(worker_bcast_align - 1);
+      
+      fprintf (asm_out_file, "// BEGIN VAR DEF: %s\n", worker_bcast_name);
+      fprintf (asm_out_file, ".shared.align %d .u8 %s[%d];\n",
+	       worker_bcast_align,
+	       worker_bcast_name, worker_bcast_hwm);
+    }
 }
 \f
 #undef TARGET_OPTION_OVERRIDE
Index: config/nvptx/nvptx.h
===================================================================
--- config/nvptx/nvptx.h	(revision 225323)
+++ config/nvptx/nvptx.h	(working copy)
@@ -235,7 +235,6 @@ struct nvptx_pseudo_info
 struct GTY(()) machine_function
 {
   rtx_expr_list *call_args;
-  char *warp_equal_pseudos;
   rtx start_call;
   tree funtype;
   bool has_call_with_varargs;
Index: config/nvptx/nvptx-protos.h
===================================================================
--- config/nvptx/nvptx-protos.h	(revision 225323)
+++ config/nvptx/nvptx-protos.h	(working copy)
@@ -32,6 +32,8 @@ extern void nvptx_register_pragmas (void
 extern const char *nvptx_section_for_decl (const_tree);
 
 #ifdef RTX_CODE
+extern void nvptx_expand_oacc_fork (rtx);
+extern void nvptx_expand_oacc_join (rtx);
 extern void nvptx_expand_call (rtx, rtx);
 extern rtx nvptx_expand_compare (rtx);
 extern const char *nvptx_ptx_type_from_mode (machine_mode, bool);
Index: config/nvptx/nvptx.md
===================================================================
--- config/nvptx/nvptx.md	(revision 225323)
+++ config/nvptx/nvptx.md	(working copy)
@@ -52,15 +52,26 @@
    UNSPEC_NID
 
    UNSPEC_SHARED_DATA
+
+   UNSPEC_BIT_CONV
+
+   UNSPEC_BROADCAST
+   UNSPEC_BR_UNIFIED
 ])
 
 (define_c_enum "unspecv" [
    UNSPECV_LOCK
    UNSPECV_CAS
    UNSPECV_XCHG
-   UNSPECV_WARP_BCAST
    UNSPECV_BARSYNC
    UNSPECV_ID
+
+   UNSPECV_MODES
+   UNSPECV_FORK
+   UNSPECV_FORKED
+   UNSPECV_JOINING
+   UNSPECV_JOIN
+   UNSPECV_BR_HIDDEN
 ])
 
 (define_attr "subregs_ok" "false,true"
@@ -253,6 +264,8 @@
 (define_mode_iterator QHSIM [QI HI SI])
 (define_mode_iterator SDFM [SF DF])
 (define_mode_iterator SDCM [SC DC])
+(define_mode_iterator BITS [SI SF])
+(define_mode_iterator BITD [DI DF])
 
 ;; This mode iterator allows :P to be used for patterns that operate on
 ;; pointer-sized quantities.  Exactly one of the two alternatives will match.
@@ -813,7 +826,7 @@
 		      (label_ref (match_operand 1 "" ""))
 		      (pc)))]
   ""
-  "%j0\\tbra%U0\\t%l1;")
+  "%j0\\tbra\\t%l1;")
 
 (define_insn "br_false"
   [(set (pc)
@@ -822,7 +835,34 @@
 		      (label_ref (match_operand 1 "" ""))
 		      (pc)))]
   ""
-  "%J0\\tbra%U0\\t%l1;")
+  "%J0\\tbra\\t%l1;")
+
+;; a hidden conditional branch
+(define_insn "br_true_hidden"
+  [(unspec_volatile:SI [(ne (match_operand:BI 0 "nvptx_register_operand" "R")
+			    (const_int 0))
+		        (label_ref (match_operand 1 "" ""))
+			(match_operand:SI 2 "const_int_operand" "i")]
+			UNSPECV_BR_HIDDEN)]
+  ""
+  "%j0\\tbra%U2\\t%l1;")
+
+;; unified conditional branch
+(define_insn "br_uni_true"
+  [(set (pc) (if_then_else
+	(ne (unspec:BI [(match_operand:BI 0 "nvptx_register_operand" "R")]
+		       UNSPEC_BR_UNIFIED) (const_int 0))
+        (label_ref (match_operand 1 "" "")) (pc)))]
+  ""
+  "%j0\\tbra.uni\\t%l1;")
+
+(define_insn "br_uni_false"
+  [(set (pc) (if_then_else
+	(eq (unspec:BI [(match_operand:BI 0 "nvptx_register_operand" "R")]
+		       UNSPEC_BR_UNIFIED) (const_int 0))
+        (label_ref (match_operand 1 "" "")) (pc)))]
+  ""
+  "%J0\\tbra.uni\\t%l1;")
 
 (define_expand "cbranch<mode>4"
   [(set (pc)
@@ -1326,37 +1366,99 @@
   return asms[INTVAL (operands[1])];
 })
 
-(define_insn "oacc_thread_broadcastsi"
-  [(set (match_operand:SI 0 "nvptx_register_operand" "")
-	(unspec_volatile:SI [(match_operand:SI 1 "nvptx_register_operand" "")]
-			    UNSPECV_WARP_BCAST))]
+(define_insn "oacc_modes"
+  [(unspec_volatile:SI [(match_operand:SI 0 "const_int_operand" "")]
+		       UNSPECV_MODES)]
   ""
-  "%.\\tshfl.idx.b32\\t%0, %1, 0, 31;")
+  "// modes %0;"
+)
 
-(define_expand "oacc_thread_broadcastdi"
-  [(set (match_operand:DI 0 "nvptx_register_operand" "")
-	(unspec_volatile:DI [(match_operand:DI 1 "nvptx_register_operand" "")]
-			    UNSPECV_WARP_BCAST))]
-  ""
-{
-  rtx t = gen_reg_rtx (DImode);
-  emit_insn (gen_lshrdi3 (t, operands[1], GEN_INT (32)));
-  rtx op0 = force_reg (SImode, gen_lowpart (SImode, t));
-  rtx op1 = force_reg (SImode, gen_lowpart (SImode, operands[1]));
-  rtx targ0 = gen_reg_rtx (SImode);
-  rtx targ1 = gen_reg_rtx (SImode);
-  emit_insn (gen_oacc_thread_broadcastsi (targ0, op0));
-  emit_insn (gen_oacc_thread_broadcastsi (targ1, op1));
-  rtx t2 = gen_reg_rtx (DImode);
-  rtx t3 = gen_reg_rtx (DImode);
-  emit_insn (gen_extendsidi2 (t2, targ0));
-  emit_insn (gen_extendsidi2 (t3, targ1));
-  rtx t4 = gen_reg_rtx (DImode);
-  emit_insn (gen_ashldi3 (t4, t2, GEN_INT (32)));
-  emit_insn (gen_iordi3 (operands[0], t3, t4));
-  DONE;
+(define_insn "nvptx_fork"
+  [(unspec_volatile:SI [(match_operand:SI 0 "const_int_operand" "")]
+		       UNSPECV_FORK)]
+  ""
+  "// fork %0;"
+)
+
+(define_insn "nvptx_forked"
+  [(unspec_volatile:SI [(match_operand:SI 0 "const_int_operand" "")]
+		       UNSPECV_FORKED)]
+  ""
+  "// forked %0;"
+)
+
+(define_insn "nvptx_joining"
+  [(unspec_volatile:SI [(match_operand:SI 0 "const_int_operand" "")]
+		       UNSPECV_JOINING)]
+  ""
+  "// joining %0;"
+)
+
+(define_insn "nvptx_join"
+  [(unspec_volatile:SI [(match_operand:SI 0 "const_int_operand" "")]
+		       UNSPECV_JOIN)]
+  ""
+  "// join %0;"
+)
+
+(define_expand "oacc_fork"
+  [(unspec_volatile:SI [(match_operand:SI 0 "const_int_operand" "")]
+		       UNSPECV_FORKED)]
+  ""
+{
+  nvptx_expand_oacc_fork (operands[0]);
 })
 
+(define_expand "oacc_join"
+  [(unspec_volatile:SI [(match_operand:SI 0 "const_int_operand" "")]
+		       UNSPECV_JOIN)]
+  ""
+{
+  nvptx_expand_oacc_join (operands[0]);
+})
+
+;; only 32-bit shuffles exist.
+(define_insn "nvptx_broadcast<mode>"
+  [(set (match_operand:BITS 0 "nvptx_register_operand" "")
+	(unspec:BITS
+		[(match_operand:BITS 1 "nvptx_register_operand" "")]
+		  UNSPEC_BROADCAST))]
+  ""
+  "%.\\tshfl.idx.b32\\t%0, %1, 0, 31;")
+
+;; extract parts of a 64 bit object into 2 32-bit ints
+(define_insn "unpack<mode>si2"
+  [(set (match_operand:SI 0 "nvptx_register_operand" "")
+        (unspec:SI [(match_operand:BITD 2 "nvptx_register_operand" "")
+		    (const_int 0)] UNSPEC_BIT_CONV))
+   (set (match_operand:SI 1 "nvptx_register_operand" "")
+        (unspec:SI [(match_dup 2) (const_int 1)] UNSPEC_BIT_CONV))]
+  ""
+  "%.\\tmov.b64 {%0,%1}, %2;")
+
+;; pack 2 32-bit ints into a 64 bit object
+(define_insn "packsi<mode>2"
+  [(set (match_operand:BITD 0 "nvptx_register_operand" "")
+        (unspec:BITD [(match_operand:SI 1 "nvptx_register_operand" "")
+		      (match_operand:SI 2 "nvptx_register_operand" "")]
+		    UNSPEC_BIT_CONV))]
+  ""
+  "%.\\tmov.b64 %0, {%1,%2};")
+
+(define_insn "worker_load<mode>"
+  [(set (match_operand:SDISDFM 0 "nvptx_register_operand" "=R")
+        (unspec:SDISDFM [(match_operand:SDISDFM 1 "memory_operand" "m")]
+			 UNSPEC_SHARED_DATA))]
+  ""
+  "%.\\tld.shared%u0\\t%0,%1;")
+
+(define_insn "worker_store<mode>"
+  [(set (unspec:SDISDFM [(match_operand:SDISDFM 1 "memory_operand" "=m")]
+			 UNSPEC_SHARED_DATA)
+	(match_operand:SDISDFM 0 "nvptx_register_operand" "R"))]
+  ""
+  "%.\\tst.shared%u1\\t%1,%0;")
+
 (define_insn "ganglocal_ptr<mode>"
   [(set (match_operand:P 0 "nvptx_register_operand" "")
 	(unspec:P [(const_int 0)] UNSPEC_SHARED_DATA))]
@@ -1462,14 +1564,8 @@
   "%.\\tatom%A1.b%T0.<logic>\\t%0, %1, %2;")
 
 ;; ??? Mark as not predicable later?
-(define_insn "threadbarrier_insn"
-  [(unspec_volatile [(match_operand:SI 0 "const_int_operand" "")] UNSPECV_BARSYNC)]
+(define_insn "nvptx_barsync"
+  [(unspec_volatile [(match_operand:SI 0 "const_int_operand" "")]
+		    UNSPECV_BARSYNC)]
   ""
   "bar.sync\\t%0;")
-
-(define_expand "oacc_threadbarrier"
-  [(unspec_volatile [(match_operand:SI 0 "const_int_operand" "")] UNSPECV_BARSYNC)]
-  ""
-{
-  operands[0] = const0_rtx;
-})

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [gomp] Move openacc vector& worker single handling to RTL
  2015-07-08 14:48             ` Nathan Sidwell
@ 2015-07-08 14:58               ` Jakub Jelinek
  2015-07-08 21:46                 ` Nathan Sidwell
  0 siblings, 1 reply; 31+ messages in thread
From: Jakub Jelinek @ 2015-07-08 14:58 UTC (permalink / raw)
  To: Nathan Sidwell; +Cc: GCC Patches

On Wed, Jul 08, 2015 at 10:47:56AM -0400, Nathan Sidwell wrote:
> +/* Generate loop head markers in outer->inner order.  */
> +
> +static void
> +gen_oacc_fork (gimple_seq *seq, unsigned mask)
> +{
> +  {
> +    // TODDO: Determine this information from the parallel region itself

TODO ?

> +    // and emit it once in the offload function.  Currently the target
> +    // geometry definition is being extracted early.  For now inform
> +    // the backend we're using all axes of parallelism, which is a
> +    // safe default.
> +    gcall *call = gimple_build_call_internal
> +      (IFN_GOACC_MODES, 1, 
> +       build_int_cst (unsigned_type_node,
> +		      OACC_LOOP_MASK (OACC_gang)
> +		      | OACC_LOOP_MASK (OACC_vector)
> +		      | OACC_LOOP_MASK (OACC_worker)));

The formatting is too ugly.  I'd say you just want

    tree arg = build_int_cst (unsigned_type_node,
			      OACC_LOOP_MASK (OACC_gang)
			      | OACC_LOOP_MASK (OACC_vector)
			      | OACC_LOOP_MASK (OACC_worker));
    gcall *call = gimple_build_call_internal (IFN_GOACC_MODES, 1, arg);
> +                   | OACC_LOOP_MASK (OACC_vector)   

> +  for (level = OACC_gang; level != OACC_HWM; level++)
> +    if (mask & OACC_LOOP_MASK (level))
> +      {
> +	tree arg = build_int_cst (unsigned_type_node, level);
> +	gcall *call = gimple_build_call_internal
> +	  (IFN_GOACC_FORK, 1, arg);

Why the line-break?  That should fit into 80 columns just fine.

> +	gimple_seq_add_stmt (seq, call);
> +      }
> +}
> +
> +/* Generate loop tail markers in inner->outer order.  */
> +
> +static void
> +gen_oacc_join (gimple_seq *seq, unsigned mask)
> +{
> +  unsigned level;
> +
> +  for (level = OACC_HWM; level-- != OACC_gang; )
> +    if (mask & OACC_LOOP_MASK (level))
> +      {
> +	tree arg = build_int_cst (unsigned_type_node, level);
> +	gcall *call = gimple_build_call_internal
> +	  (IFN_GOACC_JOIN, 1, arg);
> +	gimple_seq_add_stmt (seq, call);
> +      }
> +}
>  
>  /* Find the mapping for DECL in CTX or the immediately enclosing
>     context that has a mapping for DECL.
> @@ -6777,21 +6808,6 @@ expand_omp_for_generic (struct omp_regio
>      }
>  }
>  
> -
> -/* True if a barrier is needed after a loop partitioned over
> -   gangs/workers/vectors as specified by GWV_BITS.  OpenACC semantics specify
> -   that a (conceptual) barrier is needed after worker and vector-partitioned
> -   loops, but not after gang-partitioned loops.  Currently we are relying on
> -   warp reconvergence to synchronise threads within a warp after vector loops,
> -   so an explicit barrier is not helpful after those.  */
> -
> -static bool
> -oacc_loop_needs_threadbarrier_p (int gwv_bits)
> -{
> -  return !(gwv_bits & OACC_LOOP_MASK (OACC_gang))
> -    && (gwv_bits & OACC_LOOP_MASK (OACC_worker));
> -}
> -
>  /* A subroutine of expand_omp_for.  Generate code for a parallel
>     loop with static schedule and no specified chunk size.  Given
>     parameters:
> @@ -6800,6 +6816,7 @@ oacc_loop_needs_threadbarrier_p (int gwv
>  
>     where COND is "<" or ">", we generate pseudocode
>  
> +  OACC_FORK
>  	if ((__typeof (V)) -1 > 0 && N2 cond N1) goto L2;
>  	if (cond is <)
>  	  adj = STEP - 1;
> @@ -6827,6 +6844,11 @@ oacc_loop_needs_threadbarrier_p (int gwv
>  	V += STEP;
>  	if (V cond e) goto L1;
>      L2:
> + OACC_JOIN
> +
> + It'd be better to place the OACC_LOOP markers just inside the outer
> + conditional, so they can be entirely eliminated if the loop is
> + unreachable.

Putting OACC_FORK/OACC_JOIN unconditionally into the comment is very
confusing.  The expand_omp_for_static_nochunk routine is used for
#pragma omp for schedule(static), #pragma omp distribute etc. which
certainly don't want to emit such markers in there.  So perhaps mention
somewhere that you wrap all the above sequence in between
OACC_FORK/OACC_JOIN markers.

> @@ -7220,6 +7249,7 @@ find_phi_with_arg_on_edge (tree arg, edg
>  
>     where COND is "<" or ">", we generate pseudocode
>  
> +OACC_FORK
>  	if ((__typeof (V)) -1 > 0 && N2 cond N1) goto L2;
>  	if (cond is <)
>  	  adj = STEP - 1;
> @@ -7230,6 +7260,7 @@ find_phi_with_arg_on_edge (tree arg, edg
>  	else
>  	  n = (adj + N2 - N1) / STEP;
>  	trip = 0;
> +
>  	V = threadid * CHUNK * STEP + N1;  -- this extra definition of V is
>  					      here so that V is defined
>  					      if the loop is not entered
> @@ -7248,6 +7279,7 @@ find_phi_with_arg_on_edge (tree arg, edg
>  	trip += 1;
>  	goto L0;
>      L4:
> +OACC_JOIN
>  */

Likewise.
>  
>  static void
> @@ -7281,10 +7313,6 @@ expand_omp_for_static_chunk (struct omp_
>    gcc_assert (EDGE_COUNT (iter_part_bb->succs) == 2);
>    fin_bb = BRANCH_EDGE (iter_part_bb)->dest;
>  
> -  /* Broadcast variables to OpenACC threads.  */
> -  entry_bb = oacc_broadcast (entry_bb, fin_bb, region);
> -  region->entry = entry_bb;
> -
>    gcc_assert (broken_loop
>  	      || fin_bb == FALLTHRU_EDGE (cont_bb)->dest);
>    seq_start_bb = split_edge (FALLTHRU_EDGE (iter_part_bb));
> @@ -7296,7 +7324,7 @@ expand_omp_for_static_chunk (struct omp_
>        trip_update_bb = split_edge (FALLTHRU_EDGE (cont_bb));
>      }
>    exit_bb = region->exit;
> -
> +  

Please avoid such whitespace changes.

In any case, as it is a gomp-4_0-branch patch, I'll defer full review to the
branch maintainers.

	Jakub

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [gomp] Move openacc vector& worker single handling to RTL
  2015-07-08 14:58               ` Jakub Jelinek
@ 2015-07-08 21:46                 ` Nathan Sidwell
  2015-07-10  0:25                   ` Nathan Sidwell
  0 siblings, 1 reply; 31+ messages in thread
From: Nathan Sidwell @ 2015-07-08 21:46 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: GCC Patches

[-- Attachment #1: Type: text/plain, Size: 1556 bytes --]

On 07/08/15 10:58, Jakub Jelinek wrote:
> On Wed, Jul 08, 2015 at 10:47:56AM -0400, Nathan Sidwell wrote:
>> +/* Generate loop head markers in outer->inner order.  */
>> +
>> +static void
>> +gen_oacc_fork (gimple_seq *seq, unsigned mask)
>> +{
>> +  {
>> +    // TODDO: Determine this information from the parallel region itself
>
> TODO ?

I want to clean this up with the offloading launch API.  As it happens, I did 
manage to have the PTX  backend DTRT if it doesn't encounter this internal fn. 
I'm dropping it fromm this patch (it'd undoubtedly need moving around anyway).

>> +	gcall *call = gimple_build_call_internal
>> +	  (IFN_GOACC_FORK, 1, arg);
>
> Why the line-break?  That should fit into 80 columns just fine.

oh, it does now I've changed the name of the internal function.

>> + It'd be better to place the OACC_LOOP markers just inside the outer
>> + conditional, so they can be entirely eliminated if the loop is
>> + unreachable.
>
> Putting OACC_FORK/OACC_JOIN unconditionally into the comment is very
> confusing.  The expand_omp_for_static_nochunk routine is used for
> #pragma omp for schedule(static), #pragma omp distribute etc. which
> certainly don't want to emit such markers in there.  So perhaps mention
> somewhere that you wrap all the above sequence in between
> OACC_FORK/OACC_JOIN markers.

Done. (at both sites)

> Please avoid such whitespace changes.

Fixed (& searched others).

> In any case, as it is a gomp-4_0-branch patch, I'll defer full review to the
> branch maintainers.

Thanks for your review!

nathan

[-- Attachment #2: rtl-08072015-2.diff --]
[-- Type: text/plain, Size: 89029 bytes --]

2015-07-08  Nathan Sidwell  <nathan@codesourcery.com>

	Infrastructure:
	* gimple.h (gimple_call_internal_unique_p): Declare.
	* gimple.c (gimple_call_same_target_p): Add check for
	gimple_call_internal_unique_p.
	* internal-fn.c (gimple_call_internal_unique_p): New.
	* omp-low.h (OACC_LOOP_MASK): Define here...
	* omp-low.c (OACC_LOOP_MASK): ... not here.
	* tree-ssa-threadedge.c	(record_temporary_equivalences_from_stmts):
	Add check for gimple_call_internal_unique_p.
	* tree-ssa-tail-merge.c (same_succ_def::equal): Add EQ check for
	the gimple statements.

	Additions:
	* internal-fn.def (GOACC_FORK, GOACC_JOIN): New.
	* internal-fn.c (gimple_call_internal_unique_p): Add check for
	IFN_GOACC_FORK, IFN_GOACC_JOIN.
	(expand_GOACC_FORK, expand_GOACC_JOIN): New.
	* omp-low.c (gen_oacc_fork, gen_oacc_join): New.
	(expand_omp_for_static_nochunk): Add oacc loop fork & join calls.
	(expand_omp_for_static_chunk): Likewise.
	* config/nvptx/nvptx-protos.h (nvptx_expand_oacc_fork,
	nvptx_expand_oacc_join): Declare.
	* config/nvptx/nvptx.md (UNSPEC_BIT_CONV, UNSPEC_BROADCAST,
	UNSPEC_BR_UNIFIED): New unspecs.
	(UNSPECV_FORK, UNSPECV_FORKED, UNSPECV_JOINING, UNSPECV_JOIN,
	UNSPECV_BR_HIDDEN): New.
	(BITS, BITD): New mode iterators.
	(br_true_hidden, br_false_hidden, br_uni_true, br_uni_false): New
	branches.
	(nvptx_fork, nvptx_forked, nvptx_joining, nvptx_join): New insns.
	(oacc_fork, oacc_join): New expand
	(nvptx_broadcast<mode>): New insn.
	(unpack<mode>si2, packsi<mode>2): New insns.
	(worker_load<mode>, worker_store<mode>): New insns.
	(nvptx_barsync): Renamed from ...
	(threadbarrier_insn): ... here.
	* config/nvptx/nvptx.c: Include hash-map,h, dominance.h, cfg.h &
	omp-low.h.
	(worker_bcast_hwm, worker_bcast_align, worker_bcast_name,
	worker_bcast_sym): New.
	(nvptx_option_override): Initialize worker_bcast_sym.
	(nvptx_expand_oacc_fork, nvptx_expand_oacc_join): New.
	(nvptx_gen_unpack, nvptx_gen_pack): New.
	(struct wcast_data_t, propagate_mask): New types.
	(nvptx_gen_vcast, nvptx_gen_wcast): New.
	(nvptx_print_operand):  Change 'U' specifier to look at operand
	itself.
	(struct parallel): New structs.
	(parallel::parallel, parallel::~parallel): Ctor & dtor.
	(bb_insn_map_t): New map.
	(insn_bb_t, insn_bb_vec_t): New tuple & vector of.
	(nvptx_split_blocks, nvptx_discover_pre): New.
	(bb_par_t, bb_par_vec_t); New tuple & vector of.
	(nvptx_dump_pars,nvptx_discover_pars): New.
	(nvptx_propagate, vprop_gen, nvptx_vpropagate, wprop_gen,
	nvptx_wpropagate): New.
	(nvptx_wsync): New.
	(nvptx_single, nvptx_skip_par): New.
	(nvptx_process_pars): New.
	(nvptx_neuter_pars): New.
	(nvptx_reorg): Add liveness DF problem.  Call nvptx_split_blocks,
	nvptx_discover_pars, nvptx_process_pars & nvptx_neuter_pars.
	(nvptx_cannot_copy_insn): Check for broadcast, sync, fork & join insns.
	(nvptx_file_end): Output worker broadcast array definition.

	Deletions:
	* builtins.c (expand_oacc_thread_barrier): Delete.
	(expand_oacc_thread_broadcast): Delete.
	(expand_builtin): Adjust.
	* gimple.c (struct gimple_statement_omp_parallel_layout): Remove
	broadcast_array member.
	(gimple_omp_target_broadcast_array): Delete.
	(gimple_omp_target_set_broadcast_array): Delete.
	* omp-low.c (omp_region): Remove broadcast_array member.
	(oacc_broadcast): Delete.
	(build_oacc_threadbarrier): Delete.
	(oacc_loop_needs_threadbarrier_p): Delete.
	(oacc_alloc_broadcast_storage): Delete.
	(find_omp_target_region): Remove call to
	gimple_omp_target_broadcast_array.
	(enclosing_target_region, required_predication_mask,
	generate_vector_broadcast, generate_oacc_broadcast,
	make_predication_test, predicate_bb, find_predicatable_bbs,
	predicate_omp_regions): Delete.
	(use, gen, live_in): Delete.
	(populate_loop_live_in, oacc_populate_live_in_1,
	oacc_populate_live_in, populate_loop_use, oacc_broadcast_1,
	oacc_broadcast): Delete.
	(execute_expand_omp): Remove predicate_omp_regions call.
	(lower_omp_target): Remove oacc_alloc_broadcast_storage call.
	Remove gimple_omp_target_set_broadcast_array call.
	(make_gimple_omp_edges): Remove oacc_loop_needs_threadbarrier_p
	check.
	* tree-ssa-alias.c (ref_maybe_used_by_call_p_1): Remove
	BUILT_IN_GOACC_THREADBARRIER.
	* omp-builtins.def (BUILT_IN_GOACC_THREAD_BROADCAST,
	BUILT_IN_GOACC_THREAD_BROADCAST_LL,
	BUILT_IN_GOACC_THREADBARRIER): Delete.
	* config/nvptx/nvptx.md (UNSPECV_WARPBCAST): Delete.
	(br_true, br_false): Remove U format specifier.
	(oacc_thread_broadcastsi, oacc_thread_broadcast_di): Delete.
	(oacc_threadbarrier): Delete.
	* config/.nvptx/nvptx.c (condition_unidirectional_p): Delete.
	(nvptx_print_operand):  Change 'U' specifier to look at operand
	itself.
	(nvptx_reorg_subreg): Remove unidirection checking.
	(nvptx_cannot_copy_insn): Remove broadcast and barrier insns.
	* config/nvptx/nvptx.h (machine_function): Remove
	arp_equal_pseudos.

Index: tree-ssa-alias.c
===================================================================
--- tree-ssa-alias.c	(revision 225323)
+++ tree-ssa-alias.c	(working copy)
@@ -1764,7 +1764,6 @@ ref_maybe_used_by_call_p_1 (gcall *call,
 	case BUILT_IN_GOMP_ATOMIC_END:
 	case BUILT_IN_GOMP_BARRIER:
 	case BUILT_IN_GOMP_BARRIER_CANCEL:
-	case BUILT_IN_GOACC_THREADBARRIER:
 	case BUILT_IN_GOMP_TASKWAIT:
 	case BUILT_IN_GOMP_TASKGROUP_END:
 	case BUILT_IN_GOMP_CRITICAL_START:
Index: gimple.c
===================================================================
--- gimple.c	(revision 225323)
+++ gimple.c	(working copy)
@@ -1380,12 +1380,27 @@ bool
 gimple_call_same_target_p (const_gimple c1, const_gimple c2)
 {
   if (gimple_call_internal_p (c1))
-    return (gimple_call_internal_p (c2)
-	    && gimple_call_internal_fn (c1) == gimple_call_internal_fn (c2));
+    {
+      if (!gimple_call_internal_p (c2)
+	  || gimple_call_internal_fn (c1) != gimple_call_internal_fn (c2))
+	return false;
+
+      if (gimple_call_internal_unique_p (c1))
+	return false;
+      
+      return true;
+    }
+  else if (gimple_call_fn (c1) == gimple_call_fn (c2))
+    return true;
   else
-    return (gimple_call_fn (c1) == gimple_call_fn (c2)
-	    || (gimple_call_fndecl (c1)
-		&& gimple_call_fndecl (c1) == gimple_call_fndecl (c2)));
+    {
+      tree decl = gimple_call_fndecl (c1);
+
+      if (!decl || decl != gimple_call_fndecl (c2))
+	return false;
+
+      return true;
+    }
 }
 
 /* Detect flags from a GIMPLE_CALL.  This is just like
Index: gimple.h
===================================================================
--- gimple.h	(revision 225323)
+++ gimple.h	(working copy)
@@ -581,10 +581,6 @@ struct GTY((tag("GSS_OMP_PARALLEL_LAYOUT
   /* [ WORD 11 ]
      Size of the gang-local memory to allocate.  */
   tree ganglocal_size;
-
-  /* [ WORD 12 ]
-     A pointer to the array to be used for broadcasting across threads.  */
-  tree broadcast_array;
 };
 
 /* GIMPLE_OMP_PARALLEL or GIMPLE_TASK */
@@ -2693,6 +2689,11 @@ gimple_call_internal_fn (const_gimple gs
   return static_cast <const gcall *> (gs)->u.internal_fn;
 }
 
+/* Return true, if this internal gimple call is unique.  */
+
+extern bool
+gimple_call_internal_unique_p (const_gimple);
+
 /* If CTRL_ALTERING_P is true, mark GIMPLE_CALL S to be a stmt
    that could alter control flow.  */
 
@@ -5248,25 +5249,6 @@ gimple_omp_target_set_ganglocal_size (go
 }
 
 
-/* Return the pointer to the broadcast array associated with OMP_TARGET GS.  */
-
-static inline tree
-gimple_omp_target_broadcast_array (const gomp_target *omp_target_stmt)
-{
-  return omp_target_stmt->broadcast_array;
-}
-
-
-/* Set PTR to be the broadcast array associated with OMP_TARGET
-   GS.  */
-
-static inline void
-gimple_omp_target_set_broadcast_array (gomp_target *omp_target_stmt, tree ptr)
-{
-  omp_target_stmt->broadcast_array = ptr;
-}
-
-
 /* Return the clauses associated with OMP_TEAMS GS.  */
 
 static inline tree
Index: tree-ssa-threadedge.c
===================================================================
--- tree-ssa-threadedge.c	(revision 225323)
+++ tree-ssa-threadedge.c	(working copy)
@@ -310,6 +310,17 @@ record_temporary_equivalences_from_stmts
 	  && gimple_asm_volatile_p (as_a <gasm *> (stmt)))
 	return NULL;
 
+      /* If the statement is a unique builtin, we can not thread
+	 through here.  */
+      if (gimple_code (stmt) == GIMPLE_CALL)
+	{
+	  gcall *call = as_a <gcall *> (stmt);
+
+	  if (gimple_call_internal_p (call)
+	      && gimple_call_internal_unique_p (call))
+	    return NULL;
+	}
+
       /* If duplicating this block is going to cause too much code
 	 expansion, then do not thread through this block.  */
       stmt_count++;
Index: omp-low.c
===================================================================
--- omp-low.c	(revision 225323)
+++ omp-low.c	(working copy)
@@ -166,14 +166,8 @@ struct omp_region
 
   /* For an OpenACC loop, the level of parallelism requested.  */
   int gwv_this;
-
-  tree broadcast_array;
 };
 
-/* Levels of parallelism as defined by OpenACC.  Increasing numbers
-   correspond to deeper loop nesting levels.  */
-#define OACC_LOOP_MASK(X) (1 << (X))
-
 /* Context structure.  Used to store information about each parallel
    directive in the code.  */
 
@@ -292,8 +286,6 @@ static vec<omp_context *> taskreg_contex
 
 static void scan_omp (gimple_seq *, omp_context *);
 static tree scan_omp_1_op (tree *, int *, void *);
-static basic_block oacc_broadcast (basic_block, basic_block,
-				   struct omp_region *);
 
 #define WALK_SUBSTMTS  \
     case GIMPLE_BIND: \
@@ -3487,15 +3479,6 @@ build_omp_barrier (tree lhs)
   return g;
 }
 
-/* Build a call to GOACC_threadbarrier.  */
-
-static gcall *
-build_oacc_threadbarrier (void)
-{
-  tree fndecl = builtin_decl_explicit (BUILT_IN_GOACC_THREADBARRIER);
-  return gimple_build_call (fndecl, 0);
-}
-
 /* If a context was created for STMT when it was scanned, return it.  */
 
 static omp_context *
@@ -3506,6 +3489,37 @@ maybe_lookup_ctx (gimple stmt)
   return n ? (omp_context *) n->value : NULL;
 }
 
+/* Generate loop head markers in outer->inner order.  */
+
+static void
+gen_oacc_fork (gimple_seq *seq, unsigned mask)
+{
+  unsigned level;
+
+  for (level = OACC_gang; level != OACC_HWM; level++)
+    if (mask & OACC_LOOP_MASK (level))
+      {
+	tree arg = build_int_cst (unsigned_type_node, level);
+	gcall *call = gimple_build_call_internal (IFN_GOACC_FORK, 1, arg);
+	gimple_seq_add_stmt (seq, call);
+      }
+}
+
+/* Generate loop tail markers in inner->outer order.  */
+
+static void
+gen_oacc_join (gimple_seq *seq, unsigned mask)
+{
+  unsigned level;
+
+  for (level = OACC_HWM; level-- != OACC_gang; )
+    if (mask & OACC_LOOP_MASK (level))
+      {
+	tree arg = build_int_cst (unsigned_type_node, level);
+	gcall *call = gimple_build_call_internal (IFN_GOACC_JOIN, 1, arg);
+	gimple_seq_add_stmt (seq, call);
+      }
+}
 
 /* Find the mapping for DECL in CTX or the immediately enclosing
    context that has a mapping for DECL.
@@ -6777,21 +6791,6 @@ expand_omp_for_generic (struct omp_regio
     }
 }
 
-
-/* True if a barrier is needed after a loop partitioned over
-   gangs/workers/vectors as specified by GWV_BITS.  OpenACC semantics specify
-   that a (conceptual) barrier is needed after worker and vector-partitioned
-   loops, but not after gang-partitioned loops.  Currently we are relying on
-   warp reconvergence to synchronise threads within a warp after vector loops,
-   so an explicit barrier is not helpful after those.  */
-
-static bool
-oacc_loop_needs_threadbarrier_p (int gwv_bits)
-{
-  return !(gwv_bits & OACC_LOOP_MASK (OACC_gang))
-    && (gwv_bits & OACC_LOOP_MASK (OACC_worker));
-}
-
 /* A subroutine of expand_omp_for.  Generate code for a parallel
    loop with static schedule and no specified chunk size.  Given
    parameters:
@@ -6827,6 +6826,11 @@ oacc_loop_needs_threadbarrier_p (int gwv
 	V += STEP;
 	if (V cond e) goto L1;
     L2:
+
+ For OpenACC the above is wrapped in an OACC_FORK/OACC_JOIN pair.
+ Currently we wrap the whole sequence, but it'd be better to place the
+ markers just inside the outer conditional, so they can be entirely
+ eliminated if the loop is unreachable.
 */
 
 static void
@@ -6868,10 +6872,6 @@ expand_omp_for_static_nochunk (struct om
     }
   exit_bb = region->exit;
 
-  /* Broadcast variables to OpenACC threads.  */
-  entry_bb = oacc_broadcast (entry_bb, fin_bb, region);
-  region->entry = entry_bb;
-
   /* Iteration space partitioning goes in ENTRY_BB.  */
   gsi = gsi_last_bb (entry_bb);
   gcc_assert (gimple_code (gsi_stmt (gsi)) == GIMPLE_OMP_FOR);
@@ -6893,6 +6893,15 @@ expand_omp_for_static_nochunk (struct om
     t = fold_binary (fd->loop.cond_code, boolean_type_node,
 		     fold_convert (type, fd->loop.n1),
 		     fold_convert (type, fd->loop.n2));
+
+  if (gimple_omp_for_kind (fd->for_stmt) == GF_OMP_FOR_KIND_OACC_LOOP)
+    {
+      gimple_seq seq = NULL;
+	
+      gen_oacc_fork (&seq, region->gwv_this);
+      gsi_insert_seq_before (&gsi, seq, GSI_SAME_STMT);
+    }
+
   if (fd->collapse == 1
       && TYPE_UNSIGNED (type)
       && (t == NULL_TREE || !integer_onep (t)))
@@ -7134,17 +7143,17 @@ expand_omp_for_static_nochunk (struct om
 
   /* Replace the GIMPLE_OMP_RETURN with a barrier, or nothing.  */
   gsi = gsi_last_bb (exit_bb);
-  if (!gimple_omp_return_nowait_p (gsi_stmt (gsi)))
+  if (gimple_omp_for_kind (fd->for_stmt) == GF_OMP_FOR_KIND_OACC_LOOP)
+    {
+      gimple_seq seq = NULL;
+
+      gen_oacc_join (&seq, region->gwv_this);
+      gsi_insert_seq_after (&gsi, seq, GSI_SAME_STMT);
+    }
+  else if (!gimple_omp_return_nowait_p (gsi_stmt (gsi)))
     {
       t = gimple_omp_return_lhs (gsi_stmt (gsi));
-      if (gimple_omp_for_kind (fd->for_stmt) == GF_OMP_FOR_KIND_OACC_LOOP)
-	{
-	  gcc_checking_assert (t == NULL_TREE);
-	  if (oacc_loop_needs_threadbarrier_p (region->gwv_this))
-	    gsi_insert_after (&gsi, build_oacc_threadbarrier (), GSI_SAME_STMT);
-	}
-      else
-	gsi_insert_after (&gsi, build_omp_barrier (t), GSI_SAME_STMT);
+      gsi_insert_after (&gsi, build_omp_barrier (t), GSI_SAME_STMT);
     }
   gsi_remove (&gsi, true);
 
@@ -7248,6 +7257,11 @@ find_phi_with_arg_on_edge (tree arg, edg
 	trip += 1;
 	goto L0;
     L4:
+
+ For OpenACC the above is wrapped in an OACC_FORK/OACC_JOIN pair.
+ Currently we wrap the whole sequence, but it'd be better to place the
+ markers just inside the outer conditional, so they can be entirely
+ eliminated if the loop is unreachable.
 */
 
 static void
@@ -7281,10 +7295,6 @@ expand_omp_for_static_chunk (struct omp_
   gcc_assert (EDGE_COUNT (iter_part_bb->succs) == 2);
   fin_bb = BRANCH_EDGE (iter_part_bb)->dest;
 
-  /* Broadcast variables to OpenACC threads.  */
-  entry_bb = oacc_broadcast (entry_bb, fin_bb, region);
-  region->entry = entry_bb;
-
   gcc_assert (broken_loop
 	      || fin_bb == FALLTHRU_EDGE (cont_bb)->dest);
   seq_start_bb = split_edge (FALLTHRU_EDGE (iter_part_bb));
@@ -7318,6 +7328,14 @@ expand_omp_for_static_chunk (struct omp_
     t = fold_binary (fd->loop.cond_code, boolean_type_node,
 		     fold_convert (type, fd->loop.n1),
 		     fold_convert (type, fd->loop.n2));
+  if (gimple_omp_for_kind (fd->for_stmt) == GF_OMP_FOR_KIND_OACC_LOOP)
+    {
+      gimple_seq seq = NULL;
+	
+      gen_oacc_fork (&seq, region->gwv_this);
+      gsi_insert_seq_before (&gsi, seq, GSI_SAME_STMT);
+    }
+
   if (fd->collapse == 1
       && TYPE_UNSIGNED (type)
       && (t == NULL_TREE || !integer_onep (t)))
@@ -7576,17 +7594,18 @@ expand_omp_for_static_chunk (struct omp_
 
   /* Replace the GIMPLE_OMP_RETURN with a barrier, or nothing.  */
   gsi = gsi_last_bb (exit_bb);
-  if (!gimple_omp_return_nowait_p (gsi_stmt (gsi)))
+
+  if (gimple_omp_for_kind (fd->for_stmt) == GF_OMP_FOR_KIND_OACC_LOOP)
+    {
+      gimple_seq seq = NULL;
+
+      gen_oacc_join (&seq, region->gwv_this);
+      gsi_insert_seq_after (&gsi, seq, GSI_SAME_STMT);
+    }
+  else if (!gimple_omp_return_nowait_p (gsi_stmt (gsi)))
     {
       t = gimple_omp_return_lhs (gsi_stmt (gsi));
-      if (gimple_omp_for_kind (fd->for_stmt) == GF_OMP_FOR_KIND_OACC_LOOP)
-        {
-	  gcc_checking_assert (t == NULL_TREE);
-	  if (oacc_loop_needs_threadbarrier_p (region->gwv_this))
-	    gsi_insert_after (&gsi, build_oacc_threadbarrier (), GSI_SAME_STMT);
-	}
-      else
-	gsi_insert_after (&gsi, build_omp_barrier (t), GSI_SAME_STMT);
+      gsi_insert_after (&gsi, build_omp_barrier (t), GSI_SAME_STMT);
     }
   gsi_remove (&gsi, true);
 
@@ -9158,20 +9177,6 @@ expand_omp_atomic (struct omp_region *re
   expand_omp_atomic_mutex (load_bb, store_bb, addr, loaded_val, stored_val);
 }
 
-/* Allocate storage for OpenACC worker threads in CTX to broadcast
-   condition results.  */
-
-static void
-oacc_alloc_broadcast_storage (omp_context *ctx)
-{
-  tree vull_type_node = build_qualified_type (long_long_unsigned_type_node,
-					      TYPE_QUAL_VOLATILE);
-
-  ctx->worker_sync_elt
-    = alloc_var_ganglocal (NULL_TREE, vull_type_node, ctx,
-			   TYPE_SIZE_UNIT (vull_type_node));
-}
-
 /* Mark the loops inside the kernels region starting at REGION_ENTRY and ending
    at REGION_EXIT.  */
 
@@ -9947,7 +9952,6 @@ find_omp_target_region_data (struct omp_
     region->gwv_this |= OACC_LOOP_MASK (OACC_worker);
   if (find_omp_clause (clauses, OMP_CLAUSE_VECTOR_LENGTH))
     region->gwv_this |= OACC_LOOP_MASK (OACC_vector);
-  region->broadcast_array = gimple_omp_target_broadcast_array (stmt);
 }
 
 /* Helper for build_omp_regions.  Scan the dominator tree starting at
@@ -10091,669 +10095,6 @@ build_omp_regions (void)
   build_omp_regions_1 (ENTRY_BLOCK_PTR_FOR_FN (cfun), NULL, false);
 }
 
-/* Walk the tree upwards from region until a target region is found
-   or we reach the end, then return it.  */
-static omp_region *
-enclosing_target_region (omp_region *region)
-{
-  while (region != NULL
-	 && region->type != GIMPLE_OMP_TARGET)
-    region = region->outer;
-  return region;
-}
-
-/* Return a mask of GWV_ values indicating the kind of OpenACC
-   predication required for basic blocks in REGION.  */
-
-static int
-required_predication_mask (omp_region *region)
-{
-  while (region
-	 && region->type != GIMPLE_OMP_FOR && region->type != GIMPLE_OMP_TARGET)
-    region = region->outer;
-  if (!region)
-    return 0;
-
-  int outer_masks = region->gwv_this;
-  omp_region *outer_target = region;
-  while (outer_target != NULL && outer_target->type != GIMPLE_OMP_TARGET)
-    {
-      if (outer_target->type == GIMPLE_OMP_FOR)
-	outer_masks |= outer_target->gwv_this;
-      outer_target = outer_target->outer;
-    }
-  if (!outer_target)
-    return 0;
-
-  int mask = 0;
-  if ((outer_target->gwv_this & OACC_LOOP_MASK (OACC_worker)) != 0
-      && (region->type == GIMPLE_OMP_TARGET
-	  || (outer_masks & OACC_LOOP_MASK (OACC_worker)) == 0))
-    mask |= OACC_LOOP_MASK (OACC_worker);
-  if ((outer_target->gwv_this & OACC_LOOP_MASK (OACC_vector)) != 0
-      && (region->type == GIMPLE_OMP_TARGET
-	  || (outer_masks & OACC_LOOP_MASK (OACC_vector)) == 0))
-    mask |= OACC_LOOP_MASK (OACC_vector);
-  return mask;
-}
-
-/* Generate a broadcast across OpenACC vector threads (a warp on GPUs)
-   so that VAR is broadcast to DEST_VAR.  The new statements are added
-   after WHERE.  Return the stmt after which the block should be split.  */
-
-static gimple
-generate_vector_broadcast (tree dest_var, tree var,
-			   gimple_stmt_iterator &where)
-{
-  gimple retval = gsi_stmt (where);
-  tree vartype = TREE_TYPE (var);
-  tree call_arg_type = unsigned_type_node;
-  enum built_in_function fn = BUILT_IN_GOACC_THREAD_BROADCAST;
-
-  if (TYPE_PRECISION (vartype) > TYPE_PRECISION (call_arg_type))
-    {
-      fn = BUILT_IN_GOACC_THREAD_BROADCAST_LL;
-      call_arg_type = long_long_unsigned_type_node;
-    }
-
-  bool need_conversion = !types_compatible_p (vartype, call_arg_type);
-  tree casted_var = var;
-
-  if (need_conversion)
-    {
-      gassign *conv1 = NULL;
-      casted_var = create_tmp_var (call_arg_type);
-
-      /* Handle floats and doubles.  */
-      if (!INTEGRAL_TYPE_P (vartype))
-	{
-	  tree t = fold_build1 (VIEW_CONVERT_EXPR, call_arg_type, var);
-	  conv1 = gimple_build_assign (casted_var, t);
-	}
-      else
-	conv1 = gimple_build_assign (casted_var, NOP_EXPR, var);
-
-      gsi_insert_after (&where, conv1, GSI_CONTINUE_LINKING);
-    }
-
-  tree decl = builtin_decl_explicit (fn);
-  gimple call = gimple_build_call (decl, 1, casted_var);
-  gsi_insert_after (&where, call, GSI_NEW_STMT);
-  tree casted_dest = dest_var;
-
-  if (need_conversion)
-    {
-      gassign *conv2 = NULL;
-      casted_dest = create_tmp_var (call_arg_type);
-
-      if (!INTEGRAL_TYPE_P (vartype))
-	{
-	  tree t = fold_build1 (VIEW_CONVERT_EXPR, vartype, casted_dest);
-	  conv2 = gimple_build_assign (dest_var, t);
-	}
-      else
-	conv2 = gimple_build_assign (dest_var, NOP_EXPR, casted_dest);
-
-      gsi_insert_after (&where, conv2, GSI_CONTINUE_LINKING);
-    }
-
-  gimple_call_set_lhs (call, casted_dest);
-  return retval;
-}
-
-/* Generate a broadcast across OpenACC threads in REGION so that VAR
-   is broadcast to DEST_VAR.  MASK specifies the parallelism level and
-   thereby the broadcast method.  If it is only vector, we
-   can use a warp broadcast, otherwise we fall back to memory
-   store/load.  */
-
-static gimple
-generate_oacc_broadcast (omp_region *region, tree dest_var, tree var,
-			 gimple_stmt_iterator &where, int mask)
-{
-  if (mask == OACC_LOOP_MASK (OACC_vector))
-    return generate_vector_broadcast (dest_var, var, where);
-
-  omp_region *parent = enclosing_target_region (region);
-
-  tree elttype = build_qualified_type (TREE_TYPE (var), TYPE_QUAL_VOLATILE);
-  tree ptr = create_tmp_var (build_pointer_type (elttype));
-  gassign *cast1 = gimple_build_assign (ptr, NOP_EXPR,
-				       parent->broadcast_array);
-  gsi_insert_after (&where, cast1, GSI_NEW_STMT);
-  gassign *st = gimple_build_assign (build_simple_mem_ref (ptr), var);
-  gsi_insert_after (&where, st, GSI_NEW_STMT);
-
-  gsi_insert_after (&where, build_oacc_threadbarrier (), GSI_NEW_STMT);
-
-  gassign *cast2 = gimple_build_assign (ptr, NOP_EXPR,
-					parent->broadcast_array);
-  gsi_insert_after (&where, cast2, GSI_NEW_STMT);
-  gassign *ld = gimple_build_assign (dest_var, build_simple_mem_ref (ptr));
-  gsi_insert_after (&where, ld, GSI_NEW_STMT);
-
-  gsi_insert_after (&where, build_oacc_threadbarrier (), GSI_NEW_STMT);
-
-  return st;
-}
-
-/* Build a test for OpenACC predication.  TRUE_EDGE is the edge that should be
-   taken if the block should be executed.  SKIP_DEST_BB is the destination to
-   jump to otherwise.  MASK specifies the type of predication, it can contain
-   the bits for VECTOR and/or WORKER.  */
-
-static void
-make_predication_test (edge true_edge, basic_block skip_dest_bb, int mask)
-{
-  basic_block cond_bb = true_edge->src;
-  
-  gimple_stmt_iterator tmp_gsi = gsi_last_bb (cond_bb);
-  tree decl = builtin_decl_explicit (BUILT_IN_GOACC_ID);
-  tree comp_var = NULL_TREE;
-  unsigned ix;
-
-  for (ix = OACC_worker; ix <= OACC_vector; ix++)
-    if (OACC_LOOP_MASK (ix) & mask)
-      {
-	gimple call = gimple_build_call
-	  (decl, 1, build_int_cst (unsigned_type_node, ix));
-	tree var = create_tmp_var (unsigned_type_node);
-
-	gimple_call_set_lhs (call, var);
-	gsi_insert_after (&tmp_gsi, call, GSI_NEW_STMT);
-	if (comp_var)
-	  {
-	    tree new_comp = create_tmp_var (unsigned_type_node);
-	    gassign *ior = gimple_build_assign (new_comp,
-						BIT_IOR_EXPR, comp_var, var);
-	    gsi_insert_after (&tmp_gsi, ior, GSI_NEW_STMT);
-	    comp_var = new_comp;
-	  }
-	else
-	  comp_var = var;
-      }
-
-  tree cond = build2 (EQ_EXPR, boolean_type_node, comp_var,
-		      fold_convert (unsigned_type_node, integer_zero_node));
-  gimple cond_stmt = gimple_build_cond_empty (cond);
-  gsi_insert_after (&tmp_gsi, cond_stmt, GSI_NEW_STMT);
-
-  true_edge->flags = EDGE_TRUE_VALUE;
-
-  /* Force an abnormal edge before a broadcast operation that might be present
-     in SKIP_DEST_BB.  This is only done for the non-execution edge (with
-     respect to the predication done by this function) -- the opposite
-     (execution) edge that reaches the broadcast operation must be made
-     abnormal also, e.g. in this function's caller.  */
-  edge e = make_edge (cond_bb, skip_dest_bb, EDGE_FALSE_VALUE);
-  basic_block false_abnorm_bb = split_edge (e);
-  edge abnorm_edge = single_succ_edge (false_abnorm_bb);
-  abnorm_edge->flags |= EDGE_ABNORMAL;
-}
-
-/* Apply OpenACC predication to basic block BB which is in
-   region PARENT.  MASK has a bitmask of levels that need to be
-   applied; VECTOR and/or WORKER may be set.  */
-
-static void
-predicate_bb (basic_block bb, struct omp_region *parent, int mask)
-{
-  /* We handle worker-single vector-partitioned loops by jumping
-     around them if not in the controlling worker.  Don't insert
-     unnecessary (and incorrect) predication.  */
-  if (parent->type == GIMPLE_OMP_FOR
-      && (parent->gwv_this & OACC_LOOP_MASK (OACC_vector)))
-    mask &= ~OACC_LOOP_MASK (OACC_worker);
-
-  if (mask == 0 || parent->type == GIMPLE_OMP_ATOMIC_LOAD)
-    return;
-
-  gimple_stmt_iterator gsi;
-  gimple stmt;
-
-  gsi = gsi_last_bb (bb);
-  stmt = gsi_stmt (gsi);
-  if (stmt == NULL)
-    return;
-
-  basic_block skip_dest_bb = NULL;
-
-  if (gimple_code (stmt) == GIMPLE_OMP_ENTRY_END)
-    return;
-
-  if (gimple_code (stmt) == GIMPLE_COND)
-    {
-      tree cond_var = create_tmp_var (boolean_type_node);
-      tree broadcast_cond = create_tmp_var (boolean_type_node);
-      gassign *asgn = gimple_build_assign (cond_var,
-					   gimple_cond_code (stmt),
-					   gimple_cond_lhs (stmt),
-					   gimple_cond_rhs (stmt));
-      gsi_insert_before (&gsi, asgn, GSI_CONTINUE_LINKING);
-      gimple_stmt_iterator gsi_asgn = gsi_for_stmt (asgn);
-
-      gimple splitpoint = generate_oacc_broadcast (parent, broadcast_cond,
-						   cond_var, gsi_asgn,
-						   mask);
-
-      edge e = split_block (bb, splitpoint);
-      e->flags = EDGE_ABNORMAL;
-      skip_dest_bb = e->dest;
-
-      gimple_cond_set_condition (as_a <gcond *> (stmt), EQ_EXPR,
-				 broadcast_cond, boolean_true_node);
-    }
-  else if (gimple_code (stmt) == GIMPLE_SWITCH)
-    {
-      gswitch *sstmt = as_a <gswitch *> (stmt);
-      tree var = gimple_switch_index (sstmt);
-      tree new_var = create_tmp_var (TREE_TYPE (var));
-
-      gassign *asgn = gimple_build_assign (new_var, var);
-      gsi_insert_before (&gsi, asgn, GSI_CONTINUE_LINKING);
-      gimple_stmt_iterator gsi_asgn = gsi_for_stmt (asgn);
-
-      gimple splitpoint = generate_oacc_broadcast (parent, new_var, var,
-						   gsi_asgn, mask);
-
-      edge e = split_block (bb, splitpoint);
-      e->flags = EDGE_ABNORMAL;
-      skip_dest_bb = e->dest;
-
-      gimple_switch_set_index (sstmt, new_var);
-    }
-  else if (is_gimple_omp (stmt))
-    {
-      gsi_prev (&gsi);
-      gimple split_stmt = gsi_stmt (gsi);
-      enum gimple_code code = gimple_code (stmt);
-
-      /* First, see if we must predicate away an entire loop or atomic region.  */
-      if (code == GIMPLE_OMP_FOR
-	  || code == GIMPLE_OMP_ATOMIC_LOAD)
-	{
-	  omp_region *inner;
-	  inner = *bb_region_map->get (FALLTHRU_EDGE (bb)->dest);
-	  skip_dest_bb = single_succ (inner->exit);
-	  gcc_assert (inner->entry == bb);
-	  if (code != GIMPLE_OMP_FOR
-	      || ((inner->gwv_this & OACC_LOOP_MASK (OACC_vector))
-		  && !(inner->gwv_this & OACC_LOOP_MASK (OACC_worker))
-		  && (mask & OACC_LOOP_MASK  (OACC_worker))))
-	    {
-	      gimple_stmt_iterator head_gsi = gsi_start_bb (bb);
-	      gsi_prev (&head_gsi);
-	      edge e0 = split_block (bb, gsi_stmt (head_gsi));
-	      int mask2 = mask;
-	      if (code == GIMPLE_OMP_FOR)
-		mask2 &= ~OACC_LOOP_MASK (OACC_vector);
-	      if (!split_stmt || code != GIMPLE_OMP_FOR)
-		{
-		  /* The simple case: nothing here except the for,
-		     so we just need to make one branch around the
-		     entire loop.  */
-		  inner->entry = e0->dest;
-		  make_predication_test (e0, skip_dest_bb, mask2);
-		  return;
-		}
-	      basic_block for_block = e0->dest;
-	      /* The general case, make two conditions - a full one around the
-		 code preceding the for, and one branch around the loop.  */
-	      edge e1 = split_block (for_block, split_stmt);
-	      basic_block bb3 = e1->dest;
-	      edge e2 = split_block (for_block, split_stmt);
-	      basic_block bb2 = e2->dest;
-
-	      make_predication_test (e0, bb2, mask);
-	      make_predication_test (single_pred_edge (bb3), skip_dest_bb,
-				     mask2);
-	      inner->entry = bb3;
-	      return;
-	    }
-	}
-
-      /* Only a few statements need special treatment.  */
-      if (gimple_code (stmt) != GIMPLE_OMP_FOR
-	  && gimple_code (stmt) != GIMPLE_OMP_CONTINUE
-	  && gimple_code (stmt) != GIMPLE_OMP_RETURN)
-	{
-	  edge e = single_succ_edge (bb);
-	  skip_dest_bb = e->dest;
-	}
-      else
-	{
-	  if (!split_stmt)
-	    return;
-	  edge e = split_block (bb, split_stmt);
-	  skip_dest_bb = e->dest;
-	  if (gimple_code (stmt) == GIMPLE_OMP_CONTINUE)
-	    {
-	      gcc_assert (parent->cont == bb);
-	      parent->cont = skip_dest_bb;
-	    }
-	  else if (gimple_code (stmt) == GIMPLE_OMP_RETURN)
-	    {
-	      gcc_assert (parent->exit == bb);
-	      parent->exit = skip_dest_bb;
-	    }
-	  else if (gimple_code (stmt) == GIMPLE_OMP_FOR)
-	    {
-	      omp_region *inner;
-	      inner = *bb_region_map->get (FALLTHRU_EDGE (skip_dest_bb)->dest);
-	      gcc_assert (inner->entry == bb);
-	      inner->entry = skip_dest_bb;
-	    }
-	}
-    }
-  else if (single_succ_p (bb))
-    {
-      edge e = single_succ_edge (bb);
-      skip_dest_bb = e->dest;
-      if (gimple_code (stmt) == GIMPLE_GOTO)
-	gsi_prev (&gsi);
-      if (gsi_stmt (gsi) == 0)
-	return;
-    }
-
-  if (skip_dest_bb != NULL)
-    {
-      gimple_stmt_iterator head_gsi = gsi_start_bb (bb);
-      gsi_prev (&head_gsi);
-      edge e2 = split_block (bb, gsi_stmt (head_gsi));
-      make_predication_test (e2, skip_dest_bb, mask);
-    }
-}
-
-/* Walk the dominator tree starting at BB to collect basic blocks in
-   WORKLIST which need OpenACC vector predication applied to them.  */
-
-static void
-find_predicatable_bbs (basic_block bb, vec<basic_block> &worklist)
-{
-  struct omp_region *parent = *bb_region_map->get (bb);
-  if (required_predication_mask (parent) != 0)
-    worklist.safe_push (bb);
-  basic_block son;
-  for (son = first_dom_son (CDI_DOMINATORS, bb);
-       son;
-       son = next_dom_son (CDI_DOMINATORS, son))
-    find_predicatable_bbs (son, worklist);
-}
-
-/* Apply OpenACC vector predication to all basic blocks.  HEAD_BB is the
-   first.  */
-
-static void
-predicate_omp_regions (basic_block head_bb)
-{
-  vec<basic_block> worklist = vNULL;
-  find_predicatable_bbs (head_bb, worklist);
-  int i;
-  basic_block bb;
-  FOR_EACH_VEC_ELT (worklist, i, bb)
-    {
-      omp_region *region = *bb_region_map->get (bb);
-      int mask = required_predication_mask (region);
-      predicate_bb (bb, region, mask);
-    }
-}
-
-/* USE and GET sets for variable broadcasting.  */
-static std::set<tree> use, gen, live_in;
-
-/* This is an extremely conservative live in analysis.  We only want to
-   detect is any compiler temporary used inside an acc loop is local to
-   that loop or not.  So record all decl uses in all the basic blocks
-   post-dominating the acc loop in question.  */
-static tree
-populate_loop_live_in (tree *tp, int *walk_subtrees,
-		       void *data_ ATTRIBUTE_UNUSED)
-{
-  struct walk_stmt_info *wi = (struct walk_stmt_info *) data_;
-
-  if (wi && wi->is_lhs)
-    {
-      if (VAR_P (*tp))
-	live_in.insert (*tp);
-    }
-  else if (IS_TYPE_OR_DECL_P (*tp))
-    *walk_subtrees = 0;
-
-  return NULL_TREE;
-}
-
-static void
-oacc_populate_live_in_1 (basic_block entry_bb, basic_block exit_bb,
-			 basic_block loop_bb)
-{
-  basic_block son;
-  gimple_stmt_iterator gsi;
-
-  if (entry_bb == exit_bb)
-    return;
-
-  if (!dominated_by_p (CDI_DOMINATORS, loop_bb, entry_bb))
-    return;
-
-  for (gsi = gsi_start_bb (entry_bb); !gsi_end_p (gsi); gsi_next (&gsi))
-    {
-      struct walk_stmt_info wi;
-      gimple stmt;
-
-      memset (&wi, 0, sizeof (wi));
-      stmt = gsi_stmt (gsi);
-
-      walk_gimple_op (stmt, populate_loop_live_in, &wi);
-    }
-
-  /* Continue walking the dominator tree.  */
-  for (son = first_dom_son (CDI_DOMINATORS, entry_bb);
-       son;
-       son = next_dom_son (CDI_DOMINATORS, son))
-    oacc_populate_live_in_1 (son, exit_bb, loop_bb);
-}
-
-static void
-oacc_populate_live_in (basic_block entry_bb, omp_region *region)
-{
-  /* Find the innermost OMP_TARGET region.  */
-  while (region  && region->type != GIMPLE_OMP_TARGET)
-    region = region->outer;
-
-  if (!region)
-    return;
-
-  basic_block son;
-
-  for (son = first_dom_son (CDI_DOMINATORS, region->entry);
-       son;
-       son = next_dom_son (CDI_DOMINATORS, son))
-    oacc_populate_live_in_1 (son, region->exit, entry_bb);
-}
-
-static tree
-populate_loop_use (tree *tp, int *walk_subtrees, void *data_)
-{
-  struct walk_stmt_info *wi = (struct walk_stmt_info *) data_;
-  std::set<tree>::iterator it;
-
-  /* There isn't much to do for LHS ops. There shouldn't be any pointers
-     or references here.  */
-  if (wi && wi->is_lhs)
-    return NULL_TREE;
-
-  if (VAR_P (*tp))
-    {
-      tree type;
-
-      *walk_subtrees = 0;
-
-      /* Filter out incompatible decls.  */
-      if (INDIRECT_REF_P (*tp) || is_global_var (*tp))
-	return NULL_TREE;
-
-      type = TREE_TYPE (*tp);
-
-      /* Aggregate types aren't supported either.  */
-      if (AGGREGATE_TYPE_P (type))
-	return NULL_TREE;
-
-      /* Filter out decls inside GEN.  */
-      it = gen.find (*tp);
-      if (it == gen.end ())
-	use.insert (*tp);
-    }
-  else if (IS_TYPE_OR_DECL_P (*tp))
-    *walk_subtrees = 0;
-
-  return NULL_TREE;
-}
-
-/* INIT is true if this is the first time this function is called.  */
-
-static void
-oacc_broadcast_1 (basic_block entry_bb, basic_block exit_bb, bool init,
-		  int mask)
-{
-  basic_block son;
-  gimple_stmt_iterator gsi;
-  gimple stmt;
-  tree block, var;
-
-  if (entry_bb == exit_bb)
-    return;
-
-  /* Populate the GEN set.  */
-
-  gsi = gsi_start_bb (entry_bb);
-  stmt = gsi_stmt (gsi);
-
-  /* There's nothing to do if stmt is empty or if this is the entry basic
-     block to the vector loop.  The entry basic block to pre-expanded loops
-     do not have an entry label.  As such, the scope containing the initial
-     entry_bb should not be added to the gen set.  */
-  if (stmt != NULL && !init && (block = gimple_block (stmt)) != NULL)
-    for (var = BLOCK_VARS (block); var; var = DECL_CHAIN (var))
-      gen.insert(var);
-
-  /* Populate the USE set.  */
-
-  for (gsi = gsi_start_bb (entry_bb); !gsi_end_p (gsi); gsi_next (&gsi))
-    {
-      struct walk_stmt_info wi;
-
-      memset (&wi, 0, sizeof (wi));
-      stmt = gsi_stmt (gsi);
-
-      walk_gimple_op (stmt, populate_loop_use, &wi);
-    }
-
-  /* Continue processing the children of this basic block.  */
-  for (son = first_dom_son (CDI_DOMINATORS, entry_bb);
-       son;
-       son = next_dom_son (CDI_DOMINATORS, son))
-    oacc_broadcast_1 (son, exit_bb, false, mask);
-}
-
-/* Broadcast variables to OpenACC vector loops.  This function scans
-   all of the basic blocks withing an acc vector loop.  It maintains
-   two sets of decls, a GEN set and a USE set.  The GEN set contains
-   all of the decls in the the basic block's scope.  The USE set
-   consists of decls used in current basic block, but are not in the
-   GEN set, globally defined or were transferred into the the accelerator
-   via a data movement clause.
-
-   The vector loop begins at ENTRY_BB and end at EXIT_BB, where EXIT_BB
-   is a latch back to ENTRY_BB.  Once a set of used variables have been
-   determined, they will get broadcasted in a pre-header to ENTRY_BB.  */
-
-static basic_block
-oacc_broadcast (basic_block entry_bb, basic_block exit_bb, omp_region *region)
-{
-  gimple_stmt_iterator gsi;
-  std::set<tree>::iterator it;
-  int mask = region->gwv_this;
-
-  /* Nothing to do if this isn't an acc worker or vector loop.  */
-  if (mask == 0)
-    return entry_bb;
-
-  use.empty ();
-  gen.empty ();
-  live_in.empty ();
-
-  /* Currently, subroutines aren't supported.  */
-  gcc_assert (!lookup_attribute ("oacc function",
-				 DECL_ATTRIBUTES (current_function_decl)));
-
-  /* Populate live_in.  */
-  oacc_populate_live_in (entry_bb, region);
-
-  /* Populate the set of used decls.  */
-  oacc_broadcast_1 (entry_bb, exit_bb, true, mask);
-
-  /* Filter out all of the GEN decls from the USE set.  Also filter out
-     any compiler temporaries that which are not present in LIVE_IN.  */
-  for (it = use.begin (); it != use.end (); it++)
-    {
-      std::set<tree>::iterator git, lit;
-
-      git = gen.find (*it);
-      lit = live_in.find (*it);
-      if (git != gen.end () || lit == live_in.end ())
-	use.erase (it);
-    }
-
-  if (mask == OACC_LOOP_MASK (OACC_vector))
-    {
-      /* Broadcast all decls in USE right before the last instruction in
-	 entry_bb.  */
-      gsi = gsi_last_bb (entry_bb);
-
-      gimple_seq seq = NULL;
-      gimple_stmt_iterator g2 = gsi_start (seq);
-
-      for (it = use.begin (); it != use.end (); it++)
-	generate_oacc_broadcast (region, *it, *it, g2, mask);
-
-      gsi_insert_seq_before (&gsi, seq, GSI_CONTINUE_LINKING);
-    }
-  else if (mask & OACC_LOOP_MASK (OACC_worker))
-    {
-      if (use.empty ())
-	return entry_bb;
-
-      /* If this loop contains a worker, then each broadcast must be
-	 predicated.  */
-
-      for (it = use.begin (); it != use.end (); it++)
-	{
-	  /* Worker broadcasting requires predication.  To do that, there
-	     needs to be several new parent basic blocks before the omp
-	     for instruction.  */
-
-	  gimple_seq seq = NULL;
-	  gimple_stmt_iterator g2 = gsi_start (seq);
-	  gimple splitpoint = generate_oacc_broadcast (region, *it, *it,
-						       g2, mask);
-	  gsi = gsi_last_bb (entry_bb);
-	  gsi_insert_seq_before (&gsi, seq, GSI_CONTINUE_LINKING);
-	  edge e = split_block (entry_bb, splitpoint);
-	  e->flags |= EDGE_ABNORMAL;
-	  basic_block dest_bb = e->dest;
-	  gsi_prev (&gsi);
-	  edge e2 = split_block (entry_bb, gsi_stmt (gsi));
-	  e2->flags |= EDGE_ABNORMAL;
-	  make_predication_test (e2, dest_bb, mask);
-
-	  /* Update entry_bb.  */
-	  entry_bb = dest_bb;
-	}
-    }
-
-  return entry_bb;
-}
-
 /* Main entry point for expanding OMP-GIMPLE into runtime calls.  */
 
 static unsigned int
@@ -10772,8 +10113,6 @@ execute_expand_omp (void)
 	  fprintf (dump_file, "\n");
 	}
 
-      predicate_omp_regions (ENTRY_BLOCK_PTR_FOR_FN (cfun));
-
       remove_exit_barriers (root_omp_region);
 
       expand_omp (root_omp_region);
@@ -12342,10 +11681,7 @@ lower_omp_target (gimple_stmt_iterator *
   orlist = NULL;
 
   if (is_gimple_omp_oacc (stmt))
-    {
-      oacc_init_count_vars (ctx, clauses);
-      oacc_alloc_broadcast_storage (ctx);
-    }
+    oacc_init_count_vars (ctx, clauses);
 
   if (has_reduction)
     {
@@ -12631,7 +11967,6 @@ lower_omp_target (gimple_stmt_iterator *
   gsi_insert_seq_before (gsi_p, sz_ilist, GSI_SAME_STMT);
 
   gimple_omp_target_set_ganglocal_size (stmt, sz);
-  gimple_omp_target_set_broadcast_array (stmt, ctx->worker_sync_elt);
   pop_gimplify_context (NULL);
 }
 
@@ -13348,16 +12683,7 @@ make_gimple_omp_edges (basic_block bb, s
 				  ((for_stmt = last_stmt (cur_region->entry))))
 	     == GF_OMP_FOR_KIND_OACC_LOOP)
         {
-	  /* Called before OMP expansion, so this information has not been
-	     recorded in cur_region->gwv_this yet.  */
-	  int gwv_bits = find_omp_for_region_gwv (for_stmt);
-	  if (oacc_loop_needs_threadbarrier_p (gwv_bits))
-	    {
-	      make_edge (bb, bb->next_bb, EDGE_FALLTHRU | EDGE_ABNORMAL);
-	      fallthru = false;
-	    }
-	  else
-	    fallthru = true;
+	  fallthru = true;
 	}
       else
 	/* In the case of a GIMPLE_OMP_SECTION, the edge will go
Index: omp-low.h
===================================================================
--- omp-low.h	(revision 225323)
+++ omp-low.h	(working copy)
@@ -20,6 +20,8 @@ along with GCC; see the file COPYING3.
 #ifndef GCC_OMP_LOW_H
 #define GCC_OMP_LOW_H
 
+/* Levels of parallelism as defined by OpenACC.  Increasing numbers
+   correspond to deeper loop nesting levels.  */
 enum oacc_loop_levels
   {
     OACC_gang,
@@ -27,6 +29,7 @@ enum oacc_loop_levels
     OACC_vector,
     OACC_HWM
   };
+#define OACC_LOOP_MASK(X) (1 << (X))
 
 struct omp_region;
 
Index: tree-ssa-tail-merge.c
===================================================================
--- tree-ssa-tail-merge.c	(revision 225323)
+++ tree-ssa-tail-merge.c	(working copy)
@@ -608,10 +608,13 @@ same_succ_def::equal (const same_succ_de
     {
       s1 = gsi_stmt (gsi1);
       s2 = gsi_stmt (gsi2);
-      if (gimple_code (s1) != gimple_code (s2))
-	return 0;
-      if (is_gimple_call (s1) && !gimple_call_same_target_p (s1, s2))
-	return 0;
+      if (s1 != s2)
+	{
+	  if (gimple_code (s1) != gimple_code (s2))
+	    return 0;
+	  if (is_gimple_call (s1) && !gimple_call_same_target_p (s1, s2))
+	    return 0;
+	}
       gsi_next_nondebug (&gsi1);
       gsi_next_nondebug (&gsi2);
       gsi_advance_fw_nondebug_nonlocal (&gsi1);
Index: omp-builtins.def
===================================================================
--- omp-builtins.def	(revision 225323)
+++ omp-builtins.def	(working copy)
@@ -69,13 +69,6 @@ DEF_GOACC_BUILTIN (BUILT_IN_GOACC_GET_GA
 		   BT_FN_PTR, ATTR_NOTHROW_LEAF_LIST)
 DEF_GOACC_BUILTIN (BUILT_IN_GOACC_DEVICEPTR, "GOACC_deviceptr",
 		   BT_FN_PTR_PTR, ATTR_CONST_NOTHROW_LEAF_LIST)
-DEF_GOACC_BUILTIN (BUILT_IN_GOACC_THREAD_BROADCAST, "GOACC_thread_broadcast",
-		   BT_FN_UINT_UINT, ATTR_NOTHROW_LEAF_LIST)
-DEF_GOACC_BUILTIN (BUILT_IN_GOACC_THREAD_BROADCAST_LL, "GOACC_thread_broadcast_ll",
-		   BT_FN_ULONGLONG_ULONGLONG, ATTR_NOTHROW_LEAF_LIST)
-DEF_GOACC_BUILTIN (BUILT_IN_GOACC_THREADBARRIER, "GOACC_threadbarrier",
-		   BT_FN_VOID, ATTR_NOTHROW_LEAF_LIST)
-
 DEF_GOACC_BUILTIN_COMPILER (BUILT_IN_ACC_ON_DEVICE, "acc_on_device",
 			    BT_FN_INT_INT, ATTR_CONST_NOTHROW_LEAF_LIST)
 
Index: internal-fn.def
===================================================================
--- internal-fn.def	(revision 225323)
+++ internal-fn.def	(working copy)
@@ -64,3 +64,5 @@ DEF_INTERNAL_FN (MUL_OVERFLOW, ECF_CONST
 DEF_INTERNAL_FN (TSAN_FUNC_EXIT, ECF_NOVOPS | ECF_LEAF | ECF_NOTHROW, NULL)
 DEF_INTERNAL_FN (VA_ARG, ECF_NOTHROW | ECF_LEAF, NULL)
 DEF_INTERNAL_FN (GOACC_DATA_END_WITH_ARG, ECF_NOTHROW, ".r")
+DEF_INTERNAL_FN (GOACC_FORK, ECF_NOTHROW | ECF_LEAF, ".")
+DEF_INTERNAL_FN (GOACC_JOIN, ECF_NOTHROW | ECF_LEAF, ".")
Index: config/nvptx/nvptx.md
===================================================================
--- config/nvptx/nvptx.md	(revision 225323)
+++ config/nvptx/nvptx.md	(working copy)
@@ -52,15 +52,25 @@
    UNSPEC_NID
 
    UNSPEC_SHARED_DATA
+
+   UNSPEC_BIT_CONV
+
+   UNSPEC_BROADCAST
+   UNSPEC_BR_UNIFIED
 ])
 
 (define_c_enum "unspecv" [
    UNSPECV_LOCK
    UNSPECV_CAS
    UNSPECV_XCHG
-   UNSPECV_WARP_BCAST
    UNSPECV_BARSYNC
    UNSPECV_ID
+
+   UNSPECV_FORK
+   UNSPECV_FORKED
+   UNSPECV_JOINING
+   UNSPECV_JOIN
+   UNSPECV_BR_HIDDEN
 ])
 
 (define_attr "subregs_ok" "false,true"
@@ -253,6 +263,8 @@
 (define_mode_iterator QHSIM [QI HI SI])
 (define_mode_iterator SDFM [SF DF])
 (define_mode_iterator SDCM [SC DC])
+(define_mode_iterator BITS [SI SF])
+(define_mode_iterator BITD [DI DF])
 
 ;; This mode iterator allows :P to be used for patterns that operate on
 ;; pointer-sized quantities.  Exactly one of the two alternatives will match.
@@ -813,7 +825,7 @@
 		      (label_ref (match_operand 1 "" ""))
 		      (pc)))]
   ""
-  "%j0\\tbra%U0\\t%l1;")
+  "%j0\\tbra\\t%l1;")
 
 (define_insn "br_false"
   [(set (pc)
@@ -822,7 +834,34 @@
 		      (label_ref (match_operand 1 "" ""))
 		      (pc)))]
   ""
-  "%J0\\tbra%U0\\t%l1;")
+  "%J0\\tbra\\t%l1;")
+
+;; a hidden conditional branch
+(define_insn "br_true_hidden"
+  [(unspec_volatile:SI [(ne (match_operand:BI 0 "nvptx_register_operand" "R")
+			    (const_int 0))
+		        (label_ref (match_operand 1 "" ""))
+			(match_operand:SI 2 "const_int_operand" "i")]
+			UNSPECV_BR_HIDDEN)]
+  ""
+  "%j0\\tbra%U2\\t%l1;")
+
+;; unified conditional branch
+(define_insn "br_uni_true"
+  [(set (pc) (if_then_else
+	(ne (unspec:BI [(match_operand:BI 0 "nvptx_register_operand" "R")]
+		       UNSPEC_BR_UNIFIED) (const_int 0))
+        (label_ref (match_operand 1 "" "")) (pc)))]
+  ""
+  "%j0\\tbra.uni\\t%l1;")
+
+(define_insn "br_uni_false"
+  [(set (pc) (if_then_else
+	(eq (unspec:BI [(match_operand:BI 0 "nvptx_register_operand" "R")]
+		       UNSPEC_BR_UNIFIED) (const_int 0))
+        (label_ref (match_operand 1 "" "")) (pc)))]
+  ""
+  "%J0\\tbra.uni\\t%l1;")
 
 (define_expand "cbranch<mode>4"
   [(set (pc)
@@ -1326,37 +1365,92 @@
   return asms[INTVAL (operands[1])];
 })
 
-(define_insn "oacc_thread_broadcastsi"
-  [(set (match_operand:SI 0 "nvptx_register_operand" "")
-	(unspec_volatile:SI [(match_operand:SI 1 "nvptx_register_operand" "")]
-			    UNSPECV_WARP_BCAST))]
+(define_insn "nvptx_fork"
+  [(unspec_volatile:SI [(match_operand:SI 0 "const_int_operand" "")]
+		       UNSPECV_FORK)]
   ""
-  "%.\\tshfl.idx.b32\\t%0, %1, 0, 31;")
+  "// fork %0;"
+)
 
-(define_expand "oacc_thread_broadcastdi"
-  [(set (match_operand:DI 0 "nvptx_register_operand" "")
-	(unspec_volatile:DI [(match_operand:DI 1 "nvptx_register_operand" "")]
-			    UNSPECV_WARP_BCAST))]
-  ""
-{
-  rtx t = gen_reg_rtx (DImode);
-  emit_insn (gen_lshrdi3 (t, operands[1], GEN_INT (32)));
-  rtx op0 = force_reg (SImode, gen_lowpart (SImode, t));
-  rtx op1 = force_reg (SImode, gen_lowpart (SImode, operands[1]));
-  rtx targ0 = gen_reg_rtx (SImode);
-  rtx targ1 = gen_reg_rtx (SImode);
-  emit_insn (gen_oacc_thread_broadcastsi (targ0, op0));
-  emit_insn (gen_oacc_thread_broadcastsi (targ1, op1));
-  rtx t2 = gen_reg_rtx (DImode);
-  rtx t3 = gen_reg_rtx (DImode);
-  emit_insn (gen_extendsidi2 (t2, targ0));
-  emit_insn (gen_extendsidi2 (t3, targ1));
-  rtx t4 = gen_reg_rtx (DImode);
-  emit_insn (gen_ashldi3 (t4, t2, GEN_INT (32)));
-  emit_insn (gen_iordi3 (operands[0], t3, t4));
-  DONE;
+(define_insn "nvptx_forked"
+  [(unspec_volatile:SI [(match_operand:SI 0 "const_int_operand" "")]
+		       UNSPECV_FORKED)]
+  ""
+  "// forked %0;"
+)
+
+(define_insn "nvptx_joining"
+  [(unspec_volatile:SI [(match_operand:SI 0 "const_int_operand" "")]
+		       UNSPECV_JOINING)]
+  ""
+  "// joining %0;"
+)
+
+(define_insn "nvptx_join"
+  [(unspec_volatile:SI [(match_operand:SI 0 "const_int_operand" "")]
+		       UNSPECV_JOIN)]
+  ""
+  "// join %0;"
+)
+
+(define_expand "oacc_fork"
+  [(unspec_volatile:SI [(match_operand:SI 0 "const_int_operand" "")]
+		       UNSPECV_FORKED)]
+  ""
+{
+  nvptx_expand_oacc_fork (operands[0]);
 })
 
+(define_expand "oacc_join"
+  [(unspec_volatile:SI [(match_operand:SI 0 "const_int_operand" "")]
+		       UNSPECV_JOIN)]
+  ""
+{
+  nvptx_expand_oacc_join (operands[0]);
+})
+
+;; only 32-bit shuffles exist.
+(define_insn "nvptx_broadcast<mode>"
+  [(set (match_operand:BITS 0 "nvptx_register_operand" "")
+	(unspec:BITS
+		[(match_operand:BITS 1 "nvptx_register_operand" "")]
+		  UNSPEC_BROADCAST))]
+  ""
+  "%.\\tshfl.idx.b32\\t%0, %1, 0, 31;")
+
+;; extract parts of a 64 bit object into 2 32-bit ints
+(define_insn "unpack<mode>si2"
+  [(set (match_operand:SI 0 "nvptx_register_operand" "")
+        (unspec:SI [(match_operand:BITD 2 "nvptx_register_operand" "")
+		    (const_int 0)] UNSPEC_BIT_CONV))
+   (set (match_operand:SI 1 "nvptx_register_operand" "")
+        (unspec:SI [(match_dup 2) (const_int 1)] UNSPEC_BIT_CONV))]
+  ""
+  "%.\\tmov.b64 {%0,%1}, %2;")
+
+;; pack 2 32-bit ints into a 64 bit object
+(define_insn "packsi<mode>2"
+  [(set (match_operand:BITD 0 "nvptx_register_operand" "")
+        (unspec:BITD [(match_operand:SI 1 "nvptx_register_operand" "")
+		      (match_operand:SI 2 "nvptx_register_operand" "")]
+		    UNSPEC_BIT_CONV))]
+  ""
+  "%.\\tmov.b64 %0, {%1,%2};")
+
+(define_insn "worker_load<mode>"
+  [(set (match_operand:SDISDFM 0 "nvptx_register_operand" "=R")
+        (unspec:SDISDFM [(match_operand:SDISDFM 1 "memory_operand" "m")]
+			 UNSPEC_SHARED_DATA))]
+  ""
+  "%.\\tld.shared%u0\\t%0,%1;")
+
+(define_insn "worker_store<mode>"
+  [(set (unspec:SDISDFM [(match_operand:SDISDFM 1 "memory_operand" "=m")]
+			 UNSPEC_SHARED_DATA)
+	(match_operand:SDISDFM 0 "nvptx_register_operand" "R"))]
+  ""
+  "%.\\tst.shared%u1\\t%1,%0;")
+
 (define_insn "ganglocal_ptr<mode>"
   [(set (match_operand:P 0 "nvptx_register_operand" "")
 	(unspec:P [(const_int 0)] UNSPEC_SHARED_DATA))]
@@ -1462,14 +1556,8 @@
   "%.\\tatom%A1.b%T0.<logic>\\t%0, %1, %2;")
 
 ;; ??? Mark as not predicable later?
-(define_insn "threadbarrier_insn"
-  [(unspec_volatile [(match_operand:SI 0 "const_int_operand" "")] UNSPECV_BARSYNC)]
+(define_insn "nvptx_barsync"
+  [(unspec_volatile [(match_operand:SI 0 "const_int_operand" "")]
+		    UNSPECV_BARSYNC)]
   ""
   "bar.sync\\t%0;")
-
-(define_expand "oacc_threadbarrier"
-  [(unspec_volatile [(match_operand:SI 0 "const_int_operand" "")] UNSPECV_BARSYNC)]
-  ""
-{
-  operands[0] = const0_rtx;
-})
Index: config/nvptx/nvptx.c
===================================================================
--- config/nvptx/nvptx.c	(revision 225323)
+++ config/nvptx/nvptx.c	(working copy)
@@ -24,6 +24,7 @@
 #include "coretypes.h"
 #include "tm.h"
 #include "rtl.h"
+#include "hash-map.h"
 #include "hash-set.h"
 #include "machmode.h"
 #include "vec.h"
@@ -74,6 +75,9 @@
 #include "df.h"
 #include "dumpfile.h"
 #include "builtins.h"
+#include "dominance.h"
+#include "cfg.h"
+#include "omp-low.h"
 
 /* Record the function decls we've written, and the libfuncs and function
    decls corresponding to them.  */
@@ -97,6 +101,16 @@ static GTY((cache))
 static GTY((cache)) hash_table<tree_hasher> *declared_fndecls_htab;
 static GTY((cache)) hash_table<tree_hasher> *needed_fndecls_htab;
 
+/* Size of buffer needed to broadcast across workers.  This is used
+   for both worker-neutering and worker broadcasting.   It is shared
+   by all functions emitted.  The buffer is placed in shared memory.
+   It'd be nice if PTX supported common blocks, because then this
+   could be shared across TUs (taking the largest size).  */
+static unsigned worker_bcast_hwm;
+static unsigned worker_bcast_align;
+#define worker_bcast_name "__worker_bcast"
+static GTY(()) rtx worker_bcast_sym;
+
 /* Allocate a new, cleared machine_function structure.  */
 
 static struct machine_function *
@@ -124,6 +138,8 @@ nvptx_option_override (void)
   needed_fndecls_htab = hash_table<tree_hasher>::create_ggc (17);
   declared_libfuncs_htab
     = hash_table<declared_libfunc_hasher>::create_ggc (17);
+
+  worker_bcast_sym = gen_rtx_SYMBOL_REF (Pmode, worker_bcast_name);
 }
 
 /* Return the mode to be used when declaring a ptx object for OBJ.
@@ -1066,6 +1082,210 @@ nvptx_expand_compare (rtx compare)
   return gen_rtx_NE (BImode, pred, const0_rtx);
 }
 
+
+/* Expand the oacc fork & join primitive into ptx-required unspecs.  */
+
+void
+nvptx_expand_oacc_fork (rtx mode)
+{
+  /* Emit fork for worker level.  */
+  if (UINTVAL (mode) == OACC_worker)
+    emit_insn (gen_nvptx_fork (mode));
+}
+
+void
+nvptx_expand_oacc_join (rtx mode)
+{
+  /* Emit joining for all pars.  */
+  emit_insn (gen_nvptx_joining (mode));
+}
+
+/* Generate instruction(s) to unpack a 64 bit object into 2 32 bit
+   objects.  */
+
+static rtx
+nvptx_gen_unpack (rtx dst0, rtx dst1, rtx src)
+{
+  rtx res;
+  
+  switch (GET_MODE (src))
+    {
+    case DImode:
+      res = gen_unpackdisi2 (dst0, dst1, src);
+      break;
+    case DFmode:
+      res = gen_unpackdfsi2 (dst0, dst1, src);
+      break;
+    default: gcc_unreachable ();
+    }
+  return res;
+}
+
+/* Generate instruction(s) to pack 2 32 bit objects into a 64 bit
+   object.  */
+
+static rtx
+nvptx_gen_pack (rtx dst, rtx src0, rtx src1)
+{
+  rtx res;
+  
+  switch (GET_MODE (dst))
+    {
+    case DImode:
+      res = gen_packsidi2 (dst, src0, src1);
+      break;
+    case DFmode:
+      res = gen_packsidf2 (dst, src0, src1);
+      break;
+    default: gcc_unreachable ();
+    }
+  return res;
+}
+
+/* Generate an instruction or sequence to broadcast register REG
+   across the vectors of a single warp.  */
+
+static rtx
+nvptx_gen_vcast (rtx reg)
+{
+  rtx res;
+
+  switch (GET_MODE (reg))
+    {
+    case SImode:
+      res = gen_nvptx_broadcastsi (reg, reg);
+      break;
+    case SFmode:
+      res = gen_nvptx_broadcastsf (reg, reg);
+      break;
+    case DImode:
+    case DFmode:
+      {
+	rtx tmp0 = gen_reg_rtx (SImode);
+	rtx tmp1 = gen_reg_rtx (SImode);
+
+	start_sequence ();
+	emit_insn (nvptx_gen_unpack (tmp0, tmp1, reg));
+	emit_insn (nvptx_gen_vcast (tmp0));
+	emit_insn (nvptx_gen_vcast (tmp1));
+	emit_insn (nvptx_gen_pack (reg, tmp0, tmp1));
+	res = get_insns ();
+	end_sequence ();
+      }
+      break;
+    case BImode:
+      {
+	rtx tmp = gen_reg_rtx (SImode);
+	
+	start_sequence ();
+	emit_insn (gen_sel_truesi (tmp, reg, GEN_INT (1), const0_rtx));
+	emit_insn (nvptx_gen_vcast (tmp));
+	emit_insn (gen_rtx_SET (BImode, reg,
+				gen_rtx_NE (BImode, tmp, const0_rtx)));
+	res = get_insns ();
+	end_sequence ();
+      }
+      break;
+      
+    case HImode:
+    case QImode:
+    default:debug_rtx (reg);gcc_unreachable ();
+    }
+  return res;
+}
+
+/* Structure used when generating a worker-level spill or fill.  */
+
+struct wcast_data_t
+{
+  rtx base;
+  rtx ptr;
+  unsigned offset;
+};
+
+/* Direction of the spill/fill and looping setup/teardown indicator.  */
+
+enum propagate_mask
+  {
+    PM_read = 1 << 0,
+    PM_write = 1 << 1,
+    PM_loop_begin = 1 << 2,
+    PM_loop_end = 1 << 3,
+
+    PM_read_write = PM_read | PM_write
+  };
+
+/* Generate instruction(s) to spill or fill register REG to/from the
+   worker broadcast array.  PM indicates what is to be done, REP
+   how many loop iterations will be executed (0 for not a loop).  */
+   
+static rtx
+nvptx_gen_wcast (rtx reg, propagate_mask pm, unsigned rep, wcast_data_t *data)
+{
+  rtx  res;
+  machine_mode mode = GET_MODE (reg);
+
+  switch (mode)
+    {
+    case BImode:
+      {
+	rtx tmp = gen_reg_rtx (SImode);
+	
+	start_sequence ();
+	if (pm & PM_read)
+	  emit_insn (gen_sel_truesi (tmp, reg, GEN_INT (1), const0_rtx));
+	emit_insn (nvptx_gen_wcast (tmp, pm, rep, data));
+	if (pm & PM_write)
+	  emit_insn (gen_rtx_SET (BImode, reg,
+				  gen_rtx_NE (BImode, tmp, const0_rtx)));
+	res = get_insns ();
+	end_sequence ();
+      }
+      break;
+
+    default:
+      {
+	rtx addr = data->ptr;
+
+	if (!addr)
+	  {
+	    unsigned align = GET_MODE_ALIGNMENT (mode) / BITS_PER_UNIT;
+
+	    if (align > worker_bcast_align)
+	      worker_bcast_align = align;
+	    data->offset = (data->offset + align - 1) & ~(align - 1);
+	    addr = data->base;
+	    if (data->offset)
+	      addr = gen_rtx_PLUS (Pmode, addr, GEN_INT (data->offset));
+	  }
+	
+	addr = gen_rtx_MEM (mode, addr);
+	addr = gen_rtx_UNSPEC (mode, gen_rtvec (1, addr), UNSPEC_SHARED_DATA);
+	if (pm & PM_read)
+	  res = gen_rtx_SET (mode, addr, reg);
+	if (pm & PM_write)
+	  res = gen_rtx_SET (mode, reg, addr);
+
+	if (data->ptr)
+	  {
+	    /* We're using a ptr, increment it.  */
+	    start_sequence ();
+	    
+	    emit_insn (res);
+	    emit_insn (gen_adddi3 (data->ptr, data->ptr,
+				   GEN_INT (GET_MODE_SIZE (GET_MODE (res)))));
+	    res = get_insns ();
+	    end_sequence ();
+	  }
+	else
+	  rep = 1;
+	data->offset += rep * GET_MODE_SIZE (GET_MODE (reg));
+      }
+      break;
+    }
+  return res;
+}
+
 /* When loading an operand ORIG_OP, verify whether an address space
    conversion to generic is required, and if so, perform it.  Also
    check for SYMBOL_REFs for function decls and call
@@ -1647,23 +1867,6 @@ nvptx_print_operand_address (FILE *file,
   nvptx_print_address_operand (file, addr, VOIDmode);
 }
 
-/* Return true if the value of COND is the same across all threads in a
-   warp.  */
-
-static bool
-condition_unidirectional_p (rtx cond)
-{
-  if (CONSTANT_P (cond))
-    return true;
-  if (GET_CODE (cond) == REG)
-    return cfun->machine->warp_equal_pseudos[REGNO (cond)];
-  if (GET_RTX_CLASS (GET_CODE (cond)) == RTX_COMPARE
-      || GET_RTX_CLASS (GET_CODE (cond)) == RTX_COMM_COMPARE)
-    return (condition_unidirectional_p (XEXP (cond, 0))
-	    && condition_unidirectional_p (XEXP (cond, 1)));
-  return false;
-}
-
 /* Print an operand, X, to FILE, with an optional modifier in CODE.
 
    Meaning of CODE:
@@ -1677,8 +1880,7 @@ condition_unidirectional_p (rtx cond)
    t -- print a type opcode suffix, promoting QImode to 32 bits
    T -- print a type size in bits
    u -- print a type opcode suffix without promotions.
-   U -- print ".uni" if a condition consists only of values equal across all
-        threads in a warp.  */
+   U -- print ".uni" if the const_int operand is non-zero.  */
 
 static void
 nvptx_print_operand (FILE *file, rtx x, int code)
@@ -1740,7 +1942,7 @@ nvptx_print_operand (FILE *file, rtx x,
       goto common;
 
     case 'U':
-      if (condition_unidirectional_p (x))
+      if (INTVAL (x))
 	fprintf (file, ".uni");
       break;
 
@@ -1900,7 +2102,7 @@ get_replacement (struct reg_replace *r)
    conversion copyin/copyout instructions.  */
 
 static void
-nvptx_reorg_subreg (int max_regs)
+nvptx_reorg_subreg ()
 {
   struct reg_replace qiregs, hiregs, siregs, diregs;
   rtx_insn *insn, *next;
@@ -1914,11 +2116,6 @@ nvptx_reorg_subreg (int max_regs)
   siregs.mode = SImode;
   diregs.mode = DImode;
 
-  cfun->machine->warp_equal_pseudos
-    = ggc_cleared_vec_alloc<char> (max_regs);
-
-  auto_vec<unsigned> warp_reg_worklist;
-
   for (insn = get_insns (); insn; insn = next)
     {
       next = NEXT_INSN (insn);
@@ -1934,18 +2131,6 @@ nvptx_reorg_subreg (int max_regs)
       diregs.n_in_use = 0;
       extract_insn (insn);
 
-      if (recog_memoized (insn) == CODE_FOR_oacc_thread_broadcastsi
-	  || (GET_CODE (PATTERN (insn)) == SET
-	      && CONSTANT_P (SET_SRC (PATTERN (insn)))))
-	{
-	  rtx dest = recog_data.operand[0];
-	  if (REG_P (dest) && REG_N_SETS (REGNO (dest)) == 1)
-	    {
-	      cfun->machine->warp_equal_pseudos[REGNO (dest)] = true;
-	      warp_reg_worklist.safe_push (REGNO (dest));
-	    }
-	}
-
       enum attr_subregs_ok s_ok = get_attr_subregs_ok (insn);
       for (int i = 0; i < recog_data.n_operands; i++)
 	{
@@ -1999,71 +2184,742 @@ nvptx_reorg_subreg (int max_regs)
 	  validate_change (insn, recog_data.operand_loc[i], new_reg, false);
 	}
     }
+}
+
+/* Loop structure of the function.The entire function is described as
+   a NULL loop.  We should be able to extend this to represent
+   superblocks.  */
+
+#define OACC_null OACC_HWM
+
+struct parallel
+{
+  /* Parent parallel.  */
+  parallel *parent;
+  
+  /* Next sibling parallel.  */
+  parallel *next;
+
+  /* First child parallel.  */
+  parallel *inner;
+
+  /* Partitioning mode of the parallel.  */
+  unsigned mode;
+
+  /* Partitioning used within inner parallels. */
+  unsigned inner_mask;
+
+  /* Location of parallel forked and join.  The forked is the first
+     block in the parallel and the join is the first block after of
+     the partition.  */
+  basic_block forked_block;
+  basic_block join_block;
+
+  rtx_insn *forked_insn;
+  rtx_insn *join_insn;
+
+  rtx_insn *fork_insn;
+  rtx_insn *joining_insn;
+
+  /* Basic blocks in this parallel, but not in child parallels.  The
+     FORKED and JOINING blocks are in the partition.  The FORK and JOIN
+     blocks are not.  */
+  auto_vec<basic_block> blocks;
+
+public:
+  parallel (parallel *parent, unsigned mode);
+  ~parallel ();
+};
+
+/* Constructor links the new parallel into it's parent's chain of
+   children.  */
+
+parallel::parallel (parallel *parent_, unsigned mode_)
+  :parent (parent_), next (0), inner (0), mode (mode_), inner_mask (0)
+{
+  forked_block = join_block = 0;
+  forked_insn = join_insn = 0;
+  fork_insn = joining_insn = 0;
+  
+  if (parent)
+    {
+      next = parent->inner;
+      parent->inner = this;
+    }
+}
+
+parallel::~parallel ()
+{
+  delete inner;
+  delete next;
+}
+
+/* Map of basic blocks to insns */
+typedef hash_map<basic_block, rtx_insn *> bb_insn_map_t;
+
+/* A tuple of an insn of interest and the BB in which it resides.  */
+typedef std::pair<rtx_insn *, basic_block> insn_bb_t;
+typedef auto_vec<insn_bb_t> insn_bb_vec_t;
+
+/* Split basic blocks such that each forked and join unspecs are at
+   the start of their basic blocks.  Thus afterwards each block will
+   have a single partitioning mode.  We also do the same for return
+   insns, as they are executed by every thread.  Return the
+   partitioning mode of the function as a whole.  Populate MAP with
+   head and tail blocks.  We also clear the BB visited flag, which is
+   used when finding partitions.  */
+
+static void
+nvptx_split_blocks (bb_insn_map_t *map)
+{
+  insn_bb_vec_t worklist;
+  basic_block block;
+  rtx_insn *insn;
 
-  while (!warp_reg_worklist.is_empty ())
+  /* Locate all the reorg instructions of interest.  */
+  FOR_ALL_BB_FN (block, cfun)
     {
-      int regno = warp_reg_worklist.pop ();
+      bool seen_insn = false;
+
+      // Clear visited flag, for use by parallel locator  */
+      block->flags &= ~BB_VISITED;
       
-      df_ref use = DF_REG_USE_CHAIN (regno);
-      for (; use; use = DF_REF_NEXT_REG (use))
+      FOR_BB_INSNS (block, insn)
 	{
-	  rtx_insn *insn;
-	  if (!DF_REF_INSN_INFO (use))
-	    continue;
-	  insn = DF_REF_INSN (use);
-	  if (DEBUG_INSN_P (insn))
-	    continue;
-
-	  /* The only insns we have to exclude are those which refer to
-	     memory.  */
-	  rtx pat = PATTERN (insn);
-	  if (GET_CODE (pat) == SET
-	      && (MEM_P (SET_SRC (pat)) || MEM_P (SET_DEST (pat))))
+	  if (!INSN_P (insn))
 	    continue;
-
-	  df_ref insn_use;
-	  bool all_equal = true;
-	  FOR_EACH_INSN_USE (insn_use, insn)
+	  switch (recog_memoized (insn))
 	    {
-	      unsigned insn_regno = DF_REF_REGNO (insn_use);
-	      if (!cfun->machine->warp_equal_pseudos[insn_regno])
-		{
-		  all_equal = false;
-		  break;
-		}
+	    default:
+	      seen_insn = true;
+	      continue;
+	    case CODE_FOR_nvptx_forked:
+	    case CODE_FOR_nvptx_join:
+	      break;
+	      
+	    case CODE_FOR_return:
+	      /* We also need to split just before return insns, as
+		 that insn needs executing by all threads, but the
+		 block it is in probably does not.  */
+	      break;
 	    }
-	  if (!all_equal)
-	    continue;
-	  df_ref insn_def;
-	  FOR_EACH_INSN_DEF (insn_def, insn)
+
+	  if (seen_insn)
+	    /* We've found an instruction that  must be at the start of
+	       a block, but isn't.  Add it to the worklist.  */
+	    worklist.safe_push (insn_bb_t (insn, block));
+	  else
+	    /* It was already the first instruction.  Just add it to
+	       the map.  */
+	    map->get_or_insert (block) = insn;
+	  seen_insn = true;
+	}
+    }
+
+  /* Split blocks on the worklist.  */
+  unsigned ix;
+  insn_bb_t *elt;
+  basic_block remap = 0;
+  for (ix = 0; worklist.iterate (ix, &elt); ix++)
+    {
+      if (remap != elt->second)
+	{
+	  block = elt->second;
+	  remap = block;
+	}
+      
+      /* Split block before insn. The insn is in the new block  */
+      edge e = split_block (block, PREV_INSN (elt->first));
+
+      block = e->dest;
+      map->get_or_insert (block) = elt->first;
+    }
+}
+
+/* BLOCK is a basic block containing a head or tail instruction.
+   Locate the associated prehead or pretail instruction, which must be
+   in the single predecessor block.  */
+
+static rtx_insn *
+nvptx_discover_pre (basic_block block, int expected)
+{
+  gcc_assert (block->preds->length () == 1);
+  basic_block pre_block = (*block->preds)[0]->src;
+  rtx_insn *pre_insn;
+
+  for (pre_insn = BB_END (pre_block); !INSN_P (pre_insn);
+       pre_insn = PREV_INSN (pre_insn))
+    gcc_assert (pre_insn != BB_HEAD (pre_block));
+
+  gcc_assert (recog_memoized (pre_insn) == expected);
+  return pre_insn;
+}
+
+/*  Dump this parallel and all its inner parallels.  */
+
+static void
+nvptx_dump_pars (parallel *par, unsigned depth)
+{
+  fprintf (dump_file, "%u: mode %d head=%d, tail=%d\n",
+	   depth, par->mode,
+	   par->forked_block ? par->forked_block->index : -1,
+	   par->join_block ? par->join_block->index : -1);
+
+  fprintf (dump_file, "    blocks:");
+
+  basic_block block;
+  for (unsigned ix = 0; par->blocks.iterate (ix, &block); ix++)
+    fprintf (dump_file, " %d", block->index);
+  fprintf (dump_file, "\n");
+  if (par->inner)
+    nvptx_dump_pars (par->inner, depth + 1);
+
+  if (par->next)
+    nvptx_dump_pars (par->next, depth);
+}
+
+typedef std::pair<basic_block, parallel *> bb_par_t;
+typedef auto_vec<bb_par_t> bb_par_vec_t;
+
+/* Walk the BBG looking for fork & join markers.  Construct a
+   loop structure for the function.  MAP is a mapping of basic blocks
+   to head & taiol markers, discoveded when splitting blocks.  This
+   speeds up the discovery.  We rely on the BB visited flag having
+   been cleared when splitting blocks.  */
+
+static parallel *
+nvptx_discover_pars (bb_insn_map_t *map)
+{
+  parallel *outer_par = new parallel (0, OACC_null);
+  bb_par_vec_t worklist;
+  basic_block block;
+
+  // Mark entry and exit blocks as visited.
+  block = EXIT_BLOCK_PTR_FOR_FN (cfun);
+  block->flags |= BB_VISITED;
+  block = ENTRY_BLOCK_PTR_FOR_FN (cfun);
+  worklist.safe_push (bb_par_t (block, outer_par));
+
+  while (worklist.length ())
+    {
+      bb_par_t bb_par = worklist.pop ();
+      parallel *l = bb_par.second;
+
+      block = bb_par.first;
+
+      // Have we met this block?
+      if (block->flags & BB_VISITED)
+	continue;
+      block->flags |= BB_VISITED;
+      
+      rtx_insn **endp = map->get (block);
+      if (endp)
+	{
+	  rtx_insn *end = *endp;
+	  
+	  /* This is a block head or tail, or return instruction.  */
+	  switch (recog_memoized (end))
 	    {
-	      unsigned dregno = DF_REF_REGNO (insn_def);
-	      if (cfun->machine->warp_equal_pseudos[dregno])
-		continue;
-	      cfun->machine->warp_equal_pseudos[dregno] = true;
-	      warp_reg_worklist.safe_push (dregno);
+	    case CODE_FOR_return:
+	      /* Return instructions are in their own block, and we
+		 don't need to do anything more.  */
+	      continue;
+
+	    case CODE_FOR_nvptx_forked:
+	      /* Loop head, create a new inner loop and add it into
+		 our parent's child list.  */
+	      {
+		unsigned mode = UINTVAL (XVECEXP (PATTERN (end), 0, 0));
+		
+		l = new parallel (l, mode);
+		l->forked_block = block;
+		l->forked_insn = end;
+		if (mode == OACC_worker)
+		  l->fork_insn
+		    = nvptx_discover_pre (block, CODE_FOR_nvptx_fork);
+	      }
+	      break;
+
+	    case CODE_FOR_nvptx_join:
+	      /* A loop tail.  Finish the current loop and return to
+		 parent.  */
+	      {
+		unsigned mode = UINTVAL (XVECEXP (PATTERN (end), 0, 0));
+
+		gcc_assert (l->mode == mode);
+		l->join_block = block;
+		l->join_insn = end;
+		if (mode == OACC_worker)
+		  l->joining_insn
+		    = nvptx_discover_pre (block, CODE_FOR_nvptx_joining);
+		l = l->parent;
+	      }
+	      break;
+
+	    default:
+	      gcc_unreachable ();
 	    }
 	}
+
+      /* Add this block onto the current loop's list of blocks.  */
+      l->blocks.safe_push (block);
+
+      /* Push each destination block onto the work list.  */
+      edge e;
+      edge_iterator ei;
+      FOR_EACH_EDGE (e, ei, block->succs)
+	worklist.safe_push (bb_par_t (e->dest, l));
     }
 
   if (dump_file)
-    for (int i = 0; i < max_regs; i++)
-      if (cfun->machine->warp_equal_pseudos[i])
-	fprintf (dump_file, "Found warp invariant pseudo %d\n", i);
+    {
+      fprintf (dump_file, "\nLoops\n");
+      nvptx_dump_pars (outer_par, 0);
+      fprintf (dump_file, "\n");
+    }
+  
+  return outer_par;
+}
+
+/* Propagate live state at the start of a partitioned region.  BLOCK
+   provides the live register information, and might not contain
+   INSN. Propagation is inserted just after INSN. RW indicates whether
+   we are reading and/or writing state.  This
+   separation is needed for worker-level proppagation where we
+   essentially do a spill & fill.  FN is the underlying worker
+   function to generate the propagation instructions for single
+   register.  DATA is user data.
+
+   We propagate the live register set and the entire frame.  We could
+   do better by (a) propagating just the live set that is used within
+   the partitioned regions and (b) only propagating stack entries that
+   are used.  The latter might be quite hard to determine.  */
+
+static void
+nvptx_propagate (basic_block block, rtx_insn *insn, propagate_mask rw,
+		 rtx (*fn) (rtx, propagate_mask,
+			    unsigned, void *), void *data)
+{
+  bitmap live = DF_LIVE_IN (block);
+  bitmap_iterator iterator;
+  unsigned ix;
+
+  /* Copy the frame array.  */
+  HOST_WIDE_INT fs = get_frame_size ();
+  if (fs)
+    {
+      rtx tmp = gen_reg_rtx (DImode);
+      rtx idx = NULL_RTX;
+      rtx ptr = gen_reg_rtx (Pmode);
+      rtx pred = NULL_RTX;
+      rtx_code_label *label = NULL;
+
+      gcc_assert (!(fs & (GET_MODE_SIZE (DImode) - 1)));
+      fs /= GET_MODE_SIZE (DImode);
+      /* Detect single iteration loop. */
+      if (fs == 1)
+	fs = 0;
+
+      start_sequence ();
+      emit_insn (gen_rtx_SET (Pmode, ptr, frame_pointer_rtx));
+      if (fs)
+	{
+	  idx = gen_reg_rtx (SImode);
+	  pred = gen_reg_rtx (BImode);
+	  label = gen_label_rtx ();
+	  
+	  emit_insn (gen_rtx_SET (SImode, idx, GEN_INT (fs)));
+	  /* Allow worker function to initialize anything needed */
+	  rtx init = fn (tmp, PM_loop_begin, fs, data);
+	  if (init)
+	    emit_insn (init);
+	  emit_label (label);
+	  LABEL_NUSES (label)++;
+	  emit_insn (gen_addsi3 (idx, idx, GEN_INT (-1)));
+	}
+      if (rw & PM_read)
+	emit_insn (gen_rtx_SET (DImode, tmp, gen_rtx_MEM (DImode, ptr)));
+      emit_insn (fn (tmp, rw, fs, data));
+      if (rw & PM_write)
+	emit_insn (gen_rtx_SET (DImode, gen_rtx_MEM (DImode, ptr), tmp));
+      if (fs)
+	{
+	  emit_insn (gen_rtx_SET (SImode, pred,
+				  gen_rtx_NE (BImode, idx, const0_rtx)));
+	  emit_insn (gen_adddi3 (ptr, ptr, GEN_INT (GET_MODE_SIZE (DImode))));
+	  emit_insn (gen_br_true_hidden (pred, label, GEN_INT (1)));
+	  rtx fini = fn (tmp, PM_loop_end, fs, data);
+	  if (fini)
+	    emit_insn (fini);
+	  emit_insn (gen_rtx_CLOBBER (GET_MODE (idx), idx));
+	}
+      emit_insn (gen_rtx_CLOBBER (GET_MODE (tmp), tmp));
+      emit_insn (gen_rtx_CLOBBER (GET_MODE (ptr), ptr));
+      rtx cpy = get_insns ();
+      end_sequence ();
+      insn = emit_insn_after (cpy, insn);
+    }
+
+  /* Copy live registers.  */
+  EXECUTE_IF_SET_IN_BITMAP (live, 0, ix, iterator)
+    {
+      rtx reg = regno_reg_rtx[ix];
+
+      if (REGNO (reg) >= FIRST_PSEUDO_REGISTER)
+	{
+	  rtx bcast = fn (reg, rw, 0, data);
+
+	  insn = emit_insn_after (bcast, insn);
+	}
+    }
 }
 
-/* PTX-specific reorganization
-   1) mark now-unused registers, so function begin doesn't declare
-   unused registers.
-   2) replace subregs with suitable sequences.
-*/
+/* Worker for nvptx_vpropagate.  */
+
+static rtx
+vprop_gen (rtx reg, propagate_mask pm,
+	   unsigned ARG_UNUSED (count), void *ARG_UNUSED (data))
+{
+  if (!(pm & PM_read_write))
+    return 0;
+  
+  return nvptx_gen_vcast (reg);
+}
+
+/* Propagate state that is live at start of BLOCK across the vectors
+   of a single warp.  Propagation is inserted just after INSN.   */
 
 static void
-nvptx_reorg (void)
+nvptx_vpropagate (basic_block block, rtx_insn *insn)
 {
-  struct reg_replace qiregs, hiregs, siregs, diregs;
-  rtx_insn *insn, *next;
+  nvptx_propagate (block, insn, PM_read_write, vprop_gen, 0);
+}
+
+/* Worker for nvptx_wpropagate.  */
 
+static rtx
+wprop_gen (rtx reg, propagate_mask pm, unsigned rep, void *data_)
+{
+  wcast_data_t *data = (wcast_data_t *)data_;
+
+  if (pm & PM_loop_begin)
+    {
+      /* Starting a loop, initialize pointer.    */
+      unsigned align = GET_MODE_ALIGNMENT (GET_MODE (reg)) / BITS_PER_UNIT;
+
+      if (align > worker_bcast_align)
+	worker_bcast_align = align;
+      data->offset = (data->offset + align - 1) & ~(align - 1);
+
+      data->ptr = gen_reg_rtx (Pmode);
+
+      return gen_adddi3 (data->ptr, data->base, GEN_INT (data->offset));
+    }
+  else if (pm & PM_loop_end)
+    {
+      rtx clobber = gen_rtx_CLOBBER (GET_MODE (data->ptr), data->ptr);
+      data->ptr = NULL_RTX;
+      return clobber;
+    }
+  else
+    return nvptx_gen_wcast (reg, pm, rep, data);
+}
+
+/* Spill or fill live state that is live at start of BLOCK.  PRE_P
+   indicates if this is just before partitioned mode (do spill), or
+   just after it starts (do fill). Sequence is inserted just after
+   INSN.  */
+
+static void
+nvptx_wpropagate (bool pre_p, basic_block block, rtx_insn *insn)
+{
+  wcast_data_t data;
+
+  data.base = gen_reg_rtx (Pmode);
+  data.offset = 0;
+  data.ptr = NULL_RTX;
+
+  nvptx_propagate (block, insn, pre_p ? PM_read : PM_write, wprop_gen, &data);
+  if (data.offset)
+    {
+      /* Stuff was emitted, initialize the base pointer now.  */
+      rtx init = gen_rtx_SET (Pmode, data.base, worker_bcast_sym);
+      emit_insn_after (init, insn);
+      
+      if (worker_bcast_hwm < data.offset)
+	worker_bcast_hwm = data.offset;
+    }
+}
+
+/* Emit a worker-level synchronization barrier.  */
+
+static void
+nvptx_wsync (bool tail_p, rtx_insn *insn)
+{
+  emit_insn_after (gen_nvptx_barsync (GEN_INT (tail_p)), insn);
+}
+
+/* Single neutering according to MASK.  FROM is the incoming block and
+   TO is the outgoing block.  These may be the same block. Insert at
+   start of FROM:
+   
+     if (tid.<axis>) hidden_goto end.
+
+   and insert before ending branch of TO (if there is such an insn):
+
+     end:
+     <possibly-broadcast-cond>
+     <branch>
+
+   We currently only use differnt FROM and TO when skipping an entire
+   loop.  We could do more if we detected superblocks.  */
+
+static void
+nvptx_single (unsigned mask, basic_block from, basic_block to)
+{
+  rtx_insn *head = BB_HEAD (from);
+  rtx_insn *tail = BB_END (to);
+  unsigned skip_mask = mask;
+
+  /* Find first insn of from block */
+  while (head != BB_END (from) && !INSN_P (head))
+    head = NEXT_INSN (head);
+
+  /* Find last insn of to block */
+  rtx_insn *limit = from == to ? head : BB_HEAD (to);
+  while (tail != limit && !INSN_P (tail) && !LABEL_P (tail))
+    tail = PREV_INSN (tail);
+
+  /* Detect if tail is a branch.  */
+  rtx tail_branch = NULL_RTX;
+  rtx cond_branch = NULL_RTX;
+  if (tail && INSN_P (tail))
+    {
+      tail_branch = PATTERN (tail);
+      if (GET_CODE (tail_branch) != SET || SET_DEST (tail_branch) != pc_rtx)
+	tail_branch = NULL_RTX;
+      else
+	{
+	  cond_branch = SET_SRC (tail_branch);
+	  if (GET_CODE (cond_branch) != IF_THEN_ELSE)
+	    cond_branch = NULL_RTX;
+	}
+    }
+
+  if (tail == head)
+    {
+      /* If this is empty, do nothing.  */
+      if (!head || !INSN_P (head))
+	return;
+
+      /* If this is a dummy insn, do nothing.  */
+      switch (recog_memoized (head))
+	{
+	default:break;
+	case CODE_FOR_nvptx_fork:
+	case CODE_FOR_nvptx_forked:
+	case CODE_FOR_nvptx_joining:
+	case CODE_FOR_nvptx_join:
+	  return;
+	}
+
+      if (cond_branch)
+	{
+	  /* If we're only doing vector single, there's no need to
+	     emit skip code because we'll not insert anything.  */
+	  if (!(mask & OACC_LOOP_MASK (OACC_vector)))
+	    skip_mask = 0;
+	}
+      else if (tail_branch)
+	/* Block with only unconditional branch.  Nothing to do.  */
+	return;
+    }
+
+  /* Insert the vector test inside the worker test.  */
+  unsigned mode;
+  rtx_insn *before = tail;
+  for (mode = OACC_worker; mode <= OACC_vector; mode++)
+    if (OACC_LOOP_MASK (mode) & skip_mask)
+      {
+	rtx id = gen_reg_rtx (SImode);
+	rtx pred = gen_reg_rtx (BImode);
+	rtx_code_label *label = gen_label_rtx ();
+
+	emit_insn_before (gen_oacc_id (id, GEN_INT (mode)), head);
+	rtx cond = gen_rtx_SET (BImode, pred,
+				gen_rtx_NE (BImode, id, const0_rtx));
+	emit_insn_before (cond, head);
+	emit_insn_before (gen_br_true_hidden (pred, label,
+					      GEN_INT (mode != OACC_vector)),
+			  head);
+
+	LABEL_NUSES (label)++;
+	if (tail_branch)
+	  before = emit_label_before (label, before);
+	else
+	  emit_label_after (label, tail);
+      }
+
+  /* Now deal with propagating the branch condition.  */
+  if (cond_branch)
+    {
+      rtx pvar = XEXP (XEXP (cond_branch, 0), 0);
+
+      if (OACC_LOOP_MASK (OACC_vector) == mask)
+	{
+	  /* Vector mode only, do a shuffle.  */
+	  emit_insn_before (nvptx_gen_vcast (pvar), tail);
+	}
+      else
+	{
+	  /* Includes worker mode, do spill & fill.  by construction
+	     we should never have worker mode only. */
+	  wcast_data_t data;
+
+	  data.base = worker_bcast_sym;
+	  data.ptr = 0;
+
+	  if (worker_bcast_hwm < GET_MODE_SIZE (SImode))
+	    worker_bcast_hwm = GET_MODE_SIZE (SImode);
+
+	  data.offset = 0;
+	  emit_insn_before (nvptx_gen_wcast (pvar, PM_read, 0, &data),
+			    before);
+	  emit_insn_before (gen_nvptx_barsync (GEN_INT (2)), tail);
+	  data.offset = 0;
+	  emit_insn_before (nvptx_gen_wcast (pvar, PM_write, 0, &data),
+			    tail);
+	}
+
+      extract_insn (tail);
+      rtx unsp = gen_rtx_UNSPEC (BImode, gen_rtvec (1, pvar),
+				 UNSPEC_BR_UNIFIED);
+      validate_change (tail, recog_data.operand_loc[0], unsp, false);
+    }
+}
+
+/* PAR is a parallel that is being skipped in its entirety according to
+   MASK.  Treat this as skipping a superblock starting at forked
+   and ending at joining.  */
+
+static void
+nvptx_skip_par (unsigned mask, parallel *par)
+{
+  basic_block tail = par->join_block;
+  gcc_assert (tail->preds->length () == 1);
+
+  basic_block pre_tail = (*tail->preds)[0]->src;
+  gcc_assert (pre_tail->succs->length () == 1);
+
+  nvptx_single (mask, par->forked_block, pre_tail);
+}
+
+/* Process the parallel PAR and all its contained
+   parallels.  We do everything but the neutering.  Return mask of
+   partitioned modes used within this parallel.  */
+
+static unsigned
+nvptx_process_pars (parallel *par)
+{
+  unsigned inner_mask = OACC_LOOP_MASK (par->mode);
+  
+  /* Do the inner parallels first.  */
+  if (par->inner)
+    {
+      par->inner_mask = nvptx_process_pars (par->inner);
+      inner_mask |= par->inner_mask;
+    }
+  
+  switch (par->mode)
+    {
+    case OACC_null:
+      /* Dummy parallel.  */
+      break;
+
+    case OACC_vector:
+      nvptx_vpropagate (par->forked_block, par->forked_insn);
+      break;
+      
+    case OACC_worker:
+      {
+	nvptx_wpropagate (false, par->forked_block,
+			  par->forked_insn);
+	nvptx_wpropagate (true, par->forked_block, par->fork_insn);
+	/* Insert begin and end synchronizations.  */
+	nvptx_wsync (false, par->forked_insn);
+	nvptx_wsync (true, par->joining_insn);
+      }
+      break;
+
+    case OACC_gang:
+      break;
+
+    default:gcc_unreachable ();
+    }
+
+  /* Now do siblings.  */
+  if (par->next)
+    inner_mask |= nvptx_process_pars (par->next);
+  return inner_mask;
+}
+
+/* Neuter the parallel described by PAR.  We recurse in depth-first
+   order.  MODES are the partitioning of the execution and OUTER is
+   the partitioning of the parallels we are contained in.  */
+
+static void
+nvptx_neuter_pars (parallel *par, unsigned modes, unsigned outer)
+{
+  unsigned me = (OACC_LOOP_MASK (par->mode)
+		 & (OACC_LOOP_MASK (OACC_worker)
+		    | OACC_LOOP_MASK (OACC_vector)));
+  unsigned  skip_mask = 0, neuter_mask = 0;
+  
+  if (par->inner)
+    nvptx_neuter_pars (par->inner, modes, outer | me);
+
+  for (unsigned mode = OACC_worker; mode <= OACC_vector; mode++)
+    {
+      if ((outer | me) & OACC_LOOP_MASK (mode))
+	{ /* Mode is partitioned: no neutering.  */ }
+      else if (!(modes & OACC_LOOP_MASK (mode)))
+	{ /* Mode  is not used: nothing to do.  */ }
+      else if (par->inner_mask & OACC_LOOP_MASK (mode)
+	       || !par->forked_insn)
+	/* Partitioned in inner parallels, or we're not a partitioned
+	   at all: neuter individual blocks.  */
+	neuter_mask |= OACC_LOOP_MASK (mode);
+      else if (!par->parent || !par->parent->forked_insn
+	       || par->parent->inner_mask & OACC_LOOP_MASK (mode))
+	/* Parent isn't a parallel or contains this paralleling: skip
+	   parallel at this level.  */
+	skip_mask |= OACC_LOOP_MASK (mode);
+      else
+	{ /* Parent will skip this parallel itself.  */ }
+    }
+
+  if (neuter_mask)
+    {
+      basic_block block;
+
+      for (unsigned ix = 0; par->blocks.iterate (ix, &block); ix++)
+	nvptx_single (neuter_mask, block, block);
+    }
+
+  if (skip_mask)
+      nvptx_skip_par (skip_mask, par);
+  
+  if (par->next)
+    nvptx_neuter_pars (par->next, modes, outer);
+}
+
+/* NVPTX machine dependent reorg.
+   Insert vector and worker single neutering code and state
+   propagation when entering partioned mode.  Fixup subregs.  */
+
+static void
+nvptx_reorg (void)
+{
   /* We are freeing block_for_insn in the toplev to keep compatibility
      with old MDEP_REORGS that are not CFG based.  Recompute it now.  */
   compute_bb_for_insn ();
@@ -2072,19 +2928,36 @@ nvptx_reorg (void)
 
   df_clear_flags (DF_LR_RUN_DCE);
   df_set_flags (DF_NO_INSN_RESCAN | DF_NO_HARD_REGS);
+  df_live_add_problem ();
+  
+  /* Split blocks and record interesting unspecs.  */
+  bb_insn_map_t bb_insn_map;
+
+    nvptx_split_blocks (&bb_insn_map);
+
+  /* Compute live regs */
   df_analyze ();
   regstat_init_n_sets_and_refs ();
 
-  int max_regs = max_reg_num ();
-
+  if (dump_file)
+    df_dump (dump_file);
+  
   /* Mark unused regs as unused.  */
+  int max_regs = max_reg_num ();
   for (int i = LAST_VIRTUAL_REGISTER + 1; i < max_regs; i++)
     if (REG_N_SETS (i) == 0 && REG_N_REFS (i) == 0)
       regno_reg_rtx[i] = const0_rtx;
 
-  /* Replace subregs.  */
-  nvptx_reorg_subreg (max_regs);
+  parallel *pars = nvptx_discover_pars (&bb_insn_map);
+
+  nvptx_process_pars (pars);
+  nvptx_neuter_pars (pars, (OACC_LOOP_MASK (OACC_vector)
+			    | OACC_LOOP_MASK (OACC_worker)), 0);
 
+  delete pars;
+
+  nvptx_reorg_subreg ();
+  
   regstat_free_n_sets_and_refs ();
 
   df_finish_pass (true);
@@ -2133,19 +3006,24 @@ nvptx_vector_alignment (const_tree type)
   return MIN (align, BIGGEST_ALIGNMENT);
 }
 
-/* Indicate that INSN cannot be duplicated.  This is true for insns
-   that generate a unique id.  To be on the safe side, we also
-   exclude instructions that have to be executed simultaneously by
-   all threads in a warp.  */
+/* Indicate that INSN cannot be duplicated.   */
 
 static bool
 nvptx_cannot_copy_insn_p (rtx_insn *insn)
 {
-  if (recog_memoized (insn) == CODE_FOR_oacc_thread_broadcastsi)
-    return true;
-  if (recog_memoized (insn) == CODE_FOR_threadbarrier_insn)
-    return true;
-  return false;
+  switch (recog_memoized (insn))
+    {
+    case CODE_FOR_nvptx_broadcastsi:
+    case CODE_FOR_nvptx_broadcastsf:
+    case CODE_FOR_nvptx_barsync:
+    case CODE_FOR_nvptx_fork:
+    case CODE_FOR_nvptx_forked:
+    case CODE_FOR_nvptx_joining:
+    case CODE_FOR_nvptx_join:
+      return true;
+    default:
+      return false;
+    }
 }
 \f
 /* Record a symbol for mkoffload to enter into the mapping table.  */
@@ -2185,6 +3063,21 @@ nvptx_file_end (void)
   FOR_EACH_HASH_TABLE_ELEMENT (*needed_fndecls_htab, decl, tree, iter)
     nvptx_record_fndecl (decl, true);
   fputs (func_decls.str().c_str(), asm_out_file);
+
+  if (worker_bcast_hwm)
+    {
+      /* Define the broadcast buffer.  */
+
+      if (worker_bcast_align < GET_MODE_SIZE (SImode))
+	worker_bcast_align = GET_MODE_SIZE (SImode);
+      worker_bcast_hwm = (worker_bcast_hwm + worker_bcast_align - 1)
+	& ~(worker_bcast_align - 1);
+      
+      fprintf (asm_out_file, "// BEGIN VAR DEF: %s\n", worker_bcast_name);
+      fprintf (asm_out_file, ".shared.align %d .u8 %s[%d];\n",
+	       worker_bcast_align,
+	       worker_bcast_name, worker_bcast_hwm);
+    }
 }
 \f
 #undef TARGET_OPTION_OVERRIDE
Index: config/nvptx/nvptx.h
===================================================================
--- config/nvptx/nvptx.h	(revision 225323)
+++ config/nvptx/nvptx.h	(working copy)
@@ -235,7 +235,6 @@ struct nvptx_pseudo_info
 struct GTY(()) machine_function
 {
   rtx_expr_list *call_args;
-  char *warp_equal_pseudos;
   rtx start_call;
   tree funtype;
   bool has_call_with_varargs;
Index: config/nvptx/nvptx-protos.h
===================================================================
--- config/nvptx/nvptx-protos.h	(revision 225323)
+++ config/nvptx/nvptx-protos.h	(working copy)
@@ -32,6 +32,8 @@ extern void nvptx_register_pragmas (void
 extern const char *nvptx_section_for_decl (const_tree);
 
 #ifdef RTX_CODE
+extern void nvptx_expand_oacc_fork (rtx);
+extern void nvptx_expand_oacc_join (rtx);
 extern void nvptx_expand_call (rtx, rtx);
 extern rtx nvptx_expand_compare (rtx);
 extern const char *nvptx_ptx_type_from_mode (machine_mode, bool);
Index: builtins.c
===================================================================
--- builtins.c	(revision 225323)
+++ builtins.c	(working copy)
@@ -5947,20 +5947,6 @@ expand_builtin_acc_on_device (tree exp A
 #endif
 }
 
-/* Expand a thread synchronization point for OpenACC threads.  */
-static void
-expand_oacc_threadbarrier (void)
-{
-#ifdef HAVE_oacc_threadbarrier
-  rtx insn = GEN_FCN (CODE_FOR_oacc_threadbarrier) ();
-  if (insn != NULL_RTX)
-    {
-      emit_insn (insn);
-    }
-#endif
-}
-
-
 /* Expand a thread-id/thread-count builtin for OpenACC.  */
 
 static rtx
@@ -6032,47 +6018,6 @@ expand_oacc_ganglocal_ptr (rtx target AT
   return NULL_RTX;
 }
 
-/* Handle a GOACC_thread_broadcast builtin call EXP with target TARGET.
-   Return the result.  */
-
-static rtx
-expand_builtin_oacc_thread_broadcast (tree exp, rtx target)
-{
-  tree arg0 = CALL_EXPR_ARG (exp, 0);
-  enum insn_code icode;
-
-  enum machine_mode mode = TYPE_MODE (TREE_TYPE (arg0));
-  gcc_assert (INTEGRAL_MODE_P (mode));
-  do
-    {
-      icode = direct_optab_handler (oacc_thread_broadcast_optab, mode);
-      mode = GET_MODE_WIDER_MODE (mode);
-    }
-  while (icode == CODE_FOR_nothing && mode != VOIDmode);
-  if (icode == CODE_FOR_nothing)
-    return expand_expr (arg0, NULL_RTX, VOIDmode, EXPAND_NORMAL);
-
-  rtx tmp = target;
-  machine_mode mode0 = insn_data[icode].operand[0].mode;
-  machine_mode mode1 = insn_data[icode].operand[1].mode;
-  if (!tmp || !REG_P (tmp) || GET_MODE (tmp) != mode0)
-    tmp = gen_reg_rtx (mode0);
-  rtx op1 = expand_expr (arg0, NULL_RTX, mode1, EXPAND_NORMAL);
-  if (GET_MODE (op1) != mode1)
-    op1 = convert_to_mode (mode1, op1, 0);
-
-  /* op1 might be an immediate, place it inside a register.  */
-  op1 = force_reg (mode1, op1);
-
-  rtx insn = GEN_FCN (icode) (tmp, op1);
-  if (insn != NULL_RTX)
-    {
-      emit_insn (insn);
-      return tmp;
-    }
-  return const0_rtx;
-}
-
 /* Expand an expression EXP that calls a built-in function,
    with result going to TARGET if that's convenient
    (and in mode MODE if that's convenient).
@@ -7225,14 +7170,6 @@ expand_builtin (tree exp, rtx target, rt
 	return target;
       break;
 
-    case BUILT_IN_GOACC_THREAD_BROADCAST:
-    case BUILT_IN_GOACC_THREAD_BROADCAST_LL:
-      return expand_builtin_oacc_thread_broadcast (exp, target);
-
-    case BUILT_IN_GOACC_THREADBARRIER:
-      expand_oacc_threadbarrier ();
-      return const0_rtx;
-
     default:	/* just do library call, if unknown builtin */
       break;
     }
Index: internal-fn.c
===================================================================
--- internal-fn.c	(revision 225323)
+++ internal-fn.c	(working copy)
@@ -98,6 +98,20 @@ init_internal_fns ()
   internal_fn_fnspec_array[IFN_LAST] = 0;
 }
 
+/* Return true if this internal fn call is a unique marker -- it
+   should not be duplicated or merged.  */
+
+bool
+gimple_call_internal_unique_p (const_gimple gs)
+{
+  switch (gimple_call_internal_fn (gs))
+    {
+    default: return false;
+    case IFN_GOACC_FORK: return true;
+    case IFN_GOACC_JOIN: return true;
+    }
+}
+
 /* ARRAY_TYPE is an array of vector modes.  Return the associated insn
    for load-lanes-style optab OPTAB.  The insn must exist.  */
 
@@ -1990,6 +2004,26 @@ expand_GOACC_DATA_END_WITH_ARG (gcall *s
   gcc_unreachable ();
 }
 
+static void
+expand_GOACC_FORK (gcall *stmt)
+{
+  rtx mode = expand_normal (gimple_call_arg (stmt, 0));
+  
+#ifdef HAVE_oacc_fork
+  emit_insn (gen_oacc_fork (mode));
+#endif
+}
+
+static void
+expand_GOACC_JOIN (gcall *stmt)
+{
+  rtx mode = expand_normal (gimple_call_arg (stmt, 0));
+  
+#ifdef HAVE_oacc_join
+  emit_insn (gen_oacc_join (mode));
+#endif
+}
+
 /* Routines to expand each internal function, indexed by function number.
    Each routine has the prototype:
 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [gomp] Move openacc vector& worker single handling to RTL
  2015-07-08 21:46                 ` Nathan Sidwell
@ 2015-07-10  0:25                   ` Nathan Sidwell
  2015-07-10  9:04                     ` Thomas Schwinge
                                       ` (4 more replies)
  0 siblings, 5 replies; 31+ messages in thread
From: Nathan Sidwell @ 2015-07-10  0:25 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: GCC Patches

[-- Attachment #1: Type: text/plain, Size: 227 bytes --]

This is the patch I committed.  Bernd pointed out that I didn't need to be so 
coy about the branches in the middle of blocks at that point of the compilation 
anyway.  So we remove a couple of unneeded insn patterns.

nathan


[-- Attachment #2: rtl-09072015-2.diff --]
[-- Type: text/plain, Size: 88726 bytes --]

2015-07-09  Nathan Sidwell  <nathan@codesourcery.com>

	Infrastructure:
	* gimple.h (gimple_call_internal_unique_p): Declare.
	* gimple.c (gimple_call_same_target_p): Add check for
	gimple_call_internal_unique_p.
	* internal-fn.c (gimple_call_internal_unique_p): New.
	* omp-low.h (OACC_LOOP_MASK): Define here...
	* omp-low.c (OACC_LOOP_MASK): ... not here.
	* tree-ssa-threadedge.c	(record_temporary_equivalences_from_stmts):
	Add check for gimple_call_internal_unique_p.
	* tree-ssa-tail-merge.c (same_succ_def::equal): Add EQ check for
	the gimple statements.

	Additions:
	* internal-fn.def (GOACC_FORK, GOACC_JOIN): New.
	* internal-fn.c (gimple_call_internal_unique_p): Add check for
	IFN_GOACC_FORK, IFN_GOACC_JOIN.
	(expand_GOACC_FORK, expand_GOACC_JOIN): New.
	* omp-low.c (gen_oacc_fork, gen_oacc_join): New.
	(expand_omp_for_static_nochunk): Add oacc loop fork & join calls.
	(expand_omp_for_static_chunk): Likewise.
	* config/nvptx/nvptx-protos.h (nvptx_expand_oacc_fork,
	nvptx_expand_oacc_join): Declare.
	* config/nvptx/nvptx.md (UNSPEC_BIT_CONV, UNSPEC_BROADCAST,
	UNSPEC_BR_UNIFIED): New unspecs.
	(UNSPECV_FORK, UNSPECV_FORKED, UNSPECV_JOINING, UNSPECV_JOIN): New.
	(BITS, BITD): New mode iterators.
	(br_true_uni, br_false_uni): New unified branches.
	(nvptx_fork, nvptx_forked, nvptx_joining, nvptx_join): New insns.
	(oacc_fork, oacc_join): New expand
	(nvptx_broadcast<mode>): New insn.
	(unpack<mode>si2, packsi<mode>2): New insns.
	(worker_load<mode>, worker_store<mode>): New insns.
	(nvptx_barsync): Renamed from ...
	(threadbarrier_insn): ... here.
	* config/nvptx/nvptx.c: Include hash-map,h, dominance.h, cfg.h &
	omp-low.h.
	(worker_bcast_hwm, worker_bcast_align, worker_bcast_name,
	worker_bcast_sym): New.
	(nvptx_option_override): Initialize worker_bcast_sym.
	(nvptx_expand_oacc_fork, nvptx_expand_oacc_join): New.
	(nvptx_gen_unpack, nvptx_gen_pack): New.
	(struct wcast_data_t, propagate_mask): New types.
	(nvptx_gen_vcast, nvptx_gen_wcast): New.
	(struct parallel): New structs.
	(parallel::parallel, parallel::~parallel): Ctor & dtor.
	(bb_insn_map_t): New map.
	(insn_bb_t, insn_bb_vec_t): New tuple & vector of.
	(nvptx_split_blocks, nvptx_discover_pre): New.
	(bb_par_t, bb_par_vec_t); New tuple & vector of.
	(nvptx_dump_pars,nvptx_discover_pars): New.
	(nvptx_propagate): New.
	(vprop_gen, nvptx_vpropagate)@ New.
	(wprop_gen, nvptx_wpropagate): New.
	(nvptx_wsync): New.
	(nvptx_single, nvptx_skip_par): New.
	(nvptx_process_pars): New.
	(nvptx_neuter_pars): New.
	(nvptx_reorg): Add liveness DF problem.  Call nvptx_split_blocks,
	nvptx_discover_pars, nvptx_process_pars & nvptx_neuter_pars.
	(nvptx_cannot_copy_insn): Check for broadcast, sync, fork & join insns.
	(nvptx_file_end): Output worker broadcast array definition.

	Deletions:
	* builtins.c (expand_oacc_thread_barrier): Delete.
	(expand_oacc_thread_broadcast): Delete.
	(expand_builtin): Adjust.
	* gimple.c (struct gimple_statement_omp_parallel_layout): Remove
	broadcast_array member.
	(gimple_omp_target_broadcast_array): Delete.
	(gimple_omp_target_set_broadcast_array): Delete.
	* omp-low.c (omp_region): Remove broadcast_array member.
	(oacc_broadcast): Delete.
	(build_oacc_threadbarrier): Delete.
	(oacc_loop_needs_threadbarrier_p): Delete.
	(oacc_alloc_broadcast_storage): Delete.
	(find_omp_target_region): Remove call to
	gimple_omp_target_broadcast_array.
	(enclosing_target_region, required_predication_mask,
	generate_vector_broadcast, generate_oacc_broadcast,
	make_predication_test, predicate_bb, find_predicatable_bbs,
	predicate_omp_regions): Delete.
	(use, gen, live_in): Delete.
	(populate_loop_live_in, oacc_populate_live_in_1,
	oacc_populate_live_in, populate_loop_use, oacc_broadcast_1,
	oacc_broadcast): Delete.
	(execute_expand_omp): Remove predicate_omp_regions call.
	(lower_omp_target): Remove oacc_alloc_broadcast_storage call.
	Remove gimple_omp_target_set_broadcast_array call.
	(make_gimple_omp_edges): Remove oacc_loop_needs_threadbarrier_p
	check.
	* tree-ssa-alias.c (ref_maybe_used_by_call_p_1): Remove
	BUILT_IN_GOACC_THREADBARRIER.
	* omp-builtins.def (BUILT_IN_GOACC_THREAD_BROADCAST,
	BUILT_IN_GOACC_THREAD_BROADCAST_LL,
	BUILT_IN_GOACC_THREADBARRIER): Delete.
	* config/nvptx/nvptx.md (UNSPECV_WARPBCAST): Delete.
	(br_true, br_false): Remove U format specifier.
	(oacc_thread_broadcastsi, oacc_thread_broadcast_di): Delete.
	(oacc_threadbarrier): Delete.
	* config/.nvptx/nvptx.c (condition_unidirectional_p): Delete.
	(nvptx_print_operand):  Remove 'U' specifier.
	(nvptx_reorg_subreg): Remove unidirection checking.
	(nvptx_cannot_copy_insn): Remove broadcast and barrier insns.
	* config/nvptx/nvptx.h (machine_function): Remove
	arp_equal_pseudos.

Index: internal-fn.c
===================================================================
--- internal-fn.c	(revision 225323)
+++ internal-fn.c	(working copy)
@@ -98,6 +98,20 @@ init_internal_fns ()
   internal_fn_fnspec_array[IFN_LAST] = 0;
 }
 
+/* Return true if this internal fn call is a unique marker -- it
+   should not be duplicated or merged.  */
+
+bool
+gimple_call_internal_unique_p (const_gimple gs)
+{
+  switch (gimple_call_internal_fn (gs))
+    {
+    default: return false;
+    case IFN_GOACC_FORK: return true;
+    case IFN_GOACC_JOIN: return true;
+    }
+}
+
 /* ARRAY_TYPE is an array of vector modes.  Return the associated insn
    for load-lanes-style optab OPTAB.  The insn must exist.  */
 
@@ -1990,6 +2004,26 @@ expand_GOACC_DATA_END_WITH_ARG (gcall *s
   gcc_unreachable ();
 }
 
+static void
+expand_GOACC_FORK (gcall *stmt)
+{
+  rtx mode = expand_normal (gimple_call_arg (stmt, 0));
+  
+#ifdef HAVE_oacc_fork
+  emit_insn (gen_oacc_fork (mode));
+#endif
+}
+
+static void
+expand_GOACC_JOIN (gcall *stmt)
+{
+  rtx mode = expand_normal (gimple_call_arg (stmt, 0));
+  
+#ifdef HAVE_oacc_join
+  emit_insn (gen_oacc_join (mode));
+#endif
+}
+
 /* Routines to expand each internal function, indexed by function number.
    Each routine has the prototype:
 
Index: tree-ssa-threadedge.c
===================================================================
--- tree-ssa-threadedge.c	(revision 225323)
+++ tree-ssa-threadedge.c	(working copy)
@@ -310,6 +310,17 @@ record_temporary_equivalences_from_stmts
 	  && gimple_asm_volatile_p (as_a <gasm *> (stmt)))
 	return NULL;
 
+      /* If the statement is a unique builtin, we can not thread
+	 through here.  */
+      if (gimple_code (stmt) == GIMPLE_CALL)
+	{
+	  gcall *call = as_a <gcall *> (stmt);
+
+	  if (gimple_call_internal_p (call)
+	      && gimple_call_internal_unique_p (call))
+	    return NULL;
+	}
+
       /* If duplicating this block is going to cause too much code
 	 expansion, then do not thread through this block.  */
       stmt_count++;
Index: builtins.c
===================================================================
--- builtins.c	(revision 225323)
+++ builtins.c	(working copy)
@@ -5947,20 +5947,6 @@ expand_builtin_acc_on_device (tree exp A
 #endif
 }
 
-/* Expand a thread synchronization point for OpenACC threads.  */
-static void
-expand_oacc_threadbarrier (void)
-{
-#ifdef HAVE_oacc_threadbarrier
-  rtx insn = GEN_FCN (CODE_FOR_oacc_threadbarrier) ();
-  if (insn != NULL_RTX)
-    {
-      emit_insn (insn);
-    }
-#endif
-}
-
-
 /* Expand a thread-id/thread-count builtin for OpenACC.  */
 
 static rtx
@@ -6032,47 +6018,6 @@ expand_oacc_ganglocal_ptr (rtx target AT
   return NULL_RTX;
 }
 
-/* Handle a GOACC_thread_broadcast builtin call EXP with target TARGET.
-   Return the result.  */
-
-static rtx
-expand_builtin_oacc_thread_broadcast (tree exp, rtx target)
-{
-  tree arg0 = CALL_EXPR_ARG (exp, 0);
-  enum insn_code icode;
-
-  enum machine_mode mode = TYPE_MODE (TREE_TYPE (arg0));
-  gcc_assert (INTEGRAL_MODE_P (mode));
-  do
-    {
-      icode = direct_optab_handler (oacc_thread_broadcast_optab, mode);
-      mode = GET_MODE_WIDER_MODE (mode);
-    }
-  while (icode == CODE_FOR_nothing && mode != VOIDmode);
-  if (icode == CODE_FOR_nothing)
-    return expand_expr (arg0, NULL_RTX, VOIDmode, EXPAND_NORMAL);
-
-  rtx tmp = target;
-  machine_mode mode0 = insn_data[icode].operand[0].mode;
-  machine_mode mode1 = insn_data[icode].operand[1].mode;
-  if (!tmp || !REG_P (tmp) || GET_MODE (tmp) != mode0)
-    tmp = gen_reg_rtx (mode0);
-  rtx op1 = expand_expr (arg0, NULL_RTX, mode1, EXPAND_NORMAL);
-  if (GET_MODE (op1) != mode1)
-    op1 = convert_to_mode (mode1, op1, 0);
-
-  /* op1 might be an immediate, place it inside a register.  */
-  op1 = force_reg (mode1, op1);
-
-  rtx insn = GEN_FCN (icode) (tmp, op1);
-  if (insn != NULL_RTX)
-    {
-      emit_insn (insn);
-      return tmp;
-    }
-  return const0_rtx;
-}
-
 /* Expand an expression EXP that calls a built-in function,
    with result going to TARGET if that's convenient
    (and in mode MODE if that's convenient).
@@ -7225,14 +7170,6 @@ expand_builtin (tree exp, rtx target, rt
 	return target;
       break;
 
-    case BUILT_IN_GOACC_THREAD_BROADCAST:
-    case BUILT_IN_GOACC_THREAD_BROADCAST_LL:
-      return expand_builtin_oacc_thread_broadcast (exp, target);
-
-    case BUILT_IN_GOACC_THREADBARRIER:
-      expand_oacc_threadbarrier ();
-      return const0_rtx;
-
     default:	/* just do library call, if unknown builtin */
       break;
     }
Index: tree-ssa-tail-merge.c
===================================================================
--- tree-ssa-tail-merge.c	(revision 225323)
+++ tree-ssa-tail-merge.c	(working copy)
@@ -608,10 +608,13 @@ same_succ_def::equal (const same_succ_de
     {
       s1 = gsi_stmt (gsi1);
       s2 = gsi_stmt (gsi2);
-      if (gimple_code (s1) != gimple_code (s2))
-	return 0;
-      if (is_gimple_call (s1) && !gimple_call_same_target_p (s1, s2))
-	return 0;
+      if (s1 != s2)
+	{
+	  if (gimple_code (s1) != gimple_code (s2))
+	    return 0;
+	  if (is_gimple_call (s1) && !gimple_call_same_target_p (s1, s2))
+	    return 0;
+	}
       gsi_next_nondebug (&gsi1);
       gsi_next_nondebug (&gsi2);
       gsi_advance_fw_nondebug_nonlocal (&gsi1);
Index: config/nvptx/nvptx.h
===================================================================
--- config/nvptx/nvptx.h	(revision 225323)
+++ config/nvptx/nvptx.h	(working copy)
@@ -235,7 +235,6 @@ struct nvptx_pseudo_info
 struct GTY(()) machine_function
 {
   rtx_expr_list *call_args;
-  char *warp_equal_pseudos;
   rtx start_call;
   tree funtype;
   bool has_call_with_varargs;
Index: config/nvptx/nvptx-protos.h
===================================================================
--- config/nvptx/nvptx-protos.h	(revision 225323)
+++ config/nvptx/nvptx-protos.h	(working copy)
@@ -32,6 +32,8 @@ extern void nvptx_register_pragmas (void
 extern const char *nvptx_section_for_decl (const_tree);
 
 #ifdef RTX_CODE
+extern void nvptx_expand_oacc_fork (rtx);
+extern void nvptx_expand_oacc_join (rtx);
 extern void nvptx_expand_call (rtx, rtx);
 extern rtx nvptx_expand_compare (rtx);
 extern const char *nvptx_ptx_type_from_mode (machine_mode, bool);
Index: config/nvptx/nvptx.md
===================================================================
--- config/nvptx/nvptx.md	(revision 225323)
+++ config/nvptx/nvptx.md	(working copy)
@@ -52,15 +52,24 @@
    UNSPEC_NID
 
    UNSPEC_SHARED_DATA
+
+   UNSPEC_BIT_CONV
+
+   UNSPEC_BROADCAST
+   UNSPEC_BR_UNIFIED
 ])
 
 (define_c_enum "unspecv" [
    UNSPECV_LOCK
    UNSPECV_CAS
    UNSPECV_XCHG
-   UNSPECV_WARP_BCAST
    UNSPECV_BARSYNC
    UNSPECV_ID
+
+   UNSPECV_FORK
+   UNSPECV_FORKED
+   UNSPECV_JOINING
+   UNSPECV_JOIN
 ])
 
 (define_attr "subregs_ok" "false,true"
@@ -253,6 +262,8 @@
 (define_mode_iterator QHSIM [QI HI SI])
 (define_mode_iterator SDFM [SF DF])
 (define_mode_iterator SDCM [SC DC])
+(define_mode_iterator BITS [SI SF])
+(define_mode_iterator BITD [DI DF])
 
 ;; This mode iterator allows :P to be used for patterns that operate on
 ;; pointer-sized quantities.  Exactly one of the two alternatives will match.
@@ -813,7 +824,7 @@
 		      (label_ref (match_operand 1 "" ""))
 		      (pc)))]
   ""
-  "%j0\\tbra%U0\\t%l1;")
+  "%j0\\tbra\\t%l1;")
 
 (define_insn "br_false"
   [(set (pc)
@@ -822,7 +833,24 @@
 		      (label_ref (match_operand 1 "" ""))
 		      (pc)))]
   ""
-  "%J0\\tbra%U0\\t%l1;")
+  "%J0\\tbra\\t%l1;")
+
+;; unified conditional branch
+(define_insn "br_true_uni"
+  [(set (pc) (if_then_else
+	(ne (unspec:BI [(match_operand:BI 0 "nvptx_register_operand" "R")]
+		       UNSPEC_BR_UNIFIED) (const_int 0))
+        (label_ref (match_operand 1 "" "")) (pc)))]
+  ""
+  "%j0\\tbra.uni\\t%l1;")
+
+(define_insn "br_false_uni"
+  [(set (pc) (if_then_else
+	(eq (unspec:BI [(match_operand:BI 0 "nvptx_register_operand" "R")]
+		       UNSPEC_BR_UNIFIED) (const_int 0))
+        (label_ref (match_operand 1 "" "")) (pc)))]
+  ""
+  "%J0\\tbra.uni\\t%l1;")
 
 (define_expand "cbranch<mode>4"
   [(set (pc)
@@ -1326,37 +1354,92 @@
   return asms[INTVAL (operands[1])];
 })
 
-(define_insn "oacc_thread_broadcastsi"
-  [(set (match_operand:SI 0 "nvptx_register_operand" "")
-	(unspec_volatile:SI [(match_operand:SI 1 "nvptx_register_operand" "")]
-			    UNSPECV_WARP_BCAST))]
+(define_insn "nvptx_fork"
+  [(unspec_volatile:SI [(match_operand:SI 0 "const_int_operand" "")]
+		       UNSPECV_FORK)]
   ""
-  "%.\\tshfl.idx.b32\\t%0, %1, 0, 31;")
+  "// fork %0;"
+)
 
-(define_expand "oacc_thread_broadcastdi"
-  [(set (match_operand:DI 0 "nvptx_register_operand" "")
-	(unspec_volatile:DI [(match_operand:DI 1 "nvptx_register_operand" "")]
-			    UNSPECV_WARP_BCAST))]
-  ""
-{
-  rtx t = gen_reg_rtx (DImode);
-  emit_insn (gen_lshrdi3 (t, operands[1], GEN_INT (32)));
-  rtx op0 = force_reg (SImode, gen_lowpart (SImode, t));
-  rtx op1 = force_reg (SImode, gen_lowpart (SImode, operands[1]));
-  rtx targ0 = gen_reg_rtx (SImode);
-  rtx targ1 = gen_reg_rtx (SImode);
-  emit_insn (gen_oacc_thread_broadcastsi (targ0, op0));
-  emit_insn (gen_oacc_thread_broadcastsi (targ1, op1));
-  rtx t2 = gen_reg_rtx (DImode);
-  rtx t3 = gen_reg_rtx (DImode);
-  emit_insn (gen_extendsidi2 (t2, targ0));
-  emit_insn (gen_extendsidi2 (t3, targ1));
-  rtx t4 = gen_reg_rtx (DImode);
-  emit_insn (gen_ashldi3 (t4, t2, GEN_INT (32)));
-  emit_insn (gen_iordi3 (operands[0], t3, t4));
-  DONE;
+(define_insn "nvptx_forked"
+  [(unspec_volatile:SI [(match_operand:SI 0 "const_int_operand" "")]
+		       UNSPECV_FORKED)]
+  ""
+  "// forked %0;"
+)
+
+(define_insn "nvptx_joining"
+  [(unspec_volatile:SI [(match_operand:SI 0 "const_int_operand" "")]
+		       UNSPECV_JOINING)]
+  ""
+  "// joining %0;"
+)
+
+(define_insn "nvptx_join"
+  [(unspec_volatile:SI [(match_operand:SI 0 "const_int_operand" "")]
+		       UNSPECV_JOIN)]
+  ""
+  "// join %0;"
+)
+
+(define_expand "oacc_fork"
+  [(unspec_volatile:SI [(match_operand:SI 0 "const_int_operand" "")]
+		       UNSPECV_FORKED)]
+  ""
+{
+  nvptx_expand_oacc_fork (operands[0]);
 })
 
+(define_expand "oacc_join"
+  [(unspec_volatile:SI [(match_operand:SI 0 "const_int_operand" "")]
+		       UNSPECV_JOIN)]
+  ""
+{
+  nvptx_expand_oacc_join (operands[0]);
+})
+
+;; only 32-bit shuffles exist.
+(define_insn "nvptx_broadcast<mode>"
+  [(set (match_operand:BITS 0 "nvptx_register_operand" "")
+	(unspec:BITS
+		[(match_operand:BITS 1 "nvptx_register_operand" "")]
+		  UNSPEC_BROADCAST))]
+  ""
+  "%.\\tshfl.idx.b32\\t%0, %1, 0, 31;")
+
+;; extract parts of a 64 bit object into 2 32-bit ints
+(define_insn "unpack<mode>si2"
+  [(set (match_operand:SI 0 "nvptx_register_operand" "")
+        (unspec:SI [(match_operand:BITD 2 "nvptx_register_operand" "")
+		    (const_int 0)] UNSPEC_BIT_CONV))
+   (set (match_operand:SI 1 "nvptx_register_operand" "")
+        (unspec:SI [(match_dup 2) (const_int 1)] UNSPEC_BIT_CONV))]
+  ""
+  "%.\\tmov.b64 {%0,%1}, %2;")
+
+;; pack 2 32-bit ints into a 64 bit object
+(define_insn "packsi<mode>2"
+  [(set (match_operand:BITD 0 "nvptx_register_operand" "")
+        (unspec:BITD [(match_operand:SI 1 "nvptx_register_operand" "")
+		      (match_operand:SI 2 "nvptx_register_operand" "")]
+		    UNSPEC_BIT_CONV))]
+  ""
+  "%.\\tmov.b64 %0, {%1,%2};")
+
+(define_insn "worker_load<mode>"
+  [(set (match_operand:SDISDFM 0 "nvptx_register_operand" "=R")
+        (unspec:SDISDFM [(match_operand:SDISDFM 1 "memory_operand" "m")]
+			 UNSPEC_SHARED_DATA))]
+  ""
+  "%.\\tld.shared%u0\\t%0,%1;")
+
+(define_insn "worker_store<mode>"
+  [(set (unspec:SDISDFM [(match_operand:SDISDFM 1 "memory_operand" "=m")]
+			 UNSPEC_SHARED_DATA)
+	(match_operand:SDISDFM 0 "nvptx_register_operand" "R"))]
+  ""
+  "%.\\tst.shared%u1\\t%1,%0;")
+
 (define_insn "ganglocal_ptr<mode>"
   [(set (match_operand:P 0 "nvptx_register_operand" "")
 	(unspec:P [(const_int 0)] UNSPEC_SHARED_DATA))]
@@ -1462,14 +1545,8 @@
   "%.\\tatom%A1.b%T0.<logic>\\t%0, %1, %2;")
 
 ;; ??? Mark as not predicable later?
-(define_insn "threadbarrier_insn"
-  [(unspec_volatile [(match_operand:SI 0 "const_int_operand" "")] UNSPECV_BARSYNC)]
+(define_insn "nvptx_barsync"
+  [(unspec_volatile [(match_operand:SI 0 "const_int_operand" "")]
+		    UNSPECV_BARSYNC)]
   ""
   "bar.sync\\t%0;")
-
-(define_expand "oacc_threadbarrier"
-  [(unspec_volatile [(match_operand:SI 0 "const_int_operand" "")] UNSPECV_BARSYNC)]
-  ""
-{
-  operands[0] = const0_rtx;
-})
Index: config/nvptx/nvptx.c
===================================================================
--- config/nvptx/nvptx.c	(revision 225323)
+++ config/nvptx/nvptx.c	(working copy)
@@ -24,6 +24,7 @@
 #include "coretypes.h"
 #include "tm.h"
 #include "rtl.h"
+#include "hash-map.h"
 #include "hash-set.h"
 #include "machmode.h"
 #include "vec.h"
@@ -74,6 +75,9 @@
 #include "df.h"
 #include "dumpfile.h"
 #include "builtins.h"
+#include "dominance.h"
+#include "cfg.h"
+#include "omp-low.h"
 
 /* Record the function decls we've written, and the libfuncs and function
    decls corresponding to them.  */
@@ -97,6 +101,16 @@ static GTY((cache))
 static GTY((cache)) hash_table<tree_hasher> *declared_fndecls_htab;
 static GTY((cache)) hash_table<tree_hasher> *needed_fndecls_htab;
 
+/* Size of buffer needed to broadcast across workers.  This is used
+   for both worker-neutering and worker broadcasting.   It is shared
+   by all functions emitted.  The buffer is placed in shared memory.
+   It'd be nice if PTX supported common blocks, because then this
+   could be shared across TUs (taking the largest size).  */
+static unsigned worker_bcast_hwm;
+static unsigned worker_bcast_align;
+#define worker_bcast_name "__worker_bcast"
+static GTY(()) rtx worker_bcast_sym;
+
 /* Allocate a new, cleared machine_function structure.  */
 
 static struct machine_function *
@@ -124,6 +138,8 @@ nvptx_option_override (void)
   needed_fndecls_htab = hash_table<tree_hasher>::create_ggc (17);
   declared_libfuncs_htab
     = hash_table<declared_libfunc_hasher>::create_ggc (17);
+
+  worker_bcast_sym = gen_rtx_SYMBOL_REF (Pmode, worker_bcast_name);
 }
 
 /* Return the mode to be used when declaring a ptx object for OBJ.
@@ -1066,6 +1082,210 @@ nvptx_expand_compare (rtx compare)
   return gen_rtx_NE (BImode, pred, const0_rtx);
 }
 
+
+/* Expand the oacc fork & join primitive into ptx-required unspecs.  */
+
+void
+nvptx_expand_oacc_fork (rtx mode)
+{
+  /* Emit fork for worker level.  */
+  if (UINTVAL (mode) == OACC_worker)
+    emit_insn (gen_nvptx_fork (mode));
+}
+
+void
+nvptx_expand_oacc_join (rtx mode)
+{
+  /* Emit joining for all pars.  */
+  emit_insn (gen_nvptx_joining (mode));
+}
+
+/* Generate instruction(s) to unpack a 64 bit object into 2 32 bit
+   objects.  */
+
+static rtx
+nvptx_gen_unpack (rtx dst0, rtx dst1, rtx src)
+{
+  rtx res;
+  
+  switch (GET_MODE (src))
+    {
+    case DImode:
+      res = gen_unpackdisi2 (dst0, dst1, src);
+      break;
+    case DFmode:
+      res = gen_unpackdfsi2 (dst0, dst1, src);
+      break;
+    default: gcc_unreachable ();
+    }
+  return res;
+}
+
+/* Generate instruction(s) to pack 2 32 bit objects into a 64 bit
+   object.  */
+
+static rtx
+nvptx_gen_pack (rtx dst, rtx src0, rtx src1)
+{
+  rtx res;
+  
+  switch (GET_MODE (dst))
+    {
+    case DImode:
+      res = gen_packsidi2 (dst, src0, src1);
+      break;
+    case DFmode:
+      res = gen_packsidf2 (dst, src0, src1);
+      break;
+    default: gcc_unreachable ();
+    }
+  return res;
+}
+
+/* Generate an instruction or sequence to broadcast register REG
+   across the vectors of a single warp.  */
+
+static rtx
+nvptx_gen_vcast (rtx reg)
+{
+  rtx res;
+
+  switch (GET_MODE (reg))
+    {
+    case SImode:
+      res = gen_nvptx_broadcastsi (reg, reg);
+      break;
+    case SFmode:
+      res = gen_nvptx_broadcastsf (reg, reg);
+      break;
+    case DImode:
+    case DFmode:
+      {
+	rtx tmp0 = gen_reg_rtx (SImode);
+	rtx tmp1 = gen_reg_rtx (SImode);
+
+	start_sequence ();
+	emit_insn (nvptx_gen_unpack (tmp0, tmp1, reg));
+	emit_insn (nvptx_gen_vcast (tmp0));
+	emit_insn (nvptx_gen_vcast (tmp1));
+	emit_insn (nvptx_gen_pack (reg, tmp0, tmp1));
+	res = get_insns ();
+	end_sequence ();
+      }
+      break;
+    case BImode:
+      {
+	rtx tmp = gen_reg_rtx (SImode);
+	
+	start_sequence ();
+	emit_insn (gen_sel_truesi (tmp, reg, GEN_INT (1), const0_rtx));
+	emit_insn (nvptx_gen_vcast (tmp));
+	emit_insn (gen_rtx_SET (BImode, reg,
+				gen_rtx_NE (BImode, tmp, const0_rtx)));
+	res = get_insns ();
+	end_sequence ();
+      }
+      break;
+      
+    case HImode:
+    case QImode:
+    default:debug_rtx (reg);gcc_unreachable ();
+    }
+  return res;
+}
+
+/* Structure used when generating a worker-level spill or fill.  */
+
+struct wcast_data_t
+{
+  rtx base;
+  rtx ptr;
+  unsigned offset;
+};
+
+/* Direction of the spill/fill and looping setup/teardown indicator.  */
+
+enum propagate_mask
+  {
+    PM_read = 1 << 0,
+    PM_write = 1 << 1,
+    PM_loop_begin = 1 << 2,
+    PM_loop_end = 1 << 3,
+
+    PM_read_write = PM_read | PM_write
+  };
+
+/* Generate instruction(s) to spill or fill register REG to/from the
+   worker broadcast array.  PM indicates what is to be done, REP
+   how many loop iterations will be executed (0 for not a loop).  */
+   
+static rtx
+nvptx_gen_wcast (rtx reg, propagate_mask pm, unsigned rep, wcast_data_t *data)
+{
+  rtx  res;
+  machine_mode mode = GET_MODE (reg);
+
+  switch (mode)
+    {
+    case BImode:
+      {
+	rtx tmp = gen_reg_rtx (SImode);
+	
+	start_sequence ();
+	if (pm & PM_read)
+	  emit_insn (gen_sel_truesi (tmp, reg, GEN_INT (1), const0_rtx));
+	emit_insn (nvptx_gen_wcast (tmp, pm, rep, data));
+	if (pm & PM_write)
+	  emit_insn (gen_rtx_SET (BImode, reg,
+				  gen_rtx_NE (BImode, tmp, const0_rtx)));
+	res = get_insns ();
+	end_sequence ();
+      }
+      break;
+
+    default:
+      {
+	rtx addr = data->ptr;
+
+	if (!addr)
+	  {
+	    unsigned align = GET_MODE_ALIGNMENT (mode) / BITS_PER_UNIT;
+
+	    if (align > worker_bcast_align)
+	      worker_bcast_align = align;
+	    data->offset = (data->offset + align - 1) & ~(align - 1);
+	    addr = data->base;
+	    if (data->offset)
+	      addr = gen_rtx_PLUS (Pmode, addr, GEN_INT (data->offset));
+	  }
+	
+	addr = gen_rtx_MEM (mode, addr);
+	addr = gen_rtx_UNSPEC (mode, gen_rtvec (1, addr), UNSPEC_SHARED_DATA);
+	if (pm & PM_read)
+	  res = gen_rtx_SET (mode, addr, reg);
+	if (pm & PM_write)
+	  res = gen_rtx_SET (mode, reg, addr);
+
+	if (data->ptr)
+	  {
+	    /* We're using a ptr, increment it.  */
+	    start_sequence ();
+	    
+	    emit_insn (res);
+	    emit_insn (gen_adddi3 (data->ptr, data->ptr,
+				   GEN_INT (GET_MODE_SIZE (GET_MODE (res)))));
+	    res = get_insns ();
+	    end_sequence ();
+	  }
+	else
+	  rep = 1;
+	data->offset += rep * GET_MODE_SIZE (GET_MODE (reg));
+      }
+      break;
+    }
+  return res;
+}
+
 /* When loading an operand ORIG_OP, verify whether an address space
    conversion to generic is required, and if so, perform it.  Also
    check for SYMBOL_REFs for function decls and call
@@ -1647,23 +1867,6 @@ nvptx_print_operand_address (FILE *file,
   nvptx_print_address_operand (file, addr, VOIDmode);
 }
 
-/* Return true if the value of COND is the same across all threads in a
-   warp.  */
-
-static bool
-condition_unidirectional_p (rtx cond)
-{
-  if (CONSTANT_P (cond))
-    return true;
-  if (GET_CODE (cond) == REG)
-    return cfun->machine->warp_equal_pseudos[REGNO (cond)];
-  if (GET_RTX_CLASS (GET_CODE (cond)) == RTX_COMPARE
-      || GET_RTX_CLASS (GET_CODE (cond)) == RTX_COMM_COMPARE)
-    return (condition_unidirectional_p (XEXP (cond, 0))
-	    && condition_unidirectional_p (XEXP (cond, 1)));
-  return false;
-}
-
 /* Print an operand, X, to FILE, with an optional modifier in CODE.
 
    Meaning of CODE:
@@ -1676,9 +1879,7 @@ condition_unidirectional_p (rtx cond)
    f -- print a full reg even for something that must always be split
    t -- print a type opcode suffix, promoting QImode to 32 bits
    T -- print a type size in bits
-   u -- print a type opcode suffix without promotions.
-   U -- print ".uni" if a condition consists only of values equal across all
-        threads in a warp.  */
+   u -- print a type opcode suffix without promotions.  */
 
 static void
 nvptx_print_operand (FILE *file, rtx x, int code)
@@ -1739,11 +1940,6 @@ nvptx_print_operand (FILE *file, rtx x,
       fprintf (file, "@!");
       goto common;
 
-    case 'U':
-      if (condition_unidirectional_p (x))
-	fprintf (file, ".uni");
-      break;
-
     case 'c':
       op_mode = GET_MODE (XEXP (x, 0));
       switch (x_code)
@@ -1900,7 +2096,7 @@ get_replacement (struct reg_replace *r)
    conversion copyin/copyout instructions.  */
 
 static void
-nvptx_reorg_subreg (int max_regs)
+nvptx_reorg_subreg ()
 {
   struct reg_replace qiregs, hiregs, siregs, diregs;
   rtx_insn *insn, *next;
@@ -1914,11 +2110,6 @@ nvptx_reorg_subreg (int max_regs)
   siregs.mode = SImode;
   diregs.mode = DImode;
 
-  cfun->machine->warp_equal_pseudos
-    = ggc_cleared_vec_alloc<char> (max_regs);
-
-  auto_vec<unsigned> warp_reg_worklist;
-
   for (insn = get_insns (); insn; insn = next)
     {
       next = NEXT_INSN (insn);
@@ -1934,18 +2125,6 @@ nvptx_reorg_subreg (int max_regs)
       diregs.n_in_use = 0;
       extract_insn (insn);
 
-      if (recog_memoized (insn) == CODE_FOR_oacc_thread_broadcastsi
-	  || (GET_CODE (PATTERN (insn)) == SET
-	      && CONSTANT_P (SET_SRC (PATTERN (insn)))))
-	{
-	  rtx dest = recog_data.operand[0];
-	  if (REG_P (dest) && REG_N_SETS (REGNO (dest)) == 1)
-	    {
-	      cfun->machine->warp_equal_pseudos[REGNO (dest)] = true;
-	      warp_reg_worklist.safe_push (REGNO (dest));
-	    }
-	}
-
       enum attr_subregs_ok s_ok = get_attr_subregs_ok (insn);
       for (int i = 0; i < recog_data.n_operands; i++)
 	{
@@ -1999,71 +2178,745 @@ nvptx_reorg_subreg (int max_regs)
 	  validate_change (insn, recog_data.operand_loc[i], new_reg, false);
 	}
     }
+}
 
-  while (!warp_reg_worklist.is_empty ())
+/* Loop structure of the function.The entire function is described as
+   a NULL loop.  We should be able to extend this to represent
+   superblocks.  */
+
+#define OACC_null OACC_HWM
+
+struct parallel
+{
+  /* Parent parallel.  */
+  parallel *parent;
+  
+  /* Next sibling parallel.  */
+  parallel *next;
+
+  /* First child parallel.  */
+  parallel *inner;
+
+  /* Partitioning mode of the parallel.  */
+  unsigned mode;
+
+  /* Partitioning used within inner parallels. */
+  unsigned inner_mask;
+
+  /* Location of parallel forked and join.  The forked is the first
+     block in the parallel and the join is the first block after of
+     the partition.  */
+  basic_block forked_block;
+  basic_block join_block;
+
+  rtx_insn *forked_insn;
+  rtx_insn *join_insn;
+
+  rtx_insn *fork_insn;
+  rtx_insn *joining_insn;
+
+  /* Basic blocks in this parallel, but not in child parallels.  The
+     FORKED and JOINING blocks are in the partition.  The FORK and JOIN
+     blocks are not.  */
+  auto_vec<basic_block> blocks;
+
+public:
+  parallel (parallel *parent, unsigned mode);
+  ~parallel ();
+};
+
+/* Constructor links the new parallel into it's parent's chain of
+   children.  */
+
+parallel::parallel (parallel *parent_, unsigned mode_)
+  :parent (parent_), next (0), inner (0), mode (mode_), inner_mask (0)
+{
+  forked_block = join_block = 0;
+  forked_insn = join_insn = 0;
+  fork_insn = joining_insn = 0;
+  
+  if (parent)
     {
-      int regno = warp_reg_worklist.pop ();
+      next = parent->inner;
+      parent->inner = this;
+    }
+}
+
+parallel::~parallel ()
+{
+  delete inner;
+  delete next;
+}
+
+/* Map of basic blocks to insns */
+typedef hash_map<basic_block, rtx_insn *> bb_insn_map_t;
+
+/* A tuple of an insn of interest and the BB in which it resides.  */
+typedef std::pair<rtx_insn *, basic_block> insn_bb_t;
+typedef auto_vec<insn_bb_t> insn_bb_vec_t;
+
+/* Split basic blocks such that each forked and join unspecs are at
+   the start of their basic blocks.  Thus afterwards each block will
+   have a single partitioning mode.  We also do the same for return
+   insns, as they are executed by every thread.  Return the
+   partitioning mode of the function as a whole.  Populate MAP with
+   head and tail blocks.  We also clear the BB visited flag, which is
+   used when finding partitions.  */
+
+static void
+nvptx_split_blocks (bb_insn_map_t *map)
+{
+  insn_bb_vec_t worklist;
+  basic_block block;
+  rtx_insn *insn;
+
+  /* Locate all the reorg instructions of interest.  */
+  FOR_ALL_BB_FN (block, cfun)
+    {
+      bool seen_insn = false;
+
+      // Clear visited flag, for use by parallel locator  */
+      block->flags &= ~BB_VISITED;
       
-      df_ref use = DF_REG_USE_CHAIN (regno);
-      for (; use; use = DF_REF_NEXT_REG (use))
+      FOR_BB_INSNS (block, insn)
 	{
-	  rtx_insn *insn;
-	  if (!DF_REF_INSN_INFO (use))
-	    continue;
-	  insn = DF_REF_INSN (use);
-	  if (DEBUG_INSN_P (insn))
+	  if (!INSN_P (insn))
 	    continue;
-
-	  /* The only insns we have to exclude are those which refer to
-	     memory.  */
-	  rtx pat = PATTERN (insn);
-	  if (GET_CODE (pat) == SET
-	      && (MEM_P (SET_SRC (pat)) || MEM_P (SET_DEST (pat))))
-	    continue;
-
-	  df_ref insn_use;
-	  bool all_equal = true;
-	  FOR_EACH_INSN_USE (insn_use, insn)
+	  switch (recog_memoized (insn))
 	    {
-	      unsigned insn_regno = DF_REF_REGNO (insn_use);
-	      if (!cfun->machine->warp_equal_pseudos[insn_regno])
-		{
-		  all_equal = false;
-		  break;
-		}
+	    default:
+	      seen_insn = true;
+	      continue;
+	    case CODE_FOR_nvptx_forked:
+	    case CODE_FOR_nvptx_join:
+	      break;
+	      
+	    case CODE_FOR_return:
+	      /* We also need to split just before return insns, as
+		 that insn needs executing by all threads, but the
+		 block it is in probably does not.  */
+	      break;
 	    }
-	  if (!all_equal)
-	    continue;
-	  df_ref insn_def;
-	  FOR_EACH_INSN_DEF (insn_def, insn)
+
+	  if (seen_insn)
+	    /* We've found an instruction that  must be at the start of
+	       a block, but isn't.  Add it to the worklist.  */
+	    worklist.safe_push (insn_bb_t (insn, block));
+	  else
+	    /* It was already the first instruction.  Just add it to
+	       the map.  */
+	    map->get_or_insert (block) = insn;
+	  seen_insn = true;
+	}
+    }
+
+  /* Split blocks on the worklist.  */
+  unsigned ix;
+  insn_bb_t *elt;
+  basic_block remap = 0;
+  for (ix = 0; worklist.iterate (ix, &elt); ix++)
+    {
+      if (remap != elt->second)
+	{
+	  block = elt->second;
+	  remap = block;
+	}
+      
+      /* Split block before insn. The insn is in the new block  */
+      edge e = split_block (block, PREV_INSN (elt->first));
+
+      block = e->dest;
+      map->get_or_insert (block) = elt->first;
+    }
+}
+
+/* BLOCK is a basic block containing a head or tail instruction.
+   Locate the associated prehead or pretail instruction, which must be
+   in the single predecessor block.  */
+
+static rtx_insn *
+nvptx_discover_pre (basic_block block, int expected)
+{
+  gcc_assert (block->preds->length () == 1);
+  basic_block pre_block = (*block->preds)[0]->src;
+  rtx_insn *pre_insn;
+
+  for (pre_insn = BB_END (pre_block); !INSN_P (pre_insn);
+       pre_insn = PREV_INSN (pre_insn))
+    gcc_assert (pre_insn != BB_HEAD (pre_block));
+
+  gcc_assert (recog_memoized (pre_insn) == expected);
+  return pre_insn;
+}
+
+/*  Dump this parallel and all its inner parallels.  */
+
+static void
+nvptx_dump_pars (parallel *par, unsigned depth)
+{
+  fprintf (dump_file, "%u: mode %d head=%d, tail=%d\n",
+	   depth, par->mode,
+	   par->forked_block ? par->forked_block->index : -1,
+	   par->join_block ? par->join_block->index : -1);
+
+  fprintf (dump_file, "    blocks:");
+
+  basic_block block;
+  for (unsigned ix = 0; par->blocks.iterate (ix, &block); ix++)
+    fprintf (dump_file, " %d", block->index);
+  fprintf (dump_file, "\n");
+  if (par->inner)
+    nvptx_dump_pars (par->inner, depth + 1);
+
+  if (par->next)
+    nvptx_dump_pars (par->next, depth);
+}
+
+typedef std::pair<basic_block, parallel *> bb_par_t;
+typedef auto_vec<bb_par_t> bb_par_vec_t;
+
+/* Walk the BBG looking for fork & join markers.  Construct a
+   loop structure for the function.  MAP is a mapping of basic blocks
+   to head & taiol markers, discoveded when splitting blocks.  This
+   speeds up the discovery.  We rely on the BB visited flag having
+   been cleared when splitting blocks.  */
+
+static parallel *
+nvptx_discover_pars (bb_insn_map_t *map)
+{
+  parallel *outer_par = new parallel (0, OACC_null);
+  bb_par_vec_t worklist;
+  basic_block block;
+
+  // Mark entry and exit blocks as visited.
+  block = EXIT_BLOCK_PTR_FOR_FN (cfun);
+  block->flags |= BB_VISITED;
+  block = ENTRY_BLOCK_PTR_FOR_FN (cfun);
+  worklist.safe_push (bb_par_t (block, outer_par));
+
+  while (worklist.length ())
+    {
+      bb_par_t bb_par = worklist.pop ();
+      parallel *l = bb_par.second;
+
+      block = bb_par.first;
+
+      // Have we met this block?
+      if (block->flags & BB_VISITED)
+	continue;
+      block->flags |= BB_VISITED;
+      
+      rtx_insn **endp = map->get (block);
+      if (endp)
+	{
+	  rtx_insn *end = *endp;
+	  
+	  /* This is a block head or tail, or return instruction.  */
+	  switch (recog_memoized (end))
 	    {
-	      unsigned dregno = DF_REF_REGNO (insn_def);
-	      if (cfun->machine->warp_equal_pseudos[dregno])
-		continue;
-	      cfun->machine->warp_equal_pseudos[dregno] = true;
-	      warp_reg_worklist.safe_push (dregno);
+	    case CODE_FOR_return:
+	      /* Return instructions are in their own block, and we
+		 don't need to do anything more.  */
+	      continue;
+
+	    case CODE_FOR_nvptx_forked:
+	      /* Loop head, create a new inner loop and add it into
+		 our parent's child list.  */
+	      {
+		unsigned mode = UINTVAL (XVECEXP (PATTERN (end), 0, 0));
+		
+		l = new parallel (l, mode);
+		l->forked_block = block;
+		l->forked_insn = end;
+		if (mode == OACC_worker)
+		  l->fork_insn
+		    = nvptx_discover_pre (block, CODE_FOR_nvptx_fork);
+	      }
+	      break;
+
+	    case CODE_FOR_nvptx_join:
+	      /* A loop tail.  Finish the current loop and return to
+		 parent.  */
+	      {
+		unsigned mode = UINTVAL (XVECEXP (PATTERN (end), 0, 0));
+
+		gcc_assert (l->mode == mode);
+		l->join_block = block;
+		l->join_insn = end;
+		if (mode == OACC_worker)
+		  l->joining_insn
+		    = nvptx_discover_pre (block, CODE_FOR_nvptx_joining);
+		l = l->parent;
+	      }
+	      break;
+
+	    default:
+	      gcc_unreachable ();
 	    }
 	}
+
+      /* Add this block onto the current loop's list of blocks.  */
+      l->blocks.safe_push (block);
+
+      /* Push each destination block onto the work list.  */
+      edge e;
+      edge_iterator ei;
+      FOR_EACH_EDGE (e, ei, block->succs)
+	worklist.safe_push (bb_par_t (e->dest, l));
     }
 
   if (dump_file)
-    for (int i = 0; i < max_regs; i++)
-      if (cfun->machine->warp_equal_pseudos[i])
-	fprintf (dump_file, "Found warp invariant pseudo %d\n", i);
+    {
+      fprintf (dump_file, "\nLoops\n");
+      nvptx_dump_pars (outer_par, 0);
+      fprintf (dump_file, "\n");
+    }
+  
+  return outer_par;
+}
+
+/* Propagate live state at the start of a partitioned region.  BLOCK
+   provides the live register information, and might not contain
+   INSN. Propagation is inserted just after INSN. RW indicates whether
+   we are reading and/or writing state.  This
+   separation is needed for worker-level proppagation where we
+   essentially do a spill & fill.  FN is the underlying worker
+   function to generate the propagation instructions for single
+   register.  DATA is user data.
+
+   We propagate the live register set and the entire frame.  We could
+   do better by (a) propagating just the live set that is used within
+   the partitioned regions and (b) only propagating stack entries that
+   are used.  The latter might be quite hard to determine.  */
+
+static void
+nvptx_propagate (basic_block block, rtx_insn *insn, propagate_mask rw,
+		 rtx (*fn) (rtx, propagate_mask,
+			    unsigned, void *), void *data)
+{
+  bitmap live = DF_LIVE_IN (block);
+  bitmap_iterator iterator;
+  unsigned ix;
+
+  /* Copy the frame array.  */
+  HOST_WIDE_INT fs = get_frame_size ();
+  if (fs)
+    {
+      rtx tmp = gen_reg_rtx (DImode);
+      rtx idx = NULL_RTX;
+      rtx ptr = gen_reg_rtx (Pmode);
+      rtx pred = NULL_RTX;
+      rtx_code_label *label = NULL;
+
+      gcc_assert (!(fs & (GET_MODE_SIZE (DImode) - 1)));
+      fs /= GET_MODE_SIZE (DImode);
+      /* Detect single iteration loop. */
+      if (fs == 1)
+	fs = 0;
+
+      start_sequence ();
+      emit_insn (gen_rtx_SET (Pmode, ptr, frame_pointer_rtx));
+      if (fs)
+	{
+	  idx = gen_reg_rtx (SImode);
+	  pred = gen_reg_rtx (BImode);
+	  label = gen_label_rtx ();
+	  
+	  emit_insn (gen_rtx_SET (SImode, idx, GEN_INT (fs)));
+	  /* Allow worker function to initialize anything needed */
+	  rtx init = fn (tmp, PM_loop_begin, fs, data);
+	  if (init)
+	    emit_insn (init);
+	  emit_label (label);
+	  LABEL_NUSES (label)++;
+	  emit_insn (gen_addsi3 (idx, idx, GEN_INT (-1)));
+	}
+      if (rw & PM_read)
+	emit_insn (gen_rtx_SET (DImode, tmp, gen_rtx_MEM (DImode, ptr)));
+      emit_insn (fn (tmp, rw, fs, data));
+      if (rw & PM_write)
+	emit_insn (gen_rtx_SET (DImode, gen_rtx_MEM (DImode, ptr), tmp));
+      if (fs)
+	{
+	  emit_insn (gen_rtx_SET (SImode, pred,
+				  gen_rtx_NE (BImode, idx, const0_rtx)));
+	  emit_insn (gen_adddi3 (ptr, ptr, GEN_INT (GET_MODE_SIZE (DImode))));
+	  emit_insn (gen_br_true_uni (pred, label));
+	  rtx fini = fn (tmp, PM_loop_end, fs, data);
+	  if (fini)
+	    emit_insn (fini);
+	  emit_insn (gen_rtx_CLOBBER (GET_MODE (idx), idx));
+	}
+      emit_insn (gen_rtx_CLOBBER (GET_MODE (tmp), tmp));
+      emit_insn (gen_rtx_CLOBBER (GET_MODE (ptr), ptr));
+      rtx cpy = get_insns ();
+      end_sequence ();
+      insn = emit_insn_after (cpy, insn);
+    }
+
+  /* Copy live registers.  */
+  EXECUTE_IF_SET_IN_BITMAP (live, 0, ix, iterator)
+    {
+      rtx reg = regno_reg_rtx[ix];
+
+      if (REGNO (reg) >= FIRST_PSEUDO_REGISTER)
+	{
+	  rtx bcast = fn (reg, rw, 0, data);
+
+	  insn = emit_insn_after (bcast, insn);
+	}
+    }
+}
+
+/* Worker for nvptx_vpropagate.  */
+
+static rtx
+vprop_gen (rtx reg, propagate_mask pm,
+	   unsigned ARG_UNUSED (count), void *ARG_UNUSED (data))
+{
+  if (!(pm & PM_read_write))
+    return 0;
+  
+  return nvptx_gen_vcast (reg);
 }
 
-/* PTX-specific reorganization
-   1) mark now-unused registers, so function begin doesn't declare
-   unused registers.
-   2) replace subregs with suitable sequences.
-*/
+/* Propagate state that is live at start of BLOCK across the vectors
+   of a single warp.  Propagation is inserted just after INSN.   */
 
 static void
-nvptx_reorg (void)
+nvptx_vpropagate (basic_block block, rtx_insn *insn)
 {
-  struct reg_replace qiregs, hiregs, siregs, diregs;
-  rtx_insn *insn, *next;
+  nvptx_propagate (block, insn, PM_read_write, vprop_gen, 0);
+}
+
+/* Worker for nvptx_wpropagate.  */
+
+static rtx
+wprop_gen (rtx reg, propagate_mask pm, unsigned rep, void *data_)
+{
+  wcast_data_t *data = (wcast_data_t *)data_;
+
+  if (pm & PM_loop_begin)
+    {
+      /* Starting a loop, initialize pointer.    */
+      unsigned align = GET_MODE_ALIGNMENT (GET_MODE (reg)) / BITS_PER_UNIT;
+
+      if (align > worker_bcast_align)
+	worker_bcast_align = align;
+      data->offset = (data->offset + align - 1) & ~(align - 1);
+
+      data->ptr = gen_reg_rtx (Pmode);
+
+      return gen_adddi3 (data->ptr, data->base, GEN_INT (data->offset));
+    }
+  else if (pm & PM_loop_end)
+    {
+      rtx clobber = gen_rtx_CLOBBER (GET_MODE (data->ptr), data->ptr);
+      data->ptr = NULL_RTX;
+      return clobber;
+    }
+  else
+    return nvptx_gen_wcast (reg, pm, rep, data);
+}
+
+/* Spill or fill live state that is live at start of BLOCK.  PRE_P
+   indicates if this is just before partitioned mode (do spill), or
+   just after it starts (do fill). Sequence is inserted just after
+   INSN.  */
+
+static void
+nvptx_wpropagate (bool pre_p, basic_block block, rtx_insn *insn)
+{
+  wcast_data_t data;
+
+  data.base = gen_reg_rtx (Pmode);
+  data.offset = 0;
+  data.ptr = NULL_RTX;
+
+  nvptx_propagate (block, insn, pre_p ? PM_read : PM_write, wprop_gen, &data);
+  if (data.offset)
+    {
+      /* Stuff was emitted, initialize the base pointer now.  */
+      rtx init = gen_rtx_SET (Pmode, data.base, worker_bcast_sym);
+      emit_insn_after (init, insn);
+      
+      if (worker_bcast_hwm < data.offset)
+	worker_bcast_hwm = data.offset;
+    }
+}
+
+/* Emit a worker-level synchronization barrier.  */
+
+static void
+nvptx_wsync (bool tail_p, rtx_insn *insn)
+{
+  emit_insn_after (gen_nvptx_barsync (GEN_INT (tail_p)), insn);
+}
+
+/* Single neutering according to MASK.  FROM is the incoming block and
+   TO is the outgoing block.  These may be the same block. Insert at
+   start of FROM:
+   
+     if (tid.<axis>) goto end.
+
+   and insert before ending branch of TO (if there is such an insn):
+
+     end:
+     <possibly-broadcast-cond>
+     <branch>
+
+   We currently only use differnt FROM and TO when skipping an entire
+   loop.  We could do more if we detected superblocks.  */
+
+static void
+nvptx_single (unsigned mask, basic_block from, basic_block to)
+{
+  rtx_insn *head = BB_HEAD (from);
+  rtx_insn *tail = BB_END (to);
+  unsigned skip_mask = mask;
+
+  /* Find first insn of from block */
+  while (head != BB_END (from) && !INSN_P (head))
+    head = NEXT_INSN (head);
+
+  /* Find last insn of to block */
+  rtx_insn *limit = from == to ? head : BB_HEAD (to);
+  while (tail != limit && !INSN_P (tail) && !LABEL_P (tail))
+    tail = PREV_INSN (tail);
+
+  /* Detect if tail is a branch.  */
+  rtx tail_branch = NULL_RTX;
+  rtx cond_branch = NULL_RTX;
+  if (tail && INSN_P (tail))
+    {
+      tail_branch = PATTERN (tail);
+      if (GET_CODE (tail_branch) != SET || SET_DEST (tail_branch) != pc_rtx)
+	tail_branch = NULL_RTX;
+      else
+	{
+	  cond_branch = SET_SRC (tail_branch);
+	  if (GET_CODE (cond_branch) != IF_THEN_ELSE)
+	    cond_branch = NULL_RTX;
+	}
+    }
+
+  if (tail == head)
+    {
+      /* If this is empty, do nothing.  */
+      if (!head || !INSN_P (head))
+	return;
+
+      /* If this is a dummy insn, do nothing.  */
+      switch (recog_memoized (head))
+	{
+	default:break;
+	case CODE_FOR_nvptx_fork:
+	case CODE_FOR_nvptx_forked:
+	case CODE_FOR_nvptx_joining:
+	case CODE_FOR_nvptx_join:
+	  return;
+	}
 
+      if (cond_branch)
+	{
+	  /* If we're only doing vector single, there's no need to
+	     emit skip code because we'll not insert anything.  */
+	  if (!(mask & OACC_LOOP_MASK (OACC_vector)))
+	    skip_mask = 0;
+	}
+      else if (tail_branch)
+	/* Block with only unconditional branch.  Nothing to do.  */
+	return;
+    }
+
+  /* Insert the vector test inside the worker test.  */
+  unsigned mode;
+  rtx_insn *before = tail;
+  for (mode = OACC_worker; mode <= OACC_vector; mode++)
+    if (OACC_LOOP_MASK (mode) & skip_mask)
+      {
+	rtx id = gen_reg_rtx (SImode);
+	rtx pred = gen_reg_rtx (BImode);
+	rtx_code_label *label = gen_label_rtx ();
+
+	emit_insn_before (gen_oacc_id (id, GEN_INT (mode)), head);
+	rtx cond = gen_rtx_SET (BImode, pred,
+				gen_rtx_NE (BImode, id, const0_rtx));
+	emit_insn_before (cond, head);
+	rtx br;
+	if (mode == OACC_vector)
+	  br = gen_br_true (pred, label);
+	else
+	  br = gen_br_true_uni (pred, label);
+	emit_insn_before (br, head);
+
+	LABEL_NUSES (label)++;
+	if (tail_branch)
+	  before = emit_label_before (label, before);
+	else
+	  emit_label_after (label, tail);
+      }
+
+  /* Now deal with propagating the branch condition.  */
+  if (cond_branch)
+    {
+      rtx pvar = XEXP (XEXP (cond_branch, 0), 0);
+
+      if (OACC_LOOP_MASK (OACC_vector) == mask)
+	{
+	  /* Vector mode only, do a shuffle.  */
+	  emit_insn_before (nvptx_gen_vcast (pvar), tail);
+	}
+      else
+	{
+	  /* Includes worker mode, do spill & fill.  by construction
+	     we should never have worker mode only. */
+	  wcast_data_t data;
+
+	  data.base = worker_bcast_sym;
+	  data.ptr = 0;
+
+	  if (worker_bcast_hwm < GET_MODE_SIZE (SImode))
+	    worker_bcast_hwm = GET_MODE_SIZE (SImode);
+
+	  data.offset = 0;
+	  emit_insn_before (nvptx_gen_wcast (pvar, PM_read, 0, &data),
+			    before);
+	  emit_insn_before (gen_nvptx_barsync (GEN_INT (2)), tail);
+	  data.offset = 0;
+	  emit_insn_before (nvptx_gen_wcast (pvar, PM_write, 0, &data),
+			    tail);
+	}
+
+      extract_insn (tail);
+      rtx unsp = gen_rtx_UNSPEC (BImode, gen_rtvec (1, pvar),
+				 UNSPEC_BR_UNIFIED);
+      validate_change (tail, recog_data.operand_loc[0], unsp, false);
+    }
+}
+
+/* PAR is a parallel that is being skipped in its entirety according to
+   MASK.  Treat this as skipping a superblock starting at forked
+   and ending at joining.  */
+
+static void
+nvptx_skip_par (unsigned mask, parallel *par)
+{
+  basic_block tail = par->join_block;
+  gcc_assert (tail->preds->length () == 1);
+
+  basic_block pre_tail = (*tail->preds)[0]->src;
+  gcc_assert (pre_tail->succs->length () == 1);
+
+  nvptx_single (mask, par->forked_block, pre_tail);
+}
+
+/* Process the parallel PAR and all its contained
+   parallels.  We do everything but the neutering.  Return mask of
+   partitioned modes used within this parallel.  */
+
+static unsigned
+nvptx_process_pars (parallel *par)
+{
+  unsigned inner_mask = OACC_LOOP_MASK (par->mode);
+  
+  /* Do the inner parallels first.  */
+  if (par->inner)
+    {
+      par->inner_mask = nvptx_process_pars (par->inner);
+      inner_mask |= par->inner_mask;
+    }
+  
+  switch (par->mode)
+    {
+    case OACC_null:
+      /* Dummy parallel.  */
+      break;
+
+    case OACC_vector:
+      nvptx_vpropagate (par->forked_block, par->forked_insn);
+      break;
+      
+    case OACC_worker:
+      {
+	nvptx_wpropagate (false, par->forked_block,
+			  par->forked_insn);
+	nvptx_wpropagate (true, par->forked_block, par->fork_insn);
+	/* Insert begin and end synchronizations.  */
+	nvptx_wsync (false, par->forked_insn);
+	nvptx_wsync (true, par->joining_insn);
+      }
+      break;
+
+    case OACC_gang:
+      break;
+
+    default:gcc_unreachable ();
+    }
+
+  /* Now do siblings.  */
+  if (par->next)
+    inner_mask |= nvptx_process_pars (par->next);
+  return inner_mask;
+}
+
+/* Neuter the parallel described by PAR.  We recurse in depth-first
+   order.  MODES are the partitioning of the execution and OUTER is
+   the partitioning of the parallels we are contained in.  */
+
+static void
+nvptx_neuter_pars (parallel *par, unsigned modes, unsigned outer)
+{
+  unsigned me = (OACC_LOOP_MASK (par->mode)
+		 & (OACC_LOOP_MASK (OACC_worker)
+		    | OACC_LOOP_MASK (OACC_vector)));
+  unsigned  skip_mask = 0, neuter_mask = 0;
+  
+  if (par->inner)
+    nvptx_neuter_pars (par->inner, modes, outer | me);
+
+  for (unsigned mode = OACC_worker; mode <= OACC_vector; mode++)
+    {
+      if ((outer | me) & OACC_LOOP_MASK (mode))
+	{ /* Mode is partitioned: no neutering.  */ }
+      else if (!(modes & OACC_LOOP_MASK (mode)))
+	{ /* Mode  is not used: nothing to do.  */ }
+      else if (par->inner_mask & OACC_LOOP_MASK (mode)
+	       || !par->forked_insn)
+	/* Partitioned in inner parallels, or we're not a partitioned
+	   at all: neuter individual blocks.  */
+	neuter_mask |= OACC_LOOP_MASK (mode);
+      else if (!par->parent || !par->parent->forked_insn
+	       || par->parent->inner_mask & OACC_LOOP_MASK (mode))
+	/* Parent isn't a parallel or contains this paralleling: skip
+	   parallel at this level.  */
+	skip_mask |= OACC_LOOP_MASK (mode);
+      else
+	{ /* Parent will skip this parallel itself.  */ }
+    }
+
+  if (neuter_mask)
+    {
+      basic_block block;
+
+      for (unsigned ix = 0; par->blocks.iterate (ix, &block); ix++)
+	nvptx_single (neuter_mask, block, block);
+    }
+
+  if (skip_mask)
+      nvptx_skip_par (skip_mask, par);
+  
+  if (par->next)
+    nvptx_neuter_pars (par->next, modes, outer);
+}
+
+/* NVPTX machine dependent reorg.
+   Insert vector and worker single neutering code and state
+   propagation when entering partioned mode.  Fixup subregs.  */
+
+static void
+nvptx_reorg (void)
+{
   /* We are freeing block_for_insn in the toplev to keep compatibility
      with old MDEP_REORGS that are not CFG based.  Recompute it now.  */
   compute_bb_for_insn ();
@@ -2072,19 +2925,36 @@ nvptx_reorg (void)
 
   df_clear_flags (DF_LR_RUN_DCE);
   df_set_flags (DF_NO_INSN_RESCAN | DF_NO_HARD_REGS);
+  df_live_add_problem ();
+  
+  /* Split blocks and record interesting unspecs.  */
+  bb_insn_map_t bb_insn_map;
+
+    nvptx_split_blocks (&bb_insn_map);
+
+  /* Compute live regs */
   df_analyze ();
   regstat_init_n_sets_and_refs ();
 
-  int max_regs = max_reg_num ();
-
+  if (dump_file)
+    df_dump (dump_file);
+  
   /* Mark unused regs as unused.  */
+  int max_regs = max_reg_num ();
   for (int i = LAST_VIRTUAL_REGISTER + 1; i < max_regs; i++)
     if (REG_N_SETS (i) == 0 && REG_N_REFS (i) == 0)
       regno_reg_rtx[i] = const0_rtx;
 
-  /* Replace subregs.  */
-  nvptx_reorg_subreg (max_regs);
+  parallel *pars = nvptx_discover_pars (&bb_insn_map);
+
+  nvptx_process_pars (pars);
+  nvptx_neuter_pars (pars, (OACC_LOOP_MASK (OACC_vector)
+			    | OACC_LOOP_MASK (OACC_worker)), 0);
 
+  delete pars;
+
+  nvptx_reorg_subreg ();
+  
   regstat_free_n_sets_and_refs ();
 
   df_finish_pass (true);
@@ -2133,19 +3003,24 @@ nvptx_vector_alignment (const_tree type)
   return MIN (align, BIGGEST_ALIGNMENT);
 }
 
-/* Indicate that INSN cannot be duplicated.  This is true for insns
-   that generate a unique id.  To be on the safe side, we also
-   exclude instructions that have to be executed simultaneously by
-   all threads in a warp.  */
+/* Indicate that INSN cannot be duplicated.   */
 
 static bool
 nvptx_cannot_copy_insn_p (rtx_insn *insn)
 {
-  if (recog_memoized (insn) == CODE_FOR_oacc_thread_broadcastsi)
-    return true;
-  if (recog_memoized (insn) == CODE_FOR_threadbarrier_insn)
-    return true;
-  return false;
+  switch (recog_memoized (insn))
+    {
+    case CODE_FOR_nvptx_broadcastsi:
+    case CODE_FOR_nvptx_broadcastsf:
+    case CODE_FOR_nvptx_barsync:
+    case CODE_FOR_nvptx_fork:
+    case CODE_FOR_nvptx_forked:
+    case CODE_FOR_nvptx_joining:
+    case CODE_FOR_nvptx_join:
+      return true;
+    default:
+      return false;
+    }
 }
 \f
 /* Record a symbol for mkoffload to enter into the mapping table.  */
@@ -2185,6 +3060,21 @@ nvptx_file_end (void)
   FOR_EACH_HASH_TABLE_ELEMENT (*needed_fndecls_htab, decl, tree, iter)
     nvptx_record_fndecl (decl, true);
   fputs (func_decls.str().c_str(), asm_out_file);
+
+  if (worker_bcast_hwm)
+    {
+      /* Define the broadcast buffer.  */
+
+      if (worker_bcast_align < GET_MODE_SIZE (SImode))
+	worker_bcast_align = GET_MODE_SIZE (SImode);
+      worker_bcast_hwm = (worker_bcast_hwm + worker_bcast_align - 1)
+	& ~(worker_bcast_align - 1);
+      
+      fprintf (asm_out_file, "// BEGIN VAR DEF: %s\n", worker_bcast_name);
+      fprintf (asm_out_file, ".shared.align %d .u8 %s[%d];\n",
+	       worker_bcast_align,
+	       worker_bcast_name, worker_bcast_hwm);
+    }
 }
 \f
 #undef TARGET_OPTION_OVERRIDE
Index: tree-ssa-alias.c
===================================================================
--- tree-ssa-alias.c	(revision 225323)
+++ tree-ssa-alias.c	(working copy)
@@ -1764,7 +1764,6 @@ ref_maybe_used_by_call_p_1 (gcall *call,
 	case BUILT_IN_GOMP_ATOMIC_END:
 	case BUILT_IN_GOMP_BARRIER:
 	case BUILT_IN_GOMP_BARRIER_CANCEL:
-	case BUILT_IN_GOACC_THREADBARRIER:
 	case BUILT_IN_GOMP_TASKWAIT:
 	case BUILT_IN_GOMP_TASKGROUP_END:
 	case BUILT_IN_GOMP_CRITICAL_START:
Index: gimple.c
===================================================================
--- gimple.c	(revision 225323)
+++ gimple.c	(working copy)
@@ -1380,12 +1380,27 @@ bool
 gimple_call_same_target_p (const_gimple c1, const_gimple c2)
 {
   if (gimple_call_internal_p (c1))
-    return (gimple_call_internal_p (c2)
-	    && gimple_call_internal_fn (c1) == gimple_call_internal_fn (c2));
+    {
+      if (!gimple_call_internal_p (c2)
+	  || gimple_call_internal_fn (c1) != gimple_call_internal_fn (c2))
+	return false;
+
+      if (gimple_call_internal_unique_p (c1))
+	return false;
+      
+      return true;
+    }
+  else if (gimple_call_fn (c1) == gimple_call_fn (c2))
+    return true;
   else
-    return (gimple_call_fn (c1) == gimple_call_fn (c2)
-	    || (gimple_call_fndecl (c1)
-		&& gimple_call_fndecl (c1) == gimple_call_fndecl (c2)));
+    {
+      tree decl = gimple_call_fndecl (c1);
+
+      if (!decl || decl != gimple_call_fndecl (c2))
+	return false;
+
+      return true;
+    }
 }
 
 /* Detect flags from a GIMPLE_CALL.  This is just like
Index: gimple.h
===================================================================
--- gimple.h	(revision 225323)
+++ gimple.h	(working copy)
@@ -581,10 +581,6 @@ struct GTY((tag("GSS_OMP_PARALLEL_LAYOUT
   /* [ WORD 11 ]
      Size of the gang-local memory to allocate.  */
   tree ganglocal_size;
-
-  /* [ WORD 12 ]
-     A pointer to the array to be used for broadcasting across threads.  */
-  tree broadcast_array;
 };
 
 /* GIMPLE_OMP_PARALLEL or GIMPLE_TASK */
@@ -2693,6 +2689,11 @@ gimple_call_internal_fn (const_gimple gs
   return static_cast <const gcall *> (gs)->u.internal_fn;
 }
 
+/* Return true, if this internal gimple call is unique.  */
+
+extern bool
+gimple_call_internal_unique_p (const_gimple);
+
 /* If CTRL_ALTERING_P is true, mark GIMPLE_CALL S to be a stmt
    that could alter control flow.  */
 
@@ -5248,25 +5249,6 @@ gimple_omp_target_set_ganglocal_size (go
 }
 
 
-/* Return the pointer to the broadcast array associated with OMP_TARGET GS.  */
-
-static inline tree
-gimple_omp_target_broadcast_array (const gomp_target *omp_target_stmt)
-{
-  return omp_target_stmt->broadcast_array;
-}
-
-
-/* Set PTR to be the broadcast array associated with OMP_TARGET
-   GS.  */
-
-static inline void
-gimple_omp_target_set_broadcast_array (gomp_target *omp_target_stmt, tree ptr)
-{
-  omp_target_stmt->broadcast_array = ptr;
-}
-
-
 /* Return the clauses associated with OMP_TEAMS GS.  */
 
 static inline tree
Index: omp-low.c
===================================================================
--- omp-low.c	(revision 225323)
+++ omp-low.c	(working copy)
@@ -166,14 +166,8 @@ struct omp_region
 
   /* For an OpenACC loop, the level of parallelism requested.  */
   int gwv_this;
-
-  tree broadcast_array;
 };
 
-/* Levels of parallelism as defined by OpenACC.  Increasing numbers
-   correspond to deeper loop nesting levels.  */
-#define OACC_LOOP_MASK(X) (1 << (X))
-
 /* Context structure.  Used to store information about each parallel
    directive in the code.  */
 
@@ -292,8 +286,6 @@ static vec<omp_context *> taskreg_contex
 
 static void scan_omp (gimple_seq *, omp_context *);
 static tree scan_omp_1_op (tree *, int *, void *);
-static basic_block oacc_broadcast (basic_block, basic_block,
-				   struct omp_region *);
 
 #define WALK_SUBSTMTS  \
     case GIMPLE_BIND: \
@@ -3487,15 +3479,6 @@ build_omp_barrier (tree lhs)
   return g;
 }
 
-/* Build a call to GOACC_threadbarrier.  */
-
-static gcall *
-build_oacc_threadbarrier (void)
-{
-  tree fndecl = builtin_decl_explicit (BUILT_IN_GOACC_THREADBARRIER);
-  return gimple_build_call (fndecl, 0);
-}
-
 /* If a context was created for STMT when it was scanned, return it.  */
 
 static omp_context *
@@ -3506,6 +3489,37 @@ maybe_lookup_ctx (gimple stmt)
   return n ? (omp_context *) n->value : NULL;
 }
 
+/* Generate loop head markers in outer->inner order.  */
+
+static void
+gen_oacc_fork (gimple_seq *seq, unsigned mask)
+{
+  unsigned level;
+
+  for (level = OACC_gang; level != OACC_HWM; level++)
+    if (mask & OACC_LOOP_MASK (level))
+      {
+	tree arg = build_int_cst (unsigned_type_node, level);
+	gcall *call = gimple_build_call_internal (IFN_GOACC_FORK, 1, arg);
+	gimple_seq_add_stmt (seq, call);
+      }
+}
+
+/* Generate loop tail markers in inner->outer order.  */
+
+static void
+gen_oacc_join (gimple_seq *seq, unsigned mask)
+{
+  unsigned level;
+
+  for (level = OACC_HWM; level-- != OACC_gang; )
+    if (mask & OACC_LOOP_MASK (level))
+      {
+	tree arg = build_int_cst (unsigned_type_node, level);
+	gcall *call = gimple_build_call_internal (IFN_GOACC_JOIN, 1, arg);
+	gimple_seq_add_stmt (seq, call);
+      }
+}
 
 /* Find the mapping for DECL in CTX or the immediately enclosing
    context that has a mapping for DECL.
@@ -6777,21 +6791,6 @@ expand_omp_for_generic (struct omp_regio
     }
 }
 
-
-/* True if a barrier is needed after a loop partitioned over
-   gangs/workers/vectors as specified by GWV_BITS.  OpenACC semantics specify
-   that a (conceptual) barrier is needed after worker and vector-partitioned
-   loops, but not after gang-partitioned loops.  Currently we are relying on
-   warp reconvergence to synchronise threads within a warp after vector loops,
-   so an explicit barrier is not helpful after those.  */
-
-static bool
-oacc_loop_needs_threadbarrier_p (int gwv_bits)
-{
-  return !(gwv_bits & OACC_LOOP_MASK (OACC_gang))
-    && (gwv_bits & OACC_LOOP_MASK (OACC_worker));
-}
-
 /* A subroutine of expand_omp_for.  Generate code for a parallel
    loop with static schedule and no specified chunk size.  Given
    parameters:
@@ -6827,6 +6826,11 @@ oacc_loop_needs_threadbarrier_p (int gwv
 	V += STEP;
 	if (V cond e) goto L1;
     L2:
+
+ For OpenACC the above is wrapped in an OACC_FORK/OACC_JOIN pair.
+ Currently we wrap the whole sequence, but it'd be better to place the
+ markers just inside the outer conditional, so they can be entirely
+ eliminated if the loop is unreachable.
 */
 
 static void
@@ -6868,10 +6872,6 @@ expand_omp_for_static_nochunk (struct om
     }
   exit_bb = region->exit;
 
-  /* Broadcast variables to OpenACC threads.  */
-  entry_bb = oacc_broadcast (entry_bb, fin_bb, region);
-  region->entry = entry_bb;
-
   /* Iteration space partitioning goes in ENTRY_BB.  */
   gsi = gsi_last_bb (entry_bb);
   gcc_assert (gimple_code (gsi_stmt (gsi)) == GIMPLE_OMP_FOR);
@@ -6893,6 +6893,15 @@ expand_omp_for_static_nochunk (struct om
     t = fold_binary (fd->loop.cond_code, boolean_type_node,
 		     fold_convert (type, fd->loop.n1),
 		     fold_convert (type, fd->loop.n2));
+
+  if (gimple_omp_for_kind (fd->for_stmt) == GF_OMP_FOR_KIND_OACC_LOOP)
+    {
+      gimple_seq seq = NULL;
+	
+      gen_oacc_fork (&seq, region->gwv_this);
+      gsi_insert_seq_before (&gsi, seq, GSI_SAME_STMT);
+    }
+
   if (fd->collapse == 1
       && TYPE_UNSIGNED (type)
       && (t == NULL_TREE || !integer_onep (t)))
@@ -7134,17 +7143,17 @@ expand_omp_for_static_nochunk (struct om
 
   /* Replace the GIMPLE_OMP_RETURN with a barrier, or nothing.  */
   gsi = gsi_last_bb (exit_bb);
-  if (!gimple_omp_return_nowait_p (gsi_stmt (gsi)))
+  if (gimple_omp_for_kind (fd->for_stmt) == GF_OMP_FOR_KIND_OACC_LOOP)
+    {
+      gimple_seq seq = NULL;
+
+      gen_oacc_join (&seq, region->gwv_this);
+      gsi_insert_seq_after (&gsi, seq, GSI_SAME_STMT);
+    }
+  else if (!gimple_omp_return_nowait_p (gsi_stmt (gsi)))
     {
       t = gimple_omp_return_lhs (gsi_stmt (gsi));
-      if (gimple_omp_for_kind (fd->for_stmt) == GF_OMP_FOR_KIND_OACC_LOOP)
-	{
-	  gcc_checking_assert (t == NULL_TREE);
-	  if (oacc_loop_needs_threadbarrier_p (region->gwv_this))
-	    gsi_insert_after (&gsi, build_oacc_threadbarrier (), GSI_SAME_STMT);
-	}
-      else
-	gsi_insert_after (&gsi, build_omp_barrier (t), GSI_SAME_STMT);
+      gsi_insert_after (&gsi, build_omp_barrier (t), GSI_SAME_STMT);
     }
   gsi_remove (&gsi, true);
 
@@ -7248,6 +7257,11 @@ find_phi_with_arg_on_edge (tree arg, edg
 	trip += 1;
 	goto L0;
     L4:
+
+ For OpenACC the above is wrapped in an OACC_FORK/OACC_JOIN pair.
+ Currently we wrap the whole sequence, but it'd be better to place the
+ markers just inside the outer conditional, so they can be entirely
+ eliminated if the loop is unreachable.
 */
 
 static void
@@ -7281,10 +7295,6 @@ expand_omp_for_static_chunk (struct omp_
   gcc_assert (EDGE_COUNT (iter_part_bb->succs) == 2);
   fin_bb = BRANCH_EDGE (iter_part_bb)->dest;
 
-  /* Broadcast variables to OpenACC threads.  */
-  entry_bb = oacc_broadcast (entry_bb, fin_bb, region);
-  region->entry = entry_bb;
-
   gcc_assert (broken_loop
 	      || fin_bb == FALLTHRU_EDGE (cont_bb)->dest);
   seq_start_bb = split_edge (FALLTHRU_EDGE (iter_part_bb));
@@ -7318,6 +7328,14 @@ expand_omp_for_static_chunk (struct omp_
     t = fold_binary (fd->loop.cond_code, boolean_type_node,
 		     fold_convert (type, fd->loop.n1),
 		     fold_convert (type, fd->loop.n2));
+  if (gimple_omp_for_kind (fd->for_stmt) == GF_OMP_FOR_KIND_OACC_LOOP)
+    {
+      gimple_seq seq = NULL;
+	
+      gen_oacc_fork (&seq, region->gwv_this);
+      gsi_insert_seq_before (&gsi, seq, GSI_SAME_STMT);
+    }
+
   if (fd->collapse == 1
       && TYPE_UNSIGNED (type)
       && (t == NULL_TREE || !integer_onep (t)))
@@ -7576,17 +7594,18 @@ expand_omp_for_static_chunk (struct omp_
 
   /* Replace the GIMPLE_OMP_RETURN with a barrier, or nothing.  */
   gsi = gsi_last_bb (exit_bb);
-  if (!gimple_omp_return_nowait_p (gsi_stmt (gsi)))
+
+  if (gimple_omp_for_kind (fd->for_stmt) == GF_OMP_FOR_KIND_OACC_LOOP)
+    {
+      gimple_seq seq = NULL;
+
+      gen_oacc_join (&seq, region->gwv_this);
+      gsi_insert_seq_after (&gsi, seq, GSI_SAME_STMT);
+    }
+  else if (!gimple_omp_return_nowait_p (gsi_stmt (gsi)))
     {
       t = gimple_omp_return_lhs (gsi_stmt (gsi));
-      if (gimple_omp_for_kind (fd->for_stmt) == GF_OMP_FOR_KIND_OACC_LOOP)
-        {
-	  gcc_checking_assert (t == NULL_TREE);
-	  if (oacc_loop_needs_threadbarrier_p (region->gwv_this))
-	    gsi_insert_after (&gsi, build_oacc_threadbarrier (), GSI_SAME_STMT);
-	}
-      else
-	gsi_insert_after (&gsi, build_omp_barrier (t), GSI_SAME_STMT);
+      gsi_insert_after (&gsi, build_omp_barrier (t), GSI_SAME_STMT);
     }
   gsi_remove (&gsi, true);
 
@@ -9158,20 +9177,6 @@ expand_omp_atomic (struct omp_region *re
   expand_omp_atomic_mutex (load_bb, store_bb, addr, loaded_val, stored_val);
 }
 
-/* Allocate storage for OpenACC worker threads in CTX to broadcast
-   condition results.  */
-
-static void
-oacc_alloc_broadcast_storage (omp_context *ctx)
-{
-  tree vull_type_node = build_qualified_type (long_long_unsigned_type_node,
-					      TYPE_QUAL_VOLATILE);
-
-  ctx->worker_sync_elt
-    = alloc_var_ganglocal (NULL_TREE, vull_type_node, ctx,
-			   TYPE_SIZE_UNIT (vull_type_node));
-}
-
 /* Mark the loops inside the kernels region starting at REGION_ENTRY and ending
    at REGION_EXIT.  */
 
@@ -9947,7 +9952,6 @@ find_omp_target_region_data (struct omp_
     region->gwv_this |= OACC_LOOP_MASK (OACC_worker);
   if (find_omp_clause (clauses, OMP_CLAUSE_VECTOR_LENGTH))
     region->gwv_this |= OACC_LOOP_MASK (OACC_vector);
-  region->broadcast_array = gimple_omp_target_broadcast_array (stmt);
 }
 
 /* Helper for build_omp_regions.  Scan the dominator tree starting at
@@ -10091,669 +10095,6 @@ build_omp_regions (void)
   build_omp_regions_1 (ENTRY_BLOCK_PTR_FOR_FN (cfun), NULL, false);
 }
 
-/* Walk the tree upwards from region until a target region is found
-   or we reach the end, then return it.  */
-static omp_region *
-enclosing_target_region (omp_region *region)
-{
-  while (region != NULL
-	 && region->type != GIMPLE_OMP_TARGET)
-    region = region->outer;
-  return region;
-}
-
-/* Return a mask of GWV_ values indicating the kind of OpenACC
-   predication required for basic blocks in REGION.  */
-
-static int
-required_predication_mask (omp_region *region)
-{
-  while (region
-	 && region->type != GIMPLE_OMP_FOR && region->type != GIMPLE_OMP_TARGET)
-    region = region->outer;
-  if (!region)
-    return 0;
-
-  int outer_masks = region->gwv_this;
-  omp_region *outer_target = region;
-  while (outer_target != NULL && outer_target->type != GIMPLE_OMP_TARGET)
-    {
-      if (outer_target->type == GIMPLE_OMP_FOR)
-	outer_masks |= outer_target->gwv_this;
-      outer_target = outer_target->outer;
-    }
-  if (!outer_target)
-    return 0;
-
-  int mask = 0;
-  if ((outer_target->gwv_this & OACC_LOOP_MASK (OACC_worker)) != 0
-      && (region->type == GIMPLE_OMP_TARGET
-	  || (outer_masks & OACC_LOOP_MASK (OACC_worker)) == 0))
-    mask |= OACC_LOOP_MASK (OACC_worker);
-  if ((outer_target->gwv_this & OACC_LOOP_MASK (OACC_vector)) != 0
-      && (region->type == GIMPLE_OMP_TARGET
-	  || (outer_masks & OACC_LOOP_MASK (OACC_vector)) == 0))
-    mask |= OACC_LOOP_MASK (OACC_vector);
-  return mask;
-}
-
-/* Generate a broadcast across OpenACC vector threads (a warp on GPUs)
-   so that VAR is broadcast to DEST_VAR.  The new statements are added
-   after WHERE.  Return the stmt after which the block should be split.  */
-
-static gimple
-generate_vector_broadcast (tree dest_var, tree var,
-			   gimple_stmt_iterator &where)
-{
-  gimple retval = gsi_stmt (where);
-  tree vartype = TREE_TYPE (var);
-  tree call_arg_type = unsigned_type_node;
-  enum built_in_function fn = BUILT_IN_GOACC_THREAD_BROADCAST;
-
-  if (TYPE_PRECISION (vartype) > TYPE_PRECISION (call_arg_type))
-    {
-      fn = BUILT_IN_GOACC_THREAD_BROADCAST_LL;
-      call_arg_type = long_long_unsigned_type_node;
-    }
-
-  bool need_conversion = !types_compatible_p (vartype, call_arg_type);
-  tree casted_var = var;
-
-  if (need_conversion)
-    {
-      gassign *conv1 = NULL;
-      casted_var = create_tmp_var (call_arg_type);
-
-      /* Handle floats and doubles.  */
-      if (!INTEGRAL_TYPE_P (vartype))
-	{
-	  tree t = fold_build1 (VIEW_CONVERT_EXPR, call_arg_type, var);
-	  conv1 = gimple_build_assign (casted_var, t);
-	}
-      else
-	conv1 = gimple_build_assign (casted_var, NOP_EXPR, var);
-
-      gsi_insert_after (&where, conv1, GSI_CONTINUE_LINKING);
-    }
-
-  tree decl = builtin_decl_explicit (fn);
-  gimple call = gimple_build_call (decl, 1, casted_var);
-  gsi_insert_after (&where, call, GSI_NEW_STMT);
-  tree casted_dest = dest_var;
-
-  if (need_conversion)
-    {
-      gassign *conv2 = NULL;
-      casted_dest = create_tmp_var (call_arg_type);
-
-      if (!INTEGRAL_TYPE_P (vartype))
-	{
-	  tree t = fold_build1 (VIEW_CONVERT_EXPR, vartype, casted_dest);
-	  conv2 = gimple_build_assign (dest_var, t);
-	}
-      else
-	conv2 = gimple_build_assign (dest_var, NOP_EXPR, casted_dest);
-
-      gsi_insert_after (&where, conv2, GSI_CONTINUE_LINKING);
-    }
-
-  gimple_call_set_lhs (call, casted_dest);
-  return retval;
-}
-
-/* Generate a broadcast across OpenACC threads in REGION so that VAR
-   is broadcast to DEST_VAR.  MASK specifies the parallelism level and
-   thereby the broadcast method.  If it is only vector, we
-   can use a warp broadcast, otherwise we fall back to memory
-   store/load.  */
-
-static gimple
-generate_oacc_broadcast (omp_region *region, tree dest_var, tree var,
-			 gimple_stmt_iterator &where, int mask)
-{
-  if (mask == OACC_LOOP_MASK (OACC_vector))
-    return generate_vector_broadcast (dest_var, var, where);
-
-  omp_region *parent = enclosing_target_region (region);
-
-  tree elttype = build_qualified_type (TREE_TYPE (var), TYPE_QUAL_VOLATILE);
-  tree ptr = create_tmp_var (build_pointer_type (elttype));
-  gassign *cast1 = gimple_build_assign (ptr, NOP_EXPR,
-				       parent->broadcast_array);
-  gsi_insert_after (&where, cast1, GSI_NEW_STMT);
-  gassign *st = gimple_build_assign (build_simple_mem_ref (ptr), var);
-  gsi_insert_after (&where, st, GSI_NEW_STMT);
-
-  gsi_insert_after (&where, build_oacc_threadbarrier (), GSI_NEW_STMT);
-
-  gassign *cast2 = gimple_build_assign (ptr, NOP_EXPR,
-					parent->broadcast_array);
-  gsi_insert_after (&where, cast2, GSI_NEW_STMT);
-  gassign *ld = gimple_build_assign (dest_var, build_simple_mem_ref (ptr));
-  gsi_insert_after (&where, ld, GSI_NEW_STMT);
-
-  gsi_insert_after (&where, build_oacc_threadbarrier (), GSI_NEW_STMT);
-
-  return st;
-}
-
-/* Build a test for OpenACC predication.  TRUE_EDGE is the edge that should be
-   taken if the block should be executed.  SKIP_DEST_BB is the destination to
-   jump to otherwise.  MASK specifies the type of predication, it can contain
-   the bits for VECTOR and/or WORKER.  */
-
-static void
-make_predication_test (edge true_edge, basic_block skip_dest_bb, int mask)
-{
-  basic_block cond_bb = true_edge->src;
-  
-  gimple_stmt_iterator tmp_gsi = gsi_last_bb (cond_bb);
-  tree decl = builtin_decl_explicit (BUILT_IN_GOACC_ID);
-  tree comp_var = NULL_TREE;
-  unsigned ix;
-
-  for (ix = OACC_worker; ix <= OACC_vector; ix++)
-    if (OACC_LOOP_MASK (ix) & mask)
-      {
-	gimple call = gimple_build_call
-	  (decl, 1, build_int_cst (unsigned_type_node, ix));
-	tree var = create_tmp_var (unsigned_type_node);
-
-	gimple_call_set_lhs (call, var);
-	gsi_insert_after (&tmp_gsi, call, GSI_NEW_STMT);
-	if (comp_var)
-	  {
-	    tree new_comp = create_tmp_var (unsigned_type_node);
-	    gassign *ior = gimple_build_assign (new_comp,
-						BIT_IOR_EXPR, comp_var, var);
-	    gsi_insert_after (&tmp_gsi, ior, GSI_NEW_STMT);
-	    comp_var = new_comp;
-	  }
-	else
-	  comp_var = var;
-      }
-
-  tree cond = build2 (EQ_EXPR, boolean_type_node, comp_var,
-		      fold_convert (unsigned_type_node, integer_zero_node));
-  gimple cond_stmt = gimple_build_cond_empty (cond);
-  gsi_insert_after (&tmp_gsi, cond_stmt, GSI_NEW_STMT);
-
-  true_edge->flags = EDGE_TRUE_VALUE;
-
-  /* Force an abnormal edge before a broadcast operation that might be present
-     in SKIP_DEST_BB.  This is only done for the non-execution edge (with
-     respect to the predication done by this function) -- the opposite
-     (execution) edge that reaches the broadcast operation must be made
-     abnormal also, e.g. in this function's caller.  */
-  edge e = make_edge (cond_bb, skip_dest_bb, EDGE_FALSE_VALUE);
-  basic_block false_abnorm_bb = split_edge (e);
-  edge abnorm_edge = single_succ_edge (false_abnorm_bb);
-  abnorm_edge->flags |= EDGE_ABNORMAL;
-}
-
-/* Apply OpenACC predication to basic block BB which is in
-   region PARENT.  MASK has a bitmask of levels that need to be
-   applied; VECTOR and/or WORKER may be set.  */
-
-static void
-predicate_bb (basic_block bb, struct omp_region *parent, int mask)
-{
-  /* We handle worker-single vector-partitioned loops by jumping
-     around them if not in the controlling worker.  Don't insert
-     unnecessary (and incorrect) predication.  */
-  if (parent->type == GIMPLE_OMP_FOR
-      && (parent->gwv_this & OACC_LOOP_MASK (OACC_vector)))
-    mask &= ~OACC_LOOP_MASK (OACC_worker);
-
-  if (mask == 0 || parent->type == GIMPLE_OMP_ATOMIC_LOAD)
-    return;
-
-  gimple_stmt_iterator gsi;
-  gimple stmt;
-
-  gsi = gsi_last_bb (bb);
-  stmt = gsi_stmt (gsi);
-  if (stmt == NULL)
-    return;
-
-  basic_block skip_dest_bb = NULL;
-
-  if (gimple_code (stmt) == GIMPLE_OMP_ENTRY_END)
-    return;
-
-  if (gimple_code (stmt) == GIMPLE_COND)
-    {
-      tree cond_var = create_tmp_var (boolean_type_node);
-      tree broadcast_cond = create_tmp_var (boolean_type_node);
-      gassign *asgn = gimple_build_assign (cond_var,
-					   gimple_cond_code (stmt),
-					   gimple_cond_lhs (stmt),
-					   gimple_cond_rhs (stmt));
-      gsi_insert_before (&gsi, asgn, GSI_CONTINUE_LINKING);
-      gimple_stmt_iterator gsi_asgn = gsi_for_stmt (asgn);
-
-      gimple splitpoint = generate_oacc_broadcast (parent, broadcast_cond,
-						   cond_var, gsi_asgn,
-						   mask);
-
-      edge e = split_block (bb, splitpoint);
-      e->flags = EDGE_ABNORMAL;
-      skip_dest_bb = e->dest;
-
-      gimple_cond_set_condition (as_a <gcond *> (stmt), EQ_EXPR,
-				 broadcast_cond, boolean_true_node);
-    }
-  else if (gimple_code (stmt) == GIMPLE_SWITCH)
-    {
-      gswitch *sstmt = as_a <gswitch *> (stmt);
-      tree var = gimple_switch_index (sstmt);
-      tree new_var = create_tmp_var (TREE_TYPE (var));
-
-      gassign *asgn = gimple_build_assign (new_var, var);
-      gsi_insert_before (&gsi, asgn, GSI_CONTINUE_LINKING);
-      gimple_stmt_iterator gsi_asgn = gsi_for_stmt (asgn);
-
-      gimple splitpoint = generate_oacc_broadcast (parent, new_var, var,
-						   gsi_asgn, mask);
-
-      edge e = split_block (bb, splitpoint);
-      e->flags = EDGE_ABNORMAL;
-      skip_dest_bb = e->dest;
-
-      gimple_switch_set_index (sstmt, new_var);
-    }
-  else if (is_gimple_omp (stmt))
-    {
-      gsi_prev (&gsi);
-      gimple split_stmt = gsi_stmt (gsi);
-      enum gimple_code code = gimple_code (stmt);
-
-      /* First, see if we must predicate away an entire loop or atomic region.  */
-      if (code == GIMPLE_OMP_FOR
-	  || code == GIMPLE_OMP_ATOMIC_LOAD)
-	{
-	  omp_region *inner;
-	  inner = *bb_region_map->get (FALLTHRU_EDGE (bb)->dest);
-	  skip_dest_bb = single_succ (inner->exit);
-	  gcc_assert (inner->entry == bb);
-	  if (code != GIMPLE_OMP_FOR
-	      || ((inner->gwv_this & OACC_LOOP_MASK (OACC_vector))
-		  && !(inner->gwv_this & OACC_LOOP_MASK (OACC_worker))
-		  && (mask & OACC_LOOP_MASK  (OACC_worker))))
-	    {
-	      gimple_stmt_iterator head_gsi = gsi_start_bb (bb);
-	      gsi_prev (&head_gsi);
-	      edge e0 = split_block (bb, gsi_stmt (head_gsi));
-	      int mask2 = mask;
-	      if (code == GIMPLE_OMP_FOR)
-		mask2 &= ~OACC_LOOP_MASK (OACC_vector);
-	      if (!split_stmt || code != GIMPLE_OMP_FOR)
-		{
-		  /* The simple case: nothing here except the for,
-		     so we just need to make one branch around the
-		     entire loop.  */
-		  inner->entry = e0->dest;
-		  make_predication_test (e0, skip_dest_bb, mask2);
-		  return;
-		}
-	      basic_block for_block = e0->dest;
-	      /* The general case, make two conditions - a full one around the
-		 code preceding the for, and one branch around the loop.  */
-	      edge e1 = split_block (for_block, split_stmt);
-	      basic_block bb3 = e1->dest;
-	      edge e2 = split_block (for_block, split_stmt);
-	      basic_block bb2 = e2->dest;
-
-	      make_predication_test (e0, bb2, mask);
-	      make_predication_test (single_pred_edge (bb3), skip_dest_bb,
-				     mask2);
-	      inner->entry = bb3;
-	      return;
-	    }
-	}
-
-      /* Only a few statements need special treatment.  */
-      if (gimple_code (stmt) != GIMPLE_OMP_FOR
-	  && gimple_code (stmt) != GIMPLE_OMP_CONTINUE
-	  && gimple_code (stmt) != GIMPLE_OMP_RETURN)
-	{
-	  edge e = single_succ_edge (bb);
-	  skip_dest_bb = e->dest;
-	}
-      else
-	{
-	  if (!split_stmt)
-	    return;
-	  edge e = split_block (bb, split_stmt);
-	  skip_dest_bb = e->dest;
-	  if (gimple_code (stmt) == GIMPLE_OMP_CONTINUE)
-	    {
-	      gcc_assert (parent->cont == bb);
-	      parent->cont = skip_dest_bb;
-	    }
-	  else if (gimple_code (stmt) == GIMPLE_OMP_RETURN)
-	    {
-	      gcc_assert (parent->exit == bb);
-	      parent->exit = skip_dest_bb;
-	    }
-	  else if (gimple_code (stmt) == GIMPLE_OMP_FOR)
-	    {
-	      omp_region *inner;
-	      inner = *bb_region_map->get (FALLTHRU_EDGE (skip_dest_bb)->dest);
-	      gcc_assert (inner->entry == bb);
-	      inner->entry = skip_dest_bb;
-	    }
-	}
-    }
-  else if (single_succ_p (bb))
-    {
-      edge e = single_succ_edge (bb);
-      skip_dest_bb = e->dest;
-      if (gimple_code (stmt) == GIMPLE_GOTO)
-	gsi_prev (&gsi);
-      if (gsi_stmt (gsi) == 0)
-	return;
-    }
-
-  if (skip_dest_bb != NULL)
-    {
-      gimple_stmt_iterator head_gsi = gsi_start_bb (bb);
-      gsi_prev (&head_gsi);
-      edge e2 = split_block (bb, gsi_stmt (head_gsi));
-      make_predication_test (e2, skip_dest_bb, mask);
-    }
-}
-
-/* Walk the dominator tree starting at BB to collect basic blocks in
-   WORKLIST which need OpenACC vector predication applied to them.  */
-
-static void
-find_predicatable_bbs (basic_block bb, vec<basic_block> &worklist)
-{
-  struct omp_region *parent = *bb_region_map->get (bb);
-  if (required_predication_mask (parent) != 0)
-    worklist.safe_push (bb);
-  basic_block son;
-  for (son = first_dom_son (CDI_DOMINATORS, bb);
-       son;
-       son = next_dom_son (CDI_DOMINATORS, son))
-    find_predicatable_bbs (son, worklist);
-}
-
-/* Apply OpenACC vector predication to all basic blocks.  HEAD_BB is the
-   first.  */
-
-static void
-predicate_omp_regions (basic_block head_bb)
-{
-  vec<basic_block> worklist = vNULL;
-  find_predicatable_bbs (head_bb, worklist);
-  int i;
-  basic_block bb;
-  FOR_EACH_VEC_ELT (worklist, i, bb)
-    {
-      omp_region *region = *bb_region_map->get (bb);
-      int mask = required_predication_mask (region);
-      predicate_bb (bb, region, mask);
-    }
-}
-
-/* USE and GET sets for variable broadcasting.  */
-static std::set<tree> use, gen, live_in;
-
-/* This is an extremely conservative live in analysis.  We only want to
-   detect is any compiler temporary used inside an acc loop is local to
-   that loop or not.  So record all decl uses in all the basic blocks
-   post-dominating the acc loop in question.  */
-static tree
-populate_loop_live_in (tree *tp, int *walk_subtrees,
-		       void *data_ ATTRIBUTE_UNUSED)
-{
-  struct walk_stmt_info *wi = (struct walk_stmt_info *) data_;
-
-  if (wi && wi->is_lhs)
-    {
-      if (VAR_P (*tp))
-	live_in.insert (*tp);
-    }
-  else if (IS_TYPE_OR_DECL_P (*tp))
-    *walk_subtrees = 0;
-
-  return NULL_TREE;
-}
-
-static void
-oacc_populate_live_in_1 (basic_block entry_bb, basic_block exit_bb,
-			 basic_block loop_bb)
-{
-  basic_block son;
-  gimple_stmt_iterator gsi;
-
-  if (entry_bb == exit_bb)
-    return;
-
-  if (!dominated_by_p (CDI_DOMINATORS, loop_bb, entry_bb))
-    return;
-
-  for (gsi = gsi_start_bb (entry_bb); !gsi_end_p (gsi); gsi_next (&gsi))
-    {
-      struct walk_stmt_info wi;
-      gimple stmt;
-
-      memset (&wi, 0, sizeof (wi));
-      stmt = gsi_stmt (gsi);
-
-      walk_gimple_op (stmt, populate_loop_live_in, &wi);
-    }
-
-  /* Continue walking the dominator tree.  */
-  for (son = first_dom_son (CDI_DOMINATORS, entry_bb);
-       son;
-       son = next_dom_son (CDI_DOMINATORS, son))
-    oacc_populate_live_in_1 (son, exit_bb, loop_bb);
-}
-
-static void
-oacc_populate_live_in (basic_block entry_bb, omp_region *region)
-{
-  /* Find the innermost OMP_TARGET region.  */
-  while (region  && region->type != GIMPLE_OMP_TARGET)
-    region = region->outer;
-
-  if (!region)
-    return;
-
-  basic_block son;
-
-  for (son = first_dom_son (CDI_DOMINATORS, region->entry);
-       son;
-       son = next_dom_son (CDI_DOMINATORS, son))
-    oacc_populate_live_in_1 (son, region->exit, entry_bb);
-}
-
-static tree
-populate_loop_use (tree *tp, int *walk_subtrees, void *data_)
-{
-  struct walk_stmt_info *wi = (struct walk_stmt_info *) data_;
-  std::set<tree>::iterator it;
-
-  /* There isn't much to do for LHS ops. There shouldn't be any pointers
-     or references here.  */
-  if (wi && wi->is_lhs)
-    return NULL_TREE;
-
-  if (VAR_P (*tp))
-    {
-      tree type;
-
-      *walk_subtrees = 0;
-
-      /* Filter out incompatible decls.  */
-      if (INDIRECT_REF_P (*tp) || is_global_var (*tp))
-	return NULL_TREE;
-
-      type = TREE_TYPE (*tp);
-
-      /* Aggregate types aren't supported either.  */
-      if (AGGREGATE_TYPE_P (type))
-	return NULL_TREE;
-
-      /* Filter out decls inside GEN.  */
-      it = gen.find (*tp);
-      if (it == gen.end ())
-	use.insert (*tp);
-    }
-  else if (IS_TYPE_OR_DECL_P (*tp))
-    *walk_subtrees = 0;
-
-  return NULL_TREE;
-}
-
-/* INIT is true if this is the first time this function is called.  */
-
-static void
-oacc_broadcast_1 (basic_block entry_bb, basic_block exit_bb, bool init,
-		  int mask)
-{
-  basic_block son;
-  gimple_stmt_iterator gsi;
-  gimple stmt;
-  tree block, var;
-
-  if (entry_bb == exit_bb)
-    return;
-
-  /* Populate the GEN set.  */
-
-  gsi = gsi_start_bb (entry_bb);
-  stmt = gsi_stmt (gsi);
-
-  /* There's nothing to do if stmt is empty or if this is the entry basic
-     block to the vector loop.  The entry basic block to pre-expanded loops
-     do not have an entry label.  As such, the scope containing the initial
-     entry_bb should not be added to the gen set.  */
-  if (stmt != NULL && !init && (block = gimple_block (stmt)) != NULL)
-    for (var = BLOCK_VARS (block); var; var = DECL_CHAIN (var))
-      gen.insert(var);
-
-  /* Populate the USE set.  */
-
-  for (gsi = gsi_start_bb (entry_bb); !gsi_end_p (gsi); gsi_next (&gsi))
-    {
-      struct walk_stmt_info wi;
-
-      memset (&wi, 0, sizeof (wi));
-      stmt = gsi_stmt (gsi);
-
-      walk_gimple_op (stmt, populate_loop_use, &wi);
-    }
-
-  /* Continue processing the children of this basic block.  */
-  for (son = first_dom_son (CDI_DOMINATORS, entry_bb);
-       son;
-       son = next_dom_son (CDI_DOMINATORS, son))
-    oacc_broadcast_1 (son, exit_bb, false, mask);
-}
-
-/* Broadcast variables to OpenACC vector loops.  This function scans
-   all of the basic blocks withing an acc vector loop.  It maintains
-   two sets of decls, a GEN set and a USE set.  The GEN set contains
-   all of the decls in the the basic block's scope.  The USE set
-   consists of decls used in current basic block, but are not in the
-   GEN set, globally defined or were transferred into the the accelerator
-   via a data movement clause.
-
-   The vector loop begins at ENTRY_BB and end at EXIT_BB, where EXIT_BB
-   is a latch back to ENTRY_BB.  Once a set of used variables have been
-   determined, they will get broadcasted in a pre-header to ENTRY_BB.  */
-
-static basic_block
-oacc_broadcast (basic_block entry_bb, basic_block exit_bb, omp_region *region)
-{
-  gimple_stmt_iterator gsi;
-  std::set<tree>::iterator it;
-  int mask = region->gwv_this;
-
-  /* Nothing to do if this isn't an acc worker or vector loop.  */
-  if (mask == 0)
-    return entry_bb;
-
-  use.empty ();
-  gen.empty ();
-  live_in.empty ();
-
-  /* Currently, subroutines aren't supported.  */
-  gcc_assert (!lookup_attribute ("oacc function",
-				 DECL_ATTRIBUTES (current_function_decl)));
-
-  /* Populate live_in.  */
-  oacc_populate_live_in (entry_bb, region);
-
-  /* Populate the set of used decls.  */
-  oacc_broadcast_1 (entry_bb, exit_bb, true, mask);
-
-  /* Filter out all of the GEN decls from the USE set.  Also filter out
-     any compiler temporaries that which are not present in LIVE_IN.  */
-  for (it = use.begin (); it != use.end (); it++)
-    {
-      std::set<tree>::iterator git, lit;
-
-      git = gen.find (*it);
-      lit = live_in.find (*it);
-      if (git != gen.end () || lit == live_in.end ())
-	use.erase (it);
-    }
-
-  if (mask == OACC_LOOP_MASK (OACC_vector))
-    {
-      /* Broadcast all decls in USE right before the last instruction in
-	 entry_bb.  */
-      gsi = gsi_last_bb (entry_bb);
-
-      gimple_seq seq = NULL;
-      gimple_stmt_iterator g2 = gsi_start (seq);
-
-      for (it = use.begin (); it != use.end (); it++)
-	generate_oacc_broadcast (region, *it, *it, g2, mask);
-
-      gsi_insert_seq_before (&gsi, seq, GSI_CONTINUE_LINKING);
-    }
-  else if (mask & OACC_LOOP_MASK (OACC_worker))
-    {
-      if (use.empty ())
-	return entry_bb;
-
-      /* If this loop contains a worker, then each broadcast must be
-	 predicated.  */
-
-      for (it = use.begin (); it != use.end (); it++)
-	{
-	  /* Worker broadcasting requires predication.  To do that, there
-	     needs to be several new parent basic blocks before the omp
-	     for instruction.  */
-
-	  gimple_seq seq = NULL;
-	  gimple_stmt_iterator g2 = gsi_start (seq);
-	  gimple splitpoint = generate_oacc_broadcast (region, *it, *it,
-						       g2, mask);
-	  gsi = gsi_last_bb (entry_bb);
-	  gsi_insert_seq_before (&gsi, seq, GSI_CONTINUE_LINKING);
-	  edge e = split_block (entry_bb, splitpoint);
-	  e->flags |= EDGE_ABNORMAL;
-	  basic_block dest_bb = e->dest;
-	  gsi_prev (&gsi);
-	  edge e2 = split_block (entry_bb, gsi_stmt (gsi));
-	  e2->flags |= EDGE_ABNORMAL;
-	  make_predication_test (e2, dest_bb, mask);
-
-	  /* Update entry_bb.  */
-	  entry_bb = dest_bb;
-	}
-    }
-
-  return entry_bb;
-}
-
 /* Main entry point for expanding OMP-GIMPLE into runtime calls.  */
 
 static unsigned int
@@ -10772,8 +10113,6 @@ execute_expand_omp (void)
 	  fprintf (dump_file, "\n");
 	}
 
-      predicate_omp_regions (ENTRY_BLOCK_PTR_FOR_FN (cfun));
-
       remove_exit_barriers (root_omp_region);
 
       expand_omp (root_omp_region);
@@ -12342,10 +11681,7 @@ lower_omp_target (gimple_stmt_iterator *
   orlist = NULL;
 
   if (is_gimple_omp_oacc (stmt))
-    {
-      oacc_init_count_vars (ctx, clauses);
-      oacc_alloc_broadcast_storage (ctx);
-    }
+    oacc_init_count_vars (ctx, clauses);
 
   if (has_reduction)
     {
@@ -12631,7 +11967,6 @@ lower_omp_target (gimple_stmt_iterator *
   gsi_insert_seq_before (gsi_p, sz_ilist, GSI_SAME_STMT);
 
   gimple_omp_target_set_ganglocal_size (stmt, sz);
-  gimple_omp_target_set_broadcast_array (stmt, ctx->worker_sync_elt);
   pop_gimplify_context (NULL);
 }
 
@@ -13348,16 +12683,7 @@ make_gimple_omp_edges (basic_block bb, s
 				  ((for_stmt = last_stmt (cur_region->entry))))
 	     == GF_OMP_FOR_KIND_OACC_LOOP)
         {
-	  /* Called before OMP expansion, so this information has not been
-	     recorded in cur_region->gwv_this yet.  */
-	  int gwv_bits = find_omp_for_region_gwv (for_stmt);
-	  if (oacc_loop_needs_threadbarrier_p (gwv_bits))
-	    {
-	      make_edge (bb, bb->next_bb, EDGE_FALLTHRU | EDGE_ABNORMAL);
-	      fallthru = false;
-	    }
-	  else
-	    fallthru = true;
+	  fallthru = true;
 	}
       else
 	/* In the case of a GIMPLE_OMP_SECTION, the edge will go
Index: omp-low.h
===================================================================
--- omp-low.h	(revision 225323)
+++ omp-low.h	(working copy)
@@ -20,6 +20,8 @@ along with GCC; see the file COPYING3.
 #ifndef GCC_OMP_LOW_H
 #define GCC_OMP_LOW_H
 
+/* Levels of parallelism as defined by OpenACC.  Increasing numbers
+   correspond to deeper loop nesting levels.  */
 enum oacc_loop_levels
   {
     OACC_gang,
@@ -27,6 +29,7 @@ enum oacc_loop_levels
     OACC_vector,
     OACC_HWM
   };
+#define OACC_LOOP_MASK(X) (1 << (X))
 
 struct omp_region;
 
Index: omp-builtins.def
===================================================================
--- omp-builtins.def	(revision 225323)
+++ omp-builtins.def	(working copy)
@@ -69,13 +69,6 @@ DEF_GOACC_BUILTIN (BUILT_IN_GOACC_GET_GA
 		   BT_FN_PTR, ATTR_NOTHROW_LEAF_LIST)
 DEF_GOACC_BUILTIN (BUILT_IN_GOACC_DEVICEPTR, "GOACC_deviceptr",
 		   BT_FN_PTR_PTR, ATTR_CONST_NOTHROW_LEAF_LIST)
-DEF_GOACC_BUILTIN (BUILT_IN_GOACC_THREAD_BROADCAST, "GOACC_thread_broadcast",
-		   BT_FN_UINT_UINT, ATTR_NOTHROW_LEAF_LIST)
-DEF_GOACC_BUILTIN (BUILT_IN_GOACC_THREAD_BROADCAST_LL, "GOACC_thread_broadcast_ll",
-		   BT_FN_ULONGLONG_ULONGLONG, ATTR_NOTHROW_LEAF_LIST)
-DEF_GOACC_BUILTIN (BUILT_IN_GOACC_THREADBARRIER, "GOACC_threadbarrier",
-		   BT_FN_VOID, ATTR_NOTHROW_LEAF_LIST)
-
 DEF_GOACC_BUILTIN_COMPILER (BUILT_IN_ACC_ON_DEVICE, "acc_on_device",
 			    BT_FN_INT_INT, ATTR_CONST_NOTHROW_LEAF_LIST)
 
Index: internal-fn.def
===================================================================
--- internal-fn.def	(revision 225323)
+++ internal-fn.def	(working copy)
@@ -64,3 +64,5 @@ DEF_INTERNAL_FN (MUL_OVERFLOW, ECF_CONST
 DEF_INTERNAL_FN (TSAN_FUNC_EXIT, ECF_NOVOPS | ECF_LEAF | ECF_NOTHROW, NULL)
 DEF_INTERNAL_FN (VA_ARG, ECF_NOTHROW | ECF_LEAF, NULL)
 DEF_INTERNAL_FN (GOACC_DATA_END_WITH_ARG, ECF_NOTHROW, ".r")
+DEF_INTERNAL_FN (GOACC_FORK, ECF_NOTHROW | ECF_LEAF, ".")
+DEF_INTERNAL_FN (GOACC_JOIN, ECF_NOTHROW | ECF_LEAF, ".")

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [gomp] Move openacc vector& worker single handling to RTL
  2015-07-10  0:25                   ` Nathan Sidwell
@ 2015-07-10  9:04                     ` Thomas Schwinge
  2015-07-11 19:25                       ` [gomp4] Revert "Work around nvptx offloading compiler --enable-checking=yes,df,fold,rtl breakage" (was: fix df verify failure) Thomas Schwinge
  2015-07-13 11:26                       ` [gomp] Move openacc vector& worker single handling to RTL Thomas Schwinge
  2015-07-11 21:18                     ` [gomp4] Resolve bootstrap failure in expand_GOACC_FORK, expand_GOACC_JOIN (was: Move openacc vector& worker single handling to RTL) Thomas Schwinge
                                       ` (3 subsequent siblings)
  4 siblings, 2 replies; 31+ messages in thread
From: Thomas Schwinge @ 2015-07-10  9:04 UTC (permalink / raw)
  To: Nathan Sidwell; +Cc: GCC Patches, Jakub Jelinek

[-- Attachment #1: Type: text/plain, Size: 14660 bytes --]

Hi!

On Thu, 09 Jul 2015 20:25:22 -0400, Nathan Sidwell <nathan@acm.org> wrote:
> This is the patch I committed.

:-) Whee!

From testing this, two things:

1. Can you please have a look at the following ICE?  I suppose you can
reproduce this in your non-checking build by just unconditionally
enabling that df_verify call?  Committed to gomp-4_0-branch in r225656:

commit 1aff96b721921f621642c0fab95359453bc01beb
Author: tschwinge <tschwinge@138bc75d-0d04-0410-961f-82ee72b054a4>
Date:   Fri Jul 10 09:01:55 2015 +0000

    Work around nvptx offloading compiler --enable-checking=yes,df,fold,rtl breakage
    
    ... introduced in r225647.
    
        checking whether the GNU Fortran compiler is working... no
        configure: error: GNU Fortran is not working; please report a bug in http://gcc.gnu.org/bugzilla, attaching /home/thomas/tmp/source/gcc/openacc/openacc-gomp-4_0-branch-work_/build-gcc-accel-nvptx/nvptx-none/libgfortran/config.log
        make[1]: *** [configure-target-libgfortran] Error 1
    
        configure:4192: [...]/build-gcc-accel-nvptx/./gcc/xgcc -B[...]/build-gcc-accel-nvptx/./gcc/ -nostdinc -B[...]/build-gcc-accel-nvptx/nvptx-none/newlib/ -isystem [...]/build-gcc-accel-nvptx/nvptx-none/newlib/targ-include -isystem [...]/source-gcc/newlib/libc/include -B/nvptx-none/bin/ -B/nvptx-none/lib/ -isystem /nvptx-none/include -isystem /nvptx-none/sys-include --sysroot=[...]/install/nvptx-none   -c -g  conftest.c >&5
        conftest.c: In function 'main':
        conftest.c:16:1: internal compiler error: in df_live_verify_transfer_functions, at df-problems.c:1849
         }
         ^
        0x6d3d8e df_live_verify_transfer_functions()
                [...]/source-gcc/gcc/df-problems.c:1848
        0x6cb83a df_analyze_1
                [...]/source-gcc/gcc/df-core.c:1241
        0xd909a0 nvptx_reorg
                [...]/source-gcc/gcc/config/nvptx/nvptx.c:2946
        0xa50829 execute
                [...]/source-gcc/gcc/reorg.c:4034
        Please submit a full bug report,
        with preprocessed source if appropriate.
        Please include the complete backtrace with any bug report.
        See <http://gcc.gnu.org/bugs.html> for instructions.
        configure:4192: $? = 1
        configure: failed program was:
        | /* confdefs.h */
        | #define PACKAGE_NAME "GNU Fortran Runtime Library"
        | #define PACKAGE_TARNAME "libgfortran"
        | #define PACKAGE_VERSION "0.3"
        | #define PACKAGE_STRING "GNU Fortran Runtime Library 0.3"
        | #define PACKAGE_BUGREPORT ""
        | #define PACKAGE_URL "http://www.gnu.org/software/libgfortran/"
        | /* end confdefs.h.  */
        |
        | int
        | main ()
        | {
        |
        |   ;
        |   return 0;
        | }
    
    Reproduce:
    
        $ echo 'static void foo(void) {}' | build-gcc-accel-nvptx/gcc/xgcc -Bbuild-gcc-accel-nvptx/gcc/ -S -x c -
        <stdin>: In function 'foo':
        <stdin>:1:1: internal compiler error: in df_live_verify_transfer_functions, at df-problems.c:1849
        0x6d3d8e df_live_verify_transfer_functions()
                [...]/source-gcc/gcc/df-problems.c:1848
        0x6cb83a df_analyze_1
                [...]/source-gcc/gcc/df-core.c:1241
        0xd909a0 nvptx_reorg
                [...]/source-gcc/gcc/config/nvptx/nvptx.c:2946
        0xa50829 execute
                [...]/source-gcc/gcc/reorg.c:4034
    
    Workaround:
    
    	gcc/
    	* df-core.c (df_analyze_1): Disable df_verify call.
    
    git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/branches/gomp-4_0-branch@225656 138bc75d-0d04-0410-961f-82ee72b054a4
---
 gcc/ChangeLog.gomp |    4 ++++
 gcc/df-core.c      |    2 ++
 2 files changed, 6 insertions(+)

diff --git gcc/ChangeLog.gomp gcc/ChangeLog.gomp
index c71e396..535900c 100644
--- gcc/ChangeLog.gomp
+++ gcc/ChangeLog.gomp
@@ -1,3 +1,7 @@
+2015-07-10  Thomas Schwinge  <thomas@codesourcery.com>
+
+	* df-core.c (df_analyze_1): Disable df_verify call.
+
 2015-07-09  Nathan Sidwell  <nathan@codesourcery.com>
 
 	Infrastructure:
diff --git gcc/df-core.c gcc/df-core.c
index 67040a1..52cca8e 100644
--- gcc/df-core.c
+++ gcc/df-core.c
@@ -1235,10 +1235,12 @@ df_analyze_1 (void)
   if (dump_file)
     fprintf (dump_file, "df_analyze called\n");
 
+#if /* TODO */ 0
 #ifndef ENABLE_DF_CHECKING
   if (df->changeable_flags & DF_VERIFY_SCHEDULED)
 #endif
     df_verify ();
+#endif
 
   /* Skip over the DF_SCAN problem. */
   for (i = 1; i < df->num_problems_defined; i++)


2. Don't be shy to remove a bunch of XFAILs, in fact all :-) of those
remaining from the test cases that Julian had added in
<http://news.gmane.org/find-root.php?message_id=%3C20150617151515.087aa93e%40octopus%3E>.

Unfortunately, there's also one regressions, but I'm seeing it only on
Nvidia K20 hardware, not on my laptop (but it may well be
hardware-dependent: according to a web search, CUDA error 716 translates
to CUDA_ERROR_MISALIGNED_ADDRESS).  Are you reproducing that one, and/or
do you have an idea where it's coming from?

Committed to gomp-4_0-branch in r225657:

commit bdecfaf444a5811e5ea2a942e7b98b160d737b7b
Author: tschwinge <tschwinge@138bc75d-0d04-0410-961f-82ee72b054a4>
Date:   Fri Jul 10 09:02:02 2015 +0000

    libgomp: XFAILs update
    
    ... after r225647.
    
    	libgomp/
    	* testsuite/libgomp.oacc-c-c++-common/parallel-loop-1.c: Add
    	XFAIL.
    	* testsuite/libgomp.oacc-c-c++-common/private-vars-local-worker-3.c:
    	Remove XFAIL.
    	* testsuite/libgomp.oacc-c-c++-common/private-vars-local-worker-4.c:
    	Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/private-vars-local-worker-5.c:
    	Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/private-vars-loop-gang-2.c:
    	Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/private-vars-loop-gang-4.c:
    	Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/private-vars-loop-gang-5.c:
    	Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/private-vars-loop-gang-6.c:
    	Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/private-vars-loop-worker-5.c:
    	Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/private-vars-loop-worker-6.c:
    	Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/private-vars-loop-worker-7.c:
    	Likewise.
    
    git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/branches/gomp-4_0-branch@225657 138bc75d-0d04-0410-961f-82ee72b054a4
---
 libgomp/ChangeLog.gomp                             |   25 ++++++++++++++++++++
 .../libgomp.oacc-c-c++-common/parallel-loop-1.c    |    1 +
 .../private-vars-local-worker-3.c                  |    2 --
 .../private-vars-local-worker-4.c                  |    2 --
 .../private-vars-local-worker-5.c                  |    2 --
 .../private-vars-loop-gang-2.c                     |    2 --
 .../private-vars-loop-gang-4.c                     |    3 ---
 .../private-vars-loop-gang-5.c                     |    2 --
 .../private-vars-loop-gang-6.c                     |    2 --
 .../private-vars-loop-worker-5.c                   |    3 ---
 .../private-vars-loop-worker-6.c                   |    2 --
 .../private-vars-loop-worker-7.c                   |    2 --
 12 files changed, 26 insertions(+), 22 deletions(-)

diff --git libgomp/ChangeLog.gomp libgomp/ChangeLog.gomp
index 1949d78..6d1c547 100644
--- libgomp/ChangeLog.gomp
+++ libgomp/ChangeLog.gomp
@@ -1,3 +1,28 @@
+2015-07-10  Thomas Schwinge  <thomas@codesourcery.com>
+
+	* testsuite/libgomp.oacc-c-c++-common/parallel-loop-1.c: Add
+	XFAIL.
+	* testsuite/libgomp.oacc-c-c++-common/private-vars-local-worker-3.c:
+	Remove XFAIL.
+	* testsuite/libgomp.oacc-c-c++-common/private-vars-local-worker-4.c:
+	Likewise.
+	* testsuite/libgomp.oacc-c-c++-common/private-vars-local-worker-5.c:
+	Likewise.
+	* testsuite/libgomp.oacc-c-c++-common/private-vars-loop-gang-2.c:
+	Likewise.
+	* testsuite/libgomp.oacc-c-c++-common/private-vars-loop-gang-4.c:
+	Likewise.
+	* testsuite/libgomp.oacc-c-c++-common/private-vars-loop-gang-5.c:
+	Likewise.
+	* testsuite/libgomp.oacc-c-c++-common/private-vars-loop-gang-6.c:
+	Likewise.
+	* testsuite/libgomp.oacc-c-c++-common/private-vars-loop-worker-5.c:
+	Likewise.
+	* testsuite/libgomp.oacc-c-c++-common/private-vars-loop-worker-6.c:
+	Likewise.
+	* testsuite/libgomp.oacc-c-c++-common/private-vars-loop-worker-7.c:
+	Likewise.
+
 2015-07-08  James Norris  <jnorris@codesourcery.com>
 
 	* oacc-parallel.c (GOACC_parallel GOACC_data_start): Handle Fortran
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/parallel-loop-1.c libgomp/testsuite/libgomp.oacc-c-c++-common/parallel-loop-1.c
index a1f974d..23a9b23 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/parallel-loop-1.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/parallel-loop-1.c
@@ -1,4 +1,5 @@
 /* { dg-do run } */
+/* { dg-xfail-run-if "cuStreamSynchronize error: unknown result code:   716" { openacc_nvidia_accel_selected } } */
 
 #include <stdlib.h>
 
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/private-vars-local-worker-3.c libgomp/testsuite/libgomp.oacc-c-c++-common/private-vars-local-worker-3.c
index 6129523..1e67322 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/private-vars-local-worker-3.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/private-vars-local-worker-3.c
@@ -1,5 +1,3 @@
-/* { dg-xfail-run-if "TODO" { openacc_nvidia_accel_selected } { "*" } { "" } } */
-
 #include <assert.h>
 
 /* Test of worker-private variables declared in a local scope, broadcasting
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/private-vars-local-worker-4.c libgomp/testsuite/libgomp.oacc-c-c++-common/private-vars-local-worker-4.c
index 4cec00e..120001b 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/private-vars-local-worker-4.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/private-vars-local-worker-4.c
@@ -1,5 +1,3 @@
-/* { dg-xfail-run-if "TODO" { openacc_nvidia_accel_selected } { "*" } { "" } } */
-
 #include <assert.h>
 
 /* Test of worker-private variables declared in a local scope, broadcasting
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/private-vars-local-worker-5.c libgomp/testsuite/libgomp.oacc-c-c++-common/private-vars-local-worker-5.c
index efc2206..f849f0c 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/private-vars-local-worker-5.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/private-vars-local-worker-5.c
@@ -1,5 +1,3 @@
-/* { dg-xfail-run-if "TODO" { openacc_nvidia_accel_selected } { "*" } { "" } } */
-
 #include <assert.h>
 
 /* Test of worker-private variables declared in a local scope, broadcasting
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/private-vars-loop-gang-2.c libgomp/testsuite/libgomp.oacc-c-c++-common/private-vars-loop-gang-2.c
index 9debf83..3898c0e 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/private-vars-loop-gang-2.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/private-vars-loop-gang-2.c
@@ -1,5 +1,3 @@
-/* { dg-xfail-run-if "TODO" { openacc_nvidia_accel_selected } { "*" } { "" } } */
-
 #include <assert.h>
 
 /* Test of gang-private variables declared on loop directive, with broadcasting
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/private-vars-loop-gang-4.c libgomp/testsuite/libgomp.oacc-c-c++-common/private-vars-loop-gang-4.c
index f0f0477..45714010 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/private-vars-loop-gang-4.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/private-vars-loop-gang-4.c
@@ -1,6 +1,3 @@
-/* { dg-xfail-if "TODO: ICE" { *-*-* } } */
-/* { dg-excess-errors "TODO" } */
-
 #include <assert.h>
 
 /* Test of gang-private addressable variable declared on loop directive, with
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/private-vars-loop-gang-5.c libgomp/testsuite/libgomp.oacc-c-c++-common/private-vars-loop-gang-5.c
index b955303..b070773 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/private-vars-loop-gang-5.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/private-vars-loop-gang-5.c
@@ -1,5 +1,3 @@
-/* { dg-xfail-run-if "TODO" { openacc_nvidia_accel_selected } { "*" } { "" } } */
-
 #include <assert.h>
 
 /* Test of gang-private array variable declared on loop directive, with
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/private-vars-loop-gang-6.c libgomp/testsuite/libgomp.oacc-c-c++-common/private-vars-loop-gang-6.c
index 0c17eaa..ec74292 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/private-vars-loop-gang-6.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/private-vars-loop-gang-6.c
@@ -1,5 +1,3 @@
-/* { dg-xfail-run-if "TODO" { openacc_nvidia_accel_selected } { "*" } { "" } } */
-
 #include <assert.h>
 
 /* Test of gang-private aggregate variable declared on loop directive, with
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/private-vars-loop-worker-5.c libgomp/testsuite/libgomp.oacc-c-c++-common/private-vars-loop-worker-5.c
index 741795d..a28105c 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/private-vars-loop-worker-5.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/private-vars-loop-worker-5.c
@@ -1,6 +1,3 @@
-/* { dg-xfail-if "TODO: ICE" { *-*-* } } */
-/* { dg-excess-errors "TODO" } */
-
 #include <assert.h>
 
 /* Test of worker-private variables declared on a loop directive, broadcasting
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/private-vars-loop-worker-6.c libgomp/testsuite/libgomp.oacc-c-c++-common/private-vars-loop-worker-6.c
index feba09e..5dde621 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/private-vars-loop-worker-6.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/private-vars-loop-worker-6.c
@@ -1,5 +1,3 @@
-/* { dg-xfail-run-if "TODO" { openacc_nvidia_accel_selected } { "*" } { "" } } */
-
 #include <assert.h>
 
 /* Test of worker-private variables declared on a loop directive, broadcasting
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/private-vars-loop-worker-7.c libgomp/testsuite/libgomp.oacc-c-c++-common/private-vars-loop-worker-7.c
index 5469c5d..e4d4ccf 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/private-vars-loop-worker-7.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/private-vars-loop-worker-7.c
@@ -1,5 +1,3 @@
-/* { dg-xfail-run-if "TODO" { openacc_nvidia_accel_selected } { "*" } { "" } } */
-
 #include <assert.h>
 
 /* Test of worker-private variables declared on loop directive, broadcasting


Grüße,
 Thomas

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 472 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [gomp] fix df verify failure
@ 2015-07-10 22:04 Nathan Sidwell
  0 siblings, 0 replies; 31+ messages in thread
From: Nathan Sidwell @ 2015-07-10 22:04 UTC (permalink / raw)
  To: GCC Patches, Thomas Schwinge

[-- Attachment #1: Type: text/plain, Size: 160 bytes --]

I've committed this patch to fix a df verify crash Thomas pointed me at. 
Thomas, I think this means you can revert the workaround  you just committed?

nathan

[-- Attachment #2: vfy.diff --]
[-- Type: text/plain, Size: 923 bytes --]

2015-07-10  Nathan Sidwell  <nathan@codesourcery.com>

	* config/nvptx/nvptx.c (nvptx_reorg): Move df problem setting, set
	dirty flags.

Index: config/nvptx/nvptx.c
===================================================================
--- config/nvptx/nvptx.c	(revision 225647)
+++ config/nvptx/nvptx.c	(working copy)
@@ -2923,16 +2923,16 @@ nvptx_reorg (void)
 
   thread_prologue_and_epilogue_insns ();
 
-  df_clear_flags (DF_LR_RUN_DCE);
-  df_set_flags (DF_NO_INSN_RESCAN | DF_NO_HARD_REGS);
-  df_live_add_problem ();
-  
   /* Split blocks and record interesting unspecs.  */
   bb_insn_map_t bb_insn_map;
 
-    nvptx_split_blocks (&bb_insn_map);
+  nvptx_split_blocks (&bb_insn_map);
 
   /* Compute live regs */
+  df_clear_flags (DF_LR_RUN_DCE);
+  df_set_flags (DF_NO_INSN_RESCAN | DF_NO_HARD_REGS);
+  df_live_add_problem ();
+  df_live_set_all_dirty ();
   df_analyze ();
   regstat_init_n_sets_and_refs ();
 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [gomp4] Revert "Work around nvptx offloading compiler --enable-checking=yes,df,fold,rtl breakage" (was: fix df verify failure)
  2015-07-10  9:04                     ` Thomas Schwinge
@ 2015-07-11 19:25                       ` Thomas Schwinge
  2015-07-13 11:26                       ` [gomp] Move openacc vector& worker single handling to RTL Thomas Schwinge
  1 sibling, 0 replies; 31+ messages in thread
From: Thomas Schwinge @ 2015-07-11 19:25 UTC (permalink / raw)
  To: Nathan Sidwell, GCC Patches

[-- Attachment #1: Type: text/plain, Size: 1810 bytes --]

Hi!

On Fri, 10 Jul 2015 18:04:36 -0400, Nathan Sidwell <nathan@acm.org> wrote:
> I've committed this patch to fix a df verify crash Thomas pointed me at. 

Thanks!

> Thomas, I think this means you can revert the workaround  you just committed?

Right.  Committed in r225714:

commit 687f194e535317024ca67c32b26bb277b6f266ae
Author: tschwinge <tschwinge@138bc75d-0d04-0410-961f-82ee72b054a4>
Date:   Sat Jul 11 19:21:48 2015 +0000

    Revert "Work around nvptx offloading compiler --enable-checking=yes,df,fold,rtl breakage"
    
    This reverts r225656; problem get addressed in r225695.
    
    	gcc/
    	* df-core.c (df_analyze_1): Don't disable df_verify call.
    
    git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/branches/gomp-4_0-branch@225714 138bc75d-0d04-0410-961f-82ee72b054a4
---
 gcc/ChangeLog.gomp | 4 ++++
 gcc/df-core.c      | 2 --
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git gcc/ChangeLog.gomp gcc/ChangeLog.gomp
index baff20c..1f57a9d 100644
--- gcc/ChangeLog.gomp
+++ gcc/ChangeLog.gomp
@@ -1,3 +1,7 @@
+2015-07-11  Thomas Schwinge  <thomas@codesourcery.com>
+
+	* df-core.c (df_analyze_1): Don't disable df_verify call.
+
 2015-07-10  Nathan Sidwell  <nathan@codesourcery.com>
 
 	* config/nvptx/nvptx.c (nvptx_reorg): Move df problem setting, set
diff --git gcc/df-core.c gcc/df-core.c
index 52cca8e..67040a1 100644
--- gcc/df-core.c
+++ gcc/df-core.c
@@ -1235,12 +1235,10 @@ df_analyze_1 (void)
   if (dump_file)
     fprintf (dump_file, "df_analyze called\n");
 
-#if /* TODO */ 0
 #ifndef ENABLE_DF_CHECKING
   if (df->changeable_flags & DF_VERIFY_SCHEDULED)
 #endif
     df_verify ();
-#endif
 
   /* Skip over the DF_SCAN problem. */
   for (i = 1; i < df->num_problems_defined; i++)


Grüße,
 Thomas

[-- Attachment #2: Type: application/pgp-signature, Size: 472 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [gomp4] Resolve bootstrap failure in expand_GOACC_FORK, expand_GOACC_JOIN (was: Move openacc vector& worker single handling to RTL)
  2015-07-10  0:25                   ` Nathan Sidwell
  2015-07-10  9:04                     ` Thomas Schwinge
@ 2015-07-11 21:18                     ` Thomas Schwinge
  2015-07-14  8:26                     ` [gomp] Move openacc vector& worker single handling to RTL Thomas Schwinge
                                       ` (2 subsequent siblings)
  4 siblings, 0 replies; 31+ messages in thread
From: Thomas Schwinge @ 2015-07-11 21:18 UTC (permalink / raw)
  To: Nathan Sidwell, GCC Patches; +Cc: Jakub Jelinek

[-- Attachment #1: Type: text/plain, Size: 3088 bytes --]

Hi!

On Thu, 09 Jul 2015 20:25:22 -0400, Nathan Sidwell <nathan@acm.org> wrote:
> This is the patch I committed.

> --- internal-fn.c	(revision 225323)
> +++ internal-fn.c	(working copy)

> +static void
> +expand_GOACC_FORK (gcall *stmt)
> +{
> +  rtx mode = expand_normal (gimple_call_arg (stmt, 0));
> +  
> +#ifdef HAVE_oacc_fork
> +  emit_insn (gen_oacc_fork (mode));
> +#endif
> +}
> +
> +static void
> +expand_GOACC_JOIN (gcall *stmt)
> +{
> +  rtx mode = expand_normal (gimple_call_arg (stmt, 0));
> +  
> +#ifdef HAVE_oacc_join
> +  emit_insn (gen_oacc_join (mode));
> +#endif
> +}

Committed in r225715:

commit f9d00ca614a8dc28f21ab4a16d7cdbbe16668ca3
Author: tschwinge <tschwinge@138bc75d-0d04-0410-961f-82ee72b054a4>
Date:   Sat Jul 11 21:17:46 2015 +0000

    Resolve bootstrap failure in expand_GOACC_FORK, expand_GOACC_JOIN
    
        [...]/source-gcc/gcc/internal-fn.c: In function 'void expand_GOACC_FORK(gcall*)':
        [...]/source-gcc/gcc/internal-fn.c:1970:7: error: unused variable 'mode' [-Werror=unused-variable]
           rtx mode = expand_normal (gimple_call_arg (stmt, 0));
               ^
        [...]/source-gcc/gcc/internal-fn.c: In function 'void expand_GOACC_JOIN(gcall*)':
        [...]/source-gcc/gcc/internal-fn.c:1980:7: error: unused variable 'mode' [-Werror=unused-variable]
           rtx mode = expand_normal (gimple_call_arg (stmt, 0));
               ^
    
    	gcc/
    	* internal-fn.c (expand_GOACC_FORK, expand_GOACC_JOIN)
    	[!HAVE_oacc_fork]: Keep quiet.
    
    git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/branches/gomp-4_0-branch@225715 138bc75d-0d04-0410-961f-82ee72b054a4
---
 gcc/ChangeLog.gomp |    3 +++
 gcc/internal-fn.c  |   12 ++++++------
 2 files changed, 9 insertions(+), 6 deletions(-)

diff --git gcc/ChangeLog.gomp gcc/ChangeLog.gomp
index 1f57a9d..ea3ea6b 100644
--- gcc/ChangeLog.gomp
+++ gcc/ChangeLog.gomp
@@ -1,5 +1,8 @@
 2015-07-11  Thomas Schwinge  <thomas@codesourcery.com>
 
+	* internal-fn.c (expand_GOACC_FORK, expand_GOACC_JOIN)
+	[!HAVE_oacc_fork]: Keep quiet.
+
 	* df-core.c (df_analyze_1): Don't disable df_verify call.
 
 2015-07-10  Nathan Sidwell  <nathan@codesourcery.com>
diff --git gcc/internal-fn.c gcc/internal-fn.c
index e1c4c9a..b507208 100644
--- gcc/internal-fn.c
+++ gcc/internal-fn.c
@@ -2005,21 +2005,21 @@ expand_GOACC_DATA_END_WITH_ARG (gcall *stmt ATTRIBUTE_UNUSED)
 }
 
 static void
-expand_GOACC_FORK (gcall *stmt)
+expand_GOACC_FORK (gcall *stmt ATTRIBUTE_UNUSED)
 {
-  rtx mode = expand_normal (gimple_call_arg (stmt, 0));
-  
 #ifdef HAVE_oacc_fork
+  rtx mode = expand_normal (gimple_call_arg (stmt, 0));
+  
   emit_insn (gen_oacc_fork (mode));
 #endif
 }
 
 static void
-expand_GOACC_JOIN (gcall *stmt)
+expand_GOACC_JOIN (gcall *stmt ATTRIBUTE_UNUSED)
 {
-  rtx mode = expand_normal (gimple_call_arg (stmt, 0));
-  
 #ifdef HAVE_oacc_join
+  rtx mode = expand_normal (gimple_call_arg (stmt, 0));
+  
   emit_insn (gen_oacc_join (mode));
 #endif
 }


Grüße,
 Thomas

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 472 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [gomp] Move openacc vector& worker single handling to RTL
  2015-07-10  9:04                     ` Thomas Schwinge
  2015-07-11 19:25                       ` [gomp4] Revert "Work around nvptx offloading compiler --enable-checking=yes,df,fold,rtl breakage" (was: fix df verify failure) Thomas Schwinge
@ 2015-07-13 11:26                       ` Thomas Schwinge
  2015-07-13 13:23                         ` Nathan Sidwell
       [not found]                         ` <55A7D5DD.2070600@mentor.com>
  1 sibling, 2 replies; 31+ messages in thread
From: Thomas Schwinge @ 2015-07-13 11:26 UTC (permalink / raw)
  To: Nathan Sidwell, GCC Patches; +Cc: Jakub Jelinek

[-- Attachment #1: Type: text/plain, Size: 3264 bytes --]

Hi!

On Fri, 10 Jul 2015 11:04:14 +0200, I wrote:
> On Thu, 09 Jul 2015 20:25:22 -0400, Nathan Sidwell <nathan@acm.org> wrote:
> > This is the patch I committed.

> 2. Don't be shy to remove a bunch of XFAILs, in fact all :-) of those
> remaining from the test cases that Julian had added in
> <http://news.gmane.org/find-root.php?message_id=%3C20150617151515.087aa93e%40octopus%3E>.
> 
> Unfortunately, there's also one regressions, but I'm seeing it only on
> Nvidia K20 hardware, not on my laptop (but it may well be
> hardware-dependent: according to a web search, CUDA error 716 translates
> to CUDA_ERROR_MISALIGNED_ADDRESS).  Are you reproducing that one, and/or
> do you have an idea where it's coming from?

Are you looking into this, or should somebody else?


Also, this one:

> --- libgomp/testsuite/libgomp.oacc-c-c++-common/private-vars-loop-gang-5.c
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/private-vars-loop-gang-5.c
> @@ -1,5 +1,3 @@
> -/* { dg-xfail-run-if "TODO" { openacc_nvidia_accel_selected } { "*" } { "" } } */
> -
>  #include <assert.h>
>  
>  /* Test of gang-private array variable declared on loop directive, with

... in fact still FAILs for acc_device_nvidia (maybe I've just been lucky
when I first tested your patch/commit?), so that's another thing to look
into; committed in r225733:

commit 79234191653398a5897ca9be0f28af417e1ad212
Author: tschwinge <tschwinge@138bc75d-0d04-0410-961f-82ee72b054a4>
Date:   Mon Jul 13 11:23:13 2015 +0000

    libgomp: XFAIL libgomp.oacc-c-c++-common/private-vars-loop-gang-5.c for acc_device_nvidia
    
        private-vars-loop-gang-5.exe: [...]/libgomp.oacc-c-c++-common/private-vars-loop-gang-5.c:29: main: Assertion `arr[i] == i + (i % 8) * 2' failed.
    
    	libgomp/
    	* testsuite/libgomp.oacc-c-c++-common/private-vars-loop-gang-5.c:
    	Add XFAIL.
    
    git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/branches/gomp-4_0-branch@225733 138bc75d-0d04-0410-961f-82ee72b054a4
---
 libgomp/ChangeLog.gomp                                               | 5 +++++
 .../testsuite/libgomp.oacc-c-c++-common/private-vars-loop-gang-5.c   | 3 +++
 2 files changed, 8 insertions(+)

diff --git libgomp/ChangeLog.gomp libgomp/ChangeLog.gomp
index 6ee00be..fd7887a 100644
--- libgomp/ChangeLog.gomp
+++ libgomp/ChangeLog.gomp
@@ -1,3 +1,8 @@
+2015-07-13  Thomas Schwinge  <thomas@codesourcery.com>
+
+	* testsuite/libgomp.oacc-c-c++-common/private-vars-loop-gang-5.c:
+	Add XFAIL.
+
 2015-07-12  Tom de Vries  <tom@codesourcery.com>
 
 	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-nest.c: New test.
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/private-vars-loop-gang-5.c libgomp/testsuite/libgomp.oacc-c-c++-common/private-vars-loop-gang-5.c
index b070773..a710849 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/private-vars-loop-gang-5.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/private-vars-loop-gang-5.c
@@ -1,3 +1,6 @@
+/* main: Assertion `arr[i] == i + (i % 8) * 2' failed.
+   { dg-xfail-run-if "TODO" { openacc_nvidia_accel_selected } { "*" } { "" } } */
+
 #include <assert.h>
 
 /* Test of gang-private array variable declared on loop directive, with


Grüße,
 Thomas

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 472 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [gomp] Move openacc vector& worker single handling to RTL
  2015-07-13 11:26                       ` [gomp] Move openacc vector& worker single handling to RTL Thomas Schwinge
@ 2015-07-13 13:23                         ` Nathan Sidwell
       [not found]                         ` <55A7D5DD.2070600@mentor.com>
  1 sibling, 0 replies; 31+ messages in thread
From: Nathan Sidwell @ 2015-07-13 13:23 UTC (permalink / raw)
  To: Thomas Schwinge, GCC Patches; +Cc: Jakub Jelinek

On 07/13/15 07:26, Thomas Schwinge wrote:
> Hi!
>
> On Fri, 10 Jul 2015 11:04:14 +0200, I wrote:
>> On Thu, 09 Jul 2015 20:25:22 -0400, Nathan Sidwell <nathan@acm.org> wrote:
>>> This is the patch I committed.
>
>> 2. Don't be shy to remove a bunch of XFAILs, in fact all :-) of those
>> remaining from the test cases that Julian had added in
>> <http://news.gmane.org/find-root.php?message_id=%3C20150617151515.087aa93e%40octopus%3E>.
>>
>> Unfortunately, there's also one regressions, but I'm seeing it only on
>> Nvidia K20 hardware, not on my laptop (but it may well be
>> hardware-dependent: according to a web search, CUDA error 716 translates
>> to CUDA_ERROR_MISALIGNED_ADDRESS).  Are you reproducing that one, and/or
>> do you have an idea where it's coming from?
>
> Are you looking into this, or should somebody else?

I'm not looking at any regressions because I  wasn't aware of any.

nathan

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [gomp] Move openacc vector& worker single handling to RTL
  2015-07-10  0:25                   ` Nathan Sidwell
  2015-07-10  9:04                     ` Thomas Schwinge
  2015-07-11 21:18                     ` [gomp4] Resolve bootstrap failure in expand_GOACC_FORK, expand_GOACC_JOIN (was: Move openacc vector& worker single handling to RTL) Thomas Schwinge
@ 2015-07-14  8:26                     ` Thomas Schwinge
  2015-07-15  2:41                       ` Nathan Sidwell
  2015-07-18 20:31                     ` Thomas Schwinge
  2015-12-01  9:07                     ` Thomas Schwinge
  4 siblings, 1 reply; 31+ messages in thread
From: Thomas Schwinge @ 2015-07-14  8:26 UTC (permalink / raw)
  To: Nathan Sidwell; +Cc: GCC Patches, Jakub Jelinek

[-- Attachment #1: Type: text/plain, Size: 3988 bytes --]

Hi!

It's me, again.  ;-)

On Thu, 09 Jul 2015 20:25:22 -0400, Nathan Sidwell <nathan@acm.org> wrote:
> This is the patch I committed.  [...]

> --- config/nvptx/nvptx.c	(revision 225323)
> +++ config/nvptx/nvptx.c	(working copy)

> +/* Direction of the spill/fill and looping setup/teardown indicator.  */
> +
> +enum propagate_mask
> +  {
> +    PM_read = 1 << 0,
> +    PM_write = 1 << 1,
> +    PM_loop_begin = 1 << 2,
> +    PM_loop_end = 1 << 3,
> +
> +    PM_read_write = PM_read | PM_write
> +  };
> +
> +/* Generate instruction(s) to spill or fill register REG to/from the
> +   worker broadcast array.  PM indicates what is to be done, REP
> +   how many loop iterations will be executed (0 for not a loop).  */
> +   
> +static rtx
> +nvptx_gen_wcast (rtx reg, propagate_mask pm, unsigned rep, wcast_data_t *data)
> +{
> +  rtx  res;
> +  machine_mode mode = GET_MODE (reg);
> +
> +  switch (mode)
> +    {
> +    case BImode:
> +      {
> +	rtx tmp = gen_reg_rtx (SImode);
> +	
> +	start_sequence ();
> +	if (pm & PM_read)
> +	  emit_insn (gen_sel_truesi (tmp, reg, GEN_INT (1), const0_rtx));
> +	emit_insn (nvptx_gen_wcast (tmp, pm, rep, data));
> +	if (pm & PM_write)
> +	  emit_insn (gen_rtx_SET (BImode, reg,
> +				  gen_rtx_NE (BImode, tmp, const0_rtx)));
> +	res = get_insns ();
> +	end_sequence ();
> +      }
> +      break;
> +
> +    default:
> +      {
> +	rtx addr = data->ptr;
> +
> +	if (!addr)
> +	  {
> +	    unsigned align = GET_MODE_ALIGNMENT (mode) / BITS_PER_UNIT;
> +
> +	    if (align > worker_bcast_align)
> +	      worker_bcast_align = align;
> +	    data->offset = (data->offset + align - 1) & ~(align - 1);
> +	    addr = data->base;
> +	    if (data->offset)
> +	      addr = gen_rtx_PLUS (Pmode, addr, GEN_INT (data->offset));
> +	  }
> +	
> +	addr = gen_rtx_MEM (mode, addr);
> +	addr = gen_rtx_UNSPEC (mode, gen_rtvec (1, addr), UNSPEC_SHARED_DATA);
> +	if (pm & PM_read)
> +	  res = gen_rtx_SET (mode, addr, reg);
> +	if (pm & PM_write)
> +	  res = gen_rtx_SET (mode, reg, addr);
> +
> +	if (data->ptr)
> +	  {
> +	    /* We're using a ptr, increment it.  */
> +	    start_sequence ();
> +	    
> +	    emit_insn (res);
> +	    emit_insn (gen_adddi3 (data->ptr, data->ptr,
> +				   GEN_INT (GET_MODE_SIZE (GET_MODE (res)))));
> +	    res = get_insns ();
> +	    end_sequence ();
> +	  }
> +	else
> +	  rep = 1;
> +	data->offset += rep * GET_MODE_SIZE (GET_MODE (reg));
> +      }
> +      break;
> +    }
> +  return res;
> +}

OK to commit the following, or should other PM_* combinations be handled
here, such as (PM_read | PM_write)?  (But I don't think so.)

commit a1909fecb28267aa76df538ad9e01e4d228f5f9a
Author: Thomas Schwinge <thomas@codesourcery.com>
Date:   Tue Jul 14 09:59:48 2015 +0200

    nvptx: Avoid -Wuninitialized diagnostic
    
        [...]/source-gcc/gcc/config/nvptx/nvptx.c: In function 'rtx_def* nvptx_gen_wcast(rtx, propagate_mask, unsigned int, wcast_data_t*)':
        [...]/source-gcc/gcc/config/nvptx/nvptx.c:1258:8: warning: 'res' may be used uninitialized in this function [-Wuninitialized]
    
    	gcc/
    	* config/nvptx/nvptx.c (nvptx_gen_wcast): Mark unreachable code
    	path.
---
 gcc/config/nvptx/nvptx.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git gcc/config/nvptx/nvptx.c gcc/config/nvptx/nvptx.c
index 0e1e764..dfe5d34 100644
--- gcc/config/nvptx/nvptx.c
+++ gcc/config/nvptx/nvptx.c
@@ -1253,10 +1253,12 @@ nvptx_gen_wcast (rtx reg, propagate_mask pm, unsigned rep, wcast_data_t *data)
 	
 	addr = gen_rtx_MEM (mode, addr);
 	addr = gen_rtx_UNSPEC (mode, gen_rtvec (1, addr), UNSPEC_SHARED_DATA);
-	if (pm & PM_read)
+	if (pm == PM_read)
 	  res = gen_rtx_SET (addr, reg);
-	if (pm & PM_write)
+	else if (pm == PM_write)
 	  res = gen_rtx_SET (reg, addr);
+	else
+	  gcc_unreachable ();
 
 	if (data->ptr)
 	  {


Grüße,
 Thomas

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 472 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [gomp] Move openacc vector& worker single handling to RTL
  2015-07-14  8:26                     ` [gomp] Move openacc vector& worker single handling to RTL Thomas Schwinge
@ 2015-07-15  2:41                       ` Nathan Sidwell
  0 siblings, 0 replies; 31+ messages in thread
From: Nathan Sidwell @ 2015-07-15  2:41 UTC (permalink / raw)
  To: Thomas Schwinge; +Cc: GCC Patches, Jakub Jelinek

On 07/14/15 04:25, Thomas Schwinge wrote:

>   	addr = gen_rtx_MEM (mode, addr);
>   	addr = gen_rtx_UNSPEC (mode, gen_rtvec (1, addr), UNSPEC_SHARED_DATA);
> -	if (pm & PM_read)
> +	if (pm == PM_read)
>   	  res = gen_rtx_SET (addr, reg);
> -	if (pm & PM_write)
> +	else if (pm == PM_write)
>   	  res = gen_rtx_SET (reg, addr);
> +	else
> +	  gcc_unreachable ();

OK. or maybe assert (pm == PM_write) inside the else?  your call

nathan

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [gomp] Fix PTX worker spill/fill
@ 2015-07-16 17:18 Nathan Sidwell
       [not found] ` <55A7D4BA.2000309@mentor.com>
  0 siblings, 1 reply; 31+ messages in thread
From: Nathan Sidwell @ 2015-07-16 17:18 UTC (permalink / raw)
  To: GCC Patches; +Cc: jnorris

[-- Attachment #1: Type: text/plain, Size: 226 bytes --]

I've committed this patch to fix a bug in the worker spill/fill code.  We ended 
up not incrementing the pointer, resulting in the stack frame being filled with 
the same value.

Thanks to Jim for finding the failure.

nathan

[-- Attachment #2: spill.patch --]
[-- Type: text/x-patch, Size: 635 bytes --]

2015-07-16  Nathan Sidwell  <nathan@codesourcery.com>

	* config/nvptx/nvptx.c (nvptx_gen_wcast): Fix typo accessing reg's
	mode for pointer increment.

Index: config/nvptx/nvptx.c
===================================================================
--- config/nvptx/nvptx.c	(revision 225831)
+++ config/nvptx/nvptx.c	(working copy)
@@ -1257,7 +1257,7 @@ nvptx_gen_wcast (rtx reg, propagate_mask
 	    
 	    emit_insn (res);
 	    emit_insn (gen_adddi3 (data->ptr, data->ptr,
-				   GEN_INT (GET_MODE_SIZE (GET_MODE (res)))));
+				   GEN_INT (GET_MODE_SIZE (GET_MODE (reg)))));
 	    res = get_insns ();
 	    end_sequence ();
 	  }

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [gomp] Fix PTX worker spill/fill
       [not found]                         ` <55A7D5DD.2070600@mentor.com>
@ 2015-07-17  9:00                           ` Thomas Schwinge
  0 siblings, 0 replies; 31+ messages in thread
From: Thomas Schwinge @ 2015-07-17  9:00 UTC (permalink / raw)
  To: Nathan Sidwell; +Cc: jnorris, GCC Patches

[-- Attachment #1: Type: text/plain, Size: 2960 bytes --]

Hi!

On Thu, 16 Jul 2015 12:23:52 -0400, Nathan Sidwell <nathan@acm.org> wrote:
> I've committed this patch to fix a bug in the worker spill/fill code.  We ended 
> up not incrementing the pointer, resulting in the stack frame being filled with 
> the same value.
> 
> Thanks to Jim for finding the failure.

> --- config/nvptx/nvptx.c	(revision 225831)
> +++ config/nvptx/nvptx.c	(working copy)
> @@ -1257,7 +1257,7 @@ nvptx_gen_wcast (rtx reg, propagate_mask
>  	    
>  	    emit_insn (res);
>  	    emit_insn (gen_adddi3 (data->ptr, data->ptr,
> -				   GEN_INT (GET_MODE_SIZE (GET_MODE (res)))));
> +				   GEN_INT (GET_MODE_SIZE (GET_MODE (reg)))));
>  	    res = get_insns ();
>  	    end_sequence ();
>  	  }

Nice; this is actually the change to resolve the FAIL for
libgomp.oacc-c-c++-common/private-vars-loop-gang-5.c I had reported in
<http://news.gmane.org/find-root.php?message_id=%3C877fq41epd.fsf%40kepler.schwinge.homeip.net%3E>.
(The testsuite/libgomp.oacc-c-c++-common/parallel-loop-1.c regression
reported earlier in that thread remains to be addressed.)  Committed to
gomp-4_0-branch in r225922:

commit 7961bf7049729aebadf639a52174be14010da499
Author: tschwinge <tschwinge@138bc75d-0d04-0410-961f-82ee72b054a4>
Date:   Fri Jul 17 08:30:10 2015 +0000

    libgomp: Remove XFAIL libgomp.oacc-c-c++-common/private-vars-loop-gang-5.c for acc_device_nvidia
    
    Problem got addressed in r225896.
    
    	libgomp/
    	* testsuite/libgomp.oacc-c-c++-common/private-vars-loop-gang-5.c:
    	Remove XFAIL.
    
    git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/branches/gomp-4_0-branch@225922 138bc75d-0d04-0410-961f-82ee72b054a4
---
 libgomp/ChangeLog.gomp                                            |    5 +++++
 .../libgomp.oacc-c-c++-common/private-vars-loop-gang-5.c          |    3 ---
 2 files changed, 5 insertions(+), 3 deletions(-)

diff --git libgomp/ChangeLog.gomp libgomp/ChangeLog.gomp
index b2e4b2c..0293ad5 100644
--- libgomp/ChangeLog.gomp
+++ libgomp/ChangeLog.gomp
@@ -1,3 +1,8 @@
+2015-07-17  Thomas Schwinge  <thomas@codesourcery.com>
+
+	* testsuite/libgomp.oacc-c-c++-common/private-vars-loop-gang-5.c:
+	Remove XFAIL.
+
 2015-07-15  Nathan Sidwell  <nathan@codesourcery.com>
 
 	* plugin/plugin-nvptx.c (nvptx_exec): Show grid dimensions in
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/private-vars-loop-gang-5.c libgomp/testsuite/libgomp.oacc-c-c++-common/private-vars-loop-gang-5.c
index a710849..b070773 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/private-vars-loop-gang-5.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/private-vars-loop-gang-5.c
@@ -1,6 +1,3 @@
-/* main: Assertion `arr[i] == i + (i % 8) * 2' failed.
-   { dg-xfail-run-if "TODO" { openacc_nvidia_accel_selected } { "*" } { "" } } */
-
 #include <assert.h>
 
 /* Test of gang-private array variable declared on loop directive, with


Grüße,
 Thomas

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 472 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [gomp] Fix PTX worker spill/fill
       [not found] ` <55A7D4BA.2000309@mentor.com>
@ 2015-07-17  9:29   ` Thomas Schwinge
  0 siblings, 0 replies; 31+ messages in thread
From: Thomas Schwinge @ 2015-07-17  9:29 UTC (permalink / raw)
  To: Nathan Sidwell, GCC Patches; +Cc: jnorris, Cesar Philippidis

[-- Attachment #1: Type: text/plain, Size: 3058 bytes --]

Hi!

On Thu, 16 Jul 2015 12:23:52 -0400, Nathan Sidwell <nathan@acm.org> wrote:
> I've committed this patch to fix a bug in the worker spill/fill code.  We ended 
> up not incrementing the pointer, resulting in the stack frame being filled with 
> the same value.
> 
> Thanks to Jim for finding the failure.

Cesar had prepared a reduced test case, a slightly altered variant of
which I've now committed to gomp-4_0-branch in r225924:

commit ee7fb343a0d0dbd17ac8dc7d24048d8647e41232
Author: tschwinge <tschwinge@138bc75d-0d04-0410-961f-82ee72b054a4>
Date:   Fri Jul 17 09:11:10 2015 +0000

    OpenACC: Add test case for worker state propagation handling the stack frame
    
    ... for problem that got addressed in r225896.
    
    	libgomp/
    	* testsuite/libgomp.oacc-c-c++-common/worker-partn-8.c: New file.
    
    git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/branches/gomp-4_0-branch@225924 138bc75d-0d04-0410-961f-82ee72b054a4
---
 libgomp/ChangeLog.gomp                             |  5 ++
 .../libgomp.oacc-c-c++-common/worker-partn-8.c     | 53 ++++++++++++++++++++++
 2 files changed, 58 insertions(+)

diff --git libgomp/ChangeLog.gomp libgomp/ChangeLog.gomp
index 0293ad5..ec943f5 100644
--- libgomp/ChangeLog.gomp
+++ libgomp/ChangeLog.gomp
@@ -1,4 +1,9 @@
 2015-07-17  Thomas Schwinge  <thomas@codesourcery.com>
+	    Cesar Philippidis  <cesar@codesourcery.com>
+
+	* testsuite/libgomp.oacc-c-c++-common/worker-partn-8.c: New file.
+
+2015-07-17  Thomas Schwinge  <thomas@codesourcery.com>
 
 	* testsuite/libgomp.oacc-c-c++-common/private-vars-loop-gang-5.c:
 	Remove XFAIL.
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/worker-partn-8.c libgomp/testsuite/libgomp.oacc-c-c++-common/worker-partn-8.c
new file mode 100644
index 0000000..e787947
--- /dev/null
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/worker-partn-8.c
@@ -0,0 +1,53 @@
+/* { dg-additional-options "-O0" } */
+
+/* With -O0, variables are on the stack, not in registers.  Check that worker
+   state propagation handles the stack frame.  */
+
+int
+main (int argc, char *argv[])
+{
+  int w0 = 0;
+  int w1 = 0;
+  int w2 = 0;
+  int w3 = 0;
+  int w4 = 0;
+  int w5 = 0;
+  int w6 = 0;
+  int w7 = 0;
+
+  int i;
+
+#pragma acc parallel num_gangs (1) num_workers (8) copy (w0, w1, w2, w3, w4, w5, w6, w7)
+  {
+    int internal = 100;
+
+#pragma acc loop worker
+    for (i = 0; i < 8; i++)
+      {
+	switch (i)
+	  {
+	  case 0: w0 = internal; break;
+	  case 1: w1 = internal; break;
+	  case 2: w2 = internal; break;
+	  case 3: w3 = internal; break;
+	  case 4: w4 = internal; break;
+	  case 5: w5 = internal; break;
+	  case 6: w6 = internal; break;
+	  case 7: w7 = internal; break;
+	  default: break;
+	  }
+      }
+  }
+
+  if (w0 != 100
+      || w1 != 100
+      || w2 != 100
+      || w3 != 100
+      || w4 != 100
+      || w5 != 100
+      || w6 != 100
+      || w7 != 100)
+    __builtin_abort ();
+
+  return 0;
+}


Grüße,
 Thomas

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 472 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [gomp] Move openacc vector& worker single handling to RTL
  2015-07-10  0:25                   ` Nathan Sidwell
                                       ` (2 preceding siblings ...)
  2015-07-14  8:26                     ` [gomp] Move openacc vector& worker single handling to RTL Thomas Schwinge
@ 2015-07-18 20:31                     ` Thomas Schwinge
  2015-07-20 13:19                       ` Nathan Sidwell
  2015-07-21 20:57                       ` [gomp] Move openacc vector& worker single handling to RTL Nathan Sidwell
  2015-12-01  9:07                     ` Thomas Schwinge
  4 siblings, 2 replies; 31+ messages in thread
From: Thomas Schwinge @ 2015-07-18 20:31 UTC (permalink / raw)
  To: Nathan Sidwell; +Cc: GCC Patches, Jakub Jelinek


[-- Attachment #1.1: Type: text/plain, Size: 5437 bytes --]

Hi Nathan!

On Thu, 09 Jul 2015 20:25:22 -0400, Nathan Sidwell <nathan@acm.org> wrote:
> This is the patch I committed.  [...]

Prompted by your recent "-O0 patch" to »[f]ix PTX worker spill/fill«, I
used the attached patch 0001-O0-libgomp-C-C-testing.patch to run all C
and C++ libgomp testing with -O0 (for Fortran, we iterate through various
kinds of optimization levels anyway).  (There are no regressions of
OpenMP testing.)  

For OpenACC nvptx offloading, there must still be something wrong; here's
a count of the (non-deterministic!) regressions of ten runs of the
libgomp testsuite.  As private-vars-loop-worker-5.c fails most often, it
probably makes sense to look into that one first.

For avoidance of doubt, there are no such regressions if I un-apply your
patch to »[m]ove openacc vector& worker single handling to RTL«.

libgomp.oacc-c:

    3: [-PASS:-]{+FAIL:+} libgomp.oacc-c/../libgomp.oacc-c-c++-common/private-vars-local-worker-1.c -DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 execution test
    4: [-PASS:-]{+FAIL:+} libgomp.oacc-c/../libgomp.oacc-c-c++-common/private-vars-local-worker-2.c -DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 execution test
    3: [-PASS:-]{+FAIL:+} libgomp.oacc-c/../libgomp.oacc-c-c++-common/private-vars-local-worker-3.c -DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 execution test
    5: [-PASS:-]{+FAIL:+} libgomp.oacc-c/../libgomp.oacc-c-c++-common/private-vars-local-worker-4.c -DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 execution test
    4: [-PASS:-]{+FAIL:+} libgomp.oacc-c/../libgomp.oacc-c-c++-common/private-vars-local-worker-5.c -DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 execution test
    3: [-PASS:-]{+FAIL:+} libgomp.oacc-c/../libgomp.oacc-c-c++-common/private-vars-loop-vector-1.c -DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 execution test
    2: [-PASS:-]{+FAIL:+} libgomp.oacc-c/../libgomp.oacc-c-c++-common/private-vars-loop-vector-2.c -DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 execution test
    3: [-PASS:-]{+FAIL:+} libgomp.oacc-c/../libgomp.oacc-c-c++-common/private-vars-loop-worker-2.c -DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 execution test
    2: [-PASS:-]{+FAIL:+} libgomp.oacc-c/../libgomp.oacc-c-c++-common/private-vars-loop-worker-3.c -DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 execution test
    2: [-PASS:-]{+FAIL:+} libgomp.oacc-c/../libgomp.oacc-c-c++-common/private-vars-loop-worker-4.c -DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 execution test
    8: [-PASS:-]{+FAIL:+} libgomp.oacc-c/../libgomp.oacc-c-c++-common/private-vars-loop-worker-5.c -DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 execution test
    4: [-PASS:-]{+FAIL:+} libgomp.oacc-c/../libgomp.oacc-c-c++-common/private-vars-loop-worker-6.c -DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 execution test
    4: [-PASS:-]{+FAIL:+} libgomp.oacc-c/../libgomp.oacc-c-c++-common/private-vars-loop-worker-7.c -DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 execution test
    1: [-PASS:-]{+FAIL:+} libgomp.oacc-c/../libgomp.oacc-c-c++-common/worker-partn-5.c -DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 execution test
    3: [-PASS:-]{+FAIL:+} libgomp.oacc-c/../libgomp.oacc-c-c++-common/worker-partn-6.c -DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 execution test

libgomp.oacc-c++:

    5: [-PASS:-]{+FAIL:+} libgomp.oacc-c++/../libgomp.oacc-c-c++-common/private-vars-local-worker-1.c -DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 execution test
    5: [-PASS:-]{+FAIL:+} libgomp.oacc-c++/../libgomp.oacc-c-c++-common/private-vars-local-worker-2.c -DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 execution test
    4: [-PASS:-]{+FAIL:+} libgomp.oacc-c++/../libgomp.oacc-c-c++-common/private-vars-local-worker-3.c -DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 execution test
    5: [-PASS:-]{+FAIL:+} libgomp.oacc-c++/../libgomp.oacc-c-c++-common/private-vars-local-worker-4.c -DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 execution test
    6: [-PASS:-]{+FAIL:+} libgomp.oacc-c++/../libgomp.oacc-c-c++-common/private-vars-local-worker-5.c -DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 execution test
    3: [-PASS:-]{+FAIL:+} libgomp.oacc-c++/../libgomp.oacc-c-c++-common/private-vars-loop-vector-1.c -DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 execution test
    2: [-PASS:-]{+FAIL:+} libgomp.oacc-c++/../libgomp.oacc-c-c++-common/private-vars-loop-worker-2.c -DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 execution test
    4: [-PASS:-]{+FAIL:+} libgomp.oacc-c++/../libgomp.oacc-c-c++-common/private-vars-loop-worker-3.c -DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 execution test
    4: [-PASS:-]{+FAIL:+} libgomp.oacc-c++/../libgomp.oacc-c-c++-common/private-vars-loop-worker-4.c -DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 execution test
    7: [-PASS:-]{+FAIL:+} libgomp.oacc-c++/../libgomp.oacc-c-c++-common/private-vars-loop-worker-5.c -DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 execution test
    4: [-PASS:-]{+FAIL:+} libgomp.oacc-c++/../libgomp.oacc-c-c++-common/private-vars-loop-worker-6.c -DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 execution test
    5: [-PASS:-]{+FAIL:+} libgomp.oacc-c++/../libgomp.oacc-c-c++-common/private-vars-loop-worker-7.c -DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 execution test
    1: [-PASS:-]{+FAIL:+} libgomp.oacc-c++/../libgomp.oacc-c-c++-common/worker-partn-6.c -DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 execution test


Grüße,
 Thomas



[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1.2: 0001-O0-libgomp-C-C-testing.patch --]
[-- Type: text/x-diff, Size: 2081 bytes --]

From a527ce3bcb60a4dbd8feb579dd90688b33760d78 Mon Sep 17 00:00:00 2001
From: Thomas Schwinge <thomas@codesourcery.com>
Date: Fri, 17 Jul 2015 15:24:19 +0200
Subject: [PATCH] -O0 libgomp C, C++ testing

---
 libgomp/testsuite/libgomp.c++/c++.exp      | 1 +
 libgomp/testsuite/libgomp.c/c.exp          | 1 +
 libgomp/testsuite/libgomp.oacc-c++/c++.exp | 1 +
 libgomp/testsuite/libgomp.oacc-c/c.exp     | 1 +
 4 files changed, 4 insertions(+)

diff --git a/libgomp/testsuite/libgomp.c++/c++.exp b/libgomp/testsuite/libgomp.c++/c++.exp
index d6d525a..6bdb83d 100644
--- a/libgomp/testsuite/libgomp.c++/c++.exp
+++ b/libgomp/testsuite/libgomp.c++/c++.exp
@@ -16,6 +16,7 @@ if [info exists lang_include_flags] then {
 if ![info exists DEFAULT_CFLAGS] then {
     set DEFAULT_CFLAGS "-O2"
 }
+set DEFAULT_CFLAGS "-O0"
 
 # Initialize dg.
 dg-init
diff --git a/libgomp/testsuite/libgomp.c/c.exp b/libgomp/testsuite/libgomp.c/c.exp
index 25f347b..f89377f 100644
--- a/libgomp/testsuite/libgomp.c/c.exp
+++ b/libgomp/testsuite/libgomp.c/c.exp
@@ -16,6 +16,7 @@ load_gcc_lib gcc-dg.exp
 if ![info exists DEFAULT_CFLAGS] then {
     set DEFAULT_CFLAGS "-O2"
 }
+set DEFAULT_CFLAGS "-O0"
 
 # Initialize dg.
 dg-init
diff --git a/libgomp/testsuite/libgomp.oacc-c++/c++.exp b/libgomp/testsuite/libgomp.oacc-c++/c++.exp
index 7309f78..4dba472 100644
--- a/libgomp/testsuite/libgomp.oacc-c++/c++.exp
+++ b/libgomp/testsuite/libgomp.oacc-c++/c++.exp
@@ -18,6 +18,7 @@ if [info exists lang_include_flags] then {
 if ![info exists DEFAULT_CFLAGS] then {
     set DEFAULT_CFLAGS "-O2"
 }
+set DEFAULT_CFLAGS "-O0"
 
 # Initialize dg.
 dg-init
diff --git a/libgomp/testsuite/libgomp.oacc-c/c.exp b/libgomp/testsuite/libgomp.oacc-c/c.exp
index 60be15d..80b4635 100644
--- a/libgomp/testsuite/libgomp.oacc-c/c.exp
+++ b/libgomp/testsuite/libgomp.oacc-c/c.exp
@@ -18,6 +18,7 @@ load_gcc_lib gcc-dg.exp
 if ![info exists DEFAULT_CFLAGS] then {
     set DEFAULT_CFLAGS "-O2"
 }
+set DEFAULT_CFLAGS "-O0"
 
 # Initialize dg.
 dg-init
-- 
2.1.4


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 472 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [gomp] Move openacc vector& worker single handling to RTL
  2015-07-18 20:31                     ` Thomas Schwinge
@ 2015-07-20 13:19                       ` Nathan Sidwell
  2015-07-20 15:56                         ` Nathan Sidwell
  2015-07-21 20:57                       ` [gomp] Move openacc vector& worker single handling to RTL Nathan Sidwell
  1 sibling, 1 reply; 31+ messages in thread
From: Nathan Sidwell @ 2015-07-20 13:19 UTC (permalink / raw)
  To: Thomas Schwinge; +Cc: GCC Patches, Jakub Jelinek

On 07/18/15 11:37, Thomas Schwinge wrote:
> Hi Nathan!

> For OpenACC nvptx offloading, there must still be something wrong; here's
> a count of the (non-deterministic!) regressions of ten runs of the
> libgomp testsuite.  As private-vars-loop-worker-5.c fails most often, it
> probably makes sense to look into that one first.

I'll take a look. :(

nathan

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [gomp] Move openacc vector& worker single handling to RTL
  2015-07-20 13:19                       ` Nathan Sidwell
@ 2015-07-20 15:56                         ` Nathan Sidwell
  2015-07-22 17:05                           ` Nathan Sidwell
  0 siblings, 1 reply; 31+ messages in thread
From: Nathan Sidwell @ 2015-07-20 15:56 UTC (permalink / raw)
  To: Thomas Schwinge; +Cc: GCC Patches, Jakub Jelinek

On 07/20/15 09:01, Nathan Sidwell wrote:
> On 07/18/15 11:37, Thomas Schwinge wrote:
>> Hi Nathan!
>
>> For OpenACC nvptx offloading, there must still be something wrong; here's
>> a count of the (non-deterministic!) regressions of ten runs of the
>> libgomp testsuite.  As private-vars-loop-worker-5.c fails most often, it
>> probably makes sense to look into that one first.
>
> I'll take a look. :(

Having difficulty reproducing it (preprocessed source compiled at -O0 works for 
me).  Do you have an exact recipe?


nathan

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [gomp] Move openacc vector& worker single handling to RTL
  2015-07-18 20:31                     ` Thomas Schwinge
  2015-07-20 13:19                       ` Nathan Sidwell
@ 2015-07-21 20:57                       ` Nathan Sidwell
  2015-07-22  8:32                         ` Thomas Schwinge
  1 sibling, 1 reply; 31+ messages in thread
From: Nathan Sidwell @ 2015-07-21 20:57 UTC (permalink / raw)
  To: Thomas Schwinge; +Cc: GCC Patches, Jakub Jelinek

On 07/18/15 11:37, Thomas Schwinge wrote:
> Hi Nathan!
>
> On Thu, 09 Jul 2015 20:25:22 -0400, Nathan Sidwell <nathan@acm.org> wrote:
>> This is the patch I committed.  [...]
>
> Prompted by your recent "-O0 patch" to »[f]ix PTX worker spill/fill«, I
> used the attached patch 0001-O0-libgomp-C-C-testing.patch to run all C
> and C++ libgomp testing with -O0 (for Fortran, we iterate through various
> kinds of optimization levels anyway).  (There are no regressions of
> OpenMP testing.)
>
> For OpenACC nvptx offloading, there must still be something wrong; here's
> a count of the (non-deterministic!) regressions of ten runs of the
> libgomp testsuite.  As private-vars-loop-worker-5.c fails most often, it
> probably makes sense to look into that one first.
>
> For avoidance of doubt, there are no such regressions if I un-apply your
> patch to »[m]ove openacc vector& worker single handling to RTL«.

I cannot reproduce the failures.  Applying your patch I see the following new fails:

FAIL: libgomp.oacc-c/../libgomp.oacc-c-c++-common/lib-5.c 
-DACC_DEVICE_TYPE_host_nonshm=1 -DACC_MEM_SHARED=0 execution test
FAIL: libgomp.oacc-c/../libgomp.oacc-c-c++-common/private-vars-local-worker-3.c 
-DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 e
xecution test
FAIL: libgomp.oacc-c/../libgomp.oacc-c-c++-common/private-vars-loop-worker-7.c 
-DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 ex
ecution test
FAIL: libgomp.oacc-c++/../libgomp.oacc-c-c++-common/present-1.c 
-DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 output pattern te
st, is , should match present clause: !acc_is_present
FAIL: 
libgomp.oacc-c++/../libgomp.oacc-c-c++-common/private-vars-local-worker-2.c 
-DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0
  execution test
FAIL: libgomp.oacc-c++/../libgomp.oacc-c-c++-common/private-vars-loop-vector-1.c 
-DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0
execution test
FAIL: libgomp.oacc-c++/../libgomp.oacc-c-c++-common/private-vars-loop-worker-4.c 
-DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0
execution test
FAIL: libgomp.oacc-c++/../libgomp.oacc-c-c++-common/private-vars-loop-worker-5.c 
-DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0
execution test

Which differs from your list.  Attempting to reproduce outside the test suite 
results in working executables.

nathan

-- 
Nathan Sidwell

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [gomp] Move openacc vector& worker single handling to RTL
  2015-07-21 20:57                       ` [gomp] Move openacc vector& worker single handling to RTL Nathan Sidwell
@ 2015-07-22  8:32                         ` Thomas Schwinge
  0 siblings, 0 replies; 31+ messages in thread
From: Thomas Schwinge @ 2015-07-22  8:32 UTC (permalink / raw)
  To: Nathan Sidwell; +Cc: GCC Patches, Jakub Jelinek

[-- Attachment #1: Type: text/plain, Size: 3419 bytes --]

Hi Nathan!

On Tue, 21 Jul 2015 16:05:05 -0400, Nathan Sidwell <nathan@codesourcery.com> wrote:
> On 07/18/15 11:37, Thomas Schwinge wrote:
> > On Thu, 09 Jul 2015 20:25:22 -0400, Nathan Sidwell <nathan@acm.org> wrote:
> >> This is the patch I committed.  [...]
> >
> > Prompted by your recent "-O0 patch" to »[f]ix PTX worker spill/fill«, I
> > used the attached patch 0001-O0-libgomp-C-C-testing.patch to run all C
> > and C++ libgomp testing with -O0 (for Fortran, we iterate through various
> > kinds of optimization levels anyway).  (There are no regressions of
> > OpenMP testing.)
> >
> > For OpenACC nvptx offloading, there must still be something wrong; here's
> > a count of the (non-deterministic!) regressions of ten runs of the
> > libgomp testsuite.  As private-vars-loop-worker-5.c fails most often, it
> > probably makes sense to look into that one first.
> >
> > For avoidance of doubt, there are no such regressions if I un-apply your
> > patch to »[m]ove openacc vector& worker single handling to RTL«.
> 
> I cannot reproduce the failures.  Applying your patch I see the following new fails:
> 
> FAIL: libgomp.oacc-c/../libgomp.oacc-c-c++-common/lib-5.c 
> -DACC_DEVICE_TYPE_host_nonshm=1 -DACC_MEM_SHARED=0 execution test
> FAIL: libgomp.oacc-c/../libgomp.oacc-c-c++-common/private-vars-local-worker-3.c 
> -DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 e
> xecution test
> FAIL: libgomp.oacc-c/../libgomp.oacc-c-c++-common/private-vars-loop-worker-7.c 
> -DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 ex
> ecution test
> FAIL: libgomp.oacc-c++/../libgomp.oacc-c-c++-common/present-1.c 
> -DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0 output pattern te
> st, is , should match present clause: !acc_is_present
> FAIL: 
> libgomp.oacc-c++/../libgomp.oacc-c-c++-common/private-vars-local-worker-2.c 
> -DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0
>   execution test
> FAIL: libgomp.oacc-c++/../libgomp.oacc-c-c++-common/private-vars-loop-vector-1.c 
> -DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0
> execution test
> FAIL: libgomp.oacc-c++/../libgomp.oacc-c-c++-common/private-vars-loop-worker-4.c 
> -DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0
> execution test
> FAIL: libgomp.oacc-c++/../libgomp.oacc-c-c++-common/private-vars-loop-worker-5.c 
> -DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0
> execution test
> 
> Which differs from your list.

Well, then instead look into one of these (the private-vars-* ones)?  :-)
(Still hoping they're all caused by the same problem.)

> Attempting to reproduce outside the test suite 
> results in working executables.

Have you tried running it multiple times?  As I said, it's
non-deterministic.

Taking from libgomp.log the compile command line of
private-vars-loop-worker-5.c for »-DACC_DEVICE_TYPE_nvidia=1«, removing
the constructor.o stuff, replacing »-L« by »{-L,-Wl\,-rpath\,}«, and
adding »-O0« at the end, I then see the following:

    $ while :; do ./private-vars-loop-worker-5.exe 2> /dev/null && echo -n .; done
    ...Aborted (core dumped)
    .........Aborted (core dumped)
    ........Aborted (core dumped)
    ....Aborted (core dumped)
    .Aborted (core dumped)
    ...........Aborted (core dumped)
    ........Aborted (core dumped)
    Aborted (core dumped)
    .Aborted (core dumped)
    ...Aborted (core dumped)
    [...]


Grüße,
 Thomas

[-- Attachment #2: Type: application/pgp-signature, Size: 472 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [gomp] Move openacc vector& worker single handling to RTL
  2015-07-20 15:56                         ` Nathan Sidwell
@ 2015-07-22 17:05                           ` Nathan Sidwell
  2015-07-23  8:52                             ` [gomp4] libgomp: Some torture testing for C and C++ OpenACC test cases (was: [gomp] Move openacc vector& worker single handling to RTL) Thomas Schwinge
  0 siblings, 1 reply; 31+ messages in thread
From: Nathan Sidwell @ 2015-07-22 17:05 UTC (permalink / raw)
  To: Thomas Schwinge; +Cc: GCC Patches, Jakub Jelinek

[-- Attachment #1: Type: text/plain, Size: 1121 bytes --]

On 07/20/15 11:08, Nathan Sidwell wrote:
> On 07/20/15 09:01, Nathan Sidwell wrote:
>> On 07/18/15 11:37, Thomas Schwinge wrote:
>>> Hi Nathan!
>>
>>> For OpenACC nvptx offloading, there must still be something wrong; here's
>>> a count of the (non-deterministic!) regressions of ten runs of the
>>> libgomp testsuite.  As private-vars-loop-worker-5.c fails most often, it
>>> probably makes sense to look into that one first.
>>
>> I'll take a look. :(
>
> Having difficulty reproducing it (preprocessed source compiled at -O0 works for
> me).  Do you have an exact recipe?

Thomas helped me reproduce them -- they are very intermittent.  Anyway, fixed 
with the attached patch I've committed to gomp branch.

The bug was a race condition in the worker-level 'follow along' algorithm. 
Worker zero could overwrite the flag for some subsequent block before all the 
other workers had read the previous value of the flag.  This wasn't 
optimization-level specific, but it appears unoptimized code creates better 
conditions to cause the behaviour.

This appears to fix all the -O0 regressions you observed Thomas.

nathan

[-- Attachment #2: gomp4-barrier.patch --]
[-- Type: text/x-patch, Size: 3145 bytes --]

2015-07-22  Nathan Sidwell  <nathan@acm.org>

	* config/nvptx/nvptx.c (nvptx_option_override): Initialize worker
	buffer alignment here.
	(nvptx_wsync): Generate pattern, not emit instruction.
	(nvptx_single): Insert barrier after read.
	(nvptx_process_pars): Adjust nvptx_wsync use.
	(nvptx_file_end): No need to apply default alignment here.

Index: config/nvptx/nvptx.c
===================================================================
--- config/nvptx/nvptx.c	(revision 226044)
+++ config/nvptx/nvptx.c	(working copy)
@@ -124,6 +124,7 @@ nvptx_option_override (void)
     = hash_table<declared_libfunc_hasher>::create_ggc (17);
 
   worker_bcast_sym = gen_rtx_SYMBOL_REF (Pmode, worker_bcast_name);
+  worker_bcast_align = GET_MODE_SIZE (SImode);
 }
 
 /* Return the mode to be used when declaring a ptx object for OBJ.
@@ -2627,12 +2628,13 @@ nvptx_wpropagate (bool pre_p, basic_bloc
     }
 }
 
-/* Emit a worker-level synchronization barrier.  */
+/* Emit a worker-level synchronization barrier.  We use different
+   markers for before and after synchronizations.  */
 
-static void
-nvptx_wsync (bool tail_p, rtx_insn *insn)
+static rtx
+nvptx_wsync (bool after)
 {
-  emit_insn_after (gen_nvptx_barsync (GEN_INT (tail_p)), insn);
+  return gen_nvptx_barsync (GEN_INT (after));
 }
 
 /* Single neutering according to MASK.  FROM is the incoming block and
@@ -2750,7 +2752,7 @@ nvptx_single (unsigned mask, basic_block
 	}
       else
 	{
-	  /* Includes worker mode, do spill & fill.  by construction
+	  /* Includes worker mode, do spill & fill.  By construction
 	     we should never have worker mode only. */
 	  wcast_data_t data;
 
@@ -2763,10 +2765,14 @@ nvptx_single (unsigned mask, basic_block
 	  data.offset = 0;
 	  emit_insn_before (nvptx_gen_wcast (pvar, PM_read, 0, &data),
 			    before);
-	  emit_insn_before (gen_nvptx_barsync (GEN_INT (2)), tail);
+	  /* Barrier so other workers can see the write.  */
+	  emit_insn_before (nvptx_wsync (false), tail);
 	  data.offset = 0;
-	  emit_insn_before (nvptx_gen_wcast (pvar, PM_write, 0, &data),
-			    tail);
+	  emit_insn_before (nvptx_gen_wcast (pvar, PM_write, 0, &data), tail);
+	  /* This barrier is needed to avoid worker zero clobbering
+	     the broadcast buffer before all the other workers have
+	     had a chance to read this instance of it.  */
+	  emit_insn_before (nvptx_wsync (true), tail);
 	}
 
       extract_insn (tail);
@@ -2824,8 +2830,8 @@ nvptx_process_pars (parallel *par)
 			  par->forked_insn);
 	nvptx_wpropagate (true, par->forked_block, par->fork_insn);
 	/* Insert begin and end synchronizations.  */
-	nvptx_wsync (false, par->forked_insn);
-	nvptx_wsync (true, par->joining_insn);
+	emit_insn_after (nvptx_wsync (false), par->forked_insn);
+	emit_insn_before (nvptx_wsync (true), par->joining_insn);
       }
       break;
 
@@ -3046,8 +3052,6 @@ nvptx_file_end (void)
     {
       /* Define the broadcast buffer.  */
 
-      if (worker_bcast_align < GET_MODE_SIZE (SImode))
-	worker_bcast_align = GET_MODE_SIZE (SImode);
       worker_bcast_hwm = (worker_bcast_hwm + worker_bcast_align - 1)
 	& ~(worker_bcast_align - 1);
       

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [gomp4] libgomp: Some torture testing for C and C++ OpenACC test cases (was: [gomp] Move openacc vector& worker single handling to RTL)
  2015-07-22 17:05                           ` Nathan Sidwell
@ 2015-07-23  8:52                             ` Thomas Schwinge
  0 siblings, 0 replies; 31+ messages in thread
From: Thomas Schwinge @ 2015-07-23  8:52 UTC (permalink / raw)
  To: Nathan Sidwell, GCC Patches; +Cc: Jakub Jelinek

[-- Attachment #1: Type: text/plain, Size: 6952 bytes --]

Hi!

On Wed, 22 Jul 2015 12:47:32 -0400, Nathan Sidwell <nathan@acm.org> wrote:
> On 07/20/15 11:08, Nathan Sidwell wrote:
> > On 07/20/15 09:01, Nathan Sidwell wrote:
> >> On 07/18/15 11:37, Thomas Schwinge wrote:
> >>> For OpenACC nvptx offloading, there must still be something wrong; here's
> >>> a count of the (non-deterministic!) regressions of ten runs of the
> >>> libgomp testsuite.

> Thomas helped me reproduce them -- they are very intermittent.  Anyway, fixed 
> with the attached patch I've committed to gomp branch.

\o/

> This appears to fix all the -O0 regressions you observed Thomas.

Thanks, confirmed!


To get better test coverage for device-specific code that is only ever
used in offloading configurations, it's a good idea to do a (limited) set
of torture testing also for some libgomp C and C++ test cases (it's done
for all testing in Fortran): those that are dealing with the specifics of
gang/worker/vector single/redundant/partitioned modes.  They're selected
based on their file names -- not a perfect property to detect such test
cases, but should be sufficient.  To avoid testing time exploding too
much, limit any torture testing to -O0 and -O2 only, under the assumption
that between -O0 and -O[something] there is the biggest difference in the
overall structure of the generated code.

Committed to gomp-4_0-branch in r226091:

commit b1bd5f92c3f536ebab9b36510636c7ab845123f8
Author: tschwinge <tschwinge@138bc75d-0d04-0410-961f-82ee72b054a4>
Date:   Thu Jul 23 08:50:15 2015 +0000

    libgomp: Some torture testing for C and C++ OpenACC test cases
    
    	libgomp/
    	* testsuite/libgomp.oacc-c++/c++.exp: Run ttests with
    	gcc-dg-runtest.
    	* testsuite/libgomp.oacc-c/c.exp: Likewise.
    
    git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/branches/gomp-4_0-branch@226091 138bc75d-0d04-0410-961f-82ee72b054a4
---
 libgomp/ChangeLog.gomp                     |  6 ++++++
 libgomp/testsuite/libgomp.oacc-c++/c++.exp | 26 ++++++++++++++++++++++++++
 libgomp/testsuite/libgomp.oacc-c/c.exp     | 25 +++++++++++++++++++++++++
 3 files changed, 57 insertions(+)

diff --git libgomp/ChangeLog.gomp libgomp/ChangeLog.gomp
index 33e7b3b..b5ace3f 100644
--- libgomp/ChangeLog.gomp
+++ libgomp/ChangeLog.gomp
@@ -1,3 +1,9 @@
+2015-07-23  Thomas Schwinge  <thomas@codesourcery.com>
+
+	* testsuite/libgomp.oacc-c++/c++.exp: Run ttests with
+	gcc-dg-runtest.
+	* testsuite/libgomp.oacc-c/c.exp: Likewise.
+
 2015-07-22  Thomas Schwinge  <thomas@codesourcery.com>
 
 	* testsuite/libgomp.oacc-c-c++-common/lib-1.c: Remove explicit
diff --git libgomp/testsuite/libgomp.oacc-c++/c++.exp libgomp/testsuite/libgomp.oacc-c++/c++.exp
index 7309f78..3dbc917 100644
--- libgomp/testsuite/libgomp.oacc-c++/c++.exp
+++ libgomp/testsuite/libgomp.oacc-c++/c++.exp
@@ -1,5 +1,12 @@
 # This whole file adapted from libgomp.c++/c++.exp.
 
+# To avoid testing time exploding too much, limit any torture testing to -O0
+# and -O2 only, under the assumption that between -O0 and -O[something] there
+# is the biggest difference in the overall structure of the generated code.
+set TORTURE_OPTIONS [list \
+    { -O0 } \
+    { -O2 } ]
+
 load_lib libgomp-dg.exp
 load_gcc_lib gcc-dg.exp
 
@@ -61,6 +68,22 @@ if { $lang_test_file_found } {
     set tests [lsort [concat \
 			  [find $srcdir/$subdir *.C] \
 			  [find $srcdir/$subdir/../libgomp.oacc-c-c++-common *.c]]]
+    # To get better test coverage for device-specific code that is only ever
+    # used in offloading configurations, we'd like more thorough (torture)
+    # testing for test cases that are dealing with the specifics of
+    # gang/worker/vector single/redundant/partitioned modes.  They're selected
+    # based on their file names -- not a perfect property to detect such test
+    # cases, but should be sufficient.
+    set ttests [lsort -unique [concat \
+				   [find $srcdir/$subdir/../libgomp.oacc-c-c++-common *gang*.c] \
+				   [find $srcdir/$subdir/../libgomp.oacc-c-c++-common *worker*.c] \
+				   [find $srcdir/$subdir/../libgomp.oacc-c-c++-common *vec*.c]]]
+    # tests := tests - ttests.
+    foreach t $ttests {
+	set i [lsearch -exact $tests $t]
+	set tests [lreplace $tests $i $i]
+    }
+
 
     if { $blddir != "" } {
         set ld_library_path "$always_ld_library_path:${blddir}/${lang_library_path}"
@@ -116,6 +139,7 @@ if { $lang_test_file_found } {
 	set tagopt "$tagopt -DACC_MEM_SHARED=$acc_mem_shared"
 
 	dg-runtest $tests "$tagopt" "$libstdcxx_includes $DEFAULT_CFLAGS"
+	gcc-dg-runtest $ttests "$tagopt" "$libstdcxx_includes"
     }
 }
 
@@ -124,5 +148,7 @@ if { [info exists HAVE_SET_GXX_UNDER_TEST] } {
     unset GXX_UNDER_TEST
 }
 
+unset TORTURE_OPTIONS
+
 # All done.
 dg-finish
diff --git libgomp/testsuite/libgomp.oacc-c/c.exp libgomp/testsuite/libgomp.oacc-c/c.exp
index 60be15d..988dfc6 100644
--- libgomp/testsuite/libgomp.oacc-c/c.exp
+++ libgomp/testsuite/libgomp.oacc-c/c.exp
@@ -11,6 +11,13 @@ if [info exists lang_include_flags] then {
     unset lang_include_flags
 }
 
+# To avoid testing time exploding too much, limit any torture testing to -O0
+# and -O2 only, under the assumption that between -O0 and -O[something] there
+# is the biggest difference in the overall structure of the generated code.
+set TORTURE_OPTIONS [list \
+    { -O0 } \
+    { -O2 } ]
+
 load_lib libgomp-dg.exp
 load_gcc_lib gcc-dg.exp
 
@@ -31,6 +38,21 @@ lappend libgomp_compile_options "compiler=$GCC_UNDER_TEST"
 set tests [lsort [concat \
 		      [find $srcdir/$subdir *.c] \
 		      [find $srcdir/$subdir/../libgomp.oacc-c-c++-common *.c]]]
+# To get better test coverage for device-specific code that is only ever
+# used in offloading configurations, we'd like more thorough (torture)
+# testing for test cases that are dealing with the specifics of
+# gang/worker/vector single/redundant/partitioned modes.  They're selected
+# based on their file names -- not a perfect property to detect such test
+# cases, but should be sufficient.
+set ttests [lsort -unique [concat \
+			       [find $srcdir/$subdir/../libgomp.oacc-c-c++-common *gang*.c] \
+			       [find $srcdir/$subdir/../libgomp.oacc-c-c++-common *worker*.c] \
+			       [find $srcdir/$subdir/../libgomp.oacc-c-c++-common *vec*.c]]]
+# tests := tests - ttests.
+foreach t $ttests {
+    set i [lsearch -exact $tests $t]
+    set tests [lreplace $tests $i $i]
+}
 
 set ld_library_path $always_ld_library_path
 append ld_library_path [gcc-set-multilib-library-path $GCC_UNDER_TEST]
@@ -75,7 +97,10 @@ foreach offload_target_openacc $offload_targets_s_openacc {
     set tagopt "$tagopt -DACC_MEM_SHARED=$acc_mem_shared"
 
     dg-runtest $tests "$tagopt" $DEFAULT_CFLAGS
+    gcc-dg-runtest $ttests "$tagopt" ""
 }
 
+unset TORTURE_OPTIONS
+
 # All done.
 dg-finish


Grüße,
 Thomas

[-- Attachment #2: Type: application/pgp-signature, Size: 472 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [gomp] Move openacc vector& worker single handling to RTL
  2015-07-10  0:25                   ` Nathan Sidwell
                                       ` (3 preceding siblings ...)
  2015-07-18 20:31                     ` Thomas Schwinge
@ 2015-12-01  9:07                     ` Thomas Schwinge
  4 siblings, 0 replies; 31+ messages in thread
From: Thomas Schwinge @ 2015-12-01  9:07 UTC (permalink / raw)
  To: Nathan Sidwell, GCC Patches

[-- Attachment #1: Type: text/plain, Size: 3340 bytes --]

Hi!

On Thu, 09 Jul 2015 20:25:22 -0400, Nathan Sidwell <nathan@acm.org> wrote:
> This is the patch I committed.  [...]

> 2015-07-09  Nathan Sidwell  <nathan@codesourcery.com>

> 	* omp-low.c (omp_region): [...]
> 	(enclosing_target_region, required_predication_mask,
> 	generate_vector_broadcast, generate_oacc_broadcast,
> 	make_predication_test, predicate_bb, find_predicatable_bbs,
> 	predicate_omp_regions): Delete.
> 	[...]

This removed all usage of bb_region_map.  Now cleaned up in
gomp-4_0-branch r231102:

commit ff7e1eb4e855aa16d14ae047172269bc7192a069
Author: tschwinge <tschwinge@138bc75d-0d04-0410-961f-82ee72b054a4>
Date:   Tue Dec 1 09:04:33 2015 +0000

    gcc/omp-low.c: Remove bb_region_map
    
    	gcc/
    	* omp-low.c (bb_region_map): Remove.  Adjust all users.
    
    git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/branches/gomp-4_0-branch@231102 138bc75d-0d04-0410-961f-82ee72b054a4
---
 gcc/ChangeLog.gomp |  4 ++++
 gcc/omp-low.c      | 42 +++++++++++++++++-------------------------
 2 files changed, 21 insertions(+), 25 deletions(-)

diff --git gcc/ChangeLog.gomp gcc/ChangeLog.gomp
index 0e4f371..4842164 100644
--- gcc/ChangeLog.gomp
+++ gcc/ChangeLog.gomp
@@ -1,3 +1,7 @@
+2015-12-01  Thomas Schwinge  <thomas@codesourcery.com>
+
+	* omp-low.c (bb_region_map): Remove.  Adjust all users.
+
 2015-11-30  Cesar Philippidis  <cesar@codesourcery.com>
 
 	* tree-nested.c (convert_nonlocal_omp_clauses): Handle optional
diff --git gcc/omp-low.c gcc/omp-low.c
index 1b52f6b..a1e7a14 100644
--- gcc/omp-low.c
+++ gcc/omp-low.c
@@ -13356,9 +13356,6 @@ expand_omp (struct omp_region *region)
     }
 }
 
-/* Map each basic block to an omp_region.  */
-static hash_map<basic_block, omp_region *> *bb_region_map;
-
 static void
 find_omp_for_region_data (struct omp_region *region, gomp_for *stmt)
 {
@@ -13394,8 +13391,6 @@ build_omp_regions_1 (basic_block bb, struct omp_region *parent,
   gimple *stmt;
   basic_block son;
 
-  bb_region_map->put (bb, parent);
-
   gsi = gsi_last_bb (bb);
   if (!gsi_end_p (gsi) && is_gimple_omp (gsi_stmt (gsi)))
     {
@@ -13536,31 +13531,28 @@ build_omp_regions (void)
 static unsigned int
 execute_expand_omp (void)
 {
-  bb_region_map = new hash_map<basic_block, omp_region *>;
-
   build_omp_regions ();
 
-  if (root_omp_region)
+  if (!root_omp_region)
+    return 0;
+
+  if (dump_file)
     {
-      if (dump_file)
-	{
-	  fprintf (dump_file, "\nOMP region tree\n\n");
-	  dump_omp_region (dump_file, root_omp_region, 0);
-	  fprintf (dump_file, "\n");
-	}
-
-      remove_exit_barriers (root_omp_region);
-
-      expand_omp (root_omp_region);
-
-      if (flag_checking && !loops_state_satisfies_p (LOOPS_NEED_FIXUP))
-	verify_loop_structure ();
-      cleanup_tree_cfg ();
-
-      free_omp_regions ();
+      fprintf (dump_file, "\nOMP region tree\n\n");
+      dump_omp_region (dump_file, root_omp_region, 0);
+      fprintf (dump_file, "\n");
     }
 
-  delete bb_region_map;
+  remove_exit_barriers (root_omp_region);
+
+  expand_omp (root_omp_region);
+
+  if (flag_checking && !loops_state_satisfies_p (LOOPS_NEED_FIXUP))
+    verify_loop_structure ();
+  cleanup_tree_cfg ();
+
+  free_omp_regions ();
+
   return 0;
 }
 


Grüße
 Thomas

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 472 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2015-12-01  9:07 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-07-03 22:52 [gomp] Move openacc vector& worker single handling to RTL Nathan Sidwell
2015-07-03 23:12 ` Jakub Jelinek
2015-07-04 20:41   ` Nathan Sidwell
2015-07-06 19:35     ` Nathan Sidwell
2015-07-07  9:54       ` Jakub Jelinek
2015-07-07 14:13         ` Nathan Sidwell
2015-07-07 14:22           ` Jakub Jelinek
2015-07-07 14:43             ` Nathan Sidwell
2015-07-08 14:48             ` Nathan Sidwell
2015-07-08 14:58               ` Jakub Jelinek
2015-07-08 21:46                 ` Nathan Sidwell
2015-07-10  0:25                   ` Nathan Sidwell
2015-07-10  9:04                     ` Thomas Schwinge
2015-07-11 19:25                       ` [gomp4] Revert "Work around nvptx offloading compiler --enable-checking=yes,df,fold,rtl breakage" (was: fix df verify failure) Thomas Schwinge
2015-07-13 11:26                       ` [gomp] Move openacc vector& worker single handling to RTL Thomas Schwinge
2015-07-13 13:23                         ` Nathan Sidwell
     [not found]                         ` <55A7D5DD.2070600@mentor.com>
2015-07-17  9:00                           ` [gomp] Fix PTX worker spill/fill Thomas Schwinge
2015-07-11 21:18                     ` [gomp4] Resolve bootstrap failure in expand_GOACC_FORK, expand_GOACC_JOIN (was: Move openacc vector& worker single handling to RTL) Thomas Schwinge
2015-07-14  8:26                     ` [gomp] Move openacc vector& worker single handling to RTL Thomas Schwinge
2015-07-15  2:41                       ` Nathan Sidwell
2015-07-18 20:31                     ` Thomas Schwinge
2015-07-20 13:19                       ` Nathan Sidwell
2015-07-20 15:56                         ` Nathan Sidwell
2015-07-22 17:05                           ` Nathan Sidwell
2015-07-23  8:52                             ` [gomp4] libgomp: Some torture testing for C and C++ OpenACC test cases (was: [gomp] Move openacc vector& worker single handling to RTL) Thomas Schwinge
2015-07-21 20:57                       ` [gomp] Move openacc vector& worker single handling to RTL Nathan Sidwell
2015-07-22  8:32                         ` Thomas Schwinge
2015-12-01  9:07                     ` Thomas Schwinge
2015-07-10 22:04 [gomp] fix df verify failure Nathan Sidwell
2015-07-16 17:18 [gomp] Fix PTX worker spill/fill Nathan Sidwell
     [not found] ` <55A7D4BA.2000309@mentor.com>
2015-07-17  9:29   ` Thomas Schwinge

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).