[gomp4] Preserve NVPTX "reconvergence" points

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [gomp4] Preserve NVPTX "reconvergence" points
@ 2015-05-28 14:20 Julian Brown
  2015-05-28 14:59 ` Jakub Jelinek
  2015-05-28 15:02 ` Richard Biener
  0 siblings, 2 replies; 23+ messages in thread
From: Julian Brown @ 2015-05-28 14:20 UTC (permalink / raw)
  To: gcc-patches, Bernd Schmidt, Jakub Jelinek, Thomas Schwinge

[-- Attachment #1: Type: text/plain, Size: 6991 bytes --]

For NVPTX, it is vitally important that the divergence of threads
within a warp can be controlled: in particular we must be able to
generate code that we know "reconverges" at a particular point.
Unfortunately GCC's middle-end optimisers can cause this property to
be violated, which causes problems for the OpenACC execution model
we're planning to use for NVPTX.

As a brief example: code running in vector-single mode runs on a
single thread of a warp, and must broadcast condition results to other
threads of the warp so that they can "follow along" and be ready for
vector-partitioned execution when necessary.

#pragma acc parallel
{
  #pragma acc loop gang
  for (i = 0; i < N; i++)
  {
    /* This is vector-single mode.  */
    n = ...;
    switch (n)
    {
    case 1:
      #pragma acc loop vector
      for (...)
      {
        /* This is vector-partitioned mode.  */
      }
      ...
    }
  }
}

Here, the calculation "n = ..." takes place on a single thread (of
each partitioned gang of the outer loop), but the switch statement
(terminating the BB) must be executed by all threads in the warp. The
vector-single statements will be translated using a branch around for
the "idle" threads:

if (threadIdx.x == 0)
{
  n_0 = ...;
}
n_x = broadcast (n_0)
switch (n_x)
...

Where "broadcast" is an operation that transfers values from some
other thread of a warp (i.e., the zeroth) to the current thread
(implemented as a "shfl" instruction for NVPTX).

I observed a similar example to this cloning the broadcast and switch
instructions (in the .dom1 dump), along the lines of:

if (threadIdx.x == 0)
{
  n_0 = ...;
  n_x = broadcast (n_0)
  switch (n_x)
  ...
}
else
{
  n_x = broadcast (n_0)
  switch (n_x)
  ...
}

This doesn't work because the "broadcast" operation has to be run with
non-diverged warps for correct operation, and here there is divergence
due to the "if (threadIdx.x == 0)" condition.

So, the way I have tried to handle this is by attempting to inhibit
optimisation along edges which have a reconvergence point as their
destination. The essential idea is to make such edges "abnormal",
although the existing EDGE_ABNORMAL flag is not used because that has
implicit meaning built into it already, and the new edge type may need
to be handled differently in some areas. One example is that at
present, blocks concluding with GIMPLE_COND cannot have EDGE_ABNORMAL
set on their EDGE_TRUE or EDGE_FALSE outgoing edges.

The attached patch introduces a new edge flag (EDGE_TO_RECONVERGENCE),
for the GIMPLE CFG only. In principle there's nothing to stop the flag
being propagated to the RTL CFG also, in which case it'd probably be
set at the same time as EDGE_ABNORMAL, mirroring the semantics of e.g.
EDGE_EH, EDGE_ABNORMAL_CALL and EDGE_SIBCALL. Then, passes which
inspect the RTL CFG can continue to only check the ABNORMAL flag. But
so far (in rather limited testing!), that has not been observed to be
necessary. (We can control RTL CFG manipulation indirectly by using the
CANNOT_COPY_INSN_P target hook, sensitive e.g. to the "broadcast"
instruction.)

For the GIMPLE CFG (i.e. in passes operating on GIMPLE form),
EDGE_TO_RECONVERGENCE behaves mostly the same as EDGE_ABNORMAL (i.e.,
inhibiting certain optimisations), and so has been added to relevant
conditionals largely mechanically. Places where it is treated specially
are:

* tree-cfg.c:gimple_verify_flow_info does not permit EDGE_ABNORMAL on
  outgoing edges of a block concluding with a GIMPLE_COND statement.
  But, we allow EDGE_TO_RECONVERGENCE there.

* tree-vrp.c:find_conditional_asserts skips over outgoing GIMPLE_COND
  edges with EDGE_TO_RECONVERGENCE set (avoiding an ICE when the pass
  tries to split the edge later).

There are probably other optimisations that will be tripped up by the
new flag along the same lines as the VRP tweak above, which we will no
doubt discover in due course.

Together with the patch,

  https://gcc.gnu.org/ml/gcc-patches/2015-05/msg02612.html

This shows no regressions for the libgomp tests.

OK for gomp4 branch?

Thanks,

Julian

    ChangeLog

    gcc/
    * basic-block.h (EDGE_COMPLEX): Add EDGE_TO_RECONVERGENCE flag.
    (bb_hash_abnorm_or_reconv_pred): New function.
    (hash_abnormal_or_eh_outgoing_edge_p): Consider
    EDGE_TO_RECONVERGENCE also.
    * cfg-flags.def (TO_RECONVERGENCE): Add flag.
    * omp-low.c (predicate_bb): Set EDGE_TO_RECONVERGENCE on edges
    leading to a reconvergence point.
    * cfgbuild.c (purge_dead_tablejump_edges): Consider
    EDGE_TO_RECONVERGENCE.
    * cfgcleanup.c (try_crossjump_to_edge, try_head_merge_bb): Likewise.
    * cfgexpand.c (expand_gimple_tailcall, construct_exit_block)
    (pass_expand::execute): Likewise.
    * cfghooks.c (can_copy_bbs_p): Likewise.
    * cfgloop.c (bb_loop_header_p): Likewise.
    * cfgloopmanip.c (scale_loop_profile): Likewise.
    * gimple-iterator.c (gimple_find_edge_insert_loc): Likewise.
    * graph.c (draw_cfg_node_succ_edges): Likewise.
    * graphite-scope-detection.c (canonicalize_loop_closed_ssa):
    Likewise.
    * predict.c (tree_bb_level_predictions): Likewise.
    * profile.c (instrument_edges, branch_prop, find_spanning_tree):
    Likewise.
    * tree-cfg.c (replace_uses_by, gimple_split_edge)
    (gimple_redirect_edge_and_branch, split_critical_edges): Likewise.
    * tree-cfgcleanup.c (tree_forwarder_block_p, remove_forwarder_block)
    (pass_merge_phi::execute): Likewise.
    * tree-chkp.c (chkp_fix_cfg): Likewise.
    * tree-if-conv.c (if_convertible_bb_p): Likewise.
    * tree-inline.c (update_ssa_across_abnormal_edges): Likewise.
    * tree-into-ssa.c (rewrite_update_phi_arguments)
    (rewrite_update_dom_walker::before_dom_children)
    (create_new_def_for): Likewise.
    * tree-outof-ssa.c (eliminate_phi): Likewise.
    * tree-phinodes.c (add_phi_arg): Likewise.
    * tree-ssa-coalesce (coalesce_cost_edge, create_outofssa_var_map)
    (coalesce_partitions): Likewise.
    * tree-ssa-dom.c (cprop_into_successor_phis)
    (dom_opt_dom_walker::after_dom_children, propagate_rhs_into_lhs):
    Likewise.
    * tree-ssa-loop-im.c (loop_suitable_for_sm): Likewise.
    * tree-ssa-loop-prefetch.c (emit_mfence_after_loop)
    (may_use_storent_in_loop_p): Likewise.
    * tree-ssa-phiopt.c (tree_ssa_phiopt_worker): Likewise.
    * tree-ssa-pre.c (compute_antic, insert_into_preds_of_block):
    Likewise.
    * tree-ssa-propagate.c (simulate_block, replace_phi_args_in):
    Likewise.
    * tree-ssa-sink.c (sink_code_in_bb): Likewise.
    * tree-ssa-threadedge.c (thread_across_edge): Likewise.
    * tree-ssa-threadupdate.c (thread_single_edge): Likewise.
    * tree-ssa-uninit.c (compute_control_dep_chain): Likewise.
    * tree-ssa.c (verify_phi_args): Likewise.
    * tree-vect-loop.c (vect_analyze_loop_form): Likewise.
    * value-prof.c (gimple_ic): Likewise.
    * tree-vrp.c (infer_value_range, process_assert_insertions_for):
    Likewise.
    (find_conditional_asserts): Skip over EDGE_TO_RECONVERGENCE edges.

[-- Attachment #2: to-reconvergence-4.diff --]
[-- Type: text/x-patch, Size: 31591 bytes --]

commit 472bd543b30356f7a4c59efc961f9f61b11ca197
Author: Julian Brown <julian@codesourcery.com>
Date:   Wed May 20 11:35:45 2015 -0700

    Introduce EDGE_TO_RECONVERGENCE, and tweak some uses of EDGE_ABNORMAL.

diff --git a/gcc/basic-block.h b/gcc/basic-block.h
index f28fa57..7fe25f0 100644
--- a/gcc/basic-block.h
+++ b/gcc/basic-block.h
@@ -70,7 +70,8 @@ enum cfg_edge_flags {
    Test the edge flags on EDGE_COMPLEX to detect all forms of "strange"
    control flow transfers.  */
 #define EDGE_COMPLEX \
-  (EDGE_ABNORMAL | EDGE_ABNORMAL_CALL | EDGE_EH | EDGE_PRESERVE)
+  (EDGE_ABNORMAL | EDGE_ABNORMAL_CALL | EDGE_EH | EDGE_PRESERVE \
+   | EDGE_TO_RECONVERGENCE)
 
 struct GTY(()) rtl_bb_info {
   /* The first insn of the block is embedded into bb->il.x.  */
@@ -559,6 +560,20 @@ bb_has_abnormal_pred (basic_block bb)
   return false;
 }
 
+static inline bool
+bb_has_abnorm_or_reconv_pred (basic_block bb)
+{
+  edge e;
+  edge_iterator ei;
+
+  FOR_EACH_EDGE (e, ei, bb->preds)
+    {
+      if (e->flags & (EDGE_ABNORMAL | EDGE_TO_RECONVERGENCE))
+	return true;
+    }
+  return false;
+}
+
 /* Return the fallthru edge in EDGES if it exists, NULL otherwise.  */
 static inline edge
 find_fallthru_edge (vec<edge, va_gc> *edges)
@@ -629,9 +644,10 @@ has_abnormal_or_eh_outgoing_edge_p (basic_block bb)
   edge_iterator ei;
 
   FOR_EACH_EDGE (e, ei, bb->succs)
-    if (e->flags & (EDGE_ABNORMAL | EDGE_EH))
+    if (e->flags & (EDGE_ABNORMAL | EDGE_EH | EDGE_TO_RECONVERGENCE))
       return true;
 
   return false;
 }
+
 #endif /* GCC_BASIC_BLOCK_H */
diff --git a/gcc/cfg-flags.def b/gcc/cfg-flags.def
index eedcd69..fd51e2f 100644
--- a/gcc/cfg-flags.def
+++ b/gcc/cfg-flags.def
@@ -177,6 +177,10 @@ DEF_EDGE_FLAG(TM_UNINSTRUMENTED, 15)
 /* Abort (over) edge out of a GIMPLE_TRANSACTION statement.  */
 DEF_EDGE_FLAG(TM_ABORT, 16)
 
+/* An "immutable" edge to an OpenACC (currently, NVPTX) reconvergence point. 
+   This flag is only used for the GIMPLE CFG.  */
+DEF_EDGE_FLAG(TO_RECONVERGENCE, 17)
+
 #endif
 
 /*
diff --git a/gcc/cfgbuild.c b/gcc/cfgbuild.c
index 7cbed50..7185f07 100644
--- a/gcc/cfgbuild.c
+++ b/gcc/cfgbuild.c
@@ -449,7 +449,7 @@ purge_dead_tablejump_edges (basic_block bb, rtx_jump_table_data *table)
       if (FULL_STATE (e->dest) & BLOCK_USED_BY_TABLEJUMP)
 	SET_STATE (e->dest, FULL_STATE (e->dest)
 			    & ~(size_t) BLOCK_USED_BY_TABLEJUMP);
-      else if (!(e->flags & (EDGE_ABNORMAL | EDGE_EH)))
+      else if (!(e->flags & (EDGE_ABNORMAL | EDGE_EH | EDGE_TO_RECONVERGENCE)))
 	{
 	  remove_edge (e);
 	  continue;
diff --git a/gcc/cfgcleanup.c b/gcc/cfgcleanup.c
index 797d14a..e73062a 100644
--- a/gcc/cfgcleanup.c
+++ b/gcc/cfgcleanup.c
@@ -2031,7 +2031,7 @@ try_crossjump_to_edge (int mode, edge e1, edge e2,
 
   /* Avoid deleting preserve label when redirecting ABNORMAL edges.  */
   if (block_has_preserve_label (e1->dest)
-      && (e1->flags & EDGE_ABNORMAL))
+      && (e1->flags & (EDGE_ABNORMAL | EDGE_TO_RECONVERGENCE)))
     return false;
 
   /* Here we know that the insns in the end of SRC1 which are common with SRC2
@@ -2389,7 +2389,7 @@ try_head_merge_bb (basic_block bb)
 	  return false;
 	}
 
-      if (e->flags & EDGE_ABNORMAL)
+      if (e->flags & (EDGE_ABNORMAL | EDGE_TO_RECONVERGENCE))
 	return false;
 
       /* Normally, all destination blocks must only be reachable from this
diff --git a/gcc/cfgexpand.c b/gcc/cfgexpand.c
index 5905ddb..688158b 100644
--- a/gcc/cfgexpand.c
+++ b/gcc/cfgexpand.c
@@ -3569,7 +3569,7 @@ expand_gimple_tailcall (basic_block bb, gcall *stmt, bool *can_fallthru)
 
   for (ei = ei_start (bb->succs); (e = ei_safe_edge (ei)); )
     {
-      if (!(e->flags & (EDGE_ABNORMAL | EDGE_EH)))
+      if (!(e->flags & (EDGE_ABNORMAL | EDGE_EH | EDGE_TO_RECONVERGENCE)))
 	{
 	  if (e->dest != EXIT_BLOCK_PTR_FOR_FN (cfun))
 	    {
@@ -5674,7 +5674,7 @@ construct_exit_block (void)
   while (ix < EDGE_COUNT (EXIT_BLOCK_PTR_FOR_FN (cfun)->preds))
     {
       e = EDGE_PRED (EXIT_BLOCK_PTR_FOR_FN (cfun), ix);
-      if (!(e->flags & EDGE_ABNORMAL))
+      if (!(e->flags & (EDGE_ABNORMAL | EDGE_TO_RECONVERGENCE)))
 	redirect_edge_succ (e, exit_block);
       else
 	ix++;
@@ -6222,7 +6222,7 @@ pass_expand::execute (function *fun)
 	     representation.  It is safe to remove them here as
 	     find_many_sub_basic_blocks will rediscover them.
 	     In the future we should get this fixed properly.  */
-	  if ((e->flags & EDGE_ABNORMAL)
+	  if ((e->flags & (EDGE_ABNORMAL | EDGE_TO_RECONVERGENCE))
 	      && !(e->flags & EDGE_SIBCALL))
 	    remove_edge (e);
 	  else
diff --git a/gcc/cfghooks.c b/gcc/cfghooks.c
index fc23edb..64483f4 100644
--- a/gcc/cfghooks.c
+++ b/gcc/cfghooks.c
@@ -1304,7 +1304,7 @@ can_copy_bbs_p (basic_block *bbs, unsigned n)
       /* In case we should redirect abnormal edge during duplication, fail.  */
       edge_iterator ei;
       FOR_EACH_EDGE (e, ei, bbs[i]->succs)
-	if ((e->flags & EDGE_ABNORMAL)
+	if ((e->flags & (EDGE_ABNORMAL | EDGE_TO_RECONVERGENCE))
 	    && (e->dest->flags & BB_DUPLICATED))
 	  {
 	    ret = false;
diff --git a/gcc/cfgloop.c b/gcc/cfgloop.c
index 5767494..d4a1c93 100644
--- a/gcc/cfgloop.c
+++ b/gcc/cfgloop.c
@@ -387,7 +387,7 @@ bb_loop_header_p (basic_block header)
 
   /* If we have an abnormal predecessor, do not consider the
      loop (not worth the problems).  */
-  if (bb_has_abnormal_pred (header))
+  if (bb_has_abnorm_or_reconv_pred (header))
     return false;
 
   /* Look for back edges where a predecessor is dominated
diff --git a/gcc/cfgloopmanip.c b/gcc/cfgloopmanip.c
index 45cc85d..8bbf4cb 100644
--- a/gcc/cfgloopmanip.c
+++ b/gcc/cfgloopmanip.c
@@ -554,7 +554,8 @@ scale_loop_profile (struct loop *loop, int scale, gcov_type iteration_bound)
 	  gcov_type count_delta;
 
           FOR_EACH_EDGE (other_e, ei, e->src->succs)
-	    if (!(other_e->flags & (EDGE_ABNORMAL | EDGE_FAKE))
+	    if (!(other_e->flags & (EDGE_ABNORMAL | EDGE_FAKE
+				    | EDGE_TO_RECONVERGENCE))
 		&& e != other_e)
 	      break;
 
diff --git a/gcc/gimple-iterator.c b/gcc/gimple-iterator.c
index df29123..bcb6649 100644
--- a/gcc/gimple-iterator.c
+++ b/gcc/gimple-iterator.c
@@ -792,7 +792,7 @@ gimple_find_edge_insert_loc (edge e, gimple_stmt_iterator *gsi,
      the last statement does not end a basic block, insert there.
      Except for the entry block.  */
   src = e->src;
-  if ((e->flags & EDGE_ABNORMAL) == 0
+  if ((e->flags & (EDGE_ABNORMAL | EDGE_TO_RECONVERGENCE)) == 0
       && single_succ_p (src)
       && src != ENTRY_BLOCK_PTR_FOR_FN (cfun))
     {
diff --git a/gcc/graph.c b/gcc/graph.c
index 5fb0d78..e5d3646 100644
--- a/gcc/graph.c
+++ b/gcc/graph.c
@@ -142,7 +142,7 @@ draw_cfg_node_succ_edges (pretty_printer *pp, int funcdef_no, basic_block bb)
 	  weight = 100;
 	}
 
-      if (e->flags & EDGE_ABNORMAL)
+      if (e->flags & (EDGE_ABNORMAL | EDGE_TO_RECONVERGENCE))
 	color = "red";
 
       pp_printf (pp,
diff --git a/gcc/graphite-scop-detection.c b/gcc/graphite-scop-detection.c
index 02e9e50..8f91f6e 100644
--- a/gcc/graphite-scop-detection.c
+++ b/gcc/graphite-scop-detection.c
@@ -1339,7 +1339,7 @@ canonicalize_loop_closed_ssa (loop_p loop)
   edge e = single_exit (loop);
   basic_block bb;
 
-  if (!e || e->flags & EDGE_ABNORMAL)
+  if (!e || e->flags & (EDGE_ABNORMAL | EDGE_TO_RECONVERGENCE))
     return;
 
   bb = e->dest;
diff --git a/gcc/omp-low.c b/gcc/omp-low.c
index f82247b..e28f5b4 100644
--- a/gcc/omp-low.c
+++ b/gcc/omp-low.c
@@ -10447,6 +10447,7 @@ predicate_bb (basic_block bb, struct omp_region *parent)
       generate_vector_broadcast (broadcast_cond, cond_var, gsi_asgn);
 
       edge e = split_block (bb, asgn);
+      e->flags = EDGE_TO_RECONVERGENCE;
       skip_dest_bb = e->dest;
 
       gimple_cond_set_condition (as_a <gcond *> (stmt), EQ_EXPR,
@@ -10465,6 +10466,7 @@ predicate_bb (basic_block bb, struct omp_region *parent)
       generate_vector_broadcast (new_var, var, gsi_asgn);
 
       edge e = split_block (bb, asgn);
+      e->flags = EDGE_TO_RECONVERGENCE;
       skip_dest_bb = e->dest;
 
       gimple_switch_set_index (sstmt, new_var);
@@ -10477,6 +10479,7 @@ predicate_bb (basic_block bb, struct omp_region *parent)
 	  && gimple_code (stmt) != GIMPLE_OMP_CONTINUE)
 	{
 	  edge e = single_succ_edge (bb);
+	  e->flags = EDGE_TO_RECONVERGENCE;
 	  skip_dest_bb = e->dest;
 	  if (gimple_code (stmt) == GIMPLE_OMP_RETURN)
 	    {
@@ -10490,6 +10493,7 @@ predicate_bb (basic_block bb, struct omp_region *parent)
 	  if (!split_stmt)
 	    return;
 	  edge e = split_block (bb, split_stmt);
+	  e->flags = EDGE_TO_RECONVERGENCE;
 	  skip_dest_bb = e->dest;
 	  if (gimple_code (stmt) == GIMPLE_OMP_CONTINUE)
 	    {
@@ -10508,6 +10512,7 @@ predicate_bb (basic_block bb, struct omp_region *parent)
   else if (single_succ_p (bb))
     {
       edge e = single_succ_edge (bb);
+      e->flags |= EDGE_TO_RECONVERGENCE;
       skip_dest_bb = e->dest;
       if (gimple_code (stmt) == GIMPLE_GOTO)
 	gsi_prev (&gsi);
@@ -10540,7 +10545,8 @@ predicate_bb (basic_block bb, struct omp_region *parent)
       gsi_insert_after (&tmp_gsi, cond_stmt, GSI_CONTINUE_LINKING);
 
       e2->flags = EDGE_TRUE_VALUE;
-      make_edge (cond_bb, skip_dest_bb, EDGE_FALSE_VALUE);
+      make_edge (cond_bb, skip_dest_bb,
+		 EDGE_FALSE_VALUE | EDGE_TO_RECONVERGENCE);
     }
 }
 
diff --git a/gcc/predict.c b/gcc/predict.c
index 67d5d20..9098bce 100644
--- a/gcc/predict.c
+++ b/gcc/predict.c
@@ -2188,7 +2188,8 @@ tree_bb_level_predictions (void)
   edge_iterator ei;
 
   FOR_EACH_EDGE (e, ei, EXIT_BLOCK_PTR_FOR_FN (cfun)->preds)
-    if (!(e->flags & (EDGE_ABNORMAL | EDGE_FAKE | EDGE_EH)))
+    if (!(e->flags & (EDGE_ABNORMAL | EDGE_FAKE | EDGE_EH
+		      | EDGE_TO_RECONVERGENCE)))
       {
         has_return_edges = true;
 	break;
diff --git a/gcc/profile.c b/gcc/profile.c
index a178a1b..cd96b9f 100644
--- a/gcc/profile.c
+++ b/gcc/profile.c
@@ -170,7 +170,8 @@ instrument_edges (struct edge_list *el)
 
 	  if (!inf->ignore && !inf->on_tree)
 	    {
-	      gcc_assert (!(e->flags & EDGE_ABNORMAL));
+	      gcc_assert (!(e->flags & (EDGE_ABNORMAL
+					| EDGE_TO_RECONVERGENCE)));
 	      if (dump_file)
 		fprintf (dump_file, "Edge %d to %d instrumented%s\n",
 			 e->src->index, e->dest->index,
@@ -1107,7 +1108,8 @@ branch_prob (void)
 	      edge ne = single_succ_edge (new_bb);
 	      ne->goto_locus = e->goto_locus;
 	    }
-	  if ((e->flags & (EDGE_ABNORMAL | EDGE_ABNORMAL_CALL))
+	  if ((e->flags & (EDGE_ABNORMAL | EDGE_ABNORMAL_CALL
+			   | EDGE_TO_RECONVERGENCE))
 	       && e->dest != EXIT_BLOCK_PTR_FOR_FN (cfun))
 	    need_exit_edge = 1;
 	  if (e->dest == EXIT_BLOCK_PTR_FOR_FN (cfun))
@@ -1115,7 +1117,8 @@ branch_prob (void)
 	}
       FOR_EACH_EDGE (e, ei, bb->preds)
 	{
-	  if ((e->flags & (EDGE_ABNORMAL | EDGE_ABNORMAL_CALL))
+	  if ((e->flags & (EDGE_ABNORMAL | EDGE_ABNORMAL_CALL
+			   | EDGE_TO_RECONVERGENCE))
 	       && e->src != ENTRY_BLOCK_PTR_FOR_FN (cfun))
 	    need_entry_edge = 1;
 	  if (e->src == ENTRY_BLOCK_PTR_FOR_FN (cfun))
@@ -1180,7 +1183,8 @@ branch_prob (void)
       e->count = 0;
 
       /* Mark edges we've replaced by fake edges above as ignored.  */
-      if ((e->flags & (EDGE_ABNORMAL | EDGE_ABNORMAL_CALL))
+      if ((e->flags & (EDGE_ABNORMAL | EDGE_ABNORMAL_CALL
+		       | EDGE_TO_RECONVERGENCE))
 	  && e->src != ENTRY_BLOCK_PTR_FOR_FN (cfun)
 	  && e->dest != EXIT_BLOCK_PTR_FOR_FN (cfun))
 	{
@@ -1430,7 +1434,8 @@ find_spanning_tree (struct edge_list *el)
   for (i = 0; i < num_edges; i++)
     {
       edge e = INDEX_EDGE (el, i);
-      if (((e->flags & (EDGE_ABNORMAL | EDGE_ABNORMAL_CALL | EDGE_FAKE))
+      if (((e->flags & (EDGE_ABNORMAL | EDGE_ABNORMAL_CALL | EDGE_FAKE
+			| EDGE_TO_RECONVERGENCE))
 	   || e->dest == EXIT_BLOCK_PTR_FOR_FN (cfun))
 	  && !EDGE_INFO (e)->ignore
 	  && (find_group (e->src) != find_group (e->dest)))
diff --git a/gcc/tree-cfg.c b/gcc/tree-cfg.c
index 99b27c7..fa12678 100644
--- a/gcc/tree-cfg.c
+++ b/gcc/tree-cfg.c
@@ -1865,7 +1865,7 @@ replace_uses_by (tree name, tree val)
 	    {
 	      e = gimple_phi_arg_edge (as_a <gphi *> (stmt),
 				       PHI_ARG_INDEX_FROM_USE (use));
-	      if (e->flags & EDGE_ABNORMAL
+	      if (e->flags & (EDGE_ABNORMAL | EDGE_TO_RECONVERGENCE)
 		  && !SSA_NAME_OCCURS_IN_ABNORMAL_PHI (val))
 		{
 		  /* This can only occur for virtual operands, since
@@ -2776,7 +2776,7 @@ gimple_split_edge (edge edge_in)
   edge new_edge, e;
 
   /* Abnormal edges cannot be split.  */
-  gcc_assert (!(edge_in->flags & EDGE_ABNORMAL));
+  gcc_assert (!(edge_in->flags & (EDGE_ABNORMAL | EDGE_TO_RECONVERGENCE)));
 
   dest = edge_in->dest;
 
@@ -5578,7 +5578,7 @@ gimple_redirect_edge_and_branch (edge e, basic_block dest)
   edge ret;
   gimple stmt;
 
-  if (e->flags & EDGE_ABNORMAL)
+  if (e->flags & (EDGE_ABNORMAL | EDGE_TO_RECONVERGENCE))
     return NULL;
 
   if (e->dest == dest)
@@ -5724,7 +5724,7 @@ gimple_redirect_edge_and_branch (edge e, basic_block dest)
 static bool
 gimple_can_remove_branch_p (const_edge e)
 {
-  if (e->flags & (EDGE_ABNORMAL | EDGE_EH))
+  if (e->flags & (EDGE_ABNORMAL | EDGE_EH | EDGE_TO_RECONVERGENCE))
     return false;
 
   return true;
@@ -8288,7 +8288,8 @@ split_critical_edges (void)
     {
       FOR_EACH_EDGE (e, ei, bb->succs)
         {
-	  if (EDGE_CRITICAL_P (e) && !(e->flags & EDGE_ABNORMAL))
+	  if (EDGE_CRITICAL_P (e) && !(e->flags & (EDGE_ABNORMAL
+						   | EDGE_TO_RECONVERGENCE)))
 	    split_edge (e);
 	  /* PRE inserts statements to edges and expects that
 	     since split_critical_edges was done beforehand, committing edge
@@ -8301,7 +8302,7 @@ split_critical_edges (void)
 	            || !gimple_seq_empty_p (phi_nodes (e->dest))
 		    || e->dest == EXIT_BLOCK_PTR_FOR_FN (cfun))
 		   && e->src != ENTRY_BLOCK_PTR_FOR_FN (cfun)
-	           && !(e->flags & EDGE_ABNORMAL))
+	           && !(e->flags & (EDGE_ABNORMAL | EDGE_TO_RECONVERGENCE)))
 	    {
 	      gimple_stmt_iterator gsi;
 
diff --git a/gcc/tree-cfgcleanup.c b/gcc/tree-cfgcleanup.c
index 26258aa..e361833 100644
--- a/gcc/tree-cfgcleanup.c
+++ b/gcc/tree-cfgcleanup.c
@@ -307,7 +307,8 @@ tree_forwarder_block_p (basic_block bb, bool phi_wanted)
       /* Nor should this be an infinite loop.  */
       || single_succ (bb) == bb
       /* BB may not have an abnormal outgoing edge.  */
-      || (single_succ_edge (bb)->flags & EDGE_ABNORMAL))
+      || (single_succ_edge (bb)->flags & (EDGE_ABNORMAL
+					  | EDGE_TO_RECONVERGENCE)))
     return false;
 
   gcc_checking_assert (bb != ENTRY_BLOCK_PTR_FOR_FN (cfun));
@@ -451,8 +452,8 @@ remove_forwarder_block (basic_block bb)
 
      So if there is an abnormal edge to BB, proceed only if there is
      no abnormal edge to DEST and there are no phi nodes in DEST.  */
-  if (bb_has_abnormal_pred (bb)
-      && (bb_has_abnormal_pred (dest)
+  if (bb_has_abnorm_or_reconv_pred (bb)
+      && (bb_has_abnorm_or_reconv_pred (dest)
 	  || !gimple_seq_empty_p (phi_nodes (dest))))
     return false;
 
@@ -483,7 +484,7 @@ remove_forwarder_block (basic_block bb)
     {
       bitmap_set_bit (cfgcleanup_altered_bbs, e->src->index);
 
-      if (e->flags & EDGE_ABNORMAL)
+      if (e->flags & (EDGE_ABNORMAL | EDGE_TO_RECONVERGENCE))
 	{
 	  /* If there is an abnormal edge, redirect it anyway, and
 	     move the labels to the new block to make it legal.  */
@@ -1019,7 +1020,7 @@ pass_merge_phi::execute (function *fun)
       if (gimple_seq_empty_p (phi_nodes (dest))
 	  /* We don't want to deal with a basic block with
 	     abnormal edges.  */
-	  || bb_has_abnormal_pred (bb))
+	  || bb_has_abnorm_or_reconv_pred (bb))
 	continue;
 
       if (!dominated_by_p (CDI_DOMINATORS, dest, bb))
diff --git a/gcc/tree-chkp.c b/gcc/tree-chkp.c
index 288470b..a916525 100644
--- a/gcc/tree-chkp.c
+++ b/gcc/tree-chkp.c
@@ -3994,7 +3994,7 @@ chkp_fix_cfg ()
 	    /* We cannot split abnormal edge.  Therefore we
 	       store its params, make it regular and then
 	       rebuild abnormal edge after split.  */
-	    if (fall->flags & EDGE_ABNORMAL)
+	    if (fall->flags & (EDGE_ABNORMAL | EDGE_TO_RECONVERGENCE))
 	      {
 		flags = fall->flags & ~EDGE_FALLTHRU;
 		dest = fall->dest;
diff --git a/gcc/tree-if-conv.c b/gcc/tree-if-conv.c
index 49ff458..7849934 100644
--- a/gcc/tree-if-conv.c
+++ b/gcc/tree-if-conv.c
@@ -1070,7 +1070,8 @@ if_convertible_bb_p (struct loop *loop, basic_block bb, basic_block exit_bb)
 
   /* Be less adventurous and handle only normal edges.  */
   FOR_EACH_EDGE (e, ei, bb->succs)
-    if (e->flags & (EDGE_EH | EDGE_ABNORMAL | EDGE_IRREDUCIBLE_LOOP))
+    if (e->flags & (EDGE_EH | EDGE_ABNORMAL | EDGE_IRREDUCIBLE_LOOP
+		    | EDGE_TO_RECONVERGENCE))
       {
 	if (dump_file && (dump_flags & TDF_DETAILS))
 	  fprintf (dump_file, "Difficult to handle edges\n");
diff --git a/gcc/tree-inline.c b/gcc/tree-inline.c
index 71d75d9..72aa414 100644
--- a/gcc/tree-inline.c
+++ b/gcc/tree-inline.c
@@ -2182,8 +2182,10 @@ update_ssa_across_abnormal_edges (basic_block bb, basic_block ret_bb,
 
 	    re = find_edge (ret_bb, e->dest);
 	    gcc_checking_assert (re);
-	    gcc_assert ((re->flags & (EDGE_EH | EDGE_ABNORMAL))
-			== (e->flags & (EDGE_EH | EDGE_ABNORMAL)));
+	    gcc_assert ((re->flags & (EDGE_EH | EDGE_ABNORMAL
+				      | EDGE_TO_RECONVERGENCE))
+			== (e->flags & (EDGE_EH | EDGE_ABNORMAL
+					| EDGE_TO_RECONVERGENCE)));
 
 	    SET_USE (PHI_ARG_DEF_PTR_FROM_EDGE (phi, e),
 		     USE_FROM_PTR (PHI_ARG_DEF_PTR_FROM_EDGE (phi, re)));
diff --git a/gcc/tree-into-ssa.c b/gcc/tree-into-ssa.c
index e8b55c1..7e9cb57 100644
--- a/gcc/tree-into-ssa.c
+++ b/gcc/tree-into-ssa.c
@@ -2101,7 +2101,7 @@ rewrite_update_phi_arguments (basic_block bb)
 	    }
 
 
-	  if (e->flags & EDGE_ABNORMAL)
+	  if (e->flags & (EDGE_ABNORMAL | EDGE_TO_RECONVERGENCE))
 	    SSA_NAME_OCCURS_IN_ABNORMAL_PHI (USE_FROM_PTR (arg_p)) = 1;
 	}
     }
@@ -2138,7 +2138,7 @@ rewrite_update_dom_walker::before_dom_children (basic_block bb)
 
   /* Mark the LHS if any of the arguments flows through an abnormal
      edge.  */
-  is_abnormal_phi = bb_has_abnormal_pred (bb);
+  is_abnormal_phi = bb_has_abnorm_or_reconv_pred (bb);
 
   /* If any of the PHI nodes is a replacement for a name in
      OLD_SSA_NAMES or it's one of the names in NEW_SSA_NAMES, then
@@ -2899,7 +2899,8 @@ create_new_def_for (tree old_name, gimple stmt, def_operand_p def)
       basic_block bb = gimple_bb (stmt);
 
       /* If needed, mark NEW_NAME as occurring in an abnormal PHI node. */
-      SSA_NAME_OCCURS_IN_ABNORMAL_PHI (new_name) = bb_has_abnormal_pred (bb);
+      SSA_NAME_OCCURS_IN_ABNORMAL_PHI (new_name)
+        = bb_has_abnorm_or_reconv_pred (bb);
     }
 
   add_new_name_mapping (new_name, old_name);
diff --git a/gcc/tree-outof-ssa.c b/gcc/tree-outof-ssa.c
index e23bc0b..fe24cf8 100644
--- a/gcc/tree-outof-ssa.c
+++ b/gcc/tree-outof-ssa.c
@@ -766,7 +766,7 @@ eliminate_phi (edge e, elim_graph g)
   gcc_assert (g->copy_locus.length () == 0);
 
   /* Abnormal edges already have everything coalesced.  */
-  if (e->flags & EDGE_ABNORMAL)
+  if (e->flags & (EDGE_ABNORMAL | EDGE_TO_RECONVERGENCE))
     return;
 
   g->e = e;
diff --git a/gcc/tree-phinodes.c b/gcc/tree-phinodes.c
index d657907..05c383d 100644
--- a/gcc/tree-phinodes.c
+++ b/gcc/tree-phinodes.c
@@ -392,7 +392,7 @@ add_phi_arg (gphi *phi, tree def, edge e, source_location locus)
 
   /* Copy propagation needs to know what object occur in abnormal
      PHI nodes.  This is a convenient place to record such information.  */
-  if (e->flags & EDGE_ABNORMAL)
+  if (e->flags & (EDGE_ABNORMAL | EDGE_TO_RECONVERGENCE))
     {
       SSA_NAME_OCCURS_IN_ABNORMAL_PHI (def) = 1;
       SSA_NAME_OCCURS_IN_ABNORMAL_PHI (PHI_RESULT (phi)) = 1;
diff --git a/gcc/tree-ssa-coalesce.c b/gcc/tree-ssa-coalesce.c
index eeac5a4..a72c86c 100644
--- a/gcc/tree-ssa-coalesce.c
+++ b/gcc/tree-ssa-coalesce.c
@@ -170,7 +170,7 @@ coalesce_cost_edge (edge e)
   /* Inserting copy on critical edge costs more than inserting it elsewhere.  */
   if (EDGE_CRITICAL_P (e))
     mult = 2;
-  if (e->flags & EDGE_ABNORMAL)
+  if (e->flags & (EDGE_ABNORMAL | EDGE_TO_RECONVERGENCE))
     return MUST_COALESCE_COST;
   if (e->flags & EDGE_EH)
     {
@@ -975,11 +975,11 @@ create_outofssa_var_map (coalesce_list_p cl, bitmap used_in_copy)
 
 	      register_ssa_partition (map, arg);
 	      if (gimple_can_coalesce_p (arg, res)
-		  || (e->flags & EDGE_ABNORMAL))
+		  || (e->flags & (EDGE_ABNORMAL | EDGE_TO_RECONVERGENCE)))
 		{
 		  saw_copy = true;
 		  bitmap_set_bit (used_in_copy, SSA_NAME_VERSION (arg));
-		  if ((e->flags & EDGE_ABNORMAL) == 0)
+		  if ((e->flags & (EDGE_ABNORMAL | EDGE_TO_RECONVERGENCE)) == 0)
 		    {
 		      int cost = coalesce_cost_edge (e);
 		      if (cost == 1 && has_single_use (arg))
@@ -1206,7 +1206,7 @@ coalesce_partitions (var_map map, ssa_conflicts_p graph, coalesce_list_p cl,
   FOR_EACH_BB_FN (bb, cfun)
     {
       FOR_EACH_EDGE (e, ei, bb->preds)
-	if (e->flags & EDGE_ABNORMAL)
+	if (e->flags & (EDGE_ABNORMAL | EDGE_TO_RECONVERGENCE))
 	  {
 	    gphi_iterator gsi;
 	    for (gsi = gsi_start_phis (bb); !gsi_end_p (gsi);
diff --git a/gcc/tree-ssa-dom.c b/gcc/tree-ssa-dom.c
index 14f3e9e..9a27acf 100644
--- a/gcc/tree-ssa-dom.c
+++ b/gcc/tree-ssa-dom.c
@@ -1874,7 +1874,7 @@ cprop_into_successor_phis (basic_block bb)
 
       /* If this is an abnormal edge, then we do not want to copy propagate
 	 into the PHI alternative associated with this edge.  */
-      if (e->flags & EDGE_ABNORMAL)
+      if (e->flags & (EDGE_ABNORMAL | EDGE_TO_RECONVERGENCE))
 	continue;
 
       gsi = gsi_start_phis (e->dest);
@@ -1983,7 +1983,8 @@ dom_opt_dom_walker::after_dom_children (basic_block bb)
      may be able to statically determine which of the outgoing edges
      will be traversed when the incoming edge from BB is traversed.  */
   if (single_succ_p (bb)
-      && (single_succ_edge (bb)->flags & EDGE_ABNORMAL) == 0
+      && (single_succ_edge (bb)->flags & (EDGE_ABNORMAL
+					  | EDGE_TO_RECONVERGENCE)) == 0
       && potentially_threadable_block (single_succ (bb)))
     {
       thread_across_edge (single_succ_edge (bb));
@@ -1991,8 +1992,10 @@ dom_opt_dom_walker::after_dom_children (basic_block bb)
   else if ((last = last_stmt (bb))
 	   && gimple_code (last) == GIMPLE_COND
 	   && EDGE_COUNT (bb->succs) == 2
-	   && (EDGE_SUCC (bb, 0)->flags & EDGE_ABNORMAL) == 0
-	   && (EDGE_SUCC (bb, 1)->flags & EDGE_ABNORMAL) == 0)
+	   && (EDGE_SUCC (bb, 0)->flags & (EDGE_ABNORMAL
+					   | EDGE_TO_RECONVERGENCE)) == 0
+	   && (EDGE_SUCC (bb, 1)->flags & (EDGE_ABNORMAL
+					   | EDGE_TO_RECONVERGENCE)) == 0)
     {
       edge true_edge, false_edge;
 
@@ -2957,7 +2960,7 @@ propagate_rhs_into_lhs (gimple stmt, tree lhs, tree rhs, bitmap interesting_name
 
 		  /* And fixup the flags on the single remaining edge.  */
 		  te->flags &= ~(EDGE_TRUE_VALUE | EDGE_FALSE_VALUE);
-		  te->flags &= ~EDGE_ABNORMAL;
+		  te->flags &= ~(EDGE_ABNORMAL | EDGE_TO_RECONVERGENCE);
 		  te->flags |= EDGE_FALLTHRU;
 		  if (te->probability > REG_BR_PROB_BASE)
 		    te->probability = REG_BR_PROB_BASE;
diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c
index 11fc699..cb61979 100644
--- a/gcc/tree-ssa-loop-im.c
+++ b/gcc/tree-ssa-loop-im.c
@@ -2313,7 +2313,7 @@ loop_suitable_for_sm (struct loop *loop ATTRIBUTE_UNUSED,
   edge ex;
 
   FOR_EACH_VEC_ELT (exits, i, ex)
-    if (ex->flags & (EDGE_ABNORMAL | EDGE_EH))
+    if (ex->flags & (EDGE_ABNORMAL | EDGE_EH | EDGE_TO_RECONVERGENCE))
       return false;
 
   return true;
diff --git a/gcc/tree-ssa-loop-prefetch.c b/gcc/tree-ssa-loop-prefetch.c
index a948d68..f817b9e 100644
--- a/gcc/tree-ssa-loop-prefetch.c
+++ b/gcc/tree-ssa-loop-prefetch.c
@@ -1283,7 +1283,7 @@ emit_mfence_after_loop (struct loop *loop)
       if (!single_pred_p (exit->dest)
 	  /* If possible, we prefer not to insert the fence on other paths
 	     in cfg.  */
-	  && !(exit->flags & EDGE_ABNORMAL))
+	  && !(exit->flags & (EDGE_ABNORMAL | EDGE_TO_RECONVERGENCE)))
 	split_loop_exit_edge (exit);
       bsi = gsi_after_labels (exit->dest);
 
@@ -1313,7 +1313,7 @@ may_use_storent_in_loop_p (struct loop *loop)
       edge exit;
 
       FOR_EACH_VEC_ELT (exits, i, exit)
-	if ((exit->flags & EDGE_ABNORMAL)
+	if ((exit->flags & (EDGE_ABNORMAL | EDGE_TO_RECONVERGENCE))
 	    && exit->dest == EXIT_BLOCK_PTR_FOR_FN (cfun))
 	  ret = false;
 
diff --git a/gcc/tree-ssa-phiopt.c b/gcc/tree-ssa-phiopt.c
index 7c846c2..91ae8dc 100644
--- a/gcc/tree-ssa-phiopt.c
+++ b/gcc/tree-ssa-phiopt.c
@@ -239,8 +239,8 @@ tree_ssa_phiopt_worker (bool do_store_elim, bool do_hoist_loads)
       bb2 = e2->dest;
 
       /* We cannot do the optimization on abnormal edges.  */
-      if ((e1->flags & EDGE_ABNORMAL) != 0
-          || (e2->flags & EDGE_ABNORMAL) != 0)
+      if ((e1->flags & (EDGE_ABNORMAL | EDGE_TO_RECONVERGENCE)) != 0
+          || (e2->flags & (EDGE_ABNORMAL | EDGE_TO_RECONVERGENCE)) != 0)
        continue;
 
       /* If either bb1's succ or bb2 or bb2's succ is non NULL.  */
diff --git a/gcc/tree-ssa-pre.c b/gcc/tree-ssa-pre.c
index d857d84..2f69389 100644
--- a/gcc/tree-ssa-pre.c
+++ b/gcc/tree-ssa-pre.c
@@ -2410,7 +2410,7 @@ compute_antic (void)
       FOR_EACH_EDGE (e, ei, block->preds)
 	{
 	  e->flags &= ~EDGE_DFS_BACK;
-	  if (e->flags & EDGE_ABNORMAL)
+	  if (e->flags & (EDGE_ABNORMAL | EDGE_TO_RECONVERGENCE))
 	    {
 	      bitmap_set_bit (has_abnormal_preds, block->index);
 	      break;
@@ -3035,7 +3035,7 @@ insert_into_preds_of_block (basic_block block, unsigned int exprnum,
 	{
 	  builtexpr = create_expression_by_pieces (bprime, eprime,
 						   &stmts, type);
-	  gcc_assert (!(pred->flags & EDGE_ABNORMAL));
+	  gcc_assert (!(pred->flags & (EDGE_ABNORMAL | EDGE_TO_RECONVERGENCE)));
 	  gsi_insert_seq_on_edge (pred, stmts);
 	  if (!builtexpr)
 	    {
diff --git a/gcc/tree-ssa-propagate.c b/gcc/tree-ssa-propagate.c
index e23da70..26b8ec8 100644
--- a/gcc/tree-ssa-propagate.c
+++ b/gcc/tree-ssa-propagate.c
@@ -526,7 +526,7 @@ simulate_block (basic_block block)
       normal_edge = NULL;
       FOR_EACH_EDGE (e, ei, block->succs)
 	{
-	  if (e->flags & (EDGE_ABNORMAL | EDGE_EH))
+	  if (e->flags & (EDGE_ABNORMAL | EDGE_EH | EDGE_TO_RECONVERGENCE))
 	    add_control_edge (e);
 	  else
 	    {
@@ -1072,7 +1072,7 @@ replace_phi_args_in (gphi *phi, ssa_prop_get_value_fn get_value)
 		 through an abnormal edge, update the replacement
 		 accordingly.  */
 	      if (TREE_CODE (val) == SSA_NAME
-		  && e->flags & EDGE_ABNORMAL
+		  && e->flags & (EDGE_ABNORMAL | EDGE_TO_RECONVERGENCE)
 		  && !SSA_NAME_OCCURS_IN_ABNORMAL_PHI (val))
 		{
 		  /* This can only occur for virtual operands, since
diff --git a/gcc/tree-ssa-sink.c b/gcc/tree-ssa-sink.c
index 1ed8a0e..c3d091c 100644
--- a/gcc/tree-ssa-sink.c
+++ b/gcc/tree-ssa-sink.c
@@ -501,7 +501,7 @@ sink_code_in_bb (basic_block bb)
 
   /* We can't move things across abnormal edges, so don't try.  */
   FOR_EACH_EDGE (e, ei, bb->succs)
-    if (e->flags & EDGE_ABNORMAL)
+    if (e->flags & (EDGE_ABNORMAL | EDGE_TO_RECONVERGENCE))
       goto earlyout;
 
   for (gsi = gsi_last_bb (bb); !gsi_end_p (gsi);)
diff --git a/gcc/tree-ssa-threadedge.c b/gcc/tree-ssa-threadedge.c
index acbbb67..5b6626a 100644
--- a/gcc/tree-ssa-threadedge.c
+++ b/gcc/tree-ssa-threadedge.c
@@ -1417,7 +1417,7 @@ thread_across_edge (gcond *dummy_cond,
     /* If E->dest has abnormal outgoing edges, then there's no guarantee
        we can safely redirect any of the edges.  Just punt those cases.  */
     FOR_EACH_EDGE (taken_edge, ei, e->dest->succs)
-      if (taken_edge->flags & EDGE_ABNORMAL)
+      if (taken_edge->flags & (EDGE_ABNORMAL | EDGE_TO_RECONVERGENCE))
 	{
 	  const_and_copies->pop_to_marker ();
 	  BITMAP_FREE (visited);
diff --git a/gcc/tree-ssa-threadupdate.c b/gcc/tree-ssa-threadupdate.c
index 0d61c18..49b010d 100644
--- a/gcc/tree-ssa-threadupdate.c
+++ b/gcc/tree-ssa-threadupdate.c
@@ -1661,7 +1661,8 @@ thread_single_edge (edge e)
       remove_ctrl_stmt_and_useless_edges (bb, eto->dest);
 
       /* And fixup the flags on the single remaining edge.  */
-      eto->flags &= ~(EDGE_TRUE_VALUE | EDGE_FALSE_VALUE | EDGE_ABNORMAL);
+      eto->flags &= ~(EDGE_TRUE_VALUE | EDGE_FALSE_VALUE | EDGE_ABNORMAL
+		      | EDGE_TO_RECONVERGENCE);
       eto->flags |= EDGE_FALLTHRU;
 
       return bb;
diff --git a/gcc/tree-ssa-uninit.c b/gcc/tree-ssa-uninit.c
index 19a3e82..a23d83a 100644
--- a/gcc/tree-ssa-uninit.c
+++ b/gcc/tree-ssa-uninit.c
@@ -459,7 +459,7 @@ compute_control_dep_chain (basic_block bb, basic_block dep_bb,
     {
       basic_block cd_bb;
       int post_dom_check = 0;
-      if (e->flags & (EDGE_FAKE | EDGE_ABNORMAL))
+      if (e->flags & (EDGE_FAKE | EDGE_ABNORMAL | EDGE_TO_RECONVERGENCE))
         continue;
 
       cd_bb = e->dest;
diff --git a/gcc/tree-ssa.c b/gcc/tree-ssa.c
index 10d3314..86fd30c 100644
--- a/gcc/tree-ssa.c
+++ b/gcc/tree-ssa.c
@@ -883,7 +883,9 @@ verify_phi_args (gphi *phi, basic_block bb, basic_block *definition_block)
 	{
 	  err = verify_ssa_name (op, virtual_operand_p (gimple_phi_result (phi)));
 	  err |= verify_use (e->src, definition_block[SSA_NAME_VERSION (op)],
-			     op_p, phi, e->flags & EDGE_ABNORMAL, NULL);
+			     op_p, phi,
+			     e->flags & (EDGE_ABNORMAL | EDGE_TO_RECONVERGENCE),
+			     NULL);
 	}
 
       if (TREE_CODE (op) == ADDR_EXPR)
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index 49bf518..dcca3fe 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -1277,7 +1277,7 @@ vect_analyze_loop_form (struct loop *loop)
   if (!single_pred_p (single_exit (loop)->dest))
     {
       edge e = single_exit (loop);
-      if (!(e->flags & EDGE_ABNORMAL))
+      if (!(e->flags & (EDGE_ABNORMAL | EDGE_TO_RECONVERGENCE)))
 	{
 	  split_loop_exit_edge (e);
 	  if (dump_enabled_p ())
diff --git a/gcc/tree-vrp.c b/gcc/tree-vrp.c
index 6744a91..97a3ccf 100644
--- a/gcc/tree-vrp.c
+++ b/gcc/tree-vrp.c
@@ -4926,7 +4926,7 @@ infer_value_range (gimple stmt, tree op, enum tree_code *comp_code_p, tree *val_
       edge e;
 
       FOR_EACH_EDGE (e, ei, gimple_bb (stmt)->succs)
-	if (!(e->flags & EDGE_ABNORMAL))
+	if (!(e->flags & (EDGE_ABNORMAL | EDGE_TO_RECONVERGENCE)))
 	  break;
       if (e == NULL)
 	return false;
@@ -5949,7 +5949,9 @@ find_conditional_asserts (basic_block bb, gcond *last)
      insert.  */
   FOR_EACH_EDGE (e, ei, bb->succs)
     {
-      if (e->dest == bb)
+      /* Skip over EDGE_TO_RECONVERGENCE edges because they cannot be split
+	 later in the pass.  */
+      if (e->dest == bb || (e->flags & EDGE_TO_RECONVERGENCE))
 	continue;
 
       /* Register the necessary assertions for each operand in the
@@ -6412,7 +6414,7 @@ process_assert_insertions_for (tree name, assert_locus_t loc)
      STMT is not control flow, there may only be one non-abnormal edge
      out of BB.  */
   FOR_EACH_EDGE (e, ei, loc->bb->succs)
-    if (!(e->flags & EDGE_ABNORMAL))
+    if (!(e->flags & (EDGE_ABNORMAL | EDGE_TO_RECONVERGENCE)))
       {
 	gsi_insert_on_edge (e, assert_stmt);
 	return true;
diff --git a/gcc/value-prof.c b/gcc/value-prof.c
index b16bce8..7a70c49 100644
--- a/gcc/value-prof.c
+++ b/gcc/value-prof.c
@@ -1565,7 +1565,7 @@ gimple_ic (gcall *icall_stmt, struct cgraph_node *direct_call,
     }
 
   FOR_EACH_EDGE (e_eh, ei, icall_bb->succs)
-    if (e_eh->flags & (EDGE_EH | EDGE_ABNORMAL))
+    if (e_eh->flags & (EDGE_EH | EDGE_ABNORMAL | EDGE_TO_RECONVERGENCE))
       {
 	e = make_edge (dcall_bb, e_eh->dest, e_eh->flags);
 	for (gphi_iterator psi = gsi_start_phis (e_eh->dest);

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [gomp4] Preserve NVPTX "reconvergence" points
  2015-05-28 14:20 [gomp4] Preserve NVPTX "reconvergence" points Julian Brown
@ 2015-05-28 14:59 ` Jakub Jelinek
  2015-05-28 15:14   ` Thomas Schwinge
  2015-05-28 15:02 ` Richard Biener
  1 sibling, 1 reply; 23+ messages in thread
From: Jakub Jelinek @ 2015-05-28 14:59 UTC (permalink / raw)
  To: Julian Brown; +Cc: gcc-patches, Bernd Schmidt, Thomas Schwinge

On Thu, May 28, 2015 at 03:06:35PM +0100, Julian Brown wrote:
> For NVPTX, it is vitally important that the divergence of threads
> within a warp can be controlled: in particular we must be able to
> generate code that we know "reconverges" at a particular point.
> Unfortunately GCC's middle-end optimisers can cause this property to
> be violated, which causes problems for the OpenACC execution model
> we're planning to use for NVPTX.
> 
> As a brief example: code running in vector-single mode runs on a
> single thread of a warp, and must broadcast condition results to other
> threads of the warp so that they can "follow along" and be ready for
> vector-partitioned execution when necessary.

I think the lowering of this already at ompexp time is premature,
I think much better would be to have a function attribute (or cgraph
flag) that would be set for functions you want to compile this way
(plus a targetm flag that the targets want to support it that way),
plus a flag in loop structure for the acc loop vector loops
(perhaps the current OpenMP simd loop flags are good enough for that),
and lower it somewhere around the vectorization pass or so.

Or, what exactly do you emit for the fallback code, or for other GPGPUs
or XeonPhi?  To me e.g. for XeonPhi or HSA this sounds like you
want to implement the acc loop gang as a work-sharing loop among
threads (like #pragma omp for) and #pragma acc loop vector like
a loop that should be vectorized if at all possible (like #pragma omp simd).
I really think it is important that OpenACC GCC support is not so strongly
tied to one specific GPGPU, and similarly OpenMP should be usable for
all offloading targets GCC supports.

That way, it is possible to auto-vectorize the code too, decision how
to expand the code of offloaded function is done already separately for each
offloading target, there is a space for optimizations on much simpler
cfg, etc.

	Jakub

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [gomp4] Preserve NVPTX "reconvergence" points
  2015-05-28 14:59 ` Jakub Jelinek
@ 2015-05-28 15:14   ` Thomas Schwinge
  2015-05-28 15:28     ` Jakub Jelinek
  0 siblings, 1 reply; 23+ messages in thread
From: Thomas Schwinge @ 2015-05-28 15:14 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: gcc-patches, Bernd Schmidt, Nathan Sidwell, Julian Brown

[-- Attachment #1: Type: text/plain, Size: 2239 bytes --]

Hi!

On Thu, 28 May 2015 16:20:11 +0200, Jakub Jelinek <jakub@redhat.com> wrote:
> On Thu, May 28, 2015 at 03:06:35PM +0100, Julian Brown wrote:
> > [...]

> I think the lowering of this already at ompexp time is premature

Yes, we're aware of this "wart".  :-|

> I think much better would be to have a function attribute (or cgraph
> flag) that would be set for functions you want to compile this way
> (plus a targetm flag that the targets want to support it that way),
> plus a flag in loop structure for the acc loop vector loops
> (perhaps the current OpenMP simd loop flags are good enough for that),
> and lower it somewhere around the vectorization pass or so.

Moving the loop lowering/expansion later is along the same lines as we've
been thinking.  Figuring out how the OpenMP simd implementation works, is
another thing I wanted to look into.

> Or, what exactly do you emit for the fallback code, or for other GPGPUs
> or XeonPhi?  To me e.g. for XeonPhi or HSA this sounds like you
> want to implement the acc loop gang as a work-sharing loop among
> threads (like #pragma omp for) and #pragma acc loop vector like
> a loop that should be vectorized if at all possible (like #pragma omp simd).
> I really think it is important that OpenACC GCC support is not so strongly
> tied to one specific GPGPU

Not disagreeing, but: we have to start somewhere.  GPU offloading and all
its peculiarities is still entering unknown terriroty in GCC; we're still
learning, and shall try to converge the emerging different
implementations in the future.  Doing the completely generic (agnostic of
specific offloading device) implementation right now is a challenging
task, hence the work on a "nvptx-specific prototype" first, to put it
this way.

That said, we of course very much welcome your continued review of our
work, and your suggestions!

> and similarly OpenMP should be usable for
> all offloading targets GCC supports.
> 
> That way, it is possible to auto-vectorize the code too, decision how
> to expand the code of offloaded function is done already separately for each
> offloading target, there is a space for optimizations on much simpler
> cfg, etc.


Grüße,
 Thomas

[-- Attachment #2: Type: application/pgp-signature, Size: 472 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [gomp4] Preserve NVPTX "reconvergence" points
  2015-05-28 15:14   ` Thomas Schwinge
@ 2015-05-28 15:28     ` Jakub Jelinek
  2015-06-19 10:44       ` Bernd Schmidt
  0 siblings, 1 reply; 23+ messages in thread
From: Jakub Jelinek @ 2015-05-28 15:28 UTC (permalink / raw)
  To: Thomas Schwinge; +Cc: gcc-patches, Bernd Schmidt, Nathan Sidwell, Julian Brown

On Thu, May 28, 2015 at 04:49:43PM +0200, Thomas Schwinge wrote:
> > I think much better would be to have a function attribute (or cgraph
> > flag) that would be set for functions you want to compile this way
> > (plus a targetm flag that the targets want to support it that way),
> > plus a flag in loop structure for the acc loop vector loops
> > (perhaps the current OpenMP simd loop flags are good enough for that),
> > and lower it somewhere around the vectorization pass or so.
> 
> Moving the loop lowering/expansion later is along the same lines as we've
> been thinking.  Figuring out how the OpenMP simd implementation works, is
> another thing I wanted to look into.

The OpenMP simd expansion is actually quite simple thing.
Basically, the simd loop is in ompexp expanded as a normal loop with some
flags in the loop structure (which are pretty much optimization hints).
There is a flag that the user would really like to vectorize it, and another
field that says (from what user told) what vectorization factor is safe to
use regardless of compiler's analysis.  There is some complications with
privatization clauses, so some variables are in GIMPLE represented as arrays
with maximum vf elements and indexed by internal function (simd lane), which
the vectorizer then either turns into a scalar again (if the loop isn't
vectorized), or vectorizes it and for addressables keeps in arrays with
actual vf elements.

I admit I don't know too much about OpenACC, but I'd think doing something
similar (i.e. some loop structure hint or request that a particular loop is
vectorized and perhaps something about lexical forward/backward dependencies
in the loop) could work.  Then for XeonPhi or host fallback, you'd just use
normal vectorizer.  And for PTX you could instead about the same time
instead of vectorization lower code to a single working thread doing stuff
except for simd marked loops which would be lowered to run on all threads
in the warp.

> Not disagreeing, but: we have to start somewhere.  GPU offloading and all
> its peculiarities is still entering unknown terriroty in GCC; we're still
> learning, and shall try to converge the emerging different
> implementations in the future.  Doing the completely generic (agnostic of
> specific offloading device) implementation right now is a challenging
> task, hence the work on a "nvptx-specific prototype" first, to put it
> this way.

I understand it is more work, I'd just like to ask that when designing stuff
for the OpenACC offloading you (plural) try to take the other offloading
devices and host fallback into account.  E.g. the XeonPhi is not hard to
understand, it is pretty much just a many core x86_64 chip where the
offloading is some process how to run something on the other device
and the emulation mode very well emulates that through running it in a
different process.  This stuff is already about what happens in offloaded
code, so considerations for it are similar to those for host code
(especially hosts that can vectorize).

As far as OpenMP / PTX goes, I'll try to find time for it again soon
(busy with OpenMP 4.1 work so far), but e.g. the above stuff (having
a single thread in warp do most of the non-vectorized work, and only
use other threads in the warp for vectorization) is definitely what
OpenMP will benefit from too.

	Jakub

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [gomp4] Preserve NVPTX "reconvergence" points
  2015-05-28 15:28     ` Jakub Jelinek
@ 2015-06-19 10:44       ` Bernd Schmidt
  2015-06-19 12:32         ` Jakub Jelinek
  0 siblings, 1 reply; 23+ messages in thread
From: Bernd Schmidt @ 2015-06-19 10:44 UTC (permalink / raw)
  To: Jakub Jelinek, Thomas Schwinge; +Cc: gcc-patches, Nathan Sidwell, Julian Brown

[-- Attachment #1: Type: text/plain, Size: 2492 bytes --]

On 05/28/2015 05:08 PM, Jakub Jelinek wrote:

> I understand it is more work, I'd just like to ask that when designing stuff
> for the OpenACC offloading you (plural) try to take the other offloading
> devices and host fallback into account.

The problem is that many of the transformations we need to do are really 
GPU specific, and with the current structure of omplow/ompexp they are 
being done in the host compiler. The offloading scheme we decided on 
does not give us the means to write out multiple versions of an 
offloaded function where each target gets a different one. For that 
reason I think we should postpone these lowering decisions until we're 
in the accel compiler, where they could be controlled by target hooks, 
and over the last two weeks I've been doing some experiments to see how 
that could be achieved.

The basic idea is to delay expanding the inner regions of an OpenACC 
target region during ompexp, write out offload LTO (almost) immediately 
afterwards, and then have another ompexp phase which runs on the accel 
compiler to take the offloaded function to its final form. The first 
attempt really did write LTO immediately after, before moving to SSA 
phase. It seems that this could be made to work, but the pass manager 
and LTO code rather expects that what is being read in is in SSA form 
already. Also, some offloaded code is produced by OpenACC kernels 
expansion much later in the compilation, so with this approach we have 
an inconsistency where functions we get back from LTO are at very 
different levels of lowering.

The next attempt was to run the into-ssa passes after ompexpand, and 
only then write things out. I've changed the gimple representation of 
some OMP statements (primarily gimple_omp_for) so that they are 
relatively normal statements with operands that can be transformed into 
SSA form. As far as what's easier to work with - I believe some of the 
transformations we have to do could benefit from being in SSA, but on 
the other hand the OpenACC predication code has given me some trouble. 
I've still not sompletely convinced myself that the update_ssa call I've 
added will actually do the right thing after we've mucked up the CFG.

I'm appending a proof-of-concept patch. This is intended to show the 
general outline of what I have in mind, rather than pass the testsuite. 
It's good enough to compile some of the OpenACC testcases (let's say 
worker-single-3 if you need one). Let me know what you think.


Bernd


[-- Attachment #2: offload-early.diff --]
[-- Type: text/x-patch, Size: 60689 bytes --]

Index: gcc/cgraphunit.c
===================================================================
--- gcc/cgraphunit.c	(revision 224547)
+++ gcc/cgraphunit.c	(working copy)
@@ -2171,6 +2171,23 @@ ipa_passes (void)
       execute_ipa_pass_list (passes->all_small_ipa_passes);
       if (seen_error ())
 	return;
+
+      if (g->have_offload)
+	{
+	  extern void write_offload_lto ();
+	  section_name_prefix = OFFLOAD_SECTION_NAME_PREFIX;
+	  write_offload_lto ();
+	}
+    }
+  bool do_local_opts = !in_lto_p;
+#ifdef ACCEL_COMPILER
+  do_local_opts = true;
+#endif
+  if (do_local_opts)
+    {
+      execute_ipa_pass_list (passes->all_local_opt_passes);
+      if (seen_error ())
+	return;
     }
 
   /* This extra symtab_remove_unreachable_nodes pass tends to catch some
@@ -2182,7 +2199,7 @@ ipa_passes (void)
   if (symtab->state < IPA_SSA)
     symtab->state = IPA_SSA;
 
-  if (!in_lto_p)
+  if (do_local_opts)
     {
       /* Generate coverage variables and constructors.  */
       coverage_finish ();
@@ -2285,6 +2302,14 @@ symbol_table::compile (void)
   if (seen_error ())
     return;
 
+#ifdef ACCEL_COMPILER
+  {
+    cgraph_node *node;
+    FOR_EACH_DEFINED_FUNCTION (node)
+      node->get_untransformed_body ();
+  }
+#endif
+
 #ifdef ENABLE_CHECKING
   symtab_node::verify_symtab_nodes ();
 #endif
Index: gcc/config/nvptx/nvptx.c
===================================================================
--- gcc/config/nvptx/nvptx.c	(revision 224547)
+++ gcc/config/nvptx/nvptx.c	(working copy)
@@ -1171,18 +1171,42 @@ nvptx_section_from_addr_space (addr_spac
     }
 }
 
-/* Determine whether DECL goes into .const or .global.  */
+/* Determine the address space DECL lives in.  */
 
-const char *
-nvptx_section_for_decl (const_tree decl)
+static addr_space_t
+nvptx_addr_space_for_decl (const_tree decl)
 {
+  if (decl == NULL_TREE || TREE_CODE (decl) == FUNCTION_DECL)
+    return ADDR_SPACE_GENERIC;
+
+  if (lookup_attribute ("oacc ganglocal", DECL_ATTRIBUTES (decl)) != NULL_TREE)
+    return ADDR_SPACE_SHARED;
+
   bool is_const = (CONSTANT_CLASS_P (decl)
 		   || TREE_CODE (decl) == CONST_DECL
 		   || TREE_READONLY (decl));
   if (is_const)
-    return ".const";
+    return ADDR_SPACE_CONST;
 
-  return ".global";
+  return ADDR_SPACE_GLOBAL;
+}
+
+/* Return a ptx string representing the address space for a variable DECL.  */
+
+const char *
+nvptx_section_for_decl (const_tree decl)
+{
+  switch (nvptx_addr_space_for_decl (decl))
+    {
+    case ADDR_SPACE_CONST:
+      return ".const";
+    case ADDR_SPACE_SHARED:
+      return ".shared";
+    case ADDR_SPACE_GLOBAL:
+      return ".global";
+    default:
+      gcc_unreachable ();
+    }
 }
 
 /* Look for a SYMBOL_REF in ADDR and return the address space to be used
@@ -1196,17 +1220,7 @@ nvptx_addr_space_from_address (rtx addr)
   if (GET_CODE (addr) != SYMBOL_REF)
     return ADDR_SPACE_GENERIC;
 
-  tree decl = SYMBOL_REF_DECL (addr);
-  if (decl == NULL_TREE || TREE_CODE (decl) == FUNCTION_DECL)
-    return ADDR_SPACE_GENERIC;
-
-  bool is_const = (CONSTANT_CLASS_P (decl)
-		   || TREE_CODE (decl) == CONST_DECL
-		   || TREE_READONLY (decl));
-  if (is_const)
-    return ADDR_SPACE_CONST;
-
-  return ADDR_SPACE_GLOBAL;
+  return nvptx_addr_space_for_decl (SYMBOL_REF_DECL (addr));
 }
 \f
 /* Machinery to output constant initializers.  */
Index: gcc/gimple-pretty-print.c
===================================================================
--- gcc/gimple-pretty-print.c	(revision 224547)
+++ gcc/gimple-pretty-print.c	(working copy)
@@ -1175,11 +1175,12 @@ dump_gimple_omp_for (pretty_printer *buf
       dump_gimple_fmt (buffer, spc, flags, " >,");
       for (i = 0; i < gimple_omp_for_collapse (gs); i++)
 	dump_gimple_fmt (buffer, spc, flags,
-			 "%+%T, %T, %T, %s, %T,%n",
+			 "%+%T, %T, %T, %s, %s, %T,%n",
 			 gimple_omp_for_index (gs, i),
 			 gimple_omp_for_initial (gs, i),
 			 gimple_omp_for_final (gs, i),
 			 get_tree_code_name (gimple_omp_for_cond (gs, i)),
+			 get_tree_code_name (gimple_omp_for_incr_code (gs, i)),
 			 gimple_omp_for_incr (gs, i));
       dump_gimple_fmt (buffer, spc, flags, "PRE_BODY <%S>%->",
 		       gimple_omp_for_pre_body (gs));
@@ -1259,6 +1260,20 @@ dump_gimple_omp_for (pretty_printer *buf
 	  dump_generic_node (buffer, gimple_omp_for_index (gs, i), spc,
 			     flags, false);
 	  pp_string (buffer, " = ");
+	  dump_generic_node (buffer, gimple_omp_for_index (gs, i), spc,
+			     flags, false);
+	  switch (gimple_omp_for_incr_code (gs, i))
+	    {
+	    case POINTER_PLUS_EXPR:
+	    case PLUS_EXPR:
+	      pp_plus (buffer);
+	      break;
+	    case MINUS_EXPR:
+	      pp_minus (buffer);
+	      break;
+	    default:
+	      gcc_unreachable ();
+	    }
 	  dump_generic_node (buffer, gimple_omp_for_incr (gs, i), spc,
 			     flags, false);
 	  pp_right_paren (buffer);
Index: gcc/gimple-streamer-in.c
===================================================================
--- gcc/gimple-streamer-in.c	(revision 224547)
+++ gcc/gimple-streamer-in.c	(working copy)
@@ -176,6 +176,7 @@ input_gimple_stmt (struct lto_input_bloc
       }
       /* Fallthru  */
 
+    case GIMPLE_OMP_ENTRY_END:
     case GIMPLE_ASSIGN:
     case GIMPLE_CALL:
     case GIMPLE_RETURN:
@@ -225,6 +226,7 @@ input_gimple_stmt (struct lto_input_bloc
 
     case GIMPLE_NOP:
     case GIMPLE_PREDICT:
+    case GIMPLE_OMP_RETURN:
       break;
 
     case GIMPLE_TRANSACTION:
@@ -232,6 +234,42 @@ input_gimple_stmt (struct lto_input_bloc
 				    stream_read_tree (ib, data_in));
       break;
 
+    case GIMPLE_OMP_FOR:
+      {
+	gomp_for *for_stmt = as_a <gomp_for *> (stmt);
+	gimple_omp_for_set_clauses (for_stmt, stream_read_tree (ib, data_in));
+	size_t collapse = streamer_read_hwi (ib);
+	for_stmt->collapse = collapse;
+	for_stmt->iter = ggc_cleared_vec_alloc<gimple_omp_for_iter> (collapse);
+	for (size_t i = 0; i < collapse; i++)
+	  {
+	    gimple_omp_for_set_cond (stmt, i, streamer_read_enum (ib, tree_code,
+							       MAX_TREE_CODES));
+	    gimple_omp_for_set_incr_code (stmt, i, streamer_read_enum (ib, tree_code,
+								       MAX_TREE_CODES));
+	    gimple_omp_for_set_index (stmt, i, stream_read_tree (ib, data_in));
+	    gimple_omp_for_set_initial (stmt, i, stream_read_tree (ib, data_in));
+	    gimple_omp_for_set_final (stmt, i, stream_read_tree (ib, data_in));
+	    gimple_omp_for_set_incr (stmt, i, stream_read_tree (ib, data_in));
+	  }
+      }
+      break;
+
+    case GIMPLE_OMP_CONTINUE:
+      {
+	gomp_continue *cont_stmt = as_a <gomp_continue *> (stmt);
+	gimple_omp_continue_set_control_def (cont_stmt, stream_read_tree (ib, data_in));
+	gimple_omp_continue_set_control_use (cont_stmt, stream_read_tree (ib, data_in));
+      }
+      break;
+
+    case GIMPLE_OMP_TARGET:
+      {
+	gomp_target *tgt_stmt = as_a <gomp_target *> (stmt);
+	gimple_omp_target_set_clauses (tgt_stmt, stream_read_tree (ib, data_in));
+      }
+      break;
+
     default:
       internal_error ("bytecode stream: unknown GIMPLE statement tag %s",
 		      lto_tag_name (tag));
@@ -239,9 +277,9 @@ input_gimple_stmt (struct lto_input_bloc
 
   /* Update the properties of symbols, SSA names and labels associated
      with STMT.  */
-  if (code == GIMPLE_ASSIGN || code == GIMPLE_CALL)
+  if (code == GIMPLE_ASSIGN || code == GIMPLE_CALL || code == GIMPLE_OMP_CONTINUE)
     {
-      tree lhs = gimple_get_lhs (stmt);
+      tree lhs = gimple_op (stmt, 0);
       if (lhs && TREE_CODE (lhs) == SSA_NAME)
 	SSA_NAME_DEF_STMT (lhs) = stmt;
     }
@@ -257,7 +295,16 @@ input_gimple_stmt (struct lto_input_bloc
 	    SSA_NAME_DEF_STMT (op) = stmt;
 	}
     }
-
+  else if (code == GIMPLE_OMP_FOR)
+    {
+      gomp_for *for_stmt = as_a <gomp_for *> (stmt);
+      for (unsigned i = 0; i < gimple_omp_for_collapse (for_stmt); i++)
+	{
+	  tree op = gimple_omp_for_index (for_stmt, i);
+	  if (TREE_CODE (op) == SSA_NAME)
+	    SSA_NAME_DEF_STMT (op) = stmt;
+	}
+    }
   /* Reset alias information.  */
   if (code == GIMPLE_CALL)
     gimple_call_reset_alias_info (as_a <gcall *> (stmt));
Index: gcc/gimple-streamer-out.c
===================================================================
--- gcc/gimple-streamer-out.c	(revision 224547)
+++ gcc/gimple-streamer-out.c	(working copy)
@@ -147,6 +147,7 @@ output_gimple_stmt (struct output_block
       }
       /* Fallthru  */
 
+    case GIMPLE_OMP_ENTRY_END:
     case GIMPLE_ASSIGN:
     case GIMPLE_CALL:
     case GIMPLE_RETURN:
@@ -201,6 +202,7 @@ output_gimple_stmt (struct output_block
 
     case GIMPLE_NOP:
     case GIMPLE_PREDICT:
+    case GIMPLE_OMP_RETURN:
       break;
 
     case GIMPLE_TRANSACTION:
@@ -211,6 +213,45 @@ output_gimple_stmt (struct output_block
       }
       break;
 
+    case GIMPLE_OMP_FOR:
+      {
+	gomp_for *for_stmt = as_a <gomp_for *> (stmt);
+	stream_write_tree (ob, gimple_omp_for_clauses (for_stmt), true);
+	size_t collapse_count = gimple_omp_for_collapse (for_stmt);
+	streamer_write_hwi (ob, collapse_count);
+	for (size_t i = 0; i < collapse_count; i++)
+	  {
+	    streamer_write_enum (ob->main_stream, tree_code, MAX_TREE_CODES,
+				 gimple_omp_for_cond (for_stmt, i));
+	    streamer_write_enum (ob->main_stream, tree_code, MAX_TREE_CODES,
+				 gimple_omp_for_incr_code (for_stmt, i));
+	    stream_write_tree (ob, gimple_omp_for_index (for_stmt, i), true);
+	    stream_write_tree (ob, gimple_omp_for_initial (for_stmt, i), true);
+	    stream_write_tree (ob, gimple_omp_for_final (for_stmt, i), true);
+	    stream_write_tree (ob, gimple_omp_for_incr (for_stmt, i), true);
+	  }
+	/* No need to write out the pre-body, it's empty by the time we
+	   get here.  */
+      }
+      break;
+
+    case GIMPLE_OMP_CONTINUE:
+      {
+	gomp_continue *cont_stmt = as_a <gomp_continue *> (stmt);
+	stream_write_tree (ob, gimple_omp_continue_control_def (cont_stmt),
+			   true);
+	stream_write_tree (ob, gimple_omp_continue_control_use (cont_stmt),
+			   true);
+      }
+      break;
+
+    case GIMPLE_OMP_TARGET:
+      {
+	gomp_target *tgt_stmt = as_a <gomp_target *> (stmt);
+	stream_write_tree (ob, gimple_omp_target_clauses (tgt_stmt), true);
+      }
+      break;
+
     default:
       gcc_unreachable ();
     }
Index: gcc/gimple.c
===================================================================
--- gcc/gimple.c	(revision 224547)
+++ gcc/gimple.c	(working copy)
@@ -855,9 +855,11 @@ gimple_build_debug_source_bind_stat (tre
 /* Build a GIMPLE_OMP_ENTRY_END statement.  */
 
 gimple
-gimple_build_omp_entry_end (void)
+gimple_build_omp_entry_end (tree var)
 {
-  return gimple_alloc (GIMPLE_OMP_ENTRY_END, 0);
+  gimple t = gimple_alloc (GIMPLE_OMP_ENTRY_END, 1);
+  gimple_set_op (t, 0, var);
+  return t;
 }
 
 
@@ -890,13 +892,14 @@ gomp_for *
 gimple_build_omp_for (gimple_seq body, int kind, tree clauses, size_t collapse,
 		      gimple_seq pre_body)
 {
-  gomp_for *p = as_a <gomp_for *> (gimple_alloc (GIMPLE_OMP_FOR, 0));
+  int nops = collapse * 4;
+  gomp_for *p = as_a <gomp_for *> (gimple_alloc (GIMPLE_OMP_FOR, nops));
   if (body)
     gimple_omp_set_body (p, body);
   gimple_omp_for_set_clauses (p, clauses);
   gimple_omp_for_set_kind (p, kind);
   p->collapse = collapse;
-  p->iter =  ggc_cleared_vec_alloc<gimple_omp_for_iter> (collapse);
+  p->iter = ggc_cleared_vec_alloc<gimple_omp_for_iter> (collapse);
 
   if (pre_body)
     gimple_omp_for_set_pre_body (p, pre_body);
@@ -1011,7 +1014,7 @@ gomp_continue *
 gimple_build_omp_continue (tree control_def, tree control_use)
 {
   gomp_continue *p
-    = as_a <gomp_continue *> (gimple_alloc (GIMPLE_OMP_CONTINUE, 0));
+    = as_a <gomp_continue *> (gimple_alloc (GIMPLE_OMP_CONTINUE, 2));
   gimple_omp_continue_set_control_def (p, control_def);
   gimple_omp_continue_set_control_use (p, control_use);
   return p;
Index: gcc/gimple.def
===================================================================
--- gcc/gimple.def	(revision 224547)
+++ gcc/gimple.def	(working copy)
@@ -225,11 +225,11 @@ DEFGSCODE(GIMPLE_OMP_ATOMIC_STORE, "gimp
 
 /* GIMPLE_OMP_CONTINUE marks the location of the loop or sections
    iteration in partially lowered OpenMP code.  */
-DEFGSCODE(GIMPLE_OMP_CONTINUE, "gimple_omp_continue", GSS_OMP_CONTINUE)
+DEFGSCODE(GIMPLE_OMP_CONTINUE, "gimple_omp_continue", GSS_WITH_OPS)
 
 /* GIMPLE_OMP_ENTRY_END marks the end of the unpredicated entry block
    into an offloaded region.  */
-DEFGSCODE(GIMPLE_OMP_ENTRY_END, "gimple_omp_entry_end", GSS_BASE)
+DEFGSCODE(GIMPLE_OMP_ENTRY_END, "gimple_omp_entry_end", GSS_WITH_OPS)
 
 /* GIMPLE_OMP_CRITICAL <NAME, BODY> represents
 
Index: gcc/gimple.h
===================================================================
--- gcc/gimple.h	(revision 224547)
+++ gcc/gimple.h	(working copy)
@@ -301,7 +301,7 @@ struct GTY((tag("GSS_CALL")))
 /* OMP statements.  */
 
 struct GTY((tag("GSS_OMP")))
-  gimple_statement_omp : public gimple_statement_base
+  gimple_statement_omp : public gimple_statement_with_ops_base
 {
   /* [ WORD 1-6 ] : base class */
 
@@ -520,20 +520,8 @@ struct GTY((tag("GSS_OMP_CRITICAL")))
 
 
 struct GTY(()) gimple_omp_for_iter {
-  /* Condition code.  */
-  enum tree_code cond;
-
-  /* Index variable.  */
-  tree index;
-
-  /* Initial value.  */
-  tree initial;
-
-  /* Final value.  */
-  tree final;
-
-  /* Increment.  */
-  tree incr;
+  /* Condition code and increment code.  */
+  enum tree_code cond, incr;
 };
 
 /* GIMPLE_OMP_FOR */
@@ -556,6 +544,12 @@ struct GTY((tag("GSS_OMP_FOR")))
   /* [ WORD 11 ]
      Pre-body evaluated before the loop body begins.  */
   gimple_seq pre_body;
+
+  /* [ WORD 12 ]
+     Operand vector.  NOTE!  This must always be the last field
+     of this structure.  In particular, this means that this
+     structure cannot be embedded inside another one.  */
+  tree GTY((length ("%h.num_ops"))) op[1];
 };
 
 
@@ -581,10 +575,6 @@ struct GTY((tag("GSS_OMP_PARALLEL_LAYOUT
   /* [ WORD 11 ]
      Size of the gang-local memory to allocate.  */
   tree ganglocal_size;
-
-  /* [ WORD 12 ]
-     A pointer to the array to be used for broadcasting across threads.  */
-  tree broadcast_array;
 };
 
 /* GIMPLE_OMP_PARALLEL or GIMPLE_TASK */
@@ -655,16 +645,10 @@ struct GTY((tag("GSS_OMP_SECTIONS")))
    Note: This does not inherit from gimple_statement_omp, because we
          do not need the body field.  */
 
-struct GTY((tag("GSS_OMP_CONTINUE")))
-  gomp_continue : public gimple_statement_base
+struct GTY((tag("GSS_WITH_OPS")))
+  gomp_continue : public gimple_statement_with_ops
 {
-  /* [ WORD 1-6 ] : base class */
-
-  /* [ WORD 7 ]  */
-  tree control_def;
-
-  /* [ WORD 8 ]  */
-  tree control_use;
+  /* no additional fields; this uses the layout for GSS_WITH_OPS. */
 };
 
 /* GIMPLE_OMP_SINGLE, GIMPLE_OMP_TEAMS */
@@ -1356,7 +1340,7 @@ gimple gimple_build_omp_taskgroup (gimpl
 gomp_continue *gimple_build_omp_continue (tree, tree);
 gimple gimple_build_omp_ordered (gimple_seq);
 gimple gimple_build_omp_return (bool);
-gimple gimple_build_omp_entry_end ();
+gimple gimple_build_omp_entry_end (tree);
 gomp_sections *gimple_build_omp_sections (gimple_seq, tree);
 gimple gimple_build_omp_sections_switch (void);
 gomp_single *gimple_build_omp_single (gimple_seq, tree);
@@ -1853,7 +1837,10 @@ gimple_init_singleton (gimple g)
 static inline bool
 gimple_has_ops (const_gimple g)
 {
-  return gimple_code (g) >= GIMPLE_COND && gimple_code (g) <= GIMPLE_RETURN;
+  return ((gimple_code (g) >= GIMPLE_COND && gimple_code (g) <= GIMPLE_RETURN)
+	  || gimple_code (g) == GIMPLE_OMP_FOR
+	  || gimple_code (g) == GIMPLE_OMP_ENTRY_END
+	  || gimple_code (g) == GIMPLE_OMP_CONTINUE);
 }
 
 template <>
@@ -4559,6 +4546,27 @@ gimple_omp_for_set_cond (gimple gs, size
   omp_for_stmt->iter[i].cond = cond;
 }
 
+/* Return the increment code associated with the OMP_FOR statement GS.  */
+
+static inline enum tree_code
+gimple_omp_for_incr_code (const_gimple gs, size_t i)
+{
+  const gomp_for *omp_for_stmt = as_a <const gomp_for *> (gs);
+  gcc_gimple_checking_assert (i < omp_for_stmt->collapse);
+  return omp_for_stmt->iter[i].incr;
+}
+
+
+/* Set INCR to be the increment code for the OMP_FOR statement GS.  */
+
+static inline void
+gimple_omp_for_set_incr_code (gimple gs, size_t i, enum tree_code incr)
+{
+  gomp_for *omp_for_stmt = as_a <gomp_for *> (gs);
+  gcc_gimple_checking_assert (i < omp_for_stmt->collapse);
+  omp_for_stmt->iter[i].incr = incr;
+}
+
 
 /* Return the index variable for the OMP_FOR statement GS.  */
 
@@ -4567,7 +4575,7 @@ gimple_omp_for_index (const_gimple gs, s
 {
   const gomp_for *omp_for_stmt = as_a <const gomp_for *> (gs);
   gcc_gimple_checking_assert (i < omp_for_stmt->collapse);
-  return omp_for_stmt->iter[i].index;
+  return gimple_op (gs, i);
 }
 
 
@@ -4578,7 +4586,7 @@ gimple_omp_for_index_ptr (gimple gs, siz
 {
   gomp_for *omp_for_stmt = as_a <gomp_for *> (gs);
   gcc_gimple_checking_assert (i < omp_for_stmt->collapse);
-  return &omp_for_stmt->iter[i].index;
+  return gimple_op_ptr (gs, i);
 }
 
 
@@ -4588,8 +4596,9 @@ static inline void
 gimple_omp_for_set_index (gimple gs, size_t i, tree index)
 {
   gomp_for *omp_for_stmt = as_a <gomp_for *> (gs);
-  gcc_gimple_checking_assert (i < omp_for_stmt->collapse);
-  omp_for_stmt->iter[i].index = index;
+  size_t c = omp_for_stmt->collapse;
+  gcc_gimple_checking_assert (i < c);
+  gimple_set_op (gs, i, index);
 }
 
 
@@ -4599,8 +4608,9 @@ static inline tree
 gimple_omp_for_initial (const_gimple gs, size_t i)
 {
   const gomp_for *omp_for_stmt = as_a <const gomp_for *> (gs);
-  gcc_gimple_checking_assert (i < omp_for_stmt->collapse);
-  return omp_for_stmt->iter[i].initial;
+  size_t c = omp_for_stmt->collapse;
+  gcc_gimple_checking_assert (i < c);
+  return gimple_op (gs, i + c);
 }
 
 
@@ -4610,8 +4620,9 @@ static inline tree *
 gimple_omp_for_initial_ptr (gimple gs, size_t i)
 {
   gomp_for *omp_for_stmt = as_a <gomp_for *> (gs);
-  gcc_gimple_checking_assert (i < omp_for_stmt->collapse);
-  return &omp_for_stmt->iter[i].initial;
+  size_t c = omp_for_stmt->collapse;
+  gcc_gimple_checking_assert (i < c);
+  return gimple_op_ptr (gs, i + c);
 }
 
 
@@ -4621,8 +4632,9 @@ static inline void
 gimple_omp_for_set_initial (gimple gs, size_t i, tree initial)
 {
   gomp_for *omp_for_stmt = as_a <gomp_for *> (gs);
-  gcc_gimple_checking_assert (i < omp_for_stmt->collapse);
-  omp_for_stmt->iter[i].initial = initial;
+  size_t c = omp_for_stmt->collapse;
+  gcc_gimple_checking_assert (i < c);
+  gimple_set_op (gs, i + c, initial);
 }
 
 
@@ -4632,8 +4644,9 @@ static inline tree
 gimple_omp_for_final (const_gimple gs, size_t i)
 {
   const gomp_for *omp_for_stmt = as_a <const gomp_for *> (gs);
-  gcc_gimple_checking_assert (i < omp_for_stmt->collapse);
-  return omp_for_stmt->iter[i].final;
+  size_t c = omp_for_stmt->collapse;
+  gcc_gimple_checking_assert (i < c);
+  return gimple_op (gs, i + c * 2);
 }
 
 
@@ -4643,8 +4656,9 @@ static inline tree *
 gimple_omp_for_final_ptr (gimple gs, size_t i)
 {
   gomp_for *omp_for_stmt = as_a <gomp_for *> (gs);
-  gcc_gimple_checking_assert (i < omp_for_stmt->collapse);
-  return &omp_for_stmt->iter[i].final;
+  size_t c = omp_for_stmt->collapse;
+  gcc_gimple_checking_assert (i < c);
+  return gimple_op_ptr (gs, i + c * 2);
 }
 
 
@@ -4654,8 +4668,9 @@ static inline void
 gimple_omp_for_set_final (gimple gs, size_t i, tree final)
 {
   gomp_for *omp_for_stmt = as_a <gomp_for *> (gs);
-  gcc_gimple_checking_assert (i < omp_for_stmt->collapse);
-  omp_for_stmt->iter[i].final = final;
+  size_t c = omp_for_stmt->collapse;
+  gcc_gimple_checking_assert (i < c);
+  gimple_set_op (gs, i + c * 2, final);
 }
 
 
@@ -4665,8 +4680,9 @@ static inline tree
 gimple_omp_for_incr (const_gimple gs, size_t i)
 {
   const gomp_for *omp_for_stmt = as_a <const gomp_for *> (gs);
-  gcc_gimple_checking_assert (i < omp_for_stmt->collapse);
-  return omp_for_stmt->iter[i].incr;
+  size_t c = omp_for_stmt->collapse;
+  gcc_gimple_checking_assert (i < c);
+  return gimple_op (gs, i + c * 3);
 }
 
 
@@ -4676,8 +4692,9 @@ static inline tree *
 gimple_omp_for_incr_ptr (gimple gs, size_t i)
 {
   gomp_for *omp_for_stmt = as_a <gomp_for *> (gs);
-  gcc_gimple_checking_assert (i < omp_for_stmt->collapse);
-  return &omp_for_stmt->iter[i].incr;
+  size_t c = omp_for_stmt->collapse;
+  gcc_gimple_checking_assert (i < c);
+  return gimple_op_ptr (gs, i + c * 3);
 }
 
 
@@ -4687,8 +4704,9 @@ static inline void
 gimple_omp_for_set_incr (gimple gs, size_t i, tree incr)
 {
   gomp_for *omp_for_stmt = as_a <gomp_for *> (gs);
-  gcc_gimple_checking_assert (i < omp_for_stmt->collapse);
-  omp_for_stmt->iter[i].incr = incr;
+  size_t c = omp_for_stmt->collapse;
+  gcc_gimple_checking_assert (i < c);
+  gimple_set_op (gs, i + c * 3, incr);
 }
 
 
@@ -5248,25 +5266,6 @@ gimple_omp_target_set_ganglocal_size (go
 }
 
 
-/* Return the pointer to the broadcast array associated with OMP_TARGET GS.  */
-
-static inline tree
-gimple_omp_target_broadcast_array (const gomp_target *omp_target_stmt)
-{
-  return omp_target_stmt->broadcast_array;
-}
-
-
-/* Set PTR to be the broadcast array associated with OMP_TARGET
-   GS.  */
-
-static inline void
-gimple_omp_target_set_broadcast_array (gomp_target *omp_target_stmt, tree ptr)
-{
-  omp_target_stmt->broadcast_array = ptr;
-}
-
-
 /* Return the clauses associated with OMP_TEAMS GS.  */
 
 static inline tree
@@ -5446,7 +5445,7 @@ gimple_omp_atomic_load_rhs_ptr (gomp_ato
 static inline tree
 gimple_omp_continue_control_def (const gomp_continue *cont_stmt)
 {
-  return cont_stmt->control_def;
+  return gimple_op (cont_stmt, 0);
 }
 
 /* The same as above, but return the address.  */
@@ -5454,7 +5453,7 @@ gimple_omp_continue_control_def (const g
 static inline tree *
 gimple_omp_continue_control_def_ptr (gomp_continue *cont_stmt)
 {
-  return &cont_stmt->control_def;
+  return gimple_op_ptr (cont_stmt, 0);
 }
 
 /* Set the definition of the control variable in a GIMPLE_OMP_CONTINUE.  */
@@ -5462,7 +5461,7 @@ gimple_omp_continue_control_def_ptr (gom
 static inline void
 gimple_omp_continue_set_control_def (gomp_continue *cont_stmt, tree def)
 {
-  cont_stmt->control_def = def;
+  gimple_set_op (cont_stmt, 0, def);
 }
 
 
@@ -5471,7 +5470,7 @@ gimple_omp_continue_set_control_def (gom
 static inline tree
 gimple_omp_continue_control_use (const gomp_continue *cont_stmt)
 {
-  return cont_stmt->control_use;
+  return gimple_op (cont_stmt, 1);
 }
 
 
@@ -5480,7 +5479,7 @@ gimple_omp_continue_control_use (const g
 static inline tree *
 gimple_omp_continue_control_use_ptr (gomp_continue *cont_stmt)
 {
-  return &cont_stmt->control_use;
+  return gimple_op_ptr (cont_stmt, 1);
 }
 
 
@@ -5489,7 +5488,7 @@ gimple_omp_continue_control_use_ptr (gom
 static inline void
 gimple_omp_continue_set_control_use (gomp_continue *cont_stmt, tree use)
 {
-  cont_stmt->control_use = use;
+  gimple_set_op (cont_stmt, 1, use);
 }
 
 /* Return a pointer to the body for the GIMPLE_TRANSACTION statement
Index: gcc/gimplify.c
===================================================================
--- gcc/gimplify.c	(revision 224547)
+++ gcc/gimplify.c	(working copy)
@@ -7582,12 +7582,15 @@ gimplify_omp_for (tree *expr_p, gimple_s
   for (i = 0; i < TREE_VEC_LENGTH (OMP_FOR_INIT (for_stmt)); i++)
     {
       t = TREE_VEC_ELT (OMP_FOR_INIT (for_stmt), i);
-      gimple_omp_for_set_index (gfor, i, TREE_OPERAND (t, 0));
+      tree idxvar = TREE_OPERAND (t, 0);
+      gimple_omp_for_set_index (gfor, i, idxvar);
       gimple_omp_for_set_initial (gfor, i, TREE_OPERAND (t, 1));
       t = TREE_VEC_ELT (OMP_FOR_COND (for_stmt), i);
       gimple_omp_for_set_cond (gfor, i, TREE_CODE (t));
       gimple_omp_for_set_final (gfor, i, TREE_OPERAND (t, 1));
       t = TREE_VEC_ELT (OMP_FOR_INCR (for_stmt), i);
+      t = TREE_OPERAND (t, 1);
+      gimple_omp_for_set_incr_code (gfor, i, TREE_CODE (t));
       gimple_omp_for_set_incr (gfor, i, TREE_OPERAND (t, 1));
     }
 
Index: gcc/gsstruct.def
===================================================================
--- gcc/gsstruct.def	(revision 224547)
+++ gcc/gsstruct.def	(working copy)
@@ -42,12 +42,11 @@ DEFGSSTRUCT(GSS_EH_ELSE, geh_else, false
 DEFGSSTRUCT(GSS_WCE, gimple_statement_wce, false)
 DEFGSSTRUCT(GSS_OMP, gimple_statement_omp, false)
 DEFGSSTRUCT(GSS_OMP_CRITICAL, gomp_critical, false)
-DEFGSSTRUCT(GSS_OMP_FOR, gomp_for, false)
+DEFGSSTRUCT(GSS_OMP_FOR, gomp_for, true)
 DEFGSSTRUCT(GSS_OMP_PARALLEL_LAYOUT, gimple_statement_omp_parallel_layout, false)
 DEFGSSTRUCT(GSS_OMP_TASK, gomp_task, false)
 DEFGSSTRUCT(GSS_OMP_SECTIONS, gomp_sections, false)
 DEFGSSTRUCT(GSS_OMP_SINGLE_LAYOUT, gimple_statement_omp_single_layout, false)
-DEFGSSTRUCT(GSS_OMP_CONTINUE, gomp_continue, false)
 DEFGSSTRUCT(GSS_OMP_ATOMIC_LOAD, gomp_atomic_load, false)
 DEFGSSTRUCT(GSS_OMP_ATOMIC_STORE_LAYOUT, gomp_atomic_store, false)
 DEFGSSTRUCT(GSS_TRANSACTION, gtransaction, false)
Index: gcc/ipa-inline-analysis.c
===================================================================
--- gcc/ipa-inline-analysis.c	(revision 224547)
+++ gcc/ipa-inline-analysis.c	(working copy)
@@ -4122,10 +4122,12 @@ inline_generate_summary (void)
 {
   struct cgraph_node *node;
 
+#ifndef ACCEL_COMPILER
   /* When not optimizing, do not bother to analyze.  Inlining is still done
      because edge redirection needs to happen there.  */
   if (!optimize && !flag_generate_lto && !flag_generate_offload && !flag_wpa)
     return;
+#endif
 
   if (!inline_summaries)
     inline_summaries = (inline_summary_t*) inline_summary_t::create_ggc (symtab);
Index: gcc/lto/lto.c
===================================================================
--- gcc/lto/lto.c	(revision 224547)
+++ gcc/lto/lto.c	(working copy)
@@ -3115,8 +3115,10 @@ read_cgraph_and_symbols (unsigned nfiles
   /* Read the IPA summary data.  */
   if (flag_ltrans)
     ipa_read_optimization_summaries ();
+#ifndef ACCEL_COMPILER
   else
     ipa_read_summaries ();
+#endif
 
   for (i = 0; all_file_decl_data[i]; i++)
     {
Index: gcc/lto-streamer-out.c
===================================================================
--- gcc/lto-streamer-out.c	(revision 224547)
+++ gcc/lto-streamer-out.c	(working copy)
@@ -1800,27 +1800,32 @@ output_ssa_names (struct output_block *o
 {
   unsigned int i, len;
 
-  len = vec_safe_length (SSANAMES (fn));
-  streamer_write_uhwi (ob, len);
-
-  for (i = 1; i < len; i++)
+  if (cfun->gimple_df)
     {
-      tree ptr = (*SSANAMES (fn))[i];
+      len = vec_safe_length (SSANAMES (fn));
+      streamer_write_uhwi (ob, len);
 
-      if (ptr == NULL_TREE
-	  || SSA_NAME_IN_FREE_LIST (ptr)
-	  || virtual_operand_p (ptr))
-	continue;
+      for (i = 1; i < len; i++)
+	{
+	  tree ptr = (*SSANAMES (fn))[i];
 
-      streamer_write_uhwi (ob, i);
-      streamer_write_char_stream (ob->main_stream,
-				  SSA_NAME_IS_DEFAULT_DEF (ptr));
-      if (SSA_NAME_VAR (ptr))
-	stream_write_tree (ob, SSA_NAME_VAR (ptr), true);
-      else
-	/* ???  This drops SSA_NAME_IDENTIFIER on the floor.  */
-	stream_write_tree (ob, TREE_TYPE (ptr), true);
+	  if (ptr == NULL_TREE
+	      || SSA_NAME_IN_FREE_LIST (ptr)
+	      || virtual_operand_p (ptr))
+	    continue;
+
+	  streamer_write_uhwi (ob, i);
+	  streamer_write_char_stream (ob->main_stream,
+				      SSA_NAME_IS_DEFAULT_DEF (ptr));
+	  if (SSA_NAME_VAR (ptr))
+	    stream_write_tree (ob, SSA_NAME_VAR (ptr), true);
+	  else
+	    /* ???  This drops SSA_NAME_IDENTIFIER on the floor.  */
+	    stream_write_tree (ob, TREE_TYPE (ptr), true);
+	}
     }
+  else
+    streamer_write_zero (ob);
 
   streamer_write_zero (ob);
 }
Index: gcc/omp-low.c
===================================================================
--- gcc/omp-low.c	(revision 224547)
+++ gcc/omp-low.c	(working copy)
@@ -110,7 +110,7 @@ along with GCC; see the file COPYING3.
 #include "gomp-constants.h"
 #include "gimple-pretty-print.h"
 #include "set"
-
+#include "output.h"
 
 /* Lowering of OMP parallel and workshare constructs proceeds in two
    phases.  The first phase scans the function looking for OMP statements
@@ -597,17 +597,17 @@ extract_omp_for_data (gomp_for *for_stmt
 	}
 
       t = gimple_omp_for_incr (for_stmt, i);
-      gcc_assert (TREE_OPERAND (t, 0) == var);
-      switch (TREE_CODE (t))
+      enum tree_code incr_code = gimple_omp_for_incr_code (for_stmt, i);
+      switch (incr_code)
 	{
 	case PLUS_EXPR:
-	  loop->step = TREE_OPERAND (t, 1);
+	  loop->step = t;
 	  break;
 	case POINTER_PLUS_EXPR:
-	  loop->step = fold_convert (ssizetype, TREE_OPERAND (t, 1));
+	  loop->step = fold_convert (ssizetype, t);
 	  break;
 	case MINUS_EXPR:
-	  loop->step = TREE_OPERAND (t, 1);
+	  loop->step = t;
 	  loop->step = fold_build1_loc (loc,
 				    NEGATE_EXPR, TREE_TYPE (loop->step),
 				    loop->step);
@@ -9721,12 +9721,21 @@ loop_get_oacc_kernels_region_entry (stru
     }
 }
 
+static bool
+was_offloaded_p (tree fn)
+{
+#ifdef ACCEL_COMPILER
+  return true;
+#endif
+  struct cgraph_node *node = cgraph_node::get (fn);
+  return node->offloadable;
+}
+
 /* Expand the GIMPLE_OMP_TARGET starting at REGION.  */
 
 static void
 expand_omp_target (struct omp_region *region)
 {
-  basic_block entry_bb, exit_bb, new_bb;
   struct function *child_cfun;
   tree child_fn, block, t;
   gimple_stmt_iterator gsi;
@@ -9736,12 +9745,33 @@ expand_omp_target (struct omp_region *re
   bool offloaded, data_region;
   bool do_emit_library_call = true;
   bool do_splitoff = true;
+  bool already_offloaded = was_offloaded_p (current_function_decl);
 
   entry_stmt = as_a <gomp_target *> (last_stmt (region->entry));
+  location_t entry_loc = gimple_location (entry_stmt);
 
-  new_bb = region->entry;
+  basic_block new_bb = region->entry;
+  basic_block entry_bb = region->entry;
+  basic_block exit_bb = region->exit;
+  basic_block entry_succ_bb = single_succ (entry_bb);
 
-  offloaded = is_gimple_omp_offloaded (entry_stmt);
+  if (already_offloaded)
+    {
+      gsi = gsi_for_stmt (entry_stmt);
+      gsi_remove (&gsi, true);
+
+      gsi = gsi_last_bb (exit_bb);
+      gcc_assert (!gsi_end_p (gsi)
+		  && gimple_code (gsi_stmt (gsi)) == GIMPLE_OMP_RETURN);
+      gsi_remove (&gsi, true);
+
+      gsi = gsi_last_bb (entry_succ_bb);
+      if (gimple_code (gsi_stmt (gsi)) == GIMPLE_OMP_ENTRY_END)
+	gsi_remove (&gsi, true);
+      return;
+    }
+
+  offloaded = !already_offloaded && is_gimple_omp_offloaded (entry_stmt);
   switch (gimple_omp_target_kind (entry_stmt))
     {
     case GF_OMP_TARGET_KIND_REGION:
@@ -9773,9 +9803,6 @@ expand_omp_target (struct omp_region *re
   if (child_cfun != NULL)
     gcc_checking_assert (!child_cfun->cfg);
 
-  entry_bb = region->entry;
-  exit_bb = region->exit;
-
   if (gimple_omp_target_kind (entry_stmt) == GF_OMP_TARGET_KIND_OACC_KERNELS)
     {
       if (!gimple_in_ssa_p (cfun))
@@ -9814,13 +9841,7 @@ expand_omp_target (struct omp_region *re
 	}
     }
 
-  basic_block entry_succ_bb = single_succ (entry_bb);
-  if (offloaded && !gimple_in_ssa_p (cfun))
-    {
-      gsi = gsi_last_bb (entry_succ_bb);
-      if (gimple_code (gsi_stmt (gsi)) == GIMPLE_OMP_ENTRY_END)
-	gsi_remove (&gsi, true);
-    }
+  tree data_arg = gimple_omp_target_data_arg (entry_stmt);
 
   if (offloaded
       && do_splitoff)
@@ -9840,7 +9861,6 @@ expand_omp_target (struct omp_region *re
 	 a function call that has been inlined, the original PARM_DECL
 	 .OMP_DATA_I may have been converted into a different local
 	 variable.  In which case, we need to keep the assignment.  */
-      tree data_arg = gimple_omp_target_data_arg (entry_stmt);
       if (data_arg)
 	{
 	  gimple_stmt_iterator gsi;
@@ -9923,8 +9943,12 @@ expand_omp_target (struct omp_region *re
       stmt = gsi_stmt (gsi);
       gcc_assert (stmt
 		  && gimple_code (stmt) == gimple_code (entry_stmt));
+      gsi_prev (&gsi);
+      stmt = gsi_stmt (gsi);
       e = split_block (entry_bb, stmt);
+#if 0
       gsi_remove (&gsi, true);
+#endif
       entry_bb = e->dest;
       single_succ_edge (entry_bb)->flags = EDGE_FALLTHRU;
 
@@ -9932,11 +9956,16 @@ expand_omp_target (struct omp_region *re
       if (exit_bb)
 	{
 	  gsi = gsi_last_bb (exit_bb);
+	  gimple ompret = gsi_stmt (gsi);
 	  gcc_assert (!gsi_end_p (gsi)
-		      && gimple_code (gsi_stmt (gsi)) == GIMPLE_OMP_RETURN);
+		      && gimple_code (ompret) == GIMPLE_OMP_RETURN);
 	  stmt = gimple_build_return (NULL);
 	  gsi_insert_after (&gsi, stmt, GSI_SAME_STMT);
+#if 0
 	  gsi_remove (&gsi, true);
+#endif
+	  edge e1 = split_block (exit_bb, ompret);
+	  exit_bb = e1->dest;
 
 	  /* A vuse in single_succ (exit_bb) may use a vdef from the region
 	     which is about to be split off.  Mark the vdef for renaming.  */
@@ -9955,6 +9984,9 @@ expand_omp_target (struct omp_region *re
       else
 	block = gimple_block (entry_stmt);
 
+      /* Make sure we don't try to copy these.  */
+      gimple_omp_target_set_child_fn (entry_stmt, NULL);
+      gimple_omp_target_set_data_arg (entry_stmt, NULL);
       new_bb = move_sese_region_to_fn (child_cfun, entry_bb, exit_bb, block);
       if (exit_bb)
 	single_succ_edge (new_bb)->flags = EDGE_FALLTHRU;
@@ -9979,6 +10011,8 @@ expand_omp_target (struct omp_region *re
 
       /* Inform the callgraph about the new function.  */
       DECL_STRUCT_FUNCTION (child_fn)->curr_properties = cfun->curr_properties;
+      DECL_STRUCT_FUNCTION (child_fn)->curr_properties &= ~PROP_gimple_eomp;
+      
       cgraph_node::add_new_function (child_fn, true);
       cgraph_node::get (child_fn)->parallelized_function = 1;
 
@@ -10088,7 +10122,7 @@ expand_omp_target (struct omp_region *re
       clause_loc = OMP_CLAUSE_LOCATION (c);
     }
   else
-    clause_loc = gimple_location (entry_stmt);
+    clause_loc = entry_loc;
 
   /* Ensure 'device' is of the correct type.  */
   device = fold_convert_loc (clause_loc, integer_type_node, device);
@@ -10147,7 +10181,7 @@ expand_omp_target (struct omp_region *re
 
   gsi = gsi_last_bb (new_bb);
   t = gimple_omp_target_data_arg (entry_stmt);
-  if (t == NULL)
+  if (data_arg == NULL)
     {
       t1 = size_zero_node;
       t2 = build_zero_cst (ptr_type_node);
@@ -10156,11 +10190,11 @@ expand_omp_target (struct omp_region *re
     }
   else
     {
-      t1 = TYPE_MAX_VALUE (TYPE_DOMAIN (TREE_TYPE (TREE_VEC_ELT (t, 1))));
+      t1 = TYPE_MAX_VALUE (TYPE_DOMAIN (TREE_TYPE (TREE_VEC_ELT (data_arg, 1))));
       t1 = size_binop (PLUS_EXPR, t1, size_int (1));
-      t2 = build_fold_addr_expr (TREE_VEC_ELT (t, 0));
-      t3 = build_fold_addr_expr (TREE_VEC_ELT (t, 1));
-      t4 = build_fold_addr_expr (TREE_VEC_ELT (t, 2));
+      t2 = build_fold_addr_expr (TREE_VEC_ELT (data_arg, 0));
+      t3 = build_fold_addr_expr (TREE_VEC_ELT (data_arg, 1));
+      t4 = build_fold_addr_expr (TREE_VEC_ELT (data_arg, 2));
     }
 
   gimple g;
@@ -10209,8 +10243,7 @@ expand_omp_target (struct omp_region *re
 
 	/* Default values for num_gangs, num_workers, and vector_length.  */
 	t_num_gangs = t_num_workers = t_vector_length
-	  = fold_convert_loc (gimple_location (entry_stmt),
-			      integer_type_node, integer_one_node);
+	  = fold_convert_loc (entry_loc, integer_type_node, integer_one_node);
 	/* ..., but if present, use the value specified by the respective
 	   clause, making sure that are of the correct type.  */
 	c = find_omp_clause (clauses, OMP_CLAUSE_NUM_GANGS);
@@ -10241,8 +10274,7 @@ expand_omp_target (struct omp_region *re
 	int t_wait_idx;
 
 	/* Default values for t_async.  */
-	t_async = fold_convert_loc (gimple_location (entry_stmt),
-				    integer_type_node,
+	t_async = fold_convert_loc (entry_loc, integer_type_node,
 				    build_int_cst (integer_type_node,
 						   GOMP_ASYNC_SYNC));
 	/* ..., but if present, use the value specified by the respective
@@ -10257,8 +10289,7 @@ expand_omp_target (struct omp_region *re
 	/* Save the index, and... */
 	t_wait_idx = args.length ();
 	/* ... push a default value.  */
-	args.quick_push (fold_convert_loc (gimple_location (entry_stmt),
-					   integer_type_node,
+	args.quick_push (fold_convert_loc (entry_loc, integer_type_node,
 					   integer_zero_node));
 	c = find_omp_clause (clauses, OMP_CLAUSE_WAIT);
 	if (c)
@@ -10279,8 +10310,7 @@ expand_omp_target (struct omp_region *re
 	    /* Now that we know the number, replace the default value.  */
 	    args.ordered_remove (t_wait_idx);
 	    args.quick_insert (t_wait_idx,
-			       fold_convert_loc (gimple_location (entry_stmt),
-						 integer_type_node,
+			       fold_convert_loc (entry_loc, integer_type_node,
 						 build_int_cst (integer_type_node, n)));
 	  }
       }
@@ -10290,7 +10320,7 @@ expand_omp_target (struct omp_region *re
     }
 
   g = gimple_build_call_vec (builtin_decl_explicit (start_ix), args);
-  gimple_set_location (g, gimple_location (entry_stmt));
+  gimple_set_location (g, entry_loc);
   gsi_insert_before (&gsi, g, GSI_SAME_STMT);
   if (!offloaded)
     {
@@ -10310,6 +10340,23 @@ expand_omp_target (struct omp_region *re
     update_ssa (TODO_update_ssa_only_virtuals);
 }
 
+static bool
+expand_region_inner_p (omp_region *region)
+{
+  if (!region->inner)
+    return false;
+
+  if (region->type != GIMPLE_OMP_TARGET)
+    return true;
+  if (was_offloaded_p (current_function_decl))
+    return true;
+
+  gomp_target *entry_stmt = as_a <gomp_target *> (last_stmt (region->entry));
+  bool offloaded = is_gimple_omp_offloaded (entry_stmt);
+
+  return !offloaded || !is_gimple_omp_oacc (entry_stmt);
+}
+
 /* Expand the parallel region tree rooted at REGION.  Expansion
    proceeds in depth-first order.  Innermost regions are expanded
    first.  This way, parallel regions that require a new function to
@@ -10340,8 +10387,7 @@ expand_omp (struct omp_region *region)
       if (region->type == GIMPLE_OMP_FOR
 	  && gimple_omp_for_combined_p (last_stmt (region->entry)))
 	inner_stmt = last_stmt (region->inner->entry);
-     
-      if (region->inner)
+      if (expand_region_inner_p (region))
 	expand_omp (region->inner);
 
       saved_location = input_location;
@@ -10439,7 +10485,9 @@ find_omp_target_region_data (struct omp_
     region->gwv_this |= MASK_WORKER;
   if (find_omp_clause (clauses, OMP_CLAUSE_VECTOR_LENGTH))
     region->gwv_this |= MASK_VECTOR;
-  region->broadcast_array = gimple_omp_target_broadcast_array (stmt);
+  basic_block entry_succ = single_succ (region->entry);
+  gimple ee_stmt = last_stmt (entry_succ);
+  region->broadcast_array = gimple_op (ee_stmt, 0);
 }
 
 /* Helper for build_omp_regions.  Scan the dominator tree starting at
@@ -10666,6 +10714,7 @@ generate_vector_broadcast (tree dest_var
 	conv1 = gimple_build_assign (casted_var, NOP_EXPR, var);
 
       gsi_insert_after (&where, conv1, GSI_CONTINUE_LINKING);
+      retval = conv1;
     }
 
   tree decl = builtin_decl_explicit (fn);
@@ -10709,19 +10758,21 @@ generate_oacc_broadcast (omp_region *reg
   omp_region *parent = enclosing_target_region (region);
 
   tree elttype = build_qualified_type (TREE_TYPE (var), TYPE_QUAL_VOLATILE);
-  tree ptr = create_tmp_var (build_pointer_type (elttype));
-  gassign *cast1 = gimple_build_assign (ptr, NOP_EXPR,
+  tree ptrtype = build_pointer_type (elttype);
+  tree ptr1 = make_ssa_name (ptrtype);
+  tree ptr2 = make_ssa_name (ptrtype);
+  gassign *cast1 = gimple_build_assign (ptr1, NOP_EXPR,
 				       parent->broadcast_array);
   gsi_insert_after (&where, cast1, GSI_NEW_STMT);
-  gassign *st = gimple_build_assign (build_simple_mem_ref (ptr), var);
+  gassign *st = gimple_build_assign (build_simple_mem_ref (ptr1), var);
   gsi_insert_after (&where, st, GSI_NEW_STMT);
 
   gsi_insert_after (&where, build_oacc_threadbarrier (), GSI_NEW_STMT);
 
-  gassign *cast2 = gimple_build_assign (ptr, NOP_EXPR,
+  gassign *cast2 = gimple_build_assign (ptr2, NOP_EXPR,
 					parent->broadcast_array);
   gsi_insert_after (&where, cast2, GSI_NEW_STMT);
-  gassign *ld = gimple_build_assign (dest_var, build_simple_mem_ref (ptr));
+  gassign *ld = gimple_build_assign (dest_var, build_simple_mem_ref (ptr2));
   gsi_insert_after (&where, ld, GSI_NEW_STMT);
 
   gsi_insert_after (&where, build_oacc_threadbarrier (), GSI_NEW_STMT);
@@ -10735,7 +10786,8 @@ generate_oacc_broadcast (omp_region *reg
    the bits MASK_VECTOR and/or MASK_WORKER.  */
 
 static void
-make_predication_test (edge true_edge, basic_block skip_dest_bb, int mask)
+make_predication_test (edge true_edge, basic_block skip_dest_bb, int mask,
+		       bool set_dominator)
 {
   basic_block cond_bb = true_edge->src;
   
@@ -10747,7 +10799,7 @@ make_predication_test (edge true_edge, b
   if (mask & MASK_VECTOR)
     {
       gimple call = gimple_build_call (decl, 1, integer_zero_node);
-      vvar = create_tmp_var (unsigned_type_node);
+      vvar = make_ssa_name (unsigned_type_node);
       comp_var = vvar;
       gimple_call_set_lhs (call, vvar);
       gsi_insert_after (&tmp_gsi, call, GSI_NEW_STMT);
@@ -10755,14 +10807,14 @@ make_predication_test (edge true_edge, b
   if (mask & MASK_WORKER)
     {
       gimple call = gimple_build_call (decl, 1, integer_one_node);
-      wvar = create_tmp_var (unsigned_type_node);
+      wvar = make_ssa_name (unsigned_type_node);
       comp_var = wvar;
       gimple_call_set_lhs (call, wvar);
       gsi_insert_after (&tmp_gsi, call, GSI_NEW_STMT);
     }
   if (wvar && vvar)
     {
-      comp_var = create_tmp_var (unsigned_type_node);
+      comp_var = make_ssa_name (unsigned_type_node);
       gassign *ior = gimple_build_assign (comp_var, BIT_IOR_EXPR, wvar, vvar);
       gsi_insert_after (&tmp_gsi, ior, GSI_NEW_STMT);
     }
@@ -10782,6 +10834,9 @@ make_predication_test (edge true_edge, b
   basic_block false_abnorm_bb = split_edge (e);
   edge abnorm_edge = single_succ_edge (false_abnorm_bb);
   abnorm_edge->flags |= EDGE_ABNORMAL;
+
+  if (set_dominator)
+    set_immediate_dominator (CDI_DOMINATORS, skip_dest_bb, cond_bb);
 }
 
 /* Apply OpenACC predication to basic block BB which is in
@@ -10791,6 +10846,8 @@ make_predication_test (edge true_edge, b
 static void
 predicate_bb (basic_block bb, struct omp_region *parent, int mask)
 {
+  bool set_dominator = true;
+
   /* We handle worker-single vector-partitioned loops by jumping
      around them if not in the controlling worker.  Don't insert
      unnecessary (and incorrect) predication.  */
@@ -10816,8 +10873,8 @@ predicate_bb (basic_block bb, struct omp
 
   if (gimple_code (stmt) == GIMPLE_COND)
     {
-      tree cond_var = create_tmp_var (boolean_type_node);
-      tree broadcast_cond = create_tmp_var (boolean_type_node);
+      tree cond_var = make_ssa_name (boolean_type_node);
+      tree broadcast_cond = make_ssa_name (boolean_type_node);
       gassign *asgn = gimple_build_assign (cond_var,
 					   gimple_cond_code (stmt),
 					   gimple_cond_lhs (stmt),
@@ -10830,30 +10887,36 @@ predicate_bb (basic_block bb, struct omp
 						   mask);
 
       edge e = split_block (bb, splitpoint);
+      set_immediate_dominator (CDI_DOMINATORS, e->dest, e->src);
       e->flags = EDGE_ABNORMAL;
       skip_dest_bb = e->dest;
 
       gimple_cond_set_condition (as_a <gcond *> (stmt), EQ_EXPR,
 				 broadcast_cond, boolean_true_node);
+      update_stmt (stmt);
     }
   else if (gimple_code (stmt) == GIMPLE_SWITCH)
     {
       gswitch *sstmt = as_a <gswitch *> (stmt);
       tree var = gimple_switch_index (sstmt);
-      tree new_var = create_tmp_var (TREE_TYPE (var));
+      tree new_var = make_ssa_name (TREE_TYPE (var));
 
+#if 0
       gassign *asgn = gimple_build_assign (new_var, var);
       gsi_insert_before (&gsi, asgn, GSI_CONTINUE_LINKING);
       gimple_stmt_iterator gsi_asgn = gsi_for_stmt (asgn);
-
+#endif
+      gsi_prev (&gsi);
       gimple splitpoint = generate_oacc_broadcast (parent, new_var, var,
-						   gsi_asgn, mask);
+						   gsi, mask);
 
       edge e = split_block (bb, splitpoint);
+      set_immediate_dominator (CDI_DOMINATORS, e->dest, e->src);
       e->flags = EDGE_ABNORMAL;
       skip_dest_bb = e->dest;
 
       gimple_switch_set_index (sstmt, new_var);
+      update_stmt (stmt);
     }
   else if (is_gimple_omp (stmt))
     {
@@ -10876,6 +10939,7 @@ predicate_bb (basic_block bb, struct omp
 	      gimple_stmt_iterator head_gsi = gsi_start_bb (bb);
 	      gsi_prev (&head_gsi);
 	      edge e0 = split_block (bb, gsi_stmt (head_gsi));
+	      set_immediate_dominator (CDI_DOMINATORS, e0->dest, e0->src);
 	      int mask2 = mask;
 	      if (code == GIMPLE_OMP_FOR)
 		mask2 &= ~MASK_VECTOR;
@@ -10885,7 +10949,7 @@ predicate_bb (basic_block bb, struct omp
 		     so we just need to make one branch around the
 		     entire loop.  */
 		  inner->entry = e0->dest;
-		  make_predication_test (e0, skip_dest_bb, mask2);
+		  make_predication_test (e0, skip_dest_bb, mask2, true);
 		  return;
 		}
 	      basic_block for_block = e0->dest;
@@ -10896,9 +10960,9 @@ predicate_bb (basic_block bb, struct omp
 	      edge e2 = split_block (for_block, split_stmt);
 	      basic_block bb2 = e2->dest;
 
-	      make_predication_test (e0, bb2, mask);
+	      make_predication_test (e0, bb2, mask, true);
 	      make_predication_test (single_pred_edge (bb3), skip_dest_bb,
-				     mask2);
+				     mask2, true);
 	      inner->entry = bb3;
 	      return;
 	    }
@@ -10917,6 +10981,7 @@ predicate_bb (basic_block bb, struct omp
 	  if (!split_stmt)
 	    return;
 	  edge e = split_block (bb, split_stmt);
+	  set_immediate_dominator (CDI_DOMINATORS, e->dest, e->src);
 	  skip_dest_bb = e->dest;
 	  if (gimple_code (stmt) == GIMPLE_OMP_CONTINUE)
 	    {
@@ -10945,6 +11010,8 @@ predicate_bb (basic_block bb, struct omp
 	gsi_prev (&gsi);
       if (gsi_stmt (gsi) == 0)
 	return;
+      if (get_immediate_dominator (CDI_DOMINATORS, skip_dest_bb) != bb)
+	set_dominator = false;
     }
 
   if (skip_dest_bb != NULL)
@@ -10952,24 +11019,31 @@ predicate_bb (basic_block bb, struct omp
       gimple_stmt_iterator head_gsi = gsi_start_bb (bb);
       gsi_prev (&head_gsi);
       edge e2 = split_block (bb, gsi_stmt (head_gsi));
-      make_predication_test (e2, skip_dest_bb, mask);
+      set_immediate_dominator (CDI_DOMINATORS, e2->dest, e2->src);
+      make_predication_test (e2, skip_dest_bb, mask, set_dominator);
     }
 }
 
 /* Walk the dominator tree starting at BB to collect basic blocks in
    WORKLIST which need OpenACC vector predication applied to them.  */
 
-static void
+static bool
 find_predicatable_bbs (basic_block bb, vec<basic_block> &worklist)
 {
+  bool ret = false;
   struct omp_region *parent = *bb_region_map->get (bb);
   if (required_predication_mask (parent) != 0)
-    worklist.safe_push (bb);
+    {
+      worklist.safe_push (bb);
+      ret = true;
+    }
+  
   basic_block son;
   for (son = first_dom_son (CDI_DOMINATORS, bb);
        son;
        son = next_dom_son (CDI_DOMINATORS, son))
-    find_predicatable_bbs (son, worklist);
+    ret |= find_predicatable_bbs (son, worklist);
+  return ret;
 }
 
 /* Apply OpenACC vector predication to all basic blocks.  HEAD_BB is the
@@ -10979,7 +11053,9 @@ static void
 predicate_omp_regions (basic_block head_bb)
 {
   vec<basic_block> worklist = vNULL;
-  find_predicatable_bbs (head_bb, worklist);
+  if (!find_predicatable_bbs (head_bb, worklist))
+    return;
+
   int i;
   basic_block bb;
   FOR_EACH_VEC_ELT (worklist, i, bb)
@@ -10988,6 +11064,11 @@ predicate_omp_regions (basic_block head_
       int mask = required_predication_mask (region);
       predicate_bb (bb, region, mask);
     }
+  free_dominance_info (CDI_DOMINATORS);
+  calculate_dominance_info (CDI_DOMINATORS);
+  mark_virtual_operands_for_renaming (cfun);
+  update_ssa (TODO_update_ssa);
+  verify_ssa (true, true);
 }
 
 /* USE and GET sets for variable broadcasting.  */
@@ -11176,7 +11257,8 @@ oacc_broadcast (basic_block entry_bb, ba
 
   /* Currently, subroutines aren't supported.  */
   gcc_assert (!lookup_attribute ("oacc function",
-				 DECL_ATTRIBUTES (current_function_decl)));
+				 DECL_ATTRIBUTES (current_function_decl))
+	      || was_offloaded_p (current_function_decl));
 
   /* Populate live_in.  */
   oacc_populate_live_in (entry_bb, region);
@@ -11236,7 +11318,7 @@ oacc_broadcast (basic_block entry_bb, ba
 	  gsi_prev (&gsi);
 	  edge e2 = split_block (entry_bb, gsi_stmt (gsi));
 	  e2->flags |= EDGE_ABNORMAL;
-	  make_predication_test (e2, dest_bb, mask);
+	  make_predication_test (e2, dest_bb, mask, true);
 
 	  /* Update entry_bb.  */
 	  entry_bb = dest_bb;
@@ -11249,7 +11331,7 @@ oacc_broadcast (basic_block entry_bb, ba
 /* Main entry point for expanding OMP-GIMPLE into runtime calls.  */
 
 static unsigned int
-execute_expand_omp (void)
+execute_expand_omp (bool first)
 {
   bb_region_map = new hash_map<basic_block, omp_region *>;
 
@@ -11264,7 +11346,8 @@ execute_expand_omp (void)
 	  fprintf (dump_file, "\n");
 	}
 
-      predicate_omp_regions (ENTRY_BLOCK_PTR_FOR_FN (cfun));
+      if (!first)
+	predicate_omp_regions (ENTRY_BLOCK_PTR_FOR_FN (cfun));
 
       remove_exit_barriers (root_omp_region);
 
@@ -11317,9 +11400,10 @@ public:
       if (!gate)
 	return 0;
 
-      return execute_expand_omp ();
+      return execute_expand_omp (true);
     }
 
+  opt_pass * clone () { return new pass_expand_omp (m_ctxt); }
 }; // class pass_expand_omp
 
 } // anon namespace
@@ -11400,9 +11484,9 @@ public:
     }
   virtual unsigned int execute (function *)
     {
-      unsigned res = execute_expand_omp ();
+      unsigned res = execute_expand_omp (false);
       release_dangling_ssa_names ();
-      return res;
+      return res | TODO_update_ssa;
     }
   opt_pass * clone () { return new pass_expand_omp_ssa (m_ctxt); }
 
@@ -12562,7 +12646,7 @@ lower_omp_for (gimple_stmt_iterator *gsi
       if (!is_gimple_min_invariant (*rhs_p))
 	*rhs_p = get_formal_tmp_var (*rhs_p, &body);
 
-      rhs_p = &TREE_OPERAND (gimple_omp_for_incr (stmt, i), 1);
+      rhs_p = gimple_omp_for_incr_ptr (stmt, i);
       if (!is_gimple_min_invariant (*rhs_p))
 	*rhs_p = get_formal_tmp_var (*rhs_p, &body);
     }
@@ -13547,7 +13631,7 @@ lower_omp_target (gimple_stmt_iterator *
 
   if (offloaded)
     {
-      gimple_seq_add_stmt (&new_body, gimple_build_omp_entry_end ());
+      gimple_seq_add_stmt (&new_body, gimple_build_omp_entry_end (ctx->worker_sync_elt));
       if (has_reduction)
 	{
 	  gimple_seq_add_seq (&irlist, tgt_body);
@@ -13583,7 +13667,6 @@ lower_omp_target (gimple_stmt_iterator *
   gsi_insert_seq_before (gsi_p, sz_ilist, GSI_SAME_STMT);
 
   gimple_omp_target_set_ganglocal_size (stmt, sz);
-  gimple_omp_target_set_broadcast_array (stmt, ctx->worker_sync_elt);
   pop_gimplify_context (NULL);
 }
 
Index: gcc/pass_manager.h
===================================================================
--- gcc/pass_manager.h	(revision 224547)
+++ gcc/pass_manager.h	(working copy)
@@ -28,6 +28,7 @@ struct register_pass_info;
 #define GCC_PASS_LISTS \
   DEF_PASS_LIST (all_lowering_passes) \
   DEF_PASS_LIST (all_small_ipa_passes) \
+  DEF_PASS_LIST (all_local_opt_passes) \
   DEF_PASS_LIST (all_regular_ipa_passes) \
   DEF_PASS_LIST (all_late_ipa_passes) \
   DEF_PASS_LIST (all_passes)
@@ -82,6 +83,7 @@ public:
   /* The root of the compilation pass tree, once constructed.  */
   opt_pass *all_passes;
   opt_pass *all_small_ipa_passes;
+  opt_pass *all_local_opt_passes;
   opt_pass *all_lowering_passes;
   opt_pass *all_regular_ipa_passes;
   opt_pass *all_late_ipa_passes;
Index: gcc/passes.c
===================================================================
--- gcc/passes.c	(revision 224547)
+++ gcc/passes.c	(working copy)
@@ -454,8 +454,12 @@ public:
   /* opt_pass methods: */
   virtual bool gate (function *)
     {
-      /* Don't bother doing anything if the program has errors.  */
-      return (!seen_error () && !in_lto_p);
+      if (seen_error ())
+	return false;
+#ifdef ACCEL_COMPILER
+      return true;
+#endif
+      return !in_lto_p;
     }
 
 }; // class pass_local_optimization_passes
@@ -952,6 +956,7 @@ pass_manager::dump_passes () const
 
   dump_pass_list (all_lowering_passes, 1);
   dump_pass_list (all_small_ipa_passes, 1);
+  dump_pass_list (all_local_opt_passes, 1);
   dump_pass_list (all_regular_ipa_passes, 1);
   dump_pass_list (all_late_ipa_passes, 1);
   dump_pass_list (all_passes, 1);
@@ -1463,6 +1468,8 @@ pass_manager::register_pass (struct regi
   if (!success || all_instances)
     success |= position_pass (pass_info, &all_small_ipa_passes);
   if (!success || all_instances)
+    success |= position_pass (pass_info, &all_local_opt_passes);
+  if (!success || all_instances)
     success |= position_pass (pass_info, &all_regular_ipa_passes);
   if (!success || all_instances)
     success |= position_pass (pass_info, &all_late_ipa_passes);
@@ -1515,9 +1522,10 @@ pass_manager::register_pass (struct regi
    If we are optimizing, compile is then invoked:
 
    compile ()
-       ipa_passes () 			-> all_small_ipa_passes
+       ipa_passes () 			-> all_small_ipa_passes,
+					   all_local_opt_passes
 					-> Analysis of all_regular_ipa_passes
-	* possible LTO streaming at copmilation time *
+	* possible LTO streaming at compilation time *
 					-> Execution of all_regular_ipa_passes
 	* possible LTO streaming at link time *
 					-> all_late_ipa_passes
@@ -1541,8 +1549,8 @@ pass_manager::operator delete (void *ptr
 }
 
 pass_manager::pass_manager (context *ctxt)
-: all_passes (NULL), all_small_ipa_passes (NULL), all_lowering_passes (NULL),
-  all_regular_ipa_passes (NULL),
+: all_passes (NULL), all_small_ipa_passes (NULL), all_local_opt_passes (NULL),
+  all_lowering_passes (NULL), all_regular_ipa_passes (NULL),
   all_late_ipa_passes (NULL), passes_by_id (NULL), passes_by_id_size (0),
   m_ctxt (ctxt)
 {
@@ -1592,6 +1600,7 @@ pass_manager::pass_manager (context *ctx
   /* Register the passes with the tree dump code.  */
   register_dump_files (all_lowering_passes);
   register_dump_files (all_small_ipa_passes);
+  register_dump_files (all_local_opt_passes);
   register_dump_files (all_regular_ipa_passes);
   register_dump_files (all_late_ipa_passes);
   register_dump_files (all_passes);
@@ -2463,24 +2472,15 @@ ipa_write_summaries_1 (lto_symtab_encode
   lto_delete_out_decl_state (state);
 }
 
-/* Write out summaries for all the nodes in the callgraph.  */
-
-void
-ipa_write_summaries (void)
+static lto_symtab_encoder_t
+build_symtab_encoder (void)
 {
-  lto_symtab_encoder_t encoder;
+  lto_symtab_encoder_t encoder = lto_symtab_encoder_new (false);
   int i, order_pos;
   varpool_node *vnode;
   struct cgraph_node *node;
   struct cgraph_node **order;
 
-  if ((!flag_generate_lto && !flag_generate_offload) || seen_error ())
-    return;
-
-  select_what_to_stream ();
-
-  encoder = lto_symtab_encoder_new (false);
-
   /* Create the callgraph set in the same order used in
      cgraph_expand_all_functions.  This mostly facilitates debugging,
      since it causes the gimple file to be processed in the same order
@@ -2515,10 +2515,50 @@ ipa_write_summaries (void)
   FOR_EACH_DEFINED_VARIABLE (vnode)
     if (vnode->need_lto_streaming)
       lto_set_symtab_encoder_in_partition (encoder, vnode);
+  free (order);
+  return encoder;
+}
 
+/* Write out summaries for all the nodes in the callgraph.  */
+
+void
+ipa_write_summaries (void)
+{
+  if ((!flag_generate_lto && !flag_generate_offload) || seen_error ())
+    return;
+
+  select_what_to_stream ();
+  lto_symtab_encoder_t encoder = build_symtab_encoder ();
   ipa_write_summaries_1 (compute_ltrans_boundary (encoder));
+}
 
-  free (order);
+void
+write_offload_lto (void)
+{
+  if (!flag_generate_offload || seen_error ())
+    return;
+
+  lto_stream_offload_p = true;
+
+  select_what_to_stream ();
+  lto_symtab_encoder_t encoder = build_symtab_encoder ();
+  encoder = compute_ltrans_boundary (encoder);
+
+  struct lto_out_decl_state *state = lto_new_out_decl_state ();
+  state->symtab_node_encoder = encoder;
+
+  lto_output_init_mode_table ();
+  lto_push_out_decl_state (state);
+
+  gcc_assert (!flag_wpa);
+
+  write_lto ();
+
+  gcc_assert (lto_get_out_decl_state () == state);
+  lto_pop_out_decl_state ();
+  lto_delete_out_decl_state (state);
+
+  lto_stream_offload_p = false;
 }
 
 /* Same as execute_pass_list but assume that subpasses of IPA passes
Index: gcc/passes.def
===================================================================
--- gcc/passes.def	(revision 224547)
+++ gcc/passes.def	(working copy)
@@ -60,6 +60,10 @@ along with GCC; see the file COPYING3.
       NEXT_PASS (pass_early_warn_uninitialized);
       NEXT_PASS (pass_nothrow);
   POP_INSERT_PASSES ()
+  TERMINATE_PASS_LIST ()
+
+  /* Local optimization passes.  */
+  INSERT_PASSES_AFTER (all_local_opt_passes)
 
   NEXT_PASS (pass_chkp_instrumentation_passes);
   PUSH_INSERT_PASSES_WITHIN (pass_chkp_instrumentation_passes)
@@ -70,6 +74,7 @@ along with GCC; see the file COPYING3.
 
   NEXT_PASS (pass_local_optimization_passes);
   PUSH_INSERT_PASSES_WITHIN (pass_local_optimization_passes)
+      NEXT_PASS (pass_expand_omp_ssa);
       NEXT_PASS (pass_fixup_cfg);
       NEXT_PASS (pass_rebuild_cgraph_edges);
       NEXT_PASS (pass_inline_parameters);
Index: gcc/ssa-iterators.h
===================================================================
--- gcc/ssa-iterators.h	(revision 224547)
+++ gcc/ssa-iterators.h	(working copy)
@@ -609,17 +609,21 @@ op_iter_init (ssa_op_iter *ptr, gimple s
     {
       switch (gimple_code (stmt))
 	{
-	  case GIMPLE_ASSIGN:
-	  case GIMPLE_CALL:
-	    ptr->numops = 1;
-	    break;
-	  case GIMPLE_ASM:
-	    ptr->numops = gimple_asm_noutputs (as_a <gasm *> (stmt));
-	    break;
-	  default:
-	    ptr->numops = 0;
-	    flags &= ~(SSA_OP_DEF | SSA_OP_VDEF);
-	    break;
+	case GIMPLE_ASSIGN:
+	case GIMPLE_CALL:
+	case GIMPLE_OMP_CONTINUE:
+	  ptr->numops = 1;
+	  break;
+	case GIMPLE_ASM:
+	  ptr->numops = gimple_asm_noutputs (as_a <gasm *> (stmt));
+	  break;
+	case GIMPLE_OMP_FOR:
+	  ptr->numops = gimple_omp_for_collapse (stmt);
+	  break;
+	default:
+	  ptr->numops = 0;
+	  flags &= ~(SSA_OP_DEF | SSA_OP_VDEF);
+	  break;
 	}
     }
   ptr->uses = (flags & (SSA_OP_USE|SSA_OP_VUSE)) ? gimple_use_ops (stmt) : NULL;
Index: gcc/tree-cfg.c
===================================================================
--- gcc/tree-cfg.c	(revision 224547)
+++ gcc/tree-cfg.c	(working copy)
@@ -6649,6 +6649,7 @@ move_stmt_r (gimple_stmt_iterator *gsi_p
 
     case GIMPLE_OMP_RETURN:
     case GIMPLE_OMP_CONTINUE:
+    case GIMPLE_OMP_ENTRY_END:
       break;
     default:
       if (is_gimple_omp (stmt))
@@ -6659,7 +6660,7 @@ move_stmt_r (gimple_stmt_iterator *gsi_p
 	     function.  */
 	  bool save_remap_decls_p = p->remap_decls_p;
 	  p->remap_decls_p = false;
-	  *handled_ops_p = true;
+	  //	  *handled_ops_p = true;
 
 	  walk_gimple_seq_mod (gimple_omp_body_ptr (stmt), move_stmt_r,
 			       move_stmt_op, wi);
Index: gcc/tree-into-ssa.c
===================================================================
--- gcc/tree-into-ssa.c	(revision 224547)
+++ gcc/tree-into-ssa.c	(working copy)
@@ -2442,6 +2442,7 @@ pass_build_ssa::execute (function *fun)
 	SET_SSA_NAME_VAR_OR_IDENTIFIER (name, DECL_NAME (decl));
     }
 
+  verify_ssa (false, true);
   return 0;
 }
 
Index: gcc/tree-nested.c
===================================================================
--- gcc/tree-nested.c	(revision 224547)
+++ gcc/tree-nested.c	(working copy)
@@ -673,14 +673,8 @@ walk_gimple_omp_for (gomp_for *for_stmt,
       wi.is_lhs = false;
       walk_tree (gimple_omp_for_final_ptr (for_stmt, i), callback_op,
 		 &wi, NULL);
-
-      t = gimple_omp_for_incr (for_stmt, i);
-      gcc_assert (BINARY_CLASS_P (t));
-      wi.val_only = false;
-      walk_tree (&TREE_OPERAND (t, 0), callback_op, &wi, NULL);
-      wi.val_only = true;
-      wi.is_lhs = false;
-      walk_tree (&TREE_OPERAND (t, 1), callback_op, &wi, NULL);
+      walk_tree (gimple_omp_for_incr_ptr (for_stmt, i), callback_op,
+		 &wi, NULL);
     }
 
   seq = gsi_seq (wi.gsi);
Index: gcc/tree-ssa-operands.c
===================================================================
--- gcc/tree-ssa-operands.c	(revision 224547)
+++ gcc/tree-ssa-operands.c	(working copy)
@@ -942,11 +942,18 @@ parse_ssa_operands (struct function *fn,
       append_vuse (gimple_vop (fn));
       goto do_default;
 
+    case GIMPLE_OMP_FOR:
+      start = gimple_omp_for_collapse (stmt);
+      for (i = 0; i < start; i++)
+	get_expr_operands (fn, stmt, gimple_op_ptr (stmt, i), opf_def);
+      goto do_default;
+      
     case GIMPLE_CALL:
       /* Add call-clobbered operands, if needed.  */
       maybe_add_call_vops (fn, as_a <gcall *> (stmt));
       /* FALLTHRU */
 
+    case GIMPLE_OMP_CONTINUE:
     case GIMPLE_ASSIGN:
       get_expr_operands (fn, stmt, gimple_op_ptr (stmt, 0), opf_def);
       start = 1;


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [gomp4] Preserve NVPTX "reconvergence" points
  2015-06-19 10:44       ` Bernd Schmidt
@ 2015-06-19 12:32         ` Jakub Jelinek
  2015-06-19 13:07           ` Bernd Schmidt
  2015-06-22 14:00           ` Julian Brown
  0 siblings, 2 replies; 23+ messages in thread
From: Jakub Jelinek @ 2015-06-19 12:32 UTC (permalink / raw)
  To: Bernd Schmidt; +Cc: Thomas Schwinge, gcc-patches, Nathan Sidwell, Julian Brown

On Fri, Jun 19, 2015 at 11:53:14AM +0200, Bernd Schmidt wrote:
> On 05/28/2015 05:08 PM, Jakub Jelinek wrote:
> 
> >I understand it is more work, I'd just like to ask that when designing stuff
> >for the OpenACC offloading you (plural) try to take the other offloading
> >devices and host fallback into account.
> 
> The problem is that many of the transformations we need to do are really GPU
> specific, and with the current structure of omplow/ompexp they are being
> done in the host compiler. The offloading scheme we decided on does not give
> us the means to write out multiple versions of an offloaded function where
> each target gets a different one. For that reason I think we should postpone
> these lowering decisions until we're in the accel compiler, where they could
> be controlled by target hooks, and over the last two weeks I've been doing
> some experiments to see how that could be achieved.

Emitting PTX specific code from current ompexp is highly undesirable of
course, but I must say I'm not a big fan of keeping the GOMP_* gimple trees
around for too long either, they've never meant to be used in low gimple,
and even all the early optimization passes could screw them up badly,
they are also very much OpenMP or OpenACC specific, rather than representing
language neutral behavior, so there is a problem that you'd need M x N
different expansions of those constructs, which is not really maintainable
(M being number of supported offloading standards, right now 2, and N
number of different offloading devices (host, XeonPhi, PTX, HSA, ...)).

I wonder why struct loop flags and other info together with function
attributes and/or cgraph flags and other info aren't sufficient for the
OpenACC needs.
Have you or Thomas looked what we're doing for OpenMP simd / Cilk+ simd?

Why can't the execution model (normal, vector-single and worker-single)
be simply attributes on functions or cgraph node flags and the kind of
#acc loop simply be flags on struct loop, like already OpenMP simd
/ Cilk+ simd is?

I mean, you need to implement the PTX broadcasting etc. for the 3 different
modes (one where each thread executes everything, another one where
only first thread in a warp executes everything, other threads only
call functions with the same mode, or specially marked loops), another one
where only a single thread (in the CTA) executes everything, other threads
only call functions with the same mode or specially marked loops, because
if you have #acc routine (something) ... that is just an attribute of a
function, not really some construct in the body of it.

The vector level parallelism is something where on the host/host_noshm/XeonPhi
(dunno about HSA) you want vectorization to happen, and that is already
implemented in the vectorizer pass, implementing it again elsewhere is
highly undesirable.  For PTX the implementation is of course different,
and the vectorizer is likely not the right pass to handle them, but why
can't the same struct loop flags be used by the pass that handles the
conditionalization of execution for the 2 of the 3 above modes?

Then there is the worker level parallelism, but I'd hope it can be handled
similarly, and supposedly the pass that handles vector-single and
worker-single lowering for PTX could do the same for non-PTX targets
- if the OpenACC execution model is that all the (e.g. pthread based)
threads are started immediately and you skip in worker-single mode work on
other than the first thread, then it needs to behave similarly to PTX,
just probably needs to use library calls rather than PTX builtins to query
the thread number.

	Jakub

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [gomp4] Preserve NVPTX "reconvergence" points
  2015-06-19 12:32         ` Jakub Jelinek
@ 2015-06-19 13:07           ` Bernd Schmidt
  2015-06-19 14:10             ` Jakub Jelinek
  2015-06-22 14:00           ` Julian Brown
  1 sibling, 1 reply; 23+ messages in thread
From: Bernd Schmidt @ 2015-06-19 13:07 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: Thomas Schwinge, gcc-patches, Nathan Sidwell, Julian Brown

On 06/19/2015 02:25 PM, Jakub Jelinek wrote:
> Emitting PTX specific code from current ompexp is highly undesirable of
> course, but I must say I'm not a big fan of keeping the GOMP_* gimple trees
> around for too long either, they've never meant to be used in low gimple,
> and even all the early optimization passes could screw them up badly,

The idea is not to keep them around for very long, but I think there's 
no reason why they couldn't survive a while longer. Between ompexpand 
and the end of build_ssa_passes, we have (ignoring things like chkp and 
ubsan which can just be turned off for offloaded functions if necessary):
   NEXT_PASS (pass_ipa_free_lang_data);
   NEXT_PASS (pass_ipa_function_and_variable_visibility);
       NEXT_PASS (pass_fixup_cfg);
       NEXT_PASS (pass_init_datastructures);
       NEXT_PASS (pass_build_ssa);
       NEXT_PASS (pass_early_warn_uninitialized);
       NEXT_PASS (pass_nothrow);

Nothing in there strikes me as particularly problematic if we can make 
things like GIMPLE_OMP_FOR survive into-ssa - which I think I did in my 
patch. Besides, the OpenACC kernels path generates them in SSA form 
anyway during parloops so one could make the argument that this is a 
step towards better consistency.

> they are also very much OpenMP or OpenACC specific, rather than representing
> language neutral behavior, so there is a problem that you'd need M x N
> different expansions of those constructs, which is not really maintainable
> (M being number of supported offloading standards, right now 2, and N
> number of different offloading devices (host, XeonPhi, PTX, HSA, ...)).

Well, that's a problem we have anyway, independent on how we implement 
all these devices and standards. I don't see how that's relevant to the 
discussion.

> I wonder why struct loop flags and other info together with function
> attributes and/or cgraph flags and other info aren't sufficient for the
> OpenACC needs.
> Have you or Thomas looked what we're doing for OpenMP simd / Cilk+ simd?

> Why can't the execution model (normal, vector-single and worker-single)
> be simply attributes on functions or cgraph node flags and the kind of
> #acc loop simply be flags on struct loop, like already OpenMP simd
> / Cilk+ simd is?

We haven't looked at Cilk+ or anything like that. You suggest using 
attributes and flags, but at what point do you intend to actually lower 
the IR to actually represent what's going on?

> The vector level parallelism is something where on the host/host_noshm/XeonPhi
> (dunno about HSA) you want vectorization to happen, and that is already
> implemented in the vectorizer pass, implementing it again elsewhere is
> highly undesirable.  For PTX the implementation is of course different,
> and the vectorizer is likely not the right pass to handle them, but why
> can't the same struct loop flags be used by the pass that handles the
> conditionalization of execution for the 2 of the 3 above modes?

Agreed on wanting the vectorizer to handle things for "normal" machines, 
that is one of the motivations for pushing the lowering past the offload 
LTO writeout stage. The problem with OpenACC on GPUs is that the 
predication really changes the CFG and the data flow - I fear 
unpredictable effects if we let any optimizers run before lowering 
OpenACC to the point where we actually represent what's going on in the 
function.


Bernd

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [gomp4] Preserve NVPTX "reconvergence" points
  2015-06-19 13:07           ` Bernd Schmidt
@ 2015-06-19 14:10             ` Jakub Jelinek
  2015-06-22 14:04               ` Bernd Schmidt
  2015-06-24 13:37               ` Bernd Schmidt
  0 siblings, 2 replies; 23+ messages in thread
From: Jakub Jelinek @ 2015-06-19 14:10 UTC (permalink / raw)
  To: Bernd Schmidt; +Cc: Thomas Schwinge, gcc-patches, Nathan Sidwell, Julian Brown

On Fri, Jun 19, 2015 at 03:03:38PM +0200, Bernd Schmidt wrote:
> >they are also very much OpenMP or OpenACC specific, rather than representing
> >language neutral behavior, so there is a problem that you'd need M x N
> >different expansions of those constructs, which is not really maintainable
> >(M being number of supported offloading standards, right now 2, and N
> >number of different offloading devices (host, XeonPhi, PTX, HSA, ...)).
> 
> Well, that's a problem we have anyway, independent on how we implement all
> these devices and standards. I don't see how that's relevant to the
> discussion.

It is relevant, because if you lower early (omplower/ompexp) into some IL
form common to all the offloading standards, then it is M + N.

> >I wonder why struct loop flags and other info together with function
> >attributes and/or cgraph flags and other info aren't sufficient for the
> >OpenACC needs.
> >Have you or Thomas looked what we're doing for OpenMP simd / Cilk+ simd?
> 
> >Why can't the execution model (normal, vector-single and worker-single)
> >be simply attributes on functions or cgraph node flags and the kind of
> >#acc loop simply be flags on struct loop, like already OpenMP simd
> >/ Cilk+ simd is?
> 
> We haven't looked at Cilk+ or anything like that. You suggest using
> attributes and flags, but at what point do you intend to actually lower the
> IR to actually represent what's going on?

I think around where the vectorizer is, perhaps before the loop optimization
pass queue (or after it, some investigation is needed).

> >The vector level parallelism is something where on the host/host_noshm/XeonPhi
> >(dunno about HSA) you want vectorization to happen, and that is already
> >implemented in the vectorizer pass, implementing it again elsewhere is
> >highly undesirable.  For PTX the implementation is of course different,
> >and the vectorizer is likely not the right pass to handle them, but why
> >can't the same struct loop flags be used by the pass that handles the
> >conditionalization of execution for the 2 of the 3 above modes?
> 
> Agreed on wanting the vectorizer to handle things for "normal" machines,
> that is one of the motivations for pushing the lowering past the offload LTO
> writeout stage. The problem with OpenACC on GPUs is that the predication
> really changes the CFG and the data flow - I fear unpredictable effects if
> we let any optimizers run before lowering OpenACC to the point where we
> actually represent what's going on in the function.

I actually believe having some optimization passes in between the ompexp
and the lowering of the IR into the form PTX wants is highly desirable,
the form with the worker-single or vector-single mode lowered will contain
too complex CFG for many optimizations to be really effective, especially
if it uses abnormal edges.  E.g. inlining supposedly would have harder job
etc.  What exact unpredictable effects do you fear?
If the loop remains in the IL (isn't optimized away as unreachable or
isn't removed, e.g. as a non-loop - say if it contains a noreturn call),
the flags on struct loop should be still there.  For the loop clauses
(reduction always, and private/lastprivate if addressable etc.) for
OpenMP simd / Cilk+ simd we use special arrays indexed by internal
functions, which then during vectorization are shrunk (but in theory could
be expanded too) to the right vectorization factor if vectorized, of course
accesses within the loop vectorized using SIMD, and if not vectorized,
shrunk to 1 element.  So the PTX IL lowering pass could use the same
arrays ("omp simd array" attribute) to transform the decls into thread local
vars as opposed to vars shared by the whole CTA.

	Jakub

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [gomp4] Preserve NVPTX "reconvergence" points
  2015-06-19 14:10             ` Jakub Jelinek
@ 2015-06-22 14:04               ` Bernd Schmidt
  2015-06-22 14:25                 ` Jakub Jelinek
  2015-06-24 13:37               ` Bernd Schmidt
  1 sibling, 1 reply; 23+ messages in thread
From: Bernd Schmidt @ 2015-06-22 14:04 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: Thomas Schwinge, gcc-patches, Nathan Sidwell, Julian Brown

On 06/19/2015 03:45 PM, Jakub Jelinek wrote:
> I actually believe having some optimization passes in between the ompexp
> and the lowering of the IR into the form PTX wants is highly desirable,
> the form with the worker-single or vector-single mode lowered will contain
> too complex CFG for many optimizations to be really effective, especially
> if it uses abnormal edges.  E.g. inlining supposedly would have harder job
> etc.  What exact unpredictable effects do you fear?

Mostly the ones I can't predict. But let's take one example, LICM: let's 
say you pull some assignment out of a loop, then you find yourself in 
one of two possible situations: either it's become not actually 
available inside the loop (because the data and control flow is not 
described correctly and the compiler doesn't know what's going on), or, 
to avoid that, you introduce additional broadcasting operations when 
entering the loop, which might be quite expensive.

Bernd

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [gomp4] Preserve NVPTX "reconvergence" points
  2015-06-22 14:04               ` Bernd Schmidt
@ 2015-06-22 14:25                 ` Jakub Jelinek
  0 siblings, 0 replies; 23+ messages in thread
From: Jakub Jelinek @ 2015-06-22 14:25 UTC (permalink / raw)
  To: Bernd Schmidt; +Cc: Thomas Schwinge, gcc-patches, Nathan Sidwell, Julian Brown

On Mon, Jun 22, 2015 at 03:59:57PM +0200, Bernd Schmidt wrote:
> On 06/19/2015 03:45 PM, Jakub Jelinek wrote:
> >I actually believe having some optimization passes in between the ompexp
> >and the lowering of the IR into the form PTX wants is highly desirable,
> >the form with the worker-single or vector-single mode lowered will contain
> >too complex CFG for many optimizations to be really effective, especially
> >if it uses abnormal edges.  E.g. inlining supposedly would have harder job
> >etc.  What exact unpredictable effects do you fear?
> 
> Mostly the ones I can't predict. But let's take one example, LICM: let's say
> you pull some assignment out of a loop, then you find yourself in one of two
> possible situations: either it's become not actually available inside the
> loop (because the data and control flow is not described correctly and the
> compiler doesn't know what's going on), or, to avoid that, you introduce

Why do you think that would happen?  E.g. for non-addressable gimple types you'd
most likely just have a PHI for it on the loop.

> additional broadcasting operations when entering the loop, which might be
> quite expensive.

If the PHI has cheap initialization, there is not a problem to emit it as
initialization in the loop instead of a broadcast (kind like RA
rematerialization).  And by actually adding such an optimization, you help
even code that has computation in a vector-single code and uses it in vector
acc loop.

	Jakub

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [gomp4] Preserve NVPTX "reconvergence" points
  2015-06-19 14:10             ` Jakub Jelinek
  2015-06-22 14:04               ` Bernd Schmidt
@ 2015-06-24 13:37               ` Bernd Schmidt
  2015-06-24 14:08                 ` Jakub Jelinek
  1 sibling, 1 reply; 23+ messages in thread
From: Bernd Schmidt @ 2015-06-24 13:37 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: Thomas Schwinge, gcc-patches, Nathan Sidwell, Julian Brown

On 06/19/2015 03:45 PM, Jakub Jelinek wrote:

> If the loop remains in the IL (isn't optimized away as unreachable or
> isn't removed, e.g. as a non-loop - say if it contains a noreturn call),
> the flags on struct loop should be still there.  For the loop clauses
> (reduction always, and private/lastprivate if addressable etc.) for
> OpenMP simd / Cilk+ simd we use special arrays indexed by internal
> functions, which then during vectorization are shrunk (but in theory could
> be expanded too) to the right vectorization factor if vectorized, of course
> accesses within the loop vectorized using SIMD, and if not vectorized,
> shrunk to 1 element.

I'd appreciate if you could describe that mechanism in more detail. As 
far as I can tell it is very poorly commented and documented in the 
code. I mean, it doesn't even follow the minimal coding standards of 
describing function inputs:

/* Helper function of lower_rec_input_clauses, used for #pragma omp simd
    privatization.  */

static bool
lower_rec_simd_input_clauses (tree new_var, omp_context *ctx, int &max_vf,
			      tree &idx, tree &lane, tree &ivar, tree &lvar)


Bernd

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [gomp4] Preserve NVPTX "reconvergence" points
  2015-06-24 13:37               ` Bernd Schmidt
@ 2015-06-24 14:08                 ` Jakub Jelinek
  0 siblings, 0 replies; 23+ messages in thread
From: Jakub Jelinek @ 2015-06-24 14:08 UTC (permalink / raw)
  To: Bernd Schmidt; +Cc: Thomas Schwinge, gcc-patches, Nathan Sidwell, Julian Brown

On Wed, Jun 24, 2015 at 03:11:04PM +0200, Bernd Schmidt wrote:
> On 06/19/2015 03:45 PM, Jakub Jelinek wrote:
> 
> >If the loop remains in the IL (isn't optimized away as unreachable or
> >isn't removed, e.g. as a non-loop - say if it contains a noreturn call),
> >the flags on struct loop should be still there.  For the loop clauses
> >(reduction always, and private/lastprivate if addressable etc.) for
> >OpenMP simd / Cilk+ simd we use special arrays indexed by internal
> >functions, which then during vectorization are shrunk (but in theory could
> >be expanded too) to the right vectorization factor if vectorized, of course
> >accesses within the loop vectorized using SIMD, and if not vectorized,
> >shrunk to 1 element.
> 
> I'd appreciate if you could describe that mechanism in more detail. As far
> as I can tell it is very poorly commented and documented in the code. I
> mean, it doesn't even follow the minimal coding standards of describing
> function inputs:
> 
> /* Helper function of lower_rec_input_clauses, used for #pragma omp simd
>    privatization.  */
> 
> static bool
> lower_rec_simd_input_clauses (tree new_var, omp_context *ctx, int &max_vf,
> 			      tree &idx, tree &lane, tree &ivar, tree &lvar)

Here is the theory behind it:
https://gcc.gnu.org/ml/gcc-patches/2013-04/msg01661.html
In the end it is using internal functions instead of uglified builtins.
I'd suggest you look at some of the libgomp.c/simd*.c tests, say
with -O2 -mavx2 -fdump-tree-{omplower,ssa,ifcvt,vect,optimized}
to see how it is lowered and expanded.  I assume #pragma omp simd roughly
corresponds to #pragma acc loop vector, maxvf for PTX vectorization is
supposedly 32 (warp size).  For SIMD vectorization, if the vectorization
fails, the arrays are shrunk to 1 element, otherwise they are shrunk to the
vectorization factor, and later optimizations if they aren't really
addressable optimized using FRE and other memory optimizations so that they
don't touch memory unless really needed.
For the PTX style vectorization (parallelization between threads in a warp),
I'd say you would always shrink to 1 element again, but such variables would
be local to each of the threads in the warp (or another possibility is
shared arrays of size 32 indexed by %tid.x & 31), while addressable variables
without such magic type would be shared among all threads; non-addressable
variables (SSA_NAMEs) depending on where they are used.
You'd need to transform reductions (which are right now represented as
another loop, from 0 to an internal function, so easily recognizable) into
the PTX reductions.  Also, lastprivate is now an access to the array using
last lane internal function, dunno what that corresponds to in PTX
(perhaps also a reduction where all but the thread executing the last
iteration say or in 0 and the remaining thread ors in the lastprivate value).

	Jakub

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [gomp4] Preserve NVPTX "reconvergence" points
  2015-06-19 12:32         ` Jakub Jelinek
  2015-06-19 13:07           ` Bernd Schmidt
@ 2015-06-22 14:00           ` Julian Brown
  2015-06-22 14:36             ` Jakub Jelinek
  1 sibling, 1 reply; 23+ messages in thread
From: Julian Brown @ 2015-06-22 14:00 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: Bernd Schmidt, Thomas Schwinge, gcc-patches, Nathan Sidwell

On Fri, 19 Jun 2015 14:25:57 +0200
Jakub Jelinek <jakub@redhat.com> wrote:

> On Fri, Jun 19, 2015 at 11:53:14AM +0200, Bernd Schmidt wrote:
> > On 05/28/2015 05:08 PM, Jakub Jelinek wrote:
> > 
> > >I understand it is more work, I'd just like to ask that when
> > >designing stuff for the OpenACC offloading you (plural) try to
> > >take the other offloading devices and host fallback into account.
> > 
> > The problem is that many of the transformations we need to do are
> > really GPU specific, and with the current structure of
> > omplow/ompexp they are being done in the host compiler. The
> > offloading scheme we decided on does not give us the means to write
> > out multiple versions of an offloaded function where each target
> > gets a different one. For that reason I think we should postpone
> > these lowering decisions until we're in the accel compiler, where
> > they could be controlled by target hooks, and over the last two
> > weeks I've been doing some experiments to see how that could be
> > achieved.

> I wonder why struct loop flags and other info together with function
> attributes and/or cgraph flags and other info aren't sufficient for
> the OpenACC needs.
> Have you or Thomas looked what we're doing for OpenMP simd / Cilk+
> simd?
> 
> Why can't the execution model (normal, vector-single and
> worker-single) be simply attributes on functions or cgraph node flags
> and the kind of #acc loop simply be flags on struct loop, like
> already OpenMP simd / Cilk+ simd is?

One problem is that (at least on the GPU hardware we've considered so
far) we're somewhat constrained in how much control we have over how the
underlying hardware executes code: it's possible to draw up a scheme
where OpenACC source-level control-flow semantics are reflected directly
in the PTX assembly output (e.g. to say "all threads in a CTA/warp will
be coherent after such-and-such a loop"), and lowering OpenACC
directives quite early seems to make that relatively tractable. (Even
if the resulting code is relatively un-optimisable due to the abnormal
edges inserted to make sure that the CFG doesn't become "ill-formed".)

If arbitrary optimisations are done between OMP-lowering time and
somewhere around vectorisation (say), it's less clear if that
correspondence can be maintained. Say if the code executed by half the
threads in a warp becomes physically separated from the code executed
by the other half of the threads in a warp due to some loop
optimisation, we can no longer easily determine where that warp will
reconverge, and certain other operations (relying on coherent warps --
e.g. CTA synchronisation) become impossible. A similar issue exists for
warps within a CTA.

So, essentially -- I don't know how "late" loop lowering would interact
with:

(a) Maintaining a CFG that will work with PTX.

(b) Predication for worker-single and/or vector-single modes
(actually all currently-proposed schemes have problems with proper
representation of data-dependencies for variables and
compiler-generated temporaries between predicated regions.)

Julian

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [gomp4] Preserve NVPTX "reconvergence" points
  2015-06-22 14:00           ` Julian Brown
@ 2015-06-22 14:36             ` Jakub Jelinek
  2015-06-22 15:18               ` Julian Brown
                                 ` (2 more replies)
  0 siblings, 3 replies; 23+ messages in thread
From: Jakub Jelinek @ 2015-06-22 14:36 UTC (permalink / raw)
  To: Julian Brown; +Cc: Bernd Schmidt, Thomas Schwinge, gcc-patches, Nathan Sidwell

On Mon, Jun 22, 2015 at 02:55:49PM +0100, Julian Brown wrote:
> One problem is that (at least on the GPU hardware we've considered so
> far) we're somewhat constrained in how much control we have over how the
> underlying hardware executes code: it's possible to draw up a scheme
> where OpenACC source-level control-flow semantics are reflected directly
> in the PTX assembly output (e.g. to say "all threads in a CTA/warp will
> be coherent after such-and-such a loop"), and lowering OpenACC
> directives quite early seems to make that relatively tractable. (Even
> if the resulting code is relatively un-optimisable due to the abnormal
> edges inserted to make sure that the CFG doesn't become "ill-formed".)
> 
> If arbitrary optimisations are done between OMP-lowering time and
> somewhere around vectorisation (say), it's less clear if that
> correspondence can be maintained. Say if the code executed by half the
> threads in a warp becomes physically separated from the code executed
> by the other half of the threads in a warp due to some loop
> optimisation, we can no longer easily determine where that warp will
> reconverge, and certain other operations (relying on coherent warps --
> e.g. CTA synchronisation) become impossible. A similar issue exists for
> warps within a CTA.
> 
> So, essentially -- I don't know how "late" loop lowering would interact
> with:
> 
> (a) Maintaining a CFG that will work with PTX.
> 
> (b) Predication for worker-single and/or vector-single modes
> (actually all currently-proposed schemes have problems with proper
> representation of data-dependencies for variables and
> compiler-generated temporaries between predicated regions.)

I don't understand why lowering the way you suggest helps here at all.
In the proposed scheme, you essentially have whole function
in e.g. worker-single or vector-single mode, which you need to be able to
handle properly in any case, because users can write such routines
themselves.  And then you can have a loop in such a function that
has some special attribute, a hint that it is desirable to vectorize it
(for PTX the PTX way) or use vector-single mode for it in a worker-single
function.  So, the special pass then of course needs to handle all the
needed broadcasting and reduction required to change the mode from e.g.
worker-single to vector-single, but the convergence points still would be
either on the boundary of such loops to be vectorized or parallelized, or
wherever else they appear in normal vector-single or worker-single functions
(around the calls to certainly calls?).

	Jakub

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [gomp4] Preserve NVPTX "reconvergence" points
  2015-06-22 14:36             ` Jakub Jelinek
@ 2015-06-22 15:18               ` Julian Brown
  2015-06-22 15:33               ` Bernd Schmidt
  2015-06-22 17:54               ` Julian Brown
  2 siblings, 0 replies; 23+ messages in thread
From: Julian Brown @ 2015-06-22 15:18 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: Bernd Schmidt, Thomas Schwinge, gcc-patches, Nathan Sidwell

On Mon, 22 Jun 2015 16:24:56 +0200
Jakub Jelinek <jakub@redhat.com> wrote:

> On Mon, Jun 22, 2015 at 02:55:49PM +0100, Julian Brown wrote:
> > One problem is that (at least on the GPU hardware we've considered
> > so far) we're somewhat constrained in how much control we have over
> > how the underlying hardware executes code: it's possible to draw up
> > a scheme where OpenACC source-level control-flow semantics are
> > reflected directly in the PTX assembly output (e.g. to say "all
> > threads in a CTA/warp will be coherent after such-and-such a
> > loop"), and lowering OpenACC directives quite early seems to make
> > that relatively tractable. (Even if the resulting code is
> > relatively un-optimisable due to the abnormal edges inserted to
> > make sure that the CFG doesn't become "ill-formed".)
> > 
> > If arbitrary optimisations are done between OMP-lowering time and
> > somewhere around vectorisation (say), it's less clear if that
> > correspondence can be maintained. Say if the code executed by half
> > the threads in a warp becomes physically separated from the code
> > executed by the other half of the threads in a warp due to some loop
> > optimisation, we can no longer easily determine where that warp will
> > reconverge, and certain other operations (relying on coherent warps
> > -- e.g. CTA synchronisation) become impossible. A similar issue
> > exists for warps within a CTA.
> > 
> > So, essentially -- I don't know how "late" loop lowering would
> > interact with:
> > 
> > (a) Maintaining a CFG that will work with PTX.
> > 
> > (b) Predication for worker-single and/or vector-single modes
> > (actually all currently-proposed schemes have problems with proper
> > representation of data-dependencies for variables and
> > compiler-generated temporaries between predicated regions.)
> 
> I don't understand why lowering the way you suggest helps here at all.
> In the proposed scheme, you essentially have whole function
> in e.g. worker-single or vector-single mode, which you need to be
> able to handle properly in any case, because users can write such
> routines themselves.  And then you can have a loop in such a function
> that has some special attribute, a hint that it is desirable to
> vectorize it (for PTX the PTX way) or use vector-single mode for it
> in a worker-single function.  So, the special pass then of course
> needs to handle all the needed broadcasting and reduction required to
> change the mode from e.g. worker-single to vector-single, but the
> convergence points still would be either on the boundary of such
> loops to be vectorized or parallelized, or wherever else they appear
> in normal vector-single or worker-single functions (around the calls
> to certainly calls?).

I think most of my concerns are centred around loops (with the markings
you suggest) that might be split into parts: if that cannot happen for
loops that are annotated as you describe, maybe things will work out OK.

(Apologies for my ignorance here, this isn't a part of the compiler
that I know anything about.)

Julian

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [gomp4] Preserve NVPTX "reconvergence" points
  2015-06-22 14:36             ` Jakub Jelinek
  2015-06-22 15:18               ` Julian Brown
@ 2015-06-22 15:33               ` Bernd Schmidt
  2015-06-22 16:13                 ` Nathan Sidwell
  2015-06-22 17:54               ` Julian Brown
  2 siblings, 1 reply; 23+ messages in thread
From: Bernd Schmidt @ 2015-06-22 15:33 UTC (permalink / raw)
  To: Jakub Jelinek, Julian Brown; +Cc: Thomas Schwinge, gcc-patches, Nathan Sidwell

On 06/22/2015 04:24 PM, Jakub Jelinek wrote:
> I don't understand why lowering the way you suggest helps here at all.
> In the proposed scheme, you essentially have whole function
> in e.g. worker-single or vector-single mode, which you need to be able to
> handle properly in any case, because users can write such routines
> themselves.  And then you can have a loop in such a function that
> has some special attribute, a hint that it is desirable to vectorize it
> (for PTX the PTX way) or use vector-single mode for it in a worker-single
> function.

You can have a hint that it is desirable, but not a hint that it is 
correct (because passes in between may invalidate that). The OpenACC 
directives guarantee to the compiler that the program can be transformed 
into a parallel form. If we lose them early we must then rely on our 
analysis which may not be strong enough to prove that the loop can be 
parallelized. If we make these transformations early enough, while we 
still have the OpenACC directives, we can guarantee that we do exactly 
what the programmer specified.

Bernd

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [gomp4] Preserve NVPTX "reconvergence" points
  2015-06-22 15:33               ` Bernd Schmidt
@ 2015-06-22 16:13                 ` Nathan Sidwell
  2015-06-22 16:27                   ` Jakub Jelinek
  0 siblings, 1 reply; 23+ messages in thread
From: Nathan Sidwell @ 2015-06-22 16:13 UTC (permalink / raw)
  To: Bernd Schmidt, Jakub Jelinek, Julian Brown; +Cc: Thomas Schwinge, gcc-patches

On 06/22/15 11:18, Bernd Schmidt wrote:

> You can have a hint that it is desirable, but not a hint that it is correct
> (because passes in between may invalidate that). The OpenACC directives
> guarantee to the compiler that the program can be transformed into a parallel
> form. If we lose them early we must then rely on our analysis which may not be
> strong enough to prove that the loop can be parallelized. If we make these
> transformations early enough, while we still have the OpenACC directives, we can
> guarantee that we do exactly what the programmer specified.

How does this differ from openmp's needs to preserve parallelism on a parallel 
loop?  Is it more than the reconvergence issue?

nathan

-- 
Nathan Sidwell

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [gomp4] Preserve NVPTX "reconvergence" points
  2015-06-22 16:13                 ` Nathan Sidwell
@ 2015-06-22 16:27                   ` Jakub Jelinek
  2015-06-22 16:35                     ` Nathan Sidwell
  0 siblings, 1 reply; 23+ messages in thread
From: Jakub Jelinek @ 2015-06-22 16:27 UTC (permalink / raw)
  To: Nathan Sidwell; +Cc: Bernd Schmidt, Julian Brown, Thomas Schwinge, gcc-patches

On Mon, Jun 22, 2015 at 12:08:36PM -0400, Nathan Sidwell wrote:
> On 06/22/15 11:18, Bernd Schmidt wrote:
> 
> >You can have a hint that it is desirable, but not a hint that it is correct
> >(because passes in between may invalidate that). The OpenACC directives
> >guarantee to the compiler that the program can be transformed into a parallel
> >form. If we lose them early we must then rely on our analysis which may not be
> >strong enough to prove that the loop can be parallelized. If we make these
> >transformations early enough, while we still have the OpenACC directives, we can
> >guarantee that we do exactly what the programmer specified.
> 
> How does this differ from openmp's needs to preserve parallelism on a
> parallel loop?  Is it more than the reconvergence issue?

OpenMP has significantly different execution model, a parallel block in
OpenMP is run by certain number of threads (the initial thread (the one
encountering that region) and then dpeending on clauses and library
decisions perhaps others), with a barrier at the end of the region, and
afterwards only the initial thread continues again.
So, an OpenMP parallel is implemented as a library call, taking outlined
function from the parallel's body as one of its arguments and the body
is executed by the initial thread and perhaps others.
OpenMP worksharing loop is just coordination between the threads in the
team, which thread takes which subset of the loop's iterations, and
optionally followed by a barrier.  OpenMP simd loop is a loop that has
certain properties guaranteed by the user and can be vectorized.
In contrast to this, OpenACC spawns all the threads/CTAs upfront, and then
idles on some of them until there is work for them.

	Jakub

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [gomp4] Preserve NVPTX "reconvergence" points
  2015-06-22 16:27                   ` Jakub Jelinek
@ 2015-06-22 16:35                     ` Nathan Sidwell
  0 siblings, 0 replies; 23+ messages in thread
From: Nathan Sidwell @ 2015-06-22 16:35 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: Bernd Schmidt, Julian Brown, Thomas Schwinge, gcc-patches

On 06/22/15 12:20, Jakub Jelinek wrote:

> OpenMP worksharing loop is just coordination between the threads in the
> team, which thread takes which subset of the loop's iterations, and
> optionally followed by a barrier.  OpenMP simd loop is a loop that has
> certain properties guaranteed by the user and can be vectorized.
> In contrast to this, OpenACC spawns all the threads/CTAs upfront, and then
> idles on some of them until there is work for them.

correct.  I expressed my question poorly.  What I mean is that in openmp, a loop 
that is parallelizeable (by user decree, I guess[*]), should not be transformed 
such that it is not parallelizeable.

This seems to me to be a common requirement of both languages.  How one gets 
parallel threads of execution to the body of the loop is a different question.

nathan

[*] For ones where the compiler needs to detect parallizeablilty, it's 
preferable that it doesn't do something earlier to force serializeablility.

-- 
Nathan Sidwell

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [gomp4] Preserve NVPTX "reconvergence" points
  2015-06-22 14:36             ` Jakub Jelinek
  2015-06-22 15:18               ` Julian Brown
  2015-06-22 15:33               ` Bernd Schmidt
@ 2015-06-22 17:54               ` Julian Brown
  2015-06-22 18:48                 ` Jakub Jelinek
  2 siblings, 1 reply; 23+ messages in thread
From: Julian Brown @ 2015-06-22 17:54 UTC (permalink / raw)
  To: Jakub Jelinek; +Cc: Bernd Schmidt, Thomas Schwinge, gcc-patches, Nathan Sidwell

On Mon, 22 Jun 2015 16:24:56 +0200
Jakub Jelinek <jakub@redhat.com> wrote:

> On Mon, Jun 22, 2015 at 02:55:49PM +0100, Julian Brown wrote:
> > One problem is that (at least on the GPU hardware we've considered
> > so far) we're somewhat constrained in how much control we have over
> > how the underlying hardware executes code: it's possible to draw up
> > a scheme where OpenACC source-level control-flow semantics are
> > reflected directly in the PTX assembly output (e.g. to say "all
> > threads in a CTA/warp will be coherent after such-and-such a
> > loop"), and lowering OpenACC directives quite early seems to make
> > that relatively tractable. (Even if the resulting code is
> > relatively un-optimisable due to the abnormal edges inserted to
> > make sure that the CFG doesn't become "ill-formed".)
> > 
> > If arbitrary optimisations are done between OMP-lowering time and
> > somewhere around vectorisation (say), it's less clear if that
> > correspondence can be maintained. Say if the code executed by half
> > the threads in a warp becomes physically separated from the code
> > executed by the other half of the threads in a warp due to some loop
> > optimisation, we can no longer easily determine where that warp will
> > reconverge, and certain other operations (relying on coherent warps
> > -- e.g. CTA synchronisation) become impossible. A similar issue
> > exists for warps within a CTA.
> > 
> > So, essentially -- I don't know how "late" loop lowering would
> > interact with:
> > 
> > (a) Maintaining a CFG that will work with PTX.
> > 
> > (b) Predication for worker-single and/or vector-single modes
> > (actually all currently-proposed schemes have problems with proper
> > representation of data-dependencies for variables and
> > compiler-generated temporaries between predicated regions.)
> 
> I don't understand why lowering the way you suggest helps here at all.
> In the proposed scheme, you essentially have whole function
> in e.g. worker-single or vector-single mode, which you need to be
> able to handle properly in any case, because users can write such
> routines themselves.

In vector-single or worker-single mode, divergence of threads within a
warp or a CTA is controlled by broadcasting the controlling expression
of conditional branches to the set of "inactive" threads, so each of
those follows along with the active thread. So you only get
potentially-problematic thread divergence when workers or vectors are
operating in partitioned mode.

So, for instance, a made-up example:

#pragma acc parallel
{
  #pragma acc loop gang
  for (i = 0; i < N; i++))
  {
    #pragma acc loop worker
    for (j = 0; j < M; j++)
    {
      if (j < M / 2)
        /* stmt 1 */
      else
        /* stmt 2 */
    }

    /* reconvergence point: thread barrier */

    [...]
  }
}

Here "stmt 1" and "stmt 2" execute in worker-partitioned, vector-single
mode. With "early lowering", the reconvergence point can be
inserted at the end of the loop, and abnormal edges (etc.) can be used
to ensure that the CFG does not get changed in such a way that there is
no longer a unique point at which the loop threads reconverge.

With "late lowering", it's no longer obvious to me if that can still be
done.

Julian

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [gomp4] Preserve NVPTX "reconvergence" points
  2015-06-22 17:54               ` Julian Brown
@ 2015-06-22 18:48                 ` Jakub Jelinek
  0 siblings, 0 replies; 23+ messages in thread
From: Jakub Jelinek @ 2015-06-22 18:48 UTC (permalink / raw)
  To: Julian Brown; +Cc: Bernd Schmidt, Thomas Schwinge, gcc-patches, Nathan Sidwell

On Mon, Jun 22, 2015 at 06:48:10PM +0100, Julian Brown wrote:
> In vector-single or worker-single mode, divergence of threads within a
> warp or a CTA is controlled by broadcasting the controlling expression
> of conditional branches to the set of "inactive" threads, so each of
> those follows along with the active thread. So you only get
> potentially-problematic thread divergence when workers or vectors are
> operating in partitioned mode.
> 
> So, for instance, a made-up example:
> 
> #pragma acc parallel
> {
>   #pragma acc loop gang
>   for (i = 0; i < N; i++))
>   {
>     #pragma acc loop worker
>     for (j = 0; j < M; j++)
>     {
>       if (j < M / 2)
>         /* stmt 1 */
>       else
>         /* stmt 2 */
>     }
> 
>     /* reconvergence point: thread barrier */
> 
>     [...]
>   }
> }
> 
> Here "stmt 1" and "stmt 2" execute in worker-partitioned, vector-single
> mode. With "early lowering", the reconvergence point can be
> inserted at the end of the loop, and abnormal edges (etc.) can be used
> to ensure that the CFG does not get changed in such a way that there is
> no longer a unique point at which the loop threads reconverge.
> 
> With "late lowering", it's no longer obvious to me if that can still be
> done.

Why?  The loop still has an exit edge (if there is no break/return/throw out of
the loop which I bet is not allowed), so you just insert the reconvergence
point at the exit edge from the loop.
For the "late lowering", I said it is up for benchmarking/investigation
where it would be best placed, it doesn't have to be after the loop passes,
there are plenty of optimization passes even before those.  But once you turn
many of the SSA_NAMEs in a function into (ab) ssa vars, many optimizations
just give up.
And, if you really want to avoid certain loop optimizations, you have always
the possibility to e.g. wrap certain statement in the loop in internal
function (e.g. the loop condition) or something similar to make the passes
more careful about those loops and make it easier to lower it later.

	Jakub

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [gomp4] Preserve NVPTX "reconvergence" points
  2015-05-28 14:20 [gomp4] Preserve NVPTX "reconvergence" points Julian Brown
  2015-05-28 14:59 ` Jakub Jelinek
@ 2015-05-28 15:02 ` Richard Biener
  2015-06-03 11:47   ` Julian Brown
  1 sibling, 1 reply; 23+ messages in thread
From: Richard Biener @ 2015-05-28 15:02 UTC (permalink / raw)
  To: Julian Brown; +Cc: GCC Patches, Bernd Schmidt, Jakub Jelinek, Thomas Schwinge

On Thu, May 28, 2015 at 4:06 PM, Julian Brown <julian@codesourcery.com> wrote:
> For NVPTX, it is vitally important that the divergence of threads
> within a warp can be controlled: in particular we must be able to
> generate code that we know "reconverges" at a particular point.
> Unfortunately GCC's middle-end optimisers can cause this property to
> be violated, which causes problems for the OpenACC execution model
> we're planning to use for NVPTX.
>
> As a brief example: code running in vector-single mode runs on a
> single thread of a warp, and must broadcast condition results to other
> threads of the warp so that they can "follow along" and be ready for
> vector-partitioned execution when necessary.
>
> #pragma acc parallel
> {
>   #pragma acc loop gang
>   for (i = 0; i < N; i++)
>   {
>     /* This is vector-single mode.  */
>     n = ...;
>     switch (n)
>     {
>     case 1:
>       #pragma acc loop vector
>       for (...)
>       {
>         /* This is vector-partitioned mode.  */
>       }
>       ...
>     }
>   }
> }
>
> Here, the calculation "n = ..." takes place on a single thread (of
> each partitioned gang of the outer loop), but the switch statement
> (terminating the BB) must be executed by all threads in the warp. The
> vector-single statements will be translated using a branch around for
> the "idle" threads:
>
> if (threadIdx.x == 0)
> {
>   n_0 = ...;
> }
> n_x = broadcast (n_0)
> switch (n_x)
> ...
>
> Where "broadcast" is an operation that transfers values from some
> other thread of a warp (i.e., the zeroth) to the current thread
> (implemented as a "shfl" instruction for NVPTX).
>
> I observed a similar example to this cloning the broadcast and switch
> instructions (in the .dom1 dump), along the lines of:
>
> if (threadIdx.x == 0)
> {
>   n_0 = ...;
>   n_x = broadcast (n_0)
>   switch (n_x)
>   ...
> }
> else
> {
>   n_x = broadcast (n_0)
>   switch (n_x)
>   ...
> }
>
> This doesn't work because the "broadcast" operation has to be run with
> non-diverged warps for correct operation, and here there is divergence
> due to the "if (threadIdx.x == 0)" condition.
>
> So, the way I have tried to handle this is by attempting to inhibit
> optimisation along edges which have a reconvergence point as their
> destination. The essential idea is to make such edges "abnormal",
> although the existing EDGE_ABNORMAL flag is not used because that has
> implicit meaning built into it already, and the new edge type may need
> to be handled differently in some areas. One example is that at
> present, blocks concluding with GIMPLE_COND cannot have EDGE_ABNORMAL
> set on their EDGE_TRUE or EDGE_FALSE outgoing edges.
>
> The attached patch introduces a new edge flag (EDGE_TO_RECONVERGENCE),
> for the GIMPLE CFG only. In principle there's nothing to stop the flag
> being propagated to the RTL CFG also, in which case it'd probably be
> set at the same time as EDGE_ABNORMAL, mirroring the semantics of e.g.
> EDGE_EH, EDGE_ABNORMAL_CALL and EDGE_SIBCALL. Then, passes which
> inspect the RTL CFG can continue to only check the ABNORMAL flag. But
> so far (in rather limited testing!), that has not been observed to be
> necessary. (We can control RTL CFG manipulation indirectly by using the
> CANNOT_COPY_INSN_P target hook, sensitive e.g. to the "broadcast"
> instruction.)
>
> For the GIMPLE CFG (i.e. in passes operating on GIMPLE form),
> EDGE_TO_RECONVERGENCE behaves mostly the same as EDGE_ABNORMAL (i.e.,
> inhibiting certain optimisations), and so has been added to relevant
> conditionals largely mechanically. Places where it is treated specially
> are:
>
> * tree-cfg.c:gimple_verify_flow_info does not permit EDGE_ABNORMAL on
>   outgoing edges of a block concluding with a GIMPLE_COND statement.
>   But, we allow EDGE_TO_RECONVERGENCE there.
>
> * tree-vrp.c:find_conditional_asserts skips over outgoing GIMPLE_COND
>   edges with EDGE_TO_RECONVERGENCE set (avoiding an ICE when the pass
>   tries to split the edge later).
>
> There are probably other optimisations that will be tripped up by the
> new flag along the same lines as the VRP tweak above, which we will no
> doubt discover in due course.
>
> Together with the patch,
>
>   https://gcc.gnu.org/ml/gcc-patches/2015-05/msg02612.html
>
> This shows no regressions for the libgomp tests.
>
> OK for gomp4 branch?

Hmm, I don't think adding a new edge flag is good nor necessary.  It seems to
me that instead the broadcast operation should have abnormal control flow
and thus basic-blocks should be split either before or after it (so either
incoming or outgoing edge(s) should be abnormal).  I suppose splitting
before the broadcast would be best (thus handle it similar to setjmp ()).

Richard.

> Thanks,
>
> Julian
>
>     ChangeLog
>
>     gcc/
>     * basic-block.h (EDGE_COMPLEX): Add EDGE_TO_RECONVERGENCE flag.
>     (bb_hash_abnorm_or_reconv_pred): New function.
>     (hash_abnormal_or_eh_outgoing_edge_p): Consider
>     EDGE_TO_RECONVERGENCE also.
>     * cfg-flags.def (TO_RECONVERGENCE): Add flag.
>     * omp-low.c (predicate_bb): Set EDGE_TO_RECONVERGENCE on edges
>     leading to a reconvergence point.
>     * cfgbuild.c (purge_dead_tablejump_edges): Consider
>     EDGE_TO_RECONVERGENCE.
>     * cfgcleanup.c (try_crossjump_to_edge, try_head_merge_bb): Likewise.
>     * cfgexpand.c (expand_gimple_tailcall, construct_exit_block)
>     (pass_expand::execute): Likewise.
>     * cfghooks.c (can_copy_bbs_p): Likewise.
>     * cfgloop.c (bb_loop_header_p): Likewise.
>     * cfgloopmanip.c (scale_loop_profile): Likewise.
>     * gimple-iterator.c (gimple_find_edge_insert_loc): Likewise.
>     * graph.c (draw_cfg_node_succ_edges): Likewise.
>     * graphite-scope-detection.c (canonicalize_loop_closed_ssa):
>     Likewise.
>     * predict.c (tree_bb_level_predictions): Likewise.
>     * profile.c (instrument_edges, branch_prop, find_spanning_tree):
>     Likewise.
>     * tree-cfg.c (replace_uses_by, gimple_split_edge)
>     (gimple_redirect_edge_and_branch, split_critical_edges): Likewise.
>     * tree-cfgcleanup.c (tree_forwarder_block_p, remove_forwarder_block)
>     (pass_merge_phi::execute): Likewise.
>     * tree-chkp.c (chkp_fix_cfg): Likewise.
>     * tree-if-conv.c (if_convertible_bb_p): Likewise.
>     * tree-inline.c (update_ssa_across_abnormal_edges): Likewise.
>     * tree-into-ssa.c (rewrite_update_phi_arguments)
>     (rewrite_update_dom_walker::before_dom_children)
>     (create_new_def_for): Likewise.
>     * tree-outof-ssa.c (eliminate_phi): Likewise.
>     * tree-phinodes.c (add_phi_arg): Likewise.
>     * tree-ssa-coalesce (coalesce_cost_edge, create_outofssa_var_map)
>     (coalesce_partitions): Likewise.
>     * tree-ssa-dom.c (cprop_into_successor_phis)
>     (dom_opt_dom_walker::after_dom_children, propagate_rhs_into_lhs):
>     Likewise.
>     * tree-ssa-loop-im.c (loop_suitable_for_sm): Likewise.
>     * tree-ssa-loop-prefetch.c (emit_mfence_after_loop)
>     (may_use_storent_in_loop_p): Likewise.
>     * tree-ssa-phiopt.c (tree_ssa_phiopt_worker): Likewise.
>     * tree-ssa-pre.c (compute_antic, insert_into_preds_of_block):
>     Likewise.
>     * tree-ssa-propagate.c (simulate_block, replace_phi_args_in):
>     Likewise.
>     * tree-ssa-sink.c (sink_code_in_bb): Likewise.
>     * tree-ssa-threadedge.c (thread_across_edge): Likewise.
>     * tree-ssa-threadupdate.c (thread_single_edge): Likewise.
>     * tree-ssa-uninit.c (compute_control_dep_chain): Likewise.
>     * tree-ssa.c (verify_phi_args): Likewise.
>     * tree-vect-loop.c (vect_analyze_loop_form): Likewise.
>     * value-prof.c (gimple_ic): Likewise.
>     * tree-vrp.c (infer_value_range, process_assert_insertions_for):
>     Likewise.
>     (find_conditional_asserts): Skip over EDGE_TO_RECONVERGENCE edges.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [gomp4] Preserve NVPTX "reconvergence" points
  2015-05-28 15:02 ` Richard Biener
@ 2015-06-03 11:47   ` Julian Brown
  0 siblings, 0 replies; 23+ messages in thread
From: Julian Brown @ 2015-06-03 11:47 UTC (permalink / raw)
  To: Richard Biener; +Cc: GCC Patches, Bernd Schmidt, Jakub Jelinek, Thomas Schwinge

[-- Attachment #1: Type: text/plain, Size: 2666 bytes --]

On Thu, 28 May 2015 16:37:04 +0200
Richard Biener <richard.guenther@gmail.com> wrote:

> On Thu, May 28, 2015 at 4:06 PM, Julian Brown
> <julian@codesourcery.com> wrote:
> > For NVPTX, it is vitally important that the divergence of threads
> > within a warp can be controlled: in particular we must be able to
> > generate code that we know "reconverges" at a particular point.
> > Unfortunately GCC's middle-end optimisers can cause this property to
> > be violated, which causes problems for the OpenACC execution model
> > we're planning to use for NVPTX.
> 
> Hmm, I don't think adding a new edge flag is good nor necessary.  It
> seems to me that instead the broadcast operation should have abnormal
> control flow and thus basic-blocks should be split either before or
> after it (so either incoming or outgoing edge(s) should be
> abnormal).  I suppose splitting before the broadcast would be best
> (thus handle it similar to setjmp ()).

Here's a version of the patch that uses abnormal edges with semantics
unchanged, splitting the "false"/non-execution edge using a dummy block
to avoid the prohibited case of both EDGE_TRUE/EDGE_FALSE and
EDGE_ABNORMAL on the outgoing edges of a GIMPLE_COND.

So for a fragment like this:

  if (threadIdx.x == 0) /* cond_bb */
  {
    /* work */
    p0 = ...; /* assign */
  }
  pN = broadcast(p0);
  if (pN) goto T; else goto F;

Incoming edges to a broadcast operation have EDGE_ABNORMAL set:

  +--------+
  |cond_bb |--------,
  +--------+        |
      | (true edge) | (false edge)
      v             v
  +--------+     +-------+
  | (work) |     | dummy |
  +--------+     +-------+
  | assign |        |
  +--------+        |
ABNORM|             |ABNORM
      v             |
  +--------+<-------'
  |  bcast |
  +--------+
  |  cond  |
  +--------+
   /     \
  T       F

The abnormal edges actually serve two purposes, I think: as well as
ensuring the broadcast operation takes place when a warp is
non-diverged/coherent, they ensure that p0 is not seen as uninitialised
along the "false" path from cond_bb, possibly leading to the broadcast
operation being optimised away as partially redundant. This feels
somewhat fragile though! We'll have to continue to think about
warp divergence in subsequent patches.

The patch passes libgomp testing (with Bernd's recent worker-single
patch also). OK for gomp4 branch (together with the
previously-mentioned inline thread builtin patch)?

Thanks,

Julian

ChangeLog

    gcc/
    * omp-low.c (make_predication_test): Split false block out of
    cond_bb, making latter edge abnormal.
    (predicate_bb): Set EDGE_ABNORMAL on edges before broadcast
    operations.

[-- Attachment #2: to-reconvergence-5.diff --]
[-- Type: text/x-patch, Size: 1751 bytes --]

commit 38056ae4a29f93ce54715dfad843a233f3b0fd2a
Author: Julian Brown <julian@codesourcery.com>
Date:   Mon Jun 1 11:12:41 2015 -0700

    Use abnormal edges before broadcast ops

diff --git a/gcc/omp-low.c b/gcc/omp-low.c
index 7048f9f..310eb72 100644
--- a/gcc/omp-low.c
+++ b/gcc/omp-low.c
@@ -10555,7 +10555,16 @@ make_predication_test (edge true_edge, basic_block skip_dest_bb, int mask)
   gsi_insert_after (&tmp_gsi, cond_stmt, GSI_NEW_STMT);
 
   true_edge->flags = EDGE_TRUE_VALUE;
-  make_edge (cond_bb, skip_dest_bb, EDGE_FALSE_VALUE);
+
+  /* Force an abnormal edge before a broadcast operation that might be present
+     in SKIP_DEST_BB.  This is only done for the non-execution edge (with
+     respect to the predication done by this function) -- the opposite
+     (execution) edge that reaches the broadcast operation must be made
+     abnormal also, e.g. in this function's caller.  */
+  edge e = make_edge (cond_bb, skip_dest_bb, EDGE_FALSE_VALUE);
+  basic_block false_abnorm_bb = split_edge (e);
+  edge abnorm_edge = single_succ_edge (false_abnorm_bb);
+  abnorm_edge->flags |= EDGE_ABNORMAL;
 }
 
 /* Apply OpenACC predication to basic block BB which is in
@@ -10605,6 +10614,7 @@ predicate_bb (basic_block bb, struct omp_region *parent, int mask)
 						   mask);
 
       edge e = split_block (bb, splitpoint);
+      e->flags = EDGE_ABNORMAL;
       skip_dest_bb = e->dest;
 
       gimple_cond_set_condition (as_a <gcond *> (stmt), EQ_EXPR,
@@ -10624,6 +10634,7 @@ predicate_bb (basic_block bb, struct omp_region *parent, int mask)
 						   gsi_asgn, mask);
 
       edge e = split_block (bb, splitpoint);
+      e->flags = EDGE_ABNORMAL;
       skip_dest_bb = e->dest;
 
       gimple_switch_set_index (sstmt, new_var);

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2015-06-24 13:53 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-05-28 14:20 [gomp4] Preserve NVPTX "reconvergence" points Julian Brown
2015-05-28 14:59 ` Jakub Jelinek
2015-05-28 15:14   ` Thomas Schwinge
2015-05-28 15:28     ` Jakub Jelinek
2015-06-19 10:44       ` Bernd Schmidt
2015-06-19 12:32         ` Jakub Jelinek
2015-06-19 13:07           ` Bernd Schmidt
2015-06-19 14:10             ` Jakub Jelinek
2015-06-22 14:04               ` Bernd Schmidt
2015-06-22 14:25                 ` Jakub Jelinek
2015-06-24 13:37               ` Bernd Schmidt
2015-06-24 14:08                 ` Jakub Jelinek
2015-06-22 14:00           ` Julian Brown
2015-06-22 14:36             ` Jakub Jelinek
2015-06-22 15:18               ` Julian Brown
2015-06-22 15:33               ` Bernd Schmidt
2015-06-22 16:13                 ` Nathan Sidwell
2015-06-22 16:27                   ` Jakub Jelinek
2015-06-22 16:35                     ` Nathan Sidwell
2015-06-22 17:54               ` Julian Brown
2015-06-22 18:48                 ` Jakub Jelinek
2015-05-28 15:02 ` Richard Biener
2015-06-03 11:47   ` Julian Brown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).