public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [PATCH, PR43864] Gimple level duplicate block cleanup.
@ 2011-06-08  9:49 Tom de Vries
  2011-06-08  9:55 ` [PATCH, PR43864] Gimple level duplicate block cleanup - test cases Tom de Vries
                   ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Tom de Vries @ 2011-06-08  9:49 UTC (permalink / raw)
  To: Richard Guenther; +Cc: Steven Bosscher, gcc-patches

[-- Attachment #1: Type: text/plain, Size: 4276 bytes --]

Hi Richard,

I have a patch for PR43864. The patch adds a gimple level duplicate block
cleanup. The patch has been bootstrapped and reg-tested on x86_64, and
reg-tested on ARM. The size impact on ARM for spec2000 is shown in the following
table (%, lower is better).

                     none            pic
                thumb1  thumb2  thumb1 thumb2
spec2000          99.9    99.9    99.8   99.8

PR43864 is currently marked as a duplicate of PR20070, but I'm not sure that the
optimizations proposed in PR20070 would fix this PR.

The problem in this PR is that when compiling with -O2, the example below should
only have one call to free. The original problem is formulated in terms of -Os,
but currently we generate one call to free with -Os, although still not the
smallest code possible. I'll show here the -O2 case, since that's similar to the
original PR.

#include <stdio.h>
void foo (char*, FILE*);
char* hprofStartupp(char *outputFileName, char *ctx)
{
    char fileName[1000];
    FILE *fp;
    sprintf(fileName, outputFileName);
    if (access(fileName, 1) == 0) {
        free(ctx);
        return 0;
    }

    fp = fopen(fileName, 0);
    if (fp == 0) {
        free(ctx);
        return 0;
    }

    foo(outputFileName, fp);

    return ctx;
}

AFAIU, there are 2 complementary methods of rtl optimizations proposed in PR20070.
- Merging 2 blocks which are identical expect for input registers, by using a
  conditional move to choose between the different input registers.
- Merging 2 blocks which have different local registers, by ignoring those
  differences

Blocks .L6 and.L7 have no difference in local registers, but they have a
difference in input registers: r3 and r1. Replacing the move to r5 by a
conditional move would probably be benificial in terms of size, but it's not
clear what condition the conditional move should be using. Calculating such a
condition would add in size and increase the execution path.

gcc -O2 -march=armv7-a -mthumb pr43864.c -S:
...
	push	{r4, r5, lr}
	mov	r4, r0
	sub	sp, sp, #1004
	mov	r5, r1
	mov	r0, sp
	mov	r1, r4
	bl	sprintf
	mov	r0, sp
	movs	r1, #1
	bl	access
	mov	r3, r0
	cbz	r0, .L6
	movs	r1, #0
	mov	r0, sp
	bl	fopen
	mov	r1, r0
	cbz	r0, .L7
	mov	r0, r4
	bl	foo
.L3:
	mov	r0, r5
	add	sp, sp, #1004
	pop	{r4, r5, pc}
.L6:
	mov	r0, r5
	mov	r5, r3
	bl	free
	b	.L3
.L7:
	mov	r0, r5
	mov	r5, r1
	bl	free
	b	.L3
...

The proposed patch solved the problem by dealing with the 2 blocks at a level
when they are still identical: at gimple level. It detect that the 2 blocks are
identical, and removes one of them.

The following table shows the impact of the patch on the example in terms of
size for -march=armv7-a:

          without     with    delta
Os      :     108      104       -4
O2      :     120      104      -16
Os thumb:      68       64       -4
O2 thumb:      76       64      -12

The gain in size for -O2 is that of removing the entire block, plus the
replacement of 2 moves by a constant set, which also decreases the execution
path. The patch ensures optimal code for both -O2 and -Os.


By keeping track of equivalent definitions in the 2 blocks, we can ignore those
differences in comparison. Without this feature, we would only match blocks with
resultless operations, due the the ssa-nature of gimples.
For example, with this feature, we reduce the following function to its minimum
at gimple level, rather than at rtl level.

int f(int c, int b, int d)
{
  int r, e;

  if (c)
    r = b + d;
  else
    {
      e = b + d;
      r = e;
    }

  return r;
}

;; Function f (f)

f (int c, int b, int d)
{
  int e;

<bb 2>:
  e_6 = b_3(D) + d_4(D);
  return e_6;

}

I'll send the patch with the testcases in a separate email.

OK for trunk?

Thanks,
- Tom

2011-06-08  Tom de Vries  <tom@codesourcery.com>

	PR middle-end/43864
	* tree-cfgcleanup.c (int_int_splay_lookup, int_int_splay_insert)
	(int_int_splay_node_contained_in, int_int_splay_contained_in)
	(equiv_lookup, equiv_insert, equiv_contained_in, equiv_init)
	(equiv_delete, gimple_base_equal_p, pt_solution_equal_p, gimple_equal_p)
	(bb_gimple_equal_p, update_debug_stmts, cleanup_duplicate_preds_1)
	(same_or_local_phi_alternatives, cleanup_duplicate_preds): New function.
	(cleanup_tree_cfg_bb): Use cleanup_duplicate_preds.

[-- Attachment #2: pr43864.5.patch --]
[-- Type: text/x-patch, Size: 16000 bytes --]

Index: gcc/tree-cfgcleanup.c
===================================================================
--- gcc/tree-cfgcleanup.c	(revision 173703)
+++ gcc/tree-cfgcleanup.c	(working copy)
@@ -641,6 +641,552 @@ cleanup_omp_return (basic_block bb)
   return true;
 }
 
+/* Returns true if S contains (I1, I2).  */
+
+static bool
+int_int_splay_lookup (splay_tree s, unsigned int i1, unsigned int i2)
+{
+  splay_tree_node node;
+
+  if (s == NULL)
+    return false;
+
+  node = splay_tree_lookup (s, i1);
+  return node && node->value == i2;
+}
+
+/* Attempts to insert (I1, I2) into *S.  Returns true if successful.
+   Allocates *S if necessary.  */
+
+static bool
+int_int_splay_insert (splay_tree *s, unsigned int i1 , unsigned int i2)
+{
+  if (*s != NULL)
+    {
+      /* Check for existing element, which would otherwise be silently
+	 overwritten by splay_tree_insert.  */
+      if (splay_tree_lookup (*s, i1))
+	return false;
+    }
+  else
+    *s = splay_tree_new (splay_tree_compare_ints, 0, 0);
+
+  splay_tree_insert (*s, i1, i2);
+  return true;
+}
+
+/* Returns 0 if (NODE->value, NODE->key) is an element of S.  Otherwise,
+   returns 1.  */
+
+static int
+int_int_splay_node_contained_in (splay_tree_node node, void *s)
+{
+  splay_tree_node snode = splay_tree_lookup ((splay_tree)s, node->key);
+  return (!snode || node->value != snode->value) ? 1 : 0;
+}
+
+/* Returns true if all elements of S1 are also in S2.  */
+
+static bool
+int_int_splay_contained_in (splay_tree s1, splay_tree s2)
+{
+  if (s1 == NULL)
+    return true;
+  if (s2 == NULL)
+    return false;
+  return splay_tree_foreach (s1, int_int_splay_node_contained_in, s2) == 0;
+}
+
+typedef splay_tree equiv_t;
+
+/* Returns true if EQUIV contains (SSA_NAME_VERSION (VAL1),
+                                   SSA_NAME_VERSION (VAL2)).  */
+
+static bool
+equiv_lookup (equiv_t equiv, tree val1, tree val2)
+{
+  if (val1 == NULL_TREE || val2 == NULL_TREE
+      || TREE_CODE (val1) != SSA_NAME || TREE_CODE (val2) != SSA_NAME)
+    return false;
+
+  return int_int_splay_lookup (equiv, SSA_NAME_VERSION (val1),
+			       SSA_NAME_VERSION (val2));
+}
+
+/* Attempts to insert (SSA_NAME_VERSION (VAL1), SSA_NAME_VERSION (VAL2)) into
+   EQUIV, provided they are defined BB1 and BB2.  Returns true if successful.
+   Allocates *EQUIV if necessary.  */
+
+static bool
+equiv_insert (equiv_t *equiv, tree val1, tree val2,
+	      basic_block bb1, basic_block bb2)
+{
+  if (val1 == NULL_TREE || val2 == NULL_TREE
+      || TREE_CODE (val1) != SSA_NAME || TREE_CODE (val2) != SSA_NAME
+      || gimple_bb (SSA_NAME_DEF_STMT (val1)) != bb1
+      || gimple_bb (SSA_NAME_DEF_STMT (val2)) != bb2)
+    return false;
+
+  return int_int_splay_insert (equiv, SSA_NAME_VERSION (val1),
+			       SSA_NAME_VERSION (val2));
+}
+
+/* Returns true if all elements of S1 are also in S2.  */
+
+static bool
+equiv_contained_in (equiv_t s1, equiv_t s2)
+{
+  return int_int_splay_contained_in (s1, s2);
+}
+
+/* Init equiv_t *S.  */
+
+static void
+equiv_init (equiv_t *s)
+{
+  *s = NULL;
+}
+
+/* Delete equiv_t *S and reinit.  */
+
+static void
+equiv_delete (equiv_t *s)
+{
+  if (!*s)
+    return;
+
+  splay_tree_delete (*s);
+  *s = NULL;
+}
+
+/* Check whether S1 and S2 are equal, considering the fields in
+   gimple_statement_base.  Ignores fields uid, location, bb, and block.  */
+
+static bool
+gimple_base_equal_p (gimple s1, gimple s2)
+{
+  if (gimple_code (s1) != gimple_code (s2))
+    return false;
+
+  if (gimple_no_warning_p (s1) != gimple_no_warning_p (s2))
+    return false;
+
+  /* For pass-local flags visited and plf we would like to be more aggressive.
+     But that means we must have a way to find out whether the flags are
+     currently in use or not.  */
+  if (gimple_visited_p (s1) != gimple_visited_p (s2))
+    return false;
+
+  if (is_gimple_assign (s1)
+      && (gimple_assign_nontemporal_move_p (s1)
+          != gimple_assign_nontemporal_move_p (s2)))
+    return false;
+
+  if (gimple_plf (s1, GF_PLF_1) != gimple_plf (s2, GF_PLF_1))
+    return false;
+
+  if (gimple_plf (s1, GF_PLF_2) != gimple_plf (s2, GF_PLF_2))
+    return false;
+
+  /* The modified field is set when allocating, but only reset for the first
+     time once ssa_operands_active.  So before ssa_operands_active, the field
+     signals that the ssa operands have not been scanned, and after
+     ssa_operands_active it signals that the ssa operands might be invalid.
+     We check here only for the latter case.  */
+  if (ssa_operands_active ()
+      && (gimple_modified_p (s1) || gimple_modified_p (s2)))
+    return false;
+
+  if (gimple_has_volatile_ops (s1) != gimple_has_volatile_ops (s2))
+    return false;
+
+  if (s1->gsbase.subcode != s2->gsbase.subcode)
+    return false;
+
+  if (gimple_num_ops (s1) != gimple_num_ops (s2))
+    return false;
+
+  return true;
+}
+
+/* Return true if p1 and p2 can be considered equal.  */
+
+static bool
+pt_solution_equal_p (struct pt_solution *p1, struct pt_solution *p2)
+{
+  if (pt_solution_empty_p (p1) != pt_solution_empty_p (p2))
+    return false;
+  if (pt_solution_empty_p (p1))
+    return true;
+
+  /* TODO: make this less conservative.  */
+  return (p1->anything && p2->anything);
+}
+
+/* Return true if gimple statements S1 and S2 are equal.  EQUIV contains pairs
+   of local defs that can be considered equivalent at entry, and if equal
+   contains at exit the defs and vdefs of S1 and S2.  */
+
+static bool
+gimple_equal_p (equiv_t *equiv, gimple s1, gimple s2)
+{
+  unsigned int i;
+  enum gimple_statement_structure_enum gss;
+  tree lhs1, lhs2;
+  basic_block bb1 = gimple_bb (s1), bb2 = gimple_bb (s2);
+
+  /* Handle omp gimples conservatively.  */
+  if (is_gimple_omp (s1) || is_gimple_omp (s2))
+    return false;
+
+  if (!gimple_base_equal_p (s1, s2))
+    return false;
+
+  gss = gimple_statement_structure (s1);
+  switch (gss)
+    {
+    case GSS_CALL:
+      if (!pt_solution_equal_p (gimple_call_use_set (s1),
+				gimple_call_use_set (s2))
+	  || !pt_solution_equal_p (gimple_call_clobber_set (s1),
+				   gimple_call_clobber_set (s2))
+	  || !gimple_call_same_target_p (s1, s2))
+	return false;
+      /* Falthru.  */
+
+    case GSS_WITH_MEM_OPS_BASE:
+    case GSS_WITH_MEM_OPS:
+      {
+	tree vdef1 = gimple_vdef (s1), vdef2 = gimple_vdef (s2);
+	tree vuse1 = gimple_vuse (s1), vuse2 = gimple_vuse (s2);
+	if (vuse1 != vuse2 && !equiv_lookup (*equiv, vuse1, vuse2))
+	  return false;
+	if (vdef1 != vdef2 && !equiv_insert (equiv, vdef1, vdef2, bb1, bb2))
+	  return false;
+      }
+      /* Falthru.  */
+
+    case GSS_WITH_OPS:
+      /* Ignore gimple_def_ops and gimple_use_ops.  They are duplicates of
+	 gimple_vdef, gimple_vuse and gimple_ops, and checked elsewhere.  */
+      /* Falthru.  */
+
+    case GSS_BASE:
+      break;
+
+    default:
+      return false;
+    }
+
+  /* Find lhs.  */
+  lhs1 = gimple_get_lhs (s1);
+  lhs2 = gimple_get_lhs (s2);
+
+  /* Handle ops.  */
+  for (i = 0; i < gimple_num_ops (s1); ++i)
+    {
+      tree t1 = gimple_op (s1, i);
+      tree t2 = gimple_op (s2, i);
+      if (t1 == NULL_TREE && t2 == NULL_TREE)
+        continue;
+      if (t1 == NULL_TREE || t2 == NULL_TREE)
+        return false;
+      /* Skip lhs.  */
+      if (lhs1 == t1 && i == 0)
+	continue;
+      if (!operand_equal_p (t1, t2, 0) && !equiv_lookup (*equiv, t1, t2))
+	return false;
+    }
+
+  /* Handle lhs.  */
+  if (lhs1 != lhs2 && !equiv_insert (equiv, lhs1, lhs2, bb1, bb2))
+    return false;
+
+  return true;
+}
+
+/* Return true if BB1 and BB2 contain the same non-debug gimple statements, and
+   if the def pairs in PHI_EQUIV are found to be equivalent defs in BB1 and
+   BB2.  */
+
+static bool
+bb_gimple_equal_p (equiv_t phi_equiv, basic_block bb1, basic_block bb2)
+{
+  gimple_stmt_iterator gsi1 = gsi_start_nondebug_bb (bb1);
+  gimple_stmt_iterator gsi2 = gsi_start_nondebug_bb (bb2);
+  bool end1, end2;
+  equiv_t equiv;
+  bool equal = true;
+
+  end1 = gsi_end_p (gsi1);
+  end2 = gsi_end_p (gsi2);
+
+  /* Don't handle empty blocks, these are handled elsewhere in the cleanup.  */
+  if (end1 || end2)
+    return false;
+
+  /* TODO: handle blocks with phi-nodes.  We'll have find corresponding
+     phi-nodes in bb1 and bb2, with the same alternatives for the same
+     preds.  */
+  if (phi_nodes (bb1) != NULL || phi_nodes (bb2) != NULL)
+    return false;
+
+  equiv_init (&equiv);
+  while (true)
+    {
+      if (end1 && end2)
+        break;
+      if (end1 || end2
+	  || !gimple_equal_p (&equiv, gsi_stmt (gsi1), gsi_stmt (gsi2)))
+	{
+	  equal = false;
+	  break;
+	}
+
+      gsi_next_nondebug (&gsi1);
+      gsi_next_nondebug (&gsi2);
+      end1 = gsi_end_p (gsi1);
+      end2 = gsi_end_p (gsi2);
+    }
+
+  /* equiv now contains all bb1,bb2 def pairs which are equivalent.
+     Check if the phi alternatives are indeed equivalent.  */
+  equal = equal && equiv_contained_in (phi_equiv, equiv);
+
+  equiv_delete (&equiv);
+
+  return equal;
+}
+
+/* Resets all debug statements in BBUSE that have uses that are not
+   dominated by their defs.  */
+
+static void
+update_debug_stmts (basic_block bbuse)
+{
+  use_operand_p use_p;
+  ssa_op_iter oi;
+  basic_block bbdef;
+  gimple stmt, def_stmt;
+  gimple_stmt_iterator gsi;
+  tree name;
+
+  for (gsi = gsi_start_bb (bbuse); !gsi_end_p (gsi); gsi_next (&gsi))
+    {
+      stmt = gsi_stmt (gsi);
+      if (!is_gimple_debug (stmt))
+	continue;
+      gcc_assert (gimple_debug_bind_p (stmt));
+
+      gcc_assert (dom_info_available_p (CDI_DOMINATORS));
+
+      FOR_EACH_PHI_OR_STMT_USE (use_p, stmt, oi, SSA_OP_USE)
+	{
+	  name = USE_FROM_PTR (use_p);
+	  gcc_assert (TREE_CODE (name) == SSA_NAME);
+
+	  def_stmt = SSA_NAME_DEF_STMT (name);
+	  gcc_assert (def_stmt != NULL);
+
+	  bbdef = gimple_bb (def_stmt);
+	  if (bbdef == NULL || bbuse == bbdef
+	      || dominated_by_p (CDI_DOMINATORS, bbuse, bbdef))
+	    continue;
+
+	  gimple_debug_bind_reset_value (stmt);
+	  update_stmt (stmt);
+	}
+    }
+}
+
+/* E1 and E2 have a common dest.  Detect if E1->src and E2->src are duplicates,
+   and if so, redirect the predecessor edges of E1->src to E2->src and remove
+   E1->src.  Returns true if any changes were made.  */
+
+static bool
+cleanup_duplicate_preds_1 (equiv_t phi_equiv, edge e1, edge e2)
+{
+  edge pred_edge;
+  basic_block bb1, bb2, pred;
+  basic_block bb_dom = NULL, bb2_dom = NULL;
+  unsigned int i;
+  basic_block bb = e1->dest;
+  gcc_assert (bb == e2->dest);
+
+  if (e1->flags != e2->flags)
+    return false;
+
+  bb1 = e1->src;
+  bb2 = e2->src;
+
+  /* TODO: We could allow multiple successor edges here, as long as bb1 and bb2
+     have the same successors.  */
+  if (EDGE_COUNT (bb1->succs) != 1 || EDGE_COUNT (bb2->succs) != 1)
+    return false;
+
+  if (!bb_gimple_equal_p (phi_equiv, bb1, bb2))
+    return false;
+
+  if (dump_file)
+    fprintf (dump_file, "cleanup_duplicate_preds: "
+             "cleaning up <bb %d>, duplicate of <bb %d>\n", bb1->index,
+             bb2->index);
+
+  /* Calculate the changes to be made to the dominator info.  */
+  if (dom_info_available_p (CDI_DOMINATORS))
+    {
+      /* Calculate bb2_dom.  */
+      bb2_dom = nearest_common_dominator (CDI_DOMINATORS, bb2, bb1);
+      if (bb2_dom == bb1 || bb2_dom == bb2)
+        bb2_dom = get_immediate_dominator (CDI_DOMINATORS, bb2_dom);
+
+      /* Calculate bb_dom.  */
+      bb_dom = get_immediate_dominator (CDI_DOMINATORS, bb);
+      if (bb == bb2)
+        bb_dom = bb2_dom;
+      else if (bb_dom == bb1 || bb_dom == bb2)
+        bb_dom = bb2;
+      else
+        {
+          /* Count the predecessors of bb (other than bb1 or bb2), not dominated
+             by bb.  If there are none, merging bb1 and bb2 will mean that bb2
+             dominates bb.  */
+          int not_dominated = 0;
+          for (i = 0; i < EDGE_COUNT (bb->preds); ++i)
+            {
+              pred_edge = EDGE_PRED (bb, i);
+              pred = pred_edge->src;
+              if (pred == bb1 || pred == bb2)
+                continue;
+              if (dominated_by_p (CDI_DOMINATORS, pred, bb))
+                continue;
+              not_dominated++;
+            }
+          if (not_dominated == 0)
+            bb_dom = bb2;
+        }
+    }
+
+  /* Redirect the incoming edges of bb1 to bb2.  */
+  for (i = EDGE_COUNT (bb1->preds); i > 0 ; --i)
+    {
+      pred_edge = EDGE_PRED (bb1, i - 1);
+      pred = pred_edge->src;
+      pred_edge = redirect_edge_and_branch (pred_edge, bb2);
+      gcc_assert (pred_edge != NULL);
+      /* The set of successors of pred have changed.  */
+      bitmap_set_bit (cfgcleanup_altered_bbs, pred->index);
+    }
+
+  /* The set of predecessors has changed for both bb and bb2.  */
+  bitmap_set_bit (cfgcleanup_altered_bbs, bb->index);
+  bitmap_set_bit (cfgcleanup_altered_bbs, bb2->index);
+
+  /* bb1 has no incoming edges anymore, and has become unreachable.  */
+  delete_basic_block (bb1);
+  bitmap_clear_bit (cfgcleanup_altered_bbs, bb1->index);
+
+  /* Update dominator info.  */
+  if (dom_info_available_p (CDI_DOMINATORS))
+    {
+      /* Note: update order is relevant.  */
+      set_immediate_dominator (CDI_DOMINATORS, bb2, bb2_dom);
+      if (bb != bb2)
+	set_immediate_dominator (CDI_DOMINATORS, bb, bb_dom);
+      verify_dominators (CDI_DOMINATORS);
+    }
+
+  /* Reset invalidated debug statements.  */
+  update_debug_stmts (bb2);
+
+  return true;
+}
+
+/* Returns whether for all phis in E1->dest the phi alternatives for E1 and
+   E2 are either:
+   - equal, or
+   - defined locally in E1->src and E2->src.
+   In the latter case, register the alternatives in *PHI_EQUIV.  */
+
+static bool
+same_or_local_phi_alternatives (equiv_t *phi_equiv, edge e1, edge e2)
+{
+  int n1 = e1->dest_idx;
+  int n2 = e2->dest_idx;
+  gimple_stmt_iterator gsi;
+  basic_block dest = e1->dest;
+  gcc_assert (dest == e2->dest);
+
+  for (gsi = gsi_start_phis (dest); !gsi_end_p (gsi); gsi_next (&gsi))
+    {
+      gimple phi = gsi_stmt (gsi);
+      tree val1 = gimple_phi_arg_def (phi, n1);
+      tree val2 = gimple_phi_arg_def (phi, n2);
+
+      gcc_assert (val1 != NULL_TREE);
+      gcc_assert (val2 != NULL_TREE);
+
+      if (operand_equal_for_phi_arg_p (val1, val2))
+	continue;
+
+      if (!equiv_insert (phi_equiv, val1, val2, e1->src, e2->src))
+	return false;
+    }
+
+  return true;
+}
+
+/* Detect duplicate predecessor blocks of BB and clean them up.  Return true if
+   any changes were made.  */
+
+static bool
+cleanup_duplicate_preds (basic_block bb)
+{
+  edge e1, e2, e1_swapped, e2_swapped;
+  unsigned int i, j, n;
+  equiv_t phi_equiv;
+  bool changed;
+
+  if (optimize < 2)
+    return false;
+
+  n = EDGE_COUNT (bb->preds);
+
+  for (i = 0; i < n; ++i)
+    {
+      e1 = EDGE_PRED (bb, i);
+      if (e1->flags & EDGE_COMPLEX)
+	continue;
+      for (j = i + 1; j < n; ++j)
+        {
+          e2 = EDGE_PRED (bb, j);
+	  if (e2->flags & EDGE_COMPLEX)
+	    continue;
+
+	  /* Block e1->src might be deleted.  If bb and e1->src are the same
+	     block, delete e2->src instead, by swapping e1 and e2.  */
+	  e1_swapped = (bb == e1->src) ? e2: e1;
+	  e2_swapped = (bb == e1->src) ? e1: e2;
+
+	  /* For all phis in bb, the phi alternatives for e1 and e2 need to have
+	     the same value.  */
+	  equiv_init (&phi_equiv);
+	  if (same_or_local_phi_alternatives (&phi_equiv, e1_swapped, e2_swapped))
+	    /* Collapse e1->src and e2->src if they are duplicates.  */
+	    changed = cleanup_duplicate_preds_1 (phi_equiv, e1_swapped, e2_swapped);
+	  else
+	    changed = false;
+
+	  equiv_delete (&phi_equiv);
+
+	  if (changed)
+	    return true;
+        }
+    }
+
+  return false;
+}
+
 /* Tries to cleanup cfg in basic block BB.  Returns true if anything
    changes.  */
 
@@ -668,6 +1210,9 @@ cleanup_tree_cfg_bb (basic_block bb)
       return true;
     }
 
+  if (cleanup_duplicate_preds (bb))
+    return true;
+
   return retval;
 }
 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH, PR43864] Gimple level duplicate block cleanup - test cases.
  2011-06-08  9:49 [PATCH, PR43864] Gimple level duplicate block cleanup Tom de Vries
@ 2011-06-08  9:55 ` Tom de Vries
  2011-07-18  2:54   ` Tom de Vries
  2011-06-08 10:09 ` [PATCH, PR43864] Gimple level duplicate block cleanup Richard Guenther
  2011-06-10 18:43 ` Jeff Law
  2 siblings, 1 reply; 18+ messages in thread
From: Tom de Vries @ 2011-06-08  9:55 UTC (permalink / raw)
  To: Richard Guenther; +Cc: Steven Bosscher, gcc-patches

[-- Attachment #1: Type: text/plain, Size: 273 bytes --]

On 06/08/2011 11:42 AM, Tom de Vries wrote:

> I'll send the patch with the testcases in a separate email.

OK for trunk?

Thanks,
- Tom

2011-06-08  Tom de Vries  <tom@codesourcery.com>

	PR middle-end/43864
	* gcc.dg/pr43864.c: New test.
	* gcc.dg/pr43864-2.c: New test.

[-- Attachment #2: pr43864.test.patch --]
[-- Type: text/x-patch, Size: 1501 bytes --]

Index: gcc/testsuite/gcc.dg/pr43864-2.c
===================================================================
--- /dev/null (new file)
+++ gcc/testsuite/gcc.dg/pr43864-2.c (revision 0)
@@ -0,0 +1,22 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-optimized" } */
+
+int f(int c, int b, int d)
+{
+  int r, e;
+
+  if (c)
+    r = b + d;
+  else
+    {
+      e = b + d;
+      r = e;
+    }
+
+  return r;
+}
+
+/* { dg-final { scan-tree-dump-times "if " 0 "optimized"} } */
+/* { dg-final { scan-tree-dump-times "\\\+" 1 "optimized"} } */
+/* { dg-final { scan-tree-dump-times "PHI" 0 "optimized"} } */
+/* { dg-final { cleanup-tree-dump "optimized" } } */
Index: gcc/testsuite/gcc.dg/pr43864.c
===================================================================
--- /dev/null (new file)
+++ gcc/testsuite/gcc.dg/pr43864.c (revision 0)
@@ -0,0 +1,33 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-optimized" } */
+
+#include <stdio.h>
+#include <stdlib.h>
+
+void foo (char*, FILE*);
+
+char* hprofStartupp (char *outputFileName, char *ctx)
+{
+  char fileName[1000];
+  FILE *fp;
+  sprintf (fileName, outputFileName);
+  if (access (fileName, 1) == 0)
+    {
+      free (ctx);
+      return 0;
+    }
+
+  fp = fopen (fileName, 0);
+  if (fp == 0)
+    {
+      free (ctx);
+      return 0;
+    }
+
+  foo (outputFileName, fp);
+
+  return ctx;
+}
+
+/* { dg-final { scan-tree-dump-times "free" 1 "optimized"} } */
+/* { dg-final { cleanup-tree-dump "optimized" } } */

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH, PR43864] Gimple level duplicate block cleanup.
  2011-06-08  9:49 [PATCH, PR43864] Gimple level duplicate block cleanup Tom de Vries
  2011-06-08  9:55 ` [PATCH, PR43864] Gimple level duplicate block cleanup - test cases Tom de Vries
@ 2011-06-08 10:09 ` Richard Guenther
  2011-06-08 10:40   ` Steven Bosscher
  2011-06-10 17:16   ` Tom de Vries
  2011-06-10 18:43 ` Jeff Law
  2 siblings, 2 replies; 18+ messages in thread
From: Richard Guenther @ 2011-06-08 10:09 UTC (permalink / raw)
  To: Tom de Vries; +Cc: Steven Bosscher, gcc-patches

On Wed, Jun 8, 2011 at 11:42 AM, Tom de Vries <vries@codesourcery.com> wrote:
> Hi Richard,
>
> I have a patch for PR43864. The patch adds a gimple level duplicate block
> cleanup. The patch has been bootstrapped and reg-tested on x86_64, and
> reg-tested on ARM. The size impact on ARM for spec2000 is shown in the following
> table (%, lower is better).
>
>                     none            pic
>                thumb1  thumb2  thumb1 thumb2
> spec2000          99.9    99.9    99.8   99.8
>
> PR43864 is currently marked as a duplicate of PR20070, but I'm not sure that the
> optimizations proposed in PR20070 would fix this PR.
>
> The problem in this PR is that when compiling with -O2, the example below should
> only have one call to free. The original problem is formulated in terms of -Os,
> but currently we generate one call to free with -Os, although still not the
> smallest code possible. I'll show here the -O2 case, since that's similar to the
> original PR.
>
> #include <stdio.h>
> void foo (char*, FILE*);
> char* hprofStartupp(char *outputFileName, char *ctx)
> {
>    char fileName[1000];
>    FILE *fp;
>    sprintf(fileName, outputFileName);
>    if (access(fileName, 1) == 0) {
>        free(ctx);
>        return 0;
>    }
>
>    fp = fopen(fileName, 0);
>    if (fp == 0) {
>        free(ctx);
>        return 0;
>    }
>
>    foo(outputFileName, fp);
>
>    return ctx;
> }
>
> AFAIU, there are 2 complementary methods of rtl optimizations proposed in PR20070.
> - Merging 2 blocks which are identical expect for input registers, by using a
>  conditional move to choose between the different input registers.
> - Merging 2 blocks which have different local registers, by ignoring those
>  differences
>
> Blocks .L6 and.L7 have no difference in local registers, but they have a
> difference in input registers: r3 and r1. Replacing the move to r5 by a
> conditional move would probably be benificial in terms of size, but it's not
> clear what condition the conditional move should be using. Calculating such a
> condition would add in size and increase the execution path.
>
> gcc -O2 -march=armv7-a -mthumb pr43864.c -S:
> ...
>        push    {r4, r5, lr}
>        mov     r4, r0
>        sub     sp, sp, #1004
>        mov     r5, r1
>        mov     r0, sp
>        mov     r1, r4
>        bl      sprintf
>        mov     r0, sp
>        movs    r1, #1
>        bl      access
>        mov     r3, r0
>        cbz     r0, .L6
>        movs    r1, #0
>        mov     r0, sp
>        bl      fopen
>        mov     r1, r0
>        cbz     r0, .L7
>        mov     r0, r4
>        bl      foo
> .L3:
>        mov     r0, r5
>        add     sp, sp, #1004
>        pop     {r4, r5, pc}
> .L6:
>        mov     r0, r5
>        mov     r5, r3
>        bl      free
>        b       .L3
> .L7:
>        mov     r0, r5
>        mov     r5, r1
>        bl      free
>        b       .L3
> ...
>
> The proposed patch solved the problem by dealing with the 2 blocks at a level
> when they are still identical: at gimple level. It detect that the 2 blocks are
> identical, and removes one of them.
>
> The following table shows the impact of the patch on the example in terms of
> size for -march=armv7-a:
>
>          without     with    delta
> Os      :     108      104       -4
> O2      :     120      104      -16
> Os thumb:      68       64       -4
> O2 thumb:      76       64      -12
>
> The gain in size for -O2 is that of removing the entire block, plus the
> replacement of 2 moves by a constant set, which also decreases the execution
> path. The patch ensures optimal code for both -O2 and -Os.
>
>
> By keeping track of equivalent definitions in the 2 blocks, we can ignore those
> differences in comparison. Without this feature, we would only match blocks with
> resultless operations, due the the ssa-nature of gimples.
> For example, with this feature, we reduce the following function to its minimum
> at gimple level, rather than at rtl level.
>
> int f(int c, int b, int d)
> {
>  int r, e;
>
>  if (c)
>    r = b + d;
>  else
>    {
>      e = b + d;
>      r = e;
>    }
>
>  return r;
> }
>
> ;; Function f (f)
>
> f (int c, int b, int d)
> {
>  int e;
>
> <bb 2>:
>  e_6 = b_3(D) + d_4(D);
>  return e_6;
>
> }
>
> I'll send the patch with the testcases in a separate email.
>
> OK for trunk?

I don't like that you hook this into cleanup_tree_cfg - that is called
_way_ too often.

This also duplicates the literal matching done on the RTL level - instead
I think this optimization would be more related to value-numbering
(either that of SCCVN/FRE/PRE or that of DOM which also does
jump-threading).

Richard.

> Thanks,
> - Tom
>
> 2011-06-08  Tom de Vries  <tom@codesourcery.com>
>
>        PR middle-end/43864
>        * tree-cfgcleanup.c (int_int_splay_lookup, int_int_splay_insert)
>        (int_int_splay_node_contained_in, int_int_splay_contained_in)
>        (equiv_lookup, equiv_insert, equiv_contained_in, equiv_init)
>        (equiv_delete, gimple_base_equal_p, pt_solution_equal_p, gimple_equal_p)
>        (bb_gimple_equal_p, update_debug_stmts, cleanup_duplicate_preds_1)
>        (same_or_local_phi_alternatives, cleanup_duplicate_preds): New function.
>        (cleanup_tree_cfg_bb): Use cleanup_duplicate_preds.
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH, PR43864] Gimple level duplicate block cleanup.
  2011-06-08 10:09 ` [PATCH, PR43864] Gimple level duplicate block cleanup Richard Guenther
@ 2011-06-08 10:40   ` Steven Bosscher
  2011-06-10 17:16   ` Tom de Vries
  1 sibling, 0 replies; 18+ messages in thread
From: Steven Bosscher @ 2011-06-08 10:40 UTC (permalink / raw)
  To: Richard Guenther; +Cc: Tom de Vries, gcc-patches

On Wed, Jun 8, 2011 at 11:55 AM, Richard Guenther
<richard.guenther@gmail.com> wrote:
> On Wed, Jun 8, 2011 at 11:42 AM, Tom de Vries <vries@codesourcery.com> wrote:
>> Hi Richard,
>>
>> I have a patch for PR43864. The patch adds a gimple level duplicate block
>> cleanup. The patch has been bootstrapped and reg-tested on x86_64, and
>> reg-tested on ARM. The size impact on ARM for spec2000 is shown in the following
>> table (%, lower is better).
>>
>>                     none            pic
>>                thumb1  thumb2  thumb1 thumb2
>> spec2000          99.9    99.9    99.8   99.8
>>
>> PR43864 is currently marked as a duplicate of PR20070, but I'm not sure that the
>> optimizations proposed in PR20070 would fix this PR.
>>
>> The problem in this PR is that when compiling with -O2, the example below should
>> only have one call to free. The original problem is formulated in terms of -Os,
>> but currently we generate one call to free with -Os, although still not the
>> smallest code possible. I'll show here the -O2 case, since that's similar to the
>> original PR.
>>
>> #include <stdio.h>
>> void foo (char*, FILE*);
>> char* hprofStartupp(char *outputFileName, char *ctx)
>> {
>>    char fileName[1000];
>>    FILE *fp;
>>    sprintf(fileName, outputFileName);
>>    if (access(fileName, 1) == 0) {
>>        free(ctx);
>>        return 0;
>>    }
>>
>>    fp = fopen(fileName, 0);
>>    if (fp == 0) {
>>        free(ctx);
>>        return 0;
>>    }
>>
>>    foo(outputFileName, fp);
>>
>>    return ctx;
>> }
>>
>> AFAIU, there are 2 complementary methods of rtl optimizations proposed in PR20070.
>> - Merging 2 blocks which are identical expect for input registers, by using a
>>  conditional move to choose between the different input registers.
>> - Merging 2 blocks which have different local registers, by ignoring those
>>  differences
>>
>> Blocks .L6 and.L7 have no difference in local registers, but they have a
>> difference in input registers: r3 and r1. Replacing the move to r5 by a
>> conditional move would probably be benificial in terms of size, but it's not
>> clear what condition the conditional move should be using. Calculating such a
>> condition would add in size and increase the execution path.
>>
>> gcc -O2 -march=armv7-a -mthumb pr43864.c -S:
>> ...
>>        push    {r4, r5, lr}
>>        mov     r4, r0
>>        sub     sp, sp, #1004
>>        mov     r5, r1
>>        mov     r0, sp
>>        mov     r1, r4
>>        bl      sprintf
>>        mov     r0, sp
>>        movs    r1, #1
>>        bl      access
>>        mov     r3, r0
>>        cbz     r0, .L6
>>        movs    r1, #0
>>        mov     r0, sp
>>        bl      fopen
>>        mov     r1, r0
>>        cbz     r0, .L7
>>        mov     r0, r4
>>        bl      foo
>> .L3:
>>        mov     r0, r5
>>        add     sp, sp, #1004
>>        pop     {r4, r5, pc}
>> .L6:
>>        mov     r0, r5
>>        mov     r5, r3
>>        bl      free
>>        b       .L3
>> .L7:
>>        mov     r0, r5
>>        mov     r5, r1
>>        bl      free
>>        b       .L3
>> ...
>>
>> The proposed patch solved the problem by dealing with the 2 blocks at a level
>> when they are still identical: at gimple level. It detect that the 2 blocks are
>> identical, and removes one of them.
>>
>> The following table shows the impact of the patch on the example in terms of
>> size for -march=armv7-a:
>>
>>          without     with    delta
>> Os      :     108      104       -4
>> O2      :     120      104      -16
>> Os thumb:      68       64       -4
>> O2 thumb:      76       64      -12
>>
>> The gain in size for -O2 is that of removing the entire block, plus the
>> replacement of 2 moves by a constant set, which also decreases the execution
>> path. The patch ensures optimal code for both -O2 and -Os.
>>
>>
>> By keeping track of equivalent definitions in the 2 blocks, we can ignore those
>> differences in comparison. Without this feature, we would only match blocks with
>> resultless operations, due the the ssa-nature of gimples.
>> For example, with this feature, we reduce the following function to its minimum
>> at gimple level, rather than at rtl level.
>>
>> int f(int c, int b, int d)
>> {
>>  int r, e;
>>
>>  if (c)
>>    r = b + d;
>>  else
>>    {
>>      e = b + d;
>>      r = e;
>>    }
>>
>>  return r;
>> }
>>
>> ;; Function f (f)
>>
>> f (int c, int b, int d)
>> {
>>  int e;
>>
>> <bb 2>:
>>  e_6 = b_3(D) + d_4(D);
>>  return e_6;
>>
>> }
>>
>> I'll send the patch with the testcases in a separate email.
>>
>> OK for trunk?
>
> I don't like that you hook this into cleanup_tree_cfg - that is called
> _way_ too often.
>
> This also duplicates the literal matching done on the RTL level - instead
> I think this optimization would be more related to value-numbering
> (either that of SCCVN/FRE/PRE or that of DOM which also does
> jump-threading).

And while at it: Put it in a separate file ;-)

Ciao!
Steven

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH, PR43864] Gimple level duplicate block cleanup.
  2011-06-08 10:09 ` [PATCH, PR43864] Gimple level duplicate block cleanup Richard Guenther
  2011-06-08 10:40   ` Steven Bosscher
@ 2011-06-10 17:16   ` Tom de Vries
  2011-06-14 15:12     ` Richard Guenther
  1 sibling, 1 reply; 18+ messages in thread
From: Tom de Vries @ 2011-06-10 17:16 UTC (permalink / raw)
  To: Richard Guenther; +Cc: Steven Bosscher, gcc-patches

[-- Attachment #1: Type: text/plain, Size: 8277 bytes --]

Hi Richard,

thanks for the review.

On 06/08/2011 11:55 AM, Richard Guenther wrote:
> On Wed, Jun 8, 2011 at 11:42 AM, Tom de Vries <vries@codesourcery.com> wrote:
>> Hi Richard,
>>
>> I have a patch for PR43864. The patch adds a gimple level duplicate block
>> cleanup. The patch has been bootstrapped and reg-tested on x86_64, and
>> reg-tested on ARM. The size impact on ARM for spec2000 is shown in the following
>> table (%, lower is better).
>>
>>                     none            pic
>>                thumb1  thumb2  thumb1 thumb2
>> spec2000         99.9    99.9    99.8   99.8
>>
>> PR43864 is currently marked as a duplicate of PR20070, but I'm not sure that the
>> optimizations proposed in PR20070 would fix this PR.
>>
>> The problem in this PR is that when compiling with -O2, the example below should
>> only have one call to free. The original problem is formulated in terms of -Os,
>> but currently we generate one call to free with -Os, although still not the
>> smallest code possible. I'll show here the -O2 case, since that's similar to the
>> original PR.
>>

Example A. (naming it for reference below)

>> #include <stdio.h>
>> void foo (char*, FILE*);
>> char* hprofStartupp(char *outputFileName, char *ctx)
>> {
>>    char fileName[1000];
>>    FILE *fp;
>>    sprintf(fileName, outputFileName);
>>    if (access(fileName, 1) == 0) {
>>        free(ctx);
>>        return 0;
>>    }
>>
>>    fp = fopen(fileName, 0);
>>    if (fp == 0) {
>>        free(ctx);
>>        return 0;
>>    }
>>
>>    foo(outputFileName, fp);
>>
>>    return ctx;
>> }
>>
>> AFAIU, there are 2 complementary methods of rtl optimizations proposed in PR20070.
>> - Merging 2 blocks which are identical expect for input registers, by using a
>>  conditional move to choose between the different input registers.
>> - Merging 2 blocks which have different local registers, by ignoring those
>>  differences
>>
>> Blocks .L6 and.L7 have no difference in local registers, but they have a
>> difference in input registers: r3 and r1. Replacing the move to r5 by a
>> conditional move would probably be benificial in terms of size, but it's not
>> clear what condition the conditional move should be using. Calculating such a
>> condition would add in size and increase the execution path.
>>
>> gcc -O2 -march=armv7-a -mthumb pr43864.c -S:
>> ...
>>        push    {r4, r5, lr}
>>        mov     r4, r0
>>        sub     sp, sp, #1004
>>        mov     r5, r1
>>        mov     r0, sp
>>        mov     r1, r4
>>        bl      sprintf
>>        mov     r0, sp
>>        movs    r1, #1
>>        bl      access
>>        mov     r3, r0
>>        cbz     r0, .L6
>>        movs    r1, #0
>>        mov     r0, sp
>>        bl      fopen
>>        mov     r1, r0
>>        cbz     r0, .L7
>>        mov     r0, r4
>>        bl      foo
>> .L3:
>>        mov     r0, r5
>>        add     sp, sp, #1004
>>        pop     {r4, r5, pc}
>> .L6:
>>        mov     r0, r5
>>        mov     r5, r3
>>        bl      free
>>        b       .L3
>> .L7:
>>        mov     r0, r5
>>        mov     r5, r1
>>        bl      free
>>        b       .L3
>> ...
>>
>> The proposed patch solved the problem by dealing with the 2 blocks at a level
>> when they are still identical: at gimple level. It detect that the 2 blocks are
>> identical, and removes one of them.
>>
>> The following table shows the impact of the patch on the example in terms of
>> size for -march=armv7-a:
>>
>>          without     with    delta
>> Os      :     108      104       -4
>> O2      :     120      104      -16
>> Os thumb:      68       64       -4
>> O2 thumb:      76       64      -12
>>
>> The gain in size for -O2 is that of removing the entire block, plus the
>> replacement of 2 moves by a constant set, which also decreases the execution
>> path. The patch ensures optimal code for both -O2 and -Os.
>>
>>
>> By keeping track of equivalent definitions in the 2 blocks, we can ignore those
>> differences in comparison. Without this feature, we would only match blocks with
>> resultless operations, due the the ssa-nature of gimples.
>> For example, with this feature, we reduce the following function to its minimum
>> at gimple level, rather than at rtl level.
>>

Example B. (naming it for reference below)

>> int f(int c, int b, int d)
>> {
>>  int r, e;
>>
>>  if (c)
>>    r = b + d;
>>  else
>>    {
>>      e = b + d;
>>      r = e;
>>    }
>>
>>  return r;
>> }
>>
>> ;; Function f (f)
>>
>> f (int c, int b, int d)
>> {
>>  int e;
>>
>> <bb 2>:
>>  e_6 = b_3(D) + d_4(D);
>>  return e_6;
>>
>> }
>>
>> I'll send the patch with the testcases in a separate email.
>>
>> OK for trunk?
> 
> I don't like that you hook this into cleanup_tree_cfg - that is called
> _way_ too often.
> 

Here is a reworked patch that addresses several concerns, particularly the
compile time overhead.

Changes:
- The optimization is now in a separate file.
- The optimization is now a pass rather than a cleanup. That allowed me to
  remove the test for pass-local flags.
  New is the pass driver tail_merge_optimize, based on
  tree-cfgcleanup.c:cleanup_tree_cfg_1.
- The pass is run once, on SSA. Before, the patch would
  fix example A only before SSA and example B only on SSA.
  In order to fix example A on SSA, I added these changes:
  - handle the vop state at entry of bb1 and bb2 as equal (gimple_equal_p)
  - insert vop phi in bb2, and use that one (update_vuses)
  - complete pt_solutions_equal_p.

Passed x86_64 bootstrapping and regression testing, currently regtesting on ARM.

I placed the pass at the earliest point where it fixes example B: After copy
propagation and dead code elimination, specifically, after the first invocation
of pass_cd_dce. Do you know (any other points) where the pass should be scheduled?

> This also duplicates the literal matching done on the RTL level - instead
> I think this optimization would be more related to value-numbering
> (either that of SCCVN/FRE/PRE or that of DOM which also does
> jump-threading).

The pass currently does just duplicate block elimination, not cross-jumping.
If we would like to extend this to cross-jumping, I think we need to do the
reverse of value numbering: walk backwards over the bb, and keep track of the
way values are used rather than defined. This will allows us to make a cut
halfway a basic block.
In general, we cannot do cut halfway a basic block in the current implementation
(of value numbering and forward matching), since we assume equivalence of the
incoming vops at bb entry. This assumption is in general only valid if we indeed
replace the entire block by another entire block.
I imagine that a cross-jumping heuristic would be based on the length of the
match and the amount of non-vop phis it would introduce. Then value numbering
would be something orthogonal to this optimization, which would reduce amount of
phis needed for a cross-jump.
I think it would make sense to use SCCVN value numbering at the point that we
have this backward matching.

I'm not sure whether it's a good idea to try to replace the current forward
local value numbering with SCCVN value numbering, since we currently declare
vops equal, which are, in the global sense, not equal. And once we go to
backward matching, we'll need something to keep track of the uses, and we can
reuse the current infrastructure for that, but not the SCCVN value numbering.

Does that make any sense?

Ok for trunk, once ARM testing finishes ok?

Thanks,
- Tom

2011-06-10  Tom de Vries  <tom@codesourcery.com>

	PR middle-end/43864
	* tree-ssa-tail-merge.c: New file.
	(int_int_splay_lookup, int_int_splay_insert)
	(int_int_splay_node_contained_in, int_int_splay_contained_in)
	(equiv_lookup, equiv_insert, equiv_contained_in, equiv_init)
	(equiv_delete, gimple_base_equal_p, pt_solution_equal_p, gimple_equal_p)
	(bb_gimple_equal_p, update_debug_stmts, update_vuses)
	(cleanup_duplicate_preds_1, same_or_local_phi_alternatives)
	(cleanup_duplicate_preds, tail_merge_optimize, gate_tail_merge): New
	function.
	(pass_tail_merge): New gimple pass.
	* tree-pass.h (pass_tail_merge): Declare new pass.
	* passes.c (init_optimization_passes): Use new pass.
	* Makefile.in (OBJS-common): Add tree-ssa-tail-merge.o.
	(tree-ssa-tail-merge.o): New rule.

[-- Attachment #2: pr43864.9.patch --]
[-- Type: text/x-patch, Size: 23755 bytes --]

Index: gcc/tree-ssa-tail-merge.c
===================================================================
--- /dev/null (new file)
+++ gcc/tree-ssa-tail-merge.c (revision 0)
@@ -0,0 +1,732 @@
+/* Tail merging for gimple.
+   Copyright (C) 2011 Free Software Foundation, Inc.
+   Contributed by Tom de Vries (tom@codesourcery.com)
+
+This file is part of GCC.
+
+GCC is free software; you can redistribute it and/or modify
+it under the terms of the GNU General Public License as published by
+the Free Software Foundation; either version 3, or (at your option)
+any later version.
+
+GCC is distributed in the hope that it will be useful,
+but WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+GNU General Public License for more details.
+
+You should have received a copy of the GNU General Public License
+along with GCC; see the file COPYING3.  If not see
+<http://www.gnu.org/licenses/>.  */
+
+#include "config.h"
+#include "system.h"
+#include "coretypes.h"
+#include "tm.h"
+#include "tree.h"
+#include "tm_p.h"
+#include "basic-block.h"
+#include "output.h"
+#include "flags.h"
+#include "function.h"
+#include "tree-flow.h"
+#include "timevar.h"
+#include "tree-pass.h"
+#include "splay-tree.h"
+#include "bitmap.h"
+#include "tree-ssa-alias.h"
+
+static bitmap altered_bbs;
+
+/* Returns true if S contains (I1, I2).  */
+
+static bool
+int_int_splay_lookup (splay_tree s, unsigned int i1, unsigned int i2)
+{
+  splay_tree_node node;
+
+  if (s == NULL)
+    return false;
+
+  node = splay_tree_lookup (s, i1);
+  return node && node->value == i2;
+}
+
+/* Attempts to insert (I1, I2) into *S.  Returns true if successful.
+   Allocates *S if necessary.  */
+
+static bool
+int_int_splay_insert (splay_tree *s, unsigned int i1 , unsigned int i2)
+{
+  if (*s != NULL)
+    {
+      /* Check for existing element, which would otherwise be silently
+         overwritten by splay_tree_insert.  */
+      if (splay_tree_lookup (*s, i1))
+        return false;
+    }
+  else
+    *s = splay_tree_new (splay_tree_compare_ints, 0, 0);
+
+  splay_tree_insert (*s, i1, i2);
+  return true;
+}
+
+/* Returns 0 if (NODE->value, NODE->key) is an element of S.  Otherwise,
+   returns 1.  */
+
+static int
+int_int_splay_node_contained_in (splay_tree_node node, void *s)
+{
+  splay_tree_node snode = splay_tree_lookup ((splay_tree)s, node->key);
+  return (!snode || node->value != snode->value) ? 1 : 0;
+}
+
+/* Returns true if all elements of S1 are also in S2.  */
+
+static bool
+int_int_splay_contained_in (splay_tree s1, splay_tree s2)
+{
+  if (s1 == NULL)
+    return true;
+  if (s2 == NULL)
+    return false;
+  return splay_tree_foreach (s1, int_int_splay_node_contained_in, s2) == 0;
+}
+
+typedef splay_tree equiv_t;
+
+/* Returns true if EQUIV contains (SSA_NAME_VERSION (VAL1),
+                                   SSA_NAME_VERSION (VAL2)).  */
+
+static bool
+equiv_lookup (equiv_t equiv, tree val1, tree val2)
+{
+  if (val1 == NULL_TREE || val2 == NULL_TREE
+      || TREE_CODE (val1) != SSA_NAME || TREE_CODE (val2) != SSA_NAME)
+    return false;
+
+  return int_int_splay_lookup (equiv, SSA_NAME_VERSION (val1),
+                               SSA_NAME_VERSION (val2));
+}
+
+/* Attempts to insert (SSA_NAME_VERSION (VAL1), SSA_NAME_VERSION (VAL2)) into
+   EQUIV, provided they are defined BB1 and BB2.  Returns true if successful.
+   Allocates *EQUIV if necessary.  */
+
+static bool
+equiv_insert (equiv_t *equiv, tree val1, tree val2,
+              basic_block bb1, basic_block bb2)
+{
+  if (val1 == NULL_TREE || val2 == NULL_TREE
+      || TREE_CODE (val1) != SSA_NAME || TREE_CODE (val2) != SSA_NAME
+      || gimple_bb (SSA_NAME_DEF_STMT (val1)) != bb1
+      || gimple_bb (SSA_NAME_DEF_STMT (val2)) != bb2)
+    return false;
+
+  return int_int_splay_insert (equiv, SSA_NAME_VERSION (val1),
+                               SSA_NAME_VERSION (val2));
+}
+
+/* Returns true if all elements of S1 are also in S2.  */
+
+static bool
+equiv_contained_in (equiv_t s1, equiv_t s2)
+{
+  return int_int_splay_contained_in (s1, s2);
+}
+
+/* Init equiv_t *S.  */
+
+static void
+equiv_init (equiv_t *s)
+{
+  *s = NULL;
+}
+
+/* Delete equiv_t *S and reinit.  */
+
+static void
+equiv_delete (equiv_t *s)
+{
+  if (!*s)
+    return;
+
+  splay_tree_delete (*s);
+  *s = NULL;
+}
+
+/* Check whether S1 and S2 are equal, considering the fields in
+   gimple_statement_base.  Ignores fields uid, location, bb, and block, and the
+   pass-local flags visited and plf.  */
+
+static bool
+gimple_base_equal_p (gimple s1, gimple s2)
+{
+  if (gimple_code (s1) != gimple_code (s2))
+    return false;
+
+  if (gimple_no_warning_p (s1) != gimple_no_warning_p (s2))
+    return false;
+
+  if (is_gimple_assign (s1)
+      && (gimple_assign_nontemporal_move_p (s1)
+          != gimple_assign_nontemporal_move_p (s2)))
+    return false;
+
+  if (gimple_modified_p (s1) || gimple_modified_p (s2))
+    return false;
+
+  if (gimple_has_volatile_ops (s1) != gimple_has_volatile_ops (s2))
+    return false;
+
+  if (s1->gsbase.subcode != s2->gsbase.subcode)
+    return false;
+
+  if (gimple_num_ops (s1) != gimple_num_ops (s2))
+    return false;
+
+  return true;
+}
+
+/* Return true if p1 and p2 can be considered equal.  */
+
+static bool
+pt_solution_equal_p (struct pt_solution *p1, struct pt_solution *p2)
+{
+  if (p1->anything != p2->anything
+      || p1->nonlocal != p2->nonlocal
+      || p1->escaped != p2->escaped
+      || p1->ipa_escaped != p2->ipa_escaped
+      || p1->null != p2->null
+      || p1->vars_contains_global != p2->vars_contains_global
+      || p1->vars_contains_restrict != p2->vars_contains_restrict)
+    return false;
+
+  if ((p1->vars == NULL) != (p2->vars == NULL))
+    return false;
+
+  if (p1->vars == NULL)
+    return true;
+
+  return bitmap_equal_p (p1->vars, p2->vars);
+}
+
+/* Return true if gimple statements S1 and S2 are equal.  At entry, EQUIV
+   contains pairs of local defs that can be considered equivalent.  If S1 and S2
+   are equal, at exit EQUAL contains the defs and vdefs of S1 and S2.  If found,
+   return vop state at bb entry in PHI_USE1 for BB1 and PHI_USE2 for BB2.  */
+
+static bool
+gimple_equal_p (equiv_t *equiv, gimple s1, gimple s2,
+                tree *phi_vuse1, tree *phi_vuse2)
+{
+  unsigned int i;
+  enum gimple_statement_structure_enum gss;
+  tree lhs1, lhs2;
+  basic_block bb1 = gimple_bb (s1), bb2 = gimple_bb (s2);
+
+  /* Handle omp gimples conservatively.  */
+  if (is_gimple_omp (s1) || is_gimple_omp (s2))
+    return false;
+
+  if (!gimple_base_equal_p (s1, s2))
+    return false;
+
+  gss = gimple_statement_structure (s1);
+  switch (gss)
+    {
+    case GSS_CALL:
+      if (!pt_solution_equal_p (gimple_call_use_set (s1),
+                                gimple_call_use_set (s2))
+          || !pt_solution_equal_p (gimple_call_clobber_set (s1),
+                                   gimple_call_clobber_set (s2))
+          || !gimple_call_same_target_p (s1, s2))
+        return false;
+      /* Falthru.  */
+
+    case GSS_WITH_MEM_OPS_BASE:
+    case GSS_WITH_MEM_OPS:
+      {
+        tree vdef1 = gimple_vdef (s1), vdef2 = gimple_vdef (s2);
+        tree vuse1 = gimple_vuse (s1), vuse2 = gimple_vuse (s2);
+        if (vuse1 == NULL_TREE || vuse2 == NULL_TREE)
+          {
+            if (vuse1 != vuse2)
+              return false;
+          }
+        else if (*phi_vuse1 == NULL_TREE)
+          {
+            *phi_vuse1 = vuse1;
+            *phi_vuse2 = vuse2;
+          }
+        else if (vuse1 != vuse2 && !(vuse1 == *phi_vuse1 && vuse2 == *phi_vuse2)
+                 &&!equiv_lookup (*equiv, vuse1, vuse2))
+          return false;
+
+        if (vdef1 == NULL_TREE || vdef2 == NULL_TREE)
+          {
+            if (vdef1 != vdef2)
+              return false;
+          }
+        else if (!equiv_insert (equiv, vdef1, vdef2, bb1, bb2))
+          return false;
+      }
+      /* Falthru.  */
+
+    case GSS_WITH_OPS:
+      /* Ignore gimple_def_ops and gimple_use_ops.  They are duplicates of
+         gimple_vdef, gimple_vuse and gimple_ops, which are checked
+         elsewhere.  */
+      /* Falthru.  */
+
+    case GSS_BASE:
+      break;
+
+    default:
+      return false;
+    }
+
+  /* Find lhs.  */
+  lhs1 = gimple_get_lhs (s1);
+  lhs2 = gimple_get_lhs (s2);
+
+  /* Handle ops.  */
+  for (i = 0; i < gimple_num_ops (s1); ++i)
+    {
+      tree t1 = gimple_op (s1, i);
+      tree t2 = gimple_op (s2, i);
+      if (t1 == NULL_TREE && t2 == NULL_TREE)
+        continue;
+      if (t1 == NULL_TREE || t2 == NULL_TREE)
+        return false;
+      /* Skip lhs.  */
+      if (lhs1 == t1 && i == 0)
+        continue;
+      if (!operand_equal_p (t1, t2, 0) && !equiv_lookup (*equiv, t1, t2))
+        return false;
+    }
+
+  /* Handle lhs.  */
+  if (lhs1 != lhs2 && !equiv_insert (equiv, lhs1, lhs2, bb1, bb2))
+    return false;
+
+  return true;
+}
+
+/* Return true if BB1 and BB2 contain the same non-debug gimple statements, and
+   if the def pairs in PHI_EQUIV are found to be equivalent defs in BB1 and
+   BB2.  Return vop state at bb entry in PHI_USE1 for BB1 and PHI_USE2 for
+   BB2.  */
+
+static bool
+bb_gimple_equal_p (equiv_t phi_equiv, basic_block bb1, basic_block bb2,
+                   tree *phi_vuse1, tree *phi_vuse2)
+{
+  gimple_stmt_iterator gsi1 = gsi_start_nondebug_bb (bb1);
+  gimple_stmt_iterator gsi2 = gsi_start_nondebug_bb (bb2);
+  bool end1, end2;
+  equiv_t equiv;
+  bool equal = true;
+
+  end1 = gsi_end_p (gsi1);
+  end2 = gsi_end_p (gsi2);
+
+  /* Don't handle empty blocks, these are handled elsewhere in the cleanup.  */
+  if (end1 || end2)
+    return false;
+
+  /* TODO: handle blocks with phi-nodes.  We'll have find corresponding
+     phi-nodes in bb1 and bb2, with the same alternatives for the same
+     preds.  */
+  if (phi_nodes (bb1) != NULL || phi_nodes (bb2) != NULL)
+    return false;
+
+  equiv_init (&equiv);
+  while (true)
+    {
+      if (end1 && end2)
+        break;
+      if (end1 || end2
+          || !gimple_equal_p (&equiv, gsi_stmt (gsi1), gsi_stmt (gsi2),
+                              phi_vuse1, phi_vuse2))
+        {
+          equal = false;
+          break;
+        }
+
+      gsi_next_nondebug (&gsi1);
+      gsi_next_nondebug (&gsi2);
+      end1 = gsi_end_p (gsi1);
+      end2 = gsi_end_p (gsi2);
+    }
+
+  /* equiv now contains all bb1,bb2 def pairs which are equivalent.
+     Check if the phi alternatives are indeed equivalent.  */
+  equal = equal && equiv_contained_in (phi_equiv, equiv);
+
+  equiv_delete (&equiv);
+
+  return equal;
+}
+
+/* Resets all debug statements in BBUSE that have uses that are not
+   dominated by their defs.  */
+
+static void
+update_debug_stmts (basic_block bbuse)
+{
+  use_operand_p use_p;
+  ssa_op_iter oi;
+  basic_block bbdef;
+  gimple stmt, def_stmt;
+  gimple_stmt_iterator gsi;
+  tree name;
+
+  for (gsi = gsi_start_bb (bbuse); !gsi_end_p (gsi); gsi_next (&gsi))
+    {
+      stmt = gsi_stmt (gsi);
+      if (!is_gimple_debug (stmt))
+        continue;
+      gcc_assert (gimple_debug_bind_p (stmt));
+
+      FOR_EACH_PHI_OR_STMT_USE (use_p, stmt, oi, SSA_OP_USE)
+        {
+          name = USE_FROM_PTR (use_p);
+          gcc_assert (TREE_CODE (name) == SSA_NAME);
+
+          def_stmt = SSA_NAME_DEF_STMT (name);
+          gcc_assert (def_stmt != NULL);
+
+          bbdef = gimple_bb (def_stmt);
+          if (bbdef == NULL || bbuse == bbdef
+              || dominated_by_p (CDI_DOMINATORS, bbuse, bbdef))
+            continue;
+
+          gimple_debug_bind_reset_value (stmt);
+          update_stmt (stmt);
+        }
+    }
+}
+
+/* Create a vop phi in BB2, with VUSE1 arguments for all the REDIRECTED_EDGES,
+   and VUSE2 for the other edges.  Then use the phi instead of VUSE2 in BB2.  */
+
+static void
+update_vuses (tree vuse1, tree vuse2, basic_block bb2,
+              VEC (edge,heap) *redirected_edges)
+{
+  gimple stmt, phi = NULL;
+  tree lhs, vuse, vdef;
+  unsigned int i;
+  gimple def_stmt1, def_stmt2;
+  gimple_stmt_iterator gsi;
+  source_location locus1, locus2;
+
+  if (vuse1 == NULL_TREE && vuse2 == NULL_TREE)
+    return;
+  gcc_assert (vuse1 != NULL_TREE && vuse2 != NULL_TREE);
+
+  def_stmt1 = SSA_NAME_DEF_STMT (vuse1);
+  locus1 = gimple_location (def_stmt1);
+  def_stmt2 = SSA_NAME_DEF_STMT (vuse2);
+  locus2 = gimple_location (def_stmt2);
+
+  /* This is not triggered yet, since we bail out if bb2 has more than
+     1 predecessor.  */
+  gcc_assert (gimple_bb (def_stmt2) != bb2);
+
+  /* No need to create a phi with 2 equal arguments.  */
+  if (vuse1 == vuse2)
+    return;
+
+  /* Create a phi, first with default argument vuse2 for all preds.  */
+  lhs = make_ssa_name (SSA_NAME_VAR (vuse2), NULL);
+  phi = create_phi_node (lhs, bb2);
+  SSA_NAME_DEF_STMT (lhs) = phi;
+  for (i = 0; i < EDGE_COUNT (bb2->preds); ++i)
+    add_phi_arg (phi, vuse2, EDGE_PRED (bb2, i), locus2);
+
+  /* Now overwrite the arguments associated with the redirected edges with
+     vuse1.  */
+  for (i = 0; i < EDGE_COUNT (redirected_edges); ++i)
+    {
+      edge e = VEC_index (edge, redirected_edges, i);
+      gcc_assert (PHI_ARG_DEF_FROM_EDGE (phi, e));
+      SET_PHI_ARG_DEF (phi, e->dest_idx, vuse1);
+      gimple_phi_arg_set_location (phi, e->dest_idx, locus1);
+    }
+
+  /* Replace uses of vuse2 with uses of the phi.  */
+  for (gsi = gsi_start_bb (bb2); !gsi_end_p (gsi); gsi_next (&gsi))
+    {
+      stmt = gsi_stmt (gsi);
+      vuse = gimple_vuse (stmt);
+      vdef = gimple_vdef (stmt);
+      if (vuse != NULL_TREE)
+        {
+          gcc_assert (vuse == vuse2);
+          gimple_set_vuse (stmt,  lhs);
+          update_stmt (stmt);
+        }
+      if (vdef != NULL_TREE)
+        break;
+    }
+}
+
+/* E1 and E2 have a common dest.  Detect if E1->src and E2->src are duplicates,
+   and if so, redirect the predecessor edges of E1->src to E2->src and remove
+   E1->src.  Returns true if any changes were made.  */
+
+static bool
+cleanup_duplicate_preds_1 (equiv_t phi_equiv, edge e1, edge e2)
+{
+  edge pred_edge;
+  basic_block bb1, bb2, pred;
+  basic_block bb_dom = NULL, bb2_dom = NULL;
+  unsigned int i;
+  basic_block bb = e1->dest;
+  tree phi_vuse1 = NULL_TREE, phi_vuse2 = NULL_TREE;
+  VEC (edge,heap) *redirected_edges;
+  gcc_assert (bb == e2->dest);
+
+  if (e1->flags != e2->flags)
+    return false;
+
+  bb1 = e1->src;
+  bb2 = e2->src;
+
+  /* TODO: We could allow multiple successor edges here, as long as bb1 and bb2
+     have the same successors.  */
+  if (EDGE_COUNT (bb1->succs) != 1 || EDGE_COUNT (bb2->succs) != 1)
+    return false;
+
+  if (!bb_gimple_equal_p (phi_equiv, bb1, bb2, &phi_vuse1, &phi_vuse2))
+    return false;
+
+  if (dump_file)
+    fprintf (dump_file, "cleanup_duplicate_preds: "
+             "cleaning up <bb %d>, duplicate of <bb %d>\n", bb1->index,
+             bb2->index);
+
+  /* Calculate the changes to be made to the dominator info.
+     Calculate bb2_dom.  */
+  bb2_dom = nearest_common_dominator (CDI_DOMINATORS, bb2, bb1);
+  if (bb2_dom == bb1 || bb2_dom == bb2)
+    bb2_dom = get_immediate_dominator (CDI_DOMINATORS, bb2_dom);
+
+  /* Calculate bb_dom.  */
+  bb_dom = get_immediate_dominator (CDI_DOMINATORS, bb);
+  if (bb == bb2)
+    bb_dom = bb2_dom;
+  else if (bb_dom == bb1 || bb_dom == bb2)
+    bb_dom = bb2;
+  else
+    {
+      /* Count the predecessors of bb (other than bb1 or bb2), not dominated
+         by bb.  If there are none, merging bb1 and bb2 will mean that bb2
+         dominates bb.  */
+      int not_dominated = 0;
+      for (i = 0; i < EDGE_COUNT (bb->preds); ++i)
+        {
+          pred_edge = EDGE_PRED (bb, i);
+          pred = pred_edge->src;
+          if (pred == bb1 || pred == bb2)
+            continue;
+          if (dominated_by_p (CDI_DOMINATORS, pred, bb))
+            continue;
+          not_dominated++;
+        }
+      if (not_dominated == 0)
+        bb_dom = bb2;
+    }
+
+  redirected_edges = VEC_alloc (edge, heap, 10);
+
+  /* Redirect the incoming edges of bb1 to bb2.  */
+  for (i = EDGE_COUNT (bb1->preds); i > 0 ; --i)
+    {
+      pred_edge = EDGE_PRED (bb1, i - 1);
+      pred = pred_edge->src;
+      pred_edge = redirect_edge_and_branch (pred_edge, bb2);
+      gcc_assert (pred_edge != NULL);
+      VEC_safe_push (edge, heap, redirected_edges, pred_edge);
+    }
+
+  /* The set of predecessors has changed for both bb and bb2.  */
+  bitmap_set_bit (altered_bbs, bb->index);
+  bitmap_set_bit (altered_bbs, bb2->index);
+
+  /* bb1 has no incoming edges anymore, and has become unreachable.  */
+  delete_basic_block (bb1);
+  bitmap_clear_bit (altered_bbs, bb1->index);
+
+  /* Update dominator info.  Note: update order is relevant.  */
+  set_immediate_dominator (CDI_DOMINATORS, bb2, bb2_dom);
+  if (bb != bb2)
+    set_immediate_dominator (CDI_DOMINATORS, bb, bb_dom);
+
+  /* Reset invalidated debug statements.  */
+  update_debug_stmts (bb2);
+
+  /* Insert vop phi, and update vuses.  */
+  update_vuses (phi_vuse1, phi_vuse2, bb2, redirected_edges);
+
+  VEC_free (edge, heap, redirected_edges);
+
+  return true;
+}
+
+/* Returns whether for all phis in E1->dest the phi alternatives for E1 and
+   E2 are either:
+   - equal, or
+   - defined locally in E1->src and E2->src.
+   In the latter case, register the alternatives in *PHI_EQUIV.  */
+
+static bool
+same_or_local_phi_alternatives (equiv_t *phi_equiv, edge e1, edge e2)
+{
+  int n1 = e1->dest_idx;
+  int n2 = e2->dest_idx;
+  gimple_stmt_iterator gsi;
+  basic_block dest = e1->dest;
+  gcc_assert (dest == e2->dest);
+
+  for (gsi = gsi_start_phis (dest); !gsi_end_p (gsi); gsi_next (&gsi))
+    {
+      gimple phi = gsi_stmt (gsi);
+      tree val1 = gimple_phi_arg_def (phi, n1);
+      tree val2 = gimple_phi_arg_def (phi, n2);
+
+      gcc_assert (val1 != NULL_TREE);
+      gcc_assert (val2 != NULL_TREE);
+
+      if (operand_equal_for_phi_arg_p (val1, val2))
+        continue;
+
+      if (equiv_insert (phi_equiv, val1, val2, e1->src, e2->src))
+        continue;
+
+      /* TODO: handle case that val1 and val2 are vops which are not locally
+         defined.  */
+      return false;
+    }
+
+  return true;
+}
+
+/* Detect duplicate predecessor blocks of BB and clean them up.  Return true if
+   any changes were made.  */
+
+static bool
+cleanup_duplicate_preds (basic_block bb)
+{
+  edge e1, e2, e1_swapped, e2_swapped;
+  unsigned int i, j, n;
+  equiv_t phi_equiv;
+  bool changed;
+
+  n = EDGE_COUNT (bb->preds);
+
+  for (i = 0; i < n; ++i)
+    {
+      e1 = EDGE_PRED (bb, i);
+      if (e1->flags & EDGE_COMPLEX)
+        continue;
+      for (j = i + 1; j < n; ++j)
+        {
+          e2 = EDGE_PRED (bb, j);
+          if (e2->flags & EDGE_COMPLEX)
+            continue;
+
+          /* Block e1->src might be deleted.  If bb and e1->src are the same
+             block, delete e2->src instead, by swapping e1 and e2.  */
+          e1_swapped = (bb == e1->src) ? e2: e1;
+          e2_swapped = (bb == e1->src) ? e1: e2;
+
+          /* For all phis in bb, the phi alternatives for e1 and e2 need to have
+             the same value.  */
+          equiv_init (&phi_equiv);
+          if (same_or_local_phi_alternatives (&phi_equiv, e1_swapped,
+                                              e2_swapped))
+            /* Collapse e1->src and e2->src if they are duplicates.  */
+            changed = cleanup_duplicate_preds_1 (phi_equiv, e1_swapped,
+                                                 e2_swapped);
+          else
+            changed = false;
+
+          equiv_delete (&phi_equiv);
+
+          if (changed)
+            return true;
+        }
+    }
+
+  return false;
+}
+
+/* Runs tail merge optimization.  */
+
+static unsigned int
+tail_merge_optimize (void)
+{
+  basic_block bb;
+  unsigned i, n;
+
+  calculate_dominance_info (CDI_DOMINATORS);
+
+  /* Initialize worklist.  */
+  n = last_basic_block - NUM_FIXED_BLOCKS;
+  altered_bbs = BITMAP_ALLOC (NULL);
+  bitmap_set_range (altered_bbs, NUM_FIXED_BLOCKS, n);
+
+  /* Now process the altered blocks, as long as any are available.  */
+  while (!bitmap_empty_p (altered_bbs))
+    {
+      i = bitmap_first_set_bit (altered_bbs);
+      bitmap_clear_bit (altered_bbs, i);
+      if (i < NUM_FIXED_BLOCKS)
+        continue;
+
+      bb = BASIC_BLOCK (i);
+      if (!bb)
+        continue;
+
+      cleanup_duplicate_preds (bb);
+    }
+
+  BITMAP_FREE (altered_bbs);
+
+#ifdef ENABLE_CHECKING
+  verify_dominators (CDI_DOMINATORS);
+#endif
+
+  return 0;
+}
+
+/* Returns true if tail merge pass should be run.  */
+
+static bool
+gate_tail_merge (void)
+{
+  return optimize >= 2;
+}
+
+struct gimple_opt_pass pass_tail_merge =
+{
+ {
+  GIMPLE_PASS,
+  "tailmerge",                          /* name */
+  gate_tail_merge,                      /* gate */
+  tail_merge_optimize,                  /* execute */
+  NULL,                                 /* sub */
+  NULL,                                 /* next */
+  0,                                    /* static_pass_number */
+  TV_TREE_CFG,                          /* tv_id */
+  PROP_ssa | PROP_cfg,                  /* properties_required */
+  0,                                    /* properties_provided */
+  0,                                    /* properties_destroyed */
+  0,                                    /* todo_flags_start */
+  TODO_verify_ssa | TODO_verify_stmts
+  | TODO_cleanup_cfg | TODO_dump_func   /* todo_flags_finish */
+ }
+};
Index: gcc/tree-pass.h
===================================================================
--- gcc/tree-pass.h (revision 173734)
+++ gcc/tree-pass.h (working copy)
@@ -446,6 +446,7 @@ extern struct gimple_opt_pass pass_trace
 extern struct gimple_opt_pass pass_warn_unused_result;
 extern struct gimple_opt_pass pass_split_functions;
 extern struct gimple_opt_pass pass_feedback_split_functions;
+extern struct gimple_opt_pass pass_tail_merge;
 
 /* IPA Passes */
 extern struct simple_ipa_opt_pass pass_ipa_lower_emutls;
Index: gcc/Makefile.in
===================================================================
--- gcc/Makefile.in (revision 173734)
+++ gcc/Makefile.in (working copy)
@@ -1441,6 +1441,7 @@ OBJS-common = \
 	tree-ssa-sccvn.o \
 	tree-ssa-sink.o \
 	tree-ssa-structalias.o \
+	tree-ssa-tail-merge.o \
 	tree-ssa-ter.o \
 	tree-ssa-threadedge.o \
 	tree-ssa-threadupdate.o \
@@ -2395,6 +2396,13 @@ stor-layout.o : stor-layout.c $(CONFIG_H
    $(TREE_H) $(PARAMS_H) $(FLAGS_H) $(FUNCTION_H) $(EXPR_H) output.h $(RTL_H) \
    $(GGC_H) $(TM_P_H) $(TARGET_H) langhooks.h $(REGS_H) gt-stor-layout.h \
    $(DIAGNOSTIC_CORE_H) $(CGRAPH_H) $(TREE_INLINE_H) $(TREE_DUMP_H) $(GIMPLE_H)
+tree-ssa-tail-merge.o: tree-ssa-tail-merge.c \
+   $(SYSTEM_H) $(CONFIG_H) coretypes.h $(TM_H) $(BITMAP_H) \
+   $(FLAGS_H) $(TM_P_H) $(BASIC_BLOCK_H) output.h \
+   $(TREE_H) $(TREE_FLOW_H) $(TREE_INLINE_H) \
+   $(GIMPLE_H) $(FUNCTION_H) \
+   $(TREE_PASS_H) $(TIMEVAR_H) $(SPLAY_TREE_H) \
+   $(CGRAPH_H)
 tree-ssa-structalias.o: tree-ssa-structalias.c \
    $(SYSTEM_H) $(CONFIG_H) coretypes.h $(TM_H) $(GGC_H) $(OBSTACK_H) $(BITMAP_H) \
    $(FLAGS_H) $(TM_P_H) $(BASIC_BLOCK_H) output.h \
Index: gcc/passes.c
===================================================================
--- gcc/passes.c (revision 173734)
+++ gcc/passes.c (working copy)
@@ -767,6 +767,7 @@ init_optimization_passes (void)
 	  NEXT_PASS (pass_copy_prop);
 	  NEXT_PASS (pass_merge_phi);
 	  NEXT_PASS (pass_cd_dce);
+	  NEXT_PASS (pass_tail_merge);
 	  NEXT_PASS (pass_early_ipa_sra);
 	  NEXT_PASS (pass_tail_recursion);
 	  NEXT_PASS (pass_convert_switch);

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH, PR43864] Gimple level duplicate block cleanup.
  2011-06-08  9:49 [PATCH, PR43864] Gimple level duplicate block cleanup Tom de Vries
  2011-06-08  9:55 ` [PATCH, PR43864] Gimple level duplicate block cleanup - test cases Tom de Vries
  2011-06-08 10:09 ` [PATCH, PR43864] Gimple level duplicate block cleanup Richard Guenther
@ 2011-06-10 18:43 ` Jeff Law
  2 siblings, 0 replies; 18+ messages in thread
From: Jeff Law @ 2011-06-10 18:43 UTC (permalink / raw)
  To: Tom de Vries; +Cc: Richard Guenther, Steven Bosscher, gcc-patches

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 06/08/11 03:42, Tom de Vries wrote:
> Hi Richard,
> 
> I have a patch for PR43864. The patch adds a gimple level duplicate block
> cleanup. The patch has been bootstrapped and reg-tested on x86_64, and
> reg-tested on ARM. The size impact on ARM for spec2000 is shown in the following
> table (%, lower is better).
> 
>                      none            pic
>                 thumb1  thumb2  thumb1 thumb2
> spec2000          99.9    99.9    99.8   99.8
> 
> PR43864 is currently marked as a duplicate of PR20070, but I'm not sure that the
> optimizations proposed in PR20070 would fix this PR.
> 
> The problem in this PR is that when compiling with -O2, the example below should
> only have one call to free. The original problem is formulated in terms of -Os,
> but currently we generate one call to free with -Os, although still not the
> smallest code possible. I'll show here the -O2 case, since that's similar to the
> original PR.
[ ... ]

FWIW, I've seen at least one paper which claims that extending value
numbering redundancy elimination to handle blocks is effective at
eliminating duplicates.    Redundancy elimination for blocks turns into
CFG manipulations.

We want to do this early in the pipeline so that register numbering
doesn't get in the way.

I'm going to let you and Richi iterate, but wanted to chime in with my
general support for detecting and eliminating blocks early in the
pipeline (gimple).

jeff
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org/

iQEcBAEBAgAGBQJN8mTsAAoJEBRtltQi2kC7UxMIAKqEprg9LUJS4QnUeXRKDx7h
+tMBS1hPEZcmfxV9jo95oXb+1yiCIrI2CUou9PDO8E4i/Dv3x6kCF5vyfLBGx0ZI
dS80HjTdvr2vUzxFQnChESvDYepyg2JE6pnlnOe4cwnuyFHWhPQG7L3e3lDaZkTa
224afwDNNUMmCGDE4emUpgV/evQ4dpiiY/dqAU1fu6ev8wQxfDpyGeZOTD+qC1XM
DfpZumzHJpJfo2w/LMbSiBNci0HYxENjUheNixzHDLMSUO8AdStje6NBIJ96I3s7
p94WuwuP08B1wU5J0F4B1TiyAJxDOBbtApRwhpAOs9NImEDydmOXDLjIaPivXBY=
=2J0r
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH, PR43864] Gimple level duplicate block cleanup.
  2011-06-10 17:16   ` Tom de Vries
@ 2011-06-14 15:12     ` Richard Guenther
  2011-07-12 12:21       ` Tom de Vries
  0 siblings, 1 reply; 18+ messages in thread
From: Richard Guenther @ 2011-06-14 15:12 UTC (permalink / raw)
  To: Tom de Vries; +Cc: Steven Bosscher, gcc-patches

On Fri, Jun 10, 2011 at 6:54 PM, Tom de Vries <vries@codesourcery.com> wrote:
> Hi Richard,
>
> thanks for the review.
>
> On 06/08/2011 11:55 AM, Richard Guenther wrote:
>> On Wed, Jun 8, 2011 at 11:42 AM, Tom de Vries <vries@codesourcery.com> wrote:
>>> Hi Richard,
>>>
>>> I have a patch for PR43864. The patch adds a gimple level duplicate block
>>> cleanup. The patch has been bootstrapped and reg-tested on x86_64, and
>>> reg-tested on ARM. The size impact on ARM for spec2000 is shown in the following
>>> table (%, lower is better).
>>>
>>>                     none            pic
>>>                thumb1  thumb2  thumb1 thumb2
>>> spec2000         99.9    99.9    99.8   99.8
>>>
>>> PR43864 is currently marked as a duplicate of PR20070, but I'm not sure that the
>>> optimizations proposed in PR20070 would fix this PR.
>>>
>>> The problem in this PR is that when compiling with -O2, the example below should
>>> only have one call to free. The original problem is formulated in terms of -Os,
>>> but currently we generate one call to free with -Os, although still not the
>>> smallest code possible. I'll show here the -O2 case, since that's similar to the
>>> original PR.
>>>
>
> Example A. (naming it for reference below)
>
>>> #include <stdio.h>
>>> void foo (char*, FILE*);
>>> char* hprofStartupp(char *outputFileName, char *ctx)
>>> {
>>>    char fileName[1000];
>>>    FILE *fp;
>>>    sprintf(fileName, outputFileName);
>>>    if (access(fileName, 1) == 0) {
>>>        free(ctx);
>>>        return 0;
>>>    }
>>>
>>>    fp = fopen(fileName, 0);
>>>    if (fp == 0) {
>>>        free(ctx);
>>>        return 0;
>>>    }
>>>
>>>    foo(outputFileName, fp);
>>>
>>>    return ctx;
>>> }
>>>
>>> AFAIU, there are 2 complementary methods of rtl optimizations proposed in PR20070.
>>> - Merging 2 blocks which are identical expect for input registers, by using a
>>>  conditional move to choose between the different input registers.
>>> - Merging 2 blocks which have different local registers, by ignoring those
>>>  differences
>>>
>>> Blocks .L6 and.L7 have no difference in local registers, but they have a
>>> difference in input registers: r3 and r1. Replacing the move to r5 by a
>>> conditional move would probably be benificial in terms of size, but it's not
>>> clear what condition the conditional move should be using. Calculating such a
>>> condition would add in size and increase the execution path.
>>>
>>> gcc -O2 -march=armv7-a -mthumb pr43864.c -S:
>>> ...
>>>        push    {r4, r5, lr}
>>>        mov     r4, r0
>>>        sub     sp, sp, #1004
>>>        mov     r5, r1
>>>        mov     r0, sp
>>>        mov     r1, r4
>>>        bl      sprintf
>>>        mov     r0, sp
>>>        movs    r1, #1
>>>        bl      access
>>>        mov     r3, r0
>>>        cbz     r0, .L6
>>>        movs    r1, #0
>>>        mov     r0, sp
>>>        bl      fopen
>>>        mov     r1, r0
>>>        cbz     r0, .L7
>>>        mov     r0, r4
>>>        bl      foo
>>> .L3:
>>>        mov     r0, r5
>>>        add     sp, sp, #1004
>>>        pop     {r4, r5, pc}
>>> .L6:
>>>        mov     r0, r5
>>>        mov     r5, r3
>>>        bl      free
>>>        b       .L3
>>> .L7:
>>>        mov     r0, r5
>>>        mov     r5, r1
>>>        bl      free
>>>        b       .L3
>>> ...
>>>
>>> The proposed patch solved the problem by dealing with the 2 blocks at a level
>>> when they are still identical: at gimple level. It detect that the 2 blocks are
>>> identical, and removes one of them.
>>>
>>> The following table shows the impact of the patch on the example in terms of
>>> size for -march=armv7-a:
>>>
>>>          without     with    delta
>>> Os      :     108      104       -4
>>> O2      :     120      104      -16
>>> Os thumb:      68       64       -4
>>> O2 thumb:      76       64      -12
>>>
>>> The gain in size for -O2 is that of removing the entire block, plus the
>>> replacement of 2 moves by a constant set, which also decreases the execution
>>> path. The patch ensures optimal code for both -O2 and -Os.
>>>
>>>
>>> By keeping track of equivalent definitions in the 2 blocks, we can ignore those
>>> differences in comparison. Without this feature, we would only match blocks with
>>> resultless operations, due the the ssa-nature of gimples.
>>> For example, with this feature, we reduce the following function to its minimum
>>> at gimple level, rather than at rtl level.
>>>
>
> Example B. (naming it for reference below)
>
>>> int f(int c, int b, int d)
>>> {
>>>  int r, e;
>>>
>>>  if (c)
>>>    r = b + d;
>>>  else
>>>    {
>>>      e = b + d;
>>>      r = e;
>>>    }
>>>
>>>  return r;
>>> }
>>>
>>> ;; Function f (f)
>>>
>>> f (int c, int b, int d)
>>> {
>>>  int e;
>>>
>>> <bb 2>:
>>>  e_6 = b_3(D) + d_4(D);
>>>  return e_6;
>>>
>>> }
>>>
>>> I'll send the patch with the testcases in a separate email.
>>>
>>> OK for trunk?
>>
>> I don't like that you hook this into cleanup_tree_cfg - that is called
>> _way_ too often.
>>
>
> Here is a reworked patch that addresses several concerns, particularly the
> compile time overhead.
>
> Changes:
> - The optimization is now in a separate file.
> - The optimization is now a pass rather than a cleanup. That allowed me to
>  remove the test for pass-local flags.
>  New is the pass driver tail_merge_optimize, based on
>  tree-cfgcleanup.c:cleanup_tree_cfg_1.
> - The pass is run once, on SSA. Before, the patch would
>  fix example A only before SSA and example B only on SSA.
>  In order to fix example A on SSA, I added these changes:
>  - handle the vop state at entry of bb1 and bb2 as equal (gimple_equal_p)
>  - insert vop phi in bb2, and use that one (update_vuses)
>  - complete pt_solutions_equal_p.
>
> Passed x86_64 bootstrapping and regression testing, currently regtesting on ARM.
>
> I placed the pass at the earliest point where it fixes example B: After copy
> propagation and dead code elimination, specifically, after the first invocation
> of pass_cd_dce. Do you know (any other points) where the pass should be scheduled?

It's probably reasonable to run it after IPA inlining has taken place which
means insert it somewhen after the second pass_fre (I'd suggest after
pass_merge_phi).

But my general comment still applies - I don't like the structural
comparison code at all and you should really use the value-numbering
machineries we have or even better, merge this pass with FRE itself
(or DOM if that suits you more).  For FRE you'd want to hook into
tree-ssa-pre.c:eliminate().

>> This also duplicates the literal matching done on the RTL level - instead
>> I think this optimization would be more related to value-numbering
>> (either that of SCCVN/FRE/PRE or that of DOM which also does
>> jump-threading).
>
> The pass currently does just duplicate block elimination, not cross-jumping.
> If we would like to extend this to cross-jumping, I think we need to do the
> reverse of value numbering: walk backwards over the bb, and keep track of the
> way values are used rather than defined. This will allows us to make a cut
> halfway a basic block.

I don't understand - I propose to do literal matching but using value-numbering
for tracking equivalences to avoid literal matching for stuff we know is
equivalent.  In fact I think it will be mostly calls and stores where we
need to do literal matching, but never intermediate computations on
registers.

But maybe I miss something here.

> In general, we cannot do cut halfway a basic block in the current implementation
> (of value numbering and forward matching), since we assume equivalence of the
> incoming vops at bb entry. This assumption is in general only valid if we indeed
> replace the entire block by another entire block.

Why are VOPs of concern at all?

> I imagine that a cross-jumping heuristic would be based on the length of the
> match and the amount of non-vop phis it would introduce. Then value numbering
> would be something orthogonal to this optimization, which would reduce amount of
> phis needed for a cross-jump.
> I think it would make sense to use SCCVN value numbering at the point that we
> have this backward matching.
>
> I'm not sure whether it's a good idea to try to replace the current forward
> local value numbering with SCCVN value numbering, since we currently declare
> vops equal, which are, in the global sense, not equal. And once we go to
> backward matching, we'll need something to keep track of the uses, and we can
> reuse the current infrastructure for that, but not the SCCVN value numbering.
>
> Does that make any sense?

Ok, let me think about this a bit.

For now about the patch in general.  The functions need renaming to
something more sensible now that this isn't cfg-cleanup anymore.

I miss a general overview of the pass - it's hard to reverse engineer
its working for me.  Like (working backwards), you are detecting
duplicate predecessors - that obviously doesn't work for duplicates
without any successors, like those ending in noreturn calls.

+  n = EDGE_COUNT (bb->preds);
+
+  for (i = 0; i < n; ++i)
+    {
+      e1 = EDGE_PRED (bb, i);
+      if (e1->flags & EDGE_COMPLEX)
+        continue;
+      for (j = i + 1; j < n; ++j)
+        {

that's quadratic in the number of predecessors.

+          /* Block e1->src might be deleted.  If bb and e1->src are the same
+             block, delete e2->src instead, by swapping e1 and e2.  */
+          e1_swapped = (bb == e1->src) ? e2: e1;
+          e2_swapped = (bb == e1->src) ? e1: e2;

is that because you incrementally merge preds two at a time?  As you
are deleting blocks don't you need to adjust the quadratic walking?
Thus, with say four equivalent preds won't your code crash anyway?

I think the code needs to delay the CFG manipulation to the end
of this function.

+/* Returns whether for all phis in E1->dest the phi alternatives for E1 and
+   E2 are either:
+   - equal, or
+   - defined locally in E1->src and E2->src.
+   In the latter case, register the alternatives in *PHI_EQUIV.  */
+
+static bool
+same_or_local_phi_alternatives (equiv_t *phi_equiv, edge e1, edge e2)
+{
+  int n1 = e1->dest_idx;
+  int n2 = e2->dest_idx;
+  gimple_stmt_iterator gsi;
+  basic_block dest = e1->dest;
+  gcc_assert (dest == e2->dest);

too many asserts in general - I'd say for this case pass in the destination
block as argument.

+      gcc_assert (val1 != NULL_TREE);
+      gcc_assert (val2 != NULL_TREE);

superfluous.

+static bool
+cleanup_duplicate_preds_1 (equiv_t phi_equiv, edge e1, edge e2)
...
+  VEC (edge,heap) *redirected_edges;
+  gcc_assert (bb == e2->dest);

same.

+  if (e1->flags != e2->flags)
+    return false;

that's bad - it should handle EDGE_TRUE/FALSE_VALUE mismatches
by swapping edges in the preds.

+  /* TODO: We could allow multiple successor edges here, as long as bb1 and bb2
+     have the same successors.  */
+  if (EDGE_COUNT (bb1->succs) != 1 || EDGE_COUNT (bb2->succs) != 1)
+    return false;

hm, ok - that would need fixing, too.  Same or mergeable successors
of course, which makes me wonder if doing this whole transformation
incrementally and locally is a good idea ;)   Also

+  /* Calculate the changes to be made to the dominator info.
+     Calculate bb2_dom.  */
...

wouldn't be necessary I suppose (just throw away dom info after the
pass).

That is, I'd globally record BB equivalences (thus, "value-number"
BBs) and apply the CFG manipulations at a single point.

Btw, I miss where you insert PHI nodes for all uses that flow in
from the preds preds - you do that for VOPs but not for real
operands?

+  /* Replace uses of vuse2 with uses of the phi.  */
+  for (gsi = gsi_start_bb (bb2); !gsi_end_p (gsi); gsi_next (&gsi))
+    {

why not walk immediate uses of the old PHI and SET_USE to
the new one instead (for those uses in the duplicate BB of course)?

+    case GSS_CALL:
+      if (!pt_solution_equal_p (gimple_call_use_set (s1),
+                                gimple_call_use_set (s2))

I don't understand why you are concerned about equality of
points-to information.  Why not simply ior it (pt_solution_ior_into - note
they are shared so you need to unshare them first).

+/* Return true if p1 and p2 can be considered equal.  */
+
+static bool
+pt_solution_equal_p (struct pt_solution *p1, struct pt_solution *p2)

would go into tree-ssa-structalias.c instead.

+static bool
+gimple_base_equal_p (gimple s1, gimple s2)
+{
...
+  if (gimple_modified_p (s1) || gimple_modified_p (s2))
+    return false;

that shouldn't be of concern.

+  if (s1->gsbase.subcode != s2->gsbase.subcode)
+    return false;

for assigns that are of class GIMPLE_SINGLE_RHS we do not
update subcode during transformations so it can differ for now
equal statements.

I'm not sure if a splay tree for the SSA name version equivalency
map is the best representation - I would have used a simple
array of num_ssa_names size and assign value-numbers
(the lesser version for example).

Thus equiv_insert would do

  value = MIN (SSA_NAME_VERSION (val1), SSA_NAME_VERSION (val2));
  values[SSA_NAME_VERSION (val1)] = value;
  values[SSA_NAME_VERSION (val2)] = value;

if the names are not defined in bb1 resp. bb2 we would have to insert
a PHI node in the merged block - that would be a cost thingy for
doing this value-numbering in a more global way.

You don't seem to be concerned about the SSA names points-to
information, but it surely has the same issues as that of the calls
(so either they should be equal or they should be conservatively
merged).  But as-is you don't allow any values to flow into the
merged blocks that are not equal for both edges, no?

+  TV_TREE_CFG,                          /* tv_id */

add a new timevar.  We wan to be able to turn the pass off,
so also add a new option (I can see it might make debugging harder
in some cases).

Can you assess the effect of the patch on GCC itself (for example
when building cc1?)? What's the size benefit and the compile-time
overhead?

Thanks,
Richard.


> Thanks,
> - Tom
>
> 2011-06-10  Tom de Vries  <tom@codesourcery.com>
>
>        PR middle-end/43864
>        * tree-ssa-tail-merge.c: New file.
>        (int_int_splay_lookup, int_int_splay_insert)
>        (int_int_splay_node_contained_in, int_int_splay_contained_in)
>        (equiv_lookup, equiv_insert, equiv_contained_in, equiv_init)
>        (equiv_delete, gimple_base_equal_p, pt_solution_equal_p, gimple_equal_p)
>        (bb_gimple_equal_p, update_debug_stmts, update_vuses)
>        (cleanup_duplicate_preds_1, same_or_local_phi_alternatives)
>        (cleanup_duplicate_preds, tail_merge_optimize, gate_tail_merge): New
>        function.
>        (pass_tail_merge): New gimple pass.
>        * tree-pass.h (pass_tail_merge): Declare new pass.
>        * passes.c (init_optimization_passes): Use new pass.
>        * Makefile.in (OBJS-common): Add tree-ssa-tail-merge.o.
>        (tree-ssa-tail-merge.o): New rule.
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH, PR43864] Gimple level duplicate block cleanup.
  2011-06-14 15:12     ` Richard Guenther
@ 2011-07-12 12:21       ` Tom de Vries
  2011-07-12 14:37         ` Richard Guenther
  0 siblings, 1 reply; 18+ messages in thread
From: Tom de Vries @ 2011-07-12 12:21 UTC (permalink / raw)
  To: Richard Guenther; +Cc: Steven Bosscher, gcc-patches

[-- Attachment #1: Type: text/plain, Size: 20015 bytes --]

Hi Richard,

here's a new version of the pass. I attempted to address as much as possible
your comments. The pass was bootstrapped and reg-tested on x86_64.

On 06/14/2011 05:01 PM, Richard Guenther wrote:
> On Fri, Jun 10, 2011 at 6:54 PM, Tom de Vries <vries@codesourcery.com> wrote:
>> Hi Richard,
>>
>> thanks for the review.
>>
>> On 06/08/2011 11:55 AM, Richard Guenther wrote:
>>> On Wed, Jun 8, 2011 at 11:42 AM, Tom de Vries <vries@codesourcery.com> wrote:
>>>> Hi Richard,
>>>>
>>>> I have a patch for PR43864. The patch adds a gimple level duplicate block
>>>> cleanup. The patch has been bootstrapped and reg-tested on x86_64, and
>>>> reg-tested on ARM. The size impact on ARM for spec2000 is shown in the following
>>>> table (%, lower is better).
>>>>
>>>>                     none            pic
>>>>                thumb1  thumb2  thumb1 thumb2
>>>> spec2000         99.9    99.9    99.8   99.8
>>>>
>>>> PR43864 is currently marked as a duplicate of PR20070, but I'm not sure that the
>>>> optimizations proposed in PR20070 would fix this PR.
>>>>
>>>> The problem in this PR is that when compiling with -O2, the example below should
>>>> only have one call to free. The original problem is formulated in terms of -Os,
>>>> but currently we generate one call to free with -Os, although still not the
>>>> smallest code possible. I'll show here the -O2 case, since that's similar to the
>>>> original PR.
>>>>
>>
>> Example A. (naming it for reference below)
>>
>>>> #include <stdio.h>
>>>> void foo (char*, FILE*);
>>>> char* hprofStartupp(char *outputFileName, char *ctx)
>>>> {
>>>>    char fileName[1000];
>>>>    FILE *fp;
>>>>    sprintf(fileName, outputFileName);
>>>>    if (access(fileName, 1) == 0) {
>>>>        free(ctx);
>>>>        return 0;
>>>>    }
>>>>
>>>>    fp = fopen(fileName, 0);
>>>>    if (fp == 0) {
>>>>        free(ctx);
>>>>        return 0;
>>>>    }
>>>>
>>>>    foo(outputFileName, fp);
>>>>
>>>>    return ctx;
>>>> }
>>>>
>>>> AFAIU, there are 2 complementary methods of rtl optimizations proposed in PR20070.
>>>> - Merging 2 blocks which are identical expect for input registers, by using a
>>>>  conditional move to choose between the different input registers.
>>>> - Merging 2 blocks which have different local registers, by ignoring those
>>>>  differences
>>>>
>>>> Blocks .L6 and.L7 have no difference in local registers, but they have a
>>>> difference in input registers: r3 and r1. Replacing the move to r5 by a
>>>> conditional move would probably be benificial in terms of size, but it's not
>>>> clear what condition the conditional move should be using. Calculating such a
>>>> condition would add in size and increase the execution path.
>>>>
>>>> gcc -O2 -march=armv7-a -mthumb pr43864.c -S:
>>>> ...
>>>>        push    {r4, r5, lr}
>>>>        mov     r4, r0
>>>>        sub     sp, sp, #1004
>>>>        mov     r5, r1
>>>>        mov     r0, sp
>>>>        mov     r1, r4
>>>>        bl      sprintf
>>>>        mov     r0, sp
>>>>        movs    r1, #1
>>>>        bl      access
>>>>        mov     r3, r0
>>>>        cbz     r0, .L6
>>>>        movs    r1, #0
>>>>        mov     r0, sp
>>>>        bl      fopen
>>>>        mov     r1, r0
>>>>        cbz     r0, .L7
>>>>        mov     r0, r4
>>>>        bl      foo
>>>> .L3:
>>>>        mov     r0, r5
>>>>        add     sp, sp, #1004
>>>>        pop     {r4, r5, pc}
>>>> .L6:
>>>>        mov     r0, r5
>>>>        mov     r5, r3
>>>>        bl      free
>>>>        b       .L3
>>>> .L7:
>>>>        mov     r0, r5
>>>>        mov     r5, r1
>>>>        bl      free
>>>>        b       .L3
>>>> ...
>>>>
>>>> The proposed patch solved the problem by dealing with the 2 blocks at a level
>>>> when they are still identical: at gimple level. It detect that the 2 blocks are
>>>> identical, and removes one of them.
>>>>
>>>> The following table shows the impact of the patch on the example in terms of
>>>> size for -march=armv7-a:
>>>>
>>>>          without     with    delta
>>>> Os      :     108      104       -4
>>>> O2      :     120      104      -16
>>>> Os thumb:      68       64       -4
>>>> O2 thumb:      76       64      -12
>>>>
>>>> The gain in size for -O2 is that of removing the entire block, plus the
>>>> replacement of 2 moves by a constant set, which also decreases the execution
>>>> path. The patch ensures optimal code for both -O2 and -Os.
>>>>
>>>>
>>>> By keeping track of equivalent definitions in the 2 blocks, we can ignore those
>>>> differences in comparison. Without this feature, we would only match blocks with
>>>> resultless operations, due the the ssa-nature of gimples.
>>>> For example, with this feature, we reduce the following function to its minimum
>>>> at gimple level, rather than at rtl level.
>>>>
>>
>> Example B. (naming it for reference below)
>>
>>>> int f(int c, int b, int d)
>>>> {
>>>>  int r, e;
>>>>
>>>>  if (c)
>>>>    r = b + d;
>>>>  else
>>>>    {
>>>>      e = b + d;
>>>>      r = e;
>>>>    }
>>>>
>>>>  return r;
>>>> }
>>>>
>>>> ;; Function f (f)
>>>>
>>>> f (int c, int b, int d)
>>>> {
>>>>  int e;
>>>>
>>>> <bb 2>:
>>>>  e_6 = b_3(D) + d_4(D);
>>>>  return e_6;
>>>>
>>>> }
>>>>
>>>> I'll send the patch with the testcases in a separate email.
>>>>
>>>> OK for trunk?
>>>
>>> I don't like that you hook this into cleanup_tree_cfg - that is called
>>> _way_ too often.
>>>
>>
>> Here is a reworked patch that addresses several concerns, particularly the
>> compile time overhead.
>>
>> Changes:
>> - The optimization is now in a separate file.
>> - The optimization is now a pass rather than a cleanup. That allowed me to
>>  remove the test for pass-local flags.
>>  New is the pass driver tail_merge_optimize, based on
>>  tree-cfgcleanup.c:cleanup_tree_cfg_1.
>> - The pass is run once, on SSA. Before, the patch would
>>  fix example A only before SSA and example B only on SSA.
>>  In order to fix example A on SSA, I added these changes:
>>  - handle the vop state at entry of bb1 and bb2 as equal (gimple_equal_p)
>>  - insert vop phi in bb2, and use that one (update_vuses)
>>  - complete pt_solutions_equal_p.
>>
>> Passed x86_64 bootstrapping and regression testing, currently regtesting on ARM.
>>
>> I placed the pass at the earliest point where it fixes example B: After copy
>> propagation and dead code elimination, specifically, after the first invocation
>> of pass_cd_dce. Do you know (any other points) where the pass should be scheduled?
> 
> It's probably reasonable to run it after IPA inlining has taken place which
> means insert it somewhen after the second pass_fre (I'd suggest after
> pass_merge_phi).
> 

I placed it there, but I ran into some interaction with
pass_late_warn_uninitialized.  Addition of the pass makes test
gcc.dg/uninit-pred-2_c.c fail.

FAIL: gcc.dg/uninit-pred-2_c.c bogus uninitialized var warning
                               (test for bogus messages, line 43)
FAIL: gcc.dg/uninit-pred-2_c.c real uninitialized var warning
                               (test for warnings, line 45)

   int foo_2 (int n, int m, int r)
   {
     int flag = 0;
     int v;

     if (n)
       {
         v = r;
	 flag = 1;
       }

     if (m) g++;
     else bar ();

     if (flag)
       blah (v); { dg-bogus "uninitialized" "bogus uninitialized var warning" }
     else
       blah (v); { dg-warning "uninitialized" "real uninitialized var warning" }

     return 0;
   }

The pass replaces the second call to blah with the first one, and eliminates
the if.  After that, the uninitialized warning is issued for the line number
of the first call to blah, while at source level the warning only makes sense
for the second call to blah.

Shall I try putting the pass after pass_late_warn_uninitialized?

> But my general comment still applies - I don't like the structural
> comparison code at all and you should really use the value-numbering
> machineries we have

I now use sccvn.

> or even better, merge this pass with FRE itself
> (or DOM if that suits you more).  For FRE you'd want to hook into
> tree-ssa-pre.c:eliminate().
> 

If we need to do the transformation after pass_late_warn_uninitialized, it needs
to stay on its own, I suppose.

>>> This also duplicates the literal matching done on the RTL level - instead
>>> I think this optimization would be more related to value-numbering
>>> (either that of SCCVN/FRE/PRE or that of DOM which also does
>>> jump-threading).
>>
>> The pass currently does just duplicate block elimination, not cross-jumping.
>> If we would like to extend this to cross-jumping, I think we need to do the
>> reverse of value numbering: walk backwards over the bb, and keep track of the
>> way values are used rather than defined. This will allows us to make a cut
>> halfway a basic block.
> 
> I don't understand - I propose to do literal matching but using value-numbering
> for tracking equivalences to avoid literal matching for stuff we know is
> equivalent.  In fact I think it will be mostly calls and stores where we
> need to do literal matching, but never intermediate computations on
> registers.
> 

I tried to implement that scheme now.

> But maybe I miss something here.
>
>> In general, we cannot do cut halfway a basic block in the current implementation
>> (of value numbering and forward matching), since we assume equivalence of the
>> incoming vops at bb entry. This assumption is in general only valid if we indeed
>> replace the entire block by another entire block.
> 
> Why are VOPs of concern at all?
> 

In the previous version, I inserted the phis for the vops manually.
In the current version of the pass, I let TODO_update_ssa_only_virtuals deal
with vops, so it's not relevant anymore.

>> I imagine that a cross-jumping heuristic would be based on the length of the
>> match and the amount of non-vop phis it would introduce. Then value numbering
>> would be something orthogonal to this optimization, which would reduce amount of
>> phis needed for a cross-jump.
>> I think it would make sense to use SCCVN value numbering at the point that we
>> have this backward matching.
>>
>> I'm not sure whether it's a good idea to try to replace the current forward
>> local value numbering with SCCVN value numbering, since we currently declare
>> vops equal, which are, in the global sense, not equal. And once we go to
>> backward matching, we'll need something to keep track of the uses, and we can
>> reuse the current infrastructure for that, but not the SCCVN value numbering.
>>
>> Does that make any sense?
> 
> Ok, let me think about this a bit.
> 

I tried to to be more clear on this in the header comment of the pass.

> For now about the patch in general.  The functions need renaming to
> something more sensible now that this isn't cfg-cleanup anymore.
> 
> I miss a general overview of the pass - it's hard to reverse engineer
> its working for me.

I added a header comment.

> Like (working backwards), you are detecting
> duplicate predecessors
> - that obviously doesn't work for duplicates
> without any successors, like those ending in noreturn calls.
> 

Merging of blocks without successors works now.

> +  n = EDGE_COUNT (bb->preds);
> +
> +  for (i = 0; i < n; ++i)
> +    {
> +      e1 = EDGE_PRED (bb, i);
> +      if (e1->flags & EDGE_COMPLEX)
> +        continue;
> +      for (j = i + 1; j < n; ++j)
> +        {
> 
> that's quadratic in the number of predecessors.
> 

The quadratic comparison is now limited by PARAM_TAIL_MERGE_MAX_COMPARISONS.
Each bb is compared to maximally PARAM_TAIL_MERGE_MAX_COMPARISONS similar bbs
per worklist iteration.

> +          /* Block e1->src might be deleted.  If bb and e1->src are the same
> +             block, delete e2->src instead, by swapping e1 and e2.  */
> +          e1_swapped = (bb == e1->src) ? e2: e1;
> +          e2_swapped = (bb == e1->src) ? e1: e2;
> 
> is that because you incrementally merge preds two at a time?  As you
> are deleting blocks don't you need to adjust the quadratic walking?
> Thus, with say four equivalent preds won't your code crash anyway?
> 

I think it was to make calculation of dominator info easier, but I use now
functions from dominance.c for that, so this piece of code is gone.

> I think the code needs to delay the CFG manipulation to the end
> of this function.
> 

I now delay the cfg manipulation till after each analysis phase.

> +/* Returns whether for all phis in E1->dest the phi alternatives for E1 and
> +   E2 are either:
> +   - equal, or
> +   - defined locally in E1->src and E2->src.
> +   In the latter case, register the alternatives in *PHI_EQUIV.  */
> +
> +static bool
> +same_or_local_phi_alternatives (equiv_t *phi_equiv, edge e1, edge e2)
> +{
> +  int n1 = e1->dest_idx;
> +  int n2 = e2->dest_idx;
> +  gimple_stmt_iterator gsi;
> +  basic_block dest = e1->dest;
> +  gcc_assert (dest == e2->dest);
> 
> too many asserts in general - I'd say for this case pass in the destination
> block as argument.
> 
> +      gcc_assert (val1 != NULL_TREE);
> +      gcc_assert (val2 != NULL_TREE);
> 
> superfluous.
> 
> +static bool
> +cleanup_duplicate_preds_1 (equiv_t phi_equiv, edge e1, edge e2)
> ...
> +  VEC (edge,heap) *redirected_edges;
> +  gcc_assert (bb == e2->dest);
> 
> same.
> 
> +  if (e1->flags != e2->flags)
> +    return false;
> 
> that's bad - it should handle EDGE_TRUE/FALSE_VALUE mismatches
> by swapping edges in the preds.
> 

That's handled now.

> +  /* TODO: We could allow multiple successor edges here, as long as bb1 and bb2
> +     have the same successors.  */
> +  if (EDGE_COUNT (bb1->succs) != 1 || EDGE_COUNT (bb2->succs) != 1)
> +    return false;
> 
> hm, ok - that would need fixing, too.  Same or mergeable successors
> of course, which makes me wonder if doing this whole transformation
> incrementally and locally is a good idea ;)   Also
> 

Also handled now.

> +  /* Calculate the changes to be made to the dominator info.
> +     Calculate bb2_dom.  */
> ...
> 
> wouldn't be necessary I suppose (just throw away dom info after the
> pass).
> 
> That is, I'd globally record BB equivalences (thus, "value-number"
> BBs) and apply the CFG manipulations at a single point.
> 

I delay the cfg manipulation till after each analysis phase. Delaying the cfg
manipulation till the end of the pass instead might make the analysis code more
convoluted.

> Btw, I miss where you insert PHI nodes for all uses that flow in
> from the preds preds - you do that for VOPs but not for real
> operands?
> 

Indeed, inserting phis for non-vops is a todo.

> +  /* Replace uses of vuse2 with uses of the phi.  */
> +  for (gsi = gsi_start_bb (bb2); !gsi_end_p (gsi); gsi_next (&gsi))
> +    {
> 
> why not walk immediate uses of the old PHI and SET_USE to
> the new one instead (for those uses in the duplicate BB of course)?
> 

And I no longer insert VOP phis, but let a TODO handle that, so this code is gone.

> +    case GSS_CALL:
> +      if (!pt_solution_equal_p (gimple_call_use_set (s1),
> +                                gimple_call_use_set (s2))
> 
> I don't understand why you are concerned about equality of
> points-to information.  Why not simply ior it (pt_solution_ior_into - note
> they are shared so you need to unshare them first).
> 

I let a todo handle the alias info now.

> +/* Return true if p1 and p2 can be considered equal.  */
> +
> +static bool
> +pt_solution_equal_p (struct pt_solution *p1, struct pt_solution *p2)
> 
> would go into tree-ssa-structalias.c instead.
> 
> +static bool
> +gimple_base_equal_p (gimple s1, gimple s2)
> +{
> ...
> +  if (gimple_modified_p (s1) || gimple_modified_p (s2))
> +    return false;
> 
> that shouldn't be of concern.
> 
> +  if (s1->gsbase.subcode != s2->gsbase.subcode)
> +    return false;
> 
> for assigns that are of class GIMPLE_SINGLE_RHS we do not
> update subcode during transformations so it can differ for now
> equal statements.
> 

handled properly now.

> I'm not sure if a splay tree for the SSA name version equivalency
> map is the best representation - I would have used a simple
> array of num_ssa_names size and assign value-numbers
> (the lesser version for example).
> 
> Thus equiv_insert would do
> 
>   value = MIN (SSA_NAME_VERSION (val1), SSA_NAME_VERSION (val2));
>   values[SSA_NAME_VERSION (val1)] = value;
>   values[SSA_NAME_VERSION (val2)] = value;
> 
> if the names are not defined in bb1 resp. bb2 we would have to insert
> a PHI node in the merged block - that would be a cost thingy for
> doing this value-numbering in a more global way.
> 

local value numbering code has been removed.

> You don't seem to be concerned about the SSA names points-to
> information, but it surely has the same issues as that of the calls
> (so either they should be equal or they should be conservatively
> merged).  But as-is you don't allow any values to flow into the
> merged blocks that are not equal for both edges, no?
> 

Correct, that's still a todo.

> +  TV_TREE_CFG,                          /* tv_id */
> 
> add a new timevar.  We wan to be able to turn the pass off,
> so also add a new option (I can see it might make debugging harder
> in some cases).
> 

I added -ftree-tail-merge and TV_TREE_TAIL_MERGE.

> Can you assess the effect of the patch on GCC itself (for example
> when building cc1?)? What's the size benefit and the compile-time
> overhead?
> 

effect on building cc1:

               real        user        sys
without: 19m50.158s  19m 2.090s  0m20.860s
with:    19m59.456s  19m17.170s  0m20.350s
                     ----------
                       +15.080s
                         +1.31%

$ size without/cc1 with/cc1
    text   data      bss       dec      hex     filename
17515986  41320  1364352  18921658  120b8ba  without/cc1
17399226  41320  1364352  18804898  11ef0a2     with/cc1
--------
 -116760
  -0.67%

OK for trunk, provided build & reg-testing on ARM is ok?

Thanks,
- Tom

2011-07-12  Tom de Vries  <tom@codesourcery.com>

	PR middle-end/43864
	* tree-ssa-tail-merge.c: New file.
	(bb_dominated_by_p): New function.
	(scc_vn_ok): New var.
	(init_gvn, delete_gvn, gvn_val, gvn_uses_equal): New function.
	(bb_size): New var.
	(init_bb_size, delete_bb_size): New function.
	(struct same_succ): Define.
	(same_succ_t, const_same_succ_t): New typedef.
	(same_succ_print, same_succ_print_traverse, same_succ_hash)
	(inverse_flags, same_succ_equal, same_succ_alloc, same_succ_delete)
	(same_succ_reset): New function.
	(same_succ_htab, bb_to_same_succ, same_succ_edge_flags)
	(bitmap deleted_bbs, deleted_bb_preds): New vars.
	(debug_same_succ): New function.
	(worklist): New var.
	(print_worklist, add_to_worklist, find_same_succ_bb, find_same_succ)
	(init_worklist, delete_worklist, delete_basic_block_same_succ)
	(update_worklist): New function.
	(struct bb_cluster): Define.
	(bb_cluster_t, const_bb_cluster_t): New typedef.
	(print_cluster, debug_cluster, same_predecessors)
	(add_bb_to_cluster, new_cluster, delete_cluster): New function.
	(merge_cluster, all_clusters): New var.
	(alloc_cluster_vectors, reset_cluster_vectors, delete_cluster_vectors)
	(merge_clusters, set_cluster): New function.
	(gimple_subcode_equal_p, gimple_base_equal_p, gimple_equal_p)
	(bb_gimple_equal_p): New function.
	(find_duplicate, same_phi_alternatives_1, same_phi_alternatives)
	(bb_has_non_vop_phi, find_clusters_1, find_clusters): New function.
	(replace_block_by, apply_clusters): New	function.
	(update_debug_stmt, update_debug_stmt): New function.
	(tail_merge_optimize, gate_tail_merge): New function.
	(pass_tail_merge): New gimple pass.
	* tree-pass.h (pass_tail_merge): Declare new pass.
	* passes.c (init_optimization_passes): Use new pass.
	* Makefile.in (OBJS-common): Add tree-ssa-tail-merge.o.
	(tree-ssa-tail-merge.o): New rule.
	* opts.c (default_options_table): Set OPT_ftree_tail_merge by default at
	OPT_LEVELS_2_PLUS.
	* timevar.def (TV_TREE_TAIL_MERGE): New timevar.
	* common.opt (ftree-tail-merge): New switches.
	* params.def (PARAM_TAIL_MERGE_MAX_COMPARISONS): New parameter.

[-- Attachment #2: pr43864.27.patch --]
[-- Type: text/x-patch, Size: 45542 bytes --]

Index: gcc/tree-ssa-tail-merge.c
===================================================================
--- gcc/tree-ssa-tail-merge.c	(revision 0)
+++ gcc/tree-ssa-tail-merge.c	(revision 0)
@@ -0,0 +1,1530 @@
+/* Tail merging for gimple.
+   Copyright (C) 2011 Free Software Foundation, Inc.
+   Contributed by Tom de Vries (tom@codesourcery.com)
+
+This file is part of GCC.
+
+GCC is free software; you can redistribute it and/or modify
+it under the terms of the GNU General Public License as published by
+the Free Software Foundation; either version 3, or (at your option)
+any later version.
+
+GCC is distributed in the hope that it will be useful,
+but WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+GNU General Public License for more details.
+
+You should have received a copy of the GNU General Public License
+along with GCC; see the file COPYING3.  If not see
+<http://www.gnu.org/licenses/>.  */
+
+/* Pass overview.
+
+
+   MOTIVATIONAL EXAMPLE
+
+   gimple representation of gcc/testsuite/gcc.dg/pr43864.c at
+
+   hprofStartupp (charD.1 * outputFileNameD.2600, charD.1 * ctxD.2601)
+   {
+     struct FILED.1638 * fpD.2605;
+     charD.1 fileNameD.2604[1000];
+     intD.0 D.3915;
+     const charD.1 * restrict outputFileName.0D.3914;
+
+     # BLOCK 2 freq:10000
+     # PRED: ENTRY [100.0%]  (fallthru,exec)
+     # PT = nonlocal { D.3926 } (restr)
+     outputFileName.0D.3914_3
+       = (const charD.1 * restrict) outputFileNameD.2600_2(D);
+     # .MEMD.3923_13 = VDEF <.MEMD.3923_12(D)>
+     # USE = nonlocal null { fileNameD.2604 D.3926 } (restr)
+     # CLB = nonlocal null { fileNameD.2604 D.3926 } (restr)
+     sprintfD.759 (&fileNameD.2604, outputFileName.0D.3914_3);
+     # .MEMD.3923_14 = VDEF <.MEMD.3923_13>
+     # USE = nonlocal null { fileNameD.2604 D.3926 } (restr)
+     # CLB = nonlocal null { fileNameD.2604 D.3926 } (restr)
+     D.3915_4 = accessD.2606 (&fileNameD.2604, 1);
+     if (D.3915_4 == 0)
+       goto <bb 3>;
+     else
+       goto <bb 4>;
+     # SUCC: 3 [10.0%]  (true,exec) 4 [90.0%]  (false,exec)
+
+     # BLOCK 3 freq:1000
+     # PRED: 2 [10.0%]  (true,exec)
+     # .MEMD.3923_15 = VDEF <.MEMD.3923_14>
+     # USE = nonlocal null { fileNameD.2604 D.3926 } (restr)
+     # CLB = nonlocal null { fileNameD.2604 D.3926 } (restr)
+     freeD.898 (ctxD.2601_5(D));
+     goto <bb 7>;
+     # SUCC: 7 [100.0%]  (fallthru,exec)
+
+     # BLOCK 4 freq:9000
+     # PRED: 2 [90.0%]  (false,exec)
+     # .MEMD.3923_16 = VDEF <.MEMD.3923_14>
+     # PT = nonlocal escaped
+     # USE = nonlocal null { fileNameD.2604 D.3926 } (restr)
+     # CLB = nonlocal null { fileNameD.2604 D.3926 } (restr)
+     fpD.2605_8 = fopenD.1805 (&fileNameD.2604[0], 0B);
+     if (fpD.2605_8 == 0B)
+       goto <bb 5>;
+     else
+       goto <bb 6>;
+     # SUCC: 5 [1.9%]  (true,exec) 6 [98.1%]  (false,exec)
+
+     # BLOCK 5 freq:173
+     # PRED: 4 [1.9%]  (true,exec)
+     # .MEMD.3923_17 = VDEF <.MEMD.3923_16>
+     # USE = nonlocal null { fileNameD.2604 D.3926 } (restr)
+     # CLB = nonlocal null { fileNameD.2604 D.3926 } (restr)
+     freeD.898 (ctxD.2601_5(D));
+     goto <bb 7>;
+     # SUCC: 7 [100.0%]  (fallthru,exec)
+
+     # BLOCK 6 freq:8827
+     # PRED: 4 [98.1%]  (false,exec)
+     # .MEMD.3923_18 = VDEF <.MEMD.3923_16>
+     # USE = nonlocal null { fileNameD.2604 D.3926 } (restr)
+     # CLB = nonlocal null { fileNameD.2604 D.3926 } (restr)
+     fooD.2599 (outputFileNameD.2600_2(D), fpD.2605_8);
+     # SUCC: 7 [100.0%]  (fallthru,exec)
+
+     # BLOCK 7 freq:10000
+     # PRED: 3 [100.0%]  (fallthru,exec) 5 [100.0%]  (fallthru,exec)
+             6 [100.0%]  (fallthru,exec)
+     # PT = nonlocal null
+
+     # ctxD.2601_1 = PHI <0B(3), 0B(5), ctxD.2601_5(D)(6)>
+     # .MEMD.3923_11 = PHI <.MEMD.3923_15(3), .MEMD.3923_17(5),
+                            .MEMD.3923_18(6)>
+     # VUSE <.MEMD.3923_11>
+     return ctxD.2601_1;
+     # SUCC: EXIT [100.0%]
+   }
+
+   bb 3 and bb 5 can be merged.  The blocks have different predecessors, but the
+   same successors, and the same operations.
+
+
+   CONTEXT
+
+   A technique called tail merging (or cross jumping) can fix the example
+   above.  For a block, we look for common code at the end (the tail) of the
+   predecessor blocks, and insert jumps from one block to the other.
+   The example is a special case for tail merging, in that 2 whole blocks
+   can be merged, rather than just the end parts of it.
+   We currently only focus on whole block merging, so in that sense
+   calling this pass tail merge is a bit of a misnomer.
+
+   We distinguish 2 kinds of situations in which blocks can be merged:
+   - same operations, same predecessors.  The successor edges coming from one
+     block are redirected to come from the other block.
+   - same operations, same successors.  The predecessor edges entering one block
+     are redirected to enter the other block.  Note that this operation might
+     involve introducing phi operations.
+
+   For efficient implementation, we would like to value numbers the blocks, and
+   have a comparison operator that tells us whether the blocks are equal.
+   Besides being runtime efficient, block value numbering should also abstract
+   from irrelevant differences in order of operations, much like normal value
+   numbering abstracts from irrelevant order of operations.
+
+   For the first situation (same_operations, same predecessors), normal value
+   numbering fits well.  We can calculate a block value number based on the
+   value numbers of the defs and vdefs.
+
+   For the second situation (same operations, same successors), this approach
+   doesn't work so well.  We can illustrate this using the example.  The calls
+   to free use different vdefs: MEMD.3923_16 and MEMD.3923_14, and these will
+   remain different in value numbering, since they represent different memory
+   states.  So the resulting vdefs of the frees will be different in value
+   numbering, so the block value numbers will be different.
+
+   The reason why we call the blocks equal is not because they define the same
+   values, but because uses in the blocks use (possibly different) defs in the
+   same way.  To be able to detect this efficiently, we need to do some kind of
+   reverse value numbering, meaning number the uses rather than the defs, and
+   calculate a block value number based on the value number of the uses.
+   Ideally, a block comparison operator will also indicate which phis are needed
+   to merge the blocks.
+
+   For the moment, we don't do block value numbering, but we do insn-by-insn
+   matching, using scc value numbers to match operations with results, and
+   structural comparison otherwise, while ignoring vop mismatches.
+
+
+   IMPLEMENTATION
+
+   1. The pass first determines all groups of blocks with the same successor
+      blocks.
+   2. Within each group, it tries to determine clusters of equal basic blocks.
+   3. The clusters are applied.
+   4. The same successor groups are updated.
+   5. This process is repeated from 2 onwards, until no more changes.
+
+
+   LIMITATIONS/TODO
+
+   - block only
+   - handles only 'same operations, same successors'.
+     It handles same predecessors as a special subcase though.
+   - does not implement the reverse value numbering and block value numbering.
+   - improve memory allocation: use garbage collected memory, obstacks,
+     allocpools where appropriate.
+   - no insertion of phis.  We only introduce vop-phis, and not explicitly, but
+     by TODO_update_ssa_only_virtuals.
+
+
+   SWITCHES
+
+   - ftree-tail-merge.  On at -O2.  We might have make the pass less aggressive
+     for -O2, and only keep maximum at -Os.  */
+
+#include "config.h"
+#include "system.h"
+#include "coretypes.h"
+#include "tm.h"
+#include "tree.h"
+#include "tm_p.h"
+#include "basic-block.h"
+#include "output.h"
+#include "flags.h"
+#include "function.h"
+#include "tree-flow.h"
+#include "timevar.h"
+#include "tree-pass.h"
+#include "bitmap.h"
+#include "tree-ssa-alias.h"
+#include "params.h"
+#include "tree-pretty-print.h"
+#include "hashtab.h"
+#include "gimple-pretty-print.h"
+#include "tree-ssa-sccvn.h"
+#include "tree-dump.h"
+
+/* Returns true if BB1 is dominated by BB2.  Robust against
+   arguments being NULL, where NULL means entry bb.  */
+
+static bool
+bb_dominated_by_p (basic_block bb1, basic_block bb2)
+{
+  if (!bb1)
+    return false;
+
+  if (!bb2)
+    return true;
+
+  return dominated_by_p (CDI_DOMINATORS, bb1, bb2);
+}
+
+/* Indicates whether we can use scc_vn info.  */
+
+static bool scc_vn_ok;
+
+/* Initializes scc_vn info.  */
+
+static void
+init_gvn (void)
+{
+  scc_vn_ok = run_scc_vn (VN_NOWALK);
+}
+
+/* Deletes scc_vn info.  */
+
+static void
+delete_gvn (void)
+{
+  if (!scc_vn_ok)
+    return;
+
+  free_scc_vn ();
+}
+
+/* Return the canonical scc_vn tree for X, if we can use scc_vn_info.
+   Otherwise, return X.  */
+
+static tree
+gvn_val (tree x)
+{
+  return ((scc_vn_ok && x != NULL && TREE_CODE (x) == SSA_NAME)
+	  ? VN_INFO ((x))->valnum : x);
+}
+
+/* VAL1 and VAL2 are either:
+   - uses in BB1 and BB2, or
+   - phi alternatives for BB1 and BB2.
+   SAME_PREDS indicates whether BB1 and BB2 have the same predecessors.
+   Return true if the uses have the same gvn value, and if the corresponding
+   defs can be used in both BB1 and BB2.  */
+
+static bool
+gvn_uses_equal (tree val1, tree val2, basic_block bb1,
+		basic_block bb2, bool same_preds)
+{
+  gimple def1, def2;
+  basic_block def1_bb, def2_bb;
+
+  if (val1 == NULL_TREE || val2 == NULL_TREE)
+    return false;
+
+  if (gvn_val (val1) != gvn_val (val2))
+    return false;
+
+  /* If BB1 and BB2 have the same predecessors, the same values are defined at
+     entry of BB1 and BB2.  Otherwise, we need to check.  */
+
+  if (TREE_CODE (val1) == SSA_NAME)
+    {
+      if (!same_preds)
+	{
+	  def1 = SSA_NAME_DEF_STMT (val1);
+	  def1_bb = gimple_bb (def1);
+	  if (!bb_dominated_by_p (bb2, def1_bb))
+	    return false;
+	}
+    }
+  else if (!CONSTANT_CLASS_P (val1))
+    return false;
+
+  if (TREE_CODE (val2) == SSA_NAME)
+    {
+      if (!same_preds)
+	{
+	  def2 = SSA_NAME_DEF_STMT (val2);
+	  def2_bb = gimple_bb (def2);
+	  if (bb_dominated_by_p (bb1, def2_bb))
+	    return false;
+	}
+    }
+  else if (!CONSTANT_CLASS_P (val2))
+    return false;
+
+  return true;
+}
+
+/* Size of each bb.  */
+
+static int *bb_size;
+
+/* Init bb_size administration.  */
+
+static void
+init_bb_size (void)
+{
+  int i;
+  int size;
+  gimple_stmt_iterator gsi;
+  basic_block bb;
+
+  bb_size = XNEWVEC (int, last_basic_block);
+  for (i = 0; i < last_basic_block; ++i)
+    {
+      bb = BASIC_BLOCK (i);
+      size = 0;
+      bb_size[i] = size;
+      if (bb == NULL)
+	continue;
+      for (gsi = gsi_start_nondebug_bb (bb);
+	   !gsi_end_p (gsi); gsi_next_nondebug (&gsi))
+	size++;
+      bb_size[i] = size;
+    }
+}
+
+/* Delete bb_size administration.  */
+
+static void
+delete_bb_size (void)
+{
+  XDELETEVEC (bb_size);
+  bb_size = NULL;
+}
+
+struct same_succ
+{
+  /* The bbs that have the same successor bbs.  */
+  bitmap bbs;
+  /* The successor bbs.  */
+  bitmap succs;
+  /* Indicates whether the EDGE_TRUE/FALSE_VALUEs of succ_flags are swapped for
+     bb.  */
+  bitmap inverse;
+  /* The edge flags for each of the successor bbs.  */
+  VEC (int, heap) *succ_flags;
+  /* Indicates whether the struct is in the worklist.  */
+  bool in_worklist;
+};
+typedef struct same_succ *same_succ_t;
+typedef const struct same_succ *const_same_succ_t;
+
+/* Prints E to FILE.  */
+
+static void
+same_succ_print (FILE *file, const same_succ_t e)
+{
+  unsigned int i;
+  bitmap_print (file, e->bbs, "bbs:", "\n");
+  bitmap_print (file, e->succs, "succs:", "\n");
+  bitmap_print (file, e->inverse, "inverse:", "\n");
+  fprintf (file, "flags:");
+  for (i = 0; i < VEC_length (int, e->succ_flags); ++i)
+    fprintf (file, " %x", VEC_index (int, e->succ_flags, i));
+  fprintf (file, "\n");
+}
+
+/* Prints same_succ VE to VFILE.  */
+
+static int
+same_succ_print_traverse (void **ve, void *vfile)
+{
+  const same_succ_t e = *((const same_succ_t *)ve);
+  FILE *file = ((FILE*)vfile);
+  same_succ_print (file, e);
+  return 1;
+}
+
+/* Calculates hash value for same_succ VE.  */
+
+static hashval_t
+same_succ_hash (const void *ve)
+{
+  const_same_succ_t e = (const_same_succ_t)ve;
+  hashval_t hashval = bitmap_hash (e->succs);
+  int flags;
+  unsigned int i;
+  unsigned int first = bitmap_first_set_bit (e->bbs);
+  int size = bb_size [first];
+  gimple_stmt_iterator gsi;
+  gimple stmt;
+  basic_block bb = BASIC_BLOCK (first);
+
+  hashval = iterative_hash_hashval_t (size, hashval);
+  for (gsi = gsi_start_nondebug_bb (bb);
+	   !gsi_end_p (gsi); gsi_next_nondebug (&gsi))
+    {
+      stmt = gsi_stmt (gsi);
+      hashval = iterative_hash_hashval_t (gimple_code (stmt), hashval);
+      if (!is_gimple_call (stmt))
+	continue;
+      if (gimple_call_internal_p (stmt))
+	hashval = iterative_hash_hashval_t
+	  ((hashval_t) gimple_call_internal_fn (stmt), hashval);
+      else
+	hashval = iterative_hash_expr (gimple_call_fn (stmt), hashval);
+    }
+  for (i = 0; i < VEC_length (int, e->succ_flags); ++i)
+    {
+      flags = VEC_index (int, e->succ_flags, i);
+      flags = flags & ~(EDGE_TRUE_VALUE | EDGE_FALSE_VALUE);
+      hashval = iterative_hash_hashval_t (flags, hashval);
+    }
+  return hashval;
+}
+
+/* Returns true if E1 and E2 have 2 successors, and if the successor flags
+   are inverse for the EDGE_TRUE_VALUE and EDGE_FALSE_VALUE flags, and equal for
+   the other edge flags.  */
+
+static bool
+inverse_flags (const_same_succ_t e1, const_same_succ_t e2)
+{
+  int f1a, f1b, f2a, f2b;
+  int mask = ~(EDGE_TRUE_VALUE | EDGE_FALSE_VALUE);
+
+  if (VEC_length (int, e1->succ_flags) != 2)
+    return false;
+
+  f1a = VEC_index (int, e1->succ_flags, 0);
+  f1b = VEC_index (int, e1->succ_flags, 1);
+  f2a = VEC_index (int, e2->succ_flags, 0);
+  f2b = VEC_index (int, e2->succ_flags, 1);
+
+  if (f1a == f2a && f1b == f2b)
+    return false;
+
+  return (f1a & mask) == (f2a & mask) && (f1b & mask) == (f2b & mask);
+}
+
+/* Compares SAME_SUCCs VE1 and VE2.  */
+
+static int
+same_succ_equal (const void *ve1, const void *ve2)
+{
+  const_same_succ_t e1 = (const_same_succ_t)ve1;
+  const_same_succ_t e2 = (const_same_succ_t)ve2;
+  unsigned int i, first1, first2;
+  gimple_stmt_iterator gsi1, gsi2;
+  gimple s1, s2;
+
+  if (bitmap_bit_p (e1->bbs, ENTRY_BLOCK)
+      || bitmap_bit_p (e1->bbs, EXIT_BLOCK)
+      || bitmap_bit_p (e2->bbs, ENTRY_BLOCK)
+      || bitmap_bit_p (e2->bbs, EXIT_BLOCK))
+    return 0;
+
+  if (VEC_length (int, e1->succ_flags) != VEC_length (int, e2->succ_flags))
+    return 0;
+
+  if (!bitmap_equal_p (e1->succs, e2->succs))
+    return 0;
+
+  if (!inverse_flags (e1, e2))
+    {
+      for (i = 0; i < VEC_length (int, e1->succ_flags); ++i)
+	if (VEC_index (int, e1->succ_flags, i)
+	    != VEC_index (int, e1->succ_flags, i))
+	  return 0;
+    }
+
+  first1 = bitmap_first_set_bit (e1->bbs);
+  first2 = bitmap_first_set_bit (e2->bbs);
+
+  if (bb_size [first1]
+      != bb_size [first2])
+    return 0;
+
+  gsi1 = gsi_start_nondebug_bb (BASIC_BLOCK (first1));
+  gsi2 = gsi_start_nondebug_bb (BASIC_BLOCK (first1));
+  while (!(gsi_end_p (gsi1) || gsi_end_p (gsi2)))
+    {
+      s1 = gsi_stmt (gsi1);
+      s2 = gsi_stmt (gsi2);
+      if (gimple_code (s1) != gimple_code (s2))
+	return 0;
+      if (is_gimple_call (s1) && !gimple_call_same_target_p (s1, s2))
+	return 0;
+      gsi_next_nondebug (&gsi1);
+      gsi_next_nondebug (&gsi2);
+    }
+
+  return 1;
+}
+
+/* Alloc and init a new SAME_SUCC.  */
+
+static same_succ_t
+same_succ_alloc (void)
+{
+  same_succ_t same = XNEW (struct same_succ);
+
+  same->bbs = BITMAP_ALLOC (NULL);
+  same->succs = BITMAP_ALLOC (NULL);
+  same->inverse = BITMAP_ALLOC (NULL);
+  same->succ_flags = VEC_alloc (int, heap, 10);
+  same->in_worklist = false;
+
+  return same;
+}
+
+/* Delete same_succ VE.  */
+
+static void
+same_succ_delete (void *ve)
+{
+  same_succ_t e = (same_succ_t)ve;
+
+  bitmap_clear (e->bbs);
+  bitmap_clear (e->succs);
+  bitmap_clear (e->inverse);
+  VEC_free (int, heap, e->succ_flags);
+
+  XDELETE (ve);
+}
+
+/* Reset same_succ SAME.  */
+
+static void
+same_succ_reset (same_succ_t same)
+{
+  bitmap_clear (same->bbs);
+  bitmap_clear (same->succs);
+  bitmap_clear (same->inverse);
+  VEC_truncate (int, same->succ_flags, 0);
+}
+
+/* Hash table with all same_succ entries.  */
+
+static htab_t same_succ_htab;
+
+/* Array that indicates the same_succ for each bb.  */
+
+static same_succ_t *bb_to_same_succ;
+
+/* Array that is used to store the edge flags for a successor.  */
+
+static int *same_succ_edge_flags;
+
+/* Bitmap that is used to mark bbs that are recently deleted.  */
+
+static bitmap deleted_bbs;
+
+/* Bitmap that is used to mark predecessors of bbs that are
+   deleted.  */
+
+static bitmap deleted_bb_preds;
+
+DEF_VEC_P (same_succ_t);
+DEF_VEC_ALLOC_P (same_succ_t, heap);
+
+/* Prints same_succ_htab to stderr.  */
+
+extern void debug_same_succ (void);
+DEBUG_FUNCTION void
+debug_same_succ ( void)
+{
+  htab_traverse (same_succ_htab, same_succ_print_traverse, stderr);
+}
+
+/* Vector of bbs to process.  */
+
+static VEC (same_succ_t, heap) *worklist;
+
+/* Prints worklist to FILE.  */
+
+static void
+print_worklist (FILE *file)
+{
+  unsigned int i;
+  for (i = 0; i < VEC_length (same_succ_t, worklist); ++i)
+    same_succ_print (file, VEC_index (same_succ_t, worklist, i));
+}
+
+/* Adds SAME to worklist.  */
+
+static void
+add_to_worklist (same_succ_t same)
+{
+  if (same->in_worklist)
+    return;
+
+  if (bitmap_count_bits (same->bbs) < 2)
+    return;
+
+  same->in_worklist = true;
+  VEC_safe_push (same_succ_t, heap, worklist, same);
+}
+
+/* Add BB to same_succ_htab.  */
+
+static void
+find_same_succ_bb (basic_block bb, same_succ_t *same_p)
+{
+  unsigned int j;
+  bitmap_iterator bj;
+  same_succ_t same = *same_p;
+  same_succ_t *slot;
+
+  if (bb == NULL)
+    return;
+  bitmap_set_bit (same->bbs, bb->index);
+  for (j = 0; j < EDGE_COUNT (bb->succs); ++j)
+    {
+      edge e = EDGE_SUCC (bb, j);
+      int index = e->dest->index;
+      bitmap_set_bit (same->succs, index);
+      same_succ_edge_flags[index] = e->flags;
+    }
+  EXECUTE_IF_SET_IN_BITMAP (same->succs, 0, j, bj)
+    VEC_safe_push (int, heap, same->succ_flags, same_succ_edge_flags[j]);
+
+  slot = (same_succ_t *) htab_find_slot (same_succ_htab, same, INSERT);
+  if (*slot == NULL)
+    {
+      *slot = same;
+      bb_to_same_succ[bb->index] = same;
+      add_to_worklist (same);
+      *same_p = NULL;
+    }
+  else
+    {
+      bitmap_set_bit ((*slot)->bbs, bb->index);
+      bb_to_same_succ[bb->index] = *slot;
+      add_to_worklist (*slot);
+      if (inverse_flags (same, *slot))
+	bitmap_set_bit ((*slot)->inverse, bb->index);
+      same_succ_reset (same);
+    }
+}
+
+/* Find bbs with same successors.  */
+
+static void
+find_same_succ (void)
+{
+  int i;
+  same_succ_t same = same_succ_alloc ();
+
+  for (i = 0; i < last_basic_block; ++i)
+    {
+      find_same_succ_bb (BASIC_BLOCK (i), &same);
+      if (same == NULL)
+	same = same_succ_alloc ();
+    }
+
+  same_succ_delete (same);
+}
+
+/* Initializes worklist administration.  */
+
+static void
+init_worklist (void)
+{
+  init_bb_size ();
+  same_succ_htab
+    = htab_create (1024, same_succ_hash, same_succ_equal, same_succ_delete);
+  bb_to_same_succ = XCNEWVEC (same_succ_t, last_basic_block);
+  same_succ_edge_flags = XCNEWVEC (int, last_basic_block);
+  deleted_bbs = BITMAP_ALLOC (NULL);
+  deleted_bb_preds = BITMAP_ALLOC (NULL);
+  worklist = VEC_alloc (same_succ_t, heap, last_basic_block);
+  find_same_succ ();
+
+  if (dump_file)
+    {
+      fprintf (dump_file, "initial worklist:\n");
+      print_worklist (dump_file);
+    }
+}
+
+/* Deletes worklist administration.  */
+
+static void
+delete_worklist (void)
+{
+  delete_bb_size ();
+  htab_delete (same_succ_htab);
+  same_succ_htab = NULL;
+  XDELETEVEC (bb_to_same_succ);
+  bb_to_same_succ = NULL;
+  XDELETEVEC (same_succ_edge_flags);
+  same_succ_edge_flags = NULL;
+  BITMAP_FREE (deleted_bbs);
+  BITMAP_FREE (deleted_bb_preds);
+  VEC_free (same_succ_t, heap, worklist);
+}
+
+/* Mark BB as deleted, and mark its predecessors.  */
+
+static void
+delete_basic_block_same_succ (basic_block bb)
+{
+  int pred_i, i = bb->index;
+  unsigned int j;
+  edge e;
+
+  bitmap_set_bit (deleted_bbs, i);
+
+  for (j = 0; j < EDGE_COUNT (bb->preds);  ++j)
+    {
+      e  = EDGE_PRED (bb, j);
+      pred_i = e->src->index;
+      bitmap_set_bit (deleted_bb_preds, pred_i);
+    }
+}
+
+/* For deleted_bb_preds, find bbs with same successors.  */
+
+static void
+update_worklist (void)
+{
+  unsigned int i;
+  bitmap_iterator bi;
+  basic_block bb;
+  same_succ_t same;
+
+  EXECUTE_IF_SET_IN_BITMAP (deleted_bbs, 0, i, bi)
+    {
+      same = bb_to_same_succ[i];
+      bb_to_same_succ[i] = NULL;
+      bitmap_clear_bit (same->bbs, i);
+    }
+
+  same = same_succ_alloc ();
+  bitmap_and_compl_into (deleted_bb_preds, deleted_bbs);
+  EXECUTE_IF_SET_IN_BITMAP (deleted_bb_preds, 0, i, bi)
+    {
+      bb = BASIC_BLOCK (i);
+      gcc_assert (bb != NULL);
+      bb_to_same_succ[i] = NULL;
+      find_same_succ_bb (bb, &same);
+      if (same == NULL)
+	same = same_succ_alloc ();
+    }
+
+  same_succ_delete (same);
+
+  bitmap_clear (deleted_bbs);
+  bitmap_clear (deleted_bb_preds);
+}
+
+struct bb_cluster
+{
+  /* The bbs in the cluster.  */
+  bitmap bbs;
+  /* The preds of the bbs in the cluster.  */
+  bitmap preds;
+  /* index in all_clusters vector.  */
+  int index;
+};
+typedef struct bb_cluster *bb_cluster_t;
+typedef const struct bb_cluster *const_bb_cluster_t;
+
+/* Prints cluster C to FILE.  */
+
+static void
+print_cluster (FILE *file, bb_cluster_t c)
+{
+  if (c == NULL)
+    return;
+  bitmap_print (file, c->bbs, "bbs:", "\n");
+  bitmap_print (file, c->preds, "preds:", "\n");
+}
+
+/* Prints cluster C to stderr.  */
+
+extern void debug_cluster (bb_cluster_t);
+DEBUG_FUNCTION void
+debug_cluster (bb_cluster_t c)
+{
+  print_cluster (stderr, c);
+}
+
+/* Returns true if bb1 and bb2 have the same predecessors.  */
+
+static bool
+same_predecessors (basic_block bb1, basic_block bb2)
+{
+  unsigned int i, j;
+  edge ei, ej;
+  unsigned int n1 = EDGE_COUNT (bb1->preds), n2 = EDGE_COUNT (bb2->preds);
+  unsigned int nr_matches = 0;
+
+  if (n1 != n2)
+    return false;
+
+  for (i = 0; i < n1; ++i)
+    {
+      ei = EDGE_PRED (bb1, i);
+      for (j = 0; j < n2; ++j)
+	{
+	  ej = EDGE_PRED (bb2, j);
+	  if (ei->src != ej->src)
+	    continue;
+	  nr_matches++;
+	  break;
+	}
+    }
+
+  return nr_matches == n1;
+}
+
+/* Add BB to cluster C.  Sets BB in C->bbs, and preds of BB in C->preds.  */
+
+static void
+add_bb_to_cluster (bb_cluster_t c, basic_block bb)
+{
+  int index = bb->index;
+  unsigned int i;
+  bitmap_set_bit (c->bbs, index);
+
+  for (i = 0; i < EDGE_COUNT (bb->preds); ++i)
+    bitmap_set_bit (c->preds, EDGE_PRED (bb, i)->src->index);
+}
+
+/* Allocate and init new cluster.  */
+
+static bb_cluster_t
+new_cluster (void)
+{
+  bb_cluster_t c;
+  c = XCNEW (struct bb_cluster);
+  c->bbs = BITMAP_ALLOC (NULL);
+  c->preds = BITMAP_ALLOC (NULL);
+  return c;
+}
+
+/* Delete clusters.  */
+
+static void
+delete_cluster (bb_cluster_t c)
+{
+  if (c == NULL)
+    return;
+  BITMAP_FREE (c->bbs);
+  BITMAP_FREE (c->preds);
+  XDELETE (c);
+}
+
+/* Array that indicates which bbs have clusters that can be merged.  */
+
+static bb_cluster_t *merge_cluster;
+
+DEF_VEC_P (bb_cluster_t);
+DEF_VEC_ALLOC_P (bb_cluster_t, heap);
+
+/* Array that contains all clusters.  */
+
+static VEC (bb_cluster_t, heap) *all_clusters;
+
+/* Allocate all cluster vectors.  */
+
+static void
+alloc_cluster_vectors (void)
+{
+  merge_cluster = XCNEWVEC (bb_cluster_t, last_basic_block);
+  all_clusters = VEC_alloc (bb_cluster_t, heap, last_basic_block);
+}
+
+/* Reset all cluster vectors.  */
+
+static void
+reset_cluster_vectors (void)
+{
+  unsigned int i;
+  unsigned size = last_basic_block * sizeof (bb_cluster_t);
+  memset (merge_cluster, 0, size);
+  for (i = 0; i < VEC_length (bb_cluster_t, all_clusters); ++i)
+    delete_cluster (VEC_index (bb_cluster_t, all_clusters, i));
+  VEC_truncate (bb_cluster_t, all_clusters, 0);
+}
+
+/* Delete all cluster vectors.  */
+
+static void
+delete_cluster_vectors (void)
+{
+  unsigned int i;
+  XDELETEVEC (merge_cluster);
+  merge_cluster = NULL;
+  for (i = 0; i < VEC_length (bb_cluster_t, all_clusters); ++i)
+    delete_cluster (VEC_index (bb_cluster_t, all_clusters, i));
+  VEC_free (bb_cluster_t, heap, all_clusters);
+}
+
+/* Merge cluster C2 into C1.  */
+
+static void
+merge_clusters (bb_cluster_t c1, bb_cluster_t c2)
+{
+  bitmap_ior_into (c1->bbs, c2->bbs);
+  bitmap_ior_into (c1->preds, c2->preds);
+}
+
+/* Register equivalence of BB1 and BB2 (members of cluster C).  Store c in
+   all_clusters, or merge c with existing cluster.  */
+
+static void
+set_cluster (bb_cluster_t c, basic_block bb1, basic_block bb2)
+{
+  int i1 = bb1->index;
+  int i2 = bb2->index;
+  int old_index, other_index;
+  bb_cluster_t old;
+
+  if (merge_cluster[i1] == NULL && merge_cluster[i2] == NULL)
+    {
+      merge_cluster[i1] = c;
+      merge_cluster[i2] = c;
+      c->index = VEC_length (bb_cluster_t, all_clusters);
+      VEC_safe_push (bb_cluster_t, heap, all_clusters, c);
+    }
+  else if (merge_cluster[i1] == NULL || merge_cluster[i2] == NULL)
+    {
+      old_index = merge_cluster[i1] == NULL ? i2 : i1;
+      other_index = merge_cluster[i1] == NULL ? i1 : i2;
+      old = merge_cluster[old_index];
+      merge_clusters (old, c);
+      merge_cluster[other_index] = old;
+      delete_cluster (c);
+    }
+  else if (merge_cluster[i1] != merge_cluster[i2])
+    {
+      unsigned int j;
+      bitmap_iterator bj;
+      delete_cluster (c);
+      old = merge_cluster[i2];
+      merge_clusters (merge_cluster[i1], old);
+      EXECUTE_IF_SET_IN_BITMAP (old->bbs, 0, j, bj)
+	merge_cluster[j] = merge_cluster[i1];
+      VEC_replace (bb_cluster_t, all_clusters, old->index, NULL);
+      delete_cluster (old);
+    }
+  else
+    gcc_unreachable ();
+}
+
+/* Returns true if
+   - the gimple subcodes of S1 and S2 match, or
+   - the gimple subcodes do not matter given the gimple code, or
+   - the gimple subcodes are an inverse comparison and INV_COND
+     is true.  */
+
+static bool
+gimple_subcode_equal_p (gimple s1, gimple s2, bool inv_cond)
+{
+  tree var, var_type;
+  bool honor_nans;
+
+  if (is_gimple_assign (s1)
+      && gimple_assign_rhs_class (s1) == GIMPLE_SINGLE_RHS)
+    return true;
+
+  if (gimple_code (s1) == GIMPLE_COND && inv_cond)
+    {
+      var = gimple_cond_lhs (s1);
+      var_type = TREE_TYPE (var);
+      honor_nans = HONOR_NANS (TYPE_MODE (var_type));
+
+      if (gimple_expr_code (s1)
+	  == invert_tree_comparison (gimple_expr_code (s2), honor_nans))
+	return true;
+    }
+
+  return s1->gsbase.subcode == s2->gsbase.subcode;
+}
+
+/* Check whether S1 and S2 are equal, considering the fields in
+   gimple_statement_base.  Ignores fields uid, location, bb, and block, and the
+   pass-local flags visited and plf.  */
+
+static bool
+gimple_base_equal_p (gimple s1, gimple s2, bool inv_cond)
+{
+  if (gimple_code (s1) != gimple_code (s2))
+    return false;
+
+  if (gimple_no_warning_p (s1) != gimple_no_warning_p (s2))
+    return false;
+
+  if (is_gimple_assign (s1)
+      && (gimple_assign_nontemporal_move_p (s1)
+          != gimple_assign_nontemporal_move_p (s2)))
+    return false;
+
+  gcc_assert (!gimple_modified_p (s1) && !gimple_modified_p (s2));
+
+  if (gimple_has_volatile_ops (s1) != gimple_has_volatile_ops (s2))
+    return false;
+
+  if (!gimple_subcode_equal_p (s1, s2, inv_cond))
+    return false;
+
+  if (gimple_num_ops (s1) != gimple_num_ops (s2))
+    return false;
+
+  return true;
+}
+
+/* Return true if gimple statements S1 and S2 are equal.  SAME_PREDS indicates
+   whether gimple_bb (s1) and gimple_bb (s2) have the same predecessors.  */
+
+static bool
+gimple_equal_p (gimple s1, gimple s2, bool same_preds, bool inv_cond)
+{
+  unsigned int i;
+  enum gimple_statement_structure_enum gss;
+  tree lhs1, lhs2;
+  basic_block bb1 = gimple_bb (s1), bb2 = gimple_bb (s2);
+
+  /* Handle omp gimples conservatively.  */
+  if (is_gimple_omp (s1) || is_gimple_omp (s2))
+    return false;
+
+  /* Handle lhs.  */
+  lhs1 = gimple_get_lhs (s1);
+  lhs2 = gimple_get_lhs (s2);
+  if (lhs1 != NULL_TREE && lhs2 != NULL_TREE)
+    return (same_preds && TREE_CODE (lhs1) == SSA_NAME
+	    && TREE_CODE (lhs2) == SSA_NAME
+	    && gvn_val (lhs1) == gvn_val (lhs2));
+  else if (!(lhs1 == NULL_TREE && lhs2 == NULL_TREE))
+    return false;
+
+  if (!gimple_base_equal_p (s1, s2, inv_cond))
+    return false;
+
+  gss = gimple_statement_structure (s1);
+  switch (gss)
+    {
+    case GSS_CALL:
+      /* Ignore gimple_call_use_set and gimple_call_clobber_set, and let
+	 TODO_rebuild_alias deal with this.  */
+      if (!gimple_call_same_target_p (s1, s2))
+        return false;
+      /* Falthru.  */
+
+    case GSS_WITH_MEM_OPS_BASE:
+    case GSS_WITH_MEM_OPS:
+      /* Ignore gimple_vdef and gimpe_vuse mismatches, and let
+	 TODO_update_ssa_only_virtuals deal with this.  */
+      /* Falthru.  */
+
+    case GSS_WITH_OPS:
+      /* Ignore gimple_def_ops and gimple_use_ops.  They are duplicates of
+         gimple_vdef, gimple_vuse and gimple_ops, which are checked
+         elsewhere.  */
+      /* Falthru.  */
+
+    case GSS_BASE:
+      break;
+
+    default:
+      return false;
+    }
+
+  /* Handle ops.  */
+  for (i = 0; i < gimple_num_ops (s1); ++i)
+    {
+      tree t1 = gimple_op (s1, i);
+      tree t2 = gimple_op (s2, i);
+
+      if (t1 == NULL_TREE && t2 == NULL_TREE)
+        continue;
+      if (t1 == NULL_TREE || t2 == NULL_TREE)
+        return false;
+      /* Skip lhs.  */
+      if (lhs1 == t1 && i == 0)
+        continue;
+
+      if (operand_equal_p (t1, t2, 0))
+	continue;
+      if (gvn_uses_equal (t1, t2, bb1, bb2, same_preds))
+	continue;
+
+      return false;
+    }
+
+  return true;
+}
+
+/* Return true if BB1 and BB2 contain the same non-debug gimple statements.
+   SAME_PREDS indicates whether BB1 and BB2 have the same predecessors.  */
+
+static bool
+bb_gimple_equal_p (basic_block bb1, basic_block bb2, bool same_preds,
+		   bool inv_cond)
+
+{
+  gimple_stmt_iterator gsi1 = gsi_last_nondebug_bb (bb1);
+  gimple_stmt_iterator gsi2 = gsi_last_nondebug_bb (bb2);
+  bool end1 = gsi_end_p (gsi1);
+  bool end2 = gsi_end_p (gsi2);
+
+  while (!end1 && !end2)
+    {
+      if (!gimple_equal_p (gsi_stmt (gsi1), gsi_stmt (gsi2),
+			   same_preds, inv_cond))
+	return false;
+
+      gsi_prev_nondebug (&gsi1);
+      gsi_prev_nondebug (&gsi2);
+      end1 = gsi_end_p (gsi1);
+      end2 = gsi_end_p (gsi2);
+    }
+
+  return end1 && end2;
+}
+
+/* Return true if BB1 and BB2 (members of cluster C) are duplicates.  SAME_PREDS
+   indicates whether BB1 and BB2 have the same predecessors.  */
+
+static bool
+find_duplicate (bb_cluster_t c, basic_block bb1,
+		basic_block bb2, bool same_preds, bool inv_cond)
+{
+  if (!bb_gimple_equal_p (bb1, bb2, same_preds, inv_cond))
+    return false;
+
+  if (dump_file)
+    {
+      fprintf (dump_file, "find_duplicates: <bb %d> duplicate of <bb %d>\n",
+	       bb1->index, bb2->index);
+      print_cluster (dump_file, c);
+    }
+
+  set_cluster (c, bb1, bb2);
+  return true;
+}
+
+/* Returns whether for all phis in DEST the phi alternatives for E1 and
+   E2 are equal.  SAME_PREDS indicates whether BB1 and BB2 have the same
+   predecessors.  */
+
+static bool
+same_phi_alternatives_1 (basic_block dest, edge e1, edge e2, bool same_preds)
+{
+  int n1 = e1->dest_idx, n2 = e2->dest_idx;
+  basic_block bb1 = e1->src, bb2 = e2->src;
+  gimple_stmt_iterator gsi;
+
+  for (gsi = gsi_start_phis (dest); !gsi_end_p (gsi); gsi_next (&gsi))
+    {
+      gimple phi = gsi_stmt (gsi);
+      tree lhs = gimple_phi_result (phi);
+      tree val1 = gimple_phi_arg_def (phi, n1);
+      tree val2 = gimple_phi_arg_def (phi, n2);
+
+      if (VOID_TYPE_P (TREE_TYPE (lhs)))
+	continue;
+
+      if (operand_equal_for_phi_arg_p (val1, val2))
+        continue;
+      if (gvn_uses_equal (val1, val2, bb1, bb2, same_preds))
+	continue;
+
+      return false;
+    }
+
+  return true;
+}
+
+/* Returns whether for all successors of BB1 and BB2 (members of SAME_SUCC), the
+   phi alternatives for BB1 and BB2 are equal.  SAME_PREDS indicates whether BB1
+   and BB2 have the same predecessors.  */
+
+static bool
+same_phi_alternatives (same_succ_t same_succ, basic_block bb1, basic_block bb2,
+		       bool same_preds)
+{
+  unsigned int s;
+  bitmap_iterator bs;
+  edge e1, e2;
+  basic_block succ;
+
+  EXECUTE_IF_SET_IN_BITMAP (same_succ->succs, 0, s, bs)
+    {
+      succ = BASIC_BLOCK (s);
+      e1 = find_edge (bb1, succ);
+      e2 = find_edge (bb2, succ);
+      if (e1->flags & EDGE_COMPLEX
+	  || e2->flags & EDGE_COMPLEX)
+	return false;
+
+      /* For all phis in bb, the phi alternatives for e1 and e2 need to have
+	 the same value.  */
+      if (!same_phi_alternatives_1 (succ, e1, e2, same_preds))
+	return false;
+    }
+
+  return true;
+}
+
+/* Return true if BB has non-vop phis.  */
+
+static bool
+bb_has_non_vop_phi (basic_block bb)
+{
+  gimple_seq phis = phi_nodes (bb);
+  gimple phi;
+
+  if (phis == NULL)
+    return false;
+
+  if (!gimple_seq_singleton_p (phis))
+    return true;
+
+  phi = gimple_seq_first_stmt (phis);
+  return !VOID_TYPE_P (TREE_TYPE (gimple_phi_result (phi)));
+}
+
+/* Within SAME_SUCC->bbs, find clusters of bbs which can be merged.  */
+
+static void
+find_clusters_1 (same_succ_t same_succ)
+{
+  bb_cluster_t c;
+  basic_block bb1, bb2;
+  unsigned int i, j;
+  bitmap_iterator bi, bj;
+  bool same_preds, inv_cond;
+  int nr_comparisons;
+  int max_comparisons = PARAM_VALUE (PARAM_TAIL_MERGE_MAX_COMPARISONS);
+
+  if (same_succ == NULL)
+    return;
+
+  EXECUTE_IF_SET_IN_BITMAP (same_succ->bbs, 0, i, bi)
+    {
+      bb1 = BASIC_BLOCK (i);
+
+      /* TODO: handle blocks with phi-nodes.  We'll have find corresponding
+	 phi-nodes in bb1 and bb2, with the same alternatives for the same
+	 preds.  */
+      if (bb_has_non_vop_phi (bb1))
+	continue;
+
+      nr_comparisons = 0;
+      EXECUTE_IF_SET_IN_BITMAP (same_succ->bbs, i + 1, j, bj)
+	{
+	  bb2 = BASIC_BLOCK (j);
+
+	  if (bb_has_non_vop_phi (bb2))
+	    continue;
+
+	  if (merge_cluster[bb1->index] != NULL
+	      && merge_cluster[bb1->index] == merge_cluster[bb2->index])
+	    continue;
+
+	  /* Limit quadratic behaviour.  */
+	  nr_comparisons++;
+	  if (nr_comparisons > max_comparisons)
+	    break;
+
+	  c = new_cluster ();
+	  add_bb_to_cluster (c, bb1);
+	  add_bb_to_cluster (c, bb2);
+	  same_preds = same_predecessors (bb1, bb2);
+	  inv_cond = (bitmap_bit_p (same_succ->inverse, bb1->index)
+		      != bitmap_bit_p (same_succ->inverse, bb2->index));
+	  if (!(same_phi_alternatives (same_succ, bb1, bb2,
+				       same_preds)
+		&& find_duplicate (c, bb1, bb2, same_preds,
+				   inv_cond)))
+	    delete_cluster (c);
+        }
+    }
+}
+
+/* Find clusters of bbs which can be merged.  */
+
+static void
+find_clusters (void)
+{
+  same_succ_t same;
+
+  while (!VEC_empty (same_succ_t, worklist))
+    {
+      same = VEC_pop (same_succ_t, worklist);
+      same->in_worklist = false;
+      if (dump_file)
+	{
+	  fprintf (dump_file, "processing worklist entry\n");
+	  same_succ_print (dump_file, same);
+	}
+      find_clusters_1 (same);
+    }
+}
+
+/* Redirect all edges from BB1 to BB2, remove BB1, and insert C->phis into
+   BB2.  */
+
+static void
+replace_block_by (basic_block bb1, basic_block bb2)
+{
+  edge pred_edge;
+  unsigned int i;
+
+  delete_basic_block_same_succ (bb1);
+
+  /* Redirect the incoming edges of bb1 to bb2.  */
+  for (i = EDGE_COUNT (bb1->preds); i > 0 ; --i)
+    {
+      pred_edge = EDGE_PRED (bb1, i - 1);
+      pred_edge = redirect_edge_and_branch (pred_edge, bb2);
+      gcc_assert (pred_edge != NULL);
+    }
+
+  /* bb1 has no incoming edges anymore, and has become unreachable.  */
+  delete_basic_block (bb1);
+
+  /* Update dominator info.  */
+  set_immediate_dominator (CDI_DOMINATORS, bb2,
+			   recompute_dominator (CDI_DOMINATORS, bb2));
+}
+
+/* For each cluster in all_clusters, merge all cluster->bbs.  Returns
+   number of bbs removed.  */
+
+static int
+apply_clusters (void)
+{
+  basic_block bb1, bb2;
+  bb_cluster_t c;
+  unsigned int i, j;
+  bitmap_iterator bj;
+  int nr_bbs_removed = 0;
+
+  for (i = 0; i < VEC_length (bb_cluster_t, all_clusters); ++i)
+    {
+      c = VEC_index (bb_cluster_t, all_clusters, i);
+      if (c == NULL)
+	continue;
+
+      bb2 = BASIC_BLOCK (bitmap_first_set_bit (c->bbs));
+      gcc_assert (bb2 != NULL);
+
+      EXECUTE_IF_SET_IN_BITMAP (c->bbs, 0, j, bj)
+	{
+	  bb1 = BASIC_BLOCK (j);
+	  gcc_assert (bb1 != NULL);
+	  if (bb1 == bb2)
+	    continue;
+
+	  replace_block_by (bb1, bb2);
+	  nr_bbs_removed++;
+	}
+    }
+
+  return nr_bbs_removed;
+}
+
+/* Resets debug statement STMT if it has uses that are not dominated by their
+   defs.  */
+
+static void
+update_debug_stmt (gimple stmt)
+{
+  use_operand_p use_p;
+  ssa_op_iter oi;
+  basic_block bbdef, bbuse;
+  gimple def_stmt;
+  tree name;
+
+  if (!gimple_debug_bind_p (stmt))
+    return;
+
+  bbuse = gimple_bb (stmt);
+  FOR_EACH_PHI_OR_STMT_USE (use_p, stmt, oi, SSA_OP_USE)
+    {
+      name = USE_FROM_PTR (use_p);
+      gcc_assert (TREE_CODE (name) == SSA_NAME);
+
+      def_stmt = SSA_NAME_DEF_STMT (name);
+      gcc_assert (def_stmt != NULL);
+
+      bbdef = gimple_bb (def_stmt);
+      if (bbdef == NULL || bbuse == bbdef
+	  || dominated_by_p (CDI_DOMINATORS, bbuse, bbdef))
+	continue;
+
+      gimple_debug_bind_reset_value (stmt);
+      update_stmt (stmt);
+    }
+}
+
+/* Resets all debug statements that have uses that are not
+   dominated by their defs.  */
+
+static void
+update_debug_stmts (void)
+{
+  int i;
+  basic_block bb;
+
+  for (i = 0; i < last_basic_block; ++i)
+    {
+      gimple stmt;
+      gimple_stmt_iterator gsi;
+
+      bb = BASIC_BLOCK (i);
+
+      if (bb == NULL)
+	continue;
+
+      for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi))
+	{
+	  stmt = gsi_stmt (gsi);
+	  if (!is_gimple_debug (stmt))
+	    continue;
+	  update_debug_stmt (stmt);
+	}
+    }
+}
+
+/* Runs tail merge optimization.  */
+
+static unsigned int
+tail_merge_optimize (void)
+{
+  int nr_bbs_removed_total = 0;
+  int nr_bbs_removed;
+  bool loop_entered = false;
+  int iteration_nr = 0;
+
+  init_worklist ();
+
+  while (!VEC_empty (same_succ_t, worklist))
+    {
+      if (!loop_entered)
+	{
+	  loop_entered = true;
+	  calculate_dominance_info (CDI_DOMINATORS);
+	  init_gvn ();
+	  alloc_cluster_vectors ();
+	}
+      else
+	reset_cluster_vectors ();
+
+      iteration_nr++;
+      if (dump_file)
+	fprintf (dump_file, "worklist iteration #%d\n", iteration_nr);
+
+      find_clusters ();
+      gcc_assert (VEC_empty (same_succ_t, worklist));
+      if (VEC_empty (bb_cluster_t, all_clusters))
+	break;
+
+      nr_bbs_removed = apply_clusters ();
+      nr_bbs_removed_total += nr_bbs_removed;
+      if (nr_bbs_removed == 0)
+	break;
+
+      update_worklist ();
+    }
+
+  if (nr_bbs_removed_total > 0)
+    {
+      update_debug_stmts ();
+
+      /* Mark vops for updating.  Without this, TODO_update_ssa_only_virtuals
+	 won't do anything.  */
+      mark_sym_for_renaming (gimple_vop (cfun));
+
+      if (dump_file)
+	{
+	  fprintf (dump_file, "Before TODOs.\n");
+	  dump_function_to_file (current_function_decl, dump_file, dump_flags);
+	}
+    }
+
+  delete_worklist ();
+  if (loop_entered)
+    {
+      free_dominance_info (CDI_DOMINATORS);
+      delete_gvn ();
+      delete_cluster_vectors ();
+    }
+
+  return 0;
+}
+
+/* Returns true if tail merge pass should be run.  */
+
+static bool
+gate_tail_merge (void)
+{
+  return flag_tree_tail_merge;
+}
+
+struct gimple_opt_pass pass_tail_merge =
+{
+ {
+  GIMPLE_PASS,
+  "tailmerge",                          /* name */
+  gate_tail_merge,                      /* gate */
+  tail_merge_optimize,                  /* execute */
+  NULL,                                 /* sub */
+  NULL,                                 /* next */
+  0,                                    /* static_pass_number */
+  TV_TREE_TAIL_MERGE,                   /* tv_id */
+  PROP_ssa | PROP_cfg,                  /* properties_required */
+  0,                                    /* properties_provided */
+  0,                                    /* properties_destroyed */
+  0,                                    /* todo_flags_start */
+  TODO_verify_ssa | TODO_verify_stmts
+  | TODO_verify_flow | TODO_update_ssa_only_virtuals
+  | TODO_rebuild_alias
+  | TODO_cleanup_cfg | TODO_dump_func   /* todo_flags_finish */
+ }
+};
Index: gcc/tree-pass.h
===================================================================
--- gcc/tree-pass.h	(revision 175801)
+++ gcc/tree-pass.h	(working copy)
@@ -447,6 +447,7 @@ extern struct gimple_opt_pass pass_trace
 extern struct gimple_opt_pass pass_warn_unused_result;
 extern struct gimple_opt_pass pass_split_functions;
 extern struct gimple_opt_pass pass_feedback_split_functions;
+extern struct gimple_opt_pass pass_tail_merge;
 
 /* IPA Passes */
 extern struct simple_ipa_opt_pass pass_ipa_lower_emutls;
Index: gcc/opts.c
===================================================================
--- gcc/opts.c	(revision 175801)
+++ gcc/opts.c	(working copy)
@@ -484,6 +484,7 @@ static const struct default_options defa
     { OPT_LEVELS_2_PLUS, OPT_falign_jumps, NULL, 1 },
     { OPT_LEVELS_2_PLUS, OPT_falign_labels, NULL, 1 },
     { OPT_LEVELS_2_PLUS, OPT_falign_functions, NULL, 1 },
+    { OPT_LEVELS_2_PLUS, OPT_ftree_tail_merge, NULL, 1 },
 
     /* -O3 optimizations.  */
     { OPT_LEVELS_3_PLUS, OPT_ftree_loop_distribute_patterns, NULL, 1 },
Index: gcc/timevar.def
===================================================================
--- gcc/timevar.def	(revision 175801)
+++ gcc/timevar.def	(working copy)
@@ -127,6 +127,7 @@ DEFTIMEVAR (TV_TREE_GIMPLIFY	     , "tre
 DEFTIMEVAR (TV_TREE_EH		     , "tree eh")
 DEFTIMEVAR (TV_TREE_CFG		     , "tree CFG construction")
 DEFTIMEVAR (TV_TREE_CLEANUP_CFG	     , "tree CFG cleanup")
+DEFTIMEVAR (TV_TREE_TAIL_MERGE       , "tree tail merge")
 DEFTIMEVAR (TV_TREE_VRP              , "tree VRP")
 DEFTIMEVAR (TV_TREE_COPY_PROP        , "tree copy propagation")
 DEFTIMEVAR (TV_FIND_REFERENCED_VARS  , "tree find ref. vars")
Index: gcc/common.opt
===================================================================
--- gcc/common.opt	(revision 175801)
+++ gcc/common.opt	(working copy)
@@ -1937,6 +1937,10 @@ ftree-dominator-opts
 Common Report Var(flag_tree_dom) Optimization
 Enable dominator optimizations
 
+ftree-tail-merge
+Common Report Var(flag_tree_tail_merge) Optimization
+Enable tail merging on trees
+
 ftree-dse
 Common Report Var(flag_tree_dse) Optimization
 Enable dead store elimination
Index: gcc/Makefile.in
===================================================================
--- gcc/Makefile.in	(revision 175801)
+++ gcc/Makefile.in	(working copy)
@@ -1466,6 +1466,7 @@ OBJS = \
 	tree-ssa-sccvn.o \
 	tree-ssa-sink.o \
 	tree-ssa-structalias.o \
+	tree-ssa-tail-merge.o \
 	tree-ssa-ter.o \
 	tree-ssa-threadedge.o \
 	tree-ssa-threadupdate.o \
@@ -2427,6 +2428,13 @@ stor-layout.o : stor-layout.c $(CONFIG_H
    $(TREE_H) $(PARAMS_H) $(FLAGS_H) $(FUNCTION_H) $(EXPR_H) output.h $(RTL_H) \
    $(GGC_H) $(TM_P_H) $(TARGET_H) langhooks.h $(REGS_H) gt-stor-layout.h \
    $(DIAGNOSTIC_CORE_H) $(CGRAPH_H) $(TREE_INLINE_H) $(TREE_DUMP_H) $(GIMPLE_H)
+tree-ssa-tail-merge.o: tree-ssa-tail-merge.c \
+   $(SYSTEM_H) $(CONFIG_H) coretypes.h $(TM_H) $(BITMAP_H) \
+   $(FLAGS_H) $(TM_P_H) $(BASIC_BLOCK_H) output.h \
+   $(TREE_H) $(TREE_FLOW_H) $(TREE_INLINE_H) \
+   $(GIMPLE_H) $(FUNCTION_H) \
+   $(TREE_PASS_H) $(TIMEVAR_H) tree-ssa-sccvn.h \
+   $(CGRAPH_H) gimple-pretty-print.h tree-pretty-print.h $(PARAMS_H)
 tree-ssa-structalias.o: tree-ssa-structalias.c \
    $(SYSTEM_H) $(CONFIG_H) coretypes.h $(TM_H) $(GGC_H) $(OBSTACK_H) $(BITMAP_H) \
    $(FLAGS_H) $(TM_P_H) $(BASIC_BLOCK_H) output.h \
Index: gcc/passes.c
===================================================================
--- gcc/passes.c	(revision 175801)
+++ gcc/passes.c	(working copy)
@@ -1292,6 +1292,7 @@ init_optimization_passes (void)
       NEXT_PASS (pass_fre);
       NEXT_PASS (pass_copy_prop);
       NEXT_PASS (pass_merge_phi);
+      NEXT_PASS (pass_tail_merge);
       NEXT_PASS (pass_vrp);
       NEXT_PASS (pass_dce);
       NEXT_PASS (pass_cselim);
Index: gcc/params.def
===================================================================
--- gcc/params.def	(revision 175801)
+++ gcc/params.def	(working copy)
@@ -892,6 +892,11 @@ DEFPARAM (PARAM_MAX_STORES_TO_SINK,
           "Maximum number of conditional store pairs that can be sunk",
           2, 0, 0)
 
+DEFPARAM (PARAM_TAIL_MERGE_MAX_COMPARISONS,
+          "tail-merge-max-comparisons",
+          "Maximum amount of similar bbs to compare bb with",
+          10, 0, 0)
+
 
 /*
 Local variables:

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH, PR43864] Gimple level duplicate block cleanup.
  2011-07-12 12:21       ` Tom de Vries
@ 2011-07-12 14:37         ` Richard Guenther
  2011-07-18  0:41           ` Tom de Vries
  0 siblings, 1 reply; 18+ messages in thread
From: Richard Guenther @ 2011-07-12 14:37 UTC (permalink / raw)
  To: Tom de Vries; +Cc: Steven Bosscher, gcc-patches

On Tue, Jul 12, 2011 at 2:12 PM, Tom de Vries <vries@codesourcery.com> wrote:
> Hi Richard,
>
> here's a new version of the pass. I attempted to address as much as possible
> your comments. The pass was bootstrapped and reg-tested on x86_64.
>
> On 06/14/2011 05:01 PM, Richard Guenther wrote:
>> On Fri, Jun 10, 2011 at 6:54 PM, Tom de Vries <vries@codesourcery.com> wrote:
>>> Hi Richard,
>>>
>>> thanks for the review.
>>>
>>> On 06/08/2011 11:55 AM, Richard Guenther wrote:
>>>> On Wed, Jun 8, 2011 at 11:42 AM, Tom de Vries <vries@codesourcery.com> wrote:
>>>>> Hi Richard,
>>>>>
>>>>> I have a patch for PR43864. The patch adds a gimple level duplicate block
>>>>> cleanup. The patch has been bootstrapped and reg-tested on x86_64, and
>>>>> reg-tested on ARM. The size impact on ARM for spec2000 is shown in the following
>>>>> table (%, lower is better).
>>>>>
>>>>>                     none            pic
>>>>>                thumb1  thumb2  thumb1 thumb2
>>>>> spec2000         99.9    99.9    99.8   99.8
>>>>>
>>>>> PR43864 is currently marked as a duplicate of PR20070, but I'm not sure that the
>>>>> optimizations proposed in PR20070 would fix this PR.
>>>>>
>>>>> The problem in this PR is that when compiling with -O2, the example below should
>>>>> only have one call to free. The original problem is formulated in terms of -Os,
>>>>> but currently we generate one call to free with -Os, although still not the
>>>>> smallest code possible. I'll show here the -O2 case, since that's similar to the
>>>>> original PR.
>>>>>
>>>
>>> Example A. (naming it for reference below)
>>>
>>>>> #include <stdio.h>
>>>>> void foo (char*, FILE*);
>>>>> char* hprofStartupp(char *outputFileName, char *ctx)
>>>>> {
>>>>>    char fileName[1000];
>>>>>    FILE *fp;
>>>>>    sprintf(fileName, outputFileName);
>>>>>    if (access(fileName, 1) == 0) {
>>>>>        free(ctx);
>>>>>        return 0;
>>>>>    }
>>>>>
>>>>>    fp = fopen(fileName, 0);
>>>>>    if (fp == 0) {
>>>>>        free(ctx);
>>>>>        return 0;
>>>>>    }
>>>>>
>>>>>    foo(outputFileName, fp);
>>>>>
>>>>>    return ctx;
>>>>> }
>>>>>
>>>>> AFAIU, there are 2 complementary methods of rtl optimizations proposed in PR20070.
>>>>> - Merging 2 blocks which are identical expect for input registers, by using a
>>>>>  conditional move to choose between the different input registers.
>>>>> - Merging 2 blocks which have different local registers, by ignoring those
>>>>>  differences
>>>>>
>>>>> Blocks .L6 and.L7 have no difference in local registers, but they have a
>>>>> difference in input registers: r3 and r1. Replacing the move to r5 by a
>>>>> conditional move would probably be benificial in terms of size, but it's not
>>>>> clear what condition the conditional move should be using. Calculating such a
>>>>> condition would add in size and increase the execution path.
>>>>>
>>>>> gcc -O2 -march=armv7-a -mthumb pr43864.c -S:
>>>>> ...
>>>>>        push    {r4, r5, lr}
>>>>>        mov     r4, r0
>>>>>        sub     sp, sp, #1004
>>>>>        mov     r5, r1
>>>>>        mov     r0, sp
>>>>>        mov     r1, r4
>>>>>        bl      sprintf
>>>>>        mov     r0, sp
>>>>>        movs    r1, #1
>>>>>        bl      access
>>>>>        mov     r3, r0
>>>>>        cbz     r0, .L6
>>>>>        movs    r1, #0
>>>>>        mov     r0, sp
>>>>>        bl      fopen
>>>>>        mov     r1, r0
>>>>>        cbz     r0, .L7
>>>>>        mov     r0, r4
>>>>>        bl      foo
>>>>> .L3:
>>>>>        mov     r0, r5
>>>>>        add     sp, sp, #1004
>>>>>        pop     {r4, r5, pc}
>>>>> .L6:
>>>>>        mov     r0, r5
>>>>>        mov     r5, r3
>>>>>        bl      free
>>>>>        b       .L3
>>>>> .L7:
>>>>>        mov     r0, r5
>>>>>        mov     r5, r1
>>>>>        bl      free
>>>>>        b       .L3
>>>>> ...
>>>>>
>>>>> The proposed patch solved the problem by dealing with the 2 blocks at a level
>>>>> when they are still identical: at gimple level. It detect that the 2 blocks are
>>>>> identical, and removes one of them.
>>>>>
>>>>> The following table shows the impact of the patch on the example in terms of
>>>>> size for -march=armv7-a:
>>>>>
>>>>>          without     with    delta
>>>>> Os      :     108      104       -4
>>>>> O2      :     120      104      -16
>>>>> Os thumb:      68       64       -4
>>>>> O2 thumb:      76       64      -12
>>>>>
>>>>> The gain in size for -O2 is that of removing the entire block, plus the
>>>>> replacement of 2 moves by a constant set, which also decreases the execution
>>>>> path. The patch ensures optimal code for both -O2 and -Os.
>>>>>
>>>>>
>>>>> By keeping track of equivalent definitions in the 2 blocks, we can ignore those
>>>>> differences in comparison. Without this feature, we would only match blocks with
>>>>> resultless operations, due the the ssa-nature of gimples.
>>>>> For example, with this feature, we reduce the following function to its minimum
>>>>> at gimple level, rather than at rtl level.
>>>>>
>>>
>>> Example B. (naming it for reference below)
>>>
>>>>> int f(int c, int b, int d)
>>>>> {
>>>>>  int r, e;
>>>>>
>>>>>  if (c)
>>>>>    r = b + d;
>>>>>  else
>>>>>    {
>>>>>      e = b + d;
>>>>>      r = e;
>>>>>    }
>>>>>
>>>>>  return r;
>>>>> }
>>>>>
>>>>> ;; Function f (f)
>>>>>
>>>>> f (int c, int b, int d)
>>>>> {
>>>>>  int e;
>>>>>
>>>>> <bb 2>:
>>>>>  e_6 = b_3(D) + d_4(D);
>>>>>  return e_6;
>>>>>
>>>>> }
>>>>>
>>>>> I'll send the patch with the testcases in a separate email.
>>>>>
>>>>> OK for trunk?
>>>>
>>>> I don't like that you hook this into cleanup_tree_cfg - that is called
>>>> _way_ too often.
>>>>
>>>
>>> Here is a reworked patch that addresses several concerns, particularly the
>>> compile time overhead.
>>>
>>> Changes:
>>> - The optimization is now in a separate file.
>>> - The optimization is now a pass rather than a cleanup. That allowed me to
>>>  remove the test for pass-local flags.
>>>  New is the pass driver tail_merge_optimize, based on
>>>  tree-cfgcleanup.c:cleanup_tree_cfg_1.
>>> - The pass is run once, on SSA. Before, the patch would
>>>  fix example A only before SSA and example B only on SSA.
>>>  In order to fix example A on SSA, I added these changes:
>>>  - handle the vop state at entry of bb1 and bb2 as equal (gimple_equal_p)
>>>  - insert vop phi in bb2, and use that one (update_vuses)
>>>  - complete pt_solutions_equal_p.
>>>
>>> Passed x86_64 bootstrapping and regression testing, currently regtesting on ARM.
>>>
>>> I placed the pass at the earliest point where it fixes example B: After copy
>>> propagation and dead code elimination, specifically, after the first invocation
>>> of pass_cd_dce. Do you know (any other points) where the pass should be scheduled?
>>
>> It's probably reasonable to run it after IPA inlining has taken place which
>> means insert it somewhen after the second pass_fre (I'd suggest after
>> pass_merge_phi).
>>
>
> I placed it there, but I ran into some interaction with
> pass_late_warn_uninitialized.  Addition of the pass makes test
> gcc.dg/uninit-pred-2_c.c fail.
>
> FAIL: gcc.dg/uninit-pred-2_c.c bogus uninitialized var warning
>                               (test for bogus messages, line 43)
> FAIL: gcc.dg/uninit-pred-2_c.c real uninitialized var warning
>                               (test for warnings, line 45)
>
>   int foo_2 (int n, int m, int r)
>   {
>     int flag = 0;
>     int v;
>
>     if (n)
>       {
>         v = r;
>         flag = 1;
>       }
>
>     if (m) g++;
>     else bar ();
>
>     if (flag)
>       blah (v); { dg-bogus "uninitialized" "bogus uninitialized var warning" }
>     else
>       blah (v); { dg-warning "uninitialized" "real uninitialized var warning" }
>
>     return 0;
>   }
>
> The pass replaces the second call to blah with the first one, and eliminates
> the if.  After that, the uninitialized warning is issued for the line number
> of the first call to blah, while at source level the warning only makes sense
> for the second call to blah.
>
> Shall I try putting the pass after pass_late_warn_uninitialized?

No, simply pass -fno-tree-tail-merge in the testcase.

>> But my general comment still applies - I don't like the structural
>> comparison code at all and you should really use the value-numbering
>> machineries we have
>
> I now use sccvn.

Good.

>> or even better, merge this pass with FRE itself
>> (or DOM if that suits you more).  For FRE you'd want to hook into
>> tree-ssa-pre.c:eliminate().
>>
>
> If we need to do the transformation after pass_late_warn_uninitialized, it needs
> to stay on its own, I suppose.

I suppose part of the high cost of the pass is running SCCVN, so it
makes sense to share that with the existing FRE run.  Any reason
you use VN_NOWALK?

>>>> This also duplicates the literal matching done on the RTL level - instead
>>>> I think this optimization would be more related to value-numbering
>>>> (either that of SCCVN/FRE/PRE or that of DOM which also does
>>>> jump-threading).
>>>
>>> The pass currently does just duplicate block elimination, not cross-jumping.
>>> If we would like to extend this to cross-jumping, I think we need to do the
>>> reverse of value numbering: walk backwards over the bb, and keep track of the
>>> way values are used rather than defined. This will allows us to make a cut
>>> halfway a basic block.
>>
>> I don't understand - I propose to do literal matching but using value-numbering
>> for tracking equivalences to avoid literal matching for stuff we know is
>> equivalent.  In fact I think it will be mostly calls and stores where we
>> need to do literal matching, but never intermediate computations on
>> registers.
>>
>
> I tried to implement that scheme now.
>
>> But maybe I miss something here.
>>
>>> In general, we cannot do cut halfway a basic block in the current implementation
>>> (of value numbering and forward matching), since we assume equivalence of the
>>> incoming vops at bb entry. This assumption is in general only valid if we indeed
>>> replace the entire block by another entire block.
>>
>> Why are VOPs of concern at all?
>>
>
> In the previous version, I inserted the phis for the vops manually.
> In the current version of the pass, I let TODO_update_ssa_only_virtuals deal
> with vops, so it's not relevant anymore.
>
>>> I imagine that a cross-jumping heuristic would be based on the length of the
>>> match and the amount of non-vop phis it would introduce. Then value numbering
>>> would be something orthogonal to this optimization, which would reduce amount of
>>> phis needed for a cross-jump.
>>> I think it would make sense to use SCCVN value numbering at the point that we
>>> have this backward matching.
>>>
>>> I'm not sure whether it's a good idea to try to replace the current forward
>>> local value numbering with SCCVN value numbering, since we currently declare
>>> vops equal, which are, in the global sense, not equal. And once we go to
>>> backward matching, we'll need something to keep track of the uses, and we can
>>> reuse the current infrastructure for that, but not the SCCVN value numbering.
>>>
>>> Does that make any sense?
>>
>> Ok, let me think about this a bit.
>>
>
> I tried to to be more clear on this in the header comment of the pass.
>
>> For now about the patch in general.  The functions need renaming to
>> something more sensible now that this isn't cfg-cleanup anymore.
>>
>> I miss a general overview of the pass - it's hard to reverse engineer
>> its working for me.
>
> I added a header comment.
>
>> Like (working backwards), you are detecting
>> duplicate predecessors
>> - that obviously doesn't work for duplicates
>> without any successors, like those ending in noreturn calls.
>>
>
> Merging of blocks without successors works now.
>
>> +  n = EDGE_COUNT (bb->preds);
>> +
>> +  for (i = 0; i < n; ++i)
>> +    {
>> +      e1 = EDGE_PRED (bb, i);
>> +      if (e1->flags & EDGE_COMPLEX)
>> +        continue;
>> +      for (j = i + 1; j < n; ++j)
>> +        {
>>
>> that's quadratic in the number of predecessors.
>>
>
> The quadratic comparison is now limited by PARAM_TAIL_MERGE_MAX_COMPARISONS.
> Each bb is compared to maximally PARAM_TAIL_MERGE_MAX_COMPARISONS similar bbs
> per worklist iteration.
>
>> +          /* Block e1->src might be deleted.  If bb and e1->src are the same
>> +             block, delete e2->src instead, by swapping e1 and e2.  */
>> +          e1_swapped = (bb == e1->src) ? e2: e1;
>> +          e2_swapped = (bb == e1->src) ? e1: e2;
>>
>> is that because you incrementally merge preds two at a time?  As you
>> are deleting blocks don't you need to adjust the quadratic walking?
>> Thus, with say four equivalent preds won't your code crash anyway?
>>
>
> I think it was to make calculation of dominator info easier, but I use now
> functions from dominance.c for that, so this piece of code is gone.
>
>> I think the code needs to delay the CFG manipulation to the end
>> of this function.
>>
>
> I now delay the cfg manipulation till after each analysis phase.
>
>> +/* Returns whether for all phis in E1->dest the phi alternatives for E1 and
>> +   E2 are either:
>> +   - equal, or
>> +   - defined locally in E1->src and E2->src.
>> +   In the latter case, register the alternatives in *PHI_EQUIV.  */
>> +
>> +static bool
>> +same_or_local_phi_alternatives (equiv_t *phi_equiv, edge e1, edge e2)
>> +{
>> +  int n1 = e1->dest_idx;
>> +  int n2 = e2->dest_idx;
>> +  gimple_stmt_iterator gsi;
>> +  basic_block dest = e1->dest;
>> +  gcc_assert (dest == e2->dest);
>>
>> too many asserts in general - I'd say for this case pass in the destination
>> block as argument.
>>
>> +      gcc_assert (val1 != NULL_TREE);
>> +      gcc_assert (val2 != NULL_TREE);
>>
>> superfluous.
>>
>> +static bool
>> +cleanup_duplicate_preds_1 (equiv_t phi_equiv, edge e1, edge e2)
>> ...
>> +  VEC (edge,heap) *redirected_edges;
>> +  gcc_assert (bb == e2->dest);
>>
>> same.
>>
>> +  if (e1->flags != e2->flags)
>> +    return false;
>>
>> that's bad - it should handle EDGE_TRUE/FALSE_VALUE mismatches
>> by swapping edges in the preds.
>>
>
> That's handled now.
>
>> +  /* TODO: We could allow multiple successor edges here, as long as bb1 and bb2
>> +     have the same successors.  */
>> +  if (EDGE_COUNT (bb1->succs) != 1 || EDGE_COUNT (bb2->succs) != 1)
>> +    return false;
>>
>> hm, ok - that would need fixing, too.  Same or mergeable successors
>> of course, which makes me wonder if doing this whole transformation
>> incrementally and locally is a good idea ;)   Also
>>
>
> Also handled now.
>
>> +  /* Calculate the changes to be made to the dominator info.
>> +     Calculate bb2_dom.  */
>> ...
>>
>> wouldn't be necessary I suppose (just throw away dom info after the
>> pass).
>>
>> That is, I'd globally record BB equivalences (thus, "value-number"
>> BBs) and apply the CFG manipulations at a single point.
>>
>
> I delay the cfg manipulation till after each analysis phase. Delaying the cfg
> manipulation till the end of the pass instead might make the analysis code more
> convoluted.
>
>> Btw, I miss where you insert PHI nodes for all uses that flow in
>> from the preds preds - you do that for VOPs but not for real
>> operands?
>>
>
> Indeed, inserting phis for non-vops is a todo.
>
>> +  /* Replace uses of vuse2 with uses of the phi.  */
>> +  for (gsi = gsi_start_bb (bb2); !gsi_end_p (gsi); gsi_next (&gsi))
>> +    {
>>
>> why not walk immediate uses of the old PHI and SET_USE to
>> the new one instead (for those uses in the duplicate BB of course)?
>>
>
> And I no longer insert VOP phis, but let a TODO handle that, so this code is gone.

Ok.  Somewhat costly in comparison though.

>> +    case GSS_CALL:
>> +      if (!pt_solution_equal_p (gimple_call_use_set (s1),
>> +                                gimple_call_use_set (s2))
>>
>> I don't understand why you are concerned about equality of
>> points-to information.  Why not simply ior it (pt_solution_ior_into - note
>> they are shared so you need to unshare them first).
>>
>
> I let a todo handle the alias info now.

Hmm, that's not going to work if it's needed for correctness.

>> +/* Return true if p1 and p2 can be considered equal.  */
>> +
>> +static bool
>> +pt_solution_equal_p (struct pt_solution *p1, struct pt_solution *p2)
>>
>> would go into tree-ssa-structalias.c instead.
>>
>> +static bool
>> +gimple_base_equal_p (gimple s1, gimple s2)
>> +{
>> ...
>> +  if (gimple_modified_p (s1) || gimple_modified_p (s2))
>> +    return false;
>>
>> that shouldn't be of concern.
>>
>> +  if (s1->gsbase.subcode != s2->gsbase.subcode)
>> +    return false;
>>
>> for assigns that are of class GIMPLE_SINGLE_RHS we do not
>> update subcode during transformations so it can differ for now
>> equal statements.
>>
>
> handled properly now.
>
>> I'm not sure if a splay tree for the SSA name version equivalency
>> map is the best representation - I would have used a simple
>> array of num_ssa_names size and assign value-numbers
>> (the lesser version for example).
>>
>> Thus equiv_insert would do
>>
>>   value = MIN (SSA_NAME_VERSION (val1), SSA_NAME_VERSION (val2));
>>   values[SSA_NAME_VERSION (val1)] = value;
>>   values[SSA_NAME_VERSION (val2)] = value;
>>
>> if the names are not defined in bb1 resp. bb2 we would have to insert
>> a PHI node in the merged block - that would be a cost thingy for
>> doing this value-numbering in a more global way.
>>
>
> local value numbering code has been removed.
>
>> You don't seem to be concerned about the SSA names points-to
>> information, but it surely has the same issues as that of the calls
>> (so either they should be equal or they should be conservatively
>> merged).  But as-is you don't allow any values to flow into the
>> merged blocks that are not equal for both edges, no?
>>
>
> Correct, that's still a todo.
>
>> +  TV_TREE_CFG,                          /* tv_id */
>>
>> add a new timevar.  We wan to be able to turn the pass off,
>> so also add a new option (I can see it might make debugging harder
>> in some cases).
>>
>
> I added -ftree-tail-merge and TV_TREE_TAIL_MERGE.
>
>> Can you assess the effect of the patch on GCC itself (for example
>> when building cc1?)? What's the size benefit and the compile-time
>> overhead?
>>
>
> effect on building cc1:
>
>               real        user        sys
> without: 19m50.158s  19m 2.090s  0m20.860s
> with:    19m59.456s  19m17.170s  0m20.350s
>                     ----------
>                       +15.080s
>                         +1.31%

That's quite a lot of time.

> $ size without/cc1 with/cc1
>    text   data      bss       dec      hex     filename
> 17515986  41320  1364352  18921658  120b8ba  without/cc1
> 17399226  41320  1364352  18804898  11ef0a2     with/cc1
> --------
>  -116760
>  -0.67%
>
> OK for trunk, provided build & reg-testing on ARM is ok?

I miss additions to the testsuite.


+static bool
+bb_dominated_by_p (basic_block bb1, basic_block bb2)

please use

+  if (TREE_CODE (val1) == SSA_NAME)
+    {
+      if (!same_preds
           && !SSA_NAME_IS_DEFAULT_DEF (val1)
           && !dominated_by_p (bb2, gimple_bb (SSA_NAME_DEF_STMT (val1)))
              return false;

instead.  All stmts should have a BB apart from def stmts of default defs
(which are gimple_nops).

+/* Return the canonical scc_vn tree for X, if we can use scc_vn_info.
+   Otherwise, return X.  */
+
+static tree
+gvn_val (tree x)
+{
+  return ((scc_vn_ok && x != NULL && TREE_CODE (x) == SSA_NAME)
+         ? VN_INFO ((x))->valnum : x);
+}

I suppose we want to export vn_valueize from tree-ssa-sccvn.c instead
which seems to perform the same.  Do you ever call the above
when scc_vn_ok is false or x is NULL?

+static bool
+gvn_uses_equal (tree val1, tree val2, basic_block bb1,
+               basic_block bb2, bool same_preds)
+{
+  gimple def1, def2;
+  basic_block def1_bb, def2_bb;
+
+  if (val1 == NULL_TREE || val2 == NULL_TREE)
+    return false;

does this ever happen?

+  if (gvn_val (val1) != gvn_val (val2))
+    return false;

I suppose a shortcut

   if (val1 == val2)
     return true;

is possible?

+static int *bb_size;
+
+/* Init bb_size administration.  */
+
+static void
+init_bb_size (void)
+{

if you need more per-BB info you can hook it into bb->aux.  What's the
size used for (I guess I'll see below ...)?

+      for (gsi = gsi_start_nondebug_bb (bb);
+          !gsi_end_p (gsi); gsi_next_nondebug (&gsi))
+       size++;

is pretty rough.  I guess for a quick false result for comparing BBs
(which means you could initialize the info lazily?)

+struct same_succ
+{
+  /* The bbs that have the same successor bbs.  */
+  bitmap bbs;
+  /* The successor bbs.  */
+  bitmap succs;
+  /* Indicates whether the EDGE_TRUE/FALSE_VALUEs of succ_flags are swapped for
+     bb.  */
+  bitmap inverse;
+  /* The edge flags for each of the successor bbs.  */
+  VEC (int, heap) *succ_flags;
+  /* Indicates whether the struct is in the worklist.  */
+  bool in_worklist;
+};

looks somewhat odd at first sight - maybe a overall comment what this
is used for is missing.  Well, let's see.

+static hashval_t
+same_succ_hash (const void *ve)
+{
+  const_same_succ_t e = (const_same_succ_t)ve;
+  hashval_t hashval = bitmap_hash (e->succs);
+  int flags;
+  unsigned int i;
+  unsigned int first = bitmap_first_set_bit (e->bbs);
+  int size = bb_size [first];
+  gimple_stmt_iterator gsi;
+  gimple stmt;
+  basic_block bb = BASIC_BLOCK (first);
+
+  hashval = iterative_hash_hashval_t (size, hashval);
+  for (gsi = gsi_start_nondebug_bb (bb);
+          !gsi_end_p (gsi); gsi_next_nondebug (&gsi))
+    {
+      stmt = gsi_stmt (gsi);
+      hashval = iterative_hash_hashval_t (gimple_code (stmt), hashval);
+      if (!is_gimple_call (stmt))
+       continue;
+      if (gimple_call_internal_p (stmt))
+       hashval = iterative_hash_hashval_t
+         ((hashval_t) gimple_call_internal_fn (stmt), hashval);
+      else
+       hashval = iterative_hash_expr (gimple_call_fn (stmt), hashval);

you could also keep a cache of the BB hash as you keep a cache
of the size (if this function is called multiple times per BB).
The hash looks relatively weak - for all asignments it will hash
in GIMPLE_ASSIGN only ... I'd at least hash in gimple_assign_rhs_code.
The call handling OTOH looks overly complicated to me ;)

The hash will be dependent on stmt ordering even if that doesn't matter,
like

  i = i + 1;
  j = j - 1;

vs. the swapped variant.  Similar the successor edges are not sorted,
so true/false edges may be in different order.

Not sure yet if your comparison function would make those BBs
unequal anyway.

+static bool
+inverse_flags (const_same_succ_t e1, const_same_succ_t e2)
+{
+  int f1a, f1b, f2a, f2b;
+  int mask = ~(EDGE_TRUE_VALUE | EDGE_FALSE_VALUE);
+
+  if (VEC_length (int, e1->succ_flags) != 2)
+    return false;
...

I wonder why you keep a VEC of successor edges in same_succ_t
instead of  using the embedded successor edge vector in the basic_block
structure?

+      bb_to_same_succ[bb->index] = *slot;

looks like a candidate for per-BB info in bb->aux, too.

+static void
+find_same_succ (void)
+{
+  int i;
+  same_succ_t same = same_succ_alloc ();
+
+  for (i = 0; i < last_basic_block; ++i)
+    {
+      find_same_succ_bb (BASIC_BLOCK (i), &same);
+      if (same == NULL)
+       same = same_succ_alloc ();
+    }

I suppose you want FOR_EACH_BB (excluding entry/exit block) or
FOR_ALL_BB (including them).  The above also can
have BASIC_BLOCK(i) == NULL.  Similar in other places.

+  for (i = 0; i < n1; ++i)
+    {
+      ei = EDGE_PRED (bb1, i);
+      for (j = 0; j < n2; ++j)
+       {
+         ej = EDGE_PRED (bb2, j);
+         if (ei->src != ej->src)
+           continue;
+         nr_matches++;
+         break;
+       }
+    }

  FOR_EACH_EDGE (ei, iterator, bb1->preds)
     if (!find_edge (ei->src, bb2))
       return false;

is easier to parse.

+static bool
+gimple_subcode_equal_p (gimple s1, gimple s2, bool inv_cond)
+{
+  tree var, var_type;
+  bool honor_nans;
+
+  if (is_gimple_assign (s1)
+      && gimple_assign_rhs_class (s1) == GIMPLE_SINGLE_RHS)
+    return true;

the subcode for GIMPLE_SINGLE_RHS is gimple_assign_rhs_code
(TREE_CODE of gimple_assign_rhs1 actually).

+static bool
+gimple_base_equal_p (gimple s1, gimple s2, bool inv_cond)

I wonder if you still need this given ..

+static bool
+gimple_equal_p (gimple s1, gimple s2, bool same_preds, bool inv_cond)
+{
+  unsigned int i;
+  enum gimple_statement_structure_enum gss;
+  tree lhs1, lhs2;
+  basic_block bb1 = gimple_bb (s1), bb2 = gimple_bb (s2);
+
+  /* Handle omp gimples conservatively.  */
+  if (is_gimple_omp (s1) || is_gimple_omp (s2))
+    return false;
+
+  /* Handle lhs.  */
+  lhs1 = gimple_get_lhs (s1);
+  lhs2 = gimple_get_lhs (s2);
+  if (lhs1 != NULL_TREE && lhs2 != NULL_TREE)
+    return (same_preds && TREE_CODE (lhs1) == SSA_NAME
+           && TREE_CODE (lhs2) == SSA_NAME
+           && gvn_val (lhs1) == gvn_val (lhs2));
+  else if (!(lhs1 == NULL_TREE && lhs2 == NULL_TREE))
+    return false;

all lhs equivalency is defered to GVN (which means all GIMPLE_ASSIGN
and GIMPLE_CALL stmts with a lhs).

That leaves the case of calls without a lhs.  I'd rather structure this
function like

  if (gimple_code (s1) != gimple_code (s2))
    return false;
  swithc (gimple_code (s1))
    {
    case GIMPLE_CALL:
       ... compare arguments ...
       if equal ok, if not and we have a lhs use GVN.

    case GIMPLE_ASSIGN:
       ... compare GVN of the LHS ...

     case GIMPLE_COND:
        ... compare operands ...

     default:
        return false;
    }


+static bool
+bb_gimple_equal_p (basic_block bb1, basic_block bb2, bool same_preds,
+                  bool inv_cond)
+
+{

you don't do an early out by comparing the pre-computed sizes.  Mind
you can have hashtable collisions where they still differ (did you
check hashtable stats on it?  how is the collision rate?)

+static bool
+bb_has_non_vop_phi (basic_block bb)
+{
+  gimple_seq phis = phi_nodes (bb);
+  gimple phi;
+
+  if (phis == NULL)
+    return false;
+
+  if (!gimple_seq_singleton_p (phis))
+    return true;
+
+  phi = gimple_seq_first_stmt (phis);
+  return !VOID_TYPE_P (TREE_TYPE (gimple_phi_result (phi)));

return is_gimple_reg (gimple_phi_result (phi));

+static void
+update_debug_stmts (void)
+{
+  int i;
+  basic_block bb;
+
+  for (i = 0; i < last_basic_block; ++i)
+    {
+      gimple stmt;
+      gimple_stmt_iterator gsi;
+
+      bb = BASIC_BLOCK (i);

FOR_EACH_BB

it must be possible to avoid scanning basic-blocks that are not affected
by the transform, no?  In fact the only affected basic-blocks should be
those that were merged with another block?

+      /* Mark vops for updating.  Without this, TODO_update_ssa_only_virtuals
+        won't do anything.  */
+      mark_sym_for_renaming (gimple_vop (cfun));

it won't insert any PHIs, that's correct.  Still somewhat ugly, a manual
update of PHI nodes should be possible.

+      if (dump_file)
+       {
+         fprintf (dump_file, "Before TODOs.\n");

with TDF_DETAILS only please.

+      free_dominance_info (CDI_DOMINATORS);

if you keep dominance info up-to-date there is no need to free it.

+  TODO_verify_ssa | TODO_verify_stmts
+  | TODO_verify_flow | TODO_update_ssa_only_virtuals
+  | TODO_rebuild_alias

please no TODO_rebuild_alias, simply remove it - alias info in merged
paths should be compatible enough if there is value-equivalence between
SSA names.  At least you can't rely on TODO_rebuild_alias for
correctness - it is skipped if IPA PTA was run for example.

+  | TODO_cleanup_cfg

is that needed?  If so return it from your execute function if you changed
anything only.  But I doubt your transformation introduces cleanup
opportunities?

New options and params need documentation in doc/invoke.texi.

Thanks,
Richard.

> Thanks,
> - Tom
>
> 2011-07-12  Tom de Vries  <tom@codesourcery.com>
>
>        PR middle-end/43864
>        * tree-ssa-tail-merge.c: New file.
>        (bb_dominated_by_p): New function.
>        (scc_vn_ok): New var.
>        (init_gvn, delete_gvn, gvn_val, gvn_uses_equal): New function.
>        (bb_size): New var.
>        (init_bb_size, delete_bb_size): New function.
>        (struct same_succ): Define.
>        (same_succ_t, const_same_succ_t): New typedef.
>        (same_succ_print, same_succ_print_traverse, same_succ_hash)
>        (inverse_flags, same_succ_equal, same_succ_alloc, same_succ_delete)
>        (same_succ_reset): New function.
>        (same_succ_htab, bb_to_same_succ, same_succ_edge_flags)
>        (bitmap deleted_bbs, deleted_bb_preds): New vars.
>        (debug_same_succ): New function.
>        (worklist): New var.
>        (print_worklist, add_to_worklist, find_same_succ_bb, find_same_succ)
>        (init_worklist, delete_worklist, delete_basic_block_same_succ)
>        (update_worklist): New function.
>        (struct bb_cluster): Define.
>        (bb_cluster_t, const_bb_cluster_t): New typedef.
>        (print_cluster, debug_cluster, same_predecessors)
>        (add_bb_to_cluster, new_cluster, delete_cluster): New function.
>        (merge_cluster, all_clusters): New var.
>        (alloc_cluster_vectors, reset_cluster_vectors, delete_cluster_vectors)
>        (merge_clusters, set_cluster): New function.
>        (gimple_subcode_equal_p, gimple_base_equal_p, gimple_equal_p)
>        (bb_gimple_equal_p): New function.
>        (find_duplicate, same_phi_alternatives_1, same_phi_alternatives)
>        (bb_has_non_vop_phi, find_clusters_1, find_clusters): New function.
>        (replace_block_by, apply_clusters): New function.
>        (update_debug_stmt, update_debug_stmt): New function.
>        (tail_merge_optimize, gate_tail_merge): New function.
>        (pass_tail_merge): New gimple pass.
>        * tree-pass.h (pass_tail_merge): Declare new pass.
>        * passes.c (init_optimization_passes): Use new pass.
>        * Makefile.in (OBJS-common): Add tree-ssa-tail-merge.o.
>        (tree-ssa-tail-merge.o): New rule.
>        * opts.c (default_options_table): Set OPT_ftree_tail_merge by default at
>        OPT_LEVELS_2_PLUS.
>        * timevar.def (TV_TREE_TAIL_MERGE): New timevar.
>        * common.opt (ftree-tail-merge): New switches.
>        * params.def (PARAM_TAIL_MERGE_MAX_COMPARISONS): New parameter.
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH, PR43864] Gimple level duplicate block cleanup.
  2011-07-12 14:37         ` Richard Guenther
@ 2011-07-18  0:41           ` Tom de Vries
  2011-07-22 15:54             ` Richard Guenther
  0 siblings, 1 reply; 18+ messages in thread
From: Tom de Vries @ 2011-07-18  0:41 UTC (permalink / raw)
  To: Richard Guenther; +Cc: Steven Bosscher, gcc-patches

[-- Attachment #1: Type: text/plain, Size: 34349 bytes --]

On 07/12/2011 04:07 PM, Richard Guenther wrote:
> On Tue, Jul 12, 2011 at 2:12 PM, Tom de Vries <vries@codesourcery.com> wrote:
>> Hi Richard,
>>
>> here's a new version of the pass. I attempted to address as much as possible
>> your comments. The pass was bootstrapped and reg-tested on x86_64.
>>
>> On 06/14/2011 05:01 PM, Richard Guenther wrote:
>>> On Fri, Jun 10, 2011 at 6:54 PM, Tom de Vries <vries@codesourcery.com> wrote:
>>>> Hi Richard,
>>>>
>>>> thanks for the review.
>>>>
>>>> On 06/08/2011 11:55 AM, Richard Guenther wrote:
>>>>> On Wed, Jun 8, 2011 at 11:42 AM, Tom de Vries <vries@codesourcery.com> wrote:
>>>>>> Hi Richard,
>>>>>>
>>>>>> I have a patch for PR43864. The patch adds a gimple level duplicate block
>>>>>> cleanup. The patch has been bootstrapped and reg-tested on x86_64, and
>>>>>> reg-tested on ARM. The size impact on ARM for spec2000 is shown in the following
>>>>>> table (%, lower is better).
>>>>>>
>>>>>>                     none            pic
>>>>>>                thumb1  thumb2  thumb1 thumb2
>>>>>> spec2000         99.9    99.9    99.8   99.8
>>>>>>
>>>>>> PR43864 is currently marked as a duplicate of PR20070, but I'm not sure that the
>>>>>> optimizations proposed in PR20070 would fix this PR.
>>>>>>
>>>>>> The problem in this PR is that when compiling with -O2, the example below should
>>>>>> only have one call to free. The original problem is formulated in terms of -Os,
>>>>>> but currently we generate one call to free with -Os, although still not the
>>>>>> smallest code possible. I'll show here the -O2 case, since that's similar to the
>>>>>> original PR.
>>>>>>
>>>>
>>>> Example A. (naming it for reference below)
>>>>
>>>>>> #include <stdio.h>
>>>>>> void foo (char*, FILE*);
>>>>>> char* hprofStartupp(char *outputFileName, char *ctx)
>>>>>> {
>>>>>>    char fileName[1000];
>>>>>>    FILE *fp;
>>>>>>    sprintf(fileName, outputFileName);
>>>>>>    if (access(fileName, 1) == 0) {
>>>>>>        free(ctx);
>>>>>>        return 0;
>>>>>>    }
>>>>>>
>>>>>>    fp = fopen(fileName, 0);
>>>>>>    if (fp == 0) {
>>>>>>        free(ctx);
>>>>>>        return 0;
>>>>>>    }
>>>>>>
>>>>>>    foo(outputFileName, fp);
>>>>>>
>>>>>>    return ctx;
>>>>>> }
>>>>>>
>>>>>> AFAIU, there are 2 complementary methods of rtl optimizations proposed in PR20070.
>>>>>> - Merging 2 blocks which are identical expect for input registers, by using a
>>>>>>  conditional move to choose between the different input registers.
>>>>>> - Merging 2 blocks which have different local registers, by ignoring those
>>>>>>  differences
>>>>>>
>>>>>> Blocks .L6 and.L7 have no difference in local registers, but they have a
>>>>>> difference in input registers: r3 and r1. Replacing the move to r5 by a
>>>>>> conditional move would probably be benificial in terms of size, but it's not
>>>>>> clear what condition the conditional move should be using. Calculating such a
>>>>>> condition would add in size and increase the execution path.
>>>>>>
>>>>>> gcc -O2 -march=armv7-a -mthumb pr43864.c -S:
>>>>>> ...
>>>>>>        push    {r4, r5, lr}
>>>>>>        mov     r4, r0
>>>>>>        sub     sp, sp, #1004
>>>>>>        mov     r5, r1
>>>>>>        mov     r0, sp
>>>>>>        mov     r1, r4
>>>>>>        bl      sprintf
>>>>>>        mov     r0, sp
>>>>>>        movs    r1, #1
>>>>>>        bl      access
>>>>>>        mov     r3, r0
>>>>>>        cbz     r0, .L6
>>>>>>        movs    r1, #0
>>>>>>        mov     r0, sp
>>>>>>        bl      fopen
>>>>>>        mov     r1, r0
>>>>>>        cbz     r0, .L7
>>>>>>        mov     r0, r4
>>>>>>        bl      foo
>>>>>> .L3:
>>>>>>        mov     r0, r5
>>>>>>        add     sp, sp, #1004
>>>>>>        pop     {r4, r5, pc}
>>>>>> .L6:
>>>>>>        mov     r0, r5
>>>>>>        mov     r5, r3
>>>>>>        bl      free
>>>>>>        b       .L3
>>>>>> .L7:
>>>>>>        mov     r0, r5
>>>>>>        mov     r5, r1
>>>>>>        bl      free
>>>>>>        b       .L3
>>>>>> ...
>>>>>>
>>>>>> The proposed patch solved the problem by dealing with the 2 blocks at a level
>>>>>> when they are still identical: at gimple level. It detect that the 2 blocks are
>>>>>> identical, and removes one of them.
>>>>>>
>>>>>> The following table shows the impact of the patch on the example in terms of
>>>>>> size for -march=armv7-a:
>>>>>>
>>>>>>          without     with    delta
>>>>>> Os      :     108      104       -4
>>>>>> O2      :     120      104      -16
>>>>>> Os thumb:      68       64       -4
>>>>>> O2 thumb:      76       64      -12
>>>>>>
>>>>>> The gain in size for -O2 is that of removing the entire block, plus the
>>>>>> replacement of 2 moves by a constant set, which also decreases the execution
>>>>>> path. The patch ensures optimal code for both -O2 and -Os.
>>>>>>
>>>>>>
>>>>>> By keeping track of equivalent definitions in the 2 blocks, we can ignore those
>>>>>> differences in comparison. Without this feature, we would only match blocks with
>>>>>> resultless operations, due the the ssa-nature of gimples.
>>>>>> For example, with this feature, we reduce the following function to its minimum
>>>>>> at gimple level, rather than at rtl level.
>>>>>>
>>>>
>>>> Example B. (naming it for reference below)
>>>>
>>>>>> int f(int c, int b, int d)
>>>>>> {
>>>>>>  int r, e;
>>>>>>
>>>>>>  if (c)
>>>>>>    r = b + d;
>>>>>>  else
>>>>>>    {
>>>>>>      e = b + d;
>>>>>>      r = e;
>>>>>>    }
>>>>>>
>>>>>>  return r;
>>>>>> }
>>>>>>
>>>>>> ;; Function f (f)
>>>>>>
>>>>>> f (int c, int b, int d)
>>>>>> {
>>>>>>  int e;
>>>>>>
>>>>>> <bb 2>:
>>>>>>  e_6 = b_3(D) + d_4(D);
>>>>>>  return e_6;
>>>>>>
>>>>>> }
>>>>>>
>>>>>> I'll send the patch with the testcases in a separate email.
>>>>>>
>>>>>> OK for trunk?
>>>>>
>>>>> I don't like that you hook this into cleanup_tree_cfg - that is called
>>>>> _way_ too often.
>>>>>
>>>>
>>>> Here is a reworked patch that addresses several concerns, particularly the
>>>> compile time overhead.
>>>>
>>>> Changes:
>>>> - The optimization is now in a separate file.
>>>> - The optimization is now a pass rather than a cleanup. That allowed me to
>>>>  remove the test for pass-local flags.
>>>>  New is the pass driver tail_merge_optimize, based on
>>>>  tree-cfgcleanup.c:cleanup_tree_cfg_1.
>>>> - The pass is run once, on SSA. Before, the patch would
>>>>  fix example A only before SSA and example B only on SSA.
>>>>  In order to fix example A on SSA, I added these changes:
>>>>  - handle the vop state at entry of bb1 and bb2 as equal (gimple_equal_p)
>>>>  - insert vop phi in bb2, and use that one (update_vuses)
>>>>  - complete pt_solutions_equal_p.
>>>>
>>>> Passed x86_64 bootstrapping and regression testing, currently regtesting on ARM.
>>>>
>>>> I placed the pass at the earliest point where it fixes example B: After copy
>>>> propagation and dead code elimination, specifically, after the first invocation
>>>> of pass_cd_dce. Do you know (any other points) where the pass should be scheduled?
>>>
>>> It's probably reasonable to run it after IPA inlining has taken place which
>>> means insert it somewhen after the second pass_fre (I'd suggest after
>>> pass_merge_phi).
>>>
>>
>> I placed it there, but I ran into some interaction with
>> pass_late_warn_uninitialized.  Addition of the pass makes test
>> gcc.dg/uninit-pred-2_c.c fail.
>>
>> FAIL: gcc.dg/uninit-pred-2_c.c bogus uninitialized var warning
>>                               (test for bogus messages, line 43)
>> FAIL: gcc.dg/uninit-pred-2_c.c real uninitialized var warning
>>                               (test for warnings, line 45)
>>
>>   int foo_2 (int n, int m, int r)
>>   {
>>     int flag = 0;
>>     int v;
>>
>>     if (n)
>>       {
>>         v = r;
>>         flag = 1;
>>       }
>>
>>     if (m) g++;
>>     else bar ();
>>
>>     if (flag)
>>       blah (v); { dg-bogus "uninitialized" "bogus uninitialized var warning" }
>>     else
>>       blah (v); { dg-warning "uninitialized" "real uninitialized var warning" }
>>
>>     return 0;
>>   }
>>
>> The pass replaces the second call to blah with the first one, and eliminates
>> the if.  After that, the uninitialized warning is issued for the line number
>> of the first call to blah, while at source level the warning only makes sense
>> for the second call to blah.
>>
>> Shall I try putting the pass after pass_late_warn_uninitialized?
> 
> No, simply pass -fno-tree-tail-merge in the testcase.
> 
>>> But my general comment still applies - I don't like the structural
>>> comparison code at all and you should really use the value-numbering
>>> machineries we have
>>
>> I now use sccvn.
> 
> Good.
> 
>>> or even better, merge this pass with FRE itself
>>> (or DOM if that suits you more).  For FRE you'd want to hook into
>>> tree-ssa-pre.c:eliminate().
>>>
>>
>> If we need to do the transformation after pass_late_warn_uninitialized, it needs
>> to stay on its own, I suppose.
> 
> I suppose part of the high cost of the pass is running SCCVN, so it
> makes sense to share that with the existing FRE run.

Done.

> Any reason
> you use VN_NOWALK?
>

No, that was just a first-try value.

>>>>> This also duplicates the literal matching done on the RTL level - instead
>>>>> I think this optimization would be more related to value-numbering
>>>>> (either that of SCCVN/FRE/PRE or that of DOM which also does
>>>>> jump-threading).
>>>>
>>>> The pass currently does just duplicate block elimination, not cross-jumping.
>>>> If we would like to extend this to cross-jumping, I think we need to do the
>>>> reverse of value numbering: walk backwards over the bb, and keep track of the
>>>> way values are used rather than defined. This will allows us to make a cut
>>>> halfway a basic block.
>>>
>>> I don't understand - I propose to do literal matching but using value-numbering
>>> for tracking equivalences to avoid literal matching for stuff we know is
>>> equivalent.  In fact I think it will be mostly calls and stores where we
>>> need to do literal matching, but never intermediate computations on
>>> registers.
>>>
>>
>> I tried to implement that scheme now.
>>
>>> But maybe I miss something here.
>>>
>>>> In general, we cannot do cut halfway a basic block in the current implementation
>>>> (of value numbering and forward matching), since we assume equivalence of the
>>>> incoming vops at bb entry. This assumption is in general only valid if we indeed
>>>> replace the entire block by another entire block.
>>>
>>> Why are VOPs of concern at all?
>>>
>>
>> In the previous version, I inserted the phis for the vops manually.
>> In the current version of the pass, I let TODO_update_ssa_only_virtuals deal
>> with vops, so it's not relevant anymore.
>>
>>>> I imagine that a cross-jumping heuristic would be based on the length of the
>>>> match and the amount of non-vop phis it would introduce. Then value numbering
>>>> would be something orthogonal to this optimization, which would reduce amount of
>>>> phis needed for a cross-jump.
>>>> I think it would make sense to use SCCVN value numbering at the point that we
>>>> have this backward matching.
>>>>
>>>> I'm not sure whether it's a good idea to try to replace the current forward
>>>> local value numbering with SCCVN value numbering, since we currently declare
>>>> vops equal, which are, in the global sense, not equal. And once we go to
>>>> backward matching, we'll need something to keep track of the uses, and we can
>>>> reuse the current infrastructure for that, but not the SCCVN value numbering.
>>>>
>>>> Does that make any sense?
>>>
>>> Ok, let me think about this a bit.
>>>
>>
>> I tried to to be more clear on this in the header comment of the pass.
>>
>>> For now about the patch in general.  The functions need renaming to
>>> something more sensible now that this isn't cfg-cleanup anymore.
>>>
>>> I miss a general overview of the pass - it's hard to reverse engineer
>>> its working for me.
>>
>> I added a header comment.
>>
>>> Like (working backwards), you are detecting
>>> duplicate predecessors
>>> - that obviously doesn't work for duplicates
>>> without any successors, like those ending in noreturn calls.
>>>
>>
>> Merging of blocks without successors works now.
>>
>>> +  n = EDGE_COUNT (bb->preds);
>>> +
>>> +  for (i = 0; i < n; ++i)
>>> +    {
>>> +      e1 = EDGE_PRED (bb, i);
>>> +      if (e1->flags & EDGE_COMPLEX)
>>> +        continue;
>>> +      for (j = i + 1; j < n; ++j)
>>> +        {
>>>
>>> that's quadratic in the number of predecessors.
>>>
>>
>> The quadratic comparison is now limited by PARAM_TAIL_MERGE_MAX_COMPARISONS.
>> Each bb is compared to maximally PARAM_TAIL_MERGE_MAX_COMPARISONS similar bbs
>> per worklist iteration.
>>
>>> +          /* Block e1->src might be deleted.  If bb and e1->src are the same
>>> +             block, delete e2->src instead, by swapping e1 and e2.  */
>>> +          e1_swapped = (bb == e1->src) ? e2: e1;
>>> +          e2_swapped = (bb == e1->src) ? e1: e2;
>>>
>>> is that because you incrementally merge preds two at a time?  As you
>>> are deleting blocks don't you need to adjust the quadratic walking?
>>> Thus, with say four equivalent preds won't your code crash anyway?
>>>
>>
>> I think it was to make calculation of dominator info easier, but I use now
>> functions from dominance.c for that, so this piece of code is gone.
>>
>>> I think the code needs to delay the CFG manipulation to the end
>>> of this function.
>>>
>>
>> I now delay the cfg manipulation till after each analysis phase.
>>
>>> +/* Returns whether for all phis in E1->dest the phi alternatives for E1 and
>>> +   E2 are either:
>>> +   - equal, or
>>> +   - defined locally in E1->src and E2->src.
>>> +   In the latter case, register the alternatives in *PHI_EQUIV.  */
>>> +
>>> +static bool
>>> +same_or_local_phi_alternatives (equiv_t *phi_equiv, edge e1, edge e2)
>>> +{
>>> +  int n1 = e1->dest_idx;
>>> +  int n2 = e2->dest_idx;
>>> +  gimple_stmt_iterator gsi;
>>> +  basic_block dest = e1->dest;
>>> +  gcc_assert (dest == e2->dest);
>>>
>>> too many asserts in general - I'd say for this case pass in the destination
>>> block as argument.
>>>
>>> +      gcc_assert (val1 != NULL_TREE);
>>> +      gcc_assert (val2 != NULL_TREE);
>>>
>>> superfluous.
>>>
>>> +static bool
>>> +cleanup_duplicate_preds_1 (equiv_t phi_equiv, edge e1, edge e2)
>>> ...
>>> +  VEC (edge,heap) *redirected_edges;
>>> +  gcc_assert (bb == e2->dest);
>>>
>>> same.
>>>
>>> +  if (e1->flags != e2->flags)
>>> +    return false;
>>>
>>> that's bad - it should handle EDGE_TRUE/FALSE_VALUE mismatches
>>> by swapping edges in the preds.
>>>
>>
>> That's handled now.
>>
>>> +  /* TODO: We could allow multiple successor edges here, as long as bb1 and bb2
>>> +     have the same successors.  */
>>> +  if (EDGE_COUNT (bb1->succs) != 1 || EDGE_COUNT (bb2->succs) != 1)
>>> +    return false;
>>>
>>> hm, ok - that would need fixing, too.  Same or mergeable successors
>>> of course, which makes me wonder if doing this whole transformation
>>> incrementally and locally is a good idea ;)   Also
>>>
>>
>> Also handled now.
>>
>>> +  /* Calculate the changes to be made to the dominator info.
>>> +     Calculate bb2_dom.  */
>>> ...
>>>
>>> wouldn't be necessary I suppose (just throw away dom info after the
>>> pass).
>>>
>>> That is, I'd globally record BB equivalences (thus, "value-number"
>>> BBs) and apply the CFG manipulations at a single point.
>>>
>>
>> I delay the cfg manipulation till after each analysis phase. Delaying the cfg
>> manipulation till the end of the pass instead might make the analysis code more
>> convoluted.
>>
>>> Btw, I miss where you insert PHI nodes for all uses that flow in
>>> from the preds preds - you do that for VOPs but not for real
>>> operands?
>>>
>>
>> Indeed, inserting phis for non-vops is a todo.
>>
>>> +  /* Replace uses of vuse2 with uses of the phi.  */
>>> +  for (gsi = gsi_start_bb (bb2); !gsi_end_p (gsi); gsi_next (&gsi))
>>> +    {
>>>
>>> why not walk immediate uses of the old PHI and SET_USE to
>>> the new one instead (for those uses in the duplicate BB of course)?
>>>
>>
>> And I no longer insert VOP phis, but let a TODO handle that, so this code is gone.
> 
> Ok.  Somewhat costly in comparison though.
> 

I tried to add that back, guarded by update_vops.  Handled in update_vuses,
vop_phi, insn_vops, vop_at_entry, replace_block_by.

>>> +    case GSS_CALL:
>>> +      if (!pt_solution_equal_p (gimple_call_use_set (s1),
>>> +                                gimple_call_use_set (s2))
>>>
>>> I don't understand why you are concerned about equality of
>>> points-to information.  Why not simply ior it (pt_solution_ior_into - note
>>> they are shared so you need to unshare them first).
>>>
>>
>> I let a todo handle the alias info now.
> 
> Hmm, that's not going to work if it's needed for correctness.
> 

Should be handed my merge_calls now.

>>> +/* Return true if p1 and p2 can be considered equal.  */
>>> +
>>> +static bool
>>> +pt_solution_equal_p (struct pt_solution *p1, struct pt_solution *p2)
>>>
>>> would go into tree-ssa-structalias.c instead.
>>>
>>> +static bool
>>> +gimple_base_equal_p (gimple s1, gimple s2)
>>> +{
>>> ...
>>> +  if (gimple_modified_p (s1) || gimple_modified_p (s2))
>>> +    return false;
>>>
>>> that shouldn't be of concern.
>>>
>>> +  if (s1->gsbase.subcode != s2->gsbase.subcode)
>>> +    return false;
>>>
>>> for assigns that are of class GIMPLE_SINGLE_RHS we do not
>>> update subcode during transformations so it can differ for now
>>> equal statements.
>>>
>>
>> handled properly now.
>>
>>> I'm not sure if a splay tree for the SSA name version equivalency
>>> map is the best representation - I would have used a simple
>>> array of num_ssa_names size and assign value-numbers
>>> (the lesser version for example).
>>>
>>> Thus equiv_insert would do
>>>
>>>   value = MIN (SSA_NAME_VERSION (val1), SSA_NAME_VERSION (val2));
>>>   values[SSA_NAME_VERSION (val1)] = value;
>>>   values[SSA_NAME_VERSION (val2)] = value;
>>>
>>> if the names are not defined in bb1 resp. bb2 we would have to insert
>>> a PHI node in the merged block - that would be a cost thingy for
>>> doing this value-numbering in a more global way.
>>>
>>
>> local value numbering code has been removed.
>>
>>> You don't seem to be concerned about the SSA names points-to
>>> information, but it surely has the same issues as that of the calls
>>> (so either they should be equal or they should be conservatively
>>> merged).  But as-is you don't allow any values to flow into the
>>> merged blocks that are not equal for both edges, no?
>>>
>>
>> Correct, that's still a todo.
>>
>>> +  TV_TREE_CFG,                          /* tv_id */
>>>
>>> add a new timevar.  We wan to be able to turn the pass off,
>>> so also add a new option (I can see it might make debugging harder
>>> in some cases).
>>>
>>
>> I added -ftree-tail-merge and TV_TREE_TAIL_MERGE.
>>
>>> Can you assess the effect of the patch on GCC itself (for example
>>> when building cc1?)? What's the size benefit and the compile-time
>>> overhead?
>>>
>>
>> effect on building cc1:
>>
>>               real        user        sys
>> without: 19m50.158s  19m 2.090s  0m20.860s
>> with:    19m59.456s  19m17.170s  0m20.350s
>>                     ----------
>>                       +15.080s
>>                         +1.31%
> 
> That's quite a lot of time.
>

Measurement for this version:

               real        user        sys
without  19m59.995s  19m 9.970s  0m21.050s
with     19m56.160s  19m14.830s  0m21.530s
                     ----------
                         +4.86s
                         +0.42%

    text   data      bss       dec      hex     filename
17547657  41736  1364384  18953777  1213631  without/cc1
17211049  41736  1364384  18617169  11c1351     with/cc1
--------
 -336608
  -1.92%


>> $ size without/cc1 with/cc1
>>    text   data      bss       dec      hex     filename
>> 17515986  41320  1364352  18921658  120b8ba  without/cc1
>> 17399226  41320  1364352  18804898  11ef0a2     with/cc1
>> --------
>>  -116760
>>  -0.67%
>>
>> OK for trunk, provided build & reg-testing on ARM is ok?
> 
> I miss additions to the testsuite.
> 

I will send an updated patch on thread
http://gcc.gnu.org/ml/gcc-patches/2011-06/msg00625.html.

> 
> +static bool
> +bb_dominated_by_p (basic_block bb1, basic_block bb2)
> 
> please use
> 
> +  if (TREE_CODE (val1) == SSA_NAME)
> +    {
> +      if (!same_preds
>            && !SSA_NAME_IS_DEFAULT_DEF (val1)
>            && !dominated_by_p (bb2, gimple_bb (SSA_NAME_DEF_STMT (val1)))
>               return false;
> 
> instead.  All stmts should have a BB apart from def stmts of default defs
> (which are gimple_nops).
> 

Done.

> +/* Return the canonical scc_vn tree for X, if we can use scc_vn_info.
> +   Otherwise, return X.  */
> +
> +static tree
> +gvn_val (tree x)
> +{
> +  return ((scc_vn_ok && x != NULL && TREE_CODE (x) == SSA_NAME)
> +         ? VN_INFO ((x))->valnum : x);
> +}
> 
> I suppose we want to export vn_valueize from tree-ssa-sccvn.c instead
> which seems to perform the same.

Done.

> Do you ever call the above
> when scc_vn_ok is false or x is NULL?

Not in this version. Earlier, I also ran the pass if sccvn bailed out, but pre
and fre only run if sccvn succeeded.

> 
> +static bool
> +gvn_uses_equal (tree val1, tree val2, basic_block bb1,
> +               basic_block bb2, bool same_preds)
> +{
> +  gimple def1, def2;
> +  basic_block def1_bb, def2_bb;
> +
> +  if (val1 == NULL_TREE || val2 == NULL_TREE)
> +    return false;
> 
> does this ever happen?

Not in the current version.  Removed.

> 
> +  if (gvn_val (val1) != gvn_val (val2))
> +    return false;
> 
> I suppose a shortcut
> 
>    if (val1 == val2)
>      return true;
> 
> is possible?
> 

Indeed. Added.

> +static int *bb_size;
> +
> +/* Init bb_size administration.  */
> +
> +static void
> +init_bb_size (void)
> +{
> 
> if you need more per-BB info you can hook it into bb->aux.  What's the
> size used for (I guess I'll see below ...)?
> 
> +      for (gsi = gsi_start_nondebug_bb (bb);
> +          !gsi_end_p (gsi); gsi_next_nondebug (&gsi))
> +       size++;
> 
> is pretty rough.  I guess for a quick false result for comparing BBs
> (which means you could initialize the info lazily?)

Done.

> 
> +struct same_succ
> +{
> +  /* The bbs that have the same successor bbs.  */
> +  bitmap bbs;
> +  /* The successor bbs.  */
> +  bitmap succs;
> +  /* Indicates whether the EDGE_TRUE/FALSE_VALUEs of succ_flags are swapped for
> +     bb.  */
> +  bitmap inverse;
> +  /* The edge flags for each of the successor bbs.  */
> +  VEC (int, heap) *succ_flags;
> +  /* Indicates whether the struct is in the worklist.  */
> +  bool in_worklist;
> +};
> 
> looks somewhat odd at first sight - maybe a overall comment what this
> is used for is missing.  Well, let's see.
> 

Tried to add an overall comment.

> +static hashval_t
> +same_succ_hash (const void *ve)
> +{
> +  const_same_succ_t e = (const_same_succ_t)ve;
> +  hashval_t hashval = bitmap_hash (e->succs);
> +  int flags;
> +  unsigned int i;
> +  unsigned int first = bitmap_first_set_bit (e->bbs);
> +  int size = bb_size [first];
> +  gimple_stmt_iterator gsi;
> +  gimple stmt;
> +  basic_block bb = BASIC_BLOCK (first);
> +
> +  hashval = iterative_hash_hashval_t (size, hashval);
> +  for (gsi = gsi_start_nondebug_bb (bb);
> +          !gsi_end_p (gsi); gsi_next_nondebug (&gsi))
> +    {
> +      stmt = gsi_stmt (gsi);
> +      hashval = iterative_hash_hashval_t (gimple_code (stmt), hashval);
> +      if (!is_gimple_call (stmt))
> +       continue;
> +      if (gimple_call_internal_p (stmt))
> +       hashval = iterative_hash_hashval_t
> +         ((hashval_t) gimple_call_internal_fn (stmt), hashval);
> +      else
> +       hashval = iterative_hash_expr (gimple_call_fn (stmt), hashval);
> 
> you could also keep a cache of the BB hash as you keep a cache
> of the size (if this function is called multiple times per BB).

Right, I forgot it's a closed hash table. Added the cache of the bb hash.

> The hash looks relatively weak - for all asignments it will hash
> in GIMPLE_ASSIGN only ... I'd at least hash in gimple_assign_rhs_code.

Done.

> The call handling OTOH looks overly complicated to me ;)
> 

That's an attempt to handle compiling insn-recog.c efficiently. All bbs without
successors are grouped together, and even after selecting on same function name,
there are still thousands of bbs in that group. I added now the args as well.

> The hash will be dependent on stmt ordering even if that doesn't matter,
> like
> 
>   i = i + 1;
>   j = j - 1;
> 
> vs. the swapped variant. 

Right, that's a todo, added that in the header comment.

> Similar the successor edges are not sorted,
> so true/false edges may be in different order.
> 

I keep a cache of the successor edge flags, in order of bbs.

> Not sure yet if your comparison function would make those BBs
> unequal anyway.
> 
> +static bool
> +inverse_flags (const_same_succ_t e1, const_same_succ_t e2)
> +{
> +  int f1a, f1b, f2a, f2b;
> +  int mask = ~(EDGE_TRUE_VALUE | EDGE_FALSE_VALUE);
> +
> +  if (VEC_length (int, e1->succ_flags) != 2)
> +    return false;
> ...
> 
> I wonder why you keep a VEC of successor edges in same_succ_t
> instead of  using the embedded successor edge vector in the basic_block
> structure?
> 

To keep the information in order of bbs, for quick comparison.

> +      bb_to_same_succ[bb->index] = *slot;
> 
> looks like a candidate for per-BB info in bb->aux, too.
> 

Done.

> +static void
> +find_same_succ (void)
> +{
> +  int i;
> +  same_succ_t same = same_succ_alloc ();
> +
> +  for (i = 0; i < last_basic_block; ++i)
> +    {
> +      find_same_succ_bb (BASIC_BLOCK (i), &same);
> +      if (same == NULL)
> +       same = same_succ_alloc ();
> +    }
> 
> I suppose you want FOR_EACH_BB (excluding entry/exit block) or
> FOR_ALL_BB (including them).  The above also can
> have BASIC_BLOCK(i) == NULL.  Similar in other places.
> 

Done.

> +  for (i = 0; i < n1; ++i)
> +    {
> +      ei = EDGE_PRED (bb1, i);
> +      for (j = 0; j < n2; ++j)
> +       {
> +         ej = EDGE_PRED (bb2, j);
> +         if (ei->src != ej->src)
> +           continue;
> +         nr_matches++;
> +         break;
> +       }
> +    }
> 
>   FOR_EACH_EDGE (ei, iterator, bb1->preds)
>      if (!find_edge (ei->src, bb2))
>        return false;
> 
> is easier to parse.
> 

Done.

> +static bool
> +gimple_subcode_equal_p (gimple s1, gimple s2, bool inv_cond)
> +{
> +  tree var, var_type;
> +  bool honor_nans;
> +
> +  if (is_gimple_assign (s1)
> +      && gimple_assign_rhs_class (s1) == GIMPLE_SINGLE_RHS)
> +    return true;
> 
> the subcode for GIMPLE_SINGLE_RHS is gimple_assign_rhs_code
> (TREE_CODE of gimple_assign_rhs1 actually).
> 
> +static bool
> +gimple_base_equal_p (gimple s1, gimple s2, bool inv_cond)
> 
> I wonder if you still need this given ..
> 
> +static bool
> +gimple_equal_p (gimple s1, gimple s2, bool same_preds, bool inv_cond)
> +{
> +  unsigned int i;
> +  enum gimple_statement_structure_enum gss;
> +  tree lhs1, lhs2;
> +  basic_block bb1 = gimple_bb (s1), bb2 = gimple_bb (s2);
> +
> +  /* Handle omp gimples conservatively.  */
> +  if (is_gimple_omp (s1) || is_gimple_omp (s2))
> +    return false;
> +
> +  /* Handle lhs.  */
> +  lhs1 = gimple_get_lhs (s1);
> +  lhs2 = gimple_get_lhs (s2);
> +  if (lhs1 != NULL_TREE && lhs2 != NULL_TREE)
> +    return (same_preds && TREE_CODE (lhs1) == SSA_NAME
> +           && TREE_CODE (lhs2) == SSA_NAME
> +           && gvn_val (lhs1) == gvn_val (lhs2));
> +  else if (!(lhs1 == NULL_TREE && lhs2 == NULL_TREE))
> +    return false;
> 
> all lhs equivalency is defered to GVN (which means all GIMPLE_ASSIGN
> and GIMPLE_CALL stmts with a lhs).
> 
> That leaves the case of calls without a lhs.  I'd rather structure this
> function like
> 
>   if (gimple_code (s1) != gimple_code (s2))
>     return false;
>   swithc (gimple_code (s1))
>     {
>     case GIMPLE_CALL:
>        ... compare arguments ...
>        if equal ok, if not and we have a lhs use GVN.
> 
>     case GIMPLE_ASSIGN:
>        ... compare GVN of the LHS ...
> 
>      case GIMPLE_COND:
>         ... compare operands ...
> 
>      default:
>         return false;
>     }
> 
> 

Done.

> +static bool
> +bb_gimple_equal_p (basic_block bb1, basic_block bb2, bool same_preds,
> +                  bool inv_cond)
> +
> +{
> 
> you don't do an early out by comparing the pre-computed sizes.

This function is only called for bb with the same size. The hash table equal
fuction does have the size comparison.

> Mind
> you can have hashtable collisions where they still differ (did you
> check hashtable stats on it?  how is the collision rate?)
> 

I managed to lower the collision rate by specifying n_basic_blocks as hash table
size.

While compiling insn-recog.c, the highest collision rate for a function is 2.69.

> +static bool
> +bb_has_non_vop_phi (basic_block bb)
> +{
> +  gimple_seq phis = phi_nodes (bb);
> +  gimple phi;
> +
> +  if (phis == NULL)
> +    return false;
> +
> +  if (!gimple_seq_singleton_p (phis))
> +    return true;
> +
> +  phi = gimple_seq_first_stmt (phis);
> +  return !VOID_TYPE_P (TREE_TYPE (gimple_phi_result (phi)));
> 
> return is_gimple_reg (gimple_phi_result (phi));
> 

Done.

> +static void
> +update_debug_stmts (void)
> +{
> +  int i;
> +  basic_block bb;
> +
> +  for (i = 0; i < last_basic_block; ++i)
> +    {
> +      gimple stmt;
> +      gimple_stmt_iterator gsi;
> +
> +      bb = BASIC_BLOCK (i);
> 
> FOR_EACH_BB
> 
> it must be possible to avoid scanning basic-blocks that are not affected
> by the transform, no?  In fact the only affected basic-blocks should be
> those that were merged with another block?

Done. I also check for MAY_HAVE_DEBUG_STMTS now.

> 
> +      /* Mark vops for updating.  Without this, TODO_update_ssa_only_virtuals
> +        won't do anything.  */
> +      mark_sym_for_renaming (gimple_vop (cfun));
> 
> it won't insert any PHIs, that's correct.  Still somewhat ugly, a manual
> update of PHI nodes should be possible.
> 

Added.  I'm trying to be lazy about it though:

+  bool update_vops = ((todo & TODO_update_ssa_only_virtuals) == 0
+		      || !symbol_marked_for_renaming (gimple_vop (cfun)));

If we're going to insert those phis anyway given the current todo, we don't bother.


> +      if (dump_file)
> +       {
> +         fprintf (dump_file, "Before TODOs.\n");
> 
> with TDF_DETAILS only please.
> 

Done.

> +      free_dominance_info (CDI_DOMINATORS);
> 
> if you keep dominance info up-to-date there is no need to free it.
> 

Indeed. And by not freeing it, the info was checked, and I hit validation errors
in that info, so the updating code had problems.  I reverted back to calculating
when needed and freeing when changed.

> +  TODO_verify_ssa | TODO_verify_stmts
> +  | TODO_verify_flow | TODO_update_ssa_only_virtuals
> +  | TODO_rebuild_alias
> 
> please no TODO_rebuild_alias, simply remove it - alias info in merged
> paths should be compatible enough if there is value-equivalence between
> SSA names.  At least you can't rely on TODO_rebuild_alias for
> correctness - it is skipped if IPA PTA was run for example.
> 

Done.

> +  | TODO_cleanup_cfg
> 
> is that needed?  If so return it from your execute function if you changed
> anything only.  But I doubt your transformation introduces cleanup
> opportunities?
> 

If all the predeccessor blocks of a block are merged, the block and it's
remaining predecessor block might be merged, so that is a cleanup opportunity.
Removed for the moment.

> New options and params need documentation in doc/invoke.texi.
> 

Added.

> Thanks,
> Richard.
> 

Bootstrapped and reg-tested on x86_64.  Ok for trunk (after ARM testing)?

Thanks,
- Tom

2011-07-17  Tom de Vries  <tom@codesourcery.com>

	PR middle-end/43864
	* tree-ssa-tail-merge.c: New file.
	(struct same_succ): Define.
	(same_succ_t, const_same_succ_t): New typedef.
	(struct bb_cluster): Define.
	(bb_cluster_t, const_bb_cluster_t): New typedef.
	(struct aux_bb_info): Define.
	(BB_SIZE, BB_SAME_SUCC, BB_CLUSTER, BB_VOP_AT_EXIT): Define.
	(gvn_uses_equal): New function.
	(same_succ_print, same_succ_print_traverse, same_succ_hash)
	(inverse_flags, same_succ_equal, same_succ_alloc, same_succ_delete)
	(same_succ_reset): New function.
	(same_succ_htab, same_succ_edge_flags)
	(deleted_bbs, deleted_bb_preds): New var.
	(debug_same_succ): New function.
	(worklist): New var.
	(print_worklist, add_to_worklist, find_same_succ_bb, find_same_succ)
	(init_worklist, delete_worklist, delete_basic_block_same_succ)
	(same_succ_flush_bbs, update_worklist): New function.
	(print_cluster, debug_cluster, same_predecessors)
	(add_bb_to_cluster, new_cluster, delete_cluster): New function.
	(all_clusters): New var.
	(alloc_cluster_vectors, reset_cluster_vectors, delete_cluster_vectors)
	(merge_clusters, set_cluster): New function.
	(gimple_equal_p, find_duplicate, same_phi_alternatives_1)
	(same_phi_alternatives, bb_has_non_vop_phi, find_clusters_1)
	(find_clusters): New function.
	(merge_calls, update_vuses, vop_phi, insn_vops, vop_at_entry)
	(replace_block_by): New function.
	(update_bbs): New var.
	(apply_clusters): New function.
	(update_debug_stmt, update_debug_stmts): New function.
	(tail_merge_optimize): New function.
	tree-flow.h (tail_merge_optimize): Declare.
	* tree-ssa-pre.c (execute_pre): Use tail_merge_optimize.
	* Makefile.in (OBJS-common): Add tree-ssa-tail-merge.o.
	(tree-ssa-tail-merge.o): New rule.
	* opts.c (default_options_table): Set OPT_ftree_tail_merge by default at
	OPT_LEVELS_2_PLUS.
	* tree-ssa-sccvn.c (vn_valueize): Move to ...
	* tree-ssa-sccvn.h (vn_valueize): Here.
	* tree-ssa-alias.h (pt_solution_ior_into_shared): Declare.
	* tree-ssa-structalias.c (find_what_var_points_to): Factor out and
	use ...
	(pt_solution_share): New function.
	(pt_solution_unshare, pt_solution_ior_into_shared): New function.
	(delete_points_to_sets): Nullify shared_bitmap_table after deletion.
	* timevar.def (TV_TREE_TAIL_MERGE): New timevar.
	* common.opt (ftree-tail-merge): New switch.
	* params.def (PARAM_MAX_TAIL_MERGE_COMPARISONS): New parameter.
	* doc/invoke.texi (Optimization Options, -O2): Add -ftree-tail-merge.
	(-ftree-tail-merge, max-tail-merge-comparisons): New item.

[-- Attachment #2: pr43864.31.patch --]
[-- Type: text/x-patch, Size: 57892 bytes --]

Index: gcc/tree-ssa-tail-merge.c
===================================================================
--- gcc/tree-ssa-tail-merge.c	(revision 0)
+++ gcc/tree-ssa-tail-merge.c	(revision 0)
@@ -0,0 +1,1676 @@
+/* Tail merging for gimple.
+   Copyright (C) 2011 Free Software Foundation, Inc.
+   Contributed by Tom de Vries (tom@codesourcery.com)
+
+This file is part of GCC.
+
+GCC is free software; you can redistribute it and/or modify
+it under the terms of the GNU General Public License as published by
+the Free Software Foundation; either version 3, or (at your option)
+any later version.
+
+GCC is distributed in the hope that it will be useful,
+but WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+GNU General Public License for more details.
+
+You should have received a copy of the GNU General Public License
+along with GCC; see the file COPYING3.  If not see
+<http://www.gnu.org/licenses/>.  */
+
+/* Pass overview.
+
+
+   MOTIVATIONAL EXAMPLE
+
+   gimple representation of gcc/testsuite/gcc.dg/pr43864.c at
+
+   hprofStartupp (charD.1 * outputFileNameD.2600, charD.1 * ctxD.2601)
+   {
+     struct FILED.1638 * fpD.2605;
+     charD.1 fileNameD.2604[1000];
+     intD.0 D.3915;
+     const charD.1 * restrict outputFileName.0D.3914;
+
+     # BLOCK 2 freq:10000
+     # PRED: ENTRY [100.0%]  (fallthru,exec)
+     # PT = nonlocal { D.3926 } (restr)
+     outputFileName.0D.3914_3
+       = (const charD.1 * restrict) outputFileNameD.2600_2(D);
+     # .MEMD.3923_13 = VDEF <.MEMD.3923_12(D)>
+     # USE = nonlocal null { fileNameD.2604 D.3926 } (restr)
+     # CLB = nonlocal null { fileNameD.2604 D.3926 } (restr)
+     sprintfD.759 (&fileNameD.2604, outputFileName.0D.3914_3);
+     # .MEMD.3923_14 = VDEF <.MEMD.3923_13>
+     # USE = nonlocal null { fileNameD.2604 D.3926 } (restr)
+     # CLB = nonlocal null { fileNameD.2604 D.3926 } (restr)
+     D.3915_4 = accessD.2606 (&fileNameD.2604, 1);
+     if (D.3915_4 == 0)
+       goto <bb 3>;
+     else
+       goto <bb 4>;
+     # SUCC: 3 [10.0%]  (true,exec) 4 [90.0%]  (false,exec)
+
+     # BLOCK 3 freq:1000
+     # PRED: 2 [10.0%]  (true,exec)
+     # .MEMD.3923_15 = VDEF <.MEMD.3923_14>
+     # USE = nonlocal null { fileNameD.2604 D.3926 } (restr)
+     # CLB = nonlocal null { fileNameD.2604 D.3926 } (restr)
+     freeD.898 (ctxD.2601_5(D));
+     goto <bb 7>;
+     # SUCC: 7 [100.0%]  (fallthru,exec)
+
+     # BLOCK 4 freq:9000
+     # PRED: 2 [90.0%]  (false,exec)
+     # .MEMD.3923_16 = VDEF <.MEMD.3923_14>
+     # PT = nonlocal escaped
+     # USE = nonlocal null { fileNameD.2604 D.3926 } (restr)
+     # CLB = nonlocal null { fileNameD.2604 D.3926 } (restr)
+     fpD.2605_8 = fopenD.1805 (&fileNameD.2604[0], 0B);
+     if (fpD.2605_8 == 0B)
+       goto <bb 5>;
+     else
+       goto <bb 6>;
+     # SUCC: 5 [1.9%]  (true,exec) 6 [98.1%]  (false,exec)
+
+     # BLOCK 5 freq:173
+     # PRED: 4 [1.9%]  (true,exec)
+     # .MEMD.3923_17 = VDEF <.MEMD.3923_16>
+     # USE = nonlocal null { fileNameD.2604 D.3926 } (restr)
+     # CLB = nonlocal null { fileNameD.2604 D.3926 } (restr)
+     freeD.898 (ctxD.2601_5(D));
+     goto <bb 7>;
+     # SUCC: 7 [100.0%]  (fallthru,exec)
+
+     # BLOCK 6 freq:8827
+     # PRED: 4 [98.1%]  (false,exec)
+     # .MEMD.3923_18 = VDEF <.MEMD.3923_16>
+     # USE = nonlocal null { fileNameD.2604 D.3926 } (restr)
+     # CLB = nonlocal null { fileNameD.2604 D.3926 } (restr)
+     fooD.2599 (outputFileNameD.2600_2(D), fpD.2605_8);
+     # SUCC: 7 [100.0%]  (fallthru,exec)
+
+     # BLOCK 7 freq:10000
+     # PRED: 3 [100.0%]  (fallthru,exec) 5 [100.0%]  (fallthru,exec)
+             6 [100.0%]  (fallthru,exec)
+     # PT = nonlocal null
+
+     # ctxD.2601_1 = PHI <0B(3), 0B(5), ctxD.2601_5(D)(6)>
+     # .MEMD.3923_11 = PHI <.MEMD.3923_15(3), .MEMD.3923_17(5),
+                            .MEMD.3923_18(6)>
+     # VUSE <.MEMD.3923_11>
+     return ctxD.2601_1;
+     # SUCC: EXIT [100.0%]
+   }
+
+   bb 3 and bb 5 can be merged.  The blocks have different predecessors, but the
+   same successors, and the same operations.
+
+
+   CONTEXT
+
+   A technique called tail merging (or cross jumping) can fix the example
+   above.  For a block, we look for common code at the end (the tail) of the
+   predecessor blocks, and insert jumps from one block to the other.
+   The example is a special case for tail merging, in that 2 whole blocks
+   can be merged, rather than just the end parts of it.
+   We currently only focus on whole block merging, so in that sense
+   calling this pass tail merge is a bit of a misnomer.
+
+   We distinguish 2 kinds of situations in which blocks can be merged:
+   - same operations, same predecessors.  The successor edges coming from one
+     block are redirected to come from the other block.
+   - same operations, same successors.  The predecessor edges entering one block
+     are redirected to enter the other block.  Note that this operation might
+     involve introducing phi operations.
+
+   For efficient implementation, we would like to value numbers the blocks, and
+   have a comparison operator that tells us whether the blocks are equal.
+   Besides being runtime efficient, block value numbering should also abstract
+   from irrelevant differences in order of operations, much like normal value
+   numbering abstracts from irrelevant order of operations.
+
+   For the first situation (same_operations, same predecessors), normal value
+   numbering fits well.  We can calculate a block value number based on the
+   value numbers of the defs and vdefs.
+
+   For the second situation (same operations, same successors), this approach
+   doesn't work so well.  We can illustrate this using the example.  The calls
+   to free use different vdefs: MEMD.3923_16 and MEMD.3923_14, and these will
+   remain different in value numbering, since they represent different memory
+   states.  So the resulting vdefs of the frees will be different in value
+   numbering, so the block value numbers will be different.
+
+   The reason why we call the blocks equal is not because they define the same
+   values, but because uses in the blocks use (possibly different) defs in the
+   same way.  To be able to detect this efficiently, we need to do some kind of
+   reverse value numbering, meaning number the uses rather than the defs, and
+   calculate a block value number based on the value number of the uses.
+   Ideally, a block comparison operator will also indicate which phis are needed
+   to merge the blocks.
+
+   For the moment, we don't do block value numbering, but we do insn-by-insn
+   matching, using scc value numbers to match operations with results, and
+   structural comparison otherwise, while ignoring vop mismatches.
+
+
+   IMPLEMENTATION
+
+   1. The pass first determines all groups of blocks with the same successor
+      blocks.
+   2. Within each group, it tries to determine clusters of equal basic blocks.
+   3. The clusters are applied.
+   4. The same successor groups are updated.
+   5. This process is repeated from 2 onwards, until no more changes.
+
+
+   LIMITATIONS/TODO
+
+   - block only
+   - handles only 'same operations, same successors'.
+     It handles same predecessors as a special subcase though.
+   - does not implement the reverse value numbering and block value numbering.
+   - does not abstract from statement order.  In order to do this, we need to
+     abstract from statement order in the hash function, and bb comparison
+     functions.
+   - improve memory allocation: use garbage collected memory, obstacks,
+     allocpools where appropriate.
+   - no insertion of gimple_reg phis,  We only introduce vop-phis.
+   - handle blocks with gimple_reg phi_nodes.
+
+
+   SWITCHES
+
+   - ftree-tail-merge.  On at -O2.  We may have to enable it only at -Os.  */
+
+#include "config.h"
+#include "system.h"
+#include "coretypes.h"
+#include "tm.h"
+#include "tree.h"
+#include "tm_p.h"
+#include "basic-block.h"
+#include "output.h"
+#include "flags.h"
+#include "function.h"
+#include "tree-flow.h"
+#include "timevar.h"
+#include "bitmap.h"
+#include "tree-ssa-alias.h"
+#include "params.h"
+#include "tree-pretty-print.h"
+#include "hashtab.h"
+#include "gimple-pretty-print.h"
+#include "tree-ssa-sccvn.h"
+#include "tree-dump.h"
+
+/* Describes a group of bbs with the same successors.  The successor bbs are
+   cached in succs, and the successor edge flags are cached in succ_flags.
+   If a bb has the EDGE_TRUE/VALSE_VALUE flags swapped compared to succ_flags,
+   it's marked in inverse.
+   Additionally, the hash value for the struct is cached in hashval, and
+   in_worklist indicates whether it's currently part of worklist.  */
+
+struct same_succ
+{
+  /* The bbs that have the same successor bbs.  */
+  bitmap bbs;
+  /* The successor bbs.  */
+  bitmap succs;
+  /* Indicates whether the EDGE_TRUE/FALSE_VALUEs of succ_flags are swapped for
+     bb.  */
+  bitmap inverse;
+  /* The edge flags for each of the successor bbs.  */
+  VEC (int, heap) *succ_flags;
+  /* Indicates whether the struct is currently in the worklist.  */
+  bool in_worklist;
+  /* The hash value of the struct.  */
+  hashval_t hashval;
+};
+typedef struct same_succ *same_succ_t;
+typedef const struct same_succ *const_same_succ_t;
+
+/* A group of bbs where 1 bb from bbs can replace the other bbs.  */
+
+struct bb_cluster
+{
+  /* The bbs in the cluster.  */
+  bitmap bbs;
+  /* The preds of the bbs in the cluster.  */
+  bitmap preds;
+  /* index in all_clusters vector.  */
+  int index;
+};
+typedef struct bb_cluster *bb_cluster_t;
+typedef const struct bb_cluster *const_bb_cluster_t;
+
+/* Per bb-info.  */
+
+struct aux_bb_info
+{
+  /* The number of non-debug statements in the bb.  */
+  int size;
+  /* The same_succ that this bb is a member of.  */
+  same_succ_t same_succ;
+  /* The cluster that this bb is a member of.  */
+  bb_cluster_t cluster;
+  /* The vop state at the exit of a bb.  This is shortlived data, used to
+     communicate data between update_block_by and update_vuses.  */
+  tree vop_at_exit;
+};
+
+/* Macros to access the fields of struct aux_bb_info.  */
+
+#define BB_SIZE(bb) (((struct aux_bb_info *)bb->aux)->size)
+#define BB_SAME_SUCC(bb) (((struct aux_bb_info *)bb->aux)->same_succ)
+#define BB_CLUSTER(bb) (((struct aux_bb_info *)bb->aux)->cluster)
+#define BB_VOP_AT_EXIT(bb) (((struct aux_bb_info *)bb->aux)->vop_at_exit)
+
+/* VAL1 and VAL2 are either:
+   - uses in BB1 and BB2, or
+   - phi alternatives for BB1 and BB2.
+   SAME_PREDS indicates whether BB1 and BB2 have the same predecessors.
+   Return true if the uses have the same gvn value, and if the corresponding
+   defs can be used in both BB1 and BB2.  */
+
+static bool
+gvn_uses_equal (tree val1, tree val2, basic_block bb1,
+		basic_block bb2, bool same_preds)
+{
+  gcc_checking_assert (val1 != NULL_TREE && val2 != NULL_TREE);
+
+  if (val1 == val2)
+    return true;
+
+  if (vn_valueize (val1) != vn_valueize (val2))
+    return false;
+
+  /* If BB1 and BB2 have the same predecessors, the same values are defined at
+     entry of BB1 and BB2.  Otherwise, we need to check.  */
+
+  if (TREE_CODE (val1) == SSA_NAME)
+    {
+      if (!same_preds
+	  && !SSA_NAME_IS_DEFAULT_DEF (val1)
+	  && !dominated_by_p (CDI_DOMINATORS, bb2,
+			      gimple_bb (SSA_NAME_DEF_STMT (val1))))
+	return false;
+    }
+  else if (!CONSTANT_CLASS_P (val1))
+    return false;
+
+  if (TREE_CODE (val2) == SSA_NAME)
+    {
+      if (!same_preds
+	  && !SSA_NAME_IS_DEFAULT_DEF (val2)
+	  && !dominated_by_p (CDI_DOMINATORS, bb1,
+			      gimple_bb (SSA_NAME_DEF_STMT (val2))))
+	return false;
+    }
+  else if (!CONSTANT_CLASS_P (val2))
+    return false;
+
+  return true;
+}
+
+/* Prints E to FILE.  */
+
+static void
+same_succ_print (FILE *file, const same_succ_t e)
+{
+  unsigned int i;
+  bitmap_print (file, e->bbs, "bbs:", "\n");
+  bitmap_print (file, e->succs, "succs:", "\n");
+  bitmap_print (file, e->inverse, "inverse:", "\n");
+  fprintf (file, "flags:");
+  for (i = 0; i < VEC_length (int, e->succ_flags); ++i)
+    fprintf (file, " %x", VEC_index (int, e->succ_flags, i));
+  fprintf (file, "\n");
+}
+
+/* Prints same_succ VE to VFILE.  */
+
+static int
+same_succ_print_traverse (void **ve, void *vfile)
+{
+  const same_succ_t e = *((const same_succ_t *)ve);
+  FILE *file = ((FILE*)vfile);
+  same_succ_print (file, e);
+  return 1;
+}
+
+/* Calculates hash value for same_succ VE.  */
+
+static hashval_t
+same_succ_hash (const void *ve)
+{
+  const_same_succ_t e = (const_same_succ_t)ve;
+  hashval_t hashval = bitmap_hash (e->succs);
+  int flags;
+  unsigned int i;
+  unsigned int first = bitmap_first_set_bit (e->bbs);
+  basic_block bb = BASIC_BLOCK (first);
+  int size = 0;
+  gimple_stmt_iterator gsi;
+  gimple stmt;
+  tree arg;
+
+  for (gsi = gsi_start_nondebug_bb (bb);
+	   !gsi_end_p (gsi); gsi_next_nondebug (&gsi))
+    {
+      size++;
+      stmt = gsi_stmt (gsi);
+      hashval = iterative_hash_hashval_t (gimple_code (stmt), hashval);
+      if (is_gimple_assign (stmt))
+	hashval = iterative_hash_hashval_t (gimple_assign_rhs_code (stmt),
+					    hashval);
+      if (!is_gimple_call (stmt))
+	continue;
+      if (gimple_call_internal_p (stmt))
+	hashval = iterative_hash_hashval_t
+	  ((hashval_t) gimple_call_internal_fn (stmt), hashval);
+      else
+	hashval = iterative_hash_expr (gimple_call_fn (stmt), hashval);
+      for (i = 0; i < gimple_call_num_args (stmt); i++)
+	{
+	  arg = gimple_call_arg (stmt, i);
+	  arg = vn_valueize (arg);
+	  hashval = iterative_hash_expr (arg, hashval);
+	}
+    }
+  hashval = iterative_hash_hashval_t (size, hashval);
+  BB_SIZE (bb) = size;
+
+  for (i = 0; i < VEC_length (int, e->succ_flags); ++i)
+    {
+      flags = VEC_index (int, e->succ_flags, i);
+      flags = flags & ~(EDGE_TRUE_VALUE | EDGE_FALSE_VALUE);
+      hashval = iterative_hash_hashval_t (flags, hashval);
+    }
+  return hashval;
+}
+
+/* Returns true if E1 and E2 have 2 successors, and if the successor flags
+   are inverse for the EDGE_TRUE_VALUE and EDGE_FALSE_VALUE flags, and equal for
+   the other edge flags.  */
+
+static bool
+inverse_flags (const_same_succ_t e1, const_same_succ_t e2)
+{
+  int f1a, f1b, f2a, f2b;
+  int mask = ~(EDGE_TRUE_VALUE | EDGE_FALSE_VALUE);
+
+  if (VEC_length (int, e1->succ_flags) != 2)
+    return false;
+
+  f1a = VEC_index (int, e1->succ_flags, 0);
+  f1b = VEC_index (int, e1->succ_flags, 1);
+  f2a = VEC_index (int, e2->succ_flags, 0);
+  f2b = VEC_index (int, e2->succ_flags, 1);
+
+  if (f1a == f2a && f1b == f2b)
+    return false;
+
+  return (f1a & mask) == (f2a & mask) && (f1b & mask) == (f2b & mask);
+}
+
+/* Compares SAME_SUCCs VE1 and VE2.  */
+
+static int
+same_succ_equal (const void *ve1, const void *ve2)
+{
+  const_same_succ_t e1 = (const_same_succ_t)ve1;
+  const_same_succ_t e2 = (const_same_succ_t)ve2;
+  unsigned int i, first1, first2;
+  gimple_stmt_iterator gsi1, gsi2;
+  gimple s1, s2;
+  basic_block bb1, bb2;
+
+  if (e1->hashval != e2->hashval)
+    return 0;
+
+  if (bitmap_bit_p (e1->bbs, ENTRY_BLOCK)
+      || bitmap_bit_p (e1->bbs, EXIT_BLOCK)
+      || bitmap_bit_p (e2->bbs, ENTRY_BLOCK)
+      || bitmap_bit_p (e2->bbs, EXIT_BLOCK))
+    return 0;
+
+  if (VEC_length (int, e1->succ_flags) != VEC_length (int, e2->succ_flags))
+    return 0;
+
+  if (!bitmap_equal_p (e1->succs, e2->succs))
+    return 0;
+
+  if (!inverse_flags (e1, e2))
+    {
+      for (i = 0; i < VEC_length (int, e1->succ_flags); ++i)
+	if (VEC_index (int, e1->succ_flags, i)
+	    != VEC_index (int, e1->succ_flags, i))
+	  return 0;
+    }
+
+  first1 = bitmap_first_set_bit (e1->bbs);
+  first2 = bitmap_first_set_bit (e2->bbs);
+
+  bb1 = BASIC_BLOCK (first1);
+  bb2 = BASIC_BLOCK (first2);
+
+  if (BB_SIZE (bb1) != BB_SIZE (bb2))
+    return 0;
+
+  gsi1 = gsi_start_nondebug_bb (bb1);
+  gsi2 = gsi_start_nondebug_bb (bb2);
+  while (!(gsi_end_p (gsi1) || gsi_end_p (gsi2)))
+    {
+      s1 = gsi_stmt (gsi1);
+      s2 = gsi_stmt (gsi2);
+      if (gimple_code (s1) != gimple_code (s2))
+	return 0;
+      if (is_gimple_call (s1) && !gimple_call_same_target_p (s1, s2))
+	return 0;
+      gsi_next_nondebug (&gsi1);
+      gsi_next_nondebug (&gsi2);
+    }
+
+  return 1;
+}
+
+/* Alloc and init a new SAME_SUCC.  */
+
+static same_succ_t
+same_succ_alloc (void)
+{
+  same_succ_t same = XNEW (struct same_succ);
+
+  same->bbs = BITMAP_ALLOC (NULL);
+  same->succs = BITMAP_ALLOC (NULL);
+  same->inverse = BITMAP_ALLOC (NULL);
+  same->succ_flags = VEC_alloc (int, heap, 10);
+  same->in_worklist = false;
+
+  return same;
+}
+
+/* Delete same_succ VE.  */
+
+static void
+same_succ_delete (void *ve)
+{
+  same_succ_t e = (same_succ_t)ve;
+
+  BITMAP_FREE (e->bbs);
+  BITMAP_FREE (e->succs);
+  BITMAP_FREE (e->inverse);
+  VEC_free (int, heap, e->succ_flags);
+
+  XDELETE (ve);
+}
+
+/* Reset same_succ SAME.  */
+
+static void
+same_succ_reset (same_succ_t same)
+{
+  bitmap_clear (same->bbs);
+  bitmap_clear (same->succs);
+  bitmap_clear (same->inverse);
+  VEC_truncate (int, same->succ_flags, 0);
+}
+
+/* Hash table with all same_succ entries.  */
+
+static htab_t same_succ_htab;
+
+/* Array that is used to store the edge flags for a successor.  */
+
+static int *same_succ_edge_flags;
+
+/* Bitmap that is used to mark bbs that are recently deleted.  */
+
+static bitmap deleted_bbs;
+
+/* Bitmap that is used to mark predecessors of bbs that are
+   deleted.  */
+
+static bitmap deleted_bb_preds;
+
+/* Prints same_succ_htab to stderr.  */
+
+extern void debug_same_succ (void);
+DEBUG_FUNCTION void
+debug_same_succ ( void)
+{
+  htab_traverse (same_succ_htab, same_succ_print_traverse, stderr);
+}
+
+DEF_VEC_P (same_succ_t);
+DEF_VEC_ALLOC_P (same_succ_t, heap);
+
+/* Vector of bbs to process.  */
+
+static VEC (same_succ_t, heap) *worklist;
+
+/* Prints worklist to FILE.  */
+
+static void
+print_worklist (FILE *file)
+{
+  unsigned int i;
+  for (i = 0; i < VEC_length (same_succ_t, worklist); ++i)
+    same_succ_print (file, VEC_index (same_succ_t, worklist, i));
+}
+
+/* Adds SAME to worklist.  */
+
+static void
+add_to_worklist (same_succ_t same)
+{
+  if (same->in_worklist)
+    return;
+
+  if (bitmap_count_bits (same->bbs) < 2)
+    return;
+
+  same->in_worklist = true;
+  VEC_safe_push (same_succ_t, heap, worklist, same);
+}
+
+/* Add BB to same_succ_htab.  */
+
+static void
+find_same_succ_bb (basic_block bb, same_succ_t *same_p)
+{
+  unsigned int j;
+  bitmap_iterator bj;
+  same_succ_t same = *same_p;
+  same_succ_t *slot;
+  edge_iterator ei;
+  edge e;
+
+  if (bb == NULL)
+    return;
+  bitmap_set_bit (same->bbs, bb->index);
+  FOR_EACH_EDGE (e, ei, bb->succs)
+    {
+      int index = e->dest->index;
+      bitmap_set_bit (same->succs, index);
+      same_succ_edge_flags[index] = e->flags;
+    }
+  EXECUTE_IF_SET_IN_BITMAP (same->succs, 0, j, bj)
+    VEC_safe_push (int, heap, same->succ_flags, same_succ_edge_flags[j]);
+
+  same->hashval = same_succ_hash (same);
+
+  slot = (same_succ_t *) htab_find_slot_with_hash (same_succ_htab, same,
+						   same->hashval, INSERT);
+  if (*slot == NULL)
+    {
+      *slot = same;
+      BB_SAME_SUCC (bb) = same;
+      add_to_worklist (same);
+      *same_p = NULL;
+    }
+  else
+    {
+      bitmap_set_bit ((*slot)->bbs, bb->index);
+      BB_SAME_SUCC (bb) = *slot;
+      add_to_worklist (*slot);
+      if (inverse_flags (same, *slot))
+	bitmap_set_bit ((*slot)->inverse, bb->index);
+      same_succ_reset (same);
+    }
+}
+
+/* Find bbs with same successors.  */
+
+static void
+find_same_succ (void)
+{
+  same_succ_t same = same_succ_alloc ();
+  basic_block bb;
+
+  FOR_EACH_BB (bb)
+    {
+      find_same_succ_bb (bb, &same);
+      if (same == NULL)
+	same = same_succ_alloc ();
+    }
+
+  same_succ_delete (same);
+}
+
+/* Initializes worklist administration.  */
+
+static void
+init_worklist (void)
+{
+  alloc_aux_for_blocks (sizeof (struct aux_bb_info));
+  same_succ_htab
+    = htab_create (n_basic_blocks, same_succ_hash, same_succ_equal,
+		   same_succ_delete);
+  same_succ_edge_flags = XCNEWVEC (int, last_basic_block);
+  deleted_bbs = BITMAP_ALLOC (NULL);
+  deleted_bb_preds = BITMAP_ALLOC (NULL);
+  worklist = VEC_alloc (same_succ_t, heap, n_basic_blocks);
+  find_same_succ ();
+
+  if (dump_file)
+    {
+      fprintf (dump_file, "initial worklist:\n");
+      print_worklist (dump_file);
+    }
+}
+
+/* Deletes worklist administration.  */
+
+static void
+delete_worklist (void)
+{
+  free_aux_for_blocks ();
+  htab_delete (same_succ_htab);
+  same_succ_htab = NULL;
+  XDELETEVEC (same_succ_edge_flags);
+  same_succ_edge_flags = NULL;
+  BITMAP_FREE (deleted_bbs);
+  BITMAP_FREE (deleted_bb_preds);
+  VEC_free (same_succ_t, heap, worklist);
+}
+
+/* Mark BB as deleted, and mark its predecessors.  */
+
+static void
+delete_basic_block_same_succ (basic_block bb)
+{
+  edge e;
+  edge_iterator ei;
+
+  bitmap_set_bit (deleted_bbs, bb->index);
+
+  FOR_EACH_EDGE (e, ei, bb->preds)
+    bitmap_set_bit (deleted_bb_preds, e->src->index);
+}
+
+/* Removes all bbs in BBS from their corresponding same_succ.  */
+
+static void
+same_succ_flush_bbs (bitmap bbs)
+{
+  unsigned int i;
+  bitmap_iterator bi;
+
+  EXECUTE_IF_SET_IN_BITMAP (bbs, 0, i, bi)
+    {
+      basic_block bb = BASIC_BLOCK (i);
+      same_succ_t same = BB_SAME_SUCC (bb);
+      BB_SAME_SUCC (bb) = NULL;
+      if (bitmap_single_bit_set_p (same->bbs))
+	htab_remove_elt_with_hash (same_succ_htab, same, same->hashval);
+      else
+	bitmap_clear_bit (same->bbs, i);
+    }
+}
+
+/* For deleted_bb_preds, find bbs with same successors.  */
+
+static void
+update_worklist (void)
+{
+  unsigned int i;
+  bitmap_iterator bi;
+  basic_block bb;
+  same_succ_t same;
+
+  bitmap_and_compl_into (deleted_bb_preds, deleted_bbs);
+  bitmap_clear_bit (deleted_bb_preds, ENTRY_BLOCK);
+  same_succ_flush_bbs (deleted_bbs);
+  same_succ_flush_bbs (deleted_bb_preds);
+
+  EXECUTE_IF_SET_IN_BITMAP (deleted_bbs, 0, i, bi)
+    delete_basic_block (BASIC_BLOCK (i));
+
+  same = same_succ_alloc ();
+  EXECUTE_IF_SET_IN_BITMAP (deleted_bb_preds, 0, i, bi)
+    {
+      bb = BASIC_BLOCK (i);
+      gcc_assert (bb != NULL);
+      find_same_succ_bb (bb, &same);
+      if (same == NULL)
+	same = same_succ_alloc ();
+    }
+  same_succ_delete (same);
+
+  bitmap_clear (deleted_bbs);
+  bitmap_clear (deleted_bb_preds);
+}
+
+/* Prints cluster C to FILE.  */
+
+static void
+print_cluster (FILE *file, bb_cluster_t c)
+{
+  if (c == NULL)
+    return;
+  bitmap_print (file, c->bbs, "bbs:", "\n");
+  bitmap_print (file, c->preds, "preds:", "\n");
+}
+
+/* Prints cluster C to stderr.  */
+
+extern void debug_cluster (bb_cluster_t);
+DEBUG_FUNCTION void
+debug_cluster (bb_cluster_t c)
+{
+  print_cluster (stderr, c);
+}
+
+/* Returns true if bb1 and bb2 have the same predecessors.  */
+
+static bool
+same_predecessors (basic_block bb1, basic_block bb2)
+{
+  edge e;
+  edge_iterator ei;
+  unsigned int n1 = EDGE_COUNT (bb1->preds), n2 = EDGE_COUNT (bb2->preds);
+
+  if (n1 != n2)
+    return false;
+
+  FOR_EACH_EDGE (e, ei, bb1->preds)
+    if (!find_edge (e->src, bb2))
+      return false;
+
+  return true;
+}
+
+/* Add BB to cluster C.  Sets BB in C->bbs, and preds of BB in C->preds.  */
+
+static void
+add_bb_to_cluster (bb_cluster_t c, basic_block bb)
+{
+  edge e;
+  edge_iterator ei;
+
+  bitmap_set_bit (c->bbs, bb->index);
+
+  FOR_EACH_EDGE (e, ei, bb->preds)
+    bitmap_set_bit (c->preds, e->src->index);
+}
+
+/* Allocate and init new cluster.  */
+
+static bb_cluster_t
+new_cluster (void)
+{
+  bb_cluster_t c;
+  c = XCNEW (struct bb_cluster);
+  c->bbs = BITMAP_ALLOC (NULL);
+  c->preds = BITMAP_ALLOC (NULL);
+  return c;
+}
+
+/* Delete clusters.  */
+
+static void
+delete_cluster (bb_cluster_t c)
+{
+  if (c == NULL)
+    return;
+  BITMAP_FREE (c->bbs);
+  BITMAP_FREE (c->preds);
+  XDELETE (c);
+}
+
+DEF_VEC_P (bb_cluster_t);
+DEF_VEC_ALLOC_P (bb_cluster_t, heap);
+
+/* Array that contains all clusters.  */
+
+static VEC (bb_cluster_t, heap) *all_clusters;
+
+/* Allocate all cluster vectors.  */
+
+static void
+alloc_cluster_vectors (void)
+{
+  all_clusters = VEC_alloc (bb_cluster_t, heap, n_basic_blocks);
+}
+
+/* Reset all cluster vectors.  */
+
+static void
+reset_cluster_vectors (void)
+{
+  unsigned int i;
+  basic_block bb;
+  for (i = 0; i < VEC_length (bb_cluster_t, all_clusters); ++i)
+    delete_cluster (VEC_index (bb_cluster_t, all_clusters, i));
+  VEC_truncate (bb_cluster_t, all_clusters, 0);
+  FOR_EACH_BB (bb)
+    BB_CLUSTER (bb) = NULL;
+}
+
+/* Delete all cluster vectors.  */
+
+static void
+delete_cluster_vectors (void)
+{
+  unsigned int i;
+  for (i = 0; i < VEC_length (bb_cluster_t, all_clusters); ++i)
+    delete_cluster (VEC_index (bb_cluster_t, all_clusters, i));
+  VEC_free (bb_cluster_t, heap, all_clusters);
+}
+
+/* Merge cluster C2 into C1.  */
+
+static void
+merge_clusters (bb_cluster_t c1, bb_cluster_t c2)
+{
+  bitmap_ior_into (c1->bbs, c2->bbs);
+  bitmap_ior_into (c1->preds, c2->preds);
+}
+
+/* Register equivalence of BB1 and BB2 (members of cluster C).  Store c in
+   all_clusters, or merge c with existing cluster.  */
+
+static void
+set_cluster (basic_block bb1, basic_block bb2)
+{
+  basic_block merge_bb, other_bb;
+  bb_cluster_t merge, old, c;
+
+  if (BB_CLUSTER (bb1) == NULL && BB_CLUSTER (bb2) == NULL)
+    {
+      c = new_cluster ();
+      add_bb_to_cluster (c, bb1);
+      add_bb_to_cluster (c, bb2);
+      BB_CLUSTER (bb1) = c;
+      BB_CLUSTER (bb2) = c;
+      c->index = VEC_length (bb_cluster_t, all_clusters);
+      VEC_safe_push (bb_cluster_t, heap, all_clusters, c);
+    }
+  else if (BB_CLUSTER (bb1) == NULL || BB_CLUSTER (bb2) == NULL)
+    {
+      merge_bb = BB_CLUSTER (bb1) == NULL ? bb2 : bb1;
+      other_bb = BB_CLUSTER (bb1) == NULL ? bb1 : bb2;
+      merge = BB_CLUSTER (merge_bb);
+      add_bb_to_cluster (merge, other_bb);
+      BB_CLUSTER (other_bb) = merge;
+    }
+  else if (BB_CLUSTER (bb1) != BB_CLUSTER (bb2))
+    {
+      unsigned int i;
+      bitmap_iterator bi;
+
+      old = BB_CLUSTER (bb2);
+      merge = BB_CLUSTER (bb1);
+      merge_clusters (merge, old);
+      EXECUTE_IF_SET_IN_BITMAP (old->bbs, 0, i, bi)
+	BB_CLUSTER (BASIC_BLOCK (i)) = merge;
+      VEC_replace (bb_cluster_t, all_clusters, old->index, NULL);
+      delete_cluster (old);
+    }
+  else
+    gcc_unreachable ();
+}
+
+/* Return true if gimple statements S1 and S2 are equal.  SAME_PREDS indicates
+   whether gimple_bb (s1) and gimple_bb (s2) (members of SAME_SUCC) have the
+   same predecessors.  */
+
+static bool
+gimple_equal_p (same_succ_t same_succ, gimple s1, gimple s2, bool same_preds)
+{
+  unsigned int i;
+  tree lhs1, lhs2;
+  basic_block bb1 = gimple_bb (s1), bb2 = gimple_bb (s2);
+  tree t1, t2;
+  bool equal, inv_cond;
+  enum tree_code code1, code2;
+
+  if (gimple_code (s1) != gimple_code (s2))
+    return false;
+
+  switch (gimple_code (s1))
+    {
+    case GIMPLE_CALL:
+      if (gimple_call_num_args (s1) != gimple_call_num_args (s2))
+	return false;
+      if (!gimple_call_same_target_p (s1, s2))
+        return false;
+
+      equal = true;
+      for (i = 0; i < gimple_call_num_args (s1); ++i)
+	{
+	  t1 = gimple_call_arg (s1, i);
+	  t2 = gimple_call_arg (s2, i);
+	  if (operand_equal_p (t1, t2, 0))
+	    continue;
+	  if (gvn_uses_equal (t1, t2, bb1, bb2, same_preds))
+	    continue;
+	  equal = false;
+	  break;
+	}
+      if (equal)
+	return true;
+
+      lhs1 = gimple_get_lhs (s1);
+      lhs2 = gimple_get_lhs (s2);
+      return (lhs1 != NULL_TREE && lhs2 != NULL_TREE && same_preds
+	      && TREE_CODE (lhs1) == SSA_NAME && TREE_CODE (lhs2) == SSA_NAME
+	      && vn_valueize (lhs1) == vn_valueize (lhs2));
+
+    case GIMPLE_ASSIGN:
+      lhs1 = gimple_get_lhs (s1);
+      lhs2 = gimple_get_lhs (s2);
+      return (same_preds && TREE_CODE (lhs1) == SSA_NAME
+	      && TREE_CODE (lhs2) == SSA_NAME
+	      && vn_valueize (lhs1) == vn_valueize (lhs2));
+
+    case GIMPLE_COND:
+      t1 = gimple_cond_lhs (s1);
+      t2 = gimple_cond_lhs (s2);
+      if (!operand_equal_p (t1, t2, 0)
+	  && !gvn_uses_equal (t1, t2, bb1, bb2, same_preds))
+	return false;
+
+      t1 = gimple_cond_rhs (s1);
+      t2 = gimple_cond_rhs (s2);
+      if (!operand_equal_p (t1, t2, 0)
+	  && !gvn_uses_equal (t1, t2, bb1, bb2, same_preds))
+	return false;
+
+      code1 = gimple_expr_code (s1);
+      code2 = gimple_expr_code (s2);
+      inv_cond = (bitmap_bit_p (same_succ->inverse, bb1->index)
+		  != bitmap_bit_p (same_succ->inverse, bb2->index));
+      if (inv_cond)
+	{
+	  bool honor_nans
+	    = HONOR_NANS (TYPE_MODE (TREE_TYPE (gimple_cond_lhs (s1))));
+	  code2 = invert_tree_comparison (code2, honor_nans);
+	}
+      return code1 == code2;
+
+    default:
+      return false;
+    }
+}
+
+/* Determines whether BB1 and BB2 (members of same_succ) are duplicates.  If so,
+   clusters them.  SAME_PREDS indicates whether BB1 and BB2 have the same
+   predecessors.  */
+
+static void
+find_duplicate (same_succ_t same_succ, basic_block bb1, basic_block bb2,
+		bool same_preds)
+{
+  gimple_stmt_iterator gsi1 = gsi_last_nondebug_bb (bb1);
+  gimple_stmt_iterator gsi2 = gsi_last_nondebug_bb (bb2);
+  bool end1 = gsi_end_p (gsi1);
+  bool end2 = gsi_end_p (gsi2);
+
+  while (!end1 && !end2)
+    {
+      if (!gimple_equal_p (same_succ, gsi_stmt (gsi1), gsi_stmt (gsi2),
+			   same_preds))
+	return;
+
+      gsi_prev_nondebug (&gsi1);
+      gsi_prev_nondebug (&gsi2);
+      end1 = gsi_end_p (gsi1);
+      end2 = gsi_end_p (gsi2);
+    }
+
+  if (!(end1 && end2))
+    return;
+
+  if (dump_file)
+    fprintf (dump_file, "find_duplicates: <bb %d> duplicate of <bb %d>\n",
+	     bb1->index, bb2->index);
+
+  set_cluster (bb1, bb2);
+}
+
+/* Returns whether for all phis in DEST the phi alternatives for E1 and
+   E2 are equal.  SAME_PREDS indicates whether BB1 and BB2 have the same
+   predecessors.  */
+
+static bool
+same_phi_alternatives_1 (basic_block dest, edge e1, edge e2, bool same_preds)
+{
+  int n1 = e1->dest_idx, n2 = e2->dest_idx;
+  basic_block bb1 = e1->src, bb2 = e2->src;
+  gimple_stmt_iterator gsi;
+
+  for (gsi = gsi_start_phis (dest); !gsi_end_p (gsi); gsi_next (&gsi))
+    {
+      gimple phi = gsi_stmt (gsi);
+      tree lhs = gimple_phi_result (phi);
+      tree val1 = gimple_phi_arg_def (phi, n1);
+      tree val2 = gimple_phi_arg_def (phi, n2);
+
+      if (!is_gimple_reg (lhs))
+	continue;
+
+      if (operand_equal_for_phi_arg_p (val1, val2))
+        continue;
+      if (gvn_uses_equal (val1, val2, bb1, bb2, same_preds))
+	continue;
+
+      return false;
+    }
+
+  return true;
+}
+
+/* Returns whether for all successors of BB1 and BB2 (members of SAME_SUCC), the
+   phi alternatives for BB1 and BB2 are equal.  SAME_PREDS indicates whether BB1
+   and BB2 have the same predecessors.  */
+
+static bool
+same_phi_alternatives (same_succ_t same_succ, basic_block bb1, basic_block bb2,
+		       bool same_preds)
+{
+  unsigned int s;
+  bitmap_iterator bs;
+  edge e1, e2;
+  basic_block succ;
+
+  EXECUTE_IF_SET_IN_BITMAP (same_succ->succs, 0, s, bs)
+    {
+      succ = BASIC_BLOCK (s);
+      e1 = find_edge (bb1, succ);
+      e2 = find_edge (bb2, succ);
+      if (e1->flags & EDGE_COMPLEX
+	  || e2->flags & EDGE_COMPLEX)
+	return false;
+
+      /* For all phis in bb, the phi alternatives for e1 and e2 need to have
+	 the same value.  */
+      if (!same_phi_alternatives_1 (succ, e1, e2, same_preds))
+	return false;
+    }
+
+  return true;
+}
+
+/* Return true if BB has non-vop phis.  */
+
+static bool
+bb_has_non_vop_phi (basic_block bb)
+{
+  gimple_seq phis = phi_nodes (bb);
+  gimple phi;
+
+  if (phis == NULL)
+    return false;
+
+  if (!gimple_seq_singleton_p (phis))
+    return true;
+
+  phi = gimple_seq_first_stmt (phis);
+  return is_gimple_reg (gimple_phi_result (phi));
+}
+
+/* Within SAME_SUCC->bbs, find clusters of bbs which can be merged.  */
+
+static void
+find_clusters_1 (same_succ_t same_succ)
+{
+  basic_block bb1, bb2;
+  unsigned int i, j;
+  bitmap_iterator bi, bj;
+  bool same_preds;
+  int nr_comparisons;
+  int max_comparisons = PARAM_VALUE (PARAM_MAX_TAIL_MERGE_COMPARISONS);
+
+  if (same_succ == NULL)
+    return;
+
+  EXECUTE_IF_SET_IN_BITMAP (same_succ->bbs, 0, i, bi)
+    {
+      bb1 = BASIC_BLOCK (i);
+
+      /* TODO: handle blocks with phi-nodes.  We'll have find corresponding
+	 phi-nodes in bb1 and bb2, with the same alternatives for the same
+	 preds.  */
+      if (bb_has_non_vop_phi (bb1))
+	continue;
+
+      nr_comparisons = 0;
+      EXECUTE_IF_SET_IN_BITMAP (same_succ->bbs, i + 1, j, bj)
+	{
+	  bb2 = BASIC_BLOCK (j);
+
+	  if (bb_has_non_vop_phi (bb2))
+	    continue;
+
+	  if (BB_CLUSTER (bb1) != NULL
+	      && BB_CLUSTER (bb1) == BB_CLUSTER (bb2))
+	    continue;
+
+	  /* Limit quadratic behaviour.  */
+	  nr_comparisons++;
+	  if (nr_comparisons > max_comparisons)
+	    break;
+
+	  same_preds = same_predecessors (bb1, bb2);
+
+	  if (!(same_phi_alternatives (same_succ, bb1, bb2, same_preds)))
+	    continue;
+	  find_duplicate (same_succ, bb1, bb2, same_preds);
+        }
+    }
+}
+
+/* Find clusters of bbs which can be merged.  */
+
+static void
+find_clusters (void)
+{
+  same_succ_t same;
+
+  while (!VEC_empty (same_succ_t, worklist))
+    {
+      same = VEC_pop (same_succ_t, worklist);
+      same->in_worklist = false;
+      if (dump_file)
+	{
+	  fprintf (dump_file, "processing worklist entry\n");
+	  same_succ_print (dump_file, same);
+	}
+      find_clusters_1 (same);
+    }
+}
+
+/* Merge the alias info of the calls in BB1 into the calls in BB2.  */
+
+static void
+merge_calls (basic_block bb1, basic_block bb2)
+{
+  gimple_stmt_iterator gsi1 = gsi_start_nondebug_bb (bb1);
+  gimple_stmt_iterator gsi2 = gsi_start_nondebug_bb (bb2);
+  bool end1, end2;
+  gimple s1, s2;
+
+  end1 = gsi_end_p (gsi1);
+  end2 = gsi_end_p (gsi2);
+
+  while (true)
+    {
+      if (end1 && end2)
+	return;
+      gcc_assert (!end1 && !end2);
+      s1 = gsi_stmt (gsi1);
+      s2 = gsi_stmt (gsi2);
+
+      if (is_gimple_call (s1) && is_gimple_call (s2))
+	{
+	  pt_solution_ior_into_shared (gimple_call_use_set (s2),
+				       gimple_call_use_set (s1));
+	  pt_solution_ior_into_shared (gimple_call_clobber_set (s2),
+				       gimple_call_clobber_set (s1));
+	}
+      else
+	gcc_assert (!is_gimple_call (s1) && !is_gimple_call (s2));
+
+      gsi_next_nondebug (&gsi1);
+      gsi_next_nondebug (&gsi2);
+      end1 = gsi_end_p (gsi1);
+      end2 = gsi_end_p (gsi2);
+    }
+}
+
+/* Create or update a vop phi in BB2.  Use VUSE1 arguments for all the
+   REDIRECTED_EDGES, or if VUSE1 is NULL_TREE, use BB_VOP_AT_EXIT.  If a new
+   phis is created, use the phi instead of VUSE2 in BB2.  */
+
+static void
+update_vuses (tree vuse1, tree vuse2, basic_block bb2,
+              VEC (edge,heap) *redirected_edges)
+{
+  gimple stmt, phi = NULL;
+  tree lhs, arg, current_arg;
+  unsigned int i;
+  gimple def_stmt2;
+  source_location locus1, locus2;
+  imm_use_iterator iter;
+  use_operand_p use_p;
+  edge_iterator ei;
+  edge e;
+
+  if (vuse2 == NULL_TREE)
+    return;
+
+  def_stmt2 = SSA_NAME_DEF_STMT (vuse2);
+
+  /* Update existing phi.  */
+  if (gimple_bb (def_stmt2) == bb2)
+    {
+      phi = def_stmt2;
+
+      for (i = 0; i < EDGE_COUNT (redirected_edges); ++i)
+	{
+	  e = VEC_index (edge, redirected_edges, i);
+	  if (vuse1)
+	    arg = vuse1;
+	  else
+	    arg = BB_VOP_AT_EXIT (e->src);
+	  current_arg = PHI_ARG_DEF_FROM_EDGE (phi, e);
+	  if (current_arg == NULL)
+	    {
+	      locus1 = gimple_location (SSA_NAME_DEF_STMT (arg));
+	      add_phi_arg (phi, arg, e, locus1);
+	    }
+	  else
+	    gcc_assert (arg == current_arg);
+	}
+      return;
+    }
+
+  /* No need to create a phi with 2 equal arguments.  */
+  if (vuse1 == vuse2)
+    return;
+
+  locus2 = gimple_location (def_stmt2);
+
+  /* Create a phi, first with default argument vuse2 for all preds.  */
+  lhs = make_ssa_name (SSA_NAME_VAR (vuse2), NULL);
+  VN_INFO_GET (lhs);
+  phi = create_phi_node (lhs, bb2);
+  SSA_NAME_DEF_STMT (lhs) = phi;
+  FOR_EACH_EDGE (e, ei, bb2->preds)
+    add_phi_arg (phi, vuse2, e, locus2);
+
+  /* Now overwrite the arguments associated with the redirected edges with
+     vuse1.  */
+  for (i = 0; i < EDGE_COUNT (redirected_edges); ++i)
+    {
+      e = VEC_index (edge, redirected_edges, i);
+      gcc_assert (PHI_ARG_DEF_FROM_EDGE (phi, e));
+      if (vuse1)
+	arg = vuse1;
+      else
+	arg = BB_VOP_AT_EXIT (e->src);
+      SET_PHI_ARG_DEF (phi, e->dest_idx, arg);
+      locus1 = gimple_location (SSA_NAME_DEF_STMT (arg));
+      gimple_phi_arg_set_location (phi, e->dest_idx, locus1);
+    }
+
+  /* Replace uses of vuse2 in bb2 with phi.  */
+  FOR_EACH_IMM_USE_STMT (stmt, iter, vuse2)
+    {
+      if (gimple_code (stmt) == GIMPLE_PHI)
+	{
+	  edge e;
+	  if (stmt == phi)
+	    continue;
+	  e = find_edge (bb2, gimple_bb (stmt));
+	  if (e == NULL)
+	    continue;
+	  use_p = PHI_ARG_DEF_PTR_FROM_EDGE (stmt, e);
+	  SET_USE (use_p, lhs);
+	  update_stmt (stmt);
+	}
+      else if (gimple_bb (stmt) == bb2)
+	{
+	  FOR_EACH_IMM_USE_ON_STMT (use_p, iter)
+	    SET_USE (use_p, lhs);
+	  update_stmt (stmt);
+	}
+    }
+}
+
+/* Returns the vop phi of BB, if any.  */
+
+static gimple
+vop_phi (basic_block bb)
+{
+  gimple stmt;
+  gimple_stmt_iterator gsi;
+  for (gsi = gsi_start_phis (bb); !gsi_end_p (gsi); gsi_next (&gsi))
+    {
+      stmt = gsi_stmt (gsi);
+      if (is_gimple_reg (gimple_phi_result (stmt)))
+	continue;
+      return stmt;
+    }
+  return NULL;
+}
+
+/* Scans the vdefs and vuses of the insn of BB, and returns the vop at entry in
+   VOP_AT_ENTRY, and the vop at exit in VOP_AT_EXIT.  */
+
+static void
+insn_vops (basic_block bb, tree *vop_at_entry, tree *vop_at_exit)
+{
+  gimple stmt;
+  gimple_stmt_iterator gsi;
+  tree vuse, vdef;
+  tree last_vdef = NULL_TREE;
+
+  if (*vop_at_entry != NULL_TREE && *vop_at_exit != NULL_TREE)
+    return;
+
+  for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi))
+    {
+      stmt = gsi_stmt (gsi);
+      vuse = gimple_vuse (stmt);
+      vdef = gimple_vdef (stmt);
+      if (vuse != NULL_TREE && *vop_at_entry == NULL_TREE)
+	{
+	  *vop_at_entry = vuse;
+	  if (*vop_at_exit != NULL_TREE)
+	    return;
+	}
+      if (vdef != NULL_TREE)
+	last_vdef = vdef;
+    }
+
+  *vop_at_exit = last_vdef != NULL_TREE ? last_vdef : *vop_at_entry;
+}
+
+/* Returns the vop at entry of BB1 in VOP_AT_ENTRY1, and the one of bb2 in
+   VOP_AT_ENTRY2, where BB1 and BB2 have the same successors.  */
+
+static void
+vop_at_entry (basic_block bb1, basic_block bb2, tree *vop_at_entry1,
+	      tree *vop_at_entry2)
+{
+  gimple succ_phi, bb1_phi, bb2_phi;
+  basic_block succ;
+  tree vop_at_exit1 = NULL_TREE, vop_at_exit2 = NULL_TREE;
+  bool same_at_exit;
+
+  bb1_phi = vop_phi (bb1);
+  bb2_phi = vop_phi (bb2);
+
+  *vop_at_entry1 = bb1_phi != NULL ? gimple_phi_result (bb1_phi) : NULL_TREE;
+  *vop_at_entry2 = bb2_phi != NULL ? gimple_phi_result (bb2_phi) : NULL_TREE;
+
+  if (*vop_at_entry1 != NULL_TREE && *vop_at_entry2 != NULL_TREE)
+    return;
+
+  if (EDGE_COUNT (bb1->succs) != 0)
+    {
+      succ = EDGE_SUCC (bb1, 0)->dest;
+      succ_phi = vop_phi (succ);
+      if (succ_phi != NULL)
+	{
+	  vop_at_exit1
+	    = PHI_ARG_DEF_FROM_EDGE (succ_phi, find_edge (bb1, succ));
+	  vop_at_exit2
+	    = PHI_ARG_DEF_FROM_EDGE (succ_phi, find_edge (bb2, succ));
+	}
+    }
+
+  same_at_exit = vop_at_exit1 == vop_at_exit2;
+
+  if (*vop_at_entry1 == NULL_TREE && vop_at_exit1 != NULL_TREE
+      && gimple_bb (SSA_NAME_DEF_STMT (vop_at_exit1)) != bb1)
+    *vop_at_entry1 = vop_at_exit1;
+
+  if (*vop_at_entry2 == NULL_TREE && vop_at_exit2 != NULL_TREE
+      && gimple_bb (SSA_NAME_DEF_STMT (vop_at_exit2)) != bb2)
+    *vop_at_entry2 = vop_at_exit2;
+
+  if (*vop_at_entry1 != NULL_TREE && *vop_at_entry2 != NULL_TREE)
+    return;
+
+  insn_vops (bb1, vop_at_entry1, &vop_at_exit1);
+  insn_vops (bb2, vop_at_entry2, &vop_at_exit2);
+
+  if (*vop_at_entry1 != NULL_TREE && *vop_at_entry2 != NULL_TREE)
+    return;
+
+  if (same_at_exit && vop_at_exit1 != NULL_TREE
+      && *vop_at_entry2 == NULL_TREE
+      && dominated_by_p (CDI_DOMINATORS, bb2, bb1))
+    *vop_at_entry2 = vop_at_exit1;
+
+  if (same_at_exit && vop_at_exit2 != NULL_TREE
+      && *vop_at_entry1 == NULL_TREE
+      && dominated_by_p (CDI_DOMINATORS, bb1, bb2))
+    *vop_at_entry1 = vop_at_exit2;
+
+  if (*vop_at_entry1 != NULL_TREE && *vop_at_entry2 != NULL_TREE)
+    return;
+
+  gcc_assert (*vop_at_entry1 == NULL_TREE && *vop_at_entry2 == NULL_TREE);
+}
+
+/* Redirect all edges from BB1 to BB2, marks BB1 for removal, and if
+   UPDATE_VOPS, inserts vop phis.  */
+
+static void
+replace_block_by (basic_block bb1, basic_block bb2, bool update_vops)
+{
+  edge pred_edge;
+  unsigned int i;
+  tree phi_vuse1, phi_vuse2, arg;
+  VEC (edge,heap) *redirected_edges = NULL;
+  edge e;
+  edge_iterator ei;
+
+  if (update_vops)
+    {
+      vop_at_entry (bb1, bb2, &phi_vuse1, &phi_vuse2);
+
+      if (phi_vuse1 != NULL_TREE
+	  && gimple_bb (SSA_NAME_DEF_STMT (phi_vuse1)) == bb1)
+	{
+	  FOR_EACH_EDGE (e, ei, bb1->preds)
+	    {
+	      arg = PHI_ARG_DEF_FROM_EDGE (SSA_NAME_DEF_STMT (phi_vuse1), e);
+	      BB_VOP_AT_EXIT (e->src) = arg;
+	    }
+	  phi_vuse1 = NULL;
+	}
+      redirected_edges = VEC_alloc (edge, heap, 10);
+    }
+
+  delete_basic_block_same_succ (bb1);
+
+  /* Redirect the incoming edges of bb1 to bb2.  */
+  for (i = EDGE_COUNT (bb1->preds); i > 0 ; --i)
+    {
+      pred_edge = EDGE_PRED (bb1, i - 1);
+      pred_edge = redirect_edge_and_branch (pred_edge, bb2);
+      gcc_assert (pred_edge != NULL);
+      if (update_vops)
+	VEC_safe_push (edge, heap, redirected_edges, pred_edge);
+    }
+
+  if (update_vops)
+    {
+      update_vuses (phi_vuse1, phi_vuse2, bb2, redirected_edges);
+      VEC_free (edge, heap, redirected_edges);
+    }
+
+  merge_calls (bb1, bb2);
+}
+
+/* Bbs for which update_debug_stmt need to be called.  */
+
+static bitmap update_bbs;
+
+/* For each cluster in all_clusters, merge all cluster->bbs.  Returns
+   number of bbs removed.  Insert vop phis if UPDATE_VOPS.  */
+
+static int
+apply_clusters (bool update_vops)
+{
+  basic_block bb1, bb2;
+  bb_cluster_t c;
+  unsigned int i, j;
+  bitmap_iterator bj;
+  int nr_bbs_removed = 0;
+
+  for (i = 0; i < VEC_length (bb_cluster_t, all_clusters); ++i)
+    {
+      c = VEC_index (bb_cluster_t, all_clusters, i);
+      if (c == NULL)
+	continue;
+
+      bb2 = BASIC_BLOCK (bitmap_first_set_bit (c->bbs));
+      gcc_assert (bb2 != NULL);
+
+      bitmap_set_bit (update_bbs, bb2->index);
+      EXECUTE_IF_SET_IN_BITMAP (c->bbs, 0, j, bj)
+	{
+	  bb1 = BASIC_BLOCK (j);
+	  gcc_assert (bb1 != NULL);
+	  if (bb1 == bb2)
+	    continue;
+
+	  bitmap_clear_bit (update_bbs, bb1->index);
+	  replace_block_by (bb1, bb2, update_vops);
+	  nr_bbs_removed++;
+	}
+    }
+
+  return nr_bbs_removed;
+}
+
+/* Resets debug statement STMT if it has uses that are not dominated by their
+   defs.  */
+
+static void
+update_debug_stmt (gimple stmt)
+{
+  use_operand_p use_p;
+  ssa_op_iter oi;
+  basic_block bbdef, bbuse;
+  gimple def_stmt;
+  tree name;
+
+  if (!gimple_debug_bind_p (stmt))
+    return;
+
+  bbuse = gimple_bb (stmt);
+  FOR_EACH_PHI_OR_STMT_USE (use_p, stmt, oi, SSA_OP_USE)
+    {
+      name = USE_FROM_PTR (use_p);
+      gcc_assert (TREE_CODE (name) == SSA_NAME);
+
+      def_stmt = SSA_NAME_DEF_STMT (name);
+      gcc_assert (def_stmt != NULL);
+
+      bbdef = gimple_bb (def_stmt);
+      if (bbdef == NULL || bbuse == bbdef
+	  || dominated_by_p (CDI_DOMINATORS, bbuse, bbdef))
+	continue;
+
+      gimple_debug_bind_reset_value (stmt);
+      update_stmt (stmt);
+    }
+}
+
+/* Resets all debug statements that have uses that are not
+   dominated by their defs.  */
+
+static void
+update_debug_stmts (void)
+{
+  basic_block bb;
+  bitmap_iterator bi;
+  unsigned int i;
+
+  if (!MAY_HAVE_DEBUG_STMTS)
+    return;
+
+  EXECUTE_IF_SET_IN_BITMAP (update_bbs, 0, i, bi)
+    {
+      gimple stmt;
+      gimple_stmt_iterator gsi;
+
+      bb = BASIC_BLOCK (i);
+      for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi))
+	{
+	  stmt = gsi_stmt (gsi);
+	  if (!is_gimple_debug (stmt))
+	    continue;
+	  update_debug_stmt (stmt);
+	}
+    }
+}
+
+/* Runs tail merge optimization.  */
+
+unsigned int
+tail_merge_optimize (unsigned int todo)
+{
+  int nr_bbs_removed_total = 0;
+  int nr_bbs_removed;
+  bool loop_entered = false;
+  int iteration_nr = 0;
+  bool update_vops = ((todo & TODO_update_ssa_only_virtuals) == 0
+		      || !symbol_marked_for_renaming (gimple_vop (cfun)));
+
+  if (!flag_tree_tail_merge)
+    return 0;
+
+  timevar_push (TV_TREE_TAIL_MERGE);
+
+  init_worklist ();
+
+  while (!VEC_empty (same_succ_t, worklist))
+    {
+      if (!loop_entered)
+	{
+	  loop_entered = true;
+	  alloc_cluster_vectors ();
+	  update_bbs = BITMAP_ALLOC (NULL);
+	}
+      else
+	reset_cluster_vectors ();
+
+      iteration_nr++;
+      if (dump_file)
+	fprintf (dump_file, "worklist iteration #%d\n", iteration_nr);
+
+      calculate_dominance_info (CDI_DOMINATORS);
+      find_clusters ();
+      gcc_assert (VEC_empty (same_succ_t, worklist));
+      if (VEC_empty (bb_cluster_t, all_clusters))
+	break;
+
+      nr_bbs_removed = apply_clusters (update_vops);
+      nr_bbs_removed_total += nr_bbs_removed;
+      if (nr_bbs_removed == 0)
+	break;
+
+      free_dominance_info (CDI_DOMINATORS);
+      update_worklist ();
+    }
+
+  if (dump_file)
+    fprintf (dump_file, "htab collision / search: %f\n",
+	     htab_collisions (same_succ_htab));
+
+  if (nr_bbs_removed_total > 0)
+    {
+      calculate_dominance_info (CDI_DOMINATORS);
+      update_debug_stmts ();
+
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	{
+	  fprintf (dump_file, "Before TODOs.\n");
+	  dump_function_to_file (current_function_decl, dump_file, dump_flags);
+	}
+
+      todo |= (TODO_verify_ssa | TODO_verify_stmts | TODO_verify_flow
+	       | TODO_dump_func);
+    }
+
+  delete_worklist ();
+  if (loop_entered)
+    {
+      delete_cluster_vectors ();
+      BITMAP_FREE (update_bbs);
+    }
+
+  timevar_pop (TV_TREE_TAIL_MERGE);
+
+  return todo;
+}
Index: gcc/tree-ssa-sccvn.c
===================================================================
--- gcc/tree-ssa-sccvn.c	(revision 175801)
+++ gcc/tree-ssa-sccvn.c	(working copy)
@@ -2872,19 +2872,6 @@ simplify_unary_expression (gimple stmt)
   return NULL_TREE;
 }
 
-/* Valueize NAME if it is an SSA name, otherwise just return it.  */
-
-static inline tree
-vn_valueize (tree name)
-{
-  if (TREE_CODE (name) == SSA_NAME)
-    {
-      tree tem = SSA_VAL (name);
-      return tem == VN_TOP ? name : tem;
-    }
-  return name;
-}
-
 /* Try to simplify RHS using equivalences and constant folding.  */
 
 static tree
Index: gcc/tree-ssa-sccvn.h
===================================================================
--- gcc/tree-ssa-sccvn.h	(revision 175801)
+++ gcc/tree-ssa-sccvn.h	(working copy)
@@ -209,4 +209,18 @@ unsigned int get_constant_value_id (tree
 unsigned int get_or_alloc_constant_value_id (tree);
 bool value_id_constant_p (unsigned int);
 tree fully_constant_vn_reference_p (vn_reference_t);
+
+/* Valueize NAME if it is an SSA name, otherwise just return it.  */
+
+static inline tree
+vn_valueize (tree name)
+{
+  if (TREE_CODE (name) == SSA_NAME)
+    {
+      tree tem = VN_INFO (name)->valnum;
+      return tem == VN_TOP ? name : tem;
+    }
+  return name;
+}
+
 #endif /* TREE_SSA_SCCVN_H  */
Index: gcc/tree-ssa-alias.h
===================================================================
--- gcc/tree-ssa-alias.h	(revision 175801)
+++ gcc/tree-ssa-alias.h	(working copy)
@@ -134,6 +134,8 @@ extern bool pt_solutions_same_restrict_b
 extern void pt_solution_reset (struct pt_solution *);
 extern void pt_solution_set (struct pt_solution *, bitmap, bool, bool);
 extern void pt_solution_set_var (struct pt_solution *, tree);
+extern void pt_solution_ior_into_shared (struct pt_solution *,
+					 struct pt_solution *);
 
 extern void dump_pta_stats (FILE *);
 
Index: gcc/opts.c
===================================================================
--- gcc/opts.c	(revision 175801)
+++ gcc/opts.c	(working copy)
@@ -484,6 +484,7 @@ static const struct default_options defa
     { OPT_LEVELS_2_PLUS, OPT_falign_jumps, NULL, 1 },
     { OPT_LEVELS_2_PLUS, OPT_falign_labels, NULL, 1 },
     { OPT_LEVELS_2_PLUS, OPT_falign_functions, NULL, 1 },
+    { OPT_LEVELS_2_PLUS, OPT_ftree_tail_merge, NULL, 1 },
 
     /* -O3 optimizations.  */
     { OPT_LEVELS_3_PLUS, OPT_ftree_loop_distribute_patterns, NULL, 1 },
Index: gcc/timevar.def
===================================================================
--- gcc/timevar.def	(revision 175801)
+++ gcc/timevar.def	(working copy)
@@ -127,6 +127,7 @@ DEFTIMEVAR (TV_TREE_GIMPLIFY	     , "tre
 DEFTIMEVAR (TV_TREE_EH		     , "tree eh")
 DEFTIMEVAR (TV_TREE_CFG		     , "tree CFG construction")
 DEFTIMEVAR (TV_TREE_CLEANUP_CFG	     , "tree CFG cleanup")
+DEFTIMEVAR (TV_TREE_TAIL_MERGE       , "tree tail merge")
 DEFTIMEVAR (TV_TREE_VRP              , "tree VRP")
 DEFTIMEVAR (TV_TREE_COPY_PROP        , "tree copy propagation")
 DEFTIMEVAR (TV_FIND_REFERENCED_VARS  , "tree find ref. vars")
Index: gcc/tree-ssa-pre.c
===================================================================
--- gcc/tree-ssa-pre.c	(revision 175801)
+++ gcc/tree-ssa-pre.c	(working copy)
@@ -4935,7 +4935,6 @@ execute_pre (bool do_fre)
   statistics_counter_event (cfun, "Constified", pre_stats.constified);
 
   clear_expression_ids ();
-  free_scc_vn ();
   if (!do_fre)
     {
       remove_dead_inserted_code ();
@@ -4945,6 +4944,9 @@ execute_pre (bool do_fre)
   scev_finalize ();
   fini_pre (do_fre);
 
+  todo |= tail_merge_optimize (todo);
+  free_scc_vn ();
+
   return todo;
 }
 
Index: gcc/common.opt
===================================================================
--- gcc/common.opt	(revision 175801)
+++ gcc/common.opt	(working copy)
@@ -1937,6 +1937,10 @@ ftree-dominator-opts
 Common Report Var(flag_tree_dom) Optimization
 Enable dominator optimizations
 
+ftree-tail-merge
+Common Report Var(flag_tree_tail_merge) Optimization
+Enable tail merging on trees
+
 ftree-dse
 Common Report Var(flag_tree_dse) Optimization
 Enable dead store elimination
Index: gcc/tree-flow.h
===================================================================
--- gcc/tree-flow.h	(revision 175801)
+++ gcc/tree-flow.h	(working copy)
@@ -806,6 +806,9 @@ bool multiplier_allowed_in_address_p (HO
 unsigned multiply_by_cost (HOST_WIDE_INT, enum machine_mode, bool);
 bool may_be_nonaddressable_p (tree expr);
 
+/* In tree-ssa-tail-merge.c.  */
+extern unsigned int tail_merge_optimize (unsigned int);
+
 /* In tree-ssa-threadupdate.c.  */
 extern bool thread_through_all_blocks (bool);
 extern void register_jump_thread (edge, edge, edge);
Index: gcc/Makefile.in
===================================================================
--- gcc/Makefile.in	(revision 175801)
+++ gcc/Makefile.in	(working copy)
@@ -1466,6 +1466,7 @@ OBJS = \
 	tree-ssa-sccvn.o \
 	tree-ssa-sink.o \
 	tree-ssa-structalias.o \
+	tree-ssa-tail-merge.o \
 	tree-ssa-ter.o \
 	tree-ssa-threadedge.o \
 	tree-ssa-threadupdate.o \
@@ -2427,6 +2428,13 @@ stor-layout.o : stor-layout.c $(CONFIG_H
    $(TREE_H) $(PARAMS_H) $(FLAGS_H) $(FUNCTION_H) $(EXPR_H) output.h $(RTL_H) \
    $(GGC_H) $(TM_P_H) $(TARGET_H) langhooks.h $(REGS_H) gt-stor-layout.h \
    $(DIAGNOSTIC_CORE_H) $(CGRAPH_H) $(TREE_INLINE_H) $(TREE_DUMP_H) $(GIMPLE_H)
+tree-ssa-tail-merge.o: tree-ssa-tail-merge.c \
+   $(SYSTEM_H) $(CONFIG_H) coretypes.h $(TM_H) $(BITMAP_H) \
+   $(FLAGS_H) $(TM_P_H) $(BASIC_BLOCK_H) output.h \
+   $(TREE_H) $(TREE_FLOW_H) $(TREE_INLINE_H) \
+   $(GIMPLE_H) $(FUNCTION_H) \
+   $(TIMEVAR_H) tree-ssa-sccvn.h \
+   $(CGRAPH_H) gimple-pretty-print.h tree-pretty-print.h $(PARAMS_H)
 tree-ssa-structalias.o: tree-ssa-structalias.c \
    $(SYSTEM_H) $(CONFIG_H) coretypes.h $(TM_H) $(GGC_H) $(OBSTACK_H) $(BITMAP_H) \
    $(FLAGS_H) $(TM_P_H) $(BASIC_BLOCK_H) output.h \
Index: gcc/tree-ssa-structalias.c
===================================================================
--- gcc/tree-ssa-structalias.c	(revision 175801)
+++ gcc/tree-ssa-structalias.c	(working copy)
@@ -5688,6 +5688,48 @@ shared_bitmap_add (bitmap pt_vars)
   *slot = (void *) sbi;
 }
 
+/* Unshares the points-to bitmap of PT.  */
+
+static void
+pt_solution_unshare (struct pt_solution *pt)
+{
+  bitmap copy;
+
+  if (pt == NULL || pt->vars == NULL || shared_bitmap_table == NULL)
+    return;
+
+  copy = BITMAP_GGC_ALLOC ();
+  bitmap_copy (pt->vars, copy);
+  pt->vars = copy;
+}
+
+/* Shares the points-to bitmap of PT.  */
+
+static void
+pt_solution_share (struct pt_solution *pt)
+{
+  bitmap shared;
+
+  if (pt == NULL || pt->vars == NULL || shared_bitmap_table == NULL)
+    return;
+
+  shared = shared_bitmap_lookup (pt->vars);
+
+  if (!shared)
+    {
+      /* Share unshared bitmap.  */
+      shared_bitmap_add (pt->vars);
+      return;
+    }
+
+  /* Already using shared bitmap.  */
+  if (shared == pt->vars)
+    return;
+
+  /* Use shared bitmap.  */
+  bitmap_clear (pt->vars);
+  pt->vars = shared;
+}
 
 /* Set bits in INTO corresponding to the variable uids in solution set FROM.  */
 
@@ -5734,7 +5776,6 @@ find_what_var_points_to (varinfo_t orig_
   unsigned int i;
   bitmap_iterator bi;
   bitmap finished_solution;
-  bitmap result;
   varinfo_t vi;
 
   memset (pt, 0, sizeof (struct pt_solution));
@@ -5788,17 +5829,8 @@ find_what_var_points_to (varinfo_t orig_
   stats.points_to_sets_created++;
 
   set_uids_in_ptset (finished_solution, vi->solution, pt);
-  result = shared_bitmap_lookup (finished_solution);
-  if (!result)
-    {
-      shared_bitmap_add (finished_solution);
-      pt->vars = finished_solution;
-    }
-  else
-    {
-      pt->vars = result;
-      bitmap_clear (finished_solution);
-    }
+  pt->vars = finished_solution;
+  pt_solution_share (pt);
 }
 
 /* Given a pointer variable P, fill in its points-to set.  */
@@ -5921,6 +5953,25 @@ pt_solution_ior_into (struct pt_solution
   bitmap_ior_into (dest->vars, src->vars);
 }
 
+/* Like pt_solution_ior_into, but may be used if the points-to bitmap
+   of *DEST might be shared.  */
+
+void
+pt_solution_ior_into_shared (struct pt_solution *dest, struct pt_solution *src)
+{
+  if (!src->vars)
+    return;
+  if (!dest->vars)
+    {
+      dest->vars = src->vars;
+      return;
+    }
+
+  pt_solution_unshare (dest);
+  pt_solution_ior_into (dest, src);
+  pt_solution_share (dest);
+}
+
 /* Return true if the points-to solution *PT is empty.  */
 
 bool
@@ -6600,6 +6651,7 @@ delete_points_to_sets (void)
   unsigned int i;
 
   htab_delete (shared_bitmap_table);
+  shared_bitmap_table = NULL;
   if (dump_file && (dump_flags & TDF_STATS))
     fprintf (dump_file, "Points to sets created:%d\n",
 	     stats.points_to_sets_created);
Index: gcc/params.def
===================================================================
--- gcc/params.def	(revision 175801)
+++ gcc/params.def	(working copy)
@@ -892,6 +892,11 @@ DEFPARAM (PARAM_MAX_STORES_TO_SINK,
           "Maximum number of conditional store pairs that can be sunk",
           2, 0, 0)
 
+DEFPARAM (PARAM_MAX_TAIL_MERGE_COMPARISONS,
+          "max-tail-merge-comparisons",
+          "Maximum amount of similar bbs to compare a bb with",
+          10, 0, 0)
+
 
 /*
 Local variables:
Index: gcc/doc/invoke.texi
===================================================================
--- gcc/doc/invoke.texi	(revision 175801)
+++ gcc/doc/invoke.texi	(working copy)
@@ -404,7 +404,7 @@ Objective-C and Objective-C++ Dialects}.
 -ftree-phiprop -ftree-loop-distribution -ftree-loop-distribute-patterns @gol
 -ftree-loop-ivcanon -ftree-loop-linear -ftree-loop-optimize @gol
 -ftree-parallelize-loops=@var{n} -ftree-pre -ftree-pta -ftree-reassoc @gol
--ftree-sink -ftree-sra -ftree-switch-conversion @gol
+-ftree-sink -ftree-sra -ftree-switch-conversion -ftree-tail-merge @gol
 -ftree-ter -ftree-vect-loop-version -ftree-vectorize -ftree-vrp @gol
 -funit-at-a-time -funroll-all-loops -funroll-loops @gol
 -funsafe-loop-optimizations -funsafe-math-optimizations -funswitch-loops @gol
@@ -6091,7 +6091,7 @@ also turns on the following optimization
 -fsched-interblock  -fsched-spec @gol
 -fschedule-insns  -fschedule-insns2 @gol
 -fstrict-aliasing -fstrict-overflow @gol
--ftree-switch-conversion @gol
+-ftree-switch-conversion -ftree-tail-merge @gol
 -ftree-pre @gol
 -ftree-vrp}
 
@@ -6974,6 +6974,11 @@ Perform conversion of simple initializat
 initializations from a scalar array.  This flag is enabled by default
 at @option{-O2} and higher.
 
+@item -ftree-tail-merge
+Merges identical blocks with same successors.  This flag is enabled by default
+at @option{-O2} and higher.  The run time of this pass can be limited using
+@option{max-tail-merge-comparisons} parameter.
+
 @item -ftree-dce
 @opindex ftree-dce
 Perform dead code elimination (DCE) on trees.  This flag is enabled by
@@ -8541,6 +8546,10 @@ This is used to avoid quadratic behavior
 The value of 0 will avoid limiting the search, but may slow down compilation
 of huge functions.  The default value is 30.
 
+@item max-tail-merge-comparisons
+The maximum amount of similar bbs to compare a bb with.  This is used to
+avoid quadratic behaviour in tree tail merging.  The default value is 10.
+
 @item max-unrolled-insns
 The maximum number of instructions that a loop should have if that loop
 is unrolled, and if the loop is unrolled, it determines how many times

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH, PR43864] Gimple level duplicate block cleanup - test cases.
  2011-06-08  9:55 ` [PATCH, PR43864] Gimple level duplicate block cleanup - test cases Tom de Vries
@ 2011-07-18  2:54   ` Tom de Vries
  2011-08-19 18:38     ` Tom de Vries
  0 siblings, 1 reply; 18+ messages in thread
From: Tom de Vries @ 2011-07-18  2:54 UTC (permalink / raw)
  To: Richard Guenther; +Cc: Steven Bosscher, gcc-patches

[-- Attachment #1: Type: text/plain, Size: 455 bytes --]

Updated version.

On 06/08/2011 11:45 AM, Tom de Vries wrote:
> On 06/08/2011 11:42 AM, Tom de Vries wrote:
> 
>> I'll send the patch with the testcases in a separate email.
> 

OK for trunk?

Thanks,
- Tom

2011-07-17  Tom de Vries  <tom@codesourcery.com>

	PR middle-end/43864
	* gcc.dg/fold-compare-2.c (dg-options): Add -fno-tree-tail-merge.
	* gcc/testsuite/gcc.dg/uninit-pred-2_c.c: Same.
	* gcc.dg/pr43864.c: New test.
	* gcc.dg/pr43864-2.c: Same.

[-- Attachment #2: pr43864.31.test.patch --]
[-- Type: text/x-patch, Size: 2481 bytes --]

Index: gcc/testsuite/gcc.dg/fold-compare-2.c
===================================================================
--- gcc/testsuite/gcc.dg/fold-compare-2.c	(revision 175801)
+++ gcc/testsuite/gcc.dg/fold-compare-2.c	(working copy)
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -fdump-tree-vrp" } */
+/* { dg-options "-O2 -fno-tree-tail-merge -fdump-tree-vrp" } */
 
 extern void abort (void);
 
Index: gcc/testsuite/gcc.dg/uninit-pred-2_c.c
===================================================================
--- gcc/testsuite/gcc.dg/uninit-pred-2_c.c	(revision 175801)
+++ gcc/testsuite/gcc.dg/uninit-pred-2_c.c	(working copy)
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-Wuninitialized -O2" } */
+/* { dg-options "-Wuninitialized -O2 -fno-tree-tail-merge" } */
 
 int g;
 void bar (void);
Index: gcc/testsuite/gcc.dg/pr43864.c
===================================================================
--- gcc/testsuite/gcc.dg/pr43864.c	(revision 0)
+++ gcc/testsuite/gcc.dg/pr43864.c	(revision 0)
@@ -0,0 +1,35 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-optimized" } */
+
+extern void foo (char*, int);
+extern void mysprintf (char *, char *);
+extern void myfree (void *);
+extern int access (char *, int);
+extern int fopen (char *, int);
+
+char *
+hprofStartupp (char *outputFileName, char *ctx)
+{
+  char fileName[1000];
+  int fp;
+  mysprintf (fileName, outputFileName);
+  if (access (fileName, 1) == 0)
+    {
+      myfree (ctx);
+      return 0;
+    }
+
+  fp = fopen (fileName, 0);
+  if (fp == 0)
+    {
+      myfree (ctx);
+      return 0;
+    }
+
+  foo (outputFileName, fp);
+
+  return ctx;
+}
+
+/* { dg-final { scan-tree-dump-times "myfree" 1 "optimized"} } */
+/* { dg-final { cleanup-tree-dump "optimized" } } */
Index: gcc/testsuite/gcc.dg/pr43864-2.c
===================================================================
--- gcc/testsuite/gcc.dg/pr43864-2.c	(revision 0)
+++ gcc/testsuite/gcc.dg/pr43864-2.c	(revision 0)
@@ -0,0 +1,23 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-optimized" } */
+
+int
+f (int c, int b, int d)
+{
+  int r, e;
+
+  if (c)
+    r = b + d;
+  else
+    {
+      e = b + d;
+      r = e;
+    }
+
+  return r;
+}
+
+/* { dg-final { scan-tree-dump-times "if " 0 "optimized"} } */
+/* { dg-final { scan-tree-dump-times "\\\+" 1 "optimized"} } */
+/* { dg-final { scan-tree-dump-times "PHI" 0 "optimized"} } */
+/* { dg-final { cleanup-tree-dump "optimized" } } */

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH, PR43864] Gimple level duplicate block cleanup.
  2011-07-18  0:41           ` Tom de Vries
@ 2011-07-22 15:54             ` Richard Guenther
  2011-08-19 18:33               ` Tom de Vries
  2011-08-24  9:00               ` Tom de Vries
  0 siblings, 2 replies; 18+ messages in thread
From: Richard Guenther @ 2011-07-22 15:54 UTC (permalink / raw)
  To: Tom de Vries; +Cc: Steven Bosscher, gcc-patches

On Sun, Jul 17, 2011 at 8:33 PM, Tom de Vries <vries@codesourcery.com> wrote:

> Bootstrapped and reg-tested on x86_64.  Ok for trunk (after ARM testing)?

+static int
+same_succ_equal (const void *ve1, const void *ve2)
+{
...
+  if (bitmap_bit_p (e1->bbs, ENTRY_BLOCK)
+      || bitmap_bit_p (e1->bbs, EXIT_BLOCK)
+      || bitmap_bit_p (e2->bbs, ENTRY_BLOCK)
+      || bitmap_bit_p (e2->bbs, EXIT_BLOCK))
+    return 0;

that's odd - what are these checks for?

+  if (dump_file)
+    {
+      fprintf (dump_file, "initial worklist:\n");

with dump_flags & TDF_DETAILS

I'm now at merge_calls and wondering about alias info again.  We are
probably safe for the per-pointer information because we are not
operating flow-sensitive for memory and for merging require value-equivalence
for SSA names.  For calls the same should be true - we are not
flow- or context-sensitive, and even if we were context-sentitive we
require equivalent arguments (for memory arguments we should be safe
because of the non-flow-sensitivity).

So, did you actually run into problems?  If not then I suggest to remove
merge_calls completely (and the related changes that it requires).

+/* Create or update a vop phi in BB2.  Use VUSE1 arguments for all the
+   REDIRECTED_EDGES, or if VUSE1 is NULL_TREE, use BB_VOP_AT_EXIT.  If a new
+   phis is created, use the phi instead of VUSE2 in BB2.  */
+
+static void
+update_vuses (tree vuse1, tree vuse2, basic_block bb2,
+              VEC (edge,heap) *redirected_edges)

...

+  if (vuse2 == NULL_TREE)
+    return;

hm, that's the case when there is no VUSE that is dominated by BB2
(or is in BB2).  Ok, might happen.

+             locus1 = gimple_location (SSA_NAME_DEF_STMT (arg));
+             add_phi_arg (phi, arg, e, locus1);

I don't think virtual operand PHIs should have locations, just use
UNKNOWN_LOCATION here.

+  locus2 = gimple_location (def_stmt2);

Likewise.

+  /* Create a phi, first with default argument vuse2 for all preds.  */
+  lhs = make_ssa_name (SSA_NAME_VAR (vuse2), NULL);
+  VN_INFO_GET (lhs);
+  phi = create_phi_node (lhs, bb2);
+  SSA_NAME_DEF_STMT (lhs) = phi;
+  FOR_EACH_EDGE (e, ei, bb2->preds)
+    add_phi_arg (phi, vuse2, e, locus2);
+
+  /* Now overwrite the arguments associated with the redirected edges with
+     vuse1.  */
+  for (i = 0; i < EDGE_COUNT (redirected_edges); ++i)
+    {
+      e = VEC_index (edge, redirected_edges, i);
+      gcc_assert (PHI_ARG_DEF_FROM_EDGE (phi, e));

No need for this assert.

+      if (vuse1)
+       arg = vuse1;
+      else
+       arg = BB_VOP_AT_EXIT (e->src);
+      SET_PHI_ARG_DEF (phi, e->dest_idx, arg);
+      locus1 = gimple_location (SSA_NAME_DEF_STMT (arg));

See above.

+      gimple_phi_arg_set_location (phi, e->dest_idx, locus1);
+    }


Can you maybe merge this with the update-existing-phi-case?  They
look all too similar.

+  /* Replace uses of vuse2 in bb2 with phi.  */
+  FOR_EACH_IMM_USE_STMT (stmt, iter, vuse2)
+    {
+      if (gimple_code (stmt) == GIMPLE_PHI)

Does FOR_EACH_IMM_USE_ON_STMT really not work for PHIs?
Other code doesn't seem to care.

+       {
+         edge e;
+         if (stmt == phi)
+           continue;
+         e = find_edge (bb2, gimple_bb (stmt));
+         if (e == NULL)
+           continue;
+         use_p = PHI_ARG_DEF_PTR_FROM_EDGE (stmt, e);
+         SET_USE (use_p, lhs);
+         update_stmt (stmt);
+       }
+      else if (gimple_bb (stmt) == bb2)

That check looks odd.  A use can very well appear in a forwarder block.

+       {
+         FOR_EACH_IMM_USE_ON_STMT (use_p, iter)
+           SET_USE (use_p, lhs);
+         update_stmt (stmt);
+       }
+    }

+/* Scans the vdefs and vuses of the insn of BB, and returns the vop at entry in
+   VOP_AT_ENTRY, and the vop at exit in VOP_AT_EXIT.  */
+
+static void
+insn_vops (basic_block bb, tree *vop_at_entry, tree *vop_at_exit)

it's easier to start from the bb end and walk until you see the
first vdef or vuse.  Then you have *vop_at_exit.  From there
just walk the SSA_NAME_DEF_STMTs of the vuse until you
hit one whose definition is not in BB - and you have *vop_at_entry.
That way you can avoid scanning most of the stmts.

The function also has an odd name ;)  It should be something like
vops_at_bb_entry_and_exit.

+static void
+vop_at_entry (basic_block bb1, basic_block bb2, tree *vop_at_entry1,
+             tree *vop_at_entry2)

so you don't need the vop at exit at all?  The function is a bit unclear
to me given it does so much stuff other than just computing the BBs
entry VOPs ...

+static void
+replace_block_by (basic_block bb1, basic_block bb2, bool update_vops)
+{

can you add some comments before the different phases of update?
I _think_ I understand what it does, but ...

+/* Runs tail merge optimization.  */
+
+unsigned int
+tail_merge_optimize (unsigned int todo)
+{
+  int nr_bbs_removed_total = 0;
+  int nr_bbs_removed;
+  bool loop_entered = false;
+  int iteration_nr = 0;
+  bool update_vops = ((todo & TODO_update_ssa_only_virtuals) == 0
+                     || !symbol_marked_for_renaming (gimple_vop (cfun)));

you need to simplify this to

  bool update_vops = !symbol_marked_for_renaming (gimple_vop (cfun));

+      if (nr_bbs_removed == 0)
+       break;
+
+      free_dominance_info (CDI_DOMINATORS);

we might want to limit the number of iterations we perform - especially
as you are re-computing dominators on each iteration.  What's the
maximum number of iterations you see during a GCC bootstrap?

+  if (dump_file)
+    fprintf (dump_file, "htab collision / search: %f\n",
+            htab_collisions (same_succ_htab));

in general without dump_flags & TDF_DETAILS passes should print
at most things when they did a transformation (some even don't do that).
Please double-check all your dump-prints.

+      todo |= (TODO_verify_ssa | TODO_verify_stmts | TODO_verify_flow
+              | TODO_dump_func);

should all be already set.

@ -4945,6 +4944,9 @@ execute_pre (bool do_fre)
   scev_finalize ();
   fini_pre (do_fre);

+  todo |= tail_merge_optimize (todo);
+  free_scc_vn ();

Please only run tail_merge_optimize once.  As we are running through
this code three times at -O2.  I suggest to try it in the !do_fre case
only which we only run once (as PRE).  If that doesn't work out
nicely we need to find other means (like introduce a
pass_fre_and_tail_merge which passes down another flag and replace
one FRE pass with that new combined pass).

Index: gcc/tree-flow.h
===================================================================
--- gcc/tree-flow.h     (revision 175801)
+++ gcc/tree-flow.h     (working copy)
@@ -806,6 +806,9 @@ bool multiplier_allowed_in_address_p (HO
 unsigned multiply_by_cost (HOST_WIDE_INT, enum machine_mode, bool);
 bool may_be_nonaddressable_p (tree expr);

+/* In tree-ssa-tail-merge.c.  */
+extern unsigned int tail_merge_optimize (unsigned int);

Eh, tree-flow.h kitchen-sink ;)  Please put it into tree-pass.h instead.

That said - I'm reasonably happy with the pass now, but it's rather large
(this review took 40min again ...) so I appreciate a second look from
somebody else.

Btw, can you expand a bit on the amount of testcases?

Thanks,
Richard.


> Thanks,
> - Tom
>
> 2011-07-17  Tom de Vries  <tom@codesourcery.com>
>
>        PR middle-end/43864
>        * tree-ssa-tail-merge.c: New file.
>        (struct same_succ): Define.
>        (same_succ_t, const_same_succ_t): New typedef.
>        (struct bb_cluster): Define.
>        (bb_cluster_t, const_bb_cluster_t): New typedef.
>        (struct aux_bb_info): Define.
>        (BB_SIZE, BB_SAME_SUCC, BB_CLUSTER, BB_VOP_AT_EXIT): Define.
>        (gvn_uses_equal): New function.
>        (same_succ_print, same_succ_print_traverse, same_succ_hash)
>        (inverse_flags, same_succ_equal, same_succ_alloc, same_succ_delete)
>        (same_succ_reset): New function.
>        (same_succ_htab, same_succ_edge_flags)
>        (deleted_bbs, deleted_bb_preds): New var.
>        (debug_same_succ): New function.
>        (worklist): New var.
>        (print_worklist, add_to_worklist, find_same_succ_bb, find_same_succ)
>        (init_worklist, delete_worklist, delete_basic_block_same_succ)
>        (same_succ_flush_bbs, update_worklist): New function.
>        (print_cluster, debug_cluster, same_predecessors)
>        (add_bb_to_cluster, new_cluster, delete_cluster): New function.
>        (all_clusters): New var.
>        (alloc_cluster_vectors, reset_cluster_vectors, delete_cluster_vectors)
>        (merge_clusters, set_cluster): New function.
>        (gimple_equal_p, find_duplicate, same_phi_alternatives_1)
>        (same_phi_alternatives, bb_has_non_vop_phi, find_clusters_1)
>        (find_clusters): New function.
>        (merge_calls, update_vuses, vop_phi, insn_vops, vop_at_entry)
>        (replace_block_by): New function.
>        (update_bbs): New var.
>        (apply_clusters): New function.
>        (update_debug_stmt, update_debug_stmts): New function.
>        (tail_merge_optimize): New function.
>        tree-flow.h (tail_merge_optimize): Declare.
>        * tree-ssa-pre.c (execute_pre): Use tail_merge_optimize.
>        * Makefile.in (OBJS-common): Add tree-ssa-tail-merge.o.
>        (tree-ssa-tail-merge.o): New rule.
>        * opts.c (default_options_table): Set OPT_ftree_tail_merge by default at
>        OPT_LEVELS_2_PLUS.
>        * tree-ssa-sccvn.c (vn_valueize): Move to ...
>        * tree-ssa-sccvn.h (vn_valueize): Here.
>        * tree-ssa-alias.h (pt_solution_ior_into_shared): Declare.
>        * tree-ssa-structalias.c (find_what_var_points_to): Factor out and
>        use ...
>        (pt_solution_share): New function.
>        (pt_solution_unshare, pt_solution_ior_into_shared): New function.
>        (delete_points_to_sets): Nullify shared_bitmap_table after deletion.
>        * timevar.def (TV_TREE_TAIL_MERGE): New timevar.
>        * common.opt (ftree-tail-merge): New switch.
>        * params.def (PARAM_MAX_TAIL_MERGE_COMPARISONS): New parameter.
>        * doc/invoke.texi (Optimization Options, -O2): Add -ftree-tail-merge.
>        (-ftree-tail-merge, max-tail-merge-comparisons): New item.
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH, PR43864] Gimple level duplicate block cleanup.
  2011-07-22 15:54             ` Richard Guenther
@ 2011-08-19 18:33               ` Tom de Vries
  2011-08-24  9:00               ` Tom de Vries
  1 sibling, 0 replies; 18+ messages in thread
From: Tom de Vries @ 2011-08-19 18:33 UTC (permalink / raw)
  To: Richard Guenther; +Cc: Steven Bosscher, gcc-patches

[-- Attachment #1: Type: text/plain, Size: 12221 bytes --]

Hi Richard,

sorry for the delayed response, I was on vacation for a bit.

On 07/22/2011 05:36 PM, Richard Guenther wrote:
> On Sun, Jul 17, 2011 at 8:33 PM, Tom de Vries <vries@codesourcery.com> wrote:
> 
>> Bootstrapped and reg-tested on x86_64.  Ok for trunk (after ARM testing)?
> 
> +static int
> +same_succ_equal (const void *ve1, const void *ve2)
> +{
> ...
> +  if (bitmap_bit_p (e1->bbs, ENTRY_BLOCK)
> +      || bitmap_bit_p (e1->bbs, EXIT_BLOCK)
> +      || bitmap_bit_p (e2->bbs, ENTRY_BLOCK)
> +      || bitmap_bit_p (e2->bbs, EXIT_BLOCK))
> +    return 0;
> 
> that's odd - what are these checks for?
>

Turned out to be dead code, now removed.

> +  if (dump_file)
> +    {
> +      fprintf (dump_file, "initial worklist:\n");
> 
> with dump_flags & TDF_DETAILS
> 
> I'm now at merge_calls and wondering about alias info again.  We are
> probably safe for the per-pointer information because we are not
> operating flow-sensitive for memory and for merging require value-equivalence
> for SSA names.  For calls the same should be true - we are not
> flow- or context-sensitive, and even if we were context-sentitive we
> require equivalent arguments (for memory arguments we should be safe
> because of the non-flow-sensitivity).
> 
> So, did you actually run into problems?  If not then I suggest to remove
> merge_calls completely (and the related changes that it requires).
> 

No, I did not run into actual problems. So, merge_calls and dependencies are now
removed.

> +/* Create or update a vop phi in BB2.  Use VUSE1 arguments for all the
> +   REDIRECTED_EDGES, or if VUSE1 is NULL_TREE, use BB_VOP_AT_EXIT.  If a new
> +   phis is created, use the phi instead of VUSE2 in BB2.  */
> +
> +static void
> +update_vuses (tree vuse1, tree vuse2, basic_block bb2,
> +              VEC (edge,heap) *redirected_edges)
> 
> ...
> 
> +  if (vuse2 == NULL_TREE)
> +    return;
> 
> hm, that's the case when there is no VUSE that is dominated by BB2
> (or is in BB2).  Ok, might happen.
> 
> +             locus1 = gimple_location (SSA_NAME_DEF_STMT (arg));
> +             add_phi_arg (phi, arg, e, locus1);
> 
> I don't think virtual operand PHIs should have locations, just use
> UNKNOWN_LOCATION here.
> 

Done.

> +  locus2 = gimple_location (def_stmt2);
> 
> Likewise.
> 

Done.

> +  /* Create a phi, first with default argument vuse2 for all preds.  */
> +  lhs = make_ssa_name (SSA_NAME_VAR (vuse2), NULL);
> +  VN_INFO_GET (lhs);
> +  phi = create_phi_node (lhs, bb2);
> +  SSA_NAME_DEF_STMT (lhs) = phi;
> +  FOR_EACH_EDGE (e, ei, bb2->preds)
> +    add_phi_arg (phi, vuse2, e, locus2);
> +
> +  /* Now overwrite the arguments associated with the redirected edges with
> +     vuse1.  */
> +  for (i = 0; i < EDGE_COUNT (redirected_edges); ++i)
> +    {
> +      e = VEC_index (edge, redirected_edges, i);
> +      gcc_assert (PHI_ARG_DEF_FROM_EDGE (phi, e));
> 
> No need for this assert.
> 

Removed.

> +      if (vuse1)
> +       arg = vuse1;
> +      else
> +       arg = BB_VOP_AT_EXIT (e->src);
> +      SET_PHI_ARG_DEF (phi, e->dest_idx, arg);
> +      locus1 = gimple_location (SSA_NAME_DEF_STMT (arg));
> 
> See above.

Done.

> 
> +      gimple_phi_arg_set_location (phi, e->dest_idx, locus1);
> +    }
> 
> 
> Can you maybe merge this with the update-existing-phi-case?  They
> look all too similar.
> 

Indeed. Merged now.

> +  /* Replace uses of vuse2 in bb2 with phi.  */
> +  FOR_EACH_IMM_USE_STMT (stmt, iter, vuse2)
> +    {
> +      if (gimple_code (stmt) == GIMPLE_PHI)
> 
> Does FOR_EACH_IMM_USE_ON_STMT really not work for PHIs?
> Other code doesn't seem to care.

Now rewritten to use FOR_EACH_IMM_USE_ON_STMT also for PHIs.

> +       {
> +         edge e;
> +         if (stmt == phi)
> +           continue;
> +         e = find_edge (bb2, gimple_bb (stmt));
> +         if (e == NULL)
> +           continue;
> +         use_p = PHI_ARG_DEF_PTR_FROM_EDGE (stmt, e);
> +         SET_USE (use_p, lhs);
> +         update_stmt (stmt);
> +       }
> +      else if (gimple_bb (stmt) == bb2)
> 
> That check looks odd.  A use can very well appear in a forwarder block.
> 

AFAIU, it can't. The pass does not allow forwarder blocks between bb2 and it's
successor(s).

> +       {
> +         FOR_EACH_IMM_USE_ON_STMT (use_p, iter)
> +           SET_USE (use_p, lhs);
> +         update_stmt (stmt);
> +       }
> +    }
> 
> +/* Scans the vdefs and vuses of the insn of BB, and returns the vop at entry in
> +   VOP_AT_ENTRY, and the vop at exit in VOP_AT_EXIT.  */
> +
> +static void
> +insn_vops (basic_block bb, tree *vop_at_entry, tree *vop_at_exit)
> 
> it's easier to start from the bb end and walk until you see the
> first vdef or vuse.  Then you have *vop_at_exit.  From there
> just walk the SSA_NAME_DEF_STMTs of the vuse until you
> hit one whose definition is not in BB - and you have *vop_at_entry.
> That way you can avoid scanning most of the stmts.
> 
> The function also has an odd name ;)  It should be something like
> vops_at_bb_entry_and_exit.
> 
> +static void
> +vop_at_entry (basic_block bb1, basic_block bb2, tree *vop_at_entry1,
> +             tree *vop_at_entry2)
> 
> so you don't need the vop at exit at all?  The function is a bit unclear
> to me given it does so much stuff other than just computing the BBs
> entry VOPs ...
> 

Right, I overdid that a bit. I now have a simpler function vop_at_entry, which
only tries to calculate vop_at_entry.

> +static void
> +replace_block_by (basic_block bb1, basic_block bb2, bool update_vops)
> +{
> 
> can you add some comments before the different phases of update?
> I _think_ I understand what it does, but ...
> 

Ok, added some comments there, I hope that clarifies things.

> +/* Runs tail merge optimization.  */
> +
> +unsigned int
> +tail_merge_optimize (unsigned int todo)
> +{
> +  int nr_bbs_removed_total = 0;
> +  int nr_bbs_removed;
> +  bool loop_entered = false;
> +  int iteration_nr = 0;
> +  bool update_vops = ((todo & TODO_update_ssa_only_virtuals) == 0
> +                     || !symbol_marked_for_renaming (gimple_vop (cfun)));
> 
> you need to simplify this to
> 
>   bool update_vops = !symbol_marked_for_renaming (gimple_vop (cfun));
> 

Done.

> +      if (nr_bbs_removed == 0)
> +       break;
> +
> +      free_dominance_info (CDI_DOMINATORS);
> 
> we might want to limit the number of iterations we perform - especially
> as you are re-computing dominators on each iteration.  What's the
> maximum number of iterations you see during a GCC bootstrap?
> 

The maximum number is 16, which occurred 4 times, 2 times for
insn_default_latency_bdver1 and 2 times for internal_dfa_insn_code_bdver1,

histogram of iteration_nr:
0  139971
1   70248
2    2416
3     142
4      32
5       2
6       0
...
15      0
16      4

> +  if (dump_file)
> +    fprintf (dump_file, "htab collision / search: %f\n",
> +            htab_collisions (same_succ_htab));
> 
> in general without dump_flags & TDF_DETAILS passes should print
> at most things when they did a transformation (some even don't do that).
> Please double-check all your dump-prints.
> 

Done.

> +      todo |= (TODO_verify_ssa | TODO_verify_stmts | TODO_verify_flow
> +              | TODO_dump_func);
> 
> should all be already set.

Only TODO_verify_flow is always set by pre (and I think that's unnecessary if
nothing changed).

> 
> @ -4945,6 +4944,9 @@ execute_pre (bool do_fre)
>    scev_finalize ();
>    fini_pre (do_fre);
> 
> +  todo |= tail_merge_optimize (todo);
> +  free_scc_vn ();
> 
> Please only run tail_merge_optimize once.  As we are running through
> this code three times at -O2.  I suggest to try it in the !do_fre case
> only which we only run once (as PRE).  

Done.

> If that doesn't work out
> nicely we need to find other means (like introduce a
> pass_fre_and_tail_merge which passes down another flag and replace
> one FRE pass with that new combined pass).
> 
> Index: gcc/tree-flow.h
> ===================================================================
> --- gcc/tree-flow.h     (revision 175801)
> +++ gcc/tree-flow.h     (working copy)
> @@ -806,6 +806,9 @@ bool multiplier_allowed_in_address_p (HO
>  unsigned multiply_by_cost (HOST_WIDE_INT, enum machine_mode, bool);
>  bool may_be_nonaddressable_p (tree expr);
> 
> +/* In tree-ssa-tail-merge.c.  */
> +extern unsigned int tail_merge_optimize (unsigned int);
> 
> Eh, tree-flow.h kitchen-sink ;)  Please put it into tree-pass.h instead.
> 

Done.

> That said - I'm reasonably happy with the pass now,

Great :)

> but it's rather large
> (this review took 40min again ...) so I appreciate a second look from
> somebody else.

Thanks a lot for the review. I'll ask Ian Taylor in a separate email for review.

> 
> Btw, can you expand a bit on the amount of testcases?
> 

Increased amount of test-cases from 2 to 4. Will send them in the test-case thread.

> Thanks,
> Richard.

Runtime and impact on cc1:

               real        user        sys
without  21m21.308s  20m25.640s  0m27.540s
with     21m12.092s  20m22.540s  0m28.590s
                     ----------
                          -3.1s
                         -0.25%

$ size without/cc1 with/cc1
    text   data      bss       dec      hex     filename
17836921  41856  1351264  19230041  1256d59  without/cc1
17531665  41856  1351264  18924785  120c4f1     with/cc1
--------
 -305256
   -1.71%

I did some additional changes:
1. I added disregarding the order of local calculations (see local_def and
   uses).
2. I separated out the checking of availability of dependences
   (see deps_ok_for_redirect). This was previously checked while matching.
3. I keep a rep_bb for clusters. This is a bug fix, previous we just
   used the first bb of the clusters, and sometimes this resulted in
   uses not being dominated by their defs. This bug surfaced once I added 1.

Bootstrapped & reg-tested on x86_64.

Thanks again,
- Tom

2011-08-19  Tom de Vries  <tom@codesourcery.com>

	PR middle-end/43864
	* tree-ssa-tail-merge.c: New file.
	(struct same_succ): Define.
	(same_succ_t, const_same_succ_t): New typedef.
	(struct bb_cluster): Define.
	(bb_cluster_t, const_bb_cluster_t): New typedef.
	(struct aux_bb_info): Define.
	(BB_SIZE, BB_SAME_SUCC, BB_CLUSTER, BB_VOP_AT_EXIT): Define.
	(gvn_uses_equal): New function.
	(same_succ_print, same_succ_print_traverse, update_dep_bb)
	(stmt_update_dep_bb, local_def, same_succ_hash)
	(inverse_flags, same_succ_equal, same_succ_alloc, same_succ_delete)
	(same_succ_reset): New function.
	(same_succ_htab, same_succ_edge_flags)
	(deleted_bbs, deleted_bb_preds): New var.
	(debug_same_succ): New function.
	(worklist): New var.
	(print_worklist, add_to_worklist, find_same_succ_bb, find_same_succ)
	(init_worklist, delete_worklist, delete_basic_block_same_succ)
	(same_succ_flush_bbs, purge_bbs, update_worklist): New function.
	(print_cluster, debug_cluster, update_rep_bb)
	(add_bb_to_cluster, new_cluster, delete_cluster): New function.
	(all_clusters): New var.
	(alloc_cluster_vectors, reset_cluster_vectors, delete_cluster_vectors)
	(merge_clusters, set_cluster): New function.
	(gimple_equal_p, gsi_advance_bw_nondebug_nonlocal, find_duplicate)
	(same_phi_alternatives_1, same_phi_alternatives, bb_has_non_vop_phi)
	(deps_ok_for_redirect_from_bb_to_bb, deps_ok_for_redirect)
	(find_clusters_1, find_clusters): New function.
	(update_vuses, vop_phi, vop_at_entry, replace_block_by): New function.
	(update_bbs): New var.
	(apply_clusters): New function.
	(update_debug_stmt, update_debug_stmts): New function.
	(tail_merge_optimize): New function.
	tree-pass.h (tail_merge_optimize): Declare.
	* tree-ssa-pre.c (execute_pre): Use tail_merge_optimize.
	* Makefile.in (OBJS-common): Add tree-ssa-tail-merge.o.
	(tree-ssa-tail-merge.o): New rule.
	* opts.c (default_options_table): Set OPT_ftree_tail_merge by default at
	OPT_LEVELS_2_PLUS.
	* tree-ssa-sccvn.c (vn_valueize): Move to ...
	* tree-ssa-sccvn.h (vn_valueize): Here.
	* timevar.def (TV_TREE_TAIL_MERGE): New timevar.
	* common.opt (ftree-tail-merge): New switch.
	* params.def (PARAM_MAX_TAIL_MERGE_COMPARISONS): New parameter.
	* doc/invoke.texi (Optimization Options, -O2): Add -ftree-tail-merge.
	(-ftree-tail-merge, max-tail-merge-comparisons): New item.

[-- Attachment #2: pr43864.37.patch --]
[-- Type: text/x-patch, Size: 54869 bytes --]

Index: gcc/tree-ssa-tail-merge.c
===================================================================
--- gcc/tree-ssa-tail-merge.c	(revision 0)
+++ gcc/tree-ssa-tail-merge.c	(revision 0)
@@ -0,0 +1,1701 @@
+/* Tail merging for gimple.
+   Copyright (C) 2011 Free Software Foundation, Inc.
+   Contributed by Tom de Vries (tom@codesourcery.com)
+
+This file is part of GCC.
+
+GCC is free software; you can redistribute it and/or modify
+it under the terms of the GNU General Public License as published by
+the Free Software Foundation; either version 3, or (at your option)
+any later version.
+
+GCC is distributed in the hope that it will be useful,
+but WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+GNU General Public License for more details.
+
+You should have received a copy of the GNU General Public License
+along with GCC; see the file COPYING3.  If not see
+<http://www.gnu.org/licenses/>.  */
+
+/* Pass overview.
+
+
+   MOTIVATIONAL EXAMPLE
+
+   gimple representation of gcc/testsuite/gcc.dg/pr43864.c at
+
+   hprofStartupp (charD.1 * outputFileNameD.2600, charD.1 * ctxD.2601)
+   {
+     struct FILED.1638 * fpD.2605;
+     charD.1 fileNameD.2604[1000];
+     intD.0 D.3915;
+     const charD.1 * restrict outputFileName.0D.3914;
+
+     # BLOCK 2 freq:10000
+     # PRED: ENTRY [100.0%]  (fallthru,exec)
+     # PT = nonlocal { D.3926 } (restr)
+     outputFileName.0D.3914_3
+       = (const charD.1 * restrict) outputFileNameD.2600_2(D);
+     # .MEMD.3923_13 = VDEF <.MEMD.3923_12(D)>
+     # USE = nonlocal null { fileNameD.2604 D.3926 } (restr)
+     # CLB = nonlocal null { fileNameD.2604 D.3926 } (restr)
+     sprintfD.759 (&fileNameD.2604, outputFileName.0D.3914_3);
+     # .MEMD.3923_14 = VDEF <.MEMD.3923_13>
+     # USE = nonlocal null { fileNameD.2604 D.3926 } (restr)
+     # CLB = nonlocal null { fileNameD.2604 D.3926 } (restr)
+     D.3915_4 = accessD.2606 (&fileNameD.2604, 1);
+     if (D.3915_4 == 0)
+       goto <bb 3>;
+     else
+       goto <bb 4>;
+     # SUCC: 3 [10.0%]  (true,exec) 4 [90.0%]  (false,exec)
+
+     # BLOCK 3 freq:1000
+     # PRED: 2 [10.0%]  (true,exec)
+     # .MEMD.3923_15 = VDEF <.MEMD.3923_14>
+     # USE = nonlocal null { fileNameD.2604 D.3926 } (restr)
+     # CLB = nonlocal null { fileNameD.2604 D.3926 } (restr)
+     freeD.898 (ctxD.2601_5(D));
+     goto <bb 7>;
+     # SUCC: 7 [100.0%]  (fallthru,exec)
+
+     # BLOCK 4 freq:9000
+     # PRED: 2 [90.0%]  (false,exec)
+     # .MEMD.3923_16 = VDEF <.MEMD.3923_14>
+     # PT = nonlocal escaped
+     # USE = nonlocal null { fileNameD.2604 D.3926 } (restr)
+     # CLB = nonlocal null { fileNameD.2604 D.3926 } (restr)
+     fpD.2605_8 = fopenD.1805 (&fileNameD.2604[0], 0B);
+     if (fpD.2605_8 == 0B)
+       goto <bb 5>;
+     else
+       goto <bb 6>;
+     # SUCC: 5 [1.9%]  (true,exec) 6 [98.1%]  (false,exec)
+
+     # BLOCK 5 freq:173
+     # PRED: 4 [1.9%]  (true,exec)
+     # .MEMD.3923_17 = VDEF <.MEMD.3923_16>
+     # USE = nonlocal null { fileNameD.2604 D.3926 } (restr)
+     # CLB = nonlocal null { fileNameD.2604 D.3926 } (restr)
+     freeD.898 (ctxD.2601_5(D));
+     goto <bb 7>;
+     # SUCC: 7 [100.0%]  (fallthru,exec)
+
+     # BLOCK 6 freq:8827
+     # PRED: 4 [98.1%]  (false,exec)
+     # .MEMD.3923_18 = VDEF <.MEMD.3923_16>
+     # USE = nonlocal null { fileNameD.2604 D.3926 } (restr)
+     # CLB = nonlocal null { fileNameD.2604 D.3926 } (restr)
+     fooD.2599 (outputFileNameD.2600_2(D), fpD.2605_8);
+     # SUCC: 7 [100.0%]  (fallthru,exec)
+
+     # BLOCK 7 freq:10000
+     # PRED: 3 [100.0%]  (fallthru,exec) 5 [100.0%]  (fallthru,exec)
+             6 [100.0%]  (fallthru,exec)
+     # PT = nonlocal null
+
+     # ctxD.2601_1 = PHI <0B(3), 0B(5), ctxD.2601_5(D)(6)>
+     # .MEMD.3923_11 = PHI <.MEMD.3923_15(3), .MEMD.3923_17(5),
+                            .MEMD.3923_18(6)>
+     # VUSE <.MEMD.3923_11>
+     return ctxD.2601_1;
+     # SUCC: EXIT [100.0%]
+   }
+
+   bb 3 and bb 5 can be merged.  The blocks have different predecessors, but the
+   same successors, and the same operations.
+
+
+   CONTEXT
+
+   A technique called tail merging (or cross jumping) can fix the example
+   above.  For a block, we look for common code at the end (the tail) of the
+   predecessor blocks, and insert jumps from one block to the other.
+   The example is a special case for tail merging, in that 2 whole blocks
+   can be merged, rather than just the end parts of it.
+   We currently only focus on whole block merging, so in that sense
+   calling this pass tail merge is a bit of a misnomer.
+
+   We distinguish 2 kinds of situations in which blocks can be merged:
+   - same operations, same predecessors.  The successor edges coming from one
+     block are redirected to come from the other block.
+   - same operations, same successors.  The predecessor edges entering one block
+     are redirected to enter the other block.  Note that this operation might
+     involve introducing phi operations.
+
+   For efficient implementation, we would like to value numbers the blocks, and
+   have a comparison operator that tells us whether the blocks are equal.
+   Besides being runtime efficient, block value numbering should also abstract
+   from irrelevant differences in order of operations, much like normal value
+   numbering abstracts from irrelevant order of operations.
+
+   For the first situation (same_operations, same predecessors), normal value
+   numbering fits well.  We can calculate a block value number based on the
+   value numbers of the defs and vdefs.
+
+   For the second situation (same operations, same successors), this approach
+   doesn't work so well.  We can illustrate this using the example.  The calls
+   to free use different vdefs: MEMD.3923_16 and MEMD.3923_14, and these will
+   remain different in value numbering, since they represent different memory
+   states.  So the resulting vdefs of the frees will be different in value
+   numbering, so the block value numbers will be different.
+
+   The reason why we call the blocks equal is not because they define the same
+   values, but because uses in the blocks use (possibly different) defs in the
+   same way.  To be able to detect this efficiently, we need to do some kind of
+   reverse value numbering, meaning number the uses rather than the defs, and
+   calculate a block value number based on the value number of the uses.
+   Ideally, a block comparison operator will also indicate which phis are needed
+   to merge the blocks.
+
+   For the moment, we don't do block value numbering, but we do insn-by-insn
+   matching, using scc value numbers to match operations with results, and
+   structural comparison otherwise, while ignoring vop mismatches.
+
+
+   IMPLEMENTATION
+
+   1. The pass first determines all groups of blocks with the same successor
+      blocks.
+   2. Within each group, it tries to determine clusters of equal basic blocks.
+   3. The clusters are applied.
+   4. The same successor groups are updated.
+   5. This process is repeated from 2 onwards, until no more changes.
+
+
+   LIMITATIONS/TODO
+
+   - block only
+   - handles only 'same operations, same successors'.
+     It handles same predecessors as a special subcase though.
+   - does not implement the reverse value numbering and block value numbering.
+   - improve memory allocation: use garbage collected memory, obstacks,
+     allocpools where appropriate.
+   - no insertion of gimple_reg phis,  We only introduce vop-phis.
+   - handle blocks with gimple_reg phi_nodes.
+
+
+   SWITCHES
+
+   - ftree-tail-merge.  On at -O2.  We may have to enable it only at -Os.  */
+
+#include "config.h"
+#include "system.h"
+#include "coretypes.h"
+#include "tm.h"
+#include "tree.h"
+#include "tm_p.h"
+#include "basic-block.h"
+#include "output.h"
+#include "flags.h"
+#include "function.h"
+#include "tree-flow.h"
+#include "timevar.h"
+#include "bitmap.h"
+#include "tree-ssa-alias.h"
+#include "params.h"
+#include "tree-pretty-print.h"
+#include "hashtab.h"
+#include "gimple-pretty-print.h"
+#include "tree-ssa-sccvn.h"
+#include "tree-dump.h"
+
+/* Describes a group of bbs with the same successors.  The successor bbs are
+   cached in succs, and the successor edge flags are cached in succ_flags.
+   If a bb has the EDGE_TRUE/VALSE_VALUE flags swapped compared to succ_flags,
+   it's marked in inverse.
+   Additionally, the hash value for the struct is cached in hashval, and
+   in_worklist indicates whether it's currently part of worklist.  */
+
+struct same_succ
+{
+  /* The bbs that have the same successor bbs.  */
+  bitmap bbs;
+  /* The successor bbs.  */
+  bitmap succs;
+  /* Indicates whether the EDGE_TRUE/FALSE_VALUEs of succ_flags are swapped for
+     bb.  */
+  bitmap inverse;
+  /* The edge flags for each of the successor bbs.  */
+  VEC (int, heap) *succ_flags;
+  /* Indicates whether the struct is currently in the worklist.  */
+  bool in_worklist;
+  /* The hash value of the struct.  */
+  hashval_t hashval;
+};
+typedef struct same_succ *same_succ_t;
+typedef const struct same_succ *const_same_succ_t;
+
+/* A group of bbs where 1 bb from bbs can replace the other bbs.  */
+
+struct bb_cluster
+{
+  /* The bbs in the cluster.  */
+  bitmap bbs;
+  /* The preds of the bbs in the cluster.  */
+  bitmap preds;
+  /* Index in all_clusters vector.  */
+  int index;
+  /* The bb to replace the cluster with.  */
+  basic_block rep_bb;
+};
+typedef struct bb_cluster *bb_cluster_t;
+typedef const struct bb_cluster *const_bb_cluster_t;
+
+/* Per bb-info.  */
+
+struct aux_bb_info
+{
+  /* The number of non-debug statements in the bb.  */
+  int size;
+  /* The same_succ that this bb is a member of.  */
+  same_succ_t same_succ;
+  /* The cluster that this bb is a member of.  */
+  bb_cluster_t cluster;
+  /* The vop state at the exit of a bb.  This is shortlived data, used to
+     communicate data between update_block_by and update_vuses.  */
+  tree vop_at_exit;
+  /* The bb that either contains or is dominated by the dependencies of the
+     bb.  */
+  basic_block dep_bb;
+};
+
+/* Macros to access the fields of struct aux_bb_info.  */
+
+#define BB_SIZE(bb) (((struct aux_bb_info *)bb->aux)->size)
+#define BB_SAME_SUCC(bb) (((struct aux_bb_info *)bb->aux)->same_succ)
+#define BB_CLUSTER(bb) (((struct aux_bb_info *)bb->aux)->cluster)
+#define BB_VOP_AT_EXIT(bb) (((struct aux_bb_info *)bb->aux)->vop_at_exit)
+#define BB_DEP_BB(bb) (((struct aux_bb_info *)bb->aux)->dep_bb)
+
+/* VAL1 and VAL2 are either:
+   - uses in BB1 and BB2, or
+   - phi alternatives for BB1 and BB2.
+   Return true if the uses have the same gvn value.  */
+
+static bool
+gvn_uses_equal (tree val1, tree val2)
+{
+  gcc_checking_assert (val1 != NULL_TREE && val2 != NULL_TREE);
+
+  if (val1 == val2)
+    return true;
+
+  if (vn_valueize (val1) != vn_valueize (val2))
+    return false;
+
+  return ((TREE_CODE (val1) == SSA_NAME || CONSTANT_CLASS_P (val1))
+	  && (TREE_CODE (val2) == SSA_NAME || CONSTANT_CLASS_P (val2)));
+}
+
+/* Prints E to FILE.  */
+
+static void
+same_succ_print (FILE *file, const same_succ_t e)
+{
+  unsigned int i;
+  bitmap_print (file, e->bbs, "bbs:", "\n");
+  bitmap_print (file, e->succs, "succs:", "\n");
+  bitmap_print (file, e->inverse, "inverse:", "\n");
+  fprintf (file, "flags:");
+  for (i = 0; i < VEC_length (int, e->succ_flags); ++i)
+    fprintf (file, " %x", VEC_index (int, e->succ_flags, i));
+  fprintf (file, "\n");
+}
+
+/* Prints same_succ VE to VFILE.  */
+
+static int
+same_succ_print_traverse (void **ve, void *vfile)
+{
+  const same_succ_t e = *((const same_succ_t *)ve);
+  FILE *file = ((FILE*)vfile);
+  same_succ_print (file, e);
+  return 1;
+}
+
+/* Update BB_DEP_BB (USE_BB), given a use of VAL in USE_BB.  */
+
+static void
+update_dep_bb (basic_block use_bb, tree val)
+{
+  basic_block dep_bb;
+
+  /* Not a dep.  */
+  if (TREE_CODE (val) != SSA_NAME)
+    return;
+
+  /* Skip use of global def.  */
+  if (SSA_NAME_IS_DEFAULT_DEF (val))
+    return;
+
+  /* Skip use of local def.  */
+  dep_bb = gimple_bb (SSA_NAME_DEF_STMT (val));
+  if (dep_bb == use_bb)
+    return;
+
+  if (BB_DEP_BB (use_bb) == NULL
+      || dominated_by_p (CDI_DOMINATORS, dep_bb, BB_DEP_BB (use_bb)))
+    BB_DEP_BB (use_bb) = dep_bb;
+}
+
+/* Update BB_DEP_BB, given the dependencies in STMT.  */
+
+static void
+stmt_update_dep_bb (gimple stmt)
+{
+  ssa_op_iter iter;
+  use_operand_p use;
+
+  FOR_EACH_SSA_USE_OPERAND (use, stmt, iter, SSA_OP_USE)
+    update_dep_bb (gimple_bb (stmt), USE_FROM_PTR (use));
+}
+
+/* Returns whether VAL is used in the same bb as in which it is defined, or
+   in the phi of a successor bb.  */
+
+static bool
+local_def (tree val)
+{
+  gimple stmt, def_stmt;
+  basic_block bb, def_bb;
+  imm_use_iterator iter;
+  bool res;
+
+  if (TREE_CODE (val) != SSA_NAME)
+    return false;
+  def_stmt = SSA_NAME_DEF_STMT (val);
+  def_bb = gimple_bb (def_stmt);
+
+  res = true;
+  FOR_EACH_IMM_USE_STMT (stmt, iter, val)
+    {
+      bb = gimple_bb (stmt);
+      if (bb == def_bb)
+	continue;
+      if (gimple_code (stmt) == GIMPLE_PHI
+	  && find_edge (def_bb, bb))
+	continue;
+      res = false;
+      BREAK_FROM_IMM_USE_STMT (iter);
+    }
+  return res;
+}
+
+/* Calculates hash value for same_succ VE.  */
+
+static hashval_t
+same_succ_hash (const void *ve)
+{
+  const_same_succ_t e = (const_same_succ_t)ve;
+  hashval_t hashval = bitmap_hash (e->succs);
+  int flags;
+  unsigned int i;
+  unsigned int first = bitmap_first_set_bit (e->bbs);
+  basic_block bb = BASIC_BLOCK (first);
+  int size = 0;
+  gimple_stmt_iterator gsi;
+  gimple stmt;
+  tree arg;
+  unsigned int s;
+  bitmap_iterator bs;
+
+  for (gsi = gsi_start_nondebug_bb (bb);
+       !gsi_end_p (gsi); gsi_next_nondebug (&gsi))
+    {
+      stmt = gsi_stmt (gsi);
+      stmt_update_dep_bb (stmt);
+      if (is_gimple_assign (stmt) && local_def (gimple_get_lhs (stmt))
+	  && !gimple_has_side_effects (stmt))
+	continue;
+      size++;
+
+      hashval = iterative_hash_hashval_t (gimple_code (stmt), hashval);
+      if (is_gimple_assign (stmt))
+	hashval = iterative_hash_hashval_t (gimple_assign_rhs_code (stmt),
+					    hashval);
+      if (!is_gimple_call (stmt))
+	continue;
+      if (gimple_call_internal_p (stmt))
+	hashval = iterative_hash_hashval_t
+	  ((hashval_t) gimple_call_internal_fn (stmt), hashval);
+      else
+	hashval = iterative_hash_expr (gimple_call_fn (stmt), hashval);
+      for (i = 0; i < gimple_call_num_args (stmt); i++)
+	{
+	  arg = gimple_call_arg (stmt, i);
+	  arg = vn_valueize (arg);
+	  hashval = iterative_hash_expr (arg, hashval);
+	}
+    }
+
+  hashval = iterative_hash_hashval_t (size, hashval);
+  BB_SIZE (bb) = size;
+
+  for (i = 0; i < VEC_length (int, e->succ_flags); ++i)
+    {
+      flags = VEC_index (int, e->succ_flags, i);
+      flags = flags & ~(EDGE_TRUE_VALUE | EDGE_FALSE_VALUE);
+      hashval = iterative_hash_hashval_t (flags, hashval);
+    }
+
+  EXECUTE_IF_SET_IN_BITMAP (e->succs, 0, s, bs)
+    {
+      int n = find_edge (bb, BASIC_BLOCK (s))->dest_idx;
+      for (gsi = gsi_start_phis (BASIC_BLOCK (s)); !gsi_end_p (gsi);
+	   gsi_next (&gsi))
+	{
+	  gimple phi = gsi_stmt (gsi);
+	  tree lhs = gimple_phi_result (phi);
+	  tree val = gimple_phi_arg_def (phi, n);
+
+	  if (!is_gimple_reg (lhs))
+	    continue;
+	  update_dep_bb (bb, val);
+	}
+    }
+
+  return hashval;
+}
+
+/* Returns true if E1 and E2 have 2 successors, and if the successor flags
+   are inverse for the EDGE_TRUE_VALUE and EDGE_FALSE_VALUE flags, and equal for
+   the other edge flags.  */
+
+static bool
+inverse_flags (const_same_succ_t e1, const_same_succ_t e2)
+{
+  int f1a, f1b, f2a, f2b;
+  int mask = ~(EDGE_TRUE_VALUE | EDGE_FALSE_VALUE);
+
+  if (VEC_length (int, e1->succ_flags) != 2)
+    return false;
+
+  f1a = VEC_index (int, e1->succ_flags, 0);
+  f1b = VEC_index (int, e1->succ_flags, 1);
+  f2a = VEC_index (int, e2->succ_flags, 0);
+  f2b = VEC_index (int, e2->succ_flags, 1);
+
+  if (f1a == f2a && f1b == f2b)
+    return false;
+
+  return (f1a & mask) == (f2a & mask) && (f1b & mask) == (f2b & mask);
+}
+
+/* Compares SAME_SUCCs VE1 and VE2.  */
+
+static int
+same_succ_equal (const void *ve1, const void *ve2)
+{
+  const_same_succ_t e1 = (const_same_succ_t)ve1;
+  const_same_succ_t e2 = (const_same_succ_t)ve2;
+  unsigned int i, first1, first2;
+  gimple_stmt_iterator gsi1, gsi2;
+  gimple s1, s2;
+  basic_block bb1, bb2;
+
+  if (e1->hashval != e2->hashval)
+    return 0;
+
+  if (VEC_length (int, e1->succ_flags) != VEC_length (int, e2->succ_flags))
+    return 0;
+
+  if (!bitmap_equal_p (e1->succs, e2->succs))
+    return 0;
+
+  if (!inverse_flags (e1, e2))
+    {
+      for (i = 0; i < VEC_length (int, e1->succ_flags); ++i)
+	if (VEC_index (int, e1->succ_flags, i)
+	    != VEC_index (int, e1->succ_flags, i))
+	  return 0;
+    }
+
+  first1 = bitmap_first_set_bit (e1->bbs);
+  first2 = bitmap_first_set_bit (e2->bbs);
+
+  bb1 = BASIC_BLOCK (first1);
+  bb2 = BASIC_BLOCK (first2);
+
+  if (BB_SIZE (bb1) != BB_SIZE (bb2))
+    return 0;
+
+  gsi1 = gsi_start_nondebug_bb (bb1);
+  gsi2 = gsi_start_nondebug_bb (bb2);
+  while (!(gsi_end_p (gsi1) || gsi_end_p (gsi2)))
+    {
+      s1 = gsi_stmt (gsi1);
+      s2 = gsi_stmt (gsi2);
+      if (gimple_code (s1) != gimple_code (s2))
+	return 0;
+      if (is_gimple_call (s1) && !gimple_call_same_target_p (s1, s2))
+	return 0;
+      gsi_next_nondebug (&gsi1);
+      gsi_next_nondebug (&gsi2);
+    }
+
+  return 1;
+}
+
+/* Alloc and init a new SAME_SUCC.  */
+
+static same_succ_t
+same_succ_alloc (void)
+{
+  same_succ_t same = XNEW (struct same_succ);
+
+  same->bbs = BITMAP_ALLOC (NULL);
+  same->succs = BITMAP_ALLOC (NULL);
+  same->inverse = BITMAP_ALLOC (NULL);
+  same->succ_flags = VEC_alloc (int, heap, 10);
+  same->in_worklist = false;
+
+  return same;
+}
+
+/* Delete same_succ VE.  */
+
+static void
+same_succ_delete (void *ve)
+{
+  same_succ_t e = (same_succ_t)ve;
+
+  BITMAP_FREE (e->bbs);
+  BITMAP_FREE (e->succs);
+  BITMAP_FREE (e->inverse);
+  VEC_free (int, heap, e->succ_flags);
+
+  XDELETE (ve);
+}
+
+/* Reset same_succ SAME.  */
+
+static void
+same_succ_reset (same_succ_t same)
+{
+  bitmap_clear (same->bbs);
+  bitmap_clear (same->succs);
+  bitmap_clear (same->inverse);
+  VEC_truncate (int, same->succ_flags, 0);
+}
+
+/* Hash table with all same_succ entries.  */
+
+static htab_t same_succ_htab;
+
+/* Array that is used to store the edge flags for a successor.  */
+
+static int *same_succ_edge_flags;
+
+/* Bitmap that is used to mark bbs that are recently deleted.  */
+
+static bitmap deleted_bbs;
+
+/* Bitmap that is used to mark predecessors of bbs that are
+   deleted.  */
+
+static bitmap deleted_bb_preds;
+
+/* Prints same_succ_htab to stderr.  */
+
+extern void debug_same_succ (void);
+DEBUG_FUNCTION void
+debug_same_succ ( void)
+{
+  htab_traverse (same_succ_htab, same_succ_print_traverse, stderr);
+}
+
+DEF_VEC_P (same_succ_t);
+DEF_VEC_ALLOC_P (same_succ_t, heap);
+
+/* Vector of bbs to process.  */
+
+static VEC (same_succ_t, heap) *worklist;
+
+/* Prints worklist to FILE.  */
+
+static void
+print_worklist (FILE *file)
+{
+  unsigned int i;
+  for (i = 0; i < VEC_length (same_succ_t, worklist); ++i)
+    same_succ_print (file, VEC_index (same_succ_t, worklist, i));
+}
+
+/* Adds SAME to worklist.  */
+
+static void
+add_to_worklist (same_succ_t same)
+{
+  if (same->in_worklist)
+    return;
+
+  if (bitmap_count_bits (same->bbs) < 2)
+    return;
+
+  same->in_worklist = true;
+  VEC_safe_push (same_succ_t, heap, worklist, same);
+}
+
+/* Add BB to same_succ_htab.  */
+
+static void
+find_same_succ_bb (basic_block bb, same_succ_t *same_p)
+{
+  unsigned int j;
+  bitmap_iterator bj;
+  same_succ_t same = *same_p;
+  same_succ_t *slot;
+  edge_iterator ei;
+  edge e;
+
+  if (bb == NULL)
+    return;
+  bitmap_set_bit (same->bbs, bb->index);
+  FOR_EACH_EDGE (e, ei, bb->succs)
+    {
+      int index = e->dest->index;
+      bitmap_set_bit (same->succs, index);
+      same_succ_edge_flags[index] = e->flags;
+    }
+  EXECUTE_IF_SET_IN_BITMAP (same->succs, 0, j, bj)
+    VEC_safe_push (int, heap, same->succ_flags, same_succ_edge_flags[j]);
+
+  same->hashval = same_succ_hash (same);
+
+  slot = (same_succ_t *) htab_find_slot_with_hash (same_succ_htab, same,
+						   same->hashval, INSERT);
+  if (*slot == NULL)
+    {
+      *slot = same;
+      BB_SAME_SUCC (bb) = same;
+      add_to_worklist (same);
+      *same_p = NULL;
+    }
+  else
+    {
+      bitmap_set_bit ((*slot)->bbs, bb->index);
+      BB_SAME_SUCC (bb) = *slot;
+      add_to_worklist (*slot);
+      if (inverse_flags (same, *slot))
+	bitmap_set_bit ((*slot)->inverse, bb->index);
+      same_succ_reset (same);
+    }
+}
+
+/* Find bbs with same successors.  */
+
+static void
+find_same_succ (void)
+{
+  same_succ_t same = same_succ_alloc ();
+  basic_block bb;
+
+  FOR_EACH_BB (bb)
+    {
+      find_same_succ_bb (bb, &same);
+      if (same == NULL)
+	same = same_succ_alloc ();
+    }
+
+  same_succ_delete (same);
+}
+
+/* Initializes worklist administration.  */
+
+static void
+init_worklist (void)
+{
+  alloc_aux_for_blocks (sizeof (struct aux_bb_info));
+  same_succ_htab
+    = htab_create (n_basic_blocks, same_succ_hash, same_succ_equal,
+		   same_succ_delete);
+  same_succ_edge_flags = XCNEWVEC (int, last_basic_block);
+  deleted_bbs = BITMAP_ALLOC (NULL);
+  deleted_bb_preds = BITMAP_ALLOC (NULL);
+  worklist = VEC_alloc (same_succ_t, heap, n_basic_blocks);
+  find_same_succ ();
+
+  if (dump_file && (dump_flags & TDF_DETAILS))
+    {
+      fprintf (dump_file, "initial worklist:\n");
+      print_worklist (dump_file);
+    }
+}
+
+/* Deletes worklist administration.  */
+
+static void
+delete_worklist (void)
+{
+  free_aux_for_blocks ();
+  htab_delete (same_succ_htab);
+  same_succ_htab = NULL;
+  XDELETEVEC (same_succ_edge_flags);
+  same_succ_edge_flags = NULL;
+  BITMAP_FREE (deleted_bbs);
+  BITMAP_FREE (deleted_bb_preds);
+  VEC_free (same_succ_t, heap, worklist);
+}
+
+/* Mark BB as deleted, and mark its predecessors.  */
+
+static void
+delete_basic_block_same_succ (basic_block bb)
+{
+  edge e;
+  edge_iterator ei;
+
+  bitmap_set_bit (deleted_bbs, bb->index);
+
+  FOR_EACH_EDGE (e, ei, bb->preds)
+    bitmap_set_bit (deleted_bb_preds, e->src->index);
+}
+
+/* Removes all bbs in BBS from their corresponding same_succ.  */
+
+static void
+same_succ_flush_bbs (bitmap bbs)
+{
+  unsigned int i;
+  bitmap_iterator bi;
+
+  EXECUTE_IF_SET_IN_BITMAP (bbs, 0, i, bi)
+    {
+      basic_block bb = BASIC_BLOCK (i);
+      same_succ_t same = BB_SAME_SUCC (bb);
+      BB_SAME_SUCC (bb) = NULL;
+      if (bitmap_single_bit_set_p (same->bbs))
+	htab_remove_elt_with_hash (same_succ_htab, same, same->hashval);
+      else
+	bitmap_clear_bit (same->bbs, i);
+    }
+}
+
+/* Delete all deleted_bbs.  */
+
+static void
+purge_bbs (void)
+{
+  unsigned int i;
+  bitmap_iterator bi;
+
+  same_succ_flush_bbs (deleted_bbs);
+
+  EXECUTE_IF_SET_IN_BITMAP (deleted_bbs, 0, i, bi)
+    delete_basic_block (BASIC_BLOCK (i));
+
+  bitmap_and_compl_into (deleted_bb_preds, deleted_bbs);
+  bitmap_clear (deleted_bbs);
+}
+
+/* For deleted_bb_preds, find bbs with same successors.  */
+
+static void
+update_worklist (void)
+{
+  unsigned int i;
+  bitmap_iterator bi;
+  basic_block bb;
+  same_succ_t same;
+
+  bitmap_clear_bit (deleted_bb_preds, ENTRY_BLOCK);
+  same_succ_flush_bbs (deleted_bb_preds);
+
+  same = same_succ_alloc ();
+  EXECUTE_IF_SET_IN_BITMAP (deleted_bb_preds, 0, i, bi)
+    {
+      bb = BASIC_BLOCK (i);
+      gcc_assert (bb != NULL);
+      find_same_succ_bb (bb, &same);
+      if (same == NULL)
+	same = same_succ_alloc ();
+    }
+  same_succ_delete (same);
+  bitmap_clear (deleted_bb_preds);
+}
+
+/* Prints cluster C to FILE.  */
+
+static void
+print_cluster (FILE *file, bb_cluster_t c)
+{
+  if (c == NULL)
+    return;
+  bitmap_print (file, c->bbs, "bbs:", "\n");
+  bitmap_print (file, c->preds, "preds:", "\n");
+}
+
+/* Prints cluster C to stderr.  */
+
+extern void debug_cluster (bb_cluster_t);
+DEBUG_FUNCTION void
+debug_cluster (bb_cluster_t c)
+{
+  print_cluster (stderr, c);
+}
+
+/* Update C->rep_bb, given that BB is added to the cluster.  */
+
+static void
+update_rep_bb (bb_cluster_t c, basic_block bb)
+{
+  /* Initial.  */
+  if (c->rep_bb == NULL)
+    {
+      c->rep_bb = bb;
+      return;
+    }
+
+  /* Current needs no deps, keep it.  */
+  if (BB_DEP_BB (c->rep_bb) == NULL)
+    return;
+
+  /* Bb needs no deps, change rep_bb.  */
+  if (BB_DEP_BB (bb) == NULL)
+    {
+      c->rep_bb = bb;
+      return;
+    }
+
+  /* Bb needs last deps earlier than current, change rep_bb.  A potential
+     problem with this, is that the first deps might also be earlier, which
+     would mean we prefer longer lifetimes for the deps.  To be able to check
+     for this, we would have to trace BB_FIRST_DEP_BB as well, besides
+     BB_DEP_BB, which is really BB_LAST_DEP_BB.
+     The benefit of choosing the bb with last deps earlier, is that it can
+     potentially be used as replacement for more bbs.  */
+  if (dominated_by_p (CDI_DOMINATORS, BB_DEP_BB (c->rep_bb), BB_DEP_BB (bb)))
+    c->rep_bb = bb;
+}
+
+/* Add BB to cluster C.  Sets BB in C->bbs, and preds of BB in C->preds.  */
+
+static void
+add_bb_to_cluster (bb_cluster_t c, basic_block bb)
+{
+  edge e;
+  edge_iterator ei;
+
+  bitmap_set_bit (c->bbs, bb->index);
+
+  FOR_EACH_EDGE (e, ei, bb->preds)
+    bitmap_set_bit (c->preds, e->src->index);
+
+  update_rep_bb (c, bb);
+}
+
+/* Allocate and init new cluster.  */
+
+static bb_cluster_t
+new_cluster (void)
+{
+  bb_cluster_t c;
+  c = XCNEW (struct bb_cluster);
+  c->bbs = BITMAP_ALLOC (NULL);
+  c->preds = BITMAP_ALLOC (NULL);
+  c->rep_bb = NULL;
+  return c;
+}
+
+/* Delete clusters.  */
+
+static void
+delete_cluster (bb_cluster_t c)
+{
+  if (c == NULL)
+    return;
+  BITMAP_FREE (c->bbs);
+  BITMAP_FREE (c->preds);
+  XDELETE (c);
+}
+
+DEF_VEC_P (bb_cluster_t);
+DEF_VEC_ALLOC_P (bb_cluster_t, heap);
+
+/* Array that contains all clusters.  */
+
+static VEC (bb_cluster_t, heap) *all_clusters;
+
+/* Allocate all cluster vectors.  */
+
+static void
+alloc_cluster_vectors (void)
+{
+  all_clusters = VEC_alloc (bb_cluster_t, heap, n_basic_blocks);
+}
+
+/* Reset all cluster vectors.  */
+
+static void
+reset_cluster_vectors (void)
+{
+  unsigned int i;
+  basic_block bb;
+  for (i = 0; i < VEC_length (bb_cluster_t, all_clusters); ++i)
+    delete_cluster (VEC_index (bb_cluster_t, all_clusters, i));
+  VEC_truncate (bb_cluster_t, all_clusters, 0);
+  FOR_EACH_BB (bb)
+    BB_CLUSTER (bb) = NULL;
+}
+
+/* Delete all cluster vectors.  */
+
+static void
+delete_cluster_vectors (void)
+{
+  unsigned int i;
+  for (i = 0; i < VEC_length (bb_cluster_t, all_clusters); ++i)
+    delete_cluster (VEC_index (bb_cluster_t, all_clusters, i));
+  VEC_free (bb_cluster_t, heap, all_clusters);
+}
+
+/* Merge cluster C2 into C1.  */
+
+static void
+merge_clusters (bb_cluster_t c1, bb_cluster_t c2)
+{
+  bitmap_ior_into (c1->bbs, c2->bbs);
+  bitmap_ior_into (c1->preds, c2->preds);
+}
+
+/* Register equivalence of BB1 and BB2 (members of cluster C).  Store c in
+   all_clusters, or merge c with existing cluster.  */
+
+static void
+set_cluster (basic_block bb1, basic_block bb2)
+{
+  basic_block merge_bb, other_bb;
+  bb_cluster_t merge, old, c;
+
+  if (BB_CLUSTER (bb1) == NULL && BB_CLUSTER (bb2) == NULL)
+    {
+      c = new_cluster ();
+      add_bb_to_cluster (c, bb1);
+      add_bb_to_cluster (c, bb2);
+      BB_CLUSTER (bb1) = c;
+      BB_CLUSTER (bb2) = c;
+      c->index = VEC_length (bb_cluster_t, all_clusters);
+      VEC_safe_push (bb_cluster_t, heap, all_clusters, c);
+    }
+  else if (BB_CLUSTER (bb1) == NULL || BB_CLUSTER (bb2) == NULL)
+    {
+      merge_bb = BB_CLUSTER (bb1) == NULL ? bb2 : bb1;
+      other_bb = BB_CLUSTER (bb1) == NULL ? bb1 : bb2;
+      merge = BB_CLUSTER (merge_bb);
+      add_bb_to_cluster (merge, other_bb);
+      BB_CLUSTER (other_bb) = merge;
+    }
+  else if (BB_CLUSTER (bb1) != BB_CLUSTER (bb2))
+    {
+      unsigned int i;
+      bitmap_iterator bi;
+
+      old = BB_CLUSTER (bb2);
+      merge = BB_CLUSTER (bb1);
+      merge_clusters (merge, old);
+      EXECUTE_IF_SET_IN_BITMAP (old->bbs, 0, i, bi)
+	BB_CLUSTER (BASIC_BLOCK (i)) = merge;
+      VEC_replace (bb_cluster_t, all_clusters, old->index, NULL);
+      update_rep_bb (merge, old->rep_bb);
+      delete_cluster (old);
+    }
+  else
+    gcc_unreachable ();
+}
+
+/* Return true if gimple statements S1 and S2 are equal.  Gimple_bb (s1) and
+   gimple_bb (s2) are members of SAME_SUCC.  */
+
+static bool
+gimple_equal_p (same_succ_t same_succ, gimple s1, gimple s2)
+{
+  unsigned int i;
+  tree lhs1, lhs2;
+  basic_block bb1 = gimple_bb (s1), bb2 = gimple_bb (s2);
+  tree t1, t2;
+  bool equal, inv_cond;
+  enum tree_code code1, code2;
+
+  if (gimple_code (s1) != gimple_code (s2))
+    return false;
+
+  switch (gimple_code (s1))
+    {
+    case GIMPLE_CALL:
+      if (gimple_call_num_args (s1) != gimple_call_num_args (s2))
+	return false;
+      if (!gimple_call_same_target_p (s1, s2))
+        return false;
+
+      equal = true;
+      for (i = 0; i < gimple_call_num_args (s1); ++i)
+	{
+	  t1 = gimple_call_arg (s1, i);
+	  t2 = gimple_call_arg (s2, i);
+	  if (operand_equal_p (t1, t2, 0))
+	    continue;
+	  if (gvn_uses_equal (t1, t2))
+	    continue;
+	  equal = false;
+	  break;
+	}
+      if (equal)
+	return true;
+
+      lhs1 = gimple_get_lhs (s1);
+      lhs2 = gimple_get_lhs (s2);
+      return (lhs1 != NULL_TREE && lhs2 != NULL_TREE
+	      && TREE_CODE (lhs1) == SSA_NAME && TREE_CODE (lhs2) == SSA_NAME
+	      && vn_valueize (lhs1) == vn_valueize (lhs2));
+
+    case GIMPLE_ASSIGN:
+      lhs1 = gimple_get_lhs (s1);
+      lhs2 = gimple_get_lhs (s2);
+      return (TREE_CODE (lhs1) == SSA_NAME
+	      && TREE_CODE (lhs2) == SSA_NAME
+	      && vn_valueize (lhs1) == vn_valueize (lhs2));
+
+    case GIMPLE_COND:
+      t1 = gimple_cond_lhs (s1);
+      t2 = gimple_cond_lhs (s2);
+      if (!operand_equal_p (t1, t2, 0)
+	  && !gvn_uses_equal (t1, t2))
+	return false;
+
+      t1 = gimple_cond_rhs (s1);
+      t2 = gimple_cond_rhs (s2);
+      if (!operand_equal_p (t1, t2, 0)
+	  && !gvn_uses_equal (t1, t2))
+	return false;
+
+      code1 = gimple_expr_code (s1);
+      code2 = gimple_expr_code (s2);
+      inv_cond = (bitmap_bit_p (same_succ->inverse, bb1->index)
+		  != bitmap_bit_p (same_succ->inverse, bb2->index));
+      if (inv_cond)
+	{
+	  bool honor_nans
+	    = HONOR_NANS (TYPE_MODE (TREE_TYPE (gimple_cond_lhs (s1))));
+	  code2 = invert_tree_comparison (code2, honor_nans);
+	}
+      return code1 == code2;
+
+    default:
+      return false;
+    }
+}
+
+/* Let GSI skip backwards over local defs.  */
+
+static void
+gsi_advance_bw_nondebug_nonlocal (gimple_stmt_iterator *gsi)
+{
+  gimple stmt;
+
+  while (true)
+    {
+      if (gsi_end_p (*gsi))
+	return;
+      stmt = gsi_stmt (*gsi);
+      if (!(is_gimple_assign (stmt) && local_def (gimple_get_lhs (stmt))
+	    && !gimple_has_side_effects (stmt)))
+	return;
+      gsi_prev_nondebug (gsi);
+    }
+}
+
+/* Determines whether BB1 and BB2 (members of same_succ) are duplicates.  If so,
+   clusters them.  */
+
+static void
+find_duplicate (same_succ_t same_succ, basic_block bb1, basic_block bb2)
+{
+  gimple_stmt_iterator gsi1 = gsi_last_nondebug_bb (bb1);
+  gimple_stmt_iterator gsi2 = gsi_last_nondebug_bb (bb2);
+
+  gsi_advance_bw_nondebug_nonlocal (&gsi1);
+  gsi_advance_bw_nondebug_nonlocal (&gsi2);
+
+  while (!gsi_end_p (gsi1) && !gsi_end_p (gsi2))
+    {
+      if (!gimple_equal_p (same_succ, gsi_stmt (gsi1), gsi_stmt (gsi2)))
+	return;
+
+      gsi_prev_nondebug (&gsi1);
+      gsi_prev_nondebug (&gsi2);
+      gsi_advance_bw_nondebug_nonlocal (&gsi1);
+      gsi_advance_bw_nondebug_nonlocal (&gsi2);
+    }
+
+  if (!(gsi_end_p (gsi1) && gsi_end_p (gsi2)))
+    return;
+
+  if (dump_file)
+    fprintf (dump_file, "find_duplicates: <bb %d> duplicate of <bb %d>\n",
+	     bb1->index, bb2->index);
+
+  set_cluster (bb1, bb2);
+}
+
+/* Returns whether for all phis in DEST the phi alternatives for E1 and
+   E2 are equal.  */
+
+static bool
+same_phi_alternatives_1 (basic_block dest, edge e1, edge e2)
+{
+  int n1 = e1->dest_idx, n2 = e2->dest_idx;
+  gimple_stmt_iterator gsi;
+
+  for (gsi = gsi_start_phis (dest); !gsi_end_p (gsi); gsi_next (&gsi))
+    {
+      gimple phi = gsi_stmt (gsi);
+      tree lhs = gimple_phi_result (phi);
+      tree val1 = gimple_phi_arg_def (phi, n1);
+      tree val2 = gimple_phi_arg_def (phi, n2);
+
+      if (!is_gimple_reg (lhs))
+	continue;
+
+      if (operand_equal_for_phi_arg_p (val1, val2))
+        continue;
+      if (gvn_uses_equal (val1, val2))
+	continue;
+
+      return false;
+    }
+
+  return true;
+}
+
+/* Returns whether for all successors of BB1 and BB2 (members of SAME_SUCC), the
+   phi alternatives for BB1 and BB2 are equal.  */
+
+static bool
+same_phi_alternatives (same_succ_t same_succ, basic_block bb1, basic_block bb2)
+{
+  unsigned int s;
+  bitmap_iterator bs;
+  edge e1, e2;
+  basic_block succ;
+
+  EXECUTE_IF_SET_IN_BITMAP (same_succ->succs, 0, s, bs)
+    {
+      succ = BASIC_BLOCK (s);
+      e1 = find_edge (bb1, succ);
+      e2 = find_edge (bb2, succ);
+      if (e1->flags & EDGE_COMPLEX
+	  || e2->flags & EDGE_COMPLEX)
+	return false;
+
+      /* For all phis in bb, the phi alternatives for e1 and e2 need to have
+	 the same value.  */
+      if (!same_phi_alternatives_1 (succ, e1, e2))
+	return false;
+    }
+
+  return true;
+}
+
+/* Return true if BB has non-vop phis.  */
+
+static bool
+bb_has_non_vop_phi (basic_block bb)
+{
+  gimple_seq phis = phi_nodes (bb);
+  gimple phi;
+
+  if (phis == NULL)
+    return false;
+
+  if (!gimple_seq_singleton_p (phis))
+    return true;
+
+  phi = gimple_seq_first_stmt (phis);
+  return is_gimple_reg (gimple_phi_result (phi));
+}
+
+/* Returns true if redirecting the incoming edges of FROM to TO maintains the
+   invariant that uses in FROM are dominates by their defs.  */
+
+static bool
+deps_ok_for_redirect_from_bb_to_bb (basic_block from, basic_block to)
+{
+  basic_block cd, dep_bb = BB_DEP_BB (to);
+  edge_iterator ei;
+  edge e;
+  bitmap from_preds = BITMAP_ALLOC (NULL);
+
+  if (dep_bb == NULL)
+    return true;
+
+  FOR_EACH_EDGE (e, ei, from->preds)
+    bitmap_set_bit (from_preds, e->src->index);
+  cd = nearest_common_dominator_for_set (CDI_DOMINATORS, from_preds);
+  BITMAP_FREE (from_preds);
+
+  return dominated_by_p (CDI_DOMINATORS, dep_bb, cd);
+}
+
+/* Returns true if replacing BB1 (or its replacement bb) by BB2 (or its
+   replacement bb) and vice versa maintains the invariant that uses in the
+   replacement are dominates by their defs.  */
+
+static bool
+deps_ok_for_redirect (basic_block bb1, basic_block bb2)
+{
+  if (BB_CLUSTER (bb1) != NULL)
+    bb1 = BB_CLUSTER (bb1)->rep_bb;
+
+  if (BB_CLUSTER (bb2) != NULL)
+    bb2 = BB_CLUSTER (bb2)->rep_bb;
+
+  return (deps_ok_for_redirect_from_bb_to_bb (bb1, bb2)
+	  && deps_ok_for_redirect_from_bb_to_bb (bb2, bb1));
+}
+
+/* Within SAME_SUCC->bbs, find clusters of bbs which can be merged.  */
+
+static void
+find_clusters_1 (same_succ_t same_succ)
+{
+  basic_block bb1, bb2;
+  unsigned int i, j;
+  bitmap_iterator bi, bj;
+  int nr_comparisons;
+  int max_comparisons = PARAM_VALUE (PARAM_MAX_TAIL_MERGE_COMPARISONS);
+
+  EXECUTE_IF_SET_IN_BITMAP (same_succ->bbs, 0, i, bi)
+    {
+      bb1 = BASIC_BLOCK (i);
+
+      /* TODO: handle blocks with phi-nodes.  We'll have to find corresponding
+	 phi-nodes in bb1 and bb2, with the same alternatives for the same
+	 preds.  */
+      if (bb_has_non_vop_phi (bb1))
+	continue;
+
+      nr_comparisons = 0;
+      EXECUTE_IF_SET_IN_BITMAP (same_succ->bbs, i + 1, j, bj)
+	{
+	  bb2 = BASIC_BLOCK (j);
+
+	  if (bb_has_non_vop_phi (bb2))
+	    continue;
+
+	  if (BB_CLUSTER (bb1) != NULL && BB_CLUSTER (bb1) == BB_CLUSTER (bb2))
+	    continue;
+
+	  /* Limit quadratic behaviour.  */
+	  nr_comparisons++;
+	  if (nr_comparisons > max_comparisons)
+	    break;
+
+	  /* This is a conservative dependency check.  We could test more
+	     precise for allowed replacement direction.  */
+	  if (!deps_ok_for_redirect (bb1, bb2))
+	    continue;
+
+	  if (!(same_phi_alternatives (same_succ, bb1, bb2)))
+	    continue;
+
+	  find_duplicate (same_succ, bb1, bb2);
+        }
+    }
+}
+
+/* Find clusters of bbs which can be merged.  */
+
+static void
+find_clusters (void)
+{
+  same_succ_t same;
+
+  while (!VEC_empty (same_succ_t, worklist))
+    {
+      same = VEC_pop (same_succ_t, worklist);
+      same->in_worklist = false;
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	{
+	  fprintf (dump_file, "processing worklist entry\n");
+	  same_succ_print (dump_file, same);
+	}
+      find_clusters_1 (same);
+    }
+}
+
+/* Create or update a vop phi in BB2.  Use VUSE1 arguments for all the
+   REDIRECTED_EDGES, or if VUSE1 is NULL_TREE, use BB_VOP_AT_EXIT.  If a new
+   phis is created, use the phi instead of VUSE2 in BB2.  */
+
+static void
+update_vuses (tree vuse1, tree vuse2, basic_block bb2,
+              VEC (edge,heap) *redirected_edges)
+{
+  gimple stmt, phi = NULL;
+  tree lhs = NULL_TREE, arg;
+  unsigned int i;
+  gimple def_stmt2;
+  imm_use_iterator iter;
+  use_operand_p use_p;
+  edge_iterator ei;
+  edge e;
+
+  def_stmt2 = SSA_NAME_DEF_STMT (vuse2);
+
+  if (gimple_bb (def_stmt2) == bb2)
+    /* Update existing phi.  */
+    phi = def_stmt2;
+  else
+    {
+      /* No need to create a phi with 2 equal arguments.  */
+      if (vuse1 == vuse2)
+	return;
+
+      /* Create a phi.  */
+      lhs = make_ssa_name (SSA_NAME_VAR (vuse2), NULL);
+      VN_INFO_GET (lhs);
+      phi = create_phi_node (lhs, bb2);
+      SSA_NAME_DEF_STMT (lhs) = phi;
+
+      /* Set default argument vuse2 for all preds.  */
+      FOR_EACH_EDGE (e, ei, bb2->preds)
+	add_phi_arg (phi, vuse2, e, UNKNOWN_LOCATION);
+    }
+
+  /* Update phi.  */
+  for (i = 0; i < EDGE_COUNT (redirected_edges); ++i)
+    {
+      e = VEC_index (edge, redirected_edges, i);
+      if (vuse1 != NULL_TREE)
+	arg = vuse1;
+      else
+	arg = BB_VOP_AT_EXIT (e->src);
+      add_phi_arg (phi, arg, e, UNKNOWN_LOCATION);
+    }
+
+  /* Return if we updated an existing phi.  */
+  if (gimple_bb (def_stmt2) == bb2)
+    return;
+
+  /* Replace relevant uses of vuse2 with the newly created phi.  */
+  FOR_EACH_IMM_USE_STMT (stmt, iter, vuse2)
+    {
+      if (stmt == phi)
+	continue;
+      if (gimple_code (stmt) != GIMPLE_PHI)
+	if (gimple_bb (stmt) != bb2)
+	  continue;
+
+      FOR_EACH_IMM_USE_ON_STMT (use_p, iter)
+	{
+	  if (gimple_code (stmt) == GIMPLE_PHI)
+	    {
+	      unsigned int pred_index = PHI_ARG_INDEX_FROM_USE (use_p);
+	      basic_block pred = EDGE_PRED (gimple_bb (stmt), pred_index)->src;
+	      if (pred !=  bb2)
+		continue;
+	    }
+	  SET_USE (use_p, lhs);
+	  update_stmt (stmt);
+	}
+    }
+}
+
+/* Returns the vop phi of BB, if any.  */
+
+static gimple
+vop_phi (basic_block bb)
+{
+  gimple stmt;
+  gimple_stmt_iterator gsi;
+  for (gsi = gsi_start_phis (bb); !gsi_end_p (gsi); gsi_next (&gsi))
+    {
+      stmt = gsi_stmt (gsi);
+      if (is_gimple_reg (gimple_phi_result (stmt)))
+	continue;
+      return stmt;
+    }
+  return NULL;
+}
+
+/* Returns the vop state at the entry of BB, if found in BB or a successor
+   bb.  */
+
+static tree
+vop_at_entry (basic_block bb)
+{
+  gimple bb_phi, succ_phi;
+  gimple_stmt_iterator gsi;
+  gimple stmt;
+  tree vuse, vdef;
+  basic_block succ;
+
+  bb_phi = vop_phi (bb);
+  if (bb_phi != NULL)
+    return gimple_phi_result (bb_phi);
+
+  for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi))
+    {
+      stmt = gsi_stmt (gsi);
+      vuse = gimple_vuse (stmt);
+      vdef = gimple_vdef (stmt);
+      if (vuse != NULL_TREE)
+	return vuse;
+      if (vdef != NULL_TREE)
+	return NULL_TREE;
+    }
+
+  if (EDGE_COUNT (bb->succs) == 0)
+    return NULL_TREE;
+
+  succ = EDGE_SUCC (bb, 0)->dest;
+  succ_phi = vop_phi (succ);
+  return (succ_phi != NULL
+	  ? PHI_ARG_DEF_FROM_EDGE (succ_phi, find_edge (bb, succ))
+	  : NULL_TREE);
+}
+
+/* Redirect all edges from BB1 to BB2, marks BB1 for removal, and if
+   UPDATE_VOPS, inserts vop phis.  */
+
+static void
+replace_block_by (basic_block bb1, basic_block bb2, bool update_vops)
+{
+  edge pred_edge;
+  unsigned int i;
+  tree phi_vuse1 = NULL_TREE, phi_vuse2 = NULL_TREE, arg;
+  VEC (edge,heap) *redirected_edges = NULL;
+  edge e;
+  edge_iterator ei;
+
+  if (update_vops)
+    {
+      /* Find the vops at entry of bb1 and bb2.  */
+      phi_vuse1 = vop_at_entry (bb1);
+      phi_vuse2 = vop_at_entry (bb2);
+
+      /* If one of the 2 not found, it means there's no need to update.  */
+      update_vops = phi_vuse1 != NULL_TREE && phi_vuse2 != NULL_TREE;
+    }
+
+  if (update_vops && gimple_bb (SSA_NAME_DEF_STMT (phi_vuse1)) == bb1)
+    {
+      /* If the vop at entry of bb1 is a phi, save the phi alternatives in
+	 BB_VOP_AT_EXIT, before we lose that information by redirecting the
+	 edges.  */
+      FOR_EACH_EDGE (e, ei, bb1->preds)
+	{
+	  arg = PHI_ARG_DEF_FROM_EDGE (SSA_NAME_DEF_STMT (phi_vuse1), e);
+	  BB_VOP_AT_EXIT (e->src) = arg;
+	}
+      phi_vuse1 = NULL;
+    }
+
+  /* Mark the basic block for later deletion.  */
+  delete_basic_block_same_succ (bb1);
+
+  if (update_vops)
+    redirected_edges = VEC_alloc (edge, heap, 10);
+
+  /* Redirect the incoming edges of bb1 to bb2.  */
+  for (i = EDGE_COUNT (bb1->preds); i > 0 ; --i)
+    {
+      pred_edge = EDGE_PRED (bb1, i - 1);
+      pred_edge = redirect_edge_and_branch (pred_edge, bb2);
+      gcc_assert (pred_edge != NULL);
+      if (update_vops)
+	VEC_safe_push (edge, heap, redirected_edges, pred_edge);
+    }
+
+  /* Update the vops.  */
+  if (update_vops)
+    {
+      update_vuses (phi_vuse1, phi_vuse2, bb2, redirected_edges);
+      VEC_free (edge, heap, redirected_edges);
+    }
+}
+
+/* Bbs for which update_debug_stmt need to be called.  */
+
+static bitmap update_bbs;
+
+/* For each cluster in all_clusters, merge all cluster->bbs.  Returns
+   number of bbs removed.  Insert vop phis if UPDATE_VOPS.  */
+
+static int
+apply_clusters (bool update_vops)
+{
+  basic_block bb1, bb2;
+  bb_cluster_t c;
+  unsigned int i, j;
+  bitmap_iterator bj;
+  int nr_bbs_removed = 0;
+
+  for (i = 0; i < VEC_length (bb_cluster_t, all_clusters); ++i)
+    {
+      c = VEC_index (bb_cluster_t, all_clusters, i);
+      if (c == NULL)
+	continue;
+
+      bb2 = c->rep_bb;
+      bitmap_set_bit (update_bbs, bb2->index);
+
+      bitmap_clear_bit (c->bbs, bb2->index);
+      EXECUTE_IF_SET_IN_BITMAP (c->bbs, 0, j, bj)
+	{
+	  bb1 = BASIC_BLOCK (j);
+	  bitmap_clear_bit (update_bbs, bb1->index);
+
+	  replace_block_by (bb1, bb2, update_vops);
+	  nr_bbs_removed++;
+	}
+    }
+
+  return nr_bbs_removed;
+}
+
+/* Resets debug statement STMT if it has uses that are not dominated by their
+   defs.  */
+
+static void
+update_debug_stmt (gimple stmt)
+{
+  use_operand_p use_p;
+  ssa_op_iter oi;
+  basic_block bbdef, bbuse;
+  gimple def_stmt;
+  tree name;
+
+  if (!gimple_debug_bind_p (stmt))
+    return;
+
+  bbuse = gimple_bb (stmt);
+  FOR_EACH_PHI_OR_STMT_USE (use_p, stmt, oi, SSA_OP_USE)
+    {
+      name = USE_FROM_PTR (use_p);
+      gcc_assert (TREE_CODE (name) == SSA_NAME);
+
+      def_stmt = SSA_NAME_DEF_STMT (name);
+      gcc_assert (def_stmt != NULL);
+
+      bbdef = gimple_bb (def_stmt);
+      if (bbdef == NULL || bbuse == bbdef
+	  || dominated_by_p (CDI_DOMINATORS, bbuse, bbdef))
+	continue;
+
+      gimple_debug_bind_reset_value (stmt);
+      update_stmt (stmt);
+    }
+}
+
+/* Resets all debug statements that have uses that are not
+   dominated by their defs.  */
+
+static void
+update_debug_stmts (void)
+{
+  basic_block bb;
+  bitmap_iterator bi;
+  unsigned int i;
+
+  if (!MAY_HAVE_DEBUG_STMTS)
+    return;
+
+  EXECUTE_IF_SET_IN_BITMAP (update_bbs, 0, i, bi)
+    {
+      gimple stmt;
+      gimple_stmt_iterator gsi;
+
+      bb = BASIC_BLOCK (i);
+      for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi))
+	{
+	  stmt = gsi_stmt (gsi);
+	  if (!is_gimple_debug (stmt))
+	    continue;
+	  update_debug_stmt (stmt);
+	}
+    }
+}
+
+/* Runs tail merge optimization.  */
+
+unsigned int
+tail_merge_optimize (unsigned int todo)
+{
+  int nr_bbs_removed_total = 0;
+  int nr_bbs_removed;
+  bool loop_entered = false;
+  int iteration_nr = 0;
+  bool update_vops = !symbol_marked_for_renaming (gimple_vop (cfun));
+
+  if (!flag_tree_tail_merge)
+    return 0;
+
+  timevar_push (TV_TREE_TAIL_MERGE);
+
+  calculate_dominance_info (CDI_DOMINATORS);
+  init_worklist ();
+
+  while (!VEC_empty (same_succ_t, worklist))
+    {
+      if (!loop_entered)
+	{
+	  loop_entered = true;
+	  alloc_cluster_vectors ();
+	  update_bbs = BITMAP_ALLOC (NULL);
+	}
+      else
+	reset_cluster_vectors ();
+
+      iteration_nr++;
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	fprintf (dump_file, "worklist iteration #%d\n", iteration_nr);
+
+      find_clusters ();
+      gcc_assert (VEC_empty (same_succ_t, worklist));
+      if (VEC_empty (bb_cluster_t, all_clusters))
+	break;
+
+      nr_bbs_removed = apply_clusters (update_vops);
+      nr_bbs_removed_total += nr_bbs_removed;
+      if (nr_bbs_removed == 0)
+	break;
+
+      free_dominance_info (CDI_DOMINATORS);
+      purge_bbs ();
+
+      calculate_dominance_info (CDI_DOMINATORS);
+      update_worklist ();
+    }
+
+  if (dump_file && (dump_flags & TDF_DETAILS))
+    fprintf (dump_file, "htab collision / search: %f\n",
+	     htab_collisions (same_succ_htab));
+
+  if (nr_bbs_removed_total > 0)
+    {
+      update_debug_stmts ();
+
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	{
+	  fprintf (dump_file, "Before TODOs.\n");
+	  dump_function_to_file (current_function_decl, dump_file, dump_flags);
+	}
+
+      todo |= (TODO_verify_ssa | TODO_verify_stmts | TODO_verify_flow
+	       | TODO_dump_func);
+    }
+
+  delete_worklist ();
+  if (loop_entered)
+    {
+      delete_cluster_vectors ();
+      BITMAP_FREE (update_bbs);
+    }
+
+  timevar_pop (TV_TREE_TAIL_MERGE);
+
+  return todo;
+}
Index: gcc/tree-pass.h
===================================================================
--- gcc/tree-pass.h	(revision 176554)
+++ gcc/tree-pass.h	(working copy)
@@ -401,6 +401,7 @@ extern struct gimple_opt_pass pass_call_
 extern struct gimple_opt_pass pass_merge_phi;
 extern struct gimple_opt_pass pass_split_crit_edges;
 extern struct gimple_opt_pass pass_pre;
+extern unsigned int tail_merge_optimize (unsigned int);
 extern struct gimple_opt_pass pass_profile;
 extern struct gimple_opt_pass pass_strip_predict_hints;
 extern struct gimple_opt_pass pass_lower_complex_O0;
Index: gcc/tree-ssa-sccvn.c
===================================================================
--- gcc/tree-ssa-sccvn.c	(revision 176554)
+++ gcc/tree-ssa-sccvn.c	(working copy)
@@ -2896,19 +2896,6 @@ simplify_unary_expression (gimple stmt)
   return NULL_TREE;
 }
 
-/* Valueize NAME if it is an SSA name, otherwise just return it.  */
-
-static inline tree
-vn_valueize (tree name)
-{
-  if (TREE_CODE (name) == SSA_NAME)
-    {
-      tree tem = SSA_VAL (name);
-      return tem == VN_TOP ? name : tem;
-    }
-  return name;
-}
-
 /* Try to simplify RHS using equivalences and constant folding.  */
 
 static tree
Index: gcc/tree-ssa-sccvn.h
===================================================================
--- gcc/tree-ssa-sccvn.h	(revision 176554)
+++ gcc/tree-ssa-sccvn.h	(working copy)
@@ -209,4 +209,18 @@ unsigned int get_constant_value_id (tree
 unsigned int get_or_alloc_constant_value_id (tree);
 bool value_id_constant_p (unsigned int);
 tree fully_constant_vn_reference_p (vn_reference_t);
+
+/* Valueize NAME if it is an SSA name, otherwise just return it.  */
+
+static inline tree
+vn_valueize (tree name)
+{
+  if (TREE_CODE (name) == SSA_NAME)
+    {
+      tree tem = VN_INFO (name)->valnum;
+      return tem == VN_TOP ? name : tem;
+    }
+  return name;
+}
+
 #endif /* TREE_SSA_SCCVN_H  */
Index: gcc/opts.c
===================================================================
--- gcc/opts.c	(revision 176554)
+++ gcc/opts.c	(working copy)
@@ -484,6 +484,7 @@ static const struct default_options defa
     { OPT_LEVELS_2_PLUS, OPT_falign_jumps, NULL, 1 },
     { OPT_LEVELS_2_PLUS, OPT_falign_labels, NULL, 1 },
     { OPT_LEVELS_2_PLUS, OPT_falign_functions, NULL, 1 },
+    { OPT_LEVELS_2_PLUS, OPT_ftree_tail_merge, NULL, 1 },
 
     /* -O3 optimizations.  */
     { OPT_LEVELS_3_PLUS, OPT_ftree_loop_distribute_patterns, NULL, 1 },
Index: gcc/timevar.def
===================================================================
--- gcc/timevar.def	(revision 176554)
+++ gcc/timevar.def	(working copy)
@@ -127,6 +127,7 @@ DEFTIMEVAR (TV_TREE_GIMPLIFY	     , "tre
 DEFTIMEVAR (TV_TREE_EH		     , "tree eh")
 DEFTIMEVAR (TV_TREE_CFG		     , "tree CFG construction")
 DEFTIMEVAR (TV_TREE_CLEANUP_CFG	     , "tree CFG cleanup")
+DEFTIMEVAR (TV_TREE_TAIL_MERGE       , "tree tail merge")
 DEFTIMEVAR (TV_TREE_VRP              , "tree VRP")
 DEFTIMEVAR (TV_TREE_COPY_PROP        , "tree copy propagation")
 DEFTIMEVAR (TV_FIND_REFERENCED_VARS  , "tree find ref. vars")
Index: gcc/tree-ssa-pre.c
===================================================================
--- gcc/tree-ssa-pre.c	(revision 176554)
+++ gcc/tree-ssa-pre.c	(working copy)
@@ -4926,7 +4926,6 @@ execute_pre (bool do_fre)
   statistics_counter_event (cfun, "Constified", pre_stats.constified);
 
   clear_expression_ids ();
-  free_scc_vn ();
   if (!do_fre)
     {
       remove_dead_inserted_code ();
@@ -4936,6 +4935,17 @@ execute_pre (bool do_fre)
   scev_finalize ();
   fini_pre (do_fre);
 
+  if (!do_fre)
+    /* TODO: tail_merge_optimize may merge all predecessors of a block, in which
+       case we can merge the block with the remaining predecessor of the block.
+       It should either:
+       - call merge_blocks after each tail merge iteration
+       - call merge_blocks after all tail merge iterations
+       - mark TODO_cleanup_cfg when necessary
+       - share the cfg cleanup with fini_pre.  */
+    todo |= tail_merge_optimize (todo);
+  free_scc_vn ();
+
   return todo;
 }
 
Index: gcc/common.opt
===================================================================
--- gcc/common.opt	(revision 176554)
+++ gcc/common.opt	(working copy)
@@ -1937,6 +1937,10 @@ ftree-dominator-opts
 Common Report Var(flag_tree_dom) Optimization
 Enable dominator optimizations
 
+ftree-tail-merge
+Common Report Var(flag_tree_tail_merge) Optimization
+Enable tail merging on trees
+
 ftree-dse
 Common Report Var(flag_tree_dse) Optimization
 Enable dead store elimination
Index: gcc/Makefile.in
===================================================================
--- gcc/Makefile.in	(revision 176554)
+++ gcc/Makefile.in	(working copy)
@@ -1465,6 +1465,7 @@ OBJS = \
 	tree-ssa-sccvn.o \
 	tree-ssa-sink.o \
 	tree-ssa-structalias.o \
+	tree-ssa-tail-merge.o \
 	tree-ssa-ter.o \
 	tree-ssa-threadedge.o \
 	tree-ssa-threadupdate.o \
@@ -2373,6 +2374,13 @@ stor-layout.o : stor-layout.c $(CONFIG_H
    $(TREE_H) $(PARAMS_H) $(FLAGS_H) $(FUNCTION_H) $(EXPR_H) output.h $(RTL_H) \
    $(GGC_H) $(TM_P_H) $(TARGET_H) langhooks.h $(REGS_H) gt-stor-layout.h \
    $(DIAGNOSTIC_CORE_H) $(CGRAPH_H) $(TREE_INLINE_H) $(TREE_DUMP_H) $(GIMPLE_H)
+tree-ssa-tail-merge.o: tree-ssa-tail-merge.c \
+   $(SYSTEM_H) $(CONFIG_H) coretypes.h $(TM_H) $(BITMAP_H) \
+   $(FLAGS_H) $(TM_P_H) $(BASIC_BLOCK_H) output.h \
+   $(TREE_H) $(TREE_FLOW_H) $(TREE_INLINE_H) \
+   $(GIMPLE_H) $(FUNCTION_H) \
+   $(TIMEVAR_H) tree-ssa-sccvn.h \
+   $(CGRAPH_H) gimple-pretty-print.h tree-pretty-print.h $(PARAMS_H)
 tree-ssa-structalias.o: tree-ssa-structalias.c \
    $(SYSTEM_H) $(CONFIG_H) coretypes.h $(TM_H) $(GGC_H) $(OBSTACK_H) $(BITMAP_H) \
    $(FLAGS_H) $(TM_P_H) $(BASIC_BLOCK_H) output.h \
Index: gcc/params.def
===================================================================
--- gcc/params.def	(revision 176554)
+++ gcc/params.def	(working copy)
@@ -908,6 +908,11 @@ DEFPARAM (PARAM_CASE_VALUES_THRESHOLD,
 	  "if 0, use the default for the machine",
           0, 0, 0)
 
+DEFPARAM (PARAM_MAX_TAIL_MERGE_COMPARISONS,
+          "max-tail-merge-comparisons",
+          "Maximum amount of similar bbs to compare a bb with",
+          10, 0, 0)
+
 
 /*
 Local variables:
Index: gcc/doc/invoke.texi
===================================================================
--- gcc/doc/invoke.texi	(revision 176554)
+++ gcc/doc/invoke.texi	(working copy)
@@ -404,7 +404,7 @@ Objective-C and Objective-C++ Dialects}.
 -ftree-phiprop -ftree-loop-distribution -ftree-loop-distribute-patterns @gol
 -ftree-loop-ivcanon -ftree-loop-linear -ftree-loop-optimize @gol
 -ftree-parallelize-loops=@var{n} -ftree-pre -ftree-pta -ftree-reassoc @gol
--ftree-sink -ftree-sra -ftree-switch-conversion @gol
+-ftree-sink -ftree-sra -ftree-switch-conversion -ftree-tail-merge @gol
 -ftree-ter -ftree-vect-loop-version -ftree-vectorize -ftree-vrp @gol
 -funit-at-a-time -funroll-all-loops -funroll-loops @gol
 -funsafe-loop-optimizations -funsafe-math-optimizations -funswitch-loops @gol
@@ -6096,7 +6096,7 @@ also turns on the following optimization
 -fsched-interblock  -fsched-spec @gol
 -fschedule-insns  -fschedule-insns2 @gol
 -fstrict-aliasing -fstrict-overflow @gol
--ftree-switch-conversion @gol
+-ftree-switch-conversion -ftree-tail-merge @gol
 -ftree-pre @gol
 -ftree-vrp}
 
@@ -6979,6 +6979,11 @@ Perform conversion of simple initializat
 initializations from a scalar array.  This flag is enabled by default
 at @option{-O2} and higher.
 
+@item -ftree-tail-merge
+Merges identical blocks with same successors.  This flag is enabled by default
+at @option{-O2} and higher.  The run time of this pass can be limited using
+@option{max-tail-merge-comparisons} parameter.
+
 @item -ftree-dce
 @opindex ftree-dce
 Perform dead code elimination (DCE) on trees.  This flag is enabled by
@@ -8546,6 +8551,10 @@ This is used to avoid quadratic behavior
 The value of 0 will avoid limiting the search, but may slow down compilation
 of huge functions.  The default value is 30.
 
+@item max-tail-merge-comparisons
+The maximum amount of similar bbs to compare a bb with.  This is used to
+avoid quadratic behaviour in tree tail merging.  The default value is 10.
+
 @item max-unrolled-insns
 The maximum number of instructions that a loop should have if that loop
 is unrolled, and if the loop is unrolled, it determines how many times

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH, PR43864] Gimple level duplicate block cleanup - test cases.
  2011-07-18  2:54   ` Tom de Vries
@ 2011-08-19 18:38     ` Tom de Vries
  2011-08-25 10:09       ` Richard Guenther
  0 siblings, 1 reply; 18+ messages in thread
From: Tom de Vries @ 2011-08-19 18:38 UTC (permalink / raw)
  To: Richard Guenther; +Cc: Steven Bosscher, gcc-patches

[-- Attachment #1: Type: text/plain, Size: 594 bytes --]

On 07/17/2011 08:33 PM, Tom de Vries wrote:
> Updated version.
> 
> On 06/08/2011 11:45 AM, Tom de Vries wrote:
>> On 06/08/2011 11:42 AM, Tom de Vries wrote:
>>
>>> I'll send the patch with the testcases in a separate email.
>>
> 

2 extra testcases added.

OK for trunk?

Thanks,
- Tom

2011-08-19  Tom de Vries  <tom@codesourcery.com>

	PR middle-end/43864
	* gcc.dg/fold-compare-2.c (dg-options): Add -fno-tree-tail-merge.
	* gcc/testsuite/gcc.dg/uninit-pred-2_c.c: Same.
	* gcc.dg/pr43864.c: New test.
	* gcc.dg/pr43864-2.c: Same.
	* gcc.dg/pr43864-3.c: Same.
	* gcc.dg/pr43864-4.c: Same.

[-- Attachment #2: pr43864.37.test.patch --]
[-- Type: text/x-patch, Size: 3792 bytes --]

Index: gcc/testsuite/gcc.dg/pr43864-4.c
===================================================================
--- gcc/testsuite/gcc.dg/pr43864-4.c	(revision 0)
+++ gcc/testsuite/gcc.dg/pr43864-4.c	(revision 0)
@@ -0,0 +1,28 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-pre" } */
+
+/* Different stmt order.  */
+
+int f(int c, int b, int d)
+{
+  int r, r2, e;
+
+  if (c)
+    {
+      r = b + d;
+      r2 = d - b;
+    }
+  else
+    {
+      r2 = d - b;
+      e = d + b;
+      r = e;
+    }
+
+  return r - r2;
+}
+
+/* { dg-final { scan-tree-dump-times "if " 0 "pre"} } */
+/* { dg-final { scan-tree-dump-times "_.*\\\+.*_" 1 "pre"} } */
+/* { dg-final { scan-tree-dump-times " - " 2 "pre"} } */
+/* { dg-final { cleanup-tree-dump "pre" } } */
Index: gcc/testsuite/gcc.dg/fold-compare-2.c
===================================================================
--- gcc/testsuite/gcc.dg/fold-compare-2.c	(revision 176554)
+++ gcc/testsuite/gcc.dg/fold-compare-2.c	(working copy)
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -fdump-tree-vrp" } */
+/* { dg-options "-O2 -fno-tree-tail-merge -fdump-tree-vrp" } */
 
 extern void abort (void);
 
Index: gcc/testsuite/gcc.dg/uninit-pred-2_c.c
===================================================================
--- gcc/testsuite/gcc.dg/uninit-pred-2_c.c	(revision 176554)
+++ gcc/testsuite/gcc.dg/uninit-pred-2_c.c	(working copy)
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-Wuninitialized -O2" } */
+/* { dg-options "-Wuninitialized -O2 -fno-tree-tail-merge" } */
 
 int g;
 void bar (void);
Index: gcc/testsuite/gcc.dg/pr43864.c
===================================================================
--- gcc/testsuite/gcc.dg/pr43864.c	(revision 0)
+++ gcc/testsuite/gcc.dg/pr43864.c	(revision 0)
@@ -0,0 +1,35 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-pre" } */
+
+extern void foo (char*, int);
+extern void mysprintf (char *, char *);
+extern void myfree (void *);
+extern int access (char *, int);
+extern int fopen (char *, int);
+
+char *
+hprofStartupp (char *outputFileName, char *ctx)
+{
+  char fileName[1000];
+  int fp;
+  mysprintf (fileName, outputFileName);
+  if (access (fileName, 1) == 0)
+    {
+      myfree (ctx);
+      return 0;
+    }
+
+  fp = fopen (fileName, 0);
+  if (fp == 0)
+    {
+      myfree (ctx);
+      return 0;
+    }
+
+  foo (outputFileName, fp);
+
+  return ctx;
+}
+
+/* { dg-final { scan-tree-dump-times "myfree \\(" 1 "pre"} } */
+/* { dg-final { cleanup-tree-dump "pre" } } */
Index: gcc/testsuite/gcc.dg/pr43864-2.c
===================================================================
--- gcc/testsuite/gcc.dg/pr43864-2.c	(revision 0)
+++ gcc/testsuite/gcc.dg/pr43864-2.c	(revision 0)
@@ -0,0 +1,22 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-pre" } */
+
+int
+f (int c, int b, int d)
+{
+  int r, e;
+
+  if (c)
+    r = b + d;
+  else
+    {
+      e = b + d;
+      r = e;
+    }
+
+  return r;
+}
+
+/* { dg-final { scan-tree-dump-times "if " 0 "pre"} } */
+/* { dg-final { scan-tree-dump-times "_.*\\\+.*_" 1 "pre"} } */
+/* { dg-final { cleanup-tree-dump "pre" } } */
Index: gcc/testsuite/gcc.dg/pr43864-3.c
===================================================================
--- gcc/testsuite/gcc.dg/pr43864-3.c	(revision 0)
+++ gcc/testsuite/gcc.dg/pr43864-3.c	(revision 0)
@@ -0,0 +1,23 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-pre" } */
+
+/* Commutative case.  */
+
+int f(int c, int b, int d)
+{
+  int r, e;
+
+  if (c)
+    r = b + d;
+  else
+    {
+      e = d + b;
+      r = e;
+    }
+
+  return r;
+}
+
+/* { dg-final { scan-tree-dump-times "if " 0 "pre"} } */
+/* { dg-final { scan-tree-dump-times "_.*\\\+.*_" 1 "pre"} } */
+/* { dg-final { cleanup-tree-dump "pre" } } */

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH, PR43864] Gimple level duplicate block cleanup.
  2011-07-22 15:54             ` Richard Guenther
  2011-08-19 18:33               ` Tom de Vries
@ 2011-08-24  9:00               ` Tom de Vries
  2011-08-25  1:07                 ` Ian Lance Taylor
  1 sibling, 1 reply; 18+ messages in thread
From: Tom de Vries @ 2011-08-24  9:00 UTC (permalink / raw)
  To: Ian Lance Taylor; +Cc: Richard Guenther, Steven Bosscher, gcc-patches

On 07/22/2011 05:36 PM, Richard Guenther wrote:
> That said - I'm reasonably happy with the pass now, but it's rather large
> (this review took 40min again ...) so I appreciate a second look from
> somebody else.
> 

Ian,

Do you have a moment to give a second look to a gimple CFG optimization?  The
optimization removes duplicate basic blocks and reduces code size by 1-2%.

The latest patch is posted at
http://gcc.gnu.org/ml/gcc-patches/2011-08/msg01602.html.

Thanks,
- Tom

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH, PR43864] Gimple level duplicate block cleanup.
  2011-08-24  9:00               ` Tom de Vries
@ 2011-08-25  1:07                 ` Ian Lance Taylor
  2011-08-25  9:30                   ` Richard Guenther
  0 siblings, 1 reply; 18+ messages in thread
From: Ian Lance Taylor @ 2011-08-25  1:07 UTC (permalink / raw)
  To: Tom de Vries; +Cc: Richard Guenther, Steven Bosscher, gcc-patches

Tom de Vries <vries@codesourcery.com> writes:

> Do you have a moment to give a second look to a gimple CFG optimization?  The
> optimization removes duplicate basic blocks and reduces code size by 1-2%.
>
> The latest patch is posted at
> http://gcc.gnu.org/ml/gcc-patches/2011-08/msg01602.html.


I'm not really the best person to look at this patch, since it applies
to areas of the compiler with which I am less familiar..  However, since
you ask, I did read through the patch, and it looks OK to me.  Since
Richi OK'ed it, this patch is OK with the following changes.


> +typedef struct same_succ *same_succ_t;
> +typedef const struct same_succ *const_same_succ_t;

Don't name new types ending with "_t".  POSIX reserves names ending with
"_t" when <sys/types.h> is #included.  Name these something else.

> +typedef struct bb_cluster *bb_cluster_t;
> +typedef const struct bb_cluster *const_bb_cluster_t;

Same here.


> +@item -ftree-tail-merge
> +Merges identical blocks with same successors.  This flag is enabled by default
> +at @option{-O2} and higher.  The run time of this pass can be limited using
> +@option{max-tail-merge-comparisons} parameter.

I think this text can be improved to be more meaningful to compiler
users.  I suggest something like:

  Look for identical code sequences.  When found, replace one with a
  jump to the other.  This optimization is known as tail merging or
  cross jumping.  This flag is enabled [now same as above]


Thanks.

Ian

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH, PR43864] Gimple level duplicate block cleanup.
  2011-08-25  1:07                 ` Ian Lance Taylor
@ 2011-08-25  9:30                   ` Richard Guenther
  0 siblings, 0 replies; 18+ messages in thread
From: Richard Guenther @ 2011-08-25  9:30 UTC (permalink / raw)
  To: Ian Lance Taylor; +Cc: Tom de Vries, Steven Bosscher, gcc-patches

On Wed, Aug 24, 2011 at 9:00 PM, Ian Lance Taylor <iant@google.com> wrote:
> Tom de Vries <vries@codesourcery.com> writes:
>
>> Do you have a moment to give a second look to a gimple CFG optimization?  The
>> optimization removes duplicate basic blocks and reduces code size by 1-2%.
>>
>> The latest patch is posted at
>> http://gcc.gnu.org/ml/gcc-patches/2011-08/msg01602.html.
>
>
> I'm not really the best person to look at this patch, since it applies
> to areas of the compiler with which I am less familiar..  However, since
> you ask, I did read through the patch, and it looks OK to me.  Since
> Richi OK'ed it, this patch is OK with the following changes.
>
>
>> +typedef struct same_succ *same_succ_t;
>> +typedef const struct same_succ *const_same_succ_t;
>
> Don't name new types ending with "_t".  POSIX reserves names ending with
> "_t" when <sys/types.h> is #included.  Name these something else.
>
>> +typedef struct bb_cluster *bb_cluster_t;
>> +typedef const struct bb_cluster *const_bb_cluster_t;
>
> Same here.
>
>
>> +@item -ftree-tail-merge
>> +Merges identical blocks with same successors.  This flag is enabled by default
>> +at @option{-O2} and higher.  The run time of this pass can be limited using
>> +@option{max-tail-merge-comparisons} parameter.
>
> I think this text can be improved to be more meaningful to compiler
> users.  I suggest something like:
>
>  Look for identical code sequences.  When found, replace one with a
>  jump to the other.  This optimization is known as tail merging or
>  cross jumping.  This flag is enabled [now same as above]

Can you also add a --param for the maximum number of iterations
you perform (16 sounds quite high for GCC bootstrap), I'd default it
to 2 which seems to catch 99% of all cases.

If you already committed the patch just do it as a followup please.

Thanks,
Richard.

>
> Thanks.
>
> Ian
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH, PR43864] Gimple level duplicate block cleanup - test cases.
  2011-08-19 18:38     ` Tom de Vries
@ 2011-08-25 10:09       ` Richard Guenther
  0 siblings, 0 replies; 18+ messages in thread
From: Richard Guenther @ 2011-08-25 10:09 UTC (permalink / raw)
  To: Tom de Vries; +Cc: Steven Bosscher, gcc-patches

On Fri, Aug 19, 2011 at 6:28 PM, Tom de Vries <vries@codesourcery.com> wrote:
> On 07/17/2011 08:33 PM, Tom de Vries wrote:
>> Updated version.
>>
>> On 06/08/2011 11:45 AM, Tom de Vries wrote:
>>> On 06/08/2011 11:42 AM, Tom de Vries wrote:
>>>
>>>> I'll send the patch with the testcases in a separate email.
>>>
>>
>
> 2 extra testcases added.
>
> OK for trunk?

Ok.

THanks,
Richard.

> Thanks,
> - Tom
>
> 2011-08-19  Tom de Vries  <tom@codesourcery.com>
>
>        PR middle-end/43864
>        * gcc.dg/fold-compare-2.c (dg-options): Add -fno-tree-tail-merge.
>        * gcc/testsuite/gcc.dg/uninit-pred-2_c.c: Same.
>        * gcc.dg/pr43864.c: New test.
>        * gcc.dg/pr43864-2.c: Same.
>        * gcc.dg/pr43864-3.c: Same.
>        * gcc.dg/pr43864-4.c: Same.
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2011-08-25  7:35 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-06-08  9:49 [PATCH, PR43864] Gimple level duplicate block cleanup Tom de Vries
2011-06-08  9:55 ` [PATCH, PR43864] Gimple level duplicate block cleanup - test cases Tom de Vries
2011-07-18  2:54   ` Tom de Vries
2011-08-19 18:38     ` Tom de Vries
2011-08-25 10:09       ` Richard Guenther
2011-06-08 10:09 ` [PATCH, PR43864] Gimple level duplicate block cleanup Richard Guenther
2011-06-08 10:40   ` Steven Bosscher
2011-06-10 17:16   ` Tom de Vries
2011-06-14 15:12     ` Richard Guenther
2011-07-12 12:21       ` Tom de Vries
2011-07-12 14:37         ` Richard Guenther
2011-07-18  0:41           ` Tom de Vries
2011-07-22 15:54             ` Richard Guenther
2011-08-19 18:33               ` Tom de Vries
2011-08-24  9:00               ` Tom de Vries
2011-08-25  1:07                 ` Ian Lance Taylor
2011-08-25  9:30                   ` Richard Guenther
2011-06-10 18:43 ` Jeff Law

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).