extend fwprop optimization

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* extend fwprop optimization
@ 2013-02-25 23:32 Wei Mi
  2013-02-26  0:08 ` Steven Bosscher
  0 siblings, 1 reply; 29+ messages in thread
From: Wei Mi @ 2013-02-25 23:32 UTC (permalink / raw)
  To: GCC Patches; +Cc: David Li

[-- Attachment #1: Type: text/plain, Size: 5461 bytes --]

Hello,

I have a patch trying to extend the fwprop to propagate complex
expressions. I post it for discussion. Existing fwprop can propagate
the src of simple def insns (like: ra = const, ra = rb or ra =
subreg(rb)) to uses, but it cannot propagate the def insn like: ra =
rb + rc. Here is the motivation example below. Existing fwprop cannot
handle it because the def insn is not const/reg/subreg case. combine
phase also cannot handle the case because combine phase is based on
LINK_LOG and cannot handle single-def multiple down uses cases.

The motivational case:
http://gcc.gnu.org/ml/gcc/2013-01/msg00181.html

The extended fwprop iterates each def and tries to propagate the def
to multiple down uses, even if the def is a complex expression. The
propagation will create a series of change candidates, and we will
consider their costs in a group (Existing fwprop consider def-use pair
one by one). If all the uses for the def could be replaced, then
may_confirm_whole_group is true, which indicates the def insn could be
removed after all the changes are applied. The benefit of each change
is the subtract of the cost before the change and that after the
change. We also take insn splitting and peephole into consideration,
.i.e, the cost of the change is the cost after insn splitting and
peephole which may be applied to the insn changed. This is useful for
the motivational case, for which, the transformation from "a << (b &
63)" to "a << b" is done by insn splitting, so we need to consider the
cost after insn splitting.  total_benefit is the summation of the
benefits of all the changes. total_positive_benefit is the summation
of all the positive benefits. extra_benefit is the benefit to remove
the def insn if may_confirm_whole_group is true. If total_benefit +
extra_benefit >= total_positive_benefit, we choose to apply all the
changes and remove the def insn. If not, we choose to only apply the
positive benefit changes one by one.

Testing result:
a small number of regression failures caused by testcases limitation.
bootstrapped ok.

base: gcc r195411 -O2
test:   gcc r195411 + fwprop extension  -O2
dynamic insn number is got using "perf stat".

spec2000 O2 C/C++ benchmarks result.
CPU2000 INT   perf improvement (%)  dynamic insn number reduced (%)
164.gzip                        -0.27                               0.18
175.vpr                         0                                      0.22
176.gcc                         0.71                                0.06
181.mcf                         0.91                                -0.02
186.crafty                      0                                     0.23
197.parser                     -0.46                               0.32
252.eon                    *** 8.78                                 xxxx
253.perlbmk                  2.47                                 1.92
254.gap                         0                                      0.30
255.vortex                     2.17                                 0.22
256.bzip2                      -0.33                                1.37
300.twolf                       0.11                                  0.11

CPU2000 FP
177.mesa                      0.28                                 0.60
179.art                           0.64                                1.24
183.equake                   -0.38                                0.01
188.ammp                     0                                      0.09

spec2006 O2 C/C++ benchmarks result.
CPU2006 INT   perf improvement (%)  dynamic insn number reduced (%)
400.perlbench               2.47                                -0.06
401.bzip2                      1.28                                0.73
403.gcc                         0                                     0.10
429.mcf                         0                                     -0.06
445.gobmk                    0.68                                0.33
456.hmmer                   0.23                                 -0.01
458.sjeng                      -1.14                                0
462.libquantum       ***  7.52                                13.01
464.h264ref                   xxxx                                xxxx
471.omnetpp                -0.61                                0.06
473.astar                       0                                     0.45
483.xalancbmk              1.30                                0.02

CPU2006 FP
433.milc                         0                                     0.01
444.namd                      -0.25                               0
447.dealII                       xxxx                                xxxx
450.soplex                     0.84                                0.30
453.povray                     0                                     0.09
470.lbm                          -0.35                               0
482.sphinx3                    0.18                                0

*** Although eon and libquantum are improved a lot, the performance
improvement are not caused by fwpropext.  eon performance diff is
caused by code layout change. libquantum performance diff is because
after fwprop extension, a bad pre optimization is disabled in the
hottest loop.
*** I got endless running on 464.h264ref and compilation error on
447.deaIII for both with/without my changes. They are probably because
the spec configuration or options. So I just skipped those two tests
for simplicity.

Thanks,
Wei.

[-- Attachment #2: patch --]
[-- Type: application/octet-stream, Size: 42684 bytes --]

Index: gcc/jump.c
===================================================================
--- gcc/jump.c	(revision 196270)
+++ gcc/jump.c	(working copy)
@@ -1868,7 +1868,9 @@ true_regnum (const_rtx x)
   if (REG_P (x))
     {
       if (REGNO (x) >= FIRST_PSEUDO_REGISTER
-	  && (lra_in_progress || reg_renumber[REGNO (x)] >= 0))
+	  && (lra_in_progress 
+	      || (reg_renumber 
+		  && reg_renumber[REGNO (x)] >= 0)))
 	return reg_renumber[REGNO (x)];
       return REGNO (x);
     }
Index: gcc/config/i386/i386.md
===================================================================
--- gcc/config/i386/i386.md	(revision 196270)
+++ gcc/config/i386/i386.md	(working copy)
@@ -16884,8 +16884,13 @@
 	   (const_int 0)]))]
   "! TARGET_PARTIAL_REG_STALL
    && ix86_match_ccmode (insn, CCNOmode)
-   && true_regnum (operands[2]) != AX_REG
-   && peep2_reg_dead_p (1, operands[2])"
+   && ((REG_P (operands[2])
+	&& true_regnum (operands[2]) != AX_REG
+	&& peep2_reg_dead_p (1, operands[2]))
+       || (GET_CODE (operands[2]) == SUBREG
+	   && true_regnum (SUBREG_REG(operands[2])) != AX_REG
+	   && peep2_reg_dead_p (1, SUBREG_REG(operands[2]))))
+"
   [(parallel
      [(set (match_dup 0)
 	   (match_op_dup 1 [(and:QI (match_dup 2) (match_dup 3))
Index: gcc/config/i386/i386.c
===================================================================
--- gcc/config/i386/i386.c	(revision 196270)
+++ gcc/config/i386/i386.c	(working copy)
@@ -15902,7 +15902,8 @@ ix86_expand_clear (rtx dest)
   rtx tmp;
 
   /* We play register width games, which are only valid after reload.  */
-  gcc_assert (reload_completed);
+  if (strncmp (current_pass->name, "fwprop", 6))
+    gcc_assert (reload_completed);
 
   /* Avoid HImode and its attendant prefix byte.  */
   if (GET_MODE_SIZE (GET_MODE (dest)) < 4)
Index: gcc/recog.c
===================================================================
--- gcc/recog.c	(revision 196270)
+++ gcc/recog.c	(working copy)
@@ -181,6 +181,14 @@ typedef struct change_t
   rtx *loc;
   rtx old;
   bool unshare;
+  int benefit;
+  bool verified;
+  /* Some change is associated with last change.
+     If last change is committed, then current
+     change will also be committed. A such case
+     is the change to remove a CLOBBER in 
+     verify_change */
+  bool associated_with_last;
 } change_t;
 
 static change_t *changes;
@@ -235,6 +243,9 @@ validate_change_1 (rtx object, rtx *loc,
   changes[num_changes].loc = loc;
   changes[num_changes].old = old;
   changes[num_changes].unshare = unshare;
+  changes[num_changes].benefit = 0;
+  changes[num_changes].verified = false;
+  changes[num_changes].associated_with_last = false;
 
   if (object && !MEM_P (object))
     {
@@ -495,6 +506,276 @@ confirm_change_group (void)
   num_changes = 0;
 }
 
+int
+get_changes_num (void)
+{
+  return num_changes;
+}
+
+void
+set_change_verified (int idx, bool val)
+{
+  changes[idx].verified = val;
+}
+
+void
+set_change_benefit (int idx, int val)
+{
+  changes[idx].benefit = val;
+}
+
+void
+set_change_associated_with_last (int idx, bool val)
+{
+  changes[idx].associated_with_last = val;
+}
+
+int
+estimate_seq_cost (rtx first, bool speed)
+{
+  int cost = 0;
+
+  while (first)
+    {
+      rtx set = single_set (first);
+      rtx pat = PATTERN (first);
+      if (set)
+	cost += set_src_cost (SET_SRC(set), speed);
+      else if (GET_CODE (pat) == PARALLEL)
+	{
+	  /* select the minimal set cost as parallel cost */
+	  int i;
+	  int mincost = MAX_COST;
+	  for (i = 0; i < XVECLEN (pat, 0); i++)
+	    {
+	      enum rtx_code code;
+	      set = XVECEXP (pat, 0, i);
+	      code = GET_CODE (set);
+	      if (code == SET)
+		{
+		  int icost = set_src_cost (SET_SRC(set), speed);
+		  if (icost < mincost)
+		    mincost = icost;
+		}
+	      else if (code == CLOBBER)
+		continue;
+	      else
+		{
+		  mincost = MAX_COST;
+		  break;
+		}
+	    }
+	    cost += mincost;
+	}
+      else
+	{
+	  fprintf (stderr, "split or peephole result is not a set\n");
+	  print_rtl_single (stderr, first);
+	  gcc_assert (0);
+	}
+      first = NEXT_INSN (first);
+    }
+  return cost;
+}
+
+int
+estimate_split_and_peephole (rtx insn)
+{
+  int match_len;
+  int old_cost;
+  rtx result;
+  rtx pat = PATTERN (insn);
+  rtx set = single_set(insn);
+  bool speed = optimize_bb_for_speed_p (BLOCK_FOR_INSN (insn));
+
+  if (set)
+    old_cost = set_src_cost (SET_SRC(set), speed);
+  else
+    {
+      fprintf (stderr, "insn is not a set\n");
+      print_rtl_single (stderr, insn);
+      gcc_assert (0);
+    }
+
+  result = split_insns (pat, insn);
+
+  if (result)
+    return old_cost - estimate_seq_cost(result, speed);
+
+  initialize_before_estimate_peephole (insn);
+  result = peephole2_insns (pat, insn, &match_len);
+
+  if (result)
+    return old_cost - estimate_seq_cost(result, speed);
+
+  return 0; 
+}
+
+/* We cannot confirm all the changes group, then we evaluate
+   the change one by one. For fwprop_addr, the cost evaluation
+   is caculated using targetm.address_cost() and has been done 
+   in propagate_rtx_1, so we use chk_benefit to control not to
+   check benefit again. */
+
+bool
+confirm_change_one_by_one (bool chk_benefit)
+{
+  int i;
+  rtx last_object = NULL;
+  bool last_change_committed = false;
+
+  for (i = num_changes - 1; i >= 0; i--)
+    {
+      rtx object = changes[i].object;
+
+      /* If change is not verified successfully, or benefit <= 0 
+         and current change is not associated with last committed
+         change, then we will backout the change */
+      if (!changes[i].verified 
+	  || (chk_benefit
+	      && changes[i].benefit <= 0 
+	      && !(last_change_committed 
+		   && changes[i].associated_with_last)))
+	{
+	  *changes[i].loc = changes[i].old;
+	  if (changes[i].object && !MEM_P (changes[i].object))
+	    INSN_CODE (changes[i].object) = changes[i].old_code;
+	  last_change_committed = false;
+	  continue;
+	}
+
+      if (changes[i].unshare)
+	*changes[i].loc = copy_rtx (*changes[i].loc);
+
+      /* Avoid unnecessary rescanning when multiple changes to same instruction
+         are made.  */
+      if (object)
+	{
+	  if (object != last_object && last_object && INSN_P (last_object))
+	    df_insn_rescan (last_object);
+	  last_object = object;
+	}
+
+      if (dump_file)
+	fprintf(dump_file, "\n   *** change[%d] -- committed ***\n", i);
+
+      if (dump_file)
+	{
+	  fprintf (dump_file, "\nIn insn %d, replacing\n ", INSN_UID (object));
+	  print_inline_rtx (dump_file, changes[i].old, 2);
+	  fprintf (dump_file, "\n with ");
+	  print_inline_rtx (dump_file, *changes[i].loc, 2);
+	  fprintf (dump_file, "\n resulting: ");
+	  print_inline_rtx (dump_file, object, 2);
+	}
+
+      last_change_committed = true;
+    }
+
+  if (last_object && INSN_P (last_object))
+    df_insn_rescan (last_object);
+
+  num_changes = 0;
+  if (last_object)
+    return true;
+  else
+    return false;
+}
+
+bool
+confirm_change_group_by_cost (bool may_confirm_whole_group, 
+			      int extra_benefit,
+			      bool chk_benefit)
+{
+  int i;
+  int total_benefit = 0, total_positive_benefit = 0;
+  bool no_positive_benefit = true;
+
+  if (num_changes == 0)
+    {
+      if (dump_file)
+        fprintf(dump_file, "No changes being tried\n");
+      return false;
+    }
+
+  if (!chk_benefit)
+    return confirm_change_one_by_one(false);
+
+  if (dump_file)
+    fprintf(dump_file, "  extra benefit = %d\n", extra_benefit);
+
+  for (i = 0; i < num_changes; i++)
+    {
+      int split_or_peephole_cost;
+
+      if (!changes[i].verified)
+	{
+	  may_confirm_whole_group = false;
+	  if (dump_file)
+	    fprintf(dump_file, "  change[%d]: benefit = %d, verified - fail\n", 
+		    i, changes[i].benefit);
+	  continue;
+	}
+
+      split_or_peephole_cost = estimate_split_and_peephole (changes[i].object);
+      changes[i].benefit += split_or_peephole_cost;
+
+      total_benefit += changes[i].benefit;
+      if (changes[i].benefit > 0)
+	{
+	  total_positive_benefit += changes[i].benefit;
+	  no_positive_benefit = false;
+	}
+
+      if (dump_file)
+	fprintf(dump_file, "  change[%d]: benefit = %d, verified - ok\n", 
+		i, changes[i].benefit);
+    }
+
+  /* Compare the benefit between applying the whole change group
+     and only applying changes whose benefit > 0. When applying 
+     the whole change group, we usually have the extra_benefit is about
+     deleting the def insn when all its uses are replaced and
+     it becomes a dead insn */
+  if (may_confirm_whole_group
+      && total_benefit + extra_benefit < total_positive_benefit)
+    may_confirm_whole_group = false;
+
+  if (may_confirm_whole_group)
+    {
+      if (dump_file)
+        fprintf(dump_file, "!!! All the changes committed\n");
+
+      if (dump_file)
+	{
+	  for (i = 0; i < num_changes; i++)
+	    {
+	      fprintf (dump_file, "\nIn insn %d, replacing\n ",
+		       INSN_UID (changes[i].object));
+	      print_inline_rtx (dump_file, changes[i].old, 2);
+	      fprintf (dump_file, "\n with ");
+	      print_inline_rtx (dump_file, *changes[i].loc, 2);
+	      fprintf (dump_file, "\n resulting: ");
+	      print_inline_rtx (dump_file, changes[i].object, 2);
+	    }
+	}
+
+      confirm_change_group ();
+      return true;
+    }
+  else if (no_positive_benefit)
+    {
+      cancel_changes (0);
+      if (dump_file)
+        fprintf(dump_file, "No changes committed\n");
+      return false;
+    }
+  else
+    {
+      return confirm_change_one_by_one (true);
+    }
+}
+
 /* Apply a group of changes previously issued with `validate_change'.
    If all changes are valid, call confirm_change_group and return 1,
    otherwise, call cancel_changes and return 0.  */
@@ -2997,6 +3278,27 @@ int peep2_current_count;
    DF_LIVE_OUT for the block.  */
 #define PEEP2_EOB	pc_rtx
 
+void
+initialize_before_estimate_peephole (rtx insn)
+{
+  bitmap live;
+  basic_block bb = BLOCK_FOR_INSN (insn);
+  peep2_current = 0;
+  peep2_current_count = 0;
+  peep2_insn_data[0].insn = insn;
+  peep2_insn_data[0].live_before = BITMAP_ALLOC (&reg_obstack);
+  peep2_insn_data[1].insn = PEEP2_EOB;
+  peep2_insn_data[1].live_before = BITMAP_ALLOC (&reg_obstack);
+
+  live = BITMAP_ALLOC (&reg_obstack);
+  bitmap_copy (live, DF_LR_IN (bb));
+  df_simulate_initialize_forwards (bb, live);
+  simulate_backwards_to_point(bb, live, insn);
+  COPY_REG_SET (peep2_insn_data[1].live_before, live);
+  df_simulate_one_insn_backwards (bb, insn, live);
+  COPY_REG_SET (peep2_insn_data[0].live_before, live);
+}
+
 /* Wrap N to fit into the peep2_insn_data buffer.  */
 
 static int
@@ -3050,6 +3352,14 @@ peep2_reg_dead_p (int ofs, rtx reg)
   gcc_assert (peep2_insn_data[ofs].insn != NULL_RTX);
 
   regno = REGNO (reg);
+
+  /* we may call peephole2_insns in fwprop phase to estimate how 
+     peephole will affect the cost of the insn transformed by fwprop.
+     fwprop is done before ira phase, so we need to consider pesudo
+     register here as well. */
+  if (!strncmp(current_pass->name, "fwprop", 6))
+    return !REGNO_REG_SET_P (peep2_insn_data[ofs].live_before, regno);
+
   n = hard_regno_nregs[regno][GET_MODE (reg)];
   while (--n >= 0)
     if (REGNO_REG_SET_P (peep2_insn_data[ofs].live_before, regno + n))
@@ -3078,6 +3388,13 @@ peep2_find_free_register (int from, int
   df_ref *def_rec;
   int i;
 
+  /* we may call peephole2_insns in fwprop phase to estimate how 
+     peephole will affect the cost of the insn transformed by fwprop.
+     fwprop is done before ira phase. In that case, we simply return
+     a new pseudo register. */ 
+  if (!strncmp (current_pass->name, "fwprop", 6))
+    return gen_reg_rtx (mode);
+
   gcc_assert (from < MAX_INSNS_PER_PEEP2 + 1);
   gcc_assert (to < MAX_INSNS_PER_PEEP2 + 1);
 
Index: gcc/recog.h
===================================================================
--- gcc/recog.h	(revision 196270)
+++ gcc/recog.h	(working copy)
@@ -82,6 +82,17 @@ extern int insn_invalid_p (rtx, bool);
 extern int verify_changes (int);
 extern void confirm_change_group (void);
 extern int apply_change_group (void);
+extern int get_changes_num (void);
+extern void set_change_verified (int idx, bool val);
+extern void set_change_benefit (int idx, int val);
+extern void set_change_associated_with_last (int idx, bool val);
+extern int estimate_seq_cost (rtx first, bool speed);
+extern int estimate_split_and_peephole (rtx insn);
+extern void initialize_before_estimate_peephole (rtx insn);
+extern bool confirm_change_one_by_one (bool chk_benefit);
+extern bool confirm_change_group_by_cost (bool may_confirm_whole_group,
+					  int extra_benefit,
+					  bool chk_benefit);
 extern int num_validated_changes (void);
 extern void cancel_changes (int);
 extern int constrain_operands (int);
Index: gcc/rtl.h
===================================================================
--- gcc/rtl.h	(revision 196270)
+++ gcc/rtl.h	(working copy)
@@ -440,6 +440,9 @@ struct GTY((variable_size)) rtvec_def {
 /* Predicate yielding nonzero iff X is a label insn.  */
 #define LABEL_P(X) (GET_CODE (X) == CODE_LABEL)
 
+/* Predicate yielding true iff X is a symbol ref */
+#define SYMBOL_REF_P(X) (GET_CODE (X) == SYMBOL_REF)
+
 /* Predicate yielding nonzero iff X is a jump insn.  */
 #define JUMP_P(X) (GET_CODE (X) == JUMP_INSN)
 
Index: gcc/fwprop.c
===================================================================
--- gcc/fwprop.c	(revision 196270)
+++ gcc/fwprop.c	(working copy)
@@ -274,10 +274,12 @@ build_single_def_use_links (void)
   /* We use the multiple definitions problem to compute our restricted
      use-def chains.  */
   df_set_flags (DF_EQ_NOTES);
+  df_set_flags (DF_LR_RUN_DCE);
   df_md_add_problem ();
   df_note_add_problem ();
-  df_analyze ();
+  df_chain_add_problem (DF_UD_CHAIN | DF_DU_CHAIN);
   df_maybe_reorganize_use_refs (DF_REF_ORDER_BY_INSN_WITH_NOTES);
+  df_analyze ();
 
   use_def_ref.create (DF_USES_TABLE_SIZE ());
   use_def_ref.safe_grow_cleared (DF_USES_TABLE_SIZE ());
@@ -412,36 +414,6 @@ should_replace_address (rtx old_rtx, rtx
   return (gain > 0);
 }
 
-
-/* Flags for the last parameter of propagate_rtx_1.  */
-
-enum {
-  /* If PR_CAN_APPEAR is true, propagate_rtx_1 always returns true;
-     if it is false, propagate_rtx_1 returns false if, for at least
-     one occurrence OLD, it failed to collapse the result to a constant.
-     For example, (mult:M (reg:M A) (minus:M (reg:M B) (reg:M A))) may
-     collapse to zero if replacing (reg:M B) with (reg:M A).
-
-     PR_CAN_APPEAR is disregarded inside MEMs: in that case,
-     propagate_rtx_1 just tries to make cheaper and valid memory
-     addresses.  */
-  PR_CAN_APPEAR = 1,
-
-  /* If PR_HANDLE_MEM is not set, propagate_rtx_1 won't attempt any replacement
-     outside memory addresses.  This is needed because propagate_rtx_1 does
-     not do any analysis on memory; thus it is very conservative and in general
-     it will fail if non-read-only MEMs are found in the source expression.
-
-     PR_HANDLE_MEM is set when the source of the propagation was not
-     another MEM.  Then, it is safe not to treat non-read-only MEMs as
-     ``opaque'' objects.  */
-  PR_HANDLE_MEM = 2,
-
-  /* Set when costs should be optimized for speed.  */
-  PR_OPTIMIZE_FOR_SPEED = 4
-};
-
-
 /* Replace all occurrences of OLD in *PX with NEW and try to simplify the
    resulting expression.  Replace *PX with a new RTL expression if an
    occurrence of OLD was found.
@@ -450,32 +422,21 @@ enum {
    matching code here.  (The sole exception is the handling of LO_SUM, but
    that is because there is no simplify_gen_* function for LO_SUM).  */
 
-static bool
-propagate_rtx_1 (rtx *px, rtx old_rtx, rtx new_rtx, int flags)
+static bool 
+propagate_rtx_1 (rtx *px, rtx old_rtx, rtx new_rtx, bool speed)
 {
   rtx x = *px, tem = NULL_RTX, op0, op1, op2;
   enum rtx_code code = GET_CODE (x);
   enum machine_mode mode = GET_MODE (x);
   enum machine_mode op_mode;
-  bool can_appear = (flags & PR_CAN_APPEAR) != 0;
   bool valid_ops = true;
 
-  if (!(flags & PR_HANDLE_MEM) && MEM_P (x) && !MEM_READONLY_P (x))
-    {
-      /* If unsafe, change MEMs to CLOBBERs or SCRATCHes (to preserve whether
-	 they have side effects or not).  */
-      *px = (side_effects_p (x)
-	     ? gen_rtx_CLOBBER (GET_MODE (x), const0_rtx)
-	     : gen_rtx_SCRATCH (GET_MODE (x)));
-      return false;
-    }
-
   /* If X is OLD_RTX, return NEW_RTX.  But not if replacing only within an
      address, and we are *not* inside one.  */
   if (x == old_rtx)
     {
       *px = new_rtx;
-      return can_appear;
+      return true;
     }
 
   /* If this is an expression, try recursive substitution.  */
@@ -484,7 +445,7 @@ propagate_rtx_1 (rtx *px, rtx old_rtx, r
     case RTX_UNARY:
       op0 = XEXP (x, 0);
       op_mode = GET_MODE (op0);
-      valid_ops &= propagate_rtx_1 (&op0, old_rtx, new_rtx, flags);
+      valid_ops &= propagate_rtx_1 (&op0, old_rtx, new_rtx, speed);
       if (op0 == XEXP (x, 0))
 	return true;
       tem = simplify_gen_unary (code, mode, op0, op_mode);
@@ -494,8 +455,8 @@ propagate_rtx_1 (rtx *px, rtx old_rtx, r
     case RTX_COMM_ARITH:
       op0 = XEXP (x, 0);
       op1 = XEXP (x, 1);
-      valid_ops &= propagate_rtx_1 (&op0, old_rtx, new_rtx, flags);
-      valid_ops &= propagate_rtx_1 (&op1, old_rtx, new_rtx, flags);
+      valid_ops &= propagate_rtx_1 (&op0, old_rtx, new_rtx, speed);
+      valid_ops &= propagate_rtx_1 (&op1, old_rtx, new_rtx, speed);
       if (op0 == XEXP (x, 0) && op1 == XEXP (x, 1))
 	return true;
       tem = simplify_gen_binary (code, mode, op0, op1);
@@ -506,8 +467,8 @@ propagate_rtx_1 (rtx *px, rtx old_rtx, r
       op0 = XEXP (x, 0);
       op1 = XEXP (x, 1);
       op_mode = GET_MODE (op0) != VOIDmode ? GET_MODE (op0) : GET_MODE (op1);
-      valid_ops &= propagate_rtx_1 (&op0, old_rtx, new_rtx, flags);
-      valid_ops &= propagate_rtx_1 (&op1, old_rtx, new_rtx, flags);
+      valid_ops &= propagate_rtx_1 (&op0, old_rtx, new_rtx, speed);
+      valid_ops &= propagate_rtx_1 (&op1, old_rtx, new_rtx, speed);
       if (op0 == XEXP (x, 0) && op1 == XEXP (x, 1))
 	return true;
       tem = simplify_gen_relational (code, mode, op_mode, op0, op1);
@@ -519,9 +480,9 @@ propagate_rtx_1 (rtx *px, rtx old_rtx, r
       op1 = XEXP (x, 1);
       op2 = XEXP (x, 2);
       op_mode = GET_MODE (op0);
-      valid_ops &= propagate_rtx_1 (&op0, old_rtx, new_rtx, flags);
-      valid_ops &= propagate_rtx_1 (&op1, old_rtx, new_rtx, flags);
-      valid_ops &= propagate_rtx_1 (&op2, old_rtx, new_rtx, flags);
+      valid_ops &= propagate_rtx_1 (&op0, old_rtx, new_rtx, speed);
+      valid_ops &= propagate_rtx_1 (&op1, old_rtx, new_rtx, speed);
+      valid_ops &= propagate_rtx_1 (&op2, old_rtx, new_rtx, speed);
       if (op0 == XEXP (x, 0) && op1 == XEXP (x, 1) && op2 == XEXP (x, 2))
 	return true;
       if (op_mode == VOIDmode)
@@ -534,7 +495,7 @@ propagate_rtx_1 (rtx *px, rtx old_rtx, r
       if (code == SUBREG)
 	{
           op0 = XEXP (x, 0);
-	  valid_ops &= propagate_rtx_1 (&op0, old_rtx, new_rtx, flags);
+	  valid_ops &= propagate_rtx_1 (&op0, old_rtx, new_rtx, speed);
           if (op0 == XEXP (x, 0))
 	    return true;
 	  tem = simplify_gen_subreg (mode, op0, GET_MODE (SUBREG_REG (x)),
@@ -554,7 +515,7 @@ propagate_rtx_1 (rtx *px, rtx old_rtx, r
 
 	  op0 = new_op0 = targetm.delegitimize_address (op0);
 	  valid_ops &= propagate_rtx_1 (&new_op0, old_rtx, new_rtx,
-					flags | PR_CAN_APPEAR);
+					speed);
 
 	  /* Dismiss transformation that we do not want to carry on.  */
 	  if (!valid_ops
@@ -569,7 +530,7 @@ propagate_rtx_1 (rtx *px, rtx old_rtx, r
 	  if (!(REG_P (old_rtx) && REG_P (new_rtx))
 	      && !should_replace_address (op0, new_op0, GET_MODE (x),
 					  MEM_ADDR_SPACE (x),
-	      			 	  flags & PR_OPTIMIZE_FOR_SPEED))
+	      			 	  speed))
 	    return true;
 
 	  tem = replace_equiv_address_nv (x, new_op0);
@@ -583,8 +544,8 @@ propagate_rtx_1 (rtx *px, rtx old_rtx, r
 	  /* The only simplification we do attempts to remove references to op0
 	     or make it constant -- in both cases, op0's invalidity will not
 	     make the result invalid.  */
-	  propagate_rtx_1 (&op0, old_rtx, new_rtx, flags | PR_CAN_APPEAR);
-	  valid_ops &= propagate_rtx_1 (&op1, old_rtx, new_rtx, flags);
+	  propagate_rtx_1 (&op0, old_rtx, new_rtx, speed);
+	  valid_ops &= propagate_rtx_1 (&op1, old_rtx, new_rtx, speed);
           if (op0 == XEXP (x, 0) && op1 == XEXP (x, 1))
 	    return true;
 
@@ -605,7 +566,7 @@ propagate_rtx_1 (rtx *px, rtx old_rtx, r
 	  if (rtx_equal_p (x, old_rtx))
 	    {
               *px = new_rtx;
-              return can_appear;
+              return true;
 	    }
 	}
       break;
@@ -620,10 +581,7 @@ propagate_rtx_1 (rtx *px, rtx old_rtx, r
 
   *px = tem;
 
-  /* The replacement we made so far is valid, if all of the recursive
-     replacements were valid, or we could simplify everything to
-     a constant.  */
-  return valid_ops || can_appear || CONSTANT_P (tem);
+  return valid_ops;
 }
 
 
@@ -634,7 +592,7 @@ static int
 varying_mem_p (rtx *body, void *data ATTRIBUTE_UNUSED)
 {
   rtx x = *body;
-  return MEM_P (x) && !MEM_READONLY_P (x);
+  return (MEM_P (x) && !MEM_READONLY_P (x)) || CALL_P (x);
 }
 
 
@@ -651,29 +609,13 @@ propagate_rtx (rtx x, enum machine_mode
 	       bool speed)
 {
   rtx tem;
-  bool collapsed;
-  int flags;
 
   if (REG_P (new_rtx) && REGNO (new_rtx) < FIRST_PSEUDO_REGISTER)
     return NULL_RTX;
 
-  flags = 0;
-  if (REG_P (new_rtx)
-      || CONSTANT_P (new_rtx)
-      || (GET_CODE (new_rtx) == SUBREG
-	  && REG_P (SUBREG_REG (new_rtx))
-	  && (GET_MODE_SIZE (mode)
-	      <= GET_MODE_SIZE (GET_MODE (SUBREG_REG (new_rtx))))))
-    flags |= PR_CAN_APPEAR;
-  if (!for_each_rtx (&new_rtx, varying_mem_p, NULL))
-    flags |= PR_HANDLE_MEM;
-
-  if (speed)
-    flags |= PR_OPTIMIZE_FOR_SPEED;
-
   tem = x;
-  collapsed = propagate_rtx_1 (&tem, old_rtx, copy_rtx (new_rtx), flags);
-  if (tem == x || !collapsed)
+  propagate_rtx_1 (&tem, old_rtx, copy_rtx (new_rtx), speed);
+  if (tem == x)
     return NULL_RTX;
 
   /* gen_lowpart_common will not be able to process VOIDmode entities other
@@ -851,98 +793,6 @@ all_uses_available_at (rtx def_insn, rtx
   return true;
 }
 
-\f
-static df_ref *active_defs;
-#ifdef ENABLE_CHECKING
-static sparseset active_defs_check;
-#endif
-
-/* Fill the ACTIVE_DEFS array with the use->def link for the registers
-   mentioned in USE_REC.  Register the valid entries in ACTIVE_DEFS_CHECK
-   too, for checking purposes.  */
-
-static void
-register_active_defs (df_ref *use_rec)
-{
-  while (*use_rec)
-    {
-      df_ref use = *use_rec++;
-      df_ref def = get_def_for_use (use);
-      int regno = DF_REF_REGNO (use);
-
-#ifdef ENABLE_CHECKING
-      sparseset_set_bit (active_defs_check, regno);
-#endif
-      active_defs[regno] = def;
-    }
-}
-
-
-/* Build the use->def links that we use to update the dataflow info
-   for new uses.  Note that building the links is very cheap and if
-   it were done earlier, they could be used to rule out invalid
-   propagations (in addition to what is done in all_uses_available_at).
-   I'm not doing this yet, though.  */
-
-static void
-update_df_init (rtx def_insn, rtx insn)
-{
-#ifdef ENABLE_CHECKING
-  sparseset_clear (active_defs_check);
-#endif
-  register_active_defs (DF_INSN_USES (def_insn));
-  register_active_defs (DF_INSN_USES (insn));
-  register_active_defs (DF_INSN_EQ_USES (insn));
-}
-
-
-/* Update the USE_DEF_REF array for the given use, using the active definitions
-   in the ACTIVE_DEFS array to match pseudos to their def. */
-
-static inline void
-update_uses (df_ref *use_rec)
-{
-  while (*use_rec)
-    {
-      df_ref use = *use_rec++;
-      int regno = DF_REF_REGNO (use);
-
-      /* Set up the use-def chain.  */
-      if (DF_REF_ID (use) >= (int) use_def_ref.length ())
-        use_def_ref.safe_grow_cleared (DF_REF_ID (use) + 1);
-
-#ifdef ENABLE_CHECKING
-      gcc_assert (sparseset_bit_p (active_defs_check, regno));
-#endif
-      use_def_ref[DF_REF_ID (use)] = active_defs[regno];
-    }
-}
-
-
-/* Update the USE_DEF_REF array for the uses in INSN.  Only update note
-   uses if NOTES_ONLY is true.  */
-
-static void
-update_df (rtx insn, rtx note)
-{
-  struct df_insn_info *insn_info = DF_INSN_INFO_GET (insn);
-
-  if (note)
-    {
-      df_uses_create (&XEXP (note, 0), insn, DF_REF_IN_NOTE);
-      df_notes_rescan (insn);
-    }
-  else
-    {
-      df_uses_create (&PATTERN (insn), insn, 0);
-      df_insn_rescan (insn);
-      update_uses (DF_INSN_INFO_USES (insn_info));
-    }
-
-  update_uses (DF_INSN_INFO_EQ_USES (insn_info));
-}
-
-
 /* Try substituting NEW into LOC, which originated from forward propagation
    of USE's value from DEF_INSN.  SET_REG_EQUAL says whether we are
    substituting the whole SET_SRC, so we can set a REG_EQUAL note if the
@@ -950,81 +800,53 @@ update_df (rtx insn, rtx note)
    performed.  */
 
 static bool
-try_fwprop_subst (df_ref use, rtx *loc, rtx new_rtx, rtx def_insn, bool set_reg_equal)
+try_fwprop_subst (df_ref use, rtx *loc, rtx new_rtx)
 {
   rtx insn = DF_REF_INSN (use);
   rtx set = single_set (insn);
-  rtx note = NULL_RTX;
+  // rtx note = NULL_RTX;
   bool speed = optimize_bb_for_speed_p (BLOCK_FOR_INSN (insn));
-  int old_cost = 0;
-  bool ok;
+  int old_cost = 0, benefit = 0;
+  int old_changes_num, new_changes_num;
 
-  update_df_init (def_insn, insn);
+  /* see when the insn is not a set */
+  if (!set)
+    return false;
 
   /* forward_propagate_subreg may be operating on an instruction with
      multiple sets.  If so, assume the cost of the new instruction is
      not greater than the old one.  */
   if (set)
-    old_cost = set_src_cost (SET_SRC (set), speed);
-  if (dump_file)
-    {
-      fprintf (dump_file, "\nIn insn %d, replacing\n ", INSN_UID (insn));
-      print_inline_rtx (dump_file, *loc, 2);
-      fprintf (dump_file, "\n with ");
-      print_inline_rtx (dump_file, new_rtx, 2);
-      fprintf (dump_file, "\n");
-    }
+    old_cost = (set_src_cost (SET_SRC (set), speed)
+		+ set_src_cost (SET_DEST (set), speed) + 1);
 
+  old_changes_num = get_changes_num ();
   validate_unshare_change (insn, loc, new_rtx, true);
-  if (!verify_changes (0))
-    {
-      if (dump_file)
-	fprintf (dump_file, "Changes to insn %d not recognized\n",
-		 INSN_UID (insn));
-      ok = false;
-    }
-
-  else if (DF_REF_TYPE (use) == DF_REF_REG_USE
-	   && set
-	   && set_src_cost (SET_SRC (set), speed) > old_cost)
-    {
-      if (dump_file)
-	fprintf (dump_file, "Changes to insn %d not profitable\n",
-		 INSN_UID (insn));
-      ok = false;
-    }
-
-  else
-    {
-      if (dump_file)
-	fprintf (dump_file, "Changed insn %d\n", INSN_UID (insn));
-      ok = true;
-    }
-
-  if (ok)
-    {
-      confirm_change_group ();
-      num_changes++;
-    }
-  else
-    {
-      cancel_changes (0);
-
-      /* Can also record a simplified value in a REG_EQUAL note,
-	 making a new one if one does not already exist.  */
-      if (set_reg_equal)
-	{
-	  if (dump_file)
-	    fprintf (dump_file, " Setting REG_EQUAL note\n");
-
-	  note = set_unique_reg_note (insn, REG_EQUAL, copy_rtx (new_rtx));
-	}
-    }
 
-  if ((ok || note) && !CONSTANT_P (new_rtx))
-    update_df (insn, note);
+  if (verify_changes (old_changes_num))
+  {
+    /* verify_changes may calls validate_change and add new changes,
+       in which case either all the changes added this time committed 
+       together, or canceled. So set_change_associated_with_last 
+       is used to associate all the changes added this time into the 
+       last one. */
+    int i;
+    int new_cost = set_src_cost (SET_SRC (set), speed)
+		   + set_src_cost (SET_DEST (set), speed) + 1;
+    benefit = old_cost - new_cost;
+    new_changes_num = get_changes_num();
+
+    set_change_verified (new_changes_num - 1, true);
+    set_change_benefit (new_changes_num - 1, benefit);
+    for (i = new_changes_num - 2; i >= old_changes_num; i--)
+      {
+	set_change_verified (i, true);
+	set_change_benefit (i, 0);
+	set_change_associated_with_last (i, true);
+      }
+  }
 
-  return ok;
+  return true;
 }
 
 /* For the given single_set INSN, containing SRC known to be a
@@ -1107,8 +929,7 @@ forward_propagate_subreg (df_ref use, rt
 	  && GET_MODE (SUBREG_REG (src)) == use_mode
 	  && subreg_lowpart_p (src)
 	  && all_uses_available_at (def_insn, use_insn))
-	return try_fwprop_subst (use, DF_REF_LOC (use), SUBREG_REG (src),
-				 def_insn, false);
+	return try_fwprop_subst (use, DF_REF_LOC (use), SUBREG_REG (src));
     }
 
   /* If this is a SUBREG of a ZERO_EXTEND or SIGN_EXTEND, and the SUBREG
@@ -1139,87 +960,113 @@ forward_propagate_subreg (df_ref use, rt
 	  && (targetm.mode_rep_extended (use_mode, GET_MODE (src))
 	      != (int) GET_CODE (src))
 	  && all_uses_available_at (def_insn, use_insn))
-	return try_fwprop_subst (use, DF_REF_LOC (use), XEXP (src, 0),
-				 def_insn, false);
+	return try_fwprop_subst (use, DF_REF_LOC (use), XEXP (src, 0));
     }
 
   return false;
 }
 
-/* Try to replace USE with SRC (defined in DEF_INSN) in __asm.  */
+static void
+mems_modified_p (rtx dest, const_rtx setter ATTRIBUTE_UNUSED, void *data)
+{
+  bool *modified = (bool *)data;
+
+  /* If DEST is not a MEM, then it will not conflict with the load.  Note
+     that function calls are assumed to clobber memory, but are handled
+     elsewhere.  */
+  if (MEM_P (dest))
+    {
+      *modified = true;
+      return;
+    }
+}
+
+/* Check whether any memory modification insn from from insn
+   to to insn. */
 
 static bool
-forward_propagate_asm (df_ref use, rtx def_insn, rtx def_set, rtx reg)
+mem_may_be_modified (rtx from, rtx to)
 {
-  rtx use_insn = DF_REF_INSN (use), src, use_pat, asm_operands, new_rtx, *loc;
-  int speed_p, i;
-  df_ref *use_vec;
+  bool modified = false;
+  rtx insn;
 
-  gcc_assert ((DF_REF_FLAGS (use) & DF_REF_IN_NOTE) == 0);
+  /* For now, we only check the simple case where from and to
+     are in the same bb. */
+  basic_block bb = BLOCK_FOR_INSN (from);
+  if (bb != BLOCK_FOR_INSN (to))
+    return true;
 
-  src = SET_SRC (def_set);
-  use_pat = PATTERN (use_insn);
+  for (insn = from; insn != to; insn = NEXT_INSN (insn))
+    {
+      if (!NONDEBUG_INSN_P (insn))
+	continue;
 
-  /* In __asm don't replace if src might need more registers than
-     reg, as that could increase register pressure on the __asm.  */
-  use_vec = DF_INSN_USES (def_insn);
-  if (use_vec[0] && use_vec[1])
-    return false;
+      note_stores (PATTERN (insn), mems_modified_p, &modified);
+      if (modified)
+	break;
+      if ((modified = CALL_P (insn)))
+	break;
+    }
+  gcc_assert (insn);
+  return modified;
+}
+
+int
+reg_mentioned_num (const_rtx reg, const_rtx in)
+{
+  const char *fmt;
+  int i, num = 0;
+  enum rtx_code code;
+
+  if (in == 0)
+    return 0;
 
-  update_df_init (def_insn, use_insn);
-  speed_p = optimize_bb_for_speed_p (BLOCK_FOR_INSN (use_insn));
-  asm_operands = NULL_RTX;
-  switch (GET_CODE (use_pat))
+  if (reg == in)
+    return 1;
+
+  code = GET_CODE (in);
+
+  switch (code)
     {
-    case ASM_OPERANDS:
-      asm_operands = use_pat;
-      break;
-    case SET:
-      if (MEM_P (SET_DEST (use_pat)))
-	{
-	  loc = &SET_DEST (use_pat);
-	  new_rtx = propagate_rtx (*loc, GET_MODE (*loc), reg, src, speed_p);
-	  if (new_rtx)
-	    validate_unshare_change (use_insn, loc, new_rtx, true);
-	}
-      asm_operands = SET_SRC (use_pat);
-      break;
-    case PARALLEL:
-      for (i = 0; i < XVECLEN (use_pat, 0); i++)
-	if (GET_CODE (XVECEXP (use_pat, 0, i)) == SET)
-	  {
-	    if (MEM_P (SET_DEST (XVECEXP (use_pat, 0, i))))
-	      {
-		loc = &SET_DEST (XVECEXP (use_pat, 0, i));
-		new_rtx = propagate_rtx (*loc, GET_MODE (*loc), reg,
-					 src, speed_p);
-		if (new_rtx)
-		  validate_unshare_change (use_insn, loc, new_rtx, true);
-	      }
-	    asm_operands = SET_SRC (XVECEXP (use_pat, 0, i));
-	  }
-	else if (GET_CODE (XVECEXP (use_pat, 0, i)) == ASM_OPERANDS)
-	  asm_operands = XVECEXP (use_pat, 0, i);
-      break;
+      /* Compare registers by number.  */
+    case REG:
+      return REG_P (reg) && REGNO (in) == REGNO (reg);
+
+      /* These codes have no constituent expressions
+	 and are unique.  */
+    case SCRATCH:
+    case CC0:
+    case PC:
+
+      /* Skip expr list. */
+    case EXPR_LIST:
+      return 0;
+
+    CASE_CONST_ANY:
+      /* These are kept unique for a given value.  */
+      return 0;
+
     default:
-      gcc_unreachable ();
+      break;
     }
 
-  gcc_assert (asm_operands && GET_CODE (asm_operands) == ASM_OPERANDS);
-  for (i = 0; i < ASM_OPERANDS_INPUT_LENGTH (asm_operands); i++)
-    {
-      loc = &ASM_OPERANDS_INPUT (asm_operands, i);
-      new_rtx = propagate_rtx (*loc, GET_MODE (*loc), reg, src, speed_p);
-      if (new_rtx)
-	validate_unshare_change (use_insn, loc, new_rtx, true);
-    }
+  if (GET_CODE (reg) == code && rtx_equal_p (reg, in))
+    return 1;
 
-  if (num_changes_pending () == 0 || !apply_change_group ())
-    return false;
+  fmt = GET_RTX_FORMAT (code);
 
-  update_df (use_insn, NULL);
-  num_changes++;
-  return true;
+  for (i = GET_RTX_LENGTH (code) - 1; i >= 0; i--)
+    {
+      if (fmt[i] == 'E')
+	{
+	  int j;
+	  for (j = XVECLEN (in, i) - 1; j >= 0; j--)
+	    num += reg_mentioned_num (reg, XVECEXP (in, i, j));
+	}
+      else if (fmt[i] == 'e')
+	num += reg_mentioned_num (reg, XEXP (in, i));
+    }
+  return num;
 }
 
 /* Try to replace USE with SRC (defined in DEF_INSN) and simplify the
@@ -1231,14 +1078,9 @@ forward_propagate_and_simplify (df_ref u
   rtx use_insn = DF_REF_INSN (use);
   rtx use_set = single_set (use_insn);
   rtx src, reg, new_rtx, *loc;
-  bool set_reg_equal;
   enum machine_mode mode;
-  int asm_use = -1;
-
-  if (INSN_CODE (use_insn) < 0)
-    asm_use = asm_noperands (PATTERN (use_insn));
 
-  if (!use_set && asm_use < 0 && !DEBUG_INSN_P (use_insn))
+  if (!use_set && !DEBUG_INSN_P (use_insn))
     return false;
 
   /* Do not propagate into PC, CC0, etc.  */
@@ -1288,21 +1130,19 @@ forward_propagate_and_simplify (df_ref u
       return false;
     }
 
-  if (asm_use >= 0)
-    return forward_propagate_asm (use, def_insn, def_set, reg);
+  /* if only new_rtx contains varying mem, we cannot do propagation
+     safely because we do nothing mem related def-use analysis */
+  if (for_each_rtx (&src, varying_mem_p, NULL)
+      && (mem_may_be_modified (def_insn, use_insn)
+	  || volatile_refs_p (src)))
+    return false;
 
   /* Else try simplifying.  */
 
   if (DF_REF_TYPE (use) == DF_REF_REG_MEM_STORE)
-    {
-      loc = &SET_DEST (use_set);
-      set_reg_equal = false;
-    }
+    loc = &SET_DEST (use_set);
   else if (!use_set)
-    {
-      loc = &INSN_VAR_LOCATION_LOC (use_insn);
-      set_reg_equal = false;
-    }
+    loc = &INSN_VAR_LOCATION_LOC (use_insn);
   else
     {
       rtx note = find_reg_note (use_insn, REG_EQUAL, NULL_RTX);
@@ -1310,22 +1150,6 @@ forward_propagate_and_simplify (df_ref u
 	loc = &XEXP (note, 0);
       else
 	loc = &SET_SRC (use_set);
-
-      /* Do not replace an existing REG_EQUAL note if the insn is not
-	 recognized.  Either we're already replacing in the note, or we'll
-	 separately try plugging the definition in the note and simplifying.
-	 And only install a REQ_EQUAL note when the destination is a REG
-	 that isn't mentioned in USE_SET, as the note would be invalid
-	 otherwise.  We also don't want to install a note if we are merely
-	 propagating a pseudo since verifying that this pseudo isn't dead
-	 is a pain; moreover such a note won't help anything.  */
-      set_reg_equal = (note == NULL_RTX
-		       && REG_P (SET_DEST (use_set))
-		       && !REG_P (src)
-		       && !(GET_CODE (src) == SUBREG
-			    && REG_P (SUBREG_REG (src)))
-		       && !reg_mentioned_p (SET_DEST (use_set),
-					    SET_SRC (use_set)));
     }
 
   if (GET_MODE (*loc) == VOIDmode)
@@ -1339,7 +1163,7 @@ forward_propagate_and_simplify (df_ref u
   if (!new_rtx)
     return false;
 
-  return try_fwprop_subst (use, loc, new_rtx, def_insn, set_reg_equal);
+  return try_fwprop_subst (use, loc, new_rtx);
 }
 
 
@@ -1402,7 +1226,6 @@ forward_propagate_into (df_ref use)
   return false;
 }
 
-\f
 static void
 fwprop_init (void)
 {
@@ -1417,11 +1240,6 @@ fwprop_init (void)
 
   build_single_def_use_links ();
   df_set_flags (DF_DEFER_INSN_RESCAN);
-
-  active_defs = XNEWVEC (df_ref, max_reg_num ());
-#ifdef ENABLE_CHECKING
-  active_defs_check = sparseset_alloc (max_reg_num ());
-#endif
 }
 
 static void
@@ -1430,19 +1248,10 @@ fwprop_done (void)
   loop_optimizer_finalize ();
 
   use_def_ref.release ();
-  free (active_defs);
-#ifdef ENABLE_CHECKING
-  sparseset_free (active_defs_check);
-#endif
 
   free_dominance_info (CDI_DOMINATORS);
   cleanup_cfg (0);
   delete_trivially_dead_insns (get_insns (), max_reg_num ());
-
-  if (dump_file)
-    fprintf (dump_file,
-	     "\nNumber of successful forward propagations: %d\n\n",
-	     num_changes);
 }
 
 
@@ -1454,31 +1263,130 @@ gate_fwprop (void)
   return optimize > 0 && flag_forward_propagate;
 }
 
+static bool
+iterate_def_uses (df_ref def, bool fwprop_addr)
+{
+  int use_num = 0;
+  int def_insn_cost = 0;
+  rtx def_insn, use_insn;
+  struct df_link *uses;
+  int reg_replaced_num = 0;
+  bool all_uses_replaced;
+  bool speed;
+
+  def_insn = DF_REF_INSN (def);
+  speed = optimize_bb_for_speed_p (BLOCK_FOR_INSN (def_insn));
+
+  if (def_insn)
+  {
+    rtx set = single_set (def_insn);
+    if (set)
+      def_insn_cost = set_src_cost (SET_SRC (set), speed)
+		      + set_src_cost (SET_DEST (set), speed) + 1;
+    else
+      return false;
+  }
+
+  if (dump_file)
+    {
+      fprintf(dump_file, "\n------------------------\n");
+      fprintf(dump_file, "Def %d:\n", INSN_UID (def_insn));
+    }
+
+  for (uses = DF_REF_CHAIN (def), use_num = 0;
+     uses; uses = uses->next)
+  {
+    int old_reg_num, new_reg_num;
+
+    df_ref use = uses->ref;
+    if (DF_REF_IS_ARTIFICIAL (use))
+	continue;
+
+    use_insn = DF_REF_INSN (use);
+    if (!NONDEBUG_INSN_P (use_insn))
+	continue;
+
+    if (dump_file)
+      fprintf(dump_file, "\tUse %d\n", INSN_UID (use_insn));
+
+    if (fwprop_addr)
+      {
+	if (DF_REF_TYPE (use) != DF_REF_REG_USE
+	    && DF_REF_BB (use)->loop_father != NULL
+	    /* The outer most loop is not really a loop.  */
+	    && loop_outer (DF_REF_BB (use)->loop_father) != NULL)
+	  forward_propagate_into (use);
+      }
+    else
+      {
+	if (DF_REF_TYPE (use) == DF_REF_REG_USE
+	    || DF_REF_BB (use)->loop_father == NULL
+	    || loop_outer (DF_REF_BB (use)->loop_father) == NULL)
+	  {
+	    old_reg_num = reg_mentioned_num (DF_REF_REG (use), use_insn);
+
+	    forward_propagate_into (use);
+
+	    new_reg_num = reg_mentioned_num (DF_REF_REG (use), use_insn);
+	    reg_replaced_num += old_reg_num - new_reg_num;
+	  }
+      }
+    use_num++;
+  }
+
+  if (!use_num)
+    return false;
+
+  if (fwprop_addr)
+    return confirm_change_group_by_cost (false,
+					 0,
+					 false);
+  else
+    {
+      all_uses_replaced = (use_num == reg_replaced_num);
+      return confirm_change_group_by_cost (all_uses_replaced,
+					   def_insn_cost,
+					   true);
+    }
+}
+
 static unsigned int
 fwprop (void)
 {
-  unsigned i;
+  basic_block bb;
+  rtx insn;
+  df_ref *def_vec;
   bool need_cleanup = false;
 
-  fwprop_init ();
+  if (flag_fwprop_func && strcmp(current_function_name(), flag_fwprop_func))
+    return 0;
 
-  /* Go through all the uses.  df_uses_create will create new ones at the
-     end, and we'll go through them as well.
+  if (dump_file)
+    fprintf (dump_file, "\n============== fwprop ==============\n");
 
-     Do not forward propagate addresses into loops until after unrolling.
-     CSE did so because it was able to fix its own mess, but we are not.  */
+  fwprop_init ();
 
-  for (i = 0; i < DF_USES_TABLE_SIZE (); i++)
+  FOR_EACH_BB (bb)
     {
-      df_ref use = DF_USES_GET (i);
-      if (use)
-	if (DF_REF_TYPE (use) == DF_REF_REG_USE
-	    || DF_REF_BB (use)->loop_father == NULL
-	    /* The outer most loop is not really a loop.  */
-	    || loop_outer (DF_REF_BB (use)->loop_father) == NULL)
-	  need_cleanup |= forward_propagate_into (use);
+      FOR_BB_INSNS (bb, insn)
+        {
+          if (!NONDEBUG_INSN_P (insn)
+	      || CALL_P (insn))
+            continue;
+
+          for (def_vec = DF_INSN_DEFS (insn); *def_vec; def_vec++)
+	    {
+	      bool result;
+              result = iterate_def_uses (*def_vec, false);
+	      need_cleanup |= result;
+
+	      if (result)
+		num_changes += 1;
+	    }
+	}
     }
 
+
   fwprop_done ();
   if (need_cleanup)
     cleanup_cfg (0);
@@ -1510,22 +1418,37 @@ struct rtl_opt_pass pass_rtl_fwprop =
 static unsigned int
 fwprop_addr (void)
 {
-  unsigned i;
+  basic_block bb;
+  rtx insn;
+  df_ref *def_vec;
   bool need_cleanup = false;
 
+  if (flag_fwprop_func && strcmp(current_function_name(), flag_fwprop_func))
+    return 0;
+
+  if (dump_file)
+    fprintf (dump_file, "\n============== fwprop_addr ==============\n");
+
   fwprop_init ();
 
-  /* Go through all the uses.  df_uses_create will create new ones at the
-     end, and we'll go through them as well.  */
-  for (i = 0; i < DF_USES_TABLE_SIZE (); i++)
+  FOR_EACH_BB (bb)
     {
-      df_ref use = DF_USES_GET (i);
-      if (use)
-	if (DF_REF_TYPE (use) != DF_REF_REG_USE
-	    && DF_REF_BB (use)->loop_father != NULL
-	    /* The outer most loop is not really a loop.  */
-	    && loop_outer (DF_REF_BB (use)->loop_father) != NULL)
-	  need_cleanup |= forward_propagate_into (use);
+      FOR_BB_INSNS (bb, insn)
+        {
+          if (!NONDEBUG_INSN_P (insn)
+	      || CALL_P (insn))
+            continue;
+
+          for (def_vec = DF_INSN_DEFS (insn); *def_vec; def_vec++)
+	    {
+	      bool result;
+              result = iterate_def_uses (*def_vec, true);
+	      need_cleanup |= result;
+
+	      if (result)
+		num_changes += 1;
+	    }
+	}
     }
 
   fwprop_done ();
Index: gcc/common.opt
===================================================================
--- gcc/common.opt	(revision 196270)
+++ gcc/common.opt	(working copy)
@@ -1138,6 +1138,10 @@ fforward-propagate
 Common Report Var(flag_forward_propagate) Optimization
 Perform a forward propagation pass on RTL
 
+ffwprop=
+Common Driver JoinedOrMissing RejectNegative Var(flag_fwprop_func)
+-ffwprop=<func>		Only turn on forward propagation for the specified func.
+
 ffp-contract=
 Common Joined RejectNegative Enum(fp_contract_mode) Var(flag_fp_contract_mode) Init(FP_CONTRACT_FAST)
 -ffp-contract=[off|on|fast] Perform floating-point expression contraction.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: extend fwprop optimization
  2013-02-25 23:32 extend fwprop optimization Wei Mi
@ 2013-02-26  0:08 ` Steven Bosscher
  2013-02-26  1:12   ` Wei Mi
  0 siblings, 1 reply; 29+ messages in thread
From: Steven Bosscher @ 2013-02-26  0:08 UTC (permalink / raw)
  To: Wei Mi; +Cc: GCC Patches, David Li

On Tue, Feb 26, 2013 at 12:32 AM, Wei Mi wrote:
> We also take insn splitting and peephole into consideration,
> .i.e, the cost of the change is the cost after insn splitting and
> peephole which may be applied to the insn changed. This is useful for
> the motivational case,

It also goes against everything ever done before in RTL land.

If you need early splitting like this, then these insns should
probably be splitted earlier. And trying to apply peephole2's at this
stage is, ehm, very creative... I'm surprised it works at all,
peephole2 is supposed to work on strict RTL (i.e. post-reload, all
constraints matched, hard regiters only, etc.). NB, you can't use
peephole2 unconditionally, not all targets have them. See
HAVE_peephole2.

Can you explain step-by-step what is going on, that you need this?

> -      break;
> +      /* Compare registers by number.  */
> +    case REG:
> +      return REG_P (reg) && REGNO (in) == REGNO (reg);

This will not work for hard registers.

FWIW, en passant you've made fwprop quadratic in the number of insns
in a basic block, in initialize_before_estimate_peephole potentially
calling simulate_backwards_to_point repeatedly on every insn in a
basic block.

Also: no ChangeLog, not following code style conventions, no comments,
entire blocks of recently added code disappearing (e.g. the "Do not
replace an existing REG_EQUAL note" stuff)...

Don't mean to be too harsh, but in this form I'm not going to look at it.

Ciao!
Steven

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: extend fwprop optimization
  2013-02-26  0:08 ` Steven Bosscher
@ 2013-02-26  1:12   ` Wei Mi
  2013-02-26 11:00     ` Steven Bosscher
  0 siblings, 1 reply; 29+ messages in thread
From: Wei Mi @ 2013-02-26  1:12 UTC (permalink / raw)
  To: Steven Bosscher; +Cc: GCC Patches, David Li

Hi,

On Mon, Feb 25, 2013 at 4:08 PM, Steven Bosscher <stevenb.gcc@gmail.com> wrote:
> On Tue, Feb 26, 2013 at 12:32 AM, Wei Mi wrote:
>> We also take insn splitting and peephole into consideration,
>> .i.e, the cost of the change is the cost after insn splitting and
>> peephole which may be applied to the insn changed. This is useful for
>> the motivational case,
>
> It also goes against everything ever done before in RTL land.
>
> If you need early splitting like this, then these insns should
> probably be splitted earlier. And trying to apply peephole2's at this
> stage is, ehm, very creative... I'm surprised it works at all,
> peephole2 is supposed to work on strict RTL (i.e. post-reload, all
> constraints matched, hard regiters only, etc.). NB, you can't use
> peephole2 unconditionally, not all targets have them. See
> HAVE_peephole2.
>
> Can you explain step-by-step what is going on, that you need this?
>

I am not trying to actually do the transformation of split and
peephole. Just want to know how does the insn look like after the
split and peephole, then we can decide whether to do fwprop based on
more precise cost caculation.

For the motivational case
IR before fwprop:
(insn 18 17 19 2 (parallel [
            (set (reg:SI 75 [ D.2322 ])
                (and:SI (reg:SI 88 [ D.2325 ])
                    (const_int 63 [0x3f])))
            (clobber (reg:CC 17 flags))
        ]) 1.c:25 402 {*andsi_1}
     (expr_list:REG_DEAD (reg:SI 88 [ D.2325 ])
        (expr_list:REG_UNUSED (reg:CC 17 flags)
            (nil))))
(insn 22 21 23 2 (parallel [
            (set (reg:DI 91 [ D.2324 ])
                (ashift:DI (reg:DI 71 [ D.2324 ])
                    (subreg:QI (reg:SI 75 [ D.2322 ]) 0)))
            (clobber (reg:CC 17 flags))
        ]) 1.c:25 522 {*ashldi3_1}
     (expr_list:REG_DEAD (reg:DI 71 [ D.2324 ])
        (expr_list:REG_UNUSED (reg:CC 17 flags)
            (nil))))
(insn 23 22 24 2 (parallel [
            (set (reg:DI 92 [ D.2324 ])
                (lshiftrt:DI (reg:DI 91 [ D.2324 ])
                    (subreg:QI (reg:SI 75 [ D.2322 ]) 0)))
            (clobber (reg:CC 17 flags))
        ]) 1.c:25 556 {*lshrdi3_1}
     (expr_list:REG_DEAD (reg:DI 91 [ D.2324 ])
        (expr_list:REG_DEAD (reg:SI 75 [ D.2322 ])
            (expr_list:REG_UNUSED (reg:CC 17 flags)
                (nil)))))

After propagation, we get
(insn 22 21 23 2 (parallel [
                (set (reg:DI 91 [ D.2324 ])
                    (ashift:DI (reg:DI 71 [ D.2324 ])
                        (subreg:QI (and:SI (reg:SI 88 [ D.2325 ])
                                (const_int 63 [0x3f])) 0)))
                (clobber (reg:CC 17 flags))
            ]) 1.c:25 518 {*ashldi3_mask}
         (expr_list:REG_DEAD (reg:DI 71 [ D.2324 ])
            (expr_list:REG_UNUSED (reg:CC 17 flags)
                (nil))))
(insn 23 22 24 2 (parallel [
                (set (reg:DI 92 [ D.2324 ])
                    (lshiftrt:DI (reg:DI 91 [ D.2324 ])
                        (subreg:QI (and:SI (reg:SI 88 [ D.2325 ])
                                (const_int 63 [0x3f])) 0)))
                (clobber (reg:CC 17 flags))
            ]) 1.c:25 539 {*lshrdi3_mask}
         (expr_list:REG_DEAD (reg:DI 91 [ D.2324 ])
            (expr_list:REG_DEAD (reg:SI 75 [ D.2322 ])
                (expr_list:REG_UNUSED (reg:CC 17 flags)
                    (nil)))))

But it is not a good transformation unless we know insn split will
change a << (b & 63) to a << b; Here we want to see what the rtl looks
like after insn splitting in fwprop cost estimation (We call
split_insns in estimate_split_and_peephole(), but not to do insn
splitting actually in this phase). After checking the result of
split_insns, we decide it is beneficial to do the propagation here
because insn splitting will optimize the fwprop results.

After insn splitting.
(insn 39 21 40 2 (parallel [
            (set (reg:DI 92 [ D.2324 ])
                (ashift:DI (reg:DI 71 [ D.2324 ])
                    (subreg:QI (reg:SI 88 [ D.2325 ]) 0)))
            (clobber (reg:CC 17 flags))
        ]) 1.c:25 -1
     (nil))
(insn 40 39 24 2 (parallel [
            (set (reg:DI 92 [ D.2324 ])
                (lshiftrt:DI (reg:DI 92 [ D.2324 ])
                    (subreg:QI (reg:SI 88 [ D.2325 ]) 0)))
            (clobber (reg:CC 17 flags))
        ]) 1.c:25 -1
     (nil))

Similarly I think if we can know what the insn looks like after
peephole, it will be better for fwprop. But I don't have a testcase to
prove it is better to consider peephole. Maybe I should do some
testing how much benefit we will get if consider peephole impact in
fwprop.

>> -      break;
>> +      /* Compare registers by number.  */
>> +    case REG:
>> +      return REG_P (reg) && REGNO (in) == REGNO (reg);
>
> This will not work for hard registers.
>
> FWIW, en passant you've made fwprop quadratic in the number of insns
> in a basic block, in initialize_before_estimate_peephole potentially
> calling simulate_backwards_to_point repeatedly on every insn in a
> basic block.

Yes. I may need to evaluate the benefit of considering peephole impact
in fwprop and its compilation time cost.

>
> Also: no ChangeLog, not following code style conventions, no comments,
> entire blocks of recently added code disappearing (e.g. the "Do not
> replace an existing REG_EQUAL note" stuff)...
>
> Don't mean to be too harsh, but in this form I'm not going to look at it.

Sorry, I sent it out for an early discussion, so I neglect ChangeLog,
code style...  I will fix them and add more comments. About the
REG_EQUAL and forward_propagate_asm deleted, my code hasn't dealed
with them in a good way, so I just deleted them in the patch for easy
discussion. But they will be added back soon. I was looking into the
tricky bug about REG_EQUAL you fixed not a long ago these two days to
understand how to deal with REG_EQUAL.

>
> Ciao!
> Steven

Thanks,
Wei.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: extend fwprop optimization
  2013-02-26  1:12   ` Wei Mi
@ 2013-02-26 11:00     ` Steven Bosscher
  2013-02-27 18:37       ` Wei Mi
  0 siblings, 1 reply; 29+ messages in thread
From: Steven Bosscher @ 2013-02-26 11:00 UTC (permalink / raw)
  To: Wei Mi; +Cc: GCC Patches, David Li, Uros Bizjak

On Tue, Feb 26, 2013 at 2:12 AM, Wei Mi wrote:
> But it is not a good transformation unless we know insn split will
> change a << (b & 63) to a << b; Here we want to see what the rtl looks
> like after insn splitting in fwprop cost estimation (We call
> split_insns in estimate_split_and_peephole(), but not to do insn
> splitting actually in this phase).

So you're splitting to find out that the shift is truncated to 5 or 6
bits. That looks like what you really want is to have
SHIFT_COUNT_TRUNCATED working for your target. It isn't defined for
i386:

/* Define if shifts truncate the shift count which implies one can
   omit a sign-extension or zero-extension of a shift count.

   On i386, shifts do truncate the count.  But bit test instructions
   take the modulo of the bit offset operand.  */

/* #define SHIFT_COUNT_TRUNCATED */

Perhaps SHIFT_COUNT_TRUNCATED should be turned into a target hook, and
take an rtx_code (or a pattern) to let the target decide whether a
truncation is applicable or not.

This is a target thing, so perhaps Uros has some ideas about this.

I'm guessing cse.c would then handle your code transformation already,
or can be made to do so without a lot of extra work, e.g. teach
fold_rtx about such (shift (...) (and (...))) transformations that are
really truncations.

Ciao!
Steven

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: extend fwprop optimization
  2013-02-26 11:00     ` Steven Bosscher
@ 2013-02-27 18:37       ` Wei Mi
  2013-02-27 21:22         ` Steven Bosscher
  0 siblings, 1 reply; 29+ messages in thread
From: Wei Mi @ 2013-02-27 18:37 UTC (permalink / raw)
  To: Steven Bosscher; +Cc: GCC Patches, David Li, Uros Bizjak

On Tue, Feb 26, 2013 at 2:59 AM, Steven Bosscher <stevenb.gcc@gmail.com> wrote:
> On Tue, Feb 26, 2013 at 2:12 AM, Wei Mi wrote:
>> But it is not a good transformation unless we know insn split will
>> change a << (b & 63) to a << b; Here we want to see what the rtl looks
>> like after insn splitting in fwprop cost estimation (We call
>> split_insns in estimate_split_and_peephole(), but not to do insn
>> splitting actually in this phase).
>
> So you're splitting to find out that the shift is truncated to 5 or 6
> bits. That looks like what you really want is to have
> SHIFT_COUNT_TRUNCATED working for your target. It isn't defined for
> i386:
>
> /* Define if shifts truncate the shift count which implies one can
>    omit a sign-extension or zero-extension of a shift count.
>
>    On i386, shifts do truncate the count.  But bit test instructions
>    take the modulo of the bit offset operand.  */
>
> /* #define SHIFT_COUNT_TRUNCATED */
>
> Perhaps SHIFT_COUNT_TRUNCATED should be turned into a target hook, and
> take an rtx_code (or a pattern) to let the target decide whether a
> truncation is applicable or not.
>
> This is a target thing, so perhaps Uros has some ideas about this.
>
> I'm guessing cse.c would then handle your code transformation already,
> or can be made to do so without a lot of extra work, e.g. teach
> fold_rtx about such (shift (...) (and (...))) transformations that are
> really truncations.

Thanks for pointing out fold_rtx. I took a look at it and cse
yesterday, and I agreed with you fold_rtx could be extended to handle
the motivational case. But I still think fwprop extension could be
meaningful generally.

1. fold_rtx doesn't handling all the propagation-simplification tasks.
It only handles some typical cases. I think cse doesn't want to be
very cumbersome to include all the fwprop's functionality. fwprop
extension tries to generally handle the propagation-simplification
problem. I think cse contains fold_rtx partially because existing
fwprop and combine are not ideal. If fwprop could handle the general
case, cse could simply try to find common subexpression.

2. fold_rtx does the simplification only based on the current insn,
while fwprop extension tries to consider the def-uses group in a
whole. When all the uses could be propagated, we have the choices: a)
do all the propagations then delete the def insn, even if some
propagations may not be beneficial. b) only select beneficial
propagations and leave the def insn there. fwprop extension has a cost
model to choose which way to go.

What do you think?

Thanks,
Wei.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: extend fwprop optimization
  2013-02-27 18:37       ` Wei Mi
@ 2013-02-27 21:22         ` Steven Bosscher
  2013-02-27 21:56           ` Wei Mi
  0 siblings, 1 reply; 29+ messages in thread
From: Steven Bosscher @ 2013-02-27 21:22 UTC (permalink / raw)
  To: Wei Mi; +Cc: GCC Patches, David Li, Uros Bizjak

On Wed, Feb 27, 2013 at 7:37 PM, Wei Mi wrote:
> What do you think?

I think you'll not be able to teach fold_rtx to perform the
transformation you want it to do without having SHIFT_COUNT_TRUNCATED
set for i386. I already tried it the other day, but GCC won't do the
truncation without knowing the insn is really a shift insn and
shift_truncation_mask returns something useful.

Ciao!
Steven


Index: cse.c
===================================================================
--- cse.c       (revision 196182)
+++ cse.c       (working copy)
@@ -3179,9 +3179,22 @@ fold_rtx (rtx x, rtx insn)

        switch (GET_CODE (folded_arg))
          {
+         case SUBREG:
+           /* If the SUBREG_REG comes in from an AND, and this is not a
+              paradoxical subreg, then try to fold the SUBREG.  */
+           if (REG_P (SUBREG_REG (folded_arg))
+               && ! paradoxical_subreg_p (folded_arg))
+             {
+               rtx y = lookup_as_function (SUBREG_REG (folded_arg), AND);
+               if (y != 0)
+                 y = simplify_gen_binary(AND, GET_MODE (folded_arg),
+                                         XEXP(y, 0), XEXP(y, 1));
+               if (y != 0)
+                 folded_arg = y;
+             }
+           /* ... fall through ...  */
          case MEM:
          case REG:
-         case SUBREG:
            const_arg = equiv_constant (folded_arg);
            break;

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: extend fwprop optimization
  2013-02-27 21:22         ` Steven Bosscher
@ 2013-02-27 21:56           ` Wei Mi
  2013-03-11  5:52             ` Wei Mi
  0 siblings, 1 reply; 29+ messages in thread
From: Wei Mi @ 2013-02-27 21:56 UTC (permalink / raw)
  To: Steven Bosscher; +Cc: GCC Patches, David Li, Uros Bizjak

Yes, I agree with you. fold_rtx also needs to be extended because now
it only handles the case similar as follows for shift insn:
  a = b op const1
  c = a >> const2
for our motivational case, the second operand of the first insn is a
reg instead of a const. We also need to add the truncation support for
our case in simplify_binary_operation.

I will send out a more official patch about fwprop extension soon.
Then it may be easier to talk about its rationality.

Thanks,
Wei.

On Wed, Feb 27, 2013 at 1:21 PM, Steven Bosscher <stevenb.gcc@gmail.com> wrote:
> On Wed, Feb 27, 2013 at 7:37 PM, Wei Mi wrote:
>> What do you think?
>
> I think you'll not be able to teach fold_rtx to perform the
> transformation you want it to do without having SHIFT_COUNT_TRUNCATED
> set for i386. I already tried it the other day, but GCC won't do the
> truncation without knowing the insn is really a shift insn and
> shift_truncation_mask returns something useful.
>
> Ciao!
> Steven
>
>
> Index: cse.c
> ===================================================================
> --- cse.c       (revision 196182)
> +++ cse.c       (working copy)
> @@ -3179,9 +3179,22 @@ fold_rtx (rtx x, rtx insn)
>
>         switch (GET_CODE (folded_arg))
>           {
> +         case SUBREG:
> +           /* If the SUBREG_REG comes in from an AND, and this is not a
> +              paradoxical subreg, then try to fold the SUBREG.  */
> +           if (REG_P (SUBREG_REG (folded_arg))
> +               && ! paradoxical_subreg_p (folded_arg))
> +             {
> +               rtx y = lookup_as_function (SUBREG_REG (folded_arg), AND);
> +               if (y != 0)
> +                 y = simplify_gen_binary(AND, GET_MODE (folded_arg),
> +                                         XEXP(y, 0), XEXP(y, 1));
> +               if (y != 0)
> +                 folded_arg = y;
> +             }
> +           /* ... fall through ...  */
>           case MEM:
>           case REG:
> -         case SUBREG:
>             const_arg = equiv_constant (folded_arg);
>             break;

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: extend fwprop optimization
  2013-02-27 21:56           ` Wei Mi
@ 2013-03-11  5:52             ` Wei Mi
  2013-03-11 18:10               ` Jeff Law
  2013-03-11 19:52               ` Steven Bosscher
  0 siblings, 2 replies; 29+ messages in thread
From: Wei Mi @ 2013-03-11  5:52 UTC (permalink / raw)
  To: Steven Bosscher; +Cc: GCC Patches, David Li, Uros Bizjak

[-- Attachment #1: Type: text/plain, Size: 3863 bytes --]

Hi,

This is the fwprop extension patch which is put in order. Regression
test and bootstrap pass. Please help to review its rationality. The
following is a brief description what I have done in the patch.

In order to make fwprop more effective in rtl optimization, we extend
it to handle general expressions instead of the three cases listed in
the head comment in fwprop.c. The major changes include a) We need to
check propagation correctness for src exprs of def which contain mem
references. Previous fwprop for the three cases above doesn't have the
problem. b) We need a better cost model because the benefit is usually
not so apparent as the three cases above.

For a general fwprop problem, there are two possible sources where
benefit comes from. The frist is the new use insn after propagation
and simplification may have lower cost than itself before propagation,
or propagation may create a new insn, that could be splitted or
peephole optimized later and get a lower cost. The second is that if
all the uses are replaced with the src of the def insn, the def insn
could be deleted.

So instead of check each def-use pair independently, we use DU chain
to track all the uses for a def. For each def-use pair, we attempt the
propagation, record the change candidate in changes[] array, but we
wait to confirm the changes until all the pairs with the same def are
iterated. The changes confirmation is done in the func
confirm_change_group_by_cost. We only do this for fwprop. For
fwprop_addr, the benefit of each change is ensured by
propagation_rtx_1 using should_replace_address, so we just confirm all
the changes without checking benefit again.

Thanks,
Wei.

On Wed, Feb 27, 2013 at 1:56 PM, Wei Mi <wmi@google.com> wrote:
> Yes, I agree with you. fold_rtx also needs to be extended because now
> it only handles the case similar as follows for shift insn:
>   a = b op const1
>   c = a >> const2
> for our motivational case, the second operand of the first insn is a
> reg instead of a const. We also need to add the truncation support for
> our case in simplify_binary_operation.
>
> I will send out a more official patch about fwprop extension soon.
> Then it may be easier to talk about its rationality.
>
> Thanks,
> Wei.
>
> On Wed, Feb 27, 2013 at 1:21 PM, Steven Bosscher <stevenb.gcc@gmail.com> wrote:
>> On Wed, Feb 27, 2013 at 7:37 PM, Wei Mi wrote:
>>> What do you think?
>>
>> I think you'll not be able to teach fold_rtx to perform the
>> transformation you want it to do without having SHIFT_COUNT_TRUNCATED
>> set for i386. I already tried it the other day, but GCC won't do the
>> truncation without knowing the insn is really a shift insn and
>> shift_truncation_mask returns something useful.
>>
>> Ciao!
>> Steven
>>
>>
>> Index: cse.c
>> ===================================================================
>> --- cse.c       (revision 196182)
>> +++ cse.c       (working copy)
>> @@ -3179,9 +3179,22 @@ fold_rtx (rtx x, rtx insn)
>>
>>         switch (GET_CODE (folded_arg))
>>           {
>> +         case SUBREG:
>> +           /* If the SUBREG_REG comes in from an AND, and this is not a
>> +              paradoxical subreg, then try to fold the SUBREG.  */
>> +           if (REG_P (SUBREG_REG (folded_arg))
>> +               && ! paradoxical_subreg_p (folded_arg))
>> +             {
>> +               rtx y = lookup_as_function (SUBREG_REG (folded_arg), AND);
>> +               if (y != 0)
>> +                 y = simplify_gen_binary(AND, GET_MODE (folded_arg),
>> +                                         XEXP(y, 0), XEXP(y, 1));
>> +               if (y != 0)
>> +                 folded_arg = y;
>> +             }
>> +           /* ... fall through ...  */
>>           case MEM:
>>           case REG:
>> -         case SUBREG:
>>             const_arg = equiv_constant (folded_arg);
>>             break;

[-- Attachment #2: ChangeLog --]
[-- Type: application/octet-stream, Size: 3756 bytes --]

2013-03-08  Wei Mi  <wmi@google.com>

	* fwprop.c (build_single_def_use_links): Add DU chain problem in df.
	(propagate_rtx_1): Remove PR_HANDLE_MEM.
	(varying_mem_p): Add call a kind of varying mem.
	(propagate_rtx): Remove PR_CAN_APPEAR and PR_HANDLE_MEM.
	(register_active_defs): Deleted.
	(update_df_init): Deleted.
	(update_uses): Deleted.
	(update_df): Deleted.
	(try_fwprop_subst): Extract the confirmation part to a separate func.
	(forward_propagate_subreg): Change the args of try_fwprop_subst.
	(mems_modified_p): New. Check whether dest is a mem.
	(mem_may_be_modified): New. Check if mem modified in an insn range.
	(reg_mentioned_num): New. How many times a reg appear.
	(def_return_reg): New. Whether the set define a return reg.
	(forward_propagate_asm): Make asm propagations being applied
	separately.
	(forward_propagate_and_simplify): Check propagation correctness
	if new_rtx contains varying mem.
	(fwprop_init): Remove active_defs and active_defs_check.
	(fwprop_done): Likewise.
	(iterate_def_uses): New. Iterate all the uses connecting to a def.
	(fwprop): Iterate all the defs instead of all the uses.
	(fwprop_addr): Likewise.
	* recog.c (validate_change_1): Add fields for change_t.
	(confirm_change_group): Add a param.
	(set_change_verified): Add a change_t interface.
	(set_change_benefit): Likewise.
	(set_change_equal_note): Likewise.
	(set_change_associated_with_last): Likewise.
	(update_df): New. Update def/use references after insn changes.
	(estimate_seq_cost): New. Estimate the cost of an insn seq.
	(estimate_split_and_peephole): New. Estimate the cost of
	split and peephole result.
	(confirm_change_one_by_one): New. Confirm each change separately.
	(confirm_change_group_by_cost): New. Confirm changes based on
	a cost model.
	(apply_change_group): Add a param.
	(cancel_changes): Add REG_EQUAL note according to equal_note field.
	(validate_replace_rtx_subexp): Add a param for apply_change_group.
	(validate_replace_rtx): Likewise.
	(validate_replace_rtx_part): Likewise.
	(validate_replace_rtx_part_nosimplify): Likewise.
	(validate_simplify_insn): Likewise.
	(initialize_before_estimate_peephole): New. Preparation work
	for calling peephole2_insns.
	(peep2_reg_dead_p): Special handling when it is called in
	fwprop phase.
	(peep2_find_free_register): Likewise.
	* recog.h: Add some func prototypes.
	* config/i386/i386.c (ix86_expand_clear): Special handling when it
	is called in fwprop phase.
	* config/i386/i386.md: Likewise.
	* cprop.c (try_replace_reg): Add a param for apply_change_group.
	* cse.c (fold_rtx): Likewise.
	(try_back_substitute_reg): Likewise.
	(canonicalize_insn): Likewise.
	(cse_insn): Likewise.
	(cse_change_cc_mode_insn): Likewise.
	* postreload.c (reload_cse_simplify): Likewise.
	(reload_cse_simplify_operands): Likewise.
	(reload_combine_recognize_pattern): Likewise.
	* lower-subreg.c (resolve_simple_move): Likewise.
	(decompose_multiword_subregs): Likewise.
	* combine-stack-adj.c (try_apply_stack_adjustment): Likewise.
	* regmove.c (try_auto_increment): Likewise.
	(optimize_reg_copy_3): Likewise.
	* loop-unroll.c (expand_var_during_unrolling): Likewise.
	* regcprop.c (apply_debug_insn_changes): Likewise.
	(copyprop_hardreg_forward_1): Likewise.
	* ree.c (combine_reaching_defs): Likewise.
	* ifcvt.c (cond_exec_process_if_block): Likewise.
	(dead_or_predicable): Likewise.
	* loop-invariant.c (replace_uses): Likewise.
	(move_invariant_reg): Likewise.
	* lra-eliminations.c (eliminate_regs_in_insn): Likewise.
	* reload1.c (reload_as_needed): Likewise.
	* compare-elim.c (try_eliminate_compare): Likewise.
	* jump.c (redirect_jump): Likewise.
	(redirect_jump_2): Likewise.
	(invert_jump): Likewise.
	(true_regnum): Special handling when it is called in fwprop phase.


[-- Attachment #3: patch --]
[-- Type: application/octet-stream, Size: 66095 bytes --]

Index: fwprop.c
===================================================================
--- fwprop.c	(revision 196270)
+++ fwprop.c	(working copy)
@@ -39,6 +39,7 @@ along with GCC; see the file COPYING3.
 #include "domwalk.h"
 #include "emit-rtl.h"
 
+#include "tree.h"
 
 /* This pass does simple forward propagation and simplification when an
    operand of an insn can only come from a single def.  This pass uses
@@ -112,6 +113,35 @@ along with GCC; see the file COPYING3.
    I just punt and record only singleton use-def chains, which is
    all that is needed by fwprop.  */
 
+/* In order to make fwprop more effective in rtl optimization, we
+   extend it to handle general expressions instead of only three cases
+   above. The major changes include a) We need to check propagation
+   correctness for src exprs of def which contain mem reference.
+   Previous fwprop for the three cases above doesn't have the problem.
+   b) We need a better cost model because the benefit is usually
+   not so apparent as the three cases above.
+
+   For a general fwprop problem, there are two possible sources where
+   benefit comes from. The frist is the new use insn after propagation
+   and simplification may have lower cost than itself before propagation,
+   or propagation may create a new insn, that could be splitted or peephole
+   optimized later and get a lower cost. The second is that if all the
+   uses are replaced with the src of the def insn, the def insn could
+   be deleted.
+
+   So instead of check each def-use pair independently, we use DU chain to
+   track all the uses for a def. For each def-use pair, we attempt the
+   propagation, record the change candidate in changes[] array, but we
+   wait to confirm the changes until all the pairs with the same def are
+   iterated. The changes confirmation is done in the func
+   confirm_change_group_by_cost. We only do this for fwprop. For fwprop_addr,
+   the benefit of each change is ensured by propagation_rtx_1 using
+   should_replace_address, so we just confirm all the changes without
+   checking benefit again.
+
+   Other changes:
+   We think the maintainance for use_def_ref vector is not necessary, so
+   we remove update_df/update_uses/update_df_init/register_active_defs.  */
 
 static int num_changes;
 
@@ -274,10 +304,14 @@ build_single_def_use_links (void)
   /* We use the multiple definitions problem to compute our restricted
      use-def chains.  */
   df_set_flags (DF_EQ_NOTES);
+  /* DF_LR_RUN_DCE is used in peephole2_insns, which is called for cost
+     estimation in estimate_split_and_peephole.  */
+  df_set_flags (DF_LR_RUN_DCE);
   df_md_add_problem ();
   df_note_add_problem ();
-  df_analyze ();
+  df_chain_add_problem (DF_UD_CHAIN | DF_DU_CHAIN);
   df_maybe_reorganize_use_refs (DF_REF_ORDER_BY_INSN_WITH_NOTES);
+  df_analyze ();
 
   use_def_ref.create (DF_USES_TABLE_SIZE ());
   use_def_ref.safe_grow_cleared (DF_USES_TABLE_SIZE ());
@@ -412,36 +446,6 @@ should_replace_address (rtx old_rtx, rtx
   return (gain > 0);
 }
 
-
-/* Flags for the last parameter of propagate_rtx_1.  */
-
-enum {
-  /* If PR_CAN_APPEAR is true, propagate_rtx_1 always returns true;
-     if it is false, propagate_rtx_1 returns false if, for at least
-     one occurrence OLD, it failed to collapse the result to a constant.
-     For example, (mult:M (reg:M A) (minus:M (reg:M B) (reg:M A))) may
-     collapse to zero if replacing (reg:M B) with (reg:M A).
-
-     PR_CAN_APPEAR is disregarded inside MEMs: in that case,
-     propagate_rtx_1 just tries to make cheaper and valid memory
-     addresses.  */
-  PR_CAN_APPEAR = 1,
-
-  /* If PR_HANDLE_MEM is not set, propagate_rtx_1 won't attempt any replacement
-     outside memory addresses.  This is needed because propagate_rtx_1 does
-     not do any analysis on memory; thus it is very conservative and in general
-     it will fail if non-read-only MEMs are found in the source expression.
-
-     PR_HANDLE_MEM is set when the source of the propagation was not
-     another MEM.  Then, it is safe not to treat non-read-only MEMs as
-     ``opaque'' objects.  */
-  PR_HANDLE_MEM = 2,
-
-  /* Set when costs should be optimized for speed.  */
-  PR_OPTIMIZE_FOR_SPEED = 4
-};
-
-
 /* Replace all occurrences of OLD in *PX with NEW and try to simplify the
    resulting expression.  Replace *PX with a new RTL expression if an
    occurrence of OLD was found.
@@ -451,31 +455,20 @@ enum {
    that is because there is no simplify_gen_* function for LO_SUM).  */
 
 static bool
-propagate_rtx_1 (rtx *px, rtx old_rtx, rtx new_rtx, int flags)
+propagate_rtx_1 (rtx *px, rtx old_rtx, rtx new_rtx, bool speed)
 {
   rtx x = *px, tem = NULL_RTX, op0, op1, op2;
   enum rtx_code code = GET_CODE (x);
   enum machine_mode mode = GET_MODE (x);
   enum machine_mode op_mode;
-  bool can_appear = (flags & PR_CAN_APPEAR) != 0;
   bool valid_ops = true;
 
-  if (!(flags & PR_HANDLE_MEM) && MEM_P (x) && !MEM_READONLY_P (x))
-    {
-      /* If unsafe, change MEMs to CLOBBERs or SCRATCHes (to preserve whether
-	 they have side effects or not).  */
-      *px = (side_effects_p (x)
-	     ? gen_rtx_CLOBBER (GET_MODE (x), const0_rtx)
-	     : gen_rtx_SCRATCH (GET_MODE (x)));
-      return false;
-    }
-
   /* If X is OLD_RTX, return NEW_RTX.  But not if replacing only within an
      address, and we are *not* inside one.  */
   if (x == old_rtx)
     {
       *px = new_rtx;
-      return can_appear;
+      return true;
     }
 
   /* If this is an expression, try recursive substitution.  */
@@ -484,7 +477,7 @@ propagate_rtx_1 (rtx *px, rtx old_rtx, r
     case RTX_UNARY:
       op0 = XEXP (x, 0);
       op_mode = GET_MODE (op0);
-      valid_ops &= propagate_rtx_1 (&op0, old_rtx, new_rtx, flags);
+      valid_ops &= propagate_rtx_1 (&op0, old_rtx, new_rtx, speed);
       if (op0 == XEXP (x, 0))
 	return true;
       tem = simplify_gen_unary (code, mode, op0, op_mode);
@@ -494,8 +487,8 @@ propagate_rtx_1 (rtx *px, rtx old_rtx, r
     case RTX_COMM_ARITH:
       op0 = XEXP (x, 0);
       op1 = XEXP (x, 1);
-      valid_ops &= propagate_rtx_1 (&op0, old_rtx, new_rtx, flags);
-      valid_ops &= propagate_rtx_1 (&op1, old_rtx, new_rtx, flags);
+      valid_ops &= propagate_rtx_1 (&op0, old_rtx, new_rtx, speed);
+      valid_ops &= propagate_rtx_1 (&op1, old_rtx, new_rtx, speed);
       if (op0 == XEXP (x, 0) && op1 == XEXP (x, 1))
 	return true;
       tem = simplify_gen_binary (code, mode, op0, op1);
@@ -506,8 +499,8 @@ propagate_rtx_1 (rtx *px, rtx old_rtx, r
       op0 = XEXP (x, 0);
       op1 = XEXP (x, 1);
       op_mode = GET_MODE (op0) != VOIDmode ? GET_MODE (op0) : GET_MODE (op1);
-      valid_ops &= propagate_rtx_1 (&op0, old_rtx, new_rtx, flags);
-      valid_ops &= propagate_rtx_1 (&op1, old_rtx, new_rtx, flags);
+      valid_ops &= propagate_rtx_1 (&op0, old_rtx, new_rtx, speed);
+      valid_ops &= propagate_rtx_1 (&op1, old_rtx, new_rtx, speed);
       if (op0 == XEXP (x, 0) && op1 == XEXP (x, 1))
 	return true;
       tem = simplify_gen_relational (code, mode, op_mode, op0, op1);
@@ -519,9 +512,9 @@ propagate_rtx_1 (rtx *px, rtx old_rtx, r
       op1 = XEXP (x, 1);
       op2 = XEXP (x, 2);
       op_mode = GET_MODE (op0);
-      valid_ops &= propagate_rtx_1 (&op0, old_rtx, new_rtx, flags);
-      valid_ops &= propagate_rtx_1 (&op1, old_rtx, new_rtx, flags);
-      valid_ops &= propagate_rtx_1 (&op2, old_rtx, new_rtx, flags);
+      valid_ops &= propagate_rtx_1 (&op0, old_rtx, new_rtx, speed);
+      valid_ops &= propagate_rtx_1 (&op1, old_rtx, new_rtx, speed);
+      valid_ops &= propagate_rtx_1 (&op2, old_rtx, new_rtx, speed);
       if (op0 == XEXP (x, 0) && op1 == XEXP (x, 1) && op2 == XEXP (x, 2))
 	return true;
       if (op_mode == VOIDmode)
@@ -534,7 +527,7 @@ propagate_rtx_1 (rtx *px, rtx old_rtx, r
       if (code == SUBREG)
 	{
           op0 = XEXP (x, 0);
-	  valid_ops &= propagate_rtx_1 (&op0, old_rtx, new_rtx, flags);
+	  valid_ops &= propagate_rtx_1 (&op0, old_rtx, new_rtx, speed);
           if (op0 == XEXP (x, 0))
 	    return true;
 	  tem = simplify_gen_subreg (mode, op0, GET_MODE (SUBREG_REG (x)),
@@ -554,7 +547,7 @@ propagate_rtx_1 (rtx *px, rtx old_rtx, r
 
 	  op0 = new_op0 = targetm.delegitimize_address (op0);
 	  valid_ops &= propagate_rtx_1 (&new_op0, old_rtx, new_rtx,
-					flags | PR_CAN_APPEAR);
+					speed);
 
 	  /* Dismiss transformation that we do not want to carry on.  */
 	  if (!valid_ops
@@ -569,7 +562,7 @@ propagate_rtx_1 (rtx *px, rtx old_rtx, r
 	  if (!(REG_P (old_rtx) && REG_P (new_rtx))
 	      && !should_replace_address (op0, new_op0, GET_MODE (x),
 					  MEM_ADDR_SPACE (x),
-	      			 	  flags & PR_OPTIMIZE_FOR_SPEED))
+	      			 	  speed))
 	    return true;
 
 	  tem = replace_equiv_address_nv (x, new_op0);
@@ -583,8 +576,8 @@ propagate_rtx_1 (rtx *px, rtx old_rtx, r
 	  /* The only simplification we do attempts to remove references to op0
 	     or make it constant -- in both cases, op0's invalidity will not
 	     make the result invalid.  */
-	  propagate_rtx_1 (&op0, old_rtx, new_rtx, flags | PR_CAN_APPEAR);
-	  valid_ops &= propagate_rtx_1 (&op1, old_rtx, new_rtx, flags);
+	  propagate_rtx_1 (&op0, old_rtx, new_rtx, speed);
+	  valid_ops &= propagate_rtx_1 (&op1, old_rtx, new_rtx, speed);
           if (op0 == XEXP (x, 0) && op1 == XEXP (x, 1))
 	    return true;
 
@@ -605,7 +598,7 @@ propagate_rtx_1 (rtx *px, rtx old_rtx, r
 	  if (rtx_equal_p (x, old_rtx))
 	    {
               *px = new_rtx;
-              return can_appear;
+              return true;
 	    }
 	}
       break;
@@ -620,10 +613,7 @@ propagate_rtx_1 (rtx *px, rtx old_rtx, r
 
   *px = tem;
 
-  /* The replacement we made so far is valid, if all of the recursive
-     replacements were valid, or we could simplify everything to
-     a constant.  */
-  return valid_ops || can_appear || CONSTANT_P (tem);
+  return valid_ops;
 }
 
 
@@ -634,7 +624,7 @@ static int
 varying_mem_p (rtx *body, void *data ATTRIBUTE_UNUSED)
 {
   rtx x = *body;
-  return MEM_P (x) && !MEM_READONLY_P (x);
+  return (MEM_P (x) && !MEM_READONLY_P (x)) || CALL_P (x);
 }
 
 
@@ -652,27 +642,12 @@ propagate_rtx (rtx x, enum machine_mode
 {
   rtx tem;
   bool collapsed;
-  int flags;
 
   if (REG_P (new_rtx) && REGNO (new_rtx) < FIRST_PSEUDO_REGISTER)
     return NULL_RTX;
 
-  flags = 0;
-  if (REG_P (new_rtx)
-      || CONSTANT_P (new_rtx)
-      || (GET_CODE (new_rtx) == SUBREG
-	  && REG_P (SUBREG_REG (new_rtx))
-	  && (GET_MODE_SIZE (mode)
-	      <= GET_MODE_SIZE (GET_MODE (SUBREG_REG (new_rtx))))))
-    flags |= PR_CAN_APPEAR;
-  if (!for_each_rtx (&new_rtx, varying_mem_p, NULL))
-    flags |= PR_HANDLE_MEM;
-
-  if (speed)
-    flags |= PR_OPTIMIZE_FOR_SPEED;
-
   tem = x;
-  collapsed = propagate_rtx_1 (&tem, old_rtx, copy_rtx (new_rtx), flags);
+  collapsed = propagate_rtx_1 (&tem, old_rtx, copy_rtx (new_rtx), speed);
   if (tem == x || !collapsed)
     return NULL_RTX;
 
@@ -851,180 +826,71 @@ all_uses_available_at (rtx def_insn, rtx
   return true;
 }
 
-\f
-static df_ref *active_defs;
-#ifdef ENABLE_CHECKING
-static sparseset active_defs_check;
-#endif
-
-/* Fill the ACTIVE_DEFS array with the use->def link for the registers
-   mentioned in USE_REC.  Register the valid entries in ACTIVE_DEFS_CHECK
-   too, for checking purposes.  */
-
-static void
-register_active_defs (df_ref *use_rec)
-{
-  while (*use_rec)
-    {
-      df_ref use = *use_rec++;
-      df_ref def = get_def_for_use (use);
-      int regno = DF_REF_REGNO (use);
-
-#ifdef ENABLE_CHECKING
-      sparseset_set_bit (active_defs_check, regno);
-#endif
-      active_defs[regno] = def;
-    }
-}
-
-
-/* Build the use->def links that we use to update the dataflow info
-   for new uses.  Note that building the links is very cheap and if
-   it were done earlier, they could be used to rule out invalid
-   propagations (in addition to what is done in all_uses_available_at).
-   I'm not doing this yet, though.  */
-
-static void
-update_df_init (rtx def_insn, rtx insn)
-{
-#ifdef ENABLE_CHECKING
-  sparseset_clear (active_defs_check);
-#endif
-  register_active_defs (DF_INSN_USES (def_insn));
-  register_active_defs (DF_INSN_USES (insn));
-  register_active_defs (DF_INSN_EQ_USES (insn));
-}
-
-
-/* Update the USE_DEF_REF array for the given use, using the active definitions
-   in the ACTIVE_DEFS array to match pseudos to their def. */
-
-static inline void
-update_uses (df_ref *use_rec)
-{
-  while (*use_rec)
-    {
-      df_ref use = *use_rec++;
-      int regno = DF_REF_REGNO (use);
-
-      /* Set up the use-def chain.  */
-      if (DF_REF_ID (use) >= (int) use_def_ref.length ())
-        use_def_ref.safe_grow_cleared (DF_REF_ID (use) + 1);
-
-#ifdef ENABLE_CHECKING
-      gcc_assert (sparseset_bit_p (active_defs_check, regno));
-#endif
-      use_def_ref[DF_REF_ID (use)] = active_defs[regno];
-    }
-}
-
-
-/* Update the USE_DEF_REF array for the uses in INSN.  Only update note
-   uses if NOTES_ONLY is true.  */
-
-static void
-update_df (rtx insn, rtx note)
-{
-  struct df_insn_info *insn_info = DF_INSN_INFO_GET (insn);
-
-  if (note)
-    {
-      df_uses_create (&XEXP (note, 0), insn, DF_REF_IN_NOTE);
-      df_notes_rescan (insn);
-    }
-  else
-    {
-      df_uses_create (&PATTERN (insn), insn, 0);
-      df_insn_rescan (insn);
-      update_uses (DF_INSN_INFO_USES (insn_info));
-    }
-
-  update_uses (DF_INSN_INFO_EQ_USES (insn_info));
-}
-
-
 /* Try substituting NEW into LOC, which originated from forward propagation
    of USE's value from DEF_INSN.  SET_REG_EQUAL says whether we are
    substituting the whole SET_SRC, so we can set a REG_EQUAL note if the
-   new insn is not recognized.  Return whether the substitution was
-   performed.  */
+   new insn is not recognized. We record possible change in changes array,
+   and record their verifying result and calculated benefit.  */
 
 static bool
-try_fwprop_subst (df_ref use, rtx *loc, rtx new_rtx, rtx def_insn, bool set_reg_equal)
+try_fwprop_subst (df_ref use, rtx *loc, rtx new_rtx, bool set_reg_equal)
 {
   rtx insn = DF_REF_INSN (use);
   rtx set = single_set (insn);
-  rtx note = NULL_RTX;
   bool speed = optimize_bb_for_speed_p (BLOCK_FOR_INSN (insn));
-  int old_cost = 0;
-  bool ok;
+  int old_cost = 0, benefit = 0;
+  int old_changes_num, new_changes_num;
 
-  update_df_init (def_insn, insn);
+  /* see when the insn is not a set  */
+  if (!set)
+    return false;
 
   /* forward_propagate_subreg may be operating on an instruction with
      multiple sets.  If so, assume the cost of the new instruction is
      not greater than the old one.  */
   if (set)
-    old_cost = set_src_cost (SET_SRC (set), speed);
-  if (dump_file)
-    {
-      fprintf (dump_file, "\nIn insn %d, replacing\n ", INSN_UID (insn));
-      print_inline_rtx (dump_file, *loc, 2);
-      fprintf (dump_file, "\n with ");
-      print_inline_rtx (dump_file, new_rtx, 2);
-      fprintf (dump_file, "\n");
-    }
+    old_cost = (set_src_cost (SET_SRC (set), speed)
+		+ set_src_cost (SET_DEST (set), speed) + 1);
 
+  old_changes_num = num_changes_pending ();
   validate_unshare_change (insn, loc, new_rtx, true);
-  if (!verify_changes (0))
-    {
-      if (dump_file)
-	fprintf (dump_file, "Changes to insn %d not recognized\n",
-		 INSN_UID (insn));
-      ok = false;
-    }
-
-  else if (DF_REF_TYPE (use) == DF_REF_REG_USE
-	   && set
-	   && set_src_cost (SET_SRC (set), speed) > old_cost)
-    {
-      if (dump_file)
-	fprintf (dump_file, "Changes to insn %d not profitable\n",
-		 INSN_UID (insn));
-      ok = false;
-    }
-
-  else
-    {
-      if (dump_file)
-	fprintf (dump_file, "Changed insn %d\n", INSN_UID (insn));
-      ok = true;
-    }
-
-  if (ok)
-    {
-      confirm_change_group ();
-      num_changes++;
-    }
-  else
-    {
-      cancel_changes (0);
-
-      /* Can also record a simplified value in a REG_EQUAL note,
-	 making a new one if one does not already exist.  */
-      if (set_reg_equal)
-	{
-	  if (dump_file)
-	    fprintf (dump_file, " Setting REG_EQUAL note\n");
 
-	  note = set_unique_reg_note (insn, REG_EQUAL, copy_rtx (new_rtx));
-	}
-    }
-
-  if ((ok || note) && !CONSTANT_P (new_rtx))
-    update_df (insn, note);
+  /* verify_changes may calls validate_change and add new changes.
+     The new changes are either adding or removing CLOBBER to match
+     insn pattern. These changes should either be committed or canceled
+     in a group, so we use set_change_associated_with_last to indicate
+     whether current change is committed depends on the last change.  */
+  if (verify_changes (old_changes_num))
+  {
+    int i;
+    int new_cost = set_src_cost (SET_SRC (set), speed)
+		   + set_src_cost (SET_DEST (set), speed) + 1;
+    /* validate_unshare_change will tentatively change *loc to new_rtx.
+       We compare the cost before and after validate_unshare_change
+       and get the potential benefit of the change.  */
+    benefit = old_cost - new_cost;
+
+    /* For the change group with adding or removing CLOBBER, we attach
+       the real change benefit to the last change, that is because
+       in func confirm_change_group_by_cost, we need to iterate change
+       in reverse order to make sure cancelling change works correctly.
+       We set other change's benefit to 0, so the overall benefit for
+       the change group is the same. Meanwhile set all the changes in
+       the group to being verified successfully.  */
+    new_changes_num = num_changes_pending ();
+    set_change_verified (new_changes_num - 1, true);
+    set_change_benefit (new_changes_num - 1, benefit);
+    for (i = new_changes_num - 2; i >= old_changes_num; i--)
+      {
+	set_change_verified (i, true);
+	set_change_benefit (i, 0);
+	set_change_associated_with_last (i, true);
+      }
+    set_change_equal_note (old_changes_num, set_reg_equal);
+    return true;
+  }
 
-  return ok;
+  return false;
 }
 
 /* For the given single_set INSN, containing SRC known to be a
@@ -1107,8 +973,7 @@ forward_propagate_subreg (df_ref use, rt
 	  && GET_MODE (SUBREG_REG (src)) == use_mode
 	  && subreg_lowpart_p (src)
 	  && all_uses_available_at (def_insn, use_insn))
-	return try_fwprop_subst (use, DF_REF_LOC (use), SUBREG_REG (src),
-				 def_insn, false);
+	return try_fwprop_subst (use, DF_REF_LOC (use), SUBREG_REG (src), false);
     }
 
   /* If this is a SUBREG of a ZERO_EXTEND or SIGN_EXTEND, and the SUBREG
@@ -1139,20 +1004,133 @@ forward_propagate_subreg (df_ref use, rt
 	  && (targetm.mode_rep_extended (use_mode, GET_MODE (src))
 	      != (int) GET_CODE (src))
 	  && all_uses_available_at (def_insn, use_insn))
-	return try_fwprop_subst (use, DF_REF_LOC (use), XEXP (src, 0),
-				 def_insn, false);
+	return try_fwprop_subst (use, DF_REF_LOC (use), XEXP (src, 0), false);
     }
 
   return false;
 }
 
-/* Try to replace USE with SRC (defined in DEF_INSN) in __asm.  */
+static void
+mems_modified_p (rtx dest, const_rtx setter ATTRIBUTE_UNUSED, void *data)
+{
+  bool *modified = (bool *)data;
+
+  /* If DEST is not a MEM, then it will not conflict with the load.  Note
+     that function calls are assumed to clobber memory, but are handled
+     elsewhere.  */
+  if (MEM_P (dest))
+    {
+      *modified = true;
+      return;
+    }
+}
+
+/* Check whether any memory modification insn from from insn
+   to to insn.  */
+
+static bool
+mem_may_be_modified (rtx from, rtx to)
+{
+  bool modified = false;
+  rtx insn;
+
+  /* For now, we only check the simple case where from and to
+     are in the same bb.  */
+  basic_block bb = BLOCK_FOR_INSN (from);
+  if (bb != BLOCK_FOR_INSN (to))
+    return true;
+
+  for (insn = from; insn != to; insn = NEXT_INSN (insn))
+    {
+      if (!NONDEBUG_INSN_P (insn))
+	continue;
+
+      note_stores (PATTERN (insn), mems_modified_p, &modified);
+      if (modified)
+	break;
+
+      modified = CALL_P (insn);
+      if (modified)
+	break;
+
+      modified = volatile_insn_p (PATTERN (insn));
+      if (modified)
+	break;
+    }
+  gcc_assert (insn);
+  return modified;
+}
+
+/* Calculate how many times reg appears in rtx "in".  */
+
+int
+reg_mentioned_num (const_rtx reg, const_rtx in)
+{
+  const char *fmt;
+  int i, num = 0;
+  enum rtx_code code;
+
+  if (in == 0)
+    return 0;
+
+  if (reg == in)
+    return 1;
+
+  code = GET_CODE (in);
+
+  switch (code)
+    {
+      /* Compare registers by number.  */
+    case REG:
+      return REG_P (reg) && REGNO (in) == REGNO (reg);
+
+      /* These codes have no constituent expressions
+	 and are unique.  */
+    case SCRATCH:
+    case CC0:
+    case PC:
+
+      /* Skip expr list.  */
+    case EXPR_LIST:
+      return 0;
+
+    CASE_CONST_ANY:
+      /* These are kept unique for a given value.  */
+      return 0;
+
+    default:
+      break;
+    }
+
+  if (GET_CODE (reg) == code && rtx_equal_p (reg, in))
+    return 1;
+
+  fmt = GET_RTX_FORMAT (code);
+
+  for (i = GET_RTX_LENGTH (code) - 1; i >= 0; i--)
+    {
+      if (fmt[i] == 'E')
+	{
+	  int j;
+	  for (j = XVECLEN (in, i) - 1; j >= 0; j--)
+	    num += reg_mentioned_num (reg, XVECEXP (in, i, j));
+	}
+      else if (fmt[i] == 'e')
+	num += reg_mentioned_num (reg, XEXP (in, i));
+    }
+  return num;
+}
+
+/* Try to replace USE with SRC (defined in DEF_INSN) in __asm.
+   All the changes added here will be applied immediately without
+   affecting any existing changes. After this func, the changes
+   num is the same as before the func.  */
 
 static bool
 forward_propagate_asm (df_ref use, rtx def_insn, rtx def_set, rtx reg)
 {
   rtx use_insn = DF_REF_INSN (use), src, use_pat, asm_operands, new_rtx, *loc;
-  int speed_p, i;
+  int speed_p, i, old_change_num, new_change_num;
   df_ref *use_vec;
 
   gcc_assert ((DF_REF_FLAGS (use) & DF_REF_IN_NOTE) == 0);
@@ -1166,7 +1144,7 @@ forward_propagate_asm (df_ref use, rtx d
   if (use_vec[0] && use_vec[1])
     return false;
 
-  update_df_init (def_insn, use_insn);
+  old_change_num = num_changes_pending ();
   speed_p = optimize_bb_for_speed_p (BLOCK_FOR_INSN (use_insn));
   asm_operands = NULL_RTX;
   switch (GET_CODE (use_pat))
@@ -1214,14 +1192,44 @@ forward_propagate_asm (df_ref use, rtx d
 	validate_unshare_change (use_insn, loc, new_rtx, true);
     }
 
-  if (num_changes_pending () == 0 || !apply_change_group ())
+  new_change_num = num_changes_pending ();
+  if ((new_change_num - old_change_num) == 0
+      || !apply_change_group (old_change_num))
     return false;
 
-  update_df (use_insn, NULL);
-  num_changes++;
+  df_uses_create (&PATTERN (use_insn), use_insn, 0);
+  df_insn_rescan (use_insn);
+
   return true;
 }
 
+/* Find whether the set define a return reg.  */
+
+static bool
+def_return_reg (rtx set)
+{
+  edge eg;
+  edge_iterator ei;
+  rtx dest = SET_DEST (set);
+
+  if (!REG_P (dest))
+    return false;
+
+  FOR_EACH_EDGE (eg, ei, EXIT_BLOCK_PTR->preds)
+    if (eg->flags & EDGE_FALLTHRU)
+      {
+	basic_block src_bb = eg->src;
+	rtx last_insn, ret_reg;
+	if (EDGE_COUNT (EXIT_BLOCK_PTR->preds) == 1
+	    && NONJUMP_INSN_P ((last_insn = BB_END (src_bb)))
+	    && GET_CODE (PATTERN (last_insn)) == USE
+	    && GET_CODE ((ret_reg = XEXP (PATTERN (last_insn), 0))) == REG
+	    && REGNO (ret_reg) == REGNO (dest))
+	  return true;
+      }
+  return false;
+}
+
 /* Try to replace USE with SRC (defined in DEF_INSN) and simplify the
    result.  */
 
@@ -1230,6 +1238,7 @@ forward_propagate_and_simplify (df_ref u
 {
   rtx use_insn = DF_REF_INSN (use);
   rtx use_set = single_set (use_insn);
+  /* rtx src, reg, new_rtx, *loc, use_set_dest, use_set_src; */
   rtx src, reg, new_rtx, *loc;
   bool set_reg_equal;
   enum machine_mode mode;
@@ -1279,18 +1288,46 @@ forward_propagate_and_simplify (df_ref u
       rtx x = avoid_constant_pool_reference (src);
       if (x != src && use_set)
 	{
-          rtx note = find_reg_note (use_insn, REG_EQUAL, NULL_RTX);
+	  rtx note = find_reg_note (use_insn, REG_EQUAL, NULL_RTX);
 	  rtx old_rtx = note ? XEXP (note, 0) : SET_SRC (use_set);
 	  rtx new_rtx = simplify_replace_rtx (old_rtx, src, x);
 	  if (old_rtx != new_rtx)
-            set_unique_reg_note (use_insn, REG_EQUAL, copy_rtx (new_rtx));
+	    set_unique_reg_note (use_insn, REG_EQUAL, copy_rtx (new_rtx));
 	}
       return false;
     }
 
+  /* If only new_rtx contains varying mem or has other side effect, and
+     mem maybe modified between def and use, we cannot do propagation
+     safely. mem_may_be_modified is a simple check without inquiring
+     cfg and alias result.  */
+  if (for_each_rtx (&src, varying_mem_p, NULL)
+      && mem_may_be_modified (def_insn, use_insn))
+    return false;
+
+  if (volatile_refs_p (src))
+    return false;
+
   if (asm_use >= 0)
     return forward_propagate_asm (use, def_insn, def_set, reg);
 
+  /* If the dest of the use insn is a return reg, we don't try fwprop,
+     because mode-switching tries to find return reg copy insn and create
+     pre exit basicblock, and fwprop for return copy insn may make it
+     confused.  */
+  if (def_return_reg (use_set))
+    return false;
+
+  /* We have (hard reg = reg) type insns for func param passing or
+     return value setting. We don't want to propagate in such case
+     because it may restrict cse/gcse. Check hash_rtx and
+     hash_scan_set.  */
+  use_set_dest = SET_DEST (use_set);
+  use_set_src = SET_SRC (use_set);
+  if (REG_P (use_set_dest) && REG_P (use_set_src)
+      && (REGNO (use_set_dest) < FIRST_PSEUDO_REGISTER))
+    return false;
+
   /* Else try simplifying.  */
 
   if (DF_REF_TYPE (use) == DF_REF_REG_MEM_STORE)
@@ -1339,7 +1376,7 @@ forward_propagate_and_simplify (df_ref u
   if (!new_rtx)
     return false;
 
-  return try_fwprop_subst (use, loc, new_rtx, def_insn, set_reg_equal);
+  return try_fwprop_subst (use, loc, new_rtx, set_reg_equal);
 }
 
 
@@ -1402,7 +1439,6 @@ forward_propagate_into (df_ref use)
   return false;
 }
 
-\f
 static void
 fwprop_init (void)
 {
@@ -1417,11 +1453,6 @@ fwprop_init (void)
 
   build_single_def_use_links ();
   df_set_flags (DF_DEFER_INSN_RESCAN);
-
-  active_defs = XNEWVEC (df_ref, max_reg_num ());
-#ifdef ENABLE_CHECKING
-  active_defs_check = sparseset_alloc (max_reg_num ());
-#endif
 }
 
 static void
@@ -1430,55 +1461,150 @@ fwprop_done (void)
   loop_optimizer_finalize ();
 
   use_def_ref.release ();
-  free (active_defs);
-#ifdef ENABLE_CHECKING
-  sparseset_free (active_defs_check);
-#endif
 
   free_dominance_info (CDI_DOMINATORS);
   cleanup_cfg (0);
   delete_trivially_dead_insns (get_insns (), max_reg_num ());
-
-  if (dump_file)
-    fprintf (dump_file,
-	     "\nNumber of successful forward propagations: %d\n\n",
-	     num_changes);
 }
 
-
-/* Main entry point.  */
-
 static bool
 gate_fwprop (void)
 {
   return optimize > 0 && flag_forward_propagate;
 }
 
+/* Main func for forward propagation. Iterate all the uses connecting to
+   the same def. For each def-use pair, try forward propagate the src of
+   the def into the use. After all the def-use pairs are iterated, confirm
+   the changes based on the cost of the whole group.  */
+
+static bool
+iterate_def_uses (df_ref def, bool fwprop_addr)
+{
+  int use_num = 0;
+  int def_insn_cost = 0;
+  rtx def_insn, use_insn;
+  struct df_link *uses;
+  int reg_replaced_num = 0;
+  bool all_uses_replaced;
+  bool speed;
+
+  def_insn = DF_REF_INSN (def);
+  speed = optimize_bb_for_speed_p (BLOCK_FOR_INSN (def_insn));
+
+  if (def_insn)
+  {
+    rtx set = single_set (def_insn);
+    if (set)
+      def_insn_cost = set_src_cost (SET_SRC (set), speed)
+		      + set_src_cost (SET_DEST (set), speed) + 1;
+    else
+      return false;
+  }
+
+  if (dump_file)
+    {
+      fprintf (dump_file, "\n------------------------\n");
+      fprintf (dump_file, "Def %d:\n", INSN_UID (def_insn));
+    }
+
+  for (uses = DF_REF_CHAIN (def), use_num = 0;
+     uses; uses = uses->next)
+  {
+    int old_reg_num, new_reg_num;
+
+    df_ref use = uses->ref;
+    if (DF_REF_IS_ARTIFICIAL (use))
+	continue;
+
+    use_insn = DF_REF_INSN (use);
+    if (!NONDEBUG_INSN_P (use_insn))
+	continue;
+
+    if (dump_file)
+      fprintf (dump_file, "\tUse %d\n", INSN_UID (use_insn));
+
+    if (fwprop_addr)
+      {
+	if (DF_REF_TYPE (use) != DF_REF_REG_USE
+	    && DF_REF_BB (use)->loop_father != NULL
+	    /* The outer most loop is not really a loop.  */
+	    && loop_outer (DF_REF_BB (use)->loop_father) != NULL)
+	  forward_propagate_into (use);
+      }
+    else
+      {
+	if (DF_REF_TYPE (use) == DF_REF_REG_USE
+	    || DF_REF_BB (use)->loop_father == NULL
+	    || loop_outer (DF_REF_BB (use)->loop_father) == NULL)
+	  {
+	    old_reg_num = reg_mentioned_num (DF_REF_REG (use), use_insn);
+
+	    forward_propagate_into (use);
+
+	    new_reg_num = reg_mentioned_num (DF_REF_REG (use), use_insn);
+	    reg_replaced_num += old_reg_num - new_reg_num;
+	  }
+      }
+    use_num++;
+  }
+
+  if (!use_num)
+    return false;
+
+  if (fwprop_addr)
+     return confirm_change_group_by_cost (false,
+					  0,
+					  false);
+  else
+    {
+      all_uses_replaced = (use_num == reg_replaced_num);
+      return confirm_change_group_by_cost (all_uses_replaced,
+					   def_insn_cost,
+					   true);
+    }
+}
+
+/* Try forward propagate src of the def to the normal uses.  */
+
 static unsigned int
 fwprop (void)
 {
-  unsigned i;
+  basic_block bb;
+  rtx insn;
+  df_ref *def_vec;
   bool need_cleanup = false;
 
-  fwprop_init ();
+  if (dump_file) {
+    fprintf (dump_file, "\n============== fwprop ==============\n");
 
-  /* Go through all the uses.  df_uses_create will create new ones at the
-     end, and we'll go through them as well.
+    extern void dump_cfg (FILE *file);
+    dump_cfg (dump_file);
+  }
 
-     Do not forward propagate addresses into loops until after unrolling.
-     CSE did so because it was able to fix its own mess, but we are not.  */
+  fwprop_init ();
 
-  for (i = 0; i < DF_USES_TABLE_SIZE (); i++)
+  FOR_EACH_BB (bb)
     {
-      df_ref use = DF_USES_GET (i);
-      if (use)
-	if (DF_REF_TYPE (use) == DF_REF_REG_USE
-	    || DF_REF_BB (use)->loop_father == NULL
-	    /* The outer most loop is not really a loop.  */
-	    || loop_outer (DF_REF_BB (use)->loop_father) == NULL)
-	  need_cleanup |= forward_propagate_into (use);
+      FOR_BB_INSNS (bb, insn)
+	{
+	  if (!NONDEBUG_INSN_P (insn)
+	      || CALL_P (insn))
+	    continue;
+
+	  for (def_vec = DF_INSN_DEFS (insn); *def_vec; def_vec++)
+	    {
+	      bool result;
+	      result = iterate_def_uses (*def_vec, false);
+	      need_cleanup |= result;
+
+	      if (result)
+		num_changes += 1;
+	    }
+	}
     }
 
+
   fwprop_done ();
   if (need_cleanup)
     cleanup_cfg (0);
@@ -1507,25 +1633,39 @@ struct rtl_opt_pass pass_rtl_fwprop =
  }
 };
 
+/* Try forward propagate src of the def to the uses in memory addresses.  */
+
 static unsigned int
 fwprop_addr (void)
 {
-  unsigned i;
+  basic_block bb;
+  rtx insn;
+  df_ref *def_vec;
   bool need_cleanup = false;
 
+  if (dump_file)
+    fprintf (dump_file, "\n============== fwprop_addr ==============\n");
+
   fwprop_init ();
 
-  /* Go through all the uses.  df_uses_create will create new ones at the
-     end, and we'll go through them as well.  */
-  for (i = 0; i < DF_USES_TABLE_SIZE (); i++)
+  FOR_EACH_BB (bb)
     {
-      df_ref use = DF_USES_GET (i);
-      if (use)
-	if (DF_REF_TYPE (use) != DF_REF_REG_USE
-	    && DF_REF_BB (use)->loop_father != NULL
-	    /* The outer most loop is not really a loop.  */
-	    && loop_outer (DF_REF_BB (use)->loop_father) != NULL)
-	  need_cleanup |= forward_propagate_into (use);
+      FOR_BB_INSNS (bb, insn)
+	{
+	  if (!NONDEBUG_INSN_P (insn)
+	      || CALL_P (insn))
+	    continue;
+
+	  for (def_vec = DF_INSN_DEFS (insn); *def_vec; def_vec++)
+	    {
+	      bool result;
+	      result = iterate_def_uses (*def_vec, true);
+	      need_cleanup |= result;
+
+	      if (result)
+		num_changes += 1;
+	    }
+	}
     }
 
   fwprop_done ();
@@ -1535,6 +1675,26 @@ fwprop_addr (void)
   return 0;
 }
 
+void
+dump_cfg (FILE *file)
+{
+  basic_block bb;
+  FOR_EACH_BB (bb)
+    {
+      edge e;
+      edge_iterator ei;
+
+      fprintf (file, "BB%d: \n", bb->index);
+      FOR_EACH_EDGE (e, ei, bb->succs)
+	{
+	  gcc_assert (e->src == bb);
+	  basic_block succ = e->dest;
+	  fprintf (file, "\tBB%d ", succ->index);
+	}
+      fprintf (file, "\n");
+    }
+}
+
 struct rtl_opt_pass pass_rtl_fwprop_addr =
 {
  {
Index: cprop.c
===================================================================
--- cprop.c	(revision 196270)
+++ cprop.c	(working copy)
@@ -739,7 +739,7 @@ try_replace_reg (rtx from, rtx to, rtx i
   to = copy_rtx (to);
 
   validate_replace_src_group (from, to, insn);
-  if (num_changes_pending () && apply_change_group ())
+  if (num_changes_pending () && apply_change_group (0))
     success = 1;
 
   /* Try to simplify SET_SRC if we have substituted a constant.  */
Index: cse.c
===================================================================
--- cse.c	(revision 196270)
+++ cse.c	(working copy)
@@ -3267,7 +3267,7 @@ fold_rtx (rtx x, rtx insn)
 	  tem = folded_arg0, folded_arg0 = folded_arg1, folded_arg1 = tem;
 	}
 
-      apply_change_group ();
+      apply_change_group (0);
     }
 
   /* If X is an arithmetic operation, see if we can simplify it.  */
@@ -4194,7 +4194,7 @@ try_back_substitute_reg (rtx set, rtx in
 	      validate_change (prev, &SET_DEST (PATTERN (prev)), dest, 1);
 	      validate_change (insn, &SET_DEST (set), src, 1);
 	      validate_change (insn, &SET_SRC (set), dest, 1);
-	      apply_change_group ();
+	      apply_change_group (0);
 
 	      /* If INSN has a REG_EQUAL note, and this note mentions
 		 REG0, then we must delete it, because the value in
@@ -4312,7 +4312,7 @@ canonicalize_insn (rtx insn, struct set
   if (GET_CODE (x) == SET && GET_CODE (SET_SRC (x)) == CALL)
     {
       canon_reg (SET_SRC (x), insn);
-      apply_change_group ();
+      apply_change_group (0);
       fold_rtx (SET_SRC (x), insn);
     }
   else if (GET_CODE (x) == CLOBBER)
@@ -4343,7 +4343,7 @@ canonicalize_insn (rtx insn, struct set
   else if (GET_CODE (x) == CALL)
     {
       canon_reg (x, insn);
-      apply_change_group ();
+      apply_change_group (0);
       fold_rtx (x, insn);
     }
   else if (DEBUG_INSN_P (insn))
@@ -4356,7 +4356,7 @@ canonicalize_insn (rtx insn, struct set
 	  if (GET_CODE (y) == SET && GET_CODE (SET_SRC (y)) == CALL)
 	    {
 	      canon_reg (SET_SRC (y), insn);
-	      apply_change_group ();
+	      apply_change_group (0);
 	      fold_rtx (SET_SRC (y), insn);
 	    }
 	  else if (GET_CODE (y) == CLOBBER)
@@ -4371,7 +4371,7 @@ canonicalize_insn (rtx insn, struct set
 	  else if (GET_CODE (y) == CALL)
 	    {
 	      canon_reg (y, insn);
-	      apply_change_group ();
+	      apply_change_group (0);
 	      fold_rtx (y, insn);
 	    }
 	}
@@ -4392,7 +4392,7 @@ canonicalize_insn (rtx insn, struct set
       else
 	{
 	  canon_reg (XEXP (tem, 0), insn);
-	  apply_change_group ();
+	  apply_change_group (0);
 	  XEXP (tem, 0) = fold_rtx (XEXP (tem, 0), insn);
 	  df_notes_rescan (insn);
 	}
@@ -4441,7 +4441,7 @@ canonicalize_insn (rtx insn, struct set
 
      The result of apply_change_group can be ignored; see canon_reg.  */
 
-  apply_change_group ();
+  apply_change_group (0);
 }
 \f
 /* Main function of CSE.
@@ -5148,7 +5148,7 @@ cse_insn (rtx insn)
 					   dest_reg, 1);
 		  validate_unshare_change (insn, &SET_SRC (sets[i].rtl),
 					   GEN_INT (val), 1);
-		  if (apply_change_group ())
+		  if (apply_change_group (0))
 		    {
 		      rtx note = find_reg_note (insn, REG_EQUAL, NULL_RTX);
 		      if (note)
@@ -5217,7 +5217,7 @@ cse_insn (rtx insn)
 		 canon_reg.  */
 
 	      validate_change (insn, &SET_SRC (sets[i].rtl), new_rtx, 1);
-	      apply_change_group ();
+	      apply_change_group (0);
 
 	      break;
 	    }
@@ -7083,7 +7083,7 @@ cse_change_cc_mode_insn (rtx insn, rtx n
      something wrong with the cc_modes_compatible back end function.
      CC modes only can be considered compatible if the insn - with the mode
      replaced by any of the compatible modes - can still be recognized.  */
-  success = apply_change_group ();
+  success = apply_change_group (0);
   gcc_assert (success);
 }
 
Index: postreload.c
===================================================================
--- postreload.c	(revision 196270)
+++ postreload.c	(working copy)
@@ -118,7 +118,7 @@ reload_cse_simplify (rtx insn, rtx testr
 	}
 
       if (count > 0)
-	apply_change_group ();
+	apply_change_group (0);
       else
 	reload_cse_simplify_operands (insn, testreg);
     }
@@ -176,7 +176,7 @@ reload_cse_simplify (rtx insn, rtx testr
 	  count += reload_cse_simplify_set (XVECEXP (body, 0, i), insn);
 
       if (count > 0)
-	apply_change_group ();
+	apply_change_group (0);
       else
 	reload_cse_simplify_operands (insn, testreg);
     }
@@ -476,7 +476,7 @@ reload_cse_simplify_operands (rtx insn,
 	      validate_change (insn, recog_data.operand_loc[1-i],
 			       gen_rtx_REG (word_mode, REGNO (SET_DEST (set))),
 			       1);
-	      if (! apply_change_group ())
+	      if (! apply_change_group (0))
 		return 0;
 	      return reload_cse_simplify_operands (insn, testreg);
 	    }
@@ -670,7 +670,7 @@ reload_cse_simplify_operands (rtx insn,
 		       gen_rtx_REG (mode, op_alt_regno[op][j]), 1);
     }
 
-  return apply_change_group ();
+  return apply_change_group (0);
 }
 \f
 /* If reload couldn't use reg+reg+offset addressing, try to use reg+reg
@@ -1205,7 +1205,7 @@ reload_combine_recognize_pattern (rtx in
 					replacement.  */
 				     reg_sum, 1);
 
-	  if (apply_change_group ())
+	  if (apply_change_group (0))
 	    {
 	      struct reg_use *lowest_ruid = NULL;
 
Index: lower-subreg.c
===================================================================
--- lower-subreg.c	(revision 196270)
+++ lower-subreg.c	(working copy)
@@ -952,7 +952,7 @@ resolve_simple_move (rtx set, rtx insn)
 	for_each_rtx (&XEXP (src, 0), resolve_subreg_use, NULL_RTX);
       if (MEM_P (dest))
 	for_each_rtx (&XEXP (dest, 0), resolve_subreg_use, NULL_RTX);
-      acg = apply_change_group ();
+      acg = apply_change_group (0);
       gcc_assert (acg);
     }
 
@@ -1601,7 +1601,7 @@ decompose_multiword_subregs (bool decomp
 			  validate_unshare_change (insn, pl, *px, 1);
 			}
 
-		      i = apply_change_group ();
+		      i = apply_change_group (0);
 		      gcc_assert (i);
 		    }
 		}
Index: combine-stack-adj.c
===================================================================
--- combine-stack-adj.c	(revision 196270)
+++ combine-stack-adj.c	(working copy)
@@ -224,7 +224,7 @@ try_apply_stack_adjustment (rtx insn, st
       validate_change (ml->insn, ml->ref, new_val, 1);
     }
 
-  if (apply_change_group ())
+  if (apply_change_group (0))
     {
       /* Succeeded.  Update our knowledge of the stack references.  */
       for (ml = reflist; ml ; ml = ml->next)
Index: regmove.c
===================================================================
--- regmove.c	(revision 196270)
+++ regmove.c	(working copy)
@@ -185,7 +185,7 @@ try_auto_increment (rtx insn, rtx inc_in
 			       gen_rtx_fmt_e (inc_code,
 					      GET_MODE (XEXP (use, 0)), reg),
 			       1);
-	      if (apply_change_group ())
+	      if (apply_change_group (0))
 		{
 		  /* If there is a REG_DEAD note on this insn, we must
 		     change this not to REG_UNUSED meaning that the register
@@ -572,7 +572,7 @@ optimize_reg_copy_3 (rtx insn, rtx dest,
   validate_replace_rtx_group (src, src_reg, insn);
 
   /* Now see if all the changes are valid.  */
-  if (! apply_change_group ())
+  if (! apply_change_group (0))
     {
       /* One or more changes were no good.  Back out everything.  */
       PUT_MODE (src_reg, old_mode);
Index: loop-unroll.c
===================================================================
--- loop-unroll.c	(revision 196270)
+++ loop-unroll.c	(working copy)
@@ -2169,7 +2169,7 @@ expand_var_during_unrolling (struct var_
     new_reg = get_expansion (ve);
 
   validate_replace_rtx_group (SET_DEST (set), new_reg, insn);
-  if (apply_change_group ())
+  if (apply_change_group (0))
     if (really_new_expansion)
       {
         ve->var_expansions.safe_push (new_reg);
Index: config/i386/i386.c
===================================================================
--- config/i386/i386.c	(revision 196270)
+++ config/i386/i386.c	(working copy)
@@ -15901,8 +15901,14 @@ ix86_expand_clear (rtx dest)
 {
   rtx tmp;
 
-  /* We play register width games, which are only valid after reload.  */
-  gcc_assert (reload_completed);
+  /* We play register width games, which are only valid after reload.
+     An exception: fwprop call peephole to estimate the change benefit,
+     and peephole will call this func. That is before reload complete.
+     It will not bring any problem because the peephole2_insns call is
+     only used for cost estimation in fwprop, and its change will be
+     abandoned immediately after the cost estimation.  */
+  if (strncmp (current_pass->name, "fwprop", 6))
+    gcc_assert (reload_completed);
 
   /* Avoid HImode and its attendant prefix byte.  */
   if (GET_MODE_SIZE (GET_MODE (dest)) < 4)
Index: regcprop.c
===================================================================
--- regcprop.c	(revision 196270)
+++ regcprop.c	(working copy)
@@ -697,12 +697,12 @@ apply_debug_insn_changes (struct value_d
     {
       if (last_insn != change->insn)
 	{
-	  apply_change_group ();
+	  apply_change_group (0);
 	  last_insn = change->insn;
 	}
       validate_change (change->insn, change->loc, change->new_rtx, 1);
     }
-  apply_change_group ();
+  apply_change_group (0);
 }
 
 /* Called via for_each_rtx, for all used registers in a real
@@ -954,7 +954,7 @@ copyprop_hardreg_forward_1 (basic_block
 
       if (any_replacements)
 	{
-	  if (! apply_change_group ())
+	  if (! apply_change_group (0))
 	    {
 	      for (i = 0; i < n_ops; i++)
 		if (replaced[i])
Index: ree.c
===================================================================
--- ree.c	(revision 196270)
+++ ree.c	(working copy)
@@ -718,7 +718,7 @@ combine_reaching_defs (ext_cand *cand, c
 	 cannot be merged, we entirely give up.  In the future, we should allow
 	 extensions to be partially eliminated along those paths where the
 	 definitions could be merged.  */
-      if (apply_change_group ())
+      if (apply_change_group (0))
         {
           if (dump_file)
             fprintf (dump_file, "All merges were successful.\n");
Index: ifcvt.c
===================================================================
--- ifcvt.c	(revision 196270)
+++ ifcvt.c	(working copy)
@@ -669,7 +669,7 @@ cond_exec_process_if_block (ce_if_block_
 
   /* If we cannot apply the changes, fail.  Do not go through the normal fail
      processing, since apply_change_group will call cancel_changes.  */
-  if (! apply_change_group ())
+  if (! apply_change_group (0))
     {
 #ifdef IFCVT_MODIFY_CANCEL
       /* Cancel any machine dependent changes.  */
@@ -4245,7 +4245,7 @@ dead_or_predicable (basic_block test_bb,
     }
 
   if (verify_changes (n_validated_changes))
-    confirm_change_group ();
+    confirm_change_group (0);
   else
     goto cancel;
 
Index: loop-invariant.c
===================================================================
--- loop-invariant.c	(revision 196270)
+++ loop-invariant.c	(working copy)
@@ -1410,7 +1410,7 @@ replace_uses (struct invariant *inv, rtx
 
       /* If we aren't part of a larger group, apply the changes now.  */
       if (!in_group)
-	return apply_change_group ();
+	return apply_change_group (0);
     }
 
   return 1;
@@ -1470,7 +1470,7 @@ move_invariant_reg (struct loop *loop, u
       replace_uses (inv, reg, true);
 
       /* And validate all the changes.  */
-      if (!apply_change_group ())
+      if (!apply_change_group (0))
 	goto fail;
 
       emit_insn_after (gen_move_insn (dest, reg), inv->insn);
Index: lra-eliminations.c
===================================================================
--- lra-eliminations.c	(revision 196270)
+++ lra-eliminations.c	(working copy)
@@ -835,7 +835,7 @@ eliminate_regs_in_insn (rtx insn, bool r
 		  validate_change (insn, &SET_SRC (old_set), src, 1);
 		  validate_change (insn, &SET_DEST (old_set),
 				   ep->from_rtx, 1);
-		  if (! apply_change_group ())
+		  if (! apply_change_group (0))
 		    {
 		      SET_SRC (old_set) = src;
 		      SET_DEST (old_set) = ep->from_rtx;
Index: recog.c
===================================================================
--- recog.c	(revision 196270)
+++ recog.c	(working copy)
@@ -181,6 +181,19 @@ typedef struct change_t
   rtx *loc;
   rtx old;
   bool unshare;
+  /* How much benefit to apply the change.  */
+  int benefit;
+  bool verified;
+  /* Record whether we need to create a equal note
+     if the change is canceled.  */
+  bool equal_note;
+  /* Some changes are committed or cancelled in
+     a group. We use associated_with_last flag to
+     make current change to be consistent with the
+     last change in the group. Adding or removing
+     CLOBBER in verify_change will create such kind
+     of change group.  */
+  bool associated_with_last;
 } change_t;
 
 static change_t *changes;
@@ -235,6 +248,10 @@ validate_change_1 (rtx object, rtx *loc,
   changes[num_changes].loc = loc;
   changes[num_changes].old = old;
   changes[num_changes].unshare = unshare;
+  changes[num_changes].benefit = 0;
+  changes[num_changes].verified = false;
+  changes[num_changes].equal_note = false;
+  changes[num_changes].associated_with_last = false;
 
   if (object && !MEM_P (object))
     {
@@ -252,7 +269,7 @@ validate_change_1 (rtx object, rtx *loc,
   if (in_group)
     return 1;
   else
-    return apply_change_group ();
+    return apply_change_group (0);
 }
 
 /* Wrapper for validate_change_1 without the UNSHARE argument defaulting
@@ -463,17 +480,18 @@ verify_changes (int num)
   return (i == num_changes);
 }
 
-/* A group of changes has previously been issued with validate_change
-   and verified with verify_changes.  Call df_insn_rescan for each of
-   the insn changed and clear num_changes.  */
+/* A group of changes from num to num_changes - 1 has previously been
+   issued with validate_change and verified with verify_changes.
+   Call df_insn_rescan for each of the insn changed and reset num_changes
+   to num.  */
 
 void
-confirm_change_group (void)
+confirm_change_group (int num)
 {
   int i;
   rtx last_object = NULL;
 
-  for (i = 0; i < num_changes; i++)
+  for (i = num; i < num_changes; i++)
     {
       rtx object = changes[i].object;
 
@@ -492,24 +510,364 @@ confirm_change_group (void)
 
   if (last_object && INSN_P (last_object))
     df_insn_rescan (last_object);
+  num_changes = num;
+}
+
+/* Interfaces to operate change fields.  */
+
+void
+set_change_verified (int idx, bool val)
+{
+  changes[idx].verified = val;
+}
+
+void
+set_change_benefit (int idx, int val)
+{
+  changes[idx].benefit = val;
+}
+
+void
+set_change_equal_note (int idx, bool val)
+{
+  changes[idx].equal_note = val;
+}
+
+void
+set_change_associated_with_last (int idx, bool val)
+{
+  changes[idx].associated_with_last = val;
+}
+
+/* Estimate the cost of an insn sequence. The sequence is usually
+   the result of split_insns or peephole2_insns. The cost of the
+   sequence is the summation of the cost of each insn in the sequence.  */
+
+int
+estimate_seq_cost (rtx first, bool speed)
+{
+  int cost = 0;
+
+  while (first)
+    {
+      rtx set = single_set (first);
+      rtx pat = PATTERN (first);
+      if (set)
+	cost += set_src_cost (SET_SRC (set), speed);
+      else if (GET_CODE (pat) == PARALLEL)
+	{
+	  /* select the minimal set cost as parallel cost.  */
+	  int i;
+	  int mincost = MAX_COST;
+	  for (i = 0; i < XVECLEN (pat, 0); i++)
+	    {
+	      enum rtx_code code;
+	      set = XVECEXP (pat, 0, i);
+	      code = GET_CODE (set);
+	      if (code == SET)
+		{
+		  int icost = set_src_cost (SET_SRC (set), speed);
+		  if (icost < mincost)
+		    mincost = icost;
+		}
+	      else if (code == CLOBBER)
+		continue;
+	      else
+		{
+		  mincost = MAX_COST;
+		  break;
+		}
+	    }
+	    cost += mincost;
+	}
+      else
+	{
+	  fprintf (stderr, "split or peephole result is not a set\n");
+	  print_rtl_single (stderr, first);
+	  gcc_assert (0);
+	}
+      first = NEXT_INSN (first);
+    }
+  return cost;
+}
+
+/* Foresee what an insn looks like after split and peephole. Estimate
+   the cost of insn split and peephole result, and calculate the cost
+   difference before and after split and peephole. The difference is
+   returned and used to adjust the change benefit. Note that insn split
+   and peephole result will not be committed here.  */
+
+int
+estimate_split_and_peephole (rtx insn)
+{
+  int match_len;
+  int old_cost;
+  rtx result;
+  rtx pat = PATTERN (insn);
+  rtx set = single_set (insn);
+  bool speed = optimize_bb_for_speed_p (BLOCK_FOR_INSN (insn));
+
+  if (set)
+    old_cost = set_src_cost (SET_SRC (set), speed);
+  else
+    {
+      fprintf (stderr, "insn is not a set\n");
+      print_rtl_single (stderr, insn);
+      gcc_assert (0);
+    }
+
+  result = split_insns (pat, insn);
+
+  if (result)
+    return old_cost - estimate_seq_cost (result, speed);
+
+  initialize_before_estimate_peephole (insn);
+  result = peephole2_insns (pat, insn, &match_len);
+
+  if (result)
+    return old_cost - estimate_seq_cost (result, speed);
+
+  return 0;
+}
+
+static void
+update_df (int from, int to, bool is_note)
+{
+  int i;
+  rtx insn;
+
+  if (is_note)
+    {
+      for (i = from; i <= to; i++)
+	{
+	  insn = changes[i].object;
+          if (changes[i].equal_note)
+	    {
+	      rtx note = find_reg_note (insn, REG_EQUAL, NULL_RTX);
+	      if (note)
+		{
+		  df_uses_create (&XEXP (note, 0), insn, DF_REF_IN_NOTE);
+		  df_notes_rescan (insn);
+		}
+	    }
+	}
+    }
+  else
+    {
+      for (i = from; i <= to; i++)
+	{
+	  insn = changes[i].object;
+	  df_uses_create (&PATTERN (insn), insn, 0);
+	  df_insn_rescan (insn);
+	}
+    }
+}
+
+/* When we cannot committed all the changes group, we evaluate the change
+   one by one. We choose to commit those changes whose benefits are greater
+   than 0. For fwprop_addr, the cost evaluation is caculated using
+   targetm.address_cost() and has been done in propagate_rtx_1, so we set 
+   chk_benefit false to skip benefit checking and simply commit the change
+   for fwprop_addr.  */
+
+bool
+confirm_change_one_by_one (bool chk_benefit)
+{
+  int i, last_i = 0;
+  rtx last_object = NULL;
+  bool last_change_committed = false;
+
+  for (i = num_changes - 1; i >= 0; i--)
+    {
+      rtx object = changes[i].object;
+
+      /* If change is not verified successfully, or benefit <= 0
+	 and current change is not associated with last committed
+	 change, then we will backout the change.  */
+      if (!changes[i].verified
+	  || (chk_benefit
+	      && changes[i].benefit <= 0
+	      && !(last_change_committed
+		   && changes[i].associated_with_last)))
+	{
+	  rtx new_rtx = *changes[i].loc;
+	  *changes[i].loc = changes[i].old;
+	  if (changes[i].object && !MEM_P (changes[i].object))
+	    INSN_CODE (changes[i].object) = changes[i].old_code;
+	  last_change_committed = false;
+
+	  if (changes[i].equal_note)
+	    {
+	      set_unique_reg_note (changes[i].object,
+				   REG_EQUAL, copy_rtx (new_rtx));
+	      update_df (i, i, true);
+	    }
+	  continue;
+	}
+
+      if (changes[i].unshare)
+	*changes[i].loc = copy_rtx (*changes[i].loc);
+
+      /* Avoid unnecessary rescanning when multiple changes to same instruction
+	 are made.  */
+      if (object)
+	{
+	  if (object != last_object && last_object && INSN_P (last_object))
+	    update_df (last_i, last_i, false);
+	  last_object = object;
+	  last_i = i;
+	}
+
+      if (dump_file)
+	fprintf (dump_file, "\n   *** change[%d] -- committed ***\n", i);
+
+      if (dump_file)
+	{
+	  fprintf (dump_file, "\nIn insn %d, replacing\n ", INSN_UID (object));
+	  print_inline_rtx (dump_file, changes[i].old, 2);
+	  fprintf (dump_file, "\n with ");
+	  print_inline_rtx (dump_file, *changes[i].loc, 2);
+	  fprintf (dump_file, "\n resulting: ");
+	  print_inline_rtx (dump_file, object, 2);
+	}
+
+      last_change_committed = true;
+    }
+
+  if (last_object && INSN_P (last_object))
+    update_df (last_i, last_i, false);
+
   num_changes = 0;
+  if (last_object)
+    return true;
+  else
+    return false;
+}
+
+/* Confirm a group of change based on the cost. may_confirm_whole_group
+   is initialized to true if for fwprop all the uses are replaced and
+   the def insn could be deleted. For fwprop, extra_benefit is the benefit
+   to delete the def insn. chk_benefit is set when fwprop_addr is true.  */
+
+bool
+confirm_change_group_by_cost (bool may_confirm_whole_group,
+			      int extra_benefit,
+			      bool chk_benefit)
+{
+  int i, to;
+  int total_benefit = 0, total_positive_benefit = 0;
+  bool no_positive_benefit = true;
+
+  if (num_changes == 0)
+    {
+      if (dump_file)
+	fprintf (dump_file, "No changes being tried\n");
+      return false;
+    }
+
+  if (!chk_benefit)
+    return confirm_change_one_by_one (false);
+
+  if (dump_file)
+    fprintf (dump_file, "  extra benefit = %d\n", extra_benefit);
+
+  /* Iterate all the changes, adjust the change benefit if the change result
+     could be splitted or peephole optimized. Calculate the total benefits
+     and total positive benefits in the iteration.  */
+  for (i = 0; i < num_changes; i++)
+    {
+      int split_or_peephole_cost;
+
+      /* If any change fail in the verification, we cannot confirm all
+	 the changes in a group.  */
+      if (!changes[i].verified)
+	{
+	  may_confirm_whole_group = false;
+	  if (dump_file)
+	    fprintf (dump_file, "  change[%d]: benefit = %d, verified - fail\n",
+		    i, changes[i].benefit);
+	  continue;
+	}
+
+      /* Adjust benefit using the split and peephole results.  */
+      split_or_peephole_cost = estimate_split_and_peephole (changes[i].object);
+      changes[i].benefit += split_or_peephole_cost;
+
+      total_benefit += changes[i].benefit;
+      if (changes[i].benefit > 0)
+	{
+	  total_positive_benefit += changes[i].benefit;
+	  no_positive_benefit = false;
+	}
+
+      if (dump_file)
+	fprintf (dump_file, "  change[%d]: benefit = %d, verified - ok\n",
+		i, changes[i].benefit);
+    }
+
+  /* Compare the benefit and choose between applying the whole change
+     group and only applying the changes with positive benefit.  */
+  if (may_confirm_whole_group
+      && (total_benefit + extra_benefit < total_positive_benefit))
+    may_confirm_whole_group = false;
+
+  if (may_confirm_whole_group)
+    {
+      /* Commit all the changes in a group.  */
+      if (dump_file)
+	fprintf (dump_file, "!!! All the changes committed\n");
+
+      if (dump_file)
+	{
+	  for (i = 0; i < num_changes; i++)
+	    {
+	      fprintf (dump_file, "\nIn insn %d, replacing\n ",
+		       INSN_UID (changes[i].object));
+	      print_inline_rtx (dump_file, changes[i].old, 2);
+	      fprintf (dump_file, "\n with ");
+	      print_inline_rtx (dump_file, *changes[i].loc, 2);
+	      fprintf (dump_file, "\n resulting: ");
+	      print_inline_rtx (dump_file, changes[i].object, 2);
+	    }
+	}
+
+      to = num_changes - 1;
+      confirm_change_group (0);
+      update_df (0, to, false);
+      return true;
+    }
+  else if (no_positive_benefit)
+    {
+      /* Cancel all the changes.  */
+      to = num_changes - 1;
+      cancel_changes (0);
+      update_df (0, to, true);
+      if (dump_file)
+	fprintf (dump_file, "No changes committed\n");
+      return false;
+    }
+  else
+    /* Cannot commit all the changes. Try to commit those changes
+       with positive benefit.  */
+    return confirm_change_one_by_one (true);
 }
 
 /* Apply a group of changes previously issued with `validate_change'.
    If all changes are valid, call confirm_change_group and return 1,
-   otherwise, call cancel_changes and return 0.  */
+   otherwise, call cancel_changes and return 0. The change group index
+   starts from num to the num_changes - 1.  */
 
 int
-apply_change_group (void)
+apply_change_group (int num)
 {
-  if (verify_changes (0))
+  if (verify_changes (num))
     {
-      confirm_change_group ();
+      confirm_change_group (num);
       return 1;
     }
   else
     {
-      cancel_changes (0);
+      cancel_changes (num);
       return 0;
     }
 }
@@ -534,9 +892,13 @@ cancel_changes (int num)
      they were made.  */
   for (i = num_changes - 1; i >= num; i--)
     {
+      rtx new_rtx = *changes[i].loc;
       *changes[i].loc = changes[i].old;
       if (changes[i].object && !MEM_P (changes[i].object))
 	INSN_CODE (changes[i].object) = changes[i].old_code;
+      if (changes[i].equal_note)
+	set_unique_reg_note (changes[i].object,
+			     REG_EQUAL, copy_rtx (new_rtx));
     }
   num_changes = num;
 }
@@ -777,7 +1139,7 @@ int
 validate_replace_rtx_subexp (rtx from, rtx to, rtx insn, rtx *loc)
 {
   validate_replace_rtx_1 (loc, from, to, insn, true);
-  return apply_change_group ();
+  return apply_change_group (0);
 }
 
 /* Try replacing every occurrence of FROM in INSN with TO.  After all
@@ -787,7 +1149,7 @@ int
 validate_replace_rtx (rtx from, rtx to, rtx insn)
 {
   validate_replace_rtx_1 (&PATTERN (insn), from, to, insn, true);
-  return apply_change_group ();
+  return apply_change_group (0);
 }
 
 /* Try replacing every occurrence of FROM in WHERE with TO.  Assume that WHERE
@@ -800,7 +1162,7 @@ int
 validate_replace_rtx_part (rtx from, rtx to, rtx *where, rtx insn)
 {
   validate_replace_rtx_1 (where, from, to, insn, true);
-  return apply_change_group ();
+  return apply_change_group (0);
 }
 
 /* Same as above, but do not simplify rtx afterwards.  */
@@ -809,7 +1171,7 @@ validate_replace_rtx_part_nosimplify (rt
                                       rtx insn)
 {
   validate_replace_rtx_1 (where, from, to, insn, false);
-  return apply_change_group ();
+  return apply_change_group (0);
 
 }
 
@@ -895,7 +1257,7 @@ validate_simplify_insn (rtx insn)
 	      validate_change (insn, &SET_DEST (s), newpat, 1);
 	  }
       }
-  return ((num_changes_pending () > 0) && (apply_change_group () > 0));
+  return ((num_changes_pending () > 0) && (apply_change_group (0) > 0));
 }
 \f
 #ifdef HAVE_cc0
@@ -996,7 +1358,7 @@ general_operand (rtx op, enum machine_mo
 	     integer modes need the same number of hard registers, the
 	     size of floating point mode can be less than the integer
 	     mode.  */
-	  && ! lra_in_progress 
+	  && ! lra_in_progress
 	  && GET_MODE_SIZE (GET_MODE (op)) > GET_MODE_SIZE (GET_MODE (sub)))
 	return 0;
 
@@ -1077,7 +1439,7 @@ register_operand (rtx op, enum machine_m
 	     integer modes need the same number of hard registers, the
 	     size of floating point mode can be less than the integer
 	     mode.  */
-	  && ! lra_in_progress 
+	  && ! lra_in_progress
 	  && GET_MODE_SIZE (GET_MODE (op)) > GET_MODE_SIZE (GET_MODE (sub)))
 	return 0;
 
@@ -1718,7 +2080,7 @@ asm_operand_ok (rtx op, const char *cons
 
 	case 'E':
 	case 'F':
-	  if (CONST_DOUBLE_AS_FLOAT_P (op) 
+	  if (CONST_DOUBLE_AS_FLOAT_P (op)
 	      || (GET_CODE (op) == CONST_VECTOR
 		  && GET_MODE_CLASS (GET_MODE (op)) == MODE_VECTOR_FLOAT))
 	    result = 1;
@@ -2816,7 +3178,7 @@ reg_fits_class_p (const_rtx operand, reg
   /* Regno must not be a pseudo register.  Offset may be negative.  */
   return (HARD_REGISTER_NUM_P (regno)
 	  && HARD_REGISTER_NUM_P (regno + offset)
-	  && in_hard_reg_set_p (reg_class_contents[(int) cl], mode, 
+	  && in_hard_reg_set_p (reg_class_contents[(int) cl], mode,
 				regno + offset));
 }
 \f
@@ -2997,6 +3359,32 @@ int peep2_current_count;
    DF_LIVE_OUT for the block.  */
 #define PEEP2_EOB	pc_rtx
 
+/* Initialize peep2_insn_data array and their live_before field.
+   Only two elements in the peep2_insn_data, one is the input insn,
+   the other is the element marking the end. The live_before field
+   is caculated by df backwards simulation starting from the live
+   set of bb.  */
+void
+initialize_before_estimate_peephole (rtx insn)
+{
+  bitmap live;
+  basic_block bb = BLOCK_FOR_INSN (insn);
+  peep2_current = 0;
+  peep2_current_count = 0;
+  peep2_insn_data[0].insn = insn;
+  peep2_insn_data[0].live_before = BITMAP_ALLOC (&reg_obstack);
+  peep2_insn_data[1].insn = PEEP2_EOB;
+  peep2_insn_data[1].live_before = BITMAP_ALLOC (&reg_obstack);
+
+  live = BITMAP_ALLOC (&reg_obstack);
+  bitmap_copy (live, DF_LR_IN (bb));
+  df_simulate_initialize_forwards (bb, live);
+  simulate_backwards_to_point (bb, live, insn);
+  COPY_REG_SET (peep2_insn_data[1].live_before, live);
+  df_simulate_one_insn_backwards (bb, insn, live);
+  COPY_REG_SET (peep2_insn_data[0].live_before, live);
+}
+
 /* Wrap N to fit into the peep2_insn_data buffer.  */
 
 static int
@@ -3050,6 +3438,14 @@ peep2_reg_dead_p (int ofs, rtx reg)
   gcc_assert (peep2_insn_data[ofs].insn != NULL_RTX);
 
   regno = REGNO (reg);
+
+  /* we may call peephole2_insns in fwprop phase to estimate how
+     peephole will affect the cost of the insn transformed by fwprop.
+     fwprop is done before ira phase, so we need to consider pesudo
+     register here as well.  */
+  if (!strncmp (current_pass->name, "fwprop", 6))
+    return !REGNO_REG_SET_P (peep2_insn_data[ofs].live_before, regno);
+
   n = hard_regno_nregs[regno][GET_MODE (reg)];
   while (--n >= 0)
     if (REGNO_REG_SET_P (peep2_insn_data[ofs].live_before, regno + n))
@@ -3078,6 +3474,13 @@ peep2_find_free_register (int from, int
   df_ref *def_rec;
   int i;
 
+  /* we may call peephole2_insns in fwprop phase to estimate how
+     peephole will affect the cost of the insn transformed by fwprop.
+     fwprop is done before ira phase. In that case, we simply return
+     a new pseudo register.  */
+  if (!strncmp (current_pass->name, "fwprop", 6))
+    return gen_reg_rtx (mode);
+
   gcc_assert (from < MAX_INSNS_PER_PEEP2 + 1);
   gcc_assert (to < MAX_INSNS_PER_PEEP2 + 1);
 
Index: recog.h
===================================================================
--- recog.h	(revision 196270)
+++ recog.h	(working copy)
@@ -80,8 +80,19 @@ extern bool validate_unshare_change (rtx
 extern bool canonicalize_change_group (rtx insn, rtx x);
 extern int insn_invalid_p (rtx, bool);
 extern int verify_changes (int);
-extern void confirm_change_group (void);
-extern int apply_change_group (void);
+extern void confirm_change_group (int num);
+extern int apply_change_group (int num);
+extern void set_change_verified (int idx, bool val);
+extern void set_change_benefit (int idx, int val);
+extern void set_change_equal_note (int idx, bool val);
+extern void set_change_associated_with_last (int idx, bool val);
+extern int estimate_seq_cost (rtx first, bool speed);
+extern int estimate_split_and_peephole (rtx insn);
+extern void initialize_before_estimate_peephole (rtx insn);
+extern bool confirm_change_one_by_one (bool chk_benefit);
+extern bool confirm_change_group_by_cost (bool may_confirm_whole_group,
+					  int extra_benefit,
+					  bool chk_benefit);
 extern int num_validated_changes (void);
 extern void cancel_changes (int);
 extern int constrain_operands (int);
Index: reload1.c
===================================================================
--- reload1.c	(revision 196270)
+++ reload1.c	(working copy)
@@ -3308,7 +3308,7 @@ eliminate_regs_in_insn (rtx insn, int re
 		    validate_change (insn, &SET_SRC (old_set), src, 1);
 		    validate_change (insn, &SET_DEST (old_set),
 				     ep->to_rtx, 1);
-		    if (! apply_change_group ())
+		    if (! apply_change_group (0))
 		      {
 			SET_SRC (old_set) = src;
 			SET_DEST (old_set) = ep->to_rtx;
@@ -4768,7 +4768,7 @@ reload_as_needed (int live_known)
 			      if (!n)
 				cancel_changes (0);
 			      else
-				confirm_change_group ();
+				confirm_change_group (0);
 			    }
 			  break;
 			}
Index: compare-elim.c
===================================================================
--- compare-elim.c	(revision 196270)
+++ compare-elim.c	(working copy)
@@ -593,7 +593,7 @@ try_eliminate_compare (struct comparison
   /* Succeed if the new instruction is valid.  Note that we may have started
      a change group within maybe_select_cc_mode, therefore we must continue. */
   validate_change (insn, &XVECEXP (PATTERN (insn), 0, 1), x, true);
-  if (!apply_change_group ())
+  if (!apply_change_group (0))
     return false;
  
   /* Success.  Delete the compare insn...  */
Index: jump.c
===================================================================
--- jump.c	(revision 196270)
+++ jump.c	(working copy)
@@ -1527,7 +1527,7 @@ redirect_jump (rtx jump, rtx nlabel, int
   if (nlabel == olabel)
     return 1;
 
-  if (! redirect_jump_1 (jump, nlabel) || ! apply_change_group ())
+  if (! redirect_jump_1 (jump, nlabel) || ! apply_change_group (0))
     return 0;
 
   redirect_jump_2 (jump, olabel, nlabel, delete_unused, 0);
@@ -1563,7 +1563,7 @@ redirect_jump_2 (rtx jump, rtx olabel, r
       else
 	{
 	  redirect_exp_1 (&XEXP (note, 0), olabel, nlabel, jump);
-	  confirm_change_group ();
+	  confirm_change_group (0);
 	}
     }
 
@@ -1649,7 +1649,7 @@ invert_jump (rtx jump, rtx nlabel, int d
 {
   rtx olabel = JUMP_LABEL (jump);
 
-  if (invert_jump_1 (jump, nlabel) && apply_change_group ())
+  if (invert_jump_1 (jump, nlabel) && apply_change_group (0))
     {
       redirect_jump_2 (jump, olabel, nlabel, delete_unused, 1);
       return 1;
@@ -1868,7 +1868,9 @@ true_regnum (const_rtx x)
   if (REG_P (x))
     {
       if (REGNO (x) >= FIRST_PSEUDO_REGISTER
-	  && (lra_in_progress || reg_renumber[REGNO (x)] >= 0))
+	  && (lra_in_progress 
+	      || (reg_renumber 
+		  && reg_renumber[REGNO (x)] >= 0)))
 	return reg_renumber[REGNO (x)];
       return REGNO (x);
     }
@@ -1888,6 +1890,8 @@ true_regnum (const_rtx x)
 	  if (info.representable_p)
 	    return base + info.offset;
 	}
+      else
+	return base;
     }
   return -1;
 }

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: extend fwprop optimization
  2013-03-11  5:52             ` Wei Mi
@ 2013-03-11 18:10               ` Jeff Law
  2013-03-11 18:17                 ` Steven Bosscher
  2013-03-11 19:52               ` Steven Bosscher
  1 sibling, 1 reply; 29+ messages in thread
From: Jeff Law @ 2013-03-11 18:10 UTC (permalink / raw)
  To: Wei Mi; +Cc: Steven Bosscher, GCC Patches, David Li, Uros Bizjak

On 03/10/2013 11:52 PM, Wei Mi wrote:
> Hi,
>
> This is the fwprop extension patch which is put in order. Regression
> test and bootstrap pass. Please help to review its rationality. The
> following is a brief description what I have done in the patch.
>
> In order to make fwprop more effective in rtl optimization, we extend
> it to handle general expressions instead of the three cases listed in
> the head comment in fwprop.c. The major changes include a) We need to
> check propagation correctness for src exprs of def which contain mem
> references. Previous fwprop for the three cases above doesn't have the
> problem. b) We need a better cost model because the benefit is usually
> not so apparent as the three cases above.
>
> For a general fwprop problem, there are two possible sources where
> benefit comes from. The frist is the new use insn after propagation
> and simplification may have lower cost than itself before propagation,
> or propagation may create a new insn, that could be splitted or
> peephole optimized later and get a lower cost. The second is that if
> all the uses are replaced with the src of the def insn, the def insn
> could be deleted.
>
> So instead of check each def-use pair independently, we use DU chain
> to track all the uses for a def. For each def-use pair, we attempt the
> propagation, record the change candidate in changes[] array, but we
> wait to confirm the changes until all the pairs with the same def are
> iterated. The changes confirmation is done in the func
> confirm_change_group_by_cost. We only do this for fwprop. For
> fwprop_addr, the benefit of each change is ensured by
> propagation_rtx_1 using should_replace_address, so we just confirm all
> the changes without checking benefit again.
Can you please attach this to the 4.9 pending patches tracker bug. 
We're really focused on trying to get 4.8 out the door and this doesn't 
seem like suitable material for GCC 4.8.

I haven't looked at the details of the patch at all yet and doubt I 
would prior to GCC 4.8 going out the door.

Thanks,
jeff

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: extend fwprop optimization
  2013-03-11 18:10               ` Jeff Law
@ 2013-03-11 18:17                 ` Steven Bosscher
  0 siblings, 0 replies; 29+ messages in thread
From: Steven Bosscher @ 2013-03-11 18:17 UTC (permalink / raw)
  To: Jeff Law; +Cc: Wei Mi, GCC Patches, David Li, Uros Bizjak

On Mon, Mar 11, 2013 at 7:10 PM, Jeff Law <law@redhat.com> wrote:
> On 03/10/2013 11:52 PM, Wei Mi wrote:
>>
>> Hi,
>>
>> This is the fwprop extension patch which is put in order. Regression
>> test and bootstrap pass. Please help to review its rationality. The
>> following is a brief description what I have done in the patch.
>>
>> In order to make fwprop more effective in rtl optimization, we extend
>> it to handle general expressions instead of the three cases listed in
>> the head comment in fwprop.c. The major changes include a) We need to
>> check propagation correctness for src exprs of def which contain mem
>> references. Previous fwprop for the three cases above doesn't have the
>> problem. b) We need a better cost model because the benefit is usually
>> not so apparent as the three cases above.
>>
>> For a general fwprop problem, there are two possible sources where
>> benefit comes from. The frist is the new use insn after propagation
>> and simplification may have lower cost than itself before propagation,
>> or propagation may create a new insn, that could be splitted or
>> peephole optimized later and get a lower cost. The second is that if
>> all the uses are replaced with the src of the def insn, the def insn
>> could be deleted.
>>
>> So instead of check each def-use pair independently, we use DU chain
>> to track all the uses for a def. For each def-use pair, we attempt the
>> propagation, record the change candidate in changes[] array, but we
>> wait to confirm the changes until all the pairs with the same def are
>> iterated. The changes confirmation is done in the func
>> confirm_change_group_by_cost. We only do this for fwprop. For
>> fwprop_addr, the benefit of each change is ensured by
>> propagation_rtx_1 using should_replace_address, so we just confirm all
>> the changes without checking benefit again.
>
> Can you please attach this to the 4.9 pending patches tracker bug. We're
> really focused on trying to get 4.8 out the door and this doesn't seem like
> suitable material for GCC 4.8.
>
> I haven't looked at the details of the patch at all yet and doubt I would
> prior to GCC 4.8 going out the door.
>
> Thanks,
> jeff
>

Jeff,

The world has more people than you, and with different interests. This
patch was posted here for comments on the idea, and while I'm sure
your feedback would be very valuable, it is no more required for
discussing this patch than it is for releasing GCC 4.8.

Ciao!
Steven

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: extend fwprop optimization
  2013-03-11  5:52             ` Wei Mi
  2013-03-11 18:10               ` Jeff Law
@ 2013-03-11 19:52               ` Steven Bosscher
  2013-03-12  7:18                 ` Wei Mi
  1 sibling, 1 reply; 29+ messages in thread
From: Steven Bosscher @ 2013-03-11 19:52 UTC (permalink / raw)
  To: Wei Mi; +Cc: GCC Patches, David Li, Uros Bizjak

On Mon, Mar 11, 2013 at 6:52 AM, Wei Mi wrote:
> This is the fwprop extension patch which is put in order. Regression
> test and bootstrap pass. Please help to review its rationality. The
> following is a brief description what I have done in the patch.
>
> In order to make fwprop more effective in rtl optimization, we extend
> it to handle general expressions instead of the three cases listed in
> the head comment in fwprop.c. The major changes include a) We need to
> check propagation correctness for src exprs of def which contain mem
> references. Previous fwprop for the three cases above doesn't have the
> problem. b) We need a better cost model because the benefit is usually
> not so apparent as the three cases above.
>
> For a general fwprop problem, there are two possible sources where
> benefit comes from. The frist is the new use insn after propagation
> and simplification may have lower cost than itself before propagation,
> or propagation may create a new insn, that could be splitted or
> peephole optimized later and get a lower cost. The second is that if
> all the uses are replaced with the src of the def insn, the def insn
> could be deleted.
>
> So instead of check each def-use pair independently, we use DU chain
> to track all the uses for a def. For each def-use pair, we attempt the
> propagation, record the change candidate in changes[] array, but we
> wait to confirm the changes until all the pairs with the same def are
> iterated. The changes confirmation is done in the func
> confirm_change_group_by_cost. We only do this for fwprop. For
> fwprop_addr, the benefit of each change is ensured by
> propagation_rtx_1 using should_replace_address, so we just confirm all
> the changes without checking benefit again.

Hello Wei Mi,

So IIUC, in essence you are doing:

main:
  FOR_EACH_BB:
    FOR_BB_INSNS, non-debug insns only:
      for each df_ref DEF operand on insn:
        iterate_def_uses

iterate_def_uses:
  for each UD chain from DEF to USE(i):
    forward_propagate_into
  confirm changes by total benefit

I still like the idea, but there are also still a few "design issues"
to resolve.

Some of the same comments as before apply: Do you really, really,
reallyreally have to go so low-level as to insn splitting, peephole
optimizations, and even register allocation, to get the cost right?
That will almost certainly not be acceptable, and I for one would
oppose such a change. It's IMHO a violation of proper engineering when
your medium-to-high level code transformations have to do that. If you
have strong reasons for your approach, it'd be helpful if you can
explain them so that we can together look for a less intrusive
solution (e.g. splitting earlier, adjusting the cost model, etc.).

So things like:
> +  /* we may call peephole2_insns in fwprop phase to estimate how
> +     peephole will affect the cost of the insn transformed by fwprop.
> +     fwprop is done before ira phase. In that case, we simply return
> +     a new pseudo register.  */
> +  if (!strncmp (current_pass->name, "fwprop", 6))
> +    return gen_reg_rtx (mode);

and

> Index: config/i386/i386.c
> ===================================================================
> --- config/i386/i386.c        (revision 196270)
> +++ config/i386/i386.c        (working copy)
> @@ -15901,8 +15901,14 @@ ix86_expand_clear (rtx dest)
>  {
>    rtx tmp;
>
> -  /* We play register width games, which are only valid after reload.  */
> -  gcc_assert (reload_completed);
> +  /* We play register width games, which are only valid after reload.
> +     An exception: fwprop call peephole to estimate the change benefit,
> +     and peephole will call this func. That is before reload complete.
> +     It will not bring any problem because the peephole2_insns call is
> +     only used for cost estimation in fwprop, and its change will be
> +     abandoned immediately after the cost estimation.  */
> +  if (strncmp (current_pass->name, "fwprop", 6))
> +    gcc_assert (reload_completed);

are IMHO not OK.

Note that your patch is a bit difficult to read at some points because
you have included a bunch of non-changes (whitespaces fixes --
necessary cleanups but not relevant for your patch), see e.g. the
changed lines that contain "lra_in_progress". Also the changes like:
>  static bool
> -propagate_rtx_1 (rtx *px, rtx old_rtx, rtx new_rtx, int flags)
> +propagate_rtx_1 (rtx *px, rtx old_rtx, rtx new_rtx, bool speed)

which are quite distracting, making it harder to see what has *really* changed.

You should probably just a helper function apply_change_group_num()
and avoid all the apply_change_group use fixups.


In fwprop.c:
> +  /* DF_LR_RUN_DCE is used in peephole2_insns, which is called for cost
> +     estimation in estimate_split_and_peephole.  */
> +  df_set_flags (DF_LR_RUN_DCE);
>    df_md_add_problem ();
>    df_note_add_problem ();
> -  df_analyze ();
> +  df_chain_add_problem (DF_UD_CHAIN | DF_DU_CHAIN);
>    df_maybe_reorganize_use_refs (DF_REF_ORDER_BY_INSN_WITH_NOTES);
> +  df_analyze ();

you add DU and UD chains, and implicitly the RD problem, but you also
already have the MD problem. I think my reaching-defs patches for GCC
4.8 make the MD problem less necessary, but you certainly don't need
MD + RD + UD + DU.

You've noticed so yourself:
> +   We think the maintainance for use_def_ref vector is not necessary, so
> +   we remove update_df/update_uses/update_df_init/register_active_defs.  */

and it looks like you're simply avoiding the problem by queuing up
changes and commit them all at the end. I don't believe that will
work, you'll break the UD and DU chains and may end up with dangling
pointers to free'd or removed df_refs.


> +  /* see when the insn is not a set  */
> +  if (!set)
> +    return false;

fwprop.c was speciflcally developed to also handle multiple-set
instructions, like the bits of cse.c that it tried to replace. Your
patch should not change this.


> +static bool
> +mem_may_be_modified (rtx from, rtx to)

This has "potentially slow" written all over it :-) (You're punting on
any MEM for now, but someone at some point will find a reason to use
alias analysis, blowing up really bad test cases like PR39326...)


> +int
> +reg_mentioned_num (const_rtx reg, const_rtx in

Should use DF caches instead of deep-diving the pattern. Or if DF
cache updates are deferred, use for_each_rtx on the pattern.


> +/* Find whether the set define a return reg.  */
> +
> +static bool
> +def_return_reg (rtx set)
> +{
> +  edge eg;
> +  edge_iterator ei;
> +  rtx dest = SET_DEST (set);
> +
> +  if (!REG_P (dest))
> +    return false;

The return USE will also be a hard reg:
 +  if (!REG_P (dest) || ! HARD_REGISTER_P (dest))
 +    return false;


> +  FOR_EACH_EDGE (eg, ei, EXIT_BLOCK_PTR->preds)
> +    if (eg->flags & EDGE_FALLTHRU)
> +      {
> +     basic_block src_bb = eg->src;
> +     rtx last_insn, ret_reg;
> +     if (EDGE_COUNT (EXIT_BLOCK_PTR->preds) == 1

single_pred_p(), but why do FOR_EACH_EDGE and then chec that there is
only one pred to begin with?

> +         && NONJUMP_INSN_P ((last_insn = BB_END (src_bb)))
> +         && GET_CODE (PATTERN (last_insn)) == USE
> +         && GET_CODE ((ret_reg = XEXP (PATTERN (last_insn), 0))) == REG
> +         && REGNO (ret_reg) == REGNO (dest))
> +       return true;
> +      }
> +  return false;
> +}

Actually this whole change makes me nervous. I don't think you should
propagate into any USE at all, for return value or otherwise.


> +  if (def_insn)
> +  {
> +    rtx set = single_set (def_insn);
> +    if (set)
> +      def_insn_cost = set_src_cost (SET_SRC (set), speed)
> +                   + set_src_cost (SET_DEST (set), speed) + 1;
> +    else
> +      return false;
> +  }

As before: You'll have to deal with non-single_set insns also.


> +void
> +dump_cfg (FILE *file)
> +{

You'll find -fdump-rtl-fwprop-graph useful, as well as brief_dump_cfg.


> +  if (fwprop_addr)
> +     return confirm_change_group_by_cost (false,
> +                                       0,
> +                                       false);
> +  else
> +    {
> +      all_uses_replaced = (use_num == reg_replaced_num);
> +      return confirm_change_group_by_cost (all_uses_replaced,
> +                                        def_insn_cost,
> +                                        true);
> +    }


What happens if you propagate into an insn that uses the same register
twice? Will the DU chains still be valid (I don't think that's
guaranteed)?

Is the extra_benefit flag always applicable if all USEs of a DEF have
been propagated out? What if the DEF is in an insn that is inherently
necessary?

Have you measured what effect this pass has on combine?

Ciao!
Steven

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: extend fwprop optimization
  2013-03-11 19:52               ` Steven Bosscher
@ 2013-03-12  7:18                 ` Wei Mi
  2013-03-16 22:49                   ` Steven Bosscher
  0 siblings, 1 reply; 29+ messages in thread
From: Wei Mi @ 2013-03-12  7:18 UTC (permalink / raw)
  To: Steven Bosscher; +Cc: GCC Patches, David Li, Uros Bizjak

Thanks for the helpful comments! I have some replies inlined.

Regards,
Wei.

On Mon, Mar 11, 2013 at 12:52 PM, Steven Bosscher <stevenb.gcc@gmail.com> wrote:
> On Mon, Mar 11, 2013 at 6:52 AM, Wei Mi wrote:
>> This is the fwprop extension patch which is put in order. Regression
>> test and bootstrap pass. Please help to review its rationality. The
>> following is a brief description what I have done in the patch.
>>
>> In order to make fwprop more effective in rtl optimization, we extend
>> it to handle general expressions instead of the three cases listed in
>> the head comment in fwprop.c. The major changes include a) We need to
>> check propagation correctness for src exprs of def which contain mem
>> references. Previous fwprop for the three cases above doesn't have the
>> problem. b) We need a better cost model because the benefit is usually
>> not so apparent as the three cases above.
>>
>> For a general fwprop problem, there are two possible sources where
>> benefit comes from. The frist is the new use insn after propagation
>> and simplification may have lower cost than itself before propagation,
>> or propagation may create a new insn, that could be splitted or
>> peephole optimized later and get a lower cost. The second is that if
>> all the uses are replaced with the src of the def insn, the def insn
>> could be deleted.
>>
>> So instead of check each def-use pair independently, we use DU chain
>> to track all the uses for a def. For each def-use pair, we attempt the
>> propagation, record the change candidate in changes[] array, but we
>> wait to confirm the changes until all the pairs with the same def are
>> iterated. The changes confirmation is done in the func
>> confirm_change_group_by_cost. We only do this for fwprop. For
>> fwprop_addr, the benefit of each change is ensured by
>> propagation_rtx_1 using should_replace_address, so we just confirm all
>> the changes without checking benefit again.
>
> Hello Wei Mi,
>
> So IIUC, in essence you are doing:
>
> main:
>   FOR_EACH_BB:
>     FOR_BB_INSNS, non-debug insns only:
>       for each df_ref DEF operand on insn:
>         iterate_def_uses
>
> iterate_def_uses:
>   for each UD chain from DEF to USE(i):
>     forward_propagate_into
>   confirm changes by total benefit
>
> I still like the idea, but there are also still a few "design issues"
> to resolve.
>
> Some of the same comments as before apply: Do you really, really,
> reallyreally have to go so low-level as to insn splitting, peephole
> optimizations, and even register allocation, to get the cost right?
> That will almost certainly not be acceptable, and I for one would
> oppose such a change. It's IMHO a violation of proper engineering when
> your medium-to-high level code transformations have to do that. If you
> have strong reasons for your approach, it'd be helpful if you can
> explain them so that we can together look for a less intrusive
> solution (e.g. splitting earlier, adjusting the cost model, etc.).
>

For the motivational case, I need insn splitting to get the cost
right. insn splitting is not very intrusive. All I need is to call
split_insns func. I think split_insns is just a pattern matching func
just like recog(), which is called at many places. Peephole is not
necessary (I add it in order to find as many oppotunities as possible,
but from my trace analysis, it doesn't help much). To call
peephole2_insn() is indeed intrusive, because peephole assumes reg
allocation is completed, I have to insert the ugly workaround below.
peephole also requires setting DF_LR_RUN_DCE flag and some
initialization of peep2_insn_data array.

So how about keep split_insns and remove peephole in the cost estimation func?

> So things like:
>> +  /* we may call peephole2_insns in fwprop phase to estimate how
>> +     peephole will affect the cost of the insn transformed by fwprop.
>> +     fwprop is done before ira phase. In that case, we simply return
>> +     a new pseudo register.  */
>> +  if (!strncmp (current_pass->name, "fwprop", 6))
>> +    return gen_reg_rtx (mode);
>
> and
>
>> Index: config/i386/i386.c
>> ===================================================================
>> --- config/i386/i386.c        (revision 196270)
>> +++ config/i386/i386.c        (working copy)
>> @@ -15901,8 +15901,14 @@ ix86_expand_clear (rtx dest)
>>  {
>>    rtx tmp;
>>
>> -  /* We play register width games, which are only valid after reload.  */
>> -  gcc_assert (reload_completed);
>> +  /* We play register width games, which are only valid after reload.
>> +     An exception: fwprop call peephole to estimate the change benefit,
>> +     and peephole will call this func. That is before reload complete.
>> +     It will not bring any problem because the peephole2_insns call is
>> +     only used for cost estimation in fwprop, and its change will be
>> +     abandoned immediately after the cost estimation.  */
>> +  if (strncmp (current_pass->name, "fwprop", 6))
>> +    gcc_assert (reload_completed);
>
> are IMHO not OK.
>

They are intrusive and inserted for peephole.

> Note that your patch is a bit difficult to read at some points because
> you have included a bunch of non-changes (whitespaces fixes --
> necessary cleanups but not relevant for your patch), see e.g. the
> changed lines that contain "lra_in_progress". Also the changes like:
>>  static bool
>> -propagate_rtx_1 (rtx *px, rtx old_rtx, rtx new_rtx, int flags)
>> +propagate_rtx_1 (rtx *px, rtx old_rtx, rtx new_rtx, bool speed)
>
> which are quite distracting, making it harder to see what has *really* changed.
>

In the old fwprop, the flags is of enum type {PR_CAN_APPEAR,
PR_HANDLE_MEM, PR_OPTIMIZE_FOR_SPEED}.   PR_CAN_APPEAR is used to
restrict fwprop to only handle the three typical cases in the comment
at the head of fwprop.c. PR_HANDLE_MEM is used for mem addr.
PR_OPTIMIZE_FOR_SPEED indicates optimization is for speed or for size.
New fwprop will only use PR_OPTIMIZE_FOR_SPEED, and the other two are
useless. So I change the param from "int flags" to "bool speed".

> You should probably just a helper function apply_change_group_num()
> and avoid all the apply_change_group use fixups.
>
>

Yes, they are distracting. I can use apply_change_group_num for easy
code review, but I think to commit, extending apply_change_group and
avoid creating a very similar func is more welcomed?


> In fwprop.c:
>> +  /* DF_LR_RUN_DCE is used in peephole2_insns, which is called for cost
>> +     estimation in estimate_split_and_peephole.  */
>> +  df_set_flags (DF_LR_RUN_DCE);
>>    df_md_add_problem ();
>>    df_note_add_problem ();
>> -  df_analyze ();
>> +  df_chain_add_problem (DF_UD_CHAIN | DF_DU_CHAIN);
>>    df_maybe_reorganize_use_refs (DF_REF_ORDER_BY_INSN_WITH_NOTES);
>> +  df_analyze ();
>
> you add DU and UD chains, and implicitly the RD problem, but you also
> already have the MD problem. I think my reaching-defs patches for GCC
> 4.8 make the MD problem less necessary, but you certainly don't need
> MD + RD + UD + DU.
>

MD problem is used to set use_def_ref and check whether a use has
multiple def in the old fwprop. I reuse that part in new fwprop. With
UD chain, we may remove MD problem and use_def_ref vector, and use UD
chain to do the check. I will try it.

> You've noticed so yourself:
>> +   We think the maintainance for use_def_ref vector is not necessary, so
>> +   we remove update_df/update_uses/update_df_init/register_active_defs.  */
>
> and it looks like you're simply avoiding the problem by queuing up
> changes and commit them all at the end. I don't believe that will
> work, you'll break the UD and DU chains and may end up with dangling
> pointers to free'd or removed df_refs.
>

I only remove the code related with use_def_ref vector. The code to
update df_refs is kept but moved to update_df in recog.c.

>
>> +  /* see when the insn is not a set  */
>> +  if (!set)
>> +    return false;
>
> fwprop.c was speciflcally developed to also handle multiple-set
> instructions, like the bits of cse.c that it tried to replace. Your
> patch should not change this.
>

ok, I will remove the limitation.

>
>> +static bool
>> +mem_may_be_modified (rtx from, rtx to)
>
> This has "potentially slow" written all over it :-) (You're punting on
> any MEM for now, but someone at some point will find a reason to use
> alias analysis, blowing up really bad test cases like PR39326...)
>

If someone wants to add alias analysis here, he must find out some
good reasons :-). That is what I expect.

>
>> +int
>> +reg_mentioned_num (const_rtx reg, const_rtx in
>
> Should use DF caches instead of deep-diving the pattern. Or if DF
> cache updates are deferred, use for_each_rtx on the pattern.
>

I will fix it.

>
>> +/* Find whether the set define a return reg.  */
>> +
>> +static bool
>> +def_return_reg (rtx set)
>> +{
>> +  edge eg;
>> +  edge_iterator ei;
>> +  rtx dest = SET_DEST (set);
>> +
>> +  if (!REG_P (dest))
>> +    return false;
>
> The return USE will also be a hard reg:
>  +  if (!REG_P (dest) || ! HARD_REGISTER_P (dest))
>  +    return false;
>
>
>> +  FOR_EACH_EDGE (eg, ei, EXIT_BLOCK_PTR->preds)
>> +    if (eg->flags & EDGE_FALLTHRU)
>> +      {
>> +     basic_block src_bb = eg->src;
>> +     rtx last_insn, ret_reg;
>> +     if (EDGE_COUNT (EXIT_BLOCK_PTR->preds) == 1
>
> single_pred_p(), but why do FOR_EACH_EDGE and then chec that there is
> only one pred to begin with?

My mistake. I copy the chunk of code from mode-switching.
FOR_EACH_EDGE is unneeded.

>
>> +         && NONJUMP_INSN_P ((last_insn = BB_END (src_bb)))
>> +         && GET_CODE (PATTERN (last_insn)) == USE
>> +         && GET_CODE ((ret_reg = XEXP (PATTERN (last_insn), 0))) == REG
>> +         && REGNO (ret_reg) == REGNO (dest))
>> +       return true;
>> +      }
>> +  return false;
>> +}
>
> Actually this whole change makes me nervous. I don't think you should
> propagate into any USE at all, for return value or otherwise.

Yes, reasonable. I will add the restriction.

>
>> +  if (def_insn)
>> +  {
>> +    rtx set = single_set (def_insn);
>> +    if (set)
>> +      def_insn_cost = set_src_cost (SET_SRC (set), speed)
>> +                   + set_src_cost (SET_DEST (set), speed) + 1;
>> +    else
>> +      return false;
>> +  }
>
> As before: You'll have to deal with non-single_set insns also.
>

I will remove the limitation.

>
>> +void
>> +dump_cfg (FILE *file)
>> +{
>
> You'll find -fdump-rtl-fwprop-graph useful, as well as brief_dump_cfg.
>

Oh, thanks.

>
>> +  if (fwprop_addr)
>> +     return confirm_change_group_by_cost (false,
>> +                                       0,
>> +                                       false);
>> +  else
>> +    {
>> +      all_uses_replaced = (use_num == reg_replaced_num);
>> +      return confirm_change_group_by_cost (all_uses_replaced,
>> +                                        def_insn_cost,
>> +                                        true);
>> +    }
>
>
> What happens if you propagate into an insn that uses the same register
> twice? Will the DU chains still be valid (I don't think that's
> guaranteed)?

I think the DU chains still be valid. If propagate into the insn that
uses the same register twice, the two uses will be replaced when the
first use is seen (propagate_rtx_1 will propagate all the occurrances
of the same reg in the use insn).  When the second use is seen, the
df_use and use insn in its insn_info are still available.
forward_propagate_into will early return after check reg_mentioned_p
(DF_REF_REG (use), parent) and find out no reg is used  any more.

A testcase in gcc regression testsuite shows the case:
gcc/testsuite/gcc.target/i386/387-10.c

>
> Is the extra_benefit flag always applicable if all USEs of a DEF have
> been propagated out? What if the DEF is in an insn that is inherently
> necessary?
>

If the DEF is an insn that is inherently necessary, we may mistakenly
assume it could be deleted in cost estimation. But it will not affect
correctness because fwprop depends on delete_trivially_dead_insns in
fwprop_done or DCE after fwprop to delete dead DEF.

> Have you measured what effect this pass has on combine?

I didn't do the measure. I only see a case that new fwprop does the
same optimization as what combine does, which old fwprop doesn't
optimize. New fwprop bring the optimization to an earlier stage. The
testcase is gcc/testsuite/gcc.target/i386/387-10.c.
Qualitatively, fwprop optimize the single-def multiple down uses case
which combine cannot handle. But I didn't do measurement about their
interactive potential.

>
> Ciao!
> Steven

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: extend fwprop optimization
  2013-03-12  7:18                 ` Wei Mi
@ 2013-03-16 22:49                   ` Steven Bosscher
  2013-03-17  7:15                     ` Wei Mi
  0 siblings, 1 reply; 29+ messages in thread
From: Steven Bosscher @ 2013-03-16 22:49 UTC (permalink / raw)
  To: Wei Mi; +Cc: GCC Patches, David Li, Uros Bizjak

On Tue, Mar 12, 2013 at 8:18 AM, Wei Mi wrote:
> For the motivational case, I need insn splitting to get the cost
> right. insn splitting is not very intrusive. All I need is to call
> split_insns func.

It may not look very intrusive, but there's a lot happening in the
back ground. You're creating a lot of new RTL, and then just throw it
away again. You fake the compiler into thinking you're much deeper in
the pipeline than you really are. You're assuming there are no
side-effects other than that some insn gets split, but there are back
ends where splitters may have side-effects.

Even though I've asked twice now, you still have not explained this
motivational case, except to say that there is one. *What* are you
trying to do, *what* is not happening without the splits, and *what*
happens if you split. Only if you explain that in a lot more detail
than "I have a motivational case" then we can look into what is a
proper solution.

The problem with some of the splitters is that they exist to break up
RTL from 'expand' to initially keep some pattern together to allow the
code transformation passes to handle the pattern as one instruction.
This made sense when RTL was the only intermediate representation and
splitting too early would inhibit some optimizations. But I would
expect most (if not all) such cases to be less relevant because of the
GIMPLE middle-end work. The only splitters you can trigger are the
pre-reload splitters (all the reload_completed conditions obviously
can't trigger if you're splitting from fwprop). Perhaps those
splitters can/should run earlier, or be made obsolete by expanding
directly to the post-splitting insns.

Unfortunately, it's not possible to tell for your case, because you
haven't explained it yet...

> So how about keep split_insns and remove peephole in the cost estimation func?

I'd strongly oppose this. I do not believe this is necessary, and I
think it's conceptually wrong.

>> What happens if you propagate into an insn that uses the same register
>> twice? Will the DU chains still be valid (I don't think that's
>> guaranteed)?
>
> I think the DU chains still be valid. If propagate into the insn that
> uses the same register twice, the two uses will be replaced when the
> first use is seen (propagate_rtx_1 will propagate all the occurrances
> of the same reg in the use insn).  When the second use is seen, the
> df_use and use insn in its insn_info are still available.
> forward_propagate_into will early return after check reg_mentioned_p
> (DF_REF_REG (use), parent) and find out no reg is used  any more.

With reg_mentioned_p you cannot verify that the DF_REF_LOC of USE is
still valid.

In any case, returning to the RD problem for DU/UD chains is probably
a good idea, now that RD is not such a hog anymore. In effect fwprop.c
would return to what it looked like before the patch of r149010.

As a way forward on all of this, I'd suggest the following steps, each
with a separate patch:
1. replace the MD problem with RD again, and build full DU/UD chains.
2. post all the recog changes separately, with minimum impact on the
parts of the compiler you don't really change. (For apply_change_group
you could even choose to overload it, or use a NUM argument with a
default value -- not sure if default argument values are OK for GCC
tho'.)
3. implement propagation into multiple USEs, but without the splitting
and peepholing.
4. see about fixing the back end to either split earlier or expand to
the desired patterns directly.

Ciao!
Steven

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: extend fwprop optimization
  2013-03-16 22:49                   ` Steven Bosscher
@ 2013-03-17  7:15                     ` Wei Mi
  2013-03-17  7:23                       ` Andrew Pinski
                                         ` (2 more replies)
  0 siblings, 3 replies; 29+ messages in thread
From: Wei Mi @ 2013-03-17  7:15 UTC (permalink / raw)
  To: Steven Bosscher; +Cc: GCC Patches, David Li, Uros Bizjak

[-- Attachment #1: Type: text/plain, Size: 5813 bytes --]

Hi,

On Sat, Mar 16, 2013 at 3:48 PM, Steven Bosscher <stevenb.gcc@gmail.com> wrote:
> On Tue, Mar 12, 2013 at 8:18 AM, Wei Mi wrote:
>> For the motivational case, I need insn splitting to get the cost
>> right. insn splitting is not very intrusive. All I need is to call
>> split_insns func.
>
> It may not look very intrusive, but there's a lot happening in the
> back ground. You're creating a lot of new RTL, and then just throw it
> away again. You fake the compiler into thinking you're much deeper in
> the pipeline than you really are. You're assuming there are no
> side-effects other than that some insn gets split, but there are back
> ends where splitters may have side-effects.

Ok, then I will remove the split_insns call.

>
> Even though I've asked twice now, you still have not explained this
> motivational case, except to say that there is one. *What* are you
> trying to do, *what* is not happening without the splits, and *what*
> happens if you split. Only if you explain that in a lot more detail
> than "I have a motivational case" then we can look into what is a
> proper solution.

:-). Sorry, I didn't say it clearly. The motivational case is the one
mentioned in the following posts (split_insns changes a << (b & 63) to
a << b).
http://gcc.gnu.org/ml/gcc/2013-01/msg00181.html
http://gcc.gnu.org/ml/gcc-patches/2013-02/msg01144.html

If I remove the split_insns call and related cost estimation
adjustment, the fwprop 18-->22 and 18-->23 will punt, because fwprop
here looks like a reverse process of cse, the total cost after fwprop
change is increased.

Def insn 18:
        Use insn 23
        Use insn 22

If we include the split_insns cost estimation adjustment.
  extra benefit by removing def insn 18 = 5
  change[0]: benefit = 0, verified - ok  // The cost of insn 22 will
not change after fwprop + insn splitting.
  change[1]: benefit = 0, verified - ok  // The insn 23 is the same with insn 22
Total benefit is 5, fwprop will go on.

If we remove the split_insns cost estimation adjustment.
  extra benefit by removing def insn 18 = 5
  change[0]: benefit = -4, verified - ok   // The costs of insn 22 and
insn 23 will increase after fwprop.
  change[1]: benefit = -4, verified - ok   // The insn 23 is the same
with insn 22
Total benefit is -3, fwprop will punt.

How about adding the (a << (b&63) ==> a << b) transformation in
simplify_binary_operation_1, becuase (a << (b&63) ==> a << b) is a
kind of architecture specific expr simplification? Then fwprop could
do the propagation as I expect.

>
> The problem with some of the splitters is that they exist to break up
> RTL from 'expand' to initially keep some pattern together to allow the
> code transformation passes to handle the pattern as one instruction.
> This made sense when RTL was the only intermediate representation and
> splitting too early would inhibit some optimizations. But I would
> expect most (if not all) such cases to be less relevant because of the
> GIMPLE middle-end work. The only splitters you can trigger are the
> pre-reload splitters (all the reload_completed conditions obviously
> can't trigger if you're splitting from fwprop). Perhaps those
> splitters can/should run earlier, or be made obsolete by expanding
> directly to the post-splitting insns.
>
> Unfortunately, it's not possible to tell for your case, because you
> haven't explained it yet...
>
>
>> So how about keep split_insns and remove peephole in the cost estimation func?
>
> I'd strongly oppose this. I do not believe this is necessary, and I
> think it's conceptually wrong.
>
>
>>> What happens if you propagate into an insn that uses the same register
>>> twice? Will the DU chains still be valid (I don't think that's
>>> guaranteed)?
>>
>> I think the DU chains still be valid. If propagate into the insn that
>> uses the same register twice, the two uses will be replaced when the
>> first use is seen (propagate_rtx_1 will propagate all the occurrances
>> of the same reg in the use insn).  When the second use is seen, the
>> df_use and use insn in its insn_info are still available.
>> forward_propagate_into will early return after check reg_mentioned_p
>> (DF_REF_REG (use), parent) and find out no reg is used  any more.
>
> With reg_mentioned_p you cannot verify that the DF_REF_LOC of USE is
> still valid.

I think DF_REF_LOC of USE may be invalid if dangling rtx will be
recycled by garbage collection very soon (I don't know when GC will
happen). Although DF_REF_LOC of USE maybe invalid, the early return in
forward_propagate_into ensure it will not cause any correctness
problem.

>
> In any case, returning to the RD problem for DU/UD chains is probably
> a good idea, now that RD is not such a hog anymore. In effect fwprop.c
> would return to what it looked like before the patch of r149010.

I remove MD problem and use DU/UD instead.

>
> As a way forward on all of this, I'd suggest the following steps, each
> with a separate patch:

Thanks for the suggestion!

> 1. replace the MD problem with RD again, and build full DU/UD chains.

I include patch.1 attached.

> 2. post all the recog changes separately, with minimum impact on the
> parts of the compiler you don't really change. (For apply_change_group
> you could even choose to overload it, or use a NUM argument with a
> default value -- not sure if default argument values are OK for GCC
> tho'.)

patch.2 attached.

> 3. implement propagation into multiple USEs, but without the splitting
> and peepholing.

patch.3 attached.

> 4. see about fixing the back end to either split earlier or expand to
> the desired patterns directly.

I havn't included this part. If you agree with the proposal to add the
transformation (a << (b&63) ==> a << b) in
simplify_binary_operation_1, I will send out another patch about it.

Thanks,
Wei.

[-- Attachment #2: ChangeLog.1 --]
[-- Type: application/octet-stream, Size: 534 bytes --]

2013-03-16  Wei Mi  <wmi@google.com>

	* fwprop.c (get_def_for_use): Change use_def_ref to
	inquiring UD chain.
	(fwprop_df_init): Rename build_single_def_use_links to
	fwprop_df_init. 
	(process_defs): Deleted.
	(process_uses): Likewise.
	(single_def_use_enter_block): Likewise.
	(single_def_use_leave_block): Likewise.
	(all_uses_available_at): Likewise.
	(register_active_defs): Likewise.
	(update_df_init): Likewise.
	(update_uses): Likewise.
	(update_df): Likewise.
	(fwprop_init): Remove active_defs.
	(fwprop_done): Likewise.


[-- Attachment #3: ChangeLog.2 --]
[-- Type: application/octet-stream, Size: 662 bytes --]

2013-03-16  Wei Mi  <wmi@google.com>

	* recog.c (validate_change_1): Add fields for change_t.
	(confirm_change_group): Add a default param.
	(set_change_verified): Add a change_t interface.
	(set_change_benefit): Likewise.
	(set_change_equal_note): Likewise.
	(set_change_associated_with_last): Likewise.
	(update_df): New. Update def/use references after insn changes.
	(confirm_change_one_by_one): New. Confirm each change separately.
	(confirm_change_group_by_cost): New. Confirm changes based on a
	simple cost model.
	(apply_change_group): Add a param.
	(cancel_changes): Add REG_EQUAL note according to equal_note field.
	* recog.h: Add some prototypes.


[-- Attachment #4: ChangeLog.3 --]
[-- Type: application/octet-stream, Size: 1008 bytes --]

2013-03-16  Wei Mi  <wmi@google.com>

	* fwprop.c (propagate_rtx_1): Remove PR_HANDLE_MEM.
	(varying_mem_p): Add call a kind of varying mem.
	(propagate_rtx): Remove PR_CAN_APPEAR and PR_HANDLE_MEM.
	(try_fwprop_subst): Extract the confirmation part to a separate
	func.
	(forward_propagate_subreg): Change the args of try_fwprop_subst.
	(mems_modified_p): New. Check whether dest is a mem.
	(mem_may_be_modified): New. Check if mem modified in an insn range.
	(rtx_search_arg): New struct.
	(reg_occur_p): New. Check if reg has ever occured in an expr.
	(reg_mentioned_num): New. How many times a reg appear.
	(forward_propagate_asm):  Make asm propagations being applied
	separately.
	(def_return_reg): New. Whether the set define a return reg.
	(forward_propagate_and_simplify): Add more check before propagation.
	(fwprop_done): Delete outdated trace.
	(iterate_def_uses): New. Iterate all the uses connecting to a def.
	(fwprop): Iterate all the defs instead of all the uses.
	(fwprop_addr): Likewise.


[-- Attachment #5: patch.1 --]
[-- Type: application/octet-stream, Size: 9883 bytes --]

--- v0/fwprop.c	2013-03-16 21:46:21.437939338 -0700
+++ v1/fwprop.c	2013-03-17 00:04:35.450324217 -0700
@@ -115,195 +115,33 @@ along with GCC; see the file COPYING3.
 
 static int num_changes;
 
-static vec<df_ref> use_def_ref;
-static vec<df_ref> reg_defs;
-static vec<df_ref> reg_defs_stack;
-
-/* The MD bitmaps are trimmed to include only live registers to cut
-   memory usage on testcases like insn-recog.c.  Track live registers
-   in the basic block and do not perform forward propagation if the
-   destination is a dead pseudo occurring in a note.  */
-static bitmap local_md;
-static bitmap local_lr;
-
 /* Return the only def in USE's use-def chain, or NULL if there is
    more than one def in the chain.  */
 
 static inline df_ref
 get_def_for_use (df_ref use)
 {
-  return use_def_ref[DF_REF_ID (use)];
-}
-
-
-/* Update the reg_defs vector with non-partial definitions in DEF_REC.
-   TOP_FLAG says which artificials uses should be used, when DEF_REC
-   is an artificial def vector.  LOCAL_MD is modified as after a
-   df_md_simulate_* function; we do more or less the same processing
-   done there, so we do not use those functions.  */
-
-#define DF_MD_GEN_FLAGS \
-	(DF_REF_PARTIAL | DF_REF_CONDITIONAL | DF_REF_MAY_CLOBBER)
-
-static void
-process_defs (df_ref *def_rec, int top_flag)
-{
-  df_ref def;
-  while ((def = *def_rec++) != NULL)
-    {
-      df_ref curr_def = reg_defs[DF_REF_REGNO (def)];
-      unsigned int dregno;
+  if (!DF_REF_CHAIN (use))
+    return NULL;
 
-      if ((DF_REF_FLAGS (def) & DF_REF_AT_TOP) != top_flag)
-	continue;
+  /* More than one reaching def.  */
+  if (DF_REF_CHAIN (use)->next)
+    return NULL;
 
-      dregno = DF_REF_REGNO (def);
-      if (curr_def)
-	reg_defs_stack.safe_push (curr_def);
-      else
-	{
-	  /* Do not store anything if "transitioning" from NULL to NULL.  But
-             otherwise, push a special entry on the stack to tell the
-	     leave_block callback that the entry in reg_defs was NULL.  */
-	  if (DF_REF_FLAGS (def) & DF_MD_GEN_FLAGS)
-	    ;
-	  else
-	    reg_defs_stack.safe_push (def);
-	}
-
-      if (DF_REF_FLAGS (def) & DF_MD_GEN_FLAGS)
-	{
-	  bitmap_set_bit (local_md, dregno);
-	  reg_defs[dregno] = NULL;
-	}
-      else
-	{
-	  bitmap_clear_bit (local_md, dregno);
-	  reg_defs[dregno] = def;
-	}
-    }
-}
-
-
-/* Fill the use_def_ref vector with values for the uses in USE_REC,
-   taking reaching definitions info from LOCAL_MD and REG_DEFS.
-   TOP_FLAG says which artificials uses should be used, when USE_REC
-   is an artificial use vector.  */
-
-static void
-process_uses (df_ref *use_rec, int top_flag)
-{
-  df_ref use;
-  while ((use = *use_rec++) != NULL)
-    if ((DF_REF_FLAGS (use) & DF_REF_AT_TOP) == top_flag)
-      {
-        unsigned int uregno = DF_REF_REGNO (use);
-        if (reg_defs[uregno]
-	    && !bitmap_bit_p (local_md, uregno)
-	    && bitmap_bit_p (local_lr, uregno))
-	  use_def_ref[DF_REF_ID (use)] = reg_defs[uregno];
-      }
-}
-
-
-static void
-single_def_use_enter_block (struct dom_walk_data *walk_data ATTRIBUTE_UNUSED,
-			    basic_block bb)
-{
-  int bb_index = bb->index;
-  struct df_md_bb_info *md_bb_info = df_md_get_bb_info (bb_index);
-  struct df_lr_bb_info *lr_bb_info = df_lr_get_bb_info (bb_index);
-  rtx insn;
-
-  bitmap_copy (local_md, &md_bb_info->in);
-  bitmap_copy (local_lr, &lr_bb_info->in);
-
-  /* Push a marker for the leave_block callback.  */
-  reg_defs_stack.safe_push (NULL);
-
-  process_uses (df_get_artificial_uses (bb_index), DF_REF_AT_TOP);
-  process_defs (df_get_artificial_defs (bb_index), DF_REF_AT_TOP);
-
-  /* We don't call df_simulate_initialize_forwards, as it may overestimate
-     the live registers if there are unused artificial defs.  We prefer
-     liveness to be underestimated.  */
-
-  FOR_BB_INSNS (bb, insn)
-    if (INSN_P (insn))
-      {
-        unsigned int uid = INSN_UID (insn);
-        process_uses (DF_INSN_UID_USES (uid), 0);
-        process_uses (DF_INSN_UID_EQ_USES (uid), 0);
-        process_defs (DF_INSN_UID_DEFS (uid), 0);
-	df_simulate_one_insn_forwards (bb, insn, local_lr);
-      }
-
-  process_uses (df_get_artificial_uses (bb_index), 0);
-  process_defs (df_get_artificial_defs (bb_index), 0);
-}
-
-/* Pop the definitions created in this basic block when leaving its
-   dominated parts.  */
-
-static void
-single_def_use_leave_block (struct dom_walk_data *walk_data ATTRIBUTE_UNUSED,
-			    basic_block bb ATTRIBUTE_UNUSED)
-{
-  df_ref saved_def;
-  while ((saved_def = reg_defs_stack.pop ()) != NULL)
-    {
-      unsigned int dregno = DF_REF_REGNO (saved_def);
-
-      /* See also process_defs.  */
-      if (saved_def == reg_defs[dregno])
-	reg_defs[dregno] = NULL;
-      else
-	reg_defs[dregno] = saved_def;
-    }
+  return DF_REF_CHAIN (use)->ref;
 }
 
-
 /* Build a vector holding the reaching definitions of uses reached by a
    single dominating definition.  */
 
 static void
-build_single_def_use_links (void)
+fwprop_df_init (void)
 {
-  struct dom_walk_data walk_data;
-
-  /* We use the multiple definitions problem to compute our restricted
-     use-def chains.  */
   df_set_flags (DF_EQ_NOTES);
-  df_md_add_problem ();
   df_note_add_problem ();
-  df_analyze ();
+  df_chain_add_problem (DF_UD_CHAIN | DF_DU_CHAIN);
   df_maybe_reorganize_use_refs (DF_REF_ORDER_BY_INSN_WITH_NOTES);
-
-  use_def_ref.create (DF_USES_TABLE_SIZE ());
-  use_def_ref.safe_grow_cleared (DF_USES_TABLE_SIZE ());
-
-  reg_defs.create (max_reg_num ());
-  reg_defs.safe_grow_cleared (max_reg_num ());
-
-  reg_defs_stack.create (n_basic_blocks * 10);
-  local_md = BITMAP_ALLOC (NULL);
-  local_lr = BITMAP_ALLOC (NULL);
-
-  /* Walk the dominator tree looking for single reaching definitions
-     dominating the uses.  This is similar to how SSA form is built.  */
-  walk_data.dom_direction = CDI_DOMINATORS;
-  walk_data.initialize_block_local_data = NULL;
-  walk_data.before_dom_children = single_def_use_enter_block;
-  walk_data.after_dom_children = single_def_use_leave_block;
-
-  init_walk_dominator_tree (&walk_data);
-  walk_dominator_tree (&walk_data, ENTRY_BLOCK_PTR);
-  fini_walk_dominator_tree (&walk_data);
-
-  BITMAP_FREE (local_lr);
-  BITMAP_FREE (local_md);
-  reg_defs.release ();
-  reg_defs_stack.release ();
+  df_analyze ();
 }
 
 \f
@@ -852,96 +690,6 @@ all_uses_available_at (rtx def_insn, rtx
 }
 
 \f
-static df_ref *active_defs;
-#ifdef ENABLE_CHECKING
-static sparseset active_defs_check;
-#endif
-
-/* Fill the ACTIVE_DEFS array with the use->def link for the registers
-   mentioned in USE_REC.  Register the valid entries in ACTIVE_DEFS_CHECK
-   too, for checking purposes.  */
-
-static void
-register_active_defs (df_ref *use_rec)
-{
-  while (*use_rec)
-    {
-      df_ref use = *use_rec++;
-      df_ref def = get_def_for_use (use);
-      int regno = DF_REF_REGNO (use);
-
-#ifdef ENABLE_CHECKING
-      sparseset_set_bit (active_defs_check, regno);
-#endif
-      active_defs[regno] = def;
-    }
-}
-
-
-/* Build the use->def links that we use to update the dataflow info
-   for new uses.  Note that building the links is very cheap and if
-   it were done earlier, they could be used to rule out invalid
-   propagations (in addition to what is done in all_uses_available_at).
-   I'm not doing this yet, though.  */
-
-static void
-update_df_init (rtx def_insn, rtx insn)
-{
-#ifdef ENABLE_CHECKING
-  sparseset_clear (active_defs_check);
-#endif
-  register_active_defs (DF_INSN_USES (def_insn));
-  register_active_defs (DF_INSN_USES (insn));
-  register_active_defs (DF_INSN_EQ_USES (insn));
-}
-
-
-/* Update the USE_DEF_REF array for the given use, using the active definitions
-   in the ACTIVE_DEFS array to match pseudos to their def. */
-
-static inline void
-update_uses (df_ref *use_rec)
-{
-  while (*use_rec)
-    {
-      df_ref use = *use_rec++;
-      int regno = DF_REF_REGNO (use);
-
-      /* Set up the use-def chain.  */
-      if (DF_REF_ID (use) >= (int) use_def_ref.length ())
-        use_def_ref.safe_grow_cleared (DF_REF_ID (use) + 1);
-
-#ifdef ENABLE_CHECKING
-      gcc_assert (sparseset_bit_p (active_defs_check, regno));
-#endif
-      use_def_ref[DF_REF_ID (use)] = active_defs[regno];
-    }
-}
-
-
-/* Update the USE_DEF_REF array for the uses in INSN.  Only update note
-   uses if NOTES_ONLY is true.  */
-
-static void
-update_df (rtx insn, rtx note)
-{
-  struct df_insn_info *insn_info = DF_INSN_INFO_GET (insn);
-
-  if (note)
-    {
-      df_uses_create (&XEXP (note, 0), insn, DF_REF_IN_NOTE);
-      df_notes_rescan (insn);
-    }
-  else
-    {
-      df_uses_create (&PATTERN (insn), insn, 0);
-      df_insn_rescan (insn);
-      update_uses (DF_INSN_INFO_USES (insn_info));
-    }
-
-  update_uses (DF_INSN_INFO_EQ_USES (insn_info));
-}
-
 
 /* Try substituting NEW into LOC, which originated from forward propagation
    of USE's value from DEF_INSN.  SET_REG_EQUAL says whether we are
@@ -1412,16 +1160,11 @@ fwprop_init (void)
   /* We do not always want to propagate into loops, so we have to find
      loops and be careful about them.  Avoid CFG modifications so that
      we don't have to update dominance information afterwards for
-     build_single_def_use_links.  */
+     fwprop_df_init.  */
   loop_optimizer_init (AVOID_CFG_MODIFICATIONS);
 
-  build_single_def_use_links ();
+  fwprop_df_init ();
   df_set_flags (DF_DEFER_INSN_RESCAN);
-
-  active_defs = XNEWVEC (df_ref, max_reg_num ());
-#ifdef ENABLE_CHECKING
-  active_defs_check = sparseset_alloc (max_reg_num ());
-#endif
 }
 
 static void
@@ -1429,12 +1172,6 @@ fwprop_done (void)
 {
   loop_optimizer_finalize ();
 
-  use_def_ref.release ();
-  free (active_defs);
-#ifdef ENABLE_CHECKING
-  sparseset_free (active_defs_check);
-#endif
-
   free_dominance_info (CDI_DOMINATORS);
   cleanup_cfg (0);
   delete_trivially_dead_insns (get_insns (), max_reg_num ());

[-- Attachment #6: patch.2 --]
[-- Type: application/octet-stream, Size: 10687 bytes --]

--- v0/recog.c	2013-03-16 21:46:26.827976689 -0700
+++ v2/recog.c	2013-03-16 23:20:25.840377883 -0700
@@ -181,6 +181,19 @@ typedef struct change_t
   rtx *loc;
   rtx old;
   bool unshare;
+  /* How much benefit to apply the change.  */
+  int benefit;
+  bool verified;
+  /* Record whether we need to create a equal note
+     if the change is canceled.  */
+  bool equal_note;
+  /* Some changes are committed or cancelled in
+     a group. We use associated_with_last flag to
+     make current change to be consistent with the
+     last change in the group. Adding or removing
+     CLOBBER in verify_change will create such kind
+     of change group.  */
+  bool associated_with_last;
 } change_t;
 
 static change_t *changes;
@@ -235,6 +248,10 @@ validate_change_1 (rtx object, rtx *loc,
   changes[num_changes].loc = loc;
   changes[num_changes].old = old;
   changes[num_changes].unshare = unshare;
+  changes[num_changes].benefit = 0;
+  changes[num_changes].verified = false;
+  changes[num_changes].equal_note = false;
+  changes[num_changes].associated_with_last = false;
 
   if (object && !MEM_P (object))
     {
@@ -463,17 +480,18 @@ verify_changes (int num)
   return (i == num_changes);
 }
 
-/* A group of changes has previously been issued with validate_change
-   and verified with verify_changes.  Call df_insn_rescan for each of
-   the insn changed and clear num_changes.  */
+/* A group of changes from num to num_changes - 1 has previously been
+   issued with validate_change and verified with verify_changes.
+   Call df_insn_rescan for each of the insn changed and reset num_changes
+   to num.  */
 
 void
-confirm_change_group (void)
+confirm_change_group (int num)
 {
   int i;
   rtx last_object = NULL;
 
-  for (i = 0; i < num_changes; i++)
+  for (i = num; i < num_changes; i++)
     {
       rtx object = changes[i].object;
 
@@ -492,24 +510,267 @@ confirm_change_group (void)
 
   if (last_object && INSN_P (last_object))
     df_insn_rescan (last_object);
+  num_changes = num;
+}
+
+/* Interfaces to operate change fields.  */
+
+void
+set_change_verified (int idx, bool val)
+{
+  changes[idx].verified = val;
+}
+
+void
+set_change_benefit (int idx, int val)
+{
+  changes[idx].benefit = val;
+}
+
+void
+set_change_equal_note (int idx, bool val)
+{
+  changes[idx].equal_note = val;
+}
+
+void
+set_change_associated_with_last (int idx, bool val)
+{
+  changes[idx].associated_with_last = val;
+}
+
+static void
+update_df (int from, int to, bool is_note)
+{
+  int i;
+  rtx insn;
+
+  if (is_note)
+    {
+      for (i = from; i <= to; i++)
+	{
+	  insn = changes[i].object;
+          if (changes[i].equal_note)
+	    {
+	      rtx note = find_reg_note (insn, REG_EQUAL, NULL_RTX);
+	      if (note)
+		{
+		  df_uses_create (&XEXP (note, 0), insn, DF_REF_IN_NOTE);
+		  df_notes_rescan (insn);
+		}
+	    }
+	}
+    }
+  else
+    {
+      for (i = from; i <= to; i++)
+	{
+	  insn = changes[i].object;
+	  df_uses_create (&PATTERN (insn), insn, 0);
+	  df_insn_rescan (insn);
+	}
+    }
+}
+
+/* When we cannot committed all the changes group, we evaluate the change
+   one by one. We choose to commit those changes whose benefits are greater
+   than 0. For fwprop_addr, the cost evaluation is caculated using
+   targetm.address_cost() and has been done in propagate_rtx_1, so we set 
+   chk_benefit false to skip benefit checking and simply commit the change
+   for fwprop_addr.  */
+
+bool
+confirm_change_one_by_one (bool chk_benefit)
+{
+  int i, last_i = 0;
+  rtx last_object = NULL;
+  bool last_change_committed = false;
+
+  for (i = num_changes - 1; i >= 0; i--)
+    {
+      rtx object = changes[i].object;
+
+      /* If change is not verified successfully, or benefit <= 0
+	 and current change is not associated with last committed
+	 change, then we will backout the change.  */
+      if (!changes[i].verified
+	  || (chk_benefit
+	      && changes[i].benefit <= 0
+	      && !(last_change_committed
+		   && changes[i].associated_with_last)))
+	{
+	  rtx new_rtx = *changes[i].loc;
+	  *changes[i].loc = changes[i].old;
+	  if (changes[i].object && !MEM_P (changes[i].object))
+	    INSN_CODE (changes[i].object) = changes[i].old_code;
+	  last_change_committed = false;
+
+	  if (changes[i].equal_note)
+	    {
+	      set_unique_reg_note (changes[i].object,
+				   REG_EQUAL, copy_rtx (new_rtx));
+	      update_df (i, i, true);
+	    }
+	  continue;
+	}
+
+      if (changes[i].unshare)
+	*changes[i].loc = copy_rtx (*changes[i].loc);
+
+      /* Avoid unnecessary rescanning when multiple changes to same instruction
+	 are made.  */
+      if (object)
+	{
+	  if (object != last_object && last_object && INSN_P (last_object))
+	    update_df (last_i, last_i, false);
+	  last_object = object;
+	  last_i = i;
+	}
+
+      if (dump_file)
+	fprintf (dump_file, "\n   *** change[%d] -- committed ***\n", i);
+
+      if (dump_file)
+	{
+	  fprintf (dump_file, "\nIn insn %d, replacing\n ", INSN_UID (object));
+	  print_inline_rtx (dump_file, changes[i].old, 2);
+	  fprintf (dump_file, "\n with ");
+	  print_inline_rtx (dump_file, *changes[i].loc, 2);
+	  fprintf (dump_file, "\n resulting: ");
+	  print_inline_rtx (dump_file, object, 2);
+	}
+
+      last_change_committed = true;
+    }
+
+  if (last_object && INSN_P (last_object))
+    update_df (last_i, last_i, false);
+
   num_changes = 0;
+  if (last_object)
+    return true;
+  else
+    return false;
+}
+
+/* Confirm a group of change based on the cost. may_confirm_whole_group
+   is initialized to true if for fwprop all the uses are replaced and
+   the def insn could be deleted. For fwprop, extra_benefit is the benefit
+   to delete the def insn. chk_benefit is set when fwprop_addr is true.  */
+
+bool
+confirm_change_group_by_cost (bool may_confirm_whole_group,
+			      int extra_benefit,
+			      bool chk_benefit)
+{
+  int i, to;
+  int total_benefit = 0, total_positive_benefit = 0;
+  bool no_positive_benefit = true;
+
+  if (num_changes == 0)
+    {
+      if (dump_file)
+	fprintf (dump_file, "No changes being tried\n");
+      return false;
+    }
+
+  if (!chk_benefit)
+    return confirm_change_one_by_one (false);
+
+  if (dump_file)
+    fprintf (dump_file, "  extra benefit = %d\n", extra_benefit);
+
+  /* Iterate all the changes, adjust the change benefit if the change result
+     could be splitted or peephole optimized. Calculate the total benefits
+     and total positive benefits in the iteration.  */
+  for (i = 0; i < num_changes; i++)
+    {
+      /* If any change fail in the verification, we cannot confirm all
+	 the changes in a group.  */
+      if (!changes[i].verified)
+	{
+	  may_confirm_whole_group = false;
+	  if (dump_file)
+	    fprintf (dump_file, "  change[%d]: benefit = %d, verified - fail\n",
+		    i, changes[i].benefit);
+	  continue;
+	}
+
+      total_benefit += changes[i].benefit;
+      if (changes[i].benefit > 0)
+	{
+	  total_positive_benefit += changes[i].benefit;
+	  no_positive_benefit = false;
+	}
+
+      if (dump_file)
+	fprintf (dump_file, "  change[%d]: benefit = %d, verified - ok\n",
+		i, changes[i].benefit);
+    }
+
+  /* Compare the benefit and choose between applying the whole change
+     group and only applying the changes with positive benefit.  */
+  if (may_confirm_whole_group
+      && (total_benefit + extra_benefit < total_positive_benefit))
+    may_confirm_whole_group = false;
+
+  if (may_confirm_whole_group)
+    {
+      /* Commit all the changes in a group.  */
+      if (dump_file)
+	fprintf (dump_file, "!!! All the changes committed\n");
+
+      if (dump_file)
+	{
+	  for (i = 0; i < num_changes; i++)
+	    {
+	      fprintf (dump_file, "\nIn insn %d, replacing\n ",
+		       INSN_UID (changes[i].object));
+	      print_inline_rtx (dump_file, changes[i].old, 2);
+	      fprintf (dump_file, "\n with ");
+	      print_inline_rtx (dump_file, *changes[i].loc, 2);
+	      fprintf (dump_file, "\n resulting: ");
+	      print_inline_rtx (dump_file, changes[i].object, 2);
+	    }
+	}
+
+      to = num_changes - 1;
+      confirm_change_group ();
+      update_df (0, to, false);
+      return true;
+    }
+  else if (no_positive_benefit)
+    {
+      /* Cancel all the changes.  */
+      to = num_changes - 1;
+      cancel_changes (0);
+      update_df (0, to, true);
+      if (dump_file)
+	fprintf (dump_file, "No changes committed\n");
+      return false;
+    }
+  else
+    /* Cannot commit all the changes. Try to commit those changes
+       with positive benefit.  */
+    return confirm_change_one_by_one (true);
 }
 
 /* Apply a group of changes previously issued with `validate_change'.
    If all changes are valid, call confirm_change_group and return 1,
-   otherwise, call cancel_changes and return 0.  */
+   otherwise, call cancel_changes and return 0. The change group index
+   starts from num to the num_changes - 1.  */
 
 int
-apply_change_group (void)
+apply_change_group (int num)
 {
-  if (verify_changes (0))
+  if (verify_changes (num))
     {
-      confirm_change_group ();
+      confirm_change_group (num);
       return 1;
     }
   else
     {
-      cancel_changes (0);
+      cancel_changes (num);
       return 0;
     }
 }
@@ -534,9 +795,13 @@ cancel_changes (int num)
      they were made.  */
   for (i = num_changes - 1; i >= num; i--)
     {
+      rtx new_rtx = *changes[i].loc;
       *changes[i].loc = changes[i].old;
       if (changes[i].object && !MEM_P (changes[i].object))
 	INSN_CODE (changes[i].object) = changes[i].old_code;
+      if (changes[i].equal_note)
+	set_unique_reg_note (changes[i].object,
+			     REG_EQUAL, copy_rtx (new_rtx));
     }
   num_changes = num;
 }
--- v0/recog.h	2013-03-16 21:46:26.827976689 -0700
+++ v2/recog.h	2013-03-16 23:33:50.106502219 -0700
@@ -80,8 +80,16 @@ extern bool validate_unshare_change (rtx
 extern bool canonicalize_change_group (rtx insn, rtx x);
 extern int insn_invalid_p (rtx, bool);
 extern int verify_changes (int);
-extern void confirm_change_group (void);
-extern int apply_change_group (void);
+extern void confirm_change_group (int num = 0);
+extern int apply_change_group (int num = 0);
+extern void set_change_verified (int idx, bool val);
+extern void set_change_benefit (int idx, int val);
+extern void set_change_equal_note (int idx, bool val);
+extern void set_change_associated_with_last (int idx, bool val);
+extern bool confirm_change_one_by_one (bool chk_benefit);
+extern bool confirm_change_group_by_cost (bool may_confirm_whole_group,
+					  int extra_benefit,
+					  bool chk_benefit);
 extern int num_validated_changes (void);
 extern void cancel_changes (int);
 extern int constrain_operands (int);

[-- Attachment #7: patch.3 --]
[-- Type: application/octet-stream, Size: 28022 bytes --]

--- v1/fwprop.c	2013-03-17 00:04:35.450324217 -0700
+++ v3/fwprop.c	2013-03-17 00:04:45.120396071 -0700
@@ -39,6 +39,7 @@ along with GCC; see the file COPYING3.
 #include "domwalk.h"
 #include "emit-rtl.h"
 
+#include "tree.h"
 
 /* This pass does simple forward propagation and simplification when an
    operand of an insn can only come from a single def.  This pass uses
@@ -112,6 +113,34 @@ along with GCC; see the file COPYING3.
    I just punt and record only singleton use-def chains, which is
    all that is needed by fwprop.  */
 
+/* In order to make fwprop more effective in rtl optimization, we
+   extend it to handle general expressions instead of only three cases
+   above. The major changes include a) We need to check propagation
+   correctness for src exprs of def which contain mem references.
+   Previous fwprop for the three cases above doesn't have the problem.
+   b) We need a better cost model because the benefit is usually
+   not so apparent as the three cases above.
+
+   For a general fwprop problem, there are two possible sources where
+   benefit comes from. The frist is the new use insn after propagation
+   and simplification may have lower cost than itself before propagation,
+   The second is that if all the uses are replaced with the src of the 
+   def insn, the def insn could be deleted.
+
+   So instead of check each def-use pair independently, we use DU chain to
+   track all the uses for a def. For each def-use pair, we attempt the
+   propagation, record the change candidate in changes[] array, but we
+   wait to confirm the changes until all the pairs with the same def are
+   iterated. The changes confirmation is done in the func
+   confirm_change_group_by_cost. We only do this for fwprop. For fwprop_addr,
+   the benefit of each change is ensured by propagation_rtx_1 using
+   should_replace_address, so we just confirm all the changes without
+   checking benefit again.
+
+   Other changes:
+   We think the maintainance for use_def_ref vector is not necessary, so
+   we remove the related code in update_df/update_uses/update_df_init/
+   register_active_defs.  */
 
 static int num_changes;
 
@@ -250,36 +279,6 @@ should_replace_address (rtx old_rtx, rtx
   return (gain > 0);
 }
 
-
-/* Flags for the last parameter of propagate_rtx_1.  */
-
-enum {
-  /* If PR_CAN_APPEAR is true, propagate_rtx_1 always returns true;
-     if it is false, propagate_rtx_1 returns false if, for at least
-     one occurrence OLD, it failed to collapse the result to a constant.
-     For example, (mult:M (reg:M A) (minus:M (reg:M B) (reg:M A))) may
-     collapse to zero if replacing (reg:M B) with (reg:M A).
-
-     PR_CAN_APPEAR is disregarded inside MEMs: in that case,
-     propagate_rtx_1 just tries to make cheaper and valid memory
-     addresses.  */
-  PR_CAN_APPEAR = 1,
-
-  /* If PR_HANDLE_MEM is not set, propagate_rtx_1 won't attempt any replacement
-     outside memory addresses.  This is needed because propagate_rtx_1 does
-     not do any analysis on memory; thus it is very conservative and in general
-     it will fail if non-read-only MEMs are found in the source expression.
-
-     PR_HANDLE_MEM is set when the source of the propagation was not
-     another MEM.  Then, it is safe not to treat non-read-only MEMs as
-     ``opaque'' objects.  */
-  PR_HANDLE_MEM = 2,
-
-  /* Set when costs should be optimized for speed.  */
-  PR_OPTIMIZE_FOR_SPEED = 4
-};
-
-
 /* Replace all occurrences of OLD in *PX with NEW and try to simplify the
    resulting expression.  Replace *PX with a new RTL expression if an
    occurrence of OLD was found.
@@ -289,31 +288,20 @@ enum {
    that is because there is no simplify_gen_* function for LO_SUM).  */
 
 static bool
-propagate_rtx_1 (rtx *px, rtx old_rtx, rtx new_rtx, int flags)
+propagate_rtx_1 (rtx *px, rtx old_rtx, rtx new_rtx, bool speed)
 {
   rtx x = *px, tem = NULL_RTX, op0, op1, op2;
   enum rtx_code code = GET_CODE (x);
   enum machine_mode mode = GET_MODE (x);
   enum machine_mode op_mode;
-  bool can_appear = (flags & PR_CAN_APPEAR) != 0;
   bool valid_ops = true;
 
-  if (!(flags & PR_HANDLE_MEM) && MEM_P (x) && !MEM_READONLY_P (x))
-    {
-      /* If unsafe, change MEMs to CLOBBERs or SCRATCHes (to preserve whether
-	 they have side effects or not).  */
-      *px = (side_effects_p (x)
-	     ? gen_rtx_CLOBBER (GET_MODE (x), const0_rtx)
-	     : gen_rtx_SCRATCH (GET_MODE (x)));
-      return false;
-    }
-
   /* If X is OLD_RTX, return NEW_RTX.  But not if replacing only within an
      address, and we are *not* inside one.  */
   if (x == old_rtx)
     {
       *px = new_rtx;
-      return can_appear;
+      return true;
     }
 
   /* If this is an expression, try recursive substitution.  */
@@ -322,7 +310,7 @@ propagate_rtx_1 (rtx *px, rtx old_rtx, r
     case RTX_UNARY:
       op0 = XEXP (x, 0);
       op_mode = GET_MODE (op0);
-      valid_ops &= propagate_rtx_1 (&op0, old_rtx, new_rtx, flags);
+      valid_ops &= propagate_rtx_1 (&op0, old_rtx, new_rtx, speed);
       if (op0 == XEXP (x, 0))
 	return true;
       tem = simplify_gen_unary (code, mode, op0, op_mode);
@@ -332,8 +320,8 @@ propagate_rtx_1 (rtx *px, rtx old_rtx, r
     case RTX_COMM_ARITH:
       op0 = XEXP (x, 0);
       op1 = XEXP (x, 1);
-      valid_ops &= propagate_rtx_1 (&op0, old_rtx, new_rtx, flags);
-      valid_ops &= propagate_rtx_1 (&op1, old_rtx, new_rtx, flags);
+      valid_ops &= propagate_rtx_1 (&op0, old_rtx, new_rtx, speed);
+      valid_ops &= propagate_rtx_1 (&op1, old_rtx, new_rtx, speed);
       if (op0 == XEXP (x, 0) && op1 == XEXP (x, 1))
 	return true;
       tem = simplify_gen_binary (code, mode, op0, op1);
@@ -344,8 +332,8 @@ propagate_rtx_1 (rtx *px, rtx old_rtx, r
       op0 = XEXP (x, 0);
       op1 = XEXP (x, 1);
       op_mode = GET_MODE (op0) != VOIDmode ? GET_MODE (op0) : GET_MODE (op1);
-      valid_ops &= propagate_rtx_1 (&op0, old_rtx, new_rtx, flags);
-      valid_ops &= propagate_rtx_1 (&op1, old_rtx, new_rtx, flags);
+      valid_ops &= propagate_rtx_1 (&op0, old_rtx, new_rtx, speed);
+      valid_ops &= propagate_rtx_1 (&op1, old_rtx, new_rtx, speed);
       if (op0 == XEXP (x, 0) && op1 == XEXP (x, 1))
 	return true;
       tem = simplify_gen_relational (code, mode, op_mode, op0, op1);
@@ -357,9 +345,9 @@ propagate_rtx_1 (rtx *px, rtx old_rtx, r
       op1 = XEXP (x, 1);
       op2 = XEXP (x, 2);
       op_mode = GET_MODE (op0);
-      valid_ops &= propagate_rtx_1 (&op0, old_rtx, new_rtx, flags);
-      valid_ops &= propagate_rtx_1 (&op1, old_rtx, new_rtx, flags);
-      valid_ops &= propagate_rtx_1 (&op2, old_rtx, new_rtx, flags);
+      valid_ops &= propagate_rtx_1 (&op0, old_rtx, new_rtx, speed);
+      valid_ops &= propagate_rtx_1 (&op1, old_rtx, new_rtx, speed);
+      valid_ops &= propagate_rtx_1 (&op2, old_rtx, new_rtx, speed);
       if (op0 == XEXP (x, 0) && op1 == XEXP (x, 1) && op2 == XEXP (x, 2))
 	return true;
       if (op_mode == VOIDmode)
@@ -372,7 +360,7 @@ propagate_rtx_1 (rtx *px, rtx old_rtx, r
       if (code == SUBREG)
 	{
           op0 = XEXP (x, 0);
-	  valid_ops &= propagate_rtx_1 (&op0, old_rtx, new_rtx, flags);
+	  valid_ops &= propagate_rtx_1 (&op0, old_rtx, new_rtx, speed);
           if (op0 == XEXP (x, 0))
 	    return true;
 	  tem = simplify_gen_subreg (mode, op0, GET_MODE (SUBREG_REG (x)),
@@ -392,7 +380,7 @@ propagate_rtx_1 (rtx *px, rtx old_rtx, r
 
 	  op0 = new_op0 = targetm.delegitimize_address (op0);
 	  valid_ops &= propagate_rtx_1 (&new_op0, old_rtx, new_rtx,
-					flags | PR_CAN_APPEAR);
+					speed);
 
 	  /* Dismiss transformation that we do not want to carry on.  */
 	  if (!valid_ops
@@ -407,7 +395,7 @@ propagate_rtx_1 (rtx *px, rtx old_rtx, r
 	  if (!(REG_P (old_rtx) && REG_P (new_rtx))
 	      && !should_replace_address (op0, new_op0, GET_MODE (x),
 					  MEM_ADDR_SPACE (x),
-	      			 	  flags & PR_OPTIMIZE_FOR_SPEED))
+	      			 	  speed))
 	    return true;
 
 	  tem = replace_equiv_address_nv (x, new_op0);
@@ -421,8 +409,8 @@ propagate_rtx_1 (rtx *px, rtx old_rtx, r
 	  /* The only simplification we do attempts to remove references to op0
 	     or make it constant -- in both cases, op0's invalidity will not
 	     make the result invalid.  */
-	  propagate_rtx_1 (&op0, old_rtx, new_rtx, flags | PR_CAN_APPEAR);
-	  valid_ops &= propagate_rtx_1 (&op1, old_rtx, new_rtx, flags);
+	  propagate_rtx_1 (&op0, old_rtx, new_rtx, speed);
+	  valid_ops &= propagate_rtx_1 (&op1, old_rtx, new_rtx, speed);
           if (op0 == XEXP (x, 0) && op1 == XEXP (x, 1))
 	    return true;
 
@@ -443,7 +431,7 @@ propagate_rtx_1 (rtx *px, rtx old_rtx, r
 	  if (rtx_equal_p (x, old_rtx))
 	    {
               *px = new_rtx;
-              return can_appear;
+              return true;
 	    }
 	}
       break;
@@ -458,10 +446,7 @@ propagate_rtx_1 (rtx *px, rtx old_rtx, r
 
   *px = tem;
 
-  /* The replacement we made so far is valid, if all of the recursive
-     replacements were valid, or we could simplify everything to
-     a constant.  */
-  return valid_ops || can_appear || CONSTANT_P (tem);
+  return valid_ops;
 }
 
 
@@ -472,7 +457,7 @@ static int
 varying_mem_p (rtx *body, void *data ATTRIBUTE_UNUSED)
 {
   rtx x = *body;
-  return MEM_P (x) && !MEM_READONLY_P (x);
+  return (MEM_P (x) && !MEM_READONLY_P (x)) || CALL_P (x);
 }
 
 
@@ -490,27 +475,12 @@ propagate_rtx (rtx x, enum machine_mode
 {
   rtx tem;
   bool collapsed;
-  int flags;
 
   if (REG_P (new_rtx) && REGNO (new_rtx) < FIRST_PSEUDO_REGISTER)
     return NULL_RTX;
 
-  flags = 0;
-  if (REG_P (new_rtx)
-      || CONSTANT_P (new_rtx)
-      || (GET_CODE (new_rtx) == SUBREG
-	  && REG_P (SUBREG_REG (new_rtx))
-	  && (GET_MODE_SIZE (mode)
-	      <= GET_MODE_SIZE (GET_MODE (SUBREG_REG (new_rtx))))))
-    flags |= PR_CAN_APPEAR;
-  if (!for_each_rtx (&new_rtx, varying_mem_p, NULL))
-    flags |= PR_HANDLE_MEM;
-
-  if (speed)
-    flags |= PR_OPTIMIZE_FOR_SPEED;
-
   tem = x;
-  collapsed = propagate_rtx_1 (&tem, old_rtx, copy_rtx (new_rtx), flags);
+  collapsed = propagate_rtx_1 (&tem, old_rtx, copy_rtx (new_rtx), speed);
   if (tem == x || !collapsed)
     return NULL_RTX;
 
@@ -689,90 +659,75 @@ all_uses_available_at (rtx def_insn, rtx
   return true;
 }
 
-\f
-
 /* Try substituting NEW into LOC, which originated from forward propagation
    of USE's value from DEF_INSN.  SET_REG_EQUAL says whether we are
    substituting the whole SET_SRC, so we can set a REG_EQUAL note if the
-   new insn is not recognized.  Return whether the substitution was
-   performed.  */
+   new insn is not recognized. We record possible change in changes array,
+   and record their verifying result and calculated benefit.  */
 
 static bool
-try_fwprop_subst (df_ref use, rtx *loc, rtx new_rtx, rtx def_insn, bool set_reg_equal)
+try_fwprop_subst (df_ref use, rtx *loc, rtx new_rtx, bool set_reg_equal)
 {
   rtx insn = DF_REF_INSN (use);
   rtx set = single_set (insn);
-  rtx note = NULL_RTX;
   bool speed = optimize_bb_for_speed_p (BLOCK_FOR_INSN (insn));
-  int old_cost = 0;
-  bool ok;
-
-  update_df_init (def_insn, insn);
+  int old_cost = 0, benefit = 0;
+  int old_changes_num, new_changes_num;
 
   /* forward_propagate_subreg may be operating on an instruction with
-     multiple sets.  If so, assume the cost of the new instruction is
-     not greater than the old one.  */
+     multiple sets.  Assume the old cost is 1 and the new cost is 0,
+     if only verify_changes passes, subreg propagation will always be
+     confirmed.  */
   if (set)
-    old_cost = set_src_cost (SET_SRC (set), speed);
-  if (dump_file)
-    {
-      fprintf (dump_file, "\nIn insn %d, replacing\n ", INSN_UID (insn));
-      print_inline_rtx (dump_file, *loc, 2);
-      fprintf (dump_file, "\n with ");
-      print_inline_rtx (dump_file, new_rtx, 2);
-      fprintf (dump_file, "\n");
-    }
-
-  validate_unshare_change (insn, loc, new_rtx, true);
-  if (!verify_changes (0))
-    {
-      if (dump_file)
-	fprintf (dump_file, "Changes to insn %d not recognized\n",
-		 INSN_UID (insn));
-      ok = false;
-    }
-
-  else if (DF_REF_TYPE (use) == DF_REF_REG_USE
-	   && set
-	   && set_src_cost (SET_SRC (set), speed) > old_cost)
-    {
-      if (dump_file)
-	fprintf (dump_file, "Changes to insn %d not profitable\n",
-		 INSN_UID (insn));
-      ok = false;
-    }
-
+    old_cost = (set_src_cost (SET_SRC (set), speed)
+		+ set_src_cost (SET_DEST (set), speed) + 1);
   else
-    {
-      if (dump_file)
-	fprintf (dump_file, "Changed insn %d\n", INSN_UID (insn));
-      ok = true;
-    }
+    old_cost = 1;
 
-  if (ok)
-    {
-      confirm_change_group ();
-      num_changes++;
-    }
-  else
-    {
-      cancel_changes (0);
-
-      /* Can also record a simplified value in a REG_EQUAL note,
-	 making a new one if one does not already exist.  */
-      if (set_reg_equal)
-	{
-	  if (dump_file)
-	    fprintf (dump_file, " Setting REG_EQUAL note\n");
-
-	  note = set_unique_reg_note (insn, REG_EQUAL, copy_rtx (new_rtx));
-	}
-    }
+  old_changes_num = num_changes_pending ();
+  validate_unshare_change (insn, loc, new_rtx, true);
 
-  if ((ok || note) && !CONSTANT_P (new_rtx))
-    update_df (insn, note);
+  /* verify_changes may calls validate_change and add new changes.
+     The new changes are either adding or removing CLOBBER to match
+     insn pattern. These changes should either be committed or canceled
+     in a group, so we use set_change_associated_with_last to indicate
+     whether current change is committed depends on the last change.  */
+  if (verify_changes (old_changes_num))
+  {
+    int i;
+    int new_cost;
+
+    if (set)
+      new_cost = set_src_cost (SET_SRC (set), speed)
+		 + set_src_cost (SET_DEST (set), speed) + 1;
+    else
+      new_cost = 0;
+    /* validate_unshare_change will tentatively change *loc to new_rtx.
+       We compare the cost before and after validate_unshare_change
+       and get the potential benefit of the change.  */
+    benefit = old_cost - new_cost;
+
+    /* For the change group with adding or removing CLOBBER, we attach
+       the real change benefit to the last change, that is because
+       in func confirm_change_group_by_cost, we need to iterate change
+       in reverse order to make sure cancelling change works correctly.
+       We set other change's benefit to 0, so the overall benefit for
+       the change group is the same. Meanwhile set all the changes in
+       the group to being verified successfully.  */
+    new_changes_num = num_changes_pending ();
+    set_change_verified (new_changes_num - 1, true);
+    set_change_benefit (new_changes_num - 1, benefit);
+    for (i = new_changes_num - 2; i >= old_changes_num; i--)
+      {
+	set_change_verified (i, true);
+	set_change_benefit (i, 0);
+	set_change_associated_with_last (i, true);
+      }
+    set_change_equal_note (old_changes_num, set_reg_equal);
+    return true;
+  }
 
-  return ok;
+  return false;
 }
 
 /* For the given single_set INSN, containing SRC known to be a
@@ -855,8 +810,7 @@ forward_propagate_subreg (df_ref use, rt
 	  && GET_MODE (SUBREG_REG (src)) == use_mode
 	  && subreg_lowpart_p (src)
 	  && all_uses_available_at (def_insn, use_insn))
-	return try_fwprop_subst (use, DF_REF_LOC (use), SUBREG_REG (src),
-				 def_insn, false);
+	return try_fwprop_subst (use, DF_REF_LOC (use), SUBREG_REG (src), false);
     }
 
   /* If this is a SUBREG of a ZERO_EXTEND or SIGN_EXTEND, and the SUBREG
@@ -887,20 +841,136 @@ forward_propagate_subreg (df_ref use, rt
 	  && (targetm.mode_rep_extended (use_mode, GET_MODE (src))
 	      != (int) GET_CODE (src))
 	  && all_uses_available_at (def_insn, use_insn))
-	return try_fwprop_subst (use, DF_REF_LOC (use), XEXP (src, 0),
-				 def_insn, false);
+	return try_fwprop_subst (use, DF_REF_LOC (use), XEXP (src, 0), false);
     }
 
   return false;
 }
 
-/* Try to replace USE with SRC (defined in DEF_INSN) in __asm.  */
+static void
+mems_modified_p (rtx dest, const_rtx setter ATTRIBUTE_UNUSED, void *data)
+{
+  bool *modified = (bool *)data;
+
+  /* If DEST is not a MEM, then it will not conflict with the load.  Note
+     that function calls are assumed to clobber memory, but are handled
+     elsewhere.  */
+  if (MEM_P (dest))
+    {
+      *modified = true;
+      return;
+    }
+}
+
+/* Check whether any memory modification insn from from insn
+   to to insn.  */
+
+static bool
+mem_may_be_modified (rtx from, rtx to)
+{
+  bool modified = false;
+  rtx insn;
+
+  /* For now, we only check the simple case where from and to
+     are in the same bb.  */
+  basic_block bb = BLOCK_FOR_INSN (from);
+  if (bb != BLOCK_FOR_INSN (to))
+    return true;
+
+  for (insn = from; insn != to; insn = NEXT_INSN (insn))
+    {
+      if (!NONDEBUG_INSN_P (insn))
+	continue;
+
+      note_stores (PATTERN (insn), mems_modified_p, &modified);
+      if (modified)
+	break;
+
+      modified = CALL_P (insn);
+      if (modified)
+	break;
+
+      modified = volatile_insn_p (PATTERN (insn));
+      if (modified)
+	break;
+    }
+  gcc_assert (insn);
+  return modified;
+}
+
+struct rtx_search_arg
+{
+  /* What we are searching for.  */
+  rtx x;
+  /* The occurrence counter.  */
+  int n;
+};
+
+typedef struct rtx_search_arg *rtx_search_arg_p;
+
+int
+reg_occur_p (rtx *in, void *arg)
+{
+  enum rtx_code code;
+  rtx_search_arg_p p = (rtx_search_arg_p) arg;
+  rtx reg = p->x;
+
+  if (in == 0 || *in == 0)
+    return -1;
+
+  code = GET_CODE (*in);
+
+  switch (code)
+    {
+      /* Compare registers by number.  */
+    case REG:
+      if (REG_P (reg) && REGNO (*in) == REGNO (reg))
+	p->n++;
+      /* These codes have no constituent expressions
+	 and are unique.  */
+    case SCRATCH:
+    case CC0:
+    case PC:
+      /* Skip expr list.  */
+    case EXPR_LIST:
+    CASE_CONST_ANY:
+      /* These are kept unique for a given value.  */
+      return -1;
+
+    default:
+      break;
+    }
+
+  return 0;
+}
+
+/* Calculate how many times reg appears in rtx "in".  */
+
+static int
+reg_mentioned_num (rtx reg, rtx in)
+{
+  struct rtx_search_arg data;
+  enum rtx_code code = GET_CODE (reg);
+  gcc_assert (code == REG || code == SUBREG);
+  if (code == SUBREG)
+    reg = SUBREG_REG (reg);
+
+  data.x = reg;
+  data.n = 0;
+  for_each_rtx (&in, &reg_occur_p, (void *)&data);
+  return data.n;
+}
+
+/* Try to replace USE with SRC (defined in DEF_INSN) in __asm.
+   All the changes added here will be applied immediately without
+   affecting any existing changes. After this func, the changes
+   num is the same as before the func.  */
 
 static bool
 forward_propagate_asm (df_ref use, rtx def_insn, rtx def_set, rtx reg)
 {
   rtx use_insn = DF_REF_INSN (use), src, use_pat, asm_operands, new_rtx, *loc;
-  int speed_p, i;
+  int speed_p, i, old_change_num, new_change_num;
   df_ref *use_vec;
 
   gcc_assert ((DF_REF_FLAGS (use) & DF_REF_IN_NOTE) == 0);
@@ -914,7 +984,7 @@ forward_propagate_asm (df_ref use, rtx d
   if (use_vec[0] && use_vec[1])
     return false;
 
-  update_df_init (def_insn, use_insn);
+  old_change_num = num_changes_pending ();
   speed_p = optimize_bb_for_speed_p (BLOCK_FOR_INSN (use_insn));
   asm_operands = NULL_RTX;
   switch (GET_CODE (use_pat))
@@ -962,14 +1032,43 @@ forward_propagate_asm (df_ref use, rtx d
 	validate_unshare_change (use_insn, loc, new_rtx, true);
     }
 
-  if (num_changes_pending () == 0 || !apply_change_group ())
+  new_change_num = num_changes_pending ();
+  if ((new_change_num - old_change_num) == 0
+      || !apply_change_group (old_change_num))
     return false;
 
-  update_df (use_insn, NULL);
-  num_changes++;
+  df_uses_create (&PATTERN (use_insn), use_insn, 0);
+  df_insn_rescan (use_insn);
+
   return true;
 }
 
+/* Find whether the set define a return reg.  */
+
+static bool
+def_return_reg (rtx set)
+{
+  edge eg;
+  edge_iterator ei;
+  rtx dest = SET_DEST (set);
+
+  if (!REG_P (dest))
+    return false;
+
+  FOR_EACH_EDGE (eg, ei, EXIT_BLOCK_PTR->preds)
+    if (eg->flags & EDGE_FALLTHRU)
+      {
+	basic_block src_bb = eg->src;
+	rtx last_insn, ret_reg;
+	if (NONJUMP_INSN_P ((last_insn = BB_END (src_bb)))
+	    && GET_CODE (PATTERN (last_insn)) == USE
+	    && GET_CODE ((ret_reg = XEXP (PATTERN (last_insn), 0))) == REG
+	    && REGNO (ret_reg) == REGNO (dest))
+	  return true;
+      }
+  return false;
+}
+
 /* Try to replace USE with SRC (defined in DEF_INSN) and simplify the
    result.  */
 
@@ -978,7 +1077,7 @@ forward_propagate_and_simplify (df_ref u
 {
   rtx use_insn = DF_REF_INSN (use);
   rtx use_set = single_set (use_insn);
-  rtx src, reg, new_rtx, *loc;
+  rtx src, reg, new_rtx, *loc, use_set_dest, use_set_src;
   bool set_reg_equal;
   enum machine_mode mode;
   int asm_use = -1;
@@ -1036,9 +1135,37 @@ forward_propagate_and_simplify (df_ref u
       return false;
     }
 
+  /* If only new_rtx contains varying mem or has other side effect, and
+     mem maybe modified between def and use, we cannot do propagation
+     safely. mem_may_be_modified is a simple check without inquiring
+     cfg and alias result.  */
+  if (for_each_rtx (&src, varying_mem_p, NULL)
+      && mem_may_be_modified (def_insn, use_insn))
+    return false;
+
+  if (volatile_refs_p (src))
+    return false;
+
   if (asm_use >= 0)
     return forward_propagate_asm (use, def_insn, def_set, reg);
 
+  /* If the dest of the use insn is a return reg, we don't try fwprop,
+     because mode-switching tries to find return reg copy insn and create
+     pre exit basicblock, and fwprop for return copy insn may make it
+     confused.  */
+  if (def_return_reg (use_set))
+    return false;
+
+  /* We have (hard reg = reg) type insns for func param passing or
+     return value setting. We don't want to propagate in such case
+     because it may restrict cse/gcse. Check hash_rtx and
+     hash_scan_set.  */
+  use_set_dest = SET_DEST (use_set);
+  use_set_src = SET_SRC (use_set);
+  if (REG_P (use_set_dest) && REG_P (use_set_src)
+      && (REGNO (use_set_dest) < FIRST_PSEUDO_REGISTER))
+    return false;
+
   /* Else try simplifying.  */
 
   if (DF_REF_TYPE (use) == DF_REF_REG_MEM_STORE)
@@ -1087,7 +1214,7 @@ forward_propagate_and_simplify (df_ref u
   if (!new_rtx)
     return false;
 
-  return try_fwprop_subst (use, loc, new_rtx, def_insn, set_reg_equal);
+  return try_fwprop_subst (use, loc, new_rtx, set_reg_equal);
 }
 
 
@@ -1150,7 +1277,6 @@ forward_propagate_into (df_ref use)
   return false;
 }
 
-\f
 static void
 fwprop_init (void)
 {
@@ -1175,47 +1301,142 @@ fwprop_done (void)
   free_dominance_info (CDI_DOMINATORS);
   cleanup_cfg (0);
   delete_trivially_dead_insns (get_insns (), max_reg_num ());
-
-  if (dump_file)
-    fprintf (dump_file,
-	     "\nNumber of successful forward propagations: %d\n\n",
-	     num_changes);
 }
 
-
-/* Main entry point.  */
-
 static bool
 gate_fwprop (void)
 {
   return optimize > 0 && flag_forward_propagate;
 }
 
+/* Main func for forward propagation. Iterate all the uses connecting to
+   the same def. For each def-use pair, try forward propagate the src of
+   the def into the use. After all the def-use pairs are iterated, confirm
+   the changes based on the cost of the whole group.  */
+
+static bool
+iterate_def_uses (df_ref def, bool fwprop_addr)
+{
+  int use_num = 0;
+  int def_insn_cost = 0;
+  rtx def_insn, use_insn;
+  struct df_link *uses;
+  int reg_replaced_num = 0;
+  bool all_uses_replaced;
+  bool speed;
+
+  def_insn = DF_REF_INSN (def);
+  speed = optimize_bb_for_speed_p (BLOCK_FOR_INSN (def_insn));
+
+  if (def_insn)
+  {
+    rtx set = single_set (def_insn);
+    if (set)
+      def_insn_cost = set_src_cost (SET_SRC (set), speed)
+		      + set_src_cost (SET_DEST (set), speed) + 1;
+    else
+      return false;
+  }
+
+  if (dump_file)
+    {
+      fprintf (dump_file, "\n------------------------\n");
+      fprintf (dump_file, "Def %d:\n", INSN_UID (def_insn));
+    }
+
+  for (uses = DF_REF_CHAIN (def), use_num = 0;
+       uses; uses = uses->next)
+  {
+    int old_reg_num, new_reg_num;
+
+    df_ref use = uses->ref;
+    if (DF_REF_IS_ARTIFICIAL (use))
+	continue;
+
+    use_insn = DF_REF_INSN (use);
+    if (!NONDEBUG_INSN_P (use_insn))
+	continue;
+
+    if (dump_file)
+      fprintf (dump_file, "\tUse %d\n", INSN_UID (use_insn));
+
+    if (fwprop_addr)
+      {
+	if (DF_REF_TYPE (use) != DF_REF_REG_USE
+	    && DF_REF_BB (use)->loop_father != NULL
+	    /* The outer most loop is not really a loop.  */
+	    && loop_outer (DF_REF_BB (use)->loop_father) != NULL)
+	  forward_propagate_into (use);
+      }
+    else
+      {
+	if (DF_REF_TYPE (use) == DF_REF_REG_USE
+	    || DF_REF_BB (use)->loop_father == NULL
+	    || loop_outer (DF_REF_BB (use)->loop_father) == NULL)
+	  {
+	    old_reg_num = reg_mentioned_num (DF_REF_REG (use), use_insn);
+
+	    forward_propagate_into (use);
+
+	    new_reg_num = reg_mentioned_num (DF_REF_REG (use), use_insn);
+	    reg_replaced_num += old_reg_num - new_reg_num;
+	  }
+      }
+    use_num++;
+  }
+
+  if (!use_num)
+    return false;
+
+  if (fwprop_addr)
+     return confirm_change_group_by_cost (false,
+					  0,
+					  false);
+  else
+    {
+      all_uses_replaced = (use_num == reg_replaced_num);
+      return confirm_change_group_by_cost (all_uses_replaced,
+					   def_insn_cost,
+					   true);
+    }
+}
+
+/* Try forward propagate src of the def to the normal uses.  */
+
 static unsigned int
 fwprop (void)
 {
-  unsigned i;
+  basic_block bb;
+  rtx insn;
+  df_ref *def_vec;
   bool need_cleanup = false;
 
+  if (dump_file)
+    fprintf (dump_file, "\n============== fwprop ==============\n");
+
   fwprop_init ();
 
-  /* Go through all the uses.  df_uses_create will create new ones at the
-     end, and we'll go through them as well.
+  FOR_EACH_BB (bb)
+    {
+      FOR_BB_INSNS (bb, insn)
+	{
+	  if (!NONDEBUG_INSN_P (insn)
+	      || CALL_P (insn))
+	    continue;
 
-     Do not forward propagate addresses into loops until after unrolling.
-     CSE did so because it was able to fix its own mess, but we are not.  */
+	  for (def_vec = DF_INSN_DEFS (insn); *def_vec; def_vec++)
+	    {
+	      bool result;
+	      result = iterate_def_uses (*def_vec, false);
+	      need_cleanup |= result;
 
-  for (i = 0; i < DF_USES_TABLE_SIZE (); i++)
-    {
-      df_ref use = DF_USES_GET (i);
-      if (use)
-	if (DF_REF_TYPE (use) == DF_REF_REG_USE
-	    || DF_REF_BB (use)->loop_father == NULL
-	    /* The outer most loop is not really a loop.  */
-	    || loop_outer (DF_REF_BB (use)->loop_father) == NULL)
-	  need_cleanup |= forward_propagate_into (use);
+	      if (result)
+		num_changes += 1;
+	    }
+	}
     }
 
+
   fwprop_done ();
   if (need_cleanup)
     cleanup_cfg (0);
@@ -1244,25 +1465,39 @@ struct rtl_opt_pass pass_rtl_fwprop =
  }
 };
 
+/* Try forward propagate src of the def to the uses in memory addresses.  */
+
 static unsigned int
 fwprop_addr (void)
 {
-  unsigned i;
+  basic_block bb;
+  rtx insn;
+  df_ref *def_vec;
   bool need_cleanup = false;
 
+  if (dump_file)
+    fprintf (dump_file, "\n============== fwprop_addr ==============\n");
+
   fwprop_init ();
 
-  /* Go through all the uses.  df_uses_create will create new ones at the
-     end, and we'll go through them as well.  */
-  for (i = 0; i < DF_USES_TABLE_SIZE (); i++)
+  FOR_EACH_BB (bb)
     {
-      df_ref use = DF_USES_GET (i);
-      if (use)
-	if (DF_REF_TYPE (use) != DF_REF_REG_USE
-	    && DF_REF_BB (use)->loop_father != NULL
-	    /* The outer most loop is not really a loop.  */
-	    && loop_outer (DF_REF_BB (use)->loop_father) != NULL)
-	  need_cleanup |= forward_propagate_into (use);
+      FOR_BB_INSNS (bb, insn)
+	{
+	  if (!NONDEBUG_INSN_P (insn)
+	      || CALL_P (insn))
+	    continue;
+
+	  for (def_vec = DF_INSN_DEFS (insn); *def_vec; def_vec++)
+	    {
+	      bool result;
+	      result = iterate_def_uses (*def_vec, true);
+	      need_cleanup |= result;
+
+	      if (result)
+		num_changes += 1;
+	    }
+	}
     }
 
   fwprop_done ();

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: extend fwprop optimization
  2013-03-17  7:15                     ` Wei Mi
@ 2013-03-17  7:23                       ` Andrew Pinski
  2013-03-24  4:18                       ` Wei Mi
  2013-04-02  7:11                       ` Wei Mi
  2 siblings, 0 replies; 29+ messages in thread
From: Andrew Pinski @ 2013-03-17  7:23 UTC (permalink / raw)
  To: Wei Mi; +Cc: Steven Bosscher, GCC Patches, David Li, Uros Bizjak

On Sun, Mar 17, 2013 at 12:15 AM, Wei Mi <wmi@google.com> wrote:
> Hi,
>
> On Sat, Mar 16, 2013 at 3:48 PM, Steven Bosscher <stevenb.gcc@gmail.com> wrote:
>> On Tue, Mar 12, 2013 at 8:18 AM, Wei Mi wrote:
>>> For the motivational case, I need insn splitting to get the cost
>>> right. insn splitting is not very intrusive. All I need is to call
>>> split_insns func.
>>
>> It may not look very intrusive, but there's a lot happening in the
>> back ground. You're creating a lot of new RTL, and then just throw it
>> away again. You fake the compiler into thinking you're much deeper in
>> the pipeline than you really are. You're assuming there are no
>> side-effects other than that some insn gets split, but there are back
>> ends where splitters may have side-effects.
>
> Ok, then I will remove the split_insns call.
>
>>
>> Even though I've asked twice now, you still have not explained this
>> motivational case, except to say that there is one. *What* are you
>> trying to do, *what* is not happening without the splits, and *what*
>> happens if you split. Only if you explain that in a lot more detail
>> than "I have a motivational case" then we can look into what is a
>> proper solution.
>
> :-). Sorry, I didn't say it clearly. The motivational case is the one
> mentioned in the following posts (split_insns changes a << (b & 63) to
> a << b).
> http://gcc.gnu.org/ml/gcc/2013-01/msg00181.html
> http://gcc.gnu.org/ml/gcc-patches/2013-02/msg01144.html



>
> If I remove the split_insns call and related cost estimation
> adjustment, the fwprop 18-->22 and 18-->23 will punt, because fwprop
> here looks like a reverse process of cse, the total cost after fwprop
> change is increased.
>
> Def insn 18:
>         Use insn 23
>         Use insn 22
>
> If we include the split_insns cost estimation adjustment.
>   extra benefit by removing def insn 18 = 5
>   change[0]: benefit = 0, verified - ok  // The cost of insn 22 will
> not change after fwprop + insn splitting.
>   change[1]: benefit = 0, verified - ok  // The insn 23 is the same with insn 22
> Total benefit is 5, fwprop will go on.
>
> If we remove the split_insns cost estimation adjustment.
>   extra benefit by removing def insn 18 = 5
>   change[0]: benefit = -4, verified - ok   // The costs of insn 22 and
> insn 23 will increase after fwprop.
>   change[1]: benefit = -4, verified - ok   // The insn 23 is the same
> with insn 22
> Total benefit is -3, fwprop will punt.
>
> How about adding the (a << (b&63) ==> a << b) transformation in
> simplify_binary_operation_1, becuase (a << (b&63) ==> a << b) is a
> kind of architecture specific expr simplification? Then fwprop could
> do the propagation as I expect.
>
>>
>> The problem with some of the splitters is that they exist to break up
>> RTL from 'expand' to initially keep some pattern together to allow the
>> code transformation passes to handle the pattern as one instruction.
>> This made sense when RTL was the only intermediate representation and
>> splitting too early would inhibit some optimizations. But I would
>> expect most (if not all) such cases to be less relevant because of the
>> GIMPLE middle-end work. The only splitters you can trigger are the
>> pre-reload splitters (all the reload_completed conditions obviously
>> can't trigger if you're splitting from fwprop). Perhaps those
>> splitters can/should run earlier, or be made obsolete by expanding
>> directly to the post-splitting insns.
>>
>> Unfortunately, it's not possible to tell for your case, because you
>> haven't explained it yet...
>>
>>
>>> So how about keep split_insns and remove peephole in the cost estimation func?
>>
>> I'd strongly oppose this. I do not believe this is necessary, and I
>> think it's conceptually wrong.
>>
>>
>>>> What happens if you propagate into an insn that uses the same register
>>>> twice? Will the DU chains still be valid (I don't think that's
>>>> guaranteed)?
>>>
>>> I think the DU chains still be valid. If propagate into the insn that
>>> uses the same register twice, the two uses will be replaced when the
>>> first use is seen (propagate_rtx_1 will propagate all the occurrances
>>> of the same reg in the use insn).  When the second use is seen, the
>>> df_use and use insn in its insn_info are still available.
>>> forward_propagate_into will early return after check reg_mentioned_p
>>> (DF_REF_REG (use), parent) and find out no reg is used  any more.
>>
>> With reg_mentioned_p you cannot verify that the DF_REF_LOC of USE is
>> still valid.
>
> I think DF_REF_LOC of USE may be invalid if dangling rtx will be
> recycled by garbage collection very soon (I don't know when GC will
> happen). Although DF_REF_LOC of USE maybe invalid, the early return in
> forward_propagate_into ensure it will not cause any correctness
> problem.
>
>>
>> In any case, returning to the RD problem for DU/UD chains is probably
>> a good idea, now that RD is not such a hog anymore. In effect fwprop.c
>> would return to what it looked like before the patch of r149010.
>
> I remove MD problem and use DU/UD instead.
>
>>
>> As a way forward on all of this, I'd suggest the following steps, each
>> with a separate patch:
>
> Thanks for the suggestion!
>
>> 1. replace the MD problem with RD again, and build full DU/UD chains.
>
> I include patch.1 attached.
>
>> 2. post all the recog changes separately, with minimum impact on the
>> parts of the compiler you don't really change. (For apply_change_group
>> you could even choose to overload it, or use a NUM argument with a
>> default value -- not sure if default argument values are OK for GCC
>> tho'.)
>
> patch.2 attached.
>
>> 3. implement propagation into multiple USEs, but without the splitting
>> and peepholing.
>
> patch.3 attached.
>
>> 4. see about fixing the back end to either split earlier or expand to
>> the desired patterns directly.
>
> I havn't included this part. If you agree with the proposal to add the
> transformation (a << (b&63) ==> a << b) in
> simplify_binary_operation_1, I will send out another patch about it

Sounds like you need to look into using SHIFT_COUNT_TRUNCATED and
TARGET_SHIFT_TRUNCATION_MASK which you could use in
simplify_binary_operation_1 without it being target specific as it
uses the target hooks to find out that :).

Thanks,
Andrew

.
>
> Thanks,
> Wei.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: extend fwprop optimization
  2013-03-17  7:15                     ` Wei Mi
  2013-03-17  7:23                       ` Andrew Pinski
@ 2013-03-24  4:18                       ` Wei Mi
  2013-03-24 12:33                         ` Oleg Endo
  2013-03-25  9:36                         ` Richard Biener
  2013-04-02  7:11                       ` Wei Mi
  2 siblings, 2 replies; 29+ messages in thread
From: Wei Mi @ 2013-03-24  4:18 UTC (permalink / raw)
  To: Steven Bosscher; +Cc: GCC Patches, David Li, Uros Bizjak

[-- Attachment #1: Type: text/plain, Size: 6706 bytes --]

This is the patch to add the shift truncation in
simplify_binary_operation_1. I add a new hook
TARGET_SHIFT_COUNT_TRUNCATED which uses enum rtx_code to decide
whether we can do shift truncation. I didn't use
TARGET_SHIFT_TRUNCATION_MASK in simplify_binary_operation_1 because it
uses the macro SHIFT_COUNT_TRUNCATED. If I change
SHIFT_COUNT_TRUNCATED to targetm.shift_count_truncated in
TARGET_SHIFT_TRUNCATION_MASK, I need to give
TARGET_SHIFT_TRUNCATION_MASK a enum rtx_code param, which wasn't
trivial to get at many places in existing code.

patch.1 ~ patch.4 pass regression and bootstrap on x86_64-unknown-linux-gnu.

Thanks,
Wei.

On Sun, Mar 17, 2013 at 12:15 AM, Wei Mi <wmi@google.com> wrote:
> Hi,
>
> On Sat, Mar 16, 2013 at 3:48 PM, Steven Bosscher <stevenb.gcc@gmail.com> wrote:
>> On Tue, Mar 12, 2013 at 8:18 AM, Wei Mi wrote:
>>> For the motivational case, I need insn splitting to get the cost
>>> right. insn splitting is not very intrusive. All I need is to call
>>> split_insns func.
>>
>> It may not look very intrusive, but there's a lot happening in the
>> back ground. You're creating a lot of new RTL, and then just throw it
>> away again. You fake the compiler into thinking you're much deeper in
>> the pipeline than you really are. You're assuming there are no
>> side-effects other than that some insn gets split, but there are back
>> ends where splitters may have side-effects.
>
> Ok, then I will remove the split_insns call.
>
>>
>> Even though I've asked twice now, you still have not explained this
>> motivational case, except to say that there is one. *What* are you
>> trying to do, *what* is not happening without the splits, and *what*
>> happens if you split. Only if you explain that in a lot more detail
>> than "I have a motivational case" then we can look into what is a
>> proper solution.
>
> :-). Sorry, I didn't say it clearly. The motivational case is the one
> mentioned in the following posts (split_insns changes a << (b & 63) to
> a << b).
> http://gcc.gnu.org/ml/gcc/2013-01/msg00181.html
> http://gcc.gnu.org/ml/gcc-patches/2013-02/msg01144.html
>
> If I remove the split_insns call and related cost estimation
> adjustment, the fwprop 18-->22 and 18-->23 will punt, because fwprop
> here looks like a reverse process of cse, the total cost after fwprop
> change is increased.
>
> Def insn 18:
>         Use insn 23
>         Use insn 22
>
> If we include the split_insns cost estimation adjustment.
>   extra benefit by removing def insn 18 = 5
>   change[0]: benefit = 0, verified - ok  // The cost of insn 22 will
> not change after fwprop + insn splitting.
>   change[1]: benefit = 0, verified - ok  // The insn 23 is the same with insn 22
> Total benefit is 5, fwprop will go on.
>
> If we remove the split_insns cost estimation adjustment.
>   extra benefit by removing def insn 18 = 5
>   change[0]: benefit = -4, verified - ok   // The costs of insn 22 and
> insn 23 will increase after fwprop.
>   change[1]: benefit = -4, verified - ok   // The insn 23 is the same
> with insn 22
> Total benefit is -3, fwprop will punt.
>
> How about adding the (a << (b&63) ==> a << b) transformation in
> simplify_binary_operation_1, becuase (a << (b&63) ==> a << b) is a
> kind of architecture specific expr simplification? Then fwprop could
> do the propagation as I expect.
>
>>
>> The problem with some of the splitters is that they exist to break up
>> RTL from 'expand' to initially keep some pattern together to allow the
>> code transformation passes to handle the pattern as one instruction.
>> This made sense when RTL was the only intermediate representation and
>> splitting too early would inhibit some optimizations. But I would
>> expect most (if not all) such cases to be less relevant because of the
>> GIMPLE middle-end work. The only splitters you can trigger are the
>> pre-reload splitters (all the reload_completed conditions obviously
>> can't trigger if you're splitting from fwprop). Perhaps those
>> splitters can/should run earlier, or be made obsolete by expanding
>> directly to the post-splitting insns.
>>
>> Unfortunately, it's not possible to tell for your case, because you
>> haven't explained it yet...
>>
>>
>>> So how about keep split_insns and remove peephole in the cost estimation func?
>>
>> I'd strongly oppose this. I do not believe this is necessary, and I
>> think it's conceptually wrong.
>>
>>
>>>> What happens if you propagate into an insn that uses the same register
>>>> twice? Will the DU chains still be valid (I don't think that's
>>>> guaranteed)?
>>>
>>> I think the DU chains still be valid. If propagate into the insn that
>>> uses the same register twice, the two uses will be replaced when the
>>> first use is seen (propagate_rtx_1 will propagate all the occurrances
>>> of the same reg in the use insn).  When the second use is seen, the
>>> df_use and use insn in its insn_info are still available.
>>> forward_propagate_into will early return after check reg_mentioned_p
>>> (DF_REF_REG (use), parent) and find out no reg is used  any more.
>>
>> With reg_mentioned_p you cannot verify that the DF_REF_LOC of USE is
>> still valid.
>
> I think DF_REF_LOC of USE may be invalid if dangling rtx will be
> recycled by garbage collection very soon (I don't know when GC will
> happen). Although DF_REF_LOC of USE maybe invalid, the early return in
> forward_propagate_into ensure it will not cause any correctness
> problem.
>
>>
>> In any case, returning to the RD problem for DU/UD chains is probably
>> a good idea, now that RD is not such a hog anymore. In effect fwprop.c
>> would return to what it looked like before the patch of r149010.
>
> I remove MD problem and use DU/UD instead.
>
>>
>> As a way forward on all of this, I'd suggest the following steps, each
>> with a separate patch:
>
> Thanks for the suggestion!
>
>> 1. replace the MD problem with RD again, and build full DU/UD chains.
>
> I include patch.1 attached.
>
>> 2. post all the recog changes separately, with minimum impact on the
>> parts of the compiler you don't really change. (For apply_change_group
>> you could even choose to overload it, or use a NUM argument with a
>> default value -- not sure if default argument values are OK for GCC
>> tho'.)
>
> patch.2 attached.
>
>> 3. implement propagation into multiple USEs, but without the splitting
>> and peepholing.
>
> patch.3 attached.
>
>> 4. see about fixing the back end to either split earlier or expand to
>> the desired patterns directly.
>
> I havn't included this part. If you agree with the proposal to add the
> transformation (a << (b&63) ==> a << b) in
> simplify_binary_operation_1, I will send out another patch about it.
>
> Thanks,
> Wei.

[-- Attachment #2: ChangeLog.4 --]
[-- Type: application/octet-stream, Size: 376 bytes --]

2013-03-23  Wei Mi  <wmi@google.com>

	* simplify-rtx.c (simplify_binary_operation_1): Add simplification
	for shift truncation.
	* targhooks.c (default_shift_count_truncated): New.
	* target.def: New shift_count_truncated hook.
	* config/i386/i386.c (ix86_shift_count_truncated): New.
	* doc/tm.texi.in: Add comment for hift_count_truncated hook.
	* doc/tm.texi: Generated.


[-- Attachment #3: patch.4 --]
[-- Type: application/octet-stream, Size: 6076 bytes --]

Index: simplify-rtx.c
===================================================================
--- simplify-rtx.c	(revision 196270)
+++ simplify-rtx.c	(working copy)
@@ -3252,11 +3252,36 @@ simplify_binary_operation_1 (enum rtx_co
 	  && ! side_effects_p (op1))
 	return op0;
     canonicalize_shift:
-      if (SHIFT_COUNT_TRUNCATED && CONST_INT_P (op1))
+      /* Use TARGET_SHIFT_COUNT_TRUNCATED to check whether truncation
+	 for shift is ok for the target.  */
+      if (targetm.shift_count_truncated ((int *)&code))
 	{
-	  val = INTVAL (op1) & (GET_MODE_BITSIZE (mode) - 1);
-	  if (val != INTVAL (op1))
-	    return simplify_gen_binary (code, mode, op0, GEN_INT (val));
+	  /* Suppose mode is DImode,
+	     X << const    ->   X << (const & 63),
+	     X << (Y & 63) ->   X << Y.  */
+	  if (CONST_INT_P (op1))
+	    {
+	      val = INTVAL (op1) & (GET_MODE_BITSIZE (mode) - 1);
+	      if (val != INTVAL (op1))
+		return simplify_gen_binary (code, mode, op0, GEN_INT (val));
+	    }
+	  else if (GET_CODE (op1) == SUBREG
+		   && SUBREG_BYTE (op1) == 0
+		   && (mode == SImode || mode == DImode)
+		   && GET_CODE (XEXP (op1, 0)) == AND
+		   && REG_P (XEXP (XEXP (op1, 0), 0))
+		   && CONST_INT_P (XEXP (XEXP (op1, 0), 1)))
+	    {
+	      unsigned HOST_WIDE_INT val;
+	      unsigned HOST_WIDE_INT modsize = GET_MODE_BITSIZE (mode) - 1;
+	      val = INTVAL (XEXP (XEXP (op1, 0), 1)) & modsize;
+	      if (val == modsize)
+		{
+		  rtx temp = gen_rtx_SUBREG (GET_MODE (op1),
+					    XEXP (XEXP (op1, 0), 0), 0);
+		  return simplify_gen_binary (code, mode, op0, temp);
+		}
+	    }
 	}
       break;
 
Index: targhooks.c
===================================================================
--- targhooks.c	(revision 196270)
+++ targhooks.c	(working copy)
@@ -222,6 +222,17 @@ default_unwind_word_mode (void)
   return word_mode;
 }
 
+/* The default implementation of TARGET_SHIFT_COUNT_TRUNCATED.  */
+bool
+default_shift_count_truncated (int *code ATTRIBUTE_UNUSED)
+{
+#ifdef SHIFT_COUNT_TRUNCATED
+  return true;
+#else
+  return false;
+#endif
+}
+
 /* The default implementation of TARGET_SHIFT_TRUNCATION_MASK.  */
 
 unsigned HOST_WIDE_INT
Index: target.def
===================================================================
--- target.def	(revision 196270)
+++ target.def	(working copy)
@@ -1584,6 +1584,14 @@ DEFHOOK
  const char *, (const char *name),
  default_strip_name_encoding)
 
+/* If the op code are known to always truncate the shift count,
+   return true, otherwise return false.  */
+DEFHOOK
+(shift_count_truncated,
+ "",
+ bool, (int *code),
+ default_shift_count_truncated)
+
 /* If shift optabs for MODE are known to always truncate the shift count,
    return the mask that they apply.  Return 0 otherwise.  */
 DEFHOOK
Index: config/i386/i386.c
===================================================================
--- config/i386/i386.c	(revision 196270)
+++ config/i386/i386.c	(working copy)
@@ -42176,6 +42176,22 @@ ix86_memmodel_check (unsigned HOST_WIDE_
   return val;
 }
 
+static bool
+ix86_shift_count_truncated (int *incode)
+{
+  enum rtx_code code = (enum rtx_code) *incode;
+  if (code == ASHIFT
+      || code == ASHIFTRT
+      || code == LSHIFTRT
+      || code == SS_ASHIFT
+      || code == US_ASHIFT
+      || code == ROTATE
+      || code == ROTATERT)
+    return true;
+  else
+    return false;
+}
+
 /* Initialize the GCC target structure.  */
 #undef TARGET_RETURN_IN_MEMORY
 #define TARGET_RETURN_IN_MEMORY ix86_return_in_memory
@@ -42543,6 +42559,9 @@ ix86_memmodel_check (unsigned HOST_WIDE_
 #undef TARGET_SPILL_CLASS
 #define TARGET_SPILL_CLASS ix86_spill_class
 
+#undef TARGET_SHIFT_COUNT_TRUNCATED
+#define TARGET_SHIFT_COUNT_TRUNCATED ix86_shift_count_truncated
+
 struct gcc_target targetm = TARGET_INITIALIZER;
 \f
 #include "gt-i386.h"
Index: doc/tm.texi
===================================================================
--- doc/tm.texi	(revision 196270)
+++ doc/tm.texi	(working copy)
@@ -10393,6 +10393,17 @@ the implied truncation of the shift inst
 You need not define this macro if it would always have the value of zero.
 @end defmac
 
+@deftypefn {Target Hook} bool TARGET_SHIFT_COUNT_TRUNCATED (int *@var{code})
+This function is used to replace SHIFT_COUNT_TRUNCATED. For 80386 and 680x0
+truncation only applies to shift operations and not the bit-field operations.
+This function is used to look into the operation code and decide if truncation
+could be applied. This is to enable truncation optimization in simplify_rtx.
+It is better to keep such kind of optimizations in simplify_rtx because if the
+optimization can only be done after the transformations of fwprop or combine,
+inquiring split result in fwprop and combine cost estimation phase will be
+intrusive.
+@end deftypefn
+
 @anchor{TARGET_SHIFT_TRUNCATION_MASK}
 @deftypefn {Target Hook} {unsigned HOST_WIDE_INT} TARGET_SHIFT_TRUNCATION_MASK (enum machine_mode @var{mode})
 This function describes how the standard shift patterns for @var{mode}
Index: doc/tm.texi.in
===================================================================
--- doc/tm.texi.in	(revision 196270)
+++ doc/tm.texi.in	(working copy)
@@ -10243,6 +10243,17 @@ the implied truncation of the shift inst
 You need not define this macro if it would always have the value of zero.
 @end defmac
 
+@hook TARGET_SHIFT_COUNT_TRUNCATED
+This function is used to replace SHIFT_COUNT_TRUNCATED. For 80386 and 680x0
+truncation only applies to shift operations and not the bit-field operations.
+This function is used to look into the operation code and decide if truncation
+could be applied. This is to enable truncation optimization in simplify_rtx.
+It is better to keep such kind of optimizations in simplify_rtx because if the
+optimization can only be done after the transformations of fwprop or combine,
+inquiring split result in fwprop and combine cost estimation phase will be
+intrusive.
+@end deftypefn
+
 @anchor{TARGET_SHIFT_TRUNCATION_MASK}
 @hook TARGET_SHIFT_TRUNCATION_MASK
 This function describes how the standard shift patterns for @var{mode}

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: extend fwprop optimization
  2013-03-24  4:18                       ` Wei Mi
@ 2013-03-24 12:33                         ` Oleg Endo
  2013-03-25  9:36                         ` Richard Biener
  1 sibling, 0 replies; 29+ messages in thread
From: Oleg Endo @ 2013-03-24 12:33 UTC (permalink / raw)
  To: Wei Mi; +Cc: Steven Bosscher, GCC Patches, David Li, Uros Bizjak

Hi,

On Sat, 2013-03-23 at 21:18 -0700, Wei Mi wrote:
> This is the patch to add the shift truncation in
> simplify_binary_operation_1. I add a new hook
> TARGET_SHIFT_COUNT_TRUNCATED which uses enum rtx_code to decide
> whether we can do shift truncation. I didn't use
> TARGET_SHIFT_TRUNCATION_MASK in simplify_binary_operation_1 because it
> uses the macro SHIFT_COUNT_TRUNCATED. If I change
> SHIFT_COUNT_TRUNCATED to targetm.shift_count_truncated in
> TARGET_SHIFT_TRUNCATION_MASK, I need to give
> TARGET_SHIFT_TRUNCATION_MASK a enum rtx_code param, which wasn't
> trivial to get at many places in existing code.
> 

During 4.8 development there was a similar issue with the
TARGET_CANONICALIZE_COMPARISON hook.  As a temporary solution the
rtx_code has been passed as int.  I think the story started here:
http://gcc.gnu.org/ml/gcc-patches/2012-12/msg00379.html

The conclusion regarding rtx_code ...
http://gcc.gnu.org/ml/gcc-patches/2012-12/msg00646.html

Maybe this should be addressed first, since now there are at least two
cases where it's in the way.

Cheers,
Oleg



> patch.1 ~ patch.4 pass regression and bootstrap on x86_64-unknown-linux-gnu.
> 
> Thanks,
> Wei.
> 
> On Sun, Mar 17, 2013 at 12:15 AM, Wei Mi <wmi@google.com> wrote:
> > Hi,
> >
> > On Sat, Mar 16, 2013 at 3:48 PM, Steven Bosscher <stevenb.gcc@gmail.com> wrote:
> >> On Tue, Mar 12, 2013 at 8:18 AM, Wei Mi wrote:
> >>> For the motivational case, I need insn splitting to get the cost
> >>> right. insn splitting is not very intrusive. All I need is to call
> >>> split_insns func.
> >>
> >> It may not look very intrusive, but there's a lot happening in the
> >> back ground. You're creating a lot of new RTL, and then just throw it
> >> away again. You fake the compiler into thinking you're much deeper in
> >> the pipeline than you really are. You're assuming there are no
> >> side-effects other than that some insn gets split, but there are back
> >> ends where splitters may have side-effects.
> >
> > Ok, then I will remove the split_insns call.
> >
> >>
> >> Even though I've asked twice now, you still have not explained this
> >> motivational case, except to say that there is one. *What* are you
> >> trying to do, *what* is not happening without the splits, and *what*
> >> happens if you split. Only if you explain that in a lot more detail
> >> than "I have a motivational case" then we can look into what is a
> >> proper solution.
> >
> > :-). Sorry, I didn't say it clearly. The motivational case is the one
> > mentioned in the following posts (split_insns changes a << (b & 63) to
> > a << b).
> > http://gcc.gnu.org/ml/gcc/2013-01/msg00181.html
> > http://gcc.gnu.org/ml/gcc-patches/2013-02/msg01144.html
> >
> > If I remove the split_insns call and related cost estimation
> > adjustment, the fwprop 18-->22 and 18-->23 will punt, because fwprop
> > here looks like a reverse process of cse, the total cost after fwprop
> > change is increased.
> >
> > Def insn 18:
> >         Use insn 23
> >         Use insn 22
> >
> > If we include the split_insns cost estimation adjustment.
> >   extra benefit by removing def insn 18 = 5
> >   change[0]: benefit = 0, verified - ok  // The cost of insn 22 will
> > not change after fwprop + insn splitting.
> >   change[1]: benefit = 0, verified - ok  // The insn 23 is the same with insn 22
> > Total benefit is 5, fwprop will go on.
> >
> > If we remove the split_insns cost estimation adjustment.
> >   extra benefit by removing def insn 18 = 5
> >   change[0]: benefit = -4, verified - ok   // The costs of insn 22 and
> > insn 23 will increase after fwprop.
> >   change[1]: benefit = -4, verified - ok   // The insn 23 is the same
> > with insn 22
> > Total benefit is -3, fwprop will punt.
> >
> > How about adding the (a << (b&63) ==> a << b) transformation in
> > simplify_binary_operation_1, becuase (a << (b&63) ==> a << b) is a
> > kind of architecture specific expr simplification? Then fwprop could
> > do the propagation as I expect.
> >
> >>
> >> The problem with some of the splitters is that they exist to break up
> >> RTL from 'expand' to initially keep some pattern together to allow the
> >> code transformation passes to handle the pattern as one instruction.
> >> This made sense when RTL was the only intermediate representation and
> >> splitting too early would inhibit some optimizations. But I would
> >> expect most (if not all) such cases to be less relevant because of the
> >> GIMPLE middle-end work. The only splitters you can trigger are the
> >> pre-reload splitters (all the reload_completed conditions obviously
> >> can't trigger if you're splitting from fwprop). Perhaps those
> >> splitters can/should run earlier, or be made obsolete by expanding
> >> directly to the post-splitting insns.
> >>
> >> Unfortunately, it's not possible to tell for your case, because you
> >> haven't explained it yet...
> >>
> >>
> >>> So how about keep split_insns and remove peephole in the cost estimation func?
> >>
> >> I'd strongly oppose this. I do not believe this is necessary, and I
> >> think it's conceptually wrong.
> >>
> >>
> >>>> What happens if you propagate into an insn that uses the same register
> >>>> twice? Will the DU chains still be valid (I don't think that's
> >>>> guaranteed)?
> >>>
> >>> I think the DU chains still be valid. If propagate into the insn that
> >>> uses the same register twice, the two uses will be replaced when the
> >>> first use is seen (propagate_rtx_1 will propagate all the occurrances
> >>> of the same reg in the use insn).  When the second use is seen, the
> >>> df_use and use insn in its insn_info are still available.
> >>> forward_propagate_into will early return after check reg_mentioned_p
> >>> (DF_REF_REG (use), parent) and find out no reg is used  any more.
> >>
> >> With reg_mentioned_p you cannot verify that the DF_REF_LOC of USE is
> >> still valid.
> >
> > I think DF_REF_LOC of USE may be invalid if dangling rtx will be
> > recycled by garbage collection very soon (I don't know when GC will
> > happen). Although DF_REF_LOC of USE maybe invalid, the early return in
> > forward_propagate_into ensure it will not cause any correctness
> > problem.
> >
> >>
> >> In any case, returning to the RD problem for DU/UD chains is probably
> >> a good idea, now that RD is not such a hog anymore. In effect fwprop.c
> >> would return to what it looked like before the patch of r149010.
> >
> > I remove MD problem and use DU/UD instead.
> >
> >>
> >> As a way forward on all of this, I'd suggest the following steps, each
> >> with a separate patch:
> >
> > Thanks for the suggestion!
> >
> >> 1. replace the MD problem with RD again, and build full DU/UD chains.
> >
> > I include patch.1 attached.
> >
> >> 2. post all the recog changes separately, with minimum impact on the
> >> parts of the compiler you don't really change. (For apply_change_group
> >> you could even choose to overload it, or use a NUM argument with a
> >> default value -- not sure if default argument values are OK for GCC
> >> tho'.)
> >
> > patch.2 attached.
> >
> >> 3. implement propagation into multiple USEs, but without the splitting
> >> and peepholing.
> >
> > patch.3 attached.
> >
> >> 4. see about fixing the back end to either split earlier or expand to
> >> the desired patterns directly.
> >
> > I havn't included this part. If you agree with the proposal to add the
> > transformation (a << (b&63) ==> a << b) in
> > simplify_binary_operation_1, I will send out another patch about it.
> >
> > Thanks,
> > Wei.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: extend fwprop optimization
  2013-03-24  4:18                       ` Wei Mi
  2013-03-24 12:33                         ` Oleg Endo
@ 2013-03-25  9:36                         ` Richard Biener
  2013-03-25 17:29                           ` Wei Mi
  1 sibling, 1 reply; 29+ messages in thread
From: Richard Biener @ 2013-03-25  9:36 UTC (permalink / raw)
  To: Wei Mi; +Cc: Steven Bosscher, GCC Patches, David Li, Uros Bizjak

On Sun, Mar 24, 2013 at 5:18 AM, Wei Mi <wmi@google.com> wrote:
> This is the patch to add the shift truncation in
> simplify_binary_operation_1. I add a new hook
> TARGET_SHIFT_COUNT_TRUNCATED which uses enum rtx_code to decide
> whether we can do shift truncation. I didn't use
> TARGET_SHIFT_TRUNCATION_MASK in simplify_binary_operation_1 because it
> uses the macro SHIFT_COUNT_TRUNCATED. If I change
> SHIFT_COUNT_TRUNCATED to targetm.shift_count_truncated in
> TARGET_SHIFT_TRUNCATION_MASK, I need to give
> TARGET_SHIFT_TRUNCATION_MASK a enum rtx_code param, which wasn't
> trivial to get at many places in existing code.
>
> patch.1 ~ patch.4 pass regression and bootstrap on x86_64-unknown-linux-gnu.

Doing this might prove dangerous in case some pass may later decide
to use an instruction that behaves in different ways.  Consider

   tem = 1<< (n & 255);   // count truncated
   x = y & tem;  // bittest instruction bit nr _not_ truncated

so if tem is expanded to use a shift instruction which truncates the shift
count the explicit and is dropped.  If later combine comes around and
combines the bit-test to use the bittest instruction which does not
implicitely truncate the cound you have generated wrong-code.

So we need to make sure any explicit truncation originally in place
is kept in the RTL - which means SHIFT_COUNT_TRUNCATED should
not exist at all, but instead there would be two patterns for shifts
with implicit truncation - one involving the truncation (canonicalized to
bitwise and) and one not involving the truncation.

Richard.

> Thanks,
> Wei.
>
> On Sun, Mar 17, 2013 at 12:15 AM, Wei Mi <wmi@google.com> wrote:
>> Hi,
>>
>> On Sat, Mar 16, 2013 at 3:48 PM, Steven Bosscher <stevenb.gcc@gmail.com> wrote:
>>> On Tue, Mar 12, 2013 at 8:18 AM, Wei Mi wrote:
>>>> For the motivational case, I need insn splitting to get the cost
>>>> right. insn splitting is not very intrusive. All I need is to call
>>>> split_insns func.
>>>
>>> It may not look very intrusive, but there's a lot happening in the
>>> back ground. You're creating a lot of new RTL, and then just throw it
>>> away again. You fake the compiler into thinking you're much deeper in
>>> the pipeline than you really are. You're assuming there are no
>>> side-effects other than that some insn gets split, but there are back
>>> ends where splitters may have side-effects.
>>
>> Ok, then I will remove the split_insns call.
>>
>>>
>>> Even though I've asked twice now, you still have not explained this
>>> motivational case, except to say that there is one. *What* are you
>>> trying to do, *what* is not happening without the splits, and *what*
>>> happens if you split. Only if you explain that in a lot more detail
>>> than "I have a motivational case" then we can look into what is a
>>> proper solution.
>>
>> :-). Sorry, I didn't say it clearly. The motivational case is the one
>> mentioned in the following posts (split_insns changes a << (b & 63) to
>> a << b).
>> http://gcc.gnu.org/ml/gcc/2013-01/msg00181.html
>> http://gcc.gnu.org/ml/gcc-patches/2013-02/msg01144.html
>>
>> If I remove the split_insns call and related cost estimation
>> adjustment, the fwprop 18-->22 and 18-->23 will punt, because fwprop
>> here looks like a reverse process of cse, the total cost after fwprop
>> change is increased.
>>
>> Def insn 18:
>>         Use insn 23
>>         Use insn 22
>>
>> If we include the split_insns cost estimation adjustment.
>>   extra benefit by removing def insn 18 = 5
>>   change[0]: benefit = 0, verified - ok  // The cost of insn 22 will
>> not change after fwprop + insn splitting.
>>   change[1]: benefit = 0, verified - ok  // The insn 23 is the same with insn 22
>> Total benefit is 5, fwprop will go on.
>>
>> If we remove the split_insns cost estimation adjustment.
>>   extra benefit by removing def insn 18 = 5
>>   change[0]: benefit = -4, verified - ok   // The costs of insn 22 and
>> insn 23 will increase after fwprop.
>>   change[1]: benefit = -4, verified - ok   // The insn 23 is the same
>> with insn 22
>> Total benefit is -3, fwprop will punt.
>>
>> How about adding the (a << (b&63) ==> a << b) transformation in
>> simplify_binary_operation_1, becuase (a << (b&63) ==> a << b) is a
>> kind of architecture specific expr simplification? Then fwprop could
>> do the propagation as I expect.
>>
>>>
>>> The problem with some of the splitters is that they exist to break up
>>> RTL from 'expand' to initially keep some pattern together to allow the
>>> code transformation passes to handle the pattern as one instruction.
>>> This made sense when RTL was the only intermediate representation and
>>> splitting too early would inhibit some optimizations. But I would
>>> expect most (if not all) such cases to be less relevant because of the
>>> GIMPLE middle-end work. The only splitters you can trigger are the
>>> pre-reload splitters (all the reload_completed conditions obviously
>>> can't trigger if you're splitting from fwprop). Perhaps those
>>> splitters can/should run earlier, or be made obsolete by expanding
>>> directly to the post-splitting insns.
>>>
>>> Unfortunately, it's not possible to tell for your case, because you
>>> haven't explained it yet...
>>>
>>>
>>>> So how about keep split_insns and remove peephole in the cost estimation func?
>>>
>>> I'd strongly oppose this. I do not believe this is necessary, and I
>>> think it's conceptually wrong.
>>>
>>>
>>>>> What happens if you propagate into an insn that uses the same register
>>>>> twice? Will the DU chains still be valid (I don't think that's
>>>>> guaranteed)?
>>>>
>>>> I think the DU chains still be valid. If propagate into the insn that
>>>> uses the same register twice, the two uses will be replaced when the
>>>> first use is seen (propagate_rtx_1 will propagate all the occurrances
>>>> of the same reg in the use insn).  When the second use is seen, the
>>>> df_use and use insn in its insn_info are still available.
>>>> forward_propagate_into will early return after check reg_mentioned_p
>>>> (DF_REF_REG (use), parent) and find out no reg is used  any more.
>>>
>>> With reg_mentioned_p you cannot verify that the DF_REF_LOC of USE is
>>> still valid.
>>
>> I think DF_REF_LOC of USE may be invalid if dangling rtx will be
>> recycled by garbage collection very soon (I don't know when GC will
>> happen). Although DF_REF_LOC of USE maybe invalid, the early return in
>> forward_propagate_into ensure it will not cause any correctness
>> problem.
>>
>>>
>>> In any case, returning to the RD problem for DU/UD chains is probably
>>> a good idea, now that RD is not such a hog anymore. In effect fwprop.c
>>> would return to what it looked like before the patch of r149010.
>>
>> I remove MD problem and use DU/UD instead.
>>
>>>
>>> As a way forward on all of this, I'd suggest the following steps, each
>>> with a separate patch:
>>
>> Thanks for the suggestion!
>>
>>> 1. replace the MD problem with RD again, and build full DU/UD chains.
>>
>> I include patch.1 attached.
>>
>>> 2. post all the recog changes separately, with minimum impact on the
>>> parts of the compiler you don't really change. (For apply_change_group
>>> you could even choose to overload it, or use a NUM argument with a
>>> default value -- not sure if default argument values are OK for GCC
>>> tho'.)
>>
>> patch.2 attached.
>>
>>> 3. implement propagation into multiple USEs, but without the splitting
>>> and peepholing.
>>
>> patch.3 attached.
>>
>>> 4. see about fixing the back end to either split earlier or expand to
>>> the desired patterns directly.
>>
>> I havn't included this part. If you agree with the proposal to add the
>> transformation (a << (b&63) ==> a << b) in
>> simplify_binary_operation_1, I will send out another patch about it.
>>
>> Thanks,
>> Wei.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: extend fwprop optimization
  2013-03-25  9:36                         ` Richard Biener
@ 2013-03-25 17:29                           ` Wei Mi
  2013-03-25 17:33                             ` Wei Mi
  0 siblings, 1 reply; 29+ messages in thread
From: Wei Mi @ 2013-03-25 17:29 UTC (permalink / raw)
  To: Richard Biener; +Cc: Steven Bosscher, GCC Patches, David Li, Uros Bizjak

On Mon, Mar 25, 2013 at 2:35 AM, Richard Biener
<richard.guenther@gmail.com> wrote:
> On Sun, Mar 24, 2013 at 5:18 AM, Wei Mi <wmi@google.com> wrote:
>> This is the patch to add the shift truncation in
>> simplify_binary_operation_1. I add a new hook
>> TARGET_SHIFT_COUNT_TRUNCATED which uses enum rtx_code to decide
>> whether we can do shift truncation. I didn't use
>> TARGET_SHIFT_TRUNCATION_MASK in simplify_binary_operation_1 because it
>> uses the macro SHIFT_COUNT_TRUNCATED. If I change
>> SHIFT_COUNT_TRUNCATED to targetm.shift_count_truncated in
>> TARGET_SHIFT_TRUNCATION_MASK, I need to give
>> TARGET_SHIFT_TRUNCATION_MASK a enum rtx_code param, which wasn't
>> trivial to get at many places in existing code.
>>
>> patch.1 ~ patch.4 pass regression and bootstrap on x86_64-unknown-linux-gnu.
>
> Doing this might prove dangerous in case some pass may later decide
> to use an instruction that behaves in different ways.  Consider
>
>    tem = 1<< (n & 255);   // count truncated
>    x = y & tem;  // bittest instruction bit nr _not_ truncated
>
> so if tem is expanded to use a shift instruction which truncates the shift
> count the explicit and is dropped.  If later combine comes around and
> combines the bit-test to use the bittest instruction which does not
> implicitely truncate the cound you have generated wrong-code.
>

So it means the existing truncation pattern defined in insn split is
also incorrect because the truncated shift may be combined into a bit
test pattern?

// The following define_insn_and_split will do shift truncation.
(define_insn_and_split "*<shift_insn><mode>3_mask"
  [(set (match_operand:SWI48 0 "nonimmediate_operand" "=rm")
        (any_shiftrt:SWI48
          (match_operand:SWI48 1 "nonimmediate_operand" "0")
          (subreg:QI
            (and:SI
              (match_operand:SI 2 "nonimmediate_operand" "c")
              (match_operand:SI 3 "const_int_operand" "n")) 0)))
   (clobber (reg:CC FLAGS_REG))]
  "ix86_binary_operator_ok (<CODE>, <MODE>mode, operands)
   && (INTVAL (operands[3]) & (GET_MODE_BITSIZE (<MODE>mode)-1))
      == GET_MODE_BITSIZE (<MODE>mode)-1"
  "#"
  "&& 1"
  [(parallel [(set (match_dup 0)
                   (any_shiftrt:SWI48 (match_dup 1) (match_dup 2)))
              (clobber (reg:CC FLAGS_REG))])]
{
  if (can_create_pseudo_p ())
    operands [2] = force_reg (SImode, operands[2]);

  operands[2] = simplify_gen_subreg (QImode, operands[2], SImode, 0);
}
  [(set_attr "type" "ishift")
   (set_attr "mode" "<MODE>")])

> So we need to make sure any explicit truncation originally in place
> is kept in the RTL - which means SHIFT_COUNT_TRUNCATED should
> not exist at all, but instead there would be two patterns for shifts
> with implicit truncation - one involving the truncation (canonicalized to
> bitwise and) and one not involving the truncation.
>
> Richard.
>

I am trying to figure out a way not to lose the opportunity when shift
truncation is not combined in a bit test pattern. Can we keep the
explicit truncation in RTL, but generate truncation code in assembly?
Then only shift truncation which not combined in a bit test
pattershift truncationn will happen.

(define_insn "*<shift_insn_and><mode>"
  [(set (match_operand:SWI48 0 "nonimmediate_operand" "=rm")
        (any_shiftrt:SWI48
          (match_operand:SWI48 1 "nonimmediate_operand" "0")
          (subreg:QI
            (and:SI
              (match_operand:SI 2 "nonimmediate_operand" "c")
              (match_operand:SI 3 "const_int_operand" "n")) 0)))
   (clobber (reg:CC FLAGS_REG))]
  "ix86_binary_operator_ok (<CODE>, <MODE>mode, operands)"
{
   if ((INTVAL (operands[3]) & (GET_MODE_BITSIZE (<MODE>mode)-1))
      == GET_MODE_BITSIZE (<MODE>mode)-1)
      return "and\t{%3, %2|%2, %3}\n\r shift\t{%b2, %0|%0, %b2}";
   else
      "shift\t{%2, %0|%0, %2}";
}

Thanks,
Wei.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: extend fwprop optimization
  2013-03-25 17:29                           ` Wei Mi
@ 2013-03-25 17:33                             ` Wei Mi
  2013-03-26  9:14                               ` Richard Biener
  0 siblings, 1 reply; 29+ messages in thread
From: Wei Mi @ 2013-03-25 17:33 UTC (permalink / raw)
  To: Richard Biener; +Cc: Steven Bosscher, GCC Patches, David Li, Uros Bizjak

> I am trying to figure out a way not to lose the opportunity when shift
> truncation is not combined in a bit test pattern. Can we keep the
> explicit truncation in RTL, but generate truncation code in assembly?
> Then only shift truncation which not combined in a bit test
> pattershift truncationn will happen.
>
> (define_insn "*<shift_insn_and><mode>"
>   [(set (match_operand:SWI48 0 "nonimmediate_operand" "=rm")
>         (any_shiftrt:SWI48
>           (match_operand:SWI48 1 "nonimmediate_operand" "0")
>           (subreg:QI
>             (and:SI
>               (match_operand:SI 2 "nonimmediate_operand" "c")
>               (match_operand:SI 3 "const_int_operand" "n")) 0)))
>    (clobber (reg:CC FLAGS_REG))]
>   "ix86_binary_operator_ok (<CODE>, <MODE>mode, operands)"
> {
>    if ((INTVAL (operands[3]) & (GET_MODE_BITSIZE (<MODE>mode)-1))
>       == GET_MODE_BITSIZE (<MODE>mode)-1)
>       return "and\t{%3, %2|%2, %3}\n\r shift\t{%b2, %0|%0, %b2}";
>    else
>       "shift\t{%2, %0|%0, %2}";
> }

Sorry, rectify a mistake:

{
   if ((INTVAL (operands[3]) & (GET_MODE_BITSIZE (<MODE>mode)-1))
      == GET_MODE_BITSIZE (<MODE>mode)-1)
      return "shift\t{%2, %0|%0, %2}";
   else
      return "and\t{%3, %2|%2, %3}\n\r shift\t{%b2, %0|%0, %b2}";
}

Thanks,
Wei.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: extend fwprop optimization
  2013-03-25 17:33                             ` Wei Mi
@ 2013-03-26  9:14                               ` Richard Biener
  2013-03-26 18:23                                 ` Uros Bizjak
  0 siblings, 1 reply; 29+ messages in thread
From: Richard Biener @ 2013-03-26  9:14 UTC (permalink / raw)
  To: Wei Mi; +Cc: Steven Bosscher, GCC Patches, David Li, Uros Bizjak

On Mon, Mar 25, 2013 at 6:33 PM, Wei Mi <wmi@google.com> wrote:
>> I am trying to figure out a way not to lose the opportunity when shift
>> truncation is not combined in a bit test pattern. Can we keep the
>> explicit truncation in RTL, but generate truncation code in assembly?
>> Then only shift truncation which not combined in a bit test
>> pattershift truncationn will happen.
>>
>> (define_insn "*<shift_insn_and><mode>"
>>   [(set (match_operand:SWI48 0 "nonimmediate_operand" "=rm")
>>         (any_shiftrt:SWI48
>>           (match_operand:SWI48 1 "nonimmediate_operand" "0")
>>           (subreg:QI
>>             (and:SI
>>               (match_operand:SI 2 "nonimmediate_operand" "c")
>>               (match_operand:SI 3 "const_int_operand" "n")) 0)))
>>    (clobber (reg:CC FLAGS_REG))]
>>   "ix86_binary_operator_ok (<CODE>, <MODE>mode, operands)"
>> {
>>    if ((INTVAL (operands[3]) & (GET_MODE_BITSIZE (<MODE>mode)-1))
>>       == GET_MODE_BITSIZE (<MODE>mode)-1)
>>       return "and\t{%3, %2|%2, %3}\n\r shift\t{%b2, %0|%0, %b2}";
>>    else
>>       "shift\t{%2, %0|%0, %2}";
>> }
>
> Sorry, rectify a mistake:
>
> {
>    if ((INTVAL (operands[3]) & (GET_MODE_BITSIZE (<MODE>mode)-1))
>       == GET_MODE_BITSIZE (<MODE>mode)-1)
>       return "shift\t{%2, %0|%0, %2}";
>    else
>       return "and\t{%3, %2|%2, %3}\n\r shift\t{%b2, %0|%0, %b2}";
> }

I'm not sure the existing patterns are wrong because SHIFT_COUNT_TRUNCATED
is false for x86 AFAIK, exactly because of the bit-test vs. shift instruction
differences.  So there is no inconsistency.  The i386 backend seems to
try to follow my suggestion as if SHIFT_COUNT_TRUNCATED didn't exist
(well, it's false, so it technically doesn't exist for i386) and recognizes
the shift with truncate with the *<shift_insn><mode>3_mask splitter.
But I'm not sure why it bothers to do it with a splitter instead of just
with a define_insn?  Because the split code,

  [(parallel [(set (match_dup 0)
                   (any_shiftrt:SWI48 (match_dup 1) (match_dup 2)))
              (clobber (reg:CC FLAGS_REG))])]

is wrong and could be combined into a bit-test instruction.  No?

That is, why not have define_insn variants for shift instructions with
explicit truncation?

Richard.


> Thanks,
> Wei.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: extend fwprop optimization
  2013-03-26  9:14                               ` Richard Biener
@ 2013-03-26 18:23                                 ` Uros Bizjak
  2013-03-28  4:34                                   ` Wei Mi
  0 siblings, 1 reply; 29+ messages in thread
From: Uros Bizjak @ 2013-03-26 18:23 UTC (permalink / raw)
  To: Richard Biener
  Cc: Wei Mi, Steven Bosscher, GCC Patches, David Li, Kirill Yukhin

On Tue, Mar 26, 2013 at 10:14 AM, Richard Biener
<richard.guenther@gmail.com> wrote:
>>> I am trying to figure out a way not to lose the opportunity when shift
>>> truncation is not combined in a bit test pattern. Can we keep the
>>> explicit truncation in RTL, but generate truncation code in assembly?
>>> Then only shift truncation which not combined in a bit test
>>> pattershift truncationn will happen.
>>>
>>> (define_insn "*<shift_insn_and><mode>"
>>>   [(set (match_operand:SWI48 0 "nonimmediate_operand" "=rm")
>>>         (any_shiftrt:SWI48
>>>           (match_operand:SWI48 1 "nonimmediate_operand" "0")
>>>           (subreg:QI
>>>             (and:SI
>>>               (match_operand:SI 2 "nonimmediate_operand" "c")
>>>               (match_operand:SI 3 "const_int_operand" "n")) 0)))
>>>    (clobber (reg:CC FLAGS_REG))]
>>>   "ix86_binary_operator_ok (<CODE>, <MODE>mode, operands)"
>>> {
>>>    if ((INTVAL (operands[3]) & (GET_MODE_BITSIZE (<MODE>mode)-1))
>>>       == GET_MODE_BITSIZE (<MODE>mode)-1)
>>>       return "and\t{%3, %2|%2, %3}\n\r shift\t{%b2, %0|%0, %b2}";
>>>    else
>>>       "shift\t{%2, %0|%0, %2}";
>>> }
>>
>> Sorry, rectify a mistake:
>>
>> {
>>    if ((INTVAL (operands[3]) & (GET_MODE_BITSIZE (<MODE>mode)-1))
>>       == GET_MODE_BITSIZE (<MODE>mode)-1)
>>       return "shift\t{%2, %0|%0, %2}";
>>    else
>>       return "and\t{%3, %2|%2, %3}\n\r shift\t{%b2, %0|%0, %b2}";
>> }
>
> I'm not sure the existing patterns are wrong because SHIFT_COUNT_TRUNCATED
> is false for x86 AFAIK, exactly because of the bit-test vs. shift instruction
> differences.  So there is no inconsistency.  The i386 backend seems to
> try to follow my suggestion as if SHIFT_COUNT_TRUNCATED didn't exist
> (well, it's false, so it technically doesn't exist for i386) and recognizes
> the shift with truncate with the *<shift_insn><mode>3_mask splitter.
> But I'm not sure why it bothers to do it with a splitter instead of just
> with a define_insn?  Because the split code,
>
>   [(parallel [(set (match_dup 0)
>                    (any_shiftrt:SWI48 (match_dup 1) (match_dup 2)))
>               (clobber (reg:CC FLAGS_REG))])]
>
> is wrong and could be combined into a bit-test instruction.  No?
>
> That is, why not have define_insn variants for shift instructions with
> explicit truncation?

You are right, the split is harmful in this case.

It looks to me, that explicit truncation can be added to split
patterns in the most elegant way using proposed "define_subst"
infrastructure.

Uros.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: extend fwprop optimization
  2013-03-26 18:23                                 ` Uros Bizjak
@ 2013-03-28  4:34                                   ` Wei Mi
  2013-03-28 15:49                                     ` Uros Bizjak
  0 siblings, 1 reply; 29+ messages in thread
From: Wei Mi @ 2013-03-28  4:34 UTC (permalink / raw)
  To: Uros Bizjak
  Cc: Richard Biener, Steven Bosscher, GCC Patches, David Li, Kirill Yukhin

[-- Attachment #1: Type: text/plain, Size: 3026 bytes --]

I am not familiar how to use define_subst, so I write a patch that
changes define_insn_and_split to define_insn. bootstrapped and
regression tested on x86_64-unknown-linux-gnu.

A question is: after that change, Is there anyway I can make
targetm.rtx_costs() aware about the truncation, .i.e the cost is only
a "shift" instead of "shift + and".

Thanks,
Wei.

On Tue, Mar 26, 2013 at 11:23 AM, Uros Bizjak <ubizjak@gmail.com> wrote:
> On Tue, Mar 26, 2013 at 10:14 AM, Richard Biener
> <richard.guenther@gmail.com> wrote:
>>>> I am trying to figure out a way not to lose the opportunity when shift
>>>> truncation is not combined in a bit test pattern. Can we keep the
>>>> explicit truncation in RTL, but generate truncation code in assembly?
>>>> Then only shift truncation which not combined in a bit test
>>>> pattershift truncationn will happen.
>>>>
>>>> (define_insn "*<shift_insn_and><mode>"
>>>>   [(set (match_operand:SWI48 0 "nonimmediate_operand" "=rm")
>>>>         (any_shiftrt:SWI48
>>>>           (match_operand:SWI48 1 "nonimmediate_operand" "0")
>>>>           (subreg:QI
>>>>             (and:SI
>>>>               (match_operand:SI 2 "nonimmediate_operand" "c")
>>>>               (match_operand:SI 3 "const_int_operand" "n")) 0)))
>>>>    (clobber (reg:CC FLAGS_REG))]
>>>>   "ix86_binary_operator_ok (<CODE>, <MODE>mode, operands)"
>>>> {
>>>>    if ((INTVAL (operands[3]) & (GET_MODE_BITSIZE (<MODE>mode)-1))
>>>>       == GET_MODE_BITSIZE (<MODE>mode)-1)
>>>>       return "and\t{%3, %2|%2, %3}\n\r shift\t{%b2, %0|%0, %b2}";
>>>>    else
>>>>       "shift\t{%2, %0|%0, %2}";
>>>> }
>>>
>>> Sorry, rectify a mistake:
>>>
>>> {
>>>    if ((INTVAL (operands[3]) & (GET_MODE_BITSIZE (<MODE>mode)-1))
>>>       == GET_MODE_BITSIZE (<MODE>mode)-1)
>>>       return "shift\t{%2, %0|%0, %2}";
>>>    else
>>>       return "and\t{%3, %2|%2, %3}\n\r shift\t{%b2, %0|%0, %b2}";
>>> }
>>
>> I'm not sure the existing patterns are wrong because SHIFT_COUNT_TRUNCATED
>> is false for x86 AFAIK, exactly because of the bit-test vs. shift instruction
>> differences.  So there is no inconsistency.  The i386 backend seems to
>> try to follow my suggestion as if SHIFT_COUNT_TRUNCATED didn't exist
>> (well, it's false, so it technically doesn't exist for i386) and recognizes
>> the shift with truncate with the *<shift_insn><mode>3_mask splitter.
>> But I'm not sure why it bothers to do it with a splitter instead of just
>> with a define_insn?  Because the split code,
>>
>>   [(parallel [(set (match_dup 0)
>>                    (any_shiftrt:SWI48 (match_dup 1) (match_dup 2)))
>>               (clobber (reg:CC FLAGS_REG))])]
>>
>> is wrong and could be combined into a bit-test instruction.  No?
>>
>> That is, why not have define_insn variants for shift instructions with
>> explicit truncation?
>
> You are right, the split is harmful in this case.
>
> It looks to me, that explicit truncation can be added to split
> patterns in the most elegant way using proposed "define_subst"
> infrastructure.
>
> Uros.

[-- Attachment #2: changelog --]
[-- Type: application/octet-stream, Size: 133 bytes --]

2013-03-27  Wei Mi  <wmi@google.com>

	* config/i386/i386.md: Do shift truncation in define_insn
	instead of define_insn_and_split.


[-- Attachment #3: patch --]
[-- Type: application/octet-stream, Size: 3056 bytes --]

Index: config/i386/i386.md
===================================================================
--- config/i386/i386.md	(revision 196270)
+++ config/i386/i386.md	(working copy)
@@ -9136,7 +9136,7 @@
 })
 
 ;; Avoid useless masking of count operand.
-(define_insn_and_split "*ashl<mode>3_mask"
+(define_insn "*ashl<mode>3_mask"
   [(set (match_operand:SWI48 0 "nonimmediate_operand" "=rm")
 	(ashift:SWI48
 	  (match_operand:SWI48 1 "nonimmediate_operand" "0")
@@ -9148,16 +9148,8 @@
   "ix86_binary_operator_ok (ASHIFT, <MODE>mode, operands)
    && (INTVAL (operands[3]) & (GET_MODE_BITSIZE (<MODE>mode)-1))
       == GET_MODE_BITSIZE (<MODE>mode)-1"
-  "#"
-  "&& 1"
-  [(parallel [(set (match_dup 0)
-		   (ashift:SWI48 (match_dup 1) (match_dup 2)))
-	      (clobber (reg:CC FLAGS_REG))])]
 {
-  if (can_create_pseudo_p ())
-    operands [2] = force_reg (SImode, operands[2]);
-
-  operands[2] = simplify_gen_subreg (QImode, operands[2], SImode, 0);
+  return "sal{<imodesuffix>}\t{%b2, %0|%0, %b2}";
 }
   [(set_attr "type" "ishift")
    (set_attr "mode" "<MODE>")])
@@ -9646,7 +9638,7 @@
   "ix86_expand_binary_operator (<CODE>, <MODE>mode, operands); DONE;")
 
 ;; Avoid useless masking of count operand.
-(define_insn_and_split "*<shift_insn><mode>3_mask"
+(define_insn "*<shift_insn><mode>3_mask"
   [(set (match_operand:SWI48 0 "nonimmediate_operand" "=rm")
 	(any_shiftrt:SWI48
 	  (match_operand:SWI48 1 "nonimmediate_operand" "0")
@@ -9658,16 +9650,8 @@
   "ix86_binary_operator_ok (<CODE>, <MODE>mode, operands)
    && (INTVAL (operands[3]) & (GET_MODE_BITSIZE (<MODE>mode)-1))
       == GET_MODE_BITSIZE (<MODE>mode)-1"
-  "#"
-  "&& 1"
-  [(parallel [(set (match_dup 0)
-		   (any_shiftrt:SWI48 (match_dup 1) (match_dup 2)))
-	      (clobber (reg:CC FLAGS_REG))])]
 {
-  if (can_create_pseudo_p ())
-    operands [2] = force_reg (SImode, operands[2]);
-
-  operands[2] = simplify_gen_subreg (QImode, operands[2], SImode, 0);
+  return "<shift>{<imodesuffix>}\t{%b2, %0|%0, %b2}";
 }
   [(set_attr "type" "ishift")
    (set_attr "mode" "<MODE>")])
@@ -10109,7 +10093,7 @@
   "ix86_expand_binary_operator (<CODE>, <MODE>mode, operands); DONE;")
 
 ;; Avoid useless masking of count operand.
-(define_insn_and_split "*<rotate_insn><mode>3_mask"
+(define_insn "*<rotate_insn><mode>3_mask"
   [(set (match_operand:SWI48 0 "nonimmediate_operand" "=rm")
 	(any_rotate:SWI48
 	  (match_operand:SWI48 1 "nonimmediate_operand" "0")
@@ -10121,16 +10105,8 @@
   "ix86_binary_operator_ok (<CODE>, <MODE>mode, operands)
    && (INTVAL (operands[3]) & (GET_MODE_BITSIZE (<MODE>mode)-1))
       == GET_MODE_BITSIZE (<MODE>mode)-1"
-  "#"
-  "&& 1"
-  [(parallel [(set (match_dup 0)
-		   (any_rotate:SWI48 (match_dup 1) (match_dup 2)))
-	      (clobber (reg:CC FLAGS_REG))])]
 {
-  if (can_create_pseudo_p ())
-    operands [2] = force_reg (SImode, operands[2]);
-
-  operands[2] = simplify_gen_subreg (QImode, operands[2], SImode, 0);
+  return "<rotate>{<imodesuffix>}\t{%b2, %0|%0, %b2}";
 }
   [(set_attr "type" "rotate")
    (set_attr "mode" "<MODE>")])

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: extend fwprop optimization
  2013-03-28  4:34                                   ` Wei Mi
@ 2013-03-28 15:49                                     ` Uros Bizjak
  2013-04-03 20:54                                       ` Jakub Jelinek
  0 siblings, 1 reply; 29+ messages in thread
From: Uros Bizjak @ 2013-03-28 15:49 UTC (permalink / raw)
  To: Wei Mi
  Cc: Richard Biener, Steven Bosscher, GCC Patches, David Li, Kirill Yukhin

On Thu, Mar 28, 2013 at 5:34 AM, Wei Mi <wmi@google.com> wrote:
> I am not familiar how to use define_subst, so I write a patch that
> changes define_insn_and_split to define_insn. bootstrapped and
> regression tested on x86_64-unknown-linux-gnu.
>
> A question is: after that change, Is there anyway I can make
> targetm.rtx_costs() aware about the truncation, .i.e the cost is only
> a "shift" instead of "shift + and".

Please also change all operand 2 predicates to "register_operand".

2013-03-27  Wei Mi  <wmi@google.com>

	* config/i386/i386.md: Do shift truncation in define_insn
	instead of define_insn_and_split.

Please write ChangeLog as:

	* config/i386/i386.md (*ashl<mode>3_mask): Rewrite as define_insn.
	Truncate operand 2 using %b asm operand modifier.
	(*<shift_insn><mode>3_mask): Ditto.
	(*<rotate_insn><mode>3_mask): Ditto.

OK for mainline and all release branches with these changes.

Thanks,
Uros.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: extend fwprop optimization
  2013-03-17  7:15                     ` Wei Mi
  2013-03-17  7:23                       ` Andrew Pinski
  2013-03-24  4:18                       ` Wei Mi
@ 2013-04-02  7:11                       ` Wei Mi
  2013-04-02  7:37                         ` Wei Mi
  2013-04-02  7:53                         ` Uros Bizjak
  2 siblings, 2 replies; 29+ messages in thread
From: Wei Mi @ 2013-04-02  7:11 UTC (permalink / raw)
  To: Steven Bosscher; +Cc: GCC Patches, David Li, Uros Bizjak

[-- Attachment #1: Type: text/plain, Size: 6574 bytes --]

I attached the patch.4 based on r197308. r197308 changes shift-and
type truncation from define_insn_and_split to define_insn.  patch.4
changes ix86_rtx_costs for shift-and type rtx to get the correct cost
for the result after the shift-and truncation.

With the patch.1 ~ patch.4, fwprop extension could handle the
motivational case 1.c attached by removing all the redundent "x & 63"
operations.

patch.1~patch.4 regression and bootstrap ok on
x86_64-unknown-linux-gnu. Is it ok for trunk?

Thanks,
Wei.

On Sun, Mar 17, 2013 at 12:15 AM, Wei Mi <wmi@google.com> wrote:
> Hi,
>
> On Sat, Mar 16, 2013 at 3:48 PM, Steven Bosscher <stevenb.gcc@gmail.com> wrote:
>> On Tue, Mar 12, 2013 at 8:18 AM, Wei Mi wrote:
>>> For the motivational case, I need insn splitting to get the cost
>>> right. insn splitting is not very intrusive. All I need is to call
>>> split_insns func.
>>
>> It may not look very intrusive, but there's a lot happening in the
>> back ground. You're creating a lot of new RTL, and then just throw it
>> away again. You fake the compiler into thinking you're much deeper in
>> the pipeline than you really are. You're assuming there are no
>> side-effects other than that some insn gets split, but there are back
>> ends where splitters may have side-effects.
>
> Ok, then I will remove the split_insns call.
>
>>
>> Even though I've asked twice now, you still have not explained this
>> motivational case, except to say that there is one. *What* are you
>> trying to do, *what* is not happening without the splits, and *what*
>> happens if you split. Only if you explain that in a lot more detail
>> than "I have a motivational case" then we can look into what is a
>> proper solution.
>
> :-). Sorry, I didn't say it clearly. The motivational case is the one
> mentioned in the following posts (split_insns changes a << (b & 63) to
> a << b).
> http://gcc.gnu.org/ml/gcc/2013-01/msg00181.html
> http://gcc.gnu.org/ml/gcc-patches/2013-02/msg01144.html
>
> If I remove the split_insns call and related cost estimation
> adjustment, the fwprop 18-->22 and 18-->23 will punt, because fwprop
> here looks like a reverse process of cse, the total cost after fwprop
> change is increased.
>
> Def insn 18:
>         Use insn 23
>         Use insn 22
>
> If we include the split_insns cost estimation adjustment.
>   extra benefit by removing def insn 18 = 5
>   change[0]: benefit = 0, verified - ok  // The cost of insn 22 will
> not change after fwprop + insn splitting.
>   change[1]: benefit = 0, verified - ok  // The insn 23 is the same with insn 22
> Total benefit is 5, fwprop will go on.
>
> If we remove the split_insns cost estimation adjustment.
>   extra benefit by removing def insn 18 = 5
>   change[0]: benefit = -4, verified - ok   // The costs of insn 22 and
> insn 23 will increase after fwprop.
>   change[1]: benefit = -4, verified - ok   // The insn 23 is the same
> with insn 22
> Total benefit is -3, fwprop will punt.
>
> How about adding the (a << (b&63) ==> a << b) transformation in
> simplify_binary_operation_1, becuase (a << (b&63) ==> a << b) is a
> kind of architecture specific expr simplification? Then fwprop could
> do the propagation as I expect.
>
>>
>> The problem with some of the splitters is that they exist to break up
>> RTL from 'expand' to initially keep some pattern together to allow the
>> code transformation passes to handle the pattern as one instruction.
>> This made sense when RTL was the only intermediate representation and
>> splitting too early would inhibit some optimizations. But I would
>> expect most (if not all) such cases to be less relevant because of the
>> GIMPLE middle-end work. The only splitters you can trigger are the
>> pre-reload splitters (all the reload_completed conditions obviously
>> can't trigger if you're splitting from fwprop). Perhaps those
>> splitters can/should run earlier, or be made obsolete by expanding
>> directly to the post-splitting insns.
>>
>> Unfortunately, it's not possible to tell for your case, because you
>> haven't explained it yet...
>>
>>
>>> So how about keep split_insns and remove peephole in the cost estimation func?
>>
>> I'd strongly oppose this. I do not believe this is necessary, and I
>> think it's conceptually wrong.
>>
>>
>>>> What happens if you propagate into an insn that uses the same register
>>>> twice? Will the DU chains still be valid (I don't think that's
>>>> guaranteed)?
>>>
>>> I think the DU chains still be valid. If propagate into the insn that
>>> uses the same register twice, the two uses will be replaced when the
>>> first use is seen (propagate_rtx_1 will propagate all the occurrances
>>> of the same reg in the use insn).  When the second use is seen, the
>>> df_use and use insn in its insn_info are still available.
>>> forward_propagate_into will early return after check reg_mentioned_p
>>> (DF_REF_REG (use), parent) and find out no reg is used  any more.
>>
>> With reg_mentioned_p you cannot verify that the DF_REF_LOC of USE is
>> still valid.
>
> I think DF_REF_LOC of USE may be invalid if dangling rtx will be
> recycled by garbage collection very soon (I don't know when GC will
> happen). Although DF_REF_LOC of USE maybe invalid, the early return in
> forward_propagate_into ensure it will not cause any correctness
> problem.
>
>>
>> In any case, returning to the RD problem for DU/UD chains is probably
>> a good idea, now that RD is not such a hog anymore. In effect fwprop.c
>> would return to what it looked like before the patch of r149010.
>
> I remove MD problem and use DU/UD instead.
>
>>
>> As a way forward on all of this, I'd suggest the following steps, each
>> with a separate patch:
>
> Thanks for the suggestion!
>
>> 1. replace the MD problem with RD again, and build full DU/UD chains.
>
> I include patch.1 attached.
>
>> 2. post all the recog changes separately, with minimum impact on the
>> parts of the compiler you don't really change. (For apply_change_group
>> you could even choose to overload it, or use a NUM argument with a
>> default value -- not sure if default argument values are OK for GCC
>> tho'.)
>
> patch.2 attached.
>
>> 3. implement propagation into multiple USEs, but without the splitting
>> and peepholing.
>
> patch.3 attached.
>
>> 4. see about fixing the back end to either split earlier or expand to
>> the desired patterns directly.
>
> I havn't included this part. If you agree with the proposal to add the
> transformation (a << (b&63) ==> a << b) in
> simplify_binary_operation_1, I will send out another patch about it.
>
> Thanks,
> Wei.

[-- Attachment #2: ChangeLog.1 --]
[-- Type: application/octet-stream, Size: 534 bytes --]

2013-03-16  Wei Mi  <wmi@google.com>

	* fwprop.c (get_def_for_use): Change use_def_ref to
	inquiring UD chain.
	(fwprop_df_init): Rename build_single_def_use_links to
	fwprop_df_init. 
	(process_defs): Deleted.
	(process_uses): Likewise.
	(single_def_use_enter_block): Likewise.
	(single_def_use_leave_block): Likewise.
	(all_uses_available_at): Likewise.
	(register_active_defs): Likewise.
	(update_df_init): Likewise.
	(update_uses): Likewise.
	(update_df): Likewise.
	(fwprop_init): Remove active_defs.
	(fwprop_done): Likewise.


[-- Attachment #3: ChangeLog.2 --]
[-- Type: application/octet-stream, Size: 662 bytes --]

2013-03-16  Wei Mi  <wmi@google.com>

	* recog.c (validate_change_1): Add fields for change_t.
	(confirm_change_group): Add a default param.
	(set_change_verified): Add a change_t interface.
	(set_change_benefit): Likewise.
	(set_change_equal_note): Likewise.
	(set_change_associated_with_last): Likewise.
	(update_df): New. Update def/use references after insn changes.
	(confirm_change_one_by_one): New. Confirm each change separately.
	(confirm_change_group_by_cost): New. Confirm changes based on a
	simple cost model.
	(apply_change_group): Add a param.
	(cancel_changes): Add REG_EQUAL note according to equal_note field.
	* recog.h: Add some prototypes.


[-- Attachment #4: ChangeLog.3 --]
[-- Type: application/octet-stream, Size: 1008 bytes --]

2013-03-16  Wei Mi  <wmi@google.com>

	* fwprop.c (propagate_rtx_1): Remove PR_HANDLE_MEM.
	(varying_mem_p): Add call a kind of varying mem.
	(propagate_rtx): Remove PR_CAN_APPEAR and PR_HANDLE_MEM.
	(try_fwprop_subst): Extract the confirmation part to a separate
	func.
	(forward_propagate_subreg): Change the args of try_fwprop_subst.
	(mems_modified_p): New. Check whether dest is a mem.
	(mem_may_be_modified): New. Check if mem modified in an insn range.
	(rtx_search_arg): New struct.
	(reg_occur_p): New. Check if reg has ever occured in an expr.
	(reg_mentioned_num): New. How many times a reg appear.
	(forward_propagate_asm):  Make asm propagations being applied
	separately.
	(def_return_reg): New. Whether the set define a return reg.
	(forward_propagate_and_simplify): Add more check before propagation.
	(fwprop_done): Delete outdated trace.
	(iterate_def_uses): New. Iterate all the uses connecting to a def.
	(fwprop): Iterate all the defs instead of all the uses.
	(fwprop_addr): Likewise.


[-- Attachment #5: ChangeLog.4 --]
[-- Type: application/octet-stream, Size: 192 bytes --]

2013-04-01  Wei Mi  <wmi@google.com>

	* config/i386/i386.c (ix86_rtx_costs): Set proper rtx cost for
	ashl<mode>3_mask, *<shift_insn><mode>3_mask and
	*<rotate_insn><mode>3_mask in i386.md. 

[-- Attachment #6: patch.1 --]
[-- Type: application/octet-stream, Size: 9883 bytes --]

--- v0/fwprop.c	2013-03-16 21:46:21.437939338 -0700
+++ v1/fwprop.c	2013-03-17 00:04:35.450324217 -0700
@@ -115,195 +115,33 @@ along with GCC; see the file COPYING3.
 
 static int num_changes;
 
-static vec<df_ref> use_def_ref;
-static vec<df_ref> reg_defs;
-static vec<df_ref> reg_defs_stack;
-
-/* The MD bitmaps are trimmed to include only live registers to cut
-   memory usage on testcases like insn-recog.c.  Track live registers
-   in the basic block and do not perform forward propagation if the
-   destination is a dead pseudo occurring in a note.  */
-static bitmap local_md;
-static bitmap local_lr;
-
 /* Return the only def in USE's use-def chain, or NULL if there is
    more than one def in the chain.  */
 
 static inline df_ref
 get_def_for_use (df_ref use)
 {
-  return use_def_ref[DF_REF_ID (use)];
-}
-
-
-/* Update the reg_defs vector with non-partial definitions in DEF_REC.
-   TOP_FLAG says which artificials uses should be used, when DEF_REC
-   is an artificial def vector.  LOCAL_MD is modified as after a
-   df_md_simulate_* function; we do more or less the same processing
-   done there, so we do not use those functions.  */
-
-#define DF_MD_GEN_FLAGS \
-	(DF_REF_PARTIAL | DF_REF_CONDITIONAL | DF_REF_MAY_CLOBBER)
-
-static void
-process_defs (df_ref *def_rec, int top_flag)
-{
-  df_ref def;
-  while ((def = *def_rec++) != NULL)
-    {
-      df_ref curr_def = reg_defs[DF_REF_REGNO (def)];
-      unsigned int dregno;
+  if (!DF_REF_CHAIN (use))
+    return NULL;
 
-      if ((DF_REF_FLAGS (def) & DF_REF_AT_TOP) != top_flag)
-	continue;
+  /* More than one reaching def.  */
+  if (DF_REF_CHAIN (use)->next)
+    return NULL;
 
-      dregno = DF_REF_REGNO (def);
-      if (curr_def)
-	reg_defs_stack.safe_push (curr_def);
-      else
-	{
-	  /* Do not store anything if "transitioning" from NULL to NULL.  But
-             otherwise, push a special entry on the stack to tell the
-	     leave_block callback that the entry in reg_defs was NULL.  */
-	  if (DF_REF_FLAGS (def) & DF_MD_GEN_FLAGS)
-	    ;
-	  else
-	    reg_defs_stack.safe_push (def);
-	}
-
-      if (DF_REF_FLAGS (def) & DF_MD_GEN_FLAGS)
-	{
-	  bitmap_set_bit (local_md, dregno);
-	  reg_defs[dregno] = NULL;
-	}
-      else
-	{
-	  bitmap_clear_bit (local_md, dregno);
-	  reg_defs[dregno] = def;
-	}
-    }
-}
-
-
-/* Fill the use_def_ref vector with values for the uses in USE_REC,
-   taking reaching definitions info from LOCAL_MD and REG_DEFS.
-   TOP_FLAG says which artificials uses should be used, when USE_REC
-   is an artificial use vector.  */
-
-static void
-process_uses (df_ref *use_rec, int top_flag)
-{
-  df_ref use;
-  while ((use = *use_rec++) != NULL)
-    if ((DF_REF_FLAGS (use) & DF_REF_AT_TOP) == top_flag)
-      {
-        unsigned int uregno = DF_REF_REGNO (use);
-        if (reg_defs[uregno]
-	    && !bitmap_bit_p (local_md, uregno)
-	    && bitmap_bit_p (local_lr, uregno))
-	  use_def_ref[DF_REF_ID (use)] = reg_defs[uregno];
-      }
-}
-
-
-static void
-single_def_use_enter_block (struct dom_walk_data *walk_data ATTRIBUTE_UNUSED,
-			    basic_block bb)
-{
-  int bb_index = bb->index;
-  struct df_md_bb_info *md_bb_info = df_md_get_bb_info (bb_index);
-  struct df_lr_bb_info *lr_bb_info = df_lr_get_bb_info (bb_index);
-  rtx insn;
-
-  bitmap_copy (local_md, &md_bb_info->in);
-  bitmap_copy (local_lr, &lr_bb_info->in);
-
-  /* Push a marker for the leave_block callback.  */
-  reg_defs_stack.safe_push (NULL);
-
-  process_uses (df_get_artificial_uses (bb_index), DF_REF_AT_TOP);
-  process_defs (df_get_artificial_defs (bb_index), DF_REF_AT_TOP);
-
-  /* We don't call df_simulate_initialize_forwards, as it may overestimate
-     the live registers if there are unused artificial defs.  We prefer
-     liveness to be underestimated.  */
-
-  FOR_BB_INSNS (bb, insn)
-    if (INSN_P (insn))
-      {
-        unsigned int uid = INSN_UID (insn);
-        process_uses (DF_INSN_UID_USES (uid), 0);
-        process_uses (DF_INSN_UID_EQ_USES (uid), 0);
-        process_defs (DF_INSN_UID_DEFS (uid), 0);
-	df_simulate_one_insn_forwards (bb, insn, local_lr);
-      }
-
-  process_uses (df_get_artificial_uses (bb_index), 0);
-  process_defs (df_get_artificial_defs (bb_index), 0);
-}
-
-/* Pop the definitions created in this basic block when leaving its
-   dominated parts.  */
-
-static void
-single_def_use_leave_block (struct dom_walk_data *walk_data ATTRIBUTE_UNUSED,
-			    basic_block bb ATTRIBUTE_UNUSED)
-{
-  df_ref saved_def;
-  while ((saved_def = reg_defs_stack.pop ()) != NULL)
-    {
-      unsigned int dregno = DF_REF_REGNO (saved_def);
-
-      /* See also process_defs.  */
-      if (saved_def == reg_defs[dregno])
-	reg_defs[dregno] = NULL;
-      else
-	reg_defs[dregno] = saved_def;
-    }
+  return DF_REF_CHAIN (use)->ref;
 }
 
-
 /* Build a vector holding the reaching definitions of uses reached by a
    single dominating definition.  */
 
 static void
-build_single_def_use_links (void)
+fwprop_df_init (void)
 {
-  struct dom_walk_data walk_data;
-
-  /* We use the multiple definitions problem to compute our restricted
-     use-def chains.  */
   df_set_flags (DF_EQ_NOTES);
-  df_md_add_problem ();
   df_note_add_problem ();
-  df_analyze ();
+  df_chain_add_problem (DF_UD_CHAIN | DF_DU_CHAIN);
   df_maybe_reorganize_use_refs (DF_REF_ORDER_BY_INSN_WITH_NOTES);
-
-  use_def_ref.create (DF_USES_TABLE_SIZE ());
-  use_def_ref.safe_grow_cleared (DF_USES_TABLE_SIZE ());
-
-  reg_defs.create (max_reg_num ());
-  reg_defs.safe_grow_cleared (max_reg_num ());
-
-  reg_defs_stack.create (n_basic_blocks * 10);
-  local_md = BITMAP_ALLOC (NULL);
-  local_lr = BITMAP_ALLOC (NULL);
-
-  /* Walk the dominator tree looking for single reaching definitions
-     dominating the uses.  This is similar to how SSA form is built.  */
-  walk_data.dom_direction = CDI_DOMINATORS;
-  walk_data.initialize_block_local_data = NULL;
-  walk_data.before_dom_children = single_def_use_enter_block;
-  walk_data.after_dom_children = single_def_use_leave_block;
-
-  init_walk_dominator_tree (&walk_data);
-  walk_dominator_tree (&walk_data, ENTRY_BLOCK_PTR);
-  fini_walk_dominator_tree (&walk_data);
-
-  BITMAP_FREE (local_lr);
-  BITMAP_FREE (local_md);
-  reg_defs.release ();
-  reg_defs_stack.release ();
+  df_analyze ();
 }
 
 \f
@@ -852,96 +690,6 @@ all_uses_available_at (rtx def_insn, rtx
 }
 
 \f
-static df_ref *active_defs;
-#ifdef ENABLE_CHECKING
-static sparseset active_defs_check;
-#endif
-
-/* Fill the ACTIVE_DEFS array with the use->def link for the registers
-   mentioned in USE_REC.  Register the valid entries in ACTIVE_DEFS_CHECK
-   too, for checking purposes.  */
-
-static void
-register_active_defs (df_ref *use_rec)
-{
-  while (*use_rec)
-    {
-      df_ref use = *use_rec++;
-      df_ref def = get_def_for_use (use);
-      int regno = DF_REF_REGNO (use);
-
-#ifdef ENABLE_CHECKING
-      sparseset_set_bit (active_defs_check, regno);
-#endif
-      active_defs[regno] = def;
-    }
-}
-
-
-/* Build the use->def links that we use to update the dataflow info
-   for new uses.  Note that building the links is very cheap and if
-   it were done earlier, they could be used to rule out invalid
-   propagations (in addition to what is done in all_uses_available_at).
-   I'm not doing this yet, though.  */
-
-static void
-update_df_init (rtx def_insn, rtx insn)
-{
-#ifdef ENABLE_CHECKING
-  sparseset_clear (active_defs_check);
-#endif
-  register_active_defs (DF_INSN_USES (def_insn));
-  register_active_defs (DF_INSN_USES (insn));
-  register_active_defs (DF_INSN_EQ_USES (insn));
-}
-
-
-/* Update the USE_DEF_REF array for the given use, using the active definitions
-   in the ACTIVE_DEFS array to match pseudos to their def. */
-
-static inline void
-update_uses (df_ref *use_rec)
-{
-  while (*use_rec)
-    {
-      df_ref use = *use_rec++;
-      int regno = DF_REF_REGNO (use);
-
-      /* Set up the use-def chain.  */
-      if (DF_REF_ID (use) >= (int) use_def_ref.length ())
-        use_def_ref.safe_grow_cleared (DF_REF_ID (use) + 1);
-
-#ifdef ENABLE_CHECKING
-      gcc_assert (sparseset_bit_p (active_defs_check, regno));
-#endif
-      use_def_ref[DF_REF_ID (use)] = active_defs[regno];
-    }
-}
-
-
-/* Update the USE_DEF_REF array for the uses in INSN.  Only update note
-   uses if NOTES_ONLY is true.  */
-
-static void
-update_df (rtx insn, rtx note)
-{
-  struct df_insn_info *insn_info = DF_INSN_INFO_GET (insn);
-
-  if (note)
-    {
-      df_uses_create (&XEXP (note, 0), insn, DF_REF_IN_NOTE);
-      df_notes_rescan (insn);
-    }
-  else
-    {
-      df_uses_create (&PATTERN (insn), insn, 0);
-      df_insn_rescan (insn);
-      update_uses (DF_INSN_INFO_USES (insn_info));
-    }
-
-  update_uses (DF_INSN_INFO_EQ_USES (insn_info));
-}
-
 
 /* Try substituting NEW into LOC, which originated from forward propagation
    of USE's value from DEF_INSN.  SET_REG_EQUAL says whether we are
@@ -1412,16 +1160,11 @@ fwprop_init (void)
   /* We do not always want to propagate into loops, so we have to find
      loops and be careful about them.  Avoid CFG modifications so that
      we don't have to update dominance information afterwards for
-     build_single_def_use_links.  */
+     fwprop_df_init.  */
   loop_optimizer_init (AVOID_CFG_MODIFICATIONS);
 
-  build_single_def_use_links ();
+  fwprop_df_init ();
   df_set_flags (DF_DEFER_INSN_RESCAN);
-
-  active_defs = XNEWVEC (df_ref, max_reg_num ());
-#ifdef ENABLE_CHECKING
-  active_defs_check = sparseset_alloc (max_reg_num ());
-#endif
 }
 
 static void
@@ -1429,12 +1172,6 @@ fwprop_done (void)
 {
   loop_optimizer_finalize ();
 
-  use_def_ref.release ();
-  free (active_defs);
-#ifdef ENABLE_CHECKING
-  sparseset_free (active_defs_check);
-#endif
-
   free_dominance_info (CDI_DOMINATORS);
   cleanup_cfg (0);
   delete_trivially_dead_insns (get_insns (), max_reg_num ());

[-- Attachment #7: patch.2 --]
[-- Type: application/octet-stream, Size: 10687 bytes --]

--- v0/recog.c	2013-03-16 21:46:26.827976689 -0700
+++ v2/recog.c	2013-03-16 23:20:25.840377883 -0700
@@ -181,6 +181,19 @@ typedef struct change_t
   rtx *loc;
   rtx old;
   bool unshare;
+  /* How much benefit to apply the change.  */
+  int benefit;
+  bool verified;
+  /* Record whether we need to create a equal note
+     if the change is canceled.  */
+  bool equal_note;
+  /* Some changes are committed or cancelled in
+     a group. We use associated_with_last flag to
+     make current change to be consistent with the
+     last change in the group. Adding or removing
+     CLOBBER in verify_change will create such kind
+     of change group.  */
+  bool associated_with_last;
 } change_t;
 
 static change_t *changes;
@@ -235,6 +248,10 @@ validate_change_1 (rtx object, rtx *loc,
   changes[num_changes].loc = loc;
   changes[num_changes].old = old;
   changes[num_changes].unshare = unshare;
+  changes[num_changes].benefit = 0;
+  changes[num_changes].verified = false;
+  changes[num_changes].equal_note = false;
+  changes[num_changes].associated_with_last = false;
 
   if (object && !MEM_P (object))
     {
@@ -463,17 +480,18 @@ verify_changes (int num)
   return (i == num_changes);
 }
 
-/* A group of changes has previously been issued with validate_change
-   and verified with verify_changes.  Call df_insn_rescan for each of
-   the insn changed and clear num_changes.  */
+/* A group of changes from num to num_changes - 1 has previously been
+   issued with validate_change and verified with verify_changes.
+   Call df_insn_rescan for each of the insn changed and reset num_changes
+   to num.  */
 
 void
-confirm_change_group (void)
+confirm_change_group (int num)
 {
   int i;
   rtx last_object = NULL;
 
-  for (i = 0; i < num_changes; i++)
+  for (i = num; i < num_changes; i++)
     {
       rtx object = changes[i].object;
 
@@ -492,24 +510,267 @@ confirm_change_group (void)
 
   if (last_object && INSN_P (last_object))
     df_insn_rescan (last_object);
+  num_changes = num;
+}
+
+/* Interfaces to operate change fields.  */
+
+void
+set_change_verified (int idx, bool val)
+{
+  changes[idx].verified = val;
+}
+
+void
+set_change_benefit (int idx, int val)
+{
+  changes[idx].benefit = val;
+}
+
+void
+set_change_equal_note (int idx, bool val)
+{
+  changes[idx].equal_note = val;
+}
+
+void
+set_change_associated_with_last (int idx, bool val)
+{
+  changes[idx].associated_with_last = val;
+}
+
+static void
+update_df (int from, int to, bool is_note)
+{
+  int i;
+  rtx insn;
+
+  if (is_note)
+    {
+      for (i = from; i <= to; i++)
+	{
+	  insn = changes[i].object;
+          if (changes[i].equal_note)
+	    {
+	      rtx note = find_reg_note (insn, REG_EQUAL, NULL_RTX);
+	      if (note)
+		{
+		  df_uses_create (&XEXP (note, 0), insn, DF_REF_IN_NOTE);
+		  df_notes_rescan (insn);
+		}
+	    }
+	}
+    }
+  else
+    {
+      for (i = from; i <= to; i++)
+	{
+	  insn = changes[i].object;
+	  df_uses_create (&PATTERN (insn), insn, 0);
+	  df_insn_rescan (insn);
+	}
+    }
+}
+
+/* When we cannot committed all the changes group, we evaluate the change
+   one by one. We choose to commit those changes whose benefits are greater
+   than 0. For fwprop_addr, the cost evaluation is caculated using
+   targetm.address_cost() and has been done in propagate_rtx_1, so we set 
+   chk_benefit false to skip benefit checking and simply commit the change
+   for fwprop_addr.  */
+
+bool
+confirm_change_one_by_one (bool chk_benefit)
+{
+  int i, last_i = 0;
+  rtx last_object = NULL;
+  bool last_change_committed = false;
+
+  for (i = num_changes - 1; i >= 0; i--)
+    {
+      rtx object = changes[i].object;
+
+      /* If change is not verified successfully, or benefit <= 0
+	 and current change is not associated with last committed
+	 change, then we will backout the change.  */
+      if (!changes[i].verified
+	  || (chk_benefit
+	      && changes[i].benefit <= 0
+	      && !(last_change_committed
+		   && changes[i].associated_with_last)))
+	{
+	  rtx new_rtx = *changes[i].loc;
+	  *changes[i].loc = changes[i].old;
+	  if (changes[i].object && !MEM_P (changes[i].object))
+	    INSN_CODE (changes[i].object) = changes[i].old_code;
+	  last_change_committed = false;
+
+	  if (changes[i].equal_note)
+	    {
+	      set_unique_reg_note (changes[i].object,
+				   REG_EQUAL, copy_rtx (new_rtx));
+	      update_df (i, i, true);
+	    }
+	  continue;
+	}
+
+      if (changes[i].unshare)
+	*changes[i].loc = copy_rtx (*changes[i].loc);
+
+      /* Avoid unnecessary rescanning when multiple changes to same instruction
+	 are made.  */
+      if (object)
+	{
+	  if (object != last_object && last_object && INSN_P (last_object))
+	    update_df (last_i, last_i, false);
+	  last_object = object;
+	  last_i = i;
+	}
+
+      if (dump_file)
+	fprintf (dump_file, "\n   *** change[%d] -- committed ***\n", i);
+
+      if (dump_file)
+	{
+	  fprintf (dump_file, "\nIn insn %d, replacing\n ", INSN_UID (object));
+	  print_inline_rtx (dump_file, changes[i].old, 2);
+	  fprintf (dump_file, "\n with ");
+	  print_inline_rtx (dump_file, *changes[i].loc, 2);
+	  fprintf (dump_file, "\n resulting: ");
+	  print_inline_rtx (dump_file, object, 2);
+	}
+
+      last_change_committed = true;
+    }
+
+  if (last_object && INSN_P (last_object))
+    update_df (last_i, last_i, false);
+
   num_changes = 0;
+  if (last_object)
+    return true;
+  else
+    return false;
+}
+
+/* Confirm a group of change based on the cost. may_confirm_whole_group
+   is initialized to true if for fwprop all the uses are replaced and
+   the def insn could be deleted. For fwprop, extra_benefit is the benefit
+   to delete the def insn. chk_benefit is set when fwprop_addr is true.  */
+
+bool
+confirm_change_group_by_cost (bool may_confirm_whole_group,
+			      int extra_benefit,
+			      bool chk_benefit)
+{
+  int i, to;
+  int total_benefit = 0, total_positive_benefit = 0;
+  bool no_positive_benefit = true;
+
+  if (num_changes == 0)
+    {
+      if (dump_file)
+	fprintf (dump_file, "No changes being tried\n");
+      return false;
+    }
+
+  if (!chk_benefit)
+    return confirm_change_one_by_one (false);
+
+  if (dump_file)
+    fprintf (dump_file, "  extra benefit = %d\n", extra_benefit);
+
+  /* Iterate all the changes, adjust the change benefit if the change result
+     could be splitted or peephole optimized. Calculate the total benefits
+     and total positive benefits in the iteration.  */
+  for (i = 0; i < num_changes; i++)
+    {
+      /* If any change fail in the verification, we cannot confirm all
+	 the changes in a group.  */
+      if (!changes[i].verified)
+	{
+	  may_confirm_whole_group = false;
+	  if (dump_file)
+	    fprintf (dump_file, "  change[%d]: benefit = %d, verified - fail\n",
+		    i, changes[i].benefit);
+	  continue;
+	}
+
+      total_benefit += changes[i].benefit;
+      if (changes[i].benefit > 0)
+	{
+	  total_positive_benefit += changes[i].benefit;
+	  no_positive_benefit = false;
+	}
+
+      if (dump_file)
+	fprintf (dump_file, "  change[%d]: benefit = %d, verified - ok\n",
+		i, changes[i].benefit);
+    }
+
+  /* Compare the benefit and choose between applying the whole change
+     group and only applying the changes with positive benefit.  */
+  if (may_confirm_whole_group
+      && (total_benefit + extra_benefit < total_positive_benefit))
+    may_confirm_whole_group = false;
+
+  if (may_confirm_whole_group)
+    {
+      /* Commit all the changes in a group.  */
+      if (dump_file)
+	fprintf (dump_file, "!!! All the changes committed\n");
+
+      if (dump_file)
+	{
+	  for (i = 0; i < num_changes; i++)
+	    {
+	      fprintf (dump_file, "\nIn insn %d, replacing\n ",
+		       INSN_UID (changes[i].object));
+	      print_inline_rtx (dump_file, changes[i].old, 2);
+	      fprintf (dump_file, "\n with ");
+	      print_inline_rtx (dump_file, *changes[i].loc, 2);
+	      fprintf (dump_file, "\n resulting: ");
+	      print_inline_rtx (dump_file, changes[i].object, 2);
+	    }
+	}
+
+      to = num_changes - 1;
+      confirm_change_group ();
+      update_df (0, to, false);
+      return true;
+    }
+  else if (no_positive_benefit)
+    {
+      /* Cancel all the changes.  */
+      to = num_changes - 1;
+      cancel_changes (0);
+      update_df (0, to, true);
+      if (dump_file)
+	fprintf (dump_file, "No changes committed\n");
+      return false;
+    }
+  else
+    /* Cannot commit all the changes. Try to commit those changes
+       with positive benefit.  */
+    return confirm_change_one_by_one (true);
 }
 
 /* Apply a group of changes previously issued with `validate_change'.
    If all changes are valid, call confirm_change_group and return 1,
-   otherwise, call cancel_changes and return 0.  */
+   otherwise, call cancel_changes and return 0. The change group index
+   starts from num to the num_changes - 1.  */
 
 int
-apply_change_group (void)
+apply_change_group (int num)
 {
-  if (verify_changes (0))
+  if (verify_changes (num))
     {
-      confirm_change_group ();
+      confirm_change_group (num);
       return 1;
     }
   else
     {
-      cancel_changes (0);
+      cancel_changes (num);
       return 0;
     }
 }
@@ -534,9 +795,13 @@ cancel_changes (int num)
      they were made.  */
   for (i = num_changes - 1; i >= num; i--)
     {
+      rtx new_rtx = *changes[i].loc;
       *changes[i].loc = changes[i].old;
       if (changes[i].object && !MEM_P (changes[i].object))
 	INSN_CODE (changes[i].object) = changes[i].old_code;
+      if (changes[i].equal_note)
+	set_unique_reg_note (changes[i].object,
+			     REG_EQUAL, copy_rtx (new_rtx));
     }
   num_changes = num;
 }
--- v0/recog.h	2013-03-16 21:46:26.827976689 -0700
+++ v2/recog.h	2013-03-16 23:33:50.106502219 -0700
@@ -80,8 +80,16 @@ extern bool validate_unshare_change (rtx
 extern bool canonicalize_change_group (rtx insn, rtx x);
 extern int insn_invalid_p (rtx, bool);
 extern int verify_changes (int);
-extern void confirm_change_group (void);
-extern int apply_change_group (void);
+extern void confirm_change_group (int num = 0);
+extern int apply_change_group (int num = 0);
+extern void set_change_verified (int idx, bool val);
+extern void set_change_benefit (int idx, int val);
+extern void set_change_equal_note (int idx, bool val);
+extern void set_change_associated_with_last (int idx, bool val);
+extern bool confirm_change_one_by_one (bool chk_benefit);
+extern bool confirm_change_group_by_cost (bool may_confirm_whole_group,
+					  int extra_benefit,
+					  bool chk_benefit);
 extern int num_validated_changes (void);
 extern void cancel_changes (int);
 extern int constrain_operands (int);

[-- Attachment #8: patch.3 --]
[-- Type: application/octet-stream, Size: 28022 bytes --]

--- v1/fwprop.c	2013-03-17 00:04:35.450324217 -0700
+++ v3/fwprop.c	2013-03-17 00:04:45.120396071 -0700
@@ -39,6 +39,7 @@ along with GCC; see the file COPYING3.
 #include "domwalk.h"
 #include "emit-rtl.h"
 
+#include "tree.h"
 
 /* This pass does simple forward propagation and simplification when an
    operand of an insn can only come from a single def.  This pass uses
@@ -112,6 +113,34 @@ along with GCC; see the file COPYING3.
    I just punt and record only singleton use-def chains, which is
    all that is needed by fwprop.  */
 
+/* In order to make fwprop more effective in rtl optimization, we
+   extend it to handle general expressions instead of only three cases
+   above. The major changes include a) We need to check propagation
+   correctness for src exprs of def which contain mem references.
+   Previous fwprop for the three cases above doesn't have the problem.
+   b) We need a better cost model because the benefit is usually
+   not so apparent as the three cases above.
+
+   For a general fwprop problem, there are two possible sources where
+   benefit comes from. The frist is the new use insn after propagation
+   and simplification may have lower cost than itself before propagation,
+   The second is that if all the uses are replaced with the src of the 
+   def insn, the def insn could be deleted.
+
+   So instead of check each def-use pair independently, we use DU chain to
+   track all the uses for a def. For each def-use pair, we attempt the
+   propagation, record the change candidate in changes[] array, but we
+   wait to confirm the changes until all the pairs with the same def are
+   iterated. The changes confirmation is done in the func
+   confirm_change_group_by_cost. We only do this for fwprop. For fwprop_addr,
+   the benefit of each change is ensured by propagation_rtx_1 using
+   should_replace_address, so we just confirm all the changes without
+   checking benefit again.
+
+   Other changes:
+   We think the maintainance for use_def_ref vector is not necessary, so
+   we remove the related code in update_df/update_uses/update_df_init/
+   register_active_defs.  */
 
 static int num_changes;
 
@@ -250,36 +279,6 @@ should_replace_address (rtx old_rtx, rtx
   return (gain > 0);
 }
 
-
-/* Flags for the last parameter of propagate_rtx_1.  */
-
-enum {
-  /* If PR_CAN_APPEAR is true, propagate_rtx_1 always returns true;
-     if it is false, propagate_rtx_1 returns false if, for at least
-     one occurrence OLD, it failed to collapse the result to a constant.
-     For example, (mult:M (reg:M A) (minus:M (reg:M B) (reg:M A))) may
-     collapse to zero if replacing (reg:M B) with (reg:M A).
-
-     PR_CAN_APPEAR is disregarded inside MEMs: in that case,
-     propagate_rtx_1 just tries to make cheaper and valid memory
-     addresses.  */
-  PR_CAN_APPEAR = 1,
-
-  /* If PR_HANDLE_MEM is not set, propagate_rtx_1 won't attempt any replacement
-     outside memory addresses.  This is needed because propagate_rtx_1 does
-     not do any analysis on memory; thus it is very conservative and in general
-     it will fail if non-read-only MEMs are found in the source expression.
-
-     PR_HANDLE_MEM is set when the source of the propagation was not
-     another MEM.  Then, it is safe not to treat non-read-only MEMs as
-     ``opaque'' objects.  */
-  PR_HANDLE_MEM = 2,
-
-  /* Set when costs should be optimized for speed.  */
-  PR_OPTIMIZE_FOR_SPEED = 4
-};
-
-
 /* Replace all occurrences of OLD in *PX with NEW and try to simplify the
    resulting expression.  Replace *PX with a new RTL expression if an
    occurrence of OLD was found.
@@ -289,31 +288,20 @@ enum {
    that is because there is no simplify_gen_* function for LO_SUM).  */
 
 static bool
-propagate_rtx_1 (rtx *px, rtx old_rtx, rtx new_rtx, int flags)
+propagate_rtx_1 (rtx *px, rtx old_rtx, rtx new_rtx, bool speed)
 {
   rtx x = *px, tem = NULL_RTX, op0, op1, op2;
   enum rtx_code code = GET_CODE (x);
   enum machine_mode mode = GET_MODE (x);
   enum machine_mode op_mode;
-  bool can_appear = (flags & PR_CAN_APPEAR) != 0;
   bool valid_ops = true;
 
-  if (!(flags & PR_HANDLE_MEM) && MEM_P (x) && !MEM_READONLY_P (x))
-    {
-      /* If unsafe, change MEMs to CLOBBERs or SCRATCHes (to preserve whether
-	 they have side effects or not).  */
-      *px = (side_effects_p (x)
-	     ? gen_rtx_CLOBBER (GET_MODE (x), const0_rtx)
-	     : gen_rtx_SCRATCH (GET_MODE (x)));
-      return false;
-    }
-
   /* If X is OLD_RTX, return NEW_RTX.  But not if replacing only within an
      address, and we are *not* inside one.  */
   if (x == old_rtx)
     {
       *px = new_rtx;
-      return can_appear;
+      return true;
     }
 
   /* If this is an expression, try recursive substitution.  */
@@ -322,7 +310,7 @@ propagate_rtx_1 (rtx *px, rtx old_rtx, r
     case RTX_UNARY:
       op0 = XEXP (x, 0);
       op_mode = GET_MODE (op0);
-      valid_ops &= propagate_rtx_1 (&op0, old_rtx, new_rtx, flags);
+      valid_ops &= propagate_rtx_1 (&op0, old_rtx, new_rtx, speed);
       if (op0 == XEXP (x, 0))
 	return true;
       tem = simplify_gen_unary (code, mode, op0, op_mode);
@@ -332,8 +320,8 @@ propagate_rtx_1 (rtx *px, rtx old_rtx, r
     case RTX_COMM_ARITH:
       op0 = XEXP (x, 0);
       op1 = XEXP (x, 1);
-      valid_ops &= propagate_rtx_1 (&op0, old_rtx, new_rtx, flags);
-      valid_ops &= propagate_rtx_1 (&op1, old_rtx, new_rtx, flags);
+      valid_ops &= propagate_rtx_1 (&op0, old_rtx, new_rtx, speed);
+      valid_ops &= propagate_rtx_1 (&op1, old_rtx, new_rtx, speed);
       if (op0 == XEXP (x, 0) && op1 == XEXP (x, 1))
 	return true;
       tem = simplify_gen_binary (code, mode, op0, op1);
@@ -344,8 +332,8 @@ propagate_rtx_1 (rtx *px, rtx old_rtx, r
       op0 = XEXP (x, 0);
       op1 = XEXP (x, 1);
       op_mode = GET_MODE (op0) != VOIDmode ? GET_MODE (op0) : GET_MODE (op1);
-      valid_ops &= propagate_rtx_1 (&op0, old_rtx, new_rtx, flags);
-      valid_ops &= propagate_rtx_1 (&op1, old_rtx, new_rtx, flags);
+      valid_ops &= propagate_rtx_1 (&op0, old_rtx, new_rtx, speed);
+      valid_ops &= propagate_rtx_1 (&op1, old_rtx, new_rtx, speed);
       if (op0 == XEXP (x, 0) && op1 == XEXP (x, 1))
 	return true;
       tem = simplify_gen_relational (code, mode, op_mode, op0, op1);
@@ -357,9 +345,9 @@ propagate_rtx_1 (rtx *px, rtx old_rtx, r
       op1 = XEXP (x, 1);
       op2 = XEXP (x, 2);
       op_mode = GET_MODE (op0);
-      valid_ops &= propagate_rtx_1 (&op0, old_rtx, new_rtx, flags);
-      valid_ops &= propagate_rtx_1 (&op1, old_rtx, new_rtx, flags);
-      valid_ops &= propagate_rtx_1 (&op2, old_rtx, new_rtx, flags);
+      valid_ops &= propagate_rtx_1 (&op0, old_rtx, new_rtx, speed);
+      valid_ops &= propagate_rtx_1 (&op1, old_rtx, new_rtx, speed);
+      valid_ops &= propagate_rtx_1 (&op2, old_rtx, new_rtx, speed);
       if (op0 == XEXP (x, 0) && op1 == XEXP (x, 1) && op2 == XEXP (x, 2))
 	return true;
       if (op_mode == VOIDmode)
@@ -372,7 +360,7 @@ propagate_rtx_1 (rtx *px, rtx old_rtx, r
       if (code == SUBREG)
 	{
           op0 = XEXP (x, 0);
-	  valid_ops &= propagate_rtx_1 (&op0, old_rtx, new_rtx, flags);
+	  valid_ops &= propagate_rtx_1 (&op0, old_rtx, new_rtx, speed);
           if (op0 == XEXP (x, 0))
 	    return true;
 	  tem = simplify_gen_subreg (mode, op0, GET_MODE (SUBREG_REG (x)),
@@ -392,7 +380,7 @@ propagate_rtx_1 (rtx *px, rtx old_rtx, r
 
 	  op0 = new_op0 = targetm.delegitimize_address (op0);
 	  valid_ops &= propagate_rtx_1 (&new_op0, old_rtx, new_rtx,
-					flags | PR_CAN_APPEAR);
+					speed);
 
 	  /* Dismiss transformation that we do not want to carry on.  */
 	  if (!valid_ops
@@ -407,7 +395,7 @@ propagate_rtx_1 (rtx *px, rtx old_rtx, r
 	  if (!(REG_P (old_rtx) && REG_P (new_rtx))
 	      && !should_replace_address (op0, new_op0, GET_MODE (x),
 					  MEM_ADDR_SPACE (x),
-	      			 	  flags & PR_OPTIMIZE_FOR_SPEED))
+	      			 	  speed))
 	    return true;
 
 	  tem = replace_equiv_address_nv (x, new_op0);
@@ -421,8 +409,8 @@ propagate_rtx_1 (rtx *px, rtx old_rtx, r
 	  /* The only simplification we do attempts to remove references to op0
 	     or make it constant -- in both cases, op0's invalidity will not
 	     make the result invalid.  */
-	  propagate_rtx_1 (&op0, old_rtx, new_rtx, flags | PR_CAN_APPEAR);
-	  valid_ops &= propagate_rtx_1 (&op1, old_rtx, new_rtx, flags);
+	  propagate_rtx_1 (&op0, old_rtx, new_rtx, speed);
+	  valid_ops &= propagate_rtx_1 (&op1, old_rtx, new_rtx, speed);
           if (op0 == XEXP (x, 0) && op1 == XEXP (x, 1))
 	    return true;
 
@@ -443,7 +431,7 @@ propagate_rtx_1 (rtx *px, rtx old_rtx, r
 	  if (rtx_equal_p (x, old_rtx))
 	    {
               *px = new_rtx;
-              return can_appear;
+              return true;
 	    }
 	}
       break;
@@ -458,10 +446,7 @@ propagate_rtx_1 (rtx *px, rtx old_rtx, r
 
   *px = tem;
 
-  /* The replacement we made so far is valid, if all of the recursive
-     replacements were valid, or we could simplify everything to
-     a constant.  */
-  return valid_ops || can_appear || CONSTANT_P (tem);
+  return valid_ops;
 }
 
 
@@ -472,7 +457,7 @@ static int
 varying_mem_p (rtx *body, void *data ATTRIBUTE_UNUSED)
 {
   rtx x = *body;
-  return MEM_P (x) && !MEM_READONLY_P (x);
+  return (MEM_P (x) && !MEM_READONLY_P (x)) || CALL_P (x);
 }
 
 
@@ -490,27 +475,12 @@ propagate_rtx (rtx x, enum machine_mode
 {
   rtx tem;
   bool collapsed;
-  int flags;
 
   if (REG_P (new_rtx) && REGNO (new_rtx) < FIRST_PSEUDO_REGISTER)
     return NULL_RTX;
 
-  flags = 0;
-  if (REG_P (new_rtx)
-      || CONSTANT_P (new_rtx)
-      || (GET_CODE (new_rtx) == SUBREG
-	  && REG_P (SUBREG_REG (new_rtx))
-	  && (GET_MODE_SIZE (mode)
-	      <= GET_MODE_SIZE (GET_MODE (SUBREG_REG (new_rtx))))))
-    flags |= PR_CAN_APPEAR;
-  if (!for_each_rtx (&new_rtx, varying_mem_p, NULL))
-    flags |= PR_HANDLE_MEM;
-
-  if (speed)
-    flags |= PR_OPTIMIZE_FOR_SPEED;
-
   tem = x;
-  collapsed = propagate_rtx_1 (&tem, old_rtx, copy_rtx (new_rtx), flags);
+  collapsed = propagate_rtx_1 (&tem, old_rtx, copy_rtx (new_rtx), speed);
   if (tem == x || !collapsed)
     return NULL_RTX;
 
@@ -689,90 +659,75 @@ all_uses_available_at (rtx def_insn, rtx
   return true;
 }
 
-\f
-
 /* Try substituting NEW into LOC, which originated from forward propagation
    of USE's value from DEF_INSN.  SET_REG_EQUAL says whether we are
    substituting the whole SET_SRC, so we can set a REG_EQUAL note if the
-   new insn is not recognized.  Return whether the substitution was
-   performed.  */
+   new insn is not recognized. We record possible change in changes array,
+   and record their verifying result and calculated benefit.  */
 
 static bool
-try_fwprop_subst (df_ref use, rtx *loc, rtx new_rtx, rtx def_insn, bool set_reg_equal)
+try_fwprop_subst (df_ref use, rtx *loc, rtx new_rtx, bool set_reg_equal)
 {
   rtx insn = DF_REF_INSN (use);
   rtx set = single_set (insn);
-  rtx note = NULL_RTX;
   bool speed = optimize_bb_for_speed_p (BLOCK_FOR_INSN (insn));
-  int old_cost = 0;
-  bool ok;
-
-  update_df_init (def_insn, insn);
+  int old_cost = 0, benefit = 0;
+  int old_changes_num, new_changes_num;
 
   /* forward_propagate_subreg may be operating on an instruction with
-     multiple sets.  If so, assume the cost of the new instruction is
-     not greater than the old one.  */
+     multiple sets.  Assume the old cost is 1 and the new cost is 0,
+     if only verify_changes passes, subreg propagation will always be
+     confirmed.  */
   if (set)
-    old_cost = set_src_cost (SET_SRC (set), speed);
-  if (dump_file)
-    {
-      fprintf (dump_file, "\nIn insn %d, replacing\n ", INSN_UID (insn));
-      print_inline_rtx (dump_file, *loc, 2);
-      fprintf (dump_file, "\n with ");
-      print_inline_rtx (dump_file, new_rtx, 2);
-      fprintf (dump_file, "\n");
-    }
-
-  validate_unshare_change (insn, loc, new_rtx, true);
-  if (!verify_changes (0))
-    {
-      if (dump_file)
-	fprintf (dump_file, "Changes to insn %d not recognized\n",
-		 INSN_UID (insn));
-      ok = false;
-    }
-
-  else if (DF_REF_TYPE (use) == DF_REF_REG_USE
-	   && set
-	   && set_src_cost (SET_SRC (set), speed) > old_cost)
-    {
-      if (dump_file)
-	fprintf (dump_file, "Changes to insn %d not profitable\n",
-		 INSN_UID (insn));
-      ok = false;
-    }
-
+    old_cost = (set_src_cost (SET_SRC (set), speed)
+		+ set_src_cost (SET_DEST (set), speed) + 1);
   else
-    {
-      if (dump_file)
-	fprintf (dump_file, "Changed insn %d\n", INSN_UID (insn));
-      ok = true;
-    }
+    old_cost = 1;
 
-  if (ok)
-    {
-      confirm_change_group ();
-      num_changes++;
-    }
-  else
-    {
-      cancel_changes (0);
-
-      /* Can also record a simplified value in a REG_EQUAL note,
-	 making a new one if one does not already exist.  */
-      if (set_reg_equal)
-	{
-	  if (dump_file)
-	    fprintf (dump_file, " Setting REG_EQUAL note\n");
-
-	  note = set_unique_reg_note (insn, REG_EQUAL, copy_rtx (new_rtx));
-	}
-    }
+  old_changes_num = num_changes_pending ();
+  validate_unshare_change (insn, loc, new_rtx, true);
 
-  if ((ok || note) && !CONSTANT_P (new_rtx))
-    update_df (insn, note);
+  /* verify_changes may calls validate_change and add new changes.
+     The new changes are either adding or removing CLOBBER to match
+     insn pattern. These changes should either be committed or canceled
+     in a group, so we use set_change_associated_with_last to indicate
+     whether current change is committed depends on the last change.  */
+  if (verify_changes (old_changes_num))
+  {
+    int i;
+    int new_cost;
+
+    if (set)
+      new_cost = set_src_cost (SET_SRC (set), speed)
+		 + set_src_cost (SET_DEST (set), speed) + 1;
+    else
+      new_cost = 0;
+    /* validate_unshare_change will tentatively change *loc to new_rtx.
+       We compare the cost before and after validate_unshare_change
+       and get the potential benefit of the change.  */
+    benefit = old_cost - new_cost;
+
+    /* For the change group with adding or removing CLOBBER, we attach
+       the real change benefit to the last change, that is because
+       in func confirm_change_group_by_cost, we need to iterate change
+       in reverse order to make sure cancelling change works correctly.
+       We set other change's benefit to 0, so the overall benefit for
+       the change group is the same. Meanwhile set all the changes in
+       the group to being verified successfully.  */
+    new_changes_num = num_changes_pending ();
+    set_change_verified (new_changes_num - 1, true);
+    set_change_benefit (new_changes_num - 1, benefit);
+    for (i = new_changes_num - 2; i >= old_changes_num; i--)
+      {
+	set_change_verified (i, true);
+	set_change_benefit (i, 0);
+	set_change_associated_with_last (i, true);
+      }
+    set_change_equal_note (old_changes_num, set_reg_equal);
+    return true;
+  }
 
-  return ok;
+  return false;
 }
 
 /* For the given single_set INSN, containing SRC known to be a
@@ -855,8 +810,7 @@ forward_propagate_subreg (df_ref use, rt
 	  && GET_MODE (SUBREG_REG (src)) == use_mode
 	  && subreg_lowpart_p (src)
 	  && all_uses_available_at (def_insn, use_insn))
-	return try_fwprop_subst (use, DF_REF_LOC (use), SUBREG_REG (src),
-				 def_insn, false);
+	return try_fwprop_subst (use, DF_REF_LOC (use), SUBREG_REG (src), false);
     }
 
   /* If this is a SUBREG of a ZERO_EXTEND or SIGN_EXTEND, and the SUBREG
@@ -887,20 +841,136 @@ forward_propagate_subreg (df_ref use, rt
 	  && (targetm.mode_rep_extended (use_mode, GET_MODE (src))
 	      != (int) GET_CODE (src))
 	  && all_uses_available_at (def_insn, use_insn))
-	return try_fwprop_subst (use, DF_REF_LOC (use), XEXP (src, 0),
-				 def_insn, false);
+	return try_fwprop_subst (use, DF_REF_LOC (use), XEXP (src, 0), false);
     }
 
   return false;
 }
 
-/* Try to replace USE with SRC (defined in DEF_INSN) in __asm.  */
+static void
+mems_modified_p (rtx dest, const_rtx setter ATTRIBUTE_UNUSED, void *data)
+{
+  bool *modified = (bool *)data;
+
+  /* If DEST is not a MEM, then it will not conflict with the load.  Note
+     that function calls are assumed to clobber memory, but are handled
+     elsewhere.  */
+  if (MEM_P (dest))
+    {
+      *modified = true;
+      return;
+    }
+}
+
+/* Check whether any memory modification insn from from insn
+   to to insn.  */
+
+static bool
+mem_may_be_modified (rtx from, rtx to)
+{
+  bool modified = false;
+  rtx insn;
+
+  /* For now, we only check the simple case where from and to
+     are in the same bb.  */
+  basic_block bb = BLOCK_FOR_INSN (from);
+  if (bb != BLOCK_FOR_INSN (to))
+    return true;
+
+  for (insn = from; insn != to; insn = NEXT_INSN (insn))
+    {
+      if (!NONDEBUG_INSN_P (insn))
+	continue;
+
+      note_stores (PATTERN (insn), mems_modified_p, &modified);
+      if (modified)
+	break;
+
+      modified = CALL_P (insn);
+      if (modified)
+	break;
+
+      modified = volatile_insn_p (PATTERN (insn));
+      if (modified)
+	break;
+    }
+  gcc_assert (insn);
+  return modified;
+}
+
+struct rtx_search_arg
+{
+  /* What we are searching for.  */
+  rtx x;
+  /* The occurrence counter.  */
+  int n;
+};
+
+typedef struct rtx_search_arg *rtx_search_arg_p;
+
+int
+reg_occur_p (rtx *in, void *arg)
+{
+  enum rtx_code code;
+  rtx_search_arg_p p = (rtx_search_arg_p) arg;
+  rtx reg = p->x;
+
+  if (in == 0 || *in == 0)
+    return -1;
+
+  code = GET_CODE (*in);
+
+  switch (code)
+    {
+      /* Compare registers by number.  */
+    case REG:
+      if (REG_P (reg) && REGNO (*in) == REGNO (reg))
+	p->n++;
+      /* These codes have no constituent expressions
+	 and are unique.  */
+    case SCRATCH:
+    case CC0:
+    case PC:
+      /* Skip expr list.  */
+    case EXPR_LIST:
+    CASE_CONST_ANY:
+      /* These are kept unique for a given value.  */
+      return -1;
+
+    default:
+      break;
+    }
+
+  return 0;
+}
+
+/* Calculate how many times reg appears in rtx "in".  */
+
+static int
+reg_mentioned_num (rtx reg, rtx in)
+{
+  struct rtx_search_arg data;
+  enum rtx_code code = GET_CODE (reg);
+  gcc_assert (code == REG || code == SUBREG);
+  if (code == SUBREG)
+    reg = SUBREG_REG (reg);
+
+  data.x = reg;
+  data.n = 0;
+  for_each_rtx (&in, &reg_occur_p, (void *)&data);
+  return data.n;
+}
+
+/* Try to replace USE with SRC (defined in DEF_INSN) in __asm.
+   All the changes added here will be applied immediately without
+   affecting any existing changes. After this func, the changes
+   num is the same as before the func.  */
 
 static bool
 forward_propagate_asm (df_ref use, rtx def_insn, rtx def_set, rtx reg)
 {
   rtx use_insn = DF_REF_INSN (use), src, use_pat, asm_operands, new_rtx, *loc;
-  int speed_p, i;
+  int speed_p, i, old_change_num, new_change_num;
   df_ref *use_vec;
 
   gcc_assert ((DF_REF_FLAGS (use) & DF_REF_IN_NOTE) == 0);
@@ -914,7 +984,7 @@ forward_propagate_asm (df_ref use, rtx d
   if (use_vec[0] && use_vec[1])
     return false;
 
-  update_df_init (def_insn, use_insn);
+  old_change_num = num_changes_pending ();
   speed_p = optimize_bb_for_speed_p (BLOCK_FOR_INSN (use_insn));
   asm_operands = NULL_RTX;
   switch (GET_CODE (use_pat))
@@ -962,14 +1032,43 @@ forward_propagate_asm (df_ref use, rtx d
 	validate_unshare_change (use_insn, loc, new_rtx, true);
     }
 
-  if (num_changes_pending () == 0 || !apply_change_group ())
+  new_change_num = num_changes_pending ();
+  if ((new_change_num - old_change_num) == 0
+      || !apply_change_group (old_change_num))
     return false;
 
-  update_df (use_insn, NULL);
-  num_changes++;
+  df_uses_create (&PATTERN (use_insn), use_insn, 0);
+  df_insn_rescan (use_insn);
+
   return true;
 }
 
+/* Find whether the set define a return reg.  */
+
+static bool
+def_return_reg (rtx set)
+{
+  edge eg;
+  edge_iterator ei;
+  rtx dest = SET_DEST (set);
+
+  if (!REG_P (dest))
+    return false;
+
+  FOR_EACH_EDGE (eg, ei, EXIT_BLOCK_PTR->preds)
+    if (eg->flags & EDGE_FALLTHRU)
+      {
+	basic_block src_bb = eg->src;
+	rtx last_insn, ret_reg;
+	if (NONJUMP_INSN_P ((last_insn = BB_END (src_bb)))
+	    && GET_CODE (PATTERN (last_insn)) == USE
+	    && GET_CODE ((ret_reg = XEXP (PATTERN (last_insn), 0))) == REG
+	    && REGNO (ret_reg) == REGNO (dest))
+	  return true;
+      }
+  return false;
+}
+
 /* Try to replace USE with SRC (defined in DEF_INSN) and simplify the
    result.  */
 
@@ -978,7 +1077,7 @@ forward_propagate_and_simplify (df_ref u
 {
   rtx use_insn = DF_REF_INSN (use);
   rtx use_set = single_set (use_insn);
-  rtx src, reg, new_rtx, *loc;
+  rtx src, reg, new_rtx, *loc, use_set_dest, use_set_src;
   bool set_reg_equal;
   enum machine_mode mode;
   int asm_use = -1;
@@ -1036,9 +1135,37 @@ forward_propagate_and_simplify (df_ref u
       return false;
     }
 
+  /* If only new_rtx contains varying mem or has other side effect, and
+     mem maybe modified between def and use, we cannot do propagation
+     safely. mem_may_be_modified is a simple check without inquiring
+     cfg and alias result.  */
+  if (for_each_rtx (&src, varying_mem_p, NULL)
+      && mem_may_be_modified (def_insn, use_insn))
+    return false;
+
+  if (volatile_refs_p (src))
+    return false;
+
   if (asm_use >= 0)
     return forward_propagate_asm (use, def_insn, def_set, reg);
 
+  /* If the dest of the use insn is a return reg, we don't try fwprop,
+     because mode-switching tries to find return reg copy insn and create
+     pre exit basicblock, and fwprop for return copy insn may make it
+     confused.  */
+  if (def_return_reg (use_set))
+    return false;
+
+  /* We have (hard reg = reg) type insns for func param passing or
+     return value setting. We don't want to propagate in such case
+     because it may restrict cse/gcse. Check hash_rtx and
+     hash_scan_set.  */
+  use_set_dest = SET_DEST (use_set);
+  use_set_src = SET_SRC (use_set);
+  if (REG_P (use_set_dest) && REG_P (use_set_src)
+      && (REGNO (use_set_dest) < FIRST_PSEUDO_REGISTER))
+    return false;
+
   /* Else try simplifying.  */
 
   if (DF_REF_TYPE (use) == DF_REF_REG_MEM_STORE)
@@ -1087,7 +1214,7 @@ forward_propagate_and_simplify (df_ref u
   if (!new_rtx)
     return false;
 
-  return try_fwprop_subst (use, loc, new_rtx, def_insn, set_reg_equal);
+  return try_fwprop_subst (use, loc, new_rtx, set_reg_equal);
 }
 
 
@@ -1150,7 +1277,6 @@ forward_propagate_into (df_ref use)
   return false;
 }
 
-\f
 static void
 fwprop_init (void)
 {
@@ -1175,47 +1301,142 @@ fwprop_done (void)
   free_dominance_info (CDI_DOMINATORS);
   cleanup_cfg (0);
   delete_trivially_dead_insns (get_insns (), max_reg_num ());
-
-  if (dump_file)
-    fprintf (dump_file,
-	     "\nNumber of successful forward propagations: %d\n\n",
-	     num_changes);
 }
 
-
-/* Main entry point.  */
-
 static bool
 gate_fwprop (void)
 {
   return optimize > 0 && flag_forward_propagate;
 }
 
+/* Main func for forward propagation. Iterate all the uses connecting to
+   the same def. For each def-use pair, try forward propagate the src of
+   the def into the use. After all the def-use pairs are iterated, confirm
+   the changes based on the cost of the whole group.  */
+
+static bool
+iterate_def_uses (df_ref def, bool fwprop_addr)
+{
+  int use_num = 0;
+  int def_insn_cost = 0;
+  rtx def_insn, use_insn;
+  struct df_link *uses;
+  int reg_replaced_num = 0;
+  bool all_uses_replaced;
+  bool speed;
+
+  def_insn = DF_REF_INSN (def);
+  speed = optimize_bb_for_speed_p (BLOCK_FOR_INSN (def_insn));
+
+  if (def_insn)
+  {
+    rtx set = single_set (def_insn);
+    if (set)
+      def_insn_cost = set_src_cost (SET_SRC (set), speed)
+		      + set_src_cost (SET_DEST (set), speed) + 1;
+    else
+      return false;
+  }
+
+  if (dump_file)
+    {
+      fprintf (dump_file, "\n------------------------\n");
+      fprintf (dump_file, "Def %d:\n", INSN_UID (def_insn));
+    }
+
+  for (uses = DF_REF_CHAIN (def), use_num = 0;
+       uses; uses = uses->next)
+  {
+    int old_reg_num, new_reg_num;
+
+    df_ref use = uses->ref;
+    if (DF_REF_IS_ARTIFICIAL (use))
+	continue;
+
+    use_insn = DF_REF_INSN (use);
+    if (!NONDEBUG_INSN_P (use_insn))
+	continue;
+
+    if (dump_file)
+      fprintf (dump_file, "\tUse %d\n", INSN_UID (use_insn));
+
+    if (fwprop_addr)
+      {
+	if (DF_REF_TYPE (use) != DF_REF_REG_USE
+	    && DF_REF_BB (use)->loop_father != NULL
+	    /* The outer most loop is not really a loop.  */
+	    && loop_outer (DF_REF_BB (use)->loop_father) != NULL)
+	  forward_propagate_into (use);
+      }
+    else
+      {
+	if (DF_REF_TYPE (use) == DF_REF_REG_USE
+	    || DF_REF_BB (use)->loop_father == NULL
+	    || loop_outer (DF_REF_BB (use)->loop_father) == NULL)
+	  {
+	    old_reg_num = reg_mentioned_num (DF_REF_REG (use), use_insn);
+
+	    forward_propagate_into (use);
+
+	    new_reg_num = reg_mentioned_num (DF_REF_REG (use), use_insn);
+	    reg_replaced_num += old_reg_num - new_reg_num;
+	  }
+      }
+    use_num++;
+  }
+
+  if (!use_num)
+    return false;
+
+  if (fwprop_addr)
+     return confirm_change_group_by_cost (false,
+					  0,
+					  false);
+  else
+    {
+      all_uses_replaced = (use_num == reg_replaced_num);
+      return confirm_change_group_by_cost (all_uses_replaced,
+					   def_insn_cost,
+					   true);
+    }
+}
+
+/* Try forward propagate src of the def to the normal uses.  */
+
 static unsigned int
 fwprop (void)
 {
-  unsigned i;
+  basic_block bb;
+  rtx insn;
+  df_ref *def_vec;
   bool need_cleanup = false;
 
+  if (dump_file)
+    fprintf (dump_file, "\n============== fwprop ==============\n");
+
   fwprop_init ();
 
-  /* Go through all the uses.  df_uses_create will create new ones at the
-     end, and we'll go through them as well.
+  FOR_EACH_BB (bb)
+    {
+      FOR_BB_INSNS (bb, insn)
+	{
+	  if (!NONDEBUG_INSN_P (insn)
+	      || CALL_P (insn))
+	    continue;
 
-     Do not forward propagate addresses into loops until after unrolling.
-     CSE did so because it was able to fix its own mess, but we are not.  */
+	  for (def_vec = DF_INSN_DEFS (insn); *def_vec; def_vec++)
+	    {
+	      bool result;
+	      result = iterate_def_uses (*def_vec, false);
+	      need_cleanup |= result;
 
-  for (i = 0; i < DF_USES_TABLE_SIZE (); i++)
-    {
-      df_ref use = DF_USES_GET (i);
-      if (use)
-	if (DF_REF_TYPE (use) == DF_REF_REG_USE
-	    || DF_REF_BB (use)->loop_father == NULL
-	    /* The outer most loop is not really a loop.  */
-	    || loop_outer (DF_REF_BB (use)->loop_father) == NULL)
-	  need_cleanup |= forward_propagate_into (use);
+	      if (result)
+		num_changes += 1;
+	    }
+	}
     }
 
+
   fwprop_done ();
   if (need_cleanup)
     cleanup_cfg (0);
@@ -1244,25 +1465,39 @@ struct rtl_opt_pass pass_rtl_fwprop =
  }
 };
 
+/* Try forward propagate src of the def to the uses in memory addresses.  */
+
 static unsigned int
 fwprop_addr (void)
 {
-  unsigned i;
+  basic_block bb;
+  rtx insn;
+  df_ref *def_vec;
   bool need_cleanup = false;
 
+  if (dump_file)
+    fprintf (dump_file, "\n============== fwprop_addr ==============\n");
+
   fwprop_init ();
 
-  /* Go through all the uses.  df_uses_create will create new ones at the
-     end, and we'll go through them as well.  */
-  for (i = 0; i < DF_USES_TABLE_SIZE (); i++)
+  FOR_EACH_BB (bb)
     {
-      df_ref use = DF_USES_GET (i);
-      if (use)
-	if (DF_REF_TYPE (use) != DF_REF_REG_USE
-	    && DF_REF_BB (use)->loop_father != NULL
-	    /* The outer most loop is not really a loop.  */
-	    && loop_outer (DF_REF_BB (use)->loop_father) != NULL)
-	  need_cleanup |= forward_propagate_into (use);
+      FOR_BB_INSNS (bb, insn)
+	{
+	  if (!NONDEBUG_INSN_P (insn)
+	      || CALL_P (insn))
+	    continue;
+
+	  for (def_vec = DF_INSN_DEFS (insn); *def_vec; def_vec++)
+	    {
+	      bool result;
+	      result = iterate_def_uses (*def_vec, true);
+	      need_cleanup |= result;
+
+	      if (result)
+		num_changes += 1;
+	    }
+	}
     }
 
   fwprop_done ();

[-- Attachment #9: patch.4 --]
[-- Type: application/octet-stream, Size: 524 bytes --]

Index: config/i386/i386.c
===================================================================
--- config/i386/i386.c	(revision 196270)
+++ config/i386/i386.c	(working copy)
@@ -34170,6 +34181,12 @@ ix86_rtx_costs (rtx x, int code_i, int o
 	{
 	  if (CONST_INT_P (XEXP (x, 1)))
 	    *total = cost->shift_const;
+	  else if (GET_CODE (XEXP (x, 1)) == SUBREG
+		   && GET_CODE (XEXP (XEXP (x, 1), 0)) == AND)
+	    {
+	      *total = cost->shift_var;
+	      return true;
+	    }
 	  else
 	    *total = cost->shift_var;
 	}

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: extend fwprop optimization
  2013-04-02  7:11                       ` Wei Mi
@ 2013-04-02  7:37                         ` Wei Mi
  2013-04-02  7:53                         ` Uros Bizjak
  1 sibling, 0 replies; 29+ messages in thread
From: Wei Mi @ 2013-04-02  7:37 UTC (permalink / raw)
  To: Steven Bosscher; +Cc: GCC Patches, David Li, Uros Bizjak

[-- Attachment #1: Type: text/plain, Size: 6824 bytes --]

1.c attached.

On Mon, Apr 1, 2013 at 10:43 PM, Wei Mi <wmi@google.com> wrote:
> I attached the patch.4 based on r197308. r197308 changes shift-and
> type truncation from define_insn_and_split to define_insn.  patch.4
> changes ix86_rtx_costs for shift-and type rtx to get the correct cost
> for the result after the shift-and truncation.
>
> With the patch.1 ~ patch.4, fwprop extension could handle the
> motivational case 1.c attached by removing all the redundent "x & 63"
> operations.
>
> patch.1~patch.4 regression and bootstrap ok on
> x86_64-unknown-linux-gnu. Is it ok for trunk?
>
> Thanks,
> Wei.
>
> On Sun, Mar 17, 2013 at 12:15 AM, Wei Mi <wmi@google.com> wrote:
>> Hi,
>>
>> On Sat, Mar 16, 2013 at 3:48 PM, Steven Bosscher <stevenb.gcc@gmail.com> wrote:
>>> On Tue, Mar 12, 2013 at 8:18 AM, Wei Mi wrote:
>>>> For the motivational case, I need insn splitting to get the cost
>>>> right. insn splitting is not very intrusive. All I need is to call
>>>> split_insns func.
>>>
>>> It may not look very intrusive, but there's a lot happening in the
>>> back ground. You're creating a lot of new RTL, and then just throw it
>>> away again. You fake the compiler into thinking you're much deeper in
>>> the pipeline than you really are. You're assuming there are no
>>> side-effects other than that some insn gets split, but there are back
>>> ends where splitters may have side-effects.
>>
>> Ok, then I will remove the split_insns call.
>>
>>>
>>> Even though I've asked twice now, you still have not explained this
>>> motivational case, except to say that there is one. *What* are you
>>> trying to do, *what* is not happening without the splits, and *what*
>>> happens if you split. Only if you explain that in a lot more detail
>>> than "I have a motivational case" then we can look into what is a
>>> proper solution.
>>
>> :-). Sorry, I didn't say it clearly. The motivational case is the one
>> mentioned in the following posts (split_insns changes a << (b & 63) to
>> a << b).
>> http://gcc.gnu.org/ml/gcc/2013-01/msg00181.html
>> http://gcc.gnu.org/ml/gcc-patches/2013-02/msg01144.html
>>
>> If I remove the split_insns call and related cost estimation
>> adjustment, the fwprop 18-->22 and 18-->23 will punt, because fwprop
>> here looks like a reverse process of cse, the total cost after fwprop
>> change is increased.
>>
>> Def insn 18:
>>         Use insn 23
>>         Use insn 22
>>
>> If we include the split_insns cost estimation adjustment.
>>   extra benefit by removing def insn 18 = 5
>>   change[0]: benefit = 0, verified - ok  // The cost of insn 22 will
>> not change after fwprop + insn splitting.
>>   change[1]: benefit = 0, verified - ok  // The insn 23 is the same with insn 22
>> Total benefit is 5, fwprop will go on.
>>
>> If we remove the split_insns cost estimation adjustment.
>>   extra benefit by removing def insn 18 = 5
>>   change[0]: benefit = -4, verified - ok   // The costs of insn 22 and
>> insn 23 will increase after fwprop.
>>   change[1]: benefit = -4, verified - ok   // The insn 23 is the same
>> with insn 22
>> Total benefit is -3, fwprop will punt.
>>
>> How about adding the (a << (b&63) ==> a << b) transformation in
>> simplify_binary_operation_1, becuase (a << (b&63) ==> a << b) is a
>> kind of architecture specific expr simplification? Then fwprop could
>> do the propagation as I expect.
>>
>>>
>>> The problem with some of the splitters is that they exist to break up
>>> RTL from 'expand' to initially keep some pattern together to allow the
>>> code transformation passes to handle the pattern as one instruction.
>>> This made sense when RTL was the only intermediate representation and
>>> splitting too early would inhibit some optimizations. But I would
>>> expect most (if not all) such cases to be less relevant because of the
>>> GIMPLE middle-end work. The only splitters you can trigger are the
>>> pre-reload splitters (all the reload_completed conditions obviously
>>> can't trigger if you're splitting from fwprop). Perhaps those
>>> splitters can/should run earlier, or be made obsolete by expanding
>>> directly to the post-splitting insns.
>>>
>>> Unfortunately, it's not possible to tell for your case, because you
>>> haven't explained it yet...
>>>
>>>
>>>> So how about keep split_insns and remove peephole in the cost estimation func?
>>>
>>> I'd strongly oppose this. I do not believe this is necessary, and I
>>> think it's conceptually wrong.
>>>
>>>
>>>>> What happens if you propagate into an insn that uses the same register
>>>>> twice? Will the DU chains still be valid (I don't think that's
>>>>> guaranteed)?
>>>>
>>>> I think the DU chains still be valid. If propagate into the insn that
>>>> uses the same register twice, the two uses will be replaced when the
>>>> first use is seen (propagate_rtx_1 will propagate all the occurrances
>>>> of the same reg in the use insn).  When the second use is seen, the
>>>> df_use and use insn in its insn_info are still available.
>>>> forward_propagate_into will early return after check reg_mentioned_p
>>>> (DF_REF_REG (use), parent) and find out no reg is used  any more.
>>>
>>> With reg_mentioned_p you cannot verify that the DF_REF_LOC of USE is
>>> still valid.
>>
>> I think DF_REF_LOC of USE may be invalid if dangling rtx will be
>> recycled by garbage collection very soon (I don't know when GC will
>> happen). Although DF_REF_LOC of USE maybe invalid, the early return in
>> forward_propagate_into ensure it will not cause any correctness
>> problem.
>>
>>>
>>> In any case, returning to the RD problem for DU/UD chains is probably
>>> a good idea, now that RD is not such a hog anymore. In effect fwprop.c
>>> would return to what it looked like before the patch of r149010.
>>
>> I remove MD problem and use DU/UD instead.
>>
>>>
>>> As a way forward on all of this, I'd suggest the following steps, each
>>> with a separate patch:
>>
>> Thanks for the suggestion!
>>
>>> 1. replace the MD problem with RD again, and build full DU/UD chains.
>>
>> I include patch.1 attached.
>>
>>> 2. post all the recog changes separately, with minimum impact on the
>>> parts of the compiler you don't really change. (For apply_change_group
>>> you could even choose to overload it, or use a NUM argument with a
>>> default value -- not sure if default argument values are OK for GCC
>>> tho'.)
>>
>> patch.2 attached.
>>
>>> 3. implement propagation into multiple USEs, but without the splitting
>>> and peepholing.
>>
>> patch.3 attached.
>>
>>> 4. see about fixing the back end to either split earlier or expand to
>>> the desired patterns directly.
>>
>> I havn't included this part. If you agree with the proposal to add the
>> transformation (a << (b&63) ==> a << b) in
>> simplify_binary_operation_1, I will send out another patch about it.
>>
>> Thanks,
>> Wei.

[-- Attachment #2: 1.c --]
[-- Type: text/x-csrc, Size: 702 bytes --]

typedef unsigned long long uint64;
typedef unsigned int uint32;

class Decoder {
 public:
 Decoder() : k_minus_1_(0), buf_(0), bits_left_(0) {}
 ~Decoder() {}

 uint32 ExtractBits(uint64 end, uint64 start);
 inline uint32 GetBits(int bits) {
   uint32 val = ExtractBits(bits, 0);
   buf_ >>= bits;
   bits_left_ -= bits;
   return val;
 }

 uint32 Get(uint32 bits);

 uint32 k_minus_1_;
 uint64 buf_;
 unsigned long bits_left_;
};

uint32 Decoder::ExtractBits(uint64 end, uint64 start) {
 return (buf_ << (-end & 63)) >> ((start - end) & 63);
}

uint32 Decoder::Get(uint32 bits) {
 bits += k_minus_1_;
 uint32 msbit = (bits > (k_minus_1_ + 1));
 return GetBits(bits - msbit) | (msbit << (bits - 1));
}

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: extend fwprop optimization
  2013-04-02  7:11                       ` Wei Mi
  2013-04-02  7:37                         ` Wei Mi
@ 2013-04-02  7:53                         ` Uros Bizjak
  1 sibling, 0 replies; 29+ messages in thread
From: Uros Bizjak @ 2013-04-02  7:53 UTC (permalink / raw)
  To: Wei Mi; +Cc: Steven Bosscher, GCC Patches, David Li

On Tue, Apr 2, 2013 at 7:43 AM, Wei Mi <wmi@google.com> wrote:
> I attached the patch.4 based on r197308. r197308 changes shift-and
> type truncation from define_insn_and_split to define_insn.  patch.4
> changes ix86_rtx_costs for shift-and type rtx to get the correct cost
> for the result after the shift-and truncation.
>
> With the patch.1 ~ patch.4, fwprop extension could handle the
> motivational case 1.c attached by removing all the redundent "x & 63"
> operations.
>
> patch.1~patch.4 regression and bootstrap ok on
> x86_64-unknown-linux-gnu. Is it ok for trunk?

> 2013-04-01  Wei Mi  <wmi@google.com>
>
>     * config/i386/i386.c (ix86_rtx_costs): Set proper rtx cost for
>     ashl<mode>3_mask, *<shift_insn><mode>3_mask and
    *<rotate_insn><mode>3_mask in i386.md.

Patch 4 is OK for mainline and also for release branches that were
changed by your previous i386.md patch.

Thanks,
Uros.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: extend fwprop optimization
  2013-03-28 15:49                                     ` Uros Bizjak
@ 2013-04-03 20:54                                       ` Jakub Jelinek
  2013-04-04  5:13                                         ` Wei Mi
  0 siblings, 1 reply; 29+ messages in thread
From: Jakub Jelinek @ 2013-04-03 20:54 UTC (permalink / raw)
  To: Uros Bizjak
  Cc: Wei Mi, Richard Biener, Steven Bosscher, GCC Patches, David Li,
	Kirill Yukhin

On Thu, Mar 28, 2013 at 04:49:47PM +0100, Uros Bizjak wrote:
> 2013-03-27  Wei Mi  <wmi@google.com>
> 
> 	* config/i386/i386.md: Do shift truncation in define_insn
> 	instead of define_insn_and_split.
> 
> Please write ChangeLog as:
> 
> 	* config/i386/i386.md (*ashl<mode>3_mask): Rewrite as define_insn.
> 	Truncate operand 2 using %b asm operand modifier.
> 	(*<shift_insn><mode>3_mask): Ditto.
> 	(*<rotate_insn><mode>3_mask): Ditto.
> 
> OK for mainline and all release branches with these changes.

This broke bootstrap on x86_64-linux as well as i686-linux on the 4.6
branch.  Fixed thusly, committed as obvious after bootstrapping/regtesting
on those targets.

2013-04-03  Jakub Jelinek  <jakub@redhat.com>

	* config/i386/i386.md (*<shiftrt_insn><mode>3_mask): Use
	<shiftrt> instead of <shift>.

--- gcc/config/i386/i386.md.jj	2013-04-03 16:11:07.000000000 +0200
+++ gcc/config/i386/i386.md	2013-04-03 17:42:15.034672014 +0200
@@ -9827,7 +9827,7 @@ (define_insn "*<shiftrt_insn><mode>3_mas
    && (INTVAL (operands[3]) & (GET_MODE_BITSIZE (<MODE>mode)-1))
       == GET_MODE_BITSIZE (<MODE>mode)-1"
 {
-  return "<shift>{<imodesuffix>}\t{%b2, %0|%0, %b2}";
+  return "<shiftrt>{<imodesuffix>}\t{%b2, %0|%0, %b2}";
 }
   [(set_attr "type" "ishift")
    (set_attr "mode" "<MODE>")])


	Jakub

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: extend fwprop optimization
  2013-04-03 20:54                                       ` Jakub Jelinek
@ 2013-04-04  5:13                                         ` Wei Mi
  0 siblings, 0 replies; 29+ messages in thread
From: Wei Mi @ 2013-04-04  5:13 UTC (permalink / raw)
  To: Jakub Jelinek
  Cc: Uros Bizjak, Richard Biener, Steven Bosscher, GCC Patches,
	David Li, Kirill Yukhin

Thanks for helping fixing it. I will take care to verify regression
and bootstrap before checkin to release branches next time.

Regards,
Wei.

On Wed, Apr 3, 2013 at 11:08 AM, Jakub Jelinek <jakub@redhat.com> wrote:
> On Thu, Mar 28, 2013 at 04:49:47PM +0100, Uros Bizjak wrote:
>> 2013-03-27  Wei Mi  <wmi@google.com>
>>
>>       * config/i386/i386.md: Do shift truncation in define_insn
>>       instead of define_insn_and_split.
>>
>> Please write ChangeLog as:
>>
>>       * config/i386/i386.md (*ashl<mode>3_mask): Rewrite as define_insn.
>>       Truncate operand 2 using %b asm operand modifier.
>>       (*<shift_insn><mode>3_mask): Ditto.
>>       (*<rotate_insn><mode>3_mask): Ditto.
>>
>> OK for mainline and all release branches with these changes.
>
> This broke bootstrap on x86_64-linux as well as i686-linux on the 4.6
> branch.  Fixed thusly, committed as obvious after bootstrapping/regtesting
> on those targets.
>
> 2013-04-03  Jakub Jelinek  <jakub@redhat.com>
>
>         * config/i386/i386.md (*<shiftrt_insn><mode>3_mask): Use
>         <shiftrt> instead of <shift>.
>
> --- gcc/config/i386/i386.md.jj  2013-04-03 16:11:07.000000000 +0200
> +++ gcc/config/i386/i386.md     2013-04-03 17:42:15.034672014 +0200
> @@ -9827,7 +9827,7 @@ (define_insn "*<shiftrt_insn><mode>3_mas
>     && (INTVAL (operands[3]) & (GET_MODE_BITSIZE (<MODE>mode)-1))
>        == GET_MODE_BITSIZE (<MODE>mode)-1"
>  {
> -  return "<shift>{<imodesuffix>}\t{%b2, %0|%0, %b2}";
> +  return "<shiftrt>{<imodesuffix>}\t{%b2, %0|%0, %b2}";
>  }
>    [(set_attr "type" "ishift")
>     (set_attr "mode" "<MODE>")])
>
>
>         Jakub

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2013-04-03 20:44 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-02-25 23:32 extend fwprop optimization Wei Mi
2013-02-26  0:08 ` Steven Bosscher
2013-02-26  1:12   ` Wei Mi
2013-02-26 11:00     ` Steven Bosscher
2013-02-27 18:37       ` Wei Mi
2013-02-27 21:22         ` Steven Bosscher
2013-02-27 21:56           ` Wei Mi
2013-03-11  5:52             ` Wei Mi
2013-03-11 18:10               ` Jeff Law
2013-03-11 18:17                 ` Steven Bosscher
2013-03-11 19:52               ` Steven Bosscher
2013-03-12  7:18                 ` Wei Mi
2013-03-16 22:49                   ` Steven Bosscher
2013-03-17  7:15                     ` Wei Mi
2013-03-17  7:23                       ` Andrew Pinski
2013-03-24  4:18                       ` Wei Mi
2013-03-24 12:33                         ` Oleg Endo
2013-03-25  9:36                         ` Richard Biener
2013-03-25 17:29                           ` Wei Mi
2013-03-25 17:33                             ` Wei Mi
2013-03-26  9:14                               ` Richard Biener
2013-03-26 18:23                                 ` Uros Bizjak
2013-03-28  4:34                                   ` Wei Mi
2013-03-28 15:49                                     ` Uros Bizjak
2013-04-03 20:54                                       ` Jakub Jelinek
2013-04-04  5:13                                         ` Wei Mi
2013-04-02  7:11                       ` Wei Mi
2013-04-02  7:37                         ` Wei Mi
2013-04-02  7:53                         ` Uros Bizjak

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).