public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [PATCH 2/9] [doloop] Correct extracting loop exit condition
  2011-07-21 16:31 [PATCH 0/9] [RFC] Expand SMS functionality zhroma
@ 2011-07-21 16:31 ` zhroma
  2011-07-22 12:22   ` Richard Sandiford
  2011-07-21 16:31 ` [PATCH 5/9] [SMS] Support new loop pattern zhroma
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 30+ messages in thread
From: zhroma @ 2011-07-21 16:31 UTC (permalink / raw)
  To: gcc-patches; +Cc: dm

This patch fixes the compiler segfault found while regtesting trunk with SMS on
IA64 platform.  Segfault happens on test gcc.dg/pr45259.c with -fmodulo-sched
enabled.  The following jump instruction is given as argument for
doloop_condition_get function:
(jump_insn 86 85 88 7 (set (pc)
        (reg/f:DI 403)) 339 {indirect_jump}
     (expr_list:REG_DEAD (reg/f:DI 403)
        (nil)))
The patch adds checking for the form of comparison instruction before
extracting loop exit condition.

2011-07-20  Roman Zhuykov  <zhroma@ispras.ru>
	* loop-doloop.c (doloop_condition_get): Correctly check
	the form of comparison instruction.
---
 gcc/loop-doloop.c |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/gcc/loop-doloop.c b/gcc/loop-doloop.c
index f8429c4..dfc4a16 100644
--- a/gcc/loop-doloop.c
+++ b/gcc/loop-doloop.c
@@ -153,6 +153,8 @@ doloop_condition_get (rtx doloop_pat)
       else
         inc = PATTERN (prev_insn);
       /* We expect the condition to be of the form (reg != 0)  */
+      if (GET_CODE (cmp) != SET || GET_CODE (SET_SRC (cmp)) != IF_THEN_ELSE)
+	return 0;
       cond = XEXP (SET_SRC (cmp), 0);
       if (GET_CODE (cond) != NE || XEXP (cond, 1) != const0_rtx)
         return 0;
-- 
Roman Zhuykov
zhroma@ispras.ru

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 0/9] [RFC] Expand SMS functionality
@ 2011-07-21 16:31 zhroma
  2011-07-21 16:31 ` [PATCH 2/9] [doloop] Correct extracting loop exit condition zhroma
                   ` (9 more replies)
  0 siblings, 10 replies; 30+ messages in thread
From: zhroma @ 2011-07-21 16:31 UTC (permalink / raw)
  To: gcc-patches; +Cc: dm

All the work described in next emails was done while trying to improve SMS
functionality.  The main idea is to remove requrement of doloop_end instruction
pattern.  This allows SMS to work on more platforms, for example x86-64 and
ARM.

All job was done on top of the following patch.  This patch is a combination of
these 7 patches by Revital Eres and Alexandre Oliva:
http://gcc.gnu.org/ml/gcc-patches/2011-05/msg01340.html
http://gcc.gnu.org/ml/gcc-patches/2011-05/msg01341.html
http://gcc.gnu.org/ml/gcc-patches/2011-05/msg01342.html
http://gcc.gnu.org/ml/gcc-patches/2011-05/msg01344.html
http://gcc.gnu.org/ml/gcc-patches/2011-05/msg00246.html
http://gcc.gnu.org/ml/gcc-patches/2011-04/msg01309.html
http://gcc.gnu.org/ml/gcc-patches/2011-04/msg01294.html
All these patches are not in trunk yet, but are directly related to SMS and
probably will be in trunk soon.

The combination of all these patches and all my patches with SMS enabled by
default succesfully passes bootstrap and regtest on x86_64, IA64, and regtest
on ARM cross compiler.

The result compiler gain the following SPEC CPU 2000 results on Core 2 Duo
E8400 (x86_64).  In this table column A shows SMS vs trunk and column B shows
"SMS with -fmodulo-sched-allow-regmoves and -funsafe_loop_optimizations" vs
"trunk with -funsafe_loop_optimizations"

  A,%   B,%  Test
 0.18  0.00 164.gzip
 0.44 -0.62 175.vpr
 2.52  2.55 176.gcc
-0.53  0.69 181.mcf
 0.16 -0.92 186.crafty
-0.12  0.30 197.parser
 0.00  1.72 252.eon
-1.72 -0.10 253.perlbmk
 1.24  0.08 254.gap
-0.19  0.32 255.vortex
-0.17  0.29 256.bzip2
 0.28  0.75 300.twolf
 0.17  0.42 CINT2000
-0.12 -4.20 168.wupwise
-0.10 -5.02 171.swim
-0.05  6.02 172.mgrid
 0.32 -0.70 173.applu
 0.55 -1.20 177.mesa
 0.00 -0.25 178.galgel
-1.82 -0.31 179.art
 0.17 -0.31 183.equake
 0.00  2.94 187.facerec
 0.15 -1.95 188.ammp
 0.03  0.03 189.lucas
 0.28  0.62 191.fma3d
 0.00 -2.14 200.sixtrack
 0.00  0.70 301.apsi
-0.04 -0.45 CFP2000
 0.06 -0.05 TOTAL

I take a look on the most bad result, which is -5% on test 171.swim.  There is
a loop with 12 instructions where SMS creates 3 additional register move
operations, and than register allocator adds 3 more register move due to
hardware limits.  So, new loop kernel contains 18 instructions and executes
slower.

On ARM Cortex-A9 test board SPEC INT 2000 shows the following results.  Column
X is a difference on O2 between new and old SMS approaches.  Column Y shows the
same, but with -allow-regmoves and -funsafe-loop-optimizations enabled.

 X,%   Y,%  Test
-0.19 -0.37 gzip   
-0.01  0.16 vpr    
-3.47  3.22 gcc    
-1.28 -0.70 mcf    
 2.31  0.63 crafty 
-0.41  0.11 parser 
-0.05 -0.57 eon    
-1.64  1.31 perlbmk
-2.71 -3.09 gap    
-0.80 -0.15 vortex 
-0.07 -0.01 bzip2  
-0.81 -0.14 twolf  
-0.75  0.04 CINT2000

Unfortunately, SMS shows good results only on some tests.
But I suppose this improvement should make it's way into trunk.

---
 gcc/ddg.c          |   33 ++-
 gcc/modulo-sched.c |  744 +++++++++++++++++++++++++++++++++++++++------------
 2 files changed, 593 insertions(+), 184 deletions(-)

diff --git a/gcc/ddg.c b/gcc/ddg.c
index d06bdbb..2bb2cc1 100644
--- a/gcc/ddg.c
+++ b/gcc/ddg.c
@@ -197,11 +197,6 @@ create_ddg_dep_from_intra_loop_link (ddg_ptr g, ddg_node_ptr src_node,
         }
     }
 
-  /* If a true dep edge enters the branch create an anti edge in the
-     opposite direction to prevent the creation of reg-moves.  */
-  if ((DEP_TYPE (link) == REG_DEP_TRUE) && JUMP_P (dest_node->insn))
-    create_ddg_dep_no_link (g, dest_node, src_node, ANTI_DEP, REG_DEP, 1);
-
    latency = dep_cost (link);
    e = create_ddg_edge (src_node, dest_node, t, dt, latency, distance);
    add_edge_to_ddg (g, e);
@@ -306,8 +301,11 @@ add_cross_iteration_register_deps (ddg_ptr g, df_ref last_def)
 
 	  gcc_assert (first_def_node);
 
+         /* Always create the edge if the use node is a branch in
+            order to prevent the creation of reg-moves.  */
           if (DF_REF_ID (last_def) != DF_REF_ID (first_def)
-              || !flag_modulo_sched_allow_regmoves)
+              || !flag_modulo_sched_allow_regmoves
+              || (flag_modulo_sched_allow_regmoves && JUMP_P (use_node->insn)))
             create_ddg_dep_no_link (g, use_node, first_def_node, ANTI_DEP,
                                     REG_DEP, 1);
 
@@ -484,7 +482,12 @@ build_intra_loop_deps (ddg_ptr g)
 
       FOR_EACH_DEP (dest_node->insn, SD_LIST_BACK, sd_it, dep)
 	{
-	  ddg_node_ptr src_node = get_node_of_insn (g, DEP_PRO (dep));
+	  ddg_node_ptr src_node;
+
+	  if (DEBUG_INSN_P (DEP_PRO (dep)) && !DEBUG_INSN_P (dest_node->insn))
+	    continue;
+
+	  src_node = get_node_of_insn (g, DEP_PRO (dep));
 
 	  if (!src_node)
 	    continue;
@@ -1091,12 +1094,18 @@ find_nodes_on_paths (sbitmap result, ddg_ptr g, sbitmap from, sbitmap to)
 	  ddg_edge_ptr e;
 	  ddg_node_ptr u_node = &g->nodes[u];
 
+	  /* Ignore DEBUG_INSNs when calculating the SCCs to avoid their
+	     influence on the scheduling order and rec_mii.  */
+	  if (DEBUG_INSN_P (u_node->insn))
+	    continue;
+
 	  for (e = u_node->out; e != (ddg_edge_ptr) 0; e = e->next_out)
 	    {
 	      ddg_node_ptr v_node = e->dest;
 	      int v = v_node->cuid;
 
-	      if (!TEST_BIT (reachable_from, v))
+	      /* Ignore DEBUG_INSN when calculating the SCCs.  */
+	      if (!TEST_BIT (reachable_from, v) && !DEBUG_INSN_P (v_node->insn))
 		{
 		  SET_BIT (reachable_from, v);
 		  SET_BIT (tmp, v);
@@ -1120,12 +1129,18 @@ find_nodes_on_paths (sbitmap result, ddg_ptr g, sbitmap from, sbitmap to)
 	  ddg_edge_ptr e;
 	  ddg_node_ptr u_node = &g->nodes[u];
 
+	  /* Ignore DEBUG_INSNs when calculating the SCCs to avoid their
+	     influence on the scheduling order and rec_mii.  */
+	  if (DEBUG_INSN_P (u_node->insn))
+	    continue;
+
 	  for (e = u_node->in; e != (ddg_edge_ptr) 0; e = e->next_in)
 	    {
 	      ddg_node_ptr v_node = e->src;
 	      int v = v_node->cuid;
 
-	      if (!TEST_BIT (reach_to, v))
+	      /* Ignore DEBUG_INSN when calculating the SCCs.  */
+	      if (!TEST_BIT (reach_to, v) && !DEBUG_INSN_P (v_node->insn))
 		{
 		  SET_BIT (reach_to, v);
 		  SET_BIT (tmp, v);
diff --git a/gcc/modulo-sched.c b/gcc/modulo-sched.c
index 668aa22..24d99af 100644
--- a/gcc/modulo-sched.c
+++ b/gcc/modulo-sched.c
@@ -84,13 +84,14 @@ along with GCC; see the file COPYING3.  If not see
       II cycles (i.e. use register copies to prevent a def from overwriting
       itself before reaching the use).
 
-    SMS works with countable loops whose loop count can be easily
-    adjusted.  This is because we peel a constant number of iterations
-    into a prologue and epilogue for which we want to avoid emitting
-    the control part, and a kernel which is to iterate that constant
-    number of iterations less than the original loop.  So the control
-    part should be a set of insns clearly identified and having its
-    own iv, not otherwise used in the loop (at-least for now), which
+    SMS works with countable loops (1) whose control part can be easily
+    decoupled from the rest of the loop and (2) whose loop count can
+    be easily adjusted.  This is because we peel a constant number of
+    iterations into a prologue and epilogue for which we want to avoid
+    emitting the control part, and a kernel which is to iterate that
+    constant number of iterations less than the original loop.  So the
+    control part should be a set of insns clearly identified and having
+    its own iv, not otherwise used in the loop (at-least for now), which
     initializes a register before the loop to the number of iterations.
     Currently SMS relies on the do-loop pattern to recognize such loops,
     where (1) the control part comprises of all insns defining and/or
@@ -202,33 +203,58 @@ static void generate_prolog_epilog (partial_schedule_ptr, struct loop *,
                                     rtx, rtx);
 static void duplicate_insns_of_cycles (partial_schedule_ptr,
 				       int, int, int, rtx);
-static int calculate_stage_count (partial_schedule_ptr ps);
+static int calculate_stage_count (partial_schedule_ptr, int);
+static void calculate_must_precede_follow (ddg_node_ptr, int, int,
+					   int, int, sbitmap, sbitmap, sbitmap);
+static int get_sched_window (partial_schedule_ptr, ddg_node_ptr,
+			     sbitmap, int, int *, int *, int *);
+static bool try_scheduling_node_in_cycle (partial_schedule_ptr, ddg_node_ptr,
+					  int, int, sbitmap, int *, sbitmap,
+					  sbitmap);
+static bool remove_node_from_ps (partial_schedule_ptr, ps_insn_ptr);
+static int record_inc_dec_insn_info (rtx, rtx, rtx, rtx, rtx, void *);
+
 #define SCHED_ASAP(x) (((node_sched_params_ptr)(x)->aux.info)->asap)
 #define SCHED_TIME(x) (((node_sched_params_ptr)(x)->aux.info)->time)
-#define SCHED_FIRST_REG_MOVE(x) \
-	(((node_sched_params_ptr)(x)->aux.info)->first_reg_move)
-#define SCHED_NREG_MOVES(x) \
-	(((node_sched_params_ptr)(x)->aux.info)->nreg_moves)
 #define SCHED_ROW(x) (((node_sched_params_ptr)(x)->aux.info)->row)
 #define SCHED_STAGE(x) (((node_sched_params_ptr)(x)->aux.info)->stage)
 #define SCHED_COLUMN(x) (((node_sched_params_ptr)(x)->aux.info)->column)
 
-/* The scheduling parameters held for each node.  */
-typedef struct node_sched_params
+/* Information about register-move generated for a definition.  */
+struct regmove_info
 {
-  int asap;	/* A lower-bound on the absolute scheduling cycle.  */
-  int time;	/* The absolute scheduling cycle (time >= asap).  */
+  /* The definition for which the register-move is generated for.  */
+  rtx def;
 
   /* The following field (first_reg_move) is a pointer to the first
-     register-move instruction added to handle the modulo-variable-expansion
-     of the register defined by this node.  This register-move copies the
-     original register defined by the node.  */
+     register-move instruction added to handle the
+     modulo-variable-expansion of the register defined by this node.
+     This register-move copies the original register defined by the node.
+  */
   rtx first_reg_move;
 
   /* The number of register-move instructions added, immediately preceding
      first_reg_move.  */
   int nreg_moves;
 
+  /* Auxiliary info used in the calculation of the register-moves.  */
+  void *aux;
+};
+
+typedef struct regmove_info *regmove_info_ptr;
+DEF_VEC_P (regmove_info_ptr);
+DEF_VEC_ALLOC_P (regmove_info_ptr, heap);
+
+/* The scheduling parameters held for each node.  */
+typedef struct node_sched_params
+{
+  int asap;	/* A lower-bound on the absolute scheduling cycle.  */
+  int time;	/* The absolute scheduling cycle (time >= asap).  */
+
+  /* Information about register-moves needed for
+     definitions in the instruction.  */
+  VEC (regmove_info_ptr, heap) *insn_regmove_info;
+
   int row;    /* Holds time % ii.  */
   int stage;  /* Holds time / ii.  */
 
@@ -406,12 +432,58 @@ set_node_sched_params (ddg_ptr g)
      appropriate sched_params structure.  */
   for (i = 0; i < g->num_nodes; i++)
     {
+      rtx insn = g->nodes[i].insn;
+      rtx note = find_reg_note (insn, REG_INC, NULL_RTX);
+      rtx set = single_set (insn);
+
       /* Watch out for aliasing problems?  */
       node_sched_params[i].asap = g->nodes[i].aux.count;
+      node_sched_params[i].insn_regmove_info = NULL;
+
+      /* Record the definition(s) in the instruction.  These will be
+	 later used to calculate the register-moves needed for each
+	 definition. */
+      if (set && REG_P (SET_DEST (set)))
+	{
+	  regmove_info_ptr elt =
+	    (regmove_info_ptr) xcalloc (1, sizeof (struct regmove_info));
+
+	  elt->def = SET_DEST (set);
+	  VEC_safe_push (regmove_info_ptr, heap,
+			 node_sched_params[i].insn_regmove_info,
+			 elt);
+	}
+
+      if (note)
+	for_each_inc_dec (&insn, record_inc_dec_insn_info,
+			  &node_sched_params[i]);
+
       g->nodes[i].aux.info = &node_sched_params[i];
     }
 }
 
+/* Free the sched_params information allocated for each node.  */
+static void
+free_node_sched_params (ddg_ptr g)
+{
+  int i;
+  regmove_info_ptr def;
+
+  for (i = 0; i < g->num_nodes; i++)
+    {
+      int j;
+      VEC (regmove_info_ptr, heap) *rinfo =
+	node_sched_params[i].insn_regmove_info;
+
+      for (j = 0; VEC_iterate (regmove_info_ptr, rinfo, j, def); j++)
+	free (def);
+
+      VEC_free (regmove_info_ptr, heap, rinfo);
+    }
+
+  free (node_sched_params);
+}
+
 static void
 print_node_sched_params (FILE *file, int num_nodes, ddg_ptr g)
 {
@@ -421,20 +493,32 @@ print_node_sched_params (FILE *file, int num_nodes, ddg_ptr g)
     return;
   for (i = 0; i < num_nodes; i++)
     {
+      int k;
       node_sched_params_ptr nsp = &node_sched_params[i];
-      rtx reg_move = nsp->first_reg_move;
-      int j;
+      regmove_info_ptr def;
+      VEC (regmove_info_ptr, heap) *rinfo =
+	nsp->insn_regmove_info;
 
       fprintf (file, "Node = %d; INSN = %d\n", i,
 	       (INSN_UID (g->nodes[i].insn)));
       fprintf (file, " asap = %d:\n", nsp->asap);
       fprintf (file, " time = %d:\n", nsp->time);
-      fprintf (file, " nreg_moves = %d:\n", nsp->nreg_moves);
-      for (j = 0; j < nsp->nreg_moves; j++)
+
+      /* Iterate over the definitions in the instruction printing the
+         reg-moves needed definition for each definition. */
+      for (k = 0; VEC_iterate (regmove_info_ptr, rinfo, k, def); k++)
 	{
-	  fprintf (file, " reg_move = ");
-	  print_rtl_single (file, reg_move);
-	  reg_move = PREV_INSN (reg_move);
+	  rtx reg_move = def->first_reg_move;
+	  int j;
+	  fprintf (file, "def:\n");
+	  print_rtl_single (file, def->def);
+	  fprintf (file, " nreg_moves = %d\n", def->nreg_moves);
+	  for (j = 0; j < def->nreg_moves; j++)
+	    {
+	      fprintf (file, " reg_move = ");
+	      print_rtl_single (file, reg_move);
+	      reg_move = PREV_INSN (reg_move);
+	    }
 	}
     }
 }
@@ -455,17 +539,20 @@ generate_reg_moves (partial_schedule_ptr ps, bool rescan)
 {
   ddg_ptr g = ps->g;
   int ii = ps->ii;
-  int i;
+  int i, j;
   struct undo_replace_buff_elem *reg_move_replaces = NULL;
 
   for (i = 0; i < g->num_nodes; i++)
     {
       ddg_node_ptr u = &g->nodes[i];
       ddg_edge_ptr e;
-      int nreg_moves = 0, i_reg_move;
-      sbitmap *uses_of_defs;
+      int i_reg_move;
       rtx last_reg_move;
       rtx prev_reg, old_reg;
+      bool need_reg_moves_p = false;
+      VEC (regmove_info_ptr, heap) *rinfo =
+	node_sched_params[i].insn_regmove_info;
+      regmove_info_ptr def;
 
       /* Compute the number of reg_moves needed for u, by looking at life
 	 ranges started at u (excluding self-loops).  */
@@ -483,18 +570,41 @@ generate_reg_moves (partial_schedule_ptr ps, bool rescan)
 		&& SCHED_COLUMN (e->dest) < SCHED_COLUMN (e->src))
 	      nreg_moves4e--;
 
-	    nreg_moves = MAX (nreg_moves, nreg_moves4e);
+	    /* Iterate over the definitions in the instruction and record
+	       the information about reg-moves needed for each one.  */
+	    for (j = 0; VEC_iterate (regmove_info_ptr, rinfo, j, def); j++)
+	      {
+		if (rtx_referenced_p (def->def, e->dest->insn))
+		  {
+		    rtx set = single_set (e->dest->insn);
+
+		    /* Check that the TRUE_DEP edge belongs to the current
+		       definition.  */
+		    if (set && REG_P (SET_DEST (set))
+			&& (SET_DEST (set) == def->def))
+		      continue;
+
+		    def->nreg_moves = MAX (def->nreg_moves, nreg_moves4e);
+		    if (def->nreg_moves != 0)
+		      need_reg_moves_p = true;
+		  }
+	      }
 	  }
 
-      if (nreg_moves == 0)
+      if (!need_reg_moves_p)
 	continue;
 
-      /* Every use of the register defined by node may require a different
-	 copy of this register, depending on the time the use is scheduled.
-	 Set a bitmap vector, telling which nodes use each copy of this
-	 register.  */
-      uses_of_defs = sbitmap_vector_alloc (nreg_moves, g->num_nodes);
-      sbitmap_vector_zero (uses_of_defs, nreg_moves);
+      for (j = 0; VEC_iterate (regmove_info_ptr, rinfo, j, def); j++)
+	{
+	  def->aux =
+	    (sbitmap *) sbitmap_vector_alloc (def->nreg_moves, g->num_nodes);
+
+	  /* Every use of the register defined by node may require a different
+	     copy of this register, depending on the time the use is scheduled.
+	     Set a bitmap vector, telling which nodes use each copy of this
+	     register.  */
+	  sbitmap_vector_zero ((sbitmap *)def->aux, def->nreg_moves);
+	}
       for (e = u->out; e; e = e->next_out)
 	if (e->type == TRUE_DEP && e->dest != e->src)
 	  {
@@ -508,55 +618,79 @@ generate_reg_moves (partial_schedule_ptr ps, bool rescan)
 	      dest_copy--;
 
 	    if (dest_copy)
-	      SET_BIT (uses_of_defs[dest_copy - 1], e->dest->cuid);
+	      {
+		/* Iterate over the definitions in the instruction and record
+		   the information about reg-moves needed for each one.  */
+		for (j = 0; VEC_iterate (regmove_info_ptr, rinfo, j, def); j++)
+		  {
+		    sbitmap *uses_of_def = (sbitmap *)def->aux;
+
+		    if (rtx_referenced_p (def->def, e->dest->insn))
+		      {
+			rtx set = single_set (e->dest->insn);
+
+			/* Check that the TRUE_DEP edge belongs to the current
+			   definition.  */
+			if (set && REG_P (SET_DEST (set))
+			    && (SET_DEST (set) == def->def))
+			  continue;
+
+			SET_BIT (uses_of_def[dest_copy - 1], e->dest->cuid);
+		      }
+		  }
+	      }
 	  }
 
       /* Now generate the reg_moves, attaching relevant uses to them.  */
-      SCHED_NREG_MOVES (u) = nreg_moves;
-      old_reg = prev_reg = copy_rtx (SET_DEST (single_set (u->insn)));
-      /* Insert the reg-moves right before the notes which precede
-         the insn they relates to.  */
-      last_reg_move = u->first_note;
-
-      for (i_reg_move = 0; i_reg_move < nreg_moves; i_reg_move++)
+      for (j = 0; VEC_iterate (regmove_info_ptr, rinfo, j, def); j++)
 	{
-	  unsigned int i_use = 0;
-	  rtx new_reg = gen_reg_rtx (GET_MODE (prev_reg));
-	  rtx reg_move = gen_move_insn (new_reg, prev_reg);
-	  sbitmap_iterator sbi;
-
-	  add_insn_before (reg_move, last_reg_move, NULL);
-	  last_reg_move = reg_move;
+	  sbitmap *uses_of_def = (sbitmap *)def->aux;
+	  old_reg = prev_reg = copy_rtx (def->def);
 
-	  if (!SCHED_FIRST_REG_MOVE (u))
-	    SCHED_FIRST_REG_MOVE (u) = reg_move;
+	  /* Insert the reg-moves right before the notes which precede
+	     the insn they relates to.  */
+	  last_reg_move = u->first_note;
 
-	  EXECUTE_IF_SET_IN_SBITMAP (uses_of_defs[i_reg_move], 0, i_use, sbi)
+	  for (i_reg_move = 0; i_reg_move < def->nreg_moves; i_reg_move++)
 	    {
-	      struct undo_replace_buff_elem *rep;
+	      unsigned int i_use = 0;
+	      rtx new_reg = gen_reg_rtx (GET_MODE (prev_reg));
+	      rtx reg_move = gen_move_insn (new_reg, prev_reg);
+	      sbitmap_iterator sbi;
 
-	      rep = (struct undo_replace_buff_elem *)
-		    xcalloc (1, sizeof (struct undo_replace_buff_elem));
-	      rep->insn = g->nodes[i_use].insn;
-	      rep->orig_reg = old_reg;
-	      rep->new_reg = new_reg;
+	      add_insn_before (reg_move, last_reg_move, NULL);
+	      last_reg_move = reg_move;
+
+	      if (!def->first_reg_move)
+		def->first_reg_move = reg_move;
 
-	      if (! reg_move_replaces)
-		reg_move_replaces = rep;
-	      else
+	      EXECUTE_IF_SET_IN_SBITMAP (uses_of_def[i_reg_move], 0, i_use, sbi)
 		{
-		  rep->next = reg_move_replaces;
-		  reg_move_replaces = rep;
+		  struct undo_replace_buff_elem *rep;
+
+		  rep = (struct undo_replace_buff_elem *)
+		    xcalloc (1, sizeof (struct undo_replace_buff_elem));
+		  rep->insn = g->nodes[i_use].insn;
+		  rep->orig_reg = old_reg;
+		  rep->new_reg = new_reg;
+
+		  if (! reg_move_replaces)
+		    reg_move_replaces = rep;
+		  else
+		    {
+		      rep->next = reg_move_replaces;
+		      reg_move_replaces = rep;
+		    }
+
+		  replace_rtx (g->nodes[i_use].insn, old_reg, new_reg);
+		  if (rescan)
+		    df_insn_rescan (g->nodes[i_use].insn);
 		}
 
-	      replace_rtx (g->nodes[i_use].insn, old_reg, new_reg);
-	      if (rescan)
-		df_insn_rescan (g->nodes[i_use].insn);
+	      prev_reg = new_reg;
 	    }
-
-	  prev_reg = new_reg;
+	  sbitmap_vector_free (uses_of_def);
 	}
-      sbitmap_vector_free (uses_of_defs);
     }
   return reg_move_replaces;
 }
@@ -575,6 +709,33 @@ free_undo_replace_buff (struct undo_replace_buff_elem *reg_move_replaces)
     }
 }
 
+/* Update the sched_params for node U using the II,
+   the CYCLE of U and MIN_CYCLE.  */
+static void
+update_node_sched_params (ddg_node_ptr u, int ii, int cycle, int min_cycle)
+{
+  int sc_until_cycle_zero;
+  int stage;
+
+  SCHED_TIME (u) = cycle;
+  SCHED_ROW (u) = SMODULO (cycle, ii);
+
+  /* The calculation of stage count is done adding the number
+     of stages before cycle zero and after cycle zero.  */
+  sc_until_cycle_zero = CALC_STAGE_COUNT (-1, min_cycle, ii);
+
+  if (SCHED_TIME (u) < 0)
+    {
+      stage = CALC_STAGE_COUNT (-1, SCHED_TIME (u), ii);
+      SCHED_STAGE (u) = sc_until_cycle_zero - stage;
+    }
+  else
+    {
+      stage = CALC_STAGE_COUNT (SCHED_TIME (u), 0, ii);
+      SCHED_STAGE (u) = sc_until_cycle_zero + stage - 1;
+    }
+}
+
 /* Bump the SCHED_TIMEs of all nodes by AMOUNT.  Set the values of
    SCHED_ROW and SCHED_STAGE.  Instruction scheduled on cycle AMOUNT
    will move to cycle zero.  */
@@ -591,15 +752,14 @@ reset_sched_times (partial_schedule_ptr ps, int amount)
 	ddg_node_ptr u = crr_insn->node;
 	int normalized_time = SCHED_TIME (u) - amount;
 	int new_min_cycle = PS_MIN_CYCLE (ps) - amount;
-        int sc_until_cycle_zero, stage;
 
         if (dump_file)
           {
             /* Print the scheduling times after the rotation.  */
             fprintf (dump_file, "crr_insn->node=%d (insn id %d), "
                      "crr_insn->cycle=%d, min_cycle=%d", crr_insn->node->cuid,
-                     INSN_UID (crr_insn->node->insn), SCHED_TIME (u),
-                     normalized_time);
+                     INSN_UID (crr_insn->node->insn), normalized_time,
+                     new_min_cycle);
             if (JUMP_P (crr_insn->node->insn))
               fprintf (dump_file, " (branch)");
             fprintf (dump_file, "\n");
@@ -607,23 +767,9 @@ reset_sched_times (partial_schedule_ptr ps, int amount)
 	
 	gcc_assert (SCHED_TIME (u) >= ps->min_cycle);
 	gcc_assert (SCHED_TIME (u) <= ps->max_cycle);
-	SCHED_TIME (u) = normalized_time;
-	SCHED_ROW (u) = SMODULO (normalized_time, ii);
-      
-        /* The calculation of stage count is done adding the number
-           of stages before cycle zero and after cycle zero.  */
-	sc_until_cycle_zero = CALC_STAGE_COUNT (-1, new_min_cycle, ii);
-	
-	if (SCHED_TIME (u) < 0)
-	  {
-	    stage = CALC_STAGE_COUNT (-1, SCHED_TIME (u), ii);
-	    SCHED_STAGE (u) = sc_until_cycle_zero - stage;
-	  }
-	else
-	  {
-	    stage = CALC_STAGE_COUNT (SCHED_TIME (u), 0, ii);
-	    SCHED_STAGE (u) = sc_until_cycle_zero + stage - 1;
-	  }
+
+	crr_insn->cycle = normalized_time;
+	update_node_sched_params (u, ii, normalized_time, new_min_cycle);
       }
 }
  
@@ -660,6 +806,206 @@ permute_partial_schedule (partial_schedule_ptr ps, rtx last)
 			    PREV_INSN (last));
 }
 
+/* Set bitmaps TMP_FOLLOW and TMP_PRECEDE to MUST_FOLLOW and MUST_PRECEDE
+   respectively only if cycle C falls in the scheduling window boundaries
+   marked by START and END cycles.  STEP is the direction of the window.
+   */
+static inline void
+set_must_precede_follow (sbitmap *tmp_follow, sbitmap must_follow,
+			 sbitmap *tmp_precede, sbitmap must_precede, int c,
+			 int start, int end, int step)
+{
+  *tmp_precede = NULL;
+  *tmp_follow = NULL;
+
+  if (c == start)
+    {
+      if (step == 1)
+	*tmp_precede = must_precede;
+      else			/* step == -1.  */
+	*tmp_follow = must_follow;
+    }
+  if (c == end - step)
+    {
+      if (step == 1)
+	*tmp_follow = must_follow;
+      else			/* step == -1.  */
+	*tmp_precede = must_precede;
+    }
+
+}
+
+/* Return True if the branch can be moved to row ii-1 while
+   normalizing the partial schedule PS to start from cycle zero and thus
+   optimize the SC.  Otherwise return False.  */
+static bool
+optimize_sc (partial_schedule_ptr ps, ddg_ptr g)
+{
+  int amount = PS_MIN_CYCLE (ps);
+  sbitmap sched_nodes = sbitmap_alloc (g->num_nodes);
+  int start, end, step;
+  int ii = ps->ii;
+  bool ok = false;
+  int stage_count, stage_count_curr;
+
+  /* Compare the SC after normalization and SC after bringing the branch
+     to row ii-1.  If they are equal just bail out.  */
+  stage_count = calculate_stage_count (ps, amount);
+  stage_count_curr =
+    calculate_stage_count (ps, SCHED_TIME (g->closing_branch) + 1);
+
+  if (stage_count == stage_count_curr)
+    {
+      if (dump_file)
+	fprintf (dump_file, "SMS SC already optimized.\n");
+
+      ok = false;
+      goto clear;
+    }
+
+  if (dump_file)
+    {
+      fprintf (dump_file, "SMS Trying to optimize branch location\n");
+      fprintf (dump_file, "SMS partial schedule before trial:\n");
+      print_partial_schedule (ps, dump_file);
+    }
+
+  /* First, normailize the partial schedualing.  */
+  reset_sched_times (ps, amount);
+  rotate_partial_schedule (ps, amount);
+  if (dump_file)
+    {
+      fprintf (dump_file,
+	       "SMS partial schedule after normalization (ii, %d, SC %d):\n",
+	       ii, stage_count);
+      print_partial_schedule (ps, dump_file);
+    }
+
+  if (SMODULO (SCHED_TIME (g->closing_branch), ii) == ii - 1)
+    {
+      ok = true;
+      goto clear;
+    }
+
+  sbitmap_ones (sched_nodes);
+
+  /* Calculate the new placement of the branch.  It should be in row
+     ii-1 and fall into it's scheduling window.  */
+  if (get_sched_window (ps, g->closing_branch, sched_nodes, ii, &start,
+			&step, &end) == 0)
+    {
+      bool success;
+      ps_insn_ptr next_ps_i;
+      int branch_cycle = SCHED_TIME (g->closing_branch);
+      int row = SMODULO (branch_cycle, ps->ii);
+      int num_splits = 0;
+      sbitmap must_precede, must_follow, tmp_precede, tmp_follow;
+      int c;
+
+      if (dump_file)
+	fprintf (dump_file, "\nTrying to schedule node %d "
+		 "INSN = %d  in (%d .. %d) step %d\n",
+		 g->closing_branch->cuid,
+		 (INSN_UID (g->closing_branch->insn)), start, end, step);
+
+      gcc_assert ((step > 0 && start < end) || (step < 0 && start > end));
+      if (step == 1)
+	{
+	  c = start + ii - SMODULO (start, ii) - 1;
+	  gcc_assert (c >= start);
+	  if (c >= end)
+	    {
+	      ok = false;
+	      if (dump_file)
+		fprintf (dump_file,
+			 "SMS failed to schedule branch at cycle: %d\n", c);
+	      goto clear;
+	    }
+	}
+      else
+	{
+	  c = start - SMODULO (start, ii) - 1;
+	  gcc_assert (c <= start);
+
+	  if (c <= end)
+	    {
+	      if (dump_file)
+		fprintf (dump_file,
+			 "SMS failed to schedule branch at cycle: %d\n", c);
+	      ok = false;
+	      goto clear;
+	    }
+	}
+
+      must_precede = sbitmap_alloc (g->num_nodes);
+      must_follow = sbitmap_alloc (g->num_nodes);
+
+      /* Try to schedule the branch is it's new cycle.  */
+      calculate_must_precede_follow (g->closing_branch, start, end,
+				     step, ii, sched_nodes,
+				     must_precede, must_follow);
+
+      set_must_precede_follow (&tmp_follow, must_follow, &tmp_precede,
+			       must_precede, c, start, end, step);
+
+      /* Find the element in the partial schedule related to the closing
+         branch so we can remove it from it's current cycle.  */
+      for (next_ps_i = ps->rows[row];
+	   next_ps_i; next_ps_i = next_ps_i->next_in_row)
+	if (next_ps_i->node->cuid == g->closing_branch->cuid)
+	  break;
+
+      gcc_assert (next_ps_i);
+      gcc_assert (remove_node_from_ps (ps, next_ps_i));
+      success =
+	try_scheduling_node_in_cycle (ps, g->closing_branch,
+				      g->closing_branch->cuid, c,
+				      sched_nodes, &num_splits,
+				      tmp_precede, tmp_follow);
+      gcc_assert (num_splits == 0);
+      if (!success)
+	{
+	  if (dump_file)
+	    fprintf (dump_file,
+		     "SMS failed to schedule branch at cycle: %d, "
+		     "bringing it back to cycle %d\n", c, branch_cycle);
+
+	  /* The branch was failed to be placed in row ii - 1.
+	     Put it back in it's original place in the partial
+	     schedualing.  */
+	  set_must_precede_follow (&tmp_follow, must_follow, &tmp_precede,
+				   must_precede, branch_cycle, start, end,
+				   step);
+	  success =
+	    try_scheduling_node_in_cycle (ps, g->closing_branch,
+					  g->closing_branch->cuid,
+					  branch_cycle, sched_nodes,
+					  &num_splits, tmp_precede,
+					  tmp_follow);
+	  gcc_assert (success && (num_splits == 0));
+	  ok = false;
+	}
+      else
+	{
+	  /* The branch is placed in row ii - 1.  */
+	  if (dump_file)
+	    fprintf (dump_file,
+		     "SMS success in moving branch to cycle %d\n", c);
+
+	  update_node_sched_params (g->closing_branch, ii, c,
+				    PS_MIN_CYCLE (ps));
+	  ok = true;
+	}
+
+      free (must_precede);
+      free (must_follow);
+    }
+
+clear:
+  free (sched_nodes);
+  return ok;
+}
+
 static void
 duplicate_insns_of_cycles (partial_schedule_ptr ps, int from_stage,
 			   int to_stage, int for_prolog, rtx count_reg)
@@ -671,7 +1017,7 @@ duplicate_insns_of_cycles (partial_schedule_ptr ps, int from_stage,
     for (ps_ij = ps->rows[row]; ps_ij; ps_ij = ps_ij->next_in_row)
       {
 	ddg_node_ptr u_node = ps_ij->node;
-	int j, i_reg_moves;
+	int i_reg_moves;
 	rtx reg_move = NULL_RTX;
 
         /* Do not duplicate any insn which refers to count_reg as it
@@ -686,43 +1032,68 @@ duplicate_insns_of_cycles (partial_schedule_ptr ps, int from_stage,
 
 	if (for_prolog)
 	  {
-	    /* SCHED_STAGE (u_node) >= from_stage == 0.  Generate increasing
-	       number of reg_moves starting with the second occurrence of
-	       u_node, which is generated if its SCHED_STAGE <= to_stage.  */
-	    i_reg_moves = to_stage - SCHED_STAGE (u_node) + 1;
-	    i_reg_moves = MAX (i_reg_moves, 0);
-	    i_reg_moves = MIN (i_reg_moves, SCHED_NREG_MOVES (u_node));
-
-	    /* The reg_moves start from the *first* reg_move backwards.  */
-	    if (i_reg_moves)
+	    int i;
+	    VEC (regmove_info_ptr, heap) *rinfo =
+	      node_sched_params[u_node->cuid].insn_regmove_info;
+	    regmove_info_ptr def;
+
+	    for (i = 0; VEC_iterate (regmove_info_ptr, rinfo, i, def); i++)
 	      {
-		reg_move = SCHED_FIRST_REG_MOVE (u_node);
-		for (j = 1; j < i_reg_moves; j++)
-		  reg_move = PREV_INSN (reg_move);
+		int j;
+
+		/* SCHED_STAGE (u_node) >= from_stage == 0.  Generate increasing
+		   number of reg_moves starting with the second occurrence of
+		   u_node, which is generated if its SCHED_STAGE <= to_stage.  */
+		i_reg_moves = to_stage - SCHED_STAGE (u_node) + 1;
+		i_reg_moves = MAX (i_reg_moves, 0);
+		i_reg_moves = MIN (i_reg_moves, def->nreg_moves);
+
+		/* The reg_moves start from the *first* reg_move backwards.  */
+		if (i_reg_moves)
+		  {
+		    reg_move = def->first_reg_move;
+		    for (j = 1; j < i_reg_moves; j++)
+		      reg_move = PREV_INSN (reg_move);
+		  }
+
+		for (j = 0; j < i_reg_moves;
+		     j++, reg_move = NEXT_INSN (reg_move))
+		  emit_insn (copy_rtx (PATTERN (reg_move)));
 	      }
 	  }
 	else /* It's for the epilog.  */
 	  {
-	    /* SCHED_STAGE (u_node) <= to_stage.  Generate all reg_moves,
-	       starting to decrease one stage after u_node no longer occurs;
-	       that is, generate all reg_moves until
-	       SCHED_STAGE (u_node) == from_stage - 1.  */
-	    i_reg_moves = SCHED_NREG_MOVES (u_node)
-	    	       - (from_stage - SCHED_STAGE (u_node) - 1);
-	    i_reg_moves = MAX (i_reg_moves, 0);
-	    i_reg_moves = MIN (i_reg_moves, SCHED_NREG_MOVES (u_node));
-
-	    /* The reg_moves start from the *last* reg_move forwards.  */
-	    if (i_reg_moves)
+	    int i;
+	    VEC (regmove_info_ptr, heap) *rinfo =
+	      node_sched_params[u_node->cuid].insn_regmove_info;
+	    regmove_info_ptr def;
+
+	    for (i = 0; VEC_iterate (regmove_info_ptr, rinfo, i, def); i++)
 	      {
-		reg_move = SCHED_FIRST_REG_MOVE (u_node);
-		for (j = 1; j < SCHED_NREG_MOVES (u_node); j++)
-		  reg_move = PREV_INSN (reg_move);
+		int j;
+
+		/* SCHED_STAGE (u_node) <= to_stage.  Generate all reg_moves,
+		   starting to decrease one stage after u_node no longer occurs;
+		   that is, generate all reg_moves until
+		   SCHED_STAGE (u_node) == from_stage - 1.  */
+		i_reg_moves = def->nreg_moves
+		  - (from_stage - SCHED_STAGE (u_node) - 1);
+		i_reg_moves = MAX (i_reg_moves, 0);
+		i_reg_moves = MIN (i_reg_moves, def->nreg_moves);
+
+		/* The reg_moves start from the *last* reg_move forwards.  */
+		if (i_reg_moves)
+		  {
+		    reg_move = def->first_reg_move;
+		    for (j = 1; j < def->nreg_moves; j++)
+		      reg_move = PREV_INSN (reg_move);
+		  }
+
+		for (j = 0; j < i_reg_moves;
+		     j++, reg_move = NEXT_INSN (reg_move))
+		  emit_insn (copy_rtx (PATTERN (reg_move)));
 	      }
 	  }
-
-	for (j = 0; j < i_reg_moves; j++, reg_move = NEXT_INSN (reg_move))
-	  emit_insn (copy_rtx (PATTERN (reg_move)));
 	if (SCHED_STAGE (u_node) >= from_stage
 	    && SCHED_STAGE (u_node) <= to_stage)
 	  duplicate_insn_chain (u_node->first_note, u_node->insn);
@@ -911,6 +1282,25 @@ setup_sched_infos (void)
 /* Used to calculate the upper bound of ii.  */
 #define MAXII_FACTOR 2
 
+/* Callback for for_each_inc_dec.  Records in ARG the register DEST
+   which is defined by the auto operation.  */
+static int
+record_inc_dec_insn_info (rtx mem ATTRIBUTE_UNUSED,
+			  rtx op ATTRIBUTE_UNUSED,
+			  rtx dest, rtx src ATTRIBUTE_UNUSED,
+			  rtx srcoff ATTRIBUTE_UNUSED, void *arg)
+{
+  node_sched_params_ptr params = (node_sched_params_ptr) arg;
+  regmove_info_ptr insn_regmove_info =
+    (regmove_info_ptr) xcalloc (1, sizeof (struct regmove_info));
+
+  insn_regmove_info->def = copy_rtx (dest);
+  VEC_safe_push (regmove_info_ptr, heap, params->insn_regmove_info,
+		 insn_regmove_info);
+
+  return -1;
+}
+
 /* Main entry point, perform SMS scheduling on the loops of the function
    that consist of single basic blocks.  */
 static void
@@ -927,6 +1317,7 @@ sms_schedule (void)
   basic_block condition_bb = NULL;
   edge latch_edge;
   gcov_type trip_count = 0;
+  int temp;
 
   loop_optimizer_init (LOOPS_HAVE_PREHEADERS
 		       | LOOPS_HAVE_RECORDED_EXITS);
@@ -936,21 +1327,18 @@ sms_schedule (void)
       return;  /* There are no loops to schedule.  */
     }
 
+  temp = reload_completed;
+  reload_completed = 1;
   /* Initialize issue_rate.  */
   if (targetm.sched.issue_rate)
-    {
-      int temp = reload_completed;
-
-      reload_completed = 1;
-      issue_rate = targetm.sched.issue_rate ();
-      reload_completed = temp;
-    }
+    issue_rate = targetm.sched.issue_rate ();
   else
     issue_rate = 1;
 
   /* Initialize the scheduler.  */
   setup_sched_infos ();
   haifa_sched_init ();
+  reload_completed = temp;
 
   /* Allocate memory to hold the DDG array one entry for each loop.
      We use loop->num as index into this array.  */
@@ -1042,12 +1430,9 @@ sms_schedule (void)
 	continue;
       }
 
-      /* Don't handle BBs with calls or barriers or auto-increment insns 
-	 (to avoid creating invalid reg-moves for the auto-increment insns),
-	 or !single_set with the exception of instructions that include
-	 count_reg---these instructions are part of the control part
-	 that do-loop recognizes.
-         ??? Should handle auto-increment insns.
+      /* Don't handle BBs with calls or barriers, or !single_set insns
+	  with the exception of instructions that include count_reg---these
+	  instructions are part of the control part that do-loop recognizes
          ??? Should handle insns defining subregs.  */
      for (insn = head; insn != NEXT_INSN (tail); insn = NEXT_INSN (insn))
       {
@@ -1058,7 +1443,6 @@ sms_schedule (void)
             || (NONDEBUG_INSN_P (insn) && !JUMP_P (insn)
                 && !single_set (insn) && GET_CODE (PATTERN (insn)) != USE
                 && !reg_mentioned_p (count_reg, insn))
-            || (FIND_REG_INC_NOTE (insn, NULL_RTX) != 0)
             || (INSN_P (insn) && (set = single_set (insn))
                 && GET_CODE (SET_DEST (set)) == SUBREG))
         break;
@@ -1072,8 +1456,6 @@ sms_schedule (void)
 		fprintf (dump_file, "SMS loop-with-call\n");
 	      else if (BARRIER_P (insn))
 		fprintf (dump_file, "SMS loop-with-barrier\n");
-              else if (FIND_REG_INC_NOTE (insn, NULL_RTX) != 0)
-                fprintf (dump_file, "SMS reg inc\n");
               else if ((NONDEBUG_INSN_P (insn) && !JUMP_P (insn)
                 && !single_set (insn) && GET_CODE (PATTERN (insn)) != USE))
                 fprintf (dump_file, "SMS loop-with-not-single-set\n");
@@ -1115,6 +1497,7 @@ sms_schedule (void)
       int mii, rec_mii;
       unsigned stage_count = 0;
       HOST_WIDEST_INT loop_count = 0;
+      bool opt_sc_p = false;
 
       if (! (g = g_arr[loop->num]))
         continue;
@@ -1197,12 +1580,30 @@ sms_schedule (void)
 
       ps = sms_schedule_by_order (g, mii, maxii, node_order);
 
-       if (ps)
-       {
-         stage_count = calculate_stage_count (ps);
-         gcc_assert(stage_count >= 1);
-         PS_STAGE_COUNT(ps) = stage_count;
-       }
+      if (ps)
+	{
+	  /* Try to achieve optimized SC by normalizing the partial
+	     schedule (having the cycles start from cycle zero). The branch
+	     location must be placed in row ii-1 in the final scheduling.
+	     If that's not the case after the normalization then try to
+	     move the branch to that row if possible.  */
+	  opt_sc_p = optimize_sc (ps, g);
+	  if (opt_sc_p)
+	    stage_count = calculate_stage_count (ps, 0);
+	  else
+	    {
+	      /* Bring the branch to cycle -1.  */
+	      int amount = SCHED_TIME (g->closing_branch) + 1;
+
+	      if (dump_file)
+		fprintf (dump_file, "SMS schedule branch at cycle -1\n");
+
+	      stage_count = calculate_stage_count (ps, amount);
+	    }
+
+	  gcc_assert (stage_count >= 1);
+	  PS_STAGE_COUNT (ps) = stage_count;
+	}
 
       /* The default value of PARAM_SMS_MIN_SC is 2 as stage count of
 	 1 means that there is no interleaving between iterations thus
@@ -1224,12 +1625,16 @@ sms_schedule (void)
       else
 	{
 	  struct undo_replace_buff_elem *reg_move_replaces;
-          int amount = SCHED_TIME (g->closing_branch) + 1;
+
+          if (!opt_sc_p)
+            {
+	      /* Rotate the partial schedule to have the branch in row ii-1.  */
+              int amount = SCHED_TIME (g->closing_branch) + 1;
+
+              reset_sched_times (ps, amount);
+              rotate_partial_schedule (ps, amount);
+            }
 	  
-	  /* Set the stage boundaries.	The closing_branch was scheduled
-	     and should appear in the last (ii-1) row.  */
-	  reset_sched_times (ps, amount);
-	  rotate_partial_schedule (ps, amount);
 	  set_columns_for_ps (ps);
 
 	  canon_loop (loop);
@@ -1280,7 +1685,7 @@ sms_schedule (void)
 	}
 
       free_partial_schedule (ps);
-      free (node_sched_params);
+      free_node_sched_params (g);
       free (node_order);
       free_ddg (g);
     }
@@ -1381,13 +1786,11 @@ sms_schedule (void)
    scheduling window is empty and zero otherwise.  */
 
 static int
-get_sched_window (partial_schedule_ptr ps, int *nodes_order, int i,
+get_sched_window (partial_schedule_ptr ps, ddg_node_ptr u_node,
 		  sbitmap sched_nodes, int ii, int *start_p, int *step_p, int *end_p)
 {
   int start, step, end;
   ddg_edge_ptr e;
-  int u = nodes_order [i];
-  ddg_node_ptr u_node = &ps->g->nodes[u];
   sbitmap psp = sbitmap_alloc (ps->g->num_nodes);
   sbitmap pss = sbitmap_alloc (ps->g->num_nodes);
   sbitmap u_node_preds = NODE_PREDECESSORS (u_node);
@@ -1799,7 +2202,7 @@ sms_schedule_by_order (ddg_ptr g, int mii, int maxii, int *nodes_order)
 
 	  /* Try to get non-empty scheduling window.  */
 	 success = 0;
-         if (get_sched_window (ps, nodes_order, i, sched_nodes, ii, &start,
+         if (get_sched_window (ps, u_node, sched_nodes, ii, &start,
                                 &step, &end) == 0)
             {
               if (dump_file)
@@ -1819,21 +2222,9 @@ sms_schedule_by_order (ddg_ptr g, int mii, int maxii, int *nodes_order)
                   sbitmap tmp_precede = NULL;
                   sbitmap tmp_follow = NULL;
 
-                  if (c == start)
-                    {
-                      if (step == 1)
-                        tmp_precede = must_precede;
-                      else      /* step == -1.  */
-                        tmp_follow = must_follow;
-                    }
-                  if (c == end - step)
-                    {
-                      if (step == 1)
-                        tmp_follow = must_follow;
-                      else      /* step == -1.  */
-                        tmp_precede = must_precede;
-                    }
-
+                  set_must_precede_follow (&tmp_follow, must_follow,
+		                           &tmp_precede, must_precede,
+                                           c, start, end, step);
                   success =
                     try_scheduling_node_in_cycle (ps, u_node, u, c,
                                                   sched_nodes,
@@ -2550,8 +2941,13 @@ print_partial_schedule (partial_schedule_ptr ps, FILE *dump)
       fprintf (dump, "\n[ROW %d ]: ", i);
       while (ps_i)
 	{
-	  fprintf (dump, "%d, ",
-		   INSN_UID (ps_i->node->insn));
+          if (JUMP_P (ps_i->node->insn))
+            fprintf (dump, "%d (branch), ",
+                     INSN_UID (ps_i->node->insn));
+          else
+            fprintf (dump, "%d, ",
+                     INSN_UID (ps_i->node->insn));
+
 	  ps_i = ps_i->next_in_row;
 	}
     }
@@ -2893,12 +3289,10 @@ ps_add_node_check_conflicts (partial_schedule_ptr ps, ddg_node_ptr n,
 }
 
 /* Calculate the stage count of the partial schedule PS.  The calculation
-   takes into account the rotation to bring the closing branch to row
-   ii-1.  */
+   takes into account the rotation amount passed in ROTATION_AMOUNT.  */
 int
-calculate_stage_count (partial_schedule_ptr ps)
+calculate_stage_count (partial_schedule_ptr ps, int rotation_amount)
 {
-  int rotation_amount = (SCHED_TIME (ps->g->closing_branch)) + 1;
   int new_min_cycle = PS_MIN_CYCLE (ps) - rotation_amount;
   int new_max_cycle = PS_MAX_CYCLE (ps) - rotation_amount;
   int stage_count = CALC_STAGE_COUNT (-1, new_min_cycle, ps->ii);
--
Roman Zhuykov
zhroma@ispras.ru

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 5/9] [SMS] Support new loop pattern
  2011-07-21 16:31 [PATCH 0/9] [RFC] Expand SMS functionality zhroma
  2011-07-21 16:31 ` [PATCH 2/9] [doloop] Correct extracting loop exit condition zhroma
@ 2011-07-21 16:31 ` zhroma
  2011-07-24 11:06   ` Revital1 Eres
                     ` (2 more replies)
  2011-07-21 16:31 ` [PATCH 1/9] [obvious] Minor cleanup zhroma
                   ` (7 subsequent siblings)
  9 siblings, 3 replies; 30+ messages in thread
From: zhroma @ 2011-07-21 16:31 UTC (permalink / raw)
  To: gcc-patches; +Cc: dm, Ayal Zaks

This patch should be applied only after pending patches by Revital.  This patch
significantly enhances the existing implementation of the SMS.  Patch adds
support of scheduling loops without doloop pattern.  The loop should meet the
following requirements.
First three are the same as for loop with doloop pattern:
- loop body contains only one basic block;
- basic block contains no calls or barriers inside;
- there are no !single_set instructions (with the exception of control part
  instructions);
The next three describe the control part of new supported loops.
- the last jump instruction should look like:  pc=(regF!=0)?label:pc, regF is
  flag register;
- the last instruction which sets regF should be: regF=COMPARE(regC,X), where X
  is a constant, or maybe a register, which is not changed inside a loop;
- only one instruction modifies regC inside a loop (other can use regC, but not
  write), and it should simply adjust it by a constant: regC=regC+step, where
  step is a constant.

This loop pattern matches a lot of loops on ARM and x86_64 platforms.  The
patch provides backward compatibility, which means that doloop loops are
processed as usual.  When doloop is succesfully scheduled by SMS, its number of
iterations of loop kernel should be decreased by the number of stages in a
schedule minus one, while other iterations expand to prologue and epilogue.  In
new supported loops such approach can't be used, because some instructions can
use count register (regC).  Instead of this, the final register value X in
compare instruction regF=COMPARE(regC,X) is changed to another value Y
respective to the stage this instruction is scheduled (Y = X - stage * step).
If X is an immediate value and Y is also possible to be immediate, we just
change an instruction.  In other cases, new register regY is used to store new
final value.  The instructions of regY initialization are added to the prologue
and compare instruction is changed to regF=COMPARE(regC,regY).

Testing of this appoach reveals two bugs, which do not appear while SMS was
used only for doloop loops.  Both these bugs happen due to the nature of the
flag register.  On x86_64 it is clobbered by most of arithmetic instructions.
The following situation happens when SMS is enabled without register renaming
(-fno-modulo-sched-allow-regmoves).  When data dependency graph is built, there
is a step when we generate anti-dependencies from last register use to first
write of this register at the next iteration.  At this moment we should also
create such dependencies to all instructions which clobber the register to
prevent this clobbers being before last use is new schedule.

Here is an model of example:

loop {
set1 regR
use1 regR
clobber regR
set2 regR
use2 regR
}

If we create only use2->set1 anti-dependency (and no use2->cloober) the
following wrong schedule is possible:

prologue {
set1 regR
use1 regR
clobber regR
}
kernel {
set2 regR
clobber regR (instruction from next iteration in terms of old loop kernel)
use2 regR
set1 regR (also from next iteration)
use1 regR (also from next iteration)
}
epilogue {
set2 regR
use2 regR
}

This problem was succesfully fixed by creating a vector of all clobbering
instructions together with first write and adding all needed dependencies.

The other bug happens only with -fmodulo-sched-allow-regmoves.  Here we
eliminate some anti-dependence edges in data dependency graph in order to
resolve them later by adding some register moves (renaming instructions).  But
in situation as in example above compiler gives an ICE because it can't create
a register move, when regR is hardware flag register.  So we have to know which
register(s) cause anti-dependency in order to understand whether we can ignore
it.  I can't find any easy way to gather this information, so I create my own
structures to store this info and had implemented my own hooks for
sched_analyze function.  This leads to more complex interconnection between
ddg.c and modulo-sched.c.

One more thing to point out is number of loop iterations. When number of
iterations of a loop is not known at compile time, SMS has to create two loop
versions (original and scheduled), and execute scheduled one only when real
number of iterations is bigger than number of stages.  In doloop case the
number of iterations simply equals to the count register value before the loop.
So SMS finds its constant initialization or makes two loop versions.  In new
supported loops number of iterations value is more complex.  It even can't be
calculated as (final_reg_value-start_reg_value)/step because of examples like
this:

for (unsigned int x = 0x0; x != 0x6F80919A; x += 0xEDCBA987) ...;

This loop has 22 iterations.  So, i decided to use get_simple_loop_desc
function which gives a structure with loop characteristics, some of them helps
to find iteration number:

rtx niter_expr - The number of iterations of the loop;
bool const_iter - True if the loop iterates the constant number of times;
unsigned HOST_WIDEST_INT niter - Number of iterations if constant;

But we can use these expressions only after looking through some other fields
of returned structure:

bool simple_p - True if we are able to say anything about number of iterations
of the loop;
rtx assumptions - Assumptions under that the rest of the information is valid;
rtx noloop_assumptions - Assumptions under which the loop ends before reaching
the latch;
rtx infinite - Condition under which the loop is infinite.

I decide to allow SMS scheduling only when simple_p is true and other three
fields are NULL_RTX, or when simple_p is true and
flag_unsafe_loop_optimizations is set.  One more exception is infinite
condition, and the next separate patch is an attempt to process it.

2011-07-20  Roman Zhuykov  <zhroma@ispras.ru>
	* ddg.c: New VEC.
	(create_ddg_dep_from_intra_loop_link): Use information about register
	uses and sets to determine correctly whether anti-dependency
	can be ignored.
	(add_cross_iteration_register_deps): Store information about
	all clobbers and first write to a register.  Use collected
	information to create anti-dependencies from last use.
	(build_intra_loop_deps): Add call to sms_sched_analyze_init.
	* modulo-sched.c: Include pointer-set.h.
	(old_sched_deps_info): New structure.
	(regset_pair): New type.
	(insn_map): Declare.
	(curr_insn): Ditto.
	(regset_pair_init, destroy_regset_pair, sms_start_insn,
	sms_finish_insn, sms_note_reg_set, sms_note_reg_clobber
	sms_note_reg_use, extract_from_insn_map,
	sms_sched_analyze_init, sms_create_ddg_finish): New functions.
	(sms_sched_deps_info): Add new callbacks.
	(nondoloop_register_get): New function.
	(const_iteration_count): Rename to ...
	(search_const_init): ...this.  Add new parameter (is_const).  Always
	return register initialization rtx and set is_const to true
	only when it is constant.
	(duplicate_insns_of_cycles): Add new parameter (doloop_p).  Do not
	duplicate instructions with count_reg only when doloop_p is set.
	Update all callers.
	(generate_prolog_epilog): Add new parameters.  Correctly generate loop
	prologue for new loop pattern.
	(sms_schedule): Support new loop pattern.
	* sched-int.h (sms_sched_analyze_init, extract_from_insn_map): Export.
---
 gcc/ddg.c          |  131 +++++++---
 gcc/modulo-sched.c |  763 ++++++++++++++++++++++++++++++++++++++++++++++------
 gcc/sched-int.h    |    2 +
 3 files changed, 780 insertions(+), 116 deletions(-)

diff --git a/gcc/ddg.c b/gcc/ddg.c
index 5d0a401..b7c338d 100644
--- a/gcc/ddg.c
+++ b/gcc/ddg.c
@@ -49,6 +49,10 @@ along with GCC; see the file COPYING3.  If not see
 /* A flag indicating that a ddg edge belongs to an SCC or not.  */
 enum edge_flag {NOT_IN_SCC = 0, IN_SCC};
 
+/* A vector of dependencies needed while processing clobbers.  */
+DEF_VEC_P(df_ref);
+DEF_VEC_ALLOC_P(df_ref,heap);
+
 /* Forward declarations.  */
 static void add_backarc_to_ddg (ddg_ptr, ddg_edge_ptr);
 static void add_backarc_to_scc (ddg_scc_ptr, ddg_edge_ptr);
@@ -178,23 +182,45 @@ create_ddg_dep_from_intra_loop_link (ddg_ptr g, ddg_node_ptr src_node,
      whose register has multiple defs in the loop.  */
   if (flag_modulo_sched_allow_regmoves && (t == ANTI_DEP && dt == REG_DEP))
     {
-      rtx set;
+      bool can_delete_dep = true;
+      unsigned regno;
+      reg_set_iterator rsi;
+      regset src_uses, dest_sets, regs;
+
+      /* Register sets from modulo scheduler structures.  */
+      src_uses = extract_from_insn_map (src_node->insn, 0);
+      dest_sets = extract_from_insn_map (dest_node->insn, 1);
+
+      if (!src_uses || !dest_sets)
+	return;
 
-      set = single_set (dest_node->insn);
-      /* TODO: Handle registers that REG_P is not true for them, i.e.
-         subregs and special registers.  */
-      if (set && REG_P (SET_DEST (set)))
+      /* Build regset intersection.  */
+      regs = ALLOC_REG_SET (&reg_obstack);
+      COPY_REG_SET (regs, src_uses);
+      AND_REG_SET (regs, dest_sets);
+
+      EXECUTE_IF_SET_IN_REG_SET (regs, 0, regno, rsi)
         {
-          int regno = REGNO (SET_DEST (set));
           df_ref first_def;
           struct df_rd_bb_info *bb_info = DF_RD_BB_INFO (g->bb);
 
           first_def = df_bb_regno_first_def_find (g->bb, regno);
           gcc_assert (first_def);
 
-          if (bitmap_bit_p (&bb_info->gen, DF_REF_ID (first_def)))
-            return;
+	  /* CC-flags and other hard registers can't be renamed.
+	     Check whether loop kernel has only one def.  */
+          if (HARD_REGISTER_NUM_P (regno)
+	      || !bitmap_bit_p (&bb_info->gen, DF_REF_ID (first_def)))
+	    {
+	      can_delete_dep = false;
+	      break;
+	    }
         }
+
+      FREE_REG_SET (regs);
+
+      if (can_delete_dep)
+	return;
     }
 
    latency = dep_cost (link);
@@ -247,16 +273,20 @@ create_ddg_dep_no_link (ddg_ptr g, ddg_node_ptr from, ddg_node_ptr to,
 static void
 add_cross_iteration_register_deps (ddg_ptr g, df_ref last_def)
 {
-  int regno = DF_REF_REGNO (last_def);
+  unsigned int regno = DF_REF_REGNO (last_def);
   struct df_link *r_use;
   int has_use_in_bb_p = false;
-  rtx def_insn = DF_REF_INSN (last_def);
+  rtx insn, def_insn = DF_REF_INSN (last_def);
   ddg_node_ptr last_def_node = get_node_of_insn (g, def_insn);
   ddg_node_ptr use_node;
+  df_ref *def_rec;
+  unsigned int uid;
+  static VEC(df_ref,heap) *all_defs;
 #ifdef ENABLE_CHECKING
   struct df_rd_bb_info *bb_info = DF_RD_BB_INFO (g->bb);
 #endif
   df_ref first_def = df_bb_regno_first_def_find (g->bb, regno);
+  bool first_write = true;
 
   gcc_assert (last_def_node);
   gcc_assert (first_def);
@@ -267,6 +297,31 @@ add_cross_iteration_register_deps (ddg_ptr g, df_ref last_def)
 			       DF_REF_ID (first_def)));
 #endif
 
+  all_defs = VEC_alloc (df_ref, heap, 0);
+
+  /* Find all defs which are clobbers and the first normal write def.  */
+  FOR_BB_INSNS (g->bb, insn)
+    {
+      if (!INSN_P (insn))
+        continue;
+      uid = INSN_UID (insn);
+      for (def_rec = DF_INSN_UID_DEFS (uid); *def_rec; def_rec++)
+        {
+          df_ref def = *def_rec;
+          if (DF_REF_REGNO (def) == regno)
+            {
+	      bool is_clobber = DF_REF_FLAGS (def) & (DF_REF_MUST_CLOBBER
+						      | DF_REF_MAY_CLOBBER);
+	      if (is_clobber || first_write)
+	        {
+		  VEC_safe_push (df_ref, heap, all_defs, def);
+		  if (!is_clobber)
+		    first_write = false;
+		}
+	    }
+        }
+    }
+
   /* Create inter-loop true dependences and anti dependences.  */
   for (r_use = DF_REF_CHAIN (last_def); r_use != NULL; r_use = r_use->next)
     {
@@ -290,25 +345,32 @@ add_cross_iteration_register_deps (ddg_ptr g, df_ref last_def)
 	}
       else if (!DEBUG_INSN_P (use_insn))
 	{
+	  unsigned int i;
+	  df_ref curr_def;
 	  /* Add anti deps from last_def's uses in the current iteration
-	     to the first def in the next iteration.  We do not add ANTI
-	     dep when there is an intra-loop TRUE dep in the opposite
-	     direction, but use regmoves to fix such disregarded ANTI
-	     deps when broken.	If the first_def reaches the USE then
-	     there is such a dep.  */
-	  ddg_node_ptr first_def_node = get_node_of_insn (g,
-							  DF_REF_INSN (first_def));
-
-	  gcc_assert (first_def_node);
-
-         /* Always create the edge if the use node is a branch in
-            order to prevent the creation of reg-moves.  */
-          if (DF_REF_ID (last_def) != DF_REF_ID (first_def)
-              || !flag_modulo_sched_allow_regmoves
-              || (flag_modulo_sched_allow_regmoves && JUMP_P (use_node->insn)))
-            create_ddg_dep_no_link (g, use_node, first_def_node, ANTI_DEP,
-                                    REG_DEP, 1);
-
+	     to the first def and all clobbers in the next iteration.
+	     We do not add ANTI dep when there is an intra-loop TRUE dep
+	     in the opposite direction, but use regmoves to fix such
+	     disregarded ANTI deps when broken.	If the curr_def reaches
+	     the USE then there is such a dep.  */
+	  FOR_EACH_VEC_ELT (df_ref, all_defs, i, curr_def)
+	    {
+	      if (DF_REF_ID (last_def) != DF_REF_ID (curr_def)
+		  /* Some hard regs (for ex. CC-flags) can't be renamed.  */
+                  || HARD_REGISTER_P (DF_REF_REG (last_def))
+		  || !flag_modulo_sched_allow_regmoves
+		  /* Always create the edge if the use node is a branch in
+		     order to prevent the creation of reg-moves.  */
+		  || (flag_modulo_sched_allow_regmoves
+		      && JUMP_P (use_node->insn)))
+		{
+	          ddg_node_ptr curr_def_node = get_node_of_insn (g,
+						DF_REF_INSN (curr_def));
+		  gcc_assert (curr_def_node);
+		  create_ddg_dep_no_link (g, use_node, curr_def_node,
+					  ANTI_DEP, REG_DEP, 1);
+	        }
+	    }
 	}
     }
   /* Create an inter-loop output dependence between LAST_DEF (which is the
@@ -318,18 +380,15 @@ add_cross_iteration_register_deps (ddg_ptr g, df_ref last_def)
      defs starting with a true dependence to a use which can be in the
      next iteration; followed by an anti dependence of that use to the
      first def (i.e. if there is a use between the two defs.)  */
-  if (!has_use_in_bb_p)
+  if (!has_use_in_bb_p && DF_REF_ID (last_def) != DF_REF_ID (first_def))
     {
-      ddg_node_ptr dest_node;
-
-      if (DF_REF_ID (last_def) == DF_REF_ID (first_def))
-	return;
-
-      dest_node = get_node_of_insn (g, DF_REF_INSN (first_def));
+      ddg_node_ptr dest_node = get_node_of_insn (g, DF_REF_INSN (first_def));
       gcc_assert (dest_node);
       create_ddg_dep_no_link (g, last_def_node, dest_node,
 			      OUTPUT_DEP, REG_DEP, 1);
     }
+
+  VEC_free (df_ref, heap, all_defs);
 }
 /* Build inter-loop dependencies, by looking at DF analysis backwards.  */
 static void
@@ -466,6 +525,8 @@ build_intra_loop_deps (ddg_ptr g)
 
   /* Build the dependence information, using the sched_analyze function.  */
   init_deps_global ();
+  /* Set sms hooks and initialize additional structures.  */
+  sms_sched_analyze_init ();
   init_deps (&tmp_deps, false);
 
   /* Do the intra-block data dependence analysis for the given block.  */
diff --git a/gcc/modulo-sched.c b/gcc/modulo-sched.c
index 948209e..35d2ee4 100644
--- a/gcc/modulo-sched.c
+++ b/gcc/modulo-sched.c
@@ -48,6 +48,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "tree-pass.h"
 #include "dbgcnt.h"
 #include "df.h"
+#include "pointer-set.h"
 
 #ifdef INSN_SCHEDULING
 
@@ -200,9 +201,10 @@ static void set_node_sched_params (ddg_ptr);
 static partial_schedule_ptr sms_schedule_by_order (ddg_ptr, int, int, int *);
 static void permute_partial_schedule (partial_schedule_ptr, rtx);
 static void generate_prolog_epilog (partial_schedule_ptr, struct loop *,
-                                    rtx, rtx);
+                                    rtx, bool, bool, rtx, HOST_WIDEST_INT,
+                                    bool, HOST_WIDEST_INT, rtx *);
 static void duplicate_insns_of_cycles (partial_schedule_ptr,
-				       int, int, int, rtx);
+				       int, int, int, rtx, bool);
 static int calculate_stage_count (partial_schedule_ptr, int);
 static void calculate_must_precede_follow (ddg_node_ptr, int, int,
 					   int, int, sbitmap, sbitmap, sbitmap);
@@ -264,7 +266,162 @@ typedef struct node_sched_params
 } *node_sched_params_ptr;
 
 \f
-/* The following three functions are copied from the current scheduler
+/* Next data structures and callbacks help to store information about
+   using registers in insn.  */
+static struct sched_deps_info_def old_sched_deps_info;
+
+/* Two regsets to store which registers the insn reads and writes.  */
+typedef struct regset_pair_def
+{
+  regset uses;
+  regset sets;
+} regset_pair;
+
+/* Allocate memory for regset_pair structure.  */
+static regset_pair*
+regset_pair_init (void)
+{
+  regset_pair *trs = (regset_pair *)xcalloc (1, sizeof (regset_pair));
+  trs->uses = ALLOC_REG_SET (&reg_obstack);
+  trs->sets = ALLOC_REG_SET (&reg_obstack);
+  return trs;
+}
+
+/* Pointer map is used to find a reg sets for insn.  */
+static struct pointer_map_t *insn_map;
+
+/* Find (or create) regset_pair for INSN in pointer_map.  */
+static regset_pair*
+regset_pair_get (rtx insn)
+{
+  void **slot = pointer_map_contains (insn_map, insn);
+  if (!slot)
+    {
+      slot = pointer_map_insert (insn_map, insn);
+      *slot = regset_pair_init ();
+    }
+  return (regset_pair*)*slot;
+}
+
+/* Callback for pointer_map_traverse to free memory used by regset_pair.  */
+static bool
+destroy_regset_pair (const void *key ATTRIBUTE_UNUSED, void **slot,
+		   void *data ATTRIBUTE_UNUSED)
+{
+  regset_pair *trs = (regset_pair*)*slot;
+  FREE_REG_SET (trs->uses);
+  FREE_REG_SET (trs->sets);
+  free (trs);
+  return true;
+}
+
+/* SMS sched_analyze hooks.  Every of them calls original hook first.  */
+static rtx curr_insn = NULL_RTX;
+
+static void
+sms_start_insn (rtx insn)
+{
+  old_sched_deps_info.start_insn (insn);
+
+  gcc_assert (insn && !curr_insn);
+  curr_insn = insn;
+  if (dump_file)
+    {
+      fprintf (dump_file, "sms analyze: start insn:\n");
+      print_rtl_single (dump_file, curr_insn);
+    }
+}
+
+static void
+sms_finish_insn (void)
+{
+  old_sched_deps_info.finish_insn ();
+
+  gcc_assert (curr_insn);
+  curr_insn = NULL_RTX;
+  if (dump_file)
+    fprintf (dump_file, "sms analyze: finished insn\n");
+}
+
+static void
+sms_note_reg_set (int regno)
+{
+  regset_pair *trs;
+  old_sched_deps_info.note_reg_set (regno);
+
+  gcc_assert (curr_insn);
+  trs = regset_pair_get (curr_insn);
+  SET_REGNO_REG_SET (trs->sets, regno);
+
+  if (dump_file)
+    fprintf (dump_file, "sms analyze: reg set %d\n", regno);
+}
+
+static void
+sms_note_reg_clobber (int regno)
+{
+  regset_pair *trs;
+  old_sched_deps_info.note_reg_clobber (regno);
+
+  gcc_assert (curr_insn);
+  trs = regset_pair_get (curr_insn);
+  SET_REGNO_REG_SET (trs->sets, regno);
+
+  if (dump_file)
+    fprintf (dump_file, "sms analyze: reg clobber %d\n", regno);
+}
+
+static void
+sms_note_reg_use (int regno)
+{
+  regset_pair *trs;
+  old_sched_deps_info.note_reg_use (regno);
+
+  gcc_assert (curr_insn);
+  trs = regset_pair_get (curr_insn);
+  SET_REGNO_REG_SET (trs->uses, regno);
+
+  if (dump_file)
+    fprintf (dump_file, "sms analyze: reg use %d\n", regno);
+}
+
+/* Extract the saved data about register usage.  Bool SETS is true,
+   when we need the set of written regs.  */
+regset
+extract_from_insn_map (rtx insn, bool sets)
+{
+  void **slot = pointer_map_contains (insn_map, insn);
+  regset_pair *trs;
+  if (!slot)
+    return NULL;
+  trs = (regset_pair*)*slot;
+  return sets ? trs->sets : trs->uses;
+}
+
+/* Setup SMS hooks.  Initialize pointer_map.  */
+void
+sms_sched_analyze_init (void)
+{
+  memcpy (&old_sched_deps_info, sched_deps_info,
+	  sizeof (struct sched_deps_info_def));
+  sched_deps_info->start_insn = sms_start_insn;
+  sched_deps_info->finish_insn = sms_finish_insn;
+  sched_deps_info->note_reg_set = sms_note_reg_set;
+  sched_deps_info->note_reg_clobber = sms_note_reg_clobber;
+  sched_deps_info->note_reg_use = sms_note_reg_use;
+  insn_map = pointer_map_create ();
+}
+
+/* Free pointer_map with freeing internal structures first.  */
+static void
+sms_create_ddg_finish (void)
+{
+  pointer_map_traverse (insn_map, destroy_regset_pair, NULL);
+  pointer_map_destroy (insn_map);
+}
+
+\f
+/* The following two functions are copied from the current scheduler
    code in order to use sched_analyze() for computing the dependencies.
    They are used when initializing the sched_info structure.  */
 static const char *
@@ -287,9 +444,20 @@ static struct common_sched_info_def sms_common_sched_info;
 static struct sched_deps_info_def sms_sched_deps_info =
   {
     compute_jump_reg_dependencies,
-    NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL,
-    NULL,
-    0, 0, 0
+    sms_start_insn,
+    sms_finish_insn,
+    NULL, /* start_lhs */
+    NULL, /* finish_lhs */
+    NULL, /* start_rhs */
+    NULL, /* finish_rhs */
+    sms_note_reg_set,
+    sms_note_reg_clobber,
+    sms_note_reg_use,
+    NULL, /* note_mem_dep */
+    NULL, /* note_dep */
+    0, /* use_cselib */
+    0, /* use_deps_list */
+    0 /* generate_spec_deps */
   };
 
 static struct haifa_sched_info sms_sched_info =
@@ -311,6 +479,7 @@ static struct haifa_sched_info sms_sched_info =
   0
 };
 
+\f
 /* Given HEAD and TAIL which are the first and last insns in a loop;
    return the register which controls the loop.  Return zero if it has
    more than one occurrence in the loop besides the control part or the
@@ -364,37 +533,164 @@ doloop_register_get (rtx head ATTRIBUTE_UNUSED, rtx tail ATTRIBUTE_UNUSED)
 #endif
 }
 
-/* Check if COUNT_REG is set to a constant in the PRE_HEADER block, so
-   that the number of iterations is a compile-time constant.  If so,
-   return the rtx that sets COUNT_REG to a constant, and set COUNT to
-   this constant.  Otherwise return 0.  */
+/* Same as previous for loop with always-the-same-step counter.  */
+static rtx
+nondoloop_register_get (rtx head, rtx tail, int cmp_side,
+			rtx *addsub_output, rtx *cmp_output)
+{
+  rtx insn, reg, flagreg, addsub, cmp, end;
+
+  /* Check jump instruction form */
+  insn = single_set (tail);
+  if (insn == NULL_RTX
+      || SET_DEST (insn) != pc_rtx
+      || GET_CODE (SET_SRC (insn)) != IF_THEN_ELSE
+      || GET_CODE (XEXP (SET_SRC (insn), 1)) != LABEL_REF
+      || XEXP (SET_SRC (insn), 2) != pc_rtx)
+    return NULL_RTX;
+
+  /* Check loop exit condition */
+  insn = XEXP (SET_SRC (insn), 0);
+  if (GET_CODE (insn) != NE || XEXP (insn, 1) != const0_rtx)
+    return NULL_RTX;
+
+  /* Flags register */
+  flagreg = XEXP (insn, 0);
+
+  /* Searching comparison instruction */
+  cmp = PREV_INSN (tail);
+  while (cmp != PREV_INSN (head))
+    {
+      if (INSN_P (cmp) && reg_set_p (flagreg, cmp))
+        break;
+      cmp = PREV_INSN (cmp);
+    }
+  if (cmp == PREV_INSN (head))
+    return NULL_RTX;
+
+  /* Check comparison */
+  insn = single_set (cmp);
+  if (insn == NULL_RTX
+      || ! rtx_equal_p (flagreg, SET_DEST (insn))
+      || GET_CODE (SET_SRC (insn)) != COMPARE)
+    return NULL_RTX;
+
+  /* Loop register */
+  gcc_assert (0 <= cmp_side && cmp_side <= 1);
+  reg = XEXP (SET_SRC (insn), cmp_side);
+  if (! REG_P (reg))
+    return NULL_RTX;
+
+  /* End value */
+  end = XEXP (SET_SRC (insn), 1 - cmp_side);
+  if (! REG_P (end) && ! CONST_INT_P (end))
+    return NULL_RTX;
+
+  /* Searching register add\sub instruction */
+  addsub = PREV_INSN (cmp);
+  while (addsub != PREV_INSN (head))
+    {
+      if (INSN_P (addsub) && reg_set_p (reg, addsub))
+        break;
+      addsub = PREV_INSN (addsub);
+    }
+  if (addsub == PREV_INSN (head))
+    return NULL_RTX;
+
+  /* Checking register change instruction */
+  insn = single_set (addsub);
+  if (insn == NULL_RTX || ! rtx_equal_p (reg, SET_DEST (insn)))
+    return NULL_RTX;
+  insn = SET_SRC (insn);
+  if ((GET_CODE (insn) != PLUS && GET_CODE (insn) != MINUS)
+      || ! rtx_equal_p (reg, XEXP (insn, 0))
+      || ! (CONST_INT_P (XEXP (insn, 1))))
+    return NULL_RTX;
+
+  /* No other REG and END (if reg) modifications allowed */
+  for (insn = head; insn != tail; insn = NEXT_INSN (insn))
+    {
+      if (REG_P(end) && reg_set_p (end, insn))
+        {
+          if (dump_file)
+          {
+            fprintf (dump_file, "SMS end register found ");
+            print_rtl_single (dump_file, reg);
+            fprintf (dump_file, " outside write in insn:\n");
+            print_rtl_single (dump_file, insn);
+          }
+	  return NULL_RTX;
+	}
+      if (insn != addsub && reg_set_p (reg, insn))
+        {
+          if (dump_file)
+          {
+            fprintf (dump_file, "SMS count_reg found ");
+            print_rtl_single (dump_file, reg);
+            fprintf (dump_file, " outside write in insn:\n");
+            print_rtl_single (dump_file, insn);
+          }
+          return NULL_RTX;
+        }
+    }
+
+  *addsub_output = addsub;
+  *cmp_output = cmp;
+  return reg;
+}
+
+/* Check if REG is set to a constant in the PRE_HEADER block.
+   If possible to find, return the rtx that sets REG.
+   If REG is set to a constant (probably not directly),
+   set IS_CONST to true and VALUE to that constant value.  */
 static rtx
-const_iteration_count (rtx count_reg, basic_block pre_header,
-		       HOST_WIDEST_INT * count)
+search_const_init (basic_block pre_header, rtx reg, bool *is_const,
+		   HOST_WIDEST_INT *value)
 {
   rtx insn;
   rtx head, tail;
 
-  if (! pre_header)
-    return NULL_RTX;
+  if (!pre_header)
+    {
+      *is_const = false;
+      return NULL_RTX;
+    }
 
   get_ebb_head_tail (pre_header, pre_header, &head, &tail);
 
   for (insn = tail; insn != PREV_INSN (head); insn = PREV_INSN (insn))
     if (NONDEBUG_INSN_P (insn) && single_set (insn) &&
-	rtx_equal_p (count_reg, SET_DEST (single_set (insn))))
+	rtx_equal_p (reg, SET_DEST (single_set (insn))))
       {
-	rtx pat = single_set (insn);
+	rtx src, pat = single_set (insn);
+	src = SET_SRC (pat);
 
-	if (CONST_INT_P (SET_SRC (pat)))
+	if (CONST_INT_P (src))
 	  {
-	    *count = INTVAL (SET_SRC (pat));
-	    return insn;
+	    *is_const = true;
+	    *value = INTVAL (src);
 	  }
+	else if (REG_P (src))
+	  { /* Check if previous insn sets SRC = constant.  */
+	    pat = single_set (PREV_INSN (insn));
+	    if (pat != NULL_RTX && rtx_equal_p (src, SET_DEST (pat))
+		&& CONST_INT_P (SET_SRC (pat)))
+	      {
+		*is_const = true;
+		*value = INTVAL (SET_SRC (pat));
+	      }
+	    else
+		*is_const = false;
+	  }
+	else
+	  *is_const = false;
 
-	return NULL_RTX;
+	return insn;
       }
+    else if (reg_set_p (reg, insn))
+      break;
 
+  *is_const = false;
   return NULL_RTX;
 }
 
@@ -1008,7 +1304,8 @@ clear:
 
 static void
 duplicate_insns_of_cycles (partial_schedule_ptr ps, int from_stage,
-			   int to_stage, int for_prolog, rtx count_reg)
+			   int to_stage, int for_prolog, rtx count_reg,
+			   bool doloop_p)
 {
   int row;
   ps_insn_ptr ps_ij;
@@ -1020,14 +1317,14 @@ duplicate_insns_of_cycles (partial_schedule_ptr ps, int from_stage,
 	int i_reg_moves;
 	rtx reg_move = NULL_RTX;
 
-        /* Do not duplicate any insn which refers to count_reg as it
-           belongs to the control part.
+        /* In doloop case do not duplicate any insn which refers
+	   to count_reg as it belongs to the control part.
            The closing branch is scheduled as well and thus should
            be ignored.
            TODO: This should be done by analyzing the control part of
            the loop.  */
-        if (reg_mentioned_p (count_reg, u_node->insn)
-            || JUMP_P (ps_ij->node->insn))
+        if ((doloop_p && reg_mentioned_p (count_reg, u_node->insn))
+            || JUMP_P (u_node->insn))
           continue;
 
 	if (for_prolog)
@@ -1104,7 +1401,10 @@ duplicate_insns_of_cycles (partial_schedule_ptr ps, int from_stage,
 /* Generate the instructions (including reg_moves) for prolog & epilog.  */
 static void
 generate_prolog_epilog (partial_schedule_ptr ps, struct loop *loop,
-                        rtx count_reg, rtx count_init)
+                        rtx count_reg, bool doloop_p, bool count_init_isconst,
+			rtx fin_reg, HOST_WIDEST_INT fin_nonconst_adjust,
+			bool create_reg, HOST_WIDEST_INT reg_val,
+			rtx *created_reg)
 {
   int i;
   int last_stage = PS_STAGE_COUNT (ps) - 1;
@@ -1113,12 +1413,12 @@ generate_prolog_epilog (partial_schedule_ptr ps, struct loop *loop,
   /* Generate the prolog, inserting its insns on the loop-entry edge.  */
   start_sequence ();
 
-  if (!count_init)
+  if (doloop_p && !count_init_isconst)
     {
-      /* Generate instructions at the beginning of the prolog to
-         adjust the loop count by STAGE_COUNT.  If loop count is constant
-         (count_init), this constant is adjusted by STAGE_COUNT in
-         generate_prolog_epilog function.  */
+      /* In doloop we generate instructions at the beginning of the prolog to
+         adjust the initial value of doloop counter by STAGE_COUNT.
+	 If loop count is constant, this adjustment is done outside this
+         function, simply correcting the source of initialization insn.  */
       rtx sub_reg = NULL_RTX;
 
       sub_reg = expand_simple_binop (GET_MODE (count_reg), MINUS,
@@ -1129,8 +1429,40 @@ generate_prolog_epilog (partial_schedule_ptr ps, struct loop *loop,
         emit_move_insn (count_reg, sub_reg);
     }
 
+  if (!doloop_p)
+    {
+      /* In non-doloop we generate instructions at the beginning of
+         the prolog to adjust the final value (with this value loop count
+	 register is compared to check whether the loop should stop).  */
+      if (fin_nonconst_adjust != 0)
+	{
+	  /* If the final value is in a register - create another register
+	     to store a shifted value.  */
+	  rtx new_reg, reg = NULL_RTX;
+	  reg = gen_reg_rtx (GET_MODE (fin_reg));
+	  new_reg = expand_simple_binop (GET_MODE (fin_reg), MINUS, fin_reg,
+					 GEN_INT (fin_nonconst_adjust),
+					 reg, 0, OPTAB_DIRECT);
+	  gcc_assert (REG_P (new_reg));
+	  if (REGNO (new_reg) != REGNO (reg))
+	    emit_move_insn (reg, new_reg);
+	  *created_reg = new_reg;
+	}
+      else if (create_reg)
+	{
+	  /* If old final value is an immediate, and the new one can't be
+	     an immediate, we create a register to store it.  If both values
+	     are immediate the adjustment is done outside this fuction,
+	     just correcting the constant value in compare intruction.  */
+	  rtx reg = NULL_RTX;
+	  reg = gen_reg_rtx (GET_MODE (count_reg));
+	  emit_move_insn (reg, GEN_INT (reg_val));
+	  *created_reg = reg;
+	}
+    }
+
   for (i = 0; i < last_stage; i++)
-    duplicate_insns_of_cycles (ps, 0, i, 1, count_reg);
+    duplicate_insns_of_cycles (ps, 0, i, 1, count_reg, doloop_p);
 
   /* Put the prolog on the entry edge.  */
   e = loop_preheader_edge (loop);
@@ -1142,7 +1474,7 @@ generate_prolog_epilog (partial_schedule_ptr ps, struct loop *loop,
   start_sequence ();
 
   for (i = 0; i < last_stage; i++)
-    duplicate_insns_of_cycles (ps, i + 1, last_stage, 0, count_reg);
+    duplicate_insns_of_cycles (ps, i + 1, last_stage, 0, count_reg, doloop_p);
 
   /* Put the epilogue on the exit edge.  */
   gcc_assert (single_exit (loop));
@@ -1422,13 +1754,30 @@ sms_schedule (void)
           continue;
         }
 
-      /* Make sure this is a doloop.  */
-      if ( !(count_reg = doloop_register_get (head, tail)))
-      {
-        if (dump_file)
-          fprintf (dump_file, "SMS doloop_register_get failed\n");
-	continue;
-      }
+      /* Is this a doloop?  */
+      if ((count_reg = doloop_register_get (head, tail)))
+        {
+	  if (dump_file)
+	    fprintf (dump_file, "SMS doloop\n");
+        }
+      else if ((count_reg = nondoloop_register_get (head, tail, 0,
+						    &insn, &insn)))
+	{
+	  if (dump_file)
+	    fprintf (dump_file, "SMS non-doloop\n");
+	}
+      else if ((count_reg = nondoloop_register_get (head, tail, 1,
+						    &insn, &insn)))
+	{
+	  if (dump_file)
+	    fprintf (dump_file, "SMS non-doloop with transposed cmp\n");
+	}
+      else
+	{
+	  if (dump_file)
+	    fprintf (dump_file, "SMS imcompatible loop\n");
+	  continue;
+	}
 
       /* Don't handle BBs with calls or barriers, or !single_set insns
 	  with the exception of instructions that include count_reg---these
@@ -1477,7 +1826,7 @@ sms_schedule (void)
 	    fprintf (dump_file, "SMS create_ddg failed\n");
 	  continue;
         }
-
+      sms_create_ddg_finish ();
       g_arr[loop->num] = g;
       if (dump_file)
         fprintf (dump_file, "...OK\n");
@@ -1489,15 +1838,27 @@ sms_schedule (void)
     fprintf (dump_file, "=========================\n\n");
   }
 
+  df_clear_flags (DF_LR_RUN_DCE);
+
   /* We don't want to perform SMS on new loops - created by versioning.  */
   FOR_EACH_LOOP (li, loop, 0)
     {
+      bool doloop_p, count_fin_isconst, count_init_isconst;
+      bool was_immediate = false;
+      bool prolog_create_reg = false;
+      int prolog_fin_nonconst_adjust = 0;
+      bool nonsimple_loop = false;
       rtx head, tail;
-      rtx count_reg, count_init;
-      int mii, rec_mii;
-      unsigned stage_count = 0;
-      HOST_WIDEST_INT loop_count = 0;
       bool opt_sc_p = false;
+      rtx count_reg, count_fin_reg, new_comp_reg = NULL_RTX;
+      rtx count_init_insn, count_fin_init_insn;
+      rtx add, cmp;
+      int mii, rec_mii, cmp_side = -1;
+      int stage_count = 0;
+      HOST_WIDEST_INT count_init_val = 0, count_fin_val = 0;
+      HOST_WIDEST_INT count_step = 0, loop_count = -1;
+      HOST_WIDEST_INT count_fin_newval = 0;
+      struct niter_desc *desc = NULL;
 
       if (! (g = g_arr[loop->num]))
         continue;
@@ -1535,32 +1896,159 @@ sms_schedule (void)
 	               (HOST_WIDEST_INT) profile_info->sum_max);
 	      fprintf (dump_file, "\n");
 	    }
-	  fprintf (dump_file, "SMS doloop\n");
 	  fprintf (dump_file, "SMS built-ddg %d\n", g->num_nodes);
           fprintf (dump_file, "SMS num-loads %d\n", g->num_loads);
           fprintf (dump_file, "SMS num-stores %d\n", g->num_stores);
 	}
 
 
-      /* In case of th loop have doloop register it gets special
-	 handling.  */
-      count_init = NULL_RTX;
-      if ((count_reg = doloop_register_get (head, tail)))
+      /* Extract count register and determine loop type.  */
+      add = NULL_RTX;
+      cmp = NULL_RTX;
+      if ((count_reg = doloop_register_get (head, tail))
+	  || (count_reg = nondoloop_register_get (head, tail, 0, &add, &cmp))
+	  || (count_reg = nondoloop_register_get (head, tail, 1, &add, &cmp)))
 	{
-	  basic_block pre_header;
+	  basic_block pre_header = loop_preheader_edge (loop)->src;
+
+	  doloop_p = (cmp == NULL_RTX);
+	  if (doloop_p)
+	    {
+	      /* Doloop finish parameters are always the same.  */
+	      count_step = -1;
+	      count_fin_isconst = true;
+	      count_fin_val = 0;
+	      count_fin_reg = NULL_RTX;
+	      count_fin_init_insn = NULL_RTX;
+	    }
+	  else
+	    {
+	      /* In other loop we need to determine counter step
+	         and finish parameters.  */
+	      rtx step, end;
+
+	      gcc_assert (single_set (add) && single_set (cmp));
+
+	      /* Extract the step.  */
+	      step = XEXP (SET_SRC (single_set (add)), 1);
+	      gcc_assert (CONST_INT_P (step));
+
+	      if (GET_CODE (SET_SRC (single_set (add))) == MINUS)
+	        count_step = - INTVAL (step);
+	      else if (GET_CODE (SET_SRC (single_set (add))) == PLUS)
+	        count_step = INTVAL (step);
+	      else
+		gcc_unreachable ();
 
-	  pre_header = loop_preheader_edge (loop)->src;
-	  count_init = const_iteration_count (count_reg, pre_header,
-					      &loop_count);
+	      gcc_assert(count_step != 0);
+
+	      /* Check what operand of compare insn is a counter register.  */
+	      if (count_reg == XEXP (SET_SRC (single_set (cmp)), 0))
+		cmp_side = 0;
+	      else if (count_reg == XEXP (SET_SRC (single_set (cmp)), 1))
+		cmp_side = 1;
+	      else
+		gcc_unreachable ();
+
+	      /* Extract finish border for counter reg.  */
+	      end = XEXP (SET_SRC (single_set (cmp)), 1 - cmp_side);
+
+	      if (CONST_INT_P (end))
+		{
+		  /* Constant finish border.  loop until (reg != const).  */
+		  count_fin_isconst = true;
+		  count_fin_val = INTVAL (end);
+		  count_fin_reg = NULL_RTX;
+		  count_fin_init_insn = NULL_RTX;
+		}
+	      else if (REG_P (end))
+		{
+		  /* Register is a border.  Loop until (reg != fin_reg).  */
+		  count_fin_reg = end;
+		  count_fin_isconst = false;
+		  /* Try to find constant initinalization of fin_reg
+		   * in preheader.  */
+		  count_fin_init_insn = search_const_init (pre_header,
+							   count_fin_reg,
+							   &count_fin_isconst,
+							   &count_fin_val);
+		}
+	      else
+		gcc_unreachable ();
+	    }
+	  /* Try to find a constant initalization of count_reg in preheader.  */
+	  count_init_insn = search_const_init (pre_header,
+					       count_reg,
+					       &count_init_isconst,
+					       &count_init_val);
+	}
+      else /* Loop is incompatible now, but it was OK on while analyzing!  */
+	gcc_assert (count_reg);
+
+
+      desc = get_simple_loop_desc (loop);
+      gcc_assert (desc);
+      /* nonsimple_loop means it's impossible to analyze the loop
+         or there are some assumptions to make the analyzis results right
+         or there is a condition of non-infinite number of iterations.
+        We want doloops to be scheduled even if analyzis shows they are
+	 nonsimple (backward compatibility).  */
+      nonsimple_loop = !desc->simple_p;
+      /* We allow scheduling loop with some assumptions or infinite condition
+	 only when unsafe_loop_optimizations flag is enabled.  */
+      if (flag_unsafe_loop_optimizations)
+	 {
+	   desc->infinite = NULL_RTX;
+	   desc->assumptions = NULL_RTX;
+	   desc->noloop_assumptions = NULL_RTX;
+	 }
+      nonsimple_loop = nonsimple_loop || (desc->assumptions != NULL_RTX)
+			|| (desc->noloop_assumptions != NULL_RTX)
+			|| (desc->infinite != NULL_RTX);
+      /* Only doloops can be nonsimple_loops for SMS.  */
+      if (nonsimple_loop && !doloop_p)
+	{
+	  free_ddg (g);
+	  continue;
+	}
+      /* Manually set some description fields in non-simple doloop.  */
+      if (nonsimple_loop)
+	{
+	  gcc_assert(doloop_p);
+	  desc->const_iter = false;
+	  desc->infinite = NULL_RTX;
 	}
-      gcc_assert (count_reg);
 
-      if (dump_file && count_init)
+      if (desc->const_iter)
+	{
+	  gcc_assert (!desc->infinite);
+	  loop_count = desc->niter;
+	  if (dump_file)
+	    fprintf (dump_file, "SMS const loop iterations = "
+		     HOST_WIDEST_INT_PRINT_DEC "\n", loop_count);
+	}
+      if (count_init_isconst && count_fin_isconst)
         {
-          fprintf (dump_file, "SMS const-doloop ");
-          fprintf (dump_file, HOST_WIDEST_INT_PRINT_DEC,
-		     loop_count);
-          fprintf (dump_file, "\n");
+	  gcc_assert (doloop_p || desc->const_iter);
+	  if (doloop_p)
+	    {
+	      if (nonsimple_loop)
+		{
+	          loop_count = count_init_val;
+		  desc->const_iter = true;
+		}
+              gcc_assert (desc->const_iter && loop_count == count_init_val);
+	    }
+	  if (dump_file)
+	    {
+	      fprintf (dump_file, "SMS const-%s ",
+		       doloop_p ? "doloop" : "loop");
+	      fprintf (dump_file, HOST_WIDEST_INT_PRINT_DEC " to "
+		       HOST_WIDEST_INT_PRINT_DEC " step "
+		       HOST_WIDEST_INT_PRINT_DEC,
+		       count_init_val, count_fin_val, count_step);
+	      fprintf (dump_file, "\n");
+	    }
         }
 
       node_order = XNEWVEC (int, g->num_nodes);
@@ -1608,8 +2096,8 @@ sms_schedule (void)
       /* The default value of PARAM_SMS_MIN_SC is 2 as stage count of
 	 1 means that there is no interleaving between iterations thus
 	 we let the scheduling passes do the job in this case.  */
-      if (stage_count < (unsigned) PARAM_VALUE (PARAM_SMS_MIN_SC)
-	  || (count_init && (loop_count <= stage_count))
+      if (stage_count < PARAM_VALUE (PARAM_SMS_MIN_SC)
+	  || (desc->const_iter && (loop_count <= stage_count))
 	  || (flag_branch_probabilities && (trip_count <= stage_count)))
 	{
 	  if (dump_file)
@@ -1625,8 +2113,10 @@ sms_schedule (void)
       else
 	{
 	  struct undo_replace_buff_elem *reg_move_replaces;
+	  int row, cmp_stage = -1;
+	  ps_insn_ptr crr_insn;
 
-          if (!opt_sc_p)
+	  if (!opt_sc_p)
             {
 	      /* Rotate the partial schedule to have the branch in row ii-1.  */
               int amount = SCHED_TIME (g->closing_branch) + 1;
@@ -1647,23 +2137,24 @@ sms_schedule (void)
 	      print_partial_schedule (ps, dump_file);
 	    }
  
-          /* case the BCT count is not known , Do loop-versioning */
-	  if (count_reg && ! count_init)
-            {
-	      rtx comp_rtx = gen_rtx_fmt_ee (GT, VOIDmode, count_reg,
-	  				     GEN_INT(stage_count));
-	      unsigned prob = (PROB_SMS_ENOUGH_ITERATIONS
-			       * REG_BR_PROB_BASE) / 100;
-
-	      loop_version (loop, comp_rtx, &condition_bb,
-	  		    prob, prob, REG_BR_PROB_BASE - prob,
-			    true);
-	     }
+	  if (!desc->const_iter)
+	    {
+	      /* Loop versioning if the number of iterations is unknown.  */
+	      unsigned prob;
+	      rtx vers_cond;
+	      vers_cond = gen_rtx_fmt_ee (GT, VOIDmode, nonsimple_loop ?
+					  count_reg : desc->niter_expr,
+					  GEN_INT (stage_count));
+	      if (dump_file)
+		{
+		  fprintf (dump_file, "\nLoop versioning condition:\n");
+		  print_rtl_single (dump_file, vers_cond);
+		}
 
-	  /* Set new iteration count of loop kernel.  */
-          if (count_reg && count_init)
-	    SET_SRC (single_set (count_init)) = GEN_INT (loop_count
-						     - stage_count + 1);
+	      prob = (PROB_SMS_ENOUGH_ITERATIONS * REG_BR_PROB_BASE) / 100;
+	      loop_version (loop, vers_cond, &condition_bb, prob,
+			    prob, REG_BR_PROB_BASE - prob, true);
+	    }
 
 	  /* Now apply the scheduled kernel to the RTL of the loop.  */
 	  permute_partial_schedule (ps, g->closing_branch->first_note);
@@ -1678,8 +2169,116 @@ sms_schedule (void)
 	  reg_move_replaces = generate_reg_moves (ps, true);
 	  if (dump_file)
 	    print_node_sched_params (dump_file, g->num_nodes, g);
-	  /* Generate prolog and epilog.  */
-          generate_prolog_epilog (ps, loop, count_reg, count_init);
+
+	  if (doloop_p && count_init_isconst)
+	    {
+	      /* Change counter reg initialization constant. In more complex
+	         cases this adjustment is done with adding some insns
+		 to loop prologue in generate_prolog_epilog function.  */
+	      gcc_assert (single_set (count_init_insn) != NULL_RTX);
+	      SET_SRC (single_set (count_init_insn))
+		    = GEN_INT (count_init_val - stage_count + 1);
+	      df_insn_rescan (count_init_insn);
+	    }
+
+	  if (!doloop_p)
+	    {
+	      /* Calculation of the compare insn stage in schedule.  */
+	      for (row = 0; row < ps->ii; row++)
+		for (crr_insn = ps->rows[row];
+		     crr_insn;
+		     crr_insn = crr_insn->next_in_row)
+		  {
+		    gcc_assert (0 <= SCHED_STAGE (crr_insn->node));
+		    gcc_assert (SCHED_STAGE (crr_insn->node) < stage_count);
+		    if (rtx_equal_p (crr_insn->node->insn, cmp))
+		      {
+			gcc_assert (cmp_stage == -1);
+		        cmp_stage = SCHED_STAGE (crr_insn->node);
+		      }
+		  }
+              if (dump_file)
+		fprintf (dump_file, "cmp_stage=%d\n", cmp_stage);
+	      gcc_assert (cmp_stage >= 0);
+	    }
+
+	  /* When compare insn stage is non-zero we are to shift the final
+	     counter reg value (which counter is compared to exit loop).
+	     Final value can be an immediate or can be a register, which
+	     constant initialization we find in preheader.  */
+	  was_immediate = false;
+	  if (!doloop_p && count_fin_isconst && cmp_stage > 0)
+	    {
+              gcc_assert (0 <= cmp_side && cmp_side <= 1);
+	      /* New finish value.  */
+	      count_fin_newval = count_fin_val - count_step * cmp_stage;
+	      was_immediate = CONST_INT_P (XEXP (SET_SRC (single_set (cmp)),
+							  1 - cmp_side));
+	      if (was_immediate)
+		{
+		  /* Check whether new value also can be an immediate.
+		     For exapmle, on ARM not all values can be encoded as
+		     an immediate, so we have to load it to a register once
+		     before the loop starts.  */
+		  rtx to = GEN_INT (count_fin_newval);
+		  prolog_create_reg = rtx_cost (to, GET_CODE (to), false)
+				    > rtx_cost (GEN_INT(1), CONST_INT, false);
+	        }
+	      else
+		{
+		  /* A value is already in a register and we easily change
+		     initialization instruction in preheader.  */
+		  gcc_assert (count_fin_init_insn);
+		  SET_SRC (single_set (count_fin_init_insn))
+			= GEN_INT (count_fin_newval);
+		  df_insn_rescan (count_fin_init_insn);
+		}
+	    }
+
+	  /* The adjustment of finish register value.
+	     Zero means no adjustment needed or adjusment is done
+	     without additional insn in prologue.  */
+	  if (!doloop_p && !count_fin_isconst)
+	    prolog_fin_nonconst_adjust = count_step * cmp_stage;
+
+	  /* Ready to generate prolog and epilog.  */
+	  generate_prolog_epilog (ps, loop, count_reg, doloop_p,
+			          count_init_isconst, count_fin_reg,
+				  prolog_fin_nonconst_adjust,
+				  prolog_create_reg, count_fin_newval,
+				  &new_comp_reg);
+
+	  /* And only after generating prolog and epilog it is possible
+	     to modify the compare instruction (to prevent copying wrong insn
+	     form to first and last stages).  */
+	  if (!doloop_p && cmp_stage > 0)
+	    {
+              gcc_assert (0 <= cmp_side && cmp_side <= 1);
+	      if (was_immediate && !prolog_create_reg)
+		{
+		/* Easy case - just modify a constant.  */
+		  gcc_assert (new_comp_reg == NULL_RTX);
+		  XEXP (SET_SRC (single_set (cmp)), 1 - cmp_side)
+			= GEN_INT (count_fin_newval);
+		}
+	      else
+		{
+		  if (count_fin_isconst && !was_immediate)
+		    /* Value is in a register and we already changed
+		       initialization instruction in preheader.  */
+		    gcc_assert (new_comp_reg == NULL_RTX);
+		  else
+		    {
+		      /* Another case - use created by generate_prolog_epilog
+		         register, which value is initialized in prologue.  */
+		      gcc_assert (new_comp_reg != NULL_RTX);
+		      XEXP (SET_SRC (single_set (cmp)), 1 - cmp_side)
+			      = new_comp_reg;
+		    }
+		}
+	      df_insn_rescan (cmp);
+	    }
+	  else gcc_assert (new_comp_reg == NULL_RTX);
 
 	  free_undo_replace_buff (reg_move_replaces);
 	}
@@ -1690,7 +2289,9 @@ sms_schedule (void)
       free_ddg (g);
     }
 
+  df_set_flags (DF_LR_RUN_DCE);
   free (g_arr);
+  iv_analysis_done ();
 
   /* Release scheduler data, needed until now because of DFA.  */
   haifa_sched_finish ();
diff --git a/gcc/sched-int.h b/gcc/sched-int.h
index 2eee49d..dce9363 100644
--- a/gcc/sched-int.h
+++ b/gcc/sched-int.h
@@ -1229,6 +1229,8 @@ extern void sched_deps_finish (void);
 extern void haifa_note_reg_set (int);
 extern void haifa_note_reg_clobber (int);
 extern void haifa_note_reg_use (int);
+void sms_sched_analyze_init (void);
+regset extract_from_insn_map (rtx, bool);
 
 extern void maybe_extend_reg_info_p (void);
 
-- 
Roman Zhuykov
zhroma@ispras.ru

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 6/9] [SMS] Support potentially infinite loop
  2011-07-21 16:31 [PATCH 0/9] [RFC] Expand SMS functionality zhroma
                   ` (3 preceding siblings ...)
  2011-07-21 16:31 ` [PATCH 3/9] [SMS] Eliminate redundant edges zhroma
@ 2011-07-21 16:31 ` zhroma
  2011-07-21 16:37 ` [PATCH 7/9] New assertion zhroma
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 30+ messages in thread
From: zhroma @ 2011-07-21 16:31 UTC (permalink / raw)
  To: gcc-patches; +Cc: dm, Ayal Zaks

This patch should be applied only after previous patch.  This patch allows SMS
to schedule loops with non-NULL condition under which the loop is infinite.
Infinite condition is an expression list and all of them should be false if we
want to use niter_expr.  So, if each of expressions is a simple comparison, we
can check them to be false in loop versioning condition.  And the optimized
loop version will run only when all infinite conditions are false and number of
iterations is big enough.  The problem is in adding support of such a complex
conditions with several conjuntions to loop_version function.  To extract
expression list we have to check whether we are on RTL, while these functions
work both on RTL and GIMPLE.  I understand that this patch is not ready for
trunk "as is", but it was created to make the whole situation more clear.

2011-07-20  Roman Zhuykov  <zhroma@ispras.ru>
	* cfgloopmanip.c (loop_version): Support multiple conditions with
	logical conjunction on RTL level.
	* common.opt (fmodulo-sched-insert-infinite-checks): New flag.
	* modulo-sched.c (sms_schedule): Support potentially infinite loops.
---
 gcc/cfgloopmanip.c |   60 +++++++++++++++++++++++++++++++++++++++++++++++++--
 gcc/common.opt     |    4 +++
 gcc/modulo-sched.c |   58 ++++++++++++++++++++++++++++++++++++++++++++------
 3 files changed, 112 insertions(+), 10 deletions(-)

diff --git a/gcc/cfgloopmanip.c b/gcc/cfgloopmanip.c
index 1824421..d20da19 100644
--- a/gcc/cfgloopmanip.c
+++ b/gcc/cfgloopmanip.c
@@ -1556,9 +1556,63 @@ loop_version (struct loop *loop,
      Note down new head as second_head.  */
   second_head = entry->dest;
 
-  /* Split loop entry edge and insert new block with cond expr.  */
-  cond_bb =  lv_adjust_loop_entry_edge (first_head, second_head,
-					entry, cond_expr, then_prob);
+  /* On rtl level support multiple conditions (with logical conjunction
+     between them) organized as an expr_list.  */
+  if (current_ir_type () != IR_GIMPLE
+      && GET_CODE ((rtx)cond_expr) == EXPR_LIST)
+    {
+      edge curr_edge;
+      basic_block bb;
+      int num_cond = 0;
+      rtx curr_cond, backward, other_cond;
+
+      /* Reverse condition list.  */
+      backward = NULL_RTX;
+      other_cond = (rtx)cond_expr;
+      while (other_cond != NULL_RTX)
+	{
+	  backward = gen_rtx_EXPR_LIST (VOIDmode,
+					XEXP (other_cond, 0), backward);
+	  other_cond = XEXP (other_cond, 1);
+	  num_cond++;
+	}
+      gcc_assert (num_cond > 1);
+
+      /* Create new block to prevent many preheaders.  */
+      second_head = split_edge (entry);
+      entry = find_edge (entry->src, second_head);
+
+      /* Starting multi split using the last condition
+         which is first in a reversed list.  */
+      cond_bb = lv_adjust_loop_entry_edge (first_head, second_head, entry,
+					   XEXP (backward, 0), then_prob);
+      other_cond = XEXP (backward, 1);
+      bb = cond_bb;
+      /* Find edge entering the created bb.  */
+      curr_edge = find_edge (entry->src, bb);
+	while (other_cond != NULL_RTX)
+	  {
+	    curr_cond = XEXP (other_cond, 0);
+	    other_cond = XEXP (other_cond, 1);
+	    /* Redirect the edge to make it possible to use
+	       lv_adjust_loop_entry_edge.  */
+	    curr_edge = redirect_edge_and_branch (curr_edge, second_head);
+	    gcc_assert (curr_edge && curr_edge->src == entry->src);
+
+	    /* Split using the next condition is the reversed list.
+	       Set 100% probability in all conditions except last.  */
+	    bb = lv_adjust_loop_entry_edge (bb, second_head, curr_edge,
+					    curr_cond, REG_BR_PROB_BASE);
+	    /* The next edge to process - to new condition bb.  */
+	    curr_edge = find_edge (curr_edge->src, bb);
+	    gcc_assert (curr_edge && curr_edge->src == entry->src);
+	  }
+    }
+  else
+    /* Simply split loop entry edge and insert new block with cond expr.  */
+    cond_bb = lv_adjust_loop_entry_edge (first_head, second_head,
+					 entry, cond_expr, then_prob);
+
   if (condition_bb)
     *condition_bb = cond_bb;
 
diff --git a/gcc/common.opt b/gcc/common.opt
index f127936..c40725c 100644
--- a/gcc/common.opt
+++ b/gcc/common.opt
@@ -1437,6 +1437,10 @@ fmodulo-sched-allow-regmoves
 Common Report Var(flag_modulo_sched_allow_regmoves)
 Perform SMS based modulo scheduling with register moves allowed
 
+fmodulo-sched-insert-infinite-checks
+Common Report Var(flag_modulo_sched_insert_infinite_checks)
+Insert expensive checks for infinite amount of loop iterations while SMS
+
 fmove-loop-invariants
 Common Report Var(flag_move_loop_invariants) Init(1) Optimization
 Move loop invariant computations out of loops
diff --git a/gcc/modulo-sched.c b/gcc/modulo-sched.c
index 35d2ee4..2aeea47 100644
--- a/gcc/modulo-sched.c
+++ b/gcc/modulo-sched.c
@@ -1994,8 +1994,12 @@ sms_schedule (void)
         We want doloops to be scheduled even if analyzis shows they are
 	 nonsimple (backward compatibility).  */
       nonsimple_loop = !desc->simple_p;
-      /* We allow scheduling loop with some assumptions or infinite condition
-	 only when unsafe_loop_optimizations flag is enabled.  */
+      /* Infinite number of iterations condition can be checked at runtime
+         to execute the right version of a loop.  But this checks can
+	 slow down the program when the loop is inside an outer one.
+	 So, we add this checks only when an option is enabled, and allow
+	 scheduling loop without adding checks when unsafe_loop_optimizations
+	 flag is enabled.  */
       if (flag_unsafe_loop_optimizations)
 	 {
 	   desc->infinite = NULL_RTX;
@@ -2003,8 +2007,20 @@ sms_schedule (void)
 	   desc->noloop_assumptions = NULL_RTX;
 	 }
       nonsimple_loop = nonsimple_loop || (desc->assumptions != NULL_RTX)
-			|| (desc->noloop_assumptions != NULL_RTX)
-			|| (desc->infinite != NULL_RTX);
+			|| (desc->noloop_assumptions != NULL_RTX);
+      if (!flag_modulo_sched_insert_infinite_checks)
+        nonsimple_loop = nonsimple_loop || (desc->infinite != NULL_RTX);
+      /* Check the form of the infinite condition.  */
+      if (!nonsimple_loop && desc->infinite)
+	{
+	  rtx r = desc->infinite;
+	  while (r && COMPARISON_P (XEXP (r, 0)))
+	    {
+	      gcc_assert (GET_CODE (r) == EXPR_LIST);
+	      r = XEXP (r, 1);
+	    }
+	  nonsimple_loop = (r != NULL);
+	}
       /* Only doloops can be nonsimple_loops for SMS.  */
       if (nonsimple_loop && !doloop_p)
 	{
@@ -2142,9 +2158,37 @@ sms_schedule (void)
 	      /* Loop versioning if the number of iterations is unknown.  */
 	      unsigned prob;
 	      rtx vers_cond;
-	      vers_cond = gen_rtx_fmt_ee (GT, VOIDmode, nonsimple_loop ?
-					  count_reg : desc->niter_expr,
-					  GEN_INT (stage_count));
+
+	      if (desc->infinite)
+		{
+		  /* We have to check the number of iterations is non-infinite
+		     before comparing it with the number of stages.  So, each
+		     condition in a desc->infinite expr list (with logical OR)
+		     is reversed and add to a new expr list (with logical AND).  */
+		  rtx r, temp;
+		  vers_cond = copy_rtx (desc->infinite);
+		  gcc_assert (desc->niter_expr);
+		  r = vers_cond;
+		  temp = reversed_condition (XEXP(r, 0));
+		  gcc_assert(temp);
+		  XEXP (r, 0) = temp;
+		  while (XEXP (r, 1))
+		    {
+		      temp = reversed_condition (XEXP(r, 0));
+		      gcc_assert(temp);
+		      XEXP (r, 0) = temp;
+		      r = XEXP (r, 1);
+		    }
+		  XEXP (r, 1) = gen_rtx_EXPR_LIST (VOIDmode,
+					gen_rtx_fmt_ee (GT, VOIDmode,
+						        desc->niter_expr,
+							GEN_INT (stage_count)),
+					NULL_RTX);
+		}
+	      else /* Condition = (number of iters > number of stages).  */
+		vers_cond = gen_rtx_fmt_ee (GT, VOIDmode, nonsimple_loop ?
+					    count_reg : desc->niter_expr,
+					    GEN_INT (stage_count));
 	      if (dump_file)
 		{
 		  fprintf (dump_file, "\nLoop versioning condition:\n");
-- 
Roman Zhuykov
zhroma@ispras.ru

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 1/9] [obvious] Minor cleanup
  2011-07-21 16:31 [PATCH 0/9] [RFC] Expand SMS functionality zhroma
  2011-07-21 16:31 ` [PATCH 2/9] [doloop] Correct extracting loop exit condition zhroma
  2011-07-21 16:31 ` [PATCH 5/9] [SMS] Support new loop pattern zhroma
@ 2011-07-21 16:31 ` zhroma
  2011-07-21 16:31 ` [PATCH 3/9] [SMS] Eliminate redundant edges zhroma
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 30+ messages in thread
From: zhroma @ 2011-07-21 16:31 UTC (permalink / raw)
  To: gcc-patches; +Cc: dm

This obvious patch just removes unused tree_ssa_loop_version function
declaration from tree-flow.h.  Will be committed in 24 hours if no objection.

2011-07-20  Roman Zhuykov  <zhroma@ispras.ru>
	* tree-flow.h (tree_ssa_loop_version): Remove unused
	declaration.
---
 gcc/tree-flow.h |    2 --
 1 files changed, 0 insertions(+), 2 deletions(-)

diff --git a/gcc/tree-flow.h b/gcc/tree-flow.h
index a864e16..8cbd756 100644
--- a/gcc/tree-flow.h
+++ b/gcc/tree-flow.h
@@ -699,8 +699,6 @@ bool gimple_duplicate_loop_to_header_edge (struct loop *, edge,
 struct loop *slpeel_tree_duplicate_loop_to_edge_cfg (struct loop *, edge);
 void rename_variables_in_loop (struct loop *);
 void rename_variables_in_bb (basic_block bb);
-struct loop *tree_ssa_loop_version (struct loop *, tree,
-				    basic_block *);
 tree expand_simple_operations (tree);
 void substitute_in_loop_info (struct loop *, tree, tree);
 edge single_dom_exit (struct loop *);
-- 
Roman Zhuykov
zhroma@ispras.ru

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 3/9] [SMS] Eliminate redundant edges
  2011-07-21 16:31 [PATCH 0/9] [RFC] Expand SMS functionality zhroma
                   ` (2 preceding siblings ...)
  2011-07-21 16:31 ` [PATCH 1/9] [obvious] Minor cleanup zhroma
@ 2011-07-21 16:31 ` zhroma
  2011-07-24 10:36   ` Revital1 Eres
  2011-07-21 16:31 ` [PATCH 6/9] [SMS] Support potentially infinite loop zhroma
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 30+ messages in thread
From: zhroma @ 2011-07-21 16:31 UTC (permalink / raw)
  To: gcc-patches; +Cc: dm, Ayal Zaks

While building a data dependency graph for loop a ddg edge for some pair
of instructions with inter-loop dependency should be created only if
there is no edge for intra-loop dependency between these instructions.
Creating both of edges leads sometimes to the fact that function
generate_reg_moves creates a redundant register renaming, and some
instruction receives wrong register value from previous iteration.
Overall, this gives a miscompilation.

The add_inter_loop_mem_dep(from,to) function is called only when no
inter-loop "from->to" edge exists, but this function sometimes creates
backward "to->from" edges.  This patch prevents these backward edges to
be redundant, it allows to create such edge only when there is no
inter-loop one.

2011-07-20  Roman Zhuykov  <zhroma@ispras.ru>
	* ddg.c (add_intra_loop_mem_dep): Add new parameter (from_index).
	Use it to check whether backward edge is redundant.
	(build_intra_loop_deps): Update call to add_intra_loop_mem_dep.
---
 gcc/ddg.c |   15 +++++++++------
 1 files changed, 9 insertions(+), 6 deletions(-)

diff --git a/gcc/ddg.c b/gcc/ddg.c
index 2bb2cc1..5d0a401 100644
--- a/gcc/ddg.c
+++ b/gcc/ddg.c
@@ -418,7 +418,7 @@ add_intra_loop_mem_dep (ddg_ptr g, ddg_node_ptr from, ddg_node_ptr to)
 /* Given two nodes, analyze their RTL insns and add inter-loop mem deps
    to ddg G.  */
 static void
-add_inter_loop_mem_dep (ddg_ptr g, ddg_node_ptr from, ddg_node_ptr to)
+add_inter_loop_mem_dep (ddg_ptr g, ddg_node_ptr from, ddg_node_ptr to, int from_index)
 {
   if (!insns_may_alias_p (from->insn, to->insn))
     /* Do not create edge if memory references have disjoint alias sets.  */
@@ -442,10 +442,13 @@ add_inter_loop_mem_dep (ddg_ptr g, ddg_node_ptr from, ddg_node_ptr to)
       else if (from->cuid != to->cuid)
 	{
 	  create_ddg_dep_no_link (g, from, to, ANTI_DEP, MEM_DEP, 1);
-	  if (DEBUG_INSN_P (from->insn) || DEBUG_INSN_P (to->insn))
-	    create_ddg_dep_no_link (g, to, from, ANTI_DEP, MEM_DEP, 1);
-	  else
-	    create_ddg_dep_no_link (g, to, from, TRUE_DEP, MEM_DEP, 1);
+	  if (! TEST_BIT (to->successors, from_index))
+	    {
+	      if (DEBUG_INSN_P (from->insn) || DEBUG_INSN_P (to->insn))
+		create_ddg_dep_no_link (g, to, from, ANTI_DEP, MEM_DEP, 1);
+	      else
+		create_ddg_dep_no_link (g, to, from, TRUE_DEP, MEM_DEP, 1);
+	    }
 	}
     }
 
@@ -511,7 +514,7 @@ build_intra_loop_deps (ddg_ptr g)
 		  /* Don't bother calculating inter-loop dep if an intra-loop dep
 		     already exists.  */
 	      	  if (! TEST_BIT (dest_node->successors, j))
-		    add_inter_loop_mem_dep (g, dest_node, j_node);
+		    add_inter_loop_mem_dep (g, dest_node, j_node, i);
 		  /* If -fmodulo-sched-allow-regmoves
 		     is set certain anti-dep edges are not created.
 		     It might be that these anti-dep edges are on the
-- 
Roman Zhuykov
zhroma@ispras.ru

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 7/9] New assertion
  2011-07-21 16:31 [PATCH 0/9] [RFC] Expand SMS functionality zhroma
                   ` (4 preceding siblings ...)
  2011-07-21 16:31 ` [PATCH 6/9] [SMS] Support potentially infinite loop zhroma
@ 2011-07-21 16:37 ` zhroma
  2011-07-21 16:59 ` [PATCH 8/9] Extend simple_rhs_p zhroma
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 30+ messages in thread
From: zhroma @ 2011-07-21 16:37 UTC (permalink / raw)
  To: gcc-patches; +Cc: dm

This patch adds an assertion in function rtl_lv_add_condition_to_bb.
It allows me to find mistakes easier while writing code which creates
complex loop versioning condition in previous patch.

2011-07-20  Roman Zhuykov  <zhroma@ispras.ru>
	* cfgrtl.c (rtl_lv_add_condition_to_bb): New assertion.
---
 gcc/cfgrtl.c |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/gcc/cfgrtl.c b/gcc/cfgrtl.c
index 7eb4362..068da4a 100644
--- a/gcc/cfgrtl.c
+++ b/gcc/cfgrtl.c
@@ -3103,6 +3103,7 @@ rtl_lv_add_condition_to_bb (basic_block first_head ,
   start_sequence ();
   op0 = force_operand (op0, NULL_RTX);
   op1 = force_operand (op1, NULL_RTX);
+  gcc_assert (op0 && op1);
   do_compare_rtx_and_jump (op0, op1, comp, 0,
 			   mode, NULL_RTX, NULL_RTX, label, -1);
   jump = get_last_insn ();
-- 
Roman Zhuykov
zhroma@ispras.ru

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 8/9] Extend simple_rhs_p
  2011-07-21 16:31 [PATCH 0/9] [RFC] Expand SMS functionality zhroma
                   ` (5 preceding siblings ...)
  2011-07-21 16:37 ` [PATCH 7/9] New assertion zhroma
@ 2011-07-21 16:59 ` zhroma
  2011-07-21 17:04 ` [PATCH 4/9] Move the SMS pass earlier zhroma
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 30+ messages in thread
From: zhroma @ 2011-07-21 16:59 UTC (permalink / raw)
  To: gcc-patches; +Cc: dm

This patch adds a multiply-and-add instruction expression to
simple_rhs_p function.  When debugging my sms patches, the following
situation was found.  Imagine a loop, with constant step > 1, where "n"
and "fin" is unknown at compile time.

for (c = n * step + fin; c != fin; c -= step) ...;

c is a pseudo-register and is initialized as one multiply-and-add rtl
instruction before the loop.  In this case, get_simple_loop_desc
function gives a non-null infinite condition which looks like:
(c - fin)%step == 0
It's obviously always true when you try to substitute the initialization
expression. But no substitution is done, because the exression is not
simple_rhs_p.  So, this patch allows get_simple_loop_desc analysis to be
more precise.  How can this influence other optimizations?

2011-07-20  Roman Zhuykov  <zhroma@ispras.ru>
	* loop-iv.c (simple_rhs_p): Support multiply-and-add operations.
---
 gcc/loop-iv.c |   10 ++++++++--
 1 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/gcc/loop-iv.c b/gcc/loop-iv.c
index 83d2501..dbb7728 100644
--- a/gcc/loop-iv.c
+++ b/gcc/loop-iv.c
@@ -1340,9 +1340,15 @@ simple_rhs_p (rtx rhs)
     case AND:
       op0 = XEXP (rhs, 0);
       op1 = XEXP (rhs, 1);
-      /* Allow reg OP const and reg OP reg.  */
+      /* Allow op0 to be reg or expression like "reg mult const".
+       * Allow op1 to be reg or const.  */
       if (!(REG_P (op0) && !HARD_REGISTER_P (op0))
-	  && !function_invariant_p (op0))
+	  && !function_invariant_p (op0)
+	  && !(((GET_CODE (op0) == ASHIFT)
+		|| (GET_CODE (op0) == ASHIFTRT)
+		|| (GET_CODE (op0) == LSHIFTRT)
+		|| (GET_CODE (op0) == MULT)
+	       ) && simple_rhs_p (op0)))
 	return false;
       if (!(REG_P (op1) && !HARD_REGISTER_P (op1))
 	  && !function_invariant_p (op1))
-- 
Roman Zhuykov
zhroma@ispras.ru

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 4/9] Move the SMS pass earlier
  2011-07-21 16:31 [PATCH 0/9] [RFC] Expand SMS functionality zhroma
                   ` (6 preceding siblings ...)
  2011-07-21 16:59 ` [PATCH 8/9] Extend simple_rhs_p zhroma
@ 2011-07-21 17:04 ` zhroma
  2011-07-21 17:09 ` [PATCH 9/9] [ARM] Remove artificial doloop_end pattern zhroma
  2011-09-30 15:37 ` [PATCH 0/9] [RFC] Expand SMS functionality Roman Zhuykov
  9 siblings, 0 replies; 30+ messages in thread
From: zhroma @ 2011-07-21 17:04 UTC (permalink / raw)
  To: gcc-patches; +Cc: dm

This patch moves the pass_sms before the pass_partition_blocks.  There
is no doloop_end pattern on x86_64.  That's why current SMS
implementation can't schedule any loops.  But regtesting trunk with SMS
on x86_64 shows an ICE on gcc.dg/tree-prof/bb-reorg.c.  The problem is
in an unconditional edge created by pass_partition_blocks.  The edge is
not eliminated while entering cfg_layout mode, and this leads to
assertion fail, because there should be no unconditional jumps in
cfg_layout mode.  Moving the pass seems reasonable besause it also
allows to prevent additional entering and exiting cfl_layout mode.  I
can't say anything about how correct such pass movement is from other
points of view.

2011-07-20  Roman Zhuykov  <zhroma@ispras.ru>
	* modulo-sched.c (rest_of_handle_sms): Do not enter or exit
	cfg_layout mode.
	* passes.c (init_optimization_passes): Move pass_sms before
	pass_partition_blocks.
---
 gcc/modulo-sched.c |    9 ---------
 gcc/passes.c       |    2 +-
 2 files changed, 1 insertions(+), 10 deletions(-)

diff --git a/gcc/modulo-sched.c b/gcc/modulo-sched.c
index 24d99af..948209e 100644
--- a/gcc/modulo-sched.c
+++ b/gcc/modulo-sched.c
@@ -3352,21 +3352,12 @@ static unsigned int
 rest_of_handle_sms (void)
 {
 #ifdef INSN_SCHEDULING
-  basic_block bb;
-
   /* Collect loop information to be used in SMS.  */
-  cfg_layout_initialize (0);
   sms_schedule ();
 
   /* Update the life information, because we add pseudos.  */
   max_regno = max_reg_num ();
-
-  /* Finalize layout changes.  */
-  FOR_EACH_BB (bb)
-    if (bb->next_bb != EXIT_BLOCK_PTR)
-      bb->aux = bb->next_bb;
   free_dominance_info (CDI_DOMINATORS);
-  cfg_layout_finalize ();
 #endif /* INSN_SCHEDULING */
   return 0;
 }
diff --git a/gcc/passes.c b/gcc/passes.c
index 88b7147..5594571 100644
--- a/gcc/passes.c
+++ b/gcc/passes.c
@@ -1456,6 +1456,7 @@ init_optimization_passes (void)
       NEXT_PASS (pass_ud_rtl_dce);
       NEXT_PASS (pass_combine);
       NEXT_PASS (pass_if_after_combine);
+      NEXT_PASS (pass_sms);
       NEXT_PASS (pass_partition_blocks);
       NEXT_PASS (pass_regmove);
       NEXT_PASS (pass_outof_cfg_layout_mode);
@@ -1465,7 +1466,6 @@ init_optimization_passes (void)
       NEXT_PASS (pass_stack_ptr_mod);
       NEXT_PASS (pass_mode_switching);
       NEXT_PASS (pass_match_asm_constraints);
-      NEXT_PASS (pass_sms);
       NEXT_PASS (pass_sched);
       NEXT_PASS (pass_ira);
       NEXT_PASS (pass_postreload);
-- 
Roman Zhuykov
zhroma@ispras.ru

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 9/9] [ARM] Remove artificial doloop_end pattern
  2011-07-21 16:31 [PATCH 0/9] [RFC] Expand SMS functionality zhroma
                   ` (7 preceding siblings ...)
  2011-07-21 17:04 ` [PATCH 4/9] Move the SMS pass earlier zhroma
@ 2011-07-21 17:09 ` zhroma
  2012-01-04 16:47   ` Richard Earnshaw
  2011-09-30 15:37 ` [PATCH 0/9] [RFC] Expand SMS functionality Roman Zhuykov
  9 siblings, 1 reply; 30+ messages in thread
From: zhroma @ 2011-07-21 17:09 UTC (permalink / raw)
  To: gcc-patches; +Cc: dm

This patch eliminates fake doloop_end pattern for ARM platform.  The problem
with such a pattern is that it slows down the loop when SMS doesn't create good
schedule.  So, i suppose fake pattern is no longer needed with new loop forms
supported.

2011-07-20  Roman Zhuykov  <zhroma@ispras.ru>
	* config/arm/thumb2.md (doloop_end): Delete.
---
 gcc/config/arm/thumb2.md |   51 ----------------------------------------------
 1 files changed, 0 insertions(+), 51 deletions(-)

diff --git a/gcc/config/arm/thumb2.md b/gcc/config/arm/thumb2.md
index 9a11012..492e765 100644
--- a/gcc/config/arm/thumb2.md
+++ b/gcc/config/arm/thumb2.md
@@ -1101,54 +1101,3 @@
   operands[2] = GEN_INT (32 - INTVAL (operands[2]));
   ")
 
-;; Define the subtract-one-and-jump insns so loop.c
-;; knows what to generate.
-(define_expand "doloop_end"
-  [(use (match_operand 0 "" ""))      ; loop pseudo
-   (use (match_operand 1 "" ""))      ; iterations; zero if unknown
-   (use (match_operand 2 "" ""))      ; max iterations
-   (use (match_operand 3 "" ""))      ; loop level
-   (use (match_operand 4 "" ""))]     ; label
-  "TARGET_32BIT"
-  "
- {
-   /* Currently SMS relies on the do-loop pattern to recognize loops
-      where (1) the control part consists of all insns defining and/or
-      using a certain 'count' register and (2) the loop count can be
-      adjusted by modifying this register prior to the loop.
-      ??? The possible introduction of a new block to initialize the
-      new IV can potentially affect branch optimizations.  */
-   if (optimize > 0 && flag_modulo_sched)
-   {
-     rtx s0;
-     rtx bcomp;
-     rtx loc_ref;
-     rtx cc_reg;
-     rtx insn;
-     rtx cmp;
-
-     /* Only use this on innermost loops.  */
-     if (INTVAL (operands[3]) > 1)
-       FAIL;
-
-     if (GET_MODE (operands[0]) != SImode)
-       FAIL;
-
-     s0 = operands [0];
-     if (TARGET_THUMB2)
-       insn = emit_insn (gen_thumb2_addsi3_compare0 (s0, s0, GEN_INT (-1)));
-     else
-       insn = emit_insn (gen_addsi3_compare0 (s0, s0, GEN_INT (-1)));
-
-     cmp = XVECEXP (PATTERN (insn), 0, 0);
-     cc_reg = SET_DEST (cmp);
-     bcomp = gen_rtx_NE (VOIDmode, cc_reg, const0_rtx);
-     loc_ref = gen_rtx_LABEL_REF (VOIDmode, operands [4]);
-     emit_jump_insn (gen_rtx_SET (VOIDmode, pc_rtx,
-                                  gen_rtx_IF_THEN_ELSE (VOIDmode, bcomp,
-                                                        loc_ref, pc_rtx)));
-     DONE;
-   }else
-      FAIL;
- }")
-
-- 
Roman Zhuykov
zhroma@ispras.ru

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/9] [doloop] Correct extracting loop exit condition
  2011-07-21 16:31 ` [PATCH 2/9] [doloop] Correct extracting loop exit condition zhroma
@ 2011-07-22 12:22   ` Richard Sandiford
  2011-09-30 15:43     ` Roman Zhuykov
  0 siblings, 1 reply; 30+ messages in thread
From: Richard Sandiford @ 2011-07-22 12:22 UTC (permalink / raw)
  To: zhroma; +Cc: gcc-patches, dm

zhroma@ispras.ru writes:
> This patch fixes the compiler segfault found while regtesting trunk with SMS on
> IA64 platform.  Segfault happens on test gcc.dg/pr45259.c with -fmodulo-sched
> enabled.  The following jump instruction is given as argument for
> doloop_condition_get function:
> (jump_insn 86 85 88 7 (set (pc)
>         (reg/f:DI 403)) 339 {indirect_jump}
>      (expr_list:REG_DEAD (reg/f:DI 403)
>         (nil)))
> The patch adds checking for the form of comparison instruction before
> extracting loop exit condition.
>
> 2011-07-20  Roman Zhuykov  <zhroma@ispras.ru>
> 	* loop-doloop.c (doloop_condition_get): Correctly check
> 	the form of comparison instruction.
> ---
>  gcc/loop-doloop.c |    2 ++
>  1 files changed, 2 insertions(+), 0 deletions(-)
>
> diff --git a/gcc/loop-doloop.c b/gcc/loop-doloop.c
> index f8429c4..dfc4a16 100644
> --- a/gcc/loop-doloop.c
> +++ b/gcc/loop-doloop.c
> @@ -153,6 +153,8 @@ doloop_condition_get (rtx doloop_pat)
>        else
>          inc = PATTERN (prev_insn);
>        /* We expect the condition to be of the form (reg != 0)  */
> +      if (GET_CODE (cmp) != SET || GET_CODE (SET_SRC (cmp)) != IF_THEN_ELSE)
> +	return 0;
>        cond = XEXP (SET_SRC (cmp), 0);
>        if (GET_CODE (cond) != NE || XEXP (cond, 1) != const0_rtx)
>          return 0;

I think it'd be better to integrate:

      /* We expect the condition to be of the form (reg != 0)  */
      cond = XEXP (SET_SRC (cmp), 0);
      if (GET_CODE (cond) != NE || XEXP (cond, 1) != const0_rtx)
        return 0;

into:

  /* We expect a GE or NE comparison with 0 or 1.  */
  if ((GET_CODE (condition) != GE
       && GET_CODE (condition) != NE)
      || (XEXP (condition, 1) != const0_rtx
          && XEXP (condition, 1) != const1_rtx))
    return 0;

The next "if" already uses "GET_CODE (pattern) != PARALLEL" as a check
for the second and third cases.  E.g. something like:

  if (GET_CODE (pattern) == PARALLEL)
    {
      /* We expect a GE or NE comparison with 0 or 1.  */
      if ((GET_CODE (condition) != GE
           && GET_CODE (condition) != NE)
          || (XEXP (condition, 1) != const0_rtx
              && XEXP (condition, 1) != const1_rtx))
        return 0;
    }
  else
    {
      /* In the second and third cases, we expect the condition to
         be of the form (reg != 0)  */
      if (GET_CODE (condition) != NE || XEXP (condition, 1) != const0_rtx)
        return 0;
    }

That's pre-approved (independently of the other patches) if it works.

Richard

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 3/9] [SMS] Eliminate redundant edges
  2011-07-21 16:31 ` [PATCH 3/9] [SMS] Eliminate redundant edges zhroma
@ 2011-07-24 10:36   ` Revital1 Eres
  0 siblings, 0 replies; 30+ messages in thread
From: Revital1 Eres @ 2011-07-24 10:36 UTC (permalink / raw)
  To: zhroma; +Cc: Ayal Zaks, dm, gcc-patches

Hi Roman,

> While building a data dependency graph for loop a ddg edge for some pair
> of instructions with inter-loop dependency should be created only if
> there is no edge for intra-loop dependency between these instructions.
> Creating both of edges leads sometimes to the fact that function

It would be much appreciated if you could provide an example for the
problematic scenario which leads to the miscompilation.

Thanks,
Revital



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 5/9] [SMS] Support new loop pattern
  2011-07-21 16:31 ` [PATCH 5/9] [SMS] Support new loop pattern zhroma
@ 2011-07-24 11:06   ` Revital1 Eres
  2011-07-26  9:02   ` Richard Sandiford
  2011-09-30 15:54   ` Roman Zhuykov
  2 siblings, 0 replies; 30+ messages in thread
From: Revital1 Eres @ 2011-07-24 11:06 UTC (permalink / raw)
  To: zhroma; +Cc: Ayal Zaks, dm, gcc-patches

Hello Roman,

> This patch should be applied only after pending patches by Revital. This
patch
> significantly enhances the existing implementation of the SMS.  Patch
adds
> support of scheduling loops without doloop pattern.  The loop should meet
the
> following requirements.

Thanks for the patch!
I will try to go through it in more details soon.
I'm testing now whether the recent bootstrap
failure on ARM machine (PR49789) is revolved with your patch.
I also plan to see the effect of it on PowerPC and SPU which currently
support doloop pattern.

Thanks,
Revital

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 5/9] [SMS] Support new loop pattern
  2011-07-21 16:31 ` [PATCH 5/9] [SMS] Support new loop pattern zhroma
  2011-07-24 11:06   ` Revital1 Eres
@ 2011-07-26  9:02   ` Richard Sandiford
  2011-07-27 17:36     ` Roman Zhuykov
  2011-09-30 15:54   ` Roman Zhuykov
  2 siblings, 1 reply; 30+ messages in thread
From: Richard Sandiford @ 2011-07-26  9:02 UTC (permalink / raw)
  To: zhroma; +Cc: gcc-patches, dm, Ayal Zaks

zhroma@ispras.ru writes:
> The next three describe the control part of new supported loops.
> - the last jump instruction should look like:  pc=(regF!=0)?label:pc, regF is
>   flag register;
> - the last instruction which sets regF should be: regF=COMPARE(regC,X), where X
>   is a constant, or maybe a register, which is not changed inside a loop;
> - only one instruction modifies regC inside a loop (other can use regC, but not
>   write), and it should simply adjust it by a constant: regC=regC+step, where
>   step is a constant.

Note that on ARM, the comparison and loop counter addition can happen
as a single parallel:

(insn 29 27 30 3 (parallel [
            (set (reg:CC_NOOV 24 cc)
                (compare:CC_NOOV (plus:SI (reg:SI 142 [ ivtmp.9 ])
                        (const_int -1 [0xffffffffffffffff]))
                    (const_int 0 [0])))
            (set (reg:SI 142 [ ivtmp.9 ])
                (plus:SI (reg:SI 142 [ ivtmp.9 ])
                    (const_int -1 [0xffffffffffffffff])))
        ]) /tmp/bar5.c:9 6 {addsi3_compare0}
     (nil))

I think we'd need to handle that too before getting rid of the
ARM doloop_end pattern.

Richard

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 5/9] [SMS] Support new loop pattern
  2011-07-26  9:02   ` Richard Sandiford
@ 2011-07-27 17:36     ` Roman Zhuykov
  0 siblings, 0 replies; 30+ messages in thread
From: Roman Zhuykov @ 2011-07-27 17:36 UTC (permalink / raw)
  To: zhroma, gcc-patches, dm, Ayal Zaks, richard.sandiford

2011/7/26 Richard Sandiford <richard.sandiford@linaro.org>:
> Note that on ARM, the comparison and loop counter addition can happen
> as a single parallel:

Certainly, I notice such "subs" ARM instructions.  IMHO, this pattern seems to
appear rarely in real loops.  For loops without doloop_end pattern we have to
make the following instruction transformation as I have noticed already:

"The final register value X in compare instruction regF=COMPARE(regC,X) is
changed to another value Y respective to the stage this instruction is
scheduled: (Y = X - stage * step)"

In subs instruction we are unable to do this, because we can't change the
number to compare with.  It seems there are three following ways of
solving this.

The first way is to check that counter register is not used by non-control-flow
instructions before running SMS on such loops.  The same condition is
checked in doloop_condition_get.

The second way is to allow SMS to process loop with subs instruction, but when
the schedule is already computed, then allow to apply it only if X == Y
(otherwise new schedule lead to miscompilation).

The third way is to create a pair of sub and cmp instructions instead of subs
when needed.

> I think we'd need to handle that too before getting rid of the
> ARM doloop_end pattern.

I think all three ways are complicated enough and decide to begin with
implementing SMS without such loops support.

--
Roman Zhuykov
zhroma@ispras.ru

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/9] [RFC] Expand SMS functionality
  2011-07-21 16:31 [PATCH 0/9] [RFC] Expand SMS functionality zhroma
                   ` (8 preceding siblings ...)
  2011-07-21 17:09 ` [PATCH 9/9] [ARM] Remove artificial doloop_end pattern zhroma
@ 2011-09-30 15:37 ` Roman Zhuykov
  2011-10-17 14:34   ` Richard Sandiford
  9 siblings, 1 reply; 30+ messages in thread
From: Roman Zhuykov @ 2011-09-30 15:37 UTC (permalink / raw)
  To: gcc-patches; +Cc: dm

Ping.
The following RTL patches need reviews:
[PATCH 4/9] Move the SMS pass earlier
http://gcc.gnu.org/ml/gcc-patches/2011-07/msg01811.html
[PATCH 7/9] New assertion in rtl_lv_add_condition_to_bb
http://gcc.gnu.org/ml/gcc-patches/2011-07/msg01808.html
[PATCH 8/9] Extend simple_rhs_p
http://gcc.gnu.org/ml/gcc-patches/2011-07/msg01810.html


2011/7/21  <zhroma@ispras.ru>:
> All the work described in next emails was done while trying to improve SMS
> functionality.  The main idea is to remove requrement of doloop_end instruction
> pattern.  This allows SMS to work on more platforms, for example x86-64 and
> ARM.
> --
> Roman Zhuykov
> zhroma@ispras.ru
>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/9] [doloop] Correct extracting loop exit condition
  2011-07-22 12:22   ` Richard Sandiford
@ 2011-09-30 15:43     ` Roman Zhuykov
  2012-02-10 12:00       ` Andrey Belevantsev
  0 siblings, 1 reply; 30+ messages in thread
From: Roman Zhuykov @ 2011-09-30 15:43 UTC (permalink / raw)
  To: zhroma, gcc-patches, dm, richard.sandiford

[-- Attachment #1: Type: text/plain, Size: 2847 bytes --]

2011/7/22 Richard Sandiford <richard.sandiford@linaro.org>:
> zhroma@ispras.ru writes:
>> This patch fixes the compiler segfault found while regtesting trunk with SMS on
>> IA64 platform.  Segfault happens on test gcc.dg/pr45259.c with -fmodulo-sched
>> enabled.  The following jump instruction is given as argument for
>> doloop_condition_get function:
>> (jump_insn 86 85 88 7 (set (pc)
>>         (reg/f:DI 403)) 339 {indirect_jump}
>>      (expr_list:REG_DEAD (reg/f:DI 403)
>>         (nil)))
>> The patch adds checking for the form of comparison instruction before
>> extracting loop exit condition.
>>
>> 2011-07-20  Roman Zhuykov  <zhroma@ispras.ru>
>>       * loop-doloop.c (doloop_condition_get): Correctly check
>>       the form of comparison instruction.
>> ---
>>  gcc/loop-doloop.c |    2 ++
>>  1 files changed, 2 insertions(+), 0 deletions(-)
>>
>> diff --git a/gcc/loop-doloop.c b/gcc/loop-doloop.c
>> index f8429c4..dfc4a16 100644
>> --- a/gcc/loop-doloop.c
>> +++ b/gcc/loop-doloop.c
>> @@ -153,6 +153,8 @@ doloop_condition_get (rtx doloop_pat)
>>        else
>>          inc = PATTERN (prev_insn);
>>        /* We expect the condition to be of the form (reg != 0)  */
>> +      if (GET_CODE (cmp) != SET || GET_CODE (SET_SRC (cmp)) != IF_THEN_ELSE)
>> +     return 0;
>>        cond = XEXP (SET_SRC (cmp), 0);
>>        if (GET_CODE (cond) != NE || XEXP (cond, 1) != const0_rtx)
>>          return 0;
>
> I think it'd be better to integrate:
>
>      /* We expect the condition to be of the form (reg != 0)  */
>      cond = XEXP (SET_SRC (cmp), 0);
>      if (GET_CODE (cond) != NE || XEXP (cond, 1) != const0_rtx)
>        return 0;
>
> into:
>
>  /* We expect a GE or NE comparison with 0 or 1.  */
>  if ((GET_CODE (condition) != GE
>       && GET_CODE (condition) != NE)
>      || (XEXP (condition, 1) != const0_rtx
>          && XEXP (condition, 1) != const1_rtx))
>    return 0;
>
> The next "if" already uses "GET_CODE (pattern) != PARALLEL" as a check
> for the second and third cases.  E.g. something like:
>
>  if (GET_CODE (pattern) == PARALLEL)
>    {
>      /* We expect a GE or NE comparison with 0 or 1.  */
>      if ((GET_CODE (condition) != GE
>           && GET_CODE (condition) != NE)
>          || (XEXP (condition, 1) != const0_rtx
>              && XEXP (condition, 1) != const1_rtx))
>        return 0;
>    }
>  else
>    {
>      /* In the second and third cases, we expect the condition to
>         be of the form (reg != 0)  */
>      if (GET_CODE (condition) != NE || XEXP (condition, 1) != const0_rtx)
>        return 0;
>    }
>
> That's pre-approved (independently of the other patches) if it works.
Changed like the following. Will commit if no objections after a couple of days.

--
Roman Zhuykov
zhroma@ispras.ru

[-- Attachment #2: doloop.patch --]
[-- Type: text/x-patch, Size: 1919 bytes --]

2011-09-30  Roman Zhuykov  <zhroma@ispras.ru>
	* loop-doloop.c (doloop_condition_get): Correctly check
	the form of comparison instruction.
---
 gcc/loop-doloop.c |   28 +++++++++++++++++-----------
 1 files changed, 17 insertions(+), 11 deletions(-)

diff --git a/gcc/loop-doloop.c b/gcc/loop-doloop.c
index a7e264f..4e83649 100644
--- a/gcc/loop-doloop.c
+++ b/gcc/loop-doloop.c
@@ -113,7 +113,6 @@ doloop_condition_get (rtx doloop_pat)
 
   if (GET_CODE (pattern) != PARALLEL)
     {
-      rtx cond;
       rtx prev_insn = prev_nondebug_insn (doloop_pat);
       rtx cmp_arg1, cmp_arg2;
       rtx cmp_orig;
@@ -152,10 +151,6 @@ doloop_condition_get (rtx doloop_pat)
 	}
       else
         inc = PATTERN (prev_insn);
-      /* We expect the condition to be of the form (reg != 0)  */
-      cond = XEXP (SET_SRC (cmp), 0);
-      if (GET_CODE (cond) != NE || XEXP (cond, 1) != const0_rtx)
-        return 0;
     }
   else
     {
@@ -193,12 +188,23 @@ doloop_condition_get (rtx doloop_pat)
   /* Extract loop termination condition.  */
   condition = XEXP (SET_SRC (cmp), 0);
 
-  /* We expect a GE or NE comparison with 0 or 1.  */
-  if ((GET_CODE (condition) != GE
-       && GET_CODE (condition) != NE)
-      || (XEXP (condition, 1) != const0_rtx
-          && XEXP (condition, 1) != const1_rtx))
-    return 0;
+  if (GET_CODE (pattern) == PARALLEL)
+    {
+      /* We expect a GE or NE comparison with 0 or 1.  */
+      if ((GET_CODE (condition) != GE
+	   && GET_CODE (condition) != NE)
+	   || (XEXP (condition, 1) != const0_rtx
+	   && XEXP (condition, 1) != const1_rtx))
+        return 0;
+    }
+  else
+    {
+      /* In the second and third cases, we expect the condition
+         to be of the form (reg != 0)  */
+      if (GET_CODE (condition) != NE
+	  || XEXP (condition, 1) != const0_rtx)
+        return 0;
+    }
 
   if ((XEXP (condition, 0) == reg)
       /* For the third case:  */  

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 5/9] [SMS] Support new loop pattern
  2011-07-21 16:31 ` [PATCH 5/9] [SMS] Support new loop pattern zhroma
  2011-07-24 11:06   ` Revital1 Eres
  2011-07-26  9:02   ` Richard Sandiford
@ 2011-09-30 15:54   ` Roman Zhuykov
  2011-10-12  0:48     ` Ayal Zaks
  2 siblings, 1 reply; 30+ messages in thread
From: Roman Zhuykov @ 2011-09-30 15:54 UTC (permalink / raw)
  To: gcc-patches; +Cc: dm, Ayal Zaks

[-- Attachment #1: Type: text/plain, Size: 8567 bytes --]

2011/7/21  <zhroma@ispras.ru>:
> This patch should be applied only after pending patches by Revital.


Ping. New version is attached, it suits current trunk without
additional patches.
Also this related patch needs approval:
http://gcc.gnu.org/ml/gcc-patches/2011-07/msg01804.html

> The loop should meet the following requirements.
> First three are the same as for loop with doloop pattern:
> ...
> The next three describe the control part of new supported loops.
> - the last jump instruction should look like:  pc=(regF!=0)?label:pc, regF is
>  flag register;
> - the last instruction which sets regF should be: regF=COMPARE(regC,X), where X
>  is a constant, or maybe a register, which is not changed inside a loop;
> - only one instruction modifies regC inside a loop (other can use regC, but not
>  write), and it should simply adjust it by a constant: regC=regC+step, where
>  step is a constant.

> When doloop is succesfully scheduled by SMS, its number of
> iterations of loop kernel should be decreased by the number of stages in a
> schedule minus one, while other iterations expand to prologue and epilogue.
> In new supported loops such approach can't be used, because some
> instructions can use count register (regC).  Instead of this,
> the final register value X in compare instruction regF=COMPARE(regC,X)
> is changed to another value Y respective to the stage this instruction
> is scheduled (Y = X - stage * step).

The main difference from doloop case is that regC can be used by some
instructions in loop body.
That's why we are unable to simply adjust regC initial value, but have
to keep it's value correct on each particular iteration.
So, we change comparison instruction accordingly.

An example:
int a[100];
int main()
{
  int i;
  for (i = 85; i > 12; i -= 5)
      a[i] = i * i;
  return a[15]-225;
}
ARM assembler with "-O2 -fno-auto-inc-dec":
        ldr     r0, .L5
        mov     r3, #85
        mov     r2, r0
.L2:
        mul     r1, r3, r3
        sub     r3, r3, #5
        cmp     r3, #10
        str     r1, [r2, #340]
        sub     r2, r2, #20
        bne     .L2
        ldr     r0, [r0, #60]
        sub     r0, r0, #225
        bx      lr
.L5:
        .word   a

Loop body is executed 15 times.
When compiling with SMS, it finds a schedule with ii=7, stage_count=3
and following times:
Stage  Time       Insn
0          5      mul     r1, r3, r3
1         10     sub     r3, r3, #5
1         11     cmp     r3, #10
1         11     str     r1, [r2, #340]
1         13     bne     .L2
2         16     sub     r2, r2, #20

To make new schedule correct the loop body
should be executed 14 times and we change compare instruction:
regF=COMPARE(regC,X) to regF=COMPARE(regC,Y) where Y = X - stage * step.
In our example regC is r3, X is 10, step = -5, compare instruction
is scheduled on stage 1, so it should be Y = 10 - 1 * (-5) = 15.

So, after SMS it looks like:
        ldr     r0, .L5
        mov     r3, #85
        mov     r2, r0
;;prologue
        mul     r1, r3, r3      ;;from stage 0 first iteration
        sub     r3, r3, #5      ;;3 insns from stage 1 first iteration
        cmp     r3, #10
        str     r1, [r2, #340]
        mul     r1, r3, r3      ;;from stage 0 second iteration
;;body
.L2:
        sub     r3, r3, #5
        sub     r2, r2, #20
        cmp     r3, #15         ;; new value to compare with is Y=15
        str     r1, [r2, #340]
        mul     r1, r3, r3
        bne     .L2
;;epilogue
        sub     r2, r2, #20     ;;from stage 2 pre-last iteration
        sub     r3, r3, #5      ;;3 insns from stage 1 last iteration
        cmp     r3, #10
        str     r1, [r2, #340]
        sub     r2, r2, #20     ;;from stage 2 last iteration

        ldr     r0, [r0, #60]
        sub     r0, r0, #225
        bx      lr
.L5:
        .word   a

Real ARM assembler with SMS (after some optimizations and without dead code):
        mov     r3, #85
        ldr     r0, .L8
        mul     r1, r3, r3
        sub     r3, r3, #5
        mov     r2, r0
        str     r1, [r0, #340]
        mul     r1, r3, r3
.L2:
        sub     r3, r3, #5
        sub     r2, r2, #20
        cmp     r3, #15
        str     r1, [r2, #340]
        mul     r1, r3, r3
        bne     .L2
        str     r1, [r2, #320]
        ldr     r0, [r0, #60]
        sub     r0, r0, #225
        bx      lr
.L8:
        .word   a

>
> Testing of this appoach reveals two bugs, which do not appear while SMS was
> used only for doloop loops.  Both these bugs happen due to the nature of the
> flag register.  On x86_64 it is clobbered by most of arithmetic instructions.
> The following situation happens when SMS is enabled without register renaming
> (-fno-modulo-sched-allow-regmoves).  When data dependency graph is built, there
> is a step when we generate anti-dependencies from last register use to first
> write of this register at the next iteration.  At this moment we should also
> create such dependencies to all instructions which clobber the register to
> prevent this clobbers being before last use is new schedule.
>
> Here is an model of example:
>
> loop {
> set1 regR
> use1 regR
> clobber regR
> set2 regR
> use2 regR
> }
>
> If we create only use2->set1 anti-dependency (and no use2->cloober) the
> following wrong schedule is possible:
>
> prologue {
> set1 regR
> use1 regR
> clobber regR
> }
> kernel {
> set2 regR
> clobber regR (instruction from next iteration in terms of old loop kernel)
> use2 regR
> set1 regR (also from next iteration)
> use1 regR (also from next iteration)
> }
> epilogue {
> set2 regR
> use2 regR
> }
>
> This problem was succesfully fixed by creating a vector of all clobbering
> instructions together with first write and adding all needed dependencies.
>
> The other bug happens only with -fmodulo-sched-allow-regmoves.  Here we
> eliminate some anti-dependence edges in data dependency graph in order to
> resolve them later by adding some register moves (renaming instructions).  But
> in situation as in example above compiler gives an ICE because it can't create
> a register move, when regR is hardware flag register.  So we have to know which
> register(s) cause anti-dependency in order to understand whether we can ignore
> it.  I can't find any easy way to gather this information, so I create my own
> structures to store this info and had implemented my own hooks for
> sched_analyze function.  This leads to more complex interconnection between
> ddg.c and modulo-sched.c.
>
> One more thing to point out is number of loop iterations. When number of
> iterations of a loop is not known at compile time, SMS has to create two loop
> versions (original and scheduled), and execute scheduled one only when real
> number of iterations is bigger than number of stages.  In doloop case the
> number of iterations simply equals to the count register value before the loop.
> So SMS finds its constant initialization or makes two loop versions.  In new
> supported loops number of iterations value is more complex.  It even can't be
> calculated as (final_reg_value-start_reg_value)/step because of examples like
> this:
>
> for (unsigned int x = 0x0; x != 0x6F80919A; x += 0xEDCBA987) ...;
>
> This loop has 22 iterations.  So, i decided to use get_simple_loop_desc
> function which gives a structure with loop characteristics, some of them helps
> to find iteration number:
>
> rtx niter_expr - The number of iterations of the loop;
> bool const_iter - True if the loop iterates the constant number of times;
> unsigned HOST_WIDEST_INT niter - Number of iterations if constant;
>
> But we can use these expressions only after looking through some other fields
> of returned structure:
>
> bool simple_p - True if we are able to say anything about number of iterations
> of the loop;
> rtx assumptions - Assumptions under that the rest of the information is valid;
> rtx noloop_assumptions - Assumptions under which the loop ends before reaching
> the latch;
> rtx infinite - Condition under which the loop is infinite.
>
> I decide to allow SMS scheduling only when simple_p is true and other three
> fields are NULL_RTX, or when simple_p is true and
> flag_unsafe_loop_optimizations is set.  One more exception is infinite
> condition, and the next separate patch is an attempt to process it.
>
--
Roman Zhuykov
zhroma@ispras.ru

[-- Attachment #2: sms.patch --]
[-- Type: text/x-patch, Size: 41269 bytes --]

2011-09-30  Roman Zhuykov  <zhroma@ispras.ru>
	* ddg.c: New VEC.
	(create_ddg_dep_from_intra_loop_link): Use information about register
	uses and sets to determine correctly whether anti-dependency
	can be ignored.
	(add_cross_iteration_register_deps): Store information about
	all clobbers and first write to a register.  Use collected
	information to create anti-dependencies from last use.
	(build_intra_loop_deps): Add call to sms_sched_analyze_init.
	* modulo-sched.c: Include pointer-set.h.
	(old_sched_deps_info): New structure.
	(regset_pair): New type.
	(insn_map): Declare.
	(curr_insn): Ditto.
	(regset_pair_init, destroy_regset_pair, sms_start_insn,
	sms_finish_insn, sms_note_reg_set, sms_note_reg_clobber
	sms_note_reg_use, extract_from_insn_map,
	sms_sched_analyze_init, sms_create_ddg_finish): New functions.
	(sms_sched_deps_info): Add new callbacks.
	(nondoloop_register_get): New function.
	(const_iteration_count): Rename to ...
	(search_const_init): ...this.  Add new parameter (is_const).  Always
	return register initialization rtx and set is_const to true
	only when it is constant.
	(duplicate_insns_of_cycles): Add new parameter (doloop_p).  Do not
	duplicate instructions with count_reg only when doloop_p is set.
	Update all callers.
	(generate_prolog_epilog): Add new parameters.  Correctly generate loop
	prologue for new loop pattern.
	(sms_schedule): Support new loop pattern.
	* sched-int.h (sms_sched_analyze_init, extract_from_insn_map): Export.
---
 gcc/ddg.c          |  131 +++++++---
 gcc/modulo-sched.c |  763 ++++++++++++++++++++++++++++++++++++++++++++++------
 gcc/sched-int.h    |    2 +
 3 files changed, 780 insertions(+), 116 deletions(-)

diff --git a/gcc/ddg.c b/gcc/ddg.c
index 856fa4e..b80bd2d 100644
--- a/gcc/ddg.c
+++ b/gcc/ddg.c
@@ -49,6 +49,10 @@ along with GCC; see the file COPYING3.  If not see
 /* A flag indicating that a ddg edge belongs to an SCC or not.  */
 enum edge_flag {NOT_IN_SCC = 0, IN_SCC};
 
+/* A vector of dependencies needed while processing clobbers.  */
+DEF_VEC_P(df_ref);
+DEF_VEC_ALLOC_P(df_ref,heap);
+
 /* Forward declarations.  */
 static void add_backarc_to_ddg (ddg_ptr, ddg_edge_ptr);
 static void add_backarc_to_scc (ddg_scc_ptr, ddg_edge_ptr);
@@ -178,23 +182,45 @@ create_ddg_dep_from_intra_loop_link (ddg_ptr g, ddg_node_ptr src_node,
      whose register has multiple defs in the loop.  */
   if (flag_modulo_sched_allow_regmoves && (t == ANTI_DEP && dt == REG_DEP))
     {
-      rtx set;
+      bool can_delete_dep = true;
+      unsigned regno;
+      reg_set_iterator rsi;
+      regset src_uses, dest_sets, regs;
+
+      /* Register sets from modulo scheduler structures.  */
+      src_uses = extract_from_insn_map (src_node->insn, 0);
+      dest_sets = extract_from_insn_map (dest_node->insn, 1);
+
+      if (!src_uses || !dest_sets)
+	return;
 
-      set = single_set (dest_node->insn);
-      /* TODO: Handle registers that REG_P is not true for them, i.e.
-         subregs and special registers.  */
-      if (set && REG_P (SET_DEST (set)))
+      /* Build regset intersection.  */
+      regs = ALLOC_REG_SET (&reg_obstack);
+      COPY_REG_SET (regs, src_uses);
+      AND_REG_SET (regs, dest_sets);
+
+      EXECUTE_IF_SET_IN_REG_SET (regs, 0, regno, rsi)
         {
-          int regno = REGNO (SET_DEST (set));
           df_ref first_def;
           struct df_rd_bb_info *bb_info = DF_RD_BB_INFO (g->bb);
 
           first_def = df_bb_regno_first_def_find (g->bb, regno);
           gcc_assert (first_def);
 
-          if (bitmap_bit_p (&bb_info->gen, DF_REF_ID (first_def)))
-            return;
+	  /* CC-flags and other hard registers can't be renamed.
+	     Check whether loop kernel has only one def.  */
+          if (HARD_REGISTER_NUM_P (regno)
+	      || !bitmap_bit_p (&bb_info->gen, DF_REF_ID (first_def)))
+	    {
+	      can_delete_dep = false;
+	      break;
+	    }
         }
+
+      FREE_REG_SET (regs);
+
+      if (can_delete_dep)
+	return;
     }
 
    latency = dep_cost (link);
@@ -247,16 +273,20 @@ create_ddg_dep_no_link (ddg_ptr g, ddg_node_ptr from, ddg_node_ptr to,
 static void
 add_cross_iteration_register_deps (ddg_ptr g, df_ref last_def)
 {
-  int regno = DF_REF_REGNO (last_def);
+  unsigned int regno = DF_REF_REGNO (last_def);
   struct df_link *r_use;
   int has_use_in_bb_p = false;
-  rtx def_insn = DF_REF_INSN (last_def);
+  rtx insn, def_insn = DF_REF_INSN (last_def);
   ddg_node_ptr last_def_node = get_node_of_insn (g, def_insn);
   ddg_node_ptr use_node;
+  df_ref *def_rec;
+  unsigned int uid;
+  static VEC(df_ref,heap) *all_defs;
 #ifdef ENABLE_CHECKING
   struct df_rd_bb_info *bb_info = DF_RD_BB_INFO (g->bb);
 #endif
   df_ref first_def = df_bb_regno_first_def_find (g->bb, regno);
+  bool first_write = true;
 
   gcc_assert (last_def_node);
   gcc_assert (first_def);
@@ -267,6 +297,31 @@ add_cross_iteration_register_deps (ddg_ptr g, df_ref last_def)
 			       DF_REF_ID (first_def)));
 #endif
 
+  all_defs = VEC_alloc (df_ref, heap, 0);
+
+  /* Find all defs which are clobbers and the first normal write def.  */
+  FOR_BB_INSNS (g->bb, insn)
+    {
+      if (!INSN_P (insn))
+        continue;
+      uid = INSN_UID (insn);
+      for (def_rec = DF_INSN_UID_DEFS (uid); *def_rec; def_rec++)
+        {
+          df_ref def = *def_rec;
+          if (DF_REF_REGNO (def) == regno)
+            {
+	      bool is_clobber = DF_REF_FLAGS (def) & (DF_REF_MUST_CLOBBER
+						      | DF_REF_MAY_CLOBBER);
+	      if (is_clobber || first_write)
+	        {
+		  VEC_safe_push (df_ref, heap, all_defs, def);
+		  if (!is_clobber)
+		    first_write = false;
+		}
+	    }
+        }
+    }
+
   /* Create inter-loop true dependences and anti dependences.  */
   for (r_use = DF_REF_CHAIN (last_def); r_use != NULL; r_use = r_use->next)
     {
@@ -290,25 +345,32 @@ add_cross_iteration_register_deps (ddg_ptr g, df_ref last_def)
 	}
       else if (!DEBUG_INSN_P (use_insn))
 	{
+	  unsigned int i;
+	  df_ref curr_def;
 	  /* Add anti deps from last_def's uses in the current iteration
-	     to the first def in the next iteration.  We do not add ANTI
-	     dep when there is an intra-loop TRUE dep in the opposite
-	     direction, but use regmoves to fix such disregarded ANTI
-	     deps when broken.	If the first_def reaches the USE then
-	     there is such a dep.  */
-	  ddg_node_ptr first_def_node = get_node_of_insn (g,
-							  DF_REF_INSN (first_def));
-
-	  gcc_assert (first_def_node);
-
-         /* Always create the edge if the use node is a branch in
-            order to prevent the creation of reg-moves.  */
-          if (DF_REF_ID (last_def) != DF_REF_ID (first_def)
-              || !flag_modulo_sched_allow_regmoves
-	      || JUMP_P (use_node->insn))
-            create_ddg_dep_no_link (g, use_node, first_def_node, ANTI_DEP,
-                                    REG_DEP, 1);
-
+	     to the first def and all clobbers in the next iteration.
+	     We do not add ANTI dep when there is an intra-loop TRUE dep
+	     in the opposite direction, but use regmoves to fix such
+	     disregarded ANTI deps when broken.	If the curr_def reaches
+	     the USE then there is such a dep.  */
+	  FOR_EACH_VEC_ELT (df_ref, all_defs, i, curr_def)
+	    {
+	      if (DF_REF_ID (last_def) != DF_REF_ID (curr_def)
+		  /* Some hard regs (for ex. CC-flags) can't be renamed.  */
+                  || HARD_REGISTER_P (DF_REF_REG (last_def))
+		  || !flag_modulo_sched_allow_regmoves
+		  /* Always create the edge if the use node is a branch in
+		     order to prevent the creation of reg-moves.  */
+		  || (flag_modulo_sched_allow_regmoves
+		      && JUMP_P (use_node->insn)))
+		{
+	          ddg_node_ptr curr_def_node = get_node_of_insn (g,
+						DF_REF_INSN (curr_def));
+		  gcc_assert (curr_def_node);
+		  create_ddg_dep_no_link (g, use_node, curr_def_node,
+					  ANTI_DEP, REG_DEP, 1);
+	        }
+	    }
 	}
     }
   /* Create an inter-loop output dependence between LAST_DEF (which is the
@@ -318,18 +380,15 @@ add_cross_iteration_register_deps (ddg_ptr g, df_ref last_def)
      defs starting with a true dependence to a use which can be in the
      next iteration; followed by an anti dependence of that use to the
      first def (i.e. if there is a use between the two defs.)  */
-  if (!has_use_in_bb_p)
+  if (!has_use_in_bb_p && DF_REF_ID (last_def) != DF_REF_ID (first_def))
     {
-      ddg_node_ptr dest_node;
-
-      if (DF_REF_ID (last_def) == DF_REF_ID (first_def))
-	return;
-
-      dest_node = get_node_of_insn (g, DF_REF_INSN (first_def));
+      ddg_node_ptr dest_node = get_node_of_insn (g, DF_REF_INSN (first_def));
       gcc_assert (dest_node);
       create_ddg_dep_no_link (g, last_def_node, dest_node,
 			      OUTPUT_DEP, REG_DEP, 1);
     }
+
+  VEC_free (df_ref, heap, all_defs);
 }
 /* Build inter-loop dependencies, by looking at DF analysis backwards.  */
 static void
@@ -463,6 +522,8 @@ build_intra_loop_deps (ddg_ptr g)
 
   /* Build the dependence information, using the sched_analyze function.  */
   init_deps_global ();
+  /* Set sms hooks and initialize additional structures.  */
+  sms_sched_analyze_init ();
   init_deps (&tmp_deps, false);
 
   /* Do the intra-block data dependence analysis for the given block.  */
diff --git a/gcc/modulo-sched.c b/gcc/modulo-sched.c
index 28be942..3fed5c2 100644
--- a/gcc/modulo-sched.c
+++ b/gcc/modulo-sched.c
@@ -48,6 +48,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "tree-pass.h"
 #include "dbgcnt.h"
 #include "df.h"
+#include "pointer-set.h"
 
 #ifdef INSN_SCHEDULING
 
@@ -200,9 +201,10 @@ static void set_node_sched_params (ddg_ptr);
 static partial_schedule_ptr sms_schedule_by_order (ddg_ptr, int, int, int *);
 static void permute_partial_schedule (partial_schedule_ptr, rtx);
 static void generate_prolog_epilog (partial_schedule_ptr, struct loop *,
-                                    rtx, rtx);
+                                    rtx, bool, bool, rtx, HOST_WIDEST_INT,
+                                    bool, HOST_WIDEST_INT, rtx *);
 static void duplicate_insns_of_cycles (partial_schedule_ptr,
-				       int, int, int, rtx);
+				       int, int, int, rtx, bool);
 static int calculate_stage_count (partial_schedule_ptr, int);
 static void calculate_must_precede_follow (ddg_node_ptr, int, int,
 					   int, int, sbitmap, sbitmap, sbitmap);
@@ -248,7 +250,162 @@ typedef struct node_sched_params
 } *node_sched_params_ptr;
 
 \f
-/* The following three functions are copied from the current scheduler
+/* Next data structures and callbacks help to store information about
+   using registers in insn.  */
+static struct sched_deps_info_def old_sched_deps_info;
+
+/* Two regsets to store which registers the insn reads and writes.  */
+typedef struct regset_pair_def
+{
+  regset uses;
+  regset sets;
+} regset_pair;
+
+/* Allocate memory for regset_pair structure.  */
+static regset_pair*
+regset_pair_init (void)
+{
+  regset_pair *trs = (regset_pair *)xcalloc (1, sizeof (regset_pair));
+  trs->uses = ALLOC_REG_SET (&reg_obstack);
+  trs->sets = ALLOC_REG_SET (&reg_obstack);
+  return trs;
+}
+
+/* Pointer map is used to find a reg sets for insn.  */
+static struct pointer_map_t *insn_map;
+
+/* Find (or create) regset_pair for INSN in pointer_map.  */
+static regset_pair*
+regset_pair_get (rtx insn)
+{
+  void **slot = pointer_map_contains (insn_map, insn);
+  if (!slot)
+    {
+      slot = pointer_map_insert (insn_map, insn);
+      *slot = regset_pair_init ();
+    }
+  return (regset_pair*)*slot;
+}
+
+/* Callback for pointer_map_traverse to free memory used by regset_pair.  */
+static bool
+destroy_regset_pair (const void *key ATTRIBUTE_UNUSED, void **slot,
+		   void *data ATTRIBUTE_UNUSED)
+{
+  regset_pair *trs = (regset_pair*)*slot;
+  FREE_REG_SET (trs->uses);
+  FREE_REG_SET (trs->sets);
+  free (trs);
+  return true;
+}
+
+/* SMS sched_analyze hooks.  Every of them calls original hook first.  */
+static rtx curr_insn = NULL_RTX;
+
+static void
+sms_start_insn (rtx insn)
+{
+  old_sched_deps_info.start_insn (insn);
+
+  gcc_assert (insn && !curr_insn);
+  curr_insn = insn;
+  if (dump_file)
+    {
+      fprintf (dump_file, "sms analyze: start insn:\n");
+      print_rtl_single (dump_file, curr_insn);
+    }
+}
+
+static void
+sms_finish_insn (void)
+{
+  old_sched_deps_info.finish_insn ();
+
+  gcc_assert (curr_insn);
+  curr_insn = NULL_RTX;
+  if (dump_file)
+    fprintf (dump_file, "sms analyze: finished insn\n");
+}
+
+static void
+sms_note_reg_set (int regno)
+{
+  regset_pair *trs;
+  old_sched_deps_info.note_reg_set (regno);
+
+  gcc_assert (curr_insn);
+  trs = regset_pair_get (curr_insn);
+  SET_REGNO_REG_SET (trs->sets, regno);
+
+  if (dump_file)
+    fprintf (dump_file, "sms analyze: reg set %d\n", regno);
+}
+
+static void
+sms_note_reg_clobber (int regno)
+{
+  regset_pair *trs;
+  old_sched_deps_info.note_reg_clobber (regno);
+
+  gcc_assert (curr_insn);
+  trs = regset_pair_get (curr_insn);
+  SET_REGNO_REG_SET (trs->sets, regno);
+
+  if (dump_file)
+    fprintf (dump_file, "sms analyze: reg clobber %d\n", regno);
+}
+
+static void
+sms_note_reg_use (int regno)
+{
+  regset_pair *trs;
+  old_sched_deps_info.note_reg_use (regno);
+
+  gcc_assert (curr_insn);
+  trs = regset_pair_get (curr_insn);
+  SET_REGNO_REG_SET (trs->uses, regno);
+
+  if (dump_file)
+    fprintf (dump_file, "sms analyze: reg use %d\n", regno);
+}
+
+/* Extract the saved data about register usage.  Bool SETS is true,
+   when we need the set of written regs.  */
+regset
+extract_from_insn_map (rtx insn, bool sets)
+{
+  void **slot = pointer_map_contains (insn_map, insn);
+  regset_pair *trs;
+  if (!slot)
+    return NULL;
+  trs = (regset_pair*)*slot;
+  return sets ? trs->sets : trs->uses;
+}
+
+/* Setup SMS hooks.  Initialize pointer_map.  */
+void
+sms_sched_analyze_init (void)
+{
+  memcpy (&old_sched_deps_info, sched_deps_info,
+	  sizeof (struct sched_deps_info_def));
+  sched_deps_info->start_insn = sms_start_insn;
+  sched_deps_info->finish_insn = sms_finish_insn;
+  sched_deps_info->note_reg_set = sms_note_reg_set;
+  sched_deps_info->note_reg_clobber = sms_note_reg_clobber;
+  sched_deps_info->note_reg_use = sms_note_reg_use;
+  insn_map = pointer_map_create ();
+}
+
+/* Free pointer_map with freeing internal structures first.  */
+static void
+sms_create_ddg_finish (void)
+{
+  pointer_map_traverse (insn_map, destroy_regset_pair, NULL);
+  pointer_map_destroy (insn_map);
+}
+
+\f
+/* The following two functions are copied from the current scheduler
    code in order to use sched_analyze() for computing the dependencies.
    They are used when initializing the sched_info structure.  */
 static const char *
@@ -271,9 +428,20 @@ static struct common_sched_info_def sms_common_sched_info;
 static struct sched_deps_info_def sms_sched_deps_info =
   {
     compute_jump_reg_dependencies,
-    NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL,
-    NULL,
-    0, 0, 0
+    sms_start_insn,
+    sms_finish_insn,
+    NULL, /* start_lhs */
+    NULL, /* finish_lhs */
+    NULL, /* start_rhs */
+    NULL, /* finish_rhs */
+    sms_note_reg_set,
+    sms_note_reg_clobber,
+    sms_note_reg_use,
+    NULL, /* note_mem_dep */
+    NULL, /* note_dep */
+    0, /* use_cselib */
+    0, /* use_deps_list */
+    0 /* generate_spec_deps */
   };
 
 static struct haifa_sched_info sms_sched_info =
@@ -295,6 +463,7 @@ static struct haifa_sched_info sms_sched_info =
   0
 };
 
+\f
 /* Given HEAD and TAIL which are the first and last insns in a loop;
    return the register which controls the loop.  Return zero if it has
    more than one occurrence in the loop besides the control part or the
@@ -348,37 +517,164 @@ doloop_register_get (rtx head ATTRIBUTE_UNUSED, rtx tail ATTRIBUTE_UNUSED)
 #endif
 }
 
-/* Check if COUNT_REG is set to a constant in the PRE_HEADER block, so
-   that the number of iterations is a compile-time constant.  If so,
-   return the rtx that sets COUNT_REG to a constant, and set COUNT to
-   this constant.  Otherwise return 0.  */
+/* Same as previous for loop with always-the-same-step counter.  */
+static rtx
+nondoloop_register_get (rtx head, rtx tail, int cmp_side,
+			rtx *addsub_output, rtx *cmp_output)
+{
+  rtx insn, reg, flagreg, addsub, cmp, end;
+
+  /* Check jump instruction form */
+  insn = single_set (tail);
+  if (insn == NULL_RTX
+      || SET_DEST (insn) != pc_rtx
+      || GET_CODE (SET_SRC (insn)) != IF_THEN_ELSE
+      || GET_CODE (XEXP (SET_SRC (insn), 1)) != LABEL_REF
+      || XEXP (SET_SRC (insn), 2) != pc_rtx)
+    return NULL_RTX;
+
+  /* Check loop exit condition */
+  insn = XEXP (SET_SRC (insn), 0);
+  if (GET_CODE (insn) != NE || XEXP (insn, 1) != const0_rtx)
+    return NULL_RTX;
+
+  /* Flags register */
+  flagreg = XEXP (insn, 0);
+
+  /* Searching comparison instruction */
+  cmp = PREV_INSN (tail);
+  while (cmp != PREV_INSN (head))
+    {
+      if (INSN_P (cmp) && reg_set_p (flagreg, cmp))
+        break;
+      cmp = PREV_INSN (cmp);
+    }
+  if (cmp == PREV_INSN (head))
+    return NULL_RTX;
+
+  /* Check comparison */
+  insn = single_set (cmp);
+  if (insn == NULL_RTX
+      || ! rtx_equal_p (flagreg, SET_DEST (insn))
+      || GET_CODE (SET_SRC (insn)) != COMPARE)
+    return NULL_RTX;
+
+  /* Loop register */
+  gcc_assert (0 <= cmp_side && cmp_side <= 1);
+  reg = XEXP (SET_SRC (insn), cmp_side);
+  if (! REG_P (reg))
+    return NULL_RTX;
+
+  /* End value */
+  end = XEXP (SET_SRC (insn), 1 - cmp_side);
+  if (! REG_P (end) && ! CONST_INT_P (end))
+    return NULL_RTX;
+
+  /* Searching register add\sub instruction */
+  addsub = PREV_INSN (cmp);
+  while (addsub != PREV_INSN (head))
+    {
+      if (INSN_P (addsub) && reg_set_p (reg, addsub))
+        break;
+      addsub = PREV_INSN (addsub);
+    }
+  if (addsub == PREV_INSN (head))
+    return NULL_RTX;
+
+  /* Checking register change instruction */
+  insn = single_set (addsub);
+  if (insn == NULL_RTX || ! rtx_equal_p (reg, SET_DEST (insn)))
+    return NULL_RTX;
+  insn = SET_SRC (insn);
+  if ((GET_CODE (insn) != PLUS && GET_CODE (insn) != MINUS)
+      || ! rtx_equal_p (reg, XEXP (insn, 0))
+      || ! (CONST_INT_P (XEXP (insn, 1))))
+    return NULL_RTX;
+
+  /* No other REG and END (if reg) modifications allowed */
+  for (insn = head; insn != tail; insn = NEXT_INSN (insn))
+    {
+      if (REG_P(end) && reg_set_p (end, insn))
+        {
+          if (dump_file)
+          {
+            fprintf (dump_file, "SMS end register found ");
+            print_rtl_single (dump_file, reg);
+            fprintf (dump_file, " outside write in insn:\n");
+            print_rtl_single (dump_file, insn);
+          }
+	  return NULL_RTX;
+	}
+      if (insn != addsub && reg_set_p (reg, insn))
+        {
+          if (dump_file)
+          {
+            fprintf (dump_file, "SMS count_reg found ");
+            print_rtl_single (dump_file, reg);
+            fprintf (dump_file, " outside write in insn:\n");
+            print_rtl_single (dump_file, insn);
+          }
+          return NULL_RTX;
+        }
+    }
+
+  *addsub_output = addsub;
+  *cmp_output = cmp;
+  return reg;
+}
+
+/* Check if REG is set to a constant in the PRE_HEADER block.
+   If possible to find, return the rtx that sets REG.
+   If REG is set to a constant (probably not directly),
+   set IS_CONST to true and VALUE to that constant value.  */
 static rtx
-const_iteration_count (rtx count_reg, basic_block pre_header,
-		       HOST_WIDEST_INT * count)
+search_const_init (basic_block pre_header, rtx reg, bool *is_const,
+		   HOST_WIDEST_INT *value)
 {
   rtx insn;
   rtx head, tail;
 
-  if (! pre_header)
-    return NULL_RTX;
+  if (!pre_header)
+    {
+      *is_const = false;
+      return NULL_RTX;
+    }
 
   get_ebb_head_tail (pre_header, pre_header, &head, &tail);
 
   for (insn = tail; insn != PREV_INSN (head); insn = PREV_INSN (insn))
     if (NONDEBUG_INSN_P (insn) && single_set (insn) &&
-	rtx_equal_p (count_reg, SET_DEST (single_set (insn))))
+	rtx_equal_p (reg, SET_DEST (single_set (insn))))
       {
-	rtx pat = single_set (insn);
+	rtx src, pat = single_set (insn);
+	src = SET_SRC (pat);
 
-	if (CONST_INT_P (SET_SRC (pat)))
+	if (CONST_INT_P (src))
 	  {
-	    *count = INTVAL (SET_SRC (pat));
-	    return insn;
+	    *is_const = true;
+	    *value = INTVAL (src);
+	  }
+	else if (REG_P (src))
+	  { /* Check if previous insn sets SRC = constant.  */
+	    pat = single_set (PREV_INSN (insn));
+	    if (pat != NULL_RTX && rtx_equal_p (src, SET_DEST (pat))
+		&& CONST_INT_P (SET_SRC (pat)))
+	      {
+		*is_const = true;
+		*value = INTVAL (SET_SRC (pat));
+	      }
+	    else
+		*is_const = false;
 	  }
+	else
+	  *is_const = false;
 
-	return NULL_RTX;
+	return insn;
       }
+    else if (reg_set_p (reg, insn))
+      break;
 
+  *is_const = false;
   return NULL_RTX;
 }
 
@@ -886,7 +1182,8 @@ clear:
 
 static void
 duplicate_insns_of_cycles (partial_schedule_ptr ps, int from_stage,
-			   int to_stage, int for_prolog, rtx count_reg)
+			   int to_stage, int for_prolog, rtx count_reg,
+			   bool doloop_p)
 {
   int row;
   ps_insn_ptr ps_ij;
@@ -898,14 +1195,14 @@ duplicate_insns_of_cycles (partial_schedule_ptr ps, int from_stage,
 	int j, i_reg_moves;
 	rtx reg_move = NULL_RTX;
 
-        /* Do not duplicate any insn which refers to count_reg as it
-           belongs to the control part.
+        /* In doloop case do not duplicate any insn which refers
+	   to count_reg as it belongs to the control part.
            The closing branch is scheduled as well and thus should
            be ignored.
            TODO: This should be done by analyzing the control part of
            the loop.  */
-        if (reg_mentioned_p (count_reg, u_node->insn)
-            || JUMP_P (ps_ij->node->insn))
+        if ((doloop_p && reg_mentioned_p (count_reg, u_node->insn))
+            || JUMP_P (u_node->insn))
           continue;
 
 	if (for_prolog)
@@ -957,7 +1254,10 @@ duplicate_insns_of_cycles (partial_schedule_ptr ps, int from_stage,
 /* Generate the instructions (including reg_moves) for prolog & epilog.  */
 static void
 generate_prolog_epilog (partial_schedule_ptr ps, struct loop *loop,
-                        rtx count_reg, rtx count_init)
+                        rtx count_reg, bool doloop_p, bool count_init_isconst,
+			rtx fin_reg, HOST_WIDEST_INT fin_nonconst_adjust,
+			bool create_reg, HOST_WIDEST_INT reg_val,
+			rtx *created_reg)
 {
   int i;
   int last_stage = PS_STAGE_COUNT (ps) - 1;
@@ -966,12 +1266,12 @@ generate_prolog_epilog (partial_schedule_ptr ps, struct loop *loop,
   /* Generate the prolog, inserting its insns on the loop-entry edge.  */
   start_sequence ();
 
-  if (!count_init)
+  if (doloop_p && !count_init_isconst)
     {
-      /* Generate instructions at the beginning of the prolog to
-         adjust the loop count by STAGE_COUNT.  If loop count is constant
-         (count_init), this constant is adjusted by STAGE_COUNT in
-         generate_prolog_epilog function.  */
+      /* In doloop we generate instructions at the beginning of the prolog to
+         adjust the initial value of doloop counter by STAGE_COUNT.
+	 If loop count is constant, this adjustment is done outside this
+         function, simply correcting the source of initialization insn.  */
       rtx sub_reg = NULL_RTX;
 
       sub_reg = expand_simple_binop (GET_MODE (count_reg), MINUS,
@@ -982,8 +1282,40 @@ generate_prolog_epilog (partial_schedule_ptr ps, struct loop *loop,
         emit_move_insn (count_reg, sub_reg);
     }
 
+  if (!doloop_p)
+    {
+      /* In non-doloop we generate instructions at the beginning of
+         the prolog to adjust the final value (with this value loop count
+	 register is compared to check whether the loop should stop).  */
+      if (fin_nonconst_adjust != 0)
+	{
+	  /* If the final value is in a register - create another register
+	     to store a shifted value.  */
+	  rtx new_reg, reg = NULL_RTX;
+	  reg = gen_reg_rtx (GET_MODE (fin_reg));
+	  new_reg = expand_simple_binop (GET_MODE (fin_reg), MINUS, fin_reg,
+					 GEN_INT (fin_nonconst_adjust),
+					 reg, 0, OPTAB_DIRECT);
+	  gcc_assert (REG_P (new_reg));
+	  if (REGNO (new_reg) != REGNO (reg))
+	    emit_move_insn (reg, new_reg);
+	  *created_reg = new_reg;
+	}
+      else if (create_reg)
+	{
+	  /* If old final value is an immediate, and the new one can't be
+	     an immediate, we create a register to store it.  If both values
+	     are immediate the adjustment is done outside this fuction,
+	     just correcting the constant value in compare intruction.  */
+	  rtx reg = NULL_RTX;
+	  reg = gen_reg_rtx (GET_MODE (count_reg));
+	  emit_move_insn (reg, GEN_INT (reg_val));
+	  *created_reg = reg;
+	}
+    }
+
   for (i = 0; i < last_stage; i++)
-    duplicate_insns_of_cycles (ps, 0, i, 1, count_reg);
+    duplicate_insns_of_cycles (ps, 0, i, 1, count_reg, doloop_p);
 
   /* Put the prolog on the entry edge.  */
   e = loop_preheader_edge (loop);
@@ -995,7 +1327,7 @@ generate_prolog_epilog (partial_schedule_ptr ps, struct loop *loop,
   start_sequence ();
 
   for (i = 0; i < last_stage; i++)
-    duplicate_insns_of_cycles (ps, i + 1, last_stage, 0, count_reg);
+    duplicate_insns_of_cycles (ps, i + 1, last_stage, 0, count_reg, doloop_p);
 
   /* Put the epilogue on the exit edge.  */
   gcc_assert (single_exit (loop));
@@ -1258,13 +1590,30 @@ sms_schedule (void)
           continue;
         }
 
-      /* Make sure this is a doloop.  */
-      if ( !(count_reg = doloop_register_get (head, tail)))
-      {
-        if (dump_file)
-          fprintf (dump_file, "SMS doloop_register_get failed\n");
-	continue;
-      }
+      /* Is this a doloop?  */
+      if ((count_reg = doloop_register_get (head, tail)))
+        {
+	  if (dump_file)
+	    fprintf (dump_file, "SMS doloop\n");
+        }
+      else if ((count_reg = nondoloop_register_get (head, tail, 0,
+						    &insn, &insn)))
+	{
+	  if (dump_file)
+	    fprintf (dump_file, "SMS non-doloop\n");
+	}
+      else if ((count_reg = nondoloop_register_get (head, tail, 1,
+						    &insn, &insn)))
+	{
+	  if (dump_file)
+	    fprintf (dump_file, "SMS non-doloop with transposed cmp\n");
+	}
+      else
+	{
+	  if (dump_file)
+	    fprintf (dump_file, "SMS imcompatible loop\n");
+	  continue;
+	}
 
       /* Don't handle BBs with calls or barriers or auto-increment insns 
 	 (to avoid creating invalid reg-moves for the auto-increment insns),
@@ -1319,7 +1668,7 @@ sms_schedule (void)
 	    fprintf (dump_file, "SMS create_ddg failed\n");
 	  continue;
         }
-
+      sms_create_ddg_finish ();
       g_arr[loop->num] = g;
       if (dump_file)
         fprintf (dump_file, "...OK\n");
@@ -1331,15 +1680,27 @@ sms_schedule (void)
     fprintf (dump_file, "=========================\n\n");
   }
 
+  df_clear_flags (DF_LR_RUN_DCE);
+
   /* We don't want to perform SMS on new loops - created by versioning.  */
   FOR_EACH_LOOP (li, loop, 0)
     {
+      bool doloop_p, count_fin_isconst, count_init_isconst;
+      bool was_immediate = false;
+      bool prolog_create_reg = false;
+      int prolog_fin_nonconst_adjust = 0;
+      bool nonsimple_loop = false;
       rtx head, tail;
-      rtx count_reg, count_init;
-      int mii, rec_mii;
-      unsigned stage_count = 0;
-      HOST_WIDEST_INT loop_count = 0;
       bool opt_sc_p = false;
+      rtx count_reg, count_fin_reg, new_comp_reg = NULL_RTX;
+      rtx count_init_insn, count_fin_init_insn;
+      rtx add, cmp;
+      int mii, rec_mii, cmp_side = -1;
+      int stage_count = 0;
+      HOST_WIDEST_INT count_init_val = 0, count_fin_val = 0;
+      HOST_WIDEST_INT count_step = 0, loop_count = -1;
+      HOST_WIDEST_INT count_fin_newval = 0;
+      struct niter_desc *desc = NULL;
 
       if (! (g = g_arr[loop->num]))
         continue;
@@ -1377,32 +1738,159 @@ sms_schedule (void)
 	               (HOST_WIDEST_INT) profile_info->sum_max);
 	      fprintf (dump_file, "\n");
 	    }
-	  fprintf (dump_file, "SMS doloop\n");
 	  fprintf (dump_file, "SMS built-ddg %d\n", g->num_nodes);
           fprintf (dump_file, "SMS num-loads %d\n", g->num_loads);
           fprintf (dump_file, "SMS num-stores %d\n", g->num_stores);
 	}
 
 
-      /* In case of th loop have doloop register it gets special
-	 handling.  */
-      count_init = NULL_RTX;
-      if ((count_reg = doloop_register_get (head, tail)))
+      /* Extract count register and determine loop type.  */
+      add = NULL_RTX;
+      cmp = NULL_RTX;
+      if ((count_reg = doloop_register_get (head, tail))
+	  || (count_reg = nondoloop_register_get (head, tail, 0, &add, &cmp))
+	  || (count_reg = nondoloop_register_get (head, tail, 1, &add, &cmp)))
 	{
-	  basic_block pre_header;
+	  basic_block pre_header = loop_preheader_edge (loop)->src;
+
+	  doloop_p = (cmp == NULL_RTX);
+	  if (doloop_p)
+	    {
+	      /* Doloop finish parameters are always the same.  */
+	      count_step = -1;
+	      count_fin_isconst = true;
+	      count_fin_val = 0;
+	      count_fin_reg = NULL_RTX;
+	      count_fin_init_insn = NULL_RTX;
+	    }
+	  else
+	    {
+	      /* In other loop we need to determine counter step
+	         and finish parameters.  */
+	      rtx step, end;
+
+	      gcc_assert (single_set (add) && single_set (cmp));
+
+	      /* Extract the step.  */
+	      step = XEXP (SET_SRC (single_set (add)), 1);
+	      gcc_assert (CONST_INT_P (step));
+
+	      if (GET_CODE (SET_SRC (single_set (add))) == MINUS)
+	        count_step = - INTVAL (step);
+	      else if (GET_CODE (SET_SRC (single_set (add))) == PLUS)
+	        count_step = INTVAL (step);
+	      else
+		gcc_unreachable ();
+
+	      gcc_assert(count_step != 0);
+
+	      /* Check what operand of compare insn is a counter register.  */
+	      if (count_reg == XEXP (SET_SRC (single_set (cmp)), 0))
+		cmp_side = 0;
+	      else if (count_reg == XEXP (SET_SRC (single_set (cmp)), 1))
+		cmp_side = 1;
+	      else
+		gcc_unreachable ();
 
-	  pre_header = loop_preheader_edge (loop)->src;
-	  count_init = const_iteration_count (count_reg, pre_header,
-					      &loop_count);
+	      /* Extract finish border for counter reg.  */
+	      end = XEXP (SET_SRC (single_set (cmp)), 1 - cmp_side);
+
+	      if (CONST_INT_P (end))
+		{
+		  /* Constant finish border.  loop until (reg != const).  */
+		  count_fin_isconst = true;
+		  count_fin_val = INTVAL (end);
+		  count_fin_reg = NULL_RTX;
+		  count_fin_init_insn = NULL_RTX;
+		}
+	      else if (REG_P (end))
+		{
+		  /* Register is a border.  Loop until (reg != fin_reg).  */
+		  count_fin_reg = end;
+		  count_fin_isconst = false;
+		  /* Try to find constant initinalization of fin_reg
+		   * in preheader.  */
+		  count_fin_init_insn = search_const_init (pre_header,
+							   count_fin_reg,
+							   &count_fin_isconst,
+							   &count_fin_val);
+		}
+	      else
+		gcc_unreachable ();
+	    }
+	  /* Try to find a constant initalization of count_reg in preheader.  */
+	  count_init_insn = search_const_init (pre_header,
+					       count_reg,
+					       &count_init_isconst,
+					       &count_init_val);
+	}
+      else /* Loop is incompatible now, but it was OK on while analyzing!  */
+	gcc_assert (count_reg);
+
+
+      desc = get_simple_loop_desc (loop);
+      gcc_assert (desc);
+      /* nonsimple_loop means it's impossible to analyze the loop
+         or there are some assumptions to make the analyzis results right
+         or there is a condition of non-infinite number of iterations.
+        We want doloops to be scheduled even if analyzis shows they are
+	 nonsimple (backward compatibility).  */
+      nonsimple_loop = !desc->simple_p;
+      /* We allow scheduling loop with some assumptions or infinite condition
+	 only when unsafe_loop_optimizations flag is enabled.  */
+      if (flag_unsafe_loop_optimizations)
+	 {
+	   desc->infinite = NULL_RTX;
+	   desc->assumptions = NULL_RTX;
+	   desc->noloop_assumptions = NULL_RTX;
+	 }
+      nonsimple_loop = nonsimple_loop || (desc->assumptions != NULL_RTX)
+			|| (desc->noloop_assumptions != NULL_RTX)
+			|| (desc->infinite != NULL_RTX);
+      /* Only doloops can be nonsimple_loops for SMS.  */
+      if (nonsimple_loop && !doloop_p)
+	{
+	  free_ddg (g);
+	  continue;
+	}
+      /* Manually set some description fields in non-simple doloop.  */
+      if (nonsimple_loop)
+	{
+	  gcc_assert(doloop_p);
+	  desc->const_iter = false;
+	  desc->infinite = NULL_RTX;
 	}
-      gcc_assert (count_reg);
 
-      if (dump_file && count_init)
+      if (desc->const_iter)
+	{
+	  gcc_assert (!desc->infinite);
+	  loop_count = desc->niter;
+	  if (dump_file)
+	    fprintf (dump_file, "SMS const loop iterations = "
+		     HOST_WIDEST_INT_PRINT_DEC "\n", loop_count);
+	}
+      if (count_init_isconst && count_fin_isconst)
         {
-          fprintf (dump_file, "SMS const-doloop ");
-          fprintf (dump_file, HOST_WIDEST_INT_PRINT_DEC,
-		     loop_count);
-          fprintf (dump_file, "\n");
+	  gcc_assert (doloop_p || desc->const_iter);
+	  if (doloop_p)
+	    {
+	      if (nonsimple_loop)
+		{
+	          loop_count = count_init_val;
+		  desc->const_iter = true;
+		}
+              gcc_assert (desc->const_iter && loop_count == count_init_val);
+	    }
+	  if (dump_file)
+	    {
+	      fprintf (dump_file, "SMS const-%s ",
+		       doloop_p ? "doloop" : "loop");
+	      fprintf (dump_file, HOST_WIDEST_INT_PRINT_DEC " to "
+		       HOST_WIDEST_INT_PRINT_DEC " step "
+		       HOST_WIDEST_INT_PRINT_DEC,
+		       count_init_val, count_fin_val, count_step);
+	      fprintf (dump_file, "\n");
+	    }
         }
 
       node_order = XNEWVEC (int, g->num_nodes);
@@ -1450,8 +1938,8 @@ sms_schedule (void)
       /* The default value of PARAM_SMS_MIN_SC is 2 as stage count of
 	 1 means that there is no interleaving between iterations thus
 	 we let the scheduling passes do the job in this case.  */
-      if (stage_count < (unsigned) PARAM_VALUE (PARAM_SMS_MIN_SC)
-	  || (count_init && (loop_count <= stage_count))
+      if (stage_count < PARAM_VALUE (PARAM_SMS_MIN_SC)
+	  || (desc->const_iter && (loop_count <= stage_count))
 	  || (flag_branch_probabilities && (trip_count <= stage_count)))
 	{
 	  if (dump_file)
@@ -1467,8 +1955,10 @@ sms_schedule (void)
       else
 	{
 	  struct undo_replace_buff_elem *reg_move_replaces;
+	  int row, cmp_stage = -1;
+	  ps_insn_ptr crr_insn;
 
-          if (!opt_sc_p)
+	  if (!opt_sc_p)
             {
 	      /* Rotate the partial schedule to have the branch in row ii-1.  */
               int amount = SCHED_TIME (g->closing_branch) - (ps->ii - 1);
@@ -1489,23 +1979,24 @@ sms_schedule (void)
 	      print_partial_schedule (ps, dump_file);
 	    }
  
-          /* case the BCT count is not known , Do loop-versioning */
-	  if (count_reg && ! count_init)
-            {
-	      rtx comp_rtx = gen_rtx_fmt_ee (GT, VOIDmode, count_reg,
-	  				     GEN_INT(stage_count));
-	      unsigned prob = (PROB_SMS_ENOUGH_ITERATIONS
-			       * REG_BR_PROB_BASE) / 100;
-
-	      loop_version (loop, comp_rtx, &condition_bb,
-	  		    prob, prob, REG_BR_PROB_BASE - prob,
-			    true);
-	     }
+	  if (!desc->const_iter)
+	    {
+	      /* Loop versioning if the number of iterations is unknown.  */
+	      unsigned prob;
+	      rtx vers_cond;
+	      vers_cond = gen_rtx_fmt_ee (GT, VOIDmode, nonsimple_loop ?
+					  count_reg : desc->niter_expr,
+					  GEN_INT (stage_count));
+	      if (dump_file)
+		{
+		  fprintf (dump_file, "\nLoop versioning condition:\n");
+		  print_rtl_single (dump_file, vers_cond);
+		}
 
-	  /* Set new iteration count of loop kernel.  */
-          if (count_reg && count_init)
-	    SET_SRC (single_set (count_init)) = GEN_INT (loop_count
-						     - stage_count + 1);
+	      prob = (PROB_SMS_ENOUGH_ITERATIONS * REG_BR_PROB_BASE) / 100;
+	      loop_version (loop, vers_cond, &condition_bb, prob,
+			    prob, REG_BR_PROB_BASE - prob, true);
+	    }
 
 	  /* Now apply the scheduled kernel to the RTL of the loop.  */
 	  permute_partial_schedule (ps, g->closing_branch->first_note);
@@ -1520,8 +2011,116 @@ sms_schedule (void)
 	  reg_move_replaces = generate_reg_moves (ps, true);
 	  if (dump_file)
 	    print_node_sched_params (dump_file, g->num_nodes, g);
-	  /* Generate prolog and epilog.  */
-          generate_prolog_epilog (ps, loop, count_reg, count_init);
+
+	  if (doloop_p && count_init_isconst)
+	    {
+	      /* Change counter reg initialization constant. In more complex
+	         cases this adjustment is done with adding some insns
+		 to loop prologue in generate_prolog_epilog function.  */
+	      gcc_assert (single_set (count_init_insn) != NULL_RTX);
+	      SET_SRC (single_set (count_init_insn))
+		    = GEN_INT (count_init_val - stage_count + 1);
+	      df_insn_rescan (count_init_insn);
+	    }
+
+	  if (!doloop_p)
+	    {
+	      /* Calculation of the compare insn stage in schedule.  */
+	      for (row = 0; row < ps->ii; row++)
+		for (crr_insn = ps->rows[row];
+		     crr_insn;
+		     crr_insn = crr_insn->next_in_row)
+		  {
+		    gcc_assert (0 <= SCHED_STAGE (crr_insn->node));
+		    gcc_assert (SCHED_STAGE (crr_insn->node) < stage_count);
+		    if (rtx_equal_p (crr_insn->node->insn, cmp))
+		      {
+			gcc_assert (cmp_stage == -1);
+		        cmp_stage = SCHED_STAGE (crr_insn->node);
+		      }
+		  }
+              if (dump_file)
+		fprintf (dump_file, "cmp_stage=%d\n", cmp_stage);
+	      gcc_assert (cmp_stage >= 0);
+	    }
+
+	  /* When compare insn stage is non-zero we are to shift the final
+	     counter reg value (which counter is compared to exit loop).
+	     Final value can be an immediate or can be a register, which
+	     constant initialization we find in preheader.  */
+	  was_immediate = false;
+	  if (!doloop_p && count_fin_isconst && cmp_stage > 0)
+	    {
+              gcc_assert (0 <= cmp_side && cmp_side <= 1);
+	      /* New finish value.  */
+	      count_fin_newval = count_fin_val - count_step * cmp_stage;
+	      was_immediate = CONST_INT_P (XEXP (SET_SRC (single_set (cmp)),
+							  1 - cmp_side));
+	      if (was_immediate)
+		{
+		  /* Check whether new value also can be an immediate.
+		     For exapmle, on ARM not all values can be encoded as
+		     an immediate, so we have to load it to a register once
+		     before the loop starts.  */
+		  rtx to = GEN_INT (count_fin_newval);
+		  prolog_create_reg = rtx_cost (to, GET_CODE (to), 0, false)
+				    > rtx_cost (GEN_INT(1), CONST_INT, 0, false);
+	        }
+	      else
+		{
+		  /* A value is already in a register and we easily change
+		     initialization instruction in preheader.  */
+		  gcc_assert (count_fin_init_insn);
+		  SET_SRC (single_set (count_fin_init_insn))
+			= GEN_INT (count_fin_newval);
+		  df_insn_rescan (count_fin_init_insn);
+		}
+	    }
+
+	  /* The adjustment of finish register value.
+	     Zero means no adjustment needed or adjusment is done
+	     without additional insn in prologue.  */
+	  if (!doloop_p && !count_fin_isconst)
+	    prolog_fin_nonconst_adjust = count_step * cmp_stage;
+
+	  /* Ready to generate prolog and epilog.  */
+	  generate_prolog_epilog (ps, loop, count_reg, doloop_p,
+			          count_init_isconst, count_fin_reg,
+				  prolog_fin_nonconst_adjust,
+				  prolog_create_reg, count_fin_newval,
+				  &new_comp_reg);
+
+	  /* And only after generating prolog and epilog it is possible
+	     to modify the compare instruction (to prevent copying wrong insn
+	     form to first and last stages).  */
+	  if (!doloop_p && cmp_stage > 0)
+	    {
+              gcc_assert (0 <= cmp_side && cmp_side <= 1);
+	      if (was_immediate && !prolog_create_reg)
+		{
+		/* Easy case - just modify a constant.  */
+		  gcc_assert (new_comp_reg == NULL_RTX);
+		  XEXP (SET_SRC (single_set (cmp)), 1 - cmp_side)
+			= GEN_INT (count_fin_newval);
+		}
+	      else
+		{
+		  if (count_fin_isconst && !was_immediate)
+		    /* Value is in a register and we already changed
+		       initialization instruction in preheader.  */
+		    gcc_assert (new_comp_reg == NULL_RTX);
+		  else
+		    {
+		      /* Another case - use created by generate_prolog_epilog
+		         register, which value is initialized in prologue.  */
+		      gcc_assert (new_comp_reg != NULL_RTX);
+		      XEXP (SET_SRC (single_set (cmp)), 1 - cmp_side)
+			      = new_comp_reg;
+		    }
+		}
+	      df_insn_rescan (cmp);
+	    }
+	  else gcc_assert (new_comp_reg == NULL_RTX);
 
 	  free_undo_replace_buff (reg_move_replaces);
 	}
@@ -1532,7 +2131,9 @@ sms_schedule (void)
       free_ddg (g);
     }
 
+  df_set_flags (DF_LR_RUN_DCE);
   free (g_arr);
+  iv_analysis_done ();
 
   /* Release scheduler data, needed until now because of DFA.  */
   haifa_sched_finish ();
diff --git a/gcc/sched-int.h b/gcc/sched-int.h
index 6797397..4a1d726 100644
--- a/gcc/sched-int.h
+++ b/gcc/sched-int.h
@@ -1232,6 +1232,8 @@ extern void sched_deps_finish (void);
 extern void haifa_note_reg_set (int);
 extern void haifa_note_reg_clobber (int);
 extern void haifa_note_reg_use (int);
+void sms_sched_analyze_init (void);
+regset extract_from_insn_map (rtx, bool);
 
 extern void maybe_extend_reg_info_p (void);
 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 5/9] [SMS] Support new loop pattern
  2011-09-30 15:54   ` Roman Zhuykov
@ 2011-10-12  0:48     ` Ayal Zaks
  2011-12-07 14:36       ` Roman Zhuykov
  2011-12-07 14:42       ` Roman Zhuykov
  0 siblings, 2 replies; 30+ messages in thread
From: Ayal Zaks @ 2011-10-12  0:48 UTC (permalink / raw)
  To: Roman Zhuykov; +Cc: gcc-patches, dm

On Fri, Sep 30, 2011 at 5:22 PM, Roman Zhuykov <zhroma@ispras.ru> wrote:
> 2011/7/21  <zhroma@ispras.ru>:
>> This patch should be applied only after pending patches by Revital.
>
>
> Ping. New version is attached, it suits current trunk without
> additional patches.

Thanks for the ping.

> Also this related patch needs approval:
> http://gcc.gnu.org/ml/gcc-patches/2011-07/msg01804.html
>
>> The loop should meet the following requirements.
>> First three are the same as for loop with doloop pattern:
>> ...
>> The next three describe the control part of new supported loops.
>> - the last jump instruction should look like:  pc=(regF!=0)?label:pc, regF is

you'd probably want to bump to next instruction if falling through,
e.g., pc=(regF!=0)?label:pc+4

>>  flag register;
>> - the last instruction which sets regF should be: regF=COMPARE(regC,X), where X
>>  is a constant, or maybe a register, which is not changed inside a loop;
>> - only one instruction modifies regC inside a loop (other can use regC, but not
>>  write), and it should simply adjust it by a constant: regC=regC+step, where
>>  step is a constant.
>
>> When doloop is succesfully scheduled by SMS, its number of
>> iterations of loop kernel should be decreased by the number of stages in a
>> schedule minus one, while other iterations expand to prologue and epilogue.
>> In new supported loops such approach can't be used, because some
>> instructions can use count register (regC).  Instead of this,
>> the final register value X in compare instruction regF=COMPARE(regC,X)
>> is changed to another value Y respective to the stage this instruction
>> is scheduled (Y = X - stage * step).

making sure this does not underflow; i.e., that the number of
iterations is no less than stage (you've addressed this towards the
end below).

>
> The main difference from doloop case is that regC can be used by some
> instructions in loop body.
> That's why we are unable to simply adjust regC initial value, but have
> to keep it's value correct on each particular iteration.
> So, we change comparison instruction accordingly.
>
> An example:
> int a[100];
> int main()
> {
>  int i;
>  for (i = 85; i > 12; i -= 5)
>      a[i] = i * i;
>  return a[15]-225;
> }
> ARM assembler with "-O2 -fno-auto-inc-dec":
>        ldr     r0, .L5
>        mov     r3, #85
>        mov     r2, r0
> .L2:
>        mul     r1, r3, r3
>        sub     r3, r3, #5
>        cmp     r3, #10
>        str     r1, [r2, #340]
>        sub     r2, r2, #20
>        bne     .L2
>        ldr     r0, [r0, #60]
>        sub     r0, r0, #225
>        bx      lr
> .L5:
>        .word   a
>
> Loop body is executed 15 times.
> When compiling with SMS, it finds a schedule with ii=7, stage_count=3
> and following times:
> Stage  Time       Insn
> 0          5      mul     r1, r3, r3
> 1         10     sub     r3, r3, #5
> 1         11     cmp     r3, #10
> 1         11     str     r1, [r2, #340]
> 1         13     bne     .L2
> 2         16     sub     r2, r2, #20
>

branch is not scheduled last?

> To make new schedule correct the loop body
> should be executed 14 times and we change compare instruction:

the loop itself should execute 13 times.

> regF=COMPARE(regC,X) to regF=COMPARE(regC,Y) where Y = X - stage * step.
> In our example regC is r3, X is 10, step = -5, compare instruction
> is scheduled on stage 1, so it should be Y = 10 - 1 * (-5) = 15.
>

right. In general, if the compare is on stage s (starting from 0), it
will be executed s times in the epilog, so it should exit the loop
upon reaching Y = X - s * step.

> So, after SMS it looks like:
>        ldr     r0, .L5
>        mov     r3, #85
>        mov     r2, r0
> ;;prologue
>        mul     r1, r3, r3      ;;from stage 0 first iteration
>        sub     r3, r3, #5      ;;3 insns from stage 1 first iteration
>        cmp     r3, #10
>        str     r1, [r2, #340]
>        mul     r1, r3, r3      ;;from stage 0 second iteration
> ;;body
> .L2:
>        sub     r3, r3, #5
>        sub     r2, r2, #20
>        cmp     r3, #15         ;; new value to compare with is Y=15
>        str     r1, [r2, #340]
>        mul     r1, r3, r3
>        bne     .L2
> ;;epilogue
>        sub     r2, r2, #20     ;;from stage 2 pre-last iteration
>        sub     r3, r3, #5      ;;3 insns from stage 1 last iteration
>        cmp     r3, #10
>        str     r1, [r2, #340]
>        sub     r2, r2, #20     ;;from stage 2 last iteration
>
>        ldr     r0, [r0, #60]
>        sub     r0, r0, #225
>        bx      lr
> .L5:
>        .word   a
>
> Real ARM assembler with SMS (after some optimizations and without dead code):
>        mov     r3, #85
>        ldr     r0, .L8
>        mul     r1, r3, r3
>        sub     r3, r3, #5
>        mov     r2, r0
>        str     r1, [r0, #340]
>        mul     r1, r3, r3
> .L2:
>        sub     r3, r3, #5
>        sub     r2, r2, #20
>        cmp     r3, #15
>        str     r1, [r2, #340]
>        mul     r1, r3, r3
>        bne     .L2
>        str     r1, [r2, #320]
>        ldr     r0, [r0, #60]
>        sub     r0, r0, #225
>        bx      lr
> .L8:
>        .word   a
>
>>
>> Testing of this appoach reveals two bugs, which do not appear while SMS was
>> used only for doloop loops.  Both these bugs happen due to the nature of the
>> flag register.  On x86_64 it is clobbered by most of arithmetic instructions.
>> The following situation happens when SMS is enabled without register renaming
>> (-fno-modulo-sched-allow-regmoves).  When data dependency graph is built, there
>> is a step when we generate anti-dependencies from last register use to first
>> write of this register at the next iteration.

is a step when we generate anti-dependencies from all last registers
uses (i.e., of last register def) to first write of this register at
the next iteration.

>> At this moment we should also
>> create such dependencies to all instructions which clobber the register to
>> prevent this clobbers being before last use is new schedule.
>>

well, we simply need to connect these last uses to either the first
write *or* the first clobber of this register at the next iteration,
according to whichever is first, no?

>> Here is an model of example:
>>
>> loop {
>> set1 regR
>> use1 regR
>> clobber regR
>> set2 regR
>> use2 regR
>> }
>>
>> If we create only use2->set1 anti-dependency (and no use2->cloober) the
>> following wrong schedule is possible:
>>
>> prologue {
>> set1 regR
>> use1 regR
>> clobber regR
>> }
>> kernel {
>> set2 regR
>> clobber regR (instruction from next iteration in terms of old loop kernel)
>> use2 regR
>> set1 regR (also from next iteration)
>> use1 regR (also from next iteration)
>> }
>> epilogue {
>> set2 regR
>> use2 regR
>> }
>>

strange; according to prolog (and epilog), clobber should appear after
use1 in the kernel, no? Aren't there (intra-iteration) dependencies
preventing the clobber to skip over use1 and/or set1?

>> This problem was succesfully fixed by creating a vector of all clobbering
>> instructions together with first write and adding all needed dependencies.
>>

seems like an overkill to me; we didn't draw an edge between every
last use and every write, because writes are kept in order by having
output dependences between them. So should be the case with clobbers.

This should ideally be solved by a dedicated (separate) patch.

>> The other bug happens only with -fmodulo-sched-allow-regmoves.  Here we
>> eliminate some anti-dependence edges in data dependency graph in order to
>> resolve them later by adding some register moves (renaming instructions).  But
>> in situation as in example above compiler gives an ICE because it can't create
>> a register move, when regR is hardware flag register.  So we have to know which
>> register(s) cause anti-dependency in order to understand whether we can ignore
>> it.  I can't find any easy way to gather this information, so I create my own
>> structures to store this info and had implemented my own hooks for
>> sched_analyze function.  This leads to more complex interconnection between
>> ddg.c and modulo-sched.c.

Well, having ddg.c's/create_ddg_dep_from_intra_loop_link() consult
"Register sets from modulo scheduler structures" to determine if an
anti-dep can be eliminated, seems awkward. One should be able to build
a ddg independent of any modulo scheduler structures.
One simple solution is to keep all anti-dep edges of registers that
are clobbered anywhere in the loop. This might be overly conservative,
but perhaps not so if "On x86_64 it is clobbered by most of arithmetic
instructions".
If an anti-dep between a use and a dep of a clobber register is to be
removed, a dependence should be created between the use and a
clobbering instruction following the def. Hope this is clear.

This too should be solved by a dedicated (separate) patch, for easier digestion.

Presumably, the ddg already has all intra-iteration edges preventing
motion of clobbering instructions within an iteration, and we are only
missing inter-iteration deps or deps surfaced by eliminating
anti-deps, right?

You might consider introducing a new type of dependence for such
edges, "Clobber", if it would help.

>>
>> One more thing to point out is number of loop iterations. When number of
>> iterations of a loop is not known at compile time, SMS has to create two loop
>> versions (original and scheduled), and execute scheduled one only when real
>> number of iterations is bigger than number of stages.  In doloop case the
>> number of iterations simply equals to the count register value before the loop.
>> So SMS finds its constant initialization or makes two loop versions.  In new
>> supported loops number of iterations value is more complex.  It even can't be
>> calculated as (final_reg_value-start_reg_value)/step because of examples like
>> this:
>>
>> for (unsigned int x = 0x0; x != 0x6F80919A; x += 0xEDCBA987) ...;
>>
>> This loop has 22 iterations.  So, i decided to use get_simple_loop_desc
>> function which gives a structure with loop characteristics, some of them helps
>> to find iteration number:
>>
>> rtx niter_expr - The number of iterations of the loop;
>> bool const_iter - True if the loop iterates the constant number of times;
>> unsigned HOST_WIDEST_INT niter - Number of iterations if constant;
>>
>> But we can use these expressions only after looking through some other fields
>> of returned structure:
>>
>> bool simple_p - True if we are able to say anything about number of iterations
>> of the loop;
>> rtx assumptions - Assumptions under that the rest of the information is valid;
>> rtx noloop_assumptions - Assumptions under which the loop ends before reaching
>> the latch;
>> rtx infinite - Condition under which the loop is infinite.
>>
>> I decide to allow SMS scheduling only when simple_p is true and other three
>> fields are NULL_RTX, or when simple_p is true and
>> flag_unsafe_loop_optimizations is set.  One more exception is infinite
>> condition, and the next separate patch is an attempt to process it.
>>

ok, still need to go over this rather lengthy and orthogonal (although
it exposes the bugs above) piece.

Ayal.


> --
> Roman Zhuykov
> zhroma@ispras.ru
>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/9] [RFC] Expand SMS functionality
  2011-09-30 15:37 ` [PATCH 0/9] [RFC] Expand SMS functionality Roman Zhuykov
@ 2011-10-17 14:34   ` Richard Sandiford
  0 siblings, 0 replies; 30+ messages in thread
From: Richard Sandiford @ 2011-10-17 14:34 UTC (permalink / raw)
  To: Roman Zhuykov; +Cc: gcc-patches, dm

Roman Zhuykov <zhroma@ispras.ru> writes:
> [PATCH 4/9] Move the SMS pass earlier
> http://gcc.gnu.org/ml/gcc-patches/2011-07/msg01811.html

I don't think this is a good idea.  To get good results, SMS really
needs to run as close to the register allocator as possible, otherwise
later passes might disrupt the schedule.

Richard

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [SMS] Support new loop pattern
  2011-10-12  0:48     ` Ayal Zaks
@ 2011-12-07 14:36       ` Roman Zhuykov
  2011-12-07 14:42       ` Roman Zhuykov
  1 sibling, 0 replies; 30+ messages in thread
From: Roman Zhuykov @ 2011-12-07 14:36 UTC (permalink / raw)
  To: Ayal Zaks; +Cc: gcc-patches, dm

[-- Attachment #1: Type: text/plain, Size: 7641 bytes --]

2011/10/12 Ayal Zaks <ayal.zaks@gmail.com>:>>> - the last jump
instruction should look like:  pc=(regF!=0)?label:pc, regF is>> you'd
probably want to bump to next instruction if falling through,> e.g.,
pc=(regF!=0)?label:pc+4>
It is considered that program counter is increased automatically
onhardware level.Otherwise we should add something like "pc=pc+4" in
parallel to eachinstruction in RTL.
>>>  flag register;>>> - the last instruction which sets regF should be: regF=COMPARE(regC,X), where X>>>  is a constant, or maybe a register, which is not changed inside a loop;>>> - only one instruction modifies regC inside a loop (other can use regC, but not>>>  write), and it should simply adjust it by a constant: regC=regC+step, where>>>  step is a constant.>>>>> When doloop is succesfully scheduled by SMS, its number of>>> iterations of loop kernel should be decreased by the number of stages in a>>> schedule minus one, while other iterations expand to prologue and epilogue.>>> In new supported loops such approach can't be used, because some>>> instructions can use count register (regC).  Instead of this,>>> the final register value X in compare instruction regF=COMPARE(regC,X)>>> is changed to another value Y respective to the stage this instruction>>> is scheduled (Y = X - stage * step).>> making sure this does not underflow; i.e., that the number of> iterations is no less than stage (you've addressed this towards the> end below).>
Yes, this situation is processed correctly.
>>>> The main difference from doloop case is that regC can be used by some>> instructions in loop body.>> That's why we are unable to simply adjust regC initial value, but have>> to keep it's value correct on each particular iteration.>> So, we change comparison instruction accordingly.>>>> An example:>> int a[100];>> int main()>> {>>  int i;>>  for (i = 85; i > 12; i -= 5)>>      a[i] = i * i;>>  return a[15]-225;>> }>> ARM assembler with "-O2 -fno-auto-inc-dec":>>        ldr     r0, .L5>>        mov     r3, #85>>        mov     r2, r0>> .L2:>>        mul     r1, r3, r3>>        sub     r3, r3, #5>>        cmp     r3, #10>>        str     r1, [r2, #340]>>        sub     r2, r2, #20>>        bne     .L2>>        ldr     r0, [r0, #60]>>        sub     r0, r0, #225>>        bx      lr>> .L5:>>        .word   a>>>> Loop body is executed 15 times.>> When compiling with SMS, it finds a schedule with ii=7, stage_count=3>> and following times:>> Stage  Time       Insn>> 0          5      mul     r1, r3, r3>> 1         10     sub     r3, r3, #5>> 1         11     cmp     r3, #10>> 1         11     str     r1, [r2, #340]>> 1         13     bne     .L2>> 2         16     sub     r2, r2, #20>>>> branch is not scheduled last?>
Yes, branch schedule time is smaller then sub's one.This mean that
"sub r2, r2, $20" instruction from original iterationnumber K will be
executed after"bne .L2" from original iteration number K.But certainly
bne remains to be the last instuction in new loop body.Below you can
see how it looks after SMS.
>> To make new schedule correct the loop body>> should be executed 14 times and we change compare instruction:>> the loop itself should execute 13 times.
with i =85, 80, 75, 70, 6560, 55, 50, 45, 4035, 30, 25, 20, 15this
gives total 15 iterations (15 stores to memory).And new loop body will
be executed 13 times (one store goes toepilogue and one - to
prologue).
>> regF=COMPARE(regC,X) to regF=COMPARE(regC,Y) where Y = X - stage * step.>> In our example regC is r3, X is 10, step = -5, compare instruction>> is scheduled on stage 1, so it should be Y = 10 - 1 * (-5) = 15.>>>> right. In general, if the compare is on stage s (starting from 0), it> will be executed s times in the epilog, so it should exit the loop> upon reaching Y = X - s * step.>>> So, after SMS it looks like:>>        ldr     r0, .L5>>        mov     r3, #85>>        mov     r2, r0>> ;;prologue>>        mul     r1, r3, r3      ;;from stage 0 first iteration>>        sub     r3, r3, #5      ;;3 insns from stage 1 first iteration>>        cmp     r3, #10>>        str     r1, [r2, #340]>>        mul     r1, r3, r3      ;;from stage 0 second iteration>> ;;body>> .L2:>>        sub     r3, r3, #5>>        sub     r2, r2, #20>>        cmp     r3, #15         ;; new value to compare with is Y=15>>        str     r1, [r2, #340]>>        mul     r1, r3, r3>>        bne     .L2>> ;;epilogue>>        sub     r2, r2, #20     ;;from stage 2 pre-last iteration>>        sub     r3, r3, #5      ;;3 insns from stage 1 last iteration>>        cmp     r3, #10>>        str     r1, [r2, #340]>>        sub     r2, r2, #20     ;;from stage 2 last iteration>>>>        ldr     r0, [r0, #60]>>        sub     r0, r0, #225>>        bx      lr>> .L5:>>        .word   a>>
Here in comments I mention why insn was copied to prolog and
epilog.Only branch is not copied at all.
>>> Testing of this appoach reveals two bugs, which do not appear while SMS was>>> used only for doloop loops.  Both these bugs happen due to the nature of the>>> flag register.  On x86_64 it is clobbered by most of arithmetic instructions.> This should ideally be solved by a dedicated (separate) patch.> ...> This too should be solved by a dedicated (separate) patch, for easier digestion.
As Ayal asks, I'll continue discussion of these two bugs in
twoseparate e-mails, answering on this letter.
>>>>>> One more thing to point out is number of loop iterations. When number of>>> iterations of a loop is not known at compile time, SMS has to create two loop>>> versions (original and scheduled), and execute scheduled one only when real>>> number of iterations is bigger than number of stages.  In doloop case the>>> number of iterations simply equals to the count register value before the loop.>>> So SMS finds its constant initialization or makes two loop versions.  In new>>> supported loops number of iterations value is more complex.  It even can't be>>> calculated as (final_reg_value-start_reg_value)/step because of examples like>>> this:>>>>>> for (unsigned int x = 0x0; x != 0x6F80919A; x += 0xEDCBA987) ...;>>>>>> This loop has 22 iterations.  So, i decided to use get_simple_loop_desc>>> function which gives a structure with loop characteristics, some of them helps>>> to find iteration number:>>>>>> rtx niter_expr - The number of iterations of the loop;>>> bool const_iter - True if the loop iterates the constant number of times;>>> unsigned HOST_WIDEST_INT niter - Number of iterations if constant;>>>>>> But we can use these expressions only after looking through some other fields>>> of returned structure:>>>>>> bool simple_p - True if we are able to say anything about number of iterations>>> of the loop;>>> rtx assumptions - Assumptions under that the rest of the information is valid;>>> rtx noloop_assumptions - Assumptions under which the loop ends before reaching>>> the latch;>>> rtx infinite - Condition under which the loop is infinite.>>>>>> I decide to allow SMS scheduling only when simple_p is true and other three>>> fields are NULL_RTX, or when simple_p is true and>>> flag_unsafe_loop_optimizations is set.  One more exception is infinite>>> condition, and the next separate patch is an attempt to process it.>>>>> ok, still need to go over this rather lengthy and orthogonal (although> it exposes the bugs above) piece.>> Ayal.>>
New version is attached, it suits current trunk.Without fixing both
bugs mentioned above, this patch brokes bootstrap on x86-64.
Together with DDG fixes the patch was succesfully regtested on ARM,and
"regstrapped" on x86-64 and IA64.
--Roman Zhuykovzhroma@ispras.ru

[-- Attachment #2: sms.patch --]
[-- Type: text/x-patch, Size: 26313 bytes --]

2011-12-07  Roman Zhuykov  <zhroma@ispras.ru>
	* modulo-sched.c (nondoloop_register_get): New function.
	(const_iteration_count): Rename to ...
	(search_const_init): ...this.  Add new parameter (is_const).  Always
	return register initialization rtx and set is_const to true
	only when it is constant.
	(duplicate_insns_of_cycles): Add new parameter (doloop_p).  Do not
	duplicate instructions with count_reg only when doloop_p is set.
	Update all callers.
	(generate_prolog_epilog): Add new parameters.  Correctly generate loop
	prologue for new loop pattern.
	(sms_schedule): Support new loop pattern.
---

diff --git a/gcc/modulo-sched.c b/gcc/modulo-sched.c
index 0ea9a4d..e62aca7 100644
--- a/gcc/modulo-sched.c
+++ b/gcc/modulo-sched.c
@@ -220,7 +220,8 @@ static void set_node_sched_params (ddg_ptr);
 static partial_schedule_ptr sms_schedule_by_order (ddg_ptr, int, int, int *);
 static void permute_partial_schedule (partial_schedule_ptr, rtx);
 static void generate_prolog_epilog (partial_schedule_ptr, struct loop *,
-                                    rtx, rtx);
+                                    rtx, bool, bool, rtx, HOST_WIDEST_INT,
+                                    bool, HOST_WIDEST_INT, rtx *);
 static int calculate_stage_count (partial_schedule_ptr, int);
 static void calculate_must_precede_follow (ddg_node_ptr, int, int,
 					   int, int, sbitmap, sbitmap, sbitmap);
@@ -255,7 +256,7 @@ typedef struct node_sched_params node_sched_params;
 DEF_VEC_O (node_sched_params);
 DEF_VEC_ALLOC_O (node_sched_params, heap);
 \f
-/* The following three functions are copied from the current scheduler
+/* The following two functions are copied from the current scheduler
    code in order to use sched_analyze() for computing the dependencies.
    They are used when initializing the sched_info structure.  */
 static const char *
@@ -398,37 +399,164 @@ doloop_register_get (rtx head ATTRIBUTE_UNUSED, rtx tail ATTRIBUTE_UNUSED)
 #endif
 }
 
-/* Check if COUNT_REG is set to a constant in the PRE_HEADER block, so
-   that the number of iterations is a compile-time constant.  If so,
-   return the rtx that sets COUNT_REG to a constant, and set COUNT to
-   this constant.  Otherwise return 0.  */
+/* Same as previous for loop with always-the-same-step counter.  */
 static rtx
-const_iteration_count (rtx count_reg, basic_block pre_header,
-		       HOST_WIDEST_INT * count)
+nondoloop_register_get (rtx head, rtx tail, int cmp_side,
+			rtx *addsub_output, rtx *cmp_output)
+{
+  rtx insn, reg, flagreg, addsub, cmp, end;
+
+  /* Check jump instruction form */
+  insn = single_set (tail);
+  if (insn == NULL_RTX
+      || SET_DEST (insn) != pc_rtx
+      || GET_CODE (SET_SRC (insn)) != IF_THEN_ELSE
+      || GET_CODE (XEXP (SET_SRC (insn), 1)) != LABEL_REF
+      || XEXP (SET_SRC (insn), 2) != pc_rtx)
+    return NULL_RTX;
+
+  /* Check loop exit condition */
+  insn = XEXP (SET_SRC (insn), 0);
+  if (GET_CODE (insn) != NE || XEXP (insn, 1) != const0_rtx)
+    return NULL_RTX;
+
+  /* Flags register */
+  flagreg = XEXP (insn, 0);
+
+  /* Searching comparison instruction */
+  cmp = PREV_INSN (tail);
+  while (cmp != PREV_INSN (head))
+    {
+      if (INSN_P (cmp) && reg_set_p (flagreg, cmp))
+        break;
+      cmp = PREV_INSN (cmp);
+    }
+  if (cmp == PREV_INSN (head))
+    return NULL_RTX;
+
+  /* Check comparison */
+  insn = single_set (cmp);
+  if (insn == NULL_RTX
+      || ! rtx_equal_p (flagreg, SET_DEST (insn))
+      || GET_CODE (SET_SRC (insn)) != COMPARE)
+    return NULL_RTX;
+
+  /* Loop register */
+  gcc_assert (0 <= cmp_side && cmp_side <= 1);
+  reg = XEXP (SET_SRC (insn), cmp_side);
+  if (! REG_P (reg))
+    return NULL_RTX;
+
+  /* End value */
+  end = XEXP (SET_SRC (insn), 1 - cmp_side);
+  if (! REG_P (end) && ! CONST_INT_P (end))
+    return NULL_RTX;
+
+  /* Searching register add\sub instruction */
+  addsub = PREV_INSN (cmp);
+  while (addsub != PREV_INSN (head))
+    {
+      if (INSN_P (addsub) && reg_set_p (reg, addsub))
+        break;
+      addsub = PREV_INSN (addsub);
+    }
+  if (addsub == PREV_INSN (head))
+    return NULL_RTX;
+
+  /* Checking register change instruction */
+  insn = single_set (addsub);
+  if (insn == NULL_RTX || ! rtx_equal_p (reg, SET_DEST (insn)))
+    return NULL_RTX;
+  insn = SET_SRC (insn);
+  if ((GET_CODE (insn) != PLUS && GET_CODE (insn) != MINUS)
+      || ! rtx_equal_p (reg, XEXP (insn, 0))
+      || ! (CONST_INT_P (XEXP (insn, 1))))
+    return NULL_RTX;
+
+  /* No other REG and END (if reg) modifications allowed */
+  for (insn = head; insn != tail; insn = NEXT_INSN (insn))
+    {
+      if (REG_P(end) && reg_set_p (end, insn))
+        {
+          if (dump_file)
+          {
+            fprintf (dump_file, "SMS end register found ");
+            print_rtl_single (dump_file, reg);
+            fprintf (dump_file, " outside write in insn:\n");
+            print_rtl_single (dump_file, insn);
+          }
+	  return NULL_RTX;
+	}
+      if (insn != addsub && reg_set_p (reg, insn))
+        {
+          if (dump_file)
+          {
+            fprintf (dump_file, "SMS count_reg found ");
+            print_rtl_single (dump_file, reg);
+            fprintf (dump_file, " outside write in insn:\n");
+            print_rtl_single (dump_file, insn);
+          }
+          return NULL_RTX;
+        }
+    }
+
+  *addsub_output = addsub;
+  *cmp_output = cmp;
+  return reg;
+}
+
+/* Check if REG is set to a constant in the PRE_HEADER block.
+   If possible to find, return the rtx that sets REG.
+   If REG is set to a constant (probably not directly),
+   set IS_CONST to true and VALUE to that constant value.  */
+static rtx
+search_const_init (basic_block pre_header, rtx reg, bool *is_const,
+		   HOST_WIDEST_INT *value)
 {
   rtx insn;
   rtx head, tail;
 
-  if (! pre_header)
-    return NULL_RTX;
+  if (!pre_header)
+    {
+      *is_const = false;
+      return NULL_RTX;
+    }
 
   get_ebb_head_tail (pre_header, pre_header, &head, &tail);
 
   for (insn = tail; insn != PREV_INSN (head); insn = PREV_INSN (insn))
     if (NONDEBUG_INSN_P (insn) && single_set (insn) &&
-	rtx_equal_p (count_reg, SET_DEST (single_set (insn))))
+	rtx_equal_p (reg, SET_DEST (single_set (insn))))
       {
-	rtx pat = single_set (insn);
+	rtx src, pat = single_set (insn);
+	src = SET_SRC (pat);
 
-	if (CONST_INT_P (SET_SRC (pat)))
+	if (CONST_INT_P (src))
 	  {
-	    *count = INTVAL (SET_SRC (pat));
-	    return insn;
+	    *is_const = true;
+	    *value = INTVAL (src);
+	  }
+	else if (REG_P (src))
+	  { /* Check if previous insn sets SRC = constant.  */
+	    pat = single_set (PREV_INSN (insn));
+	    if (pat != NULL_RTX && rtx_equal_p (src, SET_DEST (pat))
+		&& CONST_INT_P (SET_SRC (pat)))
+	      {
+		*is_const = true;
+		*value = INTVAL (SET_SRC (pat));
+	      }
+	    else
+		*is_const = false;
 	  }
+	else
+	  *is_const = false;
 
-	return NULL_RTX;
+	return insn;
       }
+    else if (reg_set_p (reg, insn))
+      break;
 
+  *is_const = false;
   return NULL_RTX;
 }
 
@@ -1103,7 +1231,7 @@ clear:
 
 static void
 duplicate_insns_of_cycles (partial_schedule_ptr ps, int from_stage,
-			   int to_stage, rtx count_reg)
+			   int to_stage, rtx count_reg, bool doloop_p)
 {
   int row;
   ps_insn_ptr ps_ij;
@@ -1115,14 +1243,14 @@ duplicate_insns_of_cycles (partial_schedule_ptr ps, int from_stage,
 	int first_u, last_u;
 	rtx u_insn;
 
-        /* Do not duplicate any insn which refers to count_reg as it
-           belongs to the control part.
+        /* In doloop case do not duplicate any insn which refers
+	   to count_reg as it belongs to the control part.
            The closing branch is scheduled as well and thus should
            be ignored.
            TODO: This should be done by analyzing the control part of
            the loop.  */
 	u_insn = ps_rtl_insn (ps, u);
-        if (reg_mentioned_p (count_reg, u_insn)
+        if ((doloop_p && reg_mentioned_p (count_reg, u_insn))
             || JUMP_P (u_insn))
           continue;
 
@@ -1142,7 +1270,10 @@ duplicate_insns_of_cycles (partial_schedule_ptr ps, int from_stage,
 /* Generate the instructions (including reg_moves) for prolog & epilog.  */
 static void
 generate_prolog_epilog (partial_schedule_ptr ps, struct loop *loop,
-                        rtx count_reg, rtx count_init)
+                        rtx count_reg, bool doloop_p, bool count_init_isconst,
+			rtx fin_reg, HOST_WIDEST_INT fin_nonconst_adjust,
+			bool create_reg, HOST_WIDEST_INT reg_val,
+			rtx *created_reg)
 {
   int i;
   int last_stage = PS_STAGE_COUNT (ps) - 1;
@@ -1151,12 +1282,12 @@ generate_prolog_epilog (partial_schedule_ptr ps, struct loop *loop,
   /* Generate the prolog, inserting its insns on the loop-entry edge.  */
   start_sequence ();
 
-  if (!count_init)
+  if (doloop_p && !count_init_isconst)
     {
-      /* Generate instructions at the beginning of the prolog to
-         adjust the loop count by STAGE_COUNT.  If loop count is constant
-         (count_init), this constant is adjusted by STAGE_COUNT in
-         generate_prolog_epilog function.  */
+      /* In doloop we generate instructions at the beginning of the prolog to
+         adjust the initial value of doloop counter by STAGE_COUNT.
+	 If loop count is constant, this adjustment is done outside this
+         function, simply correcting the source of initialization insn.  */
       rtx sub_reg = NULL_RTX;
 
       sub_reg = expand_simple_binop (GET_MODE (count_reg), MINUS,
@@ -1167,8 +1298,40 @@ generate_prolog_epilog (partial_schedule_ptr ps, struct loop *loop,
         emit_move_insn (count_reg, sub_reg);
     }
 
+  if (!doloop_p)
+    {
+      /* In non-doloop we generate instructions at the beginning of
+         the prolog to adjust the final value (with this value loop count
+	 register is compared to check whether the loop should stop).  */
+      if (fin_nonconst_adjust != 0)
+	{
+	  /* If the final value is in a register - create another register
+	     to store a shifted value.  */
+	  rtx new_reg, reg = NULL_RTX;
+	  reg = gen_reg_rtx (GET_MODE (fin_reg));
+	  new_reg = expand_simple_binop (GET_MODE (fin_reg), MINUS, fin_reg,
+					 GEN_INT (fin_nonconst_adjust),
+					 reg, 0, OPTAB_DIRECT);
+	  gcc_assert (REG_P (new_reg));
+	  if (REGNO (new_reg) != REGNO (reg))
+	    emit_move_insn (reg, new_reg);
+	  *created_reg = new_reg;
+	}
+      else if (create_reg)
+	{
+	  /* If old final value is an immediate, and the new one can't be
+	     an immediate, we create a register to store it.  If both values
+	     are immediate the adjustment is done outside this fuction,
+	     just correcting the constant value in compare intruction.  */
+	  rtx reg = NULL_RTX;
+	  reg = gen_reg_rtx (GET_MODE (count_reg));
+	  emit_move_insn (reg, GEN_INT (reg_val));
+	  *created_reg = reg;
+	}
+    }
+
   for (i = 0; i < last_stage; i++)
-    duplicate_insns_of_cycles (ps, 0, i, count_reg);
+    duplicate_insns_of_cycles (ps, 0, i, count_reg, doloop_p);
 
   /* Put the prolog on the entry edge.  */
   e = loop_preheader_edge (loop);
@@ -1182,7 +1345,7 @@ generate_prolog_epilog (partial_schedule_ptr ps, struct loop *loop,
   start_sequence ();
 
   for (i = 0; i < last_stage; i++)
-    duplicate_insns_of_cycles (ps, i + 1, last_stage, count_reg);
+    duplicate_insns_of_cycles (ps, i + 1, last_stage, count_reg, doloop_p);
 
   /* Put the epilogue on the exit edge.  */
   gcc_assert (single_exit (loop));
@@ -1460,13 +1623,30 @@ sms_schedule (void)
           continue;
         }
 
-      /* Make sure this is a doloop.  */
-      if ( !(count_reg = doloop_register_get (head, tail)))
-      {
-        if (dump_file)
-          fprintf (dump_file, "SMS doloop_register_get failed\n");
-	continue;
-      }
+      /* Is this a doloop?  */
+      if ((count_reg = doloop_register_get (head, tail)))
+        {
+	  if (dump_file)
+	    fprintf (dump_file, "SMS doloop\n");
+        }
+      else if ((count_reg = nondoloop_register_get (head, tail, 0,
+						    &insn, &insn)))
+	{
+	  if (dump_file)
+	    fprintf (dump_file, "SMS non-doloop\n");
+	}
+      else if ((count_reg = nondoloop_register_get (head, tail, 1,
+						    &insn, &insn)))
+	{
+	  if (dump_file)
+	    fprintf (dump_file, "SMS non-doloop with transposed cmp\n");
+	}
+      else
+	{
+	  if (dump_file)
+	    fprintf (dump_file, "SMS imcompatible loop\n");
+	  continue;
+	}
 
       /* Don't handle BBs with calls or barriers
 	 or !single_set with the exception of instructions that include
@@ -1516,7 +1696,6 @@ sms_schedule (void)
 	    fprintf (dump_file, "SMS create_ddg failed\n");
 	  continue;
         }
-
       g_arr[loop->num] = g;
       if (dump_file)
         fprintf (dump_file, "...OK\n");
@@ -1528,14 +1707,28 @@ sms_schedule (void)
     fprintf (dump_file, "=========================\n\n");
   }
 
+  df_clear_flags (DF_LR_RUN_DCE);
+
   /* We don't want to perform SMS on new loops - created by versioning.  */
   FOR_EACH_LOOP (li, loop, 0)
     {
+      bool doloop_p, count_fin_isconst, count_init_isconst;
+      bool was_immediate = false;
+      bool prolog_create_reg = false;
+      int prolog_fin_nonconst_adjust = 0;
+      bool nonsimple_loop = false;
       rtx head, tail;
-      rtx count_reg, count_init;
-      int mii, rec_mii, stage_count, min_cycle;
-      HOST_WIDEST_INT loop_count = 0;
+      int min_cycle;
       bool opt_sc_p;
+      rtx count_reg, count_fin_reg, new_comp_reg = NULL_RTX;
+      rtx count_init_insn, count_fin_init_insn;
+      rtx add, cmp;
+      int mii, rec_mii, cmp_side = -1, cmp_stage = -1;
+      int stage_count = 0;
+      HOST_WIDEST_INT count_init_val = 0, count_fin_val = 0;
+      HOST_WIDEST_INT count_step = 0, loop_count = -1;
+      HOST_WIDEST_INT count_fin_newval = 0;
+      struct niter_desc *desc = NULL;
 
       if (! (g = g_arr[loop->num]))
         continue;
@@ -1573,32 +1766,159 @@ sms_schedule (void)
 	               (HOST_WIDEST_INT) profile_info->sum_max);
 	      fprintf (dump_file, "\n");
 	    }
-	  fprintf (dump_file, "SMS doloop\n");
 	  fprintf (dump_file, "SMS built-ddg %d\n", g->num_nodes);
           fprintf (dump_file, "SMS num-loads %d\n", g->num_loads);
           fprintf (dump_file, "SMS num-stores %d\n", g->num_stores);
 	}
 
 
-      /* In case of th loop have doloop register it gets special
-	 handling.  */
-      count_init = NULL_RTX;
-      if ((count_reg = doloop_register_get (head, tail)))
+      /* Extract count register and determine loop type.  */
+      add = NULL_RTX;
+      cmp = NULL_RTX;
+      if ((count_reg = doloop_register_get (head, tail))
+	  || (count_reg = nondoloop_register_get (head, tail, 0, &add, &cmp))
+	  || (count_reg = nondoloop_register_get (head, tail, 1, &add, &cmp)))
 	{
-	  basic_block pre_header;
+	  basic_block pre_header = loop_preheader_edge (loop)->src;
+
+	  doloop_p = (cmp == NULL_RTX);
+	  if (doloop_p)
+	    {
+	      /* Doloop finish parameters are always the same.  */
+	      count_step = -1;
+	      count_fin_isconst = true;
+	      count_fin_val = 0;
+	      count_fin_reg = NULL_RTX;
+	      count_fin_init_insn = NULL_RTX;
+	    }
+	  else
+	    {
+	      /* In other loop we need to determine counter step
+	         and finish parameters.  */
+	      rtx step, end;
+
+	      gcc_assert (single_set (add) && single_set (cmp));
+
+	      /* Extract the step.  */
+	      step = XEXP (SET_SRC (single_set (add)), 1);
+	      gcc_assert (CONST_INT_P (step));
+
+	      if (GET_CODE (SET_SRC (single_set (add))) == MINUS)
+	        count_step = - INTVAL (step);
+	      else if (GET_CODE (SET_SRC (single_set (add))) == PLUS)
+	        count_step = INTVAL (step);
+	      else
+		gcc_unreachable ();
+
+	      gcc_assert(count_step != 0);
+
+	      /* Check what operand of compare insn is a counter register.  */
+	      if (count_reg == XEXP (SET_SRC (single_set (cmp)), 0))
+		cmp_side = 0;
+	      else if (count_reg == XEXP (SET_SRC (single_set (cmp)), 1))
+		cmp_side = 1;
+	      else
+		gcc_unreachable ();
+
+	      /* Extract finish border for counter reg.  */
+	      end = XEXP (SET_SRC (single_set (cmp)), 1 - cmp_side);
 
-	  pre_header = loop_preheader_edge (loop)->src;
-	  count_init = const_iteration_count (count_reg, pre_header,
-					      &loop_count);
+	      if (CONST_INT_P (end))
+		{
+		  /* Constant finish border.  loop until (reg != const).  */
+		  count_fin_isconst = true;
+		  count_fin_val = INTVAL (end);
+		  count_fin_reg = NULL_RTX;
+		  count_fin_init_insn = NULL_RTX;
+		}
+	      else if (REG_P (end))
+		{
+		  /* Register is a border.  Loop until (reg != fin_reg).  */
+		  count_fin_reg = end;
+		  count_fin_isconst = false;
+		  /* Try to find constant initinalization of fin_reg
+		   * in preheader.  */
+		  count_fin_init_insn = search_const_init (pre_header,
+							   count_fin_reg,
+							   &count_fin_isconst,
+							   &count_fin_val);
+		}
+	      else
+		gcc_unreachable ();
+	    }
+	  /* Try to find a constant initalization of count_reg in preheader.  */
+	  count_init_insn = search_const_init (pre_header,
+					       count_reg,
+					       &count_init_isconst,
+					       &count_init_val);
+	}
+      else /* Loop is incompatible now, but it was OK on while analyzing!  */
+	gcc_assert (count_reg);
+
+
+      desc = get_simple_loop_desc (loop);
+      gcc_assert (desc);
+      /* nonsimple_loop means it's impossible to analyze the loop
+         or there are some assumptions to make the analyzis results right
+         or there is a condition of non-infinite number of iterations.
+        We want doloops to be scheduled even if analyzis shows they are
+	 nonsimple (backward compatibility).  */
+      nonsimple_loop = !desc->simple_p;
+      /* We allow scheduling loop with some assumptions or infinite condition
+	 only when unsafe_loop_optimizations flag is enabled.  */
+      if (flag_unsafe_loop_optimizations)
+	 {
+	   desc->infinite = NULL_RTX;
+	   desc->assumptions = NULL_RTX;
+	   desc->noloop_assumptions = NULL_RTX;
+	 }
+      nonsimple_loop = nonsimple_loop || (desc->assumptions != NULL_RTX)
+			|| (desc->noloop_assumptions != NULL_RTX)
+			|| (desc->infinite != NULL_RTX);
+      /* Only doloops can be nonsimple_loops for SMS.  */
+      if (nonsimple_loop && !doloop_p)
+	{
+	  free_ddg (g);
+	  continue;
+	}
+      /* Manually set some description fields in non-simple doloop.  */
+      if (nonsimple_loop)
+	{
+	  gcc_assert(doloop_p);
+	  desc->const_iter = false;
+	  desc->infinite = NULL_RTX;
 	}
-      gcc_assert (count_reg);
 
-      if (dump_file && count_init)
+      if (desc->const_iter)
+	{
+	  gcc_assert (!desc->infinite);
+	  loop_count = desc->niter;
+	  if (dump_file)
+	    fprintf (dump_file, "SMS const loop iterations = "
+		     HOST_WIDEST_INT_PRINT_DEC "\n", loop_count);
+	}
+      if (count_init_isconst && count_fin_isconst)
         {
-          fprintf (dump_file, "SMS const-doloop ");
-          fprintf (dump_file, HOST_WIDEST_INT_PRINT_DEC,
-		     loop_count);
-          fprintf (dump_file, "\n");
+	  gcc_assert (doloop_p || desc->const_iter);
+	  if (doloop_p)
+	    {
+	      if (nonsimple_loop)
+		{
+	          loop_count = count_init_val;
+		  desc->const_iter = true;
+		}
+              gcc_assert (desc->const_iter && loop_count == count_init_val);
+	    }
+	  if (dump_file)
+	    {
+	      fprintf (dump_file, "SMS const-%s ",
+		       doloop_p ? "doloop" : "loop");
+	      fprintf (dump_file, HOST_WIDEST_INT_PRINT_DEC " to "
+		       HOST_WIDEST_INT_PRINT_DEC " step "
+		       HOST_WIDEST_INT_PRINT_DEC,
+		       count_init_val, count_fin_val, count_step);
+	      fprintf (dump_file, "\n");
+	    }
         }
 
       node_order = XNEWVEC (int, g->num_nodes);
@@ -1649,7 +1969,7 @@ sms_schedule (void)
 	     1 means that there is no interleaving between iterations thus
 	     we let the scheduling passes do the job in this case.  */
 	  if (stage_count < PARAM_VALUE (PARAM_SMS_MIN_SC)
-	      || (count_init && (loop_count <= stage_count))
+	      || (desc->const_iter && (loop_count <= stage_count))
 	      || (flag_branch_probabilities && (trip_count <= stage_count)))
 	    {
 	      if (dump_file)
@@ -1709,23 +2029,24 @@ sms_schedule (void)
 	      print_partial_schedule (ps, dump_file);
 	    }
  
-          /* case the BCT count is not known , Do loop-versioning */
-	  if (count_reg && ! count_init)
-            {
-	      rtx comp_rtx = gen_rtx_fmt_ee (GT, VOIDmode, count_reg,
-	  				     GEN_INT(stage_count));
-	      unsigned prob = (PROB_SMS_ENOUGH_ITERATIONS
-			       * REG_BR_PROB_BASE) / 100;
-
-	      loop_version (loop, comp_rtx, &condition_bb,
-	  		    prob, prob, REG_BR_PROB_BASE - prob,
-			    true);
-	     }
+	  if (!desc->const_iter)
+	    {
+	      /* Loop versioning if the number of iterations is unknown.  */
+	      unsigned prob;
+	      rtx vers_cond;
+	      vers_cond = gen_rtx_fmt_ee (GT, VOIDmode, nonsimple_loop ?
+					  count_reg : desc->niter_expr,
+					  GEN_INT (stage_count));
+	      if (dump_file)
+		{
+		  fprintf (dump_file, "\nLoop versioning condition:\n");
+		  print_rtl_single (dump_file, vers_cond);
+		}
 
-	  /* Set new iteration count of loop kernel.  */
-          if (count_reg && count_init)
-	    SET_SRC (single_set (count_init)) = GEN_INT (loop_count
-						     - stage_count + 1);
+	      prob = (PROB_SMS_ENOUGH_ITERATIONS * REG_BR_PROB_BASE) / 100;
+	      loop_version (loop, vers_cond, &condition_bb, prob,
+			    prob, REG_BR_PROB_BASE - prob, true);
+	    }
 
 	  /* Now apply the scheduled kernel to the RTL of the loop.  */
 	  permute_partial_schedule (ps, g->closing_branch->first_note);
@@ -1741,8 +2062,121 @@ sms_schedule (void)
 	  apply_reg_moves (ps);
 	  if (dump_file)
 	    print_node_sched_params (dump_file, g->num_nodes, ps);
-	  /* Generate prolog and epilog.  */
-          generate_prolog_epilog (ps, loop, count_reg, count_init);
+
+	  if (doloop_p && count_init_isconst)
+	    {
+	      /* Change counter reg initialization constant. In more complex
+	         cases this adjustment is done with adding some insns
+		 to loop prologue in generate_prolog_epilog function.  */
+	      gcc_assert (single_set (count_init_insn) != NULL_RTX);
+	      SET_SRC (single_set (count_init_insn))
+		    = GEN_INT (count_init_val - stage_count + 1);
+	      df_insn_rescan (count_init_insn);
+	    }
+
+	  if (!doloop_p)
+	    {
+	      /* Calculation of the compare insn stage in schedule.  */
+	      ps_insn_ptr crr_insn;
+	      int row, stage;
+	      cmp_stage = -1;
+	      for (row = 0; row < ps->ii; row++)
+		for (crr_insn = ps->rows[row];
+		     crr_insn;
+		     crr_insn = crr_insn->next_in_row)
+		  {
+		    stage = SCHED_STAGE (crr_insn->id);
+		    gcc_assert (0 <= stage && stage < stage_count);
+		    if (rtx_equal_p (ps_rtl_insn (ps, crr_insn->id), cmp))
+		      {
+			gcc_assert (cmp_stage == -1);
+		        cmp_stage = stage;
+		      }
+		  }
+              if (dump_file)
+		fprintf (dump_file, "cmp_stage=%d\n", cmp_stage);
+	      gcc_assert (cmp_stage >= 0);
+	    }
+
+	  /* When compare insn stage is non-zero we are to shift the final
+	     counter reg value (which counter is compared to exit loop).
+	     Final value can be an immediate or can be a register, which
+	     constant initialization we find in preheader.  */
+	  was_immediate = false;
+	  if (!doloop_p && count_fin_isconst && cmp_stage > 0)
+	    {
+              gcc_assert (0 <= cmp_side && cmp_side <= 1);
+	      /* New finish value.  */
+	      count_fin_newval = count_fin_val - count_step * cmp_stage;
+	      was_immediate = CONST_INT_P (XEXP (SET_SRC (single_set (cmp)),
+							  1 - cmp_side));
+	      if (was_immediate)
+		{
+		  /* Check whether new value also can be an immediate.
+		     For exapmle, on ARM not all values can be encoded as
+		     an immediate, so we have to load it to a register once
+		     before the loop starts.  */
+		  rtx to = GEN_INT (count_fin_newval);
+		  prolog_create_reg = rtx_cost (to, GET_CODE (to), 0, false)
+			    > rtx_cost (GEN_INT(1), CONST_INT, 0, false);
+	        }
+	      else
+		{
+		  /* A value is already in a register and we easily change
+		     initialization instruction in preheader.  */
+		  gcc_assert (count_fin_init_insn);
+		  SET_SRC (single_set (count_fin_init_insn))
+			= GEN_INT (count_fin_newval);
+		  df_insn_rescan (count_fin_init_insn);
+		}
+	    }
+
+	  /* The adjustment of finish register value.
+	     Zero means no adjustment needed or adjusment is done
+	     without additional insn in prologue.  */
+	  if (!doloop_p && !count_fin_isconst)
+	    prolog_fin_nonconst_adjust = count_step * cmp_stage;
+
+	  /* Ready to generate prolog and epilog.  */
+	  generate_prolog_epilog (ps, loop, count_reg, doloop_p,
+			          count_init_isconst, count_fin_reg,
+				  prolog_fin_nonconst_adjust,
+				  prolog_create_reg, count_fin_newval,
+				  &new_comp_reg);
+
+	  /* And only after generating prolog and epilog it is possible
+	     to modify the compare instruction (to prevent copying wrong insn
+	     form to first and last stages).  */
+	  if (!doloop_p && cmp_stage > 0)
+	    {
+              gcc_assert (0 <= cmp_side && cmp_side <= 1);
+	      if (was_immediate && !prolog_create_reg)
+		{
+		/* Easy case - just modify a constant.  */
+		  gcc_assert (new_comp_reg == NULL_RTX);
+		  XEXP (SET_SRC (single_set (cmp)), 1 - cmp_side)
+			= GEN_INT (count_fin_newval);
+		}
+	      else
+		{
+		  if (count_fin_isconst && !was_immediate)
+		    /* Value is in a register and we already changed
+		       initialization instruction in preheader.  */
+		    gcc_assert (new_comp_reg == NULL_RTX);
+		  else
+		    {
+		      /* Another case - use created by generate_prolog_epilog
+		         register, which value is initialized in prologue.  */
+		      gcc_assert (new_comp_reg != NULL_RTX);
+		      XEXP (SET_SRC (single_set (cmp)), 1 - cmp_side)
+			      = new_comp_reg;
+		    }
+		}
+	      df_insn_rescan (cmp);
+	    }
+	  else
+	    gcc_assert (new_comp_reg == NULL_RTX);
+
 	  break;
 	}
 
@@ -1752,7 +2186,9 @@ sms_schedule (void)
       free_ddg (g);
     }
 
+  df_set_flags (DF_LR_RUN_DCE);
   free (g_arr);
+  iv_analysis_done ();
 
   /* Release scheduler data, needed until now because of DFA.  */
   haifa_sched_finish ();

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [SMS] Support new loop pattern
  2011-10-12  0:48     ` Ayal Zaks
  2011-12-07 14:36       ` Roman Zhuykov
@ 2011-12-07 14:42       ` Roman Zhuykov
  2011-12-29 15:43         ` Roman Zhuykov
  1 sibling, 1 reply; 30+ messages in thread
From: Roman Zhuykov @ 2011-12-07 14:42 UTC (permalink / raw)
  To: Ayal Zaks; +Cc: gcc-patches, dm

[-- Attachment #1: Type: text/plain, Size: 8061 bytes --]

Apologies for the messed up previous e-mail.

2011/10/12 Ayal Zaks <ayal.zaks@gmail.com>:
>>> - the last jump instruction should look like:  pc=(regF!=0)?label:pc, regF is
>
> you'd probably want to bump to next instruction if falling through,
> e.g., pc=(regF!=0)?label:pc+4
>

It is considered that program counter is increased automatically on
hardware level.
Otherwise we should add something like "pc=pc+4" in parallel to each
instruction in RTL.

>>>  flag register;
>>> - the last instruction which sets regF should be: regF=COMPARE(regC,X), where X
>>>  is a constant, or maybe a register, which is not changed inside a loop;
>>> - only one instruction modifies regC inside a loop (other can use regC, but not
>>>  write), and it should simply adjust it by a constant: regC=regC+step, where
>>>  step is a constant.
>>
>>> When doloop is succesfully scheduled by SMS, its number of
>>> iterations of loop kernel should be decreased by the number of stages in a
>>> schedule minus one, while other iterations expand to prologue and epilogue.
>>> In new supported loops such approach can't be used, because some
>>> instructions can use count register (regC).  Instead of this,
>>> the final register value X in compare instruction regF=COMPARE(regC,X)
>>> is changed to another value Y respective to the stage this instruction
>>> is scheduled (Y = X - stage * step).
>
> making sure this does not underflow; i.e., that the number of
> iterations is no less than stage (you've addressed this towards the
> end below).
>

Yes, this situation is processed correctly.

>>
>> The main difference from doloop case is that regC can be used by some
>> instructions in loop body.
>> That's why we are unable to simply adjust regC initial value, but have
>> to keep it's value correct on each particular iteration.
>> So, we change comparison instruction accordingly.
>>
>> An example:
>> int a[100];
>> int main()
>> {
>>  int i;
>>  for (i = 85; i > 12; i -= 5)
>>      a[i] = i * i;
>>  return a[15]-225;
>> }
>> ARM assembler with "-O2 -fno-auto-inc-dec":
>>        ldr     r0, .L5
>>        mov     r3, #85
>>        mov     r2, r0
>> .L2:
>>        mul     r1, r3, r3
>>        sub     r3, r3, #5
>>        cmp     r3, #10
>>        str     r1, [r2, #340]
>>        sub     r2, r2, #20
>>        bne     .L2
>>        ldr     r0, [r0, #60]
>>        sub     r0, r0, #225
>>        bx      lr
>> .L5:
>>        .word   a
>>
>> Loop body is executed 15 times.
>> When compiling with SMS, it finds a schedule with ii=7, stage_count=3
>> and following times:
>> Stage  Time       Insn
>> 0          5      mul     r1, r3, r3
>> 1         10     sub     r3, r3, #5
>> 1         11     cmp     r3, #10
>> 1         11     str     r1, [r2, #340]
>> 1         13     bne     .L2
>> 2         16     sub     r2, r2, #20
>>
>
> branch is not scheduled last?
>

Yes, branch schedule time is smaller then sub's one.
This mean that "sub r2, r2, $20" instruction from original iteration
number K will be executed after
"bne .L2" from original iteration number K.
But certainly bne remains to be the last instuction in new loop body.
Below you can see how it looks after SMS.

>> To make new schedule correct the loop body
>> should be executed 14 times and we change compare instruction:
>
> the loop itself should execute 13 times.

with i =
85, 80, 75, 70, 65
60, 55, 50, 45, 40
35, 30, 25, 20, 15
this gives total 15 iterations (15 stores to memory).
And new loop body will be executed 13 times (one store goes to
epilogue and one - to prologue).

>> regF=COMPARE(regC,X) to regF=COMPARE(regC,Y) where Y = X - stage * step.
>> In our example regC is r3, X is 10, step = -5, compare instruction
>> is scheduled on stage 1, so it should be Y = 10 - 1 * (-5) = 15.
>>
>
> right. In general, if the compare is on stage s (starting from 0), it
> will be executed s times in the epilog, so it should exit the loop
> upon reaching Y = X - s * step.
>
>> So, after SMS it looks like:
>>        ldr     r0, .L5
>>        mov     r3, #85
>>        mov     r2, r0
>> ;;prologue
>>        mul     r1, r3, r3      ;;from stage 0 first iteration
>>        sub     r3, r3, #5      ;;3 insns from stage 1 first iteration
>>        cmp     r3, #10
>>        str     r1, [r2, #340]
>>        mul     r1, r3, r3      ;;from stage 0 second iteration
>> ;;body
>> .L2:
>>        sub     r3, r3, #5
>>        sub     r2, r2, #20
>>        cmp     r3, #15         ;; new value to compare with is Y=15
>>        str     r1, [r2, #340]
>>        mul     r1, r3, r3
>>        bne     .L2
>> ;;epilogue
>>        sub     r2, r2, #20     ;;from stage 2 pre-last iteration
>>        sub     r3, r3, #5      ;;3 insns from stage 1 last iteration
>>        cmp     r3, #10
>>        str     r1, [r2, #340]
>>        sub     r2, r2, #20     ;;from stage 2 last iteration
>>
>>        ldr     r0, [r0, #60]
>>        sub     r0, r0, #225
>>        bx      lr
>> .L5:
>>        .word   a
>>

Here in comments I mention why insn was copied to prolog and epilog.
Only branch is not copied at all.

>>> Testing of this appoach reveals two bugs, which do not appear while SMS was
>>> used only for doloop loops.  Both these bugs happen due to the nature of the
>>> flag register.  On x86_64 it is clobbered by most of arithmetic instructions.
> This should ideally be solved by a dedicated (separate) patch.
> ...
> This too should be solved by a dedicated (separate) patch, for easier digestion.

As Ayal asks, I'll continue discussion of these two bugs in two
separate e-mails, answering on this letter.

>>>
>>> One more thing to point out is number of loop iterations. When number of
>>> iterations of a loop is not known at compile time, SMS has to create two loop
>>> versions (original and scheduled), and execute scheduled one only when real
>>> number of iterations is bigger than number of stages.  In doloop case the
>>> number of iterations simply equals to the count register value before the loop.
>>> So SMS finds its constant initialization or makes two loop versions.  In new
>>> supported loops number of iterations value is more complex.  It even can't be
>>> calculated as (final_reg_value-start_reg_value)/step because of examples like
>>> this:
>>>
>>> for (unsigned int x = 0x0; x != 0x6F80919A; x += 0xEDCBA987) ...;
>>>
>>> This loop has 22 iterations.  So, i decided to use get_simple_loop_desc
>>> function which gives a structure with loop characteristics, some of them helps
>>> to find iteration number:
>>>
>>> rtx niter_expr - The number of iterations of the loop;
>>> bool const_iter - True if the loop iterates the constant number of times;
>>> unsigned HOST_WIDEST_INT niter - Number of iterations if constant;
>>>
>>> But we can use these expressions only after looking through some other fields
>>> of returned structure:
>>>
>>> bool simple_p - True if we are able to say anything about number of iterations
>>> of the loop;
>>> rtx assumptions - Assumptions under that the rest of the information is valid;
>>> rtx noloop_assumptions - Assumptions under which the loop ends before reaching
>>> the latch;
>>> rtx infinite - Condition under which the loop is infinite.
>>>
>>> I decide to allow SMS scheduling only when simple_p is true and other three
>>> fields are NULL_RTX, or when simple_p is true and
>>> flag_unsafe_loop_optimizations is set.  One more exception is infinite
>>> condition, and the next separate patch is an attempt to process it.
>>>
>
> ok, still need to go over this rather lengthy and orthogonal (although
> it exposes the bugs above) piece.
>
> Ayal.
>
>

New version is attached, it suits current trunk.
Without fixing both bugs mentioned above, this patch brokes bootstrap on x86-64.

Together with DDG fixes the patch was succesfully regtested on ARM,
and "regstrapped" on x86-64 and IA64.

--
Roman Zhuykov
zhroma@ispras.ru

[-- Attachment #2: sms.patch --]
[-- Type: text/x-patch, Size: 26313 bytes --]

2011-12-07  Roman Zhuykov  <zhroma@ispras.ru>
	* modulo-sched.c (nondoloop_register_get): New function.
	(const_iteration_count): Rename to ...
	(search_const_init): ...this.  Add new parameter (is_const).  Always
	return register initialization rtx and set is_const to true
	only when it is constant.
	(duplicate_insns_of_cycles): Add new parameter (doloop_p).  Do not
	duplicate instructions with count_reg only when doloop_p is set.
	Update all callers.
	(generate_prolog_epilog): Add new parameters.  Correctly generate loop
	prologue for new loop pattern.
	(sms_schedule): Support new loop pattern.
---

diff --git a/gcc/modulo-sched.c b/gcc/modulo-sched.c
index 0ea9a4d..e62aca7 100644
--- a/gcc/modulo-sched.c
+++ b/gcc/modulo-sched.c
@@ -220,7 +220,8 @@ static void set_node_sched_params (ddg_ptr);
 static partial_schedule_ptr sms_schedule_by_order (ddg_ptr, int, int, int *);
 static void permute_partial_schedule (partial_schedule_ptr, rtx);
 static void generate_prolog_epilog (partial_schedule_ptr, struct loop *,
-                                    rtx, rtx);
+                                    rtx, bool, bool, rtx, HOST_WIDEST_INT,
+                                    bool, HOST_WIDEST_INT, rtx *);
 static int calculate_stage_count (partial_schedule_ptr, int);
 static void calculate_must_precede_follow (ddg_node_ptr, int, int,
 					   int, int, sbitmap, sbitmap, sbitmap);
@@ -255,7 +256,7 @@ typedef struct node_sched_params node_sched_params;
 DEF_VEC_O (node_sched_params);
 DEF_VEC_ALLOC_O (node_sched_params, heap);
 \f
-/* The following three functions are copied from the current scheduler
+/* The following two functions are copied from the current scheduler
    code in order to use sched_analyze() for computing the dependencies.
    They are used when initializing the sched_info structure.  */
 static const char *
@@ -398,37 +399,164 @@ doloop_register_get (rtx head ATTRIBUTE_UNUSED, rtx tail ATTRIBUTE_UNUSED)
 #endif
 }
 
-/* Check if COUNT_REG is set to a constant in the PRE_HEADER block, so
-   that the number of iterations is a compile-time constant.  If so,
-   return the rtx that sets COUNT_REG to a constant, and set COUNT to
-   this constant.  Otherwise return 0.  */
+/* Same as previous for loop with always-the-same-step counter.  */
 static rtx
-const_iteration_count (rtx count_reg, basic_block pre_header,
-		       HOST_WIDEST_INT * count)
+nondoloop_register_get (rtx head, rtx tail, int cmp_side,
+			rtx *addsub_output, rtx *cmp_output)
+{
+  rtx insn, reg, flagreg, addsub, cmp, end;
+
+  /* Check jump instruction form */
+  insn = single_set (tail);
+  if (insn == NULL_RTX
+      || SET_DEST (insn) != pc_rtx
+      || GET_CODE (SET_SRC (insn)) != IF_THEN_ELSE
+      || GET_CODE (XEXP (SET_SRC (insn), 1)) != LABEL_REF
+      || XEXP (SET_SRC (insn), 2) != pc_rtx)
+    return NULL_RTX;
+
+  /* Check loop exit condition */
+  insn = XEXP (SET_SRC (insn), 0);
+  if (GET_CODE (insn) != NE || XEXP (insn, 1) != const0_rtx)
+    return NULL_RTX;
+
+  /* Flags register */
+  flagreg = XEXP (insn, 0);
+
+  /* Searching comparison instruction */
+  cmp = PREV_INSN (tail);
+  while (cmp != PREV_INSN (head))
+    {
+      if (INSN_P (cmp) && reg_set_p (flagreg, cmp))
+        break;
+      cmp = PREV_INSN (cmp);
+    }
+  if (cmp == PREV_INSN (head))
+    return NULL_RTX;
+
+  /* Check comparison */
+  insn = single_set (cmp);
+  if (insn == NULL_RTX
+      || ! rtx_equal_p (flagreg, SET_DEST (insn))
+      || GET_CODE (SET_SRC (insn)) != COMPARE)
+    return NULL_RTX;
+
+  /* Loop register */
+  gcc_assert (0 <= cmp_side && cmp_side <= 1);
+  reg = XEXP (SET_SRC (insn), cmp_side);
+  if (! REG_P (reg))
+    return NULL_RTX;
+
+  /* End value */
+  end = XEXP (SET_SRC (insn), 1 - cmp_side);
+  if (! REG_P (end) && ! CONST_INT_P (end))
+    return NULL_RTX;
+
+  /* Searching register add\sub instruction */
+  addsub = PREV_INSN (cmp);
+  while (addsub != PREV_INSN (head))
+    {
+      if (INSN_P (addsub) && reg_set_p (reg, addsub))
+        break;
+      addsub = PREV_INSN (addsub);
+    }
+  if (addsub == PREV_INSN (head))
+    return NULL_RTX;
+
+  /* Checking register change instruction */
+  insn = single_set (addsub);
+  if (insn == NULL_RTX || ! rtx_equal_p (reg, SET_DEST (insn)))
+    return NULL_RTX;
+  insn = SET_SRC (insn);
+  if ((GET_CODE (insn) != PLUS && GET_CODE (insn) != MINUS)
+      || ! rtx_equal_p (reg, XEXP (insn, 0))
+      || ! (CONST_INT_P (XEXP (insn, 1))))
+    return NULL_RTX;
+
+  /* No other REG and END (if reg) modifications allowed */
+  for (insn = head; insn != tail; insn = NEXT_INSN (insn))
+    {
+      if (REG_P(end) && reg_set_p (end, insn))
+        {
+          if (dump_file)
+          {
+            fprintf (dump_file, "SMS end register found ");
+            print_rtl_single (dump_file, reg);
+            fprintf (dump_file, " outside write in insn:\n");
+            print_rtl_single (dump_file, insn);
+          }
+	  return NULL_RTX;
+	}
+      if (insn != addsub && reg_set_p (reg, insn))
+        {
+          if (dump_file)
+          {
+            fprintf (dump_file, "SMS count_reg found ");
+            print_rtl_single (dump_file, reg);
+            fprintf (dump_file, " outside write in insn:\n");
+            print_rtl_single (dump_file, insn);
+          }
+          return NULL_RTX;
+        }
+    }
+
+  *addsub_output = addsub;
+  *cmp_output = cmp;
+  return reg;
+}
+
+/* Check if REG is set to a constant in the PRE_HEADER block.
+   If possible to find, return the rtx that sets REG.
+   If REG is set to a constant (probably not directly),
+   set IS_CONST to true and VALUE to that constant value.  */
+static rtx
+search_const_init (basic_block pre_header, rtx reg, bool *is_const,
+		   HOST_WIDEST_INT *value)
 {
   rtx insn;
   rtx head, tail;
 
-  if (! pre_header)
-    return NULL_RTX;
+  if (!pre_header)
+    {
+      *is_const = false;
+      return NULL_RTX;
+    }
 
   get_ebb_head_tail (pre_header, pre_header, &head, &tail);
 
   for (insn = tail; insn != PREV_INSN (head); insn = PREV_INSN (insn))
     if (NONDEBUG_INSN_P (insn) && single_set (insn) &&
-	rtx_equal_p (count_reg, SET_DEST (single_set (insn))))
+	rtx_equal_p (reg, SET_DEST (single_set (insn))))
       {
-	rtx pat = single_set (insn);
+	rtx src, pat = single_set (insn);
+	src = SET_SRC (pat);
 
-	if (CONST_INT_P (SET_SRC (pat)))
+	if (CONST_INT_P (src))
 	  {
-	    *count = INTVAL (SET_SRC (pat));
-	    return insn;
+	    *is_const = true;
+	    *value = INTVAL (src);
+	  }
+	else if (REG_P (src))
+	  { /* Check if previous insn sets SRC = constant.  */
+	    pat = single_set (PREV_INSN (insn));
+	    if (pat != NULL_RTX && rtx_equal_p (src, SET_DEST (pat))
+		&& CONST_INT_P (SET_SRC (pat)))
+	      {
+		*is_const = true;
+		*value = INTVAL (SET_SRC (pat));
+	      }
+	    else
+		*is_const = false;
 	  }
+	else
+	  *is_const = false;
 
-	return NULL_RTX;
+	return insn;
       }
+    else if (reg_set_p (reg, insn))
+      break;
 
+  *is_const = false;
   return NULL_RTX;
 }
 
@@ -1103,7 +1231,7 @@ clear:
 
 static void
 duplicate_insns_of_cycles (partial_schedule_ptr ps, int from_stage,
-			   int to_stage, rtx count_reg)
+			   int to_stage, rtx count_reg, bool doloop_p)
 {
   int row;
   ps_insn_ptr ps_ij;
@@ -1115,14 +1243,14 @@ duplicate_insns_of_cycles (partial_schedule_ptr ps, int from_stage,
 	int first_u, last_u;
 	rtx u_insn;
 
-        /* Do not duplicate any insn which refers to count_reg as it
-           belongs to the control part.
+        /* In doloop case do not duplicate any insn which refers
+	   to count_reg as it belongs to the control part.
            The closing branch is scheduled as well and thus should
            be ignored.
            TODO: This should be done by analyzing the control part of
            the loop.  */
 	u_insn = ps_rtl_insn (ps, u);
-        if (reg_mentioned_p (count_reg, u_insn)
+        if ((doloop_p && reg_mentioned_p (count_reg, u_insn))
             || JUMP_P (u_insn))
           continue;
 
@@ -1142,7 +1270,10 @@ duplicate_insns_of_cycles (partial_schedule_ptr ps, int from_stage,
 /* Generate the instructions (including reg_moves) for prolog & epilog.  */
 static void
 generate_prolog_epilog (partial_schedule_ptr ps, struct loop *loop,
-                        rtx count_reg, rtx count_init)
+                        rtx count_reg, bool doloop_p, bool count_init_isconst,
+			rtx fin_reg, HOST_WIDEST_INT fin_nonconst_adjust,
+			bool create_reg, HOST_WIDEST_INT reg_val,
+			rtx *created_reg)
 {
   int i;
   int last_stage = PS_STAGE_COUNT (ps) - 1;
@@ -1151,12 +1282,12 @@ generate_prolog_epilog (partial_schedule_ptr ps, struct loop *loop,
   /* Generate the prolog, inserting its insns on the loop-entry edge.  */
   start_sequence ();
 
-  if (!count_init)
+  if (doloop_p && !count_init_isconst)
     {
-      /* Generate instructions at the beginning of the prolog to
-         adjust the loop count by STAGE_COUNT.  If loop count is constant
-         (count_init), this constant is adjusted by STAGE_COUNT in
-         generate_prolog_epilog function.  */
+      /* In doloop we generate instructions at the beginning of the prolog to
+         adjust the initial value of doloop counter by STAGE_COUNT.
+	 If loop count is constant, this adjustment is done outside this
+         function, simply correcting the source of initialization insn.  */
       rtx sub_reg = NULL_RTX;
 
       sub_reg = expand_simple_binop (GET_MODE (count_reg), MINUS,
@@ -1167,8 +1298,40 @@ generate_prolog_epilog (partial_schedule_ptr ps, struct loop *loop,
         emit_move_insn (count_reg, sub_reg);
     }
 
+  if (!doloop_p)
+    {
+      /* In non-doloop we generate instructions at the beginning of
+         the prolog to adjust the final value (with this value loop count
+	 register is compared to check whether the loop should stop).  */
+      if (fin_nonconst_adjust != 0)
+	{
+	  /* If the final value is in a register - create another register
+	     to store a shifted value.  */
+	  rtx new_reg, reg = NULL_RTX;
+	  reg = gen_reg_rtx (GET_MODE (fin_reg));
+	  new_reg = expand_simple_binop (GET_MODE (fin_reg), MINUS, fin_reg,
+					 GEN_INT (fin_nonconst_adjust),
+					 reg, 0, OPTAB_DIRECT);
+	  gcc_assert (REG_P (new_reg));
+	  if (REGNO (new_reg) != REGNO (reg))
+	    emit_move_insn (reg, new_reg);
+	  *created_reg = new_reg;
+	}
+      else if (create_reg)
+	{
+	  /* If old final value is an immediate, and the new one can't be
+	     an immediate, we create a register to store it.  If both values
+	     are immediate the adjustment is done outside this fuction,
+	     just correcting the constant value in compare intruction.  */
+	  rtx reg = NULL_RTX;
+	  reg = gen_reg_rtx (GET_MODE (count_reg));
+	  emit_move_insn (reg, GEN_INT (reg_val));
+	  *created_reg = reg;
+	}
+    }
+
   for (i = 0; i < last_stage; i++)
-    duplicate_insns_of_cycles (ps, 0, i, count_reg);
+    duplicate_insns_of_cycles (ps, 0, i, count_reg, doloop_p);
 
   /* Put the prolog on the entry edge.  */
   e = loop_preheader_edge (loop);
@@ -1182,7 +1345,7 @@ generate_prolog_epilog (partial_schedule_ptr ps, struct loop *loop,
   start_sequence ();
 
   for (i = 0; i < last_stage; i++)
-    duplicate_insns_of_cycles (ps, i + 1, last_stage, count_reg);
+    duplicate_insns_of_cycles (ps, i + 1, last_stage, count_reg, doloop_p);
 
   /* Put the epilogue on the exit edge.  */
   gcc_assert (single_exit (loop));
@@ -1460,13 +1623,30 @@ sms_schedule (void)
           continue;
         }
 
-      /* Make sure this is a doloop.  */
-      if ( !(count_reg = doloop_register_get (head, tail)))
-      {
-        if (dump_file)
-          fprintf (dump_file, "SMS doloop_register_get failed\n");
-	continue;
-      }
+      /* Is this a doloop?  */
+      if ((count_reg = doloop_register_get (head, tail)))
+        {
+	  if (dump_file)
+	    fprintf (dump_file, "SMS doloop\n");
+        }
+      else if ((count_reg = nondoloop_register_get (head, tail, 0,
+						    &insn, &insn)))
+	{
+	  if (dump_file)
+	    fprintf (dump_file, "SMS non-doloop\n");
+	}
+      else if ((count_reg = nondoloop_register_get (head, tail, 1,
+						    &insn, &insn)))
+	{
+	  if (dump_file)
+	    fprintf (dump_file, "SMS non-doloop with transposed cmp\n");
+	}
+      else
+	{
+	  if (dump_file)
+	    fprintf (dump_file, "SMS imcompatible loop\n");
+	  continue;
+	}
 
       /* Don't handle BBs with calls or barriers
 	 or !single_set with the exception of instructions that include
@@ -1516,7 +1696,6 @@ sms_schedule (void)
 	    fprintf (dump_file, "SMS create_ddg failed\n");
 	  continue;
         }
-
       g_arr[loop->num] = g;
       if (dump_file)
         fprintf (dump_file, "...OK\n");
@@ -1528,14 +1707,28 @@ sms_schedule (void)
     fprintf (dump_file, "=========================\n\n");
   }
 
+  df_clear_flags (DF_LR_RUN_DCE);
+
   /* We don't want to perform SMS on new loops - created by versioning.  */
   FOR_EACH_LOOP (li, loop, 0)
     {
+      bool doloop_p, count_fin_isconst, count_init_isconst;
+      bool was_immediate = false;
+      bool prolog_create_reg = false;
+      int prolog_fin_nonconst_adjust = 0;
+      bool nonsimple_loop = false;
       rtx head, tail;
-      rtx count_reg, count_init;
-      int mii, rec_mii, stage_count, min_cycle;
-      HOST_WIDEST_INT loop_count = 0;
+      int min_cycle;
       bool opt_sc_p;
+      rtx count_reg, count_fin_reg, new_comp_reg = NULL_RTX;
+      rtx count_init_insn, count_fin_init_insn;
+      rtx add, cmp;
+      int mii, rec_mii, cmp_side = -1, cmp_stage = -1;
+      int stage_count = 0;
+      HOST_WIDEST_INT count_init_val = 0, count_fin_val = 0;
+      HOST_WIDEST_INT count_step = 0, loop_count = -1;
+      HOST_WIDEST_INT count_fin_newval = 0;
+      struct niter_desc *desc = NULL;
 
       if (! (g = g_arr[loop->num]))
         continue;
@@ -1573,32 +1766,159 @@ sms_schedule (void)
 	               (HOST_WIDEST_INT) profile_info->sum_max);
 	      fprintf (dump_file, "\n");
 	    }
-	  fprintf (dump_file, "SMS doloop\n");
 	  fprintf (dump_file, "SMS built-ddg %d\n", g->num_nodes);
           fprintf (dump_file, "SMS num-loads %d\n", g->num_loads);
           fprintf (dump_file, "SMS num-stores %d\n", g->num_stores);
 	}
 
 
-      /* In case of th loop have doloop register it gets special
-	 handling.  */
-      count_init = NULL_RTX;
-      if ((count_reg = doloop_register_get (head, tail)))
+      /* Extract count register and determine loop type.  */
+      add = NULL_RTX;
+      cmp = NULL_RTX;
+      if ((count_reg = doloop_register_get (head, tail))
+	  || (count_reg = nondoloop_register_get (head, tail, 0, &add, &cmp))
+	  || (count_reg = nondoloop_register_get (head, tail, 1, &add, &cmp)))
 	{
-	  basic_block pre_header;
+	  basic_block pre_header = loop_preheader_edge (loop)->src;
+
+	  doloop_p = (cmp == NULL_RTX);
+	  if (doloop_p)
+	    {
+	      /* Doloop finish parameters are always the same.  */
+	      count_step = -1;
+	      count_fin_isconst = true;
+	      count_fin_val = 0;
+	      count_fin_reg = NULL_RTX;
+	      count_fin_init_insn = NULL_RTX;
+	    }
+	  else
+	    {
+	      /* In other loop we need to determine counter step
+	         and finish parameters.  */
+	      rtx step, end;
+
+	      gcc_assert (single_set (add) && single_set (cmp));
+
+	      /* Extract the step.  */
+	      step = XEXP (SET_SRC (single_set (add)), 1);
+	      gcc_assert (CONST_INT_P (step));
+
+	      if (GET_CODE (SET_SRC (single_set (add))) == MINUS)
+	        count_step = - INTVAL (step);
+	      else if (GET_CODE (SET_SRC (single_set (add))) == PLUS)
+	        count_step = INTVAL (step);
+	      else
+		gcc_unreachable ();
+
+	      gcc_assert(count_step != 0);
+
+	      /* Check what operand of compare insn is a counter register.  */
+	      if (count_reg == XEXP (SET_SRC (single_set (cmp)), 0))
+		cmp_side = 0;
+	      else if (count_reg == XEXP (SET_SRC (single_set (cmp)), 1))
+		cmp_side = 1;
+	      else
+		gcc_unreachable ();
+
+	      /* Extract finish border for counter reg.  */
+	      end = XEXP (SET_SRC (single_set (cmp)), 1 - cmp_side);
 
-	  pre_header = loop_preheader_edge (loop)->src;
-	  count_init = const_iteration_count (count_reg, pre_header,
-					      &loop_count);
+	      if (CONST_INT_P (end))
+		{
+		  /* Constant finish border.  loop until (reg != const).  */
+		  count_fin_isconst = true;
+		  count_fin_val = INTVAL (end);
+		  count_fin_reg = NULL_RTX;
+		  count_fin_init_insn = NULL_RTX;
+		}
+	      else if (REG_P (end))
+		{
+		  /* Register is a border.  Loop until (reg != fin_reg).  */
+		  count_fin_reg = end;
+		  count_fin_isconst = false;
+		  /* Try to find constant initinalization of fin_reg
+		   * in preheader.  */
+		  count_fin_init_insn = search_const_init (pre_header,
+							   count_fin_reg,
+							   &count_fin_isconst,
+							   &count_fin_val);
+		}
+	      else
+		gcc_unreachable ();
+	    }
+	  /* Try to find a constant initalization of count_reg in preheader.  */
+	  count_init_insn = search_const_init (pre_header,
+					       count_reg,
+					       &count_init_isconst,
+					       &count_init_val);
+	}
+      else /* Loop is incompatible now, but it was OK on while analyzing!  */
+	gcc_assert (count_reg);
+
+
+      desc = get_simple_loop_desc (loop);
+      gcc_assert (desc);
+      /* nonsimple_loop means it's impossible to analyze the loop
+         or there are some assumptions to make the analyzis results right
+         or there is a condition of non-infinite number of iterations.
+        We want doloops to be scheduled even if analyzis shows they are
+	 nonsimple (backward compatibility).  */
+      nonsimple_loop = !desc->simple_p;
+      /* We allow scheduling loop with some assumptions or infinite condition
+	 only when unsafe_loop_optimizations flag is enabled.  */
+      if (flag_unsafe_loop_optimizations)
+	 {
+	   desc->infinite = NULL_RTX;
+	   desc->assumptions = NULL_RTX;
+	   desc->noloop_assumptions = NULL_RTX;
+	 }
+      nonsimple_loop = nonsimple_loop || (desc->assumptions != NULL_RTX)
+			|| (desc->noloop_assumptions != NULL_RTX)
+			|| (desc->infinite != NULL_RTX);
+      /* Only doloops can be nonsimple_loops for SMS.  */
+      if (nonsimple_loop && !doloop_p)
+	{
+	  free_ddg (g);
+	  continue;
+	}
+      /* Manually set some description fields in non-simple doloop.  */
+      if (nonsimple_loop)
+	{
+	  gcc_assert(doloop_p);
+	  desc->const_iter = false;
+	  desc->infinite = NULL_RTX;
 	}
-      gcc_assert (count_reg);
 
-      if (dump_file && count_init)
+      if (desc->const_iter)
+	{
+	  gcc_assert (!desc->infinite);
+	  loop_count = desc->niter;
+	  if (dump_file)
+	    fprintf (dump_file, "SMS const loop iterations = "
+		     HOST_WIDEST_INT_PRINT_DEC "\n", loop_count);
+	}
+      if (count_init_isconst && count_fin_isconst)
         {
-          fprintf (dump_file, "SMS const-doloop ");
-          fprintf (dump_file, HOST_WIDEST_INT_PRINT_DEC,
-		     loop_count);
-          fprintf (dump_file, "\n");
+	  gcc_assert (doloop_p || desc->const_iter);
+	  if (doloop_p)
+	    {
+	      if (nonsimple_loop)
+		{
+	          loop_count = count_init_val;
+		  desc->const_iter = true;
+		}
+              gcc_assert (desc->const_iter && loop_count == count_init_val);
+	    }
+	  if (dump_file)
+	    {
+	      fprintf (dump_file, "SMS const-%s ",
+		       doloop_p ? "doloop" : "loop");
+	      fprintf (dump_file, HOST_WIDEST_INT_PRINT_DEC " to "
+		       HOST_WIDEST_INT_PRINT_DEC " step "
+		       HOST_WIDEST_INT_PRINT_DEC,
+		       count_init_val, count_fin_val, count_step);
+	      fprintf (dump_file, "\n");
+	    }
         }
 
       node_order = XNEWVEC (int, g->num_nodes);
@@ -1649,7 +1969,7 @@ sms_schedule (void)
 	     1 means that there is no interleaving between iterations thus
 	     we let the scheduling passes do the job in this case.  */
 	  if (stage_count < PARAM_VALUE (PARAM_SMS_MIN_SC)
-	      || (count_init && (loop_count <= stage_count))
+	      || (desc->const_iter && (loop_count <= stage_count))
 	      || (flag_branch_probabilities && (trip_count <= stage_count)))
 	    {
 	      if (dump_file)
@@ -1709,23 +2029,24 @@ sms_schedule (void)
 	      print_partial_schedule (ps, dump_file);
 	    }
  
-          /* case the BCT count is not known , Do loop-versioning */
-	  if (count_reg && ! count_init)
-            {
-	      rtx comp_rtx = gen_rtx_fmt_ee (GT, VOIDmode, count_reg,
-	  				     GEN_INT(stage_count));
-	      unsigned prob = (PROB_SMS_ENOUGH_ITERATIONS
-			       * REG_BR_PROB_BASE) / 100;
-
-	      loop_version (loop, comp_rtx, &condition_bb,
-	  		    prob, prob, REG_BR_PROB_BASE - prob,
-			    true);
-	     }
+	  if (!desc->const_iter)
+	    {
+	      /* Loop versioning if the number of iterations is unknown.  */
+	      unsigned prob;
+	      rtx vers_cond;
+	      vers_cond = gen_rtx_fmt_ee (GT, VOIDmode, nonsimple_loop ?
+					  count_reg : desc->niter_expr,
+					  GEN_INT (stage_count));
+	      if (dump_file)
+		{
+		  fprintf (dump_file, "\nLoop versioning condition:\n");
+		  print_rtl_single (dump_file, vers_cond);
+		}
 
-	  /* Set new iteration count of loop kernel.  */
-          if (count_reg && count_init)
-	    SET_SRC (single_set (count_init)) = GEN_INT (loop_count
-						     - stage_count + 1);
+	      prob = (PROB_SMS_ENOUGH_ITERATIONS * REG_BR_PROB_BASE) / 100;
+	      loop_version (loop, vers_cond, &condition_bb, prob,
+			    prob, REG_BR_PROB_BASE - prob, true);
+	    }
 
 	  /* Now apply the scheduled kernel to the RTL of the loop.  */
 	  permute_partial_schedule (ps, g->closing_branch->first_note);
@@ -1741,8 +2062,121 @@ sms_schedule (void)
 	  apply_reg_moves (ps);
 	  if (dump_file)
 	    print_node_sched_params (dump_file, g->num_nodes, ps);
-	  /* Generate prolog and epilog.  */
-          generate_prolog_epilog (ps, loop, count_reg, count_init);
+
+	  if (doloop_p && count_init_isconst)
+	    {
+	      /* Change counter reg initialization constant. In more complex
+	         cases this adjustment is done with adding some insns
+		 to loop prologue in generate_prolog_epilog function.  */
+	      gcc_assert (single_set (count_init_insn) != NULL_RTX);
+	      SET_SRC (single_set (count_init_insn))
+		    = GEN_INT (count_init_val - stage_count + 1);
+	      df_insn_rescan (count_init_insn);
+	    }
+
+	  if (!doloop_p)
+	    {
+	      /* Calculation of the compare insn stage in schedule.  */
+	      ps_insn_ptr crr_insn;
+	      int row, stage;
+	      cmp_stage = -1;
+	      for (row = 0; row < ps->ii; row++)
+		for (crr_insn = ps->rows[row];
+		     crr_insn;
+		     crr_insn = crr_insn->next_in_row)
+		  {
+		    stage = SCHED_STAGE (crr_insn->id);
+		    gcc_assert (0 <= stage && stage < stage_count);
+		    if (rtx_equal_p (ps_rtl_insn (ps, crr_insn->id), cmp))
+		      {
+			gcc_assert (cmp_stage == -1);
+		        cmp_stage = stage;
+		      }
+		  }
+              if (dump_file)
+		fprintf (dump_file, "cmp_stage=%d\n", cmp_stage);
+	      gcc_assert (cmp_stage >= 0);
+	    }
+
+	  /* When compare insn stage is non-zero we are to shift the final
+	     counter reg value (which counter is compared to exit loop).
+	     Final value can be an immediate or can be a register, which
+	     constant initialization we find in preheader.  */
+	  was_immediate = false;
+	  if (!doloop_p && count_fin_isconst && cmp_stage > 0)
+	    {
+              gcc_assert (0 <= cmp_side && cmp_side <= 1);
+	      /* New finish value.  */
+	      count_fin_newval = count_fin_val - count_step * cmp_stage;
+	      was_immediate = CONST_INT_P (XEXP (SET_SRC (single_set (cmp)),
+							  1 - cmp_side));
+	      if (was_immediate)
+		{
+		  /* Check whether new value also can be an immediate.
+		     For exapmle, on ARM not all values can be encoded as
+		     an immediate, so we have to load it to a register once
+		     before the loop starts.  */
+		  rtx to = GEN_INT (count_fin_newval);
+		  prolog_create_reg = rtx_cost (to, GET_CODE (to), 0, false)
+			    > rtx_cost (GEN_INT(1), CONST_INT, 0, false);
+	        }
+	      else
+		{
+		  /* A value is already in a register and we easily change
+		     initialization instruction in preheader.  */
+		  gcc_assert (count_fin_init_insn);
+		  SET_SRC (single_set (count_fin_init_insn))
+			= GEN_INT (count_fin_newval);
+		  df_insn_rescan (count_fin_init_insn);
+		}
+	    }
+
+	  /* The adjustment of finish register value.
+	     Zero means no adjustment needed or adjusment is done
+	     without additional insn in prologue.  */
+	  if (!doloop_p && !count_fin_isconst)
+	    prolog_fin_nonconst_adjust = count_step * cmp_stage;
+
+	  /* Ready to generate prolog and epilog.  */
+	  generate_prolog_epilog (ps, loop, count_reg, doloop_p,
+			          count_init_isconst, count_fin_reg,
+				  prolog_fin_nonconst_adjust,
+				  prolog_create_reg, count_fin_newval,
+				  &new_comp_reg);
+
+	  /* And only after generating prolog and epilog it is possible
+	     to modify the compare instruction (to prevent copying wrong insn
+	     form to first and last stages).  */
+	  if (!doloop_p && cmp_stage > 0)
+	    {
+              gcc_assert (0 <= cmp_side && cmp_side <= 1);
+	      if (was_immediate && !prolog_create_reg)
+		{
+		/* Easy case - just modify a constant.  */
+		  gcc_assert (new_comp_reg == NULL_RTX);
+		  XEXP (SET_SRC (single_set (cmp)), 1 - cmp_side)
+			= GEN_INT (count_fin_newval);
+		}
+	      else
+		{
+		  if (count_fin_isconst && !was_immediate)
+		    /* Value is in a register and we already changed
+		       initialization instruction in preheader.  */
+		    gcc_assert (new_comp_reg == NULL_RTX);
+		  else
+		    {
+		      /* Another case - use created by generate_prolog_epilog
+		         register, which value is initialized in prologue.  */
+		      gcc_assert (new_comp_reg != NULL_RTX);
+		      XEXP (SET_SRC (single_set (cmp)), 1 - cmp_side)
+			      = new_comp_reg;
+		    }
+		}
+	      df_insn_rescan (cmp);
+	    }
+	  else
+	    gcc_assert (new_comp_reg == NULL_RTX);
+
 	  break;
 	}
 
@@ -1752,7 +2186,9 @@ sms_schedule (void)
       free_ddg (g);
     }
 
+  df_set_flags (DF_LR_RUN_DCE);
   free (g_arr);
+  iv_analysis_done ();
 
   /* Release scheduler data, needed until now because of DFA.  */
   haifa_sched_finish ();

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [SMS] Support new loop pattern
  2011-12-07 14:42       ` Roman Zhuykov
@ 2011-12-29 15:43         ` Roman Zhuykov
  2012-02-10 12:27           ` Roman Zhuykov
  0 siblings, 1 reply; 30+ messages in thread
From: Roman Zhuykov @ 2011-12-29 15:43 UTC (permalink / raw)
  To: Ayal Zaks; +Cc: gcc-patches, dm

Ping.
Ayal, could you review this patch and these two patches too.
http://gcc.gnu.org/ml/gcc-patches/2011-12/msg00505.html
http://gcc.gnu.org/ml/gcc-patches/2011-12/msg00506.html

Happy holidays.

2011/12/7 Roman Zhuykov <zhroma@ispras.ru>:
> Apologies for the messed up previous e-mail.
>
> 2011/10/12 Ayal Zaks <ayal.zaks@gmail.com>:
>>>> - the last jump instruction should look like:  pc=(regF!=0)?label:pc, regF is
>>
>> you'd probably want to bump to next instruction if falling through,
>> e.g., pc=(regF!=0)?label:pc+4
>>
>
> It is considered that program counter is increased automatically on
> hardware level.
> Otherwise we should add something like "pc=pc+4" in parallel to each
> instruction in RTL.
>
>>>>  flag register;
>>>> - the last instruction which sets regF should be: regF=COMPARE(regC,X), where X
>>>>  is a constant, or maybe a register, which is not changed inside a loop;
>>>> - only one instruction modifies regC inside a loop (other can use regC, but not
>>>>  write), and it should simply adjust it by a constant: regC=regC+step, where
>>>>  step is a constant.
>>>
>>>> When doloop is succesfully scheduled by SMS, its number of
>>>> iterations of loop kernel should be decreased by the number of stages in a
>>>> schedule minus one, while other iterations expand to prologue and epilogue.
>>>> In new supported loops such approach can't be used, because some
>>>> instructions can use count register (regC).  Instead of this,
>>>> the final register value X in compare instruction regF=COMPARE(regC,X)
>>>> is changed to another value Y respective to the stage this instruction
>>>> is scheduled (Y = X - stage * step).
>>
>> making sure this does not underflow; i.e., that the number of
>> iterations is no less than stage (you've addressed this towards the
>> end below).
>>
>
> Yes, this situation is processed correctly.
>
>>>
>>> The main difference from doloop case is that regC can be used by some
>>> instructions in loop body.
>>> That's why we are unable to simply adjust regC initial value, but have
>>> to keep it's value correct on each particular iteration.
>>> So, we change comparison instruction accordingly.
>>>
>>> An example:
>>> int a[100];
>>> int main()
>>> {
>>>  int i;
>>>  for (i = 85; i > 12; i -= 5)
>>>      a[i] = i * i;
>>>  return a[15]-225;
>>> }
>>> ARM assembler with "-O2 -fno-auto-inc-dec":
>>>        ldr     r0, .L5
>>>        mov     r3, #85
>>>        mov     r2, r0
>>> .L2:
>>>        mul     r1, r3, r3
>>>        sub     r3, r3, #5
>>>        cmp     r3, #10
>>>        str     r1, [r2, #340]
>>>        sub     r2, r2, #20
>>>        bne     .L2
>>>        ldr     r0, [r0, #60]
>>>        sub     r0, r0, #225
>>>        bx      lr
>>> .L5:
>>>        .word   a
>>>
>>> Loop body is executed 15 times.
>>> When compiling with SMS, it finds a schedule with ii=7, stage_count=3
>>> and following times:
>>> Stage  Time       Insn
>>> 0          5      mul     r1, r3, r3
>>> 1         10     sub     r3, r3, #5
>>> 1         11     cmp     r3, #10
>>> 1         11     str     r1, [r2, #340]
>>> 1         13     bne     .L2
>>> 2         16     sub     r2, r2, #20
>>>
>>
>> branch is not scheduled last?
>>
>
> Yes, branch schedule time is smaller then sub's one.
> This mean that "sub r2, r2, $20" instruction from original iteration
> number K will be executed after
> "bne .L2" from original iteration number K.
> But certainly bne remains to be the last instuction in new loop body.
> Below you can see how it looks after SMS.
>
>>> To make new schedule correct the loop body
>>> should be executed 14 times and we change compare instruction:
>>
>> the loop itself should execute 13 times.
>
> with i =
> 85, 80, 75, 70, 65
> 60, 55, 50, 45, 40
> 35, 30, 25, 20, 15
> this gives total 15 iterations (15 stores to memory).
> And new loop body will be executed 13 times (one store goes to
> epilogue and one - to prologue).
>
>>> regF=COMPARE(regC,X) to regF=COMPARE(regC,Y) where Y = X - stage * step.
>>> In our example regC is r3, X is 10, step = -5, compare instruction
>>> is scheduled on stage 1, so it should be Y = 10 - 1 * (-5) = 15.
>>>
>>
>> right. In general, if the compare is on stage s (starting from 0), it
>> will be executed s times in the epilog, so it should exit the loop
>> upon reaching Y = X - s * step.
>>
>>> So, after SMS it looks like:
>>>        ldr     r0, .L5
>>>        mov     r3, #85
>>>        mov     r2, r0
>>> ;;prologue
>>>        mul     r1, r3, r3      ;;from stage 0 first iteration
>>>        sub     r3, r3, #5      ;;3 insns from stage 1 first iteration
>>>        cmp     r3, #10
>>>        str     r1, [r2, #340]
>>>        mul     r1, r3, r3      ;;from stage 0 second iteration
>>> ;;body
>>> .L2:
>>>        sub     r3, r3, #5
>>>        sub     r2, r2, #20
>>>        cmp     r3, #15         ;; new value to compare with is Y=15
>>>        str     r1, [r2, #340]
>>>        mul     r1, r3, r3
>>>        bne     .L2
>>> ;;epilogue
>>>        sub     r2, r2, #20     ;;from stage 2 pre-last iteration
>>>        sub     r3, r3, #5      ;;3 insns from stage 1 last iteration
>>>        cmp     r3, #10
>>>        str     r1, [r2, #340]
>>>        sub     r2, r2, #20     ;;from stage 2 last iteration
>>>
>>>        ldr     r0, [r0, #60]
>>>        sub     r0, r0, #225
>>>        bx      lr
>>> .L5:
>>>        .word   a
>>>
>
> Here in comments I mention why insn was copied to prolog and epilog.
> Only branch is not copied at all.
>
>>>> Testing of this appoach reveals two bugs, which do not appear while SMS was
>>>> used only for doloop loops.  Both these bugs happen due to the nature of the
>>>> flag register.  On x86_64 it is clobbered by most of arithmetic instructions.
>> This should ideally be solved by a dedicated (separate) patch.
>> ...
>> This too should be solved by a dedicated (separate) patch, for easier digestion.
>
> As Ayal asks, I'll continue discussion of these two bugs in two
> separate e-mails, answering on this letter.
>
>>>>
>>>> One more thing to point out is number of loop iterations. When number of
>>>> iterations of a loop is not known at compile time, SMS has to create two loop
>>>> versions (original and scheduled), and execute scheduled one only when real
>>>> number of iterations is bigger than number of stages.  In doloop case the
>>>> number of iterations simply equals to the count register value before the loop.
>>>> So SMS finds its constant initialization or makes two loop versions.  In new
>>>> supported loops number of iterations value is more complex.  It even can't be
>>>> calculated as (final_reg_value-start_reg_value)/step because of examples like
>>>> this:
>>>>
>>>> for (unsigned int x = 0x0; x != 0x6F80919A; x += 0xEDCBA987) ...;
>>>>
>>>> This loop has 22 iterations.  So, i decided to use get_simple_loop_desc
>>>> function which gives a structure with loop characteristics, some of them helps
>>>> to find iteration number:
>>>>
>>>> rtx niter_expr - The number of iterations of the loop;
>>>> bool const_iter - True if the loop iterates the constant number of times;
>>>> unsigned HOST_WIDEST_INT niter - Number of iterations if constant;
>>>>
>>>> But we can use these expressions only after looking through some other fields
>>>> of returned structure:
>>>>
>>>> bool simple_p - True if we are able to say anything about number of iterations
>>>> of the loop;
>>>> rtx assumptions - Assumptions under that the rest of the information is valid;
>>>> rtx noloop_assumptions - Assumptions under which the loop ends before reaching
>>>> the latch;
>>>> rtx infinite - Condition under which the loop is infinite.
>>>>
>>>> I decide to allow SMS scheduling only when simple_p is true and other three
>>>> fields are NULL_RTX, or when simple_p is true and
>>>> flag_unsafe_loop_optimizations is set.  One more exception is infinite
>>>> condition, and the next separate patch is an attempt to process it.
>>>>
>>
>> ok, still need to go over this rather lengthy and orthogonal (although
>> it exposes the bugs above) piece.
>>
>> Ayal.
>>
>>
>
> New version is attached, it suits current trunk.
> Without fixing both bugs mentioned above, this patch brokes bootstrap on x86-64.
>
> Together with DDG fixes the patch was succesfully regtested on ARM,
> and "regstrapped" on x86-64 and IA64.
>
> --
> Roman Zhuykov
> zhroma@ispras.ru

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 9/9] [ARM] Remove artificial doloop_end pattern
  2011-07-21 17:09 ` [PATCH 9/9] [ARM] Remove artificial doloop_end pattern zhroma
@ 2012-01-04 16:47   ` Richard Earnshaw
  0 siblings, 0 replies; 30+ messages in thread
From: Richard Earnshaw @ 2012-01-04 16:47 UTC (permalink / raw)
  To: zhroma; +Cc: gcc-patches, dm

On 21/07/11 17:30, zhroma@ispras.ru wrote:
> This patch eliminates fake doloop_end pattern for ARM platform.  The problem
> with such a pattern is that it slows down the loop when SMS doesn't create good
> schedule.  So, i suppose fake pattern is no longer needed with new loop forms
> supported.
> 
> 2011-07-20  Roman Zhuykov  <zhroma@ispras.ru>
> 	* config/arm/thumb2.md (doloop_end): Delete.

I have no objections to this patch, but committing it needs to be
co-ordinated with the other SMS changes that are being discussed.
Deleting it today will, I think, cause SMS to be disabled and I don't
want that to happen.

R.

> ---
>  gcc/config/arm/thumb2.md |   51 ----------------------------------------------
>  1 files changed, 0 insertions(+), 51 deletions(-)
> 
> diff --git a/gcc/config/arm/thumb2.md b/gcc/config/arm/thumb2.md
> index 9a11012..492e765 100644
> --- a/gcc/config/arm/thumb2.md
> +++ b/gcc/config/arm/thumb2.md
> @@ -1101,54 +1101,3 @@
>    operands[2] = GEN_INT (32 - INTVAL (operands[2]));
>    ")
>  
> -;; Define the subtract-one-and-jump insns so loop.c
> -;; knows what to generate.
> -(define_expand "doloop_end"
> -  [(use (match_operand 0 "" ""))      ; loop pseudo
> -   (use (match_operand 1 "" ""))      ; iterations; zero if unknown
> -   (use (match_operand 2 "" ""))      ; max iterations
> -   (use (match_operand 3 "" ""))      ; loop level
> -   (use (match_operand 4 "" ""))]     ; label
> -  "TARGET_32BIT"
> -  "
> - {
> -   /* Currently SMS relies on the do-loop pattern to recognize loops
> -      where (1) the control part consists of all insns defining and/or
> -      using a certain 'count' register and (2) the loop count can be
> -      adjusted by modifying this register prior to the loop.
> -      ??? The possible introduction of a new block to initialize the
> -      new IV can potentially affect branch optimizations.  */
> -   if (optimize > 0 && flag_modulo_sched)
> -   {
> -     rtx s0;
> -     rtx bcomp;
> -     rtx loc_ref;
> -     rtx cc_reg;
> -     rtx insn;
> -     rtx cmp;
> -
> -     /* Only use this on innermost loops.  */
> -     if (INTVAL (operands[3]) > 1)
> -       FAIL;
> -
> -     if (GET_MODE (operands[0]) != SImode)
> -       FAIL;
> -
> -     s0 = operands [0];
> -     if (TARGET_THUMB2)
> -       insn = emit_insn (gen_thumb2_addsi3_compare0 (s0, s0, GEN_INT (-1)));
> -     else
> -       insn = emit_insn (gen_addsi3_compare0 (s0, s0, GEN_INT (-1)));
> -
> -     cmp = XVECEXP (PATTERN (insn), 0, 0);
> -     cc_reg = SET_DEST (cmp);
> -     bcomp = gen_rtx_NE (VOIDmode, cc_reg, const0_rtx);
> -     loc_ref = gen_rtx_LABEL_REF (VOIDmode, operands [4]);
> -     emit_jump_insn (gen_rtx_SET (VOIDmode, pc_rtx,
> -                                  gen_rtx_IF_THEN_ELSE (VOIDmode, bcomp,
> -                                                        loc_ref, pc_rtx)));
> -     DONE;
> -   }else
> -      FAIL;
> - }")
> -


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/9] [doloop] Correct extracting loop exit condition
  2011-09-30 15:43     ` Roman Zhuykov
@ 2012-02-10 12:00       ` Andrey Belevantsev
  2012-02-17 10:49         ` Richard Sandiford
  0 siblings, 1 reply; 30+ messages in thread
From: Andrey Belevantsev @ 2012-02-10 12:00 UTC (permalink / raw)
  To: Roman Zhuykov; +Cc: gcc-patches, dm, richard.sandiford

Hello Richard,

On 30.09.2011 19:21, Roman Zhuykov wrote:
> 2011/7/22 Richard Sandiford<richard.sandiford@linaro.org>:
>> That's pre-approved (independently of the other patches) if it works.
...
> Changed like the following. Will commit if no objections after a couple of days.

We forgot to commit this patch back in September, is it fine to commit at 
this stage?

Andrey


>
> --
> Roman Zhuykov
> zhroma@ispras.ru

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [SMS] Support new loop pattern
  2011-12-29 15:43         ` Roman Zhuykov
@ 2012-02-10 12:27           ` Roman Zhuykov
  2012-03-29 12:37             ` Andrey Belevantsev
  0 siblings, 1 reply; 30+ messages in thread
From: Roman Zhuykov @ 2012-02-10 12:27 UTC (permalink / raw)
  To: Ayal Zaks; +Cc: gcc-patches, dm

Ping.
Ayal, please review this patch and these three patches too:
http://gcc.gnu.org/ml/gcc-patches/2011-12/msg00505.html
http://gcc.gnu.org/ml/gcc-patches/2011-12/msg00506.html
http://gcc.gnu.org/ml/gcc-patches/2011-12/msg01800.html

--
Roman Zhuykov
zhroma@ispras.ru

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/9] [doloop] Correct extracting loop exit condition
  2012-02-10 12:00       ` Andrey Belevantsev
@ 2012-02-17 10:49         ` Richard Sandiford
  0 siblings, 0 replies; 30+ messages in thread
From: Richard Sandiford @ 2012-02-17 10:49 UTC (permalink / raw)
  To: Andrey Belevantsev; +Cc: Roman Zhuykov, gcc-patches, dm

Andrey Belevantsev <abel@ispras.ru> writes:
> On 30.09.2011 19:21, Roman Zhuykov wrote:
>> 2011/7/22 Richard Sandiford<richard.sandiford@linaro.org>:
>>> That's pre-approved (independently of the other patches) if it works.
> ...
>> Changed like the following. Will commit if no objections after a couple of days.
>
> We forgot to commit this patch back in September, is it fine to commit at 
> this stage?

Probably a bit late now, sorry.  (A bit like my reply.)

Richard

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [SMS] Support new loop pattern
  2012-02-10 12:27           ` Roman Zhuykov
@ 2012-03-29 12:37             ` Andrey Belevantsev
  2012-03-30 23:21               ` Ayal Zaks
  0 siblings, 1 reply; 30+ messages in thread
From: Andrey Belevantsev @ 2012-03-29 12:37 UTC (permalink / raw)
  To: Roman Zhuykov; +Cc: Ayal Zaks, gcc-patches, dm

Hello,

I'd like to ping again those SMS patches once we're back to Stage 1.

Ayal, maybe it would remove some burden for you if you'd review the general 
SMS functionality of those patches, and we'd ask RTL folks to look at the 
pieces related to RTL pattern matching and generation?

Yours,
Andrey

On 10.02.2012 16:15, Roman Zhuykov wrote:
> Ping.
> Ayal, please review this patch and these three patches too:
> http://gcc.gnu.org/ml/gcc-patches/2011-12/msg00505.html
> http://gcc.gnu.org/ml/gcc-patches/2011-12/msg00506.html
> http://gcc.gnu.org/ml/gcc-patches/2011-12/msg01800.html
>
> --
> Roman Zhuykov
> zhroma@ispras.ru

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [SMS] Support new loop pattern
  2012-03-29 12:37             ` Andrey Belevantsev
@ 2012-03-30 23:21               ` Ayal Zaks
  2012-04-10  8:23                 ` Andrey Belevantsev
  0 siblings, 1 reply; 30+ messages in thread
From: Ayal Zaks @ 2012-03-30 23:21 UTC (permalink / raw)
  To: Andrey Belevantsev, Roman Zhuykov; +Cc: gcc-patches, dm

Roman, Andrey,

Sorry for the delayed response.

It would indeed be good to have SMS apply to more loop patterns, still
within the realm of *countable* loops. SMS was originally designed to
handle doloops, with a specific pattern controlling the loop, easily
identified and separable from the loop's body. The newly proposed
change to support new loop patterns is pretty invasive and sizable,
taking place entirely within modulo-sched.c. The main issue I've been
considering, is whether it would be possible instead to transform the
new loop patterns we want SMS to handle, into doloops (potentially
introducing additional induction variables to feed other uses), and
then feed the resulting loop into SMS as is? In other words, could you
fold it into doloop.c? And if so, will doing so introduce significant
overheads?

2012/3/29 Andrey Belevantsev <abel@ispras.ru>:
> Hello,
>
> I'd like to ping again those SMS patches once we're back to Stage 1.
>
> Ayal, maybe it would remove some burden for you if you'd review the general
> SMS functionality of those patches, and we'd ask RTL folks to look at the
> pieces related to RTL pattern matching and generation?
>

It definitely would ... especially in light of the above issue.
Thanks (for your patches, patience, pings..),
Ayal.



> Yours,
> Andrey
>
>
> On 10.02.2012 16:15, Roman Zhuykov wrote:
>>
>> Ping.
>> Ayal, please review this patch and these three patches too:
>> http://gcc.gnu.org/ml/gcc-patches/2011-12/msg00505.html
>> http://gcc.gnu.org/ml/gcc-patches/2011-12/msg00506.html
>> http://gcc.gnu.org/ml/gcc-patches/2011-12/msg01800.html
>>
>> --
>> Roman Zhuykov
>> zhroma@ispras.ru

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [SMS] Support new loop pattern
  2012-03-30 23:21               ` Ayal Zaks
@ 2012-04-10  8:23                 ` Andrey Belevantsev
  0 siblings, 0 replies; 30+ messages in thread
From: Andrey Belevantsev @ 2012-04-10  8:23 UTC (permalink / raw)
  To: Ayal Zaks; +Cc: Roman Zhuykov, gcc-patches, dm

Hello Ayal,

First of all, thanks for your feedback.  Now to your questions:

On 31.03.2012 3:20, Ayal Zaks wrote:
> Roman, Andrey,
>
> Sorry for the delayed response.
>
> It would indeed be good to have SMS apply to more loop patterns, still
> within the realm of *countable* loops. SMS was originally designed to
> handle doloops, with a specific pattern controlling the loop, easily
> identified and separable from the loop's body. The newly proposed
> change to support new loop patterns is pretty invasive and sizable,
> taking place entirely within modulo-sched.c. The main issue I've been
> considering, is whether it would be possible instead to transform the
> new loop patterns we want SMS to handle, into doloops (potentially
> introducing additional induction variables to feed other uses), and
> then feed the resulting loop into SMS as is? In other words, could you
> fold it into doloop.c? And if so, will doing so introduce significant
> overheads?

Let me perhaps explain better.  The patch itself is one core patch (this 
thread) adding the new functionality on detecting more complex loop 
patterns and the three fixes to SMS found while working on the main patch 
(the fixes are in the mails pinged at the very end of this message).  The 
three fixes are worthwhile to commit separately anyways, they are splitted 
up from the main patch for this purpose, so I would suggest to consider 
them in any case.

For the main patch, its core is as small as we could get.  It stays with 
the countable loops as for the cases where we could get overflow behavior 
or infinite loops we bail out early.  We handle only a case of simple 
same-step affine counters.  The main reason why we add support to SMS and 
not to the doloop pass are is when we do not pipeline a loop newly 
transformed to the doloop form, this loop actually slows down on the 
platforms not having a true doloop pattern.  One has to undo the doloop 
form and to get back to the original loop form to avoid this, which seems 
rather strange.  Also, the separate decrement insn that changes the 
induction variable is better be scheduled to get more precise schedule. 
And yes, I believe that making an extra induction variable just to have the 
control part without uses in the loop core will be unnecessary overhead.

Thus, I believe that if we do want SMS to handle more complex loop, then it 
is inevitable that SMS itself would be somewhat more complex.  I would 
welcome your suggestions to make the patch more clear.  One way I see is 
that the function for getting the condition of the new loop form can be 
moved to the generic RTL loop code given the agreement of other RTL 
maintainers.  Also, some new helpers can be introduced for handling this 
specific loop forms.  But it seems that the distinction between 
doloop/non-doloop loops has to stay in the code.

Yours,
Andrey
>
> 2012/3/29 Andrey Belevantsev<abel@ispras.ru>:
>> Hello,
>>
>> I'd like to ping again those SMS patches once we're back to Stage 1.
>>
>> Ayal, maybe it would remove some burden for you if you'd review the general
>> SMS functionality of those patches, and we'd ask RTL folks to look at the
>> pieces related to RTL pattern matching and generation?
>>
>
> It definitely would ... especially in light of the above issue.
> Thanks (for your patches, patience, pings..),
> Ayal.
>
>
>
>> Yours,
>> Andrey
>>
>>
>> On 10.02.2012 16:15, Roman Zhuykov wrote:
>>>
>>> Ping.
>>> Ayal, please review this patch and these three patches too:
>>> http://gcc.gnu.org/ml/gcc-patches/2011-12/msg00505.html
>>> http://gcc.gnu.org/ml/gcc-patches/2011-12/msg00506.html
>>> http://gcc.gnu.org/ml/gcc-patches/2011-12/msg01800.html
>>>
>>> --
>>> Roman Zhuykov
>>> zhroma@ispras.ru

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2012-04-10  8:23 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-07-21 16:31 [PATCH 0/9] [RFC] Expand SMS functionality zhroma
2011-07-21 16:31 ` [PATCH 2/9] [doloop] Correct extracting loop exit condition zhroma
2011-07-22 12:22   ` Richard Sandiford
2011-09-30 15:43     ` Roman Zhuykov
2012-02-10 12:00       ` Andrey Belevantsev
2012-02-17 10:49         ` Richard Sandiford
2011-07-21 16:31 ` [PATCH 5/9] [SMS] Support new loop pattern zhroma
2011-07-24 11:06   ` Revital1 Eres
2011-07-26  9:02   ` Richard Sandiford
2011-07-27 17:36     ` Roman Zhuykov
2011-09-30 15:54   ` Roman Zhuykov
2011-10-12  0:48     ` Ayal Zaks
2011-12-07 14:36       ` Roman Zhuykov
2011-12-07 14:42       ` Roman Zhuykov
2011-12-29 15:43         ` Roman Zhuykov
2012-02-10 12:27           ` Roman Zhuykov
2012-03-29 12:37             ` Andrey Belevantsev
2012-03-30 23:21               ` Ayal Zaks
2012-04-10  8:23                 ` Andrey Belevantsev
2011-07-21 16:31 ` [PATCH 1/9] [obvious] Minor cleanup zhroma
2011-07-21 16:31 ` [PATCH 3/9] [SMS] Eliminate redundant edges zhroma
2011-07-24 10:36   ` Revital1 Eres
2011-07-21 16:31 ` [PATCH 6/9] [SMS] Support potentially infinite loop zhroma
2011-07-21 16:37 ` [PATCH 7/9] New assertion zhroma
2011-07-21 16:59 ` [PATCH 8/9] Extend simple_rhs_p zhroma
2011-07-21 17:04 ` [PATCH 4/9] Move the SMS pass earlier zhroma
2011-07-21 17:09 ` [PATCH 9/9] [ARM] Remove artificial doloop_end pattern zhroma
2012-01-04 16:47   ` Richard Earnshaw
2011-09-30 15:37 ` [PATCH 0/9] [RFC] Expand SMS functionality Roman Zhuykov
2011-10-17 14:34   ` Richard Sandiford

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).