public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [PATCH][SMS] SMS use loop induction variable analysis instead of depending on doloop optimization
@ 2016-04-28  6:06 Shiva Chen
  2016-04-28 13:26 ` Shiva Chen
  2016-04-29  9:30 ` Bernd Schmidt
  0 siblings, 2 replies; 4+ messages in thread
From: Shiva Chen @ 2016-04-28  6:06 UTC (permalink / raw)
  To: GCC Patches, Shiva Chen

[-- Attachment #1: Type: text/plain, Size: 1988 bytes --]

Hi, 

According to Richard's suggestion in
https://gcc.gnu.org/ml/gcc-patches/2015-12/msg01240.html
I try to remove the SMS dependency on doloop pass.

SMS would need to adjust kernel loop iteration count
during the transformation.

To adjust loop iteration count, SMS would need to find
count_reg which contain the loop iteration count
and then generate adjustment instruction.

Currently, SMS would find the doloop_end pattern to get
count_reg, and generate adjustment instruction according
to doloop optimization result (tranfer the loop to count
to zero with step = 1).

If can't find doloop_end pattern or the loop form not
the doloop optimization result, the SMS will skip the loop.

Doloop optimization could have benefit for some target
even if the target don't support special loop instruction.

E.g. For arm , doloop optimization could transfer
     the instructions to subs and branch which save the
     comparison instruction.

However, If the loop iteration count computation of
doloop optimization is too complicate, it would drop performance.
(PARAM_MAX_ITERATIONS_COMPUTATION_COST default value is 10
which may too high for the target not support special loop instruction)

This kind loop not suitable for doloop optimization and SMS
can't activate.

To free the SMS dependency on doloop optimization,
I try to use loop induction variable analysis to find count_reg
and generate kernel loop adjustment instruction for the loop
form without doloop optimization(increment/decrement
loop with step != 1).

Without doloop optimization, induction variable could be a
POST_INC/POST_DEC/PRE_INC/PRE_DEC in memory reference which
current implementation won't identify as loop iv. So I modify
relative code in loop-iv.c to identify this case.

With the patch, backend target could active SMS without define
doloop_end pattern.

Could anyone help me to review the patch?
Any suggestion would be very helpful.

Thanks,
Shiva


[-- Attachment #2: 0001-SMS-use-loop-induction-variable-analysis-instead-of-.patch --]
[-- Type: application/octet-stream, Size: 43501 bytes --]

From 23409fe65722c12a700983e4011b4a42309b8f24 Mon Sep 17 00:00:00 2001
From: shiva <shiva0217@gmail.com>
Date: Mon, 28 Mar 2016 15:06:16 +0800
Subject: [PATCH] SMS use loop induction variable analysis instead of depending
 on doloop optimization

---
 gcc/cfgloop.h      |   5 +-
 gcc/loop-iv.c      | 228 +++++++++++--
 gcc/modulo-sched.c | 917 +++++++++++++++++++++++++++++++++++++++++------------
 gcc/params.def     |   4 +
 4 files changed, 934 insertions(+), 220 deletions(-)

diff --git a/gcc/cfgloop.h b/gcc/cfgloop.h
index 54e738f..45b1acf 100644
--- a/gcc/cfgloop.h
+++ b/gcc/cfgloop.h
@@ -191,6 +191,9 @@ struct GTY ((chain_next ("%h.next"))) loop {
   /* True if we should try harder to vectorize this loop.  */
   bool force_vectorize;
 
+  /* True if there is count_reg reference in the loop.  */
+  bool count_ref_p;
+
   /* True if the loop is part of an oacc kernels region.  */
   bool in_oacc_kernels_region;
 
@@ -347,7 +350,7 @@ struct rtx_iv
 {
   /* Its base and step (mode of base and step is supposed to be extend_mode,
      see the description above).  */
-  rtx base, step;
+  rtx base, step, end;
 
   /* The type of extend applied to it (IV_SIGN_EXTEND, IV_ZERO_EXTEND,
      or IV_UNKNOWN_EXTEND).  */
diff --git a/gcc/loop-iv.c b/gcc/loop-iv.c
index fecaf8f..b73f165 100644
--- a/gcc/loop-iv.c
+++ b/gcc/loop-iv.c
@@ -95,6 +95,8 @@ static unsigned int iv_ref_table_size = 0;
 /* Table of rtx_ivs indexed by the df_ref uid field.  */
 static struct rtx_iv ** iv_ref_table;
 
+static bool mem_write_insn_p (rtx_insn *);
+
 /* Induction variable stored at the reference.  */
 #define DF_REF_IV(REF) iv_ref_table[DF_REF_ID (REF)]
 #define DF_REF_IV_SET(REF, IV) iv_ref_table[DF_REF_ID (REF)] = (IV)
@@ -350,10 +352,6 @@ iv_get_reaching_def (rtx_insn *insn, rtx reg, df_ref *def)
 
   adef = DF_REF_CHAIN (use)->ref;
 
-  /* We do not handle setting only part of the register.  */
-  if (DF_REF_FLAGS (adef) & DF_REF_READ_WRITE)
-    return GRD_INVALID;
-
   def_insn = DF_REF_INSN (adef);
   def_bb = DF_REF_BB (adef);
   use_bb = BLOCK_FOR_INSN (insn);
@@ -622,6 +620,58 @@ iv_shift (struct rtx_iv *iv, rtx mby)
   return true;
 }
 
+static rtx
+get_rhs (rtx_insn *insn, rtx reg)
+{
+  int i;
+  rtx set, pattern;
+  rtx rhs = NULL_RTX;
+
+  pattern = PATTERN (insn);
+
+  /* Find iv increment/decrement rhs in following pattern
+
+     (parallel [
+	(set (reg:CC_NOOV 100 cc)
+	    (compare:CC_NOOV (plus:SI (reg:SI 147)
+				      (const_int -1))
+			     (const_int 0 [0])))
+	(set (reg:SI 147)
+	    (plus:SI (reg:SI 147)
+		     (const_int -1 [0xffffffffffffffff]))
+   */
+  if (GET_CODE (pattern) == PARALLEL)
+    {
+      for (i = XVECLEN (pattern, 0) - 1; i >= 0; i--)
+	if (GET_CODE (XVECEXP (pattern, 0, i)) == SET)
+	  {
+	    if (SET_DEST (XVECEXP (pattern, 0, i)) == reg)
+	      {
+		pattern = XVECEXP (pattern, 0, i);
+		return SET_SRC (pattern);
+	      }
+	  }
+    }
+
+  set = single_set (insn);
+
+  if (!set)
+    return NULL_RTX;
+
+  if (mem_write_insn_p (insn))
+    rhs = SET_DEST (set);
+  else
+    {
+      rhs = find_reg_equal_equiv_note (insn);
+      if (rhs)
+	rhs = XEXP (rhs, 0);
+      else
+	rhs = SET_SRC (set);
+    }
+
+  return rhs;
+}
+
 /* The recursive part of get_biv_step.  Gets the value of the single value
    defined by DEF wrto initial value of REG inside loop, in shape described
    at get_biv_step.  */
@@ -632,22 +682,17 @@ get_biv_step_1 (df_ref def, rtx reg,
 		enum iv_extend_code *extend, machine_mode outer_mode,
 		rtx *outer_step)
 {
-  rtx set, rhs, op0 = NULL_RTX, op1 = NULL_RTX;
+  rtx rhs, op0 = NULL_RTX, op1 = NULL_RTX;
   rtx next, nextr;
-  enum rtx_code code;
+  enum rtx_code code, addr_code;
   rtx_insn *insn = DF_REF_INSN (def);
   df_ref next_def;
   enum iv_grd_result res;
 
-  set = single_set (insn);
-  if (!set)
-    return false;
+  rhs = get_rhs (insn, reg);
 
-  rhs = find_reg_equal_equiv_note (insn);
-  if (rhs)
-    rhs = XEXP (rhs, 0);
-  else
-    rhs = SET_SRC (set);
+  if (!rhs)
+    return false;
 
   code = GET_CODE (rhs);
   switch (code)
@@ -700,6 +745,19 @@ get_biv_step_1 (df_ref def, rtx reg,
       next = op0;
       break;
 
+    /* Allow READ_WRITE memory reference register
+       identify as IV.  */
+    case MEM:
+      op0 = XEXP (rhs, 0);
+      addr_code = GET_CODE (op0);
+      if (addr_code != POST_INC
+	  && addr_code != POST_DEC
+	  && addr_code != PRE_INC
+	  && addr_code != PRE_DEC)
+	return false;
+      next = XEXP (op0, 0);
+      break;
+
     default:
       return false;
     }
@@ -777,6 +835,21 @@ get_biv_step_1 (df_ref def, rtx reg,
       *extend = (code == SIGN_EXTEND) ? IV_SIGN_EXTEND : IV_ZERO_EXTEND;
       break;
 
+    /* Get inner_step by mode size for READ_WRITE
+       memory reference register.  */
+    case MEM:
+      gcc_assert (GET_CODE (op0) == POST_INC
+		  || GET_CODE (op0) == POST_DEC
+		  || GET_CODE (op0) == PRE_INC
+		  || GET_CODE (op0) == PRE_DEC);
+
+      if (GET_CODE (op0) == POST_INC
+	  || GET_CODE (op0) == PRE_INC)
+	*inner_step = GEN_INT (GET_MODE_SIZE (GET_MODE (rhs)));
+      else
+	*inner_step = GEN_INT (-GET_MODE_SIZE (GET_MODE (rhs)));
+      break;
+
     default:
       return false;
     }
@@ -936,14 +1009,13 @@ iv_analyze_expr (rtx_insn *insn, rtx rhs, machine_mode mode,
   rtx op0 = NULL_RTX, op1 = NULL_RTX;
   struct rtx_iv iv0, iv1;
   enum rtx_code code = GET_CODE (rhs);
+  enum rtx_code addr_code;
   machine_mode omode = mode;
 
   iv->mode = VOIDmode;
   iv->base = NULL_RTX;
   iv->step = NULL_RTX;
 
-  gcc_assert (GET_MODE (rhs) == mode || GET_MODE (rhs) == VOIDmode);
-
   if (CONSTANT_P (rhs)
       || REG_P (rhs)
       || code == SUBREG)
@@ -995,6 +1067,24 @@ iv_analyze_expr (rtx_insn *insn, rtx rhs, machine_mode mode,
 	return false;
       break;
 
+    case MEM:
+      op0 = XEXP (rhs, 0);
+      omode = GET_MODE (op0);
+      addr_code = GET_CODE (op0);
+
+      if (addr_code != POST_INC
+	  && addr_code != POST_DEC
+	  && addr_code != PRE_INC
+	  && addr_code != PRE_DEC)
+	return false;
+      break;
+    case POST_INC:
+    case POST_DEC:
+    case PRE_INC:
+    case PRE_DEC:
+      op0 = XEXP (rhs, 0);
+      break;
+
     default:
       return false;
     }
@@ -1040,6 +1130,9 @@ iv_analyze_expr (rtx_insn *insn, rtx rhs, machine_mode mode,
 	return false;
       break;
 
+    case MEM:
+      break;
+
     default:
       break;
     }
@@ -1048,6 +1141,47 @@ iv_analyze_expr (rtx_insn *insn, rtx rhs, machine_mode mode,
   return iv->base != NULL_RTX;
 }
 
+/* Auxiliary variable for mem_read_insn_p/mem_write_insn_p.  */
+static bool mem_ref_p;
+
+/* Auxiliary function for mem_read_insn_p.  */
+static void
+mark_mem_use (rtx *x, void *)
+{
+  subrtx_iterator::array_type array;
+  FOR_EACH_SUBRTX (iter, array, *x, NONCONST)
+    if (MEM_P (*iter))
+      {
+	mem_ref_p = true;
+	break;
+      }
+}
+
+/* Returns nonzero if INSN reads from memory.  */
+static bool
+mem_read_insn_p (rtx_insn *insn)
+{
+  mem_ref_p = false;
+  note_uses (&PATTERN (insn), mark_mem_use, NULL);
+  return mem_ref_p;
+}
+
+static void
+mark_mem_store (rtx loc, const_rtx setter ATTRIBUTE_UNUSED, void *data ATTRIBUTE_UNUSED)
+{
+  if (MEM_P (loc))
+    mem_ref_p = true;
+}
+
+/* Returns nonzero if INSN writes to memory.  */
+static bool
+mem_write_insn_p (rtx_insn *insn)
+{
+  mem_ref_p = false;
+  note_stores (PATTERN (insn), mark_mem_store, NULL);
+  return mem_ref_p;
+}
+
 /* Analyzes iv DEF and stores the result to *IV.  */
 
 static bool
@@ -1056,6 +1190,7 @@ iv_analyze_def (df_ref def, struct rtx_iv *iv)
   rtx_insn *insn = DF_REF_INSN (def);
   rtx reg = DF_REF_REG (def);
   rtx set, rhs;
+  rtx read_write_reg = NULL_RTX, body = NULL_RTX;
 
   if (dump_file)
     {
@@ -1085,15 +1220,47 @@ iv_analyze_def (df_ref def, struct rtx_iv *iv)
   if (!set)
     return false;
 
-  if (!REG_P (SET_DEST (set)))
-    return false;
+   /* To check the case REG is read write register
+      in memory reference.  */
+  if (mem_write_insn_p (insn))
+    body = SET_DEST (set);
+  else if (mem_read_insn_p (insn))
+    body = SET_SRC (set);
 
-  gcc_assert (SET_DEST (set) == reg);
-  rhs = find_reg_equal_equiv_note (insn);
-  if (rhs)
-    rhs = XEXP (rhs, 0);
-  else
-    rhs = SET_SRC (set);
+  if (body != NULL_RTX)
+    {
+      if (GET_CODE (body) == UNSPEC)
+	return false;
+
+      if (GET_CODE (body) == ZERO_EXTEND ||
+	  GET_CODE (body) == SIGN_EXTEND)
+	body = XEXP(body, 0);
+
+      if (GET_CODE (body) == MEM)
+	body = XEXP(body, 0);
+
+      /* Should we handle POST_MODIFY?  */
+       if (GET_CODE (body) == POST_INC
+	  || GET_CODE (body) == POST_DEC
+	  || GET_CODE (body) == PRE_INC
+	  || GET_CODE (body) == PRE_DEC)
+	read_write_reg = XEXP(body, 0);
+      else
+	return false;
+    }
+
+  gcc_assert (SET_DEST (set) == reg || read_write_reg == reg);
+
+  if (mem_write_insn_p (insn))
+    rhs = SET_DEST (set);
+   else
+    {
+      rhs = find_reg_equal_equiv_note (insn);
+      if (rhs)
+	rhs = XEXP (rhs, 0);
+      else
+	rhs = SET_SRC (set);
+    }
 
   iv_analyze_expr (insn, rhs, GET_MODE (reg), iv);
   record_iv (def, iv);
@@ -2354,6 +2521,19 @@ iv_number_of_iterations (struct loop *loop, rtx_insn *insn, rtx condition,
     goto fail;
 
   op0 = XEXP (condition, 0);
+
+  /* To handle the condition as follow
+      (ne (plus:SI (reg:SI 163)
+		   (const_int -1))
+	  (const_int 0 [0]))
+
+     The pattern would generate
+     by doloop optimization.  */
+  if (GET_CODE (op0) == PLUS
+      && CONST_INT_P (XEXP (op0, 1))
+      && REG_P (XEXP (op0, 0)))
+    op0 = XEXP (op0, 0);
+
   if (!iv_analyze (insn, op0, &iv0))
     goto fail;
   if (iv0.extend_mode == VOIDmode)
diff --git a/gcc/modulo-sched.c b/gcc/modulo-sched.c
index 5dde66c..c1a4af3 100644
--- a/gcc/modulo-sched.c
+++ b/gcc/modulo-sched.c
@@ -43,6 +43,8 @@ along with GCC; see the file COPYING3.  If not see
 #include "tree-pass.h"
 #include "dbgcnt.h"
 #include "loop-unroll.h"
+#include "print-rtl.h"
+#include "recog.h"
 
 #ifdef INSN_SCHEDULING
 
@@ -211,7 +213,7 @@ static void set_node_sched_params (ddg_ptr);
 static partial_schedule_ptr sms_schedule_by_order (ddg_ptr, int, int, int *);
 static void permute_partial_schedule (partial_schedule_ptr, rtx_insn *);
 static void generate_prolog_epilog (partial_schedule_ptr, struct loop *,
-                                    rtx, rtx);
+                                    struct rtx_iv);
 static int calculate_stage_count (partial_schedule_ptr, int);
 static void calculate_must_precede_follow (ddg_node_ptr, int, int,
 					   int, int, sbitmap, sbitmap, sbitmap);
@@ -220,6 +222,8 @@ static int get_sched_window (partial_schedule_ptr, ddg_node_ptr,
 static bool try_scheduling_node_in_cycle (partial_schedule_ptr, int, int,
 					  sbitmap, int *, sbitmap, sbitmap);
 static void remove_node_from_ps (partial_schedule_ptr, ps_insn_ptr);
+static bool duplicate_insn_p (rtx_insn *, struct rtx_iv, bool);
+static bool cmp_insn_p (struct rtx_iv, rtx_insn *);
 
 #define NODE_ASAP(node) ((node)->aux.count)
 
@@ -241,6 +245,11 @@ typedef struct node_sched_params
      u will precede v if column (u) < column (v).  */
   int column;
 } *node_sched_params_ptr;
+
+
+/* A vector that contains the sched data for each ps_insn.  */
+static vec<node_sched_params> node_sched_param_vec;
+
 \f
 /* The following three functions are copied from the current scheduler
    code in order to use sched_analyze() for computing the dependencies.
@@ -332,91 +341,386 @@ ps_num_consecutive_stages (partial_schedule_ptr ps, int id)
     return ps_reg_move (ps, id)->num_consecutive_stages;
 }
 
-/* Given HEAD and TAIL which are the first and last insns in a loop;
-   return the register which controls the loop.  Return zero if it has
-   more than one occurrence in the loop besides the control part or the
-   do-loop pattern is not of the form we expect.  */
-static rtx
-doloop_register_get (rtx_insn *head, rtx_insn *tail)
+/* Analyze loop induction variable and store to IV.  */
+static bool
+analyze_loop_iv (rtx_insn *jump, struct rtx_iv *iv)
 {
-  rtx reg, condition;
-  rtx_insn *insn, *first_insn_not_to_check;
+  rtx condition;
+  rtx op0, op1;
+  rtx_insn *at;
+  struct rtx_iv iv0, iv1;
+  enum rtx_code cmp_code;
+
+  /* Test whether the condition is suitable.  */
+  if (!(condition = get_condition (jump, &at, false, false)))
+    return false;
+
+  op0 = XEXP (condition, 0);
+  op1 = XEXP (condition, 1);
+
+  /* handle the condition:
+  (ne (plus:SI (reg:SI 163)
+               (const_int -1 [0xffffffffffffffff]))
+       (const_int 0 [0]))
+  */
+  if (CONST_INT_P (op1) && GET_CODE (op0) == PLUS)
+    {
+      op1 = XEXP (op0, 1);
+      op0 = XEXP (op0, 0);
+
+      if (REG_P (op0) && CONST_INT_P (op1))
+        {
+	  iv->base = op0;
+	  iv->step = op1;
+	  iv->end = XEXP (condition, 1);
+	  return true;
+	}
+    }
+
+  /* following code handle the condition:
+     (ne (reg:SI 163) (reg:SI 176))
+  */
+
+  if (!iv_analyze (at, op0, &iv0))
+    return false;
 
-  if (!JUMP_P (tail))
-    return NULL_RTX;
+  if (!iv_analyze (at, op1, &iv1))
+    return false;
 
-  if (!targetm.code_for_doloop_end)
-    return NULL_RTX;
+  if (iv0.step != const0_rtx)
+    {
+      iv->base = op0;
+      iv->step = iv0.step;
+      iv->end = op1;
+    }
+  else if (iv1.step != const0_rtx)
+    {
+      iv->base = op1;
+      iv->step = iv1.step;
+      iv->end = op0;
+    }
 
-  /* TODO: Free SMS's dependence on doloop_condition_get.  */
-  condition = doloop_condition_get (tail);
-  if (! condition)
-    return NULL_RTX;
+  /* get_condition() may canonicalize the condition
+     E.g.
+	(leu (reg:SI 186 [ ivtmp.41 ])
+	     (const_int 6 [0x7]))
+     =>
+	(ltu (reg:SI 186 [ ivtmp.41 ])
+	     (const_int 7 [0x7]))
 
-  if (REG_P (XEXP (condition, 0)))
-    reg = XEXP (condition, 0);
-  else if (GET_CODE (XEXP (condition, 0)) == PLUS
-	   && REG_P (XEXP (XEXP (condition, 0), 0)))
-    reg = XEXP (XEXP (condition, 0), 0);
+     we won't change the branch which contain condition (E.g. ble)
+     Therefore, we have to change back to original constant end condition
+     and let SMS loop iteration adjustment could do the right thing.  */
+  if (iv->end)
+    {
+      if (GET_CODE (PATTERN (jump)) == PARALLEL)
+	{
+	  int j;
+
+	  for (j = XVECLEN (PATTERN (jump), 0) - 1; j >= 0; j--)
+	    {
+	      rtx body = XVECEXP (PATTERN (jump), 0, j);
+
+	      if (GET_CODE (body) == SET)
+		{
+		  if (GET_CODE (SET_SRC (body)) == IF_THEN_ELSE)
+		    {
+		      cmp_code = GET_CODE (XEXP (SET_SRC (body), 0));
+		      break;
+		    }
+		}
+	    }
+	  cmp_code = GET_CODE (condition);
+	}
+      else
+	cmp_code = GET_CODE (XEXP (SET_SRC (PATTERN (jump)), 0));
+
+      if (CONST_INT_P (iv->end)
+	  && GET_CODE (condition) !=  cmp_code)
+	{
+	  switch (GET_CODE (condition))
+	  {
+	    case LT:
+	    case LTU:
+	      iv->end = GEN_INT (INTVAL (iv->end) - 1);
+	      break;
+	    case GT:
+	    case GTU:
+	      iv->end = GEN_INT (INTVAL (iv->end) + 1);
+	      break;
+	    default:
+	      break;
+	  }
+	}
+    }
   else
-    gcc_unreachable ();
+    return false;
 
-  /* Check that the COUNT_REG has no other occurrences in the loop
-     until the decrement.  We assume the control part consists of
-     either a single (parallel) branch-on-count or a (non-parallel)
-     branch immediately preceded by a single (decrement) insn.  */
-  first_insn_not_to_check = (GET_CODE (PATTERN (tail)) == PARALLEL ? tail
-                             : prev_nondebug_insn (tail));
+  return true;
+}
 
-  for (insn = head; insn != first_insn_not_to_check; insn = NEXT_INSN (insn))
-    if (!DEBUG_INSN_P (insn) && reg_mentioned_p (reg, insn))
-      {
-        if (dump_file)
-        {
-          fprintf (dump_file, "SMS count_reg found ");
-          print_rtl_single (dump_file, reg);
-          fprintf (dump_file, " outside control in insn:\n");
-          print_rtl_single (dump_file, insn);
-        }
+/* Set LOOP->loop_ref_p to true if there is COUNT_REG
+   reference instruction(except iv increment/decrement
+   instruction) in the loop.  */
+static void
+check_count_reg_reference (struct loop* loop, rtx count_reg)
+{
+  basic_block *bbs, bb;
+  unsigned i;
+  rtx_insn *insn;
+  int j;
+  rtx set;
 
-        return NULL_RTX;
-      }
+  bbs = get_loop_body (loop);
+  for (i = 0; i < loop->num_nodes; i++)
+    {
+      bb = bbs[i];
+      FOR_BB_INSNS (bb, insn)
+	{
+	  if (DEBUG_INSN_P (insn))
+	    continue;
+
+	  if (!reg_mentioned_p (count_reg, insn))
+	    continue;
+
+	  if ((set = single_set (insn)))
+	    {
+	      if (SET_DEST (set) == count_reg)
+		continue;
+	      else
+		{
+		  loop->count_ref_p = true;
+		  return;
+		}
+	    }
+
+	  if (GET_CODE (PATTERN (insn)) == PARALLEL)
+	    {
+	      for (j = XVECLEN (PATTERN (insn), 0) - 1; j >= 0; j--)
+		{
+		  rtx body = XVECEXP (PATTERN (insn), 0, j);
+
+		  if (GET_CODE (body) == SET)
+		    {
+		      if (GET_CODE (SET_SRC (body)) == COMPARE)
+			continue;
+		      if (SET_DEST (body) != count_reg
+			  && reg_mentioned_p (count_reg, body))
+			{
+			  loop->count_ref_p = true;
+			  return;
+			}
+		    }
+		}
+	    }
+	}
+    }
+  free (bbs);
+}
 
-  return reg;
+rtx
+get_count_reg (struct loop* loop, rtx_insn *tail, struct rtx_iv *iv)
+{
+  bool get_iv;
+
+  df_clear_flags (DF_LR_RUN_DCE);
+  iv_analysis_loop_init (loop);
+  get_iv = analyze_loop_iv (tail, iv);
+  df_set_flags (DF_LR_RUN_DCE);
+
+  if (!get_iv)
+     return NULL_RTX;
+
+  /* check count_reg reference in the loop
+     and set result to loop->count_ref_p.  */
+  check_count_reg_reference (loop, iv->base);
+
+  return iv->base;
 }
 
-/* Check if COUNT_REG is set to a constant in the PRE_HEADER block, so
-   that the number of iterations is a compile-time constant.  If so,
-   return the rtx_insn that sets COUNT_REG to a constant, and set COUNT to
-   this constant.  Otherwise return 0.  */
-static rtx_insn *
-const_iteration_count (rtx count_reg, basic_block pre_header,
-		       int64_t * count)
+/* Return stage number of compare instruction.  */
+static int
+get_cmp_stage (ddg_ptr g, partial_schedule_ptr ps, struct rtx_iv iv)
 {
+  int i, num_nodes = g->num_nodes;
   rtx_insn *insn;
-  rtx_insn *head, *tail;
 
-  if (! pre_header)
-    return NULL;
+  for (i = 0; i < num_nodes; i++)
+    {
+      node_sched_params_ptr nsp = SCHED_PARAMS (i);
+      insn = ps_rtl_insn (ps, i);
 
-  get_ebb_head_tail (pre_header, pre_header, &head, &tail);
+      if (cmp_insn_p (iv, insn))
+	return nsp->stage;
+    }
 
-  for (insn = tail; insn != PREV_INSN (head); insn = PREV_INSN (insn))
-    if (NONDEBUG_INSN_P (insn) && single_set (insn) &&
-	rtx_equal_p (count_reg, SET_DEST (single_set (insn))))
-      {
-	rtx pat = single_set (insn);
+  return 0;
+}
 
-	if (CONST_INT_P (SET_SRC (pat)))
-	  {
-	    *count = INTVAL (SET_SRC (pat));
-	    return insn;
-	  }
+/* Do constant end condition replacement if REPLACE_P is true.
+   Check the replacement validation if REPLACE_P is false.  */
+static bool
+loop_iteration_adjust_by_replace (ddg_ptr g, partial_schedule_ptr ps,
+				  rtx_insn *tail, struct rtx_iv iv,
+				  struct niter_desc *desc, bool replace_p)
+{
+  rtx condition;
+  rtx_insn *cmp;
+  int adjust_count = 0;
+  int cmp_stage = 0;
+  rtx end_cond = iv.end;
+  rtx new_end_cond;
+  int width;
+
+  if (!(condition = get_condition (tail, &cmp, false, false)))
+    return false;
 
-	return NULL;
-      }
+  /* If the loop contain the count_reg reference instruction,
+     we must duplicate iv decrement/increment instruction
+     in prolog/epilog to correct the reference.
+     Therefore, we adjust loop iteration count by
+     cmp_insn_stage * iv.step.  */
+  cmp_stage = get_cmp_stage (g, ps, iv);
+  adjust_count = cmp_stage * -INTVAL (iv.step);
 
-  return NULL;
+  /* We don't have to do any adjustment.  */
+  if (adjust_count == 0)
+    return true;
+
+  /* Get the mode width that loop end condition should be compared.  */
+  width = GET_MODE_PRECISION (desc->mode);
+
+  /* Adjust the loop end constant by comparison mode
+     to avoid loop exit expected iteration count overflow
+     and then get wrong loop end constant.
+
+     E.g.
+	short e;
+	for (e = 0; e != 1; e = e + 5)
+   */
+  if (width < GET_MODE_PRECISION (SImode))
+    new_end_cond = GEN_INT ((INTVAL(end_cond) + adjust_count)
+			    & ((1 << width) - 1));
+  else
+    new_end_cond = GEN_INT (INTVAL(end_cond) + adjust_count);
+
+  if (replace_p)
+    validate_replace_rtx (end_cond, new_end_cond, cmp);
+  else
+    {
+      if (dump_file)
+	{
+	  fprintf (dump_file, "cmp_stage = %d\n", cmp_stage);
+	  fprintf (dump_file, "adjust_count = %d\n", adjust_count);
+	  fprintf (dump_file, "Try to replace ");
+	  fprintf (dump_file, HOST_WIDE_INT_PRINT_DEC, INTVAL (end_cond));
+	  fprintf (dump_file, " to ");
+	  fprintf (dump_file, HOST_WIDE_INT_PRINT_DEC, INTVAL (new_end_cond));
+	  fprintf (dump_file, " in insn(%d)\n", INSN_UID (cmp));
+	}
+      /* Return false if the backend not support the replacement
+	 instruction.  */
+      if (!validate_replace_rtx (end_cond, new_end_cond, cmp))
+	{
+	  if (dump_file)
+	    fprintf (dump_file, "SMS replace end condition constant fail.\n");
+	  return false;
+	}
+      /* Replace new_end_cond back to end_cond.
+	 We don't want to modify end_cond before loop versioning.  */
+      validate_replace_rtx (new_end_cond, end_cond, cmp);
+    }
+  return true;
+}
+
+/* Generate kernel loop iteration adjustment instruction
+   and insert to prolog.  */
+static bool
+gen_loop_iteration_adjust (ddg_ptr g, partial_schedule_ptr ps,
+			   struct loop *loop, rtx_insn *tail,
+			   struct rtx_iv iv, int stage_count)
+{
+  rtx condition;
+  rtx new_reg;
+  rtx_insn *cmp;
+  machine_mode mode;
+  edge e;
+  int adjust_count = 0;
+  int cmp_stage = 0;
+  rtx count_reg = iv.base;
+  rtx end_cond = iv.end;
+
+  if (!(condition = get_condition (tail, &cmp, false, false)))
+    return false;
+
+  if (dump_file)
+    {
+      fprintf (dump_file, "count_reg is r%d\n", REGNO (count_reg));
+      fprintf (dump_file, "loop step = ");
+      fprintf (dump_file, HOST_WIDE_INT_PRINT_DEC, INTVAL (iv.step));
+      fprintf (dump_file, "\n");
+    }
+
+  /* If the loop contain the count_reg reference instruction,
+     we must duplicate iv decrement/increment instruction
+     in prolog/epilog to correct the reference.
+     Therefore, we adjust loop iteration count by
+     cmp_insn_stage * iv.step.  */
+  if (loop->count_ref_p)
+    {
+      cmp_stage = get_cmp_stage (g, ps, iv);
+      adjust_count = cmp_stage * -INTVAL (iv.step);
+
+      if (dump_file)
+	fprintf (dump_file, "cmp_stage = %d\n", cmp_stage);
+    }
+  /* else adjust iteration count by stage_count.  */
+  else
+    adjust_count = stage_count * -INTVAL (iv.step);
+
+  /* We don't have to do any adjustment.  */
+  if (adjust_count == 0)
+    return true;
+
+  if (dump_file)
+    fprintf (dump_file, "adjust_count = %d\n", adjust_count);
+
+  /* Handle the end_cond is constant.  */
+  if (CONST_INT_P (end_cond))
+    {
+      rtx new_end_cond;
+      /* We could modify loop count initial value to adjust loop
+         iteration count only when there is no count_reg reference
+         in the loop.  */
+      if (!loop->count_ref_p)
+	{
+	  new_end_cond = GEN_INT (INTVAL(end_cond) - adjust_count);
+	  start_sequence ();
+	  emit_insn (gen_add3_insn (count_reg, count_reg, new_end_cond));
+	  e = loop_preheader_edge (loop);
+	  split_edge_and_insert (e, get_insns ());
+	  end_sequence ();
+	}
+
+      return true;
+    }
+  /* Insert the end condition adjustment instruction before entry
+     the loop when end condition is in register.  */
+  else
+    {
+      mode = GET_MODE (count_reg);
+      new_reg = gen_reg_rtx (mode);
+
+      start_sequence ();
+      emit_insn (gen_add3_insn (new_reg, end_cond,
+			    GEN_INT (adjust_count)));
+      e = loop_preheader_edge (loop);
+      split_edge_and_insert (e, get_insns ());
+      end_sequence ();
+      validate_replace_rtx (end_cond, new_reg, cmp);
+    }
+
+  return true;
 }
 
 /* A very simple resource-based lower bound on the initiation interval.
@@ -431,10 +735,6 @@ res_MII (ddg_ptr g)
   return ((g->num_nodes - g->num_debug) / issue_rate);
 }
 
-
-/* A vector that contains the sched data for each ps_insn.  */
-static vec<node_sched_params> node_sched_param_vec;
-
 /* Allocate sched_params for each node and initialize it.  */
 static void
 set_node_sched_params (ddg_ptr g)
@@ -1093,9 +1393,97 @@ clear:
   return ok;
 }
 
+/* Return true if INSN is the loop end condition comparison instruction.  */
+static bool
+cmp_insn_p (struct rtx_iv iv, rtx_insn *insn)
+{
+  rtx pattern;
+  int i;
+  rtx op0, op1;
+
+  if (!reg_mentioned_p (iv.base, insn))
+    return false;
+
+  pattern = PATTERN (insn);
+
+  if (GET_CODE (pattern) == PARALLEL)
+    {
+      for (i = XVECLEN (pattern, 0) - 1; i >= 0; i--)
+      {
+	if (GET_CODE (XVECEXP (pattern, 0, i)) == SET)
+	  {
+	    op0 = XVECEXP (pattern, 0, i);
+	    if (GET_CODE (SET_SRC (op0)) == COMPARE)
+	      {
+		pattern = op0;
+		break;
+	      }
+	  }
+      }
+    }
+
+  if (GET_CODE (pattern) != SET)
+    return false;
+
+  if (GET_CODE (SET_SRC (pattern)) != COMPARE)
+    return false;
+
+  op0 = XEXP (SET_SRC (pattern), 0);
+  op1 = XEXP (SET_SRC (pattern), 1);
+
+  if (op0 == iv.base && op1 == iv.end)
+    return true;
+
+  if (op1 == iv.base && op0 == iv.end)
+    return true;
+
+  return false;
+}
+
+/* Return true if we should duplicate INSN in prolog/epilog.  */
+static bool
+duplicate_insn_p (rtx_insn *insn, struct rtx_iv iv, bool count_ref_p)
+{
+  rtx pattern;
+  int i;
+  rtx count_reg = iv.base;
+
+  if (JUMP_P (insn))
+    return false;
+
+  if (!reg_mentioned_p (count_reg, insn))
+    return true;
+
+  pattern = PATTERN (insn);
+
+  if (GET_CODE (pattern) == PARALLEL)
+    {
+      for (i = XVECLEN (pattern, 0) - 1; i >= 0; i--)
+	if (GET_CODE (XVECEXP (pattern, 0, i)) == SET)
+	  {
+	    if (SET_DEST (XVECEXP (pattern, 0, i)) == count_reg)
+	      {
+		pattern = XVECEXP (pattern, 0, i);
+		break;
+	      }
+	  }
+    }
+  /* If there is no count_reg reference in the loop,
+     We don't have to duplicate IV increment/decrement instruction.  */
+  if (SET_DEST (pattern) == count_reg
+      && !count_ref_p)
+    return false;
+
+  if (GET_CODE (pattern) == SET
+      && GET_CODE (SET_SRC (pattern)) == COMPARE)
+    return false;
+
+  return true;
+}
+
 static void
 duplicate_insns_of_cycles (partial_schedule_ptr ps, int from_stage,
-			   int to_stage, rtx count_reg)
+			   int to_stage, struct rtx_iv iv, bool count_ref_p)
 {
   int row;
   ps_insn_ptr ps_ij;
@@ -1107,15 +1495,8 @@ duplicate_insns_of_cycles (partial_schedule_ptr ps, int from_stage,
 	int first_u, last_u;
 	rtx_insn *u_insn;
 
-        /* Do not duplicate any insn which refers to count_reg as it
-           belongs to the control part.
-           The closing branch is scheduled as well and thus should
-           be ignored.
-           TODO: This should be done by analyzing the control part of
-           the loop.  */
 	u_insn = ps_rtl_insn (ps, u);
-        if (reg_mentioned_p (count_reg, u_insn)
-            || JUMP_P (u_insn))
+        if (!duplicate_insn_p (u_insn, iv, count_ref_p))
           continue;
 
 	first_u = SCHED_STAGE (u);
@@ -1134,34 +1515,18 @@ duplicate_insns_of_cycles (partial_schedule_ptr ps, int from_stage,
 /* Generate the instructions (including reg_moves) for prolog & epilog.  */
 static void
 generate_prolog_epilog (partial_schedule_ptr ps, struct loop *loop,
-                        rtx count_reg, rtx count_init)
+                        struct rtx_iv iv)
 {
   int i;
   int last_stage = PS_STAGE_COUNT (ps) - 1;
   edge e;
+  bool count_ref_p = loop->count_ref_p;
 
   /* Generate the prolog, inserting its insns on the loop-entry edge.  */
   start_sequence ();
 
-  if (!count_init)
-    {
-      /* Generate instructions at the beginning of the prolog to
-         adjust the loop count by STAGE_COUNT.  If loop count is constant
-         (count_init), this constant is adjusted by STAGE_COUNT in
-         generate_prolog_epilog function.  */
-      rtx sub_reg = NULL_RTX;
-
-      sub_reg = expand_simple_binop (GET_MODE (count_reg), MINUS, count_reg,
-				     gen_int_mode (last_stage,
-						   GET_MODE (count_reg)),
-                                     count_reg, 1, OPTAB_DIRECT);
-      gcc_assert (REG_P (sub_reg));
-      if (REGNO (sub_reg) != REGNO (count_reg))
-        emit_move_insn (count_reg, sub_reg);
-    }
-
   for (i = 0; i < last_stage; i++)
-    duplicate_insns_of_cycles (ps, 0, i, count_reg);
+    duplicate_insns_of_cycles (ps, 0, i, iv, count_ref_p);
 
   /* Put the prolog on the entry edge.  */
   e = loop_preheader_edge (loop);
@@ -1175,7 +1540,7 @@ generate_prolog_epilog (partial_schedule_ptr ps, struct loop *loop,
   start_sequence ();
 
   for (i = 0; i < last_stage; i++)
-    duplicate_insns_of_cycles (ps, i + 1, last_stage, count_reg);
+    duplicate_insns_of_cycles (ps, i + 1, last_stage, iv, count_ref_p);
 
   /* Put the epilogue on the exit edge.  */
   gcc_assert (single_exit (loop));
@@ -1337,6 +1702,139 @@ setup_sched_infos (void)
   current_sched_info = &sms_sched_info;
 }
 
+/* Do loop induction variable analysis for each loop.
+   Store loop iv to LOOP_IV [loop->num] and set
+   SMS_LOOP_P [loop->num] to true if the loop
+   could analyze successfully.  */
+static void
+sms_get_loop_iv (ddg_ptr *g_arr, struct rtx_iv *loop_iv, bool *sms_loop_p)
+{
+  rtx_insn *insn;
+  basic_block bb = NULL;
+  struct loop *loop;
+
+  FOR_EACH_LOOP (loop, 0)
+    {
+      rtx_insn *head, *tail;
+      rtx count_reg;
+
+      if (! (g_arr[loop->num]))
+        continue;
+
+      bb = loop->header;
+
+      get_ebb_head_tail (bb, bb, &head, &tail);
+
+      /* Get loop iteration count register.  */
+      if ( !(count_reg = get_count_reg (loop, tail, &loop_iv[loop->num])))
+	{
+	  if (dump_file)
+	    fprintf (dump_file, "SMS get_count_reg failed\n");
+	  continue;
+	}
+
+     for (insn = head; insn != NEXT_INSN (tail); insn = NEXT_INSN (insn))
+      {
+        if (NONDEBUG_INSN_P (insn) && !JUMP_P (insn)
+            && !single_set (insn) && GET_CODE (PATTERN (insn)) != USE
+            && !reg_mentioned_p (count_reg, insn))
+	  break;
+      }
+
+      if (insn != NEXT_INSN (tail))
+	{
+	  if (dump_file)
+	    {
+              if ((NONDEBUG_INSN_P (insn) && !JUMP_P (insn)
+		  && !single_set (insn) && GET_CODE (PATTERN (insn)) != USE))
+                fprintf (dump_file, "SMS loop-with-not-single-set\n");
+	      dump_insn_slim (dump_file, insn);
+	    }
+
+	  continue;
+	}
+      sms_loop_p [loop->num] = true;
+    }
+
+  iv_analysis_done ();
+}
+
+/* Return the insn which SMS could not handle.  */
+static rtx_insn *
+loop_sms_insn_check (const struct loop *loop)
+{
+  basic_block *bbs, bb;
+  unsigned i;
+  rtx_insn *insn;
+  rtx set;
+
+  bbs = get_loop_body (loop);
+  for (i = 0; i < loop->num_nodes; i++)
+    {
+      bb = bbs[i];
+      FOR_BB_INSNS (bb, insn)
+	{
+	  if (! INSN_P (insn))
+	    continue;
+
+	  if (CALL_P (insn))
+	    return insn;
+
+	  if (BARRIER_P (insn))
+	    return insn;
+
+	  if ((set = single_set (insn))
+	      && GET_CODE (SET_DEST (set)) == SUBREG)
+	    return insn;
+
+	  /* Skip the loop contain clobber (mem:BLK (scratch).  */
+	  if (GET_CODE (PATTERN (insn)) == CLOBBER
+	      && GET_CODE (XEXP (PATTERN (insn), 0)) == MEM)
+	    return insn;
+	}
+    }
+  free (bbs);
+
+  return NULL;
+}
+
+rtx
+entry_loop_count_reg (struct loop *loop, rtx count_reg)
+{
+  basic_block *bbs, bb;
+  unsigned i;
+  rtx_insn *insn;
+  rtx set;
+
+  bbs = get_loop_body (loop);
+  for (i = 0; i < loop->num_nodes; i++)
+    {
+      bb = bbs[i];
+      FOR_BB_INSNS (bb, insn)
+	{
+	  if (DEBUG_INSN_P (insn))
+	    continue;
+
+	  if (!reg_mentioned_p (count_reg, insn))
+	    continue;
+
+	  if ((set = single_set (insn)))
+	    {
+	      if (SET_DEST (set) == count_reg
+		  && (GET_CODE (SET_SRC (set)) == PLUS
+		      || GET_CODE (SET_SRC (set)) == MINUS))
+		{
+		  count_reg = XEXP (SET_SRC (set), 0);
+		  break;
+		}
+	    }
+	}
+    }
+
+  return count_reg;
+}
+
+
 /* Probability in % that the sms-ed loop rolls enough so that optimized
    version may be entered.  Just a guess.  */
 #define PROB_SMS_ENOUGH_ITERATIONS 80
@@ -1349,8 +1847,8 @@ setup_sched_infos (void)
 static void
 sms_schedule (void)
 {
-  rtx_insn *insn;
   ddg_ptr *g_arr, g;
+  struct rtx_iv *loop_iv;
   int * node_order;
   int maxii, max_asap;
   partial_schedule_ptr ps;
@@ -1359,6 +1857,7 @@ sms_schedule (void)
   basic_block condition_bb = NULL;
   edge latch_edge;
   gcov_type trip_count = 0;
+  bool * sms_loop_p;
 
   loop_optimizer_init (LOOPS_HAVE_PREHEADERS
 		       | LOOPS_HAVE_RECORDED_EXITS);
@@ -1388,6 +1887,10 @@ sms_schedule (void)
      We use loop->num as index into this array.  */
   g_arr = XCNEWVEC (ddg_ptr, number_of_loops (cfun));
 
+  loop_iv = XCNEWVEC (struct rtx_iv, number_of_loops (cfun));
+
+  sms_loop_p = XCNEWVEC (bool, number_of_loops (cfun));
+
   if (dump_file)
   {
     fprintf (dump_file, "\n\nSMS analysis phase\n");
@@ -1399,7 +1902,7 @@ sms_schedule (void)
   FOR_EACH_LOOP (loop, 0)
     {
       rtx_insn *head, *tail;
-      rtx count_reg;
+      rtx_insn *non_sms_insn;
 
       /* For debugging.  */
       if (dbg_cnt (sms_sched_loop) == false)
@@ -1423,11 +1926,29 @@ sms_schedule (void)
         continue;
 
       if (! loop_single_full_bb_p (loop))
-      {
-        if (dump_file)
-          fprintf (dump_file, "SMS not loop_single_full_bb_p\n");
-	continue;
-      }
+	{
+	  if (dump_file)
+	    fprintf (dump_file, "SMS not loop_single_full_bb_p\n");
+	  continue;
+	}
+
+      if (num_loop_insns (loop) < PARAM_VALUE (PARAM_SMS_LOOP_MIN_SIZE))
+	{
+	  if (dump_file)
+	    fprintf (dump_file, "SMS loop size = %d is too small\n",
+		     num_loop_insns (loop));
+	  continue;
+	}
+
+      if ((non_sms_insn = loop_sms_insn_check (loop)))
+	{
+	  if (dump_file)
+	    {
+	      fprintf (dump_file, "Contain the insn could not do SMS\n");
+	      dump_insn_slim (dump_file, non_sms_insn);
+	    }
+	  continue;
+	}
 
       bb = loop->header;
 
@@ -1465,52 +1986,6 @@ sms_schedule (void)
           continue;
         }
 
-      /* Make sure this is a doloop.  */
-      if ( !(count_reg = doloop_register_get (head, tail)))
-      {
-        if (dump_file)
-          fprintf (dump_file, "SMS doloop_register_get failed\n");
-	continue;
-      }
-
-      /* Don't handle BBs with calls or barriers
-	 or !single_set with the exception of instructions that include
-	 count_reg---these instructions are part of the control part
-	 that do-loop recognizes.
-         ??? Should handle insns defining subregs.  */
-     for (insn = head; insn != NEXT_INSN (tail); insn = NEXT_INSN (insn))
-      {
-         rtx set;
-
-        if (CALL_P (insn)
-            || BARRIER_P (insn)
-            || (NONDEBUG_INSN_P (insn) && !JUMP_P (insn)
-                && !single_set (insn) && GET_CODE (PATTERN (insn)) != USE
-                && !reg_mentioned_p (count_reg, insn))
-            || (INSN_P (insn) && (set = single_set (insn))
-                && GET_CODE (SET_DEST (set)) == SUBREG))
-        break;
-      }
-
-      if (insn != NEXT_INSN (tail))
-	{
-	  if (dump_file)
-	    {
-	      if (CALL_P (insn))
-		fprintf (dump_file, "SMS loop-with-call\n");
-	      else if (BARRIER_P (insn))
-		fprintf (dump_file, "SMS loop-with-barrier\n");
-              else if ((NONDEBUG_INSN_P (insn) && !JUMP_P (insn)
-                && !single_set (insn) && GET_CODE (PATTERN (insn)) != USE))
-                fprintf (dump_file, "SMS loop-with-not-single-set\n");
-              else
-               fprintf (dump_file, "SMS loop with subreg in lhs\n");
-	      print_rtl_single (dump_file, insn);
-	    }
-
-	  continue;
-	}
-
       /* Always schedule the closing branch with the rest of the
          instructions. The branch is rotated to be in row ii-1 at the
          end of the scheduling procedure to make sure it's the last
@@ -1527,6 +2002,9 @@ sms_schedule (void)
         fprintf (dump_file, "...OK\n");
 
     }
+
+  sms_get_loop_iv (g_arr, loop_iv, sms_loop_p);
+
   if (dump_file)
   {
     fprintf (dump_file, "\nSMS transformation phase\n");
@@ -1536,16 +2014,19 @@ sms_schedule (void)
   /* We don't want to perform SMS on new loops - created by versioning.  */
   FOR_EACH_LOOP (loop, 0)
     {
+      struct rtx_iv iv = loop_iv [loop->num];
+      rtx count_reg = iv.base;
       rtx_insn *head, *tail;
-      rtx count_reg;
-      rtx_insn *count_init;
       int mii, rec_mii, stage_count, min_cycle;
-      int64_t loop_count = 0;
       bool opt_sc_p;
+      struct niter_desc *desc;
+      bool loop_version_p = true;
 
-      if (! (g = g_arr[loop->num]))
+      if (! sms_loop_p [loop->num])
         continue;
 
+      g = g_arr[loop->num];
+
       if (dump_file)
 	{
 	  rtx_insn *insn = BB_END (loop->header);
@@ -1585,28 +2066,8 @@ sms_schedule (void)
           fprintf (dump_file, "SMS num-stores %d\n", g->num_stores);
 	}
 
-
-      /* In case of th loop have doloop register it gets special
-	 handling.  */
-      count_init = NULL;
-      if ((count_reg = doloop_register_get (head, tail)))
-	{
-	  basic_block pre_header;
-
-	  pre_header = loop_preheader_edge (loop)->src;
-	  count_init = const_iteration_count (count_reg, pre_header,
-					      &loop_count);
-	}
       gcc_assert (count_reg);
 
-      if (dump_file && count_init)
-        {
-          fprintf (dump_file, "SMS const-doloop ");
-          fprintf (dump_file, "%" PRId64,
-		     loop_count);
-          fprintf (dump_file, "\n");
-        }
-
       node_order = XNEWVEC (int, g->num_nodes);
 
       mii = 1; /* Need to pass some estimate of mii.  */
@@ -1651,20 +2112,41 @@ sms_schedule (void)
 	      gcc_assert (stage_count >= 1);
 	    }
 
+	  /* Get loop iteration information.  */
+	  desc = get_simple_loop_desc (loop);
+
+	  if (desc->const_iter && desc->niter != 0)
+	    {
+	      if (dump_file)
+	        {
+		  fprintf (dump_file, "SMS loop iteration count = ");
+		  fprintf (dump_file, HOST_WIDE_INT_PRINT_DEC, desc->niter);
+		  fprintf (dump_file, "\nSMS stage_count = %d\n", stage_count);
+		}
+
+	      if (desc->niter < (uint64_t)stage_count)
+		{
+		  if (dump_file)
+		    fprintf (dump_file, "SMS failed due to loop count"
+					"< stage_count.\n");
+		  break;
+		}
+	      else
+		loop_version_p = false;
+	    }
+
 	  /* The default value of PARAM_SMS_MIN_SC is 2 as stage count of
 	     1 means that there is no interleaving between iterations thus
 	     we let the scheduling passes do the job in this case.  */
 	  if (stage_count < PARAM_VALUE (PARAM_SMS_MIN_SC)
-	      || (count_init && (loop_count <= stage_count))
 	      || (flag_branch_probabilities && (trip_count <= stage_count)))
 	    {
 	      if (dump_file)
 		{
 		  fprintf (dump_file, "SMS failed... \n");
 		  fprintf (dump_file, "SMS sched-failed (stage-count=%d,"
-			   " loop-count=", stage_count);
-		  fprintf (dump_file, "%" PRId64, loop_count);
-		  fprintf (dump_file, ", trip-count=");
+			   , stage_count);
+		  fprintf (dump_file, " trip-count=");
 		  fprintf (dump_file, "%" PRId64, trip_count);
 		  fprintf (dump_file, ")\n");
 		}
@@ -1707,6 +2189,13 @@ sms_schedule (void)
 
 	  canon_loop (loop);
 
+	  /* Try to adjust kernel loop end conditon when end condition is
+	     constant and there is count_ref reference in the loop.
+	     Skip the loop if the relpacement fail.  */
+	  if (CONST_INT_P (iv.end) && loop->count_ref_p
+	      && !loop_iteration_adjust_by_replace (g, ps, tail, iv, desc, false))
+	    break;
+
           if (dump_file)
             {
 	      dump_insn_location (tail);
@@ -1714,25 +2203,61 @@ sms_schedule (void)
 		       ps->ii, stage_count);
 	      print_partial_schedule (ps, dump_file);
 	    }
- 
-          /* case the BCT count is not known , Do loop-versioning */
-	  if (count_reg && ! count_init)
-            {
-	      rtx comp_rtx = gen_rtx_GT (VOIDmode, count_reg,
-					 gen_int_mode (stage_count,
-						       GET_MODE (count_reg)));
+
+          /* Do loop-versioning to avoid the loop count < stage_count
+	     execute SMS version loop.  */
+	  if (loop_version_p)
+	    {
+	      rtx comp_rtx;
+	      machine_mode mode = GET_MODE (count_reg);
+
+	      int stage_steps = stage_count * -INTVAL (iv.step);
+	      rtx loop_steps;
+
+	      /* To avoid count_reg not live before entry the loop.
+	       .L3:
+	               sub     r2, r3, #1
+	               add     r0, r0, r3
+	               mov     r3, r2
+	               cmp     r2, #0
+	               bne     .L3
+
+		 In this case count_reg is r2, but we should generate
+		 loop version condition by r3.  */
+	      count_reg = entry_loop_count_reg (loop, count_reg);
+
+	      /* loop_steps = count_reg - iv.end
+	         iv.end == 0 would be the special case
+		 and we don' have to generate MINUS.  */
+	      if (INTVAL (iv.end) == 0)
+		loop_steps = count_reg;
+	      else
+		loop_steps = gen_rtx_MINUS (mode, count_reg, iv.end);
+
+	      /* Generate loop version condition according the loop
+		 is increment or decrement.  */
+	      if (INTVAL (iv.step) > 0)
+		comp_rtx = gen_rtx_LT (VOIDmode, loop_steps,
+				       gen_int_mode (stage_steps, mode));
+	      else
+		comp_rtx = gen_rtx_GT (VOIDmode, loop_steps,
+				       gen_int_mode (stage_steps, mode));
+
 	      unsigned prob = (PROB_SMS_ENOUGH_ITERATIONS
 			       * REG_BR_PROB_BASE) / 100;
-
+
 	      loop_version (loop, comp_rtx, &condition_bb,
-	  		    prob, prob, REG_BR_PROB_BASE - prob,
+			    prob, prob, REG_BR_PROB_BASE - prob,
 			    true);
-	     }
+	    }
+
+	  /* Replace constant end condition in compare instruction
+	     to adjust kernel loop iteration count.  */
+	  if (CONST_INT_P (iv.end) && loop->count_ref_p)
+	    loop_iteration_adjust_by_replace (g, ps, tail, iv, desc, true);
 
-	  /* Set new iteration count of loop kernel.  */
-          if (count_reg && count_init)
-	    SET_SRC (single_set (count_init)) = GEN_INT (loop_count
-						     - stage_count + 1);
+	  /* Insert kernel loop iteraion adjustment instructions.  */
+	  gen_loop_iteration_adjust (g, ps, loop, tail, iv, stage_count - 1);
 
 	  /* Now apply the scheduled kernel to the RTL of the loop.  */
 	  permute_partial_schedule (ps, g->closing_branch->first_note);
@@ -1749,7 +2274,7 @@ sms_schedule (void)
 	  if (dump_file)
 	    print_node_sched_params (dump_file, g->num_nodes, ps);
 	  /* Generate prolog and epilog.  */
-          generate_prolog_epilog (ps, loop, count_reg, count_init);
+          generate_prolog_epilog (ps, loop, iv);
 	  break;
 	}
 
@@ -1760,6 +2285,8 @@ sms_schedule (void)
     }
 
   free (g_arr);
+  free (loop_iv);
+  free (sms_loop_p);
 
   /* Release scheduler data, needed until now because of DFA.  */
   haifa_sched_finish ();
diff --git a/gcc/params.def b/gcc/params.def
index 9362c15..cc171f1 100644
--- a/gcc/params.def
+++ b/gcc/params.def
@@ -375,6 +375,10 @@ DEFPARAM(PARAM_SMS_LOOP_AVERAGE_COUNT_THRESHOLD,
 	 "sms-loop-average-count-threshold",
 	 "A threshold on the average loop count considered by the swing modulo scheduler.",
 	 0, 0, 0)
+DEFPARAM(PARAM_SMS_LOOP_MIN_SIZE,
+	 "sms-loop-min-size",
+	 "A threshold on the loop instruction count considered by the swing modulo scheduler",
+	 8, 0, 0)
 
 DEFPARAM(HOT_BB_COUNT_WS_PERMILLE,
 	 "hot-bb-count-ws-permille",
-- 
2.5.0


^ permalink raw reply	[flat|nested] 4+ messages in thread

* RE: [PATCH][SMS] SMS use loop induction variable analysis instead of depending on doloop optimization
  2016-04-28  6:06 [PATCH][SMS] SMS use loop induction variable analysis instead of depending on doloop optimization Shiva Chen
@ 2016-04-28 13:26 ` Shiva Chen
  2016-04-29  9:30 ` Bernd Schmidt
  1 sibling, 0 replies; 4+ messages in thread
From: Shiva Chen @ 2016-04-28 13:26 UTC (permalink / raw)
  To: GCC Patches, Shiva Chen

[-- Attachment #1: Type: text/plain, Size: 2396 bytes --]

Hi, 

I fixed some bug to pass testing on x86-64 and update the patch
as 0001-SMS-use-loop-induction-variable-analysis-v1.patch.

Thanks,
Shiva

-----Original Message-----
From: Shiva Chen 
Sent: Thursday, April 28, 2016 2:07 PM
To: GCC Patches <gcc-patches@gcc.gnu.org>; Shiva Chen <shiva0217@gmail.com>
Subject: [PATCH][SMS] SMS use loop induction variable analysis instead of depending on doloop optimization

Hi, 

According to Richard's suggestion in
https://gcc.gnu.org/ml/gcc-patches/2015-12/msg01240.html
I try to remove the SMS dependency on doloop pass.

SMS would need to adjust kernel loop iteration count during the transformation.

To adjust loop iteration count, SMS would need to find count_reg which contain the loop iteration count and then generate adjustment instruction.

Currently, SMS would find the doloop_end pattern to get count_reg, and generate adjustment instruction according to doloop optimization result (tranfer the loop to count to zero with step = 1).

If can't find doloop_end pattern or the loop form not the doloop optimization result, the SMS will skip the loop.

Doloop optimization could have benefit for some target even if the target don't support special loop instruction.

E.g. For arm , doloop optimization could transfer
     the instructions to subs and branch which save the
     comparison instruction.

However, If the loop iteration count computation of doloop optimization is too complicate, it would drop performance.
(PARAM_MAX_ITERATIONS_COMPUTATION_COST default value is 10 which may too high for the target not support special loop instruction)

This kind loop not suitable for doloop optimization and SMS can't activate.

To free the SMS dependency on doloop optimization, I try to use loop induction variable analysis to find count_reg and generate kernel loop adjustment instruction for the loop form without doloop optimization(increment/decrement loop with step != 1).

Without doloop optimization, induction variable could be a POST_INC/POST_DEC/PRE_INC/PRE_DEC in memory reference which current implementation won't identify as loop iv. So I modify relative code in loop-iv.c to identify this case.

With the patch, backend target could active SMS without define doloop_end pattern.

Could anyone help me to review the patch?
Any suggestion would be very helpful.

Thanks,
Shiva


[-- Attachment #2: 0001-SMS-use-loop-induction-variable-analysis-v1.patch --]
[-- Type: application/octet-stream, Size: 45572 bytes --]

From d407ea086f9bc19784e145a505e0a86dc46a60a5 Mon Sep 17 00:00:00 2001
From: shiva <shiva0217@gmail.com>
Date: Mon, 28 Mar 2016 15:06:16 +0800
Subject: [PATCH] SMS use loop induction variable analysis instead of depending
 on doloop optimization

---
 gcc/cfgloop.h      |   5 +-
 gcc/loop-iv.c      | 228 +++++++++++--
 gcc/modulo-sched.c | 931 ++++++++++++++++++++++++++++++++++++++++++-----------
 gcc/params.def     |   4 +
 4 files changed, 949 insertions(+), 219 deletions(-)

diff --git a/gcc/cfgloop.h b/gcc/cfgloop.h
index 54e738f..45b1acf 100644
--- a/gcc/cfgloop.h
+++ b/gcc/cfgloop.h
@@ -191,6 +191,9 @@ struct GTY ((chain_next ("%h.next"))) loop {
   /* True if we should try harder to vectorize this loop.  */
   bool force_vectorize;
 
+  /* True if there is count_reg reference in the loop.  */
+  bool count_ref_p;
+
   /* True if the loop is part of an oacc kernels region.  */
   bool in_oacc_kernels_region;
 
@@ -347,7 +350,7 @@ struct rtx_iv
 {
   /* Its base and step (mode of base and step is supposed to be extend_mode,
      see the description above).  */
-  rtx base, step;
+  rtx base, step, end;
 
   /* The type of extend applied to it (IV_SIGN_EXTEND, IV_ZERO_EXTEND,
      or IV_UNKNOWN_EXTEND).  */
diff --git a/gcc/loop-iv.c b/gcc/loop-iv.c
index fecaf8f..b73f165 100644
--- a/gcc/loop-iv.c
+++ b/gcc/loop-iv.c
@@ -95,6 +95,8 @@ static unsigned int iv_ref_table_size = 0;
 /* Table of rtx_ivs indexed by the df_ref uid field.  */
 static struct rtx_iv ** iv_ref_table;
 
+static bool mem_write_insn_p (rtx_insn *);
+
 /* Induction variable stored at the reference.  */
 #define DF_REF_IV(REF) iv_ref_table[DF_REF_ID (REF)]
 #define DF_REF_IV_SET(REF, IV) iv_ref_table[DF_REF_ID (REF)] = (IV)
@@ -350,10 +352,6 @@ iv_get_reaching_def (rtx_insn *insn, rtx reg, df_ref *def)
 
   adef = DF_REF_CHAIN (use)->ref;
 
-  /* We do not handle setting only part of the register.  */
-  if (DF_REF_FLAGS (adef) & DF_REF_READ_WRITE)
-    return GRD_INVALID;
-
   def_insn = DF_REF_INSN (adef);
   def_bb = DF_REF_BB (adef);
   use_bb = BLOCK_FOR_INSN (insn);
@@ -622,6 +620,58 @@ iv_shift (struct rtx_iv *iv, rtx mby)
   return true;
 }
 
+static rtx
+get_rhs (rtx_insn *insn, rtx reg)
+{
+  int i;
+  rtx set, pattern;
+  rtx rhs = NULL_RTX;
+
+  pattern = PATTERN (insn);
+
+  /* Find iv increment/decrement rhs in following pattern
+
+     (parallel [
+	(set (reg:CC_NOOV 100 cc)
+	    (compare:CC_NOOV (plus:SI (reg:SI 147)
+				      (const_int -1))
+			     (const_int 0 [0])))
+	(set (reg:SI 147)
+	    (plus:SI (reg:SI 147)
+		     (const_int -1 [0xffffffffffffffff]))
+   */
+  if (GET_CODE (pattern) == PARALLEL)
+    {
+      for (i = XVECLEN (pattern, 0) - 1; i >= 0; i--)
+	if (GET_CODE (XVECEXP (pattern, 0, i)) == SET)
+	  {
+	    if (SET_DEST (XVECEXP (pattern, 0, i)) == reg)
+	      {
+		pattern = XVECEXP (pattern, 0, i);
+		return SET_SRC (pattern);
+	      }
+	  }
+    }
+
+  set = single_set (insn);
+
+  if (!set)
+    return NULL_RTX;
+
+  if (mem_write_insn_p (insn))
+    rhs = SET_DEST (set);
+  else
+    {
+      rhs = find_reg_equal_equiv_note (insn);
+      if (rhs)
+	rhs = XEXP (rhs, 0);
+      else
+	rhs = SET_SRC (set);
+    }
+
+  return rhs;
+}
+
 /* The recursive part of get_biv_step.  Gets the value of the single value
    defined by DEF wrto initial value of REG inside loop, in shape described
    at get_biv_step.  */
@@ -632,22 +682,17 @@ get_biv_step_1 (df_ref def, rtx reg,
 		enum iv_extend_code *extend, machine_mode outer_mode,
 		rtx *outer_step)
 {
-  rtx set, rhs, op0 = NULL_RTX, op1 = NULL_RTX;
+  rtx rhs, op0 = NULL_RTX, op1 = NULL_RTX;
   rtx next, nextr;
-  enum rtx_code code;
+  enum rtx_code code, addr_code;
   rtx_insn *insn = DF_REF_INSN (def);
   df_ref next_def;
   enum iv_grd_result res;
 
-  set = single_set (insn);
-  if (!set)
-    return false;
+  rhs = get_rhs (insn, reg);
 
-  rhs = find_reg_equal_equiv_note (insn);
-  if (rhs)
-    rhs = XEXP (rhs, 0);
-  else
-    rhs = SET_SRC (set);
+  if (!rhs)
+    return false;
 
   code = GET_CODE (rhs);
   switch (code)
@@ -700,6 +745,19 @@ get_biv_step_1 (df_ref def, rtx reg,
       next = op0;
       break;
 
+    /* Allow READ_WRITE memory reference register
+       identify as IV.  */
+    case MEM:
+      op0 = XEXP (rhs, 0);
+      addr_code = GET_CODE (op0);
+      if (addr_code != POST_INC
+	  && addr_code != POST_DEC
+	  && addr_code != PRE_INC
+	  && addr_code != PRE_DEC)
+	return false;
+      next = XEXP (op0, 0);
+      break;
+
     default:
       return false;
     }
@@ -777,6 +835,21 @@ get_biv_step_1 (df_ref def, rtx reg,
       *extend = (code == SIGN_EXTEND) ? IV_SIGN_EXTEND : IV_ZERO_EXTEND;
       break;
 
+    /* Get inner_step by mode size for READ_WRITE
+       memory reference register.  */
+    case MEM:
+      gcc_assert (GET_CODE (op0) == POST_INC
+		  || GET_CODE (op0) == POST_DEC
+		  || GET_CODE (op0) == PRE_INC
+		  || GET_CODE (op0) == PRE_DEC);
+
+      if (GET_CODE (op0) == POST_INC
+	  || GET_CODE (op0) == PRE_INC)
+	*inner_step = GEN_INT (GET_MODE_SIZE (GET_MODE (rhs)));
+      else
+	*inner_step = GEN_INT (-GET_MODE_SIZE (GET_MODE (rhs)));
+      break;
+
     default:
       return false;
     }
@@ -936,14 +1009,13 @@ iv_analyze_expr (rtx_insn *insn, rtx rhs, machine_mode mode,
   rtx op0 = NULL_RTX, op1 = NULL_RTX;
   struct rtx_iv iv0, iv1;
   enum rtx_code code = GET_CODE (rhs);
+  enum rtx_code addr_code;
   machine_mode omode = mode;
 
   iv->mode = VOIDmode;
   iv->base = NULL_RTX;
   iv->step = NULL_RTX;
 
-  gcc_assert (GET_MODE (rhs) == mode || GET_MODE (rhs) == VOIDmode);
-
   if (CONSTANT_P (rhs)
       || REG_P (rhs)
       || code == SUBREG)
@@ -995,6 +1067,24 @@ iv_analyze_expr (rtx_insn *insn, rtx rhs, machine_mode mode,
 	return false;
       break;
 
+    case MEM:
+      op0 = XEXP (rhs, 0);
+      omode = GET_MODE (op0);
+      addr_code = GET_CODE (op0);
+
+      if (addr_code != POST_INC
+	  && addr_code != POST_DEC
+	  && addr_code != PRE_INC
+	  && addr_code != PRE_DEC)
+	return false;
+      break;
+    case POST_INC:
+    case POST_DEC:
+    case PRE_INC:
+    case PRE_DEC:
+      op0 = XEXP (rhs, 0);
+      break;
+
     default:
       return false;
     }
@@ -1040,6 +1130,9 @@ iv_analyze_expr (rtx_insn *insn, rtx rhs, machine_mode mode,
 	return false;
       break;
 
+    case MEM:
+      break;
+
     default:
       break;
     }
@@ -1048,6 +1141,47 @@ iv_analyze_expr (rtx_insn *insn, rtx rhs, machine_mode mode,
   return iv->base != NULL_RTX;
 }
 
+/* Auxiliary variable for mem_read_insn_p/mem_write_insn_p.  */
+static bool mem_ref_p;
+
+/* Auxiliary function for mem_read_insn_p.  */
+static void
+mark_mem_use (rtx *x, void *)
+{
+  subrtx_iterator::array_type array;
+  FOR_EACH_SUBRTX (iter, array, *x, NONCONST)
+    if (MEM_P (*iter))
+      {
+	mem_ref_p = true;
+	break;
+      }
+}
+
+/* Returns nonzero if INSN reads from memory.  */
+static bool
+mem_read_insn_p (rtx_insn *insn)
+{
+  mem_ref_p = false;
+  note_uses (&PATTERN (insn), mark_mem_use, NULL);
+  return mem_ref_p;
+}
+
+static void
+mark_mem_store (rtx loc, const_rtx setter ATTRIBUTE_UNUSED, void *data ATTRIBUTE_UNUSED)
+{
+  if (MEM_P (loc))
+    mem_ref_p = true;
+}
+
+/* Returns nonzero if INSN writes to memory.  */
+static bool
+mem_write_insn_p (rtx_insn *insn)
+{
+  mem_ref_p = false;
+  note_stores (PATTERN (insn), mark_mem_store, NULL);
+  return mem_ref_p;
+}
+
 /* Analyzes iv DEF and stores the result to *IV.  */
 
 static bool
@@ -1056,6 +1190,7 @@ iv_analyze_def (df_ref def, struct rtx_iv *iv)
   rtx_insn *insn = DF_REF_INSN (def);
   rtx reg = DF_REF_REG (def);
   rtx set, rhs;
+  rtx read_write_reg = NULL_RTX, body = NULL_RTX;
 
   if (dump_file)
     {
@@ -1085,15 +1220,47 @@ iv_analyze_def (df_ref def, struct rtx_iv *iv)
   if (!set)
     return false;
 
-  if (!REG_P (SET_DEST (set)))
-    return false;
+   /* To check the case REG is read write register
+      in memory reference.  */
+  if (mem_write_insn_p (insn))
+    body = SET_DEST (set);
+  else if (mem_read_insn_p (insn))
+    body = SET_SRC (set);
 
-  gcc_assert (SET_DEST (set) == reg);
-  rhs = find_reg_equal_equiv_note (insn);
-  if (rhs)
-    rhs = XEXP (rhs, 0);
-  else
-    rhs = SET_SRC (set);
+  if (body != NULL_RTX)
+    {
+      if (GET_CODE (body) == UNSPEC)
+	return false;
+
+      if (GET_CODE (body) == ZERO_EXTEND ||
+	  GET_CODE (body) == SIGN_EXTEND)
+	body = XEXP(body, 0);
+
+      if (GET_CODE (body) == MEM)
+	body = XEXP(body, 0);
+
+      /* Should we handle POST_MODIFY?  */
+       if (GET_CODE (body) == POST_INC
+	  || GET_CODE (body) == POST_DEC
+	  || GET_CODE (body) == PRE_INC
+	  || GET_CODE (body) == PRE_DEC)
+	read_write_reg = XEXP(body, 0);
+      else
+	return false;
+    }
+
+  gcc_assert (SET_DEST (set) == reg || read_write_reg == reg);
+
+  if (mem_write_insn_p (insn))
+    rhs = SET_DEST (set);
+   else
+    {
+      rhs = find_reg_equal_equiv_note (insn);
+      if (rhs)
+	rhs = XEXP (rhs, 0);
+      else
+	rhs = SET_SRC (set);
+    }
 
   iv_analyze_expr (insn, rhs, GET_MODE (reg), iv);
   record_iv (def, iv);
@@ -2354,6 +2521,19 @@ iv_number_of_iterations (struct loop *loop, rtx_insn *insn, rtx condition,
     goto fail;
 
   op0 = XEXP (condition, 0);
+
+  /* To handle the condition as follow
+      (ne (plus:SI (reg:SI 163)
+		   (const_int -1))
+	  (const_int 0 [0]))
+
+     The pattern would generate
+     by doloop optimization.  */
+  if (GET_CODE (op0) == PLUS
+      && CONST_INT_P (XEXP (op0, 1))
+      && REG_P (XEXP (op0, 0)))
+    op0 = XEXP (op0, 0);
+
   if (!iv_analyze (insn, op0, &iv0))
     goto fail;
   if (iv0.extend_mode == VOIDmode)
diff --git a/gcc/modulo-sched.c b/gcc/modulo-sched.c
index 5dde66c..7354084 100644
--- a/gcc/modulo-sched.c
+++ b/gcc/modulo-sched.c
@@ -43,6 +43,8 @@ along with GCC; see the file COPYING3.  If not see
 #include "tree-pass.h"
 #include "dbgcnt.h"
 #include "loop-unroll.h"
+#include "print-rtl.h"
+#include "recog.h"
 
 #ifdef INSN_SCHEDULING
 
@@ -211,7 +213,7 @@ static void set_node_sched_params (ddg_ptr);
 static partial_schedule_ptr sms_schedule_by_order (ddg_ptr, int, int, int *);
 static void permute_partial_schedule (partial_schedule_ptr, rtx_insn *);
 static void generate_prolog_epilog (partial_schedule_ptr, struct loop *,
-                                    rtx, rtx);
+                                    struct rtx_iv);
 static int calculate_stage_count (partial_schedule_ptr, int);
 static void calculate_must_precede_follow (ddg_node_ptr, int, int,
 					   int, int, sbitmap, sbitmap, sbitmap);
@@ -220,6 +222,8 @@ static int get_sched_window (partial_schedule_ptr, ddg_node_ptr,
 static bool try_scheduling_node_in_cycle (partial_schedule_ptr, int, int,
 					  sbitmap, int *, sbitmap, sbitmap);
 static void remove_node_from_ps (partial_schedule_ptr, ps_insn_ptr);
+static bool duplicate_insn_p (rtx_insn *, struct rtx_iv, bool);
+static bool cmp_insn_p (struct rtx_iv, rtx_insn *);
 
 #define NODE_ASAP(node) ((node)->aux.count)
 
@@ -241,6 +245,11 @@ typedef struct node_sched_params
      u will precede v if column (u) < column (v).  */
   int column;
 } *node_sched_params_ptr;
+
+
+/* A vector that contains the sched data for each ps_insn.  */
+static vec<node_sched_params> node_sched_param_vec;
+
 \f
 /* The following three functions are copied from the current scheduler
    code in order to use sched_analyze() for computing the dependencies.
@@ -332,91 +341,400 @@ ps_num_consecutive_stages (partial_schedule_ptr ps, int id)
     return ps_reg_move (ps, id)->num_consecutive_stages;
 }
 
-/* Given HEAD and TAIL which are the first and last insns in a loop;
-   return the register which controls the loop.  Return zero if it has
-   more than one occurrence in the loop besides the control part or the
-   do-loop pattern is not of the form we expect.  */
-static rtx
-doloop_register_get (rtx_insn *head, rtx_insn *tail)
+/* Analyze loop induction variable and store to IV.  */
+static bool
+analyze_loop_iv (rtx_insn *jump, struct rtx_iv *iv)
 {
-  rtx reg, condition;
-  rtx_insn *insn, *first_insn_not_to_check;
+  rtx condition;
+  rtx op0, op1;
+  rtx_insn *at;
+  struct rtx_iv iv0, iv1;
+  enum rtx_code cmp_code;
+
+  /* Test whether the condition is suitable.  */
+  if (!(condition = get_condition (jump, &at, false, false)))
+    return false;
+
+  op0 = XEXP (condition, 0);
+  op1 = XEXP (condition, 1);
+
+  /* handle the condition:
+  (ne (plus:SI (reg:SI 163)
+               (const_int -1 [0xffffffffffffffff]))
+       (const_int 0 [0]))
+  */
+  if (CONST_INT_P (op1) && GET_CODE (op0) == PLUS)
+    {
+      op1 = XEXP (op0, 1);
+      op0 = XEXP (op0, 0);
+
+      if (REG_P (op0) && CONST_INT_P (op1))
+        {
+	  iv->base = op0;
+	  iv->step = op1;
+	  iv->end = XEXP (condition, 1);
+	  return true;
+	}
+    }
+
+  /* following code handle the condition:
+     (ne (reg:SI 163) (reg:SI 176))
+  */
+
+  if (!iv_analyze (at, op0, &iv0))
+    return false;
 
-  if (!JUMP_P (tail))
-    return NULL_RTX;
+  if (!iv_analyze (at, op1, &iv1))
+    return false;
 
-  if (!targetm.code_for_doloop_end)
-    return NULL_RTX;
+  if (iv0.step != const0_rtx)
+    {
+      iv->base = op0;
+      iv->step = iv0.step;
+      iv->end = op1;
+    }
+  else if (iv1.step != const0_rtx)
+    {
+      iv->base = op1;
+      iv->step = iv1.step;
+      iv->end = op0;
+    }
 
-  /* TODO: Free SMS's dependence on doloop_condition_get.  */
-  condition = doloop_condition_get (tail);
-  if (! condition)
-    return NULL_RTX;
+  /* get_condition() may canonicalize the condition
+     E.g.
+	(leu (reg:SI 186 [ ivtmp.41 ])
+	     (const_int 6 [0x7]))
+     =>
+	(ltu (reg:SI 186 [ ivtmp.41 ])
+	     (const_int 7 [0x7]))
 
-  if (REG_P (XEXP (condition, 0)))
-    reg = XEXP (condition, 0);
-  else if (GET_CODE (XEXP (condition, 0)) == PLUS
-	   && REG_P (XEXP (XEXP (condition, 0), 0)))
-    reg = XEXP (XEXP (condition, 0), 0);
+     we won't change the branch which contain condition (E.g. ble)
+     Therefore, we have to change back to original constant end condition
+     and let SMS loop iteration adjustment could do the right thing.  */
+  if (iv->end)
+    {
+      if (GET_CODE (PATTERN (jump)) == PARALLEL)
+	{
+	  int j;
+
+	  for (j = XVECLEN (PATTERN (jump), 0) - 1; j >= 0; j--)
+	    {
+	      rtx body = XVECEXP (PATTERN (jump), 0, j);
+
+	      if (GET_CODE (body) == SET)
+		{
+		  if (GET_CODE (SET_SRC (body)) == IF_THEN_ELSE)
+		    {
+		      cmp_code = GET_CODE (XEXP (SET_SRC (body), 0));
+		      break;
+		    }
+		}
+	    }
+	  cmp_code = GET_CODE (condition);
+	}
+      else
+	cmp_code = GET_CODE (XEXP (SET_SRC (PATTERN (jump)), 0));
+
+      if (CONST_INT_P (iv->end)
+	  && GET_CODE (condition) !=  cmp_code)
+	{
+	  switch (GET_CODE (condition))
+	  {
+	    case LT:
+	    case LTU:
+	      iv->end = GEN_INT (INTVAL (iv->end) - 1);
+	      break;
+	    case GT:
+	    case GTU:
+	      iv->end = GEN_INT (INTVAL (iv->end) + 1);
+	      break;
+	    default:
+	      break;
+	  }
+	}
+    }
   else
-    gcc_unreachable ();
+    return false;
 
-  /* Check that the COUNT_REG has no other occurrences in the loop
-     until the decrement.  We assume the control part consists of
-     either a single (parallel) branch-on-count or a (non-parallel)
-     branch immediately preceded by a single (decrement) insn.  */
-  first_insn_not_to_check = (GET_CODE (PATTERN (tail)) == PARALLEL ? tail
-                             : prev_nondebug_insn (tail));
+  return true;
+}
 
-  for (insn = head; insn != first_insn_not_to_check; insn = NEXT_INSN (insn))
-    if (!DEBUG_INSN_P (insn) && reg_mentioned_p (reg, insn))
-      {
-        if (dump_file)
-        {
-          fprintf (dump_file, "SMS count_reg found ");
-          print_rtl_single (dump_file, reg);
-          fprintf (dump_file, " outside control in insn:\n");
-          print_rtl_single (dump_file, insn);
-        }
+/* Set LOOP->loop_ref_p to true if there is COUNT_REG
+   reference instruction(except iv increment/decrement
+   instruction) in the loop.  */
+static void
+check_count_reg_reference (struct loop* loop, rtx count_reg)
+{
+  basic_block *bbs, bb;
+  unsigned i;
+  rtx_insn *insn;
+  int j;
+  rtx set;
 
-        return NULL_RTX;
-      }
+  bbs = get_loop_body (loop);
+  for (i = 0; i < loop->num_nodes; i++)
+    {
+      bb = bbs[i];
+      FOR_BB_INSNS (bb, insn)
+	{
+	  if (DEBUG_INSN_P (insn))
+	    continue;
 
-  return reg;
+	  if (!reg_mentioned_p (count_reg, insn))
+	    continue;
+
+	  if ((set = single_set (insn)))
+	    {
+	      if (SET_DEST (set) == count_reg)
+		continue;
+	      else
+		{
+		  loop->count_ref_p = true;
+		  return;
+		}
+	    }
+
+	  if (GET_CODE (PATTERN (insn)) == PARALLEL)
+	    {
+	      for (j = XVECLEN (PATTERN (insn), 0) - 1; j >= 0; j--)
+		{
+		  rtx body = XVECEXP (PATTERN (insn), 0, j);
+
+		  if (GET_CODE (body) == SET)
+		    {
+		      if (GET_CODE (SET_SRC (body)) == COMPARE)
+			continue;
+		      if (SET_DEST (body) != count_reg
+			  && reg_mentioned_p (count_reg, body))
+			{
+			  loop->count_ref_p = true;
+			  return;
+			}
+		    }
+		}
+	    }
+	}
+    }
+  free (bbs);
 }
 
-/* Check if COUNT_REG is set to a constant in the PRE_HEADER block, so
-   that the number of iterations is a compile-time constant.  If so,
-   return the rtx_insn that sets COUNT_REG to a constant, and set COUNT to
-   this constant.  Otherwise return 0.  */
-static rtx_insn *
-const_iteration_count (rtx count_reg, basic_block pre_header,
-		       int64_t * count)
+rtx
+get_count_reg (struct loop* loop, rtx_insn *tail, struct rtx_iv *iv)
+{
+  bool get_iv;
+
+  df_clear_flags (DF_LR_RUN_DCE);
+  iv_analysis_loop_init (loop);
+  get_iv = analyze_loop_iv (tail, iv);
+  df_set_flags (DF_LR_RUN_DCE);
+
+  if (!get_iv)
+     return NULL_RTX;
+
+  /* check count_reg reference in the loop
+     and set result to loop->count_ref_p.  */
+  check_count_reg_reference (loop, iv->base);
+
+  return iv->base;
+}
+
+/* Return stage number of compare instruction.  */
+static int
+get_cmp_stage (ddg_ptr g, partial_schedule_ptr ps, struct rtx_iv iv)
 {
+  int i, num_nodes = g->num_nodes;
   rtx_insn *insn;
-  rtx_insn *head, *tail;
 
-  if (! pre_header)
-    return NULL;
+  for (i = 0; i < num_nodes; i++)
+    {
+      node_sched_params_ptr nsp = SCHED_PARAMS (i);
+      insn = ps_rtl_insn (ps, i);
 
-  get_ebb_head_tail (pre_header, pre_header, &head, &tail);
+      if (cmp_insn_p (iv, insn))
+	return nsp->stage;
+    }
 
-  for (insn = tail; insn != PREV_INSN (head); insn = PREV_INSN (insn))
-    if (NONDEBUG_INSN_P (insn) && single_set (insn) &&
-	rtx_equal_p (count_reg, SET_DEST (single_set (insn))))
-      {
-	rtx pat = single_set (insn);
+  return 0;
+}
 
-	if (CONST_INT_P (SET_SRC (pat)))
-	  {
-	    *count = INTVAL (SET_SRC (pat));
-	    return insn;
-	  }
+/* Do constant end condition replacement if REPLACE_P is true.
+   Check the replacement validation if REPLACE_P is false.  */
+static bool
+loop_iteration_adjust_by_replace (ddg_ptr g, partial_schedule_ptr ps,
+				  rtx_insn *tail, struct rtx_iv iv,
+				  struct niter_desc *desc, bool replace_p)
+{
+  rtx condition;
+  rtx_insn *cmp;
+  int adjust_count = 0;
+  int cmp_stage = 0;
+  rtx end_cond = iv.end;
+  rtx new_end_cond;
+  int width;
+
+  if (!(condition = get_condition (tail, &cmp, false, false)))
+    return false;
 
-	return NULL;
-      }
+  /* If the loop contain the count_reg reference instruction,
+     we must duplicate iv decrement/increment instruction
+     in prolog/epilog to correct the reference.
+     Therefore, we adjust loop iteration count by
+     cmp_insn_stage * iv.step.  */
+  cmp_stage = get_cmp_stage (g, ps, iv);
+  adjust_count = cmp_stage * -INTVAL (iv.step);
 
-  return NULL;
+  /* We don't have to do any adjustment.  */
+  if (adjust_count == 0)
+    return true;
+
+  /* Handle the case
+     (const:DI (plus:DI (symbol_ref:DI ("v"))
+			(const_int 240 [0xf0]))).  */
+  if (GET_CODE (end_cond) == CONST
+      && GET_CODE (XEXP (end_cond, 0)) == PLUS
+      && GET_CODE (XEXP (XEXP (end_cond, 0), 0)) == SYMBOL_REF
+      && GET_CODE (XEXP (XEXP (end_cond, 0), 1)) == CONST_INT)
+    {
+      end_cond = XEXP (XEXP (end_cond, 0), 1);
+    }
+
+  if (!CONST_INT_P (end_cond))
+    return false;
+
+  /* Get the mode width that loop end condition should be compared.  */
+  width = GET_MODE_PRECISION (desc->mode);
+
+  /* Adjust the loop end constant by comparison mode
+     to avoid loop exit expected iteration count overflow
+     and then get wrong loop end constant.
+
+     E.g.
+	short e;
+	for (e = 0; e != 1; e = e + 5)
+   */
+  if (width < GET_MODE_PRECISION (SImode))
+    new_end_cond = GEN_INT ((INTVAL(end_cond) + adjust_count)
+			    & ((1 << width) - 1));
+  else
+    new_end_cond = GEN_INT (INTVAL(end_cond) + adjust_count);
+
+  if (replace_p)
+    validate_replace_rtx (end_cond, new_end_cond, cmp);
+  else
+    {
+      if (dump_file)
+	{
+	  fprintf (dump_file, "cmp_stage = %d\n", cmp_stage);
+	  fprintf (dump_file, "adjust_count = %d\n", adjust_count);
+	  fprintf (dump_file, "Try to replace ");
+	  fprintf (dump_file, HOST_WIDE_INT_PRINT_DEC, INTVAL (end_cond));
+	  fprintf (dump_file, " to ");
+	  fprintf (dump_file, HOST_WIDE_INT_PRINT_DEC, INTVAL (new_end_cond));
+	  fprintf (dump_file, " in insn(%d)\n", INSN_UID (cmp));
+	}
+      /* Return false if the backend not support the replacement
+	 instruction.  */
+      if (!validate_replace_rtx (end_cond, new_end_cond, cmp))
+	{
+	  if (dump_file)
+	    fprintf (dump_file, "SMS replace end condition constant fail.\n");
+	  return false;
+	}
+      /* Replace new_end_cond back to end_cond.
+	 We don't want to modify end_cond before loop versioning.  */
+      validate_replace_rtx (new_end_cond, end_cond, cmp);
+    }
+  return true;
+}
+
+/* Generate kernel loop iteration adjustment instruction
+   and insert to prolog.  */
+static bool
+gen_loop_iteration_adjust (ddg_ptr g, partial_schedule_ptr ps,
+			   struct loop *loop, rtx_insn *tail,
+			   struct rtx_iv iv, int stage_count)
+{
+  rtx condition;
+  rtx new_reg;
+  rtx_insn *cmp;
+  machine_mode mode;
+  edge e;
+  int adjust_count = 0;
+  int cmp_stage = 0;
+  rtx count_reg = iv.base;
+  rtx end_cond = iv.end;
+
+  if (!(condition = get_condition (tail, &cmp, false, false)))
+    return false;
+
+  if (dump_file)
+    {
+      fprintf (dump_file, "count_reg is r%d\n", REGNO (count_reg));
+      fprintf (dump_file, "loop step = ");
+      fprintf (dump_file, HOST_WIDE_INT_PRINT_DEC, INTVAL (iv.step));
+      fprintf (dump_file, "\n");
+    }
+
+  /* If the loop contain the count_reg reference instruction,
+     we must duplicate iv decrement/increment instruction
+     in prolog/epilog to correct the reference.
+     Therefore, we adjust loop iteration count by
+     cmp_insn_stage * iv.step.  */
+  if (loop->count_ref_p)
+    {
+      cmp_stage = get_cmp_stage (g, ps, iv);
+      adjust_count = cmp_stage * -INTVAL (iv.step);
+
+      if (dump_file)
+	fprintf (dump_file, "cmp_stage = %d\n", cmp_stage);
+    }
+  /* else adjust iteration count by stage_count.  */
+  else
+    adjust_count = stage_count * -INTVAL (iv.step);
+
+  /* We don't have to do any adjustment.  */
+  if (adjust_count == 0)
+    return true;
+
+  if (dump_file)
+    fprintf (dump_file, "adjust_count = %d\n", adjust_count);
+
+  /* Handle the end_cond is constant.  */
+  if (CONST_INT_P (end_cond))
+    {
+      rtx new_end_cond;
+      /* We could modify loop count initial value to adjust loop
+         iteration count only when there is no count_reg reference
+         in the loop.  */
+      if (!loop->count_ref_p)
+	{
+	  new_end_cond = GEN_INT (INTVAL(end_cond) - adjust_count);
+	  start_sequence ();
+	  emit_insn (gen_add3_insn (count_reg, count_reg, new_end_cond));
+	  e = loop_preheader_edge (loop);
+	  split_edge_and_insert (e, get_insns ());
+	  end_sequence ();
+	}
+
+      return true;
+    }
+  /* Insert the end condition adjustment instruction before entry
+     the loop when end condition is in register.  */
+  else
+    {
+      mode = GET_MODE (count_reg);
+      new_reg = gen_reg_rtx (mode);
+
+      start_sequence ();
+      emit_insn (gen_add3_insn (new_reg, end_cond,
+			    GEN_INT (adjust_count)));
+      e = loop_preheader_edge (loop);
+      split_edge_and_insert (e, get_insns ());
+      end_sequence ();
+      validate_replace_rtx (end_cond, new_reg, cmp);
+    }
+
+  return true;
 }
 
 /* A very simple resource-based lower bound on the initiation interval.
@@ -431,10 +749,6 @@ res_MII (ddg_ptr g)
   return ((g->num_nodes - g->num_debug) / issue_rate);
 }
 
-
-/* A vector that contains the sched data for each ps_insn.  */
-static vec<node_sched_params> node_sched_param_vec;
-
 /* Allocate sched_params for each node and initialize it.  */
 static void
 set_node_sched_params (ddg_ptr g)
@@ -1093,9 +1407,97 @@ clear:
   return ok;
 }
 
+/* Return true if INSN is the loop end condition comparison instruction.  */
+static bool
+cmp_insn_p (struct rtx_iv iv, rtx_insn *insn)
+{
+  rtx pattern;
+  int i;
+  rtx op0, op1;
+
+  if (!reg_mentioned_p (iv.base, insn))
+    return false;
+
+  pattern = PATTERN (insn);
+
+  if (GET_CODE (pattern) == PARALLEL)
+    {
+      for (i = XVECLEN (pattern, 0) - 1; i >= 0; i--)
+      {
+	if (GET_CODE (XVECEXP (pattern, 0, i)) == SET)
+	  {
+	    op0 = XVECEXP (pattern, 0, i);
+	    if (GET_CODE (SET_SRC (op0)) == COMPARE)
+	      {
+		pattern = op0;
+		break;
+	      }
+	  }
+      }
+    }
+
+  if (GET_CODE (pattern) != SET)
+    return false;
+
+  if (GET_CODE (SET_SRC (pattern)) != COMPARE)
+    return false;
+
+  op0 = XEXP (SET_SRC (pattern), 0);
+  op1 = XEXP (SET_SRC (pattern), 1);
+
+  if (op0 == iv.base && op1 == iv.end)
+    return true;
+
+  if (op1 == iv.base && op0 == iv.end)
+    return true;
+
+  return false;
+}
+
+/* Return true if we should duplicate INSN in prolog/epilog.  */
+static bool
+duplicate_insn_p (rtx_insn *insn, struct rtx_iv iv, bool count_ref_p)
+{
+  rtx pattern;
+  int i;
+  rtx count_reg = iv.base;
+
+  if (JUMP_P (insn))
+    return false;
+
+  if (!reg_mentioned_p (count_reg, insn))
+    return true;
+
+  pattern = PATTERN (insn);
+
+  if (GET_CODE (pattern) == PARALLEL)
+    {
+      for (i = XVECLEN (pattern, 0) - 1; i >= 0; i--)
+	if (GET_CODE (XVECEXP (pattern, 0, i)) == SET)
+	  {
+	    if (SET_DEST (XVECEXP (pattern, 0, i)) == count_reg)
+	      {
+		pattern = XVECEXP (pattern, 0, i);
+		break;
+	      }
+	  }
+    }
+  /* If there is no count_reg reference in the loop,
+     We don't have to duplicate IV increment/decrement instruction.  */
+  if (SET_DEST (pattern) == count_reg
+      && !count_ref_p)
+    return false;
+
+  if (GET_CODE (pattern) == SET
+      && GET_CODE (SET_SRC (pattern)) == COMPARE)
+    return false;
+
+  return true;
+}
+
 static void
 duplicate_insns_of_cycles (partial_schedule_ptr ps, int from_stage,
-			   int to_stage, rtx count_reg)
+			   int to_stage, struct rtx_iv iv, bool count_ref_p)
 {
   int row;
   ps_insn_ptr ps_ij;
@@ -1107,15 +1509,8 @@ duplicate_insns_of_cycles (partial_schedule_ptr ps, int from_stage,
 	int first_u, last_u;
 	rtx_insn *u_insn;
 
-        /* Do not duplicate any insn which refers to count_reg as it
-           belongs to the control part.
-           The closing branch is scheduled as well and thus should
-           be ignored.
-           TODO: This should be done by analyzing the control part of
-           the loop.  */
 	u_insn = ps_rtl_insn (ps, u);
-        if (reg_mentioned_p (count_reg, u_insn)
-            || JUMP_P (u_insn))
+        if (!duplicate_insn_p (u_insn, iv, count_ref_p))
           continue;
 
 	first_u = SCHED_STAGE (u);
@@ -1134,34 +1529,18 @@ duplicate_insns_of_cycles (partial_schedule_ptr ps, int from_stage,
 /* Generate the instructions (including reg_moves) for prolog & epilog.  */
 static void
 generate_prolog_epilog (partial_schedule_ptr ps, struct loop *loop,
-                        rtx count_reg, rtx count_init)
+                        struct rtx_iv iv)
 {
   int i;
   int last_stage = PS_STAGE_COUNT (ps) - 1;
   edge e;
+  bool count_ref_p = loop->count_ref_p;
 
   /* Generate the prolog, inserting its insns on the loop-entry edge.  */
   start_sequence ();
 
-  if (!count_init)
-    {
-      /* Generate instructions at the beginning of the prolog to
-         adjust the loop count by STAGE_COUNT.  If loop count is constant
-         (count_init), this constant is adjusted by STAGE_COUNT in
-         generate_prolog_epilog function.  */
-      rtx sub_reg = NULL_RTX;
-
-      sub_reg = expand_simple_binop (GET_MODE (count_reg), MINUS, count_reg,
-				     gen_int_mode (last_stage,
-						   GET_MODE (count_reg)),
-                                     count_reg, 1, OPTAB_DIRECT);
-      gcc_assert (REG_P (sub_reg));
-      if (REGNO (sub_reg) != REGNO (count_reg))
-        emit_move_insn (count_reg, sub_reg);
-    }
-
   for (i = 0; i < last_stage; i++)
-    duplicate_insns_of_cycles (ps, 0, i, count_reg);
+    duplicate_insns_of_cycles (ps, 0, i, iv, count_ref_p);
 
   /* Put the prolog on the entry edge.  */
   e = loop_preheader_edge (loop);
@@ -1175,7 +1554,7 @@ generate_prolog_epilog (partial_schedule_ptr ps, struct loop *loop,
   start_sequence ();
 
   for (i = 0; i < last_stage; i++)
-    duplicate_insns_of_cycles (ps, i + 1, last_stage, count_reg);
+    duplicate_insns_of_cycles (ps, i + 1, last_stage, iv, count_ref_p);
 
   /* Put the epilogue on the exit edge.  */
   gcc_assert (single_exit (loop));
@@ -1337,6 +1716,139 @@ setup_sched_infos (void)
   current_sched_info = &sms_sched_info;
 }
 
+/* Do loop induction variable analysis for each loop.
+   Store loop iv to LOOP_IV [loop->num] and set
+   SMS_LOOP_P [loop->num] to true if the loop
+   could analyze successfully.  */
+static void
+sms_get_loop_iv (ddg_ptr *g_arr, struct rtx_iv *loop_iv, bool *sms_loop_p)
+{
+  rtx_insn *insn;
+  basic_block bb = NULL;
+  struct loop *loop;
+
+  FOR_EACH_LOOP (loop, 0)
+    {
+      rtx_insn *head, *tail;
+      rtx count_reg;
+
+      if (! (g_arr[loop->num]))
+        continue;
+
+      bb = loop->header;
+
+      get_ebb_head_tail (bb, bb, &head, &tail);
+
+      /* Get loop iteration count register.  */
+      if ( !(count_reg = get_count_reg (loop, tail, &loop_iv[loop->num])))
+	{
+	  if (dump_file)
+	    fprintf (dump_file, "SMS get_count_reg failed\n");
+	  continue;
+	}
+
+     for (insn = head; insn != NEXT_INSN (tail); insn = NEXT_INSN (insn))
+      {
+        if (NONDEBUG_INSN_P (insn) && !JUMP_P (insn)
+            && !single_set (insn) && GET_CODE (PATTERN (insn)) != USE
+            && !reg_mentioned_p (count_reg, insn))
+	  break;
+      }
+
+      if (insn != NEXT_INSN (tail))
+	{
+	  if (dump_file)
+	    {
+              if ((NONDEBUG_INSN_P (insn) && !JUMP_P (insn)
+		  && !single_set (insn) && GET_CODE (PATTERN (insn)) != USE))
+                fprintf (dump_file, "SMS loop-with-not-single-set\n");
+	      dump_insn_slim (dump_file, insn);
+	    }
+
+	  continue;
+	}
+      sms_loop_p [loop->num] = true;
+    }
+
+  iv_analysis_done ();
+}
+
+/* Return the insn which SMS could not handle.  */
+static rtx_insn *
+loop_sms_insn_check (const struct loop *loop)
+{
+  basic_block *bbs, bb;
+  unsigned i;
+  rtx_insn *insn;
+  rtx set;
+
+  bbs = get_loop_body (loop);
+  for (i = 0; i < loop->num_nodes; i++)
+    {
+      bb = bbs[i];
+      FOR_BB_INSNS (bb, insn)
+	{
+	  if (! INSN_P (insn))
+	    continue;
+
+	  if (CALL_P (insn))
+	    return insn;
+
+	  if (BARRIER_P (insn))
+	    return insn;
+
+	  if ((set = single_set (insn))
+	      && GET_CODE (SET_DEST (set)) == SUBREG)
+	    return insn;
+
+	  /* Skip the loop contain clobber (mem:BLK (scratch).  */
+	  if (GET_CODE (PATTERN (insn)) == CLOBBER
+	      && GET_CODE (XEXP (PATTERN (insn), 0)) == MEM)
+	    return insn;
+	}
+    }
+  free (bbs);
+
+  return NULL;
+}
+
+rtx
+entry_loop_count_reg (struct loop *loop, rtx count_reg)
+{
+  basic_block *bbs, bb;
+  unsigned i;
+  rtx_insn *insn;
+  rtx set;
+
+  bbs = get_loop_body (loop);
+  for (i = 0; i < loop->num_nodes; i++)
+    {
+      bb = bbs[i];
+      FOR_BB_INSNS (bb, insn)
+	{
+	  if (DEBUG_INSN_P (insn))
+	    continue;
+
+	  if (!reg_mentioned_p (count_reg, insn))
+	    continue;
+
+	  if ((set = single_set (insn)))
+	    {
+	      if (SET_DEST (set) == count_reg
+		  && (GET_CODE (SET_SRC (set)) == PLUS
+		      || GET_CODE (SET_SRC (set)) == MINUS))
+		{
+		  count_reg = XEXP (SET_SRC (set), 0);
+		  break;
+		}
+	    }
+	}
+    }
+
+  return count_reg;
+}
+
+
 /* Probability in % that the sms-ed loop rolls enough so that optimized
    version may be entered.  Just a guess.  */
 #define PROB_SMS_ENOUGH_ITERATIONS 80
@@ -1349,8 +1861,8 @@ setup_sched_infos (void)
 static void
 sms_schedule (void)
 {
-  rtx_insn *insn;
   ddg_ptr *g_arr, g;
+  struct rtx_iv *loop_iv;
   int * node_order;
   int maxii, max_asap;
   partial_schedule_ptr ps;
@@ -1359,6 +1871,7 @@ sms_schedule (void)
   basic_block condition_bb = NULL;
   edge latch_edge;
   gcov_type trip_count = 0;
+  bool * sms_loop_p;
 
   loop_optimizer_init (LOOPS_HAVE_PREHEADERS
 		       | LOOPS_HAVE_RECORDED_EXITS);
@@ -1388,6 +1901,10 @@ sms_schedule (void)
      We use loop->num as index into this array.  */
   g_arr = XCNEWVEC (ddg_ptr, number_of_loops (cfun));
 
+  loop_iv = XCNEWVEC (struct rtx_iv, number_of_loops (cfun));
+
+  sms_loop_p = XCNEWVEC (bool, number_of_loops (cfun));
+
   if (dump_file)
   {
     fprintf (dump_file, "\n\nSMS analysis phase\n");
@@ -1399,7 +1916,7 @@ sms_schedule (void)
   FOR_EACH_LOOP (loop, 0)
     {
       rtx_insn *head, *tail;
-      rtx count_reg;
+      rtx_insn *non_sms_insn;
 
       /* For debugging.  */
       if (dbg_cnt (sms_sched_loop) == false)
@@ -1423,11 +1940,29 @@ sms_schedule (void)
         continue;
 
       if (! loop_single_full_bb_p (loop))
-      {
-        if (dump_file)
-          fprintf (dump_file, "SMS not loop_single_full_bb_p\n");
-	continue;
-      }
+	{
+	  if (dump_file)
+	    fprintf (dump_file, "SMS not loop_single_full_bb_p\n");
+	  continue;
+	}
+
+      if (num_loop_insns (loop) < PARAM_VALUE (PARAM_SMS_LOOP_MIN_SIZE))
+	{
+	  if (dump_file)
+	    fprintf (dump_file, "SMS loop size = %d is too small\n",
+		     num_loop_insns (loop));
+	  continue;
+	}
+
+      if ((non_sms_insn = loop_sms_insn_check (loop)))
+	{
+	  if (dump_file)
+	    {
+	      fprintf (dump_file, "Contain the insn could not do SMS\n");
+	      dump_insn_slim (dump_file, non_sms_insn);
+	    }
+	  continue;
+	}
 
       bb = loop->header;
 
@@ -1465,52 +2000,6 @@ sms_schedule (void)
           continue;
         }
 
-      /* Make sure this is a doloop.  */
-      if ( !(count_reg = doloop_register_get (head, tail)))
-      {
-        if (dump_file)
-          fprintf (dump_file, "SMS doloop_register_get failed\n");
-	continue;
-      }
-
-      /* Don't handle BBs with calls or barriers
-	 or !single_set with the exception of instructions that include
-	 count_reg---these instructions are part of the control part
-	 that do-loop recognizes.
-         ??? Should handle insns defining subregs.  */
-     for (insn = head; insn != NEXT_INSN (tail); insn = NEXT_INSN (insn))
-      {
-         rtx set;
-
-        if (CALL_P (insn)
-            || BARRIER_P (insn)
-            || (NONDEBUG_INSN_P (insn) && !JUMP_P (insn)
-                && !single_set (insn) && GET_CODE (PATTERN (insn)) != USE
-                && !reg_mentioned_p (count_reg, insn))
-            || (INSN_P (insn) && (set = single_set (insn))
-                && GET_CODE (SET_DEST (set)) == SUBREG))
-        break;
-      }
-
-      if (insn != NEXT_INSN (tail))
-	{
-	  if (dump_file)
-	    {
-	      if (CALL_P (insn))
-		fprintf (dump_file, "SMS loop-with-call\n");
-	      else if (BARRIER_P (insn))
-		fprintf (dump_file, "SMS loop-with-barrier\n");
-              else if ((NONDEBUG_INSN_P (insn) && !JUMP_P (insn)
-                && !single_set (insn) && GET_CODE (PATTERN (insn)) != USE))
-                fprintf (dump_file, "SMS loop-with-not-single-set\n");
-              else
-               fprintf (dump_file, "SMS loop with subreg in lhs\n");
-	      print_rtl_single (dump_file, insn);
-	    }
-
-	  continue;
-	}
-
       /* Always schedule the closing branch with the rest of the
          instructions. The branch is rotated to be in row ii-1 at the
          end of the scheduling procedure to make sure it's the last
@@ -1527,6 +2016,9 @@ sms_schedule (void)
         fprintf (dump_file, "...OK\n");
 
     }
+
+  sms_get_loop_iv (g_arr, loop_iv, sms_loop_p);
+
   if (dump_file)
   {
     fprintf (dump_file, "\nSMS transformation phase\n");
@@ -1536,16 +2028,19 @@ sms_schedule (void)
   /* We don't want to perform SMS on new loops - created by versioning.  */
   FOR_EACH_LOOP (loop, 0)
     {
+      struct rtx_iv iv = loop_iv [loop->num];
+      rtx count_reg = iv.base;
       rtx_insn *head, *tail;
-      rtx count_reg;
-      rtx_insn *count_init;
       int mii, rec_mii, stage_count, min_cycle;
-      int64_t loop_count = 0;
       bool opt_sc_p;
+      struct niter_desc *desc;
+      bool loop_version_p = true;
 
-      if (! (g = g_arr[loop->num]))
+      if (! sms_loop_p [loop->num])
         continue;
 
+      g = g_arr[loop->num];
+
       if (dump_file)
 	{
 	  rtx_insn *insn = BB_END (loop->header);
@@ -1585,28 +2080,8 @@ sms_schedule (void)
           fprintf (dump_file, "SMS num-stores %d\n", g->num_stores);
 	}
 
-
-      /* In case of th loop have doloop register it gets special
-	 handling.  */
-      count_init = NULL;
-      if ((count_reg = doloop_register_get (head, tail)))
-	{
-	  basic_block pre_header;
-
-	  pre_header = loop_preheader_edge (loop)->src;
-	  count_init = const_iteration_count (count_reg, pre_header,
-					      &loop_count);
-	}
       gcc_assert (count_reg);
 
-      if (dump_file && count_init)
-        {
-          fprintf (dump_file, "SMS const-doloop ");
-          fprintf (dump_file, "%" PRId64,
-		     loop_count);
-          fprintf (dump_file, "\n");
-        }
-
       node_order = XNEWVEC (int, g->num_nodes);
 
       mii = 1; /* Need to pass some estimate of mii.  */
@@ -1651,20 +2126,41 @@ sms_schedule (void)
 	      gcc_assert (stage_count >= 1);
 	    }
 
+	  /* Get loop iteration information.  */
+	  desc = get_simple_loop_desc (loop);
+
+	  if (desc->const_iter && desc->niter != 0)
+	    {
+	      if (dump_file)
+	        {
+		  fprintf (dump_file, "SMS loop iteration count = ");
+		  fprintf (dump_file, HOST_WIDE_INT_PRINT_DEC, desc->niter);
+		  fprintf (dump_file, "\nSMS stage_count = %d\n", stage_count);
+		}
+
+	      if (desc->niter < (uint64_t)stage_count)
+		{
+		  if (dump_file)
+		    fprintf (dump_file, "SMS failed due to loop count"
+					"< stage_count.\n");
+		  break;
+		}
+	      else
+		loop_version_p = false;
+	    }
+
 	  /* The default value of PARAM_SMS_MIN_SC is 2 as stage count of
 	     1 means that there is no interleaving between iterations thus
 	     we let the scheduling passes do the job in this case.  */
 	  if (stage_count < PARAM_VALUE (PARAM_SMS_MIN_SC)
-	      || (count_init && (loop_count <= stage_count))
 	      || (flag_branch_probabilities && (trip_count <= stage_count)))
 	    {
 	      if (dump_file)
 		{
 		  fprintf (dump_file, "SMS failed... \n");
 		  fprintf (dump_file, "SMS sched-failed (stage-count=%d,"
-			   " loop-count=", stage_count);
-		  fprintf (dump_file, "%" PRId64, loop_count);
-		  fprintf (dump_file, ", trip-count=");
+			   , stage_count);
+		  fprintf (dump_file, " trip-count=");
 		  fprintf (dump_file, "%" PRId64, trip_count);
 		  fprintf (dump_file, ")\n");
 		}
@@ -1707,6 +2203,14 @@ sms_schedule (void)
 
 	  canon_loop (loop);
 
+	  /* Try to adjust kernel loop end conditon when end condition is
+	     constant and there is count_ref reference in the loop.
+	     Skip the loop if the relpacement fail.  */
+	  if ((CONST_INT_P (iv.end) || (GET_CODE (iv.end) == CONST))
+	      && loop->count_ref_p
+	      && !loop_iteration_adjust_by_replace (g, ps, tail, iv, desc, false))
+	    break;
+
           if (dump_file)
             {
 	      dump_insn_location (tail);
@@ -1714,25 +2218,62 @@ sms_schedule (void)
 		       ps->ii, stage_count);
 	      print_partial_schedule (ps, dump_file);
 	    }
- 
-          /* case the BCT count is not known , Do loop-versioning */
-	  if (count_reg && ! count_init)
-            {
-	      rtx comp_rtx = gen_rtx_GT (VOIDmode, count_reg,
-					 gen_int_mode (stage_count,
-						       GET_MODE (count_reg)));
+
+          /* Do loop-versioning to avoid the loop count < stage_count
+	     execute SMS version loop.  */
+	  if (loop_version_p)
+	    {
+	      rtx comp_rtx;
+	      machine_mode mode = GET_MODE (count_reg);
+
+	      int stage_steps = stage_count * -INTVAL (iv.step);
+	      rtx loop_steps;
+
+	      /* To avoid count_reg not live before entry the loop.
+	       .L3:
+	               sub     r2, r3, #1
+	               add     r0, r0, r3
+	               mov     r3, r2
+	               cmp     r2, #0
+	               bne     .L3
+
+		 In this case count_reg is r2, but we should generate
+		 loop version condition by r3.  */
+	      count_reg = entry_loop_count_reg (loop, count_reg);
+
+	      /* loop_steps = count_reg - iv.end
+	         iv.end == 0 would be the special case
+		 and we don' have to generate MINUS.  */
+	      if (INTVAL (iv.end) == 0)
+		loop_steps = count_reg;
+	      else
+		loop_steps = gen_rtx_MINUS (mode, count_reg, iv.end);
+
+	      /* Generate loop version condition according the loop
+		 is increment or decrement.  */
+	      if (INTVAL (iv.step) > 0)
+		comp_rtx = gen_rtx_LT (VOIDmode, loop_steps,
+				       gen_int_mode (stage_steps, mode));
+	      else
+		comp_rtx = gen_rtx_GT (VOIDmode, loop_steps,
+				       gen_int_mode (stage_steps, mode));
+
 	      unsigned prob = (PROB_SMS_ENOUGH_ITERATIONS
 			       * REG_BR_PROB_BASE) / 100;
 
 	      loop_version (loop, comp_rtx, &condition_bb,
-	  		    prob, prob, REG_BR_PROB_BASE - prob,
+			    prob, prob, REG_BR_PROB_BASE - prob,
 			    true);
-	     }
+	    }
+
+	  /* Replace constant end condition in compare instruction
+	     to adjust kernel loop iteration count.  */
+	  if ((CONST_INT_P (iv.end) || (GET_CODE (iv.end) == CONST))
+	      && loop->count_ref_p)
+	    loop_iteration_adjust_by_replace (g, ps, tail, iv, desc, true);
 
-	  /* Set new iteration count of loop kernel.  */
-          if (count_reg && count_init)
-	    SET_SRC (single_set (count_init)) = GEN_INT (loop_count
-						     - stage_count + 1);
+	  /* Insert kernel loop iteraion adjustment instructions.  */
+	  gen_loop_iteration_adjust (g, ps, loop, tail, iv, stage_count - 1);
 
 	  /* Now apply the scheduled kernel to the RTL of the loop.  */
 	  permute_partial_schedule (ps, g->closing_branch->first_note);
@@ -1749,7 +2290,7 @@ sms_schedule (void)
 	  if (dump_file)
 	    print_node_sched_params (dump_file, g->num_nodes, ps);
 	  /* Generate prolog and epilog.  */
-          generate_prolog_epilog (ps, loop, count_reg, count_init);
+          generate_prolog_epilog (ps, loop, iv);
 	  break;
 	}
 
@@ -1760,6 +2301,8 @@ sms_schedule (void)
     }
 
   free (g_arr);
+  free (loop_iv);
+  free (sms_loop_p);
 
   /* Release scheduler data, needed until now because of DFA.  */
   haifa_sched_finish ();
diff --git a/gcc/params.def b/gcc/params.def
index 9e05401..8fa812f 100644
--- a/gcc/params.def
+++ b/gcc/params.def
@@ -375,6 +375,10 @@ DEFPARAM(PARAM_SMS_LOOP_AVERAGE_COUNT_THRESHOLD,
 	 "sms-loop-average-count-threshold",
 	 "A threshold on the average loop count considered by the swing modulo scheduler.",
 	 0, 0, 0)
+DEFPARAM(PARAM_SMS_LOOP_MIN_SIZE,
+	 "sms-loop-min-size",
+	 "A threshold on the loop instruction count considered by the swing modulo scheduler",
+	 8, 0, 0)
 
 DEFPARAM(HOT_BB_COUNT_WS_PERMILLE,
 	 "hot-bb-count-ws-permille",
-- 
2.5.0


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH][SMS] SMS use loop induction variable analysis instead of depending on doloop optimization
  2016-04-28  6:06 [PATCH][SMS] SMS use loop induction variable analysis instead of depending on doloop optimization Shiva Chen
  2016-04-28 13:26 ` Shiva Chen
@ 2016-04-29  9:30 ` Bernd Schmidt
  2016-05-05  6:12   ` Shiva Chen
  1 sibling, 1 reply; 4+ messages in thread
From: Bernd Schmidt @ 2016-04-29  9:30 UTC (permalink / raw)
  To: Shiva Chen, GCC Patches, Shiva Chen

On 04/28/2016 08:06 AM, Shiva Chen wrote:

> Could anyone help me to review the patch?
> Any suggestion would be very helpful.

You might want to split it up if there are several logically independent 
pieces. I can't quite make sense of it all, and I'm not too familiar 
with SMS anyway, so the following is not a complete review, just a 
selection of issues I observed.

There are a large number of formatting and style problems. I'll be 
pointing out some instances, but please read
    http://www.gnu.org/prep/standards/standards.html#Writing-C
and self-review your patch before resubmission.

> +static bool mem_write_insn_p (rtx_insn *);

Generally best to order your code so that you don't need forward 
declarations.

> -  /* We do not handle setting only part of the register.  */
> -  if (DF_REF_FLAGS (adef) & DF_REF_READ_WRITE)
> -    return GRD_INVALID;
> -

Why this change?

>   }
>
> +static rtx
> +get_rhs (rtx_insn *insn, rtx reg)

get_rhs might not be the most meaningful function name. We require 
documentation before every function that says what it does, and what the 
arguments mean. Please examine the surrounding code for examples.

> +
> +  /* Find iv increment/decrement rhs in following pattern
> +
> +     (parallel [
> +	(set (reg:CC_NOOV 100 cc)
> +	    (compare:CC_NOOV (plus:SI (reg:SI 147)
> +				      (const_int -1))
> +			     (const_int 0 [0])))
> +	(set (reg:SI 147)
> +	    (plus:SI (reg:SI 147)
> +		     (const_int -1 [0xffffffffffffffff]))
> +   */

Rather than quoting large RTL blocks, it would be better to explain what 
you're trying to do.

> @@ -1048,6 +1141,47 @@ iv_analyze_expr (rtx_insn *insn, rtx rhs, machine_mode mode,
>     return iv->base != NULL_RTX;
>   }
>
> +/* Auxiliary variable for mem_read_insn_p/mem_write_insn_p.  */
> +static bool mem_ref_p;

Auxiliary variable doing what? Rather than using a global, it might be 
better to use the data pointer passed through note_uses.

> +   /* To check the case REG is read write register
> +      in memory reference.  */
> +  if (mem_write_insn_p (insn))
> +    body = SET_DEST (set);
> +  else if (mem_read_insn_p (insn))
> +    body = SET_SRC (set);

This all looks a little odd; if you're looking for autoincs, why not 
just scan the entire INSN for a MEM, rather than classify it as 
mem_write or mem_read_insn?

> +      if (GET_CODE (body) == ZERO_EXTEND ||
> +	  GET_CODE (body) == SIGN_EXTEND)

Split lines before operators.

> +
> +  /* To handle the condition as follow
> +      (ne (plus:SI (reg:SI 163)
> +		   (const_int -1))
> +	  (const_int 0 [0]))
> +
> +     The pattern would generate
> +     by doloop optimization.  */

This comment confuses me more than it helps me.

> +
> +  /* handle the condition:
> +  (ne (plus:SI (reg:SI 163)
> +               (const_int -1 [0xffffffffffffffff]))
> +       (const_int 0 [0]))
> +  */

Handle how? This has comment formatting problems, which would be easier 
to make go away if you didn't just quote RTL. (If you must quote RTL, 
best to remove irrelevant bits such as :SI and [0xfffff....] and [0]), 
and replace register numbers like 163 with variables like A.

> +
> +  /* following code handle the condition:
> +     (ne (reg:SI 163) (reg:SI 176))
> +  */

Same problem.

> +	  if ((set = single_set (insn)))
 > +	    {
 > +	      if (SET_DEST (set) == count_reg)
 > +		continue;
 > +	      else
 > +		{
 > +		  loop->count_ref_p = true;
 > +		  return;
 > +		}
 > +	    }

Either
	  if ((set = single_set (insn)) != NULL_RTX)
or, better (also removing the outer declaration of set):

     rtx set = single_set (insn);
     if (set && SET_DEST (set) == count_reg)
       continue;
     loop->count_ref_p = true;
     return;

> +
> +	  if (GET_CODE (PATTERN (insn)) == PARALLEL)
> +	    {

No unnecessary braces if an if contains only a single statement.

> +  /* check count_reg reference in the loop
> +     and set result to loop->count_ref_p.  */
> +  check_count_reg_reference (loop, iv->base);

Comments should be full sentences, properly capitalized. But avoid 
comments that just describe what a called function is doing. This 
information should be part of that function's starting comment.

Skipping most of the rest of the SMS changes...


Bernd

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH][SMS] SMS use loop induction variable analysis instead of depending on doloop optimization
  2016-04-29  9:30 ` Bernd Schmidt
@ 2016-05-05  6:12   ` Shiva Chen
  0 siblings, 0 replies; 4+ messages in thread
From: Shiva Chen @ 2016-05-05  6:12 UTC (permalink / raw)
  To: Bernd Schmidt; +Cc: Shiva Chen, GCC Patches

Hi, Bernd

Thanks for the review.

>
> You might want to split it up if there are several logically independent pieces. I can't quite make sense of it all, and I'm not too familiar with SMS anyway, so the following is not a complete review, just a selection of issues I observed.

Ok, I have split the patch in several pieces by following link.
https://gcc.gnu.org/ml/gcc-patches/2016-05/msg00377.html

> There are a large number of formatting and style problems. I'll be pointing out some instances, but please read
>    http://www.gnu.org/prep/standards/standards.html#Writing-C
> and self-review your patch before resubmission.

I have rewrite most the the comments.
Please correct me if i still missing something.

>> -  /* We do not handle setting only part of the register.  */
>> -  if (DF_REF_FLAGS (adef) & DF_REF_READ_WRITE)
>> -    return GRD_INVALID;
>> -
>
>
> Why this change?

The patch use loop induction variable analysis to find the register
contain loop count.
If the register is a read/write register, it would not identify as
loop iv and SMS would skip the loop.
Removing above condition and adding other code in loop-iv.c to
identify this case.

>>
>>   }
>>
>> +static rtx
>> +get_rhs (rtx_insn *insn, rtx reg)
>
>
> get_rhs might not be the most meaningful function name. We require documentation before every function that says what it does, and what the arguments mean. Please examine the surrounding code for examples.

Ok, i have  rename get_rhs to get_reg_calculate_expr and document
before the function.
>>
>>
>
> This all looks a little odd; if you're looking for autoincs, why not just scan the entire INSN for a MEM, rather than classify it as mem_write or mem_read_insn?

Yes, you're right. I have remove mem_read/write_insn_p and scan the
INSN directly.
>
>
>
> Bernd

Please reference the new submission by following link.
https://gcc.gnu.org/ml/gcc-patches/2016-05/msg00375.html

Thanks,
Shiva

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2016-05-05  6:12 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-04-28  6:06 [PATCH][SMS] SMS use loop induction variable analysis instead of depending on doloop optimization Shiva Chen
2016-04-28 13:26 ` Shiva Chen
2016-04-29  9:30 ` Bernd Schmidt
2016-05-05  6:12   ` Shiva Chen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).