[0/4] Modulo scheduling with haifa-sched for C6X

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* [0/4] Modulo scheduling with haifa-sched for C6X
@ 2011-09-13 22:03 Bernd Schmidt
  2011-09-13 22:13 ` [1/4] Modulo scheduling support for haifa-sched Bernd Schmidt
                   ` (5 more replies)
  0 siblings, 6 replies; 22+ messages in thread
From: Bernd Schmidt @ 2011-09-13 22:03 UTC (permalink / raw)
  To: GCC Patches

C6X has some rather nifty hardware support for modulo scheduling.
Consider the following loop:

.L13:
                ldh     .d1t1   *++A5[1], A7
        ||      add     .s1     -1, A0, A0
                ldh     .d2t1   *B5++[1], A8
                nop     1
        [A0]    b       .s1     .L13
                nop     2
                mpy     .m1     A8, A7, A9
                nop     1
                add     .d1     A3, A9, A3
        ;; condjump to .L13 occurs

This takes 9 cycles per iteration. Note that the back branch is not the
last insn in the block, it has five cycles worth of delay slots.

A fully optimized version of this making use of the loop pipelining
hardware looks like this:

		[ load ILC register some cycles previously ]
                sploop  1
.L13:
                ldh     .d1t2   *++A5[1], B6
        ||      ldh     .d2t1   *B5++[1], A8
                nop     4
        ;; load to A8 occurs
        ;; load to B6 occurs
                mpy     .m1x    A8, B6, A9
                nop     1
        ;; multiplication occurs and stores to A9
                spkernel        7, 0
        ||      add     .s1     A3, A9, A3

The loop is contained between the sploop and spkernel instructions (with
one instruction executing in parallel with the spkernel). Instructions
are copied to a loop buffer and reexecuted from there, based on the
initiation interval which is given as an argument to the sploop
instruction. In this case, the II is 1, which means we take one cycle
per loop iteration (plus prologue/epilogue costs obviously). Once the
loop buffer is full, every iteration of the loop executes in one cycle
as follows:

                ldh     .d1t2   *++A5[1], B6
        ||      ldh     .d2t1   *B5++[1], A8
        ||      mpy     .m1x    A8, B6, A9
        ||      add     .s1     A3, A9, A3

Note the unit reservations: d1, d2, t1, t2, m1, x1, and s1; no unit is
reserved twice.

The reason this works is because of the machine's exposed pipeline. It's
not only branches that have delay slots, but loads and multiplications
as well; really every instruction that takes more than a cycle. The
result of a load is ready after five cycles, and in the loop above, it
is immediately consumed afterwards. No register value has a lifetime of
more than one cycle. This, together with the choice of reservations,
allows an optimal initiation interval of 1. (The sploop hardware knows
how to deal with interrupts. In general, it's not safe to expose delay
slots for anything other than branches outside a sploop/spkernel loop.)

I have added support for this to haifa-sched.c. I expect the question
"why not use SMS" to come up; there were a number of reasons why I felt
that code is unsuitable:

There are (or were, when I started) some glaring weaknesses in SMS, such
as giving up when the loop contains autoincrement addresses (which is
the case in the example above), and by the looks of it fairly poor
memory disambiguation compared to sched-ebb with cselib.

Correctness is also an issue. The ps_has_conflicts function pretends to
verify that it generates a valid schedule, but it ignores half the
target hooks, which means it would likely produce incorrect code on C6X.
 We'd also have to duplicate the delayed-shadow pair mechanism which was
added to haifa-sched here as well to cope with load/multiply delay slots.

Finally, SMS really runs too early: C6X has an exposed pipeline and
requires an exact schedule including unit reservations based on register
allocation; it won't do to try to use a schedule that was produced
before IRA. Using SMS after register allocation seems tricky at least
since it wants to create new pseudos, and using a normal haifa-sched
pass later to fix up the schedule may "work" on other targets as it
produce a valid straight line schedule (however suboptimal for the
original goal of modulo scheduling), but it can only fail on C6X.

Making haifa-sched usable for modulo scheduling wasn't actually that
hard once I realized that I had already added support for fixed delays
between insns, for the C6X delay slot support. All we really need is to
extend that to have two different possible reasons for a fixed delay. On
top of the existing delay shadow mechanism, we can record a stage. For a
given II which we're trying, (II * stage) gives the delay between two
insns. The target code unrolls the loop to a factor that seems
reasonable and calls record_delay_pair for the new insns. Almost
everything else is in place, we just need to accept the possibility that
it may not be possible to find a valid schedule for a given II.

Backtracking is helpful but must be limited to avoid exponential
blow-ups; the param I've chosen seems not to affect performance on any
of the benchmarks I've tried and solves the (rare in the first place)
compile-time problems I ran across.

There will be four patches in the series:
1/4: haifa-sched support
2/4: Make schedule_ebb callable from outside the scheduler
3/4: A hw-doloop bugfix
4/4: C6X pieces.

I've tested them together on c6x-elf. They've been in our 4.5 based
branch for about half a year or so; we've successfully built kernels,
uClibc and applications with it. There'll be some followup patches later
to make use of the new regrename-across-loops functionality to produce
better unit reservations.

Bernd

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [1/4] Modulo scheduling support for haifa-sched
  2011-09-13 22:03 [0/4] Modulo scheduling with haifa-sched for C6X Bernd Schmidt
@ 2011-09-13 22:13 ` Bernd Schmidt
  2011-09-13 22:14 ` [2/4] Make schedule_ebb callable from elsewhere Bernd Schmidt
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 22+ messages in thread
From: Bernd Schmidt @ 2011-09-13 22:13 UTC (permalink / raw)
  To: GCC Patches

[-- Attachment #1: Type: text/plain, Size: 1310 bytes --]

This makes haifa-sched capable of acting like a modulo-scheduler in
cooperation with a caller (expected to be in a port's md_reorg). As
explained in [0/4], most of the necessary code is already there in form
of the delay slot support that was added for C6X.

The main new entry point is set_modulo_params, which informs the
scheduler of the II and the maximum number of stages for which the
caller has unrolled the loop. The caller must then call
record_delay_slot_pair to ensure the proper distances between copies of
instructions in different loop iterations.

Once the scheduler completes a stage and all instructions from the first
iteration of the loop have been scheduled, the scheduler goes into the
epilogue mode where it only schedules insns which belong to the loop
epilogue. Once this has been successful, a valid schedule has been
found. On C6X, we'll then just pick the insns from the first loop
iteration and discard the rest; the SPLOOP hardware will automatically
execute them.

Since the scheduler will not necessarily schedule all insns when in this
mode, there is a new function resolve_dependencies which just makes sure
that all the data structures are in the state expected at the end of
schedule_block.

This patch is probably best understood together with the C6X parts in 4/4.


Bernd

[-- Attachment #2: modsched1.diff --]
[-- Type: text/plain, Size: 21936 bytes --]

	* haifa-sched.c (modulo_ii, modulo_max_states, modulo_n_insns,
	modulo_insns_scheduled, modulo_iter0_max_uid, modulo_backtracks_left,
	modulo_last_stage): New static variables.
	(set_modulo_params, discard_delay_pairs_above): New functions.
	(struct delay_pair): New member stages.
	(htab_i2_traverse, htab_i1_traverse): New static functions.
	(record_delay_slot_pair): New arg stages.  All callers changed.
	Record it.
	(pair_delay): Take stages into account.
	(add_delay_dependencies): Don't do so for stage pairs.
	(struct sched_block_state): New member modulo_epilogue.
	(save_backtrack_point): Don't set SHADOW_P for stage pairs.
	(unschedule_insns_until): Decrease modulo_insns_scheduled.
	Set HARD_DEP without using or.
	(resolve_dependencies): New static function.
	(prune_ready_list): New arg modulo_epilogue_p.  All callers changed.
	If it is true, allow only insns with INSN_EXACT_TICK set.
	(schedule_block): Return bool, always true for normal scheduling,
	true or false depending on modulo scheduling success otherwise.
	Add bookkeeping for modulo scheduling, and call resolve_dependencies
	on everything left over after a modulo schedule.
	(haifa_sched_init): Remove check_cfg call.  Clear modulo_ii.
	* sched-int.h (schedule_block, record_delay_slot_pair): Adjust
	declarations.
	(set_modulo_params, discard_delay_pairs_above): Declare.
	* params.def (PARAM_MAX_MODULO_BACKTRACK_ATTEMPS): New.
	* doc/invoke.texi (--param): Document it.

Index: doc/invoke.texi
===================================================================
--- doc/invoke.texi	(revision 178779)
+++ doc/invoke.texi	(working copy)
@@ -8450,6 +8450,11 @@ before flushing the current state and st
 with few branches or calls can create excessively large lists which
 needlessly consume memory and resources.
 
+@item max-modulo-backtrack-attempts
+The maximum number of backtrack attempts the scheduler should make
+when modulo scheduling a loop.  Larger values can exponentially increase
+compile time.
+
 @item max-inline-insns-single
 Several parameters control the tree inliner used in gcc.
 This number sets the maximum number of instructions (counted in GCC's
Index: haifa-sched.c
===================================================================
--- haifa-sched.c	(revision 178779)
+++ haifa-sched.c	(working copy)
@@ -163,6 +163,31 @@ int issue_rate;
    enable a DCE pass.  */
 bool sched_no_dce;
 
+/* The current initiation interval used when modulo scheduling.  */
+static int modulo_ii;
+
+/* The maximum number of stages we are prepared to handle.  */
+static int modulo_max_stages;
+
+/* The number of insns that exist in each iteration of the loop.  We use this
+   to detect when we've scheduled all insns from the first iteration.  */
+static int modulo_n_insns;
+
+/* The current count of insns in the first iteration of the loop that have
+   already been scheduled.  */
+static int modulo_insns_scheduled;
+
+/* The maximum uid of insns from the first iteration of the loop.  */
+static int modulo_iter0_max_uid;
+
+/* The number of times we should attempt to backtrack when modulo scheduling.
+   Decreased each time we have to backtrack.  */
+static int modulo_backtracks_left;
+
+/* The stage in which the last insn from the original loop was
+   scheduled.  */
+static int modulo_last_stage;
+
 /* sched-verbose controls the amount of debugging output the
    scheduler prints.  It is controlled by -fsched-verbose=N:
    N>0 and no -DSR : the output is directed to stderr.
@@ -507,6 +532,29 @@ haifa_classify_insn (const_rtx insn)
 {
   return haifa_classify_rtx (PATTERN (insn));
 }
+\f
+/* After the scheduler initialization function has been called, this function
+   can be called to enable modulo scheduling.  II is the initiation interval
+   we should use, it affects the delays for delay_pairs that were recorded as
+   separated by a given number of stages.
+
+   MAX_STAGES provides us with a limit
+   after which we give up scheduling; the caller must have unrolled at least
+   as many copies of the loop body and recorded delay_pairs for them.
+   
+   INSNS is the number of real (non-debug) insns in one iteration of
+   the loop.  MAX_UID can be used to test whether an insn belongs to
+   the first iteration of the loop; all of them have a uid lower than
+   MAX_UID.  */
+void
+set_modulo_params (int ii, int max_stages, int insns, int max_uid)
+{
+  modulo_ii = ii;
+  modulo_max_stages = max_stages;
+  modulo_n_insns = insns;
+  modulo_iter0_max_uid = max_uid;
+  modulo_backtracks_left = PARAM_VALUE (PARAM_MAX_MODULO_BACKTRACK_ATTEMPTS);
+}
 
 /* A structure to record a pair of insns where the first one is a real
    insn that has delay slots, and the second is its delayed shadow.
@@ -518,6 +566,10 @@ struct delay_pair
   struct delay_pair *next_same_i1;
   rtx i1, i2;
   int cycles;
+  /* When doing modulo scheduling, we a delay_pair can also be used to
+     show that I1 and I2 are the same insn in a different stage.  If that
+     is the case, STAGES will be nonzero.  */
+  int stages;
 };
 
 /* Two hash tables to record delay_pairs, one indexed by I1 and the other
@@ -525,6 +577,62 @@ struct delay_pair
 static htab_t delay_htab;
 static htab_t delay_htab_i2;
 
+/* Called through htab_traverse.  Walk the hashtable using I2 as
+   index, and delete all elements involving an UID higher than
+   that pointed to by *DATA.  */
+static int
+htab_i2_traverse (void **slot, void *data)
+{
+  int maxuid = *(int *)data;
+  struct delay_pair *p = *(struct delay_pair **)slot;
+  if (INSN_UID (p->i2) >= maxuid || INSN_UID (p->i1) >= maxuid)
+    {
+      htab_clear_slot (delay_htab_i2, slot);
+    }
+  return 1;
+}
+
+/* Called through htab_traverse.  Walk the hashtable using I2 as
+   index, and delete all elements involving an UID higher than
+   that pointed to by *DATA.  */
+static int
+htab_i1_traverse (void **slot, void *data)
+{
+  int maxuid = *(int *)data;
+  struct delay_pair **pslot = (struct delay_pair **)slot;
+  struct delay_pair *p, *first, **pprev;
+
+  if (INSN_UID ((*pslot)->i1) >= maxuid)
+    {
+      htab_clear_slot (delay_htab, slot);
+      return 1;
+    }
+  pprev = &first;
+  for (p = *pslot; p; p = p->next_same_i1)
+    {
+      if (INSN_UID (p->i2) < maxuid)
+	{
+	  *pprev = p;
+	  pprev = &p->next_same_i1;
+	}
+    }
+  *pprev = NULL;
+  if (first == NULL)
+    htab_clear_slot (delay_htab, slot);
+  else
+    *pslot = first;
+  return 1;
+}
+
+/* Discard all delay pairs which involve an insn with an UID higher
+   than MAX_UID.  */
+void
+discard_delay_pairs_above (int max_uid)
+{
+  htab_traverse (delay_htab, htab_i1_traverse, &max_uid);
+  htab_traverse (delay_htab_i2, htab_i2_traverse, &max_uid);
+}
+
 /* Returns a hash value for X (which really is a delay_pair), based on
    hashing just I1.  */
 static hashval_t
@@ -555,18 +663,24 @@ delay_i2_eq (const void *x, const void *
   return ((const struct delay_pair *) x)->i2 == y;
 }
 
-/* This function can be called by a port just before it starts the
-   final scheduling pass.  It records the fact that an instruction
-   with delay slots has been split into two insns, I1 and I2.  The
-   first one will be scheduled normally and initiates the operation.
-   The second one is a shadow which must follow a specific number of
-   CYCLES after I1; its only purpose is to show the side effect that
-   occurs at that cycle in the RTL.  If a JUMP_INSN or a CALL_INSN has
-   been split, I1 should be a normal INSN, while I2 retains the
-   original insn type.  */
+/* This function can be called by a port just before it starts the final
+   scheduling pass.  It records the fact that an instruction with delay
+   slots has been split into two insns, I1 and I2.  The first one will be
+   scheduled normally and initiates the operation.  The second one is a
+   shadow which must follow a specific number of cycles after I1; its only
+   purpose is to show the side effect that occurs at that cycle in the RTL.
+   If a JUMP_INSN or a CALL_INSN has been split, I1 should be a normal INSN,
+   while I2 retains the original insn type.
+
+   There are two ways in which the number of cycles can be specified,
+   involving the CYCLES and STAGES arguments to this function.  If STAGES
+   is zero, we just use the value of CYCLES.  Otherwise, STAGES is a factor
+   which is multiplied by MODULO_II to give the number of cycles.  This is
+   only useful if the caller also calls set_modulo_params to enable modulo
+   scheduling.  */
 
 void
-record_delay_slot_pair (rtx i1, rtx i2, int cycles)
+record_delay_slot_pair (rtx i1, rtx i2, int cycles, int stages)
 {
   struct delay_pair *p = XNEW (struct delay_pair);
   struct delay_pair **slot;
@@ -574,6 +688,7 @@ record_delay_slot_pair (rtx i1, rtx i2,
   p->i1 = i1;
   p->i2 = i2;
   p->cycles = cycles;
+  p->stages = stages;
 
   if (!delay_htab)
     {
@@ -596,7 +711,10 @@ record_delay_slot_pair (rtx i1, rtx i2,
 static int
 pair_delay (struct delay_pair *p)
 {
-  return p->cycles;
+  if (p->stages == 0)
+    return p->cycles;
+  else
+    return p->stages * modulo_ii;
 }
 
 /* Given an insn INSN, add a dependence on its delayed shadow if it
@@ -619,6 +737,8 @@ add_delay_dependencies (rtx insn)
   if (!pair)
     return;
   add_dependence (insn, pair->i1, REG_DEP_ANTI);
+  if (pair->stages)
+    return;
 
   FOR_EACH_DEP (pair->i2, SD_LIST_BACK, sd_it, dep)
     {
@@ -626,7 +746,7 @@ add_delay_dependencies (rtx insn)
       struct delay_pair *other_pair
 	= (struct delay_pair *)htab_find_with_hash (delay_htab_i2, pro,
 						    htab_hash_pointer (pro));
-      if (!other_pair)
+      if (!other_pair || other_pair->stages)
 	continue;
       if (pair_delay (other_pair) >= pair_delay (pair))
 	{
@@ -1855,6 +1975,9 @@ struct sched_block_state
   /* True if a shadow insn has been scheduled in the current cycle, which
      means that no more normal insns can be issued.  */
   bool shadows_only_p;
+  /* True if we're winding down a modulo schedule, which means that we only
+     issue insns with INSN_EXACT_TICK set.  */
+  bool modulo_epilogue;
   /* Initialized with the machine's issue rate every cycle, and updated
      by calls to the variable_issue hook.  */
   int can_issue_more;
@@ -2227,7 +2350,7 @@ save_backtrack_point (struct delay_pair
       mark_backtrack_feeds (pair->i2, 1);
       INSN_TICK (pair->i2) = INVALID_TICK;
       INSN_EXACT_TICK (pair->i2) = clock_var + pair_delay (pair);
-      SHADOW_P (pair->i2) = true;
+      SHADOW_P (pair->i2) = pair->stages == 0;
       pair = pair->next_same_i1;
     }
 }
@@ -2253,11 +2376,14 @@ unschedule_insns_until (rtx insn)
       if (last != insn)
 	INSN_TICK (last) = INVALID_TICK;
 
+      if (modulo_ii > 0 && INSN_UID (last) < modulo_iter0_max_uid)
+	modulo_insns_scheduled--;
+
       for (sd_it = sd_iterator_start (last, SD_LIST_RES_FORW);
 	   sd_iterator_cond (&sd_it, &dep);)
 	{
 	  rtx con = DEP_CON (dep);
-	  TODO_SPEC (con) |= HARD_DEP;
+	  TODO_SPEC (con) = HARD_DEP;
 	  INSN_TICK (con) = INVALID_TICK;
 	  sd_unresolve_dep (sd_it);
 	}
@@ -2471,6 +2597,56 @@ estimate_shadow_tick (struct delay_pair
   return 0;
 }
 
+/* If INSN has no unresolved backwards dependencies, add it to the schedule and
+   recursively resolve all its forward dependencies.  */
+static void
+resolve_dependencies (rtx insn)
+{
+  sd_iterator_def sd_it;
+  dep_t dep;
+
+  /* Don't use sd_lists_empty_p; it ignores debug insns.  */
+  if (DEPS_LIST_FIRST (INSN_HARD_BACK_DEPS (insn)) != NULL
+      || DEPS_LIST_FIRST (INSN_SPEC_BACK_DEPS (insn)) != NULL)
+    return;
+
+  if (sched_verbose >= 4)
+    fprintf (sched_dump, ";;\tquickly resolving %d\n", INSN_UID (insn));
+
+  if (QUEUE_INDEX (insn) >= 0)
+    queue_remove (insn);
+
+  VEC_safe_push (rtx, heap, scheduled_insns, insn);
+
+  /* Update dependent instructions.  */
+  for (sd_it = sd_iterator_start (insn, SD_LIST_FORW);
+       sd_iterator_cond (&sd_it, &dep);)
+    {
+      rtx next = DEP_CON (dep);
+
+      if (sched_verbose >= 4)
+	fprintf (sched_dump, ";;\t\tdep %d against %d\n", INSN_UID (insn),
+		 INSN_UID (next));
+
+      /* Resolve the dependence between INSN and NEXT.
+	 sd_resolve_dep () moves current dep to another list thus
+	 advancing the iterator.  */
+      sd_resolve_dep (sd_it);
+
+      if (!IS_SPECULATION_BRANCHY_CHECK_P (insn))
+	{
+	  resolve_dependencies (next);
+	}
+      else
+	/* Check always has only one forward dependence (to the first insn in
+	   the recovery block), therefore, this will be executed only once.  */
+	{
+	  gcc_assert (sd_lists_empty_p (insn, SD_LIST_FORW));
+	}
+    }
+}
+
+
 /* Return the head and tail pointers of ebb starting at BEG and ending
    at END.  */
 void
@@ -3452,15 +3628,12 @@ commit_schedule (rtx prev_head, rtx tail
    issue an asm statement.
 
    If SHADOWS_ONLY_P is true, we eliminate all real insns and only
-   leave those for which SHADOW_P is true.
-
-   Return the number of cycles we must
-   advance to find the next ready instruction, or zero if there remain
-   insns on the ready list.  */
+   leave those for which SHADOW_P is true.  If MODULO_EPILOGUE is true,
+   we only leave insns which have an INSN_EXACT_TICK.  */
 
 static void
 prune_ready_list (state_t temp_state, bool first_cycle_insn_p,
-		  bool shadows_only_p)
+		  bool shadows_only_p, bool modulo_epilogue_p)
 {
   int i;
 
@@ -3471,6 +3644,12 @@ prune_ready_list (state_t temp_state, bo
       int cost = 0;
       const char *reason = "resource conflict";
 
+      if (modulo_epilogue_p && !DEBUG_INSN_P (insn)
+	  && INSN_EXACT_TICK (insn) == INVALID_TICK)
+	{
+	  cost = max_insn_queue_index;
+	  reason = "not an epilogue insn";
+	}
       if (shadows_only_p && !DEBUG_INSN_P (insn) && !SHADOW_P (insn))
 	{
 	  cost = 1;
@@ -3584,10 +3763,11 @@ verify_shadows (void)
    TARGET_BB, possibly bringing insns from subsequent blocks in the same
    region.  */
 
-void
+bool
 schedule_block (basic_block *target_bb)
 {
   int i;
+  bool success = modulo_ii == 0;
   struct sched_block_state ls;
   state_t temp_state = NULL;  /* It is used for multipass scheduling.  */
   int sort_p, advance, start_clock_var;
@@ -3708,6 +3888,9 @@ schedule_block (basic_block *target_bb)
   gcc_assert (VEC_length (rtx, scheduled_insns) == 0);
   sort_p = TRUE;
   must_backtrack = false;
+  modulo_insns_scheduled = 0;
+
+  ls.modulo_epilogue = false;
 
   /* Loop until all the insns in BB are scheduled.  */
   while ((*current_sched_info->schedule_more_p) ())
@@ -3737,8 +3920,41 @@ schedule_block (basic_block *target_bb)
 	}
       while (advance > 0);
 
-      if (ready.n_ready > 0)
-	prune_ready_list (temp_state, true, false);
+      if (ls.modulo_epilogue)
+	{
+	  int stage = clock_var / modulo_ii;
+	  if (stage > modulo_last_stage * 2 + 2)
+	    {
+	      if (sched_verbose >= 2)
+		fprintf (sched_dump,
+			 ";;\t\tmodulo scheduled succeeded at II %d\n",
+			 modulo_ii);
+	      success = true;
+	      goto end_schedule;
+	    }
+	}
+      else if (modulo_ii > 0)
+	{
+	  int stage = clock_var / modulo_ii;
+	  if (stage > modulo_max_stages)
+	    {
+	      if (sched_verbose >= 2)
+		fprintf (sched_dump,
+			 ";;\t\tfailing schedule due to excessive stages\n");
+	      goto end_schedule;
+	    }
+	  if (modulo_n_insns == modulo_insns_scheduled
+	      && stage > modulo_last_stage)
+	    {
+	      if (sched_verbose >= 2)
+		fprintf (sched_dump,
+			 ";;\t\tfound kernel after %d stages, II %d\n",
+			 stage, modulo_ii);
+	      ls.modulo_epilogue = true;
+	    }
+	}
+
+      prune_ready_list (temp_state, true, false, ls.modulo_epilogue);
       if (ready.n_ready == 0)
 	continue;
       if (must_backtrack)
@@ -3916,6 +4132,11 @@ schedule_block (basic_block *target_bb)
 
 	  /* DECISION is made.  */
 
+	  if (modulo_ii > 0 && INSN_UID (insn) < modulo_iter0_max_uid)
+	    {
+	      modulo_insns_scheduled++;
+	      modulo_last_stage = clock_var / modulo_ii;
+	    }
           if (TODO_SPEC (insn) & SPECULATIVE)
             generate_recovery_code (insn);
 
@@ -3968,7 +4189,8 @@ schedule_block (basic_block *target_bb)
 
 	  ls.first_cycle_insn_p = false;
 	  if (ready.n_ready > 0)
-	    prune_ready_list (temp_state, false, ls.shadows_only_p);
+	    prune_ready_list (temp_state, false, ls.shadows_only_p,
+			      ls.modulo_epilogue);
 	}
 
     do_backtrack:
@@ -3983,6 +4205,12 @@ schedule_block (basic_block *target_bb)
 		break;
 	      }
 	  }
+      if (must_backtrack && modulo_ii > 0)
+	{
+	  if (modulo_backtracks_left == 0)
+	    goto end_schedule;
+	  modulo_backtracks_left--;
+	}
       while (must_backtrack)
 	{
 	  struct haifa_saved_data *failed;
@@ -4016,7 +4244,48 @@ schedule_block (basic_block *target_bb)
 	    }
 	}
     }
+  if (ls.modulo_epilogue)
+    success = true;
  end_schedule:
+  if (modulo_ii > 0)
+    {
+      /* Once again, debug insn suckiness: they can be on the ready list
+	 even if they have unresolved dependencies.  To make our view
+	 of the world consistent, remove such "ready" insns.  */
+    restart_debug_insn_loop:
+      for (i = ready.n_ready - 1; i >= 0; i--)
+	{
+	  rtx x;
+
+	  x = ready_element (&ready, i);
+	  if (DEPS_LIST_FIRST (INSN_HARD_BACK_DEPS (x)) != NULL
+	      || DEPS_LIST_FIRST (INSN_SPEC_BACK_DEPS (x)) != NULL)
+	    {
+	      ready_remove (&ready, i);
+	      goto restart_debug_insn_loop;
+	    }
+	}
+      for (i = ready.n_ready - 1; i >= 0; i--)
+	{
+	  rtx x;
+
+	  x = ready_element (&ready, i);
+	  resolve_dependencies (x);
+	}
+      for (i = 0; i <= max_insn_queue_index; i++)
+	{
+	  rtx link;
+	  while ((link = insn_queue[i]) != NULL)
+	    {
+	      rtx x = XEXP (link, 0);
+	      insn_queue[i] = XEXP (link, 1);
+	      QUEUE_INDEX (x) = QUEUE_NOWHERE;
+	      free_INSN_LIST_node (link);
+	      resolve_dependencies (x);
+	    }
+	}
+    }
+
   /* Debug info.  */
   if (sched_verbose)
     {
@@ -4024,11 +4293,11 @@ schedule_block (basic_block *target_bb)
       debug_ready_list (&ready);
     }
 
-  if (current_sched_info->queue_must_finish_empty)
+  if (modulo_ii == 0 && current_sched_info->queue_must_finish_empty)
     /* Sanity check -- queue must be empty now.  Meaningless if region has
        multiple bbs.  */
     gcc_assert (!q_size && !ready.n_ready && !ready.n_debug);
-  else
+  else if (modulo_ii == 0)
     {
       /* We must maintain QUEUE_INDEX between blocks in region.  */
       for (i = ready.n_ready - 1; i >= 0; i--)
@@ -4056,9 +4325,16 @@ schedule_block (basic_block *target_bb)
 	  }
     }
 
-  commit_schedule (prev_head, tail, target_bb);
-  if (sched_verbose)
-    fprintf (sched_dump, ";;   total time = %d\n", clock_var);
+  if (success)
+    {
+      commit_schedule (prev_head, tail, target_bb);
+      if (sched_verbose)
+	fprintf (sched_dump, ";;   total time = %d\n", clock_var);
+    }
+  else
+    last_scheduled_insn = tail;
+
+  VEC_truncate (rtx, scheduled_insns, 0);
 
   if (!current_sched_info->queue_must_finish_empty
       || haifa_recovery_bb_recently_added_p)
@@ -4096,6 +4372,8 @@ schedule_block (basic_block *target_bb)
   current_sched_info->tail = tail;
 
   free_backtrack_queue ();
+
+  return success;
 }
 \f
 /* Set_priorities: compute priority of each insn in the block.  */
@@ -4303,16 +4581,11 @@ haifa_sched_init (void)
   sched_create_empty_bb = sched_create_empty_bb_1;
   haifa_recovery_bb_ever_added_p = false;
 
-#ifdef ENABLE_CHECKING
-  /* This is used preferably for finding bugs in check_cfg () itself.
-     We must call sched_bbs_init () before check_cfg () because check_cfg ()
-     assumes that the last insn in the last bb has a non-null successor.  */
-  check_cfg (0, 0);
-#endif
-
   nr_begin_data = nr_begin_control = nr_be_in_data = nr_be_in_control = 0;
   before_recovery = 0;
   after_recovery = 0;
+
+  modulo_ii = 0;
 }
 
 /* Finish work with the data specific to the Haifa scheduler.  */
Index: sched-int.h
===================================================================
--- sched-int.h	(revision 178779)
+++ sched-int.h	(working copy)
@@ -1257,7 +1257,7 @@ extern int dep_cost (dep_t);
 extern int set_priorities (rtx, rtx);
 
 extern void sched_setup_bb_reg_pressure_info (basic_block, rtx);
-extern void schedule_block (basic_block *);
+extern bool schedule_block (basic_block *);
 
 extern int cycle_issued_insns;
 extern int issue_rate;
@@ -1330,7 +1335,9 @@ extern int current_blocks;
 extern int target_bb;
 extern bool sched_no_dce;
 
-extern void record_delay_slot_pair (rtx, rtx, int);
+extern void set_modulo_params (int, int, int, int);
+extern void record_delay_slot_pair (rtx, rtx, int, int);
+extern void discard_delay_pairs_above (int);
 extern void free_delay_pairs (void);
 extern void add_delay_dependencies (rtx);
 extern bool sched_is_disabled_for_current_region_p (void);
Index: params.def
===================================================================
--- params.def	(revision 178779)
+++ params.def	(working copy)
@@ -165,6 +165,13 @@ DEFPARAM(PARAM_MAX_PENDING_LIST_LENGTH,
 	 "The maximum length of scheduling's pending operations list",
 	 32, 0, 0)
 
+/* This parameter limits the number of backtracking attempts when using the
+   haifa scheduler for modulo scheduling.  */
+DEFPARAM(PARAM_MAX_MODULO_BACKTRACK_ATTEMPTS,
+	 "max-modulo-backtrack-attempts",
+	 "The maximum number of backtrack attempts the scheduler should make when modulo scheduling a loop",
+	 40, 0, 0)
+
 DEFPARAM(PARAM_LARGE_FUNCTION_INSNS,
 	 "large-function-insns",
 	 "The size of function body to be considered large",
Index: config/c6x/c6x.c
===================================================================
--- config/c6x/c6x.c	(revision 178779)
+++ config/c6x/c6x.c	(working copy)
@@ -4803,7 +4978,7 @@ split_delayed_branch (rtx insn)
   i1 = emit_insn_before (pat, insn);
   PATTERN (insn) = newpat;
   INSN_CODE (insn) = -1;
-  record_delay_slot_pair (i1, insn, 5);
+  record_delay_slot_pair (i1, insn, 5, 0);
 }
 
 /* Split every insn (i.e. jumps and calls) which can have delay slots into

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [2/4] Make schedule_ebb callable from elsewhere
  2011-09-13 22:03 [0/4] Modulo scheduling with haifa-sched for C6X Bernd Schmidt
  2011-09-13 22:13 ` [1/4] Modulo scheduling support for haifa-sched Bernd Schmidt
@ 2011-09-13 22:14 ` Bernd Schmidt
  2011-09-13 22:56   ` Steven Bosscher
  2011-09-13 22:20 ` [3/4] Fix debug_insn problem in hw-doloop Bernd Schmidt
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 22+ messages in thread
From: Bernd Schmidt @ 2011-09-13 22:14 UTC (permalink / raw)
  To: GCC Patches

[-- Attachment #1: Type: text/plain, Size: 143 bytes --]

This is just some code rearrangement to make it possible for c6x.c to
call schedule_ebbs_init, schedule_ebb and schedule_ebbs_finish.


Bernd


[-- Attachment #2: modsched2.diff --]
[-- Type: text/plain, Size: 6553 bytes --]

	* sched-ebb.c (schedule_ebb): No longer static.  Remove declaration.
	New arg modulo_scheduling.  All callers changed.  Move note handling
	code here from schedule_ebbs.
	(schedule_ebbs_finish, schedule_ebbs_init): New functions, broken
	out of schedule_ebbs.
	(schedule_ebbs): Call them.  Remove note handling code moved to
	schedule_ebb.
	* sched-int.h (schedule_ebb, schedule_ebbs_init,
	schedule_ebbs_finish): Declare.

Index: sched-ebb.c
===================================================================
--- sched-ebb.c	(revision 178779)
+++ sched-ebb.c	(working copy)
@@ -66,7 +66,6 @@ static int rank (rtx, rtx);
 static int ebb_contributes_to_priority (rtx, rtx);
 static basic_block earliest_block_with_similiar_load (basic_block, rtx);
 static void add_deps_for_risky_insns (rtx, rtx);
-static basic_block schedule_ebb (rtx, rtx);
 static void debug_ebb_dependencies (rtx, rtx);
 
 static void ebb_add_remove_insn (rtx, int);
@@ -476,14 +475,35 @@ add_deps_for_risky_insns (rtx head, rtx
     }
 }
 
-/* Schedule a single extended basic block, defined by the boundaries HEAD
-   and TAIL.  */
+/* Schedule a single extended basic block, defined by the boundaries
+   HEAD and TAIL.
 
-static basic_block
-schedule_ebb (rtx head, rtx tail)
+   We change our expectations about scheduler behaviour depending on
+   whether MODULO_SCHEDULING is true.  If it is, we expect that the
+   caller has already called set_modulo_params and created delay pairs
+   as appropriate.  If the modulo schedule failed, we return
+   NULL_RTX.  */
+
+basic_block
+schedule_ebb (rtx head, rtx tail, bool modulo_scheduling)
 {
   basic_block first_bb, target_bb;
   struct deps_desc tmp_deps;
+  bool success;
+
+  /* Blah.  We should fix the rest of the code not to get confused by
+     a note or two.  */
+  while (head != tail)
+    {
+      if (NOTE_P (head) || DEBUG_INSN_P (head))
+	head = NEXT_INSN (head);
+      else if (NOTE_P (tail) || DEBUG_INSN_P (tail))
+	tail = PREV_INSN (tail);
+      else if (LABEL_P (head))
+	head = NEXT_INSN (head);
+      else
+	break;
+    }
 
   first_bb = BLOCK_FOR_INSN (head);
   last_bb = BLOCK_FOR_INSN (tail);
@@ -530,7 +550,9 @@ schedule_ebb (rtx head, rtx tail)
 
   /* Make ready list big enough to hold all the instructions from the ebb.  */
   sched_extend_ready_list (rgn_n_insns);
-  schedule_block (&target_bb);
+  success = schedule_block (&target_bb);
+  gcc_assert (success || modulo_scheduling);
+
   /* Free ready list.  */
   sched_finish_ready_list ();
 
@@ -538,7 +560,7 @@ schedule_ebb (rtx head, rtx tail)
      so we may made some of them empty.  Can't assert (b == last_bb).  */
 
   /* Sanity check: verify that all region insns were scheduled.  */
-  gcc_assert (sched_rgn_n_insns == rgn_n_insns);
+  gcc_assert (modulo_scheduling || sched_rgn_n_insns == rgn_n_insns);
 
   /* Free dependencies.  */
   sched_free_deps (current_sched_info->head, current_sched_info->tail, true);
@@ -555,29 +577,14 @@ schedule_ebb (rtx head, rtx tail)
       delete_basic_block (last_bb->next_bb);
     }
 
-  return last_bb;
+  return success ? last_bb : NULL;
 }
 
-/* The one entry point in this file.  */
-
+/* Perform initializations before running schedule_ebbs or a single
+   schedule_ebb.  */
 void
-schedule_ebbs (void)
+schedule_ebbs_init (void)
 {
-  basic_block bb;
-  int probability_cutoff;
-  rtx tail;
-
-  if (profile_info && flag_branch_probabilities)
-    probability_cutoff = PARAM_VALUE (TRACER_MIN_BRANCH_PROBABILITY_FEEDBACK);
-  else
-    probability_cutoff = PARAM_VALUE (TRACER_MIN_BRANCH_PROBABILITY);
-  probability_cutoff = REG_BR_PROB_BASE / 100 * probability_cutoff;
-
-  /* Taking care of this degenerate case makes the rest of
-     this code simpler.  */
-  if (n_basic_blocks == NUM_FIXED_BLOCKS)
-    return;
-
   /* Setup infos.  */
   {
     memcpy (&ebb_common_sched_info, &haifa_common_sched_info,
@@ -599,6 +606,43 @@ schedule_ebbs (void)
   /* Initialize DONT_CALC_DEPS and ebb-{start, end} markers.  */
   bitmap_initialize (&dont_calc_deps, 0);
   bitmap_clear (&dont_calc_deps);
+}
+
+/* Perform cleanups after scheduling using schedules_ebbs or schedule_ebb.  */
+void
+schedule_ebbs_finish (void)
+{
+  bitmap_clear (&dont_calc_deps);
+
+  /* Reposition the prologue and epilogue notes in case we moved the
+     prologue/epilogue insns.  */
+  if (reload_completed)
+    reposition_prologue_and_epilogue_notes ();
+
+  haifa_sched_finish ();
+}
+
+/* The one entry point in this file.  */
+
+void
+schedule_ebbs (void)
+{
+  basic_block bb;
+  int probability_cutoff;
+  rtx tail;
+
+  /* Taking care of this degenerate case makes the rest of
+     this code simpler.  */
+  if (n_basic_blocks == NUM_FIXED_BLOCKS)
+    return;
+
+  if (profile_info && flag_branch_probabilities)
+    probability_cutoff = PARAM_VALUE (TRACER_MIN_BRANCH_PROBABILITY_FEEDBACK);
+  else
+    probability_cutoff = PARAM_VALUE (TRACER_MIN_BRANCH_PROBABILITY);
+  probability_cutoff = REG_BR_PROB_BASE / 100 * probability_cutoff;
+
+  schedule_ebbs_init ();
 
   /* Schedule every region in the subroutine.  */
   FOR_EACH_BB (bb)
@@ -625,30 +669,9 @@ schedule_ebbs (void)
 	  bb = bb->next_bb;
 	}
 
-      /* Blah.  We should fix the rest of the code not to get confused by
-	 a note or two.  */
-      while (head != tail)
-	{
-	  if (NOTE_P (head) || DEBUG_INSN_P (head))
-	    head = NEXT_INSN (head);
-	  else if (NOTE_P (tail) || DEBUG_INSN_P (tail))
-	    tail = PREV_INSN (tail);
-	  else if (LABEL_P (head))
-	    head = NEXT_INSN (head);
-	  else
-	    break;
-	}
-
-      bb = schedule_ebb (head, tail);
+      bb = schedule_ebb (head, tail, false);
     }
-  bitmap_clear (&dont_calc_deps);
-
-  /* Reposition the prologue and epilogue notes in case we moved the
-     prologue/epilogue insns.  */
-  if (reload_completed)
-    reposition_prologue_and_epilogue_notes ();
-
-  haifa_sched_finish ();
+  schedule_ebbs_finish ();
 }
 
 /* INSN has been added to/removed from current ebb.  */
Index: sched-int.h
===================================================================
--- sched-int.h	(revision 178779)
+++ sched-int.h	(working copy)
@@ -1280,7 +1280,12 @@ extern rtx sched_emit_insn (rtx);
 extern rtx get_ready_element (int);
 extern int number_in_ready (void);
 \f
+/* Types and functions in sched-ebb.c.  */
 
+extern basic_block schedule_ebb (rtx, rtx, bool);
+extern void schedule_ebbs_init (void);
+extern void schedule_ebbs_finish (void);
+\f
 /* Types and functions in sched-rgn.c.  */
 
 /* A region is the main entity for interblock scheduling: insns

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [3/4] Fix debug_insn problem in hw-doloop
  2011-09-13 22:03 [0/4] Modulo scheduling with haifa-sched for C6X Bernd Schmidt
  2011-09-13 22:13 ` [1/4] Modulo scheduling support for haifa-sched Bernd Schmidt
  2011-09-13 22:14 ` [2/4] Make schedule_ebb callable from elsewhere Bernd Schmidt
@ 2011-09-13 22:20 ` Bernd Schmidt
  2011-09-13 22:32 ` [4/4] C6X pieces for modulo scheduling with haifa-sched Bernd Schmidt
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 22+ messages in thread
From: Bernd Schmidt @ 2011-09-13 22:20 UTC (permalink / raw)
  To: GCC Patches

[-- Attachment #1: Type: text/plain, Size: 221 bytes --]

This fixes an oversight that led to a compare-debug failure in the
testsuite with the other changes applied. We really don't care if a
DEBUG_INSN uses the iteration register or not.

Will commit soon as obvious.


Bernd


[-- Attachment #2: modsched3.diff --]
[-- Type: text/plain, Size: 444 bytes --]

	* hw-doloop.c (scan_loop): Compute register usage only for non-debug
	insns.

Index: hw-doloop.c
===================================================================
--- hw-doloop.c	(revision 178779)
+++ hw-doloop.c	(working copy)
@@ -123,7 +123,7 @@ scan_loop (hwloop_info loop)
 	  df_ref *def_rec;
 	  HARD_REG_SET set_this_insn;
 
-	  if (!INSN_P (insn))
+	  if (!NONDEBUG_INSN_P (insn))
 	    continue;
 
 	  if (recog_memoized (insn) < 0

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [4/4] C6X pieces for modulo scheduling with haifa-sched
  2011-09-13 22:03 [0/4] Modulo scheduling with haifa-sched for C6X Bernd Schmidt
                   ` (2 preceding siblings ...)
  2011-09-13 22:20 ` [3/4] Fix debug_insn problem in hw-doloop Bernd Schmidt
@ 2011-09-13 22:32 ` Bernd Schmidt
  2011-09-14 11:25 ` [0/4] Modulo scheduling with haifa-sched for C6X Richard Sandiford
  2011-09-27 13:05 ` Bernd Schmidt
  5 siblings, 0 replies; 22+ messages in thread
From: Bernd Schmidt @ 2011-09-13 22:32 UTC (permalink / raw)
  To: GCC Patches

[-- Attachment #1: Type: text/plain, Size: 665 bytes --]

This is the final piece that makes use of the new haifa-sched
functionality in c6x.c. It also makes use of the hw-doloop code which I
adapted from Blackfin a while ago.

After finding a candidate loop, the hwloop_optimize function unrolls it
to a suitable degree, then tries successive values for II in the hope of
finding a valid schedule. If that succeeds, it modifies the schedule it
got back so that only the necessary instructions remain; sploop and
spkernel are added, and BB_DISABLE_SCHEDULE is set so that the following
final scheduling pass doesn't touch the basic block.

If no valid schedule is found, the loop is restored to its original state.


Bernd

[-- Attachment #2: modsched4.diff --]
[-- Type: text/plain, Size: 31488 bytes --]

	* common/config/c6x/c6x-common.c (c6x_option_optimization_table):
	Enable -fmodulo-sched at -O2 and above.
	* config/c6x/c6x.md (doloop_end): New expander.
	(mvilc, sploop, spkernel, loop_end): New patterns.
	(loop_end with memory destination splitter): New.
	* config/c6x/c6x.c: Include "hw-doloop.h".
	(enum unitreqs): New.
	(unit_req_table): New typedef.
	(unit_reqs): New static variable.
	(unit_req_factor, get_unit_reqs, count_unit_reqs, merge_unit_reqs,
	res_mii, split_delayed_nonbranch, undo_split_delayed_nonbranch,
	hwloop_pattern_reg, bb_earliest_end_cycle, filter_insns_above,
	hwloop_optimize, hwloop_fail, c6x_hwloops): New static functions.
	(struct c6x_sched_context): New member last_scheduled_iter0.
	(init_sched_state): Initialize it.
	(c6x_variable_issue): Update it.
	(sploop_max_uid_iter0): New static variable.
	(c6x_sched_reorder_1): Be careful about issuing sploop.
	(c6x_reorg): Call c6x_hwlooops before the final schedule.

Index: common/config/c6x/c6x-common.c
===================================================================
--- common/config/c6x/c6x-common.c	(revision 178779)
+++ common/config/c6x/c6x-common.c	(working copy)
@@ -33,6 +33,7 @@ static const struct default_options c6x_
   {
     { OPT_LEVELS_1_PLUS, OPT_fomit_frame_pointer, NULL, 1 },
     { OPT_LEVELS_1_PLUS, OPT_frename_registers, NULL, 1 },
+    { OPT_LEVELS_2_PLUS, OPT_fmodulo_sched, NULL, 1 },
     { OPT_LEVELS_ALL, OPT_freciprocal_math, NULL, 1 },
     { OPT_LEVELS_NONE, 0, NULL, 0 }
   };
Index: config/c6x/c6x.md
===================================================================
--- config/c6x/c6x.md	(revision 178779)
+++ config/c6x/c6x.md	(working copy)
@@ -1391,6 +1391,106 @@ (define_insn_and_split "eh_return"
 )
 
 ;; -------------------------------------------------------------------------
+;; Doloop
+;; -------------------------------------------------------------------------
+
+; operand 0 is the loop count pseudo register
+; operand 1 is the number of loop iterations or 0 if it is unknown
+; operand 2 is the maximum number of loop iterations
+; operand 3 is the number of levels of enclosed loops
+; operand 4 is the label to jump to at the top of the loop
+(define_expand "doloop_end"
+  [(parallel [(set (pc) (if_then_else
+			  (ne (match_operand:SI 0 "" "")
+			      (const_int 1))
+			  (label_ref (match_operand 4 "" ""))
+			  (pc)))
+	      (set (match_dup 0)
+		   (plus:SI (match_dup 0)
+			    (const_int -1)))
+	      (clobber (match_scratch:SI 5 ""))])]
+  "TARGET_INSNS_64PLUS && optimize"
+{
+  /* The loop optimizer doesn't check the predicates... */
+  if (GET_MODE (operands[0]) != SImode)
+    FAIL;
+})
+
+(define_insn "mvilc"
+  [(set (reg:SI REG_ILC)
+	(unspec [(match_operand:SI 0 "register_operand" "a,b")] UNSPEC_MVILC))]
+  "TARGET_INSNS_64PLUS"
+  "%|%.\\tmvc\\t%$\\t%0, ILC"
+  [(set_attr "predicable" "no")
+   (set_attr "cross" "y,n")
+   (set_attr "units" "s")
+   (set_attr "dest_regfile" "b")
+   (set_attr "type" "mvilc")])
+  
+(define_insn "sploop"
+  [(unspec_volatile [(match_operand:SI 0 "const_int_operand" "i")
+		     (reg:SI REG_ILC)]
+		    UNSPECV_SPLOOP)]
+  "TARGET_INSNS_64PLUS"
+  "%|%.\\tsploop\t%0"
+  [(set_attr "predicable" "no")
+   (set_attr "type" "sploop")])
+  
+(define_insn "spkernel"
+  [(set (pc)
+	(if_then_else
+	 (ne (unspec_volatile:SI
+	      [(match_operand:SI 0 "const_int_operand" "i")
+	       (match_operand:SI 1 "const_int_operand" "i")]
+	      UNSPECV_SPKERNEL)
+	     (const_int 1))
+	 (label_ref (match_operand 2 "" ""))
+	 (pc)))]
+  "TARGET_INSNS_64PLUS"
+  "%|%.\\tspkernel\t%0, %1"
+  [(set_attr "predicable" "no")
+   (set_attr "type" "spkernel")])
+  
+(define_insn "loop_end"
+  [(set (pc)
+	(if_then_else (ne (match_operand:SI 3 "nonimmediate_operand" "0,0,0,*r")
+			  (const_int 1))
+		      (label_ref (match_operand 1 "" ""))
+		      (pc)))
+   (set (match_operand:SI 0 "nonimmediate_operand" "=AB,*r,m,m")
+	(plus:SI (match_dup 3)
+		 (const_int -1)))
+   (clobber (match_scratch:SI 2 "=X,&AB,&AB,&AB"))]
+  "TARGET_INSNS_64PLUS && optimize"
+  "#"
+  [(set_attr "type" "spkernel")])
+
+(define_split
+  [(set (pc)
+	(if_then_else (ne (match_operand:SI 3 "nonimmediate_operand" "")
+			  (const_int 1))
+		      (label_ref (match_operand 1 "" ""))
+		      (pc)))
+   (set (match_operand:SI 0 "memory_operand" "")
+	(plus:SI (match_dup 3)
+		 (const_int -1)))
+   (clobber (match_scratch 2))]
+  ""
+  [(set (match_dup 2) (plus:SI (match_dup 3) (const_int -1)))
+   (set (match_dup 0) (match_dup 2))
+   (set (pc)
+	(if_then_else (ne (match_dup 2) (const_int 0))
+		      (label_ref (match_dup 1))
+		      (pc)))]
+{
+  if (!REG_P (operands[3]))
+    {
+      emit_move_insn (operands[2], operands[3]);
+      operands[3] = operands[2];
+    }
+})
+
+;; -------------------------------------------------------------------------
 ;; Delayed-branch real jumps and shadows
 ;; -------------------------------------------------------------------------
 
Index: config/c6x/c6x.c
===================================================================
--- config/c6x/c6x.c	(revision 178779)
+++ config/c6x/c6x.c	(working copy)
@@ -50,6 +50,7 @@
 #include "sel-sched.h"
 #include "debug.h"
 #include "opts.h"
+#include "hw-doloop.h"
 
 /* Table of supported architecture variants.  */
 typedef struct
@@ -160,6 +161,27 @@ static int c6x_unit_codes[ARRAY_SIZE (c6
 
 #define RESERVATION_S1 2
 #define RESERVATION_S2 10
+
+/* An enum for the unit requirements we count in the UNIT_REQS table.  */
+enum unitreqs
+{
+  UNIT_REQ_D,
+  UNIT_REQ_L,
+  UNIT_REQ_S,
+  UNIT_REQ_M,
+  UNIT_REQ_DL,
+  UNIT_REQ_DS,
+  UNIT_REQ_LS,
+  UNIT_REQ_DLS,
+  UNIT_REQ_T,
+  UNIT_REQ_X,
+  UNIT_REQ_MAX
+};
+
+/* A table used to count unit requirements.  Used when computing minimum
+   iteration intervals.  */
+typedef int unit_req_table[2][UNIT_REQ_MAX];
+static unit_req_table unit_reqs;
 \f
 /* Register map for debugging.  */
 int const dbx_register_map[FIRST_PSEUDO_REGISTER] =
@@ -3157,6 +3179,149 @@ assign_reservations (rtx head, rtx end)
 	  }
     }
 }
+
+/* Return a factor by which to weight unit imbalances for a reservation
+   R.  */
+static int
+unit_req_factor (enum unitreqs r)
+{
+  switch (r)
+    {
+    case UNIT_REQ_D:
+    case UNIT_REQ_L:
+    case UNIT_REQ_S:
+    case UNIT_REQ_M:
+    case UNIT_REQ_X:
+    case UNIT_REQ_T:
+      return 1;
+    case UNIT_REQ_DL:
+    case UNIT_REQ_LS:
+    case UNIT_REQ_DS:
+      return 2;
+    case UNIT_REQ_DLS:
+      return 3;
+    default:
+      gcc_unreachable ();
+    }
+}
+
+/* Examine INSN, and store in REQ1/SIDE1 and REQ2/SIDE2 the unit
+   requirements.  Returns zero if INSN can't be handled, otherwise
+   either one or two to show how many of the two pairs are in use.
+   REQ1 is always used, it holds what is normally thought of as the
+   instructions reservation, e.g. UNIT_REQ_DL.  REQ2 is used to either
+   describe a cross path, or for loads/stores, the T unit.  */
+static int
+get_unit_reqs (rtx insn, int *req1, int *side1, int *req2, int *side2)
+{
+  enum attr_units units;
+  enum attr_cross cross;
+  int side, req;
+
+  if (!NONDEBUG_INSN_P (insn) || recog_memoized (insn) < 0)
+    return 0;
+  units = get_attr_units (insn);
+  if (units == UNITS_UNKNOWN)
+    return 0;
+  side = get_insn_side (insn, units);
+  cross = get_attr_cross (insn);
+
+  req = (units == UNITS_D ? UNIT_REQ_D
+	 : units == UNITS_D_ADDR ? UNIT_REQ_D
+	 : units == UNITS_DL ? UNIT_REQ_DL
+	 : units == UNITS_DS ? UNIT_REQ_DS
+	 : units == UNITS_L ? UNIT_REQ_L
+	 : units == UNITS_LS ? UNIT_REQ_LS
+	 : units == UNITS_S ? UNIT_REQ_S
+	 : units == UNITS_M ? UNIT_REQ_M
+	 : units == UNITS_DLS ? UNIT_REQ_DLS
+	 : -1);
+  gcc_assert (req != -1);
+  *req1 = req;
+  *side1 = side;
+  if (units == UNITS_D_ADDR)
+    {
+      *req2 = UNIT_REQ_T;
+      *side2 = side ^ (cross == CROSS_Y ? 1 : 0);
+      return 2;
+    }
+  else if (cross == CROSS_Y)
+    {
+      *req2 = UNIT_REQ_X;
+      *side2 = side;
+      return 2;
+    }
+  return 1;
+}
+
+/* Walk the insns between and including HEAD and TAIL, and mark the
+   resource requirements in the unit_reqs table.  */
+static void
+count_unit_reqs (unit_req_table reqs, rtx head, rtx tail)
+{
+  rtx insn;
+
+  memset (reqs, 0, sizeof (unit_req_table));
+
+  for (insn = head; insn != NEXT_INSN (tail); insn = NEXT_INSN (insn))
+    {
+      int side1, side2, req1, req2;
+
+      switch (get_unit_reqs (insn, &req1, &side1, &req2, &side2))
+	{
+	case 2:
+	  reqs[side2][req2]++;
+	  /* fall through */
+	case 1:
+	  reqs[side1][req1]++;
+	  break;
+	}
+    }
+}
+
+/* Update the table REQS by merging more specific unit reservations into
+   more general ones, i.e. counting (for example) UNIT_REQ_D also in
+   UNIT_REQ_DL, DS, and DLS.  */
+static void
+merge_unit_reqs (unit_req_table reqs)
+{
+  int side;
+  for (side = 0; side < 2; side++)
+    {
+      int d = reqs[side][UNIT_REQ_D];
+      int l = reqs[side][UNIT_REQ_L];
+      int s = reqs[side][UNIT_REQ_S];
+      int dl = reqs[side][UNIT_REQ_DL];
+      int ls = reqs[side][UNIT_REQ_LS];
+      int ds = reqs[side][UNIT_REQ_DS];
+
+      reqs[side][UNIT_REQ_DL] += d;
+      reqs[side][UNIT_REQ_DL] += l;
+      reqs[side][UNIT_REQ_DS] += d;
+      reqs[side][UNIT_REQ_DS] += s;
+      reqs[side][UNIT_REQ_LS] += l;
+      reqs[side][UNIT_REQ_LS] += s;
+      reqs[side][UNIT_REQ_DLS] += ds + dl + ls + d + l + s;
+    }
+}
+
+/* Return the resource-constrained minimum iteration interval given the
+   data in the REQS table.  This must have been processed with
+   merge_unit_reqs already.  */
+static int
+res_mii (unit_req_table reqs)
+{
+  int side, req;
+  int worst = 1;
+  for (side = 0; side < 2; side++)
+    for (req = 0; req < UNIT_REQ_MAX; req++)
+      {
+	int factor = unit_req_factor (req);
+	worst = MAX ((reqs[side][UNIT_REQ_D] + factor - 1) / factor, worst);
+      }
+
+  return worst;
+}
 \f
 /* Backend scheduling state.  */
 typedef struct c6x_sched_context
@@ -3189,13 +3354,15 @@ typedef struct c6x_sched_context
 
   /* The following variable value is the last issued insn.  */
   rtx last_scheduled_insn;
+  /* The last issued insn that isn't a shadow of another.  */
+  rtx last_scheduled_iter0;
 
   /* The following variable value is DFA state before issuing the
      first insn in the current clock cycle.  We do not use this member
      of the structure directly; we copy the data in and out of
      prev_cycle_state.  */
   state_t prev_cycle_state_ctx;
-  
+
   int reg_n_accesses[FIRST_PSEUDO_REGISTER];
   int reg_n_xaccesses[FIRST_PSEUDO_REGISTER];
   int reg_set_in_cycle[FIRST_PSEUDO_REGISTER];
@@ -3216,6 +3383,10 @@ static state_t prev_cycle_state;
    many accesses of the same register.  */
 static bool reg_access_stall;
 
+/* The highest insn uid after delayed insns were split, but before loop bodies
+   were copied by the modulo scheduling code.  */
+static int sploop_max_uid_iter0;
+
 /* Look up the jump cycle with index N.  For an out-of-bounds N, we return 0,
    so the caller does not specifically have to test for it.  */
 static int
@@ -3414,6 +3585,7 @@ static void
 init_sched_state (c6x_sched_context_t sc)
 {
   sc->last_scheduled_insn = NULL_RTX;
+  sc->last_scheduled_iter0 = NULL_RTX;
   sc->issued_this_cycle = 0;
   memset (sc->jump_cycles, 0, sizeof sc->jump_cycles);
   memset (sc->jump_cond, 0, sizeof sc->jump_cond);
@@ -3474,7 +3646,7 @@ c6x_clear_sched_context (void *_sc)
   c6x_sched_context_t sc = (c6x_sched_context_t) _sc;
   gcc_assert (_sc != NULL);
 
-  free (sc->prev_cycle_state_ctx); 
+  free (sc->prev_cycle_state_ctx);
 }
 
 /* Free _SC.  */
@@ -3709,7 +3881,7 @@ c6x_sched_reorder_1 (rtx *ready, int *pn
       bool is_asm = (icode < 0
 		     && (GET_CODE (PATTERN (insn)) == ASM_INPUT
 			 || asm_noperands (PATTERN (insn)) >= 0));
-      bool no_parallel = (is_asm
+      bool no_parallel = (is_asm || icode == CODE_FOR_sploop
 			  || (icode >= 0
 			      && get_attr_type (insn) == TYPE_ATOMIC));
 
@@ -3718,7 +3890,8 @@ c6x_sched_reorder_1 (rtx *ready, int *pn
 	 code always assumes at least 1 cycle, which may be wrong.  */
       if ((no_parallel
 	   && (ss.issued_this_cycle > 0 || clock_var < ss.delays_finished_at))
-	  || c6x_registers_update (insn))
+	  || c6x_registers_update (insn)
+	  || (ss.issued_this_cycle > 0 && icode == CODE_FOR_sploop))
 	{
 	  memmove (ready + 1, ready, (insnp - ready) * sizeof (rtx));
 	  *ready = insn;
@@ -3917,6 +4090,8 @@ c6x_variable_issue (FILE *dump ATTRIBUTE
 		    rtx insn, int can_issue_more ATTRIBUTE_UNUSED)
 {
   ss.last_scheduled_insn = insn;
+  if (INSN_UID (insn) < sploop_max_uid_iter0 && !JUMP_P (insn))
+    ss.last_scheduled_iter0 = insn;
   if (GET_CODE (PATTERN (insn)) != USE && GET_CODE (PATTERN (insn)) != CLOBBER)
     ss.issued_this_cycle++;
   if (insn_info)
@@ -4803,6 +4978,117 @@ split_delayed_branch (rtx insn)
   PATTERN (insn) = newpat;
   INSN_CODE (insn) = -1;
   record_delay_slot_pair (i1, insn, 5, 0);
+}
+
+/* If INSN is a multi-cycle insn that should be handled properly in
+   modulo-scheduling, split it into a real insn and a shadow.
+   Return true if we made a change.
+
+   It is valid for us to fail to split an insn; the caller has to deal
+   with the possibility.  Currently we handle loads and most mpy2 and
+   mpy4 insns.  */
+static bool
+split_delayed_nonbranch (rtx insn)
+{
+  int code = recog_memoized (insn);
+  enum attr_type type;
+  rtx i1, newpat, src, dest;
+  rtx pat = PATTERN (insn);
+  rtvec rtv;
+  int delay;
+
+  if (GET_CODE (pat) == COND_EXEC)
+    pat = COND_EXEC_CODE (pat);
+
+  if (code < 0 || GET_CODE (pat) != SET)
+    return false;
+  src = SET_SRC (pat);
+  dest = SET_DEST (pat);
+  if (!REG_P (dest))
+    return false;
+
+  type = get_attr_type (insn);
+  if (code >= 0
+      && (type == TYPE_LOAD
+	  || type == TYPE_LOADN))
+    {
+      if (!MEM_P (src)
+	  && (GET_CODE (src) != ZERO_EXTEND
+	      || !MEM_P (XEXP (src, 0))))
+	return false;
+
+      if (GET_MODE_SIZE (GET_MODE (dest)) > 4
+	  && (GET_MODE_SIZE (GET_MODE (dest)) != 8 || !TARGET_LDDW))
+	return false;
+
+      rtv = gen_rtvec (2, GEN_INT (REGNO (SET_DEST (pat))),
+		       SET_SRC (pat));
+      newpat = gen_load_shadow (SET_DEST (pat));
+      pat = gen_rtx_UNSPEC (VOIDmode, rtv, UNSPEC_REAL_LOAD);
+      delay = 4;
+    }
+  else if (code >= 0
+	   && (type == TYPE_MPY2
+	       || type == TYPE_MPY4))
+    {
+      /* We don't handle floating point multiplies yet.  */
+      if (GET_MODE (dest) == SFmode)
+	return false;
+
+      rtv = gen_rtvec (2, GEN_INT (REGNO (SET_DEST (pat))),
+		       SET_SRC (pat));
+      newpat = gen_mult_shadow (SET_DEST (pat));
+      pat = gen_rtx_UNSPEC (VOIDmode, rtv, UNSPEC_REAL_MULT);
+      delay = type == TYPE_MPY2 ? 1 : 3;
+    }
+  else
+    return false;
+
+  pat = duplicate_cond (pat, insn);
+  newpat = duplicate_cond (newpat, insn);
+  i1 = emit_insn_before (pat, insn);
+  PATTERN (insn) = newpat;
+  INSN_CODE (insn) = -1;
+  recog_memoized (insn);
+  recog_memoized (i1);
+  record_delay_slot_pair (i1, insn, delay, 0);
+  return true;
+}
+
+/* Examine if INSN is the result of splitting a load into a real load and a
+   shadow, and if so, undo the transformation.  */
+static void
+undo_split_delayed_nonbranch (rtx insn)
+{
+  int icode = recog_memoized (insn);
+  enum attr_type type;
+  rtx prev_pat, insn_pat, prev;
+
+  if (icode < 0)
+    return;
+  type = get_attr_type (insn);
+  if (type != TYPE_LOAD_SHADOW && type != TYPE_MULT_SHADOW)
+    return;
+  prev = PREV_INSN (insn);
+  prev_pat = PATTERN (prev);
+  insn_pat = PATTERN (insn);
+  if (GET_CODE (prev_pat) == COND_EXEC)
+    {
+      prev_pat = COND_EXEC_CODE (prev_pat);
+      insn_pat = COND_EXEC_CODE (insn_pat);
+    }
+
+  gcc_assert (GET_CODE (prev_pat) == UNSPEC
+	      && ((XINT (prev_pat, 1) == UNSPEC_REAL_LOAD
+		   && type == TYPE_LOAD_SHADOW)
+		  || (XINT (prev_pat, 1) == UNSPEC_REAL_MULT
+		      && type == TYPE_MULT_SHADOW)));
+  insn_pat = gen_rtx_SET (VOIDmode, SET_DEST (insn_pat),
+			  XVECEXP (prev_pat, 0, 1));
+  insn_pat = duplicate_cond (insn_pat, prev);
+  PATTERN (insn) = insn_pat;
+  INSN_CODE (insn) = -1;
+  delete_insn (prev);
 }
 
 /* Split every insn (i.e. jumps and calls) which can have delay slots into
@@ -4846,6 +5132,481 @@ conditionalize_after_sched (void)
       }
 }
 
+/* A callback for the hw-doloop pass.  This function examines INSN; if
+   it is a loop_end pattern we recognize, return the reg rtx for the
+   loop counter.  Otherwise, return NULL_RTX.  */
+
+static rtx
+hwloop_pattern_reg (rtx insn)
+{
+  rtx pat, reg;
+
+  if (!JUMP_P (insn) || recog_memoized (insn) != CODE_FOR_loop_end)
+    return NULL_RTX;
+
+  pat = PATTERN (insn);
+  reg = SET_DEST (XVECEXP (pat, 0, 1));
+  if (!REG_P (reg))
+    return NULL_RTX;
+  return reg;
+}
+
+/* Return the number of cycles taken by BB, as computed by scheduling,
+   including the latencies of all insns with delay slots.  IGNORE is
+   an insn we should ignore in the calculation, usually the final
+   branch.  */
+static int
+bb_earliest_end_cycle (basic_block bb, rtx ignore)
+{
+  int earliest = 0;
+  rtx insn;
+
+  FOR_BB_INSNS (bb, insn)
+    {
+      int cycles, this_clock;
+
+      if (LABEL_P (insn) || NOTE_P (insn) || DEBUG_INSN_P (insn)
+	  || GET_CODE (PATTERN (insn)) == USE
+	  || GET_CODE (PATTERN (insn)) == CLOBBER
+	  || insn == ignore)
+	continue;
+
+      this_clock = insn_get_clock (insn);
+      cycles = get_attr_cycles (insn);
+
+      if (earliest < this_clock + cycles)
+	earliest = this_clock + cycles;
+    }
+  return earliest;
+}
+
+/* Examine the insns in BB and remove all which have a uid greater or
+   equal to MAX_UID.  */
+static void
+filter_insns_above (basic_block bb, int max_uid)
+{
+  rtx insn, next;
+  bool prev_ti = false;
+  int prev_cycle = -1;
+
+  FOR_BB_INSNS_SAFE (bb, insn, next)
+    {
+      int this_cycle;
+      if (!NONDEBUG_INSN_P (insn))
+	continue;
+      if (insn == BB_END (bb))
+	return;
+      this_cycle = insn_get_clock (insn);
+      if (prev_ti && this_cycle == prev_cycle)
+	{
+	  gcc_assert (GET_MODE (insn) != TImode);
+	  PUT_MODE (insn, TImode);
+	}
+      prev_ti = false;
+      if (INSN_UID (insn) >= max_uid)
+	{
+	  if (GET_MODE (insn) == TImode)
+	    {
+	      prev_ti = true;
+	      prev_cycle = this_cycle;
+	    }
+	  delete_insn (insn);
+	}
+    }
+}
+
+/* A callback for the hw-doloop pass.  Called to optimize LOOP in a
+   machine-specific fashion; returns true if successful and false if
+   the hwloop_fail function should be called.  */
+
+static bool
+hwloop_optimize (hwloop_info loop)
+{
+  basic_block entry_bb, bb;
+  rtx seq, insn, prev, entry_after, end_packet;
+  rtx head_insn, tail_insn, new_insns, last_insn;
+  int loop_earliest, entry_earliest, entry_end_cycle;
+  int n_execute_packets;
+  edge entry_edge;
+  unsigned ix;
+  int max_uid_before, delayed_splits;
+  int i, sp_ii, min_ii, max_ii, max_parallel, n_insns, n_real_insns, stages;
+  rtx *orig_vec;
+  rtx *copies;
+  rtx **insn_copies;
+
+  if (!c6x_flag_modulo_sched || !c6x_flag_schedule_insns2
+      || !TARGET_INSNS_64PLUS)
+    return false;
+
+  if (loop->iter_reg_used || loop->depth > 1)
+    return false;
+  if (loop->has_call || loop->has_asm)
+    return false;
+
+  if (loop->head != loop->tail)
+    return false;
+
+  gcc_assert (loop->incoming_dest == loop->head);
+
+  entry_edge = NULL;
+  FOR_EACH_VEC_ELT (edge, loop->incoming, i, entry_edge)
+    if (entry_edge->flags & EDGE_FALLTHRU)
+      break;
+  if (entry_edge == NULL)
+    return false;
+
+  schedule_ebbs_init ();
+  schedule_ebb (BB_HEAD (loop->tail), loop->loop_end, true);
+  schedule_ebbs_finish ();
+
+  bb = loop->head;
+  loop_earliest = bb_earliest_end_cycle (bb, loop->loop_end) + 1;
+
+  max_uid_before = get_max_uid ();
+
+  /* Split all multi-cycle operations, such as loads.  For normal
+     scheduling, we only do this for branches, as the generated code
+     would otherwise not be interrupt-safe.  When using sploop, it is
+     safe and beneficial to split them.  If any multi-cycle operations
+     remain after splitting (because we don't handle them yet), we
+     cannot pipeline the loop.  */
+  delayed_splits = 0;
+  FOR_BB_INSNS (bb, insn)
+    {
+      if (NONDEBUG_INSN_P (insn))
+	{
+	  recog_memoized (insn);
+	  if (split_delayed_nonbranch (insn))
+	    delayed_splits++;
+	  else if (INSN_CODE (insn) >= 0
+		   && get_attr_cycles (insn) > 1)
+	    goto undo_splits;
+	}
+    }
+
+  /* Count the number of insns as well as the number real insns, and save
+     the original sequence of insns in case we must restore it later.  */
+  n_insns = n_real_insns = 0;
+  FOR_BB_INSNS (bb, insn)
+    {
+      n_insns++;
+      if (NONDEBUG_INSN_P (insn) && insn != loop->loop_end)
+	n_real_insns++;
+    }
+  orig_vec = XNEWVEC (rtx, n_insns);
+  n_insns = 0;
+  FOR_BB_INSNS (bb, insn)
+    orig_vec[n_insns++] = insn;
+
+  /* Count the unit reservations, and compute a minimum II from that
+     table.  */
+  count_unit_reqs (unit_reqs, loop->start_label,
+		   PREV_INSN (loop->loop_end));
+  merge_unit_reqs (unit_reqs);
+
+  min_ii = res_mii (unit_reqs);
+  max_ii = loop_earliest < 15 ? loop_earliest : 14;
+
+  /* Make copies of the loop body, up to a maximum number of stages we want
+     to handle.  */
+  max_parallel = loop_earliest / min_ii + 1;
+
+  copies = XCNEWVEC (rtx, (max_parallel + 1) * n_real_insns);
+  insn_copies = XNEWVEC (rtx *, max_parallel + 1);
+  for (i = 0; i < max_parallel + 1; i++)
+    insn_copies[i] = copies + i * n_real_insns;
+
+  head_insn = next_nonnote_nondebug_insn (loop->start_label);
+  tail_insn = prev_real_insn (BB_END (bb));
+
+  i = 0;
+  FOR_BB_INSNS (bb, insn)
+    if (NONDEBUG_INSN_P (insn) && insn != loop->loop_end)
+      insn_copies[0][i++] = insn;
+
+  sploop_max_uid_iter0 = get_max_uid ();
+
+  /* Generate the copies of the loop body, and save them in the
+     INSN_COPIES array.  */
+  start_sequence ();
+  for (i = 0; i < max_parallel; i++)
+    {
+      int j;
+      rtx this_iter;
+
+      this_iter = duplicate_insn_chain (head_insn, tail_insn);
+      j = 0;
+      while (this_iter)
+	{
+	  rtx prev_stage_insn = insn_copies[i][j];
+	  gcc_assert (INSN_CODE (this_iter) == INSN_CODE (prev_stage_insn));
+
+	  if (INSN_CODE (this_iter) >= 0
+	      && (get_attr_type (this_iter) == TYPE_LOAD_SHADOW
+		  || get_attr_type (this_iter) == TYPE_MULT_SHADOW))
+	    {
+	      rtx prev = PREV_INSN (this_iter);
+	      record_delay_slot_pair (prev, this_iter,
+				      get_attr_cycles (prev) - 1, 0);
+	    }
+	  else
+	    record_delay_slot_pair (prev_stage_insn, this_iter, i, 1);
+
+	  insn_copies[i + 1][j] = this_iter;
+	  j++;
+	  this_iter = next_nonnote_nondebug_insn (this_iter);
+	}
+    }
+  new_insns = get_insns ();
+  last_insn = insn_copies[max_parallel][n_real_insns - 1];
+  end_sequence ();
+  emit_insn_before (new_insns, BB_END (bb));
+
+  /* Try to schedule the loop using varying initiation intervals,
+     starting with the smallest possible and incrementing it
+     on failure.  */
+  for (sp_ii = min_ii; sp_ii <= max_ii; sp_ii++)
+    {
+      basic_block tmp_bb;
+      if (dump_file)
+	fprintf (dump_file, "Trying to schedule for II %d\n", sp_ii);
+
+      df_clear_flags (DF_LR_RUN_DCE);
+
+      schedule_ebbs_init ();
+      set_modulo_params (sp_ii, max_parallel, n_real_insns,
+			 sploop_max_uid_iter0);
+      tmp_bb = schedule_ebb (BB_HEAD (bb), last_insn, true);
+      schedule_ebbs_finish ();
+
+      if (tmp_bb)
+	{
+	  if (dump_file)
+	    fprintf (dump_file, "Found schedule with II %d\n", sp_ii);
+	  break;
+	}
+    }
+
+  discard_delay_pairs_above (max_uid_before);
+
+  if (sp_ii > max_ii)
+    goto restore_loop;
+
+  stages = insn_get_clock (ss.last_scheduled_iter0) / sp_ii + 1;
+
+  if (stages == 1 && sp_ii > 5)
+    goto restore_loop;
+
+  /* At this point, we know we've been successful, unless we find later that
+     there are too many execute packets for the loop buffer to hold.  */
+
+  /* Assign reservations to the instructions in the loop.  We must find
+     the stage that contains the full loop kernel, and transfer the
+     reservations of the instructions contained in it to the corresponding
+     instructions from iteration 0, which are the only ones we'll keep.  */
+  assign_reservations (BB_HEAD (bb), ss.last_scheduled_insn);
+  PREV_INSN (BB_END (bb)) = ss.last_scheduled_iter0;
+  NEXT_INSN (ss.last_scheduled_iter0) = BB_END (bb);
+  filter_insns_above (bb, sploop_max_uid_iter0);
+
+  for (i = 0; i < n_real_insns; i++)
+    {
+      rtx insn = insn_copies[0][i];
+      int uid = INSN_UID (insn);
+      int stage = insn_uid_get_clock (uid) / sp_ii;
+
+      if (stage + 1 < stages)
+	{
+	  int copy_uid;
+	  stage = stages - stage - 1;
+	  copy_uid = INSN_UID (insn_copies[stage][i]);
+	  INSN_INFO_ENTRY (uid).reservation
+	    = INSN_INFO_ENTRY (copy_uid).reservation;
+	}
+    }
+  if (stages == 1)
+    stages++;
+
+  /* Compute the number of execute packets the pipelined form of the loop will
+     require.  */
+  prev = NULL_RTX;
+  n_execute_packets = 0;
+  for (insn = loop->start_label; insn != loop->loop_end; insn = NEXT_INSN (insn))
+    {
+      if (NONDEBUG_INSN_P (insn) && GET_MODE (insn) == TImode
+	  && !shadow_p (insn))
+	{
+	  n_execute_packets++;
+	  if (prev && insn_get_clock (prev) + 1 != insn_get_clock (insn))
+	    /* We need an extra NOP instruction.  */
+	    n_execute_packets++;
+
+	  prev = insn;
+	}
+    }
+
+  end_packet = ss.last_scheduled_iter0;
+  while (!NONDEBUG_INSN_P (end_packet) || GET_MODE (end_packet) != TImode)
+    end_packet = PREV_INSN (end_packet);
+
+  /* The earliest cycle in which we can emit the SPKERNEL instruction.  */
+  loop_earliest = (stages - 1) * sp_ii;
+  if (loop_earliest > insn_get_clock (end_packet))
+    {
+      n_execute_packets++;
+      end_packet = loop->loop_end;
+    }
+  else
+    loop_earliest = insn_get_clock (end_packet);
+
+  if (n_execute_packets > 14)
+    goto restore_loop;
+
+  /* Generate the spkernel instruction, and place it at the appropriate
+     spot.  */
+  PUT_MODE (end_packet, VOIDmode);
+
+  insn = gen_spkernel (GEN_INT (stages - 1),
+		       const0_rtx, JUMP_LABEL (loop->loop_end));
+  insn = emit_jump_insn_before (insn, end_packet);
+  JUMP_LABEL (insn) = JUMP_LABEL (loop->loop_end);
+  insn_set_clock (insn, loop_earliest);
+  PUT_MODE (insn, TImode);
+  INSN_INFO_ENTRY (INSN_UID (insn)).ebb_start = false;
+  delete_insn (loop->loop_end);
+
+  /* Place the mvc and sploop instructions before the loop.  */
+  entry_bb = entry_edge->src;
+
+  start_sequence ();
+
+  insn = emit_insn (gen_mvilc (loop->iter_reg));
+  insn = emit_insn (gen_sploop (GEN_INT (sp_ii)));
+
+  seq = get_insns ();
+
+  if (!single_succ_p (entry_bb) || VEC_length (edge, loop->incoming) > 1)
+    {
+      basic_block new_bb;
+      edge e;
+      edge_iterator ei;
+
+      emit_insn_before (seq, BB_HEAD (loop->head));
+      seq = emit_label_before (gen_label_rtx (), seq);
+
+      new_bb = create_basic_block (seq, insn, entry_bb);
+      FOR_EACH_EDGE (e, ei, loop->incoming)
+	{
+	  if (!(e->flags & EDGE_FALLTHRU))
+	    redirect_edge_and_branch_force (e, new_bb);
+	  else
+	    redirect_edge_succ (e, new_bb);
+	}
+      make_edge (new_bb, loop->head, 0);
+    }
+  else
+    {
+      entry_after = BB_END (entry_bb);
+      while (DEBUG_INSN_P (entry_after)
+	     || (NOTE_P (entry_after)
+		 && NOTE_KIND (entry_after) != NOTE_INSN_BASIC_BLOCK))
+	entry_after = PREV_INSN (entry_after);
+      emit_insn_after (seq, entry_after);
+    }
+
+  end_sequence ();
+
+  /* Make sure we don't try to schedule this loop again.  */
+  for (ix = 0; VEC_iterate (basic_block, loop->blocks, ix, bb); ix++)
+    bb->flags |= BB_DISABLE_SCHEDULE;
+
+  return true;
+
+ restore_loop:
+  if (dump_file)
+    fprintf (dump_file, "Unable to pipeline loop.\n");
+
+  for (i = 1; i < n_insns; i++)
+    {
+      NEXT_INSN (orig_vec[i - 1]) = orig_vec[i];
+      PREV_INSN (orig_vec[i]) = orig_vec[i - 1];
+    }
+  PREV_INSN (orig_vec[0]) = PREV_INSN (BB_HEAD (bb));
+  NEXT_INSN (PREV_INSN (BB_HEAD (bb))) = orig_vec[0];
+  NEXT_INSN (orig_vec[n_insns - 1]) = NEXT_INSN (BB_END (bb));
+  PREV_INSN (NEXT_INSN (BB_END (bb))) = orig_vec[n_insns - 1];
+  BB_HEAD (bb) = orig_vec[0];
+  BB_END (bb) = orig_vec[n_insns - 1];
+ undo_splits:
+  free_delay_pairs ();
+  FOR_BB_INSNS (bb, insn)
+    if (NONDEBUG_INSN_P (insn))
+      undo_split_delayed_nonbranch (insn);
+  return false;
+}
+
+/* A callback for the hw-doloop pass.  Called when a loop we have discovered
+   turns out not to be optimizable; we have to split the doloop_end pattern
+   into a subtract and a test.  */
+static void
+hwloop_fail (hwloop_info loop)
+{
+  rtx insn, test, testreg;
+
+  if (dump_file)
+    fprintf (dump_file, "splitting doloop insn %d\n",
+	     INSN_UID (loop->loop_end));
+  insn = gen_addsi3 (loop->iter_reg, loop->iter_reg, constm1_rtx);
+  /* See if we can emit the add at the head of the loop rather than at the
+     end.  */
+  if (loop->head == NULL
+      || loop->iter_reg_used_outside
+      || loop->iter_reg_used
+      || TEST_HARD_REG_BIT (loop->regs_set_in_loop, REGNO (loop->iter_reg))
+      || loop->incoming_dest != loop->head
+      || EDGE_COUNT (loop->head->preds) != 2)
+    emit_insn_before (insn, loop->loop_end);
+  else
+    {
+      rtx t = loop->start_label;
+      while (!NOTE_P (t) || NOTE_KIND (t) != NOTE_INSN_BASIC_BLOCK)
+	t = NEXT_INSN (t);
+      emit_insn_after (insn, t);
+    }
+
+  testreg = SET_DEST (XVECEXP (PATTERN (loop->loop_end), 0, 2));
+  if (GET_CODE (testreg) == SCRATCH)
+    testreg = loop->iter_reg;
+  else
+    emit_insn_before (gen_movsi (testreg, loop->iter_reg), loop->loop_end);
+
+  test = gen_rtx_NE (VOIDmode, testreg, const0_rtx);
+  insn = emit_jump_insn_before (gen_cbranchsi4 (test, testreg, const0_rtx,
+						loop->start_label),
+				loop->loop_end);
+
+  JUMP_LABEL (insn) = loop->start_label;
+  LABEL_NUSES (loop->start_label)++;
+  delete_insn (loop->loop_end);
+}
+
+static struct hw_doloop_hooks c6x_doloop_hooks =
+{
+  hwloop_pattern_reg,
+  hwloop_optimize,
+  hwloop_fail
+};
+
+/* Run the hw-doloop pass to modulo-schedule hardware loops, or split the
+   doloop_end patterns where such optimizations are impossible.  */
+static void
+c6x_hwloops (void)
+{
+  if (optimize)
+    reorg_loops (true, &c6x_doloop_hooks);
+}
+
 /* Implement the TARGET_MACHINE_DEPENDENT_REORG pass.  We split call insns here
    into a sequence that loads the return register and performs the call,
    and emit the return label.
@@ -4874,10 +5635,17 @@ c6x_reorg (void)
       int sz = get_max_uid () * 3 / 2 + 1;
 
       insn_info = VEC_alloc (c6x_sched_insn_info, heap, sz);
+    }
 
-      /* Make sure the real-jump insns we create are not deleted.  */
-      sched_no_dce = true;
+  /* Make sure the real-jump insns we create are not deleted.  When modulo-
+     scheduling, situations where a reg is only stored in a loop can also
+     cause dead code when doing the initial unrolling.  */
+  sched_no_dce = true;
 
+  c6x_hwloops ();
+
+  if (c6x_flag_schedule_insns2)
+    {
       split_delayed_insns ();
       timevar_push (TV_SCHED2);
       if (do_selsched)
@@ -4888,8 +5656,8 @@ c6x_reorg (void)
       timevar_pop (TV_SCHED2);
 
       free_delay_pairs ();
-      sched_no_dce = false;
     }
+  sched_no_dce = false;
 
   call_labels = XCNEWVEC (rtx, get_max_uid () + 1);
 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [2/4] Make schedule_ebb callable from elsewhere
  2011-09-13 22:14 ` [2/4] Make schedule_ebb callable from elsewhere Bernd Schmidt
@ 2011-09-13 22:56   ` Steven Bosscher
  0 siblings, 0 replies; 22+ messages in thread
From: Steven Bosscher @ 2011-09-13 22:56 UTC (permalink / raw)
  To: Bernd Schmidt; +Cc: GCC Patches

On Wed, Sep 14, 2011 at 12:01 AM, Bernd Schmidt <bernds@codesourcery.com> wrote:
> This is just some code rearrangement to make it possible for c6x.c to
> call schedule_ebbs_init, schedule_ebb and schedule_ebbs_finish.
>
> +/* The one entry point in this file.  */
> +
> +void
> +schedule_ebbs (void)
> +{

Might want to update that comment.

Ciao!
Steven

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [0/4] Modulo scheduling with haifa-sched for C6X
  2011-09-13 22:03 [0/4] Modulo scheduling with haifa-sched for C6X Bernd Schmidt
                   ` (3 preceding siblings ...)
  2011-09-13 22:32 ` [4/4] C6X pieces for modulo scheduling with haifa-sched Bernd Schmidt
@ 2011-09-14 11:25 ` Richard Sandiford
  2011-09-14 12:01   ` Bernd Schmidt
  2011-09-27 13:05 ` Bernd Schmidt
  5 siblings, 1 reply; 22+ messages in thread
From: Richard Sandiford @ 2011-09-14 11:25 UTC (permalink / raw)
  To: Bernd Schmidt; +Cc: GCC Patches

Bernd Schmidt <bernds@codesourcery.com> writes:
> I have added support for this to haifa-sched.c. I expect the question
> "why not use SMS" to come up; there were a number of reasons why I felt
> that code is unsuitable:

Fully agree that SMS is unsuitable here FWIW, but...

> There are (or were, when I started) some glaring weaknesses in SMS, such
> as giving up when the loop contains autoincrement addresses (which is
> the case in the example above), and by the looks of it fairly poor
> memory disambiguation compared to sched-ebb with cselib.

...I didn't see from an admittedly quick read of the patch how you
handle memory disambiguation between iterations.  If a loop includes:

     lb $3,($4)
     sb $5,1($4)

then the two instructions can be reordered by normal ebb scheduling,
but the inter-iteration conflict is important for modulo scheduling.

Richard

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [0/4] Modulo scheduling with haifa-sched for C6X
  2011-09-14 11:25 ` [0/4] Modulo scheduling with haifa-sched for C6X Richard Sandiford
@ 2011-09-14 12:01   ` Bernd Schmidt
  2011-10-03  8:23     ` Richard Sandiford
  0 siblings, 1 reply; 22+ messages in thread
From: Bernd Schmidt @ 2011-09-14 12:01 UTC (permalink / raw)
  To: GCC Patches, richard.sandiford

On 09/14/11 11:03, Richard Sandiford wrote:
> ...I didn't see from an admittedly quick read of the patch how you
> handle memory disambiguation between iterations.  If a loop includes:
> 
>      lb $3,($4)
>      sb $5,1($4)
> 
> then the two instructions can be reordered by normal ebb scheduling,
> but the inter-iteration conflict is important for modulo scheduling.

There's nothing special to handle, I think. sched-deps should see that
the ld in iteration 1 is DEP_ANTI against the sb in iteration 0
(assuming there's also an increment).


Bernd

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [0/4] Modulo scheduling with haifa-sched for C6X
  2011-09-13 22:03 [0/4] Modulo scheduling with haifa-sched for C6X Bernd Schmidt
                   ` (4 preceding siblings ...)
  2011-09-14 11:25 ` [0/4] Modulo scheduling with haifa-sched for C6X Richard Sandiford
@ 2011-09-27 13:05 ` Bernd Schmidt
  2011-09-30  0:57   ` Vladimir Makarov
  5 siblings, 1 reply; 22+ messages in thread
From: Bernd Schmidt @ 2011-09-27 13:05 UTC (permalink / raw)
  To: GCC Patches; +Cc: Vladimir N. Makarov

Ping:
http://gcc.gnu.org/ml/gcc-patches/2011-09/msg00811.html


Bernd

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [0/4] Modulo scheduling with haifa-sched for C6X
  2011-09-27 13:05 ` Bernd Schmidt
@ 2011-09-30  0:57   ` Vladimir Makarov
  2011-09-30  2:38     ` Bernd Schmidt
  0 siblings, 1 reply; 22+ messages in thread
From: Vladimir Makarov @ 2011-09-30  0:57 UTC (permalink / raw)
  To: Bernd Schmidt; +Cc: GCC Patches

On 09/27/2011 08:36 AM, Bernd Schmidt wrote:
> Ping:
> http://gcc.gnu.org/ml/gcc-patches/2011-09/msg00811.html
>
Bernd, sorry for the delay.

   I thought for long time about this approach because we already have 
selective scheduler which with some modifications could be used for 
this.  Selective scheduler was implemented for Itanium, designed to work 
after RA (although it can work before too) and it implements a general 
software pipelining (not modulo scheduling) which could be used for 
loops with conditionals and with varying II (if we speak in terms of 
modulo scheduling).  Selective scheduling is more complicated than haifa 
scheduler (although haifa scheduler with so many changes for its 
existence is approaching to the complexity of selective scheduler), it 
might be hard to modify for your purposes,  it has also a tendency to 
generate a big code which is probably not good for C6X which is oriented 
to embedded applications.

   On the other hand, your changes to haifa-scheduler are small, so I 
concluded it might be ok (I hope the coming changes with register 
renaming which selective scheduler already does will be not big too).  
Now we have more complex selective scheduler with general software 
pipelining and simpler haifa scheduler with modulo scheduling.  So I 
think we could look at selective scheduler for servers with VLIW and 
in-order pipelined processors where code expansion is not so important 
and haifa-scheduler with modulo scheduler for embedded VLIW processors.

   Still I think that selective scheduler has more potential to generate 
a faster code than what you are trying to implement.

   As for the patch itself (only scheduler parts in 1/4 and 2/4), it is 
ok to me to commit this.  I did not find anything which should be changed.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [0/4] Modulo scheduling with haifa-sched for C6X
  2011-09-30  0:57   ` Vladimir Makarov
@ 2011-09-30  2:38     ` Bernd Schmidt
  0 siblings, 0 replies; 22+ messages in thread
From: Bernd Schmidt @ 2011-09-30  2:38 UTC (permalink / raw)
  To: Vladimir Makarov; +Cc: GCC Patches

On 09/29/11 23:32, Vladimir Makarov wrote:
> Bernd, sorry for the delay.

No problem.

>   I thought for long time about this approach because we already have
> selective scheduler which with some modifications could be used for
> this.  Selective scheduler was implemented for Itanium, designed to work
> after RA (although it can work before too) and it implements a general
> software pipelining (not modulo scheduling) which could be used for
> loops with conditionals and with varying II (if we speak in terms of
> modulo scheduling).

Yes, but that's somewhat orthogonal to what we need for C6X - the
hardware support is really designed for modulo scheduling. For more
complicated or bigger loops, using more general software pipelining
could be a win, but benchmark results seem to suggest that a large part
of the possible gains on C6X can be achieved without it.

The other problem with sel-sched is that, as you said, it's quite
complicated - the code is still rather opaque to me. Also, to use it on
C6X, the support for delay slots that now exists in haifa-sched would
have to be added to it as well. In general I'm not too keen on
supporting two different schedulers side-by-side and duplicating
features in each.

>   On the other hand, your changes to haifa-scheduler are small, so I
> concluded it might be ok (I hope the coming changes with register
> renaming which selective scheduler already does will be not big too). 

The register renaming changes will be localized to c6x.c (and a few
small changes in regrename.c) - their purpose is to ensure that the
instructions present in a loop are balanced across the two halves of the
machine. I don't think there's another target with similar enough
requirements to try to add some form of general support for this kind of
thing yet.

What I do have coming up are some more haifa-sched changes to make it
predicate insns when that allows them to be moved across jumps. Just
bootstrapping that on ia64 now...

> Now we have more complex selective scheduler with general software
> pipelining and simpler haifa scheduler with modulo scheduling.  So I
> think we could look at selective scheduler for servers with VLIW and
> in-order pipelined processors where code expansion is not so important
> and haifa-scheduler with modulo scheduler for embedded VLIW processors.

Note that the C6X modulo-scheduling really relies on the exposed
pipeline to produce sensible code; you'd probably have to be more clever
with the unrolling to make this work reasonably on a different target.
Still, I think the haifa-sched code could in principle be extended to
also be interesting for other CPUs, maybe even ia64 - doesn't that have
rotating registers for modulo scheduling?

>   As for the patch itself (only scheduler parts in 1/4 and 2/4), it is
> ok to me to commit this.  I did not find anything which should be changed.

Thanks!

Bernd

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [0/4] Modulo scheduling with haifa-sched for C6X
  2011-09-14 12:01   ` Bernd Schmidt
@ 2011-10-03  8:23     ` Richard Sandiford
  2011-10-03 14:00       ` Bernd Schmidt
  0 siblings, 1 reply; 22+ messages in thread
From: Richard Sandiford @ 2011-10-03  8:23 UTC (permalink / raw)
  To: Bernd Schmidt; +Cc: GCC Patches

Bernd Schmidt <bernds@codesourcery.com> writes:
> On 09/14/11 11:03, Richard Sandiford wrote:
>> ...I didn't see from an admittedly quick read of the patch how you
>> handle memory disambiguation between iterations.  If a loop includes:
>> 
>>      lb $3,($4)
>>      sb $5,1($4)
>> 
>> then the two instructions can be reordered by normal ebb scheduling,
>> but the inter-iteration conflict is important for modulo scheduling.
>
> There's nothing special to handle, I think. sched-deps should see that
> the ld in iteration 1 is DEP_ANTI against the sb in iteration 0
> (assuming there's also an increment).

For the record, I don't agree that we should rely on register
dependencies to handle memory dependencies.  It's possible for MEMs in
different iterations to alias without there being a register dependence
between them.  It might not happen often in modulo-schedulable loops,
but it is possible...

I realise the patch has been approved though.  Like I say,
it's just for the record.

Richard



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [0/4] Modulo scheduling with haifa-sched for C6X
  2011-10-03  8:23     ` Richard Sandiford
@ 2011-10-03 14:00       ` Bernd Schmidt
  2011-10-03 14:22         ` Richard Sandiford
  0 siblings, 1 reply; 22+ messages in thread
From: Bernd Schmidt @ 2011-10-03 14:00 UTC (permalink / raw)
  To: GCC Patches, richard.sandiford

On 10/03/11 10:23, Richard Sandiford wrote:
> Bernd Schmidt <bernds@codesourcery.com> writes:
>> On 09/14/11 11:03, Richard Sandiford wrote:
>>> ...I didn't see from an admittedly quick read of the patch how you
>>> handle memory disambiguation between iterations.  If a loop includes:
>>>
>>>      lb $3,($4)
>>>      sb $5,1($4)
>>>
>>> then the two instructions can be reordered by normal ebb scheduling,
>>> but the inter-iteration conflict is important for modulo scheduling.
>>
>> There's nothing special to handle, I think. sched-deps should see that
>> the ld in iteration 1 is DEP_ANTI against the sb in iteration 0
>> (assuming there's also an increment).
> 
> For the record, I don't agree that we should rely on register
> dependencies to handle memory dependencies.  It's possible for MEMs in
> different iterations to alias without there being a register dependence
> between them.

I don't know what you mean by "register dependence" here. sched-deps
analyzes MEMs for whether they depend on each other, but the term
"register dependence" suggests you aren't thinking about this.

If there was a problem, then rtl loop unrolling would also cause it
(since the modulo scheduling patch effectively does nothing else). Are
you sure there really is a problem?


Bernd

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [0/4] Modulo scheduling with haifa-sched for C6X
  2011-10-03 14:00       ` Bernd Schmidt
@ 2011-10-03 14:22         ` Richard Sandiford
  2011-10-03 14:32           ` Bernd Schmidt
  2011-10-03 14:52           ` Bernd Schmidt
  0 siblings, 2 replies; 22+ messages in thread
From: Richard Sandiford @ 2011-10-03 14:22 UTC (permalink / raw)
  To: Bernd Schmidt; +Cc: GCC Patches

Bernd Schmidt <bernds_cb1@t-online.de> writes:
> On 10/03/11 10:23, Richard Sandiford wrote:
>> Bernd Schmidt <bernds@codesourcery.com> writes:
>>> On 09/14/11 11:03, Richard Sandiford wrote:
>>>> ...I didn't see from an admittedly quick read of the patch how you
>>>> handle memory disambiguation between iterations.  If a loop includes:
>>>>
>>>>      lb $3,($4)
>>>>      sb $5,1($4)
>>>>
>>>> then the two instructions can be reordered by normal ebb scheduling,
>>>> but the inter-iteration conflict is important for modulo scheduling.
>>>
>>> There's nothing special to handle, I think. sched-deps should see that
>>> the ld in iteration 1 is DEP_ANTI against the sb in iteration 0
>>> (assuming there's also an increment).
>> 
>> For the record, I don't agree that we should rely on register
>> dependencies to handle memory dependencies.  It's possible for MEMs in
>> different iterations to alias without there being a register dependence
>> between them.
>
> I don't know what you mean by "register dependence" here. sched-deps
> analyzes MEMs for whether they depend on each other, but the term
> "register dependence" suggests you aren't thinking about this.

Well, as you said, sched-deps uses more exact memory disambiguation
than SMS.  But that's for a reason: if we're scheduling a loop body
using haifa-sched, we only care about intra-iteration memory
dependencies.  But modulo scheduling allows movement between
iterations as well.

So my original point was that it looked like you were adding support
for inter-iteration scheduling while still using intra-iteration memory
dependencies.  I (probably wrongly, sorry) took your response to mean
that inter-iteration memory dependencies would be accompanied by some
sort of register dependency, so that doesn't matter.

> If there was a problem, then rtl loop unrolling would also cause it
> (since the modulo scheduling patch effectively does nothing else). Are
> you sure there really is a problem?

I'm not sure I follow.  Unrolling a loop {A, B, C, D} gives:

  A1
  B1
  C1
  D1
     A2
     B2
     C2
     D2
        A3
        B3
        C3
        D3

so inter-iteration dependencies aren't a problem.  Whereas I thought your
modulo instruction did:

  A1
  B1  A2
  C1  B2  A3
  D1  C2  B3
      D2  C3
          D3

so if D1 writes to memory that A2 (but not A1) _might_ load, then the
loop doesn't behave the same way.

Richard

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [0/4] Modulo scheduling with haifa-sched for C6X
  2011-10-03 14:22         ` Richard Sandiford
@ 2011-10-03 14:32           ` Bernd Schmidt
  2011-10-03 15:26             ` Richard Sandiford
  2011-10-03 14:52           ` Bernd Schmidt
  1 sibling, 1 reply; 22+ messages in thread
From: Bernd Schmidt @ 2011-10-03 14:32 UTC (permalink / raw)
  To: Bernd Schmidt, GCC Patches, richard.sandiford

On 10/03/11 16:21, Richard Sandiford wrote:
> so inter-iteration dependencies aren't a problem.  Whereas I thought your
> modulo instruction did:
> 
>   A1
>   B1  A2
>   C1  B2  A3
>   D1  C2  B3
>       D2  C3
>           D3
> 
> so if D1 writes to memory that A2 (but not A1) _might_ load, then the
> loop doesn't behave the same way.

But sched-deps will have found a dependence between D1 and A2 so the
schedule won't look like this.


Bernd

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [0/4] Modulo scheduling with haifa-sched for C6X
  2011-10-03 14:22         ` Richard Sandiford
  2011-10-03 14:32           ` Bernd Schmidt
@ 2011-10-03 14:52           ` Bernd Schmidt
  1 sibling, 0 replies; 22+ messages in thread
From: Bernd Schmidt @ 2011-10-03 14:52 UTC (permalink / raw)
  To: Bernd Schmidt, GCC Patches, richard.sandiford

On 10/03/11 16:21, Richard Sandiford wrote:
> I'm not sure I follow.  Unrolling a loop {A, B, C, D} gives:
> 
>   A1
>   B1
>   C1
>   D1
>      A2
>      B2
>      C2
>      D2
>         A3
>         B3
>         C3
>         D3
> 
> so inter-iteration dependencies aren't a problem.

Expanding on the previous answer, yes they are if this basic block is
later scheduled by haifa sched. Modulo scheduling using the algorithm in
my patch is exactly equivalent to scheduling an unrolled loop, with
nothing more than the additional constraint that for any insn X,
  t(Xn+1) = t(Xn) + II

So, the above would be a valid schedule, and if there are no
inter-iteration dependencies, so would the following:

>   A1
>   B1  A2
>   C1  B2  A3
>   D1  C2  B3
>       D2  C3
>           D3

Bernd

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [0/4] Modulo scheduling with haifa-sched for C6X
  2011-10-03 14:32           ` Bernd Schmidt
@ 2011-10-03 15:26             ` Richard Sandiford
  2011-10-03 16:25               ` Bernd Schmidt
  0 siblings, 1 reply; 22+ messages in thread
From: Richard Sandiford @ 2011-10-03 15:26 UTC (permalink / raw)
  To: Bernd Schmidt; +Cc: GCC Patches

Bernd Schmidt <bernds_cb1@t-online.de> writes:
> On 10/03/11 16:21, Richard Sandiford wrote:
>> so inter-iteration dependencies aren't a problem.  Whereas I thought your
>> modulo instruction did:
>> 
>>   A1
>>   B1  A2
>>   C1  B2  A3
>>   D1  C2  B3
>>       D2  C3
>>           D3
>> 
>> so if D1 writes to memory that A2 (but not A1) _might_ load, then the
>> loop doesn't behave the same way.
>
> But sched-deps will have found a dependence between D1 and A2 so the
> schedule won't look like this.

OK, bad example, sorry.  So the fundamental assumption is that
if you have a loop:

Loop 1:
  A
  B
  C
  D

that you can unroll 4 times and schedule as:

Loop 2:
  A
  B A
  C B A
  D C B A
    D C B
      D C
        D

then 2 iterations of that loop:

  A
  B A
  C B A
  D C B A
    D C B
      D C
        D
  A
  B A
  C B A
  D C B A
    D C B
      D C
        D

are necessarily equivalent to:

Loop 3:
  A
  B A
  C B A
  D C B A
  A D C B
  B A D C
  C B A D
  D C B A
    D C B
      D C
        D

Is that right?  So if D from iteration N*4 of loop 1 doesn't alias A
from iteration N*4+1 of loop 1 (meaning loop 2 is correct) then it
follows that the D from any iteration M doesn't alias A from M+1
(meaning loop 3 is correct?  I'm still not convinced that follows
for sufficiently clever alias analysis.

Reason for asking is that (AIUI) SMS used to use stronger memory
disambiguation, but had to pull back to something more conservative
for similar reasons.

Richard

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [0/4] Modulo scheduling with haifa-sched for C6X
  2011-10-03 15:26             ` Richard Sandiford
@ 2011-10-03 16:25               ` Bernd Schmidt
  2011-10-03 17:23                 ` Richard Sandiford
  0 siblings, 1 reply; 22+ messages in thread
From: Bernd Schmidt @ 2011-10-03 16:25 UTC (permalink / raw)
  To: GCC Patches, richard.sandiford

On 10/03/11 17:26, Richard Sandiford wrote:
> are necessarily equivalent to:
> 
> Loop 3:
>   A
>   B A
>   C B A
>   D C B A
>   A D C B
>   B A D C
>   C B A D
>   D C B A
>     D C B
>       D C
>         D

Sort of. The insns wouldn't rotate like this in a modulo-scheduled loop.

> Is that right?  So if D from iteration N*4 of loop 1 doesn't alias A
> from iteration N*4+1 of loop 1 (meaning loop 2 is correct) then it
> follows that the D from any iteration M doesn't alias A from M+1
> (meaning loop 3 is correct?  I'm still not convinced that follows
> for sufficiently clever alias analysis.

Is there any reason to believe that gcc does sufficiently clever alias
analysis? Again, think RTL loop unrolling, if you unroll 3 times

A1
B1
  A2
  B2
    A3
    B3

then you get identical insns An as well as Bn, and I don't believe
you'll ever get rtl alias analysis to answer differently for pairs
(B1/A2) and (B2/A3). So if there is a way for any of these pairs to
alias in any iteration, it must always detect the conflict.

Also consider that the code in sched-deps doesn't know which loop
iteration it is. Imagine a fully unrolled loop; insns Xn are identical
for a given X and all n. It doesn't matter where you put your
N-iteration window that you pass to sched-deps; since it doesn't know
about the initial conditions, it must act as if it could be from any
start iteration.

Put another way: unless I see a testcase that demonstrates otherwise, I
don't believe there is a problem. If there were a problem, it would be
with the alias analysis rather than the scheduling code.

> Reason for asking is that (AIUI) SMS used to use stronger memory
> disambiguation, but had to pull back to something more conservative
> for similar reasons.

Pointers? All I could find is a thread where rth seems to be of the same
opinion as me:

  http://gcc.gnu.org/ml/gcc/2004-09/msg01648.html

Bernd

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [0/4] Modulo scheduling with haifa-sched for C6X
  2011-10-03 16:25               ` Bernd Schmidt
@ 2011-10-03 17:23                 ` Richard Sandiford
  2011-10-03 18:02                   ` Bernd Schmidt
  0 siblings, 1 reply; 22+ messages in thread
From: Richard Sandiford @ 2011-10-03 17:23 UTC (permalink / raw)
  To: Bernd Schmidt; +Cc: GCC Patches, richard.sandiford

Bernd Schmidt <bernds_cb1@t-online.de> writes:
>> Reason for asking is that (AIUI) SMS used to use stronger memory
>> disambiguation, but had to pull back to something more conservative
>> for similar reasons.
>
> Pointers? All I could find is a thread where rth seems to be of the same
> opinion as me:
>
>   http://gcc.gnu.org/ml/gcc/2004-09/msg01648.html

I was thinking of:

    http://gcc.gnu.org/ml/gcc-patches/2010-08/msg00294.html

Richard

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [0/4] Modulo scheduling with haifa-sched for C6X
  2011-10-03 17:23                 ` Richard Sandiford
@ 2011-10-03 18:02                   ` Bernd Schmidt
  2011-10-03 18:12                     ` Richard Sandiford
  0 siblings, 1 reply; 22+ messages in thread
From: Bernd Schmidt @ 2011-10-03 18:02 UTC (permalink / raw)
  To: Bernd Schmidt, GCC Patches, richard.sandiford, rdsandiford,
	Richard Guenther

On 10/03/11 19:23, Richard Sandiford wrote:
> Bernd Schmidt <bernds_cb1@t-online.de> writes:
>>> Reason for asking is that (AIUI) SMS used to use stronger memory
>>> disambiguation, but had to pull back to something more conservative
>>> for similar reasons.
>>
>> Pointers? All I could find is a thread where rth seems to be of the same
>> opinion as me:
>>
>>   http://gcc.gnu.org/ml/gcc/2004-09/msg01648.html
> 
> I was thinking of:
> 
>     http://gcc.gnu.org/ml/gcc-patches/2010-08/msg00294.html

What I see in this thread is suggestions that people use
{true,anti,output}_dependence, which are exactly the ones used by
sched-deps. We know that using these is (or rather, must be) safe
because RTL loop unrolling followed by scheduling works. So, anything I
am missing here?


Bernd

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [0/4] Modulo scheduling with haifa-sched for C6X
  2011-10-03 18:02                   ` Bernd Schmidt
@ 2011-10-03 18:12                     ` Richard Sandiford
  2011-10-03 18:19                       ` Bernd Schmidt
  0 siblings, 1 reply; 22+ messages in thread
From: Richard Sandiford @ 2011-10-03 18:12 UTC (permalink / raw)
  To: Bernd Schmidt; +Cc: GCC Patches, richard.sandiford, Richard Guenther

Bernd Schmidt <bernds_cb1@t-online.de> writes:
> On 10/03/11 19:23, Richard Sandiford wrote:
>> Bernd Schmidt <bernds_cb1@t-online.de> writes:
>>>> Reason for asking is that (AIUI) SMS used to use stronger memory
>>>> disambiguation, but had to pull back to something more conservative
>>>> for similar reasons.
>>>
>>> Pointers? All I could find is a thread where rth seems to be of the same
>>> opinion as me:
>>>
>>>   http://gcc.gnu.org/ml/gcc/2004-09/msg01648.html
>> 
>> I was thinking of:
>> 
>>     http://gcc.gnu.org/ml/gcc-patches/2010-08/msg00294.html
>
> What I see in this thread is suggestions that people use
> {true,anti,output}_dependence, which are exactly the ones used by
> sched-deps. We know that using these is (or rather, must be) safe
> because RTL loop unrolling followed by scheduling works.

But what I'm trying to say is that you're not just doing loop
unrolling followed by scheduling.  You're doing loop unrolling,
followed by scheduling, followed by an overlapping of the unrolled loop
iterations.  It just felt strange that the overlapping was being done
without any additional alias analysis.

But I admit it's a theorotical objection at best, and I'm certainly
not going to be able to come with an example, so never mind me :-)

Richard

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [0/4] Modulo scheduling with haifa-sched for C6X
  2011-10-03 18:12                     ` Richard Sandiford
@ 2011-10-03 18:19                       ` Bernd Schmidt
  0 siblings, 0 replies; 22+ messages in thread
From: Bernd Schmidt @ 2011-10-03 18:19 UTC (permalink / raw)
  To: GCC Patches, richard.sandiford, Richard Guenther, rdsandiford

On 10/03/11 20:12, Richard Sandiford wrote:
> But what I'm trying to say is that you're not just doing loop
> unrolling followed by scheduling.  You're doing loop unrolling,
> followed by scheduling, followed by an overlapping of the unrolled loop
> iterations.  It just felt strange that the overlapping was being done
> without any additional alias analysis.

I wouldn't say this is completely accurate either. If we overlap N
iterations of the loop, then we are analyzing and scheduling N
iterations together, so there isn't really additional overlap besides
the loop kernel we find.

The only assumption is that it does not matter whether you analyze
iterations (X .. X + N - 1) or iterations (Y .. Y + N - 1), since they
are indistinguishable at the RTL level. Hence, any schedule we find for
overlapping N iterations must be valid for all starting points.

Bernd

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2011-10-03 18:19 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-09-13 22:03 [0/4] Modulo scheduling with haifa-sched for C6X Bernd Schmidt
2011-09-13 22:13 ` [1/4] Modulo scheduling support for haifa-sched Bernd Schmidt
2011-09-13 22:14 ` [2/4] Make schedule_ebb callable from elsewhere Bernd Schmidt
2011-09-13 22:56   ` Steven Bosscher
2011-09-13 22:20 ` [3/4] Fix debug_insn problem in hw-doloop Bernd Schmidt
2011-09-13 22:32 ` [4/4] C6X pieces for modulo scheduling with haifa-sched Bernd Schmidt
2011-09-14 11:25 ` [0/4] Modulo scheduling with haifa-sched for C6X Richard Sandiford
2011-09-14 12:01   ` Bernd Schmidt
2011-10-03  8:23     ` Richard Sandiford
2011-10-03 14:00       ` Bernd Schmidt
2011-10-03 14:22         ` Richard Sandiford
2011-10-03 14:32           ` Bernd Schmidt
2011-10-03 15:26             ` Richard Sandiford
2011-10-03 16:25               ` Bernd Schmidt
2011-10-03 17:23                 ` Richard Sandiford
2011-10-03 18:02                   ` Bernd Schmidt
2011-10-03 18:12                     ` Richard Sandiford
2011-10-03 18:19                       ` Bernd Schmidt
2011-10-03 14:52           ` Bernd Schmidt
2011-09-27 13:05 ` Bernd Schmidt
2011-09-30  0:57   ` Vladimir Makarov
2011-09-30  2:38     ` Bernd Schmidt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).