Re: [4/4] Make SMS schedule register moves

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* Re: [4/4] Make SMS schedule register moves
       [not found] <OFE173D50E.19DC6A3D-ONC2257912.007D3E72-C2257912.007D4640@il.ibm.com>
@ 2011-09-22 17:38 ` Ayal Zaks
  2011-09-22 22:33   ` Richard Sandiford
  0 siblings, 1 reply; 13+ messages in thread
From: Ayal Zaks @ 2011-09-22 17:38 UTC (permalink / raw)
  To: Richard Sandiford, gcc-patches

> Richard Sandiford <richard.sandiford@linaro.org> wrote on 30/08/2011
> 03:29:26 PM:
>
> > From: Richard Sandiford <richard.sandiford@linaro.org>
> > To: gcc-patches@gcc.gnu.org
> > Cc: Ayal Zaks/Haifa/IBM@IBMIL
> > Date: 30/08/2011 03:29 PM
> > Subject: [4/4] Make SMS schedule register moves
> >
> > This is the move-scheduling patch itself.  It should be fairly
> > self-explanatory.  Let me know if it isn't, and I'll try to improve
> > the commentary.
>
> >
> > One potentially controversial change is to the way we handle moves
> > in the prologue and epilogue.  The current code uses a conservative

 - conservative>>accurate? (Your approach seems to be more "conservative")

>
> > check to decide which stages need which moves.  This check copes
> > with values that are live before the loop, and emits some moves
> > that aren't actually needed.  The code also emits chains of moves
> > that can be trivially optimised through propagation.  We rely on
> > later patches to clean this up for us (and they do).
> >
> > So, rather than keep a rather complicated check here, I've simply
> > emitted the moves for all stages.  In principle, that will generate
> > a little more dead code than now in cases where chains of two moves
> > are needed, but that seems to be very rare (when moves are scheduled).

indeed, it's better to simplify code generation and rely on subsequent
cleanups. Not sure about generating more (dead?) code in principle;
could you elaborate, for the record?

>
> >
> > Richard
> >
> >
> > gcc/
> >    * modulo-sched.c (extend_node_sched_params): New function.
> >    (print_node_sched_params): Print the stage.
> >    (set_columns_for_row, schedule_reg_move): New functions.
> >    (set_columns_for_ps): Move up file and use set_columns_for_now.

typo: now>>row (unless you intend this to be temporary ;-)

>
> >    (schedule_reg_moves): Call extend_node_sched_params and
> >    schedule_reg_move.  Extend size of uses bitmap.  Return false
> >    if a move could not be scheduled.
> >    (apply_reg_moves): Don't emit moves here.
> >    (permute_partial_schedule): Handle register moves.
> >    (duplicate_insns_of_cycles): Remove for_prolog.  Always emit moves.

Always emit all moves

>
> >    (generate_prolog_epilog): Update calls accordingly.
> >
> > Index: gcc/modulo-sched.c

As I'm going over the patch, let me make sure I understand what's going on:

..
> +/* Try to schedule the move with ps_insn identifier I_REG_MOVE in PS.
> +   The source of the move is provided by I_MUST_FOLLOW, which has
> +   already been scheduled.

The source y of an x=y move is defined by the previous iteration of
the next move y=z (or by the original y=... move->def). We schedule
moves one after the other bottom-up, starting from the original
move->def moving backwards in cycles. This way, the instruction
I_MUST_FOLLOW defining y is always already scheduled when we come to
schedule the dependent I_REG_MOVE x=y move.

..
> +  /* The cyclic lifetime of move->new_reg starts and ends at move->def
> +     (the instruction that defines move->old_reg).

So instruction I_REG_MOVE (new_reg=reg) must be scheduled before the
next I_MUST_FOLLOW move/original-def (due to anti-dependence: it
overwrites reg), but after the previous instance of I_MUST_FOLLOW (due
to true dependence; i.e. account for latency also). Why do moves,
except for the one closest to move->def (which is directly dependent
upon it, i.e. for whom move->def == I_MUST_FOLLOW), have to worry
about move->def at all? (Or have their cyclic lifetimes start and end
there?)

In general, there's a distinction between the "cycle" an instruction
is scheduled at (for the first iteration), which is an absolute
arbitrary integer, and the "row" it is placed in the PS, which is
between 0 and ii-1, and is simply cycle mod ii. When scheduling, it's
clearer to reason about cycles, especially if your window includes a
row twice.

In addition to the (true+anti) dependences involving I_REG_MOVE with
I_MUST_FOLLOW, it has (true+anti) dependences with each use it feeds.

We could alternatively set these dependencies of I_REG_MOVE on
I_MUST_FOLLOW, and on its uses, and call get_sched_window(). But it's
presumably simpler to handle it directly here.

Right?
Ayal.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [4/4] Make SMS schedule register moves
  2011-09-22 17:38 ` [4/4] Make SMS schedule register moves Ayal Zaks
@ 2011-09-22 22:33   ` Richard Sandiford
  2011-09-23  7:49     ` Ayal Zaks
  0 siblings, 1 reply; 13+ messages in thread
From: Richard Sandiford @ 2011-09-22 22:33 UTC (permalink / raw)
  To: Ayal Zaks; +Cc: Richard Sandiford, gcc-patches

Thanks as always for the review.

Ayal Zaks <ayal.zaks@gmail.com> writes:
>> Richard Sandiford <richard.sandiford@linaro.org> wrote on 30/08/2011
>> 03:29:26 PM:
>>
>> > From: Richard Sandiford <richard.sandiford@linaro.org>
>> > To: gcc-patches@gcc.gnu.org
>> > Cc: Ayal Zaks/Haifa/IBM@IBMIL
>> > Date: 30/08/2011 03:29 PM
>> > Subject: [4/4] Make SMS schedule register moves
>> >
>> > This is the move-scheduling patch itself.  It should be fairly
>> > self-explanatory.  Let me know if it isn't, and I'll try to improve
>> > the commentary.
>>
>> >
>> > One potentially controversial change is to the way we handle moves
>> > in the prologue and epilogue.  The current code uses a conservative
>
>
>  - conservative>>accurate? (Your approach seems to be more "conservative")

They're both conservative (and yeah, mine is more conservative :-)).
What I was trying to say is that the current code already generates
more moves than necessary.  If ddg node N is "R = foo", the current code
generates enough moves to handle definitions of R from both the prologue
copies of N _and_ the incoming edge (i.e. the value defined outside the
loop, such as an accumulator being initialised to zero).  But the code
doesn't check whether this initial definition actually exists.  If it
doesn't -- because we're dealing with an intra-loop dependency --
we generate one more move than necessary.

If we wanted to be accurate, we'd need to check whether there's
an incoming value or not.

The current code is also conservative in that the epilogue can end up
with sequences like:

    R1 = foo
    R2 = R1 (from a move)
    ... no set of R1, because we've done with that stage...
    R3 = ...R2...

where the last instruction could use R1 directly instead.
But none of this matters because later passes (including IRA)
fix it up nicely.

So this was all just a big excuse on my part along the lines of
"we already rely on this stuff being cleaned up, and it is,
so let's keep things simple".

>> > check to decide which stages need which moves.  This check copes
>> > with values that are live before the loop, and emits some moves
>> > that aren't actually needed.  The code also emits chains of moves
>> > that can be trivially optimised through propagation.  We rely on
>> > later patches to clean this up for us (and they do).
>> >
>> > So, rather than keep a rather complicated check here, I've simply
>> > emitted the moves for all stages.  In principle, that will generate
>> > a little more dead code than now in cases where chains of two moves
>> > are needed, but that seems to be very rare (when moves are scheduled).
>
> indeed, it's better to simplify code generation and rely on subsequent
> cleanups. Not sure about generating more (dead?) code in principle;
> could you elaborate, for the record?

Sorry, I wasn't very clear, but what I meant was: the patch generates
every move regardless of the stage, so it will in principle generate more
moves to unused registers than we do now.  (As above, we already generate
some such moves.)  Thoese insns will be dead, and will be later removed
as dead.  But in practice, the number of extra instructions seems to be
very low, and is usually zero.

>> >    * modulo-sched.c (extend_node_sched_params): New function.
>> >    (print_node_sched_params): Print the stage.
>> >    (set_columns_for_row, schedule_reg_move): New functions.
>> >    (set_columns_for_ps): Move up file and use set_columns_for_now.
>
>
> typo: now>>row (unless you intend this to be temporary ;-)

Doh! :-)

>> >    (schedule_reg_moves): Call extend_node_sched_params and
>> >    schedule_reg_move.  Extend size of uses bitmap.  Return false
>> >    if a move could not be scheduled.
>> >    (apply_reg_moves): Don't emit moves here.
>> >    (permute_partial_schedule): Handle register moves.
>> >    (duplicate_insns_of_cycles): Remove for_prolog.  Always emit moves.
>
>
> Always emit all moves

Oops, fixed.

>> +/* Try to schedule the move with ps_insn identifier I_REG_MOVE in PS.
>> +   The source of the move is provided by I_MUST_FOLLOW, which has
>> +   already been scheduled.
>
> The source y of an x=y move is defined by the previous iteration of
> the next move y=z (or by the original y=... move->def). We schedule
> moves one after the other bottom-up, starting from the original
> move->def moving backwards in cycles. This way, the instruction
> I_MUST_FOLLOW defining y is always already scheduled when we come to
> schedule the dependent I_REG_MOVE x=y move.

Right.

>> +  /* The cyclic lifetime of move->new_reg starts and ends at move->def
>> +     (the instruction that defines move->old_reg).
>
> So instruction I_REG_MOVE (new_reg=reg) must be scheduled before the
> next I_MUST_FOLLOW move/original-def (due to anti-dependence: it
> overwrites reg), but after the previous instance of I_MUST_FOLLOW (due
> to true dependence; i.e. account for latency also). Why do moves,
> except for the one closest to move->def (which is directly dependent
> upon it, i.e. for whom move->def == I_MUST_FOLLOW), have to worry
> about move->def at all? (Or have their cyclic lifetimes start and end
> there?)

Because the uses of new_reg belong to the same move->def based cycle.

E.g. say we have the following partial schedule, with [A]...[D] denoting
instructions:

  row 0:  empty
  row 1:  empty
  row 2:  [A] use R_2
  row 3:  [B] R_0 = ...
  row 4:  [C] use R_1
  row 5:  [D] use R_2

and with the following as-yet-unplaced moves:

  [M1] R_1 = R_0
  [M2] R_2 = R_1

The loop with scheduled moves must behave in the same way as it
would at the moment.  That is, it must behave "as if" the loop were:

LOOP 1:
  [A]  use R_2
  [M2] R_2 = R_1
  [M1] R_1 = R_0
  [B]  R_0 = ...
  [C]  use R_1
  [D]  use R_2

So (I think this is the uncontroversial bit): [M1] must be scheduled
"cyclically before" [B] and "cyclically after" [C], with the cycle
based at [B]:

    row 3 after [B]:  empty
    row 4:            [C]
    row 5:            [D]
    row 0:            empty
    row 1:            empty
    row 2:            [A]
    row 3 before [B]: empty

[M1] could therefore go in row 1.  This part is OK.

[M2] must then come "cyclically before" [M1] and "cyclically after" [A]
and [D].  If we based that cycle on [M1], we would have:

    row 1 after [M1]:  empty
    row 2:             [A]
    row 3:             [B]
    row 4:             [C]
    row 5:             [D]
    row 0:             empty
    row 1 before [M1]: empty

So it would seem that [M2] could go in row 0.  But that's incorrect,
it would lead to:

LOOP 2:
  [M2] R_2 = R_1
  [M1] R_1 = R_0
  [A]  use R_2
  [B]  R_0 = ...
  [C]  use R_1
  [D]  use R_2

and [A] now gets the wrong value of R_2.  In the first iteration of
loop 1, [A] uses the prologue-supplied value of R_2.  In loop 2 it
uses the prologue-supplied value of R_1.  I suppose it's effectively
moved up a stage.

If we keep the cycle at [B] when scheduling [M2], we have:

    row 3 after [B]:  empty
    row 4:            [C]
    row 5:            [D]
    row 0:            empty
    row 1:            [M1]
    row 2:            [A]
    row 3 before [B]: empty

We now have an empty scheduling window (successor [M1] comes
before predecessor [A]) and we reject the current ii.

> In general, there's a distinction between the "cycle" an instruction
> is scheduled at (for the first iteration), which is an absolute
> arbitrary integer, and the "row" it is placed in the PS, which is
> between 0 and ii-1, and is simply cycle mod ii. When scheduling, it's
> clearer to reason about cycles, especially if your window includes a
> row twice.
>
> In addition to the (true+anti) dependences involving I_REG_MOVE with
> I_MUST_FOLLOW, it has (true+anti) dependences with each use it feeds.
>
> We could alternatively set these dependencies of I_REG_MOVE on
> I_MUST_FOLLOW, and on its uses, and call get_sched_window(). But it's
> presumably simpler to handle it directly here.
>
> Right?

Well, I think get_sched_window would need a fair bit of surgery to
handle both cases.  And IMO it'd be a bit of a false abstraction.
In this case, we always want things to be scheduled close to the
successor (I_MUST_FOLLOW), even if there are several predecessors
(the uses).  But if there are more predecessors than successors,
get_sched_window would normally prefer to schedule closer to the
predecessors.  So I think we'd still be treating them as two separate
cases to some extent.

As you say, doing it here is much simpler :-)

Richard

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [4/4] Make SMS schedule register moves
  2011-09-22 22:33   ` Richard Sandiford
@ 2011-09-23  7:49     ` Ayal Zaks
  2011-09-28 16:16       ` Richard Sandiford
  0 siblings, 1 reply; 13+ messages in thread
From: Ayal Zaks @ 2011-09-23  7:49 UTC (permalink / raw)
  To: Ayal Zaks, Richard Sandiford, gcc-patches, rdsandiford

2011/9/23 Richard Sandiford <rdsandiford@googlemail.com>
>
> Thanks as always for the review.


Sure, always a pleasure.

>
> Ayal Zaks <ayal.zaks@gmail.com> writes:
> >> Richard Sandiford <richard.sandiford@linaro.org> wrote on 30/08/2011
> >> 03:29:26 PM:
> >>
> >> > From: Richard Sandiford <richard.sandiford@linaro.org>
> >> > To: gcc-patches@gcc.gnu.org
> >> > Cc: Ayal Zaks/Haifa/IBM@IBMIL
> >> > Date: 30/08/2011 03:29 PM
> >> > Subject: [4/4] Make SMS schedule register moves
> >> >
> >> > This is the move-scheduling patch itself.  It should be fairly
> >> > self-explanatory.  Let me know if it isn't, and I'll try to improve
> >> > the commentary.
> >>
> >> >
> >> > One potentially controversial change is to the way we handle moves
> >> > in the prologue and epilogue.  The current code uses a conservative
> >
> >
> >  - conservative>>accurate? (Your approach seems to be more "conservative")
>
> They're both conservative (and yeah, mine is more conservative :-)).
> What I was trying to say is that the current code already generates
> more moves than necessary.  If ddg node N is "R = foo", the current code
> generates enough moves to handle definitions of R from both the prologue
> copies of N _and_ the incoming edge (i.e. the value defined outside the
> loop, such as an accumulator being initialised to zero).  But the code
> doesn't check whether this initial definition actually exists.  If it
> doesn't -- because we're dealing with an intra-loop dependency --
> we generate one more move than necessary.
>
> If we wanted to be accurate, we'd need to check whether there's
> an incoming value or not.


ahh, right. Not only are we producing dead code, but such that is
using uninitialized values..
>
> The current code is also conservative in that the epilogue can end up
> with sequences like:
>
>    R1 = foo
>    R2 = R1 (from a move)
>    ... no set of R1, because we've done with that stage...
>    R3 = ...R2...
>
> where the last instruction could use R1 directly instead.
> But none of this matters because later passes (including IRA)
> fix it up nicely.
>
> So this was all just a big excuse on my part along the lines of
> "we already rely on this stuff being cleaned up, and it is,
> so let's keep things simple".


furthermore, it should be worthwhile to strengthen dce if it were not
to clean up tidily.
>
> >> > check to decide which stages need which moves.  This check copes
> >> > with values that are live before the loop, and emits some moves
> >> > that aren't actually needed.  The code also emits chains of moves
> >> > that can be trivially optimised through propagation.  We rely on
> >> > later patches to clean this up for us (and they do).
> >> >
> >> > So, rather than keep a rather complicated check here, I've simply
> >> > emitted the moves for all stages.  In principle, that will generate
> >> > a little more dead code than now in cases where chains of two moves
> >> > are needed, but that seems to be very rare (when moves are scheduled).
> >
> > indeed, it's better to simplify code generation and rely on subsequent
> > cleanups. Not sure about generating more (dead?) code in principle;
> > could you elaborate, for the record?
>
> Sorry, I wasn't very clear, but what I meant was: the patch generates
> every move regardless of the stage, so it will in principle generate more
> moves to unused registers than we do now.  (As above, we already generate
> some such moves.)  Thoese insns will be dead, and will be later removed
> as dead.  But in practice, the number of extra instructions seems to be
> very low, and is usually zero.


ok, that's clear. Cases which are not zero should potentially open
PR's against dce.
>
> >> >    * modulo-sched.c (extend_node_sched_params): New function.
> >> >    (print_node_sched_params): Print the stage.
> >> >    (set_columns_for_row, schedule_reg_move): New functions.
> >> >    (set_columns_for_ps): Move up file and use set_columns_for_now.
> >
> >
> > typo: now>>row (unless you intend this to be temporary ;-)
>
> Doh! :-)
>
> >> >    (schedule_reg_moves): Call extend_node_sched_params and
> >> >    schedule_reg_move.  Extend size of uses bitmap.  Return false
> >> >    if a move could not be scheduled.
> >> >    (apply_reg_moves): Don't emit moves here.
> >> >    (permute_partial_schedule): Handle register moves.
> >> >    (duplicate_insns_of_cycles): Remove for_prolog.  Always emit moves.
> >
> >
> > Always emit all moves
>
> Oops, fixed.
>
> >> +/* Try to schedule the move with ps_insn identifier I_REG_MOVE in PS.
> >> +   The source of the move is provided by I_MUST_FOLLOW, which has
> >> +   already been scheduled.
> >
> > The source y of an x=y move is defined by the previous iteration of
> > the next move y=z (or by the original y=... move->def). We schedule
> > moves one after the other bottom-up, starting from the original
> > move->def moving backwards in cycles. This way, the instruction
> > I_MUST_FOLLOW defining y is always already scheduled when we come to
> > schedule the dependent I_REG_MOVE x=y move.
>
> Right.
>
> >> +  /* The cyclic lifetime of move->new_reg starts and ends at move->def
> >> +     (the instruction that defines move->old_reg).
> >
> > So instruction I_REG_MOVE (new_reg=reg) must be scheduled before the
> > next I_MUST_FOLLOW move/original-def (due to anti-dependence: it
> > overwrites reg), but after the previous instance of I_MUST_FOLLOW (due
> > to true dependence; i.e. account for latency also). Why do moves,
> > except for the one closest to move->def (which is directly dependent
> > upon it, i.e. for whom move->def == I_MUST_FOLLOW), have to worry
> > about move->def at all? (Or have their cyclic lifetimes start and end
> > there?)
>
> Because the uses of new_reg belong to the same move->def based cycle.


the "cycle" (overloaded term; rather "iteration" in this context) to
which the uses belong, is inferred from the "cycle" (absolute schedule
time) in which they are scheduled, regardless of move->def.
>
> E.g. say we have the following partial schedule, with [A]...[D] denoting
> instructions:
>
>  row 0:  empty
>  row 1:  empty
>  row 2:  [A] use R_2
>  row 3:  [B] R_0 = ...
>  row 4:  [C] use R_1
>  row 5:  [D] use R_2


then these 4 instructions span 18 cycles, corresponding to the
following schedule (left column; upto some multiple of ii offset):

cyc  3:[B]R_0=
cyc  4:
cyc  5:
cyc  6:
cyc  7:
cyc  8:
cyc  9:        [B]R_0=
cyc 10:[C]=R_1
cyc 11:
cyc 12:
cyc 13:
cyc 14:
cyc 15:                [B]R_0=
cyc 16:        [C]=R_1
cyc 17:[D]=R_2
---------------------------------------
cyc 18:
cyc 19:
cyc 20:[A]=R_2
cyc 21:                        [B]R_0=
cyc 22:                [C]=R_1
cyc 23:        [D]=R_2
---------------------------------------

>
> and with the following as-yet-unplaced moves:
>
>  [M1] R_1 = R_0
>  [M2] R_2 = R_1
>
> The loop with scheduled moves must behave in the same way as it
> would at the moment.  That is, it must behave "as if" the loop were:
>
> LOOP 1:
>  [A]  use R_2
>  [M2] R_2 = R_1
>  [M1] R_1 = R_0
>  [B]  R_0 = ...
>  [C]  use R_1
>  [D]  use R_2


agreed. This reschedules [A] two cycles earlier (to cycle 18; or
perhaps it should have been there originally?), places [M1] in cycle
8, and places [M2] in cycle 13:

cyc  3:[B]R_0=
cyc  4:
cyc  5:
cyc  6:
cyc  7:
cyc  8:[M1]R1=R0
cyc  9:        [B]R_0=
cyc 10:[C]=R_1
cyc 11:
cyc 12:
cyc 13:[M2]R2=R1
cyc 14:        [M1]R1=R0
cyc 15:                [B]R_0=
cyc 16:        [C]=R_1
cyc 17:[D]=R_2
---------------------------------------
cyc 18:[A]=R_2
cyc 19:        [M2]R2=R1
cyc 20:                [M1]R1=R0
cyc 21:                        [B]R_0=
cyc 22:                [C]=R_1
cyc 23:        [D]=R_2
---------------------------------------

>
> So (I think this is the uncontroversial bit): [M1] must be scheduled
> "cyclically before" [B] and "cyclically after" [C], with the cycle
> based at [B]:
>
>    row 3 after [B]:  empty
>    row 4:            [C]
>    row 5:            [D]
>    row 0:            empty
>    row 1:            empty
>    row 2:            [A]
>    row 3 before [B]: empty
>
> [M1] could therefore go in row 1.  This part is OK.


Here's how I see it:
[M1] feeds [C] which is scheduled at cycle 10, so it must be scheduled
before cycle 10-M_latency and after cycle 10-ii. [M1] uses the result
of [B] which is scheduled at cycle 3, so must be scheduled after cycle
3+B_latency and before cycle 3+ii. Taking all latencies to be 1 and
ii=6, this yields a scheduling window of cycles [4,9]\cap[4,9]=[4,9];
if scheduled at cycle 4 it must_follow [C], if scheduled at cycle 9 it
must_precede [B]. This is identical to the logic behind the
sched_window of any instruction, based on its dependencies (as you've
updated not too long ago..), if we do not allow reg_moves (and
arguably, one should not allow reg_moves when scheduling
reg_moves...).

To address the potential erroneous scenario of Loop 2, suppose [A] is
scheduled as in the beginning in cycle 20, and that [M1] is scheduled
in cycle 7 (\in[4,9]). Then
[M2] feeds [D] and [A] which are scheduled at cycles 17 and 20, so it
must be scheduled before cycle 17-1 and after cycle 20-6. [M2] uses
the result of [M1], so must be scheduled after cycle 7+1 and before
cycle 7+6. This yields the desired [14,16]\cap[8,13]=\emptyset.

Also note that if moves are assigned absolute cycles, it becomes clear
to which stage each belongs (just like any other instruction),
regulating their generation in prolog and epilog.

>
> [M2] must then come "cyclically before" [M1] and "cyclically after" [A]
> and [D].  If we based that cycle on [M1], we would have:
>
>    row 1 after [M1]:  empty
>    row 2:             [A]
>    row 3:             [B]
>    row 4:             [C]
>    row 5:             [D]
>    row 0:             empty
>    row 1 before [M1]: empty
>
> So it would seem that [M2] could go in row 0.  But that's incorrect,
> it would lead to:
>
> LOOP 2:
>  [M2] R_2 = R_1
>  [M1] R_1 = R_0
>  [A]  use R_2
>  [B]  R_0 = ...
>  [C]  use R_1
>  [D]  use R_2
>
> and [A] now gets the wrong value of R_2.  In the first iteration of
> loop 1, [A] uses the prologue-supplied value of R_2.  In loop 2 it
> uses the prologue-supplied value of R_1.  I suppose it's effectively
> moved up a stage.
>
> If we keep the cycle at [B] when scheduling [M2], we have:
>
>    row 3 after [B]:  empty
>    row 4:            [C]
>    row 5:            [D]
>    row 0:            empty
>    row 1:            [M1]
>    row 2:            [A]
>    row 3 before [B]: empty
>
> We now have an empty scheduling window (successor [M1] comes
> before predecessor [A]) and we reject the current ii.
>
> > In general, there's a distinction between the "cycle" an instruction
> > is scheduled at (for the first iteration), which is an absolute
> > arbitrary integer, and the "row" it is placed in the PS, which is
> > between 0 and ii-1, and is simply cycle mod ii. When scheduling, it's
> > clearer to reason about cycles, especially if your window includes a
> > row twice.
> >
> > In addition to the (true+anti) dependences involving I_REG_MOVE with
> > I_MUST_FOLLOW, it has (true+anti) dependences with each use it feeds.
> >
> > We could alternatively set these dependencies of I_REG_MOVE on
> > I_MUST_FOLLOW, and on its uses, and call get_sched_window(). But it's
> > presumably simpler to handle it directly here.
> >
> > Right?
>
> Well, I think get_sched_window would need a fair bit of surgery to
> handle both cases.  And IMO it'd be a bit of a false abstraction.
> In this case, we always want things to be scheduled close to the
> successor (I_MUST_FOLLOW), even if there are several predecessors
> (the uses).  But if there are more predecessors than successors,
> get_sched_window would normally prefer to schedule closer to the
> predecessors.  So I think we'd still be treating them as two separate
> cases to some extent.


well, it's a chain of dependences, so we should Swing through it in
one direction or the other; not sure it matters which.
>
> As you say, doing it here is much simpler :-)



yes, it makes sense to replicate the process here, given that the
moves are not in the ddg and their deps are simple; but the logic
should be the same.

Ayal.

>
> Richard

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [4/4] Make SMS schedule register moves
  2011-09-23  7:49     ` Ayal Zaks
@ 2011-09-28 16:16       ` Richard Sandiford
  2011-10-10  1:10         ` Ayal Zaks
  0 siblings, 1 reply; 13+ messages in thread
From: Richard Sandiford @ 2011-09-28 16:16 UTC (permalink / raw)
  To: Ayal Zaks; +Cc: gcc-patches

Ayal Zaks <ayal.zaks@gmail.com> writes:
>> >> +  /* The cyclic lifetime of move->new_reg starts and ends at move->def
>> >> +     (the instruction that defines move->old_reg).
>> >
>> > So instruction I_REG_MOVE (new_reg=reg) must be scheduled before the
>> > next I_MUST_FOLLOW move/original-def (due to anti-dependence: it
>> > overwrites reg), but after the previous instance of I_MUST_FOLLOW (due
>> > to true dependence; i.e. account for latency also). Why do moves,
>> > except for the one closest to move->def (which is directly dependent
>> > upon it, i.e. for whom move->def == I_MUST_FOLLOW), have to worry
>> > about move->def at all? (Or have their cyclic lifetimes start and end
>> > there?)
>>
>> Because the uses of new_reg belong to the same move->def based cycle.
>
>
> the "cycle" (overloaded term; rather "iteration" in this context) to
> which the uses belong, is inferred from the "cycle" (absolute schedule
> time) in which they are scheduled, regardless of move->def.

Just to prove your point about "cycle" being an overloaded term: I wasn't
actually meaning it in the sense of "(loop) iteration".  I meant a "circular
window based on move->def".

>> So (I think this is the uncontroversial bit): [M1] must be scheduled
>> "cyclically before" [B] and "cyclically after" [C], with the cycle
>> based at [B]:
>>
>>    row 3 after [B]:  empty
>>    row 4:            [C]
>>    row 5:            [D]
>>    row 0:            empty
>>    row 1:            empty
>>    row 2:            [A]
>>    row 3 before [B]: empty
>>
>> [M1] could therefore go in row 1.  This part is OK.
>
>
> Here's how I see it:
> [M1] feeds [C] which is scheduled at cycle 10, so it must be scheduled
> before cycle 10-M_latency and after cycle 10-ii. [M1] uses the result
> of [B] which is scheduled at cycle 3, so must be scheduled after cycle
> 3+B_latency and before cycle 3+ii. Taking all latencies to be 1 and
> ii=6, this yields a scheduling window of cycles [4,9]\cap[4,9]=[4,9];
> if scheduled at cycle 4 it must_follow [C], if scheduled at cycle 9 it
> must_precede [B]. This is identical to the logic behind the
> sched_window of any instruction, based on its dependencies (as you've
> updated not too long ago..), if we do not allow reg_moves (and
> arguably, one should not allow reg_moves when scheduling
> reg_moves...).
>
> To address the potential erroneous scenario of Loop 2, suppose [A] is
> scheduled as in the beginning in cycle 20, and that [M1] is scheduled
> in cycle 7 (\in[4,9]). Then
> [M2] feeds [D] and [A] which are scheduled at cycles 17 and 20, so it
> must be scheduled before cycle 17-1 and after cycle 20-6. [M2] uses
> the result of [M1], so must be scheduled after cycle 7+1 and before
> cycle 7+6. This yields the desired [14,16]\cap[8,13]=\emptyset.

I agree it's natural to schedule moves for intra-iteration dependencies
in the normal get_sched_window way.  But suppose we have a dependency:

   A --(T,N,1)--> B

that requires two moves M1 and M2.  If we think in terms of cycles
(in the SCHED_TIME sense), then this effectively becomes:

   A --(T,N1,1)--> M1 -->(T,N2,0)--> M2 -->(T,N3,0)--> B

because it is now M1 that is fed by both the loop and the incoming edge.
But if there is a second dependency:

   A --(T,M,0)--> C

that also requires two moves, we end up with:

   A --(T,N1,1)--> M1 -->(T,N2,0)--> M2 -->(T,N3,0)--> B
                                        -->(T,M3,-1)--> B

and dependence distances of -1 feel a bit weird. :-)  Of course,
what we really have are two parallel dependencies:

   A --(T,N1,1)--> M1 -->(T,N2,0)--> M2 -->(T,N3,0)--> B

   A --(T,M1,0)--> M1' -->(T,M2,0)--> M2' -->(T,N3,0)--> B

where M1' and M2' occupy the same position as M1 and M2 in the schedule,
but are one stage further along.  But we only schedule them once,
so if we take the cycle/SCHED_TIME route, we have to introduce
dependencies of distance -1.

Another reason why it seemed a little confusing to think in terms of
cycles (in the SCHED_TIME sense) was that, by this stage, we were
already thinking in terms of rows and columns to some extent:

	    /* If dest precedes src in the schedule of the kernel, then dest
	       will read before src writes and we can save one reg_copy.  */
	    if (SCHED_ROW (e->dest->cuid) == SCHED_ROW (e->src->cuid)
		&& SCHED_COLUMN (e->dest->cuid) < SCHED_COLUMN (e->src->cuid))
	      nreg_moves4e--;

However...

> Also note that if moves are assigned absolute cycles, it becomes clear
> to which stage each belongs (just like any other instruction),
> regulating their generation in prolog and epilog.

...I have to concede that it makes the stage calculation easier,
and that that tips the balance.  (Although again, a move can belong
to two stages simultanouesly.)

Anyway, here's an updated patch that uses cycle times.  I've also
dropped the code that tried to allow windows to start and end on
the same row (and therefore schedule "either side" of that row).
I thought it might help with some VLIWy DFAs, but it was always
going to be of minor benefit, and we don't try that elsewhere,
so let's avoid the complication.

Bootstrapped & regression-tested on powerpc64-linux-gnu with
-fmodulo-sched and -fmodulo-sched-allow-regmoves enabled by default.
Also retested on the ARM benchmarks.  OK to install?

Richard


gcc/
	* modulo-sched.c (ps_reg_move_info): Add num_consecutive_stages.
	(SCHED_FIRST_REG_MOVE, SCHED_NREG_MOVES): Delete.
	(node_sched_params): Remove first_reg_move and nreg_moves.
	(ps_num_consecutive_stages, extend_node_sched_params): New functions.
	(update_node_sched_params): Move up file.
	(print_node_sched_params): Print the stage.  Don't dump info related
	to first_reg_move and nreg_moves.
	(set_columns_for_row): New function.
	(set_columns_for_ps): Move up file and use set_columns_for_row.
	(schedule_reg_move): New function.
	(schedule_reg_moves): Call extend_node_sched_params and
	schedule_reg_move.  Extend size of uses bitmap.  Initialize
	num_consecutive_stages.  Return false if a move could not be
	scheduled.
	(apply_reg_moves): Don't emit moves here.
	(permute_partial_schedule): Handle register moves.
	(duplicate_insns_of_cycles): Remove for_prolog.  Emit moves according
	to the same stage-count test as ddg nodes.
	(generate_prolog_epilog): Update calls accordingly.
	(sms_schedule): Allow move-scheduling to add a new first stage.

Index: gcc/modulo-sched.c
===================================================================
--- gcc/modulo-sched.c	2011-09-28 11:24:26.530300781 +0100
+++ gcc/modulo-sched.c	2011-09-28 15:06:29.439173959 +0100
@@ -153,6 +153,9 @@ struct ps_reg_move_info
   rtx old_reg;
   rtx new_reg;
 
+  /* The number of consecutive stages that the move occupies.  */
+  int num_consecutive_stages;
+
   /* An instruction that sets NEW_REG to the correct value.  The first
      move associated with DEF will have an rhs of OLD_REG; later moves
      use the result of the previous move.  */
@@ -218,8 +221,6 @@ static partial_schedule_ptr sms_schedule
 static void permute_partial_schedule (partial_schedule_ptr, rtx);
 static void generate_prolog_epilog (partial_schedule_ptr, struct loop *,
                                     rtx, rtx);
-static void duplicate_insns_of_cycles (partial_schedule_ptr,
-				       int, int, int, rtx);
 static int calculate_stage_count (partial_schedule_ptr, int);
 static void calculate_must_precede_follow (ddg_node_ptr, int, int,
 					   int, int, sbitmap, sbitmap, sbitmap);
@@ -233,8 +234,6 @@ #define NODE_ASAP(node) ((node)->aux.cou
 
 #define SCHED_PARAMS(x) VEC_index (node_sched_params, node_sched_param_vec, x)
 #define SCHED_TIME(x) (SCHED_PARAMS (x)->time)
-#define SCHED_FIRST_REG_MOVE(x) (SCHED_PARAMS (x)->first_reg_move)
-#define SCHED_NREG_MOVES(x) (SCHED_PARAMS (x)->nreg_moves)
 #define SCHED_ROW(x) (SCHED_PARAMS (x)->row)
 #define SCHED_STAGE(x) (SCHED_PARAMS (x)->stage)
 #define SCHED_COLUMN(x) (SCHED_PARAMS (x)->column)
@@ -244,15 +243,6 @@ typedef struct node_sched_params
 {
   int time;	/* The absolute scheduling cycle.  */
 
-  /* The following field (first_reg_move) is the ps_insn id of the first
-     register-move instruction added to handle the modulo-variable-expansion
-     of the register defined by this node.  This register-move copies the
-     original register defined by the node.  */
-  int first_reg_move;
-
-  /* The number of register-move instructions added.  */
-  int nreg_moves;
-
   int row;    /* Holds time % ii.  */
   int stage;  /* Holds time / ii.  */
 
@@ -344,6 +334,17 @@ ps_first_note (partial_schedule_ptr ps, 
   return ps->g->nodes[id].first_note;
 }
 
+/* Return the number of consecutive stages that are occupied by
+   partial schedule instruction ID in PS.  */
+static int
+ps_num_consecutive_stages (partial_schedule_ptr ps, int id)
+{
+  if (id < ps->g->num_nodes)
+    return 1;
+  else
+    return ps_reg_move (ps, id)->num_consecutive_stages;
+}
+
 /* Given HEAD and TAIL which are the first and last insns in a loop;
    return the register which controls the loop.  Return zero if it has
    more than one occurrence in the loop besides the control part or the
@@ -456,6 +457,45 @@ set_node_sched_params (ddg_ptr g)
 			 node_sched_param_vec, g->num_nodes);
 }
 
+/* Make sure that node_sched_param_vec has an entry for every move in PS.  */
+static void
+extend_node_sched_params (partial_schedule_ptr ps)
+{
+  VEC_safe_grow_cleared (node_sched_params, heap, node_sched_param_vec,
+			 ps->g->num_nodes + VEC_length (ps_reg_move_info,
+							ps->reg_moves));
+}
+
+/* Update the sched_params (time, row and stage) for node U using the II,
+   the CYCLE of U and MIN_CYCLE.
+   We're not simply taking the following
+   SCHED_STAGE (u) = CALC_STAGE_COUNT (SCHED_TIME (u), min_cycle, ii);
+   because the stages may not be aligned on cycle 0.  */
+static void
+update_node_sched_params (int u, int ii, int cycle, int min_cycle)
+{
+  int sc_until_cycle_zero;
+  int stage;
+
+  SCHED_TIME (u) = cycle;
+  SCHED_ROW (u) = SMODULO (cycle, ii);
+
+  /* The calculation of stage count is done adding the number
+     of stages before cycle zero and after cycle zero.  */
+  sc_until_cycle_zero = CALC_STAGE_COUNT (-1, min_cycle, ii);
+
+  if (SCHED_TIME (u) < 0)
+    {
+      stage = CALC_STAGE_COUNT (-1, SCHED_TIME (u), ii);
+      SCHED_STAGE (u) = sc_until_cycle_zero - stage;
+    }
+  else
+    {
+      stage = CALC_STAGE_COUNT (SCHED_TIME (u), 0, ii);
+      SCHED_STAGE (u) = sc_until_cycle_zero + stage - 1;
+    }
+}
+
 static void
 print_node_sched_params (FILE *file, int num_nodes, partial_schedule_ptr ps)
 {
@@ -466,21 +506,167 @@ print_node_sched_params (FILE *file, int
   for (i = 0; i < num_nodes; i++)
     {
       node_sched_params_ptr nsp = SCHED_PARAMS (i);
-      int j;
 
       fprintf (file, "Node = %d; INSN = %d\n", i,
 	       INSN_UID (ps_rtl_insn (ps, i)));
       fprintf (file, " asap = %d:\n", NODE_ASAP (&ps->g->nodes[i]));
       fprintf (file, " time = %d:\n", nsp->time);
-      fprintf (file, " nreg_moves = %d:\n", nsp->nreg_moves);
-      for (j = 0; j < nsp->nreg_moves; j++)
-	{
-	  ps_reg_move_info *move = ps_reg_move (ps, nsp->first_reg_move + j);
+      fprintf (file, " stage = %d:\n", nsp->stage);
+    }
+}
+
+/* Set SCHED_COLUMN for each instruction in row ROW of PS.  */
+static void
+set_columns_for_row (partial_schedule_ptr ps, int row)
+{
+  ps_insn_ptr cur_insn;
+  int column;
+
+  column = 0;
+  for (cur_insn = ps->rows[row]; cur_insn; cur_insn = cur_insn->next_in_row)
+    SCHED_COLUMN (cur_insn->id) = column++;
+}
+
+/* Set SCHED_COLUMN for each instruction in PS.  */
+static void
+set_columns_for_ps (partial_schedule_ptr ps)
+{
+  int row;
+
+  for (row = 0; row < ps->ii; row++)
+    set_columns_for_row (ps, row);
+}
+
+/* Try to schedule the move with ps_insn identifier I_REG_MOVE in PS.
+   Its predecessors and successors have already been scheduled.
+
+   The move is part of a chain that satisfies register dependencies
+   between a producing ddg node and various consuming ddg nodes.
+   If some of these dependencies cross a loop iteration (that is,
+   have a distance of 1) then DISTANCE1_USES is nonnull and contains
+   the set of uses with distance-1 dependencies.  DISTANCE1_USES
+   is null otherwise.
+
+   MUST_FOLLOW is a scratch bitmap that is big enough to hold
+   all current ps_insn ids.
+
+   Return true on success.  */
+static bool
+schedule_reg_move (partial_schedule_ptr ps, int i_reg_move,
+		   sbitmap distance1_uses, sbitmap must_follow)
+{
+  unsigned int u;
+  int this_time, this_distance, this_start, this_end, this_latency;
+  int start, end, c, ii;
+  sbitmap_iterator sbi;
+  ps_reg_move_info *move;
+  rtx this_insn;
+  ps_insn_ptr psi;
+
+  move = ps_reg_move (ps, i_reg_move);
+  ii = ps->ii;
+  if (dump_file)
+    {
+      fprintf (dump_file, "Scheduling register move INSN %d; ii = %d"
+	       ", min cycle = %d\n\n", INSN_UID (move->insn), ii,
+	       PS_MIN_CYCLE (ps));
+      print_rtl_single (dump_file, move->insn);
+      fprintf (dump_file, "\n%11s %11s %5s\n", "start", "end", "time");
+      fprintf (dump_file, "=========== =========== =====\n");
+    }
+
+  start = INT_MIN;
+  end = INT_MAX;
+
+  /* For dependencies of distance 1 between a producer ddg node A
+     and consumer ddg node B, we have a chain of dependencies:
+
+        A --(T,L1,1)--> M1 --(T,L2,0)--> M2 ... --(T,Ln,0)--> B
+
+     where Mi is the ith move.  For dependencies of distance 0 between
+     a producer ddg node A and consumer ddg node C, we have a chain of
+     dependencies:
+
+        A --(T,L1',0)--> M1' --(T,L2',0)--> M2' ... --(T,Ln',0)--> C
+
+     where Mi' occupies the same position as Mi but occurs a stage later.
+     We can only schedule each move once, so if we have both types of
+     chain, we model the second as:
+
+        A --(T,L1',1)--> M1 --(T,L2',0)--> M2 ... --(T,Ln',-1)--> C
+
+     First handle the dependencies between the previously-scheduled
+     predecessor and the move.  */
+  this_insn = ps_rtl_insn (ps, move->def);
+  this_latency = insn_latency (this_insn, move->insn);
+  this_distance = distance1_uses && move->def < ps->g->num_nodes ? 1 : 0;
+  this_time = SCHED_TIME (move->def) - this_distance * ii;
+  this_start = this_time + this_latency;
+  this_end = this_time + ii;
+  if (dump_file)
+    fprintf (dump_file, "%11d %11d %5d %d --(T,%d,%d)--> %d\n",
+	     this_start, this_end, SCHED_TIME (move->def),
+	     INSN_UID (this_insn), this_latency, this_distance,
+	     INSN_UID (move->insn));
+
+  if (start < this_start)
+    start = this_start;
+  if (end > this_end)
+    end = this_end;
+
+  /* Handle the dependencies between the move and previously-scheduled
+     successors.  */
+  EXECUTE_IF_SET_IN_SBITMAP (move->uses, 0, u, sbi)
+    {
+      this_insn = ps_rtl_insn (ps, u);
+      this_latency = insn_latency (move->insn, this_insn);
+      if (distance1_uses && !TEST_BIT (distance1_uses, u))
+	this_distance = -1;
+      else
+	this_distance = 0;
+      this_time = SCHED_TIME (u) + this_distance * ii;
+      this_start = this_time - ps->ii;
+      this_end = this_time - this_latency;
+      if (dump_file)
+	fprintf (dump_file, "%11d %11d %5d %d --(T,%d,%d)--> %d\n",
+		 this_start, this_end, SCHED_TIME (u), INSN_UID (move->insn),
+		 this_latency, this_distance, INSN_UID (this_insn));
+
+      if (start < this_start)
+	start = this_start;
+      if (end > this_end)
+	end = this_end;
+    }
+
+  if (dump_file)
+    {
+      fprintf (dump_file, "----------- ----------- -----\n");
+      fprintf (dump_file, "%11d %11d %5s %s\n", start, end, "", "(max, min)");
+    }
 
-	  fprintf (file, " reg_move = ");
-	  print_rtl_single (file, move->insn);
+  sbitmap_zero (must_follow);
+  SET_BIT (must_follow, move->def);
+
+  start = MAX (start, end - (ii - 1));
+  for (c = end; c >= start; c--)
+    {
+      psi = ps_add_node_check_conflicts (ps, i_reg_move, c,
+					 move->uses, must_follow);
+      if (psi)
+	{
+	  update_node_sched_params (i_reg_move, ii, c, PS_MIN_CYCLE (ps));
+	  if (dump_file)
+	    fprintf (dump_file, "\nScheduled register move INSN %d at"
+		     " time %d, row %d\n\n", INSN_UID (move->insn), c,
+		     SCHED_ROW (i_reg_move));
+	  return true;
 	}
     }
+
+  if (dump_file)
+    fprintf (dump_file, "\nNo available slot\n\n");
+
+  return false;
 }
 
 /*
@@ -508,9 +694,13 @@ schedule_reg_moves (partial_schedule_ptr
       int nreg_moves = 0, i_reg_move;
       rtx prev_reg, old_reg;
       int first_move;
+      int distances[2];
+      sbitmap must_follow;
+      sbitmap distance1_uses;
 
       /* Compute the number of reg_moves needed for u, by looking at life
 	 ranges started at u (excluding self-loops).  */
+      distances[0] = distances[1] = false;
       for (e = u->out; e; e = e->next_out)
 	if (e->type == TRUE_DEP && e->dest != e->src)
 	  {
@@ -527,6 +717,11 @@ schedule_reg_moves (partial_schedule_ptr
 		&& SCHED_COLUMN (e->dest->cuid) < SCHED_COLUMN (e->src->cuid))
 	      nreg_moves4e--;
 
+	    if (nreg_moves4e)
+	      {
+		gcc_assert (e->distance < 2);
+		distances[e->distance] = true;
+	      }
 	    nreg_moves = MAX (nreg_moves, nreg_moves4e);
 	  }
 
@@ -537,11 +732,10 @@ schedule_reg_moves (partial_schedule_ptr
       first_move = VEC_length (ps_reg_move_info, ps->reg_moves);
       VEC_safe_grow_cleared (ps_reg_move_info, heap, ps->reg_moves,
 			     first_move + nreg_moves);
+      extend_node_sched_params (ps);
 
       /* Record the moves associated with this node.  */
       first_move += ps->g->num_nodes;
-      SCHED_FIRST_REG_MOVE (i) = first_move;
-      SCHED_NREG_MOVES (i) = nreg_moves;
 
       /* Generate each move.  */
       old_reg = prev_reg = SET_DEST (single_set (u->insn));
@@ -550,15 +744,18 @@ schedule_reg_moves (partial_schedule_ptr
 	  ps_reg_move_info *move = ps_reg_move (ps, first_move + i_reg_move);
 
 	  move->def = i_reg_move > 0 ? first_move + i_reg_move - 1 : i;
-	  move->uses = sbitmap_alloc (g->num_nodes);
+	  move->uses = sbitmap_alloc (first_move + nreg_moves);
 	  move->old_reg = old_reg;
 	  move->new_reg = gen_reg_rtx (GET_MODE (prev_reg));
+	  move->num_consecutive_stages = distances[0] && distances[1] ? 2 : 1;
 	  move->insn = gen_move_insn (move->new_reg, copy_rtx (prev_reg));
 	  sbitmap_zero (move->uses);
 
 	  prev_reg = move->new_reg;
 	}
 
+      distance1_uses = distances[1] ? sbitmap_alloc (g->num_nodes) : NULL;
+
       /* Every use of the register defined by node may require a different
 	 copy of this register, depending on the time the use is scheduled.
 	 Record which uses require which move results.  */
@@ -582,8 +779,21 @@ schedule_reg_moves (partial_schedule_ptr
 
 		move = ps_reg_move (ps, first_move + dest_copy - 1);
 		SET_BIT (move->uses, e->dest->cuid);
+		if (e->distance == 1)
+		  SET_BIT (distance1_uses, e->dest->cuid);
 	      }
 	  }
+
+      must_follow = sbitmap_alloc (first_move + nreg_moves);
+      for (i_reg_move = 0; i_reg_move < nreg_moves; i_reg_move++)
+	if (!schedule_reg_move (ps, first_move + i_reg_move,
+				distance1_uses, must_follow))
+	  break;
+      sbitmap_free (must_follow);
+      if (distance1_uses)
+	sbitmap_free (distance1_uses);
+      if (i_reg_move < nreg_moves)
+	return false;
     }
   return true;
 }
@@ -607,39 +817,6 @@ apply_reg_moves (partial_schedule_ptr ps
 	  df_insn_rescan (ps->g->nodes[i_use].insn);
 	}
     }
-
-  FOR_EACH_VEC_ELT (ps_reg_move_info, ps->reg_moves, i, move)
-    add_insn_before (move->insn, ps_first_note (ps, move->def), NULL);
-}
-
-/* Update the sched_params (time, row and stage) for node U using the II,
-   the CYCLE of U and MIN_CYCLE.  
-   We're not simply taking the following
-   SCHED_STAGE (u) = CALC_STAGE_COUNT (SCHED_TIME (u), min_cycle, ii);
-   because the stages may not be aligned on cycle 0.  */
-static void
-update_node_sched_params (int u, int ii, int cycle, int min_cycle)
-{
-  int sc_until_cycle_zero;
-  int stage;
-
-  SCHED_TIME (u) = cycle;
-  SCHED_ROW (u) = SMODULO (cycle, ii);
-
-  /* The calculation of stage count is done adding the number
-     of stages before cycle zero and after cycle zero.  */
-  sc_until_cycle_zero = CALC_STAGE_COUNT (-1, min_cycle, ii);
-
-  if (SCHED_TIME (u) < 0)
-    {
-      stage = CALC_STAGE_COUNT (-1, SCHED_TIME (u), ii);
-      SCHED_STAGE (u) = sc_until_cycle_zero - stage;
-    }
-  else
-    {
-      stage = CALC_STAGE_COUNT (SCHED_TIME (u), 0, ii);
-      SCHED_STAGE (u) = sc_until_cycle_zero + stage - 1;
-    }
 }
 
 /* Bump the SCHED_TIMEs of all nodes by AMOUNT.  Set the values of
@@ -680,22 +857,6 @@ reset_sched_times (partial_schedule_ptr 
       }
 }
  
-/* Set SCHED_COLUMN of each node according to its position in PS.  */
-static void
-set_columns_for_ps (partial_schedule_ptr ps)
-{
-  int row;
-
-  for (row = 0; row < ps->ii; row++)
-    {
-      ps_insn_ptr cur_insn = ps->rows[row];
-      int column = 0;
-
-      for (; cur_insn; cur_insn = cur_insn->next_in_row)
-	SCHED_COLUMN (cur_insn->id) = column++;
-    }
-}
-
 /* Permute the insns according to their order in PS, from row 0 to
    row ii-1, and position them right before LAST.  This schedules
    the insns of the loop kernel.  */
@@ -712,8 +873,13 @@ permute_partial_schedule (partial_schedu
 	rtx insn = ps_rtl_insn (ps, ps_ij->id);
 
 	if (PREV_INSN (last) != insn)
-	  reorder_insns_nobb (ps_first_note (ps, ps_ij->id), insn,
-			      PREV_INSN (last));
+	  {
+	    if (ps_ij->id < ps->g->num_nodes)
+	      reorder_insns_nobb (ps_first_note (ps, ps_ij->id), insn,
+				  PREV_INSN (last));
+	    else
+	      add_insn_before (insn, last, NULL);
+	  }
       }
 }
 
@@ -916,7 +1082,7 @@ optimize_sc (partial_schedule_ptr ps, dd
 
 static void
 duplicate_insns_of_cycles (partial_schedule_ptr ps, int from_stage,
-			   int to_stage, int for_prolog, rtx count_reg)
+			   int to_stage, rtx count_reg)
 {
   int row;
   ps_insn_ptr ps_ij;
@@ -925,7 +1091,7 @@ duplicate_insns_of_cycles (partial_sched
     for (ps_ij = ps->rows[row]; ps_ij; ps_ij = ps_ij->next_in_row)
       {
 	int u = ps_ij->id;
-	int j, i_reg_moves, i_reg_move;
+	int first_u, last_u;
 	rtx u_insn;
 
         /* Do not duplicate any insn which refers to count_reg as it
@@ -939,42 +1105,15 @@ duplicate_insns_of_cycles (partial_sched
             || JUMP_P (u_insn))
           continue;
 
-	if (for_prolog)
+	first_u = SCHED_STAGE (u);
+	last_u = first_u + ps_num_consecutive_stages (ps, u) - 1;
+	if (from_stage <= last_u && to_stage >= first_u)
 	  {
-	    /* SCHED_STAGE (u) >= from_stage == 0.  Generate increasing
-	       number of reg_moves starting with the second occurrence of
-	       u, which is generated if its SCHED_STAGE <= to_stage.  */
-	    i_reg_moves = to_stage - SCHED_STAGE (u) + 1;
-	    i_reg_moves = MAX (i_reg_moves, 0);
-	    i_reg_moves = MIN (i_reg_moves, SCHED_NREG_MOVES (u));
-
-	    /* The reg_moves start from the *first* reg_move backwards.  */
-	    i_reg_move = SCHED_FIRST_REG_MOVE (u) + (i_reg_moves - 1);
+	    if (u < ps->g->num_nodes)
+	      duplicate_insn_chain (ps_first_note (ps, u), u_insn);
+	    else
+	      emit_insn (copy_rtx (PATTERN (u_insn)));
 	  }
-	else /* It's for the epilog.  */
-	  {
-	    /* SCHED_STAGE (u) <= to_stage.  Generate all reg_moves,
-	       starting to decrease one stage after u no longer occurs;
-	       that is, generate all reg_moves until
-	       SCHED_STAGE (u) == from_stage - 1.  */
-	    i_reg_moves = (SCHED_NREG_MOVES (u)
-			   - (from_stage - SCHED_STAGE (u) - 1));
-	    i_reg_moves = MAX (i_reg_moves, 0);
-	    i_reg_moves = MIN (i_reg_moves, SCHED_NREG_MOVES (u));
-
-	    /* The reg_moves start from the *last* reg_move forwards.  */
-	    i_reg_move = SCHED_FIRST_REG_MOVE (u) + (SCHED_NREG_MOVES (u) - 1);
-	  }
-
-	for (j = 0; j < i_reg_moves; j++)
-	  {
-	    ps_reg_move_info *move = ps_reg_move (ps, i_reg_move - j);
-
-	    emit_insn (copy_rtx (PATTERN (move->insn)));
-	  }
-	if (SCHED_STAGE (u) >= from_stage
-	    && SCHED_STAGE (u) <= to_stage)
-	  duplicate_insn_chain (ps_first_note (ps, u), u_insn);
       }
 }
 
@@ -1008,7 +1147,7 @@ generate_prolog_epilog (partial_schedule
     }
 
   for (i = 0; i < last_stage; i++)
-    duplicate_insns_of_cycles (ps, 0, i, 1, count_reg);
+    duplicate_insns_of_cycles (ps, 0, i, count_reg);
 
   /* Put the prolog on the entry edge.  */
   e = loop_preheader_edge (loop);
@@ -1020,7 +1159,7 @@ generate_prolog_epilog (partial_schedule
   start_sequence ();
 
   for (i = 0; i < last_stage; i++)
-    duplicate_insns_of_cycles (ps, i + 1, last_stage, 0, count_reg);
+    duplicate_insns_of_cycles (ps, i + 1, last_stage, count_reg);
 
   /* Put the epilogue on the exit edge.  */
   gcc_assert (single_exit (loop));
@@ -1361,8 +1500,7 @@ sms_schedule (void)
     {
       rtx head, tail;
       rtx count_reg, count_init;
-      int mii, rec_mii;
-      unsigned stage_count;
+      int mii, rec_mii, stage_count, min_cycle;
       HOST_WIDEST_INT loop_count = 0;
       bool opt_sc_p;
 
@@ -1472,13 +1610,12 @@ sms_schedule (void)
 		}
 
 	      gcc_assert (stage_count >= 1);
-	      PS_STAGE_COUNT (ps) = stage_count;
 	    }
 
 	  /* The default value of PARAM_SMS_MIN_SC is 2 as stage count of
 	     1 means that there is no interleaving between iterations thus
 	     we let the scheduling passes do the job in this case.  */
-	  if (stage_count < (unsigned) PARAM_VALUE (PARAM_SMS_MIN_SC)
+	  if (stage_count < PARAM_VALUE (PARAM_SMS_MIN_SC)
 	      || (count_init && (loop_count <= stage_count))
 	      || (flag_branch_probabilities && (trip_count <= stage_count)))
 	    {
@@ -1506,6 +1643,7 @@ sms_schedule (void)
 	  
 	  set_columns_for_ps (ps);
 
+	  min_cycle = PS_MIN_CYCLE (ps) - SMODULO (PS_MIN_CYCLE (ps), ps->ii);
 	  if (!schedule_reg_moves (ps))
 	    {
 	      mii = ps->ii + 1;
@@ -1513,6 +1651,19 @@ sms_schedule (void)
 	      continue;
 	    }
 
+	  /* Moves that handle incoming values might have been added
+	     to a new first stage.  It's too late to try any kind of
+	     rotation, so just bump the stage count.  */
+	  if (PS_MIN_CYCLE (ps) < min_cycle)
+	    {
+	      reset_sched_times (ps, 0);
+	      stage_count++;
+	    }
+
+	  /* The stage count should now be correct without rotation.  */
+	  gcc_checking_assert (stage_count == calculate_stage_count (ps, 0));
+	  PS_STAGE_COUNT (ps) = stage_count;
+
 	  canon_loop (loop);
 
           if (dump_file)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [4/4] Make SMS schedule register moves
  2011-09-28 16:16       ` Richard Sandiford
@ 2011-10-10  1:10         ` Ayal Zaks
  2011-10-10 12:17           ` Richard Sandiford
  0 siblings, 1 reply; 13+ messages in thread
From: Ayal Zaks @ 2011-10-10  1:10 UTC (permalink / raw)
  To: Ayal Zaks, gcc-patches, richard.sandiford

On Wed, Sep 28, 2011 at 4:49 PM, Richard Sandiford
<richard.sandiford@linaro.org> wrote:
> Ayal Zaks <ayal.zaks@gmail.com> writes:
>>> >> +  /* The cyclic lifetime of move->new_reg starts and ends at move->def
>>> >> +     (the instruction that defines move->old_reg).
>>> >
>>> > So instruction I_REG_MOVE (new_reg=reg) must be scheduled before the
>>> > next I_MUST_FOLLOW move/original-def (due to anti-dependence: it
>>> > overwrites reg), but after the previous instance of I_MUST_FOLLOW (due
>>> > to true dependence; i.e. account for latency also). Why do moves,
>>> > except for the one closest to move->def (which is directly dependent
>>> > upon it, i.e. for whom move->def == I_MUST_FOLLOW), have to worry
>>> > about move->def at all? (Or have their cyclic lifetimes start and end
>>> > there?)
>>>
>>> Because the uses of new_reg belong to the same move->def based cycle.
>>
>>
>> the "cycle" (overloaded term; rather "iteration" in this context) to
>> which the uses belong, is inferred from the "cycle" (absolute schedule
>> time) in which they are scheduled, regardless of move->def.
>
> Just to prove your point about "cycle" being an overloaded term: I wasn't
> actually meaning it in the sense of "(loop) iteration".  I meant a "circular
> window based on move->def".
>

Point proven ;-)

>>> So (I think this is the uncontroversial bit): [M1] must be scheduled
>>> "cyclically before" [B] and "cyclically after" [C], with the cycle
>>> based at [B]:
>>>
>>>    row 3 after [B]:  empty
>>>    row 4:            [C]
>>>    row 5:            [D]
>>>    row 0:            empty
>>>    row 1:            empty
>>>    row 2:            [A]
>>>    row 3 before [B]: empty
>>>
>>> [M1] could therefore go in row 1.  This part is OK.
>>
>>
>> Here's how I see it:
>> [M1] feeds [C] which is scheduled at cycle 10, so it must be scheduled
>> before cycle 10-M_latency and after cycle 10-ii. [M1] uses the result
>> of [B] which is scheduled at cycle 3, so must be scheduled after cycle
>> 3+B_latency and before cycle 3+ii. Taking all latencies to be 1 and
>> ii=6, this yields a scheduling window of cycles [4,9]\cap[4,9]=[4,9];
>> if scheduled at cycle 4 it must_follow [C], if scheduled at cycle 9 it
>> must_precede [B]. This is identical to the logic behind the
>> sched_window of any instruction, based on its dependencies (as you've
>> updated not too long ago..), if we do not allow reg_moves (and
>> arguably, one should not allow reg_moves when scheduling
>> reg_moves...).
>>
>> To address the potential erroneous scenario of Loop 2, suppose [A] is
>> scheduled as in the beginning in cycle 20, and that [M1] is scheduled
>> in cycle 7 (\in[4,9]). Then
>> [M2] feeds [D] and [A] which are scheduled at cycles 17 and 20, so it
>> must be scheduled before cycle 17-1 and after cycle 20-6. [M2] uses
>> the result of [M1], so must be scheduled after cycle 7+1 and before
>> cycle 7+6. This yields the desired [14,16]\cap[8,13]=\emptyset.
>
> I agree it's natural to schedule moves for intra-iteration dependencies
> in the normal get_sched_window way.  But suppose we have a dependency:
>
>   A --(T,N,1)--> B
>
> that requires two moves M1 and M2.  If we think in terms of cycles
> (in the SCHED_TIME sense), then this effectively becomes:
>
>   A --(T,N1,1)--> M1 -->(T,N2,0)--> M2 -->(T,N3,0)--> B
>
> because it is now M1 that is fed by both the loop and the incoming edge.
> But if there is a second dependency:
>
>   A --(T,M,0)--> C
>
> that also requires two moves, we end up with:
>
>   A --(T,N1,1)--> M1 -->(T,N2,0)--> M2 -->(T,N3,0)--> B
>                                        -->(T,M3,-1)--> B
>
> and dependence distances of -1 feel a bit weird. :-)  Of course,
> what we really have are two parallel dependencies:
>
>   A --(T,N1,1)--> M1 -->(T,N2,0)--> M2 -->(T,N3,0)--> B
>
>   A --(T,M1,0)--> M1' -->(T,M2,0)--> M2' -->(T,N3,0)--> B
>
> where M1' and M2' occupy the same position as M1 and M2 in the schedule,
> but are one stage further along.  But we only schedule them once,
> so if we take the cycle/SCHED_TIME route, we have to introduce
> dependencies of distance -1.
>

Interesting; had to digest this distance 1 business, a result of
thinking in cycles instead of rows (or conversely), and mixing
dependences with scheduling; here's my understanding, based on your
explanations:

Suppose a Use is truely dependent on a Def, where both have been
scheduled at some absolute cycles; think of them as timing the first
iteration of the loop.
Assume first that Use appears originally after Def in the original
instruction sequence of the loop (dependence distance 0). In this
case, Use requires register moves if its distance D from Def according
to the schedule is more than ii cycles long -- by the time Use is
executed, the value it needs is no longer available in the def'd
register due to intervening occurrences of Def. So in this case, the
first reg-move (among D/ii) should appear after Def, recording its
value before the next occurrence of Def overwrites it, and feeding
subsequent moves as needed before each is overwritten. Thus the
scheduling window of this first reg-move is within (Def, Def+ii).

Now, suppose Use appears before Def, i.e., Use is upwards-exposed; if
it remains scheduled before the Def, no reg-move is needed. If, otoh,
Def is scheduled (D>0 cycles) before Use, breaking the anti-dependence
between them, (D/ii + 1) reg-moves are needed in order to feed Use
with the live-in value before Def. So in this case, the first reg-move
should appear before Def (again feeding subsequent moves as needed
before each is overwritten). Thus the scheduling window of this first
reg-move is within (Def-ii, Def).

In any case, each use requires a specific number of reg-moves, which
begin either before or after the def; it is always safe (though
potentially redundant) to start reg-moving before the def -- uses that
do not need the live-in value will ignore it by feeding from a later
reg-move.

Q: if we generate reg-moves starting from before the def, would
redundant reg-moves be eliminated if no use needs access to live-in
value? If so, would that simplify their generation? (If not, it may be
interesting to understand how to improve DCE to catch it.)

The issue of assigning stages to reg-moves is mostly relevant for
prolog and epilog generation, which requires and receives special
attention -- handled very nicely by ps_num_consecutive_stages! Note
that currently a simple boolean indicator for (the exceptional case
of) double stages would suffice, instead of generalizing to arbitrary
nums of consecutive stages (see any potential use for them?).

> Another reason why it seemed a little confusing to think in terms of
> cycles (in the SCHED_TIME sense) was that, by this stage, we were
> already thinking in terms of rows and columns to some extent:
>
>            /* If dest precedes src in the schedule of the kernel, then dest
>               will read before src writes and we can save one reg_copy.  */
>            if (SCHED_ROW (e->dest->cuid) == SCHED_ROW (e->src->cuid)
>                && SCHED_COLUMN (e->dest->cuid) < SCHED_COLUMN (e->src->cuid))
>              nreg_moves4e--;
>

right, we have do deal with cycles that end up projected to same row;
not sure that forgetting about cycles would reduce the confusion
though...

> However...
>
>> Also note that if moves are assigned absolute cycles, it becomes clear
>> to which stage each belongs (just like any other instruction),
>> regulating their generation in prolog and epilog.
>
> ...I have to concede that it makes the stage calculation easier,
> and that that tips the balance.  (Although again, a move can belong
> to two stages simultanouesly.)
>
> Anyway, here's an updated patch that uses cycle times.  I've also
> dropped the code that tried to allow windows to start and end on
> the same row (and therefore schedule "either side" of that row).
> I thought it might help with some VLIWy DFAs, but it was always
> going to be of minor benefit, and we don't try that elsewhere,
> so let's avoid the complication.
>

ok. Such windows seem rare, as they imply zero move (or def->move)
latencies, iinm. Leaving a note behind would be good.

> Bootstrapped & regression-tested on powerpc64-linux-gnu with
> -fmodulo-sched and -fmodulo-sched-allow-regmoves enabled by default.
> Also retested on the ARM benchmarks.  OK to install?
>

Yes, OK.
Some points to consider raised above, and some comments added below.
Thanks,
Ayal.


> Richard
>
>
> gcc/
>        * modulo-sched.c (ps_reg_move_info): Add num_consecutive_stages.
>        (SCHED_FIRST_REG_MOVE, SCHED_NREG_MOVES): Delete.
>        (node_sched_params): Remove first_reg_move and nreg_moves.
>        (ps_num_consecutive_stages, extend_node_sched_params): New functions.
>        (update_node_sched_params): Move up file.
>        (print_node_sched_params): Print the stage.  Don't dump info related
>        to first_reg_move and nreg_moves.
>        (set_columns_for_row): New function.
>        (set_columns_for_ps): Move up file and use set_columns_for_row.
>        (schedule_reg_move): New function.
>        (schedule_reg_moves): Call extend_node_sched_params and
>        schedule_reg_move.  Extend size of uses bitmap.  Initialize
>        num_consecutive_stages.  Return false if a move could not be
>        scheduled.
>        (apply_reg_moves): Don't emit moves here.
>        (permute_partial_schedule): Handle register moves.
>        (duplicate_insns_of_cycles): Remove for_prolog.  Emit moves according
>        to the same stage-count test as ddg nodes.
>        (generate_prolog_epilog): Update calls accordingly.
>        (sms_schedule): Allow move-scheduling to add a new first stage.
>
> Index: gcc/modulo-sched.c
> ===================================================================
> --- gcc/modulo-sched.c  2011-09-28 11:24:26.530300781 +0100
> +++ gcc/modulo-sched.c  2011-09-28 15:06:29.439173959 +0100
> @@ -153,6 +153,9 @@ struct ps_reg_move_info
>   rtx old_reg;
>   rtx new_reg;
>
> +  /* The number of consecutive stages that the move occupies.  */
> +  int num_consecutive_stages;
> +
>   /* An instruction that sets NEW_REG to the correct value.  The first
>      move associated with DEF will have an rhs of OLD_REG; later moves
>      use the result of the previous move.  */
> @@ -218,8 +221,6 @@ static partial_schedule_ptr sms_schedule
>  static void permute_partial_schedule (partial_schedule_ptr, rtx);
>  static void generate_prolog_epilog (partial_schedule_ptr, struct loop *,
>                                     rtx, rtx);
> -static void duplicate_insns_of_cycles (partial_schedule_ptr,
> -                                      int, int, int, rtx);
>  static int calculate_stage_count (partial_schedule_ptr, int);
>  static void calculate_must_precede_follow (ddg_node_ptr, int, int,
>                                           int, int, sbitmap, sbitmap, sbitmap);
> @@ -233,8 +234,6 @@ #define NODE_ASAP(node) ((node)->aux.cou
>
>  #define SCHED_PARAMS(x) VEC_index (node_sched_params, node_sched_param_vec, x)
>  #define SCHED_TIME(x) (SCHED_PARAMS (x)->time)
> -#define SCHED_FIRST_REG_MOVE(x) (SCHED_PARAMS (x)->first_reg_move)
> -#define SCHED_NREG_MOVES(x) (SCHED_PARAMS (x)->nreg_moves)
>  #define SCHED_ROW(x) (SCHED_PARAMS (x)->row)
>  #define SCHED_STAGE(x) (SCHED_PARAMS (x)->stage)
>  #define SCHED_COLUMN(x) (SCHED_PARAMS (x)->column)
> @@ -244,15 +243,6 @@ typedef struct node_sched_params
>  {
>   int time;    /* The absolute scheduling cycle.  */
>
> -  /* The following field (first_reg_move) is the ps_insn id of the first
> -     register-move instruction added to handle the modulo-variable-expansion
> -     of the register defined by this node.  This register-move copies the
> -     original register defined by the node.  */
> -  int first_reg_move;
> -
> -  /* The number of register-move instructions added.  */
> -  int nreg_moves;
> -
>   int row;    /* Holds time % ii.  */
>   int stage;  /* Holds time / ii.  */
>
> @@ -344,6 +334,17 @@ ps_first_note (partial_schedule_ptr ps,
>   return ps->g->nodes[id].first_note;
>  }
>
> +/* Return the number of consecutive stages that are occupied by
> +   partial schedule instruction ID in PS.  */
> +static int
> +ps_num_consecutive_stages (partial_schedule_ptr ps, int id)
> +{
> +  if (id < ps->g->num_nodes)
> +    return 1;
> +  else
> +    return ps_reg_move (ps, id)->num_consecutive_stages;
> +}
> +
>  /* Given HEAD and TAIL which are the first and last insns in a loop;
>    return the register which controls the loop.  Return zero if it has
>    more than one occurrence in the loop besides the control part or the
> @@ -456,6 +457,45 @@ set_node_sched_params (ddg_ptr g)
>                         node_sched_param_vec, g->num_nodes);
>  }
>
> +/* Make sure that node_sched_param_vec has an entry for every move in PS.  */
> +static void
> +extend_node_sched_params (partial_schedule_ptr ps)
> +{
> +  VEC_safe_grow_cleared (node_sched_params, heap, node_sched_param_vec,
> +                        ps->g->num_nodes + VEC_length (ps_reg_move_info,
> +                                                       ps->reg_moves));
> +}
> +
> +/* Update the sched_params (time, row and stage) for node U using the II,
> +   the CYCLE of U and MIN_CYCLE.
> +   We're not simply taking the following
> +   SCHED_STAGE (u) = CALC_STAGE_COUNT (SCHED_TIME (u), min_cycle, ii);
> +   because the stages may not be aligned on cycle 0.  */
> +static void
> +update_node_sched_params (int u, int ii, int cycle, int min_cycle)
> +{
> +  int sc_until_cycle_zero;
> +  int stage;
> +
> +  SCHED_TIME (u) = cycle;
> +  SCHED_ROW (u) = SMODULO (cycle, ii);
> +
> +  /* The calculation of stage count is done adding the number
> +     of stages before cycle zero and after cycle zero.  */
> +  sc_until_cycle_zero = CALC_STAGE_COUNT (-1, min_cycle, ii);
> +
> +  if (SCHED_TIME (u) < 0)
> +    {
> +      stage = CALC_STAGE_COUNT (-1, SCHED_TIME (u), ii);
> +      SCHED_STAGE (u) = sc_until_cycle_zero - stage;
> +    }
> +  else
> +    {
> +      stage = CALC_STAGE_COUNT (SCHED_TIME (u), 0, ii);
> +      SCHED_STAGE (u) = sc_until_cycle_zero + stage - 1;
> +    }
> +}
> +
>  static void
>  print_node_sched_params (FILE *file, int num_nodes, partial_schedule_ptr ps)
>  {
> @@ -466,21 +506,167 @@ print_node_sched_params (FILE *file, int
>   for (i = 0; i < num_nodes; i++)
>     {
>       node_sched_params_ptr nsp = SCHED_PARAMS (i);
> -      int j;
>
>       fprintf (file, "Node = %d; INSN = %d\n", i,
>               INSN_UID (ps_rtl_insn (ps, i)));
>       fprintf (file, " asap = %d:\n", NODE_ASAP (&ps->g->nodes[i]));
>       fprintf (file, " time = %d:\n", nsp->time);
> -      fprintf (file, " nreg_moves = %d:\n", nsp->nreg_moves);
> -      for (j = 0; j < nsp->nreg_moves; j++)
> -       {
> -         ps_reg_move_info *move = ps_reg_move (ps, nsp->first_reg_move + j);

Is another way provided to print a summary of the regmoves? Or does
the info dumped while scheduling each one suffice for practical
tracing and debugging.

> +      fprintf (file, " stage = %d:\n", nsp->stage);
> +    }
> +}
> +
> +/* Set SCHED_COLUMN for each instruction in row ROW of PS.  */
> +static void
> +set_columns_for_row (partial_schedule_ptr ps, int row)
> +{
> +  ps_insn_ptr cur_insn;
> +  int column;
> +
> +  column = 0;
> +  for (cur_insn = ps->rows[row]; cur_insn; cur_insn = cur_insn->next_in_row)
> +    SCHED_COLUMN (cur_insn->id) = column++;
> +}
> +
> +/* Set SCHED_COLUMN for each instruction in PS.  */
> +static void
> +set_columns_for_ps (partial_schedule_ptr ps)
> +{
> +  int row;
> +
> +  for (row = 0; row < ps->ii; row++)
> +    set_columns_for_row (ps, row);
> +}
> +
> +/* Try to schedule the move with ps_insn identifier I_REG_MOVE in PS.
> +   Its predecessors and successors have already been scheduled.

well, except for some (preceeding or succeeding?) moves that have not
yet been scheduled, right?

> +
> +   The move is part of a chain that satisfies register dependencies
> +   between a producing ddg node and various consuming ddg nodes.
> +   If some of these dependencies cross a loop iteration (that is,
> +   have a distance of 1) then DISTANCE1_USES is nonnull and contains
> +   the set of uses with distance-1 dependencies.  DISTANCE1_USES
> +   is null otherwise.
> +

Maybe clarify that they are upwards-exposed or live-in uses.

> +   MUST_FOLLOW is a scratch bitmap that is big enough to hold
> +   all current ps_insn ids.
> +
> +   Return true on success.  */
> +static bool
> +schedule_reg_move (partial_schedule_ptr ps, int i_reg_move,
> +                  sbitmap distance1_uses, sbitmap must_follow)
> +{
> +  unsigned int u;
> +  int this_time, this_distance, this_start, this_end, this_latency;
> +  int start, end, c, ii;
> +  sbitmap_iterator sbi;
> +  ps_reg_move_info *move;
> +  rtx this_insn;
> +  ps_insn_ptr psi;
> +
> +  move = ps_reg_move (ps, i_reg_move);
> +  ii = ps->ii;
> +  if (dump_file)
> +    {
> +      fprintf (dump_file, "Scheduling register move INSN %d; ii = %d"
> +              ", min cycle = %d\n\n", INSN_UID (move->insn), ii,
> +              PS_MIN_CYCLE (ps));
> +      print_rtl_single (dump_file, move->insn);
> +      fprintf (dump_file, "\n%11s %11s %5s\n", "start", "end", "time");
> +      fprintf (dump_file, "=========== =========== =====\n");
> +    }
> +
> +  start = INT_MIN;
> +  end = INT_MAX;
> +
> +  /* For dependencies of distance 1 between a producer ddg node A
> +     and consumer ddg node B, we have a chain of dependencies:
> +
> +        A --(T,L1,1)--> M1 --(T,L2,0)--> M2 ... --(T,Ln,0)--> B
> +
> +     where Mi is the ith move.  For dependencies of distance 0 between
> +     a producer ddg node A and consumer ddg node C, we have a chain of
> +     dependencies:
> +
> +        A --(T,L1',0)--> M1' --(T,L2',0)--> M2' ... --(T,Ln',0)--> C
> +
> +     where Mi' occupies the same position as Mi but occurs a stage later.
> +     We can only schedule each move once, so if we have both types of
> +     chain, we model the second as:
> +
> +        A --(T,L1',1)--> M1 --(T,L2',0)--> M2 ... --(T,Ln',-1)--> C
> +
> +     First handle the dependencies between the previously-scheduled
> +     predecessor and the move.  */
> +  this_insn = ps_rtl_insn (ps, move->def);
> +  this_latency = insn_latency (this_insn, move->insn);
> +  this_distance = distance1_uses && move->def < ps->g->num_nodes ? 1 : 0;
> +  this_time = SCHED_TIME (move->def) - this_distance * ii;
> +  this_start = this_time + this_latency;
> +  this_end = this_time + ii;
> +  if (dump_file)
> +    fprintf (dump_file, "%11d %11d %5d %d --(T,%d,%d)--> %d\n",
> +            this_start, this_end, SCHED_TIME (move->def),
> +            INSN_UID (this_insn), this_latency, this_distance,
> +            INSN_UID (move->insn));
> +
> +  if (start < this_start)

redundant; start=INT_MIN is surely < this_start.

> +    start = this_start;
> +  if (end > this_end)

redundant; end=INT_MAX is surely > this_end.

> +    end = this_end;
> +
> +  /* Handle the dependencies between the move and previously-scheduled
> +     successors.  */

(maybe assert they have indeed all been scheduled)

> +  EXECUTE_IF_SET_IN_SBITMAP (move->uses, 0, u, sbi)
> +    {
> +      this_insn = ps_rtl_insn (ps, u);
> +      this_latency = insn_latency (move->insn, this_insn);
> +      if (distance1_uses && !TEST_BIT (distance1_uses, u))

shouldn't this be if (distance1_uses && TEST_BIT (distance1_uses, u))?

> +       this_distance = -1;
> +      else
> +       this_distance = 0;
> +      this_time = SCHED_TIME (u) + this_distance * ii;
> +      this_start = this_time - ps->ii;

use ii instead of ps->ii

> +      this_end = this_time - this_latency;
> +      if (dump_file)
> +       fprintf (dump_file, "%11d %11d %5d %d --(T,%d,%d)--> %d\n",
> +                this_start, this_end, SCHED_TIME (u), INSN_UID (move->insn),
> +                this_latency, this_distance, INSN_UID (this_insn));
> +
> +      if (start < this_start)
> +       start = this_start;
> +      if (end > this_end)
> +       end = this_end;
> +    }
> +
> +  if (dump_file)
> +    {
> +      fprintf (dump_file, "----------- ----------- -----\n");
> +      fprintf (dump_file, "%11d %11d %5s %s\n", start, end, "", "(max, min)");
> +    }
>
> -         fprintf (file, " reg_move = ");
> -         print_rtl_single (file, move->insn);
> +  sbitmap_zero (must_follow);
> +  SET_BIT (must_follow, move->def);
> +
> +  start = MAX (start, end - (ii - 1));

ok, so this excludes considering both ends of a possible ii-cycled
window. Such a window would imply zero move (or def->move) latencies
iinm, and more care in setting must_follow/must_precede.

> +  for (c = end; c >= start; c--)
> +    {
> +      psi = ps_add_node_check_conflicts (ps, i_reg_move, c,
> +                                        move->uses, must_follow);

ok - def must_follow the move, if both are scheduled on same row (as
we potentially start from this (end) row moving backwards), but uses
should precede the move, assuming that if move and use are on same row
the sched distance between them is ii (i.e., non-zero move->use
latency)

> +      if (psi)
> +       {
> +         update_node_sched_params (i_reg_move, ii, c, PS_MIN_CYCLE (ps));
> +         if (dump_file)
> +           fprintf (dump_file, "\nScheduled register move INSN %d at"
> +                    " time %d, row %d\n\n", INSN_UID (move->insn), c,
> +                    SCHED_ROW (i_reg_move));
> +         return true;
>        }
>     }
> +
> +  if (dump_file)
> +    fprintf (dump_file, "\nNo available slot\n\n");
> +
> +  return false;
>  }
>
>  /*
> @@ -508,9 +694,13 @@ schedule_reg_moves (partial_schedule_ptr
>       int nreg_moves = 0, i_reg_move;
>       rtx prev_reg, old_reg;
>       int first_move;
> +      int distances[2];
> +      sbitmap must_follow;
> +      sbitmap distance1_uses;
>
>       /* Compute the number of reg_moves needed for u, by looking at life
>         ranges started at u (excluding self-loops).  */
> +      distances[0] = distances[1] = false;
>       for (e = u->out; e; e = e->next_out)
>        if (e->type == TRUE_DEP && e->dest != e->src)
>          {
> @@ -527,6 +717,11 @@ schedule_reg_moves (partial_schedule_ptr
>                && SCHED_COLUMN (e->dest->cuid) < SCHED_COLUMN (e->src->cuid))
>              nreg_moves4e--;
>
> +           if (nreg_moves4e)
> +             {
> +               gcc_assert (e->distance < 2);
> +               distances[e->distance] = true;
> +             }
>            nreg_moves = MAX (nreg_moves, nreg_moves4e);
>          }
>
> @@ -537,11 +732,10 @@ schedule_reg_moves (partial_schedule_ptr
>       first_move = VEC_length (ps_reg_move_info, ps->reg_moves);
>       VEC_safe_grow_cleared (ps_reg_move_info, heap, ps->reg_moves,
>                             first_move + nreg_moves);
> +      extend_node_sched_params (ps);
>
>       /* Record the moves associated with this node.  */
>       first_move += ps->g->num_nodes;
> -      SCHED_FIRST_REG_MOVE (i) = first_move;
> -      SCHED_NREG_MOVES (i) = nreg_moves;
>
>       /* Generate each move.  */
>       old_reg = prev_reg = SET_DEST (single_set (u->insn));
> @@ -550,15 +744,18 @@ schedule_reg_moves (partial_schedule_ptr
>          ps_reg_move_info *move = ps_reg_move (ps, first_move + i_reg_move);
>
>          move->def = i_reg_move > 0 ? first_move + i_reg_move - 1 : i;
> -         move->uses = sbitmap_alloc (g->num_nodes);
> +         move->uses = sbitmap_alloc (first_move + nreg_moves);
>          move->old_reg = old_reg;
>          move->new_reg = gen_reg_rtx (GET_MODE (prev_reg));
> +         move->num_consecutive_stages = distances[0] && distances[1] ? 2 : 1;
>          move->insn = gen_move_insn (move->new_reg, copy_rtx (prev_reg));
>          sbitmap_zero (move->uses);
>
>          prev_reg = move->new_reg;
>        }
>
> +      distance1_uses = distances[1] ? sbitmap_alloc (g->num_nodes) : NULL;
> +
>       /* Every use of the register defined by node may require a different
>         copy of this register, depending on the time the use is scheduled.
>         Record which uses require which move results.  */
> @@ -582,8 +779,21 @@ schedule_reg_moves (partial_schedule_ptr
>
>                move = ps_reg_move (ps, first_move + dest_copy - 1);
>                SET_BIT (move->uses, e->dest->cuid);
> +               if (e->distance == 1)
> +                 SET_BIT (distance1_uses, e->dest->cuid);
>              }
>          }
> +
> +      must_follow = sbitmap_alloc (first_move + nreg_moves);
> +      for (i_reg_move = 0; i_reg_move < nreg_moves; i_reg_move++)
> +       if (!schedule_reg_move (ps, first_move + i_reg_move,
> +                               distance1_uses, must_follow))
> +         break;
> +      sbitmap_free (must_follow);
> +      if (distance1_uses)
> +       sbitmap_free (distance1_uses);
> +      if (i_reg_move < nreg_moves)
> +       return false;
>     }
>   return true;
>  }
> @@ -607,39 +817,6 @@ apply_reg_moves (partial_schedule_ptr ps
>          df_insn_rescan (ps->g->nodes[i_use].insn);
>        }
>     }
> -
> -  FOR_EACH_VEC_ELT (ps_reg_move_info, ps->reg_moves, i, move)
> -    add_insn_before (move->insn, ps_first_note (ps, move->def), NULL);
> -}
> -
> -/* Update the sched_params (time, row and stage) for node U using the II,
> -   the CYCLE of U and MIN_CYCLE.
> -   We're not simply taking the following
> -   SCHED_STAGE (u) = CALC_STAGE_COUNT (SCHED_TIME (u), min_cycle, ii);
> -   because the stages may not be aligned on cycle 0.  */
> -static void
> -update_node_sched_params (int u, int ii, int cycle, int min_cycle)
> -{
> -  int sc_until_cycle_zero;
> -  int stage;
> -
> -  SCHED_TIME (u) = cycle;
> -  SCHED_ROW (u) = SMODULO (cycle, ii);
> -
> -  /* The calculation of stage count is done adding the number
> -     of stages before cycle zero and after cycle zero.  */
> -  sc_until_cycle_zero = CALC_STAGE_COUNT (-1, min_cycle, ii);
> -
> -  if (SCHED_TIME (u) < 0)
> -    {
> -      stage = CALC_STAGE_COUNT (-1, SCHED_TIME (u), ii);
> -      SCHED_STAGE (u) = sc_until_cycle_zero - stage;
> -    }
> -  else
> -    {
> -      stage = CALC_STAGE_COUNT (SCHED_TIME (u), 0, ii);
> -      SCHED_STAGE (u) = sc_until_cycle_zero + stage - 1;
> -    }
>  }
>
>  /* Bump the SCHED_TIMEs of all nodes by AMOUNT.  Set the values of
> @@ -680,22 +857,6 @@ reset_sched_times (partial_schedule_ptr
>       }
>  }
>
> -/* Set SCHED_COLUMN of each node according to its position in PS.  */
> -static void
> -set_columns_for_ps (partial_schedule_ptr ps)
> -{
> -  int row;
> -
> -  for (row = 0; row < ps->ii; row++)
> -    {
> -      ps_insn_ptr cur_insn = ps->rows[row];
> -      int column = 0;
> -
> -      for (; cur_insn; cur_insn = cur_insn->next_in_row)
> -       SCHED_COLUMN (cur_insn->id) = column++;
> -    }
> -}
> -
>  /* Permute the insns according to their order in PS, from row 0 to
>    row ii-1, and position them right before LAST.  This schedules
>    the insns of the loop kernel.  */
> @@ -712,8 +873,13 @@ permute_partial_schedule (partial_schedu
>        rtx insn = ps_rtl_insn (ps, ps_ij->id);
>
>        if (PREV_INSN (last) != insn)
> -         reorder_insns_nobb (ps_first_note (ps, ps_ij->id), insn,
> -                             PREV_INSN (last));
> +         {
> +           if (ps_ij->id < ps->g->num_nodes)
> +             reorder_insns_nobb (ps_first_note (ps, ps_ij->id), insn,
> +                                 PREV_INSN (last));
> +           else
> +             add_insn_before (insn, last, NULL);
> +         }
>       }
>  }
>
> @@ -916,7 +1082,7 @@ optimize_sc (partial_schedule_ptr ps, dd
>
>  static void
>  duplicate_insns_of_cycles (partial_schedule_ptr ps, int from_stage,
> -                          int to_stage, int for_prolog, rtx count_reg)
> +                          int to_stage, rtx count_reg)
>  {
>   int row;
>   ps_insn_ptr ps_ij;
> @@ -925,7 +1091,7 @@ duplicate_insns_of_cycles (partial_sched
>     for (ps_ij = ps->rows[row]; ps_ij; ps_ij = ps_ij->next_in_row)
>       {
>        int u = ps_ij->id;
> -       int j, i_reg_moves, i_reg_move;
> +       int first_u, last_u;
>        rtx u_insn;
>
>         /* Do not duplicate any insn which refers to count_reg as it
> @@ -939,42 +1105,15 @@ duplicate_insns_of_cycles (partial_sched
>             || JUMP_P (u_insn))
>           continue;
>
> -       if (for_prolog)
> +       first_u = SCHED_STAGE (u);
> +       last_u = first_u + ps_num_consecutive_stages (ps, u) - 1;
> +       if (from_stage <= last_u && to_stage >= first_u)
>          {
> -           /* SCHED_STAGE (u) >= from_stage == 0.  Generate increasing
> -              number of reg_moves starting with the second occurrence of
> -              u, which is generated if its SCHED_STAGE <= to_stage.  */
> -           i_reg_moves = to_stage - SCHED_STAGE (u) + 1;
> -           i_reg_moves = MAX (i_reg_moves, 0);
> -           i_reg_moves = MIN (i_reg_moves, SCHED_NREG_MOVES (u));
> -
> -           /* The reg_moves start from the *first* reg_move backwards.  */
> -           i_reg_move = SCHED_FIRST_REG_MOVE (u) + (i_reg_moves - 1);
> +           if (u < ps->g->num_nodes)
> +             duplicate_insn_chain (ps_first_note (ps, u), u_insn);
> +           else
> +             emit_insn (copy_rtx (PATTERN (u_insn)));
>          }
> -       else /* It's for the epilog.  */
> -         {
> -           /* SCHED_STAGE (u) <= to_stage.  Generate all reg_moves,
> -              starting to decrease one stage after u no longer occurs;
> -              that is, generate all reg_moves until
> -              SCHED_STAGE (u) == from_stage - 1.  */
> -           i_reg_moves = (SCHED_NREG_MOVES (u)
> -                          - (from_stage - SCHED_STAGE (u) - 1));
> -           i_reg_moves = MAX (i_reg_moves, 0);
> -           i_reg_moves = MIN (i_reg_moves, SCHED_NREG_MOVES (u));
> -
> -           /* The reg_moves start from the *last* reg_move forwards.  */
> -           i_reg_move = SCHED_FIRST_REG_MOVE (u) + (SCHED_NREG_MOVES (u) - 1);
> -         }
> -
> -       for (j = 0; j < i_reg_moves; j++)
> -         {
> -           ps_reg_move_info *move = ps_reg_move (ps, i_reg_move - j);
> -
> -           emit_insn (copy_rtx (PATTERN (move->insn)));
> -         }
> -       if (SCHED_STAGE (u) >= from_stage
> -           && SCHED_STAGE (u) <= to_stage)
> -         duplicate_insn_chain (ps_first_note (ps, u), u_insn);
>       }
>  }
>
> @@ -1008,7 +1147,7 @@ generate_prolog_epilog (partial_schedule
>     }
>
>   for (i = 0; i < last_stage; i++)
> -    duplicate_insns_of_cycles (ps, 0, i, 1, count_reg);
> +    duplicate_insns_of_cycles (ps, 0, i, count_reg);
>
>   /* Put the prolog on the entry edge.  */
>   e = loop_preheader_edge (loop);
> @@ -1020,7 +1159,7 @@ generate_prolog_epilog (partial_schedule
>   start_sequence ();
>
>   for (i = 0; i < last_stage; i++)
> -    duplicate_insns_of_cycles (ps, i + 1, last_stage, 0, count_reg);
> +    duplicate_insns_of_cycles (ps, i + 1, last_stage, count_reg);
>
>   /* Put the epilogue on the exit edge.  */
>   gcc_assert (single_exit (loop));
> @@ -1361,8 +1500,7 @@ sms_schedule (void)
>     {
>       rtx head, tail;
>       rtx count_reg, count_init;
> -      int mii, rec_mii;
> -      unsigned stage_count;
> +      int mii, rec_mii, stage_count, min_cycle;
>       HOST_WIDEST_INT loop_count = 0;
>       bool opt_sc_p;
>
> @@ -1472,13 +1610,12 @@ sms_schedule (void)
>                }
>
>              gcc_assert (stage_count >= 1);
> -             PS_STAGE_COUNT (ps) = stage_count;
>            }
>
>          /* The default value of PARAM_SMS_MIN_SC is 2 as stage count of
>             1 means that there is no interleaving between iterations thus
>             we let the scheduling passes do the job in this case.  */
> -         if (stage_count < (unsigned) PARAM_VALUE (PARAM_SMS_MIN_SC)
> +         if (stage_count < PARAM_VALUE (PARAM_SMS_MIN_SC)
>              || (count_init && (loop_count <= stage_count))
>              || (flag_branch_probabilities && (trip_count <= stage_count)))
>            {
> @@ -1506,6 +1643,7 @@ sms_schedule (void)
>
>          set_columns_for_ps (ps);
>
> +         min_cycle = PS_MIN_CYCLE (ps) - SMODULO (PS_MIN_CYCLE (ps), ps->ii);
>          if (!schedule_reg_moves (ps))
>            {
>              mii = ps->ii + 1;
> @@ -1513,6 +1651,19 @@ sms_schedule (void)
>              continue;
>            }
>
> +         /* Moves that handle incoming values might have been added
> +            to a new first stage.  It's too late to try any kind of
> +            rotation, so just bump the stage count.  */

hmm, one could still apply the rotation optimization at this time if
desired, no?

Ayal.

> +         if (PS_MIN_CYCLE (ps) < min_cycle)
> +           {
> +             reset_sched_times (ps, 0);
> +             stage_count++;
> +           }
> +
> +         /* The stage count should now be correct without rotation.  */
> +         gcc_checking_assert (stage_count == calculate_stage_count (ps, 0));
> +         PS_STAGE_COUNT (ps) = stage_count;
> +
>          canon_loop (loop);
>
>           if (dump_file)
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [4/4] Make SMS schedule register moves
  2011-10-10  1:10         ` Ayal Zaks
@ 2011-10-10 12:17           ` Richard Sandiford
  2011-10-10 15:52             ` Ayal Zaks
  0 siblings, 1 reply; 13+ messages in thread
From: Richard Sandiford @ 2011-10-10 12:17 UTC (permalink / raw)
  To: Ayal Zaks; +Cc: gcc-patches

Ayal Zaks <ayal.zaks@gmail.com> writes:
>> I agree it's natural to schedule moves for intra-iteration dependencies
>> in the normal get_sched_window way.  But suppose we have a dependency:
>>
>>   A --(T,N,1)--> B
>>
>> that requires two moves M1 and M2.  If we think in terms of cycles
>> (in the SCHED_TIME sense), then this effectively becomes:
>>
>>   A --(T,N1,1)--> M1 -->(T,N2,0)--> M2 -->(T,N3,0)--> B
>>
>> because it is now M1 that is fed by both the loop and the incoming edge.
>> But if there is a second dependency:
>>
>>   A --(T,M,0)--> C
>>
>> that also requires two moves, we end up with:
>>
>>   A --(T,N1,1)--> M1 -->(T,N2,0)--> M2 -->(T,N3,0)--> B
>>                                        -->(T,M3,-1)--> B
>>
>> and dependence distances of -1 feel a bit weird. :-)  Of course,
>> what we really have are two parallel dependencies:
>>
>>   A --(T,N1,1)--> M1 -->(T,N2,0)--> M2 -->(T,N3,0)--> B
>>
>>   A --(T,M1,0)--> M1' -->(T,M2,0)--> M2' -->(T,N3,0)--> B
>>
>> where M1' and M2' occupy the same position as M1 and M2 in the schedule,
>> but are one stage further along.  But we only schedule them once,
>> so if we take the cycle/SCHED_TIME route, we have to introduce
>> dependencies of distance -1.
>>
>
> Interesting; had to digest this distance 1 business, a result of
> thinking in cycles instead of rows (or conversely), and mixing
> dependences with scheduling; here's my understanding, based on your
> explanations:
>
> Suppose a Use is truely dependent on a Def, where both have been
> scheduled at some absolute cycles; think of them as timing the first
> iteration of the loop.
> Assume first that Use appears originally after Def in the original
> instruction sequence of the loop (dependence distance 0). In this
> case, Use requires register moves if its distance D from Def according
> to the schedule is more than ii cycles long -- by the time Use is
> executed, the value it needs is no longer available in the def'd
> register due to intervening occurrences of Def. So in this case, the
> first reg-move (among D/ii) should appear after Def, recording its
> value before the next occurrence of Def overwrites it, and feeding
> subsequent moves as needed before each is overwritten. Thus the
> scheduling window of this first reg-move is within (Def, Def+ii).
>
> Now, suppose Use appears before Def, i.e., Use is upwards-exposed; if
> it remains scheduled before the Def, no reg-move is needed. If, otoh,
> Def is scheduled (D>0 cycles) before Use, breaking the anti-dependence
> between them, (D/ii + 1) reg-moves are needed in order to feed Use
> with the live-in value before Def. So in this case, the first reg-move
> should appear before Def (again feeding subsequent moves as needed
> before each is overwritten). Thus the scheduling window of this first
> reg-move is within (Def-ii, Def).
>
> In any case, each use requires a specific number of reg-moves, which
> begin either before or after the def; it is always safe (though
> potentially redundant) to start reg-moving before the def -- uses that
> do not need the live-in value will ignore it by feeding from a later
> reg-move.

Right.  The distance on the Def->Use dependency is effectively transferred
to the dependency between the Def and first move.

And we can potentially have both kinds of Use at the same time.
We then end up scheduling two moves, one in each window, but require
them to occupy the same row and column.  It feels more convenient to
schedule the earlier of the two moves.

> Q: if we generate reg-moves starting from before the def, would
> redundant reg-moves be eliminated if no use needs access to live-in
> value? If so, would that simplify their generation? (If not, it may be
> interesting to understand how to improve DCE to catch it.)

To be clear, the new version of the patch (unlike the old one)
doesn't emit reg-moves before the def if all uses are distance 0.
We only do it where some or all uses are distance 1.  The first move
before the def is always needed in that case.

So redudant moves are confined to the case where (a) we have more
than one move, (b) we have both distance 0 and distance 1 uses, and
(c) one of the two distance sets requires more moves than the other.
If the distance 0 dependencies require more moves than the distance
1 dependencies, we will have a redudant move in the prologue.
If the distance 1 dependencies require more moves than the distance
0 dependencies, we will have a redudant move in the epilogue.
But I believe that this is already handled by dce.

(For avoidance of doubt, the new code is more accurate than
current trunk in this respect.)

> The issue of assigning stages to reg-moves is mostly relevant for
> prolog and epilog generation, which requires and receives special
> attention -- handled very nicely by ps_num_consecutive_stages! Note
> that currently a simple boolean indicator for (the exceptional case
> of) double stages would suffice, instead of generalizing to arbitrary
> nums of consecutive stages (see any potential use for them?).

Not in the immediate term.  But I think having a boolean indicator
would be inconsistent.  If the distance field is an int (even though
we only expect distance-0 and distance-1 register dependencies)
then I think the number of stages should be too.

I did wonder originally about using a boolean, but IMO, it makes
the code less readable rather than more.  Instead of a simple
range check like:

     if (first_stage_for_insn <= last_stage_in_range
         && last_stage_for_insn >= first_stage_in_range)

we end up with the equivalent of:

     if (first_stage_for_insn <= last_stage_in_range
         && (double_stage_move_p (...)
             ? first_stage_for_insn + 1 >= first_stage_in_range
             : first_stage_for_insn >= first_stage_in_range))

with no corresponding simplification elsewhere.

>> However...
>>
>>> Also note that if moves are assigned absolute cycles, it becomes clear
>>> to which stage each belongs (just like any other instruction),
>>> regulating their generation in prolog and epilog.
>>
>> ...I have to concede that it makes the stage calculation easier,
>> and that that tips the balance.  (Although again, a move can belong
>> to two stages simultanouesly.)
>>
>> Anyway, here's an updated patch that uses cycle times.  I've also
>> dropped the code that tried to allow windows to start and end on
>> the same row (and therefore schedule "either side" of that row).
>> I thought it might help with some VLIWy DFAs, but it was always
>> going to be of minor benefit, and we don't try that elsewhere,
>> so let's avoid the complication.
>>
>
> ok. Such windows seem rare, as they imply zero move (or def->move)
> latencies, iinm. Leaving a note behind would be good.

OK.  Since this same restriction applies to the amin scheduling
window code, I suppose the natural place would be in the algorithm
description itself:

/* The SMS scheduling algorithm itself
   -----------------------------------
   Input: 'O' an ordered list of insns of a loop.
   Output: A scheduling of the loop - kernel, prolog, and epilogue.
   ...
   42. compute epilogue & prologue
   43. finish - succeeded to schedule
*/

E.g. adding something like this at the end:

   ??? The algorithm restricts the scheduling window to II cycles.
   In rare cases, it may be better to allow windows of II+1 cycles.
   The window would then start and end on the same row, but with
   different "must precede" and "must follow" requirements.

Let me know what you think and I'll add it as a follow-on patch.

>> Bootstrapped & regression-tested on powerpc64-linux-gnu with
>> -fmodulo-sched and -fmodulo-sched-allow-regmoves enabled by default.
>> Also retested on the ARM benchmarks.  OK to install?
>>
>
> Yes, OK.
> Some points to consider raised above, and some comments added below.

Thanks, applied with the changes below.

>> @@ -466,21 +506,167 @@ print_node_sched_params (FILE *file, int
>>   for (i = 0; i < num_nodes; i++)
>>     {
>>       node_sched_params_ptr nsp = SCHED_PARAMS (i);
>> -      int j;
>>
>>       fprintf (file, "Node = %d; INSN = %d\n", i,
>>               INSN_UID (ps_rtl_insn (ps, i)));
>>       fprintf (file, " asap = %d:\n", NODE_ASAP (&ps->g->nodes[i]));
>>       fprintf (file, " time = %d:\n", nsp->time);
>> -      fprintf (file, " nreg_moves = %d:\n", nsp->nreg_moves);
>> -      for (j = 0; j < nsp->nreg_moves; j++)
>> -       {
>> -         ps_reg_move_info *move = ps_reg_move (ps, nsp->first_reg_move + j);
>
> Is another way provided to print a summary of the regmoves? Or does
> the info dumped while scheduling each one suffice for practical
> tracing and debugging.

Yeah, the new info should be more complete.  It gives the pattern
(which is what we dump here), but also lists the producers and
consumers of each move (which we don't print here).

>> +      fprintf (file, " stage = %d:\n", nsp->stage);
>> +    }
>> +}
>> +
>> +/* Set SCHED_COLUMN for each instruction in row ROW of PS.  */
>> +static void
>> +set_columns_for_row (partial_schedule_ptr ps, int row)
>> +{
>> +  ps_insn_ptr cur_insn;
>> +  int column;
>> +
>> +  column = 0;
>> +  for (cur_insn = ps->rows[row]; cur_insn; cur_insn = cur_insn->next_in_row)
>> +    SCHED_COLUMN (cur_insn->id) = column++;
>> +}
>> +
>> +/* Set SCHED_COLUMN for each instruction in PS.  */
>> +static void
>> +set_columns_for_ps (partial_schedule_ptr ps)
>> +{
>> +  int row;
>> +
>> +  for (row = 0; row < ps->ii; row++)
>> +    set_columns_for_row (ps, row);
>> +}
>> +
>> +/* Try to schedule the move with ps_insn identifier I_REG_MOVE in PS.
>> +   Its predecessors and successors have already been scheduled.
>
> well, except for some (preceeding or succeeding?) moves that have not
> yet been scheduled, right?

Ah, yeah, fixed to:

/* Try to schedule the move with ps_insn identifier I_REG_MOVE in PS.
   Its single predecessor has already been scheduled, as has its
   ddg node successors.  (The move may have also another move as its
   successor, in which case that successor will be scheduled later.)

>> +
>> +   The move is part of a chain that satisfies register dependencies
>> +   between a producing ddg node and various consuming ddg nodes.
>> +   If some of these dependencies cross a loop iteration (that is,
>> +   have a distance of 1) then DISTANCE1_USES is nonnull and contains
>> +   the set of uses with distance-1 dependencies.  DISTANCE1_USES
>> +   is null otherwise.
>> +
>
> Maybe clarify that they are upwards-exposed or live-in uses.

OK, changed to:

   The move is part of a chain that satisfies register dependencies
   between a producing ddg node and various consuming ddg nodes.
   If some of these dependencies have a distance of 1 (meaning that
   the use is upward-exposoed) then DISTANCE1_USES is nonnull and
   contains the set of uses with distance-1 dependencies.
   DISTANCE1_USES is null otherwise.

>> +   MUST_FOLLOW is a scratch bitmap that is big enough to hold
>> +   all current ps_insn ids.
>> +
>> +   Return true on success.  */
>> +static bool
>> +schedule_reg_move (partial_schedule_ptr ps, int i_reg_move,
>> +                  sbitmap distance1_uses, sbitmap must_follow)
>> +{
>> +  unsigned int u;
>> +  int this_time, this_distance, this_start, this_end, this_latency;
>> +  int start, end, c, ii;
>> +  sbitmap_iterator sbi;
>> +  ps_reg_move_info *move;
>> +  rtx this_insn;
>> +  ps_insn_ptr psi;
>> +
>> +  move = ps_reg_move (ps, i_reg_move);
>> +  ii = ps->ii;
>> +  if (dump_file)
>> +    {
>> +      fprintf (dump_file, "Scheduling register move INSN %d; ii = %d"
>> +              ", min cycle = %d\n\n", INSN_UID (move->insn), ii,
>> +              PS_MIN_CYCLE (ps));
>> +      print_rtl_single (dump_file, move->insn);
>> +      fprintf (dump_file, "\n%11s %11s %5s\n", "start", "end", "time");
>> +      fprintf (dump_file, "=========== =========== =====\n");
>> +    }
>> +
>> +  start = INT_MIN;
>> +  end = INT_MAX;
>> +
>> +  /* For dependencies of distance 1 between a producer ddg node A
>> +     and consumer ddg node B, we have a chain of dependencies:
>> +
>> +        A --(T,L1,1)--> M1 --(T,L2,0)--> M2 ... --(T,Ln,0)--> B
>> +
>> +     where Mi is the ith move.  For dependencies of distance 0 between
>> +     a producer ddg node A and consumer ddg node C, we have a chain of
>> +     dependencies:
>> +
>> +        A --(T,L1',0)--> M1' --(T,L2',0)--> M2' ... --(T,Ln',0)--> C
>> +
>> +     where Mi' occupies the same position as Mi but occurs a stage later.
>> +     We can only schedule each move once, so if we have both types of
>> +     chain, we model the second as:
>> +
>> +        A --(T,L1',1)--> M1 --(T,L2',0)--> M2 ... --(T,Ln',-1)--> C
>> +
>> +     First handle the dependencies between the previously-scheduled
>> +     predecessor and the move.  */
>> +  this_insn = ps_rtl_insn (ps, move->def);
>> +  this_latency = insn_latency (this_insn, move->insn);
>> +  this_distance = distance1_uses && move->def < ps->g->num_nodes ? 1 : 0;
>> +  this_time = SCHED_TIME (move->def) - this_distance * ii;
>> +  this_start = this_time + this_latency;
>> +  this_end = this_time + ii;
>> +  if (dump_file)
>> +    fprintf (dump_file, "%11d %11d %5d %d --(T,%d,%d)--> %d\n",
>> +            this_start, this_end, SCHED_TIME (move->def),
>> +            INSN_UID (this_insn), this_latency, this_distance,
>> +            INSN_UID (move->insn));
>> +
>> +  if (start < this_start)
>
> redundant; start=INT_MIN is surely < this_start.
>
>> +    start = this_start;
>> +  if (end > this_end)
>
> redundant; end=INT_MAX is surely > this_end.

I did this deliberately because the order in which we apply the limits
isn't important.  Making them assymetrical by applying this kind of
micro-optimisation is IMO a bad idea.  The compiler will do it for us.

E.g. I originally had an extra limit to make sure that we didn't add
new stages.  If we applied the optimisation by hand, and then someone
added a new limit like that later, they'd have to be careful about
where they put it.

>> +    end = this_end;
>> +
>> +  /* Handle the dependencies between the move and previously-scheduled
>> +     successors.  */
>
> (maybe assert they have indeed all been scheduled)

Hmm, I don't think that'd be useful.  We've already read the schedule
time (and row and column) by this point, and we don't assert the same
thing elsewhere (in the code that assigns moves to uses, or in
get_sched_window itself).

>> +  EXECUTE_IF_SET_IN_SBITMAP (move->uses, 0, u, sbi)
>> +    {
>> +      this_insn = ps_rtl_insn (ps, u);
>> +      this_latency = insn_latency (move->insn, this_insn);
>> +      if (distance1_uses && !TEST_BIT (distance1_uses, u))
>
> shouldn't this be if (distance1_uses && TEST_BIT (distance1_uses, u))?
>
>> +       this_distance = -1;
>> +      else
>> +       this_distance = 0;

No, this condition corresponds to:

  /* For dependencies of distance 1 between a producer ddg node A
     and consumer ddg node B, we have a chain of dependencies:

        A --(T,L1,1)--> M1 --(T,L2,0)--> M2 ... --(T,Ln,0)--> B

     where Mi is the ith move.  For dependencies of distance 0 between
     a producer ddg node A and consumer ddg node C, we have a chain of
     dependencies:

        A --(T,L1',0)--> M1' --(T,L2',0)--> M2' ... --(T,Ln',0)--> C

     where Mi' occupies the same position as Mi but occurs a stage later.
     We can only schedule each move once, so if we have both types of
     chain, we model the second as:

        A --(T,L1',1)--> M1 --(T,L2',0)--> M2 ... --(T,Ln',-1)--> C

     First handle the dependencies between the previously-scheduled
     predecessor and the move.  */

i.e. we want a distance of -1 when (a) the original definition has uses
with both distance 0 and distance 1 and (b) the particular use we're
looking at has distance 0.

>> +      this_time = SCHED_TIME (u) + this_distance * ii;
>> +      this_start = this_time - ps->ii;
>
> use ii instead of ps->ii

Oops, fixed.

>> +      this_end = this_time - this_latency;
>> +      if (dump_file)
>> +       fprintf (dump_file, "%11d %11d %5d %d --(T,%d,%d)--> %d\n",
>> +                this_start, this_end, SCHED_TIME (u), INSN_UID (move->insn),
>> +                this_latency, this_distance, INSN_UID (this_insn));
>> +
>> +      if (start < this_start)
>> +       start = this_start;
>> +      if (end > this_end)
>> +       end = this_end;
>> +    }
>> +
>> +  if (dump_file)
>> +    {
>> +      fprintf (dump_file, "----------- ----------- -----\n");
>> +      fprintf (dump_file, "%11d %11d %5s %s\n", start, end, "", "(max, min)");
>> +    }
>>
>> -         fprintf (file, " reg_move = ");
>> -         print_rtl_single (file, move->insn);
>> +  sbitmap_zero (must_follow);
>> +  SET_BIT (must_follow, move->def);
>> +
>> +  start = MAX (start, end - (ii - 1));
>
> ok, so this excludes considering both ends of a possible ii-cycled
> window. Such a window would imply zero move (or def->move) latencies
> iinm, and more care in setting must_follow/must_precede.

Right.

>> +  for (c = end; c >= start; c--)
>> +    {
>> +      psi = ps_add_node_check_conflicts (ps, i_reg_move, c,
>> +                                        move->uses, must_follow);
>
> ok - def must_follow the move, if both are scheduled on same row (as
> we potentially start from this (end) row moving backwards), but uses
> should precede the move, assuming that if move and use are on same row
> the sched distance between them is ii (i.e., non-zero move->use
> latency)

Right.

>> @@ -1513,6 +1651,19 @@ sms_schedule (void)
>>              continue;
>>            }
>>
>> +         /* Moves that handle incoming values might have been added
>> +            to a new first stage.  It's too late to try any kind of
>> +            rotation, so just bump the stage count.  */
>
> hmm, one could still apply the rotation optimization at this time if
> desired, no?

Hmm, maybe :-)  I changed the comment to:

	  /* Moves that handle incoming values might have been added
	     to a new first stage.  Bump the stage count if so.

	     ??? Perhaps we could consider rotating the schedule here
	     instead?  */

Richard


gcc/
	* modulo-sched.c (ps_reg_move_info): Add num_consecutive_stages.
	(SCHED_FIRST_REG_MOVE, SCHED_NREG_MOVES): Delete.
	(node_sched_params): Remove first_reg_move and nreg_moves.
	(ps_num_consecutive_stages, extend_node_sched_params): New functions.
	(update_node_sched_params): Move up file.
	(print_node_sched_params): Print the stage.  Don't dump info related
	to first_reg_move and nreg_moves.
	(set_columns_for_row): New function.
	(set_columns_for_ps): Move up file and use set_columns_for_row.
	(schedule_reg_move): New function.
	(schedule_reg_moves): Call extend_node_sched_params and
	schedule_reg_move.  Extend size of uses bitmap.  Initialize
	num_consecutive_stages.  Return false if a move could not be
	scheduled.
	(apply_reg_moves): Don't emit moves here.
	(permute_partial_schedule): Handle register moves.
	(duplicate_insns_of_cycles): Remove for_prolog.  Emit moves according
	to the same stage-count test as ddg nodes.
	(generate_prolog_epilog): Update calls accordingly.
	(sms_schedule): Allow move-scheduling to add a new first stage.

Index: gcc/modulo-sched.c
===================================================================
--- gcc/modulo-sched.c	2011-10-10 11:03:23.278273801 +0100
+++ gcc/modulo-sched.c	2011-10-10 12:24:44.539932692 +0100
@@ -153,6 +153,9 @@ struct ps_reg_move_info
   rtx old_reg;
   rtx new_reg;
 
+  /* The number of consecutive stages that the move occupies.  */
+  int num_consecutive_stages;
+
   /* An instruction that sets NEW_REG to the correct value.  The first
      move associated with DEF will have an rhs of OLD_REG; later moves
      use the result of the previous move.  */
@@ -218,8 +221,6 @@ static partial_schedule_ptr sms_schedule
 static void permute_partial_schedule (partial_schedule_ptr, rtx);
 static void generate_prolog_epilog (partial_schedule_ptr, struct loop *,
                                     rtx, rtx);
-static void duplicate_insns_of_cycles (partial_schedule_ptr,
-				       int, int, int, rtx);
 static int calculate_stage_count (partial_schedule_ptr, int);
 static void calculate_must_precede_follow (ddg_node_ptr, int, int,
 					   int, int, sbitmap, sbitmap, sbitmap);
@@ -233,8 +234,6 @@ #define NODE_ASAP(node) ((node)->aux.cou
 
 #define SCHED_PARAMS(x) VEC_index (node_sched_params, node_sched_param_vec, x)
 #define SCHED_TIME(x) (SCHED_PARAMS (x)->time)
-#define SCHED_FIRST_REG_MOVE(x) (SCHED_PARAMS (x)->first_reg_move)
-#define SCHED_NREG_MOVES(x) (SCHED_PARAMS (x)->nreg_moves)
 #define SCHED_ROW(x) (SCHED_PARAMS (x)->row)
 #define SCHED_STAGE(x) (SCHED_PARAMS (x)->stage)
 #define SCHED_COLUMN(x) (SCHED_PARAMS (x)->column)
@@ -244,15 +243,6 @@ typedef struct node_sched_params
 {
   int time;	/* The absolute scheduling cycle.  */
 
-  /* The following field (first_reg_move) is the ps_insn id of the first
-     register-move instruction added to handle the modulo-variable-expansion
-     of the register defined by this node.  This register-move copies the
-     original register defined by the node.  */
-  int first_reg_move;
-
-  /* The number of register-move instructions added.  */
-  int nreg_moves;
-
   int row;    /* Holds time % ii.  */
   int stage;  /* Holds time / ii.  */
 
@@ -344,6 +334,17 @@ ps_first_note (partial_schedule_ptr ps, 
   return ps->g->nodes[id].first_note;
 }
 
+/* Return the number of consecutive stages that are occupied by
+   partial schedule instruction ID in PS.  */
+static int
+ps_num_consecutive_stages (partial_schedule_ptr ps, int id)
+{
+  if (id < ps->g->num_nodes)
+    return 1;
+  else
+    return ps_reg_move (ps, id)->num_consecutive_stages;
+}
+
 /* Given HEAD and TAIL which are the first and last insns in a loop;
    return the register which controls the loop.  Return zero if it has
    more than one occurrence in the loop besides the control part or the
@@ -456,6 +457,45 @@ set_node_sched_params (ddg_ptr g)
 			 node_sched_param_vec, g->num_nodes);
 }
 
+/* Make sure that node_sched_param_vec has an entry for every move in PS.  */
+static void
+extend_node_sched_params (partial_schedule_ptr ps)
+{
+  VEC_safe_grow_cleared (node_sched_params, heap, node_sched_param_vec,
+			 ps->g->num_nodes + VEC_length (ps_reg_move_info,
+							ps->reg_moves));
+}
+
+/* Update the sched_params (time, row and stage) for node U using the II,
+   the CYCLE of U and MIN_CYCLE.
+   We're not simply taking the following
+   SCHED_STAGE (u) = CALC_STAGE_COUNT (SCHED_TIME (u), min_cycle, ii);
+   because the stages may not be aligned on cycle 0.  */
+static void
+update_node_sched_params (int u, int ii, int cycle, int min_cycle)
+{
+  int sc_until_cycle_zero;
+  int stage;
+
+  SCHED_TIME (u) = cycle;
+  SCHED_ROW (u) = SMODULO (cycle, ii);
+
+  /* The calculation of stage count is done adding the number
+     of stages before cycle zero and after cycle zero.  */
+  sc_until_cycle_zero = CALC_STAGE_COUNT (-1, min_cycle, ii);
+
+  if (SCHED_TIME (u) < 0)
+    {
+      stage = CALC_STAGE_COUNT (-1, SCHED_TIME (u), ii);
+      SCHED_STAGE (u) = sc_until_cycle_zero - stage;
+    }
+  else
+    {
+      stage = CALC_STAGE_COUNT (SCHED_TIME (u), 0, ii);
+      SCHED_STAGE (u) = sc_until_cycle_zero + stage - 1;
+    }
+}
+
 static void
 print_node_sched_params (FILE *file, int num_nodes, partial_schedule_ptr ps)
 {
@@ -466,21 +506,169 @@ print_node_sched_params (FILE *file, int
   for (i = 0; i < num_nodes; i++)
     {
       node_sched_params_ptr nsp = SCHED_PARAMS (i);
-      int j;
 
       fprintf (file, "Node = %d; INSN = %d\n", i,
 	       INSN_UID (ps_rtl_insn (ps, i)));
       fprintf (file, " asap = %d:\n", NODE_ASAP (&ps->g->nodes[i]));
       fprintf (file, " time = %d:\n", nsp->time);
-      fprintf (file, " nreg_moves = %d:\n", nsp->nreg_moves);
-      for (j = 0; j < nsp->nreg_moves; j++)
-	{
-	  ps_reg_move_info *move = ps_reg_move (ps, nsp->first_reg_move + j);
+      fprintf (file, " stage = %d:\n", nsp->stage);
+    }
+}
+
+/* Set SCHED_COLUMN for each instruction in row ROW of PS.  */
+static void
+set_columns_for_row (partial_schedule_ptr ps, int row)
+{
+  ps_insn_ptr cur_insn;
+  int column;
+
+  column = 0;
+  for (cur_insn = ps->rows[row]; cur_insn; cur_insn = cur_insn->next_in_row)
+    SCHED_COLUMN (cur_insn->id) = column++;
+}
+
+/* Set SCHED_COLUMN for each instruction in PS.  */
+static void
+set_columns_for_ps (partial_schedule_ptr ps)
+{
+  int row;
+
+  for (row = 0; row < ps->ii; row++)
+    set_columns_for_row (ps, row);
+}
+
+/* Try to schedule the move with ps_insn identifier I_REG_MOVE in PS.
+   Its single predecessor has already been scheduled, as has its
+   ddg node successors.  (The move may have also another move as its
+   successor, in which case that successor will be scheduled later.)
 
-	  fprintf (file, " reg_move = ");
-	  print_rtl_single (file, move->insn);
+   The move is part of a chain that satisfies register dependencies
+   between a producing ddg node and various consuming ddg nodes.
+   If some of these dependencies have a distance of 1 (meaning that
+   the use is upward-exposoed) then DISTANCE1_USES is nonnull and
+   contains the set of uses with distance-1 dependencies.
+   DISTANCE1_USES is null otherwise.
+
+   MUST_FOLLOW is a scratch bitmap that is big enough to hold
+   all current ps_insn ids.
+
+   Return true on success.  */
+static bool
+schedule_reg_move (partial_schedule_ptr ps, int i_reg_move,
+		   sbitmap distance1_uses, sbitmap must_follow)
+{
+  unsigned int u;
+  int this_time, this_distance, this_start, this_end, this_latency;
+  int start, end, c, ii;
+  sbitmap_iterator sbi;
+  ps_reg_move_info *move;
+  rtx this_insn;
+  ps_insn_ptr psi;
+
+  move = ps_reg_move (ps, i_reg_move);
+  ii = ps->ii;
+  if (dump_file)
+    {
+      fprintf (dump_file, "Scheduling register move INSN %d; ii = %d"
+	       ", min cycle = %d\n\n", INSN_UID (move->insn), ii,
+	       PS_MIN_CYCLE (ps));
+      print_rtl_single (dump_file, move->insn);
+      fprintf (dump_file, "\n%11s %11s %5s\n", "start", "end", "time");
+      fprintf (dump_file, "=========== =========== =====\n");
+    }
+
+  start = INT_MIN;
+  end = INT_MAX;
+
+  /* For dependencies of distance 1 between a producer ddg node A
+     and consumer ddg node B, we have a chain of dependencies:
+
+        A --(T,L1,1)--> M1 --(T,L2,0)--> M2 ... --(T,Ln,0)--> B
+
+     where Mi is the ith move.  For dependencies of distance 0 between
+     a producer ddg node A and consumer ddg node C, we have a chain of
+     dependencies:
+
+        A --(T,L1',0)--> M1' --(T,L2',0)--> M2' ... --(T,Ln',0)--> C
+
+     where Mi' occupies the same position as Mi but occurs a stage later.
+     We can only schedule each move once, so if we have both types of
+     chain, we model the second as:
+
+        A --(T,L1',1)--> M1 --(T,L2',0)--> M2 ... --(T,Ln',-1)--> C
+
+     First handle the dependencies between the previously-scheduled
+     predecessor and the move.  */
+  this_insn = ps_rtl_insn (ps, move->def);
+  this_latency = insn_latency (this_insn, move->insn);
+  this_distance = distance1_uses && move->def < ps->g->num_nodes ? 1 : 0;
+  this_time = SCHED_TIME (move->def) - this_distance * ii;
+  this_start = this_time + this_latency;
+  this_end = this_time + ii;
+  if (dump_file)
+    fprintf (dump_file, "%11d %11d %5d %d --(T,%d,%d)--> %d\n",
+	     this_start, this_end, SCHED_TIME (move->def),
+	     INSN_UID (this_insn), this_latency, this_distance,
+	     INSN_UID (move->insn));
+
+  if (start < this_start)
+    start = this_start;
+  if (end > this_end)
+    end = this_end;
+
+  /* Handle the dependencies between the move and previously-scheduled
+     successors.  */
+  EXECUTE_IF_SET_IN_SBITMAP (move->uses, 0, u, sbi)
+    {
+      this_insn = ps_rtl_insn (ps, u);
+      this_latency = insn_latency (move->insn, this_insn);
+      if (distance1_uses && !TEST_BIT (distance1_uses, u))
+	this_distance = -1;
+      else
+	this_distance = 0;
+      this_time = SCHED_TIME (u) + this_distance * ii;
+      this_start = this_time - ii;
+      this_end = this_time - this_latency;
+      if (dump_file)
+	fprintf (dump_file, "%11d %11d %5d %d --(T,%d,%d)--> %d\n",
+		 this_start, this_end, SCHED_TIME (u), INSN_UID (move->insn),
+		 this_latency, this_distance, INSN_UID (this_insn));
+
+      if (start < this_start)
+	start = this_start;
+      if (end > this_end)
+	end = this_end;
+    }
+
+  if (dump_file)
+    {
+      fprintf (dump_file, "----------- ----------- -----\n");
+      fprintf (dump_file, "%11d %11d %5s %s\n", start, end, "", "(max, min)");
+    }
+
+  sbitmap_zero (must_follow);
+  SET_BIT (must_follow, move->def);
+
+  start = MAX (start, end - (ii - 1));
+  for (c = end; c >= start; c--)
+    {
+      psi = ps_add_node_check_conflicts (ps, i_reg_move, c,
+					 move->uses, must_follow);
+      if (psi)
+	{
+	  update_node_sched_params (i_reg_move, ii, c, PS_MIN_CYCLE (ps));
+	  if (dump_file)
+	    fprintf (dump_file, "\nScheduled register move INSN %d at"
+		     " time %d, row %d\n\n", INSN_UID (move->insn), c,
+		     SCHED_ROW (i_reg_move));
+	  return true;
 	}
     }
+
+  if (dump_file)
+    fprintf (dump_file, "\nNo available slot\n\n");
+
+  return false;
 }
 
 /*
@@ -508,6 +696,9 @@ schedule_reg_moves (partial_schedule_ptr
       int nreg_moves = 0, i_reg_move;
       rtx prev_reg, old_reg;
       int first_move;
+      int distances[2];
+      sbitmap must_follow;
+      sbitmap distance1_uses;
       rtx set = single_set (u->insn);
       
       /* Skip instructions that do not set a register.  */
@@ -516,6 +707,7 @@ schedule_reg_moves (partial_schedule_ptr
  
       /* Compute the number of reg_moves needed for u, by looking at life
 	 ranges started at u (excluding self-loops).  */
+      distances[0] = distances[1] = false;
       for (e = u->out; e; e = e->next_out)
 	if (e->type == TRUE_DEP && e->dest != e->src)
 	  {
@@ -546,6 +738,11 @@ schedule_reg_moves (partial_schedule_ptr
 		gcc_assert (!autoinc_var_is_used_p (u->insn, e->dest->insn));
 	      }
 	    
+	    if (nreg_moves4e)
+	      {
+		gcc_assert (e->distance < 2);
+		distances[e->distance] = true;
+	      }
 	    nreg_moves = MAX (nreg_moves, nreg_moves4e);
 	  }
 
@@ -556,11 +753,10 @@ schedule_reg_moves (partial_schedule_ptr
       first_move = VEC_length (ps_reg_move_info, ps->reg_moves);
       VEC_safe_grow_cleared (ps_reg_move_info, heap, ps->reg_moves,
 			     first_move + nreg_moves);
+      extend_node_sched_params (ps);
 
       /* Record the moves associated with this node.  */
       first_move += ps->g->num_nodes;
-      SCHED_FIRST_REG_MOVE (i) = first_move;
-      SCHED_NREG_MOVES (i) = nreg_moves;
 
       /* Generate each move.  */
       old_reg = prev_reg = SET_DEST (single_set (u->insn));
@@ -569,15 +765,18 @@ schedule_reg_moves (partial_schedule_ptr
 	  ps_reg_move_info *move = ps_reg_move (ps, first_move + i_reg_move);
 
 	  move->def = i_reg_move > 0 ? first_move + i_reg_move - 1 : i;
-	  move->uses = sbitmap_alloc (g->num_nodes);
+	  move->uses = sbitmap_alloc (first_move + nreg_moves);
 	  move->old_reg = old_reg;
 	  move->new_reg = gen_reg_rtx (GET_MODE (prev_reg));
+	  move->num_consecutive_stages = distances[0] && distances[1] ? 2 : 1;
 	  move->insn = gen_move_insn (move->new_reg, copy_rtx (prev_reg));
 	  sbitmap_zero (move->uses);
 
 	  prev_reg = move->new_reg;
 	}
 
+      distance1_uses = distances[1] ? sbitmap_alloc (g->num_nodes) : NULL;
+
       /* Every use of the register defined by node may require a different
 	 copy of this register, depending on the time the use is scheduled.
 	 Record which uses require which move results.  */
@@ -601,8 +800,21 @@ schedule_reg_moves (partial_schedule_ptr
 
 		move = ps_reg_move (ps, first_move + dest_copy - 1);
 		SET_BIT (move->uses, e->dest->cuid);
+		if (e->distance == 1)
+		  SET_BIT (distance1_uses, e->dest->cuid);
 	      }
 	  }
+
+      must_follow = sbitmap_alloc (first_move + nreg_moves);
+      for (i_reg_move = 0; i_reg_move < nreg_moves; i_reg_move++)
+	if (!schedule_reg_move (ps, first_move + i_reg_move,
+				distance1_uses, must_follow))
+	  break;
+      sbitmap_free (must_follow);
+      if (distance1_uses)
+	sbitmap_free (distance1_uses);
+      if (i_reg_move < nreg_moves)
+	return false;
     }
   return true;
 }
@@ -626,39 +838,6 @@ apply_reg_moves (partial_schedule_ptr ps
 	  df_insn_rescan (ps->g->nodes[i_use].insn);
 	}
     }
-
-  FOR_EACH_VEC_ELT (ps_reg_move_info, ps->reg_moves, i, move)
-    add_insn_before (move->insn, ps_first_note (ps, move->def), NULL);
-}
-
-/* Update the sched_params (time, row and stage) for node U using the II,
-   the CYCLE of U and MIN_CYCLE.  
-   We're not simply taking the following
-   SCHED_STAGE (u) = CALC_STAGE_COUNT (SCHED_TIME (u), min_cycle, ii);
-   because the stages may not be aligned on cycle 0.  */
-static void
-update_node_sched_params (int u, int ii, int cycle, int min_cycle)
-{
-  int sc_until_cycle_zero;
-  int stage;
-
-  SCHED_TIME (u) = cycle;
-  SCHED_ROW (u) = SMODULO (cycle, ii);
-
-  /* The calculation of stage count is done adding the number
-     of stages before cycle zero and after cycle zero.  */
-  sc_until_cycle_zero = CALC_STAGE_COUNT (-1, min_cycle, ii);
-
-  if (SCHED_TIME (u) < 0)
-    {
-      stage = CALC_STAGE_COUNT (-1, SCHED_TIME (u), ii);
-      SCHED_STAGE (u) = sc_until_cycle_zero - stage;
-    }
-  else
-    {
-      stage = CALC_STAGE_COUNT (SCHED_TIME (u), 0, ii);
-      SCHED_STAGE (u) = sc_until_cycle_zero + stage - 1;
-    }
 }
 
 /* Bump the SCHED_TIMEs of all nodes by AMOUNT.  Set the values of
@@ -699,22 +878,6 @@ reset_sched_times (partial_schedule_ptr 
       }
 }
  
-/* Set SCHED_COLUMN of each node according to its position in PS.  */
-static void
-set_columns_for_ps (partial_schedule_ptr ps)
-{
-  int row;
-
-  for (row = 0; row < ps->ii; row++)
-    {
-      ps_insn_ptr cur_insn = ps->rows[row];
-      int column = 0;
-
-      for (; cur_insn; cur_insn = cur_insn->next_in_row)
-	SCHED_COLUMN (cur_insn->id) = column++;
-    }
-}
-
 /* Permute the insns according to their order in PS, from row 0 to
    row ii-1, and position them right before LAST.  This schedules
    the insns of the loop kernel.  */
@@ -731,8 +894,13 @@ permute_partial_schedule (partial_schedu
 	rtx insn = ps_rtl_insn (ps, ps_ij->id);
 
 	if (PREV_INSN (last) != insn)
-	  reorder_insns_nobb (ps_first_note (ps, ps_ij->id), insn,
-			      PREV_INSN (last));
+	  {
+	    if (ps_ij->id < ps->g->num_nodes)
+	      reorder_insns_nobb (ps_first_note (ps, ps_ij->id), insn,
+				  PREV_INSN (last));
+	    else
+	      add_insn_before (insn, last, NULL);
+	  }
       }
 }
 
@@ -935,7 +1103,7 @@ optimize_sc (partial_schedule_ptr ps, dd
 
 static void
 duplicate_insns_of_cycles (partial_schedule_ptr ps, int from_stage,
-			   int to_stage, int for_prolog, rtx count_reg)
+			   int to_stage, rtx count_reg)
 {
   int row;
   ps_insn_ptr ps_ij;
@@ -944,7 +1112,7 @@ duplicate_insns_of_cycles (partial_sched
     for (ps_ij = ps->rows[row]; ps_ij; ps_ij = ps_ij->next_in_row)
       {
 	int u = ps_ij->id;
-	int j, i_reg_moves, i_reg_move;
+	int first_u, last_u;
 	rtx u_insn;
 
         /* Do not duplicate any insn which refers to count_reg as it
@@ -958,42 +1126,15 @@ duplicate_insns_of_cycles (partial_sched
             || JUMP_P (u_insn))
           continue;
 
-	if (for_prolog)
+	first_u = SCHED_STAGE (u);
+	last_u = first_u + ps_num_consecutive_stages (ps, u) - 1;
+	if (from_stage <= last_u && to_stage >= first_u)
 	  {
-	    /* SCHED_STAGE (u) >= from_stage == 0.  Generate increasing
-	       number of reg_moves starting with the second occurrence of
-	       u, which is generated if its SCHED_STAGE <= to_stage.  */
-	    i_reg_moves = to_stage - SCHED_STAGE (u) + 1;
-	    i_reg_moves = MAX (i_reg_moves, 0);
-	    i_reg_moves = MIN (i_reg_moves, SCHED_NREG_MOVES (u));
-
-	    /* The reg_moves start from the *first* reg_move backwards.  */
-	    i_reg_move = SCHED_FIRST_REG_MOVE (u) + (i_reg_moves - 1);
-	  }
-	else /* It's for the epilog.  */
-	  {
-	    /* SCHED_STAGE (u) <= to_stage.  Generate all reg_moves,
-	       starting to decrease one stage after u no longer occurs;
-	       that is, generate all reg_moves until
-	       SCHED_STAGE (u) == from_stage - 1.  */
-	    i_reg_moves = (SCHED_NREG_MOVES (u)
-			   - (from_stage - SCHED_STAGE (u) - 1));
-	    i_reg_moves = MAX (i_reg_moves, 0);
-	    i_reg_moves = MIN (i_reg_moves, SCHED_NREG_MOVES (u));
-
-	    /* The reg_moves start from the *last* reg_move forwards.  */
-	    i_reg_move = SCHED_FIRST_REG_MOVE (u) + (SCHED_NREG_MOVES (u) - 1);
+	    if (u < ps->g->num_nodes)
+	      duplicate_insn_chain (ps_first_note (ps, u), u_insn);
+	    else
+	      emit_insn (copy_rtx (PATTERN (u_insn)));
 	  }
-
-	for (j = 0; j < i_reg_moves; j++)
-	  {
-	    ps_reg_move_info *move = ps_reg_move (ps, i_reg_move - j);
-
-	    emit_insn (copy_rtx (PATTERN (move->insn)));
-	  }
-	if (SCHED_STAGE (u) >= from_stage
-	    && SCHED_STAGE (u) <= to_stage)
-	  duplicate_insn_chain (ps_first_note (ps, u), u_insn);
       }
 }
 
@@ -1027,7 +1168,7 @@ generate_prolog_epilog (partial_schedule
     }
 
   for (i = 0; i < last_stage; i++)
-    duplicate_insns_of_cycles (ps, 0, i, 1, count_reg);
+    duplicate_insns_of_cycles (ps, 0, i, count_reg);
 
   /* Put the prolog on the entry edge.  */
   e = loop_preheader_edge (loop);
@@ -1039,7 +1180,7 @@ generate_prolog_epilog (partial_schedule
   start_sequence ();
 
   for (i = 0; i < last_stage; i++)
-    duplicate_insns_of_cycles (ps, i + 1, last_stage, 0, count_reg);
+    duplicate_insns_of_cycles (ps, i + 1, last_stage, count_reg);
 
   /* Put the epilogue on the exit edge.  */
   gcc_assert (single_exit (loop));
@@ -1375,8 +1516,7 @@ sms_schedule (void)
     {
       rtx head, tail;
       rtx count_reg, count_init;
-      int mii, rec_mii;
-      unsigned stage_count;
+      int mii, rec_mii, stage_count, min_cycle;
       HOST_WIDEST_INT loop_count = 0;
       bool opt_sc_p;
 
@@ -1486,13 +1626,12 @@ sms_schedule (void)
 		}
 
 	      gcc_assert (stage_count >= 1);
-	      PS_STAGE_COUNT (ps) = stage_count;
 	    }
 
 	  /* The default value of PARAM_SMS_MIN_SC is 2 as stage count of
 	     1 means that there is no interleaving between iterations thus
 	     we let the scheduling passes do the job in this case.  */
-	  if (stage_count < (unsigned) PARAM_VALUE (PARAM_SMS_MIN_SC)
+	  if (stage_count < PARAM_VALUE (PARAM_SMS_MIN_SC)
 	      || (count_init && (loop_count <= stage_count))
 	      || (flag_branch_probabilities && (trip_count <= stage_count)))
 	    {
@@ -1520,6 +1659,7 @@ sms_schedule (void)
 	  
 	  set_columns_for_ps (ps);
 
+	  min_cycle = PS_MIN_CYCLE (ps) - SMODULO (PS_MIN_CYCLE (ps), ps->ii);
 	  if (!schedule_reg_moves (ps))
 	    {
 	      mii = ps->ii + 1;
@@ -1527,6 +1667,21 @@ sms_schedule (void)
 	      continue;
 	    }
 
+	  /* Moves that handle incoming values might have been added
+	     to a new first stage.  Bump the stage count if so.
+
+	     ??? Perhaps we could consider rotating the schedule here
+	     instead?  */
+	  if (PS_MIN_CYCLE (ps) < min_cycle)
+	    {
+	      reset_sched_times (ps, 0);
+	      stage_count++;
+	    }
+
+	  /* The stage count should now be correct without rotation.  */
+	  gcc_checking_assert (stage_count == calculate_stage_count (ps, 0));
+	  PS_STAGE_COUNT (ps) = stage_count;
+
 	  canon_loop (loop);
 
           if (dump_file)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [4/4] Make SMS schedule register moves
  2011-10-10 12:17           ` Richard Sandiford
@ 2011-10-10 15:52             ` Ayal Zaks
  2011-10-11  8:49               ` Richard Sandiford
  0 siblings, 1 reply; 13+ messages in thread
From: Ayal Zaks @ 2011-10-10 15:52 UTC (permalink / raw)
  To: Ayal Zaks, gcc-patches, richard.sandiford

On Mon, Oct 10, 2011 at 1:57 PM, Richard Sandiford
<richard.sandiford@linaro.org> wrote:
> Ayal Zaks <ayal.zaks@gmail.com> writes:
>>> I agree it's natural to schedule moves for intra-iteration dependencies
>>> in the normal get_sched_window way.  But suppose we have a dependency:
>>>
>>>   A --(T,N,1)--> B
>>>
>>> that requires two moves M1 and M2.  If we think in terms of cycles
>>> (in the SCHED_TIME sense), then this effectively becomes:
>>>
>>>   A --(T,N1,1)--> M1 -->(T,N2,0)--> M2 -->(T,N3,0)--> B
>>>
>>> because it is now M1 that is fed by both the loop and the incoming edge.
>>> But if there is a second dependency:
>>>
>>>   A --(T,M,0)--> C
>>>
>>> that also requires two moves, we end up with:
>>>
>>>   A --(T,N1,1)--> M1 -->(T,N2,0)--> M2 -->(T,N3,0)--> B
>>>                                        -->(T,M3,-1)--> B
>>>
>>> and dependence distances of -1 feel a bit weird. :-)  Of course,
>>> what we really have are two parallel dependencies:
>>>
>>>   A --(T,N1,1)--> M1 -->(T,N2,0)--> M2 -->(T,N3,0)--> B
>>>
>>>   A --(T,M1,0)--> M1' -->(T,M2,0)--> M2' -->(T,N3,0)--> B
>>>
>>> where M1' and M2' occupy the same position as M1 and M2 in the schedule,
>>> but are one stage further along.  But we only schedule them once,
>>> so if we take the cycle/SCHED_TIME route, we have to introduce
>>> dependencies of distance -1.
>>>
>>
>> Interesting; had to digest this distance 1 business, a result of
>> thinking in cycles instead of rows (or conversely), and mixing
>> dependences with scheduling; here's my understanding, based on your
>> explanations:
>>
>> Suppose a Use is truely dependent on a Def, where both have been
>> scheduled at some absolute cycles; think of them as timing the first
>> iteration of the loop.
>> Assume first that Use appears originally after Def in the original
>> instruction sequence of the loop (dependence distance 0). In this
>> case, Use requires register moves if its distance D from Def according
>> to the schedule is more than ii cycles long -- by the time Use is
>> executed, the value it needs is no longer available in the def'd
>> register due to intervening occurrences of Def. So in this case, the
>> first reg-move (among D/ii) should appear after Def, recording its
>> value before the next occurrence of Def overwrites it, and feeding
>> subsequent moves as needed before each is overwritten. Thus the
>> scheduling window of this first reg-move is within (Def, Def+ii).
>>
>> Now, suppose Use appears before Def, i.e., Use is upwards-exposed; if
>> it remains scheduled before the Def, no reg-move is needed. If, otoh,
>> Def is scheduled (D>0 cycles) before Use, breaking the anti-dependence
>> between them, (D/ii + 1) reg-moves are needed in order to feed Use
>> with the live-in value before Def. So in this case, the first reg-move
>> should appear before Def (again feeding subsequent moves as needed
>> before each is overwritten). Thus the scheduling window of this first
>> reg-move is within (Def-ii, Def).
>>
>> In any case, each use requires a specific number of reg-moves, which
>> begin either before or after the def; it is always safe (though
>> potentially redundant) to start reg-moving before the def -- uses that
>> do not need the live-in value will ignore it by feeding from a later
>> reg-move.
>
> Right.  The distance on the Def->Use dependency is effectively transferred
> to the dependency between the Def and first move.
>
> And we can potentially have both kinds of Use at the same time.
> We then end up scheduling two moves, one in each window, but require
> them to occupy the same row and column.  It feels more convenient to
> schedule the earlier of the two moves.
>
>> Q: if we generate reg-moves starting from before the def, would
>> redundant reg-moves be eliminated if no use needs access to live-in
>> value? If so, would that simplify their generation? (If not, it may be
>> interesting to understand how to improve DCE to catch it.)
>
> To be clear, the new version of the patch (unlike the old one)
> doesn't emit reg-moves before the def if all uses are distance 0.
> We only do it where some or all uses are distance 1.  The first move
> before the def is always needed in that case.
>

Understood. The question was whether it would simplify things if we
always emit reg-moves before the def. And rely on DCE to hopefully
eliminate redundancies.

> So redudant moves are confined to the case where (a) we have more
> than one move, (b) we have both distance 0 and distance 1 uses, and
> (c) one of the two distance sets requires more moves than the other.
> If the distance 0 dependencies require more moves than the distance
> 1 dependencies, we will have a redudant move in the prologue.
> If the distance 1 dependencies require more moves than the distance
> 0 dependencies, we will have a redudant move in the epilogue.
> But I believe that this is already handled by dce.
>
> (For avoidance of doubt, the new code is more accurate than
> current trunk in this respect.)
>
>> The issue of assigning stages to reg-moves is mostly relevant for
>> prolog and epilog generation, which requires and receives special
>> attention -- handled very nicely by ps_num_consecutive_stages! Note
>> that currently a simple boolean indicator for (the exceptional case
>> of) double stages would suffice, instead of generalizing to arbitrary
>> nums of consecutive stages (see any potential use for them?).
>
> Not in the immediate term.  But I think having a boolean indicator
> would be inconsistent.  If the distance field is an int (even though
> we only expect distance-0 and distance-1 register dependencies)
> then I think the number of stages should be too.
>
> I did wonder originally about using a boolean, but IMO, it makes
> the code less readable rather than more.  Instead of a simple
> range check like:
>
>     if (first_stage_for_insn <= last_stage_in_range
>         && last_stage_for_insn >= first_stage_in_range)
>
> we end up with the equivalent of:
>
>     if (first_stage_for_insn <= last_stage_in_range
>         && (double_stage_move_p (...)
>             ? first_stage_for_insn + 1 >= first_stage_in_range
>             : first_stage_for_insn >= first_stage_in_range))
>
> with no corresponding simplification elsewhere.
>

Sure. But setting the range can be done by consulting an simple
indicator, rather than generalizing to arbitrary stage numbers; e.g.:

+ps_num_consecutive_stages (partial_schedule_ptr ps, int id)
+{
+  if (id >= ps->g->num_nodes && ps_reg_move (ps, id)->double_stages)
+    return 2;
+  else
+    return 1;
+}

or

-  last_u = first_u + ps_num_consecutive_stages (ps, u) - 1;
+  if (...double_stages) last_u = first_u + 1;
+  else last_u = first_u;

>>> However...
>>>
>>>> Also note that if moves are assigned absolute cycles, it becomes clear
>>>> to which stage each belongs (just like any other instruction),
>>>> regulating their generation in prolog and epilog.
>>>
>>> ...I have to concede that it makes the stage calculation easier,
>>> and that that tips the balance.  (Although again, a move can belong
>>> to two stages simultanouesly.)
>>>
>>> Anyway, here's an updated patch that uses cycle times.  I've also
>>> dropped the code that tried to allow windows to start and end on
>>> the same row (and therefore schedule "either side" of that row).
>>> I thought it might help with some VLIWy DFAs, but it was always
>>> going to be of minor benefit, and we don't try that elsewhere,
>>> so let's avoid the complication.
>>>
>>
>> ok. Such windows seem rare, as they imply zero move (or def->move)
>> latencies, iinm. Leaving a note behind would be good.
>
> OK.  Since this same restriction applies to the amin scheduling
> window code, I suppose the natural place would be in the algorithm
> description itself:
>
> /* The SMS scheduling algorithm itself
>   -----------------------------------
>   Input: 'O' an ordered list of insns of a loop.
>   Output: A scheduling of the loop - kernel, prolog, and epilogue.
>   ...
>   42. compute epilogue & prologue
>   43. finish - succeeded to schedule
> */
>
> E.g. adding something like this at the end:
>
>   ??? The algorithm restricts the scheduling window to II cycles.
>   In rare cases, it may be better to allow windows of II+1 cycles.
>   The window would then start and end on the same row, but with
>   different "must precede" and "must follow" requirements.
>
> Let me know what you think and I'll add it as a follow-on patch.
>

great, thanks.

>>> Bootstrapped & regression-tested on powerpc64-linux-gnu with
>>> -fmodulo-sched and -fmodulo-sched-allow-regmoves enabled by default.
>>> Also retested on the ARM benchmarks.  OK to install?
>>>
>>
>> Yes, OK.
>> Some points to consider raised above, and some comments added below.
>
> Thanks, applied with the changes below.
>
>>> @@ -466,21 +506,167 @@ print_node_sched_params (FILE *file, int
>>>   for (i = 0; i < num_nodes; i++)
>>>     {
>>>       node_sched_params_ptr nsp = SCHED_PARAMS (i);
>>> -      int j;
>>>
>>>       fprintf (file, "Node = %d; INSN = %d\n", i,
>>>               INSN_UID (ps_rtl_insn (ps, i)));
>>>       fprintf (file, " asap = %d:\n", NODE_ASAP (&ps->g->nodes[i]));
>>>       fprintf (file, " time = %d:\n", nsp->time);
>>> -      fprintf (file, " nreg_moves = %d:\n", nsp->nreg_moves);
>>> -      for (j = 0; j < nsp->nreg_moves; j++)
>>> -       {
>>> -         ps_reg_move_info *move = ps_reg_move (ps, nsp->first_reg_move + j);
>>
>> Is another way provided to print a summary of the regmoves? Or does
>> the info dumped while scheduling each one suffice for practical
>> tracing and debugging.
>
> Yeah, the new info should be more complete.  It gives the pattern
> (which is what we dump here), but also lists the producers and
> consumers of each move (which we don't print here).
>
>>> +      fprintf (file, " stage = %d:\n", nsp->stage);
>>> +    }
>>> +}
>>> +
>>> +/* Set SCHED_COLUMN for each instruction in row ROW of PS.  */
>>> +static void
>>> +set_columns_for_row (partial_schedule_ptr ps, int row)
>>> +{
>>> +  ps_insn_ptr cur_insn;
>>> +  int column;
>>> +
>>> +  column = 0;
>>> +  for (cur_insn = ps->rows[row]; cur_insn; cur_insn = cur_insn->next_in_row)
>>> +    SCHED_COLUMN (cur_insn->id) = column++;
>>> +}
>>> +
>>> +/* Set SCHED_COLUMN for each instruction in PS.  */
>>> +static void
>>> +set_columns_for_ps (partial_schedule_ptr ps)
>>> +{
>>> +  int row;
>>> +
>>> +  for (row = 0; row < ps->ii; row++)
>>> +    set_columns_for_row (ps, row);
>>> +}
>>> +
>>> +/* Try to schedule the move with ps_insn identifier I_REG_MOVE in PS.
>>> +   Its predecessors and successors have already been scheduled.
>>
>> well, except for some (preceeding or succeeding?) moves that have not
>> yet been scheduled, right?
>
> Ah, yeah, fixed to:
>
> /* Try to schedule the move with ps_insn identifier I_REG_MOVE in PS.
>   Its single predecessor has already been scheduled, as has its
>   ddg node successors.  (The move may have also another move as its
>   successor, in which case that successor will be scheduled later.)
>

perfect

>>> +
>>> +   The move is part of a chain that satisfies register dependencies
>>> +   between a producing ddg node and various consuming ddg nodes.
>>> +   If some of these dependencies cross a loop iteration (that is,
>>> +   have a distance of 1) then DISTANCE1_USES is nonnull and contains
>>> +   the set of uses with distance-1 dependencies.  DISTANCE1_USES
>>> +   is null otherwise.
>>> +
>>
>> Maybe clarify that they are upwards-exposed or live-in uses.
>
> OK, changed to:
>
>   The move is part of a chain that satisfies register dependencies
>   between a producing ddg node and various consuming ddg nodes.
>   If some of these dependencies have a distance of 1 (meaning that
>   the use is upward-exposoed) then DISTANCE1_USES is nonnull and

exposed (typo)

>   contains the set of uses with distance-1 dependencies.
>   DISTANCE1_USES is null otherwise.
>
>>> +   MUST_FOLLOW is a scratch bitmap that is big enough to hold
>>> +   all current ps_insn ids.
>>> +
>>> +   Return true on success.  */
>>> +static bool
>>> +schedule_reg_move (partial_schedule_ptr ps, int i_reg_move,
>>> +                  sbitmap distance1_uses, sbitmap must_follow)
>>> +{
>>> +  unsigned int u;
>>> +  int this_time, this_distance, this_start, this_end, this_latency;
>>> +  int start, end, c, ii;
>>> +  sbitmap_iterator sbi;
>>> +  ps_reg_move_info *move;
>>> +  rtx this_insn;
>>> +  ps_insn_ptr psi;
>>> +
>>> +  move = ps_reg_move (ps, i_reg_move);
>>> +  ii = ps->ii;
>>> +  if (dump_file)
>>> +    {
>>> +      fprintf (dump_file, "Scheduling register move INSN %d; ii = %d"
>>> +              ", min cycle = %d\n\n", INSN_UID (move->insn), ii,
>>> +              PS_MIN_CYCLE (ps));
>>> +      print_rtl_single (dump_file, move->insn);
>>> +      fprintf (dump_file, "\n%11s %11s %5s\n", "start", "end", "time");
>>> +      fprintf (dump_file, "=========== =========== =====\n");
>>> +    }
>>> +
>>> +  start = INT_MIN;
>>> +  end = INT_MAX;
>>> +
>>> +  /* For dependencies of distance 1 between a producer ddg node A
>>> +     and consumer ddg node B, we have a chain of dependencies:
>>> +
>>> +        A --(T,L1,1)--> M1 --(T,L2,0)--> M2 ... --(T,Ln,0)--> B
>>> +
>>> +     where Mi is the ith move.  For dependencies of distance 0 between
>>> +     a producer ddg node A and consumer ddg node C, we have a chain of
>>> +     dependencies:
>>> +
>>> +        A --(T,L1',0)--> M1' --(T,L2',0)--> M2' ... --(T,Ln',0)--> C
>>> +
>>> +     where Mi' occupies the same position as Mi but occurs a stage later.
>>> +     We can only schedule each move once, so if we have both types of
>>> +     chain, we model the second as:
>>> +
>>> +        A --(T,L1',1)--> M1 --(T,L2',0)--> M2 ... --(T,Ln',-1)--> C
>>> +
>>> +     First handle the dependencies between the previously-scheduled
>>> +     predecessor and the move.  */
>>> +  this_insn = ps_rtl_insn (ps, move->def);
>>> +  this_latency = insn_latency (this_insn, move->insn);
>>> +  this_distance = distance1_uses && move->def < ps->g->num_nodes ? 1 : 0;
>>> +  this_time = SCHED_TIME (move->def) - this_distance * ii;
>>> +  this_start = this_time + this_latency;
>>> +  this_end = this_time + ii;
>>> +  if (dump_file)
>>> +    fprintf (dump_file, "%11d %11d %5d %d --(T,%d,%d)--> %d\n",
>>> +            this_start, this_end, SCHED_TIME (move->def),
>>> +            INSN_UID (this_insn), this_latency, this_distance,
>>> +            INSN_UID (move->insn));
>>> +
>>> +  if (start < this_start)
>>
>> redundant; start=INT_MIN is surely < this_start.
>>
>>> +    start = this_start;
>>> +  if (end > this_end)
>>
>> redundant; end=INT_MAX is surely > this_end.
>
> I did this deliberately because the order in which we apply the limits
> isn't important.  Making them assymetrical by applying this kind of
> micro-optimisation is IMO a bad idea.  The compiler will do it for us.
>
> E.g. I originally had an extra limit to make sure that we didn't add
> new stages.  If we applied the optimisation by hand, and then someone
> added a new limit like that later, they'd have to be careful about
> where they put it.
>

ok, np.

>>> +    end = this_end;
>>> +
>>> +  /* Handle the dependencies between the move and previously-scheduled
>>> +     successors.  */
>>
>> (maybe assert they have indeed all been scheduled)
>
> Hmm, I don't think that'd be useful.  We've already read the schedule
> time (and row and column) by this point, and we don't assert the same
> thing elsewhere (in the code that assigns moves to uses, or in
> get_sched_window itself).
>

(ok; only a thought, hence the parentheses)

>>> +  EXECUTE_IF_SET_IN_SBITMAP (move->uses, 0, u, sbi)
>>> +    {
>>> +      this_insn = ps_rtl_insn (ps, u);
>>> +      this_latency = insn_latency (move->insn, this_insn);
>>> +      if (distance1_uses && !TEST_BIT (distance1_uses, u))
>>
>> shouldn't this be if (distance1_uses && TEST_BIT (distance1_uses, u))?
>>
>>> +       this_distance = -1;
>>> +      else
>>> +       this_distance = 0;
>
> No, this condition corresponds to:
>
>   /* For dependencies of distance 1 between a producer ddg node A
>      and consumer ddg node B, we have a chain of dependencies:
>
>         A --(T,L1,1)--> M1 --(T,L2,0)--> M2 ... --(T,Ln,0)--> B
>
>      where Mi is the ith move.  For dependencies of distance 0 between
>      a producer ddg node A and consumer ddg node C, we have a chain of
>      dependencies:
>
>         A --(T,L1',0)--> M1' --(T,L2',0)--> M2' ... --(T,Ln',0)--> C
>
>      where Mi' occupies the same position as Mi but occurs a stage later.
>      We can only schedule each move once, so if we have both types of
>      chain, we model the second as:
>
>         A --(T,L1',1)--> M1 --(T,L2',0)--> M2 ... --(T,Ln',-1)--> C
>
>      First handle the dependencies between the previously-scheduled
>      predecessor and the move.  */
>
> i.e. we want a distance of -1 when (a) the original definition has uses
> with both distance 0 and distance 1 and (b) the particular use we're
> looking at has distance 0.
>

ah, right

>>> +      this_time = SCHED_TIME (u) + this_distance * ii;
>>> +      this_start = this_time - ps->ii;
>>
>> use ii instead of ps->ii
>
> Oops, fixed.
>
>>> +      this_end = this_time - this_latency;
>>> +      if (dump_file)
>>> +       fprintf (dump_file, "%11d %11d %5d %d --(T,%d,%d)--> %d\n",
>>> +                this_start, this_end, SCHED_TIME (u), INSN_UID (move->insn),
>>> +                this_latency, this_distance, INSN_UID (this_insn));
>>> +
>>> +      if (start < this_start)
>>> +       start = this_start;
>>> +      if (end > this_end)
>>> +       end = this_end;
>>> +    }
>>> +
>>> +  if (dump_file)
>>> +    {
>>> +      fprintf (dump_file, "----------- ----------- -----\n");
>>> +      fprintf (dump_file, "%11d %11d %5s %s\n", start, end, "", "(max, min)");
>>> +    }
>>>
>>> -         fprintf (file, " reg_move = ");
>>> -         print_rtl_single (file, move->insn);
>>> +  sbitmap_zero (must_follow);
>>> +  SET_BIT (must_follow, move->def);
>>> +
>>> +  start = MAX (start, end - (ii - 1));
>>
>> ok, so this excludes considering both ends of a possible ii-cycled
>> window. Such a window would imply zero move (or def->move) latencies
>> iinm, and more care in setting must_follow/must_precede.
>
> Right.
>
>>> +  for (c = end; c >= start; c--)
>>> +    {
>>> +      psi = ps_add_node_check_conflicts (ps, i_reg_move, c,
>>> +                                        move->uses, must_follow);
>>
>> ok - def must_follow the move, if both are scheduled on same row (as
>> we potentially start from this (end) row moving backwards), but uses
>> should precede the move, assuming that if move and use are on same row
>> the sched distance between them is ii (i.e., non-zero move->use
>> latency)
>
> Right.
>
>>> @@ -1513,6 +1651,19 @@ sms_schedule (void)
>>>              continue;
>>>            }
>>>
>>> +         /* Moves that handle incoming values might have been added
>>> +            to a new first stage.  It's too late to try any kind of
>>> +            rotation, so just bump the stage count.  */
>>
>> hmm, one could still apply the rotation optimization at this time if
>> desired, no?
>
> Hmm, maybe :-)  I changed the comment to:
>
>          /* Moves that handle incoming values might have been added
>             to a new first stage.  Bump the stage count if so.
>
>             ??? Perhaps we could consider rotating the schedule here
>             instead?  */
>

Great.
Thanks,
Ayal.


> Richard
>
>
> gcc/
>        * modulo-sched.c (ps_reg_move_info): Add num_consecutive_stages.
>        (SCHED_FIRST_REG_MOVE, SCHED_NREG_MOVES): Delete.
>        (node_sched_params): Remove first_reg_move and nreg_moves.
>        (ps_num_consecutive_stages, extend_node_sched_params): New functions.
>        (update_node_sched_params): Move up file.
>        (print_node_sched_params): Print the stage.  Don't dump info related
>        to first_reg_move and nreg_moves.
>        (set_columns_for_row): New function.
>        (set_columns_for_ps): Move up file and use set_columns_for_row.
>        (schedule_reg_move): New function.
>        (schedule_reg_moves): Call extend_node_sched_params and
>        schedule_reg_move.  Extend size of uses bitmap.  Initialize
>        num_consecutive_stages.  Return false if a move could not be
>        scheduled.
>        (apply_reg_moves): Don't emit moves here.
>        (permute_partial_schedule): Handle register moves.
>        (duplicate_insns_of_cycles): Remove for_prolog.  Emit moves according
>        to the same stage-count test as ddg nodes.
>        (generate_prolog_epilog): Update calls accordingly.
>        (sms_schedule): Allow move-scheduling to add a new first stage.
>
> Index: gcc/modulo-sched.c
> ===================================================================
> --- gcc/modulo-sched.c  2011-10-10 11:03:23.278273801 +0100
> +++ gcc/modulo-sched.c  2011-10-10 12:24:44.539932692 +0100
> @@ -153,6 +153,9 @@ struct ps_reg_move_info
>   rtx old_reg;
>   rtx new_reg;
>
> +  /* The number of consecutive stages that the move occupies.  */
> +  int num_consecutive_stages;
> +
>   /* An instruction that sets NEW_REG to the correct value.  The first
>      move associated with DEF will have an rhs of OLD_REG; later moves
>      use the result of the previous move.  */
> @@ -218,8 +221,6 @@ static partial_schedule_ptr sms_schedule
>  static void permute_partial_schedule (partial_schedule_ptr, rtx);
>  static void generate_prolog_epilog (partial_schedule_ptr, struct loop *,
>                                     rtx, rtx);
> -static void duplicate_insns_of_cycles (partial_schedule_ptr,
> -                                      int, int, int, rtx);
>  static int calculate_stage_count (partial_schedule_ptr, int);
>  static void calculate_must_precede_follow (ddg_node_ptr, int, int,
>                                           int, int, sbitmap, sbitmap, sbitmap);
> @@ -233,8 +234,6 @@ #define NODE_ASAP(node) ((node)->aux.cou
>
>  #define SCHED_PARAMS(x) VEC_index (node_sched_params, node_sched_param_vec, x)
>  #define SCHED_TIME(x) (SCHED_PARAMS (x)->time)
> -#define SCHED_FIRST_REG_MOVE(x) (SCHED_PARAMS (x)->first_reg_move)
> -#define SCHED_NREG_MOVES(x) (SCHED_PARAMS (x)->nreg_moves)
>  #define SCHED_ROW(x) (SCHED_PARAMS (x)->row)
>  #define SCHED_STAGE(x) (SCHED_PARAMS (x)->stage)
>  #define SCHED_COLUMN(x) (SCHED_PARAMS (x)->column)
> @@ -244,15 +243,6 @@ typedef struct node_sched_params
>  {
>   int time;    /* The absolute scheduling cycle.  */
>
> -  /* The following field (first_reg_move) is the ps_insn id of the first
> -     register-move instruction added to handle the modulo-variable-expansion
> -     of the register defined by this node.  This register-move copies the
> -     original register defined by the node.  */
> -  int first_reg_move;
> -
> -  /* The number of register-move instructions added.  */
> -  int nreg_moves;
> -
>   int row;    /* Holds time % ii.  */
>   int stage;  /* Holds time / ii.  */
>
> @@ -344,6 +334,17 @@ ps_first_note (partial_schedule_ptr ps,
>   return ps->g->nodes[id].first_note;
>  }
>
> +/* Return the number of consecutive stages that are occupied by
> +   partial schedule instruction ID in PS.  */
> +static int
> +ps_num_consecutive_stages (partial_schedule_ptr ps, int id)
> +{
> +  if (id < ps->g->num_nodes)
> +    return 1;
> +  else
> +    return ps_reg_move (ps, id)->num_consecutive_stages;
> +}
> +
>  /* Given HEAD and TAIL which are the first and last insns in a loop;
>    return the register which controls the loop.  Return zero if it has
>    more than one occurrence in the loop besides the control part or the
> @@ -456,6 +457,45 @@ set_node_sched_params (ddg_ptr g)
>                         node_sched_param_vec, g->num_nodes);
>  }
>
> +/* Make sure that node_sched_param_vec has an entry for every move in PS.  */
> +static void
> +extend_node_sched_params (partial_schedule_ptr ps)
> +{
> +  VEC_safe_grow_cleared (node_sched_params, heap, node_sched_param_vec,
> +                        ps->g->num_nodes + VEC_length (ps_reg_move_info,
> +                                                       ps->reg_moves));
> +}
> +
> +/* Update the sched_params (time, row and stage) for node U using the II,
> +   the CYCLE of U and MIN_CYCLE.
> +   We're not simply taking the following
> +   SCHED_STAGE (u) = CALC_STAGE_COUNT (SCHED_TIME (u), min_cycle, ii);
> +   because the stages may not be aligned on cycle 0.  */
> +static void
> +update_node_sched_params (int u, int ii, int cycle, int min_cycle)
> +{
> +  int sc_until_cycle_zero;
> +  int stage;
> +
> +  SCHED_TIME (u) = cycle;
> +  SCHED_ROW (u) = SMODULO (cycle, ii);
> +
> +  /* The calculation of stage count is done adding the number
> +     of stages before cycle zero and after cycle zero.  */
> +  sc_until_cycle_zero = CALC_STAGE_COUNT (-1, min_cycle, ii);
> +
> +  if (SCHED_TIME (u) < 0)
> +    {
> +      stage = CALC_STAGE_COUNT (-1, SCHED_TIME (u), ii);
> +      SCHED_STAGE (u) = sc_until_cycle_zero - stage;
> +    }
> +  else
> +    {
> +      stage = CALC_STAGE_COUNT (SCHED_TIME (u), 0, ii);
> +      SCHED_STAGE (u) = sc_until_cycle_zero + stage - 1;
> +    }
> +}
> +
>  static void
>  print_node_sched_params (FILE *file, int num_nodes, partial_schedule_ptr ps)
>  {
> @@ -466,21 +506,169 @@ print_node_sched_params (FILE *file, int
>   for (i = 0; i < num_nodes; i++)
>     {
>       node_sched_params_ptr nsp = SCHED_PARAMS (i);
> -      int j;
>
>       fprintf (file, "Node = %d; INSN = %d\n", i,
>               INSN_UID (ps_rtl_insn (ps, i)));
>       fprintf (file, " asap = %d:\n", NODE_ASAP (&ps->g->nodes[i]));
>       fprintf (file, " time = %d:\n", nsp->time);
> -      fprintf (file, " nreg_moves = %d:\n", nsp->nreg_moves);
> -      for (j = 0; j < nsp->nreg_moves; j++)
> -       {
> -         ps_reg_move_info *move = ps_reg_move (ps, nsp->first_reg_move + j);
> +      fprintf (file, " stage = %d:\n", nsp->stage);
> +    }
> +}
> +
> +/* Set SCHED_COLUMN for each instruction in row ROW of PS.  */
> +static void
> +set_columns_for_row (partial_schedule_ptr ps, int row)
> +{
> +  ps_insn_ptr cur_insn;
> +  int column;
> +
> +  column = 0;
> +  for (cur_insn = ps->rows[row]; cur_insn; cur_insn = cur_insn->next_in_row)
> +    SCHED_COLUMN (cur_insn->id) = column++;
> +}
> +
> +/* Set SCHED_COLUMN for each instruction in PS.  */
> +static void
> +set_columns_for_ps (partial_schedule_ptr ps)
> +{
> +  int row;
> +
> +  for (row = 0; row < ps->ii; row++)
> +    set_columns_for_row (ps, row);
> +}
> +
> +/* Try to schedule the move with ps_insn identifier I_REG_MOVE in PS.
> +   Its single predecessor has already been scheduled, as has its
> +   ddg node successors.  (The move may have also another move as its
> +   successor, in which case that successor will be scheduled later.)
>
> -         fprintf (file, " reg_move = ");
> -         print_rtl_single (file, move->insn);
> +   The move is part of a chain that satisfies register dependencies
> +   between a producing ddg node and various consuming ddg nodes.
> +   If some of these dependencies have a distance of 1 (meaning that
> +   the use is upward-exposoed) then DISTANCE1_USES is nonnull and
> +   contains the set of uses with distance-1 dependencies.
> +   DISTANCE1_USES is null otherwise.
> +
> +   MUST_FOLLOW is a scratch bitmap that is big enough to hold
> +   all current ps_insn ids.
> +
> +   Return true on success.  */
> +static bool
> +schedule_reg_move (partial_schedule_ptr ps, int i_reg_move,
> +                  sbitmap distance1_uses, sbitmap must_follow)
> +{
> +  unsigned int u;
> +  int this_time, this_distance, this_start, this_end, this_latency;
> +  int start, end, c, ii;
> +  sbitmap_iterator sbi;
> +  ps_reg_move_info *move;
> +  rtx this_insn;
> +  ps_insn_ptr psi;
> +
> +  move = ps_reg_move (ps, i_reg_move);
> +  ii = ps->ii;
> +  if (dump_file)
> +    {
> +      fprintf (dump_file, "Scheduling register move INSN %d; ii = %d"
> +              ", min cycle = %d\n\n", INSN_UID (move->insn), ii,
> +              PS_MIN_CYCLE (ps));
> +      print_rtl_single (dump_file, move->insn);
> +      fprintf (dump_file, "\n%11s %11s %5s\n", "start", "end", "time");
> +      fprintf (dump_file, "=========== =========== =====\n");
> +    }
> +
> +  start = INT_MIN;
> +  end = INT_MAX;
> +
> +  /* For dependencies of distance 1 between a producer ddg node A
> +     and consumer ddg node B, we have a chain of dependencies:
> +
> +        A --(T,L1,1)--> M1 --(T,L2,0)--> M2 ... --(T,Ln,0)--> B
> +
> +     where Mi is the ith move.  For dependencies of distance 0 between
> +     a producer ddg node A and consumer ddg node C, we have a chain of
> +     dependencies:
> +
> +        A --(T,L1',0)--> M1' --(T,L2',0)--> M2' ... --(T,Ln',0)--> C
> +
> +     where Mi' occupies the same position as Mi but occurs a stage later.
> +     We can only schedule each move once, so if we have both types of
> +     chain, we model the second as:
> +
> +        A --(T,L1',1)--> M1 --(T,L2',0)--> M2 ... --(T,Ln',-1)--> C
> +
> +     First handle the dependencies between the previously-scheduled
> +     predecessor and the move.  */
> +  this_insn = ps_rtl_insn (ps, move->def);
> +  this_latency = insn_latency (this_insn, move->insn);
> +  this_distance = distance1_uses && move->def < ps->g->num_nodes ? 1 : 0;
> +  this_time = SCHED_TIME (move->def) - this_distance * ii;
> +  this_start = this_time + this_latency;
> +  this_end = this_time + ii;
> +  if (dump_file)
> +    fprintf (dump_file, "%11d %11d %5d %d --(T,%d,%d)--> %d\n",
> +            this_start, this_end, SCHED_TIME (move->def),
> +            INSN_UID (this_insn), this_latency, this_distance,
> +            INSN_UID (move->insn));
> +
> +  if (start < this_start)
> +    start = this_start;
> +  if (end > this_end)
> +    end = this_end;
> +
> +  /* Handle the dependencies between the move and previously-scheduled
> +     successors.  */
> +  EXECUTE_IF_SET_IN_SBITMAP (move->uses, 0, u, sbi)
> +    {
> +      this_insn = ps_rtl_insn (ps, u);
> +      this_latency = insn_latency (move->insn, this_insn);
> +      if (distance1_uses && !TEST_BIT (distance1_uses, u))
> +       this_distance = -1;
> +      else
> +       this_distance = 0;
> +      this_time = SCHED_TIME (u) + this_distance * ii;
> +      this_start = this_time - ii;
> +      this_end = this_time - this_latency;
> +      if (dump_file)
> +       fprintf (dump_file, "%11d %11d %5d %d --(T,%d,%d)--> %d\n",
> +                this_start, this_end, SCHED_TIME (u), INSN_UID (move->insn),
> +                this_latency, this_distance, INSN_UID (this_insn));
> +
> +      if (start < this_start)
> +       start = this_start;
> +      if (end > this_end)
> +       end = this_end;
> +    }
> +
> +  if (dump_file)
> +    {
> +      fprintf (dump_file, "----------- ----------- -----\n");
> +      fprintf (dump_file, "%11d %11d %5s %s\n", start, end, "", "(max, min)");
> +    }
> +
> +  sbitmap_zero (must_follow);
> +  SET_BIT (must_follow, move->def);
> +
> +  start = MAX (start, end - (ii - 1));
> +  for (c = end; c >= start; c--)
> +    {
> +      psi = ps_add_node_check_conflicts (ps, i_reg_move, c,
> +                                        move->uses, must_follow);
> +      if (psi)
> +       {
> +         update_node_sched_params (i_reg_move, ii, c, PS_MIN_CYCLE (ps));
> +         if (dump_file)
> +           fprintf (dump_file, "\nScheduled register move INSN %d at"
> +                    " time %d, row %d\n\n", INSN_UID (move->insn), c,
> +                    SCHED_ROW (i_reg_move));
> +         return true;
>        }
>     }
> +
> +  if (dump_file)
> +    fprintf (dump_file, "\nNo available slot\n\n");
> +
> +  return false;
>  }
>
>  /*
> @@ -508,6 +696,9 @@ schedule_reg_moves (partial_schedule_ptr
>       int nreg_moves = 0, i_reg_move;
>       rtx prev_reg, old_reg;
>       int first_move;
> +      int distances[2];
> +      sbitmap must_follow;
> +      sbitmap distance1_uses;
>       rtx set = single_set (u->insn);
>
>       /* Skip instructions that do not set a register.  */
> @@ -516,6 +707,7 @@ schedule_reg_moves (partial_schedule_ptr
>
>       /* Compute the number of reg_moves needed for u, by looking at life
>         ranges started at u (excluding self-loops).  */
> +      distances[0] = distances[1] = false;
>       for (e = u->out; e; e = e->next_out)
>        if (e->type == TRUE_DEP && e->dest != e->src)
>          {
> @@ -546,6 +738,11 @@ schedule_reg_moves (partial_schedule_ptr
>                gcc_assert (!autoinc_var_is_used_p (u->insn, e->dest->insn));
>              }
>
> +           if (nreg_moves4e)
> +             {
> +               gcc_assert (e->distance < 2);
> +               distances[e->distance] = true;
> +             }
>            nreg_moves = MAX (nreg_moves, nreg_moves4e);
>          }
>
> @@ -556,11 +753,10 @@ schedule_reg_moves (partial_schedule_ptr
>       first_move = VEC_length (ps_reg_move_info, ps->reg_moves);
>       VEC_safe_grow_cleared (ps_reg_move_info, heap, ps->reg_moves,
>                             first_move + nreg_moves);
> +      extend_node_sched_params (ps);
>
>       /* Record the moves associated with this node.  */
>       first_move += ps->g->num_nodes;
> -      SCHED_FIRST_REG_MOVE (i) = first_move;
> -      SCHED_NREG_MOVES (i) = nreg_moves;
>
>       /* Generate each move.  */
>       old_reg = prev_reg = SET_DEST (single_set (u->insn));
> @@ -569,15 +765,18 @@ schedule_reg_moves (partial_schedule_ptr
>          ps_reg_move_info *move = ps_reg_move (ps, first_move + i_reg_move);
>
>          move->def = i_reg_move > 0 ? first_move + i_reg_move - 1 : i;
> -         move->uses = sbitmap_alloc (g->num_nodes);
> +         move->uses = sbitmap_alloc (first_move + nreg_moves);
>          move->old_reg = old_reg;
>          move->new_reg = gen_reg_rtx (GET_MODE (prev_reg));
> +         move->num_consecutive_stages = distances[0] && distances[1] ? 2 : 1;
>          move->insn = gen_move_insn (move->new_reg, copy_rtx (prev_reg));
>          sbitmap_zero (move->uses);
>
>          prev_reg = move->new_reg;
>        }
>
> +      distance1_uses = distances[1] ? sbitmap_alloc (g->num_nodes) : NULL;
> +
>       /* Every use of the register defined by node may require a different
>         copy of this register, depending on the time the use is scheduled.
>         Record which uses require which move results.  */
> @@ -601,8 +800,21 @@ schedule_reg_moves (partial_schedule_ptr
>
>                move = ps_reg_move (ps, first_move + dest_copy - 1);
>                SET_BIT (move->uses, e->dest->cuid);
> +               if (e->distance == 1)
> +                 SET_BIT (distance1_uses, e->dest->cuid);
>              }
>          }
> +
> +      must_follow = sbitmap_alloc (first_move + nreg_moves);
> +      for (i_reg_move = 0; i_reg_move < nreg_moves; i_reg_move++)
> +       if (!schedule_reg_move (ps, first_move + i_reg_move,
> +                               distance1_uses, must_follow))
> +         break;
> +      sbitmap_free (must_follow);
> +      if (distance1_uses)
> +       sbitmap_free (distance1_uses);
> +      if (i_reg_move < nreg_moves)
> +       return false;
>     }
>   return true;
>  }
> @@ -626,39 +838,6 @@ apply_reg_moves (partial_schedule_ptr ps
>          df_insn_rescan (ps->g->nodes[i_use].insn);
>        }
>     }
> -
> -  FOR_EACH_VEC_ELT (ps_reg_move_info, ps->reg_moves, i, move)
> -    add_insn_before (move->insn, ps_first_note (ps, move->def), NULL);
> -}
> -
> -/* Update the sched_params (time, row and stage) for node U using the II,
> -   the CYCLE of U and MIN_CYCLE.
> -   We're not simply taking the following
> -   SCHED_STAGE (u) = CALC_STAGE_COUNT (SCHED_TIME (u), min_cycle, ii);
> -   because the stages may not be aligned on cycle 0.  */
> -static void
> -update_node_sched_params (int u, int ii, int cycle, int min_cycle)
> -{
> -  int sc_until_cycle_zero;
> -  int stage;
> -
> -  SCHED_TIME (u) = cycle;
> -  SCHED_ROW (u) = SMODULO (cycle, ii);
> -
> -  /* The calculation of stage count is done adding the number
> -     of stages before cycle zero and after cycle zero.  */
> -  sc_until_cycle_zero = CALC_STAGE_COUNT (-1, min_cycle, ii);
> -
> -  if (SCHED_TIME (u) < 0)
> -    {
> -      stage = CALC_STAGE_COUNT (-1, SCHED_TIME (u), ii);
> -      SCHED_STAGE (u) = sc_until_cycle_zero - stage;
> -    }
> -  else
> -    {
> -      stage = CALC_STAGE_COUNT (SCHED_TIME (u), 0, ii);
> -      SCHED_STAGE (u) = sc_until_cycle_zero + stage - 1;
> -    }
>  }
>
>  /* Bump the SCHED_TIMEs of all nodes by AMOUNT.  Set the values of
> @@ -699,22 +878,6 @@ reset_sched_times (partial_schedule_ptr
>       }
>  }
>
> -/* Set SCHED_COLUMN of each node according to its position in PS.  */
> -static void
> -set_columns_for_ps (partial_schedule_ptr ps)
> -{
> -  int row;
> -
> -  for (row = 0; row < ps->ii; row++)
> -    {
> -      ps_insn_ptr cur_insn = ps->rows[row];
> -      int column = 0;
> -
> -      for (; cur_insn; cur_insn = cur_insn->next_in_row)
> -       SCHED_COLUMN (cur_insn->id) = column++;
> -    }
> -}
> -
>  /* Permute the insns according to their order in PS, from row 0 to
>    row ii-1, and position them right before LAST.  This schedules
>    the insns of the loop kernel.  */
> @@ -731,8 +894,13 @@ permute_partial_schedule (partial_schedu
>        rtx insn = ps_rtl_insn (ps, ps_ij->id);
>
>        if (PREV_INSN (last) != insn)
> -         reorder_insns_nobb (ps_first_note (ps, ps_ij->id), insn,
> -                             PREV_INSN (last));
> +         {
> +           if (ps_ij->id < ps->g->num_nodes)
> +             reorder_insns_nobb (ps_first_note (ps, ps_ij->id), insn,
> +                                 PREV_INSN (last));
> +           else
> +             add_insn_before (insn, last, NULL);
> +         }
>       }
>  }
>
> @@ -935,7 +1103,7 @@ optimize_sc (partial_schedule_ptr ps, dd
>
>  static void
>  duplicate_insns_of_cycles (partial_schedule_ptr ps, int from_stage,
> -                          int to_stage, int for_prolog, rtx count_reg)
> +                          int to_stage, rtx count_reg)
>  {
>   int row;
>   ps_insn_ptr ps_ij;
> @@ -944,7 +1112,7 @@ duplicate_insns_of_cycles (partial_sched
>     for (ps_ij = ps->rows[row]; ps_ij; ps_ij = ps_ij->next_in_row)
>       {
>        int u = ps_ij->id;
> -       int j, i_reg_moves, i_reg_move;
> +       int first_u, last_u;
>        rtx u_insn;
>
>         /* Do not duplicate any insn which refers to count_reg as it
> @@ -958,42 +1126,15 @@ duplicate_insns_of_cycles (partial_sched
>             || JUMP_P (u_insn))
>           continue;
>
> -       if (for_prolog)
> +       first_u = SCHED_STAGE (u);
> +       last_u = first_u + ps_num_consecutive_stages (ps, u) - 1;
> +       if (from_stage <= last_u && to_stage >= first_u)
>          {
> -           /* SCHED_STAGE (u) >= from_stage == 0.  Generate increasing
> -              number of reg_moves starting with the second occurrence of
> -              u, which is generated if its SCHED_STAGE <= to_stage.  */
> -           i_reg_moves = to_stage - SCHED_STAGE (u) + 1;
> -           i_reg_moves = MAX (i_reg_moves, 0);
> -           i_reg_moves = MIN (i_reg_moves, SCHED_NREG_MOVES (u));
> -
> -           /* The reg_moves start from the *first* reg_move backwards.  */
> -           i_reg_move = SCHED_FIRST_REG_MOVE (u) + (i_reg_moves - 1);
> -         }
> -       else /* It's for the epilog.  */
> -         {
> -           /* SCHED_STAGE (u) <= to_stage.  Generate all reg_moves,
> -              starting to decrease one stage after u no longer occurs;
> -              that is, generate all reg_moves until
> -              SCHED_STAGE (u) == from_stage - 1.  */
> -           i_reg_moves = (SCHED_NREG_MOVES (u)
> -                          - (from_stage - SCHED_STAGE (u) - 1));
> -           i_reg_moves = MAX (i_reg_moves, 0);
> -           i_reg_moves = MIN (i_reg_moves, SCHED_NREG_MOVES (u));
> -
> -           /* The reg_moves start from the *last* reg_move forwards.  */
> -           i_reg_move = SCHED_FIRST_REG_MOVE (u) + (SCHED_NREG_MOVES (u) - 1);
> +           if (u < ps->g->num_nodes)
> +             duplicate_insn_chain (ps_first_note (ps, u), u_insn);
> +           else
> +             emit_insn (copy_rtx (PATTERN (u_insn)));
>          }
> -
> -       for (j = 0; j < i_reg_moves; j++)
> -         {
> -           ps_reg_move_info *move = ps_reg_move (ps, i_reg_move - j);
> -
> -           emit_insn (copy_rtx (PATTERN (move->insn)));
> -         }
> -       if (SCHED_STAGE (u) >= from_stage
> -           && SCHED_STAGE (u) <= to_stage)
> -         duplicate_insn_chain (ps_first_note (ps, u), u_insn);
>       }
>  }
>
> @@ -1027,7 +1168,7 @@ generate_prolog_epilog (partial_schedule
>     }
>
>   for (i = 0; i < last_stage; i++)
> -    duplicate_insns_of_cycles (ps, 0, i, 1, count_reg);
> +    duplicate_insns_of_cycles (ps, 0, i, count_reg);
>
>   /* Put the prolog on the entry edge.  */
>   e = loop_preheader_edge (loop);
> @@ -1039,7 +1180,7 @@ generate_prolog_epilog (partial_schedule
>   start_sequence ();
>
>   for (i = 0; i < last_stage; i++)
> -    duplicate_insns_of_cycles (ps, i + 1, last_stage, 0, count_reg);
> +    duplicate_insns_of_cycles (ps, i + 1, last_stage, count_reg);
>
>   /* Put the epilogue on the exit edge.  */
>   gcc_assert (single_exit (loop));
> @@ -1375,8 +1516,7 @@ sms_schedule (void)
>     {
>       rtx head, tail;
>       rtx count_reg, count_init;
> -      int mii, rec_mii;
> -      unsigned stage_count;
> +      int mii, rec_mii, stage_count, min_cycle;
>       HOST_WIDEST_INT loop_count = 0;
>       bool opt_sc_p;
>
> @@ -1486,13 +1626,12 @@ sms_schedule (void)
>                }
>
>              gcc_assert (stage_count >= 1);
> -             PS_STAGE_COUNT (ps) = stage_count;
>            }
>
>          /* The default value of PARAM_SMS_MIN_SC is 2 as stage count of
>             1 means that there is no interleaving between iterations thus
>             we let the scheduling passes do the job in this case.  */
> -         if (stage_count < (unsigned) PARAM_VALUE (PARAM_SMS_MIN_SC)
> +         if (stage_count < PARAM_VALUE (PARAM_SMS_MIN_SC)
>              || (count_init && (loop_count <= stage_count))
>              || (flag_branch_probabilities && (trip_count <= stage_count)))
>            {
> @@ -1520,6 +1659,7 @@ sms_schedule (void)
>
>          set_columns_for_ps (ps);
>
> +         min_cycle = PS_MIN_CYCLE (ps) - SMODULO (PS_MIN_CYCLE (ps), ps->ii);
>          if (!schedule_reg_moves (ps))
>            {
>              mii = ps->ii + 1;
> @@ -1527,6 +1667,21 @@ sms_schedule (void)
>              continue;
>            }
>
> +         /* Moves that handle incoming values might have been added
> +            to a new first stage.  Bump the stage count if so.
> +
> +            ??? Perhaps we could consider rotating the schedule here
> +            instead?  */
> +         if (PS_MIN_CYCLE (ps) < min_cycle)
> +           {
> +             reset_sched_times (ps, 0);
> +             stage_count++;
> +           }
> +
> +         /* The stage count should now be correct without rotation.  */
> +         gcc_checking_assert (stage_count == calculate_stage_count (ps, 0));
> +         PS_STAGE_COUNT (ps) = stage_count;
> +
>          canon_loop (loop);
>
>           if (dump_file)
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [4/4] Make SMS schedule register moves
  2011-10-10 15:52             ` Ayal Zaks
@ 2011-10-11  8:49               ` Richard Sandiford
  0 siblings, 0 replies; 13+ messages in thread
From: Richard Sandiford @ 2011-10-11  8:49 UTC (permalink / raw)
  To: Ayal Zaks; +Cc: gcc-patches

Ayal Zaks <ayal.zaks@gmail.com> writes:
>>> The issue of assigning stages to reg-moves is mostly relevant for
>>> prolog and epilog generation, which requires and receives special
>>> attention -- handled very nicely by ps_num_consecutive_stages! Note
>>> that currently a simple boolean indicator for (the exceptional case
>>> of) double stages would suffice, instead of generalizing to arbitrary
>>> nums of consecutive stages (see any potential use for them?).
>>
>> Not in the immediate term.  But I think having a boolean indicator
>> would be inconsistent.  If the distance field is an int (even though
>> we only expect distance-0 and distance-1 register dependencies)
>> then I think the number of stages should be too.
>>
>> I did wonder originally about using a boolean, but IMO, it makes
>> the code less readable rather than more.  Instead of a simple
>> range check like:
>>
>>     if (first_stage_for_insn <= last_stage_in_range
>>         && last_stage_for_insn >= first_stage_in_range)
>>
>> we end up with the equivalent of:
>>
>>     if (first_stage_for_insn <= last_stage_in_range
>>         && (double_stage_move_p (...)
>>             ? first_stage_for_insn + 1 >= first_stage_in_range
>>             : first_stage_for_insn >= first_stage_in_range))
>>
>> with no corresponding simplification elsewhere.
>>
>
> Sure. But setting the range can be done by consulting an simple
> indicator, rather than generalizing to arbitrary stage numbers; e.g.:
>
> +ps_num_consecutive_stages (partial_schedule_ptr ps, int id)
> +{
> +  if (id >= ps->g->num_nodes && ps_reg_move (ps, id)->double_stages)
> +    return 2;
> +  else
> +    return 1;
> +}
>
> or
>
> -  last_u = first_u + ps_num_consecutive_stages (ps, u) - 1;
> +  if (...double_stages) last_u = first_u + 1;
> +  else last_u = first_u;

Understood.  I still prefer the posted version though.

>> E.g. adding something like this at the end:
>>
>>   ??? The algorithm restricts the scheduling window to II cycles.
>>   In rare cases, it may be better to allow windows of II+1 cycles.
>>   The window would then start and end on the same row, but with
>>   different "must precede" and "must follow" requirements.
>>
>> Let me know what you think and I'll add it as a follow-on patch.
>>
>
> great, thanks.

OK, added with the patch below.

>>>> +
>>>> +   The move is part of a chain that satisfies register dependencies
>>>> +   between a producing ddg node and various consuming ddg nodes.
>>>> +   If some of these dependencies cross a loop iteration (that is,
>>>> +   have a distance of 1) then DISTANCE1_USES is nonnull and contains
>>>> +   the set of uses with distance-1 dependencies.  DISTANCE1_USES
>>>> +   is null otherwise.
>>>> +
>>>
>>> Maybe clarify that they are upwards-exposed or live-in uses.
>>
>> OK, changed to:
>>
>>   The move is part of a chain that satisfies register dependencies
>>   between a producing ddg node and various consuming ddg nodes.
>>   If some of these dependencies have a distance of 1 (meaning that
>>   the use is upward-exposoed) then DISTANCE1_USES is nonnull and
>
> exposed (typo)

Oops, also fixed below (and applied).

Richard


gcc/
	* modulo-sched.c: Fix comment typo.  Mention the possibility
	of using scheduling windows of II+1 cycles.

Index: gcc/modulo-sched.c
===================================================================
--- gcc/modulo-sched.c	2011-10-10 12:42:41.000000000 +0100
+++ gcc/modulo-sched.c	2011-10-11 09:07:08.069166743 +0100
@@ -545,7 +545,7 @@ set_columns_for_ps (partial_schedule_ptr
    The move is part of a chain that satisfies register dependencies
    between a producing ddg node and various consuming ddg nodes.
    If some of these dependencies have a distance of 1 (meaning that
-   the use is upward-exposoed) then DISTANCE1_USES is nonnull and
+   the use is upward-exposed) then DISTANCE1_USES is nonnull and
    contains the set of uses with distance-1 dependencies.
    DISTANCE1_USES is null otherwise.
 
@@ -1810,7 +1810,11 @@ sms_schedule (void)
    41. endif
    42. compute epilogue & prologue
    43. finish - succeeded to schedule
-*/
+
+   ??? The algorithm restricts the scheduling window to II cycles.
+   In rare cases, it may be better to allow windows of II+1 cycles.
+   The window would then start and end on the same row, but with
+   different "must precede" and "must follow" requirements.  */
 
 /* A limit on the number of cycles that resource conflicts can span.  ??? Should
    be provided by DFA, and be dependent on the type of insn scheduled.  Currently

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [4/4] Make SMS schedule register moves
  2011-08-30 13:50       ` Richard Guenther
@ 2011-08-30 13:57         ` Richard Sandiford
  0 siblings, 0 replies; 13+ messages in thread
From: Richard Sandiford @ 2011-08-30 13:57 UTC (permalink / raw)
  To: Richard Guenther; +Cc: gcc-patches, zaks

Richard Guenther <richard.guenther@gmail.com> writes:
> On Tue, Aug 30, 2011 at 2:43 PM, Richard Sandiford
> <richard.sandiford@linaro.org> wrote:
>> Richard Guenther <richard.guenther@gmail.com> writes:
>>> On Tue, Aug 30, 2011 at 2:29 PM, Richard Sandiford
>>> <richard.sandiford@linaro.org> wrote:
>>>> This is the move-scheduling patch itself.  It should be fairly
>>>> self-explanatory.  Let me know if it isn't, and I'll try to improve
>>>> the commentary.
>>>
>>> Can you add some testcases?
>>
>> I don't think it's an easy thing to test for.  E.g. with the resampling
>> loop in the covering message, there might well be a reasonable ii=27
>> schedule that doesn't need as many moves.
>
> So, does it only improve code generation or does it expose more
> loops as SMS candidates?  If the former, ok ...

Yeah, it just improves code generation.  The code I'm adding only kicks
in once we've found a "successful" SMS schedule.  At the moment, once
we've found that kind of schedule, we always emit it, regardless of how
many moves it needs.

The new code instead allows the schedule to be rejected.  So from that
point of view, it reduces the number of SMS candidates (but hopefully
only in cases where we would otherwise generate worse code).

Richard

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [4/4] Make SMS schedule register moves
  2011-08-30 13:43     ` Richard Sandiford
@ 2011-08-30 13:50       ` Richard Guenther
  2011-08-30 13:57         ` Richard Sandiford
  0 siblings, 1 reply; 13+ messages in thread
From: Richard Guenther @ 2011-08-30 13:50 UTC (permalink / raw)
  To: Richard Guenther, gcc-patches, zaks, richard.sandiford

On Tue, Aug 30, 2011 at 2:43 PM, Richard Sandiford
<richard.sandiford@linaro.org> wrote:
> Richard Guenther <richard.guenther@gmail.com> writes:
>> On Tue, Aug 30, 2011 at 2:29 PM, Richard Sandiford
>> <richard.sandiford@linaro.org> wrote:
>>> This is the move-scheduling patch itself.  It should be fairly
>>> self-explanatory.  Let me know if it isn't, and I'll try to improve
>>> the commentary.
>>
>> Can you add some testcases?
>
> I don't think it's an easy thing to test for.  E.g. with the resampling
> loop in the covering message, there might well be a reasonable ii=27
> schedule that doesn't need as many moves.

So, does it only improve code generation or does it expose more
loops as SMS candidates?  If the former, ok ...

Richard.

> Richard
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [4/4] Make SMS schedule register moves
  2011-08-30 13:22   ` Richard Guenther
@ 2011-08-30 13:43     ` Richard Sandiford
  2011-08-30 13:50       ` Richard Guenther
  0 siblings, 1 reply; 13+ messages in thread
From: Richard Sandiford @ 2011-08-30 13:43 UTC (permalink / raw)
  To: Richard Guenther; +Cc: gcc-patches, zaks

Richard Guenther <richard.guenther@gmail.com> writes:
> On Tue, Aug 30, 2011 at 2:29 PM, Richard Sandiford
> <richard.sandiford@linaro.org> wrote:
>> This is the move-scheduling patch itself.  It should be fairly
>> self-explanatory.  Let me know if it isn't, and I'll try to improve
>> the commentary.
>
> Can you add some testcases?

I don't think it's an easy thing to test for.  E.g. with the resampling
loop in the covering message, there might well be a reasonable ii=27
schedule that doesn't need as many moves.

Richard

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [4/4] Make SMS schedule register moves
  2011-08-30 13:17 ` [4/4] " Richard Sandiford
@ 2011-08-30 13:22   ` Richard Guenther
  2011-08-30 13:43     ` Richard Sandiford
  0 siblings, 1 reply; 13+ messages in thread
From: Richard Guenther @ 2011-08-30 13:22 UTC (permalink / raw)
  To: gcc-patches, zaks, richard.sandiford

On Tue, Aug 30, 2011 at 2:29 PM, Richard Sandiford
<richard.sandiford@linaro.org> wrote:
> This is the move-scheduling patch itself.  It should be fairly
> self-explanatory.  Let me know if it isn't, and I'll try to improve
> the commentary.

Can you add some testcases?

Thanks,
Richard.

> One potentially controversial change is to the way we handle moves
> in the prologue and epilogue.  The current code uses a conservative
> check to decide which stages need which moves.  This check copes
> with values that are live before the loop, and emits some moves
> that aren't actually needed.  The code also emits chains of moves
> that can be trivially optimised through propagation.  We rely on
> later patches to clean this up for us (and they do).
>
> So, rather than keep a rather complicated check here, I've simply
> emitted the moves for all stages.  In principle, that will generate
> a little more dead code than now in cases where chains of two moves
> are needed, but that seems to be very rare (when moves are scheduled).
>
> Richard
>
>
> gcc/
>        * modulo-sched.c (extend_node_sched_params): New function.
>        (print_node_sched_params): Print the stage.
>        (set_columns_for_row, schedule_reg_move): New functions.
>        (set_columns_for_ps): Move up file and use set_columns_for_now.
>        (schedule_reg_moves): Call extend_node_sched_params and
>        schedule_reg_move.  Extend size of uses bitmap.  Return false
>        if a move could not be scheduled.
>        (apply_reg_moves): Don't emit moves here.
>        (permute_partial_schedule): Handle register moves.
>        (duplicate_insns_of_cycles): Remove for_prolog.  Always emit moves.
>        (generate_prolog_epilog): Update calls accordingly.
>
> Index: gcc/modulo-sched.c
> ===================================================================
> --- gcc/modulo-sched.c  2011-08-30 13:06:36.528669762 +0100
> +++ gcc/modulo-sched.c  2011-08-30 13:22:08.029124537 +0100
> @@ -220,8 +220,6 @@ static partial_schedule_ptr sms_schedule
>  static void permute_partial_schedule (partial_schedule_ptr, rtx);
>  static void generate_prolog_epilog (partial_schedule_ptr, struct loop *,
>                                     rtx, rtx);
> -static void duplicate_insns_of_cycles (partial_schedule_ptr,
> -                                      int, int, int, rtx);
>  static int calculate_stage_count (partial_schedule_ptr, int);
>  static void calculate_must_precede_follow (ddg_node_ptr, int, int,
>                                           int, int, sbitmap, sbitmap, sbitmap);
> @@ -458,6 +456,15 @@ set_node_sched_params (ddg_ptr g)
>                         node_sched_param_vec, g->num_nodes);
>  }
>
> +/* Make sure that node_sched_param_vec has an entry for every move in PS.  */
> +static void
> +extend_node_sched_params (partial_schedule_ptr ps)
> +{
> +  VEC_safe_grow_cleared (node_sched_params, heap, node_sched_param_vec,
> +                        ps->g->num_nodes + VEC_length (ps_reg_move_info,
> +                                                       ps->reg_moves));
> +}
> +
>  static void
>  print_node_sched_params (FILE *file, int num_nodes, partial_schedule_ptr ps)
>  {
> @@ -474,6 +481,7 @@ print_node_sched_params (FILE *file, int
>               INSN_UID (ps_rtl_insn (ps, i)));
>       fprintf (file, " asap = %d:\n", NODE_ASAP (&ps->g->nodes[i]));
>       fprintf (file, " time = %d:\n", nsp->time);
> +      fprintf (file, " stage = %d:\n", nsp->stage);
>       fprintf (file, " nreg_moves = %d:\n", nsp->nreg_moves);
>       for (j = 0; j < nsp->nreg_moves; j++)
>        {
> @@ -485,6 +493,168 @@ print_node_sched_params (FILE *file, int
>     }
>  }
>
> +/* Set SCHED_COLUMN for each instruction in row ROW of PS.  */
> +static void
> +set_columns_for_row (partial_schedule_ptr ps, int row)
> +{
> +  ps_insn_ptr cur_insn;
> +  int column;
> +
> +  column = 0;
> +  for (cur_insn = ps->rows[row]; cur_insn; cur_insn = cur_insn->next_in_row)
> +    SCHED_COLUMN (cur_insn->id) = column++;
> +}
> +
> +/* Set SCHED_COLUMN for each instruction in PS.  */
> +static void
> +set_columns_for_ps (partial_schedule_ptr ps)
> +{
> +  int row;
> +
> +  for (row = 0; row < ps->ii; row++)
> +    set_columns_for_row (ps, row);
> +}
> +
> +/* Try to schedule the move with ps_insn identifier I_REG_MOVE in PS.
> +   The source of the move is provided by I_MUST_FOLLOW, which has
> +   already been scheduled.  MUST_FOLLOW is a scratch bitmap that
> +   is big enough to hold I_MUST_FOLLOW.
> +
> +   Return true on success.  */
> +static bool
> +schedule_reg_move (partial_schedule_ptr ps, int i_reg_move,
> +                  sbitmap must_follow, int i_must_follow)
> +{
> +  unsigned int u;
> +  int i, start_row, end_row, this_start, this_end, this_row, latency;
> +  int origin_row, origin_column;
> +  sbitmap_iterator sbi;
> +  ps_reg_move_info *move;
> +
> +  move = ps_reg_move (ps, i_reg_move);
> +
> +  /* The cyclic lifetime of move->new_reg starts and ends at move->def
> +     (the instruction that defines move->old_reg).  We need to decide
> +     where in that cyclic lifetime the move should go.  The position
> +     is limited by I_MUST_FOLLOW (which defines the source of the move)
> +     and the nodes in move->uses (which consume the result).
> +
> +     Specifically, the lowest possible row (relative to move->def)
> +     is the maximum of:
> +
> +       a) the row in which the result of the previous iteration's
> +         I_MUST_FOLLOW becomes available
> +       b) the first row in which every use of the previous iteration's
> +         move->new_reg has completed.
> +
> +     The highest possible row (again relative to move->def) is
> +     the minimum of:
> +
> +       c) the row in which the current iteration's I_MUST_FOLLOW
> +         executes (and thus makes the source of the move unavailable)
> +       d) the last row in which every use of this iteration's
> +         move->new_reg could be satisfied without delay.
> +
> +     Because all positions are relative to move->def, this function
> +     uses ROW - ps->ii to represent positions come after move->def.
> +     The range includes origin_row twice: "origin_row - ps->ii" is for
> +     the positions in origin_row after move->def, while "origin_row"
> +     itself is for the positions before move->def.  */
> +  origin_row = SCHED_ROW (move->def);
> +  origin_column = SCHED_COLUMN (move->def);
> +  start_row = origin_row - ps->ii;
> +  end_row = origin_row;
> +
> +  if (dump_file)
> +    fprintf (dump_file, "Scheduling register move INSN %d with cyclic lifetime"
> +            " [%d, %d]\n", INSN_UID (move->insn), start_row, end_row);
> +
> +  /* Limit the range to [a, c].  */
> +  this_end = SCHED_ROW (i_must_follow);
> +  if (this_end > origin_row
> +      || (this_end == origin_row
> +         && SCHED_COLUMN (i_must_follow) > origin_column))
> +    this_end -= ps->ii;
> +  latency = insn_latency (ps_rtl_insn (ps, i_must_follow), move->insn);
> +  this_start = this_end - ps->ii + latency;
> +
> +  if (dump_file)
> +    fprintf (dump_file, "  -- source produced by INSN %d with latency %d,"
> +            " range [%d, %d]\n", INSN_UID (ps_rtl_insn (ps, i_must_follow)),
> +            latency, this_start, this_end);
> +
> +  if (start_row < this_start)
> +    start_row = this_start;
> +  if (end_row > this_end)
> +    end_row = this_end;
> +
> +  /* Limit the range to [b, d].  */
> +  EXECUTE_IF_SET_IN_SBITMAP (move->uses, 0, u, sbi)
> +    {
> +      latency = insn_latency (move->insn, ps_rtl_insn (ps, u));
> +      this_start = SCHED_ROW (u);
> +      if (this_start > origin_row
> +         || (this_start == origin_row
> +             && SCHED_COLUMN (u) > origin_column))
> +       this_start -= ps->ii;
> +      this_end = this_start + ps->ii - latency;
> +
> +      if (dump_file)
> +       fprintf (dump_file, "  -- destination used by INSN %d with latency"
> +                " %d, range [%d, %d]\n", INSN_UID (ps_rtl_insn (ps, u)),
> +                latency, this_start, this_end);
> +
> +      if (start_row < this_start)
> +       start_row = this_start;
> +      if (end_row > this_end)
> +       end_row = this_end;
> +    }
> +
> +  sbitmap_zero (must_follow);
> +  SET_BIT (must_follow, i_must_follow);
> +
> +  if (dump_file)
> +    fprintf (dump_file, "  -- trying to schedule in rows [%d, %d]\n",
> +            start_row, end_row);
> +
> +  for (i = end_row; i >= start_row; i--)
> +    {
> +      ps_insn_ptr psi;
> +
> +      this_row = i < 0 ? i + ps->ii : i;
> +      /* origin_row - ps->ii represents the part of row origin_row after
> +        move->def.  In this case the move must come after both move->uses
> +        and move->def.  */
> +      if (i == origin_row - ps->ii)
> +       {
> +         gcc_checking_assert (i == start_row);
> +         gcc_checking_assert (!TEST_BIT (move->uses, move->def));
> +         SET_BIT (move->uses, move->def);
> +         psi = ps_add_node_check_conflicts (ps, i_reg_move, this_row,
> +                                            move->uses, NULL);
> +         RESET_BIT (move->uses, move->def);
> +       }
> +      else
> +       psi = ps_add_node_check_conflicts (ps, i_reg_move, this_row,
> +                                          move->uses, must_follow);
> +
> +      if (psi)
> +       {
> +         SCHED_ROW (i_reg_move) = this_row;
> +         if (dump_file)
> +           fprintf (dump_file, "  -- scheduled in row %d (%d)\n",
> +                    i, this_row);
> +         set_columns_for_row (ps, this_row);
> +         return true;
> +       }
> +    }
> +
> +  if (dump_file)
> +    fprintf (dump_file, "  -- no available slot\n");
> +
> +  return false;
> +}
> +
>  /*
>    Breaking intra-loop register anti-dependences:
>    Each intra-loop register anti-dependence implies a cross-iteration true
> @@ -507,9 +677,10 @@ schedule_reg_moves (partial_schedule_ptr
>     {
>       ddg_node_ptr u = &g->nodes[i];
>       ddg_edge_ptr e;
> -      int nreg_moves = 0, i_reg_move;
> +      int nreg_moves = 0, i_reg_move, i_must_follow;
>       rtx prev_reg, old_reg;
>       int first_move;
> +      sbitmap must_follow;
>
>       /* Compute the number of reg_moves needed for u, by looking at life
>         ranges started at u (excluding self-loops).  */
> @@ -539,6 +710,7 @@ schedule_reg_moves (partial_schedule_ptr
>       first_move = VEC_length (ps_reg_move_info, ps->reg_moves);
>       VEC_safe_grow_cleared (ps_reg_move_info, heap, ps->reg_moves,
>                             first_move + nreg_moves);
> +      extend_node_sched_params (ps);
>
>       /* Record the moves associated with this node.  */
>       first_move += ps->g->num_nodes;
> @@ -552,7 +724,7 @@ schedule_reg_moves (partial_schedule_ptr
>          ps_reg_move_info *move = ps_reg_move (ps, first_move + i_reg_move);
>
>          move->def = i;
> -         move->uses = sbitmap_alloc (g->num_nodes);
> +         move->uses = sbitmap_alloc (first_move + nreg_moves);
>          move->old_reg = old_reg;
>          move->new_reg = gen_reg_rtx (GET_MODE (prev_reg));
>          move->insn = gen_move_insn (move->new_reg, copy_rtx (prev_reg));
> @@ -586,6 +758,19 @@ schedule_reg_moves (partial_schedule_ptr
>                SET_BIT (move->uses, e->dest->cuid);
>              }
>          }
> +
> +      must_follow = sbitmap_alloc (first_move + nreg_moves);
> +      i_must_follow = i;
> +      for (i_reg_move = 0; i_reg_move < nreg_moves; i_reg_move++)
> +       {
> +         if (!schedule_reg_move (ps, first_move + i_reg_move,
> +                                 must_follow, i_must_follow))
> +           break;
> +         i_must_follow = first_move + i_reg_move;
> +       }
> +      sbitmap_free (must_follow);
> +      if (i_reg_move < nreg_moves)
> +       return false;
>     }
>   return true;
>  }
> @@ -609,9 +794,6 @@ apply_reg_moves (partial_schedule_ptr ps
>          df_insn_rescan (ps->g->nodes[i_use].insn);
>        }
>     }
> -
> -  FOR_EACH_VEC_ELT_REVERSE (ps_reg_move_info, ps->reg_moves, i, move)
> -    add_insn_before (move->insn, ps_first_note (ps, move->def), NULL);
>  }
>
>  /* Update the sched_params (time, row and stage) for node U using the II,
> @@ -682,22 +864,6 @@ reset_sched_times (partial_schedule_ptr
>       }
>  }
>
> -/* Set SCHED_COLUMN of each node according to its position in PS.  */
> -static void
> -set_columns_for_ps (partial_schedule_ptr ps)
> -{
> -  int row;
> -
> -  for (row = 0; row < ps->ii; row++)
> -    {
> -      ps_insn_ptr cur_insn = ps->rows[row];
> -      int column = 0;
> -
> -      for (; cur_insn; cur_insn = cur_insn->next_in_row)
> -       SCHED_COLUMN (cur_insn->id) = column++;
> -    }
> -}
> -
>  /* Permute the insns according to their order in PS, from row 0 to
>    row ii-1, and position them right before LAST.  This schedules
>    the insns of the loop kernel.  */
> @@ -714,8 +880,13 @@ permute_partial_schedule (partial_schedu
>        rtx insn = ps_rtl_insn (ps, ps_ij->id);
>
>        if (PREV_INSN (last) != insn)
> -         reorder_insns_nobb (ps_first_note (ps, ps_ij->id), insn,
> -                             PREV_INSN (last));
> +         {
> +           if (ps_ij->id < ps->g->num_nodes)
> +             reorder_insns_nobb (ps_first_note (ps, ps_ij->id), insn,
> +                                 PREV_INSN (last));
> +           else
> +             add_insn_before (insn, last, NULL);
> +         }
>       }
>  }
>
> @@ -919,7 +1090,7 @@ optimize_sc (partial_schedule_ptr ps, dd
>
>  static void
>  duplicate_insns_of_cycles (partial_schedule_ptr ps, int from_stage,
> -                          int to_stage, int for_prolog, rtx count_reg)
> +                          int to_stage, rtx count_reg)
>  {
>   int row;
>   ps_insn_ptr ps_ij;
> @@ -928,7 +1099,6 @@ duplicate_insns_of_cycles (partial_sched
>     for (ps_ij = ps->rows[row]; ps_ij; ps_ij = ps_ij->next_in_row)
>       {
>        int u = ps_ij->id;
> -       int j, i_reg_moves, i_reg_move;
>        rtx u_insn;
>
>         /* Do not duplicate any insn which refers to count_reg as it
> @@ -942,42 +1112,18 @@ duplicate_insns_of_cycles (partial_sched
>             || JUMP_P (u_insn))
>           continue;
>
> -       if (for_prolog)
> -         {
> -           /* SCHED_STAGE (u) >= from_stage == 0.  Generate increasing
> -              number of reg_moves starting with the second occurrence of
> -              u, which is generated if its SCHED_STAGE <= to_stage.  */
> -           i_reg_moves = to_stage - SCHED_STAGE (u) + 1;
> -           i_reg_moves = MAX (i_reg_moves, 0);
> -           i_reg_moves = MIN (i_reg_moves, SCHED_NREG_MOVES (u));
> -
> -           /* The reg_moves start from the *first* reg_move backwards.  */
> -           i_reg_move = SCHED_FIRST_REG_MOVE (u) + (i_reg_moves - 1);
> -         }
> -       else /* It's for the epilog.  */
> -         {
> -           /* SCHED_STAGE (u) <= to_stage.  Generate all reg_moves,
> -              starting to decrease one stage after u no longer occurs;
> -              that is, generate all reg_moves until
> -              SCHED_STAGE (u) == from_stage - 1.  */
> -           i_reg_moves = (SCHED_NREG_MOVES (u)
> -                          - (from_stage - SCHED_STAGE (u) - 1));
> -           i_reg_moves = MAX (i_reg_moves, 0);
> -           i_reg_moves = MIN (i_reg_moves, SCHED_NREG_MOVES (u));
> -
> -           /* The reg_moves start from the *last* reg_move forwards.  */
> -           i_reg_move = SCHED_FIRST_REG_MOVE (u) + (SCHED_NREG_MOVES (u) - 1);
> -         }
> -
> -       for (j = 0; j < i_reg_moves; j++)
> +       if (u < ps->g->num_nodes)
>          {
> -           ps_reg_move_info *move = ps_reg_move (ps, i_reg_move - j);
> -
> -           emit_insn (copy_rtx (PATTERN (move->insn)));
> +           if (SCHED_STAGE (u) >= from_stage
> +               && SCHED_STAGE (u) <= to_stage)
> +             duplicate_insn_chain (ps_first_note (ps, u), u_insn);
>          }
> -       if (SCHED_STAGE (u) >= from_stage
> -           && SCHED_STAGE (u) <= to_stage)
> -         duplicate_insn_chain (ps_first_note (ps, u), u_insn);
> +       else
> +         /* For simplicity, we generate all moves for every stage,
> +            even though some of them will be dead.  Later optimizations
> +            will remove the dead instructions and undo some of the
> +            unnecessary renaming that moves would otherwise produce.  */
> +         emit_insn (copy_rtx (PATTERN (u_insn)));
>       }
>  }
>
> @@ -1011,7 +1157,7 @@ generate_prolog_epilog (partial_schedule
>     }
>
>   for (i = 0; i < last_stage; i++)
> -    duplicate_insns_of_cycles (ps, 0, i, 1, count_reg);
> +    duplicate_insns_of_cycles (ps, 0, i, count_reg);
>
>   /* Put the prolog on the entry edge.  */
>   e = loop_preheader_edge (loop);
> @@ -1023,7 +1169,7 @@ generate_prolog_epilog (partial_schedule
>   start_sequence ();
>
>   for (i = 0; i < last_stage; i++)
> -    duplicate_insns_of_cycles (ps, i + 1, last_stage, 0, count_reg);
> +    duplicate_insns_of_cycles (ps, i + 1, last_stage, count_reg);
>
>   /* Put the epilogue on the exit edge.  */
>   gcc_assert (single_exit (loop));
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [4/4] Make SMS schedule register moves
  2011-08-30 12:45 [0/4] " Richard Sandiford
@ 2011-08-30 13:17 ` Richard Sandiford
  2011-08-30 13:22   ` Richard Guenther
  0 siblings, 1 reply; 13+ messages in thread
From: Richard Sandiford @ 2011-08-30 13:17 UTC (permalink / raw)
  To: gcc-patches; +Cc: zaks

This is the move-scheduling patch itself.  It should be fairly
self-explanatory.  Let me know if it isn't, and I'll try to improve
the commentary.

One potentially controversial change is to the way we handle moves
in the prologue and epilogue.  The current code uses a conservative
check to decide which stages need which moves.  This check copes
with values that are live before the loop, and emits some moves
that aren't actually needed.  The code also emits chains of moves
that can be trivially optimised through propagation.  We rely on
later patches to clean this up for us (and they do).

So, rather than keep a rather complicated check here, I've simply
emitted the moves for all stages.  In principle, that will generate
a little more dead code than now in cases where chains of two moves
are needed, but that seems to be very rare (when moves are scheduled).

Richard


gcc/
	* modulo-sched.c (extend_node_sched_params): New function.
	(print_node_sched_params): Print the stage.
	(set_columns_for_row, schedule_reg_move): New functions.
	(set_columns_for_ps): Move up file and use set_columns_for_now.
	(schedule_reg_moves): Call extend_node_sched_params and
	schedule_reg_move.  Extend size of uses bitmap.  Return false
	if a move could not be scheduled.
	(apply_reg_moves): Don't emit moves here.
	(permute_partial_schedule): Handle register moves.
	(duplicate_insns_of_cycles): Remove for_prolog.  Always emit moves.
	(generate_prolog_epilog): Update calls accordingly.

Index: gcc/modulo-sched.c
===================================================================
--- gcc/modulo-sched.c	2011-08-30 13:06:36.528669762 +0100
+++ gcc/modulo-sched.c	2011-08-30 13:22:08.029124537 +0100
@@ -220,8 +220,6 @@ static partial_schedule_ptr sms_schedule
 static void permute_partial_schedule (partial_schedule_ptr, rtx);
 static void generate_prolog_epilog (partial_schedule_ptr, struct loop *,
                                     rtx, rtx);
-static void duplicate_insns_of_cycles (partial_schedule_ptr,
-				       int, int, int, rtx);
 static int calculate_stage_count (partial_schedule_ptr, int);
 static void calculate_must_precede_follow (ddg_node_ptr, int, int,
 					   int, int, sbitmap, sbitmap, sbitmap);
@@ -458,6 +456,15 @@ set_node_sched_params (ddg_ptr g)
 			 node_sched_param_vec, g->num_nodes);
 }
 
+/* Make sure that node_sched_param_vec has an entry for every move in PS.  */
+static void
+extend_node_sched_params (partial_schedule_ptr ps)
+{
+  VEC_safe_grow_cleared (node_sched_params, heap, node_sched_param_vec,
+			 ps->g->num_nodes + VEC_length (ps_reg_move_info,
+							ps->reg_moves));
+}
+
 static void
 print_node_sched_params (FILE *file, int num_nodes, partial_schedule_ptr ps)
 {
@@ -474,6 +481,7 @@ print_node_sched_params (FILE *file, int
 	       INSN_UID (ps_rtl_insn (ps, i)));
       fprintf (file, " asap = %d:\n", NODE_ASAP (&ps->g->nodes[i]));
       fprintf (file, " time = %d:\n", nsp->time);
+      fprintf (file, " stage = %d:\n", nsp->stage);
       fprintf (file, " nreg_moves = %d:\n", nsp->nreg_moves);
       for (j = 0; j < nsp->nreg_moves; j++)
 	{
@@ -485,6 +493,168 @@ print_node_sched_params (FILE *file, int
     }
 }
 
+/* Set SCHED_COLUMN for each instruction in row ROW of PS.  */
+static void
+set_columns_for_row (partial_schedule_ptr ps, int row)
+{
+  ps_insn_ptr cur_insn;
+  int column;
+
+  column = 0;
+  for (cur_insn = ps->rows[row]; cur_insn; cur_insn = cur_insn->next_in_row)
+    SCHED_COLUMN (cur_insn->id) = column++;
+}
+
+/* Set SCHED_COLUMN for each instruction in PS.  */
+static void
+set_columns_for_ps (partial_schedule_ptr ps)
+{
+  int row;
+
+  for (row = 0; row < ps->ii; row++)
+    set_columns_for_row (ps, row);
+}
+
+/* Try to schedule the move with ps_insn identifier I_REG_MOVE in PS.
+   The source of the move is provided by I_MUST_FOLLOW, which has
+   already been scheduled.  MUST_FOLLOW is a scratch bitmap that
+   is big enough to hold I_MUST_FOLLOW.
+
+   Return true on success.  */
+static bool
+schedule_reg_move (partial_schedule_ptr ps, int i_reg_move,
+		   sbitmap must_follow, int i_must_follow)
+{
+  unsigned int u;
+  int i, start_row, end_row, this_start, this_end, this_row, latency;
+  int origin_row, origin_column;
+  sbitmap_iterator sbi;
+  ps_reg_move_info *move;
+
+  move = ps_reg_move (ps, i_reg_move);
+
+  /* The cyclic lifetime of move->new_reg starts and ends at move->def
+     (the instruction that defines move->old_reg).  We need to decide
+     where in that cyclic lifetime the move should go.  The position
+     is limited by I_MUST_FOLLOW (which defines the source of the move)
+     and the nodes in move->uses (which consume the result).
+
+     Specifically, the lowest possible row (relative to move->def)
+     is the maximum of:
+
+       a) the row in which the result of the previous iteration's
+	  I_MUST_FOLLOW becomes available
+       b) the first row in which every use of the previous iteration's
+	  move->new_reg has completed.
+
+     The highest possible row (again relative to move->def) is
+     the minimum of:
+
+       c) the row in which the current iteration's I_MUST_FOLLOW
+	  executes (and thus makes the source of the move unavailable)
+       d) the last row in which every use of this iteration's
+	  move->new_reg could be satisfied without delay.
+
+     Because all positions are relative to move->def, this function
+     uses ROW - ps->ii to represent positions come after move->def.
+     The range includes origin_row twice: "origin_row - ps->ii" is for
+     the positions in origin_row after move->def, while "origin_row"
+     itself is for the positions before move->def.  */
+  origin_row = SCHED_ROW (move->def);
+  origin_column = SCHED_COLUMN (move->def);
+  start_row = origin_row - ps->ii;
+  end_row = origin_row;
+
+  if (dump_file)
+    fprintf (dump_file, "Scheduling register move INSN %d with cyclic lifetime"
+	     " [%d, %d]\n", INSN_UID (move->insn), start_row, end_row);
+
+  /* Limit the range to [a, c].  */
+  this_end = SCHED_ROW (i_must_follow);
+  if (this_end > origin_row
+      || (this_end == origin_row
+	  && SCHED_COLUMN (i_must_follow) > origin_column))
+    this_end -= ps->ii;
+  latency = insn_latency (ps_rtl_insn (ps, i_must_follow), move->insn);
+  this_start = this_end - ps->ii + latency;
+
+  if (dump_file)
+    fprintf (dump_file, "  -- source produced by INSN %d with latency %d,"
+	     " range [%d, %d]\n", INSN_UID (ps_rtl_insn (ps, i_must_follow)),
+	     latency, this_start, this_end);
+
+  if (start_row < this_start)
+    start_row = this_start;
+  if (end_row > this_end)
+    end_row = this_end;
+
+  /* Limit the range to [b, d].  */
+  EXECUTE_IF_SET_IN_SBITMAP (move->uses, 0, u, sbi)
+    {
+      latency = insn_latency (move->insn, ps_rtl_insn (ps, u));
+      this_start = SCHED_ROW (u);
+      if (this_start > origin_row
+	  || (this_start == origin_row
+	      && SCHED_COLUMN (u) > origin_column))
+	this_start -= ps->ii;
+      this_end = this_start + ps->ii - latency;
+
+      if (dump_file)
+	fprintf (dump_file, "  -- destination used by INSN %d with latency"
+		 " %d, range [%d, %d]\n", INSN_UID (ps_rtl_insn (ps, u)),
+		 latency, this_start, this_end);
+
+      if (start_row < this_start)
+	start_row = this_start;
+      if (end_row > this_end)
+	end_row = this_end;
+    }
+
+  sbitmap_zero (must_follow);
+  SET_BIT (must_follow, i_must_follow);
+
+  if (dump_file)
+    fprintf (dump_file, "  -- trying to schedule in rows [%d, %d]\n",
+	     start_row, end_row);
+
+  for (i = end_row; i >= start_row; i--)
+    {
+      ps_insn_ptr psi;
+
+      this_row = i < 0 ? i + ps->ii : i;
+      /* origin_row - ps->ii represents the part of row origin_row after
+	 move->def.  In this case the move must come after both move->uses
+	 and move->def.  */
+      if (i == origin_row - ps->ii)
+	{
+	  gcc_checking_assert (i == start_row);
+	  gcc_checking_assert (!TEST_BIT (move->uses, move->def));
+	  SET_BIT (move->uses, move->def);
+	  psi = ps_add_node_check_conflicts (ps, i_reg_move, this_row,
+					     move->uses, NULL);
+	  RESET_BIT (move->uses, move->def);
+	}
+      else
+	psi = ps_add_node_check_conflicts (ps, i_reg_move, this_row,
+					   move->uses, must_follow);
+
+      if (psi)
+	{
+	  SCHED_ROW (i_reg_move) = this_row;
+	  if (dump_file)
+	    fprintf (dump_file, "  -- scheduled in row %d (%d)\n",
+		     i, this_row);
+	  set_columns_for_row (ps, this_row);
+	  return true;
+	}
+    }
+
+  if (dump_file)
+    fprintf (dump_file, "  -- no available slot\n");
+
+  return false;
+}
+
 /*
    Breaking intra-loop register anti-dependences:
    Each intra-loop register anti-dependence implies a cross-iteration true
@@ -507,9 +677,10 @@ schedule_reg_moves (partial_schedule_ptr
     {
       ddg_node_ptr u = &g->nodes[i];
       ddg_edge_ptr e;
-      int nreg_moves = 0, i_reg_move;
+      int nreg_moves = 0, i_reg_move, i_must_follow;
       rtx prev_reg, old_reg;
       int first_move;
+      sbitmap must_follow;
 
       /* Compute the number of reg_moves needed for u, by looking at life
 	 ranges started at u (excluding self-loops).  */
@@ -539,6 +710,7 @@ schedule_reg_moves (partial_schedule_ptr
       first_move = VEC_length (ps_reg_move_info, ps->reg_moves);
       VEC_safe_grow_cleared (ps_reg_move_info, heap, ps->reg_moves,
 			     first_move + nreg_moves);
+      extend_node_sched_params (ps);
 
       /* Record the moves associated with this node.  */
       first_move += ps->g->num_nodes;
@@ -552,7 +724,7 @@ schedule_reg_moves (partial_schedule_ptr
 	  ps_reg_move_info *move = ps_reg_move (ps, first_move + i_reg_move);
 
 	  move->def = i;
-	  move->uses = sbitmap_alloc (g->num_nodes);
+	  move->uses = sbitmap_alloc (first_move + nreg_moves);
 	  move->old_reg = old_reg;
 	  move->new_reg = gen_reg_rtx (GET_MODE (prev_reg));
 	  move->insn = gen_move_insn (move->new_reg, copy_rtx (prev_reg));
@@ -586,6 +758,19 @@ schedule_reg_moves (partial_schedule_ptr
 		SET_BIT (move->uses, e->dest->cuid);
 	      }
 	  }
+
+      must_follow = sbitmap_alloc (first_move + nreg_moves);
+      i_must_follow = i;
+      for (i_reg_move = 0; i_reg_move < nreg_moves; i_reg_move++)
+	{
+	  if (!schedule_reg_move (ps, first_move + i_reg_move,
+				  must_follow, i_must_follow))
+	    break;
+	  i_must_follow = first_move + i_reg_move;
+	}
+      sbitmap_free (must_follow);
+      if (i_reg_move < nreg_moves)
+	return false;
     }
   return true;
 }
@@ -609,9 +794,6 @@ apply_reg_moves (partial_schedule_ptr ps
 	  df_insn_rescan (ps->g->nodes[i_use].insn);
 	}
     }
-
-  FOR_EACH_VEC_ELT_REVERSE (ps_reg_move_info, ps->reg_moves, i, move)
-    add_insn_before (move->insn, ps_first_note (ps, move->def), NULL);
 }
 
 /* Update the sched_params (time, row and stage) for node U using the II,
@@ -682,22 +864,6 @@ reset_sched_times (partial_schedule_ptr 
       }
 }
  
-/* Set SCHED_COLUMN of each node according to its position in PS.  */
-static void
-set_columns_for_ps (partial_schedule_ptr ps)
-{
-  int row;
-
-  for (row = 0; row < ps->ii; row++)
-    {
-      ps_insn_ptr cur_insn = ps->rows[row];
-      int column = 0;
-
-      for (; cur_insn; cur_insn = cur_insn->next_in_row)
-	SCHED_COLUMN (cur_insn->id) = column++;
-    }
-}
-
 /* Permute the insns according to their order in PS, from row 0 to
    row ii-1, and position them right before LAST.  This schedules
    the insns of the loop kernel.  */
@@ -714,8 +880,13 @@ permute_partial_schedule (partial_schedu
 	rtx insn = ps_rtl_insn (ps, ps_ij->id);
 
 	if (PREV_INSN (last) != insn)
-	  reorder_insns_nobb (ps_first_note (ps, ps_ij->id), insn,
-			      PREV_INSN (last));
+	  {
+	    if (ps_ij->id < ps->g->num_nodes)
+	      reorder_insns_nobb (ps_first_note (ps, ps_ij->id), insn,
+				  PREV_INSN (last));
+	    else
+	      add_insn_before (insn, last, NULL);
+	  }
       }
 }
 
@@ -919,7 +1090,7 @@ optimize_sc (partial_schedule_ptr ps, dd
 
 static void
 duplicate_insns_of_cycles (partial_schedule_ptr ps, int from_stage,
-			   int to_stage, int for_prolog, rtx count_reg)
+			   int to_stage, rtx count_reg)
 {
   int row;
   ps_insn_ptr ps_ij;
@@ -928,7 +1099,6 @@ duplicate_insns_of_cycles (partial_sched
     for (ps_ij = ps->rows[row]; ps_ij; ps_ij = ps_ij->next_in_row)
       {
 	int u = ps_ij->id;
-	int j, i_reg_moves, i_reg_move;
 	rtx u_insn;
 
         /* Do not duplicate any insn which refers to count_reg as it
@@ -942,42 +1112,18 @@ duplicate_insns_of_cycles (partial_sched
             || JUMP_P (u_insn))
           continue;
 
-	if (for_prolog)
-	  {
-	    /* SCHED_STAGE (u) >= from_stage == 0.  Generate increasing
-	       number of reg_moves starting with the second occurrence of
-	       u, which is generated if its SCHED_STAGE <= to_stage.  */
-	    i_reg_moves = to_stage - SCHED_STAGE (u) + 1;
-	    i_reg_moves = MAX (i_reg_moves, 0);
-	    i_reg_moves = MIN (i_reg_moves, SCHED_NREG_MOVES (u));
-
-	    /* The reg_moves start from the *first* reg_move backwards.  */
-	    i_reg_move = SCHED_FIRST_REG_MOVE (u) + (i_reg_moves - 1);
-	  }
-	else /* It's for the epilog.  */
-	  {
-	    /* SCHED_STAGE (u) <= to_stage.  Generate all reg_moves,
-	       starting to decrease one stage after u no longer occurs;
-	       that is, generate all reg_moves until
-	       SCHED_STAGE (u) == from_stage - 1.  */
-	    i_reg_moves = (SCHED_NREG_MOVES (u)
-			   - (from_stage - SCHED_STAGE (u) - 1));
-	    i_reg_moves = MAX (i_reg_moves, 0);
-	    i_reg_moves = MIN (i_reg_moves, SCHED_NREG_MOVES (u));
-
-	    /* The reg_moves start from the *last* reg_move forwards.  */
-	    i_reg_move = SCHED_FIRST_REG_MOVE (u) + (SCHED_NREG_MOVES (u) - 1);
-	  }
-
-	for (j = 0; j < i_reg_moves; j++)
+	if (u < ps->g->num_nodes)
 	  {
-	    ps_reg_move_info *move = ps_reg_move (ps, i_reg_move - j);
-
-	    emit_insn (copy_rtx (PATTERN (move->insn)));
+	    if (SCHED_STAGE (u) >= from_stage
+		&& SCHED_STAGE (u) <= to_stage)
+	      duplicate_insn_chain (ps_first_note (ps, u), u_insn);
 	  }
-	if (SCHED_STAGE (u) >= from_stage
-	    && SCHED_STAGE (u) <= to_stage)
-	  duplicate_insn_chain (ps_first_note (ps, u), u_insn);
+	else
+	  /* For simplicity, we generate all moves for every stage,
+	     even though some of them will be dead.  Later optimizations
+	     will remove the dead instructions and undo some of the
+	     unnecessary renaming that moves would otherwise produce.  */
+	  emit_insn (copy_rtx (PATTERN (u_insn)));
       }
 }
 
@@ -1011,7 +1157,7 @@ generate_prolog_epilog (partial_schedule
     }
 
   for (i = 0; i < last_stage; i++)
-    duplicate_insns_of_cycles (ps, 0, i, 1, count_reg);
+    duplicate_insns_of_cycles (ps, 0, i, count_reg);
 
   /* Put the prolog on the entry edge.  */
   e = loop_preheader_edge (loop);
@@ -1023,7 +1169,7 @@ generate_prolog_epilog (partial_schedule
   start_sequence ();
 
   for (i = 0; i < last_stage; i++)
-    duplicate_insns_of_cycles (ps, i + 1, last_stage, 0, count_reg);
+    duplicate_insns_of_cycles (ps, i + 1, last_stage, count_reg);
 
   /* Put the epilogue on the exit edge.  */
   gcc_assert (single_exit (loop));

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2011-10-11  8:18 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <OFE173D50E.19DC6A3D-ONC2257912.007D3E72-C2257912.007D4640@il.ibm.com>
2011-09-22 17:38 ` [4/4] Make SMS schedule register moves Ayal Zaks
2011-09-22 22:33   ` Richard Sandiford
2011-09-23  7:49     ` Ayal Zaks
2011-09-28 16:16       ` Richard Sandiford
2011-10-10  1:10         ` Ayal Zaks
2011-10-10 12:17           ` Richard Sandiford
2011-10-10 15:52             ` Ayal Zaks
2011-10-11  8:49               ` Richard Sandiford
2011-08-30 12:45 [0/4] " Richard Sandiford
2011-08-30 13:17 ` [4/4] " Richard Sandiford
2011-08-30 13:22   ` Richard Guenther
2011-08-30 13:43     ` Richard Sandiford
2011-08-30 13:50       ` Richard Guenther
2011-08-30 13:57         ` Richard Sandiford

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).