* Re: SMS in gcc4.0
@ 2005-06-02 1:25 Canqun Yang
0 siblings, 0 replies; 15+ messages in thread
From: Canqun Yang @ 2005-06-02 1:25 UTC (permalink / raw)
To: Steven Bosscher; +Cc: gcc, Mostafa Hagog, Ayal Zaks
Canqun Yang <canqun@nudt.edu.cn>:
> Steven Bosscher <stevenb@suse.de>:
>
> > On Wednesday 01 June 2005 16:43, Canqun Yang wrote:
> > > Hi, all
> > >
> > > I've taken a look on modulo-sched.c recently, and
> found
> > > that both new_cycles and orig_cycles are
> imprecise. The
> > > reason is that kernel_number_of_cycles does not
> take the
> > > data dependences of insns into account as the DFA
> > > scheduler does in haifa-sched.c.
> >
> > How does this affect the cycles computation?
> >
>
> An insns is ready for schedule only when all the
insns
> it dependent on have already be scheduled. In haifa-
> sched.c, there is a queue to hold the insns which are
> ready for schedule.
>
> To find how the data dependence affect the cycles
> computation, the more simple way is to compare the
> two versions of assembly code generated by GCC
> respectively, one is generated by turning on '-
fmodulo-
> sched', the other not. Without SMS, the code in loop
> has many stops ';;' to seperate the instrcutions
which
> have data dependence, while with SMS, though the
> kernel code of the loop has more instructions, but
> less stops ';;'.
>
> > > On IA-64, three improvements are needed to let
SMS
> work.
> > > 1) Modify doloop_register_get or the similar
> function
> > > defined in doloop.c to recognize the loop count
> > > register. I have supplied a patch about this in
> April.
> >
> > Mustafa and I have a patch that has a similar
> effect, see
> > http://gcc.gnu.org/ml/gcc-patches/2005-
> 06/msg00035.html.
> >
> > > 2) Use more precise way to calculate the values
of
> the
> > > two kind of cycles, or just ignore this benefit
> assertion.
> >
> > Probably need to be more precise :-/
> >
> > When I manually hacked modulo-sched.c to ignore
this
> test, I
> > did see loops getting scheduled, but I also ran
into
> ICEs in
> > cfglayout.
>
> There are no ICEs for pi.f90, swim.f, and mgrid.f
> according to my test. But, an internal compile error
> of 'unrecognizable insn' is produced
> by 'gen_sub2_insn' which explicitly minus 'ar.lc'
when
> swim.f and mgrid.f are being compiled.
There is no ICEs for pi.f90 according to my test. But
ICEs of 'unreconizable insn' is procuded
by 'gen_sub2_insns' which explicitly minus 'ar.lc'
when swim.f and mgrid.f are being compiled.
>
> >
> > > 3) The counted loop register 'ar.lc' of IA-64 can
> not be
> > > updated directly. Another temporary register is
> needed
> > > to evaluate the value of the actural loop count
> after
> > > SMS schedule, and assign its value to 'ar.lc'.
> >
> > Actually, should SMS just not update the loop
> register in place?
> > I never figured out why it tries to produce a sub
> insns (using
> > gen_sub2_insn which is also wrong btw).
> >
>
> The current implementation of SMS does not use IA-
64's
> epilog register (ar.ec). After SMS, the loop count is
> just used to control the execution times of the
kernel
> code, and the kernel code will execute
> loop_count - (stage_count - 1) times
> The sub insns generated by gen_sub2_insn is used to
> produce this value.
>
>
> > Gr.
> > Steven
> >
> >
>
Canqun Yang
Creative Compiler Research Group.
National University of Defense Technology, China.
^ permalink raw reply [flat|nested] 15+ messages in thread
[parent not found: <OF89DFFFBC.F31A0E99-ON43256FEA.005530A7-43256FEA.0055CD57@il.ibm.com>]
* RE: SMS in gcc4.0
@ 2005-03-31 17:06 Canqun Yang
2005-03-31 18:33 ` Steven Bosscher
2005-03-31 18:37 ` Steven Bosscher
0 siblings, 2 replies; 15+ messages in thread
From: Canqun Yang @ 2005-03-31 17:06 UTC (permalink / raw)
To: MUSTAFA, mark.davis, gp, ZAKS, gcc, gcc-patches
[-- Attachment #1: Type: text/plain, Size: 540 bytes --]
Hi, all
This patch will fix doloop_register_get defined in
modulo-sched.c, and let the program of PI caculation
on IA-64 be successfully modulo scheduled. On 1GHz
Itanium-2, it costs just 3.128 seconds to execute when
compiled with "-fmodulo-shced -O3" turned on, while
5.454 seconds whithout "-fmodulo-sched".
2005-03-31 Canqun Yang <canqun@nudt.edu.cn>
* modulo-sched.c (doloop_register_get): Deal
with if_then_else pattern.
Canqun Yang
Creative Compiler Research Group.
National University of Defense Technology, China.
[-- Attachment #2: pi.f90 --]
[-- Type: application/octet-stream, Size: 298 bytes --]
! Compute the value of pi by numeric integration
program pi_ex
implicit none
integer n, i
double precision h, pi
parameter (n=80000000)
h = 1.0 / n
pi = 0.0
do i=1, n
pi = pi + h * (4 / (1.0 + h*(i-0.5) * h*(i-0.5)))
end do
write(*,*) pi
end
[-- Attachment #3: modulo-sched.txt --]
[-- Type: text/plain, Size: 2328 bytes --]
*** /home/ycq/mainline/gcc/gcc/modulo-sched.c Mon Mar 21 10:49:23 2005
--- modulo-sched.c Thu Mar 31 21:11:08 2005
*************** static rtx
*** 263,269 ****
doloop_register_get (rtx insn, rtx *comp)
{
rtx pattern, cmp, inc, reg, condition;
!
if (!JUMP_P (insn))
return NULL_RTX;
pattern = PATTERN (insn);
--- 263,270 ----
doloop_register_get (rtx insn, rtx *comp)
{
rtx pattern, cmp, inc, reg, condition;
! rtx src;
!
if (!JUMP_P (insn))
return NULL_RTX;
pattern = PATTERN (insn);
*************** doloop_register_get (rtx insn, rtx *comp
*** 293,303 ****
/* Extract loop counter register. */
reg = SET_DEST (inc);
/* Check if something = (plus (reg) (const_int -1)). */
! if (GET_CODE (SET_SRC (inc)) != PLUS
! || XEXP (SET_SRC (inc), 0) != reg
! || XEXP (SET_SRC (inc), 1) != constm1_rtx)
return NULL_RTX;
/* Check for (set (pc) (if_then_else (condition)
--- 294,315 ----
/* Extract loop counter register. */
reg = SET_DEST (inc);
+ src = SET_SRC (inc);
+ /* On IA-64, the RTL pattern of SRC is just like this
+ (if_then_else:DI (ne (reg:DI 332 ar.lc)
+ (const_int 0 [0x0]))
+ (plus:DI (reg:DI 332 ar.lc)
+ (const_int -1 [0xffffffffffffffff]))
+ (reg:DI 332 ar.lc)) */
+
+ if (GET_CODE (src) == IF_THEN_ELSE)
+ src = XEXP (src, 1);
+
/* Check if something = (plus (reg) (const_int -1)). */
! if (GET_CODE (src) != PLUS
! || XEXP (src, 0) != reg
! || XEXP (src, 1) != constm1_rtx)
return NULL_RTX;
/* Check for (set (pc) (if_then_else (condition)
*************** doloop_register_get (rtx insn, rtx *comp
*** 318,324 ****
if ((GET_CODE (condition) != GE && GET_CODE (condition) != NE)
|| GET_CODE (XEXP (condition, 1)) != CONST_INT). */
if (GET_CODE (condition) != NE
! || XEXP (condition, 1) != const1_rtx)
return NULL_RTX;
if (XEXP (condition, 0) == reg)
--- 330,337 ----
if ((GET_CODE (condition) != GE && GET_CODE (condition) != NE)
|| GET_CODE (XEXP (condition, 1)) != CONST_INT). */
if (GET_CODE (condition) != NE
! || (XEXP (condition, 1) != const1_rtx
! && XEXP (condition, 1) != const0_rtx))
return NULL_RTX;
if (XEXP (condition, 0) == reg)
^ permalink raw reply [flat|nested] 15+ messages in thread
* RE: SMS in gcc4.0
2005-03-31 17:06 Canqun Yang
@ 2005-03-31 18:33 ` Steven Bosscher
2005-03-31 18:37 ` Steven Bosscher
1 sibling, 0 replies; 15+ messages in thread
From: Steven Bosscher @ 2005-03-31 18:33 UTC (permalink / raw)
To: Canqun Yang; +Cc: MUSTAFA, mark.davis, gp, ZAKS, gcc, gcc-patches
On Mar 31, 2005 03:56 PM, Canqun Yang <canqun@nudt.edu.cn> wrote:
> This patch will fix doloop_register_get defined in
> modulo-sched.c, and let the program of PI caculation
> on IA-64 be successfully modulo scheduled. On 1GHz
> Itanium-2, it costs just 3.128 seconds to execute when
> compiled with "-fmodulo-shced -O3" turned on, while
> 5.454 seconds whithout "-fmodulo-sched".
Nice! But makes me wonder... Mustafa, why can doloop_register_get
not just accept the same doloop patterns as the one accepted in
doloop_condition_get?
Gr.
Steven
^ permalink raw reply [flat|nested] 15+ messages in thread
* RE: SMS in gcc4.0
2005-03-31 17:06 Canqun Yang
2005-03-31 18:33 ` Steven Bosscher
@ 2005-03-31 18:37 ` Steven Bosscher
2005-03-31 18:52 ` Mostafa Hagog
1 sibling, 1 reply; 15+ messages in thread
From: Steven Bosscher @ 2005-03-31 18:37 UTC (permalink / raw)
To: Canqun Yang; +Cc: MUSTAFA, mark.davis, gp, ZAKS, gcc, gcc-patches
On Mar 31, 2005 03:56 PM, Canqun Yang <canqun@nudt.edu.cn> wrote:
> This patch will fix doloop_register_get defined in
> modulo-sched.c, and let the program of PI caculation
> on IA-64 be successfully modulo scheduled. On 1GHz
> Itanium-2, it costs just 3.128 seconds to execute when
> compiled with "-fmodulo-shced -O3" turned on, while
> 5.454 seconds whithout "-fmodulo-sched".
Nice! But makes me wonder... Mustafa, why can doloop_register_get
not just accept the same doloop patterns as the one accepted in
doloop_condition_get?
Gr.
Steven
^ permalink raw reply [flat|nested] 15+ messages in thread
* RE: SMS in gcc4.0
2005-03-31 18:37 ` Steven Bosscher
@ 2005-03-31 18:52 ` Mostafa Hagog
0 siblings, 0 replies; 15+ messages in thread
From: Mostafa Hagog @ 2005-03-31 18:52 UTC (permalink / raw)
To: Steven Bosscher; +Cc: Ayal Zaks, Canqun Yang, gcc, gcc-patches, gp, mark.davis
Steven Bosscher <stevenb@suse.de> wrote on 31/03/2005 16:55:52:
> On Mar 31, 2005 03:56 PM, Canqun Yang <canqun@nudt.edu.cn> wrote:
>
> > This patch will fix doloop_register_get defined in
> > modulo-sched.c, and let the program of PI caculation
> > on IA-64 be successfully modulo scheduled. On 1GHz
> > Itanium-2, it costs just 3.128 seconds to execute when
> > compiled with "-fmodulo-shced -O3" turned on, while
> > 5.454 seconds whithout "-fmodulo-sched".
>
> Nice! But makes me wonder... Mustafa, why can doloop_register_get
> not just accept the same doloop patterns as the one accepted in
> doloop_condition_get?
It should. It seems that there was a major change to doloop_condition_get
after we implemented SMS and created doloop_register_get, the correct fix
is to combine both and make the doloop_condition_get external. I will
prepare a patch for this. Actually, there is a more fundamental change
that we can do to SMS, which is to use doloop_valid_p which tell us that
there is a countable loop no matter if the machine supports BCTs or not,
but this requires changes to the way we generate the prologue/epilogue.
Mostafa.
>
> Gr.
> Steven
>
^ permalink raw reply [flat|nested] 15+ messages in thread
[parent not found: <EFA13842B4FD55469C1A837C9C849644019D731D@hdsmsx401.amr.corp.intel.com>]
* RE: SMS in gcc4.0
[not found] <EFA13842B4FD55469C1A837C9C849644019D731D@hdsmsx401.amr.corp.intel.com>
@ 2005-03-31 16:13 ` Mostafa Hagog
0 siblings, 0 replies; 15+ messages in thread
From: Mostafa Hagog @ 2005-03-31 16:13 UTC (permalink / raw)
To: Davis, Mark; +Cc: Gerald Pfeifer, Davis, Mark, gcc, Ayal Zaks
Hi Mark,
First of all I would like this discussion to be on the GCC mailing list; so
I am CCing the GCC mailing list (I hope this is OK with all the others).
"Davis, Mark" <mark.davis@intel.com> wrote on 31/03/2005 00:23:02:
> Mostafa & Gerald,
>
...
> It was mentioned that you folks had recently
> added SMS to gcc4.0, and I found the SMS paper from last year's gcc
> summit, and the description of SMS capabilities in gcc on the 4.0
> Features web site. So the obvious approach is to use SMS for Itanium as
> well as Power5 and ....
>
> 1) Is SMS in gcc currently turned on for anything other than Power5? I
> built a gcc4.0 for Itanium, and tried compiling the summation example
> from your paper (and some unrolled summation examples) using
> -O3 -fmodulo-sched
We haven't yet put efforts to tune SMS for any specific architecture
(including Power5); SMS is implemented as general as the paper (mentioned
in http://gcc.gnu.org/news/sms.html) describes.
>
> but didn't see any difference in the .s file from not using
> -fmodulo-sched. Are there other switches to turn on or dumps to look
> at?
I would suggest to start looking at SMS dumps to see what is doing there;
you can do so by adding the -dm flag to your compilation. If you want you
can send me those dumps and I will look into them.
>
>
> I'm afraid I also was the origin of some of the "not very useful"
> comments about SMS. From my way of thinking, if SMS doesn't have alias
> information or array dependence analysis, then SMS can't pipeline loops
> storing into array elements; therefore it is not very useful as a
> pipeliner, even if the swing modulo scheduling part is excellent.
> 2) Did I miss something here?
This is true; that's why we need accurate alias info in RTL level and this
is one of the efforts that one should concentrate on in improving SMS.
>
> I do not know about gcc internals (which is why I'm "project-managing",
> not "implementing"), so it was interesting and disturbing to hear what
> you and Vlad had to say about the different internal representations
> relative to when the SMS phase runs:
> a) it seems to be too early to see the machine code
> b) it's too late to have alias info
>
> 3) Do you agree with this assessment?
Its not black or white. We need accurate alias info at the RTL level to be
able to software pipeline (SMS) the majority of the loops in the real world
programs- currently the RTL alias info is not accurate enough for those
loops. Having the alias info make us capable of eliminating memory to
memory dependancies and thus make us know that we can interleave different
iterations of the loop. The alias info is usually based on high level
representation of the code, the lower you are the more information you
lose. One of the things that we would do is maintain this information
while we go down in the trees and RTL representation each pass that does
some transformation on the code will require additional effort to maintain
alias information which complicates it - that's why we want SMS to be as
early as possible. The other side of the coin is the modeling of the
machine resources (SMS is trying to solve a scheduling problem). In SMS we
use DFA for resource modeling in which we follow each one of the
instruction resource usage and try to get the optimal schedule by moving
instructions among the different iteration trying to avoid resource
conflicts. The problem in doing this early is that later passes can change
the resource usage of instructions when doing transformations on the code
(splitting instructions for example) and thus make the schedule not
optimal. A good example for a way to handle this is the disabling of the
second scheduling pass for SMSed loops to prevent it from screwing the
schedules generated by SMS. We can do the same for other passes and have a
cost model to decide if it is beneficial or not to perform the optimization
inside the SMSed loops. Other problem that results from doing SMS before
register allocation is increasing the register pressure when SMS is
aggressive. IMO, this problem should be addressed later by using register
pressure estimation inside SMS.
> 4) Do you have any suggestions about using SMS for an in-order
> microarchitecture like Itanium which is more sensitive to the exact
> schedule than OOO microarchitectures like Power5?
Actually I would say that an in-order machine would benefit more from SMS
than an OOO machine, because theoretically OOO machines do the job of SMS
in hardware in many cases. The problem is not for IA64 being in-order, but
the fact that IA64 and other in-order machines are highly dependent on the
scheduling and among them SMS. My suggestion is that we must invest in
lowering alias info to RTL and feed this information to the DDG used by SMS
which is implemented in ddg.c.
> 5) In the Intel compiler for Itanium, we carry the alias information
> from the high-level IL down to the machine-code level IL, and pipeline
> on the machine-code IL, before register allocation.
This is where SMS is currently positioned; it means that our problem is not
where SMS is performed but the alias information not getting there. This
is exactly what we were thinking all the time; the IC example reinforces
this thought.
>
> thanks,
> Mark Davis
> Intel Compiler Lab
> (formerly with DEC compiler team)
> Nashua, NH
>
Mostafa.
^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2005-06-02 13:41 UTC | newest]
Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-06-02 1:25 SMS in gcc4.0 Canqun Yang
[not found] <OF89DFFFBC.F31A0E99-ON43256FEA.005530A7-43256FEA.0055CD57@il.ibm.com>
[not found] ` <200504211739.42879.stevenb@suse.de>
2005-04-22 3:57 ` Canqun Yang
2005-04-22 6:58 ` Steven Bosscher
2005-05-09 10:54 ` Mostafa Hagog
2005-06-01 14:28 ` Canqun Yang
2005-06-01 14:35 ` Steven Bosscher
2005-06-02 1:13 ` Canqun Yang
2005-06-02 13:41 ` Mostafa Hagog
2005-06-02 13:09 ` Mostafa Hagog
2005-06-02 13:14 ` Steven Bosscher
-- strict thread matches above, loose matches on Subject: below --
2005-03-31 17:06 Canqun Yang
2005-03-31 18:33 ` Steven Bosscher
2005-03-31 18:37 ` Steven Bosscher
2005-03-31 18:52 ` Mostafa Hagog
[not found] <EFA13842B4FD55469C1A837C9C849644019D731D@hdsmsx401.amr.corp.intel.com>
2005-03-31 16:13 ` Mostafa Hagog
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).