Re: GVN-PRE sucks -- well, for SPECCFP2000 mgrid anyway

public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed

* Re: GVN-PRE sucks -- well, for SPECCFP2000 mgrid anyway
@ 2004-08-16  2:40 Richard Kenner
  2004-08-16  6:29 ` Jeffrey A Law
  0 siblings, 1 reply; 18+ messages in thread
From: Richard Kenner @ 2004-08-16  2:40 UTC (permalink / raw)
  To: dberlin; +Cc: gcc

    > 1) let pre be pre and have the register allocator "rematerialize" the 
    > guys that have been pulled out of the loop.

    I've always tried to argue this approach, but it never seems to get 
    places with people, because it means the register allocator doing *a lot*.

Not necessarly.  After all, reload has "rematerialized" constants almost
since day one and has also rematerialized some addition operations more
recently.  It's not conceptually difficult to add some more things to
rematerialize.  The problem, of course, is that reload is a mess and adding
*anything* to it will be non-trivial.

But I still agree with Kenny that it's the best approach: I don't think
that complex interactions between passes to estimate register pressure
will work well.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: GVN-PRE sucks -- well, for SPECCFP2000 mgrid anyway
  2004-08-16  2:40 GVN-PRE sucks -- well, for SPECCFP2000 mgrid anyway Richard Kenner
@ 2004-08-16  6:29 ` Jeffrey A Law
  0 siblings, 0 replies; 18+ messages in thread
From: Jeffrey A Law @ 2004-08-16  6:29 UTC (permalink / raw)
  To: Richard Kenner; +Cc: dberlin, gcc

On Sun, 2004-08-15 at 20:28, Richard Kenner wrote:
>     > 1) let pre be pre and have the register allocator "rematerialize" the 
>     > guys that have been pulled out of the loop.
> 
>     I've always tried to argue this approach, but it never seems to get 
>     places with people, because it means the register allocator doing *a lot*.
> 
> Not necessarly.  After all, reload has "rematerialized" constants almost
> since day one and has also rematerialized some addition operations more
> recently.
True, but its ability to rematerialize is poor at best.  And the
way it's structured, particularly for chains of insns producing an
ultimate value is, err, special.


> But I still agree with Kenny that it's the best approach: I don't think
> that complex interactions between passes to estimate register pressure
> will work well.
Likewise.

jeff

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: GVN-PRE sucks -- well, for SPECCFP2000 mgrid anyway
  2004-08-16  1:23       ` Kenneth Zadeck
  2004-08-16  1:47         ` Daniel Berlin
@ 2004-08-23 20:02         ` Andrew MacLeod
  1 sibling, 0 replies; 18+ messages in thread
From: Andrew MacLeod @ 2004-08-23 20:02 UTC (permalink / raw)
  To: Kenneth Zadeck
  Cc: Daniel Berlin, Mark Mitchell, gcc mailing list, Steven Bosscher

On Sun, 2004-08-15 at 21:08, Kenneth Zadeck wrote:
> 

> This is handled differently in different compilers.  There are basically 
> two schools of thought here:
> 
> 1) let pre be pre and have the register allocator "rematerialize" the 
> guys that have been pulled out of the loop.
> 
> 2) make each of the optimizations have a cost model that factors in the 
> way that several transformations effect "max live" i.e. the number of 
> items that will need to be kept in registers at any point in the program. 
> 
> I have always been in favor of the first school but there really are two 
> sides to this issue.  In the gcc world there are two parts of the 

Sorry, I've been away for a bit.

Not only would I promote (1), but I really despise (2) since it clutters
everything up.

When we were doing some of the early SSA implemention, we recognized
there would be a lot of pressure put on the register allocator which
would eventually need to be addressed. I do intend to take on some of
the register allocator issues shortly, I just have to finish up some
remaining SSA issues.

Of course, shortly doesnt mean anything will actually get fixed in the
near future, this is not a trivial issue, as everyone well knows! :-)

Andrew

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: GVN-PRE sucks -- well, for SPECCFP2000 mgrid anyway
  2004-08-19  8:17               ` Paolo Bonzini
  2004-08-19  8:38                 ` Paolo Bonzini
@ 2004-08-19 12:07                 ` Michael Matz
  1 sibling, 0 replies; 18+ messages in thread
From: Michael Matz @ 2004-08-19 12:07 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Kenneth Zadeck, Andrew MacLeod, gcc, Mark Mitchell, Steven Bosscher

Hi,

On Thu, 19 Aug 2004, Paolo Bonzini wrote:

> > And Mukta even made use of it and implemented some more of what's possible
> > in http://gcc.gnu.org/ml/gcc-patches/2003-12/msg01985.html .
> 
> Is this in new-ra?

Not yet, no.  It's still on my horizon though.

> Does it change new-ra's shape in SPEC (5% on gzip seems very good!)?

I don't know.


Ciao,
Michael.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: GVN-PRE sucks -- well, for SPECCFP2000 mgrid anyway
  2004-08-19  8:17               ` Paolo Bonzini
@ 2004-08-19  8:38                 ` Paolo Bonzini
  2004-08-19 12:07                 ` Michael Matz
  1 sibling, 0 replies; 18+ messages in thread
From: Paolo Bonzini @ 2004-08-19  8:38 UTC (permalink / raw)
  To: gcc; +Cc: Kenneth Zadeck, Andrew MacLeod, gcc, Mark Mitchell, Steven Bosscher

> And Mukta even made use of it and implemented some more of what's possible 
> in http://gcc.gnu.org/ml/gcc-patches/2003-12/msg01985.html .

Is this in new-ra?  Does it change new-ra's shape in SPEC (5% on gzip 
seems very good!)?

(And BTW, anybody know how is Vladimir Makarov's work going on?)

Paolo

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: GVN-PRE sucks -- well, for SPECCFP2000 mgrid anyway
  2004-08-17 15:56             ` Michael Matz
@ 2004-08-19  8:17               ` Paolo Bonzini
  2004-08-19  8:38                 ` Paolo Bonzini
  2004-08-19 12:07                 ` Michael Matz
  0 siblings, 2 replies; 18+ messages in thread
From: Paolo Bonzini @ 2004-08-19  8:17 UTC (permalink / raw)
  To: Michael Matz
  Cc: Kenneth Zadeck, Andrew MacLeod, gcc, Mark Mitchell, Steven Bosscher

> And Mukta even made use of it and implemented some more of what's possible 
> in http://gcc.gnu.org/ml/gcc-patches/2003-12/msg01985.html .

Is this in new-ra?  Does it change new-ra's shape in SPEC (5% on gzip 
seems very good!)?

(And BTW, anybody know how is Vladimir Makarov's work going on?)

Paolo

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: GVN-PRE sucks -- well, for SPECCFP2000 mgrid anyway
  2004-08-15 11:42 Steven Bosscher
  2004-08-15 14:55 ` Daniel Berlin
@ 2004-08-17 19:37 ` Toon Moene
  1 sibling, 0 replies; 18+ messages in thread
From: Toon Moene @ 2004-08-17 19:37 UTC (permalink / raw)
  To: Steven Bosscher; +Cc: dberlin, gcc

Steven Bosscher wrote:

> Hi,
> 
> SPEC2000's mgrid benchmark has typical code fortran that looks
> like this:
> 
>       DO I3=2,N-1
>         DO I2=2,N-1
>           DO I1=2,N-1
>             R(I1,I2,I3) = V(I1,I2,I3)
>      >      -A(0)*( U(I1,  I2,  I3  ) )
>      >      -A(1)*( U(I1-1,I2,  I3  ) + U(I1+1,I2,  I3  )
>      >                 +  U(I1,  I2-1,I3  ) + U(I1,  I2+1,I3  )
>      >                 +  U(I1,  I2,  I3-1) + U(I1,  I2,  I3+1) )
>      >      -A(2)*( U(I1-1,I2-1,I3  ) + U(I1+1,I2-1,I3  )
>      >                 +  U(I1-1,I2+1,I3  ) + U(I1+1,I2+1,I3  )
>      >                 +  U(I1,  I2-1,I3-1) + U(I1,  I2+1,I3-1)
>      >                 +  U(I1,  I2-1,I3+1) + U(I1,  I2+1,I3+1)
>      >                 +  U(I1-1,I2,  I3-1) + U(I1-1,I2,  I3+1)
>      >                 +  U(I1+1,I2,  I3-1) + U(I1+1,I2,  I3+1) )
>      >      -A(3)*( U(I1-1,I2-1,I3-1) + U(I1+1,I2-1,I3-1)
>      >                 +  U(I1-1,I2+1,I3-1) + U(I1+1,I2+1,I3-1)
>      >                 +  U(I1-1,I2-1,I3+1) + U(I1+1,I2-1,I3+1)
>      >                 +  U(I1-1,I2+1,I3+1) + U(I1+1,I2+1,I3+1) )
>           END DO
>         END DO
>       END DO
> 
> where R,V,A, and U are function arguments and I[123] are loop
> counters.
> 
> When gfortran flattens the arrays it exposes a lot of redundancies
> for address arithmetic, most of which is loop invariant.  So PRE
> gets to work, moves everything it can out of the loop, and we get:

[ A huge mess. ]

Note that g77 doesn't flatten arrays and we still move a whole lot of 
the address computations out of the (inner) loop.

No, that's not the problem. The problem is that the tree-ssa loop 
optimizer doesn't have induction variable strength reduction and 
elimination (it's present on the lno branch, but not integrated in 
mainline yet).

Note the following analysis:

On ix86 variants, which have, among others (reg+offset) addressing, only 
eleven (integer) registers are necessary to hold addresses and the loop 
counter in the above inner loop:

  1: A
  2: U(n, I2  , I3  )
  3: U(n, I2+1, I3+1)
  4: U(n, I2  , I3+1)
  5: U(n, I2+1, I3  )
  6: U(n, I2-1, I3  )
  7: U(n, I2-1, I3+1)
  8: U(n, I2-1, I3-1)
  9: V(n, I2,   I3  )
10: R(n, I2,   I3  )
11: I1

The AMD64 has 16 integer registers, enough to cause no spillage on this 
loop.

Hope this helps,

-- 
Toon Moene - mailto:toon@moene.indiv.nluug.nl - phoneto: +31 346 214290
Saturnushof 14, 3738 XG  Maartensdijk, The Netherlands
Maintainer, GNU Fortran 77: http://gcc.gnu.org/onlinedocs/g77_news.html
A maintainer of GNU Fortran 95: http://gcc.gnu.org/fortran/

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: GVN-PRE sucks -- well, for SPECCFP2000 mgrid anyway
  2004-08-16 11:27           ` Michael Matz
@ 2004-08-17 15:56             ` Michael Matz
  2004-08-19  8:17               ` Paolo Bonzini
  0 siblings, 1 reply; 18+ messages in thread
From: Michael Matz @ 2004-08-17 15:56 UTC (permalink / raw)
  To: Daniel Berlin
  Cc: Kenneth Zadeck, Andrew MacLeod, gcc, Mark Mitchell, Steven Bosscher

Hi,

On Mon, 16 Aug 2004, Michael Matz wrote:

> I'm also a fan of this approach.  On the new-regalloc branch, there is a 
> start for infrastructure for rematerializing expressions.

And Mukta even made use of it and implemented some more of what's possible 
in http://gcc.gnu.org/ml/gcc-patches/2003-12/msg01985.html .


Ciao,
Michael.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: GVN-PRE sucks -- well, for SPECCFP2000 mgrid anyway
  2004-08-16  7:50           ` Steven Bosscher
@ 2004-08-16 12:12             ` Diego Novillo
  0 siblings, 0 replies; 18+ messages in thread
From: Diego Novillo @ 2004-08-16 12:12 UTC (permalink / raw)
  To: Steven Bosscher
  Cc: Daniel Berlin, Ken Zadeck, Andrew Macleod, gcc, Mark Mitchell

On Mon, 2004-08-16 at 03:24, Steven Bosscher wrote:

> > > I realize that I am being unsympathetic to danny's "no one wants to
> > > tackle the register allocator argument".  However, it may just be time
> > > to light a fire under Richard Henderson and have him do what he did at
> > > IBM.
> >
> > You mean Andrew Macleod :)
> 
> And for the less informed, he did... what?  :-)
> 
Implement the current register allocator for the IBM compiler.


Diego.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: GVN-PRE sucks -- well, for SPECCFP2000 mgrid anyway
  2004-08-16  1:47         ` Daniel Berlin
  2004-08-16  7:50           ` Steven Bosscher
@ 2004-08-16 11:27           ` Michael Matz
  2004-08-17 15:56             ` Michael Matz
  1 sibling, 1 reply; 18+ messages in thread
From: Michael Matz @ 2004-08-16 11:27 UTC (permalink / raw)
  To: Daniel Berlin
  Cc: Kenneth Zadeck, Andrew MacLeod, gcc, Mark Mitchell, Steven Bosscher

Hi,

On Sun, 15 Aug 2004, Daniel Berlin wrote:

> > This is handled differently in different compilers.  There are basically two
> > schools of thought here:
> > 
> > 1) let pre be pre and have the register allocator "rematerialize" the guys
> > that have been pulled out of the loop.
> > 
> 
> I've always tried to argue this approach, but it never seems to get places
> with people, because it means the register allocator doing *a lot*.

I'm also a fan of this approach.  On the new-regalloc branch, there is a 
start for infrastructure for rematerializing expressions.  Up to now 
it more or less is constrained to constant expressions, or those where the 
operands are trivially proved live everywhere, and not-spillable (like for 
instance offsets from the frame pointer).

It currently just lacks the functions to validate arbitrary expressions, 
i.e. if their operands are still valid at the place where they are to be 
rematerialized.  Also missing is choosing which expressions to 
rematerialize, if given a big set of possibilities.  There are some 
approaches in the wild to minimize work, and be sure that such 
rematerialization is not pessimizing the code again.  The most powerful 
one unfortunately requires SSA form, which we can't have in RTL (subregs), 
so we need to constrain ourself to the easy cases.  But those should 
catch most things, in particular if one handles arbitrary expressions of 
operands which are defined only once (by luck).

Ciao,
Michael.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: GVN-PRE sucks -- well, for SPECCFP2000 mgrid anyway
  2004-08-16  1:47         ` Daniel Berlin
@ 2004-08-16  7:50           ` Steven Bosscher
  2004-08-16 12:12             ` Diego Novillo
  2004-08-16 11:27           ` Michael Matz
  1 sibling, 1 reply; 18+ messages in thread
From: Steven Bosscher @ 2004-08-16  7:50 UTC (permalink / raw)
  To: Daniel Berlin, Kenneth Zadeck, Andrew MacLeod; +Cc: gcc, Mark Mitchell

On Monday 16 August 2004 03:23, Daniel Berlin wrote:
> On Aug 15, 2004, at 9:08 PM, Kenneth Zadeck wrote:
> > 1) let pre be pre and have the register allocator "rematerialize" the
> > guys that have been pulled out of the loop.
>
> I've always tried to argue this approach, but it never seems to get
> places with people, because it means the register allocator doing *a
> lot*.

The other problem with letting the register allocator handle the problem
is that it just can't do that right now.

As Dan already said, the issue we're having now with GVN-PRE would also
have occured with gcse.c's PRE if it had the same information available
to it.  But apparently it does not.  Then what reason do we have to
think that the register allocator does have the required information to
basically do the reverse?

Of course it's nicer conceptually to just rematerialize stuff to reduce
register pressure and let all other passes go about their business.  But
can we get a proper rematerializing register allocator before 3.5?
I doubt that.

BTW I did try -fnew-ra the other day.  Now I know the new register
allocator in its current form sucks, but I had hoped for at least some
improvement.  Well...

flags                          mgrid runtime
-O2                              ~810s
-O2 -fnew-ra                     ~720s
-O2 -fno-tree-pre                ~510s
-O2 -fno-tree-pre -fnew-ra       ~710s

> > I realize that I am being unsympathetic to danny's "no one wants to
> > tackle the register allocator argument".  However, it may just be time
> > to light a fire under Richard Henderson and have him do what he did at
> > IBM.
>
> You mean Andrew Macleod :)

And for the less informed, he did... what?  :-)

Gr.
Steven

--
If God loves man, why did he create those pesky mosquitos?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: GVN-PRE sucks -- well, for SPECCFP2000 mgrid anyway
  2004-08-16  1:23       ` Kenneth Zadeck
@ 2004-08-16  1:47         ` Daniel Berlin
  2004-08-16  7:50           ` Steven Bosscher
  2004-08-16 11:27           ` Michael Matz
  2004-08-23 20:02         ` Andrew MacLeod
  1 sibling, 2 replies; 18+ messages in thread
From: Daniel Berlin @ 2004-08-16  1:47 UTC (permalink / raw)
  To: Kenneth Zadeck, Andrew MacLeod; +Cc: gcc, Mark Mitchell, Steven Bosscher


On Aug 15, 2004, at 9:08 PM, Kenneth Zadeck wrote:

>
>
> Daniel Berlin wrote:
>
>>
>> On Aug 15, 2004, at 8:05 PM, Mark Mitchell wrote:
>>
>>> Daniel Berlin wrote:
>>>
>>>> Yeah.
>>>> I'll see if i can come up with some simple heuristics that help 
>>>> here, without hurting anything else too much.
>>>
>>>
>>> I wonder if it's better to restrain PRE, or let it go ahead, and 
>>> then have the register allocator move computations back into the 
>>> loop.
>>
>> It could also split the live ranges, and solve the problem that way.
>> This is what i would suggest, but realistically, it's going to be 
>> easier for us to restrain PRE in our current infrastructure.  That, 
>> and nobody ever seems to want to improve the register allocator :).
>> Though at some point, all our optimizations are going to start 
>> causing real problems if our RA can't split live ranges.
>> On the other hand, we're also going to need this type of register 
>> pressure calculation for unrolling and whatnot.
>>
>> It's also not going to restrain PRE all *that* much in most cases.
>> This loop/function seems to be a very bad case.  Other compilers i 
>> can get debug info from all estimate the register pressure at ~25 
>> integer registers, and ~5 float point registers.
>>
>> That's going to cause spilling on almost all platforms without PRE 
>> pulling anything out.
>>
>> The only real problem with restraining PRE is that it would be far 
>> better for PRE to pull it out, and something later to decide which 
>> computations to move back i. It might be the case that it makes more 
>> sense to keep some of the address arithemtic pulled out, at the cost 
>> of moving some other calculation back into the loop.  Who knows.
>>
> This is handled differently in different compilers.  There are 
> basically two schools of thought here:
>
> 1) let pre be pre and have the register allocator "rematerialize" the 
> guys that have been pulled out of the loop.
>

I've always tried to argue this approach, but it never seems to get 
places with people, because it means the register allocator doing *a 
lot*.

> 2) make each of the optimizations have a cost model that factors in 
> the way that several transformations effect "max live" i.e. the number 
> of items that will need to be kept in registers at any point in the 
> program.
> I have always been in favor of the first school but there really are 
> two sides to this issue.  In the gcc world there are two parts of the 
> debate.  The first is actually making the decision and the second is 
> to get the community to stick by it.  This is one of those places 
> where the cats really do need to march in lock step because this 
> becomes not just an issue with pre but with every optimization that 
> can move code around.
> The truth is that without any technical debate I can see plan 1 
> becoming the only one that will be workable in gcc because the 
> management problems become intractable (I am still adding up the hours 
> from the last attempt to get a single cat turned around.)  You are 
> basically going to have to get everyone to try and make their 
> transformations pressure sensitive.
>
> The technical issues are that it is easier to write all the 
> transformations so that they assume that they have infinite registers 
> and then split things up later.  Plan 2 runs into problems because the 
> first optimizations to run get to use up all of the registers and the 
> rest of the optimizations get constrained even if they would have been 
> more profitable.
> This will really become an issue as we start tuning the loop 
> transformations.  It really is easier to work in a world where you 
> have ejected everything from a loop before you decide if it is 
> profitable to do something like loop switching which will end up 
> changing all of your register pressure estimates anyway.  It is hard 
> to see how any splitting decision that danny makes now will be correct 
> after the loops are switched.
>
> I realize that I am being unsympathetic to danny's "no one wants to 
> tackle the register allocator argument".  However, it may just be time 
> to light a fire under Richard Henderson and have him do what he did at 
> IBM.

You mean Andrew Macleod :)
--Dan

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: GVN-PRE sucks -- well, for SPECCFP2000 mgrid anyway
  2004-08-16  1:08     ` Daniel Berlin
@ 2004-08-16  1:23       ` Kenneth Zadeck
  2004-08-16  1:47         ` Daniel Berlin
  2004-08-23 20:02         ` Andrew MacLeod
  0 siblings, 2 replies; 18+ messages in thread
From: Kenneth Zadeck @ 2004-08-16  1:23 UTC (permalink / raw)
  To: Daniel Berlin; +Cc: Mark Mitchell, gcc, Steven Bosscher

Daniel Berlin wrote:

>
> On Aug 15, 2004, at 8:05 PM, Mark Mitchell wrote:
>
>> Daniel Berlin wrote:
>>
>>> Yeah.
>>> I'll see if i can come up with some simple heuristics that help 
>>> here, without hurting anything else too much.
>>
>>
>> I wonder if it's better to restrain PRE, or let it go ahead, and then 
>> have the register allocator move computations back into the loop.
>
> It could also split the live ranges, and solve the problem that way.
> This is what i would suggest, but realistically, it's going to be 
> easier for us to restrain PRE in our current infrastructure.  That, 
> and nobody ever seems to want to improve the register allocator :).
> Though at some point, all our optimizations are going to start causing 
> real problems if our RA can't split live ranges.
> On the other hand, we're also going to need this type of register 
> pressure calculation for unrolling and whatnot.
>
> It's also not going to restrain PRE all *that* much in most cases.
> This loop/function seems to be a very bad case.  Other compilers i can 
> get debug info from all estimate the register pressure at ~25 integer 
> registers, and ~5 float point registers.
>
> That's going to cause spilling on almost all platforms without PRE 
> pulling anything out.
>
> The only real problem with restraining PRE is that it would be far 
> better for PRE to pull it out, and something later to decide which 
> computations to move back i. It might be the case that it makes more 
> sense to keep some of the address arithemtic pulled out, at the cost 
> of moving some other calculation back into the loop.  Who knows.
>
This is handled differently in different compilers.  There are basically 
two schools of thought here:

1) let pre be pre and have the register allocator "rematerialize" the 
guys that have been pulled out of the loop.

2) make each of the optimizations have a cost model that factors in the 
way that several transformations effect "max live" i.e. the number of 
items that will need to be kept in registers at any point in the program. 

I have always been in favor of the first school but there really are two 
sides to this issue.  In the gcc world there are two parts of the 
debate.  The first is actually making the decision and the second is to 
get the community to stick by it.  This is one of those places where the 
cats really do need to march in lock step because this becomes not just 
an issue with pre but with every optimization that can move code around. 

The truth is that without any technical debate I can see plan 1 becoming 
the only one that will be workable in gcc because the management 
problems become intractable (I am still adding up the hours from the 
last attempt to get a single cat turned around.)  You are basically 
going to have to get everyone to try and make their transformations 
pressure sensitive.

The technical issues are that it is easier to write all the 
transformations so that they assume that they have infinite registers 
and then split things up later.  Plan 2 runs into problems because the 
first optimizations to run get to use up all of the registers and the 
rest of the optimizations get constrained even if they would have been 
more profitable.  

This will really become an issue as we start tuning the loop 
transformations.  It really is easier to work in a world where you have 
ejected everything from a loop before you decide if it is profitable to 
do something like loop switching which will end up changing all of your 
register pressure estimates anyway.  It is hard to see how any splitting 
decision that danny makes now will be correct after the loops are switched.

I realize that I am being unsympathetic to danny's "no one wants to 
tackle the register allocator argument".  However, it may just be time 
to light a fire under Richard Henderson and have him do what he did at IBM.

>
>>
>> I've copied Kenny here to see if he happens to know of any magic 
>> bullets for this kind of thing.
>>
>> (Kenny, the issue is that PRE is pulling so much stuff out of loops 
>> that we get register pressure, and hence bad code.)
>>
>> -- 
>> Mark Mitchell
>> CodeSourcery, LLC
>> (916) 791-8304
>> mark@codesourcery.com
>>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: GVN-PRE sucks -- well, for SPECCFP2000 mgrid anyway
  2004-08-16  0:14   ` Mark Mitchell
  2004-08-16  0:33     ` Andrew Pinski
@ 2004-08-16  1:08     ` Daniel Berlin
  2004-08-16  1:23       ` Kenneth Zadeck
  1 sibling, 1 reply; 18+ messages in thread
From: Daniel Berlin @ 2004-08-16  1:08 UTC (permalink / raw)
  To: Mark Mitchell; +Cc: gcc, Kenneth Zadeck, Steven Bosscher

On Aug 15, 2004, at 8:05 PM, Mark Mitchell wrote:

> Daniel Berlin wrote:
>
>> Yeah.
>> I'll see if i can come up with some simple heuristics that help here, 
>> without hurting anything else too much.
>
> I wonder if it's better to restrain PRE, or let it go ahead, and then 
> have the register allocator move computations back into the loop.
It could also split the live ranges, and solve the problem that way.
This is what i would suggest, but realistically, it's going to be 
easier for us to restrain PRE in our current infrastructure.  That, and 
nobody ever seems to want to improve the register allocator :).
Though at some point, all our optimizations are going to start causing 
real problems if our RA can't split live ranges.
On the other hand, we're also going to need this type of register 
pressure calculation for unrolling and whatnot.

It's also not going to restrain PRE all *that* much in most cases.
This loop/function seems to be a very bad case.  Other compilers i can 
get debug info from all estimate the register pressure at ~25 integer 
registers, and ~5 float point registers.

That's going to cause spilling on almost all platforms without PRE 
pulling anything out.

The only real problem with restraining PRE is that it would be far 
better for PRE to pull it out, and something later to decide which 
computations to move back i. It might be the case that it makes more 
sense to keep some of the address arithemtic pulled out, at the cost of 
moving some other calculation back into the loop.  Who knows.

>
> I've copied Kenny here to see if he happens to know of any magic 
> bullets for this kind of thing.
>
> (Kenny, the issue is that PRE is pulling so much stuff out of loops 
> that we get register pressure, and hence bad code.)
>
> -- 
> Mark Mitchell
> CodeSourcery, LLC
> (916) 791-8304
> mark@codesourcery.com
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: GVN-PRE sucks -- well, for SPECCFP2000 mgrid anyway
  2004-08-16  0:14   ` Mark Mitchell
@ 2004-08-16  0:33     ` Andrew Pinski
  2004-08-16  1:08     ` Daniel Berlin
  1 sibling, 0 replies; 18+ messages in thread
From: Andrew Pinski @ 2004-08-16  0:33 UTC (permalink / raw)
  To: Mark Mitchell; +Cc: Daniel Berlin, Steven Bosscher, gcc, Kenneth Zadeck


On Aug 15, 2004, at 5:05 PM, Mark Mitchell wrote:

> Daniel Berlin wrote:
>
>> Yeah.
>> I'll see if i can come up with some simple heuristics that help here, 
>> without hurting anything else too much.
>
> I wonder if it's better to restrain PRE, or let it go ahead, and then 
> have the register allocator move computations back into the loop.
>
> I've copied Kenny here to see if he happens to know of any magic 
> bullets for this kind of thing.
>
> (Kenny, the issue is that PRE is pulling so much stuff out of loops 
> that we get register pressure, and hence bad code.)

Just so I got my say and not get Danny into too much trouble but I will 
note that
NAG F95 with gcc 3.3 as the compiler gets around the same score as 
gfortran does.
This is on a dual 2.5GHz G5.


-- Pinski

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: GVN-PRE sucks -- well, for SPECCFP2000 mgrid anyway
  2004-08-15 14:55 ` Daniel Berlin
@ 2004-08-16  0:14   ` Mark Mitchell
  2004-08-16  0:33     ` Andrew Pinski
  2004-08-16  1:08     ` Daniel Berlin
  0 siblings, 2 replies; 18+ messages in thread
From: Mark Mitchell @ 2004-08-16  0:14 UTC (permalink / raw)
  To: Daniel Berlin; +Cc: Steven Bosscher, gcc, Kenneth Zadeck

Daniel Berlin wrote:

> Yeah.
> I'll see if i can come up with some simple heuristics that help here, 
> without hurting anything else too much.

I wonder if it's better to restrain PRE, or let it go ahead, and then 
have the register allocator move computations back into the loop.

I've copied Kenny here to see if he happens to know of any magic bullets 
for this kind of thing.

(Kenny, the issue is that PRE is pulling so much stuff out of loops that 
we get register pressure, and hence bad code.)

-- 
Mark Mitchell
CodeSourcery, LLC
(916) 791-8304
mark@codesourcery.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: GVN-PRE sucks -- well, for SPECCFP2000 mgrid anyway
  2004-08-15 11:42 Steven Bosscher
@ 2004-08-15 14:55 ` Daniel Berlin
  2004-08-16  0:14   ` Mark Mitchell
  2004-08-17 19:37 ` Toon Moene
  1 sibling, 1 reply; 18+ messages in thread
From: Daniel Berlin @ 2004-08-15 14:55 UTC (permalink / raw)
  To: Steven Bosscher; +Cc: gcc


On Aug 15, 2004, at 7:21 AM, Steven Bosscher wrote:

> Hi,
>
> SPEC2000's mgrid benchmark has typical code fortran that looks
> like this:
>
...
> When gfortran flattens the arrays it exposes a lot of redundancies
> for address arithmetic, most of which is loop invariant.  So PRE
> gets to work, moves everything it can out of the loop, and we get:
>
...
Ha. GCC overoptimized something.

> Needless to say that on almost any architecture this will cause so
> much spilling that PRE severely pessimizes the generated code.  And
> indeed, on AMD64, mgrid compiled with -fno-tree-pre is almost twice
> as fast as mgrid with PRE enabled.
> I guess this is convincing evidence that we need to teach tree PRE
> about register pressure, just like tree loop invariant code motion.
Yeah.
I'll see if i can come up with some simple heuristics that help here, 
without hurting anything else too much.

>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* GVN-PRE sucks -- well, for SPECCFP2000 mgrid anyway
@ 2004-08-15 11:42 Steven Bosscher
  2004-08-15 14:55 ` Daniel Berlin
  2004-08-17 19:37 ` Toon Moene
  0 siblings, 2 replies; 18+ messages in thread
From: Steven Bosscher @ 2004-08-15 11:42 UTC (permalink / raw)
  To: dberlin; +Cc: gcc

Hi,

SPEC2000's mgrid benchmark has typical code fortran that looks
like this:

      DO I3=2,N-1
        DO I2=2,N-1
          DO I1=2,N-1
            R(I1,I2,I3) = V(I1,I2,I3)
     >      -A(0)*( U(I1,  I2,  I3  ) )
     >      -A(1)*( U(I1-1,I2,  I3  ) + U(I1+1,I2,  I3  )
     >                 +  U(I1,  I2-1,I3  ) + U(I1,  I2+1,I3  )
     >                 +  U(I1,  I2,  I3-1) + U(I1,  I2,  I3+1) )
     >      -A(2)*( U(I1-1,I2-1,I3  ) + U(I1+1,I2-1,I3  )
     >                 +  U(I1-1,I2+1,I3  ) + U(I1+1,I2+1,I3  )
     >                 +  U(I1,  I2-1,I3-1) + U(I1,  I2+1,I3-1)
     >                 +  U(I1,  I2-1,I3+1) + U(I1,  I2+1,I3+1)
     >                 +  U(I1-1,I2,  I3-1) + U(I1-1,I2,  I3+1)
     >                 +  U(I1+1,I2,  I3-1) + U(I1+1,I2,  I3+1) )
     >      -A(3)*( U(I1-1,I2-1,I3-1) + U(I1+1,I2-1,I3-1)
     >                 +  U(I1-1,I2+1,I3-1) + U(I1+1,I2+1,I3-1)
     >                 +  U(I1-1,I2-1,I3+1) + U(I1+1,I2-1,I3+1)
     >                 +  U(I1-1,I2+1,I3+1) + U(I1+1,I2+1,I3+1) )
          END DO
        END DO
      END DO

where R,V,A, and U are function arguments and I[123] are loop
counters.

When gfortran flattens the arrays it exposes a lot of redundancies
for address arithmetic, most of which is loop invariant.  So PRE
gets to work, moves everything it can out of the loop, and we get:

  # BLOCK 14
  # PRED: 0 [79.0%]  (false,exec)
<L28>:;
  pretmp.202<D919>_117 = (int8<D8>)2;
  pretmp.204<D921>_475 = stride.10<D458>_28 * pretmp.202<D919>_117;
  # SUCC: 1 [100.0%]  (fallthru)

  # BLOCK 1
  # PRED: 17 [100.0%]  (fallthru) 14 [100.0%]  (fallthru)
  # prephitmp.205<D922>_466 = PHI <T.95<D560>_214(17), pretmp.204<D921>_475(14)>;
  # prephitmp.203<D920>_122 = PHI <T.94<D559>_213(17), pretmp.202<D919>_117(14)>;
  # TMT.201<D892>_425 = PHI <TMT.201<D892>_502(14), TMT.201<D892>_532(17)>;
  # TMT.200<D891>_357 = PHI <TMT.200<D891>_503(14), TMT.200<D891>_531(17)>;
  # i3<D445>_262 = PHI <2(14), T.93<D558>_212(17)>;
  # count.18<D466>_261 = PHI <count.18<D466>_60(14), count.18<D466>_118(17)>;
<L2>:;
  pretmp.206<D923>_465 = pretmp.202<D919>_117;
  pretmp.208<D925>_463 = stride.8<D456>_24 * pretmp.206<D923>_465;
  pretmp.210<D927>_461 = (int8<D8>)i3<D445>_262;
  pretmp.212<D929>_453 = stride.10<D458>_28 * pretmp.210<D927>_461;
  pretmp.214<D931>_451 = pretmp.208<D925>_463 + pretmp.212<D929>_453;
  pretmp.216<D933>_449 = i3<D445>_262 - 1;
  pretmp.218<D935>_447 = (int8<D8>)pretmp.216<D933>_449;
  pretmp.220<D937>_441 = stride.10<D458>_28 * pretmp.218<D935>_447;
  pretmp.222<D939>_411 = pretmp.208<D925>_463 + pretmp.220<D937>_441;
  pretmp.224<D941>_409 = i3<D445>_262 + 1;
  pretmp.226<D943>_403 = (int8<D8>)pretmp.224<D941>_409;
  pretmp.228<D945>_401 = stride.10<D458>_28 * pretmp.226<D943>_403;
  pretmp.230<D947>_399 = pretmp.208<D925>_463 + pretmp.228<D945>_401;
  # SUCC: 2 [100.0%]  (fallthru,exec)

  # BLOCK 2
  # PRED: 16 [100.0%]  (fallthru) 1 [100.0%]  (fallthru,exec)
  # prephitmp.231<D948>_398 = PHI <T.134<D599>_313(16), pretmp.230<D947>_399(1)>;
  # prephitmp.229<D946>_400 = PHI <T.95<D560>_214(16), pretmp.228<D945>_401(1)>;
  # prephitmp.227<D944>_402 = PHI <T.94<D559>_213(16), pretmp.226<D943>_403(1)>;
  # prephitmp.225<D942>_408 = PHI <T.93<D558>_212(16), pretmp.224<D941>_409(1)>;
  # prephitmp.223<D940>_410 = PHI <T.124<D589>_289(16), pretmp.222<D939>_411(1)>;
  # prephitmp.221<D938>_440 = PHI <T.87<D552>_203(16), pretmp.220<D937>_441(1)>;
  # prephitmp.219<D936>_442 = PHI <T.86<D551>_202(16), pretmp.218<D935>_447(1)>;
  # prephitmp.217<D934>_448 = PHI <T.85<D550>_201(16), pretmp.216<D933>_449(1)>;
  # prephitmp.215<D932>_450 = PHI <T.80<D545>_193(16), pretmp.214<D931>_451(1)>;
  # prephitmp.213<D930>_452 = PHI <T.37<D502>_127(16), pretmp.212<D929>_453(1)>;
  # prephitmp.211<D928>_454 = PHI <T.36<D501>_126(16), pretmp.210<D927>_461(1)>;
  # prephitmp.209<D926>_462 = PHI <T.79<D544>_190(16), pretmp.208<D925>_463(1)>;
  # prephitmp.207<D924>_464 = PHI <T.78<D543>_189(16), pretmp.206<D923>_465(1)>;
  # TMT.201<D892>_426 = PHI <TMT.201<D892>_425(1), TMT.201<D892>_532(16)>;
  # TMT.200<D891>_358 = PHI <TMT.200<D891>_357(1), TMT.200<D891>_531(16)>;
  # i2<D447>_242 = PHI <2(1), T.77<D542>_188(16)>;
  # count.18<D466>_241 = PHI <count.18<D466>_60(1), count.18<D466>_123(16)>;
<L4>:;
  pretmp.232<D949>_397 = (int8<D8>)i2<D447>_242;
  pretmp.234<D951>_382 = stride.8<D456>_24 * pretmp.232<D949>_397;
  pretmp.236<D953>_377 = pretmp.210<D927>_461;
  pretmp.238<D955>_375 = prephitmp.205<D922>_466;
  pretmp.240<D957>_373 = pretmp.234<D951>_382 + pretmp.238<D955>_375;
  pretmp.242<D959>_371 = pretmp.206<D923>_465;
  pretmp.244<D961>_347 = pretmp.240<D957>_373 + pretmp.242<D959>_371;
  pretmp.246<D963>_345 = offset.11<D459>_34 + pretmp.244<D961>_347;
  pretmp.248<D965>_343 = i2<D447>_242 - 1;
  pretmp.250<D967>_337 = (int8<D8>)pretmp.248<D965>_343;
  pretmp.252<D969>_335 = stride.8<D456>_24 * pretmp.250<D967>_337;
  pretmp.254<D971>_333 = pretmp.238<D955>_375 + pretmp.252<D969>_335;
  pretmp.256<D973>_314 = pretmp.242<D959>_371 + pretmp.254<D971>_333;
  pretmp.258<D975>_311 = offset.11<D459>_34 + pretmp.256<D973>_314;
  pretmp.260<D977>_309 = i2<D447>_242 + 1;
  pretmp.262<D979>_307 = (int8<D8>)pretmp.260<D977>_309;
  pretmp.264<D981>_300 = stride.8<D456>_24 * pretmp.262<D979>_307;
  pretmp.266<D983>_278 = pretmp.238<D955>_375 + pretmp.264<D981>_300;
  pretmp.268<D985>_275 = pretmp.242<D959>_371 + pretmp.266<D983>_278;
  pretmp.270<D987>_273 = offset.11<D459>_34 + pretmp.268<D985>_275;
  pretmp.272<D989>_271 = pretmp.216<D933>_449;
  pretmp.274<D991>_265 = pretmp.218<D935>_447;
  pretmp.276<D993>_263 = pretmp.220<D937>_441;
  pretmp.278<D995>_259 = pretmp.234<D951>_382 + pretmp.276<D993>_263;
  pretmp.280<D997>_253 = pretmp.242<D959>_371 + pretmp.278<D995>_259;
  pretmp.282<D999>_251 = offset.11<D459>_34 + pretmp.280<D997>_253;
  pretmp.284<D1001>_249 = pretmp.224<D941>_409;
  pretmp.286<D1003>_247 = pretmp.226<D943>_403;
  pretmp.288<D1005>_239 = pretmp.228<D945>_401;
  pretmp.290<D1007>_237 = pretmp.234<D951>_382 + pretmp.288<D1005>_239;
  pretmp.292<D1009>_235 = pretmp.242<D959>_371 + pretmp.290<D1007>_237;
  pretmp.294<D1011>_230 = offset.11<D459>_34 + pretmp.292<D1009>_235;
  pretmp.296<D1013>_228 = pretmp.252<D969>_335 + pretmp.276<D993>_263;
  pretmp.298<D1015>_211 = pretmp.242<D959>_371 + pretmp.296<D1013>_228;
  pretmp.300<D1017>_205 = offset.11<D459>_34 + pretmp.298<D1015>_211;
  pretmp.302<D1019>_500 = pretmp.264<D981>_300 + pretmp.276<D993>_263;
  pretmp.304<D1021>_498 = pretmp.242<D959>_371 + pretmp.302<D1019>_500;
  pretmp.306<D1023>_496 = offset.11<D459>_34 + pretmp.304<D1021>_498;
  pretmp.308<D1025>_494 = pretmp.252<D969>_335 + pretmp.288<D1005>_239;
  pretmp.310<D1027>_492 = pretmp.242<D959>_371 + pretmp.308<D1025>_494;
  pretmp.312<D1029>_490 = offset.11<D459>_34 + pretmp.310<D1027>_492;
  pretmp.314<D1031>_488 = pretmp.264<D981>_300 + pretmp.288<D1005>_239;
  pretmp.316<D1033>_486 = pretmp.242<D959>_371 + pretmp.314<D1031>_488;
  pretmp.318<D1035>_484 = offset.11<D459>_34 + pretmp.316<D1033>_486;
  # SUCC: 3 [100.0%]  (fallthru,exec)

  # BLOCK 3
  # PRED: 15 [100.0%]  (fallthru) 2 [100.0%]  (fallthru,exec)

[ Notice how all this stuff is hoisted even out of the
  outermost loop! ]

Needless to say that on almost any architecture this will cause so
much spilling that PRE severely pessimizes the generated code.  And
indeed, on AMD64, mgrid compiled with -fno-tree-pre is almost twice
as fast as mgrid with PRE enabled.

I guess this is convincing evidence that we need to teach tree PRE
about register pressure, just like tree loop invariant code motion.

Gr.
Steven


^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2004-08-23 19:00 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-08-16  2:40 GVN-PRE sucks -- well, for SPECCFP2000 mgrid anyway Richard Kenner
2004-08-16  6:29 ` Jeffrey A Law
  -- strict thread matches above, loose matches on Subject: below --
2004-08-15 11:42 Steven Bosscher
2004-08-15 14:55 ` Daniel Berlin
2004-08-16  0:14   ` Mark Mitchell
2004-08-16  0:33     ` Andrew Pinski
2004-08-16  1:08     ` Daniel Berlin
2004-08-16  1:23       ` Kenneth Zadeck
2004-08-16  1:47         ` Daniel Berlin
2004-08-16  7:50           ` Steven Bosscher
2004-08-16 12:12             ` Diego Novillo
2004-08-16 11:27           ` Michael Matz
2004-08-17 15:56             ` Michael Matz
2004-08-19  8:17               ` Paolo Bonzini
2004-08-19  8:38                 ` Paolo Bonzini
2004-08-19 12:07                 ` Michael Matz
2004-08-23 20:02         ` Andrew MacLeod
2004-08-17 19:37 ` Toon Moene

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).