Memory footprint/compile time explosion caused by the tree inliner

public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed

* Memory footprint/compile time explosion caused by the tree inliner
@ 2003-04-17 22:59 Eric Botcazou
  2003-04-18  0:52 ` Mike Stump
  0 siblings, 1 reply; 9+ messages in thread
From: Eric Botcazou @ 2003-04-17 22:59 UTC (permalink / raw)
  To: gcc

Hi,

PR optimization/10160 reports a huge compile time regression (4h30min vs a 
few minutes) on the 3.3 branch wrt the 3.2 branch for LyX/QT on Sparc. The 
time is mostly (85%) spent in the scheduler but the regression is caused by 
the tree inliner.

Here are some numbers at -O2:

                               peak    time
GCC 3.2.3pre                   63MB   1min38s
GCC 3.3pre
 max-inline-insns-single=150   89MB   1min18s
 max-inline-insns-single=175   89MB   1min36s
 max-inline-insns-single=200   89MB   2min10s
 max-inline-insns-single=205   90MB   2min38s
 max-inline-insns-single=210   90MB   3min37s
 max-inline-insns-single=215  157MB   5min35s
 max-inline-insns-single=225  185MB   8min19s
 max-inline-insns-single=235 >256MB   ??

The default is max-inline-insns-single=300.

Interestingly, max-inline-insns doesn't play any role:

 max-inline-insns=300 >256MB   ??
 max-inline-insns=200 >256MB   ??
 max-inline-insns=100 >256MB   ??

The default is max-inline-single-insns=600.

So, once we have reached a threshold on the number of insns for a single 
functions, a bunch of little functions (constructors in this case) gets 
inlined, totally out of control.

For example, one constructor grows from 128 stmts to 15437 stmts (154370 
insns according to the fixed conversion factor), that is more than 100 
times. Given that the scheduler is at least O(n^2), the game is over.

The problem is the new heuristics of the tree inliner:

      int sum_insns = (id ? id->inlined_stmts : 0) * INSNS_PER_STMT
		     + currfn_insns;
      /* In the extreme case that we have exceeded the recursive inlining
         limit by a huge factor (128), we just say no. Should not happen
         in real life.  */
      if (sum_insns > MAX_INLINE_INSNS * 128)
	 inlinable = 0;
      /* If we did not hit the extreme limit, we use a linear function
         with slope -1/MAX_INLINE_SLOPE to exceedingly decrease the
         allowable size. We always allow a size of MIN_INLINE_INSNS
         though.  */
      else if ((sum_insns > MAX_INLINE_INSNS)
	       && (currfn_insns > MIN_INLINE_INSNS))
	{
	  int max_curr = MAX_INLINE_INSNS_SINGLE
			- (sum_insns - MAX_INLINE_INSNS) / MAX_INLINE_SLOPE;
	  if (currfn_insns > max_curr)
	    inlinable = 0;
	}

The constructors are small, so we always have

	currfn_insns <= MIN_INLINE_INSNS.

So the limit is not max-inline-insns but max-inline-insns * 128, that is 
76800 insns (7680 stmts) per inlining group!

I did some testing on this arbitrary factor 128:

       peak    time
 16   >256 MB  ??
  8   >256 MB  ??
  4   >256 MB  ??
  3   >256 MB  ??
  2    110 MB  5min20s
  1.5  82  MB  2min37s

It clearly appears that the tree inliner lacks some global control on a per 
function basis. I think we should implement a cut, absolute or relative, so 
as to avoid this kind of explosion in compile time and memory footprint.

-- 
Eric Botcazou

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Memory footprint/compile time explosion caused by the tree inliner
  2003-04-17 22:59 Memory footprint/compile time explosion caused by the tree inliner Eric Botcazou
@ 2003-04-18  0:52 ` Mike Stump
  2003-04-18  1:14   ` Dale Johannesen
                     ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Mike Stump @ 2003-04-18  0:52 UTC (permalink / raw)
  To: Eric Botcazou; +Cc: gcc

On Thursday, April 17, 2003, at 03:15 PM, Eric Botcazou wrote:
> PR optimization/10160 reports a huge compile time regression (4h30min 
> vs a
> few minutes) on the 3.3 branch wrt the 3.2 branch for LyX/QT on Sparc.

We fix this, then break it, then fix it, then break it.  :-(  We are 
doing something wrong.

Let me explain, in C++, there are no non-inlined functions.  In C++, 
there are no function bigger than 5 lines.  Now, choose your inlining 
strategy.

Hint, inlining all function under 6 lines doesn't work.  Hint, inlining 
all functions that are inline doesn't work.

Possible solutions, don't allow more than 15x growth in caller, don't 
allow caller to grow past 1,000 stmts.

>  The
> time is mostly (85%) spent in the scheduler but the regression is 
> caused by
> the tree inliner.

It would be nice if the scheduler works with very large functions.  It 
is ok to miss some of the edges, as long as the creamy center is nice.

> For example, one constructor grows from 128 stmts to 15437 stmts 
> (154370
> insns according to the fixed conversion factor), that is more than 100
> times. Given that the scheduler is at least O(n^2), the game is over.
>
>
> The problem is the new heuristics of the tree inliner:
>
>       int sum_insns = (id ? id->inlined_stmts : 0) * INSNS_PER_STMT
> 		     + currfn_insns;
>       /* In the extreme case that we have exceeded the recursive 
> inlining
>          limit by a huge factor (128), we just say no. Should not 
> happen
>          in real life.  */

This comment it wrong.  The comment should read, this happens all the 
time in C++.

>       if (sum_insns > MAX_INLINE_INSNS * 128)

Suggestion:

	if (sum_insns > MAX_INLINE_INSNS * 128
	    || sum_insns > 15 * (currfn_insn - (id ? id->inlined_stmts : 0)))

Better, make the 15 a parameter.  Try values 2-20 on real C++ code, 
benchmark the benefit v compile time, set value near knee.  I think the 
relative cap is better than an absolute cap, as large bodied functions 
(aka C), the limit will be higher, and in small bodied functions, the 
limit will be lower.

:-(

Thanks for tacking this down, wanna try testing out the code above and 
find a reasonable constant and submit it?

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Memory footprint/compile time explosion caused by the tree inliner
  2003-04-18  0:52 ` Mike Stump
@ 2003-04-18  1:14   ` Dale Johannesen
  2003-04-18  6:50     ` Daniel Berlin
  2003-04-18  3:07   ` Steven Bosscher
  2003-04-18 12:46   ` Eric Botcazou
  2 siblings, 1 reply; 9+ messages in thread
From: Dale Johannesen @ 2003-04-18  1:14 UTC (permalink / raw)
  To: Mike Stump; +Cc: Dale Johannesen, Eric Botcazou, gcc

On Thursday, April 17, 2003, at 04:38  PM, Mike Stump wrote:
> On Thursday, April 17, 2003, at 03:15 PM, Eric Botcazou wrote:
>>  The
>> time is mostly (85%) spent in the scheduler but the regression is 
>> caused by
>> the tree inliner.
>
> It would be nice if the scheduler works with very large functions.  It 
> is ok to miss some of the edges, as long as the creamy center is nice.

Last time I looked, the scheduler bottleneck was free_deps().  People 
have
obviously already worked on speeding it up and I didn't see anything 
else
promising; good luck.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Memory footprint/compile time explosion caused by the tree inliner
  2003-04-18  0:52 ` Mike Stump
  2003-04-18  1:14   ` Dale Johannesen
@ 2003-04-18  3:07   ` Steven Bosscher
  2003-04-18 12:46   ` Eric Botcazou
  2 siblings, 0 replies; 9+ messages in thread
From: Steven Bosscher @ 2003-04-18  3:07 UTC (permalink / raw)
  To: Mike Stump; +Cc: Eric Botcazou, gcc

Op vr 18-04-2003, om 01:38 schreef Mike Stump:
> On Thursday, April 17, 2003, at 03:15 PM, Eric Botcazou wrote:
> > PR optimization/10160 reports a huge compile time regression (4h30min 
> > vs a few minutes) on the 3.3 branch wrt the 3.2 branch for LyX/QT on Sparc.
> 
> We fix this, then break it, then fix it, then break it.  :-(  We are 
> doing something wrong.

Assuming GCC hackers are capable people, it would indicate that the tree
inliner is not very roboust or well understood  ;-)

> Let me explain, in C++, there are no non-inlined functions.  In C++, 
> there are no function bigger than 5 lines.  Now, choose your inlining 
> strategy.

Read: "Now, choose your inlining strategy for C++".  How would your
suggestions work out for C and other languages?

> Hint, inlining all function under 6 lines doesn't work.  Hint, inlining 
> all functions that are inline doesn't work.
> 
> Possible solutions, don't allow more than 15x growth in caller, don't 
> allow caller to grow past 1,000 stmts.

Well, what do you define as a single statement?

If you look at the tree inliner, you'll see it's trying to predict the
size of the (maybe) inline function in insns.  We now count insns via
INSNS_PER_STMT, but, for example, a SCOPE_STMT is also counted as a
STMT, so for each scope (i.e. SCOPE_STMT + DECL_STMT) you get 20 insns,
which is not a very good guess I think.  Add a COMPOUND_STMT, and you
have 30 fictuous insns in the tree-inliner for the minimal function.

(If you don't believe me: Look at "int foo (int a) { a = 1; }" and see
that it counts as 40 insns: 30 for the function and 10 for the
EXPR_STMT.)

Especially for your small line functions, this hurts.  Just an empty
function body account for 30 insns, and then you have, say, 10 stmts. 
Gives you (10*10 + 30) == 130 == PARAM_MIN_INLINE_INSNS, so just on the
limit of being too big for inlining when a lot of inlining has been done
already.  Add one more statement and the function is no longer an inline
candidate after repeated inlining.

This may explain why Richard Guenther says it helps for POOMA
performance to increase PARAM_MIN_INLINE_INSNS.  Bumping it gave him
more inlined functions and better performance with only a very small
compile time increase.
(see http://gcc.gnu.org/ml/gcc/2003-04/msg00795.html).

> > For example, one constructor grows from 128 stmts to 15437 stmts 
> > (154370
> > insns according to the fixed conversion factor), that is more than 100
> > times. Given that the scheduler is at least O(n^2), the game is over.
> >
> >
> > The problem is the new heuristics of the tree inliner:
> >
> >       int sum_insns = (id ? id->inlined_stmts : 0) * INSNS_PER_STMT
> > 		     + currfn_insns;
> >       /* In the extreme case that we have exceeded the recursive  inlining
> >          limit by a huge factor (128), we just say no. Should not happen
> >          in real life.  */
> 
> This comment it wrong.  The comment should read, this happens all the 
> time in C++.

128 times the recursive inlining limit is a really large number of
insns.  So I don't think this is real the problem.

In the end it depends on where you _start_ inlining from (call graph
based or just at random like we do now).  Maybe we should allow
-finline-functions only  with -funit-at-a-time in the future (*)... or
at least see how things look when unit-at-a-time for C++ is available.

> Thanks for tacking this down, wanna try testing out the code above and 
> find a reasonable constant and submit it?

You make it sound like this analysis is a Great Discovery, but in fact
these issues with inline parameters have been discussed quite a few
times on gcc-bugs and in at least three high priority PRs for 3.3. 
Remember my mail from yesterday about inline parameters, to which only
Richard Guenther gave serious replies??

Greetz
Steven

*) I've also played a bit with the compile time regression PRs and the
tree-ssa branch, and for example 1687 does not occur there because the
tree of the inline candidate is much simpler, so walk_tree does not
consume so much time.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Memory footprint/compile time explosion caused by the tree inliner
  2003-04-18  1:14   ` Dale Johannesen
@ 2003-04-18  6:50     ` Daniel Berlin
  0 siblings, 0 replies; 9+ messages in thread
From: Daniel Berlin @ 2003-04-18  6:50 UTC (permalink / raw)
  To: Dale Johannesen; +Cc: Mike Stump, Eric Botcazou, gcc

On Thu, 17 Apr 2003, Dale Johannesen wrote:

> On Thursday, April 17, 2003, at 04:38  PM, Mike Stump wrote:
> > On Thursday, April 17, 2003, at 03:15 PM, Eric Botcazou wrote:
> >>  The
> >> time is mostly (85%) spent in the scheduler but the regression is
> >> caused by
> >> the tree inliner.
> >
> > It would be nice if the scheduler works with very large functions.  It
> > is ok to miss some of the edges, as long as the creamy center is nice.
>
> Last time I looked, the scheduler bottleneck was free_deps().  People
> have
> obviously already worked on speeding it up and I didn't see anything
> else
> promising; good luck.

Err, if it's really free_deps, then it's a matter of changing data
structures a bit.
If you used a special INSN_LIST/EXPR_LIST that was
1. GC allocated
2. Not marked by GC by being in an INSN (IE needs some other root to be
traversed marked)
3. Added a root in the scheduler that pointed to them, that was null'ed
out by free_deps

Then you would be able to simply collect every so often in the scheduler,
and free_deps would just be

root = NULL;

Or something like this.

Anything similar (IE arena allocating/obstack allocating) requires not
using the standard INSN_LIST/EXPR_LIST as well.
At least, i'm assuming other places use INSN_LIST/EXPR_LIST and expect
them to live over passes, or else it's safe to just modify
INSN_LIST/EXPR_LIST themselves.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Memory footprint/compile time explosion caused by the tree inliner
  2003-04-18  0:52 ` Mike Stump
  2003-04-18  1:14   ` Dale Johannesen
  2003-04-18  3:07   ` Steven Bosscher
@ 2003-04-18 12:46   ` Eric Botcazou
  2003-04-18 16:48     ` Geert Bosch
  2 siblings, 1 reply; 9+ messages in thread
From: Eric Botcazou @ 2003-04-18 12:46 UTC (permalink / raw)
  To: Mike Stump; +Cc: gcc

> Possible solutions, don't allow more than 15x growth in caller, don't
> allow caller to grow past 1,000 stmts.

Yes, I think a global cap is really needed.

> It would be nice if the scheduler works with very large functions.  It
> is ok to miss some of the edges, as long as the creamy center is nice.

According to Vlad, there is no hope, the scheduler is at least O(n^2).

> Suggestion:
>
> 	if (sum_insns > MAX_INLINE_INSNS * 128
>
> 	    || sum_insns > 15 * (currfn_insn - (id ? id->inlined_stmts : 0)))

Hmm... I think this formula doesn't make sense since currfn_insn is the insns 
count of the function to be inlined, not of the function from which we are 
inlining. The whole heuristics is purely local, there is no global control 
at all.

> Better, make the 15 a parameter.  Try values 2-20 on real C++ code,
> benchmark the benefit v compile time, set value near knee.  I think the
> relative cap is better than an absolute cap, as large bodied functions
> (aka C), the limit will be higher, and in small bodied functions, the
> limit will be lower.

Yes, a relative cap is probably better.

> Thanks for tacking this down, wanna try testing out the code above and
> find a reasonable constant and submit it?

I'll investigate the problem but I'll be on vacation for the next 10 days 
with no computer at hand.

-- 
Eric Botcazou

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Memory footprint/compile time explosion caused by the tree inliner
  2003-04-18 12:46   ` Eric Botcazou
@ 2003-04-18 16:48     ` Geert Bosch
  2003-04-18 18:06       ` Joe Buck
  0 siblings, 1 reply; 9+ messages in thread
From: Geert Bosch @ 2003-04-18 16:48 UTC (permalink / raw)
  To: Eric Botcazou; +Cc: Mike Stump, gcc

On Friday, Apr 18, 2003, at 07:20 America/New_York, Eric Botcazou wrote:
>> It would be nice if the scheduler works with very large functions.  It
>> is ok to miss some of the edges, as long as the creamy center is nice.
>
> According to Vlad, there is no hope, the scheduler is at least O(n^2).

Why can't we just insert arbitrary scheduling barriers in very large
functions and schedule the pieces independently? We could relate the
maximum size of a piece to expected execution frequency, so that we
work harder to optimize loops than other code.

It would seem that scheduling is only quadratic if at any cycle you
consider all ready instructions as candidates. If you would limit
candidate instructions to a (preferably sliding) fixed-size window,
scheduling becomes linear as it should be.

Think about it, wouldn't it be quite unacceptable if the scheduling
performed by out-of-order processors would be anything else than
linear in the number of instructions of the executed code? :-)
We definitely need to make GCC's scheduling algorithm execute
in linear time, with the constant related to optimization level.

   -Geert

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Memory footprint/compile time explosion caused by the tree inliner
  2003-04-18 16:48     ` Geert Bosch
@ 2003-04-18 18:06       ` Joe Buck
  2003-04-18 21:36         ` Toon Moene
  0 siblings, 1 reply; 9+ messages in thread
From: Joe Buck @ 2003-04-18 18:06 UTC (permalink / raw)
  To: Geert Bosch; +Cc: Eric Botcazou, Mike Stump, gcc


> > According to Vlad, there is no hope, the scheduler is at least O(n^2).

On Fri, Apr 18, 2003 at 10:49:51AM -0400, Geert Bosch wrote:
> Why can't we just insert arbitrary scheduling barriers in very large
> functions and schedule the pieces independently? We could relate the
> maximum size of a piece to expected execution frequency, so that we
> work harder to optimize loops than other code.

We pretty much have to do something like that; there are a lot of
processor simulator codes that have huge functions that are essentially a
straight line.  Toon, you're the expert, but I understand that this is
really common in scientific Fortran codes.

> Think about it, wouldn't it be quite unacceptable if the scheduling
> performed by out-of-order processors would be anything else than
> linear in the number of instructions of the executed code? :-)

Yes, it would be unacceptable.  Quadratic up to some function size cutoff
and linear after that would be OK.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Memory footprint/compile time explosion caused by the tree inliner
  2003-04-18 18:06       ` Joe Buck
@ 2003-04-18 21:36         ` Toon Moene
  0 siblings, 0 replies; 9+ messages in thread
From: Toon Moene @ 2003-04-18 21:36 UTC (permalink / raw)
  To: Joe Buck; +Cc: Geert Bosch, Eric Botcazou, Mike Stump, gcc

Joe Buck wrote:

>>>According to Vlad, there is no hope, the scheduler is at least O(n^2).

> On Fri, Apr 18, 2003 at 10:49:51AM -0400, Geert Bosch wrote:
> 
>>Why can't we just insert arbitrary scheduling barriers in very large
>>functions and schedule the pieces independently? We could relate the
>>maximum size of a piece to expected execution frequency, so that we
>>work harder to optimize loops than other code.

> We pretty much have to do something like that; there are a lot of
> processor simulator codes that have huge functions that are essentially a
> straight line.  Toon, you're the expert, but I understand that this is
> really common in scientific Fortran codes.

Well, certainly there are Fortran codes that have 1000's of lines per 
subroutine (our own code is an example of that coding style).

The question is, however, whether lines-of-code-per-function is the 
problem - does the scheduler really consider a complete subroutine for 
optimization ?  Fortran codes (if they're not automatically generated) 
are normally full of loops, which partition the code in scheduling 
regions naturally, I'd assume.

-- 
Toon Moene - mailto:toon@moene.indiv.nluug.nl - phoneto: +31 346 214290
Saturnushof 14, 3738 XG  Maartensdijk, The Netherlands
Maintainer, GNU Fortran 77: http://gcc.gnu.org/onlinedocs/g77_news.html
GNU Fortran 95: http://gcc-g95.sourceforge.net/ (under construction)

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2003-04-18 19:58 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-04-17 22:59 Memory footprint/compile time explosion caused by the tree inliner Eric Botcazou
2003-04-18  0:52 ` Mike Stump
2003-04-18  1:14   ` Dale Johannesen
2003-04-18  6:50     ` Daniel Berlin
2003-04-18  3:07   ` Steven Bosscher
2003-04-18 12:46   ` Eric Botcazou
2003-04-18 16:48     ` Geert Bosch
2003-04-18 18:06       ` Joe Buck
2003-04-18 21:36         ` Toon Moene

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).