Re: [PATCH] Add capability to run several iterations of early optimizations

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

* Re: [PATCH] Add capability to run several iterations of early optimizations
@ 2011-10-27 22:47 Matt
  2011-10-28 10:01 ` Richard Guenther
  0 siblings, 1 reply; 15+ messages in thread
From: Matt @ 2011-10-27 22:47 UTC (permalink / raw)
  To: Richard Guenther; +Cc: Maxim Kuvyrkov, GCC Patches

>>> Then you'd have to analyze the compile-time impact of the IPA
>>> splitting on its own when not iterating. ?Then you should look
>>> at what actually was the optimizations that were performed
>>> that lead to the improvement (I can see some indirect inlining
>>> happening, but everything else would be a bug in present
>>> optimizers in the early pipeline - they are all designed to be
>>> roughly independent on each other and _not_ expose new
>>> opportunities by iteration). ?Thus - testcases?
>>
>> The initial motivation for the patch was to enable more indirect 
inlining and devirtualization opportunities.

> Hm.

It is the proprietary codebase of my employer that these optimizations 
were developed for. Multiple iterations specifically helps propogate the 
concrete type information from functions that implement the 
Abstract Factory design pattern, allowing for cleaner runtime dynamic 
dispatch. I can verify that in said codebase (and in the reduced, 
non-proprietary examples Maxim provided earlier in the year) it works 
quite effectively.

Many of the devirt examples focus on a pure top-down approach like this:
class I { virtual void f() = 0; };
class K : public I { virtual void f() {} };
class L: public I { virtual void f() {} };
void g(I& i) { i.f(); }
int main(void) { L l; g(l); return 0; }

While that strategy isn't unheard of, it implies a link-time substitution 
to inject new/different sub-classes of the parameterized interface. 
Besides limiting extensibility by requiring a rebuild/relink, it also 
presupposes that two different implementations would be mutually exclusive 
for that module. That is often not the case, hence the factory pattern 
expressed in the other examples Maxim provided.

>> Since then I found the patch to be helpful in searching for 
optimization 
opportunities and bugs. ?E.g., SPEC2006's 471.omnetpp drops 20% with 2 
additional iterations of early optimizations [*]. ?Given that applying 
more optimizations should, theoretically, not decrease performance, there 
is likely a very real bug or deficiency behind that.

> It is likely early SRA that messes up, or maybe convert switch.  Early
> passes should be really restricted to always profitable cleanups.

> Your experiment looks useful to track down these bugs, but in general
> I don't think we want to expose iterating early passes.

In these other more top-down examples of devirt I mention above, I agree 
with you. Once the CFG is ordered and the analyses happen, things should 
be propogated forward without issue. In the case of factory functions, my 
understanding and experience on this real-world codebase is that multiple 
passes are required. First, to "bubble up" the concrete type info coming 
out of the factory function. Depending on how many layers, it may require 
a couple. Second, to then forward propogate that concrete type information 
for the pointer.

There was a surprising side-effect when I started experimenting with this 
ipa-passes feature. In a module that contains ~100KLOC, I implemented 
mega-compilation (a poor-man's LTO). At two passes, the module got larger, 
which I expected. This minor growth continued with each additional pass, 
until at about 7 passes when it decreased by over 10%. I set up a script 
to run overnight to incrementally try passes and record the module size, 
and the "sweet spot" ended up being 54 passes as far as size. I took the 
three smallest binaries and did a full performance regression at the 
system level, and the smallest binary's inclusion resulted in an ~6% 
performance improvement (measured as overall network I/O throughput) while 
using less CPU on a Transmeta Crusoe-based appliance. (This is a web 
proxy, with about 500KLOC of other code that was not compiled in this new 
way.)

The idea of multiple passes resulting is a smaller binary and higher 
performance was like a dream. I reproduced a similar pattern on open 
source projects, namely scummvm (on which I was able to use proper LTO)*. 
That is, smaller binaries resulted as well as decreased CPU usage. On some 
projects, this could possibly be correlated with micro-level benchmarks 
such as reduced branch prediction and L1 cache misses as reported by 
callgrind.

While it's possible/probable that some of the performance improvements I 
saw by increasing ipa-passes were ultimately missed-optimization bugs that 
should be fixed, I'd be very surprised if *all* of those improvements were 
the case. As such, I would still like to see this exposed. I would be 
happy to file bugs and help test any instances where it looks like an 
optimization should have been gotten within a single ipa-pass.

Thanks for helping to get this feature (and the other devirt-related 
pieces) into 4.7 -- it's been a huge boon to improving our C++ designs 
without sacrificing performance.

* Note that that scummvm's "sweet spot" number of iterations was 
different. That being said, the default of three iterations to make the 
typical use of Factory pattern devirtualize correctly still resulted in 
improved performance over a single pass -- just not necessarily a smaller 
binary.

--
tangled strands of DNA explain the way that I behave.
http://www.clock.org/~matt

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] Add capability to run several iterations of early optimizations
  2011-10-27 22:47 [PATCH] Add capability to run several iterations of early optimizations Matt
@ 2011-10-28 10:01 ` Richard Guenther
  2011-10-28 22:30   ` Matt
  0 siblings, 1 reply; 15+ messages in thread
From: Richard Guenther @ 2011-10-28 10:01 UTC (permalink / raw)
  To: Matt; +Cc: Maxim Kuvyrkov, GCC Patches

On Thu, Oct 27, 2011 at 11:53 PM, Matt <matt@use.net> wrote:
>>>> Then you'd have to analyze the compile-time impact of the IPA
>>>> splitting on its own when not iterating. ?Then you should look
>>>> at what actually was the optimizations that were performed
>>>> that lead to the improvement (I can see some indirect inlining
>>>> happening, but everything else would be a bug in present
>>>> optimizers in the early pipeline - they are all designed to be
>>>> roughly independent on each other and _not_ expose new
>>>> opportunities by iteration). ?Thus - testcases?
>>>
>>> The initial motivation for the patch was to enable more indirect
>
> inlining and devirtualization opportunities.
>
>> Hm.
>
> It is the proprietary codebase of my employer that these optimizations were
> developed for. Multiple iterations specifically helps propogate the concrete
> type information from functions that implement the Abstract Factory design
> pattern, allowing for cleaner runtime dynamic dispatch. I can verify that in
> said codebase (and in the reduced, non-proprietary examples Maxim provided
> earlier in the year) it works quite effectively.
>
> Many of the devirt examples focus on a pure top-down approach like this:
> class I { virtual void f() = 0; };
> class K : public I { virtual void f() {} };
> class L: public I { virtual void f() {} };
> void g(I& i) { i.f(); }
> int main(void) { L l; g(l); return 0; }
>
> While that strategy isn't unheard of, it implies a link-time substitution to
> inject new/different sub-classes of the parameterized interface. Besides
> limiting extensibility by requiring a rebuild/relink, it also presupposes
> that two different implementations would be mutually exclusive for that
> module. That is often not the case, hence the factory pattern expressed in
> the other examples Maxim provided.
>
>>> Since then I found the patch to be helpful in searching for
>
> optimization opportunities and bugs. ?E.g., SPEC2006's 471.omnetpp drops 20%
> with 2 additional iterations of early optimizations [*]. ?Given that
> applying more optimizations should, theoretically, not decrease performance,
> there is likely a very real bug or deficiency behind that.
>
>> It is likely early SRA that messes up, or maybe convert switch.  Early
>> passes should be really restricted to always profitable cleanups.
>
>> Your experiment looks useful to track down these bugs, but in general
>> I don't think we want to expose iterating early passes.
>
> In these other more top-down examples of devirt I mention above, I agree
> with you. Once the CFG is ordered and the analyses happen, things should be
> propogated forward without issue. In the case of factory functions, my
> understanding and experience on this real-world codebase is that multiple
> passes are required. First, to "bubble up" the concrete type info coming out
> of the factory function. Depending on how many layers, it may require a
> couple. Second, to then forward propogate that concrete type information for
> the pointer.
>
> There was a surprising side-effect when I started experimenting with this
> ipa-passes feature. In a module that contains ~100KLOC, I implemented
> mega-compilation (a poor-man's LTO). At two passes, the module got larger,
> which I expected. This minor growth continued with each additional pass,
> until at about 7 passes when it decreased by over 10%. I set up a script to
> run overnight to incrementally try passes and record the module size, and
> the "sweet spot" ended up being 54 passes as far as size. I took the three
> smallest binaries and did a full performance regression at the system level,
> and the smallest binary's inclusion resulted in an ~6% performance
> improvement (measured as overall network I/O throughput) while using less
> CPU on a Transmeta Crusoe-based appliance. (This is a web proxy, with about
> 500KLOC of other code that was not compiled in this new way.)
>
> The idea of multiple passes resulting is a smaller binary and higher
> performance was like a dream. I reproduced a similar pattern on open source
> projects, namely scummvm (on which I was able to use proper LTO)*. That is,
> smaller binaries resulted as well as decreased CPU usage. On some projects,
> this could possibly be correlated with micro-level benchmarks such as
> reduced branch prediction and L1 cache misses as reported by callgrind.
>
> While it's possible/probable that some of the performance improvements I saw
> by increasing ipa-passes were ultimately missed-optimization bugs that
> should be fixed, I'd be very surprised if *all* of those improvements were
> the case. As such, I would still like to see this exposed. I would be happy
> to file bugs and help test any instances where it looks like an optimization
> should have been gotten within a single ipa-pass.

I discussed the idea of iterating early optimizations shortly with Honza.
I was trying to step back a bit and look at what we try to do right now,
which is, optimize functions in topological order (thus, try to make sure
all callees are already early optimized when optimizing callers).  That
of course is difficult when there are cycles in the cgraph (for which we
basically start at a random place), especially when there would be
may-edges (which we do not have) for indirect/virtual calls, as they
basically make the whole cgraph cyclic.  So my idea was to make
cycle processing more explicit in early optimizations, and, whenever
we discover a new direct cgraph edge make sure we optimize the
callee, and whenever we optimized a callee queue all callers for
re-optimization.  You of course have to limit the number of times you
want to process a function, otherwise for a cycle, you'd optimize
indefinitely.  We already do the inlining itself repeatedly (via
--param early-inliner-max-iterations), though that only iterates
the inlining itself, allowing for "deep" inlining and some cases
of inlining of indirect calls if the inliner substituted a function
pointer parameter into a call of the inlined function.  Moving that
iteration over to iterating over the optimizations could make sense.

Thus, I'd really like to at least make iterating depend on some
profitability analysis, even if it is only based on cgraph analysis
such as 'we discovered a new direct edge'.

Richard.

>
> Thanks for helping to get this feature (and the other devirt-related pieces)
> into 4.7 -- it's been a huge boon to improving our C++ designs without
> sacrificing performance.
>
>
> * Note that that scummvm's "sweet spot" number of iterations was different.
> That being said, the default of three iterations to make the typical use of
> Factory pattern devirtualize correctly still resulted in improved
> performance over a single pass -- just not necessarily a smaller binary.
>
>
>
> --
> tangled strands of DNA explain the way that I behave.
> http://www.clock.org/~matt
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] Add capability to run several iterations of early optimizations
  2011-10-28 10:01 ` Richard Guenther
@ 2011-10-28 22:30   ` Matt
  0 siblings, 0 replies; 15+ messages in thread
From: Matt @ 2011-10-28 22:30 UTC (permalink / raw)
  To: Richard Guenther; +Cc: Maxim Kuvyrkov, GCC Patches

On Fri, 28 Oct 2011, Richard Guenther wrote:

> I discussed the idea of iterating early optimizations shortly with Honza.
> I was trying to step back a bit and look at what we try to do right now,
> which is, optimize functions in topological order (thus, try to make sure
> all callees are already early optimized when optimizing callers).  That
> of course is difficult when there are cycles in the cgraph (for which we
> basically start at a random place), especially when there would be
> may-edges (which we do not have) for indirect/virtual calls, as they
> basically make the whole cgraph cyclic.  So my idea was to make
> cycle processing more explicit in early optimizations, and, whenever
> we discover a new direct cgraph edge make sure we optimize the
> callee, and whenever we optimized a callee queue all callers for
> re-optimization.  You of course have to limit the number of times you
> want to process a function, otherwise for a cycle, you'd optimize
> indefinitely.  We already do the inlining itself repeatedly (via
> --param early-inliner-max-iterations), though that only iterates
> the inlining itself, allowing for "deep" inlining and some cases
> of inlining of indirect calls if the inliner substituted a function
> pointer parameter into a call of the inlined function.  Moving that
> iteration over to iterating over the optimizations could make sense.

Inverting the patch's current approach makes sense to me, and may be 
preferrable from a usability perspective. That is, since each iteration 
increases compile time the user is indirectly specifying how long they're 
willing to give the compiler to get the best code. That being said, I'd be 
curious what the "maximum" number of iterations would be on something like 
Firefox/WebKit when compiled with LTO. On the proprietary codebase we 
tested on, there were still things being discovered at 120+ iterations 
(~100KLOC, compiled with mega-compilation aka the poor-man's LTO).

This could actually speed things up quite a bit if it means that we would 
only re-run the early passes on the functions whose call-chain had some 
optimization applied. That is, that the parts of the code being 
re-analyzed/optimized would narrow with each iteration, potentially 
reducing the overhead of multiple passes in the first place.

> Thus, I'd really like to at least make iterating depend on some
> profitability analysis, even if it is only based on cgraph analysis
> such as 'we discovered a new direct edge'.

That makes sense to me, but then again I'm not implementing it ;) FWIW, I 
implemented a similar rule in a static analysis product I wrote a few 
years ago -- if no new information was discovered on a given bottom-up 
pass, move onto the next analysis.

Thanks again for taking the time to work through this!

--
tangled strands of DNA explain the way that I behave.
http://www.clock.org/~matt

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] Add capability to run several iterations of early optimizations
  2011-11-08  7:23                 ` Maxim Kuvyrkov
@ 2011-11-08 11:18                   ` Richard Guenther
  0 siblings, 0 replies; 15+ messages in thread
From: Richard Guenther @ 2011-11-08 11:18 UTC (permalink / raw)
  To: Maxim Kuvyrkov; +Cc: Matt, GCC Patches

On Tue, Nov 8, 2011 at 7:34 AM, Maxim Kuvyrkov <maxim@codesourcery.com> wrote:
> On 2/11/2011, at 10:18 AM, Richard Guenther wrote:
>>
>>>>>  Thus, I don't think we want to
>>>>> merge this in its current form or in this stage1.
>>>>
>>>> What is the benefit of pushing this to a later release?  If anything,
>>>> merging the support for iterative optimizations now will allow us to
>>>> consider adding the wonderful smartness to it later.  In the meantime,
>>>> substituting that smartness with a knob is still a great alternative.
>>
>> The benefit?  The benifit is to not have another magic knob in there
>> that doesn't make too much sense and papers over real conceptual/algorithmic
>> issues.  Brute-force iterating is a hack, not a solution. (sorry)
>>
>>> I agree (of course). Having the knob will be very useful for testing and
>>> determining the acceptance criteria for the later "smartness". While
>>> terminating early would be a nice optimization, the feature is still
>>> intrinsically useful and deployable without it. In addition, when using LTO
>>> on nearly all the projects/modules I tested on, 3+ passes were always
>>> productive.
>>
>> If that is true then I'd really like to see testcases.  Because I am sure you
>> are just papering over (mabe even easy to fix) issues by the brute-force
>> iterating approach.  We also do not have a switch to run every pass twice
>> in succession, just because that would be as stupid as this.
>>
>>> To be fair, when not using LTO, beyond 2-3 passes did not often
>>> produce improvements unless individual compilation units were enormous.
>>>
>>> There was also the question of if some of the improvements seen with
>>> multiple passes were indicative of deficiencies in early inlining, CFG, SRA,
>>> etc. If the knob is available, I'm happy to continue testing on the same
>>> projects I've filed recent LTO/graphite bugs against (glib, zlib, openssl,
>>> scummvm, binutils, etc) and write a report on what I observe as "suspicious"
>>> improvements that perhaps should be caught/made in a single pass.
>>>
>>> It's worth noting again that while this is a useful feature in and of itself
>>> (especially when combined with LTO), it's *extremely* useful when coupled
>>> with the de-virtualization improvements submitted in other threads. The
>>> examples submitted for inclusion in the test suite aren't academic -- they
>>> are reductions of real-world performance issues from a mature (and shipping)
>>> C++-based networking product. Any C++ codebase that employs physical
>>> separation in their designs via Factory patterns, Interface Segregation,
>>> and/or Dependency Inversion will likely see improvements. To me, these
>>> enahncements combine to form one of the biggest leaps I've seen in C++ code
>>> optimization -- code that can be clean, OO, *and* fast.
>>
>> But iterating the whole early optimization pipeline is not a sensible approach
>> of attacking these.
>
> Richard,
>
> You are making valid points.  Ideally, we would have a compiler with optimal pass pipeline that would not generate any better or worse code with second and subsequent iterations.  I disagree, however, that the patch discussed in this thread is a hack.  It is a feature.  It allows developers and power-users to experiment with the pass pipeline and see what comes out of it.
>
> The benchmarks for the 2nd version of the patch have finally finished [*].  It shows that the additional iterations do make a difference of 2% for a number of benchmarks.  There are also trends like "additional iterations improve -m64 and hurt -m32" -- 254.gap, 256.bzip2, 300.twolf, 171.swim, 173.applu -- and "additional iterations improve -O2 and hurt -O3" -- 179.art, 400.perlbench.  There also are enough benchmarks that show straight improvement or straight regression.  How are we going to investigate and fix these missed optimization opportunities and regressions without the feature that we are discussing here?
>
> We already provide certain features to enable users tinker with optimization pipeline:
> 1. -O0/-O1/-O2/-O3 options for entry users,
> 2. -f<pass>/-fno-<pass> for advanced users,
> 3. --param <obscure>=MAGIC for developers and compiler-savvy users.
>
> Iterative optimizations should be no different.  "Yes", without this feature the world would be a simpler place for us to live in, as we would not have to know and care about regressions and missed opportunities in GCC's optimizations.  But, really, that would've been a boring world :-).
>
> I feel that dropping this feature will cause the problems to stay unnoticed and never get fixed.

How?  You are not even _analyzing_ why you are getting benefits, or
filing bugs.  Did you experiment with reducing the set of passes you
iterate?  Thus, which passes really do help when iterating?  I've
asked you that before - and showed proper analysis that could be
done to detect the indirect inlining benefit that you might see (surely
not on SPEC 2000 though) - and said that yes, it _would_ make sense
to do sth about this specific case and yes, iterating for a selected
set of functions is a valid solution for the indirect inlining case.
It is never,
IMHO, when no inlining happened between the iterations.

You are basically saying, well - let's iterate optimizers some time
and hope they will catch something.  Sure they will.  Show the important
pieces and analyze why they need the iteration.

You probably should realize that with each of this "power-user" features
the GCC testing matrix becomes larger.  If I'd sell commercial support
for GCC I'd make all but -O* unsupported.

Thanks,
Richard.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] Add capability to run several iterations of early optimizations
  2011-11-01 21:33               ` Richard Guenther
@ 2011-11-08  7:23                 ` Maxim Kuvyrkov
  2011-11-08 11:18                   ` Richard Guenther
  0 siblings, 1 reply; 15+ messages in thread
From: Maxim Kuvyrkov @ 2011-11-08  7:23 UTC (permalink / raw)
  To: Richard Guenther; +Cc: Matt, GCC Patches

On 2/11/2011, at 10:18 AM, Richard Guenther wrote:
> 
>>>>  Thus, I don't think we want to
>>>> merge this in its current form or in this stage1.
>>> 
>>> What is the benefit of pushing this to a later release?  If anything,
>>> merging the support for iterative optimizations now will allow us to
>>> consider adding the wonderful smartness to it later.  In the meantime,
>>> substituting that smartness with a knob is still a great alternative.
> 
> The benefit?  The benifit is to not have another magic knob in there
> that doesn't make too much sense and papers over real conceptual/algorithmic
> issues.  Brute-force iterating is a hack, not a solution. (sorry)
> 
>> I agree (of course). Having the knob will be very useful for testing and
>> determining the acceptance criteria for the later "smartness". While
>> terminating early would be a nice optimization, the feature is still
>> intrinsically useful and deployable without it. In addition, when using LTO
>> on nearly all the projects/modules I tested on, 3+ passes were always
>> productive.
> 
> If that is true then I'd really like to see testcases.  Because I am sure you
> are just papering over (mabe even easy to fix) issues by the brute-force
> iterating approach.  We also do not have a switch to run every pass twice
> in succession, just because that would be as stupid as this.
> 
>> To be fair, when not using LTO, beyond 2-3 passes did not often
>> produce improvements unless individual compilation units were enormous.
>> 
>> There was also the question of if some of the improvements seen with
>> multiple passes were indicative of deficiencies in early inlining, CFG, SRA,
>> etc. If the knob is available, I'm happy to continue testing on the same
>> projects I've filed recent LTO/graphite bugs against (glib, zlib, openssl,
>> scummvm, binutils, etc) and write a report on what I observe as "suspicious"
>> improvements that perhaps should be caught/made in a single pass.
>> 
>> It's worth noting again that while this is a useful feature in and of itself
>> (especially when combined with LTO), it's *extremely* useful when coupled
>> with the de-virtualization improvements submitted in other threads. The
>> examples submitted for inclusion in the test suite aren't academic -- they
>> are reductions of real-world performance issues from a mature (and shipping)
>> C++-based networking product. Any C++ codebase that employs physical
>> separation in their designs via Factory patterns, Interface Segregation,
>> and/or Dependency Inversion will likely see improvements. To me, these
>> enahncements combine to form one of the biggest leaps I've seen in C++ code
>> optimization -- code that can be clean, OO, *and* fast.
> 
> But iterating the whole early optimization pipeline is not a sensible approach
> of attacking these.

Richard,

You are making valid points.  Ideally, we would have a compiler with optimal pass pipeline that would not generate any better or worse code with second and subsequent iterations.  I disagree, however, that the patch discussed in this thread is a hack.  It is a feature.  It allows developers and power-users to experiment with the pass pipeline and see what comes out of it.

The benchmarks for the 2nd version of the patch have finally finished [*].  It shows that the additional iterations do make a difference of 2% for a number of benchmarks.  There are also trends like "additional iterations improve -m64 and hurt -m32" -- 254.gap, 256.bzip2, 300.twolf, 171.swim, 173.applu -- and "additional iterations improve -O2 and hurt -O3" -- 179.art, 400.perlbench.  There also are enough benchmarks that show straight improvement or straight regression.  How are we going to investigate and fix these missed optimization opportunities and regressions without the feature that we are discussing here?

We already provide certain features to enable users tinker with optimization pipeline:
1. -O0/-O1/-O2/-O3 options for entry users,
2. -f<pass>/-fno-<pass> for advanced users,
3. --param <obscure>=MAGIC for developers and compiler-savvy users.

Iterative optimizations should be no different.  "Yes", without this feature the world would be a simpler place for us to live in, as we would not have to know and care about regressions and missed opportunities in GCC's optimizations.  But, really, that would've been a boring world :-).

I feel that dropping this feature will cause the problems to stay unnoticed and never get fixed.

[*] https://docs.google.com/spreadsheet/ccc?key=0AvK0Y-Pgj7bNdFBQMEJ6d3laeFdvdk9lQ1p0LUFkVFE

Thank you,

--
Maxim Kuvyrkov
CodeSourcery / Mentor Graphics

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] Add capability to run several iterations of early optimizations
  2011-10-29  0:10             ` Matt
  2011-11-01 20:48               ` Martin Jambor
@ 2011-11-01 21:33               ` Richard Guenther
  2011-11-08  7:23                 ` Maxim Kuvyrkov
  1 sibling, 1 reply; 15+ messages in thread
From: Richard Guenther @ 2011-11-01 21:33 UTC (permalink / raw)
  To: Matt; +Cc: Maxim Kuvyrkov, GCC Patches

On Sat, Oct 29, 2011 at 1:06 AM, Matt <matt@use.net> wrote:
> On Sat, 29 Oct 2011, Maxim Kuvyrkov wrote:
>
>>> I like this variant a lot better than the last one - still it lacks any
>>> analysis-based justification for iteration (see my reply to Matt on
>>> what I discussed with Honza).
>>
>> Yes, having a way to tell whether a function have significantly changed
>> would be awesome.  My approach here would be to make inline_parameters
>> output feedback of how much the size/time metrics have changed for a
>> function since previous run.  If the change is above X%, then queue
>> functions callers for more optimizations.  Similarly, Martin's
>> rebuild_cgraph_edges_and_devirt (when that goes into trunk) could queue new
>> direct callees and current function for another iteration if new direct
>> edges were resolved.
>
> Figuring out the heuristic will need decent testing on a few projects to
> figure out what the "sweet spot" is (smallest binary for time/passes spent)
> for that given codebase. With a few data points, a reasonable stab at the
> metrics you mention can be had that would not terminate the iterations
> before the known optimial number of passes. Without those data points, it
> seems like making sure the metrics allow those "sweet spots" to be attained
> will be difficult.

Well, sure - the same like with inlining heuristics.

>>>  Thus, I don't think we want to
>>> merge this in its current form or in this stage1.
>>
>> What is the benefit of pushing this to a later release?  If anything,
>> merging the support for iterative optimizations now will allow us to
>> consider adding the wonderful smartness to it later.  In the meantime,
>> substituting that smartness with a knob is still a great alternative.

The benefit?  The benifit is to not have another magic knob in there
that doesn't make too much sense and papers over real conceptual/algorithmic
issues.  Brute-force iterating is a hack, not a solution. (sorry)

> I agree (of course). Having the knob will be very useful for testing and
> determining the acceptance criteria for the later "smartness". While
> terminating early would be a nice optimization, the feature is still
> intrinsically useful and deployable without it. In addition, when using LTO
> on nearly all the projects/modules I tested on, 3+ passes were always
> productive.

If that is true then I'd really like to see testcases.  Because I am sure you
are just papering over (mabe even easy to fix) issues by the brute-force
iterating approach.  We also do not have a switch to run every pass twice
in succession, just because that would be as stupid as this.

> To be fair, when not using LTO, beyond 2-3 passes did not often
> produce improvements unless individual compilation units were enormous.
>
> There was also the question of if some of the improvements seen with
> multiple passes were indicative of deficiencies in early inlining, CFG, SRA,
> etc. If the knob is available, I'm happy to continue testing on the same
> projects I've filed recent LTO/graphite bugs against (glib, zlib, openssl,
> scummvm, binutils, etc) and write a report on what I observe as "suspicious"
> improvements that perhaps should be caught/made in a single pass.
>
> It's worth noting again that while this is a useful feature in and of itself
> (especially when combined with LTO), it's *extremely* useful when coupled
> with the de-virtualization improvements submitted in other threads. The
> examples submitted for inclusion in the test suite aren't academic -- they
> are reductions of real-world performance issues from a mature (and shipping)
> C++-based networking product. Any C++ codebase that employs physical
> separation in their designs via Factory patterns, Interface Segregation,
> and/or Dependency Inversion will likely see improvements. To me, these
> enahncements combine to form one of the biggest leaps I've seen in C++ code
> optimization -- code that can be clean, OO, *and* fast.

But iterating the whole early optimization pipeline is not a sensible approach
of attacking these.

Richard.

> Richard: If there's any additional testing or information I can reasonably
> provide to help get this in for this stage1, let me know.
>
> Thanks!
>
>
> --
> tangled strands of DNA explain the way that I behave.
> http://www.clock.org/~matt
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] Add capability to run several iterations of early optimizations
  2011-10-29  0:10             ` Matt
@ 2011-11-01 20:48               ` Martin Jambor
  2011-11-01 21:33               ` Richard Guenther
  1 sibling, 0 replies; 15+ messages in thread
From: Martin Jambor @ 2011-11-01 20:48 UTC (permalink / raw)
  To: Matt; +Cc: Maxim Kuvyrkov, Richard Guenther, GCC Patches

Hi,

On Fri, Oct 28, 2011 at 04:06:20PM -0700, Matt wrote:
>
...

> 
> I agree (of course). Having the knob will be very useful for testing
> and determining the acceptance criteria for the later "smartness".
> While terminating early would be a nice optimization, the feature is
> still intrinsically useful and deployable without it. In addition,
> when using LTO on nearly all the projects/modules I tested on, 3+
> passes were always productive. To be fair, when not using LTO,
> beyond 2-3 passes did not often produce improvements unless
> individual compilation units were enormous.

I'm quite surprised you get extra benefit with LTO since early
optimizations are exactly the part of middle-end which should produce
the same results, LTO or not.  So the only way I can imagine this can
happen is that inlining analysis gets somehow a much better input and
then can make much bigger use of it.  If this is because of extra
early inlining, we might try to be able to catch these situations when
doing IPA inlining decisions which would work regardless of any
iteration number cut-off.  If it is because of something else, it's
probably better to (at least try to) tweak the passes and/or inlining
analysis to understand each other straight away.

> 
> There was also the question of if some of the improvements seen with
> multiple passes were indicative of deficiencies in early inlining,
> CFG, SRA, 

SRA, because it is not flow-sensitive in any way, unfortunately
sometimes produces useless statements which then need to be cleanup up
by forwprop (and possibly dse and others).  We've already talked with
Richi about this and agreed the early one should be dumbed down a
little to produce much less of these.  I'm afraid I won't be able to
submit a patch doing that during this stage 1, though.

> etc. If the knob is available, I'm happy to continue
> testing on the same projects I've filed recent LTO/graphite bugs
> against (glib, zlib, openssl, scummvm, binutils, etc) and write a
> report on what I observe as "suspicious" improvements that perhaps
> should be caught/made in a single pass.
> 
> It's worth noting again that while this is a useful feature in and
> of itself (especially when combined with LTO), it's *extremely*
> useful when coupled with the de-virtualization improvements
> submitted in other threads. The examples submitted for inclusion in
> the test suite aren't academic -- they are reductions of real-world
> performance issues from a mature (and shipping) C++-based networking
> product. Any C++ codebase that employs physical separation in their
> designs via Factory patterns, Interface Segregation, and/or
> Dependency Inversion will likely see improvements. To me, these
> enahncements combine to form one of the biggest leaps I've seen in
> C++ code optimization -- code that can be clean, OO, *and* fast.

Well, while I'd understand that whenever there is a new direct call
graph edge, trying early inlining again might help or save some work
for the full inlining, I think that we should rather try to enhance
the current IPA infrastructure rather than grow another one from the
early optimizations, especially if we aim at LTO - iterating early
optimizations will not help reduce abstraction if it is spread across
a number of compilation units.

Martin

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] Add capability to run several iterations of early optimizations
  2011-10-28 23:07           ` Maxim Kuvyrkov
@ 2011-10-29  0:10             ` Matt
  2011-11-01 20:48               ` Martin Jambor
  2011-11-01 21:33               ` Richard Guenther
  0 siblings, 2 replies; 15+ messages in thread
From: Matt @ 2011-10-29  0:10 UTC (permalink / raw)
  To: Maxim Kuvyrkov; +Cc: Richard Guenther, GCC Patches

On Sat, 29 Oct 2011, Maxim Kuvyrkov wrote:

>> I like this variant a lot better than the last one - still it lacks any
>> analysis-based justification for iteration (see my reply to Matt on
>> what I discussed with Honza).
>
> Yes, having a way to tell whether a function have significantly changed 
> would be awesome.  My approach here would be to make inline_parameters 
> output feedback of how much the size/time metrics have changed for a 
> function since previous run.  If the change is above X%, then queue 
> functions callers for more optimizations.  Similarly, Martin's 
> rebuild_cgraph_edges_and_devirt (when that goes into trunk) could queue 
> new direct callees and current function for another iteration if new 
> direct edges were resolved.

Figuring out the heuristic will need decent testing on a few projects to 
figure out what the "sweet spot" is (smallest binary for time/passes 
spent) for that given codebase. With a few data points, a reasonable stab 
at the metrics you mention can be had that would not terminate the 
iterations before the known optimial number of passes. Without those data 
points, it seems like making sure the metrics allow those "sweet spots" to 
be attained will be difficult.

>>  Thus, I don't think we want to
>> merge this in its current form or in this stage1.
>
> What is the benefit of pushing this to a later release?  If anything, 
> merging the support for iterative optimizations now will allow us to 
> consider adding the wonderful smartness to it later.  In the meantime, 
> substituting that smartness with a knob is still a great alternative.

I agree (of course). Having the knob will be very useful for testing and 
determining the acceptance criteria for the later "smartness". While 
terminating early would be a nice optimization, the feature is still 
intrinsically useful and deployable without it. In addition, when using 
LTO on nearly all the projects/modules I tested on, 3+ passes were 
always productive. To be fair, when not using LTO, beyond 2-3 passes did 
not often produce improvements unless individual compilation units were 
enormous.

There was also the question of if some of the improvements seen with 
multiple passes were indicative of deficiencies in early inlining, CFG, 
SRA, etc. If the knob is available, I'm happy to continue testing on the 
same projects I've filed recent LTO/graphite bugs against (glib, zlib, 
openssl, scummvm, binutils, etc) and write a report on what I observe as 
"suspicious" improvements that perhaps should be caught/made in a single 
pass.

It's worth noting again that while this is a useful feature in and of 
itself (especially when combined with LTO), it's *extremely* useful when 
coupled with the de-virtualization improvements submitted in other 
threads. The examples submitted for inclusion in the test suite aren't 
academic -- they are reductions of real-world performance issues from a 
mature (and shipping) C++-based networking product. Any C++ codebase that 
employs physical separation in their designs via Factory patterns, 
Interface Segregation, and/or Dependency Inversion will likely see 
improvements. To me, these enahncements combine to form one of the biggest 
leaps I've seen in C++ code optimization -- code that can be clean, OO, 
*and* fast.

Richard: If there's any additional testing or information I can reasonably 
provide to help get this in for this stage1, let me know.

Thanks!

--
tangled strands of DNA explain the way that I behave.
http://www.clock.org/~matt

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] Add capability to run several iterations of early optimizations
  2011-10-28 11:12         ` Richard Guenther
@ 2011-10-28 23:07           ` Maxim Kuvyrkov
  2011-10-29  0:10             ` Matt
  0 siblings, 1 reply; 15+ messages in thread
From: Maxim Kuvyrkov @ 2011-10-28 23:07 UTC (permalink / raw)
  To: Richard Guenther; +Cc: GCC Patches, Matt

On 28/10/2011, at 11:43 PM, Richard Guenther wrote:

> On Fri, Oct 28, 2011 at 1:05 AM, Maxim Kuvyrkov <maxim@codesourcery.com> wrote:
>> 
>> OK for trunk?
> 
> diff --git a/gcc/cgraph.c b/gcc/cgraph.c
> index f056d3d..4738b28 100644
> --- a/gcc/cgraph.c
> +++ b/gcc/cgraph.c
> @@ -2416,7 +2416,7 @@ cgraph_add_new_function (tree fndecl, bool lowered)
>            tree_lowering_passes (fndecl);
>            bitmap_obstack_initialize (NULL);
>            if (!gimple_in_ssa_p (DECL_STRUCT_FUNCTION (fndecl)))
> -             execute_pass_list (pass_early_local_passes.pass.sub);
> +             execute_early_local_passes_for_current_function ();
>            bitmap_obstack_release (NULL);
>            pop_cfun ();
>            current_function_decl = NULL;
> @@ -2441,7 +2441,7 @@ cgraph_add_new_function (tree fndecl, bool lowered)
>        gimple_register_cfg_hooks ();
>        bitmap_obstack_initialize (NULL);
>        if (!gimple_in_ssa_p (DECL_STRUCT_FUNCTION (fndecl)))
> -         execute_pass_list (pass_early_local_passes.pass.sub);
> +         execute_early_local_passes_for_current_function ();
> 
> I think these should only execute the lowering pieces of early local passes,
> let me see if that's properly split ...
> 
> @@ -255,7 +255,7 @@ cgraph_process_new_functions (void)
>              /* When not optimizing, be sure we run early local passes anyway
>                 to expand OMP.  */
>              || !optimize)
> -           execute_pass_list (pass_early_local_passes.pass.sub);
> +           execute_early_local_passes_for_current_function ();
> 
> similar.

In the case of tree_lowering_passes, I agree.  But with the calls to early_local_passes from cgraph_add_new_function/cgraph_process_new_functions we ought to run the full list of optimizations to make bodies of new functions ready for inlining to existing functions.  Note, cgraph_add_new_function() explicitly calls tree_lowering_passes and then invokes early_local_passes.

What do you think?  If we want to pursue this, it should be in a separate patch anyway.

> 
> About all this -suffix stuff, I'd like to have the iterations simply re-use
> the existing dump-files, thus statically sub-divide pass_early_local_passes
> like

I considered this when I was implementing and debugging iterative support, and it is better to keep dumps for main and iterative parts separate.  These are two different IPA passes, and their sub-passes should use separate dumps.

Given that the set of sub-passes of early_local_passes_main is substantially different from the set of sub-passes of early_local_passes_iter (the former has a bunch of init/expand/lower passes in it), mixing the dumps will be mighty confusing.  Without knowing the exact ordering of passes one will not be able to construct a chronological picture of optimizations.  Separating dumps of sub-passes of different IPA passes simplifies understanding of optimization ordering.

> 
> NEXT_PASS (pass_early_local_lowering_passes);
>  {
>      NEXT_PASS (pass_fixup_cfg);
>      NEXT_PASS (pass_init_datastructures);
>      NEXT_PASS (pass_expand_omp);
> 
>      NEXT_PASS (pass_referenced_vars);
>      NEXT_PASS (pass_build_ssa);
>      NEXT_PASS (pass_lower_vector);
>      NEXT_PASS (pass_early_warn_uninitialized);
>  }
> /* The following you maybe iterate.  */
> NEXT_PASS (pass_early_local_optimizations);
>  {
>      NEXT_PASS (pass_rebuild_cgraph_edges);
>      NEXT_PASS (pass_inline_parameters);
>      NEXT_PASS (pass_early_inline);
>      NEXT_PASS (pass_all_early_optimizations);
>        {
>          struct opt_pass **p = &pass_all_early_optimizations.pass.sub;
>          NEXT_PASS (pass_remove_cgraph_callee_edges);
>          NEXT_PASS (pass_rename_ssa_copies);
>          NEXT_PASS (pass_ccp);
>          NEXT_PASS (pass_forwprop);
>          /* pass_build_ealias is a dummy pass that ensures that we
>             execute TODO_rebuild_alias at this point.  Re-building
>             alias information also rewrites no longer addressed
>             locals into SSA form if possible.  */
>          NEXT_PASS (pass_build_ealias);
>          NEXT_PASS (pass_sra_early);
>          NEXT_PASS (pass_fre);
>          NEXT_PASS (pass_copy_prop);
>          NEXT_PASS (pass_merge_phi);
>          NEXT_PASS (pass_cd_dce);
>          NEXT_PASS (pass_early_ipa_sra);
>          NEXT_PASS (pass_tail_recursion);
>          NEXT_PASS (pass_convert_switch);
>          NEXT_PASS (pass_cleanup_eh);
>          NEXT_PASS (pass_profile);
>          NEXT_PASS (pass_local_pure_const);
>       }
>   }
> NEXT_PASS (pass_late_early_local_optimizations)
>  {
>          /* Split functions creates parts that are not run through
>             early optimizations again.  It is thus good idea to do this
>             late.  */
>          NEXT_PASS (pass_split_functions);
>      NEXT_PASS (pass_release_ssa_names);
>      NEXT_PASS (pass_rebuild_cgraph_edges);
>      NEXT_PASS (pass_inline_parameters);
>  }

This simple division into 3 stages will not work.  On one hand, one needs to add fixup/cleanup passes at the beginning and end of every SIMPLE_IPA pass; without those cgraph_verify() gets upset very easily.  On the other hand, when iterative optimizations are not enabled, and there is no need to decouple main and late parts, these additional fixup/cleanup passes would be unnecessary overhead.

Also, in the above division you effectively make 3 separate SIMPLE_IPA passes, and I don't know what the effect of that will be.

The pass formation and scheduling in the patch I posted may not look as pretty as what you describe above, but it is reasonably clean and easy to maintain.  One needs to change only passes in the main part, and those changes will be automatically picked up in iterative and late parts.

> 
> +++ b/gcc/toplev.c
> @@ -1228,7 +1228,6 @@ general_init (const char *argv0)
>   /* This must be done after global_init_params but before argument
>      processing.  */
>   init_ggc_heuristics();
> -  init_optimization_passes ();
>   statistics_early_init ();
>   finish_params ();
> }
> @@ -1989,6 +1988,8 @@ toplev_main (int argc, char **argv)
>                  save_decoded_options, save_decoded_options_count,
>                  UNKNOWN_LOCATION, global_dc);
> 
> +  init_optimization_passes ();
> +
>   handle_common_deferred_options ();
> 
>   init_local_tick ();
> 
> any reason for this?

Yes, this moves init_optimization_passes() after processing of --param options.  This is needed to be able to use PARAM_VALUE in passes initialization.

> 
> +  /* Don't recurse or wonder on to the next pass when running
> +     execute_ipa_pass_list below.  */
> +  current->execute = NULL;
> +  current->next = NULL;
> 
> that looks awkward ;)  Shouldn't instead
> 
> +    execute_ipa_pass_list (current);
> 
> not do what it does?  Thus, split that into two pieces, one that we
> can iterate w/o the fiddling with current->?

The choice is between the above and duplicating post-processing logic of execute_ipa_pass_list (i.e., call to cgraph_process_new_functions) in the iteration loop.  I agree that the above is a tad awkward, but it is cleaner than copy-paste of the final part of execute_ipa_pass_list.

> 
> I like this variant a lot better than the last one - still it lacks any
> analysis-based justification for iteration (see my reply to Matt on
> what I discussed with Honza).

Yes, having a way to tell whether a function have significantly changed would be awesome.  My approach here would be to make inline_parameters output feedback of how much the size/time metrics have changed for a function since previous run.  If the change is above X%, then queue functions callers for more optimizations.  Similarly, Martin's rebuild_cgraph_edges_and_devirt (when that goes into trunk) could queue new direct callees and current function for another iteration if new direct edges were resolved.

But such analysis is way out of scope of this patch, which adds infrastructure for iterative optimizations to be possible and reliable.

>  Thus, I don't think we want to
> merge this in its current form or in this stage1.

What is the benefit of pushing this to a later release?  If anything, merging the support for iterative optimizations now will allow us to consider adding the wonderful smartness to it later.  In the meantime, substituting that smartness with a knob is still a great alternative.

Thank you,

--
Maxim Kuvyrkov
CodeSourcery / Mentor Graphics

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] Add capability to run several iterations of early optimizations
  2011-10-27 23:29       ` Maxim Kuvyrkov
@ 2011-10-28 11:12         ` Richard Guenther
  2011-10-28 23:07           ` Maxim Kuvyrkov
  0 siblings, 1 reply; 15+ messages in thread
From: Richard Guenther @ 2011-10-28 11:12 UTC (permalink / raw)
  To: Maxim Kuvyrkov; +Cc: GCC Patches, Matt

On Fri, Oct 28, 2011 at 1:05 AM, Maxim Kuvyrkov <maxim@codesourcery.com> wrote:
> Richard,
>
> Just as Matt posted his findings about the effect of iterating early optimizations, I've got the new patch ready.  This patch is essentially a complete rewrite and addresses the comments you made.
>
> On 18/10/2011, at 9:56 PM, Richard Guenther wrote:
>
>>>>
>>>> If we'd want to iterate early optimizations we'd want to do it by iterating
>>>> an IPA pass so that we benefit from more precise size estimates
>>>> when trying to inline a function the second time.
>>>
>>> Could you elaborate on this a bit?  Early optimizations are gimple passes, so I'm missing your point here.
>>
>> pass_early_local_passes is an IPA pass, you want to iterate
>> fn1, fn2, fn1, fn2, ..., not fn1, fn1 ..., fn2, fn2 ... precisely for better
>> inlining.  Thus you need to split pass_early_local_passes into pieces
>> so you can iterate one of the IPA pieces.
>
> Early_local_passes are now split into _main, _iter and _late parts.  To avoid changing the default case, _late part is merged into _main when no iterative optimizations are requested.
>
>>
>>>> Also statically
>>>> scheduling the passes will mess up dump files and you have no
>>>> chance of say, noticing that nothing changed for function f and its
>>>> callees in iteration N and thus you can skip processing them in
>>>> iteration N + 1.
>>>
>>> Yes, these are the shortcomings.  The dump files name changes can be fixed, e.g., by adding a suffix to the passes on iterations after the first one.  The analysis to avoid unnecessary iterations is more complex problem.
>
> To avoid changing the dump file names the patch appends "_iter" suffix to the dumps of iterative passes.
>
>>
>> Sure.  I analyzed early passes by manually duplicating them and
>> test that they do nothing for tramp3d, which they pretty much all did
>> at some point.
>>
>>>>
>>>> So, at least you should split the pass_early_local_passes IPA pass
>>>> into three, you'd iterate over the 2nd (definitely not over pass_split_functions
>>>> though), the third would be pass_profile and pass_split_functions only.
>>>> And you'd iterate from the place the 2nd IPA pass is executed, not
>>>> by scheduling them N times.
>>>
>>> OK, I will look into this.
>
> Done.
>
>>>
>>>>
>>>> Then you'd have to analyze the compile-time impact of the IPA
>>>> splitting on its own when not iterating.
>
> I decided to avoid this and keep the pass pipeline effectively the same when not running iterative optimizations.  This is achieved by scheduling pass_early_optimizations_late in different places in the pipeline depending on whether iterative optimizations are enabled or not.
>
> The patch bootstraps and passes regtest on i686-pc-linux-gnu {-m32/-m64} with 3 iterations enabled by default.  The only failures are 5 scan-dump tests that are due to more functions being inlined than expected.  With iterative optimizations disabled there is no change.
>
> I've kicked off SPEC2000/SPEC2006 benchmark runs to see the performance effect of the patch, and those will be posted in the same Google Docs spreadsheet in several days.
>
> OK for trunk?

diff --git a/gcc/cgraph.c b/gcc/cgraph.c
index f056d3d..4738b28 100644
--- a/gcc/cgraph.c
+++ b/gcc/cgraph.c
@@ -2416,7 +2416,7 @@ cgraph_add_new_function (tree fndecl, bool lowered)
            tree_lowering_passes (fndecl);
            bitmap_obstack_initialize (NULL);
            if (!gimple_in_ssa_p (DECL_STRUCT_FUNCTION (fndecl)))
-             execute_pass_list (pass_early_local_passes.pass.sub);
+             execute_early_local_passes_for_current_function ();
            bitmap_obstack_release (NULL);
            pop_cfun ();
            current_function_decl = NULL;
@@ -2441,7 +2441,7 @@ cgraph_add_new_function (tree fndecl, bool lowered)
        gimple_register_cfg_hooks ();
        bitmap_obstack_initialize (NULL);
        if (!gimple_in_ssa_p (DECL_STRUCT_FUNCTION (fndecl)))
-         execute_pass_list (pass_early_local_passes.pass.sub);
+         execute_early_local_passes_for_current_function ();

I think these should only execute the lowering pieces of early local passes,
let me see if that's properly split ...

@@ -255,7 +255,7 @@ cgraph_process_new_functions (void)
              /* When not optimizing, be sure we run early local passes anyway
                 to expand OMP.  */
              || !optimize)
-           execute_pass_list (pass_early_local_passes.pass.sub);
+           execute_early_local_passes_for_current_function ();

similar.

About all this -suffix stuff, I'd like to have the iterations simply re-use
the existing dump-files, thus statically sub-divide pass_early_local_passes
like

NEXT_PASS (pass_early_local_lowering_passes);
  {
      NEXT_PASS (pass_fixup_cfg);
      NEXT_PASS (pass_init_datastructures);
      NEXT_PASS (pass_expand_omp);

      NEXT_PASS (pass_referenced_vars);
      NEXT_PASS (pass_build_ssa);
      NEXT_PASS (pass_lower_vector);
      NEXT_PASS (pass_early_warn_uninitialized);
  }
/* The following you maybe iterate.  */
NEXT_PASS (pass_early_local_optimizations);
  {
      NEXT_PASS (pass_rebuild_cgraph_edges);
      NEXT_PASS (pass_inline_parameters);
      NEXT_PASS (pass_early_inline);
      NEXT_PASS (pass_all_early_optimizations);
        {
          struct opt_pass **p = &pass_all_early_optimizations.pass.sub;
          NEXT_PASS (pass_remove_cgraph_callee_edges);
          NEXT_PASS (pass_rename_ssa_copies);
          NEXT_PASS (pass_ccp);
          NEXT_PASS (pass_forwprop);
          /* pass_build_ealias is a dummy pass that ensures that we
             execute TODO_rebuild_alias at this point.  Re-building
             alias information also rewrites no longer addressed
             locals into SSA form if possible.  */
          NEXT_PASS (pass_build_ealias);
          NEXT_PASS (pass_sra_early);
          NEXT_PASS (pass_fre);
          NEXT_PASS (pass_copy_prop);
          NEXT_PASS (pass_merge_phi);
          NEXT_PASS (pass_cd_dce);
          NEXT_PASS (pass_early_ipa_sra);
          NEXT_PASS (pass_tail_recursion);
          NEXT_PASS (pass_convert_switch);
          NEXT_PASS (pass_cleanup_eh);
          NEXT_PASS (pass_profile);
          NEXT_PASS (pass_local_pure_const);
       }
   }
NEXT_PASS (pass_late_early_local_optimizations)
  {
          /* Split functions creates parts that are not run through
             early optimizations again.  It is thus good idea to do this
             late.  */
          NEXT_PASS (pass_split_functions);
      NEXT_PASS (pass_release_ssa_names);
      NEXT_PASS (pass_rebuild_cgraph_edges);
      NEXT_PASS (pass_inline_parameters);
  }

+++ b/gcc/toplev.c
@@ -1228,7 +1228,6 @@ general_init (const char *argv0)
   /* This must be done after global_init_params but before argument
      processing.  */
   init_ggc_heuristics();
-  init_optimization_passes ();
   statistics_early_init ();
   finish_params ();
 }
@@ -1989,6 +1988,8 @@ toplev_main (int argc, char **argv)
                  save_decoded_options, save_decoded_options_count,
                  UNKNOWN_LOCATION, global_dc);

+  init_optimization_passes ();
+
   handle_common_deferred_options ();

   init_local_tick ();

any reason for this?

+  /* Don't recurse or wonder on to the next pass when running
+     execute_ipa_pass_list below.  */
+  current->execute = NULL;
+  current->next = NULL;

that looks awkward ;)  Shouldn't instead

+    execute_ipa_pass_list (current);

not do what it does?  Thus, split that into two pieces, one that we
can iterate w/o the fiddling with current->?

I like this variant a lot better than the last one - still it lacks any
analysis-based justification for iteration (see my reply to Matt on
what I discussed with Honza).  Thus, I don't think we want to
merge this in its current form or in this stage1.

I'd also like to know _which_ early passes are worth iterating
(and if iterating is only worth if the iterated early inlining inlined
something).  I guess that goes back to the point that iterating
should be driven by the early inliner and/or  the fact whether
in the previous iteration new direct calls were discovered.

Richard.


> --
> Maxim Kuvyrkov
> CodeSourcery / Mentor Graphics
>
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] Add capability to run several iterations of early optimizations
  2011-10-18  9:09     ` Richard Guenther
@ 2011-10-27 23:29       ` Maxim Kuvyrkov
  2011-10-28 11:12         ` Richard Guenther
  0 siblings, 1 reply; 15+ messages in thread
From: Maxim Kuvyrkov @ 2011-10-27 23:29 UTC (permalink / raw)
  To: Richard Guenther; +Cc: GCC Patches, Matt

[-- Attachment #1: Type: text/plain, Size: 3129 bytes --]

Richard,

Just as Matt posted his findings about the effect of iterating early optimizations, I've got the new patch ready.  This patch is essentially a complete rewrite and addresses the comments you made.

On 18/10/2011, at 9:56 PM, Richard Guenther wrote:

>>> 
>>> If we'd want to iterate early optimizations we'd want to do it by iterating
>>> an IPA pass so that we benefit from more precise size estimates
>>> when trying to inline a function the second time.
>> 
>> Could you elaborate on this a bit?  Early optimizations are gimple passes, so I'm missing your point here.
> 
> pass_early_local_passes is an IPA pass, you want to iterate
> fn1, fn2, fn1, fn2, ..., not fn1, fn1 ..., fn2, fn2 ... precisely for better
> inlining.  Thus you need to split pass_early_local_passes into pieces
> so you can iterate one of the IPA pieces.

Early_local_passes are now split into _main, _iter and _late parts.  To avoid changing the default case, _late part is merged into _main when no iterative optimizations are requested.

> 
>>> Also statically
>>> scheduling the passes will mess up dump files and you have no
>>> chance of say, noticing that nothing changed for function f and its
>>> callees in iteration N and thus you can skip processing them in
>>> iteration N + 1.
>> 
>> Yes, these are the shortcomings.  The dump files name changes can be fixed, e.g., by adding a suffix to the passes on iterations after the first one.  The analysis to avoid unnecessary iterations is more complex problem.

To avoid changing the dump file names the patch appends "_iter" suffix to the dumps of iterative passes.

> 
> Sure.  I analyzed early passes by manually duplicating them and
> test that they do nothing for tramp3d, which they pretty much all did
> at some point.
> 
>>> 
>>> So, at least you should split the pass_early_local_passes IPA pass
>>> into three, you'd iterate over the 2nd (definitely not over pass_split_functions
>>> though), the third would be pass_profile and pass_split_functions only.
>>> And you'd iterate from the place the 2nd IPA pass is executed, not
>>> by scheduling them N times.
>> 
>> OK, I will look into this.

Done.

>> 
>>> 
>>> Then you'd have to analyze the compile-time impact of the IPA
>>> splitting on its own when not iterating.  

I decided to avoid this and keep the pass pipeline effectively the same when not running iterative optimizations.  This is achieved by scheduling pass_early_optimizations_late in different places in the pipeline depending on whether iterative optimizations are enabled or not.

The patch bootstraps and passes regtest on i686-pc-linux-gnu {-m32/-m64} with 3 iterations enabled by default.  The only failures are 5 scan-dump tests that are due to more functions being inlined than expected.  With iterative optimizations disabled there is no change.

I've kicked off SPEC2000/SPEC2006 benchmark runs to see the performance effect of the patch, and those will be posted in the same Google Docs spreadsheet in several days.

OK for trunk?

--
Maxim Kuvyrkov
CodeSourcery / Mentor Graphics


[-- Attachment #2: fsf-gcc-iter-eipa-2.ChangeLog --]
[-- Type: application/octet-stream, Size: 1403 bytes --]

2011-10-28  Maxim Kuvyrkov  <maxim@codesourcery.com>

	Add scheduling of several iterations of early IPA passes.

	* Makefile.in (tree-optimize.o): Add dependency on PARAMS_H.
	* cgraph.c (cgraph_add_new_functions): Update.
	* cgraphunit.c (cgraph_process_new_functions): Update.
	* doc/invoke.texi (eipa-iterations): Document new parameter.
	* params.def (PARAM_EIPA_ITERATIONS): Define.
	* passes.c (make_pass_instance): Add new argument.  Handle sub-passes.
	(next_pass_1): Add new argument.
	(init_optimization_passes): Split pass_early_local_passes into
	pass_early_local_passes_main pass_early_local_passes_iter and
	pass_early_local_passes_late.
	Split pass_all_early_optimizations into
	pass_early_optimizations_main and pass_early_optimizations_late.
	Schedule iterative optimizations if requested.
	* toplev.c (general_init, toplev_main): Move init_optimizations_passes
	after processing of arguments.
	* tree-optimize.c (params.h): New include.
	(pass_early_local_passes): Split into pass_early_local_passes_main,
	pass_early_local_passes_iter and pass_early_local_passes_late.
	(execute_early_local_passes_iter): New function.
	(execute_early_local_passes_for_current_function): New wrapper for
	running early_local_passes.
	(pass_all_early_optimizations): Split into
	pass_early_optimizations_main and pass_early_optimizations_late.
	(tree_lowering_passes): Update.
	* tree-pass.h: Update.

[-- Attachment #3: fsf-gcc-iter-eipa-2.patch --]
[-- Type: application/octet-stream, Size: 17961 bytes --]

commit 659f197e185606ae5d4dea904e18f4e392a52f68
Author: Maxim Kuvyrkov <maxim@codesourcery.com>
Date:   Sat Oct 1 01:09:50 2011 -0700

    Add iterative optimization passes

diff --git a/gcc/Makefile.in b/gcc/Makefile.in
index 3608904..ac1839d 100644
--- a/gcc/Makefile.in
+++ b/gcc/Makefile.in
@@ -2657,7 +2657,7 @@ tree-optimize.o : tree-optimize.c $(TREE_FLOW_H) $(CONFIG_H) $(SYSTEM_H) \
    coretypes.h $(TREE_DUMP_H) toplev.h $(DIAGNOSTIC_CORE_H) $(FUNCTION_H) langhooks.h \
    $(FLAGS_H) $(CGRAPH_H) $(PLUGIN_H) \
    $(TREE_INLINE_H) tree-mudflap.h $(GGC_H) graph.h $(CGRAPH_H) \
-   $(TREE_PASS_H) $(CFGLOOP_H) $(EXCEPT_H) $(REGSET_H)
+   $(TREE_PASS_H) $(CFGLOOP_H) $(EXCEPT_H) $(REGSET_H) $(PARAMS_H)
 
 gimplify.o : gimplify.c $(CONFIG_H) $(SYSTEM_H) $(TREE_H) $(GIMPLE_H) \
    $(DIAGNOSTIC_H) $(GIMPLE_H) $(TREE_INLINE_H) langhooks.h \
diff --git a/gcc/cgraph.c b/gcc/cgraph.c
index f056d3d..4738b28 100644
--- a/gcc/cgraph.c
+++ b/gcc/cgraph.c
@@ -2416,7 +2416,7 @@ cgraph_add_new_function (tree fndecl, bool lowered)
 	    tree_lowering_passes (fndecl);
 	    bitmap_obstack_initialize (NULL);
 	    if (!gimple_in_ssa_p (DECL_STRUCT_FUNCTION (fndecl)))
-	      execute_pass_list (pass_early_local_passes.pass.sub);
+	      execute_early_local_passes_for_current_function ();
 	    bitmap_obstack_release (NULL);
 	    pop_cfun ();
 	    current_function_decl = NULL;
@@ -2441,7 +2441,7 @@ cgraph_add_new_function (tree fndecl, bool lowered)
 	gimple_register_cfg_hooks ();
 	bitmap_obstack_initialize (NULL);
 	if (!gimple_in_ssa_p (DECL_STRUCT_FUNCTION (fndecl)))
-	  execute_pass_list (pass_early_local_passes.pass.sub);
+	  execute_early_local_passes_for_current_function ();
 	bitmap_obstack_release (NULL);
 	tree_rest_of_compilation (fndecl);
 	pop_cfun ();
diff --git a/gcc/cgraphunit.c b/gcc/cgraphunit.c
index 25d7561..9a0e06d 100644
--- a/gcc/cgraphunit.c
+++ b/gcc/cgraphunit.c
@@ -255,7 +255,7 @@ cgraph_process_new_functions (void)
 	      /* When not optimizing, be sure we run early local passes anyway
 		 to expand OMP.  */
 	      || !optimize)
-	    execute_pass_list (pass_early_local_passes.pass.sub);
+	    execute_early_local_passes_for_current_function ();
 	  else
 	    compute_inline_parameters (node, true);
 	  free_dominance_info (CDI_POST_DOMINATORS);
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 50e875a..d5cc035 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -9088,6 +9088,12 @@ the parameter is reserved exclusively for debug insns created by
 @option{-fvar-tracking-assignments}, but debug insns may get
 (non-overlapping) uids above it if the reserved range is exhausted.
 
+@item eipa-iterations
+The pass scheduler will execute @option{eipa-iterations} iterations of
+early optimization passes before running interprocedural analysis.
+Running several iterations of optimization passes allows the compiler
+to provide thoroughly optimized code to the interprocedural analysis.
+
 @item ipa-sra-ptr-growth-factor
 IPA-SRA will replace a pointer to an aggregate with one or more new
 parameters only when their cumulative size is less or equal to
diff --git a/gcc/params.def b/gcc/params.def
index b160530..2baad91 100644
--- a/gcc/params.def
+++ b/gcc/params.def
@@ -861,6 +861,11 @@ DEFPARAM (PARAM_MIN_NONDEBUG_INSN_UID,
 	  "The minimum UID to be used for a nondebug insn",
 	  0, 1, 0)
 
+DEFPARAM (PARAM_EIPA_ITERATIONS,
+	  "eipa-iterations",
+	  "Number of iterations of early optimizations before IPA analysis",
+	  3, 1, 0)
+
 DEFPARAM (PARAM_IPA_SRA_PTR_GROWTH_FACTOR,
 	  "ipa-sra-ptr-growth-factor",
 	  "Maximum allowed growth of size of new parameters ipa-sra replaces "
diff --git a/gcc/passes.c b/gcc/passes.c
index 887007f..c3caf3f 100644
--- a/gcc/passes.c
+++ b/gcc/passes.c
@@ -898,10 +898,13 @@ is_pass_explicitly_enabled_or_disabled (struct opt_pass *pass,
 }
 
 /* Look at the static_pass_number and duplicate the pass
-   if it is already added to a list. */
+   if it is already added to a list.
+   If SUFFIX is non-NULL, append it to the name of the new pass instead
+   of an increasing number.  */
 
 static struct opt_pass *
-make_pass_instance (struct opt_pass *pass, bool track_duplicates)
+make_pass_instance (struct opt_pass *pass, bool track_duplicates,
+		    const char *suffix)
 {
   /* A nonzero static_pass_number indicates that the
      pass is already in the list.  */
@@ -924,15 +927,40 @@ make_pass_instance (struct opt_pass *pass, bool track_duplicates)
       else
         gcc_unreachable ();
 
+      if (pass->sub)
+	/* Duplicate sub-passes.  */
+	{
+	  struct opt_pass *sub = pass->sub;
+	  struct opt_pass **new_sub_ptr = &new_pass->sub;
+
+	  do
+	    {
+	      *new_sub_ptr = make_pass_instance (sub, track_duplicates, suffix);
+	      new_sub_ptr = &(*new_sub_ptr)->next;
+	      sub = sub->next;
+	    } while (sub);
+	}
+
       new_pass->next = NULL;
 
       new_pass->todo_flags_start &= ~TODO_mark_first_instance;
 
+      if (suffix)
+	/* Make pass name unique by appending SUFFIX to it.  */
+	{
+	  /* Make sure we don't append suffix twice.  */
+	  {
+	    int name_len = strlen (pass->name), suffix_len = strlen (suffix);
+	    gcc_assert (strcmp (suffix, &pass->name[name_len - suffix_len]));
+	  }
+	  new_pass->static_pass_number = -1;
+	  new_pass->name = concat (pass->name, suffix, NULL);
+	}
       /* Indicate to register_dump_files that this pass has duplicates,
          and so it should rename the dump file.  The first instance will
          be -1, and be number of duplicates = -static_pass_number - 1.
          Subsequent instances will be > 0 and just the duplicate number.  */
-      if ((pass->name && pass->name[0] != '*') || track_duplicates)
+      else if ((pass->name && pass->name[0] != '*') || track_duplicates)
         {
           pass->static_pass_number -= 1;
           new_pass->static_pass_number = -pass->static_pass_number;
@@ -953,12 +981,12 @@ make_pass_instance (struct opt_pass *pass, bool track_duplicates)
    in the list.  */
 
 static struct opt_pass **
-next_pass_1 (struct opt_pass **list, struct opt_pass *pass)
+next_pass_1 (struct opt_pass **list, struct opt_pass *pass, const char *suffix)
 {
   /* Every pass should have a name so that plugins can refer to them.  */
   gcc_assert (pass->name != NULL);
 
-  *list = make_pass_instance (pass, false);
+  *list = make_pass_instance (pass, false, suffix);
 
   return &(*list)->next;
 }
@@ -1008,7 +1036,7 @@ position_pass (struct register_pass_info *new_pass_info,
           struct opt_pass *new_pass;
           struct pass_list_node *new_pass_node;
 
-	  new_pass = make_pass_instance (new_pass_info->pass, true);
+	  new_pass = make_pass_instance (new_pass_info->pass, true, NULL);
 
           /* Insert the new pass instance based on the positioning op.  */
           switch (new_pass_info->pos_op)
@@ -1164,8 +1192,9 @@ void
 init_optimization_passes (void)
 {
   struct opt_pass **p;
+  const char *suffix = NULL;
 
-#define NEXT_PASS(PASS)  (p = next_pass_1 (p, &((PASS).pass)))
+#define NEXT_PASS(PASS)  (p = next_pass_1 (p, &((PASS).pass), suffix))
 
  /* All passes needed to lower the function into shape optimizers can
     operate on.  These passes are always run first on the function, but
@@ -1188,9 +1217,9 @@ init_optimization_passes (void)
   p = &all_small_ipa_passes;
   NEXT_PASS (pass_ipa_free_lang_data);
   NEXT_PASS (pass_ipa_function_and_variable_visibility);
-  NEXT_PASS (pass_early_local_passes);
+  NEXT_PASS (pass_early_local_passes_main);
     {
-      struct opt_pass **p = &pass_early_local_passes.pass.sub;
+      struct opt_pass **p = &pass_early_local_passes_main.pass.sub;
       NEXT_PASS (pass_fixup_cfg);
       NEXT_PASS (pass_init_datastructures);
       NEXT_PASS (pass_expand_omp);
@@ -1202,9 +1231,9 @@ init_optimization_passes (void)
       NEXT_PASS (pass_rebuild_cgraph_edges);
       NEXT_PASS (pass_inline_parameters);
       NEXT_PASS (pass_early_inline);
-      NEXT_PASS (pass_all_early_optimizations);
-	{
-	  struct opt_pass **p = &pass_all_early_optimizations.pass.sub;
+      NEXT_PASS (pass_early_optimizations_main);
+        {
+	  struct opt_pass **p = &pass_early_optimizations_main.pass.sub;
 	  NEXT_PASS (pass_remove_cgraph_callee_edges);
 	  NEXT_PASS (pass_rename_ssa_copies);
 	  NEXT_PASS (pass_ccp);
@@ -1222,18 +1251,61 @@ init_optimization_passes (void)
 	  NEXT_PASS (pass_early_ipa_sra);
 	  NEXT_PASS (pass_tail_recursion);
 	  NEXT_PASS (pass_convert_switch);
-          NEXT_PASS (pass_cleanup_eh);
-          NEXT_PASS (pass_profile);
-          NEXT_PASS (pass_local_pure_const);
+	  NEXT_PASS (pass_cleanup_eh);
+	  NEXT_PASS (pass_local_pure_const);
+	}
+      /* Define pass_early_optimizations_late, we run these optimizations
+	 strictly once per function.
+	 When not running iterative optimizations, schedule these
+	 optimizations together with pass_early_local_pass_main.
+	 Otherwise, run them after iterative optimizations.  */
+	{
+	  struct opt_pass **p = &pass_early_optimizations_late.pass.sub;
+	  NEXT_PASS (pass_profile);
 	  /* Split functions creates parts that are not run through
 	     early optimizations again.  It is thus good idea to do this
 	     late.  */
-          NEXT_PASS (pass_split_functions);
+	  NEXT_PASS (pass_split_functions);
+	  NEXT_PASS (pass_release_ssa_names);
 	}
-      NEXT_PASS (pass_release_ssa_names);
+      if (PARAM_VALUE (PARAM_EIPA_ITERATIONS) == 1)
+	NEXT_PASS (pass_early_optimizations_late);
       NEXT_PASS (pass_rebuild_cgraph_edges);
       NEXT_PASS (pass_inline_parameters);
     }
+
+  if (PARAM_VALUE (PARAM_EIPA_ITERATIONS) > 1)
+    {
+      /* Prepare and run several iterations of early_optimizations.  */
+      NEXT_PASS (pass_early_local_passes_iter);
+        {
+	  struct opt_pass **p = &pass_early_local_passes_iter.pass.sub;
+	  const char *suffix = "_iter";
+	  /* Fixup CFG as some of callees might have had their attributes
+	     (nothrow, pure, etc.) changed by pass_local_pure_const.
+	     Pass_fixup_cfg should always be followed by
+	     rebuild_cgraph_edges to remove outdated call-graph edges.  */
+	  NEXT_PASS (pass_fixup_cfg);
+	  NEXT_PASS (pass_rebuild_cgraph_edges);
+	  /* Run early inlining.  */
+	  NEXT_PASS (pass_inline_parameters);
+	  NEXT_PASS (pass_early_inline);
+	  /* Run the bulk of early optimizations.  */
+	  NEXT_PASS (pass_early_optimizations_main);
+	  /* Clean up and prepare to be inlined to some other function.  */
+	  NEXT_PASS (pass_rebuild_cgraph_edges);
+	  NEXT_PASS (pass_inline_parameters);
+	}
+      NEXT_PASS (pass_early_local_passes_late);
+        {
+	  struct opt_pass **p = &pass_early_local_passes_late.pass.sub;
+	  NEXT_PASS (pass_fixup_cfg);
+	  NEXT_PASS (pass_rebuild_cgraph_edges);
+	  NEXT_PASS (pass_early_optimizations_late);
+	  NEXT_PASS (pass_rebuild_cgraph_edges);
+	  NEXT_PASS (pass_inline_parameters);
+	}
+    }
   NEXT_PASS (pass_ipa_tree_profile);
     {
       struct opt_pass **p = &pass_ipa_tree_profile.pass.sub;
diff --git a/gcc/toplev.c b/gcc/toplev.c
index 86eed5d..4186157 100644
--- a/gcc/toplev.c
+++ b/gcc/toplev.c
@@ -1228,7 +1228,6 @@ general_init (const char *argv0)
   /* This must be done after global_init_params but before argument
      processing.  */
   init_ggc_heuristics();
-  init_optimization_passes ();
   statistics_early_init ();
   finish_params ();
 }
@@ -1989,6 +1988,8 @@ toplev_main (int argc, char **argv)
 		  save_decoded_options, save_decoded_options_count,
 		  UNKNOWN_LOCATION, global_dc);
 
+  init_optimization_passes ();
+
   handle_common_deferred_options ();
 
   init_local_tick ();
diff --git a/gcc/tree-optimize.c b/gcc/tree-optimize.c
index d7978b9..8516070 100644
--- a/gcc/tree-optimize.c
+++ b/gcc/tree-optimize.c
@@ -47,6 +47,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "except.h"
 #include "plugin.h"
 #include "regset.h"	/* FIXME: For reg_obstack.  */
+#include "params.h"
 
 /* Gate: execute, or not, all of the non-trivial optimizations.  */
 
@@ -101,11 +102,11 @@ execute_all_early_local_passes (void)
   return 0;
 }
 
-struct simple_ipa_opt_pass pass_early_local_passes =
+struct simple_ipa_opt_pass pass_early_local_passes_main =
 {
  {
   SIMPLE_IPA_PASS,
-  "early_local_cleanups",		/* name */
+  "early_local_cleanups_main",		/* name */
   gate_all_early_local_passes,		/* gate */
   execute_all_early_local_passes,	/* execute */
   NULL,					/* sub */
@@ -120,6 +121,86 @@ struct simple_ipa_opt_pass pass_early_local_passes =
  }
 };
 
+static unsigned int
+execute_early_local_passes_iter (void)
+{
+  struct opt_pass *current = current_pass;
+  struct opt_pass *next = current_pass->next;
+  int i;
+
+  /* Don't recurse or wonder on to the next pass when running
+     execute_ipa_pass_list below.  */
+  current->execute = NULL;
+  current->next = NULL;
+
+  /* Run PARAM_EIPA_ITERATIONS-1 iterations of early optimizations.
+     We run these passes the first time as part of
+     pass_early_local_passes_main.  */
+  gcc_assert (PARAM_VALUE (PARAM_EIPA_ITERATIONS) > 1);
+  for (i = 1; i < PARAM_VALUE (PARAM_EIPA_ITERATIONS); ++i)
+    execute_ipa_pass_list (current);
+
+  /* Restore.  */
+  current->next = next;
+  current->execute = execute_early_local_passes_iter;
+  current_pass = current;
+
+  /* Tell execute_early_local_passes_for_current_function that it is OK to
+     run late optimizations now.  This also avoids running an extra iteration
+     of optimizations by execute_ipa_pass_list machinery.  */
+  current->sub = NULL;
+
+  return 0;
+}
+
+struct simple_ipa_opt_pass pass_early_local_passes_iter =
+{
+ {
+  SIMPLE_IPA_PASS,
+  "early_local_cleanups_iter",		/* name */
+  gate_all_early_local_passes,		/* gate */
+  execute_early_local_passes_iter,	/* execute */
+  NULL,					/* sub */
+  NULL,					/* next */
+  0,					/* static_pass_number */
+  TV_EARLY_LOCAL,			/* tv_id */
+  0,					/* properties_required */
+  0,					/* properties_provided */
+  0,					/* properties_destroyed */
+  0,					/* todo_flags_start */
+  TODO_remove_functions	 		/* todo_flags_finish */
+ }
+};
+
+struct simple_ipa_opt_pass pass_early_local_passes_late =
+{
+ {
+  SIMPLE_IPA_PASS,
+  "early_local_cleanups_late",		/* name */
+  gate_all_early_local_passes,		/* gate */
+  NULL,					/* execute */
+  NULL,					/* sub */
+  NULL,					/* next */
+  0,					/* static_pass_number */
+  TV_EARLY_LOCAL,			/* tv_id */
+  0,					/* properties_required */
+  0,					/* properties_provided */
+  0,					/* properties_destroyed */
+  0,					/* todo_flags_start */
+  TODO_remove_functions	 		/* todo_flags_finish */
+ }
+};
+
+void
+execute_early_local_passes_for_current_function (void)
+{
+  execute_pass_list (pass_early_local_passes_main.pass.sub);
+  /* Run early_local_passes_late if we are done with iterative optimizations
+     or not running them (iterative optimizations) at all.  */
+  if (pass_early_local_passes_iter.pass.sub == NULL)
+    execute_pass_list (pass_early_local_passes_late.pass.sub);
+}
+
 /* Gate: execute, or not, all of the non-trivial optimizations.  */
 
 static bool
@@ -130,11 +211,30 @@ gate_all_early_optimizations (void)
 	  && !seen_error ());
 }
 
-struct gimple_opt_pass pass_all_early_optimizations =
+struct gimple_opt_pass pass_early_optimizations_main =
+{
+ {
+  GIMPLE_PASS,
+  "early_optimizations_main",		/* name */
+  gate_all_early_optimizations,		/* gate */
+  NULL,					/* execute */
+  NULL,					/* sub */
+  NULL,					/* next */
+  0,					/* static_pass_number */
+  TV_NONE,				/* tv_id */
+  0,					/* properties_required */
+  0,					/* properties_provided */
+  0,					/* properties_destroyed */
+  0,					/* todo_flags_start */
+  0					/* todo_flags_finish */
+ }
+};
+
+struct gimple_opt_pass pass_early_optimizations_late =
 {
  {
   GIMPLE_PASS,
-  "early_optimizations",		/* name */
+  "early_optimizations_late",		/* name */
   gate_all_early_optimizations,		/* gate */
   NULL,					/* execute */
   NULL,					/* sub */
@@ -384,7 +484,7 @@ tree_lowering_passes (tree fn)
   bitmap_obstack_initialize (NULL);
   execute_pass_list (all_lowering_passes);
   if (optimize && cgraph_global_info_ready)
-    execute_pass_list (pass_early_local_passes.pass.sub);
+    execute_early_local_passes_for_current_function ();
   free_dominance_info (CDI_POST_DOMINATORS);
   free_dominance_info (CDI_DOMINATORS);
   compact_blocks ();
diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h
index df1e24c..c1f44b9 100644
--- a/gcc/tree-pass.h
+++ b/gcc/tree-pass.h
@@ -455,7 +455,9 @@ extern struct simple_ipa_opt_pass pass_ipa_lower_emutls;
 extern struct simple_ipa_opt_pass pass_ipa_function_and_variable_visibility;
 extern struct simple_ipa_opt_pass pass_ipa_tree_profile;
 
-extern struct simple_ipa_opt_pass pass_early_local_passes;
+extern struct simple_ipa_opt_pass pass_early_local_passes_main;
+extern struct simple_ipa_opt_pass pass_early_local_passes_iter;
+extern struct simple_ipa_opt_pass pass_early_local_passes_late;
 
 extern struct ipa_opt_pass_d pass_ipa_whole_program_visibility;
 extern struct ipa_opt_pass_d pass_ipa_lto_gimple_out;
@@ -575,7 +577,8 @@ extern struct rtl_opt_pass pass_rtl_seqabstr;
 extern struct gimple_opt_pass pass_release_ssa_names;
 extern struct gimple_opt_pass pass_early_inline;
 extern struct gimple_opt_pass pass_inline_parameters;
-extern struct gimple_opt_pass pass_all_early_optimizations;
+extern struct gimple_opt_pass pass_early_optimizations_main;
+extern struct gimple_opt_pass pass_early_optimizations_late;
 extern struct gimple_opt_pass pass_update_address_taken;
 extern struct gimple_opt_pass pass_convert_switch;
 
@@ -638,6 +641,8 @@ extern void register_pass (struct register_pass_info *);
    directly in jump threading, and avoid peeling them next time.  */
 extern bool first_pass_instance;
 
+extern void execute_early_local_passes_for_current_function (void);
+
 /* Declare for plugins.  */
 extern void do_per_function_toporder (void (*) (void *), void *);
 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] Add capability to run several iterations of early optimizations
  2011-10-18  3:00   ` Maxim Kuvyrkov
@ 2011-10-18  9:09     ` Richard Guenther
  2011-10-27 23:29       ` Maxim Kuvyrkov
  0 siblings, 1 reply; 15+ messages in thread
From: Richard Guenther @ 2011-10-18  9:09 UTC (permalink / raw)
  To: Maxim Kuvyrkov; +Cc: GCC Patches

On Tue, Oct 18, 2011 at 1:45 AM, Maxim Kuvyrkov <maxim@codesourcery.com> wrote:
> On 13/10/2011, at 12:58 AM, Richard Guenther wrote:
>
>> On Wed, Oct 12, 2011 at 8:50 AM, Maxim Kuvyrkov <maxim@codesourcery.com> wrote:
>>> The following patch adds new knob to make GCC perform several iterations of early optimizations and inlining.
>>>
>>> This is for dont-care-about-compile-time-optimize-all-you-can scenarios.  Performing several iterations of optimizations does significantly improve code speed on a certain proprietary source base.  Some hand-tuning of the parameter value is required to get optimum performance.  Another good use for this option is for search and ad-hoc analysis of cases where GCC misses optimization opportunities.
>>>
>>> With the default setting of '1', nothing is changed from the current status quo.
>>>
>>> The patch was bootstrapped and regtested with 3 iterations set by default on i686-linux-gnu.  The only failures in regression testsuite were due to latent bugs in handling of EH information, which are being discussed in a different thread.
>>>
>>> Performance impact on the standard benchmarks is not conclusive, there are improvements in SPEC2000 of up to 4% and regressions down to -2%, see [*].  SPEC2006 benchmarks will take another day or two to complete and I will update the spreadsheet then.  The benchmarks were run on a Core2 system for all combinations of {-m32/-m64}{-O2/-O3}.
>>>
>>> Effect on compilation time is fairly predictable, about 10% compile time increase with 3 iterations.
>>>
>>> OK for trunk?
>>
>> I don't think this is a good idea, especially in the form you implemented it.
>>
>> If we'd want to iterate early optimizations we'd want to do it by iterating
>> an IPA pass so that we benefit from more precise size estimates
>> when trying to inline a function the second time.
>
> Could you elaborate on this a bit?  Early optimizations are gimple passes, so I'm missing your point here.

pass_early_local_passes is an IPA pass, you want to iterate
fn1, fn2, fn1, fn2, ..., not fn1, fn1 ..., fn2, fn2 ... precisely for better
inlining.  Thus you need to split pass_early_local_passes into pieces
so you can iterate one of the IPA pieces.

>> Also statically
>> scheduling the passes will mess up dump files and you have no
>> chance of say, noticing that nothing changed for function f and its
>> callees in iteration N and thus you can skip processing them in
>> iteration N + 1.
>
> Yes, these are the shortcomings.  The dump files name changes can be fixed, e.g., by adding a suffix to the passes on iterations after the first one.  The analysis to avoid unnecessary iterations is more complex problem.

Sure.  I analyzed early passes by manually duplicating them and
test that they do nothing for tramp3d, which they pretty much all did
at some point.

>>
>> So, at least you should split the pass_early_local_passes IPA pass
>> into three, you'd iterate over the 2nd (definitely not over pass_split_functions
>> though), the third would be pass_profile and pass_split_functions only.
>> And you'd iterate from the place the 2nd IPA pass is executed, not
>> by scheduling them N times.
>
> OK, I will look into this.
>
>>
>> Then you'd have to analyze the compile-time impact of the IPA
>> splitting on its own when not iterating.  Then you should look
>> at what actually was the optimizations that were performed
>> that lead to the improvement (I can see some indirect inlining
>> happening, but everything else would be a bug in present
>> optimizers in the early pipeline - they are all designed to be
>> roughly independent on each other and _not_ expose new
>> opportunities by iteration).  Thus - testcases?
>
> The initial motivation for the patch was to enable more indirect inlining and devirtualization opportunities.

Hm.

> Since then I found the patch to be helpful in searching for optimization opportunities and bugs.  E.g., SPEC2006's 471.omnetpp drops 20% with 2 additional iterations of early optimizations [*].  Given that applying more optimizations should, theoretically, not decrease performance, there is likely a very real bug or deficiency behind that.

It is likely early SRA that messes up, or maybe convert switch.  Early
passes should be really restricted to always profitable cleanups.

Your experiment looks useful to track down these bugs, but in general
I don't think we want to expose iterating early passes.

Richard.

> Thank you,
>
> [*] https://docs.google.com/spreadsheet/ccc?key=0AvK0Y-Pgj7bNdFBQMEJ6d3laeFdvdk9lQ1p0LUFkVFE&hl=en_US
>
> --
> Maxim Kuvyrkov
> CodeSourcery / Mentor Graphics
>
>
>
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] Add capability to run several iterations of early optimizations
  2011-10-12 12:15 ` Richard Guenther
@ 2011-10-18  3:00   ` Maxim Kuvyrkov
  2011-10-18  9:09     ` Richard Guenther
  0 siblings, 1 reply; 15+ messages in thread
From: Maxim Kuvyrkov @ 2011-10-18  3:00 UTC (permalink / raw)
  To: Richard Guenther; +Cc: GCC Patches

On 13/10/2011, at 12:58 AM, Richard Guenther wrote:

> On Wed, Oct 12, 2011 at 8:50 AM, Maxim Kuvyrkov <maxim@codesourcery.com> wrote:
>> The following patch adds new knob to make GCC perform several iterations of early optimizations and inlining.
>> 
>> This is for dont-care-about-compile-time-optimize-all-you-can scenarios.  Performing several iterations of optimizations does significantly improve code speed on a certain proprietary source base.  Some hand-tuning of the parameter value is required to get optimum performance.  Another good use for this option is for search and ad-hoc analysis of cases where GCC misses optimization opportunities.
>> 
>> With the default setting of '1', nothing is changed from the current status quo.
>> 
>> The patch was bootstrapped and regtested with 3 iterations set by default on i686-linux-gnu.  The only failures in regression testsuite were due to latent bugs in handling of EH information, which are being discussed in a different thread.
>> 
>> Performance impact on the standard benchmarks is not conclusive, there are improvements in SPEC2000 of up to 4% and regressions down to -2%, see [*].  SPEC2006 benchmarks will take another day or two to complete and I will update the spreadsheet then.  The benchmarks were run on a Core2 system for all combinations of {-m32/-m64}{-O2/-O3}.
>> 
>> Effect on compilation time is fairly predictable, about 10% compile time increase with 3 iterations.
>> 
>> OK for trunk?
> 
> I don't think this is a good idea, especially in the form you implemented it.
> 
> If we'd want to iterate early optimizations we'd want to do it by iterating
> an IPA pass so that we benefit from more precise size estimates
> when trying to inline a function the second time.  

Could you elaborate on this a bit?  Early optimizations are gimple passes, so I'm missing your point here.

> Also statically
> scheduling the passes will mess up dump files and you have no
> chance of say, noticing that nothing changed for function f and its
> callees in iteration N and thus you can skip processing them in
> iteration N + 1.

Yes, these are the shortcomings.  The dump files name changes can be fixed, e.g., by adding a suffix to the passes on iterations after the first one.  The analysis to avoid unnecessary iterations is more complex problem.

> 
> So, at least you should split the pass_early_local_passes IPA pass
> into three, you'd iterate over the 2nd (definitely not over pass_split_functions
> though), the third would be pass_profile and pass_split_functions only.
> And you'd iterate from the place the 2nd IPA pass is executed, not
> by scheduling them N times.

OK, I will look into this.

> 
> Then you'd have to analyze the compile-time impact of the IPA
> splitting on its own when not iterating.  Then you should look
> at what actually was the optimizations that were performed
> that lead to the improvement (I can see some indirect inlining
> happening, but everything else would be a bug in present
> optimizers in the early pipeline - they are all designed to be
> roughly independent on each other and _not_ expose new
> opportunities by iteration).  Thus - testcases?

The initial motivation for the patch was to enable more indirect inlining and devirtualization opportunities. Since then I found the patch to be helpful in searching for optimization opportunities and bugs.  E.g., SPEC2006's 471.omnetpp drops 20% with 2 additional iterations of early optimizations [*].  Given that applying more optimizations should, theoretically, not decrease performance, there is likely a very real bug or deficiency behind that.

Thank you,

[*] https://docs.google.com/spreadsheet/ccc?key=0AvK0Y-Pgj7bNdFBQMEJ6d3laeFdvdk9lQ1p0LUFkVFE&hl=en_US

--
Maxim Kuvyrkov
CodeSourcery / Mentor Graphics



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] Add capability to run several iterations of early optimizations
  2011-10-12  8:14 Maxim Kuvyrkov
@ 2011-10-12 12:15 ` Richard Guenther
  2011-10-18  3:00   ` Maxim Kuvyrkov
  0 siblings, 1 reply; 15+ messages in thread
From: Richard Guenther @ 2011-10-12 12:15 UTC (permalink / raw)
  To: Maxim Kuvyrkov; +Cc: GCC Patches

On Wed, Oct 12, 2011 at 8:50 AM, Maxim Kuvyrkov <maxim@codesourcery.com> wrote:
> The following patch adds new knob to make GCC perform several iterations of early optimizations and inlining.
>
> This is for dont-care-about-compile-time-optimize-all-you-can scenarios.  Performing several iterations of optimizations does significantly improve code speed on a certain proprietary source base.  Some hand-tuning of the parameter value is required to get optimum performance.  Another good use for this option is for search and ad-hoc analysis of cases where GCC misses optimization opportunities.
>
> With the default setting of '1', nothing is changed from the current status quo.
>
> The patch was bootstrapped and regtested with 3 iterations set by default on i686-linux-gnu.  The only failures in regression testsuite were due to latent bugs in handling of EH information, which are being discussed in a different thread.
>
> Performance impact on the standard benchmarks is not conclusive, there are improvements in SPEC2000 of up to 4% and regressions down to -2%, see [*].  SPEC2006 benchmarks will take another day or two to complete and I will update the spreadsheet then.  The benchmarks were run on a Core2 system for all combinations of {-m32/-m64}{-O2/-O3}.
>
> Effect on compilation time is fairly predictable, about 10% compile time increase with 3 iterations.
>
> OK for trunk?

I don't think this is a good idea, especially in the form you implemented it.

If we'd want to iterate early optimizations we'd want to do it by iterating
an IPA pass so that we benefit from more precise size estimates
when trying to inline a function the second time.  Also statically
scheduling the passes will mess up dump files and you have no
chance of say, noticing that nothing changed for function f and its
callees in iteration N and thus you can skip processing them in
iteration N + 1.

So, at least you should split the pass_early_local_passes IPA pass
into three, you'd iterate over the 2nd (definitely not over pass_split_functions
though), the third would be pass_profile and pass_split_functions only.
And you'd iterate from the place the 2nd IPA pass is executed, not
by scheduling them N times.

Then you'd have to analyze the compile-time impact of the IPA
splitting on its own when not iterating.  Then you should look
at what actually was the optimizations that were performed
that lead to the improvement (I can see some indirect inlining
happening, but everything else would be a bug in present
optimizers in the early pipeline - they are all designed to be
roughly independent on each other and _not_ expose new
opportunities by iteration).  Thus - testcases?

Thanks,
Richard.

> [*] https://docs.google.com/spreadsheet/ccc?key=0AvK0Y-Pgj7bNdFBQMEJ6d3laeFdvdk9lQ1p0LUFkVFE&hl=en_US
>
> Thank you,
>
> --
> Maxim Kuvyrkov
> CodeSourcery / Mentor Graphics
>
>
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH] Add capability to run several iterations of early optimizations
@ 2011-10-12  8:14 Maxim Kuvyrkov
  2011-10-12 12:15 ` Richard Guenther
  0 siblings, 1 reply; 15+ messages in thread
From: Maxim Kuvyrkov @ 2011-10-12  8:14 UTC (permalink / raw)
  To: GCC Patches

[-- Attachment #1: Type: text/plain, Size: 1483 bytes --]

The following patch adds new knob to make GCC perform several iterations of early optimizations and inlining.

This is for dont-care-about-compile-time-optimize-all-you-can scenarios.  Performing several iterations of optimizations does significantly improve code speed on a certain proprietary source base.  Some hand-tuning of the parameter value is required to get optimum performance.  Another good use for this option is for search and ad-hoc analysis of cases where GCC misses optimization opportunities.

With the default setting of '1', nothing is changed from the current status quo.

The patch was bootstrapped and regtested with 3 iterations set by default on i686-linux-gnu.  The only failures in regression testsuite were due to latent bugs in handling of EH information, which are being discussed in a different thread.

Performance impact on the standard benchmarks is not conclusive, there are improvements in SPEC2000 of up to 4% and regressions down to -2%, see [*].  SPEC2006 benchmarks will take another day or two to complete and I will update the spreadsheet then.  The benchmarks were run on a Core2 system for all combinations of {-m32/-m64}{-O2/-O3}.

Effect on compilation time is fairly predictable, about 10% compile time increase with 3 iterations.

OK for trunk?

[*] https://docs.google.com/spreadsheet/ccc?key=0AvK0Y-Pgj7bNdFBQMEJ6d3laeFdvdk9lQ1p0LUFkVFE&hl=en_US

Thank you,

--
Maxim Kuvyrkov
CodeSourcery / Mentor Graphics



[-- Attachment #2: fsf-gcc-iter-eipa.ChangeLog --]
[-- Type: application/octet-stream, Size: 429 bytes --]

2011-10-11  Maxim Kuvyrkov  <maxim@codesourcery.com>

	Add scheduling of several iterations of early IPA passes.

	* doc/invoke.texi (eipa-iterations): Document new parameter.
	* params.def (PARAM_EIPA_ITERATIONS): Define.
	* passes.c (init_optimization_passes): Schedule several iterations
	of pass_all_early_optimizations.
	* toplev.c (general_init, toplev_main): Move init_optimizations_passes
	after processing of arguments.

[-- Attachment #3: fsf-gcc-iter-eipa.patch --]
[-- Type: application/octet-stream, Size: 4741 bytes --]

commit a54b6cb0160d5413499e08a87da3a35eb6e29652
Author: Maxim Kuvyrkov <maxim@codesourcery.com>
Date:   Sat Oct 1 01:09:50 2011 -0700

    Add iterative optimization passes

diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index c92c6b2..1ebaf35 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -9060,6 +9060,12 @@ the parameter is reserved exclusively for debug insns created by
 @option{-fvar-tracking-assignments}, but debug insns may get
 (non-overlapping) uids above it if the reserved range is exhausted.
 
+@item eipa-iterations
+The pass scheduler will execute @option{eipa-iterations} iterations of
+early optimization passes before running interprocedural analysis.
+Running several iterations of optimization passes allows the compiler
+to provide thoroughly optimized code to the interprocedural analysis.
+
 @item ipa-sra-ptr-growth-factor
 IPA-SRA will replace a pointer to an aggregate with one or more new
 parameters only when their cumulative size is less or equal to
diff --git a/gcc/params.def b/gcc/params.def
index 5e49c48..9609919 100644
--- a/gcc/params.def
+++ b/gcc/params.def
@@ -861,6 +861,11 @@ DEFPARAM (PARAM_MIN_NONDEBUG_INSN_UID,
 	  "The minimum UID to be used for a nondebug insn",
 	  0, 1, 0)
 
+DEFPARAM (PARAM_EIPA_ITERATIONS,
+	  "eipa-iterations",
+	  "Number of iterations of early optimizations before IPA analysis",
+	  1, 1, 0)
+
 DEFPARAM (PARAM_IPA_SRA_PTR_GROWTH_FACTOR,
 	  "ipa-sra-ptr-growth-factor",
 	  "Maximum allowed growth of size of new parameters ipa-sra replaces "
diff --git a/gcc/passes.c b/gcc/passes.c
index 887007f..dcc3297 100644
--- a/gcc/passes.c
+++ b/gcc/passes.c
@@ -1191,6 +1191,7 @@ init_optimization_passes (void)
   NEXT_PASS (pass_early_local_passes);
     {
       struct opt_pass **p = &pass_early_local_passes.pass.sub;
+      int i;
       NEXT_PASS (pass_fixup_cfg);
       NEXT_PASS (pass_init_datastructures);
       NEXT_PASS (pass_expand_omp);
@@ -1199,12 +1200,29 @@ init_optimization_passes (void)
       NEXT_PASS (pass_build_ssa);
       NEXT_PASS (pass_lower_vector);
       NEXT_PASS (pass_early_warn_uninitialized);
-      NEXT_PASS (pass_rebuild_cgraph_edges);
-      NEXT_PASS (pass_inline_parameters);
-      NEXT_PASS (pass_early_inline);
-      NEXT_PASS (pass_all_early_optimizations);
+
+      /* Schedule PARAM_EIPA_ITERATIONS iterations of early optimizations.
+	 We run these passes at least once as pass_all_early_optimizations.  */
+      gcc_assert (PARAM_VALUE (PARAM_EIPA_ITERATIONS) >= 1);
+      for (i = 0; i < PARAM_VALUE (PARAM_EIPA_ITERATIONS); ++i)
 	{
-	  struct opt_pass **p = &pass_all_early_optimizations.pass.sub;
+	  struct opt_pass **q;
+
+	  NEXT_PASS (pass_rebuild_cgraph_edges);
+	  NEXT_PASS (pass_inline_parameters);
+	  NEXT_PASS (pass_early_inline);
+
+	  {
+	    struct opt_pass **r;
+	    r = p;
+	    NEXT_PASS (pass_all_early_optimizations);
+	    /* (*R) is the pointer to the newly allocated instance of
+	       pass_all_early_optimizations.  */
+	    q = &(*r)->next;
+	    /* Push to sub-pass.  */
+	    p = &(*r)->sub;
+	  }
+
 	  NEXT_PASS (pass_remove_cgraph_callee_edges);
 	  NEXT_PASS (pass_rename_ssa_copies);
 	  NEXT_PASS (pass_ccp);
@@ -1222,13 +1240,19 @@ init_optimization_passes (void)
 	  NEXT_PASS (pass_early_ipa_sra);
 	  NEXT_PASS (pass_tail_recursion);
 	  NEXT_PASS (pass_convert_switch);
-          NEXT_PASS (pass_cleanup_eh);
-          NEXT_PASS (pass_profile);
-          NEXT_PASS (pass_local_pure_const);
+	  NEXT_PASS (pass_cleanup_eh);
+	  if (i == PARAM_VALUE (PARAM_EIPA_ITERATIONS) - 1)
+	    /* Schedule pass_profile only for the last iteration.
+	       This pass assumes that it is run only once.  */
+	    NEXT_PASS (pass_profile);
+	  NEXT_PASS (pass_local_pure_const);
 	  /* Split functions creates parts that are not run through
 	     early optimizations again.  It is thus good idea to do this
 	     late.  */
-          NEXT_PASS (pass_split_functions);
+	  NEXT_PASS (pass_split_functions);
+
+	  /* Pop from sub-pass.  */
+	  p = q;
 	}
       NEXT_PASS (pass_release_ssa_names);
       NEXT_PASS (pass_rebuild_cgraph_edges);
diff --git a/gcc/toplev.c b/gcc/toplev.c
index ab6b5a4..a0ad17d 100644
--- a/gcc/toplev.c
+++ b/gcc/toplev.c
@@ -1224,7 +1224,6 @@ general_init (const char *argv0)
   /* This must be done after global_init_params but before argument
      processing.  */
   init_ggc_heuristics();
-  init_optimization_passes ();
   statistics_early_init ();
   finish_params ();
 }
@@ -1984,6 +1983,8 @@ toplev_main (int argc, char **argv)
 		  save_decoded_options, save_decoded_options_count,
 		  UNKNOWN_LOCATION, global_dc);
 
+  init_optimization_passes ();
+
   handle_common_deferred_options ();
 
   init_local_tick ();

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2011-11-08 11:12 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-10-27 22:47 [PATCH] Add capability to run several iterations of early optimizations Matt
2011-10-28 10:01 ` Richard Guenther
2011-10-28 22:30   ` Matt
  -- strict thread matches above, loose matches on Subject: below --
2011-10-12  8:14 Maxim Kuvyrkov
2011-10-12 12:15 ` Richard Guenther
2011-10-18  3:00   ` Maxim Kuvyrkov
2011-10-18  9:09     ` Richard Guenther
2011-10-27 23:29       ` Maxim Kuvyrkov
2011-10-28 11:12         ` Richard Guenther
2011-10-28 23:07           ` Maxim Kuvyrkov
2011-10-29  0:10             ` Matt
2011-11-01 20:48               ` Martin Jambor
2011-11-01 21:33               ` Richard Guenther
2011-11-08  7:23                 ` Maxim Kuvyrkov
2011-11-08 11:18                   ` Richard Guenther

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).