Re: [PATCH] Add capability to run several iterations of early optimizations

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

From: Matt <matt@use.net>
To: Richard Guenther <richard.guenther@gmail.com>
Cc: Maxim Kuvyrkov <maxim@codesourcery.com>,
	    GCC Patches <gcc-patches@gcc.gnu.org>
Subject: Re: [PATCH] Add capability to run several iterations of early optimizations
Date: Thu, 27 Oct 2011 22:47:00 -0000	[thread overview]
Message-ID: <Pine.NEB.4.64.1110271414340.6823@cesium.clock.org> (raw)

>>> Then you'd have to analyze the compile-time impact of the IPA
>>> splitting on its own when not iterating. ?Then you should look
>>> at what actually was the optimizations that were performed
>>> that lead to the improvement (I can see some indirect inlining
>>> happening, but everything else would be a bug in present
>>> optimizers in the early pipeline - they are all designed to be
>>> roughly independent on each other and _not_ expose new
>>> opportunities by iteration). ?Thus - testcases?
>>
>> The initial motivation for the patch was to enable more indirect 
inlining and devirtualization opportunities.

> Hm.

It is the proprietary codebase of my employer that these optimizations 
were developed for. Multiple iterations specifically helps propogate the 
concrete type information from functions that implement the 
Abstract Factory design pattern, allowing for cleaner runtime dynamic 
dispatch. I can verify that in said codebase (and in the reduced, 
non-proprietary examples Maxim provided earlier in the year) it works 
quite effectively.

Many of the devirt examples focus on a pure top-down approach like this:
class I { virtual void f() = 0; };
class K : public I { virtual void f() {} };
class L: public I { virtual void f() {} };
void g(I& i) { i.f(); }
int main(void) { L l; g(l); return 0; }

While that strategy isn't unheard of, it implies a link-time substitution 
to inject new/different sub-classes of the parameterized interface. 
Besides limiting extensibility by requiring a rebuild/relink, it also 
presupposes that two different implementations would be mutually exclusive 
for that module. That is often not the case, hence the factory pattern 
expressed in the other examples Maxim provided.

>> Since then I found the patch to be helpful in searching for 
optimization 
opportunities and bugs. ?E.g., SPEC2006's 471.omnetpp drops 20% with 2 
additional iterations of early optimizations [*]. ?Given that applying 
more optimizations should, theoretically, not decrease performance, there 
is likely a very real bug or deficiency behind that.

> It is likely early SRA that messes up, or maybe convert switch.  Early
> passes should be really restricted to always profitable cleanups.

> Your experiment looks useful to track down these bugs, but in general
> I don't think we want to expose iterating early passes.

In these other more top-down examples of devirt I mention above, I agree 
with you. Once the CFG is ordered and the analyses happen, things should 
be propogated forward without issue. In the case of factory functions, my 
understanding and experience on this real-world codebase is that multiple 
passes are required. First, to "bubble up" the concrete type info coming 
out of the factory function. Depending on how many layers, it may require 
a couple. Second, to then forward propogate that concrete type information 
for the pointer.

There was a surprising side-effect when I started experimenting with this 
ipa-passes feature. In a module that contains ~100KLOC, I implemented 
mega-compilation (a poor-man's LTO). At two passes, the module got larger, 
which I expected. This minor growth continued with each additional pass, 
until at about 7 passes when it decreased by over 10%. I set up a script 
to run overnight to incrementally try passes and record the module size, 
and the "sweet spot" ended up being 54 passes as far as size. I took the 
three smallest binaries and did a full performance regression at the 
system level, and the smallest binary's inclusion resulted in an ~6% 
performance improvement (measured as overall network I/O throughput) while 
using less CPU on a Transmeta Crusoe-based appliance. (This is a web 
proxy, with about 500KLOC of other code that was not compiled in this new 
way.)

The idea of multiple passes resulting is a smaller binary and higher 
performance was like a dream. I reproduced a similar pattern on open 
source projects, namely scummvm (on which I was able to use proper LTO)*. 
That is, smaller binaries resulted as well as decreased CPU usage. On some 
projects, this could possibly be correlated with micro-level benchmarks 
such as reduced branch prediction and L1 cache misses as reported by 
callgrind.

While it's possible/probable that some of the performance improvements I 
saw by increasing ipa-passes were ultimately missed-optimization bugs that 
should be fixed, I'd be very surprised if *all* of those improvements were 
the case. As such, I would still like to see this exposed. I would be 
happy to file bugs and help test any instances where it looks like an 
optimization should have been gotten within a single ipa-pass.

Thanks for helping to get this feature (and the other devirt-related 
pieces) into 4.7 -- it's been a huge boon to improving our C++ designs 
without sacrificing performance.

* Note that that scummvm's "sweet spot" number of iterations was 
different. That being said, the default of three iterations to make the 
typical use of Factory pattern devirtualize correctly still resulted in 
improved performance over a single pass -- just not necessarily a smaller 
binary.

--
tangled strands of DNA explain the way that I behave.
http://www.clock.org/~matt

next             reply	other threads:[~2011-10-27 21:53 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-10-27 22:47 Matt [this message]
2011-10-28 10:01 ` Richard Guenther
2011-10-28 22:30   ` Matt
  -- strict thread matches above, loose matches on Subject: below --
2011-10-12  8:14 Maxim Kuvyrkov
2011-10-12 12:15 ` Richard Guenther
2011-10-18  3:00   ` Maxim Kuvyrkov
2011-10-18  9:09     ` Richard Guenther
2011-10-27 23:29       ` Maxim Kuvyrkov
2011-10-28 11:12         ` Richard Guenther
2011-10-28 23:07           ` Maxim Kuvyrkov
2011-10-29  0:10             ` Matt
2011-11-01 20:48               ` Martin Jambor
2011-11-01 21:33               ` Richard Guenther
2011-11-08  7:23                 ` Maxim Kuvyrkov
2011-11-08 11:18                   ` Richard Guenther

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Pine.NEB.4.64.1110271414340.6823@cesium.clock.org \
    --to=matt@use.net \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=maxim@codesourcery.com \
    --cc=richard.guenther@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).