From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-patches-return-305834-listarch-gcc-patches=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 27608 invoked by alias); 28 Oct 2011 09:04:14 -0000
Received: (qmail 27581 invoked by uid 22791); 28 Oct 2011 09:04:11 -0000
X-SWARE-Spam-Status: No, hits=-0.9 required=5.0	tests=AWL,BAYES_50,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,TW_MM,TW_MV
X-Spam-Check-By: sourceware.org
Received: from mail-vw0-f47.google.com (HELO mail-vw0-f47.google.com) (209.85.212.47)    by sourceware.org (qpsmtpd/0.43rc1) with ESMTP; Fri, 28 Oct 2011 09:03:45 +0000
Received: by vwe42 with SMTP id 42so3585575vwe.20        for <gcc-patches@gcc.gnu.org>; Fri, 28 Oct 2011 02:03:45 -0700 (PDT)
MIME-Version: 1.0
Received: by 10.182.131.34 with SMTP id oj2mr351361obb.71.1319792624792; Fri, 28 Oct 2011 02:03:44 -0700 (PDT)
Received: by 10.182.17.232 with HTTP; Fri, 28 Oct 2011 02:03:44 -0700 (PDT)
In-Reply-To: <Pine.NEB.4.64.1110271414340.6823@cesium.clock.org>
References: <Pine.NEB.4.64.1110271414340.6823@cesium.clock.org>
Date: Fri, 28 Oct 2011 10:01:00 -0000
Message-ID: <CAFiYyc3VB+eM-G_rSDfgCou0BG7ir0gsGr5UjaAaBVXaXH7gmg@mail.gmail.com>
Subject: Re: [PATCH] Add capability to run several iterations of early optimizations
From: Richard Guenther <richard.guenther@gmail.com>
To: Matt <matt@use.net>
Cc: Maxim Kuvyrkov <maxim@codesourcery.com>, GCC Patches <gcc-patches@gcc.gnu.org>
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
X-IsSubscribed: yes
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
X-SW-Source: 2011-10/txt/msg02632.txt.bz2

On Thu, Oct 27, 2011 at 11:53 PM, Matt <matt@use.net> wrote:
>>>> Then you'd have to analyze the compile-time impact of the IPA
>>>> splitting on its own when not iterating. ?Then you should look
>>>> at what actually was the optimizations that were performed
>>>> that lead to the improvement (I can see some indirect inlining
>>>> happening, but everything else would be a bug in present
>>>> optimizers in the early pipeline - they are all designed to be
>>>> roughly independent on each other and _not_ expose new
>>>> opportunities by iteration). ?Thus - testcases?
>>>
>>> The initial motivation for the patch was to enable more indirect
>
> inlining and devirtualization opportunities.
>
>> Hm.
>
> It is the proprietary codebase of my employer that these optimizations we=
re
> developed for. Multiple iterations specifically helps propogate the concr=
ete
> type information from functions that implement the Abstract Factory design
> pattern, allowing for cleaner runtime dynamic dispatch. I can verify that=
 in
> said codebase (and in the reduced, non-proprietary examples Maxim provided
> earlier in the year) it works quite effectively.
>
> Many of the devirt examples focus on a pure top-down approach like this:
> class I { virtual void f() =3D 0; };
> class K : public I { virtual void f() {} };
> class L: public I { virtual void f() {} };
> void g(I& i) { i.f(); }
> int main(void) { L l; g(l); return 0; }
>
> While that strategy isn't unheard of, it implies a link-time substitution=
 to
> inject new/different sub-classes of the parameterized interface. Besides
> limiting extensibility by requiring a rebuild/relink, it also presupposes
> that two different implementations would be mutually exclusive for that
> module. That is often not the case, hence the factory pattern expressed in
> the other examples Maxim provided.
>
>>> Since then I found the patch to be helpful in searching for
>
> optimization opportunities and bugs. ?E.g., SPEC2006's 471.omnetpp drops =
20%
> with 2 additional iterations of early optimizations [*]. ?Given that
> applying more optimizations should, theoretically, not decrease performan=
ce,
> there is likely a very real bug or deficiency behind that.
>
>> It is likely early SRA that messes up, or maybe convert switch. =A0Early
>> passes should be really restricted to always profitable cleanups.
>
>> Your experiment looks useful to track down these bugs, but in general
>> I don't think we want to expose iterating early passes.
>
> In these other more top-down examples of devirt I mention above, I agree
> with you. Once the CFG is ordered and the analyses happen, things should =
be
> propogated forward without issue. In the case of factory functions, my
> understanding and experience on this real-world codebase is that multiple
> passes are required. First, to "bubble up" the concrete type info coming =
out
> of the factory function. Depending on how many layers, it may require a
> couple. Second, to then forward propogate that concrete type information =
for
> the pointer.
>
> There was a surprising side-effect when I started experimenting with this
> ipa-passes feature. In a module that contains ~100KLOC, I implemented
> mega-compilation (a poor-man's LTO). At two passes, the module got larger,
> which I expected. This minor growth continued with each additional pass,
> until at about 7 passes when it decreased by over 10%. I set up a script =
to
> run overnight to incrementally try passes and record the module size, and
> the "sweet spot" ended up being 54 passes as far as size. I took the three
> smallest binaries and did a full performance regression at the system lev=
el,
> and the smallest binary's inclusion resulted in an ~6% performance
> improvement (measured as overall network I/O throughput) while using less
> CPU on a Transmeta Crusoe-based appliance. (This is a web proxy, with abo=
ut
> 500KLOC of other code that was not compiled in this new way.)
>
> The idea of multiple passes resulting is a smaller binary and higher
> performance was like a dream. I reproduced a similar pattern on open sour=
ce
> projects, namely scummvm (on which I was able to use proper LTO)*. That i=
s,
> smaller binaries resulted as well as decreased CPU usage. On some project=
s,
> this could possibly be correlated with micro-level benchmarks such as
> reduced branch prediction and L1 cache misses as reported by callgrind.
>
> While it's possible/probable that some of the performance improvements I =
saw
> by increasing ipa-passes were ultimately missed-optimization bugs that
> should be fixed, I'd be very surprised if *all* of those improvements were
> the case. As such, I would still like to see this exposed. I would be hap=
py
> to file bugs and help test any instances where it looks like an optimizat=
ion
> should have been gotten within a single ipa-pass.

I discussed the idea of iterating early optimizations shortly with Honza.
I was trying to step back a bit and look at what we try to do right now,
which is, optimize functions in topological order (thus, try to make sure
all callees are already early optimized when optimizing callers).  That
of course is difficult when there are cycles in the cgraph (for which we
basically start at a random place), especially when there would be
may-edges (which we do not have) for indirect/virtual calls, as they
basically make the whole cgraph cyclic.  So my idea was to make
cycle processing more explicit in early optimizations, and, whenever
we discover a new direct cgraph edge make sure we optimize the
callee, and whenever we optimized a callee queue all callers for
re-optimization.  You of course have to limit the number of times you
want to process a function, otherwise for a cycle, you'd optimize
indefinitely.  We already do the inlining itself repeatedly (via
--param early-inliner-max-iterations), though that only iterates
the inlining itself, allowing for "deep" inlining and some cases
of inlining of indirect calls if the inliner substituted a function
pointer parameter into a call of the inlined function.  Moving that
iteration over to iterating over the optimizations could make sense.

Thus, I'd really like to at least make iterating depend on some
profitability analysis, even if it is only based on cgraph analysis
such as 'we discovered a new direct edge'.

Richard.

>
> Thanks for helping to get this feature (and the other devirt-related piec=
es)
> into 4.7 -- it's been a huge boon to improving our C++ designs without
> sacrificing performance.
>
>
> * Note that that scummvm's "sweet spot" number of iterations was differen=
t.
> That being said, the default of three iterations to make the typical use =
of
> Factory pattern devirtualize correctly still resulted in improved
> performance over a single pass -- just not necessarily a smaller binary.
>
>
>
> --
> tangled strands of DNA explain the way that I behave.
> http://www.clock.org/~matt
>