public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
* [PATCH] Add capability to run several iterations of early optimizations
@ 2011-10-12  8:14 Maxim Kuvyrkov
  2011-10-12 12:15 ` Richard Guenther
  0 siblings, 1 reply; 15+ messages in thread
From: Maxim Kuvyrkov @ 2011-10-12  8:14 UTC (permalink / raw)
  To: GCC Patches

[-- Attachment #1: Type: text/plain, Size: 1483 bytes --]

The following patch adds new knob to make GCC perform several iterations of early optimizations and inlining.

This is for dont-care-about-compile-time-optimize-all-you-can scenarios.  Performing several iterations of optimizations does significantly improve code speed on a certain proprietary source base.  Some hand-tuning of the parameter value is required to get optimum performance.  Another good use for this option is for search and ad-hoc analysis of cases where GCC misses optimization opportunities.

With the default setting of '1', nothing is changed from the current status quo.

The patch was bootstrapped and regtested with 3 iterations set by default on i686-linux-gnu.  The only failures in regression testsuite were due to latent bugs in handling of EH information, which are being discussed in a different thread.

Performance impact on the standard benchmarks is not conclusive, there are improvements in SPEC2000 of up to 4% and regressions down to -2%, see [*].  SPEC2006 benchmarks will take another day or two to complete and I will update the spreadsheet then.  The benchmarks were run on a Core2 system for all combinations of {-m32/-m64}{-O2/-O3}.

Effect on compilation time is fairly predictable, about 10% compile time increase with 3 iterations.

OK for trunk?

[*] https://docs.google.com/spreadsheet/ccc?key=0AvK0Y-Pgj7bNdFBQMEJ6d3laeFdvdk9lQ1p0LUFkVFE&hl=en_US

Thank you,

--
Maxim Kuvyrkov
CodeSourcery / Mentor Graphics



[-- Attachment #2: fsf-gcc-iter-eipa.ChangeLog --]
[-- Type: application/octet-stream, Size: 429 bytes --]

2011-10-11  Maxim Kuvyrkov  <maxim@codesourcery.com>

	Add scheduling of several iterations of early IPA passes.

	* doc/invoke.texi (eipa-iterations): Document new parameter.
	* params.def (PARAM_EIPA_ITERATIONS): Define.
	* passes.c (init_optimization_passes): Schedule several iterations
	of pass_all_early_optimizations.
	* toplev.c (general_init, toplev_main): Move init_optimizations_passes
	after processing of arguments.

[-- Attachment #3: fsf-gcc-iter-eipa.patch --]
[-- Type: application/octet-stream, Size: 4741 bytes --]

commit a54b6cb0160d5413499e08a87da3a35eb6e29652
Author: Maxim Kuvyrkov <maxim@codesourcery.com>
Date:   Sat Oct 1 01:09:50 2011 -0700

    Add iterative optimization passes

diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index c92c6b2..1ebaf35 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -9060,6 +9060,12 @@ the parameter is reserved exclusively for debug insns created by
 @option{-fvar-tracking-assignments}, but debug insns may get
 (non-overlapping) uids above it if the reserved range is exhausted.
 
+@item eipa-iterations
+The pass scheduler will execute @option{eipa-iterations} iterations of
+early optimization passes before running interprocedural analysis.
+Running several iterations of optimization passes allows the compiler
+to provide thoroughly optimized code to the interprocedural analysis.
+
 @item ipa-sra-ptr-growth-factor
 IPA-SRA will replace a pointer to an aggregate with one or more new
 parameters only when their cumulative size is less or equal to
diff --git a/gcc/params.def b/gcc/params.def
index 5e49c48..9609919 100644
--- a/gcc/params.def
+++ b/gcc/params.def
@@ -861,6 +861,11 @@ DEFPARAM (PARAM_MIN_NONDEBUG_INSN_UID,
 	  "The minimum UID to be used for a nondebug insn",
 	  0, 1, 0)
 
+DEFPARAM (PARAM_EIPA_ITERATIONS,
+	  "eipa-iterations",
+	  "Number of iterations of early optimizations before IPA analysis",
+	  1, 1, 0)
+
 DEFPARAM (PARAM_IPA_SRA_PTR_GROWTH_FACTOR,
 	  "ipa-sra-ptr-growth-factor",
 	  "Maximum allowed growth of size of new parameters ipa-sra replaces "
diff --git a/gcc/passes.c b/gcc/passes.c
index 887007f..dcc3297 100644
--- a/gcc/passes.c
+++ b/gcc/passes.c
@@ -1191,6 +1191,7 @@ init_optimization_passes (void)
   NEXT_PASS (pass_early_local_passes);
     {
       struct opt_pass **p = &pass_early_local_passes.pass.sub;
+      int i;
       NEXT_PASS (pass_fixup_cfg);
       NEXT_PASS (pass_init_datastructures);
       NEXT_PASS (pass_expand_omp);
@@ -1199,12 +1200,29 @@ init_optimization_passes (void)
       NEXT_PASS (pass_build_ssa);
       NEXT_PASS (pass_lower_vector);
       NEXT_PASS (pass_early_warn_uninitialized);
-      NEXT_PASS (pass_rebuild_cgraph_edges);
-      NEXT_PASS (pass_inline_parameters);
-      NEXT_PASS (pass_early_inline);
-      NEXT_PASS (pass_all_early_optimizations);
+
+      /* Schedule PARAM_EIPA_ITERATIONS iterations of early optimizations.
+	 We run these passes at least once as pass_all_early_optimizations.  */
+      gcc_assert (PARAM_VALUE (PARAM_EIPA_ITERATIONS) >= 1);
+      for (i = 0; i < PARAM_VALUE (PARAM_EIPA_ITERATIONS); ++i)
 	{
-	  struct opt_pass **p = &pass_all_early_optimizations.pass.sub;
+	  struct opt_pass **q;
+
+	  NEXT_PASS (pass_rebuild_cgraph_edges);
+	  NEXT_PASS (pass_inline_parameters);
+	  NEXT_PASS (pass_early_inline);
+
+	  {
+	    struct opt_pass **r;
+	    r = p;
+	    NEXT_PASS (pass_all_early_optimizations);
+	    /* (*R) is the pointer to the newly allocated instance of
+	       pass_all_early_optimizations.  */
+	    q = &(*r)->next;
+	    /* Push to sub-pass.  */
+	    p = &(*r)->sub;
+	  }
+
 	  NEXT_PASS (pass_remove_cgraph_callee_edges);
 	  NEXT_PASS (pass_rename_ssa_copies);
 	  NEXT_PASS (pass_ccp);
@@ -1222,13 +1240,19 @@ init_optimization_passes (void)
 	  NEXT_PASS (pass_early_ipa_sra);
 	  NEXT_PASS (pass_tail_recursion);
 	  NEXT_PASS (pass_convert_switch);
-          NEXT_PASS (pass_cleanup_eh);
-          NEXT_PASS (pass_profile);
-          NEXT_PASS (pass_local_pure_const);
+	  NEXT_PASS (pass_cleanup_eh);
+	  if (i == PARAM_VALUE (PARAM_EIPA_ITERATIONS) - 1)
+	    /* Schedule pass_profile only for the last iteration.
+	       This pass assumes that it is run only once.  */
+	    NEXT_PASS (pass_profile);
+	  NEXT_PASS (pass_local_pure_const);
 	  /* Split functions creates parts that are not run through
 	     early optimizations again.  It is thus good idea to do this
 	     late.  */
-          NEXT_PASS (pass_split_functions);
+	  NEXT_PASS (pass_split_functions);
+
+	  /* Pop from sub-pass.  */
+	  p = q;
 	}
       NEXT_PASS (pass_release_ssa_names);
       NEXT_PASS (pass_rebuild_cgraph_edges);
diff --git a/gcc/toplev.c b/gcc/toplev.c
index ab6b5a4..a0ad17d 100644
--- a/gcc/toplev.c
+++ b/gcc/toplev.c
@@ -1224,7 +1224,6 @@ general_init (const char *argv0)
   /* This must be done after global_init_params but before argument
      processing.  */
   init_ggc_heuristics();
-  init_optimization_passes ();
   statistics_early_init ();
   finish_params ();
 }
@@ -1984,6 +1983,8 @@ toplev_main (int argc, char **argv)
 		  save_decoded_options, save_decoded_options_count,
 		  UNKNOWN_LOCATION, global_dc);
 
+  init_optimization_passes ();
+
   handle_common_deferred_options ();
 
   init_local_tick ();

^ permalink raw reply	[flat|nested] 15+ messages in thread
* Re: [PATCH] Add capability to run several iterations of early optimizations
@ 2011-10-27 22:47 Matt
  2011-10-28 10:01 ` Richard Guenther
  0 siblings, 1 reply; 15+ messages in thread
From: Matt @ 2011-10-27 22:47 UTC (permalink / raw)
  To: Richard Guenther; +Cc: Maxim Kuvyrkov, GCC Patches

>>> Then you'd have to analyze the compile-time impact of the IPA
>>> splitting on its own when not iterating. ?Then you should look
>>> at what actually was the optimizations that were performed
>>> that lead to the improvement (I can see some indirect inlining
>>> happening, but everything else would be a bug in present
>>> optimizers in the early pipeline - they are all designed to be
>>> roughly independent on each other and _not_ expose new
>>> opportunities by iteration). ?Thus - testcases?
>>
>> The initial motivation for the patch was to enable more indirect 
inlining and devirtualization opportunities.

> Hm.

It is the proprietary codebase of my employer that these optimizations 
were developed for. Multiple iterations specifically helps propogate the 
concrete type information from functions that implement the 
Abstract Factory design pattern, allowing for cleaner runtime dynamic 
dispatch. I can verify that in said codebase (and in the reduced, 
non-proprietary examples Maxim provided earlier in the year) it works 
quite effectively.

Many of the devirt examples focus on a pure top-down approach like this:
class I { virtual void f() = 0; };
class K : public I { virtual void f() {} };
class L: public I { virtual void f() {} };
void g(I& i) { i.f(); }
int main(void) { L l; g(l); return 0; }

While that strategy isn't unheard of, it implies a link-time substitution 
to inject new/different sub-classes of the parameterized interface. 
Besides limiting extensibility by requiring a rebuild/relink, it also 
presupposes that two different implementations would be mutually exclusive 
for that module. That is often not the case, hence the factory pattern 
expressed in the other examples Maxim provided.

>> Since then I found the patch to be helpful in searching for 
optimization 
opportunities and bugs. ?E.g., SPEC2006's 471.omnetpp drops 20% with 2 
additional iterations of early optimizations [*]. ?Given that applying 
more optimizations should, theoretically, not decrease performance, there 
is likely a very real bug or deficiency behind that.

> It is likely early SRA that messes up, or maybe convert switch.  Early
> passes should be really restricted to always profitable cleanups.

> Your experiment looks useful to track down these bugs, but in general
> I don't think we want to expose iterating early passes.

In these other more top-down examples of devirt I mention above, I agree 
with you. Once the CFG is ordered and the analyses happen, things should 
be propogated forward without issue. In the case of factory functions, my 
understanding and experience on this real-world codebase is that multiple 
passes are required. First, to "bubble up" the concrete type info coming 
out of the factory function. Depending on how many layers, it may require 
a couple. Second, to then forward propogate that concrete type information 
for the pointer.

There was a surprising side-effect when I started experimenting with this 
ipa-passes feature. In a module that contains ~100KLOC, I implemented 
mega-compilation (a poor-man's LTO). At two passes, the module got larger, 
which I expected. This minor growth continued with each additional pass, 
until at about 7 passes when it decreased by over 10%. I set up a script 
to run overnight to incrementally try passes and record the module size, 
and the "sweet spot" ended up being 54 passes as far as size. I took the 
three smallest binaries and did a full performance regression at the 
system level, and the smallest binary's inclusion resulted in an ~6% 
performance improvement (measured as overall network I/O throughput) while 
using less CPU on a Transmeta Crusoe-based appliance. (This is a web 
proxy, with about 500KLOC of other code that was not compiled in this new 
way.)

The idea of multiple passes resulting is a smaller binary and higher 
performance was like a dream. I reproduced a similar pattern on open 
source projects, namely scummvm (on which I was able to use proper LTO)*. 
That is, smaller binaries resulted as well as decreased CPU usage. On some 
projects, this could possibly be correlated with micro-level benchmarks 
such as reduced branch prediction and L1 cache misses as reported by 
callgrind.

While it's possible/probable that some of the performance improvements I 
saw by increasing ipa-passes were ultimately missed-optimization bugs that 
should be fixed, I'd be very surprised if *all* of those improvements were 
the case. As such, I would still like to see this exposed. I would be 
happy to file bugs and help test any instances where it looks like an 
optimization should have been gotten within a single ipa-pass.


Thanks for helping to get this feature (and the other devirt-related 
pieces) into 4.7 -- it's been a huge boon to improving our C++ designs 
without sacrificing performance.


* Note that that scummvm's "sweet spot" number of iterations was 
different. That being said, the default of three iterations to make the 
typical use of Factory pattern devirtualize correctly still resulted in 
improved performance over a single pass -- just not necessarily a smaller 
binary.



--
tangled strands of DNA explain the way that I behave.
http://www.clock.org/~matt

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2011-11-08 11:12 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-10-12  8:14 [PATCH] Add capability to run several iterations of early optimizations Maxim Kuvyrkov
2011-10-12 12:15 ` Richard Guenther
2011-10-18  3:00   ` Maxim Kuvyrkov
2011-10-18  9:09     ` Richard Guenther
2011-10-27 23:29       ` Maxim Kuvyrkov
2011-10-28 11:12         ` Richard Guenther
2011-10-28 23:07           ` Maxim Kuvyrkov
2011-10-29  0:10             ` Matt
2011-11-01 20:48               ` Martin Jambor
2011-11-01 21:33               ` Richard Guenther
2011-11-08  7:23                 ` Maxim Kuvyrkov
2011-11-08 11:18                   ` Richard Guenther
2011-10-27 22:47 Matt
2011-10-28 10:01 ` Richard Guenther
2011-10-28 22:30   ` Matt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).