public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed
* [RFC] Remove -freorder-blocks-and-partition
@ 2011-07-19 21:43 Richard Henderson
  2011-07-19 21:56 ` Bernd Schmidt
                   ` (3 more replies)
  0 siblings, 4 replies; 21+ messages in thread
From: Richard Henderson @ 2011-07-19 21:43 UTC (permalink / raw)
  To: gcc

There are a number of problems with this code that affect
its ability to work with any non-x86-like target, that is,
anyone that doesn't define at least HAS_LONG_UNCOND_BRANCH
and possibly HAS_LONG_COND_BRANCH.

We begin, quite sensibly, with pass_partition_blocks which
performs a number of transformations upon the code that,
while the actual code could be better factored, is quite
easy to follow.  Depending on the features of the target,
fallthrus are turned into unconditional jumps, conditional
jumps are split into branch around branch, unconditional
jumps are turned into indirect jumps.

There's nice bits of commentary that say why things are 
implemented this way, including exposing the indirect jumps
to the register allocator.

But after pass_partition_blocks, we run into trouble.  There
are no less than 4 other passes that add *new* crossing jumps
without doing *any* of the subsequent fixups for less capable
targets: pass_outof_cfg_layout_mode, pass_reorder_blocks,
pass_sched2 (ia64 only? it's in code in haifa that looks like
speculative load fixups), and pass_convert_to_eh_region_ranges.

The worst part is that test coverage for this feature is
extremely poor.  It's very difficult to tell if any cleanup
in this area is likely to introduce more bugs than it fixes.

After 3 days fighting with this code, I had a bit of a 
cathartic whine on IRC.  I got two votes to just rip the 
whole thing out.

Andrew Pinski points out that the feature could probably be
equivalently implemented via outlining and function calls
(I assume well back at the gimple level).  At which point we
no longer have cross-segment jump_insns at the rtl level,
which seems like a Really Big Win to me at this point.
Not that I'm volunteering to actually do the work to implement
any such scheme.

Thoughts?


r~

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC] Remove -freorder-blocks-and-partition
  2011-07-19 21:43 [RFC] Remove -freorder-blocks-and-partition Richard Henderson
@ 2011-07-19 21:56 ` Bernd Schmidt
  2011-07-19 22:09   ` Richard Henderson
  2011-07-19 22:28 ` Joern Rennecke
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 21+ messages in thread
From: Bernd Schmidt @ 2011-07-19 21:56 UTC (permalink / raw)
  To: Richard Henderson; +Cc: gcc

On 07/19/11 23:33, Richard Henderson wrote:
> But after pass_partition_blocks, we run into trouble.  There
> are no less than 4 other passes that add *new* crossing jumps
> without doing *any* of the subsequent fixups for less capable
> targets: pass_outof_cfg_layout_mode, pass_reorder_blocks,
> pass_sched2 (ia64 only? it's in code in haifa that looks like
> speculative load fixups), and pass_convert_to_eh_region_ranges.

Is it possible to leave it in for targets to call during their reorg
pass, which I'm assuming is late enough? Really the entire pass pipeline
after reload needs rethinking.

I'm not necessarily opposed to removing it though. I also ran into the
lack of test coverage.


Bernd

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC] Remove -freorder-blocks-and-partition
  2011-07-19 21:56 ` Bernd Schmidt
@ 2011-07-19 22:09   ` Richard Henderson
  2011-07-19 23:18     ` Joern Rennecke
  0 siblings, 1 reply; 21+ messages in thread
From: Richard Henderson @ 2011-07-19 22:09 UTC (permalink / raw)
  To: Bernd Schmidt; +Cc: gcc

On 07/19/2011 02:42 PM, Bernd Schmidt wrote:
> On 07/19/11 23:33, Richard Henderson wrote:
>> But after pass_partition_blocks, we run into trouble.  There
>> are no less than 4 other passes that add *new* crossing jumps
>> without doing *any* of the subsequent fixups for less capable
>> targets: pass_outof_cfg_layout_mode, pass_reorder_blocks,
>> pass_sched2 (ia64 only? it's in code in haifa that looks like
>> speculative load fixups), and pass_convert_to_eh_region_ranges.

Unrelated to your response, I had intended to make the point that
three of the errant passes above run after register allocation.
Which of course does not work on any target that does not support
long branches.

> 
> Is it possible to leave it in for targets to call during their reorg
> pass, which I'm assuming is late enough? Really the entire pass pipeline
> after reload needs rethinking.

I'm not sure how one would apply this during a reorg pass.

Presumably your target does have at least long unconditional branches,
since otherwise one runs into a register allocation problem.  If in
addition you've long unconditional branches as well as no
exception_receiver pattern, then it seems like you could completely do
away with the early pass_partition_blocks.

Given the pre-conditions above, we could delay the assignment of 
blocks to partitions until quite late.  Although you probably would
never leave it any later than pass_reorder_blocks.


r~

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC] Remove -freorder-blocks-and-partition
  2011-07-19 21:43 [RFC] Remove -freorder-blocks-and-partition Richard Henderson
  2011-07-19 21:56 ` Bernd Schmidt
@ 2011-07-19 22:28 ` Joern Rennecke
  2011-07-19 22:44   ` Richard Henderson
  2011-07-25  7:40 ` Xinliang David Li
  2011-08-03 20:50 ` Jan Hubicka
  3 siblings, 1 reply; 21+ messages in thread
From: Joern Rennecke @ 2011-07-19 22:28 UTC (permalink / raw)
  To: Richard Henderson; +Cc: gcc

Quoting Richard Henderson <rth@redhat.com>:

> Andrew Pinski points out that the feature could probably be
> equivalently implemented via outlining and function calls
> (I assume well back at the gimple level).

Function calls would mean that you'd have to deal with
call-clobbered registers - any working size set savings from outlining
could easily be drowned by worsening register allocation or insertion of
caller-save instructions.  And you can't easily set multiple values in
specific registers and stack slots.  Unless you want to add fancy custom
ABIs for the outlined functions.
And then there is the issue that function tend to have a single
return address.  You might have a complex piece of error handling code
that makes a decision where it comes back into the hot code.  With a
function, you would need yet another return value, and then a tablejump
depending on that value.

> At which point we
> no longer have cross-segment jump_insns at the rtl level,
> which seems like a Really Big Win to me at this point.

I suppose the basic problem is that these jumps are so easily mistaken for
ordinary jump_insns.  If they were more obviously different, like a tablejmp,
we'd leave them alone by default.  We don't do jump threading through  
non-simplified tablejumps, either.
What would you think about putting the destination section in the
instruction pattern?

Of course, changing the rtl representation doesn't fix the problems with
passes like cfglayout.
These might indeed be better off with a different model for a jump into
a cold section; e.g. it could be thought of as an instruction that sets
a vector of registers and memory locations depending on another such vector,
and then does (optionally) a multi-way jump.  Indeed a bit like a call  
instruction, but with more potential side effects that we want, and less
that we don't want.

> Not that I'm volunteering to actually do the work to implement
> any such scheme.

Same here...  just some thoughts.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC] Remove -freorder-blocks-and-partition
  2011-07-19 22:28 ` Joern Rennecke
@ 2011-07-19 22:44   ` Richard Henderson
  0 siblings, 0 replies; 21+ messages in thread
From: Richard Henderson @ 2011-07-19 22:44 UTC (permalink / raw)
  To: Joern Rennecke; +Cc: gcc

On 07/19/2011 03:24 PM, Joern Rennecke wrote:
>> Andrew Pinski points out that the feature could probably be
>> equivalently implemented via outlining and function calls
>> (I assume well back at the gimple level).
> 
> Function calls would mean that you'd have to deal with
> call-clobbered registers - any working size set savings from outlining
> could easily be drowned by worsening register allocation or insertion of
> caller-save instructions.  And you can't easily set multiple values in
> specific registers and stack slots.  Unless you want to add fancy custom
> ABIs for the outlined functions.
> And then there is the issue that function tend to have a single
> return address.  You might have a complex piece of error handling code
> that makes a decision where it comes back into the hot code.  With a
> function, you would need yet another return value, and then a tablejump
> depending on that value.

All true.

Although I'll note that while doing profiledbootstrap with 
partitioning enabled, a very large fraction of what gets pushed
to the cold section seems to be noreturn paths that lead to abort.

> I suppose the basic problem is that these jumps are so easily
> mistaken for ordinary jump_insns.  If they were more obviously
> different, like a tablejmp, we'd leave them alone by default.  We
> don't do jump threading through non-simplified tablejumps, either. 
> What would you think about putting the destination section in the 
> instruction pattern?

Having some obviously distinguishable mechanism might be an
interesting solution.

> Of course, changing the rtl representation doesn't fix the problems with
> passes like cfglayout.

Yeah, I'm not sure what to do with cfglayout.  We definitely do
not want it considering insns from both segments at the same time.

I wonder how well we'd get away with marking crossing edges as
abnormal?  Or something else that automatically prevents anyone
from trying to split such an edge.



r~

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC] Remove -freorder-blocks-and-partition
  2011-07-19 22:09   ` Richard Henderson
@ 2011-07-19 23:18     ` Joern Rennecke
  0 siblings, 0 replies; 21+ messages in thread
From: Joern Rennecke @ 2011-07-19 23:18 UTC (permalink / raw)
  To: Richard Henderson; +Cc: Bernd Schmidt, gcc

Quoting Richard Henderson <rth@redhat.com>:

> Presumably your target does have at least long unconditional branches,
> since otherwise one runs into a register allocation problem.  If in
> addition you've long unconditional branches as well as no
> exception_receiver pattern, then it seems like you could completely do
> away with the early pass_partition_blocks.

Well, targets with delay slot can use register scavenging, and in the
rare case that no free register is found, save one register on the stack,
load it with the target address, jump, and restore the register from the
stack in the delay slot of the jump.
Not as good having the long jump exposed earlier, but at least the compilation
succeeds.  Well, unless something else gets out of range due to unanticipated
size increases that can't be fixed up that easily, e.g. a vector of  
case labels.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC] Remove -freorder-blocks-and-partition
  2011-07-19 21:43 [RFC] Remove -freorder-blocks-and-partition Richard Henderson
  2011-07-19 21:56 ` Bernd Schmidt
  2011-07-19 22:28 ` Joern Rennecke
@ 2011-07-25  7:40 ` Xinliang David Li
  2011-07-25 11:05   ` Paolo Bonzini
  2011-08-03 20:50 ` Jan Hubicka
  3 siblings, 1 reply; 21+ messages in thread
From: Xinliang David Li @ 2011-07-25  7:40 UTC (permalink / raw)
  To: Richard Henderson; +Cc: gcc

FYI  the performance impact of this option with SPEC06 (built with
google_46 compiler and measured on a core2 box).  The base line number
is FDO, and ref number is FDO + reorder_with_partitioning.

xalancbmk improves > 3.5%
perlbench improves > 1.5%
dealII and bzip2 degrades about 1.4%.

Note the partitioning scheme is not tuned at all -- there is not even
a tunable parameter to play with.

David




On Tue, Jul 19, 2011 at 2:33 PM, Richard Henderson <rth@redhat.com> wrote:
> There are a number of problems with this code that affect
> its ability to work with any non-x86-like target, that is,
> anyone that doesn't define at least HAS_LONG_UNCOND_BRANCH
> and possibly HAS_LONG_COND_BRANCH.
>
> We begin, quite sensibly, with pass_partition_blocks which
> performs a number of transformations upon the code that,
> while the actual code could be better factored, is quite
> easy to follow.  Depending on the features of the target,
> fallthrus are turned into unconditional jumps, conditional
> jumps are split into branch around branch, unconditional
> jumps are turned into indirect jumps.
>
> There's nice bits of commentary that say why things are
> implemented this way, including exposing the indirect jumps
> to the register allocator.
>
> But after pass_partition_blocks, we run into trouble.  There
> are no less than 4 other passes that add *new* crossing jumps
> without doing *any* of the subsequent fixups for less capable
> targets: pass_outof_cfg_layout_mode, pass_reorder_blocks,
> pass_sched2 (ia64 only? it's in code in haifa that looks like
> speculative load fixups), and pass_convert_to_eh_region_ranges.
>
> The worst part is that test coverage for this feature is
> extremely poor.  It's very difficult to tell if any cleanup
> in this area is likely to introduce more bugs than it fixes.
>
> After 3 days fighting with this code, I had a bit of a
> cathartic whine on IRC.  I got two votes to just rip the
> whole thing out.
>
> Andrew Pinski points out that the feature could probably be
> equivalently implemented via outlining and function calls
> (I assume well back at the gimple level).  At which point we
> no longer have cross-segment jump_insns at the rtl level,
> which seems like a Really Big Win to me at this point.
> Not that I'm volunteering to actually do the work to implement
> any such scheme.
>
> Thoughts?
>
>
> r~
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC] Remove -freorder-blocks-and-partition
  2011-07-25  7:40 ` Xinliang David Li
@ 2011-07-25 11:05   ` Paolo Bonzini
  2011-07-25 18:40     ` Xinliang David Li
                       ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: Paolo Bonzini @ 2011-07-25 11:05 UTC (permalink / raw)
  To: Xinliang David Li; +Cc: Richard Henderson, gcc

On 07/25/2011 06:42 AM, Xinliang David Li wrote:
> FYI  the performance impact of this option with SPEC06 (built with
> google_46 compiler and measured on a core2 box).  The base line number
> is FDO, and ref number is FDO + reorder_with_partitioning.
>
> xalancbmk improves>  3.5%
> perlbench improves>  1.5%
> dealII and bzip2 degrades about 1.4%.
>
> Note the partitioning scheme is not tuned at all -- there is not even
> a tunable parameter to play with.

Did you check what is pushed down to the cold section in these cases?

Paolo

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC] Remove -freorder-blocks-and-partition
  2011-07-25 11:05   ` Paolo Bonzini
@ 2011-07-25 18:40     ` Xinliang David Li
  2011-07-26  1:29     ` Xinliang David Li
  2011-08-03 20:56     ` Jan Hubicka
  2 siblings, 0 replies; 21+ messages in thread
From: Xinliang David Li @ 2011-07-25 18:40 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Richard Henderson, gcc

On Mon, Jul 25, 2011 at 3:23 AM, Paolo Bonzini <bonzini@gnu.org> wrote:
> On 07/25/2011 06:42 AM, Xinliang David Li wrote:
>>
>> FYI  the performance impact of this option with SPEC06 (built with
>> google_46 compiler and measured on a core2 box).  The base line number
>> is FDO, and ref number is FDO + reorder_with_partitioning.
>>
>> xalancbmk improves>  3.5%
>> perlbench improves>  1.5%
>> dealII and bzip2 degrades about 1.4%.
>>
>> Note the partitioning scheme is not tuned at all -- there is not even
>> a tunable parameter to play with.
>
> Did you check what is pushed down to the cold section in these cases?

I have not done any analysis on them .

David
>
> Paolo
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC] Remove -freorder-blocks-and-partition
  2011-07-25 11:05   ` Paolo Bonzini
  2011-07-25 18:40     ` Xinliang David Li
@ 2011-07-26  1:29     ` Xinliang David Li
  2011-07-26  2:33       ` Joern Rennecke
  2011-08-03 21:06       ` Jan Hubicka
  2011-08-03 20:56     ` Jan Hubicka
  2 siblings, 2 replies; 21+ messages in thread
From: Xinliang David Li @ 2011-07-26  1:29 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Richard Henderson, gcc

In xalancbmk, with the partition option, most of object files have
nonzero size cold sections generated. The text size of the binary is
increased to 3572728 bytes from 3466790 bytes.  Profiling the program
using the training input shows the following differences. With
partitioning, number of executed branch instructions slightly
increases, but itlb misses and icache load misses are significantly
lower compared with the binary without partitioning.


David

With partition:
-----------------
   53654937239  branches
      306751458  L1-icache-load-misses
        8146112  iTLB-load-misses

Without partition:
---------------------
    52348639025  branches
      454417666  L1-icache-load-misses
       14470953  iTLB-load-misses


On Mon, Jul 25, 2011 at 3:23 AM, Paolo Bonzini <bonzini@gnu.org> wrote:
> On 07/25/2011 06:42 AM, Xinliang David Li wrote:
>>
>> FYI  the performance impact of this option with SPEC06 (built with
>> google_46 compiler and measured on a core2 box).  The base line number
>> is FDO, and ref number is FDO + reorder_with_partitioning.
>>
>> xalancbmk improves>  3.5%
>> perlbench improves>  1.5%
>> dealII and bzip2 degrades about 1.4%.
>>
>> Note the partitioning scheme is not tuned at all -- there is not even
>> a tunable parameter to play with.
>
> Did you check what is pushed down to the cold section in these cases?
>
> Paolo
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC] Remove -freorder-blocks-and-partition
  2011-07-26  1:29     ` Xinliang David Li
@ 2011-07-26  2:33       ` Joern Rennecke
  2011-07-27  6:47         ` Xinliang David Li
  2011-08-03 21:06       ` Jan Hubicka
  1 sibling, 1 reply; 21+ messages in thread
From: Joern Rennecke @ 2011-07-26  2:33 UTC (permalink / raw)
  To: Xinliang David Li; +Cc: Paolo Bonzini, Richard Henderson, gcc

Quoting Xinliang David Li <davidxl@google.com>:

> In xalancbmk, with the partition option, most of object files have
> nonzero size cold sections generated. The text size of the binary is
> increased to 3572728 bytes from 3466790 bytes.  Profiling the program
> using the training input shows the following differences. With
> partitioning, number of executed branch instructions slightly
> increases, but itlb misses and icache load misses are significantly
> lower compared with the binary without partitioning.

It is nice to have confirmation that for this benchmark, the optimization
causes a speedup because it works as intended, however...

>>> dealII and bzip2 degrades about 1.4%.

... I think the question was more directed at what causes the
performance degradation for these two benchmarks.

If we could retain most of the speedups when the optimization works well
but avoid most of the slowdown in the benchmarks that are currently hurt,
we could improve the overall SPEC06 score.  And hopefully, this would
also be beneficial to other code.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC] Remove -freorder-blocks-and-partition
  2011-07-26  2:33       ` Joern Rennecke
@ 2011-07-27  6:47         ` Xinliang David Li
  2011-07-27  8:12           ` Paolo Bonzini
  0 siblings, 1 reply; 21+ messages in thread
From: Xinliang David Li @ 2011-07-27  6:47 UTC (permalink / raw)
  To: Joern Rennecke; +Cc: Paolo Bonzini, Richard Henderson, gcc

On Mon, Jul 25, 2011 at 6:30 PM, Joern Rennecke <amylaar@spamcop.net> wrote:
> Quoting Xinliang David Li <davidxl@google.com>:
>
>> In xalancbmk, with the partition option, most of object files have
>> nonzero size cold sections generated. The text size of the binary is
>> increased to 3572728 bytes from 3466790 bytes.  Profiling the program
>> using the training input shows the following differences. With
>> partitioning, number of executed branch instructions slightly
>> increases, but itlb misses and icache load misses are significantly
>> lower compared with the binary without partitioning.
>
> It is nice to have confirmation that for this benchmark, the optimization
> causes a speedup because it works as intended, however...
>
>>>> dealII and bzip2 degrades about 1.4%.
>
> ... I think the question was more directed at what causes the
> performance degradation for these two benchmarks.
>
> If we could retain most of the speedups when the optimization works well
> but avoid most of the slowdown in the benchmarks that are currently hurt,
> we could improve the overall SPEC06 score.  And hopefully, this would
> also be beneficial to other code.

Agree.  There are certainly problems in the partition pass, as for
bzip2 the icache misses actually go up with partition, which is not
expected. It needs further analysis.

David

>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC] Remove -freorder-blocks-and-partition
  2011-07-27  6:47         ` Xinliang David Li
@ 2011-07-27  8:12           ` Paolo Bonzini
  0 siblings, 0 replies; 21+ messages in thread
From: Paolo Bonzini @ 2011-07-27  8:12 UTC (permalink / raw)
  To: Xinliang David Li; +Cc: Joern Rennecke, Richard Henderson, gcc, Laszlo Ersek

On 07/27/2011 06:51 AM, Xinliang David Li wrote:
> >  If we could retain most of the speedups when the optimization works well
> >  but avoid most of the slowdown in the benchmarks that are currently hurt,
> >  we could improve the overall SPEC06 score.  And hopefully, this would
> >  also be beneficial to other code.
>
> Agree.  There are certainly problems in the partition pass, as for
> bzip2 the icache misses actually go up with partition, which is not
> expected. It needs further analysis.

It's probably too aggressive.  Icache misses go up because a) the 
overall size of the executable grows; b) cold parts are probably not 
cold enough in the case of bzip2.f

Paolo

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC] Remove -freorder-blocks-and-partition
  2011-07-19 21:43 [RFC] Remove -freorder-blocks-and-partition Richard Henderson
                   ` (2 preceding siblings ...)
  2011-07-25  7:40 ` Xinliang David Li
@ 2011-08-03 20:50 ` Jan Hubicka
  3 siblings, 0 replies; 21+ messages in thread
From: Jan Hubicka @ 2011-08-03 20:50 UTC (permalink / raw)
  To: Richard Henderson; +Cc: gcc

Hi,
> The worst part is that test coverage for this feature is
> extremely poor.  It's very difficult to tell if any cleanup
> in this area is likely to introduce more bugs than it fixes.
> 
> After 3 days fighting with this code, I had a bit of a 
> cathartic whine on IRC.  I got two votes to just rip the 
> whole thing out.

I am also not fan of the code, given that I had several encounters with it and
was bit by it quite badly, too.

With ipa-split I implemented part of what is needed for outlining of cold
regions of function sinto a separate functions.  This however is different from
partitioning - i.e. the code sequence of getting into the offlined part is
longer since you need to actually pass stuff in function arguments and it is
hard to jump back and forth in between hot and cold regions.
Expecting it the partitioning to be fully replaced by gimple level offlining is 
thus not realistic.

So function partitioning still makes sense to me as an optimization and in fact
I was hoping to get it into shape that it can be enabled with -fprofile-use by
default and thus also tested by profiledbootstrap.  It did not happen as I am
busy with IPA/LTO tasks at the moment.

So I am unsure what really we want to do.  Removing the feature seems pity,
but at the same time the code really needs an revamp. Since you apparently spent
most time to on this issue, I won't object to your decision to rip out the code.

Honza
> 
> Andrew Pinski points out that the feature could probably be
> equivalently implemented via outlining and function calls
> (I assume well back at the gimple level).  At which point we
> no longer have cross-segment jump_insns at the rtl level,
> which seems like a Really Big Win to me at this point.
> Not that I'm volunteering to actually do the work to implement
> any such scheme.
> 
> Thoughts?
> 
> 
> r~

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC] Remove -freorder-blocks-and-partition
  2011-07-25 11:05   ` Paolo Bonzini
  2011-07-25 18:40     ` Xinliang David Li
  2011-07-26  1:29     ` Xinliang David Li
@ 2011-08-03 20:56     ` Jan Hubicka
  2 siblings, 0 replies; 21+ messages in thread
From: Jan Hubicka @ 2011-08-03 20:56 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Xinliang David Li, Richard Henderson, gcc

> On 07/25/2011 06:42 AM, Xinliang David Li wrote:
>> FYI  the performance impact of this option with SPEC06 (built with
>> google_46 compiler and measured on a core2 box).  The base line number
>> is FDO, and ref number is FDO + reorder_with_partitioning.
>>
>> xalancbmk improves>  3.5%
>> perlbench improves>  1.5%
>> dealII and bzip2 degrades about 1.4%.
>>
>> Note the partitioning scheme is not tuned at all -- there is not even
>> a tunable parameter to play with.
>

I looked at the bzip2 slowdown years ago and back then it was code layout issue:
i.e. adding a nops at place code was offlined actually returned the performance.
It was couple years back and thus deifnitely on different CPY than what David use.
Bzip2 has tight internal loops sorting the strings, so the layout issues are however
quite likely explanation.

Honza

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC] Remove -freorder-blocks-and-partition
  2011-07-26  1:29     ` Xinliang David Li
  2011-07-26  2:33       ` Joern Rennecke
@ 2011-08-03 21:06       ` Jan Hubicka
  2011-08-03 21:46         ` Xinliang David Li
  1 sibling, 1 reply; 21+ messages in thread
From: Jan Hubicka @ 2011-08-03 21:06 UTC (permalink / raw)
  To: Xinliang David Li; +Cc: Paolo Bonzini, Richard Henderson, gcc

> In xalancbmk, with the partition option, most of object files have
> nonzero size cold sections generated. The text size of the binary is
> increased to 3572728 bytes from 3466790 bytes.  Profiling the program
> using the training input shows the following differences. With
> partitioning, number of executed branch instructions slightly
> increases, but itlb misses and icache load misses are significantly
> lower compared with the binary without partitioning.
> 
> 
> David
> 
> With partition:
> -----------------
>    53654937239  branches
>       306751458  L1-icache-load-misses
>         8146112  iTLB-load-misses

Note that I was also planning for some time to introduce notion of provably cold
stuff into our branch prediction heurstics. I.e. code leading to aborts, eh etc
that can be then offlined even w/o profile feedback and could perhaps help
to large apps.
(also the whole pass should be more effective with larger testcases, SPEC2k6 is slowly
becoming a small one)

Honza

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC] Remove -freorder-blocks-and-partition
  2011-08-03 21:06       ` Jan Hubicka
@ 2011-08-03 21:46         ` Xinliang David Li
  2011-08-04 13:32           ` Jan Hubicka
  2011-08-04 13:39           ` Jan Hubicka
  0 siblings, 2 replies; 21+ messages in thread
From: Xinliang David Li @ 2011-08-03 21:46 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: Paolo Bonzini, Richard Henderson, gcc

On Wed, Aug 3, 2011 at 2:06 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> In xalancbmk, with the partition option, most of object files have
>> nonzero size cold sections generated. The text size of the binary is
>> increased to 3572728 bytes from 3466790 bytes.  Profiling the program
>> using the training input shows the following differences. With
>> partitioning, number of executed branch instructions slightly
>> increases, but itlb misses and icache load misses are significantly
>> lower compared with the binary without partitioning.
>>
>>
>> David
>>
>> With partition:
>> -----------------
>>    53654937239  branches
>>       306751458  L1-icache-load-misses
>>         8146112  iTLB-load-misses
>
> Note that I was also planning for some time to introduce notion of provably cold
> stuff into our branch prediction heurstics. I.e. code leading to aborts, eh etc

no-return attribute is looked at by static profile estimation pass. Is
the attribute (definitely not returning) properly propagated to the
callers (wrappers of exit, etc)?

David

> that can be then offlined even w/o profile feedback and could perhaps help
> to large apps.
> (also the whole pass should be more effective with larger testcases, SPEC2k6 is slowly
> becoming a small one)
>
> Honza
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC] Remove -freorder-blocks-and-partition
  2011-08-03 21:46         ` Xinliang David Li
@ 2011-08-04 13:32           ` Jan Hubicka
  2011-08-04 13:39           ` Jan Hubicka
  1 sibling, 0 replies; 21+ messages in thread
From: Jan Hubicka @ 2011-08-04 13:32 UTC (permalink / raw)
  To: Xinliang David Li; +Cc: Jan Hubicka, Paolo Bonzini, Richard Henderson, gcc

> On Wed, Aug 3, 2011 at 2:06 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
> >> In xalancbmk, with the partition option, most of object files have
> >> nonzero size cold sections generated. The text size of the binary is
> >> increased to 3572728 bytes from 3466790 bytes.  Profiling the program
> >> using the training input shows the following differences. With
> >> partitioning, number of executed branch instructions slightly
> >> increases, but itlb misses and icache load misses are significantly
> >> lower compared with the binary without partitioning.
> >>
> >>
> >> David
> >>
> >> With partition:
> >> -----------------
> >>    53654937239  branches
> >>       306751458  L1-icache-load-misses
> >>         8146112  iTLB-load-misses
> >
> > Note that I was also planning for some time to introduce notion of provably cold
> > stuff into our branch prediction heurstics. I.e. code leading to aborts, eh etc
> 
> no-return attribute is looked at by static profile estimation pass. Is
> the attribute (definitely not returning) properly propagated to the
> callers (wrappers of exit, etc)?

It is, at local pure const and IPA pure const. Catch with IPA pure const is
that profile is computed at tha time and it is not updated afterwards, so when
discovered late it does not affect static profile estimates (yet), only gets
cfg/codegen better.

Honza
> 
> David
> 
> > that can be then offlined even w/o profile feedback and could perhaps help
> > to large apps.
> > (also the whole pass should be more effective with larger testcases, SPEC2k6 is slowly
> > becoming a small one)
> >
> > Honza
> >

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC] Remove -freorder-blocks-and-partition
  2011-08-03 21:46         ` Xinliang David Li
  2011-08-04 13:32           ` Jan Hubicka
@ 2011-08-04 13:39           ` Jan Hubicka
  2011-08-04 16:03             ` Taras Glek
  1 sibling, 1 reply; 21+ messages in thread
From: Jan Hubicka @ 2011-08-04 13:39 UTC (permalink / raw)
  To: Xinliang David Li
  Cc: Jan Hubicka, Paolo Bonzini, Richard Henderson, gcc, tglek

Also on the oriignal topic, Iknow that Mozlla folks experimented with this
switch (and I do expect it should make noticeable reducion in the hot section
footprint that is important for them). They are not using it at the moment
because of problems with their bug reporting tool not being able to do unwind
info for split functions.

Taras, do you have any data if the partitioning helps?
Honza

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC] Remove -freorder-blocks-and-partition
  2011-08-04 13:39           ` Jan Hubicka
@ 2011-08-04 16:03             ` Taras Glek
  0 siblings, 0 replies; 21+ messages in thread
From: Taras Glek @ 2011-08-04 16:03 UTC (permalink / raw)
  To: Jan Hubicka; +Cc: Xinliang David Li, Paolo Bonzini, Richard Henderson, gcc

On 08/04/2011 06:39 AM, Jan Hubicka wrote:
> Also on the oriignal topic, Iknow that Mozlla folks experimented with this
> switch (and I do expect it should make noticeable reducion in the hot section
> footprint that is important for them). They are not using it at the moment
> because of problems with their bug reporting tool not being able to do unwind
> info for split functions.
>
> Taras, do you have any data if the partitioning helps?
> Honza
Currently it does not help. I believe it has potential since we have a 
lot of large methods where only a small part of them gets run.

Taras

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC] Remove -freorder-blocks-and-partition
@ 2011-07-19 22:43 Steven Bosscher
  0 siblings, 0 replies; 21+ messages in thread
From: Steven Bosscher @ 2011-07-19 22:43 UTC (permalink / raw)
  To: Richard Henderson, GCC Mailing List

re. http://gcc.gnu.org/ml/gcc/2011-07/msg00349.html

Richard Henderson wrote:
> After 3 days fighting with this code, I had a bit of a
> cathartic whine on IRC.  I got two votes to just rip the
> whole thing out.

Add one more vote.

> Andrew Pinski points out that the feature could probably be
> equivalently implemented via outlining and function calls
> (I assume well back at the gimple level).

I guess the ipa-split pass could easily be modified to do this, it'd
just need a few new heuristics.

Ciao!
Steven

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2011-08-04 16:03 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-07-19 21:43 [RFC] Remove -freorder-blocks-and-partition Richard Henderson
2011-07-19 21:56 ` Bernd Schmidt
2011-07-19 22:09   ` Richard Henderson
2011-07-19 23:18     ` Joern Rennecke
2011-07-19 22:28 ` Joern Rennecke
2011-07-19 22:44   ` Richard Henderson
2011-07-25  7:40 ` Xinliang David Li
2011-07-25 11:05   ` Paolo Bonzini
2011-07-25 18:40     ` Xinliang David Li
2011-07-26  1:29     ` Xinliang David Li
2011-07-26  2:33       ` Joern Rennecke
2011-07-27  6:47         ` Xinliang David Li
2011-07-27  8:12           ` Paolo Bonzini
2011-08-03 21:06       ` Jan Hubicka
2011-08-03 21:46         ` Xinliang David Li
2011-08-04 13:32           ` Jan Hubicka
2011-08-04 13:39           ` Jan Hubicka
2011-08-04 16:03             ` Taras Glek
2011-08-03 20:56     ` Jan Hubicka
2011-08-03 20:50 ` Jan Hubicka
2011-07-19 22:43 Steven Bosscher

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).