FDO and LTO on ARM

public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed

* FDO and LTO on ARM
@ 2011-08-04 14:05 Mike Hommey
  2011-08-04 15:16 ` Richard Guenther
                   ` (2 more replies)
  0 siblings, 3 replies; 32+ messages in thread
From: Mike Hommey @ 2011-08-04 14:05 UTC (permalink / raw)
  To: gcc; +Cc: tglek, dougkwan, jingyu, carrot, davidxl

Hi,

We (Mozilla) are trying to get the best of the ARM toolchain for our
Android build. I recently built an Android Native-code Development Kit
with GCC 4.6.1 and binutils 2.21.53, instead of GCC 4.4.3 and binutils
2.19 that come with the default NDK.

LTO doesn't work at all, I'm getting an ICE that looks like the one from
bug 41159.

FDO however, works, but sadly, the resulting build is not only quite
bigger, it's also slower on some tests (the Sunspider javascript
benchmark). While we have seen improvements on other tests (most
notably, the V8 benchmark is much faster) by switching to GCC 4.6 (that
is, without FDO), FDO doesn't seem to bring anything on the table. It
even seems to bring performance regression.

Note that we do our normal builds with -Os and use -O3 for FDO. As for
architecture specific flags, we use -marmv7-a -mthumb -mfloat-abi=softfp
-mfpu=vfp. I've attempted a -O2 build in the past with GCC 4.4 but it
was both bigger and slower than the -Os builds.

So, it pretty much looks like current aggressive optimizations hit
current hardware limitations and are slower than builds optimized for
size.

Has there been significant changes to the ARM backend that would justify
that I try some more with current GCC HEAD? Should I maybe try some more
with the linaro GCC branch? Are there things we can do to help getting
better ARM performance?

Cheers,

Mike

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: FDO and LTO on ARM
  2011-08-04 14:05 FDO and LTO on ARM Mike Hommey
@ 2011-08-04 15:16 ` Richard Guenther
  2011-08-04 16:38   ` Mike Hommey
  2011-08-04 17:02 ` Xinliang David Li
  2011-08-11 14:22 ` Mike Hommey
  2 siblings, 1 reply; 32+ messages in thread
From: Richard Guenther @ 2011-08-04 15:16 UTC (permalink / raw)
  To: Mike Hommey; +Cc: gcc, tglek, dougkwan, jingyu, carrot, davidxl

On Thu, Aug 4, 2011 at 4:05 PM, Mike Hommey <mhommey@mozilla.com> wrote:
> Hi,
>
> We (Mozilla) are trying to get the best of the ARM toolchain for our
> Android build. I recently built an Android Native-code Development Kit
> with GCC 4.6.1 and binutils 2.21.53, instead of GCC 4.4.3 and binutils
> 2.19 that come with the default NDK.
>
> LTO doesn't work at all, I'm getting an ICE that looks like the one from
> bug 41159.
>
> FDO however, works, but sadly, the resulting build is not only quite
> bigger, it's also slower on some tests (the Sunspider javascript
> benchmark). While we have seen improvements on other tests (most
> notably, the V8 benchmark is much faster) by switching to GCC 4.6 (that
> is, without FDO), FDO doesn't seem to bring anything on the table. It
> even seems to bring performance regression.
>
> Note that we do our normal builds with -Os and use -O3 for FDO. As for
> architecture specific flags, we use -marmv7-a -mthumb -mfloat-abi=softfp
> -mfpu=vfp. I've attempted a -O2 build in the past with GCC 4.4 but it
> was both bigger and slower than the -Os builds.
>
> So, it pretty much looks like current aggressive optimizations hit
> current hardware limitations and are slower than builds optimized for
> size.
>
> Has there been significant changes to the ARM backend that would justify
> that I try some more with current GCC HEAD? Should I maybe try some more
> with the linaro GCC branch? Are there things we can do to help getting
> better ARM performance?

-fprofile-use enables quite some optimizations that are even off for -O3
which are -funroll-loops and -fpeel-loops, -ftracer and -funswitch-loops.
Those will all be increasing code-size (hopefully only for hot code pieces
though).

Did you try using FDO with -Os?  FDO should make hot code parts
optimized similar to -O3 but leave other pieces optimized for size.
Using FDO with -O3 gives you the opposite, cold portions optimized
for size while the rest is optimized for speed.

Richard.

> Cheers,
>
> Mike
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: FDO and LTO on ARM
  2011-08-04 15:16 ` Richard Guenther
@ 2011-08-04 16:38   ` Mike Hommey
  2011-08-04 18:42     ` Jan Hubicka
  0 siblings, 1 reply; 32+ messages in thread
From: Mike Hommey @ 2011-08-04 16:38 UTC (permalink / raw)
  To: Richard Guenther; +Cc: gcc, tglek, dougkwan, jingyu, carrot, davidxl, jh

On Thu, Aug 04, 2011 at 05:16:25PM +0200, Richard Guenther wrote:
> -fprofile-use enables quite some optimizations that are even off for -O3
> which are -funroll-loops and -fpeel-loops, -ftracer and -funswitch-loops.
> Those will all be increasing code-size (hopefully only for hot code pieces
> though).
> 
> Did you try using FDO with -Os?  FDO should make hot code parts
> optimized similar to -O3 but leave other pieces optimized for size.
> Using FDO with -O3 gives you the opposite, cold portions optimized
> for size while the rest is optimized for speed.

Yes I initially did, and the results were quite similar, in fact (that
is, regression on Sunspider ; I hadn't checked V8 by then)

I however don't see the difference between the two cases you describe,
except if hot and cold parts combined is not the whole. Experimentation
shows that FDO -Os and FDO -O3 are different, though.

Jan Hubicka also commented on my blog where I mentioned these FDO
issues. Since I think it makes more sense to have the discussion
happening in one place only (and this list is more appropriate), I'll
put answers to his questions below:

> I never actually looked on PDO on ARM, but the slowdowns should not
> really happen. Would be possible to analyse this bit further (i.e.
> figure out what slows down) so we could fix it for 4.7?

The android packages are linked from my blog, and they can be unzipped.
For convenience to the list readers, here is a link to the files:
http://people.mozilla.org/~mhommey/pgo/
The .nightly.apk file is a build from our build bots (standard NDK, GCC
4.4). The .gcc4.6.apk file is a GCC 4.6 -Os build. The pgo.apk file is a
GCC 4.6 -O3 FDO build.
The corresponding source code is:
http://hg.mozilla.org/mozilla-central/file/1dddaeb1366b
Please tell me if you need more data, like object files, unstripped
libraries, etc.

> Also do you get any warnings on profile mismatches? Perhaps something
> is wrong to the degree that the relevant part of profile gets
> misapplied.

I don't get any warning on profile mismatches. I only get a "few"
missing gcda files warning, but that's expected.

Cheers,

Mike

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: FDO and LTO on ARM
  2011-08-04 14:05 FDO and LTO on ARM Mike Hommey
  2011-08-04 15:16 ` Richard Guenther
@ 2011-08-04 17:02 ` Xinliang David Li
  2011-08-04 17:16   ` Denis Chertykov
  2011-08-04 18:51   ` Jan Hubicka
  2011-08-11 14:22 ` Mike Hommey
  2 siblings, 2 replies; 32+ messages in thread
From: Xinliang David Li @ 2011-08-04 17:02 UTC (permalink / raw)
  To: Mike Hommey; +Cc: gcc, tglek, dougkwan, jingyu, Carrot Wei, Mark Heffernan

+Mark who has done size optimization tuning with FDO.

On Thu, Aug 4, 2011 at 7:05 AM, Mike Hommey <mhommey@mozilla.com> wrote:
> Hi,
>
> We (Mozilla) are trying to get the best of the ARM toolchain for our
> Android build. I recently built an Android Native-code Development Kit
> with GCC 4.6.1 and binutils 2.21.53, instead of GCC 4.4.3 and binutils
> 2.19 that come with the default NDK.
>
> LTO doesn't work at all, I'm getting an ICE that looks like the one from
> bug 41159.
>
> FDO however, works, but sadly, the resulting build is not only quite
> bigger,

Is this true for both 4.6 and 4.4 gcc? There is a bug in 4.6 that
prevents cold functions from be optimized for size with FDO. The bug
was fixed in trunk recently.

> it's also slower on some tests (the Sunspider javascript
> benchmark). While we have seen improvements on other tests (most
> notably, the V8 benchmark is much faster) by switching to GCC 4.6 (that
> is, without FDO), FDO doesn't seem to bring anything on the table. It
> even seems to bring performance regression.

ARM specific performance tuning (with FDO) seems needed.  More
parameters (e.g, in inliner related) may need to be made target
dependent.
>
> Note that we do our normal builds with -Os and use -O3 for FDO. As for
> architecture specific flags, we use -marmv7-a -mthumb -mfloat-abi=softfp
> -mfpu=vfp. I've attempted a -O2 build in the past with GCC 4.4 but it
> was both bigger and slower than the -Os builds.
>
> So, it pretty much looks like current aggressive optimizations hit
> current hardware limitations and are slower than builds optimized for
> size.

Yes, this is very likely. Hardware profiling will be very useful to
help identify the root cause.

>
> Has there been significant changes to the ARM backend that would justify
> that I try some more with current GCC HEAD? Should I maybe try some more
> with the linaro GCC branch? Are there things we can do to help getting
> better ARM performance?

It does not hurt to try it :)

Thanks,

David

>
> Cheers,
>
> Mike
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: FDO and LTO on ARM
  2011-08-04 17:02 ` Xinliang David Li
@ 2011-08-04 17:16   ` Denis Chertykov
  2011-08-04 18:51   ` Jan Hubicka
  1 sibling, 0 replies; 32+ messages in thread
From: Denis Chertykov @ 2011-08-04 17:16 UTC (permalink / raw)
  To: Xinliang David Li
  Cc: Mike Hommey, gcc, tglek, dougkwan, jingyu, Carrot Wei, Mark Heffernan

2011/8/4 Xinliang David Li <davidxl@google.com>:
> +Mark who has done size optimization tuning with FDO.
>
> On Thu, Aug 4, 2011 at 7:05 AM, Mike Hommey <mhommey@mozilla.com> wrote:
>> Hi,
>>
>> We (Mozilla) are trying to get the best of the ARM toolchain for our
>> Android build. I recently built an Android Native-code Development Kit
>> with GCC 4.6.1 and binutils 2.21.53, instead of GCC 4.4.3 and binutils
>> 2.19 that come with the default NDK.
>>
>> LTO doesn't work at all, I'm getting an ICE that looks like the one from
>> bug 41159.
>>
>> FDO however, works, but sadly, the resulting build is not only quite
>> bigger,
>
> Is this true for both 4.6 and 4.4 gcc? There is a bug in 4.6 that
> prevents cold functions from be optimized for size with FDO. The bug
> was fixed in trunk recently.
>
>> it's also slower on some tests (the Sunspider javascript
>> benchmark). While we have seen improvements on other tests (most
>> notably, the V8 benchmark is much faster) by switching to GCC 4.6 (that
>> is, without FDO), FDO doesn't seem to bring anything on the table. It
>> even seems to bring performance regression.
>
> ARM specific performance tuning (with FDO) seems needed.  More
> parameters (e.g, in inliner related) may need to be made target
> dependent.
>>
>> Note that we do our normal builds with -Os and use -O3 for FDO. As for
>> architecture specific flags, we use -marmv7-a -mthumb -mfloat-abi=softfp
>> -mfpu=vfp. I've attempted a -O2 build in the past with GCC 4.4 but it
>> was both bigger and slower than the -Os builds.
>>
>> So, it pretty much looks like current aggressive optimizations hit
>> current hardware limitations and are slower than builds optimized for
>> size.
>
> Yes, this is very likely. Hardware profiling will be very useful to
> help identify the root cause.
>
>>
>> Has there been significant changes to the ARM backend that would justify
>> that I try some more with current GCC HEAD? Should I maybe try some more
>> with the linaro GCC branch? Are there things we can do to help getting
>> better ARM performance?
>
> It does not hurt to try it :)
>

Linaro android toolchain benchmarks:
https://wiki.linaro.org/Platform/Android/AndroidToolchainBenchmarking/2011-07

Denis.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: FDO and LTO on ARM
  2011-08-04 16:38   ` Mike Hommey
@ 2011-08-04 18:42     ` Jan Hubicka
  2011-08-05  7:32       ` Richard Guenther
  0 siblings, 1 reply; 32+ messages in thread
From: Jan Hubicka @ 2011-08-04 18:42 UTC (permalink / raw)
  To: Mike Hommey
  Cc: Richard Guenther, gcc, tglek, dougkwan, jingyu, carrot, davidxl, jh

>> Did you try using FDO with -Os?  FDO should make hot code parts
>> optimized similar to -O3 but leave other pieces optimized for size.
>> Using FDO with -O3 gives you the opposite, cold portions optimized
>> for size while the rest is optimized for speed.

FDO with -Os still optimize for size, even in hot parts.  So to get resonale
speedups you need -O3+FDO.  -O3+FDO effectively defaults to -Os in  
cold portions of program.

Still -Os+FDO should be somewhat faster than -Os alone, so a slowdown  
is bug.  It is not very thoroughly since it is not really used in  
practice.

>> Also do you get any warnings on profile mismatches? Perhaps something
>> is wrong to the degree that the relevant part of profile gets
>> misapplied.
>
> I don't get any warning on profile mismatches. I only get a "few"
> missing gcda files warning, but that's expected.

Perhaps you could compile one of less trivial files you are sure that  
are covered by train run and send me -fdump-tree-all-blocks  
-fdump-ipa-all dumps of the compilation so I can double check the  
profile seems sane. This could be good start to rule out something  
stupid.

Honza
>
> Cheers,
>
> Mike
>


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: FDO and LTO on ARM
  2011-08-04 17:02 ` Xinliang David Li
  2011-08-04 17:16   ` Denis Chertykov
@ 2011-08-04 18:51   ` Jan Hubicka
  2011-08-08 12:21     ` Mike Hommey
  1 sibling, 1 reply; 32+ messages in thread
From: Jan Hubicka @ 2011-08-04 18:51 UTC (permalink / raw)
  To: Xinliang David Li
  Cc: Mike Hommey, gcc, tglek, dougkwan, jingyu, Carrot Wei, Mark Heffernan

> +Mark who has done size optimization tuning with FDO.
> 
> On Thu, Aug 4, 2011 at 7:05 AM, Mike Hommey <mhommey@mozilla.com> wrote:
> > Hi,
> >
> > We (Mozilla) are trying to get the best of the ARM toolchain for our
> > Android build. I recently built an Android Native-code Development Kit
> > with GCC 4.6.1 and binutils 2.21.53, instead of GCC 4.4.3 and binutils
> > 2.19 that come with the default NDK.
> >
> > LTO doesn't work at all, I'm getting an ICE that looks like the one from
> > bug 41159.
> >
> > FDO however, works, but sadly, the resulting build is not only quite
> > bigger,
> 
> Is this true for both 4.6 and 4.4 gcc? There is a bug in 4.6 that
> prevents cold functions from be optimized for size with FDO. The bug
> was fixed in trunk recently.

You can also backport the patch to 4.6 tree. If the bug exists there, consider
the patch preaproved.

With FDO, -O2 and -O3 is not really that significandly different (i.e. -O2
gets all the extra inlining, but it does not get vectorization that is probably
not big deal for you). -Os is however different storry.
> >
> > Has there been significant changes to the ARM backend that would justify
> > that I try some more with current GCC HEAD? Should I maybe try some more
> > with the linaro GCC branch? Are there things we can do to help getting
> > better ARM performance?
> 
> It does not hurt to try it :)

One thing that is really changed is inliner heuristics.  If would be very happy
to have some feedback on this early, since we plan to do re-tunning of it
(LTO changes many things and there are also fortran benchmarks that shows a lot
of problems. Mozilla may chime in and make my life even harder with hopefully
some positive results on it).

As discused earlier, I think it would be very good idea to start trakcing
perfomrance of Mozilla built with mainline GCC like we track other benchmarks.
We don't really monitor anything of this size and thus we are quite likely
to find new interesting issues by doing so. 

Honza
> 
> Thanks,
> 
> David
> 
> >
> > Cheers,
> >
> > Mike
> >

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: FDO and LTO on ARM
  2011-08-04 18:42     ` Jan Hubicka
@ 2011-08-05  7:32       ` Richard Guenther
  2011-08-05 14:40         ` Jan Hubicka
                           ` (2 more replies)
  0 siblings, 3 replies; 32+ messages in thread
From: Richard Guenther @ 2011-08-05  7:32 UTC (permalink / raw)
  To: Jan Hubicka
  Cc: Mike Hommey, gcc, tglek, dougkwan, jingyu, carrot, davidxl, jh

On Thu, Aug 4, 2011 at 8:42 PM, Jan Hubicka <jh@suse.de> wrote:
>>> Did you try using FDO with -Os?  FDO should make hot code parts
>>> optimized similar to -O3 but leave other pieces optimized for size.
>>> Using FDO with -O3 gives you the opposite, cold portions optimized
>>> for size while the rest is optimized for speed.
>
> FDO with -Os still optimize for size, even in hot parts.

I don't think so.  Or at least that would be a bug.  Shouldn't 'hot'
BBs/functions
be optimized for speed even at -Os?  Hm, I see predict.c indeed returns
always false for optimize_size :(

I thought we had just the neither cold or hot parts optimized according
to optimize_size.

>  So to get resonale
> speedups you need -O3+FDO.  -O3+FDO effectively defaults to -Os in cold
> portions of program.

Well, but unless your training coverage is 100% all parts with no coverage
get optimized with -O3 instead of -Os.  And I bet coverage for mozilla
isn't even close to 100%.  Thus I think recommending -O3 for FDO is
usually a bad idea.

So - did you try FDO with -O2? ;)

> Still -Os+FDO should be somewhat faster than -Os alone, so a slowdown is
> bug.  It is not very thoroughly since it is not really used in practice.
>
>>> Also do you get any warnings on profile mismatches? Perhaps something
>>> is wrong to the degree that the relevant part of profile gets
>>> misapplied.
>>
>> I don't get any warning on profile mismatches. I only get a "few"
>> missing gcda files warning, but that's expected.
>
> Perhaps you could compile one of less trivial files you are sure that are
> covered by train run and send me -fdump-tree-all-blocks -fdump-ipa-all dumps
> of the compilation so I can double check the profile seems sane. This could
> be good start to rule out something stupid.
>
> Honza
>>
>> Cheers,
>>
>> Mike
>>
>
>
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: FDO and LTO on ARM
  2011-08-05  7:32       ` Richard Guenther
@ 2011-08-05 14:40         ` Jan Hubicka
  2011-08-05 17:39           ` Xinliang David Li
  2011-08-05 17:50         ` Xinliang David Li
  2011-08-08  9:26         ` Mike Hommey
  2 siblings, 1 reply; 32+ messages in thread
From: Jan Hubicka @ 2011-08-05 14:40 UTC (permalink / raw)
  To: Richard Guenther
  Cc: Mike Hommey, gcc, tglek, dougkwan, jingyu, carrot, davidxl, jh

Am Fri 05 Aug 2011 09:32:05 AM CEST schrieb Richard Guenther  
<richard.guenther@gmail.com>:

> On Thu, Aug 4, 2011 at 8:42 PM, Jan Hubicka <jh@suse.de> wrote:
>>>> Did you try using FDO with -Os?  FDO should make hot code parts
>>>> optimized similar to -O3 but leave other pieces optimized for size.
>>>> Using FDO with -O3 gives you the opposite, cold portions optimized
>>>> for size while the rest is optimized for speed.
>>
>> FDO with -Os still optimize for size, even in hot parts.
>
> I don't think so.  Or at least that would be a bug.  Shouldn't 'hot'
> BBs/functions
> be optimized for speed even at -Os?  Hm, I see predict.c indeed returns
> always false for optimize_size :(

It was outcome of discussion held some time ago.  I think it was Mark  
promoting point that users opitmize for size when they use -Os period.

I thought we had just the neither cold or hot parts optimized according
to optimize_size. I originally wanted to have attribute HOT to  
overwrite -Os, so the well annotateed sources (i.e. kernel) could  
compile with -Os by default and explicitely declare the hot parts hot  
and get them compiled appropriately.

With profile feedback however the current logic is binary - i.e.  
blocks are either hot since their count is bigger than the threshold  
or cold. We don't really have "I don't really know" state there.  In  
some cases it would make sense - i.e. there are optimizations that we  
want to do only in the hottest parts of code, but we don't have any  
logic for that.

My plan is to extend ipa-profile to do better hot/cold partitioning  
first: at the moment we decide on fixed fraction of maximal count in  
the program. This is unnecesarily conservative for programs with not  
terribly flat profiles.  At IPA level we could collect histogram of  
counts of instructions (i.e. figure out how much time we spend on  
instructions executed N times) and then figure out where is the  
threshold so 99% of executed instructions belongs to hot region. This  
should give noticeably smaller binaries.
>
> I thought we had just the neither cold or hot parts optimized according
> to optimize_size.


>
>>  So to get resonale
>> speedups you need -O3+FDO.  -O3+FDO effectively defaults to -Os in cold
>> portions of program.
>
> Well, but unless your training coverage is 100% all parts with no coverage
> get optimized with -O3 instead of -Os.  And I bet coverage for mozilla
> isn't even close to 100%.  Thus I think recommending -O3 for FDO is
> usually a bad idea.

Code with no coverage is cold in our model (as is code executed once  
or so) and thus optimized for -Os even at -O3+FDO. This is bit  
aggressive on optimizing for size side. We might consider changing  
this policy, but so far I didn't see any complains on this...

Honza
>
> So - did you try FDO with -O2? ;)
>
>> Still -Os+FDO should be somewhat faster than -Os alone, so a slowdown is
>> bug.  It is not very thoroughly since it is not really used in practice.
>>
>>>> Also do you get any warnings on profile mismatches? Perhaps something
>>>> is wrong to the degree that the relevant part of profile gets
>>>> misapplied.
>>>
>>> I don't get any warning on profile mismatches. I only get a "few"
>>> missing gcda files warning, but that's expected.
>>
>> Perhaps you could compile one of less trivial files you are sure that are
>> covered by train run and send me -fdump-tree-all-blocks -fdump-ipa-all dumps
>> of the compilation so I can double check the profile seems sane. This could
>> be good start to rule out something stupid.
>>
>> Honza
>>>
>>> Cheers,
>>>
>>> Mike
>>>
>>
>>
>>
>


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: FDO and LTO on ARM
  2011-08-05 14:40         ` Jan Hubicka
@ 2011-08-05 17:39           ` Xinliang David Li
  0 siblings, 0 replies; 32+ messages in thread
From: Xinliang David Li @ 2011-08-05 17:39 UTC (permalink / raw)
  To: Jan Hubicka
  Cc: Richard Guenther, Mike Hommey, gcc, tglek, dougkwan, jingyu, carrot, jh

On Fri, Aug 5, 2011 at 7:40 AM, Jan Hubicka <jh@suse.de> wrote:
> Am Fri 05 Aug 2011 09:32:05 AM CEST schrieb Richard Guenther
> <richard.guenther@gmail.com>:
>
>> On Thu, Aug 4, 2011 at 8:42 PM, Jan Hubicka <jh@suse.de> wrote:
>>>>>
>>>>> Did you try using FDO with -Os?  FDO should make hot code parts
>>>>> optimized similar to -O3 but leave other pieces optimized for size.
>>>>> Using FDO with -O3 gives you the opposite, cold portions optimized
>>>>> for size while the rest is optimized for speed.
>>>
>>> FDO with -Os still optimize for size, even in hot parts.
>>
>> I don't think so.  Or at least that would be a bug.  Shouldn't 'hot'
>> BBs/functions
>> be optimized for speed even at -Os?  Hm, I see predict.c indeed returns
>> always false for optimize_size :(
>
> It was outcome of discussion held some time ago.  I think it was Mark
> promoting point that users opitmize for size when they use -Os period.
>
> I thought we had just the neither cold or hot parts optimized according
> to optimize_size. I originally wanted to have attribute HOT to overwrite
> -Os, so the well annotateed sources (i.e. kernel) could compile with -Os by
> default and explicitely declare the hot parts hot and get them compiled
> appropriately.
>
> With profile feedback however the current logic is binary - i.e. blocks are
> either hot since their count is bigger than the threshold or cold. We don't
> really have "I don't really know" state there.  In some cases it would make
> sense - i.e. there are optimizations that we want to do only in the hottest
> parts of code, but we don't have any logic for that.

For profile summary at function/cgraph_node level, there are three
states: hot, unlikely, and normal.   At BB/EDGE level, there are three
states too, but implementation  turns it into 2 states (by querying
only 'maybe_hot_bb'): hot and not hot --- instead of 'hot', 'not hot
nor cold', and 'cold'.


David
>
> My plan is to extend ipa-profile to do better hot/cold partitioning first:
> at the moment we decide on fixed fraction of maximal count in the program.
> This is unnecesarily conservative for programs with not terribly flat
> profiles.  At IPA level we could collect histogram of counts of instructions
> (i.e. figure out how much time we spend on instructions executed N times)
> and then figure out where is the threshold so 99% of executed instructions
> belongs to hot region. This should give noticeably smaller binaries.
>>
>> I thought we had just the neither cold or hot parts optimized according
>> to optimize_size.
>
>
>>
>>>  So to get resonale
>>> speedups you need -O3+FDO.  -O3+FDO effectively defaults to -Os in cold
>>> portions of program.
>>
>> Well, but unless your training coverage is 100% all parts with no coverage
>> get optimized with -O3 instead of -Os.  And I bet coverage for mozilla
>> isn't even close to 100%.  Thus I think recommending -O3 for FDO is
>> usually a bad idea.
>
> Code with no coverage is cold in our model (as is code executed once or so)
> and thus optimized for -Os even at -O3+FDO. This is bit aggressive on
> optimizing for size side. We might consider changing this policy, but so far
> I didn't see any complains on this...
>
> Honza
>>
>> So - did you try FDO with -O2? ;)
>>
>>> Still -Os+FDO should be somewhat faster than -Os alone, so a slowdown is
>>> bug.  It is not very thoroughly since it is not really used in practice.
>>>
>>>>> Also do you get any warnings on profile mismatches? Perhaps something
>>>>> is wrong to the degree that the relevant part of profile gets
>>>>> misapplied.
>>>>
>>>> I don't get any warning on profile mismatches. I only get a "few"
>>>> missing gcda files warning, but that's expected.
>>>
>>> Perhaps you could compile one of less trivial files you are sure that are
>>> covered by train run and send me -fdump-tree-all-blocks -fdump-ipa-all
>>> dumps
>>> of the compilation so I can double check the profile seems sane. This
>>> could
>>> be good start to rule out something stupid.
>>>
>>> Honza
>>>>
>>>> Cheers,
>>>>
>>>> Mike
>>>>
>>>
>>>
>>>
>>
>
>
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: FDO and LTO on ARM
  2011-08-05  7:32       ` Richard Guenther
  2011-08-05 14:40         ` Jan Hubicka
@ 2011-08-05 17:50         ` Xinliang David Li
  2011-08-08  9:26         ` Mike Hommey
  2 siblings, 0 replies; 32+ messages in thread
From: Xinliang David Li @ 2011-08-05 17:50 UTC (permalink / raw)
  To: Richard Guenther
  Cc: Jan Hubicka, Mike Hommey, gcc, tglek, dougkwan, jingyu, carrot, jh

On Fri, Aug 5, 2011 at 12:32 AM, Richard Guenther
<richard.guenther@gmail.com> wrote:
> On Thu, Aug 4, 2011 at 8:42 PM, Jan Hubicka <jh@suse.de> wrote:
>>>> Did you try using FDO with -Os?  FDO should make hot code parts
>>>> optimized similar to -O3 but leave other pieces optimized for size.
>>>> Using FDO with -O3 gives you the opposite, cold portions optimized
>>>> for size while the rest is optimized for speed.
>>
>> FDO with -Os still optimize for size, even in hot parts.
>
> I don't think so.  Or at least that would be a bug.  Shouldn't 'hot'
> BBs/functions
> be optimized for speed even at -Os?  Hm, I see predict.c indeed returns
> always false for optimize_size :(

That is function level query. At the BB/EDGE level, the condition is refined:

The BB (or instruction expansion) will be optimized for size if the bb
is not 'hot'. This logic here is probably not ideal. It means that
without specifying -Os, only the hot BBs are optimized for speed -->
the 'righter' way is 'without -Os, only cold BBs are optimize for
size' -- i.e., the lukewarm bbs are also optimize for speed. This will
match the function level logic.

David


>
> I thought we had just the neither cold or hot parts optimized according
> to optimize_size.
>
>>  So to get resonale
>> speedups you need -O3+FDO.  -O3+FDO effectively defaults to -Os in cold
>> portions of program.
>
> Well, but unless your training coverage is 100% all parts with no coverage
> get optimized with -O3 instead of -Os.  And I bet coverage for mozilla
> isn't even close to 100%.  Thus I think recommending -O3 for FDO is
> usually a bad idea.
>
> So - did you try FDO with -O2? ;)
>
>> Still -Os+FDO should be somewhat faster than -Os alone, so a slowdown is
>> bug.  It is not very thoroughly since it is not really used in practice.
>>
>>>> Also do you get any warnings on profile mismatches? Perhaps something
>>>> is wrong to the degree that the relevant part of profile gets
>>>> misapplied.
>>>
>>> I don't get any warning on profile mismatches. I only get a "few"
>>> missing gcda files warning, but that's expected.
>>
>> Perhaps you could compile one of less trivial files you are sure that are
>> covered by train run and send me -fdump-tree-all-blocks -fdump-ipa-all dumps
>> of the compilation so I can double check the profile seems sane. This could
>> be good start to rule out something stupid.
>>
>> Honza
>>>
>>> Cheers,
>>>
>>> Mike
>>>
>>
>>
>>
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: FDO and LTO on ARM
  2011-08-05  7:32       ` Richard Guenther
  2011-08-05 14:40         ` Jan Hubicka
  2011-08-05 17:50         ` Xinliang David Li
@ 2011-08-08  9:26         ` Mike Hommey
  2011-08-08 12:16           ` Mike Hommey
  2 siblings, 1 reply; 32+ messages in thread
From: Mike Hommey @ 2011-08-08  9:26 UTC (permalink / raw)
  To: Richard Guenther
  Cc: Jan Hubicka, gcc, tglek, dougkwan, jingyu, carrot, davidxl, jh

On Fri, Aug 05, 2011 at 09:32:05AM +0200, Richard Guenther wrote:
> Well, but unless your training coverage is 100% all parts with no coverage
> get optimized with -O3 instead of -Os.  And I bet coverage for mozilla
> isn't even close to 100%.  Thus I think recommending -O3 for FDO is
> usually a bad idea.

Experience shows that this isn't true.

> So - did you try FDO with -O2? ;)

Leads to roughly the same as -O3+FDO. The apk is also the same size, as
well as individual libraries.

Mike

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: FDO and LTO on ARM
  2011-08-08  9:26         ` Mike Hommey
@ 2011-08-08 12:16           ` Mike Hommey
  0 siblings, 0 replies; 32+ messages in thread
From: Mike Hommey @ 2011-08-08 12:16 UTC (permalink / raw)
  To: Richard Guenther
  Cc: Jan Hubicka, gcc, tglek, dougkwan, jingyu, carrot, davidxl, jh

On Mon, Aug 08, 2011 at 11:25:56AM +0200, Mike Hommey wrote:
> On Fri, Aug 05, 2011 at 09:32:05AM +0200, Richard Guenther wrote:
> > Well, but unless your training coverage is 100% all parts with no coverage
> > get optimized with -O3 instead of -Os.  And I bet coverage for mozilla
> > isn't even close to 100%.  Thus I think recommending -O3 for FDO is
> > usually a bad idea.
> 
> Experience shows that this isn't true.
> 
> > So - did you try FDO with -O2? ;)
> 
> Leads to roughly the same as -O3+FDO. The apk is also the same size, as
> well as individual libraries.

Actually, I have to revise my result, I had forgottent to change the
optimization level for the javascript engine. The apk and library sizes
are still similar, but the performance is actually slower than both
-O3+PGO and -Os.

Mike

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: FDO and LTO on ARM
  2011-08-04 18:51   ` Jan Hubicka
@ 2011-08-08 12:21     ` Mike Hommey
  2011-08-08 16:25       ` Jonathan Wakely
  2011-08-09 18:11       ` Mike Hommey
  0 siblings, 2 replies; 32+ messages in thread
From: Mike Hommey @ 2011-08-08 12:21 UTC (permalink / raw)
  To: Jan Hubicka
  Cc: Xinliang David Li, gcc, tglek, dougkwan, jingyu, Carrot Wei,
	Mark Heffernan

On Thu, Aug 04, 2011 at 08:51:41PM +0200, Jan Hubicka wrote:
> > +Mark who has done size optimization tuning with FDO.
> > 
> > On Thu, Aug 4, 2011 at 7:05 AM, Mike Hommey <mhommey@mozilla.com> wrote:
> > > Hi,
> > >
> > > We (Mozilla) are trying to get the best of the ARM toolchain for our
> > > Android build. I recently built an Android Native-code Development Kit
> > > with GCC 4.6.1 and binutils 2.21.53, instead of GCC 4.4.3 and binutils
> > > 2.19 that come with the default NDK.
> > >
> > > LTO doesn't work at all, I'm getting an ICE that looks like the one from
> > > bug 41159.
> > >
> > > FDO however, works, but sadly, the resulting build is not only quite
> > > bigger,
> > 
> > Is this true for both 4.6 and 4.4 gcc? There is a bug in 4.6 that
> > prevents cold functions from be optimized for size with FDO. The bug
> > was fixed in trunk recently.
> 
> You can also backport the patch to 4.6 tree. If the bug exists there, consider
> the patch preaproved.

Do you have a link to the bug or the patch?

> With FDO, -O2 and -O3 is not really that significandly different (i.e. -O2
> gets all the extra inlining, but it does not get vectorization that is probably
> not big deal for you). -Os is however different storry.
> > >
> > > Has there been significant changes to the ARM backend that would justify
> > > that I try some more with current GCC HEAD? Should I maybe try some more
> > > with the linaro GCC branch? Are there things we can do to help getting
> > > better ARM performance?
> > 
> > It does not hurt to try it :)
> 
> One thing that is really changed is inliner heuristics.  If would be very happy
> to have some feedback on this early, since we plan to do re-tunning of it
> (LTO changes many things and there are also fortran benchmarks that shows a lot
> of problems. Mozilla may chime in and make my life even harder with hopefully
> some positive results on it).
> 
> As discused earlier, I think it would be very good idea to start trakcing
> perfomrance of Mozilla built with mainline GCC like we track other benchmarks.
> We don't really monitor anything of this size and thus we are quite likely
> to find new interesting issues by doing so. 

I unfortunately hit several problems with gcc 4.7 (latest snapshot).
One is bug 50022 that I filed today.

Another is the following error in stlport headers:
  error: invalid use of incomplete type 'std::string {aka struct
  std::basic_string<char, std::char_traits<char>, std::allocator<char> >}'

I also tried GNU libstdc++ instead of stlport but I hit some other
errors that boil down to the following:
  error: 'std::wstring' has not been declared

Mike

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: FDO and LTO on ARM
  2011-08-08 12:21     ` Mike Hommey
@ 2011-08-08 16:25       ` Jonathan Wakely
  2011-08-09 12:57         ` Mike Hommey
  2011-08-09 13:21         ` Mike Hommey
  2011-08-09 18:11       ` Mike Hommey
  1 sibling, 2 replies; 32+ messages in thread
From: Jonathan Wakely @ 2011-08-08 16:25 UTC (permalink / raw)
  To: Mike Hommey
  Cc: Jan Hubicka, Xinliang David Li, gcc, tglek, dougkwan, jingyu,
	Carrot Wei, Mark Heffernan

On 8 August 2011 13:20, Mike Hommey wrote:
>
> I unfortunately hit several problems with gcc 4.7 (latest snapshot).
> One is bug 50022 that I filed today.
>
> Another is the following error in stlport headers:
>  error: invalid use of incomplete type 'std::string {aka struct
>  std::basic_string<char, std::char_traits<char>, std::allocator<char> >}'
>
> I also tried GNU libstdc++ instead of stlport but I hit some other
> errors that boil down to the following:
>  error: 'std::wstring' has not been declared

They both look as though they could be caused by something as simple
as failing to include <string> rather than a problem in GCC.  Could
you send me more context for the errors (offlist if you prefer)?  I'll
see if it's something we've changed in libstdc++, though given that
STlport fails too it seems unlikely.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: FDO and LTO on ARM
  2011-08-08 16:25       ` Jonathan Wakely
@ 2011-08-09 12:57         ` Mike Hommey
  2011-08-09 13:04           ` Jonathan Wakely
  2011-08-09 23:35           ` Fu, Chao-Ying
  2011-08-09 13:21         ` Mike Hommey
  1 sibling, 2 replies; 32+ messages in thread
From: Mike Hommey @ 2011-08-09 12:57 UTC (permalink / raw)
  To: Jonathan Wakely
  Cc: Jan Hubicka, Xinliang David Li, gcc, tglek, dougkwan, jingyu,
	Carrot Wei, Mark Heffernan

On Mon, Aug 08, 2011 at 05:25:23PM +0100, Jonathan Wakely wrote:
> On 8 August 2011 13:20, Mike Hommey wrote:
> >
> > I unfortunately hit several problems with gcc 4.7 (latest snapshot).
> > One is bug 50022 that I filed today.
> >
> > Another is the following error in stlport headers:
> > Â error: invalid use of incomplete type 'std::string {aka struct
> > Â std::basic_string<char, std::char_traits<char>, std::allocator<char> >}'
> >
> > I also tried GNU libstdc++ instead of stlport but I hit some other
> > errors that boil down to the following:
> > Â error: 'std::wstring' has not been declared
> 
> They both look as though they could be caused by something as simple
> as failing to include <string> rather than a problem in GCC.  Could
> you send me more context for the errors (offlist if you prefer)?  I'll
> see if it's something we've changed in libstdc++, though given that
> STlport fails too it seems unlikely.

I identified the libstdc++ failure as a problem when building gcc:

configure:16321:  /tmp/build-ndk/gcc-4.7.0/./gcc/xgcc -shared-libgcc -B/tmp/build-ndk/gcc-4.7.0/./gcc -nostdinc++ -L/tmp/build-ndk/gcc-4.7.0/arm-linux-androideabi/libstdc++-v3/src -L/tmp/build-ndk/gcc-4.7.0/arm-linux-androideabi/libstdc++-v3/src/.libs -B/tmp/android-ndk-r6/toolchains/arm-linux-androideabi-4.7.0/prebuilt/linux-x86/arm-linux-androideabi/bin/ -B/tmp/android-ndk-r6/toolchains/arm-linux-androideabi-4.7.0/prebuilt/linux-x86/arm-linux-androideabi/lib/ -isystem /tmp/android-ndk-r6/toolchains/arm-linux-androideabi-4.7.0/prebuilt/linux-x86/arm-linux-androideabi/include -isystem /tmp/android-ndk-r6/toolchains/arm-linux-androideabi-4.7.0/prebuilt/linux-x86/arm-linux-androideabi/sys-include    -c -frtti -fexceptions -O2 -Os -g -DTARGET_POSIX_IO -fno-short-enums  conftest.cpp >&5
conftest.cpp:35:18: error: 'INT_MIN' was not declared in this scope
conftest.cpp:36:18: error: 'INT_MAX' was not declared in this scope
(snip)
configure:16345: checking for enabled wchar_t specializations
configure:16347: result: no

Thus _GLIBCXX_USE_WCHAR_T is not defined, and as such, the typedef
for wstring isn't either.

I'll retry stlport and see if it's not something similar.

Mike

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: FDO and LTO on ARM
  2011-08-09 12:57         ` Mike Hommey
@ 2011-08-09 13:04           ` Jonathan Wakely
  2011-08-09 23:35           ` Fu, Chao-Ying
  1 sibling, 0 replies; 32+ messages in thread
From: Jonathan Wakely @ 2011-08-09 13:04 UTC (permalink / raw)
  To: Mike Hommey
  Cc: Jan Hubicka, Xinliang David Li, gcc, tglek, dougkwan, jingyu,
	Carrot Wei, Mark Heffernan

On 9 August 2011 13:57, Mike Hommey wrote:
> On Mon, Aug 08, 2011 at 05:25:23PM +0100, Jonathan Wakely wrote:
>> On 8 August 2011 13:20, Mike Hommey wrote:
>> >
>> > I unfortunately hit several problems with gcc 4.7 (latest snapshot).
>> > One is bug 50022 that I filed today.
>> >
>> > Another is the following error in stlport headers:
>> >  error: invalid use of incomplete type 'std::string {aka struct
>> >  std::basic_string<char, std::char_traits<char>, std::allocator<char> >}'
>> >
>> > I also tried GNU libstdc++ instead of stlport but I hit some other
>> > errors that boil down to the following:
>> >  error: 'std::wstring' has not been declared
>>
>> They both look as though they could be caused by something as simple
>> as failing to include <string> rather than a problem in GCC.  Could
>> you send me more context for the errors (offlist if you prefer)?  I'll
>> see if it's something we've changed in libstdc++, though given that
>> STlport fails too it seems unlikely.
>
> I identified the libstdc++ failure as a problem when building gcc:
>
> configure:16321:  /tmp/build-ndk/gcc-4.7.0/./gcc/xgcc -shared-libgcc -B/tmp/build-ndk/gcc-4.7.0/./gcc -nostdinc++ -L/tmp/build-ndk/gcc-4.7.0/arm-linux-androideabi/libstdc++-v3/src -L/tmp/build-ndk/gcc-4.7.0/arm-linux-androideabi/libstdc++-v3/src/.libs -B/tmp/android-ndk-r6/toolchains/arm-linux-androideabi-4.7.0/prebuilt/linux-x86/arm-linux-androideabi/bin/ -B/tmp/android-ndk-r6/toolchains/arm-linux-androideabi-4.7.0/prebuilt/linux-x86/arm-linux-androideabi/lib/ -isystem /tmp/android-ndk-r6/toolchains/arm-linux-androideabi-4.7.0/prebuilt/linux-x86/arm-linux-androideabi/include -isystem /tmp/android-ndk-r6/toolchains/arm-linux-androideabi-4.7.0/prebuilt/linux-x86/arm-linux-androideabi/sys-include    -c -frtti -fexceptions -O2 -Os -g -DTARGET_POSIX_IO -fno-short-enums  conftest.cpp >&5
> conftest.cpp:35:18: error: 'INT_MIN' was not declared in this scope
> conftest.cpp:36:18: error: 'INT_MAX' was not declared in this scope
> (snip)
> configure:16345: checking for enabled wchar_t specializations
> configure:16347: result: no
>
> Thus _GLIBCXX_USE_WCHAR_T is not defined, and as such, the typedef
> for wstring isn't either.

Ah ok - that happens when the C library doesn't provide all the
required wchar_t functions, fwprintf, mbrtowc etc.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: FDO and LTO on ARM
  2011-08-08 16:25       ` Jonathan Wakely
  2011-08-09 12:57         ` Mike Hommey
@ 2011-08-09 13:21         ` Mike Hommey
  2011-08-09 17:40           ` Marc Glisse
  1 sibling, 1 reply; 32+ messages in thread
From: Mike Hommey @ 2011-08-09 13:21 UTC (permalink / raw)
  To: Jonathan Wakely
  Cc: Jan Hubicka, Xinliang David Li, gcc, tglek, dougkwan, jingyu,
	Carrot Wei, Mark Heffernan

[-- Attachment #1: Type: text/plain, Size: 686 bytes --]

On Mon, Aug 08, 2011 at 05:25:23PM +0100, Jonathan Wakely wrote:
> On 8 August 2011 13:20, Mike Hommey wrote:
> >
> > I unfortunately hit several problems with gcc 4.7 (latest snapshot).
> > One is bug 50022 that I filed today.
> >
> > Another is the following error in stlport headers:
> > Â error: invalid use of incomplete type 'std::string {aka struct
> > Â std::basic_string<char, std::char_traits<char>, std::allocator<char> >}'

This one only happens with using the -std=gnu++0x flag.

The attached preprocessed file builds fine without -std=gnu++0x, and
fails with -std=gnu++0x. Note the same original file didn't fail with
the same stlport and -std=gnu++0x with gcc 4.6.

Mike

[-- Attachment #2: debug_util.i.xz --]
[-- Type: application/octet-stream, Size: 120412 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: FDO and LTO on ARM
  2011-08-09 13:21         ` Mike Hommey
@ 2011-08-09 17:40           ` Marc Glisse
  0 siblings, 0 replies; 32+ messages in thread
From: Marc Glisse @ 2011-08-09 17:40 UTC (permalink / raw)
  To: Mike Hommey
  Cc: Jonathan Wakely, Jan Hubicka, Xinliang David Li, gcc, tglek,
	dougkwan, jingyu, Carrot Wei, Mark Heffernan

On Tue, 9 Aug 2011, Mike Hommey wrote:

> This one only happens with using the -std=gnu++0x flag.
>
> The attached preprocessed file builds fine without -std=gnu++0x, and
> fails with -std=gnu++0x. Note the same original file didn't fail with
> the same stlport and -std=gnu++0x with gcc 4.6.

Shorter:

class string;
void f(const string&);
string x();
struct locale {
         string y() const;
};
template <class> void g(const locale& l) {
         f(x()); // OK
         f(l.y()); // FAIL in C++0X
}

g++ apparently instantiates more eagerly in C++0X than in C++03.

-- 
Marc Glisse

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: FDO and LTO on ARM
  2011-08-08 12:21     ` Mike Hommey
  2011-08-08 16:25       ` Jonathan Wakely
@ 2011-08-09 18:11       ` Mike Hommey
  1 sibling, 0 replies; 32+ messages in thread
From: Mike Hommey @ 2011-08-09 18:11 UTC (permalink / raw)
  To: Jan Hubicka
  Cc: Xinliang David Li, gcc, tglek, dougkwan, jingyu, Carrot Wei,
	Mark Heffernan

On Mon, Aug 08, 2011 at 02:20:37PM +0200, Mike Hommey wrote:
> I unfortunately hit several problems with gcc 4.7 (latest snapshot).
> One is bug 50022 that I filed today.
> 
> Another is the following error in stlport headers:
>   error: invalid use of incomplete type 'std::string {aka struct
>   std::basic_string<char, std::char_traits<char>, std::allocator<char> >}'
> 
> I also tried GNU libstdc++ instead of stlport but I hit some other
> errors that boil down to the following:
>   error: 'std::wstring' has not been declared

Once I get past these, linkage of libxul.so fails with:
/root/android-ndk-r6/toolchains/arm-linux-androideabi-4.6.99.20110806/prebuilt/linux-x86/bin/../lib/gcc/arm-linux-androideabi/4.7.0/../../../../arm-linux-androideabi/bin/ld:
../../dist/lib/libjs_static.a(jsbool.o)(.text+0x1e): unresolvable
R_ARM_THM_CALL relocation against symbol `__gcov_indirect_call_profiler'
/root/android-ndk-r6/toolchains/arm-linux-androideabi-4.6.99.20110806/prebuilt/linux-x86/bin/../lib/gcc/arm-linux-androideabi/4.7.0/../../../../arm-linux-androideabi/bin/ld:
final link failed: Nonrepresentable section on output

I need to check what's going on.

Mike

^ permalink raw reply	[flat|nested] 32+ messages in thread

* RE: FDO and LTO on ARM
  2011-08-09 12:57         ` Mike Hommey
  2011-08-09 13:04           ` Jonathan Wakely
@ 2011-08-09 23:35           ` Fu, Chao-Ying
  1 sibling, 0 replies; 32+ messages in thread
From: Fu, Chao-Ying @ 2011-08-09 23:35 UTC (permalink / raw)
  To: Mike Hommey; +Cc: gcc

> 
> I identified the libstdc++ failure as a problem when building gcc:
> 
> configure:16321:  /tmp/build-ndk/gcc-4.7.0/./gcc/xgcc 
> -shared-libgcc -B/tmp/build-ndk/gcc-4.7.0/./gcc -nostdinc++ 
> -L/tmp/build-ndk/gcc-4.7.0/arm-linux-androideabi/libstdc++-v3/
> src 
> -L/tmp/build-ndk/gcc-4.7.0/arm-linux-androideabi/libstdc++-v3/
> src/.libs 
> -B/tmp/android-ndk-r6/toolchains/arm-linux-androideabi-4.7.0/p
> rebuilt/linux-x86/arm-linux-androideabi/bin/ 
> -B/tmp/android-ndk-r6/toolchains/arm-linux-androideabi-4.7.0/p
> rebuilt/linux-x86/arm-linux-androideabi/lib/ -isystem 
> /tmp/android-ndk-r6/toolchains/arm-linux-androideabi-4.7.0/pre
> built/linux-x86/arm-linux-androideabi/include -isystem 
> /tmp/android-ndk-r6/toolchains/arm-linux-androideabi-4.7.0/pre
> built/linux-x86/arm-linux-androideabi/sys-include    -c 
> -frtti -fexceptions -O2 -Os -g -DTARGET_POSIX_IO 
> -fno-short-enums  conftest.cpp >&5
> conftest.cpp:35:18: error: 'INT_MIN' was not declared in this scope
> conftest.cpp:36:18: error: 'INT_MAX' was not declared in this scope
> (snip)
> configure:16345: checking for enabled wchar_t specializations
> configure:16347: result: no
> 

  I hit the same issue sometime ago.  This is a bug in NDK platform android-9 "wchar.h".
To fix it, just add <limits.h> into wchar.h.
Ex:
# git diff wchar.h
diff --git a/ndk/platforms/android-9/include/wchar.h b/ndk/platforms/android-9/i
index 9b744a5..fb8714c 100644
--- a/ndk/platforms/android-9/include/wchar.h
+++ b/ndk/platforms/android-9/include/wchar.h
@@ -38,6 +38,7 @@
 #include <stdarg.h>
 #include <time.h>
 #include <malloc.h>
+#include <limits.h>

 #include <stddef.h>


Regards,
Chao-ying

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: FDO and LTO on ARM
  2011-08-04 14:05 FDO and LTO on ARM Mike Hommey
  2011-08-04 15:16 ` Richard Guenther
  2011-08-04 17:02 ` Xinliang David Li
@ 2011-08-11 14:22 ` Mike Hommey
  2011-08-11 16:27   ` Xinliang David Li
  2 siblings, 1 reply; 32+ messages in thread
From: Mike Hommey @ 2011-08-11 14:22 UTC (permalink / raw)
  To: gcc; +Cc: tglek, dougkwan, jingyu, carrot, davidxl

On Thu, Aug 04, 2011 at 04:05:25PM +0200, Mike Hommey wrote:
> Hi,
> 
> We (Mozilla) are trying to get the best of the ARM toolchain for our
> Android build. I recently built an Android Native-code Development Kit
> with GCC 4.6.1 and binutils 2.21.53, instead of GCC 4.4.3 and binutils
> 2.19 that come with the default NDK.
> 
> LTO doesn't work at all, I'm getting an ICE that looks like the one from
> bug 41159.
> 
> FDO however, works, but sadly, the resulting build is not only quite
> bigger, it's also slower on some tests (the Sunspider javascript
> benchmark). While we have seen improvements on other tests (most
> notably, the V8 benchmark is much faster) by switching to GCC 4.6 (that
> is, without FDO), FDO doesn't seem to bring anything on the table. It
> even seems to bring performance regression.

Maybe I have an idea as to why FDO doesn't work so well. Does the
instrumentation code support running several times in parallel (as in,
several processes with the instrumented code running concurrently)?

Cheers,

Mike

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: FDO and LTO on ARM
  2011-08-11 14:22 ` Mike Hommey
@ 2011-08-11 16:27   ` Xinliang David Li
  2011-08-17 15:35     ` Mike Hommey
  0 siblings, 1 reply; 32+ messages in thread
From: Xinliang David Li @ 2011-08-11 16:27 UTC (permalink / raw)
  To: Mike Hommey; +Cc: gcc, tglek, dougkwan, jingyu, carrot

On Thu, Aug 11, 2011 at 7:21 AM, Mike Hommey <mhommey@mozilla.com> wrote:
> On Thu, Aug 04, 2011 at 04:05:25PM +0200, Mike Hommey wrote:
>> Hi,
>>
>> We (Mozilla) are trying to get the best of the ARM toolchain for our
>> Android build. I recently built an Android Native-code Development Kit
>> with GCC 4.6.1 and binutils 2.21.53, instead of GCC 4.4.3 and binutils
>> 2.19 that come with the default NDK.
>>
>> LTO doesn't work at all, I'm getting an ICE that looks like the one from
>> bug 41159.
>>
>> FDO however, works, but sadly, the resulting build is not only quite
>> bigger, it's also slower on some tests (the Sunspider javascript
>> benchmark). While we have seen improvements on other tests (most
>> notably, the V8 benchmark is much faster) by switching to GCC 4.6 (that
>> is, without FDO), FDO doesn't seem to bring anything on the table. It
>> even seems to bring performance regression.
>
> Maybe I have an idea as to why FDO doesn't work so well. Does the
> instrumentation code support running several times in parallel (as in,
> several processes with the instrumented code running concurrently)?

gcc supports profile merging from multiple runs -- but there is no
synchronization on profile file update, but it will become a problem
only when all the concurrently running processes are existing at the
same time (and dumping profile data to the same gcda file at the same
time). Similarly, there are data races in profile counter update for
multi-threaded programs, in practice it is not a big issue
(-fprofile-correction option needs to be turned on in profile-use).

David

>
> Cheers,
>
> Mike
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: FDO and LTO on ARM
  2011-08-11 16:27   ` Xinliang David Li
@ 2011-08-17 15:35     ` Mike Hommey
  2011-08-17 17:22       ` Xinliang David Li
  0 siblings, 1 reply; 32+ messages in thread
From: Mike Hommey @ 2011-08-17 15:35 UTC (permalink / raw)
  To: Xinliang David Li; +Cc: gcc, tglek, dougkwan, jingyu, carrot

On Thu, Aug 11, 2011 at 09:27:23AM -0700, Xinliang David Li wrote:
> > Maybe I have an idea as to why FDO doesn't work so well. Does the
> > instrumentation code support running several times in parallel (as in,
> > several processes with the instrumented code running concurrently)?
> 
> gcc supports profile merging from multiple runs -- but there is no
> synchronization on profile file update, but it will become a problem
> only when all the concurrently running processes are existing at the
> same time (and dumping profile data to the same gcda file at the same
> time). Similarly, there are data races in profile counter update for
> multi-threaded programs, in practice it is not a big issue
> (-fprofile-correction option needs to be turned on in profile-use).

Are there known quirks with threads and dlopen() ?

To avoid the problem mentioned above, I configured Firefox to use only
one process. But in that setup, I only get partial instrumentation data.
For example, I get instrumentation data for cairo, but not for our
layout engine. I do however get layout engine instrumentation data with
the same build if I enable multiple processes (but then I hit the
problem that multiple processes updating the same profile data is not
properly supported). The main difference between the main and the child
process is that the main process starts as a dalvik instance, dlopen()s
native code, which itself dlopen()s all our libraries. The child process
is just an executable that is linked against all these libraries.

Cheers,

Mike

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: FDO and LTO on ARM
  2011-08-17 15:35     ` Mike Hommey
@ 2011-08-17 17:22       ` Xinliang David Li
  2011-08-17 17:33         ` Mike Hommey
  0 siblings, 1 reply; 32+ messages in thread
From: Xinliang David Li @ 2011-08-17 17:22 UTC (permalink / raw)
  To: Mike Hommey; +Cc: gcc, tglek, dougkwan, jingyu, carrot

On Wed, Aug 17, 2011 at 8:35 AM, Mike Hommey <mhommey@mozilla.com> wrote:
> On Thu, Aug 11, 2011 at 09:27:23AM -0700, Xinliang David Li wrote:
>> > Maybe I have an idea as to why FDO doesn't work so well. Does the
>> > instrumentation code support running several times in parallel (as in,
>> > several processes with the instrumented code running concurrently)?
>>
>> gcc supports profile merging from multiple runs -- but there is no
>> synchronization on profile file update, but it will become a problem
>> only when all the concurrently running processes are existing at the
>> same time (and dumping profile data to the same gcda file at the same
>> time). Similarly, there are data races in profile counter update for
>> multi-threaded programs, in practice it is not a big issue
>> (-fprofile-correction option needs to be turned on in profile-use).
>
> Are there known quirks with threads and dlopen() ?

threads causes data races in counter update, but the problem is minor
in practice.

dlopen should also be fine. libgcov is linked in statically into each
shared library, so dlopen/dlclose won't mess up other shared lib or
main executable's module list.

>
> To avoid the problem mentioned above, I configured Firefox to use only
> one process. But in that setup, I only get partial instrumentation data.
> For example, I get instrumentation data for cairo, but not for our
> layout engine. I do however get layout engine instrumentation data with
> the same build if I enable multiple processes (but then I hit the
> problem that multiple processes updating the same profile data is not
> properly supported).

Are they exiting at the same time? If not, it should be fine.

>The main difference between the main and the child
> process is that the main process starts as a dalvik instance, dlopen()s
> native code, which itself dlopen()s all our libraries. The child process
> is just an executable that is linked against all these libraries.
>

If the child process's life time is properly controlled, looks like it
can be made to work.

David



> Cheers,
>
> Mike
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: FDO and LTO on ARM
  2011-08-17 17:22       ` Xinliang David Li
@ 2011-08-17 17:33         ` Mike Hommey
  0 siblings, 0 replies; 32+ messages in thread
From: Mike Hommey @ 2011-08-17 17:33 UTC (permalink / raw)
  To: Xinliang David Li; +Cc: gcc, tglek, dougkwan, jingyu, carrot

On Wed, Aug 17, 2011 at 10:22:16AM -0700, Xinliang David Li wrote:
> On Wed, Aug 17, 2011 at 8:35 AM, Mike Hommey <mhommey@mozilla.com> wrote:
> > On Thu, Aug 11, 2011 at 09:27:23AM -0700, Xinliang David Li wrote:
> >> > Maybe I have an idea as to why FDO doesn't work so well. Does the
> >> > instrumentation code support running several times in parallel (as in,
> >> > several processes with the instrumented code running concurrently)?
> >>
> >> gcc supports profile merging from multiple runs -- but there is no
> >> synchronization on profile file update, but it will become a problem
> >> only when all the concurrently running processes are existing at the
> >> same time (and dumping profile data to the same gcda file at the same
> >> time). Similarly, there are data races in profile counter update for
> >> multi-threaded programs, in practice it is not a big issue
> >> (-fprofile-correction option needs to be turned on in profile-use).
> >
> > Are there known quirks with threads and dlopen() ?
> 
> threads causes data races in counter update, but the problem is minor
> in practice.
> 
> dlopen should also be fine. libgcov is linked in statically into each
> shared library, so dlopen/dlclose won't mess up other shared lib or
> main executable's module list.
> 
> >
> > To avoid the problem mentioned above, I configured Firefox to use only
> > one process. But in that setup, I only get partial instrumentation data.
> > For example, I get instrumentation data for cairo, but not for our
> > layout engine. I do however get layout engine instrumentation data with
> > the same build if I enable multiple processes (but then I hit the
> > problem that multiple processes updating the same profile data is not
> > properly supported).
> 
> Are they exiting at the same time? If not, it should be fine.

They are, and communicate with each other.

Mike

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: FDO and LTO on ARM
  2011-08-08 20:36       ` Jan Hubicka
@ 2011-08-11 13:51         ` Richard Earnshaw
  0 siblings, 0 replies; 32+ messages in thread
From: Richard Earnshaw @ 2011-08-11 13:51 UTC (permalink / raw)
  To: Jan Hubicka
  Cc: Xinliang David Li, Jan Hubicka, Richard Guenther, Mike Hommey,
	gcc, tglek, dougkwan, jingyu, carrot, jh

On 08/08/11 21:35, Jan Hubicka wrote:
>> On Fri, Aug 5, 2011 at 3:24 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>>>>>
>>>>> In a way I like the current scheme since it is simple and extending it
>>>>> should IMO have some good reason. We could refine -Os behaviour without
>>>>> changing current predicates to optimize for speed in
>>>>> a) functions declared as "hot" by user and BBs in them that are not proved
>>>>> cold.
>>>>> b) based on profile feedback - i.e. we could have two thresholds, BBs with
>>>>> very arge counts wil be probably hot, BBs in between will be maybe
>>>>> hot/normal and BBs with low counts will be cold.
>>>>> This would probably motivate introduction of probably_hot predicate that
>>>>> summarize the above.
>>>>
>>>> Introducing a new 'probably_hot' will be very confusing -- unless you
>>>> also rename 'maybe_hot', but this leads to finer grained control:
>>>> very_hot, hot, normal, cold, unlikely which can be hard to use.  The
>>>> three state partition (not counting exec_once) seems ok, but
>>>
>>> OK, I also preffer to have fewer stages than more ;)
>>>>
>>>> 1) the unlikely state does not have controllable parameter
>>>
>>> Well, it is defined as something that is not likely to be executed, so the requirement
>>> on count to be less than 1/(number_of_test_runs*2) is very natural and don't seem
>>> to need to be tuned.
>>
>> Ok, so it is defined to be different from 'rarely-executed' case.
>> However rarely-executed seems more general and can perhaps be used in
>> place of unlikely case. If there are situation that applies only to
>> 'unlikely', they can be split apart.
> 
> So you thing of having hot (as for optimize for speed), cold (as for optimize
> for size) and rarely executed (as for optimize very heavily for size)?
> (as a replacement of current hot=speed/cold=size scheme)
> 
> It may not be completely crazy - i.e. at least kernel people tends to call
> for something that is like -Os but not doing extreme tradeoffs (like not
> expanding simple division by constant sequences or doing similar things that
> hurts performance a lot and usually save just small amount of code).
> 
> I however wonder how large portions of program can be safely characterized as
> rarely executed that are not unlikely. I.e. my intuition would be that it is
> relatively small portion of program since code tends to be either dead or used
> resonably often.
> 
> BTW The original motivation for "unlikely" was the function splitting pass, so
> the functions put into unlikely section are having good chance to be never
> touched in program execution and thus never mapped in.
> 
> It is in fact the only place it seems to be used in till today...
> 

Slightly on a tangent, but I think there would even be a case for -O1s
-O2s and -O3s, with -Os==-O2s.  On this scale -O1s would be similar to
-O1 in opitimizations, but avoiding some code-expanding situations
(examples might include loop head duplication); -O2s would largely be
the same as today, except that very expensive code removal options would
not be applied, but -O3s would be aggressive size-based optimizations,
even at the expense of significant performance.

Once such a division is well defined, making LTO use the specified
categories should be easier.

R.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: FDO and LTO on ARM
  2011-08-08 20:03     ` Xinliang David Li
@ 2011-08-08 20:36       ` Jan Hubicka
  2011-08-11 13:51         ` Richard Earnshaw
  0 siblings, 1 reply; 32+ messages in thread
From: Jan Hubicka @ 2011-08-08 20:36 UTC (permalink / raw)
  To: Xinliang David Li
  Cc: Jan Hubicka, Jan Hubicka, Richard Guenther, Mike Hommey, gcc,
	tglek, dougkwan, jingyu, carrot, jh

> On Fri, Aug 5, 2011 at 3:24 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
> >> >
> >> > In a way I like the current scheme since it is simple and extending it
> >> > should IMO have some good reason. We could refine -Os behaviour without
> >> > changing current predicates to optimize for speed in
> >> > a) functions declared as "hot" by user and BBs in them that are not proved
> >> > cold.
> >> > b) based on profile feedback - i.e. we could have two thresholds, BBs with
> >> > very arge counts wil be probably hot, BBs in between will be maybe
> >> > hot/normal and BBs with low counts will be cold.
> >> > This would probably motivate introduction of probably_hot predicate that
> >> > summarize the above.
> >>
> >> Introducing a new 'probably_hot' will be very confusing -- unless you
> >> also rename 'maybe_hot', but this leads to finer grained control:
> >> very_hot, hot, normal, cold, unlikely which can be hard to use. Â The
> >> three state partition (not counting exec_once) seems ok, but
> >
> > OK, I also preffer to have fewer stages than more ;)
> >>
> >> 1) the unlikely state does not have controllable parameter
> >
> > Well, it is defined as something that is not likely to be executed, so the requirement
> > on count to be less than 1/(number_of_test_runs*2) is very natural and don't seem
> > to need to be tuned.
> 
> Ok, so it is defined to be different from 'rarely-executed' case.
> However rarely-executed seems more general and can perhaps be used in
> place of unlikely case. If there are situation that applies only to
> 'unlikely', they can be split apart.

So you thing of having hot (as for optimize for speed), cold (as for optimize
for size) and rarely executed (as for optimize very heavily for size)?
(as a replacement of current hot=speed/cold=size scheme)

It may not be completely crazy - i.e. at least kernel people tends to call
for something that is like -Os but not doing extreme tradeoffs (like not
expanding simple division by constant sequences or doing similar things that
hurts performance a lot and usually save just small amount of code).

I however wonder how large portions of program can be safely characterized as
rarely executed that are not unlikely. I.e. my intuition would be that it is
relatively small portion of program since code tends to be either dead or used
resonably often.

BTW The original motivation for "unlikely" was the function splitting pass, so
the functions put into unlikely section are having good chance to be never
touched in program execution and thus never mapped in.

It is in fact the only place it seems to be used in till today...

Honza
> 
> David

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: FDO and LTO on ARM
  2011-08-05 22:25   ` Jan Hubicka
@ 2011-08-08 20:03     ` Xinliang David Li
  2011-08-08 20:36       ` Jan Hubicka
  0 siblings, 1 reply; 32+ messages in thread
From: Xinliang David Li @ 2011-08-08 20:03 UTC (permalink / raw)
  To: Jan Hubicka
  Cc: Jan Hubicka, Richard Guenther, Mike Hommey, gcc, tglek, dougkwan,
	jingyu, carrot, jh

On Fri, Aug 5, 2011 at 3:24 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> >
>> > In a way I like the current scheme since it is simple and extending it
>> > should IMO have some good reason. We could refine -Os behaviour without
>> > changing current predicates to optimize for speed in
>> > a) functions declared as "hot" by user and BBs in them that are not proved
>> > cold.
>> > b) based on profile feedback - i.e. we could have two thresholds, BBs with
>> > very arge counts wil be probably hot, BBs in between will be maybe
>> > hot/normal and BBs with low counts will be cold.
>> > This would probably motivate introduction of probably_hot predicate that
>> > summarize the above.
>>
>> Introducing a new 'probably_hot' will be very confusing -- unless you
>> also rename 'maybe_hot', but this leads to finer grained control:
>> very_hot, hot, normal, cold, unlikely which can be hard to use.  The
>> three state partition (not counting exec_once) seems ok, but
>
> OK, I also preffer to have fewer stages than more ;)
>>
>> 1) the unlikely state does not have controllable parameter
>
> Well, it is defined as something that is not likely to be executed, so the requirement
> on count to be less than 1/(number_of_test_runs*2) is very natural and don't seem
> to need to be tuned.

Ok, so it is defined to be different from 'rarely-executed' case.
However rarely-executed seems more general and can perhaps be used in
place of unlikely case. If there are situation that applies only to
'unlikely', they can be split apart.

David

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: FDO and LTO on ARM
  2011-08-05 21:34 ` Xinliang David Li
@ 2011-08-05 22:25   ` Jan Hubicka
  2011-08-08 20:03     ` Xinliang David Li
  0 siblings, 1 reply; 32+ messages in thread
From: Jan Hubicka @ 2011-08-05 22:25 UTC (permalink / raw)
  To: Xinliang David Li
  Cc: Jan Hubicka, Richard Guenther, Mike Hommey, gcc, tglek, dougkwan,
	jingyu, carrot, jh

> >
> > In a way I like the current scheme since it is simple and extending it
> > should IMO have some good reason. We could refine -Os behaviour without
> > changing current predicates to optimize for speed in
> > a) functions declared as "hot" by user and BBs in them that are not proved
> > cold.
> > b) based on profile feedback - i.e. we could have two thresholds, BBs with
> > very arge counts wil be probably hot, BBs in between will be maybe
> > hot/normal and BBs with low counts will be cold.
> > This would probably motivate introduction of probably_hot predicate that
> > summarize the above.
> 
> Introducing a new 'probably_hot' will be very confusing -- unless you
> also rename 'maybe_hot', but this leads to finer grained control:
> very_hot, hot, normal, cold, unlikely which can be hard to use.  The
> three state partition (not counting exec_once) seems ok, but

OK, I also preffer to have fewer stages than more ;)
> 
> 1) the unlikely state does not have controllable parameter

Well, it is defined as something that is not likely to be executed, so the requirement
on count to be less than 1/(number_of_test_runs*2) is very natural and don't seem
to need to be tuned.

> 2) hot_bb_count_fraction parameter which is used to determine
> maybe_hotness is shared for all FDO related passes. It is much more
> flexible (in terms of tuning) to allow each pass (such as inlining) to
> define its  own thresholds.

Some people call towards fewer parameters, other towards more, it is always
matter of some compromise.  So before forking the notion of hotness for individual
passes we would need to have some good reasoning on why this is very important.
> >
> > If we want to refine things, we could also re-consider how we want to behave
> > to BBs with 0 coverage. I.e. if we want to
> > Â a) consider them "normal" and let the presence of -Os/-O123 to decide
> > whether they are size/speed optimized,
> > Â b) consider them "cold" since they are not executed at all,
> > Â c) consider them "cold" in functions that are otherwise covered by the test
> > run and "normal" in case the function is not covered at all (i.e. training X
> > server on particular set of hardware may not convince GCC to optimize for
> > size all the other drivers not covered by the train run).
> >
> > We currently implement B and it sort of work well since users usually train
> > for what matters for them and are happy to see binaries smaller.
> 
> Yes -- we assume user will do his best to find representative training
> data to avoid bad optimizations, so b) should be fine.

I also think so, one notable exception are however the hardware drivers where it is inherently
hard to test all possible combinations in common use.  However I guess one should avoid
FDO compiling those for this reason.

Honza

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: FDO and LTO on ARM
  2011-08-05 19:49 Jan Hubicka
@ 2011-08-05 21:34 ` Xinliang David Li
  2011-08-05 22:25   ` Jan Hubicka
  0 siblings, 1 reply; 32+ messages in thread
From: Xinliang David Li @ 2011-08-05 21:34 UTC (permalink / raw)
  To: Jan Hubicka
  Cc: Richard Guenther, Mike Hommey, gcc, tglek, dougkwan, jingyu, carrot, jh

>
> In a way I like the current scheme since it is simple and extending it
> should IMO have some good reason. We could refine -Os behaviour without
> changing current predicates to optimize for speed in
> a) functions declared as "hot" by user and BBs in them that are not proved
> cold.
> b) based on profile feedback - i.e. we could have two thresholds, BBs with
> very arge counts wil be probably hot, BBs in between will be maybe
> hot/normal and BBs with low counts will be cold.
> This would probably motivate introduction of probably_hot predicate that
> summarize the above.

Introducing a new 'probably_hot' will be very confusing -- unless you
also rename 'maybe_hot', but this leads to finer grained control:
very_hot, hot, normal, cold, unlikely which can be hard to use.  The
three state partition (not counting exec_once) seems ok, but

1) the unlikely state does not have controllable parameter
2) hot_bb_count_fraction parameter which is used to determine
maybe_hotness is shared for all FDO related passes. It is much more
flexible (in terms of tuning) to allow each pass (such as inlining) to
define its  own thresholds.


>
> If we want to refine things, we could also re-consider how we want to behave
> to BBs with 0 coverage. I.e. if we want to
>  a) consider them "normal" and let the presence of -Os/-O123 to decide
> whether they are size/speed optimized,
>  b) consider them "cold" since they are not executed at all,
>  c) consider them "cold" in functions that are otherwise covered by the test
> run and "normal" in case the function is not covered at all (i.e. training X
> server on particular set of hardware may not convince GCC to optimize for
> size all the other drivers not covered by the train run).
>
> We currently implement B and it sort of work well since users usually train
> for what matters for them and are happy to see binaries smaller.

Yes -- we assume user will do his best to find representative training
data to avoid bad optimizations, so b) should be fine.

David


>
> What I don't like about the a&c is bit of inconsistency with small counts.
>  I.e. count 1 will imply optimizing for size, but roundoff error to 0 will
> cause it to be optimized for speed that is weird.
> Of course also flipping the default here would cause significant grown of
> FDO binaries and users are already unhappy that FDO binaries are too large.
>
> Honza
>
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: FDO and LTO on ARM
@ 2011-08-05 19:49 Jan Hubicka
  2011-08-05 21:34 ` Xinliang David Li
  0 siblings, 1 reply; 32+ messages in thread
From: Jan Hubicka @ 2011-08-05 19:49 UTC (permalink / raw)
  To: Xinliang David Li
  Cc: Richard Guenther, Mike Hommey, gcc, tglek, dougkwan, jingyu, carrot, jh

Am Fri 05 Aug 2011 07:49:49 PM CEST schrieb Xinliang David Li  
<davidxl@google.com>:

> On Fri, Aug 5, 2011 at 12:32 AM, Richard Guenther
> <richard.guenther@gmail.com> wrote:
>> On Thu, Aug 4, 2011 at 8:42 PM, Jan Hubicka <jh@suse.de> wrote:
>>>>> Did you try using FDO with -Os?  FDO should make hot code parts
>>>>> optimized similar to -O3 but leave other pieces optimized for size.
>>>>> Using FDO with -O3 gives you the opposite, cold portions optimized
>>>>> for size while the rest is optimized for speed.
>>>
>>> FDO with -Os still optimize for size, even in hot parts.
>>
>> I don't think so.  Or at least that would be a bug.  Shouldn't 'hot'
>> BBs/functions
>> be optimized for speed even at -Os?  Hm, I see predict.c indeed returns
>> always false for optimize_size :(
>
> That is function level query. At the BB/EDGE level, the condition is refined:

Well we summarize function profile to:
  1) hot
  2) normal
  3) executed once
  4) unlikely

We summarize BB profile to:
  1) maybe_hot
  2) probably_cold (equivalent to !maybe_hot)
  3) probably_never_executed

Except for executed once that is special thing for function fed by  
discovery of main() and static ctors/dtors there is 1-1 correspondence  
in between BB and function predicates.  With profile feedback function  
is hot if it contain BB that is maybe_hot (with feedback it is also  
probably hot), it is normal if it contain BB that is  
!probably_never_executed and unlikely if all BBs are  
probably_never_executed. So with profile feedback the function profile  
summaries are no more refined that BB ones.

Without profile feedback things are more messy and the names of BB  
settings was more or less invented on what static profile estimate can  
tell you. Lacking function level profile estimate, we generally  
consider functions "normal" unless told otherwise in few special cases.
We also never autodetect probably_never_executed even though it would  
make a lot of sense to do so for EH/paths to exit. As I mentioned, I  
think we should start doing so.

Finally optimize_size comes into game that is independent of the  
summaries above and it is why I added the optimize_XXX_for_size/speed  
predicates. By default -Os imply optimize for size everything and  
-O123 optimize for size everything that is maybe_hot (i.e. not quite  
reliably proven otherwise).

In a way I like the current scheme since it is simple and extending it  
should IMO have some good reason. We could refine -Os behaviour  
without changing current predicates to optimize for speed in
a) functions declared as "hot" by user and BBs in them that are not  
proved cold.
b) based on profile feedback - i.e. we could have two thresholds, BBs  
with very arge counts wil be probably hot, BBs in between will be  
maybe hot/normal and BBs with low counts will be cold.
This would probably motivate introduction of probably_hot predicate  
that summarize the above.

If we want to refine things, we could also re-consider how we want to  
behave to BBs with 0 coverage. I.e. if we want to
  a) consider them "normal" and let the presence of -Os/-O123 to  
decide whether they are size/speed optimized,
  b) consider them "cold" since they are not executed at all,
  c) consider them "cold" in functions that are otherwise covered by  
the test run and "normal" in case the function is not covered at all  
(i.e. training X server on particular set of hardware may not convince  
GCC to optimize for size all the other drivers not covered by the  
train run).

We currently implement B and it sort of work well since users usually  
train for what matters for them and are happy to see binaries smaller.

What I don't like about the a&c is bit of inconsistency with small  
counts.  I.e. count 1 will imply optimizing for size, but roundoff  
error to 0 will cause it to be optimized for speed that is weird.
Of course also flipping the default here would cause significant grown  
of FDO binaries and users are already unhappy that FDO binaries are  
too large.

Honza

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2011-08-17 17:33 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-08-04 14:05 FDO and LTO on ARM Mike Hommey
2011-08-04 15:16 ` Richard Guenther
2011-08-04 16:38   ` Mike Hommey
2011-08-04 18:42     ` Jan Hubicka
2011-08-05  7:32       ` Richard Guenther
2011-08-05 14:40         ` Jan Hubicka
2011-08-05 17:39           ` Xinliang David Li
2011-08-05 17:50         ` Xinliang David Li
2011-08-08  9:26         ` Mike Hommey
2011-08-08 12:16           ` Mike Hommey
2011-08-04 17:02 ` Xinliang David Li
2011-08-04 17:16   ` Denis Chertykov
2011-08-04 18:51   ` Jan Hubicka
2011-08-08 12:21     ` Mike Hommey
2011-08-08 16:25       ` Jonathan Wakely
2011-08-09 12:57         ` Mike Hommey
2011-08-09 13:04           ` Jonathan Wakely
2011-08-09 23:35           ` Fu, Chao-Ying
2011-08-09 13:21         ` Mike Hommey
2011-08-09 17:40           ` Marc Glisse
2011-08-09 18:11       ` Mike Hommey
2011-08-11 14:22 ` Mike Hommey
2011-08-11 16:27   ` Xinliang David Li
2011-08-17 15:35     ` Mike Hommey
2011-08-17 17:22       ` Xinliang David Li
2011-08-17 17:33         ` Mike Hommey
2011-08-05 19:49 Jan Hubicka
2011-08-05 21:34 ` Xinliang David Li
2011-08-05 22:25   ` Jan Hubicka
2011-08-08 20:03     ` Xinliang David Li
2011-08-08 20:36       ` Jan Hubicka
2011-08-11 13:51         ` Richard Earnshaw

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).