public inbox for gcc@gcc.gnu.org
 help / color / mirror / Atom feed
* FDO and LTO on ARM
@ 2011-08-04 14:05 Mike Hommey
  2011-08-04 15:16 ` Richard Guenther
                   ` (2 more replies)
  0 siblings, 3 replies; 32+ messages in thread
From: Mike Hommey @ 2011-08-04 14:05 UTC (permalink / raw)
  To: gcc; +Cc: tglek, dougkwan, jingyu, carrot, davidxl

Hi,

We (Mozilla) are trying to get the best of the ARM toolchain for our
Android build. I recently built an Android Native-code Development Kit
with GCC 4.6.1 and binutils 2.21.53, instead of GCC 4.4.3 and binutils
2.19 that come with the default NDK.

LTO doesn't work at all, I'm getting an ICE that looks like the one from
bug 41159.

FDO however, works, but sadly, the resulting build is not only quite
bigger, it's also slower on some tests (the Sunspider javascript
benchmark). While we have seen improvements on other tests (most
notably, the V8 benchmark is much faster) by switching to GCC 4.6 (that
is, without FDO), FDO doesn't seem to bring anything on the table. It
even seems to bring performance regression.

Note that we do our normal builds with -Os and use -O3 for FDO. As for
architecture specific flags, we use -marmv7-a -mthumb -mfloat-abi=softfp
-mfpu=vfp. I've attempted a -O2 build in the past with GCC 4.4 but it
was both bigger and slower than the -Os builds.

So, it pretty much looks like current aggressive optimizations hit
current hardware limitations and are slower than builds optimized for
size.

Has there been significant changes to the ARM backend that would justify
that I try some more with current GCC HEAD? Should I maybe try some more
with the linaro GCC branch? Are there things we can do to help getting
better ARM performance?

Cheers,

Mike

^ permalink raw reply	[flat|nested] 32+ messages in thread
* Re: FDO and LTO on ARM
@ 2011-08-05 19:49 Jan Hubicka
  2011-08-05 21:34 ` Xinliang David Li
  0 siblings, 1 reply; 32+ messages in thread
From: Jan Hubicka @ 2011-08-05 19:49 UTC (permalink / raw)
  To: Xinliang David Li
  Cc: Richard Guenther, Mike Hommey, gcc, tglek, dougkwan, jingyu, carrot, jh

Am Fri 05 Aug 2011 07:49:49 PM CEST schrieb Xinliang David Li  
<davidxl@google.com>:

> On Fri, Aug 5, 2011 at 12:32 AM, Richard Guenther
> <richard.guenther@gmail.com> wrote:
>> On Thu, Aug 4, 2011 at 8:42 PM, Jan Hubicka <jh@suse.de> wrote:
>>>>> Did you try using FDO with -Os?  FDO should make hot code parts
>>>>> optimized similar to -O3 but leave other pieces optimized for size.
>>>>> Using FDO with -O3 gives you the opposite, cold portions optimized
>>>>> for size while the rest is optimized for speed.
>>>
>>> FDO with -Os still optimize for size, even in hot parts.
>>
>> I don't think so.  Or at least that would be a bug.  Shouldn't 'hot'
>> BBs/functions
>> be optimized for speed even at -Os?  Hm, I see predict.c indeed returns
>> always false for optimize_size :(
>
> That is function level query. At the BB/EDGE level, the condition is refined:

Well we summarize function profile to:
  1) hot
  2) normal
  3) executed once
  4) unlikely

We summarize BB profile to:
  1) maybe_hot
  2) probably_cold (equivalent to !maybe_hot)
  3) probably_never_executed

Except for executed once that is special thing for function fed by  
discovery of main() and static ctors/dtors there is 1-1 correspondence  
in between BB and function predicates.  With profile feedback function  
is hot if it contain BB that is maybe_hot (with feedback it is also  
probably hot), it is normal if it contain BB that is  
!probably_never_executed and unlikely if all BBs are  
probably_never_executed. So with profile feedback the function profile  
summaries are no more refined that BB ones.

Without profile feedback things are more messy and the names of BB  
settings was more or less invented on what static profile estimate can  
tell you. Lacking function level profile estimate, we generally  
consider functions "normal" unless told otherwise in few special cases.
We also never autodetect probably_never_executed even though it would  
make a lot of sense to do so for EH/paths to exit. As I mentioned, I  
think we should start doing so.

Finally optimize_size comes into game that is independent of the  
summaries above and it is why I added the optimize_XXX_for_size/speed  
predicates. By default -Os imply optimize for size everything and  
-O123 optimize for size everything that is maybe_hot (i.e. not quite  
reliably proven otherwise).

In a way I like the current scheme since it is simple and extending it  
should IMO have some good reason. We could refine -Os behaviour  
without changing current predicates to optimize for speed in
a) functions declared as "hot" by user and BBs in them that are not  
proved cold.
b) based on profile feedback - i.e. we could have two thresholds, BBs  
with very arge counts wil be probably hot, BBs in between will be  
maybe hot/normal and BBs with low counts will be cold.
This would probably motivate introduction of probably_hot predicate  
that summarize the above.

If we want to refine things, we could also re-consider how we want to  
behave to BBs with 0 coverage. I.e. if we want to
  a) consider them "normal" and let the presence of -Os/-O123 to  
decide whether they are size/speed optimized,
  b) consider them "cold" since they are not executed at all,
  c) consider them "cold" in functions that are otherwise covered by  
the test run and "normal" in case the function is not covered at all  
(i.e. training X server on particular set of hardware may not convince  
GCC to optimize for size all the other drivers not covered by the  
train run).

We currently implement B and it sort of work well since users usually  
train for what matters for them and are happy to see binaries smaller.

What I don't like about the a&c is bit of inconsistency with small  
counts.  I.e. count 1 will imply optimizing for size, but roundoff  
error to 0 will cause it to be optimized for speed that is weird.
Of course also flipping the default here would cause significant grown  
of FDO binaries and users are already unhappy that FDO binaries are  
too large.

Honza

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2011-08-17 17:33 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-08-04 14:05 FDO and LTO on ARM Mike Hommey
2011-08-04 15:16 ` Richard Guenther
2011-08-04 16:38   ` Mike Hommey
2011-08-04 18:42     ` Jan Hubicka
2011-08-05  7:32       ` Richard Guenther
2011-08-05 14:40         ` Jan Hubicka
2011-08-05 17:39           ` Xinliang David Li
2011-08-05 17:50         ` Xinliang David Li
2011-08-08  9:26         ` Mike Hommey
2011-08-08 12:16           ` Mike Hommey
2011-08-04 17:02 ` Xinliang David Li
2011-08-04 17:16   ` Denis Chertykov
2011-08-04 18:51   ` Jan Hubicka
2011-08-08 12:21     ` Mike Hommey
2011-08-08 16:25       ` Jonathan Wakely
2011-08-09 12:57         ` Mike Hommey
2011-08-09 13:04           ` Jonathan Wakely
2011-08-09 23:35           ` Fu, Chao-Ying
2011-08-09 13:21         ` Mike Hommey
2011-08-09 17:40           ` Marc Glisse
2011-08-09 18:11       ` Mike Hommey
2011-08-11 14:22 ` Mike Hommey
2011-08-11 16:27   ` Xinliang David Li
2011-08-17 15:35     ` Mike Hommey
2011-08-17 17:22       ` Xinliang David Li
2011-08-17 17:33         ` Mike Hommey
2011-08-05 19:49 Jan Hubicka
2011-08-05 21:34 ` Xinliang David Li
2011-08-05 22:25   ` Jan Hubicka
2011-08-08 20:03     ` Xinliang David Li
2011-08-08 20:36       ` Jan Hubicka
2011-08-11 13:51         ` Richard Earnshaw

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).