From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-return-169723-listarch-gcc=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 16348 invoked by alias); 5 Aug 2011 19:49:02 -0000
Received: (qmail 16337 invoked by uid 22791); 5 Aug 2011 19:49:01 -0000
X-SWARE-Spam-Status: No, hits=-2.9 required=5.0	tests=AWL,BAYES_00,MIME_QP_LONG_LINE,RP_MATCHES_RCVD
X-Spam-Check-By: sourceware.org
Received: from cantor2.suse.de (HELO mx2.suse.de) (195.135.220.15)    by sourceware.org (qpsmtpd/0.43rc1) with ESMTP; Fri, 05 Aug 2011 19:48:46 +0000
Received: from relay2.suse.de (charybdis-ext.suse.de [195.135.221.2])	(using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))	(No client certificate requested)	by mx2.suse.de (Postfix) with ESMTP id DCA058C5DF;	Fri,  5 Aug 2011 21:48:44 +0200 (CEST)
Message-ID: <20110805214844.slo177f4bosowcko@imap.suse.de>
Date: Fri, 05 Aug 2011 19:49:00 -0000
From: Jan Hubicka <jh@suse.de>
To: Xinliang David Li <davidxl@google.com>
Cc: Richard Guenther <richard.guenther@gmail.com>,	Mike Hommey <mhommey@mozilla.com>, gcc@gcc.gnu.org,	tglek@mozilla.com, dougkwan@google.com, jingyu@google.com,	carrot@google.com, jh@suse.cz
Subject: Re: FDO and LTO on ARM
MIME-Version: 1.0
Content-Type: text/plain;	charset=UTF-8;	DelSp="Yes";	format="flowed"
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable
User-Agent: Internet Messaging Program (IMP) H3 (4.1.5)
Mailing-List: contact gcc-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc/>
List-Post: <mailto:gcc@gcc.gnu.org>
List-Help: <http://gcc.gnu.org/ml/>
Sender: gcc-owner@gcc.gnu.org
X-SW-Source: 2011-08/txt/msg00130.txt.bz2

Am Fri 05 Aug 2011 07:49:49 PM CEST schrieb Xinliang David Li=20=20
<davidxl@google.com>:

> On Fri, Aug 5, 2011 at 12:32 AM, Richard Guenther
> <richard.guenther@gmail.com> wrote:
>> On Thu, Aug 4, 2011 at 8:42 PM, Jan Hubicka <jh@suse.de> wrote:
>>>>> Did you try using FDO with -Os? =C2=A0FDO should make hot code parts
>>>>> optimized similar to -O3 but leave other pieces optimized for size.
>>>>> Using FDO with -O3 gives you the opposite, cold portions optimized
>>>>> for size while the rest is optimized for speed.
>>>
>>> FDO with -Os still optimize for size, even in hot parts.
>>
>> I don't think so. =C2=A0Or at least that would be a bug. =C2=A0Shouldn't=
 'hot'
>> BBs/functions
>> be optimized for speed even at -Os? =C2=A0Hm, I see predict.c indeed ret=
urns
>> always false for optimize_size :(
>
> That is function level query. At the BB/EDGE level, the condition is refi=
ned:

Well we summarize function profile to:
  1) hot
  2) normal
  3) executed once
  4) unlikely

We summarize BB profile to:
  1) maybe_hot
  2) probably_cold (equivalent to !maybe_hot)
  3) probably_never_executed

Except for executed once that is special thing for function fed by=20=20
discovery of main() and static ctors/dtors there is 1-1 correspondence=20=20
in between BB and function predicates.  With profile feedback function=20=20
is hot if it contain BB that is maybe_hot (with feedback it is also=20=20
probably hot), it is normal if it contain BB that is=20=20
!probably_never_executed and unlikely if all BBs are=20=20
probably_never_executed. So with profile feedback the function profile=20=20
summaries are no more refined that BB ones.

Without profile feedback things are more messy and the names of BB=20=20
settings was more or less invented on what static profile estimate can=20=20
tell you. Lacking function level profile estimate, we generally=20=20
consider functions "normal" unless told otherwise in few special cases.
We also never autodetect probably_never_executed even though it would=20=20
make a lot of sense to do so for EH/paths to exit. As I mentioned, I=20=20
think we should start doing so.

Finally optimize_size comes into game that is independent of the=20=20
summaries above and it is why I added the optimize_XXX_for_size/speed=20=20
predicates. By default -Os imply optimize for size everything and=20=20
-O123 optimize for size everything that is maybe_hot (i.e. not quite=20=20
reliably proven otherwise).

In a way I like the current scheme since it is simple and extending it=20=20
should IMO have some good reason. We could refine -Os behaviour=20=20
without changing current predicates to optimize for speed in
a) functions declared as "hot" by user and BBs in them that are not=20=20
proved cold.
b) based on profile feedback - i.e. we could have two thresholds, BBs=20=20
with very arge counts wil be probably hot, BBs in between will be=20=20
maybe hot/normal and BBs with low counts will be cold.
This would probably motivate introduction of probably_hot predicate=20=20
that summarize the above.

If we want to refine things, we could also re-consider how we want to=20=20
behave to BBs with 0 coverage. I.e. if we want to
  a) consider them "normal" and let the presence of -Os/-O123 to=20=20
decide whether they are size/speed optimized,
  b) consider them "cold" since they are not executed at all,
  c) consider them "cold" in functions that are otherwise covered by=20=20
the test run and "normal" in case the function is not covered at all=20=20
(i.e. training X server on particular set of hardware may not convince=20=20
GCC to optimize for size all the other drivers not covered by the=20=20
train run).

We currently implement B and it sort of work well since users usually=20=20
train for what matters for them and are happy to see binaries smaller.

What I don't like about the a&c is bit of inconsistency with small=20=20
counts.  I.e. count 1 will imply optimizing for size, but roundoff=20=20
error to 0 will cause it to be optimized for speed that is weird.
Of course also flipping the default here would cause significant grown=20=20
of FDO binaries and users are already unhappy that FDO binaries are=20=20
too large.

Honza