From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 16348 invoked by alias); 5 Aug 2011 19:49:02 -0000 Received: (qmail 16337 invoked by uid 22791); 5 Aug 2011 19:49:01 -0000 X-SWARE-Spam-Status: No, hits=-2.9 required=5.0 tests=AWL,BAYES_00,MIME_QP_LONG_LINE,RP_MATCHES_RCVD X-Spam-Check-By: sourceware.org Received: from cantor2.suse.de (HELO mx2.suse.de) (195.135.220.15) by sourceware.org (qpsmtpd/0.43rc1) with ESMTP; Fri, 05 Aug 2011 19:48:46 +0000 Received: from relay2.suse.de (charybdis-ext.suse.de [195.135.221.2]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mx2.suse.de (Postfix) with ESMTP id DCA058C5DF; Fri, 5 Aug 2011 21:48:44 +0200 (CEST) Message-ID: <20110805214844.slo177f4bosowcko@imap.suse.de> Date: Fri, 05 Aug 2011 19:49:00 -0000 From: Jan Hubicka To: Xinliang David Li Cc: Richard Guenther , Mike Hommey , gcc@gcc.gnu.org, tglek@mozilla.com, dougkwan@google.com, jingyu@google.com, carrot@google.com, jh@suse.cz Subject: Re: FDO and LTO on ARM MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8; DelSp="Yes"; format="flowed" Content-Disposition: inline Content-Transfer-Encoding: quoted-printable User-Agent: Internet Messaging Program (IMP) H3 (4.1.5) Mailing-List: contact gcc-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-owner@gcc.gnu.org X-SW-Source: 2011-08/txt/msg00130.txt.bz2 Am Fri 05 Aug 2011 07:49:49 PM CEST schrieb Xinliang David Li=20=20 : > On Fri, Aug 5, 2011 at 12:32 AM, Richard Guenther > wrote: >> On Thu, Aug 4, 2011 at 8:42 PM, Jan Hubicka wrote: >>>>> Did you try using FDO with -Os? =C2=A0FDO should make hot code parts >>>>> optimized similar to -O3 but leave other pieces optimized for size. >>>>> Using FDO with -O3 gives you the opposite, cold portions optimized >>>>> for size while the rest is optimized for speed. >>> >>> FDO with -Os still optimize for size, even in hot parts. >> >> I don't think so. =C2=A0Or at least that would be a bug. =C2=A0Shouldn't= 'hot' >> BBs/functions >> be optimized for speed even at -Os? =C2=A0Hm, I see predict.c indeed ret= urns >> always false for optimize_size :( > > That is function level query. At the BB/EDGE level, the condition is refi= ned: Well we summarize function profile to: 1) hot 2) normal 3) executed once 4) unlikely We summarize BB profile to: 1) maybe_hot 2) probably_cold (equivalent to !maybe_hot) 3) probably_never_executed Except for executed once that is special thing for function fed by=20=20 discovery of main() and static ctors/dtors there is 1-1 correspondence=20=20 in between BB and function predicates. With profile feedback function=20=20 is hot if it contain BB that is maybe_hot (with feedback it is also=20=20 probably hot), it is normal if it contain BB that is=20=20 !probably_never_executed and unlikely if all BBs are=20=20 probably_never_executed. So with profile feedback the function profile=20=20 summaries are no more refined that BB ones. Without profile feedback things are more messy and the names of BB=20=20 settings was more or less invented on what static profile estimate can=20=20 tell you. Lacking function level profile estimate, we generally=20=20 consider functions "normal" unless told otherwise in few special cases. We also never autodetect probably_never_executed even though it would=20=20 make a lot of sense to do so for EH/paths to exit. As I mentioned, I=20=20 think we should start doing so. Finally optimize_size comes into game that is independent of the=20=20 summaries above and it is why I added the optimize_XXX_for_size/speed=20=20 predicates. By default -Os imply optimize for size everything and=20=20 -O123 optimize for size everything that is maybe_hot (i.e. not quite=20=20 reliably proven otherwise). In a way I like the current scheme since it is simple and extending it=20=20 should IMO have some good reason. We could refine -Os behaviour=20=20 without changing current predicates to optimize for speed in a) functions declared as "hot" by user and BBs in them that are not=20=20 proved cold. b) based on profile feedback - i.e. we could have two thresholds, BBs=20=20 with very arge counts wil be probably hot, BBs in between will be=20=20 maybe hot/normal and BBs with low counts will be cold. This would probably motivate introduction of probably_hot predicate=20=20 that summarize the above. If we want to refine things, we could also re-consider how we want to=20=20 behave to BBs with 0 coverage. I.e. if we want to a) consider them "normal" and let the presence of -Os/-O123 to=20=20 decide whether they are size/speed optimized, b) consider them "cold" since they are not executed at all, c) consider them "cold" in functions that are otherwise covered by=20=20 the test run and "normal" in case the function is not covered at all=20=20 (i.e. training X server on particular set of hardware may not convince=20=20 GCC to optimize for size all the other drivers not covered by the=20=20 train run). We currently implement B and it sort of work well since users usually=20=20 train for what matters for them and are happy to see binaries smaller. What I don't like about the a&c is bit of inconsistency with small=20=20 counts. I.e. count 1 will imply optimizing for size, but roundoff=20=20 error to 0 will cause it to be optimized for speed that is weird. Of course also flipping the default here would cause significant grown=20=20 of FDO binaries and users are already unhappy that FDO binaries are=20=20 too large. Honza