From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 8792 invoked by alias); 26 May 2018 06:10:47 -0000 Mailing-List: contact gcc-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-owner@gcc.gnu.org Received: (qmail 8782 invoked by uid 89); 26 May 2018 06:10:47 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-1.9 required=5.0 tests=BAYES_00,SPF_PASS autolearn=ham version=3.3.2 spammy=H*f:sk:f285c55, H*i:sk:f285c55 X-HELO: mx2.suse.de Received: from mx2.suse.de (HELO mx2.suse.de) (195.135.220.15) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Sat, 26 May 2018 06:10:45 +0000 Received: from relay1.suse.de (charybdis-ext-too.suse.de [195.135.220.254]) by mx2.suse.de (Postfix) with ESMTP id CD71EADE5; Sat, 26 May 2018 06:10:42 +0000 (UTC) Date: Sat, 26 May 2018 06:10:00 -0000 User-Agent: K-9 Mail for Android In-Reply-To: References: <014f7b2a-3c64-4144-37a4-4cc7bdff3d47@redhat.com> <66d75fc6-5553-212d-1aca-d92b3c6193b2@redhat.com> <3D61699E-73EC-4936-9FF3-494DF6816616@suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Subject: Re: PR80155: Code hoisting and register pressure To: Jeff Law ,"Bin.Cheng" ,Prathamesh Kulkarni CC: GCC Development ,Thomas Preudhomme From: Richard Biener Message-ID: X-SW-Source: 2018-05/txt/msg00238.txt.bz2 On May 25, 2018 9:25:51 PM GMT+02:00, Jeff Law wrote: >On 05/25/2018 11:54 AM, Richard Biener wrote: >> On May 25, 2018 6:57:13 PM GMT+02:00, Jeff Law >wrote: >>> On 05/25/2018 03:49 AM, Bin.Cheng wrote: >>>> On Fri, May 25, 2018 at 10:23 AM, Prathamesh Kulkarni >>>> wrote: >>>>> On 23 May 2018 at 18:37, Jeff Law wrote: >>>>>> On 05/23/2018 03:20 AM, Prathamesh Kulkarni wrote: >>>>>>> On 23 May 2018 at 13:58, Richard Biener >wrote: >>>>>>>> On Wed, 23 May 2018, Prathamesh Kulkarni wrote: >>>>>>>> >>>>>>>>> Hi, >>>>>>>>> I am trying to work on PR80155, which exposes a problem with >>> code >>>>>>>>> hoisting and register pressure on a leading embedded benchmark >>> for ARM >>>>>>>>> cortex-m7, where code-hoisting causes an extra register spill. >>>>>>>>> >>>>>>>>> I have attached two test-cases which (hopefully) are >>> representative of >>>>>>>>> the original test-case. >>>>>>>>> The first one (trans_dfa.c) is bigger and somewhat similar to >>> the >>>>>>>>> original test-case and trans_dfa_2.c is hand-reduced version >of >>>>>>>>> trans_dfa.c. There's 2 spills caused with trans_dfa.c >>>>>>>>> and one spill with trans_dfa_2.c due to lesser amount of >cases. >>>>>>>>> The test-cases in the PR are probably not relevant. >>>>>>>>> >>>>>>>>> Initially I thought the spill was happening because of "too >many >>>>>>>>> hoistings" taking place in original test-case thus increasing >>> the >>>>>>>>> register pressure, but it seems the spill is possibly caused >>> because >>>>>>>>> expression gets hoisted out of a block that is on loop exit. >>>>>>>>> >>>>>>>>> For example, the following hoistings take place with >>> trans_dfa_2.c: >>>>>>>>> >>>>>>>>> (1) Inserting expression in block 4 for code hoisting: >>>>>>>>> {mem_ref<0B>,tab_20(D)}@.MEM_45 (0005) >>>>>>>>> >>>>>>>>> (2) Inserting expression in block 4 for code hoisting: >>> {plus_expr,_4,1} (0006) >>>>>>>>> >>>>>>>>> (3) Inserting expression in block 4 for code hoisting: >>>>>>>>> {pointer_plus_expr,s_33,1} (0023) >>>>>>>>> >>>>>>>>> (4) Inserting expression in block 3 for code hoisting: >>>>>>>>> {pointer_plus_expr,s_33,1} (0023) >>>>>>>>> >>>>>>>>> The issue seems to be hoisting of (*tab + 1) which consists of >>> first >>>>>>>>> two hoistings in block 4 >>>>>>>>> from blocks 5 and 9, which causes the extra spill. I verified >>> that by >>>>>>>>> disabling hoisting into block 4, >>>>>>>>> which resulted in no extra spills. >>>>>>>>> >>>>>>>>> I wonder if that's because the expression (*tab + 1) is >getting >>>>>>>>> hoisted from blocks 5 and 9, >>>>>>>>> which are on loop exit ? So the expression that was previously >>>>>>>>> computed in a block on loop exit, gets hoisted outside that >>> block >>>>>>>>> which possibly makes the allocator more defensive ? Similarly >>>>>>>>> disabling hoisting of expressions which appeared in blocks on >>> loop >>>>>>>>> exit in original test-case prevented the extra spill. The >other >>>>>>>>> hoistings didn't seem to matter. >>>>>>>> >>>>>>>> I think that's simply co-incidence. The only thing that makes >>>>>>>> a block that also exits from the loop special is that an >>>>>>>> expression could be sunk out of the loop and hoisting >(commoning >>>>>>>> with another path) could prevent that. But that isn't what is >>>>>>>> happening here and it would be a pass ordering issue as >>>>>>>> the sinking pass runs only after hoisting (no idea why exactly >>>>>>>> but I guess there are cases where we want to prefer CSE over >>>>>>>> sinking). So you could try if re-ordering PRE and sinking >helps >>>>>>>> your testcase. >>>>>>> Thanks for the suggestions. Placing sink pass before PRE works >>>>>>> for both these test-cases! Sadly it still causes the spill for >the >>> benchmark -:( >>>>>>> I will try to create a better approximation of the original >>> test-case. >>>>>>>> >>>>>>>> What I do see is a missed opportunity to merge the successors >>>>>>>> of BB 4. After PRE we have >>>>>>>> >>>>>>>> [local count: 159303558]: >>>>>>>> : >>>>>>>> pretmp_123 =3D *tab_37(D); >>>>>>>> _87 =3D pretmp_123 + 1; >>>>>>>> if (c_36 =3D=3D 65) >>>>>>>> goto ; [34.00%] >>>>>>>> else >>>>>>>> goto ; [66.00%] >>>>>>>> >>>>>>>> [local count: 54163210]: >>>>>>>> *tab_37(D) =3D _87; >>>>>>>> _96 =3D MEM[(char *)s_57 + 1B]; >>>>>>>> if (_96 !=3D 0) >>>>>>>> goto ; [89.00%] >>>>>>>> else >>>>>>>> goto ; [11.00%] >>>>>>>> >>>>>>>> [local count: 105140348]: >>>>>>>> *tab_37(D) =3D _87; >>>>>>>> _56 =3D MEM[(char *)s_57 + 1B]; >>>>>>>> if (_56 !=3D 0) >>>>>>>> goto ; [89.00%] >>>>>>>> else >>>>>>>> goto ; [11.00%] >>>>>>>> >>>>>>>> here at least the stores and loads can be hoisted. Note this >>>>>>>> may also point at the real issue of the code hoisting which is >>>>>>>> tearing apart the RMW operation? >>>>>>> Indeed, this possibility seems much more likely than block being >>> on loop exit. >>>>>>> I will try to "hardcode" the load/store hoists into block 4 for >>> this >>>>>>> specific test-case to check >>>>>>> if that prevents the spill. >>>>>> Even if it prevents the spill in this case, it's likely a good >>> thing to >>>>>> do. The statements prior to the conditional in bb5 and bb8 >should >>> be >>>>>> hoisted, leaving bb5 and bb8 with just their conditionals. >>>>> Hi, >>>>> It seems disabling forwprop somehow works for causing no extra >>> spills >>>>> on the original test-case. >>>>> >>>>> For instance, >>>>> Hoisting without forwprop: >>>>> >>>>> bb 3: >>>>> _1 =3D tab_1(D) + 8 >>>>> pretmp_268 =3D MEM[tab_1(D) + 8B]; >>>>> _2 =3D pretmp_268 + 1; >>>>> goto or >>>>> >>>>> bb 4: >>>>> *_1 =3D _ 2 >>>>> >>>>> bb 5: >>>>> *_1 =3D _2 >>>>> >>>>> Hoisting with forwprop: >>>>> >>>>> bb 3: >>>>> pretmp_164 =3D MEM[tab_1(D) + 8B]; >>>>> _2 =3D pretmp_164 + 1 >>>>> goto or >>>>> >>>>> bb 4: >>>>> MEM[tab_1(D) + 8] =3D _2; >>>>> >>>>> bb 5: >>>>> MEM[tab_1(D) + 8] =3D _2; >>>>> >>>>> Although in both cases, we aren't hoisting stores, the issues with >>> forwprop >>>>> for this case seems to be the folding of >>>>> *_1 =3D _2 >>>>> into >>>>> MEM[tab_1(D) + 8] =3D _2 ? >>>> >>>> This isn't an issue, right? IIUC, tab_1(D) used all over the loop >>>> thus propagating _1 using (tab_1(D) + 8) actually removes one live >>>> range. >>>> >>>>> >>>>> Disabling folding to mem_ref[base + offset] in forwprop "works" in >>> the >>>>> sense it created same set of hoistings as without forwprop, >however >>> it >>>>> still results in additional spills (albeit different registers). >>>>> >>>>> That's because forwprop seems to be increasing live range of >>>>> prephitmp_217 by substituting >>>>> _221 + 1 with prephitmp_217 + 2 (_221 is defined as prephitmp_217 >+ >>> 1). >>>> Hmm, it's hard to discuss private benchmarks, not sure which dump >>>> shall I find prephitmp_221/prephitmp_217 stuff. >>>> >>>>> On the other hand, Bin pointed out to me in private that forwprop >>> also >>>>> helps to restrict register pressure by propagating "tab + >const_int" >>>>> for same test-case. >>>>> >>>>> So I am not really sure if there's an easier fix than having >>>>> heuristics for estimating register pressure at TREE level ? I >would >>> be >>>> Easy fix, maybe not. OTOH, I am more convinced passes like >>>> forwprop/sink/hoisting can be improved by taking live range into >>>> consideration. Specifically, to direct such passes when moving >code >>>> around different basic blocks, because inter-block register >pressure >>>> is hard to resolve afterwards. >>>> >>>> As suggested by Jeff and Richi, I guess the first step would be >doing >>>> experiments, collecting more benchmark data for reordering sink >>> before >>>> pre? It enables code sink as well as decreases register pressure >in >>>> the original reduced cases IIRC. >>> We might even consider re-evaluating Bernd's work on what is >>> effectively >>> a gimple scheduler to minimize register pressure. >>=20 >> Sure. The main issue here I see is with the interaction with TER >which we unfortunately still rely on. Enough GIMPLE instruction >selection might help to get rid of the remaining pieces...=20 >I really wonder how bad it would be to walk over expr.c and change the >expanders to be able to walk SSA_NAME_DEF_STMT to potentially get at >the >more complex statements rather than relying on TER. But that's what they do... TER computes when this is valid and not break du= e to coalescing.=20 >That's really all TER is supposed to be doing anyway. Yes. Last year I posted patches to apply the scheduling TER does on top of = the above but it was difficult to dissect from the above... Maybe we need t= o try harder here...=20 Richard.=20 >Jeff