From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 16128 invoked by alias); 18 Dec 2014 17:00:12 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Received: (qmail 16070 invoked by uid 89); 18 Dec 2014 17:00:08 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-1.7 required=5.0 tests=AWL,BAYES_00,SPF_PASS autolearn=ham version=3.3.2 X-HELO: service87.mimecast.com Received: from service87.mimecast.com (HELO service87.mimecast.com) (91.220.42.44) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Thu, 18 Dec 2014 17:00:06 +0000 Received: from cam-owa1.Emea.Arm.com (fw-tnat.cambridge.arm.com [217.140.96.140]) by service87.mimecast.com; Thu, 18 Dec 2014 17:00:03 +0000 Received: from [10.1.205.157] ([10.1.255.212]) by cam-owa1.Emea.Arm.com with Microsoft SMTPSVC(6.0.3790.3959); Thu, 18 Dec 2014 17:00:01 +0000 Message-ID: <54930811.1020003@arm.com> Date: Thu, 18 Dec 2014 17:08:00 -0000 From: Jiong Wang User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.1.1 MIME-Version: 1.0 To: Richard Biener CC: "gcc-patches@gcc.gnu.org" Subject: Re: [PATCH] PR 62173, re-shuffle insns for RTL loop invariant hoisting References: <54803EBE.2060607@arm.com> <5480B6D6.2020201@arm.com> <548EFE0D.1070808@arm.com> <548EFE55.6090901@arm.com> In-Reply-To: X-MC-Unique: 114121817000305501 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: quoted-printable X-IsSubscribed: yes X-SW-Source: 2014-12/txt/msg01552.txt.bz2 On 17/12/14 15:54, Richard Biener wrote: > On Mon, Dec 15, 2014 at 4:29 PM, Jiong Wang wrote: >> On 15/12/14 15:28, Jiong Wang wrote: >>> On 04/12/14 19:32, Jiong Wang wrote: >>>> On 04/12/14 11:07, Richard Biener wrote: >>>> >>>>> On Thu, Dec 4, 2014 at 12:07 PM, Richard Biener >>>>> wrote: >>>>>> On Thu, Dec 4, 2014 at 12:00 PM, Jiong Wang wro= te: >>>>>> >>>>>> >>>>>> which means re-associate the constant imm with the virtual frame >>>>>> pointer. >>>>>> >>>>>> transform >>>>>> >>>>>> RA <- fixed_reg + RC >>>>>> RD <- MEM (RA + const_offset) >>>>>> >>>>>> into: >>>>>> >>>>>> RA <- fixed_reg + const_offset >>>>>> RD <- MEM (RA + RC) >>>>>> >>>>>> then RA <- fixed_reg + const_offset is actually loop invariant, so t= he >>>>>> later >>>>>> RTL GCSE PRE pass could catch it and do the hoisting, and thus >>>>>> ameliorate >>>>>> what tree >>>>>> level ivopts could not sort out. >>>>>> There is a LIM pass after gimple ivopts - if the invariantness is >>>>>> already >>>>>> visible there why not handle it there similar to the special-cases in >>>>>> rewrite_bittest and rewrite_reciprocal? >>>> maybe, needs further check. >>>> >>>>>> And of course similar tricks could be applied on the RTL level to >>>>>> RTL invariant motion? >>>> Thanks. I just checked the code, yes, loop invariant motion pass >>>> is the natural place to integrate such multi-insns invariant analysis >>>> trick. >>>> >>>> those code could be integrated into loop-invariant.c cleanly, but >>>> then I found although the invariant detected successfully but it's not >>>> moved >>>> out of loop because of cost issue, and looks like the patch committed = to >>>> fix PR33928 >>>> is trying to prevent such cheap address be hoisted to reduce register >>>> pressure. >>>> >>>> >>>> 805 /* ??? Try to determine cheapness of address computatio= n. >>>> Unfortunately >>>> 806 the address cost is only a relative measure, we can't >>>> really compare >>>> 807 it with any absolute number, but only with other add= ress >>>> costs. >>>> 808 But here we don't have any other addresses, so compa= re >>>> with a magic >>>> 809 number anyway. It has to be large enough to not reg= ress >>>> PR33928 >>>> 810 (by avoiding to move reg+8,reg+16,reg+24 invariants), >>>> but small >>>> 811 enough to not regress 410.bwaves either (by still mo= ving >>>> reg+reg >>>> 812 invariants). >>>> 813 See >>>> http://gcc.gnu.org/ml/gcc-patches/2009-10/msg01210.html . */ >>>> 814 inv->cheap_address =3D address_cost (SET_SRC (set), >>>> word_mode, >>>> 815 ADDR_SPACE_GENERIC, >>>> speed) < 3; >>>> >>>> >>>> I think that maybe necessary for x86 which is short of register, while >>>> for RISC, >>>> it may not be that necessary, especially the whole register pressure is >>>> not big. >>>> >>>> currently, what I could think of is for this transformation below, we >>>> should >>>> increase the costs: >>>> A >>>> =3D=3D >>>> RA <- virtual_stack_var + RC >>>> RD <- MEM (RA + const_offset) >>>> >>>> into: >>>> >>>> B >>>> =3D=3D >>>> RA <- virtual_stack_var + const_offset <--B >>>> RD <- MEM (RA + RC) >>>> >>>> because the cost is not that cheap, if there is not re-assocation of >>>> virtual_stack_var >>>> with const_offset, then lra elimination will create another instruction >>>> to hold the >>>> elimination result, so format A will actually be >>>> >>>> RT <- real_stack_pointer + elimination_offset >>>> RA <- RT + RC >>>> RD <- MEM (RA + const_offset) >>>> >>>> so, the re-assocation and later hoisting of invariant B could actually >>>> save two instructions >>>> in the loop, this is why there are 15% perf gap for bzip2 under some >>>> situation. >>> updated patch. >>> >>> moved this instruction shuffling trick to rtl loop invariant pass. >>> as described above, this patch tries to transform A to B, so that >>> after the transformation: >>> >>> * RA <- virtual_stack_var + const_offset could be hoisted out of = the >>> loop >>> * easy the work of lra elimination as virtual_stack_var is associa= ted >>> with const_offset >>> that the elimination offset could be combined with const_offset >>> automatically. >>> >>> current rtl loop invariant pass treat "reg <- reg + off" as cheap addre= ss, >>> while although >>> "reg <- virtual_stack_var + offset" fall into the same format, but it's >>> not that cheap as >>> we could also save one lra elimination instruction. so this patch will >>> mark >>> "reg <- virtual_stack_var + offset" transformed from A to be expensive,= so >>> that it could be >>> hoisted later. >>> >>> after patch, pr62173 is fixed on powerpc64, while *still not on aarch64= *. >>> because there are one >>> glitch in aarch64_legitimize_address which cause unnecessary complex >>> instructions sequences >>> generated when legitimize some addresses. and if we fix that, we will g= et >>> cheaper address for >>> those cases which is generally good, and the cheaper address will cause >>> tree level IVOPT do >>> more IVOPT optimization which is generally good also, but from the spec= k2k >>> result, there >>> are actually around 1% code size regression on two cases, the reason is >>> for target support >>> post-index address, doing IVOPT may not always be the best choice becau= se >>> we lost the advantage >>> of using post-index addressing. >>> >>> on aarch64, for the following testcase, the ivopted version is complexer >>> than not ivopted version. >>> >>> while (oargc--) *(nargv++) =3D *(oargv++); >>> so, I sent the generic fix here only, as it's an independent patch,= and >>> could benefit other targets >>> like powerpc64 although the issue on aarch64 is still not resolved befo= re >>> we figure out how to let >>> the correct fix on aarch64_legitimize_address do not cause code size >>> regression on benchmark caused >>> by sub-optimal tree level IVOPT. >>> >>> and the testcase is marked to run on powerpc64 only at current time. >>> >>> bootstrap OK on x86-64, no regression. >>> bootstrap OK on AArch64, and from speck2k compile dump, there do have a >>> few more RTL loop >>> invariants get hoisted. >>> >>> ok for trunk? >>> >>> gcc/ >>> PR62173 >>> loop-invariant.c.c (expensive_addr): New hash_table. >>> (need_expensive_addr_check_p): New bool. >>> (find_exits): Rename to "find_exists_and_reshuffle. >>> Support re-shuffle instructions for better loop invariant hoisting. >>> (create_new_invariant): Mark address as expensive if it's generate= d by >>> re-shuffle. >>> (init_inv_motion_data): Initialize expensive_addr and >>> need_expensive_addr_check_p. >>> (free_inv_motion_data): Release expensive_addr. >>> >>> gcc/testssuite/ >>> PR62173 >>> gcc.dg/pr62173.c: New test. >> >> sorry, patch uploaded. > + do > + { > ... > + } while ((next_insn =3D next_real_insn (next_insn)) > + && body[i] =3D=3D BLOCK_FOR_INSN (next_insn)); > > ick. I realize we don't have SSA form on RTL but doesn't DF provide > at least some help in looking up definition statements for pseudos? > In fact we want to restrict the transform to single-use pseudos, thus > hopefully it can at least tell us that... (maybe not and this is what > LOG_LINKS are for in combine...?) At least loop-invariant alreadly > computes df_chain with DF_UD_CHAIN which seems exactly what > is needed (apart from maybe getting at single-use info). thanks very much for these inspiring questions. yes, we want to restrict the transformation on single-use pseudo only, and it's better the transformation could re-use existed info and helper function to avoid increase compile time. but I haven't found anything I can reuse at the stage the transformation happen. the info similar as LOG_LINKS is what I want, but maybe simpler. I'd study the code about build LOG_LINKS, and try to see if we can do some factor out. thanks. Regards, Jiong > > The meat of this should be factored out to a separate function I guess. > > Otherwise my RTL fu is too weak to review this properly. > > Thanks, > Richard. > >>>>> Oh, and the patch misses a testcase. >>>> It's at the start of the email, will include in the patch. >>>> >>>> void bar(int i) { >>>> char A[10]; >>>> int d =3D 0; >>>> while (i > 0) >>>> A[d++] =3D i--; >>>> >>>> while (d > 0) >>>> foo(A[d--]); >>>> } >>>> >>>>>> Thanks, >>>>>> Richard. >>>>>> >>>>>>> and this patch only tries to re-shuffle instructions within single >>>>>>> basic >>>>>>> block which >>>>>>> is a inner loop which is perf critical. >>>>>>> >>>>>>> I am reusing the loop info in fwprop because there is loop info and >>>>>>> it's run >>>>>>> before >>>>>>> GCSE. >>>>>>> >>>>>>> verified on aarch64 and mips64, the array base address hoisted out = of >>>>>>> loop. >>>>>>> >>>>>>> bootstrap ok on x86-64 and aarch64. >>>>>>> >>>>>>> comments? >>>>>>> >>>>>>> thanks. >>>>>>> >>>>>>> gcc/ >>>>>>> PR62173 >>>>>>> fwprop.c (prepare_for_gcse_pre): New function. >>>>>>> (fwprop_done): Call it. >>>> >>>