From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 39983 invoked by alias); 19 Nov 2019 20:39:40 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Received: (qmail 39945 invoked by uid 89); 19 Nov 2019 20:39:39 -0000 Authentication-Results: sourceware.org; auth=none X-Spam-SWARE-Status: No, score=-2.3 required=5.0 tests=AWL,BAYES_00,KAM_SHORT,RCVD_IN_DNSWL_NONE autolearn=ham version=3.3.1 spammy=percent, Future, ripped X-HELO: gate.crashing.org Received: from gate.crashing.org (HELO gate.crashing.org) (63.228.1.57) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Tue, 19 Nov 2019 20:39:37 +0000 Received: from gate.crashing.org (localhost.localdomain [127.0.0.1]) by gate.crashing.org (8.14.1/8.14.1) with ESMTP id xAJKdWAo027643; Tue, 19 Nov 2019 14:39:33 -0600 Received: (from segher@localhost) by gate.crashing.org (8.14.1/8.14.1/Submit) id xAJKdW59027642; Tue, 19 Nov 2019 14:39:32 -0600 Date: Tue, 19 Nov 2019 21:13:00 -0000 From: Segher Boessenkool To: gcc-patches@gcc.gnu.org, richard.sandiford@arm.com Subject: Re: Add a new combine pass Message-ID: <20191119203932.GC16031@gate.crashing.org> References: <20191118235659.GO16031@gate.crashing.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.4.2.3i X-IsSubscribed: yes X-SW-Source: 2019-11/txt/msg01862.txt.bz2 On Tue, Nov 19, 2019 at 11:33:13AM +0000, Richard Sandiford wrote: > Segher Boessenkool writes: > > On Sun, Nov 17, 2019 at 11:35:26PM +0000, Richard Sandiford wrote: > >> While working on SVE, I've noticed several cases in which we fail > >> to combine instructions because the combined form would need to be > >> placed earlier in the instruction stream than the last of the > >> instructions being combined. This includes one very important > >> case in the handling of the first fault register (FFR). > > > > Do you have an example of that? > > It's difficult to share realistic examples at this stage since this > isn't really the right forum for making them public for the first time. Oh I'm very sorry. In the future, just say "Future" and I know what you mean :-) > /* Make sure that the value that is to be substituted for the register > does not use any registers whose values alter in between. However, > If the insns are adjacent, a use can't cross a set even though we > think it might (this can happen for a sequence of insns each setting > the same destination; last_set of that register might point to > a NOTE). If INSN has a REG_EQUIV note, the register is always > equivalent to the memory so the substitution is valid even if there > are intervening stores. Also, don't move a volatile asm or > UNSPEC_VOLATILE across any other insns. */ > || (! all_adjacent > && (((!MEM_P (src) > || ! find_reg_note (insn, REG_EQUIV, src)) > && modified_between_p (src, insn, i3)) > || (GET_CODE (src) == ASM_OPERANDS && MEM_VOLATILE_P (src)) > || GET_CODE (src) == UNSPEC_VOLATILE)) So this would work if you had pseudos here, instead of the hard reg? Because it is a hard reg it is the same number in both places, making it hard to move. > > How are dependencies represented in your new pass? If it just does > > walks over the insn stream for everything, you get quadratic complexity > > if you move insns backwards. We have that in combine already, mostly > > from modified_between_p, but that is limited because of how LOG_LINKS > > work, and we have been doing this for so long and there are no problems > > found with it, so it must work in practice. But I am worried about it > > when moving insns back an unlimited distance. > > It builds def-use chains, but using a constant limit on the number of > explicitly-recorded uses. All other uses go in a numerical live range > from which they (conservatively) never escape. The def-use chains > represent memory as a single entity, a bit like in gimple. Ah. So that range thing ensures correctness. Why don't you use DF for the DU chains? > >> - it tries using REG_EQUAL notes for the final instruction. > > > > And that. > > I meant REG_EQUAL notes on i3, i.e. it tries replacing the src of i3 > with i3's REG_EQUAL note and combining into that. Does combine do that? > I couldn't see it, and in: > > https://gcc.gnu.org/ml/gcc/2019-06/msg00148.html > > you seemed to reject the idea of allowing it. Yes, I still do. Do you have an example where it helps? > >> - it can parallelise two independent instructions that both read from > >> the same register or both read from memory. > > > > That only if somehow there is a link between the two (so essentially > > never). The only combinations tried by combine are those via LOG_LINKs, > > which are between a SET and the first corresponding use. This is a key > > factor that makes it kind of linear (instead of exponential) complexity. > > Tracking limited def-use chains is what makes this last bit easy. > We can just try parallelising two instructions from the (bounded) list > of uses. And for this case there's not any garbage rtl involved, since > we reuse the same PARALLEL rtx between attempts. The cost is basically > all in the recog call (which would obviously mount up if we went > overboard). *All* examples above and below are just this. If you disable everything else, what do the statistics look like then? > > One thing I want to do is some mini-combine after every split, probably > > only with the insns new from the split. But we have no cfglayout mode > > anymore then, and only hard regs (except in the first split pass, which > > is just a little later than your new pass). > > Yeah, sounds like it could be useful. I guess there'd need to be > an extra condition on the combination that the new insn can't be > immediately split. It would run *after* split. Not interleaved with it. > > And amount of garbage produced? > > If -ftime-report stats are accurate, then the total amount of > memory allocated is: > > run-combine=2 (normal combine): 1793 kB > run-combine=4 (new pass only): 98 kB > run-combine=6 (both passes): 1871 kB (new pass accounts for 78 kB) > > But again that's not a fair comparison when the main combine pass does more. The way combine does SUBST is pretty fundamental to how it works (it can be ripped out, and probably we'll have to at some point, but that will be very invasive). Originally all this temporary RTL was on obstacks and reaping it was cheap, but everything is GCed now (fixing the bugs was not cheap :-) ) If you look at even really bad cases, combine is still only a few percent of total, so it isn't too bad. > I did try hard to keep the amount of garbage rtl down though. This is > why I added validate_simplify_replace_rtx rather than trying to make > do with existing routines. It should only create new rtl if the > simplification routines did something useful. (Of course, that's mostly > true of combine as well, but things like the make_compound_operation/ > expand_compound_operation wrangler can create expressions that are never > actually useful.) Don't mention those, thanks :-) > >> To get a feel for the effect on multiple targets, I did my usual > >> bogo-comparison of number of lines of asm for gcc.c-torture, gcc.dg > >> and g++.dg, this time comparing run-combine=2 and run-combine=6 > >> using -O2 -ftree-vectorize: > > > > One problem with this is that these are very short functions on average. > > There are some long ones too :-) Yes, but this isn't a good stand-in for representative programs. > > What is the kind of changes you see for other targets? > > On powerpc64le-linux-gnu it mostly comes from eliminating comparisons > in favour of other flag-setting instructions and making more use of > post-increments. Not sure the last one is actually a win, but the > target costs say it's OK :-). E.g. from gcc.c-torture/execute/pr78675.c: > > @@ -48,9 +48,8 @@ > blr > .align 4 > .L19: > - cmpdi 0,10,0 > + mr. 9,10 > mr 3,8 > - mr 9,10 > bne 0,.L9 > b .L3 > .align 4 Okay, so this combining two uses of r10 into one insn. This isn't necessarily a good idea: the combined insn cannot be moved as much as one of its components could, which can also immediately prevent further combinations. But doing this after combine, as you do, is probably beneficial. > and a slightly more interesting example in gcc.c-torture/execute/loop-6.c: This is the same thing (we do andi. a,b,0xff instead of rlwinm. a,b,0,0xff because this is cheaper on p7 and p8). > gcc.c-torture/execute/20081218-1.c is an example where we make more use > of post-increment: > > .L9: > - lbz 10,1(9) > - addi 9,9,1 > + lbzu 10,1(9) > cmpwi 0,10,38 > bne 0,.L8 > - lbz 10,1(9) > - addi 9,9,1 > + lbzu 10,1(9) > cmpwi 0,10,38 > bne 0,.L8 > bdnz .L9 Pre-increment (we only *have* pre-modify memory accesses). > /* Mimic combine's behavior by not combining moves from allocatable hard > registers (e.g. when copying parameters or function return values). */ > if (REG_P (src) && HARD_REGISTER_P (src) && !fixed_regs[REGNO (src)]) > return false; > > Although if that could have accounted for the difference, it sounds like > we're leaving a lot on the table by doing this :-) It actually helps (and quite a bit). But if your test cases are mainly tiny functions, anything can happen. But since you see this across all targets, it must be doing something good :-) So I'd love to see statistics for *only* combining two uses of the same thing, this is something combine cannot do, and arguably *shouldn't* do! Segher