From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gcc-patches-return-514069-listarch-gcc-patches=gcc.gnu.org@gcc.gnu.org>
Received: (qmail 39983 invoked by alias); 19 Nov 2019 20:39:40 -0000
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
Received: (qmail 39945 invoked by uid 89); 19 Nov 2019 20:39:39 -0000
Authentication-Results: sourceware.org; auth=none
X-Spam-SWARE-Status: No, score=-2.3 required=5.0 tests=AWL,BAYES_00,KAM_SHORT,RCVD_IN_DNSWL_NONE autolearn=ham version=3.3.1 spammy=percent, Future, ripped
X-HELO: gate.crashing.org
Received: from gate.crashing.org (HELO gate.crashing.org) (63.228.1.57) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Tue, 19 Nov 2019 20:39:37 +0000
Received: from gate.crashing.org (localhost.localdomain [127.0.0.1])	by gate.crashing.org (8.14.1/8.14.1) with ESMTP id xAJKdWAo027643;	Tue, 19 Nov 2019 14:39:33 -0600
Received: (from segher@localhost)	by gate.crashing.org (8.14.1/8.14.1/Submit) id xAJKdW59027642;	Tue, 19 Nov 2019 14:39:32 -0600
Date: Tue, 19 Nov 2019 21:13:00 -0000
From: Segher Boessenkool <segher@kernel.crashing.org>
To: gcc-patches@gcc.gnu.org, richard.sandiford@arm.com
Subject: Re: Add a new combine pass
Message-ID: <20191119203932.GC16031@gate.crashing.org>
References: <mpteey6xeip.fsf@arm.com> <20191118235659.GO16031@gate.crashing.org> <mpt1ru4qex2.fsf@arm.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <mpt1ru4qex2.fsf@arm.com>
User-Agent: Mutt/1.4.2.3i
X-IsSubscribed: yes
X-SW-Source: 2019-11/txt/msg01862.txt.bz2

On Tue, Nov 19, 2019 at 11:33:13AM +0000, Richard Sandiford wrote:
> Segher Boessenkool <segher@kernel.crashing.org> writes:
> > On Sun, Nov 17, 2019 at 11:35:26PM +0000, Richard Sandiford wrote:
> >> While working on SVE, I've noticed several cases in which we fail
> >> to combine instructions because the combined form would need to be
> >> placed earlier in the instruction stream than the last of the
> >> instructions being combined.  This includes one very important
> >> case in the handling of the first fault register (FFR).
> >
> > Do you have an example of that?
> 
> It's difficult to share realistic examples at this stage since this
> isn't really the right forum for making them public for the first time.

Oh I'm very sorry.  In the future, just say "Future" and I know what
you mean :-)

>       /* Make sure that the value that is to be substituted for the register
> 	 does not use any registers whose values alter in between.  However,
> 	 If the insns are adjacent, a use can't cross a set even though we
> 	 think it might (this can happen for a sequence of insns each setting
> 	 the same destination; last_set of that register might point to
> 	 a NOTE).  If INSN has a REG_EQUIV note, the register is always
> 	 equivalent to the memory so the substitution is valid even if there
> 	 are intervening stores.  Also, don't move a volatile asm or
> 	 UNSPEC_VOLATILE across any other insns.  */
>       || (! all_adjacent
> 	  && (((!MEM_P (src)
> 		|| ! find_reg_note (insn, REG_EQUIV, src))
> 	       && modified_between_p (src, insn, i3))
> 	      || (GET_CODE (src) == ASM_OPERANDS && MEM_VOLATILE_P (src))
> 	      || GET_CODE (src) == UNSPEC_VOLATILE))

So this would work if you had pseudos here, instead of the hard reg?
Because it is a hard reg it is the same number in both places, making it
hard to move.

> > How are dependencies represented in your new pass?  If it just does
> > walks over the insn stream for everything, you get quadratic complexity
> > if you move insns backwards.  We have that in combine already, mostly
> > from modified_between_p, but that is limited because of how LOG_LINKS
> > work, and we have been doing this for so long and there are no problems
> > found with it, so it must work in practice.  But I am worried about it
> > when moving insns back an unlimited distance.
> 
> It builds def-use chains, but using a constant limit on the number of
> explicitly-recorded uses.  All other uses go in a numerical live range
> from which they (conservatively) never escape.  The def-use chains
> represent memory as a single entity, a bit like in gimple.

Ah.  So that range thing ensures correctness.

Why don't you use DF for the DU chains?

> >> - it tries using REG_EQUAL notes for the final instruction.
> >
> > And that.
> 
> I meant REG_EQUAL notes on i3, i.e. it tries replacing the src of i3
> with i3's REG_EQUAL note and combining into that.  Does combine do that?
> I couldn't see it, and in:
> 
>    https://gcc.gnu.org/ml/gcc/2019-06/msg00148.html
> 
> you seemed to reject the idea of allowing it.

Yes, I still do.  Do you have an example where it helps?

> >> - it can parallelise two independent instructions that both read from
> >>   the same register or both read from memory.
> >
> > That only if somehow there is a link between the two (so essentially
> > never).  The only combinations tried by combine are those via LOG_LINKs,
> > which are between a SET and the first corresponding use.  This is a key
> > factor that makes it kind of linear (instead of exponential) complexity.
> 
> Tracking limited def-use chains is what makes this last bit easy.
> We can just try parallelising two instructions from the (bounded) list
> of uses.  And for this case there's not any garbage rtl involved, since
> we reuse the same PARALLEL rtx between attempts.  The cost is basically
> all in the recog call (which would obviously mount up if we went
> overboard).

*All* examples above and below are just this.

If you disable everything else, what do the statistics look like then?

> > One thing I want to do is some mini-combine after every split, probably
> > only with the insns new from the split.  But we have no cfglayout mode
> > anymore then, and only hard regs (except in the first split pass, which
> > is just a little later than your new pass).
> 
> Yeah, sounds like it could be useful.  I guess there'd need to be
> an extra condition on the combination that the new insn can't be
> immediately split.

It would run *after* split.  Not interleaved with it.

> > And amount of garbage produced?
> 
> If -ftime-report stats are accurate, then the total amount of
> memory allocated is:
> 
> run-combine=2 (normal combine): 1793 kB
> run-combine=4 (new pass only):    98 kB
> run-combine=6 (both passes):    1871 kB (new pass accounts for 78 kB)
> 
> But again that's not a fair comparison when the main combine pass does more.

The way combine does SUBST is pretty fundamental to how it works (it can
be ripped out, and probably we'll have to at some point, but that will be
very invasive).  Originally all this temporary RTL was on obstacks and
reaping it was cheap, but everything is GCed now (fixing the bugs was not
cheap :-) )

If you look at even really bad cases, combine is still only a few percent
of total, so it isn't too bad.

> I did try hard to keep the amount of garbage rtl down though.  This is
> why I added validate_simplify_replace_rtx rather than trying to make
> do with existing routines.  It should only create new rtl if the
> simplification routines did something useful.  (Of course, that's mostly
> true of combine as well, but things like the make_compound_operation/
> expand_compound_operation wrangler can create expressions that are never
> actually useful.)

Don't mention those, thanks :-)

> >> To get a feel for the effect on multiple targets, I did my usual
> >> bogo-comparison of number of lines of asm for gcc.c-torture, gcc.dg
> >> and g++.dg, this time comparing run-combine=2 and run-combine=6
> >> using -O2 -ftree-vectorize:
> >
> > One problem with this is that these are very short functions on average.
> 
> There are some long ones too :-)

Yes, but this isn't a good stand-in for representative programs.

> > What is the kind of changes you see for other targets?
> 
> On powerpc64le-linux-gnu it mostly comes from eliminating comparisons
> in favour of other flag-setting instructions and making more use of
> post-increments.  Not sure the last one is actually a win, but the
> target costs say it's OK :-).  E.g. from gcc.c-torture/execute/pr78675.c:
> 
> @@ -48,9 +48,8 @@
>         blr
>         .align 4
>  .L19:
> -       cmpdi 0,10,0
> +       mr. 9,10
>         mr 3,8
> -       mr 9,10
>         bne 0,.L9
>         b .L3
>         .align 4

Okay, so this combining two uses of r10 into one insn.

This isn't necessarily a good idea: the combined insn cannot be moved as
much as one of its components could, which can also immediately prevent
further combinations.

But doing this after combine, as you do, is probably beneficial.

> and a slightly more interesting example in gcc.c-torture/execute/loop-6.c:

This is the same thing (we do andi. a,b,0xff instead of rlwinm. a,b,0,0xff
because this is cheaper on p7 and p8).

> gcc.c-torture/execute/20081218-1.c is an example where we make more use
> of post-increment:
> 
>  .L9:
> -       lbz 10,1(9)
> -       addi 9,9,1
> +       lbzu 10,1(9)
>         cmpwi 0,10,38
>         bne 0,.L8
> -       lbz 10,1(9)
> -       addi 9,9,1
> +       lbzu 10,1(9)
>         cmpwi 0,10,38
>         bne 0,.L8
>         bdnz .L9

Pre-increment (we only *have* pre-modify memory accesses).

>   /* Mimic combine's behavior by not combining moves from allocatable hard
>      registers (e.g. when copying parameters or function return values).  */
>   if (REG_P (src) && HARD_REGISTER_P (src) && !fixed_regs[REGNO (src)])
>     return false;
> 
> Although if that could have accounted for the difference, it sounds like
> we're leaving a lot on the table by doing this :-)

It actually helps (and quite a bit).  But if your test cases are mainly
tiny functions, anything can happen.  But since you see this across all
targets, it must be doing something good :-)


So I'd love to see statistics for *only* combining two uses of the same
thing, this is something combine cannot do, and arguably *shouldn't* do!


Segher