From: Richard Sandiford <richard.sandiford@arm.com>
To: Segher Boessenkool <segher@kernel.crashing.org>
Cc: gcc-patches@gcc.gnu.org
Subject: Re: Add a new combine pass
Date: Wed, 20 Nov 2019 18:29:00 -0000 [thread overview]
Message-ID: <mptsgmih0jx.fsf@arm.com> (raw)
In-Reply-To: <20191119203932.GC16031@gate.crashing.org> (Segher Boessenkool's message of "Tue, 19 Nov 2019 14:39:32 -0600")
Segher Boessenkool <segher@kernel.crashing.org> writes:
>> /* Make sure that the value that is to be substituted for the register
>> does not use any registers whose values alter in between. However,
>> If the insns are adjacent, a use can't cross a set even though we
>> think it might (this can happen for a sequence of insns each setting
>> the same destination; last_set of that register might point to
>> a NOTE). If INSN has a REG_EQUIV note, the register is always
>> equivalent to the memory so the substitution is valid even if there
>> are intervening stores. Also, don't move a volatile asm or
>> UNSPEC_VOLATILE across any other insns. */
>> || (! all_adjacent
>> && (((!MEM_P (src)
>> || ! find_reg_note (insn, REG_EQUIV, src))
>> && modified_between_p (src, insn, i3))
>> || (GET_CODE (src) == ASM_OPERANDS && MEM_VOLATILE_P (src))
>> || GET_CODE (src) == UNSPEC_VOLATILE))
>
> So this would work if you had pseudos here, instead of the hard reg?
> Because it is a hard reg it is the same number in both places, making it
> hard to move.
Yeah, probably. But the hard reg is a critical part of this.
Going back to the example:
(set (reg/v:VNx16BI 102 [ ok ])
(reg:VNx16BI 85 ffrt))
(set (reg:VNx16BI 85 ffrt)
(unspec:VNx16BI [(reg:VNx16BI 85 ffrt)] UNSPEC_UPDATE_FFRT))
(set (reg:CC_NZC 66 cc)
(unspec:CC_NZC
[(reg:VNx16BI 106) repeated x2
(const_int 1 [0x1])
(reg/v:VNx16BI 102 [ ok ])] UNSPEC_PTEST))
FFR is the real first fault register. FFRT is actually a fake register
whose only purpose is to describe the dependencies (in rtl) between writes
to the FFR, reads from the FFR and first-faulting loads. The whole scheme
depends on having only one fixed FFRT register.
>> > How are dependencies represented in your new pass? If it just does
>> > walks over the insn stream for everything, you get quadratic complexity
>> > if you move insns backwards. We have that in combine already, mostly
>> > from modified_between_p, but that is limited because of how LOG_LINKS
>> > work, and we have been doing this for so long and there are no problems
>> > found with it, so it must work in practice. But I am worried about it
>> > when moving insns back an unlimited distance.
>>
>> It builds def-use chains, but using a constant limit on the number of
>> explicitly-recorded uses. All other uses go in a numerical live range
>> from which they (conservatively) never escape. The def-use chains
>> represent memory as a single entity, a bit like in gimple.
>
> Ah. So that range thing ensures correctness.
Yeah.
> Why don't you use DF for the DU chains?
The problem with DF_DU_CHAIN is that it's quadratic in the worst case.
fwprop.c gets around that by using the MD problem and having its own
dominator walker to calculate limited def-use chains:
/* We use the multiple definitions problem to compute our restricted
use-def chains. */
So taking that approach here would still require some amount of
roll-your-own. Other reasons are:
* Even what fwprop does is more elaborate than we need for now.
* We need to handle memory too, and it's nice to be able to handle
it in the same way as registers.
* Updating a full, ordered def-use chain after a move is a linear-time
operation, so whatever happens, we'd need to apply some kind of limit
on the number of uses we maintain, with something like that integer
point range for the rest.
* Once we've analysed the insn and built its def-use chains, we don't
look at the df_refs again until we update the chains after a successful
combination. So it should be more efficient to maintain a small array
of insn_info_rec pointers alongside the numerical range, rather than
walk and pollute chains of df_refs and then link back the insn uids
to the pass-local info.
>> >> - it tries using REG_EQUAL notes for the final instruction.
>> >
>> > And that.
>>
>> I meant REG_EQUAL notes on i3, i.e. it tries replacing the src of i3
>> with i3's REG_EQUAL note and combining into that. Does combine do that?
>> I couldn't see it, and in:
>>
>> https://gcc.gnu.org/ml/gcc/2019-06/msg00148.html
>>
>> you seemed to reject the idea of allowing it.
>
> Yes, I still do. Do you have an example where it helps?
I'll run another set of tests for that.
>> >> - it can parallelise two independent instructions that both read from
>> >> the same register or both read from memory.
>> >
>> > That only if somehow there is a link between the two (so essentially
>> > never). The only combinations tried by combine are those via LOG_LINKs,
>> > which are between a SET and the first corresponding use. This is a key
>> > factor that makes it kind of linear (instead of exponential) complexity.
>>
>> Tracking limited def-use chains is what makes this last bit easy.
>> We can just try parallelising two instructions from the (bounded) list
>> of uses. And for this case there's not any garbage rtl involved, since
>> we reuse the same PARALLEL rtx between attempts. The cost is basically
>> all in the recog call (which would obviously mount up if we went
>> overboard).
>
> *All* examples above and below are just this.
Yeah, the powerpc and s390x examples were. The motivating FFR example
above isn't though: it's a def-use combination in parallel with the
existing definition.
> If you disable everything else, what do the statistics look like then?
Had no idea how this would turn out -- which is a good sign it was
worth doing -- but: results below.
>> > One thing I want to do is some mini-combine after every split, probably
>> > only with the insns new from the split. But we have no cfglayout mode
>> > anymore then, and only hard regs (except in the first split pass, which
>> > is just a little later than your new pass).
>>
>> Yeah, sounds like it could be useful. I guess there'd need to be
>> an extra condition on the combination that the new insn can't be
>> immediately split.
>
> It would run *after* split. Not interleaved with it.
Yeah. But what I meant was: a lot of insns that are split after
reload are combined for RA purposes and the split form is really
the preferred form (especially for scheduling). So if we have
a combine pass *after* split, I think it should avoid using any
combination that matches a split.
>> > And amount of garbage produced?
>>
>> If -ftime-report stats are accurate, then the total amount of
>> memory allocated is:
>>
>> run-combine=2 (normal combine): 1793 kB
>> run-combine=4 (new pass only): 98 kB
>> run-combine=6 (both passes): 1871 kB (new pass accounts for 78 kB)
>>
>> But again that's not a fair comparison when the main combine pass does more.
>
> The way combine does SUBST is pretty fundamental to how it works (it can
> be ripped out, and probably we'll have to at some point, but that will be
> very invasive). Originally all this temporary RTL was on obstacks and
> reaping it was cheap, but everything is GCed now (fixing the bugs was not
> cheap :-) )
Yeah, I remember :-)
> If you look at even really bad cases, combine is still only a few
> percent of total, so it isn't too bad.
>
>> I did try hard to keep the amount of garbage rtl down though. This is
>> why I added validate_simplify_replace_rtx rather than trying to make
>> do with existing routines. It should only create new rtl if the
>> simplification routines did something useful. (Of course, that's mostly
>> true of combine as well, but things like the make_compound_operation/
>> expand_compound_operation wrangler can create expressions that are never
>> actually useful.)
>
> Don't mention those, thanks :-)
>
>> >> To get a feel for the effect on multiple targets, I did my usual
>> >> bogo-comparison of number of lines of asm for gcc.c-torture, gcc.dg
>> >> and g++.dg, this time comparing run-combine=2 and run-combine=6
>> >> using -O2 -ftree-vectorize:
>> >
>> > One problem with this is that these are very short functions on average.
>>
>> There are some long ones too :-)
>
> Yes, but this isn't a good stand-in for representative programs.
Right. And number of lines of asm isn't a good stand-in for anything much.
Like I say, the whole thing is just to get a feel, on tests that are readily
to hand and are easy to compile without a full toolchain.
>> > What is the kind of changes you see for other targets?
>>
>> On powerpc64le-linux-gnu it mostly comes from eliminating comparisons
>> in favour of other flag-setting instructions and making more use of
>> post-increments. Not sure the last one is actually a win, but the
>> target costs say it's OK :-). E.g. from gcc.c-torture/execute/pr78675.c:
>>
>> @@ -48,9 +48,8 @@
>> blr
>> .align 4
>> .L19:
>> - cmpdi 0,10,0
>> + mr. 9,10
>> mr 3,8
>> - mr 9,10
>> bne 0,.L9
>> b .L3
>> .align 4
>
> Okay, so this combining two uses of r10 into one insn.
>
> This isn't necessarily a good idea: the combined insn cannot be moved as
> much as one of its components could, which can also immediately prevent
> further combinations.
>
> But doing this after combine, as you do, is probably beneficial.
>
>> and a slightly more interesting example in gcc.c-torture/execute/loop-6.c:
>
> This is the same thing (we do andi. a,b,0xff instead of rlwinm. a,b,0,0xff
> because this is cheaper on p7 and p8).
>
>> gcc.c-torture/execute/20081218-1.c is an example where we make more use
>> of post-increment:
>>
>> .L9:
>> - lbz 10,1(9)
>> - addi 9,9,1
>> + lbzu 10,1(9)
>> cmpwi 0,10,38
>> bne 0,.L8
>> - lbz 10,1(9)
>> - addi 9,9,1
>> + lbzu 10,1(9)
>> cmpwi 0,10,38
>> bne 0,.L8
>> bdnz .L9
>
> Pre-increment (we only *have* pre-modify memory accesses).
Oops, yes.
>> /* Mimic combine's behavior by not combining moves from allocatable hard
>> registers (e.g. when copying parameters or function return values). */
>> if (REG_P (src) && HARD_REGISTER_P (src) && !fixed_regs[REGNO (src)])
>> return false;
>>
>> Although if that could have accounted for the difference, it sounds like
>> we're leaving a lot on the table by doing this :-)
>
> It actually helps (and quite a bit). But if your test cases are mainly
> tiny functions, anything can happen. But since you see this across all
> targets, it must be doing something good :-)
>
>
> So I'd love to see statistics for *only* combining two uses of the same
> thing, this is something combine cannot do, and arguably *shouldn't* do!
OK, here are two sets of results. The first is for:
(A) --param run-combine=2 (current combine only)
(B) --param run-combine=6 (both passes), use-use combinations only
Target Tests Delta Best Worst Median
====== ===== ===== ==== ===== ======
aarch64-linux-gnu 158 3060 -72 520 -1
aarch64_be-linux-gnu 111 24 -57 324 -1
alpha-linux-gnu 3 3 1 1 1
amdgcn-amdhsa 18 71 -17 26 1
arc-elf 310 -4414 -1516 356 1
arm-linux-gnueabi 28 -50 -13 3 -1
arm-linux-gnueabihf 28 -50 -13 3 -1
avr-elf 26 308 -1 36 12
bfin-elf 6 8 -1 3 1
bpf-elf 10 21 -1 6 1
c6x-elf 7 9 -6 6 1
cr16-elf 13 102 1 27 2
cris-elf 35 -1001 -700 3 -2
csky-elf 9 28 1 6 2
epiphany-elf 29 -29 -2 1 -1
fr30-elf 12 17 -1 5 1
frv-linux-gnu 1 -2 -2 -2 -2
ft32-elf 10 22 -1 5 2
h8300-elf 29 56 -22 14 2
hppa64-hp-hpux11.23 9 17 -1 4 2
i686-apple-darwin 10 -33 -20 12 -2
i686-pc-linux-gnu 41 243 -12 33 3
ia64-linux-gnu 28 -32 -29 39 -4
iq2000-elf 6 8 1 2 1
lm32-elf 10 12 -3 5 1
m32r-elf 3 2 -2 2 2
m68k-linux-gnu 19 27 -2 5 2
mcore-elf 14 23 -10 6 2
microblaze-elf 5 5 1 1 1
mipsel-linux-gnu 9 12 -5 6 2
mipsisa64-linux-gnu 7 1 -3 1 1
mmix 6 6 -2 4 1
mn10300-elf 20 15 -4 5 1
moxie-rtems 8 11 -2 3 1
msp430-elf 8 24 1 6 2
nds32le-elf 91 -188 -24 136 -1
nios2-linux-gnu 2 6 1 5 1
nvptx-none 396 756 1 16 1
or1k-elf 8 20 1 4 2
pdp11 65 149 -10 45 2
powerpc-ibm-aix7.0 1039 1114 -366 2124 -1
powerpc64-linux-gnu 854 2753 -274 3094 -2
powerpc64le-linux-gnu 648 -551 -340 208 -1
pru-elf 5 5 -2 3 1
riscv32-elf 7 6 -2 5 1
riscv64-elf 2 5 2 3 2
rl78-elf 80 -648 -98 13 -4
rx-elf 16 2 -4 5 -1
s390-linux-gnu 60 -174 -39 14 -1
s390x-linux-gnu 152 -781 -159 14 -1
sh-linux-gnu 13 5 -15 7 1
sparc-linux-gnu 29 7 -3 11 -1
sparc64-linux-gnu 51 1 -8 15 -1
tilepro-linux-gnu 119 -567 -164 15 -2
v850-elf 4 4 -1 3 1
vax-netbsdelf 10 13 -4 5 1
visium-elf 4 0 -5 3 1
x86_64-darwin 7 -12 -9 4 -2
x86_64-linux-gnu 6 -11 -6 4 -2
xstormy16-elf 10 13 1 2 1
xtensa-elf 6 8 -1 2 2
which definitely shows up some outliers I need to look at. The second
set is for:
(B) --param run-combine=6 (both passes), use-use combinations only
(C) --param run-combine=6 (both passes), no restrictions
Target Tests Delta Best Worst Median
====== ===== ===== ==== ===== ======
aarch64-linux-gnu 272 -3844 -585 18 -1
aarch64_be-linux-gnu 190 -3336 -370 18 -1
alpha-linux-gnu 401 -2735 -370 22 -2
amdgcn-amdhsa 188 1867 -484 1259 -1
arc-elf 257 -1498 -650 54 -1
arm-linux-gnueabi 168 -1117 -612 680 -1
arm-linux-gnueabihf 168 -1117 -612 680 -1
avr-elf 1341 -111401 -13824 680 -10
bfin-elf 1346 -18950 -8461 465 -2
bpf-elf 63 -496 -60 3 -2
c6x-elf 179 -10527 -10084 41 -2
cr16-elf 1616 -51479 -10657 42 -13
cris-elf 113 -533 -84 4 -2
csky-elf 129 -3399 -474 1 -2
epiphany-elf 151 -375 -149 84 -1
fr30-elf 155 -1773 -756 289 -2
frv-linux-gnu 808 -13332 -2074 67 -1
ft32-elf 276 -1688 -111 -1 -2
h8300-elf 527 -11522 -1747 68 -3
hppa64-hp-hpux11.23 179 -865 -142 34 -1
i686-apple-darwin 335 -1266 -56 44 -1
i686-pc-linux-gnu 222 -2216 -556 32 -1
ia64-linux-gnu 122 -4793 -1134 40 -5
iq2000-elf 171 -1341 -61 3 -2
lm32-elf 187 -1814 -316 47 -2
m32r-elf 70 -597 -98 11 -2
m68k-linux-gnu 197 -2375 -332 148 -2
mcore-elf 125 -1236 -146 7 -1
microblaze-elf 442 -4498 -2094 32 -2
mipsel-linux-gnu 125 -2050 -222 60 -2
mipsisa64-linux-gnu 107 -2015 -130 14 -2
mmix 103 -239 -26 4 -1
mn10300-elf 215 -1039 -234 80 -1
moxie-rtems 149 -754 -79 4 -2
msp430-elf 180 -600 -63 19 -1
nds32le-elf 183 -287 -37 32 -1
nios2-linux-gnu 81 -329 -66 4 -1
nvptx-none 200 -1882 -208 -2 -2
or1k-elf 57 -317 -25 2 -1
pdp11 207 -1441 -182 83 -2
powerpc-ibm-aix7.0 400 -4145 -271 14 -2
powerpc64-linux-gnu 375 -2062 -160 117 -2
powerpc64le-linux-gnu 491 -4169 -700 156 -2
pru-elf 47 -7020 -6921 6 -1
riscv32-elf 59 -1379 -139 7 -2
riscv64-elf 89 -1562 -264 7 -1
rl78-elf 289 -16157 -1665 42 -6
rx-elf 82 -195 -53 8 -1
s390-linux-gnu 128 -2108 -1485 63 -1
s390x-linux-gnu 112 418 -32 522 -1
sh-linux-gnu 218 -410 -108 68 -1
sparc-linux-gnu 141 -866 -99 18 -1
sparc64-linux-gnu 129 -792 -102 3 -2
tilepro-linux-gnu 953 -4331 -297 332 -2
v850-elf 50 -412 -53 2 -3
vax-netbsdelf 254 -3328 -400 4 -2
visium-elf 100 -693 -138 16 -1
x86_64-darwin 345 -2134 -490 72 -1
x86_64-linux-gnu 307 -843 -288 210 -1
xstormy16-elf 218 -788 -156 59 -1
xtensa-elf 195 -1426 -322 36 1
So the main benefit does seem to come from the def-use part.
Here are some powerpc64le-linux-gnu examples of (B)->(C):
gcc.c-torture/execute/20171008-1.c:
@@ -79,8 +79,7 @@
stdu 1,-32(1)
.LCFI5:
bl foo
- rlwinm 3,3,0,0xff
- cmpwi 0,3,0
+ andi. 9,3,0xff
bne 0,.L13
addi 1,1,32
gcc.c-torture/execute/pr28982a.c:
@@ -427,15 +427,13 @@
stxvd2x 0,7,6
.align 4
.L9:
- xxlor 12,32,32
+ xvcvsxwsp 0,32
vadduwm 0,0,1
addi 10,9,16
- xvcvsxwsp 0,12
- xxlor 12,32,32
+ stxvd2x 0,0,9
+ xvcvsxwsp 0,32
+ addi 9,9,32
vadduwm 0,0,1
- stxvd2x 0,0,9
- xvcvsxwsp 0,12
- addi 9,9,32
stxvd2x 0,0,10
bdnz .L9
li 3,4
(Disclaimer: I have no idea if that's correct.)
gcc.c-torture/execute/pr65215-3.c:
@@ -56,11 +56,10 @@
srdi 10,3,32
srdi 9,3,56
slwi 6,10,24
- srwi 7,10,8
+ rlwinm 7,10,24,16,23
or 9,9,6
- rlwinm 7,7,0,16,23
+ rlwinm 10,10,8,8,15
or 9,9,7
- rlwinm 10,10,8,8,15
or 9,9,10
cmpw 0,9,8
bne 0,.L4
Just to emphasise though: I'm not proposing that we switch this on for
all targets yet. It would be opt-in until the pass is more mature.
But that FFR case is really important for the situation it handles.
Thanks,
Richard
next prev parent reply other threads:[~2019-11-20 18:20 UTC|newest]
Thread overview: 32+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-11-18 0:21 Richard Sandiford
2019-11-18 2:04 ` Andrew Pinski
2019-11-18 17:58 ` Richard Sandiford
2019-11-23 23:01 ` Segher Boessenkool
2019-11-23 23:09 ` Nicholas Krause
2019-11-23 23:43 ` Segher Boessenkool
2019-11-24 0:11 ` Nicholas Krause
2019-11-25 22:09 ` Richard Sandiford
2019-11-25 22:52 ` Segher Boessenkool
2019-12-03 13:33 ` Oleg Endo
2019-12-03 18:05 ` Segher Boessenkool
2019-12-04 10:43 ` Oleg Endo
2019-12-06 22:51 ` Segher Boessenkool
2019-12-06 23:47 ` Oleg Endo
2019-11-19 0:11 ` Segher Boessenkool
2019-11-19 11:36 ` Richard Sandiford
2019-11-19 21:13 ` Segher Boessenkool
2019-11-20 18:29 ` Richard Sandiford [this message]
2019-11-20 20:49 ` Segher Boessenkool
2019-11-21 21:12 ` Richard Sandiford
2019-11-22 16:39 ` Segher Boessenkool
2019-11-25 21:25 ` Richard Sandiford
2019-11-25 22:17 ` Segher Boessenkool
2019-11-25 23:26 ` Richard Sandiford
2019-11-26 1:44 ` Segher Boessenkool
2019-11-27 8:32 ` Richard Biener
2019-11-27 10:18 ` Richard Sandiford
2019-11-27 19:36 ` Segher Boessenkool
2019-11-21 19:02 ` Nicholas Krause
2019-11-21 19:47 ` Richard Sandiford
2019-11-22 16:27 ` Segher Boessenkool
2019-12-05 10:17 ` Richard Sandiford
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=mptsgmih0jx.fsf@arm.com \
--to=richard.sandiford@arm.com \
--cc=gcc-patches@gcc.gnu.org \
--cc=segher@kernel.crashing.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).