From: Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org>
To: Richard Biener <richard.guenther@gmail.com>
Cc: Alexander Monakov <amonakov@ispras.ru>,
GCC Development <gcc@gcc.gnu.org>
Subject: Re: LTO slows down calculix by more than 10% on aarch64
Date: Tue, 22 Sep 2020 21:54:36 +0530 [thread overview]
Message-ID: <CAAgBjMmQ6StnHV5ugHaW9mMaW7ajs40K4GR2QU3eC0fZrG0p5Q@mail.gmail.com> (raw)
In-Reply-To: <CAFiYyc0Dk+TXS0Z-MmCK7LAspFeVBq3ecZNuQV+AQtgW2-DF=Q@mail.gmail.com>
On Tue, 22 Sep 2020 at 16:36, Richard Biener <richard.guenther@gmail.com> wrote:
>
> On Tue, Sep 22, 2020 at 11:37 AM Prathamesh Kulkarni
> <prathamesh.kulkarni@linaro.org> wrote:
> >
> > On Tue, 22 Sep 2020 at 12:56, Richard Biener <richard.guenther@gmail.com> wrote:
> > >
> > > On Tue, Sep 22, 2020 at 7:08 AM Prathamesh Kulkarni
> > > <prathamesh.kulkarni@linaro.org> wrote:
> > > >
> > > > On Mon, 21 Sep 2020 at 18:14, Prathamesh Kulkarni
> > > > <prathamesh.kulkarni@linaro.org> wrote:
> > > > >
> > > > > On Mon, 21 Sep 2020 at 15:19, Prathamesh Kulkarni
> > > > > <prathamesh.kulkarni@linaro.org> wrote:
> > > > > >
> > > > > > On Fri, 4 Sep 2020 at 17:08, Alexander Monakov <amonakov@ispras.ru> wrote:
> > > > > > >
> > > > > > > > I obtained perf stat results for following benchmark runs:
> > > > > > > >
> > > > > > > > -O2:
> > > > > > > >
> > > > > > > > 7856832.692380 task-clock (msec) # 1.000 CPUs utilized
> > > > > > > > 3758 context-switches # 0.000 K/sec
> > > > > > > > 40 cpu-migrations # 0.000 K/sec
> > > > > > > > 40847 page-faults # 0.005 K/sec
> > > > > > > > 7856782413676 cycles # 1.000 GHz
> > > > > > > > 6034510093417 instructions # 0.77 insn per cycle
> > > > > > > > 363937274287 branches # 46.321 M/sec
> > > > > > > > 48557110132 branch-misses # 13.34% of all branches
> > > > > > >
> > > > > > > (ouch, 2+ hours per run is a lot, collecting a profile over a minute should be
> > > > > > > enough for this kind of code)
> > > > > > >
> > > > > > > > -O2 with orthonl inlined:
> > > > > > > >
> > > > > > > > 8319643.114380 task-clock (msec) # 1.000 CPUs utilized
> > > > > > > > 4285 context-switches # 0.001 K/sec
> > > > > > > > 28 cpu-migrations # 0.000 K/sec
> > > > > > > > 40843 page-faults # 0.005 K/sec
> > > > > > > > 8319591038295 cycles # 1.000 GHz
> > > > > > > > 6276338800377 instructions # 0.75 insn per cycle
> > > > > > > > 467400726106 branches # 56.180 M/sec
> > > > > > > > 45986364011 branch-misses # 9.84% of all branches
> > > > > > >
> > > > > > > So +100e9 branches, but +240e9 instructions and +480e9 cycles, probably implying
> > > > > > > that extra instructions are appearing in this loop nest, but not in the innermost
> > > > > > > loop. As a reminder for others, the innermost loop has only 3 iterations.
> > > > > > >
> > > > > > > > -O2 with orthonl inlined and PRE disabled (this removes the extra branches):
> > > > > > > >
> > > > > > > > 8207331.088040 task-clock (msec) # 1.000 CPUs utilized
> > > > > > > > 2266 context-switches # 0.000 K/sec
> > > > > > > > 32 cpu-migrations # 0.000 K/sec
> > > > > > > > 40846 page-faults # 0.005 K/sec
> > > > > > > > 8207292032467 cycles # 1.000 GHz
> > > > > > > > 6035724436440 instructions # 0.74 insn per cycle
> > > > > > > > 364415440156 branches # 44.401 M/sec
> > > > > > > > 53138327276 branch-misses # 14.58% of all branches
> > > > > > >
> > > > > > > This seems to match baseline in terms of instruction count, but without PRE
> > > > > > > the loop nest may be carrying some dependencies over memory. I would simply
> > > > > > > check the assembly for the entire 6-level loop nest in question, I hope it's
> > > > > > > not very complicated (though Fortran array addressing...).
> > > > > > >
> > > > > > > > -O2 with orthonl inlined and hoisting disabled:
> > > > > > > >
> > > > > > > > 7797265.206850 task-clock (msec) # 1.000 CPUs utilized
> > > > > > > > 3139 context-switches # 0.000 K/sec
> > > > > > > > 20 cpu-migrations # 0.000 K/sec
> > > > > > > > 40846 page-faults # 0.005 K/sec
> > > > > > > > 7797221351467 cycles # 1.000 GHz
> > > > > > > > 6187348757324 instructions # 0.79 insn per cycle
> > > > > > > > 461840800061 branches # 59.231 M/sec
> > > > > > > > 26920311761 branch-misses # 5.83% of all branches
> > > > > > >
> > > > > > > There's a 20e9 reduction in branch misses and a 500e9 reduction in cycle count.
> > > > > > > I don't think the former fully covers the latter (there's also a 90e9 reduction
> > > > > > > in insn count).
> > > > > > >
> > > > > > > Given that the inner loop iterates only 3 times, my main suggestion is to
> > > > > > > consider how the profile for the entire loop nest looks like (it's 6 loops deep,
> > > > > > > each iterating only 3 times).
> > > > > > >
> > > > > > > > Perf profiles for
> > > > > > > > -O2 -fno-code-hoisting and inlined orthonl:
> > > > > > > > https://people.linaro.org/~prathamesh.kulkarni/perf_O2_inline.data
> > > > > > > >
> > > > > > > > 3196866 |1f04: ldur d1, [x1, #-248]
> > > > > > > > 216348301800│ add w0, w0, #0x1
> > > > > > > > 985098 | add x2, x2, #0x18
> > > > > > > > 216215999206│ add x1, x1, #0x48
> > > > > > > > 215630376504│ fmul d1, d5, d1
> > > > > > > > 863829148015│ fmul d1, d1, d6
> > > > > > > > 864228353526│ fmul d0, d1, d0
> > > > > > > > 864568163014│ fmadd d2, d0, d16, d2
> > > > > > > > │ cmp w0, #0x4
> > > > > > > > 216125427594│ ↓ b.eq 1f34
> > > > > > > > 15010377│ ldur d0, [x2, #-8]
> > > > > > > > 143753737468│ ↑ b 1f04
> > > > > > > >
> > > > > > > > -O2 with inlined orthonl:
> > > > > > > > https://people.linaro.org/~prathamesh.kulkarni/perf_O2_inline.data
> > > > > > > >
> > > > > > > > 359871503840│ 1ef8: ldur d15, [x1, #-248]
> > > > > > > > 144055883055│ add w0, w0, #0x1
> > > > > > > > 72262104254│ add x2, x2, #0x18
> > > > > > > > 143991169721│ add x1, x1, #0x48
> > > > > > > > 288648917780│ fmul d15, d17, d15
> > > > > > > > 864665644756│ fmul d15, d15, d18
> > > > > > > > 863868426387│ fmul d14, d15, d14
> > > > > > > > 865228159813│ fmadd d16, d14, d31, d16
> > > > > > > > 245967│ cmp w0, #0x4
> > > > > > > > 215396760545│ ↓ b.eq 1f28
> > > > > > > > 704732365│ ldur d14, [x2, #-8]
> > > > > > > > 143775979620│ ↑ b 1ef8
> > > > > > >
> > > > > > > This indicates that the loop only covers about 46-48% of overall time.
> > > > > > >
> > > > > > > High count on the initial ldur instruction could be explained if the loop
> > > > > > > is not entered by "fallthru" from the preceding block, or if its backedge
> > > > > > > is mispredicted. Sampling mispredictions should be possible with perf record,
> > > > > > > and you may be able to check if loop entry is fallthrough by inspecting
> > > > > > > assembly.
> > > > > > >
> > > > > > > It may also be possible to check if code alignment matters, by compiling with
> > > > > > > -falign-loops=32.
> > > > > > Hi,
> > > > > > Thanks a lot for the detailed feedback, and I am sorry for late response.
> > > > > >
> > > > > > The hoisting region is:
> > > > > > if(mattyp.eq.1) then
> > > > > > 4 loops
> > > > > > elseif(mattyp.eq.2) then
> > > > > > {
> > > > > > orthonl inlined into basic block;
> > > > > > loads w[0] .. w[8]
> > > > > > }
> > > > > > else
> > > > > > 6 loops // load anisox
> > > > > >
> > > > > > followed by basic block:
> > > > > >
> > > > > > senergy=
> > > > > > & (s11*w(1,1)+s12*(w(1,2)+w(2,1))
> > > > > > & +s13*(w(1,3)+w(3,1))+s22*w(2,2)
> > > > > > & +s23*(w(2,3)+w(3,2))+s33*w(3,3))*weight
> > > > > > s(ii1,jj1)=s(ii1,jj1)+senergy
> > > > > > s(ii1+1,jj1+1)=s(ii1+1,jj1+1)+senergy
> > > > > > s(ii1+2,jj1+2)=s(ii1+2,jj1+2)+senergy
> > > > > >
> > > > > > Hoisting hoists loads w[0] .. w[8] from orthonl and senergy block,
> > > > > > right in block 181, which is:
> > > > > > if (mattyp.eq.2) goto <bb 182> else goto <bb 193>
> > > > > >
> > > > > > which is then further hoisted to block 173:
> > > > > > if (mattyp.eq.1) goto <bb 392> else goto <bb 181>
> > > > > >
> > > > > > From block 181, we have two paths towards senergy block (bb 194):
> > > > > > bb 181 -> bb 182 (orthonl block) -> bb 194 (senergy block)
> > > > > > AND
> > > > > > bb 181 -> bb 392 <6 loops pre-header> ... -> bb 194
> > > > > > which has a path length of around 18 blocks.
> > > > > > (bb 194 post-dominates bb 181 and bb 173).
> > > > > >
> > > > > > Disabling only load hoisting within blocks 173 and 181
> > > > > > (simply avoid inserting pre_expr if pre_expr->kind == REFERENCE),
> > > > > > avoid hoisting of 'w' array and brings back most of performance. Which
> > > > > > verifies that it is hoisting of the
> > > > > > 'w' array (w[0] ... w[8]), which is causing the slowdown ?
> > > > > >
> > > > > > I obtained perf profiles for full hoisting, and disabled hoisting of
> > > > > > 'w' array for the 6 loops, and the most drastic difference was
> > > > > > for ldur instruction:
> > > > > >
> > > > > > With full hoisting:
> > > > > > 359871503840│ 1ef8: ldur d15, [x1, #-248]
> > > > > >
> > > > > > Without full hoisting:
> > > > > > 3441224 │1edc: ldur d1, [x1, #-248]
> > > > > >
> > > > > > (The loop entry seems to be fall thru in both cases. I have attached
> > > > > > profiles for both cases).
> > > > > >
> > > > > > IIUC, the instruction seems to be loading the first element from anisox array,
> > > > > > which makes me wonder if the issue was with data-cache miss for slower version.
> > > > > > I ran perf script on perf data for L1-dcache-load-misses with period = 1million,
> > > > > > and it reported two cache misses on the ldur instruction in full hoisting case,
> > > > > > while it reported zero for the disabled load hoisting case.
> > > > > > So I wonder if the slowdown happens because hoisting of 'w' array
> > > > > > possibly results
> > > > > > in eviction of anisox thus causing a cache miss inside the inner loop
> > > > > > and making load slower ?
> > > > > >
> > > > > > Hoisting also seems to improve the number of overall cache misses tho.
> > > > > > For disabled hoisting of 'w' array case, there were a total of 463
> > > > > > cache misses, while with full hoisting there were 357 cache misses
> > > > > > (with period = 1 million).
> > > > > > Does that happen because hoisting probably reduces cache misses along
> > > > > > the orthonl path (bb 173 - > bb 181 -> bb 182 -> bb 194) ?
> > > > > Hi,
> > > > > In general I feel for this or PR80155 case, the issues come with long
> > > > > range hoistings, inside a large CFG, since we don't have an accurate
> > > > > way to model target resources (register pressure in PR80155 case / or
> > > > > possibly cache pressure in this case?) at tree level and we end up
> > > > > with register spill or cache miss inside loops, which may offset the
> > > > > benefit of hoisting. As previously discussed the right way to go is a
> > > > > live range splitting pass, at GIMPLE -> RTL border which can also help
> > > > > with other code-movement optimizations (or if the source had variables
> > > > > with long live ranges).
> > > > >
> > > > > I was wondering tho as a cheap workaround, would it make sense to
> > > > > check if we are hoisting across a "large" region of nested loops, and
> > > > > avoid in that case since hoisting may exert resource pressure inside
> > > > > loop region ? (Especially, in the cases where hoisted expressions were
> > > > > not originally AVAIL in any of the loop blocks, and the loop region
> > > > > doesn't benefit from hoisting).
> > > > >
> > > > > For instance:
> > > > > FOR_EACH_EDGE (e, ei, block)
> > > > > {
> > > > > /* Avoid hoisting across more than 3 nested loops */
> > > > > if (e->dest is a loop pre-header or loop header
> > > > > && nesting depth of loop is > 3)
> > > > > return false;
> > > > > }
> > > > >
> > > > > I think this would work for resolving the calculix issue because it
> > > > > hoists across one region of 6 loops and another of 4 loops (didn' test
> > > > > yet).
> > > > > It's not bulletproof in that it will miss detecting cases where loop
> > > > > header (or pre-header) isn't a successor of candidate block (checking
> > > > > for
> > > > > that might get expensive tho?). I will test it on gcc suite and SPEC
> > > > > for any regressions.
> > > > > Does this sound like a reasonable heuristic ?
> > > > Hi,
> > > > The attached patch implements the above heuristic.
> > > > Bootstrapped + tested on x86_64-linux-gnu with no regressions.
> > > > And it brings back most of performance for calculix on par with O2
> > > > (without inlining orthonl).
> > > > I verified that with patch there is no cache miss happening on load
> > > > insn inside loop
> > > > (with perf report -e L1-dcache-load-misses/period=1000000/)
> > > >
> > > > I am in the process of benchmarking the patch on aarch64 for SPEC for
> > > > speed and will report numbers
> > > > in couple of days. (If required, we could parametrize number of nested
> > > > loops, hardcoded (arbitrarily to) 3 in this patch,
> > > > and set it in backend to not affect other targets).
> > >
> > > I don't think this patch captures the case in a sensible way - it will simply
> > > never hoist computations out of loop header blocks with depth > 3 which
> > > is certainly not what you want. Also the pre-header check is odd - we're
> > > looking for computations in successors of BLOCK but clearly a computation
> > > in a pre-header is not at the same loop level as one in the header itself.
> > Well, my intent was to check if we are hoisting across a region,
> > which has multiple nested loops, and in that case, avoid hoisting expressions
> > that do not belong to any loop blocks, because that may increase
> > resource pressure
> > inside loops. For instance, in the calculix issue we hoist 'w' array
> > from post-dom and neither
> > loop region has any uses of 'w'. I agree checking just for loop level
> > is too coarse.
> > The check with pre-header was essentially the same to see if we are
> > hoisting across a loop region,
> > not necessarily from within the loops.
>
> But it will fail to hoist *p in
>
> if (x)
> {
> a = *p;
> }
> else
> {
> b = *p;
> }
>
> <make distance large>
> pdom:
> c = *p;
>
> so it isn't what matters either. What happens at the immediate post-dominator
> isn't directly relevant - what matters would be if the pdom is the one making
> the value antic on one of the outgoing edges. In that case we're also going
> to PRE *p into the arm not computing *p (albeit in a later position). But
> that property is impossible to compute from the sets itself (not even mentioning
> the arbitrary CFG that can be inbetween the block and its pdom or the weird
> pdoms we compute for regions not having a path to exit, like infinite loops).
>
> You could eventually look at the pdom predecessors and if *p is not AVAIL_OUT
> in each of them we _might_ have the situation you want to protect against.
> But as said PRE insertion will likely have made sure it _is_ AVAIL_OUT in each
> of them ...
Hi Richard,
Thanks for the suggestions. Right, the issue seems to be here that
post-dom block is making expressions ANTIC. Before doing insert, could
we simply copy AVAIL_OUT sets of each block into another set say ORIG_AVAIL_OUT,
as a guard against PRE eventually inserting expressions in pred blocks
of pdom and making them available?
And during hoisting, we could check if each expr that is ANTIC_IN
(pdom) is ORIG_AVAIL_OUT in each pred of pdom,
if the distance is "large".
Thanks,
Prathamesh
>
> > >
> > > Note the difficulty to capture "distance" is that the distance is simply not
> > > available at this point - it is the anticipated values from the successors
> > > that do _not_ compute the value itself that are the issue. To capture
> > > "distance" you'd need to somehow "age" anticipated value when
> > > propagating them upwards during compute_antic (where it then
> > > would also apply to PRE in general).
> > Yes, indeed. As a hack, would it make sense to avoid inserting an
> > expression in the block,
> > if it's ANTIC in post-dom block as a trade-off between extending live
> > range and hoisting
> > if the "distance" between block and post-dom is "too far" ? In
> > general, as you point out, we'd need to compute,
> > distance info for successors block during compute_antic, but special
> > casing for post-dom should be easy enough
> > during do_hoist_insertion, and hoisting an expr that is ANTIC in
> > post-dom could be potentially "long range", if the region is large.
> > It's still a coarse heuristic tho. I tried it in the attached patch.
> >
> > Thanks,
> > Prathamesh
> >
> >
> > >
> > > As with all other heuristics the only place one could do hackish attempts
> > > with at least some reasoning is the elimination phase where
> > > we make use of the (hoist) insertions - of course for hoisting we already
> > > know we'll get the "close" use in one of the successors so I fear even
> > > there it will be impossible to do something sensible.
> > >
> > > Richard.
> > >
> > > > Thanks,
> > > > Prathamesh
> > > > >
> > > > > Thanks,
> > > > > Prathamesh
> > > > >
> > > > >
> > > > >
> > > > > Thanks,
> > > > > Prathamesh
> > > > > >
> > > > > > Thanks,
> > > > > > Prathamesh
> > > > > > >
> > > > > > > Alexander
next prev parent reply other threads:[~2020-09-22 16:25 UTC|newest]
Thread overview: 25+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-08-26 10:32 Prathamesh Kulkarni
2020-08-26 11:20 ` Richard Biener
2020-08-28 11:16 ` Prathamesh Kulkarni
2020-08-28 11:57 ` Richard Biener
2020-08-31 11:21 ` Prathamesh Kulkarni
2020-08-31 11:40 ` Jan Hubicka
2020-08-28 12:03 ` Alexander Monakov
2020-08-31 11:23 ` Prathamesh Kulkarni
2020-09-04 9:52 ` Prathamesh Kulkarni
2020-09-04 11:38 ` Alexander Monakov
2020-09-21 9:49 ` Prathamesh Kulkarni
2020-09-21 12:44 ` Prathamesh Kulkarni
2020-09-22 5:08 ` Prathamesh Kulkarni
2020-09-22 7:25 ` Richard Biener
2020-09-22 9:37 ` Prathamesh Kulkarni
2020-09-22 11:06 ` Richard Biener
2020-09-22 16:24 ` Prathamesh Kulkarni [this message]
2020-09-23 7:52 ` Richard Biener
2020-09-23 10:10 ` Prathamesh Kulkarni
2020-09-23 11:10 ` Richard Biener
2020-09-24 10:36 ` Prathamesh Kulkarni
2020-09-24 11:14 ` Richard Biener
2020-10-21 10:03 ` Prathamesh Kulkarni
2020-10-21 10:39 ` Richard Biener
2020-10-28 6:55 ` Prathamesh Kulkarni
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CAAgBjMmQ6StnHV5ugHaW9mMaW7ajs40K4GR2QU3eC0fZrG0p5Q@mail.gmail.com \
--to=prathamesh.kulkarni@linaro.org \
--cc=amonakov@ispras.ru \
--cc=gcc@gcc.gnu.org \
--cc=richard.guenther@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).