From: Richard Biener <richard.guenther@gmail.com>
To: Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org>
Cc: GCC Development <gcc@gcc.gnu.org>
Subject: Re: LTO slows down calculix by more than 10% on aarch64
Date: Fri, 28 Aug 2020 13:57:14 +0200 [thread overview]
Message-ID: <CAFiYyc3od-kDzxvSh5kLGK4rVoYCMcv8H6=1XtDditoSy-VvxQ@mail.gmail.com> (raw)
In-Reply-To: <CAAgBjMmM2k0hded7mkVybsC_wx5ZBVTbLvi6DeswTJpZkNJoBQ@mail.gmail.com>
On Fri, Aug 28, 2020 at 1:17 PM Prathamesh Kulkarni
<prathamesh.kulkarni@linaro.org> wrote:
>
> On Wed, 26 Aug 2020 at 16:50, Richard Biener <richard.guenther@gmail.com> wrote:
> >
> > On Wed, Aug 26, 2020 at 12:34 PM Prathamesh Kulkarni via Gcc
> > <gcc@gcc.gnu.org> wrote:
> > >
> > > Hi,
> > > We're seeing a consistent regression >10% on calculix with -O2 -flto vs -O2
> > > on aarch64 in our validation CI. I tried to investigate this issue a
> > > bit, and it seems the regression comes from inlining of orthonl into
> > > e_c3d. Disabling that brings back the performance. However, inlining
> > > orthonl into e_c3d, increases it's size from 3187 to 3837 by around
> > > 16.9% which isn't too large.
> > >
> > > I have attached two test-cases, e_c3d.f that has orthonl manually
> > > inlined into e_c3d to "simulate" LTO's inlining, and e_c3d-orig.f,
> > > which contains unmodified function.
> > > (gauss.f is included by e_c3d.f). For reproducing, just passing -O2 is
> > > sufficient.
> > >
> > > It seems that inlining orthonl, causes 20 hoistings into block 181,
> > > which are then hoisted to block 173, in particular hoistings of w(1,
> > > 1) ... w(3, 3), which wasn't
> > > possible without inlining. The hoistings happen because of basic block
> > > that computes orthonl in line 672 has w(1, 1) ... w(3, 3) and the
> > > following block in line 1035 in e_c3d.f:
> > >
> > > senergy=
> > > & (s11*w(1,1)+s12*(w(1,2)+w(2,1))
> > > & +s13*(w(1,3)+w(3,1))+s22*w(2,2)
> > > & +s23*(w(2,3)+w(3,2))+s33*w(3,3))*weight
> > >
> > > Disabling hoisting into blocks 173 (and 181), brings back most of the
> > > performance. I am not able to understand why (if?) these hoistings of
> > > w(1, 1) ...
> > > w(3, 3) are causing slowdown however. Looking at assembly, the hot
> > > code-path from perf in e_c3d shows following code-gen diff:
> > > For inlined version:
> > > .L122:
> > > ldr d15, [x1, -248]
> > > add w0, w0, 1
> > > add x2, x2, 24
> > > add x1, x1, 72
> > > fmul d15, d17, d15
> > > fmul d15, d15, d18
> > > fmul d14, d15, d14
> > > fmadd d16, d14, d31, d16
> > > cmp w0, 4
> > > beq .L121
> > > ldr d14, [x2, -8]
> > > b .L122
> > >
> > > and for non-inlined version:
> > > .L118:
> > > ldr d0, [x1, -248]
> > > add w0, w0, 1
> > > ldr d2, [x2, -8]
> > > add x1, x1, 72
> > > add x2, x2, 24
> > > fmul d0, d3, d0
> > > fmul d0, d0, d5
> > > fmul d0, d0, d2
> > > fmadd d1, d4, d0, d1
> > > cmp w0, 4
> > > bne .L118
> >
> > I wonder if you have profles. The inlined version has a
> > non-empty latch block (looks like some PRE is happening
> > there?). Eventually your uarch does not like the close
> > (does your assembly show the layour as it is?) branches?
> Hi Richard,
> I have uploaded profiles obtained by perf here:
> -O2: https://people.linaro.org/~prathamesh.kulkarni/o2_perf.data
> -O2 -flto: https://people.linaro.org/~prathamesh.kulkarni/o2_lto_perf.data
>
> For the above loop, it shows the following:
> -O2:
> 0.01 │ f1c: ldur d0, [x1, #-248]
> 3.53 │ add w0, w0, #0x1
> │ ldur d2, [x2, #-8]
> 3.54 │ add x1, x1, #0x48
> │ add x2, x2, #0x18
> 5.89 │ fmul d0, d3, d0
> 14.12 │ fmul d0, d0, d5
> 14.14 │ fmul d0, d0, d2
> 14.13 │ fmadd d1, d4, d0, d1
> 0.00 │ cmp w0, #0x4
> 3.52 │ ↑ b.ne f1c
>
> -O2 -flto:
> 5.47 |1124: ldur d15, [x1, #-248]
> 2.19 │ add w0, w0, #0x1
> 1.10 │ add x2, x2, #0x18
> 2.18 │ add x1, x1, #0x48
> 4.37 │ fmul d15, d17, d15
> 13.13 │ fmul d15, d15, d18
> 13.13 │ fmul d14, d15, d14
> 13.14 │ fmadd d16, d14, d31, d16
> │ cmp w0, #0x4
> 3.28 │ ↓ b.eq 1154
> 0.00 │ ldur d14, [x2, #-8]
> 2.19 │ ↑ b 1124
>
> IIUC, the biggest relative difference comes from load [x1, #-248]
> which in LTO's case takes 5.47% of overall samples:
> 5.47 |1124: ldur d15, [x1, #-248]
> while in case of -O2, it's just 0.01:
> 0.01 │ f1c: ldur d0, [x1, #-248]
>
> I wonder if that's (one of) the main factor(s) behind slowdown or it's
> not too relevant ?
This looks more like the branch since usually branch costs
are attributed to the target rather than the branch itself. You could
try re-ordering the code so the loop entry jumps around the
latch which can then fall thru so see if that makes a difference.
Richard.
> Thanks,
> Prathamesh
> >
> > > which corresponds to the following loop in line 1014.
> > > do n1=1,3
> > > s(iii1,jjj1)=s(iii1,jjj1)
> > > & +anisox(m1,k1,n1,l1)
> > > & *w(k1,l1)*vo(i1,m1)*vo(j1,n1)
> > > & *weight
> > >
> > > I am not sure why would hoisting have any direct effect on this loop
> > > except perhaps that hoisting allocated more reigsters, and led to
> > > increased register pressure. Perhaps that's why it's using highered
> > > number regs for code-gen in inlined version ? However disabling
> > > hoisting in blocks 173 and 181, also leads to overall 6 extra spills
> > > (by grepping for str to sp), so
> > > hoisting is also helping here ? I am not sure how to proceed further,
> > > and would be grateful for suggestions.
> > >
> > > Thanks,
> > > Prathamesh
next prev parent reply other threads:[~2020-08-28 11:57 UTC|newest]
Thread overview: 25+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-08-26 10:32 Prathamesh Kulkarni
2020-08-26 11:20 ` Richard Biener
2020-08-28 11:16 ` Prathamesh Kulkarni
2020-08-28 11:57 ` Richard Biener [this message]
2020-08-31 11:21 ` Prathamesh Kulkarni
2020-08-31 11:40 ` Jan Hubicka
2020-08-28 12:03 ` Alexander Monakov
2020-08-31 11:23 ` Prathamesh Kulkarni
2020-09-04 9:52 ` Prathamesh Kulkarni
2020-09-04 11:38 ` Alexander Monakov
2020-09-21 9:49 ` Prathamesh Kulkarni
2020-09-21 12:44 ` Prathamesh Kulkarni
2020-09-22 5:08 ` Prathamesh Kulkarni
2020-09-22 7:25 ` Richard Biener
2020-09-22 9:37 ` Prathamesh Kulkarni
2020-09-22 11:06 ` Richard Biener
2020-09-22 16:24 ` Prathamesh Kulkarni
2020-09-23 7:52 ` Richard Biener
2020-09-23 10:10 ` Prathamesh Kulkarni
2020-09-23 11:10 ` Richard Biener
2020-09-24 10:36 ` Prathamesh Kulkarni
2020-09-24 11:14 ` Richard Biener
2020-10-21 10:03 ` Prathamesh Kulkarni
2020-10-21 10:39 ` Richard Biener
2020-10-28 6:55 ` Prathamesh Kulkarni
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CAFiYyc3od-kDzxvSh5kLGK4rVoYCMcv8H6=1XtDditoSy-VvxQ@mail.gmail.com' \
--to=richard.guenther@gmail.com \
--cc=gcc@gcc.gnu.org \
--cc=prathamesh.kulkarni@linaro.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).