From: Di Zhao OS <dizhao@os.amperecomputing.com>
To: Di Zhao OS <dizhao@os.amperecomputing.com>,
Richard Biener <richard.guenther@gmail.com>
Cc: "gcc-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org>
Subject: [PING][PATCH v4] [tree-optimization/110279] Consider FMA in get_reassociation_width
Date: Mon, 23 Oct 2023 03:49:11 +0000 [thread overview]
Message-ID: <SN6PR01MB424015065A111451934F52FEE8D8A@SN6PR01MB4240.prod.exchangelabs.com> (raw)
In-Reply-To: <SN6PR01MB4240C36D35F02379B33C258DE8CFA@SN6PR01MB4240.prod.exchangelabs.com>
Hello and Ping,
Thanks,
Di
> -----Original Message-----
> From: Di Zhao OS <dizhao@os.amperecomputing.com>
> Sent: Monday, October 9, 2023 12:40 AM
> To: Richard Biener <richard.guenther@gmail.com>
> Cc: gcc-patches@gcc.gnu.org
> Subject: RE: [PATCH v4] [tree-optimization/110279] Consider FMA in
> get_reassociation_width
>
> Attached is a new version of the patch.
>
> > -----Original Message-----
> > From: Richard Biener <richard.guenther@gmail.com>
> > Sent: Friday, October 6, 2023 5:33 PM
> > To: Di Zhao OS <dizhao@os.amperecomputing.com>
> > Cc: gcc-patches@gcc.gnu.org
> > Subject: Re: [PATCH v4] [tree-optimization/110279] Consider FMA in
> > get_reassociation_width
> >
> > On Thu, Sep 14, 2023 at 2:43 PM Di Zhao OS
> > <dizhao@os.amperecomputing.com> wrote:
> > >
> > > This is a new version of the patch on "nested FMA".
> > > Sorry for updating this after so long, I've been studying and
> > > writing micro cases to sort out the cause of the regression.
> >
> > Sorry for taking so long to reply.
> >
> > > First, following previous discussion:
> > > (https://gcc.gnu.org/pipermail/gcc-patches/2023-September/629080.html)
> > >
> > > 1. From testing more altered cases, I don't think the
> > > problem is that reassociation works locally. In that:
> > >
> > > 1) On the example with multiplications:
> > >
> > > tmp1 = a + c * c + d * d + x * y;
> > > tmp2 = x * tmp1;
> > > result += (a + c + d + tmp2);
> > >
> > > Given "result" rewritten by width=2, the performance is
> > > worse if we rewrite "tmp1" with width=2. In contrast, if we
> > > remove the multiplications from the example (and make "tmp1"
> > > not singe used), and still rewrite "result" by width=2, then
> > > rewriting "tmp1" with width=2 is better. (Make sense because
> > > the tree's depth at "result" is still smaller if we rewrite
> > > "tmp1".)
> > >
> > > 2) I tried to modify the assembly code of the example without
> > > FMA, so the width of "result" is 4. On Ampere1 there's no
> > > obvious improvement. So although this is an interesting
> > > problem, it doesn't seem like the cause of the regression.
> >
> > OK, I see.
> >
> > > 2. From assembly code of the case with FMA, one problem is
> > > that, rewriting "tmp1" to parallel didn't decrease the
> > > minimum CPU cycles (taking MULT_EXPRs into account), but
> > > increased code size, so the overhead is increased.
> > >
> > > a) When "tmp1" is not re-written to parallel:
> > > fmadd d31, d2, d2, d30
> > > fmadd d31, d3, d3, d31
> > > fmadd d31, d4, d5, d31 //"tmp1"
> > > fmadd d31, d31, d4, d3
> > >
> > > b) When "tmp1" is re-written to parallel:
> > > fmul d31, d4, d5
> > > fmadd d27, d2, d2, d30
> > > fmadd d31, d3, d3, d31
> > > fadd d31, d31, d27 //"tmp1"
> > > fmadd d31, d31, d4, d3
> > >
> > > For version a), there are 3 dependent FMAs to calculate "tmp1".
> > > For version b), there are also 3 dependent instructions in the
> > > longer path: the 1st, 3rd and 4th.
> >
> > Yes, it doesn't really change anything. The patch has
> >
> > + /* If there's code like "acc = a * b + c * d + acc" in a tight loop, some
> > + uarchs can execute results like:
> > +
> > + _1 = a * b;
> > + _2 = .FMA (c, d, _1);
> > + acc_1 = acc_0 + _2;
> > +
> > + in parallel, while turning it into
> > +
> > + _1 = .FMA(a, b, acc_0);
> > + acc_1 = .FMA(c, d, _1);
> > +
> > + hinders that, because then the first FMA depends on the result
> > of preceding
> > + iteration. */
> >
> > I can't see what can be run in parallel for the first case. The .FMA
> > depends on the multiplication a * b. Iff the uarch somehow decomposes
> > .FMA into multiply + add then the c * d multiply could run in parallel
> > with the a * b multiply which _might_ be able to hide some of the
> > latency of the full .FMA. Like on x86 Zen FMA has a latency of 4
> > cycles but a multiply only 3. But I never got confirmation from any
> > of the CPU designers that .FMAs are issued when the multiply
> > operands are ready and the add operand can be forwarded.
> >
> > I also wonder why the multiplications of the two-FMA sequence
> > then cannot be executed at the same time? So I have some doubt
> > of the theory above.
>
> The parallel execution for the code snippet above was the other
> issue (previously discussed here:
> https://gcc.gnu.org/pipermail/gcc-patches/2023-August/628960.html).
> Sorry it's a bit confusing to include that here, but these 2 fixes
> needs to be combined to avoid new regressions. Since considering
> FMA in get_reassociation_width produces more results of width=1,
> so there would be more loop depending FMA chains.
>
> > Iff this really is the reason for the sequence to execute with lower
> > overall latency and we want to attack this on GIMPLE then I think
> > we need a target hook telling us this fact (I also wonder if such
> > behavior can be modeled in the scheduler pipeline description at all?)
> >
> > > So it seems to me the current get_reassociation_width algorithm
> > > isn't optimal in the presence of FMA. So I modified the patch to
> > > improve get_reassociation_width, rather than check for code
> > > patterns. (Although there could be some other complicated
> > > factors so the regression is more obvious when there's "nested
> > > FMA". But with this patch that should be avoided or reduced.)
> > >
> > > With this patch 508.namd_r 1-copy run has 7% improvement on
> > > Ampere1, on Intel Xeon there's about 3%. While I'm still
> > > collecting data on other CPUs, I'd like to know how do you
> > > think of this.
> > >
> > > About changes in the patch:
> > >
> > > 1. When the op list forms a complete FMA chain, try to search
> > > for a smaller width considering the benefit of using FMA. With
> > > a smaller width, the increment of code size is smaller when
> > > breaking the chain.
> >
> > But this is all highly target specific (code size even more so).
> >
> > How I understand your approach to fixing the issue leads me to
> > the suggestion to prioritize parallel rewriting, thus alter rank_ops_for_fma,
> > taking the reassoc width into account (the computed width should be
> > unchanged from rank_ops_for_fma) instead of "fixing up" the parallel
> > rewriting of FMAs (well, they are not yet formed of course).
> > get_reassociation_width has 'get_required_cycles', the above theory
> > could be verified with a very simple toy pipeline model. We'd have
> > to ask the target for the reassoc width for MULT_EXPRs as well (or maybe
> > even FMA_EXPRs).
> >
> > Taking the width of FMAs into account when computing the reassoc width
> > might be another way to attack this.
>
> Previously I tried to solve this generally, on the assumption that
> FMA (smaller code size) is preferred. Now I agree it's difficult
> since: 1) As you mentioned, the latency of FMA, FMUL and FADD can
> be different. 2) From my test result on different machines we
> have, it seems simply adding the cycles together is not a good way
> to estimate the latency of consecutive FMA.
>
> I think an easier way to fix this is to add a parameter to suggest
> the length of complete FMA chain to keep. (It can be set by target
> specific tuning then.) And we can break longer FMA chains for
> better parallelism. Attached is the new implementation. With
> max-fma-chain-len=8, there's about 7% improvement in spec2017
> 508.namd_r on ampere1, and the overall improvement on fprate is
> about 1%.
>
> Since there's code in rank_ops_for_fma to identify MULT_EXPRs from
> others, I left it before get_reassociation_width so the number of
> MULT_EXPRs can be used.
>
> >
> > > 2. To avoid regressions, included the other patch
> > > (https://gcc.gnu.org/pipermail/gcc-patches/2023-September/629203.html)
> > > on this tracker again. This is because more FMA will be kept
> > > with 1., so we need to rule out the loop dependent
> > > FMA chains when param_avoid_fma_max_bits is set.
> >
> > Sorry again for taking so long to reply.
> >
> > I'll note we have an odd case on x86 Zen2(?) as well which we don't really
> > understand from a CPU behavior perspective.
> >
> > Thanks,
> > Richard.
> >
> > > Thanks,
> > > Di Zhao
> > >
> > > ----
> > >
> > > PR tree-optimization/110279
> > >
> > > gcc/ChangeLog:
> > >
> > > * tree-ssa-reassoc.cc (rank_ops_for_better_parallelism_p):
> > > New function to check whether ranking the ops results in
> > > better parallelism.
> > > (get_reassociation_width): Add new parameters. Search for
> > > smaller width considering the benefit of FMA.
> > > (rank_ops_for_fma): Change return value to be number of
> > > MULT_EXPRs.
> > > (reassociate_bb): For 3 ops, refine the condition to call
> > > swap_ops_for_binary_stmt.
> > >
> > > gcc/testsuite/ChangeLog:
> > >
> > > * gcc.dg/pr110279.c: New test.
>
> Thanks,
> Di Zhao
>
> ----
>
> PR tree-optimization/110279
>
> gcc/ChangeLog:
>
> * doc/invoke.texi: Description of param_max_fma_chain_len.
> * params.opt: New parameter param_max_fma_chain_len.
> * tree-ssa-reassoc.cc (get_reassociation_width):
> Support param_max_fma_chain_len; check for loop dependent
> FMAs.
> (rank_ops_for_fma): Return the number of MULT_EXPRs.
> (reassociate_bb): For 3 ops, refine the condition to call
> swap_ops_for_binary_stmt.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.dg/pr110279-1.c: New test.
> * gcc.dg/pr110279-2.c: New test.
> * gcc.dg/pr110279-3.c: New test.
next prev parent reply other threads:[~2023-10-23 3:49 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-09-14 12:43 [PATCH " Di Zhao OS
2023-10-06 9:33 ` Richard Biener
2023-10-08 16:39 ` Di Zhao OS
2023-10-23 3:49 ` Di Zhao OS [this message]
2023-10-31 13:47 ` Richard Biener
2023-11-09 17:53 ` Di Zhao OS
2023-11-21 13:01 ` Richard Biener
2023-11-29 14:35 ` Di Zhao OS
2023-12-11 11:01 ` Richard Biener
2023-12-13 8:14 ` Di Zhao OS
2023-12-13 9:00 ` Richard Biener
2023-12-14 20:55 ` Di Zhao OS
2023-12-15 7:23 ` Richard Biener
2023-12-15 9:46 ` Thomas Schwinge
2023-12-17 12:30 ` Di Zhao OS
2023-12-22 15:05 ` Di Zhao OS
2023-12-22 15:39 ` Richard Biener
2023-12-27 9:35 ` Di Zhao OS
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=SN6PR01MB424015065A111451934F52FEE8D8A@SN6PR01MB4240.prod.exchangelabs.com \
--to=dizhao@os.amperecomputing.com \
--cc=gcc-patches@gcc.gnu.org \
--cc=richard.guenther@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).