From: Di Zhao OS <dizhao@os.amperecomputing.com>
To: Richard Biener <richard.guenther@gmail.com>
Cc: "gcc-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org>
Subject: RE: [PATCH v4] [tree-optimization/110279] Consider FMA in get_reassociation_width
Date: Wed, 13 Dec 2023 08:14:28 +0000 [thread overview]
Message-ID: <SN6PR01MB42405D4A9923012D17B40D5CE88DA@SN6PR01MB4240.prod.exchangelabs.com> (raw)
In-Reply-To: <CAFiYyc3USQz_a-TzFSJUixTk2fPmCDX7QSVP0MtOtGyEyw72Ag@mail.gmail.com>
[-- Attachment #1: Type: text/plain, Size: 27831 bytes --]
Hello Richard,
> -----Original Message-----
> From: Richard Biener <richard.guenther@gmail.com>
> Sent: Monday, December 11, 2023 7:01 PM
> To: Di Zhao OS <dizhao@os.amperecomputing.com>
> Cc: gcc-patches@gcc.gnu.org
> Subject: Re: [PATCH v4] [tree-optimization/110279] Consider FMA in
> get_reassociation_width
>
> On Wed, Nov 29, 2023 at 3:36 PM Di Zhao OS
> <dizhao@os.amperecomputing.com> wrote:
> >
> > > -----Original Message-----
> > > From: Richard Biener <richard.guenther@gmail.com>
> > > Sent: Tuesday, November 21, 2023 9:01 PM
> > > To: Di Zhao OS <dizhao@os.amperecomputing.com>
> > > Cc: gcc-patches@gcc.gnu.org
> > > Subject: Re: [PATCH v4] [tree-optimization/110279] Consider FMA in
> > > get_reassociation_width
> > >
> > > On Thu, Nov 9, 2023 at 6:53 PM Di Zhao OS <dizhao@os.amperecomputing.com>
> > > wrote:
> > > >
> > > > > -----Original Message-----
> > > > > From: Richard Biener <richard.guenther@gmail.com>
> > > > > Sent: Tuesday, October 31, 2023 9:48 PM
> > > > > To: Di Zhao OS <dizhao@os.amperecomputing.com>
> > > > > Cc: gcc-patches@gcc.gnu.org
> > > > > Subject: Re: [PATCH v4] [tree-optimization/110279] Consider FMA in
> > > > > get_reassociation_width
> > > > >
> > > > > On Sun, Oct 8, 2023 at 6:40 PM Di Zhao OS
> <dizhao@os.amperecomputing.com>
> > > > > wrote:
> > > > > >
> > > > > > Attached is a new version of the patch.
> > > > > >
> > > > > > > -----Original Message-----
> > > > > > > From: Richard Biener <richard.guenther@gmail.com>
> > > > > > > Sent: Friday, October 6, 2023 5:33 PM
> > > > > > > To: Di Zhao OS <dizhao@os.amperecomputing.com>
> > > > > > > Cc: gcc-patches@gcc.gnu.org
> > > > > > > Subject: Re: [PATCH v4] [tree-optimization/110279] Consider FMA in
> > > > > > > get_reassociation_width
> > > > > > >
> > > > > > > On Thu, Sep 14, 2023 at 2:43 PM Di Zhao OS
> > > > > > > <dizhao@os.amperecomputing.com> wrote:
> > > > > > > >
> > > > > > > > This is a new version of the patch on "nested FMA".
> > > > > > > > Sorry for updating this after so long, I've been studying and
> > > > > > > > writing micro cases to sort out the cause of the regression.
> > > > > > >
> > > > > > > Sorry for taking so long to reply.
> > > > > > >
> > > > > > > > First, following previous discussion:
> > > > > > > > (https://gcc.gnu.org/pipermail/gcc-patches/2023-
> > > September/629080.html)
> > > > > > > >
> > > > > > > > 1. From testing more altered cases, I don't think the
> > > > > > > > problem is that reassociation works locally. In that:
> > > > > > > >
> > > > > > > > 1) On the example with multiplications:
> > > > > > > >
> > > > > > > > tmp1 = a + c * c + d * d + x * y;
> > > > > > > > tmp2 = x * tmp1;
> > > > > > > > result += (a + c + d + tmp2);
> > > > > > > >
> > > > > > > > Given "result" rewritten by width=2, the performance is
> > > > > > > > worse if we rewrite "tmp1" with width=2. In contrast, if we
> > > > > > > > remove the multiplications from the example (and make "tmp1"
> > > > > > > > not singe used), and still rewrite "result" by width=2, then
> > > > > > > > rewriting "tmp1" with width=2 is better. (Make sense because
> > > > > > > > the tree's depth at "result" is still smaller if we rewrite
> > > > > > > > "tmp1".)
> > > > > > > >
> > > > > > > > 2) I tried to modify the assembly code of the example without
> > > > > > > > FMA, so the width of "result" is 4. On Ampere1 there's no
> > > > > > > > obvious improvement. So although this is an interesting
> > > > > > > > problem, it doesn't seem like the cause of the regression.
> > > > > > >
> > > > > > > OK, I see.
> > > > > > >
> > > > > > > > 2. From assembly code of the case with FMA, one problem is
> > > > > > > > that, rewriting "tmp1" to parallel didn't decrease the
> > > > > > > > minimum CPU cycles (taking MULT_EXPRs into account), but
> > > > > > > > increased code size, so the overhead is increased.
> > > > > > > >
> > > > > > > > a) When "tmp1" is not re-written to parallel:
> > > > > > > > fmadd d31, d2, d2, d30
> > > > > > > > fmadd d31, d3, d3, d31
> > > > > > > > fmadd d31, d4, d5, d31 //"tmp1"
> > > > > > > > fmadd d31, d31, d4, d3
> > > > > > > >
> > > > > > > > b) When "tmp1" is re-written to parallel:
> > > > > > > > fmul d31, d4, d5
> > > > > > > > fmadd d27, d2, d2, d30
> > > > > > > > fmadd d31, d3, d3, d31
> > > > > > > > fadd d31, d31, d27 //"tmp1"
> > > > > > > > fmadd d31, d31, d4, d3
> > > > > > > >
> > > > > > > > For version a), there are 3 dependent FMAs to calculate "tmp1".
> > > > > > > > For version b), there are also 3 dependent instructions in the
> > > > > > > > longer path: the 1st, 3rd and 4th.
> > > > > > >
> > > > > > > Yes, it doesn't really change anything. The patch has
> > > > > > >
> > > > > > > + /* If there's code like "acc = a * b + c * d + acc" in a tight
> loop,
> > > > > some
> > > > > > > + uarchs can execute results like:
> > > > > > > +
> > > > > > > + _1 = a * b;
> > > > > > > + _2 = .FMA (c, d, _1);
> > > > > > > + acc_1 = acc_0 + _2;
> > > > > > > +
> > > > > > > + in parallel, while turning it into
> > > > > > > +
> > > > > > > + _1 = .FMA(a, b, acc_0);
> > > > > > > + acc_1 = .FMA(c, d, _1);
> > > > > > > +
> > > > > > > + hinders that, because then the first FMA depends on the
> result
> > > > > > > of preceding
> > > > > > > + iteration. */
> > > > > > >
> > > > > > > I can't see what can be run in parallel for the first case.
> The .FMA
> > > > > > > depends on the multiplication a * b. Iff the uarch somehow
> decomposes
> > > > > > > .FMA into multiply + add then the c * d multiply could run in
> parallel
> > > > > > > with the a * b multiply which _might_ be able to hide some of the
> > > > > > > latency of the full .FMA. Like on x86 Zen FMA has a latency of 4
> > > > > > > cycles but a multiply only 3. But I never got confirmation from
> any
> > > > > > > of the CPU designers that .FMAs are issued when the multiply
> > > > > > > operands are ready and the add operand can be forwarded.
> > > > > > >
> > > > > > > I also wonder why the multiplications of the two-FMA sequence
> > > > > > > then cannot be executed at the same time? So I have some doubt
> > > > > > > of the theory above.
> > > > > >
> > > > > > The parallel execution for the code snippet above was the other
> > > > > > issue (previously discussed here:
> > > > > > https://gcc.gnu.org/pipermail/gcc-patches/2023-August/628960.html).
> > > > > > Sorry it's a bit confusing to include that here, but these 2 fixes
> > > > > > needs to be combined to avoid new regressions. Since considering
> > > > > > FMA in get_reassociation_width produces more results of width=1,
> > > > > > so there would be more loop depending FMA chains.
> > > > > >
> > > > > > > Iff this really is the reason for the sequence to execute with
> lower
> > > > > > > overall latency and we want to attack this on GIMPLE then I think
> > > > > > > we need a target hook telling us this fact (I also wonder if such
> > > > > > > behavior can be modeled in the scheduler pipeline description at
> all?)
> > > > > > >
> > > > > > > > So it seems to me the current get_reassociation_width algorithm
> > > > > > > > isn't optimal in the presence of FMA. So I modified the patch to
> > > > > > > > improve get_reassociation_width, rather than check for code
> > > > > > > > patterns. (Although there could be some other complicated
> > > > > > > > factors so the regression is more obvious when there's "nested
> > > > > > > > FMA". But with this patch that should be avoided or reduced.)
> > > > > > > >
> > > > > > > > With this patch 508.namd_r 1-copy run has 7% improvement on
> > > > > > > > Ampere1, on Intel Xeon there's about 3%. While I'm still
> > > > > > > > collecting data on other CPUs, I'd like to know how do you
> > > > > > > > think of this.
> > > > > > > >
> > > > > > > > About changes in the patch:
> > > > > > > >
> > > > > > > > 1. When the op list forms a complete FMA chain, try to search
> > > > > > > > for a smaller width considering the benefit of using FMA. With
> > > > > > > > a smaller width, the increment of code size is smaller when
> > > > > > > > breaking the chain.
> > > > > > >
> > > > > > > But this is all highly target specific (code size even more so).
> > > > > > >
> > > > > > > How I understand your approach to fixing the issue leads me to
> > > > > > > the suggestion to prioritize parallel rewriting, thus alter
> > > > > rank_ops_for_fma,
> > > > > > > taking the reassoc width into account (the computed width should
> be
> > > > > > > unchanged from rank_ops_for_fma) instead of "fixing up" the
> parallel
> > > > > > > rewriting of FMAs (well, they are not yet formed of course).
> > > > > > > get_reassociation_width has 'get_required_cycles', the above
> theory
> > > > > > > could be verified with a very simple toy pipeline model. We'd
> have
> > > > > > > to ask the target for the reassoc width for MULT_EXPRs as well (or
> > > maybe
> > > > > > > even FMA_EXPRs).
> > > > > > >
> > > > > > > Taking the width of FMAs into account when computing the reassoc
> width
> > > > > > > might be another way to attack this.
> > > > > >
> > > > > > Previously I tried to solve this generally, on the assumption that
> > > > > > FMA (smaller code size) is preferred. Now I agree it's difficult
> > > > > > since: 1) As you mentioned, the latency of FMA, FMUL and FADD can
> > > > > > be different. 2) From my test result on different machines we
> > > > > > have, it seems simply adding the cycles together is not a good way
> > > > > > to estimate the latency of consecutive FMA.
> > > > > >
> > > > > > I think an easier way to fix this is to add a parameter to suggest
> > > > > > the length of complete FMA chain to keep. (It can be set by target
> > > > > > specific tuning then.) And we can break longer FMA chains for
> > > > > > better parallelism. Attached is the new implementation. With
> > > > > > max-fma-chain-len=8, there's about 7% improvement in spec2017
> > > > > > 508.namd_r on ampere1, and the overall improvement on fprate is
> > > > > > about 1%.
> > > > > >
> > > > > > Since there's code in rank_ops_for_fma to identify MULT_EXPRs from
> > > > > > others, I left it before get_reassociation_width so the number of
> > > > > > MULT_EXPRs can be used.
> > > > >
> > > > > Sorry again for the delay in replying.
> > > > >
> > > > > + /* Check if keeping complete FMA chains is preferred. */
> > > > > + if (width > 1 && mult_num >= 2 && param_max_fma_chain_len)
> > > > > + {
> > > > > + /* num_fma_chain + (num_fma_chain - 1) >= num_plus . */
> > > > > + int num_others = ops_num - mult_num;
> > > > > + int num_fma_chain = CEIL (num_others + 1, 2);
> > > > > +
> > > > > + if (num_fma_chain < width
> > > > > + && CEIL (mult_num, num_fma_chain) <= param_max_fma_chain_len)
> > > > > + width = num_fma_chain;
> > > > > + }
> > > > >
> > > > > so here 'mult_num' serves as a heuristical value how many
> > > > > FMAs we could build. If that were close to ops_num - 1 then
> > > > > we'd have a chain of FMAs. Not sure how you get at
> > > > > num_others / 2 here. Maybe we need to elaborate on what an
> > > > > FMA chain is? I thought it is FMA (FMA (FMA (..., b, c), d, e), f, g)
> > > > > where each (b,c) pair is really just one operand in the ops array,
> > > > > one of the 'mult's. Thus a FMA chain is _not_
> > > > > FMA (a, b, c) + FMA (d, e, f) + FMA (...) + ..., right?
> > > >
> > > > The "FMA chain" here refers to consecutive FMAs, each taking
> > > > The previous one's result as the third operator, i.e.
> > > > ... FMA(e, f, FMA(c, d, FMA (a, b, r)))... . So original op
> > > > list looks like "r + a * b + c * d + e * f + ...". These FMAs
> > > > will end up using the same accumulate register.
> > > >
> > > > When num_others=2 or 3, there can be 2 complete chains, e.g.
> > > > FMA (d, e, FMA (a, b, c)) + FMA (f, g, h)
> > > > or
> > > > FMA (d, e, FMA (a, b, c)) + FMA (f, g, h) + i .
> > > > And so on, that's where the "CEIL (num_others + 1, 2)" comes from.
> > > >
> > > > >
> > > > > Forming an FMA chain effectively reduces the reassociation width
> > > > > of the participating multiplies. If we were not to form FMAs all
> > > > > the multiplies could execute in parallel.
> > > > >
> > > > > So what does the above do, in terms of adjusting the reassociation
> > > > > width for the _adds_, and what's the ripple-down effect on later
> > > > > FMA forming?
> > > > >
> > > >
> > > > The above code calculates the number of such FMA chains in the op
> > > > list. And if the length of each chain doesn't exceed
> > > > param_max_fma_chain_len, then width is set to the number of chains,
> > > > so we won't break them (because rewrite_expr_tree_parallel handles
> > > > this well).
> > > >
> > > > > The change still feels like whack-a-mole playing rather than
> understanding
> > > > > the fundamental issue on the targets.
> > > >
> > > > I think the complexity is in how the instructions are piped.
> > > > Some Arm CPUs such as Neoverse V2 supports "late-forwarding":
> > > > "FP multiply-accumulate pipelines support late-forwarding of
> > > > accumulate operands from similar μOPs, allowing a typical
> > > > sequence of multiply-accumulate μOPs to issue one every N
> > > > cycles". ("N" is smaller than the latency of a single FMA
> > > > instruction.) So keeping such FMA chains can utilize such
> > > > feature and uses less FP units. I guess the case is similar on
> > > > some late X86 CPUs.
> > > >
> > > > If we try to compute the minimum circles of each option, I think
> > > > at least we'll need to know whether the target has similar
> > > > feature, and the latency of each uop. While using an
> > > > experiential length of beneficial FMA chain could be a shortcut.
> > > > (Maybe allowing different lengths for different data widths is
> > > > better.)
> > >
> > > Hm. So even when we can late-forward in an FMA chain
> > > increasing the width should typically be still better?
> > >
> > > _1 = FMA (_2 * _3 + _4);
> > > _5 = FMA (_6 * _7 + _1);
> > >
> > > say with late-forwarding we can hide the latency of the _6 * _7
> > > multiply and the overall latency of the two FMAs above become
> > > lat (FMA) + lat (ADD) in the ideal case. Alternatively we do
> > >
> > > _1 = FMA (_2 * _ 3 + _4);
> > > _8 = _6 * _ 7;
> > > _5 = _1 + _8;
> > >
> > > where if the FMA and the multiply can execute in parallel
> > > (we have two FMA pipes) the latency would be lat (FMA) + lat (ADD).
> > > But when we only have a single pipeline capable of
> > > FMA or multiplies then it is at least MIN (lat (FMA) + 1, lat (MUL) + 1)
> > > + lat (ADD), it depends on luck whether the FMA or the MUL is
> > > issued first there.
> > >
> > > So if late-forward works really well and the add part of the FMA
> > > has very low latency compared to the multiplication part having
> > > a smaller reassoc width should pay off here and we might be
> > > able to simply control this via the existing target hook?
> > >
> > > I'm not aware of x86 CPUs having late-forwarding capabilities
> > > but usually the latency of multiplication and FMA is very similar
> > > and one can issue two FMAs and possibly more ADDs in parallel.
> > >
> > > As said I think this detail (late-forward) should maybe reflected
> > > into get_required_cycles, possibly guided by a different
> > > targetm.sched.reassociation_width for MULT_EXPR vs PLUS_EXPR?
> > >
> >
> > To my understanding, the question is whether the target fully
> > pipelines FMA instructions, so the MULT part can start first if
> > its operands are ready. While targetm.sched.reassociation_width
> > reflects the number of pipes for some operation, so it can guide
> > get_required_cycles for a sequence of identical operations
> > (e.g. A * B * C * D or A + B + C + D). Since the problem in
> > this case is not the number of pipes for FMA, I think another
> > indicator maybe better.
> >
> > (Currently the fma_reassoc_width for AArch64 is to control
> > whether reassociation on FADD is OK. This workaround doesn't
> > work well on some cases, for example it turns down reassociation
> > even when there's no FMA at all. So I think we'd better not
> > follow the schema.)
> >
> > Attached is a new version of the patch with a flag to indicate
> > whether FMA is fully pipelined, and: 1) lat (MUL) >= lat (ADD);
> > 2) symmetric units are used or FMUL/FADD/FMA. Otherwise the
> > patch may not be beneficial.
> >
> > It tries to calculate the latencies including MULT_EXPRs. Since
> > the code is different with the current code (the quick-search
> > part), I haven't included it inside get_required_cycles.
>
> +; If the flag 'fully-pipelined-fma' is set, reassociation takes into account
> +; the benifit of parallelizing FMA's multiply part and addition part.
> +ffully-pipelined-fma
> +Common Var(flag_fully_pipelined_fma)
> +Assume the target fully pipelines FMA instruction, and symmetric units are
> used
> +for FMUL/FADD/FMA.
>
> please use a --param for now, I think targets might want to set this based
> on active core tuning.
>
> +/* Given that the target fully pipelines FMA instructions, return latency of
> + MULT_EXPRs that can't be hided by FMA. WIDTH is the number of pipes. */
> +
>
> return the latency .. can't be hidden by the FMA
>
> For documentation purposes it should be stated that mult_num <= ops_num
>
> + /* If the target fully pipelines FMA instruction, the multiply part can
> start
>
> instructions
>
> + first if its operands are ready. Assuming symmetric pipes are used for
>
> s/first/already/
>
> + FMUL/FADD/FMA, then for a sequence of FMA like:
> +
> + _8 = .FMA (_2, _3, _1);
> + _9 = .FMA (_5, _4, _8);
> + _10 = .FMA (_7, _6, _9);
> +
> + , if width=1, the latency is latency(MULT) + latency(ADD)*3.
> + While with width=2:
> +
> + _8 = _4 * _5;
> + _9 = .FMA (_2, _3, _1);
> + _10 = .FMA (_6, _7, _8);
> + _11 = _9 + _10;
> +
> + , it is latency(MULT)*2 + latency(ADD)*2. Assuming latency(MULT) <=
> + latency(ADD), the previous one is preferred.
>
> latency (MULT) >= latency (ADD)?
>
> ".. the first variant is preferred."
>
> +
> + Find out if we can get a smaller width considering FMA. */
Corrected these errors. Thank you for the corrections.
> + /* When flag_fully_pipelined_fma is set, assumes symmetric pipes are
> used
> + for FMUL/FADD/FMA. */
> + int lat_mul = get_mult_latency_consider_fma (ops_num, mult_num, width);
>
> what does "symmetric pipes" actually mean? For x86 Zen3 we have
> two FMA pipes (that can also do ADD and MUL) and one pipe that can
> do ADD. Is that then non-symmetric because we can issue more adds
> in parallel than FMA/MUL?
"symmetric pipes" was to indicate that FADD/FMA/FMUL use the same unit
set, so the widths are uniform, and the calculations in this patch can
apply. I think this can be relaxed for scenarios like Zen3, by searching
for a smaller width only using the pipes for FMUL/FMA. But if the pipes
for FMUL and FADD are separated, for example 1 for FMA/FMUL and 2 other
pipes for FADD, then the minimum circle might be incorrect.
Changed the descriptions and code in get_reassociation_width a bit to
include the case like Zen3:
+ /* When param_fully_pipelined_fma is set, assume FMUL and FMA use the
+ same units that can also do FADD. For other scenarios, such as when
+ FMUL and FADD are using distinct units, the following code may not
+ apply. */
+ int width_mult = targetm.sched.reassociation_width (MULT_EXPR, mode);
+ gcc_checking_assert (width_mult <= width);
+
+ /* Latency of MULT_EXPRs. */
+ int lat_mul
+ = get_mult_latency_consider_fma (ops_num, mult_num, width_mult);
> Otherwise this looks OK now.
>
> Thanks,
> Richard.
>
> > > > > + /* If there's loop dependent FMA result, return width=2 to avoid it.
> > > This
> > > > > is
> > > > > + better than skipping these FMA candidates in widening_mul. */
> > > > >
> > > > > better than skipping, but you don't touch it there? I suppose width
> == 2
> > > > > will bypass the skipping, right? This heuristic only comes in when
> the
> > > above
> > > > > change made width == 1, since otherwise we have an earlier
> > > > >
> > > > > if (width == 1)
> > > > > return width;
> > > > >
> > > > > which als guarantees width == 2 was allowed by the hook/param, right?
> > > >
> > > > Yes, that's right.
> > > >
> > > > >
> > > > > + if (width == 1 && mult_num
> > > > > + && maybe_le (tree_to_poly_int64 (TYPE_SIZE (TREE_TYPE (lhs))),
> > > > > + param_avoid_fma_max_bits))
> > > > > + {
> > > > > + /* Look for cross backedge dependency:
> > > > > + 1. LHS is a phi argument in the same basic block it is defined.
> > > > > + 2. And the result of the phi node is used in OPS. */
> > > > > + basic_block bb = gimple_bb (SSA_NAME_DEF_STMT (lhs));
> > > > > + gimple_stmt_iterator gsi;
> > > > > + for (gsi = gsi_start_phis (bb); !gsi_end_p (gsi); gsi_next
> (&gsi))
> > > > > + {
> > > > > + gphi *phi = dyn_cast<gphi *> (gsi_stmt (gsi));
> > > > > + for (unsigned i = 0; i < gimple_phi_num_args (phi); ++i)
> > > > > + {
> > > > > + tree op = PHI_ARG_DEF (phi, i);
> > > > > + if (!(op == lhs && gimple_phi_arg_edge (phi, i)->src ==
> bb))
> > > > > + continue;
> > > > >
> > > > > I think it's easier to iterate over the immediate uses of LHS like
> > > > >
> > > > > FOR_EACH_IMM_USE_FAST (use_p, iter, lhs)
> > > > > if (gphi *phi = dyn_cast <gphi *> (USE_STMT (use_p)))
> > > > > {
> > > > > if (gimple_phi_arg_edge (phi, phi_arg_index_from_use
> > > > > (use_p))->src != bb)
> > > > > continue;
> > > > > ...
> > > > > }
> > > > >
> > > > > otherwise I think _this_ part of the patch looks reasonable.
> > > > >
> > > > > As you say heuristically they might go together but I think we should
> > > split
> > > > > the
> > > > > patch - the cross-loop part can probably stand independently. Can you
> > > adjust
> > > > > and re-post?
> > > >
> > > > Attached is the separated part for cross-loop FMA. Thank you for the
> > > correction.
> > >
> > > That cross-loop FMA patch is OK.
> >
> > Committed this part at 746344dd.
> >
> > Thanks,
> > Di
> >
> > >
> > > Thanks,
> > > Richard.
> > >
> > > > >
> > > > > As for the first part I still don't understand very well and am still
> > > hoping
> > > > > we
> > > > > can get away without yet another knob to tune.
> > > > >
> > > > > Richard.
> > > > >
> > > > > > >
> > > > > > > > 2. To avoid regressions, included the other patch
> > > > > > > > (https://gcc.gnu.org/pipermail/gcc-patches/2023-
> > > September/629203.html)
> > > > > > > > on this tracker again. This is because more FMA will be kept
> > > > > > > > with 1., so we need to rule out the loop dependent
> > > > > > > > FMA chains when param_avoid_fma_max_bits is set.
> > > > > > >
> > > > > > > Sorry again for taking so long to reply.
> > > > > > >
> > > > > > > I'll note we have an odd case on x86 Zen2(?) as well which we
> don't
> > > really
> > > > > > > understand from a CPU behavior perspective.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Richard.
> > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Di Zhao
> > > > > > > >
> > > > > > > > ----
> > > > > > > >
> > > > > > > > PR tree-optimization/110279
> > > > > > > >
> > > > > > > > gcc/ChangeLog:
> > > > > > > >
> > > > > > > > * tree-ssa-reassoc.cc
> (rank_ops_for_better_parallelism_p):
> > > > > > > > New function to check whether ranking the ops results in
> > > > > > > > better parallelism.
> > > > > > > > (get_reassociation_width): Add new parameters. Search
> for
> > > > > > > > smaller width considering the benefit of FMA.
> > > > > > > > (rank_ops_for_fma): Change return value to be number of
> > > > > > > > MULT_EXPRs.
> > > > > > > > (reassociate_bb): For 3 ops, refine the condition to
> call
> > > > > > > > swap_ops_for_binary_stmt.
> > > > > > > >
> > > > > > > > gcc/testsuite/ChangeLog:
> > > > > > > >
> > > > > > > > * gcc.dg/pr110279.c: New test.
> > > > > >
> > > > > > Thanks,
> > > > > > Di Zhao
> > > > > >
> > > > > > ----
> > > > > >
> > > > > > PR tree-optimization/110279
> > > > > >
> > > > > > gcc/ChangeLog:
> > > > > >
> > > > > > * doc/invoke.texi: Description of param_max_fma_chain_len.
> > > > > > * params.opt: New parameter param_max_fma_chain_len.
> > > > > > * tree-ssa-reassoc.cc (get_reassociation_width):
> > > > > > Support param_max_fma_chain_len; check for loop dependent
> > > > > > FMAs.
> > > > > > (rank_ops_for_fma): Return the number of MULT_EXPRs.
> > > > > > (reassociate_bb): For 3 ops, refine the condition to call
> > > > > > swap_ops_for_binary_stmt.
> > > > > >
> > > > > > gcc/testsuite/ChangeLog:
> > > > > >
> > > > > > * gcc.dg/pr110279-1.c: New test.
> > > > > > * gcc.dg/pr110279-2.c: New test.
> > > > > > * gcc.dg/pr110279-3.c: New test.
> > > >
> > > > ---
> > > >
> > > > PR tree-optimization/110279
> > > >
> > > > gcc/ChangeLog:
> > > >
> > > > * tree-ssa-reassoc.cc (get_reassociation_width): check
> > > > for loop dependent FMAs.
> > > > (reassociate_bb): For 3 ops, refine the condition to call
> > > > swap_ops_for_binary_stmt.
> > > >
> > > > gcc/testsuite/ChangeLog:
> > > >
> > > > * gcc.dg/pr110279-1.c: New test.
> > ---
> >
> > PR tree-optimization/110279
> >
> > gcc/ChangeLog:
> >
> > * common.opt: New flag fully-pipelined-fma.
> > * tree-ssa-reassoc.cc (get_mult_latency_consider_fma):
> > Return latency of MULT_EXPRs that can't be hided by FMA.
> > (get_reassociation_width): Search for smaller widths
> > considering the benefit of fully pipelined FMA.
> > (rank_ops_for_fma): Return the number of MULT_EXPRs.
> > (reassociate_bb): Pass the number of MULT_EXPRs to
> > get_reassociation_width; avoid calling
> > get_reassociation_width twice.
> >
> > gcc/testsuite/ChangeLog:
> >
> > * gcc.dg/pr110279-2.c: New test.
Thanks,
Di
---
PR tree-optimization/110279
gcc/ChangeLog:
* doc/invoke.texi: New parameter fully-pipelined-fma.
* params.opt: New parameter fully-pipelined-fma.
* tree-ssa-reassoc.cc (get_mult_latency_consider_fma): Return
the latency of MULT_EXPRs that can't be hidden by the FMAs.
(get_reassociation_width): Search for a smaller width
considering the benefit of fully pipelined FMA.
(rank_ops_for_fma): Return the number of MULT_EXPRs.
(reassociate_bb): Pass the number of MULT_EXPRs to
get_reassociation_width; avoid calling
get_reassociation_width twice.
gcc/testsuite/ChangeLog:
* gcc.dg/pr110279-2.c: New test.
[-- Attachment #2: 0001-Consider-fully-pipelined-FMA-in-get_reassociation_wi.patch --]
[-- Type: application/octet-stream, Size: 11556 bytes --]
From 5bf21ade03704e965e4d07c2603b9542dc4c0ac4 Mon Sep 17 00:00:00 2001
From: "Di Zhao" <dizhao@os.amperecomputing.com>
Date: Tue, 12 Dec 2023 23:06:14 +0800
Subject: [PATCH] Consider fully pipelined FMA in get_reassociation_width
---
gcc/doc/invoke.texi | 6 ++
gcc/params.opt | 7 ++
gcc/testsuite/gcc.dg/pr110279-2.c | 41 ++++++++
gcc/tree-ssa-reassoc.cc | 150 ++++++++++++++++++++++++------
4 files changed, 177 insertions(+), 27 deletions(-)
create mode 100644 gcc/testsuite/gcc.dg/pr110279-2.c
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index d395a6a747e..f9ec93a8106 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -16159,6 +16159,12 @@ Enable loop vectorization of floating point inductions.
@item avoid-fma-max-bits
Maximum number of bits for which we avoid creating FMAs.
+@item fully-pipelined-fma
+Whether the target fully pipelines FMA instructions. If non-zero,
+reassociation considers the benefit of parallelizing FMA's multiplication
+part and addition part, assuming FMUL and FMA use the same units that can
+also do FADD.
+
@item sms-loop-average-count-threshold
A threshold on the average loop count considered by the swing modulo scheduler.
diff --git a/gcc/params.opt b/gcc/params.opt
index e3a93ac7435..ab12b528540 100644
--- a/gcc/params.opt
+++ b/gcc/params.opt
@@ -146,6 +146,13 @@ Maximum number of outgoing edges in a switch before EVRP will not process it.
Common Joined UInteger Var(param_fsm_scale_path_stmts) Init(2) IntegerRange(1, 10) Param Optimization
Scale factor to apply to the number of statements in a threading path crossing a loop backedge when comparing to max-jump-thread-duplication-stmts.
+-param=fully-pipelined-fma=
+Common Joined UInteger Var(param_fully_pipelined_fma) Init(0) IntegerRange(0, 1) Param Optimization
+Whether the target fully pipelines FMA instructions. If non-zero,
+reassociation considers the benefit of parallelizing FMA's multiplication
+part and addition part, assuming FMUL and FMA use the same units that can
+also do FADD.
+
-param=gcse-after-reload-critical-fraction=
Common Joined UInteger Var(param_gcse_after_reload_critical_fraction) Init(10) Param Optimization
The threshold ratio of critical edges execution count that permit performing redundancy elimination after reload.
diff --git a/gcc/testsuite/gcc.dg/pr110279-2.c b/gcc/testsuite/gcc.dg/pr110279-2.c
new file mode 100644
index 00000000000..0304a77aa66
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr110279-2.c
@@ -0,0 +1,41 @@
+/* PR tree-optimization/110279 */
+/* { dg-do compile } */
+/* { dg-options "-Ofast --param tree-reassoc-width=4 --param fully-pipelined-fma=1 -fdump-tree-reassoc2-details -fdump-tree-optimized" } */
+/* { dg-additional-options "-march=armv8.2-a" { target aarch64-*-* } } */
+
+#define LOOP_COUNT 800000000
+typedef double data_e;
+
+#include <stdio.h>
+
+__attribute_noinline__ data_e
+foo (data_e in)
+{
+ data_e a1, a2, a3, a4;
+ data_e tmp, result = 0;
+ a1 = in + 0.1;
+ a2 = in * 0.1;
+ a3 = in + 0.01;
+ a4 = in * 0.59;
+
+ data_e result2 = 0;
+
+ for (int ic = 0; ic < LOOP_COUNT; ic++)
+ {
+ /* Test that a complete FMA chain with length=4 is not broken. */
+ tmp = a1 + a2 * a2 + a3 * a3 + a4 * a4 ;
+ result += tmp - ic;
+ result2 = result2 / 2 - tmp;
+
+ a1 += 0.91;
+ a2 += 0.1;
+ a3 -= 0.01;
+ a4 -= 0.89;
+
+ }
+
+ return result + result2;
+}
+
+/* { dg-final { scan-tree-dump-not "was chosen for reassociation" "reassoc2"} } */
+/* { dg-final { scan-tree-dump-times {\.FMA } 3 "optimized"} } */
\ No newline at end of file
diff --git a/gcc/tree-ssa-reassoc.cc b/gcc/tree-ssa-reassoc.cc
index ce97fc9a8b8..d45898ea1d5 100644
--- a/gcc/tree-ssa-reassoc.cc
+++ b/gcc/tree-ssa-reassoc.cc
@@ -5425,13 +5425,35 @@ get_required_cycles (int ops_num, int cpu_width)
return res;
}
+/* Given that the target fully pipelines FMA instructions, return the latency
+ of MULT_EXPRs that can't be hidden by the FMAs. WIDTH is the number of
+ pipes. */
+
+static inline int
+get_mult_latency_consider_fma (int ops_num, int mult_num, int width)
+{
+ gcc_checking_assert (mult_num && mult_num <= ops_num);
+
+ /* For each partition, if mult_num == ops_num, there's latency(MULT)*2.
+ e.g:
+
+ A * B + C * D
+ =>
+ _1 = A * B;
+ _2 = .FMA (C, D, _1);
+
+ Otherwise there's latency(MULT)*1 in the first FMA. */
+ return CEIL (ops_num, width) == CEIL (mult_num, width) ? 2 : 1;
+}
+
/* Returns an optimal number of registers to use for computation of
given statements.
- LHS is the result ssa name of OPS. */
+ LHS is the result ssa name of OPS. MULT_NUM is number of sub-expressions
+ that are MULT_EXPRs, when OPS are PLUS_EXPRs or MINUS_EXPRs. */
static int
-get_reassociation_width (vec<operand_entry *> *ops, tree lhs,
+get_reassociation_width (vec<operand_entry *> *ops, int mult_num, tree lhs,
enum tree_code opc, machine_mode mode)
{
int param_width = param_tree_reassoc_width;
@@ -5457,16 +5479,68 @@ get_reassociation_width (vec<operand_entry *> *ops, tree lhs,
so we can perform a binary search for the minimal width that still
results in the optimal cycle count. */
width_min = 1;
- while (width > width_min)
+
+ /* If the target fully pipelines FMA instruction, the multiply part can start
+ already if its operands are ready. Assuming symmetric pipes are used for
+ FMUL/FADD/FMA, then for a sequence of FMA like:
+
+ _8 = .FMA (_2, _3, _1);
+ _9 = .FMA (_5, _4, _8);
+ _10 = .FMA (_7, _6, _9);
+
+ , if width=1, the latency is latency(MULT) + latency(ADD)*3.
+ While with width=2:
+
+ _8 = _4 * _5;
+ _9 = .FMA (_2, _3, _1);
+ _10 = .FMA (_6, _7, _8);
+ _11 = _9 + _10;
+
+ , it is latency(MULT)*2 + latency(ADD)*2. Assuming latency(MULT) >=
+ latency(ADD), the first variant is preferred.
+
+ Find out if we can get a smaller width considering FMA. */
+ if (width > 1 && mult_num && param_fully_pipelined_fma)
{
- int width_mid = (width + width_min) / 2;
+ /* When param_fully_pipelined_fma is set, assume FMUL and FMA use the
+ same units that can also do FADD. For other scenarios, such as when
+ FMUL and FADD are using separated units, the following code may not
+ apply. */
+ int width_mult = targetm.sched.reassociation_width (MULT_EXPR, mode);
+ gcc_checking_assert (width_mult <= width);
+
+ /* Latency of MULT_EXPRs. */
+ int lat_mul
+ = get_mult_latency_consider_fma (ops_num, mult_num, width_mult);
+
+ /* Quick search might not apply. So start from 1. */
+ for (int i = 1; i < width_mult; i++)
+ {
+ int lat_mul_new
+ = get_mult_latency_consider_fma (ops_num, mult_num, i);
+ int lat_add_new = get_required_cycles (ops_num, i);
- if (get_required_cycles (ops_num, width_mid) == cycles_best)
- width = width_mid;
- else if (width_min < width_mid)
- width_min = width_mid;
- else
- break;
+ /* Assume latency(MULT) >= latency(ADD). */
+ if (lat_mul - lat_mul_new >= lat_add_new - cycles_best)
+ {
+ width = i;
+ break;
+ }
+ }
+ }
+ else
+ {
+ while (width > width_min)
+ {
+ int width_mid = (width + width_min) / 2;
+
+ if (get_required_cycles (ops_num, width_mid) == cycles_best)
+ width = width_mid;
+ else if (width_min < width_mid)
+ width_min = width_mid;
+ else
+ break;
+ }
}
/* If there's loop dependent FMA result, return width=2 to avoid it. This is
@@ -6836,8 +6910,10 @@ transform_stmt_to_multiply (gimple_stmt_iterator *gsi, gimple *stmt,
Rearrange ops to -> e + a * b + c * d generates:
_4 = .FMA (c_7(D), d_8(D), _3);
- _11 = .FMA (a_5(D), b_6(D), _4); */
-static bool
+ _11 = .FMA (a_5(D), b_6(D), _4);
+
+ Return the number of MULT_EXPRs in the chain. */
+static int
rank_ops_for_fma (vec<operand_entry *> *ops)
{
operand_entry *oe;
@@ -6851,9 +6927,26 @@ rank_ops_for_fma (vec<operand_entry *> *ops)
if (TREE_CODE (oe->op) == SSA_NAME)
{
gimple *def_stmt = SSA_NAME_DEF_STMT (oe->op);
- if (is_gimple_assign (def_stmt)
- && gimple_assign_rhs_code (def_stmt) == MULT_EXPR)
- ops_mult.safe_push (oe);
+ if (is_gimple_assign (def_stmt))
+ {
+ if (gimple_assign_rhs_code (def_stmt) == MULT_EXPR)
+ ops_mult.safe_push (oe);
+ /* A negate on the multiplication leads to FNMA. */
+ else if (gimple_assign_rhs_code (def_stmt) == NEGATE_EXPR
+ && TREE_CODE (gimple_assign_rhs1 (def_stmt)) == SSA_NAME)
+ {
+ gimple *neg_def_stmt
+ = SSA_NAME_DEF_STMT (gimple_assign_rhs1 (def_stmt));
+ if (is_gimple_assign (neg_def_stmt)
+ && gimple_bb (neg_def_stmt) == gimple_bb (def_stmt)
+ && gimple_assign_rhs_code (neg_def_stmt) == MULT_EXPR)
+ ops_mult.safe_push (oe);
+ else
+ ops_others.safe_push (oe);
+ }
+ else
+ ops_others.safe_push (oe);
+ }
else
ops_others.safe_push (oe);
}
@@ -6869,7 +6962,8 @@ rank_ops_for_fma (vec<operand_entry *> *ops)
Putting ops that not def from mult in front can generate more FMAs.
2. If all ops are defined with mult, we don't need to rearrange them. */
- if (ops_mult.length () >= 2 && ops_mult.length () != ops_length)
+ unsigned mult_num = ops_mult.length ();
+ if (mult_num >= 2 && mult_num != ops_length)
{
/* Put no-mult ops and mult ops alternately at the end of the
queue, which is conducive to generating more FMA and reducing the
@@ -6885,9 +6979,8 @@ rank_ops_for_fma (vec<operand_entry *> *ops)
if (opindex > 0)
opindex--;
}
- return true;
}
- return false;
+ return mult_num;
}
/* Reassociate expressions in basic block BB and its post-dominator as
children.
@@ -7052,8 +7145,8 @@ reassociate_bb (basic_block bb)
{
machine_mode mode = TYPE_MODE (TREE_TYPE (lhs));
int ops_num = ops.length ();
- int width;
- bool has_fma = false;
+ int width = 0;
+ int mult_num = 0;
/* For binary bit operations, if there are at least 3
operands and the last operand in OPS is a constant,
@@ -7076,16 +7169,17 @@ reassociate_bb (basic_block bb)
opt_type)
&& (rhs_code == PLUS_EXPR || rhs_code == MINUS_EXPR))
{
- has_fma = rank_ops_for_fma (&ops);
+ mult_num = rank_ops_for_fma (&ops);
}
/* Only rewrite the expression tree to parallel in the
last reassoc pass to avoid useless work back-and-forth
with initial linearization. */
+ bool has_fma = mult_num >= 2 && mult_num != ops_num;
if (!reassoc_insert_powi_p
&& ops.length () > 3
- && (width
- = get_reassociation_width (&ops, lhs, rhs_code, mode))
+ && (width = get_reassociation_width (&ops, mult_num, lhs,
+ rhs_code, mode))
> 1)
{
if (dump_file && (dump_flags & TDF_DETAILS))
@@ -7106,10 +7200,12 @@ reassociate_bb (basic_block bb)
if (len >= 3
&& (!has_fma
/* width > 1 means ranking ops results in better
- parallelism. */
- || get_reassociation_width (&ops, lhs, rhs_code,
- mode)
- > 1))
+ parallelism. Check current value to avoid
+ calling get_reassociation_width again. */
+ || (width != 1
+ && get_reassociation_width (
+ &ops, mult_num, lhs, rhs_code, mode)
+ > 1)))
swap_ops_for_binary_stmt (ops, len - 3);
new_lhs = rewrite_expr_tree (stmt, rhs_code, 0, ops,
--
2.25.1
next prev parent reply other threads:[~2023-12-13 8:14 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-09-14 12:43 Di Zhao OS
2023-10-06 9:33 ` Richard Biener
2023-10-08 16:39 ` Di Zhao OS
2023-10-23 3:49 ` [PING][PATCH " Di Zhao OS
2023-10-31 13:47 ` [PATCH " Richard Biener
2023-11-09 17:53 ` Di Zhao OS
2023-11-21 13:01 ` Richard Biener
2023-11-29 14:35 ` Di Zhao OS
2023-12-11 11:01 ` Richard Biener
2023-12-13 8:14 ` Di Zhao OS [this message]
2023-12-13 9:00 ` Richard Biener
2023-12-14 20:55 ` Di Zhao OS
2023-12-15 7:23 ` Richard Biener
2023-12-15 9:46 ` Thomas Schwinge
2023-12-17 12:30 ` Di Zhao OS
2023-12-22 15:05 ` Di Zhao OS
2023-12-22 15:39 ` Richard Biener
2023-12-27 9:35 ` Di Zhao OS
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=SN6PR01MB42405D4A9923012D17B40D5CE88DA@SN6PR01MB4240.prod.exchangelabs.com \
--to=dizhao@os.amperecomputing.com \
--cc=gcc-patches@gcc.gnu.org \
--cc=richard.guenther@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).