From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=ZiZv=HC=gmail.com=richard.guenther@sourceware.org>
Received: from mail-lf1-x135.google.com (mail-lf1-x135.google.com [IPv6:2a00:1450:4864:20::135])
	by sourceware.org (Postfix) with ESMTPS id 25C55385800D
	for <gcc-patches@gcc.gnu.org>; Tue, 21 Nov 2023 13:05:06 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 25C55385800D
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com
ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 25C55385800D
Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=2a00:1450:4864:20::135
ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1700571909; cv=none;
	b=bUTN9S3hWtsMErLtz62soLuPEopKRj7jtbRlDcAoDSr9vp0jt1VvDOCD0qzMtCGnhIDatG7CgB4Uhi4RagrrwNmcO51vV/iOze4pfj4uXDRfe+AhzuZFuEqdXTmwJ6C1Xwp/hqcVqLKr0OZmBkiYwj5sVmjIFdGBmHzQxuc8vFc=
ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key;
	t=1700571909; c=relaxed/simple;
	bh=GRCUgoUTlq4F9ZfXtYw/mUmvZIu7vp1WnV6shxg1SR4=;
	h=DKIM-Signature:MIME-Version:From:Date:Message-ID:Subject:To; b=azZq5MA16rTsewAZHoKn0mjJHdCR/UXQo2W99iEiERbKOJUDoKForUK74SrBkxyjo2itzxk9CL1VvzXwxXFNqkIcf1M3dpyNAW9xkucpNMbdLNc6LygryfLmOk9EMpJBbRzgvklI1eD6j7WRHM6LlBK+WTpOVF9l1KqQNTdXcAw=
ARC-Authentication-Results: i=1; server2.sourceware.org
Received: by mail-lf1-x135.google.com with SMTP id 2adb3069b0e04-507cee17b00so7193594e87.2
        for <gcc-patches@gcc.gnu.org>; Tue, 21 Nov 2023 05:05:06 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1700571904; x=1701176704; darn=gcc.gnu.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=mgsFpIzOyx5PVKM1lVECelDsic8Up5oJfqQrt5qifRw=;
        b=Fj9+x5vXjmIecvoet0WgJRX3JY47h3m0+Jpb/lzkLZpfPNguEI6y1lu23MRxizxyV2
         G18du/nJ82DaNTwyHhreZ+WcbBG+WQM0O2F6hUTqKqubSjGQJDmlgiYcTHuywQ1iFyf+
         mLzPga0Vigs3YIiGCwP/d1Qc0q5rXAurlr2So3Uos8wjT0T1Kulf7ADPIJ0NjCxnF1D2
         BC5li0G9TRAQRw8crwue3Son6mJn+sxPNgE8QncCdNYOEjuH24eVxzEjxM9lJUzFowIm
         vYryjBgsKRthyLEHoT4op9kH471PGd4TFDC6evt5SIvIwwZUyU9hZEFZ/L3fFQX1juAQ
         JtJA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1700571904; x=1701176704;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=mgsFpIzOyx5PVKM1lVECelDsic8Up5oJfqQrt5qifRw=;
        b=Yk0rGAiImRl2r+mNWR2nprBtrp7aGRXK254maluIrKFwHBWBcj+tUGSj976arNe9rN
         uvl4/9pw8tf/MyJG48Z5xCgA15SgQD2xj9exBRYKNmlel0GE3TjxMfBz2Vngm5fsm6qw
         zXdnfDfFBUpLfHR7yvTnA0foWKo5lwmLgegltQfBW/INwU0TtLVOLQjD7Q9eybXHIT2B
         Y93yvv1v/nqaNhQAiaqneZwq/tl3g/NuAr3A/pY0IEJGiomcfQaXdyBwGOI2EiLwNJgU
         9nBlrXPlq4ApMBCoVrzEC0zhw7CGpvlcDKZfgZtv8gYNwmDSelXWRLfQJBkQWejwgs1Z
         aP/A==
X-Gm-Message-State: AOJu0YxvbFGevxRI3ONSr/RJ/MILB1Iov2rc7LzM1X+LBlzVmjVGyVDx
	vzsZdEeRzlgECzGo/pOebGfBtBVi/c9A/ast6f6AVRZJ
X-Google-Smtp-Source: AGHT+IFUJRT7V+qr5pxEMOVOCWyrLBcxWsMMWoruGVETYNGHAmuKAq6WtGIVdkJ0VQjvtnwlNpgcL7NPGJFzDg7AGvI=
X-Received: by 2002:ac2:5631:0:b0:508:126a:751e with SMTP id
 b17-20020ac25631000000b00508126a751emr7965046lff.36.1700571904039; Tue, 21
 Nov 2023 05:05:04 -0800 (PST)
MIME-Version: 1.0
References: <SN6PR01MB4240D8EE68CF46689F9C243AE8F7A@SN6PR01MB4240.prod.exchangelabs.com>
 <CAFiYyc02ooZrRicHc1rDtjKuc2pHn_DwfBW441F-yTWE5UG-Ew@mail.gmail.com>
 <SN6PR01MB4240C36D35F02379B33C258DE8CFA@SN6PR01MB4240.prod.exchangelabs.com>
 <CAFiYyc1T-vDH_eMWhfbDRML1ULFTxAdLzEP1W74Z9cCZh2kAOg@mail.gmail.com> <SN6PR01MB424014CD57AEE913E6067F8DE8AFA@SN6PR01MB4240.prod.exchangelabs.com>
In-Reply-To: <SN6PR01MB424014CD57AEE913E6067F8DE8AFA@SN6PR01MB4240.prod.exchangelabs.com>
From: Richard Biener <richard.guenther@gmail.com>
Date: Tue, 21 Nov 2023 14:01:24 +0100
Message-ID: <CAFiYyc12pWSwqH3r3Wx2zjHe2MrVwwwevDn3wU+FX1tor4S5WQ@mail.gmail.com>
Subject: Re: [PATCH v4] [tree-optimization/110279] Consider FMA in get_reassociation_width
To: Di Zhao OS <dizhao@os.amperecomputing.com>
Cc: "gcc-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Spam-Status: No, score=-1.2 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,KAM_SHORT,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

On Thu, Nov 9, 2023 at 6:53=E2=80=AFPM Di Zhao OS <dizhao@os.amperecomputin=
g.com> wrote:
>
> > -----Original Message-----
> > From: Richard Biener <richard.guenther@gmail.com>
> > Sent: Tuesday, October 31, 2023 9:48 PM
> > To: Di Zhao OS <dizhao@os.amperecomputing.com>
> > Cc: gcc-patches@gcc.gnu.org
> > Subject: Re: [PATCH v4] [tree-optimization/110279] Consider FMA in
> > get_reassociation_width
> >
> > On Sun, Oct 8, 2023 at 6:40=E2=80=AFPM Di Zhao OS <dizhao@os.amperecomp=
uting.com>
> > wrote:
> > >
> > > Attached is a new version of the patch.
> > >
> > > > -----Original Message-----
> > > > From: Richard Biener <richard.guenther@gmail.com>
> > > > Sent: Friday, October 6, 2023 5:33 PM
> > > > To: Di Zhao OS <dizhao@os.amperecomputing.com>
> > > > Cc: gcc-patches@gcc.gnu.org
> > > > Subject: Re: [PATCH v4] [tree-optimization/110279] Consider FMA in
> > > > get_reassociation_width
> > > >
> > > > On Thu, Sep 14, 2023 at 2:43=E2=80=AFPM Di Zhao OS
> > > > <dizhao@os.amperecomputing.com> wrote:
> > > > >
> > > > > This is a new version of the patch on "nested FMA".
> > > > > Sorry for updating this after so long, I've been studying and
> > > > > writing micro cases to sort out the cause of the regression.
> > > >
> > > > Sorry for taking so long to reply.
> > > >
> > > > > First, following previous discussion:
> > > > > (https://gcc.gnu.org/pipermail/gcc-patches/2023-September/629080.=
html)
> > > > >
> > > > > 1. From testing more altered cases, I don't think the
> > > > > problem is that reassociation works locally. In that:
> > > > >
> > > > >   1) On the example with multiplications:
> > > > >
> > > > >         tmp1 =3D a + c * c + d * d + x * y;
> > > > >         tmp2 =3D x * tmp1;
> > > > >         result +=3D (a + c + d + tmp2);
> > > > >
> > > > >   Given "result" rewritten by width=3D2, the performance is
> > > > >   worse if we rewrite "tmp1" with width=3D2. In contrast, if we
> > > > >   remove the multiplications from the example (and make "tmp1"
> > > > >   not singe used), and still rewrite "result" by width=3D2, then
> > > > >   rewriting "tmp1" with width=3D2 is better. (Make sense because
> > > > >   the tree's depth at "result" is still smaller if we rewrite
> > > > >   "tmp1".)
> > > > >
> > > > >   2) I tried to modify the assembly code of the example without
> > > > >   FMA, so the width of "result" is 4. On Ampere1 there's no
> > > > >   obvious improvement. So although this is an interesting
> > > > >   problem, it doesn't seem like the cause of the regression.
> > > >
> > > > OK, I see.
> > > >
> > > > > 2. From assembly code of the case with FMA, one problem is
> > > > > that, rewriting "tmp1" to parallel didn't decrease the
> > > > > minimum CPU cycles (taking MULT_EXPRs into account), but
> > > > > increased code size, so the overhead is increased.
> > > > >
> > > > >    a) When "tmp1" is not re-written to parallel:
> > > > >         fmadd d31, d2, d2, d30
> > > > >         fmadd d31, d3, d3, d31
> > > > >         fmadd d31, d4, d5, d31  //"tmp1"
> > > > >         fmadd d31, d31, d4, d3
> > > > >
> > > > >    b) When "tmp1" is re-written to parallel:
> > > > >         fmul  d31, d4, d5
> > > > >         fmadd d27, d2, d2, d30
> > > > >         fmadd d31, d3, d3, d31
> > > > >         fadd  d31, d31, d27     //"tmp1"
> > > > >         fmadd d31, d31, d4, d3
> > > > >
> > > > > For version a), there are 3 dependent FMAs to calculate "tmp1".
> > > > > For version b), there are also 3 dependent instructions in the
> > > > > longer path: the 1st, 3rd and 4th.
> > > >
> > > > Yes, it doesn't really change anything.  The patch has
> > > >
> > > > +  /* If there's code like "acc =3D a * b + c * d + acc" in a tight=
 loop,
> > some
> > > > +     uarchs can execute results like:
> > > > +
> > > > +       _1 =3D a * b;
> > > > +       _2 =3D .FMA (c, d, _1);
> > > > +       acc_1 =3D acc_0 + _2;
> > > > +
> > > > +     in parallel, while turning it into
> > > > +
> > > > +       _1 =3D .FMA(a, b, acc_0);
> > > > +       acc_1 =3D .FMA(c, d, _1);
> > > > +
> > > > +     hinders that, because then the first FMA depends on the resul=
t
> > > > of preceding
> > > > +     iteration.  */
> > > >
> > > > I can't see what can be run in parallel for the first case.  The .F=
MA
> > > > depends on the multiplication a * b.  Iff the uarch somehow decompo=
ses
> > > > .FMA into multiply + add then the c * d multiply could run in paral=
lel
> > > > with the a * b multiply which _might_ be able to hide some of the
> > > > latency of the full .FMA.  Like on x86 Zen FMA has a latency of 4
> > > > cycles but a multiply only 3.  But I never got confirmation from an=
y
> > > > of the CPU designers that .FMAs are issued when the multiply
> > > > operands are ready and the add operand can be forwarded.
> > > >
> > > > I also wonder why the multiplications of the two-FMA sequence
> > > > then cannot be executed at the same time?  So I have some doubt
> > > > of the theory above.
> > >
> > > The parallel execution for the code snippet above was the other
> > > issue (previously discussed here:
> > > https://gcc.gnu.org/pipermail/gcc-patches/2023-August/628960.html).
> > > Sorry it's a bit confusing to include that here, but these 2 fixes
> > > needs to be combined to avoid new regressions. Since considering
> > > FMA in get_reassociation_width produces more results of width=3D1,
> > > so there would be more loop depending FMA chains.
> > >
> > > > Iff this really is the reason for the sequence to execute with lowe=
r
> > > > overall latency and we want to attack this on GIMPLE then I think
> > > > we need a target hook telling us this fact (I also wonder if such
> > > > behavior can be modeled in the scheduler pipeline description at al=
l?)
> > > >
> > > > > So it seems to me the current get_reassociation_width algorithm
> > > > > isn't optimal in the presence of FMA. So I modified the patch to
> > > > > improve get_reassociation_width, rather than check for code
> > > > > patterns. (Although there could be some other complicated
> > > > > factors so the regression is more obvious when there's "nested
> > > > > FMA". But with this patch that should be avoided or reduced.)
> > > > >
> > > > > With this patch 508.namd_r 1-copy run has 7% improvement on
> > > > > Ampere1, on Intel Xeon there's about 3%. While I'm still
> > > > > collecting data on other CPUs, I'd like to know how do you
> > > > > think of this.
> > > > >
> > > > > About changes in the patch:
> > > > >
> > > > > 1. When the op list forms a complete FMA chain, try to search
> > > > > for a smaller width considering the benefit of using FMA. With
> > > > > a smaller width, the increment of code size is smaller when
> > > > > breaking the chain.
> > > >
> > > > But this is all highly target specific (code size even more so).
> > > >
> > > > How I understand your approach to fixing the issue leads me to
> > > > the suggestion to prioritize parallel rewriting, thus alter
> > rank_ops_for_fma,
> > > > taking the reassoc width into account (the computed width should be
> > > > unchanged from rank_ops_for_fma) instead of "fixing up" the paralle=
l
> > > > rewriting of FMAs (well, they are not yet formed of course).
> > > > get_reassociation_width has 'get_required_cycles', the above theory
> > > > could be verified with a very simple toy pipeline model.  We'd have
> > > > to ask the target for the reassoc width for MULT_EXPRs as well (or =
maybe
> > > > even FMA_EXPRs).
> > > >
> > > > Taking the width of FMAs into account when computing the reassoc wi=
dth
> > > > might be another way to attack this.
> > >
> > > Previously I tried to solve this generally, on the assumption that
> > > FMA (smaller code size) is preferred. Now I agree it's difficult
> > > since: 1) As you mentioned, the latency of FMA, FMUL and FADD can
> > > be different. 2) From my test result on different machines we
> > > have, it seems simply adding the cycles together is not a good way
> > > to estimate the latency of consecutive FMA.
> > >
> > > I think an easier way to fix this is to add a parameter to suggest
> > > the length of complete FMA chain to keep. (It can be set by target
> > > specific tuning then.) And we can break longer FMA chains for
> > > better parallelism. Attached is the new implementation. With
> > > max-fma-chain-len=3D8, there's about 7% improvement in spec2017
> > > 508.namd_r on ampere1, and the overall improvement on fprate is
> > > about 1%.
> > >
> > > Since there's code in rank_ops_for_fma to identify MULT_EXPRs from
> > > others, I left it before get_reassociation_width so the number of
> > > MULT_EXPRs can be used.
> >
> > Sorry again for the delay in replying.
> >
> > +  /* Check if keeping complete FMA chains is preferred.  */
> > +  if (width > 1 && mult_num >=3D 2 && param_max_fma_chain_len)
> > +    {
> > +      /* num_fma_chain + (num_fma_chain - 1) >=3D num_plus .  */
> > +      int num_others =3D ops_num - mult_num;
> > +      int num_fma_chain =3D CEIL (num_others + 1, 2);
> > +
> > +      if (num_fma_chain < width
> > +         && CEIL (mult_num, num_fma_chain) <=3D param_max_fma_chain_le=
n)
> > +       width =3D num_fma_chain;
> > +    }
> >
> > so here 'mult_num' serves as a heuristical value how many
> > FMAs we could build.  If that were close to ops_num - 1 then
> > we'd have a chain of FMAs.  Not sure how you get at
> > num_others / 2 here.  Maybe we need to elaborate on what an
> > FMA chain is?  I thought it is FMA (FMA (FMA (..., b, c), d, e), f, g)
> > where each (b,c) pair is really just one operand in the ops array,
> > one of the 'mult's.  Thus a FMA chain is _not_
> > FMA (a, b, c) + FMA (d, e, f) + FMA (...) + ..., right?
>
> The "FMA chain" here refers to consecutive FMAs, each taking
> The previous one's result as the third operator, i.e.
> ... FMA(e, f, FMA(c, d, FMA (a, b, r)))... . So original op
> list looks like "r + a * b + c * d + e * f + ...". These FMAs
> will end up using the same accumulate register.
>
> When num_others=3D2 or 3, there can be 2 complete chains, e.g.
>         FMA (d, e, FMA (a, b, c)) + FMA (f, g, h)
> or
>         FMA (d, e, FMA (a, b, c)) + FMA (f, g, h) + i .
> And so on, that's where the "CEIL (num_others + 1, 2)" comes from.
>
> >
> > Forming an FMA chain effectively reduces the reassociation width
> > of the participating multiplies.  If we were not to form FMAs all
> > the multiplies could execute in parallel.
> >
> > So what does the above do, in terms of adjusting the reassociation
> > width for the _adds_, and what's the ripple-down effect on later
> > FMA forming?
> >
>
> The above code calculates the number of such FMA chains in the op
> list. And if the length of each chain doesn't exceed
> param_max_fma_chain_len, then width is set to the number of chains,
> so we won't break them (because rewrite_expr_tree_parallel handles
> this well).
>
> > The change still feels like whack-a-mole playing rather than understand=
ing
> > the fundamental issue on the targets.
>
> I think the complexity is in how the instructions are piped.
> Some Arm CPUs such as Neoverse V2 supports "late-forwarding":
> "FP multiply-accumulate pipelines support late-forwarding of
> accumulate operands from similar =CE=BCOPs, allowing a typical
> sequence of multiply-accumulate =CE=BCOPs to issue one every N
> cycles". ("N" is smaller than the latency of a single FMA
> instruction.) So keeping such FMA chains can utilize such
> feature and uses less FP units. I guess the case is similar on
> some late X86 CPUs.
>
> If we try to compute the minimum circles of each option, I think
> at least we'll need to know whether the target has similar
> feature, and the latency of each uop. While using an
> experiential length of beneficial FMA chain could be a shortcut.
> (Maybe allowing different lengths for different data widths is
> better.)

Hm.  So even when we can late-forward in an FMA chain
increasing the width should typically be still better?

_1 =3D FMA (_2 * _3 + _4);
_5 =3D FMA (_6 * _7 + _1);

say with late-forwarding we can hide the latency of the _6 * _7
multiply and the overall latency of the two FMAs above become
lat (FMA) + lat (ADD) in the ideal case.  Alternatively we do

_1 =3D FMA (_2 * _ 3 + _4);
_8 =3D _6 * _ 7;
_5 =3D _1 + _8;

where if the FMA and the multiply can execute in parallel
(we have two FMA pipes) the latency would be lat (FMA) + lat (ADD).
But when we only have a single pipeline capable of
FMA or multiplies then it is at least MIN (lat (FMA) + 1, lat (MUL) + 1)
+ lat (ADD), it depends on luck whether the FMA or the MUL is
issued first there.

So if late-forward works really well and the add part of the FMA
has very low latency compared to the multiplication part having
a smaller reassoc width should pay off here and we might be
able to simply control this via the existing target hook?

I'm not aware of x86 CPUs having late-forwarding capabilities
but usually the latency of multiplication and FMA is very similar
and one can issue two FMAs and possibly more ADDs in parallel.

As said I think this detail (late-forward) should maybe reflected
into get_required_cycles, possibly guided by a different
targetm.sched.reassociation_width for MULT_EXPR vs PLUS_EXPR?

> >
> > +  /* If there's loop dependent FMA result, return width=3D2 to avoid i=
t.  This
> > is
> > +     better than skipping these FMA candidates in widening_mul.  */
> >
> > better than skipping, but you don't touch it there?  I suppose width =
=3D=3D 2
> > will bypass the skipping, right?  This heuristic only comes in when the=
 above
> > change made width =3D=3D 1, since otherwise we have an earlier
> >
> >   if (width =3D=3D 1)
> >     return width;
> >
> > which als guarantees width =3D=3D 2 was allowed by the hook/param, righ=
t?
>
> Yes, that's right.
>
> >
> > +  if (width =3D=3D 1 && mult_num
> > +      && maybe_le (tree_to_poly_int64 (TYPE_SIZE (TREE_TYPE (lhs))),
> > +                  param_avoid_fma_max_bits))
> > +    {
> > +      /* Look for cross backedge dependency:
> > +       1. LHS is a phi argument in the same basic block it is defined.
> > +       2. And the result of the phi node is used in OPS.  */
> > +      basic_block bb =3D gimple_bb (SSA_NAME_DEF_STMT (lhs));
> > +      gimple_stmt_iterator gsi;
> > +      for (gsi =3D gsi_start_phis (bb); !gsi_end_p (gsi); gsi_next (&g=
si))
> > +       {
> > +         gphi *phi =3D dyn_cast<gphi *> (gsi_stmt (gsi));
> > +         for (unsigned i =3D 0; i < gimple_phi_num_args (phi); ++i)
> > +           {
> > +             tree op =3D PHI_ARG_DEF (phi, i);
> > +             if (!(op =3D=3D lhs && gimple_phi_arg_edge (phi, i)->src =
=3D=3D bb))
> > +               continue;
> >
> > I think it's easier to iterate over the immediate uses of LHS like
> >
> >   FOR_EACH_IMM_USE_FAST (use_p, iter, lhs)
> >      if (gphi *phi =3D dyn_cast <gphi *> (USE_STMT (use_p)))
> >        {
> >           if (gimple_phi_arg_edge (phi, phi_arg_index_from_use
> > (use_p))->src !=3D bb)
> >             continue;
> > ...
> >        }
> >
> > otherwise I think _this_ part of the patch looks reasonable.
> >
> > As you say heuristically they might go together but I think we should s=
plit
> > the
> > patch - the cross-loop part can probably stand independently.  Can you =
adjust
> > and re-post?
>
> Attached is the separated part for cross-loop FMA. Thank you for the corr=
ection.

That cross-loop FMA patch is OK.

Thanks,
Richard.

> >
> > As for the first part I still don't understand very well and am still h=
oping
> > we
> > can get away without yet another knob to tune.
> >
> > Richard.
> >
> > > >
> > > > > 2. To avoid regressions, included the other patch
> > > > > (https://gcc.gnu.org/pipermail/gcc-patches/2023-September/629203.=
html)
> > > > > on this tracker again. This is because more FMA will be kept
> > > > > with 1., so we need to rule out the loop dependent
> > > > > FMA chains when param_avoid_fma_max_bits is set.
> > > >
> > > > Sorry again for taking so long to reply.
> > > >
> > > > I'll note we have an odd case on x86 Zen2(?) as well which we don't=
 really
> > > > understand from a CPU behavior perspective.
> > > >
> > > > Thanks,
> > > > Richard.
> > > >
> > > > > Thanks,
> > > > > Di Zhao
> > > > >
> > > > > ----
> > > > >
> > > > >         PR tree-optimization/110279
> > > > >
> > > > > gcc/ChangeLog:
> > > > >
> > > > >         * tree-ssa-reassoc.cc (rank_ops_for_better_parallelism_p)=
:
> > > > >         New function to check whether ranking the ops results in
> > > > >         better parallelism.
> > > > >         (get_reassociation_width): Add new parameters. Search for
> > > > >         smaller width considering the benefit of FMA.
> > > > >         (rank_ops_for_fma): Change return value to be number of
> > > > >         MULT_EXPRs.
> > > > >         (reassociate_bb): For 3 ops, refine the condition to call
> > > > >         swap_ops_for_binary_stmt.
> > > > >
> > > > > gcc/testsuite/ChangeLog:
> > > > >
> > > > >         * gcc.dg/pr110279.c: New test.
> > >
> > > Thanks,
> > > Di Zhao
> > >
> > > ----
> > >
> > >         PR tree-optimization/110279
> > >
> > > gcc/ChangeLog:
> > >
> > >         * doc/invoke.texi: Description of param_max_fma_chain_len.
> > >         * params.opt: New parameter param_max_fma_chain_len.
> > >         * tree-ssa-reassoc.cc (get_reassociation_width):
> > >         Support param_max_fma_chain_len; check for loop dependent
> > >         FMAs.
> > >         (rank_ops_for_fma): Return the number of MULT_EXPRs.
> > >         (reassociate_bb): For 3 ops, refine the condition to call
> > >         swap_ops_for_binary_stmt.
> > >
> > > gcc/testsuite/ChangeLog:
> > >
> > >         * gcc.dg/pr110279-1.c: New test.
> > >         * gcc.dg/pr110279-2.c: New test.
> > >         * gcc.dg/pr110279-3.c: New test.
>
> ---
>
>         PR tree-optimization/110279
>
> gcc/ChangeLog:
>
>         * tree-ssa-reassoc.cc (get_reassociation_width): check
>         for loop dependent FMAs.
>         (reassociate_bb): For 3 ops, refine the condition to call
>         swap_ops_for_binary_stmt.
>
> gcc/testsuite/ChangeLog:
>
>         * gcc.dg/pr110279-1.c: New test.