From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=YcZS=HY=gmail.com=richard.guenther@sourceware.org>
Received: from mail-lf1-x132.google.com (mail-lf1-x132.google.com [IPv6:2a00:1450:4864:20::132])
	by sourceware.org (Postfix) with ESMTPS id 194C2385B53C
	for <gcc-patches@gcc.gnu.org>; Wed, 13 Dec 2023 09:01:59 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 194C2385B53C
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com
ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 194C2385B53C
Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=2a00:1450:4864:20::132
ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1702458123; cv=none;
	b=o14cxSDFab2OscXfSkbwMjQi6mKc6KRodDVsh1L+bI0QcjETF3IWEbzxq56jsSInqbrhQdGUxc1xRkkdAzxw+/9q7SJtOniOjCHKUgx/gbufByua3zpScOW/OU+xh891rj162UjsQ1QCw5PzovoaMHGCyBLD4gs3nRBm2tjZWi4=
ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key;
	t=1702458123; c=relaxed/simple;
	bh=YF4kmaFbxcd/VEXnYACOUmU9YxBvUl05SCzR3zjwENQ=;
	h=DKIM-Signature:MIME-Version:From:Date:Message-ID:Subject:To; b=d2ZDYv4Ew1pijyDUxeyft5bhGVgHCyPuJaNR541jnL9E6QePEXBgdxsOUljD2sMg9b8cIaGYfGF/RtfxI4pDXLYb24mXUzrCmFnriZSQ+yf5xythZ7lyOc3udVEzq1FBExHxWB8ors4b3uDeUkgRhSXF/8mUNPVMRqMbrki922g=
ARC-Authentication-Results: i=1; server2.sourceware.org
Received: by mail-lf1-x132.google.com with SMTP id 2adb3069b0e04-50bf2d9b3fdso8826070e87.3
        for <gcc-patches@gcc.gnu.org>; Wed, 13 Dec 2023 01:01:59 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1702458117; x=1703062917; darn=gcc.gnu.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=9n6AyvTRwUuTZOhZnj7w9tD0DvYjTADbjeraObQOYIw=;
        b=HkiuMATu6C9vq5ot3sqerVuJ69mFImTVw8CdbG0PCq19utCbsLShfF1TQKn+Zip3IQ
         9AXyKoQFwYk2maVMqVcCJ1f36LNULNX8ON8Ue9D0kEptOLdtdnM5xnB9ur36jhVS/e2Y
         ueqBuB7yJ3TK4CY0sYEBhtemAGkU0TnXcjrpUjDRmSq35X+a2WHRwaeZAcNIGMAVaz6A
         gqPZ92JQeY6qJUV1eMpb1OaBHBVD6ESdIbXDEYlpYMO1LitsEqIzcKmmHUVTI0vEkmA8
         seuYO5z9TCOiHnDLCb3xDkNvDz8QMeCODlaHY14MPtjrqdmPNApPrdRgxGO5QDI2fVow
         qi+g==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1702458117; x=1703062917;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=9n6AyvTRwUuTZOhZnj7w9tD0DvYjTADbjeraObQOYIw=;
        b=OJfogcboZu2mx/d35lx60HJ5Liut4DuQ7mNgd15KAMcqfDedmVBxrda8bMu8N4N8qt
         2oD33vQ4v35makLvIvEmxvXlZiN1as6WNg3M7daIKhWUXuutc/vh9jeHNJQTdjg2MKvs
         qZA6A9t3sPMMJgyTI5cdQOwCIuVSLI7Tn106vzeJtPcoBd6t17dHcm4V5rub8Z4hXK9/
         fXTOVQ2AjSU8XaRsLqjGU0B/0sQPNv/nSg5Z2tfp3kbbkWppqn1dlbH0LZLDwL02PWiD
         +k0Do7T6uDtanfawMoU5ijylx++61T8JQLge60Gk/+e9RJ3slChJMYomJMYJE2hSRj5S
         SpEA==
X-Gm-Message-State: AOJu0YwTiaMsrpnzt/qX486EIPfWIPCRDVWbhzRivXJfG16mkpBcHsu1
	l7kXpxp6UwMQIs4gxSGFcGBPAf5ivwo/AEPc3uuZAaKb
X-Google-Smtp-Source: AGHT+IELd/nEu9E2HvxiHR7PZh3duwUGX47y7w3njMXsPAJJ0AkdyZYXt/jBvddVuaNLgEMLW5j75sXBQ544Xxxzczc=
X-Received: by 2002:ac2:51ad:0:b0:50b:e967:ddf0 with SMTP id
 f13-20020ac251ad000000b0050be967ddf0mr3352297lfk.30.1702458117041; Wed, 13
 Dec 2023 01:01:57 -0800 (PST)
MIME-Version: 1.0
References: <SN6PR01MB4240D8EE68CF46689F9C243AE8F7A@SN6PR01MB4240.prod.exchangelabs.com>
 <CAFiYyc02ooZrRicHc1rDtjKuc2pHn_DwfBW441F-yTWE5UG-Ew@mail.gmail.com>
 <SN6PR01MB4240C36D35F02379B33C258DE8CFA@SN6PR01MB4240.prod.exchangelabs.com>
 <CAFiYyc1T-vDH_eMWhfbDRML1ULFTxAdLzEP1W74Z9cCZh2kAOg@mail.gmail.com>
 <SN6PR01MB424014CD57AEE913E6067F8DE8AFA@SN6PR01MB4240.prod.exchangelabs.com>
 <CAFiYyc12pWSwqH3r3Wx2zjHe2MrVwwwevDn3wU+FX1tor4S5WQ@mail.gmail.com>
 <SN6PR01MB42407ECAB354244C7CC0BE58E883A@SN6PR01MB4240.prod.exchangelabs.com>
 <CAFiYyc3USQz_a-TzFSJUixTk2fPmCDX7QSVP0MtOtGyEyw72Ag@mail.gmail.com> <SN6PR01MB42405D4A9923012D17B40D5CE88DA@SN6PR01MB4240.prod.exchangelabs.com>
In-Reply-To: <SN6PR01MB42405D4A9923012D17B40D5CE88DA@SN6PR01MB4240.prod.exchangelabs.com>
From: Richard Biener <richard.guenther@gmail.com>
Date: Wed, 13 Dec 2023 10:00:39 +0100
Message-ID: <CAFiYyc2Gg2_VGoCiE_r76RtHcEq13uES1X5Tid6DJ3a715bivg@mail.gmail.com>
Subject: Re: [PATCH v4] [tree-optimization/110279] Consider FMA in get_reassociation_width
To: Di Zhao OS <dizhao@os.amperecomputing.com>
Cc: "gcc-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Spam-Status: No, score=-1.8 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,KAM_SHORT,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

On Wed, Dec 13, 2023 at 9:14=E2=80=AFAM Di Zhao OS
<dizhao@os.amperecomputing.com> wrote:
>
> Hello Richard,
>
> > -----Original Message-----
> > From: Richard Biener <richard.guenther@gmail.com>
> > Sent: Monday, December 11, 2023 7:01 PM
> > To: Di Zhao OS <dizhao@os.amperecomputing.com>
> > Cc: gcc-patches@gcc.gnu.org
> > Subject: Re: [PATCH v4] [tree-optimization/110279] Consider FMA in
> > get_reassociation_width
> >
> > On Wed, Nov 29, 2023 at 3:36=E2=80=AFPM Di Zhao OS
> > <dizhao@os.amperecomputing.com> wrote:
> > >
> > > > -----Original Message-----
> > > > From: Richard Biener <richard.guenther@gmail.com>
> > > > Sent: Tuesday, November 21, 2023 9:01 PM
> > > > To: Di Zhao OS <dizhao@os.amperecomputing.com>
> > > > Cc: gcc-patches@gcc.gnu.org
> > > > Subject: Re: [PATCH v4] [tree-optimization/110279] Consider FMA in
> > > > get_reassociation_width
> > > >
> > > > On Thu, Nov 9, 2023 at 6:53=E2=80=AFPM Di Zhao OS <dizhao@os.ampere=
computing.com>
> > > > wrote:
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Richard Biener <richard.guenther@gmail.com>
> > > > > > Sent: Tuesday, October 31, 2023 9:48 PM
> > > > > > To: Di Zhao OS <dizhao@os.amperecomputing.com>
> > > > > > Cc: gcc-patches@gcc.gnu.org
> > > > > > Subject: Re: [PATCH v4] [tree-optimization/110279] Consider FMA=
 in
> > > > > > get_reassociation_width
> > > > > >
> > > > > > On Sun, Oct 8, 2023 at 6:40=E2=80=AFPM Di Zhao OS
> > <dizhao@os.amperecomputing.com>
> > > > > > wrote:
> > > > > > >
> > > > > > > Attached is a new version of the patch.
> > > > > > >
> > > > > > > > -----Original Message-----
> > > > > > > > From: Richard Biener <richard.guenther@gmail.com>
> > > > > > > > Sent: Friday, October 6, 2023 5:33 PM
> > > > > > > > To: Di Zhao OS <dizhao@os.amperecomputing.com>
> > > > > > > > Cc: gcc-patches@gcc.gnu.org
> > > > > > > > Subject: Re: [PATCH v4] [tree-optimization/110279] Consider=
 FMA in
> > > > > > > > get_reassociation_width
> > > > > > > >
> > > > > > > > On Thu, Sep 14, 2023 at 2:43=E2=80=AFPM Di Zhao OS
> > > > > > > > <dizhao@os.amperecomputing.com> wrote:
> > > > > > > > >
> > > > > > > > > This is a new version of the patch on "nested FMA".
> > > > > > > > > Sorry for updating this after so long, I've been studying=
 and
> > > > > > > > > writing micro cases to sort out the cause of the regressi=
on.
> > > > > > > >
> > > > > > > > Sorry for taking so long to reply.
> > > > > > > >
> > > > > > > > > First, following previous discussion:
> > > > > > > > > (https://gcc.gnu.org/pipermail/gcc-patches/2023-
> > > > September/629080.html)
> > > > > > > > >
> > > > > > > > > 1. From testing more altered cases, I don't think the
> > > > > > > > > problem is that reassociation works locally. In that:
> > > > > > > > >
> > > > > > > > >   1) On the example with multiplications:
> > > > > > > > >
> > > > > > > > >         tmp1 =3D a + c * c + d * d + x * y;
> > > > > > > > >         tmp2 =3D x * tmp1;
> > > > > > > > >         result +=3D (a + c + d + tmp2);
> > > > > > > > >
> > > > > > > > >   Given "result" rewritten by width=3D2, the performance =
is
> > > > > > > > >   worse if we rewrite "tmp1" with width=3D2. In contrast,=
 if we
> > > > > > > > >   remove the multiplications from the example (and make "=
tmp1"
> > > > > > > > >   not singe used), and still rewrite "result" by width=3D=
2, then
> > > > > > > > >   rewriting "tmp1" with width=3D2 is better. (Make sense =
because
> > > > > > > > >   the tree's depth at "result" is still smaller if we rew=
rite
> > > > > > > > >   "tmp1".)
> > > > > > > > >
> > > > > > > > >   2) I tried to modify the assembly code of the example w=
ithout
> > > > > > > > >   FMA, so the width of "result" is 4. On Ampere1 there's =
no
> > > > > > > > >   obvious improvement. So although this is an interesting
> > > > > > > > >   problem, it doesn't seem like the cause of the regressi=
on.
> > > > > > > >
> > > > > > > > OK, I see.
> > > > > > > >
> > > > > > > > > 2. From assembly code of the case with FMA, one problem i=
s
> > > > > > > > > that, rewriting "tmp1" to parallel didn't decrease the
> > > > > > > > > minimum CPU cycles (taking MULT_EXPRs into account), but
> > > > > > > > > increased code size, so the overhead is increased.
> > > > > > > > >
> > > > > > > > >    a) When "tmp1" is not re-written to parallel:
> > > > > > > > >         fmadd d31, d2, d2, d30
> > > > > > > > >         fmadd d31, d3, d3, d31
> > > > > > > > >         fmadd d31, d4, d5, d31  //"tmp1"
> > > > > > > > >         fmadd d31, d31, d4, d3
> > > > > > > > >
> > > > > > > > >    b) When "tmp1" is re-written to parallel:
> > > > > > > > >         fmul  d31, d4, d5
> > > > > > > > >         fmadd d27, d2, d2, d30
> > > > > > > > >         fmadd d31, d3, d3, d31
> > > > > > > > >         fadd  d31, d31, d27     //"tmp1"
> > > > > > > > >         fmadd d31, d31, d4, d3
> > > > > > > > >
> > > > > > > > > For version a), there are 3 dependent FMAs to calculate "=
tmp1".
> > > > > > > > > For version b), there are also 3 dependent instructions i=
n the
> > > > > > > > > longer path: the 1st, 3rd and 4th.
> > > > > > > >
> > > > > > > > Yes, it doesn't really change anything.  The patch has
> > > > > > > >
> > > > > > > > +  /* If there's code like "acc =3D a * b + c * d + acc" in=
 a tight
> > loop,
> > > > > > some
> > > > > > > > +     uarchs can execute results like:
> > > > > > > > +
> > > > > > > > +       _1 =3D a * b;
> > > > > > > > +       _2 =3D .FMA (c, d, _1);
> > > > > > > > +       acc_1 =3D acc_0 + _2;
> > > > > > > > +
> > > > > > > > +     in parallel, while turning it into
> > > > > > > > +
> > > > > > > > +       _1 =3D .FMA(a, b, acc_0);
> > > > > > > > +       acc_1 =3D .FMA(c, d, _1);
> > > > > > > > +
> > > > > > > > +     hinders that, because then the first FMA depends on t=
he
> > result
> > > > > > > > of preceding
> > > > > > > > +     iteration.  */
> > > > > > > >
> > > > > > > > I can't see what can be run in parallel for the first case.
> > The .FMA
> > > > > > > > depends on the multiplication a * b.  Iff the uarch somehow
> > decomposes
> > > > > > > > .FMA into multiply + add then the c * d multiply could run =
in
> > parallel
> > > > > > > > with the a * b multiply which _might_ be able to hide some =
of the
> > > > > > > > latency of the full .FMA.  Like on x86 Zen FMA has a latenc=
y of 4
> > > > > > > > cycles but a multiply only 3.  But I never got confirmation=
 from
> > any
> > > > > > > > of the CPU designers that .FMAs are issued when the multipl=
y
> > > > > > > > operands are ready and the add operand can be forwarded.
> > > > > > > >
> > > > > > > > I also wonder why the multiplications of the two-FMA sequen=
ce
> > > > > > > > then cannot be executed at the same time?  So I have some d=
oubt
> > > > > > > > of the theory above.
> > > > > > >
> > > > > > > The parallel execution for the code snippet above was the oth=
er
> > > > > > > issue (previously discussed here:
> > > > > > > https://gcc.gnu.org/pipermail/gcc-patches/2023-August/628960.=
html).
> > > > > > > Sorry it's a bit confusing to include that here, but these 2 =
fixes
> > > > > > > needs to be combined to avoid new regressions. Since consider=
ing
> > > > > > > FMA in get_reassociation_width produces more results of width=
=3D1,
> > > > > > > so there would be more loop depending FMA chains.
> > > > > > >
> > > > > > > > Iff this really is the reason for the sequence to execute w=
ith
> > lower
> > > > > > > > overall latency and we want to attack this on GIMPLE then I=
 think
> > > > > > > > we need a target hook telling us this fact (I also wonder i=
f such
> > > > > > > > behavior can be modeled in the scheduler pipeline descripti=
on at
> > all?)
> > > > > > > >
> > > > > > > > > So it seems to me the current get_reassociation_width alg=
orithm
> > > > > > > > > isn't optimal in the presence of FMA. So I modified the p=
atch to
> > > > > > > > > improve get_reassociation_width, rather than check for co=
de
> > > > > > > > > patterns. (Although there could be some other complicated
> > > > > > > > > factors so the regression is more obvious when there's "n=
ested
> > > > > > > > > FMA". But with this patch that should be avoided or reduc=
ed.)
> > > > > > > > >
> > > > > > > > > With this patch 508.namd_r 1-copy run has 7% improvement =
on
> > > > > > > > > Ampere1, on Intel Xeon there's about 3%. While I'm still
> > > > > > > > > collecting data on other CPUs, I'd like to know how do yo=
u
> > > > > > > > > think of this.
> > > > > > > > >
> > > > > > > > > About changes in the patch:
> > > > > > > > >
> > > > > > > > > 1. When the op list forms a complete FMA chain, try to se=
arch
> > > > > > > > > for a smaller width considering the benefit of using FMA.=
 With
> > > > > > > > > a smaller width, the increment of code size is smaller wh=
en
> > > > > > > > > breaking the chain.
> > > > > > > >
> > > > > > > > But this is all highly target specific (code size even more=
 so).
> > > > > > > >
> > > > > > > > How I understand your approach to fixing the issue leads me=
 to
> > > > > > > > the suggestion to prioritize parallel rewriting, thus alter
> > > > > > rank_ops_for_fma,
> > > > > > > > taking the reassoc width into account (the computed width s=
hould
> > be
> > > > > > > > unchanged from rank_ops_for_fma) instead of "fixing up" the
> > parallel
> > > > > > > > rewriting of FMAs (well, they are not yet formed of course)=
.
> > > > > > > > get_reassociation_width has 'get_required_cycles', the abov=
e
> > theory
> > > > > > > > could be verified with a very simple toy pipeline model.  W=
e'd
> > have
> > > > > > > > to ask the target for the reassoc width for MULT_EXPRs as w=
ell (or
> > > > maybe
> > > > > > > > even FMA_EXPRs).
> > > > > > > >
> > > > > > > > Taking the width of FMAs into account when computing the re=
assoc
> > width
> > > > > > > > might be another way to attack this.
> > > > > > >
> > > > > > > Previously I tried to solve this generally, on the assumption=
 that
> > > > > > > FMA (smaller code size) is preferred. Now I agree it's diffic=
ult
> > > > > > > since: 1) As you mentioned, the latency of FMA, FMUL and FADD=
 can
> > > > > > > be different. 2) From my test result on different machines we
> > > > > > > have, it seems simply adding the cycles together is not a goo=
d way
> > > > > > > to estimate the latency of consecutive FMA.
> > > > > > >
> > > > > > > I think an easier way to fix this is to add a parameter to su=
ggest
> > > > > > > the length of complete FMA chain to keep. (It can be set by t=
arget
> > > > > > > specific tuning then.) And we can break longer FMA chains for
> > > > > > > better parallelism. Attached is the new implementation. With
> > > > > > > max-fma-chain-len=3D8, there's about 7% improvement in spec20=
17
> > > > > > > 508.namd_r on ampere1, and the overall improvement on fprate =
is
> > > > > > > about 1%.
> > > > > > >
> > > > > > > Since there's code in rank_ops_for_fma to identify MULT_EXPRs=
 from
> > > > > > > others, I left it before get_reassociation_width so the numbe=
r of
> > > > > > > MULT_EXPRs can be used.
> > > > > >
> > > > > > Sorry again for the delay in replying.
> > > > > >
> > > > > > +  /* Check if keeping complete FMA chains is preferred.  */
> > > > > > +  if (width > 1 && mult_num >=3D 2 && param_max_fma_chain_len)
> > > > > > +    {
> > > > > > +      /* num_fma_chain + (num_fma_chain - 1) >=3D num_plus .  =
*/
> > > > > > +      int num_others =3D ops_num - mult_num;
> > > > > > +      int num_fma_chain =3D CEIL (num_others + 1, 2);
> > > > > > +
> > > > > > +      if (num_fma_chain < width
> > > > > > +         && CEIL (mult_num, num_fma_chain) <=3D param_max_fma_=
chain_len)
> > > > > > +       width =3D num_fma_chain;
> > > > > > +    }
> > > > > >
> > > > > > so here 'mult_num' serves as a heuristical value how many
> > > > > > FMAs we could build.  If that were close to ops_num - 1 then
> > > > > > we'd have a chain of FMAs.  Not sure how you get at
> > > > > > num_others / 2 here.  Maybe we need to elaborate on what an
> > > > > > FMA chain is?  I thought it is FMA (FMA (FMA (..., b, c), d, e)=
, f, g)
> > > > > > where each (b,c) pair is really just one operand in the ops arr=
ay,
> > > > > > one of the 'mult's.  Thus a FMA chain is _not_
> > > > > > FMA (a, b, c) + FMA (d, e, f) + FMA (...) + ..., right?
> > > > >
> > > > > The "FMA chain" here refers to consecutive FMAs, each taking
> > > > > The previous one's result as the third operator, i.e.
> > > > > ... FMA(e, f, FMA(c, d, FMA (a, b, r)))... . So original op
> > > > > list looks like "r + a * b + c * d + e * f + ...". These FMAs
> > > > > will end up using the same accumulate register.
> > > > >
> > > > > When num_others=3D2 or 3, there can be 2 complete chains, e.g.
> > > > >         FMA (d, e, FMA (a, b, c)) + FMA (f, g, h)
> > > > > or
> > > > >         FMA (d, e, FMA (a, b, c)) + FMA (f, g, h) + i .
> > > > > And so on, that's where the "CEIL (num_others + 1, 2)" comes from=
.
> > > > >
> > > > > >
> > > > > > Forming an FMA chain effectively reduces the reassociation widt=
h
> > > > > > of the participating multiplies.  If we were not to form FMAs a=
ll
> > > > > > the multiplies could execute in parallel.
> > > > > >
> > > > > > So what does the above do, in terms of adjusting the reassociat=
ion
> > > > > > width for the _adds_, and what's the ripple-down effect on late=
r
> > > > > > FMA forming?
> > > > > >
> > > > >
> > > > > The above code calculates the number of such FMA chains in the op
> > > > > list. And if the length of each chain doesn't exceed
> > > > > param_max_fma_chain_len, then width is set to the number of chain=
s,
> > > > > so we won't break them (because rewrite_expr_tree_parallel handle=
s
> > > > > this well).
> > > > >
> > > > > > The change still feels like whack-a-mole playing rather than
> > understanding
> > > > > > the fundamental issue on the targets.
> > > > >
> > > > > I think the complexity is in how the instructions are piped.
> > > > > Some Arm CPUs such as Neoverse V2 supports "late-forwarding":
> > > > > "FP multiply-accumulate pipelines support late-forwarding of
> > > > > accumulate operands from similar =CE=BCOPs, allowing a typical
> > > > > sequence of multiply-accumulate =CE=BCOPs to issue one every N
> > > > > cycles". ("N" is smaller than the latency of a single FMA
> > > > > instruction.) So keeping such FMA chains can utilize such
> > > > > feature and uses less FP units. I guess the case is similar on
> > > > > some late X86 CPUs.
> > > > >
> > > > > If we try to compute the minimum circles of each option, I think
> > > > > at least we'll need to know whether the target has similar
> > > > > feature, and the latency of each uop. While using an
> > > > > experiential length of beneficial FMA chain could be a shortcut.
> > > > > (Maybe allowing different lengths for different data widths is
> > > > > better.)
> > > >
> > > > Hm.  So even when we can late-forward in an FMA chain
> > > > increasing the width should typically be still better?
> > > >
> > > > _1 =3D FMA (_2 * _3 + _4);
> > > > _5 =3D FMA (_6 * _7 + _1);
> > > >
> > > > say with late-forwarding we can hide the latency of the _6 * _7
> > > > multiply and the overall latency of the two FMAs above become
> > > > lat (FMA) + lat (ADD) in the ideal case.  Alternatively we do
> > > >
> > > > _1 =3D FMA (_2 * _ 3 + _4);
> > > > _8 =3D _6 * _ 7;
> > > > _5 =3D _1 + _8;
> > > >
> > > > where if the FMA and the multiply can execute in parallel
> > > > (we have two FMA pipes) the latency would be lat (FMA) + lat (ADD).
> > > > But when we only have a single pipeline capable of
> > > > FMA or multiplies then it is at least MIN (lat (FMA) + 1, lat (MUL)=
 + 1)
> > > > + lat (ADD), it depends on luck whether the FMA or the MUL is
> > > > issued first there.
> > > >
> > > > So if late-forward works really well and the add part of the FMA
> > > > has very low latency compared to the multiplication part having
> > > > a smaller reassoc width should pay off here and we might be
> > > > able to simply control this via the existing target hook?
> > > >
> > > > I'm not aware of x86 CPUs having late-forwarding capabilities
> > > > but usually the latency of multiplication and FMA is very similar
> > > > and one can issue two FMAs and possibly more ADDs in parallel.
> > > >
> > > > As said I think this detail (late-forward) should maybe reflected
> > > > into get_required_cycles, possibly guided by a different
> > > > targetm.sched.reassociation_width for MULT_EXPR vs PLUS_EXPR?
> > > >
> > >
> > > To my understanding, the question is whether the target fully
> > > pipelines FMA instructions, so the MULT part can start first if
> > > its operands are ready. While targetm.sched.reassociation_width
> > > reflects the number of pipes for some operation, so it can guide
> > > get_required_cycles for a sequence of identical operations
> > > (e.g. A * B * C * D or A + B + C + D). Since the problem in
> > > this case is not the number of pipes for FMA, I think another
> > > indicator maybe better.
> > >
> > > (Currently the fma_reassoc_width for AArch64 is to control
> > > whether reassociation on FADD is OK. This workaround doesn't
> > > work well on some cases, for example it turns down reassociation
> > > even when there's no FMA at all. So I think we'd better not
> > > follow the schema.)
> > >
> > > Attached is a new version of the patch with a flag to indicate
> > > whether FMA is fully pipelined, and: 1) lat (MUL) >=3D lat (ADD);
> > > 2) symmetric units are used or FMUL/FADD/FMA. Otherwise the
> > > patch may not be beneficial.
> > >
> > > It tries to calculate the latencies including MULT_EXPRs. Since
> > > the code is different with the current code (the quick-search
> > > part), I haven't included it inside get_required_cycles.
> >
> > +; If the flag 'fully-pipelined-fma' is set, reassociation takes into a=
ccount
> > +; the benifit of parallelizing FMA's multiply part and addition part.
> > +ffully-pipelined-fma
> > +Common Var(flag_fully_pipelined_fma)
> > +Assume the target fully pipelines FMA instruction, and symmetric units=
 are
> > used
> > +for FMUL/FADD/FMA.
> >
> > please use a --param for now, I think targets might want to set this ba=
sed
> > on active core tuning.
> >
> > +/* Given that the target fully pipelines FMA instructions, return late=
ncy of
> > +   MULT_EXPRs that can't be hided by FMA.  WIDTH is the number of pipe=
s.  */
> > +
> >
> > return the latency .. can't be hidden by the FMA
> >
> > For documentation purposes it should be stated that mult_num <=3D ops_n=
um
> >
> > +  /* If the target fully pipelines FMA instruction, the multiply part =
can
> > start
> >
> > instructions
> >
> > +     first if its operands are ready.  Assuming symmetric pipes are us=
ed for
> >
> > s/first/already/
> >
> > +     FMUL/FADD/FMA, then for a sequence of FMA like:
> > +
> > +       _8 =3D .FMA (_2, _3, _1);
> > +       _9 =3D .FMA (_5, _4, _8);
> > +       _10 =3D .FMA (_7, _6, _9);
> > +
> > +     , if width=3D1, the latency is latency(MULT) + latency(ADD)*3.
> > +     While with width=3D2:
> > +
> > +       _8 =3D _4 * _5;
> > +       _9 =3D .FMA (_2, _3, _1);
> > +       _10 =3D .FMA (_6, _7, _8);
> > +       _11 =3D _9 + _10;
> > +
> > +     , it is latency(MULT)*2 + latency(ADD)*2.  Assuming latency(MULT)=
 <=3D
> > +     latency(ADD), the previous one is preferred.
> >
> > latency (MULT) >=3D latency (ADD)?
> >
> > ".. the first variant is preferred."
> >
> > +
> > +     Find out if we can get a smaller width considering FMA.  */
>
> Corrected these errors. Thank you for the corrections.
>
> > +      /* When flag_fully_pipelined_fma is set, assumes symmetric pipes=
 are
> > used
> > +        for FMUL/FADD/FMA.  */
> > +      int lat_mul =3D get_mult_latency_consider_fma (ops_num, mult_num=
, width);
> >
> > what does "symmetric pipes" actually mean?  For x86 Zen3 we have
> > two FMA pipes (that can also do ADD and MUL) and one pipe that can
> > do ADD.  Is that then non-symmetric because we can issue more adds
> > in parallel than FMA/MUL?

btw, I double-checked and Zen3/4 have two pipes for FMUL/FMA and two
separate pipes for FADD, the FMUL/FMA pipes cannot do FADD.  FADD
has a latency of 3 cycles while FMUL/FMA has a latency of 4 cycles.

I'd say Zen is then not "symmetric" as in your definition?  I do wonder
what part of the pipeline characteristic could be derived from the
reassoc_width target hook (maybe the number of pipes but not whether
they are shared with another op).  In theory the scheduling description
could offer the info (if correct and precise enough), but I don't think the=
re's
a good way to query this details.

> "symmetric pipes" was to indicate that FADD/FMA/FMUL use the same unit
> set, so the widths are uniform, and the calculations in this patch can
> apply. I think this can be relaxed for scenarios like Zen3, by searching
> for a smaller width only using the pipes for FMUL/FMA. But if the pipes
> for FMUL and FADD are separated, for example 1 for FMA/FMUL and 2 other
> pipes for FADD, then the minimum circle might be incorrect.
>
> Changed the descriptions and code in get_reassociation_width a bit to
> include the case like Zen3:
>
> +      /* When param_fully_pipelined_fma is set, assume FMUL and FMA use =
the
> +        same units that can also do FADD.  For other scenarios, such as =
when
> +        FMUL and FADD are using distinct units, the following code may n=
ot
> +        apply.  */
> +      int width_mult =3D targetm.sched.reassociation_width (MULT_EXPR, m=
ode);
> +      gcc_checking_assert (width_mult <=3D width);
> +
> +      /* Latency of MULT_EXPRs.  */
> +      int lat_mul
> +       =3D get_mult_latency_consider_fma (ops_num, mult_num, width_mult)=
;

The updated patch is OK.

Thanks for your patience.

Thanks,
Richard.

> > Otherwise this looks OK now.
> >
> > Thanks,
> > Richard.
> >
> > > > > > +  /* If there's loop dependent FMA result, return width=3D2 to=
 avoid it.
> > > > This
> > > > > > is
> > > > > > +     better than skipping these FMA candidates in widening_mul=
.  */
> > > > > >
> > > > > > better than skipping, but you don't touch it there?  I suppose =
width
> > =3D=3D 2
> > > > > > will bypass the skipping, right?  This heuristic only comes in =
when
> > the
> > > > above
> > > > > > change made width =3D=3D 1, since otherwise we have an earlier
> > > > > >
> > > > > >   if (width =3D=3D 1)
> > > > > >     return width;
> > > > > >
> > > > > > which als guarantees width =3D=3D 2 was allowed by the hook/par=
am, right?
> > > > >
> > > > > Yes, that's right.
> > > > >
> > > > > >
> > > > > > +  if (width =3D=3D 1 && mult_num
> > > > > > +      && maybe_le (tree_to_poly_int64 (TYPE_SIZE (TREE_TYPE (l=
hs))),
> > > > > > +                  param_avoid_fma_max_bits))
> > > > > > +    {
> > > > > > +      /* Look for cross backedge dependency:
> > > > > > +       1. LHS is a phi argument in the same basic block it is =
defined.
> > > > > > +       2. And the result of the phi node is used in OPS.  */
> > > > > > +      basic_block bb =3D gimple_bb (SSA_NAME_DEF_STMT (lhs));
> > > > > > +      gimple_stmt_iterator gsi;
> > > > > > +      for (gsi =3D gsi_start_phis (bb); !gsi_end_p (gsi); gsi_=
next
> > (&gsi))
> > > > > > +       {
> > > > > > +         gphi *phi =3D dyn_cast<gphi *> (gsi_stmt (gsi));
> > > > > > +         for (unsigned i =3D 0; i < gimple_phi_num_args (phi);=
 ++i)
> > > > > > +           {
> > > > > > +             tree op =3D PHI_ARG_DEF (phi, i);
> > > > > > +             if (!(op =3D=3D lhs && gimple_phi_arg_edge (phi, =
i)->src =3D=3D
> > bb))
> > > > > > +               continue;
> > > > > >
> > > > > > I think it's easier to iterate over the immediate uses of LHS l=
ike
> > > > > >
> > > > > >   FOR_EACH_IMM_USE_FAST (use_p, iter, lhs)
> > > > > >      if (gphi *phi =3D dyn_cast <gphi *> (USE_STMT (use_p)))
> > > > > >        {
> > > > > >           if (gimple_phi_arg_edge (phi, phi_arg_index_from_use
> > > > > > (use_p))->src !=3D bb)
> > > > > >             continue;
> > > > > > ...
> > > > > >        }
> > > > > >
> > > > > > otherwise I think _this_ part of the patch looks reasonable.
> > > > > >
> > > > > > As you say heuristically they might go together but I think we =
should
> > > > split
> > > > > > the
> > > > > > patch - the cross-loop part can probably stand independently.  =
Can you
> > > > adjust
> > > > > > and re-post?
> > > > >
> > > > > Attached is the separated part for cross-loop FMA. Thank you for =
the
> > > > correction.
> > > >
> > > > That cross-loop FMA patch is OK.
> > >
> > > Committed this part at 746344dd.
> > >
> > > Thanks,
> > > Di
> > >
> > > >
> > > > Thanks,
> > > > Richard.
> > > >
> > > > > >
> > > > > > As for the first part I still don't understand very well and am=
 still
> > > > hoping
> > > > > > we
> > > > > > can get away without yet another knob to tune.
> > > > > >
> > > > > > Richard.
> > > > > >
> > > > > > > >
> > > > > > > > > 2. To avoid regressions, included the other patch
> > > > > > > > > (https://gcc.gnu.org/pipermail/gcc-patches/2023-
> > > > September/629203.html)
> > > > > > > > > on this tracker again. This is because more FMA will be k=
ept
> > > > > > > > > with 1., so we need to rule out the loop dependent
> > > > > > > > > FMA chains when param_avoid_fma_max_bits is set.
> > > > > > > >
> > > > > > > > Sorry again for taking so long to reply.
> > > > > > > >
> > > > > > > > I'll note we have an odd case on x86 Zen2(?) as well which =
we
> > don't
> > > > really
> > > > > > > > understand from a CPU behavior perspective.
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Richard.
> > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > Di Zhao
> > > > > > > > >
> > > > > > > > > ----
> > > > > > > > >
> > > > > > > > >         PR tree-optimization/110279
> > > > > > > > >
> > > > > > > > > gcc/ChangeLog:
> > > > > > > > >
> > > > > > > > >         * tree-ssa-reassoc.cc
> > (rank_ops_for_better_parallelism_p):
> > > > > > > > >         New function to check whether ranking the ops res=
ults in
> > > > > > > > >         better parallelism.
> > > > > > > > >         (get_reassociation_width): Add new parameters. Se=
arch
> > for
> > > > > > > > >         smaller width considering the benefit of FMA.
> > > > > > > > >         (rank_ops_for_fma): Change return value to be num=
ber of
> > > > > > > > >         MULT_EXPRs.
> > > > > > > > >         (reassociate_bb): For 3 ops, refine the condition=
 to
> > call
> > > > > > > > >         swap_ops_for_binary_stmt.
> > > > > > > > >
> > > > > > > > > gcc/testsuite/ChangeLog:
> > > > > > > > >
> > > > > > > > >         * gcc.dg/pr110279.c: New test.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Di Zhao
> > > > > > >
> > > > > > > ----
> > > > > > >
> > > > > > >         PR tree-optimization/110279
> > > > > > >
> > > > > > > gcc/ChangeLog:
> > > > > > >
> > > > > > >         * doc/invoke.texi: Description of param_max_fma_chain=
_len.
> > > > > > >         * params.opt: New parameter param_max_fma_chain_len.
> > > > > > >         * tree-ssa-reassoc.cc (get_reassociation_width):
> > > > > > >         Support param_max_fma_chain_len; check for loop depen=
dent
> > > > > > >         FMAs.
> > > > > > >         (rank_ops_for_fma): Return the number of MULT_EXPRs.
> > > > > > >         (reassociate_bb): For 3 ops, refine the condition to =
call
> > > > > > >         swap_ops_for_binary_stmt.
> > > > > > >
> > > > > > > gcc/testsuite/ChangeLog:
> > > > > > >
> > > > > > >         * gcc.dg/pr110279-1.c: New test.
> > > > > > >         * gcc.dg/pr110279-2.c: New test.
> > > > > > >         * gcc.dg/pr110279-3.c: New test.
> > > > >
> > > > > ---
> > > > >
> > > > >         PR tree-optimization/110279
> > > > >
> > > > > gcc/ChangeLog:
> > > > >
> > > > >         * tree-ssa-reassoc.cc (get_reassociation_width): check
> > > > >         for loop dependent FMAs.
> > > > >         (reassociate_bb): For 3 ops, refine the condition to call
> > > > >         swap_ops_for_binary_stmt.
> > > > >
> > > > > gcc/testsuite/ChangeLog:
> > > > >
> > > > >         * gcc.dg/pr110279-1.c: New test.
> > > ---
> > >
> > >         PR tree-optimization/110279
> > >
> > > gcc/ChangeLog:
> > >
> > >         * common.opt: New flag fully-pipelined-fma.
> > >         * tree-ssa-reassoc.cc (get_mult_latency_consider_fma):
> > >         Return latency of MULT_EXPRs that can't be hided by FMA.
> > >         (get_reassociation_width): Search for smaller widths
> > >         considering the benefit of fully pipelined FMA.
> > >         (rank_ops_for_fma): Return the number of MULT_EXPRs.
> > >         (reassociate_bb): Pass the number of MULT_EXPRs to
> > >         get_reassociation_width; avoid calling
> > >         get_reassociation_width twice.
> > >
> > > gcc/testsuite/ChangeLog:
> > >
> > >         * gcc.dg/pr110279-2.c: New test.
>
> Thanks,
> Di
>
> ---
>
>         PR tree-optimization/110279
>
> gcc/ChangeLog:
>
>         * doc/invoke.texi: New parameter fully-pipelined-fma.
>         * params.opt: New parameter fully-pipelined-fma.
>         * tree-ssa-reassoc.cc (get_mult_latency_consider_fma): Return
>         the latency of MULT_EXPRs that can't be hidden by the FMAs.
>         (get_reassociation_width): Search for a smaller width
>         considering the benefit of fully pipelined FMA.
>         (rank_ops_for_fma): Return the number of MULT_EXPRs.
>         (reassociate_bb): Pass the number of MULT_EXPRs to
>         get_reassociation_width; avoid calling
>         get_reassociation_width twice.
>
> gcc/testsuite/ChangeLog:
>
>         * gcc.dg/pr110279-2.c: New test.