From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=uwDV=GN=gmail.com=richard.guenther@sourceware.org>
Received: from mail-lj1-x22e.google.com (mail-lj1-x22e.google.com [IPv6:2a00:1450:4864:20::22e])
	by sourceware.org (Postfix) with ESMTPS id 79CB43858D20
	for <gcc-patches@gcc.gnu.org>; Tue, 31 Oct 2023 13:48:11 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 79CB43858D20
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com
ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 79CB43858D20
Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=2a00:1450:4864:20::22e
ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1698760094; cv=none;
	b=RYiI9FLWLG9W8M5/vSLSdzevvNvL0X4lDDKpeRHz2Fgrd2ne6ggQPoPWjaOqo1sM1InbH4Bjo4plUR6ZSwS3oDs8RrKygGfhQFFkNpohbSHS9cP3VVWkonaQJcJtm48ek8tpEgM7hjxwFQwActCRewVTaESEUZ/S9htzpUlBQzQ=
ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key;
	t=1698760094; c=relaxed/simple;
	bh=Qa4XGsyv72HJ/8UsZ4Kj6lJ1qyTrD7LWLV4k3c9ill8=;
	h=DKIM-Signature:MIME-Version:From:Date:Message-ID:Subject:To; b=PSLphzHSFgWr72SgD7WEDhHsDeHRU6ATixxvufPpcdznItgjcQpq1FHA1ca2zLI+MZhOKV3mYyFCntORnr1urAvBfhM3TatyjYOcRVxWRNCmmPRIJa4NC+5qOSusqb2LQut0PWahaByJQA3u/az8RxyLA5fJSZLRSTxH7elpgVo=
ARC-Authentication-Results: i=1; server2.sourceware.org
Received: by mail-lj1-x22e.google.com with SMTP id 38308e7fff4ca-2c59a4dd14cso79458041fa.2
        for <gcc-patches@gcc.gnu.org>; Tue, 31 Oct 2023 06:48:11 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1698760090; x=1699364890; darn=gcc.gnu.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=WNEtUl4oI+lq7mkqs/DZH74hrJY/5eC28viFXF7gvGo=;
        b=FnGpyTsxTh936SeAb9JHM8TrbcRrt6CI24BVpFh4laV0jEHb9lCJbAkmIvAFRTRqj7
         Vvwff5rL2KrEGY9oGq/eiUjDy6ggCY3aLf4/VZDaJvpYGLaCAbyI0ifRO7RTz0HorDX0
         ApOlu8OKegIgPJdEFe+7g3buEW9QA7Sz/Rt2HCWKQPLqtsGncNxBU1vAeUkyvRNezJd/
         c6Ys9viSbwP41tlTtJw5tQ2v++6HRI4OqG3HDTU7XPjQEZ6C7NuOch/3k8Q5SBlKLh7i
         MPgv3ZS/Y1LpeN+imimnMcDjtiVqWGWF3Dkqn7A1cWDrnddPByaUJKnNJLEP/Kc97FtM
         faLQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1698760090; x=1699364890;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=WNEtUl4oI+lq7mkqs/DZH74hrJY/5eC28viFXF7gvGo=;
        b=MXa1re490z4Cxd8B29CJdH0BGqyxYHZiCqpy37asjEkYsuOhWLE9DiMeVz9w2iKcZb
         yLbegDB37P4ItL+SOjIE4wq5p+tv/DL/NjAZTCJ7nrEQ8DcK2gOxsZ4Pym/saDuGCgUy
         621y3QmQeZ7TpBq4Am5OdgjILA5cMdXqaAw6hT5oka3Phq3oxgBSdJ65tGqBSY8JUpRj
         JvmumfgvzGj04ka/o7mjbpFfAq9ZiUzBqrkYaPf5juhRVmLPZlIT3CUqIUkkCdpe89r8
         7giqZ1+clchXnKQVfq8QWmHi4AJbEv+sGV/0Eqc/dWEgtQttC2FLQF/BerikFNL4MOBL
         0rJA==
X-Gm-Message-State: AOJu0YxBlGTKBuBag0l98srdGpglS95GGGZ2v5VNjcTPzIK7tXGD1A/U
	8sbyrJR8IA/HPkolwF0bJei9ELjmUcaJS/JY+z2XFH56GXM=
X-Google-Smtp-Source: AGHT+IFRzEksO3vSK43EVlPe4kQkdkOWA5vHRDUSnNs4XHb1XYAMeddZUfg6c9qIgSh/IVTNW3NlaP0PiNkLFQ/U/k8=
X-Received: by 2002:a19:430e:0:b0:507:c871:7888 with SMTP id
 q14-20020a19430e000000b00507c8717888mr9531071lfa.9.1698760089393; Tue, 31 Oct
 2023 06:48:09 -0700 (PDT)
MIME-Version: 1.0
References: <SN6PR01MB4240D8EE68CF46689F9C243AE8F7A@SN6PR01MB4240.prod.exchangelabs.com>
 <CAFiYyc02ooZrRicHc1rDtjKuc2pHn_DwfBW441F-yTWE5UG-Ew@mail.gmail.com> <SN6PR01MB4240C36D35F02379B33C258DE8CFA@SN6PR01MB4240.prod.exchangelabs.com>
In-Reply-To: <SN6PR01MB4240C36D35F02379B33C258DE8CFA@SN6PR01MB4240.prod.exchangelabs.com>
From: Richard Biener <richard.guenther@gmail.com>
Date: Tue, 31 Oct 2023 14:47:57 +0100
Message-ID: <CAFiYyc1T-vDH_eMWhfbDRML1ULFTxAdLzEP1W74Z9cCZh2kAOg@mail.gmail.com>
Subject: Re: [PATCH v4] [tree-optimization/110279] Consider FMA in get_reassociation_width
To: Di Zhao OS <dizhao@os.amperecomputing.com>
Cc: "gcc-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Spam-Status: No, score=-1.8 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,KAM_SHORT,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

On Sun, Oct 8, 2023 at 6:40=E2=80=AFPM Di Zhao OS <dizhao@os.amperecomputin=
g.com> wrote:
>
> Attached is a new version of the patch.
>
> > -----Original Message-----
> > From: Richard Biener <richard.guenther@gmail.com>
> > Sent: Friday, October 6, 2023 5:33 PM
> > To: Di Zhao OS <dizhao@os.amperecomputing.com>
> > Cc: gcc-patches@gcc.gnu.org
> > Subject: Re: [PATCH v4] [tree-optimization/110279] Consider FMA in
> > get_reassociation_width
> >
> > On Thu, Sep 14, 2023 at 2:43=E2=80=AFPM Di Zhao OS
> > <dizhao@os.amperecomputing.com> wrote:
> > >
> > > This is a new version of the patch on "nested FMA".
> > > Sorry for updating this after so long, I've been studying and
> > > writing micro cases to sort out the cause of the regression.
> >
> > Sorry for taking so long to reply.
> >
> > > First, following previous discussion:
> > > (https://gcc.gnu.org/pipermail/gcc-patches/2023-September/629080.html=
)
> > >
> > > 1. From testing more altered cases, I don't think the
> > > problem is that reassociation works locally. In that:
> > >
> > >   1) On the example with multiplications:
> > >
> > >         tmp1 =3D a + c * c + d * d + x * y;
> > >         tmp2 =3D x * tmp1;
> > >         result +=3D (a + c + d + tmp2);
> > >
> > >   Given "result" rewritten by width=3D2, the performance is
> > >   worse if we rewrite "tmp1" with width=3D2. In contrast, if we
> > >   remove the multiplications from the example (and make "tmp1"
> > >   not singe used), and still rewrite "result" by width=3D2, then
> > >   rewriting "tmp1" with width=3D2 is better. (Make sense because
> > >   the tree's depth at "result" is still smaller if we rewrite
> > >   "tmp1".)
> > >
> > >   2) I tried to modify the assembly code of the example without
> > >   FMA, so the width of "result" is 4. On Ampere1 there's no
> > >   obvious improvement. So although this is an interesting
> > >   problem, it doesn't seem like the cause of the regression.
> >
> > OK, I see.
> >
> > > 2. From assembly code of the case with FMA, one problem is
> > > that, rewriting "tmp1" to parallel didn't decrease the
> > > minimum CPU cycles (taking MULT_EXPRs into account), but
> > > increased code size, so the overhead is increased.
> > >
> > >    a) When "tmp1" is not re-written to parallel:
> > >         fmadd d31, d2, d2, d30
> > >         fmadd d31, d3, d3, d31
> > >         fmadd d31, d4, d5, d31  //"tmp1"
> > >         fmadd d31, d31, d4, d3
> > >
> > >    b) When "tmp1" is re-written to parallel:
> > >         fmul  d31, d4, d5
> > >         fmadd d27, d2, d2, d30
> > >         fmadd d31, d3, d3, d31
> > >         fadd  d31, d31, d27     //"tmp1"
> > >         fmadd d31, d31, d4, d3
> > >
> > > For version a), there are 3 dependent FMAs to calculate "tmp1".
> > > For version b), there are also 3 dependent instructions in the
> > > longer path: the 1st, 3rd and 4th.
> >
> > Yes, it doesn't really change anything.  The patch has
> >
> > +  /* If there's code like "acc =3D a * b + c * d + acc" in a tight loo=
p, some
> > +     uarchs can execute results like:
> > +
> > +       _1 =3D a * b;
> > +       _2 =3D .FMA (c, d, _1);
> > +       acc_1 =3D acc_0 + _2;
> > +
> > +     in parallel, while turning it into
> > +
> > +       _1 =3D .FMA(a, b, acc_0);
> > +       acc_1 =3D .FMA(c, d, _1);
> > +
> > +     hinders that, because then the first FMA depends on the result
> > of preceding
> > +     iteration.  */
> >
> > I can't see what can be run in parallel for the first case.  The .FMA
> > depends on the multiplication a * b.  Iff the uarch somehow decomposes
> > .FMA into multiply + add then the c * d multiply could run in parallel
> > with the a * b multiply which _might_ be able to hide some of the
> > latency of the full .FMA.  Like on x86 Zen FMA has a latency of 4
> > cycles but a multiply only 3.  But I never got confirmation from any
> > of the CPU designers that .FMAs are issued when the multiply
> > operands are ready and the add operand can be forwarded.
> >
> > I also wonder why the multiplications of the two-FMA sequence
> > then cannot be executed at the same time?  So I have some doubt
> > of the theory above.
>
> The parallel execution for the code snippet above was the other
> issue (previously discussed here:
> https://gcc.gnu.org/pipermail/gcc-patches/2023-August/628960.html).
> Sorry it's a bit confusing to include that here, but these 2 fixes
> needs to be combined to avoid new regressions. Since considering
> FMA in get_reassociation_width produces more results of width=3D1,
> so there would be more loop depending FMA chains.
>
> > Iff this really is the reason for the sequence to execute with lower
> > overall latency and we want to attack this on GIMPLE then I think
> > we need a target hook telling us this fact (I also wonder if such
> > behavior can be modeled in the scheduler pipeline description at all?)
> >
> > > So it seems to me the current get_reassociation_width algorithm
> > > isn't optimal in the presence of FMA. So I modified the patch to
> > > improve get_reassociation_width, rather than check for code
> > > patterns. (Although there could be some other complicated
> > > factors so the regression is more obvious when there's "nested
> > > FMA". But with this patch that should be avoided or reduced.)
> > >
> > > With this patch 508.namd_r 1-copy run has 7% improvement on
> > > Ampere1, on Intel Xeon there's about 3%. While I'm still
> > > collecting data on other CPUs, I'd like to know how do you
> > > think of this.
> > >
> > > About changes in the patch:
> > >
> > > 1. When the op list forms a complete FMA chain, try to search
> > > for a smaller width considering the benefit of using FMA. With
> > > a smaller width, the increment of code size is smaller when
> > > breaking the chain.
> >
> > But this is all highly target specific (code size even more so).
> >
> > How I understand your approach to fixing the issue leads me to
> > the suggestion to prioritize parallel rewriting, thus alter rank_ops_fo=
r_fma,
> > taking the reassoc width into account (the computed width should be
> > unchanged from rank_ops_for_fma) instead of "fixing up" the parallel
> > rewriting of FMAs (well, they are not yet formed of course).
> > get_reassociation_width has 'get_required_cycles', the above theory
> > could be verified with a very simple toy pipeline model.  We'd have
> > to ask the target for the reassoc width for MULT_EXPRs as well (or mayb=
e
> > even FMA_EXPRs).
> >
> > Taking the width of FMAs into account when computing the reassoc width
> > might be another way to attack this.
>
> Previously I tried to solve this generally, on the assumption that
> FMA (smaller code size) is preferred. Now I agree it's difficult
> since: 1) As you mentioned, the latency of FMA, FMUL and FADD can
> be different. 2) From my test result on different machines we
> have, it seems simply adding the cycles together is not a good way
> to estimate the latency of consecutive FMA.
>
> I think an easier way to fix this is to add a parameter to suggest
> the length of complete FMA chain to keep. (It can be set by target
> specific tuning then.) And we can break longer FMA chains for
> better parallelism. Attached is the new implementation. With
> max-fma-chain-len=3D8, there's about 7% improvement in spec2017
> 508.namd_r on ampere1, and the overall improvement on fprate is
> about 1%.
>
> Since there's code in rank_ops_for_fma to identify MULT_EXPRs from
> others, I left it before get_reassociation_width so the number of
> MULT_EXPRs can be used.

Sorry again for the delay in replying.

+  /* Check if keeping complete FMA chains is preferred.  */
+  if (width > 1 && mult_num >=3D 2 && param_max_fma_chain_len)
+    {
+      /* num_fma_chain + (num_fma_chain - 1) >=3D num_plus .  */
+      int num_others =3D ops_num - mult_num;
+      int num_fma_chain =3D CEIL (num_others + 1, 2);
+
+      if (num_fma_chain < width
+         && CEIL (mult_num, num_fma_chain) <=3D param_max_fma_chain_len)
+       width =3D num_fma_chain;
+    }

so here 'mult_num' serves as a heuristical value how many
FMAs we could build.  If that were close to ops_num - 1 then
we'd have a chain of FMAs.  Not sure how you get at
num_others / 2 here.  Maybe we need to elaborate on what an
FMA chain is?  I thought it is FMA (FMA (FMA (..., b, c), d, e), f, g)
where each (b,c) pair is really just one operand in the ops array,
one of the 'mult's.  Thus a FMA chain is _not_
FMA (a, b, c) + FMA (d, e, f) + FMA (...) + ..., right?

Forming an FMA chain effectively reduces the reassociation width
of the participating multiplies.  If we were not to form FMAs all
the multiplies could execute in parallel.

So what does the above do, in terms of adjusting the reassociation
width for the _adds_, and what's the ripple-down effect on later
FMA forming?

The change still feels like whack-a-mole playing rather than understanding
the fundamental issue on the targets.

+  /* If there's loop dependent FMA result, return width=3D2 to avoid it.  =
This is
+     better than skipping these FMA candidates in widening_mul.  */

better than skipping, but you don't touch it there?  I suppose width =3D=3D=
 2
will bypass the skipping, right?  This heuristic only comes in when the abo=
ve
change made width =3D=3D 1, since otherwise we have an earlier

  if (width =3D=3D 1)
    return width;

which als guarantees width =3D=3D 2 was allowed by the hook/param, right?

+  if (width =3D=3D 1 && mult_num
+      && maybe_le (tree_to_poly_int64 (TYPE_SIZE (TREE_TYPE (lhs))),
+                  param_avoid_fma_max_bits))
+    {
+      /* Look for cross backedge dependency:
+       1. LHS is a phi argument in the same basic block it is defined.
+       2. And the result of the phi node is used in OPS.  */
+      basic_block bb =3D gimple_bb (SSA_NAME_DEF_STMT (lhs));
+      gimple_stmt_iterator gsi;
+      for (gsi =3D gsi_start_phis (bb); !gsi_end_p (gsi); gsi_next (&gsi))
+       {
+         gphi *phi =3D dyn_cast<gphi *> (gsi_stmt (gsi));
+         for (unsigned i =3D 0; i < gimple_phi_num_args (phi); ++i)
+           {
+             tree op =3D PHI_ARG_DEF (phi, i);
+             if (!(op =3D=3D lhs && gimple_phi_arg_edge (phi, i)->src =3D=
=3D bb))
+               continue;

I think it's easier to iterate over the immediate uses of LHS like

  FOR_EACH_IMM_USE_FAST (use_p, iter, lhs)
     if (gphi *phi =3D dyn_cast <gphi *> (USE_STMT (use_p)))
       {
          if (gimple_phi_arg_edge (phi, phi_arg_index_from_use
(use_p))->src !=3D bb)
            continue;
...
       }

otherwise I think _this_ part of the patch looks reasonable.

As you say heuristically they might go together but I think we should split=
 the
patch - the cross-loop part can probably stand independently.  Can you adju=
st
and re-post?

As for the first part I still don't understand very well and am still hopin=
g we
can get away without yet another knob to tune.

Richard.

> >
> > > 2. To avoid regressions, included the other patch
> > > (https://gcc.gnu.org/pipermail/gcc-patches/2023-September/629203.html=
)
> > > on this tracker again. This is because more FMA will be kept
> > > with 1., so we need to rule out the loop dependent
> > > FMA chains when param_avoid_fma_max_bits is set.
> >
> > Sorry again for taking so long to reply.
> >
> > I'll note we have an odd case on x86 Zen2(?) as well which we don't rea=
lly
> > understand from a CPU behavior perspective.
> >
> > Thanks,
> > Richard.
> >
> > > Thanks,
> > > Di Zhao
> > >
> > > ----
> > >
> > >         PR tree-optimization/110279
> > >
> > > gcc/ChangeLog:
> > >
> > >         * tree-ssa-reassoc.cc (rank_ops_for_better_parallelism_p):
> > >         New function to check whether ranking the ops results in
> > >         better parallelism.
> > >         (get_reassociation_width): Add new parameters. Search for
> > >         smaller width considering the benefit of FMA.
> > >         (rank_ops_for_fma): Change return value to be number of
> > >         MULT_EXPRs.
> > >         (reassociate_bb): For 3 ops, refine the condition to call
> > >         swap_ops_for_binary_stmt.
> > >
> > > gcc/testsuite/ChangeLog:
> > >
> > >         * gcc.dg/pr110279.c: New test.
>
> Thanks,
> Di Zhao
>
> ----
>
>         PR tree-optimization/110279
>
> gcc/ChangeLog:
>
>         * doc/invoke.texi: Description of param_max_fma_chain_len.
>         * params.opt: New parameter param_max_fma_chain_len.
>         * tree-ssa-reassoc.cc (get_reassociation_width):
>         Support param_max_fma_chain_len; check for loop dependent
>         FMAs.
>         (rank_ops_for_fma): Return the number of MULT_EXPRs.
>         (reassociate_bb): For 3 ops, refine the condition to call
>         swap_ops_for_binary_stmt.
>
> gcc/testsuite/ChangeLog:
>
>         * gcc.dg/pr110279-1.c: New test.
>         * gcc.dg/pr110279-2.c: New test.
>         * gcc.dg/pr110279-3.c: New test.