RE: [PATCH] [tree-optimization/110279] swap operands in reassoc to reduce cross backedge FMA

public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed

From: Di Zhao OS <dizhao@os.amperecomputing.com>
To: Richard Biener <richard.guenther@gmail.com>
Cc: Jeff Law <jeffreyalaw@gmail.com>, Martin Jambor <mjambor@suse.cz>,
	"gcc-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org>
Subject: RE: [PATCH] [tree-optimization/110279] swap operands in reassoc to reduce cross backedge FMA
Date: Tue, 29 Aug 2023 08:58:56 +0000	[thread overview]
Message-ID: <SN6PR01MB424040011515E9352E082E7AE8E7A@SN6PR01MB4240.prod.exchangelabs.com> (raw)
In-Reply-To: <CAFiYyc2bFqTLuf+zX2tu78PK6A3ZdT_B4_mXT2iQ3a-hg5RDsw@mail.gmail.com>

Hi,

> -----Original Message-----
> From: Richard Biener <richard.guenther@gmail.com>
> Sent: Tuesday, August 29, 2023 4:09 PM
> To: Di Zhao OS <dizhao@os.amperecomputing.com>
> Cc: Jeff Law <jeffreyalaw@gmail.com>; Martin Jambor <mjambor@suse.cz>; gcc-
> patches@gcc.gnu.org
> Subject: Re: [PATCH] [tree-optimization/110279] swap operands in reassoc to
> reduce cross backedge FMA
> 
> On Tue, Aug 29, 2023 at 9:49 AM Di Zhao OS
> <dizhao@os.amperecomputing.com> wrote:
> >
> > Hi,
> >
> > > -----Original Message-----
> > > From: Richard Biener <richard.guenther@gmail.com>
> > > Sent: Tuesday, August 29, 2023 3:41 PM
> > > To: Jeff Law <jeffreyalaw@gmail.com>; Martin Jambor <mjambor@suse.cz>
> > > Cc: Di Zhao OS <dizhao@os.amperecomputing.com>; gcc-patches@gcc.gnu.org
> > > Subject: Re: [PATCH] [tree-optimization/110279] swap operands in reassoc
> to
> > > reduce cross backedge FMA
> > >
> > > On Tue, Aug 29, 2023 at 1:23 AM Jeff Law via Gcc-patches
> > > <gcc-patches@gcc.gnu.org> wrote:
> > > >
> > > >
> > > >
> > > > On 8/28/23 02:17, Di Zhao OS via Gcc-patches wrote:
> > > > > This patch tries to fix the 2% regression in 510.parest_r on
> > > > > ampere1 in the tracker. (Previous discussion is here:
> > > > > https://gcc.gnu.org/pipermail/gcc-patches/2023-July/624893.html)
> > > > >
> > > > > 1. Add testcases for the problem. For an op list in the form of
> > > > > "acc = a * b + c * d + acc", currently reassociation doesn't
> > > > > Swap the operands so that more FMAs can be generated.
> > > > > After widening_mul the result looks like:
> > > > >
> > > > >     _1 = .FMA(a, b, acc_0);
> > > > >     acc_1 = .FMA(c, d, _1);
> > > > >
> > > > > While previously (before the "Handle FMA friendly..." patch),
> > > > > widening_mul's result was like:
> > > > >
> > > > >     _1 = a * b;
> > > > >     _2 = .FMA (c, d, _1);
> > > > >     acc_1 = acc_0 + _2;
> > >
> > > How can we execute the multiply and the FMA in parallel?  They
> > > depend on each other.  Or is it the uarch can handle dependence
> > > on the add operand but only when it is with a multiplication and
> > > not a FMA in some better ways?  (I'd doubt so much complexity)
> > >
> > > Can you explain in more detail how the uarch executes one vs. the
> > > other case?

Here's my understanding after consulted our hardware team. For the
second case, the uarch of some out-of-order processors can calculate
"_2" of several loops at the same time, since there's no dependency
among different iterations. While for the first case the next iteration
has to wait for the current iteration to finish, so "acc_0"'s value is
known. I assume it is also the case in some i386 processors, since I
saw the patch "Deferring FMA transformations in tight loops" also
changed corresponding files.

> > >
> > > > > If the code fragment is in a loop, some architecture can execute
> > > > > the latter in parallel, so the performance can be much faster than
> > > > > the former. For the small testcase, the performance gap is over
> > > > > 10% on both ampere1 and neoverse-n1. So the point here is to avoid
> > > > > turning the last statement into FMA, and keep it a PLUS_EXPR as
> > > > > much as possible. (If we are rewriting the op list into parallel,
> > > > > no special treatment is needed, since the last statement after
> > > > > rewrite_expr_tree_parallel will be PLUS_EXPR anyway.)
> > > > >
> > > > > 2. Function result_feeds_back_from_phi_p is to check for cross
> > > > > backedge dependency. Added new enum fma_state to describe the
> > > > > state of FMA candidates.
> > > > >
> > > > > With this patch, there's a 3% improvement in 510.parest_r 1-copy
> > > > > run on ampere1. The compile options are:
> > > > > "-Ofast -mcpu=ampere1 -flto --param avoid-fma-max-bits=512".
> > > > >
> > > > > Best regards,
> > > > > Di Zhao
> > > > >
> > > > > ----
> > > > >
> > > > >          PR tree-optimization/110279
> > > > >
> > > > > gcc/ChangeLog:
> > > > >
> > > > >          * tree-ssa-reassoc.cc (enum fma_state): New enum to
> > > > >          describe the state of FMA candidates for an op list.
> > > > >          (rewrite_expr_tree_parallel): Changed boolean
> > > > >          parameter to enum type.
> > > > >          (result_feeds_back_from_phi_p): New function to check
> > > > >          for cross backedge dependency.
> > > > >          (rank_ops_for_fma): Return enum fma_state. Added new
> > > > >          parameter.
> > > > >          (reassociate_bb): If there's backedge dependency in an
> > > > >          op list, swap the operands before rewrite_expr_tree.
> > > > >
> > > > > gcc/testsuite/ChangeLog:
> > > > >
> > > > >          * gcc.dg/pr110279.c: New test.
> > > > Not a review, but more of a question -- isn't this transformation's
> > > > profitability uarch sensitive.  ie, just because it's bad for a set of
> > > > aarch64 uarches, doesn't mean it's bad everywhere.
> > > >
> > > > And in general we shy away from trying to adjust gimple code based on
> > > > uarch preferences.
> > > >
> > > > It seems the right place to do this is gimple->rtl expansion.
> > >
> > > Another comment is that FMA forming has this deferring code which I
> > > think deals exactly with this kind of thing?  CCing Martin who did this
> > > work based on AMD uarchs also not wanting cross-loop dependences
> > > on FMAs (or so).  In particular I see
> > >
> > >   if (fma_state.m_deferring_p
> > >       && fma_state.m_initial_phi)
> > >     {
> > >       gcc_checking_assert (fma_state.m_last_result);
> > >       if (!last_fma_candidate_feeds_initial_phi (&fma_state,
> > >                                                  &m_last_result_set))
> > >         cancel_fma_deferring (&fma_state);
> > >
> > > and I think code to avoid FMAs in other/related cases should be here
> > > as well, like avoid forming back-to-back FMAs.
> >
> > The changes in this patch is controlled by "param_avoid_fma_max_bits", so
> > I think it should only affect architectures with similar behavior. (The
> > parameter was added in a previous patch "Deferring FMA transformations
> > in tight loops", which seems to be dealing with the same issue.)
> 
> That's what I said.  Is the pipeline behavior properly modeled by the
> scheduler description?  Your patch seems to not only affect loops
> but the FMA forming case already present does - is the case you are
> after not in a tight loop?  In principle with a loop with enough scheduling
> freedom this should be a non-issue - unless the pipeline model isn't
> accurate and we put the FMAs too close together?

The current deferring FMA scheme discards all the FMA candidates in the FMA
chain that has loop dependency. So for this case there will be too less (zero)
FMA generated; while before the "Handle FMA friendly..." patch or after this
patch we can still have 1 FMA. I also tried to change the deferring FMA scheme
so only the last FMA is discarded, but it still can't solve the regression.
I think it's because this deferring scheme is in the later pass widening_mul,
which only decides whether to form FMA, but doesn't manipulate the tree for
better data parallelism. So it seems better to optimize for data parallelism 
(swapping the operands) in reassoc if we know there can be FMAs with loop
dependency, and leave the current deferring FMA scheme as the last resort?

> 
> Richard.
> 
> > > Richard.
> > >
> > > > Jeff
> >


Thanks,
Di Zhao

next prev parent reply	other threads:[~2023-08-29  8:59 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-08-28  8:17 Di Zhao OS
2023-08-28 23:22 ` Jeff Law
2023-08-29  7:41   ` Richard Biener
2023-08-29  7:49     ` Di Zhao OS
2023-08-29  8:09       ` Richard Biener
2023-08-29  8:58         ` Di Zhao OS [this message]
2023-08-29 11:11           ` Richard Biener
2023-08-30  9:33             ` Di Zhao OS
2023-08-31 12:22               ` Richard Biener
2023-09-04 10:33                 ` Di Zhao OS
2023-08-29 12:34     ` Jeff Law

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=SN6PR01MB424040011515E9352E082E7AE8E7A@SN6PR01MB4240.prod.exchangelabs.com \
    --to=dizhao@os.amperecomputing.com \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=jeffreyalaw@gmail.com \
    --cc=mjambor@suse.cz \
    --cc=richard.guenther@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).