From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=PHh5=EO=gmail.com=richard.guenther@sourceware.org>
Received: from mail-lj1-x229.google.com (mail-lj1-x229.google.com [IPv6:2a00:1450:4864:20::229])
	by sourceware.org (Postfix) with ESMTPS id C3AE03858D20
	for <gcc-patches@gcc.gnu.org>; Tue, 29 Aug 2023 11:12:52 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org C3AE03858D20
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com
Received: by mail-lj1-x229.google.com with SMTP id 38308e7fff4ca-2bcc331f942so49975841fa.0
        for <gcc-patches@gcc.gnu.org>; Tue, 29 Aug 2023 04:12:52 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20221208; t=1693307571; x=1693912371; darn=gcc.gnu.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=NMuogkJ67jn1iq5U54pmCVE7S1cMqLhntAmg5R6AoZo=;
        b=Ex4KgrYR5EuU/izn9FX9EZrNOVE7bZwAQ4VFK2tVNsrXJXSENkQ4LjiVq0rOJYQZxP
         +1FHmpQRHGpjCDsGTAgw2tgRCMhnHN1hBW9yyYn/Rq+C6NTeVFxJ3moV5Xa2gJ9H2tqg
         w3tTG/OU3ob5lhIkBTyf+nXcfqA5CT0+k0jwdJERWJYT8Ux6NE2vragh0T2VfAscYhA0
         FG8bCeTp+YtrnmxxsV5jk1aiDbfqFcJmoBQS/nTar++oEEmQYzmLiBVuKoZheklHGDYm
         9AwWxjtKk3S5DIOYfz/O2MWW51qD7Xbi7S5mbwWO8p4vsMmM621B/SD5kHbQzKZ7vSrX
         pb7Q==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20221208; t=1693307571; x=1693912371;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=NMuogkJ67jn1iq5U54pmCVE7S1cMqLhntAmg5R6AoZo=;
        b=DU3AhtfKJF59RzYmIRpeW/AeLbNOZBwFzWhE0yZ1SVvWEQQVA3LUi02aZR8ouKNzwU
         mNgvqRk6SOEOvShyEf/MxYgigmZNrVNcAsQolRzV0FMeMcXYRnYpOC+nI/hI57roxYlk
         kyqQ/aOSZGFGuavSGw4gEr5lCai8SRpOJkfDKZ9+Vqf5IBt6QHrHY0k8s0XAOjx707mR
         hKmLB2MEYskuFOAkLDZbzLPw/e5/brnfX6adOGRR/IjL3Lwd2N6F9/PykHkyL/RM+8YJ
         3yd0FlsfHzpXOsJWCxGuRgIHip0msClK5i2jsL2/lCj6N4jw8MK5ye0krzRlIkvuxdH4
         eUVA==
X-Gm-Message-State: AOJu0Yw+gNCBfpEkQh0C6HL/6AwA6sGMoPFV3api0/y793XeiFyocQzD
	Sk+K9PIl14/wA1vzT92v+pUdVLwiuLBHhh6U5Wo=
X-Google-Smtp-Source: AGHT+IHU/9SAkumhJnp0mXuh/XXmLsnbxP+oOLJePo1ZVY0uDTa8Szhk4ojAS4M0jzfIpb9AZSih/Fbbqx1P4vREG9c=
X-Received: by 2002:a2e:a23b:0:b0:2bc:eb16:f51e with SMTP id
 i27-20020a2ea23b000000b002bceb16f51emr914266ljm.19.1693307570444; Tue, 29 Aug
 2023 04:12:50 -0700 (PDT)
MIME-Version: 1.0
References: <SN6PR01MB4240A22F29D390F5B96FD057E8E0A@SN6PR01MB4240.prod.exchangelabs.com>
 <4c3c9a1c-e182-30a9-342d-525adfb8cffd@gmail.com> <CAFiYyc0bPqj_wiZSYPL-a3uHV+wg-=faGQJZPkgFLRV53x8P6A@mail.gmail.com>
 <SN6PR01MB4240BDF99B9DAE32C7E1346DE8E7A@SN6PR01MB4240.prod.exchangelabs.com>
 <CAFiYyc2bFqTLuf+zX2tu78PK6A3ZdT_B4_mXT2iQ3a-hg5RDsw@mail.gmail.com> <SN6PR01MB424040011515E9352E082E7AE8E7A@SN6PR01MB4240.prod.exchangelabs.com>
In-Reply-To: <SN6PR01MB424040011515E9352E082E7AE8E7A@SN6PR01MB4240.prod.exchangelabs.com>
From: Richard Biener <richard.guenther@gmail.com>
Date: Tue, 29 Aug 2023 13:11:10 +0200
Message-ID: <CAFiYyc063jfrwe7nDBEOHZ+n2-bRGyMQgTiG=cK101GKpqECRQ@mail.gmail.com>
Subject: Re: [PATCH] [tree-optimization/110279] swap operands in reassoc to
 reduce cross backedge FMA
To: Di Zhao OS <dizhao@os.amperecomputing.com>
Cc: Jeff Law <jeffreyalaw@gmail.com>, Martin Jambor <mjambor@suse.cz>, 
	"gcc-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Spam-Status: No, score=-1.6 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,KAM_SHORT,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

On Tue, Aug 29, 2023 at 10:59=E2=80=AFAM Di Zhao OS
<dizhao@os.amperecomputing.com> wrote:
>
> Hi,
>
> > -----Original Message-----
> > From: Richard Biener <richard.guenther@gmail.com>
> > Sent: Tuesday, August 29, 2023 4:09 PM
> > To: Di Zhao OS <dizhao@os.amperecomputing.com>
> > Cc: Jeff Law <jeffreyalaw@gmail.com>; Martin Jambor <mjambor@suse.cz>; =
gcc-
> > patches@gcc.gnu.org
> > Subject: Re: [PATCH] [tree-optimization/110279] swap operands in reasso=
c to
> > reduce cross backedge FMA
> >
> > On Tue, Aug 29, 2023 at 9:49=E2=80=AFAM Di Zhao OS
> > <dizhao@os.amperecomputing.com> wrote:
> > >
> > > Hi,
> > >
> > > > -----Original Message-----
> > > > From: Richard Biener <richard.guenther@gmail.com>
> > > > Sent: Tuesday, August 29, 2023 3:41 PM
> > > > To: Jeff Law <jeffreyalaw@gmail.com>; Martin Jambor <mjambor@suse.c=
z>
> > > > Cc: Di Zhao OS <dizhao@os.amperecomputing.com>; gcc-patches@gcc.gnu=
.org
> > > > Subject: Re: [PATCH] [tree-optimization/110279] swap operands in re=
assoc
> > to
> > > > reduce cross backedge FMA
> > > >
> > > > On Tue, Aug 29, 2023 at 1:23=E2=80=AFAM Jeff Law via Gcc-patches
> > > > <gcc-patches@gcc.gnu.org> wrote:
> > > > >
> > > > >
> > > > >
> > > > > On 8/28/23 02:17, Di Zhao OS via Gcc-patches wrote:
> > > > > > This patch tries to fix the 2% regression in 510.parest_r on
> > > > > > ampere1 in the tracker. (Previous discussion is here:
> > > > > > https://gcc.gnu.org/pipermail/gcc-patches/2023-July/624893.html=
)
> > > > > >
> > > > > > 1. Add testcases for the problem. For an op list in the form of
> > > > > > "acc =3D a * b + c * d + acc", currently reassociation doesn't
> > > > > > Swap the operands so that more FMAs can be generated.
> > > > > > After widening_mul the result looks like:
> > > > > >
> > > > > >     _1 =3D .FMA(a, b, acc_0);
> > > > > >     acc_1 =3D .FMA(c, d, _1);
> > > > > >
> > > > > > While previously (before the "Handle FMA friendly..." patch),
> > > > > > widening_mul's result was like:
> > > > > >
> > > > > >     _1 =3D a * b;
> > > > > >     _2 =3D .FMA (c, d, _1);
> > > > > >     acc_1 =3D acc_0 + _2;
> > > >
> > > > How can we execute the multiply and the FMA in parallel?  They
> > > > depend on each other.  Or is it the uarch can handle dependence
> > > > on the add operand but only when it is with a multiplication and
> > > > not a FMA in some better ways?  (I'd doubt so much complexity)
> > > >
> > > > Can you explain in more detail how the uarch executes one vs. the
> > > > other case?
>
> Here's my understanding after consulted our hardware team. For the
> second case, the uarch of some out-of-order processors can calculate
> "_2" of several loops at the same time, since there's no dependency
> among different iterations. While for the first case the next iteration
> has to wait for the current iteration to finish, so "acc_0"'s value is
> known. I assume it is also the case in some i386 processors, since I
> saw the patch "Deferring FMA transformations in tight loops" also
> changed corresponding files.

That should be true for all kind of operations, no?  Thus it means
reassoc should in general associate cross-iteration accumulation
last?  Historically we associated those first because that's how the
vectorizer liked to see them, but I think that's no longer necessary.

It should be achievable by properly biasing the operand during
rank computation (don't we already do that?).

> > > >
> > > > > > If the code fragment is in a loop, some architecture can execut=
e
> > > > > > the latter in parallel, so the performance can be much faster t=
han
> > > > > > the former. For the small testcase, the performance gap is over
> > > > > > 10% on both ampere1 and neoverse-n1. So the point here is to av=
oid
> > > > > > turning the last statement into FMA, and keep it a PLUS_EXPR as
> > > > > > much as possible. (If we are rewriting the op list into paralle=
l,
> > > > > > no special treatment is needed, since the last statement after
> > > > > > rewrite_expr_tree_parallel will be PLUS_EXPR anyway.)
> > > > > >
> > > > > > 2. Function result_feeds_back_from_phi_p is to check for cross
> > > > > > backedge dependency. Added new enum fma_state to describe the
> > > > > > state of FMA candidates.
> > > > > >
> > > > > > With this patch, there's a 3% improvement in 510.parest_r 1-cop=
y
> > > > > > run on ampere1. The compile options are:
> > > > > > "-Ofast -mcpu=3Dampere1 -flto --param avoid-fma-max-bits=3D512"=
.
> > > > > >
> > > > > > Best regards,
> > > > > > Di Zhao
> > > > > >
> > > > > > ----
> > > > > >
> > > > > >          PR tree-optimization/110279
> > > > > >
> > > > > > gcc/ChangeLog:
> > > > > >
> > > > > >          * tree-ssa-reassoc.cc (enum fma_state): New enum to
> > > > > >          describe the state of FMA candidates for an op list.
> > > > > >          (rewrite_expr_tree_parallel): Changed boolean
> > > > > >          parameter to enum type.
> > > > > >          (result_feeds_back_from_phi_p): New function to check
> > > > > >          for cross backedge dependency.
> > > > > >          (rank_ops_for_fma): Return enum fma_state. Added new
> > > > > >          parameter.
> > > > > >          (reassociate_bb): If there's backedge dependency in an
> > > > > >          op list, swap the operands before rewrite_expr_tree.
> > > > > >
> > > > > > gcc/testsuite/ChangeLog:
> > > > > >
> > > > > >          * gcc.dg/pr110279.c: New test.
> > > > > Not a review, but more of a question -- isn't this transformation=
's
> > > > > profitability uarch sensitive.  ie, just because it's bad for a s=
et of
> > > > > aarch64 uarches, doesn't mean it's bad everywhere.
> > > > >
> > > > > And in general we shy away from trying to adjust gimple code base=
d on
> > > > > uarch preferences.
> > > > >
> > > > > It seems the right place to do this is gimple->rtl expansion.
> > > >
> > > > Another comment is that FMA forming has this deferring code which I
> > > > think deals exactly with this kind of thing?  CCing Martin who did =
this
> > > > work based on AMD uarchs also not wanting cross-loop dependences
> > > > on FMAs (or so).  In particular I see
> > > >
> > > >   if (fma_state.m_deferring_p
> > > >       && fma_state.m_initial_phi)
> > > >     {
> > > >       gcc_checking_assert (fma_state.m_last_result);
> > > >       if (!last_fma_candidate_feeds_initial_phi (&fma_state,
> > > >                                                  &m_last_result_set=
))
> > > >         cancel_fma_deferring (&fma_state);
> > > >
> > > > and I think code to avoid FMAs in other/related cases should be her=
e
> > > > as well, like avoid forming back-to-back FMAs.
> > >
> > > The changes in this patch is controlled by "param_avoid_fma_max_bits"=
, so
> > > I think it should only affect architectures with similar behavior. (T=
he
> > > parameter was added in a previous patch "Deferring FMA transformation=
s
> > > in tight loops", which seems to be dealing with the same issue.)
> >
> > That's what I said.  Is the pipeline behavior properly modeled by the
> > scheduler description?  Your patch seems to not only affect loops
> > but the FMA forming case already present does - is the case you are
> > after not in a tight loop?  In principle with a loop with enough schedu=
ling
> > freedom this should be a non-issue - unless the pipeline model isn't
> > accurate and we put the FMAs too close together?
>
> The current deferring FMA scheme discards all the FMA candidates in the F=
MA
> chain that has loop dependency. So for this case there will be too less (=
zero)
> FMA generated; while before the "Handle FMA friendly..." patch or after t=
his
> patch we can still have 1 FMA. I also tried to change the deferring FMA s=
cheme
> so only the last FMA is discarded, but it still can't solve the regressio=
n.
> I think it's because this deferring scheme is in the later pass widening_=
mul,
> which only decides whether to form FMA, but doesn't manipulate the tree f=
or
> better data parallelism. So it seems better to optimize for data parallel=
ism
> (swapping the operands) in reassoc if we know there can be FMAs with loop
> dependency, and leave the current deferring FMA scheme as the last resort=
?
>
> >
> > Richard.
> >
> > > > Richard.
> > > >
> > > > > Jeff
> > >
>
>
> Thanks,
> Di Zhao