From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=PHh5=EO=gmail.com=richard.guenther@sourceware.org>
Received: from mail-lj1-x22a.google.com (mail-lj1-x22a.google.com [IPv6:2a00:1450:4864:20::22a])
	by sourceware.org (Postfix) with ESMTPS id 693663858D28
	for <gcc-patches@gcc.gnu.org>; Tue, 29 Aug 2023 07:42:56 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 693663858D28
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com
Received: by mail-lj1-x22a.google.com with SMTP id 38308e7fff4ca-2bd0d19a304so28606981fa.1
        for <gcc-patches@gcc.gnu.org>; Tue, 29 Aug 2023 00:42:56 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20221208; t=1693294975; x=1693899775;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=gDPcj4hA+UV7M086TndHf5RDoHjiOwrenYeiiIyvVPQ=;
        b=Zx9g7oGo/4KSvPndMEXOdjOusrip44BLma/+TwWTgMjNQr627kDILHL0Ltzc6uYGjC
         yWBgA6UwXwanqjD7Q0QhSRyN9rodQiYzljRZvfIJQP5V83hG1JBIh+DnjU0j+eZirN3E
         TmO0clsCrg/iTK+LdPDfu3TGkJBjw5CBDlnH5Uxf+ex03xJyMuMTWHLXvw7X0Q90MwqB
         BnhHxS09qoYwbe2RqKnyS9TKs2D+lBVmGY9eSwtjK9RYMY3fouNA16n2EyJU9GO2Szv1
         Y4qD7+V6azglBI9XUyeMVenv/L03zFVKZFKbSGQwsiVf0L21my2HPohk8ATkOsjEPGtW
         dr3Q==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20221208; t=1693294975; x=1693899775;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=gDPcj4hA+UV7M086TndHf5RDoHjiOwrenYeiiIyvVPQ=;
        b=FgoIPqK6r7BzGhJsi9h5SyeaoA8nm6JCVlHWs9grIPyRnKYyHUO4zZ/xLv1gg3fdS0
         6mMmpwb/k1tg5CE0V1jx5W47/16oI1cQdOZ+UNJeWCpCf0/LLgWTxcJAzF+S1zvrhYVM
         g5A+IVejzp/jl49U2RQ1w2zlDmmuOQeOX23YWw+QFi869XdqF/viJzKl4+FezHOX9Kie
         idSn+YvZy4dhvvEpSIOZzpUanV8JIwz0+xD6PgVnfb92LNGFzEpVZ2vtezmtz6piNasX
         e1dAO8mfc3y+PYHBjLhxhYjnjmtQjiv6I7UWHKB6zeXJYB2oEV2dA5jhyWjuZx+xE/8n
         gD9w==
X-Gm-Message-State: AOJu0YwP+j7L/fq0aIKgykgclSnuDzYLEMz8K/x33LhyIMcwOv5U5Lts
	LeFKDIMxfwHoqYYinb63QrS8068W2owqSUiQEV8=
X-Google-Smtp-Source: AGHT+IGMggazeP0fHujvJGV+0yncO8WvpfUxq4LyrGWskFHfJ1jB4PVOPgWDsmoa4CEb5sdKbT7WW7r5WmLZgYGWTmk=
X-Received: by 2002:a2e:81d5:0:b0:2bc:e882:f717 with SMTP id
 s21-20020a2e81d5000000b002bce882f717mr10926187ljg.53.1693294974503; Tue, 29
 Aug 2023 00:42:54 -0700 (PDT)
MIME-Version: 1.0
References: <SN6PR01MB4240A22F29D390F5B96FD057E8E0A@SN6PR01MB4240.prod.exchangelabs.com>
 <4c3c9a1c-e182-30a9-342d-525adfb8cffd@gmail.com>
In-Reply-To: <4c3c9a1c-e182-30a9-342d-525adfb8cffd@gmail.com>
From: Richard Biener <richard.guenther@gmail.com>
Date: Tue, 29 Aug 2023 09:41:15 +0200
Message-ID: <CAFiYyc0bPqj_wiZSYPL-a3uHV+wg-=faGQJZPkgFLRV53x8P6A@mail.gmail.com>
Subject: Re: [PATCH] [tree-optimization/110279] swap operands in reassoc to
 reduce cross backedge FMA
To: Jeff Law <jeffreyalaw@gmail.com>, Martin Jambor <mjambor@suse.cz>
Cc: Di Zhao OS <dizhao@os.amperecomputing.com>, 
	"gcc-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Spam-Status: No, score=-1.5 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,KAM_SHORT,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

On Tue, Aug 29, 2023 at 1:23=E2=80=AFAM Jeff Law via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
>
>
> On 8/28/23 02:17, Di Zhao OS via Gcc-patches wrote:
> > This patch tries to fix the 2% regression in 510.parest_r on
> > ampere1 in the tracker. (Previous discussion is here:
> > https://gcc.gnu.org/pipermail/gcc-patches/2023-July/624893.html)
> >
> > 1. Add testcases for the problem. For an op list in the form of
> > "acc =3D a * b + c * d + acc", currently reassociation doesn't
> > Swap the operands so that more FMAs can be generated.
> > After widening_mul the result looks like:
> >
> >     _1 =3D .FMA(a, b, acc_0);
> >     acc_1 =3D .FMA(c, d, _1);
> >
> > While previously (before the "Handle FMA friendly..." patch),
> > widening_mul's result was like:
> >
> >     _1 =3D a * b;
> >     _2 =3D .FMA (c, d, _1);
> >     acc_1 =3D acc_0 + _2;

How can we execute the multiply and the FMA in parallel?  They
depend on each other.  Or is it the uarch can handle dependence
on the add operand but only when it is with a multiplication and
not a FMA in some better ways?  (I'd doubt so much complexity)

Can you explain in more detail how the uarch executes one vs. the
other case?

> > If the code fragment is in a loop, some architecture can execute
> > the latter in parallel, so the performance can be much faster than
> > the former. For the small testcase, the performance gap is over
> > 10% on both ampere1 and neoverse-n1. So the point here is to avoid
> > turning the last statement into FMA, and keep it a PLUS_EXPR as
> > much as possible. (If we are rewriting the op list into parallel,
> > no special treatment is needed, since the last statement after
> > rewrite_expr_tree_parallel will be PLUS_EXPR anyway.)
> >
> > 2. Function result_feeds_back_from_phi_p is to check for cross
> > backedge dependency. Added new enum fma_state to describe the
> > state of FMA candidates.
> >
> > With this patch, there's a 3% improvement in 510.parest_r 1-copy
> > run on ampere1. The compile options are:
> > "-Ofast -mcpu=3Dampere1 -flto --param avoid-fma-max-bits=3D512".
> >
> > Best regards,
> > Di Zhao
> >
> > ----
> >
> >          PR tree-optimization/110279
> >
> > gcc/ChangeLog:
> >
> >          * tree-ssa-reassoc.cc (enum fma_state): New enum to
> >          describe the state of FMA candidates for an op list.
> >          (rewrite_expr_tree_parallel): Changed boolean
> >          parameter to enum type.
> >          (result_feeds_back_from_phi_p): New function to check
> >          for cross backedge dependency.
> >          (rank_ops_for_fma): Return enum fma_state. Added new
> >          parameter.
> >          (reassociate_bb): If there's backedge dependency in an
> >          op list, swap the operands before rewrite_expr_tree.
> >
> > gcc/testsuite/ChangeLog:
> >
> >          * gcc.dg/pr110279.c: New test.
> Not a review, but more of a question -- isn't this transformation's
> profitability uarch sensitive.  ie, just because it's bad for a set of
> aarch64 uarches, doesn't mean it's bad everywhere.
>
> And in general we shy away from trying to adjust gimple code based on
> uarch preferences.
>
> It seems the right place to do this is gimple->rtl expansion.

Another comment is that FMA forming has this deferring code which I
think deals exactly with this kind of thing?  CCing Martin who did this
work based on AMD uarchs also not wanting cross-loop dependences
on FMAs (or so).  In particular I see

  if (fma_state.m_deferring_p
      && fma_state.m_initial_phi)
    {
      gcc_checking_assert (fma_state.m_last_result);
      if (!last_fma_candidate_feeds_initial_phi (&fma_state,
                                                 &m_last_result_set))
        cancel_fma_deferring (&fma_state);

and I think code to avoid FMAs in other/related cases should be here
as well, like avoid forming back-to-back FMAs.

Richard.

> Jeff