From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=/Qyc=FU=gmail.com=richard.guenther@sourceware.org>
Received: from mail-lj1-x234.google.com (mail-lj1-x234.google.com [IPv6:2a00:1450:4864:20::234])
	by sourceware.org (Postfix) with ESMTPS id 5EA803857C66
	for <gcc-patches@gcc.gnu.org>; Fri,  6 Oct 2023 09:33:20 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 5EA803857C66
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com
Received: by mail-lj1-x234.google.com with SMTP id 38308e7fff4ca-2c28e35752cso23920651fa.0
        for <gcc-patches@gcc.gnu.org>; Fri, 06 Oct 2023 02:33:20 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1696584799; x=1697189599; darn=gcc.gnu.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=O7o0vFDjUpwKGAe82D6fzmgrNUcs0X0nRppxTJtpYlc=;
        b=V+HUjvODF3HKPmLAkcwi07KGKjXO4QDDEwHGydwBz2me/xeqD2xsxpBGHrbW1wnCqi
         AyFm26LQkiOEawfFkwr+JU6mlVrdPOON62r8nDx6JbNXSVcN/d7WUNBV0iWunD3e4Wex
         OjU1OS7kx+Bc60WcDMT0rVoitpDT9ZbxvPuymaBUTo/P5QjQvSZe9W0Dms2ZxmuI5hdf
         f4+D6P3+LOtrJWq2i/YG3AzZPfnyNOOvOmU9BqKLJ6zDq77ZLS6t9yJRI1zi8plUO4Uk
         TbXxyxJuA6kozOMV2tJ5dHEOwNpUBTw5SjSiTN7I0KA2nSbXWn1JlMkvdR1N6pDJLW0h
         7Bwg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1696584799; x=1697189599;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=O7o0vFDjUpwKGAe82D6fzmgrNUcs0X0nRppxTJtpYlc=;
        b=jsuTtarL4QSfiTeY9Ak2VCupcaoKsfwvnEZTIMXMjG//SxVOrGLFvISxU6fYcjQ5NV
         yvzJEILOoURUmCLKC5Aj9fmEHmfG8ajj1fS8gQI7rshRjM+Z3r1pmKL4FKA+j/tXQRke
         LnmxIm/s2Ce9fxnHzLqoWi5u+6amjrpd8zaffoWx8l5WhOAAc3FPfqRB4iQMY2RjsEFH
         DRgzXQ0BhlpwpdZKpGuDN0y1Yf5i9mJiCQaBCYRRHAGEN7qURtt5+K7h/TgkW/rM/gkQ
         tXRb9oh+AHxJC4PatNEk6lz/l20ocm38t9R7j5SHk6BsykGLmkvLd+9QH2peECoLW7YR
         AW1Q==
X-Gm-Message-State: AOJu0YwynIzZoqOfFdmYC6GBpROE+3HGufQ3oNzOXEhGqo3OtDlCSr60
	ijAncbvBqApUx4xsdM1dFyskj6yxskNikOwofOVLKIyV6jo=
X-Google-Smtp-Source: AGHT+IEK8eIJkS6SrMW4ysCQICMeiqes1aNC8zziMIFrgkPSMPCdqwALnflcMJdvDNrf2srVU7P5CD/WzKGKiQ5g5/w=
X-Received: by 2002:a05:6512:3ca4:b0:500:b9f3:1dc4 with SMTP id
 h36-20020a0565123ca400b00500b9f31dc4mr8430879lfv.68.1696584798353; Fri, 06
 Oct 2023 02:33:18 -0700 (PDT)
MIME-Version: 1.0
References: <SN6PR01MB4240D8EE68CF46689F9C243AE8F7A@SN6PR01MB4240.prod.exchangelabs.com>
In-Reply-To: <SN6PR01MB4240D8EE68CF46689F9C243AE8F7A@SN6PR01MB4240.prod.exchangelabs.com>
From: Richard Biener <richard.guenther@gmail.com>
Date: Fri, 6 Oct 2023 11:33:06 +0200
Message-ID: <CAFiYyc02ooZrRicHc1rDtjKuc2pHn_DwfBW441F-yTWE5UG-Ew@mail.gmail.com>
Subject: Re: [PATCH v4] [tree-optimization/110279] Consider FMA in get_reassociation_width
To: Di Zhao OS <dizhao@os.amperecomputing.com>
Cc: "gcc-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Spam-Status: No, score=-1.7 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,KAM_SHORT,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

On Thu, Sep 14, 2023 at 2:43=E2=80=AFPM Di Zhao OS
<dizhao@os.amperecomputing.com> wrote:
>
> This is a new version of the patch on "nested FMA".
> Sorry for updating this after so long, I've been studying and
> writing micro cases to sort out the cause of the regression.

Sorry for taking so long to reply.

> First, following previous discussion:
> (https://gcc.gnu.org/pipermail/gcc-patches/2023-September/629080.html)
>
> 1. From testing more altered cases, I don't think the
> problem is that reassociation works locally. In that:
>
>   1) On the example with multiplications:
>
>         tmp1 =3D a + c * c + d * d + x * y;
>         tmp2 =3D x * tmp1;
>         result +=3D (a + c + d + tmp2);
>
>   Given "result" rewritten by width=3D2, the performance is
>   worse if we rewrite "tmp1" with width=3D2. In contrast, if we
>   remove the multiplications from the example (and make "tmp1"
>   not singe used), and still rewrite "result" by width=3D2, then
>   rewriting "tmp1" with width=3D2 is better. (Make sense because
>   the tree's depth at "result" is still smaller if we rewrite
>   "tmp1".)
>
>   2) I tried to modify the assembly code of the example without
>   FMA, so the width of "result" is 4. On Ampere1 there's no
>   obvious improvement. So although this is an interesting
>   problem, it doesn't seem like the cause of the regression.

OK, I see.

> 2. From assembly code of the case with FMA, one problem is
> that, rewriting "tmp1" to parallel didn't decrease the
> minimum CPU cycles (taking MULT_EXPRs into account), but
> increased code size, so the overhead is increased.
>
>    a) When "tmp1" is not re-written to parallel:
>         fmadd d31, d2, d2, d30
>         fmadd d31, d3, d3, d31
>         fmadd d31, d4, d5, d31  //"tmp1"
>         fmadd d31, d31, d4, d3
>
>    b) When "tmp1" is re-written to parallel:
>         fmul  d31, d4, d5
>         fmadd d27, d2, d2, d30
>         fmadd d31, d3, d3, d31
>         fadd  d31, d31, d27     //"tmp1"
>         fmadd d31, d31, d4, d3
>
> For version a), there are 3 dependent FMAs to calculate "tmp1".
> For version b), there are also 3 dependent instructions in the
> longer path: the 1st, 3rd and 4th.

Yes, it doesn't really change anything.  The patch has

+  /* If there's code like "acc =3D a * b + c * d + acc" in a tight loop, s=
ome
+     uarchs can execute results like:
+
+       _1 =3D a * b;
+       _2 =3D .FMA (c, d, _1);
+       acc_1 =3D acc_0 + _2;
+
+     in parallel, while turning it into
+
+       _1 =3D .FMA(a, b, acc_0);
+       acc_1 =3D .FMA(c, d, _1);
+
+     hinders that, because then the first FMA depends on the result
of preceding
+     iteration.  */

I can't see what can be run in parallel for the first case.  The .FMA
depends on the multiplication a * b.  Iff the uarch somehow decomposes
.FMA into multiply + add then the c * d multiply could run in parallel
with the a * b multiply which _might_ be able to hide some of the
latency of the full .FMA.  Like on x86 Zen FMA has a latency of 4
cycles but a multiply only 3.  But I never got confirmation from any
of the CPU designers that .FMAs are issued when the multiply
operands are ready and the add operand can be forwarded.

I also wonder why the multiplications of the two-FMA sequence
then cannot be executed at the same time?  So I have some doubt
of the theory above.

Iff this really is the reason for the sequence to execute with lower
overall latency and we want to attack this on GIMPLE then I think
we need a target hook telling us this fact (I also wonder if such
behavior can be modeled in the scheduler pipeline description at all?)

> So it seems to me the current get_reassociation_width algorithm
> isn't optimal in the presence of FMA. So I modified the patch to
> improve get_reassociation_width, rather than check for code
> patterns. (Although there could be some other complicated
> factors so the regression is more obvious when there's "nested
> FMA". But with this patch that should be avoided or reduced.)
>
> With this patch 508.namd_r 1-copy run has 7% improvement on
> Ampere1, on Intel Xeon there's about 3%. While I'm still
> collecting data on other CPUs, I'd like to know how do you
> think of this.
>
> About changes in the patch:
>
> 1. When the op list forms a complete FMA chain, try to search
> for a smaller width considering the benefit of using FMA. With
> a smaller width, the increment of code size is smaller when
> breaking the chain.

But this is all highly target specific (code size even more so).

How I understand your approach to fixing the issue leads me to
the suggestion to prioritize parallel rewriting, thus alter rank_ops_for_fm=
a,
taking the reassoc width into account (the computed width should be
unchanged from rank_ops_for_fma) instead of "fixing up" the parallel
rewriting of FMAs (well, they are not yet formed of course).
get_reassociation_width has 'get_required_cycles', the above theory
could be verified with a very simple toy pipeline model.  We'd have
to ask the target for the reassoc width for MULT_EXPRs as well (or maybe
even FMA_EXPRs).

Taking the width of FMAs into account when computing the reassoc width
might be another way to attack this.

> 2. To avoid regressions, included the other patch
> (https://gcc.gnu.org/pipermail/gcc-patches/2023-September/629203.html)
> on this tracker again. This is because more FMA will be kept
> with 1., so we need to rule out the loop dependent
> FMA chains when param_avoid_fma_max_bits is set.

Sorry again for taking so long to reply.

I'll note we have an odd case on x86 Zen2(?) as well which we don't really
understand from a CPU behavior perspective.

Thanks,
Richard.

> Thanks,
> Di Zhao
>
> ----
>
>         PR tree-optimization/110279
>
> gcc/ChangeLog:
>
>         * tree-ssa-reassoc.cc (rank_ops_for_better_parallelism_p):
>         New function to check whether ranking the ops results in
>         better parallelism.
>         (get_reassociation_width): Add new parameters. Search for
>         smaller width considering the benefit of FMA.
>         (rank_ops_for_fma): Change return value to be number of
>         MULT_EXPRs.
>         (reassociate_bb): For 3 ops, refine the condition to call
>         swap_ops_for_binary_stmt.
>
> gcc/testsuite/ChangeLog:
>
>         * gcc.dg/pr110279.c: New test.