From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=r5bA=HX=gmail.com=richard.guenther@sourceware.org>
Received: from mail-lf1-x133.google.com (mail-lf1-x133.google.com [IPv6:2a00:1450:4864:20::133])
	by sourceware.org (Postfix) with ESMTPS id A08713858298
	for <gcc-patches@gcc.gnu.org>; Tue, 12 Dec 2023 15:02:55 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org A08713858298
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com
ARC-Filter: OpenARC Filter v1.0.0 sourceware.org A08713858298
Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=2a00:1450:4864:20::133
ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1702393386; cv=none;
	b=s1YliPlx2mGu73p322wb1yjKgEZCOhs8dJmgHNG7tJ8RUiVjWWvpQLvENl6+yeV0V3tjrPDKrKpvGPGZbiBpdLNrxF9H7eCutCqSq5mb5pb3BnHk3UUx4FvzsRYicGnbgj3Y9LHGzeoMzhdSHJ+YfFDZARrhoR1SO2BcmGtSEAo=
ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key;
	t=1702393386; c=relaxed/simple;
	bh=CXoQUEdYyriV9olSjkNVMHh+MLUfad9IZfvmJIqbav0=;
	h=DKIM-Signature:MIME-Version:From:Date:Message-ID:Subject:To; b=baeO1MhLcKDPnMRfIOmDAR3hhoCWjr7TIGwNoj212o1BvMcrzR9tVEtT2i141xuJx887sTClkARyftUfassOQ6tq9H5X9AKXZN5sIbXybBpsmaz6bAi0ldqkEySzp4ISimmupVMtmUPJYyyCKCG009kmFwxqr4LiXhlMNZo4XQU=
ARC-Authentication-Results: i=1; server2.sourceware.org
Received: by mail-lf1-x133.google.com with SMTP id 2adb3069b0e04-50bf69afa99so7399574e87.3
        for <gcc-patches@gcc.gnu.org>; Tue, 12 Dec 2023 07:02:55 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1702393374; x=1702998174; darn=gcc.gnu.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=GBY0xKMZ/XQyN6vP6Dox8C9br0H7QOgsazuRDQgp7CY=;
        b=gmtBf3ryDhZqCb6bimJIalJ/+GJpcaeysfoWSRUACkd1WY8lsozoJLBrLSIMIipLMF
         V536bFg4uGrPIQYd2ok0xaLyUtfWPoQjXWNKiA+u6R/EzGFGra9ovTt931PzwBQuGv9F
         AcuWraH5CsgTGK+4jsonIutFVRRLu1rOhvYGrb5gbCdyUnIODLM5x/zPFX/fc5IzsoPH
         PxYLVp//eAugxiGupQlWQ4BCLO7g8wPk8HL19h9yE8H7KvlPiNzrvoKYtO0BaEkYaNEz
         PAdx7UqIkoWsMqYjOVirGkh3AZ76moVd6NXtOWEyjSsZ8VLGuw2uW4J1r99b3GRzfIaw
         8fVw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1702393374; x=1702998174;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=GBY0xKMZ/XQyN6vP6Dox8C9br0H7QOgsazuRDQgp7CY=;
        b=pMOKka9cjm0utqh3az4A9VzXZfjKIT+/VQ1MvJ/7U8oXg+wp4xgDLXSOTdNKaQMjQX
         eCVxqCuzLTlaCZD9+pFylclilpc9fkppKqZmpGpJQ5hhiOa3llV8Hr4xUzLLE3yy1l6F
         AjWsSfxfnSbx0Ar1j4Z+3TX2aHvBDefy3HzZPBpCtnCZ7ohejtqUf4jKuhVa9NQmcIXg
         QS8WT2x9LYA6aT8aJf+cwRYfa7Yh6iKJMaaKhQfEDPaJjqJzLtTP0LFzcx5Sc9b3YFXV
         34fJBIX93wFFrl3x2HxZ1O/HdXHVsGzS51YjU2so+bXNF4841AeGGefRuO7GlP6a7dFk
         3xDA==
X-Gm-Message-State: AOJu0YwKq3dBMs/DcHBrs0OwoNbYf0nZLWegj54kW9BY2g/IesRGMGxF
	8sz2ikOUxVGgHW1XEjSjzv8O+e7n11aGpNEVYSc=
X-Google-Smtp-Source: AGHT+IEdLTDaC/pqECD2+DIAkMWAa/7eMNatj386EtspCdDt9kthmkQorQVhv2hdMEdD7dgMWx72sMfGvYchEPPX+Ac=
X-Received: by 2002:a05:6512:78f:b0:50c:ff3e:20cc with SMTP id
 x15-20020a056512078f00b0050cff3e20ccmr2598446lfr.30.1702393373747; Tue, 12
 Dec 2023 07:02:53 -0800 (PST)
MIME-Version: 1.0
References: <ZXhwQVQzBiy2hv89@kam.mff.cuni.cz>
In-Reply-To: <ZXhwQVQzBiy2hv89@kam.mff.cuni.cz>
From: Richard Biener <richard.guenther@gmail.com>
Date: Tue, 12 Dec 2023 16:01:39 +0100
Message-ID: <CAFiYyc2PNALM4j3mTzHM9raqgCh-Zk3RZk2_68jQbnOraG02sw@mail.gmail.com>
Subject: Re: Disable FMADD in chains for Zen4 and generic
To: Jan Hubicka <hubicka@ucw.cz>
Cc: gcc-patches@gcc.gnu.org, hongtao.liu@intel.com, hongjiu.lu@intel.com
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Spam-Status: No, score=-7.5 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,GIT_PATCH_0,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

On Tue, Dec 12, 2023 at 3:38=E2=80=AFPM Jan Hubicka <hubicka@ucw.cz> wrote:
>
> Hi,
> this patch disables use of FMA in matrix multiplication loop for generic =
(for
> x86-64-v3) and zen4.  I tested this on zen4 and Xenon Gold Gold 6212U.
>
> For Intel this is neutral both on the matrix multiplication microbenchmar=
k
> (attached) and spec2k17 where the difference was within noise for Core.
>
> On core the micro-benchmark runs as follows:
>
> With FMA:
>
>        578,500,241      cycles:u                         #    3.645 GHz  =
                       ( +-  0.12% )
>        753,318,477      instructions:u                   #    1.30  insn =
per cycle              ( +-  0.00% )
>        125,417,701      branches:u                       #  790.227 M/sec=
                       ( +-  0.00% )
>           0.159146 +- 0.000363 seconds time elapsed  ( +-  0.23% )
>
>
> No FMA:
>
>        577,573,960      cycles:u                         #    3.514 GHz  =
                       ( +-  0.15% )
>        878,318,479      instructions:u                   #    1.52  insn =
per cycle              ( +-  0.00% )
>        125,417,702      branches:u                       #  763.035 M/sec=
                       ( +-  0.00% )
>           0.164734 +- 0.000321 seconds time elapsed  ( +-  0.19% )
>
> So the cycle count is unchanged and discrete multiply+add takes same time=
 as FMA.
>
> While on zen:
>
>
> With FMA:
>          484875179      cycles:u                         #    3.599 GHz  =
                    ( +-  0.05% )  (82.11%)
>          752031517      instructions:u                   #    1.55  insn =
per cycle
>          125106525      branches:u                       #  928.712 M/sec=
                    ( +-  0.03% )  (85.09%)
>             128356      branch-misses:u                  #    0.10% of al=
l branches          ( +-  0.06% )  (83.58%)
>
> No FMA:
>          375875209      cycles:u                         #    3.592 GHz  =
                    ( +-  0.08% )  (80.74%)
>          875725341      instructions:u                   #    2.33  insn =
per cycle
>          124903825      branches:u                       #    1.194 G/sec=
                    ( +-  0.04% )  (84.59%)
>           0.105203 +- 0.000188 seconds time elapsed  ( +-  0.18% )
>
> The diffrerence is that Cores understand the fact that fmadd does not nee=
d
> all three parameters to start computation, while Zen cores doesn't.

This came up in a separate thread as well, but when doing reassoc of a
chain with
multiple dependent FMAs.

I can't understand how this uarch detail can affect performance when
as in the testcase
the longest input latency is on the multiplication from a memory load.
Do we actually
understand _why_ the FMAs are slower here?

Do we know that Cores can start the multiplication part when the add
operand isn't
ready yet?  I'm curious how you set up a micro benchmark to measure this.

There's one detail on Zen in that it can issue 2 FADDs and 2 FMUL/FMA per c=
ycle.
So in theory we can at most do 2 FMA per cycle but with latency (FMA)
=3D=3D 4 for Zen3/4
and latency (FADD/FMUL) =3D=3D 3 we might be able to squeeze out a little b=
it more
throughput when there are many FADD/FMUL ops to execute?  That works indepe=
ndent
on whether FMAs have a head-start on multiplication as you'd still be
bottle-necked
on the 2-wide issue for FMA?

On Icelake it seems all FADD/FMUL/FMA share ports 0 and 1 and all have a la=
tency
of four.  So you should get worse results there (looking at the
numbers above you
do get worse results, slightly so), probably the higher number of uops is h=
idden
by the latency.

> Since this seems noticeable win on zen and not loss on Core it seems like=
 good
> default for generic.
>
> I plan to commit the patch next week if there are no compplains.

complaint!

Richard.

> Honza
>
> #include <stdio.h>
> #include <time.h>
>
> #define SIZE 1000
>
> float a[SIZE][SIZE];
> float b[SIZE][SIZE];
> float c[SIZE][SIZE];
>
> void init(void)
> {
>    int i, j, k;
>    for(i=3D0; i<SIZE; ++i)
>    {
>       for(j=3D0; j<SIZE; ++j)
>       {
>          a[i][j] =3D (float)i + j;
>          b[i][j] =3D (float)i - j;
>          c[i][j] =3D 0.0f;
>       }
>    }
> }
>
> void mult(void)
> {
>    int i, j, k;
>
>    for(i=3D0; i<SIZE; ++i)
>    {
>       for(j=3D0; j<SIZE; ++j)
>       {
>          for(k=3D0; k<SIZE; ++k)
>          {
>             c[i][j] +=3D a[i][k] * b[k][j];
>          }
>       }
>    }
> }
>
> int main(void)
> {
>    clock_t s, e;
>
>    init();
>    s=3Dclock();
>    mult();
>    e=3Dclock();
>    printf("        mult took %10d clocks\n", (int)(e-s));
>
>    return 0;
>
> }
>
>         * confg/i386/x86-tune.def (X86_TUNE_AVOID_128FMA_CHAINS, X86_TUNE=
_AVOID_256FMA_CHAINS)
>         Enable for znver4 and Core.
>
> diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def
> index 43fa9e8fd6d..74b03cbcc60 100644
> --- a/gcc/config/i386/x86-tune.def
> +++ b/gcc/config/i386/x86-tune.def
> @@ -515,13 +515,13 @@ DEF_TUNE (X86_TUNE_USE_SCATTER_8PARTS, "use_scatter=
_8parts",
>
>  /* X86_TUNE_AVOID_128FMA_CHAINS: Avoid creating loops with tight 128bit =
or
>     smaller FMA chain.  */
> -DEF_TUNE (X86_TUNE_AVOID_128FMA_CHAINS, "avoid_fma_chains", m_ZNVER1 | m=
_ZNVER2 | m_ZNVER3
> -          | m_YONGFENG)
> +DEF_TUNE (X86_TUNE_AVOID_128FMA_CHAINS, "avoid_fma_chains", m_ZNVER1 | m=
_ZNVER2 | m_ZNVER3 | m_ZNVER4
> +          | m_YONGFENG | m_GENERIC)
>
>  /* X86_TUNE_AVOID_256FMA_CHAINS: Avoid creating loops with tight 256bit =
or
>     smaller FMA chain.  */
> -DEF_TUNE (X86_TUNE_AVOID_256FMA_CHAINS, "avoid_fma256_chains", m_ZNVER2 =
| m_ZNVER3
> -         | m_CORE_HYBRID | m_SAPPHIRERAPIDS | m_CORE_ATOM)
> +DEF_TUNE (X86_TUNE_AVOID_256FMA_CHAINS, "avoid_fma256_chains", m_ZNVER2 =
| m_ZNVER3 | m_ZNVER4
> +         | m_CORE_HYBRID | m_SAPPHIRERAPIDS | m_CORE_ATOM | m_GENERIC)
>
>  /* X86_TUNE_AVOID_512FMA_CHAINS: Avoid creating loops with tight 512bit =
or
>     smaller FMA chain.  */