From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=duV6=HX=ispras.ru=amonakov@sourceware.org>
Received: from mail.ispras.ru (mail.ispras.ru [83.149.199.84])
	by sourceware.org (Postfix) with ESMTPS id 76B3C3858C2A
	for <gcc-patches@gcc.gnu.org>; Tue, 12 Dec 2023 17:08:24 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 76B3C3858C2A
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=ispras.ru
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=ispras.ru
ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 76B3C3858C2A
Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=83.149.199.84
ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1702400907; cv=none;
	b=qycM2l+zVftSNkJVruDuB1+29HACX7wKVXM+kl5ir5k9nMhBKMDWZNVrs/U0+Yc//qSpT+PNhzPgwH2NVFk04ajL4QDkDl8vWwvLMqlUy5dSwOvS3lOuvfFTxPiKuXugjsF/Blk+epLgwB/72G+cNbw09avnbkTcGVqeBHbt98w=
ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key;
	t=1702400907; c=relaxed/simple;
	bh=L9H7cmjI/Edt9gUcbzSovwFsW/5cSJB8LAlheoyV/uo=;
	h=Date:From:To:Subject:Message-ID:MIME-Version; b=hBVFMyIcOOD5qMfgdS++TxqroZRrIA4a4H/VmWwXl3fQFMDEey7uUBD4NBQvMqF/exZvZ/Vs/3YgNYA3KHgigwbhuSwa/mEVSvUNH0BFhC/UvIir0dc0EBtv4uiM7XCyT437w9m7p2h5EaANFtASCT/XPOthXBIniGPlhuYSYvU=
ARC-Authentication-Results: i=1; server2.sourceware.org
Received: from [10.10.3.121] (unknown [10.10.3.121])
	by mail.ispras.ru (Postfix) with ESMTPS id 1B68C40F1DC7;
	Tue, 12 Dec 2023 17:08:23 +0000 (UTC)
DKIM-Filter: OpenDKIM Filter v2.11.0 mail.ispras.ru 1B68C40F1DC7
Date: Tue, 12 Dec 2023 20:08:23 +0300 (MSK)
From: Alexander Monakov <amonakov@ispras.ru>
To: Richard Biener <richard.guenther@gmail.com>
cc: Jan Hubicka <hubicka@ucw.cz>, gcc-patches@gcc.gnu.org, 
    hongtao.liu@intel.com, hongjiu.lu@intel.com
Subject: Re: Disable FMADD in chains for Zen4 and generic
In-Reply-To: <CAFiYyc2PNALM4j3mTzHM9raqgCh-Zk3RZk2_68jQbnOraG02sw@mail.gmail.com>
Message-ID: <b07dd364-3cef-f299-160c-388355dbab9c@ispras.ru>
References: <ZXhwQVQzBiy2hv89@kam.mff.cuni.cz> <CAFiYyc2PNALM4j3mTzHM9raqgCh-Zk3RZk2_68jQbnOraG02sw@mail.gmail.com>
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="8323328-2108331847-1702400903=:1613"
X-Spam-Status: No, score=-2.7 required=5.0 tests=BAYES_00,KAM_DMARC_STATUS,KAM_INFOUSMEBIZ,KAM_SHORT,SPF_HELO_NONE,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

  This message is in MIME format.  The first part should be readable text,
  while the remaining parts are likely unreadable without MIME-aware tools.

--8323328-2108331847-1702400903=:1613
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8BIT


On Tue, 12 Dec 2023, Richard Biener wrote:

> On Tue, Dec 12, 2023 at 3:38 PM Jan Hubicka <hubicka@ucw.cz> wrote:
> >
> > Hi,
> > this patch disables use of FMA in matrix multiplication loop for generic (for
> > x86-64-v3) and zen4.  I tested this on zen4 and Xenon Gold Gold 6212U.
> >
> > For Intel this is neutral both on the matrix multiplication microbenchmark
> > (attached) and spec2k17 where the difference was within noise for Core.
> >
> > On core the micro-benchmark runs as follows:
> >
> > With FMA:
> >
> >        578,500,241      cycles:u                         #    3.645 GHz                         ( +-  0.12% )
> >        753,318,477      instructions:u                   #    1.30  insn per cycle              ( +-  0.00% )
> >        125,417,701      branches:u                       #  790.227 M/sec                       ( +-  0.00% )
> >           0.159146 +- 0.000363 seconds time elapsed  ( +-  0.23% )
> >
> >
> > No FMA:
> >
> >        577,573,960      cycles:u                         #    3.514 GHz                         ( +-  0.15% )
> >        878,318,479      instructions:u                   #    1.52  insn per cycle              ( +-  0.00% )
> >        125,417,702      branches:u                       #  763.035 M/sec                       ( +-  0.00% )
> >           0.164734 +- 0.000321 seconds time elapsed  ( +-  0.19% )
> >
> > So the cycle count is unchanged and discrete multiply+add takes same time as FMA.
> >
> > While on zen:
> >
> >
> > With FMA:
> >          484875179      cycles:u                         #    3.599 GHz                      ( +-  0.05% )  (82.11%)
> >          752031517      instructions:u                   #    1.55  insn per cycle
> >          125106525      branches:u                       #  928.712 M/sec                    ( +-  0.03% )  (85.09%)
> >             128356      branch-misses:u                  #    0.10% of all branches          ( +-  0.06% )  (83.58%)
> >
> > No FMA:
> >          375875209      cycles:u                         #    3.592 GHz                      ( +-  0.08% )  (80.74%)
> >          875725341      instructions:u                   #    2.33  insn per cycle
> >          124903825      branches:u                       #    1.194 G/sec                    ( +-  0.04% )  (84.59%)
> >           0.105203 +- 0.000188 seconds time elapsed  ( +-  0.18% )
> >
> > The diffrerence is that Cores understand the fact that fmadd does not need
> > all three parameters to start computation, while Zen cores doesn't.
> 
> This came up in a separate thread as well, but when doing reassoc of a
> chain with multiple dependent FMAs.

> I can't understand how this uarch detail can affect performance when as in
> the testcase the longest input latency is on the multiplication from a
> memory load.

The latency from the memory operand doesn't matter since it's not a part
of the critical path. The memory uop of the FMA starts executing as soon
as the address is ready.

> Do we actually understand _why_ the FMAs are slower here?

It's simple, on Zen4 FMA has latency 4 while add has latency 3, and you
clearly see it in the quoted numbers: zen-with-fma has slightly below 4
cycles per branch, zen-without-fma has exactly 3 cycles per branch.

Please refer to uops.info for latency data:
https://uops.info/html-instr/VMULPS_YMM_YMM_YMM.html
https://uops.info/html-instr/VFMADD231PS_YMM_YMM_YMM.html

> Do we know that Cores can start the multiplication part when the add
> operand isn't ready yet?  I'm curious how you set up a micro benchmark to
> measure this.

Unlike some of the Arm cores, none of x86 cores can consume the addend
of an FMA on a later cycle than the multiplicands, with Alder Lake-E
being the sole exception, apparently (see 6/10/10 latencies in the
aforementioned uops.info FMA page).

> There's one detail on Zen in that it can issue 2 FADDs and 2 FMUL/FMA per
> cycle.  So in theory we can at most do 2 FMA per cycle but with latency
> (FMA) == 4 for Zen3/4 and latency (FADD/FMUL) == 3 we might be able to
> squeeze out a little bit more throughput when there are many FADD/FMUL ops
> to execute?  That works independent on whether FMAs have a head-start on
> multiplication as you'd still be bottle-necked on the 2-wide issue for
> FMA?

It doesn't matter here since all FMAs/FMULs are dependent on each other
so the processor can start a new FMA only each 4th (or 3rd cycle), except
when starting a new iteration of the outer loop.

> On Icelake it seems all FADD/FMUL/FMA share ports 0 and 1 and all have a
> latency of four.  So you should get worse results there (looking at the
> numbers above you do get worse results, slightly so), probably the higher
> number of uops is hidden by the latency.

A simple solution would be to enable AVOID_FMA_CHAINS when FMA latency 
exceeds FMUL latency (all Zens and Broadwell).

> > Since this seems noticeable win on zen and not loss on Core it seems like good
> > default for generic.
> >
> > I plan to commit the patch next week if there are no compplains.
> 
> complaint!

Thanks for raising this, hopefully my explanation clears it up.

Alexander
--8323328-2108331847-1702400903=:1613--