From: Alexander Monakov <amonakov@ispras.ru>
To: Richard Biener <richard.guenther@gmail.com>
Cc: Jan Hubicka <hubicka@ucw.cz>,
gcc-patches@gcc.gnu.org, hongtao.liu@intel.com,
hongjiu.lu@intel.com
Subject: Re: Disable FMADD in chains for Zen4 and generic
Date: Tue, 12 Dec 2023 20:08:23 +0300 (MSK) [thread overview]
Message-ID: <b07dd364-3cef-f299-160c-388355dbab9c@ispras.ru> (raw)
In-Reply-To: <CAFiYyc2PNALM4j3mTzHM9raqgCh-Zk3RZk2_68jQbnOraG02sw@mail.gmail.com>
[-- Attachment #1: Type: text/plain, Size: 5129 bytes --]
On Tue, 12 Dec 2023, Richard Biener wrote:
> On Tue, Dec 12, 2023 at 3:38 PM Jan Hubicka <hubicka@ucw.cz> wrote:
> >
> > Hi,
> > this patch disables use of FMA in matrix multiplication loop for generic (for
> > x86-64-v3) and zen4. I tested this on zen4 and Xenon Gold Gold 6212U.
> >
> > For Intel this is neutral both on the matrix multiplication microbenchmark
> > (attached) and spec2k17 where the difference was within noise for Core.
> >
> > On core the micro-benchmark runs as follows:
> >
> > With FMA:
> >
> > 578,500,241 cycles:u # 3.645 GHz ( +- 0.12% )
> > 753,318,477 instructions:u # 1.30 insn per cycle ( +- 0.00% )
> > 125,417,701 branches:u # 790.227 M/sec ( +- 0.00% )
> > 0.159146 +- 0.000363 seconds time elapsed ( +- 0.23% )
> >
> >
> > No FMA:
> >
> > 577,573,960 cycles:u # 3.514 GHz ( +- 0.15% )
> > 878,318,479 instructions:u # 1.52 insn per cycle ( +- 0.00% )
> > 125,417,702 branches:u # 763.035 M/sec ( +- 0.00% )
> > 0.164734 +- 0.000321 seconds time elapsed ( +- 0.19% )
> >
> > So the cycle count is unchanged and discrete multiply+add takes same time as FMA.
> >
> > While on zen:
> >
> >
> > With FMA:
> > 484875179 cycles:u # 3.599 GHz ( +- 0.05% ) (82.11%)
> > 752031517 instructions:u # 1.55 insn per cycle
> > 125106525 branches:u # 928.712 M/sec ( +- 0.03% ) (85.09%)
> > 128356 branch-misses:u # 0.10% of all branches ( +- 0.06% ) (83.58%)
> >
> > No FMA:
> > 375875209 cycles:u # 3.592 GHz ( +- 0.08% ) (80.74%)
> > 875725341 instructions:u # 2.33 insn per cycle
> > 124903825 branches:u # 1.194 G/sec ( +- 0.04% ) (84.59%)
> > 0.105203 +- 0.000188 seconds time elapsed ( +- 0.18% )
> >
> > The diffrerence is that Cores understand the fact that fmadd does not need
> > all three parameters to start computation, while Zen cores doesn't.
>
> This came up in a separate thread as well, but when doing reassoc of a
> chain with multiple dependent FMAs.
> I can't understand how this uarch detail can affect performance when as in
> the testcase the longest input latency is on the multiplication from a
> memory load.
The latency from the memory operand doesn't matter since it's not a part
of the critical path. The memory uop of the FMA starts executing as soon
as the address is ready.
> Do we actually understand _why_ the FMAs are slower here?
It's simple, on Zen4 FMA has latency 4 while add has latency 3, and you
clearly see it in the quoted numbers: zen-with-fma has slightly below 4
cycles per branch, zen-without-fma has exactly 3 cycles per branch.
Please refer to uops.info for latency data:
https://uops.info/html-instr/VMULPS_YMM_YMM_YMM.html
https://uops.info/html-instr/VFMADD231PS_YMM_YMM_YMM.html
> Do we know that Cores can start the multiplication part when the add
> operand isn't ready yet? I'm curious how you set up a micro benchmark to
> measure this.
Unlike some of the Arm cores, none of x86 cores can consume the addend
of an FMA on a later cycle than the multiplicands, with Alder Lake-E
being the sole exception, apparently (see 6/10/10 latencies in the
aforementioned uops.info FMA page).
> There's one detail on Zen in that it can issue 2 FADDs and 2 FMUL/FMA per
> cycle. So in theory we can at most do 2 FMA per cycle but with latency
> (FMA) == 4 for Zen3/4 and latency (FADD/FMUL) == 3 we might be able to
> squeeze out a little bit more throughput when there are many FADD/FMUL ops
> to execute? That works independent on whether FMAs have a head-start on
> multiplication as you'd still be bottle-necked on the 2-wide issue for
> FMA?
It doesn't matter here since all FMAs/FMULs are dependent on each other
so the processor can start a new FMA only each 4th (or 3rd cycle), except
when starting a new iteration of the outer loop.
> On Icelake it seems all FADD/FMUL/FMA share ports 0 and 1 and all have a
> latency of four. So you should get worse results there (looking at the
> numbers above you do get worse results, slightly so), probably the higher
> number of uops is hidden by the latency.
A simple solution would be to enable AVOID_FMA_CHAINS when FMA latency
exceeds FMUL latency (all Zens and Broadwell).
> > Since this seems noticeable win on zen and not loss on Core it seems like good
> > default for generic.
> >
> > I plan to commit the patch next week if there are no compplains.
>
> complaint!
Thanks for raising this, hopefully my explanation clears it up.
Alexander
next prev parent reply other threads:[~2023-12-12 17:08 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-12-12 14:37 Jan Hubicka
2023-12-12 15:01 ` Richard Biener
2023-12-12 16:48 ` Jan Hubicka
2023-12-12 17:08 ` Alexander Monakov [this message]
2023-12-12 23:56 ` Hongtao Liu
2023-12-13 16:03 ` Jan Hubicka
2024-01-08 3:16 ` Hongtao Liu
2024-01-17 17:29 ` Jan Hubicka
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=b07dd364-3cef-f299-160c-388355dbab9c@ispras.ru \
--to=amonakov@ispras.ru \
--cc=gcc-patches@gcc.gnu.org \
--cc=hongjiu.lu@intel.com \
--cc=hongtao.liu@intel.com \
--cc=hubicka@ucw.cz \
--cc=richard.guenther@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).