public inbox for gcc-patches@gcc.gnu.org
 help / color / mirror / Atom feed
From: Alexander Monakov <amonakov@ispras.ru>
To: Richard Biener <richard.guenther@gmail.com>
Cc: Jan Hubicka <hubicka@ucw.cz>,
	gcc-patches@gcc.gnu.org,  hongtao.liu@intel.com,
	hongjiu.lu@intel.com
Subject: Re: Disable FMADD in chains for Zen4 and generic
Date: Tue, 12 Dec 2023 20:08:23 +0300 (MSK)	[thread overview]
Message-ID: <b07dd364-3cef-f299-160c-388355dbab9c@ispras.ru> (raw)
In-Reply-To: <CAFiYyc2PNALM4j3mTzHM9raqgCh-Zk3RZk2_68jQbnOraG02sw@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 5129 bytes --]


On Tue, 12 Dec 2023, Richard Biener wrote:

> On Tue, Dec 12, 2023 at 3:38 PM Jan Hubicka <hubicka@ucw.cz> wrote:
> >
> > Hi,
> > this patch disables use of FMA in matrix multiplication loop for generic (for
> > x86-64-v3) and zen4.  I tested this on zen4 and Xenon Gold Gold 6212U.
> >
> > For Intel this is neutral both on the matrix multiplication microbenchmark
> > (attached) and spec2k17 where the difference was within noise for Core.
> >
> > On core the micro-benchmark runs as follows:
> >
> > With FMA:
> >
> >        578,500,241      cycles:u                         #    3.645 GHz                         ( +-  0.12% )
> >        753,318,477      instructions:u                   #    1.30  insn per cycle              ( +-  0.00% )
> >        125,417,701      branches:u                       #  790.227 M/sec                       ( +-  0.00% )
> >           0.159146 +- 0.000363 seconds time elapsed  ( +-  0.23% )
> >
> >
> > No FMA:
> >
> >        577,573,960      cycles:u                         #    3.514 GHz                         ( +-  0.15% )
> >        878,318,479      instructions:u                   #    1.52  insn per cycle              ( +-  0.00% )
> >        125,417,702      branches:u                       #  763.035 M/sec                       ( +-  0.00% )
> >           0.164734 +- 0.000321 seconds time elapsed  ( +-  0.19% )
> >
> > So the cycle count is unchanged and discrete multiply+add takes same time as FMA.
> >
> > While on zen:
> >
> >
> > With FMA:
> >          484875179      cycles:u                         #    3.599 GHz                      ( +-  0.05% )  (82.11%)
> >          752031517      instructions:u                   #    1.55  insn per cycle
> >          125106525      branches:u                       #  928.712 M/sec                    ( +-  0.03% )  (85.09%)
> >             128356      branch-misses:u                  #    0.10% of all branches          ( +-  0.06% )  (83.58%)
> >
> > No FMA:
> >          375875209      cycles:u                         #    3.592 GHz                      ( +-  0.08% )  (80.74%)
> >          875725341      instructions:u                   #    2.33  insn per cycle
> >          124903825      branches:u                       #    1.194 G/sec                    ( +-  0.04% )  (84.59%)
> >           0.105203 +- 0.000188 seconds time elapsed  ( +-  0.18% )
> >
> > The diffrerence is that Cores understand the fact that fmadd does not need
> > all three parameters to start computation, while Zen cores doesn't.
> 
> This came up in a separate thread as well, but when doing reassoc of a
> chain with multiple dependent FMAs.

> I can't understand how this uarch detail can affect performance when as in
> the testcase the longest input latency is on the multiplication from a
> memory load.

The latency from the memory operand doesn't matter since it's not a part
of the critical path. The memory uop of the FMA starts executing as soon
as the address is ready.

> Do we actually understand _why_ the FMAs are slower here?

It's simple, on Zen4 FMA has latency 4 while add has latency 3, and you
clearly see it in the quoted numbers: zen-with-fma has slightly below 4
cycles per branch, zen-without-fma has exactly 3 cycles per branch.

Please refer to uops.info for latency data:
https://uops.info/html-instr/VMULPS_YMM_YMM_YMM.html
https://uops.info/html-instr/VFMADD231PS_YMM_YMM_YMM.html

> Do we know that Cores can start the multiplication part when the add
> operand isn't ready yet?  I'm curious how you set up a micro benchmark to
> measure this.

Unlike some of the Arm cores, none of x86 cores can consume the addend
of an FMA on a later cycle than the multiplicands, with Alder Lake-E
being the sole exception, apparently (see 6/10/10 latencies in the
aforementioned uops.info FMA page).

> There's one detail on Zen in that it can issue 2 FADDs and 2 FMUL/FMA per
> cycle.  So in theory we can at most do 2 FMA per cycle but with latency
> (FMA) == 4 for Zen3/4 and latency (FADD/FMUL) == 3 we might be able to
> squeeze out a little bit more throughput when there are many FADD/FMUL ops
> to execute?  That works independent on whether FMAs have a head-start on
> multiplication as you'd still be bottle-necked on the 2-wide issue for
> FMA?

It doesn't matter here since all FMAs/FMULs are dependent on each other
so the processor can start a new FMA only each 4th (or 3rd cycle), except
when starting a new iteration of the outer loop.

> On Icelake it seems all FADD/FMUL/FMA share ports 0 and 1 and all have a
> latency of four.  So you should get worse results there (looking at the
> numbers above you do get worse results, slightly so), probably the higher
> number of uops is hidden by the latency.

A simple solution would be to enable AVOID_FMA_CHAINS when FMA latency 
exceeds FMUL latency (all Zens and Broadwell).

> > Since this seems noticeable win on zen and not loss on Core it seems like good
> > default for generic.
> >
> > I plan to commit the patch next week if there are no compplains.
> 
> complaint!

Thanks for raising this, hopefully my explanation clears it up.

Alexander

  parent reply	other threads:[~2023-12-12 17:08 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-12-12 14:37 Jan Hubicka
2023-12-12 15:01 ` Richard Biener
2023-12-12 16:48   ` Jan Hubicka
2023-12-12 17:08   ` Alexander Monakov [this message]
2023-12-12 23:56 ` Hongtao Liu
2023-12-13 16:03   ` Jan Hubicka
2024-01-08  3:16     ` Hongtao Liu
2024-01-17 17:29       ` Jan Hubicka

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=b07dd364-3cef-f299-160c-388355dbab9c@ispras.ru \
    --to=amonakov@ispras.ru \
    --cc=gcc-patches@gcc.gnu.org \
    --cc=hongjiu.lu@intel.com \
    --cc=hongtao.liu@intel.com \
    --cc=hubicka@ucw.cz \
    --cc=richard.guenther@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).