From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=3x6F=HX=kam.mff.cuni.cz=hubicka@sourceware.org>
Received: from nikam.ms.mff.cuni.cz (nikam.ms.mff.cuni.cz [195.113.20.16])
	by sourceware.org (Postfix) with ESMTPS id DBD343858C78
	for <gcc-patches@gcc.gnu.org>; Tue, 12 Dec 2023 16:48:20 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org DBD343858C78
Authentication-Results: sourceware.org; dmarc=fail (p=none dis=none) header.from=ucw.cz
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=kam.mff.cuni.cz
ARC-Filter: OpenARC Filter v1.0.0 sourceware.org DBD343858C78
Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=195.113.20.16
ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1702399705; cv=none;
	b=Pu/oMm+Eoz3gmGZ4kJxtwoTE14wlTdAGCIo990THvs8JefgNh0AmM/djBWNWVWix9JnXZMPQHXwQZR/ib+m4HcK1kN4/Nm11+oof2Cul+Wr6it/ghAdW4k5sHiedw2rciosLeqEwxCdxzsXaRG6ChzjbCj+WrCPBDVVPzjnCKCQ=
ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key;
	t=1702399705; c=relaxed/simple;
	bh=ics3W0fCAuxQ+nh9aq6QO7xdJ2s7MRgIaxg9bABpjfI=;
	h=DKIM-Signature:Date:From:To:Subject:Message-ID:MIME-Version; b=jqvF9LDm1Nxks7FSKBBn9P09AjeGsUjfkSpTZGNk+QUsWYamC/94KRRkUDajeCMVrqZQT6z9RCyo7410hwy9zR70cVZl4mDL5rA2S6uBRb++2ddG987ox2K75Hu32vkc+rTT7e4k8o9X3EUVU2ZWxy6Sxz29GOOuuDP1opqKuG4=
ARC-Authentication-Results: i=1; server2.sourceware.org
Received: by nikam.ms.mff.cuni.cz (Postfix, from userid 16202)
	id 4975828C19C; Tue, 12 Dec 2023 17:48:18 +0100 (CET)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ucw.cz; s=gen1;
	t=1702399698;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 in-reply-to:in-reply-to:references:references;
	bh=mqADHr/OZE3UWDafQovAt/6hjyH36ihc7RiUfTKdfYY=;
	b=iBMuYLTRj53MuJTzLprkwoW0y260LmtRlkpuEghthNBKJ18+qGOvpUeDz7K8gZ/k28JDpY
	hHnGg/lg8Ls3Y/LGVorcyg3jtiatJoXDC3kXxqi8s+3sKCQkwfa2Ijo34qJYfSqDYDd+tQ
	6vPdea0QcgUWg+Aung9OFJLg8+33hdM=
Date: Tue, 12 Dec 2023 17:48:18 +0100
From: Jan Hubicka <hubicka@ucw.cz>
To: Richard Biener <richard.guenther@gmail.com>
Cc: gcc-patches@gcc.gnu.org, hongtao.liu@intel.com, hongjiu.lu@intel.com
Subject: Re: Disable FMADD in chains for Zen4 and generic
Message-ID: <ZXiO0l0XXkfxG3Xb@kam.mff.cuni.cz>
References: <ZXhwQVQzBiy2hv89@kam.mff.cuni.cz>
 <CAFiYyc2PNALM4j3mTzHM9raqgCh-Zk3RZk2_68jQbnOraG02sw@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <CAFiYyc2PNALM4j3mTzHM9raqgCh-Zk3RZk2_68jQbnOraG02sw@mail.gmail.com>
X-Spam-Status: No, score=-10.9 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,GIT_PATCH_0,HEADER_FROM_DIFFERENT_DOMAINS,JMQ_SPF_NEUTRAL,RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL,SPF_HELO_NONE,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

> 
> This came up in a separate thread as well, but when doing reassoc of a
> chain with
> multiple dependent FMAs.
> 
> I can't understand how this uarch detail can affect performance when
> as in the testcase
> the longest input latency is on the multiplication from a memory load.
> Do we actually
> understand _why_ the FMAs are slower here?

This is my understanding:
The loop is well predictable and memory caluclations + loads can happen
in parallel.  So the main dependency chain is updating the accumulator
computing c[i][j].  FMADD is 4 cycles on Zen4, while ADD is 3.  So the
loop with FMADD can not run any faster than one iteration per 4 cycles,
while ADD can do one iteration per 3.  Which roughtly matches the
speedup we see 484875179*3/4=363656384 while measured speed is 375875209
cycles.  The benchmark is quite short and I run it 100 times in perf to
collect the data so the overhead is probably attributing to smaller then
expected difference.

> 
> Do we know that Cores can start the multiplication part when the add
> operand isn't
> ready yet?  I'm curious how you set up a micro benchmark to measure this.

Here is cycle counting benchmark:
#include <stdio.h>
int
main()
{ 
  float o=0;
  for (int i = 0; i < 1000000000; i++)
  {
#ifdef ACCUMULATE
    float p1 = o;
    float p2 = 0;
#else
    float p1 = 0;
    float p2 = o;
#endif
    float p3 = 0;
#ifdef FMA
    asm ("vfmadd231ss %2, %3, %0":"=x"(o):"0"(p1),"x"(p2),"x"(p3));
#else
    float t;
    asm ("mulss %2, %0":"=x"(t):"0"(p2),"x"(p3));
    asm ("addss %2, %0":"=x"(o):"0"(p1),"x"(t));
#endif
  }
  printf ("%f\n",o);
  return 0;
}

It performs FMAs in sequence all with zeros.  If you define ACCUMULATE
you get the pattern from matrix multiplication. On Zen I get:

jh@ryzen3:~> gcc -O3 -DFMA -DACCUMULATE l.c ; perf stat ./a.out 2>&1 | grep cycles:
     4,001,011,489      cycles:u                         #    4.837 GHz                         (83.32%)
jh@ryzen3:~> gcc -O3 -DACCUMULATE l.c ; perf stat ./a.out 2>&1 | grep cycles:
     3,000,335,064      cycles:u                         #    4.835 GHz                         (83.08%)

So 4 cycles for FMA loop and 3 cycles for separate mul and add.
Muls execute in parallel to adds in the second case.
If the dependence chain is done over multiplied paramter I get:

jh@ryzen3:~> gcc -O3 -DFMA l.c ; perf stat ./a.out 2>&1 | grep cycles:
     4,000,118,069      cycles:u                         #    4.836 GHz                         (83.32%)
jh@ryzen3:~> gcc -O3  l.c ; perf stat ./a.out 2>&1 | grep cycles:
     6,001,947,341      cycles:u                         #    4.838 GHz                         (83.32%)

FMA is the same (it is still one FMA instruction periteration) while
mul+add is 6 cycles since the dependency chain is longer.

Core gives me:

jh@aster:~> gcc -O3 l.c -DFMA -DACCUMULATE ; perf stat ./a.out 2>&1 | grep cycles:u
     5,001,515,473      cycles:u                         #    3.796 GHz
jh@aster:~> gcc -O3 l.c  -DACCUMULATE ; perf stat ./a.out 2>&1 | grep cycles:u
     4,000,977,739      cycles:u                         #    3.819 GHz
jh@aster:~> gcc -O3 l.c  -DFMA ; perf stat ./a.out 2>&1 | grep cycles:u
     5,350,523,047      cycles:u                         #    3.814 GHz
jh@aster:~> gcc -O3 l.c   ; perf stat ./a.out 2>&1 | grep cycles:u
    10,251,994,240      cycles:u                         #    3.852 GHz

So FMA seems 5 cycles if we accumulate and bit more (off noise) if we do
the long chain.  I think some cores have bigger difference between these
two numbers.
I am bit surprised of the last number of 10 cycles.  I would expect 8.

I changed the matrix multiplication benchmark to repeat multiplication
100 times

> 
> There's one detail on Zen in that it can issue 2 FADDs and 2 FMUL/FMA per cycle.
> So in theory we can at most do 2 FMA per cycle but with latency (FMA)
> == 4 for Zen3/4
> and latency (FADD/FMUL) == 3 we might be able to squeeze out a little bit more
> throughput when there are many FADD/FMUL ops to execute?  That works independent
> on whether FMAs have a head-start on multiplication as you'd still be
> bottle-necked
> on the 2-wide issue for FMA?

I am not sure I follow what you say here.  The knob only checks for
FMADDS used in accmulation type loop, so it is latency 4 and latency 3
per accumulation.  Indeed in ohter loops fmadd is win.
> 
> On Icelake it seems all FADD/FMUL/FMA share ports 0 and 1 and all have a latency
> of four.  So you should get worse results there (looking at the
> numbers above you
> do get worse results, slightly so), probably the higher number of uops is hidden
> by the latency.
I think the slower non-FMA on Core was just a noise (it shows in overall
time but not in cycle counts).

I changed the benchmark to run the multiplication 100 times.
On Intel I get:

jh@aster:~/gcc/build/gcc> gcc matrix-nofma.s ; perf stat ./a.out
        mult took   15146405 clocks

 Performance counter stats for './a.out':

         15,149.62 msec task-clock:u                     #    1.000 CPUs utilized             
                 0      context-switches:u               #    0.000 /sec                      
                 0      cpu-migrations:u                 #    0.000 /sec                      
               948      page-faults:u                    #   62.576 /sec                      
    55,803,919,561      cycles:u                         #    3.684 GHz                       
    87,615,590,411      instructions:u                   #    1.57  insn per cycle            
    12,512,896,307      branches:u                       #  825.955 M/sec                     
        12,605,403      branch-misses:u                  #    0.10% of all branches           

      15.150064855 seconds time elapsed

      15.146817000 seconds user
       0.003333000 seconds sys


jh@aster:~/gcc/build/gcc> gcc matrix-fma.s ; perf stat ./a.out
        mult took   15308879 clocks

 Performance counter stats for './a.out':

         15,312.27 msec task-clock:u                     #    1.000 CPUs utilized             
                 1      context-switches:u               #    0.000 /sec                      
                 0      cpu-migrations:u                 #    0.000 /sec                      
               948      page-faults:u                    #   61.911 /sec                      
    59,449,535,152      cycles:u                         #    3.882 GHz                       
    75,115,590,460      instructions:u                   #    1.26  insn per cycle            
    12,512,896,356      branches:u                       #  817.181 M/sec                     
        12,605,235      branch-misses:u                  #    0.10% of all branches           

      15.312776274 seconds time elapsed

      15.309462000 seconds user
       0.003333000 seconds sys

The difference seems close to noise.
If I am counting right, with 100*1000*1000*1000 multiplications
5*100*1000*1000*1000/8=62500000000 cycles overall.
Perhaps since the chain is independent for every 125 multilications it
runs a bit fater.

jh@alberti:~> gcc matrix-nofma.s ; perf stat ./a.out
        mult took   10046353 clocks

 Performance counter stats for './a.out':

          10051.47 msec task-clock:u                     #    0.999 CPUs utilized          
                 0      context-switches:u               #    0.000 /sec                   
                 0      cpu-migrations:u                 #    0.000 /sec                   
               940      page-faults:u                    #   93.519 /sec                   
       36983540385      cycles:u                         #    3.679 GHz                      (83.34%)
           3535506      stalled-cycles-frontend:u        #    0.01% frontend cycles idle     (83.33%)
          12252917      stalled-cycles-backend:u         #    0.03% backend cycles idle      (83.34%)
       87650235892      instructions:u                   #    2.37  insn per cycle         
                                                  #    0.00  stalled cycles per insn  (83.34%)
       12504689935      branches:u                       #    1.244 G/sec                    (83.33%)
          12606975      branch-misses:u                  #    0.10% of all branches          (83.32%)

      10.059089949 seconds time elapsed

      10.048218000 seconds user
       0.003998000 seconds sys


jh@alberti:~> gcc matrix-fma.s ; perf stat ./a.out
        mult took   13147631 clocks

 Performance counter stats for './a.out':

          13152.81 msec task-clock:u                     #    0.999 CPUs utilized          
                 0      context-switches:u               #    0.000 /sec                   
                 0      cpu-migrations:u                 #    0.000 /sec                   
               940      page-faults:u                    #   71.468 /sec                   
       48394201333      cycles:u                         #    3.679 GHz                      (83.32%)
           4251637      stalled-cycles-frontend:u        #    0.01% frontend cycles idle     (83.32%)
          13664772      stalled-cycles-backend:u         #    0.03% backend cycles idle      (83.34%)
       75101376364      instructions:u                   #    1.55  insn per cycle         
                                                  #    0.00  stalled cycles per insn  (83.35%)
       12510705466      branches:u                       #  951.182 M/sec                    (83.34%)
          12612898      branch-misses:u                  #    0.10% of all branches          (83.33%)

      13.162186067 seconds time elapsed

      13.153354000 seconds user
       0.000000000 seconds sys

So here I wuld expet 3*100*1000*1000*1000/8=37500000000 cycles for first
and 4*100*1000*1000*1000/8 = 50000000000 cycles for second.
So again small over-statement apparently due to parallelism between
vector multiplication, but overall it seems to match what I would expect
to see.

Honza

> 
> > Since this seems noticeable win on zen and not loss on Core it seems like good
> > default for generic.
> >
> > I plan to commit the patch next week if there are no compplains.
> 
> complaint!
> 
> Richard.
> 
> > Honza
> >
> > #include <stdio.h>
> > #include <time.h>
> >
> > #define SIZE 1000
> >
> > float a[SIZE][SIZE];
> > float b[SIZE][SIZE];
> > float c[SIZE][SIZE];
> >
> > void init(void)
> > {
> >    int i, j, k;
> >    for(i=0; i<SIZE; ++i)
> >    {
> >       for(j=0; j<SIZE; ++j)
> >       {
> >          a[i][j] = (float)i + j;
> >          b[i][j] = (float)i - j;
> >          c[i][j] = 0.0f;
> >       }
> >    }
> > }
> >
> > void mult(void)
> > {
> >    int i, j, k;
> >
> >    for(i=0; i<SIZE; ++i)
> >    {
> >       for(j=0; j<SIZE; ++j)
> >       {
> >          for(k=0; k<SIZE; ++k)
> >          {
> >             c[i][j] += a[i][k] * b[k][j];
> >          }
> >       }
> >    }
> > }
> >
> > int main(void)
> > {
> >    clock_t s, e;
> >
> >    init();
> >    s=clock();
> >    mult();
> >    e=clock();
> >    printf("        mult took %10d clocks\n", (int)(e-s));
> >
> >    return 0;
> >
> > }
> >
> >         * confg/i386/x86-tune.def (X86_TUNE_AVOID_128FMA_CHAINS, X86_TUNE_AVOID_256FMA_CHAINS)
> >         Enable for znver4 and Core.
> >
> > diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def
> > index 43fa9e8fd6d..74b03cbcc60 100644
> > --- a/gcc/config/i386/x86-tune.def
> > +++ b/gcc/config/i386/x86-tune.def
> > @@ -515,13 +515,13 @@ DEF_TUNE (X86_TUNE_USE_SCATTER_8PARTS, "use_scatter_8parts",
> >
> >  /* X86_TUNE_AVOID_128FMA_CHAINS: Avoid creating loops with tight 128bit or
> >     smaller FMA chain.  */
> > -DEF_TUNE (X86_TUNE_AVOID_128FMA_CHAINS, "avoid_fma_chains", m_ZNVER1 | m_ZNVER2 | m_ZNVER3
> > -          | m_YONGFENG)
> > +DEF_TUNE (X86_TUNE_AVOID_128FMA_CHAINS, "avoid_fma_chains", m_ZNVER1 | m_ZNVER2 | m_ZNVER3 | m_ZNVER4
> > +          | m_YONGFENG | m_GENERIC)
> >
> >  /* X86_TUNE_AVOID_256FMA_CHAINS: Avoid creating loops with tight 256bit or
> >     smaller FMA chain.  */
> > -DEF_TUNE (X86_TUNE_AVOID_256FMA_CHAINS, "avoid_fma256_chains", m_ZNVER2 | m_ZNVER3
> > -         | m_CORE_HYBRID | m_SAPPHIRERAPIDS | m_CORE_ATOM)
> > +DEF_TUNE (X86_TUNE_AVOID_256FMA_CHAINS, "avoid_fma256_chains", m_ZNVER2 | m_ZNVER3 | m_ZNVER4
> > +         | m_CORE_HYBRID | m_SAPPHIRERAPIDS | m_CORE_ATOM | m_GENERIC)
> >
> >  /* X86_TUNE_AVOID_512FMA_CHAINS: Avoid creating loops with tight 512bit or
> >     smaller FMA chain.  */