From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from nikam.ms.mff.cuni.cz (nikam.ms.mff.cuni.cz [195.113.20.16]) by sourceware.org (Postfix) with ESMTPS id DBD343858C78 for ; Tue, 12 Dec 2023 16:48:20 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org DBD343858C78 Authentication-Results: sourceware.org; dmarc=fail (p=none dis=none) header.from=ucw.cz Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=kam.mff.cuni.cz ARC-Filter: OpenARC Filter v1.0.0 sourceware.org DBD343858C78 Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=195.113.20.16 ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1702399705; cv=none; b=Pu/oMm+Eoz3gmGZ4kJxtwoTE14wlTdAGCIo990THvs8JefgNh0AmM/djBWNWVWix9JnXZMPQHXwQZR/ib+m4HcK1kN4/Nm11+oof2Cul+Wr6it/ghAdW4k5sHiedw2rciosLeqEwxCdxzsXaRG6ChzjbCj+WrCPBDVVPzjnCKCQ= ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1702399705; c=relaxed/simple; bh=ics3W0fCAuxQ+nh9aq6QO7xdJ2s7MRgIaxg9bABpjfI=; h=DKIM-Signature:Date:From:To:Subject:Message-ID:MIME-Version; b=jqvF9LDm1Nxks7FSKBBn9P09AjeGsUjfkSpTZGNk+QUsWYamC/94KRRkUDajeCMVrqZQT6z9RCyo7410hwy9zR70cVZl4mDL5rA2S6uBRb++2ddG987ox2K75Hu32vkc+rTT7e4k8o9X3EUVU2ZWxy6Sxz29GOOuuDP1opqKuG4= ARC-Authentication-Results: i=1; server2.sourceware.org Received: by nikam.ms.mff.cuni.cz (Postfix, from userid 16202) id 4975828C19C; Tue, 12 Dec 2023 17:48:18 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ucw.cz; s=gen1; t=1702399698; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=mqADHr/OZE3UWDafQovAt/6hjyH36ihc7RiUfTKdfYY=; b=iBMuYLTRj53MuJTzLprkwoW0y260LmtRlkpuEghthNBKJ18+qGOvpUeDz7K8gZ/k28JDpY hHnGg/lg8Ls3Y/LGVorcyg3jtiatJoXDC3kXxqi8s+3sKCQkwfa2Ijo34qJYfSqDYDd+tQ 6vPdea0QcgUWg+Aung9OFJLg8+33hdM= Date: Tue, 12 Dec 2023 17:48:18 +0100 From: Jan Hubicka To: Richard Biener Cc: gcc-patches@gcc.gnu.org, hongtao.liu@intel.com, hongjiu.lu@intel.com Subject: Re: Disable FMADD in chains for Zen4 and generic Message-ID: References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Spam-Status: No, score=-10.9 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,GIT_PATCH_0,HEADER_FROM_DIFFERENT_DOMAINS,JMQ_SPF_NEUTRAL,RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL,SPF_HELO_NONE,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: > > This came up in a separate thread as well, but when doing reassoc of a > chain with > multiple dependent FMAs. > > I can't understand how this uarch detail can affect performance when > as in the testcase > the longest input latency is on the multiplication from a memory load. > Do we actually > understand _why_ the FMAs are slower here? This is my understanding: The loop is well predictable and memory caluclations + loads can happen in parallel. So the main dependency chain is updating the accumulator computing c[i][j]. FMADD is 4 cycles on Zen4, while ADD is 3. So the loop with FMADD can not run any faster than one iteration per 4 cycles, while ADD can do one iteration per 3. Which roughtly matches the speedup we see 484875179*3/4=363656384 while measured speed is 375875209 cycles. The benchmark is quite short and I run it 100 times in perf to collect the data so the overhead is probably attributing to smaller then expected difference. > > Do we know that Cores can start the multiplication part when the add > operand isn't > ready yet? I'm curious how you set up a micro benchmark to measure this. Here is cycle counting benchmark: #include int main() { float o=0; for (int i = 0; i < 1000000000; i++) { #ifdef ACCUMULATE float p1 = o; float p2 = 0; #else float p1 = 0; float p2 = o; #endif float p3 = 0; #ifdef FMA asm ("vfmadd231ss %2, %3, %0":"=x"(o):"0"(p1),"x"(p2),"x"(p3)); #else float t; asm ("mulss %2, %0":"=x"(t):"0"(p2),"x"(p3)); asm ("addss %2, %0":"=x"(o):"0"(p1),"x"(t)); #endif } printf ("%f\n",o); return 0; } It performs FMAs in sequence all with zeros. If you define ACCUMULATE you get the pattern from matrix multiplication. On Zen I get: jh@ryzen3:~> gcc -O3 -DFMA -DACCUMULATE l.c ; perf stat ./a.out 2>&1 | grep cycles: 4,001,011,489 cycles:u # 4.837 GHz (83.32%) jh@ryzen3:~> gcc -O3 -DACCUMULATE l.c ; perf stat ./a.out 2>&1 | grep cycles: 3,000,335,064 cycles:u # 4.835 GHz (83.08%) So 4 cycles for FMA loop and 3 cycles for separate mul and add. Muls execute in parallel to adds in the second case. If the dependence chain is done over multiplied paramter I get: jh@ryzen3:~> gcc -O3 -DFMA l.c ; perf stat ./a.out 2>&1 | grep cycles: 4,000,118,069 cycles:u # 4.836 GHz (83.32%) jh@ryzen3:~> gcc -O3 l.c ; perf stat ./a.out 2>&1 | grep cycles: 6,001,947,341 cycles:u # 4.838 GHz (83.32%) FMA is the same (it is still one FMA instruction periteration) while mul+add is 6 cycles since the dependency chain is longer. Core gives me: jh@aster:~> gcc -O3 l.c -DFMA -DACCUMULATE ; perf stat ./a.out 2>&1 | grep cycles:u 5,001,515,473 cycles:u # 3.796 GHz jh@aster:~> gcc -O3 l.c -DACCUMULATE ; perf stat ./a.out 2>&1 | grep cycles:u 4,000,977,739 cycles:u # 3.819 GHz jh@aster:~> gcc -O3 l.c -DFMA ; perf stat ./a.out 2>&1 | grep cycles:u 5,350,523,047 cycles:u # 3.814 GHz jh@aster:~> gcc -O3 l.c ; perf stat ./a.out 2>&1 | grep cycles:u 10,251,994,240 cycles:u # 3.852 GHz So FMA seems 5 cycles if we accumulate and bit more (off noise) if we do the long chain. I think some cores have bigger difference between these two numbers. I am bit surprised of the last number of 10 cycles. I would expect 8. I changed the matrix multiplication benchmark to repeat multiplication 100 times > > There's one detail on Zen in that it can issue 2 FADDs and 2 FMUL/FMA per cycle. > So in theory we can at most do 2 FMA per cycle but with latency (FMA) > == 4 for Zen3/4 > and latency (FADD/FMUL) == 3 we might be able to squeeze out a little bit more > throughput when there are many FADD/FMUL ops to execute? That works independent > on whether FMAs have a head-start on multiplication as you'd still be > bottle-necked > on the 2-wide issue for FMA? I am not sure I follow what you say here. The knob only checks for FMADDS used in accmulation type loop, so it is latency 4 and latency 3 per accumulation. Indeed in ohter loops fmadd is win. > > On Icelake it seems all FADD/FMUL/FMA share ports 0 and 1 and all have a latency > of four. So you should get worse results there (looking at the > numbers above you > do get worse results, slightly so), probably the higher number of uops is hidden > by the latency. I think the slower non-FMA on Core was just a noise (it shows in overall time but not in cycle counts). I changed the benchmark to run the multiplication 100 times. On Intel I get: jh@aster:~/gcc/build/gcc> gcc matrix-nofma.s ; perf stat ./a.out mult took 15146405 clocks Performance counter stats for './a.out': 15,149.62 msec task-clock:u # 1.000 CPUs utilized 0 context-switches:u # 0.000 /sec 0 cpu-migrations:u # 0.000 /sec 948 page-faults:u # 62.576 /sec 55,803,919,561 cycles:u # 3.684 GHz 87,615,590,411 instructions:u # 1.57 insn per cycle 12,512,896,307 branches:u # 825.955 M/sec 12,605,403 branch-misses:u # 0.10% of all branches 15.150064855 seconds time elapsed 15.146817000 seconds user 0.003333000 seconds sys jh@aster:~/gcc/build/gcc> gcc matrix-fma.s ; perf stat ./a.out mult took 15308879 clocks Performance counter stats for './a.out': 15,312.27 msec task-clock:u # 1.000 CPUs utilized 1 context-switches:u # 0.000 /sec 0 cpu-migrations:u # 0.000 /sec 948 page-faults:u # 61.911 /sec 59,449,535,152 cycles:u # 3.882 GHz 75,115,590,460 instructions:u # 1.26 insn per cycle 12,512,896,356 branches:u # 817.181 M/sec 12,605,235 branch-misses:u # 0.10% of all branches 15.312776274 seconds time elapsed 15.309462000 seconds user 0.003333000 seconds sys The difference seems close to noise. If I am counting right, with 100*1000*1000*1000 multiplications 5*100*1000*1000*1000/8=62500000000 cycles overall. Perhaps since the chain is independent for every 125 multilications it runs a bit fater. jh@alberti:~> gcc matrix-nofma.s ; perf stat ./a.out mult took 10046353 clocks Performance counter stats for './a.out': 10051.47 msec task-clock:u # 0.999 CPUs utilized 0 context-switches:u # 0.000 /sec 0 cpu-migrations:u # 0.000 /sec 940 page-faults:u # 93.519 /sec 36983540385 cycles:u # 3.679 GHz (83.34%) 3535506 stalled-cycles-frontend:u # 0.01% frontend cycles idle (83.33%) 12252917 stalled-cycles-backend:u # 0.03% backend cycles idle (83.34%) 87650235892 instructions:u # 2.37 insn per cycle # 0.00 stalled cycles per insn (83.34%) 12504689935 branches:u # 1.244 G/sec (83.33%) 12606975 branch-misses:u # 0.10% of all branches (83.32%) 10.059089949 seconds time elapsed 10.048218000 seconds user 0.003998000 seconds sys jh@alberti:~> gcc matrix-fma.s ; perf stat ./a.out mult took 13147631 clocks Performance counter stats for './a.out': 13152.81 msec task-clock:u # 0.999 CPUs utilized 0 context-switches:u # 0.000 /sec 0 cpu-migrations:u # 0.000 /sec 940 page-faults:u # 71.468 /sec 48394201333 cycles:u # 3.679 GHz (83.32%) 4251637 stalled-cycles-frontend:u # 0.01% frontend cycles idle (83.32%) 13664772 stalled-cycles-backend:u # 0.03% backend cycles idle (83.34%) 75101376364 instructions:u # 1.55 insn per cycle # 0.00 stalled cycles per insn (83.35%) 12510705466 branches:u # 951.182 M/sec (83.34%) 12612898 branch-misses:u # 0.10% of all branches (83.33%) 13.162186067 seconds time elapsed 13.153354000 seconds user 0.000000000 seconds sys So here I wuld expet 3*100*1000*1000*1000/8=37500000000 cycles for first and 4*100*1000*1000*1000/8 = 50000000000 cycles for second. So again small over-statement apparently due to parallelism between vector multiplication, but overall it seems to match what I would expect to see. Honza > > > Since this seems noticeable win on zen and not loss on Core it seems like good > > default for generic. > > > > I plan to commit the patch next week if there are no compplains. > > complaint! > > Richard. > > > Honza > > > > #include > > #include > > > > #define SIZE 1000 > > > > float a[SIZE][SIZE]; > > float b[SIZE][SIZE]; > > float c[SIZE][SIZE]; > > > > void init(void) > > { > > int i, j, k; > > for(i=0; i > { > > for(j=0; j > { > > a[i][j] = (float)i + j; > > b[i][j] = (float)i - j; > > c[i][j] = 0.0f; > > } > > } > > } > > > > void mult(void) > > { > > int i, j, k; > > > > for(i=0; i > { > > for(j=0; j > { > > for(k=0; k > { > > c[i][j] += a[i][k] * b[k][j]; > > } > > } > > } > > } > > > > int main(void) > > { > > clock_t s, e; > > > > init(); > > s=clock(); > > mult(); > > e=clock(); > > printf(" mult took %10d clocks\n", (int)(e-s)); > > > > return 0; > > > > } > > > > * confg/i386/x86-tune.def (X86_TUNE_AVOID_128FMA_CHAINS, X86_TUNE_AVOID_256FMA_CHAINS) > > Enable for znver4 and Core. > > > > diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def > > index 43fa9e8fd6d..74b03cbcc60 100644 > > --- a/gcc/config/i386/x86-tune.def > > +++ b/gcc/config/i386/x86-tune.def > > @@ -515,13 +515,13 @@ DEF_TUNE (X86_TUNE_USE_SCATTER_8PARTS, "use_scatter_8parts", > > > > /* X86_TUNE_AVOID_128FMA_CHAINS: Avoid creating loops with tight 128bit or > > smaller FMA chain. */ > > -DEF_TUNE (X86_TUNE_AVOID_128FMA_CHAINS, "avoid_fma_chains", m_ZNVER1 | m_ZNVER2 | m_ZNVER3 > > - | m_YONGFENG) > > +DEF_TUNE (X86_TUNE_AVOID_128FMA_CHAINS, "avoid_fma_chains", m_ZNVER1 | m_ZNVER2 | m_ZNVER3 | m_ZNVER4 > > + | m_YONGFENG | m_GENERIC) > > > > /* X86_TUNE_AVOID_256FMA_CHAINS: Avoid creating loops with tight 256bit or > > smaller FMA chain. */ > > -DEF_TUNE (X86_TUNE_AVOID_256FMA_CHAINS, "avoid_fma256_chains", m_ZNVER2 | m_ZNVER3 > > - | m_CORE_HYBRID | m_SAPPHIRERAPIDS | m_CORE_ATOM) > > +DEF_TUNE (X86_TUNE_AVOID_256FMA_CHAINS, "avoid_fma256_chains", m_ZNVER2 | m_ZNVER3 | m_ZNVER4 > > + | m_CORE_HYBRID | m_SAPPHIRERAPIDS | m_CORE_ATOM | m_GENERIC) > > > > /* X86_TUNE_AVOID_512FMA_CHAINS: Avoid creating loops with tight 512bit or > > smaller FMA chain. */