From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail.ispras.ru (mail.ispras.ru [83.149.199.84]) by sourceware.org (Postfix) with ESMTPS id 76B3C3858C2A for ; Tue, 12 Dec 2023 17:08:24 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 76B3C3858C2A Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=ispras.ru Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=ispras.ru ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 76B3C3858C2A Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=83.149.199.84 ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1702400907; cv=none; b=qycM2l+zVftSNkJVruDuB1+29HACX7wKVXM+kl5ir5k9nMhBKMDWZNVrs/U0+Yc//qSpT+PNhzPgwH2NVFk04ajL4QDkDl8vWwvLMqlUy5dSwOvS3lOuvfFTxPiKuXugjsF/Blk+epLgwB/72G+cNbw09avnbkTcGVqeBHbt98w= ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1702400907; c=relaxed/simple; bh=L9H7cmjI/Edt9gUcbzSovwFsW/5cSJB8LAlheoyV/uo=; h=Date:From:To:Subject:Message-ID:MIME-Version; b=hBVFMyIcOOD5qMfgdS++TxqroZRrIA4a4H/VmWwXl3fQFMDEey7uUBD4NBQvMqF/exZvZ/Vs/3YgNYA3KHgigwbhuSwa/mEVSvUNH0BFhC/UvIir0dc0EBtv4uiM7XCyT437w9m7p2h5EaANFtASCT/XPOthXBIniGPlhuYSYvU= ARC-Authentication-Results: i=1; server2.sourceware.org Received: from [10.10.3.121] (unknown [10.10.3.121]) by mail.ispras.ru (Postfix) with ESMTPS id 1B68C40F1DC7; Tue, 12 Dec 2023 17:08:23 +0000 (UTC) DKIM-Filter: OpenDKIM Filter v2.11.0 mail.ispras.ru 1B68C40F1DC7 Date: Tue, 12 Dec 2023 20:08:23 +0300 (MSK) From: Alexander Monakov To: Richard Biener cc: Jan Hubicka , gcc-patches@gcc.gnu.org, hongtao.liu@intel.com, hongjiu.lu@intel.com Subject: Re: Disable FMADD in chains for Zen4 and generic In-Reply-To: Message-ID: References: MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="8323328-2108331847-1702400903=:1613" X-Spam-Status: No, score=-2.7 required=5.0 tests=BAYES_00,KAM_DMARC_STATUS,KAM_INFOUSMEBIZ,KAM_SHORT,SPF_HELO_NONE,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: This message is in MIME format. The first part should be readable text, while the remaining parts are likely unreadable without MIME-aware tools. --8323328-2108331847-1702400903=:1613 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8BIT On Tue, 12 Dec 2023, Richard Biener wrote: > On Tue, Dec 12, 2023 at 3:38 PM Jan Hubicka wrote: > > > > Hi, > > this patch disables use of FMA in matrix multiplication loop for generic (for > > x86-64-v3) and zen4. I tested this on zen4 and Xenon Gold Gold 6212U. > > > > For Intel this is neutral both on the matrix multiplication microbenchmark > > (attached) and spec2k17 where the difference was within noise for Core. > > > > On core the micro-benchmark runs as follows: > > > > With FMA: > > > > 578,500,241 cycles:u # 3.645 GHz ( +- 0.12% ) > > 753,318,477 instructions:u # 1.30 insn per cycle ( +- 0.00% ) > > 125,417,701 branches:u # 790.227 M/sec ( +- 0.00% ) > > 0.159146 +- 0.000363 seconds time elapsed ( +- 0.23% ) > > > > > > No FMA: > > > > 577,573,960 cycles:u # 3.514 GHz ( +- 0.15% ) > > 878,318,479 instructions:u # 1.52 insn per cycle ( +- 0.00% ) > > 125,417,702 branches:u # 763.035 M/sec ( +- 0.00% ) > > 0.164734 +- 0.000321 seconds time elapsed ( +- 0.19% ) > > > > So the cycle count is unchanged and discrete multiply+add takes same time as FMA. > > > > While on zen: > > > > > > With FMA: > > 484875179 cycles:u # 3.599 GHz ( +- 0.05% ) (82.11%) > > 752031517 instructions:u # 1.55 insn per cycle > > 125106525 branches:u # 928.712 M/sec ( +- 0.03% ) (85.09%) > > 128356 branch-misses:u # 0.10% of all branches ( +- 0.06% ) (83.58%) > > > > No FMA: > > 375875209 cycles:u # 3.592 GHz ( +- 0.08% ) (80.74%) > > 875725341 instructions:u # 2.33 insn per cycle > > 124903825 branches:u # 1.194 G/sec ( +- 0.04% ) (84.59%) > > 0.105203 +- 0.000188 seconds time elapsed ( +- 0.18% ) > > > > The diffrerence is that Cores understand the fact that fmadd does not need > > all three parameters to start computation, while Zen cores doesn't. > > This came up in a separate thread as well, but when doing reassoc of a > chain with multiple dependent FMAs. > I can't understand how this uarch detail can affect performance when as in > the testcase the longest input latency is on the multiplication from a > memory load. The latency from the memory operand doesn't matter since it's not a part of the critical path. The memory uop of the FMA starts executing as soon as the address is ready. > Do we actually understand _why_ the FMAs are slower here? It's simple, on Zen4 FMA has latency 4 while add has latency 3, and you clearly see it in the quoted numbers: zen-with-fma has slightly below 4 cycles per branch, zen-without-fma has exactly 3 cycles per branch. Please refer to uops.info for latency data: https://uops.info/html-instr/VMULPS_YMM_YMM_YMM.html https://uops.info/html-instr/VFMADD231PS_YMM_YMM_YMM.html > Do we know that Cores can start the multiplication part when the add > operand isn't ready yet? I'm curious how you set up a micro benchmark to > measure this. Unlike some of the Arm cores, none of x86 cores can consume the addend of an FMA on a later cycle than the multiplicands, with Alder Lake-E being the sole exception, apparently (see 6/10/10 latencies in the aforementioned uops.info FMA page). > There's one detail on Zen in that it can issue 2 FADDs and 2 FMUL/FMA per > cycle. So in theory we can at most do 2 FMA per cycle but with latency > (FMA) == 4 for Zen3/4 and latency (FADD/FMUL) == 3 we might be able to > squeeze out a little bit more throughput when there are many FADD/FMUL ops > to execute? That works independent on whether FMAs have a head-start on > multiplication as you'd still be bottle-necked on the 2-wide issue for > FMA? It doesn't matter here since all FMAs/FMULs are dependent on each other so the processor can start a new FMA only each 4th (or 3rd cycle), except when starting a new iteration of the outer loop. > On Icelake it seems all FADD/FMUL/FMA share ports 0 and 1 and all have a > latency of four. So you should get worse results there (looking at the > numbers above you do get worse results, slightly so), probably the higher > number of uops is hidden by the latency. A simple solution would be to enable AVOID_FMA_CHAINS when FMA latency exceeds FMUL latency (all Zens and Broadwell). > > Since this seems noticeable win on zen and not loss on Core it seems like good > > default for generic. > > > > I plan to commit the patch next week if there are no compplains. > > complaint! Thanks for raising this, hopefully my explanation clears it up. Alexander --8323328-2108331847-1702400903=:1613--