From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-lf1-x133.google.com (mail-lf1-x133.google.com [IPv6:2a00:1450:4864:20::133]) by sourceware.org (Postfix) with ESMTPS id A08713858298 for ; Tue, 12 Dec 2023 15:02:55 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org A08713858298 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com ARC-Filter: OpenARC Filter v1.0.0 sourceware.org A08713858298 Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=2a00:1450:4864:20::133 ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1702393386; cv=none; b=s1YliPlx2mGu73p322wb1yjKgEZCOhs8dJmgHNG7tJ8RUiVjWWvpQLvENl6+yeV0V3tjrPDKrKpvGPGZbiBpdLNrxF9H7eCutCqSq5mb5pb3BnHk3UUx4FvzsRYicGnbgj3Y9LHGzeoMzhdSHJ+YfFDZARrhoR1SO2BcmGtSEAo= ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1702393386; c=relaxed/simple; bh=CXoQUEdYyriV9olSjkNVMHh+MLUfad9IZfvmJIqbav0=; h=DKIM-Signature:MIME-Version:From:Date:Message-ID:Subject:To; b=baeO1MhLcKDPnMRfIOmDAR3hhoCWjr7TIGwNoj212o1BvMcrzR9tVEtT2i141xuJx887sTClkARyftUfassOQ6tq9H5X9AKXZN5sIbXybBpsmaz6bAi0ldqkEySzp4ISimmupVMtmUPJYyyCKCG009kmFwxqr4LiXhlMNZo4XQU= ARC-Authentication-Results: i=1; server2.sourceware.org Received: by mail-lf1-x133.google.com with SMTP id 2adb3069b0e04-50bf69afa99so7399574e87.3 for ; Tue, 12 Dec 2023 07:02:55 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1702393374; x=1702998174; darn=gcc.gnu.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=GBY0xKMZ/XQyN6vP6Dox8C9br0H7QOgsazuRDQgp7CY=; b=gmtBf3ryDhZqCb6bimJIalJ/+GJpcaeysfoWSRUACkd1WY8lsozoJLBrLSIMIipLMF V536bFg4uGrPIQYd2ok0xaLyUtfWPoQjXWNKiA+u6R/EzGFGra9ovTt931PzwBQuGv9F AcuWraH5CsgTGK+4jsonIutFVRRLu1rOhvYGrb5gbCdyUnIODLM5x/zPFX/fc5IzsoPH PxYLVp//eAugxiGupQlWQ4BCLO7g8wPk8HL19h9yE8H7KvlPiNzrvoKYtO0BaEkYaNEz PAdx7UqIkoWsMqYjOVirGkh3AZ76moVd6NXtOWEyjSsZ8VLGuw2uW4J1r99b3GRzfIaw 8fVw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1702393374; x=1702998174; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=GBY0xKMZ/XQyN6vP6Dox8C9br0H7QOgsazuRDQgp7CY=; b=pMOKka9cjm0utqh3az4A9VzXZfjKIT+/VQ1MvJ/7U8oXg+wp4xgDLXSOTdNKaQMjQX eCVxqCuzLTlaCZD9+pFylclilpc9fkppKqZmpGpJQ5hhiOa3llV8Hr4xUzLLE3yy1l6F AjWsSfxfnSbx0Ar1j4Z+3TX2aHvBDefy3HzZPBpCtnCZ7ohejtqUf4jKuhVa9NQmcIXg QS8WT2x9LYA6aT8aJf+cwRYfa7Yh6iKJMaaKhQfEDPaJjqJzLtTP0LFzcx5Sc9b3YFXV 34fJBIX93wFFrl3x2HxZ1O/HdXHVsGzS51YjU2so+bXNF4841AeGGefRuO7GlP6a7dFk 3xDA== X-Gm-Message-State: AOJu0YwKq3dBMs/DcHBrs0OwoNbYf0nZLWegj54kW9BY2g/IesRGMGxF 8sz2ikOUxVGgHW1XEjSjzv8O+e7n11aGpNEVYSc= X-Google-Smtp-Source: AGHT+IEdLTDaC/pqECD2+DIAkMWAa/7eMNatj386EtspCdDt9kthmkQorQVhv2hdMEdD7dgMWx72sMfGvYchEPPX+Ac= X-Received: by 2002:a05:6512:78f:b0:50c:ff3e:20cc with SMTP id x15-20020a056512078f00b0050cff3e20ccmr2598446lfr.30.1702393373747; Tue, 12 Dec 2023 07:02:53 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: Richard Biener Date: Tue, 12 Dec 2023 16:01:39 +0100 Message-ID: Subject: Re: Disable FMADD in chains for Zen4 and generic To: Jan Hubicka Cc: gcc-patches@gcc.gnu.org, hongtao.liu@intel.com, hongjiu.lu@intel.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-7.5 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,GIT_PATCH_0,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: On Tue, Dec 12, 2023 at 3:38=E2=80=AFPM Jan Hubicka wrote: > > Hi, > this patch disables use of FMA in matrix multiplication loop for generic = (for > x86-64-v3) and zen4. I tested this on zen4 and Xenon Gold Gold 6212U. > > For Intel this is neutral both on the matrix multiplication microbenchmar= k > (attached) and spec2k17 where the difference was within noise for Core. > > On core the micro-benchmark runs as follows: > > With FMA: > > 578,500,241 cycles:u # 3.645 GHz = ( +- 0.12% ) > 753,318,477 instructions:u # 1.30 insn = per cycle ( +- 0.00% ) > 125,417,701 branches:u # 790.227 M/sec= ( +- 0.00% ) > 0.159146 +- 0.000363 seconds time elapsed ( +- 0.23% ) > > > No FMA: > > 577,573,960 cycles:u # 3.514 GHz = ( +- 0.15% ) > 878,318,479 instructions:u # 1.52 insn = per cycle ( +- 0.00% ) > 125,417,702 branches:u # 763.035 M/sec= ( +- 0.00% ) > 0.164734 +- 0.000321 seconds time elapsed ( +- 0.19% ) > > So the cycle count is unchanged and discrete multiply+add takes same time= as FMA. > > While on zen: > > > With FMA: > 484875179 cycles:u # 3.599 GHz = ( +- 0.05% ) (82.11%) > 752031517 instructions:u # 1.55 insn = per cycle > 125106525 branches:u # 928.712 M/sec= ( +- 0.03% ) (85.09%) > 128356 branch-misses:u # 0.10% of al= l branches ( +- 0.06% ) (83.58%) > > No FMA: > 375875209 cycles:u # 3.592 GHz = ( +- 0.08% ) (80.74%) > 875725341 instructions:u # 2.33 insn = per cycle > 124903825 branches:u # 1.194 G/sec= ( +- 0.04% ) (84.59%) > 0.105203 +- 0.000188 seconds time elapsed ( +- 0.18% ) > > The diffrerence is that Cores understand the fact that fmadd does not nee= d > all three parameters to start computation, while Zen cores doesn't. This came up in a separate thread as well, but when doing reassoc of a chain with multiple dependent FMAs. I can't understand how this uarch detail can affect performance when as in the testcase the longest input latency is on the multiplication from a memory load. Do we actually understand _why_ the FMAs are slower here? Do we know that Cores can start the multiplication part when the add operand isn't ready yet? I'm curious how you set up a micro benchmark to measure this. There's one detail on Zen in that it can issue 2 FADDs and 2 FMUL/FMA per c= ycle. So in theory we can at most do 2 FMA per cycle but with latency (FMA) =3D=3D 4 for Zen3/4 and latency (FADD/FMUL) =3D=3D 3 we might be able to squeeze out a little b= it more throughput when there are many FADD/FMUL ops to execute? That works indepe= ndent on whether FMAs have a head-start on multiplication as you'd still be bottle-necked on the 2-wide issue for FMA? On Icelake it seems all FADD/FMUL/FMA share ports 0 and 1 and all have a la= tency of four. So you should get worse results there (looking at the numbers above you do get worse results, slightly so), probably the higher number of uops is h= idden by the latency. > Since this seems noticeable win on zen and not loss on Core it seems like= good > default for generic. > > I plan to commit the patch next week if there are no compplains. complaint! Richard. > Honza > > #include > #include > > #define SIZE 1000 > > float a[SIZE][SIZE]; > float b[SIZE][SIZE]; > float c[SIZE][SIZE]; > > void init(void) > { > int i, j, k; > for(i=3D0; i { > for(j=3D0; j { > a[i][j] =3D (float)i + j; > b[i][j] =3D (float)i - j; > c[i][j] =3D 0.0f; > } > } > } > > void mult(void) > { > int i, j, k; > > for(i=3D0; i { > for(j=3D0; j { > for(k=3D0; k { > c[i][j] +=3D a[i][k] * b[k][j]; > } > } > } > } > > int main(void) > { > clock_t s, e; > > init(); > s=3Dclock(); > mult(); > e=3Dclock(); > printf(" mult took %10d clocks\n", (int)(e-s)); > > return 0; > > } > > * confg/i386/x86-tune.def (X86_TUNE_AVOID_128FMA_CHAINS, X86_TUNE= _AVOID_256FMA_CHAINS) > Enable for znver4 and Core. > > diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def > index 43fa9e8fd6d..74b03cbcc60 100644 > --- a/gcc/config/i386/x86-tune.def > +++ b/gcc/config/i386/x86-tune.def > @@ -515,13 +515,13 @@ DEF_TUNE (X86_TUNE_USE_SCATTER_8PARTS, "use_scatter= _8parts", > > /* X86_TUNE_AVOID_128FMA_CHAINS: Avoid creating loops with tight 128bit = or > smaller FMA chain. */ > -DEF_TUNE (X86_TUNE_AVOID_128FMA_CHAINS, "avoid_fma_chains", m_ZNVER1 | m= _ZNVER2 | m_ZNVER3 > - | m_YONGFENG) > +DEF_TUNE (X86_TUNE_AVOID_128FMA_CHAINS, "avoid_fma_chains", m_ZNVER1 | m= _ZNVER2 | m_ZNVER3 | m_ZNVER4 > + | m_YONGFENG | m_GENERIC) > > /* X86_TUNE_AVOID_256FMA_CHAINS: Avoid creating loops with tight 256bit = or > smaller FMA chain. */ > -DEF_TUNE (X86_TUNE_AVOID_256FMA_CHAINS, "avoid_fma256_chains", m_ZNVER2 = | m_ZNVER3 > - | m_CORE_HYBRID | m_SAPPHIRERAPIDS | m_CORE_ATOM) > +DEF_TUNE (X86_TUNE_AVOID_256FMA_CHAINS, "avoid_fma256_chains", m_ZNVER2 = | m_ZNVER3 | m_ZNVER4 > + | m_CORE_HYBRID | m_SAPPHIRERAPIDS | m_CORE_ATOM | m_GENERIC) > > /* X86_TUNE_AVOID_512FMA_CHAINS: Avoid creating loops with tight 512bit = or > smaller FMA chain. */