From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-yw1-x1134.google.com (mail-yw1-x1134.google.com [IPv6:2607:f8b0:4864:20::1134]) by sourceware.org (Postfix) with ESMTPS id 23F3A3858019 for ; Tue, 12 Dec 2023 23:48:33 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 23F3A3858019 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 23F3A3858019 Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=2607:f8b0:4864:20::1134 ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1702424914; cv=none; b=MilKboJlycNnnbpGv9n1Yb2pHZLKPsEii1ngFrmW95149im9hMqLjEq7Hwo/ptGXiNbO//y/OWgCJKEp2ioXZbMy7PNSDCOGaIaTjsFXQ10Y7E0/JQbLp2vRclgnbl8z69Ws52S9uS4Fxt1DGtMJj5oDrJK/jel+kE5fbtz2icw= ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1702424914; c=relaxed/simple; bh=C4VQyY5Y1n4kCLqlWh1Nf/wBcTbuEKpsgj/uwiHg0Bk=; h=DKIM-Signature:MIME-Version:From:Date:Message-ID:Subject:To; b=LGRoDBN8JF6jVMUhlBx/URbm6TQ3K1NsYNBz1rNPOVQVX3lmvY2KNnCjNYbYOLddlj/N5aNM2MyiyjZE4JP06qyvHnkxfo8+jDgHNLelHGqzC5y+6eBiw7/jUtRbdoB2GMuek31h50lSWhR3PNw9Sqosj/poHRdoye8IMzcyEyI= ARC-Authentication-Results: i=1; server2.sourceware.org Received: by mail-yw1-x1134.google.com with SMTP id 00721157ae682-5d279bcce64so62717847b3.3 for ; Tue, 12 Dec 2023 15:48:33 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1702424912; x=1703029712; darn=gcc.gnu.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=XO1NyrbpFc+NCPhQtRZGFdchAlDZyQhF4Aibl7JAuMk=; b=PmDJ29WZZ75wRbnp8ioRy7uE9RHeIS4hWaNAgDMWWhtiujD3esv9EE1jadI3bgQCC3 gFp/EwA6l5h/jm8bAJaC9AIqp6RqpJ47AT9iFO1vXtcZ2ym8hX3RB5TeMkmBdRieHNLu xryqLUhE8JhUis631gWp2lQD0tT7CtqRtqaQ0s/fjbzEW/0pFkIGBnommbB7n6jYCvpQ z31eHnMFxV4yGoA7+W4C911Up0rzNYRtqPCHOK1zwJJ64aW8fJ0BF+2pEVvHkbAckhu+ TudbAAwv17SUTMxM2ad1ndD5sQ5wkfkHzSc8rFAqqSH8ONvQ+xzHnQ0ShdX/tpXN+Hgd J7eg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1702424912; x=1703029712; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=XO1NyrbpFc+NCPhQtRZGFdchAlDZyQhF4Aibl7JAuMk=; b=Q+B+nnNtrfhuUOgs+1NhPKhMybmTacsRjpkzNFeVy0Q/Utnc9TFsYU1w+NdAGQmqHw Ob2LLauegA63epUGV2llzHdtcltt96OC582ppPDGefFw1UoESRuf6wtoMHoI3LalGn3D 6lqv8vv5lub51zLEG31H7BuitQGDBO8wgxn8X6ngS5wMYLFMVYXpL8lvhyqMsZOa7Uwf y+taRilSU2SMdl2ju2RsLX4dYjM6ARbyhVV3Lupe7QULLsYUNz0y7BivQ8uUb0gfvQAM iM4nNVR51HVN67hJIQQZBnT4MGeQIMq63hZ9DdMELRMoLbUFmbWZ2KBxkdoVAwoaaUTi wYEA== X-Gm-Message-State: AOJu0YxvTCxa4dSQIhabEkDd11/ahB5BPncbohCtLOoZCtO5O4BuIO5r dKECc3hHLJz+auTHyHgmTfRECDsWoRiJcWAA+iM= X-Google-Smtp-Source: AGHT+IG3rkUmen1LJl0oRFeFZiTGy37V6l0xfgAhdkkTaktyvP0FmDxO8z5tbfNoXFQ8eFdaHPK+OJ8BanwhS2JFSeI= X-Received: by 2002:a0d:e84f:0:b0:5cc:61c7:b058 with SMTP id r76-20020a0de84f000000b005cc61c7b058mr5061992ywe.22.1702424912013; Tue, 12 Dec 2023 15:48:32 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: Hongtao Liu Date: Wed, 13 Dec 2023 07:56:49 +0800 Message-ID: Subject: Re: Disable FMADD in chains for Zen4 and generic To: Jan Hubicka Cc: gcc-patches@gcc.gnu.org, hongtao.liu@intel.com, hongjiu.lu@intel.com, "Zhang, Annita" Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-8.0 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,GIT_PATCH_0,KAM_SHORT,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: On Tue, Dec 12, 2023 at 10:38=E2=80=AFPM Jan Hubicka wrote= : > > Hi, > this patch disables use of FMA in matrix multiplication loop for generic = (for > x86-64-v3) and zen4. I tested this on zen4 and Xenon Gold Gold 6212U. > > For Intel this is neutral both on the matrix multiplication microbenchmar= k > (attached) and spec2k17 where the difference was within noise for Core. > > On core the micro-benchmark runs as follows: > > With FMA: > > 578,500,241 cycles:u # 3.645 GHz = ( +- 0.12% ) > 753,318,477 instructions:u # 1.30 insn = per cycle ( +- 0.00% ) > 125,417,701 branches:u # 790.227 M/sec= ( +- 0.00% ) > 0.159146 +- 0.000363 seconds time elapsed ( +- 0.23% ) > > > No FMA: > > 577,573,960 cycles:u # 3.514 GHz = ( +- 0.15% ) > 878,318,479 instructions:u # 1.52 insn = per cycle ( +- 0.00% ) > 125,417,702 branches:u # 763.035 M/sec= ( +- 0.00% ) > 0.164734 +- 0.000321 seconds time elapsed ( +- 0.19% ) > > So the cycle count is unchanged and discrete multiply+add takes same time= as FMA. > > While on zen: > > > With FMA: > 484875179 cycles:u # 3.599 GHz = ( +- 0.05% ) (82.11%) > 752031517 instructions:u # 1.55 insn = per cycle > 125106525 branches:u # 928.712 M/sec= ( +- 0.03% ) (85.09%) > 128356 branch-misses:u # 0.10% of al= l branches ( +- 0.06% ) (83.58%) > > No FMA: > 375875209 cycles:u # 3.592 GHz = ( +- 0.08% ) (80.74%) > 875725341 instructions:u # 2.33 insn = per cycle > 124903825 branches:u # 1.194 G/sec= ( +- 0.04% ) (84.59%) > 0.105203 +- 0.000188 seconds time elapsed ( +- 0.18% ) > > The diffrerence is that Cores understand the fact that fmadd does not nee= d > all three parameters to start computation, while Zen cores doesn't. > > Since this seems noticeable win on zen and not loss on Core it seems like= good > default for generic. > > I plan to commit the patch next week if there are no compplains. The generic part LGTM.(It's exactly what we proposed in [1]) [1] https://gcc.gnu.org/pipermail/gcc-patches/2023-November/637721.html > > Honza > > #include > #include > > #define SIZE 1000 > > float a[SIZE][SIZE]; > float b[SIZE][SIZE]; > float c[SIZE][SIZE]; > > void init(void) > { > int i, j, k; > for(i=3D0; i { > for(j=3D0; j { > a[i][j] =3D (float)i + j; > b[i][j] =3D (float)i - j; > c[i][j] =3D 0.0f; > } > } > } > > void mult(void) > { > int i, j, k; > > for(i=3D0; i { > for(j=3D0; j { > for(k=3D0; k { > c[i][j] +=3D a[i][k] * b[k][j]; > } > } > } > } > > int main(void) > { > clock_t s, e; > > init(); > s=3Dclock(); > mult(); > e=3Dclock(); > printf(" mult took %10d clocks\n", (int)(e-s)); > > return 0; > > } > > * confg/i386/x86-tune.def (X86_TUNE_AVOID_128FMA_CHAINS, X86_TUNE= _AVOID_256FMA_CHAINS) > Enable for znver4 and Core. > > diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def > index 43fa9e8fd6d..74b03cbcc60 100644 > --- a/gcc/config/i386/x86-tune.def > +++ b/gcc/config/i386/x86-tune.def > @@ -515,13 +515,13 @@ DEF_TUNE (X86_TUNE_USE_SCATTER_8PARTS, "use_scatter= _8parts", > > /* X86_TUNE_AVOID_128FMA_CHAINS: Avoid creating loops with tight 128bit = or > smaller FMA chain. */ > -DEF_TUNE (X86_TUNE_AVOID_128FMA_CHAINS, "avoid_fma_chains", m_ZNVER1 | m= _ZNVER2 | m_ZNVER3 > - | m_YONGFENG) > +DEF_TUNE (X86_TUNE_AVOID_128FMA_CHAINS, "avoid_fma_chains", m_ZNVER1 | m= _ZNVER2 | m_ZNVER3 | m_ZNVER4 > + | m_YONGFENG | m_GENERIC) > > /* X86_TUNE_AVOID_256FMA_CHAINS: Avoid creating loops with tight 256bit = or > smaller FMA chain. */ > -DEF_TUNE (X86_TUNE_AVOID_256FMA_CHAINS, "avoid_fma256_chains", m_ZNVER2 = | m_ZNVER3 > - | m_CORE_HYBRID | m_SAPPHIRERAPIDS | m_CORE_ATOM) > +DEF_TUNE (X86_TUNE_AVOID_256FMA_CHAINS, "avoid_fma256_chains", m_ZNVER2 = | m_ZNVER3 | m_ZNVER4 > + | m_CORE_HYBRID | m_SAPPHIRERAPIDS | m_CORE_ATOM | m_GENERIC) > > /* X86_TUNE_AVOID_512FMA_CHAINS: Avoid creating loops with tight 512bit = or > smaller FMA chain. */ --=20 BR, Hongtao