From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=XRZ8=BS=intel.com=lili.cui@sourceware.org>
Received: from mga06.intel.com (mga06b.intel.com [134.134.136.31])
	by sourceware.org (Postfix) with ESMTPS id C65B83858D35
	for <gcc-patches@gcc.gnu.org>; Mon, 29 May 2023 07:50:30 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org C65B83858D35
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=intel.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=intel.com
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1685346630; x=1716882630;
  h=from:to:cc:subject:date:message-id:references:
   in-reply-to:content-transfer-encoding:mime-version;
  bh=xRI263Tt+uLx3qWY2KWfg4tjO8FB+8JvropLttYEU+Y=;
  b=XQcxy6HWq1NDJ59FwYumG2KqR4aymWRrIz5b/zCkELXtEvt/VpB//BNF
   MHTGdiMS8a76fyDd5tZi5jVohjKq4IJwcLFC/4t4TYKDrQdzzBglmudeb
   QtjcRHPHZBxWsygmg489acGYpbKPEH9BfyRi6NfawKswglneWxSq2CqX4
   lA4fbE0IYX9GO/aZpWmwm6v4ceo7ZYx8gKW+lkDw0eNVTyWwgVx4+D8l+
   rgy/dIKPcmkbWvyzJPN1KmuzwXfFSBGqNAf2hvwV2xC1L4aroL08/czGl
   LvdRtTvohyZDVU4rvsGKyyYfNio6ZeCNc3yx/6EgdnmGIPn6Y6QWNu80G
   g==;
X-IronPort-AV: E=McAfee;i="6600,9927,10724"; a="418128470"
X-IronPort-AV: E=Sophos;i="6.00,200,1681196400"; 
   d="scan'208";a="418128470"
Received: from fmsmga004.fm.intel.com ([10.253.24.48])
  by orsmga104.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 29 May 2023 00:50:29 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=McAfee;i="6600,9927,10724"; a="775843307"
X-IronPort-AV: E=Sophos;i="6.00,200,1681196400"; 
   d="scan'208";a="775843307"
Received: from orsmsx601.amr.corp.intel.com ([10.22.229.14])
  by fmsmga004.fm.intel.com with ESMTP; 29 May 2023 00:50:28 -0700
Received: from orsmsx611.amr.corp.intel.com (10.22.229.24) by
 ORSMSX601.amr.corp.intel.com (10.22.229.14) with Microsoft SMTP Server
 (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id
 15.1.2507.23; Mon, 29 May 2023 00:50:28 -0700
Received: from orsmsx610.amr.corp.intel.com (10.22.229.23) by
 ORSMSX611.amr.corp.intel.com (10.22.229.24) with Microsoft SMTP Server
 (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id
 15.1.2507.23; Mon, 29 May 2023 00:50:27 -0700
Received: from ORSEDG601.ED.cps.intel.com (10.7.248.6) by
 orsmsx610.amr.corp.intel.com (10.22.229.23) with Microsoft SMTP Server
 (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id
 15.1.2507.23 via Frontend Transport; Mon, 29 May 2023 00:50:27 -0700
Received: from NAM12-BN8-obe.outbound.protection.outlook.com (104.47.55.171)
 by edgegateway.intel.com (134.134.137.102) with Microsoft SMTP Server
 (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id
 15.1.2507.23; Mon, 29 May 2023 00:50:27 -0700
ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none;
 b=n5ejfdplk9O8DcbJURp3KtWq8k1rmMPadtWQEt6Xtk7pTLXIkRjrlFgK6hptl8dHaiTTRE95l9VVhz99JLI6qg8DuKWBZsun5/jd3EFlrwmeXF4f9neubOUsLuAhNWmyhCnP0zpo+bIRMg0MaiVoNRWL26y9AMXRmiaTfGS73WzjmtqPaC2J1LkBL4BFsWPZoCPdhXg3CXP5AzyZHrdbOjjV9/hogFmm5/bLQpFUDLQGx/OCWa4R4f1UMluxAqu/YhyTBun5pMCU7b6bLNWN/h7qfp97+Agd03XeIC3ljSu68fxJW3e+RcklMO3n75g9V8W3W6KTlQvxmBSCCL8eig==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com;
 s=arcselector9901;
 h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1;
 bh=iDaOt8rTDKNhZjk5YGODswh0yJqoRn5bR5WbMV0mUOg=;
 b=QHuaJ/Qu+vUUp5qBpKhxcUG2XF7CAJzdskfqtF0CbBTC/fT1hZJi0Ca2VnArPkwvUG25wrOn2S4JqSDnszWc0EbkCC30rGm/EvMUh5k3t2XcJpRL6QBgVOqlJXO9K84VaCSPWJrNTDNEYg/+m5cISq5WWmgKgEWlplmBgBaK7zSvLiUbxtK+j5m1AKb5cBVeS1Vfvjdz4hgOeU3IzTpFg6A7e8VV5PwKlfuiKCcwx5mdZfVQ8i603ise1d6O6l2iun59HfxmytRXUyAFaid8zmQqIs/+Ezc/mSsqJHKNEXunXZtTQx7FfL2kohPoRI0MCV0hAD6JogY6YBIeQnuskw==
ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass
 smtp.mailfrom=intel.com; dmarc=pass action=none header.from=intel.com;
 dkim=pass header.d=intel.com; arc=none
Received: from SJ0PR11MB5600.namprd11.prod.outlook.com (2603:10b6:a03:3ab::16)
 by DS0PR11MB8115.namprd11.prod.outlook.com (2603:10b6:8:12a::12) with
 Microsoft SMTP Server (version=TLS1_2,
 cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.6433.22; Mon, 29 May
 2023 07:50:24 +0000
Received: from SJ0PR11MB5600.namprd11.prod.outlook.com
 ([fe80::90e7:4b99:e70e:8a43]) by SJ0PR11MB5600.namprd11.prod.outlook.com
 ([fe80::90e7:4b99:e70e:8a43%4]) with mapi id 15.20.6433.022; Mon, 29 May 2023
 07:50:24 +0000
From: "Cui, Lili" <lili.cui@intel.com>
To: "gcc-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org>
CC: "richard.guenther@gmail.com" <richard.guenther@gmail.com>,
	"linkw@linux.ibm.com" <linkw@linux.ibm.com>
Subject: RE: [PATCH] Handle FMA friendly in reassoc pass
Thread-Topic: [PATCH] Handle FMA friendly in reassoc pass
Thread-Index: AQHZjpe6vpBAvBbCUEe++MXQyj0RVK9w5Rww
Date: Mon, 29 May 2023 07:50:24 +0000
Message-ID: <SJ0PR11MB560078763B391BF1D066AF179E4A9@SJ0PR11MB5600.namprd11.prod.outlook.com>
References: <20230524233005.3284950-1-lili.cui@intel.com>
In-Reply-To: <20230524233005.3284950-1-lili.cui@intel.com>
Accept-Language: en-US
Content-Language: en-US
X-MS-Has-Attach:
X-MS-TNEF-Correlator:
authentication-results: dkim=none (message not signed)
 header.d=none;dmarc=none action=none header.from=intel.com;
x-ms-publictraffictype: Email
x-ms-traffictypediagnostic: SJ0PR11MB5600:EE_|DS0PR11MB8115:EE_
x-ms-office365-filtering-correlation-id: d887a5d2-1592-4e96-2fd5-08db60195cb1
x-ms-exchange-senderadcheck: 1
x-ms-exchange-antispam-relay: 0
x-microsoft-antispam: BCL:0;
x-microsoft-antispam-message-info: vcAOJWgUCyyGgSfHVLlWzTJ73xwUEZhN+59FBTCmCeJ6UPV3NPi1NqYStiCoqWliqRUDB57Cp81ekOhxiH4dgSxv06p/0OK4WQLFkGZ/d+M9gG+7csRCUlQhHCJhjYgCJ+XWOPId0RLgRKK3SXEGej2W0n3Mh6sZ3ykKmrTTIeD/0O3NHO/fom6oQBby1hRKmb2NWNLGoY5qphZ0rH4B5P/HaSnG/25ZLnZ1shAeINz448XoRI4uZuy2a8vX3Y+7uQnvCkWeHkGnnjyr+ynr1cv9kNKq8oAruYz7emW+OvYl+cUmD0udgtvV3exVCr7u2j1qONODuy6EW/n8xyKG6nA3icDM6ll+xGHIKUySBedvPjNZ3z23AGDi+RwPZaEinLN01XfhVbrUpc/3OSSZDsKjEWmMc8YMxgWRftJ5m+08cvFw2O2f4FgCbs8wITO+2UVaM3eq0NJlQBxu+4x9Fj6/viZpbKcyL4u93/8DhNN79+EBsrFcGh0uwM99Zq6YAiamNdz6amjttihg83IOS4GQjfgtTxgXiG4pz1MuU8idLufDbHxuJqA76PgL9CtmkACNBGVj9I4PZSHk8d904kQ36F3lwy7yhHFeu+Tn2tU=
x-forefront-antispam-report: CIP:255.255.255.255;CTRY:;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:SJ0PR11MB5600.namprd11.prod.outlook.com;PTR:;CAT:NONE;SFS:(13230028)(366004)(396003)(346002)(376002)(39860400002)(136003)(451199021)(54906003)(71200400001)(478600001)(5660300002)(8936002)(8676002)(52536014)(38070700005)(30864003)(86362001)(33656002)(2906002)(66556008)(66946007)(66476007)(76116006)(66446008)(4326008)(64756008)(6916009)(82960400001)(122000001)(316002)(55016003)(38100700002)(41300700001)(84970400001)(186003)(53546011)(6506007)(9686003)(26005)(7696005)(83380400001)(579004);DIR:OUT;SFP:1102;
x-ms-exchange-antispam-messagedata-chunkcount: 1
x-ms-exchange-antispam-messagedata-0: =?us-ascii?Q?R5hrE9KgPOQeQjNOb4fG248dtxSJcV1yP5IsJ5qUJx5YurEY9e+iMsLyNAOB?=
 =?us-ascii?Q?+ObAItRBj9QAtwqjJ3DuZ/d4c8heZi/346hFNVc3wu5JTlWmekuj0KQgQfgE?=
 =?us-ascii?Q?3APr3YTzqbQrRJcMcsAWtvS8RBZUy2JUJnfIBnjtcAjdwGI+22cp1jaLN9qK?=
 =?us-ascii?Q?5fwwfU1aNDX/bSGCN5/5HAm8M/OAOzg4QhovXArKe0C+aYm0ddXl81FPELhk?=
 =?us-ascii?Q?HdwqINeDHA9VrutGHRevtQ1OTYyAI09WypNT999QxHCTmoDRT89n/Dx3Kusv?=
 =?us-ascii?Q?gs38KH13nFiklQPyl+TV/xFYgGo3ZduitLXyHt/5VvHPsVyzitCGYoZW4v0A?=
 =?us-ascii?Q?qOSQa3TCPLKdNnCtBNp5EU3VrVrJcsQNSc+zu7IGQyzEG4/G7mwSEtOcYXBq?=
 =?us-ascii?Q?MbfQKd1EIYe3Wf+ESbbJZNzGEl4vcf1nSmVAplM4s27cUfDcI1Qc8Ashcmby?=
 =?us-ascii?Q?a0J8fk3DEyk1A3YiLYv7lNS7QicbBJ+eDiDQUz95yEdAFi38ldnpOH+l2CU0?=
 =?us-ascii?Q?nfm/bV6cvd3ij5K79rpKW3o1Qed+xbaJk8D7qWxlOvS5dcwMeNMtZb+/v2wi?=
 =?us-ascii?Q?QzN1H/qxsOhwXiY5A8xBZ55B5QvMDmhLnLUHrAR7v81X1K8Yuto5Z39IrMO3?=
 =?us-ascii?Q?rT84PZOIbWy+F4Z/zQdcxwi7yvHgYFJAH0qY91pqhSUZomK45lH6nDbpzw3x?=
 =?us-ascii?Q?IZIlVlI+O33iEIsHMtiN/Iq7sj0zjStflBxHajJkgAhfpZFGptoK5Ym9g2sg?=
 =?us-ascii?Q?Ujr/8tm6OSQTKtZFmtUkNHkGW50R2IXVNctbhTub2N0dHt4iLkpxioh6QopW?=
 =?us-ascii?Q?K0+Qz2wVGwOGdiofYouG2IQPwGMwUoL4TK/EH9eNnQtbpO+0Ie2Qzkzeco1a?=
 =?us-ascii?Q?yWElkSG+N1qaFBZ7+V1d+Aw1yY7b/pFYM/yyqn5mHSsr4eA2/kLj9Fw7K+C9?=
 =?us-ascii?Q?mLTMOo7pVxbFYuzLlWwF6Z9jQcMXoSClnttJIWvkIiqYAqj8PIpxU0K6ZKAs?=
 =?us-ascii?Q?cnp2sNBfFqQjg1fYkzWXIYzADcJKxESORWvVM1oB8F8Eg8rL08HRcFDBVrRR?=
 =?us-ascii?Q?J12UsYko+b4rH5jt2HOeEv5ZdWod7u2vfusHKwQCwcbNAGJu8GZdCjo06tDV?=
 =?us-ascii?Q?fm6y2wcjsaZ4UrtPaGgHUbPAMJDnK5J4UgtP/XHK4/hY091BwTrPPax59QQR?=
 =?us-ascii?Q?xbCN2CiFcJuCZZAo36LasIBxbopkzlmb3qwubQaCuBhIa+mRTRx4NK/0Ihh8?=
 =?us-ascii?Q?AO7Qk+P2Xg0P2h83obM4LSuENdERXZAB+q9nFt9cS71/2dd4nsHFt7E1vMeb?=
 =?us-ascii?Q?vBpxTxHEE4iFMExxAZBdmXZJU2sey0b7dGi/avJGpaUr1al9TlPYnuDKiRJ7?=
 =?us-ascii?Q?MGNMOgv1GQWKwsd7QCjVy1h67Bxyc+XbntMmVaEiDct0lrHeBAHEK+VRSZIU?=
 =?us-ascii?Q?i8l+5kIK0wBTvjd1fAlcV0HnMCXBmB6JpyyWL9hDzNdFdviOMOPyrqeRhbLC?=
 =?us-ascii?Q?2KyNKAwZ6fVTSiJ7a6GDCgH8qi9RKtxGDtkVqiVX/8G7i7AzUinoYEKev31F?=
 =?us-ascii?Q?W1cS7CdQTcNxlv/W9KbZsB3Eq4LY7bkDi/igkV3v?=
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
X-MS-Exchange-CrossTenant-AuthAs: Internal
X-MS-Exchange-CrossTenant-AuthSource: SJ0PR11MB5600.namprd11.prod.outlook.com
X-MS-Exchange-CrossTenant-Network-Message-Id: d887a5d2-1592-4e96-2fd5-08db60195cb1
X-MS-Exchange-CrossTenant-originalarrivaltime: 29 May 2023 07:50:24.8284
 (UTC)
X-MS-Exchange-CrossTenant-fromentityheader: Hosted
X-MS-Exchange-CrossTenant-id: 46c98d88-e344-4ed4-8496-4ed7712e255d
X-MS-Exchange-CrossTenant-mailboxtype: HOSTED
X-MS-Exchange-CrossTenant-userprincipalname: CfZuCCnqqpoEfuNavPY87KqDIOs+7n0sRpl8KtCtJDMsB3MXPGZJYoHOtcU5SSGQY0LEpXKqYaF2SDSDyVeCVQ==
X-MS-Exchange-Transport-CrossTenantHeadersStamped: DS0PR11MB8115
X-OriginatorOrg: intel.com
X-Spam-Status: No, score=-11.7 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,GIT_PATCH_0,SPF_HELO_NONE,SPF_NONE,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

I will rebase and commit this patch, thanks!

Lili.


> -----Original Message-----
> From: Cui, Lili <lili.cui@intel.com>
> Sent: Thursday, May 25, 2023 7:30 AM
> To: gcc-patches@gcc.gnu.org
> Cc: richard.guenther@gmail.com; linkw@linux.ibm.com; Cui, Lili
> <lili.cui@intel.com>
> Subject: [PATCH] Handle FMA friendly in reassoc pass
>=20
> From: Lili Cui <lili.cui@intel.com>
>=20
> Make some changes in reassoc pass to make it more friendly to fma pass
> later.
> Using FMA instead of mult + add reduces register pressure and insruction
> retired.
>=20
> There are mainly two changes
> 1. Put no-mult ops and mult ops alternately at the end of the queue, whic=
h is
> conducive to generating more fma and reducing the loss of FMA when
> breaking the chain.
> 2. Rewrite the rewrite_expr_tree_parallel function to try to build parall=
el
> chains according to the given correlation width, keeping the FMA chance a=
s
> much as possible.
>=20
> With the patch applied
>=20
> On ICX:
> 507.cactuBSSN_r: Improved by 1.7% for multi-copy .
> 503.bwaves_r   : Improved by  0.60% for single copy .
> 507.cactuBSSN_r: Improved by  1.10% for single copy .
> 519.lbm_r      : Improved by  2.21% for single copy .
> no measurable changes for other benchmarks.
>=20
> On aarch64
> 507.cactuBSSN_r: Improved by 1.7% for multi-copy.
> 503.bwaves_r   : Improved by 6.00% for single-copy.
> no measurable changes for other benchmarks.
>=20
> TEST1:
>=20
> float
> foo (float a, float b, float c, float d, float *e) {
>    return  *e  + a * b + c * d ;
> }
>=20
> For "-Ofast -mfpmath=3Dsse -mfma" GCC generates:
>         vmulss  %xmm3, %xmm2, %xmm2
>         vfmadd132ss     %xmm1, %xmm2, %xmm0
>         vaddss  (%rdi), %xmm0, %xmm0
>         ret
>=20
> With this patch GCC generates:
>         vfmadd213ss   (%rdi), %xmm1, %xmm0
>         vfmadd231ss   %xmm2, %xmm3, %xmm0
>         ret
>=20
> TEST2:
>=20
> for (int i =3D 0; i < N; i++)
> {
>   a[i] +=3D b[i]* c[i] + d[i] * e[i] + f[i] * g[i] + h[i] * j[i] + k[i] *=
 l[i] + m[i]* o[i] +
> p[i]; }
>=20
> For "-Ofast -mfpmath=3Dsse -mfma"  GCC generates:
> 	vmovapd e(%rax), %ymm4
> 	vmulpd  d(%rax), %ymm4, %ymm3
> 	addq    $32, %rax
> 	vmovapd c-32(%rax), %ymm5
> 	vmovapd j-32(%rax), %ymm6
> 	vmulpd  h-32(%rax), %ymm6, %ymm2
> 	vmovapd a-32(%rax), %ymm6
> 	vaddpd  p-32(%rax), %ymm6, %ymm0
> 	vmovapd g-32(%rax), %ymm7
> 	vfmadd231pd     b-32(%rax), %ymm5, %ymm3
> 	vmovapd o-32(%rax), %ymm4
> 	vmulpd  m-32(%rax), %ymm4, %ymm1
> 	vmovapd l-32(%rax), %ymm5
> 	vfmadd231pd     f-32(%rax), %ymm7, %ymm2
> 	vfmadd231pd     k-32(%rax), %ymm5, %ymm1
> 	vaddpd  %ymm3, %ymm0, %ymm0
> 	vaddpd  %ymm2, %ymm0, %ymm0
> 	vaddpd  %ymm1, %ymm0, %ymm0
> 	vmovapd %ymm0, a-32(%rax)
> 	cmpq    $8192, %rax
> 	jne     .L4
> 	vzeroupper
> 	ret
>=20
> with this patch applied GCC breaks the chain with width =3D 2 and generat=
es 6
> fma:
>=20
> 	vmovapd a(%rax), %ymm2
> 	vmovapd c(%rax), %ymm0
> 	addq    $32, %rax
> 	vmovapd e-32(%rax), %ymm1
> 	vmovapd p-32(%rax), %ymm5
> 	vmovapd g-32(%rax), %ymm3
> 	vmovapd j-32(%rax), %ymm6
> 	vmovapd l-32(%rax), %ymm4
> 	vmovapd o-32(%rax), %ymm7
> 	vfmadd132pd     b-32(%rax), %ymm2, %ymm0
> 	vfmadd132pd     d-32(%rax), %ymm5, %ymm1
> 	vfmadd231pd     f-32(%rax), %ymm3, %ymm0
> 	vfmadd231pd     h-32(%rax), %ymm6, %ymm1
> 	vfmadd231pd     k-32(%rax), %ymm4, %ymm0
> 	vfmadd231pd     m-32(%rax), %ymm7, %ymm1
> 	vaddpd  %ymm1, %ymm0, %ymm0
> 	vmovapd %ymm0, a-32(%rax)
> 	cmpq    $8192, %rax
> 	jne     .L2
> 	vzeroupper
> 	ret
>=20
> gcc/ChangeLog:
>=20
> 	PR gcc/98350
> 	* tree-ssa-reassoc.cc
> 	(rewrite_expr_tree_parallel): Rewrite this function.
> 	(rank_ops_for_fma): New.
> 	(reassociate_bb): Handle new function.
>=20
> gcc/testsuite/ChangeLog:
>=20
> 	PR gcc/98350
> 	* gcc.dg/pr98350-1.c: New test.
> 	* gcc.dg/pr98350-2.c: Ditto.
> ---
>  gcc/testsuite/gcc.dg/pr98350-1.c |  31 ++++  gcc/testsuite/gcc.dg/pr9835=
0-
> 2.c |  11 ++
>  gcc/tree-ssa-reassoc.cc          | 256 +++++++++++++++++++++----------
>  3 files changed, 215 insertions(+), 83 deletions(-)  create mode 100644
> gcc/testsuite/gcc.dg/pr98350-1.c  create mode 100644
> gcc/testsuite/gcc.dg/pr98350-2.c
>=20
> diff --git a/gcc/testsuite/gcc.dg/pr98350-1.c b/gcc/testsuite/gcc.dg/pr98=
350-
> 1.c
> new file mode 100644
> index 00000000000..6bcf78a19ab
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/pr98350-1.c
> @@ -0,0 +1,31 @@
> +/* { dg-do compile } */
> +/* { dg-options "-Ofast  -fdump-tree-widening_mul" } */
> +
> +/* Test that the compiler properly optimizes multiply and add
> +   to generate more FMA instructions.  */ #define N 1024 double a[N];
> +double b[N]; double c[N]; double d[N]; double e[N]; double f[N]; double
> +g[N]; double h[N]; double j[N]; double k[N]; double l[N]; double m[N];
> +double o[N]; double p[N];
> +
> +
> +void
> +foo (void)
> +{
> +  for (int i =3D 0; i < N; i++)
> +  {
> +    a[i] +=3D b[i] * c[i] + d[i] * e[i] + f[i] * g[i] + h[i] * j[i] +
> +k[i] * l[i] + m[i]* o[i] + p[i];
> +  }
> +}
> +/* { dg-final { scan-tree-dump-times { =3D \.FMA \(} 6 "widening_mul" } =
}
> +*/
> diff --git a/gcc/testsuite/gcc.dg/pr98350-2.c b/gcc/testsuite/gcc.dg/pr98=
350-
> 2.c
> new file mode 100644
> index 00000000000..333d34f026a
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/pr98350-2.c
> @@ -0,0 +1,11 @@
> +/* { dg-do compile } */
> +/* { dg-options "-Ofast -fdump-tree-widening_mul" } */
> +
> +/* Test that the compiler rearrange the ops to generate more FMA.  */
> +
> +float
> +foo1 (float a, float b, float c, float d, float *e) {
> +   return   *e + a * b + c * d ;
> +}
> +/* { dg-final { scan-tree-dump-times { =3D \.FMA \(} 2 "widening_mul" } =
}
> +*/
> diff --git a/gcc/tree-ssa-reassoc.cc b/gcc/tree-ssa-reassoc.cc index
> 067a3f07f7e..611fb9b1c99 100644
> --- a/gcc/tree-ssa-reassoc.cc
> +++ b/gcc/tree-ssa-reassoc.cc
> @@ -54,6 +54,7 @@ along with GCC; see the file COPYING3.  If not see
> #include "tree-ssa-reassoc.h"
>  #include "tree-ssa-math-opts.h"
>  #include "gimple-range.h"
> +#include "internal-fn.h"
>=20
>  /*  This is a simple global reassociation pass.  It is, in part, based
>      on the LLVM pass of the same name (They do some things more/less @@
> -5468,14 +5469,24 @@ get_reassociation_width (int ops_num, enum
> tree_code opc,
>    return width;
>  }
>=20
> -/* Recursively rewrite our linearized statements so that the operators
> -   match those in OPS[OPINDEX], putting the computation in rank
> -   order and trying to allow operations to be executed in
> -   parallel.  */
> +/* Rewrite statements with dependency chain with regard the chance to
> generate
> +   FMA.
> +   For the chain with FMA: Try to keep fma opportunity as much as possib=
le.
> +   For the chain without FMA: Putting the computation in rank order and
> trying
> +   to allow operations to be executed in parallel.
> +   E.g.
> +   e + f + g + a * b + c * d;
> +
> +   ssa1 =3D e + f;
> +   ssa2 =3D g + a * b;
> +   ssa3 =3D ssa1 + c * d;
> +   ssa4 =3D ssa2 + ssa3;
>=20
> +   This reassociation approach preserves the chance of fma generation as
> much
> +   as possible.  */
>  static void
> -rewrite_expr_tree_parallel (gassign *stmt, int width,
> -			    const vec<operand_entry *> &ops)
> +rewrite_expr_tree_parallel (gassign *stmt, int width, bool has_fma,
> +					 const vec<operand_entry *> &ops)
>  {
>    enum tree_code opcode =3D gimple_assign_rhs_code (stmt);
>    int op_num =3D ops.length ();
> @@ -5483,10 +5494,11 @@ rewrite_expr_tree_parallel (gassign *stmt, int
> width,
>    int stmt_num =3D op_num - 1;
>    gimple **stmts =3D XALLOCAVEC (gimple *, stmt_num);
>    int op_index =3D op_num - 1;
> -  int stmt_index =3D 0;
> -  int ready_stmts_end =3D 0;
> -  int i =3D 0;
> -  gimple *stmt1 =3D NULL, *stmt2 =3D NULL;
> +  int width_count =3D width;
> +  int i =3D 0, j =3D 0;
> +  tree tmp_op[2], op1;
> +  operand_entry *oe;
> +  gimple *stmt1 =3D NULL;
>    tree last_rhs1 =3D gimple_assign_rhs1 (stmt);
>=20
>    /* We start expression rewriting from the top statements.
> @@ -5496,91 +5508,87 @@ rewrite_expr_tree_parallel (gassign *stmt, int
> width,
>    for (i =3D stmt_num - 2; i >=3D 0; i--)
>      stmts[i] =3D SSA_NAME_DEF_STMT (gimple_assign_rhs1 (stmts[i+1]));
>=20
> -  for (i =3D 0; i < stmt_num; i++)
> +  /* Build parallel dependency chain according to width.  */  for (i =3D
> + 0; i < width; i++)
>      {
> -      tree op1, op2;
> -
> -      /* Determine whether we should use results of
> -	 already handled statements or not.  */
> -      if (ready_stmts_end =3D=3D 0
> -	  && (i - stmt_index >=3D width || op_index < 1))
> -	ready_stmts_end =3D i;
> -
> -      /* Now we choose operands for the next statement.  Non zero
> -	 value in ready_stmts_end means here that we should use
> -	 the result of already generated statements as new operand.  */
> -      if (ready_stmts_end > 0)
> -	{
> -	  op1 =3D gimple_assign_lhs (stmts[stmt_index++]);
> -	  if (ready_stmts_end > stmt_index)
> -	    op2 =3D gimple_assign_lhs (stmts[stmt_index++]);
> -	  else if (op_index >=3D 0)
> -	    {
> -	      operand_entry *oe =3D ops[op_index--];
> -	      stmt2 =3D oe->stmt_to_insert;
> -	      op2 =3D oe->op;
> -	    }
> -	  else
> -	    {
> -	      gcc_assert (stmt_index < i);
> -	      op2 =3D gimple_assign_lhs (stmts[stmt_index++]);
> -	    }
> +      /* If the chain has FAM, we do not swap two operands.  */
> +      if (op_index > 1 && !has_fma)
> +	swap_ops_for_binary_stmt (ops, op_index - 2);
>=20
> -	  if (stmt_index >=3D ready_stmts_end)
> -	    ready_stmts_end =3D 0;
> -	}
> -      else
> +      for (j =3D 0; j < 2; j++)
>  	{
> -	  if (op_index > 1)
> -	    swap_ops_for_binary_stmt (ops, op_index - 2);
> -	  operand_entry *oe2 =3D ops[op_index--];
> -	  operand_entry *oe1 =3D ops[op_index--];
> -	  op2 =3D oe2->op;
> -	  stmt2 =3D oe2->stmt_to_insert;
> -	  op1 =3D oe1->op;
> -	  stmt1 =3D oe1->stmt_to_insert;
> +	  gcc_assert (op_index >=3D 0);
> +	  oe =3D ops[op_index--];
> +	  tmp_op[j] =3D oe->op;
> +	  /* If the stmt that defines operand has to be inserted, insert it
> +	     before the use.  */
> +	  stmt1 =3D oe->stmt_to_insert;
> +	  if (stmt1)
> +	    insert_stmt_before_use (stmts[i], stmt1);
> +	  stmt1 =3D NULL;
>  	}
> -
> -      /* If we emit the last statement then we should put
> -	 operands into the last statement.  It will also
> -	 break the loop.  */
> -      if (op_index < 0 && stmt_index =3D=3D i)
> -	i =3D stmt_num - 1;
> +      stmts[i] =3D build_and_add_sum (TREE_TYPE (last_rhs1),
> +				    tmp_op[1],
> +				    tmp_op[0],
> +				    opcode);
> +      gimple_set_visited (stmts[i], true);
>=20
>        if (dump_file && (dump_flags & TDF_DETAILS))
>  	{
> -	  fprintf (dump_file, "Transforming ");
> +	  fprintf (dump_file, " into ");
>  	  print_gimple_stmt (dump_file, stmts[i], 0);
>  	}
> +    }
>=20
> -      /* If the stmt that defines operand has to be inserted, insert it
> -	 before the use.  */
> -      if (stmt1)
> -	insert_stmt_before_use (stmts[i], stmt1);
> -      if (stmt2)
> -	insert_stmt_before_use (stmts[i], stmt2);
> -      stmt1 =3D stmt2 =3D NULL;
> -
> -      /* We keep original statement only for the last one.  All
> -	 others are recreated.  */
> -      if (i =3D=3D stmt_num - 1)
> +  for (i =3D width; i < stmt_num; i++)
> +    {
> +      /* We keep original statement only for the last one.  All others a=
re
> +	 recreated.  */
> +      if ( op_index < 0)
>  	{
> -	  gimple_assign_set_rhs1 (stmts[i], op1);
> -	  gimple_assign_set_rhs2 (stmts[i], op2);
> -	  update_stmt (stmts[i]);
> +	  if (width_count =3D=3D 2)
> +	    {
> +
> +	      /* We keep original statement only for the last one.  All
> +		 others are recreated.  */
> +	      gimple_assign_set_rhs1 (stmts[i], gimple_assign_lhs (stmts[i-1]))=
;
> +	      gimple_assign_set_rhs2 (stmts[i], gimple_assign_lhs (stmts[i-2]))=
;
> +	      update_stmt (stmts[i]);
> +	    }
> +	  else
> +	    {
> +
> +	      stmts[i] =3D
> +		build_and_add_sum (TREE_TYPE (last_rhs1),
> +				   gimple_assign_lhs (stmts[i-width_count]),
> +				   gimple_assign_lhs (stmts[i-width_count+1]),
> +				   opcode);
> +	      gimple_set_visited (stmts[i], true);
> +	      width_count--;
> +	    }
>  	}
>        else
>  	{
> -	  stmts[i] =3D build_and_add_sum (TREE_TYPE (last_rhs1), op1, op2,
> opcode);
> +	  /* Attach the rest of the ops to the parallel dependency chain.  */
> +	  oe =3D ops[op_index--];
> +	  op1 =3D oe->op;
> +	  stmt1 =3D oe->stmt_to_insert;
> +	  if (stmt1)
> +	    insert_stmt_before_use (stmts[i], stmt1);
> +	  stmt1 =3D NULL;
> +	  stmts[i] =3D build_and_add_sum (TREE_TYPE (last_rhs1),
> +					gimple_assign_lhs (stmts[i-width]),
> +					op1,
> +					opcode);
>  	  gimple_set_visited (stmts[i], true);
>  	}
> +
>        if (dump_file && (dump_flags & TDF_DETAILS))
>  	{
>  	  fprintf (dump_file, " into ");
>  	  print_gimple_stmt (dump_file, stmts[i], 0);
>  	}
>      }
> -
>    remove_visited_stmt_chain (last_rhs1);  }
>=20
> @@ -6649,6 +6657,73 @@ transform_stmt_to_multiply
> (gimple_stmt_iterator *gsi, gimple *stmt,
>      }
>  }
>=20
> +/* Rearrange ops may have more FMA when the chain may has more than 2
> FMAs.
> +   Put no-mult ops and mult ops alternately at the end of the queue, whi=
ch is
> +   conducive to generating more FMA and reducing the loss of FMA when
> breaking
> +   the chain.
> +   E.g.
> +   a * b + c * d + e generates:
> +
> +   _4  =3D c_9(D) * d_10(D);
> +   _12 =3D .FMA (a_7(D), b_8(D), _4);
> +   _11 =3D e_6(D) + _12;
> +
> +   Rearrange ops to -> e + a * b + c * d generates:
> +
> +   _4  =3D .FMA (c_7(D), d_8(D), _3);
> +   _11 =3D .FMA (a_5(D), b_6(D), _4);  */ static bool rank_ops_for_fma
> +(vec<operand_entry *> *ops) {
> +  operand_entry *oe;
> +  unsigned int i;
> +  unsigned int ops_length =3D ops->length ();
> +  auto_vec<operand_entry *> ops_mult;
> +  auto_vec<operand_entry *> ops_others;
> +
> +  FOR_EACH_VEC_ELT (*ops, i, oe)
> +    {
> +      if (TREE_CODE (oe->op) =3D=3D SSA_NAME)
> +	{
> +	  gimple *def_stmt =3D SSA_NAME_DEF_STMT (oe->op);
> +	  if (is_gimple_assign (def_stmt)
> +	      && gimple_assign_rhs_code (def_stmt) =3D=3D MULT_EXPR)
> +	    ops_mult.safe_push (oe);
> +	  else
> +	    ops_others.safe_push (oe);
> +	}
> +      else
> +	ops_others.safe_push (oe);
> +    }
> +  /* 1. When ops_mult.length =3D=3D 2, like the following case,
> +
> +     a * b + c * d + e.
> +
> +     we need to rearrange the ops.
> +
> +     Putting ops that not def from mult in front can generate more FMAs.
> +
> +     2. If all ops are defined with mult, we don't need to rearrange
> +them.  */
> +  if (ops_mult.length () >=3D 2 && ops_mult.length () !=3D ops_length)
> +    {
> +      /* Put no-mult ops and mult ops alternately at the end of the
> +	 queue, which is conducive to generating more FMA and reducing the
> +	 loss of FMA when breaking the chain.  */
> +      ops->truncate (0);
> +      ops->splice (ops_mult);
> +      int j, opindex =3D ops->length ();
> +      int others_length =3D ops_others.length ();
> +      for (j =3D 0; j < others_length; j++)
> +	{
> +	  oe =3D ops_others.pop ();
> +	  ops->quick_insert (opindex, oe);
> +	  if (opindex > 0)
> +	    opindex--;
> +	}
> +      return true;
> +    }
> +  return false;
> +}
>  /* Reassociate expressions in basic block BB and its post-dominator as
>     children.
>=20
> @@ -6813,6 +6888,7 @@ reassociate_bb (basic_block bb)
>  		  machine_mode mode =3D TYPE_MODE (TREE_TYPE (lhs));
>  		  int ops_num =3D ops.length ();
>  		  int width;
> +		  bool has_fma =3D false;
>=20
>  		  /* For binary bit operations, if there are at least 3
>  		     operands and the last operand in OPS is a constant, @@ -
> 6821,11 +6897,23 @@ reassociate_bb (basic_block bb)
>  		     often match a canonical bit test when we get to RTL.  */
>  		  if (ops.length () > 2
>  		      && (rhs_code =3D=3D BIT_AND_EXPR
> -		          || rhs_code =3D=3D BIT_IOR_EXPR
> -		          || rhs_code =3D=3D BIT_XOR_EXPR)
> +			  || rhs_code =3D=3D BIT_IOR_EXPR
> +			  || rhs_code =3D=3D BIT_XOR_EXPR)
>  		      && TREE_CODE (ops.last ()->op) =3D=3D INTEGER_CST)
>  		    std::swap (*ops[0], *ops[ops_num - 1]);
>=20
> +		  optimization_type opt_type =3D bb_optimization_type (bb);
> +
> +		  /* If the target support FMA, rank_ops_for_fma will detect
> if
> +		     the chain has fmas and rearrange the ops if so.  */
> +		  if (direct_internal_fn_supported_p (IFN_FMA,
> +						      TREE_TYPE (lhs),
> +						      opt_type)
> +		      && (rhs_code =3D=3D PLUS_EXPR || rhs_code =3D=3D
> MINUS_EXPR))
> +		    {
> +		      has_fma =3D rank_ops_for_fma (&ops);
> +		    }
> +
>  		  /* Only rewrite the expression tree to parallel in the
>  		     last reassoc pass to avoid useless work back-and-forth
>  		     with initial linearization.  */
> @@ -6839,22 +6927,24 @@ reassociate_bb (basic_block bb)
>  				 "Width =3D %d was chosen for reassociation\n",
>  				 width);
>  		      rewrite_expr_tree_parallel (as_a <gassign *> (stmt),
> -						  width, ops);
> +						  width,
> +						  has_fma,
> +						  ops);
>  		    }
>  		  else
> -                    {
> -                      /* When there are three operands left, we want
> -                         to make sure the ones that get the double
> -                         binary op are chosen wisely.  */
> -                      int len =3D ops.length ();
> -                      if (len >=3D 3)
> +		    {
> +		      /* When there are three operands left, we want
> +			 to make sure the ones that get the double
> +			 binary op are chosen wisely.  */
> +		      int len =3D ops.length ();
> +		      if (len >=3D 3 && !has_fma)
>  			swap_ops_for_binary_stmt (ops, len - 3);
>=20
>  		      new_lhs =3D rewrite_expr_tree (stmt, rhs_code, 0, ops,
>  						   powi_result !=3D NULL
>  						   || negate_result,
>  						   len !=3D orig_len);
> -                    }
> +		    }
>=20
>  		  /* If we combined some repeated factors into a
>  		     __builtin_powi call, multiply that result by the
> --
> 2.25.1