From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=XlKO=B3=intel.com=lili.cui@sourceware.org>
Received: from mga06.intel.com (mga06b.intel.com [134.134.136.31])
	by sourceware.org (Postfix) with ESMTPS id 687753858C54
	for <gcc-patches@gcc.gnu.org>; Wed,  7 Jun 2023 15:38:56 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 687753858C54
Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=intel.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=intel.com
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1686152337; x=1717688337;
  h=from:to:cc:subject:date:message-id:references:
   in-reply-to:content-transfer-encoding:mime-version;
  bh=mEBhHHNnxZRP9YMA3qLwRs8osAE1r17Kp7LE71dW4HY=;
  b=J+LZhbP0eeph88pJ7A+WxULXaD/oaenIaJmFhFvkf1D31xJNI/wNKmDR
   IglNxd8W1Gu7z4JmcirGVNkwYEZxKxlsm8/KROz5m9+31FmcAUY2Rnlbx
   lpu1zONdBno/rbCgqLtMlYZ0w8/wxZjWMR2vdAJ6N7rsiOf18eT/x8SKM
   lgUrjwxU1isJhhQr3Cn9c4BibLA1mAsvoW9XnjZTtm9fkBiuyc0rTDEE+
   2JcfUe9e3eMaR+bSjl+gu3+OkNg6yjymbcMXSFuRNGmKzBSljDEQmsBGI
   W9FU9S4zZKNgbvetOpq2rUgTLddDVvgxaDsDUn1Q7ReSpuzxO8XMOuRMo
   A==;
X-IronPort-AV: E=McAfee;i="6600,9927,10734"; a="420582628"
X-IronPort-AV: E=Sophos;i="6.00,224,1681196400"; 
   d="scan'208";a="420582628"
Received: from orsmga005.jf.intel.com ([10.7.209.41])
  by orsmga104.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 07 Jun 2023 08:38:26 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=McAfee;i="6600,9927,10734"; a="883852312"
X-IronPort-AV: E=Sophos;i="6.00,224,1681196400"; 
   d="scan'208";a="883852312"
Received: from fmsmsx601.amr.corp.intel.com ([10.18.126.81])
  by orsmga005.jf.intel.com with ESMTP; 07 Jun 2023 08:38:23 -0700
Received: from fmsmsx610.amr.corp.intel.com (10.18.126.90) by
 fmsmsx601.amr.corp.intel.com (10.18.126.81) with Microsoft SMTP Server
 (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id
 15.1.2507.23; Wed, 7 Jun 2023 08:38:23 -0700
Received: from fmsedg602.ED.cps.intel.com (10.1.192.136) by
 fmsmsx610.amr.corp.intel.com (10.18.126.90) with Microsoft SMTP Server
 (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id
 15.1.2507.23 via Frontend Transport; Wed, 7 Jun 2023 08:38:23 -0700
Received: from NAM12-DM6-obe.outbound.protection.outlook.com (104.47.59.177)
 by edgegateway.intel.com (192.55.55.71) with Microsoft SMTP Server
 (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id
 15.1.2507.23; Wed, 7 Jun 2023 08:38:23 -0700
ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none;
 b=Cdb8su6Y9gOFOxqhCqFB54/HxhkzyQ2YRnrVeqMbtAHX5eO07gp/CB8x05lzecf7UyEXbHRbfcJYr9vsQsYzlbqviKsVqtxgavQ4UO6lD6lR+nVSLX4Ud9xRw9iR4762twCuantP+8cLpOKhy+99c+BX5pXig+HI68I3m0wL9OG5ew7dQzzAzlfSTtopViM78hsF6Nxbod3s6fm+mD6Yz+zR17fG5lEKAh+hONmOy1cw/awb1uNW8WefrBrnW3k4EOMNOl+lLIaxSRMXiESnRk713j2nK26aHnbIMCI8FfxpOEZOCnKYa7NC45NGLXD5Zgd8vk/kKCS8hqMGrELZPw==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com;
 s=arcselector9901;
 h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1;
 bh=CpZbMGbSCIIQM8XKxwtzVSsp1OnrpJbYiJVgcJs5K1w=;
 b=YhARi87S3UokebfuryLXXkLT1tXx2D6hRAp3vF+urMXE9qSlGy11qzXpkOCja+qE2Pp+JFfA6BnbzY7Q7PWRU1h/JWIDkpJgMoxpjCmB07Nxp4nSjikIJZRJ9lt4MSl8VYqdzeU4VQO4Z0Erxr5jt4yXUfl4Re0aHUt/nIZgyFfsqjTJUautL8wWGeY1JgFUydC42/+ciqjNcwycC06h68788F1/dUzwk3hXbo3RQWtmmdcs8sRpsxLUJSoRLF3bm4uAB2C5TKCQl1Jp9k+BTqvJWNTxnBqIZJptgo3kplxd/PGyXSuDj1VVrMPfeLYPiaE4rOO56BsRfX0He66zLw==
ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass
 smtp.mailfrom=intel.com; dmarc=pass action=none header.from=intel.com;
 dkim=pass header.d=intel.com; arc=none
Received: from SJ0PR11MB5600.namprd11.prod.outlook.com (2603:10b6:a03:3ab::16)
 by PH0PR11MB5111.namprd11.prod.outlook.com (2603:10b6:510:3c::10) with
 Microsoft SMTP Server (version=TLS1_2,
 cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.6455.33; Wed, 7 Jun
 2023 15:38:15 +0000
Received: from SJ0PR11MB5600.namprd11.prod.outlook.com
 ([fe80::50ed:c141:93dd:16d1]) by SJ0PR11MB5600.namprd11.prod.outlook.com
 ([fe80::50ed:c141:93dd:16d1%3]) with mapi id 15.20.6455.030; Wed, 7 Jun 2023
 15:38:15 +0000
From: "Cui, Lili" <lili.cui@intel.com>
To: Di Zhao OS <dizhao@os.amperecomputing.com>, "gcc-patches@gcc.gnu.org"
	<gcc-patches@gcc.gnu.org>
CC: "richard.guenther@gmail.com" <richard.guenther@gmail.com>,
	"linkw@linux.ibm.com" <linkw@linux.ibm.com>
Subject: RE: [PATCH] Handle FMA friendly in reassoc pass
Thread-Topic: [PATCH] Handle FMA friendly in reassoc pass
Thread-Index: AQHZjpe6vpBAvBbCUEe++MXQyj0RVK9+yGgAgADFVeA=
Date: Wed, 7 Jun 2023 15:38:15 +0000
Message-ID: <SJ0PR11MB5600FB21EFFF7475CB105F339E53A@SJ0PR11MB5600.namprd11.prod.outlook.com>
References: <20230524233005.3284950-1-lili.cui@intel.com>
 <SN6PR01MB4240F54D85699DEBC2547E5CE853A@SN6PR01MB4240.prod.exchangelabs.com>
In-Reply-To: <SN6PR01MB4240F54D85699DEBC2547E5CE853A@SN6PR01MB4240.prod.exchangelabs.com>
Accept-Language: en-US
Content-Language: en-US
X-MS-Has-Attach:
X-MS-TNEF-Correlator:
msip_labels: MSIP_Label_5b82cb1d-c2e0-4643-920a-bbe7b2d7cc47_ActionId=528e0138-377f-4785-9d29-af595996a1a9;MSIP_Label_5b82cb1d-c2e0-4643-920a-bbe7b2d7cc47_ContentBits=0;MSIP_Label_5b82cb1d-c2e0-4643-920a-bbe7b2d7cc47_Enabled=true;MSIP_Label_5b82cb1d-c2e0-4643-920a-bbe7b2d7cc47_Method=Standard;MSIP_Label_5b82cb1d-c2e0-4643-920a-bbe7b2d7cc47_Name=Confidential
 (Default);MSIP_Label_5b82cb1d-c2e0-4643-920a-bbe7b2d7cc47_SetDate=2023-06-07T02:55:01Z;MSIP_Label_5b82cb1d-c2e0-4643-920a-bbe7b2d7cc47_SiteId=3bc2b170-fd94-476d-b0ce-4229bdc904a7;
authentication-results: dkim=none (message not signed)
 header.d=none;dmarc=none action=none header.from=intel.com;
x-ms-publictraffictype: Email
x-ms-traffictypediagnostic: SJ0PR11MB5600:EE_|PH0PR11MB5111:EE_
x-ms-office365-filtering-correlation-id: 8343093e-6688-47cd-c692-08db676d35d7
x-ms-exchange-senderadcheck: 1
x-ms-exchange-antispam-relay: 0
x-microsoft-antispam: BCL:0;
x-microsoft-antispam-message-info: 8nXKI8NOvHk9fNjLJqO8Qs6c5/eXIy6ITPtSwH//cR4ml85JjoHGXzByf5lb3xeMdWat4A/uILVf2l3TuAB2cIN1H2GQ5LiTPK4nDgiGv8eiC80XNsQelyjcZ5ESdV9LDiykC1C8SXc2CtHEXEh9d3EImu3CFzR/+Hi4CVWzwMAkiDpZmgUjCaM945QsVMSHJnxuxMpcQiCp8aAWeqvB9VBqFujxMyrFMMwLHSLQxuMDkalFUC7pb8P2lSWXQXjkDWn4rgfIZT6RVN39wD1/qnUwP8M94s6SBWytesGtUiPdVqAiHDYmVpQA4T0nPTqCIhizmtiXn7x7a6GAwuG/f8N/taqbTdPjkqYhadQUCLXyYoCqVU0954W38wlZwXNnw1VVlh+maCPjNXrpqYFeShWRQXe7t+xJehQQPGhCaQ1ima2Jdyovx7/h+A7364yNKtxR6NrCDSXUHV3xuHh5MgFnQfbGhTzyWOqIrftPmx2hkVXnkrGCXYzZtYQy0lVFMaFXnvtL8DoPy9RVPYC1sumSPXBKcq90Q6K+uxQ7giW50bgHv1g/c89ElnNpJf8CRZ5VVRiQl8Y73m6Jyw6tGg78zZxzyDQY4ZBYCstiqU8=
x-forefront-antispam-report: CIP:255.255.255.255;CTRY:;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:SJ0PR11MB5600.namprd11.prod.outlook.com;PTR:;CAT:NONE;SFS:(13230028)(366004)(346002)(376002)(136003)(39860400002)(396003)(451199021)(55016003)(7696005)(54906003)(110136005)(478600001)(82960400001)(316002)(8676002)(8936002)(41300700001)(66446008)(76116006)(64756008)(38100700002)(66946007)(4326008)(66476007)(122000001)(66556008)(186003)(71200400001)(83380400001)(26005)(6506007)(53546011)(9686003)(86362001)(52536014)(5660300002)(38070700005)(2906002)(30864003)(33656002)(84970400001)(579004);DIR:OUT;SFP:1102;
x-ms-exchange-antispam-messagedata-chunkcount: 1
x-ms-exchange-antispam-messagedata-0: =?us-ascii?Q?l0mXaZSatlf5IWSQZbeOy3cK6lIyHflIwNbKdpHZqeai+T0aXpsCw7ovuF/E?=
 =?us-ascii?Q?5FiuhIWFaEVxYTpupmvHZdqTm9A+1spvNkz/COeRBwX4WThmNIj67ofdemP6?=
 =?us-ascii?Q?pwfCc+IIKZ1czm1YZ0HvzVNljdzQ95wLiqkOgs9i7I13fKRLMPnDhXEUpAAR?=
 =?us-ascii?Q?9IQvM9U6vspVqbvrhagDJmJMYAe/E21YRP1AlSmGiEkAQqydHTaWLLbCtkMv?=
 =?us-ascii?Q?PHq9FbM2vgpW05JOBRv7lKF3LfQqcCQ8MlZoapZ5GfrZia2ZZdnj2d3U50pX?=
 =?us-ascii?Q?LuuYLeO8SQgp6pgRwIfQRBGwnCvdw/U0jwts3F2PI4eVo882NPVZ0Zjsfa4Y?=
 =?us-ascii?Q?gnW1ZS1UN4DVLa+GhGd/hKGX9RHJc/AHPkgAELAW4bwbgEFixyAORAd/O4FS?=
 =?us-ascii?Q?u5S0jpE1534ilGBo+t5El07WntwQtJVeQ1x+ZKDAjhtCcIh4/DW5+UqjSRO/?=
 =?us-ascii?Q?qNnGQii3ecExKoBU1c14N8W6epLNAPh/PtK1lk0plSQ5k2FXZGWpwqBh+q2n?=
 =?us-ascii?Q?n3ZrlG5n6RvfL8dJ8gx2f59UxprE0L5MJEAFtVoWiKQgyCSjoEs8Lm+uHh9B?=
 =?us-ascii?Q?O0mYcbx+4ytsOrhWHw99VIWEC6ZzO7cFxk1zTh+rlDhTv4o+Kdoyn9L2f8Z5?=
 =?us-ascii?Q?J8kuJNxCAJxddcW/bflbbSEUR6+3GYZh6Q/Ps4jkbax4xvN9c4LW94CMcTl7?=
 =?us-ascii?Q?vAXB1mnz1/MH9hxqdNZ7RGhdW0TAIskYtGiNlMS7uuDXk8Qc6p7y6dygw0mZ?=
 =?us-ascii?Q?o9JdYl92dF7i7LBHRk1ImLdrzyCurPouhq7q1INZOOKQ6W5BN1e5BWWOObvV?=
 =?us-ascii?Q?xnSyM6VnXbjcUwYKX2A7zmrvKH/0+ktRr4+97MobRmq3Vp8l56NU5KZ0wJfm?=
 =?us-ascii?Q?zuqN0wkiVAstbzsEH9zZfVKczFYdBhGljn1X59lXEpwndlZX58bdmzG8oHwd?=
 =?us-ascii?Q?S9FwUM5ISIFxovQR2hdAAQDTge/rDdj59bRWP6G/Y35fLXvmcHl77LBP6bXP?=
 =?us-ascii?Q?gjVfOj3SCMqPYwvYUmtaq9cBmt/Gym9233dlGebBKWysRB1rpVdAD/H6VAJe?=
 =?us-ascii?Q?FzeVa36epZJU8NTMZwI5X0FXTWOlGeO9Xx/ktJis24wQvlr6AuPBhEs8fcIU?=
 =?us-ascii?Q?QdUCjRRGdxIdK9L1Volc1cFCztR16NvdG+XwMWCFWRE6wotO5DMEcfFfc5hK?=
 =?us-ascii?Q?kP7ajaOiqnXO4Kod70EJltEXZhlaHptw4TgCCnTzxrM1oER6bO6szguNg7Pm?=
 =?us-ascii?Q?/GbVUdYQCBK7IAgBSfjZA9m/g5plpF+eiMSNRtS2HA7XsJcBmL0U5gGzPy03?=
 =?us-ascii?Q?7Y6GLD70ipZib5Qbw/bqw4361J3rnLausqz68Zv9qohXnTL7B3rBzoyHhJ7P?=
 =?us-ascii?Q?dvBb0YnXQgPdn8iBNQOGATVQzuyXJHxCxsUqyoTKN4zKINUSZj6uAEaamiO3?=
 =?us-ascii?Q?QK56InKbHUhcxlpM6uTsNFHW8mwBbsuUh7dlk+c4RC7DoF5poc8pfqM/Tkis?=
 =?us-ascii?Q?qx6oQwnVIEBaxjUdOxnmTM/raDyTdGV8AL4/PAg+NohDmbT5c5qaT5dCxoyD?=
 =?us-ascii?Q?zQmnyAnOy1SsyH+fuoEql2YEqxHM3Ebi343CsVX2?=
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
X-MS-Exchange-CrossTenant-AuthAs: Internal
X-MS-Exchange-CrossTenant-AuthSource: SJ0PR11MB5600.namprd11.prod.outlook.com
X-MS-Exchange-CrossTenant-Network-Message-Id: 8343093e-6688-47cd-c692-08db676d35d7
X-MS-Exchange-CrossTenant-originalarrivaltime: 07 Jun 2023 15:38:15.5107
 (UTC)
X-MS-Exchange-CrossTenant-fromentityheader: Hosted
X-MS-Exchange-CrossTenant-id: 46c98d88-e344-4ed4-8496-4ed7712e255d
X-MS-Exchange-CrossTenant-mailboxtype: HOSTED
X-MS-Exchange-CrossTenant-userprincipalname: xhzbWpTOi6sD0IM9iywu+5MHJzSf6ADXBRObhXwocazy9I+yqpkN4M+UYUiE35g0YAm9IXd0z/0UCAmtg2026g==
X-MS-Exchange-Transport-CrossTenantHeadersStamped: PH0PR11MB5111
X-OriginatorOrg: intel.com
X-Spam-Status: No, score=-11.7 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,GIT_PATCH_0,SPF_HELO_NONE,SPF_NONE,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

Hi Di,

The compile options I use are: "-march=3Dnative -Ofast -funroll-loops -flto=
"
I re-ran 503, 507, and 527 on two neoverse-n1 machines, and found that one =
machine fluctuated greatly, and the score was only 70% of the other machine=
. I also couldn't reproduce the gain on the stable machine. For the 527 reg=
ression, I can't reproduce it and the data seems stable.

Regards,
Lili.

> -----Original Message-----
> From: Di Zhao OS <dizhao@os.amperecomputing.com>
> Sent: Wednesday, June 7, 2023 11:48 AM
> To: Cui, Lili <lili.cui@intel.com>; gcc-patches@gcc.gnu.org
> Cc: richard.guenther@gmail.com; linkw@linux.ibm.com
> Subject: RE: [PATCH] Handle FMA friendly in reassoc pass
>=20
> Hello Lili Cui,
>=20
> Since I'm also trying to improve this lately, I've tested your patch on s=
everal
> aarch64 machines we have, including neoverse-n1 and ampere1
> architectures. However, I haven't reproduced the 6.00% improvement of
> 503.bwaves_r single copy run you mentioned. Could you share more
> information about the aarch64 CPU and compile options you tested? The
> option I'm using is "-Ofast", with or without "--param avoid-fma-max-
> bits=3D512".
>=20
> Additionally, we found some spec2017 cases with regressions, including 4%
> on 527.cam4_r (neoverse-n1).
>=20
> > -----Original Message-----
> > From: Gcc-patches <gcc-patches-
> > bounces+dizhao=3Dos.amperecomputing.com@gcc.gnu.org> On Behalf Of
> Cui,
> > bounces+Lili via
> > Gcc-patches
> > Sent: Thursday, May 25, 2023 7:30 AM
> > To: gcc-patches@gcc.gnu.org
> > Cc: richard.guenther@gmail.com; linkw@linux.ibm.com; Lili Cui
> > <lili.cui@intel.com>
> > Subject: [PATCH] Handle FMA friendly in reassoc pass
> >
> > From: Lili Cui <lili.cui@intel.com>
> >
> > Make some changes in reassoc pass to make it more friendly to fma pass
> later.
> > Using FMA instead of mult + add reduces register pressure and
> > insruction retired.
> >
> > There are mainly two changes
> > 1. Put no-mult ops and mult ops alternately at the end of the queue,
> > which is conducive to generating more fma and reducing the loss of FMA
> > when breaking the chain.
> > 2. Rewrite the rewrite_expr_tree_parallel function to try to build
> > parallel chains according to the given correlation width, keeping the
> > FMA chance as much as possible.
> >
> > With the patch applied
> >
> > On ICX:
> > 507.cactuBSSN_r: Improved by 1.7% for multi-copy .
> > 503.bwaves_r   : Improved by  0.60% for single copy .
> > 507.cactuBSSN_r: Improved by  1.10% for single copy .
> > 519.lbm_r      : Improved by  2.21% for single copy .
> > no measurable changes for other benchmarks.
> >
> > On aarch64
> > 507.cactuBSSN_r: Improved by 1.7% for multi-copy.
> > 503.bwaves_r   : Improved by 6.00% for single-copy.
> > no measurable changes for other benchmarks.
> >
> > TEST1:
> >
> > float
> > foo (float a, float b, float c, float d, float *e) {
> >    return  *e  + a * b + c * d ;
> > }
> >
> > For "-Ofast -mfpmath=3Dsse -mfma" GCC generates:
> >         vmulss  %xmm3, %xmm2, %xmm2
> >         vfmadd132ss     %xmm1, %xmm2, %xmm0
> >         vaddss  (%rdi), %xmm0, %xmm0
> >         ret
> >
> > With this patch GCC generates:
> >         vfmadd213ss   (%rdi), %xmm1, %xmm0
> >         vfmadd231ss   %xmm2, %xmm3, %xmm0
> >         ret
> >
> > TEST2:
> >
> > for (int i =3D 0; i < N; i++)
> > {
> >   a[i] +=3D b[i]* c[i] + d[i] * e[i] + f[i] * g[i] + h[i] * j[i] + k[i]
> > * l[i]
> > + m[i]* o[i] + p[i];
> > }
> >
> > For "-Ofast -mfpmath=3Dsse -mfma"  GCC generates:
> > 	vmovapd e(%rax), %ymm4
> > 	vmulpd  d(%rax), %ymm4, %ymm3
> > 	addq    $32, %rax
> > 	vmovapd c-32(%rax), %ymm5
> > 	vmovapd j-32(%rax), %ymm6
> > 	vmulpd  h-32(%rax), %ymm6, %ymm2
> > 	vmovapd a-32(%rax), %ymm6
> > 	vaddpd  p-32(%rax), %ymm6, %ymm0
> > 	vmovapd g-32(%rax), %ymm7
> > 	vfmadd231pd     b-32(%rax), %ymm5, %ymm3
> > 	vmovapd o-32(%rax), %ymm4
> > 	vmulpd  m-32(%rax), %ymm4, %ymm1
> > 	vmovapd l-32(%rax), %ymm5
> > 	vfmadd231pd     f-32(%rax), %ymm7, %ymm2
> > 	vfmadd231pd     k-32(%rax), %ymm5, %ymm1
> > 	vaddpd  %ymm3, %ymm0, %ymm0
> > 	vaddpd  %ymm2, %ymm0, %ymm0
> > 	vaddpd  %ymm1, %ymm0, %ymm0
> > 	vmovapd %ymm0, a-32(%rax)
> > 	cmpq    $8192, %rax
> > 	jne     .L4
> > 	vzeroupper
> > 	ret
> >
> > with this patch applied GCC breaks the chain with width =3D 2 and
> > generates 6
> > fma:
> >
> > 	vmovapd a(%rax), %ymm2
> > 	vmovapd c(%rax), %ymm0
> > 	addq    $32, %rax
> > 	vmovapd e-32(%rax), %ymm1
> > 	vmovapd p-32(%rax), %ymm5
> > 	vmovapd g-32(%rax), %ymm3
> > 	vmovapd j-32(%rax), %ymm6
> > 	vmovapd l-32(%rax), %ymm4
> > 	vmovapd o-32(%rax), %ymm7
> > 	vfmadd132pd     b-32(%rax), %ymm2, %ymm0
> > 	vfmadd132pd     d-32(%rax), %ymm5, %ymm1
> > 	vfmadd231pd     f-32(%rax), %ymm3, %ymm0
> > 	vfmadd231pd     h-32(%rax), %ymm6, %ymm1
> > 	vfmadd231pd     k-32(%rax), %ymm4, %ymm0
> > 	vfmadd231pd     m-32(%rax), %ymm7, %ymm1
> > 	vaddpd  %ymm1, %ymm0, %ymm0
> > 	vmovapd %ymm0, a-32(%rax)
> > 	cmpq    $8192, %rax
> > 	jne     .L2
> > 	vzeroupper
> > 	ret
> >
> > gcc/ChangeLog:
> >
> > 	PR gcc/98350
> > 	* tree-ssa-reassoc.cc
> > 	(rewrite_expr_tree_parallel): Rewrite this function.
> > 	(rank_ops_for_fma): New.
> > 	(reassociate_bb): Handle new function.
> >
> > gcc/testsuite/ChangeLog:
> >
> > 	PR gcc/98350
> > 	* gcc.dg/pr98350-1.c: New test.
> > 	* gcc.dg/pr98350-2.c: Ditto.
> > ---
> >  gcc/testsuite/gcc.dg/pr98350-1.c |  31 ++++
> > gcc/testsuite/gcc.dg/pr98350-2.c |  11 ++
> >  gcc/tree-ssa-reassoc.cc          | 256 +++++++++++++++++++++----------
> >  3 files changed, 215 insertions(+), 83 deletions(-)  create mode
> > 100644 gcc/testsuite/gcc.dg/pr98350-1.c  create mode 100644
> > gcc/testsuite/gcc.dg/pr98350-2.c
> >
> > diff --git a/gcc/testsuite/gcc.dg/pr98350-1.c
> > b/gcc/testsuite/gcc.dg/pr98350- 1.c new file mode 100644 index
> > 00000000000..6bcf78a19ab
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.dg/pr98350-1.c
> > @@ -0,0 +1,31 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-Ofast  -fdump-tree-widening_mul" } */
> > +
> > +/* Test that the compiler properly optimizes multiply and add
> > +   to generate more FMA instructions.  */ #define N 1024 double a[N];
> > +double b[N]; double c[N]; double d[N]; double e[N]; double f[N];
> > +double g[N]; double h[N]; double j[N]; double k[N]; double l[N];
> > +double m[N]; double o[N]; double p[N];
> > +
> > +
> > +void
> > +foo (void)
> > +{
> > +  for (int i =3D 0; i < N; i++)
> > +  {
> > +    a[i] +=3D b[i] * c[i] + d[i] * e[i] + f[i] * g[i] + h[i] * j[i] +
> > +k[i] *
> > l[i] + m[i]* o[i] + p[i];
> > +  }
> > +}
> > +/* { dg-final { scan-tree-dump-times { =3D \.FMA \(} 6 "widening_mul" =
}
> > +} */
> > diff --git a/gcc/testsuite/gcc.dg/pr98350-2.c
> > b/gcc/testsuite/gcc.dg/pr98350- 2.c new file mode 100644 index
> > 00000000000..333d34f026a
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.dg/pr98350-2.c
> > @@ -0,0 +1,11 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-Ofast -fdump-tree-widening_mul" } */
> > +
> > +/* Test that the compiler rearrange the ops to generate more FMA.  */
> > +
> > +float
> > +foo1 (float a, float b, float c, float d, float *e) {
> > +   return   *e + a * b + c * d ;
> > +}
> > +/* { dg-final { scan-tree-dump-times { =3D \.FMA \(} 2 "widening_mul" =
}
> > +} */
> > diff --git a/gcc/tree-ssa-reassoc.cc b/gcc/tree-ssa-reassoc.cc index
> > 067a3f07f7e..611fb9b1c99 100644
> > --- a/gcc/tree-ssa-reassoc.cc
> > +++ b/gcc/tree-ssa-reassoc.cc
> > @@ -54,6 +54,7 @@ along with GCC; see the file COPYING3.  If not see
> > #include "tree-ssa-reassoc.h"
> >  #include "tree-ssa-math-opts.h"
> >  #include "gimple-range.h"
> > +#include "internal-fn.h"
> >
> >  /*  This is a simple global reassociation pass.  It is, in part, based
> >      on the LLVM pass of the same name (They do some things more/less
> > @@ -5468,14 +5469,24 @@ get_reassociation_width (int ops_num, enum
> > tree_code opc,
> >    return width;
> >  }
> >
> > -/* Recursively rewrite our linearized statements so that the operators
> > -   match those in OPS[OPINDEX], putting the computation in rank
> > -   order and trying to allow operations to be executed in
> > -   parallel.  */
> > +/* Rewrite statements with dependency chain with regard the chance to
> > generate
> > +   FMA.
> > +   For the chain with FMA: Try to keep fma opportunity as much as
> possible.
> > +   For the chain without FMA: Putting the computation in rank order
> > + and
> > trying
> > +   to allow operations to be executed in parallel.
> > +   E.g.
> > +   e + f + g + a * b + c * d;
> > +
> > +   ssa1 =3D e + f;
> > +   ssa2 =3D g + a * b;
> > +   ssa3 =3D ssa1 + c * d;
> > +   ssa4 =3D ssa2 + ssa3;
> >
> > +   This reassociation approach preserves the chance of fma generation
> > + as
> > much
> > +   as possible.  */
> >  static void
> > -rewrite_expr_tree_parallel (gassign *stmt, int width,
> > -			    const vec<operand_entry *> &ops)
> > +rewrite_expr_tree_parallel (gassign *stmt, int width, bool has_fma,
> > +					 const vec<operand_entry *> &ops)
> >  {
> >    enum tree_code opcode =3D gimple_assign_rhs_code (stmt);
> >    int op_num =3D ops.length ();
> > @@ -5483,10 +5494,11 @@ rewrite_expr_tree_parallel (gassign *stmt, int
> width,
> >    int stmt_num =3D op_num - 1;
> >    gimple **stmts =3D XALLOCAVEC (gimple *, stmt_num);
> >    int op_index =3D op_num - 1;
> > -  int stmt_index =3D 0;
> > -  int ready_stmts_end =3D 0;
> > -  int i =3D 0;
> > -  gimple *stmt1 =3D NULL, *stmt2 =3D NULL;
> > +  int width_count =3D width;
> > +  int i =3D 0, j =3D 0;
> > +  tree tmp_op[2], op1;
> > +  operand_entry *oe;
> > +  gimple *stmt1 =3D NULL;
> >    tree last_rhs1 =3D gimple_assign_rhs1 (stmt);
> >
> >    /* We start expression rewriting from the top statements.
> > @@ -5496,91 +5508,87 @@ rewrite_expr_tree_parallel (gassign *stmt, int
> width,
> >    for (i =3D stmt_num - 2; i >=3D 0; i--)
> >      stmts[i] =3D SSA_NAME_DEF_STMT (gimple_assign_rhs1 (stmts[i+1]));
> >
> > -  for (i =3D 0; i < stmt_num; i++)
> > +  /* Build parallel dependency chain according to width.  */  for (i
> > + =3D 0; i < width; i++)
> >      {
> > -      tree op1, op2;
> > -
> > -      /* Determine whether we should use results of
> > -	 already handled statements or not.  */
> > -      if (ready_stmts_end =3D=3D 0
> > -	  && (i - stmt_index >=3D width || op_index < 1))
> > -	ready_stmts_end =3D i;
> > -
> > -      /* Now we choose operands for the next statement.  Non zero
> > -	 value in ready_stmts_end means here that we should use
> > -	 the result of already generated statements as new operand.  */
> > -      if (ready_stmts_end > 0)
> > -	{
> > -	  op1 =3D gimple_assign_lhs (stmts[stmt_index++]);
> > -	  if (ready_stmts_end > stmt_index)
> > -	    op2 =3D gimple_assign_lhs (stmts[stmt_index++]);
> > -	  else if (op_index >=3D 0)
> > -	    {
> > -	      operand_entry *oe =3D ops[op_index--];
> > -	      stmt2 =3D oe->stmt_to_insert;
> > -	      op2 =3D oe->op;
> > -	    }
> > -	  else
> > -	    {
> > -	      gcc_assert (stmt_index < i);
> > -	      op2 =3D gimple_assign_lhs (stmts[stmt_index++]);
> > -	    }
> > +      /* If the chain has FAM, we do not swap two operands.  */
> > +      if (op_index > 1 && !has_fma)
> > +	swap_ops_for_binary_stmt (ops, op_index - 2);
> >
> > -	  if (stmt_index >=3D ready_stmts_end)
> > -	    ready_stmts_end =3D 0;
> > -	}
> > -      else
> > +      for (j =3D 0; j < 2; j++)
> >  	{
> > -	  if (op_index > 1)
> > -	    swap_ops_for_binary_stmt (ops, op_index - 2);
> > -	  operand_entry *oe2 =3D ops[op_index--];
> > -	  operand_entry *oe1 =3D ops[op_index--];
> > -	  op2 =3D oe2->op;
> > -	  stmt2 =3D oe2->stmt_to_insert;
> > -	  op1 =3D oe1->op;
> > -	  stmt1 =3D oe1->stmt_to_insert;
> > +	  gcc_assert (op_index >=3D 0);
> > +	  oe =3D ops[op_index--];
> > +	  tmp_op[j] =3D oe->op;
> > +	  /* If the stmt that defines operand has to be inserted, insert it
> > +	     before the use.  */
> > +	  stmt1 =3D oe->stmt_to_insert;
> > +	  if (stmt1)
> > +	    insert_stmt_before_use (stmts[i], stmt1);
> > +	  stmt1 =3D NULL;
> >  	}
> > -
> > -      /* If we emit the last statement then we should put
> > -	 operands into the last statement.  It will also
> > -	 break the loop.  */
> > -      if (op_index < 0 && stmt_index =3D=3D i)
> > -	i =3D stmt_num - 1;
> > +      stmts[i] =3D build_and_add_sum (TREE_TYPE (last_rhs1),
> > +				    tmp_op[1],
> > +				    tmp_op[0],
> > +				    opcode);
> > +      gimple_set_visited (stmts[i], true);
> >
> >        if (dump_file && (dump_flags & TDF_DETAILS))
> >  	{
> > -	  fprintf (dump_file, "Transforming ");
> > +	  fprintf (dump_file, " into ");
> >  	  print_gimple_stmt (dump_file, stmts[i], 0);
> >  	}
> > +    }
> >
> > -      /* If the stmt that defines operand has to be inserted, insert i=
t
> > -	 before the use.  */
> > -      if (stmt1)
> > -	insert_stmt_before_use (stmts[i], stmt1);
> > -      if (stmt2)
> > -	insert_stmt_before_use (stmts[i], stmt2);
> > -      stmt1 =3D stmt2 =3D NULL;
> > -
> > -      /* We keep original statement only for the last one.  All
> > -	 others are recreated.  */
> > -      if (i =3D=3D stmt_num - 1)
> > +  for (i =3D width; i < stmt_num; i++)
> > +    {
> > +      /* We keep original statement only for the last one.  All others=
 are
> > +	 recreated.  */
> > +      if ( op_index < 0)
> >  	{
> > -	  gimple_assign_set_rhs1 (stmts[i], op1);
> > -	  gimple_assign_set_rhs2 (stmts[i], op2);
> > -	  update_stmt (stmts[i]);
> > +	  if (width_count =3D=3D 2)
> > +	    {
> > +
> > +	      /* We keep original statement only for the last one.  All
> > +		 others are recreated.  */
> > +	      gimple_assign_set_rhs1 (stmts[i], gimple_assign_lhs (stmts[i-1]=
));
> > +	      gimple_assign_set_rhs2 (stmts[i], gimple_assign_lhs (stmts[i-2]=
));
> > +	      update_stmt (stmts[i]);
> > +	    }
> > +	  else
> > +	    {
> > +
> > +	      stmts[i] =3D
> > +		build_and_add_sum (TREE_TYPE (last_rhs1),
> > +				   gimple_assign_lhs (stmts[i-width_count]),
> > +				   gimple_assign_lhs (stmts[i-width_count+1]),
> > +				   opcode);
> > +	      gimple_set_visited (stmts[i], true);
> > +	      width_count--;
> > +	    }
> >  	}
> >        else
> >  	{
> > -	  stmts[i] =3D build_and_add_sum (TREE_TYPE (last_rhs1), op1, op2,
> > opcode);
> > +	  /* Attach the rest of the ops to the parallel dependency chain.  */
> > +	  oe =3D ops[op_index--];
> > +	  op1 =3D oe->op;
> > +	  stmt1 =3D oe->stmt_to_insert;
> > +	  if (stmt1)
> > +	    insert_stmt_before_use (stmts[i], stmt1);
> > +	  stmt1 =3D NULL;
> > +	  stmts[i] =3D build_and_add_sum (TREE_TYPE (last_rhs1),
> > +					gimple_assign_lhs (stmts[i-width]),
> > +					op1,
> > +					opcode);
> >  	  gimple_set_visited (stmts[i], true);
> >  	}
> > +
> >        if (dump_file && (dump_flags & TDF_DETAILS))
> >  	{
> >  	  fprintf (dump_file, " into ");
> >  	  print_gimple_stmt (dump_file, stmts[i], 0);
> >  	}
> >      }
> > -
> >    remove_visited_stmt_chain (last_rhs1);  }
> >
> > @@ -6649,6 +6657,73 @@ transform_stmt_to_multiply
> > (gimple_stmt_iterator *gsi, gimple *stmt,
> >      }
> >  }
> >
> > +/* Rearrange ops may have more FMA when the chain may has more than
> 2 FMAs.
> > +   Put no-mult ops and mult ops alternately at the end of the queue,
> > +which
> > is
> > +   conducive to generating more FMA and reducing the loss of FMA when
> > breaking
> > +   the chain.
> > +   E.g.
> > +   a * b + c * d + e generates:
> > +
> > +   _4  =3D c_9(D) * d_10(D);
> > +   _12 =3D .FMA (a_7(D), b_8(D), _4);
> > +   _11 =3D e_6(D) + _12;
> > +
> > +   Rearrange ops to -> e + a * b + c * d generates:
> > +
> > +   _4  =3D .FMA (c_7(D), d_8(D), _3);
> > +   _11 =3D .FMA (a_5(D), b_6(D), _4);  */ static bool rank_ops_for_fma
> > +(vec<operand_entry *> *ops) {
> > +  operand_entry *oe;
> > +  unsigned int i;
> > +  unsigned int ops_length =3D ops->length ();
> > +  auto_vec<operand_entry *> ops_mult;
> > +  auto_vec<operand_entry *> ops_others;
> > +
> > +  FOR_EACH_VEC_ELT (*ops, i, oe)
> > +    {
> > +      if (TREE_CODE (oe->op) =3D=3D SSA_NAME)
> > +	{
> > +	  gimple *def_stmt =3D SSA_NAME_DEF_STMT (oe->op);
> > +	  if (is_gimple_assign (def_stmt)
> > +	      && gimple_assign_rhs_code (def_stmt) =3D=3D MULT_EXPR)
> > +	    ops_mult.safe_push (oe);
> > +	  else
> > +	    ops_others.safe_push (oe);
> > +	}
> > +      else
> > +	ops_others.safe_push (oe);
> > +    }
> > +  /* 1. When ops_mult.length =3D=3D 2, like the following case,
> > +
> > +     a * b + c * d + e.
> > +
> > +     we need to rearrange the ops.
> > +
> > +     Putting ops that not def from mult in front can generate more FMA=
s.
> > +
> > +     2. If all ops are defined with mult, we don't need to rearrange t=
hem.
> > */
> > +  if (ops_mult.length () >=3D 2 && ops_mult.length () !=3D ops_length)
> > +    {
> > +      /* Put no-mult ops and mult ops alternately at the end of the
> > +	 queue, which is conducive to generating more FMA and reducing the
> > +	 loss of FMA when breaking the chain.  */
> > +      ops->truncate (0);
> > +      ops->splice (ops_mult);
> > +      int j, opindex =3D ops->length ();
> > +      int others_length =3D ops_others.length ();
> > +      for (j =3D 0; j < others_length; j++)
> > +	{
> > +	  oe =3D ops_others.pop ();
> > +	  ops->quick_insert (opindex, oe);
> > +	  if (opindex > 0)
> > +	    opindex--;
> > +	}
> > +      return true;
> > +    }
> > +  return false;
> > +}
> >  /* Reassociate expressions in basic block BB and its post-dominator as
> >     children.
> >
> > @@ -6813,6 +6888,7 @@ reassociate_bb (basic_block bb)
> >  		  machine_mode mode =3D TYPE_MODE (TREE_TYPE (lhs));
> >  		  int ops_num =3D ops.length ();
> >  		  int width;
> > +		  bool has_fma =3D false;
> >
> >  		  /* For binary bit operations, if there are at least 3
> >  		     operands and the last operand in OPS is a constant, @@
> > -6821,11 +6897,23 @@ reassociate_bb (basic_block bb)
> >  		     often match a canonical bit test when we get to RTL.  */
> >  		  if (ops.length () > 2
> >  		      && (rhs_code =3D=3D BIT_AND_EXPR
> > -		          || rhs_code =3D=3D BIT_IOR_EXPR
> > -		          || rhs_code =3D=3D BIT_XOR_EXPR)
> > +			  || rhs_code =3D=3D BIT_IOR_EXPR
> > +			  || rhs_code =3D=3D BIT_XOR_EXPR)
> >  		      && TREE_CODE (ops.last ()->op) =3D=3D INTEGER_CST)
> >  		    std::swap (*ops[0], *ops[ops_num - 1]);
> >
> > +		  optimization_type opt_type =3D bb_optimization_type (bb);
> > +
> > +		  /* If the target support FMA, rank_ops_for_fma will detect
> if
> > +		     the chain has fmas and rearrange the ops if so.  */
> > +		  if (direct_internal_fn_supported_p (IFN_FMA,
> > +						      TREE_TYPE (lhs),
> > +						      opt_type)
> > +		      && (rhs_code =3D=3D PLUS_EXPR || rhs_code =3D=3D MINUS_EXPR))
> > +		    {
> > +		      has_fma =3D rank_ops_for_fma (&ops);
> > +		    }
> > +
> >  		  /* Only rewrite the expression tree to parallel in the
> >  		     last reassoc pass to avoid useless work back-and-forth
> >  		     with initial linearization.  */ @@ -6839,22 +6927,24 @@
> > reassociate_bb (basic_block bb)
> >  				 "Width =3D %d was chosen for
> reassociation\n",
> >  				 width);
> >  		      rewrite_expr_tree_parallel (as_a <gassign *> (stmt),
> > -						  width, ops);
> > +						  width,
> > +						  has_fma,
> > +						  ops);
> >  		    }
> >  		  else
> > -                    {
> > -                      /* When there are three operands left, we want
> > -                         to make sure the ones that get the double
> > -                         binary op are chosen wisely.  */
> > -                      int len =3D ops.length ();
> > -                      if (len >=3D 3)
> > +		    {
> > +		      /* When there are three operands left, we want
> > +			 to make sure the ones that get the double
> > +			 binary op are chosen wisely.  */
> > +		      int len =3D ops.length ();
> > +		      if (len >=3D 3 && !has_fma)
> >  			swap_ops_for_binary_stmt (ops, len - 3);
> >
> >  		      new_lhs =3D rewrite_expr_tree (stmt, rhs_code, 0, ops,
> >  						   powi_result !=3D NULL
> >  						   || negate_result,
> >  						   len !=3D orig_len);
> > -                    }
> > +		    }
> >
> >  		  /* If we combined some repeated factors into a
> >  		     __builtin_powi call, multiply that result by the
> > --
> > 2.25.1