From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga09.intel.com (mga09.intel.com [134.134.136.24]) by sourceware.org (Postfix) with ESMTPS id D68A03858CDB for ; Thu, 18 May 2023 16:56:48 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org D68A03858CDB Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=intel.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=intel.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1684429008; x=1715965008; h=from:to:cc:subject:date:message-id:references: in-reply-to:content-transfer-encoding:mime-version; bh=mm+0AFB1A7MDPorZ17Pd5TjBi5DV5OBeHg6i0iMPbmI=; b=Idu4jeP8Ium+QtZcJOr6OivtaLpTJk6My3RM+Y97DwpPVSYhT3IA6e7n HCCBlTwUpClr+RilmRnWoN3jzZH//BQKMWofCvvBQYFBQHVsERJKsT1Gi MEzSLNGY85HzJ32MLnjSvY4pk8SIpIJrnt543HOf5/RF6OxHj1Tvzel/2 KGRLA0/tZLHAQ8ICbfEoqxuP7YEiT7gAoaXT1LrDuncnjot6070U6vDHy zQiHkC/oPlR0C5NFipBI3j5hjceKpPbAOU6UTn5ghuma/r0o4myAuh4G1 /MUU8ch56KzbSg+aJLPh1YSgl2I8oGIjZuytTNWG1PG3i8K9k+LEIWkx9 w==; X-IronPort-AV: E=McAfee;i="6600,9927,10714"; a="354461131" X-IronPort-AV: E=Sophos;i="6.00,174,1681196400"; d="scan'208";a="354461131" Received: from orsmga001.jf.intel.com ([10.7.209.18]) by orsmga102.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 18 May 2023 09:56:47 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10714"; a="735168907" X-IronPort-AV: E=Sophos;i="6.00,174,1681196400"; d="scan'208";a="735168907" Received: from fmsmsx603.amr.corp.intel.com ([10.18.126.83]) by orsmga001.jf.intel.com with ESMTP; 18 May 2023 09:56:47 -0700 Received: from fmsmsx611.amr.corp.intel.com (10.18.126.91) by fmsmsx603.amr.corp.intel.com (10.18.126.83) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.23; Thu, 18 May 2023 09:56:47 -0700 Received: from fmsmsx610.amr.corp.intel.com (10.18.126.90) by fmsmsx611.amr.corp.intel.com (10.18.126.91) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.23; Thu, 18 May 2023 09:56:46 -0700 Received: from fmsedg602.ED.cps.intel.com (10.1.192.136) by fmsmsx610.amr.corp.intel.com (10.18.126.90) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.23 via Frontend Transport; Thu, 18 May 2023 09:56:46 -0700 Received: from NAM12-MW2-obe.outbound.protection.outlook.com (104.47.66.43) by edgegateway.intel.com (192.55.55.71) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.1.2507.23; Thu, 18 May 2023 09:56:46 -0700 ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=Lqh7VtXvYT987xo5HcplZRG1BXsguLYrqOuHgEU+P80VyavVhKICO104EqZM3rkMK6aoLzs6YWFpIeXiyqp48ybVuRoZOB2vAAzlljPAoXRKh5dknKxwz004cNDwWRPghPndcGo1ihQlUUkQWBBUjUd5rp32MVdcpu7mzwPlPzZcjU8feWqOJue+OzXrdlF5nwJt7D1N5cGENXQkIuwh+puvFVmponGU6dfbfpERLCNhjBo3g8EHSuWtdzNM7K5uq7AujuPp4g5X9bHM5wGGUWzBR5DH1g2igg+JdV/kkK3ZWDhd3XC2uvXHIy6hDcEllmaIxl7wdi/np++V7FC4jA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=HXna8YB2xfJ2Uh9h9+N8I+5SHgAAuZU97ApsRsoI8rw=; b=be04pUkpkb0ka8awRnSiegPQ/cDF3B1vH5wWVpvrCDSQSH2AXf5qNCrL5MP4X07WZ2EOQfkausiCo6RCU4SVuzKf3pbKelFHLHTz+6Jxspvlq6L44abMQM7u5mel6LXRadj7Rx/e5L2xBcAdMdIFOBncr53TYedHC6a+ddcnp/kB4IliaQ2s4+VQjSLEQRGnMvbF+OYbvtJnaMU6OWQw6Q/fkBEgNTEjMCGkn1c97A8DvBrCzY6BVLhgSZskLEQcsvA88t1rX1hlbKBw2pSEeF729qEEPkli9UW1cdcZsdpNNbXzcGdU9g0VSv2xIjCwoaumxozMxD6z+p6lSPEMxA== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=intel.com; dmarc=pass action=none header.from=intel.com; dkim=pass header.d=intel.com; arc=none Received: from SJ0PR11MB5600.namprd11.prod.outlook.com (2603:10b6:a03:3ab::16) by DS0PR11MB8161.namprd11.prod.outlook.com (2603:10b6:8:164::22) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.6411.19; Thu, 18 May 2023 16:56:39 +0000 Received: from SJ0PR11MB5600.namprd11.prod.outlook.com ([fe80::fa7f:e19b:bdfc:4af]) by SJ0PR11MB5600.namprd11.prod.outlook.com ([fe80::fa7f:e19b:bdfc:4af%7]) with mapi id 15.20.6411.019; Thu, 18 May 2023 16:56:39 +0000 From: "Cui, Lili" To: "gcc-patches@gcc.gnu.org" CC: "richard.guenther@gmail.com" Subject: RE: [PATCH] PR gcc/98350:Handle FMA friendly in reassoc pass Thread-Topic: [PATCH] PR gcc/98350:Handle FMA friendly in reassoc pass Thread-Index: AQHZiL/WH5uJzHTHDUeGW1TtzGIX/69gQYLg Date: Thu, 18 May 2023 16:56:38 +0000 Message-ID: References: <20230517130222.2534562-1-lili.cui@intel.com> In-Reply-To: <20230517130222.2534562-1-lili.cui@intel.com> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=intel.com; x-ms-publictraffictype: Email x-ms-traffictypediagnostic: SJ0PR11MB5600:EE_|DS0PR11MB8161:EE_ x-ms-office365-filtering-correlation-id: 0f8cbc92-48c0-461e-a2a7-08db57c0d90f x-ms-exchange-senderadcheck: 1 x-ms-exchange-antispam-relay: 0 x-microsoft-antispam: BCL:0; x-microsoft-antispam-message-info: z3VrmLUwAvP0y3S4/ysCBqdMm3HK4fFtYLzI2dB/0A9bxeLkF33ZXzNHLt/wMrP/SZhB8t2E0ft9agNy8fYJkcZ7JEM5NOqxZWiyN9PBH7+U52zqvEkCH+7ELaeCahl7gFGWDmSE3cCxUcfPcaXZDHuiWaWYJue/BxBuSHWiP7kpc/242Np87mE/PNzNIiUj/3Pe8jL64CCCV0vmQJTzdS9IqoBfxiYwnUABSlRQ/WfKxLURbfgfS0SSghp9MW8PdC0gJF0qdNN/a5MoVoMfPQn8v0v0CSKj/nZK9djAy4NdVY/7H5yYIntZnQ2M6omrR2Uywm7HYY9Xk1HOAwL6q+1blns217hL/edJ4q8QQCK1QNn10BoF01P9RMLyGPy8CsgT8PKpt4H/CLaq8ZKDbImZkimZOk3279xC4XzKOI5inAKiNXRW41V6z+4KzXvkZO6xKPkTABrfi81X1JT/73tYOPTGa96f5usf0ykZIgNXGiteHxJlYBWYQbUNnLNle42owK0OFxhgJ5pZA+rTvyngd242/wqOlp/CJZptCOa96eRuscNOgdCrigj9jQk1KcPasm/46yaWACuD+Zym8/31eJSWOC4I3YZQYv47G04= x-forefront-antispam-report: CIP:255.255.255.255;CTRY:;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:SJ0PR11MB5600.namprd11.prod.outlook.com;PTR:;CAT:NONE;SFS:(13230028)(366004)(39860400002)(136003)(376002)(396003)(346002)(451199021)(8936002)(8676002)(86362001)(41300700001)(33656002)(66476007)(66446008)(66556008)(4326008)(82960400001)(64756008)(66946007)(122000001)(38070700005)(38100700002)(76116006)(316002)(6916009)(30864003)(84970400001)(2906002)(478600001)(83380400001)(71200400001)(55016003)(7696005)(26005)(5660300002)(52536014)(53546011)(6506007)(9686003)(186003);DIR:OUT;SFP:1102; x-ms-exchange-antispam-messagedata-chunkcount: 1 x-ms-exchange-antispam-messagedata-0: =?us-ascii?Q?oauny33Lxa1Hz355dsNFlgH8wVImd4U7qSJQs8k04TPXtEBKAU3dszm1x71W?= =?us-ascii?Q?2WJIMMSy5rr91I8XRIuYxu7aiW00JX6Eo2mRdc3Bx4+UW1Gt+nIsK3vFakvC?= =?us-ascii?Q?ZJZoAI+Rag9P0ZwMQPRivj+V6eLq7n3ELE7REYDA+bnobeNYPIUGDllXEylw?= =?us-ascii?Q?TGh0+gKbpdxAd1GEzQeaPwaRjk8apLY/JdSluanCsZ6j9iDj/8icCSi9huSJ?= =?us-ascii?Q?cuPhSQJT/W2iF5cfzCD/W7Cm6lNI/Doo49bZJQcmnjS+up1ZVpMEeux5uyuu?= =?us-ascii?Q?fXfQ+M6jIy77YL0g/5hju/5i53lBBoH9h01hEFZE3S3rsbEHlq5aB3Vz78FS?= =?us-ascii?Q?ZhRPk7WU5cqrbCvMY3OifqlEnJuZteFH8fsdXOS+ZF9yJ8DJ1gcAiy4HRXNq?= =?us-ascii?Q?BN2eRIbMF5544wsfa2hU6Zm3uiveys6ySa8wzkFzZDTZv/4PlUtEzdXaaLoU?= =?us-ascii?Q?ahyLZyak7ddi8usqrXTssUQ2UjdkU3vd62cvXX+7WrlEuTlHncR9pYJfczeH?= =?us-ascii?Q?W6E+gx4mzMQG3IlY1AluZ+JEW+6ibwonDG3ah3YTu2W3UnZvxhTk2m0JGkA6?= =?us-ascii?Q?TVl9IKe7Mkh3CBL1x6MwedmXzyVYRRPc31LLkvXhltIKzn7Wi3OqbycNCuBV?= =?us-ascii?Q?5RBvmdvHAxcSBD1PAMKQQv1L1c+VO5BIefCe8fLCNZIYwaaI98SigeA+c0Ul?= =?us-ascii?Q?vfjKo9GBYb9HtP3lvvHxf5FfhKk2jMhBzcJVKe9YeIgJGtfEPL5NGHu8UtYJ?= =?us-ascii?Q?ALuyCt2sEwsN0gzy2ACLgzPeLBRsyntxOL7qMrTuTKAlTgVe/oEx4yIF3LQV?= =?us-ascii?Q?ttVHjrPNWhQDAF1M9dh2TQfBGAmGIZcqq8w//7ceQO7Hjx8Zzqz9dA/lvEkC?= =?us-ascii?Q?9SWRHNBzn3hFvDUjEOtUaWHoLBUzkFxDkfy5PK5wbR7bj/J251hkcgFB0wg2?= =?us-ascii?Q?jTujPK1pjecPcZZ2BWbkjyO/r1lX3whTNnNsUqkKh51vCGpgwUj4SslhCR2j?= =?us-ascii?Q?m5+ULk9+7fLqdYqG1+P9A3VtlR6/iejO5umxylNRBYPeB1ifZY8hZ4CCfTjf?= =?us-ascii?Q?k5bXTiKFYMr2lB98ksogDFH3GJqzSQBgxBibBziAKHXNKc0mbWplbgz9GLTy?= =?us-ascii?Q?Y1QdZMpsT04pHkKE1pCh0ypp3zQB/NA0RBhHwr9bhqQzVKB3Kwp8CN0C1Rx2?= =?us-ascii?Q?h4Og1GJncTRW0tpG1x+bU4J0MN7JRMGBNQ6lwYET+N6hPsOBrfcqmwDfcNoX?= =?us-ascii?Q?Q6g07mRD5vtciYl0MkLJquvaBiTEY2Ryc+xH98x2iW9QngY6IRz66JTmwk6I?= =?us-ascii?Q?khnFD85jRtShl9UdXJ97OjrHfF+5WEbJkovgLbuNI0RY1pPTvdZuaopy/le5?= =?us-ascii?Q?Iyih621SGrmleNViRMIfyxGqCSsDY5N4EDr5Oow6iM6kMxZ18qhIsP5n8fxv?= =?us-ascii?Q?MKehqLkZavz4cn5B+hFuUDe2eLyxPulqh19Z8POwfovZLZWk+dP0ZV0oIdfA?= =?us-ascii?Q?uoWP7MVgz5NbwIXF2RB6tD/eecaIufvEWjoX4m6FiTCBkUeOje0hDKoAFb9V?= =?us-ascii?Q?00PxaPOc2/m98gqXzAs=3D?= Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-MS-Exchange-CrossTenant-AuthAs: Internal X-MS-Exchange-CrossTenant-AuthSource: SJ0PR11MB5600.namprd11.prod.outlook.com X-MS-Exchange-CrossTenant-Network-Message-Id: 0f8cbc92-48c0-461e-a2a7-08db57c0d90f X-MS-Exchange-CrossTenant-originalarrivaltime: 18 May 2023 16:56:38.9552 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Hosted X-MS-Exchange-CrossTenant-id: 46c98d88-e344-4ed4-8496-4ed7712e255d X-MS-Exchange-CrossTenant-mailboxtype: HOSTED X-MS-Exchange-CrossTenant-userprincipalname: J0cX2gVvissiUBAeaEsn/QmtTsLR0YyHA0CNpduQNfJYae2ra4KynsdbGBLD6rL2feA9aooz8YBkgENufl7nIg== X-MS-Exchange-Transport-CrossTenantHeadersStamped: DS0PR11MB8161 X-OriginatorOrg: intel.com X-Spam-Status: No, score=-11.7 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,GIT_PATCH_0,RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL,SPF_HELO_NONE,SPF_NONE,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: Attach CPU2017 3 run results: On ICX:=20 507.cactuBSSN_r: Improved by 1.7% for multi-copy . 503.bwaves_r : Improved by 0.60% for single copy . 507.cactuBSSN_r : Improved by 1.10% for single copy . 519.lbm_r : Improved by 2.21% for single copy . no measurable changes for other benchmarks. On aarch64=20 507.cactuBSSN_r: Improved by 1.7% for multi-copy. 503.bwaves_r : Improved by 6.00% for single-copy. no measurable changes for other benchmarks. > -----Original Message----- > From: Cui, Lili > Sent: Wednesday, May 17, 2023 9:02 PM > To: gcc-patches@gcc.gnu.org > Cc: richard.guenther@gmail.com; Cui, Lili > Subject: [PATCH] PR gcc/98350:Handle FMA friendly in reassoc pass >=20 > From: Lili Cui >=20 > Make some changes in reassoc pass to make it more friendly to fma pass > later. > Using FMA instead of mult + add reduces register pressure and insruction > retired. >=20 > There are mainly two changes > 1. Put no-mult ops and mult ops alternately at the end of the queue, whic= h is > conducive to generating more fma and reducing the loss of FMA when > breaking the chain. > 2. Rewrite the rewrite_expr_tree_parallel function to try to build parall= el > chains according to the given correlation width, keeping the FMA chance a= s > much as possible. >=20 > TEST1: >=20 > float > foo (float a, float b, float c, float d, float *e) { > return *e + a * b + c * d ; > } >=20 > For "-Ofast -mfpmath=3Dsse -mfma" GCC generates: > vmulss %xmm3, %xmm2, %xmm2 > vfmadd132ss %xmm1, %xmm2, %xmm0 > vaddss (%rdi), %xmm0, %xmm0 > ret >=20 > With this patch GCC generates: > vfmadd213ss (%rdi), %xmm1, %xmm0 > vfmadd231ss %xmm2, %xmm3, %xmm0 > ret >=20 > TEST2: >=20 > for (int i =3D 0; i < N; i++) > { > a[i] +=3D b[i]* c[i] + d[i] * e[i] + f[i] * g[i] + h[i] * j[i] + k[i] *= l[i] + m[i]* o[i] + > p[i]; } >=20 > For "-Ofast -mfpmath=3Dsse -mfma" GCC generates: > vmovapd e(%rax), %ymm4 > vmulpd d(%rax), %ymm4, %ymm3 > addq $32, %rax > vmovapd c-32(%rax), %ymm5 > vmovapd j-32(%rax), %ymm6 > vmulpd h-32(%rax), %ymm6, %ymm2 > vmovapd a-32(%rax), %ymm6 > vaddpd p-32(%rax), %ymm6, %ymm0 > vmovapd g-32(%rax), %ymm7 > vfmadd231pd b-32(%rax), %ymm5, %ymm3 > vmovapd o-32(%rax), %ymm4 > vmulpd m-32(%rax), %ymm4, %ymm1 > vmovapd l-32(%rax), %ymm5 > vfmadd231pd f-32(%rax), %ymm7, %ymm2 > vfmadd231pd k-32(%rax), %ymm5, %ymm1 > vaddpd %ymm3, %ymm0, %ymm0 > vaddpd %ymm2, %ymm0, %ymm0 > vaddpd %ymm1, %ymm0, %ymm0 > vmovapd %ymm0, a-32(%rax) > cmpq $8192, %rax > jne .L4 > vzeroupper > ret >=20 > with this patch applied GCC breaks the chain with width =3D 2 and generat= es 6 > fma: >=20 > vmovapd a(%rax), %ymm2 > vmovapd c(%rax), %ymm0 > addq $32, %rax > vmovapd e-32(%rax), %ymm1 > vmovapd p-32(%rax), %ymm5 > vmovapd g-32(%rax), %ymm3 > vmovapd j-32(%rax), %ymm6 > vmovapd l-32(%rax), %ymm4 > vmovapd o-32(%rax), %ymm7 > vfmadd132pd b-32(%rax), %ymm2, %ymm0 > vfmadd132pd d-32(%rax), %ymm5, %ymm1 > vfmadd231pd f-32(%rax), %ymm3, %ymm0 > vfmadd231pd h-32(%rax), %ymm6, %ymm1 > vfmadd231pd k-32(%rax), %ymm4, %ymm0 > vfmadd231pd m-32(%rax), %ymm7, %ymm1 > vaddpd %ymm1, %ymm0, %ymm0 > vmovapd %ymm0, a-32(%rax) > cmpq $8192, %rax > jne .L2 > vzeroupper > ret >=20 > gcc/ChangeLog: >=20 > PR gcc/98350 > * tree-ssa-reassoc.cc > (rewrite_expr_tree_parallel): Rewrite this function. > (rank_ops_for_fma): New. > (reassociate_bb): Handle new function. >=20 > gcc/testsuite/ChangeLog: >=20 > PR gcc/98350 > * gcc.dg/pr98350-1.c: New test. > * gcc.dg/pr98350-2.c: Ditto. > --- > gcc/testsuite/gcc.dg/pr98350-1.c | 31 ++++ gcc/testsuite/gcc.dg/pr9835= 0-2.c > | 11 ++ > gcc/tree-ssa-reassoc.cc | 256 +++++++++++++++++++++---------- > 3 files changed, 215 insertions(+), 83 deletions(-) create mode 100644 > gcc/testsuite/gcc.dg/pr98350-1.c create mode 100644 > gcc/testsuite/gcc.dg/pr98350-2.c >=20 > diff --git a/gcc/testsuite/gcc.dg/pr98350-1.c b/gcc/testsuite/gcc.dg/pr98= 350- > 1.c > new file mode 100644 > index 00000000000..185511c5e0a > --- /dev/null > +++ b/gcc/testsuite/gcc.dg/pr98350-1.c > @@ -0,0 +1,31 @@ > +/* { dg-do compile } */ > +/* { dg-options "-Ofast -mfpmath=3Dsse -mfma -Wno-attributes " } */ > + > +/* Test that the compiler properly optimizes multiply and add > + to generate more FMA instructions. */ #define N 1024 double a[N]; > +double b[N]; double c[N]; double d[N]; double e[N]; double f[N]; double > +g[N]; double h[N]; double j[N]; double k[N]; double l[N]; double m[N]; > +double o[N]; double p[N]; > + > + > +void > +foo (void) > +{ > + for (int i =3D 0; i < N; i++) > + { > + a[i] +=3D b[i] * c[i] + d[i] * e[i] + f[i] * g[i] + h[i] * j[i] + > +k[i] * l[i] + m[i]* o[i] + p[i]; > + } > +} > +/* { dg-final { scan-assembler-times "vfm" 6 } } */ > diff --git a/gcc/testsuite/gcc.dg/pr98350-2.c b/gcc/testsuite/gcc.dg/pr98= 350- > 2.c > new file mode 100644 > index 00000000000..b35d88aead9 > --- /dev/null > +++ b/gcc/testsuite/gcc.dg/pr98350-2.c > @@ -0,0 +1,11 @@ > +/* { dg-do compile } */ > +/* { dg-options "-Ofast -mfpmath=3Dsse -mfma -Wno-attributes " } */ > + > +/* Test that the compiler rearrange the ops to generate more FMA. */ > + > +float > +foo1 (float a, float b, float c, float d, float *e) { > + return *e + a * b + c * d ; > +} > +/* { dg-final { scan-assembler-times "vfm" 2 } } */ > diff --git a/gcc/tree-ssa-reassoc.cc b/gcc/tree-ssa-reassoc.cc index > 067a3f07f7e..52c8aab6033 100644 > --- a/gcc/tree-ssa-reassoc.cc > +++ b/gcc/tree-ssa-reassoc.cc > @@ -54,6 +54,7 @@ along with GCC; see the file COPYING3. If not see > #include "tree-ssa-reassoc.h" > #include "tree-ssa-math-opts.h" > #include "gimple-range.h" > +#include "internal-fn.h" >=20 > /* This is a simple global reassociation pass. It is, in part, based > on the LLVM pass of the same name (They do some things more/less @@ > -5468,14 +5469,24 @@ get_reassociation_width (int ops_num, enum > tree_code opc, > return width; > } >=20 > -/* Recursively rewrite our linearized statements so that the operators > - match those in OPS[OPINDEX], putting the computation in rank > - order and trying to allow operations to be executed in > - parallel. */ > +/* Rewrite statements with dependency chain with regard to the chance to > + generate FMA. > + For the chain with FMA: Try to keep fma opportunity as much as possib= le. > + For the chain without FMA: Putting the computation in rank order and > trying > + to allow operations to be executed in parallel. > + E.g. > + e + f + g + a * b + c * d; >=20 > + ssa1 =3D e + f; > + ssa2 =3D g + a * b; > + ssa3 =3D ssa1 + c * d; > + ssa4 =3D ssa2 + ssa3; > + > + This reassociation approach preserves the chance of fma generation as > much > + as possible. */ > static void > -rewrite_expr_tree_parallel (gassign *stmt, int width, > - const vec &ops) > +rewrite_expr_tree_parallel (gassign *stmt, int width, bool has_fma, > + const vec &ops) > { > enum tree_code opcode =3D gimple_assign_rhs_code (stmt); > int op_num =3D ops.length (); > @@ -5483,10 +5494,11 @@ rewrite_expr_tree_parallel (gassign *stmt, int > width, > int stmt_num =3D op_num - 1; > gimple **stmts =3D XALLOCAVEC (gimple *, stmt_num); > int op_index =3D op_num - 1; > - int stmt_index =3D 0; > - int ready_stmts_end =3D 0; > - int i =3D 0; > - gimple *stmt1 =3D NULL, *stmt2 =3D NULL; > + int width_count =3D width; > + int i =3D 0, j =3D 0; > + tree tmp_op[2], op1; > + operand_entry *oe; > + gimple *stmt1 =3D NULL; > tree last_rhs1 =3D gimple_assign_rhs1 (stmt); >=20 > /* We start expression rewriting from the top statements. > @@ -5496,91 +5508,84 @@ rewrite_expr_tree_parallel (gassign *stmt, int > width, > for (i =3D stmt_num - 2; i >=3D 0; i--) > stmts[i] =3D SSA_NAME_DEF_STMT (gimple_assign_rhs1 (stmts[i+1])); >=20 > - for (i =3D 0; i < stmt_num; i++) > + /* Build parallel dependency chain according to width. */ for (i =3D > + 0; i < width; i++) > { > - tree op1, op2; > - > - /* Determine whether we should use results of > - already handled statements or not. */ > - if (ready_stmts_end =3D=3D 0 > - && (i - stmt_index >=3D width || op_index < 1)) > - ready_stmts_end =3D i; > - > - /* Now we choose operands for the next statement. Non zero > - value in ready_stmts_end means here that we should use > - the result of already generated statements as new operand. */ > - if (ready_stmts_end > 0) > - { > - op1 =3D gimple_assign_lhs (stmts[stmt_index++]); > - if (ready_stmts_end > stmt_index) > - op2 =3D gimple_assign_lhs (stmts[stmt_index++]); > - else if (op_index >=3D 0) > - { > - operand_entry *oe =3D ops[op_index--]; > - stmt2 =3D oe->stmt_to_insert; > - op2 =3D oe->op; > - } > - else > - { > - gcc_assert (stmt_index < i); > - op2 =3D gimple_assign_lhs (stmts[stmt_index++]); > - } > + /* */ > + if (op_index > 1 && !has_fma) > + swap_ops_for_binary_stmt (ops, op_index - 2); >=20 > - if (stmt_index >=3D ready_stmts_end) > - ready_stmts_end =3D 0; > - } > - else > + for (j =3D 0; j < 2; j++) > { > - if (op_index > 1) > - swap_ops_for_binary_stmt (ops, op_index - 2); > - operand_entry *oe2 =3D ops[op_index--]; > - operand_entry *oe1 =3D ops[op_index--]; > - op2 =3D oe2->op; > - stmt2 =3D oe2->stmt_to_insert; > - op1 =3D oe1->op; > - stmt1 =3D oe1->stmt_to_insert; > + gcc_assert (op_index >=3D 0); > + oe =3D ops[op_index--]; > + tmp_op[j] =3D oe->op; > + /* If the stmt that defines operand has to be inserted, insert it > + before the use. */ > + stmt1 =3D oe->stmt_to_insert; > + if (stmt1) > + insert_stmt_before_use (stmts[i], stmt1); > + stmt1 =3D NULL; > } > - > - /* If we emit the last statement then we should put > - operands into the last statement. It will also > - break the loop. */ > - if (op_index < 0 && stmt_index =3D=3D i) > - i =3D stmt_num - 1; > + stmts[i] =3D build_and_add_sum (TREE_TYPE (last_rhs1), tmp_op[1], > tmp_op[0], opcode); > + gimple_set_visited (stmts[i], true); >=20 > if (dump_file && (dump_flags & TDF_DETAILS)) > { > - fprintf (dump_file, "Transforming "); > + fprintf (dump_file, " into "); > print_gimple_stmt (dump_file, stmts[i], 0); > } > + } >=20 > - /* If the stmt that defines operand has to be inserted, insert it > - before the use. */ > - if (stmt1) > - insert_stmt_before_use (stmts[i], stmt1); > - if (stmt2) > - insert_stmt_before_use (stmts[i], stmt2); > - stmt1 =3D stmt2 =3D NULL; > - > - /* We keep original statement only for the last one. All > - others are recreated. */ > - if (i =3D=3D stmt_num - 1) > + for (i =3D width; i < stmt_num; i++) > + { > + /* We keep original statement only for the last one. All others a= re > + recreated. */ > + if ( op_index < 0) > { > - gimple_assign_set_rhs1 (stmts[i], op1); > - gimple_assign_set_rhs2 (stmts[i], op2); > - update_stmt (stmts[i]); > + if (width_count =3D=3D 2) > + { > + > + /* We keep original statement only for the last one. All > + others are recreated. */ > + gimple_assign_set_rhs1 (stmts[i], gimple_assign_lhs (stmts[i-1]))= ; > + gimple_assign_set_rhs2 (stmts[i], gimple_assign_lhs (stmts[i-2]))= ; > + update_stmt (stmts[i]); > + } > + else > + { > + > + stmts[i] =3D > + build_and_add_sum (TREE_TYPE (last_rhs1), > + gimple_assign_lhs (stmts[i-width_count]), > + gimple_assign_lhs (stmts[i-width_count+1]), > + opcode); > + gimple_set_visited (stmts[i], true); > + width_count--; > + } > } > else > { > - stmts[i] =3D build_and_add_sum (TREE_TYPE (last_rhs1), op1, op2, > opcode); > + /* Attach the rest of the ops to the parallel dependency chain. */ > + oe =3D ops[op_index--]; > + op1 =3D oe->op; > + stmt1 =3D oe->stmt_to_insert; > + if (stmt1) > + insert_stmt_before_use (stmts[i], stmt1); > + stmt1 =3D NULL; > + stmts[i] =3D build_and_add_sum (TREE_TYPE (last_rhs1), > + gimple_assign_lhs (stmts[i-width]), > + op1, > + opcode); > gimple_set_visited (stmts[i], true); > } > + > if (dump_file && (dump_flags & TDF_DETAILS)) > { > fprintf (dump_file, " into "); > print_gimple_stmt (dump_file, stmts[i], 0); > } > } > - > remove_visited_stmt_chain (last_rhs1); } >=20 > @@ -6649,6 +6654,76 @@ transform_stmt_to_multiply > (gimple_stmt_iterator *gsi, gimple *stmt, > } > } >=20 > +/* Rearrange ops to generate more FMA when the chain may has more > than 2 fmas. > + Put no-mult ops and mult ops alternately at the end of the queue, whi= ch is > + conducive to generating more fma and reducing the loss of FMA when > breaking > + the chain. > + E.g. > + a * b + c * d + e generates: > + > + _4 =3D c_9(D) * d_10(D); > + _12 =3D .FMA (a_7(D), b_8(D), _4); > + _11 =3D e_6(D) + _12; > + > + Rtearrange ops to -> e + a * b + c * d generates: > + > + _4 =3D .FMA (c_7(D), d_8(D), _3); > + _11 =3D .FMA (a_5(D), b_6(D), _4); > + */ > +static bool > +rank_ops_for_fma (vec *ops) { > + operand_entry *oe; > + unsigned int i; > + unsigned int ops_length =3D ops->length (); > + auto_vec ops_mult; > + auto_vec ops_others; > + > + FOR_EACH_VEC_ELT (*ops, i, oe) > + { > + if (TREE_CODE (oe->op) =3D=3D SSA_NAME) > + { > + gimple *def_stmt =3D SSA_NAME_DEF_STMT (oe->op); > + if (is_gimple_assign (def_stmt) > + && gimple_assign_rhs_code (def_stmt) =3D=3D MULT_EXPR) > + ops_mult.safe_push (oe); > + else > + ops_others.safe_push (oe); > + } > + else > + ops_others.safe_push (oe); > + } > + /* When ops_mult.length =3D=3D 2, like the following case, > + > + a * b + c * d + e. > + > + we need to rearrange the ops. > + > + Putting ops that not def from mult in front can generate more > +fmas. */ > + if (ops_mult.length () >=3D 2) > + { > + /* If all ops are defined with mult, we don't need to rearrange th= em. */ > + if (ops_mult.length () !=3D ops_length) > + { > + /* Put no-mult ops and mult ops alternately at the end of the > + queue, which is conducive to generating more fma and reducing > the > + loss of FMA when breaking the chain. */ > + ops->truncate (0); > + ops->splice (ops_mult); > + int j, opindex =3D ops->length (); > + int others_length =3D ops_others.length(); > + for (j =3D 0; j < others_length; j++) > + { > + oe =3D ops_others.pop (); > + ops->safe_insert (opindex, oe); > + if (opindex > 0) > + opindex--; > + } > + } > + return true; > + } > + return false; > +} > /* Reassociate expressions in basic block BB and its post-dominator as > children. >=20 > @@ -6813,6 +6888,7 @@ reassociate_bb (basic_block bb) > machine_mode mode =3D TYPE_MODE (TREE_TYPE (lhs)); > int ops_num =3D ops.length (); > int width; > + bool has_fma =3D false; >=20 > /* For binary bit operations, if there are at least 3 > operands and the last operand in OPS is a constant, @@ - > 6821,11 +6897,23 @@ reassociate_bb (basic_block bb) > often match a canonical bit test when we get to RTL. */ > if (ops.length () > 2 > && (rhs_code =3D=3D BIT_AND_EXPR > - || rhs_code =3D=3D BIT_IOR_EXPR > - || rhs_code =3D=3D BIT_XOR_EXPR) > + || rhs_code =3D=3D BIT_IOR_EXPR > + || rhs_code =3D=3D BIT_XOR_EXPR) > && TREE_CODE (ops.last ()->op) =3D=3D INTEGER_CST) > std::swap (*ops[0], *ops[ops_num - 1]); >=20 > + optimization_type opt_type =3D bb_optimization_type (bb); > + > + /* If the target support FMA, rank_ops_for_fma will detect > if > + the chain has fmas and rearrange the ops if so. */ > + if (direct_internal_fn_supported_p (IFN_FMA, > + TREE_TYPE (lhs), > + opt_type) > + && (rhs_code =3D=3D PLUS_EXPR || rhs_code =3D=3D MINUS_EXPR)) > + { > + has_fma =3D rank_ops_for_fma(&ops); > + } > + > /* Only rewrite the expression tree to parallel in the > last reassoc pass to avoid useless work back-and-forth > with initial linearization. */ > @@ -6839,22 +6927,24 @@ reassociate_bb (basic_block bb) > "Width =3D %d was chosen for > reassociation\n", > width); > rewrite_expr_tree_parallel (as_a (stmt), > - width, ops); > + width, > + has_fma, > + ops); > } > else > - { > - /* When there are three operands left, we want > - to make sure the ones that get the double > - binary op are chosen wisely. */ > - int len =3D ops.length (); > - if (len >=3D 3) > + { > + /* When there are three operands left, we want > + to make sure the ones that get the double > + binary op are chosen wisely. */ > + int len =3D ops.length (); > + if (len >=3D 3 && !has_fma) > swap_ops_for_binary_stmt (ops, len - 3); >=20 > new_lhs =3D rewrite_expr_tree (stmt, rhs_code, 0, ops, > powi_result !=3D NULL > || negate_result, > len !=3D orig_len); > - } > + } >=20 > /* If we combined some repeated factors into a > __builtin_powi call, multiply that result by the > -- > 2.25.1