From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from NAM04-DM6-obe.outbound.protection.outlook.com (mail-dm6nam04on2107.outbound.protection.outlook.com [40.107.102.107]) by sourceware.org (Postfix) with ESMTPS id CAA803858C78 for ; Wed, 26 Jul 2023 02:01:37 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org CAA803858C78 Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=os.amperecomputing.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=os.amperecomputing.com ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=IgcBuiR8ZM81lyUNP0zjr8cbstLmPfGC3iFp8hh4FmA0Jn0//Yiir9ia4q5HSlR4NikFoujzNUwJlcNJQHNkVMFhmdjF30kfrcRSdUKmR16k9Xk9P5r4iO1SHTdjv2FM7tIsprUvVMB0JoAGq88L2YhThEIdI6ZAUNQlsSgy6Z5cBgrkWJx0hmTY6/beY/uSGF8BkR+5LZOXhM/sf40DO6/BaDfCQ/woONTTOOWoTLff+9HWo995yzpbh+kr5xEaRSauZ0uZOCkcpUpTu5Y1nrNF0yriHthOfWDaII5LwpYDZfbD0L6kIJaclOeXPr9RfedM9vR8Vrs1onEGXDysVw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=lRDTl3wYdgX86WwmR1wMSboRYO3XaW4Sz7UFQdPQ2Xc=; b=gXAOCHQbPVmAVTRSO9M0Jlwh0h9HQAwGMmI35TcU+TpT1XjPIRzJOiVxnYGhvQZHUIy5t7vfOhZ1B2N6rCtng90IEof+B0SHhM6hAU549LENGjYIr76V384Xe4nzIBspihzFPJL/iFsNJJL38oZgvhoGGtKX8fJlSEl5GDQ06ZKTKnV4uu9YoCFXFUM6M5gVuNRQqIvYFz77qPpS4FcEGlQTlirfyLk4CDnEtUmz0E7bnfruRD2En7VBOFpUijsFdTE8P98jxFtZE20k8HrfcQ6PB776g+BquUN8/XyOGT8xGiT2Z0tOfUkHvA9n4/nDiLmEeJs9ZJ1Ip0/G4Cc4mg== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=os.amperecomputing.com; dmarc=pass action=none header.from=os.amperecomputing.com; dkim=pass header.d=os.amperecomputing.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=os.amperecomputing.com; s=selector2; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=lRDTl3wYdgX86WwmR1wMSboRYO3XaW4Sz7UFQdPQ2Xc=; b=Zo1onM6qsFF/NTrpeAMpgLFbB21sKekFLbDcGrgBVVSmtmC6bpJH+iHnQ3aOyu/GtrIkxjVeB3aWSZyTDFCSUVqg8Z53RIYANB+iHt0Yxd1XC1Q0+RGugTNSwUqX+DFyrKb/lZe10L3bbuWVcXXTwoMOTMrdysSDh7eBjtJzMrg= Received: from SJ2PR01MB8635.prod.exchangelabs.com (2603:10b6:a03:57b::16) by CO6PR01MB7515.prod.exchangelabs.com (2603:10b6:303:146::6) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.6609.33; Wed, 26 Jul 2023 02:01:35 +0000 Received: from SJ2PR01MB8635.prod.exchangelabs.com ([fe80::4c34:715d:c446:7fc5]) by SJ2PR01MB8635.prod.exchangelabs.com ([fe80::4c34:715d:c446:7fc5%5]) with mapi id 15.20.6609.032; Wed, 26 Jul 2023 02:01:34 +0000 From: Hao Liu OS To: Richard Sandiford CC: "GCC-patches@gcc.gnu.org" Subject: Re: [PATCH] AArch64: Do not increase the vect reduction latency by multiplying count [PR110625] Thread-Topic: [PATCH] AArch64: Do not increase the vect reduction latency by multiplying count [PR110625] Thread-Index: AQHZuflIfzSQE63nDUGU9bz2B+2yKq/Iyx9zgAFtZ+aAAAyoV4ABDpZU Date: Wed, 26 Jul 2023 02:01:34 +0000 Message-ID: References: In-Reply-To: Accept-Language: en-US, zh-CN Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: msip_labels: MSIP_Label_5b82cb1d-c2e0-4643-920a-bbe7b2d7cc47_Enabled=True;MSIP_Label_5b82cb1d-c2e0-4643-920a-bbe7b2d7cc47_SiteId=3bc2b170-fd94-476d-b0ce-4229bdc904a7;MSIP_Label_5b82cb1d-c2e0-4643-920a-bbe7b2d7cc47_SetDate=2023-07-26T02:01:32.773Z;MSIP_Label_5b82cb1d-c2e0-4643-920a-bbe7b2d7cc47_Name=Confidential;MSIP_Label_5b82cb1d-c2e0-4643-920a-bbe7b2d7cc47_ContentBits=0;MSIP_Label_5b82cb1d-c2e0-4643-920a-bbe7b2d7cc47_Method=Standard; authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=os.amperecomputing.com; x-ms-publictraffictype: Email x-ms-traffictypediagnostic: SJ2PR01MB8635:EE_|CO6PR01MB7515:EE_ x-ms-office365-filtering-correlation-id: d053ca91-72ca-4406-bf04-08db8d7c3d2c x-ms-exchange-senderadcheck: 1 x-ms-exchange-antispam-relay: 0 x-microsoft-antispam: BCL:0; x-microsoft-antispam-message-info: POBuJoKsRDBGLc3sJV/fx+cKW0waQtqoi66kO/IUXIHG1y6cNzvZQo4qrlospLyq+gACRO1OD1EF2Cd2Fx10kggqJnwHCxY+RT8Be9DcuqcMAEh6NFYDzYvLlke2AmMknRDIAlxQouVpzcJ+2STKP6aCKV+LTngAtfirwgk9N3eLgZGdmzWYAMsfXoVumMwq7GewRP7UawLcUNUtVE43dKcMsDIKJBfiRww3FRYymsLkCI61ns+TS5m0qEWpN0sApzrZLtasbZAMC/zCssbPwbqTecGi0wP3Og8VSb+g84OxTXzTF/rT9yPUNo54v1WuNYEZYwMLNDOFBkNTeeZVCBYtRz8tpizngHSRx1ZI1DwRwl3Og12zXkicvolxQZJkG4jcpP6aq2ZPxdmpVraI9cqdx43FBdRYDgnmp5zaKjCkKuIDv8HCvF5Aq6oTsw8sxexuM6lEQCjSWbwI+idsEga+5hQdw3BKBWAH5YIFHPaEYcbfjlT+cHP99l5q/MgA1XUYhyTm5UnaazQ8R7L09VVlgfTKY5iKhCPRSr/tjRTyJcnv7+DG6+7E0nizFNL53KpgNNsvdKFaukIYGgwGzzUr6casTL/3lVSf+jALVPw= x-forefront-antispam-report: CIP:255.255.255.255;CTRY:;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:SJ2PR01MB8635.prod.exchangelabs.com;PTR:;CAT:NONE;SFS:(13230028)(4636009)(346002)(39850400004)(396003)(366004)(376002)(136003)(451199021)(316002)(83380400001)(64756008)(66476007)(66946007)(66446008)(66556008)(6916009)(4326008)(71200400001)(7696005)(9686003)(478600001)(53546011)(186003)(76116006)(91956017)(26005)(6506007)(2906002)(122000001)(86362001)(38070700005)(38100700002)(33656002)(41300700001)(55016003)(52536014)(5660300002)(8676002)(8936002)(84970400001)(66899021);DIR:OUT;SFP:1102; x-ms-exchange-antispam-messagedata-chunkcount: 1 x-ms-exchange-antispam-messagedata-0: =?us-ascii?Q?u8qV7F7unC4frvpPxp34+NkTcg360UC/Tk32m+oFAsYZsm4cLBIdW/H5tzkW?= =?us-ascii?Q?lWKL6LU+ZhgZrqcXQPpJWvLqGpXKdn+kHLUJhIG+Ga4UmTObQHoYZ/3Hr0h9?= =?us-ascii?Q?rMtQv+21bpgNwjc9/wZvoohWgNW43jziY1MEhjZ4bvFZEO5DIrGVZzVF0mTS?= =?us-ascii?Q?CP/UhZgnt6StZuIYyw7wJPAZjUx8IxC0cj28qx9Ow3ZIAPxFZCcJ6/12qpEQ?= =?us-ascii?Q?lqn7/k1+wpMIi6K5HYBsPfmzGQwm8TGF/otlRClS14wxFUmdVAHR2J3zINeP?= =?us-ascii?Q?iFSzAnABXRvOqgcXfJo7iDq72YltTpGYmoloGlxyCBZKo+sb39j3JHSLYNtv?= =?us-ascii?Q?OHILfArWQI8fEmvBZl1tdyI/LMPOLGCpt5aAevjSqTNXBBt43oTFViVd0Q2k?= =?us-ascii?Q?KdPKuXJrwgqkV0Ie+yRLNL1bK/68MoqLsS1ejV5msEnjgU+xvFPgo1cKZg80?= =?us-ascii?Q?00DYiVA/9YY0mn7JE4AbBbD/xzTq+oQkF4Vh05eYtchVOTszvYGp2DiNCB2f?= =?us-ascii?Q?a5MlCOzn0Gc28S8fJONU1vyaTM+MazVdDaBY6BRQlxh32Z7fZj7bkpzSFTuV?= =?us-ascii?Q?GjdcvqKAHKs8zaZB+e7A4jhlD3a+dVGNIMyHs+c/SKX2yFh45sHmpAszkbLC?= =?us-ascii?Q?fPwTl/fkIeZo25NF1DA+XvNoeh6f7qR/LZFhkB0O1oBtHD5qd3oiC8mH6UzG?= =?us-ascii?Q?IDwQpMAYjuE4NrMWGZBufIzwgrxYIoEP1v+W1bHOoUbCWrOk7/2Lg0vj0C1O?= =?us-ascii?Q?THmXQXTaUxjcboCFT9FRsnfbujrotcdbCaYZy2IGr26gWjXvw5kPX2X95AZV?= =?us-ascii?Q?adbUlec493uEgDPGPkGzrapligX1E+tcjAIHwPkdBdhxwGJdFMd7OeUENvck?= =?us-ascii?Q?aOb45W4bX8i88XVz/OA3mmF7FnJytKn/63AD140gPGBAIegEe5300+VE4+9F?= =?us-ascii?Q?RI8Xl5l5u3VKhGt/2F7fed1h3gTrBHuJOLAXdM8ckLo4fHM/mfDZi9wL4NUI?= =?us-ascii?Q?ZwUrMCQxZxkxwmhWFDQ5JExcAPA8Hd1P/HP5+eb+h/mKGTj4mt6tJPr739Km?= =?us-ascii?Q?oeGy4mb3q/CmpAQel7DkUYqiLui8Z3typTQMVm6kNMVph2h2BAUjx+0PVlYc?= =?us-ascii?Q?r2+btR5NzcjB3yjrxRFqPs13ekJOAGUToRItzxi/57CmMGoIOkNaBOm1O98c?= =?us-ascii?Q?+TnpbK6NDCC2l0lDnd6DEVfHOQYeaqRvy/KaTi/1qA5JOoOHZ83bcZKJAl4Y?= =?us-ascii?Q?mjfrGwRezOvfx2JrBeVJbhVqKOGsw25A3oq+AjwDG9gE3VBUw4S/OrPSoXSW?= =?us-ascii?Q?gDsbA2vZZ8TKDCTTT0iveQLJ4TPThoJBeMBubTUwRypQDig4CtrMBWjL0vjY?= =?us-ascii?Q?QgtgyJrljYa5lPsGOCSaFWEVHpFvZFMKW9sWKhI0v5Sh9Df4AyLagck7ywdH?= =?us-ascii?Q?BTVC10hg8sLJUzJYVA8+M3r+pKd0NPaN7/cxbu6ZOu/4PBE63+Qrwoj1asf4?= =?us-ascii?Q?Z1pbdVQk6dPxqrkM8C+ReERci8R/54vfu6MssT96agtM/s6XznwNiQbg0sSK?= =?us-ascii?Q?EadGT9B+HNGE9h0I72dbCkML14Q5N+bPUsiImTBCjXkmi9Ue9Tcxt9sKATP+?= =?us-ascii?Q?DQ=3D=3D?= Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-OriginatorOrg: os.amperecomputing.com X-MS-Exchange-CrossTenant-AuthAs: Internal X-MS-Exchange-CrossTenant-AuthSource: SJ2PR01MB8635.prod.exchangelabs.com X-MS-Exchange-CrossTenant-Network-Message-Id: d053ca91-72ca-4406-bf04-08db8d7c3d2c X-MS-Exchange-CrossTenant-originalarrivaltime: 26 Jul 2023 02:01:34.4463 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Hosted X-MS-Exchange-CrossTenant-id: 3bc2b170-fd94-476d-b0ce-4229bdc904a7 X-MS-Exchange-CrossTenant-mailboxtype: HOSTED X-MS-Exchange-CrossTenant-userprincipalname: /wjhE06rDmVfCIclzD+WMvsf4UhgKltk43pTK1Vlvr/q/2ko872R6k/vINyKRZdhCRX+wPLgx3rIDCSKcu79Qy4bhjQsw2WIEmzg6kEYsgsrw7sUS9ULoEIAiwWHaDD/ X-MS-Exchange-Transport-CrossTenantHeadersStamped: CO6PR01MB7515 X-Spam-Status: No, score=-12.4 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,GIT_PATCH_0,KAM_SHORT,RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H2,SPF_HELO_PASS,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: > When was STMT_VINFO_REDUC_DEF empty? I just want to make sure that we're= not papering over an issue elsewhere. Yes, I also wonder if this is an issue in vectorizable_reduction. Below is= the the gimple of "gcc.target/aarch64/sve/cost_model_13.c": : # res_18 =3D PHI # i_20 =3D PHI _1 =3D (long unsigned int) i_20; _2 =3D _1 * 2; _3 =3D x_14(D) + _2; _4 =3D *_3; _5 =3D (unsigned short) _4; res.0_6 =3D (unsigned short) res_18; _7 =3D _5 + res.0_6; <-- The current stmt_inf= o res_15 =3D (short int) _7; i_16 =3D i_20 + 1; if (n_11(D) > i_16) goto ; else goto ; : goto ; It looks like that STMT_VINFO_REDUC_DEF should be "res_18 =3D PHI "? The status here is: STMT_VINFO_REDUC_IDX (stmt_info): 1 STMT_VINFO_REDUC_TYPE (stmt_info): TREE_CODE_REDUCTION STMT_VINFO_REDUC_VECTYPE (stmt_info): 0x0 Thanks, Hao ________________________________________ From: Richard Sandiford Sent: Tuesday, July 25, 2023 17:44 To: Hao Liu OS Cc: GCC-patches@gcc.gnu.org Subject: Re: [PATCH] AArch64: Do not increase the vect reduction latency by= multiplying count [PR110625] Hao Liu OS writes: > Hi, > > Thanks for the suggestion. I tested it and found a gcc_assert failure: > gcc.target/aarch64/sve/cost_model_13.c (internal compiler error: in i= nfo_for_reduction, at tree-vect-loop.cc:5473) > > It is caused by empty STMT_VINFO_REDUC_DEF. When was STMT_VINFO_REDUC_DEF empty? I just want to make sure that we're not papering over an issue elsewhere. Thanks, Richard So, I added an extra check before checking single_defuse_cycle. The updat= ed patch is below. Is it OK for trunk? > > --- > > The new costs should only count reduction latency by multiplying count fo= r > single_defuse_cycle. For other situations, this will increase the reduct= ion > latency a lot and miss vectorization opportunities. > > Tested on aarch64-linux-gnu. > > gcc/ChangeLog: > > PR target/110625 > * config/aarch64/aarch64.cc (count_ops): Only '* count' for > single_defuse_cycle while counting reduction_latency. > > gcc/testsuite/ChangeLog: > > * gcc.target/aarch64/pr110625_1.c: New testcase. > * gcc.target/aarch64/pr110625_2.c: New testcase. > --- > gcc/config/aarch64/aarch64.cc | 13 ++++-- > gcc/testsuite/gcc.target/aarch64/pr110625_1.c | 46 +++++++++++++++++++ > gcc/testsuite/gcc.target/aarch64/pr110625_2.c | 14 ++++++ > 3 files changed, 69 insertions(+), 4 deletions(-) > create mode 100644 gcc/testsuite/gcc.target/aarch64/pr110625_1.c > create mode 100644 gcc/testsuite/gcc.target/aarch64/pr110625_2.c > > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.c= c > index 560e5431636..478a4e00110 100644 > --- a/gcc/config/aarch64/aarch64.cc > +++ b/gcc/config/aarch64/aarch64.cc > @@ -16788,10 +16788,15 @@ aarch64_vector_costs::count_ops (unsigned int c= ount, vect_cost_for_stmt kind, > { > unsigned int base > =3D aarch64_in_loop_reduction_latency (m_vinfo, stmt_info, m_vec_fl= ags); > - > - /* ??? Ideally we'd do COUNT reductions in parallel, but unfortuna= tely > - that's not yet the case. */ > - ops->reduction_latency =3D MAX (ops->reduction_latency, base * cou= nt); > + if (STMT_VINFO_REDUC_DEF (stmt_info) > + && STMT_VINFO_FORCE_SINGLE_CYCLE ( > + info_for_reduction (m_vinfo, stmt_info))) > + /* ??? Ideally we'd use a tree to reduce the copies down to 1 vecto= r, > + and then accumulate that, but at the moment the loop-carried > + dependency includes all copies. */ > + ops->reduction_latency =3D MAX (ops->reduction_latency, base * coun= t); > + else > + ops->reduction_latency =3D MAX (ops->reduction_latency, base); > } > > /* Assume that multiply-adds will become a single operation. */ > diff --git a/gcc/testsuite/gcc.target/aarch64/pr110625_1.c b/gcc/testsuit= e/gcc.target/aarch64/pr110625_1.c > new file mode 100644 > index 00000000000..0965cac33a0 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/aarch64/pr110625_1.c > @@ -0,0 +1,46 @@ > +/* { dg-do compile } */ > +/* { dg-options "-Ofast -mcpu=3Dneoverse-n2 -fdump-tree-vect-details -fn= o-tree-slp-vectorize" } */ > +/* { dg-final { scan-tree-dump-not "reduction latency =3D 8" "vect" } } = */ > + > +/* Do not increase the vector body cost due to the incorrect reduction l= atency > + Original vector body cost =3D 51 > + Scalar issue estimate: > + ... > + reduction latency =3D 2 > + estimated min cycles per iteration =3D 2.000000 > + estimated cycles per vector iteration (for VF 2) =3D 4.000000 > + Vector issue estimate: > + ... > + reduction latency =3D 8 <-- Too large > + estimated min cycles per iteration =3D 8.000000 > + Increasing body cost to 102 because scalar code would issue more qui= ckly > + ... > + missed: cost model: the vector iteration cost =3D 102 divided by th= e scalar iteration cost =3D 44 is greater or equal to the vectorization fac= tor =3D 2. > + missed: not vectorized: vectorization not profitable. */ > + > +typedef struct > +{ > + unsigned short m1, m2, m3, m4; > +} the_struct_t; > +typedef struct > +{ > + double m1, m2, m3, m4, m5; > +} the_struct2_t; > + > +double > +bar (the_struct2_t *); > + > +double > +foo (double *k, unsigned int n, the_struct_t *the_struct) > +{ > + unsigned int u; > + the_struct2_t result; > + for (u =3D 0; u < n; u++, k--) > + { > + result.m1 +=3D (*k) * the_struct[u].m1; > + result.m2 +=3D (*k) * the_struct[u].m2; > + result.m3 +=3D (*k) * the_struct[u].m3; > + result.m4 +=3D (*k) * the_struct[u].m4; > + } > + return bar (&result); > +} > diff --git a/gcc/testsuite/gcc.target/aarch64/pr110625_2.c b/gcc/testsuit= e/gcc.target/aarch64/pr110625_2.c > new file mode 100644 > index 00000000000..7a84aa8355e > --- /dev/null > +++ b/gcc/testsuite/gcc.target/aarch64/pr110625_2.c > @@ -0,0 +1,14 @@ > +/* { dg-do compile } */ > +/* { dg-options "-Ofast -mcpu=3Dneoverse-n2 -fdump-tree-vect-details -fn= o-tree-slp-vectorize" } */ > +/* { dg-final { scan-tree-dump "reduction latency =3D 8" "vect" } } */ > + > +/* The reduction latency should be multiplied by the count for > + single_defuse_cycle. */ > + > +long > +f (long res, short *ptr1, short *ptr2, int n) > +{ > + for (int i =3D 0; i < n; ++i) > + res +=3D (long) ptr1[i] << ptr2[i]; > + return res; > +}