From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from NAM11-DM6-obe.outbound.protection.outlook.com (mail-dm6nam11on2096.outbound.protection.outlook.com [40.107.223.96]) by sourceware.org (Postfix) with ESMTPS id 7077D3858C66 for ; Tue, 25 Jul 2023 09:10:36 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 7077D3858C66 Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=os.amperecomputing.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=os.amperecomputing.com ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=LC2MRnnLMVOoQK3ZKJkd6i3OTv9AaFK0zbdKfx4vPT/9TtvVU8TeYph+FpywaeDOUUuaqC1lrCHZQacvKB0NFMGtcZgugXHDz4c+nFldy/y5HiKXKEWr84moOrGbKdsDXivVXxu39FQkDD011T/lpcM4YWqkoPU1l6RLnNql8QJtwdjuPnQkrWik1LwthdMcQAJGxRBsqi6kIT2Puza0br1qWhbVeFFBfZ3xdiv9YfyH4/BXinQ6zUbAvMGY32HbQVZMWBt+K/xiTM+MSRRfjIYFraWF+ADJTMOk1+oTb6xhRiZ7KcObmwDN5kZwQ+TkCGJN57eXpj8/8UR/qWUapg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=8QJpnNAe/EbR+0k2GBeE1lhiYv9yXHgrkhBjTYknb5o=; b=KDo/0xZamriqm4PcDH5+PFFSqEeIft4taUxZCEbPznKGKLbm2p7Cndr6CHdIYeAb6kmzx5pStIaWNytsA3vb8F1sK9S15tyIBt4o1oMv32LFedeUETHCSGn5XRH8iK8gWwg0A1M5h239WphoK8MN5SmUUSj2lanNqRBOmsj00Oj/2IV8tsILqv8/vybQbnH4/qyXlD9LELab0B7b8kIOOxoku9nFs+Ta8RXkE1fTAoqmVjaLSkrSv2FFuTOkUe4cFWXgmaHqcb3O5JQUu8VzN0Qbg6+pr8gh0C8HLFzv01/fTMNIij72aJwRmTPSAWZLAOUl4N5GYZtW3S06vaAQLQ== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=os.amperecomputing.com; dmarc=pass action=none header.from=os.amperecomputing.com; dkim=pass header.d=os.amperecomputing.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=os.amperecomputing.com; s=selector2; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=8QJpnNAe/EbR+0k2GBeE1lhiYv9yXHgrkhBjTYknb5o=; b=XieLwaJAhxKZUHvtOgd6O313bKctXI+no3rem1fLw1Sf2/MlwxBWnZ8clp6Jefc+0mMSDbuXkGrj+xZ/vmIvbDcVlCXHs5f9T9/ENBE54XtOPcFlXtx7YE8qKBSrR3uf3Q/f73VGyK8kJuYnKs5osp6ZrLRpiY6s0h07uNFZLAE= Received: from SJ2PR01MB8635.prod.exchangelabs.com (2603:10b6:a03:57b::16) by DM8PR01MB6806.prod.exchangelabs.com (2603:10b6:8:21::14) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.6609.33; Tue, 25 Jul 2023 09:10:32 +0000 Received: from SJ2PR01MB8635.prod.exchangelabs.com ([fe80::4c34:715d:c446:7fc5]) by SJ2PR01MB8635.prod.exchangelabs.com ([fe80::4c34:715d:c446:7fc5%5]) with mapi id 15.20.6609.032; Tue, 25 Jul 2023 09:10:31 +0000 From: Hao Liu OS To: Richard Sandiford CC: "GCC-patches@gcc.gnu.org" Subject: Re: [PATCH] AArch64: Do not increase the vect reduction latency by multiplying count [PR110625] Thread-Topic: [PATCH] AArch64: Do not increase the vect reduction latency by multiplying count [PR110625] Thread-Index: AQHZuflIfzSQE63nDUGU9bz2B+2yKq/Iyx9zgAFtZ+Y= Date: Tue, 25 Jul 2023 09:10:31 +0000 Message-ID: References: In-Reply-To: Accept-Language: en-US, zh-CN Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: msip_labels: MSIP_Label_5b82cb1d-c2e0-4643-920a-bbe7b2d7cc47_Enabled=True;MSIP_Label_5b82cb1d-c2e0-4643-920a-bbe7b2d7cc47_SiteId=3bc2b170-fd94-476d-b0ce-4229bdc904a7;MSIP_Label_5b82cb1d-c2e0-4643-920a-bbe7b2d7cc47_SetDate=2023-07-25T09:10:28.785Z;MSIP_Label_5b82cb1d-c2e0-4643-920a-bbe7b2d7cc47_Name=Confidential;MSIP_Label_5b82cb1d-c2e0-4643-920a-bbe7b2d7cc47_ContentBits=0;MSIP_Label_5b82cb1d-c2e0-4643-920a-bbe7b2d7cc47_Method=Standard; authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=os.amperecomputing.com; x-ms-publictraffictype: Email x-ms-traffictypediagnostic: SJ2PR01MB8635:EE_|DM8PR01MB6806:EE_ x-ms-office365-filtering-correlation-id: c0492b78-c0be-4b80-5367-08db8ceeff22 x-ms-exchange-senderadcheck: 1 x-ms-exchange-antispam-relay: 0 x-microsoft-antispam: BCL:0; x-microsoft-antispam-message-info: qrVcX+ejJgKBpeHNMlpbN6XRctwf2mcNBz0IKSlAQF9gVb0629Bp25Zj4/ng9KXl3hUR+wNfTGWhhiMtXX0rugAy7NLyadH3NR9Hcppc0uJdYQbuK0Kn0HhoyXZjM17gIVDqqR6yvxDDbmNOsJNmyfe5a+NtS3KhDaDaDqaQiv3TqDfdJluJqlyL+dUIYpF1x9j4ErBQ0QQ9vPcSajA/I2qRCGxJNW7LSSuH156ebwG9+qj5pTYt66WyPeahIY1U0sIAMifoggtbQeklDdJbtOIn/TjTV4Agla5TgS/WtN33cjy87/d/hiQFF0cGSSSJy2PUvTPYHfdUtv0h6RbB9QC4wpNYCKUQGzN2NjXsCIldIwFLCRm0KlePZvPVhUO88R36TX5rncsmrawUlCbdDt9aPmB8H8U7sCBJ2x7ZswLvhRnkdnwlP52nduu86thu7Wg2s7UeQ5n9RRSAt1/sHmMIx8Y2xN7k3A21uFdhgfwa36qz9EDCzHxzfwMEIDtaMzRAFerZ160U+j6T1Setb27SIPoZGnc6DNMMnspo+mz+Cth+Fz3Mh7IyaiqM/oUHnNTZk/Cb7i4wxNCuJNH+KQBVDlbpMnVBIVkq5YU+JdY= x-forefront-antispam-report: CIP:255.255.255.255;CTRY:;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:SJ2PR01MB8635.prod.exchangelabs.com;PTR:;CAT:NONE;SFS:(13230028)(4636009)(346002)(39850400004)(376002)(136003)(396003)(366004)(451199021)(2906002)(66899021)(41300700001)(316002)(38070700005)(5660300002)(52536014)(8936002)(8676002)(33656002)(86362001)(55016003)(9686003)(6506007)(53546011)(26005)(84970400001)(478600001)(7696005)(71200400001)(83380400001)(186003)(38100700002)(4326008)(122000001)(66556008)(66446008)(66476007)(64756008)(66946007)(6916009)(91956017)(76116006);DIR:OUT;SFP:1102; x-ms-exchange-antispam-messagedata-chunkcount: 1 x-ms-exchange-antispam-messagedata-0: =?iso-8859-1?Q?fIoutOL5Z/ui31fGCA583jfpM3/dmQj0bsAeBEMq6CawTi0SkMgEeZeLFN?= =?iso-8859-1?Q?zquwlyb808UtvbFrNf3PuvjuF+D9ZF3g32Tlri91so/Z74wBsGXFfckAGr?= =?iso-8859-1?Q?linQFhbBUp1j0aJd9TqAOILw1c77cGsIh1nQF26MV1XqI/6q7/MwsNYqsG?= =?iso-8859-1?Q?qifPzRGXQDgmaYxKPxwzz4XJVb4XDZnW1+qt6UJPqxoR9B0gA1UwPMpRoP?= =?iso-8859-1?Q?ZdSisgKruRKoKFLLdz+1FBzm5A2iHAYhAxoLz1OPuKhW2C7NnPrVFGRvP7?= =?iso-8859-1?Q?iVjyhwUlQPq3EOt1XAPFNQAnUYQ216F577kjSar+v/lLTsRYysKMQPquUQ?= =?iso-8859-1?Q?+REio8mrKkujN8pqhM5HQ+T3cWp+PxqYpX2yqW4efz0oQHi6rWyzBKJd0Q?= =?iso-8859-1?Q?NogJ0JdMe2vHfz+IjQ6LxYNr39mn/2wSGUj4w9ACxiS5yPj3OQ5+BUeNBm?= =?iso-8859-1?Q?R1/1bsbdVk5fo15X7+2Yx1dOfZjIiiGk3kQK53J86CME1vF1A5lhfNZOih?= =?iso-8859-1?Q?qvRKFzG9t4rSquPBo6lgZOYi1v2wJMvdNTnqfMBev+3ybu7h2MmXDiPv4a?= =?iso-8859-1?Q?yxNsAPwQrMVou/AMvR6pP9qkDmxUeIJ6XpExBMuLsocmqdYCSt3EANLyNz?= =?iso-8859-1?Q?wr3u8DqXiPP+v+FHuIEGuJ1c0Pu4nyonRAXS2byr0X+rry+Td9UHJxUyzY?= =?iso-8859-1?Q?2cA4rv+4xD4pifzXC1wMCasgvws1PFgrEUxxBDgiRPO/31PRsXV82aRoXi?= =?iso-8859-1?Q?97atBSRb83kLizYR87KRdX6h1vo6X7F9jiFzGMSk7uaGFOqEr/JoL67BRj?= =?iso-8859-1?Q?zrtju/KugdgtPNPEbVaiLRvSxI1Bv7Oaa2wpVUYqE4fRnFTyut2tCj9HVj?= =?iso-8859-1?Q?ucQ73hDopPr7kZlvHQ7SutwW7bug3AVf5KTvZoEGp4y55kfNXrcMZH7IMD?= =?iso-8859-1?Q?9ZWmFpM/jMPfYm24/+jXcAoXnmJf75z+sOVManHRY9INhQXcJRQkWW2J1e?= =?iso-8859-1?Q?fdYfuaATw8Alh4PyneCQofoB2jOHxRQ/mRgblwzqxjD2NQzglQVzaXjfni?= =?iso-8859-1?Q?jRyCsL2nTFnb6/S2PRsKlnWMIx6xCnbQ7ziVHUf5aMlCNyZ3ScVKsXaplA?= =?iso-8859-1?Q?MyJ5fqwmc052tCRJ97PK0IEkc9JCOyysTYxAgngzm2H/puQnfNavDrygt1?= =?iso-8859-1?Q?xo4R9EDokBCH8AN6hr21NjZ5jFxXtEtGHvIEVaoYwU1KX9QEapyzbTfGSW?= =?iso-8859-1?Q?s88sX/7UAXes/M/9BTXhrHgktHMa2w9gcCl62QRVZeAHvAKshPUjeyYofg?= =?iso-8859-1?Q?ZqB7klxyU9+OJXLzBR5HoObgP70EvOGanS1Zxo7VBDlCCzvu1KaO4sqxyH?= =?iso-8859-1?Q?SUorUZDvVrgGkLU0oIkxJYayf0J+mW/nbMWu7kh4oHJjxu2007v8d7X1BB?= =?iso-8859-1?Q?T91jjAvHAIYx4zDyYBqJrtUY1gUXhbNQgADP1lwLo0t7k36oVN5t4L4RL5?= =?iso-8859-1?Q?cTrmtJ7c9+kVYf4hV5DpLzPOxxe/lgmcouV1p0HaBnJjO0jY1PpRjki6jh?= =?iso-8859-1?Q?UPsp0Uuc7EdSbrM9jLeRpOafwmn+E58CR9bVEsipjSgYYgsR+ctVsYUsKi?= =?iso-8859-1?Q?9AcoD+QLE/eGCzaXHYqn+bIiQtF8hfrmjsuDkdBQa+ITmUANvGJv+9Vg?= =?iso-8859-1?Q?=3D=3D?= Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-OriginatorOrg: os.amperecomputing.com X-MS-Exchange-CrossTenant-AuthAs: Internal X-MS-Exchange-CrossTenant-AuthSource: SJ2PR01MB8635.prod.exchangelabs.com X-MS-Exchange-CrossTenant-Network-Message-Id: c0492b78-c0be-4b80-5367-08db8ceeff22 X-MS-Exchange-CrossTenant-originalarrivaltime: 25 Jul 2023 09:10:31.3091 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Hosted X-MS-Exchange-CrossTenant-id: 3bc2b170-fd94-476d-b0ce-4229bdc904a7 X-MS-Exchange-CrossTenant-mailboxtype: HOSTED X-MS-Exchange-CrossTenant-userprincipalname: 4AAd+uRqhXvEnK5bZU0E7vuBYeFe/wD6TmkZODicAUeWarAwvuTvnrq/t7EG3xIWtgePxqnXq4+nJejj877eRsmSC5vlsD9IHpvEBmABCX8+ZbHskSk8TT35mg9p1fK9 X-MS-Exchange-Transport-CrossTenantHeadersStamped: DM8PR01MB6806 X-Spam-Status: No, score=-12.3 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,GIT_PATCH_0,KAM_SHORT,RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H2,SPF_HELO_PASS,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org List-Id: Hi,=0A= =0A= Thanks for the suggestion. I tested it and found a gcc_assert failure:=0A= gcc.target/aarch64/sve/cost_model_13.c (internal compiler error: in inf= o_for_reduction, at tree-vect-loop.cc:5473)=0A= =0A= It is caused by empty STMT_VINFO_REDUC_DEF. So, I added an extra check bef= ore checking single_defuse_cycle. The updated patch is below. Is it OK for= trunk?=0A= =0A= ---=0A= =0A= The new costs should only count reduction latency by multiplying count for= =0A= single_defuse_cycle. For other situations, this will increase the reductio= n=0A= latency a lot and miss vectorization opportunities.=0A= =0A= Tested on aarch64-linux-gnu.=0A= =0A= gcc/ChangeLog:=0A= =0A= PR target/110625=0A= * config/aarch64/aarch64.cc (count_ops): Only '* count' for=0A= single_defuse_cycle while counting reduction_latency.=0A= =0A= gcc/testsuite/ChangeLog:=0A= =0A= * gcc.target/aarch64/pr110625_1.c: New testcase.=0A= * gcc.target/aarch64/pr110625_2.c: New testcase.=0A= ---=0A= gcc/config/aarch64/aarch64.cc | 13 ++++--=0A= gcc/testsuite/gcc.target/aarch64/pr110625_1.c | 46 +++++++++++++++++++=0A= gcc/testsuite/gcc.target/aarch64/pr110625_2.c | 14 ++++++=0A= 3 files changed, 69 insertions(+), 4 deletions(-)=0A= create mode 100644 gcc/testsuite/gcc.target/aarch64/pr110625_1.c=0A= create mode 100644 gcc/testsuite/gcc.target/aarch64/pr110625_2.c=0A= =0A= diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc= =0A= index 560e5431636..478a4e00110 100644=0A= --- a/gcc/config/aarch64/aarch64.cc=0A= +++ b/gcc/config/aarch64/aarch64.cc=0A= @@ -16788,10 +16788,15 @@ aarch64_vector_costs::count_ops (unsigned int cou= nt, vect_cost_for_stmt kind,=0A= {=0A= unsigned int base=0A= =3D aarch64_in_loop_reduction_latency (m_vinfo, stmt_info, m_vec_flags);= =0A= -=0A= - /* ??? Ideally we'd do COUNT reductions in parallel, but unfortunate= ly=0A= - that's not yet the case. */=0A= - ops->reduction_latency =3D MAX (ops->reduction_latency, base * count= );=0A= + if (STMT_VINFO_REDUC_DEF (stmt_info)=0A= + && STMT_VINFO_FORCE_SINGLE_CYCLE (=0A= + info_for_reduction (m_vinfo, stmt_info)))=0A= + /* ??? Ideally we'd use a tree to reduce the copies down to 1 vector,=0A= + and then accumulate that, but at the moment the loop-carried=0A= + dependency includes all copies. */=0A= + ops->reduction_latency =3D MAX (ops->reduction_latency, base * count);=0A= + else=0A= + ops->reduction_latency =3D MAX (ops->reduction_latency, base);=0A= }=0A= =0A= /* Assume that multiply-adds will become a single operation. */=0A= diff --git a/gcc/testsuite/gcc.target/aarch64/pr110625_1.c b/gcc/testsuite/= gcc.target/aarch64/pr110625_1.c=0A= new file mode 100644=0A= index 00000000000..0965cac33a0=0A= --- /dev/null=0A= +++ b/gcc/testsuite/gcc.target/aarch64/pr110625_1.c=0A= @@ -0,0 +1,46 @@=0A= +/* { dg-do compile } */=0A= +/* { dg-options "-Ofast -mcpu=3Dneoverse-n2 -fdump-tree-vect-details -fno-= tree-slp-vectorize" } */=0A= +/* { dg-final { scan-tree-dump-not "reduction latency =3D 8" "vect" } } */= =0A= +=0A= +/* Do not increase the vector body cost due to the incorrect reduction lat= ency=0A= + Original vector body cost =3D 51=0A= + Scalar issue estimate:=0A= + ...=0A= + reduction latency =3D 2=0A= + estimated min cycles per iteration =3D 2.000000=0A= + estimated cycles per vector iteration (for VF 2) =3D 4.000000=0A= + Vector issue estimate:=0A= + ...=0A= + reduction latency =3D 8 <-- Too large=0A= + estimated min cycles per iteration =3D 8.000000=0A= + Increasing body cost to 102 because scalar code would issue more quick= ly=0A= + ...=0A= + missed: cost model: the vector iteration cost =3D 102 divided by the = scalar iteration cost =3D 44 is greater or equal to the vectorization facto= r =3D 2.=0A= + missed: not vectorized: vectorization not profitable. */=0A= +=0A= +typedef struct=0A= +{=0A= + unsigned short m1, m2, m3, m4;=0A= +} the_struct_t;=0A= +typedef struct=0A= +{=0A= + double m1, m2, m3, m4, m5;=0A= +} the_struct2_t;=0A= +=0A= +double=0A= +bar (the_struct2_t *);=0A= +=0A= +double=0A= +foo (double *k, unsigned int n, the_struct_t *the_struct)=0A= +{=0A= + unsigned int u;=0A= + the_struct2_t result;=0A= + for (u =3D 0; u < n; u++, k--)=0A= + {=0A= + result.m1 +=3D (*k) * the_struct[u].m1;=0A= + result.m2 +=3D (*k) * the_struct[u].m2;=0A= + result.m3 +=3D (*k) * the_struct[u].m3;=0A= + result.m4 +=3D (*k) * the_struct[u].m4;=0A= + }=0A= + return bar (&result);=0A= +}=0A= diff --git a/gcc/testsuite/gcc.target/aarch64/pr110625_2.c b/gcc/testsuite/= gcc.target/aarch64/pr110625_2.c=0A= new file mode 100644=0A= index 00000000000..7a84aa8355e=0A= --- /dev/null=0A= +++ b/gcc/testsuite/gcc.target/aarch64/pr110625_2.c=0A= @@ -0,0 +1,14 @@=0A= +/* { dg-do compile } */=0A= +/* { dg-options "-Ofast -mcpu=3Dneoverse-n2 -fdump-tree-vect-details -fno-= tree-slp-vectorize" } */=0A= +/* { dg-final { scan-tree-dump "reduction latency =3D 8" "vect" } } */=0A= +=0A= +/* The reduction latency should be multiplied by the count for=0A= + single_defuse_cycle. */=0A= +=0A= +long=0A= +f (long res, short *ptr1, short *ptr2, int n)=0A= +{=0A= + for (int i =3D 0; i < n; ++i)=0A= + res +=3D (long) ptr1[i] << ptr2[i];=0A= + return res;=0A= +}=0A= -- =0A= 2.34.1=0A= =0A= =0A= ________________________________________=0A= From: Richard Sandiford =0A= Sent: Monday, July 24, 2023 19:10=0A= To: Hao Liu OS=0A= Cc: GCC-patches@gcc.gnu.org=0A= Subject: Re: [PATCH] AArch64: Do not increase the vect reduction latency by= multiplying count [PR110625]=0A= =0A= Hao Liu OS writes:=0A= > This only affects the new costs in aarch64 backend. Currently, the reduc= tion=0A= > latency of vector body is too large as it is multiplied by stmt count. A= s the=0A= > scalar reduction latency is small, the new costs model may think "scalar = code=0A= > would issue more quickly" and increase the vector body cost a lot, which = will=0A= > miss vectorization opportunities.=0A= >=0A= > Tested by bootstrapping on aarch64-linux-gnu.=0A= >=0A= > gcc/ChangeLog:=0A= >=0A= > PR target/110625=0A= > * config/aarch64/aarch64.cc (count_ops): Remove the '* count'=0A= > for reduction_latency.=0A= >=0A= > gcc/testsuite/ChangeLog:=0A= >=0A= > * gcc.target/aarch64/pr110625.c: New testcase.=0A= > ---=0A= > gcc/config/aarch64/aarch64.cc | 5 +--=0A= > gcc/testsuite/gcc.target/aarch64/pr110625.c | 46 +++++++++++++++++++++= =0A= > 2 files changed, 47 insertions(+), 4 deletions(-)=0A= > create mode 100644 gcc/testsuite/gcc.target/aarch64/pr110625.c=0A= >=0A= > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.c= c=0A= > index 560e5431636..27afa64b7d5 100644=0A= > --- a/gcc/config/aarch64/aarch64.cc=0A= > +++ b/gcc/config/aarch64/aarch64.cc=0A= > @@ -16788,10 +16788,7 @@ aarch64_vector_costs::count_ops (unsigned int co= unt, vect_cost_for_stmt kind,=0A= > {=0A= > unsigned int base=0A= > =3D aarch64_in_loop_reduction_latency (m_vinfo, stmt_info, m_vec_fl= ags);=0A= > -=0A= > - /* ??? Ideally we'd do COUNT reductions in parallel, but unfortuna= tely=0A= > - that's not yet the case. */=0A= > - ops->reduction_latency =3D MAX (ops->reduction_latency, base * cou= nt);=0A= > + ops->reduction_latency =3D MAX (ops->reduction_latency, base);=0A= =0A= The ??? is referring to the single_defuse_cycle code in=0A= vectorizable_reduction. E.g. consider:=0A= =0A= long=0A= f (long res, short *ptr1, short *ptr2, int n) {=0A= for (int i =3D 0; i < n; ++i)=0A= res +=3D (long) ptr1[i] << ptr2[i];=0A= return res;=0A= }=0A= =0A= compiled at -O3. The main loop is:=0A= =0A= movi v25.4s, 0 // init accumulator=0A= lsl x5, x5, 4=0A= .p2align 3,,7=0A= .L4:=0A= ldr q31, [x1, x4]=0A= ldr q29, [x2, x4]=0A= add x4, x4, 16=0A= sxtl v30.4s, v31.4h=0A= sxtl2 v31.4s, v31.8h=0A= sxtl v28.4s, v29.4h=0A= sxtl2 v29.4s, v29.8h=0A= sxtl v27.2d, v30.2s=0A= sxtl2 v30.2d, v30.4s=0A= sxtl v23.2d, v28.2s=0A= sxtl2 v26.2d, v28.4s=0A= sxtl v24.2d, v29.2s=0A= sxtl v28.2d, v31.2s=0A= sshl v27.2d, v27.2d, v23.2d=0A= sshl v30.2d, v30.2d, v26.2d=0A= sxtl2 v31.2d, v31.4s=0A= sshl v28.2d, v28.2d, v24.2d=0A= add v27.2d, v27.2d, v25.2d // v25 -> v27=0A= sxtl2 v29.2d, v29.4s=0A= add v30.2d, v30.2d, v27.2d // v27 -> v30=0A= sshl v31.2d, v31.2d, v29.2d=0A= add v30.2d, v28.2d, v30.2d // v30 -> v30=0A= add v25.2d, v31.2d, v30.2d // v30 -> v25=0A= cmp x4, x5=0A= bne .L4=0A= =0A= Here count is 4 and the latency is 4 additions (8 cycles). So as a=0A= worst case, the current cost is correct.=0A= =0A= To remove the count in all cases, we would need to=0A= =0A= (1) avoid single def-use cycles or=0A= =0A= (2) reassociate the reduction as a tree, (4->2, 2->1, 1+acc->acc)=0A= =0A= But looking again, it seems we do have the information to distinguish=0A= the cases. We can do something like:=0A= =0A= stmt_vec_info reduc_info =3D info_for_reduction (m_vinfo, stmt_info);= =0A= if (STMT_VINFO_FORCE_SINGLE_CYCLE (reduc_info))=0A= /* ??? Ideally we'd use a tree to reduce the copies down to 1 vecto= r,=0A= and then accumulate that, but at the moment the loop-carried=0A= dependency includes all copies. */=0A= ops->reduction_latency =3D MAX (ops->reduction_latency, base * coun= t);=0A= else=0A= ops->reduction_latency =3D MAX (ops->reduction_latency, base);=0A= =0A= (completely untested).=0A= =0A= Thanks,=0A= Richard=0A=