From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=sEly=DL=os.amperecomputing.com=hliu@sourceware.org>
Received: from NAM11-DM6-obe.outbound.protection.outlook.com (mail-dm6nam11on2096.outbound.protection.outlook.com [40.107.223.96])
	by sourceware.org (Postfix) with ESMTPS id 7077D3858C66
	for <gcc-patches@gcc.gnu.org>; Tue, 25 Jul 2023 09:10:36 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 7077D3858C66
Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=os.amperecomputing.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=os.amperecomputing.com
ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none;
 b=LC2MRnnLMVOoQK3ZKJkd6i3OTv9AaFK0zbdKfx4vPT/9TtvVU8TeYph+FpywaeDOUUuaqC1lrCHZQacvKB0NFMGtcZgugXHDz4c+nFldy/y5HiKXKEWr84moOrGbKdsDXivVXxu39FQkDD011T/lpcM4YWqkoPU1l6RLnNql8QJtwdjuPnQkrWik1LwthdMcQAJGxRBsqi6kIT2Puza0br1qWhbVeFFBfZ3xdiv9YfyH4/BXinQ6zUbAvMGY32HbQVZMWBt+K/xiTM+MSRRfjIYFraWF+ADJTMOk1+oTb6xhRiZ7KcObmwDN5kZwQ+TkCGJN57eXpj8/8UR/qWUapg==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com;
 s=arcselector9901;
 h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1;
 bh=8QJpnNAe/EbR+0k2GBeE1lhiYv9yXHgrkhBjTYknb5o=;
 b=KDo/0xZamriqm4PcDH5+PFFSqEeIft4taUxZCEbPznKGKLbm2p7Cndr6CHdIYeAb6kmzx5pStIaWNytsA3vb8F1sK9S15tyIBt4o1oMv32LFedeUETHCSGn5XRH8iK8gWwg0A1M5h239WphoK8MN5SmUUSj2lanNqRBOmsj00Oj/2IV8tsILqv8/vybQbnH4/qyXlD9LELab0B7b8kIOOxoku9nFs+Ta8RXkE1fTAoqmVjaLSkrSv2FFuTOkUe4cFWXgmaHqcb3O5JQUu8VzN0Qbg6+pr8gh0C8HLFzv01/fTMNIij72aJwRmTPSAWZLAOUl4N5GYZtW3S06vaAQLQ==
ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass
 smtp.mailfrom=os.amperecomputing.com; dmarc=pass action=none
 header.from=os.amperecomputing.com; dkim=pass
 header.d=os.amperecomputing.com; arc=none
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=os.amperecomputing.com; s=selector2;
 h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck;
 bh=8QJpnNAe/EbR+0k2GBeE1lhiYv9yXHgrkhBjTYknb5o=;
 b=XieLwaJAhxKZUHvtOgd6O313bKctXI+no3rem1fLw1Sf2/MlwxBWnZ8clp6Jefc+0mMSDbuXkGrj+xZ/vmIvbDcVlCXHs5f9T9/ENBE54XtOPcFlXtx7YE8qKBSrR3uf3Q/f73VGyK8kJuYnKs5osp6ZrLRpiY6s0h07uNFZLAE=
Received: from SJ2PR01MB8635.prod.exchangelabs.com (2603:10b6:a03:57b::16) by
 DM8PR01MB6806.prod.exchangelabs.com (2603:10b6:8:21::14) with Microsoft SMTP
 Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id
 15.20.6609.33; Tue, 25 Jul 2023 09:10:32 +0000
Received: from SJ2PR01MB8635.prod.exchangelabs.com
 ([fe80::4c34:715d:c446:7fc5]) by SJ2PR01MB8635.prod.exchangelabs.com
 ([fe80::4c34:715d:c446:7fc5%5]) with mapi id 15.20.6609.032; Tue, 25 Jul 2023
 09:10:31 +0000
From: Hao Liu OS <hliu@os.amperecomputing.com>
To: Richard Sandiford <richard.sandiford@arm.com>
CC: "GCC-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org>
Subject: Re: [PATCH] AArch64: Do not increase the vect reduction latency by
 multiplying count [PR110625]
Thread-Topic: [PATCH] AArch64: Do not increase the vect reduction latency by
 multiplying count [PR110625]
Thread-Index: AQHZuflIfzSQE63nDUGU9bz2B+2yKq/Iyx9zgAFtZ+Y=
Date: Tue, 25 Jul 2023 09:10:31 +0000
Message-ID:
 <SJ2PR01MB8635D1095B5BDDAE1CA19877E103A@SJ2PR01MB8635.prod.exchangelabs.com>
References:
 <SJ2PR01MB8635742E07E2076FA2BE0560E139A@SJ2PR01MB8635.prod.exchangelabs.com>
 <mptr0oxh8eb.fsf@arm.com>
In-Reply-To: <mptr0oxh8eb.fsf@arm.com>
Accept-Language: en-US, zh-CN
Content-Language: en-US
X-MS-Has-Attach:
X-MS-TNEF-Correlator:
msip_labels:
 MSIP_Label_5b82cb1d-c2e0-4643-920a-bbe7b2d7cc47_Enabled=True;MSIP_Label_5b82cb1d-c2e0-4643-920a-bbe7b2d7cc47_SiteId=3bc2b170-fd94-476d-b0ce-4229bdc904a7;MSIP_Label_5b82cb1d-c2e0-4643-920a-bbe7b2d7cc47_SetDate=2023-07-25T09:10:28.785Z;MSIP_Label_5b82cb1d-c2e0-4643-920a-bbe7b2d7cc47_Name=Confidential;MSIP_Label_5b82cb1d-c2e0-4643-920a-bbe7b2d7cc47_ContentBits=0;MSIP_Label_5b82cb1d-c2e0-4643-920a-bbe7b2d7cc47_Method=Standard;
authentication-results: dkim=none (message not signed)
 header.d=none;dmarc=none action=none header.from=os.amperecomputing.com;
x-ms-publictraffictype: Email
x-ms-traffictypediagnostic: SJ2PR01MB8635:EE_|DM8PR01MB6806:EE_
x-ms-office365-filtering-correlation-id: c0492b78-c0be-4b80-5367-08db8ceeff22
x-ms-exchange-senderadcheck: 1
x-ms-exchange-antispam-relay: 0
x-microsoft-antispam: BCL:0;
x-microsoft-antispam-message-info:
 qrVcX+ejJgKBpeHNMlpbN6XRctwf2mcNBz0IKSlAQF9gVb0629Bp25Zj4/ng9KXl3hUR+wNfTGWhhiMtXX0rugAy7NLyadH3NR9Hcppc0uJdYQbuK0Kn0HhoyXZjM17gIVDqqR6yvxDDbmNOsJNmyfe5a+NtS3KhDaDaDqaQiv3TqDfdJluJqlyL+dUIYpF1x9j4ErBQ0QQ9vPcSajA/I2qRCGxJNW7LSSuH156ebwG9+qj5pTYt66WyPeahIY1U0sIAMifoggtbQeklDdJbtOIn/TjTV4Agla5TgS/WtN33cjy87/d/hiQFF0cGSSSJy2PUvTPYHfdUtv0h6RbB9QC4wpNYCKUQGzN2NjXsCIldIwFLCRm0KlePZvPVhUO88R36TX5rncsmrawUlCbdDt9aPmB8H8U7sCBJ2x7ZswLvhRnkdnwlP52nduu86thu7Wg2s7UeQ5n9RRSAt1/sHmMIx8Y2xN7k3A21uFdhgfwa36qz9EDCzHxzfwMEIDtaMzRAFerZ160U+j6T1Setb27SIPoZGnc6DNMMnspo+mz+Cth+Fz3Mh7IyaiqM/oUHnNTZk/Cb7i4wxNCuJNH+KQBVDlbpMnVBIVkq5YU+JdY=
x-forefront-antispam-report:
 CIP:255.255.255.255;CTRY:;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:SJ2PR01MB8635.prod.exchangelabs.com;PTR:;CAT:NONE;SFS:(13230028)(4636009)(346002)(39850400004)(376002)(136003)(396003)(366004)(451199021)(2906002)(66899021)(41300700001)(316002)(38070700005)(5660300002)(52536014)(8936002)(8676002)(33656002)(86362001)(55016003)(9686003)(6506007)(53546011)(26005)(84970400001)(478600001)(7696005)(71200400001)(83380400001)(186003)(38100700002)(4326008)(122000001)(66556008)(66446008)(66476007)(64756008)(66946007)(6916009)(91956017)(76116006);DIR:OUT;SFP:1102;
x-ms-exchange-antispam-messagedata-chunkcount: 1
x-ms-exchange-antispam-messagedata-0:
 =?iso-8859-1?Q?fIoutOL5Z/ui31fGCA583jfpM3/dmQj0bsAeBEMq6CawTi0SkMgEeZeLFN?=
 =?iso-8859-1?Q?zquwlyb808UtvbFrNf3PuvjuF+D9ZF3g32Tlri91so/Z74wBsGXFfckAGr?=
 =?iso-8859-1?Q?linQFhbBUp1j0aJd9TqAOILw1c77cGsIh1nQF26MV1XqI/6q7/MwsNYqsG?=
 =?iso-8859-1?Q?qifPzRGXQDgmaYxKPxwzz4XJVb4XDZnW1+qt6UJPqxoR9B0gA1UwPMpRoP?=
 =?iso-8859-1?Q?ZdSisgKruRKoKFLLdz+1FBzm5A2iHAYhAxoLz1OPuKhW2C7NnPrVFGRvP7?=
 =?iso-8859-1?Q?iVjyhwUlQPq3EOt1XAPFNQAnUYQ216F577kjSar+v/lLTsRYysKMQPquUQ?=
 =?iso-8859-1?Q?+REio8mrKkujN8pqhM5HQ+T3cWp+PxqYpX2yqW4efz0oQHi6rWyzBKJd0Q?=
 =?iso-8859-1?Q?NogJ0JdMe2vHfz+IjQ6LxYNr39mn/2wSGUj4w9ACxiS5yPj3OQ5+BUeNBm?=
 =?iso-8859-1?Q?R1/1bsbdVk5fo15X7+2Yx1dOfZjIiiGk3kQK53J86CME1vF1A5lhfNZOih?=
 =?iso-8859-1?Q?qvRKFzG9t4rSquPBo6lgZOYi1v2wJMvdNTnqfMBev+3ybu7h2MmXDiPv4a?=
 =?iso-8859-1?Q?yxNsAPwQrMVou/AMvR6pP9qkDmxUeIJ6XpExBMuLsocmqdYCSt3EANLyNz?=
 =?iso-8859-1?Q?wr3u8DqXiPP+v+FHuIEGuJ1c0Pu4nyonRAXS2byr0X+rry+Td9UHJxUyzY?=
 =?iso-8859-1?Q?2cA4rv+4xD4pifzXC1wMCasgvws1PFgrEUxxBDgiRPO/31PRsXV82aRoXi?=
 =?iso-8859-1?Q?97atBSRb83kLizYR87KRdX6h1vo6X7F9jiFzGMSk7uaGFOqEr/JoL67BRj?=
 =?iso-8859-1?Q?zrtju/KugdgtPNPEbVaiLRvSxI1Bv7Oaa2wpVUYqE4fRnFTyut2tCj9HVj?=
 =?iso-8859-1?Q?ucQ73hDopPr7kZlvHQ7SutwW7bug3AVf5KTvZoEGp4y55kfNXrcMZH7IMD?=
 =?iso-8859-1?Q?9ZWmFpM/jMPfYm24/+jXcAoXnmJf75z+sOVManHRY9INhQXcJRQkWW2J1e?=
 =?iso-8859-1?Q?fdYfuaATw8Alh4PyneCQofoB2jOHxRQ/mRgblwzqxjD2NQzglQVzaXjfni?=
 =?iso-8859-1?Q?jRyCsL2nTFnb6/S2PRsKlnWMIx6xCnbQ7ziVHUf5aMlCNyZ3ScVKsXaplA?=
 =?iso-8859-1?Q?MyJ5fqwmc052tCRJ97PK0IEkc9JCOyysTYxAgngzm2H/puQnfNavDrygt1?=
 =?iso-8859-1?Q?xo4R9EDokBCH8AN6hr21NjZ5jFxXtEtGHvIEVaoYwU1KX9QEapyzbTfGSW?=
 =?iso-8859-1?Q?s88sX/7UAXes/M/9BTXhrHgktHMa2w9gcCl62QRVZeAHvAKshPUjeyYofg?=
 =?iso-8859-1?Q?ZqB7klxyU9+OJXLzBR5HoObgP70EvOGanS1Zxo7VBDlCCzvu1KaO4sqxyH?=
 =?iso-8859-1?Q?SUorUZDvVrgGkLU0oIkxJYayf0J+mW/nbMWu7kh4oHJjxu2007v8d7X1BB?=
 =?iso-8859-1?Q?T91jjAvHAIYx4zDyYBqJrtUY1gUXhbNQgADP1lwLo0t7k36oVN5t4L4RL5?=
 =?iso-8859-1?Q?cTrmtJ7c9+kVYf4hV5DpLzPOxxe/lgmcouV1p0HaBnJjO0jY1PpRjki6jh?=
 =?iso-8859-1?Q?UPsp0Uuc7EdSbrM9jLeRpOafwmn+E58CR9bVEsipjSgYYgsR+ctVsYUsKi?=
 =?iso-8859-1?Q?9AcoD+QLE/eGCzaXHYqn+bIiQtF8hfrmjsuDkdBQa+ITmUANvGJv+9Vg?=
 =?iso-8859-1?Q?=3D=3D?=
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
X-OriginatorOrg: os.amperecomputing.com
X-MS-Exchange-CrossTenant-AuthAs: Internal
X-MS-Exchange-CrossTenant-AuthSource: SJ2PR01MB8635.prod.exchangelabs.com
X-MS-Exchange-CrossTenant-Network-Message-Id: c0492b78-c0be-4b80-5367-08db8ceeff22
X-MS-Exchange-CrossTenant-originalarrivaltime: 25 Jul 2023 09:10:31.3091
 (UTC)
X-MS-Exchange-CrossTenant-fromentityheader: Hosted
X-MS-Exchange-CrossTenant-id: 3bc2b170-fd94-476d-b0ce-4229bdc904a7
X-MS-Exchange-CrossTenant-mailboxtype: HOSTED
X-MS-Exchange-CrossTenant-userprincipalname: 4AAd+uRqhXvEnK5bZU0E7vuBYeFe/wD6TmkZODicAUeWarAwvuTvnrq/t7EG3xIWtgePxqnXq4+nJejj877eRsmSC5vlsD9IHpvEBmABCX8+ZbHskSk8TT35mg9p1fK9
X-MS-Exchange-Transport-CrossTenantHeadersStamped: DM8PR01MB6806
X-Spam-Status: No, score=-12.3 required=5.0 tests=BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,GIT_PATCH_0,KAM_SHORT,RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H2,SPF_HELO_PASS,SPF_PASS,TXREP,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org
List-Id: <gcc-patches.gcc.gnu.org>

Hi,=0A=
=0A=
Thanks for the suggestion.  I tested it and found a gcc_assert failure:=0A=
    gcc.target/aarch64/sve/cost_model_13.c (internal compiler error: in inf=
o_for_reduction, at tree-vect-loop.cc:5473)=0A=
=0A=
It is caused by empty STMT_VINFO_REDUC_DEF.  So, I added an extra check bef=
ore checking single_defuse_cycle. The updated patch is below.  Is it OK for=
 trunk?=0A=
=0A=
---=0A=
=0A=
The new costs should only count reduction latency by multiplying count for=
=0A=
single_defuse_cycle.  For other situations, this will increase the reductio=
n=0A=
latency a lot and miss vectorization opportunities.=0A=
=0A=
Tested on aarch64-linux-gnu.=0A=
=0A=
gcc/ChangeLog:=0A=
=0A=
	PR target/110625=0A=
	* config/aarch64/aarch64.cc (count_ops): Only '* count' for=0A=
	single_defuse_cycle while counting reduction_latency.=0A=
=0A=
gcc/testsuite/ChangeLog:=0A=
=0A=
	* gcc.target/aarch64/pr110625_1.c: New testcase.=0A=
	* gcc.target/aarch64/pr110625_2.c: New testcase.=0A=
---=0A=
 gcc/config/aarch64/aarch64.cc                 | 13 ++++--=0A=
 gcc/testsuite/gcc.target/aarch64/pr110625_1.c | 46 +++++++++++++++++++=0A=
 gcc/testsuite/gcc.target/aarch64/pr110625_2.c | 14 ++++++=0A=
 3 files changed, 69 insertions(+), 4 deletions(-)=0A=
 create mode 100644 gcc/testsuite/gcc.target/aarch64/pr110625_1.c=0A=
 create mode 100644 gcc/testsuite/gcc.target/aarch64/pr110625_2.c=0A=
=0A=
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc=
=0A=
index 560e5431636..478a4e00110 100644=0A=
--- a/gcc/config/aarch64/aarch64.cc=0A=
+++ b/gcc/config/aarch64/aarch64.cc=0A=
@@ -16788,10 +16788,15 @@ aarch64_vector_costs::count_ops (unsigned int cou=
nt, vect_cost_for_stmt kind,=0A=
     {=0A=
       unsigned int base=0A=
 	=3D aarch64_in_loop_reduction_latency (m_vinfo, stmt_info, m_vec_flags);=
=0A=
-=0A=
-      /* ??? Ideally we'd do COUNT reductions in parallel, but unfortunate=
ly=0A=
-	 that's not yet the case.  */=0A=
-      ops->reduction_latency =3D MAX (ops->reduction_latency, base * count=
);=0A=
+      if (STMT_VINFO_REDUC_DEF (stmt_info)=0A=
+	  && STMT_VINFO_FORCE_SINGLE_CYCLE (=0A=
+	    info_for_reduction (m_vinfo, stmt_info)))=0A=
+	/* ??? Ideally we'd use a tree to reduce the copies down to 1 vector,=0A=
+	   and then accumulate that, but at the moment the loop-carried=0A=
+	   dependency includes all copies.  */=0A=
+	ops->reduction_latency =3D MAX (ops->reduction_latency, base * count);=0A=
+      else=0A=
+	ops->reduction_latency =3D MAX (ops->reduction_latency, base);=0A=
     }=0A=
 =0A=
   /* Assume that multiply-adds will become a single operation.  */=0A=
diff --git a/gcc/testsuite/gcc.target/aarch64/pr110625_1.c b/gcc/testsuite/=
gcc.target/aarch64/pr110625_1.c=0A=
new file mode 100644=0A=
index 00000000000..0965cac33a0=0A=
--- /dev/null=0A=
+++ b/gcc/testsuite/gcc.target/aarch64/pr110625_1.c=0A=
@@ -0,0 +1,46 @@=0A=
+/* { dg-do compile } */=0A=
+/* { dg-options "-Ofast -mcpu=3Dneoverse-n2 -fdump-tree-vect-details -fno-=
tree-slp-vectorize" } */=0A=
+/* { dg-final { scan-tree-dump-not "reduction latency =3D 8" "vect" } } */=
=0A=
+=0A=
+/* Do not increase the vector body cost due to the incorrect reduction lat=
ency=0A=
+    Original vector body cost =3D 51=0A=
+    Scalar issue estimate:=0A=
+      ...=0A=
+      reduction latency =3D 2=0A=
+      estimated min cycles per iteration =3D 2.000000=0A=
+      estimated cycles per vector iteration (for VF 2) =3D 4.000000=0A=
+    Vector issue estimate:=0A=
+      ...=0A=
+      reduction latency =3D 8      <-- Too large=0A=
+      estimated min cycles per iteration =3D 8.000000=0A=
+    Increasing body cost to 102 because scalar code would issue more quick=
ly=0A=
+      ...=0A=
+    missed:  cost model: the vector iteration cost =3D 102 divided by the =
scalar iteration cost =3D 44 is greater or equal to the vectorization facto=
r =3D 2.=0A=
+    missed:  not vectorized: vectorization not profitable.  */=0A=
+=0A=
+typedef struct=0A=
+{=0A=
+  unsigned short m1, m2, m3, m4;=0A=
+} the_struct_t;=0A=
+typedef struct=0A=
+{=0A=
+  double m1, m2, m3, m4, m5;=0A=
+} the_struct2_t;=0A=
+=0A=
+double=0A=
+bar (the_struct2_t *);=0A=
+=0A=
+double=0A=
+foo (double *k, unsigned int n, the_struct_t *the_struct)=0A=
+{=0A=
+  unsigned int u;=0A=
+  the_struct2_t result;=0A=
+  for (u =3D 0; u < n; u++, k--)=0A=
+    {=0A=
+      result.m1 +=3D (*k) * the_struct[u].m1;=0A=
+      result.m2 +=3D (*k) * the_struct[u].m2;=0A=
+      result.m3 +=3D (*k) * the_struct[u].m3;=0A=
+      result.m4 +=3D (*k) * the_struct[u].m4;=0A=
+    }=0A=
+  return bar (&result);=0A=
+}=0A=
diff --git a/gcc/testsuite/gcc.target/aarch64/pr110625_2.c b/gcc/testsuite/=
gcc.target/aarch64/pr110625_2.c=0A=
new file mode 100644=0A=
index 00000000000..7a84aa8355e=0A=
--- /dev/null=0A=
+++ b/gcc/testsuite/gcc.target/aarch64/pr110625_2.c=0A=
@@ -0,0 +1,14 @@=0A=
+/* { dg-do compile } */=0A=
+/* { dg-options "-Ofast -mcpu=3Dneoverse-n2 -fdump-tree-vect-details -fno-=
tree-slp-vectorize" } */=0A=
+/* { dg-final { scan-tree-dump "reduction latency =3D 8" "vect" } } */=0A=
+=0A=
+/* The reduction latency should be multiplied by the count for=0A=
+   single_defuse_cycle.  */=0A=
+=0A=
+long=0A=
+f (long res, short *ptr1, short *ptr2, int n)=0A=
+{=0A=
+  for (int i =3D 0; i < n; ++i)=0A=
+    res +=3D (long) ptr1[i] << ptr2[i];=0A=
+  return res;=0A=
+}=0A=
-- =0A=
2.34.1=0A=
=0A=
=0A=
________________________________________=0A=
From: Richard Sandiford <richard.sandiford@arm.com>=0A=
Sent: Monday, July 24, 2023 19:10=0A=
To: Hao Liu OS=0A=
Cc: GCC-patches@gcc.gnu.org=0A=
Subject: Re: [PATCH] AArch64: Do not increase the vect reduction latency by=
 multiplying count [PR110625]=0A=
=0A=
Hao Liu OS <hliu@os.amperecomputing.com> writes:=0A=
> This only affects the new costs in aarch64 backend.  Currently, the reduc=
tion=0A=
> latency of vector body is too large as it is multiplied by stmt count.  A=
s the=0A=
> scalar reduction latency is small, the new costs model may think "scalar =
code=0A=
> would issue more quickly" and increase the vector body cost a lot, which =
will=0A=
> miss vectorization opportunities.=0A=
>=0A=
> Tested by bootstrapping on aarch64-linux-gnu.=0A=
>=0A=
> gcc/ChangeLog:=0A=
>=0A=
>       PR target/110625=0A=
>       * config/aarch64/aarch64.cc (count_ops): Remove the '* count'=0A=
>       for reduction_latency.=0A=
>=0A=
> gcc/testsuite/ChangeLog:=0A=
>=0A=
>       * gcc.target/aarch64/pr110625.c: New testcase.=0A=
> ---=0A=
>  gcc/config/aarch64/aarch64.cc               |  5 +--=0A=
>  gcc/testsuite/gcc.target/aarch64/pr110625.c | 46 +++++++++++++++++++++=
=0A=
>  2 files changed, 47 insertions(+), 4 deletions(-)=0A=
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/pr110625.c=0A=
>=0A=
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.c=
c=0A=
> index 560e5431636..27afa64b7d5 100644=0A=
> --- a/gcc/config/aarch64/aarch64.cc=0A=
> +++ b/gcc/config/aarch64/aarch64.cc=0A=
> @@ -16788,10 +16788,7 @@ aarch64_vector_costs::count_ops (unsigned int co=
unt, vect_cost_for_stmt kind,=0A=
>      {=0A=
>        unsigned int base=0A=
>       =3D aarch64_in_loop_reduction_latency (m_vinfo, stmt_info, m_vec_fl=
ags);=0A=
> -=0A=
> -      /* ??? Ideally we'd do COUNT reductions in parallel, but unfortuna=
tely=0A=
> -      that's not yet the case.  */=0A=
> -      ops->reduction_latency =3D MAX (ops->reduction_latency, base * cou=
nt);=0A=
> +      ops->reduction_latency =3D MAX (ops->reduction_latency, base);=0A=
=0A=
The ??? is referring to the single_defuse_cycle code in=0A=
vectorizable_reduction.  E.g. consider:=0A=
=0A=
long=0A=
f (long res, short *ptr1, short *ptr2, int n) {=0A=
  for (int i =3D 0; i < n; ++i)=0A=
    res +=3D (long) ptr1[i] << ptr2[i];=0A=
  return res;=0A=
}=0A=
=0A=
compiled at -O3.  The main loop is:=0A=
=0A=
        movi    v25.4s, 0                       // init accumulator=0A=
        lsl     x5, x5, 4=0A=
        .p2align 3,,7=0A=
.L4:=0A=
        ldr     q31, [x1, x4]=0A=
        ldr     q29, [x2, x4]=0A=
        add     x4, x4, 16=0A=
        sxtl    v30.4s, v31.4h=0A=
        sxtl2   v31.4s, v31.8h=0A=
        sxtl    v28.4s, v29.4h=0A=
        sxtl2   v29.4s, v29.8h=0A=
        sxtl    v27.2d, v30.2s=0A=
        sxtl2   v30.2d, v30.4s=0A=
        sxtl    v23.2d, v28.2s=0A=
        sxtl2   v26.2d, v28.4s=0A=
        sxtl    v24.2d, v29.2s=0A=
        sxtl    v28.2d, v31.2s=0A=
        sshl    v27.2d, v27.2d, v23.2d=0A=
        sshl    v30.2d, v30.2d, v26.2d=0A=
        sxtl2   v31.2d, v31.4s=0A=
        sshl    v28.2d, v28.2d, v24.2d=0A=
        add     v27.2d, v27.2d, v25.2d          // v25 -> v27=0A=
        sxtl2   v29.2d, v29.4s=0A=
        add     v30.2d, v30.2d, v27.2d          // v27 -> v30=0A=
        sshl    v31.2d, v31.2d, v29.2d=0A=
        add     v30.2d, v28.2d, v30.2d          // v30 -> v30=0A=
        add     v25.2d, v31.2d, v30.2d          // v30 -> v25=0A=
        cmp     x4, x5=0A=
        bne     .L4=0A=
=0A=
Here count is 4 and the latency is 4 additions (8 cycles).  So as a=0A=
worst case, the current cost is correct.=0A=
=0A=
To remove the count in all cases, we would need to=0A=
=0A=
(1) avoid single def-use cycles or=0A=
=0A=
(2) reassociate the reduction as a tree, (4->2, 2->1, 1+acc->acc)=0A=
=0A=
But looking again, it seems we do have the information to distinguish=0A=
the cases.  We can do something like:=0A=
=0A=
      stmt_vec_info reduc_info =3D info_for_reduction (m_vinfo, stmt_info);=
=0A=
      if (STMT_VINFO_FORCE_SINGLE_CYCLE (reduc_info))=0A=
        /* ??? Ideally we'd use a tree to reduce the copies down to 1 vecto=
r,=0A=
           and then accumulate that, but at the moment the loop-carried=0A=
           dependency includes all copies.  */=0A=
        ops->reduction_latency =3D MAX (ops->reduction_latency, base * coun=
t);=0A=
      else=0A=
        ops->reduction_latency =3D MAX (ops->reduction_latency, base);=0A=
=0A=
(completely untested).=0A=
=0A=
Thanks,=0A=
Richard=0A=